Causal Thinking in Observational Studies: Matching, Propensity Scores, and IPTW Explained

Mayta
Jun 27
3 min read

When we want to know if a treatment truly causes a better outcome—especially in observational studies—we need more than just statistics. We need causal thinking, and we need the right methods. This guide walks you through model-based adjustment, standardisation, matching, balancing scores, propensity scores, and IPTW, all in one place, explained simply.

🔹 Model-Based Adjustment

What it is: You build a regression model to estimate the treatment effect while adjusting for covariates.

How it works:

“Let's model outcome = treatment + age + BP + diabetes...”
Assumes treatment effect is the same across all subgroups (no effect modification).

When to use: You trust your model, have mostly continuous variables, and assume no big effect variation between subgroups.

🔹 Standardisation

What it is: You split people into groups (e.g., age 60–70, 70–80), calculate treatment effects in each group, then average them.

Key point: It allows for effect modification (e.g., statins might help older patients more).

Limitations:

Only works well with categorical covariates.
Too many strata → small sample sizes → positivity violation.

🔹 Matching

What it is: For every treated person, you find one (or more) untreated person(s) who look very similar in covariates.

Benefit: Doesn’t assume a model. Instead, it mimics a randomized trial by design.

Challenge: Matching on many variables is hard—especially when they’re a mix of continuous and categorical.

🔹 Balancing Score

What it is: A score that summarizes a patient’s covariates. If two people have the same score, they’re “balanced.”

Example:

Patient A and B look very similar in age, LDL, and diabetes → same balancing score.
Makes them comparable for treatment effect estimation.

🔹 Propensity Score (p(X))

What it is: A special type of balancing score—it’s the probability of receiving treatment, given the person’s covariates.

Example:

A patient with p(X) = 0.85 → they had an 85% chance of getting statins based on their age, LDL, etc.

Use:

Match people with similar p(X)
Stratify by p(X)
Weight them using p(X) → leads to IPTW

🔹 Why We Use Bell Curve Plots for Propensity Score

After calculating propensity scores, we graph the distribution of p(X) in both treated and untreated groups. These often look like “bell curves.”

What we check:

Do the curves overlap a lot? Good! We can compare.
Do they barely touch? Bad. We can’t make fair comparisons.

We only analyze people in the region of common support—where treated and untreated groups have overlapping p(X). This improves fairness but may reduce sample size.

🔹 Checking Balance After PS

After matching or stratifying by PS, we must check covariate balance using something called standardized differences (stddiff).

Rule of thumb:

stddiff < 0.1 → balanced
stddiff ≥ 0.1 → still biased

This is essential before estimating any treatment effect.

🔹 IPTW (Inverse Probability of Treatment Weighting)

What it is: A technique that creates a “pseudo-population” where treatment is randomly assigned—by weighting each person based on their p(X).

Weights:

For treated → 1 / p(X)
For untreated → 1 / (1 - p(X))

Why it's powerful:

Uses the whole dataset (unlike matching which may discard cases).
Balances groups so you can compare outcomes as if they were randomized.

🔹 Final Workflow (Putting It All Together)

Estimate propensity scores using covariates that influence both treatment and outcome.
Check overlap using bell-curve plots (region of common support).
Choose a method:
- Match on p(X)
- Stratify on p(X)
- Weight using IPTW
Check balance using standardized differences.
Estimate causal effect using outcome models (with or without weights).

✅ Summary Table

Method	Core Idea	Best For
Model-Based	Regression + adjustment	Simple structure, no effect modification
Standardisation	Grouping + averaging	Allows effect modification
Matching	Pair similar individuals	Precise but sample may shrink
Propensity Score	Chance of treatment	Enables match/stratify/weight
IPTW	Weighting to mimic randomization	Full-sample causal estimation