top of page

From Chaos to Causal Clarity: Balancing Scores and Propensity Scores Explained Simply

  • Writer: Mayta
    Mayta
  • 4 hours ago
  • 3 min read

Introduction: Why This Matters

Imagine you're trying to compare two treatments, but your patients weren't randomized. You're worried that the two groups are different in important ways—older vs. younger, sicker vs. healthier. You want to know: Is it fair to compare them?

This is where balancing scores, especially the propensity score, come in. These are powerful tools that let you clean up observational data and mimic the logic of randomization.

Let’s walk through it—step by step—with simple analogies and clinical examples.

🧠 Step 1: Balancing Score – “How Similar Are They?”

💡 Concept:

A balancing score is a number that summarizes all the background traits (covariates) of a patient. If two patients have the same score, they are similar, regardless of whether one got treated and the other didn’t.

D⊥X∣b(X)

It lets you say: "I don't need to look at 10+ variables—just this score."

🏥 Clinical Example:

  • Patient A: Age 68, LDL 180, diabetes → looks like a typical statin patient.

  • Patient B: Age 67, LDL 185, diabetes → also looks like a statin patient.

Even if only one got a statin, if their balancing score is similar → fair to compare.

🍪 Everyday Analogy:

You're comparing people who eat cookies vs. those who don’t.

  • Cookie eaters = younger, less stressed, and more sleep

  • Balancing score = “Cookie-likeness”

If two people have the same balancing score, they have similar background traits—whether or not they eat cookies.

🎯 Step 2: Propensity Score – “How Likely Is Treatment?”

💡 Concept:

The propensity score is the probability that a patient receives a treatment, given their covariates.

p(X)=Pr⁡(D=1∣X)

It’s the best, most efficient balancing score. Once two patients have the same p(X), they're exchangeable—even if only one got treated.

🏥 Clinical Example:

  • Patient A: p(X) = 0.88 → likely to get statin

  • Patient B: p(X) = 0.86 → likely too, but didn’t

These two are good for comparison. Their chance of being treated were the same.

🍪 Everyday Analogy:

Propensity Score = “How likely are you to be a cookie eater?”

  • Based on traits (age, stress, sleep)

  • If p(X) = 0.9 → you probably eat cookies

  • Match you to someone with p(X) = 0.9 who doesn’t → fair test of cookie impact

🪟 Step 3: Region of Common Support – “Who Can We Fairly Compare?”

💡 Concept:

Not everyone has a match. Some patients are very likely to get treatment, others are very unlikely. These outliers can’t be fairly compared.

So, we trim the extremes and keep only the overlapping zone, called the Region of Common Support (RCS).

🏥 Clinical Example:

  • Some patients: p(X) = 0.99 → always get statin

  • Others: p(X) = 0.01 → never get it

  • No overlap = can’t compare

Trim to those with p(X) = 0.2–0.9 where groups overlap → valid causal inference.

🍪 Everyday Analogy:

If all cookie eaters have p(X) = 0.9–1.0 and all non-cookie eaters have p(X) = 0–0.1 → no overlap.

You can’t compare someone who always eats cookies to someone who never would. Trim the sample to those in the middle.

📏 Step 4: Balance Diagnostics – “Did Matching Work?”

💡 Concept:

After matching or stratifying by propensity score, we need to check balance: are the groups truly similar?

We use standardized differences (stddiff). It compares each covariate between groups, adjusted for scale.

  • Good balance: stddiff < 0.1

  • Poor balance: stddiff > 0.1 → re-model or refine PS

🏥 Clinical Example:

You're testing:

xi: pbalchk age LDL diabetes ..., strata(PS_strata) if comsup==1, graph

The graph shows whether covariates are now balanced, stratum by stratum.

🍪 Everyday Analogy:

You’ve grouped cookie and non-cookie people by cookie-likeness (p(X)).

Now check: Do they still differ in sleep, age, stress?

If yes → something’s off. If no → ready to compare happiness fairly.

🔬 Recap with Realistic Numbers

Step

Example

Insight

Balancing Score

A & B look similar

Use one score instead of 10 vars

Propensity Score (p(X))

p = 0.88 vs. 0.86

Close enough to match

Region of Common Support

p = 0.2–0.9 range kept

Outliers dropped, fair zone only

Stddiff < 0.1

LDL: 0.07, Age: 0.03

Good—matching worked


Recent Posts

See All

Comments

Rated 0 out of 5 stars.
No ratings yet

Add a rating
Post: Blog2_Post

​Message for International and Thai Readers Understanding My Medical Context in Thailand

Message for International and Thai Readers Understanding My Broader Content Beyond Medicine

bottom of page