Outcome-Driven Regression: Choosing the Right Model Based on Y in Clinical Research
- Mayta
- Jun 25
- 3 min read
Why start with Y?
Because the distribution and scale of the outcome (not the exposure) dictate:
The likelihood function behind the model.
Which effect measure is naturally produced (mean diff, OR, RR, HR, IRR …).
The mathematical assumptions we must check to keep inference valid.
Below is a tiered roadmap, moving from the simplest outcomes to the most complex, with compact Stata code, interpretation cues, and the key diagnostic you should run every single time.
1. Continuous Outcomes (e.g., blood pressure, length‑of‑stay)
Model | Default effect | Core Stata syntax |
Gaussian (linear) regression | β = mean difference | regress Y X1 X2 |
Robust / Huber‑White covariance | Same β, wider CI | regress Y X1 X2, vce(robust) |
Linear mixed‑effects (cluster / repeated) | subject‑specific β | `mixed Y X1 X2 |
Assumptions & Checks
Linearity – scatter + lowess or twoway (scatter Y X1) (lfit Y X1).
Homoskedasticity – rvfplot, look for “megaphone”.
Normal residuals – qnorm r, name(res) after predict r, resid.
Independence – if violated, use cluster() SEs, GEE, or mixed.
Interpretation nugget lincom _b[X1]*10 immediately gives the change in Y per 10‑unit increase in X1 – far more clinically intuitive than the raw coefficient.
2. Binary Outcomes (e.g., stroke yes/no)
Model (choose by estimand) | Effect | Stata |
Logistic regression | Odds Ratio (OR) | logistic Y i.exposure controls |
Modified Poisson (log‑binomial via Poisson+robust) | Risk Ratio (RR) | glm Y i.exposure, fam(poisson) link(log) vce(robust) eform |
Linear probability model | Risk Difference (RD) | glm Y i.exposure, fam(bin) link(id) vce(robust) |
Assumptions
Correct link (logit/log).
No complete separation – watch for “note: outcome = exposure predicts success perfectly”; if seen, use firthlogit.
Independence; if matched, switch to clogit.
Rare outcome? OR ≈ RR. If common (>10 %), prefer RR/RD models.
Post‑estimation essentials
estat gof, group(10) // Hosmer‑Lemeshow calibration
lroc // Discrimination (AUC)
3. Count Outcomes (Events without person‑time)
Model | Effect | When to choose |
Poisson | Incidence Rate Ratio (IRR) | Variance ≈ mean |
Negative Binomial | IRR | Over‑dispersion (α > 0) |
poisson events i.treatment, irr
estat gof // if Pearson χ² / df >> 1 → over‑dispersed
nbreg events i.treatment, irr
4. Rates (Events per person‑time)
Just add the exposure (offset) term:
gen log_pt = ln(pyrs)
poisson events i.treatment, offset(log_pt) irr vce(robust)
Key assumption – correct person‑time; mismeasured denominators will bias the IRR, whatever model you run.
5. Time‑to‑Event Outcomes
Model | Effect | Checks |
Cox proportional hazards | Hazard Ratio (HR) | PH assumption (estat phtest, stphplot) |
Flexible parametric (stpm2) | Time‑varying HR, RMST | Use when PH fails |
Fine & Gray competing‑risk | Sub‑distribution HR | For competing events |
Bare‑bones Cox workflow
stset time, fail(death==1)
stcox i.exposure confounders, vce(robust)
estat phtest
stphplot, by(exposure)
If PH is violated for exposure:
stpm2 i.exposure, df(4) scale(hazard) tvc(exposure) dftvc(2)
Plot adjusted survival: stcurve, survival at1(exposure=0) at2(exposure=1)
6. Ordered Categorical Outcomes (e.g., mild / mod / severe)
Model | Assumption | Test |
Ordinal (ologit) | Proportional odds (PO) | oparallel |
Generalised ordered (gologit2) | — | Works when PO fails |
ologit stage i.exposure otherVars, or
oparallel
Fail → ssc install gologit2 →
gologit2 stage i.exposure, autofit or
7. Unordered Categorical Outcomes (>2 classes, no natural order)
mlogit outcome i.exposure covs, baseoutcome(0) rrr
Interpret each R as the risk relative to the base category.
8. Repeated Measures / Clusters
Population‑averaged (GEE)
xtgee Y i.exposure time, i(id) corr(exchangeable) ///
family(gaussian) link(identity) vce(robust)
Subject‑specific (Mixed)
mixed Y i.exposure##c.time || id: time, cov(un)
Choose GEE when you care about the average treatment effect; mixed when you need individual trajectories.
9. Putting It All Together – Decision Flow
Identify Y (scale, distribution, censoring, clustering).
Pick the default model from the table above.
Run diagnostics listed for that model.
If an assumption breaks, upgrade (robust SE, NB, stpm2, gologit2, mixed).
Report effect + 95 % CI + assumption checks so readers trust your estimates.
Copy‑and‑Save “Assumption Check‑Block” Template
/* 1. Fit chosen model */
<model command here>
/* 2. Linearity & LOESS */
twoway (scatter Y X1) (lowess Y X1)
/* 3. Influential points */
predict cooksd, cooksd
graph box cooksd
/* 4. Residual diagnostics */
predict resid, residuals
hist resid, normal
rvfplot
/* 5. Robustness */
<same model>, vce(robust)
Final Take‑aways
Outcome first, always – let Y tell you the family/link.
Think of assumptions as contracts; violate them and the effect you quote can’t be trusted.
When in doubt, add a robust option or move to a semiparametric / non‑parametric alternative.
Document every decision (model choice, diagnostics, fixes) – it’s the best prophylaxis against reviewer infection.
Happy modelling — and may your residuals forever be well‑behaved!
Comments