← All posts

Outcome-Driven Regression: Choosing the Right Model Based on Y in Clinical Research

Clinical Epidemiology ResearchUniqcret doctor knowledgesData Analytics or StatisticsStata [Data Analytics]Methodology and Research Design

Why start with Y?

Because the distribution and scale of the outcome (not the exposure) dictate:

  1. The likelihood function behind the model.
  2. Which effect measure is naturally produced (mean diff, OR, RR, HR, IRR …).
  3. The mathematical assumptions we must check to keep inference valid.

Below is a tiered roadmap, moving from the simplest outcomes to the most complex, with compact Stata code, interpretation cues, and the key diagnostic you should run every single time.


1. Continuous Outcomes  (e.g., blood pressure, length‑of‑stay)

ModelDefault effectCore Stata syntax
Gaussian (linear) regressionβ = mean differenceregress Y X1 X2
Robust / Huber‑White covarianceSame β, wider CIregress Y X1 X2, vce(robust)
Linear mixed‑effects (cluster / repeated)subject‑specific β`mixed Y X1 X2

Assumptions & Checks

  1. Linearity – scatter + lowess or twoway (scatter Y X1) (lfit Y X1).
  2. Homoskedasticity – rvfplot, look for “megaphone”.
  3. Normal residuals – qnorm r, name(res) after predict r, resid.
  4. Independence – if violated, use cluster() SEs, GEE, or mixed.

Interpretation nugget lincom _b[X1]*10  immediately gives the change in Y per 10‑unit  increase in X1 – far more clinically intuitive than the raw coefficient.


2. Binary Outcomes  (e.g., stroke yes/no)

Model (choose by estimand)EffectStata
Logistic regressionOdds Ratio (OR)logistic Y i.exposure controls
Modified Poisson (log‑binomial via Poisson+robust)Risk Ratio (RR)glm Y i.exposure, fam(poisson) link(log) vce(robust) eform
Linear probability modelRisk Difference (RD)glm Y i.exposure, fam(bin) link(id) vce(robust)

Assumptions

Post‑estimation essentials

estat gof, group(10)     // Hosmer‑Lemeshow calibration
lroc                    // Discrimination (AUC)


3. Count Outcomes  (Events without person‑time)

ModelEffectWhen to choose
PoissonIncidence Rate Ratio (IRR)Variance ≈ mean
Negative BinomialIRROver‑dispersion (α > 0)
poisson events i.treatment, irr
estat gof           // if Pearson χ² / df >> 1 → over‑dispersed
nbreg events i.treatment, irr


4. Rates  (Events per person‑time)

Just add the exposure (offset) term:

gen log_pt = ln(pyrs)
poisson events i.treatment, offset(log_pt) irr vce(robust)

Key assumption – correct person‑time; mismeasured denominators will bias the IRR, whatever model you run.


5. Time‑to‑Event Outcomes

ModelEffectChecks
Cox proportional hazardsHazard Ratio (HR)PH assumption (estat phtest, stphplot)
Flexible parametric (stpm2)Time‑varying HR, RMSTUse when PH fails
Fine & Gray competing‑riskSub‑distribution HRFor competing events

Bare‑bones Cox workflow

stset time, fail(death==1)
stcox i.exposure confounders, vce(robust)
estat phtest
stphplot, by(exposure)

If PH is violated for exposure:

stpm2 i.exposure, df(4) scale(hazard) tvc(exposure) dftvc(2)

Plot adjusted survival: stcurve, survival at1(exposure=0) at2(exposure=1)


6. Ordered Categorical Outcomes  (e.g., mild / mod / severe)

ModelAssumptionTest
Ordinal (ologit)Proportional odds (PO)oparallel
Generalised ordered (gologit2)Works when PO fails
ologit stage i.exposure otherVars, or
oparallel

Fail → ssc install gologit2 →

gologit2 stage i.exposure, autofit or


7. Unordered Categorical Outcomes  (>2 classes, no natural order)

mlogit outcome i.exposure covs, baseoutcome(0) rrr

Interpret each R as the risk relative to the base category.


8. Repeated Measures / Clusters

Population‑averaged (GEE)

xtgee Y i.exposure time, i(id) corr(exchangeable) ///
      family(gaussian) link(identity) vce(robust)

Subject‑specific (Mixed)

mixed Y i.exposure##c.time || id: time, cov(un)

Choose GEE when you care about the average treatment effect; mixed when you need individual trajectories.


9. Putting It All Together – Decision Flow

  1. Identify Y (scale, distribution, censoring, clustering).
  2. Pick the default model from the table above.
  3. Run diagnostics listed for that model.
  4. If an assumption breaks, upgrade (robust SE, NB, stpm2, gologit2, mixed).
  5. Report effect + 95 % CI + assumption checks so readers trust your estimates.

Copy‑and‑Save “Assumption Check‑Block” Template

/* 1. Fit chosen model */
<model command here>

/* 2. Linearity & LOESS */
twoway (scatter Y X1) (lowess Y X1)

/* 3. Influential points */
predict cooksd, cooksd
graph box cooksd

/* 4. Residual diagnostics */
predict resid, residuals
hist resid, normal
rvfplot

/* 5. Robustness */
<same model>, vce(robust)


Final Take‑aways

Happy modelling — and may your residuals forever be well‑behaved!

Comments

No comments yet. Be the first to share your thoughts.

Sign in to comment