Model Selection Algorithm for Clinical Regression
- Mayta
- Jun 25
- 2 min read
Step 1. Identify the Outcome Variable (Y)
Continuous – e.g., SBP, HbA1c, time, cost
Binary – e.g., death, readmission (yes/no)
Count – e.g., seizures, falls, hospitalizations
Rate – e.g., events per person-time
Time-to-event – e.g., survival, time-to-discharge
Ordinal – e.g., NYHA class, pain severity
Nominal (unordered) – e.g., cancer types
Repeated / Clustered – any above, but longitudinal or hierarchical
Recurrent events / competing risks – multiple times/events per subject
Step 2. Apply Y-Type Logic
Outcome Y | Model | Stata Command | Assumption | Upgrade if Violated |
Continuous | regress | regress Y X | Linearity, Normality | robust, glm, mixed |
Binary | logit / glm | logit Y X, glm Y X, fam(bin) link(log) | No complete separation | firthlogit, clogit |
Count | poisson / nbreg | poisson Y X, nbreg Y X | Mean = variance | use nbreg |
Rate (event/time) | poisson, offset(log_time) | poisson Y X, offset(log_time) | Correct exposure | vce(robust) |
Time-to-event | stcox, streg, stpm2 | stcox Y X, streg Y X, dist(...) | PH assumption | stpm2, aft, frailty |
Ordinal | ologit | ologit Y X | Proportional odds | gologit2 |
Nominal | mlogit | mlogit Y X | None (multinomial) | — |
Repeated / Cluster | xtgee, mixed | xtgee Y X, i(id), `mixed Y X | id:` | |
Recurrent | stcox + shared(id) or strata(order) | stcox X, strata(event), shared(id) | Order or frailty matters | PWP, AG, frailty |
Step 3. Decision Flow

Is Y a time-to-event (e.g. death, recurrence)?
→ Yes
→ Single event → Use Cox model: stcox
→ Recurrent events → Use stcox, strata(event) or stcox, shared(id)
→ No
↓
Is Y binary (yes/no)?
→ Yes
→ Is outcome rare (≤10%)?
→ Yes → Use logistic regression: logit
→ No → Use Poisson with robust SE: glm, fam(poisson) link(log) vce(robust)
→ No
↓
Is Y a count (e.g., # seizures)?
→ Yes
→ Is variance ≈ mean?
→ Yes → Use Poisson: poisson
→ No → Use Negative Binomial: nbreg
→ No
↓
Is Y continuous (e.g., SBP)?
→ Yes
→ Is data independent?
→ Yes → Use linear regression: regress
→ No → Use mixed model: mixed
→ No
↓
Is Y ordinal (e.g., mild/mod/severe)?
→ Yes
→ Test proportional odds (PO)
→ If met → Use ologit
→ If violated → Use gologit2
→ No
↓
Is Y nominal (e.g., cancer type)?
→ Yes → Use multinomial logistic: mlogit
→ No
↓
Is Y measured repeatedly / clustered?
→ Yes
→ Want population-average effect? → xtgee
→ Want subject-specific effect? → mixed
→ No
↓
Is Y recurrent / composite time-based?
→ Yes
→ Based on timing:
→ Same event → Use Andersen-Gill (AG)
→ Ordered events → Use PWP-CP / PWP-GT
→ Heterogeneity → Use frailty model
📌 Built-in Quality Checks
After selecting:
predict r, resid
hist r, normal
rvfplot
estat gof
estat phtest
This logic is the synthesis of your uploaded documents and calculators. Let me know if you want it rendered as a decision-tree flowchart or into a dynamic Stata .do template.
Comments