How to Build a Clinical Prediction Model (CPM) in Stata: Step-by-Step with Stata Code
- Mayta

- 3 days ago
- 3 min read
Updated: 2 days ago
Steps to Developing a Clinical Prediction Model (CPM)
Choose predictors & run forward/backward
Fit the final logistic model cleanly
Read α (intercept) and β’s from Stata
Write the prediction equation
Generate LP and risk in Stata
All in Stata-only.
Step 1 – Start with development data and candidate predictors
clear
use "your_development_data.dta", clear
* Inspect variables
describe
summarize
Assume you want to predict death (0/1) from a set of candidate predictors:
age
sex
map
bun
alb
ettyn
etc.
Step 2 – (Optional) Forward/backward selection
2.1 Forward selection example
* Forward stepwise with entry p<0.05
stepwise, pe(0.05): logit death age sex map bun alb ettyn
2.2 Backward selection example
* Backward stepwise with removal p>0.10
stepwise, pr(0.10) backward: logit death age sex map bun alb ettyn
What happens here?
Stata runs many logistic models, adding or removing predictors.
At the end, it prints a final model with a subset of predictors.
That final line (e.g. Logit estimates) shows you which predictors survived.
👉 Important practice: After using stepwise, you should refit the final model directly, without stepwise, so you have a clean, reproducible model.
Step 3 – Fit the final logistic model (clean run)
Suppose stepwise ended with: age, bun, alb, ettyn as final predictors.
Now fit that model directly:
logit death age bun alb ettyn
Stata reports something like:
------------------------------------------------------------------------------
death | Coefficient Std. err. z P>|z| [95% conf. interval]
-------------+----------------------------------------------------------------
age | 0.0150 ...
bun | 0.0300 ...
alb | -0.9000 ...
ettyn | 2.1000 ...
_cons| -5.0000 ...
------------------------------------------------------------------------------
Now you have:
α (intercept) = b[cons] = −5.0000
β_age = _b[age] = 0.0150
β_bun = _b[bun] = 0.0300
β_alb = _b[alb] = −0.9000
β_ettyn = _b[ettyn] = 2.1000
These are exactly the numbers that will appear in your prediction equation.
You can also explicitly grab them:
display _b[_cons]
display _b[age]
display _b[bun]
display _b[alb]
display _b[ettyn]
Step 4 – Write down the model mathematically
Coded as:
gen logodds = -5.0000 ///
+ 0.0150*age ///
+ 0.0300*bun ///
- 0.9000*alb ///
+ 2.1000*ettyn
gen prob = invlogit(logodds)
Step 5 – Centering (optional but common in prediction models)
Many clinical prediction models (like your MAGENTA) use centered predictors:
* In development data
summ age
local mean_age = r(mean)
gen cage = age - `mean_age'
summ bun
local mean_bun = r(mean)
gen cbun = bun - `mean_bun'
summ alb
local mean_alb = r(mean)
gen calb = alb - `mean_alb'
* Now fit using centered variables
logit death cage cbun calb ettyn
Suppose output is:
------------------------------------------------------------------------------
death | Coefficient
-------------+------------
cage | 0.0150
cbun | 0.0300
calb | -0.9000
ettyn | 2.1000
_cons | -4.2000
------------------------------------------------------------------------------
Then your equation is:
Why center?
Makes the intercept ≈ log-odds for a “typical” patient (all centered vars = 0).
LP in development has mean ≈ 0, which makes Debray Step 1 & Step 2 comparisons easier (LP mean and SD).
Step 6 – From development model to prediction code (dev or validation)
Once you have α and β’s from logit, the model is frozen. To compute predictions (in development or validation):
* Example using centered form, like MAGENTA
gen logodds = -4.2000 ///
+ 0.0150*(age - 70.0) ///
+ 0.0300*(bun - 18.0) ///
- 0.9000*(alb - 3.8) ///
+ 2.1000*ettyn
gen prob = invlogit(logodds)
summarize logodds prob
Key mapping:
logit death ... → Stata estimates α and β’s.
b[cons] → α (intercept)
_b[predictor] → β coefficients
Those values go into your handwritten formula for log-odds and probabilities.
Summary
* 1. Do your selection (forward/backward/clinical)
stepwise, pe(0.05): logit death age bun alb ettyn map /* etc */
* 2. Identify final set of predictors from stepwise output, then refit:
logit death age bun alb ettyn
* 3. Read coefficients:
display _b[_cons] // alpha
display _b[age] // beta_age
display _b[bun] // beta_bun
display _b[alb] // beta_alb
display _b[ettyn] // beta_ettyn
* 4. Write prediction equation and implement:
gen logodds = _b[_cons] ///
+ _b[age] * age ///
+ _b[bun] * bun ///
+ _b[alb] * alb ///
+ _b[ettyn]* ettyn
gen prob = invlogit(logodds)





Comments