How to Build a Clinical Prediction Model (CPM) in Stata: Step-by-Step with Stata Code

Mayta
3 days ago
3 min read

Updated: 2 days ago

Steps to Developing a Clinical Prediction Model (CPM)

Choose predictors & run forward/backward
Fit the final logistic model cleanly
Read α (intercept) and β’s from Stata
Write the prediction equation
Generate LP and risk in Stata

All in Stata-only.

Step 1 – Start with development data and candidate predictors

clear
use "your_development_data.dta", clear

* Inspect variables
describe
summarize

Assume you want to predict death (0/1) from a set of candidate predictors:

age
sex
map
bun
alb
ettyn
etc.

Step 2 – (Optional) Forward/backward selection

2.1 Forward selection example

* Forward stepwise with entry p<0.05
stepwise, pe(0.05): logit death age sex map bun alb ettyn

2.2 Backward selection example

* Backward stepwise with removal p>0.10
stepwise, pr(0.10) backward: logit death age sex map bun alb ettyn

What happens here?

Stata runs many logistic models, adding or removing predictors.
At the end, it prints a final model with a subset of predictors.
That final line (e.g. Logit estimates) shows you which predictors survived.

👉 Important practice: After using stepwise, you should refit the final model directly, without stepwise, so you have a clean, reproducible model.

Step 3 – Fit the final logistic model (clean run)

Suppose stepwise ended with: age, bun, alb, ettyn as final predictors.

Now fit that model directly:

logit death age bun alb ettyn

Stata reports something like:

------------------------------------------------------------------------------
      death | Coefficient  Std. err.    z    P>|z|    [95% conf. interval]
-------------+----------------------------------------------------------------
        age |   0.0150     ...
        bun |   0.0300     ...
        alb |  -0.9000     ...
      ettyn |   2.1000     ...
       _cons|  -5.0000     ...
------------------------------------------------------------------------------

Now you have:

α (intercept) = b[cons] = −5.0000
β_age = _b[age] = 0.0150
β_bun = _b[bun] = 0.0300
β_alb = _b[alb] = −0.9000
β_ettyn = _b[ettyn] = 2.1000

These are exactly the numbers that will appear in your prediction equation.

You can also explicitly grab them:

display _b[_cons]
display _b[age]
display _b[bun]
display _b[alb]
display _b[ettyn]

Step 4 – Write down the model mathematically

Coded as:

gen logodds = -5.0000 ///
              + 0.0150*age ///
              + 0.0300*bun ///
              - 0.9000*alb ///
              + 2.1000*ettyn

gen prob = invlogit(logodds)

Step 5 – Centering (optional but common in prediction models)

Many clinical prediction models (like your MAGENTA) use centered predictors:

* In development data
summ age
local mean_age = r(mean)
gen cage = age - `mean_age'

summ bun
local mean_bun = r(mean)
gen cbun = bun - `mean_bun'

summ alb
local mean_alb = r(mean)
gen calb = alb - `mean_alb'

* Now fit using centered variables
logit death cage cbun calb ettyn

Suppose output is:

------------------------------------------------------------------------------
      death | Coefficient
-------------+------------
       cage  |   0.0150
       cbun  |   0.0300
       calb  |  -0.9000
      ettyn  |   2.1000
       _cons |  -4.2000
------------------------------------------------------------------------------

Then your equation is:

Why center?

Makes the intercept ≈ log-odds for a “typical” patient (all centered vars = 0).
LP in development has mean ≈ 0, which makes Debray Step 1 & Step 2 comparisons easier (LP mean and SD).

Step 6 – From development model to prediction code (dev or validation)

Once you have α and β’s from logit, the model is frozen. To compute predictions (in development or validation):

* Example using centered form, like MAGENTA
gen logodds = -4.2000 ///
    + 0.0150*(age - 70.0) ///
    + 0.0300*(bun - 18.0) ///
    - 0.9000*(alb - 3.8) ///
    + 2.1000*ettyn

gen prob = invlogit(logodds)
summarize logodds prob

Key mapping:

logit death ... → Stata estimates α and β’s.
b[cons] → α (intercept)
_b[predictor] → β coefficients
Those values go into your handwritten formula for log-odds and probabilities.

Summary

* 1. Do your selection (forward/backward/clinical)
stepwise, pe(0.05): logit death age bun alb ettyn map /* etc */

* 2. Identify final set of predictors from stepwise output, then refit:
logit death age bun alb ettyn

* 3. Read coefficients:
display _b[_cons]     // alpha
display _b[age]       // beta_age
display _b[bun]       // beta_bun
display _b[alb]       // beta_alb
display _b[ettyn]     // beta_ettyn

* 4. Write prediction equation and implement:
gen logodds = _b[_cons] ///
              + _b[age]  * age ///
              + _b[bun]  * bun ///
              + _b[alb]  * alb ///
              + _b[ettyn]* ettyn

gen prob = invlogit(logodds)