top of page

How to Build a Clinical Prediction Model (CPM) in Stata: Step-by-Step with Stata Code

  • Writer: Mayta
    Mayta
  • 3 days ago
  • 3 min read

Updated: 2 days ago

Steps to Developing a Clinical Prediction Model (CPM)

  1. Choose predictors & run forward/backward

  2. Fit the final logistic model cleanly

  3. Read α (intercept) and β’s from Stata

  4. Write the prediction equation

  5. Generate LP and risk in Stata

All in Stata-only.

Step 1 – Start with development data and candidate predictors

clear
use "your_development_data.dta", clear

* Inspect variables
describe
summarize

Assume you want to predict death (0/1) from a set of candidate predictors:

  • age

  • sex

  • map

  • bun

  • alb

  • ettyn

  • etc.


Step 2 – (Optional) Forward/backward selection

2.1 Forward selection example

* Forward stepwise with entry p<0.05
stepwise, pe(0.05): logit death age sex map bun alb ettyn

2.2 Backward selection example

* Backward stepwise with removal p>0.10
stepwise, pr(0.10) backward: logit death age sex map bun alb ettyn

What happens here?

  • Stata runs many logistic models, adding or removing predictors.

  • At the end, it prints a final model with a subset of predictors.

  • That final line (e.g. Logit estimates) shows you which predictors survived.

👉 Important practice: After using stepwise, you should refit the final model directly, without stepwise, so you have a clean, reproducible model.

Step 3 – Fit the final logistic model (clean run)

Suppose stepwise ended with: age, bun, alb, ettyn as final predictors.

Now fit that model directly:

logit death age bun alb ettyn

Stata reports something like:

------------------------------------------------------------------------------
      death | Coefficient  Std. err.    z    P>|z|    [95% conf. interval]
-------------+----------------------------------------------------------------
        age |   0.0150     ...
        bun |   0.0300     ...
        alb |  -0.9000     ...
      ettyn |   2.1000     ...
       _cons|  -5.0000     ...
------------------------------------------------------------------------------

Now you have:

  • α (intercept) = b[cons] = −5.0000

  • β_age        = _b[age]     = 0.0150

  • β_bun        = _b[bun]     = 0.0300

  • β_alb        = _b[alb]     = −0.9000

  • β_ettyn      = _b[ettyn]   = 2.1000

These are exactly the numbers that will appear in your prediction equation.

You can also explicitly grab them:

display _b[_cons]
display _b[age]
display _b[bun]
display _b[alb]
display _b[ettyn]

Step 4 – Write down the model mathematically


Coded as:

gen logodds = -5.0000 ///
              + 0.0150*age ///
              + 0.0300*bun ///
              - 0.9000*alb ///
              + 2.1000*ettyn

gen prob = invlogit(logodds)

Step 5 – Centering (optional but common in prediction models)

Many clinical prediction models (like your MAGENTA) use centered predictors:

* In development data
summ age
local mean_age = r(mean)
gen cage = age - `mean_age'

summ bun
local mean_bun = r(mean)
gen cbun = bun - `mean_bun'

summ alb
local mean_alb = r(mean)
gen calb = alb - `mean_alb'

* Now fit using centered variables
logit death cage cbun calb ettyn

Suppose output is:

------------------------------------------------------------------------------
      death | Coefficient
-------------+------------
       cage  |   0.0150
       cbun  |   0.0300
       calb  |  -0.9000
      ettyn  |   2.1000
       _cons |  -4.2000
------------------------------------------------------------------------------

Then your equation is:


Why center?

  • Makes the intercept ≈ log-odds for a “typical” patient (all centered vars = 0).

  • LP in development has mean ≈ 0, which makes Debray Step 1 & Step 2 comparisons easier (LP mean and SD).


Step 6 – From development model to prediction code (dev or validation)

Once you have α and β’s from logit, the model is frozen. To compute predictions (in development or validation):

* Example using centered form, like MAGENTA
gen logodds = -4.2000 ///
    + 0.0150*(age - 70.0) ///
    + 0.0300*(bun - 18.0) ///
    - 0.9000*(alb - 3.8) ///
    + 2.1000*ettyn

gen prob = invlogit(logodds)
summarize logodds prob

Key mapping:

  • logit death ... → Stata estimates α and β’s.

  • b[cons]        → α (intercept)

  • _b[predictor]    → β coefficients

  • Those values go into your handwritten formula for log-odds and probabilities.


Summary

* 1. Do your selection (forward/backward/clinical)
stepwise, pe(0.05): logit death age bun alb ettyn map /* etc */

* 2. Identify final set of predictors from stepwise output, then refit:
logit death age bun alb ettyn

* 3. Read coefficients:
display _b[_cons]     // alpha
display _b[age]       // beta_age
display _b[bun]       // beta_bun
display _b[alb]       // beta_alb
display _b[ettyn]     // beta_ettyn

* 4. Write prediction equation and implement:
gen logodds = _b[_cons] ///
              + _b[age]  * age ///
              + _b[bun]  * bun ///
              + _b[alb]  * alb ///
              + _b[ettyn]* ettyn

gen prob = invlogit(logodds)

Recent Posts

See All

Comments

Rated 0 out of 5 stars.
No ratings yet

Add a rating
Post: Blog2_Post

​Message for International and Thai Readers Understanding My Medical Context in Thailand

Message for International and Thai Readers Understanding My Broader Content Beyond Medicine

bottom of page