Reporting Performance and Stability in TRIPOD+AI & Riley Framework Clinical Prediction Models: A Stata-Centered Code+Framework
- Mayta

- 13 hours ago
- 3 min read
This article integrates the TRIPOD+AI reporting standards with the latest Riley/Collins/Ensor stability framework. It shifts the focus from just "average" performance to "individual" reliability using the pm-suite in Stata.
Introduction
In modern clinical prediction, showing that a model is "accurate on average" is no longer enough. Under TRIPOD+AI, you must report both Performance (how well the model works for the population) and Stability (how much an individual’s risk estimate changes if the training data were slightly different).
1. The Core Distinction
Performance (Average): "On average, how close are we to the truth?" (AUC, Calibration, Net Benefit).
Stability (Reliability): "If I re-ran this study with a different sample, would this specific patient get the same risk score?"
2. The 8 Required Outputs
To be fully compliant with the Riley/Collins logic, your results section should include the following three pillars:
Pillar A: Performance (Population-Average)
ROC Curve / C-statistic: Can the model separate cases from non-cases?
Calibration Plot: Does the predicted risk match observed risk across the spectrum?
Decision Curve Analysis (DCA): Does the model provide higher Net Benefit than "treat all" or "treat none" at clinical thresholds?
Pillar B: Stability (Individual & Decision Reliability)
Prediction Instability Plot: Visualizes the "wiggle" of individual risks across bootstrap re-fits.
Average MAPE (Stability Index): The mean absolute difference between original and bootstrap risks. Target: $< 0.02$ (context-dependent).
95% Uncertainty Interval (UI): The range (2.5th to 97.5th percentile) of risk for a single patient across re-fits.
Classification Instability Plot: Shows "threshold flipping"—how often a patient moves from "low risk" to "high risk" across model developments.
Pillar C: Stability (Population-Level)
Calibration Instability Plot: A "spaghetti plot" of calibration curves from bootstrap re-fits to show if the model's reliability is volatile.
3. The Stata Toolchain: pm-suite
The authoritative tools for this workflow are maintained by Joie Ensor and the Riley/Collins team.
Installation
Stata
* Performance & Utilities
ssc install pmcalplot, replace
net install dca, from("https://raw.github.com/ddsjoberg/dca.stata/master/") replace
* The Stability Suite (Riley/Ensor)
net from https://joieensor.github.io/pm-suite/
net install pmstabilityplots, replace
net install pmstabilityss, replace // For sample size planning
Mapping Requirements to Commands
Requirement | Stata Command | Key Output |
Calibration | pmcalplot | Observed vs. Predicted |
Clinical Utility | dca | Net Benefit |
Individual Stability | pmstabilityplots | Prediction Instability Plot & MAPE |
Decision Stability | pmstabilityplots | Classification Instability |
Calib. Stability | pmstabilityplots | Spaghetti Calibration Curves |
4. Implementation Workflow
Step 1: Fit and Assess Performance
Stata
logistic outcome x1 x2 x3
predict p_app, pr
* Standard Performance
lroc
pmcalplot p_app outcome, count
dca outcome p_app, xstop(0.5)
Step 2: Assess Stability (The Riley Method)
Using pmstabilityplots automates the bootstrap re-development process. It re-estimates the model parameters multiple times to see how much the individual predictions change.
Stata
* Stability Assessment (e.g., 200 bootstrap reps)
* 'threshold' defines the point for Classification Instability
pmstabilityplots outcome x1 x2 x3, reps(200) threshold(0.2)
This command generates the three critical figures:
Prediction Instability Plot: Highlighting the MAPE and 95% UIs.
Classification Instability: Visualizing how many patients cross the 20% risk threshold.
Calibration Instability: Showing the variation in the calibration intercept and slope.
5. Reporting Template (Methods Section)
"Model performance was evaluated via discrimination (C-statistic), calibration (calibration plots), and clinical utility (Decision Curve Analysis). To ensure the reliability of individual-level predictions, we performed a stability analysis according to the Riley/Collins framework. We quantified prediction instability using Mean Absolute Prediction Error (MAPE) and individual 95% uncertainty intervals (UI). Decision stability was assessed via classification instability plots at a clinical threshold of [X%]. All stability analyses were performed in Stata using the pmstabilityplots package (pm-suite), involving [200] bootstrap re-development cycles."
6. R Crosswalk: pminternal
If collaborating with R users, the pminternal package provides the exact same framework:
Item | R Function (pminternal) |
Prediction Instability | prediction_stability() |
Stability Index (MAPE) | mape_stability() |
Decision Stability | dcurve_stability() |
Calibration Stability | calibration_stability() |
Would you like me to generate a mock Results table showing how to present the MAPE and 95% UI for different risk strata?






Comments