Reporting Performance and Stability in TRIPOD+AI & Riley Framework Clinical Prediction Models: A Stata-Centered Code+Framework

Mayta
Jan 6
3 min read

This article integrates the TRIPOD+AI reporting standards with the latest Riley/Collins/Ensor stability framework. It shifts the focus from just "average" performance to "individual" reliability using the pm-suite in Stata.

Introduction

In modern clinical prediction, showing that a model is "accurate on average" is no longer enough. Under TRIPOD+AI, you must report both Performance (how well the model works for the population) and Stability (how much an individual’s risk estimate changes if the training data were slightly different).

1. The Core Distinction

Performance (Average): "On average, how close are we to the truth?" (AUC, Calibration, Net Benefit).
Stability (Reliability): "If I re-ran this study with a different sample, would this specific patient get the same risk score?"

2. The 8 Required Outputs

To be fully compliant with the Riley/Collins logic, your results section should include the following three pillars:

Pillar A: Performance (Population-Average)

ROC Curve / C-statistic: Can the model separate cases from non-cases?
Calibration Plot: Does the predicted risk match observed risk across the spectrum?
Decision Curve Analysis (DCA): Does the model provide higher Net Benefit than "treat all" or "treat none" at clinical thresholds?

Pillar B: Stability (Individual & Decision Reliability)

Prediction Instability Plot: Visualizes the "wiggle" of individual risks across bootstrap re-fits.
Average MAPE (Stability Index): The mean absolute difference between original and bootstrap risks. Target: $< 0.02$ (context-dependent).
95% Uncertainty Interval (UI): The range (2.5th to 97.5th percentile) of risk for a single patient across re-fits.
Classification Instability Plot: Shows "threshold flipping"—how often a patient moves from "low risk" to "high risk" across model developments.

Pillar C: Stability (Population-Level)

Calibration Instability Plot: A "spaghetti plot" of calibration curves from bootstrap re-fits to show if the model's reliability is volatile.

3. The Stata Toolchain: pm-suite

The authoritative tools for this workflow are maintained by Joie Ensor and the Riley/Collins team.

Installation

Stata

* Performance & Utilities
ssc install pmcalplot, replace
net install dca, from("https://raw.github.com/ddsjoberg/dca.stata/master/") replace

* The Stability Suite (Riley/Ensor)
net from https://joieensor.github.io/pm-suite/
net install pmstabilityplots, replace
net install pmstabilityss, replace  // For sample size planning

Mapping Requirements to Commands

Requirement	Stata Command	Key Output
Calibration	pmcalplot	Observed vs. Predicted
Clinical Utility	dca	Net Benefit
Individual Stability	pmstabilityplots	Prediction Instability Plot & MAPE
Decision Stability	pmstabilityplots	Classification Instability
Calib. Stability	pmstabilityplots	Spaghetti Calibration Curves

4. Implementation Workflow

Step 1: Fit and Assess Performance

Stata

logistic outcome x1 x2 x3
predict p_app, pr

* Standard Performance
lroc
pmcalplot p_app outcome, count
dca outcome p_app, xstop(0.5)

Step 2: Assess Stability (The Riley Method)

Using pmstabilityplots automates the bootstrap re-development process. It re-estimates the model parameters multiple times to see how much the individual predictions change.

Stata

* Stability Assessment (e.g., 200 bootstrap reps)
* 'threshold' defines the point for Classification Instability
pmstabilityplots outcome x1 x2 x3, reps(200) threshold(0.2)

This command generates the three critical figures:

Prediction Instability Plot: Highlighting the MAPE and 95% UIs.
Classification Instability: Visualizing how many patients cross the 20% risk threshold.
Calibration Instability: Showing the variation in the calibration intercept and slope.

5. Reporting Template (Methods Section)

"Model performance was evaluated via discrimination (C-statistic), calibration (calibration plots), and clinical utility (Decision Curve Analysis). To ensure the reliability of individual-level predictions, we performed a stability analysis according to the Riley/Collins framework. We quantified prediction instability using Mean Absolute Prediction Error (MAPE) and individual 95% uncertainty intervals (UI). Decision stability was assessed via classification instability plots at a clinical threshold of [X%]. All stability analyses were performed in Stata using the pmstabilityplots package (pm-suite), involving [200] bootstrap re-development cycles."

6. R Crosswalk: pminternal

If collaborating with R users, the pminternal package provides the exact same framework:

Item	R Function (pminternal)
Prediction Instability	prediction_stability()
Stability Index (MAPE)	mape_stability()
Decision Stability	dcurve_stability()
Calibration Stability	calibration_stability()

Would you like me to generate a mock Results table showing how to present the MAPE and 95% UI for different risk strata?

Reporting Performance and Stability in TRIPOD+AI & Riley Framework Clinical Prediction Models: A Stata-Centered Code+Framework

Introduction

1. The Core Distinction

2. The 8 Required Outputs

Pillar A: Performance (Population-Average)

Pillar B: Stability (Individual & Decision Reliability)

Pillar C: Stability (Population-Level)

3. The Stata Toolchain: pm-suite

Installation

Mapping Requirements to Commands

4. Implementation Workflow

Step 1: Fit and Assess Performance

Step 2: Assess Stability (The Riley Method)

5. Reporting Template (Methods Section)

6. R Crosswalk: pminternal

Recent Posts

Comments