Beyond Performance: Prediction Stability and Uncertainty in TRIPOD-Ready Clinical Prediction Models
- Mayta

- 3 days ago
- 4 min read
Introduction
Once you’ve shown discrimination, calibration, and clinical utility, a sophisticated reader asks a deeper question:
“How stable are these predictions if the data were slightly different?”
This is where many CPM papers stop too early.
Two models can have identical AUROC, calibration, and DCA—yet one is fragile, overly dependent on idiosyncrasies of the development sample, while the other is robust. Stability analysis exposes this difference.
In the CECS CPM framework, this fourth layer is explicitly taught as prediction stability and uncertainty, evaluated using internal resampling methods (typically bootstrapping).
The goal is not to impress—but to be honest about how much trust each individual prediction deserves.
1) Prediction Instability Plot — How much do predictions move?
What it shows: Individual-level prediction stability
A prediction instability plot visualizes how much each patient’s predicted risk varies across bootstrap samples compared with the original model.
Conceptually, it plots:
Original predicted risk (x-axis)
Bootstrapped predicted risk (y-axis)
One point per patient per bootstrap draw
In CECS-aligned workflows, this is typically produced using internal validation tools (e.g., pminternal, type = "instability").
Clinical translation
“If I refit this model on slightly different versions of the same dataset, do patients keep roughly the same risk estimates—or do they jump around?”
Stable models show tight clouds around the identity line. Unstable models show wide vertical scatter—especially concerning at clinically actionable risk thresholds.
Common rookie mistake
Reporting only average performance metrics and ignoring whether individual patients’ predictions are reliable.
Scatter of Bootstrap vs Original Predictions — Seeing overfitting directly
A complementary visualization overlays:
Original model predictions
Bootstrap-refitted predictions
Each point represents a patient. The distance from the diagonal line reflects prediction instability.
Why this matters:
Overfitted models often show excellent apparent AUROC
But large bootstrap scatter reveals that the model’s predictions are sample-dependent
This plot is often more persuasive to reviewers than abstract statements about “internal validation.”
🔍 Secret Insight: Reviewers may forgive a modest AUROC, but they rarely forgive unstable predictions once shown.
2) Average MAPE — How wrong are predictions on average?
What it shows: Numerical prediction error
Mean Absolute Prediction Error (MAPE) summarizes how far predicted probabilities deviate from observed outcomes on average.
Formally:
Where:
p̂ᵢ = predicted probability
yᵢ = observed outcome (0/1)
In CECS teaching, MAPE < 0.02 is often used as a heuristic for excellent stability in well-calibrated binary outcome models, recognizing that acceptable thresholds depend on context.
Clinical translation
“On average, how far off is the model’s predicted risk from what actually happens?”
MAPE complements calibration plots:
Calibration shows systematic bias
MAPE shows average absolute error
Common rookie mistake
Reporting only AUROC and ignoring absolute prediction error, which clinicians care about far more.
3) 95% Uncertainty Intervals — Admitting what we don’t know
What it shows: Prediction uncertainty, not just point estimates
Bootstrap-based uncertainty intervals (UIs) reflect how much each patient’s predicted risk varies across resampled models.
For each patient, you can derive:
Median predicted risk
2.5th–97.5th percentile range across bootstrap fits
These intervals are often visualized directly in instability plots or as vertical error bars.
Clinical translation
“This patient’s risk is estimated at 18%—but realistically, it could be anywhere from 12% to 25%.”
This framing is crucial for shared decision-making and aligns naturally with DCA-based threshold reasoning.
Common rookie mistake
Presenting single-number risk predictions without acknowledging uncertainty, which falsely implies precision.
4) Calibration Instability — Does calibration survive resampling?
Calibration should not only look good once.
Calibration instability analysis overlays calibration curves from multiple bootstrap samples on top of the apparent calibration curve.
What you want to see:
Bootstrap curves clustering tightly around the optimism-corrected curve
Minimal divergence in clinically relevant risk ranges
What signals trouble:
Wide spread of calibration curves
Slope flattening or exaggeration across resamples
In CECS CPM logic, this is a direct test of whether calibration is structural or merely accidental.
How stability completes the TRIPOD performance story
At this point, your CPM performance narrative answers five distinct questions:
Layer | Question answered |
Discrimination (ROC) | Can the model rank patients correctly? |
Calibration | Are the predicted probabilities numerically correct? |
Usefulness (DCA) | Does using the model improve decisions? |
Stability | Are predictions robust to data perturbation? |
Uncertainty | How precise are individual predictions? |
TRIPOD emphasizes transparent reporting and validation; stability analysis operationalizes that principle at the individual patient level, where clinical decisions actually occur.
Final takeaway: Stable models earn trust; unstable models borrow it
A TRIPOD-ready CPM doesn’t just look good—it behaves well under stress.
By adding:
Prediction instability plots
Bootstrap scatter visualizations
Average MAPE
95% uncertainty intervals
Calibration instability overlays
You move from “This model performs well” to:
“This model performs well, consistently, and honestly.”
That distinction is exactly what separates publishable CPMs from clinically credible ones in the CECS framework.
Key takeaways
Performance metrics alone do not guarantee reliable predictions.
Stability analysis reveals overfitting that AUROC and calibration can hide.
Bootstrap-based instability and uncertainty should be reported whenever possible.
Stable calibration across resamples is a powerful signal of model robustness.
TRIPOD-aligned CPMs benefit enormously from explicit stability reporting.





Comments