Beyond Performance: Prediction Stability and Uncertainty in TRIPOD-Ready Clinical Prediction Models

Mayta
Jan 6
4 min read

Introduction

Once you’ve shown discrimination, calibration, and clinical utility, a sophisticated reader asks a deeper question:

“How stable are these predictions if the data were slightly different?”

This is where many CPM papers stop too early.

Two models can have identical AUROC, calibration, and DCA—yet one is fragile, overly dependent on idiosyncrasies of the development sample, while the other is robust. Stability analysis exposes this difference.

In the CECS CPM framework, this fourth layer is explicitly taught as prediction stability and uncertainty, evaluated using internal resampling methods (typically bootstrapping).

The goal is not to impress—but to be honest about how much trust each individual prediction deserves.

1) Prediction Instability Plot — How much do predictions move?

What it shows: Individual-level prediction stability

A prediction instability plot visualizes how much each patient’s predicted risk varies across bootstrap samples compared with the original model.

Conceptually, it plots:

Original predicted risk (x-axis)
Bootstrapped predicted risk (y-axis)
One point per patient per bootstrap draw

In CECS-aligned workflows, this is typically produced using internal validation tools (e.g., pminternal, type = "instability").

Clinical translation

“If I refit this model on slightly different versions of the same dataset, do patients keep roughly the same risk estimates—or do they jump around?”

Stable models show tight clouds around the identity line. Unstable models show wide vertical scatter—especially concerning at clinically actionable risk thresholds.

Common rookie mistake

Reporting only average performance metrics and ignoring whether individual patients’ predictions are reliable.

Scatter of Bootstrap vs Original Predictions — Seeing overfitting directly

A complementary visualization overlays:

Original model predictions
Bootstrap-refitted predictions

Each point represents a patient. The distance from the diagonal line reflects prediction instability.

Why this matters:

Overfitted models often show excellent apparent AUROC
But large bootstrap scatter reveals that the model’s predictions are sample-dependent

This plot is often more persuasive to reviewers than abstract statements about “internal validation.”

🔍 Secret Insight: Reviewers may forgive a modest AUROC, but they rarely forgive unstable predictions once shown.

2) Average MAPE — How wrong are predictions on average?

What it shows: Numerical prediction error

Mean Absolute Prediction Error (MAPE) summarizes how far predicted probabilities deviate from observed outcomes on average.

Formally:

Where:

p̂ᵢ = predicted probability
yᵢ = observed outcome (0/1)

In CECS teaching, MAPE < 0.02 is often used as a heuristic for excellent stability in well-calibrated binary outcome models, recognizing that acceptable thresholds depend on context.

Clinical translation

“On average, how far off is the model’s predicted risk from what actually happens?”

MAPE complements calibration plots:

Calibration shows systematic bias
MAPE shows average absolute error

Common rookie mistake

Reporting only AUROC and ignoring absolute prediction error, which clinicians care about far more.

3) 95% Uncertainty Intervals — Admitting what we don’t know

What it shows: Prediction uncertainty, not just point estimates

Bootstrap-based uncertainty intervals (UIs) reflect how much each patient’s predicted risk varies across resampled models.

For each patient, you can derive:

Median predicted risk
2.5th–97.5th percentile range across bootstrap fits

These intervals are often visualized directly in instability plots or as vertical error bars.

Clinical translation

“This patient’s risk is estimated at 18%—but realistically, it could be anywhere from 12% to 25%.”

This framing is crucial for shared decision-making and aligns naturally with DCA-based threshold reasoning.

Common rookie mistake

Presenting single-number risk predictions without acknowledging uncertainty, which falsely implies precision.

4) Calibration Instability — Does calibration survive resampling?

Calibration should not only look good once.

Calibration instability analysis overlays calibration curves from multiple bootstrap samples on top of the apparent calibration curve.

What you want to see:

Bootstrap curves clustering tightly around the optimism-corrected curve
Minimal divergence in clinically relevant risk ranges

What signals trouble:

Wide spread of calibration curves
Slope flattening or exaggeration across resamples

In CECS CPM logic, this is a direct test of whether calibration is structural or merely accidental.

How stability completes the TRIPOD performance story

At this point, your CPM performance narrative answers five distinct questions:

Layer	Question answered
Discrimination (ROC)	Can the model rank patients correctly?
Calibration	Are the predicted probabilities numerically correct?
Usefulness (DCA)	Does using the model improve decisions?
Stability	Are predictions robust to data perturbation?
Uncertainty	How precise are individual predictions?

TRIPOD emphasizes transparent reporting and validation; stability analysis operationalizes that principle at the individual patient level, where clinical decisions actually occur.

Final takeaway: Stable models earn trust; unstable models borrow it

A TRIPOD-ready CPM doesn’t just look good—it behaves well under stress.

By adding:

Prediction instability plots
Bootstrap scatter visualizations
Average MAPE
95% uncertainty intervals
Calibration instability overlays

You move from “This model performs well” to:

“This model performs well, consistently, and honestly.”

That distinction is exactly what separates publishable CPMs from clinically credible ones in the CECS framework.

Key takeaways

Performance metrics alone do not guarantee reliable predictions.
Stability analysis reveals overfitting that AUROC and calibration can hide.
Bootstrap-based instability and uncertainty should be reported whenever possible.
Stable calibration across resamples is a powerful signal of model robustness.
TRIPOD-aligned CPMs benefit enormously from explicit stability reporting.

Beyond Performance: Prediction Stability and Uncertainty in TRIPOD-Ready Clinical Prediction Models

Introduction

1) Prediction Instability Plot — How much do predictions move?

What it shows: Individual-level prediction stability

Clinical translation

Common rookie mistake

Scatter of Bootstrap vs Original Predictions — Seeing overfitting directly

2) Average MAPE — How wrong are predictions on average?

What it shows: Numerical prediction error

Clinical translation

Common rookie mistake

3) 95% Uncertainty Intervals — Admitting what we don’t know

What it shows: Prediction uncertainty, not just point estimates

Clinical translation

Common rookie mistake

4) Calibration Instability — Does calibration survive resampling?

How stability completes the TRIPOD performance story

Final takeaway: Stable models earn trust; unstable models borrow it

Key takeaways

Recent Posts

Comments