← All posts

Beyond Performance: Prediction Stability and Uncertainty in TRIPOD-Ready Clinical Prediction Models

Clinical Epidemiology ResearchUniqcret doctor knowledgesStata [Data Analytics]Data Analytics or StatisticsDiagnosis [Methodology]Prognosis [Methodology]Methodology and Research Design

Introduction

Once you’ve shown discrimination, calibration, and clinical utility, a sophisticated reader asks a deeper question:

“How stable are these predictions if the data were slightly different?”

This is where many CPM papers stop too early.

Two models can have identical AUROC, calibration, and DCA—yet one is fragile, overly dependent on idiosyncrasies of the development sample, while the other is robust. Stability analysis exposes this difference.

In the CECS CPM framework, this fourth layer is explicitly taught as prediction stability and uncertainty, evaluated using internal resampling methods (typically bootstrapping).

The goal is not to impress—but to be honest about how much trust each individual prediction deserves.


1) Prediction Instability Plot — How much do predictions move?

What it shows: Individual-level prediction stability

A prediction instability plot visualizes how much each patient’s predicted risk varies across bootstrap samples compared with the original model.

Conceptually, it plots:

In CECS-aligned workflows, this is typically produced using internal validation tools (e.g., pminternal, type = "instability").

Clinical translation

“If I refit this model on slightly different versions of the same dataset, do patients keep roughly the same risk estimates—or do they jump around?”

Stable models show tight clouds around the identity line. Unstable models show wide vertical scatter—especially concerning at clinically actionable risk thresholds.

Common rookie mistake

Reporting only average performance metrics and ignoring whether individual patients’ predictions are reliable.

Scatter of Bootstrap vs Original Predictions — Seeing overfitting directly

A complementary visualization overlays:

Each point represents a patient. The distance from the diagonal line reflects prediction instability.

Why this matters:

This plot is often more persuasive to reviewers than abstract statements about “internal validation.”

🔍 Secret Insight: Reviewers may forgive a modest AUROC, but they rarely forgive unstable predictions once shown.


2) Average MAPE — How wrong are predictions on average?

What it shows: Numerical prediction error

Mean Absolute Prediction Error (MAPE) summarizes how far predicted probabilities deviate from observed outcomes on average.

Formally:

MAPE = 1 n | p ^ boot p ^ ref p ^ ref |

Where:

In CECS teaching, MAPE < 0.02 is often used as a heuristic for excellent stability in well-calibrated binary outcome models, recognizing that acceptable thresholds depend on context.

Clinical translation

“On average, how far off is the model’s predicted risk from what actually happens?”

MAPE complements calibration plots:

Common rookie mistake

Reporting only AUROC and ignoring absolute prediction error, which clinicians care about far more.


3) 95% Uncertainty Intervals — Admitting what we don’t know

What it shows: Prediction uncertainty, not just point estimates

Bootstrap-based uncertainty intervals (UIs) reflect how much each patient’s predicted risk varies across resampled models.

For each patient, you can derive:

These intervals are often visualized directly in instability plots or as vertical error bars.

Clinical translation

“This patient’s risk is estimated at 18%—but realistically, it could be anywhere from 12% to 25%.”

This framing is crucial for shared decision-making and aligns naturally with DCA-based threshold reasoning.

Common rookie mistake

Presenting single-number risk predictions without acknowledging uncertainty, which falsely implies precision.


4) Calibration Instability — Does calibration survive resampling?

Calibration should not only look good once.

Calibration instability analysis overlays calibration curves from multiple bootstrap samples on top of the apparent calibration curve.

What you want to see:

What signals trouble:

In CECS CPM logic, this is a direct test of whether calibration is structural or merely accidental.


How stability completes the TRIPOD performance story

At this point, your CPM performance narrative answers five distinct questions:

LayerQuestion answered
Discrimination (ROC)Can the model rank patients correctly?
CalibrationAre the predicted probabilities numerically correct?
Usefulness (DCA)Does using the model improve decisions?
StabilityAre predictions robust to data perturbation?
UncertaintyHow precise are individual predictions?

TRIPOD emphasizes transparent reporting and validation; stability analysis operationalizes that principle at the individual patient level, where clinical decisions actually occur.


Final takeaway: Stable models earn trust; unstable models borrow it

A TRIPOD-ready CPM doesn’t just look good—it behaves well under stress.

By adding:

You move from “This model performs well” to:

“This model performs well, consistently, and honestly.”

That distinction is exactly what separates publishable CPMs from clinically credible ones in the CECS framework.


Key takeaways

Comments

No comments yet. Be the first to share your thoughts.

Sign in to comment