← All posts

Prediction Stability in TRIPOD-Ready Clinical Prediction Models: The Multiverse Perspective

Clinical Epidemiology ResearchUniqcret doctor knowledgesMethodology and Research DesignDiagnosis [Methodology]Prognosis [Methodology]
Prediction Stability in TRIPOD-Ready Clinical Prediction Models: The Multiverse Perspective

Background

Internal validation is not only about how good your model looks in the development dataset, but also about how dependable the model is when the dataset changes slightly. In practice, we need two complementary bootstrap stories:

  1. Optimism bootstrap (overfitting / performance bias)“Is the apparent performance too good because the model is overfit?”
  2. Stability bootstrap (instability / model sensitivity)“If the data change a little, does the model change a lot?”This is the “multiverse” concept: many plausible models could have been developed from the same target population, and a patient’s predicted risk can vary across those models.

In Thai clinical-stat language: stability = เสถียร (stable). The opposite is instability (ไม่เสถียร).

Conceptual Link: Why You Need Both Bootstraps

A) Optimism Bootstrap (Performance-focused Internal Validation)

B) Stability Bootstrap (Reliability-focused Internal Validation)

Key bridge: A model can have “good” optimism-corrected performance and still be clinically unsafe if individual predictions are unstable.


METHODS

Study Context and Model Development

Describe your development exactly once, because stability bootstrap will repeat it:

Critical rule (your point is correct):

In each bootstrap sample, you must redevelop the model using the identical method used in the original development (same candidate predictors, same MFP process, same selection rules, same penalty/tuning strategy).


Internal Validation Part 1: Optimism Bootstrap (Overfitting of Apparent Performance)

Aim

To quantify how much apparent performance is exaggerated due to overfitting.

Procedure (typical wording)

(Keep this short; your audience usually already recognizes this bootstrap purpose.)


Internal Validation Part 2: Stability Bootstrap (Instability / Multiverse Assessment)

Aim

To evaluate whether small perturbations in data lead to large changes in:

  1. model structure
  2. model parameters
  3. individual predicted risks (clinically most important)

Bootstrap size

Use B ≥ 200 (your instruction is aligned with recommendations).

Algorithm (write as steps in Methods)

For each bootstrap iteration b = 1, …, B:

  1. Sample n individuals with replacement from the development dataset.
  2. Redevelop the model in the bootstrap sample using the same development pipeline as the original model.
  3. Apply the bootstrap-developed model back to the original development dataset to obtain predicted risks p̂_bi for each individual i.
  4. Store p̂_bi alongside the original model prediction p̂_i.

Resulting prediction set

This yields, for each patient i:

p̂_i, p̂_1i, p̂_2i, …, p̂_Bi

This is the multiverse of plausible predictions.


What to Measure (Pre-specify these outputs)

1) Model Structure Instability (What changes in the model?)

Report:

Results formats:

2) Coefficient Instability (β drift)

Report:

Results formats:

3) Prediction Instability (Clinically most important)

This is your “same patient, different universe” question:

“Does the same patient get 10% risk in one bootstrap model and 30% in another?”

Core metric: Individual-level MAPE

For each patient:

MAPE i = 1 B b = 1 B | p ^ b i p ^ i |

Then summarise across patients (mean/median, max, percentiles).

Core figure: Prediction Instability Plot

This plot is the visual proof of stability (เสถียร) vs instability.


Optional but Recommended Extensions (Separate subsections)

A) Classification Stability (Decision instability at a threshold)

If your model is used with a treatment threshold (e.g., 10%):

B) Calibration Instability (Population-level stability)

Overlay calibration curves from bootstrap models applied to the original dataset:


RESULTS (How to Lay It Out Correctly)

Results Section 1: Optimism-Corrected Performance (Internal Validation)

Report in 1–2 paragraphs + a small table:

Table example: “Apparent vs Optimism-corrected performance”


Results Section 2: Stability (เสถียร) Assessment (Bootstrap Multiverse)

2.1 Model Structure Stability

2.2 Parameter Stability

2.3 Individual Prediction Stability (Primary clinical focus)

2.4 Decision Stability (if threshold-based use)

2.5 Calibration Stability


Recommended Figure Placement (Simple Rule)

Main manuscript (minimum strong set)

Prediction instability plot (must have if your paper emphasizes reliability)

Figure 2 from BMC Medicine article (Springer Nature)
Image credit: Springer Nature — BMC Medicine, Figure 2
ⓒ Original publisher. Displayed via hotlinking for educational fair use; fallback image used if the primary source becomes unavailable.

Supplement (common but acceptable)

Figure 3 from BMC Medicine article (Springer Nature)
Image credit: Springer Nature — BMC Medicine, Figure 3
ⓒ Original publisher. Displayed via hotlinking for educational fair use; fallback image used if the primary source becomes unavailable.
Figure 4 from BMC Medicine article (Springer Nature)
Image credit: Springer Nature — BMC Medicine, Figure 4
ⓒ Original publisher. Displayed via hotlinking for educational fair use; fallback image used if the primary source becomes unavailable.

Discussion: How to Interpret and What to Do If It’s Unstable

If instability is high:


Methods Summary

“We performed internal validation using bootstrapping for two purposes: (i) optimism correction to quantify overfitting in apparent performance, and (ii) a stability bootstrap to examine model instability across a multiverse of plausible development samples. For stability assessment, we generated B ≥ 200 bootstrap samples of size n with replacement. In each bootstrap sample, we redeveloped the model using the identical development strategy as the original model (same candidate predictors, same functional form strategy such as MFP/linear terms, and same selection/tuning procedure), and applied each bootstrap-developed model to the original dataset to obtain patient-level predicted risks. We quantified individual-level prediction instability using mean absolute prediction error (MAPE) and visualised instability using a prediction instability plot; we additionally assessed decision instability (classification instability) at clinically relevant thresholds and calibration instability by overlaying bootstrap calibration curves.”


Key takeaways

Prediction Stability in TRIPOD-Ready Clinical Prediction Models: The Multiverse Perspective — Uniqcret