top of page

Internal Validation vs Instability

  • Writer: Mayta
    Mayta
  • 6 hours ago
  • 5 min read

Pocket note

“The concept depends on which dataset you compare the model against (i.e., where you evaluate it).” Why it feels like “same data but different view”

Think of data as wearing different hats:

  • Training hat: data used to fit (estimate) the model

  • Testing hat: data used to evaluate (score) the model

In bootstrap workflows, the original dataset can be used as a testing hat even though it’s still the same file of patients.

So it’s not “different data” — it’s different role in the comparison. The key comparison that defines the concept

1) Internal validation (optimism correction) = “two-hat comparison”

Key idea: compare performance in the data used to fit vs a separate evaluation set

In bootstrap:

  • Fit in bootstrap sample (training hat)

  • Evaluate in:

    • bootstrap sample (same data as training → “apparent” performance)

    • original dataset (plays the role of “new-ish” data → “test” performance)

Optimism = apparent − testThis only exists because you evaluated on two different datasets/roles.

2) Instability (parameter stability) = “model-to-model comparison”

Key idea: compare models fitted on many resamples

  • Fit many bootstrap models

  • Look at how coefficients / variable selection change across those models

Here the key “comparison” is model vs model, not performance-in-two-datasets. [6]

3) Predictive uncertainty = “prediction distribution for the same person”

Key idea: compare predicted risk for the same individual across many bootstrap models

  • Fit bootstrap models

  • Predict each person in the original dataset → many predicted risks for the same person

  • Summarize spread (median, 2.5–97.5 percentile)

Comparison is prediction vs prediction, not performance correction.

The single keyword that will un-confuse you

When you’re stuck, ask yourself:

“What is my evaluation dataset (or evaluation role)?”

That question tells you whether you’re doing:

  • internal validation (needs apparent vs test performance),

  • parameter instability (needs beta distributions),

  • or prediction uncertainty (needs p-hat distributions).

“The interpretation depends on which dataset is used for evaluation (bootstrap sample vs original dataset), even if the original dataset appears to be the ‘same data’.”

1) Internal Validation (Optimism Correction)

Scientific question

How much does model performance look “too good” because the model was evaluated on the same data used for development?

What is being validated

Model performance (not coefficients), including:

  • discrimination (e.g., AUROC / C-statistic)

  • calibration (calibration slope/intercept)

  • overall accuracy (e.g., Brier score)

Conceptual target

Overfitting / optimism due to development-data reuse

Bootstrap logic (core steps)

For each bootstrap sample (b):

  1. Fit the model in the bootstrap sample

  2. Evaluate performance in:

    • bootstrap sample → Perf_boot(b) (apparent in bootstrap)

    • original dataset → Perf_orig(b) (test in original)

  3. Optimism(b) = Perf_boot(b) − Perf_orig(b)

Final:

  • Mean optimism = average of Optimism(b) across all bootstrap samples

  • Optimism-corrected performance = Apparent performance in original data − Mean optimism

Unit of inference

Model-level performance

What you are allowed to claim

“Model performance was internally validated using bootstrap optimism correction.”

What you must NOT claim

  • Model is stable (that is a different claim)

  • Individual predicted risks have quantified uncertainty (different aim)

pminternal: This is a core purpose of pminternal (internal validation via bootstrap), so this claim is aligned.

2) Instability (Stability Analysis)

“Instability” is not one thing. For articles, separate it into (A) model/parameter instability and (B) predictive uncertainty. pminternal mainly addresses (A), and partially addresses performance variability.

2A) Model (Parameter) Instability

Scientific question

If we re-sampled from the same underlying population, would we obtain a meaningfully different model?

What is being assessed

  • regression coefficients (betas)

  • variable selection frequency (if selection was used)

  • model structure (sign changes, inclusion/exclusion patterns)

Bootstrap logic (core steps)

For each bootstrap sample:

  1. Fit a new model in the bootstrap sample

  2. Store coefficients and (if applicable) selected variables

  3. Summarize:

    • coefficient distributions (spread, sign stability)

    • selection frequency for each predictor

Unit of inference

Model structure (parameters)

What you are allowed to claim

“Model parameters showed limited/substantial instability across bootstrap samples.”

What you must NOT claim

  • Corrected performance (that belongs to internal validation)

  • Individual predicted-risk intervals (that is predictive uncertainty)

pminternal: This is the main meaning of “instability” you should emphasize when your workflow relies on pminternal.

2B) Predictive Stability (Uncertainty of Predicted Probability)

Scientific question

For each individual, how uncertain is the predicted risk due to sampling variation?

What is being assessed

Predicted probability (risk) for the same individual, not coefficients.

Bootstrap logic (core steps)

For each bootstrap sample (b):

  1. Fit model in bootstrap sample

  2. Predict risk for individuals in the original dataset → p_hat_i(b)

For each individual (i):

  • summarize distribution of p_hat_i(1…B)

  • report median and 2.5–97.5 percentiles as a bootstrap interval for predicted risk

Unit of inference

Individual-level prediction

What you are allowed to claim

“Individual risk predictions showed acceptable uncertainty.”

What you must NOT claim

  • Internal validation (this does not correct optimism)

  • Parameter stability (this does not directly describe coefficient behavior)

⚠️ pminternal: This is not a core output/aim of pminternal. If you report this, present it as an additional analysis (separate from pminternal internal validation).

3) Performance instability (optional but commonly reported)

This is different from optimism correction. Here you describe variability (spread) of performance estimates across bootstrap samples (e.g., distribution of AUROC). This is descriptive stability of performance.

pminternal: Often supports or facilitates this via bootstrap results, but your primary “validation” claim should still be optimism correction.

4) Reviewer-proof one-table summary

Aspect

Internal validation (optimism correction)

Model/parameter instability

Predictive uncertainty (risk intervals)

Main question

Is performance overly optimistic?

Is model structure stable?

Is individual predicted risk stable?

Unit

Performance

Parameters

Individual prediction

Uses original data as test set

Yes (required)

Optional

Yes (required)

Corrects overfitting

Yes

No

No

Produces individual risk intervals

No

No

Yes

pminternal aligned?

Yes (core)

Yes (core)

Not core


5) Article-ready Methods paragraph (merged)

Internal validation of model performance was conducted using bootstrap optimism correction. For each bootstrap resample, the model was refitted and performance was evaluated in both the bootstrap sample and the original dataset; the average difference (optimism) was used to correct the apparent performance in the original data. Model instability was assessed by examining variability of regression coefficients (and, where applicable, predictor selection frequencies) across bootstrap samples. If individual-level uncertainty in predicted risk was reported, it was derived from the distribution of predicted probabilities generated by refitting the model in each bootstrap sample and predicting risks for individuals in the original dataset; this analysis quantifies uncertainty in individual predictions and should be interpreted separately from internal validation.

6) How to word “instability” if you used pminternal (recommended)

“Model instability was assessed using bootstrap resampling as implemented in the pminternal workflow, focusing on variability of regression coefficients and performance measures across bootstrap samples.”

(Notice: no mention of individual risk intervals unless you truly did that separately.)

Key takeaways

  • Internal validation = bootstrap optimism correction (performance-focused).

  • Instability (pminternal meaning) = parameter instability + (optionally) performance variability.

  • Individual predicted-risk intervals = predictive uncertainty, separate claim, not the default pminternal “instability.”

    In the paper, always state: what question you asked and what unit you are inferring.

Recent Posts

See All

Comments

Rated 0 out of 5 stars.
No ratings yet

Add a rating
Post: Blog2_Post

​Message for International and Thai Readers Understanding My Medical Context in Thailand

Message for International and Thai Readers Understanding My Broader Content Beyond Medicine

bottom of page