Internal Validation vs Instability
- Mayta

- 6 hours ago
- 5 min read
Pocket note
“The concept depends on which dataset you compare the model against (i.e., where you evaluate it).” Why it feels like “same data but different view”
Think of data as wearing different hats:
Training hat: data used to fit (estimate) the model
Testing hat: data used to evaluate (score) the model
In bootstrap workflows, the original dataset can be used as a testing hat even though it’s still the same file of patients.
So it’s not “different data” — it’s different role in the comparison. The key comparison that defines the concept
1) Internal validation (optimism correction) = “two-hat comparison”
Key idea: compare performance in the data used to fit vs a separate evaluation set
In bootstrap:
Fit in bootstrap sample (training hat)
Evaluate in:
bootstrap sample (same data as training → “apparent” performance)
original dataset (plays the role of “new-ish” data → “test” performance)
Optimism = apparent − testThis only exists because you evaluated on two different datasets/roles.
2) Instability (parameter stability) = “model-to-model comparison”
Key idea: compare models fitted on many resamples
Fit many bootstrap models
Look at how coefficients / variable selection change across those models
Here the key “comparison” is model vs model, not performance-in-two-datasets. [6]
3) Predictive uncertainty = “prediction distribution for the same person”
Key idea: compare predicted risk for the same individual across many bootstrap models
Fit bootstrap models
Predict each person in the original dataset → many predicted risks for the same person
Summarize spread (median, 2.5–97.5 percentile)
Comparison is prediction vs prediction, not performance correction.
The single keyword that will un-confuse you
When you’re stuck, ask yourself:
“What is my evaluation dataset (or evaluation role)?”
That question tells you whether you’re doing:
internal validation (needs apparent vs test performance),
parameter instability (needs beta distributions),
or prediction uncertainty (needs p-hat distributions).
“The interpretation depends on which dataset is used for evaluation (bootstrap sample vs original dataset), even if the original dataset appears to be the ‘same data’.”
1) Internal Validation (Optimism Correction)
Scientific question
How much does model performance look “too good” because the model was evaluated on the same data used for development?
What is being validated
Model performance (not coefficients), including:
discrimination (e.g., AUROC / C-statistic)
calibration (calibration slope/intercept)
overall accuracy (e.g., Brier score)
Conceptual target
Overfitting / optimism due to development-data reuse
Bootstrap logic (core steps)
For each bootstrap sample (b):
Fit the model in the bootstrap sample
Evaluate performance in:
bootstrap sample → Perf_boot(b) (apparent in bootstrap)
original dataset → Perf_orig(b) (test in original)
Optimism(b) = Perf_boot(b) − Perf_orig(b)
Final:
Mean optimism = average of Optimism(b) across all bootstrap samples
Optimism-corrected performance = Apparent performance in original data − Mean optimism
Unit of inference
Model-level performance
What you are allowed to claim
“Model performance was internally validated using bootstrap optimism correction.”
What you must NOT claim
Model is stable (that is a different claim)
Individual predicted risks have quantified uncertainty (different aim)
✅ pminternal: This is a core purpose of pminternal (internal validation via bootstrap), so this claim is aligned.
2) Instability (Stability Analysis)
“Instability” is not one thing. For articles, separate it into (A) model/parameter instability and (B) predictive uncertainty. pminternal mainly addresses (A), and partially addresses performance variability.
2A) Model (Parameter) Instability
Scientific question
If we re-sampled from the same underlying population, would we obtain a meaningfully different model?
What is being assessed
regression coefficients (betas)
variable selection frequency (if selection was used)
model structure (sign changes, inclusion/exclusion patterns)
Bootstrap logic (core steps)
For each bootstrap sample:
Fit a new model in the bootstrap sample
Store coefficients and (if applicable) selected variables
Summarize:
coefficient distributions (spread, sign stability)
selection frequency for each predictor
Unit of inference
Model structure (parameters)
What you are allowed to claim
“Model parameters showed limited/substantial instability across bootstrap samples.”
What you must NOT claim
Corrected performance (that belongs to internal validation)
Individual predicted-risk intervals (that is predictive uncertainty)
✅ pminternal: This is the main meaning of “instability” you should emphasize when your workflow relies on pminternal.
2B) Predictive Stability (Uncertainty of Predicted Probability)
Scientific question
For each individual, how uncertain is the predicted risk due to sampling variation?
What is being assessed
Predicted probability (risk) for the same individual, not coefficients.
Bootstrap logic (core steps)
For each bootstrap sample (b):
Fit model in bootstrap sample
Predict risk for individuals in the original dataset → p_hat_i(b)
For each individual (i):
summarize distribution of p_hat_i(1…B)
report median and 2.5–97.5 percentiles as a bootstrap interval for predicted risk
Unit of inference
Individual-level prediction
What you are allowed to claim
“Individual risk predictions showed acceptable uncertainty.”
What you must NOT claim
Internal validation (this does not correct optimism)
Parameter stability (this does not directly describe coefficient behavior)
⚠️ pminternal: This is not a core output/aim of pminternal. If you report this, present it as an additional analysis (separate from pminternal internal validation).
3) Performance instability (optional but commonly reported)
This is different from optimism correction. Here you describe variability (spread) of performance estimates across bootstrap samples (e.g., distribution of AUROC). This is descriptive stability of performance.
✅ pminternal: Often supports or facilitates this via bootstrap results, but your primary “validation” claim should still be optimism correction.
4) Reviewer-proof one-table summary
Aspect | Internal validation (optimism correction) | Model/parameter instability | Predictive uncertainty (risk intervals) |
Main question | Is performance overly optimistic? | Is model structure stable? | Is individual predicted risk stable? |
Unit | Performance | Parameters | Individual prediction |
Uses original data as test set | Yes (required) | Optional | Yes (required) |
Corrects overfitting | Yes | No | No |
Produces individual risk intervals | No | No | Yes |
pminternal aligned? | Yes (core) | Yes (core) | Not core |
5) Article-ready Methods paragraph (merged)
Internal validation of model performance was conducted using bootstrap optimism correction. For each bootstrap resample, the model was refitted and performance was evaluated in both the bootstrap sample and the original dataset; the average difference (optimism) was used to correct the apparent performance in the original data. Model instability was assessed by examining variability of regression coefficients (and, where applicable, predictor selection frequencies) across bootstrap samples. If individual-level uncertainty in predicted risk was reported, it was derived from the distribution of predicted probabilities generated by refitting the model in each bootstrap sample and predicting risks for individuals in the original dataset; this analysis quantifies uncertainty in individual predictions and should be interpreted separately from internal validation.
6) How to word “instability” if you used pminternal (recommended)
“Model instability was assessed using bootstrap resampling as implemented in the pminternal workflow, focusing on variability of regression coefficients and performance measures across bootstrap samples.”
(Notice: no mention of individual risk intervals unless you truly did that separately.)
Key takeaways
Internal validation = bootstrap optimism correction (performance-focused).
Instability (pminternal meaning) = parameter instability + (optionally) performance variability.
Individual predicted-risk intervals = predictive uncertainty, separate claim, not the default pminternal “instability.”
In the paper, always state: what question you asked and what unit you are inferring.






Comments