Internal Validation vs Instability
- Mayta

- Jan 28
- 6 min read
Updated: Jan 29
Pocket note: Core concept [Eng]
1) Internal validation = optimism correction (performance-focused)
Internal validation via bootstrap optimism correction compares performance when the model is evaluated in the bootstrap sample versus when it is evaluated in the original dataset. For each bootstrap replicate, we compute optimism as: optimism = apparent performance in the bootstrap sample − test performance in the original dataset. The mean optimism across replicates is then subtracted from the apparent performance in the original dataset to obtain the optimism-corrected performance.
Key sentence:
Optimism correction requires evaluating performance in two datasets (bootstrap and original) to quantify and remove optimism.
2) Instability = model changes across bootstraps (parameter/predictor-focused)
Model instability is assessed by examining how the fitted model itself changes across bootstrap samples. We refit the model in many bootstrap samples and then evaluate instability by summarizing:variability of regression coefficients across bootstrap models, and/orfrequency with which predictors are selected (if variable selection is used).
Key sentence:
Instability is primarily a model-to-model comparison across bootstrap replicates; evaluating in the original dataset is optional and does not define instability.
(If you’re using pminternal, this “parameter/predictor instability + performance variability” is the instability concept you should emphasize.)
3) Uncertainty in predicted probability (optional, prediction-focused)
If the goal is uncertainty in individual predicted risk, we refit the model in each bootstrap sample and then generate predicted probabilities for individuals in the original dataset. For each individual, the distribution of predicted probabilities across bootstrap models can be summarized (e.g., median and 2.5–97.5 percentiles) to describe uncertainty in predicted risk.
Key sentence:
This quantifies uncertainty of individual predictions and should be reported separately from internal validation (optimism correction) and from parameter instability.
One-line takeaway
“We build the model the same way in each bootstrap; what changes—and what determines the interpretation—is which dataset we use to test/evaluate the model.”

Pocket note: Core concept [Thai] ได้เลย — ประโยคแกนกลางที่คุณอยากได้คือ ถูกต้องมาก:
✅ “ทำโมเดลทุก bootstrap เหมือนกัน ต่างกันตรงที่เอา dataset ไหนกลับไป test/evaluate”
ขอจัดให้เป็นเวอร์ชัน “อธิบายได้จบใน 30 วินาที” + แก้ถ้อยคำ instability ให้เป๊ะ
1) Internal validation = optimism correction
เราทำในทุก bootstrap เหมือนกัน: fit model ใน bootstrap sample
แต่ “ต่างกันตรง dataset ที่เอากลับไป test” เพื่อคำนวณ optimism:
ในแต่ละ bootstrap รอบ b
Apparent_boot(b) = performance เมื่อ “test ใน bootstrap sample” (ที่ใช้ train ด้วย)
Test_boot(b) = performance เมื่อ “เอาโมเดลจาก bootstrap ไป test ใน original dataset”
Optimism(b) = Apparent_boot(b) − Test_boot(b)
แล้ว
mean optimism = เฉลี่ย optimism(b) ทุก bootstrap
corrected performance = Apparent_original − mean optimism
✅ Key message: optimism ต้องเกิดจากการเทียบ performance 2 dataset (bootstrap vs original)
2) Instability = model changes across bootstraps (ไม่ใช่แค่ test อย่างเดียว)
ตรงนี้ขอแก้ให้คม:
คุณพูดว่า
“instability test in original to find model different?”
ประเด็นคือ ความเป็น instability ไม่ได้เกิดจากการ test ใน originalแต่มาจากการที่ เรามีหลาย bootstrap models แล้ว “โมเดลมันต่างกันเอง”
Parameter / predictor instability (แบบที่ pminternal เน้น)
ในแต่ละ bootstrap รอบ b:
fit model → ได้ beta(b), predictors(b)
สรุป:
beta กระจายกว้างไหม (coeff variability)
ตัวแปรเข้าออกบ่อยไหม (selection frequency)
สัญญาณ overfit / ต้อง shrink ไหม
✅ Key message: instability = เปรียบเทียบ “model vs model” (across bootstraps)การเอาไป test ใน original เป็น optional ช่วยดูผลกระทบต่อ performance หรือ prediction แต่ไม่ใช่ตัวนิยาม instability
3) Predicted probability uncertainty (ถ้าคุณทำ)
อันนี้ก็ “ทำ model ทุก bootstrap เหมือนกัน” และก็ “ต่างตรง dataset ที่เอากลับไป evaluate” เหมือนกัน แต่เป้าหมายต่าง:
ในแต่ละ bootstrap รอบ b:
fit model ใน bootstrap
เอาไป predict ใน original → ได้ p_hat_i(b)
สรุปต่อคน:
ดูการกระจายของ p_hat_i(1…B) → เป็น uncertainty ของ predicted risk
✅ Key message: นี่คือ uncertainty ของ predictionไม่ใช่ optimism correction และไม่ใช่ parameter instability (แม้เกี่ยวข้องกัน)
ประโยคสรุป
เราสร้างโมเดลในทุก bootstrap เหมือนกัน ความหมายจะเปลี่ยนตามว่าเราเอา dataset ไหนกลับไป evaluate: ถ้าเอาไป evaluate ทั้งใน bootstrap และ original → ได้ optimism และ corrected performance (internal validation) ถ้าดูว่าโมเดลที่ได้จากแต่ละ bootstrap เปลี่ยนแค่ไหน → คือ instability ของ coefficients/predictorsถ้าเอาไป evaluate เป็น predicted probability ใน original หลาย ๆ ครั้ง → คือ uncertainty ของ predicted risk
Key takeaway
✅ มันต่างกันตรง dataset/role ที่เอากลับไป test จริงแต่ต้องจำเพิ่มอีกนิดว่า
optimism = ต้องมี 2 evaluation sets เพื่อหาความต่าง
instability = ต้อง compare models across bootstraps (original test เป็น optional)
1) Internal Validation (Optimism Correction)
Scientific question
How much does model performance look “too good” because the model was evaluated on the same data used for development?
What is being validated
Model performance (not coefficients), including:
discrimination (e.g., AUROC / C-statistic)
calibration (calibration slope/intercept)
overall accuracy (e.g., Brier score)
Conceptual target
Overfitting / optimism due to development-data reuse
Bootstrap logic (core steps)
For each bootstrap sample (b):
Fit the model in the bootstrap sample
Evaluate performance in:
bootstrap sample → Perf_boot(b) (apparent in bootstrap)
original dataset → Perf_orig(b) (test in original)
Optimism(b) = Perf_boot(b) − Perf_orig(b)
Final:
Mean optimism = average of Optimism(b) across all bootstrap samples
Optimism-corrected performance = Apparent performance in original data − Mean optimism
Unit of inference
Model-level performance
What you are allowed to claim
“Model performance was internally validated using bootstrap optimism correction.”
What you must NOT claim
Model is stable (that is a different claim)
Individual predicted risks have quantified uncertainty (different aim)
✅ pminternal: This is a core purpose of pminternal (internal validation via bootstrap), so this claim is aligned.
2) Instability (Stability Analysis)
“Instability” is not one thing. For articles, separate it into (A) model/parameter instability and (B) predictive uncertainty. pminternal mainly addresses (A), and partially addresses performance variability.
2A) Model (Parameter) Instability
Scientific question
If we re-sampled from the same underlying population, would we obtain a meaningfully different model?
What is being assessed
regression coefficients (betas)
variable selection frequency (if selection was used)
model structure (sign changes, inclusion/exclusion patterns)
Bootstrap logic (core steps)
For each bootstrap sample:
Fit a new model in the bootstrap sample
Store coefficients and (if applicable) selected variables
Summarize:
coefficient distributions (spread, sign stability)
selection frequency for each predictor
Unit of inference
Model structure (parameters)
What you are allowed to claim
“Model parameters showed limited/substantial instability across bootstrap samples.”
What you must NOT claim
Corrected performance (that belongs to internal validation)
Individual predicted-risk intervals (that is predictive uncertainty)
✅ pminternal: This is the main meaning of “instability” you should emphasize when your workflow relies on pminternal.
2B) Predictive Stability (Uncertainty of Predicted Probability)
Scientific question
For each individual, how uncertain is the predicted risk due to sampling variation?
What is being assessed
Predicted probability (risk) for the same individual, not coefficients.
Bootstrap logic (core steps)
For each bootstrap sample (b):
Fit model in bootstrap sample
Predict risk for individuals in the original dataset → p_hat_i(b)
For each individual (i):
summarize distribution of p_hat_i(1…B)
report median and 2.5–97.5 percentiles as a bootstrap interval for predicted risk
Unit of inference
Individual-level prediction
What you are allowed to claim
“Individual risk predictions showed acceptable uncertainty.”
What you must NOT claim
Internal validation (this does not correct optimism)
Parameter stability (this does not directly describe coefficient behavior)
⚠️ pminternal: This is not a core output/aim of pminternal. If you report this, present it as an additional analysis (separate from pminternal internal validation).
3) Performance instability (optional but commonly reported)
This is different from optimism correction. Here you describe variability (spread) of performance estimates across bootstrap samples (e.g., distribution of AUROC). This is descriptive stability of performance.
✅ pminternal: Often supports or facilitates this via bootstrap results, but your primary “validation” claim should still be optimism correction.
4) Reviewer-proof one-table summary
Aspect | Internal validation (optimism correction) | Model/parameter instability | Predictive uncertainty (risk intervals) |
Main question | Is performance overly optimistic? | Is model structure stable? | Is individual predicted risk stable? |
Unit | Performance | Parameters | Individual prediction |
Uses original data as test set | Yes (required) | Optional | Yes (required) |
Corrects overfitting | Yes | No | No |
Produces individual risk intervals | No | No | Yes |
pminternal aligned? | Yes (core) | Yes (core) | Not core |
5) Article-ready Methods paragraph (merged)
Internal validation of model performance was conducted using bootstrap optimism correction. For each bootstrap resample, the model was refitted and performance was evaluated in both the bootstrap sample and the original dataset; the average difference (optimism) was used to correct the apparent performance in the original data. Model instability was assessed by examining variability of regression coefficients (and, where applicable, predictor selection frequencies) across bootstrap samples. If individual-level uncertainty in predicted risk was reported, it was derived from the distribution of predicted probabilities generated by refitting the model in each bootstrap sample and predicting risks for individuals in the original dataset; this analysis quantifies uncertainty in individual predictions and should be interpreted separately from internal validation.
6) How to word “instability” if you used pminternal (recommended)
“Model instability was assessed using bootstrap resampling as implemented in the pminternal workflow, focusing on variability of regression coefficients and performance measures across bootstrap samples.”
(Notice: no mention of individual risk intervals unless you truly did that separately.)
Key takeaways
Internal validation = bootstrap optimism correction (performance-focused).
Instability (pminternal meaning) = parameter instability + (optionally) performance variability.
Individual predicted-risk intervals = predictive uncertainty, separate claim, not the default pminternal “instability.”
In the paper, always state: what question you asked and what unit you are inferring.



Comments