Internal Validation vs Instability

Mayta
Jan 28
6 min read

Updated: Jan 29

Pocket note: Core concept [Eng]

We fit a model in every bootstrap sample in the same way. What changes is which dataset we use to evaluate (test) the fitted model.In other words, the interpretation depends on the evaluation dataset (and its role).

1) Internal validation = optimism correction (performance-focused)

Internal validation via bootstrap optimism correction compares performance when the model is evaluated in the bootstrap sample versus when it is evaluated in the original dataset. For each bootstrap replicate, we compute optimism as: optimism = apparent performance in the bootstrap sample − test performance in the original dataset. The mean optimism across replicates is then subtracted from the apparent performance in the original dataset to obtain the optimism-corrected performance.

Key sentence:

Optimism correction requires evaluating performance in two datasets (bootstrap and original) to quantify and remove optimism.

2) Instability = model changes across bootstraps (parameter/predictor-focused)

Model instability is assessed by examining how the fitted model itself changes across bootstrap samples. We refit the model in many bootstrap samples and then evaluate instability by summarizing:variability of regression coefficients across bootstrap models, and/orfrequency with which predictors are selected (if variable selection is used).

Key sentence:

Instability is primarily a model-to-model comparison across bootstrap replicates; evaluating in the original dataset is optional and does not define instability.

(If you’re using pminternal, this “parameter/predictor instability + performance variability” is the instability concept you should emphasize.)

3) Uncertainty in predicted probability (optional, prediction-focused)

If the goal is uncertainty in individual predicted risk, we refit the model in each bootstrap sample and then generate predicted probabilities for individuals in the original dataset. For each individual, the distribution of predicted probabilities across bootstrap models can be summarized (e.g., median and 2.5–97.5 percentiles) to describe uncertainty in predicted risk.

Key sentence:

This quantifies uncertainty of individual predictions and should be reported separately from internal validation (optimism correction) and from parameter instability.

One-line takeaway

“We build the model the same way in each bootstrap; what changes—and what determines the interpretation—is which dataset we use to test/evaluate the model.”

Pocket note: Core concept [Thai] ได้เลย — ประโยคแกนกลางที่คุณอยากได้คือ ถูกต้องมาก:

✅ “ทำโมเดลทุก bootstrap เหมือนกัน ต่างกันตรงที่เอา dataset ไหนกลับไป test/evaluate”

ขอจัดให้เป็นเวอร์ชัน “อธิบายได้จบใน 30 วินาที” + แก้ถ้อยคำ instability ให้เป๊ะ

1) Internal validation = optimism correction

เราทำในทุก bootstrap เหมือนกัน: fit model ใน bootstrap sample

แต่ “ต่างกันตรง dataset ที่เอากลับไป test” เพื่อคำนวณ optimism:

ในแต่ละ bootstrap รอบ b

Apparent_boot(b) = performance เมื่อ “test ใน bootstrap sample” (ที่ใช้ train ด้วย)
Test_boot(b) = performance เมื่อ “เอาโมเดลจาก bootstrap ไป test ใน original dataset”
Optimism(b) = Apparent_boot(b) − Test_boot(b)

แล้ว

mean optimism = เฉลี่ย optimism(b) ทุก bootstrap
corrected performance = Apparent_original − mean optimism

✅ Key message: optimism ต้องเกิดจากการเทียบ performance 2 dataset (bootstrap vs original)

2) Instability = model changes across bootstraps (ไม่ใช่แค่ test อย่างเดียว)

ตรงนี้ขอแก้ให้คม:

คุณพูดว่า

“instability test in original to find model different?”

ประเด็นคือ ความเป็น instability ไม่ได้เกิดจากการ test ใน originalแต่มาจากการที่ เรามีหลาย bootstrap models แล้ว “โมเดลมันต่างกันเอง”

Parameter / predictor instability (แบบที่ pminternal เน้น)

ในแต่ละ bootstrap รอบ b:

fit model → ได้ beta(b), predictors(b)

สรุป:

beta กระจายกว้างไหม (coeff variability)
ตัวแปรเข้าออกบ่อยไหม (selection frequency)
สัญญาณ overfit / ต้อง shrink ไหม

✅ Key message: instability = เปรียบเทียบ “model vs model” (across bootstraps)การเอาไป test ใน original เป็น optional ช่วยดูผลกระทบต่อ performance หรือ prediction แต่ไม่ใช่ตัวนิยาม instability

3) Predicted probability uncertainty (ถ้าคุณทำ)

อันนี้ก็ “ทำ model ทุก bootstrap เหมือนกัน” และก็ “ต่างตรง dataset ที่เอากลับไป evaluate” เหมือนกัน แต่เป้าหมายต่าง:

ในแต่ละ bootstrap รอบ b:

fit model ใน bootstrap
เอาไป predict ใน original → ได้ p_hat_i(b)

สรุปต่อคน:

ดูการกระจายของ p_hat_i(1…B) → เป็น uncertainty ของ predicted risk

✅ Key message: นี่คือ uncertainty ของ predictionไม่ใช่ optimism correction และไม่ใช่ parameter instability (แม้เกี่ยวข้องกัน)

ประโยคสรุป

เราสร้างโมเดลในทุก bootstrap เหมือนกัน ความหมายจะเปลี่ยนตามว่าเราเอา dataset ไหนกลับไป evaluate: ถ้าเอาไป evaluate ทั้งใน bootstrap และ original → ได้ optimism และ corrected performance (internal validation) ถ้าดูว่าโมเดลที่ได้จากแต่ละ bootstrap เปลี่ยนแค่ไหน → คือ instability ของ coefficients/predictorsถ้าเอาไป evaluate เป็น predicted probability ใน original หลาย ๆ ครั้ง → คือ uncertainty ของ predicted risk

Key takeaway

✅ มันต่างกันตรง dataset/role ที่เอากลับไป test จริงแต่ต้องจำเพิ่มอีกนิดว่า

optimism = ต้องมี 2 evaluation sets เพื่อหาความต่าง
instability = ต้อง compare models across bootstraps (original test เป็น optional)

1) Internal Validation (Optimism Correction)

Scientific question

How much does model performance look “too good” because the model was evaluated on the same data used for development?

What is being validated

Model performance (not coefficients), including:

discrimination (e.g., AUROC / C-statistic)
calibration (calibration slope/intercept)
overall accuracy (e.g., Brier score)

Conceptual target

Overfitting / optimism due to development-data reuse

Bootstrap logic (core steps)

For each bootstrap sample (b):

Fit the model in the bootstrap sample
Evaluate performance in:
- bootstrap sample → Perf_boot(b) (apparent in bootstrap)
- original dataset → Perf_orig(b) (test in original)
Optimism(b) = Perf_boot(b) − Perf_orig(b)

Final:

Mean optimism = average of Optimism(b) across all bootstrap samples
Optimism-corrected performance = Apparent performance in original data − Mean optimism

Unit of inference

Model-level performance

What you are allowed to claim

“Model performance was internally validated using bootstrap optimism correction.”

What you must NOT claim

Model is stable (that is a different claim)
Individual predicted risks have quantified uncertainty (different aim)

✅ pminternal: This is a core purpose of pminternal (internal validation via bootstrap), so this claim is aligned.

2) Instability (Stability Analysis)

“Instability” is not one thing. For articles, separate it into (A) model/parameter instability and (B) predictive uncertainty. pminternal mainly addresses (A), and partially addresses performance variability.

2A) Model (Parameter) Instability

Scientific question

If we re-sampled from the same underlying population, would we obtain a meaningfully different model?

What is being assessed

regression coefficients (betas)
variable selection frequency (if selection was used)
model structure (sign changes, inclusion/exclusion patterns)

Bootstrap logic (core steps)

For each bootstrap sample:

Fit a new model in the bootstrap sample
Store coefficients and (if applicable) selected variables
Summarize:
- coefficient distributions (spread, sign stability)
- selection frequency for each predictor

Unit of inference

Model structure (parameters)

What you are allowed to claim

“Model parameters showed limited/substantial instability across bootstrap samples.”

What you must NOT claim

Corrected performance (that belongs to internal validation)
Individual predicted-risk intervals (that is predictive uncertainty)

✅ pminternal: This is the main meaning of “instability” you should emphasize when your workflow relies on pminternal.

2B) Predictive Stability (Uncertainty of Predicted Probability)

Scientific question

For each individual, how uncertain is the predicted risk due to sampling variation?

What is being assessed

Predicted probability (risk) for the same individual, not coefficients.

Bootstrap logic (core steps)

For each bootstrap sample (b):

Fit model in bootstrap sample
Predict risk for individuals in the original dataset → p_hat_i(b)

For each individual (i):

summarize distribution of p_hat_i(1…B)
report median and 2.5–97.5 percentiles as a bootstrap interval for predicted risk

Unit of inference

Individual-level prediction

What you are allowed to claim

“Individual risk predictions showed acceptable uncertainty.”

What you must NOT claim

Internal validation (this does not correct optimism)
Parameter stability (this does not directly describe coefficient behavior)

⚠️ pminternal: This is not a core output/aim of pminternal. If you report this, present it as an additional analysis (separate from pminternal internal validation).

3) Performance instability (optional but commonly reported)

This is different from optimism correction. Here you describe variability (spread) of performance estimates across bootstrap samples (e.g., distribution of AUROC). This is descriptive stability of performance.

✅ pminternal: Often supports or facilitates this via bootstrap results, but your primary “validation” claim should still be optimism correction.

4) Reviewer-proof one-table summary

Aspect	Internal validation (optimism correction)	Model/parameter instability	Predictive uncertainty (risk intervals)
Main question	Is performance overly optimistic?	Is model structure stable?	Is individual predicted risk stable?
Unit	Performance	Parameters	Individual prediction
Uses original data as test set	Yes (required)	Optional	Yes (required)
Corrects overfitting	Yes	No	No
Produces individual risk intervals	No	No	Yes
pminternal aligned?	Yes (core)	Yes (core)	Not core

5) Article-ready Methods paragraph (merged)

Internal validation of model performance was conducted using bootstrap optimism correction. For each bootstrap resample, the model was refitted and performance was evaluated in both the bootstrap sample and the original dataset; the average difference (optimism) was used to correct the apparent performance in the original data. Model instability was assessed by examining variability of regression coefficients (and, where applicable, predictor selection frequencies) across bootstrap samples. If individual-level uncertainty in predicted risk was reported, it was derived from the distribution of predicted probabilities generated by refitting the model in each bootstrap sample and predicting risks for individuals in the original dataset; this analysis quantifies uncertainty in individual predictions and should be interpreted separately from internal validation.

6) How to word “instability” if you used pminternal (recommended)

“Model instability was assessed using bootstrap resampling as implemented in the pminternal workflow, focusing on variability of regression coefficients and performance measures across bootstrap samples.”

(Notice: no mention of individual risk intervals unless you truly did that separately.)

Key takeaways

Internal validation = bootstrap optimism correction (performance-focused).
Instability (pminternal meaning) = parameter instability + (optionally) performance variability.
Individual predicted-risk intervals = predictive uncertainty, separate claim, not the default pminternal “instability.”
In the paper, always state: what question you asked and what unit you are inferring.

Pocket note: Core concept [Eng]

1) Internal validation = optimism correction (performance-focused)

2) Instability = model changes across bootstraps (parameter/predictor-focused)

3) Uncertainty in predicted probability (optional, prediction-focused)

One-line takeaway

Pocket note: Core concept [Thai] ได้เลย — ประโยคแกนกลางที่คุณอยากได้คือ ถูกต้องมาก:

1) Internal validation = optimism correction

2) Instability = model changes across bootstraps (ไม่ใช่แค่ test อย่างเดียว)

Parameter / predictor instability (แบบที่ pminternal เน้น)

3) Predicted probability uncertainty (ถ้าคุณทำ)

ประโยคสรุป

Key takeaway

1) Internal Validation (Optimism Correction)

Scientific question

What is being validated

Conceptual target

Bootstrap logic (core steps)

Unit of inference

What you are allowed to claim

What you must NOT claim

2) Instability (Stability Analysis)

2A) Model (Parameter) Instability

Scientific question

What is being assessed

Bootstrap logic (core steps)

Unit of inference

What you are allowed to claim

What you must NOT claim

2B) Predictive Stability (Uncertainty of Predicted Probability)

Scientific question

What is being assessed

Bootstrap logic (core steps)

Unit of inference

What you are allowed to claim

What you must NOT claim

3) Performance instability (optional but commonly reported)

4) Reviewer-proof one-table summary

5) Article-ready Methods paragraph (merged)

6) How to word “instability” if you used pminternal (recommended)

Key takeaways

Comments