top of page

Why Missing Data Requires Both Imputation [Bootstrap then kNN] and Bootstrap [again] Internal Validation

  • Writer: Mayta
    Mayta
  • 1 hour ago
  • 3 min read

First: fix one wrong mental picture

I am thinking:

“If complete data can do bootstrap once,missing data must do bootstrap twice → this feels like cheating / overkill.”

This feeling comes from counting datasets, but internal validation is not about counting datasets.

👉 Internal validation is about ONE comparison:

Was the model evaluated on data it was NOT trained on?

Everything else is bookkeeping.

Case 1 COMPLETE DATA (no missing)

❓ What is the correct internal validation?

Correct bootstrap internal validation (textbook):

  1. Fit model on original data → apparent performance

  2. For b = 1…B:

    • bootstrap sample

    • fit model on bootstrap sample

    • test on original data

  3. Optimism = apparent − mean(test)

  4. Corrected performance = apparent − optimism

✅ This is accepted ✅ This uses B bootstrap datasets ❌ You do NOT fit the model on 100 bootstraps and call that “development”

🚫 What is NOT correct (my first sentence)

“develop model from boosts of complete 100 dataset then use each boosts for check internal Validity”

This is not standard internal validation

Because:

  • You are changing the development data

  • There is no single “final model”

  • Internal validation assumes one development dataset

📌 Bootstrap datasets are not development datasetsThey are tools to test overfitting of one model

Case 2 MISSING DATA (kNN replicated datasets)

Here is where symmetry breaks — and this is the key insight.

🔑 Core principle (memorize this)

Imputation is not validation.Validation is not imputation.

They address different problems, so they stack, not replace.

What missing data forces me to do (extra step)

With missing data you must first answer:

“What is my development dataset?”

I cannot use the raw incomplete data.

So I do:

➡️ kNN imputation replicated M times ➡️ This creates M versions of the same development dataset

📌 These M datasets are all development datasets 📌 None of them is a test dataset

This is why you feel it “explodes”.

Correct structure with missing data

Step A — Development (imputation problem)

  • Use kNN replicated datasets (e.g. M = 100–500)

  • Decide:

    • predictors

    • FP powers

  • Freeze model specification

⛔ No validation yet ⛔ No optimism yet

Step B — Internal validation (overfitting problem)

Now pick ONE development dataset at a time (each imputation):

For imputation m:

  1. Fit final model → apparentₘ

  2. Bootstrap b = 1…B within that dataset

    • fit on bootstrap sample

    • test on original imputation m

  3. Get optimismₘ

Repeat for m = 1…M

Then summarize optimism across m.

📌 This is exactly equivalent to complete-data bootstrap 📌 Just repeated across imputations

Why this is NOT cheating (this is the key answer to my fear)

I said:

“boots kNN then got 100 data set and develop and boots 50 for each dataset it jump to 5000 so complete data look cheat”

This is the wrong comparison ❌

I am comparing:

Complete data

Missing data

1 dataset

M datasets

1 bootstrap loop

M bootstrap loops

But the correct comparison is:

Problem

What uncertainty exists

Complete data

Overfitting only

Missing data

Overfitting + missingness

So I must pay two uncertainty taxes, not one.

📌 More uncertainty → more resampling 📌 That is not cheating — that is honesty

One decisive simplification (this usually resolves confusion)

Imagine this analogy

  • Complete data = one blurry photo

  • Missing data = 100 blurry photos

Internal validation asks:

“If I slightly perturb the photo, does my model still work?”

With missing data you must ask that question for each plausible photo.

That is why loops multiply.

Final YES / NO answers (pin these)

❓ Complete data:

“Develop model from 100 bootstrap datasets?”

NO Develop from original data only.

❓ Complete data:

“Use bootstrap to test internal validity?”

YES One bootstrap loop.

❓ Missing data:

“Develop model from raw incomplete data?”

NO

❓ Missing data:

“Use kNN replicated datasets as development data?”

YES

❓ Missing data:

“Need bootstrap again after kNN?”

YES Because kNN ≠ validation.

The one sentence you were missing (this resolves everything)

Bootstrap datasets are never development datasets.Imputed datasets are never test datasets.

Recent Posts

See All
Internal Validation with Bootstrap, kNN Imputation, and Fractional Polynomial Models [Thai]

(กรณี kNN imputation หลายชุด + Fractional Polynomial + Bootstrap) บทนำ: เรากำลังพยายามตอบคำถามอะไร? ในการพัฒนา prediction model  คำถามสำคัญไม่ใช่แค่ว่า “โมเดล fit กับข้อมูลเราได้ดีแค่ไหน?” แต่คือ “โมเ

 
 
 

Comments

Rated 0 out of 5 stars.
No ratings yet

Add a rating
Post: Blog2_Post

​Message for International and Thai Readers Understanding My Medical Context in Thailand

Message for International and Thai Readers Understanding My Broader Content Beyond Medicine

bottom of page