Why Missing Data Requires Both Imputation [Bootstrap then kNN] and Bootstrap [again] Internal Validation
- Mayta

- 1 hour ago
- 3 min read
First: fix one wrong mental picture
I am thinking:
“If complete data can do bootstrap once,missing data must do bootstrap twice → this feels like cheating / overkill.”
This feeling comes from counting datasets, but internal validation is not about counting datasets.
👉 Internal validation is about ONE comparison:
Was the model evaluated on data it was NOT trained on?
Everything else is bookkeeping.
Case 1 COMPLETE DATA (no missing)
❓ What is the correct internal validation?
Correct bootstrap internal validation (textbook):
Fit model on original data → apparent performance
For b = 1…B:
bootstrap sample
fit model on bootstrap sample
test on original data
Optimism = apparent − mean(test)
Corrected performance = apparent − optimism
✅ This is accepted
✅ This uses B bootstrap datasets
❌ You do NOT fit the model on 100 bootstraps and call that “development”
🚫 What is NOT correct (my first sentence)
“develop model from boosts of complete 100 dataset then use each boosts for check internal Validity”
❌ This is not standard internal validation
Because:
You are changing the development data
There is no single “final model”
Internal validation assumes one development dataset
📌 Bootstrap datasets are not development datasetsThey are tools to test overfitting of one model
Case 2 MISSING DATA (kNN replicated datasets)
Here is where symmetry breaks — and this is the key insight.
🔑 Core principle (memorize this)
Imputation is not validation.Validation is not imputation.
They address different problems, so they stack, not replace.
What missing data forces me to do (extra step)
With missing data you must first answer:
“What is my development dataset?”
I cannot use the raw incomplete data.
So I do:
➡️ kNN imputation replicated M times
➡️ This creates M versions of the same development dataset
📌 These M datasets are all development datasets 📌 None of them is a test dataset
This is why you feel it “explodes”.
Correct structure with missing data
Step A — Development (imputation problem)
Use kNN replicated datasets (e.g. M = 100–500)
Decide:
predictors
FP powers
Freeze model specification
⛔ No validation yet
⛔ No optimism yet
Step B — Internal validation (overfitting problem)
Now pick ONE development dataset at a time (each imputation):
For imputation m:
Fit final model → apparentₘ
Bootstrap b = 1…B within that dataset
fit on bootstrap sample
test on original imputation m
Get optimismₘ
Repeat for m = 1…M
Then summarize optimism across m.
📌 This is exactly equivalent to complete-data bootstrap
📌 Just repeated across imputations
Why this is NOT cheating (this is the key answer to my fear)
I said:
“boots kNN then got 100 data set and develop and boots 50 for each dataset it jump to 5000 so complete data look cheat”
This is the wrong comparison ❌
I am comparing:
Complete data | Missing data |
1 dataset | M datasets |
1 bootstrap loop | M bootstrap loops |
But the correct comparison is:
Problem | What uncertainty exists |
Complete data | Overfitting only |
Missing data | Overfitting + missingness |
So I must pay two uncertainty taxes, not one.
📌 More uncertainty → more resampling
📌 That is not cheating — that is honesty
One decisive simplification (this usually resolves confusion)
Imagine this analogy
Complete data = one blurry photo
Missing data = 100 blurry photos
Internal validation asks:
“If I slightly perturb the photo, does my model still work?”
With missing data you must ask that question for each plausible photo.
That is why loops multiply.
Final YES / NO answers (pin these)
❓ Complete data:
“Develop model from 100 bootstrap datasets?”
❌ NO
Develop from original data only.
❓ Complete data:
“Use bootstrap to test internal validity?”
✅ YES
One bootstrap loop.
❓ Missing data:
“Develop model from raw incomplete data?”
❌ NO
❓ Missing data:
“Use kNN replicated datasets as development data?”
✅ YES
❓ Missing data:
“Need bootstrap again after kNN?”
✅ YES
Because kNN ≠ validation.
The one sentence you were missing (this resolves everything)
Bootstrap datasets are never development datasets.Imputed datasets are never test datasets.






Comments