Bootstrap Before kNN Is Not Internal Validation: Clarifying Imputation Variability vs Model Optimism

Mayta
Dec 25, 2025
6 min read

We don’t have to bootstrap before kNN because it’s not a rule—it’s just an optional way to reflect imputation variability, not internal validation. Bootstrap before kNN cannot be claimed as internal validation,

because it does not involve model fitting and testing on different data

and thus does not estimate optimism. Internal validation requires bootstrapping of the final model; bootstrapping the imputation step alone is insufficient, because we fit the model on the imputed datasets, not the dataset before imputation.

What I want my professor to accept

Generating 500 completed datasets via “bootstrap → kNN” is primarily a missing-data strategy (it propagates imputation variability).
Internal validation is different: it quantifies overfitting/optimism of a specified final model and requires a train–test contrast created by resampling (bootstrap/CV). TRIPOD explicitly recommends internal validation (bootstrapping/cross-validation) to quantify optimism and adjust for overfitting. (limbic-cenc.org)
If you fit and evaluate the model on each of the 500 datasets and average performance, you are still estimating repeated apparent performance (train=test each time), not optimism-corrected performance. The Steyerberg & Harrell JCE paper formalizes internal validation and shows bootstrap internal validation provides stable, low-bias estimates compared with split sample methods. (ScienceDirect)
“Bootstrap → kNN” can be enough only if it is implemented as a validation algorithm (train model on bootstrap-imputed sample and test on the original dataset), not merely as a way to manufacture many completed datasets.

1) What internal validation answers (and what it is not)

The question

In prediction modeling, the key issue is not “How good is the model on the data I used?” but:

How much optimism is present due to overfitting, and what performance should I expect in similar future patients?

TRIPOD’s Explanation & Elaboration states that prediction model development studies should include internal validation (e.g., bootstrapping or cross-validation) to quantify optimism and adjust for overfitting. (limbic-cenc.org)

What internal validation is not

Internal validation is not simply creating many versions of a completed dataset. Creating 500 completed datasets can help characterize uncertainty due to missingness/imputation, but it does not automatically estimate the generalization loss from model fitting choices (overfitting).

2) Two “bootstraps” that people confuse

2.1 Bootstrap used to generate completed datasets (bootstrap → kNN → 500)

This workflow:

bootstrap incomplete dataset → kNN impute → repeat → 500 completed datasets

primarily injects variability related to:

which rows are resampled,
kNN neighborhoods,
imputed values.

That addresses missing-data uncertainty.

2.2 Bootstrap used for internal validation (optimism correction)

Bootstrap internal validation (the “optimism bootstrap”) is defined by a model-centric train–test structure:fit the model in resampled data and evaluate performance in data not used for that fit (typically the original sample), then estimate optimism and correct apparent performance. TRIPOD describes this logic (apparent vs optimism-adjusted performance) and recommends resampling methods for internal validation. (limbic-cenc.org)

3) Why “500 kNN-imputed datasets” do not automatically give internal validation

The core requirement you cannot skip

To estimate optimism, you must create a situation where:

The model is trained on one dataset and
evaluated on a different dataset not used for that training.

If you do this instead:

for each of 500 completed datasets: fit the model and evaluate it on the same dataset, then average AUC/slope/CITL

You are repeatedly computing:

Perf(trainb, testb)

That is apparent performance, repeated 500 times. Averaging apparent performance does not produce optimism-corrected performance.

This is exactly why the Steyerberg & Harrell (JCE 2001) paper exists: performance is overestimated when evaluated on the same sample used to construct the model; internal validation procedures aim to estimate performance in new subjects from the same population, and bootstrapping is recommended for internal validity estimation. (ScienceDirect)

“But bootstrapping doesn’t create new people either.”

Correct—Bootstrap doesn’t create new individuals. Internal validation is not about “new people,” it’s about new estimation problems: refitting the model in perturbed samples and assessing how performance drops when evaluated outside the training sample. TRIPOD frames internal validation exactly as resampling-based optimism adjustment using the development dataset. (limbic-cenc.org)

“But imputation creates new data, so it’s like a new population.”

Imputation creates new values for missing predictors; it does not create the “new sample” structure required to estimate optimism unless the model is explicitly tested on data not used for fitting. In short:

New values ≠ a test set.

4) When “bootstrap → kNN” could be sufficient

“Bootstrap → kNN” becomes internal validation only if you embed the validation contrast inside the loop.

Valid “bootstrap → kNN” internal validation algorithm

For each bootstrap replicate (b):

draw a bootstrap sample from the incomplete dataset
kNN-impute that bootstrap sample → completed bootstrap training data
fit the final fixed model on that bootstrap-imputed training data
test predictions on the original (non-bootstrap) completed dataset (or a consistent reference dataset)
compute test performance and estimate optimism

This is consistent with broader literature distinguishing two approaches when combining imputation and bootstrap:

impute first then bootstrap, or
bootstrap first then impute within bootstrap samples—but either way the key is that bootstrap is tied to model fitting and performance evaluation, not only to data completion. (arXiv)

5) What the recent literature says about combining imputation with internal validation

A recent 2025 J Clin Epidemiol review explicitly notes that clinical prediction modeling studies combine multiple imputation with internal model validation using different strategies (MI-prior-IMV vs MI-during-IMV), reflecting a trade-off between methodological ideality and computational feasibility. (ScienceDirect)

A concrete example in prediction modeling literature (e.g., lasso model validation with multiply imputed data) compares strategies such as resampling completed datasets vs resampling incomplete datasets and imputing within each bootstrap, again emphasizing that optimism evaluation depends on the model being trained in one sample and assessed in another. (Springer)

6) Practical recommendation for your kNN-replicated setting (500 datasets, FP terms, Stata validation)

Development (choose the model specification)

Use your kNN-replicated datasets (500, or a stable subset like 140) to stabilize development decisions:
- predictor list (pre-test only, avoid leakage)
- FP powers / functional forms (MFP allowed here)

Once the specification is decided:

freeze predictors + FP powers + link function.

Internal validation (quantify optimism of that final model)

Do bootstrap internal validation (lean B=50–100) after freezing the model. TRIPOD’s internal validation concept is exactly an optimism adjustment based on repeated resampling. (limbic-cenc.org)

Two defendable implementations:

Option A (what you coded; efficient and publishable in your context)For each completed dataset (m) (e.g., 140 chosen from 500):

fit final fixed model → apparent AUC/slope/CITL
bootstrap within that dataset: train on bootstrap sample, test on the original dataset (m) → mean test performance
optimism(_m)=apparent(_m)−test̄(_m)
summarize across (m) (median/IQR; mean as secondary)

This aligns with “imputation prior to internal validation” patterns discussed in the MI+IMV literature, while maintaining the essential optimism logic. (ScienceDirect)

Option B (nested; heavier)Bootstrap incomplete data, impute inside each bootstrap, fit model, test on the reference dataset. This is conceptually clean but often computationally intense. (arXiv)

Fractional polynomials: why re-centering is allowed

You should keep FP powers fixed (model identity) and recompute centering constants within each sample (parameterization). This does not define a new functional form; it stabilizes estimation and calibration quantities like CITL. (Your Stata implementation reflects this correctly.)

7) Methods paragraph

Missing predictor values were handled using repeated k-nearest neighbors (kNN) single imputation to create multiple completed datasets (M=500). The final model specification (predictor set and functional form) was determined during model development and then held fixed. Internal validation was performed using bootstrap resampling to quantify optimism in discrimination (AUC) and calibration (calibration slope and calibration-in-the-large). In each bootstrap replicate, the fixed model was refit in the bootstrap sample and evaluated in the original dataset to estimate optimism, which was subtracted from apparent performance to obtain optimism-corrected estimates, consistent with TRIPOD recommendations for internal validation. (limbic-cenc.org)

References (key sources)

TRIPOD Explanation & Elaboration (Ann Intern Med 2015): internal validation, optimism, resampling. (limbic-cenc.org)
TRIPOD Statement (Ann Intern Med 2015): internal validation as part of model development. (OHDSI)
Steyerberg & Harrell et al. (J Clin Epidemiol 2001): efficiency of internal validation procedures; recommends bootstrap. (ScienceDirect)
2025 J Clin Epidemiol review on combining MI with internal model validation (MI-prior vs MI-during). (ScienceDirect)
Bootstrap inference with multiple imputation (distinguishes “impute-then-bootstrap” vs “bootstrap-then-impute”). (arXiv)
Example comparing MI+bootstrap strategies in prediction modeling (BMC Med Res Methodol 2014). (Springer)
Harrell’s discussion of optimism bootstrap (conceptual explanation of estimating optimism and subtracting from apparent). (Statistical Thinking)