kNN Imputation and Bootstrap Internal Validation in Clinical Prediction Modeling: A Two-Phase Workflow Explained

Mayta
Dec 29, 2025
8 min read

A practical two‑phase workflow (MI or kNN for development, bootstrap for internal validation) — and why “bootstrap before kNN” can matter when kNN is deterministic

Clinical prediction modeling (CPM) workflows often feel confusing because two separate problems get solved with tools that look similar:

Missing‑data handling (so you can fit a model despite missing predictors)
Internal validation (so you can quantify overfitting/optimism and report the performance you can expect in similar future patients)

TRIPOD is explicit that internal validation is part of model development and is typically done using resampling methods such as bootstrapping or cross‑validation to estimate optimism in predictive performance. (OHDSI)

This article is written to match your updated understanding:

Phase 1 (Development): use MI or kNN to obtain analyzable data and specify the final model.
Phase 2 (Internal validation): use bootstrap to estimate optimism of that final model; and if your imputation is deterministic (kNN), your institution’s preference for bootstrap → kNN is best understood as “repeat preprocessing inside resampling so the whole pipeline is validated.”

0) A fast glossary (you’ll use these words with your professor)

Development sample The dataset you use to build your model.

Apparent performance Performance measured on the same data used for fitting the model (typically optimistic).

Internal validation A resampling procedure (e.g., bootstrap) that estimates how much performance is overestimated due to overfitting (“optimism”). (OHDSI)

Optimism For a performance metric (M), optimism is typically the average difference:

and then:

This is the standard bootstrap‑optimism logic described in TRIPOD and classic internal‑validation work.

1) What you’re measuring (AUC, calibration slope, CITL)

Most CPM papers report:

Discrimination: AUC (c‑statistic)

How well the model separates cases from non‑cases.

Calibration‑in‑the‑large (CITL) = calibration intercept

A “global” calibration check: do predicted risks match overall observed risk? Steyerberg & Vergouwe describe calibration‑in‑the‑large as the intercept (“A/alpha”) and calibration slope as slope (“B/beta”) in validation.

Calibration slope

A “spread” check: are predictions too extreme (slope < 1) or not extreme enough (slope > 1)?

Why you often saw slope ≈ 1 and CITL ≈ 0

If you assess calibration on the same dataset used to fit the model, then:

The model is calibrated to its own training sample by construction, so CITL “makes no sense” as a diagnostic at development (Steyerberg & Vergouwe explicitly note this point).
Slope and intercept also tend to look “ideal” when you reuse the fitted linear predictor in the same training data (that’s why internal validation is needed).

2) The two phases (your “Eureka” framed in TRIPOD language)

Phase 1 — Model development (specify and fit the final model)

You decide:

predictors (avoid leakage),
coding/functional forms (linear / FP / spline),
interactions if any,
and fit the model.

If predictors are missing, you choose an imputation strategy to allow modeling.

Phase 2 — Internal validation (estimate optimism)

You refit the final model repeatedly in resamples and compare train vs test performance to estimate optimism, then correct apparent performance. This is the bootstrap internal‑validation logic TRIPOD describes (Box F) and that classic work recommends for stability/low bias.

3) Phase 1 choices for missing predictors: MI vs kNN

Option A: Multiple imputation (MI)

MI creates M completed datasets by drawing missing values from a predictive distribution; you analyze each dataset and combine results (Rubin’s rules). The imputed values differ across the M datasets because MI is stochastic.

How many imputations (M)?

Stata’s MI manual notes that while older advice often suggested small M, some analyses may require M = 50+, and Stata recommends at least 20 imputations to reduce Monte‑Carlo error. (Stata)
von Hippel provides a two‑stage method and explains why SE stability may require large M (sometimes >200), implemented via the how_many_imputations command.

Your practical takeaway (MI): MI does not “decide” M for you automatically; you choose M (and tools like how_many_imputations help justify it).

Option B: Deterministic kNN imputation

Plain kNN imputation is typically single and deterministic once you fix:

predictors used for distance,
k,
scaling/standardization rules,
tie handling.

So “kNN alone” yields one completed dataset for a given configuration.

There is also research on nearest‑neighbor‑based multiple imputation methods (i.e., nearest neighbor concepts used in a multiple‑imputation framework). (ScienceDirect)

Your practical takeaway (kNN): If your kNN implementation is deterministic, you don’t automatically get MI‑style “between‑imputation” variability unless you add a mechanism (e.g., stochasticity, multiple‑imputation NN method, or a resampling wrapper).

4) Phase 2 internal validation: the bootstrap optimism algorithm (core idea)

The classic bootstrap internal validation (optimism correction) is:

Fit the model in the full development sample → compute apparent performance.
For each bootstrap replicate (b = 1...B):
- sample with replacement from development data → bootstrap sample
- fit the model in the bootstrap sample
- evaluate performance in:
  - the bootstrap sample (train / “apparent within bootstrap”), and
  - the original development sample (test)
- compute optimism(_b) = train(_b) − test(_b)
Average optimism across (B) replicates
Corrected = apparent − mean optimism

TRIPOD’s internal validation box emphasizes that resampling estimates optimism and can be used for adjustment/shrinkage. Steyerberg & Harrell’s JCE paper supports bootstrapping as efficient/stable for internal validity estimation compared with split‑sample methods. (ScienceDirect)

5) Where your school’s point fits: kNN then bootstrap vs bootstrap then kNN

This is the part that caused your earlier confusion, and it’s worth framing carefully:

The general principle (TRIPOD‑consistent)

TRIPOD explains that all aspects of model fitting that can cause overfitting should be incorporated in each bootstrap sample (e.g., predictor selection, transformations, interaction checks).

Your institution is effectively extending this principle to preprocessing, arguing:

If kNN imputation is part of the pipeline that will be used when the model is applied, then internal validation should repeat that imputation step within each resample.

That is a defensible pipeline validation viewpoint, especially when imputation is deterministic.

Why “kNN then bootstrap” can be considered incomplete (for your school’s goal)

If you do:

kNN impute once → then bootstrap the completed dataset

Then the bootstrap sees a fixed imputed dataset. You are internally validating:

the modeling step, conditional on that one imputation, not
the full pipeline that includes the imputation procedure.

This is not “wrong” in all contexts — it just answers a narrower question.

Why “bootstrap → kNN” changes the meaning (and may match your school’s requirement)

If you do:

bootstrap the incomplete dataset → run deterministic kNN inside each bootstrap sample

Then the imputed values can change across bootstrap samples because:

The resampled dataset changes the neighborhood structure,
The available donors change,
The imputation mapping is effectively re‑estimated per resample (even if deterministic within a resample).

This is consistent with the idea that bootstrap resampling plus deterministic imputation can reduce bias in prediction‑model validation settings with missing predictors (recent methodological work explicitly discusses bootstrapping followed by deterministic imputation for internal validation). (arXiv)

Also, in the broader “bootstrap + imputation” literature, the order of nesting (bootstrap within MI vs MI within bootstrap) is known to matter; it’s not automatically valid to mix them casually. (Stef van Buuren)

Your practical takeaway: For deterministic kNN, bootstrap → kNN is a reasonable way to ensure the imputation step is “inside” the resampling loop (pipeline repeated), which is exactly the TRIPOD spirit for repeating modeling decisions inside validation.

6) How MI fits with internal validation (and why you felt it was “fragile”)

Combining MI and internal model validation is not one single recipe; there are recognized strategies.

A 2025 J Clin Epidemiol methodological review describes two broad approaches:

MI‑prior‑IMV: impute first, then do internal validation
MI‑during‑IMV: internal validation resampling is done and MI is performed within each resample (Jclinepi)

This explains why:

some workflows “impute first, then bootstrap models within each imputed dataset” (common in practice),
while others prefer “bootstrap first, then impute within bootstrap” (pipeline‑faithful but heavy).

Your practical takeaway (MI): You can bootstrap with MI, but it’s computationally and methodologically more complex than plain MI or plain bootstrap, and the literature explicitly discusses the tradeoffs and nesting order. (Jclinepi)

7) A clean, blog‑ready algorithm description (your updated workflow)

Here’s a crisp way to describe what you’re doing (and why):

Phase 1: Development (missing data handled first)

Handle missing predictors using either:
- MI (produce M completed datasets), or
- deterministic kNN (produce 1 completed dataset, with fixed settings)
Fit the model and finalize the specification (predictors + functional form).

Phase 2: Internal validation (bootstrap optimism correction; kNN done inside bootstrap)

For each bootstrap replicate b = 1...B :

Draw a bootstrap sample from the original incomplete dataset (preserving missingness patterns through resampling).
Apply deterministic kNN imputation within that bootstrap sample to obtain a complete bootstrap training dataset.
Fit the fixed final model in that bootstrap‑imputed training dataset.
Evaluate performance in a test dataset not used for that fit (commonly the original development dataset, completed using a consistent imputation procedure) and compute AUC, slope, CITL.
Compute optimism for each metric as train − test. Average optimism across B replicates, then subtract from the apparent performance.

This is exactly the logic TRIPOD describes for internal validation and optimism correction; you are simply ensuring the deterministic imputation step is included inside the resampling loop as part of the pipeline.

8) A Methods paragraph you can paste (kNN + bootstrap internal validation)

Use this as your “default” CPM wording when you develop logistic models with missing predictors and deterministic kNN:

Methods (template): “Missing predictor values were handled using deterministic k‑nearest neighbors (kNN) imputation. Model development was performed using the completed dataset, with the final model specification fixed after development. Internal validation was performed using bootstrap resampling to estimate optimism in discrimination (AUC) and calibration (calibration‑in‑the‑large and calibration slope). In each bootstrap replicate, a bootstrap sample was drawn from the original dataset and predictors were imputed within the bootstrap sample using deterministic kNN; the fixed model was refit in the bootstrap‑imputed sample and its performance was evaluated in the original development sample. Optimism was estimated as the mean difference between bootstrap‑sample performance and test performance across replicates, and optimism‑corrected performance was obtained by subtracting mean optimism from the apparent performance.”

If you need the single‑sentence version (your style):

“Optimism was estimated as the mean difference between apparent and test performance across bootstrap samples; optimism‑corrected performance was obtained by subtracting the mean optimism from apparent performance.”

9) A Methods paragraph you can paste (MI + internal validation)

If in the future you do MI:

“Missing predictor values were handled using multiple imputation, creating (M) completed datasets. Internal validation was performed using bootstrap resampling to estimate optimism in discrimination and calibration; the combination of imputation and internal validation followed a prespecified strategy (e.g., MI performed prior to internal model validation, or MI performed within each bootstrap replicate).” (Stata)

And for “how many imputations” justification, you can cite:

Stata’s MI manual guidance and references to the literature (Stata)
von Hippel’s paper and the how_many_imputations approach

10) What to show in Results (so it reads like a strong CPM paper)

A simple reporting structure (you can copy):

Apparent performance on the full development dataset (AUC, CITL, slope)
Mean optimism from bootstrap
Optimism‑corrected performance = apparent − optimism
Optional: uniform shrinkage factor ≈ optimism‑corrected slope (if you plan coefficient shrinkage; TRIPOD discusses shrinkage/adjustment in the internal‑validation context). (OHDSI)

References you can cite (professor‑friendly)

Below are the “anchor” references that cover each key claim:

TRIPOD Statement (internal validation is necessary; uses bootstrap/CV) (OHDSI)
TRIPOD Explanation & Elaboration, Box F (optimism, internal validation, repeat modeling steps)
Steyerberg & Harrell, J Clin Epidemiol 2001 (bootstrap internal validation is efficient/stable) (ScienceDirect)
Steyerberg & Vergouwe tutorial (ABCD validation; CITL and slope definitions; CITL in development “makes no sense”)
Awounvo et al., J Clin Epidemiol 2025 (how MI is combined with internal validation in CPM literature; MI‑prior vs MI‑during) (Jclinepi)
Mi et al., arXiv 2024 (methodological discussion supporting bootstrap followed by deterministic imputation for internal validation under missing predictors) (arXiv)
Stata MI manual (discussion of number of imputations; practical recommendation ≥20; references) (Stata)
von Hippel 2018 (how many imputations; two‑stage procedure; how_many_imputations)
Schomaker & Heumann (order of bootstrap and MI matters; different strategies) (arXiv)
Brand et al. 2019 (combining bootstrap and imputation; nesting order matters) (Stef van Buuren)
Faisal & Tutz 2021 (nearest‑neighbor concepts in multiple imputation) (ScienceDirect)