Handling Missing Data in Clinical Prediction Models: Bootstrap kNN vs Multiple Imputation

Mayta
Dec 24, 2025
3 min read

Overview

Missing data are common in clinical datasets and must be handled carefully to avoid biased estimates, inflated performance, and invalid inference. In this study, missing values were addressed using k-nearest neighbor (kNN) imputation as an alternative to multiple imputation (MI), followed by bootstrap-based internal validation of a diagnostic prediction model for gastrointestinal malignancy.

This section describes the theoretical justification, practical implementation, and validation strategy for each approach.

Missing Data: Conceptual Framework

Missing Data Mechanisms

Missingness was assumed to be predominantly Missing At Random (MAR), where the probability of missingness depends on observed variables (e.g., symptoms, laboratory values), rather than on the unobserved value itself.

Key variables with substantial missingness included:

Fecal immunochemical test (FIT) (~41%)
Menopause-related variables (~42%), this is not random men don't have Menopause
Selected laboratory indices (≤5%)

Outcome data (group_gimalig) were fully observed and never imputed, consistent with best practices in prediction modeling.

k-Nearest Neighbor (kNN) Imputation

Conceptual Rationale

kNN imputation is a single imputation method that replaces missing values by borrowing information from the k most similar individuals, defined by distance in the predictor space.

For each observation with missing data:

Continuous variables are imputed using the median of the nearest neighbors
Categorical variables are imputed using the mode

kNN is attractive when:

Missingness is moderate
Predictor relationships are complex and nonlinear
A deterministic imputation is preferred for downstream resampling

However, kNN does not propagate imputation uncertainty, unlike MI.

Bootstrap Before kNN Imputation

Why Bootstrap First?

Because kNN is a single, deterministic imputation, uncertainty must be introduced before imputation to reflect sampling variability.

The correct sequence is therefore:

Bootstrap → kNN Imputation → Model Fitting

Procedure

Bootstrap sampling (sampling with replacement) was performed on the original dataset
Each bootstrap sample may contain:
- Duplicated individuals
- Omitted individuals
kNN imputation was applied separately within each bootstrap sample
This resulted in 500 distinct, imputed datasets, each reflecting:
- Sampling variability
- Imputation variability induced by resampling

Each dataset contained the same variables but differed in:

Which patients were duplicated
Which patients were excluded
Which neighbors were used for imputation

Interpretation

This approach treats bootstrap resampling as the source of uncertainty, while kNN serves as a deterministic missing-data repair mechanism within each resample.

This is statistically coherent and analogous to nonparametric Bayesian uncertainty.

Multiple Imputation (MI)

Conceptual Rationale

Multiple Imputation addresses missing data by:

Creating M completed datasets
Introducing stochastic (randomisation by stochastic way) variation in imputations
Combining results using Rubin’s rules

MI is the preferred approach when:

Missingness is MAR
Inference (coefficients, standard errors) is the primary goal
Model form is stable across imputations

Do We Bootstrap Before MI?

Short Answer: No

Correct MI Workflow

MI → Model Fitting → (Optional) Bootstrap

MI already incorporates uncertainty due to missing data. Bootstrapping before MI would:

Double-count uncertainty
Violate Rubin’s combining rules
Produce invalid variance estimates

Rubin’s Rules Replace Pre-Bootstrap

Rubin’s rules account for:

Within-imputation variance
Between-imputation variance

Thus, MI does not require bootstrapping for inference.

MI + Internal Validation (Advanced Case)

If internal validation is required after MI (e.g., optimism-corrected AUC):

Two valid but complex options exist:

Option 1 (Theoretically Correct, Rarely Used)

Bootstrap within each imputed dataset, then pool optimism

This is computationally expensive and rarely feasible in practice.

Option 2 (Recommended in Practice)

Use MI for model estimation
Report apparent performance
Acknowledge optimism risk
Avoid full bootstrap correction unless essential

In practice, many prediction studies do not combine MI and bootstrap optimism correction due to instability and software limitations.

Comparison of Strategies

Aspect	Bootstrap + kNN	MI Only
Imputation uncertainty	❌ (via resampling only)	✅ (explicit)
Sampling variability	✅	❌
Rubin’s rules	❌	✅
Internal validation	Straightforward	Complex
Computational stability	High	Moderate
Software burden	Low	High
Reviewer familiarity	Moderate	High

Internal Validation Strategy Used

Given:

High missingness in key predictors
Instability of variable selection under MI
Software limitations for MI + bootstrap

We adopted:

Bootstrap before kNN imputation, generating 500 imputed datasets

For each dataset:

Models can be refit
Discrimination (AUC) and calibration assessed
Performance distributions summarized

This approach provides empirical internal validation, reflecting both:

Sampling variability
Imputation variability induced by resampling

Key Reporting Statements (Ready to Paste)

“Missing predictor data were handled using k-nearest neighbor imputation. To reflect uncertainty due to sampling and missing data, nonparametric bootstrap resampling was performed prior to imputation, generating 500 bootstrap datasets. kNN imputation was applied independently within each bootstrap sample. The outcome variable was not imputed. Model performance was evaluated across all bootstrap-imputed datasets.”

“Multiple imputation was considered but not used due to instability of variable selection and practical limitations in combining MI with bootstrap-based internal validation.”

Final Take-Home Messages

kNN is single imputation → bootstrap must come first
MI already models uncertainty → do NOT bootstrap before MI
Bootstrap + kNN is defensible for prediction modeling
MI + bootstrap is theoretically elegant but practically fragile
Never impute outcomes in prediction models

Handling Missing Data in Clinical Prediction Models: Bootstrap kNN vs Multiple Imputation

Overview

Missing Data: Conceptual Framework

Missing Data Mechanisms

k-Nearest Neighbor (kNN) Imputation

Conceptual Rationale

Bootstrap Before kNN Imputation

Why Bootstrap First?

Procedure

Interpretation

Multiple Imputation (MI)

Conceptual Rationale

Do We Bootstrap Before MI?

Short Answer: No

Correct MI Workflow

Rubin’s Rules Replace Pre-Bootstrap

MI + Internal Validation (Advanced Case)

Option 1 (Theoretically Correct, Rarely Used)

Option 2 (Recommended in Practice)

Comparison of Strategies

Internal Validation Strategy Used

Key Reporting Statements (Ready to Paste)

Final Take-Home Messages

Recent Posts

Comments