Handling Missing Data in Clinical Prediction Models: Bootstrap kNN vs Multiple Imputation
- Mayta
- 1 day ago
- 3 min read
Overview
Missing data are common in clinical datasets and must be handled carefully to avoid biased estimates, inflated performance, and invalid inference. In this study, missing values were addressed using k-nearest neighbor (kNN) imputation as an alternative to multiple imputation (MI), followed by bootstrap-based internal validation of a diagnostic prediction model for gastrointestinal malignancy.
This section describes the theoretical justification, practical implementation, and validation strategy for each approach.
Missing Data: Conceptual Framework
Missing Data Mechanisms
Missingness was assumed to be predominantly Missing At Random (MAR), where the probability of missingness depends on observed variables (e.g., symptoms, laboratory values), rather than on the unobserved value itself.
Key variables with substantial missingness included:
Fecal immunochemical test (FIT) (~41%)
Menopause-related variables (~42%), this is not random men don't have Menopause
Selected laboratory indices (≤5%)
Outcome data (group_gimalig) were fully observed and never imputed, consistent with best practices in prediction modeling.
k-Nearest Neighbor (kNN) Imputation
Conceptual Rationale
kNN imputation is a single imputation method that replaces missing values by borrowing information from the k most similar individuals, defined by distance in the predictor space.
For each observation with missing data:
Continuous variables are imputed using the median of the nearest neighbors
Categorical variables are imputed using the mode
kNN is attractive when:
Missingness is moderate
Predictor relationships are complex and nonlinear
A deterministic imputation is preferred for downstream resampling
However, kNN does not propagate imputation uncertainty, unlike MI.
Bootstrap Before kNN Imputation
Why Bootstrap First?
Because kNN is a single, deterministic imputation, uncertainty must be introduced before imputation to reflect sampling variability.
The correct sequence is therefore:
Bootstrap → kNN Imputation → Model Fitting
Procedure
Bootstrap sampling (sampling with replacement) was performed on the original dataset
Each bootstrap sample may contain:
Duplicated individuals
Omitted individuals
kNN imputation was applied separately within each bootstrap sample
This resulted in 500 distinct, imputed datasets, each reflecting:
Sampling variability
Imputation variability induced by resampling
Each dataset contained the same variables but differed in:
Which patients were duplicated
Which patients were excluded
Which neighbors were used for imputation
Interpretation
This approach treats bootstrap resampling as the source of uncertainty, while kNN serves as a deterministic missing-data repair mechanism within each resample.
This is statistically coherent and analogous to nonparametric Bayesian uncertainty.
Multiple Imputation (MI)
Conceptual Rationale
Multiple Imputation addresses missing data by:
Creating M completed datasets
Introducing stochastic (randomisation by stochastic way) variation in imputations
Combining results using Rubin’s rules
MI is the preferred approach when:
Missingness is MAR
Inference (coefficients, standard errors) is the primary goal
Model form is stable across imputations
Do We Bootstrap Before MI?
Short Answer: No
Correct MI Workflow
MI → Model Fitting → (Optional) Bootstrap
MI already incorporates uncertainty due to missing data. Bootstrapping before MI would:
Double-count uncertainty
Violate Rubin’s combining rules
Produce invalid variance estimates
Rubin’s Rules Replace Pre-Bootstrap
Rubin’s rules account for:
Within-imputation variance
Between-imputation variance
Thus, MI does not require bootstrapping for inference.
MI + Internal Validation (Advanced Case)
If internal validation is required after MI (e.g., optimism-corrected AUC):
Two valid but complex options exist:
Option 1 (Theoretically Correct, Rarely Used)
Bootstrap within each imputed dataset, then pool optimism
This is computationally expensive and rarely feasible in practice.
Option 2 (Recommended in Practice)
Use MI for model estimation
Report apparent performance
Acknowledge optimism risk
Avoid full bootstrap correction unless essential
In practice, many prediction studies do not combine MI and bootstrap optimism correction due to instability and software limitations.
Comparison of Strategies
Aspect | Bootstrap + kNN | MI Only |
Imputation uncertainty | ❌ (via resampling only) | ✅ (explicit) |
Sampling variability | ✅ | ❌ |
Rubin’s rules | ❌ | ✅ |
Internal validation | Straightforward | Complex |
Computational stability | High | Moderate |
Software burden | Low | High |
Reviewer familiarity | Moderate | High |
Internal Validation Strategy Used
Given:
High missingness in key predictors
Instability of variable selection under MI
Software limitations for MI + bootstrap
We adopted:
Bootstrap before kNN imputation, generating 500 imputed datasets
For each dataset:
Models can be refit
Discrimination (AUC) and calibration assessed
Performance distributions summarized
This approach provides empirical internal validation, reflecting both:
Sampling variability
Imputation variability induced by resampling
Key Reporting Statements (Ready to Paste)
“Missing predictor data were handled using k-nearest neighbor imputation. To reflect uncertainty due to sampling and missing data, nonparametric bootstrap resampling was performed prior to imputation, generating 500 bootstrap datasets. kNN imputation was applied independently within each bootstrap sample. The outcome variable was not imputed. Model performance was evaluated across all bootstrap-imputed datasets.”
“Multiple imputation was considered but not used due to instability of variable selection and practical limitations in combining MI with bootstrap-based internal validation.”
Final Take-Home Messages
kNN is single imputation → bootstrap must come first
MI already models uncertainty → do NOT bootstrap before MI
Bootstrap + kNN is defensible for prediction modeling
MI + bootstrap is theoretically elegant but practically fragile
Never impute outcomes in prediction models




