top of page

Handling Missing Data in Clinical Prediction Models: Bootstrap kNN vs Multiple Imputation

Overview

Missing data are common in clinical datasets and must be handled carefully to avoid biased estimates, inflated performance, and invalid inference. In this study, missing values were addressed using k-nearest neighbor (kNN) imputation as an alternative to multiple imputation (MI), followed by bootstrap-based internal validation of a diagnostic prediction model for gastrointestinal malignancy.

This section describes the theoretical justification, practical implementation, and validation strategy for each approach.

Missing Data: Conceptual Framework

Missing Data Mechanisms

Missingness was assumed to be predominantly Missing At Random (MAR), where the probability of missingness depends on observed variables (e.g., symptoms, laboratory values), rather than on the unobserved value itself.

Key variables with substantial missingness included:

  • Fecal immunochemical test (FIT) (~41%)

  • Menopause-related variables (~42%), this is not random men don't have Menopause

  • Selected laboratory indices (≤5%)

Outcome data (group_gimalig) were fully observed and never imputed, consistent with best practices in prediction modeling.

k-Nearest Neighbor (kNN) Imputation

Conceptual Rationale

kNN imputation is a single imputation method that replaces missing values by borrowing information from the k most similar individuals, defined by distance in the predictor space.

For each observation with missing data:

  • Continuous variables are imputed using the median of the nearest neighbors

  • Categorical variables are imputed using the mode

kNN is attractive when:

  • Missingness is moderate

  • Predictor relationships are complex and nonlinear

  • A deterministic imputation is preferred for downstream resampling

However, kNN does not propagate imputation uncertainty, unlike MI.

Bootstrap Before kNN Imputation

Why Bootstrap First?

Because kNN is a single, deterministic imputation, uncertainty must be introduced before imputation to reflect sampling variability.

The correct sequence is therefore:

Bootstrap → kNN Imputation → Model Fitting

Procedure

  1. Bootstrap sampling (sampling with replacement) was performed on the original dataset

  2. Each bootstrap sample may contain:

    • Duplicated individuals

    • Omitted individuals

  3. kNN imputation was applied separately within each bootstrap sample

  4. This resulted in 500 distinct, imputed datasets, each reflecting:

    • Sampling variability

    • Imputation variability induced by resampling

Each dataset contained the same variables but differed in:

  • Which patients were duplicated

  • Which patients were excluded

  • Which neighbors were used for imputation

Interpretation

This approach treats bootstrap resampling as the source of uncertainty, while kNN serves as a deterministic missing-data repair mechanism within each resample.

This is statistically coherent and analogous to nonparametric Bayesian uncertainty.

Multiple Imputation (MI)

Conceptual Rationale

Multiple Imputation addresses missing data by:

  1. Creating M completed datasets

  2. Introducing stochastic (randomisation by stochastic way) variation in imputations

  3. Combining results using Rubin’s rules

MI is the preferred approach when:

  • Missingness is MAR

  • Inference (coefficients, standard errors) is the primary goal

  • Model form is stable across imputations


Do We Bootstrap Before MI?

Short Answer: No

Correct MI Workflow

MI → Model Fitting → (Optional) Bootstrap

MI already incorporates uncertainty due to missing data. Bootstrapping before MI would:

  • Double-count uncertainty

  • Violate Rubin’s combining rules

  • Produce invalid variance estimates

Rubin’s Rules Replace Pre-Bootstrap

Rubin’s rules account for:

  • Within-imputation variance

  • Between-imputation variance

Thus, MI does not require bootstrapping for inference.

MI + Internal Validation (Advanced Case)

If internal validation is required after MI (e.g., optimism-corrected AUC):

Two valid but complex options exist:

Option 1 (Theoretically Correct, Rarely Used)

Bootstrap within each imputed dataset, then pool optimism

This is computationally expensive and rarely feasible in practice.

Option 2 (Recommended in Practice)

  • Use MI for model estimation

  • Report apparent performance

  • Acknowledge optimism risk

  • Avoid full bootstrap correction unless essential

In practice, many prediction studies do not combine MI and bootstrap optimism correction due to instability and software limitations.

Comparison of Strategies

Aspect

Bootstrap + kNN

MI Only

Imputation uncertainty

❌ (via resampling only)

✅ (explicit)

Sampling variability

Rubin’s rules

Internal validation

Straightforward

Complex

Computational stability

High

Moderate

Software burden

Low

High

Reviewer familiarity

Moderate

High


Internal Validation Strategy Used

Given:

  • High missingness in key predictors

  • Instability of variable selection under MI

  • Software limitations for MI + bootstrap

We adopted:

Bootstrap before kNN imputation, generating 500 imputed datasets

For each dataset:

  • Models can be refit

  • Discrimination (AUC) and calibration assessed

  • Performance distributions summarized

This approach provides empirical internal validation, reflecting both:

  • Sampling variability

  • Imputation variability induced by resampling


Key Reporting Statements (Ready to Paste)

“Missing predictor data were handled using k-nearest neighbor imputation. To reflect uncertainty due to sampling and missing data, nonparametric bootstrap resampling was performed prior to imputation, generating 500 bootstrap datasets. kNN imputation was applied independently within each bootstrap sample. The outcome variable was not imputed. Model performance was evaluated across all bootstrap-imputed datasets.”

“Multiple imputation was considered but not used due to instability of variable selection and practical limitations in combining MI with bootstrap-based internal validation.”

Final Take-Home Messages

  1. kNN is single imputation → bootstrap must come first

  2. MI already models uncertainty → do NOT bootstrap before MI

  3. Bootstrap + kNN is defensible for prediction modeling

  4. MI + bootstrap is theoretically elegant but practically fragile

  5. Never impute outcomes in prediction models


Recent Posts

See All
Post: Blog2_Post

​Message for International and Thai Readers Understanding My Medical Context in Thailand

Message for International and Thai Readers Understanding My Broader Content Beyond Medicine

bottom of page