← All posts

k-Nearest Neighbours (kNN): From Clinical Similarity to Safe Missing-Data Imputation

Clinical Epidemiology ResearchUniqcret doctor knowledgesData Analytics or StatisticsR [Data Analytics]
k-Nearest Neighbours (kNN): From Clinical Similarity to Safe Missing-Data Imputation

Introduction

k‑Nearest Neighbours (kNN) is one of those rare methods that shows up everywhere in machine learning and clinical data science—not because it is “fancy”, but because the core idea is extremely reusable:

If two patients look similar on the variables you can see, they are likely to be similar on the variable you can’t see.

That “variable you can’t see” might be a class label (disease vs no disease), a numeric outcome (e.g., risk score), or simply a missing lab value that needs imputation.

This article explains what kNN is, how it connects to unsupervised learning / clustering, and how to use it safely for missing‑data imputation (including why we do not impute the outcome Y).


1) The core concept: neighbours in feature space

Every observation (patient) is a point in a multidimensional space:

kNN is not one single algorithm; it’s a neighbourhood idea:

  1. Pick a distance function.
  2. For a given patient, find the k nearest neighbours.
  3. Borrow information from those neighbours.

That’s it.


2) kNN in machine learning: supervised vs “unlabelled” uses

A) Supervised kNN (needs labels)

This is the classic “kNN classifier / regressor”.

Here you must have labels (Y known in training data). This is supervised learning.

B) Unsupervised / “unlabelled” neighbour methods (no labels needed)

Your professor’s point about “unlabelled / clustering” is also valid—the neighbour idea is used inside unsupervised learning, even if the classic kNN classifier is supervised.

Examples:

So yes: the kNN neighbourhood idea is often “unlabelled”.But be careful: kNN ≠ k‑means.


3) kNN for missing‑data imputation: why it works

What is kNN imputation?

kNN imputation fills in missing values by looking at patients who are similar on observed variables.

For each missing cell:

  1. Identify neighbours using the other observed predictors.
  2. Impute:
    • continuous → median (robust) or mean
    • categorical/binary → mode (most common category)

This is conceptually similar to “hot‑deck” imputation in epidemiology: replace missing with values from similar donors.

Why kNN imputation is attractive

But: kNN imputation is still “single imputation”

This means it creates one completed dataset. It does not automatically reflect uncertainty the way multiple imputation does. So:

(If your goal is prediction modeling, this is usually acceptable with clear reporting + internal validation.)


4) Why we do NOT impute Y (the outcome)

In prediction modeling, there are two common “don’ts”:

✅ Don’t impute the outcome (Y)

If Y is missing, imputing it creates “fake truth”. It can:

✅ Don’t use outcome information to fill predictors in a way that won’t exist at deployment

If your goal is to build a model that will be used on new patients (where Y is unknown at prediction time), then imputing predictors using Y can be seen as information leakage.

In our clinical prediction workflow guidance, the rule is:

Prediction models: impute X, not Y; avoid using outcome Y for imputation.

That’s the clean “deployment-safe” principle: the imputation step should not rely on information that won’t be available when the model is used.


5) Why use RAW predictors (no transformations) before kNN imputation?

When people say “raw predictors only”, they usually mean:

This is correct and important for clinical prediction modeling:

One technical note: many kNN implementations internally standardize variables for distance calculation (so labs with large scales don’t dominate distance). That scaling is not “feature engineering” in the clinical sense; it’s just making the distance computation fair.


6) Is k = 5 a “magic number”?

The honest answer: k = 5 is a strong default, not magic.

k is a bias–variance trade‑off:

Why k = 5 is commonly used

Because it often balances:

That’s why many teams standardize on k = 5 for baseline kNN imputation in clinical datasets.

How to choose k properly (blog‑friendly methods)

Method 1: Sensitivity analysis (fast and practical)

Run imputation with a few k values and compare:

Typical set: k = 3, 5, 7, 10

If results are stable, k = 5 is defensible.

Method 2: “Mask-and-recover” tuning (more technical, more convincing)

You can simulate missingness:

  1. randomly hide a small fraction of observed values
  2. impute them using different k
  3. compare imputed vs true values (RMSE for continuous, accuracy for categorical)
  4. pick the k with smallest error

This is one of the cleanest ways to justify k.


7) Practical R template: kNN imputation for real clinical data

Below is a general template (works for many projects).Key rules included: exclude ID, exclude outcome, impute only predictors.

library(haven)
library(dplyr)
library(VIM)

df <- read_dta("your_data.dta")

id_var <- "ID"
y_var  <- "outcome"

X_vars <- c("age","sex","hb","wbc","plt","fit")   # <- choose your predictors

X <- df %>%
  select(all_of(X_vars)) %>%
  mutate(across(where(is.labelled), as.factor))   # Stata labelled → factor

set.seed(215)
X_imp <- kNN(
  X,
  k = 5,
  imp_var = FALSE,
  numFun = median,
  catFun = maxCat
)

df_imp <- bind_cols(
  df %>% select(all_of(c(id_var, y_var))),
  X_imp
)

8) Code to “tune k” using mask‑and‑recover (blog‑ready)

This gives you a defendable reason for k = 5 (or shows another k is better).

library(VIM)
library(dplyr)

score_knn_k <- function(X, k, p_mask = 0.10, seed = 1) {
  set.seed(seed)

  X0 <- X
  # locate observed cells
  obs <- which(!is.na(as.matrix(X0)), arr.ind = TRUE)

  # mask a subset
  n_mask <- floor(nrow(obs) * p_mask)
  idx <- obs[sample(seq_len(nrow(obs)), n_mask), , drop = FALSE]

  X_mask <- X0
  true_val <- vector("list", n_mask)

  for (i in seq_len(n_mask)) {
    r <- idx[i, 1]; c <- idx[i, 2]
    true_val[[i]] <- X_mask[r, c]
    X_mask[r, c] <- NA
  }

  # impute
  X_imp <- kNN(
    X_mask,
    k = k,
    imp_var = FALSE,
    numFun = median,
    catFun = maxCat
  )

  # evaluate
  num_cols <- which(sapply(X0, is.numeric))
  cat_cols <- which(sapply(X0, function(z) is.factor(z) || is.character(z)))

  # collect imputed vs true
  rmse <- NA
  acc  <- NA

  # numeric RMSE
  if (length(num_cols) > 0) {
    num_idx <- idx[idx[,2] %in% num_cols, , drop = FALSE]
    if (nrow(num_idx) > 0) {
      true_num <- sapply(seq_len(nrow(num_idx)), function(i){
        r <- num_idx[i,1]; c <- num_idx[i,2]
        X0[r, c]
      })
      imp_num <- sapply(seq_len(nrow(num_idx)), function(i){
        r <- num_idx[i,1]; c <- num_idx[i,2]
        X_imp[r, c]
      })
      rmse <- sqrt(mean((as.numeric(imp_num) - as.numeric(true_num))^2, na.rm = TRUE))
    }
  }

  # categorical accuracy
  if (length(cat_cols) > 0) {
    cat_idx <- idx[idx[,2] %in% cat_cols, , drop = FALSE]
    if (nrow(cat_idx) > 0) {
      true_cat <- sapply(seq_len(nrow(cat_idx)), function(i){
        r <- cat_idx[i,1]; c <- cat_idx[i,2]
        as.character(X0[r, c])
      })
      imp_cat <- sapply(seq_len(nrow(cat_idx)), function(i){
        r <- cat_idx[i,1]; c <- cat_idx[i,2]
        as.character(X_imp[r, c])
      })
      acc <- mean(imp_cat == true_cat, na.rm = TRUE)
    }
  }

  tibble(k = k, rmse = rmse, acc = acc)
}

# Example usage:
# X must be your predictor-only data frame (no ID, no outcome)
# ks <- c(3,5,7,10)
# results <- bind_rows(lapply(ks, function(k) score_knn_k(X, k, p_mask = 0.10, seed = 215)))
# results

How to interpret:


9) Common pitfalls (the things that break kNN)

  1. Including post‑diagnostic variables in X (endoscopy results, pathology, final diagnosis)→ leakage (predicting with information from the future)
  2. Including the outcome (Y) for imputation in prediction modeling→ breaks deployment logic and inflates performance
  3. Too many predictors with weak signal→ distances become meaningless (curse of dimensionality)
  4. High missingness in many predictors→ neighbour search becomes unreliable(kNN is strongest when most predictors are observed)
  5. Not checking pre/post distributionsAlways check that imputation did not create impossible values or distort prevalence.


10) A clean blog conclusion you can reuse

kNN is not only a “classifier”. It is a general neighbourhood strategy used across machine learning:

Using k = 5 is a strong practical default, but the best workflow is: