top of page

Imputation in Clinical Research: Prediction vs Causal Thinking Explained

  • Writer: Mayta
    Mayta
  • 20 hours ago
  • 4 min read

A Practical Guide to Prediction, Causal Inference, and Longitudinal Data

1. Why Imputation Needs Careful Thinking

Missing data occur in almost all clinical datasets:

  • Patients miss visits

  • Laboratory tests are not ordered

  • Follow-up is incomplete

Imputation is used to replace missing values so that:

  • Statistical power is preserved

  • Bias from complete-case analysis is reduced

However, imputation is not a neutral technical step.It directly affects:

  • Validity of estimates

  • Model performance

  • Clinical usefulness of results

The same imputation strategy can be correct in one study and wrong in another.

2. First Rule: Define Your Research Goal

Before choosing any imputation method, you must clearly define:

What question am I trying to answer?

There are two fundamentally different goals:

  1. Prediction → “Can we predict an outcome for a future patient?”

  2. Causal inference → “What is the true effect of an exposure or treatment?”

These goals follow different logic and require different imputation rules.

3. Prediction Models (CPMs): Why Y Must Not Be Imputed

3.1 What Is a Prediction Model?

Clinical Prediction Models (CPMs) aim to:

  • Estimate the probability of an outcome (Y)

  • Using information available before Y occurs

Examples:

  • Mortality risk score

  • ICU admission prediction

  • Disease risk calculators

  • Machine-learning classifiers

In real clinical use:

  • Y is unknown

  • Only predictors (X) are available

3.2 Do Not Impute the Outcome (Y)

In prediction studies:

Never impute Y Never use Y to help impute X

This rule is absolute.

3.3 Data Leakage Explained Simply

Data leakage happens when information that would not be available in real life is used during model development.

If Y is used during imputation:

  • The model indirectly “sees the answer”

  • The training process becomes unrealistic

This creates a false sense of accuracy.

Think of it like this:

It is like giving students the exam answersand then claiming they are excellent test-takers.

3.4 Why Performance Becomes Overestimated

Using Y in imputation causes:

  • Artificially high AUROC

  • Excellent calibration in development data

  • Strong internal validation results

But when applied to new patients:

  • Performance drops sharply

  • External validation fails

  • Clinical trust is lost

This is one of the most common reasons prediction models fail in practice.

3.5 Conceptual Error: Prediction vs Explanation

Prediction asks:

“Given what we know now, what will happen?”

Using Y during imputation answers a different question:

“Given the outcome, can we reconstruct the predictors?”

That is a retrospective explanation, not prediction.

4. Causal Inference: Why the Rules Are Different

4.1 What Is Causal Analysis?

Causal (etiologic or therapeutic) research asks:

“What would have happened if the exposure or treatment were different?”

The goal is to estimate:

  • Risk difference

  • Risk ratio

  • Hazard ratio

  • Causal effect

Here, the focus is:

  • Bias control

  • Valid estimation of the true effect

Prediction performance is irrelevant.

4.2 Using Y in Imputation Can Be Acceptable

In causal analysis, using Y in multiple imputation can be appropriate if done carefully.

Why?

Because:

  • Y is part of the data-generating process

  • Excluding Y can bias covariate imputation

  • The goal is not future prediction

When missingness is Missing At Random (MAR):

  • Including Y helps model the missingness mechanism

  • Bias is reduced

  • Estimates move closer to the causal truth

This aligns with:

  • Causal diagrams (DAGs)

  • Counterfactual reasoning

4.3 Key Difference in Mindset

Aspect

Prediction

Causal Inference

Goal

Future accuracy

Truth of effect

Use of Y

Forbidden

Often acceptable

Performance metrics

Central

Secondary

Bias control

Limited

Primary focus

5. Longitudinal Data: Why Y Can Be Imputed

5.1 What Makes Longitudinal Data Special?

Longitudinal data include:

  • Repeated measurements over time

  • Within-patient correlation

  • Natural temporal ordering

Examples:

  • HbA1c every 3 months

  • Blood pressure across visits

  • Functional scores during follow-up

Missingness often occurs:

  • Between visits

  • Due to missed appointments or delayed tests

5.2 Imputing Y Is Often Reasonable

In longitudinal settings:

  • Past values of Y are strong predictors of future Y

  • The outcome is not a single final event

  • The model reflects biological continuity

Therefore: ✅ Imputing missing Y values is often appropriate

5.3 The Critical Rule: Respect Time Order

You may use:

  • Baseline values

  • Past measurements

  • Concurrent covariates

You must NOT use:

  • Future outcomes

  • Information that occurs after the missing time point

Violating time order creates temporal leakage, which is as harmful as data leakage.

6. Practical Summary Table

Study Type

Impute Y?

Use Y to Impute X?

Key Reason

Clinical prediction models

❌ No

❌ No

Prevent data leakage

Machine learning prediction

❌ No

❌ No

Real-world realism

Causal/etiologic studies

✅ Sometimes

✅ Sometimes

Reduce bias

Therapeutic effect estimation

✅ Sometimes

✅ Sometimes

Causal validity

Longitudinal outcomes

✅ Yes (with rules)

✅ Yes

Temporal structure

7. Common Mistakes to Avoid

  • Mixing prediction and causal goals in the same model

  • Reporting high performance without external validation

  • Using “standard MI pipelines” without thinking

  • Treating imputation as a purely statistical task

Key Insight

Imputation is part of study design, not just data cleaning.

Good imputation cannot fix a poorly defined research question.

Final Takeaways

  • Always define prediction vs causal first

  • Prediction models must reflect real-world data availability

  • Using Y incorrectly causes data leakage and false confidence

  • Causal inference allows broader imputation strategies

  • Longitudinal data require respect for time


Recent Posts

See All

Comments

Rated 0 out of 5 stars.
No ratings yet

Add a rating
Post: Blog2_Post

​Message for International and Thai Readers Understanding My Medical Context in Thailand

Message for International and Thai Readers Understanding My Broader Content Beyond Medicine

bottom of page