Imputation in Clinical Research: Prediction vs Causal Thinking Explained
- Mayta

- 20 hours ago
- 4 min read
A Practical Guide to Prediction, Causal Inference, and Longitudinal Data
1. Why Imputation Needs Careful Thinking
Missing data occur in almost all clinical datasets:
Patients miss visits
Laboratory tests are not ordered
Follow-up is incomplete
Imputation is used to replace missing values so that:
Statistical power is preserved
Bias from complete-case analysis is reduced
Validity of estimates
Model performance
Clinical usefulness of results
The same imputation strategy can be correct in one study and wrong in another.
2. First Rule: Define Your Research Goal
Before choosing any imputation method, you must clearly define:
What question am I trying to answer?
There are two fundamentally different goals:
Prediction → “Can we predict an outcome for a future patient?”
Causal inference → “What is the true effect of an exposure or treatment?”
These goals follow different logic and require different imputation rules.
3. Prediction Models (CPMs): Why Y Must Not Be Imputed
3.1 What Is a Prediction Model?
Clinical Prediction Models (CPMs) aim to:
Estimate the probability of an outcome (Y)
Using information available before Y occurs
Examples:
Mortality risk score
ICU admission prediction
Disease risk calculators
Machine-learning classifiers
In real clinical use:
Y is unknown
Only predictors (X) are available
3.2 Do Not Impute the Outcome (Y)
In prediction studies:
❌ Never impute Y ❌ Never use Y to help impute X
This rule is absolute.
3.3 Data Leakage Explained Simply
Data leakage happens when information that would not be available in real life is used during model development.
If Y is used during imputation:
The model indirectly “sees the answer”
The training process becomes unrealistic
This creates a false sense of accuracy.
Think of it like this:
It is like giving students the exam answersand then claiming they are excellent test-takers.
3.4 Why Performance Becomes Overestimated
Using Y in imputation causes:
Artificially high AUROC
Excellent calibration in development data
Strong internal validation results
But when applied to new patients:
Performance drops sharply
External validation fails
Clinical trust is lost
This is one of the most common reasons prediction models fail in practice.
3.5 Conceptual Error: Prediction vs Explanation
Prediction asks:
“Given what we know now, what will happen?”
Using Y during imputation answers a different question:
“Given the outcome, can we reconstruct the predictors?”
That is a retrospective explanation, not prediction.
4. Causal Inference: Why the Rules Are Different
4.1 What Is Causal Analysis?
Causal (etiologic or therapeutic) research asks:
“What would have happened if the exposure or treatment were different?”
The goal is to estimate:
Risk difference
Risk ratio
Hazard ratio
Causal effect
Here, the focus is:
Bias control
Valid estimation of the true effect
Prediction performance is irrelevant.
4.2 Using Y in Imputation Can Be Acceptable
In causal analysis, using Y in multiple imputation can be appropriate if done carefully.
Why?
Because:
Y is part of the data-generating process
Excluding Y can bias covariate imputation
The goal is not future prediction
When missingness is Missing At Random (MAR):
Including Y helps model the missingness mechanism
Bias is reduced
Estimates move closer to the causal truth
This aligns with:
Causal diagrams (DAGs)
Counterfactual reasoning
4.3 Key Difference in Mindset
Aspect | Prediction | Causal Inference |
Goal | Future accuracy | Truth of effect |
Use of Y | Forbidden | Often acceptable |
Performance metrics | Central | Secondary |
Bias control | Limited | Primary focus |
5. Longitudinal Data: Why Y Can Be Imputed
5.1 What Makes Longitudinal Data Special?
Longitudinal data include:
Repeated measurements over time
Within-patient correlation
Natural temporal ordering
Examples:
HbA1c every 3 months
Blood pressure across visits
Functional scores during follow-up
Missingness often occurs:
Between visits
Due to missed appointments or delayed tests
5.2 Imputing Y Is Often Reasonable
In longitudinal settings:
Past values of Y are strong predictors of future Y
The outcome is not a single final event
The model reflects biological continuity
Therefore: ✅ Imputing missing Y values is often appropriate
5.3 The Critical Rule: Respect Time Order
You may use:
Baseline values
Past measurements
Concurrent covariates
You must NOT use:
Future outcomes
Information that occurs after the missing time point
Violating time order creates temporal leakage, which is as harmful as data leakage.
6. Practical Summary Table
Study Type | Impute Y? | Use Y to Impute X? | Key Reason |
Clinical prediction models | ❌ No | ❌ No | Prevent data leakage |
Machine learning prediction | ❌ No | ❌ No | Real-world realism |
Causal/etiologic studies | ✅ Sometimes | ✅ Sometimes | Reduce bias |
Therapeutic effect estimation | ✅ Sometimes | ✅ Sometimes | Causal validity |
Longitudinal outcomes | ✅ Yes (with rules) | ✅ Yes | Temporal structure |
7. Common Mistakes to Avoid
Mixing prediction and causal goals in the same model
Reporting high performance without external validation
Using “standard MI pipelines” without thinking
Treating imputation as a purely statistical task
Key Insight
Imputation is part of study design, not just data cleaning.
Good imputation cannot fix a poorly defined research question.
Final Takeaways
Always define prediction vs causal first
Prediction models must reflect real-world data availability
Using Y incorrectly causes data leakage and false confidence
Causal inference allows broader imputation strategies
Longitudinal data require respect for time





Comments