Multiple Imputation in Clinical Research: Univariate and Multivariate Approaches [Multiple imputation, MI]
- Mayta
- May 31
- 3 min read
Introduction
Missing data is a recurrent challenge in clinical and epidemiological studies. Multiple Imputation (MI) provides a statistically rigorous approach to restoring analytical completeness while accounting for uncertainty. As datasets grow more complex, with multiple variables missing and interdependent, understanding how to perform univariate and multivariate imputations becomes essential.
This article provides a step-by-step overview of how Multiple Imputation (MI) is performed under both simple and complex patterns of missingness, with a focus on practical implementation and decision-making logic.
Multiple Imputation: A Recap of the Process
MI is not a one-time fill-in-the-blank operation. It consists of three critical phases:
Imputation: Generate several plausible datasets where missing values are filled using model-based predictions and injected variability.
Analysis: Perform the desired statistical model on each imputed dataset independently.
Pooling: Combine results across datasets to produce a final estimate that reflects both within- and between-imputation variability.
This ensures that imputed values are not treated as known truths but as approximations that respect statistical uncertainty.
Univariate Imputation: One Variable, One Problem
When to Use
This approach is suitable when only one variable has missing data, and all others are fully observed.
How It Works
The missing variable (e.g., a biomarker) is modeled as a function of fully observed predictors.
The model predicts values for the missing entries, with added noise to account for uncertainty.
Example
Imagine a dataset where a continuous lab value is missing for a subset of patients. You might predict it using age, sex, and clinical scores—all of which are complete.
Imputation Model:Missing_lab = f(age, sex, score)
Implementation in Stata
The mi impute command family allows various model types:
mi impute regress: Linear regression for continuous variables
mi impute logit: Logistic regression for binary variables
Other forms include truncated regression, ordinal logistic, Poisson, and predictive mean matching.
Multivariate Imputation: More Variables, More Complexity
When to Use
Multivariate imputation is necessary when multiple variables are missing, often in overlapping and interdependent ways.
Key Distinction: Monotone vs Non-Monotone Patterns
Monotone Missingness
Definition: The pattern of missingness follows a consistent sequence—e.g., if variable X4 is missing, X3 is also missing, and so on.
Example: In longitudinal studies, later follow-up data are often missing for patients lost to follow-up.
Method: Perform chained univariate imputations. Begin with the variable that is most complete and use it to impute the others in a cascading fashion.
Non-Monotone Missingness
Definition: The missingness pattern is arbitrary. A variable may be missing even when others are present or absent without a predictable sequence.
Example: A patient may have missing baseline lab data and missing outcome data, while other covariates are intact.
Challenge: Cannot rely on a simple univariate chain. Requires iterative modeling that accommodates feedback loops.
Advanced Solutions for Multivariate Missingness
To tackle the complexity of multivariate patterns—especially non-monotone—modern MI techniques rely on iterative algorithms.
1. Multiple Imputation by Chained Equations (MICE)
Also known as Fully Conditional Specification (FCS).
Models each variable with missing data conditionally, using all other variables as predictors.
Iteratively cycles through each variable, updating imputations with every pass.
2. Monotone Method
Only usable when missingness follows a clear sequence.
Faster, but less flexible.
3. Multivariate Normal Imputation (MVN)
Assumes that all variables follow a multivariate normal distribution.
Best used when this assumption approximately holds (e.g., many continuous variables).
Stata Commands
mi impute chained: For non-monotone data, most commonly used.
mi impute monotone: For sequential missing patterns.
mi impute mvn: For continuous data under multivariate normality.
Tip: If the pattern appears monotone but is not formally tested, mi impute chained still works and causes no harm.
Conclusion
Univariate and multivariate imputations represent two sides of the same methodological coin—simple vs complex—but both serve the goal of reducing bias and improving precision in the face of incomplete data. Mastering the distinctions between them, and applying the right tool based on your data’s missingness structure, is essential for reliable clinical inference.
Key Takeaways
Univariate MI is suitable when only one variable is incomplete. Regression-based or model-specific imputation is used.
Multivariate MI handles datasets with multiple incomplete variables. It distinguishes monotone from non-monotone patterns.
MICE (or FCS) is the most flexible and widely used method for complex missingness structures.
In Stata, mi impute chained is the safest general-purpose tool.
Use the pattern of missingness to choose between univariate chaining or iterative full models
コメント