Choosing the Right Model for Correlated Data: A Step-by-Step Guide [Repeated Measures, Multilevel Modeling, Mixed Effects, GEE Model, Random Effects, Robust SE, Longitudinal Data, Clustered Data]
- Mayta
- Jul 2
- 4 min read
Updated: Jul 7
STEP 1: 🔍 Describe Your Data Structure
Ask:
Are measurements repeated over time on the same subject?
Are observations nested in groups (e.g., patients in clinics, students in classes)?
Are there multiple levels of clustering (e.g., patients → doctors → hospitals)?
Is time spacing regular (every 3 months) or irregular?
STEP 2: 🧩 Identify the Correlation Source
Data Type | What Makes It Correlated? | Example |
Repeated measures | Same subject over time | BP at 1, 3, 6 months |
Clustered data | Subjects in same unit | Patients in ward |
Hierarchical data | Nested clusters | Patient → doctor → hospital |
Paired/matched data | Shared factors | Left eye vs right eye |
STEP 3: 🎯 Choose the Analytical Goal
Goal | Use Model That Estimates... | Best For |
Population-level average | Marginal model (GEE) | Guidelines, public health |
Subject-specific trends | Conditional model (Mixed Effects) | Individual prediction, growth curves |
Group-specific effects | Multilevel model | School/ward/hospital policy |
Only need corrected SEs | Empirical variance correction | Regression w/ cluster SEs |
STEP 4: 📊 Map Your Scenario to the Model
Scenario | Best Model | Correlation Correction |
Single-level repeated measures (time) | GEE or Mixed Effects | AR(1) or Exchangeable |
Nested groups (patients in clinics) | Multilevel Mixed Model | Random Intercepts |
Grouped + repeated (patients in hospitals over time) | Mixed Model with Random Intercepts + Slopes | Hierarchical + Time correlation |
No strong structure but correlated outcomes | GEE with Robust SE | Empirical correction |
Very small sample or irregular visits | Unstructured correlation | Data-defined estimation |
STEP 5: 🧪 Select the Variance & Correlation Strategy
Component | Option | When to Use |
Variance Estimation | Empirical (Robust SE) | When model fit isn’t your focus, but SE correction is critical |
Model-based | When you trust your model structure & correlation | |
Correlation Structure | Exchangeable | Equal correlation (common in clusters) |
AR(1) | Time-spaced data where closer = more correlated | |
Unstructured | Large sample, unknown pattern | |
Random Effect Specification | Random Intercepts | Different starting levels |
Random Slopes | Different trends over time or treatment |
🔑 SECRET INSIGHT BOX
GEE = population average, no prediction per patient
Mixed-effects = subject-level prediction
Random slope ≠ time interaction: it reflects natural heterogeneity in trends
Always visualize trajectories before choosing slope models
🧭 Decision Map Summary
IF repeated measures → Is it over time?
├─ Yes: AR(1) or Random Slopes
└─ No: Exchangeable or Random Intercepts
IF nested in groups:
├─ Just one level? → Random Intercepts
└─ Multiple levels? → Multilevel model with nested random effects
IF unsure of pattern → Try Unstructured (if sample size supports it)
1 Describe Your Data Structure 🔍
Deep‑dive
Ask four questions before touching software:
Repeated over time? Same person measured >1 time ⇒ temporal correlation.
Nested groups? Patients in wards, eyes within patients, etc. ⇒ clustering.
Multiple levels? Patient → Doctor → Hospital ⇒ hierarchical nesting.
Timing pattern? Equally spaced visits favour AR(1); irregular timing often requires mixed‑effects with random slopes.
Why it matters – each “Yes” forces extra parameters in the covariance matrix. Incorrect specification inflates Type I error or wipes power.
Plain‑speak cheat
Write down “TIME?” and “NESTING?” first; the answers drive everything that follows.
2 Identify the Correlation Source 🧩
Deep‑dive
Data type | Mechanism | Typical example | Quick diagnostic plot |
Repeated measures | Within‑subject memory | BP at 1, 3, 6 months | Spaghetti plot of each patient |
Clustered | Shared care environment | Patients in same ICU | Box‑and‑whisker by cluster |
Hierarchical | Stacked clustering | Patient → Doctor | Variance‐components plot |
Paired | Biological pairing | Left vs Right eye | Bland‑Altman / scatter of pair |
Plain‑speak cheat
If two rows share a patient‑ID, clinic‑ID, or visit‑date, they’re probably correlated.
3 Clarify Your Analytical Goal 🎯
Deep‑dive
Goal | Key question | Proper model family | Interpretation focus |
Population average | “What is the mean effect across everyone?” | GEE / Marginal | Guidelines & policy |
Subject‑specific | “How does this patient change?” | Mixed‑effects (conditional) | Precision medicine |
Group effects | “Do hospitals differ?” | Multilevel mixed | Quality benchmarking |
Just robust SEs | “I only fear clustered SE inflation.” | OLS/GLM + Sandwich | Simple regressions |
Plain‑speak cheat
Pick GEE for public‑health answers; mixed‑effects for patient‑level answers.
4 Map Scenario → Model 📊
Deep‑dive
Scenario | Best model | Working correlation / random structure | R / Stata hint |
6 time‑points per patient, no sites | GEE | AR(1) or exchangeable | geeglm(..., corstr="ar1") |
Patients clustered in 12 clinics, one outcome | Mixed (random intercept) | Random intercept only | `lmer(Y ~ X + (1 |
Patients in 12 clinics, 5 visits each | Mixed (int + slope) | Random int + slope by patient; random int by clinic | `lmer(Y ~ time + (time |
4000 patients, 2 exams each, need quick answer | GLM + cluster‑robust SE | Sandwich (empirical) | glm(...); sandwich() |
30 patients, 10 uneven visits | Unstructured mixed (if convergence) | UN covariance | lme(..., correlation=corSymm()) |
Plain‑speak cheat
One level = random intercept; time + one level = add random slope; many levels = stack random intercepts.
5 Pick Variance & Correlation Strategy 🧪
Deep‑dive
Choice point | Options | Use when | Caveat |
Variance estimator | Empirical (robust) | Large N, misspecified corr. OK | Still biased if clusters < 30 |
Model‑based | Correlation well‑specified | Sensitive to wrong pattern | |
Correlation pattern | Exchangeable | All pairs equally related (wards) | Over‑simplifies time data |
AR(1) | Equal spacing & decay | Fails if visits irregular | |
Unstructured | Plenty of rows/cluster | Parameter‑hungry | |
Random effect | Intercept | Baseline shifts only | Assumes parallel trajectories |
Intercept + Slope | Heterogeneous change rates | Needs ≥3 time points/subject |
Plain‑speak cheat
Short panels → AR(1). Big uncertain panels → UN. Parallel lines? random‑intercept; diverging lines? random‑slope.
6 Reality Checks & Visuals 🔑
Spaghetti plot before modelling – are patient lines parallel?
Intraclass correlation (ICC) – if ≈0, maybe clustering is harmless.
Residual AC‑plot – confirms AR(1) choice.
Plain‑speak cheat
Plot first; if lines aren’t parallel your model shouldn’t be either.
Decision Flow (ASCII) 🧭
Start
├─ Repeated over time?
│ ├─ Yes → Equal spacing?
│ │ ├─ Yes → AR(1) or Random Slopes
│ │ └─ No → Random Slopes + Time-as-variable
│ └─ No → Clustered?
│ ├─ One level → Random Intercepts
│ └─ ≥Two levels → Hierarchical Mixed
└─ Unsure pattern → Try Unstructured (data permitting)
Worked Micro‑Example
Design: 150 COPD patients across 7 clinics, FEV1 at baseline, 3 m, 6 m.Goal: Predict individual recovery curves.
Structure → repeated + nested (patients in clinics).
Goal → subject‑specific.
Model → Mixed‑effects with random intercept (clinic) & intercept + slope (patient).
Spec (R):
lmer(FEV1 ~ time + (time|patient_id) + (1|clinic_id), data = copd)
Working correlation → implied by random effects; no extra AR(1) needed because random slopes capture within‑patient trajectory.
Quick Reference Card 📋 (TL;DR)
Ask | If “Yes” → | Model |
Time repeated? | Equal gaps | GEE/Mixed + AR(1) |
Unequal gaps | Mixed + Random Slopes | |
Single clustering level? | — | Mixed + Random Intercept |
Multiple levels? | — | Multilevel Mixed |
Need only robust SE? | — | Sandwich |
Comments