Choosing the Right Model for Correlated Data: A Step-by-Step Guide [Repeated Measures, Multilevel Modeling, Mixed Effects, GEE Model, Random Effects, Robust SE, Longitudinal Data, Clustered Data]
- Mayta 
- Jul 2
- 4 min read
Updated: Jul 7
STEP 1: 🔍 Describe Your Data Structure
Ask:
- Are measurements repeated over time on the same subject? 
- Are observations nested in groups (e.g., patients in clinics, students in classes)? 
- Are there multiple levels of clustering (e.g., patients → doctors → hospitals)? 
- Is time spacing regular (every 3 months) or irregular? 
STEP 2: 🧩 Identify the Correlation Source
| Data Type | What Makes It Correlated? | Example | 
| Repeated measures | Same subject over time | BP at 1, 3, 6 months | 
| Clustered data | Subjects in same unit | Patients in ward | 
| Hierarchical data | Nested clusters | Patient → doctor → hospital | 
| Paired/matched data | Shared factors | Left eye vs right eye | 
STEP 3: 🎯 Choose the Analytical Goal
| Goal | Use Model That Estimates... | Best For | 
| Population-level average | Marginal model (GEE) | Guidelines, public health | 
| Subject-specific trends | Conditional model (Mixed Effects) | Individual prediction, growth curves | 
| Group-specific effects | Multilevel model | School/ward/hospital policy | 
| Only need corrected SEs | Empirical variance correction | Regression w/ cluster SEs | 
STEP 4: 📊 Map Your Scenario to the Model
| Scenario | Best Model | Correlation Correction | 
| Single-level repeated measures (time) | GEE or Mixed Effects | AR(1) or Exchangeable | 
| Nested groups (patients in clinics) | Multilevel Mixed Model | Random Intercepts | 
| Grouped + repeated (patients in hospitals over time) | Mixed Model with Random Intercepts + Slopes | Hierarchical + Time correlation | 
| No strong structure but correlated outcomes | GEE with Robust SE | Empirical correction | 
| Very small sample or irregular visits | Unstructured correlation | Data-defined estimation | 
STEP 5: 🧪 Select the Variance & Correlation Strategy
| Component | Option | When to Use | 
| Variance Estimation | Empirical (Robust SE) | When model fit isn’t your focus, but SE correction is critical | 
| Model-based | When you trust your model structure & correlation | |
| Correlation Structure | Exchangeable | Equal correlation (common in clusters) | 
| AR(1) | Time-spaced data where closer = more correlated | |
| Unstructured | Large sample, unknown pattern | |
| Random Effect Specification | Random Intercepts | Different starting levels | 
| Random Slopes | Different trends over time or treatment | 
🔑 SECRET INSIGHT BOX
- GEE = population average, no prediction per patient 
- Mixed-effects = subject-level prediction 
- Random slope ≠ time interaction: it reflects natural heterogeneity in trends 
- Always visualize trajectories before choosing slope models 
🧭 Decision Map Summary
IF repeated measures → Is it over time?
     ├─ Yes: AR(1) or Random Slopes
     └─ No: Exchangeable or Random Intercepts
IF nested in groups:
     ├─ Just one level? → Random Intercepts
     └─ Multiple levels? → Multilevel model with nested random effects
IF unsure of pattern → Try Unstructured (if sample size supports it)
1 Describe Your Data Structure 🔍
Deep‑dive
Ask four questions before touching software:
- Repeated over time? Same person measured >1 time ⇒ temporal correlation. 
- Nested groups? Patients in wards, eyes within patients, etc. ⇒ clustering. 
- Multiple levels? Patient → Doctor → Hospital ⇒ hierarchical nesting. 
- Timing pattern? Equally spaced visits favour AR(1); irregular timing often requires mixed‑effects with random slopes. 
Why it matters – each “Yes” forces extra parameters in the covariance matrix. Incorrect specification inflates Type I error or wipes power.
Plain‑speak cheat
Write down “TIME?” and “NESTING?” first; the answers drive everything that follows.
2 Identify the Correlation Source 🧩
Deep‑dive
| Data type | Mechanism | Typical example | Quick diagnostic plot | 
| Repeated measures | Within‑subject memory | BP at 1, 3, 6 months | Spaghetti plot of each patient | 
| Clustered | Shared care environment | Patients in same ICU | Box‑and‑whisker by cluster | 
| Hierarchical | Stacked clustering | Patient → Doctor | Variance‐components plot | 
| Paired | Biological pairing | Left vs Right eye | Bland‑Altman / scatter of pair | 
Plain‑speak cheat
If two rows share a patient‑ID, clinic‑ID, or visit‑date, they’re probably correlated.
3 Clarify Your Analytical Goal 🎯
Deep‑dive
| Goal | Key question | Proper model family | Interpretation focus | 
| Population average | “What is the mean effect across everyone?” | GEE / Marginal | Guidelines & policy | 
| Subject‑specific | “How does this patient change?” | Mixed‑effects (conditional) | Precision medicine | 
| Group effects | “Do hospitals differ?” | Multilevel mixed | Quality benchmarking | 
| Just robust SEs | “I only fear clustered SE inflation.” | OLS/GLM + Sandwich | Simple regressions | 
Plain‑speak cheat
Pick GEE for public‑health answers; mixed‑effects for patient‑level answers.
4 Map Scenario → Model 📊
Deep‑dive
| Scenario | Best model | Working correlation / random structure | R / Stata hint | 
| 6 time‑points per patient, no sites | GEE | AR(1) or exchangeable | geeglm(..., corstr="ar1") | 
| Patients clustered in 12 clinics, one outcome | Mixed (random intercept) | Random intercept only | `lmer(Y ~ X + (1 | 
| Patients in 12 clinics, 5 visits each | Mixed (int + slope) | Random int + slope by patient; random int by clinic | `lmer(Y ~ time + (time | 
| 4000 patients, 2 exams each, need quick answer | GLM + cluster‑robust SE | Sandwich (empirical) | glm(...); sandwich() | 
| 30 patients, 10 uneven visits | Unstructured mixed (if convergence) | UN covariance | lme(..., correlation=corSymm()) | 
Plain‑speak cheat
One level = random intercept; time + one level = add random slope; many levels = stack random intercepts.
5 Pick Variance & Correlation Strategy 🧪
Deep‑dive
| Choice point | Options | Use when | Caveat | 
| Variance estimator | Empirical (robust) | Large N, misspecified corr. OK | Still biased if clusters < 30 | 
| Model‑based | Correlation well‑specified | Sensitive to wrong pattern | |
| Correlation pattern | Exchangeable | All pairs equally related (wards) | Over‑simplifies time data | 
| AR(1) | Equal spacing & decay | Fails if visits irregular | |
| Unstructured | Plenty of rows/cluster | Parameter‑hungry | |
| Random effect | Intercept | Baseline shifts only | Assumes parallel trajectories | 
| Intercept + Slope | Heterogeneous change rates | Needs ≥3 time points/subject | 
Plain‑speak cheat
Short panels → AR(1). Big uncertain panels → UN. Parallel lines? random‑intercept; diverging lines? random‑slope.
6 Reality Checks & Visuals 🔑
- Spaghetti plot before modelling – are patient lines parallel? 
- Intraclass correlation (ICC) – if ≈0, maybe clustering is harmless. 
- Residual AC‑plot – confirms AR(1) choice. 
Plain‑speak cheat
Plot first; if lines aren’t parallel your model shouldn’t be either.
Decision Flow (ASCII) 🧭
Start
 ├─ Repeated over time?
 │     ├─ Yes → Equal spacing?
 │     │      ├─ Yes → AR(1) or Random Slopes
 │     │      └─ No  → Random Slopes + Time-as-variable
 │     └─ No → Clustered?
 │            ├─ One level → Random Intercepts
 │            └─ ≥Two levels → Hierarchical Mixed
 └─ Unsure pattern → Try Unstructured (data permitting)
Worked Micro‑Example
Design: 150 COPD patients across 7 clinics, FEV1 at baseline, 3 m, 6 m.Goal: Predict individual recovery curves.
- Structure → repeated + nested (patients in clinics). 
- Goal → subject‑specific. 
- Model → Mixed‑effects with random intercept (clinic) & intercept + slope (patient). 
- Spec (R): 
lmer(FEV1 ~ time + (time|patient_id) + (1|clinic_id), data = copd)
- Working correlation → implied by random effects; no extra AR(1) needed because random slopes capture within‑patient trajectory. 
Quick Reference Card 📋 (TL;DR)
| Ask | If “Yes” → | Model | 
| Time repeated? | Equal gaps | GEE/Mixed + AR(1) | 
| Unequal gaps | Mixed + Random Slopes | |
| Single clustering level? | — | Mixed + Random Intercept | 
| Multiple levels? | — | Multilevel Mixed | 
| Need only robust SE? | — | Sandwich | 






Comments