Choosing the Right Model for Correlated Data: A Step-by-Step Guide [Repeated Measures, Multilevel Modeling, Mixed Effects, GEE Model, Random Effects, Robust SE, Longitudinal Data, Clustered Data]

Mayta
Jul 2
4 min read

Updated: Jul 7

STEP 1: 🔍 Describe Your Data Structure

Ask:

Are measurements repeated over time on the same subject?
Are observations nested in groups (e.g., patients in clinics, students in classes)?
Are there multiple levels of clustering (e.g., patients → doctors → hospitals)?
Is time spacing regular (every 3 months) or irregular?

STEP 2: 🧩 Identify the Correlation Source

Data Type	What Makes It Correlated?	Example
Repeated measures	Same subject over time	BP at 1, 3, 6 months
Clustered data	Subjects in same unit	Patients in ward
Hierarchical data	Nested clusters	Patient → doctor → hospital
Paired/matched data	Shared factors	Left eye vs right eye

STEP 3: 🎯 Choose the Analytical Goal

Goal	Use Model That Estimates...	Best For
Population-level average	Marginal model (GEE)	Guidelines, public health
Subject-specific trends	Conditional model (Mixed Effects)	Individual prediction, growth curves
Group-specific effects	Multilevel model	School/ward/hospital policy
Only need corrected SEs	Empirical variance correction	Regression w/ cluster SEs

STEP 4: 📊 Map Your Scenario to the Model

Scenario	Best Model	Correlation Correction
Single-level repeated measures (time)	GEE or Mixed Effects	AR(1) or Exchangeable
Nested groups (patients in clinics)	Multilevel Mixed Model	Random Intercepts
Grouped + repeated (patients in hospitals over time)	Mixed Model with Random Intercepts + Slopes	Hierarchical + Time correlation
No strong structure but correlated outcomes	GEE with Robust SE	Empirical correction
Very small sample or irregular visits	Unstructured correlation	Data-defined estimation

STEP 5: 🧪 Select the Variance & Correlation Strategy

Component	Option	When to Use
Variance Estimation	Empirical (Robust SE)	When model fit isn’t your focus, but SE correction is critical
	Model-based	When you trust your model structure & correlation
Correlation Structure	Exchangeable	Equal correlation (common in clusters)
	AR(1)	Time-spaced data where closer = more correlated
	Unstructured	Large sample, unknown pattern
Random Effect Specification	Random Intercepts	Different starting levels
	Random Slopes	Different trends over time or treatment

🔑 SECRET INSIGHT BOX

GEE = population average, no prediction per patient
Mixed-effects = subject-level prediction
Random slope ≠ time interaction: it reflects natural heterogeneity in trends
Always visualize trajectories before choosing slope models

🧭 Decision Map Summary

IF repeated measures → Is it over time?
     ├─ Yes: AR(1) or Random Slopes
     └─ No: Exchangeable or Random Intercepts

IF nested in groups:
     ├─ Just one level? → Random Intercepts
     └─ Multiple levels? → Multilevel model with nested random effects

IF unsure of pattern → Try Unstructured (if sample size supports it)

1  Describe Your Data Structure 🔍

Deep‑dive

Ask four questions before touching software:

Repeated over time? Same person measured >1 time ⇒ temporal correlation.
Nested groups? Patients in wards, eyes within patients, etc. ⇒ clustering.
Multiple levels? Patient → Doctor → Hospital ⇒ hierarchical nesting.
Timing pattern? Equally spaced visits favour AR(1); irregular timing often requires mixed‑effects with random slopes.

Why it matters – each “Yes” forces extra parameters in the covariance matrix. Incorrect specification inflates Type I error or wipes power.

Plain‑speak cheat

Write down “TIME?” and “NESTING?” first; the answers drive everything that follows.

2  Identify the Correlation Source 🧩

Deep‑dive

Data type	Mechanism	Typical example	Quick diagnostic plot
Repeated measures	Within‑subject memory	BP at 1, 3, 6 months	Spaghetti plot of each patient
Clustered	Shared care environment	Patients in same ICU	Box‑and‑whisker by cluster
Hierarchical	Stacked clustering	Patient → Doctor	Variance‐components plot
Paired	Biological pairing	Left vs Right eye	Bland‑Altman / scatter of pair

Plain‑speak cheat

If two rows share a patient‑ID, clinic‑ID, or visit‑date, they’re probably correlated.

3  Clarify Your Analytical Goal 🎯

Deep‑dive

Goal	Key question	Proper model family	Interpretation focus
Population average	“What is the mean effect across everyone?”	GEE / Marginal	Guidelines & policy
Subject‑specific	“How does this patient change?”	Mixed‑effects (conditional)	Precision medicine
Group effects	“Do hospitals differ?”	Multilevel mixed	Quality benchmarking
Just robust SEs	“I only fear clustered SE inflation.”	OLS/GLM + Sandwich	Simple regressions

Plain‑speak cheat

Pick GEE for public‑health answers; mixed‑effects for patient‑level answers.

4  Map Scenario → Model 📊

Deep‑dive

Scenario	Best model	Working correlation / random structure	R / Stata hint
6 time‑points per patient, no sites	GEE	AR(1) or exchangeable	geeglm(..., corstr="ar1")
Patients clustered in 12 clinics, one outcome	Mixed (random intercept)	Random intercept only	`lmer(Y ~ X + (1
Patients in 12 clinics, 5 visits each	Mixed (int + slope)	Random int + slope by patient; random int by clinic	`lmer(Y ~ time + (time
4000 patients, 2 exams each, need quick answer	GLM + cluster‑robust SE	Sandwich (empirical)	glm(...); sandwich()
30 patients, 10 uneven visits	Unstructured mixed (if convergence)	UN covariance	lme(..., correlation=corSymm())

Plain‑speak cheat

One level = random intercept; time + one level = add random slope; many levels = stack random intercepts.

5  Pick Variance & Correlation Strategy 🧪

Deep‑dive

Choice point	Options	Use when	Caveat
Variance estimator	Empirical (robust)	Large N, misspecified corr. OK	Still biased if clusters < 30
	Model‑based	Correlation well‑specified	Sensitive to wrong pattern
Correlation pattern	Exchangeable	All pairs equally related (wards)	Over‑simplifies time data
	AR(1)	Equal spacing & decay	Fails if visits irregular
	Unstructured	Plenty of rows/cluster	Parameter‑hungry
Random effect	Intercept	Baseline shifts only	Assumes parallel trajectories
	Intercept + Slope	Heterogeneous change rates	Needs ≥3 time points/subject

Plain‑speak cheat

Short panels → AR(1). Big uncertain panels → UN. Parallel lines? random‑intercept; diverging lines? random‑slope.

6  Reality Checks & Visuals 🔑

Spaghetti plot before modelling – are patient lines parallel?
Intraclass correlation (ICC) – if ≈0, maybe clustering is harmless.
Residual AC‑plot – confirms AR(1) choice.

Plain‑speak cheat

Plot first; if lines aren’t parallel your model shouldn’t be either.

Decision Flow (ASCII) 🧭

Start
 ├─ Repeated over time?
 │     ├─ Yes → Equal spacing?
 │     │      ├─ Yes → AR(1) or Random Slopes
 │     │      └─ No  → Random Slopes + Time-as-variable
 │     └─ No → Clustered?
 │            ├─ One level → Random Intercepts
 │            └─ ≥Two levels → Hierarchical Mixed
 └─ Unsure pattern → Try Unstructured (data permitting)

Worked Micro‑Example

Design: 150 COPD patients across 7 clinics, FEV1 at baseline, 3 m, 6 m.Goal: Predict individual recovery curves.

Structure → repeated + nested (patients in clinics).
Goal → subject‑specific.
Model → Mixed‑effects with random intercept (clinic) & intercept + slope (patient).
Spec (R):

lmer(FEV1 ~ time + (time|patient_id) + (1|clinic_id), data = copd)

Working correlation → implied by random effects; no extra AR(1) needed because random slopes capture within‑patient trajectory.

Quick Reference Card 📋 (TL;DR)

Ask	If “Yes” →	Model
Time repeated?	Equal gaps	GEE/Mixed + AR(1)
	Unequal gaps	Mixed + Random Slopes
Single clustering level?	—	Mixed + Random Intercept
Multiple levels?	—	Multilevel Mixed
Need only robust SE?	—	Sandwich

Choosing the Right Model for Correlated Data: A Step-by-Step Guide [Repeated Measures, Multilevel Modeling, Mixed Effects, GEE Model, Random Effects, Robust SE, Longitudinal Data, Clustered Data]

STEP 1: 🔍 Describe Your Data Structure

STEP 2: 🧩 Identify the Correlation Source

STEP 3: 🎯 Choose the Analytical Goal

STEP 4: 📊 Map Your Scenario to the Model

STEP 5: 🧪 Select the Variance & Correlation Strategy

🔑 SECRET INSIGHT BOX

🧭 Decision Map Summary

1  Describe Your Data Structure 🔍

Deep‑dive

Plain‑speak cheat

2  Identify the Correlation Source 🧩

Deep‑dive

Plain‑speak cheat

3  Clarify Your Analytical Goal 🎯

Deep‑dive

Plain‑speak cheat

4  Map Scenario → Model 📊

Deep‑dive

Plain‑speak cheat

5  Pick Variance & Correlation Strategy 🧪

Deep‑dive

Plain‑speak cheat

6  Reality Checks & Visuals 🔑

Plain‑speak cheat

Decision Flow (ASCII) 🧭

Worked Micro‑Example

Quick Reference Card 📋 (TL;DR)

Recent Posts

Comments

STEP 1: 🔍 Describe Your Data Structure

STEP 2: 🧩 Identify the Correlation Source

STEP 3: 🎯 Choose the Analytical Goal

STEP 4: 📊 Map Your Scenario to the Model

STEP 5: 🧪 Select the Variance & Correlation Strategy

🔑 SECRET INSIGHT BOX

🧭 Decision Map Summary

1 Describe Your Data Structure 🔍

Deep‑dive

Plain‑speak cheat

2 Identify the Correlation Source 🧩

Deep‑dive

Plain‑speak cheat

3 Clarify Your Analytical Goal 🎯

Deep‑dive

Plain‑speak cheat

4 Map Scenario → Model 📊

Deep‑dive

Plain‑speak cheat

5 Pick Variance & Correlation Strategy 🧪

Deep‑dive

Plain‑speak cheat

6 Reality Checks & Visuals 🔑

Plain‑speak cheat

Decision Flow (ASCII) 🧭

Worked Micro‑Example

Quick Reference Card 📋 (TL;DR)

Comments

1  Describe Your Data Structure 🔍

2  Identify the Correlation Source 🧩

3  Clarify Your Analytical Goal 🎯

4  Map Scenario → Model 📊

5  Pick Variance & Correlation Strategy 🧪

6  Reality Checks & Visuals 🔑