← All posts

Choosing the Right Model for Correlated Data: A Step-by-Step Guide [Repeated Measures, Multilevel Modeling, Mixed Effects, GEE Model, Random Effects, Robust SE, Longitudinal Data, Clustered Data]

Clinical Epidemiology ResearchUniqcret doctor knowledgesData Analytics or Statistics

STEP 1: 🔍 Describe Your Data Structure

Ask:

  1. Are measurements repeated over time on the same subject?
  2. Are observations nested in groups (e.g., patients in clinics, students in classes)?
  3. Are there multiple levels of clustering (e.g., patients → doctors → hospitals)?
  4. Is time spacing regular (every 3 months) or irregular?

STEP 2: 🧩 Identify the Correlation Source

Data TypeWhat Makes It Correlated?Example
Repeated measuresSame subject over timeBP at 1, 3, 6 months
Clustered dataSubjects in same unitPatients in ward
Hierarchical dataNested clustersPatient → doctor → hospital
Paired/matched dataShared factorsLeft eye vs right eye


STEP 3: 🎯 Choose the Analytical Goal

GoalUse Model That Estimates...Best For
Population-level averageMarginal model (GEE)Guidelines, public health
Subject-specific trendsConditional model (Mixed Effects)Individual prediction, growth curves
Group-specific effectsMultilevel modelSchool/ward/hospital policy
Only need corrected SEsEmpirical variance correctionRegression w/ cluster SEs


STEP 4: 📊 Map Your Scenario to the Model

ScenarioBest ModelCorrelation Correction
Single-level repeated measures (time)GEE or Mixed EffectsAR(1) or Exchangeable
Nested groups (patients in clinics)Multilevel Mixed ModelRandom Intercepts
Grouped + repeated (patients in hospitals over time)Mixed Model with Random Intercepts + SlopesHierarchical + Time correlation
No strong structure but correlated outcomesGEE with Robust SEEmpirical correction
Very small sample or irregular visitsUnstructured correlationData-defined estimation


STEP 5: 🧪 Select the Variance & Correlation Strategy

ComponentOptionWhen to Use
Variance EstimationEmpirical (Robust SE)When model fit isn’t your focus, but SE correction is critical
 Model-basedWhen you trust your model structure & correlation
Correlation StructureExchangeableEqual correlation (common in clusters)
 AR(1)Time-spaced data where closer = more correlated
 UnstructuredLarge sample, unknown pattern
Random Effect SpecificationRandom InterceptsDifferent starting levels
 Random SlopesDifferent trends over time or treatment

🔑 SECRET INSIGHT BOX


🧭 Decision Map Summary

IF repeated measures → Is it over time?
     ├─ Yes: AR(1) or Random Slopes
     └─ No: Exchangeable or Random Intercepts

IF nested in groups:
     ├─ Just one level? → Random Intercepts
     └─ Multiple levels? → Multilevel model with nested random effects

IF unsure of pattern → Try Unstructured (if sample size supports it)

1  Describe Your Data Structure  🔍

Deep‑dive

Ask four questions before touching software:

  1. Repeated over time? Same person measured >1 time ⇒ temporal correlation.
  2. Nested groups? Patients in wards, eyes within patients, etc. ⇒ clustering.
  3. Multiple levels? Patient → Doctor → Hospital ⇒ hierarchical nesting.
  4. Timing pattern? Equally spaced visits favour AR(1); irregular timing often requires mixed‑effects with random slopes.

Why it matters – each “Yes” forces extra parameters in the covariance matrix. Incorrect specification inflates Type I error or wipes power.

Plain‑speak cheat

Write down “TIME?” and “NESTING?” first; the answers drive everything that follows.


2  Identify the Correlation Source  🧩

Deep‑dive

Data typeMechanismTypical exampleQuick diagnostic plot
Repeated measuresWithin‑subject memoryBP at 1, 3, 6 monthsSpaghetti plot of each patient
ClusteredShared care environmentPatients in same ICUBox‑and‑whisker by cluster
HierarchicalStacked clusteringPatient → DoctorVariance‐components plot
PairedBiological pairingLeft vs Right eyeBland‑Altman / scatter of pair

Plain‑speak cheat

If two rows share a patient‑ID, clinic‑ID, or visit‑date, they’re probably correlated.


3  Clarify Your Analytical Goal  🎯

Deep‑dive

GoalKey questionProper model familyInterpretation focus
Population average“What is the mean effect across everyone?”GEE / MarginalGuidelines & policy
Subject‑specific“How does this patient change?”Mixed‑effects (conditional)Precision medicine
Group effects“Do hospitals differ?”Multilevel mixedQuality benchmarking
Just robust SEs“I only fear clustered SE inflation.”OLS/GLM + SandwichSimple regressions

Plain‑speak cheat

Pick GEE for public‑health answers; mixed‑effects for patient‑level answers.


4  Map Scenario → Model  📊

Deep‑dive

ScenarioBest modelWorking correlation / random structureR / Stata hint
6 time‑points per patient, no sitesGEEAR(1) or exchangeablegeeglm(..., corstr="ar1")
Patients clustered in 12 clinics, one outcomeMixed (random intercept)Random intercept only`lmer(Y ~ X + (1
Patients in 12 clinics, 5 visits eachMixed (int + slope)Random int + slope by patient; random int by clinic`lmer(Y ~ time + (time
4000 patients, 2 exams each, need quick answerGLM + cluster‑robust SESandwich (empirical)glm(...); sandwich()
30 patients, 10 uneven visitsUnstructured mixed (if convergence)UN covariancelme(..., correlation=corSymm())

Plain‑speak cheat

One level = random intercept; time + one level = add random slope; many levels = stack random intercepts.


5  Pick Variance & Correlation Strategy  🧪

Deep‑dive

Choice pointOptionsUse whenCaveat
Variance estimatorEmpirical (robust)Large N, misspecified corr. OKStill biased if clusters < 30
 Model‑basedCorrelation well‑specifiedSensitive to wrong pattern
Correlation patternExchangeableAll pairs equally related (wards)Over‑simplifies time data
 AR(1)Equal spacing & decayFails if visits irregular
 UnstructuredPlenty of rows/clusterParameter‑hungry
Random effectInterceptBaseline shifts onlyAssumes parallel trajectories
 Intercept + SlopeHeterogeneous change ratesNeeds ≥3 time points/subject

Plain‑speak cheat

Short panels → AR(1). Big uncertain panels → UN. Parallel lines? random‑intercept; diverging lines? random‑slope.


6  Reality Checks & Visuals  🔑

  1. Spaghetti plot before modelling – are patient lines parallel?
  2. Intraclass correlation (ICC) – if ≈0, maybe clustering is harmless.
  3. Residual AC‑plot – confirms AR(1) choice.

Plain‑speak cheat

Plot first; if lines aren’t parallel your model shouldn’t be either.


Decision Flow (ASCII)  🧭

Start
 ├─ Repeated over time?
 │     ├─ Yes → Equal spacing?
 │     │      ├─ Yes → AR(1) or Random Slopes
 │     │      └─ No  → Random Slopes + Time-as-variable
 │     └─ No → Clustered?
 │            ├─ One level → Random Intercepts
 │            └─ ≥Two levels → Hierarchical Mixed
 └─ Unsure pattern → Try Unstructured (data permitting)

Worked Micro‑Example

Design: 150 COPD patients across 7 clinics, FEV1 at baseline, 3 m, 6 m.Goal: Predict individual recovery curves.

  1. Structure → repeated + nested (patients in clinics).
  2. Goal → subject‑specific.
  3. Model → Mixed‑effects with random intercept (clinic) & intercept + slope (patient).
  4. Spec (R):
lmer(FEV1 ~ time + (time|patient_id) + (1|clinic_id), data = copd)
  1. Working correlation → implied by random effects; no extra AR(1) needed because random slopes capture within‑patient trajectory.

Quick Reference Card  📋 (TL;DR)

AskIf “Yes” →Model
Time repeated?Equal gapsGEE/Mixed + AR(1)
 Unequal gapsMixed + Random Slopes
Single clustering level?Mixed + Random Intercept
Multiple levels?Multilevel Mixed
Need only robust SE?Sandwich

Comments

No comments yet. Be the first to share your thoughts.

Sign in to comment