Choosing GEE Correlation aka. Repeated Measures: ind = Independence, exc = exchangeable, ar1 = Autoregressive Order 1, sta1 = Stationary m-dependent (m=1), uns = Unstructured

Mayta
Jul 8
3 min read

Updated: Jul 8

Marginal Model-Based Correlation Structures in GEE (Generalized Estimating Equations)

Marginal models (via GEE) assume correlated outcomes within subjects/clusters and require specification of a "working correlation structure".

While coefficient estimates (β) are consistent regardless of structure, the efficiency of SE estimates depends heavily on choosing an appropriate correlation form.

🚫 ind — Independent

Assumes: No within-subject correlation (ρ = 0)
🔥 Not appropriate in repeated measures — violates core assumption of within-subject dependence
❌ Use only if you're absolutely sure responses are uncorrelated (which defeats the purpose of GEE)
✅ Use in clustered cross-section, NOT in longitudinal/repeated measures

Conclusion: Do not use ind if you're modeling intra-subject dependency — it contradicts the purpose of GEE in this context.

✅ exchangeable — Compound Symmetry (CS)

Assumes: All pairwise correlations are equal
Use when:
- Time spacing is not relevant
- Repeated measures are irregular or few
- Clinical example: BP at admission, discharge, follow-up (no decay expected)

xtgee y x1 x2, family(gaussian) link(identity) corr(exchangeable) i(id) vce(robust)

Advantage: Efficient if assumption holds; still robust if not (with vce(robust))

✅ ar1 — Autoregressive Order 1

Assumes: Correlation decays exponentially with time lag
Corr(yt,yt+k)=ρk\text{Corr}(y_t, y_{t+k}) = \rho^k
Use when:
- Equally spaced time points (e.g., every week/month)
- There's temporal ordering
- Clinical example: Daily symptom ratings, serial biomarkers

xtgee y x1 x2, family(gaussian) link(identity) corr(ar1) i(id) vce(robust)

Warning: Requires proper time index and spacing — not for irregular intervals

✅ sta1 — Stationary m-dependent (m=1)

Assumes:
- Observations ≤ m steps apart have non-zero correlation
- All others = 0
sta1: Correlation only among immediate neighbors
Use when:
- You expect correlation only between adjacent time points
- Clinical example: Hourly measurements where only consecutive readings are related
Syntax:

xtgee y x1 x2, family(gaussian) link(identity) corr(sta1) i(id) vce(robust)

Note: Underused but powerful in short, dense, equally spaced panels

✅ uns — Unstructured

Allows: Each pair of observations to have its own unique correlation
Most flexible — no assumptions
BUT: Requires large sample sizes and many observations per subject
Use when:
- N is large (hundreds of subjects)
- Each subject has many (≥5–6) repeated measures
- You want to let the data dictate the correlation

xtgee y x1 x2, family(gaussian) link(identity) corr(uns) i(id) vce(robust)

Caution: Fails or overfits with sparse/imbalanced data. Huge computational cost.

🧭 Summary Table: GEE Correlation Structures for Clinical Repeated Measures

Structure	Meaning	When to Use	Key Assumptions	Stata corr()	Caution
ind	Independence	Never (for repeated measures)	No correlation	independent	Violates longitudinal logic
exchangeable	Equal correlation	Irregular, short, balanced reps	Constant ρ across all timepoints	exchangeable	Assumption fails with decaying correlation
ar1	Decaying (lag-based)	Equally spaced, ordered timepoints	Corr decays exponentially by lag	ar1	Invalid for irregular timepoints
sta1	Neighbor correlation only	Very short panel, e.g., hourly or closely spaced	Corr only among adjacent timepoints	sta1	Rarely used, underdocumented
uns	Fully unstructured	Very large N + many repeated measures	No assumption	uns	Overfits unless data is massive

💡 Clinical Decision Logic

Sparse, irregular follow-ups → Use exchangeable
Time-ordered, equally spaced → Use ar1
Immediate-neighbor dependence only → Use sta1
Dense + high-frequency + large N → uns
Never use ind unless modeling cross-sectional clusters (not repeated measures)

Correlation Matrix Simulation — GEE Working Correlation Structures

Data: Repeated measures on 5 time points (T1–T5)

🔴 corr(independent)

Assumes no within-subject correlation (ρ = 0) — each time point is independent.

	T1	T2	T3	T4	T5
T1	1.00	0.00	0.00	0.00	0.00
T2	0.00	1.00	0.00	0.00	0.00
T3	0.00	0.00	1.00	0.00	0.00
T4	0.00	0.00	0.00	1.00	0.00
T5	0.00	0.00	0.00	0.00	1.00

🟢 corr(exchangeable)

Assumes constant correlation between all pairs (ρ = 0.80)

	T1	T2	T3	T4	T5
T1	1.00	0.80	0.80	0.80	0.80
T2	0.80	1.00	0.80	0.80	0.80
T3	0.80	0.80	1.00	0.80	0.80
T4	0.80	0.80	0.80	1.00	0.80
T5	0.80	0.80	0.80	0.80	1.00

🟡 corr(ar1)

Assumes correlation decays by lag (ρ = 0.90, then ρ² = 0.81, ρ³ = 0.73…)

	T1	T2	T3	T4	T5
T1	1.00	0.90	0.81	0.73	0.66
T2	0.90	1.00	0.90	0.81	0.73
T3	0.81	0.90	1.00	0.90	0.81
T4	0.73	0.81	0.90	1.00	0.90
T5	0.66	0.73	0.81	0.90	1.00

🔵 corr(sta1)

Only adjacent timepoints are correlated (lag-1), ρ = 0.80

	T1	T2	T3	T4	T5
T1	1.00	0.80	0.00	0.00	0.00
T2	0.80	1.00	0.80	0.00	0.00
T3	0.00	0.80	1.00	0.80	0.00
T4	0.00	0.00	0.80	1.00	0.80
T5	0.00	0.00	0.00	0.80	1.00

🟣 corr(uns)

No assumptions — each pair has its own unique correlation

	T1	T2	T3	T4	T5
T1	1.00	0.86	0.74	0.55	0.33
T2	0.86	1.00	0.71	0.49	0.40
T3	0.74	0.71	1.00	0.69	0.50
T4	0.55	0.49	0.69	1.00	0.66
T5	0.33	0.40	0.50	0.66	1.00