Choosing GEE Correlation aka. Repeated Measures: ind = Independence, exc = exchangeable, ar1 = Autoregressive Order 1, sta1 = Stationary m-dependent (m=1), uns = Unstructured
- Mayta
- Jul 8
- 3 min read
Updated: Jul 8
Marginal Model-Based Correlation Structures in GEE (Generalized Estimating Equations)
Marginal models (via GEE) assume correlated outcomes within subjects/clusters and require specification of a "working correlation structure".
While coefficient estimates (β) are consistent regardless of structure, the efficiency of SE estimates depends heavily on choosing an appropriate correlation form.
🚫 ind — Independent
Assumes: No within-subject correlation (ρ = 0)
🔥 Not appropriate in repeated measures — violates core assumption of within-subject dependence
❌ Use only if you're absolutely sure responses are uncorrelated (which defeats the purpose of GEE)
✅ Use in clustered cross-section, NOT in longitudinal/repeated measures
Conclusion: Do not use ind if you're modeling intra-subject dependency — it contradicts the purpose of GEE in this context.
✅ exchangeable — Compound Symmetry (CS)
Assumes: All pairwise correlations are equal
Use when:
Time spacing is not relevant
Repeated measures are irregular or few
Clinical example: BP at admission, discharge, follow-up (no decay expected)
xtgee y x1 x2, family(gaussian) link(identity) corr(exchangeable) i(id) vce(robust)
Advantage: Efficient if assumption holds; still robust if not (with vce(robust))
✅ ar1 — Autoregressive Order 1
Assumes: Correlation decays exponentially with time lag
Corr(yt,yt+k)=ρk\text{Corr}(y_t, y_{t+k}) = \rho^k
Use when:
Equally spaced time points (e.g., every week/month)
There's temporal ordering
Clinical example: Daily symptom ratings, serial biomarkers
xtgee y x1 x2, family(gaussian) link(identity) corr(ar1) i(id) vce(robust)
Warning: Requires proper time index and spacing — not for irregular intervals
✅ sta1 — Stationary m-dependent (m=1)
Assumes:
Observations ≤ m steps apart have non-zero correlation
All others = 0
sta1: Correlation only among immediate neighbors
Use when:
You expect correlation only between adjacent time points
Clinical example: Hourly measurements where only consecutive readings are related
Syntax:
xtgee y x1 x2, family(gaussian) link(identity) corr(sta1) i(id) vce(robust)
Note: Underused but powerful in short, dense, equally spaced panels
✅ uns — Unstructured
Allows: Each pair of observations to have its own unique correlation
Most flexible — no assumptions
BUT: Requires large sample sizes and many observations per subject
Use when:
N is large (hundreds of subjects)
Each subject has many (≥5–6) repeated measures
You want to let the data dictate the correlation
xtgee y x1 x2, family(gaussian) link(identity) corr(uns) i(id) vce(robust)
Caution: Fails or overfits with sparse/imbalanced data. Huge computational cost.
🧭 Summary Table: GEE Correlation Structures for Clinical Repeated Measures
Structure | Meaning | When to Use | Key Assumptions | Stata corr() | Caution |
ind | Independence | Never (for repeated measures) | No correlation | independent | Violates longitudinal logic |
exchangeable | Equal correlation | Irregular, short, balanced reps | Constant ρ across all timepoints | exchangeable | Assumption fails with decaying correlation |
ar1 | Decaying (lag-based) | Equally spaced, ordered timepoints | Corr decays exponentially by lag | ar1 | Invalid for irregular timepoints |
sta1 | Neighbor correlation only | Very short panel, e.g., hourly or closely spaced | Corr only among adjacent timepoints | sta1 | Rarely used, underdocumented |
uns | Fully unstructured | Very large N + many repeated measures | No assumption | uns | Overfits unless data is massive |
💡 Clinical Decision Logic
Sparse, irregular follow-ups → Use exchangeable
Time-ordered, equally spaced → Use ar1
Immediate-neighbor dependence only → Use sta1
Dense + high-frequency + large N → uns
Never use ind unless modeling cross-sectional clusters (not repeated measures)
Correlation Matrix Simulation — GEE Working Correlation Structures
Data: Repeated measures on 5 time points (T1–T5)
🔴 corr(independent)
Assumes no within-subject correlation (ρ = 0) — each time point is independent.
T1 | T2 | T3 | T4 | T5 | |
T1 | 1.00 | 0.00 | 0.00 | 0.00 | 0.00 |
T2 | 0.00 | 1.00 | 0.00 | 0.00 | 0.00 |
T3 | 0.00 | 0.00 | 1.00 | 0.00 | 0.00 |
T4 | 0.00 | 0.00 | 0.00 | 1.00 | 0.00 |
T5 | 0.00 | 0.00 | 0.00 | 0.00 | 1.00 |
🟢 corr(exchangeable)
Assumes constant correlation between all pairs (ρ = 0.80)
T1 | T2 | T3 | T4 | T5 | |
T1 | 1.00 | 0.80 | 0.80 | 0.80 | 0.80 |
T2 | 0.80 | 1.00 | 0.80 | 0.80 | 0.80 |
T3 | 0.80 | 0.80 | 1.00 | 0.80 | 0.80 |
T4 | 0.80 | 0.80 | 0.80 | 1.00 | 0.80 |
T5 | 0.80 | 0.80 | 0.80 | 0.80 | 1.00 |
🟡 corr(ar1)
Assumes correlation decays by lag (ρ = 0.90, then ρ² = 0.81, ρ³ = 0.73…)
T1 | T2 | T3 | T4 | T5 | |
T1 | 1.00 | 0.90 | 0.81 | 0.73 | 0.66 |
T2 | 0.90 | 1.00 | 0.90 | 0.81 | 0.73 |
T3 | 0.81 | 0.90 | 1.00 | 0.90 | 0.81 |
T4 | 0.73 | 0.81 | 0.90 | 1.00 | 0.90 |
T5 | 0.66 | 0.73 | 0.81 | 0.90 | 1.00 |
🔵 corr(sta1)
Only adjacent timepoints are correlated (lag-1), ρ = 0.80
T1 | T2 | T3 | T4 | T5 | |
T1 | 1.00 | 0.80 | 0.00 | 0.00 | 0.00 |
T2 | 0.80 | 1.00 | 0.80 | 0.00 | 0.00 |
T3 | 0.00 | 0.80 | 1.00 | 0.80 | 0.00 |
T4 | 0.00 | 0.00 | 0.80 | 1.00 | 0.80 |
T5 | 0.00 | 0.00 | 0.00 | 0.80 | 1.00 |
🟣 corr(uns)
No assumptions — each pair has its own unique correlation
T1 | T2 | T3 | T4 | T5 | |
T1 | 1.00 | 0.86 | 0.74 | 0.55 | 0.33 |
T2 | 0.86 | 1.00 | 0.71 | 0.49 | 0.40 |
T3 | 0.74 | 0.71 | 1.00 | 0.69 | 0.50 |
T4 | 0.55 | 0.49 | 0.69 | 1.00 | 0.66 |
T5 | 0.33 | 0.40 | 0.50 | 0.66 | 1.00 |
Comments