Types of ICC (intraclass correlation coefficient): One-Way, Two-Way Random, and Mixed Effects Explained

1 One-way Random Effects: ICC(1) / ICC(1,1) / ICC(1,k)
Concept
- Subjects are measured by raters who are treated as random noise.
- Raters are not modeled explicitly (no rater factor in the model).
- Often used when not all subjects are rated by the same raters, or the design is naturally unbalanced.
When to use
- Different raters may rate different subjects; raters are interchangeable and loosely organized.
- You want reliability across this general rater pool, but you cannot guarantee all raters see all subjects.
Typical examples
- Different nurses measuring BP on different patients in a ward.
- Rotating lab techs performing lab tests, with no strict “all raters rate all samples” structure.
What problem it solves
- Handles unbalanced and messy rating patterns.
- Gives you a reliability estimate generalizable to the rater population, even when design is not fully crossed.
Stata code (built-in icc)
* ICC(1,1): one-way random, single measurement
icc measure id, oneway single
* ICC(1,k): one-way random, average of k ratings per subject
icc measure id, oneway average
2 Two-way Random Effects: ICC(2,1) / ICC(2,k)
Concept
- All subjects are rated by all raters.
- Raters are treated as a random sample from a larger population of similar raters.
- You want interchangeability: any trained clinician could, in principle, replace the current raters.
When to use
- Fully crossed design: each subject rated by every rater.
- You want to generalize reliability to other raters not in your study (e.g., GPs, other radiologists).
Typical examples
- 4 radiologists all reading the same CT scans; you want a reliability estimate that holds for other radiologists or GPs in future.
- Multiple lab technicians all running the same specimens; any tech in future may run the test.
What problem it solves
- Explicitly models:
- Between-subject variance
- Between-rater variance
- Residual (subject×rater) interaction / error
- With absolute agreement, systematic differences between raters (e.g., rater A always scores higher) are treated as error, which is important if raters are supposed to be interchangeable.
Interpretation
“If a different clinician/GP/radiologist were to rate these patients, how reliable would the scores be?”
Stata code (built-in icc)
* ICC(2,1): two-way random, absolute agreement, single rating
icc measure id rater, random absolute single
* ICC(2,k): two-way random, absolute agreement, average of k raters
icc measure id rater, random absolute average
3 Two-way Mixed Effects: ICC(3,1) / ICC(3,k)
Concept
- All subjects are rated by all raters, but raters are treated as fixed (not random).
- You care only about these specific raters, not about others in the future.
When to use
- You will not generalize to other raters; the “unit” being validated is the specific rater team.
- Typical for expert panels or named specialists.
Typical examples
- Only 2 expert cardiologists will read all echoes in the trial and in clinical practice.
- A central adjudication committee rating all events, and you only care about this exact committee.
What problem it solves
- Excludes rater population variance from the denominator (raters are fixed, not sampled).
- With consistency, systematic differences between raters (one is always higher) are not counted as error; you only care about whether patients are ranked consistently.
Interpretation
“Given these exact raters, how reproducible are their scores?”Not: “What happens if a different rater reads the images?”
Stata code (built-in icc)
* ICC(3,1): two-way mixed, absolute agreement, single rating
icc measure id rater, mixed absolute single
* ICC(3,1) consistency version:
icc measure id rater, mixed consistency single
* ICC(3,k): two-way mixed, absolute agreement, mean of k raters
icc measure id rater, mixed absolute average
🔁 Single vs Average: (1,1) vs (1,k); (2,1) vs (2,k); (3,1) vs (3,k)
- (·,1) → reliability of one rating from one rater.
- (·,k) → reliability of the mean of k raters; this is always higher, because averaging reduces error.
Example:
- ICC(2,1) = reliability of A single reader (anyone from the rater pool).
- ICC(2,k) = reliability of the average score of all readers.
📌 Absolute Agreement vs Consistency
These are about what counts as “error” in the denominator:
🔹 Absolute Agreement (aka ICC_agreement, like in de Vet’s paper)
- Counts systematic differences between raters as part of error.
- Question:“Do raters give exactly the same scores?”
- Appropriate when raters are supposed to be interchangeable and absolute scale matters (e.g., BP, ROM in degrees).
🔹 Consistency (aka ICC_consistency)
- Ignores systematic mean differences between raters; focuses on relative ordering.
- Question:“Do raters rank patients similarly, even if one tends to score higher overall?”
- Useful if only ranking/ordering matters, not exact values.
In de Vet et al., the formulas are:
[\text{ICC}{agreement} = \frac{\sigma_p^2}{\sigma_p^2 + \sigma{pt}^2 + \sigma_{residual}^2}]
[\text{ICC}{consistency} = \frac{\sigma_p^2}{\sigma_p^2 + \sigma{residual}^2}]
where:
- (\sigma_p^2): variance between persons
- (\sigma_{pt}^2): systematic variance between raters
- (\sigma_{residual}^2): residual (error) variance
Absolute agreement includes (\sigma_{pt}^2) as error; consistency does not.
🎯 ICC vs Agreement: What Question Are You Answering?
From de Vet et al.:
- Reliability (ICC) answers:“Can we distinguish between subjects despite measurement error?”It relates measurement error to between-subject variance, so it depends heavily on sample heterogeneity.
- Agreement (SEM, Limits of Agreement, %within ±X, kappa) answers:“How close are repeated measurements to each other?”It focuses directly on measurement error, independent of how diverse the subjects are.
Key implications:
- ICC can be high in a very heterogeneous sample (big between-person variance), even with substantial error.
- Agreement indices (e.g., SEM, LoA) are more stable across samples and better for monitoring change over time or detecting small clinical changes.
🧮 Agreement Side: SEM, SDC, Limits of Agreement
These are not ICC but frequently paired with it:
- SEM (Standard Error of Measurement)[SEM = \sqrt{\text{error variance}}]On the original scale (e.g., kg, degrees). Good for saying: “Typical noise around a single measurement is about X units.”
- SDC (Smallest Detectable Change)[SDC = 1.96 \times \sqrt{2} \times SEM]Smallest change that reliably exceeds measurement error for a single patient.
- Limits of Agreement (Bland–Altman)[\text{Mean difference} \pm 1.96 \times SD_{\text{difference}}]Shows the range where 95% of differences between methods/raters fall.
These are agreement measures, not reliability per se, and are recommended when the research focus is evaluating change (e.g., pre–post treatment).
🛠 Stata Implementation Summary
Built-in icc (for continuous outcomes):
* One-way random (ICC(1,1); ICC(1,k))
icc measure id, oneway single
icc measure id, oneway average
* Two-way random (ICC(2,1); ICC(2,k)), absolute agreement
icc measure id rater, random absolute single
icc measure id rater, random absolute average
* Two-way mixed (ICC(3,1); ICC(3,k)), absolute agreement
icc measure id rater, mixed absolute single
icc measure id rater, mixed absolute average
* Consistency versions (exclude systematic rater differences)
icc measure id rater, random consistency single
icc measure id rater, mixed consistency single
🔍 kappaetc vs icc
- icc (built-in Stata)
- Can compute ICC(1), ICC(2), ICC(3).
- Supports one-way, two-way random, two-way mixed, with absolute and consistency, single or average.
- kappaetc (user-written)
- Excellent for kappa, AC1/AC2, specific agreement, and sometimes one-way ICC (ICC(1,k)) depending on options.
- Cannot compute full two-way mixed absolute models like ICC(3,1) or other advanced ICC(3,·) / ICC(2,·) structures; hence errors like:option icc() invalid r(198)
So:
- Use icc for full ICC modeling (1-way, 2-way random, 2-way mixed).
- Use kappaetc when your focus is on categorical agreement (kappa family) or simple one-way ICC structures.
If you’d like, I can now turn this into a one-page PDF-style summary or a slide-ready outline for your CECS teaching or your methods section.
Summary Table with Rationale + Fixes
| ICC Model | Use When | Why Use (Statistical Issue It Fixes) | Example |
| ICC(1,1) One-way random | Different raters rate different subjects; raters random | Handles unbalanced ratings; generalizable to population of raters | Rotating lab techs |
| ICC(2,1) Two-way random, absolute agreement | All raters rate all subjects; raters random & interchangeable | Fixes bias by including rater mean differences; reliability generalizes to other raters | Multiple radiologists |
| ICC(3,1) Two-way mixed, consistency | All raters rate all subjects; raters fixed | Fixes inflation from rater differences; focuses on consistency unique to these raters | Two cardiologists in RCT |
What Each ICC Model Fixes
ICC(1) — Fixes problems with:
- Random, inconsistent assignment of raters
- Unbalanced datasets
- Raters who are not interchangeable
ICC(2) — Fixes problems with:
- Systematic rater bias (differences in scoring levels)
- Need to generalize reliability to any future rater
- Ensures reliability represents true measurement error
ICC(3) — Fixes problems with:
- Overestimation of reliability when you ignore rater effects
- Focused studies where only specific raters matter
- Need to isolate reproducibility of a rater set
ICC Note (Thai)
1 One-way Random Effects → ICC(1) / ICC(1,1)
- ใช้เมื่อ rater ไม่ได้เหมือนกันทุกคน / ไม่ได้อ่านทุก case
- rater = random → หยิบใครมาก็ได้ ไม่ได้จับคู่ทุกคนอ่านทุก subject
- เหมาะกับ field data / lab tech สลับกัน / nurse หลายคน
- ไม่แก้ปัญหา rater bias เพราะถือว่า rater เป็น random noise
2 Two-way Random Effects → ICC(2,1)
- ใช้เมื่อ ทุก rater อ่านทุก subject
- rater = random + ต้องการ generalize ออกไปยังคนอื่นได้
- เช่น วันนี้รังสีแพทย์ A,B อ่าน พรุ่งนี้หมอทั่วไป, GP หรือรังสีแพทย์คนอื่นก็ต้องอ่านได้
- แก้ปัญหา systematic bias ระหว่าง raters (absolute agreement)
เหมาะกับ clinical measurement ที่ต้องการให้ “ใครก็อ่านได้” แล้วผลยัง reliable
3 Two-way Mixed Effects → ICC(3,1)
- ใช้เมื่อ rater เป็น fixed → ต้องเป็น “คนนี้เท่านั้น”
- Generalize ไม่ได้ ไปใช้กับรังสีแพทย์คนอื่นไม่ได้
- ใช้ในงานที่ รุตเตอร์เฉพาะชุดนี้ จะเป็นคนประเมิน เช่น expert panel, 2 cardiologists
- ไม่รวม rater bias ใน denominator (consistency)
เหมาะกับงานที่ใช้แค่ raters ชุดนี้ใน study และใน practice จริง
Clinical Rule of Thumb (from COSMIN + de Vet)
✔ Inter-rater reliability (generalizable) → ICC(2) ✔ Test–retest reliability (same assessor) → ICC(3) ✔ Multi-center clinical measurement studies → ICC(2) ✔ When only designated raters will ever perform scoring → ICC(3) ✔ When raters vary randomly → ICC(1)

🧭 Step 1 – What kind of reliability?
- Test–retest / Intra-rater reliability
- Same rater (or fixed team) measuring the same subjects at different times.➜ Go to Step 2A (Two-way mixed).
- Inter-rater reliability
- Different raters assessing the same subjects (usually at one time).➜ Go to Step 2B.
🧭 Step 2A – Test–retest / Intra-rater → usually Two-way mixed
Here the rater is specific, not a random sample.
2A-1. Model
- Use Two-way mixed effects→ this leads to ICC(3,·)
2A-2. Intended measurement protocol?
- If you will use a single measurement in practice→ choose ICC(3,1)
- If you will use the mean of k repeated measurements (e.g. average of 2 readings)→ choose ICC(3,k)
2A-3. Definition
- For test–retest / intra-rater, we almost always care about absolute agreement→ ICC(3,1) absolute or ICC(3,k) absolute
Stata examples
* Test–retest, single measurement
icc measure id rater, mixed absolute single // ICC(3,1)
* Test–retest, mean of k measurements
icc measure id rater, mixed absolute average // ICC(3,k)
🧭 Step 2B – Inter-rater reliability
2B-1. Did the same set of raters rate all subjects?
✅ YES → Go to 2B-2
❌ NO → One-way random effects → ICC(1,·)
If NO (raters differ across subjects; unbalanced design):
- Model: One-way random effects → ICC(1,·)
- Protocol:
- Single rating per subject → ICC(1,1)
- Mean of k ratings per subject → ICC(1,k)
- Definition: typically absolute agreement
Stata
* One-way random, single rating
icc measure id, oneway single // ICC(1,1)
* One-way random, average of k ratings
icc measure id, oneway average // ICC(1,k)
2B-2. If YES: same raters for all subjects
Ask: How do you conceptualize the raters?
- Randomized / interchangeable raters?(You want to generalize to other similar raters, e.g. any GP or radiologist.)➜ Two-way random effects → ICC(2,·)
- Specific raters only?(Only this panel or these named experts will ever rate.)➜ Two-way mixed effects → ICC(3,·)
🧭 Step 3 – Choose Random vs Mixed branch
🔹 Two-way random effects → ICC(2,·)
Use when:
- Same raters rate all subjects, and
- Raters are a random sample of a larger population (you want generalizability).
Protocol:
- Single rater used in practice → ICC(2,1)
- Mean of k raters used in practice → ICC(2,k)
Agreement vs consistency?
- If exact values must match → absolute agreement
- If only ranking matters → consistency
Stata
* Two-way random, absolute agreement, single rating
icc measure id rater, random absolute single // ICC(2,1)
* Two-way random, absolute agreement, mean of k raters
icc measure id rater, random absolute average // ICC(2,k)
* Consistency versions (if rank more important)
icc measure id rater, random consistency single
icc measure id rater, random consistency average
🔹 Two-way mixed effects → ICC(3,·)
Use when:
- Same raters rate all subjects, and
- Raters are fixed (only these raters matter; no generalization to others).
Protocol:
- Single rater used → ICC(3,1)
- Mean of k raters used → ICC(3,k)
Agreement vs consistency?
- Absolute agreement → penalizes systematic differences between raters
- Consistency → ignores mean differences; focuses on ranking only
Stata
* Two-way mixed, absolute agreement, single rating
icc measure id rater, mixed absolute single // ICC(3,1)
* Two-way mixed, absolute agreement, mean of k raters
icc measure id rater, mixed absolute average // ICC(3,k)
* Consistency versions
icc measure id rater, mixed consistency single
icc measure id rater, mixed consistency average
🔚 Ultra-Short Cheat Table
| Situation | Model | ICC Type |
| Test–retest / intra-rater | Two-way mixed | ICC(3,1) or ICC(3,k), usually absolute |
| Inter-rater, raters differ across subjects | One-way random | ICC(1,1) or ICC(1,k), absolute |
| Inter-rater, same raters, want to generalize to other raters | Two-way random | ICC(2,1) or ICC(2,k), absolute or consistency |
| Inter-rater, same fixed raters only | Two-way mixed | ICC(3,1) or ICC(3,k), absolute or consistency |