Types of ICC (intraclass correlation coefficient): One-Way, Two-Way Random, and Mixed Effects Explained
- Mayta
- 4 hours ago
- 8 min read
1 One-way Random Effects: ICC(1) / ICC(1,1) / ICC(1,k)
Concept
Subjects are measured by raters who are treated as random noise.
Raters are not modeled explicitly (no rater factor in the model).
Often used when not all subjects are rated by the same raters, or the design is naturally unbalanced.
When to use
Different raters may rate different subjects; raters are interchangeable and loosely organized.
You want reliability across this general rater pool, but you cannot guarantee all raters see all subjects.
Typical examples
Different nurses measuring BP on different patients in a ward.
Rotating lab techs performing lab tests, with no strict “all raters rate all samples” structure.
What problem it solves
Handles unbalanced and messy rating patterns.
Gives you a reliability estimate generalizable to the rater population, even when design is not fully crossed.
Stata code (built-in icc)
* ICC(1,1): one-way random, single measurement
icc measure id, oneway single
* ICC(1,k): one-way random, average of k ratings per subject
icc measure id, oneway average
2 Two-way Random Effects: ICC(2,1) / ICC(2,k)
Concept
All subjects are rated by all raters.
Raters are treated as a random sample from a larger population of similar raters.
You want interchangeability: any trained clinician could, in principle, replace the current raters.
When to use
Fully crossed design: each subject rated by every rater.
You want to generalize reliability to other raters not in your study (e.g., GPs, other radiologists).
Typical examples
4 radiologists all reading the same CT scans; you want a reliability estimate that holds for other radiologists or GPs in future.
Multiple lab technicians all running the same specimens; any tech in future may run the test.
What problem it solves
Explicitly models:
Between-subject variance
Between-rater variance
Residual (subject×rater) interaction / error
With absolute agreement, systematic differences between raters (e.g., rater A always scores higher) are treated as error, which is important if raters are supposed to be interchangeable.
Interpretation
“If a different clinician/GP/radiologist were to rate these patients, how reliable would the scores be?”
Stata code (built-in icc)
* ICC(2,1): two-way random, absolute agreement, single rating
icc measure id rater, random absolute single
* ICC(2,k): two-way random, absolute agreement, average of k raters
icc measure id rater, random absolute average
3 Two-way Mixed Effects: ICC(3,1) / ICC(3,k)
Concept
All subjects are rated by all raters, but raters are treated as fixed (not random).
You care only about these specific raters, not about others in the future.
When to use
You will not generalize to other raters; the “unit” being validated is the specific rater team.
Typical for expert panels or named specialists.
Typical examples
Only 2 expert cardiologists will read all echoes in the trial and in clinical practice.
A central adjudication committee rating all events, and you only care about this exact committee.
What problem it solves
Excludes rater population variance from the denominator (raters are fixed, not sampled).
With consistency, systematic differences between raters (one is always higher) are not counted as error; you only care about whether patients are ranked consistently.
Interpretation
“Given these exact raters, how reproducible are their scores?”Not: “What happens if a different rater reads the images?”
Stata code (built-in icc)
* ICC(3,1): two-way mixed, absolute agreement, single rating
icc measure id rater, mixed absolute single
* ICC(3,1) consistency version:
icc measure id rater, mixed consistency single
* ICC(3,k): two-way mixed, absolute agreement, mean of k raters
icc measure id rater, mixed absolute average
🔁 Single vs Average: (1,1) vs (1,k); (2,1) vs (2,k); (3,1) vs (3,k)
(·,1) → reliability of one rating from one rater.
(·,k) → reliability of the mean of k raters; this is always higher, because averaging reduces error.
Example:
ICC(2,1) = reliability of A single reader (anyone from the rater pool).
ICC(2,k) = reliability of the average score of all readers.
📌 Absolute Agreement vs Consistency
These are about what counts as “error” in the denominator:
🔹 Absolute Agreement (aka ICC_agreement, like in de Vet’s paper)
Counts systematic differences between raters as part of error.
Question:
“Do raters give exactly the same scores?”
Appropriate when raters are supposed to be interchangeable and absolute scale matters (e.g., BP, ROM in degrees).
🔹 Consistency (aka ICC_consistency)
Ignores systematic mean differences between raters; focuses on relative ordering.
Question:
“Do raters rank patients similarly, even if one tends to score higher overall?”
Useful if only ranking/ordering matters, not exact values.
In de Vet et al., the formulas are:
[\text{ICC}{agreement} = \frac{\sigma_p^2}{\sigma_p^2 + \sigma{pt}^2 + \sigma_{residual}^2}]
[\text{ICC}{consistency} = \frac{\sigma_p^2}{\sigma_p^2 + \sigma{residual}^2}]
where:
(\sigma_p^2): variance between persons
(\sigma_{pt}^2): systematic variance between raters
(\sigma_{residual}^2): residual (error) variance
Absolute agreement includes (\sigma_{pt}^2) as error; consistency does not.
🎯 ICC vs Agreement: What Question Are You Answering?
From de Vet et al.:
Reliability (ICC) answers:
“Can we distinguish between subjects despite measurement error?”It relates measurement error to between-subject variance, so it depends heavily on sample heterogeneity.
Agreement (SEM, Limits of Agreement, %within ±X, kappa) answers:
“How close are repeated measurements to each other?”It focuses directly on measurement error, independent of how diverse the subjects are.
Key implications:
ICC can be high in a very heterogeneous sample (big between-person variance), even with substantial error.
Agreement indices (e.g., SEM, LoA) are more stable across samples and better for monitoring change over time or detecting small clinical changes.
🧮 Agreement Side: SEM, SDC, Limits of Agreement
These are not ICC but frequently paired with it:
SEM (Standard Error of Measurement)[SEM = \sqrt{\text{error variance}}]On the original scale (e.g., kg, degrees). Good for saying: “Typical noise around a single measurement is about X units.”
SDC (Smallest Detectable Change)[SDC = 1.96 \times \sqrt{2} \times SEM]Smallest change that reliably exceeds measurement error for a single patient.
Limits of Agreement (Bland–Altman)[\text{Mean difference} \pm 1.96 \times SD_{\text{difference}}]Shows the range where 95% of differences between methods/raters fall.
These are agreement measures, not reliability per se, and are recommended when the research focus is evaluating change (e.g., pre–post treatment).
🛠 Stata Implementation Summary
Built-in icc (for continuous outcomes):
* One-way random (ICC(1,1); ICC(1,k))
icc measure id, oneway single
icc measure id, oneway average
* Two-way random (ICC(2,1); ICC(2,k)), absolute agreement
icc measure id rater, random absolute single
icc measure id rater, random absolute average
* Two-way mixed (ICC(3,1); ICC(3,k)), absolute agreement
icc measure id rater, mixed absolute single
icc measure id rater, mixed absolute average
* Consistency versions (exclude systematic rater differences)
icc measure id rater, random consistency single
icc measure id rater, mixed consistency single
🔍 kappaetc vs icc
icc (built-in Stata)
Can compute ICC(1), ICC(2), ICC(3).
Supports one-way, two-way random, two-way mixed, with absolute and consistency, single or average.
kappaetc (user-written)
Excellent for kappa, AC1/AC2, specific agreement, and sometimes one-way ICC (ICC(1,k)) depending on options.
Cannot compute full two-way mixed absolute models like ICC(3,1) or other advanced ICC(3,·) / ICC(2,·) structures; hence errors like:
option icc() invalid r(198)
So:
Use icc for full ICC modeling (1-way, 2-way random, 2-way mixed).
Use kappaetc when your focus is on categorical agreement (kappa family) or simple one-way ICC structures.
If you’d like, I can now turn this into a one-page PDF-style summary or a slide-ready outline for your CECS teaching or your methods section.
Summary Table with Rationale + Fixes
ICC Model | Use When | Why Use (Statistical Issue It Fixes) | Example |
ICC(1,1) One-way random | Different raters rate different subjects; raters random | Handles unbalanced ratings; generalizable to population of raters | Rotating lab techs |
ICC(2,1) Two-way random, absolute agreement | All raters rate all subjects; raters random & interchangeable | Fixes bias by including rater mean differences; reliability generalizes to other raters | Multiple radiologists |
ICC(3,1) Two-way mixed, consistency | All raters rate all subjects; raters fixed | Fixes inflation from rater differences; focuses on consistency unique to these raters | Two cardiologists in RCT |
What Each ICC Model Fixes
ICC(1) — Fixes problems with:
Random, inconsistent assignment of raters
Unbalanced datasets
Raters who are not interchangeable
ICC(2) — Fixes problems with:
Systematic rater bias (differences in scoring levels)
Need to generalize reliability to any future rater
Ensures reliability represents true measurement error
ICC(3) — Fixes problems with:
Overestimation of reliability when you ignore rater effects
Focused studies where only specific raters matter
Need to isolate reproducibility of a rater set
ICC Note (Thai)
1 One-way Random Effects → ICC(1) / ICC(1,1)
ใช้เมื่อ rater ไม่ได้เหมือนกันทุกคน / ไม่ได้อ่านทุก case
rater = random → หยิบใครมาก็ได้ ไม่ได้จับคู่ทุกคนอ่านทุก subject
เหมาะกับ field data / lab tech สลับกัน / nurse หลายคน
ไม่แก้ปัญหา rater bias เพราะถือว่า rater เป็น random noise
2 Two-way Random Effects → ICC(2,1)
ใช้เมื่อ ทุก rater อ่านทุก subject
rater = random + ต้องการ generalize ออกไปยังคนอื่นได้
เช่น วันนี้รังสีแพทย์ A,B อ่าน พรุ่งนี้หมอทั่วไป, GP หรือรังสีแพทย์คนอื่นก็ต้องอ่านได้
แก้ปัญหา systematic bias ระหว่าง raters (absolute agreement)
เหมาะกับ clinical measurement ที่ต้องการให้ “ใครก็อ่านได้” แล้วผลยัง reliable
3 Two-way Mixed Effects → ICC(3,1)
ใช้เมื่อ rater เป็น fixed → ต้องเป็น “คนนี้เท่านั้น”
Generalize ไม่ได้ ไปใช้กับรังสีแพทย์คนอื่นไม่ได้
ใช้ในงานที่ รุตเตอร์เฉพาะชุดนี้ จะเป็นคนประเมิน เช่น expert panel, 2 cardiologists
ไม่รวม rater bias ใน denominator (consistency)
เหมาะกับงานที่ใช้แค่ raters ชุดนี้ใน study และใน practice จริง
Clinical Rule of Thumb (from COSMIN + de Vet)
✔ Inter-rater reliability (generalizable) → ICC(2)
✔ Test–retest reliability (same assessor) → ICC(3)
✔ Multi-center clinical measurement studies → ICC(2)
✔ When only designated raters will ever perform scoring → ICC(3)
✔ When raters vary randomly → ICC(1)

🧭 Step 1 – What kind of reliability?
Test–retest / Intra-rater reliability
Same rater (or fixed team) measuring the same subjects at different times.➜ Go to Step 2A (Two-way mixed).
Inter-rater reliability
Different raters assessing the same subjects (usually at one time).➜ Go to Step 2B.
🧭 Step 2A – Test–retest / Intra-rater → usually Two-way mixed
Here the rater is specific, not a random sample.
2A-1. Model
Use Two-way mixed effects→ this leads to ICC(3,·)
2A-2. Intended measurement protocol?
If you will use a single measurement in practice→ choose ICC(3,1)
If you will use the mean of k repeated measurements (e.g. average of 2 readings)→ choose ICC(3,k)
2A-3. Definition
For test–retest / intra-rater, we almost always care about absolute agreement→ ICC(3,1) absolute or ICC(3,k) absolute
Stata examples
* Test–retest, single measurement
icc measure id rater, mixed absolute single // ICC(3,1)
* Test–retest, mean of k measurements
icc measure id rater, mixed absolute average // ICC(3,k)
🧭 Step 2B – Inter-rater reliability
2B-1. Did the same set of raters rate all subjects?
✅ YES → Go to 2B-2
❌ NO → One-way random effects → ICC(1,·)
If NO (raters differ across subjects; unbalanced design):
Model: One-way random effects → ICC(1,·)
Protocol:
Single rating per subject → ICC(1,1)
Mean of k ratings per subject → ICC(1,k)
Definition: typically absolute agreement
Stata
* One-way random, single rating
icc measure id, oneway single // ICC(1,1)
* One-way random, average of k ratings
icc measure id, oneway average // ICC(1,k)
2B-2. If YES: same raters for all subjects
Ask: How do you conceptualize the raters?
Randomized / interchangeable raters?(You want to generalize to other similar raters, e.g. any GP or radiologist.)➜ Two-way random effects → ICC(2,·)
Specific raters only?(Only this panel or these named experts will ever rate.)➜ Two-way mixed effects → ICC(3,·)
🧭 Step 3 – Choose Random vs Mixed branch
🔹 Two-way random effects → ICC(2,·)
Use when:
Same raters rate all subjects, and
Raters are a random sample of a larger population (you want generalizability).
Protocol:
Single rater used in practice → ICC(2,1)
Mean of k raters used in practice → ICC(2,k)
Agreement vs consistency?
If exact values must match → absolute agreement
If only ranking matters → consistency
Stata
* Two-way random, absolute agreement, single rating
icc measure id rater, random absolute single // ICC(2,1)
* Two-way random, absolute agreement, mean of k raters
icc measure id rater, random absolute average // ICC(2,k)
* Consistency versions (if rank more important)
icc measure id rater, random consistency single
icc measure id rater, random consistency average
🔹 Two-way mixed effects → ICC(3,·)
Use when:
Same raters rate all subjects, and
Raters are fixed (only these raters matter; no generalization to others).
Protocol:
Single rater used → ICC(3,1)
Mean of k raters used → ICC(3,k)
Agreement vs consistency?
Absolute agreement → penalizes systematic differences between raters
Consistency → ignores mean differences; focuses on ranking only
Stata
* Two-way mixed, absolute agreement, single rating
icc measure id rater, mixed absolute single // ICC(3,1)
* Two-way mixed, absolute agreement, mean of k raters
icc measure id rater, mixed absolute average // ICC(3,k)
* Consistency versions
icc measure id rater, mixed consistency single
icc measure id rater, mixed consistency average
🔚 Ultra-Short Cheat Table
Situation | Model | ICC Type |
Test–retest / intra-rater | Two-way mixed | ICC(3,1) or ICC(3,k), usually absolute |
Inter-rater, raters differ across subjects | One-way random | ICC(1,1) or ICC(1,k), absolute |
Inter-rater, same raters, want to generalize to other raters | Two-way random | ICC(2,1) or ICC(2,k), absolute or consistency |
Inter-rater, same fixed raters only | Two-way mixed | ICC(3,1) or ICC(3,k), absolute or consistency |




