top of page

Types of ICC (intraclass correlation coefficient): One-Way, Two-Way Random, and Mixed Effects Explained

  • Writer: Mayta
    Mayta
  • 4 hours ago
  • 8 min read

1 One-way Random Effects: ICC(1) / ICC(1,1) / ICC(1,k)

Concept

  • Subjects are measured by raters who are treated as random noise.

  • Raters are not modeled explicitly (no rater factor in the model).

  • Often used when not all subjects are rated by the same raters, or the design is naturally unbalanced.

When to use

  • Different raters may rate different subjects; raters are interchangeable and loosely organized.

  • You want reliability across this general rater pool, but you cannot guarantee all raters see all subjects.

Typical examples

  • Different nurses measuring BP on different patients in a ward.

  • Rotating lab techs performing lab tests, with no strict “all raters rate all samples” structure.

What problem it solves

  • Handles unbalanced and messy rating patterns.

  • Gives you a reliability estimate generalizable to the rater population, even when design is not fully crossed.

Stata code (built-in icc)

* ICC(1,1): one-way random, single measurement
icc measure id, oneway single

* ICC(1,k): one-way random, average of k ratings per subject
icc measure id, oneway average

2 Two-way Random Effects: ICC(2,1) / ICC(2,k)

Concept

  • All subjects are rated by all raters.

  • Raters are treated as a random sample from a larger population of similar raters.

  • You want interchangeability: any trained clinician could, in principle, replace the current raters.

When to use

  • Fully crossed design: each subject rated by every rater.

  • You want to generalize reliability to other raters not in your study (e.g., GPs, other radiologists).

Typical examples

  • 4 radiologists all reading the same CT scans; you want a reliability estimate that holds for other radiologists or GPs in future.

  • Multiple lab technicians all running the same specimens; any tech in future may run the test.

What problem it solves

  • Explicitly models:

    • Between-subject variance

    • Between-rater variance

    • Residual (subject×rater) interaction / error

  • With absolute agreement, systematic differences between raters (e.g., rater A always scores higher) are treated as error, which is important if raters are supposed to be interchangeable.

Interpretation

“If a different clinician/GP/radiologist were to rate these patients, how reliable would the scores be?”

Stata code (built-in icc)

* ICC(2,1): two-way random, absolute agreement, single rating
icc measure id rater, random absolute single

* ICC(2,k): two-way random, absolute agreement, average of k raters
icc measure id rater, random absolute average

3 Two-way Mixed Effects: ICC(3,1) / ICC(3,k)

Concept

  • All subjects are rated by all raters, but raters are treated as fixed (not random).

  • You care only about these specific raters, not about others in the future.

When to use

  • You will not generalize to other raters; the “unit” being validated is the specific rater team.

  • Typical for expert panels or named specialists.

Typical examples

  • Only 2 expert cardiologists will read all echoes in the trial and in clinical practice.

  • A central adjudication committee rating all events, and you only care about this exact committee.

What problem it solves

  • Excludes rater population variance from the denominator (raters are fixed, not sampled).

  • With consistency, systematic differences between raters (one is always higher) are not counted as error; you only care about whether patients are ranked consistently.

Interpretation

“Given these exact raters, how reproducible are their scores?”Not: “What happens if a different rater reads the images?”

Stata code (built-in icc)

* ICC(3,1): two-way mixed, absolute agreement, single rating
icc measure id rater, mixed absolute single

* ICC(3,1) consistency version:
icc measure id rater, mixed consistency single

* ICC(3,k): two-way mixed, absolute agreement, mean of k raters
icc measure id rater, mixed absolute average

🔁 Single vs Average: (1,1) vs (1,k); (2,1) vs (2,k); (3,1) vs (3,k)

  • (·,1) → reliability of one rating from one rater.

  • (·,k) → reliability of the mean of k raters; this is always higher, because averaging reduces error.

Example:

  • ICC(2,1) = reliability of A single reader (anyone from the rater pool).

  • ICC(2,k) = reliability of the average score of all readers.

📌 Absolute Agreement vs Consistency

These are about what counts as “error” in the denominator:

🔹 Absolute Agreement (aka ICC_agreement, like in de Vet’s paper)

  • Counts systematic differences between raters as part of error.

  • Question:

    “Do raters give exactly the same scores?”

  • Appropriate when raters are supposed to be interchangeable and absolute scale matters (e.g., BP, ROM in degrees).

🔹 Consistency (aka ICC_consistency)

  • Ignores systematic mean differences between raters; focuses on relative ordering.

  • Question:

    “Do raters rank patients similarly, even if one tends to score higher overall?”

  • Useful if only ranking/ordering matters, not exact values.

In de Vet et al., the formulas are:

[\text{ICC}{agreement} = \frac{\sigma_p^2}{\sigma_p^2 + \sigma{pt}^2 + \sigma_{residual}^2}]

[\text{ICC}{consistency} = \frac{\sigma_p^2}{\sigma_p^2 + \sigma{residual}^2}]

where:

  • (\sigma_p^2): variance between persons

  • (\sigma_{pt}^2): systematic variance between raters

  • (\sigma_{residual}^2): residual (error) variance

Absolute agreement includes (\sigma_{pt}^2) as error; consistency does not.

🎯 ICC vs Agreement: What Question Are You Answering?

From de Vet et al.:

  • Reliability (ICC) answers:

    “Can we distinguish between subjects despite measurement error?”It relates measurement error to between-subject variance, so it depends heavily on sample heterogeneity.

  • Agreement (SEM, Limits of Agreement, %within ±X, kappa) answers:

    “How close are repeated measurements to each other?”It focuses directly on measurement error, independent of how diverse the subjects are.

Key implications:

  • ICC can be high in a very heterogeneous sample (big between-person variance), even with substantial error.

  • Agreement indices (e.g., SEM, LoA) are more stable across samples and better for monitoring change over time or detecting small clinical changes.

🧮 Agreement Side: SEM, SDC, Limits of Agreement

These are not ICC but frequently paired with it:

  • SEM (Standard Error of Measurement)[SEM = \sqrt{\text{error variance}}]On the original scale (e.g., kg, degrees). Good for saying: “Typical noise around a single measurement is about X units.”

  • SDC (Smallest Detectable Change)[SDC = 1.96 \times \sqrt{2} \times SEM]Smallest change that reliably exceeds measurement error for a single patient.

  • Limits of Agreement (Bland–Altman)[\text{Mean difference} \pm 1.96 \times SD_{\text{difference}}]Shows the range where 95% of differences between methods/raters fall.

These are agreement measures, not reliability per se, and are recommended when the research focus is evaluating change (e.g., pre–post treatment).

🛠 Stata Implementation Summary

Built-in icc (for continuous outcomes):

* One-way random (ICC(1,1); ICC(1,k))
icc measure id, oneway single
icc measure id, oneway average

* Two-way random (ICC(2,1); ICC(2,k)), absolute agreement
icc measure id rater, random absolute single
icc measure id rater, random absolute average

* Two-way mixed (ICC(3,1); ICC(3,k)), absolute agreement
icc measure id rater, mixed absolute single
icc measure id rater, mixed absolute average

* Consistency versions (exclude systematic rater differences)
icc measure id rater, random consistency single
icc measure id rater, mixed consistency single

🔍 kappaetc vs icc

  • icc (built-in Stata)

    • Can compute ICC(1), ICC(2), ICC(3).

    • Supports one-way, two-way random, two-way mixed, with absolute and consistency, single or average.

  • kappaetc (user-written)

    • Excellent for kappa, AC1/AC2, specific agreement, and sometimes one-way ICC (ICC(1,k)) depending on options.

    • Cannot compute full two-way mixed absolute models like ICC(3,1) or other advanced ICC(3,·) / ICC(2,·) structures; hence errors like:

      option icc() invalid r(198)

So:

  • Use icc for full ICC modeling (1-way, 2-way random, 2-way mixed).

  • Use kappaetc when your focus is on categorical agreement (kappa family) or simple one-way ICC structures.

If you’d like, I can now turn this into a one-page PDF-style summary or a slide-ready outline for your CECS teaching or your methods section.

Summary Table with Rationale + Fixes

ICC Model

Use When

Why Use (Statistical Issue It Fixes)

Example

ICC(1,1) One-way random

Different raters rate different subjects; raters random

Handles unbalanced ratings; generalizable to population of raters

Rotating lab techs

ICC(2,1) Two-way random, absolute agreement

All raters rate all subjects; raters random & interchangeable

Fixes bias by including rater mean differences; reliability generalizes to other raters

Multiple radiologists

ICC(3,1) Two-way mixed, consistency

All raters rate all subjects; raters fixed

Fixes inflation from rater differences; focuses on consistency unique to these raters

Two cardiologists in RCT


What Each ICC Model Fixes

ICC(1) — Fixes problems with:

  • Random, inconsistent assignment of raters

  • Unbalanced datasets

  • Raters who are not interchangeable

ICC(2) — Fixes problems with:

  • Systematic rater bias (differences in scoring levels)

  • Need to generalize reliability to any future rater

  • Ensures reliability represents true measurement error

ICC(3) — Fixes problems with:

  • Overestimation of reliability when you ignore rater effects

  • Focused studies where only specific raters matter

  • Need to isolate reproducibility of a rater set


ICC Note (Thai)

1 One-way Random Effects → ICC(1) / ICC(1,1)

  • ใช้เมื่อ rater ไม่ได้เหมือนกันทุกคน / ไม่ได้อ่านทุก case

  • rater = random → หยิบใครมาก็ได้ ไม่ได้จับคู่ทุกคนอ่านทุก subject

  • เหมาะกับ field data / lab tech สลับกัน / nurse หลายคน

  • ไม่แก้ปัญหา rater bias เพราะถือว่า rater เป็น random noise

2 Two-way Random Effects → ICC(2,1)

  • ใช้เมื่อ ทุก rater อ่านทุก subject

  • rater = random + ต้องการ generalize ออกไปยังคนอื่นได้

  • เช่น วันนี้รังสีแพทย์ A,B อ่าน พรุ่งนี้หมอทั่วไป, GP หรือรังสีแพทย์คนอื่นก็ต้องอ่านได้

  • แก้ปัญหา systematic bias ระหว่าง raters (absolute agreement)

เหมาะกับ clinical measurement ที่ต้องการให้ “ใครก็อ่านได้” แล้วผลยัง reliable

3 Two-way Mixed Effects → ICC(3,1)

  • ใช้เมื่อ rater เป็น fixed → ต้องเป็น “คนนี้เท่านั้น”

  • Generalize ไม่ได้ ไปใช้กับรังสีแพทย์คนอื่นไม่ได้

  • ใช้ในงานที่ รุตเตอร์เฉพาะชุดนี้ จะเป็นคนประเมิน เช่น expert panel, 2 cardiologists

  • ไม่รวม rater bias ใน denominator (consistency)

เหมาะกับงานที่ใช้แค่ raters ชุดนี้ใน study และใน practice จริง

Clinical Rule of Thumb (from COSMIN + de Vet)

Inter-rater reliability (generalizable) → ICC(2) Test–retest reliability (same assessor) → ICC(3) Multi-center clinical measurement studies → ICC(2) When only designated raters will ever perform scoring → ICC(3) When raters vary randomly → ICC(1)

Reference: Koo TK, Li MY. A guideline of selecting and reporting intraclass correlation coefficients for reliability research. J Chiropr Med. 2016 Jun;15(2):155-63. doi: 10.1016/j.jcm.2016.02.012.
Reference: Koo TK, Li MY. A guideline of selecting and reporting intraclass correlation coefficients for reliability research. J Chiropr Med. 2016 Jun;15(2):155-63. doi: 10.1016/j.jcm.2016.02.012.

🧭 Step 1 – What kind of reliability?

  1. Test–retest / Intra-rater reliability

    • Same rater (or fixed team) measuring the same subjects at different times.➜ Go to Step 2A (Two-way mixed).

  2. Inter-rater reliability

    • Different raters assessing the same subjects (usually at one time).➜ Go to Step 2B.

🧭 Step 2A – Test–retest / Intra-rater → usually Two-way mixed

Here the rater is specific, not a random sample.

2A-1. Model

  • Use Two-way mixed effects→ this leads to ICC(3,·)

2A-2. Intended measurement protocol?

  • If you will use a single measurement in practice→ choose ICC(3,1)

  • If you will use the mean of k repeated measurements (e.g. average of 2 readings)→ choose ICC(3,k)

2A-3. Definition

  • For test–retest / intra-rater, we almost always care about absolute agreementICC(3,1) absolute or ICC(3,k) absolute

Stata examples

* Test–retest, single measurement
icc measure id rater, mixed absolute single    // ICC(3,1)

* Test–retest, mean of k measurements
icc measure id rater, mixed absolute average   // ICC(3,k)

🧭 Step 2B – Inter-rater reliability

2B-1. Did the same set of raters rate all subjects?

✅ YES → Go to 2B-2

❌ NO → One-way random effects → ICC(1,·)

If NO (raters differ across subjects; unbalanced design):

  • Model: One-way random effects → ICC(1,·)

  • Protocol:

    • Single rating per subject → ICC(1,1)

    • Mean of k ratings per subject → ICC(1,k)

  • Definition: typically absolute agreement

Stata

* One-way random, single rating
icc measure id, oneway single       // ICC(1,1)

* One-way random, average of k ratings
icc measure id, oneway average      // ICC(1,k)

2B-2. If YES: same raters for all subjects

Ask: How do you conceptualize the raters?

  1. Randomized / interchangeable raters?(You want to generalize to other similar raters, e.g. any GP or radiologist.)➜ Two-way random effects → ICC(2,·)

  2. Specific raters only?(Only this panel or these named experts will ever rate.)➜ Two-way mixed effects → ICC(3,·)

🧭 Step 3 – Choose Random vs Mixed branch

🔹 Two-way random effects → ICC(2,·)

Use when:

  • Same raters rate all subjects, and

  • Raters are a random sample of a larger population (you want generalizability).

Protocol:

  • Single rater used in practice → ICC(2,1)

  • Mean of k raters used in practice → ICC(2,k)

Agreement vs consistency?

  • If exact values must match → absolute agreement

  • If only ranking matters → consistency

Stata

* Two-way random, absolute agreement, single rating
icc measure id rater, random absolute single   // ICC(2,1)

* Two-way random, absolute agreement, mean of k raters
icc measure id rater, random absolute average  // ICC(2,k)

* Consistency versions (if rank more important)
icc measure id rater, random consistency single
icc measure id rater, random consistency average

🔹 Two-way mixed effects → ICC(3,·)

Use when:

  • Same raters rate all subjects, and

  • Raters are fixed (only these raters matter; no generalization to others).

Protocol:

  • Single rater used → ICC(3,1)

  • Mean of k raters used → ICC(3,k)

Agreement vs consistency?

  • Absolute agreement → penalizes systematic differences between raters

  • Consistency → ignores mean differences; focuses on ranking only

Stata

* Two-way mixed, absolute agreement, single rating
icc measure id rater, mixed absolute single    // ICC(3,1)

* Two-way mixed, absolute agreement, mean of k raters
icc measure id rater, mixed absolute average   // ICC(3,k)

* Consistency versions
icc measure id rater, mixed consistency single
icc measure id rater, mixed consistency average

🔚 Ultra-Short Cheat Table

Situation

Model

ICC Type

Test–retest / intra-rater

Two-way mixed

ICC(3,1) or ICC(3,k), usually absolute

Inter-rater, raters differ across subjects

One-way random

ICC(1,1) or ICC(1,k), absolute

Inter-rater, same raters, want to generalize to other raters

Two-way random

ICC(2,1) or ICC(2,k), absolute or consistency

Inter-rater, same fixed raters only

Two-way mixed

ICC(3,1) or ICC(3,k), absolute or consistency


Recent Posts

See All
Post: Blog2_Post

​Message for International and Thai Readers Understanding My Medical Context in Thailand

Message for International and Thai Readers Understanding My Broader Content Beyond Medicine

bottom of page