Continuous Agreement: ICC, SEM, SDC, and Bland–Altman

Abstract
When clinicians evaluate measurement instruments that yield continuous outcomes, they cannot rely on categorical statistics like Cohen's Kappa, but instead require a distinct statistical toolbox to quantify both true patient variation and measurement error. The Intraclass Correlation Coefficient serves as the primary reliability metric, calculating the proportion of true between-subject variance from zero to one. However, investigators must explicitly specify their chosen model, definition, and type, because altering these specific parameters drastically changes the resulting reliability score on identical data. Furthermore, because this coefficient is completely unitless, a seemingly high score can easily mask substantial clinical deviations. Therefore, researchers must never report this coefficient alone and must always pair it with agreement statistics expressed in the instrument's original units, such as the Standard Error of Measurement, the Smallest Detectable Change, or Limits of Agreement mapped on a Bland-Altman plot. This article details the continuous data evaluation toolbox, teaching researchers how to select the appropriate Intraclass Correlation Coefficient and calculate corresponding agreement statistics to definitively separate genuine clinical change from measurement noise.
Introduction
When your outcome is continuous — a peak-flow reading in L/min, a goniometer angle in degrees, a questionnaire score — you cannot describe agreement with a 2×2 table and you cannot use Cohen's Kappa. You need a different toolbox. This article teaches that toolbox step by step: the Intraclass Correlation Coefficient (ICC) as the reliability statistic, and a family of agreement statistics — SEM, SDC, CV, Limits of Agreement (LoA), and the Bland–Altman plot — as the way to express measurement error in the same units your clinicians actually use.
The single biggest mistake in continuous reliability is reporting "ICC = 0.85" and stopping there. An ICC is meaningless until you say which ICC. There are six (Shrout–Fleiss) to ten (McGraw–Wong) named forms, and they can differ by a lot on the same data. So the discipline here is: choose the right ICC on purpose, then complement it with an agreement statistic in real units so a reader knows whether a change of 7 points in a patient is signal or noise.
The key question of this article: for continuous measurements, which ICC answers my design — and what is the measurement error expressed in the units of my instrument?
This is part 5 of the series. Parts 1–4 built the conceptual scaffold (performance vs agreement; agreement vs reliability; reliability designs; categorical kappa). Here we land it in continuous data.
ICC: the proportion of true variance
The ICC, like every reliability statistic, asks one thing: of all the variability I observe, what proportion is real (genuine differences between subjects) rather than measurement error? It is built on the analysis of variance (ANOVA), which is what lets us decompose the total spread into components and read off the variance due to subjects versus the variance due to error.
In words:
\[ \text{ICC} = \frac{\text{True difference between study subjects}}{\text{Total variability}} = \frac{\text{Between-subject variance} - \text{Residual error variance}}{\text{Total variance}} \]
The numerator is the "real signal" — how much subjects genuinely differ from one another. The denominator is everything you see, signal plus noise. The ratio runs from 0 to 1: an ICC of 0 means the instrument carries no reliable information at all, and an ICC of 1 means the instrument is free of error variance (perfectly reliable).
The trouble — and the entire reason this article is long — is that there is no single "the ICC." Different research questions change what counts as "error" and what counts as "total variance," and therefore give different numbers on the same dataset. You must make three explicit choices, and each one changes how residual error variance and total variance are computed.
The concept callouts below each pin down one of the three choices: MODEL, DEFINITION, and TYPE.
Choice 1 — the MODEL (one-way random / two-way random / two-way mixed)
Walk through the three:
-
One-way random-effect model. Used when each subject is rated by a different set of raters — e.g. patient A is measured by raters 1 and 2, while patient B is measured by raters 3 and 4. This is rarely used clinically. Its key assumption is that when computing residual error variance it does not separate out the variability caused by which rater did the measuring (because they are different people for different subjects). That folds rater differences into the error term, making residual error variance larger, and so the ICC comes out lower than in the other models.
-
Two-way random-effect model. Used when the raters in the study were chosen to represent the general population of raters — i.e. you want to generalise your result to other raters with similar experience and characteristics. This fits instruments used in routine clinical practice by many different clinicians (e.g. passive range-of-motion). Inter-rater reliability is the design that most often uses this model.
-
Two-way mixed-effect model. Used when you care only about consistency within this set of raters and do not want to generalise to a wider rater population. You rarely see it for inter-rater reliability; it is the natural choice for intra-rater reliability and test–retest reliability, because both designs care about variability within the same person or the same instrument across time, so generalising to other raters would be inappropriate.
Choice 2 — the DEFINITION (absolute agreement vs consistency)
Choosing between absolute and consistency depends on the level of agreement you actually need:
- If it matters that both raters give the same value (\( X_1 = X_2 \)), choose absolute agreement.
- If you only care that the two raters' scores rank together / move together (\( X_1 = X_2 + \text{error} \)), choose consistency agreement.
The algebra makes the difference concrete. Absolute agreement adds an adjusted systematic-error variance term into the denominator (the penalty); consistency does not:
\[ \text{ICC}_{\text{absolute}} = \frac{\text{Between-subject variance} - \text{Residual error variance}}{\text{Between-subject variance} + \text{Residual error variance} + \text{adjusted Systematic-error variance}} \]
\[ \text{ICC}_{\text{consistency}} = \frac{\text{Between-subject variance} - \text{Residual error variance}}{\text{Between-subject variance} + \text{Residual error variance}} \]
Because the absolute formula has a larger denominator, absolute agreement gives a lower (more conservative) ICC whenever a systematic difference exists.
Choice 3 — the TYPE (single vs average; average ≥ single)
You must decide how the tested instrument will be used in routine practice:
- If you plan to base results on the average of (say) 3 raters, report the Average ICC (the value that reflects averaging across all raters in the study).
- If in reality only one rater will be used, report the single/individual ICC.
These differ in total variance. The single/individual ICC keeps residual error variance in the denominator; the average ICC does not. Hence:
\[ \text{ICC}_{\text{single}} = \frac{\text{Between-subject variance} - \text{Residual error variance}}{\text{Between-subject variance} + \text{Residual error variance}} \]
\[ \text{ICC}_{\text{average}} = \frac{\text{Between-subject variance} - \text{Residual error variance}}{\text{Between-subject variance}} \]
So Average ICC is always larger than single/individual ICC — averaging k measurements averages away part of the error.
The Shrout–Fleiss forms and the objective → model map
Shrout and Fleiss (1979) divided the ICC into six forms, written ICC(model, type): ICC(1,1), ICC(1,k), ICC(2,1), ICC(2,k), ICC(3,1), ICC(3,k). In their scheme the difference between the two-way random-effect model (ICC 2,1 and ICC 2,k) and the two-way mixed-effect model (ICC 3,1 and ICC 3,k) lies in whether the penalised systematic difference enters total variance. In effect, Shrout and Fleiss tied absolute agreement to the two-way random model and consistency to the two-way mixed model.
Later, McGraw and Wong (1996) expanded the ICC into ten forms by considering model, definition, and type as independent axes. Crucially, they showed that the two-way random-effect and two-way mixed-effect models actually share the same algebraic formula — the choice of the word "random" versus "mixed" is therefore driven by your study design and objective, not by a different computation.
For intra-rater and test–retest designs the model is essentially fixed: two-way mixed-effect, absolute, single rater. The reasoning: neither design aims to generalise to a larger rater population (they measure within the same rater or rater group), and both are repeated measures that care about perfect reliability — so you should always choose absolute agreement.
Here is the objective → model map, read row by row:
| Objective | Measurement tool | Type of reliability | Protocol / intended use | Appropriate model |
|---|---|---|---|---|
| Reliability between raters | Imaging technique (with unit) | Inter-rater | Single rater as the basis; widespread use among GPs | Two-way random-effect, absolute, single rater |
| Reliability between raters | Imaging technique (with unit) | Inter-rater | Single rater as the basis; specific use only by specialists | Two-way mixed-effect, absolute, single rater |
| Reliability between raters | Score-based questionnaire (no unit) | Inter-rater | Mean value of raters as the basis; specific use only by specialists | Two-way mixed-effect, consistency, average rater |
| Reliability within rater | Score-based questionnaire (no unit) | Intra-rater | Single rater as the basis; widespread use among GPs | Two-way mixed-effect, absolute, single rater |
| Reliability within rater | Imaging technique (with unit) | Intra-rater | Single rater as the basis; specific use only by specialists | Two-way mixed-effect, absolute, single rater |
| Reliability between time points | Imaging technique (with unit) | Test–retest | Single rater as the basis; specific use only by specialists | Two-way mixed-effect, absolute, single rater |
"ICC is good but needs to specify model assumptions." Report the triplet — model, definition, type — every single time.
Worked ICC in Stata, step by step
The cleanest way to see the choices bite is to compute them by hand from the ANOVA table and then confirm with the icc command. We use Stata's built-in judges dataset: 6 subjects (targets) rated by 4 raters (judges), an inter-rater reliability design.
One-way model — ICC(1,1) and ICC(1,k)
The one-way formulas use only the between-subjects and within-subjects mean squares:
\[ \text{ICC}_{1,1} = \frac{MS_B - MS_W}{MS_B + (k-1)\,MS_W} \]
\[ \text{ICC}_{1,k} = \frac{MS_B - MS_W}{MS_B} \]
where \( MS_B \) is the Mean Square Between subjects and \( MS_W \) is the Mean Square Within subjects (error), equivalent to the residual error variance.
webuse judges
anova rating target
From the ANOVA, \( MS_B = 11.241667 \) and \( MS_W = 6.2638889 \), with \( k = 4 \) raters. Plugging in:
\[ \text{ICC}_{1,1} = \frac{11.241667 - 6.2638889}{11.241667 + (4-1)\times 6.2638889} = 0.1657 \]
\[ \text{ICC}_{1,k} = \frac{11.241667 - 6.2638889}{11.241667} = 0.4428 \]
Confirm with the command:
icc rating target
Notice how low the single-rater value is (0.1657) and how much higher the average-of-4 value is (0.4428) — exactly the "average ≥ single" rule, and exactly the one-way penalty that buries rater effects in the error term.
Two-way model — ICC(2,1) and ICC(2,k), absolute
Now we add the rater (judge) as a second factor. The absolute-agreement two-way formulas are:
\[ \text{ICC}_{2,1} = \frac{MS_B - MS_E}{MS_B + (k-1)\,MS_E + \frac{k}{n}(MS_R - MS_E)} \]
\[ \text{ICC}_{2,k} = \frac{MS_B - MS_E}{MS_B + \frac{MS_R - MS_E}{n}} \]
where \( MS_B \) = Mean Square Between subjects, \( MS_E \) = Mean Square error, and \( MS_R \) = Mean Square for rater.
anova rating target judge
Adding the
judge(rater) variable changes the variance partition: the variances from this two-way ANOVA are not the same as the one-way model's. The within-subjects error has now been split into a rater part and a pure-error part.
With \( MS_B = 11.241667 \), \( MS_E = 1.0194444 \), \( MS_R = 32.486111 \), \( n = 6 \) subjects and \( k = 4 \) raters:
\[ \text{ICC}_{2,1} = \frac{11.241667 - 1.0194444}{11.241667 + (4-1)\times 1.0194444 + \frac{4}{6}(32.486111 - 1.0194444)} = 0.2897 \]
\[ \text{ICC}_{2,k} = \frac{11.241667 - 1.0194444}{11.241667 + \frac{(32.486111 - 1.0194444)}{6}} = 0.6201 \]
Confirm with:
icc rating target judge, absolute
Compare the two analyses on the same data: the one-way ICC(1,1) was 0.1657, but the two-way ICC(2,1) is 0.2897 — pulling the systematic rater effect out of the error term raised reliability. This is the practical payoff of choosing the model on purpose.
For completeness, the consistency two-way forms drop the rater term from the denominator:
\[ \text{ICC}_{3,1} = \frac{MS_B - MS_E}{MS_B + (k-1)\,MS_E} \]
\[ \text{ICC}_{3,k} = \frac{MS_B - MS_E}{MS_B + \frac{MS_R - MS_E}{n}} \]
Intra-device (intra-rater) — ICC(2,1) for the PEFR data
For an intra-device design we use the pefr data: 17 subjects, each measured twice with the Mini Wright flow meter (wm1, wm2) and twice with the Wright peak-flow meter (wp1, wp2). The design (same device, two occasions) is fixed to two-way mixed-effect, absolute, single rater. First reshape to long form so each measurement is one row:
reshape long wm wp, i(id) j(occasion)
Here the second factor is occasion, so \( MS_R \) is replaced by \( MS_T \) (Mean Square for repeated time) and \( k \) is the number of occasions:
\[ \text{ICC}_{2,1} = \frac{MS_B - MS_E}{MS_B + (k-1)\,MS_E + \frac{k}{n}(MS_T - MS_E)} \]
Objective 1 — Mini Wright flow meter:
anova wm id occasion
icc wm id occasion, mixed absolute
\[ \text{ICC}_{2,1} = \frac{24771.452 - 416.80515}{24771.452 + (2-1)\times 416.80515 + \frac{2}{17}(70.617647 - 416.80515)} = 0.96847076 \]
Objective 2 — Wright peak-flow meter:
anova wp id occasion
icc wp id occasion, mixed absolute
\[ \text{ICC}_{2,1} = \frac{27599.908 - 235.96691}{27599.908 + (2-1)\times 235.96691 + \frac{2}{17}(207.52941 - 235.96691)} = 0.98316401 \]
Both devices show excellent intra-device reliability (≈ 0.9685 and ≈ 0.9832) — as expected for a physical meter measured twice in quick succession.
Agreement statistics: error in real units
The ICC tells you the proportion of variance that is real, but it is unitless and population-dependent — a high ICC can hide a clinically large error if subjects vary a lot. So COSMIN asks you to also report agreement statistics, which express measurement error in the original units of the instrument. The figure below is the decision tree for picking among them.
Mean difference and SD of differences (NOT recommended alone)
The mean difference and SD of the differences between two raters/devices describe the centre and spread of the systematic error. But they have real limitations: they ignore residual error, and they assume the systematic error is normally distributed. For these reasons the COSMIN guideline does not recommend reporting these statistics by themselves in a manuscript.
Standard Error of Measurement (SEM)
The SEM is the size of the variability that is purely error — it incorporates both residual error and systematic error in the units of the instrument. COSMIN gives two equivalent ways to compute it.
Method 1 — directly from the ANOVA mean squares (the variance components must match your design — e.g. a test–retest design must include the \( MS_T \) term for time):
\[ \text{SEM} = \sqrt{MS_R + MS_E + MS_T} \]
Method 2 — from the SD and the ICC (more commonly used): take the pooled SD (the observed SD across both raters/devices) and scale it by the square root of the error proportion \( (1 - \text{ICC}) \):
\[ \text{SEM} = SD\sqrt{1 - \text{ICC}} \]
Smallest Detectable Change (SDC)
Once you have the SEM, the SDC tells you the smallest true change that you can be 95% confident reflects a genuine change in the subject rather than measurement error. It is the threshold you use when you follow a patient over time against their own baseline:
\[ \text{SDC} = 1.96\sqrt{2}\;\text{SEM} \]
Worked example: suppose instrument T has \( \text{SEM} = 4.5 \). Then
\[ \text{SDC} = 1.96 \times \sqrt{2} \times 4.5 = 12.5 \]
So if a patient's value at follow-up differs from baseline by 7, that change is smaller than the SDC of 12.5, and you cannot claim it reflects real patient change — it is within the band of measurement error of instrument T.
Coefficient of Variation (CV)
The CV expresses the measurement error (SEM) as a percentage of the mean of the two raters/devices:
\[ \text{CV} = 100 \times \frac{\text{SEM}}{\bar{x}} \]
This is useful because error often grows as the measured value grows; the CV shows the SD of error as a fraction of the typical reading. For example, \( \text{CV} = 2\% \) means the measurement error is 2% of the value measured. Its drawback is that it assumes the ratio is constant across the whole range: if the mean is 10, 100, or 1000, a fixed CV implies an SD of 0.2, 2, and 20 respectively.
Limits of Agreement and the Bland–Altman plot
The Limits of Agreement (LoA) describe the boundaries of the systematic error and, with the Bland–Altman plot, its pattern. The Bland–Altman LoA give a 95% confidence interval for the systematic error:
\[ \text{LoA} = \bar{d} \pm 1.96\, s_d \]
where \( \bar{d} \) is the mean of the differences (mean error/bias) and \( s_d \) is the SD of the differences.
The modified Bland–Altman plot puts the mean of the two raters/devices on the X-axis (in the original Bland–Altman plot the X-axis is the true value, taken from the gold-standard instrument) and the difference between the two raters/devices on the Y-axis (the systematic error). The plot does not declare whether agreement is sufficient — that is a clinical judgement: the researcher decides, ideally by pre-specifying a maximal acceptable difference, whether the LoA range is too wide.
Read a Bland–Altman plot in 3 steps:
- How close is the mean error/bias to the line of equality (zero error)? This tells you the size of the systematic error.
- What is the pattern of the scatter of dots? Look at how the spread behaves across the range.
- Compare absolute and percentage scales to classify the pattern (below).
The patterns to recognise:
| Pattern | On the absolute scale | On the percentage scale |
|---|---|---|
| Random difference | Scatter looks roughly symmetrical; e.g. LoA from 46.4 to −60.5. Such a wide error limit may matter when the true value is 200–300 but be negligible when it is 3,000–4,000. | Percentage error decreases as concentration rises (reveals an unproportional pattern). |
| Constant difference (absence of variability) | Systematic error is the same size across the whole range; scatter is symmetrical across the mean. | Percentage of error shrinks as concentration rises (similar to random difference). |
| Proportional difference | Scatter gets wider as concentration rises, in proportion (a constant coefficient of variation). | Scatter becomes a symmetrical distribution. |
| Proportional constant difference | A slope of the difference together with constant variability. | — |
Key takeaways
- ICC = proportion of true (between-subject) variance in total variance, built on ANOVA; it runs 0 (no reliability) to 1 (error-free).
- Always specify the three choices: MODEL (one-way random / two-way random / two-way mixed), DEFINITION (absolute vs consistency), TYPE (single vs average). Average ICC ≥ single ICC, and absolute ≤ consistency when a systematic difference exists.
- Shrout–Fleiss define six forms ICC(1,1)…ICC(3,k); McGraw–Wong show two-way random and two-way mixed share the same formula, so "random vs mixed" is a statement of intent to generalise, not a different computation.
- Intra-rater and test–retest designs default to two-way mixed-effect, absolute, single rater.
- Worked
judgesdata: one-way ICC(1,1) = 0.1657, ICC(1,k) = 0.4428; two-way absolute ICC(2,1) = 0.2897, ICC(2,k) = 0.6201. Intra-devicepefr: ICC(2,1) ≈ 0.9685 (Mini Wright) and ≈ 0.9832 (Wright peak-flow). - Complement the ICC with agreement statistics in real units: \( \text{SEM} = SD\sqrt{1-\text{ICC}} \) (absolute ICC) or \( \sqrt{MS_R + MS_E + MS_T} \); \( \text{SDC} = 1.96\sqrt{2}\,\text{SEM} \) (SEM = 4.5 → SDC = 12.5, so a change of 7 is within measurement error); \( \text{CV} = 100\,\text{SEM}/\bar{x} \); \( \text{LoA} = \bar{d} \pm 1.96\, s_d \).
- COSMIN does not recommend mean/SD of differences alone. Read the Bland–Altman plot in 3 steps and classify the pattern as random, constant, or proportional difference.
References
- de Vet HCW, Terwee CB, Bouter LM. Current challenges in clinimetrics. J Clin Epidemiol. 2003;56:1137–41.
- Mokkink LB, Terwee CB, Patrick DL, et al. The COSMIN checklist. Qual Life Res. 2010;19:539–49.
- Cohen J. A coefficient of agreement for nominal scales. Educ Psychol Meas. 1960;20:37–46.
- Landis JR, Koch GG. The measurement of observer agreement for categorical data. Biometrics. 1977;33:159–74.
- Shrout PE, Fleiss JL. Intraclass correlations. Psychol Bull. 1979;86:420–28.
- McGraw KO, Wong SP. Forming inferences about some intraclass correlation coefficients. Psychol Methods. 1996;1:30–46.
- Koo TK, Li MY. A guideline of selecting and reporting ICC. J Chiropr Med. 2016;15:155–63.
- Bland JM, Altman DG. Statistical methods for assessing agreement. Lancet. 1986;1:307–10.
- Gwet KL. Computing inter-rater reliability in the presence of high agreement. Br J Math Stat Psychol. 2008;61:29–48.
- Parmar M, Naqvi SAA, et al. Collaborative large language models for screening in systematic reviews. medRxiv. 2026.
From Sensitivity to Kappa (5-part series): (1) Performance vs Agreement [01_performance_vs_agreement] · (2) Agreement vs Reliability [02_agreement_vs_reliability] · (3) Reliability designs [03_reliability_designs] · (4) Categorical — kappa [04_categorical_kappa] · (5) Continuous — ICC & agreement [05_continuous_icc_agreement]
Comments
No comments yet. Be the first to share your thoughts.
Sign in to comment