Agreement vs Reliability: One Variance, Two Questions

Abstract

Almost every decision in clinical medicine depends on clinimetric instruments, making the evaluation of their measurement quality foundational to patient care. When determining if a clinical tool is sufficient, researchers must evaluate its validity, reproducibility, and reliability. However, evaluating an instrument's consistency requires strictly separating two concepts that analyze the same observed variance: agreement and reliability. Every observed measurement is the sum of a true value and measurement error. Agreement evaluates the absolute magnitude of this error and retains specific clinical units, making it essential when exact values dictate treatment, such as comparing a bedside glucose test to a laboratory standard. Conversely, reliability measures the relative proportion of genuine signal within the total observed variance. It provides a unitless ratio between zero and one that assesses the consistency of ordering or classification, which is crucial when evaluating independent raters on a radiographic sign. This article explains the fundamental distinctions between agreement and reliability, teaching clinicians how to correctly identify their measurement goals and select the appropriate statistics.

Introduction

Almost every decision we make in clinical medicine rests on a measurement. Does this patient have the disease or not? Is the disease getting worse or better on treatment? Is it progressing or regressing over time? None of these questions can be answered by intuition alone — each one depends on a clinimetric instrument: a defined, repeatable procedure for turning a clinical phenomenon into a number or a category. Because so much hangs on these tools, the quality of the instrument is not a technical afterthought; it is the foundation of the decision built on top of it.

This article is the second in a five-part series, and its single job is to separate two ideas that are constantly confused in practice: agreement and reliability. They are not synonyms, they answer different questions, and — as we will see — they are two different ways of slicing up the same variance. By the end you should be able to look at any measurement problem and say, immediately, "this is an agreement question" or "this is a reliability question," and choose the right statistic accordingly.

The key insight of this whole article: one variance, two questions. Agreement asks how big is the error? Reliability asks how much of what I see is real signal?

Why clinimetrics matters: three quality dimensions

Before we can compare agreement and reliability we have to place them inside the bigger picture of instrument quality. When you ask "is this clinical tool good enough to use?" you are really asking three separate questions, and it pays to keep them apart.

Validity (ความถูกต้อง) — Does the instrument truly measure the thing it claims to measure? This is about the target. A tool can be perfectly consistent and still measure the wrong construct.
Reproducibility (ความสามารถในการทำซ้ำ) — Is the procedure clear and transparent enough that someone else, in their own setting, can apply it the same way? This is about the protocol, not the result: are the rating criteria explicit, unambiguous, and portable to a new context?
Reliability (ความน่าเชื่อถือ) — When the same instrument is used across different situations (different raters, different occasions), do the results stay consistent and is the error low? This is about the behaviour of the measurements in the real world.

The classic teaching example is the Medical Research Council (MRC) scale for muscle strength, scored from 0 (no visible muscle contraction at all) to 5 (normal power against full resistance, scaled to the patient). Look at how the same instrument is interrogated by each of the three questions:

To judge its validity, you ask whether the 0–5 score genuinely captures weakness — does it really measure the construct of muscle strength?
To judge its reproducibility, you scrutinise the grading criteria themselves: are they clear and transparent enough that a clinician elsewhere can apply the same anchors and reproduce the procedure?
To judge its reliability, you ask whether repeated grading is consistent and low-error. Here is the crucial subtlety: imagine an examiner who is far stronger than the patient. Even when the patient's muscle is genuinely normal, that examiner may overpower the limb and record a 4, whereas an examiner whose strength is equal to or weaker than the patient records a 5. Same patient, same true strength — but the measurement moved because of who held the limb. That gap is exactly what reliability assessment is built to detect.

Instrument types: who (or what) does the measuring?

Reliability does not look the same for every kind of instrument, so the first practical step is to classify what kind of tool you are dealing with. The COSMIN framework recognises four broad families, distinguished by who or what produces the score.

Instrument type	Abbreviation	Who/what reports it	Clinical examples
Clinician-reported outcome measure	ClinROM	A health professional	Hamilton Anxiety Rating Scale; physical exam findings (joint swelling, muscle weakness on MRC); device-assisted reads such as ultrasound Doppler for cardiac stricture
Patient-reported outcome measure	PROM	The patient directly	Quality-of-life questionnaires, symptom diaries, pain scales
Performance-based outcome measure	PerFOM	The patient's measured performance on a task	Timed walk, grip-strength dynamometry, functional capacity tests
Biomarker / laboratory value	Biomarker / lab	A laboratory or assay	Blood glucose, HbA1c, serum biomarkers

Why does the family matter? Because the source of error differs. A ClinROM carries rater error (the MRC examiner above). A PROM carries patient-recall and interpretation error. A lab value carries assay and instrument error. Knowing which family you are in tells you which reliability design you will eventually need — the subject of part 3 of this series.

COSMIN: nine properties in three domains

To bring order to all of this, the COSMIN initiative (COnsensus-based Standards for the selection of health Measurement INstruments) organises the qualities of a good instrument into nine measurement properties grouped under three domains:

Validity — does it measure the right construct?
Reliability — is the measurement consistent and low in error?
Responsiveness — can it detect genuine change over time?

This series lives almost entirely inside the second domain.

The core decomposition: observed = true + error

Everything that follows hangs on one simple model. Any value you actually record — the observed value — is the sum of the genuine quantity you are trying to capture (the true value) plus a measurement error:

\[ \text{Observed} = \text{True} + \text{Error} \]

If there were no error, the observed value would always equal the true value exactly. Error is the gap between what you wrote down and what was really there.

Now scale this up from one measurement to a group of subjects. If you measure many subjects, the observed values vary — some patients really are heavier, weaker, or sicker than others. We can quantify that spread as the observed variance, \( \sigma^2_{obs} \). And because every observed value decomposes into true + error, the variance decomposes the same way:

\[ \sigma^2_{obs} = \sigma^2_{true} + \sigma^2_{error} \]

In words: the total spread you see is made of two parts — the real spread between subjects (true variance) and the noise added by imperfect measurement (error variance). This single equation is the hinge of the entire article. Agreement and reliability are simply two different things you can ask about its right-hand side.

⤢ click to enlarge

Figure. Agreement vs reliability — one variance, two questions. Any observed measurement decomposes as observed = true + error, so the observed variance splits into true variance (real differences between subjects) and error variance (measurement error). Reliability asks what share of that variance is true — a unitless ratio between 0 and 1 (ICC, Cohen's / weighted kappa); agreement asks how large the error is on the original, absolute scale, with units (SEM, SDC, limits of agreement, % agreement). The bottom row shows when to reach for each.

Agreement vs reliability: same variance, two questions

Here is the conceptual heart of the matter. Both ideas live on the right-hand side of \( \sigma^2_{obs} = \sigma^2_{true} + \sigma^2_{error} \), but they interrogate it differently.

Agreement (ความเที่ยงตรง) is concerned only with the size of the error. It asks the absolute question:

"How far does the measured value stray from the true value?"

Agreement studies the magnitude of the measurement error itself. If repeated measurements scatter only slightly, the error is small, and we say the instrument has good agreement. Crucially, agreement is absolute and carries units — kilograms, mmol/L, mmHg. It cares about \( \sigma^2_{error} \) on its own terms.

Worked example — the new scale (measuring agreement). Suppose you are testing a new weighing scale on two patients, weighing each one four times. Each individual weigh-in misses the true weight by some amount. If you take the mean and the spread of those errors, you can state directly how accurate the scale is in kilograms compared with the truth. That number — the typical size of the error — is the agreement.

Reliability (ความสอดคล้อง) is concerned with how much of the observed spread is real. It asks the relative question:

"Can the instrument consistently tell different things apart and put them in the right order, regardless of how big the error is?"

Reliability studies the consistency of ordering / classification under repeated measurement. Formally it is the proportion of the observed variance that is true variance:

\[ \text{Reliability} = \frac{\sigma^2_{true}}{\sigma^2_{obs}} = \frac{\sigma^2_{true}}{\sigma^2_{true} + \sigma^2_{error}} \]

This makes reliability relative, unitless, and bounded between 0 and 1. A reliability near 1 means almost all of the spread you observe is genuine difference between subjects; a reliability near 0 means most of it is noise.

Worked example — ordering two patients (measuring reliability). Test the same scale on two patients where the blue patient is genuinely lighter than the green patient. Weigh each four times. On repeats 1, 2, and 4 the scale correctly shows blue < green; on repeat 3 the order flips. The instrument ranked them correctly 3 out of 4 times = 75%. Notice we never asked by how many kilograms it was wrong — only whether it kept the order straight. That is reliability.

Here is the same contrast laid out side by side:

Aspect	Agreement (ความเที่ยงตรง)	Reliability (ความสอดคล้อง)
Core question	How big is the measurement error?	What share of the variance is true signal?
Targets which term	\( \sigma^2_{error} \) (size of error)	\( \sigma^2_{true} / \sigma^2_{obs} \) (ratio)
Absolute or relative	Absolute	Relative
Units	Has units (kg, mmol/L, …)	Unitless
Range	0 to ∞ (in measurement units)	0 to 1
What "good" looks like	Small, tightly clustered errors	High proportion of true variance; consistent ordering/classification
Typical statistics	Standard error of measurement, limits of agreement (Bland–Altman)	Cohen's Kappa, ICC

When to use which: two worked clinical decisions

The two examples below come straight from clinical practice and show how the purpose of the measurement — not the instrument — decides which statistic you need. Importantly, you do not have to assess every component at once; the right choice depends on what you will do with the result.

Example 1 — DTX vs laboratory glucose → Agreement. You want to know whether a bedside Dextrostix (DTX) glucose reading can stand in for the laboratory glucose value. The clinically relevant question is "by how much does the DTX value differ, in mmol/L, from the lab truth?" — because a DTX that reads 2 mmol/L too high could send a patient down the wrong treatment path. That is an absolute, units-based question about the size of the error. Assess agreement. Here, evaluating agreement alone is sufficient for this context.

Example 2 — Two raters on a radiographic sign → Reliability. You want to test the reliability of a radiographic sign used for diagnosis. Two readers, A and B, independently judge 10 images as sign present or sign absent. What you care about is how often A and B give the same answer (both "present" or both "absent") — and you want that consistency measured beyond what chance alone would produce. The size of any "error" is meaningless here; there are no kilograms, only agree/disagree. Assess reliability (and, because chance agreement must be removed, this is exactly where Cohen's Kappa enters — the subject of part 4).

A clean rule of thumb: if your question has units, you want agreement; if your question is about ordering or classifying, you want reliability.

Key takeaways

Clinical decisions rest on measurements, so instrument quality is foundational; it splits into validity (right construct), reproducibility (clear, portable protocol), and reliability (consistent, low-error in use) — illustrated by the MRC 0–5 muscle-strength scale.
Instruments come in four families — ClinROM, PROM, PerFOM, and biomarker/lab — and the family determines the dominant source of error and the reliability design you will need.
COSMIN organises instrument quality into nine properties across three domains (validity, reliability, responsiveness); this series focuses on reliability.
The core model is \( \text{Observed} = \text{True} + \text{Error} \), which scales to \( \sigma^2_{obs} = \sigma^2_{true} + \sigma^2_{error} \).
Agreement = the size of the error (absolute, has units); Reliability = \( \sigma^2_{true} / \sigma^2_{obs} \), the share of true variance (relative, unitless, 0–1). One variance, two questions.
Choose by purpose: DTX vs lab glucose → agreement (units matter); two raters on a radiographic sign → reliability (ordering/classification beyond chance). In the worked figure, the scale ranked two patients correctly 3 of 4 times = 75%.

References

de Vet HCW, Terwee CB, Bouter LM. Current challenges in clinimetrics. J Clin Epidemiol. 2003;56:1137–41.
Mokkink LB, Terwee CB, Patrick DL, et al. The COSMIN checklist. Qual Life Res. 2010;19:539–49.
Cohen J. A coefficient of agreement for nominal scales. Educ Psychol Meas. 1960;20:37–46.
Landis JR, Koch GG. The measurement of observer agreement for categorical data. Biometrics. 1977;33:159–74.
Shrout PE, Fleiss JL. Intraclass correlations. Psychol Bull. 1979;86:420–28.
McGraw KO, Wong SP. Forming inferences about some intraclass correlation coefficients. Psychol Methods. 1996;1:30–46.
Koo TK, Li MY. A guideline of selecting and reporting ICC. J Chiropr Med. 2016;15:155–63.
Bland JM, Altman DG. Statistical methods for assessing agreement. Lancet. 1986;1:307–10.
Gwet KL. Computing inter-rater reliability in the presence of high agreement. Br J Math Stat Psychol. 2008;61:29–48.
Parmar M, Naqvi SAA, et al. Collaborative large language models for screening in systematic reviews. medRxiv. 2026.

From Sensitivity to Kappa (5-part series): (1) Performance vs Agreement [01_performance_vs_agreement] · (2) Agreement vs Reliability [02_agreement_vs_reliability] · (3) Reliability designs [03_reliability_designs] · (4) Categorical — kappa [04_categorical_kappa] · (5) Continuous — ICC & agreement [05_continuous_icc_agreement]