Categorical Agreement: % Agreement, Specific Agreement, and Kappa

Abstract

When clinicians and clinical researchers classify outcomes into discrete categories, evaluating true inter-rater consensus requires statistical tools that separate genuine diagnostic skill from sheer chance. While raw percent agreement offers an intuitive baseline, it fundamentally overestimates instrument reliability by failing to discount the probability of raters agreeing purely by luck. To properly diagnose weaknesses within specific classifications, percent specific agreement isolates consensus category by category. For rigorous evaluation of nominal data, Cohen's Kappa formally subtracts expected chance agreement from observed agreement, revealing true consensus and demonstrating how uncorrected metrics mask measurement error. Furthermore, when clinical classifications follow an ordinal scale where errors vary in severity, weighted Kappa applies linear or quadratic weighting to appropriately credit partial agreement for near-misses. Finally, the Intraclass Correlation Coefficient provides a robust alternative for longer ordinal scales to avoid arbitrary weighting schemes. This article details the categorical agreement toolkit, teaching clinicians how to calculate and interpret chance-corrected metrics to properly validate their diagnostic tools.

Introduction

When two raters classify the same things into categories — two radiologists calling chest films normal vs abnormal, two reviewers calling abstracts include vs exclude, two nurses scoring pain as mild / moderate / severe — we need a number that says how well they agree. The intuitive answer is "count how often they gave the same label and divide by the total." That is percent agreement, and it is a real, useful statistic. But it has a famous blind spot: even two people answering at random will sometimes land on the same label by sheer luck. A statistic that does not subtract that luck will flatter a bad instrument.

This article — part 4 of the From Sensitivity to Kappa series — teaches the categorical-agreement toolkit step by step: % agreement, % specific agreement, Cohen's Kappa (and why we correct for chance), and weighted Kappa for ordinal scales. Every number below is computed by hand so you can reproduce it.

The key question: when two raters give the same label, how much of that agreement is real, and how much is just chance?

First, a word on which statistic to reach for. Agreement and reliability are related but conceptually distinct. Agreement asks about the amount of measurement error and is expressed on an absolute scale (it needs a unit). Reliability asks only whether the instrument consistently separates different subjects, and is expressed as a relative proportion of true variance to total observed variance (no unit needed). For categorical data this distinction blurs, because categorical labels usually have no unit and classification matters more than measured distance — so most studies report agreement and reliability statistics side by side, which is why the two words are so often used interchangeably here.

⤢ click to enlarge

Figure. Choosing an agreement/reliability statistic by data type.

The selection map is short enough to read directly:

Type of outcome	Agreement (measurement error)	Reliability
Binary / Nominal	% Agreement, % Specific agreement	Cohen's Kappa
Ordinal	% Agreement, % Specific agreement	Weighted Kappa; ICC (ordinal outcome with more than 2 raters)

The COSMIN guideline files % Agreement and % Specific agreement under Agreement, and Cohen's Kappa / Weighted Kappa under Reliability — even though, as we will see, Kappa's formula and interpretation are spoken about almost entirely in the language of agreement.

Binary / nominal data: percent agreement

% Agreement is a descriptive statistic for the proportion of cases in which the two raters gave the same answer, with no attempt to discount chance. For a 2×2 table with cells \(a, b, c, d\) (where \(a\) and \(d\) are the two agreement cells and \(n\) is the total), it is simply:

\[ p_o = \%\text{ agreement} = \frac{a + d}{n} \times 100 \]

That is it. It is honest, transparent, and exactly the number a clinician's intuition reaches for. Its weakness — agreement by chance is baked in — is what Cohen's Kappa exists to fix.

Binary / nominal data: percent specific agreement

% Agreement collapses both categories into one number, which can hide a problem in one specific category. % Specific agreement asks, within a single category, what proportion of the times that category was used did the raters agree on it. It still ignores chance, but it is diagnostic: if one category's specific agreement is markedly below the overall % agreement, that category may be unreliable — perhaps it is too hard to judge, or the subjects chosen were not well suited to evaluating that category.

For the "Yes" category and the "No" category:

\[ p_{\text{yes}} = \frac{2a}{(a+c) + (a+b)} \times 100 \]

\[ p_{\text{no}} = \frac{2d}{(d+c) + (d+b)} \times 100 \]

The logic is observed agreement in that category, divided by the total number of times either rater assigned that category. The numerator counts the agreeing cell twice (once for each rater); the denominator sums each rater's marginal total for that category. This is why specific agreement, unlike % agreement, can differ between the two categories.

Cohen's Kappa

Cohen's Kappa (also called the Kappa statistic or K index) reports the proportion of same-answer cases after accounting for the probability that the raters happened to match by chance — chance agreement that can arise when an instrument is too easy or too hard. COSMIN classifies it as a reliability statistic, but its formula and interpretation are spoken about almost entirely in terms of agreement. The definition is:

\[ \kappa = \frac{p_o - p_e}{1 - p_e} \]

where \(p_o\) is the observed agreement \((a+d)/n\) and \(p_e\) is the agreement expected by chance:

\[ p_e = \frac{(a+b)(a+c)}{n^2} + \frac{(b+d)(c+d)}{n^2} \]

Read the formula structurally. The numerator \(p_o - p_e\) strips the chance ("error") agreement out of the observed agreement, leaving true agreement. The denominator \(1 - p_e\) is the most agreement that is still possible once chance is removed — the total probability not due to chance — which keeps \(\kappa\) capped at 1. And note the consequence at the bottom end: Kappa can be negative, which happens whenever observed agreement is less than chance agreement.

Worked example 1: two radiologists, 100 X-rays

Two radiologists read all 100 X-ray images and report each as normal or abnormal. Their cross-tabulation:

	B: Normal	B: Abnormal	Row totals
A: Normal	40	10	50
A: Abnormal	5	45	50
Totals	45	55	100

Here \(a = 40\), \(b = 10\), \(c = 5\), \(d = 45\), \(n = 100\). Now compute each statistic step by step.

Step 1 — % Agreement. They both said normal on 40 images and both said abnormal on 45 images:

\[ p_o = \frac{40 + 45}{100} = \frac{85}{100} = 0.85 = 85\% \]

Step 2 — % Normal specific agreement. Rater A called normal 50 times, rater B called normal 45 times; they agreed on normal 40 times:

\[ p_{\text{normal}} = \frac{40 + 40}{50 + 45} = \frac{80}{95} = 0.84 = 84\% \]

Step 3 — % Abnormal specific agreement. Rater A called abnormal 50 times, rater B called abnormal 55 times; they agreed on abnormal 45 times:

\[ p_{\text{abnormal}} = \frac{45 + 45}{50 + 55} = \frac{90}{105} = 0.86 = 86\% \]

The two specific-agreement values (84% and 86%) sit close to the overall 85%, telling us neither category is much harder than the other.

Step 4 — Expected (chance) agreement. The chance that both call an image normal is (A's normal rate)×(B's normal rate); the chance both call it abnormal is (A's abnormal rate)×(B's abnormal rate). Sum them:

\[ p_e = \frac{50}{100} \times \frac{45}{100} + \frac{50}{100} \times \frac{55}{100} = 0.225 + 0.275 = 0.5 \]

Step 5 — Cohen's Kappa. Plug into the definition:

\[ \kappa = \frac{p_o - p_e}{1 - p_e} = \frac{0.85 - 0.5}{1 - 0.5} = \frac{0.35}{0.5} = 0.70 \]

Reading the result. Although the raw % agreement is 85%, and the two categories are about equally easy to judge, the probability that the two radiologists agree without chance is only 70%. That 15-point drop is exactly the luck that % agreement quietly counted as skill.

Interpreting Kappa — and a caution

A widely cited verbal scale (Landis & Koch) maps Kappa to labels such as slight, fair, moderate, substantial, and almost perfect. Use it carefully: these interpretation bands are arbitrary cut-points, not laws of nature, and a "moderate" Kappa in one clinical context may be unacceptable in another. Report the number, the table it came from, and the context — never the verbal label alone.

Kappa	Common verbal label (Landis–Koch)
< 0.00	Poor (worse than chance)
0.00–0.20	Slight
0.21–0.40	Fair
0.41–0.60	Moderate
0.61–0.80	Substantial
0.81–1.00	Almost perfect

Ordinal data: weighted Kappa

For ordinal categories — normal / borderline / abnormal, or mild / moderate / severe — not all disagreements are equal. Two raters who differ by one step (normal vs borderline) are closer than two who differ by two steps (normal vs abnormal). Plain Cohen's Kappa treats every disagreement as a total miss. Weighted Kappa uses the same chance-corrected machinery but credits partial agreement: a one-step disagreement earns partial weight; a maximum disagreement earns zero. (Unlike Cohen's Kappa, the weighting also enters the chance term, not just the observed term.)

The weight \(w\) for a cell whose two labels differ by \(i\) categories, on a scale with \(k\) total categories, comes in two popular forms:

\[ \text{Linear weight: } w = 1 - \frac{i}{k-1} \]

\[ \text{Quadratic weight: } w = 1 - \frac{i^2}{(k-1)^2} \]

For a 3-level scale (\(k = 3\)), the weights work out as:

Disagreement	\(i\)	Linear \(1 - \frac{i}{k-1}\)	Quadratic \(1 - \frac{i^2}{(k-1)^2}\)
Agree (0 steps)	0	1.0	1.0
Off by 1 step	1	\(1-\frac{1}{2}=0.5\)	\(1-\frac{1}{4}=0.25\)
Off by 2 steps	2	\(1-\frac{2}{2}=0.0\)	\(1-\frac{4}{4}=0.0\)

How to choose linear vs quadratic

The weight scheme encodes how the harm of a mistake grows with its size.

Linear weight — choose this when a two-step error is worth roughly twice a one-step error (harm grows in proportion to the distance). Example: scoring pain as mild / moderate / severe, a nurse may judge that misreading mild as severe is exactly twice as bad as misreading mild as moderate.
Quadratic weight — choose this when small slips barely matter but the harm escalates disproportionately (more than linearly) as the gap widens. A one-step error keeps most of its credit (0.25 penalty in the table is kept as weight 0.25 above... read it as weight retained), while a two-step error collapses to zero.

In the example that follows, suppose the management of borderline and normal differs only slightly, while a true abnormal triggers many downstream medical steps. The cost of a big miss is disproportionately large, so quadratic weighting is the better fit, and the per-cell weights are 1 → 0.75 → 0 for 0-, 1-, and 2-step disagreements respectively.

Note on the quadratic weights used below: with the convention written as weight \(= 1, 0.75, 0\) for 0-, 1-, 2-step disagreement, a one-step slip retains three-quarters of full credit — heavily forgiving small errors while zeroing out the worst miss.

Worked example 2: three-level scale, 30 X-rays

Two radiologists read 30 images and report each as normal, borderline, or abnormal:

	B = normal	B = borderline	B = abnormal	Row totals
A = normal	8	1	1	10
A = borderline	2	9	3	14
A = abnormal	0	2	4	6
Totals	10	12	8	30

Unweighted % Agreement (the diagonal, where they matched exactly):

\[ p_o = \frac{8 + 9 + 4}{30} = \frac{21}{30} = 0.70 = 70\% \]

% Specific agreement per category:

\[ p_{\text{normal}} = \frac{8+8}{10+10} = \frac{16}{20} = 0.80 = 80\% \] \[ p_{\text{borderline}} = \frac{9+9}{14+12} = \frac{18}{26} = 0.69 = 69\% \] \[ p_{\text{abnormal}} = \frac{4+4}{6+8} = \frac{8}{14} = 0.57 = 57\% \]

The abnormal category (57%) is the weak spot — far below the 70% overall — flagging that the two raters struggle most where the stakes are highest.

Step 1 — Weighted observed agreement. With quadratic weights 1 / 0.75 / 0, group the 30 cases by how far apart the raters were:

Exact matches (0 steps): \(8 + 9 + 4 = 21\) cases, weight 1 → \((21 \times 1)/30 = 0.7\)
Off by 1 step: \(1 + 2 + 3 + 2 = 8\) cases, weight 0.75 → \((8 \times 0.75)/30 = 0.2\)
Off by 2 steps: \(1 + 0 = 1\) case, weight 0 → \((1 \times 0)/30 = 0\)

\[ p_{o(w)} = 0.7 + 0.2 + 0 = 0.90 = 90\% \]

Notice the weighted observed agreement (90%) is higher than the unweighted 70%, because partial credit is now given for near-misses — which is exactly why choosing the right weight, in the right clinical context, materially changes the conclusion.

Step 2 — Weighted chance agreement. First the marginal probabilities for each rater:

Probability	Rater A	Rater B
Normal	10/30 = 0.3333	10/30 = 0.3333
Borderline	14/30 = 0.4667	12/30 = 0.4000
Abnormal	6/30 = 0.2000	8/30 = 0.2667

For every one of the nine A×B combinations, multiply A's probability by B's probability to get the chance joint probability, then multiply by that cell's weight:

A	B	Weight	P(A)	P(B)	Joint P(A)·P(B)	Weighted probability
Normal	Normal	1.00	0.3333	0.3333	0.11111	0.11111
Normal	Borderline	0.75	0.3333	0.4000	0.13333	0.10000
Normal	Abnormal	0.00	0.3333	0.2667	0.08889	0.00000
Borderline	Normal	0.75	0.4667	0.3333	0.15556	0.11667
Borderline	Borderline	1.00	0.4667	0.4000	0.18667	0.18667
Borderline	Abnormal	0.75	0.4667	0.2667	0.12444	0.09333
Abnormal	Normal	0.00	0.2000	0.3333	0.06667	0.00000
Abnormal	Borderline	0.75	0.2000	0.4000	0.08000	0.06000
Abnormal	Abnormal	1.00	0.2000	0.2667	0.05333	0.05333

Summing the final column:

\[ p_{e(w)} = 0.72 \]

Step 3 — Quadratic weighted Kappa. Same chance-correction formula, now with the weighted terms:

\[ \kappa_w = \frac{p_{o(w)} - p_{e(w)}}{1 - p_{e(w)}} = \frac{0.90 - 0.72}{1 - 0.72} = \frac{0.18}{0.28} = 0.64 \]

So even with generous partial credit lifting observed agreement to 90%, the chance-corrected weighted Kappa is 0.64 — the honest figure once we remove the luck that a forgiving weight scheme also amplifies in the chance term.

When ICC replaces weighted Kappa

The Intraclass Correlation Coefficient (ICC) — covered fully in part 5 of this series (continuous data) — is frequently borrowed for ordinal data with many ordered levels. Its appeal here is practical: ICC sidesteps the whole problem of choosing the right weight scheme, which becomes increasingly awkward as the number of levels grows. For a short ordinal scale, weighted Kappa with a deliberately chosen weight is transparent and defensible; for a long ordinal scale (and especially with more than two raters), reach for ICC instead.

Key takeaways

% Agreement \(= (a+d)/n\) is honest and intuitive but counts chance agreement as if it were skill — it always overstates how good the raters are.
% Specific agreement (Yes: \(2a/((a+c)+(a+b))\); No: \(2d/((d+c)+(d+b))\)) exposes whether one particular category is the weak link, even when overall agreement looks fine.
Cohen's Kappa \(= (p_o - p_e)/(1 - p_e)\) corrects for chance; in the X-ray example 85% raw agreement became \(\kappa = 0.70\). Kappa can be negative (observed < chance), and the Landis–Koch verbal bands are arbitrary — report the number and its context.
Weighted Kappa credits partial agreement on ordinal scales. Choose linear weights when harm grows proportionally with the error size, quadratic when small slips barely matter but big misses are disproportionately costly. In the 3-level example, quadratic weighting gave \(p_{o(w)} = 0.90\), \(p_{e(w)} = 0.72\), so \(\kappa_w = 0.64\).
ICC can replace weighted Kappa for ordinal outcomes with many levels (or more than two raters), avoiding the need to choose a weight scheme.

References

de Vet HCW, Terwee CB, Bouter LM. Current challenges in clinimetrics. J Clin Epidemiol. 2003;56:1137–41.
Mokkink LB, Terwee CB, Patrick DL, et al. The COSMIN checklist. Qual Life Res. 2010;19:539–49.
Cohen J. A coefficient of agreement for nominal scales. Educ Psychol Meas. 1960;20:37–46.
Landis JR, Koch GG. The measurement of observer agreement for categorical data. Biometrics. 1977;33:159–74.
Shrout PE, Fleiss JL. Intraclass correlations. Psychol Bull. 1979;86:420–28.
McGraw KO, Wong SP. Forming inferences about some intraclass correlation coefficients. Psychol Methods. 1996;1:30–46.
Koo TK, Li MY. A guideline of selecting and reporting ICC. J Chiropr Med. 2016;15:155–63.
Bland JM, Altman DG. Statistical methods for assessing agreement. Lancet. 1986;1:307–10.
Gwet KL. Computing inter-rater reliability in the presence of high agreement. Br J Math Stat Psychol. 2008;61:29–48.
Parmar M, Naqvi SAA, et al. Collaborative large language models for screening in systematic reviews. medRxiv. 2026.

From Sensitivity to Kappa (5-part series): (1) Performance vs Agreement [01_performance_vs_agreement] · (2) Agreement vs Reliability [02_agreement_vs_reliability] · (3) Reliability designs [03_reliability_designs] · (4) Categorical — kappa [04_categorical_kappa] · (5) Continuous — ICC & agreement [05_continuous_icc_agreement]