Cohen’s Kappa vs Weighted Kappa: Measuring Agreement Beyond Chance

Mayta
Nov 20, 2025
4 min read

How to measure agreement beyond chance

1. Why do we need Kappa?

When two raters (or two methods) classify patients into categories—for example:

Fracture: yes / no
CT severity: mild / moderate / severe
ECG finding: normal / abnormal

we want to know:

Do they really agree, or are they just “lucky” to match by chance?

Simple percent agreement (% of cases where both give the same category) is easy to understand but has a big limitation:

If one category is very common (e.g., “no disease” = 90%), raters can agree “most of the time” even without real skill.
So percent agreement overestimates the true reliability.

Cohen’s Kappa (κ) and Weighted Kappa solve this by adjusting for chance agreement.

2. Cohen’s Kappa (κ) – for binary/nominal categories

2.1 The basic idea

Kappa answers:

“How much better is the observed agreement than what we would expect by chance alone?”

Formally:

Let Po = observed proportion of agreement
Let Pe = expected proportion of agreement by chance
Then:

κ = 1 → perfect agreement (beyond chance)
κ = 0 → no better than chance
κ < 0 → worse than chance (systematic disagreement)

2.2 The 2×2 table setup

Imagine two raters classify patients as disease / no disease:

	Rater B: Disease	Rater B: No disease	Total
Rater A: Disease	a (both say disease)	b (A disease, B no)	a + b
Rater A: No disease	c (A no, B disease)	d (both say no)	c + d
Total	a + c	b + d	N

Observed agreement:

Chance agreement (based on marginals):

Then plug Po and Pe into the κ formula.

You rarely calculate this by hand in practice (software will do it), but the logic is important.

2.3 Interpreting Kappa (with caution)

People often use rough “guides” like:

< 0.00 → poor
0.00–0.20 → slight
0.21–0.40 → fair
0.41–0.60 → moderate
0.61–0.80 → substantial
0.81–1.00 → almost perfect

But these cut-offs are arbitrary. Better to interpret κ in context:

How important are errors clinically?
What is the prevalence of each category?
How many categories are there?

2.4 Prevalence and bias paradox

Kappa is famous for a few “paradoxes”:

High percent agreement but low κ
- If almost everyone is “negative,” raters can agree a lot by always saying “negative.”
- Po is high, but Pe is also high → κ becomes small.
Imbalanced use of categories
- If one rater tends to label “positive” much more than the other, κ may drop even with decent agreement.

Practical tip:Always report:

% agreement
κ
Prevalence of categories (marginals)

Together, they give a more honest picture.

3. Weighted Kappa – when categories are ordered (ordinal)

For ordinal scales, not all disagreements are equally serious.

Example: pain scale (0–3)

Disagreeing between 0 vs 1 (no pain vs mild) is less severe than
Disagreeing between 0 vs 3 (no pain vs severe)

Unweighted κ treats both disagreements as the same → not clinically realistic.

→ Weighted Kappa fixes this by giving partial credit when raters are “close.”

3.1 How weighting works (conceptually)

Suppose we have K ordered categories (e.g., 0, 1, 2, 3).

We create a weight matrix w{ij}:

i = category from rater A
j = category from rater B
w{ij} = 1 if i = j (perfect agreement)
0 < w{ij} < 1 if i ≠ j (partial agreement)
w{ij} = 0 for maximal disagreement (e.g., 0 vs 3)

The formula becomes:

Replace Po with a weighted observed agreement (sum of weights × proportions)
Replace Pe with a weighted expected agreement under chance
Plug into:

Again, software does the math, but the logic is:

“Closer disagreements are less serious and should not be punished as heavily.”

3.2 Common types of weights

Linear weights
- Penalty increases in a straight line as categories get further apart
- Example for a 4-level scale (0–3):

Quadratic weights
- Larger penalty for big disagreements (like squared distance)
- Often used in practice
- Example:

Quadratic weighted κ is often quite close to the intraclass correlation coefficient (ICC) for such scales.

3.3 When to use Weighted Kappa

Use weighted κ when:

Outcome scale is ordinal
You care about how far apart the ratings are
Example:
- Disease severity: mild / moderate / severe
- CT score: 0–5
- Likert responses: strongly disagree → strongly agree

Use unweighted κ when:

Categories are nominal (no meaningful order)
Example:
- Pathogen type
- Mechanism of injury
- Blood group

4. Kappa vs other reliability measures

4.1 Kappa vs percent agreement

Percent agreement
- Simple, intuitive
- Overestimates reliability in skewed distributions
Kappa / weighted Kappa
- Adjust for chance agreement
- More honest, but can look “low” even when % agreement is high (prevalence paradox)

Best practice: report both.

4.2 Kappa vs correlation

Correlation (Pearson / Spearman):

Measures association, not agreement
Can be very high even when raters are systematically different(e.g., one always scoring +2 higher than the other)
Should not be used alone as a reliability index for categorical or rating data

Kappa (and weighted κ):

Designed specifically for agreement beyond chance
Correct choice for categorical / ordinal ratings

4.3 Kappa vs ICC

ICC (Intraclass Correlation Coefficient) is usually preferred when:

Outcome is continuous, or
Ordinal scale with many levels treated as approximately continuous
There are more than 2 raters

Kappa / weighted κ is typically used for:

Two raters
Categorical (binary, nominal, or ordinal) outcomes

5. How to report Kappa in a clinical paper

A good Methods + Results reporting might look like this:

Methods example:

“Inter-rater agreement for the binary outcome (fracture present/absent) was evaluated using Cohen’s kappa. For ordinal severity scores (0–3), we used quadratic weighted kappa to account for the ordered nature of the categories. For all outcomes, we additionally reported the overall percentage agreement to aid interpretation.”

Results example:

“For presence of fracture, the overall agreement was 92%, with a kappa of 0.78 (95% CI 0.69–0.86), indicating substantial agreement. For the 4-level severity score, quadratic weighted kappa was 0.82 (95% CI 0.74–0.90), reflecting almost perfect agreement.”

6. Practical take-home points

Use Kappa for binary/nominal categorical outcomes
- Inter-rater or test–retest
- Always consider prevalence and marginal distributions.
Use Weighted Kappa for ordinal outcomes
- Choose linear or quadratic weights; quadratic is common and close to ICC.
Always show % agreement alongside Kappa
- Helps clinicians understand what κ “means” in real terms.
Do not rely on correlation or t-tests / chi-square to claim “good reliability”
- Those test association or group differences, not agreement.
Interpret Kappa in context
- Consider number of categories, prevalence, and clinical consequences of misclassification.