Cohen’s Kappa vs Weighted Kappa: Measuring Agreement Beyond Chance
- Mayta

- 8 hours ago
- 4 min read
How to measure agreement beyond chance
1. Why do we need Kappa?
When two raters (or two methods) classify patients into categories—for example:
Fracture: yes / no
CT severity: mild / moderate / severe
ECG finding: normal / abnormal
we want to know:
Do they really agree, or are they just “lucky” to match by chance?
Simple percent agreement (% of cases where both give the same category) is easy to understand but has a big limitation:
If one category is very common (e.g., “no disease” = 90%), raters can agree “most of the time” even without real skill.
So percent agreement overestimates the true reliability.
Cohen’s Kappa (κ) and Weighted Kappa solve this by adjusting for chance agreement.
2. Cohen’s Kappa (κ) – for binary/nominal categories
2.1 The basic idea
Kappa answers:
“How much better is the observed agreement than what we would expect by chance alone?”
Formally:
Let Po = observed proportion of agreement
Let Pe = expected proportion of agreement by chance
Then:
κ = 1 → perfect agreement (beyond chance)
κ = 0 → no better than chance
κ < 0 → worse than chance (systematic disagreement)
2.2 The 2×2 table setup
Imagine two raters classify patients as disease / no disease:
Rater B: Disease | Rater B: No disease | Total | |
Rater A: Disease | a (both say disease) | b (A disease, B no) | a + b |
Rater A: No disease | c (A no, B disease) | d (both say no) | c + d |
Total | a + c | b + d | N |
Observed agreement:
Chance agreement (based on marginals):
Then plug Po and Pe into the κ formula.
You rarely calculate this by hand in practice (software will do it), but the logic is important.
2.3 Interpreting Kappa (with caution)
People often use rough “guides” like:
< 0.00 → poor
0.00–0.20 → slight
0.21–0.40 → fair
0.41–0.60 → moderate
0.61–0.80 → substantial
0.81–1.00 → almost perfect
But these cut-offs are arbitrary. Better to interpret κ in context:
How important are errors clinically?
What is the prevalence of each category?
How many categories are there?
2.4 Prevalence and bias paradox
Kappa is famous for a few “paradoxes”:
High percent agreement but low κ
If almost everyone is “negative,” raters can agree a lot by always saying “negative.”
Po is high, but Pe is also high → κ becomes small.
Imbalanced use of categories
If one rater tends to label “positive” much more than the other, κ may drop even with decent agreement.
Practical tip:Always report:
% agreement
κ
Prevalence of categories (marginals)
Together, they give a more honest picture.
3. Weighted Kappa – when categories are ordered (ordinal)
For ordinal scales, not all disagreements are equally serious.
Example: pain scale (0–3)
Disagreeing between 0 vs 1 (no pain vs mild) is less severe than
Disagreeing between 0 vs 3 (no pain vs severe)
Unweighted κ treats both disagreements as the same → not clinically realistic.
→ Weighted Kappa fixes this by giving partial credit when raters are “close.”
3.1 How weighting works (conceptually)
Suppose we have K ordered categories (e.g., 0, 1, 2, 3).
We create a weight matrix w{ij}:
i = category from rater A
j = category from rater B
w{ij} = 1 if i = j (perfect agreement)
0 < w{ij} < 1 if i ≠ j (partial agreement)
w{ij} = 0 for maximal disagreement (e.g., 0 vs 3)
The formula becomes:
Replace Po with a weighted observed agreement (sum of weights × proportions)
Replace Pe with a weighted expected agreement under chance
Plug into:
Again, software does the math, but the logic is:
“Closer disagreements are less serious and should not be punished as heavily.”
3.2 Common types of weights
Linear weights
Penalty increases in a straight line as categories get further apart
Example for a 4-level scale (0–3):
Quadratic weights
Larger penalty for big disagreements (like squared distance)
Often used in practice
Example:
Quadratic weighted κ is often quite close to the intraclass correlation coefficient (ICC) for such scales.
3.3 When to use Weighted Kappa
Use weighted κ when:
Outcome scale is ordinal
You care about how far apart the ratings are
Example:
Disease severity: mild / moderate / severe
CT score: 0–5
Likert responses: strongly disagree → strongly agree
Use unweighted κ when:
Categories are nominal (no meaningful order)
Example:
Pathogen type
Mechanism of injury
Blood group
4. Kappa vs other reliability measures
4.1 Kappa vs percent agreement
Percent agreement
Simple, intuitive
Overestimates reliability in skewed distributions
Kappa / weighted Kappa
Adjust for chance agreement
More honest, but can look “low” even when % agreement is high (prevalence paradox)
Best practice: report both.
4.2 Kappa vs correlation
Correlation (Pearson / Spearman):
Measures association, not agreement
Can be very high even when raters are systematically different(e.g., one always scoring +2 higher than the other)
Should not be used alone as a reliability index for categorical or rating data
Kappa (and weighted κ):
Designed specifically for agreement beyond chance
Correct choice for categorical / ordinal ratings
4.3 Kappa vs ICC
ICC (Intraclass Correlation Coefficient) is usually preferred when:
Outcome is continuous, or
Ordinal scale with many levels treated as approximately continuous
There are more than 2 raters
Kappa / weighted κ is typically used for:
Two raters
Categorical (binary, nominal, or ordinal) outcomes
5. How to report Kappa in a clinical paper
A good Methods + Results reporting might look like this:
Methods example:
“Inter-rater agreement for the binary outcome (fracture present/absent) was evaluated using Cohen’s kappa. For ordinal severity scores (0–3), we used quadratic weighted kappa to account for the ordered nature of the categories. For all outcomes, we additionally reported the overall percentage agreement to aid interpretation.”
Results example:
“For presence of fracture, the overall agreement was 92%, with a kappa of 0.78 (95% CI 0.69–0.86), indicating substantial agreement. For the 4-level severity score, quadratic weighted kappa was 0.82 (95% CI 0.74–0.90), reflecting almost perfect agreement.”
6. Practical take-home points
Use Kappa for binary/nominal categorical outcomes
Inter-rater or test–retest
Always consider prevalence and marginal distributions.
Use Weighted Kappa for ordinal outcomes
Choose linear or quadratic weights; quadratic is common and close to ICC.
Always show % agreement alongside Kappa
Helps clinicians understand what κ “means” in real terms.
Do not rely on correlation or t-tests / chi-square to claim “good reliability”
Those test association or group differences, not agreement.
Interpret Kappa in context
Consider number of categories, prevalence, and clinical consequences of misclassification.






Comments