← All posts

Cohen’s Kappa vs Weighted Kappa: Measuring Agreement Beyond Chance

Clinical Epidemiology ResearchUniqcret doctor knowledgesData Analytics or Statistics

How to measure agreement beyond chance

1. Why do we need Kappa?

When two raters (or two methods) classify patients into categories—for example:

we want to know:

Do they really agree, or are they just “lucky” to match by chance?

Simple percent agreement (% of cases where both give the same category) is easy to understand but has a big limitation:

Cohen’s Kappa (κ) and Weighted Kappa solve this by adjusting for chance agreement.


2. Cohen’s Kappa (κ) – for binary/nominal categories

2.1 The basic idea

Kappa answers:

“How much better is the observed agreement than what we would expect by chance alone?”

Formally:

κ = Po - Pe 1 - Pe

2.2 The 2×2 table setup

Imagine two raters classify patients as disease / no disease:

 Rater B: DiseaseRater B: No diseaseTotal
Rater A: Diseasea (both say disease)b (A disease, B no)a + b
Rater A: No diseasec (A no, B disease)d (both say no)c + d
Totala + cb + dN
Po = a+d N
Pe = ( a+b N × a+c N ) + ( c+d N × b+d N )

Then plug Po and Pe into the κ formula.

You rarely calculate this by hand in practice (software will do it), but the logic is important.

2.3 Interpreting Kappa (with caution)

People often use rough “guides” like:

But these cut-offs are arbitrary. Better to interpret κ in context:

2.4 Prevalence and bias paradox

Kappa is famous for a few “paradoxes”:

  1. High percent agreement but low κ
    • If almost everyone is “negative,” raters can agree a lot by always saying “negative.”
    • Po is high, but Pe is also high → κ becomes small.
  2. Imbalanced use of categories
    • If one rater tends to label “positive” much more than the other, κ may drop even with decent agreement.

Practical tip:Always report:

Together, they give a more honest picture.


3. Weighted Kappa – when categories are ordered (ordinal)

For ordinal scales, not all disagreements are equally serious.

Example: pain scale (0–3)

Unweighted κ treats both disagreements as the same → not clinically realistic.

Weighted Kappa fixes this by giving partial credit when raters are “close.”

3.1 How weighting works (conceptually)

Suppose we have K ordered categories (e.g., 0, 1, 2, 3).

We create a weight matrix w{ij}:

The formula becomes:

κw = Po,w - Pe,w 1 - Pe,w

Again, software does the math, but the logic is:

“Closer disagreements are less serious and should not be punished as heavily.”

3.2 Common types of weights

wij = 1 - |i-j| K-1
wij = 1 - ( i-j K-1 ) 2

Quadratic weighted κ is often quite close to the intraclass correlation coefficient (ICC) for such scales.

3.3 When to use Weighted Kappa

Use weighted κ when:

Use unweighted κ when:


4. Kappa vs other reliability measures

4.1 Kappa vs percent agreement

Best practice: report both.

4.2 Kappa vs correlation

Correlation (Pearson / Spearman):

Kappa (and weighted κ):

4.3 Kappa vs ICC

ICC (Intraclass Correlation Coefficient) is usually preferred when:

Kappa / weighted κ is typically used for:


5. How to report Kappa in a clinical paper

A good Methods + Results reporting might look like this:

Methods example:

“Inter-rater agreement for the binary outcome (fracture present/absent) was evaluated using Cohen’s kappa. For ordinal severity scores (0–3), we used quadratic weighted kappa to account for the ordered nature of the categories. For all outcomes, we additionally reported the overall percentage agreement to aid interpretation.”

Results example:

“For presence of fracture, the overall agreement was 92%, with a kappa of 0.78 (95% CI 0.69–0.86), indicating substantial agreement. For the 4-level severity score, quadratic weighted kappa was 0.82 (95% CI 0.74–0.90), reflecting almost perfect agreement.”


6. Practical take-home points

  1. Use Kappa for binary/nominal categorical outcomes
    • Inter-rater or test–retest
    • Always consider prevalence and marginal distributions.
  2. Use Weighted Kappa for ordinal outcomes
    • Choose linear or quadratic weights; quadratic is common and close to ICC.
  3. Always show % agreement alongside Kappa
    • Helps clinicians understand what κ “means” in real terms.
  4. Do not rely on correlation or t-tests / chi-square to claim “good reliability”
    • Those test association or group differences, not agreement.
  5. Interpret Kappa in context
    • Consider number of categories, prevalence, and clinical consequences of misclassification.

Comments

No comments yet. Be the first to share your thoughts.

Sign in to comment