Cohen’s Kappa Explained: Weighted Agreement in Clinical Research

Mayta
Oct 20, 2025
4 min read

Updated: Nov 11, 2025

Introduction Why Use Cohen’s Kappa in Diagnostic Research?

Cohen’s Kappa (κ) is a widely used statistical measure for assessing agreement between two raters or measurement methods that classify items into categorical outcomes. It adjusts for the agreement that could occur by chance, providing a more realistic and conservative estimate of concordance.

Purpose and Applications

Measuring Inter-rater or Inter-method Agreement
- Kappa is primarily used to quantify the level of agreement between observers or diagnostic methods.
- Example: Two hepatologists classify the same patient cohort for liver fibrosis stage based on MRI and biopsy results.
Adjusting for Chance Agreement
- Unlike raw percentage agreement, Kappa accounts for agreement that could occur randomly, ensuring the result reflects true concordance between raters or methods.
Analyzing Categorical or Ordinal Data
- Kappa applies to categorical variables, both nominal (unordered) and ordinal (ordered).
- Weighted Kappa, in particular, is appropriate when the categories have an inherent order, such as staging or severity grading.

🔬 Importance in Diagnostic Accuracy Research

In diagnostic accuracy research, Cohen’s Kappa serves as a reliability metric to evaluate:

The agreement between an index test and a reference standard, such as comparing MRE-derived fibrosis stage to liver biopsy results.
The consistency of diagnostic decisions or interpretations among observers.
The reliability of a diagnostic method is a key component of overall validity in diagnostic evaluation studies.

📌 Diagnostic accuracy alone is not sufficient unless the test or observer demonstrates consistent and reproducible results across repeated assessments or raters.

1. What is Cohen’s Kappa (κ)?

Cohen’s kappa is a statistical measure of agreement between two raters (or methods) who classify items into categorical outcomes.It adjusts for agreement that would occur by chance — providing a more realistic measure of concordance.

Unweighted (simple) κ is used for nominal variables (e.g., “positive / negative”).
Weighted κ is used for ordinal variables (e.g., F0–F4 fibrosis stages, disease severity scales).

The kappa coefficient ranges from –1 to +1:

κ value	Interpretation
< 0	Less than chance agreement
0.00–0.20	Slight
0.21–0.40	Fair
0.41–0.60	Moderate
0.61–0.80	Substantial
0.81–1.00	Almost perfect

2. Why “weighted” kappa?

When categories have a natural order (e.g., F0 < F1 < F2 < F3 < F4), not all disagreements are equal.

A disagreement between F2 and F3 is mild.
A disagreement between F0 and F4 is severe.

Weighted kappa assigns partial credit for “near” agreement, reflecting the degree of difference.

In essence, weighting makes κ sensitive not only to whether disagreement exists, but also how big that disagreement is.

3. Weight types in Stata (kap command)

In Stata, the kap command allows you to specify different weight schemes using the wgt() option:

kap variable1 variable2, wgt(w)   // linear weights
kap variable1 variable2, wgt(w2)  // quadratic weights

3.1. Linear Weights (wgt(w))

Penalize disagreement proportionally to how far apart the categories are.
If one category difference counts as 1 unit, a 2-category difference counts as 2, etc.
Suitable when the categories are evenly spaced (e.g., 0–1–2–3–4 with equal clinical importance).

🔹 Example: If MRE = F2 and biopsy = F3, penalty = 1 step (small). If MRE = F0 and biopsy = F4, penalty = 4 steps (large).

3.2. Quadratic Weights (wgt(w2))

Penalize larger disagreements much more heavily.
Weight decreases with the square of the difference between categories.
Used when categories are ordinal but not equally spaced — e.g., clinical stages, fibrosis scores, tumor grades.

🔹 In medical research (especially fibrosis staging), quadratic weighting is the standard choice because:

The difference between F0 and F1 is clinically minor,
But the difference between F3 and F4 is major (cirrhosis, decompensation).

4. When to choose each weighting

Weight type	Use when…	Typical example
Unweighted (default)	Data are nominal (no inherent order)	Male/Female; Positive/Negative
Linear (wgt(w))	Categories are ordered and evenly spaced	Pain scores (1–10), Likert scales
Quadratic (wgt(w2))	Categories are ordered but not equally spaced; large errors matter more	Liver fibrosis (F0–F4), cancer grades, disease severity scales

5. How to compare weighting results

Run both commands:

kap LiverBx_FCHFS_5stage MRE_stage, wgt(w)
kap LiverBx_FCHFS_5stage MRE_stage, wgt(w2)

Interpret the difference:

Δκ = κ(w2) – κ(w)	Meaning	Suggested action
< 0.05	Differences mostly adjacent; linear is acceptable	Either
0.05–0.10	Moderate nonlinearity; prefer quadratic	Prefer wgt(w2)
> 0.10	Many large disagreements; quadratic weighting clearly appropriate	Use wgt(w2)

6. Reporting weighted kappa in a study

When publishing:

Weighted Cohen’s kappa was used to evaluate agreement between MRE-derived and biopsy-derived fibrosis stage. Quadratic weighting was applied to penalize larger staging discrepancies more heavily, given the ordinal and clinically non-linear nature of fibrosis stages.

Example result:

Agreement between MRE and biopsy staging was substantial (quadratic weighted κ = 0.74, 95% CI 0.61–0.87).

7. Applicability — Is weighting for all κ?

No — weighted kappa is only for ordinal data.

Data type	Kappa type	Example
Nominal	Simple (unweighted)	Gender, infection present/absent
Ordinal	Weighted	Liver fibrosis stage, NYHA class
Continuous	Not kappa — use ICC (Intraclass Correlation Coefficient)	Lab values, test results

8. Practical Example in Hepatology

// Step 1: Compare biopsy vs MRE stage
kap LiverBx_FCHFS_5stage MRE_stage, wgt(w2)

// Step 2: Compare biopsy vs Ultrasound stage
kap LiverBx_FCHFS_5stage US_stage, wgt(w2)

Interpretation:

“Quadratic weighted κ was used due to ordinal staging.κ = 0.68 indicated substantial agreement between imaging and histology.”

9. Summary Table

Type of kappa	Data type	Weight formula	When to use	Stata code
Simple κ	Nominal	None	Categorical (unordered)	kap x y
Linear weighted κ	Ordinal (equal spacing)	1 –	i–j	/ (k–1)
Quadratic weighted κ	Ordinal (uneven spacing)	1 – ((i–j)/(k–1))²	Clinical staging, fibrosis	kap x y, wgt(w2)

10. Key takeaway

Weighted kappa refines the measurement of agreement by considering how far apart disagreements are. Use linear weights when all categories are evenly spaced. Use quadratic weights for clinically ordered but non-linear categories — the most common in medical research. In fibrosis staging, quadratic weighting (wgt(w2)) is almost always the correct choice.

Example in Hepatology Context

In studies assessing liver fibrosis staging (F0–F4), the rater or method can be represented as variables:

LiverBx_FCHFS_5stage → fibrosis stage by liver biopsy (reference standard)
MRE_stage → fibrosis stage by magnetic resonance elastography (index test)

In Stata, this comparison is typically analyzed using weighted Kappa:

kap LiverBx_FCHFS_5stage MRE_stage, wgt(w2)

Here, quadratic weighting (wgt(w2)) is applied to account for the ordinal nature of fibrosis staging, where large discrepancies (e.g., F0 vs F4) are penalized more heavily than minor ones (e.g., F2 vs F3).

🧭 Summary

✅ Yes — variables like LiverBx_FCHFS_5stage and MRE_stage can represent “raters” in Kappa analysis.
⚖️ Ensure both variables use identical coding schemes and represent comparable categories.
🧪 Weighted Kappa provides a more nuanced measure of agreement than simple percent agreement, especially for ordinal clinical scales.