Cohen’s Kappa Explained: Weighted Agreement in Clinical Research
- Mayta

- Oct 20
- 4 min read
Updated: 12 minutes ago
Introduction Why Use Cohen’s Kappa in Diagnostic Research?
Cohen’s Kappa (κ) is a widely used statistical measure for assessing agreement between two raters or measurement methods that classify items into categorical outcomes. It adjusts for the agreement that could occur by chance, providing a more realistic and conservative estimate of concordance.
Purpose and Applications
Measuring Inter-rater or Inter-method Agreement
Kappa is primarily used to quantify the level of agreement between observers or diagnostic methods.
Example: Two hepatologists classify the same patient cohort for liver fibrosis stage based on MRI and biopsy results.
Adjusting for Chance Agreement
Unlike raw percentage agreement, Kappa accounts for agreement that could occur randomly, ensuring the result reflects true concordance between raters or methods.
Analyzing Categorical or Ordinal Data
Kappa applies to categorical variables, both nominal (unordered) and ordinal (ordered).
Weighted Kappa, in particular, is appropriate when the categories have an inherent order, such as staging or severity grading.
🔬 Importance in Diagnostic Accuracy Research
In diagnostic accuracy research, Cohen’s Kappa serves as a reliability metric to evaluate:
The agreement between an index test and a reference standard, such as comparing MRE-derived fibrosis stage to liver biopsy results.
The consistency of diagnostic decisions or interpretations among observers.
The reliability of a diagnostic method is a key component of overall validity in diagnostic evaluation studies.
📌 Diagnostic accuracy alone is not sufficient unless the test or observer demonstrates consistent and reproducible results across repeated assessments or raters.
1. What is Cohen’s Kappa (κ)?
Cohen’s kappa is a statistical measure of agreement between two raters (or methods) who classify items into categorical outcomes.It adjusts for agreement that would occur by chance — providing a more realistic measure of concordance.
Unweighted (simple) κ is used for nominal variables (e.g., “positive / negative”).
Weighted κ is used for ordinal variables (e.g., F0–F4 fibrosis stages, disease severity scales).
The kappa coefficient ranges from –1 to +1:
2. Why “weighted” kappa?
When categories have a natural order (e.g., F0 < F1 < F2 < F3 < F4), not all disagreements are equal.
A disagreement between F2 and F3 is mild.
A disagreement between F0 and F4 is severe.
Weighted kappa assigns partial credit for “near” agreement, reflecting the degree of difference.
In essence, weighting makes κ sensitive not only to whether disagreement exists, but also how big that disagreement is.
3. Weight types in Stata (kap command)
In Stata, the kap command allows you to specify different weight schemes using the wgt() option:
kap variable1 variable2, wgt(w) // linear weights
kap variable1 variable2, wgt(w2) // quadratic weights
3.1. Linear Weights (wgt(w))
Penalize disagreement proportionally to how far apart the categories are.
If one category difference counts as 1 unit, a 2-category difference counts as 2, etc.
Suitable when the categories are evenly spaced (e.g., 0–1–2–3–4 with equal clinical importance).
🔹 Example: If MRE = F2 and biopsy = F3, penalty = 1 step (small). If MRE = F0 and biopsy = F4, penalty = 4 steps (large).
3.2. Quadratic Weights (wgt(w2))
Penalize larger disagreements much more heavily.
Weight decreases with the square of the difference between categories.
Used when categories are ordinal but not equally spaced — e.g., clinical stages, fibrosis scores, tumor grades.
🔹 In medical research (especially fibrosis staging), quadratic weighting is the standard choice because:
The difference between F0 and F1 is clinically minor,
But the difference between F3 and F4 is major (cirrhosis, decompensation).
4. When to choose each weighting
5. How to compare weighting results
Run both commands:
kap LiverBx_FCHFS_5stage MRE_stage, wgt(w)
kap LiverBx_FCHFS_5stage MRE_stage, wgt(w2)
Interpret the difference:
6. Reporting weighted kappa in a study
When publishing:
Weighted Cohen’s kappa was used to evaluate agreement between MRE-derived and biopsy-derived fibrosis stage. Quadratic weighting was applied to penalize larger staging discrepancies more heavily, given the ordinal and clinically non-linear nature of fibrosis stages.
Example result:
Agreement between MRE and biopsy staging was substantial (quadratic weighted κ = 0.74, 95% CI 0.61–0.87).
7. Applicability — Is weighting for all κ?
No — weighted kappa is only for ordinal data.
8. Practical Example in Hepatology
// Step 1: Compare biopsy vs MRE stage
kap LiverBx_FCHFS_5stage MRE_stage, wgt(w2)
// Step 2: Compare biopsy vs Ultrasound stage
kap LiverBx_FCHFS_5stage US_stage, wgt(w2)
Interpretation:
“Quadratic weighted κ was used due to ordinal staging.κ = 0.68 indicated substantial agreement between imaging and histology.”
9. Summary Table
10. Key takeaway
Weighted kappa refines the measurement of agreement by considering how far apart disagreements are. Use linear weights when all categories are evenly spaced. Use quadratic weights for clinically ordered but non-linear categories — the most common in medical research. In fibrosis staging, quadratic weighting (wgt(w2)) is almost always the correct choice.
Example in Hepatology Context
In studies assessing liver fibrosis staging (F0–F4), the rater or method can be represented as variables:
LiverBx_FCHFS_5stage → fibrosis stage by liver biopsy (reference standard)
MRE_stage → fibrosis stage by magnetic resonance elastography (index test)
In Stata, this comparison is typically analyzed using weighted Kappa:
kap LiverBx_FCHFS_5stage MRE_stage, wgt(w2)
Here, quadratic weighting (wgt(w2)) is applied to account for the ordinal nature of fibrosis staging, where large discrepancies (e.g., F0 vs F4) are penalized more heavily than minor ones (e.g., F2 vs F3).
🧭 Summary
✅ Yes — variables like LiverBx_FCHFS_5stage and MRE_stage can represent “raters” in Kappa analysis.
⚖️ Ensure both variables use identical coding schemes and represent comparable categories.
🧪 Weighted Kappa provides a more nuanced measure of agreement than simple percent agreement, especially for ordinal clinical scales.





Comments