Agreement vs Reliability in Categorical Data: A Practical Guide for Clinical Researchers
- Mayta

- 1 day ago
- 4 min read
1. Why do we care about reliability?
When you design a clinical tool (score, scale, questionnaire, diagnostic classification), you usually have:
Raters (people or systems making the judgment)
Repeated measurements (same patient, same test, measured twice or more)
Categorical outcomes (e.g. “present/absent”, “stage I/II/III”, “mild/moderate/severe”)
You want to know:
Agreement – Do raters give the same category?
Reliability – Does the measurement reflect true differences between patients rather than random error?
Those are related but not identical concepts, and they require different statistics.
2. First step: identify your type of categorical outcome
2.1 Binary data
Two categories only.
Examples:
Positive / Negative
Yes / No
Disease / No disease
This is still a type of categorical data.
2.2 Nominal data
More than two categories, with no natural order.
Examples:
Blood group: A, B, AB, O
Type of arrhythmia: AF, VT, SVT, other
Mechanism of injury: fall, MVC, sports, other
You cannot say “A > B > O” in a meaningful numeric way → no order.
2.3 Ordinal data
Categories that have a natural order, but are still discrete.
Examples:
Pain: none, mild, moderate, severe
Disease stage: I, II, III, IV
Likert scale: strongly disagree → strongly agree
Here “severe” is “more” than “moderate”, but you can’t guarantee equal distances between categories.
3. Agreement vs. Reliability: what’s the difference?
3.1 Agreement
Agreement asks:
“How often do raters give exactly the same category?”
% Agreement = (number of exact matches) / (total ratings)
% Specific agreement = agreement within a category (e.g. “positive” vs “negative” specifically)
This is simple but crude. It does not adjust for agreement that would happen by chance alone.
3.2 Reliability
Reliability asks:
“How much of the variability in scores is due to true differences between patients, not measurement error?”
Statistically, this is like a signal-to-noise ratio:
High reliability → most variability is true subject difference
Low reliability → most variability is just noise / random error
To measure this properly, we use statistics that:
Adjust for chance agreement
Consider variability between subjects and between raters
4. How to choose the right reliability statistic
4.1 For binary or nominal outcomes
Goal 1: Describe agreementUse:
Percent agreement
Percent specific agreement (especially useful if prevalence is very high or very low)
Goal 2: Quantify reliabilityUse:
Cohen’s Kappa (κ) for two raters
Kappa answers:
“How much better is the observed agreement than chance agreement?”
κ = 1 → perfect reliability
κ = 0 → no better than chance
κ < 0 → worse than chance (systematic disagreement)
Kappa is sensitive to prevalence (very rare or very common categories can give paradoxical values), so it is often reported together with % agreement and % specific agreement, not alone.
4.2 For ordinal outcomes
You still want:
% Agreement
% Specific agreement
But for reliability, simple Kappa (unweighted) is not ideal, because it treats all disagreement as equally bad.
Example:
Rater A: “mild”, Rater B: “moderate”
Rater A: “mild”, Rater B: “severe”
Clinically, the second discrepancy is much worse, but unweighted Kappa treats them the same.
So we use: ✅ Weighted Kappa
Assigns weights to disagreements:
Small disagreements (mild vs moderate) → lighter penalty
Big disagreements (mild vs severe) → heavier penalty
Common weights:
Linear (penalty increases linearly with distance)
Quadratic (penalty increases more for bigger gaps)
Weighted Kappa is therefore preferred for:
Symptom severity scores
Clinical staging
Likert-type scales
✅ ICC (Intraclass Correlation Coefficient) for ordinal with many raters
When:
You have ≥ 2 raters (often >2)
The scale is ordinal but can reasonably be treated as approximately continuous
Then you can use ICC as a reliability index.
ICC is widely used for continuous data, but can be used for ordinal scores that behave like continuous measurements (e.g., 0–10 scales).
5. Statistics that are often misused for reliability
These are important “red flags” because they appear in many papers.
5.1 Correlation coefficients (Pearson, Spearman) ❌ not reliability
Correlation answers:
“Do scores move in the same direction?”
It does not ask:
“Do raters give the same value?”
Example:
Rater A always gives 2 points higher than Rater B (e.g. A = B + 2)
Correlation (r) will be very high (close to 1.0) → they are perfectly linearly associated
But agreement is poor, because they systematically disagree by 2 points
So:
High correlation ≠ high agreement
Correlation is OK for association, not for reliability of ratings
5.2 Comparison tests: t-test, chi-square, Fisher exact ❌ not reliability
These tests answer questions like:
“Are the average scores of Rater A and Rater B statistically different?” (t-test)“Are the proportions of categories different between raters?” (chi-square / Fisher)
They are designed for group comparison, not for agreement.
You can have no significant difference between raters (p > 0.05)→ does not mean they agree on individual patients
Conversely, you can have a significant difference even if agreement is high, especially with large sample size
Therefore:
t-tests, χ² tests, Fisher exact tests are not reliability statistics
They tell you about systematic differences, not about case-by-case agreement
6. Decision guide
Step 1 – Identify outcome type
Binary / Nominal→ think Kappa
Ordinal→ think Weighted Kappa and ICC if many raters, more than 2, the score behaves like a continuous
Step 2 – Decide what to report
For binary/nominal:
Agreement:
% Agreement
% Specific agreement (especially for “positive” cases)
Reliability:
Cohen’s Kappa
For ordinal:
Agreement:
% Agreement
% Specific agreement by level (if helpful)
Reliability:
Weighted Kappa
ICC if:
more than 2 raters and
scale can be treated as continuous
Step 3 – Avoid the common traps
Don’t use correlation as your main reliability index
Don’t claim “no significant difference (p>0.05)” = good reliability
Always make it clear if you’re reporting:
Agreement (% agreement)
Reliability (Kappa, weighted Kappa, ICC)
7. One-paragraph template for your paper
You can adapt this wording in your Methods section:
“For categorical outcomes, we assessed agreement using the percentage of overall and category-specific agreement. Reliability beyond chance was quantified using Cohen’s kappa for binary/nominal ratings and weighted kappa for ordinal scales. In settings with more than two raters and approximately continuous ordinal scores, we additionally calculated the intraclass correlation coefficient (ICC). Correlation coefficients and group comparison tests (t-test, chi-square, Fisher’s exact) were not used as reliability indices because they do not directly measure agreement between raters.”






Comments