Agreement vs Reliability in Categorical Data: A Practical Guide for Clinical Researchers

Mayta
1 day ago
4 min read

1. Why do we care about reliability?

When you design a clinical tool (score, scale, questionnaire, diagnostic classification), you usually have:

Raters (people or systems making the judgment)
Repeated measurements (same patient, same test, measured twice or more)
Categorical outcomes (e.g. “present/absent”, “stage I/II/III”, “mild/moderate/severe”)

You want to know:

Agreement – Do raters give the same category?
Reliability – Does the measurement reflect true differences between patients rather than random error?

Those are related but not identical concepts, and they require different statistics.

2. First step: identify your type of categorical outcome

2.1 Binary data

Two categories only.

Examples:
- Positive / Negative
- Yes / No
- Disease / No disease

This is still a type of categorical data.

2.2 Nominal data

More than two categories, with no natural order.

Examples:
- Blood group: A, B, AB, O
- Type of arrhythmia: AF, VT, SVT, other
- Mechanism of injury: fall, MVC, sports, other

You cannot say “A > B > O” in a meaningful numeric way → no order.

2.3 Ordinal data

Categories that have a natural order, but are still discrete.

Examples:
- Pain: none, mild, moderate, severe
- Disease stage: I, II, III, IV
- Likert scale: strongly disagree → strongly agree

Here “severe” is “more” than “moderate”, but you can’t guarantee equal distances between categories.

3. Agreement vs. Reliability: what’s the difference?

3.1 Agreement

Agreement asks:

“How often do raters give exactly the same category?”

% Agreement = (number of exact matches) / (total ratings)
% Specific agreement = agreement within a category (e.g. “positive” vs “negative” specifically)

This is simple but crude. It does not adjust for agreement that would happen by chance alone.

3.2 Reliability

Reliability asks:

“How much of the variability in scores is due to true differences between patients, not measurement error?”

Statistically, this is like a signal-to-noise ratio:

High reliability → most variability is true subject difference
Low reliability → most variability is just noise / random error

To measure this properly, we use statistics that:

Adjust for chance agreement
Consider variability between subjects and between raters

4. How to choose the right reliability statistic

4.1 For binary or nominal outcomes

Goal 1: Describe agreementUse:

Percent agreement
Percent specific agreement (especially useful if prevalence is very high or very low)

Goal 2: Quantify reliabilityUse:

Cohen’s Kappa (κ) for two raters

Kappa answers:

“How much better is the observed agreement than chance agreement?”

κ = 1 → perfect reliability
κ = 0 → no better than chance
κ < 0 → worse than chance (systematic disagreement)

Kappa is sensitive to prevalence (very rare or very common categories can give paradoxical values), so it is often reported together with % agreement and % specific agreement, not alone.

4.2 For ordinal outcomes

You still want:

% Agreement
% Specific agreement

But for reliability, simple Kappa (unweighted) is not ideal, because it treats all disagreement as equally bad.

Example:

Rater A: “mild”, Rater B: “moderate”
Rater A: “mild”, Rater B: “severe”

Clinically, the second discrepancy is much worse, but unweighted Kappa treats them the same.

So we use: ✅ Weighted Kappa

Assigns weights to disagreements:
- Small disagreements (mild vs moderate) → lighter penalty
- Big disagreements (mild vs severe) → heavier penalty
Common weights:
- Linear (penalty increases linearly with distance)
- Quadratic (penalty increases more for bigger gaps)

Weighted Kappa is therefore preferred for:

Symptom severity scores
Clinical staging
Likert-type scales

✅ ICC (Intraclass Correlation Coefficient) for ordinal with many raters

When:

You have ≥ 2 raters (often >2)
The scale is ordinal but can reasonably be treated as approximately continuous

Then you can use ICC as a reliability index.

ICC is widely used for continuous data, but can be used for ordinal scores that behave like continuous measurements (e.g., 0–10 scales).

5. Statistics that are often misused for reliability

These are important “red flags” because they appear in many papers.

5.1 Correlation coefficients (Pearson, Spearman) ❌ not reliability

Correlation answers:

“Do scores move in the same direction?”

It does not ask:

“Do raters give the same value?”

Example:

Rater A always gives 2 points higher than Rater B (e.g. A = B + 2)
Correlation (r) will be very high (close to 1.0) → they are perfectly linearly associated
But agreement is poor, because they systematically disagree by 2 points

So:

High correlation ≠ high agreement
Correlation is OK for association, not for reliability of ratings

5.2 Comparison tests: t-test, chi-square, Fisher exact ❌ not reliability

These tests answer questions like:

“Are the average scores of Rater A and Rater B statistically different?” (t-test)“Are the proportions of categories different between raters?” (chi-square / Fisher)

They are designed for group comparison, not for agreement.

You can have no significant difference between raters (p > 0.05)→ does not mean they agree on individual patients
Conversely, you can have a significant difference even if agreement is high, especially with large sample size

Therefore:

t-tests, χ² tests, Fisher exact tests are not reliability statistics
They tell you about systematic differences, not about case-by-case agreement

6. Decision guide

Step 1 – Identify outcome type

Binary / Nominal→ think Kappa
Ordinal→ think Weighted Kappa and ICC if many raters, more than 2, the score behaves like a continuous

Step 2 – Decide what to report

For binary/nominal:

Agreement:
- % Agreement
- % Specific agreement (especially for “positive” cases)
Reliability:
- Cohen’s Kappa

For ordinal:

Agreement:
- % Agreement
- % Specific agreement by level (if helpful)
Reliability:
- Weighted Kappa
- ICC if:
  - more than 2 raters and
  - scale can be treated as continuous

Step 3 – Avoid the common traps

Don’t use correlation as your main reliability index
Don’t claim “no significant difference (p>0.05)” = good reliability
Always make it clear if you’re reporting:
- Agreement (% agreement)
- Reliability (Kappa, weighted Kappa, ICC)

7. One-paragraph template for your paper

You can adapt this wording in your Methods section:

“For categorical outcomes, we assessed agreement using the percentage of overall and category-specific agreement. Reliability beyond chance was quantified using Cohen’s kappa for binary/nominal ratings and weighted kappa for ordinal scales. In settings with more than two raters and approximately continuous ordinal scores, we additionally calculated the intraclass correlation coefficient (ICC). Correlation coefficients and group comparison tests (t-test, chi-square, Fisher’s exact) were not used as reliability indices because they do not directly measure agreement between raters.”