← All posts

Agreement vs Reliability in Categorical Data: A Practical Guide for Clinical Researchers

Clinical Epidemiology ResearchUniqcret doctor knowledgesMethodology and Research Design

Reliability Flow Chart – Two-Arm Decision

Reliability Flow Chart

For categorical outcomes: choose the correct agreement & reliability statistics
Start Identify outcome type Categorical data
Arm 1
Outcome type: Binary / Nominal
→ think Kappa 2-category or unordered categories
  • % Agreement
  • % Specific agreement (especially for “positive” category)
  • Cohen’s Kappa (chance-corrected reliability)
Arm 2
Outcome type: Ordinal
→ think Weighted Kappa and if more than 2 raters think ICC ordered categories
  • % Agreement
  • % Specific agreement by level (if useful clinically)
  • Weighted Kappa (penalizes “far” disagreements more)
  • Add ICC (Intraclass Correlation Coefficient) if:
    • > 2 raters
    • scale behaves roughly like continuous

1. Why do we care about reliability?

When you design a clinical tool (score, scale, questionnaire, diagnostic classification), you usually have:

You want to know:

  1. Agreement – Do raters give the same category?
  2. Reliability – Does the measurement reflect true differences between patients rather than random error?

Those are related but not identical concepts, and they require different statistics.


2. First step: identify your type of categorical outcome

2.1 Binary data

Two categories only.

This is still a type of categorical data.

2.2 Nominal data

More than two categories, with no natural order.

You cannot say “A > B > O” in a meaningful numeric way → no order.

2.3 Ordinal data

Categories that have a natural order, but are still discrete.

Here “severe” is “more” than “moderate”, but you can’t guarantee equal distances between categories.


3. Agreement vs. Reliability: what’s the difference?

3.1 Agreement

Agreement asks:

“How often do raters give exactly the same category?”

This is simple but crude. It does not adjust for agreement that would happen by chance alone.

3.2 Reliability

Reliability asks:

“How much of the variability in scores is due to true differences between patients, not measurement error?”

Statistically, this is like a signal-to-noise ratio:

To measure this properly, we use statistics that:

  1. Adjust for chance agreement
  2. Consider variability between subjects and between raters

4. How to choose the right reliability statistic

4.1 For binary or nominal outcomes

Goal 1: Describe agreementUse:

Goal 2: Quantify reliabilityUse:

Kappa answers:

“How much better is the observed agreement than chance agreement?”

Kappa is sensitive to prevalence (very rare or very common categories can give paradoxical values), so it is often reported together with % agreement and % specific agreement, not alone.

4.2 For ordinal outcomes

You still want:

But for reliability, simple Kappa (unweighted) is not ideal, because it treats all disagreement as equally bad.

Example:

Clinically, the second discrepancy is much worse, but unweighted Kappa treats them the same.

So we use: ✅ Weighted Kappa

Weighted Kappa is therefore preferred for:

✅ ICC (Intraclass Correlation Coefficient) for ordinal with many raters

When:

Then you can use ICC as a reliability index.

ICC is widely used for continuous data, but can be used for ordinal scores that behave like continuous measurements (e.g., 0–10 scales).


5. Statistics that are often misused for reliability

These are important “red flags” because they appear in many papers.

5.1 Correlation coefficients (Pearson, Spearman) ❌ not reliability

Correlation answers:

“Do scores move in the same direction?”

It does not ask:

“Do raters give the same value?”

Example:

So:

5.2 Comparison tests: t-test, chi-square, Fisher exact ❌ not reliability

These tests answer questions like:

“Are the average scores of Rater A and Rater B statistically different?” (t-test)“Are the proportions of categories different between raters?” (chi-square / Fisher)

They are designed for group comparison, not for agreement.

Therefore:


6. Decision guide

Step 1 – Identify outcome type

Step 2 – Decide what to report

For binary/nominal:

For ordinal:

Step 3 – Avoid the common traps


7. One-paragraph template for your paper

You can adapt this wording in your Methods section:

“For categorical outcomes, we assessed agreement using the percentage of overall and category-specific agreement. Reliability beyond chance was quantified using Cohen’s kappa for binary/nominal ratings and weighted kappa for ordinal scales. In settings with more than two raters and approximately continuous ordinal scores, we additionally calculated the intraclass correlation coefficient (ICC). Correlation coefficients and group comparison tests (t-test, chi-square, Fisher’s exact) were not used as reliability indices because they do not directly measure agreement between raters.”

Comments

No comments yet. Be the first to share your thoughts.

Sign in to comment

Agreement vs Reliability in Categorical Data: A Practical Guide for Clinical Researchers — Uniqcret