How to Choose Statistical Coefficients for Each Type of Reliability

Mayta
Dec 11, 2025
5 min read

Updated: Dec 12, 2025

Summary Table: Types of Reliability & Statistical Coefficients

Reliability Type	Purpose	Data Type	Statistical Coefficients (Named Statistics)
1. Test–Retest Reliability	Measures stability over time (same test, two occasions)	Continuous	• Pearson r • Spearman ρ (ordinal or non-normal) • ICC (Intraclass Correlation Coefficient) • CCC (Concordance Correlation Coefficient)
		Ordinal	• Spearman ρ • Weighted Cohen’s kappa
		Nominal / Dichotomous	• Cohen’s kappa (κ)
		—	Coefficient of Stability (concept label)
2. Inter-Rater Reliability	Measures agreement between different raters	Continuous	• ICC • CCC • (Bland–Altman plot for visual agreement)
		Ordinal	• Weighted Cohen’s kappa • Krippendorff’s alpha (multi-rater)
		Nominal / Dichotomous	• Cohen’s kappa (2 raters) • Fleiss’ kappa (≥3 raters) • Scott’s Pi • Brennan–Prediger kappa • Krippendorff’s alpha
3. Intra-Rater Reliability	Measures consistency of one rater over time	Continuous	• ICC • CCC
		Ordinal	• Weighted Cohen’s kappa
		Nominal / Dichotomous	• Cohen’s kappa
4. Parallel-Forms Reliability	Measures equivalence between Form A and Form B	Continuous	• Pearson r • Spearman ρ • ICC
		Ordinal	• Spearman ρ
		Nominal / Dichotomous	• Cohen’s kappa (rare use)
		—	Coefficient of Equivalence (concept label)
5. Internal Consistency Reliability	Measures how well items in a scale measure the same construct	Multi-item scale	• Cronbach’s alpha (α) (most used) • McDonald’s omega (ω) (better when factor loadings differ) • Guttman’s lambda (λ2–λ6) • KR-20 (dichotomous items only) • KR-21 • Split-half reliability + Spearman–Brown formula • Coefficient H (latent variable reliability)

Introduction

In measurement and psychometrics, reliability describes how consistently an instrument measures whatever it is supposed to measure. Different types of reliability focus on different sources of variation: time, raters, forms, and items.

Below are the main reliability types, with their key named statistics.

1. Test–Retest Reliability

What it is

Test–retest reliability asks:

If we measure the same person with the same instrument at two different times (and the trait has not truly changed), do we get similar scores?

It reflects the stability over time of a measurement.

When it is used

Cognitive or knowledge tests
Personality traits (e.g., trait anxiety)
Stable clinical scales (e.g., long-term disability, not acute pain)

Main statistics (named coefficients)

For continuous scores:

Pearson correlation coefficient (r)
- Simple correlation between Time 1 and Time 2 scores.
Spearman rank correlation (ρ / rho)
- Used when data are ordinal or not normally distributed.
Intraclass Correlation Coefficient (ICC)
- More appropriate than Pearson r when you care about agreement within a group, not just correlation.
Concordance Correlation Coefficient (CCC; Lin’s CCC)
- Combines precision (correlation) and accuracy (closeness to identity line) to quantify agreement over time.

For categorical scores:

Cohen’s kappa (κ)
- Agreement between Time 1 and Time 2 when responses are categorical (e.g., positive/negative).
Weighted kappa
- For ordered categories (e.g., mild/moderate/severe).

General term:

Coefficient of Stability
- A generic term for the correlation between two time points in test–retest designs.

2. Inter-Rater Reliability

What it is

Inter-rater reliability asks:

If two or more raters assess the same subjects, how consistently do they agree?

It focuses on agreement between different observers.

When it is used

Two clinicians rating severity of a symptom
Multiple radiologists reading the same scan
Several examiners grading OSCE or written essays

Main statistics (named coefficients)

For categorical ratings

Cohen’s kappa (κ)
- Agreement beyond chance between two raters.
Weighted Cohen’s kappa
- For ordinal categories where bigger disagreements should be penalized more.
Fleiss’ kappa
- Extension of kappa for three or more raters.
Scott’s Pi (π)
- Similar to Cohen’s kappa, with a different assumption about chance agreement.
Brennan–Prediger kappa
- Adjusted kappa that reduces the “prevalence paradox” (when categories are very imbalanced).
Krippendorff’s alpha (α)
- Very flexible: works with any number of raters, missing data, and different measurement levels (nominal, ordinal, interval).
Goodman–Kruskal’s gamma
- A measure of association for ordinal ratings; sometimes used in rater agreement contexts.

For continuous ratings

Intraclass Correlation Coefficient (ICC)
- The primary statistic for agreement among raters scoring continuous variables (e.g., mm of joint movement).
Concordance Correlation Coefficient (CCC)
- Measures both correlation and how close scores are to the line of perfect agreement.
Bland–Altman Limits of Agreement
- Not a single coefficient but a standard method/plot to evaluate agreement between raters or instruments.

3. Intra-Rater Reliability

What it is

Intra-rater reliability asks:

If the same rater assesses the same subjects on different occasions, are their own ratings consistent?

This is about the self-consistency of one observer.

When it is used

One physiotherapist measuring joint range of motion twice
One pathologist rating biopsy slides on two occasions
One examiner rescoring the same OSCE performance

Main statistics (named coefficients)

For continuous data:

Intraclass Correlation Coefficient (ICC)
- Typically used for repeated measurements by the same rater.

For categorical data:

Cohen’s kappa (κ)
Weighted kappa (for ordinal categories)

For agreement emphasis:

Concordance Correlation Coefficient (CCC)

Conceptually, intra-rater uses the same coefficients as inter-rater reliability, but all ratings come from one person at multiple times instead of multiple people.

4. Parallel-Forms (Alternate-Forms) Reliability

What it is

Parallel-forms reliability asks:

If we create two different but equivalent versions of a test, do they give similar results for the same people?

This focuses on consistency between two forms of the same instrument.

When it is used

High-stakes exams (e.g., medical school tests) with Form A and Form B
Repeated assessments where you want to reduce memory/practice effects
Large test banks where exams are assembled from item pools

Main statistics (named coefficients)

For continuous test scores:

Pearson correlation coefficient (r) between Form A and Form B
Spearman correlation (ρ) for ordinal or non-normal scores
Intraclass Correlation Coefficient (ICC)
- When each form is treated like a “rater” producing a score.

Generic term:

Coefficient of Equivalence
- General label for reliability across parallel/alternate forms.

Sometimes, when you combine alternate-forms + test–retest, the result is called coefficient of stability and equivalence, but that’s more conceptual than a distinct formula.

5. Internal Consistency Reliability

What it is

Internal consistency asks:

Do the items within a single test or questionnaire all work together to measure the same underlying construct?

This is particularly important for multi-item scales (e.g., depression scales, quality of life instruments, satisfaction surveys).

When it is used

Symptom questionnaires (e.g., depression, anxiety, fatigue)
Quality of life scales
Knowledge tests with multiple items
Attitude or Likert-type surveys

Main statistics (named coefficients)

Cronbach’s alpha (α)
- The most widely used index of internal consistency.
- Influenced by average inter-item correlation and number of items.
McDonald’s Omega (ω)
- Often a better estimator than alpha, especially when items have different factor loadings.
Kuder–Richardson Formula 20 (KR-20)
- Internal consistency index specifically for dichotomous items (e.g., right/wrong = 0/1).
Kuder–Richardson Formula 21 (KR-21)
- Simpler variant of KR-20 under more restrictive assumptions.
Guttman’s Lambda coefficients (λ2–λ6)
- A set of reliability estimates that can, in some cases, be more accurate or less biased than alpha.
Split-Half Reliability
- The test is split into two halves (e.g., odd vs even items). Total scores of each half are correlated.
Spearman–Brown Prophecy Formula
- Used to adjust the split-half coefficient to estimate reliability for the full-length test or a hypothetical longer/shorter test.

Other related indices:

Average inter-item correlation
- Simple average of correlations between all pairs of items.
Item–total correlation & “alpha if item deleted”
- Diagnostics to identify weak or problematic items in a scale.
Coefficient H (Hancock & Mueller’s H)
- Reliability index for a latent factor (useful in structural equation modeling).

Summary

Each type of reliability focuses on a different question:

Test–Retest Reliability → Is the instrument stable over time?
- Key names: Pearson r, Spearman ρ, ICC, CCC, Cohen’s kappa
Inter-Rater Reliability → Do different raters agree?
- Key names: Cohen’s kappa, Fleiss’ kappa, Krippendorff’s alpha, ICC, CCC
Intra-Rater Reliability → Is one rater consistent with themselves?
- Key names: ICC, Cohen’s kappa, weighted kappa, CCC
Parallel-Forms Reliability → Are different versions of a test equivalent?
- Key names: Pearson r, Spearman ρ, ICC, coefficient of equivalence
Internal Consistency Reliability → Do items within a scale hang together?
- Key names: Cronbach’s alpha, McDonald’s omega, KR-20, KR-21, Guttman’s lambda, split-half, Spearman–Brown

How to Choose Statistical Coefficients for Each Type of Reliability

Summary Table: Types of Reliability & Statistical Coefficients

Introduction

1. Test–Retest Reliability

What it is

When it is used

Main statistics (named coefficients)

2. Inter-Rater Reliability

What it is

When it is used

Main statistics (named coefficients)

For categorical ratings

For continuous ratings

3. Intra-Rater Reliability

What it is

When it is used

Main statistics (named coefficients)

4. Parallel-Forms (Alternate-Forms) Reliability

What it is

When it is used

Main statistics (named coefficients)

5. Internal Consistency Reliability

What it is

When it is used

Main statistics (named coefficients)

Summary

Recent Posts

Comments