top of page

How to Choose Statistical Coefficients for Each Type of Reliability

  • Writer: Mayta
    Mayta
  • 3 hours ago
  • 5 min read


Summary Table: Types of Reliability & Statistical Coefficients

Reliability Type

Purpose

Data Type

Statistical Coefficients (Named Statistics)

1. Test–Retest Reliability

Measures stability over time (same test, two occasions)

Continuous

Pearson r   Spearman ρ (ordinal or non-normal) ICC (Intraclass Correlation Coefficient) CCC (Concordance Correlation Coefficient)



Ordinal

Spearman ρ   Weighted Cohen’s kappa



Nominal / Dichotomous

Cohen’s kappa (κ)



Coefficient of Stability (concept label)

2. Inter-Rater Reliability

Measures agreement between different raters

Continuous

ICC   CCC   • (Bland–Altman plot for visual agreement)



Ordinal

Weighted Cohen’s kappa   Krippendorff’s alpha (multi-rater)



Nominal / Dichotomous

Cohen’s kappa (2 raters) Fleiss’ kappa (≥3 raters) Scott’s Pi   Brennan–Prediger kappa   Krippendorff’s alpha

3. Intra-Rater Reliability

Measures consistency of one rater over time

Continuous

ICC   CCC



Ordinal

Weighted Cohen’s kappa



Nominal / Dichotomous

Cohen’s kappa

4. Parallel-Forms Reliability

Measures equivalence between Form A and Form B

Continuous

Pearson r  • Spearman ρ  • ICC



Ordinal

Spearman ρ



Nominal / Dichotomous

Cohen’s kappa (rare use)



Coefficient of Equivalence (concept label)

5. Internal Consistency Reliability

Measures how well items in a scale measure the same construct

Multi-item scale

Cronbach’s alpha (α) (most used) McDonald’s omega (ω) (better when factor loadings differ) Guttman’s lambda (λ2–λ6)   KR-20 (dichotomous items only) KR-21  • Split-half reliability + Spearman–Brown formula   Coefficient H (latent variable reliability)

Introduction

In measurement and psychometrics, reliability describes how consistently an instrument measures whatever it is supposed to measure. Different types of reliability focus on different sources of variation: time, raters, forms, and items.

Below are the main reliability types, with their key named statistics.

1. Test–Retest Reliability

What it is

Test–retest reliability asks:

If we measure the same person with the same instrument at two different times (and the trait has not truly changed), do we get similar scores?

It reflects the stability over time of a measurement.

When it is used

  • Cognitive or knowledge tests

  • Personality traits (e.g., trait anxiety)

  • Stable clinical scales (e.g., long-term disability, not acute pain)

Main statistics (named coefficients)

For continuous scores:

  • Pearson correlation coefficient (r)

    • Simple correlation between Time 1 and Time 2 scores.

  • Spearman rank correlation (ρ / rho)

    • Used when data are ordinal or not normally distributed.

  • Intraclass Correlation Coefficient (ICC)

    • More appropriate than Pearson r when you care about agreement within a group, not just correlation.

  • Concordance Correlation Coefficient (CCC; Lin’s CCC)

    • Combines precision (correlation) and accuracy (closeness to identity line) to quantify agreement over time.

For categorical scores:

  • Cohen’s kappa (κ)

    • Agreement between Time 1 and Time 2 when responses are categorical (e.g., positive/negative).

  • Weighted kappa

    • For ordered categories (e.g., mild/moderate/severe).

General term:

  • Coefficient of Stability

    • A generic term for the correlation between two time points in test–retest designs.


2. Inter-Rater Reliability

What it is

Inter-rater reliability asks:

If two or more raters assess the same subjects, how consistently do they agree?

It focuses on agreement between different observers.

When it is used

  • Two clinicians rating severity of a symptom

  • Multiple radiologists reading the same scan

  • Several examiners grading OSCE or written essays

Main statistics (named coefficients)

For categorical ratings

  • Cohen’s kappa (κ)

    • Agreement beyond chance between two raters.

  • Weighted Cohen’s kappa

    • For ordinal categories where bigger disagreements should be penalized more.

  • Fleiss’ kappa

    • Extension of kappa for three or more raters.

  • Scott’s Pi (π)

    • Similar to Cohen’s kappa, with a different assumption about chance agreement.

  • Brennan–Prediger kappa

    • Adjusted kappa that reduces the “prevalence paradox” (when categories are very imbalanced).

  • Krippendorff’s alpha (α)

    • Very flexible: works with any number of raters, missing data, and different measurement levels (nominal, ordinal, interval).

  • Goodman–Kruskal’s gamma

    • A measure of association for ordinal ratings; sometimes used in rater agreement contexts.

For continuous ratings

  • Intraclass Correlation Coefficient (ICC)

    • The primary statistic for agreement among raters scoring continuous variables (e.g., mm of joint movement).

  • Concordance Correlation Coefficient (CCC)

    • Measures both correlation and how close scores are to the line of perfect agreement.

  • Bland–Altman Limits of Agreement

    • Not a single coefficient but a standard method/plot to evaluate agreement between raters or instruments.


3. Intra-Rater Reliability

What it is

Intra-rater reliability asks:

If the same rater assesses the same subjects on different occasions, are their own ratings consistent?

This is about the self-consistency of one observer.

When it is used

  • One physiotherapist measuring joint range of motion twice

  • One pathologist rating biopsy slides on two occasions

  • One examiner rescoring the same OSCE performance

Main statistics (named coefficients)

For continuous data:

  • Intraclass Correlation Coefficient (ICC)

    • Typically used for repeated measurements by the same rater.

For categorical data:

  • Cohen’s kappa (κ)

  • Weighted kappa (for ordinal categories)

For agreement emphasis:

  • Concordance Correlation Coefficient (CCC)

Conceptually, intra-rater uses the same coefficients as inter-rater reliability, but all ratings come from one person at multiple times instead of multiple people.

4. Parallel-Forms (Alternate-Forms) Reliability

What it is

Parallel-forms reliability asks:

If we create two different but equivalent versions of a test, do they give similar results for the same people?

This focuses on consistency between two forms of the same instrument.

When it is used

  • High-stakes exams (e.g., medical school tests) with Form A and Form B

  • Repeated assessments where you want to reduce memory/practice effects

  • Large test banks where exams are assembled from item pools

Main statistics (named coefficients)

For continuous test scores:

  • Pearson correlation coefficient (r) between Form A and Form B

  • Spearman correlation (ρ) for ordinal or non-normal scores

  • Intraclass Correlation Coefficient (ICC)

    • When each form is treated like a “rater” producing a score.

Generic term:

  • Coefficient of Equivalence

    • General label for reliability across parallel/alternate forms.

Sometimes, when you combine alternate-forms + test–retest, the result is called coefficient of stability and equivalence, but that’s more conceptual than a distinct formula.

5. Internal Consistency Reliability

What it is

Internal consistency asks:

Do the items within a single test or questionnaire all work together to measure the same underlying construct?

This is particularly important for multi-item scales (e.g., depression scales, quality of life instruments, satisfaction surveys).

When it is used

  • Symptom questionnaires (e.g., depression, anxiety, fatigue)

  • Quality of life scales

  • Knowledge tests with multiple items

  • Attitude or Likert-type surveys

Main statistics (named coefficients)

  • Cronbach’s alpha (α)

    • The most widely used index of internal consistency.

    • Influenced by average inter-item correlation and number of items.

  • McDonald’s Omega (ω)

    • Often a better estimator than alpha, especially when items have different factor loadings.

  • Kuder–Richardson Formula 20 (KR-20)

    • Internal consistency index specifically for dichotomous items (e.g., right/wrong = 0/1).

  • Kuder–Richardson Formula 21 (KR-21)

    • Simpler variant of KR-20 under more restrictive assumptions.

  • Guttman’s Lambda coefficients (λ2–λ6)

    • A set of reliability estimates that can, in some cases, be more accurate or less biased than alpha.

  • Split-Half Reliability

    • The test is split into two halves (e.g., odd vs even items). Total scores of each half are correlated.

  • Spearman–Brown Prophecy Formula

    • Used to adjust the split-half coefficient to estimate reliability for the full-length test or a hypothetical longer/shorter test.

Other related indices:

  • Average inter-item correlation

    • Simple average of correlations between all pairs of items.

  • Item–total correlation & “alpha if item deleted”

    • Diagnostics to identify weak or problematic items in a scale.

  • Coefficient H (Hancock & Mueller’s H)

    • Reliability index for a latent factor (useful in structural equation modeling).


Summary

Each type of reliability focuses on a different question:

  • Test–Retest Reliability → Is the instrument stable over time?

    • Key names: Pearson r, Spearman ρ, ICC, CCC, Cohen’s kappa

  • Inter-Rater Reliability → Do different raters agree?

    • Key names: Cohen’s kappa, Fleiss’ kappa, Krippendorff’s alpha, ICC, CCC

  • Intra-Rater Reliability → Is one rater consistent with themselves?

    • Key names: ICC, Cohen’s kappa, weighted kappa, CCC

  • Parallel-Forms Reliability → Are different versions of a test equivalent?

    • Key names: Pearson r, Spearman ρ, ICC, coefficient of equivalence

  • Internal Consistency Reliability → Do items within a scale hang together?

    • Key names: Cronbach’s alpha, McDonald’s omega, KR-20, KR-21, Guttman’s lambda, split-half, Spearman–Brown

Recent Posts

See All
A Beginner’s Guide to Python Environments

Introduction A Beginner’s Guide to Python Environments A clean, practical introduction for new programmers, researchers, and CECS students Managing Python environments is one of the most important ear

 
 
 
Post: Blog2_Post

​Message for International and Thai Readers Understanding My Medical Context in Thailand

Message for International and Thai Readers Understanding My Broader Content Beyond Medicine

bottom of page