How to Choose Statistical Coefficients for Each Type of Reliability
- Mayta
- 3 hours ago
- 5 min read
Summary Table: Types of Reliability & Statistical Coefficients
Reliability Type | Purpose | Data Type | Statistical Coefficients (Named Statistics) |
1. Test–Retest Reliability | Measures stability over time (same test, two occasions) | Continuous | • Pearson r • Spearman ρ (ordinal or non-normal) • ICC (Intraclass Correlation Coefficient) • CCC (Concordance Correlation Coefficient) |
Ordinal | • Spearman ρ • Weighted Cohen’s kappa | ||
Nominal / Dichotomous | • Cohen’s kappa (κ) | ||
— | Coefficient of Stability (concept label) | ||
2. Inter-Rater Reliability | Measures agreement between different raters | Continuous | • ICC • CCC • (Bland–Altman plot for visual agreement) |
Ordinal | • Weighted Cohen’s kappa • Krippendorff’s alpha (multi-rater) | ||
Nominal / Dichotomous | • Cohen’s kappa (2 raters) • Fleiss’ kappa (≥3 raters) • Scott’s Pi • Brennan–Prediger kappa • Krippendorff’s alpha | ||
3. Intra-Rater Reliability | Measures consistency of one rater over time | Continuous | • ICC • CCC |
Ordinal | • Weighted Cohen’s kappa | ||
Nominal / Dichotomous | • Cohen’s kappa | ||
4. Parallel-Forms Reliability | Measures equivalence between Form A and Form B | Continuous | • Pearson r • Spearman ρ • ICC |
Ordinal | • Spearman ρ | ||
Nominal / Dichotomous | • Cohen’s kappa (rare use) | ||
— | Coefficient of Equivalence (concept label) | ||
5. Internal Consistency Reliability | Measures how well items in a scale measure the same construct | Multi-item scale | • Cronbach’s alpha (α) (most used) • McDonald’s omega (ω) (better when factor loadings differ) • Guttman’s lambda (λ2–λ6) • KR-20 (dichotomous items only) • KR-21 • Split-half reliability + Spearman–Brown formula • Coefficient H (latent variable reliability) |
Introduction
In measurement and psychometrics, reliability describes how consistently an instrument measures whatever it is supposed to measure. Different types of reliability focus on different sources of variation: time, raters, forms, and items.
Below are the main reliability types, with their key named statistics.
1. Test–Retest Reliability
What it is
Test–retest reliability asks:
If we measure the same person with the same instrument at two different times (and the trait has not truly changed), do we get similar scores?
It reflects the stability over time of a measurement.
When it is used
Cognitive or knowledge tests
Personality traits (e.g., trait anxiety)
Stable clinical scales (e.g., long-term disability, not acute pain)
Main statistics (named coefficients)
For continuous scores:
Pearson correlation coefficient (r)
Simple correlation between Time 1 and Time 2 scores.
Spearman rank correlation (ρ / rho)
Used when data are ordinal or not normally distributed.
Intraclass Correlation Coefficient (ICC)
More appropriate than Pearson r when you care about agreement within a group, not just correlation.
Concordance Correlation Coefficient (CCC; Lin’s CCC)
Combines precision (correlation) and accuracy (closeness to identity line) to quantify agreement over time.
For categorical scores:
Cohen’s kappa (κ)
Agreement between Time 1 and Time 2 when responses are categorical (e.g., positive/negative).
Weighted kappa
For ordered categories (e.g., mild/moderate/severe).
General term:
Coefficient of Stability
A generic term for the correlation between two time points in test–retest designs.
2. Inter-Rater Reliability
What it is
Inter-rater reliability asks:
If two or more raters assess the same subjects, how consistently do they agree?
It focuses on agreement between different observers.
When it is used
Two clinicians rating severity of a symptom
Multiple radiologists reading the same scan
Several examiners grading OSCE or written essays
Main statistics (named coefficients)
For categorical ratings
Cohen’s kappa (κ)
Agreement beyond chance between two raters.
Weighted Cohen’s kappa
For ordinal categories where bigger disagreements should be penalized more.
Fleiss’ kappa
Extension of kappa for three or more raters.
Scott’s Pi (π)
Similar to Cohen’s kappa, with a different assumption about chance agreement.
Brennan–Prediger kappa
Adjusted kappa that reduces the “prevalence paradox” (when categories are very imbalanced).
Krippendorff’s alpha (α)
Very flexible: works with any number of raters, missing data, and different measurement levels (nominal, ordinal, interval).
Goodman–Kruskal’s gamma
A measure of association for ordinal ratings; sometimes used in rater agreement contexts.
For continuous ratings
Intraclass Correlation Coefficient (ICC)
The primary statistic for agreement among raters scoring continuous variables (e.g., mm of joint movement).
Concordance Correlation Coefficient (CCC)
Measures both correlation and how close scores are to the line of perfect agreement.
Bland–Altman Limits of Agreement
Not a single coefficient but a standard method/plot to evaluate agreement between raters or instruments.
3. Intra-Rater Reliability
What it is
Intra-rater reliability asks:
If the same rater assesses the same subjects on different occasions, are their own ratings consistent?
This is about the self-consistency of one observer.
When it is used
One physiotherapist measuring joint range of motion twice
One pathologist rating biopsy slides on two occasions
One examiner rescoring the same OSCE performance
Main statistics (named coefficients)
For continuous data:
Intraclass Correlation Coefficient (ICC)
Typically used for repeated measurements by the same rater.
For categorical data:
Cohen’s kappa (κ)
Weighted kappa (for ordinal categories)
For agreement emphasis:
Concordance Correlation Coefficient (CCC)
Conceptually, intra-rater uses the same coefficients as inter-rater reliability, but all ratings come from one person at multiple times instead of multiple people.
4. Parallel-Forms (Alternate-Forms) Reliability
What it is
Parallel-forms reliability asks:
If we create two different but equivalent versions of a test, do they give similar results for the same people?
This focuses on consistency between two forms of the same instrument.
When it is used
High-stakes exams (e.g., medical school tests) with Form A and Form B
Repeated assessments where you want to reduce memory/practice effects
Large test banks where exams are assembled from item pools
Main statistics (named coefficients)
For continuous test scores:
Pearson correlation coefficient (r) between Form A and Form B
Spearman correlation (ρ) for ordinal or non-normal scores
Intraclass Correlation Coefficient (ICC)
When each form is treated like a “rater” producing a score.
Generic term:
Coefficient of Equivalence
General label for reliability across parallel/alternate forms.
Sometimes, when you combine alternate-forms + test–retest, the result is called coefficient of stability and equivalence, but that’s more conceptual than a distinct formula.
5. Internal Consistency Reliability
What it is
Internal consistency asks:
Do the items within a single test or questionnaire all work together to measure the same underlying construct?
This is particularly important for multi-item scales (e.g., depression scales, quality of life instruments, satisfaction surveys).
When it is used
Symptom questionnaires (e.g., depression, anxiety, fatigue)
Quality of life scales
Knowledge tests with multiple items
Attitude or Likert-type surveys
Main statistics (named coefficients)
Cronbach’s alpha (α)
The most widely used index of internal consistency.
Influenced by average inter-item correlation and number of items.
McDonald’s Omega (ω)
Often a better estimator than alpha, especially when items have different factor loadings.
Kuder–Richardson Formula 20 (KR-20)
Internal consistency index specifically for dichotomous items (e.g., right/wrong = 0/1).
Kuder–Richardson Formula 21 (KR-21)
Simpler variant of KR-20 under more restrictive assumptions.
Guttman’s Lambda coefficients (λ2–λ6)
A set of reliability estimates that can, in some cases, be more accurate or less biased than alpha.
Split-Half Reliability
The test is split into two halves (e.g., odd vs even items). Total scores of each half are correlated.
Spearman–Brown Prophecy Formula
Used to adjust the split-half coefficient to estimate reliability for the full-length test or a hypothetical longer/shorter test.
Other related indices:
Average inter-item correlation
Simple average of correlations between all pairs of items.
Item–total correlation & “alpha if item deleted”
Diagnostics to identify weak or problematic items in a scale.
Coefficient H (Hancock & Mueller’s H)
Reliability index for a latent factor (useful in structural equation modeling).
Summary
Each type of reliability focuses on a different question:
Test–Retest Reliability → Is the instrument stable over time?
Key names: Pearson r, Spearman ρ, ICC, CCC, Cohen’s kappa
Inter-Rater Reliability → Do different raters agree?
Key names: Cohen’s kappa, Fleiss’ kappa, Krippendorff’s alpha, ICC, CCC
Intra-Rater Reliability → Is one rater consistent with themselves?
Key names: ICC, Cohen’s kappa, weighted kappa, CCC
Parallel-Forms Reliability → Are different versions of a test equivalent?
Key names: Pearson r, Spearman ρ, ICC, coefficient of equivalence
Internal Consistency Reliability → Do items within a scale hang together?
Key names: Cronbach’s alpha, McDonald’s omega, KR-20, KR-21, Guttman’s lambda, split-half, Spearman–Brown




