Inter-rater, Intra-rater, Test–retest: Designing a Reliability Study

Abstract

When clinicians and clinical researchers evaluate measurement instruments, they must determine reliability, defined as the proportion of total observed variance reflecting genuine subject differences rather than noise. Because routine measurements only reveal total variation, investigators must use repeated-measurement designs to isolate true clinical signals from measurement error. Crucially, every reliability protocol requires the measured subject to remain completely stable throughout the study window, as any true clinical change will falsely inflate the apparent instrument error. By systematically repeating measurements, investigators can evaluate three distinct dimensions: inter-rater reliability isolates the error of swapping assessors, intra-rater reliability examines a single rater over time while balancing recall bias against subject change, and test-retest reliability assesses the independent effect of passing time. Researchers can efficiently capture all three dimensions simultaneously using a single crossover protocol, such as deploying two raters across two days. This article explains the three fundamental reliability designs, demonstrates how to execute them simultaneously, and teaches researchers how to pinpoint where human operators introduce systematic error into the measurement process.

Introduction

When you say a measurement instrument is reliable, you are making a very specific claim: that the instrument consistently ranks or classifies different subjects the same way when the measurement is repeated. Reliability is the proportion of the variation in your data that reflects real differences between subjects rather than noise. But here is the practical problem — in any ordinary measurement you can only observe the total variation. You can never directly see how much of it is "true" and how much is "error." To split that mixture apart, you have to design a study that deliberately repeats the measurement under controlled conditions.

This article is about that design step. We will work through the three classic reliability designs — inter-rater, intra-rater, and test–retest — what each one repeats, which slice of error variance each one isolates, and the trap that comes with each. Then we will walk through a single, elegant real study (the instrumented Timed Up and Go (iTUG) in Parkinson's disease, van Lummel et al.) that captures all three at once. Finally we will open up the "error" box: where systematic error comes from, and the four points in the measurement process where it sneaks in.

The central design question: If two measurements of the same subject disagree, can I be sure the disagreement came from the instrument or the rater — and not from the subject changing underneath me?

⤢ click to enlarge

Figure. The three reliability designs — inter-rater, intra-rater, and test–retest — and how a single two-rater, two-day protocol captures all three simultaneously.

The one requirement every design shares: the subject must stay STABLE

Before we separate the three designs, we have to state the assumption they all stand on. No matter which dimension of reliability you study, the thing being measured must be stable (unchanging) over the study window. This is non-negotiable, and it is the single most common way reliability studies go wrong.

The logic is simple. Reliability asks: "When the result of two measurements differs, where did the difference come from?" We want the answer to be "the instrument or the rater," because that is what we are trying to characterise. But if the subject itself changed between measurements — a tumour grew, a patient fatigued, a wound healed — then the difference is real change in the subject, not measurement error. Our statistics cannot tell those two apart automatically; they will simply blame the instrument.

Once we grant stability, repeating the measurement lets us pin down which source the variation came from. That is exactly what each design is engineered to do.

The three designs, side by side

All three designs repeat the measurement — but they differ in what they repeat, and therefore in which slice of error variance they manage to isolate. Read this table as the heart of the article; the rest is commentary on it.

Reliability dimension	What is repeated	Error variance isolated	Main caution	Question it answers	The "ideal" instrument
1. Inter-rater / device reliability	Several raters measure the same subject	Variation between raters	The subject must not change	Is the instrument reliable when you swap the assessor?	Every rater understands the instrument the same way
2. Intra-rater / device reliability	The same rater measures the same subject at different times	Variation within a rater	The subject must not change over time	Is it reliable over time, with the same rater?	The same rater interprets the instrument identically every time
3. Test–retest reliability	The same subject is measured at different times, while accounting for the between-rater difference within each time point	Variation between time points	The subject must not change over time	Is it reliable when time passes?	Time itself does not affect the instrument's reading

Let us walk each row, because the differences are subtle.

1. Inter-rater reliability — swapping the assessor

Inter-rater reliability asks whether a group of raters (or a group of devices) produce close-to-the-same value when they measure the same thing. The error it isolates is the between-rater error — the disagreement that appears purely because you changed who is holding the instrument.

There is a convenient property here. If the thing being measured does not change over time — think of a chest radiograph or a pathology slide, which is literally a fixed object — then the different raters do not have to read it at the same moment. Rater A can read all the slides on Monday and rater B on Friday, and the design is still clean, because the object is identical on both days. The "ideal" instrument in this dimension is one that every rater interprets the same way.

2. Intra-rater reliability — same rater, time passes

Intra-rater reliability asks whether the same rater (or the same device), measuring the same thing again, gets a consistent value. The error isolated is the within-rater error.

This design carries the most delicate timing trade-off in all of clinimetrics:

The gap between the two measurements must be long enough to avoid recall bias — if the rater still remembers the first reading, the second reading is not independent, and you will overestimate reliability.
But the gap must be short enough that the subject has not changed — otherwise true change leaks in as error, and you underestimate reliability.

3. Test–retest reliability — does time itself matter?

Test–retest reliability looks almost identical to intra-rater reliability: the same rater or device measures the same subject again, with the same long-enough-but-short-enough timing caution. The conceptual difference is in the analysis, not the protocol.

In test–retest, we deliberately account for and adjust out the between-rater (or between-device) difference within a single time point, so that what remains is the true variance reflecting how stable the reading is across time. In other words, test–retest is asking the cleanest possible version of "does the passage of time, by itself, change the answer?" — having first removed the contamination of rater differences. The ideal instrument here is one for which time has no effect on the reading.

A worked design: capturing all three at once (the iTUG study)

Here is the elegant part. You do not need three separate studies. Researchers typically design one protocol that estimates all three dimensions simultaneously. The logic, read off the figure, is:

To study inter-rater, you need more than one rater.
To study intra-rater, you need the same rater to measure again at a different time.
If you collect data satisfying both of the above, you can compute test–retest for free (automatically) from the same dataset.

The canonical worked example is "Intra-Rater, Inter-Rater and Test–Retest Reliability of an Instrumented Timed Up and Go (iTUG) Test in Patients with Parkinson's Disease" by Rob C. van Lummel et al. The goal was to assess the reliability of the instrumented Timed Up and Go (iTUG) test in patients with Parkinson's disease. Procedurally, a rater fits the patient with an elastic belt carrying an electronic sensor, positions the patient, gives the start signal, and times the patient with a stopwatch.

The design — two raters (A and B) measuring each patient across two days (Day 1 and Day 2) — yields all three dimensions at once:

Reliability dimension	How the iTUG study does it	Note / caution
Inter-rater / device	Each patient is assessed by two raters, A and B	Because the same patient is measured 5 times and may fatigue on later trials, the sequence of raters is alternated across patients in both time periods, so fatigue does not systematically favour one rater
Intra-rater / device	Each rater (A and B) assesses every patient across two time points (Day 1 and Day 2)	The researchers must assume Parkinson's patients are stable over the study window. If a patient's true performance changed, the error variance would be inflated above its true value
Test–retest	All patients are re-measured across two days, each day measured by two raters	Because the design already supports intra-rater and inter-rater estimation, test–retest reliability falls out automatically

Notice how the stability caution becomes concrete here. Parkinson's disease genuinely fluctuates (on/off states, fatigue across repeated trials). The authors must assume the patients are stable across the protocol; if that assumption fails, the design cannot distinguish real clinical fluctuation from instrument error, and reliability is understated. This is the general rule made specific.

Sources of variation: True vs Error

To understand what these designs are estimating, open up the variation in your data. From the basic measurement model, every observed value is the sum of a true value and a measurement error:

\[ \text{Observed value} = \text{True value} + \text{Measurement error} \]

Measure many subjects and the same decomposition holds for the variances:

\[ \sigma^2_{\text{observed}} = \sigma^2_{\text{true}} + \sigma^2_{\text{error}} \]

This is the crux of the whole field. Agreement cares only about the size of \( \sigma^2_{\text{error}} \) (in real units). Reliability cares about the proportion of true variance within the total — conceptually:

\[ \text{Reliability} = \frac{\sigma^2_{\text{true}}}{\sigma^2_{\text{true}} + \sigma^2_{\text{error}}} = \frac{\sigma^2_{\text{true}}}{\sigma^2_{\text{observed}}} \]

In ordinary measurement you can compute \( \sigma^2_{\text{observed}} \), but not \( \sigma^2_{\text{true}} \) or \( \sigma^2_{\text{error}} \) separately — which is precisely why you need a repeated-measurement design to recover them.

Now split the error itself. Error variance is not one thing — it has two flavours:

Systematic error — error that can be explained. It typically arises from repeating the measurement with a different rater, or at a different time. It is structured: it tracks with an identifiable factor (who measured, when).
Unexplained / residual error — error that cannot be explained. For example, uncontrollable environmental fluctuations. This is the leftover noise after the systematic part is accounted for.

The pay-off is practical: if you can identify which step of the measurement process the repetition occurred in, you can point to the source of the explainable (systematic) error — and fix the instrument. That brings us to the four-step anatomy of measurement.

Where systematic error enters: the four measurement-process components

Any measurement process can be broken into four components. Systematic error can be introduced at any of them — and crucially, in some steps it can only enter through the human operator, so if the operator is not involved, that step contributes no systematic error at all.

Equipment and preparation — Equipment is everything needed to set up, run, and report the instrument. Preparation is everything needed to make the instrument ready: general preparation (the expertise or training staff need) and per-measurement preparation (readying the equipment, the environment, the storage, and positioning the patient).
Collecting raw data — everything the patient and staff do to capture data before any processing.
Data processing — all operations performed on the raw data (signals, images) to put it in a usable form (e.g. electronic) for later use.
Assignment of the score / value — computing or converting the processed data into the final score or value that is the result.

Here is how each component maps onto the iTUG study, and where variance enters:

Process component	iTUG implementation	Repeated by	Variance
Equipment & preparation	Equipment: inertial sensor system (DynaPort Hybrid, McRoberts), elastic belt, stopwatch, remote control, MoveTest software. Preparation: patient seated on a 43–46 cm chair without armrests, feet 43 cm apart, back against the chair, sensor strapped to the lower back, start signal from the rater, 3-metre walk marked by a cone	Two raters (A, B), twice in quick succession (inter-rater); repeated the same way across two days (intra-rater / test–retest)	True variance: the patient (e.g. body geometry in the ready position). Systematic error variance: the rater (e.g. sensor placement via the elastic belt, positioning the patient into the ready stance)
Collecting raw data	Patient rises from the chair, walks 3 m around the cone, returns and sits; sensor records acceleration and angular velocity in 3 directions at 100 samples/s plus the time per step; rater times with a stopwatch	Same two-rater, two-day repetition	True variance: the patient (reaction speed to the start signal, muscle fatigue on later trials, Parkinson fluctuation). Systematic error variance: the rater (reaction time in starting/stopping the stopwatch at each landmark)
Data processing	MoveTest software analyses the sensor data and computes time, angles, angular velocity, and movement events	— (no human input)	Does not vary as long as the rater is not involved in this step. Any error is unexplained (residual) variance from the conversion software
Assignment of score / value	Compute each phase time (sit-to-stand, walk, turn, sit), flexion–extension angles, angular velocity around the cone, and event segmentation from the signal to mark phase start/end	— (no human input)	Does not vary as long as the rater is not involved. Any error is unexplained (residual) variance from the computation software

The pattern is illuminating. In equipment/preparation and raw-data collection, the rater is involved, so those steps carry systematic (rater) error — and that is exactly where you would intervene to make the instrument more reliable (e.g. standardise belt placement, train the stopwatch technique). In data processing and scoring, the rater is not involved, so those steps carry no systematic rater error — only residual software error. Identifying which component the repetition lands in tells you both where the systematic error lives and what you can do about it.

Key takeaways

Reliability is the proportion of observed variance that is true variance: \( \sigma^2_{\text{true}} / \sigma^2_{\text{observed}} \). You cannot see true vs error variance directly — a repeated-measurement design is what recovers them.
Every reliability design rests on one assumption: the subject must be stable over the study window. If the subject changes, true change is mislabelled as error and reliability is biased downward.
The three designs isolate three different slices of error: inter-rater (between raters — swap the assessor), intra-rater (within a rater over time), and test–retest (between time points, after removing the rater difference).
Intra-rater carries the timing dilemma: the interval must be long enough to avoid recall bias yet short enough to keep the subject stable.
A single two-rater × two-day protocol — exactly the iTUG / van Lummel design — captures all three dimensions at once; test–retest comes out automatically.
Error variance splits into systematic (explainable: rater, time) and residual/unexplained.
Systematic error enters through the four measurement-process components — equipment & preparation, collecting raw data, data processing, assignment of score. Where the rater is involved (preparation, raw-data collection) you get systematic error you can fix; where the rater is absent (processing, scoring) you get only residual software error.

References

de Vet HCW, Terwee CB, Bouter LM. Current challenges in clinimetrics. J Clin Epidemiol. 2003;56:1137–41.
Mokkink LB, Terwee CB, Patrick DL, et al. The COSMIN checklist. Qual Life Res. 2010;19:539–49.
Cohen J. A coefficient of agreement for nominal scales. Educ Psychol Meas. 1960;20:37–46.
Landis JR, Koch GG. The measurement of observer agreement for categorical data. Biometrics. 1977;33:159–74.
Shrout PE, Fleiss JL. Intraclass correlations. Psychol Bull. 1979;86:420–28.
McGraw KO, Wong SP. Forming inferences about some intraclass correlation coefficients. Psychol Methods. 1996;1:30–46.
Koo TK, Li MY. A guideline of selecting and reporting ICC. J Chiropr Med. 2016;15:155–63.
Bland JM, Altman DG. Statistical methods for assessing agreement. Lancet. 1986;1:307–10.
Gwet KL. Computing inter-rater reliability in the presence of high agreement. Br J Math Stat Psychol. 2008;61:29–48.
Parmar M, Naqvi SAA, et al. Collaborative large language models for screening in systematic reviews. medRxiv. 2026.

From Sensitivity to Kappa (5-part series): (1) Performance vs Agreement [01_performance_vs_agreement] · (2) Agreement vs Reliability [02_agreement_vs_reliability] · (3) Reliability designs [03_reliability_designs] · (4) Categorical — kappa [04_categorical_kappa] · (5) Continuous — ICC & agreement [05_continuous_icc_agreement]