Why Sensitivity Is Not Kappa: Performance vs Agreement

Abstract

Large language models are increasingly used to screen abstracts for systematic reviews, raising questions about how best to evaluate their accuracy. Clinicians and clinical researchers must strictly differentiate between performance measured against a reference standard and agreement between independent raters. Because the prevalence of relevant studies is extremely low, relying on raw agreement introduces a dangerous prevalence trap. For instance, an artificial intelligence that blindly excludes one thousand abstracts with only twenty true inclusions might achieve 98% agreement while returning 0% recall. Since missing a critical study is an unacceptable error, screening tools must be evaluated primarily on performance metrics like recall, where collaborative models currently reach approximately 99%. While inter-rater agreement statistics serve as secondary reliability measures, researchers should prioritize Gwet's AC1 over Cohen's kappa to avoid paradoxically low scores from rare events. This article explains the fundamental distinction between performance and agreement metrics, demonstrating why recall must take precedence when evaluating automated screening tools.

Introduction

Imagine you are presenting a study at a clinical-epidemiology seminar. You have built a pipeline where a large language model (LLM) screens abstracts for a systematic review, side by side with a human reviewer. Both of them do the same visible job: they look at each abstract and stamp it Include or Exclude. So a sharp question lands in the room — the question a real clinical epidemiologist will ask you:

If a human and an AI both just label studies the same way, why not simply report inter-rater agreement (a kappa) between them and be done with it?

It feels almost obvious. Two raters, the same labels — surely the natural summary is "how often do they agree?" This article is the careful answer to why that intuition, though reasonable, is answering the wrong question for screening. We will separate two ideas that look identical on the surface but are conceptually different: performance (how good a rater is, measured against the truth) and agreement / reliability (how much two raters resemble each other, with no truth involved). Getting this distinction right is the whole foundation of the five-part series, so we will build it slowly and with worked numbers.

This matters in practice because the landmark LLM-screening papers — Parmar and colleagues' collaborative-LLM work, and Laignelot's parallel evaluation — deliberately report performance (sensitivity/recall, specificity, precision), not a kappa. By the end you will be able to explain, to an examiner's satisfaction, exactly why.

⤢ click to enlarge

Figure. Agreement (how alike two raters are, with no external truth) versus reliability and performance (how a rater compares to a reference standard) — the conceptual bridge for the whole series.

The opening puzzle: two questions hiding behind one task

Here is the trap. The task — labelling studies Include/Exclude — is identical whether you are measuring performance or agreement. But the question you ask about that task is what determines which statistic is correct. Parmar and Laignelot are not primarily asking an agreement question at all. They are asking a performance question, and that changes everything.

Let us make the two questions explicit.

Question A — the performance question. Use sensitivity / PPA (positive percent agreement) when you want to know:

"Can the AI replace one human reviewer?"

Concretely: suppose the original human reviewer included 100 papers. The AI includes 95 of those truly-includable papers.

\[ \text{Recall} = \frac{95}{100} = 95\% \]

That is a statement about performance — about how well the AI recovers the studies it should recover. It is measured against a reference standard (here, the truly includable studies). It is not a statement about agreement.

Question B — the agreement question. Use inter-rater statistics (Cohen's kappa, Fleiss' kappa, Gwet's AC1) when you want to know:

"How often do the AI and the human reach the same decision?"

Here you build a cross-tabulation of the two raters' raw labels, for example:

Study	Human	AI
1	Include	Include
2	Include	Exclude
3	Exclude	Exclude

…and then you compute a kappa from how often the cells line up. Notice that nothing in Question B refers to a truth. There is no reference standard — only two opinions being compared to each other.

A vocabulary table you must internalise

Half the confusion in this area is vocabulary: the same idea wears many names across diagnostic-testing, machine-learning, and clinimetrics traditions. Pin them down once.

Term	Aliases / synonyms	What it measures	Needs a reference (truth)?	Category
Sensitivity	Recall, PPA (positive percent agreement), true-positive rate	Of the studies that should be included, what fraction did the rater include?	Yes	Performance
Specificity	NPA (negative percent agreement), true-negative rate	Of the studies that should be excluded, what fraction did the rater exclude?	Yes	Performance
Precision	PPV (positive predictive value)	Of the studies the rater called include, what fraction truly belong?	Yes	Performance
NPV	Negative predictive value	Of the studies the rater called exclude, what fraction truly belong?	Yes	Performance
Cohen's kappa	κ (two raters)	Agreement between two raters, corrected for chance	No	Agreement / reliability
Fleiss' kappa	(≥ 3 raters)	Chance-corrected agreement among many raters	No	Agreement / reliability
Gwet's AC1	AC1	Chance-corrected agreement, robust to rare categories	No	Agreement / reliability
ICC	Intraclass correlation coefficient	Reliability of continuous/ordinal ratings	No	Reliability / agreement

Read the third and fourth columns together. Everything in the performance family is measured against a truth. Everything in the agreement/reliability family compares raters to each other with no truth at all. That single column — "Needs a reference?" — is the cleanest way to keep the two worlds apart.

The prevalence trap: a fully worked example

Now we can show why an agreement statistic can be dangerously misleading for screening. The villain is low prevalence — the fact that, in a real systematic review, the vast majority of abstracts are exclusions and only a tiny minority are true includes.

Work through this step by step.

Set up the screening pool. You have 1,000 abstracts. The truth is that only 20 of them are truly includable; the other 980 should be excluded. This is realistic: include-rates of 1–3% are typical at title/abstract screening.
Introduce a lazy AI. Picture an AI with a degenerate strategy: it labels everything "Exclude". It never includes anything.
Score the human against the AI (the agreement view). The human (doing the job properly) excludes the 980 that should be excluded. The AI also excludes all 1,000. So on the 980 true exclusions, they agree perfectly; they only differ on the 20 true includes. Raw observed agreement is:

\[ \text{Agreement} = \frac{980}{1000} = 98\% \]

A 98% agreement looks spectacular. You could put it on a slide and feel proud.

Now score the AI against the truth (the performance view). How many of the 20 truly-includable studies did the AI include? Zero. Therefore:

\[ \text{Recall} = \frac{0}{20} = 0\% \]

The AI caught none of the studies that mattered. Its 98% "agreement" is an illusion manufactured entirely by the 980 easy exclusions.

Metric	Value	What it hides / reveals
Observed agreement (Human vs AI)	98%	Inflated by the 980 easy true-exclusions
Recall / sensitivity (AI vs truth)	0%	The AI missed every includable study

A statistic that reads 98% agreement while the recall is 0% is not measuring competence — it is measuring how rare the positives are. High agreement does not mean "good".

Why screening is recall-first

The prevalence trap explains the cultural rule of systematic-review screening: recall comes first. The two possible errors are not symmetric.

A false exclude (you drop a study that should have been included) is potentially fatal to the review: a missed trial can change the pooled effect, and it is essentially invisible — you never learn what you threw away.
A false include (you wave through a study that should have been excluded) is merely annoying: it gets caught and removed at full-text screening, costing a little time.

Hence the maxim:

Missing a study is the unacceptable error.

Because the costs of the two errors are so lopsided, we tune and judge screening tools on their ability to not miss — that is, on recall (sensitivity) — before we worry about anything else. An agreement statistic does not encode this asymmetry at all: a kappa treats a missed include and a wrongly-kept exclude as the same kind of disagreement. That is the second reason agreement is the wrong headline number for screening.

This is also why the LLM-screening literature reports the numbers it does. Individual models already achieve high recall-for-inclusion: in Parmar and colleagues, GPT-4 reached 95.5%, Claude-3 Sonnet 96.6%, and Gemini 85.7%. When models are combined in a collaborative ensemble, recall rises further — up to roughly 99% — while precision-for-exclusion sits around 99.7%. Every one of those is a performance number, measured against a human-defined reference of true includes and excludes. None of them is a kappa, and that is deliberate.

Where inter-rater agreement legitimately fits

None of this means agreement statistics are useless — only that they answer a different, secondary question. There are perfectly good places for them, and a thoughtful reviewer of the Parmar paper might even ask for them. Consider these legitimate agreement questions:

AI vs AI. "What is the agreement between GPT-4 and Claude?" Here there is genuinely no reference standard you privilege — you simply want to know how interchangeable two models are. A kappa or AC1 is exactly right.
AI vs human, as reliability rather than performance. "What is the agreement between the AI reviewers and the human reviewers?" If your scientific interest is reliability — could the AI stand in for a second human reviewer in a dual-screening workflow? — then an inter-rater statistic is appropriate, as a secondary analysis alongside (not instead of) recall.

These are valuable secondary analyses. But here the kappa paradox returns to bite us. Under the same low prevalence that created the prevalence trap, Cohen's kappa can collapse to a low value even when raw agreement is very high, because kappa's chance-correction term becomes unstable when one category is rare. You can have 98% raw agreement and a kappa that looks embarrassingly poor — the mirror image of the inflation problem. This is the well-documented behaviour Gwet set out to fix.

So the honest position is layered: - Primary question = performance → report recall/sensitivity (PPA) and specificity (NPA), against the reference. This is what Parmar and Laignelot do. - Secondary question = reliability → report agreement, but prefer Gwet's AC1 over Cohen's kappa under low prevalence to avoid the paradox.

How to answer an examiner

Put it all together into one fluent paragraph you can deliver under questioning:

"Parmar and Laignelot are interested in the performance of the LLM as a reviewer replacement, so they report PPA/NPA — that is, recall and specificity — against a human-defined reference standard. Reporting a kappa instead would answer a different question and, under the very low include-prevalence of screening, would be distorted: raw agreement is inflated by the dominant 'Exclude' category, while Cohen's kappa can paradoxically collapse. If the question were instead about reliability — between the AI and the human, or between two AI models — then an inter-rater analysis such as Cohen's kappa or, better under rare events, Gwet's AC1 would be the appropriate secondary analysis."

That answer earns the nod because it does the one thing that matters: it cleanly separates performance from agreement, names the prevalence trap and the kappa paradox, and places each statistic on the right question.

Key takeaways

Performance (sensitivity/recall = PPA, specificity = NPA, precision/PPV, NPV) is measured against a reference standard; agreement / reliability (Cohen's & Fleiss' kappa, Gwet's AC1, ICC) compares raters to each other with no truth. The "needs a reference?" column is the dividing line.
The prevalence trap: with 1,000 abstracts and only 20 true includes, an AI that excludes everything scores 98% agreement but 0% recall — high agreement is not competence.
Screening is recall-first because missing a study is the unacceptable error; a false exclude is invisible and potentially fatal, a false include is caught later.
LLM-screening papers report performance, not kappa: GPT-4 recall 95.5%, Claude-3S 96.6%, Gemini 85.7%, collaborative recall up to ~99%, precision-for-exclusion ~99.7%.
Inter-rater agreement is a legitimate secondary analysis (AI vs AI, AI vs human as reliability); under low prevalence prefer Gwet's AC1 to avoid the kappa paradox.
The examiner-grade answer separates performance from agreement and assigns each statistic to its proper question.

References

de Vet HCW, Terwee CB, Bouter LM. Current challenges in clinimetrics. J Clin Epidemiol. 2003;56:1137–41.
Mokkink LB, Terwee CB, Patrick DL, et al. The COSMIN checklist. Qual Life Res. 2010;19:539–49.
Cohen J. A coefficient of agreement for nominal scales. Educ Psychol Meas. 1960;20:37–46.
Landis JR, Koch GG. The measurement of observer agreement for categorical data. Biometrics. 1977;33:159–74.
Shrout PE, Fleiss JL. Intraclass correlations. Psychol Bull. 1979;86:420–28.
McGraw KO, Wong SP. Forming inferences about some intraclass correlation coefficients. Psychol Methods. 1996;1:30–46.
Koo TK, Li MY. A guideline of selecting and reporting ICC. J Chiropr Med. 2016;15:155–63.
Bland JM, Altman DG. Statistical methods for assessing agreement. Lancet. 1986;1:307–10.
Gwet KL. Computing inter-rater reliability in the presence of high agreement. Br J Math Stat Psychol. 2008;61:29–48.
Parmar M, Naqvi SAA, et al. Collaborative large language models for screening in systematic reviews. medRxiv. 2026.

From Sensitivity to Kappa (5-part series): (1) Performance vs Agreement [01_performance_vs_agreement] · (2) Agreement vs Reliability [02_agreement_vs_reliability] · (3) Reliability designs [03_reliability_designs] · (4) Categorical — kappa [04_categorical_kappa] · (5) Continuous — ICC & agreement [05_continuous_icc_agreement]