MAPE in Clinical Prediction Models: What Mean Absolute Prediction Error Really Tells You

Mayta
Jan 6
4 min read

Introduction

MAPE (Mean Absolute Prediction Error) is often introduced as “average |prediction − truth|”, but in clinical prediction models (CPMs) it deserves a bit more respect—because what counts as “truth” depends on the outcome type, and because MAPE can look “good” even when a model is clinically misleading.

Below is the deep, CPM-focused way to think about it.

What MAPE really measures (conceptually)

For a binary outcome (event vs no event):

p̂ᵢ is the model’s predicted probability for patient i
yᵢ is the observed outcome (0 or 1)

The absolute prediction error for patient i is:

|p̂ᵢ − yᵢ|

Then, MAPE is the mean of those absolute errors across all patients.

Intuition at the bedside

MAPE answers:

“On average, how far are my predicted risks from what actually happened (0/1) for individual patients?”

So MAPE is an individual-level probability error metric—not a ranking metric (that’s AUROC), and not a clinical-utility metric (that’s DCA). It lives in the same “numerical accuracy” family as calibration, but it summarizes error differently.

What MAPE looks like for events vs non-events

Because yᵢ is either 0 or 1:

If the patient had no event (yᵢ = 0)

Error = |p̂ᵢ − 0| = p̂ᵢSo the model is “penalized” for predicting risk in someone who didn’t event.

If the patient had an event (yᵢ = 1)

Error = |p̂ᵢ − 1| = 1 − p̂ᵢSo the model is “penalized” for underpredicting risk in someone who did the event.

That means MAPE is basically averaging two clinically familiar failures:

“How much unnecessary risk did we assign to those who stayed well?”
“How much did we miss the patients who deteriorated?”

Why MAPE is not the same as calibration (and why you need both)

Calibration asks a group-level question

“Among patients predicted at 20%, do ~20% actually event?”

That’s about risk accuracy in aggregates (calibration-in-the-large, slope, calibration curve).

MAPE asks an individual-level question

“How far off were the predictions for each patient (in absolute probability terms)?”

A model can be:

Well-calibrated but high MAPE (good on average; individuals still noisy)
Low MAPE but poorly calibrated in key regions (overall average looks fine; dangerous at decision thresholds)

This is why CECS teaching keeps calibration plots/parameters as “core outputs”, then adds stability/error summaries as an extra layer—not a replacement.

The biggest trap: MAPE depends strongly on event rate (baseline risk)

MAPE can look artificially “good” in rare outcomes.

Example logic (no numbers needed):

If the event is rare, a model that predicts low risk for everyone will have small errors for the many non-events.
It will still be clinically useless for identifying events, but the average absolute error may not look terrible.

So you should interpret MAPE alongside:

AUROC (ranking)
Calibration (risk scale correctness)
DCA (clinical net benefit)
and preferably internal validation/optimism correction so you’re not reporting an overfit MAPE from the development sample.

This is the same “don’t trust apparent performance” logic you already wrote about—just applied to MAPE.

Q: Is “MAPE < 0.02” a universal target?

A: Not really.

A threshold like “< 0.02” is sometimes used as a heuristic in teaching because it feels interpretable (“average probability error < 2%”), but its reasonableness depends on:

event rate (rare vs common outcomes)
the risk range you care about clinically (0–5% vs 10–30%)
whether predictions are optimism-corrected or externally validated
the clinical action threshold (e.g., a 2% error is huge if you treat at 3%, but small if you treat at 25%)

So a better way to report MAPE in a TRIPOD-ish mindset is:

MAPE overall
MAPE near the clinically relevant risk band (e.g., within ±5% around your action threshold), because that’s where decisions flip

That connects MAPE back to clinical stakes, consistent with “metrics must match the clinical question.”

How MAPE relates to other “probability accuracy” scores

MAPE is absolute-error based, which means:

it’s robust to outliers compared with squared error
it treats a 0.10 error the same, whether it’s from 0.05→0.15 or 0.45→0.55 (sometimes desirable, sometimes not)

Two common alternatives (often reported in CPMs) are:

Brier score (mean squared error of probabilities): penalizes large mistakes more heavily
Log loss / deviance: heavily penalizes being extremely confident and wrong

MAPE is nice because it’s directly interpretable in probability units (“average absolute % error”), which is why it fits naturally into your “stability” section.

How to use MAPE properly in your stability section

If you’re positioning MAPE as “stability / reliability,” the reviewer-proof framing is:

Report apparent MAPE in development
Report optimism-corrected MAPE via bootstrap internal validation
Report external-validation MAPE if you have it
If possible, add MAPE stratified by risk bands (low / intermediate / high), because instability is often worse at extremes

This aligns with the CECS roadmap: internal validation → external validation → recalibration if needed, and always emphasize honest (not apparent) performance.

Key takeaways

MAPE measures average absolute probability error at the individual level (not ranking, not utility).
In binary outcomes, errors are p̂ᵢ for non-events and (1 − p̂ᵢ) for events—so it captures both “false alarm risk” and “missed risk.”
MAPE can look deceptively good in rare outcomes; interpret it with AUROC, calibration, and DCA.
A single “target” (e.g., < 0.02) is context-dependent; anchor interpretation to clinical threshold ranges.