Calibration Plot in Clinical Prediction Models [Calibration-in-the-Large (CITL), Calibration Slope]

Mayta
2 days ago
3 min read

Abstract

Calibration is a fundamental property of clinical prediction models (CPMs), reflecting how well predicted probabilities agree with actual observed outcomes. Unlike discrimination—how well a model distinguishes between individuals with and without an event—calibration evaluates absolute accuracy. Poor calibration can mislead clinical decision-making even when discrimination appears acceptable. This article explains the conceptual foundation, metrics, and practical interpretation of calibration, including calibration-in-the-large, calibration slope, calibration plots, and recalibration strategies.

1. Introduction

Clinical prediction models are increasingly used to estimate individual risks of outcomes such as mortality, sepsis, stroke, or readmission. For a CPM to be clinically trustworthy, two performance domains must be demonstrated:

Discrimination – Can the model distinguish high-risk from low-risk patients?
Calibration – Are the predicted probabilities numerically correct?

Discrimination often receives more attention, but calibration is equally essential because even a highly discriminative model can give incorrect absolute risks that misguide treatment thresholds, triage, and counselling.

2. What is Calibration?

Calibration refers to the agreement between predicted probabilities and observed outcome rates.

If a model predicts 10% mortality for a group of patients, then roughly 10% should die.
Perfect calibration means:

Calibration answers the question: “Are the numbers correct?”, not just “Is ranking correct?”

3. Two Fundamental Calibration Metrics

3.1 Calibration-in-the-Large (CITL)

CITL assesses whether predictions are systematically too high or too low on average.

This is estimated as the intercept α of a logistic regression:

Ideal: CITL = 0
CITL < 0 → model overpredicts risk
CITL > 0 → model underpredicts risk

CITL reflects overall bias. It does not describe how predictions behave across the entire risk spectrum—that is the role of the slope.

3.2 Calibration Slope

The calibration slope β quantifies whether predictions are too extreme or too moderate.

Estimated from:

Slope = 1 → perfect spread
Slope < 1 → predictions too extreme (model overfitting)
Slope > 1 → predictions too modest (model underfitting)

Key insight:

CITL = mean shift
Slope = spread distortion

These two metrics diagnose different forms of miscalibration.

4. Visual Assessment: The Calibration Plot

Calibration plots display observed vs predicted risks, commonly across deciles of predicted probability. The ideal line is a 45° diagonal.

A typical plot shows:

Vertical shift → CITL problem
Flattened or steep curve → slope problem
Nonlinear deviations → model misspecification (e.g., missing interactions, wrong functional form)

Modern tools also use smooth loess curves to visualize continuous agreement.

5. Why Does Calibration Fail?

Overfitting in model development
Case-mix differences between development and validation settings
Changes in disease prevalence or treatment patterns
Incorrect functional form (e.g., missing nonlinear terms)
Measurement differences (e.g., lab assay changes, coding differences)
Small sample size

Thus, calibration must be checked not only during development but whenever a model is applied to a new population.

6. Recalibration Approaches

If a model is miscalibrated but discriminatory ability is preserved, recalibration can correct it.

6.1 Intercept-Only Recalibration

6.2 Intercept + Slope Recalibration

6.3 Full Model Revision

Rebuild or update the model (e.g., by adding predictors, nonlinearities, or interactions) when calibration problems reflect structural issues.

7. Calibration vs Discrimination: Complementary but Independent

A model can:

Have excellent discrimination but poor calibration
Be well calibrated yet poorly discriminative

Thus, calibration always needs explicit evaluation. Clinical risk thresholds (e.g., >15% for treatment) depend entirely on calibration, not discrimination.

8. Practical Stata Implementation

Calculate CITL

predict xb, xb
logit outcome, offset(xb)

Calculate Calibration Slope

logit outcome xb

Plot Calibration (pmcalplot command)

ssc install pmcalplot
pmcalplot predicted outcome

9. Conclusion

Calibration is essential for the clinical reliability of prediction models. It ensures that risk estimates are numerically accurate and clinically actionable. Whereas discrimination ranks patients by risk, calibration confirms whether the actual predicted risks are trustworthy.

A well-functioning CPM must demonstrate:

CITL ≈ 0 (average accuracy)
Slope ≈ 1 (correct spread)
Strong calibration plot alignment

Without calibration, a prediction model—even one with strong discrimination—may be unsafe for clinical use.