← All posts

Calibration Plot in Clinical Prediction Models [Calibration-in-the-Large (CITL), Calibration Slope]

Clinical Epidemiology ResearchUniqcret doctor knowledgesMethodology and Research DesignDiagnosis [Methodology]Prognosis [Methodology]

Abstract

Calibration is a fundamental property of clinical prediction models (CPMs), reflecting how well predicted probabilities agree with actual observed outcomes. Unlike discrimination—how well a model distinguishes between individuals with and without an event—calibration evaluates absolute accuracy. Poor calibration can mislead clinical decision-making even when discrimination appears acceptable. This article explains the conceptual foundation, metrics, and practical interpretation of calibration, including calibration-in-the-large, calibration slope, calibration plots, and recalibration strategies.


1. Introduction

Clinical prediction models are increasingly used to estimate individual risks of outcomes such as mortality, sepsis, stroke, or readmission. For a CPM to be clinically trustworthy, two performance domains must be demonstrated:

  1. Discrimination – Can the model distinguish high-risk from low-risk patients?
  2. Calibration – Are the predicted probabilities numerically correct?

Discrimination often receives more attention, but calibration is equally essential because even a highly discriminative model can give incorrect absolute risks that misguide treatment thresholds, triage, and counselling.


2. What is Calibration?

Calibration refers to the agreement between predicted probabilities and observed outcome rates.

Pr ( Y = 1 | p ^ ) = p ^

Calibration answers the question: “Are the numbers correct?”, not just “Is ranking correct?”


3. Two Fundamental Calibration Metrics

3.1 Calibration-in-the-Large (CITL)

CITL assesses whether predictions are systematically too high or too low on average.

This is estimated as the intercept α of a logistic regression:

logit ( Y ) = α + 1 logit ( p ^ )

CITL reflects overall bias. It does not describe how predictions behave across the entire risk spectrum—that is the role of the slope.

3.2 Calibration Slope

The calibration slope β quantifies whether predictions are too extreme or too moderate.

Estimated from:

logit ( Y ) = α + β logit ( p ^ )

Key insight:

These two metrics diagnose different forms of miscalibration.


4. Visual Assessment: The Calibration Plot

Calibration plots display observed vs predicted risks, commonly across deciles of predicted probability. The ideal line is a 45° diagonal.

A typical plot shows:

Modern tools also use smooth loess curves to visualize continuous agreement.


5. Why Does Calibration Fail?

  1. Overfitting in model development
  2. Case-mix differences between development and validation settings
  3. Changes in disease prevalence or treatment patterns
  4. Incorrect functional form (e.g., missing nonlinear terms)
  5. Measurement differences (e.g., lab assay changes, coding differences)
  6. Small sample size

Thus, calibration must be checked not only during development but whenever a model is applied to a new population.


6. Recalibration Approaches

If a model is miscalibrated but discriminatory ability is preserved, recalibration can correct it.

6.1 Intercept-Only Recalibration

Fix slope = 1. Estimate a new intercept:

logit ( Y ) = α new + logit ( p ^ )

Corrects CITL only.

6.2 Intercept + Slope Recalibration

2. Re-estimate both:

logit ( Y ) = α new + β new logit ( p ^ )

Corrects both CITL and slope.

6.3 Full Model Revision

Rebuild or update the model (e.g., by adding predictors, nonlinearities, or interactions) when calibration problems reflect structural issues.


7. Calibration vs Discrimination: Complementary but Independent

A model can:

Thus, calibration always needs explicit evaluation. Clinical risk thresholds (e.g., >15% for treatment) depend entirely on calibration, not discrimination.


8. Practical Stata Implementation

Calculate CITL

predict xb, xb
logit outcome, offset(xb)

Calculate Calibration Slope

logit outcome xb

Plot Calibration (pmcalplot command)

ssc install pmcalplot
pmcalplot predicted outcome


9. Conclusion

Calibration is essential for the clinical reliability of prediction models. It ensures that risk estimates are numerically accurate and clinically actionable. Whereas discrimination ranks patients by risk, calibration confirms whether the actual predicted risks are trustworthy.

A well-functioning CPM must demonstrate:

Without calibration, a prediction model—even one with strong discrimination—may be unsafe for clinical use.