← All posts

Step 2 of the Debray Framework: Evaluating Calibration and Discrimination in External Validation

Clinical Epidemiology ResearchUniqcret doctor knowledgesMethodology and Research DesignDiagnosis [Methodology]Prognosis [Methodology]

Introduction

Once the relatedness between the development and validation populations has been established (Step 1), the next task in the Debray framework is to rigorously assess how well the original prediction model performs in the new validation sample. This step focuses on core predictive performance metrics—calibration and discrimination—accompanied by essential visual assessments. Together, these provide a comprehensive picture of predictive accuracy and potential model misfit in the validation setting.


1. Calibration: Does the Model Predict the Right Absolute Risks?

Calibration examines the agreement between predicted probabilities and observed outcomes. A well-calibrated model produces predictions that closely match the true event rate among patients with similar predicted risk.

Debray et al. emphasize two summary metrics :

A. Calibration-in-the-Large (Intercept α)

Estimated using the recalibration model:

logit ( Y ) = α + logit ( Y ^ )

Since calibration-in-the-large reflects differences in overall outcome incidence between development and validation datasets, it often mirrors differences in LP mean identified in Step 1.

B. Calibration Slope (β)

Estimated from:

logit ( Y ) = α + β logit ( Y ^ )

Calibration slope is central for determining whether the strength of predictor-outcome relationships is stable across populations. Key insight:

These two metrics diagnose different forms of miscalibration. Below are four clear clinical examples showing

1. Overestimation (CITL < 0)

Scenario

A prediction model for 30-day mortality in sepsis reports:

  • Average predicted mortality = 40%
  • Actual observed mortality = 25%

Interpretation

The model overestimates risk (predicts higher than reality).It “thinks” everyone is sicker than they truly are.

CITL

  • CITL < 0 (negative intercept shift)

Clinical consequence

Some patients may be labeled high-risk unnecessarily → overtreatment.

2. Underestimation (CITL > 0)

Scenario

A heart failure risk model outputs:

  • Predicted risk = 8%
  • Actual observed risk = 15%

Interpretation

The model underestimates true risk (predictions too low).

CITL

  • CITL > 0 (positive intercept shift)

Clinical consequence

High-risk patients may be falsely reassured or undertreated.

3. Overfitting (Slope < 1)

Scenario

A model predicts readmission risk:

PatientTrue riskModel prediction
Low-risk10%3% (too low)
Mid-risk20%20%
High-risk40%70% (too high)

Interpretation

Predictions are too extreme→ the model exaggerates differences.

This happens when the model fits noise in the original dataset.

Slope

  • Slope < 1 (curve too steep)

Clinical consequence

High-risk patients seem terrifyingly high-risk, low-risk appear safer than reality.

4. Underfitting (Slope > 1)

Scenario

A model for predicting AKI after surgery gives:

PatientTrue riskModel prediction
Low-risk5%10% (too high)
Mid-risk20%18% (too close)
High-risk40%25% (too low)

Interpretation

Differences between low-, mid-, and high-risk patients are flattened.Model predictions are too close to each other → too “soft”.

This means the model cannot learn strong relationships.

Slope

  • Slope > 1 (curve too flat)

Clinical consequence

The model does not separate low- and high-risk patients enough → poor triage ability.


2. Discrimination: Can the Model Rank-Order Patients Correctly?

Discrimination describes the model’s ability to distinguish individuals with the outcome from those without it.

The main metric is:

Concordance (c) Statistic / AUROC

Thus, discrimination must always be interpreted in light of Step 1.


3. Visual Calibration Assessment: Essential for Detecting Local Miscalibration

Debray et al. strongly advocate calibration plots, as summary metrics may hide region-specific errors.

Calibration Plot

Visual inspection identifies:

The DVT empirical example used seven quantile groups, showing not only global miscalibration but also localized deviations across risk intervals.


4. Empirical Findings from Debray’s DVT Validation Studies

Using four datasets assessing a diagnostic model for deep venous thrombosis (DVT), Debray’s Step 2 results demonstrated how calibration and discrimination behaviors differ across populations with varying case mix.

Validation Study 1

Validation Study 2

Validation Study 3


Summary of Step 2 Insights