Step 2 of the Debray Framework: Evaluating Calibration and Discrimination in External Validation

Mayta
Nov 14, 2025
4 min read

Updated: Nov 18, 2025

Introduction

Once the relatedness between the development and validation populations has been established (Step 1), the next task in the Debray framework is to rigorously assess how well the original prediction model performs in the new validation sample. This step focuses on core predictive performance metrics—calibration and discrimination—accompanied by essential visual assessments. Together, these provide a comprehensive picture of predictive accuracy and potential model misfit in the validation setting.

1. Calibration: Does the Model Predict the Right Absolute Risks?

Calibration examines the agreement between predicted probabilities and observed outcomes. A well-calibrated model produces predictions that closely match the true event rate among patients with similar predicted risk.

Debray et al. emphasize two summary metrics :

A. Calibration-in-the-Large (Intercept α)

Estimated using the recalibration model:

Ideal value: α = 0
Interpretation:
- α < 0 → model overestimates risk / model overpredicts risk
- α > 0 → model underestimates risk / model underpredicts risk

Since calibration-in-the-large reflects differences in overall outcome incidence between development and validation datasets, it often mirrors differences in LP mean identified in Step 1.

B. Calibration Slope (β)

Estimated from:

Ideal value: β = 1 → perfect spread
β < 1 → predictions too extreme "model overfitting" Predictions are too extreme, often from overfitting in development
β > 1 → predictions too modest "model underfitting" Predictions are not extreme enough, suggesting weaker predictor effects in the validation sample

Calibration slope is central for determining whether the strength of predictor-outcome relationships is stable across populations. Key insight:

CITL = mean shift
Slope = spread distortion

These two metrics diagnose different forms of miscalibration. Below are four clear clinical examples showing

1. Overestimation (CITL < 0)

Scenario

A prediction model for 30-day mortality in sepsis reports:

Average predicted mortality = 40%
Actual observed mortality = 25%

Interpretation

The model overestimates risk (predicts higher than reality).It “thinks” everyone is sicker than they truly are.

CITL

CITL < 0 (negative intercept shift)

Clinical consequence

Some patients may be labeled high-risk unnecessarily → overtreatment.

2. Underestimation (CITL > 0)

Scenario

A heart failure risk model outputs:

Predicted risk = 8%
Actual observed risk = 15%

Interpretation

The model underestimates true risk (predictions too low).

CITL

CITL > 0 (positive intercept shift)

Clinical consequence

High-risk patients may be falsely reassured or undertreated.

3. Overfitting (Slope < 1)

Scenario

A model predicts readmission risk:

Patient	True risk	Model prediction
Low-risk	10%	3% (too low)
Mid-risk	20%	20%
High-risk	40%	70% (too high)

Interpretation

Predictions are too extreme→ the model exaggerates differences.

This happens when the model fits noise in the original dataset.

Slope

Slope < 1 (curve too steep)

Clinical consequence

High-risk patients seem terrifyingly high-risk, low-risk appear safer than reality.

4. Underfitting (Slope > 1)

Scenario

A model for predicting AKI after surgery gives:

Patient	True risk	Model prediction
Low-risk	5%	10% (too high)
Mid-risk	20%	18% (too close)
High-risk	40%	25% (too low)

Interpretation

Differences between low-, mid-, and high-risk patients are flattened.Model predictions are too close to each other → too “soft”.

This means the model cannot learn strong relationships.

Slope

Slope > 1 (curve too flat)

Clinical consequence

The model does not separate low- and high-risk patients enough → poor triage ability.

2. Discrimination: Can the Model Rank-Order Patients Correctly?

Discrimination describes the model’s ability to distinguish individuals with the outcome from those without it.

The main metric is:

Concordance (c) Statistic / AUROC

Range: 0.5 (no discrimination) to 1.0 (perfect discrimination)
Reflects: How well the model orders predicted risks across patients
Influenced by: Case-mix heterogeneity
- Validation samples with wider LP spread (higher LP SD) naturally yield higher c-statistics

Thus, discrimination must always be interpreted in light of Step 1.

3. Visual Calibration Assessment: Essential for Detecting Local Miscalibration

Debray et al. strongly advocate calibration plots, as summary metrics may hide region-specific errors.

Calibration Plot

Patients are grouped into quantiles of predicted risk
Observed outcome proportions per group are plotted against predicted probabilities
Perfect performance lies along the 45° diagonal

Visual inspection identifies:

Nonuniform miscalibration (e.g., correct at low risk but wrong at high risk)
Threshold-specific problems important for clinical decision-making
Prediction drift across the risk spectrum

The DVT empirical example used seven quantile groups, showing not only global miscalibration but also localized deviations across risk intervals.

4. Empirical Findings from Debray’s DVT Validation Studies

Using four datasets assessing a diagnostic model for deep venous thrombosis (DVT), Debray’s Step 2 results demonstrated how calibration and discrimination behaviors differ across populations with varying case mix.

Validation Study 1

Calibration slope ≈ 0.90 → predictions slightly too extreme
Calibration-in-the-large: −0.52 → model systematically overpredicted risk
c ≈ 0.76 → small decrease from development sample
Interpretation:
- Sample very similar to development (Step 1) → assessing reproducibility
- Minor miscalibration correctable via intercept update

Validation Study 2

Improved c-statistic (≈ 0.82)
Reflects greater case-mix heterogeneity (larger LP SD from Step 1)
Calibration slope ~0.88, calibration-in-the-large nearly optimal
Interpretation:
- Despite population differences, discrimination improves due to diverse risk profiles
- Model remains acceptably calibrated

Validation Study 3

Strong discrimination (c ≈ 0.85)
Calibration slope >1 (≈ 1.12)
- Indicates predicted risks are not extreme enough
Systematic underprediction at higher risks is visible in the calibration plot
Interpretation:
- Represents transportability assessment with pronounced case-mix differences
- Miscalibration requires slope adjustment + intercept update

Summary of Step 2 Insights

Predictive performance must be evaluated using both numeric metrics and visual calibration.
Calibration-in-the-large diagnoses baseline risk differences.
Calibration slope diagnoses scaling or overfitting issues.
Discrimination depends strongly on LP variability, not just model quality.
Calibration plots reveal miscalibration patterns that summary metrics cannot capture.
Interpretation is inseparable from Step 1 (relatedness of samples).

Step 2 of the Debray Framework: Evaluating Calibration and Discrimination in External Validation

Introduction

1. Calibration: Does the Model Predict the Right Absolute Risks?

A. Calibration-in-the-Large (Intercept α)

B. Calibration Slope (β)

Scenario

Interpretation

CITL

Clinical consequence

Scenario

Interpretation

CITL

Clinical consequence

Scenario

Interpretation

Slope

Clinical consequence

Scenario

Interpretation

Slope

Clinical consequence

2. Discrimination: Can the Model Rank-Order Patients Correctly?

Concordance (c) Statistic / AUROC

3. Visual Calibration Assessment: Essential for Detecting Local Miscalibration

Calibration Plot

4. Empirical Findings from Debray’s DVT Validation Studies

Validation Study 1

Validation Study 2

Validation Study 3

Summary of Step 2 Insights

Recent Posts

Comments