Step 2 of the Debray Framework: Evaluating Calibration and Discrimination in External Validation
- Mayta

- 7 days ago
- 4 min read
Updated: 3 days ago
Introduction
Once the relatedness between the development and validation populations has been established (Step 1), the next task in the Debray framework is to rigorously assess how well the original prediction model performs in the new validation sample. This step focuses on core predictive performance metrics—calibration and discrimination—accompanied by essential visual assessments. Together, these provide a comprehensive picture of predictive accuracy and potential model misfit in the validation setting.
1. Calibration: Does the Model Predict the Right Absolute Risks?
Calibration examines the agreement between predicted probabilities and observed outcomes. A well-calibrated model produces predictions that closely match the true event rate among patients with similar predicted risk.
Debray et al. emphasize two summary metrics :
A. Calibration-in-the-Large (Intercept α)
Estimated using the recalibration model:
Ideal value: α = 0
Interpretation:
α < 0 → model overestimates risk / model overpredicts risk
α > 0 → model underestimates risk / model underpredicts risk
Since calibration-in-the-large reflects differences in overall outcome incidence between development and validation datasets, it often mirrors differences in LP mean identified in Step 1.
B. Calibration Slope (β)
Estimated from:
Ideal value: β = 1 → perfect spread
β < 1 → predictions too extreme "model overfitting" Predictions are too extreme, often from overfitting in development
β > 1 → predictions too modest "model underfitting" Predictions are not extreme enough, suggesting weaker predictor effects in the validation sample
Calibration slope is central for determining whether the strength of predictor-outcome relationships is stable across populations. Key insight:
CITL = mean shift
Slope = spread distortion
These two metrics diagnose different forms of miscalibration. Below are four clear clinical examples showing
1. Overestimation (CITL < 0)
Scenario
A prediction model for 30-day mortality in sepsis reports:
Average predicted mortality = 40%
Actual observed mortality = 25%
Interpretation
The model overestimates risk (predicts higher than reality).It “thinks” everyone is sicker than they truly are.
CITL
CITL < 0 (negative intercept shift)
Clinical consequence
Some patients may be labeled high-risk unnecessarily → overtreatment.
2. Underestimation (CITL > 0)
Scenario
A heart failure risk model outputs:
Predicted risk = 8%
Actual observed risk = 15%
Interpretation
The model underestimates true risk (predictions too low).
CITL
CITL > 0 (positive intercept shift)
Clinical consequence
High-risk patients may be falsely reassured or undertreated.
3. Overfitting (Slope < 1)
Scenario
A model predicts readmission risk:
Patient | True risk | Model prediction |
Low-risk | 10% | 3% (too low) |
Mid-risk | 20% | 20% |
High-risk | 40% | 70% (too high) |
Interpretation
Predictions are too extreme→ the model exaggerates differences.
This happens when the model fits noise in the original dataset.
Slope
Slope < 1 (curve too steep)
Clinical consequence
High-risk patients seem terrifyingly high-risk, low-risk appear safer than reality.
4. Underfitting (Slope > 1)
Scenario
A model for predicting AKI after surgery gives:
Patient | True risk | Model prediction |
Low-risk | 5% | 10% (too high) |
Mid-risk | 20% | 18% (too close) |
High-risk | 40% | 25% (too low) |
Interpretation
Differences between low-, mid-, and high-risk patients are flattened.Model predictions are too close to each other → too “soft”.
This means the model cannot learn strong relationships.
Slope
Slope > 1 (curve too flat)
Clinical consequence
The model does not separate low- and high-risk patients enough → poor triage ability.
2. Discrimination: Can the Model Rank-Order Patients Correctly?
Discrimination describes the model’s ability to distinguish individuals with the outcome from those without it.
The main metric is:
Concordance (c) Statistic / AUROC
Range: 0.5 (no discrimination) to 1.0 (perfect discrimination)
Reflects: How well the model orders predicted risks across patients
Influenced by: Case-mix heterogeneity
Validation samples with wider LP spread (higher LP SD) naturally yield higher c-statistics
Thus, discrimination must always be interpreted in light of Step 1.
3. Visual Calibration Assessment: Essential for Detecting Local Miscalibration
Debray et al. strongly advocate calibration plots, as summary metrics may hide region-specific errors.
Calibration Plot
Patients are grouped into quantiles of predicted risk
Observed outcome proportions per group are plotted against predicted probabilities
Perfect performance lies along the 45° diagonal
Visual inspection identifies:
Nonuniform miscalibration (e.g., correct at low risk but wrong at high risk)
Threshold-specific problems important for clinical decision-making
Prediction drift across the risk spectrum
The DVT empirical example used seven quantile groups, showing not only global miscalibration but also localized deviations across risk intervals.
4. Empirical Findings from Debray’s DVT Validation Studies
Using four datasets assessing a diagnostic model for deep venous thrombosis (DVT), Debray’s Step 2 results demonstrated how calibration and discrimination behaviors differ across populations with varying case mix.
Validation Study 1
Calibration slope ≈ 0.90 → predictions slightly too extreme
Calibration-in-the-large: −0.52 → model systematically overpredicted risk
c ≈ 0.76 → small decrease from development sample
Interpretation:
Sample very similar to development (Step 1) → assessing reproducibility
Minor miscalibration correctable via intercept update
Validation Study 2
Improved c-statistic (≈ 0.82)
Reflects greater case-mix heterogeneity (larger LP SD from Step 1)
Calibration slope ~0.88, calibration-in-the-large nearly optimal
Interpretation:
Despite population differences, discrimination improves due to diverse risk profiles
Model remains acceptably calibrated
Validation Study 3
Strong discrimination (c ≈ 0.85)
Calibration slope >1 (≈ 1.12)
Indicates predicted risks are not extreme enough
Systematic underprediction at higher risks is visible in the calibration plot
Interpretation:
Represents transportability assessment with pronounced case-mix differences
Miscalibration requires slope adjustment + intercept update
Summary of Step 2 Insights
Predictive performance must be evaluated using both numeric metrics and visual calibration.
Calibration-in-the-large diagnoses baseline risk differences.
Calibration slope diagnoses scaling or overfitting issues.
Discrimination depends strongly on LP variability, not just model quality.
Calibration plots reveal miscalibration patterns that summary metrics cannot capture.
Interpretation is inseparable from Step 1 (relatedness of samples).






Comments