How to Report Model Updating (or Not) After Debray's External Validation: Examples, Tables & Templates
- Mayta

- Nov 27
- 9 min read
1. How they present external validation (text + tables)
1.1 Paper A – IDIOM model with updating
Structure of the text
Introduction
Clinical background: iron deficiency anemia, GI malignancy.
Introduces IDIOM model, original performance.
States two aims:
Describe prevalence/clinical characteristics of GI malignancy in Thai IDA patients
Externally validate IDIOM and update it if needed.
Methods
Study design & patients (retrospective, single center, inclusion criteria, definition of IDA).
IDIOM model: gives the exact formula (β coefficients).
External validation:
States clearly they follow Debray’s three-step framework.
Mentions they will examine relatedness (development vs Chiang Mai), evaluate discrimination and calibration, then interpret.
Model updating (separate subsection):
Describes three methods:
Recalibration-in-the-large (intercept only)
Logistic recalibration (intercept + slope)
Model refitting (all coefficients)
Explains when each method is applied (miscalibration only vs miscalibration + poor discrimination).
Model performance assessment:
Lists metrics: AUC, calibration plot, CITL, calibration slope, E/O, decision curve.
Results
Prevalence & etiologies (Table 1).
Clinical characteristics (Table 2).
Comparison to development datasets (Table 3).
External validation results (text + Figure 1 + decision curve).
Model updating results (Table 4).
Subgroup analysis (Table 5).
How they use tables
Table 1 – Etiologies of IDA (descriptive clinical context).
Table 2 – Baseline characteristics comparing GI malignancy vs non-malignancy.
Table 3 – Key Debray Step 1–2 table:
Columns: Dorset, Oxford, Sheffield, Chiang Mai.
Contains prevalence, median age/Hb/MCV and model performance (AUC, CITL, slope).
This table shows case-mix differences & performance drop together.
Table 4 – Update table:
Rows: original model, recalibration-in-the-large, recalibration, refitting.
Columns: coefficients, AUC, CITL, slope.
This is the most important table for showing how updating changes metrics.
Table 5 – Subgroup (thalassemia vs non-thalassemia) discrimination and calibration.
🔑 Takeaway: When you update a model, show a dedicated table with rows = each update strategy and columns = performance metrics.
1.2 Paper B – CAC-prob with “no update needed”
Structure of the text
Introduction
Background on CAC scoring, Thai context.
Introduces CAC-prob (derived earlier).
Single aim: externally validate CAC-prob and confirm robustness.
Methods
The CAC-prob model: predictors, outcome categories, recommended cut-offs.
Study design & patients: retrospective temporal validation, inclusion/exclusion.
External validation:
Follows Gehringer framework for polytomous outcomes.
Important: they say if performance deteriorated, they would do recalibration/refitting. So updating is pre-planned but conditional.
Metrics: ordinal C-index, generalized C-index, calibration slope, diagnostic indices, etc.
Results
Flow diagram (Figure 2).
Baseline characteristics comparison (Table 1).
Predictor-outcome associations (Table 2).
Model performance (Table 3 for C-indices, Table 4 for sensitivity/specificity).
Calibration plots (Figure 3).
No “model updating” section because they decided no updating needed.
Discussion
Emphasizes:
Discrimination still strong.
Calibration slopes close to 1 (mild underfitting).
Diagnostic indices stable or improved.
Explicit sentence: “Therefore, model updating may not be necessary in this study.” (page 6).
How they use tables
Table 1 – Development vs validation characteristics, with standardized differences. Very good to show case-mix shift.
Table 2 – β coefficients / ORs in validation vs development (shows predictor effects are similar).
Table 3 – Discrimination metrics for each model.
Table 4 – Diagnostic indices at pre-specified decision cut-offs (sensitivity, specificity, PPV, NPV).
🔑 Takeaway: When you don’t update the model, focus tables on showing: Baseline similarity/difference (Table 1) Predictor effects are stable (Table 2) Discrimination and calibration slopes are acceptable (Table 3 & Figure 3) Classification performance at clinical thresholds (Table 4)
1.3 How you can copy the style
If your external validation shows the model is OK (no update):
Methods:
“If model performance deteriorated (substantial miscalibration or drop in discrimination), we pre-specified that we would perform recalibration and/or refitting.”
Results:
Show tables like Paper B: dev vs val characteristics, β/OR comparison, performance metrics, maybe diagnostic indices.
In Discussion:
Argue why updating is not needed (good discrimination, slope ≈ 1, clinically acceptable classification).
If your external validation shows poor performance (update):
Methods:
Add a “Model updating” subsection like Paper A.
Results:
Show table like Paper A’s Table 4: each update method in rows.
Explain in text how AUC, CITL, slope changed after updating.
Conclude whether:
recalibration is enough, or
refitting/new model is needed.
2. Dummy example tables for “update vs non-update”
You can copy these into Word and just replace numbers & text.
2.1 Example table – Baseline and performance (works for both)
Table X. Comparison of development and validation cohorts and model performance
Dataset | N | Outcome prevalence, % | Mean age (SD) | Predictor 1 median (IQR) | AUC (95% CI) | CITL (95% CI) | Calibration slope (95% CI) |
Development | 1,000 | 10.0 | 65.2 (11.3) | 5.4 (3.1–7.8) | 0.78 (0.75–0.81) | 0 (ref) | 1.00 (ref) |
Validation A | 500 | 22.0 | 60.3 (14.7) | 7.0 (4.5–9.6) | 0.63 (0.58–0.68) | 2.80 (2.50–3.10) | 0.35 (0.20–0.50) |
Validation B | 400 | 30.0 | 62.1 (10.2) | 6.1 (4.0–8.0) | 0.77 (0.72–0.82) | 0.10 (−0.15–0.35) | 1.05 (0.85–1.25) |
Validation A looks like your “needs updating” situation.
Validation B looks like “no update needed”.
2.2 Example table – Model updating (like Paper A’s Table 4)
Table Y. Performance of original and updated models in the validation cohort
Model version | Intercept | β (Predictor 1) | β (Predictor 2) | AUC (95% CI) | CITL (95% CI) | Calibration slope (95% CI) |
Original model (development-based) | −2.00 | 0.80 | 0.50 | 0.63 (0.58–0.68) | 2.80 (2.50–3.10) | 0.35 (0.20–0.50) |
Recalibration-in-the-large | 0.85 | 0.80 | 0.50 | 0.63 (0.58–0.68) | 0.00 (−0.25–0.25) | 0.35 (0.20–0.50) |
Logistic recalibration | −0.10 | 0.56 | 0.35 | 0.63 (0.58–0.68) | 0.00 (−0.20–0.20) | 1.00 (0.60–1.40) |
Refitted model | −0.25 | 0.95 | 0.40 | 0.68 (0.62–0.73) | 0.00 (−0.22–0.22) | 1.00 (0.70–1.30) |
You can adapt columns depending on how many predictors you want to show. The key is:
Row 1–3: show how recalibration affects calibration but not AUC.
Row 4: refitting modestly improves AUC and fixes calibration.
2.3 Example table – “No update” performance summary (like Paper B’s Tables 3–4)
Table Z. Discrimination and classification performance of the validated model
Metric / Cut-off | Development cohort | Validation cohort |
Ordinal C-index | 0.81 | 0.78 |
Average C-index (Outcome A vs others) | 0.79 | 0.82 |
Average C-index (Outcome B vs others) | 0.73 | 0.77 |
Calibration slope | 1.00 | 1.10 |
Cut-off 1 (screening recommended) – Sens. | 88.4% | 90.0% |
Cut-off 1 – Spec. | 57.4% | 55.0% |
Cut-off 2 (high risk) – Sens. | 62.0% | 79.0% |
Cut-off 2 – Spec. | 64.0% | 64.0% |
This is a good format when you want to say: “Performance is similar or better; we did not update the model.”
3. Example “article-style” template (English)
Below is a generic mini-article you can adapt.Imagine we are reporting one validation where we updated the model, and another where we did not. You can delete the part you don’t need.
Title
External validation of two clinical prediction models with and without model updating: a template report
Abstract
Background: Clinical prediction models should be validated in new populations before clinical use. When performance deteriorates, model updating may be necessary.
Methods: We performed external validation of two previously developed models in independent cohorts. Model A was evaluated in a population with a markedly higher outcome prevalence. Model B was evaluated in a cohort with a similar clinical spectrum but higher risk factor burden. Discrimination, calibration, and clinical utility were assessed. Model updating (recalibration-in-the-large, logistic recalibration, and refitting) was performed for Model A if poor performance was observed. For Model B, updating was pre-specified but only undertaken if clinically relevant deterioration was present.
Results: Model A showed reduced discrimination (AUC 0.63) and poor calibration (CITL 2.8, slope 0.35), indicating underestimation of risk and overfitting. After recalibration and refitting, calibration improved (CITL 0.0, slope 1.0) and discrimination increased modestly (AUC 0.68). Model B retained good discrimination (ordinal C-index 0.78) and only mild miscalibration (slope 1.10). Classification performance at clinically relevant cut-offs remained similar to or better than the development study, so no updating was performed.
Conclusion: This template illustrates how to report external validation with and without model updating. When calibration and discrimination are unacceptable, transparent updating and re-reporting of coefficients are required. When performance remains adequate, it is equally important to report that no updating was necessary, supported by tables and calibration plots.
Introduction
Clinical prediction models are increasingly used to support diagnostic and prognostic decision making. Their clinical value depends on performance in populations that differ from the development cohort. External validation studies therefore play a crucial role in establishing reproducibility and transportability of a model.
If external validation shows poor calibration or discrimination, model updating—such as intercept adjustment, logistic recalibration, or refitting of coefficients—may be warranted. Conversely, if a model performs well, it is important to document that no updating was necessary and to justify this conclusion using pre-specified metrics.
Here we provide a structured example of how to report external validation for two models: one requiring updating (Model A) and one that could be used without changes (Model B).
Methods
Study populations
Model A was validated in a retrospective cohort of adult patients attending [clinic/setting], recruited between [years]. The outcome of interest was [outcome], confirmed by [reference standard]. Model B was validated in a separate cohort of patients undergoing [test/assessment] at [institution] over [years].
In both cohorts, we collected the predictors required by the original models at the same time point as in the development studies.
Index models
Model A is a multivariable logistic regression model including [list of predictors]. The original regression equation and coefficients were taken from [reference].
Model B is [type of model; e.g. partial proportional odds model] predicting [outcome categories] using age, sex, and [list of other predictors].
External validation and performance measures
Following Debray et al., we first examined the relatedness between development and validation samples by comparing key baseline characteristics and outcome prevalence.
We then applied each model’s original linear predictor to the validation data and assessed:
Discrimination: AUC or ordinal C-index with 95% confidence intervals.
Calibration:
Calibration-in-the-large (CITL)
Calibration slope
Calibration plots (loess-smoothed observed vs predicted probabilities)
Classification performance: sensitivity, specificity, PPV, NPV at prespecified cut-offs when relevant.
Model updating strategy
For Model A, we pre-specified a hierarchical updating strategy:
Recalibration-in-the-large if the CITL differed substantially from 0.
Logistic recalibration (intercept + slope) if the calibration slope differed from 1.
Model refitting of all coefficients if both calibration and discrimination were unsatisfactory.
For Model B, we planned to perform recalibration only if we observed a large drop in discrimination (e.g. AUC decrease >0.05) or significant miscalibration (CITL outside ±0.5 or slope outside 0.7–1.3). If performance remained acceptable, we would retain the original model without updating.
Results
Baseline characteristics and relatedness
Model A’s validation cohort included N₁ patients, with an outcome prevalence of 22%, compared with 10% in the development study. Patients were younger, with lower [relevant marker], and had [other characteristic differences]. These findings suggested a substantial case-mix shift.
Model B’s validation cohort included N₂ patients, with similar age distribution but higher prevalence of [risk factors] and a higher proportion of high-risk outcomes. Although the risk profile was higher, the spectrum of disease remained comparable to the development cohort.
Table X summarises the main characteristics and model performance in the development and validation cohorts.
External validation of Model A (update required)
Applying Model A to the validation data yielded an AUC of 0.63 (95% CI 0.58–0.68). Calibration was poor, with CITL 2.80 (95% CI 2.50–3.10) and a calibration slope of 0.35 (95% CI 0.20–0.50). The calibration plot showed systematic underestimation across the risk range.
We then applied the pre-specified updating steps (Table Y):
Recalibration-in-the-large corrected CITL to 0.0 but did not change the slope or AUC.
Logistic recalibration further adjusted the slope to approximately 1.0, improving calibration while leaving discrimination unchanged.
Refitting all coefficients improved the AUC to 0.68 (95% CI 0.62–0.73), with CITL 0.0 and slope 1.0.
Despite these improvements, the model’s discrimination remained only moderate, and we concluded that a locally tailored model may ultimately be needed.
External validation of Model B (no update)
For Model B, discrimination remained strong, with an ordinal C-index of 0.78, comparable to 0.81 in the development study. Calibration slopes for the two binary contrasts were 1.10 and 1.05, indicating mild underfitting but acceptable overall calibration.
Diagnostic performance at the pre-specified decision thresholds was also similar or superior to the development study (Table Z). Sensitivity for identifying high-risk patients improved, while specificity remained stable.
Given the preserved discrimination, near-ideal calibration slopes, and clinically acceptable classification metrics, we did not perform any model updating for Model B.
Discussion
This example illustrates two contrasting scenarios that investigators commonly face when externally validating clinical prediction models.
For Model A, large case-mix differences and a substantial shift in outcome prevalence led to poor calibration and modest discrimination. Stepwise updating improved calibration and slightly improved discrimination, but residual limitations suggested that regional redevelopment of the model may be preferable.
For Model B, although the validation cohort had a higher-risk profile, predictor–outcome relationships remained stable and model performance was robust. In this context, the model could be implemented without updating, but it was still essential to report the evaluation results—including calibration plots and classification indices—to justify this decision.
When you design your own validation study, you can follow the same reporting pattern:
Always present baseline comparisons and performance metrics for development vs validation data.
Pre-specify clear criteria for when to update the model.
If you update, provide a table of updated coefficients and metrics.
If you do not update, provide a transparent explanation of why the original model is adequate.
Source Information
Paper A — Model Updating
Title: Clinical Characteristics of Gastrointestinal Malignancy and Validation of the IDIOM Model Among Patients With Iron Deficiency Anemia in Thailand
Authors: Tantraworasin A, Phinyo P, Suwannasom P, et al.
Journal: JGH Open (Wiley)
Year: 2023
DOI: https://doi.org/10.1002/jgh3.70221
Publisher Link: https://onlinelibrary.wiley.com/doi/10.1002/jgh3.70221
Usage Credit: This article was used as the primary example of a clinical prediction model requiring updating.The study demonstrates recalibration-in-the-large, logistic recalibration, and full model refitting when external validation shows significant miscalibration and reduced discrimination. All summaries, comparisons, and interpretive analyses in this document referencing “Paper A” derive from this publication.
Paper B — No Model Updating
Title: Performance of CAC-prob in Predicting Coronary Artery Calcium Score: An External Validation Study in a High-CAC Burden Population Authors: Wongyikul P, Phinyo P, Suwannasom P, et al. Journal: BMC Medical Informatics and Decision Making (Springer Nature) Year: 2025 DOI: https://doi.org/10.1186/s12911-025-03128-y Publisher Link: https://link.springer.com/article/10.1186/s12911-025-03128-y Usage Credit:This article was used as the example of a clinical prediction model that does not require updating.The study illustrates how preserved discrimination, stable coefficients, and acceptable calibration support the decision to retain the original model without adjustments. All summaries and comparisons referring to “Paper B” were derived from this publication.





Comments