Why Calibration Confidence Intervals Differ: Wald vs Wilson and Parametric GLM vs LOESS
- Mayta

- Mar 29
- 3 min read
Why Calibration Confidence Intervals Differ:
Wald vs Wilson Intervals and Parametric GLM vs LOESS Smoothing

Abstract
Calibration analysis is central to evaluating clinical prediction models, yet confidence intervals (CIs) in calibration plots often differ substantially depending on the statistical method used. These differences are not cosmetic but arise from distinct assumptions about uncertainty. This article clarifies two major sources of variation: (1) binomial CI estimation using Wald versus Wilson methods, and (2) calibration curve estimation using parametric logistic regression versus non-parametric LOESS smoothing. Practical guidance is provided on when each method is appropriate.
1. Introduction
Calibration assesses the agreement between predicted probabilities and observed outcomes. It is typically visualized using:
Grouped observed risks (e.g., deciles) with error bars
A smoothed calibration curve with a confidence band
However, these components are not methodologically neutral. Different statistical choices produce systematically different confidence intervals, which may lead to overconfident or misleading interpretations.
From a clinimetric perspective, calibration is an estimation problem:
Thus, the validity of calibration depends critically on how uncertainty is quantified.

2. Binomial Confidence Intervals: Wald vs Wilson
2.1 Wald Interval
The Wald interval is based on a normal approximation:
Assumptions
Large sample size
Symmetric distribution around the estimate
Limitations
Poor performance when sample size is small
Unreliable when the proportion is near 0 or 1
Can produce impossible values (<0 or >1)
Systematically underestimates uncertainty
These limitations arise because the binomial distribution is not well approximated by a normal distribution under many practical conditions.
2.2 Wilson Interval
The Wilson interval adjusts for binomial asymmetry and boundary constraints.
Properties
Better coverage probability (closer to true 95%)
Naturally constrained within [0,1]
More accurate for small samples and extreme probabilities
Interpretation
The Wilson method reflects the true binomial data-generating process and is therefore preferred for estimating observed risks in calibration plots.

2.3 When to Use Each
3. Calibration Curve Estimation: Parametric GLM vs LOESS
3.1 Parametric Logistic Regression (GLM)
A parametric calibration model is typically specified as:
where LP is the linear predictor from the model.
Properties
Strong structural assumption (linear relationship in log-odds)
Only two parameters (intercept and slope)
Smooth, stable curve
Confidence Interval Behavior
Narrow confidence bands
Reflect only uncertainty in estimating the regression coefficients
Do not capture local deviations from the assumed model form
Limitation
If the true calibration relationship is non-linear, this approach underestimates uncertainty and masks miscalibration.
3.2 LOESS (Locally Weighted Smoothing)
LOESS is a non-parametric method that fits local regressions.
Properties
Flexible, data-driven
Adapts to local structure
Does not impose a global functional form
Confidence Interval Behavior
Wider intervals in sparse regions
Reflect both sampling variability and local instability
More sensitive to irregular patterns
Limitation
Less stable at extremes (very low/high predicted risk)
Requires sufficient data density
3.3 Conceptual Difference
Parametric GLM estimates: “What is the best-fitting global calibration line?”
LOESS estimates: “What does the data actually look like locally?”
These represent fundamentally different inferential targets.

3.4 When to Use Each
4. Combined Effects on Calibration Plots
When methods are combined:
Using Wald + GLM results in systematically narrow confidence intervals, which may create an illusion of strong calibration.

5. Clinical and Methodological Implications
Calibration is directly linked to clinical decision-making. Overly narrow confidence intervals imply unjustified certainty in predicted risks, which can lead to:
Misclassification of patient risk
Inappropriate treatment decisions
Overconfidence in prediction models
In clinical prediction modeling, uncertainty must be accurately represented to ensure safe and reliable application [6].
6. Recommended Practice
For rigorous calibration analysis:
Observed Risk (Grouped Data)
Use Wilson confidence intervals
Calibration Curve
Prefer LOESS or flexible methods (e.g., splines) for visualization
Use a parametric GLM for reporting calibration intercept and slope
General Principle
Choose methods that reflect the true uncertainty of the data rather than those that produce visually appealing results.

7. Conclusion
Differences in calibration confidence intervals arise from methodological choices, not from the underlying model alone. The Wald method and parametric GLM tend to underestimate uncertainty, whereas Wilson intervals and LOESS smoothing provide more reliable representations of variability.
Careful selection of methods is essential to avoid misleading conclusions about model performance and to maintain validity in clinical decision-making.
Key Takeaways
Wald intervals underestimate uncertainty, especially in small samples or extreme probabilities
Wilson intervals provide accurate binomial confidence estimation
Parametric GLM produces narrow, assumption-driven confidence bands
LOESS reflects data-driven variability and reveals local miscalibration
Method choice directly impacts clinical interpretation of model performance



Comments