Why Calibration Confidence Intervals Differ: Wald vs Wilson and Parametric GLM vs LOESS

Why Calibration Confidence Intervals Differ:

Wald vs Wilson Intervals and Parametric GLM vs LOESS Smoothing

Abstract

Calibration analysis is central to evaluating clinical prediction models, yet confidence intervals (CIs) in calibration plots often differ substantially depending on the statistical method used. These differences are not cosmetic but arise from distinct assumptions about uncertainty. This article clarifies two major sources of variation: (1) binomial CI estimation using Wald versus Wilson methods, and (2) calibration curve estimation using parametric logistic regression versus non-parametric LOESS smoothing. Practical guidance is provided on when each method is appropriate.

1. Introduction

Calibration assesses the agreement between predicted probabilities and observed outcomes. It is typically visualized using:

Grouped observed risks (e.g., deciles) with error bars
A smoothed calibration curve with a confidence band

However, these components are not methodologically neutral. Different statistical choices produce systematically different confidence intervals, which may lead to overconfident or misleading interpretations.

From a clinimetric perspective, calibration is an estimation problem:

Observed Risk = f(Predicted Risk | sampling error + model assumptions)

Thus, the validity of calibration depends critically on how uncertainty is quantified.

2. Binomial Confidence Intervals: Wald vs Wilson

2.1 Wald Interval

The Wald interval is based on a normal approximation:

\hat{p} \pm 1.96 \sqrt{\frac{\hat{p} (1 - \hat{p})}{n}}

Assumptions

Large sample size
Symmetric distribution around the estimate

Limitations

Poor performance when sample size is small
Unreliable when the proportion is near 0 or 1
Can produce impossible values (<0 or >1)
Systematically underestimates uncertainty

These limitations arise because the binomial distribution is not well approximated by a normal distribution under many practical conditions.

2.2 Wilson Interval

The Wilson interval adjusts for binomial asymmetry and boundary constraints.

Properties

Better coverage probability (closer to true 95%)
Naturally constrained within [0,1]
More accurate for small samples and extreme probabilities

Interpretation

The Wilson method reflects the true binomial data-generating process and is therefore preferred for estimating observed risks in calibration plots.

2.3 When to Use Each

Situation	Recommended Method
Small sample size per group	Wilson
Proportions near 0 or 1	Wilson
Calibration plots (publication standard)	Wilson
Very large samples, mid-range probabilities	Wald acceptable but not preferred

3. Calibration Curve Estimation: Parametric GLM vs LOESS

3.1 Parametric Logistic Regression (GLM)

A parametric calibration model is typically specified as:

logit (Y) = β_{0} + β_{1} \cdot L P

where LP is the linear predictor from the model.

Properties

Strong structural assumption (linear relationship in log-odds)
Only two parameters (intercept and slope)
Smooth, stable curve

Confidence Interval Behavior

Narrow confidence bands
Reflect only uncertainty in estimating the regression coefficients
Do not capture local deviations from the assumed model form

Limitation

If the true calibration relationship is non-linear, this approach underestimates uncertainty and masks miscalibration.

3.2 LOESS (Locally Weighted Smoothing)

LOESS is a non-parametric method that fits local regressions.

Properties

Flexible, data-driven
Adapts to local structure
Does not impose a global functional form

Confidence Interval Behavior

Wider intervals in sparse regions
Reflect both sampling variability and local instability
More sensitive to irregular patterns

Limitation

Less stable at extremes (very low/high predicted risk)
Requires sufficient data density

3.3 Conceptual Difference

Parametric GLM estimates: “What is the best-fitting global calibration line?”
LOESS estimates: “What does the data actually look like locally?”

These represent fundamentally different inferential targets.

3.4 When to Use Each

Situation	Recommended Method
Model evaluation (honest calibration assessment)	LOESS
Detecting local miscalibration	LOESS
Small datasets with sparse regions	GLM (with caution)
Reporting calibration intercept and slope	GLM
Publication-quality visualization	LOESS (or splines)

4. Combined Effects on Calibration Plots

When methods are combined:

Component	Conservative (wider CI)	Optimistic (narrow CI)
Decile error bars	Wilson	Wald
Curve estimation	LOESS	Parametric GLM

Using Wald + GLM results in systematically narrow confidence intervals, which may create an illusion of strong calibration.

5. Clinical and Methodological Implications

Calibration is directly linked to clinical decision-making. Overly narrow confidence intervals imply unjustified certainty in predicted risks, which can lead to:

Misclassification of patient risk
Inappropriate treatment decisions
Overconfidence in prediction models

In clinical prediction modeling, uncertainty must be accurately represented to ensure safe and reliable application [6].

6. Recommended Practice

For rigorous calibration analysis:

Observed Risk (Grouped Data)

Use Wilson confidence intervals

Calibration Curve

Prefer LOESS or flexible methods (e.g., splines) for visualization
Use a parametric GLM for reporting calibration intercept and slope

General Principle

Choose methods that reflect the true uncertainty of the data rather than those that produce visually appealing results.

7. Conclusion

Differences in calibration confidence intervals arise from methodological choices, not from the underlying model alone. The Wald method and parametric GLM tend to underestimate uncertainty, whereas Wilson intervals and LOESS smoothing provide more reliable representations of variability.

Careful selection of methods is essential to avoid misleading conclusions about model performance and to maintain validity in clinical decision-making.

Key Takeaways

Wald intervals underestimate uncertainty, especially in small samples or extreme probabilities
Wilson intervals provide accurate binomial confidence estimation
Parametric GLM produces narrow, assumption-driven confidence bands
LOESS reflects data-driven variability and reveals local miscalibration
Method choice directly impacts clinical interpretation of model performance

Why Calibration Confidence Intervals Differ: Wald vs Wilson and Parametric GLM vs LOESS