Assessment of Calibration in Logistic Regression Models Using the Hosmer–Lemeshow Test and Calibration Plots

Mayta
Mar 15
5 min read

Introduction

In the development of clinical prediction models, particularly those based on logistic regression, evaluating model performance is essential before applying the model in clinical practice. Model performance is typically assessed in two main dimensions: discrimination and calibration.

Discrimination refers to the ability of a model to correctly distinguish between individuals who experience an event and those who do not. This is commonly measured using metrics such as the Area Under the Receiver Operating Characteristic Curve (AUROC).

In contrast, calibration refers to the agreement between predicted probabilities generated by the model and the actual observed outcomes. Calibration answers a critical clinical question:

Does the predicted risk correspond to the true probability of the outcome occurring?

For example, if a prediction model estimates that a group of patients has a 20% probability of experiencing a clinical event, approximately 20 out of 100 such patients should experience the event if the model is well calibrated.

Two commonly used approaches for evaluating calibration in logistic regression models are the Hosmer–Lemeshow goodness-of-fit test and the calibration plot. Although both methods assess calibration, they differ in their methodological approach and interpretability.

Concept of Calibration

Calibration reflects the agreement between predicted probabilities and observed event rates across the spectrum of predicted risk.

A well-calibrated model produces predicted probabilities that closely match the true outcome frequencies. Poor calibration occurs when the model systematically overestimates or underestimates risk.

For instance:

If patients predicted to have a 30% risk actually experience the event in about 30% of cases, the model demonstrates good calibration.
If the observed event rate is 10%, the model overestimates risk.
If the observed rate is 50%, the model underestimates risk.

Calibration is particularly important in clinical decision-making, because predicted probabilities are often used to guide treatment thresholds, risk communication, and preventive interventions.

The Hosmer–Lemeshow Test

The Hosmer–Lemeshow test is a statistical goodness-of-fit test designed to evaluate calibration in logistic regression models.

The test compares the observed number of events with the expected number of events predicted by the model across groups of individuals with similar predicted risks.

The general procedure includes the following steps:

Calculate predicted probabilities for all individuals using the logistic regression model.
Rank individuals according to predicted probability.
Divide the sample into groups (typically deciles of risk, i.e., 10 groups).
Within each group, compare the observed number of events and the expected number of events predicted by the model.
Compute a chi-square statistic summarizing the discrepancy between observed and expected values.

The null hypothesis of the Hosmer–Lemeshow test is:

The model fits the data well; that is, there is no significant difference between predicted and observed event rates.

Interpretation of the p-value is typically as follows:

p > 0.05 — No evidence of lack of fit; the model is considered adequately calibrated.
p < 0.05 — Evidence of poor model fit; predicted probabilities differ significantly from observed outcomes.

While the Hosmer–Lemeshow test is widely used due to its simplicity, it has several important limitations that must be considered when interpreting the results.

Limitations of the Hosmer–Lemeshow Test

Sensitivity to Sample Size

One of the most important limitations of the Hosmer–Lemeshow test is its dependence on sample size.

When the sample size is large, even very small differences between predicted and observed probabilities—differences that may have little or no clinical importance—can produce a statistically significant result (p < 0.05). In such situations, the test may incorrectly suggest that the model has poor calibration even though the discrepancy is practically negligible.

Conversely, when the sample size is small, the test may lack sufficient statistical power to detect meaningful calibration errors. A poorly calibrated model may therefore produce a non-significant p-value (p > 0.05), falsely suggesting adequate model fit.

In summary:

Large sample size → increased likelihood of detecting trivial miscalibration.
Small sample size → reduced ability to detect real calibration problems.

Therefore, relying solely on the Hosmer–Lemeshow test can lead to misleading conclusions regarding model performance.

Dependence on Grouping Strategy

The test relies on dividing predicted probabilities into groups (commonly deciles). However, the results can vary depending on the number of groups used and how observations are distributed across them. This grouping process introduces an element of arbitrariness that may affect the stability of the test.

Lack of Diagnostic Insight

Another limitation is that the Hosmer–Lemeshow test provides only a single p-value, which does not indicate where the model fails to fit the data. It does not reveal whether miscalibration occurs in low-risk patients, high-risk patients, or across the entire range of predicted probabilities.

Calibration Plots

A calibration plot provides a graphical assessment of calibration by comparing predicted probabilities with observed event rates.

In a calibration plot:

The x-axis represents predicted probabilities.
The y-axis represents observed event frequencies.

A perfectly calibrated model would produce points that lie along the 45-degree diagonal line, where:

This diagonal line is often referred to as the line of perfect calibration.

Calibration plots allow researchers to visually identify patterns of miscalibration, such as:

systematic overestimation of risk at low predicted probabilities,
underestimation of risk at high predicted probabilities,
deviations occurring only within certain ranges of predicted risk.

Because calibration plots provide detailed visual information about model performance, they are considered more informative than a single goodness-of-fit test.

Relationship Between the Hosmer–Lemeshow Test and Calibration Plots

Both the Hosmer–Lemeshow test and calibration plots evaluate the same underlying concept—model calibration— but they serve different purposes.

The Hosmer–Lemeshow test provides a formal statistical test of model fit, summarized by a p-value.

Calibration plots, on the other hand, provide a visual representation of model calibration, allowing researchers to identify specific patterns and locations of miscalibration.

Thus:

The Hosmer–Lemeshow test answers the question: Is there statistical evidence that the model does not fit the data?
The calibration plot answers the question: How and where does the model deviate from perfect calibration?

For this reason, calibration plots are often preferred when interpreting model performance.

Why the Hosmer–Lemeshow Test Should Not Be Used Alone

Modern methodological guidance for evaluating clinical prediction models recommends that the Hosmer–Lemeshow test should not be used as the sole measure of calibration.

This recommendation arises from several concerns:

The test is highly influenced by sample size.
The results depend on an arbitrary grouping of predicted probabilities.
The test does not indicate the direction or magnitude of miscalibration.
A single p-value cannot adequately summarize the calibration of a prediction model.

Therefore, model evaluation should incorporate multiple complementary measures, including:

Calibration plots
Calibration intercept
Calibration slope
Discrimination metrics such as AUROC or the c-statistic.

Using these measures together provides a more comprehensive assessment of model performance.

Clinical Importance of Calibration

In clinical applications, calibration is often as important as—or sometimes even more important than—discrimination.

A model may rank patients correctly according to risk (good discrimination) but still produce inaccurate probability estimates (poor calibration). When predicted probabilities are used to guide treatment decisions or risk communication, poor calibration can lead to inappropriate clinical actions.

For example:

Overestimation of risk may lead to unnecessary treatments or diagnostic procedures.
Underestimation of risk may delay interventions in high-risk patients.

Therefore, careful evaluation of calibration is essential before implementing prediction models in real-world clinical settings.

Conclusion

The Hosmer–Lemeshow test is a commonly used goodness-of-fit test for assessing calibration in logistic regression models. However, its interpretation requires caution due to its sensitivity to sample size and dependence on the arbitrary grouping of predicted probabilities.

While the test can provide useful information regarding overall model fit, it should not be relied upon as the sole measure of calibration. Calibration plots, along with additional metrics such as calibration intercept, calibration slope, and discrimination measures like AUROC, provide a more comprehensive evaluation of prediction model performance.

A thorough assessment of both calibration and discrimination is essential to ensure that clinical prediction models provide accurate and reliable risk estimates for use in medical decision-making.