Risk Score Calibration Plot vs Decile Calibration Plot (pmcalplot)

Risk Score Calibration Plot vs Decile Calibration Plot

When reading clinical prediction papers, many people recognize the word calibration but still feel confused when the graphs look very different. Some papers show the x-axis as a clinical score such as 0–5, while others show the x-axis as predicted probability and divide the data into 10 groups. Both are calibration plots, but they are not exactly the same type.

This article explains the difference between a risk score calibration plot and a decile calibration plot, why they are used, how they are constructed, and how to interpret them correctly.

What is calibration?

Calibration asks a simple question:

If a model predicts a certain risk, does that predicted risk match what actually happened?

This is different from discrimination.

Discrimination asks whether the model can separate people with disease from people without disease.
Calibration asks whether the predicted probabilities are numerically accurate.

A model can rank patients correctly and still be poorly calibrated. For example, it may correctly label some patients as higher risk than others, but predict 80% when the real observed risk is only 50%.

1. Risk Score Calibration Plot

Definition

A risk score calibration plot is a calibration graph used when a prediction model has been simplified into a point-based clinical score.

Examples of this type of model include point systems such as:

CURB-65
CHA₂DS₂-VASc
Wells score
Other bedside scoring tools converted from regression models

Instead of plotting raw predicted probabilities for each patient, the plot uses the total score categories as the x-axis.

Typical structure

X-axis

Total score, such as 0, 1, 2, 3, 4, 5

Y-axis

Observed proportion of the outcome, such as disease, death, or complication

Display

Usually the figure contains:

a line for predicted risk at each score
points or circles for the observed risk in the actual data

What it means

If the observed points sit close to the predicted line, the score is well calibrated.

For example:

score 0 → predicted risk 5%, observed risk 4%
score 1 → predicted risk 20%, observed risk 22%
score 2 → predicted risk 60%, observed risk 58%

That means the score performs well because the estimated probability attached to each score is close to reality.

Figure 1. Example of a risk score calibration plot showing predicted and observed event rates across total score categories.

Why this type is used

This plot is especially useful when the model has already been translated into a clinical bedside tool. Doctors often use integer scores more easily than raw model equations.

Instead of saying:

predicted probability = 0.73

the bedside tool says:

total score = 4

and then maps that score to a clinical risk.

So in this setting, calibration is naturally assessed by score category.

Strengths

very intuitive for clinicians
easy to read in bedside-score papers
directly matches how the tool is used in practice
each score is already a natural group

Limitations

only works well when the model is actually expressed as a point score
can hide variation within a score group
provides less detail than patient-level probability calibration
may be unstable if some score groups contain very few patients

2. Decile Calibration Plot

Definition

A decile calibration plot is a calibration graph that uses predicted probabilities from a model and groups patients into 10 equal-sized groups according to their predicted risk.

This is one of the most traditional forms of calibration display in logistic regression and prediction-model papers.

Why it is called “decile”

The word decile means one-tenth.

The dataset is sorted by predicted probability, then divided into 10 groups, each containing about 10% of the patients.

For each group, the researcher calculates:

the mean predicted probability
the observed event rate

These values are then plotted against each other.

Typical structure

X-axis

Predicted probability

Usually the average predicted risk within each decile

Y-axis

Observed probability

Usually the actual event rate within that decile

Display

Usually the figure contains:

a 45-degree reference line, representing perfect calibration
10 points, one for each decile
sometimes a connecting line between the points

How it is built

The process is usually:

Step 1

Run a model and obtain the predicted probability for every patient.

Step 2

Sort all patients from lowest predicted risk to highest predicted risk.

Step 3

Divide them into 10 equally sized groups.

Step 4

For each group, calculate:

mean predicted probability
observed proportion of the outcome

Step 5

Plot observed versus predicted risk.

If the points follow the 45-degree line, calibration is good.

Figure 2. Example of a decile calibration plot comparing observed and predicted event rates across deciles of predicted risk. We don't use this image, it looks suck.

Figure 3. Also, an example of a decile calibration plot uses pmcalplot in Stata, comparing observed and predicted event rates across deciles of predicted risk.

Why use 10 groups?

Because raw patient-level predictions can be noisy. Grouping into deciles creates a cleaner summary.

Ten groups became common because it is a practical balance:

fewer groups may oversimplify
too many groups may become unstable and noisy

So deciles are a traditional compromise between readability and detail.

Strengths

widely recognized in prediction-model literature
easy to compare predicted versus observed risk
useful for models with continuous predicted probabilities
closely related to the logic of the Hosmer–Lemeshow goodness-of-fit test

Limitations

the appearance depends on how the grouping is done
two models with similar decile plots may behave differently at the patient level
grouping can hide miscalibration inside a decile
not as informative as a smooth calibration curve in modern modeling work

3. Key difference between the two

The main difference is the meaning of the x-axis.

Risk score calibration plot

The x-axis is clinical score categories.

Each value on the x-axis is a score group that already exists in the scoring system.

Decile calibration plot

The x-axis is predicted probability, grouped into 10 bins.

The groups do not come from a clinical score. They are created statistically after the model produces patient-level probabilities.

Simple comparison table

Feature	Risk Score Calibration Plot	Decile Calibration Plot
X-axis	Total score	Predicted probability
Grouping	Natural score groups	10 equal-sized risk groups
Best use	Point-based clinical scores	Regression or ML models
Clinical feel	Very intuitive	More statistical
Detail level	Moderate	Moderate
Common in	Score papers	Prediction-model papers

4. Why do the two plots look different

They look different because they answer calibration at two different levels.

Risk score plot

Focuses on the question:

For each score value, what was the actual observed risk?

This fits clinical tools where the final output is a score.

Decile plot

Focuses on the question:

Across increasing predicted risk groups, how close were predicted and observed probabilities?

This fits statistical models that produce probabilities directly.

5. How to interpret each one correctly

Interpreting a risk score calibration plot

Look at each score category and compare:

predicted risk
observed risk

If the observed points are close to the predicted curve, calibration is good.

Example

If score 3 is predicted to have 85% risk and the observed rate is 83%, that is good calibration.

If score 3 is predicted 85% but observed only 50%, that is poor calibration.

Interpreting a decile calibration plot

Look at the relationship between the decile points and the diagonal reference line.

points on the line = well calibrated
points above the line = model underestimates risk
points below the line = model overestimates risk

Example

If a decile has predicted risk 0.40 but observed risk 0.60, that means the model underestimated risk in that range.

6. Relationship to modern calibration curves

Today, many statisticians prefer smooth calibration curves rather than only decile plots.

A smooth calibration curve often uses flexible methods such as LOESS or spline-based fitting to show calibration across the full probability range.

Compared with decile plots:

decile plots are simpler and traditional
smooth curves are often more informative
smooth curves reduce the information loss caused by grouping

But decile plots are still commonly shown because they are easy to understand.

7. Common mistakes in interpretation

Mistake 1: confusing calibration with discrimination

A high AUC does not guarantee good calibration.

Mistake 2: assuming grouped plots show everything

Both score-group plots and decile plots can hide problems inside groups.

Mistake 3: overinterpreting sparse groups

If a score category or decile has few patients, the observed risk may be unstable.

Mistake 4: treating all calibration plots as the same

A risk score calibration plot and a decile calibration plot are related, but not identical. Their x-axes represent different things.

8. When should each plot be used?

Use a risk score calibration plot when:

the final tool is a bedside score
users make decisions based on integer score groups
the model has already been simplified into points

Use a decile calibration plot when:

the model gives continuous predicted probabilities
you want a traditional grouped assessment of calibration
you are presenting logistic regression or machine learning outputs

9. Practical takeaway

A risk score calibration plot is best thought of as:

Calibration assessed across score categories

A decile calibration plot is best thought of as:

Calibration assessed across tenth-based groups of predicted probability

They are both valid ways to examine whether predicted risk matches observed risk. The difference is not that one is right and the other is wrong. The difference is that they are designed for different model outputs.

If the model output is a score, use score-based calibration.
If the model output is a probability, use decile-based calibration or a smooth calibration curve.

Suggested short conclusion for your blog

Calibration is about whether predicted risk matches reality. A risk score calibration plot uses score categories on the x-axis and is ideal for point-based clinical tools. A decile calibration plot uses predicted probabilities divided into ten groups and is common in regression and prediction-model studies. Both aim to compare predicted and observed risk, but they do so at different levels of model representation.

One-line memory aid

Score plot = calibration by clinical points Decile plot = calibration by grouped predicted probabilities