Risk Score Calibration Plot vs Decile Calibration Plot (pmcalplot)
- Mayta

- 15 minutes ago
- 6 min read
Risk Score Calibration Plot vs Decile Calibration Plot
When reading clinical prediction papers, many people recognize the word calibration but still feel confused when the graphs look very different. Some papers show the x-axis as a clinical score such as 0–5, while others show the x-axis as predicted probability and divide the data into 10 groups. Both are calibration plots, but they are not exactly the same type.
This article explains the difference between a risk score calibration plot and a decile calibration plot, why they are used, how they are constructed, and how to interpret them correctly.

What is calibration?
Calibration asks a simple question:
If a model predicts a certain risk, does that predicted risk match what actually happened?
This is different from discrimination.
Discrimination asks whether the model can separate people with disease from people without disease.
Calibration asks whether the predicted probabilities are numerically accurate.
A model can rank patients correctly and still be poorly calibrated. For example, it may correctly label some patients as higher risk than others, but predict 80% when the real observed risk is only 50%.

1. Risk Score Calibration Plot
Definition
A risk score calibration plot is a calibration graph used when a prediction model has been simplified into a point-based clinical score.
Examples of this type of model include point systems such as:
CURB-65
CHA₂DS₂-VASc
Wells score
Other bedside scoring tools converted from regression models
Instead of plotting raw predicted probabilities for each patient, the plot uses the total score categories as the x-axis.

Typical structure
X-axis
Total score, such as 0, 1, 2, 3, 4, 5
Y-axis
Observed proportion of the outcome, such as disease, death, or complication
Display
Usually the figure contains:
a line for predicted risk at each score
points or circles for the observed risk in the actual data
What it means
If the observed points sit close to the predicted line, the score is well calibrated.
For example:
score 0 → predicted risk 5%, observed risk 4%
score 1 → predicted risk 20%, observed risk 22%
score 2 → predicted risk 60%, observed risk 58%
That means the score performs well because the estimated probability attached to each score is close to reality.

Figure 1. Example of a risk score calibration plot showing predicted and observed event rates across total score categories.
Why this type is used
This plot is especially useful when the model has already been translated into a clinical bedside tool. Doctors often use integer scores more easily than raw model equations.
Instead of saying:
predicted probability = 0.73
the bedside tool says:
total score = 4
and then maps that score to a clinical risk.
So in this setting, calibration is naturally assessed by score category.
Strengths
very intuitive for clinicians
easy to read in bedside-score papers
directly matches how the tool is used in practice
each score is already a natural group
Limitations
only works well when the model is actually expressed as a point score
can hide variation within a score group
provides less detail than patient-level probability calibration
may be unstable if some score groups contain very few patients
2. Decile Calibration Plot
Definition
A decile calibration plot is a calibration graph that uses predicted probabilities from a model and groups patients into 10 equal-sized groups according to their predicted risk.
This is one of the most traditional forms of calibration display in logistic regression and prediction-model papers.

Why it is called “decile”
The word decile means one-tenth.
The dataset is sorted by predicted probability, then divided into 10 groups, each containing about 10% of the patients.
For each group, the researcher calculates:
the mean predicted probability
the observed event rate
These values are then plotted against each other.
Typical structure
X-axis
Predicted probability
Usually the average predicted risk within each decile
Y-axis
Observed probability
Usually the actual event rate within that decile
Display
Usually the figure contains:
a 45-degree reference line, representing perfect calibration
10 points, one for each decile
sometimes a connecting line between the points
How it is built
The process is usually:
Step 1
Run a model and obtain the predicted probability for every patient.
Step 2
Sort all patients from lowest predicted risk to highest predicted risk.
Step 3
Divide them into 10 equally sized groups.
Step 4
For each group, calculate:
mean predicted probability
observed proportion of the outcome
Step 5
Plot observed versus predicted risk.
If the points follow the 45-degree line, calibration is good.

Figure 2. Example of a decile calibration plot comparing observed and predicted event rates across deciles of predicted risk. We don't use this image, it looks suck.

Figure 3. Also, an example of a decile calibration plot uses pmcalplot in Stata, comparing observed and predicted event rates across deciles of predicted risk.
Why use 10 groups?
Because raw patient-level predictions can be noisy. Grouping into deciles creates a cleaner summary.
Ten groups became common because it is a practical balance:
fewer groups may oversimplify
too many groups may become unstable and noisy
So deciles are a traditional compromise between readability and detail.
Strengths
widely recognized in prediction-model literature
easy to compare predicted versus observed risk
useful for models with continuous predicted probabilities
closely related to the logic of the Hosmer–Lemeshow goodness-of-fit test
Limitations
the appearance depends on how the grouping is done
two models with similar decile plots may behave differently at the patient level
grouping can hide miscalibration inside a decile
not as informative as a smooth calibration curve in modern modeling work
3. Key difference between the two
The main difference is the meaning of the x-axis.
Risk score calibration plot
The x-axis is clinical score categories.
Each value on the x-axis is a score group that already exists in the scoring system.
Decile calibration plot
The x-axis is predicted probability, grouped into 10 bins.
The groups do not come from a clinical score. They are created statistically after the model produces patient-level probabilities.
Simple comparison table
4. Why do the two plots look different
They look different because they answer calibration at two different levels.
Risk score plot
Focuses on the question:
For each score value, what was the actual observed risk?
This fits clinical tools where the final output is a score.
Decile plot
Focuses on the question:
Across increasing predicted risk groups, how close were predicted and observed probabilities?
This fits statistical models that produce probabilities directly.
5. How to interpret each one correctly
Interpreting a risk score calibration plot
Look at each score category and compare:
predicted risk
observed risk
If the observed points are close to the predicted curve, calibration is good.
Example
If score 3 is predicted to have 85% risk and the observed rate is 83%, that is good calibration.
If score 3 is predicted 85% but observed only 50%, that is poor calibration.
Interpreting a decile calibration plot
Look at the relationship between the decile points and the diagonal reference line.
points on the line = well calibrated
points above the line = model underestimates risk
points below the line = model overestimates risk
Example
If a decile has predicted risk 0.40 but observed risk 0.60, that means the model underestimated risk in that range.
6. Relationship to modern calibration curves
Today, many statisticians prefer smooth calibration curves rather than only decile plots.
A smooth calibration curve often uses flexible methods such as LOESS or spline-based fitting to show calibration across the full probability range.
Compared with decile plots:
decile plots are simpler and traditional
smooth curves are often more informative
smooth curves reduce the information loss caused by grouping
But decile plots are still commonly shown because they are easy to understand.
7. Common mistakes in interpretation
Mistake 1: confusing calibration with discrimination
A high AUC does not guarantee good calibration.
Mistake 2: assuming grouped plots show everything
Both score-group plots and decile plots can hide problems inside groups.
Mistake 3: overinterpreting sparse groups
If a score category or decile has few patients, the observed risk may be unstable.
Mistake 4: treating all calibration plots as the same
A risk score calibration plot and a decile calibration plot are related, but not identical. Their x-axes represent different things.
8. When should each plot be used?
Use a risk score calibration plot when:
the final tool is a bedside score
users make decisions based on integer score groups
the model has already been simplified into points
Use a decile calibration plot when:
the model gives continuous predicted probabilities
you want a traditional grouped assessment of calibration
you are presenting logistic regression or machine learning outputs
9. Practical takeaway
A risk score calibration plot is best thought of as:
Calibration assessed across score categories
A decile calibration plot is best thought of as:
Calibration assessed across tenth-based groups of predicted probability
They are both valid ways to examine whether predicted risk matches observed risk. The difference is not that one is right and the other is wrong. The difference is that they are designed for different model outputs.
If the model output is a score, use score-based calibration.
If the model output is a probability, use decile-based calibration or a smooth calibration curve.
Suggested short conclusion for your blog
Calibration is about whether predicted risk matches reality. A risk score calibration plot uses score categories on the x-axis and is ideal for point-based clinical tools. A decile calibration plot uses predicted probabilities divided into ten groups and is common in regression and prediction-model studies. Both aim to compare predicted and observed risk, but they do so at different levels of model representation.
One-line memory aid
Score plot = calibration by clinical points Decile plot = calibration by grouped predicted probabilities



Comments