AUPRC: Why AUROC Alone Is Not Enough for Imbalanced Data

Introduction

In clinical epidemiology, evaluating the performance of prediction models is essential for determining their usefulness in practice. The Area Under the Receiver Operating Characteristic Curve (AUROC) is widely used for assessing discrimination. However, in situations where the outcome is rare, AUROC may provide an overly optimistic assessment of model performance.

In such settings, the Area Under the Precision–Recall Curve (AUPRC) offers a more informative evaluation because it focuses specifically on the model’s ability to correctly identify individuals with the outcome. This makes AUPRC particularly relevant for imbalanced datasets, such as predicting rare complications in emergency or critical care populations.

Conceptual Framework

The precision–recall curve is defined by two key components:

Recall (Sensitivity): the proportion of true cases correctly identified
Precision (Positive Predictive Value): the proportion of predicted positive cases that are truly positive

AUPRC summarizes the trade-off between precision and recall across all decision thresholds, emphasizing the model’s performance in identifying true positive cases.

Baseline Interpretation

A key property of AUPRC is that its baseline is equal to the prevalence of the outcome.

Unlike AUROC, which has a fixed baseline of 0.5, a non-informative model in AUPRC will have a value approximately equal to the outcome prevalence. Therefore, interpretation must always consider how much the AUPRC exceeds this baseline.

Why AUROC Can Be Misleading in Imbalanced Data

To illustrate this limitation, consider a clinical scenario:

Clinical Scenario

A model is developed to predict septic shock in the emergency department.

Total patients: 10,000
True cases: 100 (1% prevalence)

This represents a highly imbalanced dataset.

Model Performance Example

Model A

AUROC = 0.92
AUPRC = 0.08

At first glance, the AUROC suggests excellent discrimination. However, this can be misleading.

Understanding the Discrepancy

AUROC evaluates the model’s ability to separate cases from non-cases across all thresholds. In this dataset, the vast majority of patients (9,900) do not have the condition. As a result, correctly classifying non-cases contributes heavily to the AUROC, inflating its value.

However, clinical decision-making depends primarily on correctly identifying the relatively small number of true cases.

Detailed Prediction Example

At a chosen decision threshold:

Predicted high-risk patients: 200
True positives: 20
False positives: 180

From this:

Recall = 20 / 100 = 0.20

Precision = 20 / 200 = 0.10

These results indicate that the model detects only 20% of true cases and that only 10% of predicted positives are correct.

AUPRC Interpretation

The baseline AUPRC equals the prevalence:

Baseline = 0.01

The observed AUPRC:

AUPRC = 0.08

Although this is higher than the baseline, it remains low, indicating limited ability to identify true cases effectively. In practical terms, the model generates many false positives while missing most true cases.

Improved Model Comparison

Model B

AUROC = 0.91
AUPRC = 0.40

Despite having a similar AUROC, Model B demonstrates substantially improved AUPRC.

With a baseline of 0.01, an AUPRC of 0.40 indicates a forty-fold improvement over random prediction. This reflects a model that more accurately identifies true cases while reducing false positives, making it more clinically useful.

Comparison of Metrics

AUROC assesses overall discrimination and includes both true positives and true negatives
AUPRC focuses on the accuracy of positive predictions

In imbalanced datasets, AUROC may remain high even when the model performs poorly in identifying true cases. In contrast, AUPRC directly reflects performance in the clinically relevant positive class.

Clinical Interpretation

AUROC answers the question:

Can the model distinguish between patients with and without the outcome?

AUPRC answers a different and more clinically relevant question:

When the model predicts a patient is high risk, how often is that prediction correct?

In rare outcomes, this distinction is critical for avoiding unnecessary interventions and optimizing resource use.

Conclusion

AUPRC provides a more clinically meaningful evaluation of prediction models in imbalanced datasets by focusing on precision and recall. A high AUROC does not guarantee good performance in identifying true cases, whereas AUPRC directly reflects the reliability of positive predictions.

Therefore, in studies involving rare outcomes, AUPRC should be reported alongside AUROC, with careful interpretation relative to outcome prevalence.

Key Takeaways

AUPRC focuses on performance in the positive class
Baseline AUPRC equals outcome prevalence
AUROC can be misleading in imbalanced datasets
A low AUPRC despite high AUROC indicates poor clinical usefulness
Interpretation should emphasize improvement over baseline