What Is a Calibration Plot and How Do You Interpret One?

A calibration plot is a specialized graphical tool used to evaluate the trustworthiness of a predictive model’s probability estimates. It provides a visual answer to a direct question: when a model forecasts an event will occur with a certain probability, does that event actually happen with that frequency? For example, if a model predicts a 40% risk of a patient developing a condition, a well-calibrated model should show that the condition materialized in approximately 40% of all patients assigned that same 40% risk score. The plot is a standardized way to diagnose whether a prediction system is systematically overestimating or underestimating outcomes across its entire range of forecasts.

The Purpose and Visual Elements of a Calibration Plot

The primary purpose of the calibration plot, also known as a reliability diagram, is to visualize the agreement between predicted probabilities and observed outcomes. A perfectly calibrated model is one where the predicted probabilities match the actual outcomes exactly. This comparison is displayed using two axes that range from zero to one, representing the full spectrum of probability.

The X-axis represents the predicted probability or risk score assigned by the model (0% to 100%). The Y-axis displays the observed frequency or the true outcome rate for those predictions. This value is the proportion of times the event actually happened among all instances where the model gave a particular prediction score.

A diagonal line running from the bottom-left (0,0) to the top-right (1,1) serves as the reference for perfect calibration. If a model’s plotted points fall directly onto this line, the model is perfectly reliable. The data points plotted on the graph do not represent individual predictions; rather, they represent groups of predictions that have been averaged together.

Understanding Miscalibration

Interpreting the calibration plot involves assessing how far the plotted curve deviates from the ideal diagonal line. Any significant departure from this line indicates miscalibration, meaning the model’s stated confidence does not align with reality. These deviations help diagnose specific types of unreliability.

When the plotted curve falls consistently below the diagonal line, it signifies “over-prediction” or overconfidence. In this scenario, the model is too optimistic, assigning a higher probability to an event than the event’s true rate of occurrence. For example, a model predicting a 90% chance of an event, where the event only occurs 75% of the time, is over-predicting.

Conversely, if the curve rises above the diagonal line, it illustrates “under-prediction” or underconfidence. This means the model is too conservative, assigning a lower probability than the actual observed rate. If the model predicts a 20% risk, but the event actually happens 35% of the time, the model is under-predicting.

The overall slope of the plotted line also indicates the severity of the miscalibration. A slope less than one suggests the model is overly separating high and low risks, making its predictions too extreme. A slope greater than one indicates that the predictions are too modest and clustered near the average outcome rate, meaning the model fails to distinguish adequately between different risk levels.

How Calibration Plots Are Constructed

Constructing a calibration plot requires a specific methodology to transform thousands of individual predictions into a few meaningful data points. The first step is “binning,” which involves grouping all the model’s predictions into defined ranges based on their predicted probability. Standard practice often involves dividing the probability scale (0 to 1) into a set number of bins, such as ten equal-width segments (deciles).

Grouping is performed because plotting every single prediction would result in a noisy and unreadable scatter plot. By grouping predictions, the noise from individual data points is reduced, allowing the underlying relationship between predicted and observed probabilities to become clear. For each bin, two calculations define the coordinate pair for the plot.

The X-coordinate is the average predicted probability across all data points in that bin. The Y-coordinate is the true observed frequency of the event within that same bin. This process generates a set of discrete points that form the basis of the calibration curve.

While simple binning is common, more sophisticated methods are often used to create a smoother, more continuous curve. Techniques like isotonic regression or locally weighted scatterplot smoothing (LOESS) are applied to fit a curve across all binned points. These smoothing methods are particularly helpful with large datasets, as they can reveal subtle miscalibrations missed by simple, discrete bins.

Common Use Cases

Calibration plots are widely used across diverse fields where the precision of probability estimates directly impacts real-world decisions. In machine learning, calibration ensures that the model’s confidence scores are reliable, especially for classification models used in tasks like fraud detection or customer churn prediction. This is important because models like complex neural networks often do not natively produce well-calibrated probabilities.

In clinical risk prediction, calibration plots validate models that forecast health outcomes, such as the 10-year risk of heart disease. If a physician recommends aggressive treatment based on a high-risk score, the model’s stated probability must accurately reflect the true incidence rate for patients in that risk category. Poor calibration in medicine can lead to inappropriate treatment decisions.

Weather forecasting also relies on calibration to assess the reliability of probability forecasts, such as an 80% chance of rain. Calibration is distinct from a model’s ability to discriminate, which refers to the capacity to correctly rank risks and separate high-risk cases from low-risk cases. Calibration focuses only on the accuracy of the probability values assigned to those ranks.