What Is Goodness of Fit? Definition and Tests

Goodness of fit is a statistical concept that measures how well a model, distribution, or set of expected values matches the data you actually observed. It shows up across many areas of statistics, from testing whether survey responses follow a predicted pattern to evaluating whether a regression line accurately captures a trend. The core idea is always the same: you have a theoretical expectation, you have real-world data, and you want to know how close the two are.

The Core Idea Behind Goodness of Fit

Every goodness of fit test starts with a comparison between what you expected to see and what you actually found. Suppose you roll a die 600 times. If the die is fair, you’d expect each face to come up about 100 times. But your actual results will never be exactly 100 across the board. Goodness of fit testing gives you a formal way to ask: is the gap between my expected and observed results small enough to chalk up to random chance, or is something else going on?

This logic extends well beyond dice. In genetics, researchers use goodness of fit tests to check whether offspring traits follow predicted inheritance ratios. In market research, analysts test whether customer preferences are evenly split across product categories or skew toward certain ones. In quality control, manufacturers check whether defect rates match historical norms. The applications vary, but the structure is identical: set up an expected pattern, collect data, and measure the discrepancy.

The Chi-Squared Goodness of Fit Test

The most common goodness of fit test is the chi-squared test, used when you’re working with categorical data that falls into distinct groups or levels. You calculate a test statistic by comparing observed counts in each category to the counts you’d expect if your hypothesis were true. The bigger the gap between observed and expected, the larger the test statistic, and the stronger the evidence that your hypothesis doesn’t match reality.

The test works through a formal hypothesis framework. The null hypothesis states that the data follows the expected distribution, meaning all the proportions are as specified. The alternative hypothesis states that at least one proportion differs from what was predicted. If the resulting p-value falls below a threshold (typically 0.05), you reject the null hypothesis. That means the data provides convincing evidence that at least one category’s proportion is not what you assumed.

One technical detail worth knowing: the degrees of freedom for this test equal the number of categories minus the number of parameters you estimated from the data, minus one. So if you have five categories and estimated one parameter from your sample, you’d have three degrees of freedom. This number determines which chi-squared distribution you compare your test statistic against.

Goodness of Fit in Regression: R-Squared

When people talk about goodness of fit in regression analysis, they’re usually talking about R-squared. This value tells you what proportion of the variation in your outcome is explained by the variables in your model. It ranges from 0 to 1. An R-squared of 0 means the model explains none of the variation in the outcome. An R-squared of 1 means every data point falls exactly on the predicted line, a perfect fit.

In practice, you’ll rarely see either extreme. An R-squared of 0.70 means the model’s variables account for 70% of the fluctuation in the outcome, with the remaining 30% unexplained. As the value moves closer to 1, the model’s predictions align more tightly with the actual data points. As it drops toward 0, the relationship between your predictors and the outcome weakens.

Why Adjusted R-Squared Matters

Standard R-squared has a well-known flaw: it always increases when you add more variables to a model, even if those variables are irrelevant. You could toss in completely random data as a predictor and R-squared would still tick upward. This makes it unreliable for comparing models with different numbers of variables.

Adjusted R-squared solves this by penalizing the addition of predictors that don’t genuinely improve the model. It accounts for the number of variables relative to the number of data points, so it only increases when a new predictor adds real explanatory power. If you’re choosing between a simpler model and a more complex one, adjusted R-squared gives you a fairer comparison by balancing fit quality against model complexity.

Tests for Continuous Distributions

The chi-squared test works well for categorical data, but when you need to check whether continuous data (like heights, incomes, or measurement errors) follows a specific distribution such as a normal curve, other tests are more appropriate.

The Kolmogorov-Smirnov (KS) test compares your data’s cumulative distribution to the theoretical one you’re testing against. It’s widely used, but it has a notable weakness: it’s most sensitive to differences near the center of the distribution and tends to miss discrepancies in the tails, where extreme values live.

The Anderson-Darling test was developed in the 1950s specifically to overcome this limitation. It applies greater weight to the tails of the distribution, making it more sensitive to deviations at the extremes. In extensive comparative testing, the Anderson-Darling test consistently outperforms the KS test in sensitivity. If you’re particularly concerned about whether your data’s tail behavior matches expectations (common in finance and engineering), the Anderson-Darling test is the stronger choice.

Comparing Models: AIC and BIC

Sometimes the question isn’t whether a single model fits the data, but which of several competing models fits best. This is where information criteria like AIC (Akaike Information Criterion) and BIC (Bayesian Information Criterion) come in. Both measure goodness of fit while simultaneously penalizing model complexity.

The logic behind them is intuitive. Imagine plotting a curve through a set of data points. A polynomial with enough terms can pass through every single point, achieving a perfect fit. But that model would be useless for prediction because it’s tailored to the noise and quirks of that specific dataset. Apply it to new data and the fit collapses. This is overfitting, and it’s the central problem these criteria address.

Both AIC and BIC start with the model’s likelihood (how well it fits the data) and then subtract a penalty based on the number of parameters. The more parameters, the steeper the penalty. The key difference between them is that BIC’s penalty grows with sample size, making it more aggressive about favoring simpler models in large datasets. AIC uses a fixed penalty of 2 per parameter regardless of sample size. Lower values indicate a better balance of fit and simplicity for both measures.

Good Fit Doesn’t Mean Good Predictions

One of the most important distinctions in statistics is between goodness of fit and predictive performance. A model can fit the data it was built on extremely well and still perform poorly when applied to new data. Goodness of fit is typically evaluated on the same dataset used to build the model. Predictive performance requires either new data or cross-validation techniques that simulate new data by holding portions of the original dataset aside.

This gap exists because of overfitting. A complex model can capture not just the real patterns in your data but also the random noise. When you then apply that model to a fresh dataset, the noise is different, and the fit shrinks. This is why tools like adjusted R-squared, AIC, and BIC exist: they all push back against complexity to help you find models that generalize well, not just models that look good on paper.

In practical terms, a high R-squared or a small chi-squared statistic is encouraging, but it’s only part of the picture. The real test of any model is whether it holds up when it encounters data it hasn’t seen before.