What a Chi-Square Test Tells You (and What It Doesn’t)

A chi-square test tells you whether there’s a meaningful relationship between two categorical variables, or whether the pattern you see in your data is just due to chance. It works by comparing what you actually observed against what you’d expect to see if no relationship existed. If the gap between those two is large enough, the test tells you something real is going on.

This is one of the most common statistical tests in research, used everywhere from clinical trials to marketing surveys. It only works with categorical data, meaning variables that fall into groups (like yes/no, male/female, or treatment A/treatment B) rather than continuous measurements like weight or blood pressure.

The Two Types of Chi-Square Test

There are actually two versions, and they answer slightly different questions.

The test of independence is the one most people mean. It takes two categorical variables and asks whether they’re related. For example: is there a relationship between gender and likelihood of getting in trouble at school? You’d collect data on both variables, build a table showing the counts, and run the test. If the result is statistically significant, you can say the two variables are related. If not, they appear to be independent of each other.

The goodness-of-fit test works differently. It takes a single variable and asks whether its distribution matches what you’d expect. Say you roll a die 600 times. You’d expect each face to come up about 100 times. A goodness-of-fit test checks whether your actual results are close enough to that expectation, or whether the die might be loaded. This version is useful for testing whether a population is evenly distributed or whether it matches some known pattern.

How the Calculation Works

The core logic is straightforward, even if the math looks intimidating at first. The chi-square statistic is calculated by taking each cell in your data table, finding the difference between the observed count and the expected count, squaring that difference, dividing by the expected count, then adding all those values together.

The “expected” values represent what the data would look like if there were no relationship at all. In a vaccine trial, for instance, the expected values estimate how cases would be distributed if the vaccine had zero effect. The further your observed data drifts from those expected values, the larger the chi-square number gets, and the stronger the evidence that something other than chance is driving the pattern.

A small chi-square value means your observed data closely matches what you’d expect under no relationship. A large one means there’s a notable gap, suggesting the variables are connected.

Reading the P-Value

The chi-square statistic alone doesn’t give you a final answer. What matters is the p-value that comes with it. The p-value tells you the probability of seeing results this extreme if there really were no relationship between your variables.

The conventional cutoff is 0.05, meaning a 5% probability. If your p-value falls below 0.05, you’d typically conclude that the relationship is statistically significant. A p-value below 0.01 provides stronger evidence, and below 0.001 is stronger still. These thresholds aren’t magic lines. A p-value of 0.04 isn’t fundamentally different from 0.06. The evidence against no relationship simply gets stronger as the p-value gets smaller.

One thing worth keeping in mind: with a 0.05 threshold, 1 in 20 tests will produce a “significant” result purely by chance, even when no real relationship exists. This is why researchers look at the full picture rather than treating any single p-value as proof.

What It Doesn’t Tell You

A significant chi-square result tells you a relationship exists, but it doesn’t tell you how strong that relationship is. A massive sample can produce a statistically significant result even when the actual association is tiny and practically meaningless.

To measure strength, researchers use a follow-up calculation called Cramér’s V, which ranges from 0 to 1. Values below 0.10 indicate a negligible association. Between 0.10 and 0.20 is weak. Between 0.20 and 0.40 is moderate. Values above 0.40 start to indicate a relatively strong to very strong association. If you get a significant chi-square result, checking the effect size tells you whether the relationship actually matters in practical terms.

The test also can’t tell you anything about causation. If you find a significant relationship between smoking and lung disease, the chi-square confirms those two things are linked in your data, but it can’t prove one causes the other.

Degrees of Freedom

When you look up your chi-square result to find a p-value, you need the degrees of freedom. For a contingency table (the grid of counts used in a test of independence), this is calculated by taking the number of rows minus one, multiplied by the number of columns minus one. A simple 2×2 table has just one degree of freedom. A 3×4 table has six. The degrees of freedom determine which chi-square distribution your result is compared against, which in turn determines the p-value.

When It Works and When It Doesn’t

The chi-square test has a few requirements. First, your data must be counts of things in categories, not averages or measurements. Second, each observation should be independent, meaning one person’s data doesn’t influence another’s. Third, and this is the one people most often run into, your expected cell counts need to be large enough. If any cell in your table has an expected frequency below 5, the chi-square result becomes unreliable. Statistical software will usually flag this with a warning message.

When you have small expected counts, the alternative is Fisher’s exact test. It calculates the exact probability rather than relying on an approximation, making it reliable even with small samples. Most software runs it automatically when needed.

A Practical Example

Suppose a clinical trial randomly assigns 55 patients into two groups. One group receives a standard treatment and the other receives a new treatment. After the study, you count how many people in each group improved and how many didn’t. You now have two categorical variables (treatment group and outcome) and a 2×2 table of counts.

The null hypothesis is that there’s no relationship between which treatment someone received and whether they improved. The alternative hypothesis is that there is a relationship. Running the chi-square test compares the actual recovery rates against what you’d expect if both treatments were equally effective. If the p-value comes back below 0.05, you’d conclude that the treatments differ in effectiveness. If it’s above 0.05, you don’t have enough evidence to say they’re different.

This same logic applies to any situation where you’re comparing groups on a categorical outcome: whether different demographics prefer different products, whether students from different schools pass at different rates, or whether a risk factor is associated with a disease.