What Does Statistically Significant Mean?

A result is statistically significant when it is unlikely to have occurred by chance alone. More specifically, researchers calculate a number called a p-value, and if that number falls below a pre-set threshold (almost always 0.05, or 5%), the result is declared statistically significant. This means there’s less than a 5% probability of seeing a result this extreme if nothing real were going on.

How the P-Value Works

Every statistical test starts with a baseline assumption: that whatever you’re studying has no real effect. This is called the null hypothesis. If you’re testing whether a new drug lowers blood pressure, the null hypothesis says the drug does nothing and any difference you observe is just random noise.

The p-value tells you how likely your observed result (or something more extreme) would be if that “no effect” assumption were true. A p-value of 0.03, for instance, means there’s only a 3% chance of seeing data this dramatic in a world where the drug truly does nothing. Because 3% is below the 5% cutoff, you’d call the result statistically significant and reject the null hypothesis.

One critical point that trips people up: a p-value of 0.03 does not mean there’s a 3% chance the drug doesn’t work. It’s not telling you the probability that any hypothesis is true or false. It only tells you how surprising your data would be under the assumption of no effect. That distinction matters, and confusing the two is one of the most common mistakes in interpreting research.

Why 0.05?

The 5% threshold is a convention, not a law of nature. Researchers can and do set stricter cutoffs when the stakes are higher. Genetics studies routinely use thresholds of 0.00000005 because they’re testing millions of comparisons at once. Some social science research uses 0.10. A review of published papers found that 9 out of 10 studies that explicitly stated their threshold used 0.05, with the remaining one using 0.01. The 5% level has simply become the default in most fields.

The threshold you choose is called the alpha level, and it directly controls your tolerance for a specific kind of mistake: concluding something is real when it isn’t.

Two Kinds of Errors

When you draw a line and declare results on one side “significant” and the other side “not significant,” you open the door to two types of mistakes. A false positive (called a Type I error) happens when you reject the null hypothesis even though it’s actually true. You conclude the drug works, but it doesn’t. The alpha level is literally the probability of making this mistake. At 0.05, you’re accepting a 5% chance of a false positive.

A false negative (Type II error) is the opposite: the drug genuinely works, but your data aren’t dramatic enough to clear the significance threshold, so you miss it. This is more likely when your study is small or when the real effect is subtle. Researchers balance these two errors when designing a study, choosing sample sizes large enough to catch real effects without being so large that trivial ones look important.

Sample Size Changes Everything

The number of people or observations in a study has an enormous influence on whether a result crosses the significance line. Small studies may fail to detect real effects simply because they lack the statistical power to distinguish a genuine pattern from random variation. Large studies have the opposite problem: they can make tiny, meaningless differences look statistically significant.

Consider a clinical trial with 10,000 participants where the treatment group loses 0.5 kg more than the control group. With that many people, even half a kilogram can produce a p-value below 0.05. The result is statistically significant, but losing half a kilogram is not going to change anyone’s health. Similarly, if you wanted to detect a difference as small as 0.1 degrees in a dental measurement, you’d need thousands of patients. Bump that target up to 1 degree, and the required sample size drops drastically. The smaller the difference you’re looking for, the more data you need to find it.

Statistical Significance vs. Practical Significance

This is the distinction that matters most for anyone reading health news or research summaries. A statistically significant result only tells you the finding probably isn’t due to chance. It says nothing about whether the finding is large enough to matter in real life.

A study might show that a new blood pressure drug lowers readings by an average of 3.5 mmHg with a p-value well below 0.05. That’s statistically significant. But whether a 3.5 mmHg drop actually improves patient outcomes, reduces heart attacks, or changes quality of life requires entirely different evidence. Clinical significance asks: does this effect make a meaningful difference to a real person? A result can be statistically significant without being clinically significant, and this happens more often than you might expect, especially in large studies.

The reverse is also possible. A small pilot study might find a large, meaningful improvement but lack enough participants to reach statistical significance. The effect is real and important, but the data can’t yet rule out chance.

What P-Values Don’t Tell You

The American Statistical Association took the unusual step of issuing a formal statement on p-values, laying out six principles for their proper use. The core message: p-values are useful but widely misunderstood, and no scientific conclusion should rest on a p-value alone.

A p-value does not measure the probability that your hypothesis is true. It does not tell you the size of an effect. It does not tell you whether the result matters. And a result that fails to reach significance (say, p = 0.07) is not proof that nothing is going on. The difference between p = 0.04 and p = 0.06 is often trivial, yet one gets labeled “significant” and the other doesn’t. Many statisticians and journals have argued that this binary classification is unnecessary and sometimes actively misleading.

Confidence intervals offer a more complete picture. Instead of reducing your result to “significant” or “not significant,” a confidence interval gives you a range of plausible effect sizes. A 95% confidence interval of 2.0 to 8.5 tells you the true effect probably falls somewhere in that range, which is far more informative than a single p-value. A growing number of journals now require confidence intervals alongside or instead of p-values for exactly this reason.

P-Hacking: Gaming the System

Because careers and publication depend heavily on producing statistically significant results, some researchers (consciously or not) manipulate their analyses until they find a p-value below 0.05. This is called p-hacking, and it takes many forms: testing dozens of variables but only reporting the ones that reach significance, removing outliers after seeing the results, stopping data collection the moment a significant p-value appears, or splitting and combining groups in different ways until something works.

The result is a published literature with an inflated number of false positives. When researchers analyzed the distribution of p-values across large bodies of published work, they found a suspicious cluster of values just below 0.05, exactly the pattern you’d expect if results were being nudged over the line. This is one of the driving forces behind the “replication crisis,” where many published findings fail to hold up when other scientists repeat the experiments.

Transparency helps counteract this. Pre-registering studies (publicly declaring your analysis plan before collecting data) and reporting all analyses, not just the flattering ones, make p-hacking much harder to pull off undetected.

How to Read Significance Claims

When you encounter a headline saying researchers found a “statistically significant” result, a few questions will help you evaluate it. First, how large was the effect? A significant p-value with a tiny effect size is often meaningless in practice. Second, how big was the study? Very large studies can make small differences significant, while very small studies may not be powered to detect important ones. Third, was the analysis planned in advance, or does it look like the researchers went fishing for a result?

Statistical significance is a useful filter for separating signal from noise, but it was never designed to be the final word on whether something is true or important. It’s one piece of evidence, best interpreted alongside the size of the effect, the quality of the study design, and whether other independent studies found the same thing.