Effect size is the number you plug into a power analysis to represent how large a difference or relationship you expect to find. It’s the bridge between your research question and the sample size you’ll need: a larger expected effect means fewer participants, while a smaller expected effect demands a much bigger sample. Without specifying an effect size, a power analysis can’t be performed at all, because the calculation has no way to know what signal you’re trying to detect.
What Effect Size Actually Measures
Effect size quantifies the magnitude of a difference or relationship in standardized units. Where a p-value only tells you whether a result is likely due to chance, effect size tells you how big that result is. Think of it this way: a medication might produce a statistically significant improvement (small p-value), but the actual improvement could be tiny or substantial. Effect size captures that distinction.
The most common version is Cohen’s d, which expresses the difference between two group averages in terms of standard deviations. If a treatment group scores 10 points higher than a control group and the pooled standard deviation is 20, Cohen’s d is 0.5. That means the groups are separated by half a standard deviation. Other types include Pearson’s r (for correlations), Cohen’s f (for comparing three or more groups in ANOVA), and Hedges’ g, which applies a small correction to Cohen’s d that reduces bias in small samples.
How Effect Size Drives Sample Size
In a power analysis, four values are linked together: effect size, sample size, significance level (alpha, typically 0.05), and statistical power (the probability of detecting a real effect, typically set at 80%). If you fix three of them, the fourth is determined. In practice, researchers fix alpha at 0.05, power at 80%, and their anticipated effect size, then solve for sample size.
The relationship between effect size and sample size is inverse and steep. As the expected effect gets smaller, the required sample size climbs rapidly. A study designed to detect a large effect (d = 0.8) between two groups might need around 25 participants per group. Detecting a medium effect (d = 0.5) with the same power could require roughly 65 per group. A small effect (d = 0.2) can push requirements into the hundreds per group. This is why getting the effect size estimate right matters so much: underestimate it slightly and your study might need twice as many participants; overestimate it and you waste resources or, worse, build false confidence into an underpowered design.
Cohen’s Benchmarks: Small, Medium, and Large
Jacob Cohen, the statistician who formalized much of power analysis, proposed rough benchmarks that researchers still use as starting points. For Cohen’s d (comparing two means), a small effect is 0.2, medium is 0.5, and large is 0.8. Cohen described a medium effect as “visible to the naked eye of a careful observer,” a small effect as noticeably smaller than medium but not trivial, and a large effect as the same distance above medium as small is below it.
For ANOVA (comparing three or more groups), the corresponding metric is Cohen’s f, with benchmarks of 0.1 (small), 0.25 (medium), and 0.4 (large). For correlations using Pearson’s r, the conventional cutoffs are 0.1, 0.3, and 0.5.
These benchmarks are convenient, but they’re also generic. A “small” effect in one field might be clinically meaningful, while a “large” effect in another might be trivially obvious. Cohen himself cautioned against relying on his conventions when better information is available. They work best as a fallback when you have no prior data and no domain-specific expectations.
Where to Get Your Effect Size Estimate
The most reliable approach is to pull effect sizes from published studies that examined similar interventions, populations, or relationships. A thorough literature review gives you a realistic range. If multiple studies report effect sizes between 0.3 and 0.5, you have a reasonable basis for planning your own study.
When no prior literature exists, pilot data can help. A small preliminary study won’t give you a precise effect size, but it can narrow the range. Some researchers also work with a concept called the minimal clinically important difference, which is the smallest change that patients or practitioners would consider meaningful enough to matter. This value comes from clinical judgment and prior research rather than from statistical conventions, and it anchors the power analysis in practical significance rather than arbitrary benchmarks.
A third option is simply to use Cohen’s conventions, picking “small,” “medium,” or “large” based on your best judgment. This is the weakest approach because it ignores context, but it’s better than skipping the power analysis entirely. Power analysis software like G*Power makes this easy by offering the conventional values as defaults. For instance, when you set up an independent t-test in G*Power, it provides 0.2, 0.5, and 0.8 as small, medium, and large options for effect size d. For a one-way ANOVA, it offers 0.1, 0.25, and 0.4 for effect size f.
Effect Size vs. Clinical Significance
Statistical effect size and clinical significance are related but distinct. Effect size is a standardized metric based on the data’s variability. Clinical significance asks a different question: does this change matter to the patient? The minimal clinically important difference (MCID) captures this by defining the smallest improvement a patient would perceive as beneficial and that would justify changing their treatment.
The gap between these concepts is real. A study might detect a statistically significant effect that’s too small to matter in a patient’s daily life. Conversely, a clinically meaningful effect might not reach statistical significance if the study was underpowered. Using the MCID as your target effect size in a power analysis aligns your study design with what actually matters to people rather than just what the math can detect.
Calculating the MCID itself is complicated. Anchor-based methods ask patients directly whether they feel better and correlate their answers with score changes, but these are subjective and vulnerable to recall bias. Distribution-based methods use the statistical properties of the data, but they don’t incorporate the patient’s perspective at all. Neither approach is perfect, and published MCIDs for the same outcome measure can vary substantially depending on the method used.
Why Post-Hoc Power Analysis Doesn’t Work
After a study is completed, researchers sometimes calculate “observed power” using the effect size from their actual results. This is called post-hoc or retrospective power analysis, and it’s widely discouraged. A commentary in Genetics in Medicine called it “not only incorrect and uninformative, but also potentially harmful.”
The core problem is circular reasoning. Post-hoc power has a direct, one-to-one mathematical relationship with the p-value you already observed. A large p-value (non-significant result) will always produce low observed power, and a small p-value will always produce high power. It adds no new information. It also assumes the observed effect size equals the true effect size, which is only the case when you know the answer before running the study.
The most damaging misuse occurs when researchers get a non-significant result, calculate low post-hoc power, and conclude the study was “underpowered” rather than that the effect may not exist. This makes it impossible to distinguish between a study that failed because it was too small and one that failed because there was nothing to find. Power analysis belongs before data collection, during the design phase. After the study, researchers are better served by confidence intervals and careful interpretation of their findings in context.
Using Effect Size in Practice
When you sit down to run a power analysis, the effect size is typically the hardest input to determine. Alpha and power have strong defaults (0.05 and 0.80), and the statistical test is dictated by your research design. Effect size is the one value that requires genuine judgment.
Start with the literature. If prior studies consistently find Cohen’s d values around 0.4 for similar interventions, use that as your anchor. If the literature is sparse or mixed, consider what the smallest meaningful difference would be in your context. For clinical research, this often means consulting with practitioners about what change would actually alter patient management. For behavioral or social science research, it might mean thinking about what difference would have real-world implications.
If you’re uncertain, run the power analysis at multiple effect sizes. Calculate the required sample size for a small, medium, and plausible effect. This gives you a range rather than a single number, and it makes the tradeoffs visible: you can see exactly what it would cost in participants to detect a smaller effect. That transparency helps you, your collaborators, and funding agencies make an informed decision about whether the study is feasible.