How to Determine Sample Size: Formulas and Tools

Determining sample size comes down to four key inputs: how confident you want to be in your results, how precise you need them to be, how much variability exists in what you’re measuring, and how large your population is. The specific formula changes depending on whether you’re running a survey, an experiment, or a qualitative study, but the core logic is the same: larger samples give you more precision, and you can calculate exactly how many participants you need before collecting a single data point.

The Four Variables That Drive Sample Size

Every sample size calculation revolves around the same set of inputs, even if the terminology shifts between fields. Understanding what each one does lets you make informed tradeoffs rather than guessing.

Confidence level is the probability that your results reflect reality rather than random chance. A 95% confidence level, the most common standard, means that if you repeated your study 100 times, your results would fall within your stated range 95 of those times. You can also use 90% (acceptable for lower-stakes decisions) or 99% (common in medical or safety-critical research). Higher confidence requires a larger sample.

Margin of error is how much imprecision you’re willing to tolerate. A margin of error of plus or minus 3% means your true result could be 3 percentage points above or below what your sample shows. Tighter margins require more participants. A survey of 1,000 people typically yields a margin of error around 3% at 95% confidence. Doubling that sample to 2,000 only shrinks the margin to about 2%, which illustrates the diminishing returns of adding more participants.

Variability captures how spread out responses or measurements are in your population. For surveys measuring proportions (like “what percentage of customers prefer option A”), this is the expected proportion. If you have no idea what to expect, using 0.5 (50/50 split) gives you the most conservative estimate and the largest sample size. For studies measuring continuous outcomes like blood pressure or test scores, variability is expressed as the standard deviation, which you can pull from previous studies or a small pilot.

Population size matters less than most people think. For large populations (above roughly 10,000), increasing the population barely changes the required sample size. A well-designed survey of 1,000 people can represent a city of 100,000 or a country of 100 million with similar precision. Population size only becomes a meaningful factor when your total group is small.

The Core Formula for Surveys

The most widely used formula for estimating sample size in surveys and prevalence studies is straightforward. You need three numbers: the Z-score corresponding to your confidence level, the expected proportion (or standard deviation), and your desired margin of error.

The formula is: n = (Z² × p × (1 – p)) / d²

Here, n is your sample size, Z is the Z-score for your confidence level, p is the expected proportion, and d is your margin of error. The Z-scores you’ll use most often are 1.645 for 90% confidence, 1.960 for 95% confidence, and 2.576 for 99% confidence.

A quick example: you want to survey customers about satisfaction, you expect roughly 50% will be satisfied (or you’re unsure, so you default to 0.5), and you want a 5% margin of error at 95% confidence. That gives you: (1.96² × 0.5 × 0.5) / 0.05² = (3.84 × 0.25) / 0.0025 = 384 respondents. If you tighten the margin of error to 3%, the required sample jumps to about 1,067.

These benchmarks from published survey research hold up well in practice: a sample of 400 gives you roughly a 5% margin of error, 750 gets you to 4%, and 1,000 lands around 3%, all at 95% confidence.

Adjusting for Small Populations

When your total population is under about 10,000, the formula above will overestimate how many people you need. The finite population correction adjusts for this: n’ = n / (1 + (n / N)), where n is the sample size from the original formula and N is your total population.

For example, if the formula says you need 384 respondents but your entire population is only 2,000 people, the adjusted calculation gives you: 384 / (1 + 384/2000) = 384 / 1.192 = about 322. The smaller your population relative to the calculated sample, the bigger the reduction.

Sample Size for Experiments and Clinical Studies

When you’re comparing two groups, like testing whether a new treatment works better than an existing one, the calculation shifts from margin of error to something called power analysis. This approach balances four connected variables: alpha level, statistical power, effect size, and sample size. Change any one, and the others shift.

Alpha level is the probability of a false positive, finding a difference that doesn’t actually exist. The standard is 0.05 (5% risk). Stricter studies use 0.01.

Statistical power is the probability that your study will detect a real difference when one exists. The accepted minimum is 0.80 (80%), meaning you have an 80% chance of catching a true effect. Some studies aim for 0.90.

Effect size is the smallest difference between groups that you consider meaningful. This is the variable most people struggle with, because it requires you to decide in advance what counts as a meaningful result. A drug that lowers blood pressure by 1 mmHg might not matter clinically, but a 10 mmHg drop would. The smaller the effect you want to detect, the larger your sample needs to be. This is one of the most important relationships in study design: detecting subtle differences requires dramatically more participants than detecting obvious ones.

The specific formula depends on the statistical test you plan to use (comparing two means, two proportions, or something more complex), but the logic is consistent. You need previous research or a pilot study to estimate the expected effect size and variability. Without those, you’re guessing.

Sample Size for A/B Tests

A/B testing in product development and marketing uses the same statistical foundations as clinical experiments, just with different terminology. The key inputs are your baseline conversion rate (what percentage of users currently take the desired action), the minimum detectable effect (the smallest improvement worth detecting), your significance level (typically 0.05), and your statistical power (typically 0.80).

The minimum detectable effect is where most teams need to think carefully. If your current conversion rate is 5% and you want to detect a relative improvement of 10% (bringing it to 5.5%), you’ll need far more users per variation than if you’re looking for a 50% relative improvement (bringing it to 7.5%). The smaller the change you care about, the longer your test needs to run.

Most A/B testing platforms split traffic evenly between control and test groups (a 50/50 split), which is statistically optimal. One-sided tests, which only check whether the new version is better rather than just different, require smaller samples and are recommended for most product decisions where you’d only ship an improvement.

Rather than calculating this by hand, online A/B test calculators from platforms like Statsig or Optimizely let you plug in your baseline rate, minimum detectable effect, significance level, and power to get the required sample per variation instantly.

Sample Size in Qualitative Research

Qualitative studies, like interview-based or ethnographic research, don’t use formulas at all. Instead, the standard is data saturation: you keep collecting data until new interviews or observations stop producing new insights. As one foundational definition puts it, saturation is “the point in coding when you find that no new codes occur in the data.” You’re hearing the same themes repeated, and additional participants add volume but not new information.

In practice, many qualitative studies reach saturation somewhere between 12 and 30 interviews, depending on how narrow or broad the research question is. Homogeneous groups (people with very similar experiences) saturate faster. Studies exploring diverse perspectives across multiple subgroups need more participants. There’s no magic number, but experienced qualitative researchers often plan for an initial target and then assess saturation as data comes in.

Accounting for Dropouts and Non-Response

The number your formula produces is the minimum number of completed, usable responses you need. In the real world, people don’t respond to surveys, drop out of studies, or provide unusable data. You should inflate your initial recruitment target to compensate.

If you expect a 20% non-response rate, divide your required sample by 0.80. So if you need 400 completed responses, recruit 400 / 0.80 = 500 people. For mail surveys, non-response rates of 40% or higher are common, which means you might need to contact more than double your target sample. Online surveys and clinical trials each have their own typical dropout patterns, so base your adjustment on what’s realistic for your method.

Tools That Do the Math for You

For simple surveys, online calculators from sites like SurveyMonkey, Qualtrics, or Raosoft let you enter your confidence level, margin of error, and population size and get an answer in seconds. These work well for straightforward proportion-based calculations.

For experiments, clinical studies, or any design involving group comparisons, G*Power is the most widely recommended tool. It’s free, handles a wide range of statistical tests (t-tests, ANOVA, regression, chi-square), includes built-in effect size calculators, and runs on both Windows and Mac. It lets you work backward too: given a fixed sample size, it will tell you how much power your study actually has, which is useful when budget or time constraints cap your recruitment.

For A/B tests, dedicated calculators built into experimentation platforms are the fastest option and already default to the settings most product teams need.

Whichever tool you use, the quality of your sample size estimate depends entirely on the quality of your inputs. An estimate of variability or effect size pulled from thin air will produce a sample size that’s equally unreliable. When possible, ground your inputs in previous research, published benchmarks, or a small pilot study.