How to Do a Two-Sample t-Test, Step by Step

A two-sample t-test compares the averages of two independent groups to determine whether they’re meaningfully different or if the gap is just due to random chance. You’ll need two separate sets of numerical data, a few assumptions to check, and either a calculator or software to run the math. Here’s the full process from start to finish.

When to Use a Two-Sample t-Test

This test applies whenever you’re comparing measurements from two unrelated groups. Maybe you’re testing whether a new teaching method produces higher exam scores than the traditional one, or whether plants given fertilizer A grow taller than plants given fertilizer B. The key requirements: your outcome is a number (not a category), and the two groups are independent of each other, meaning the people or items in one group have no connection to those in the other.

The two samples can come from two completely separate populations, or from a single population randomly split into two groups that each receive a different treatment. If the same individuals are measured twice (before and after, for example), that’s a paired t-test, not a two-sample test.

Check the Assumptions First

Before running the test, your data needs to meet a few conditions:

Independence: Each observation is collected independently. One person’s result doesn’t influence another’s.
Normal distribution: The data in each group should follow a roughly bell-shaped curve. For small samples (under 50), the Shapiro-Wilk test is the most reliable way to check this. A p-value above 0.05 on that test means your data is consistent with a normal distribution. For larger samples, the Kolmogorov-Smirnov test works well, or you can visually inspect a histogram or Q-Q plot.
Equal or unequal variances: You need to know whether the two groups have similar spread. Levene’s test checks this. If the variances are roughly equal, you’ll use the pooled (equal variance) version of the test. If they’re not, you’ll use Welch’s t-test, which adjusts for the difference. When the two groups have very different sample sizes, pay extra attention to this assumption.

If your data violates normality and your samples are small, a non-parametric alternative like the Mann-Whitney U test is a better choice.

Set Up Your Hypotheses

Every t-test starts with two competing statements. The null hypothesis says there’s no difference between the two group averages. The alternative hypothesis says there is a difference. Written in notation, it looks like this:

Null hypothesis (H₀): μ₁ = μ₂ (the two population means are equal)
Alternative hypothesis (H₁): μ₁ ≠ μ₂ (the two population means are not equal)

That’s a two-tailed test, which is the most common version. If you have a specific directional prediction (group A is higher than group B, not just different), you’d use a one-tailed test where H₁ states μ₁ > μ₂ or μ₁ < μ₂. Two-tailed is the safer default unless you have a strong reason to test in only one direction.

You also choose a significance level, called alpha. The standard is 0.05, meaning you’re willing to accept a 5% chance of incorrectly concluding there’s a difference when there isn’t one.

Calculate the t-Statistic

The t-statistic measures how far apart the two group means are, relative to the variability in the data. A larger t-value means the groups are more clearly separated.

If Variances Are Unequal (Welch’s t-Test)

Take the difference between the two sample means, then divide by the square root of (variance₁ / n₁ + variance₂ / n₂). In plainer terms: subtract one group’s average from the other, then divide by a measure of how much uncertainty exists in both groups combined. The formula is:

t = (mean₁ – mean₂) / √(s₁²/n₁ + s₂²/n₂)

where s₁² and s₂² are the sample variances and n₁ and n₂ are the sample sizes.

If Variances Are Equal (Pooled t-Test)

When the two groups have similar variability, you first combine their variances into a single “pooled” estimate. The pooled variance is calculated as:

s²pooled = [(n₁ – 1) × s₁² + (n₂ – 1) × s₂²] / (n₁ + n₂ – 2)

Then the t-statistic becomes:

t = (mean₁ – mean₂) / (s_pooled × √(1/n₁ + 1/n₂))

This version gives the two groups proportional weight based on their sample sizes.

Find the Degrees of Freedom

Degrees of freedom determine which t-distribution you compare your result against. For the equal-variance version, it’s straightforward: add both sample sizes and subtract 2. If each group has 15 observations, your degrees of freedom are 28.

For Welch’s t-test (unequal variances), the degrees of freedom use a more complex approximation called the Welch-Satterthwaite equation. The result is often not a whole number. Most software handles this calculation automatically, so you rarely need to compute it by hand.

Interpret the Results

Once you have your t-statistic and degrees of freedom, you get a p-value, either from a t-distribution table or from software. The p-value tells you the probability of seeing a difference this large (or larger) if the null hypothesis were true.

If the p-value is less than your alpha (typically 0.05), you reject the null hypothesis and conclude the two group means are statistically different. If the p-value is 0.05 or greater, you fail to reject the null, meaning the data doesn’t provide strong enough evidence that the groups differ.

Statistical significance alone doesn’t tell you whether the difference matters in practical terms. A study with thousands of participants might find a statistically significant difference that’s trivially small. This is where effect size comes in.

Calculate Effect Size With Cohen’s d

Cohen’s d quantifies how large the difference between groups actually is, in standardized units. The formula is:

d = (mean₁ – mean₂) / s_pooled

The pooled standard deviation here uses the same formula as in the equal-variance t-test. The conventional benchmarks: a d of 0.2 is considered a small effect, 0.5 is medium, and 0.8 or above is large. Reporting Cohen’s d alongside your p-value gives a much fuller picture of your results.

Build a Confidence Interval

A confidence interval gives you a range of plausible values for the true difference between the two population means. A 95% confidence interval is calculated as:

(mean₁ – mean₂) ± t_critical × √(s₁²/n₁ + s₂²/n₂)

The critical t-value comes from the t-distribution at your chosen confidence level and degrees of freedom. If the interval doesn’t contain zero, that aligns with a statistically significant result. If it does contain zero, the true difference could plausibly be nothing.

Confidence intervals are often more informative than p-values alone because they show not just whether a difference exists, but how big it might reasonably be.

Running the Test in Excel

Excel’s T.TEST function (or TTEST in older versions) returns a p-value directly without requiring you to calculate the t-statistic by hand. The syntax is:

=T.TEST(array1, array2, tails, type)

array1: the cell range for your first group’s data
array2: the cell range for your second group’s data
tails: enter 1 for a one-tailed test, 2 for two-tailed
type: enter 2 for a two-sample test with equal variances, or 3 for unequal variances

So if your first group is in cells A1:A20 and your second group is in B1:B25, a two-tailed test assuming unequal variances would be: =T.TEST(A1:A20, B1:B25, 2, 3). The function returns the p-value directly. Compare it to 0.05 to determine significance.

For the full output including the t-statistic, degrees of freedom, and both one-tailed and two-tailed p-values, use Excel’s Data Analysis ToolPak instead. Go to Data > Data Analysis > t-Test: Two-Sample Assuming Equal Variances (or Unequal Variances), select your data ranges, and enter your alpha level. In R, the same test is a single line: t.test(group1, group2), which defaults to Welch’s version. Add var.equal = TRUE if you want the pooled version.

A Quick Worked Example

Suppose you’re comparing test scores between two classrooms. Classroom A (n = 12) has a mean of 78 and a standard deviation of 10. Classroom B (n = 14) has a mean of 85 and a standard deviation of 12. You’ve checked normality and confirmed roughly equal variances.

First, calculate the pooled variance: [(11 × 100) + (13 × 144)] / 24 = (1100 + 1872) / 24 = 123.8. The pooled standard deviation is √123.8 = 11.13. The t-statistic is (78 – 85) / (11.13 × √(1/12 + 1/14)) = -7 / (11.13 × 0.392) = -7 / 4.36 = -1.61. With 24 degrees of freedom and a two-tailed test, the critical t-value at alpha = 0.05 is about 2.064. Since 1.61 is less than 2.064, you fail to reject the null hypothesis. The difference isn’t statistically significant at the 0.05 level.

Cohen’s d would be 7 / 11.13 = 0.63, a medium-to-large effect. This is a good example of why effect size matters: the groups differ by a meaningful amount, but with small samples, there isn’t enough statistical power to confirm the difference isn’t due to chance.