Interpreting ANOVA Results in R: P-Values and Effect Size

The ANOVA summary table in R gives you five columns of output, and each one tells you something specific about whether your groups differ. Once you know what each column means, you can read the table in seconds, but the real interpretation goes beyond just checking the p-value. You also need to verify assumptions, measure effect size, and run follow-up tests to find out which groups actually differ.

Reading the ANOVA Summary Table

When you run summary(aov(y ~ group, data = df)), R prints a table with these columns:

  • Df: Degrees of freedom. For the group row, this is the number of groups minus one. For the residuals row, it’s the total number of observations minus the number of groups.
  • Sum Sq: Sum of squares. This measures the total variability explained by the grouping variable (top row) versus the leftover variability not explained by it (residuals row).
  • Mean Sq: Mean square, which is the sum of squares divided by its degrees of freedom. This standardizes the variability so the two rows are comparable.
  • F value: The F-statistic, calculated by dividing the group mean square by the residual mean square. A larger F means the between-group differences are large relative to the within-group noise.
  • Pr(>F): The p-value. This is the probability of seeing an F-statistic this large (or larger) if there were truly no difference between groups.

The null hypothesis is that all group means are the same. A p-value below 0.05 is the standard threshold for rejecting that hypothesis, which tells you at least one group mean differs from the others. It does not tell you which groups differ, only that the overall pattern is unlikely to be pure chance.

Choosing Between aov() and lm()

R has two common functions that both run the same underlying math: aov() and lm(). The difference is in what they emphasize. Use aov() when your predictor is categorical (a factor with discrete levels) and you care about the overall effect of that factor. Use lm() when you want to see individual regression coefficients, which is more natural for continuous predictors. Calling summary() on an aov object gives you the familiar ANOVA table. Calling anova() on an lm object gives you the same table. The arithmetic is identical; it’s just a matter of which presentation fits your question.

Checking Assumptions Before You Trust the Results

An ANOVA result is only reliable if two key assumptions hold: the residuals are roughly normally distributed, and the variance within each group is roughly equal. R gives you straightforward ways to test both.

Normality of Residuals

Fit your model first, then pull out the standardized residuals and run a Shapiro-Wilk test:

model <- aov(y ~ group, data = df)
shapiro.test(rstandard(model))

The null hypothesis of this test is that the data are normally distributed. So a high p-value (above 0.05) means you have no evidence against normality, which is what you want. A low p-value suggests the residuals deviate from a normal distribution, and you may need a nonparametric alternative like the Kruskal-Wallis test. You can also visually check with qqnorm(rstandard(model)). If the points roughly follow the diagonal line, normality is reasonable.

Equal Variances Across Groups

Levene’s test checks whether each group has similar spread. It requires the car package:

library(car)
leveneTest(y ~ group, data = df)

By default this uses the median, which makes it robust to outliers. If you want the classic version based on the mean, add center = "mean". A significant result (p < 0.05) means the variances are unequal, which violates the assumption. In that case, you could use Welch’s ANOVA instead (oneway.test(y ~ group, data = df, var.equal = FALSE)), which doesn’t assume equal variances.

What the P-Value Actually Tells You

The p-value in the ANOVA table gives the probability that the observed differences between group means, or more extreme differences, could have arisen through random sampling if there truly were no group effect. When it falls below 0.05, you conclude that at least one group is different. But “statistically significant” and “practically meaningful” are not the same thing. A tiny difference can be significant with a large enough sample, and a meaningful difference can be non-significant with a small sample. That’s why you need effect size.

Measuring Effect Size With Eta-Squared

Eta-squared tells you what proportion of the total variability in your data is explained by the grouping variable. You can calculate it manually by dividing the group sum of squares by the total sum of squares from the ANOVA table, or use the effectsize package:

library(effectsize)
eta_squared(model, partial = FALSE)

An eta-squared of 0.01 is generally considered small (the group explains about 1% of the variation), 0.06 is medium, and 0.14 or above is large. These are Cohen’s conventions, and they’re rough guides rather than hard rules. The partial version, which is the default when you call eta_squared(model), adjusts for other predictors in the model. For a simple one-way ANOVA with a single factor, partial and standard eta-squared are identical.

Reporting effect size alongside your p-value gives a much more complete picture. A significant ANOVA with an eta-squared of 0.02 tells a very different story than one with an eta-squared of 0.35.

Finding Which Groups Differ With Post-Hoc Tests

ANOVA tells you something is different. Tukey’s Honest Significant Differences test tells you what. Run it directly on your model object:

TukeyHSD(model)

The output is a matrix with one row for every pairwise comparison between groups. Each row has four values:

  • diff: The difference between the two group means.
  • lwr: The lower bound of the 95% confidence interval for that difference.
  • upr: The upper bound of the 95% confidence interval.
  • p adj: The p-value adjusted for the fact that you’re making multiple comparisons simultaneously.

If a confidence interval for a particular pair does not include zero, that pair of groups differs significantly. The adjusted p-value accounts for the increased risk of false positives that comes with testing many pairs at once, so use p adj rather than running individual t-tests. You can also plot the result with plot(TukeyHSD(model)) to get a visual display of the confidence intervals. Any interval that doesn’t cross the zero line represents a significant difference.

Visualizing Group Differences

A boxplot is the most common way to display the data alongside your ANOVA. With ggplot2:

library(ggplot2)
ggplot(df, aes(group, y)) + geom_boxplot()

Adding geom_jitter() overlays the individual data points, which helps readers see the sample size and spread within each group. This is especially useful when groups have different numbers of observations or when a few outliers are driving the results. The combination of the ANOVA table, an effect size measure, the Tukey post-hoc comparisons, and a boxplot gives you a complete, interpretable analysis that you can confidently report.