Linear regression is a statistical method that finds the straight line best fitting a set of data points, letting you predict one variable based on another. If you’ve ever drawn a trend line through a scatter plot, you’ve done a simplified version of what linear regression does mathematically. It’s one of the most widely used tools in statistics, data science, and machine learning because it’s straightforward to interpret and surprisingly powerful across a range of real-world problems.
How the Equation Works
A simple linear regression model follows the equation Y = a + bX. Y is the outcome you’re trying to predict (called the dependent variable), and X is the factor you’re using to make that prediction (the independent variable). The value “b” is the slope, telling you how much Y changes for every one-unit increase in X. The value “a” is the intercept, which is simply what Y equals when X is zero.
Say you’re looking at how advertising spend affects monthly sales. X is dollars spent on ads, Y is revenue. If the slope turns out to be 0.75, that means every additional dollar in advertising is associated with 75 cents in extra sales. The intercept represents your baseline sales with zero ad spend. That single equation gives you a formula to plug in any ad budget and get a sales estimate.
Finding the Best-Fitting Line
With real data, no straight line passes perfectly through every point. Some points sit above the line, some below. The vertical gap between each actual data point and the line’s prediction is called a residual, or error. Linear regression uses a technique called ordinary least squares (OLS) to find the specific line that minimizes the sum of all those squared errors. Squaring the errors does two things: it treats overestimates and underestimates equally, and it penalizes large misses more heavily than small ones.
The result is the single line, out of all possible lines, that sits closest to the data overall. This is what people mean when they say “line of best fit.”
Multiple Linear Regression
Simple linear regression uses one predictor. But most real situations involve several factors at once. Multiple linear regression extends the same idea by adding more variables to the equation. Instead of Y = a + bX, you get Y = β₀ + β₁X₁ + β₂X₂ + … + βₚXₚ. Each coefficient (β) tells you how much the outcome changes when that particular variable increases by one unit, while holding all the other variables constant.
For example, predicting a home’s sale price might involve square footage, number of bedrooms, distance from downtown, and school district rating. Each gets its own coefficient. The model is no longer a line through two-dimensional space but a surface through higher-dimensional space, though the underlying math works the same way.
What the Model Assumes
Linear regression produces reliable results only when certain conditions hold. Four core assumptions underpin the method:
- Linearity. The relationship between each predictor and the outcome is a straight line. If the true pattern is curved, a linear model will systematically miss it.
- Independence of errors. The error for one data point doesn’t depend on the error for another. This matters especially with time-series data, where consecutive measurements can be correlated.
- Constant variance (homoscedasticity). The spread of errors stays roughly the same across all levels of prediction. If errors fan out as predictions get larger, for instance, the model’s confidence intervals become unreliable.
- Normality of errors. The residuals follow a roughly bell-shaped distribution. This assumption matters most when you’re using the model to calculate confidence intervals or p-values rather than just making predictions.
When any of these assumptions breaks down, the model’s predictions and statistical tests can become biased or misleading. Checking these assumptions through residual plots is a standard step in any regression analysis.
Measuring How Well the Model Fits
The most common measure of fit is R-squared, which tells you what percentage of the variation in your outcome is explained by the model. An R-squared of 0.85 means 85% of the differences in Y can be accounted for by the predictors. The remaining 15% is unexplained variation.
R-squared has a catch, though. Adding more predictor variables to a model will almost always increase R-squared, even if those variables aren’t genuinely useful. This creates a temptation to keep piling on predictors, a problem known as overfitting. Adjusted R-squared solves this by penalizing the score for each additional variable. If a new predictor doesn’t improve the model enough to justify its inclusion, adjusted R-squared will actually drop. For any model with more than one predictor, adjusted R-squared gives a more honest picture.
Beyond R-squared, error metrics tell you how far off predictions tend to be in practical terms. Mean Absolute Error (MAE) is the average size of the mistakes in the same units as your outcome, making it easy to interpret. Root Mean Squared Error (RMSE) also uses the original units but penalizes larger errors more heavily, so it’s useful when big misses are especially costly. MAE is generally preferred when you want a straightforward, intuitive measure of typical error size.
Common Problems to Watch For
Multicollinearity
In multiple regression, problems arise when two or more predictors are highly correlated with each other. This is called multicollinearity, and it makes individual coefficients unstable and hard to interpret. If square footage and number of rooms both go into a housing model, and those two variables move closely together, the model struggles to separate their individual effects.
A standard diagnostic is the Variance Inflation Factor (VIF). A VIF above 5 to 10 signals problematic multicollinearity. The fix is usually to drop one of the correlated variables or combine them into a single measure.
Influential Outliers
Because linear regression minimizes squared errors, a single extreme data point can pull the entire line toward it. One common tool for spotting this is Cook’s distance, which measures how much the model’s predictions would change if a specific data point were removed. A data point with a Cook’s distance much larger than the rest deserves a closer look. It may be a data entry error, or it may represent a genuinely unusual case that the model shouldn’t be forced to accommodate.
Categorical Variables
Linear regression works with numbers, so non-numeric categories like “red/blue/green” or “small/medium/large” can’t be plugged in directly. The standard approach is dummy coding: creating a set of new variables, each representing one category with a value of 1 or 0. A “color” variable with three options becomes two new columns (say, “is_blue” and “is_green”), with the third category (red) serving as the reference level that all others are compared against. One category is always left out to avoid redundancy.
Where Linear Regression Gets Used
The method shows up across nearly every field that works with data. In marketing, companies model how advertising spend relates to sales revenue and use the results to pace their campaign budgets. In healthcare, hospitals analyze how staffing levels affect patient wait times, finding, for instance, that adding staff reduces wait times but with diminishing returns past a certain threshold. In human resources, analysts plot job satisfaction against experience and salary to understand which factors matter most for retention.
Financial analysts use regression to understand how individual stocks move relative to broader market indexes. Real estate platforms estimate home values using dozens of property features fed into multiple regression models. Even in fields where more complex machine learning methods dominate, linear regression frequently serves as a baseline, the first model you build before trying anything fancier, because its transparency makes it easy to verify and explain.
That interpretability is one of linear regression’s greatest strengths. Each coefficient has a clear meaning: “for every one-unit increase in X, Y changes by this much.” In contexts where you need to justify decisions to stakeholders, regulators, or patients, that clarity matters as much as raw predictive accuracy.