How to Do Logistic Regression, Step by Step

Logistic regression predicts the probability that an outcome falls into one of two categories, like whether a customer will buy or not, whether a tumor is malignant or benign, or whether a loan will default. Unlike linear regression, which outputs a continuous number, logistic regression outputs a probability between 0 and 1, then uses that probability to classify each observation. Here’s how to work through it from start to finish.

How Logistic Regression Actually Works

At its core, logistic regression takes a standard linear equation (the kind you’d see in linear regression) and passes it through a function that squeezes the output into a range between 0 and 1. That function is called the sigmoid function, and it produces the characteristic S-shaped curve you’ll see in every logistic regression tutorial.

The sigmoid takes any real number as input and maps it to a value that approaches but never quite reaches 0 or 1. When the input is a large positive number, the output is very close to 1. When the input is a large negative number, the output is close to 0. When the input is zero, the output is exactly 0.5. This is what makes logistic regression fundamentally a probability model: the output always represents the estimated chance that an observation belongs to the positive class.

The model learns by adjusting its coefficients so the predicted probabilities align as closely as possible with the actual outcomes in your training data. It does this through maximum likelihood estimation, finding the set of coefficients that makes the observed data most probable.

Preparing Your Data

Before fitting a model, you need your data in the right shape. Your outcome variable should be binary (coded as 0 and 1). Your predictor variables can be continuous, categorical, or a mix of both, but categorical variables need to be converted to numbers first.

The standard approach for categorical predictors is dummy encoding. If a variable has k unique categories (say, three colors: red, blue, green), you create k minus 1 new binary columns. One category serves as the reference group and is represented by all zeros across those columns. This avoids a mathematical redundancy called the “dummy variable trap.” One-hot encoding is similar but creates k columns instead of k minus 1. Dummy encoding is generally preferred for regression because it avoids that redundancy, while one-hot encoding is more common in machine learning pipelines that handle it automatically. Either way, if a categorical variable has many unique values, the number of new columns can balloon quickly, so keep that in mind.

For continuous predictors, check for outliers and consider whether scaling is needed. Logistic regression coefficients are sensitive to the units of your variables, so standardizing continuous features (subtracting the mean and dividing by the standard deviation) can make coefficients easier to compare and help the optimization algorithm converge faster.

Checking the Three Core Assumptions

Logistic regression has fewer assumptions than linear regression, but the ones it does have are non-negotiable. Your model needs to meet all three to produce unbiased, generalizable results.

Independent observations. Each row in your data should represent an independent case. If the same person appears multiple times, or if observations are clustered (patients within the same hospital, students within the same school), standard logistic regression will underestimate the true uncertainty in your estimates. Clustered data requires specialized approaches like mixed-effects logistic regression.

No perfect multicollinearity. Your predictor variables can be correlated with each other, but they can’t be perfectly correlated or direct linear combinations of one another. To check this, calculate the Variance Inflation Factor (VIF) for each predictor. A VIF above 5 to 10 signals problematic multicollinearity. If you find it, consider dropping one of the correlated variables or combining them into a single measure.

Linearity of the logit. This is the one that trips people up. Logistic regression doesn’t assume a linear relationship between predictors and the outcome directly. It assumes continuous predictors have a linear relationship with the log-odds of the outcome. You can test this by creating interaction terms between each continuous predictor and its natural log, then checking whether those terms are significant. If the relationship isn’t linear, you can try transforming the variable (log, square root) or binning it into categories.

How Much Data You Need

The classic rule of thumb is 10 events per variable (EPV). “Events” means the count of whichever outcome category is less common. If you’re predicting loan defaults and 50 loans in your dataset actually defaulted, you can reliably include about 5 predictor variables. With fewer events per variable, coefficient estimates become unstable and the model may not generalize well to new data.

More recent research suggests that 20 events per variable is a safer target. At that threshold, internal validation methods like bootstrap correction produce performance estimates that closely match what you’d see on a truly independent dataset. If your sample is small, be ruthless about which predictors you include. Every unnecessary variable eats into your effective sample size.

Fitting the Model

In practice, you’ll use a software package to fit the model. In Python, the two most common options are scikit-learn’s LogisticRegression and statsmodels’ Logit. In R, the built-in glm() function with family = binomial handles it. The choice depends on whether you want a machine learning workflow (scikit-learn) or a traditional statistical output with p-values and confidence intervals (statsmodels or R).

Regardless of the tool, the process is similar: specify your outcome variable and your predictors, fit the model on your training data, and examine the output. The model returns a coefficient for each predictor, an intercept, and various measures of model fit.

Interpreting the Results

Logistic regression coefficients are expressed in log-odds, which aren’t intuitive on their own. A coefficient of 0.7 means that a one-unit increase in that predictor is associated with a 0.7 increase in the log-odds of the outcome. To make this meaningful, convert it to an odds ratio by raising Euler’s number (approximately 2.718) to the power of the coefficient. So e raised to 0.7 gives an odds ratio of about 2.01, meaning the odds of the outcome roughly double for each one-unit increase in that predictor.

An odds ratio greater than 1 means the predictor increases the odds of the outcome. An odds ratio less than 1 means it decreases the odds. An odds ratio of exactly 1 means no effect. For categorical predictors, the odds ratio compares each category to the reference group you chose during encoding.

One common mistake: odds ratios are not the same as probability ratios. An odds ratio of 2 does not mean “twice as likely.” The relationship between odds and probability is nonlinear, so the actual change in probability depends on where you start. When probabilities are small (under 10%), odds ratios approximate risk ratios reasonably well. At higher probabilities, the distinction matters more.

Evaluating Model Performance

To classify observations, you need a probability threshold. The default is typically 0.5: predicted probabilities above 0.5 are assigned to class 1, and those below go to class 0. Once you’ve classified your predictions, you can build a confusion matrix showing true positives, true negatives, false positives, and false negatives.

From the confusion matrix, several metrics tell you different things about performance:

Accuracy is the percentage of all predictions that were correct. It’s easy to understand but misleading when classes are imbalanced. A model that always predicts “no disease” on a dataset where 99% of people are healthy achieves 99% accuracy while being completely useless.
Precision tells you what proportion of positive predictions were actually positive. High precision means few false alarms.
Recall (also called sensitivity) tells you what proportion of actual positives the model caught. High recall means few missed cases.
F1 score is the harmonic mean of precision and recall. It’s especially useful when you’re working with imbalanced data and need a single number that balances both concerns.

The AUC-ROC curve evaluates the model across all possible thresholds rather than just one. It plots the true positive rate against the false positive rate at every threshold, and the area under that curve (AUC) gives you an overall measure of how well the model distinguishes between classes. An AUC of 0.5 means the model is no better than flipping a coin. An AUC of 1.0 means perfect separation.

Handling Imbalanced Classes

When one outcome is much rarer than the other (fraud detection, rare diseases, equipment failure), logistic regression tends to under-predict the minority class. There are several practical fixes.

At the data level, you can oversample the minority class, undersample the majority class, or use synthetic oversampling techniques like SMOTE, which generates new synthetic examples of the minority class based on existing ones. At the algorithm level, you can apply cost-sensitive learning, which penalizes misclassification of the minority class more heavily during training. At the output level, you can adjust the classification threshold. Instead of the default 0.5, find the threshold where sensitivity and specificity intersect. In practice, this threshold can be dramatically different from 0.5: in some imbalanced datasets, optimal thresholds as low as 0.07 or as high as 0.97 have been documented.

Preventing Overfitting With Regularization

When you have many predictors relative to your sample size, logistic regression can overfit, producing coefficients that are too large and a model that performs well on training data but poorly on new data. Regularization adds a penalty to the model for having large coefficients, effectively shrinking them toward zero.

There are two main types. L1 regularization (also called Lasso) penalizes the sum of the absolute values of the coefficients. It tends to push some coefficients all the way to zero, effectively performing feature selection by eliminating irrelevant predictors. This is especially useful when you suspect only a handful of your variables truly matter. Research from Stanford has shown that L1 regularization handles irrelevant features efficiently: its data requirements grow only logarithmically with the number of irrelevant features.

L2 regularization (also called Ridge) penalizes the sum of the squared coefficients. It shrinks coefficients toward zero but rarely sets them exactly to zero. L2 works better when most of your predictors contribute at least some signal and you don’t want to discard any entirely. However, L2’s data requirements grow at least linearly with the number of irrelevant features, making it less efficient when many predictors are noise.

Most implementations default to L2 regularization. Scikit-learn’s LogisticRegression, for instance, applies L2 by default with a regularization strength parameter called C (where smaller values mean stronger regularization). You can switch to L1, or use Elastic Net, which combines both penalties. Tuning the regularization strength through cross-validation is standard practice.

A Practical Workflow Summary

Putting it all together, a logistic regression analysis follows a consistent sequence. Start by encoding categorical variables and scaling continuous ones. Check that your observations are independent, verify there’s no severe multicollinearity using VIF, and test the linearity of the logit for continuous predictors. Confirm you have at least 10 to 20 events per variable.

Split your data into training and test sets, or use cross-validation. Fit the model on the training data, apply regularization if needed, and interpret coefficients as odds ratios. Evaluate performance on held-out data using metrics appropriate to your problem: accuracy for balanced classes, F1 or AUC-ROC for imbalanced ones. If classes are heavily skewed, adjust your threshold or resampling strategy. Validate using bootstrap correction rather than a simple train-test split, particularly when sample sizes are modest.