A propensity score is the estimated probability that a person receives a particular treatment (or exposure) based on their background characteristics. It was introduced by statisticians Paul Rosenbaum and Donald Rubin in 1983 as a way to make fairer comparisons in studies where researchers can’t randomly assign people to groups. Instead of trying to account for dozens of differences between groups all at once, the propensity score collapses those differences into a single number between 0 and 1.
Why Propensity Scores Exist
In a randomized controlled trial, flipping a coin decides who gets the treatment and who doesn’t. That randomness means the two groups end up looking similar on average: similar ages, similar health histories, similar incomes. Any difference in outcomes can be more confidently attributed to the treatment itself.
Observational studies don’t have that luxury. When researchers look at existing health records, survey data, or insurance claims, the people who received a treatment are often systematically different from those who didn’t. Older patients might be more likely to get a surgery. Wealthier patients might be more likely to use a certain medication. These pre-existing differences, called confounders, can make a treatment look more effective or less effective than it really is. Propensity scores give researchers a tool to reduce that bias without running an experiment.
How a Propensity Score Is Calculated
The most common approach uses logistic regression, a type of statistical model that predicts a yes-or-no outcome. In this case, the outcome being predicted isn’t a health result. It’s whether someone received the treatment. Researchers feed the model all the background characteristics they have: age, sex, income, health conditions, prior medications, and so on. The model then spits out, for each person, a probability of having been in the treatment group.
An important rule governs what goes into the model. Researchers include factors that predict who gets the treatment, not factors caused by the treatment. The actual health outcome is deliberately left out of this step. This separation is one of the method’s strengths: because the propensity score is built without looking at results, researchers are less tempted to tweak their model until they get a favorable answer.
The Core Statistical Property
The key insight from Rosenbaum and Rubin’s original paper is that adjusting for this single score removes bias from all the observed characteristics that went into it. If two people have the same propensity score, one treated and one untreated, their background characteristics follow the same distribution, even if those characteristics include dozens of variables. This means researchers can compare people with similar scores and get closer to an apples-to-apples comparison, without needing to match on every individual variable separately.
This property only holds for measured variables. If an important confounder wasn’t recorded in the data and therefore wasn’t included in the model, the propensity score can’t account for it. This is the method’s most significant limitation, and it applies equally to other statistical adjustments in observational research.
Four Ways Propensity Scores Are Used
Once every person in a study has a propensity score, researchers apply it in one of four main ways.
- Matching: Each treated person is paired with an untreated person who has a similar propensity score. This creates two groups that look alike on measured characteristics, mimicking a randomized trial. Unmatched individuals are dropped from the analysis.
- Stratification: The full sample is divided into subgroups (typically five) based on propensity score ranges. Within each subgroup, treated and untreated individuals have similar characteristics. The treatment effect is estimated within each subgroup, then combined.
- Weighting: Rather than dropping anyone, each person’s data is weighted by the inverse of their propensity score. Someone who was unlikely to receive the treatment but did anyway gets a higher weight, because they represent a larger slice of the population. This approach, called inverse probability of treatment weighting, keeps the full sample intact.
- Covariate adjustment: The propensity score is added directly into a regression model as a control variable alongside the treatment indicator. This is the simplest application but is used less often than matching or weighting.
How Researchers Check If It Worked
After applying any of these methods, researchers check whether the treated and untreated groups actually look similar. The standard tool for this is the standardized mean difference, which measures how far apart the two groups are on each characteristic. A value below 0.1 is the most commonly used threshold for adequate balance, though some studies accept values up to 0.25. If a characteristic still shows a large gap after adjustment, the propensity score model likely needs more variables or a different specification.
Advantages Over Traditional Regression
Standard regression models can also control for confounders by adding them as variables. Propensity scores offer a few practical advantages on top of that. First, the balance checks just described give researchers a transparent, visual way to verify that their adjustment actually worked. Regression models don’t offer an equivalent diagnostic. Second, building the propensity score model is entirely separate from analyzing the outcome, which reduces the risk of unconsciously cherry-picking a model that produces a desired result. Third, when the number of confounders is large relative to the number of outcome events, regression models can become unstable. Propensity scores compress all those confounders into one number, sidestepping that problem.
That said, propensity scores aren’t universally better. A 2022 simulation study in the Journal of Clinical Epidemiology found that logistic regression frequently matched or outperformed propensity score methods, especially in large datasets. Matching methods showed an edge in specific scenarios involving unmeasured confounding. The best approach depends on the dataset and the research question.
The Unmeasured Confounding Problem
No matter how carefully a propensity score is built, it can only balance characteristics that were measured and included. If a key variable is missing from the data, the score won’t account for it, and the treatment effect estimate will carry that hidden bias. For example, a study using medical claims data might lack information on smoking status, diet, or household income. If any of those factors influence both who gets treated and how they fare, the propensity score adjustment will be incomplete.
Researchers sometimes run sensitivity analyses to estimate how strong an unmeasured confounder would need to be to change their conclusions. But there is no statistical method that fully solves this problem in observational data. This is why propensity score studies, no matter how well designed, are generally considered less definitive than randomized trials.
Where Propensity Scores Show Up
Propensity score methods are used heavily in medical research, particularly in studies of drug safety, surgical outcomes, and health policy. They’re common in epidemiology, economics, education research, and social science. Any field that relies on observational data and needs to estimate causal effects has adopted them to some degree.
For analysts doing this work, the most widely used software tools are in R. The MatchIt package, developed at Harvard, handles multiple matching algorithms and integrates with other analysis tools. Packages like twang (for weighting) and optmatch (for optimal matching) cover other approaches. Stata also has well-established commands for propensity score analysis. Python’s causalinference and DoWhy libraries offer similar functionality.