What Is a Case-Control Study in Epidemiology?

A case-control study is a type of observational research that starts with people who already have a disease or outcome (cases) and compares them to similar people who don’t (controls), then looks backward in time to figure out what exposures or risk factors might explain the difference. It’s one of the most common study designs in epidemiology, and it played a central role in establishing the link between smoking and lung cancer.

How the Design Works

The logic of a case-control study runs in reverse compared to how we usually think about cause and effect. Instead of watching a group of people over time to see who gets sick, researchers start with the outcome and work backward. They recruit one group of people who have the condition they’re studying and a second group who don’t, then investigate whether the two groups differ in their past exposures to a suspected risk factor.

No one receives a treatment or intervention. The researchers simply observe and compare what already happened. That’s what makes it “observational” rather than experimental. Because it looks back in time, a case-control study is sometimes called a retrospective study.

The key question the design answers is: were the cases more likely to have been exposed to a particular risk factor than the controls? If people with lung cancer were far more likely to have smoked than people without lung cancer, that’s a signal that smoking and lung cancer are connected.

Choosing Cases and Controls

The validity of a case-control study depends heavily on how researchers pick participants. Cases are straightforward: they’re people with a confirmed diagnosis of the disease or outcome being studied. Controls are trickier. A good control is someone who could have ended up as a case but didn’t. Their exposure history should be representative of the broader population at risk for the disease, and their exposures need to be measurable with the same accuracy as those of the cases.

Researchers typically draw controls from one of two pools. Population-based controls, pulled from general practice registries or community records, tend to reflect real-world exposure patterns well. Hospital-based controls, selected from patients admitted for unrelated conditions, make it easier to collect comparable data since both groups are being interviewed in the same setting. The trade-off is that hospitalized patients may have unusual exposure histories. To reduce that risk, researchers often use controls with a mix of different diagnoses rather than a single disease group.

Matching is another common technique. Researchers pair each case with one or more controls who share characteristics like age, sex, and geographic location. This prevents those factors from distorting the comparison between groups.

The Smoking and Lung Cancer Study

One of the most famous case-control studies in history began in 1947, when researchers Richard Doll and Austin Bradford Hill set out to understand why lung cancer rates were rising sharply in Britain. They interviewed lung cancer patients in London-area hospitals about their smoking habits, then interviewed control patients admitted to the same hospitals for other, primarily non-cancerous conditions.

The final study included 1,357 male lung cancer cases and 1,357 matched controls. The results were striking. Among the lung cancer patients, only 7 were nonsmokers. Among the controls, 61 were nonsmokers. The pattern held across smoking intensity too: 340 cases smoked 25 or more cigarettes daily, compared to just 182 controls. At every level of cigarette consumption, the lung cancer group was more heavily represented. This study, published in 1950, became foundational evidence in the case against tobacco.

Measuring the Result: The Odds Ratio

Case-control studies can’t directly measure how often a disease occurs in an exposed group versus an unexposed group, because the researchers didn’t follow people forward in time. Instead, they use a statistic called the odds ratio, which compares the odds of exposure among cases to the odds of exposure among controls.

An odds ratio of 1.0 means the exposure was equally common in both groups, suggesting no association. An odds ratio above 1.0 means cases were more likely to have been exposed, pointing toward a higher risk. An odds ratio below 1.0 suggests the exposure may be protective. For example, an odds ratio of 2.78 would mean the cases had nearly three times the odds of having been exposed compared to controls. Researchers also calculate confidence intervals around the odds ratio. If the entire interval falls above or below 1.0, the result is considered statistically significant.

One important nuance: odds ratios can overestimate the true effect size, especially when the outcome isn’t rare. For common diseases, the odds ratio will look larger than the actual relative risk would be.

Common Sources of Bias

Case-control studies are vulnerable to two major forms of bias that can skew results.

Recall bias happens because participants are asked to remember past behaviors or exposures. People who are already sick tend to think harder about what might have caused their illness, which can make them report exposures more thoroughly than healthy controls. The problem gets worse when the gap between exposure and investigation is long. Accurately remembering what you ate for dinner a week ago is difficult for anyone, let alone months or years later.

Selection bias occurs when there’s a systematic difference in characteristics between the people chosen as cases and those chosen as controls. If controls are drawn from a population that doesn’t truly represent the group at risk of becoming cases, the comparison is flawed from the start. For instance, using hospital patients as controls could introduce bias if the conditions that brought them to the hospital are themselves linked to the exposure being studied.

Why Researchers Choose This Design

Case-control studies fill a specific niche. They’re particularly well suited for studying rare diseases. If a condition affects 1 in 10,000 people, a forward-looking study would need to enroll enormous numbers of participants and follow them for years, hoping enough cases develop to draw conclusions. A case-control study sidesteps this entirely by starting with people who already have the disease.

They’re also faster and cheaper than most alternatives. Because the outcomes have already occurred, there’s no waiting period. Data collection often involves interviews, questionnaires, or medical record reviews rather than years of follow-up visits and testing. This makes case-control studies a natural fit for investigating disease outbreaks, where speed matters.

The limitations are real, though. Because the design is retrospective, it can identify associations but can’t prove causation on its own. It relies on participants’ memories and existing records, both of which can be incomplete or inaccurate. And it can only examine one outcome at a time, though it can look at multiple exposures related to that outcome.

How It Differs From a Cohort Study

The easiest way to understand a case-control study is to contrast it with a cohort study. In a cohort study, researchers start with a group of people defined by their exposure (smokers vs. nonsmokers, for example) and follow them forward to see who develops the disease. The direction is exposure first, outcome later. In a case-control study, the direction is flipped: outcome first, exposure investigated afterward.

Cohort studies can calculate a true relative risk because they track how often disease develops in each group over time. Case-control studies can only estimate that relationship through the odds ratio. Cohort studies also handle multiple outcomes well, since researchers can observe whatever diseases emerge during follow-up. But they cost more, take longer, and struggle with rare diseases because so few participants will develop the condition of interest.

Nested Case-Control Studies

A variation called the nested case-control study combines elements of both designs. Researchers start with a large cohort that’s already being tracked, often through electronic health records or an existing long-term study. When enough cases accumulate within that cohort, they select matched controls from the same group and analyze the data as a case-control study.

This approach solves several problems at once. Exposure data was collected before anyone got sick, which eliminates recall bias. The cases and controls come from the same well-defined population, reducing selection bias. And because researchers only need to analyze detailed data for a subset of the cohort rather than the entire group, it’s relatively inexpensive. Nested case-control studies are increasingly common in research that uses large medical databases or biobanks where blood samples and health records have already been stored.