How to Calculate Penetrance From Genotype and Family Data

Penetrance is calculated by dividing the number of people who carry a genetic variant and show symptoms by the total number of people who carry that variant. If 80 out of 100 people with a specific gene variant develop the associated condition, penetrance is 80%. The concept is straightforward, but getting accurate numbers in practice requires careful attention to how data is collected and which individuals are counted.

The Basic Formula

At its simplest, penetrance is a ratio:

Penetrance = number of affected carriers / total number of carriers

If you can identify everyone in a family who carries a pathogenic variant through molecular testing, and some of those carriers are affected while others are not, the maximum likelihood estimate is K = n₁ / (n₁ + n₂), where n₁ is the number of affected (penetrant) individuals and n₂ is the number of unaffected carriers. For a family where genetic testing reveals 12 carriers of a dominant variant and 9 of them have developed the condition, penetrance would be 9/12, or 75%.

When molecular testing isn’t available for every family member, a cruder estimate uses “obligate carriers” instead. An obligate carrier is someone who must carry the variant based on the pattern of affected relatives around them, even without a genetic test. For example, an unaffected person who has both an affected parent and an affected child is an obligate carrier. You divide the number of affected individuals by the total number of obligate carriers (both affected and unaffected) to get your estimate.

Complete Versus Incomplete Penetrance

If every single person who carries a pathogenic variant eventually develops the condition, penetrance is 100% and the variant is described as completely penetrant. If any fraction of carriers remain unaffected, the variant has incomplete (or reduced) penetrance. Most genetic conditions fall somewhere on this spectrum rather than sitting neatly at 100%.

BRCA1 and BRCA2 variants illustrate incomplete penetrance clearly. More than 60% of women who inherit a harmful BRCA1 or BRCA2 change will develop breast cancer in their lifetime, compared to about 13% of women in the general population. For ovarian cancer, the numbers are 39% to 58% for BRCA1 carriers and 13% to 29% for BRCA2 carriers, versus roughly 1.1% in the general population. These are high-risk variants, but they are not 100% penetrant. Some carriers never develop cancer.

Penetrance Versus Expressivity

These two terms often come up together and are easy to confuse. Penetrance asks a yes-or-no question: did the person develop any signs of the condition? Expressivity describes how severely or in what way the condition shows up among those who are affected.

Marfan syndrome is a textbook example of variable expressivity. Nearly everyone with a variant in the FBN1 gene shows some features, but one person might only be unusually tall and thin with long fingers, while another develops life-threatening heart and blood vessel complications. The penetrance of Marfan syndrome is high, but the expressivity varies enormously from person to person.

Age-Dependent Penetrance

For many conditions, penetrance isn’t a single fixed number. It increases as people age, because carriers who haven’t yet developed symptoms may simply not have lived long enough. Researchers handle this by reporting cumulative penetrance at specific ages rather than giving one lifetime figure.

Huntington’s disease demonstrates this clearly. In a study of over 400 carriers of reduced-penetrance alleles (those with 36 to 39 CAG repeats, rather than the fully penetrant 40+), researchers used survival analysis to track when symptoms first appeared. For carriers with 39 repeats, cumulative penetrance reached 68% by age 70 and 81% by age 75. For those with 38 repeats, the numbers were much lower: 32% by age 70 and 51% by age 75. The same gene, different repeat lengths, very different risk trajectories depending on age.

This age-dependent approach is standard in cancer genetics as well. When you see a statistic like “60% of BRCA1 carriers develop breast cancer,” that figure refers to cumulative risk over a full lifetime, typically estimated by age 70 or 80. A 30-year-old carrier hasn’t yet “used up” all of that risk window.

Estimating Penetrance From Family Data

When working with multi-generational pedigrees, the calculation gets more complex. Researchers typically model each family member’s disease status (affected or unaffected) alongside their age and carrier status, then use statistical models to estimate the cumulative risk of disease by a given age, often 70.

The general approach works like this: a statistical model relates carrier status to disease risk, with each family contributing to the overall estimate. The model is constrained so that the overall disease probability across the population matches what’s already known from epidemiological data. In other words, the family-based estimate has to be consistent with how common the disease is in the broader population.

Different methods handle the family data slightly differently. Some treat the outcome as a simple yes/no (did disease develop by a certain age), while others use survival-based approaches that account for age at onset and the fact that younger family members may not yet be affected. The survival-based methods tend to be more informative because they use timing, not just presence or absence of disease.

One additional complication: family members’ disease outcomes may be correlated for reasons beyond the single gene being studied. Shared environment, other inherited genes, and lifestyle factors can all cluster disease within families. More sophisticated models account for this by assuming there may be additional unmeasured genetic factors influencing disease risk within the family.

Why Ascertainment Bias Inflates Estimates

One of the most important pitfalls in penetrance calculation is ascertainment bias, which consistently pushes estimates too high. This happens because families are often selected for study precisely because they have multiple affected members. Families where a variant is present but few or no people are affected are far less likely to come to clinical attention, so they’re underrepresented in the data.

Simulation studies confirm that including biased studies without correction leads to overestimated penetrance values, especially at older ages like 70 and 80. This is a practical concern: early estimates of BRCA1 penetrance, for example, were based heavily on families with striking cancer histories and were later revised downward when population-based studies included carriers without strong family histories.

If you’re reading penetrance figures from a clinical study, it’s worth checking whether the participants were selected based on family history. If so, and no statistical correction was applied, the true penetrance in the general carrier population is likely lower than what’s reported.

Bayesian Approaches for Variant-Specific Estimates

Traditional penetrance calculations treat all carriers of variants in the same gene as a single group. But different variants within the same gene can have very different effects. A newer approach uses Bayesian statistics to estimate penetrance for individual variants based on their specific characteristics.

The logic works in two stages. First, features of the variant itself (where it sits in the protein’s structure, how it affects the protein’s function, how it’s been classified) are used to generate a starting estimate of how likely it is to cause disease. This is the prior probability. Then, clinical data from actual carriers of that variant (how many are affected, how many are not) updates that starting estimate to produce a more refined posterior probability. The more clinical data available for a specific variant, the more the final estimate reflects real-world observations rather than theoretical predictions.

This approach is especially useful for rare variants where only a handful of carriers have been identified. Without the Bayesian framework, the sample size would be too small for a meaningful estimate. By borrowing information from the variant’s molecular characteristics, researchers can generate a more stable probability even with limited clinical data.

What Modifies Penetrance

Penetrance isn’t purely a property of the variant itself. Environmental exposures, other genes in the background, and even random biological variation can all influence whether a carrier develops disease. Temperature stress, for example, has been shown in animal studies to dramatically increase the penetrance of certain developmental gene variants, with both heat and cold exposure triggering defects that don’t appear at normal temperatures. Stress-response proteins that normally buffer against genetic disruption can be diverted under environmental stress, unmasking the effects of variants that would otherwise remain silent.

In humans, the principle is the same even if the specific mechanisms differ. Lifestyle factors, hormonal exposures, co-inherited genetic variants, and environmental exposures all contribute to whether a given pathogenic variant actually produces disease. This is why two siblings carrying the identical BRCA1 variant can have different cancer outcomes, and why penetrance estimates are population-level probabilities rather than individual certainties.