The Genome Aggregation Database (gnomAD) is a resource developed by an international coalition of investigators. Its purpose is to aggregate and harmonize genetic data from numerous large-scale sequencing projects. The resulting summary data is made available to the scientific community to advance research.
The Scale and Scope of gnomAD Data
The gnomAD database is notable for its scale, housing genetic information from hundreds of thousands of individuals. The most recent version, v4.1, contains data from 730,947 exome sequences and 76,215 whole-genome sequences derived from unrelated individuals. This data was compiled from numerous large-scale sequencing projects, many of which were initially focused on specific diseases.
The database includes two main types of genetic information: whole-exome sequencing (WES) and whole-genome sequencing (WGS). WES focuses on the protein-coding regions of the genome, which constitute about 1-2% of total DNA but are where most disease-causing mutations are found. Whole-genome sequencing provides a more comprehensive view by capturing information from the entire genome.
A feature of gnomAD is the genetic diversity of the individuals included, with data from varied ancestral backgrounds like European, African, and East Asian. This diversity is important for its utility, though some populations, such as those of Middle Eastern or Oceanian descent, are not as well-represented. Improving the representation of diverse populations is an ongoing priority for the project.
Interpreting Genetic Variation
The primary function of gnomAD is to provide a baseline of human genetic variation. It catalogs genetic differences and provides the frequency of these variants within large populations. This information on allele frequency—how often a specific version of a gene appears in a population—is used for interpreting genetic data in both clinical and research settings.
This resource helps scientists distinguish between common genetic variants, which are often benign, and rare variants, which are more likely to be associated with disease. If a variant identified in a patient is rare or absent in the gnomAD population, it is a stronger candidate for causing a disorder. A common variant is less likely to be the cause of a rare disease.
The human genome can be compared to a vast text, where common variations are like common words. A rare variant is like a misspelled or unusual word. By providing a dictionary of common words and their frequencies (gnomAD), scientists can more easily spot variants that are out of place and warrant investigation.
The database provides metrics to aid this interpretation, such as “constraint.” Constraint measures if a gene has fewer protein-altering variants than expected by chance. Highly constrained genes are under greater selective pressure, and variants within them are more likely to have functional consequences, helping researchers prioritize them for study.
Applications in Research and Medicine
In a clinical context, gnomAD is used to help diagnose rare genetic diseases. When a patient’s genome is sequenced, clinicians cross-reference the identified variants with the database to filter out common ones. This process narrows the list of candidate variants that may be responsible for a patient’s condition, accelerating diagnosis.
The pharmaceutical industry uses gnomAD for drug discovery and development. Researchers can identify genes that are naturally inactivated in some healthy individuals by examining genetic variation. These “loss-of-function” variants offer clues about the potential effects of a drug designed to inhibit that gene’s protein, helping select drug targets and understand potential side effects.
gnomAD also serves as a resource for population genetics research. The data from diverse ancestries allows scientists to study human history, migration patterns, and the selective pressures that have shaped different populations. It provides a detailed map of how human populations have evolved and diverged.
Data Curation and Ethical Framework
Before inclusion in the database, data undergoes a rigorous and standardized quality control process. This harmonization is necessary to ensure that information from various sources is consistent and accurate. This allows for meaningful comparisons across the entire dataset.
Protecting participant privacy is a foundational principle of the gnomAD project. All data is de-identified, meaning personal information that could link genetic data to an individual is removed. The database only provides aggregated summary data, like allele frequencies, not individual-level genetic information, which safeguards participant identities.
The project operates under an open-science model, making its summary data freely and publicly available to the global scientific community. This commitment to open access is intended to accelerate scientific discovery. The database is managed by a team at the Broad Institute in collaboration with an international coalition of investigators.