The human genome contains billions of DNA letters, and every person possesses millions of small genetic differences, or variants, compared to a reference sequence. This immense number of differences presents the fundamental challenge in genetic diagnosis: only a tiny fraction of these variants causes disease. Pinpointing the single problematic variant from countless benign variations is difficult. Genetic variant databases provide the organized, collective knowledge needed to navigate this complex landscape and map the location of disease-causing mutations.
What Are Genetic Variant Databases
Genetic variant databases are public repositories maintained by academic consortia, government organizations, and clinical laboratories that pool data on known genetic differences. These databases house information about millions of variants observed in human populations globally. They are categorized into two primary types based on the information they store: population frequency and clinical significance.
Population databases, such as the Genome Aggregation Database (gnomAD), record how often a specific variant appears in large, diverse groups of people presumed to be healthy. This frequency data establishes whether a variant is too common in the general population to cause a rare disease. Clinical databases, like ClinVar, collect reports from diagnostic laboratories and researchers regarding the known clinical impact of a variant, often classifying it as benign, pathogenic, or a variant of uncertain significance (VUS).
Generating the List of Patient Variants
Identifying a patient’s variants begins with high-throughput sequencing technology, such as Whole Exome Sequencing (WES). WES focuses on sequencing the exome, the small portion of the genome containing the protein-coding genes, where approximately 85% of known disease-causing mutations are found. The sequencing machine reads the patient’s DNA fragments, and specialized software aligns these fragments against a standardized human reference genome.
This alignment highlights every position where the patient’s DNA sequence differs from the reference, generating a raw list of thousands of variants unique to that individual. This initial list includes harmless, common variants, as well as rare, disease-related ones. This raw variant list requires database annotation and interpretation.
Comparing Patient Data to Known Variants
The next step, annotation, involves matching every variant on the patient’s raw list against the records stored in genetic databases. This comparison drastically reduces the initial number of candidate variants, moving the analysis from thousands of possibilities to a manageable few. Analysts first check population frequency databases like gnomAD to see how often a variant has been observed in individuals without the patient’s disease. If a variant appears at a high frequency (e.g., in 1% or more of the healthy population), it is statistically unlikely to cause a rare, inherited disorder and is filtered out as benign.
The remaining rare variants are then checked against clinical databases like ClinVar. This step provides immediate classification for variants previously submitted and evaluated by other laboratories. A variant may receive an existing classification of “benign,” “likely pathogenic,” or “pathogenic,” which informs the diagnostic conclusion. If the variant lacks sufficient evidence for a clear classification, it is labeled a Variant of Uncertain Significance (VUS), requiring further investigation.
Filtering and Prioritizing Pathogenic Mutations
The final phase focuses on the few remaining rare and unclassified variants to determine which, if any, is truly pathogenic. This process moves beyond simple database lookup and incorporates detailed clinical and biological evidence. Analysts use structured frameworks, such as the guidelines established by the American College of Medical Genetics and Genomics (ACMG), to systematically score each variant based on multiple types of evidence.
One type of evidence is segregation analysis, which involves testing family members to see if the variant is present only in those affected by the disease. Bioinformatic tools also predict the functional consequences of the variant on the resulting protein, such as whether it disrupts the protein’s shape or function. The ACMG criteria assigns a weight to each piece of evidence—including population rarity, functional predictions, and segregation data—to reach one of five classifications: pathogenic, likely pathogenic, VUS, likely benign, or benign. The combination of database annotation and systematic filtering allows geneticists to pinpoint the causative mutation, leading to a clinical diagnosis.