How DAVID Pathway Analysis Finds Biological Meaning
Explore the methodology behind DAVID, a tool that translates raw gene lists into biological context by identifying the most significant underlying processes.
Explore the methodology behind DAVID, a tool that translates raw gene lists into biological context by identifying the most significant underlying processes.
High-throughput experiments can generate vast lists of genes or proteins that are difficult to interpret without further analysis. These lists, often containing hundreds of entries, require a method to discern their collective biological significance. The Database for Annotation, Visualization and Integrated Discovery (DAVID) is a web-based tool developed to solve this problem by helping scientists understand the biological meaning behind a large list of genes.
The primary goal of DAVID is to identify which biological pathways are over-represented in a given gene list. By comparing the user’s list to databases of known biological information, the platform uncovers connections that are not apparent from the raw data. This allows researchers to gain a functional understanding of the underlying biology.
To determine the biological relevance of a gene list, DAVID relies on functional annotation, which is the process of assigning functional information to specific genes. The system cross-references a user’s submitted gene list against large, curated databases to identify the roles these genes play within a cell. This framework provides the necessary context for analysis.
A principal component of this is the Gene Ontology (GO) project, which provides a structured vocabulary to describe gene attributes. GO is organized into three domains: “Biological Process” describes larger objectives, “Molecular Function” details biochemical activity, and “Cellular Component” specifies where a gene product is active. For example, if the biological process is “long-distance travel,” the molecular function might be “igniting fuel vapor,” and the cellular component would be the “spark plug.” This hierarchical categorization allows for a multi-layered understanding of a gene’s purpose.
DAVID also utilizes pathway databases, which are molecular roadmaps containing diagrams of known molecular interactions. The most prominent is the Kyoto Encyclopedia of Genes and Genomes (KEGG) Pathways. KEGG provides manually drawn pathway maps for a wide range of processes. These maps serve as biological blueprints, allowing researchers to see how their genes of interest fit into established networks for processes like cell signaling or metabolism.
The central function of DAVID is enrichment analysis, a statistical method used to determine whether a set of genes is over-represented in a biological context. The process begins when a user uploads a gene list, for instance, those more active in cancerous tissue. DAVID then compares this list against a background reference, which is the entire set of known genes for the organism, such as the complete human genome.
The core idea is to identify “over-representation.” Imagine the human genome is a large jar of 20,000 marbles, where 200 are red, symbolizing genes in the “cell cycle” pathway. If a researcher’s sample of 300 marbles (their gene list) contains 30 red ones, this is a significant over-representation. A random draw would yield only about 3 red marbles, so finding 30 suggests the selection is related to the “cell cycle” pathway.
DAVID calculates if the proportion of genes for a specific GO term or KEGG pathway in a user’s list is statistically significant compared to the proportion in the entire genome. This calculation uses a modified Fisher’s Exact Test, which DAVID calls the EASE score. The EASE score is more conservative, penalizing the significance of categories supported by very few genes to reduce potentially spurious findings. The analysis tests thousands of terms and pathways, resulting in a ranked list of biological themes relevant to the gene list.
After a gene list is submitted, DAVID offers several tools to explore and interpret the results. These tools organize the data from the enrichment analysis into a more digestible format. Each tool provides a different perspective on the functional significance of the submitted genes.
This tool presents a detailed table listing all the GO terms and KEGG pathways found to be significantly enriched in the user’s gene list. Each row corresponds to a specific biological term and includes statistical values that help the user gauge the significance of the enrichment. This chart provides the most direct view of the analysis, offering a granular look at every enriched biological theme.
This tool addresses a common issue where multiple annotation terms are redundant because they describe similar biological processes. For example, terms like “regulation of cell division” and “mitotic cell cycle control” involve many of the same genes. The clustering tool groups these related terms into annotation clusters, simplifying the results and highlighting the broader biological themes. Each cluster is given an enrichment score, making it easier to identify the most relevant functional groups.
This tool groups genes from the user’s list that share similar functional annotations. By clustering genes based on their shared roles in various biological processes or pathways, it can help identify smaller, functionally related gene groups within the larger list. This can pinpoint modules of co-regulated genes that may be working together to drive a specific biological outcome.
After an analysis, DAVID presents a report with statistical metrics. Understanding these metrics is necessary for correctly interpreting the results and drawing biological conclusions. The Functional Annotation Chart, for example, contains several columns of values that quantify the significance of the findings.
The p-value is a fundamental metric representing the probability of observing the enrichment of a GO term or pathway by random chance alone. A smaller p-value indicates that the observed enrichment is less likely to be a coincidence. A p-value less than 0.05 is conventionally considered statistically significant, suggesting the biological term is relevant to the user’s gene list.
The Fold Enrichment metric provides a measure of the magnitude of the enrichment. It is calculated by comparing the proportion of genes associated with a term in the user’s list to the proportion of genes for that same term in the background population. A Fold Enrichment of 2.0 means that the genes for that term are represented twice as much in the user’s list as would be expected by chance.
The False Discovery Rate (FDR) is an important metric for prioritizing results. When testing thousands of terms simultaneously, it is probable that some will appear significant by luck, which is known as the multiple testing problem. The FDR corrects for this by estimating the proportion of significant results that are likely to be false positives. A lower FDR value gives greater confidence that the identified biological terms are truly significant.