What Is Gene Ontology and How Is It Used?

In the world of biology, researchers generate vast amounts of data about genes and proteins. To make sense of this information and communicate findings effectively, a shared language is needed. Gene Ontology (GO) serves this purpose, acting as a standardized dictionary that describes the roles of gene products. It provides a common vocabulary, allowing a scientist studying a fruit fly and another studying a mouse to describe the function of a similar gene in the same way.

The Three Core Ontologies

Gene Ontology is built upon three interconnected vocabularies, known as ontologies, that describe different aspects of a gene product’s biology. Each ontology answers a fundamental question about what a gene does, where it acts, and what process it participates in. This separation allows for a detailed and multi-faceted description of function.

The first of these is the Biological Process ontology. This category describes the larger biological goals accomplished by the coordinated activities of multiple gene products. Think of it as the overall objective, such as “DNA repair,” “cell division,” or the “immune response.” These terms describe the broader program to which a gene contributes.

Next is the Molecular Function ontology, which defines the specific biochemical activities of individual gene products. This is the “job” at the molecular level, like “protein kinase activity” or “DNA binding.” These terms describe actions, such as catalysis or transport, that can be performed by a single protein or a functional complex.

Finally, the Cellular Component ontology specifies the location where a gene product is active. This can be a subcellular structure, like the “nucleus” or “mitochondrion,” or a larger complex such as the “ribosome.” Terms in this ontology pinpoint where a gene’s function is carried out within the cellular landscape.

Structure and Relationships

The terms within each ontology are organized into a hierarchical structure, arranging concepts from broad to specific. For example, the broad term “metabolic process” might have a more specific child term like “carbohydrate metabolic process,” which then leads to the very specific process of “glycolysis.” This structure is formally known as a Directed Acyclic Graph (DAG).

The term “acyclic” means you cannot follow a path of connections and end up back where you started. The term “directed” means the connections have a defined parent-to-child direction. A feature of the DAG structure is that a specific “child” term can have multiple “parent” terms. For instance, “DNA ligation” is part of DNA replication, DNA repair, and DNA recombination, so it is linked as a child to all three parent processes.

The Gene Annotation Process

Gene annotation is the process that links GO terms to specific genes or proteins in a database. An annotation is an assertion that a gene product is associated with a GO term, and these connections are meticulously curated by scientists. This process is how the knowledgebase is built, connecting the vocabulary to real-world biological entities.

Every annotation is supported by evidence, which is recorded using a specific “evidence code.” These codes allow users to understand the basis for a gene-function claim and assess its strength. The codes can be grouped into several categories:

Direct laboratory experiments, where a function was observed in a hands-on lab setting.
Computational predictions, where a gene’s function is predicted based on its similarity to another, better-characterized gene.
Curator inference, where a database curator uses expert knowledge to make a logical conclusion from existing data.
Author statements from a scientific paper.

Functional Enrichment Analysis

A powerful application of Gene Ontology is making sense of large-scale experimental data. Experiments that measure thousands of genes at once often result in a long list of genes that have changed their activity in response to a certain condition. On its own, this list offers little biological insight.

Functional enrichment analysis uses GO to ask: “Biologically speaking, what does this list of genes have in common?” The statistical method checks whether any GO terms appear in the gene list more frequently than would be expected by chance. This transforms a simple list of genes into meaningful, actionable biological hypotheses.

Imagine having a list of people in a city who have all developed a cough. Functional enrichment analysis is like discovering that a statistically significant number of them work at the same office building. This finding would strongly suggest the building is connected to the outbreak. Similarly, if a list of genes affected by a new drug is “enriched” for the GO term “cell cycle,” it provides a strong clue that the drug’s mechanism involves interfering with cell division.