What Is a Canonical Splice Site in Gene Expression?

Genes contain the instructions for building and maintaining an organism by encoding the information needed to produce proteins. For these instructions to be executed, they must be accurately transcribed and processed. Within this process, specific signals in the genetic code act as guideposts. Canonical splice sites are highly conserved sequences that ensure the genetic blueprint is read properly in eukaryotes, which are organisms with a cell nucleus, as part of the system that translates genetic information into functional proteins.

The Role of RNA Splicing

In eukaryotic organisms, genetic information from DNA is transcribed into a preliminary molecule called pre-messenger RNA (pre-mRNA). This initial transcript contains both coding segments, known as exons, and non-coding segments called introns. The exons contain the instructions for building a protein, while the introns must be precisely removed in a process called RNA splicing.

During splicing, introns are cut out and exons are joined to create a mature messenger RNA (mRNA) molecule, which serves as the template for protein synthesis. If this process is inaccurate, the resulting protein will likely be non-functional. The accuracy of splicing depends on recognizing the exact boundaries between exons and introns, which are marked by specific nucleotide sequences known as splice sites.

Defining Canonical Splice Sites

The term “canonical” in biology refers to a standard, highly conserved sequence that is most commonly recognized by cellular machinery. Canonical splice sites are the predominant signals used to mark intron boundaries in the genes of most eukaryotes. Nearly 99% of all introns are flanked by these specific sequences, often referred to as the “GU-AG rule.”

Two distinct sites define the intron. The 5′ splice site, or donor site, is at the beginning of the intron and almost always consists of the nucleotides guanine (G) and uracil (U). At the opposite end is the 3′ splice site, or acceptor site, which is marked by the nucleotides adenine (A) and guanine (G).

While the GU and AG dinucleotides are the primary recognition points, they are supported by adjacent sequences. A branch point, an adenine nucleotide, is located within the intron upstream of the 3′ splice site. Another is the polypyrimidine tract, a stretch of pyrimidine bases situated between the branch point and the 3′ splice site. Together, these elements form a consensus sequence that ensures the splicing machinery is recruited correctly.

The Splicing Mechanism at Canonical Sites

The process of removing introns is carried out by a large molecular machine called the spliceosome. The spliceosome is composed of small nuclear RNAs (snRNAs) and proteins, which together form complexes known as small nuclear ribonucleoproteins, or snRNPs. These snRNPs are the components that recognize and bind to the canonical splice sites on the pre-mRNA.

The splicing reaction proceeds through a precise, two-step chemical process. In the first step, a specific snRNP recognizes the 5′ splice site (GU), while another binds to the branch point adenine within the intron. The branch point adenine then attacks the 5′ splice site, cutting the pre-mRNA. This action forms a loop-like structure called a lariat, where the 5′ end of the intron is linked to the branch point adenine.

This initial reaction frees the 3′ end of the first exon. In the second step, this newly available end of the exon attacks the 3′ splice site (AG) at the end of the intron. This attack joins the two exons together, creating a continuous coding sequence. At the same time, it releases the intron lariat, which is subsequently degraded by the cell.

Canonical Splice Sites and Human Disease

Disruptions to RNA splicing can have severe health consequences, as mutations that alter a canonical splice site are a cause of many genetic disorders. These mutations prevent the spliceosome from accurately identifying intron-exon boundaries, leading to errors in the final mRNA molecule. These errors can manifest in several ways, such as exon skipping, where an entire exon is removed, or intron retention, where an intron is incorrectly included. In some cases, a mutation can activate a cryptic splice site—a nearby sequence not normally used.

Any of these errors can shift the reading frame of the genetic code, often introducing a premature stop codon that halts protein production. The resulting protein may be truncated, unstable, or non-functional. Many inherited diseases are linked to these mutations, including certain forms of cystic fibrosis, beta-thalassemia, and spinal muscular atrophy.