Long-read sequencing technologies from Pacific Biosciences and Oxford Nanopore Technologies have transformed the study of transcriptomes. These methods sequence entire transcript molecules, or isoforms, providing a clearer picture of gene expression than older, short-read technologies. However, the data generated by these platforms contain errors and artifacts that can complicate analysis.
The bioinformatics tool SQANTI3 is an analysis pipeline designed for the quality control of long-read transcript data. SQANTI3 assesses the quality of each sequenced isoform by comparing it against a reference genome and its corresponding annotation. This process allows researchers to characterize transcripts, identify potential technical artifacts, and filter out low-quality data to produce a reliable representation of the transcriptome.
The SQANTI3 Quality Control Pipeline
The SQANTI3 workflow systematically evaluates each transcript isoform, beginning with the alignment of full-length transcript sequences to a reference genome. This initial mapping step determines the genomic coordinates of each transcript, including the locations of its exons and introns. This alignment places the experimental data into its genomic context for all subsequent checks.
Once mapped, SQANTI3 compares the exon-intron structures of the isoforms to a provided reference gene annotation. The software examines each splice junction—the boundaries between exons and introns—within every transcript. This comparison allows the tool to determine if a sequenced isoform matches a known transcript, represents a novel variation, or originates from an undiscovered gene.
Based on this comparison, the pipeline calculates an array of quality metrics for each isoform. These metrics describe characteristics like the integrity of splice junctions, the proximity of transcript ends to known genomic features, and the presence of technical artifacts. This information is synthesized to assign each isoform to a structural category, which guides the filtering process.
Required Inputs for a SQANTI3 Analysis
To perform a quality control analysis, SQANTI3 requires three primary input files.
- A file containing the transcript isoforms to be analyzed, typically in GTF or GFF3 format.
- The reference genome sequence in FASTA format, which is used as a scaffold to align the transcript isoforms.
- A reference annotation file, also in GTF or GFF3 format, which contains the known gene and transcript models for the organism.
For a more robust analysis, optional files can be included. These might include short-read sequencing data to validate splice junctions or data on transcription start sites to improve the accuracy of transcript end evaluation.
Isoform Classification and Filtering Metrics
A central feature of SQANTI3 is its classification of each transcript into a structural category based on its comparison to the reference annotation. The “Full-Splice Match” (FSM) category is for isoforms where all splice junctions perfectly match a known reference transcript. These are known isoforms successfully captured by the sequencing.
The “Incomplete-Splice Match” (ISM) category describes shorter versions of a reference transcript missing one or more exons. In contrast, “Novel In Catalog” (NIC) isoforms use known splice sites to create new combinations of exons. The “Novel Not in Catalog” (NNC) category is for isoforms that use at least one novel splice site, indicating a potentially unannotated gene or variant.
SQANTI3 also provides quality metrics to help identify artifacts. It evaluates splice junctions, flagging them as “canonical” if they use standard donor (GT) and acceptor (AG) sites or “non-canonical” if they do not. Non-canonical junctions are more likely to be artifacts, though some can be biologically real. The software also measures the distance of a transcript’s start and end sites to those of reference transcripts, which helps identify fragmented molecules.
An isoform with junctions inside a “Red Zone” is flagged for closer inspection. This term refers to regions around splice junctions where technical issues, such as template switching, can create artificial exon-intron structures. By combining these metrics, researchers can distinguish between high-confidence novel isoforms and those resulting from technical errors.
Advanced Analysis and Downstream Applications
After the initial quality control, SQANTI3 offers tools to refine the dataset for biological investigation. The `sqanti3_filter.py` script allows researchers to apply rules based on the output metrics to remove low-quality transcripts. For instance, isoforms with non-canonical splice junctions or other potential artifacts can be programmatically excluded.
The `sqanti3_rescue.py` script helps recover known transcripts that may have been filtered out due to low-quality features in the experimental data. It uses evidence from the long-read data to support the expression of a reference transcript, even if the sequenced version had imperfections. This prevents the loss of known expressed genes from the final dataset.
The curated set of high-quality isoforms serves as the foundation for many downstream analyses. This clean data is suitable for differential transcript expression studies and detailed investigations of alternative splicing. The filtered transcripts can also be used to update and improve the official gene annotation for a given species.