RNA sequencing (RNA-seq) measures gene activity in biological samples. It converts RNA into complementary DNA (cDNA) and sequences millions of fragments. Aligning these reads to a reference genome is a crucial first step. This process pinpoints where each fragment originated, which is necessary for understanding gene expression and identifying active genes.
HISAT2: A Spliced Aligner
HISAT2 is a widely used alignment program known for its speed and sensitivity in mapping next-generation sequencing reads, including RNA-seq data, to a reference genome. It enables fast and memory-efficient alignment. This tool is particularly effective for RNA-seq because of its specialized ability to handle spliced alignments.
Gene models in eukaryotes contain introns, which are typically removed during transcription; consequently, RNA sequences spanning two exons require specific handling for accurate mapping. HISAT2 addresses this by utilizing a hierarchical indexing scheme that includes a global index and many smaller local indexes. This allows it to efficiently identify both known and novel splice junctions. This approach ensures that even reads spanning across introns can be accurately mapped.
STAR: The Ultrafast Aligner
STAR, or Spliced Transcripts Alignment to a Reference, stands out as another prominent RNA-seq aligner, recognized for its exceptional processing speed. Its core strategy involves building a suffix array index of the genome, which facilitates a rapid “seed-and-extend” approach for mapping reads.
STAR’s algorithm operates in two main steps: seed searching, followed by clustering, stitching, and scoring. It searches for the longest sequences that exactly match locations on the reference genome. This sequential searching of unmapped portions of reads contributes to STAR’s efficiency. It also enables it to detect splice junctions, non-canonical splices, and even chimeric transcripts.
Comparing Performance and Features
STAR generally demonstrates superior speed in mapping speed for large datasets. For example, STAR can align approximately 550 million 2 × 76 base pair paired-end reads to the human genome per hour on a 12-core server. This speed makes STAR a preferred choice for high-throughput sequencing projects.
In terms of memory usage, HISAT2 typically requires less RAM for genome indexing and alignment compared to STAR. While HISAT2 might need around 4.3 GB for the human genome index, STAR can demand significantly more, sometimes up to 28 GB, for its suffix array index. Both aligners are considered highly accurate in mapping reads and detecting splice junctions, though studies suggest subtle differences in their unique mapped read rates and handling of mismatches. STAR often shows a higher percentage of uniquely mapped reads and greater tolerance for soft-clipped and mismatched bases, leading to higher overall mapping rates.
Guidance for Choosing an Aligner
The decision between HISAT2 and STAR depends largely on the specific needs of a research project and available computational resources. If memory is a significant constraint, HISAT2 is often the more suitable choice due to its lower RAM requirements. It is also a strong contender when high sensitivity for detecting subtle or novel splice junctions is prioritized.
Conversely, STAR is the preferred aligner when processing speed is the highest priority, especially when dealing with very large datasets. Its efficiency makes it ideal for projects with abundant computational resources. Ultimately, the “best” choice is not universal; it is a practical decision informed by the unique characteristics of the RNA-seq data and the hardware at hand.