Dorado Basecaller: Technology, Applications, and Use

Dorado is a basecalling software developed by Oxford Nanopore Technologies (ONT) for its sequencing platforms. Basecalling is the process of translating raw electrical signal data from a nanopore sequencer into a sequence of nucleotide bases—A, C, G, and T for DNA, or U for RNA. As a strand of DNA or RNA passes through a microscopic pore, or nanopore, it creates characteristic disruptions in an electrical current. The sequencer measures these changes, and Dorado interprets this complex stream of information, turning raw output into usable data for biological analysis.

The Technology Behind Dorado

Dorado takes raw electrical signal data, often stored in POD5 files, and produces structured sequence data as output. It processes this signal information to generate FASTQ files, a text-based format containing the nucleotide base string and a corresponding quality score for each base. The quality score indicates the basecaller’s confidence in its accuracy.

At its core, Dorado uses deep neural networks trained on large datasets of known DNA or RNA sequences and their corresponding electrical signals. This training allows the AI to learn the patterns that correspond to specific bases, and the process involves scaling the signal and decoding the model’s output to reconstruct the final sequence. The software is built with C++ and the LibTorch library, the C++ backend for PyTorch, which allows for high-performance processing and makes the code accessible for community contributions.

Distinctive Capabilities of Dorado

Dorado is distinguished by high accuracy and processing speed for handling large data volumes. Its performance is measured using Phred quality scores (Q-scores), where a higher score indicates a lower probability of error. The software includes highly accurate models, such as those designated ‘sup’ (super accurate), and offers improvements over previous ONT basecallers like Guppy.

A primary feature is its support for hardware acceleration using NVIDIA and Apple Silicon GPUs. This allows for much faster processing than with a CPU alone, enabling researchers to keep pace with data from high-throughput instruments like the PromethION; for example, a computer with multiple GPUs can process hundreds of gigabases of data in real-time.

Dorado also offers a suite of basecalling models for specific experimental needs. Researchers can select models optimized for different nanopore chemistries, for DNA versus direct RNA sequencing, or for detecting epigenetic modifications like methylated cytosine (5mC). The software can also perform other functions, including adapter trimming and barcode demultiplexing.

Practical Applications in Research

The data produced by Dorado enables research across various biological disciplines. In genomics, accurate basecalling is used to identify genetic variants, from single nucleotide polymorphisms (SNPs) to large structural changes in a genome. This is applied in studies of genetic diseases, cancer research, and population genetics to pinpoint causative mutations or understand genetic diversity.

In the field of transcriptomics, Dorado is used to analyze RNA sequences, providing insights into gene expression. By sequencing RNA directly, researchers can quantify which genes are active, identify different versions, or isoforms, of a gene, and measure poly-A tail length to understand gene regulation.

Metagenomic studies benefit from Dorado’s ability to sequence long DNA fragments, which helps in identifying and assembling the genomes of microorganisms from complex environmental or clinical samples. This is used to characterize microbial communities in the human gut or soil. Its capacity for rapid pathogen identification is also valuable in clinical settings for diagnosing infections or tracking disease outbreaks.

Getting Started with Dorado

Dorado is an open-source, command-line application available from the Oxford Nanopore GitHub repository for Linux, macOS, and Windows. Common installation methods include using package managers like Conda, running it within a Docker container for reproducibility, or downloading pre-compiled binaries. This flexibility allows for deployment on a personal laptop, a dedicated server, or a high-performance computing cluster.

Running Dorado involves providing it with the raw signal data and specifying a basecalling model appropriate for the experiment. A basic command points the software to the input directory of raw files and an output directory for the resulting FASTQ files, along with the name of the selected model.

While a simple command can initiate basecalling, the software offers many advanced options for fine-tuning performance. For detailed instructions, guides, and the latest model recommendations, users are directed to the official documentation provided by Oxford Nanopore.