What Is De Novo Sequencing and How Does It Work?

De novo sequencing determines an organism’s complete genetic code without relying on pre-existing information. The term “de novo” means “from scratch” in Latin. This method reconstructs the entire DNA sequence solely from raw genetic data.

It is valuable for exploring the genetic makeup of unstudied species or understanding genomes with incomplete or divergent information. It provides a foundational blueprint, allowing researchers to discover novel genes, genetic variations, and structural arrangements within an organism’s DNA.

The De Novo Sequencing Process

The de novo sequencing process begins by preparing a biological sample, extracting and purifying DNA. The long DNA strands are then broken into millions of smaller fragments, as current sequencing machines cannot read an entire chromosome at once.

After fragmentation, specialized machines read the sequence of base pairs (A, T, C, G) for each fragment. This generates a vast dataset of short DNA sequences, called “reads,” which vary in length depending on the technology.

The most intricate phase is assembling these reads. Computer algorithms analyze the dataset to identify overlapping sequences. By finding overlaps, algorithms piece together fragments, initially building longer contiguous sequences called “contigs.” These are then arranged into “scaffolds,” which are ordered contigs with known gaps. Repetitive DNA regions, where identical sequences occur multiple times, pose a challenge during assembly, making it difficult to determine their correct order and copy number.

Distinguishing from Reference-Based Sequencing

De novo sequencing differs from reference-based sequencing in its approach to prior genetic knowledge. De novo aims to build a genome sequence from scratch, without a pre-existing genetic map.

In contrast, reference-based sequencing, or resequencing, relies on an established reference genome. Reads from a new sample are compared and aligned against this known sequence.

De novo sequencing is used for novel organisms with no existing genetic information, or for genomes with large structural variations. It can also identify new genes or genomic features. Reference-based sequencing is used when a reference genome is available, primarily to detect smaller genetic differences like SNPs or small insertions and deletions within a species.

Key Applications in Science and Medicine

De novo sequencing is a valuable tool across scientific and medical fields. It allows scientists to uncover the complete genetic blueprint of novel organisms, including newly discovered species of bacteria, fungi, plants, or animals. This foundational information helps understand their biology and evolutionary history.

In agriculture, de novo sequencing assembles crop genomes like wheat or rice. Characterizing these genomes helps pinpoint genes for desirable traits such as improved yield or disease resistance, guiding breeding programs for resilient food sources.

Conservation biology uses this approach to protect endangered species. Sequencing their genomes helps assess genetic diversity, identify inbreeding, and understand adaptive traits, informing conservation strategies.

In medicine, de novo sequencing aids cancer genomics. Cancer cells have highly rearranged genomes with large structural changes. De novo assembly characterizes these alterations, providing insights into tumor evolution and aiding therapeutic target identification.

De novo sequencing is also applied in metagenomics, studying genetic material from environmental samples like soil, water, or the human gut microbiome. Assembling genomes from these complex microbial mixtures helps researchers understand entire communities, their metabolic pathways, and interactions, with implications for environmental science and human health.

The Role of Sequencing Technologies

De novo genome assembly is influenced by the sequencing technology used. Two primary approaches exist: short-read and long-read sequencing. Short-read platforms produce billions of highly accurate but relatively short DNA sequences.

Short reads are cost-effective and generate immense data, making them widely used. However, assembling a complete genome from these fragments is challenging, especially with repetitive regions where many short reads may align to multiple identical locations.

Long-read sequencing technologies generate reads thousands to millions of base pairs long. While historically having higher error rates and costs, their longer reads simplify and improve genome assembly, as they can span complex repetitive regions.

Modern de novo sequencing projects often use a hybrid approach, combining short-read and long-read technologies. This strategy leverages short reads’ high accuracy and throughput for precision, while long reads resolve complex genomic structures and bridge gaps. This yields contiguous and accurate de novo genome assemblies.