What Is De Bruijn Graph Assembly and How Does It Work?
Learn how de Bruijn graphs provide a computational framework for assembling genomes from short DNA reads by navigating sequence overlaps and inherent complexities.
Learn how de Bruijn graphs provide a computational framework for assembling genomes from short DNA reads by navigating sequence overlaps and inherent complexities.
A genome’s DNA sequence is too long for current technologies to read at once. Instead, sequencing generates vast quantities of short DNA fragments, known as “reads,” which are like jumbled pieces of a puzzle. The primary objective of genome assembly is to computationally piece together this massive collection of reads by identifying their overlapping segments to reconstruct the original, continuous sequence of the genome.
To manage this complex task, one of the most prominent methods is de Bruijn graph assembly. This approach provides a structured and efficient way to handle the immense and often redundant data produced by sequencing machines, turning fragmented reads into a coherent genomic map.
A de Bruijn graph is a mathematical structure used to represent the relationships between short DNA sequences. The graph is built from “k-mers,” which are short subsequences of DNA of a specific length, denoted by ‘k’. To generate k-mers, a DNA sequence is broken down into all possible overlapping segments of that length. For example, if the DNA sequence is “GATTACA” and k=4, the 4-mers are “GATT,” “ATTA,” “TTAC,” and “TACA.”
The graph itself consists of nodes and edges. Each unique k-mer becomes a node, and a directed edge connects two nodes if they have a specific overlap. An edge connects k-mer A to k-mer B if the last k-1 letters of A are identical to the first k-1 letters of B, representing a potential link in the original genome.
Using the “GATTACA” example, an edge connects “GATT” to “ATTA” because of the “ATT” overlap. Another edge links “ATTA” to “TTAC” based on the “TTA” overlap, and “TTAC” connects to “TACA” via the “TAC” overlap. This creates a linear path of nodes that reconstructs the original sequence.
The initial input for constructing a de Bruijn graph comes from high-throughput sequencing instruments, which generate millions of short DNA reads. The first step is to process all of these reads into their constituent k-mers. This “k-merization” involves sliding a window of a chosen length ‘k’ across every read to extract all possible overlapping k-mers.
Once all k-mers have been generated, the graph is built. Each unique k-mer found across the entire dataset becomes a single node. The algorithm then establishes directed edges between these nodes based on the k-1 overlap rule, suggesting they were adjacent in the original genome.
This method effectively compresses the highly redundant sequencing data into a more manageable structure. Instead of dealing with millions of individual reads, the assembler works with a graph that represents the unique sequences and their connections.
Once the graph is constructed, it is used for genome reconstruction. The original chromosome sequence corresponds to a path through this network of nodes. The assembly algorithm’s goal is to identify these correct paths, tracing the genome’s sequence from one k-mer to the next.
In simple, linear areas of the graph with no forks or loops, the path is unambiguous. Traversing these straightforward sections allows the assembler to piece together k-mers into longer, continuous stretches of DNA called “contigs.”
The objective is to extend these contigs by following the connections from node to node. As the algorithm moves from one k-mer to an adjacent one, it adds one new DNA base to the end of the growing sequence. This traversal systematically reconstructs longer fragments of the genome.
Repetitive sequences in the genome are a significant complication. When a sequence is repeated, it generates identical k-mers from different parts of the chromosome. This causes these k-mer nodes in the graph to have multiple incoming and outgoing edges, creating branches and cycles that make it difficult for the algorithm to determine the correct path forward.
Errors from DNA sequencing technologies also create issues. A single incorrect base in a read can create a novel k-mer that does not exist in the actual genome. These erroneous k-mers lead to problematic structures in the graph, such as “bubbles” where a path diverges and then reconverges, or “spurs” which are short, dead-end paths. Assemblers mitigate these issues by removing k-mers that appear with very low frequency.
The choice of the k-mer size presents a trade-off. A small ‘k’ value increases the likelihood of finding overlaps, which is helpful in regions with low sequencing coverage. However, it also makes it harder to resolve repeats, as shorter k-mers are more likely to be identical by chance, resulting in a more tangled graph.
Conversely, a large ‘k’ value provides greater specificity and can span across shorter repetitive sequences, helping to resolve them. A larger ‘k’ requires higher sequencing coverage to ensure that all true k-mers are present in the data. Insufficient coverage with a large ‘k’ can lead to a fragmented graph with many disconnected components and a less complete assembly.
The development of de Bruijn graph assembly has had a major impact on biology and medicine, enabling researchers to decode genetic information on a large scale. Its most direct applications include: