How to Make a Phylogenetic Tree From DNA Sequences

A phylogenetic tree is a visual hypothesis representing the evolutionary history connecting a set of organisms or genes. This branching diagram is a scientific model showing how different biological entities are related through common ancestry and subsequent divergence. The purpose of constructing a phylogeny is to understand these ancestral connections, providing a framework for classifying life and studying the evolution of traits. Modern approaches almost exclusively use molecular data, meaning the sequences of DNA, RNA, or protein, rather than physical characteristics.

Data Acquisition and Selection

Building a molecular phylogenetic tree begins with acquiring the specific DNA or protein sequences for comparison. These sequences must be homologous, sharing descent from a common ancestral sequence, and are typically retrieved from public repositories like GenBank using tools like BLAST. Researchers must select the right genetic marker, which is a gene or sequence region that evolves at a rate appropriate for the comparison being made. Highly conserved genes, such as those coding for ribosomal RNA, are suitable for examining relationships between distantly related organisms because they mutate slowly.

For comparing closely related species or populations, a faster-evolving marker like mitochondrial DNA or an internal transcribed spacer (ITS) region is necessary to capture recent changes. A crucial step in this data selection is identifying and including an “outgroup” sequence. The outgroup is a species or sequence known to be less closely related to the group under study than any member of that group is to another. Including this distant relative allows the tree to be properly rooted, establishing the direction of evolutionary time and ensuring the common ancestor is correctly placed.

Preparing the Data: Sequence Alignment

Once selected, sequences must undergo multiple sequence alignment before comparison. Specialized software, such as ClustalW or MAFFT, performs this procedure, ensuring every position compared across all sequences is truly homologous, originating from the same position in the ancestral sequence. The software systematically shifts the sequences relative to one another, maximizing the number of identical nucleotides or amino acids that line up in each column.

Insertions or deletions of genetic material, known as indels, complicate this process. The software accounts for indels by introducing gaps (represented by hyphens) into the sequences. These gaps are treated as evolutionary events, preventing miscomparison of nucleotides due to a base being lost or gained in one lineage. A high-quality, accurate alignment is paramount because misalignment leads to flawed calculations and an incorrect evolutionary history.

Selecting the Calculation Method

After sequences are accurately aligned, a computational method must be chosen to convert the data matrix into a branching tree structure. Methods fall into two broad categories: distance-based and character-based approaches, trading off computational speed and statistical rigor. Distance-based methods, such as Neighbor-Joining (NJ), simplify data by calculating a single evolutionary distance score between every pair of sequences. They then use these scores to rapidly build a tree where branch lengths are proportional to the calculated genetic distance.

Character-based methods analyze each position (or character) in the alignment individually, a much more computationally intensive process. Maximum Likelihood (ML) and Bayesian inference (BI) are the most widely used character-based approaches and are considered statistically robust. These methods employ complex models of molecular evolution that estimate the probability of nucleotide substitutions occurring over time.

The Maximum Likelihood method searches possible tree topologies to find the one with the highest probability of producing the observed sequence data, given the chosen evolutionary model. Bayesian inference uses probability distributions to calculate the likelihood of a specific tree topology being correct, often providing a nuanced measure of confidence.

Assessing the Tree’s Reliability

The final step involves determining confidence in the relationships displayed in the calculated tree, particularly the topology. Since the tree is a hypothesis, scientists use resampling techniques to test the stability of each internal node, which represents an ancestral split. The most common technique is bootstrapping, which involves creating hundreds or thousands of simulated datasets by randomly sampling characters from the original aligned data with replacement.

A phylogenetic tree is generated for each resampled dataset using the same calculation method employed for the original data. Scientists then tally how many resulting bootstrap trees support each specific node or grouping found in the original tree. This count is reported as a percentage (the bootstrap support value), displayed directly on the branches of the final phylogeny. A value of 70% or higher indicates strong support for that grouping, suggesting the data support the hypothesized evolutionary relationship.