What Is Long Branch Attraction in Phylogenetics?

Phylogenetics is the study of the evolutionary history and relationships among organisms. Scientists in this field construct branching diagrams, called phylogenetic trees, to visualize these connections. A challenge in creating accurate trees is a form of systematic error known as long branch attraction. This phenomenon causes lineages that are not truly close relatives to be incorrectly grouped together.

Long branch attraction can be pictured by imagining two unrelated artists who, working in isolation for decades, both develop a highly unusual painting style. A curator finding their work might mistakenly assume they were collaborators, overlooking their true artistic lineages. In the same way, long branch attraction mistakes the accumulation of many independent evolutionary changes for a shared history, creating a misleading picture of relatedness.

The Cause of Incorrect Groupings

Phylogenetic trees are composed of branches representing distinct lineages and nodes representing common ancestors. The length of a branch corresponds to the amount of evolutionary change, such as genetic mutations, that has occurred. A “long branch” signifies a lineage that has experienced a great deal of change, either because its evolutionary rate is high or it has been evolving in isolation for a long time.

The issue arises from homoplasy, the independent evolution of similar traits in separate lineages. At the genetic level, with only four possible nucleotides in DNA (A, T, C, G), there is a significant chance that two distantly related species will independently acquire the same mutation at the same position. This is a form of convergent evolution, where separate paths lead to the same outcome.

Some methods for building phylogenetic trees are susceptible to being misled by this false signal. Maximum Parsimony operates on the principle of finding the tree that requires the fewest evolutionary changes to explain the observed data. When two long branches exist, they have each accumulated a large number of mutations, and by random chance, a notable number of these will be identical.

The parsimony algorithm sees these shared mutations and concludes the simplest explanation is that the two lineages share a recent common ancestor. This path is more “parsimonious” than the true scenario, where the same mutations would have had to occur twice, independently. This misleading statistical pull is the “attraction.”

Identifying Long Branch Attraction in a Study

A primary indicator of long branch attraction is a phylogenetic tree that shows a surprising relationship, one that contradicts well-established evidence from morphology or the fossil record. This unexpected grouping often involves two lineages whose branches on the tree are longer than the others, suggesting they have undergone rapid or extensive evolution.

A common technique to test this involves using computer simulations to assess the reliability of the grouping. An approach called parametric bootstrapping allows scientists to test whether the result is a likely artifact of the analysis. Researchers use the initial, suspect tree to estimate the parameters of evolution, such as mutation rates.

They then use these parameters to simulate the evolution of new, artificial DNA datasets. By running the same phylogenetic analysis on these simulated datasets, they can see how often the suspect grouping appears. If the long-branched lineages are consistently grouped together, it strengthens the conclusion that the attraction is an artifact and not a reflection of true evolutionary history.

Analytical Methods and Model Selection

While Maximum Parsimony is known for its susceptibility, more sophisticated techniques can handle the complexities that cause this error. Model-based methods, such as Maximum Likelihood (ML) and Bayesian Inference (BI), offer a robust framework for reconstruction. These approaches are less likely to be fooled by the superficial similarity of long branches.

ML and BI perform better because they rely on explicit statistical models of how DNA sequences change over time. Unlike parsimony, which simply counts changes, these models can account for complex evolutionary processes. They can incorporate the probability that one nucleotide will mutate into another and recognize that some types of mutations are more common.

A feature of these models is their ability to account for rate heterogeneity across lineages. The model can be set to recognize that different branches on the tree may evolve at different speeds. By allowing for “fast” and “slow” branches, the analysis can correctly interpret a large number of mutations on a branch as a product of a high evolutionary rate, preventing a misinterpretation of the data.

Data and Taxon Sampling Strategies

An effective data-focused strategy is to improve taxon sampling, which involves strategically adding new species to the analysis. The goal is to “break up” the long branches that are causing the problem.

Imagine a long, unmarked road between two distant towns; it is hard to know the exact path it takes. Adding more signposts along the way makes its route clearer. Similarly, by adding taxa to the analysis that are relatives of the long-branched lineages, scientists can effectively chop a single long branch into several shorter segments. These new, shorter branches have accumulated fewer mutations, which reduces the opportunity for random, convergent changes to create a false signal.

Another tactic involves filtering the genetic data itself. Since the problem stems from positions in the DNA that change very quickly, one solution is to remove them from the analysis. Scientists can identify and exclude the fastest-evolving genes or specific codon positions within genes. This leaves a dataset composed of more slowly evolving characters, which are less prone to the homoplasy that misleads the analysis.