How to Read a DNA Chromatogram and Assess Quality

A DNA chromatogram is a graphical representation of the raw data derived from Sanger sequencing, a method used to determine the precise order of nucleotides (A, T, C, G) in a DNA strand. This visual output provides researchers with a direct look at the sequence information. Since automated sequencing software can sometimes misinterpret the raw data, the chromatogram is an indispensable tool for manually verifying the accuracy of the generated sequence. It acts as the evidence supporting the textual sequence file, allowing for the validation of results before any further analysis is conducted.

Understanding the Visual Components

The chromatogram is essentially a two-dimensional graph that plots the fluorescent signals captured during the sequencing process. The horizontal axis (X-axis) represents the time of migration or the position along the DNA sequence, which correlates directly to the length of the DNA fragments being analyzed. As the sequencing fragments move through a capillary, the shorter fragments emerge first, meaning the sequence is read from the 5′ end to the 3′ end as the trace moves from left to right.

The vertical axis, or Y-axis, quantifies the signal intensity, which is measured in relative fluorescence units. This intensity indicates the strength of the signal for each detected base, which should ideally be strong enough to rise well above the baseline—the flat line representing zero or background signal. A distinct, sharp upward spike from the baseline is called a peak, and each peak signifies the detection of a single nucleotide at that specific position in the sequence.

To differentiate between the four nucleotides, the Sanger sequencing reaction incorporates fluorescent dyes, each uniquely tagging one of the four bases. These dyes emit light at different wavelengths, which is translated into four distinct colors on the chromatogram. For instance, one common color convention assigns a specific color to each base, such as green for adenine, red for thymine, blue for cytosine, and black for guanine.

The presence of a peak of a certain color at a particular position on the X-axis is the visual cue for the identity of the nucleotide at that spot. This four-color system allows the sequencing instrument’s software, and subsequently the human eye, to quickly identify the sequence of bases. A good quality chromatogram will display a clear separation between these colored peaks, with each position having only one dominant color.

Translating Peaks into the DNA Sequence

Reading the DNA sequence is a sequential process, moving across the graph from left to right. This progression mirrors the 5′ to 3′ direction of the newly synthesized DNA strand. At each distinct position along the X-axis, the reader must identify the highest peak to determine the correct base call.

The color of the single, highest peak dictates the nucleotide identity for that location. For example, a blue peak indicates cytosine, and a green peak indicates adenine. This process is repeated for every major peak, converting the visual pattern of colored spikes into the text sequence (A, T, C, G).

The automated base-calling software performs this translation and displays the assigned letter above each peak, but manual inspection is crucial for validation. In a high-quality read, the correct base peak is significantly taller and sharper than any background signal, making the base assignment unambiguous. The final sequence is generated by stringing together the letter assigned to each successive peak.

Assessing Read Quality and Identifying Issues

Evaluating the reliability of the generated sequence requires a careful look at the physical characteristics of the peaks. A high-quality sequence is characterized by peak uniformity, meaning peaks are consistently spaced and possess similar heights throughout the central trace. While slight variations (up to a threefold difference) are considered normal, a dramatic decrease in signal intensity suggests a problem with the reaction.

One common quality issue is noise, which appears as small, messy, multicolored peaks near the baseline. Excessive noise interferes with the computer’s ability to accurately distinguish true signal peaks, potentially leading to incorrect base calls. Signal loss typically occurs toward the end of the sequence trace. As DNA fragments become longer, they are less efficiently resolved, causing peaks to broaden, lose definition, and decrease in height.

A particularly important feature is the presence of double peaks, where two different colored peaks of roughly equal height appear at the same position. This pattern indicates either sample contamination (two templates sequenced simultaneously) or a heterozygous position in a diploid organism. The base-calling software may label such an ambiguous position with an ‘N’.

For a quantitative measure of confidence, sequencing software calculates numerical quality scores, known as Phred scores, displayed above the peaks. These scores are logarithmically related to the probability of an incorrect base call. A score of 20 means there is a 1 in 100 chance of error, while a score of 30 corresponds to a 1 in 1,000 chance. Higher Phred scores indicate greater confidence in the assigned nucleotide.