How to Read an Electropherogram for DNA Sequencing

An electropherogram is the graphical output that converts the chemical reactions of DNA sequencing into a visual format. This plot represents the raw data generated by automated sequencing instruments, most commonly those using capillary electrophoresis. It translates the order of fluorescently tagged DNA fragments into the corresponding sequence of nucleotide bases. This allows for manual verification of the precise genetic code determined by the software.

Understanding the Electropherogram Display

The electropherogram is a two-dimensional graph where the axes correspond to specific measurements from the sequencing process. The horizontal X-axis represents the length of the DNA fragments, related to the time they pass through the capillary. The vertical Y-axis measures the intensity of the fluorescent signal, or Relative Fluorescence Units (RFU), indicating the amount of DNA fragment detected.

Each peak visible on the graph corresponds to a single nucleotide base that passed the detector. Modern Sanger sequencing uses four different fluorescent dyes, linked to the four dideoxynucleotides (ddNTPs) that terminate the DNA chain. This system distinguishes the bases: Adenine (A) is typically green, Cytosine (C) is blue, Guanine (G) is black or yellow, and Thymine (T) is red.

The software uses the color and position of each peak to determine the sequence, displaying the corresponding letter above the peak. For a high-quality read, these peaks should be sharp, evenly spaced, and clearly resolved. The baseline represents the background noise or zero fluorescence and should remain flat and near zero for reliable results.

Interpreting Quality and Reliability Scores

Assessing the overall quality of the data is necessary before interpreting the sequence. The primary metric for this assessment is the Phred Quality Score (Q-score), a logarithmic measure of the probability that a base call is incorrect. A higher Q-score signifies a lower chance of error, providing statistical confidence for each base.

A base with a Q20 score indicates a 1 in 100 chance of being called incorrectly (99% accuracy). High-quality data often requires a Q30 score, meaning the base has a 1 in 1,000 chance of error (99.9% accuracy). These scores are displayed graphically above the sequence peaks, allowing a rapid check of the data’s trustworthiness across the read length.

Visually evaluating quality involves checking for uniform peak height and consistent spacing. The “read length” refers to the continuous stretch of sequence that maintains a high Q-score. Quality typically drops off toward the end of the trace as the signal weakens and bases become less resolved.

Step-by-Step Sequence Interpretation

Translating the visual peaks into the DNA sequence begins with identifying the base call line generated by the sequencing software. This line of letters represents the computer’s best guess for the sequence. Manual review is necessary to confirm accuracy, especially at positions of interest, starting with homozygous bases.

A sequence position is homozygous when the peak is sharp, well-separated, and only one color signal is present. The height of this single peak is typically greater because all DNA fragments contributing to the signal are the same length. The base caller assigns the corresponding nucleotide letter (A, T, C, or G) to this dominant peak.

Identifying heterozygous bases requires recognizing two superimposed peaks at the same position. This indicates the individual carries two different nucleotides at that location. These two peaks will be different colors and should ideally show a 50/50 ratio in their heights. If the ratio deviates significantly, it may suggest a sequencing issue requiring careful inspection.

Recognizing Common Sequencing Artifacts

Several common visual anomalies, or artifacts, can interfere with accurate sequence interpretation and should be recognized as technical noise. One frequent artifact is the “dye blob,” a large, broad peak often occurring early in the trace (around the first 50 to 100 base pairs). Dye blobs are caused by unincorporated fluorescent dye molecules that co-migrate with shorter DNA fragments but do not correspond to a specific base call.

A noisy background, characterized by small, multi-colored peaks cluttering the baseline, can cause the software to assign an ‘N’ to the sequence. This ‘N’ indicates an unresolved base because the true signal is too weak or the background is too high for a confident call. A persistently noisy trace suggests a problem with the sample quality or the sequencing reaction.

Baseline drift is an artifact where the entire baseline gradually rises or falls across the trace length. This makes it difficult to assess true peak height and uniformity due to inconsistent software scaling. Recognizing these artifacts alerts the reader to regions that may need to be disregarded or re-sequenced for reliable results.