A FASTQ file is the standard text-based format for storing raw data output from high-throughput DNA sequencing machines. It serves as the starting point for nearly all genomic analysis, containing the massive amounts of data generated when a sequencer “reads” millions or billions of short DNA fragments. The file stores individual sequencing reads, which are the short stretches of genetic information determined by the machine. Every base (A, T, C, or G) in a newly sequenced genome is recorded in this format before deeper analysis begins.
Why This Data Format is Necessary
Genomic analysis requires a format that goes beyond simply recording the sequence of A’s, T’s, C’s, and G’s. Sequencing instruments are not perfect, and every base call carries a probability of being incorrect. Simple sequence files, like the older FASTA format, only store the genetic letters and are insufficient for reliable work.
The FASTQ format addresses this need for quality assurance by coupling every base call with a corresponding confidence score. This score quantifies the reliability of the machine’s determination. Knowing the probability of error for every piece of data is necessary to distinguish between genuine biological variation and simple machine error.
The inclusion of these confidence scores allows for downstream data filtering and trimming. This process is necessary to ensure the integrity of the final analysis. This pairing of sequence and quality data makes the FASTQ format essential for modern genomics.
The Four Components of a FASTQ Entry
The structure of a FASTQ file is highly standardized. It consists of individual records for each sequence read, with every record occupying exactly four lines. This consistent structure allows bioinformatics tools to process the data efficiently.
The first line is the Read Identifier, which always begins with the `@` symbol. This line contains unique tracking information, such as the sequencing instrument, run number, and coordinates on the flow cell. This identifier is necessary for tracing the origin and context of the data.
The second line contains the Raw Sequence Data, which is the actual genetic sequence determined by the machine. It is a string composed of the four nucleotide bases: Adenine (A), Thymine (T), Cytosine (C), and Guanine (G). The letter ‘N’ is used to represent bases that could not be confidently identified.
The third line serves as a Separator Line and always begins with the `+` symbol. This line acts as a spacer between the sequence and the quality information. It may sometimes repeat the read identifier from the first line.
The fourth line is the Quality Score String. This string is a sequence of characters that corresponds exactly to the length of the raw sequence on the second line. Every character is an encoded representation of the confidence score for the base directly above it.
Interpreting Phred Quality Scores
The characters in the fourth line represent the Phred quality score, or Q score. This score is a logarithmic measure of the probability that the corresponding base call is incorrect. Higher Q scores indicate a lower probability of error and greater confidence in the base call.
The Phred scale translates the error probability into a simple integer score. For example, a Phred score of 10 (Q10) means a 1 in 10 chance (10%) that the base is wrong, or 90% accuracy. A Q20 score means a 1 in 100 chance of error (99% accuracy), and Q30 indicates 99.9% accuracy.
These numerical scores are represented by ASCII characters within the file to save space. This encoding ensures the quality score string has the same length as the sequence string. Translating these characters back into numerical scores is an automatic first step for bioinformatics tools, allowing researchers to filter out unreliable data.