What Is a FASTA File? The Format Explained

The FASTA format is a simple, text-based standard used in bioinformatics to represent biological sequence data, such as DNA, RNA, or protein sequences. It serves as a foundational file type for storing and exchanging genetic information between researchers and computational tools. The format uses single-letter codes to represent the individual units of a sequence, like the nucleotides A, T, C, and G for DNA, or the 20 different amino acids for proteins. This straightforward design ensures that the data is readable by both humans and sophisticated bioinformatics software.

Why the FASTA Format Exists

The FASTA format was developed in 1985 by David Lipman and William Pearson alongside the original FASTA sequence similarity search program. The primary need at the time was a file format that could efficiently handle the growing volume of sequence data and be easily processed by early computer systems. Its design was intentionally minimalistic to prioritize efficiency and compatibility across different software platforms.

Unlike complex binary formats, the plain-text nature of FASTA files allows for low computational overhead. This simplicity makes the files easy to parse, meaning computer programs can quickly read and interpret the sequences. The format rapidly became a standardized input and output method for various bioinformatics tools, promoting data exchange and reproducibility across the field.

Decoding the FASTA File Structure

A single sequence within a FASTA file, known as a FASTA record, is composed of two defining parts: the definition line and the sequence lines. The definition line, or header, is the first line of the record and is immediately recognizable because it must begin with a greater-than symbol (`>`). This symbol acts as a marker for the start of a new sequence record, even if multiple sequences are contained within a single file.

The content following the `>` symbol provides crucial metadata for the sequence. The first word after the symbol is typically the unique identifier for the sequence, often known as the SeqID. The rest of the line is an optional, free-text description that can include details like the organism, gene name, or database accession number. For instance, a sequence downloaded from the National Center for Biotechnology Information (NCBI) database will often include a standardized set of identifiers separated by vertical bars (`|`).

Immediately following the definition line are the sequence lines, which contain the actual biological sequence data. A common practice is to wrap the sequence lines so that each line contains no more than 80 characters. This line wrapping is not strictly necessary for modern software but improves human readability when viewing the file in a text editor.

The files commonly use extensions such as `.fasta`, `.fa`, `.fna` for nucleotide sequences, and `.faa` for amino acid (protein) sequences. Any spaces, numbers, or blank lines within the sequence data itself are ignored by bioinformatics programs.

Essential Uses of FASTA Files in Research

FASTA files serve as the fundamental data input for a wide array of computational biology applications. One of the most common applications is sequence alignment, where tools like BLAST (Basic Local Alignment Search Tool) use a FASTA file as a query to search vast public databases for similar sequences. This process helps researchers identify homologous genes or proteins in different species, which offers insights into function and evolutionary relationships.

The format is also integral to the process of genome assembly. When a genome is sequenced, the resulting short fragments of DNA, called reads, are assembled into longer, contiguous segments. These final assembled sequences, or contigs, are typically stored and shared in the FASTA format, representing the reconstructed genome or chromosome. This provides a reference sequence for subsequent annotation and analysis.

FASTA files are the standard submission and retrieval format for major public biological databases, including GenBank and UniProt. Furthermore, the format is used in phylogenetic analysis, where the sequences are aligned to build evolutionary trees that depict the relationships between different organisms or genes.