The Basic Local Alignment Search Tool (BLAST) is used in bioinformatics to find regions of similarity between biological sequences, such as DNA, RNA, or protein. When a researcher searches a query sequence against a database of known sequences, the tool identifies potential matches and assigns a statistical measure to them. Since random sequences can align by chance, especially over short segments, a statistical framework is necessary to distinguish true biological relationships from random noise.
Understanding the Expectation Value Concept
The E-value, or Expectation value, is the statistical measure that quantifies the quality of an alignment found by BLAST. It represents the number of alignments with a score equal to or better than the observed score that one would expect to find purely by chance in the database being searched. This value is not a probability; rather, it is a measure of the expected background noise for a given search space.
An E-value of \(1\) means that the search is expected to turn up one match of equal or greater quality by random chance alone. A much lower E-value, such as \(0.001\), suggests that only one such random match would be expected in \(1,000\) similar searches, indicating a higher likelihood that the match is biologically meaningful. The closer the E-value is to zero, the more significant the alignment is considered, suggesting a genuine relationship between the query and the matched sequence.
How the Search Database Affects Significance
The size of the sequence database is a direct external factor that influences the E-value calculation. The E-value is proportional to the total size of the search space, meaning larger databases naturally increase the likelihood of finding a match by accident. Consequently, a larger database will result in a higher (less significant) E-value for the exact same alignment compared to a search in a smaller database.
For instance, an alignment with a relatively low E-value in a small, specialized database might have a much higher E-value when searched against the massive NCBI non-redundant (nr) database. This difference occurs because the sheer volume of sequences in the larger database provides more opportunities for a random pairing to achieve a high score. Therefore, interpreting the E-value must always account for the size of the database used during the search.
The Role of the Bit Score in E-value Calculation
The E-value is mathematically derived from the Bit Score, a normalized measure of the alignment quality that provides the basis for statistical significance. The Bit Score is calculated from the raw alignment score, which is the sum of the substitution matrix values and gap penalties for the aligned segments. However, the raw score is dependent on the specific scoring system used.
The Bit Score standardizes the raw score by normalizing it based on the statistical parameters of the scoring system. This normalization process ensures that the Bit Score is independent of the scoring matrix, gap penalties, and the size of the search database. A higher Bit Score always corresponds to a better, more robust alignment.
Because the E-value is calculated exponentially from the Bit Score, a small increase in the Bit Score leads to a dramatically lower, more significant E-value. While the Bit Score provides an objective measure of alignment quality, the E-value adjusts that quality measure for the context of the search space, providing an easily interpretable number for statistical expectation.
Interpreting E-values and Choosing Cutoffs
Interpreting the E-value requires the user to set a threshold that balances the desire to find all potential matches against the need for high confidence in those matches. A common, strict threshold used in many research settings is an E-value less than \(1 \times 10^{-5}\), which suggests a high likelihood of a true homologous relationship. Matches with E-values this low are generally considered trustworthy hits that are highly unlikely to be random occurrences.
Choosing a cutoff is a practical decision that involves a trade-off between sensitivity and specificity. Setting a very low cutoff, such as \(1 \times 10^{-50}\), ensures high specificity by only reporting nearly identical, highly confident matches. Conversely, setting a more relaxed cutoff, such as \(1\) or even \(10\), increases the sensitivity of the search, allowing for the discovery of more distantly related sequences.
Researchers performing exploratory searches for distant homology may use a higher E-value cutoff to capture those weaker, but still potentially meaningful, relationships. For very short query sequences, a relaxed cutoff may also be necessary to find any matches at all. The appropriate E-value cutoff depends on the specific biological question and the level of confidence required for the research.