Core Command Structure and Inputs
Executing `bcftools mpileup` requires two fundamental inputs. The command follows the structure `bcftools mpileup [options]
The first required input is a reference genome, specified using the `-f` option. This FASTA file acts as the baseline against which the sample’s aligned sequences are compared. It provides the coordinate system for the analysis, allowing the tool to assess each genomic position for differences. Without the reference, the alignment data would have no context.
A prerequisite for both the reference genome and alignment files is that they must be indexed. The reference FASTA file requires a `.fai` index, while BAM files need a corresponding `.bai` or `.csi` index. This indexing creates a separate file that acts as a table of contents. It allows `bcftools mpileup` to perform random-access lookups, jumping to any specific region without reading the entire file from the beginning.
This indexing facilitates rapid and efficient processing. A minimal command to start the process is `bcftools mpileup -f reference.fasta sample.bam`. This instruction directs the program to begin generating pileup information for the `sample.bam` file based on the coordinates and sequences within `reference.fasta`.
Essential Parameters for Customization
Beyond the basic inputs, `bcftools mpileup` offers parameters to refine its operation. A common first step is controlling the output format with the `-O` or `–output-type` option, while the `-o` or `–output` flag specifies a name for the output file. BCF is the binary counterpart to VCF, smaller and faster for computers to process, making it ideal for pipelines. The choices for output type are:
- `V` for a human-readable VCF file
- `Z` for a compressed VCF (.vcf.gz)
- `U` for an uncompressed BCF
- `B` for a compressed BCF
Generating reliable results involves filtering the input data to remove low-confidence alignments and bases. The `-q` or `–min-MQ` parameter sets a minimum mapping quality score, a value representing the confidence that a sequence read is aligned correctly. Similarly, the `-Q` or `–min-BQ` flag establishes a minimum base quality, filtering out individual nucleotide bases identified with low certainty. Applying these filters ensures that calculations are based on higher-quality data.
For targeted analyses, the `-r` or `–regions` option focuses the tool on specific genomic locations to save computational resources. A user can specify a chromosome, a list of chromosomes, or a precise interval, such as `chr7:5,524,912-5,524,913`. When this is used, `mpileup` will ignore all data outside of the defined region, making the process faster and the output file smaller.
Annotations can be added to the output to provide more context for each position. The `-a` or `–annotate` flag includes information in the INFO and FORMAT fields of the resulting VCF/BCF file. Common annotations include `DP` (total read depth), `AD` (allelic depth per sample), and `SP` (strand bias), which provide downstream tools with richer data for variant calls.
Interpreting the Pileup Output
The primary output of `bcftools mpileup` is not a definitive list of genetic variants, but a file detailing the evidence for variation at each genomic site. This file, in VCF or BCF format, contains genotype likelihoods for each sample at every position analyzed. These likelihoods are the foundational data used by a subsequent calling tool to make a final determination. Understanding this output means focusing on the FORMAT column, which describes the data associated with each sample.
Within the FORMAT column, a primary field generated by `mpileup` is `PL`, which stands for Phred-scaled genotype likelihoods. The `PL` field contains three comma-separated integer values for each diploid sample. These numbers represent the statistical likelihoods of the three possible genotypes: homozygous for the reference allele (0/0), heterozygous (0/1), and homozygous for the alternate allele (1/1).
The values are Phred-scaled, which is a logarithmic transformation that makes them easier to work with. A value of 0 corresponds to the highest likelihood, while larger numbers indicate less likely genotypes. For example, a `PL` tag of `120,0,90` for a sample suggests that the heterozygous genotype (0/1) is the most probable, as its Phred-scaled likelihood is 0. The homozygous-reference genotype (0/0) is significantly less likely with a score of 120, and the homozygous-alternate genotype (1/1) is also unlikely with a score of 90.
Integration into a Variant Calling Pipeline
The `bcftools mpileup` command is rarely used in isolation; it is designed as the first part of a two-step process for variant discovery. Its output, containing genotype likelihoods, serves as the direct input for the `bcftools call` command. This division of labor is efficient: `mpileup` handles sifting through raw alignment data, while `call` performs the statistical inference to identify variants.
This workflow is implemented using a pipe (`|`) in the command line, which sends the output of one program directly to the input of another without creating an intermediate file. This creates an efficient pipeline. The `mpileup` command is configured to generate genotype likelihoods, and this information flows directly to `bcftools call`, which then evaluates these likelihoods to make a final judgment.
A common example of this integration is: `bcftools mpileup -Ou -f ref.fa sample.bam | bcftools call -mv -Ob -o variants.bcf`. In this command, `mpileup` processes the alignment file (`sample.bam`) against the reference (`ref.fa`) and produces uncompressed BCF output (`-Ou`). This output is piped to `bcftools call`, which uses a multiallelic calling model (`-m`), outputs only the variant sites (`-v`), and writes the final results to a compressed BCF file (`-Ob -o variants.bcf`).