A Manhattan plot is a specialized scatter graph used to visually summarize the results from a Genome-Wide Association Study (GWAS). GWAS tests millions of genetic variations, called Single Nucleotide Polymorphisms (SNPs), to identify those statistically associated with a specific disease or trait. The plot provides an overview of which genetic markers across the entire human genome show the strongest link to the condition. This helps researchers quickly pinpoint regions that warrant further biological investigation.
Understanding the Axes and Layout
The plot is named for its resemblance to the skyline of Manhattan, where towering peaks rise above a flat landscape of lower points. The X-axis represents the entire genome, beginning with chromosome 1 and continuing sequentially through chromosomes 22, and often including the X and Y sex chromosomes. The genetic markers are plotted in their physical order along this axis, creating a continuous representation of all tested genomic locations.
Data points for adjacent chromosomes are typically presented in alternating colors to distinguish them. This coloring scheme allows easy identification of which chromosome an associated signal belongs to. Each dot on the plot represents a single genetic marker or SNP that was tested for association with the trait.
The Y-axis displays the strength of the association. This axis plots the negative logarithm (base 10) of the P-value, written as \(-log_{10}(P)\).
The Y-Axis Scale
Using the negative logarithm means that a very small P-value is transformed into a large, positive number. For example, a P-value of \(0.01\) transforms to a Y-axis value of 2, while a much stronger P-value of \(0.00000001\) transforms to a value of 8. The higher a dot appears, the stronger the statistical evidence is for that marker’s association with the trait.
Decoding the Significance Thresholds
A GWAS involves performing millions of individual statistical tests, one for every SNP across the genome. When so many tests are conducted simultaneously, the probability of finding a false positive result increases dramatically. Statistical significance must be adjusted to account for this multiple testing problem.
A horizontal line is drawn across the plot to represent the genome-wide significance threshold. This threshold is most often determined using the Bonferroni correction. The generally accepted standard for genome-wide significance is a P-value of \(5 \times 10^{-8}\).
On the \(-log_{10}(P)\) scale, this P-value corresponds to a Y-axis value of approximately 7.3. Any data point that rises above this horizontal threshold line is considered a highly significant association. A second, lower line is sometimes included to indicate a suggestive significance level, guiding researchers toward regions that may contain weaker signals.
Identifying True Genetic Associations
The most visually striking features are the clusters of points known as “peaks.” These peaks represent a strong concentration of highly significant SNPs within a defined region of the genome. Each peak indicates a genetic locus associated with the trait being studied.
The height of the peak directly reflects the strength of the association, with the single highest point representing the most statistically significant SNP in that region. The width of the peak is determined by the density of associated markers in that area. Many SNPs in a small region tend to rise together due to a phenomenon called Linkage Disequilibrium (LD).
Linkage Disequilibrium is the non-random association of alleles, meaning a group of genetic markers are often inherited together as a block. The highest point on the peak, known as the index SNP, is often a proxy marker for the actual functional variant. The goal is to locate the peak’s position on the X-axis to identify the chromosome and genomic coordinates of the associated locus.
What the Plot Does Not Show
The plot is designed to show statistical correlation—how often a genetic marker appears in people with the trait compared to those without it. The plot does not communicate direct biological causation, meaning the presence of a peak does not prove the associated SNP is the cause of the disease.
A significant peak only identifies a genomic region of interest, and the highest point is not necessarily the true functional variant. The true causal SNP could be any of the points within the cluster. Further investigative steps, known as fine-mapping, are required to sift through the associated variants and pinpoint the exact causal change.