Biotechnology and Research Methods

GWAS Summary Statistics: What They Are and How to Use Them

Understand how aggregated data from genome-wide association studies powers genetic research, from its generation to its application and critical interpretation.

Genome-Wide Association Studies (GWAS) scan the genomes of many individuals to find genetic variants associated with a particular disease or trait. Instead of sharing sensitive individual-level data, researchers publish GWAS summary statistics, which are aggregated, anonymous results for each genetic variant. This practice allows data from millions of individuals to be combined and re-analyzed for new research without compromising participant privacy, fueling large-scale meta-analyses and helping to inform the development of new therapies.

Core Components of GWAS Summary Statistics

A GWAS summary statistics file is a text file containing a table of results where each row corresponds to a single genetic variant. The first columns identify the variant, including its chromosome, base-pair position, and a unique identifier (rsID). The file must also define the version of the human genome reference build used (e.g., GRCh37 or GRCh38), as positions change between builds. To interpret the results, columns specify the two alleles of the variant: the “effect allele” (the one being measured) and the “non-effect allele,” along with the frequency of the effect allele (EAF) in the study population.

The core of the file lies in the columns describing the statistical association. For continuous traits like height, the effect size is a beta coefficient, representing the change in the trait per copy of the effect allele. For case-control studies of a disease, this is an odds ratio. The effect size is accompanied by its standard error (SE), a measure of its precision. The p-value indicates the statistical significance of the association, while the sample size (N) used to test that variant is also included.

The Process of Generating GWAS Summary Statistics

The process begins by collecting biological samples and detailed phenotype information, such as medical history, from thousands or millions of participants. These studies are often designed as case-control, comparing individuals with a disease to those without, or as population cohorts measuring a quantitative trait.

From the samples, DNA is extracted and genotyped using microarrays that read millions of known single-nucleotide polymorphisms (SNPs). A process called imputation then statistically infers genotypes for millions of additional variants not directly measured by using a dense reference panel of sequenced genomes. Before the main analysis, the raw genetic data undergoes quality control to filter out unreliable data at both the individual and variant level.

With the cleaned and imputed data, a statistical test is performed for every SNP. A regression model, adjusted for potential confounding factors like age, sex, and genetic ancestry, is used to test the association between each variant and the trait. From the output of each regression, the summary statistics—effect size, standard error, and p-value—are extracted and compiled into the final file.

Principal Uses and Applications

A primary application of GWAS summary statistics is meta-analysis, where results from multiple GWAS of the same trait are combined. This approach increases statistical power by boosting the total sample size, allowing researchers to discover genetic variants with small effects that are not detectable in a single study.

Summary statistics are the foundation for calculating polygenic risk scores (PRS). A PRS aggregates the small effects of many variants across the genome into a single score that predicts an individual’s genetic predisposition to a disease. These scores can help identify individuals at high risk for conditions like heart disease or breast cancer, potentially enabling earlier prevention strategies.

Mendelian Randomization (MR) uses summary statistics to investigate causal relationships between different traits. For example, MR can use genetic variants associated with cholesterol as a natural experiment to determine if lifelong higher cholesterol causes heart disease. By using genetic variants as proxies for an exposure, MR provides evidence for causality that is less prone to the biases of observational studies.

These datasets also enable other analyses, including:

  • Estimating a trait’s heritability, which is the proportion of variation in a trait attributable to genetics.
  • Using methods like LD score regression to differentiate between true genetic signals and confounding biases.
  • Identifying pleiotropy, where a single genetic variant influences multiple different traits.
  • Informing drug discovery by highlighting genes and biological pathways implicated in disease.

Accessing, Formatting, and Initial Quality Checks

GWAS summary statistics are available through public databases and consortia websites. The NHGRI-EBI GWAS Catalog is a primary repository that stores summary data and curates published associations. Large research consortia, such as the Psychiatric Genomics Consortium (PGC) and the GIANT consortium, also make their summary statistics files accessible through their own portals.

These files are provided as text files, but formats can vary between studies. A first step for any user is to read the documentation to understand the column headers and ensure the data is interpreted correctly. Before use, the file must undergo quality control checks, often called “munging,” to verify that the data is sound.

Researchers check that the reported allele frequencies are consistent with those in a population-matched reference panel and look for any extreme values for effect sizes and standard errors. Another check is to examine the distribution of p-values using a Quantile-Quantile (Q-Q) plot, which helps assess whether biases like uncorrected population stratification are present. Aligning variant identifiers and alleles to a consistent reference is also necessary before combining datasets.

Important Considerations and Limitations

Population stratification is a challenge that occurs when ancestry differences between case and control groups create false associations. While statistical methods are used to adjust for this during the analysis, residual confounding can remain and lead to incorrect conclusions.

Linkage disequilibrium (LD) is the tendency for genetic variants that are physically close on a chromosome to be inherited together. Because of LD, a SNP found to be associated with a trait may not be the causal variant itself, but simply a marker that is inherited alongside the true functional variant. Follow-up fine-mapping analyses are required to dissect these regions and pinpoint the likely causal variant.

The “winner’s curse” is a phenomenon in which the effect sizes from an initial discovery GWAS are often inflated. The variants that pass the significance threshold are likely those whose effects were overestimated by chance in that study. When tested in independent replication cohorts, their effect sizes tend to be smaller.

When combining data from multiple studies, heterogeneity can be a concern, which is when a variant’s effect size differs significantly across studies. Harmonization, the process of correctly aligning alleles between datasets, is a necessary step to ensure results are compared consistently. Finally, the power of a GWAS is limited by its sample size, meaning many variants with real but small effects likely remain undiscovered.

Previous

What Are Synergistic Effects and How Do They Work?

Back to Biotechnology and Research Methods
Next

Serum Depletion: What It Is & Why It's Used in Science