Nontargeted virus discovery, also known as viromics, uses “shotgun” metagenomics to sequence all genetic material within a sample without seeking specific organisms. This approach allows for the detection of the full spectrum of viruses present, including those that are novel or unculturable using traditional methods. This exploration is crucial for understanding “viral dark matter.”
This term refers to the massive number of viral sequences found in metagenomic data that show no similarity to any known viruses in existing reference databases. Depending on the environment, these unknown sequences can account for the overwhelming majority of the data. Viruses are the most numerous biological entities on the planet, and their influence is widespread, from driving microbial evolution to impacting human health. By characterizing previously unknown viruses, researchers can identify potential new pathogens, understand the reservoirs of zoonotic diseases, and discover novel viral enzymes or processes.
Pre-processing and Assembly of Metagenomic Data
The process begins with raw sequencing reads, which are short fragments of DNA or RNA generated by high-throughput sequencing machines. The first action is a rigorous quality control (QC) check. During QC, programs like Trimmomatic or fastp scan each sequencing read, identifying and trimming off low-quality bases and removing adapter sequences, which are synthetic DNA fragments used in the sequencing process.
Once the reads are cleaned, the next step is to filter out sequences that are not of viral origin. Since samples often contain a large amount of genetic material from the host or bacteria, these sequences can obscure the much smaller viral signal. This is accomplished by aligning the reads against databases of known host and common contaminant genomes, enriching the dataset for potential viral sequences.
The remaining pool of high-quality, non-host reads is then subjected to de novo assembly. In this process, assemblers like MEGAHIT or SPAdes piece together the short, overlapping reads into longer, continuous sequences known as contigs. Metagenomic assemblers are specifically designed to handle the complexities of environmental samples, where different viruses and microbes are present at vastly different abundances.
Identification of Viral Contigs
After assembling short sequencing reads into longer contigs, the challenge becomes distinguishing viral sequences from the remaining microbial DNA. The first method is homology-based, where contigs are compared against comprehensive databases of known viral genomes and proteins, such as NCBI RefSeq. If a contig shows significant sequence similarity to a known virus, it is a strong indicator of its viral origin.
A more targeted approach focuses on the presence of specific viral “hallmark” genes. These are genes that are broadly conserved across many viral groups but are absent in cellular life. Examples include genes for capsid proteins, which form the protective protein shell of the virus, and the terminase large subunit, an enzyme involved in packaging the viral genome.
Viruses and their hosts often have distinct genomic characteristics that can be exploited for identification. These genome feature-based methods analyze properties like GC content, codon usage patterns, and k-mer frequencies. Because viral genomes are often streamlined for rapid replication, these features can differ systematically from the host’s genome, allowing computational models to separate them.
Modern virus discovery pipelines rely on sophisticated machine learning algorithms and integrated scoring tools. Programs like VirSorter2, DeepVirFinder, and VIBRANT combine multiple lines of evidence into a single, probabilistic score. They integrate homology searches, the presence of viral hallmark genes, and genomic features to assess the likelihood that a contig is viral.
Viral Sequence Clustering and Taxonomic Classification
Once a collection of contigs has been identified as viral, the next step is to organize them into biologically relevant groups that approximate distinct viral species. This is necessary because the assembly process often produces fragmented genomes, and multiple contigs may belong to the same viral population. The goal is to generate non-redundant units, often called viral operational taxonomic units (vOTUs), which serve as proxies for viral species.
A primary method for this classification is based on Average Nucleotide Identity (ANI). This technique involves the pairwise comparison of all identified viral contigs to measure their overall genetic similarity. A widely accepted standard uses a threshold of approximately 95% ANI over at least 85% of the genome length to delineate a species-level boundary. Contigs that meet this criterion are clustered together into a single vOTU.
An alternative approach involves the use of gene-sharing networks. This method is particularly useful for classifying highly divergent viruses or fragmented genomes that may not overlap significantly. Tools like vConTACT2 first predict all the protein-coding genes within the viral contigs and group them into protein clusters, or families, based on amino acid similarity. The relationships between contigs are then represented as a network, where each contig is a node and the connections are weighted by the number of shared protein clusters.
Within this network, dense groups of nodes represent distinct viral genera or families that share a common set of genes. To assign formal taxonomy, these newly defined clusters are then compared to a reference database of viruses with established classifications, such as the International Committee on Taxonomy of Viruses (ICTV) database. This comparison places the newly discovered viral clusters within the known landscape of viral taxonomy.
Post-Clustering Analysis and Interpretation
With a curated set of viral operational taxonomic units (vOTUs) established, the focus shifts from discovery to biological interpretation. A primary goal is to understand the prevalence and distribution of these newly identified viruses across different environments. This is achieved by mapping the original, quality-controlled sequencing reads from each sample back to the representative sequences of each vOTU. The number of reads that align to a particular vOTU is used to calculate its relative abundance, providing a quantitative profile of the viral community.
A significant question in virology is determining which hosts a virus infects. In silico host prediction methods link viral sequences to their microbial hosts without lab cultivation. One technique involves searching for matching CRISPR spacer sequences, which are fragments of viral DNA that bacteria incorporate into their genomes as a form of adaptive immunity. Other methods include identifying viral genomes integrated into bacterial chromosomes (prophages) or comparing k-mer frequency profiles between viruses and potential hosts, as viruses often mimic their host’s genomic signature.
To understand the potential role of a virus within its ecosystem, researchers perform functional annotation of its predicted genes. By comparing the protein sequences from a vOTU against databases of known protein functions, it is possible to infer their roles. This analysis can reveal genes involved in viral replication, assembly, and interaction with the host’s cellular machinery. It can also uncover auxiliary metabolic genes, which are host-derived genes that viruses carry to augment the host’s metabolism for their own benefit.