CheckV assesses the quality and completeness of metagenome-assembled viral genomes.
Journal
Nature biotechnology
ISSN: 1546-1696
Titre abrégé: Nat Biotechnol
Pays: United States
ID NLM: 9604648
Informations de publication
Date de publication:
05 2021
05 2021
Historique:
received:
19
05
2020
accepted:
12
11
2020
pubmed:
23
12
2020
medline:
29
6
2021
entrez:
22
12
2020
Statut:
ppublish
Résumé
Millions of new viral sequences have been identified from metagenomes, but the quality and completeness of these sequences vary considerably. Here we present CheckV, an automated pipeline for identifying closed viral genomes, estimating the completeness of genome fragments and removing flanking host regions from integrated proviruses. CheckV estimates completeness by comparing sequences with a large database of complete viral genomes, including 76,262 identified from a systematic search of publicly available metagenomes, metatranscriptomes and metaviromes. After validation on mock datasets and comparison to existing methods, we applied CheckV to large and diverse collections of metagenome-assembled viral sequences, including IMG/VR and the Global Ocean Virome. This revealed 44,652 high-quality viral genomes (that is, >90% complete), although the vast majority of sequences were small fragments, which highlights the challenge of assembling viral genomes from short-read metagenomes. Additionally, we found that removal of host contamination substantially improved the accurate identification of auxiliary metabolic genes and interpretation of viral-encoded functions.
Identifiants
pubmed: 33349699
doi: 10.1038/s41587-020-00774-7
pii: 10.1038/s41587-020-00774-7
pmc: PMC8116208
doi:
Types de publication
Journal Article
Research Support, Non-U.S. Gov't
Research Support, U.S. Gov't, Non-P.H.S.
Langues
eng
Sous-ensembles de citation
IM
Pagination
578-585Références
Shkoporov, A. N. & Hill, C. Bacteriophages of the human gut: the “Known Unknown” of the microbiome. Cell Host Microbe 25, 195–209 (2019).
pubmed: 30763534
Williamson, K. E. et al. Viruses in soil ecosystems: an unknown quantity within an unexplored territory. Annu. Rev. Virol. 4, 201–219 (2017).
pubmed: 28961409
Breitbart, M. et al. Phage puppet masters of the marine microbial realm. Nat. Microbiol. 3, 754–766 (2018).
pubmed: 29867096
Koonin, E. V. et al. Global organization and proposed megataxonomy of the virus world. Microbiol. Mol. Biol. Rev. 84, e00061-19
Paez-Espino, D. et al. Uncovering Earth’s virome. Nature 536, 425–430 (2016).
pubmed: 27533034
Gregory, A. C. et al. Marine DNA viral macro- and microdiversity from pole to pole. Cell 177, 1109–1123 (2019).
pubmed: 31031001
pmcid: 6525058
Gregory, A. C. et al. The gut virome database reveals age-dependent patterns of virome diversity in the human gut. Cell Host Microbe 28, 724–740 (2020).
Emerson, J. B. et al. Host-linked soil viral ecology along a permafrost thaw gradient. Nat. Microbiol. 3, 870–880 (2018).
pubmed: 30013236
pmcid: 6786970
Ren, J. et al. VirFinder: a novel k-mer based tool for identifying viral sequences from assembled metagenomic data. Microbiome 5, 69 (2017).
pubmed: 28683828
pmcid: 5501583
Roux, S. et al. VirSorter: mining viral signal from microbial genomic data. PeerJ 3, e985 (2015).
pubmed: 26038737
pmcid: 4451026
Kieft, K., Zhou, Z. & Anantharaman, K. VIBRANT: automated recovery, annotation and curation of microbial viruses, and evaluation of viral community function from genomic sequences. Microbiome 8, 90 (2020).
pubmed: 32522236
pmcid: 7288430
Beaulaurier, J. et al. Assembly-free single-molecule sequencing recovers complete virus genomes from natural microbial communities. Genome Res. 30, 437–446 (2020).
pubmed: 32075851
pmcid: 7111524
Warwick-Dugdale, J. et al. Long-read viral metagenomics captures abundant and microdiverse viral populations and their niche-defining genomic islands. PeerJ 7, e6800 (2019).
pubmed: 31086738
pmcid: 6487183
Suzuki, Y. et al. Long-read metagenomic exploration of extrachromosomal mobile genetic elements in the human gut. Microbiome 7, 119 (2019).
pubmed: 31455406
pmcid: 6712665
Schulz, F. et al. Giant virus diversity and host interactions through global metagenomics. Nature 578, 432–436 (2020).
pubmed: 31968354
pmcid: 7162819
Smits, S. L. et al. Assembly of viral genomes from metagenomes. Front. Microbiol. 5, 714 (2014).
pubmed: 25566226
pmcid: 4270193
Roux, S. et al. Minimum Information about an uncultivated virus genome (MIUViG). Nat. Biotechnol. 37, 29–37 (2019).
pubmed: 30556814
Roux, S. et al. Assessment of viral community functional potential from viral metagenomes may be hampered by contamination with cellular sequences. Open Biol. 3, 130160 (2013).
pubmed: 24335607
pmcid: 3877843
Belyi, V. A., Levine, A. J. & Skalka, A. M. Sequences from ancestral single-stranded DNA viruses in vertebrate genomes: the Parvoviridae and Circoviridae are more than 40 to 50 million years old. J. Virol. 84, 12458–12462 (2010).
pubmed: 20861255
pmcid: 2976387
Philippe, N. et al. Pandoraviruses: amoeba viruses with genomes up to 2.5 Mb reaching that of parasitic eukaryotes. Science 341, 281–286 (2013).
pubmed: 23869018
Chung, C. H. et al. Predicting genome terminus sequences of Bacillus cereus–group bacteriophage using next generation sequencing data. BMC Genomics 18, 350 (2017).
pubmed: 28472946
pmcid: 5418689
Antipov, D. et al. Metaviral SPAdes: assembly of viruses from metagenomic data. Bioinformatics 36, 4126–4129 (2020).
pubmed: 32413137
Akhter, S., Aziz, R. K. & Edwards, R. A. PhiSpy: a novel algorithm for finding prophages in bacterial genomes that combines similarity- and composition-based strategies. Nucleic Acids Res. 40, e126 (2012).
pubmed: 22584627
pmcid: 3439882
Starikova, E. V. et al. Phigaro: high-throughput prophage sequence annotation. Bioinformatics 36, 3882–3884 (2020).
pubmed: 32311023
Paez-Espino, D. et al. IMG/VR v.2.0: an integrated data management and analysis system for cultivated and environmental viral genomes. Nucleic Acids Res. 47, D678–D686 (2019).
pubmed: 30407573
Coutinho, F. H., Edwards, R. A. & Rodriguez-Valera, F. Charting the diversity of uncultured viruses of archaea and bacteria. BMC Biol. 17, 109 (2019).
pubmed: 31884971
pmcid: 6936153
Hindmarsh, P. & Leis, J. Retroviral DNA integration. Microbiol. Mol. Biol. Rev. 63, 836–843 (1999).
pubmed: 10585967
pmcid: 98978
Tisza, M. J. et al. Discovery of several thousand highly diverse circular DNA viruses. eLife https://doi.org/10.7554/eLife.51971 (2020).
Casjens, S. R. & Gilcrease, E. B. Determining DNA packaging strategy by analysis of the termini of the chromosomes in tailed-bacteriophage virions. Methods Mol. Biol. 502, 91–111 (2009).
pubmed: 19082553
pmcid: 3082370
Munoz-Lopez, M. & Garcia-Perez, J. L. DNA transposons: nature and applications in genomics. Curr. Genomics 11, 115–128 (2010).
pubmed: 20885819
pmcid: 2874221
Yan, Z. et al. Inverted terminal repeat sequences are important for intermolecular recombination and circularization of adeno-associated virus genomes. J. Virol. 79, 364–379 (2005).
pubmed: 15596830
pmcid: 538689
Savilahti, H. & Bamford, D. H. Linear DNA replication: inverted terminal repeats of five closely related Escherichia coli bacteriophages. Gene 49, 199–205 (1986).
pubmed: 3569914
Roux, S. et al. Benchmarking viromics: an in silico evaluation of metagenome-enabled estimates of viral community composition and diversity. PeerJ 5, e3817 (2017).
pubmed: 28948103
pmcid: 5610896
Sayers, E. W. et al. GenBank. Nucleic Acids Res. 48, D84–D86 (2020).
pubmed: 31665464
Chen, I. A. et al. IMG/M v.5.0: an integrated data management and comparative analysis system for microbial genomes and microbiomes. Nucleic Acids Res. 47, D666–D677 (2019).
pubmed: 30289528
Mitchell, A. L. et al. MGnify: the microbiome analysis resource in 2020. Nucleic Acids Res. 48, D570–D578 (2020).
pubmed: 31696235
Nayfach, S. et al. New insights from uncultivated genomes of the global human gut microbiome. Nature 568, 505–510 (2019).
pubmed: 30867587
pmcid: 6784871
Pasolli, E. et al. Extensive unexplored human microbiome diversity revealed by over 150,000 genomes from metagenomes spanning age, geography, and lifestyle. Cell 176, 649–662 (2019).
pubmed: 30661755
pmcid: 6349461
Soto-Perez, P. et al. CRISPR-Cas system of a prevalent human gut bacterium reveals hyper-targeting against phages in a human virome catalog. Cell Host Microbe 26, 325–335 (2019).
pubmed: 31492655
pmcid: 6936622
Yutin, N. et al. Eukaryotic large nucleo-cytoplasmic DNA viruses: clusters of orthologous genes and reconstruction of viral genome evolution. Virol. J. 6, 223 (2009).
pubmed: 20017929
pmcid: 2806869
Parks, D. H. et al. A standardized bacterial taxonomy based on genome phylogeny substantially revises the tree of life. Nat. Biotechnol. 36, 996–1004 (2018).
Al-Shayeb, B. et al. Clades of huge phages from across Earth’s ecosystems. Nature 578, 425–431 (2020).
pubmed: 32051592
pmcid: 7162821
Bobay, L. M., Touchon, M. & Rocha, E. P. Pervasive domestication of defective prophages by bacteria. Proc. Natl Acad. Sci. USA 111, 12127–12132 (2014).
pubmed: 25092302
pmcid: 4143005
Rinke, C. et al. Validation of picogram- and femtogram-input DNA libraries for microscale metagenomics. PeerJ 4, e2486 (2016).
pubmed: 27688978
pmcid: 5036114
Kanehisa, M. & Goto, S. KEGG: Kyoto Encyclopedia of Genes and Genomes. Nucleic Acids Res. 28, 27–30 (2000).
pubmed: 10592173
pmcid: 102409
Garneau, J. R. et al. PhageTerm: a tool for fast and accurate determination of phage termini and packaging mechanism using next-generation sequencing data. Sci. Rep. 7, 8292 (2017).
pubmed: 28811656
pmcid: 5557969
Mukherjee, S. et al. Genomes OnLine database (GOLD) v.7: updates and new features. Nucleic Acids Res. 47, D649–D659 (2019).
pubmed: 30357420
Mauri, M. et al. RAWGraphs: A visualisation platform to create open outputs. in Proc. 12th Biannual Conference on Italian SIGCHI 1–5 (2017).
Goodacre, N. et al. A reference viral database (RVDB) to enhance bioinformatics analysis of high-throughput sequencing for novel virus detection. mSphere 3, e00069-18 (2018).
El-Gebali, S. et al. The Pfam protein families database in 2019. Nucleic Acids Res. 47, D427–D432 (2019).
pubmed: 30357350
Finn, R. D. et al. The Pfam protein families database. Nucleic Acids Res. 38, D211–D222 (2010).
pubmed: 19920124
Haft, D. H. et al. TIGRFAMs and genome properties in 2013. Nucleic Acids Res. 41, D387–D395 (2013).
pubmed: 23197656
Eddy, S. R. Accelerated profile HMM searches. PLoS Comput. Biol. 7, e1002195 (2011).
pubmed: 22039361
pmcid: 3197634
Hyatt, D. et al. Gene and translation initiation site prediction in metagenomic sequences. Bioinformatics 28, 2223–2230 (2012).
pubmed: 22796954
Buchfink, B., Xie, C. & Huson, D. H. Fast and sensitive protein alignment using DIAMOND. Nat. Methods 12, 59–60 (2015).
pubmed: 25402007
Camacho, C. et al. BLAST+: architecture and applications. BMC Bioinformatics 10, 421 (2009).
pubmed: 20003500
pmcid: 2803857
Jorgensen, T. S. et al. Hundreds of circular novel plasmids and DNA elements identified in a rat cecum metamobilome. PLoS ONE 9, e87924 (2014).
pubmed: 24503942
pmcid: 3913684
Martini, M. C. et al. Genomics of high molecular weight plasmids isolated from an on-farm biopurification system. Sci. Rep. 6, 28284 (2016).
pubmed: 27321040
pmcid: 4913263
Jorgensen, T. S. et al. Plasmids, viruses, and other circular elements in rat gut. Preprint at bioRxiv https://doi.org/10.1101/143420 (2017).
Seemann, T. Prokka: rapid prokaryotic genome annotation. Bioinformatics 30, 2068–2069 (2014).
pubmed: 24642063
Krzywinski, M. et al. Circos: an information aesthetic for comparative genomics. Genome Res. 19, 1639–1645 (2009).
pubmed: 19541911
pmcid: 2752132
Soding, J., Biegert, A. & Lupas, A. N. The HHpred interactive server for protein homology detection and structure prediction. Nucleic Acids Res. 33, W244–W248 (2005).
pubmed: 15980461
pmcid: 1160169
Stothard, P. & Wishart, D. S. Circular genome visualization and exploration using CGView. Bioinformatics 21, 537–539 (2005).
pubmed: 15479716
Katoh, K. & Standley, D. M. MAFFT multiple sequence alignment software version 7: improvements in performance and usability. Mol. Biol. Evol. 30, 772–780 (2013).
pubmed: 23329690
pmcid: 3603318
Capella-Gutierrez, S., Silla-Martinez, J. M. & Gabaldon, T. trimAl: a tool for automated alignment trimming in large-scale phylogenetic analyses. Bioinformatics 25, 1972–1973 (2009).
pubmed: 19505945
pmcid: 2712344
Nguyen, L. T. et al. IQ-TREE: a fast and effective stochastic algorithm for estimating maximum-likelihood phylogenies. Mol. Biol. Evol. 32, 268–274 (2015).
pubmed: 25371430
Letunic, I. & Bork, P. Interactive Tree Of Life (iTOL) v4: recent updates and new developments. Nucleic Acids Res. 47, W256–W259 (2019).
pubmed: 30931475
pmcid: 6602468