CheckV assesses the quality and completeness of metagenome-assembled viral genomes.


Journal

Nature biotechnology
ISSN: 1546-1696
Titre abrégé: Nat Biotechnol
Pays: United States
ID NLM: 9604648

Informations de publication

Date de publication:
05 2021
Historique:
received: 19 05 2020
accepted: 12 11 2020
pubmed: 23 12 2020
medline: 29 6 2021
entrez: 22 12 2020
Statut: ppublish

Résumé

Millions of new viral sequences have been identified from metagenomes, but the quality and completeness of these sequences vary considerably. Here we present CheckV, an automated pipeline for identifying closed viral genomes, estimating the completeness of genome fragments and removing flanking host regions from integrated proviruses. CheckV estimates completeness by comparing sequences with a large database of complete viral genomes, including 76,262 identified from a systematic search of publicly available metagenomes, metatranscriptomes and metaviromes. After validation on mock datasets and comparison to existing methods, we applied CheckV to large and diverse collections of metagenome-assembled viral sequences, including IMG/VR and the Global Ocean Virome. This revealed 44,652 high-quality viral genomes (that is, >90% complete), although the vast majority of sequences were small fragments, which highlights the challenge of assembling viral genomes from short-read metagenomes. Additionally, we found that removal of host contamination substantially improved the accurate identification of auxiliary metabolic genes and interpretation of viral-encoded functions.

Identifiants

pubmed: 33349699
doi: 10.1038/s41587-020-00774-7
pii: 10.1038/s41587-020-00774-7
pmc: PMC8116208
doi:

Types de publication

Journal Article Research Support, Non-U.S. Gov't Research Support, U.S. Gov't, Non-P.H.S.

Langues

eng

Sous-ensembles de citation

IM

Pagination

578-585

Références

Shkoporov, A. N. & Hill, C. Bacteriophages of the human gut: the “Known Unknown” of the microbiome. Cell Host Microbe 25, 195–209 (2019).
pubmed: 30763534
Williamson, K. E. et al. Viruses in soil ecosystems: an unknown quantity within an unexplored territory. Annu. Rev. Virol. 4, 201–219 (2017).
pubmed: 28961409
Breitbart, M. et al. Phage puppet masters of the marine microbial realm. Nat. Microbiol. 3, 754–766 (2018).
pubmed: 29867096
Koonin, E. V. et al. Global organization and proposed megataxonomy of the virus world. Microbiol. Mol. Biol. Rev. 84, e00061-19
Paez-Espino, D. et al. Uncovering Earth’s virome. Nature 536, 425–430 (2016).
pubmed: 27533034
Gregory, A. C. et al. Marine DNA viral macro- and microdiversity from pole to pole. Cell 177, 1109–1123 (2019).
pubmed: 31031001 pmcid: 6525058
Gregory, A. C. et al. The gut virome database reveals age-dependent patterns of virome diversity in the human gut. Cell Host Microbe 28, 724–740 (2020).
Emerson, J. B. et al. Host-linked soil viral ecology along a permafrost thaw gradient. Nat. Microbiol. 3, 870–880 (2018).
pubmed: 30013236 pmcid: 6786970
Ren, J. et al. VirFinder: a novel k-mer based tool for identifying viral sequences from assembled metagenomic data. Microbiome 5, 69 (2017).
pubmed: 28683828 pmcid: 5501583
Roux, S. et al. VirSorter: mining viral signal from microbial genomic data. PeerJ 3, e985 (2015).
pubmed: 26038737 pmcid: 4451026
Kieft, K., Zhou, Z. & Anantharaman, K. VIBRANT: automated recovery, annotation and curation of microbial viruses, and evaluation of viral community function from genomic sequences. Microbiome 8, 90 (2020).
pubmed: 32522236 pmcid: 7288430
Beaulaurier, J. et al. Assembly-free single-molecule sequencing recovers complete virus genomes from natural microbial communities. Genome Res. 30, 437–446 (2020).
pubmed: 32075851 pmcid: 7111524
Warwick-Dugdale, J. et al. Long-read viral metagenomics captures abundant and microdiverse viral populations and their niche-defining genomic islands. PeerJ 7, e6800 (2019).
pubmed: 31086738 pmcid: 6487183
Suzuki, Y. et al. Long-read metagenomic exploration of extrachromosomal mobile genetic elements in the human gut. Microbiome 7, 119 (2019).
pubmed: 31455406 pmcid: 6712665
Schulz, F. et al. Giant virus diversity and host interactions through global metagenomics. Nature 578, 432–436 (2020).
pubmed: 31968354 pmcid: 7162819
Smits, S. L. et al. Assembly of viral genomes from metagenomes. Front. Microbiol. 5, 714 (2014).
pubmed: 25566226 pmcid: 4270193
Roux, S. et al. Minimum Information about an uncultivated virus genome (MIUViG). Nat. Biotechnol. 37, 29–37 (2019).
pubmed: 30556814
Roux, S. et al. Assessment of viral community functional potential from viral metagenomes may be hampered by contamination with cellular sequences. Open Biol. 3, 130160 (2013).
pubmed: 24335607 pmcid: 3877843
Belyi, V. A., Levine, A. J. & Skalka, A. M. Sequences from ancestral single-stranded DNA viruses in vertebrate genomes: the Parvoviridae and Circoviridae are more than 40 to 50 million years old. J. Virol. 84, 12458–12462 (2010).
pubmed: 20861255 pmcid: 2976387
Philippe, N. et al. Pandoraviruses: amoeba viruses with genomes up to 2.5 Mb reaching that of parasitic eukaryotes. Science 341, 281–286 (2013).
pubmed: 23869018
Chung, C. H. et al. Predicting genome terminus sequences of Bacillus cereus–group bacteriophage using next generation sequencing data. BMC Genomics 18, 350 (2017).
pubmed: 28472946 pmcid: 5418689
Antipov, D. et al. Metaviral SPAdes: assembly of viruses from metagenomic data. Bioinformatics 36, 4126–4129 (2020).
pubmed: 32413137
Akhter, S., Aziz, R. K. & Edwards, R. A. PhiSpy: a novel algorithm for finding prophages in bacterial genomes that combines similarity- and composition-based strategies. Nucleic Acids Res. 40, e126 (2012).
pubmed: 22584627 pmcid: 3439882
Starikova, E. V. et al. Phigaro: high-throughput prophage sequence annotation. Bioinformatics 36, 3882–3884 (2020).
pubmed: 32311023
Paez-Espino, D. et al. IMG/VR v.2.0: an integrated data management and analysis system for cultivated and environmental viral genomes. Nucleic Acids Res. 47, D678–D686 (2019).
pubmed: 30407573
Coutinho, F. H., Edwards, R. A. & Rodriguez-Valera, F. Charting the diversity of uncultured viruses of archaea and bacteria. BMC Biol. 17, 109 (2019).
pubmed: 31884971 pmcid: 6936153
Hindmarsh, P. & Leis, J. Retroviral DNA integration. Microbiol. Mol. Biol. Rev. 63, 836–843 (1999).
pubmed: 10585967 pmcid: 98978
Tisza, M. J. et al. Discovery of several thousand highly diverse circular DNA viruses. eLife https://doi.org/10.7554/eLife.51971 (2020).
Casjens, S. R. & Gilcrease, E. B. Determining DNA packaging strategy by analysis of the termini of the chromosomes in tailed-bacteriophage virions. Methods Mol. Biol. 502, 91–111 (2009).
pubmed: 19082553 pmcid: 3082370
Munoz-Lopez, M. & Garcia-Perez, J. L. DNA transposons: nature and applications in genomics. Curr. Genomics 11, 115–128 (2010).
pubmed: 20885819 pmcid: 2874221
Yan, Z. et al. Inverted terminal repeat sequences are important for intermolecular recombination and circularization of adeno-associated virus genomes. J. Virol. 79, 364–379 (2005).
pubmed: 15596830 pmcid: 538689
Savilahti, H. & Bamford, D. H. Linear DNA replication: inverted terminal repeats of five closely related Escherichia coli bacteriophages. Gene 49, 199–205 (1986).
pubmed: 3569914
Roux, S. et al. Benchmarking viromics: an in silico evaluation of metagenome-enabled estimates of viral community composition and diversity. PeerJ 5, e3817 (2017).
pubmed: 28948103 pmcid: 5610896
Sayers, E. W. et al. GenBank. Nucleic Acids Res. 48, D84–D86 (2020).
pubmed: 31665464
Chen, I. A. et al. IMG/M v.5.0: an integrated data management and comparative analysis system for microbial genomes and microbiomes. Nucleic Acids Res. 47, D666–D677 (2019).
pubmed: 30289528
Mitchell, A. L. et al. MGnify: the microbiome analysis resource in 2020. Nucleic Acids Res. 48, D570–D578 (2020).
pubmed: 31696235
Nayfach, S. et al. New insights from uncultivated genomes of the global human gut microbiome. Nature 568, 505–510 (2019).
pubmed: 30867587 pmcid: 6784871
Pasolli, E. et al. Extensive unexplored human microbiome diversity revealed by over 150,000 genomes from metagenomes spanning age, geography, and lifestyle. Cell 176, 649–662 (2019).
pubmed: 30661755 pmcid: 6349461
Soto-Perez, P. et al. CRISPR-Cas system of a prevalent human gut bacterium reveals hyper-targeting against phages in a human virome catalog. Cell Host Microbe 26, 325–335 (2019).
pubmed: 31492655 pmcid: 6936622
Yutin, N. et al. Eukaryotic large nucleo-cytoplasmic DNA viruses: clusters of orthologous genes and reconstruction of viral genome evolution. Virol. J. 6, 223 (2009).
pubmed: 20017929 pmcid: 2806869
Parks, D. H. et al. A standardized bacterial taxonomy based on genome phylogeny substantially revises the tree of life. Nat. Biotechnol. 36, 996–1004 (2018).
Al-Shayeb, B. et al. Clades of huge phages from across Earth’s ecosystems. Nature 578, 425–431 (2020).
pubmed: 32051592 pmcid: 7162821
Bobay, L. M., Touchon, M. & Rocha, E. P. Pervasive domestication of defective prophages by bacteria. Proc. Natl Acad. Sci. USA 111, 12127–12132 (2014).
pubmed: 25092302 pmcid: 4143005
Rinke, C. et al. Validation of picogram- and femtogram-input DNA libraries for microscale metagenomics. PeerJ 4, e2486 (2016).
pubmed: 27688978 pmcid: 5036114
Kanehisa, M. & Goto, S. KEGG: Kyoto Encyclopedia of Genes and Genomes. Nucleic Acids Res. 28, 27–30 (2000).
pubmed: 10592173 pmcid: 102409
Garneau, J. R. et al. PhageTerm: a tool for fast and accurate determination of phage termini and packaging mechanism using next-generation sequencing data. Sci. Rep. 7, 8292 (2017).
pubmed: 28811656 pmcid: 5557969
Mukherjee, S. et al. Genomes OnLine database (GOLD) v.7: updates and new features. Nucleic Acids Res. 47, D649–D659 (2019).
pubmed: 30357420
Mauri, M. et al. RAWGraphs: A visualisation platform to create open outputs. in Proc. 12th Biannual Conference on Italian SIGCHI 1–5 (2017).
Goodacre, N. et al. A reference viral database (RVDB) to enhance bioinformatics analysis of high-throughput sequencing for novel virus detection. mSphere 3, e00069-18 (2018).
El-Gebali, S. et al. The Pfam protein families database in 2019. Nucleic Acids Res. 47, D427–D432 (2019).
pubmed: 30357350
Finn, R. D. et al. The Pfam protein families database. Nucleic Acids Res. 38, D211–D222 (2010).
pubmed: 19920124
Haft, D. H. et al. TIGRFAMs and genome properties in 2013. Nucleic Acids Res. 41, D387–D395 (2013).
pubmed: 23197656
Eddy, S. R. Accelerated profile HMM searches. PLoS Comput. Biol. 7, e1002195 (2011).
pubmed: 22039361 pmcid: 3197634
Hyatt, D. et al. Gene and translation initiation site prediction in metagenomic sequences. Bioinformatics 28, 2223–2230 (2012).
pubmed: 22796954
Buchfink, B., Xie, C. & Huson, D. H. Fast and sensitive protein alignment using DIAMOND. Nat. Methods 12, 59–60 (2015).
pubmed: 25402007
Camacho, C. et al. BLAST+: architecture and applications. BMC Bioinformatics 10, 421 (2009).
pubmed: 20003500 pmcid: 2803857
Jorgensen, T. S. et al. Hundreds of circular novel plasmids and DNA elements identified in a rat cecum metamobilome. PLoS ONE 9, e87924 (2014).
pubmed: 24503942 pmcid: 3913684
Martini, M. C. et al. Genomics of high molecular weight plasmids isolated from an on-farm biopurification system. Sci. Rep. 6, 28284 (2016).
pubmed: 27321040 pmcid: 4913263
Jorgensen, T. S. et al. Plasmids, viruses, and other circular elements in rat gut. Preprint at bioRxiv https://doi.org/10.1101/143420 (2017).
Seemann, T. Prokka: rapid prokaryotic genome annotation. Bioinformatics 30, 2068–2069 (2014).
pubmed: 24642063
Krzywinski, M. et al. Circos: an information aesthetic for comparative genomics. Genome Res. 19, 1639–1645 (2009).
pubmed: 19541911 pmcid: 2752132
Soding, J., Biegert, A. & Lupas, A. N. The HHpred interactive server for protein homology detection and structure prediction. Nucleic Acids Res. 33, W244–W248 (2005).
pubmed: 15980461 pmcid: 1160169
Stothard, P. & Wishart, D. S. Circular genome visualization and exploration using CGView. Bioinformatics 21, 537–539 (2005).
pubmed: 15479716
Katoh, K. & Standley, D. M. MAFFT multiple sequence alignment software version 7: improvements in performance and usability. Mol. Biol. Evol. 30, 772–780 (2013).
pubmed: 23329690 pmcid: 3603318
Capella-Gutierrez, S., Silla-Martinez, J. M. & Gabaldon, T. trimAl: a tool for automated alignment trimming in large-scale phylogenetic analyses. Bioinformatics 25, 1972–1973 (2009).
pubmed: 19505945 pmcid: 2712344
Nguyen, L. T. et al. IQ-TREE: a fast and effective stochastic algorithm for estimating maximum-likelihood phylogenies. Mol. Biol. Evol. 32, 268–274 (2015).
pubmed: 25371430
Letunic, I. & Bork, P. Interactive Tree Of Life (iTOL) v4: recent updates and new developments. Nucleic Acids Res. 47, W256–W259 (2019).
pubmed: 30931475 pmcid: 6602468

Auteurs

Stephen Nayfach (S)

US Department of Energy Joint Genome Institute, Lawrence Berkeley National Laboratory, Berkeley, CA, USA. snayfach@lbl.gov.

Antonio Pedro Camargo (AP)

Department of Genetics, Evolution, Microbiology and Immunology, Institute of Biology, University of Campinas, Campinas, Brazil.

Frederik Schulz (F)

US Department of Energy Joint Genome Institute, Lawrence Berkeley National Laboratory, Berkeley, CA, USA.

Emiley Eloe-Fadrosh (E)

US Department of Energy Joint Genome Institute, Lawrence Berkeley National Laboratory, Berkeley, CA, USA.

Simon Roux (S)

US Department of Energy Joint Genome Institute, Lawrence Berkeley National Laboratory, Berkeley, CA, USA.

Nikos C Kyrpides (NC)

US Department of Energy Joint Genome Institute, Lawrence Berkeley National Laboratory, Berkeley, CA, USA. nckyrpides@lbl.gov.

Articles similaires

Selecting optimal software code descriptors-The case of Java.

Yegor Bugayenko, Zamira Kholmatova, Artem Kruglov et al.
1.00
Software Algorithms Programming Languages

Exploring blood-brain barrier passage using atomic weighted vector and machine learning.

Yoan Martínez-López, Paulina Phoobane, Yanaima Jauriga et al.
1.00
Blood-Brain Barrier Machine Learning Humans Support Vector Machine Software
Coal Metagenome Phylogeny Bacteria Genome, Bacterial
Genome, Viral Ralstonia Composting Solanum lycopersicum Bacteriophages

Classifications MeSH