Taxometer: Improving taxonomic classification of metagenomics contigs.


Journal

Nature communications
ISSN: 2041-1723
Titre abrégé: Nat Commun
Pays: England
ID NLM: 101528555

Informations de publication

Date de publication:
27 Sep 2024
Historique:
received: 26 01 2024
accepted: 20 09 2024
medline: 28 9 2024
pubmed: 28 9 2024
entrez: 27 9 2024
Statut: epublish

Résumé

For taxonomy based classification of metagenomics assembled contigs, current methods use sequence similarity to identify their most likely taxonomy. However, in the related field of metagenomic binning, contigs are routinely clustered using information from both the contig sequences and their abundance. We introduce Taxometer, a neural network based method that improves the annotations and estimates the quality of any taxonomic classifier using contig abundance profiles and tetra-nucleotide frequencies. We apply Taxometer to five short-read CAMI2 datasets and find that it increases the average share of correct species-level contig annotations of the MMSeqs2 tool from 66.6% to 86.2%. Additionally, it reduce the share of wrong species-level annotations in the CAMI2 Rhizosphere dataset by an average of two-fold for Metabuli, Centrifuge, and Kraken2. Futhermore, we use Taxometer for benchmarking taxonomic classifiers on two complex long-read metagenomics data sets where ground truth is not known. Taxometer is available as open-source software and can enhance any taxonomic annotation of metagenomic contigs.

Identifiants

pubmed: 39333501
doi: 10.1038/s41467-024-52771-y
pii: 10.1038/s41467-024-52771-y
doi:

Types de publication

Journal Article

Langues

eng

Sous-ensembles de citation

IM

Pagination

8357

Subventions

Organisme : Novo Nordisk Fonden (Novo Nordisk Foundation)
ID : NNF19SA0059348
Organisme : Novo Nordisk Fonden (Novo Nordisk Foundation)
ID : NNF14CC0001
Organisme : Novo Nordisk Fonden (Novo Nordisk Foundation)
ID : NNF21SA0072102

Informations de copyright

© 2024. The Author(s).

Références

Hyatt, D. et al. Prodigal: prokaryotic gene recognition and translation initiation site identification. BMC Bioinform. 11, 119 (2010).
doi: 10.1186/1471-2105-11-119
Wood, D. E., Lu, J. & Langmead, B. Improved metagenomic analysis with kraken 2. Genome Biol. 20, 257 (2019).
Mirdita, M., Steinegger, M., Breitwieser, F., Söding, J. & Karin, E. L. Fast and sensitive taxonomic assignment to metagenomic contigs. Bioinformatics 37, 3029–3031 (2021).
doi: 10.1093/bioinformatics/btab184 pubmed: 33734313 pmcid: 8479651
Lu, J., Breitwieser, F. P., Thielen, P. & Salzberg, S. L. Bracken: estimating species abundance in metagenomics data. Peer J. Comput. Sci. 3, e104 (2017).
doi: 10.7717/peerj-cs.104
Kim, D., Song, L., Breitwieser, F. P. & Salzberg, S. L. Centrifuge: rapid and sensitive classification of metagenomic sequences. Genome Res. 26, 1721–1729 (2016).
doi: 10.1101/gr.210641.116 pubmed: 27852649 pmcid: 5131823
Blanco-M´ıguez, A. et al. Extending and improving metagenomic taxonomic profiling with uncharacterized species using MetaPhlAn 4. Nat. Biotechnol. 47, 1633–1644 (2023).
Milanese, A. et al. Microbial abundance, activity and population genomic profiling with mOTUs2. Nat. Commun. 10, 1014 (2019).
Portik, D. M., Brown, C. T. & Pierce-Ward, N. T. Evaluation of taxonomic classification and profiling methods for long-read shotgun metagenomic sequencing datasets. BMC Bioinform. 23, 541 (2022).
Albertsen, M. et al. Genome sequences of rare, uncultured bacteria obtained by differential coverage binning of multiple metagenomes. Nat. Biotechnol. 31, 533–538 (2013).
doi: 10.1038/nbt.2579 pubmed: 23707974
Alneberg, J. et al. Binning metagenomic contigs by coverage and composition. Nat. Methods 11, 1144–1146 (2014).
doi: 10.1038/nmeth.3103 pubmed: 25218180
Imelfort, M. et al. GroopM: an automated tool for the recovery of population genomes from related metagenomes. PeerJ 2, e603 (2014).
doi: 10.7717/peerj.603 pubmed: 25289188 pmcid: 4183954
Wu, Y.-W., Simmons, B. A. & Singer, S. W. MaxBin 2.0: an automated binning algorithm to recover genomes from multiple metagenomic datasets. Bioinformatics 32, 605–607 (2016).
doi: 10.1093/bioinformatics/btv638 pubmed: 26515820
Kang, D. D. et al. MetaBAT 2: an adaptive binning algorithm for robust and efficient genome reconstruction from metagenome assemblies. PeerJ 7, e7359 (2019).
doi: 10.7717/peerj.7359 pubmed: 31388474 pmcid: 6662567
Nissen, J. N. et al. Improved metagenome binning and assembly using deep variational autoencoders. Nat. Biotechnol. 39, 555–560 (2021).
doi: 10.1038/s41587-020-00777-4 pubmed: 33398153
Nayfach, S. et al. A genomic catalog of earth’s microbiomes. Nat. Biotechnol. 39, 499–509 (2021).
doi: 10.1038/s41587-020-0718-6 pubmed: 33169036
Nishimura, Y. & Yoshizawa, S. The OceanDNA MAG catalog contains over 50,000 prokaryotic genomes originated from various marine environments. Sci. Data 9, 305 (2022).
doi: 10.1038/s41597-022-01392-5 pubmed: 35715423 pmcid: 9205870
Almeida, A. et al. A unified catalog of 204,938 reference genomes from the human gut microbiome. Nat. Biotechnol. 39, 105–114 (2021).
doi: 10.1038/s41587-020-0603-3 pubmed: 32690973
Morin, F. & Bengio, Y. Hierarchical probabilistic neural network language model. In Proc. Tenth International Workshop on Artificial Intelligence and Statistics. 246–252 (PMLR, 2005).
Redmon, J., Divvala, S., Girshick, R. & Farhadi, A. You only look once: unified, real-time object detection. arXiv https://doi.org/10.48550/arXiv.1506.02640 (2016).
Valmadre, J. Oh, A. H., Agarwal, A., Belgrave, D. & Cho, K. (eds) Hierarchical classification at multiple operating points. Adv. Neural Inform. Process. Syst. https://doi.org/10.48550/arXiv.2210.10929 (2022).
Wang, Q., Garrity, G. M., Tiedje, J. M. & Cole, J. R. Naive bayesian classifier for rapid assignment of rrna sequences into the new bacterial taxonomy. Appl. Environ. Microbiol. 73, 5261–5267 (2007).
doi: 10.1128/AEM.00062-07 pubmed: 17586664 pmcid: 1950982
Slabbinck, B., Waegeman, W., Dawyndt, P., De Vos, P. & De Baets, B. From learning taxonomies to phylogenetic learning: integration of 16s rrna gene data into fame-based bacterial classification. BMC Bioinform. 11, 1–16 (2010).
doi: 10.1186/1471-2105-11-69
Tafintseva, V. et al. Hierarchical classification of microorganisms based on highdimensional phenotypic data. J. Biophoton. 11, e201700047 (2018).
doi: 10.1002/jbio.201700047
Udelhoven, T., Naumann, D. & Schmitt, J. Development of a hierarchical classification system with artificial neural networks and ft-ir spectra for the identification of bacteria. Appl. Spectrosc. 54, 1471–1479 (2000).
doi: 10.1366/0003702001948619
Liang, Q., Bible, P. W., Liu, Y., Zou, B. & Wei, L. DeepMicrobes: taxonomic classification for metagenomics with deep learning. NAR Genom. Bioinform 2, lqaa009 (2020).
doi: 10.1093/nargab/lqaa009 pubmed: 33575556 pmcid: 7671387
Mock, F., Kretschmer, F., Kriese, A., B¨ocker, S. & Marz, M. Taxonomic classification of DNA sequences beyond sequence similarity using deep neural networks. Proc. Natl Acad. Sci. USA. 119, e2122636119 (2022).
doi: 10.1073/pnas.2122636119 pubmed: 36018838 pmcid: 9436379
Xiao, L., Deng, L. & Liu, X. Metagenomic sequence classification based on one-dimensional convolutional neural network. In Proc. 2022 11th International Conference on Computing and Pattern Recognition. 191–196 (Association for Computing Machinery, New York, NY, USA, 2023).
Fuhl, W., Zabel, S. & Nieselt, K. Improving taxonomic classification with feature space balancing. Bioinform. Adv. 3, vbad092 (2023).
doi: 10.1093/bioadv/vbad092 pubmed: 37577265 pmcid: 10415173
Wichmann, A. et al. MetaTransformer: deep metagenomic sequencing read classification using self-attention models. NAR Genom. Bioinform. 5, lqad082 (2023).
doi: 10.1093/nargab/lqad082 pubmed: 37705831 pmcid: 10495543
Kim, J. & Steinegger, M. Metabuli: sensitive and specific metagenomic classification via joint analysis of amino-acid and DNA. Nat. Methods 21, 971–973 (2023).
Parks, D. H. et al. GTDB: an ongoing census of bacterial and archaeal diversity through a phylogenetically consistent, rank normalized and complete genomebased taxonomy. Nucleic Acids Res. 50, D785–D794 (2022).
doi: 10.1093/nar/gkab776 pubmed: 34520557
Sayers, E. W. et al. Database resources of the national center for biotechnology information. Nucleic Acids Res. 50, D20–D26 (2022).
doi: 10.1093/nar/gkab1112 pubmed: 34850941
Dick, G. J. et al. Community-wide analysis of microbial genome sequence signatures. Genome Biol. 10, R85 (2009).
doi: 10.1186/gb-2009-10-8-r85 pubmed: 19698104 pmcid: 2745766
BioSciences, P. Data Release: Human Microbiome Samples Demonstrate Advances in Hifi-Enabled Metagenomic Sequencing. https://downloads.pacbcloud.com/public/dataset/Sequel-IIe-202104/metagenomics/ (2023).
Meyer, F. et al. Critical assessment of metagenome interpretation: the second round of challenges. Nat. Methods 19, 429–440 (2022).
doi: 10.1038/s41592-022-01431-4 pubmed: 35396482 pmcid: 9007738
Li, H. Aligning sequence reads, clone sequences and assembly contigs with bwamem. arXiv Genom. https://doi.org/10.48550/arXiv.1303.3997 (2013).
Li, H. et al. The sequence alignment/Map format and SAMtools. Bioinformatics 25, 2078–2079 (2009).
doi: 10.1093/bioinformatics/btp352 pubmed: 19505943 pmcid: 2723002
Quince, C. et al. STRONG: metagenomics strain resolution on assembly graphs. Genome Biol. 22, 214 (2021).
doi: 10.1186/s13059-021-02419-7 pubmed: 34311761 pmcid: 8311964
Benoit, G. et al. Efficient High-Quality Metagenome Assembly from Long Accurate Reads using Minimizer-space de Bruijn Graphs. https://www.biorxiv.org/content/10.1101/2023.07.07.548136v1 (2023).
Li, H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics 34, 3094–3100 (2018).
doi: 10.1093/bioinformatics/bty191 pubmed: 29750242 pmcid: 6137996
Camargo, A. Apcamargo/pycoverm: Simple Python Interface to CoverM’s Fast Coverage Estimation Functions. https://github.com/apcamargo/pycoverm/tree/main (2023).
Chaumeil, P.-A., Mussig, A. J., Hugenholtz, P. & Parks, D. H. GTDB-tk v2: memory friendly classification with the genome taxonomy database. Bioinformatics 38, 5315–5316 (2022).
doi: 10.1093/bioinformatics/btac672 pubmed: 36218463 pmcid: 9710552
Schoch, C. L. et al. Ncbi taxonomy: a comprehensive update on curation, resources and tools. Database 2020, baaa062 (2020).
doi: 10.1093/database/baaa062 pubmed: 32761142 pmcid: 7408187
Dilthey, A., Jain, C., Koren, S. & Phillippy, A. Strain-level metagenomic assignment and compositional estimation for long reads with MetaMaps. Nat. Commun. 10, 3066 (2019).
doi: 10.1038/s41467-019-10934-2 pubmed: 31296857 pmcid: 6624308
Defazio, A. & Mishchenko, K. Learning-rate-free learning by d-adaptation. In Proc. 40th International Conference on Machine Learning. 7449–7479 (PMLR, 2023).
Paszke, A. et al. PyTorch: An imperative style, high-performance deep learning library. In Proc. 33rd Conference on Neural Information Processing Systems. 8026–8037 (NeurIPS, 2019).
Kutuzova, S., Nielsen, M., Lindez Piera, P., Nybo Nissen, J. & Rasmussen, S. Taxometer: Improving taxonomic classification of metagenomics contigs. Zenodo https://doi.org/10.5281/zenodo.13379588 (2024).
doi: 10.5281/zenodo.13379588

Auteurs

Svetlana Kutuzova (S)

Department of Computer Science, University of Copenhagen, Universitetsparken 1, Copenhagen, 2100, Denmark.
The Novo Nordisk Foundation Center for Protein Research, University of Copenhagen, Blegdamsvej 3A, Copenhagen, 2200, Denmark.
The Novo Nordisk Foundation Center for Basic Metabolic Research, University of Copenhagen, Blegdamsvej 3A, Copenhagen, 2200, Denmark.

Mads Nielsen (M)

Department of Computer Science, University of Copenhagen, Universitetsparken 1, Copenhagen, 2100, Denmark.

Pau Piera (P)

The Novo Nordisk Foundation Center for Protein Research, University of Copenhagen, Blegdamsvej 3A, Copenhagen, 2200, Denmark.
The Novo Nordisk Foundation Center for Basic Metabolic Research, University of Copenhagen, Blegdamsvej 3A, Copenhagen, 2200, Denmark.

Jakob Nybo Nissen (JN)

The Novo Nordisk Foundation Center for Protein Research, University of Copenhagen, Blegdamsvej 3A, Copenhagen, 2200, Denmark. jakob.nissen@sund.ku.dk.
The Novo Nordisk Foundation Center for Basic Metabolic Research, University of Copenhagen, Blegdamsvej 3A, Copenhagen, 2200, Denmark. jakob.nissen@sund.ku.dk.

Simon Rasmussen (S)

The Novo Nordisk Foundation Center for Protein Research, University of Copenhagen, Blegdamsvej 3A, Copenhagen, 2200, Denmark. srasmuss@sund.ku.dk.
The Novo Nordisk Foundation Center for Basic Metabolic Research, University of Copenhagen, Blegdamsvej 3A, Copenhagen, 2200, Denmark. srasmuss@sund.ku.dk.
The Novo Nordisk Foundation Center for Genomic Mechanisms of Disease, Broad Institute of MIT and Harvard, Cambridge, 02142, MA, USA. srasmuss@sund.ku.dk.

Articles similaires

Selecting optimal software code descriptors-The case of Java.

Yegor Bugayenko, Zamira Kholmatova, Artem Kruglov et al.
1.00
Software Algorithms Programming Languages

Exploring blood-brain barrier passage using atomic weighted vector and machine learning.

Yoan Martínez-López, Paulina Phoobane, Yanaima Jauriga et al.
1.00
Blood-Brain Barrier Machine Learning Humans Support Vector Machine Software
Coal Metagenome Phylogeny Bacteria Genome, Bacterial
1.00
Humans Magnetic Resonance Imaging Brain Infant, Newborn Infant, Premature

Classifications MeSH