Taxometer: Improving taxonomic classification of metagenomics contigs.
Journal
Nature communications
ISSN: 2041-1723
Titre abrégé: Nat Commun
Pays: England
ID NLM: 101528555
Informations de publication
Date de publication:
27 Sep 2024
27 Sep 2024
Historique:
received:
26
01
2024
accepted:
20
09
2024
medline:
28
9
2024
pubmed:
28
9
2024
entrez:
27
9
2024
Statut:
epublish
Résumé
For taxonomy based classification of metagenomics assembled contigs, current methods use sequence similarity to identify their most likely taxonomy. However, in the related field of metagenomic binning, contigs are routinely clustered using information from both the contig sequences and their abundance. We introduce Taxometer, a neural network based method that improves the annotations and estimates the quality of any taxonomic classifier using contig abundance profiles and tetra-nucleotide frequencies. We apply Taxometer to five short-read CAMI2 datasets and find that it increases the average share of correct species-level contig annotations of the MMSeqs2 tool from 66.6% to 86.2%. Additionally, it reduce the share of wrong species-level annotations in the CAMI2 Rhizosphere dataset by an average of two-fold for Metabuli, Centrifuge, and Kraken2. Futhermore, we use Taxometer for benchmarking taxonomic classifiers on two complex long-read metagenomics data sets where ground truth is not known. Taxometer is available as open-source software and can enhance any taxonomic annotation of metagenomic contigs.
Identifiants
pubmed: 39333501
doi: 10.1038/s41467-024-52771-y
pii: 10.1038/s41467-024-52771-y
doi:
Types de publication
Journal Article
Langues
eng
Sous-ensembles de citation
IM
Pagination
8357Subventions
Organisme : Novo Nordisk Fonden (Novo Nordisk Foundation)
ID : NNF19SA0059348
Organisme : Novo Nordisk Fonden (Novo Nordisk Foundation)
ID : NNF14CC0001
Organisme : Novo Nordisk Fonden (Novo Nordisk Foundation)
ID : NNF21SA0072102
Informations de copyright
© 2024. The Author(s).
Références
Hyatt, D. et al. Prodigal: prokaryotic gene recognition and translation initiation site identification. BMC Bioinform. 11, 119 (2010).
doi: 10.1186/1471-2105-11-119
Wood, D. E., Lu, J. & Langmead, B. Improved metagenomic analysis with kraken 2. Genome Biol. 20, 257 (2019).
Mirdita, M., Steinegger, M., Breitwieser, F., Söding, J. & Karin, E. L. Fast and sensitive taxonomic assignment to metagenomic contigs. Bioinformatics 37, 3029–3031 (2021).
doi: 10.1093/bioinformatics/btab184
pubmed: 33734313
pmcid: 8479651
Lu, J., Breitwieser, F. P., Thielen, P. & Salzberg, S. L. Bracken: estimating species abundance in metagenomics data. Peer J. Comput. Sci. 3, e104 (2017).
doi: 10.7717/peerj-cs.104
Kim, D., Song, L., Breitwieser, F. P. & Salzberg, S. L. Centrifuge: rapid and sensitive classification of metagenomic sequences. Genome Res. 26, 1721–1729 (2016).
doi: 10.1101/gr.210641.116
pubmed: 27852649
pmcid: 5131823
Blanco-M´ıguez, A. et al. Extending and improving metagenomic taxonomic profiling with uncharacterized species using MetaPhlAn 4. Nat. Biotechnol. 47, 1633–1644 (2023).
Milanese, A. et al. Microbial abundance, activity and population genomic profiling with mOTUs2. Nat. Commun. 10, 1014 (2019).
Portik, D. M., Brown, C. T. & Pierce-Ward, N. T. Evaluation of taxonomic classification and profiling methods for long-read shotgun metagenomic sequencing datasets. BMC Bioinform. 23, 541 (2022).
Albertsen, M. et al. Genome sequences of rare, uncultured bacteria obtained by differential coverage binning of multiple metagenomes. Nat. Biotechnol. 31, 533–538 (2013).
doi: 10.1038/nbt.2579
pubmed: 23707974
Alneberg, J. et al. Binning metagenomic contigs by coverage and composition. Nat. Methods 11, 1144–1146 (2014).
doi: 10.1038/nmeth.3103
pubmed: 25218180
Imelfort, M. et al. GroopM: an automated tool for the recovery of population genomes from related metagenomes. PeerJ 2, e603 (2014).
doi: 10.7717/peerj.603
pubmed: 25289188
pmcid: 4183954
Wu, Y.-W., Simmons, B. A. & Singer, S. W. MaxBin 2.0: an automated binning algorithm to recover genomes from multiple metagenomic datasets. Bioinformatics 32, 605–607 (2016).
doi: 10.1093/bioinformatics/btv638
pubmed: 26515820
Kang, D. D. et al. MetaBAT 2: an adaptive binning algorithm for robust and efficient genome reconstruction from metagenome assemblies. PeerJ 7, e7359 (2019).
doi: 10.7717/peerj.7359
pubmed: 31388474
pmcid: 6662567
Nissen, J. N. et al. Improved metagenome binning and assembly using deep variational autoencoders. Nat. Biotechnol. 39, 555–560 (2021).
doi: 10.1038/s41587-020-00777-4
pubmed: 33398153
Nayfach, S. et al. A genomic catalog of earth’s microbiomes. Nat. Biotechnol. 39, 499–509 (2021).
doi: 10.1038/s41587-020-0718-6
pubmed: 33169036
Nishimura, Y. & Yoshizawa, S. The OceanDNA MAG catalog contains over 50,000 prokaryotic genomes originated from various marine environments. Sci. Data 9, 305 (2022).
doi: 10.1038/s41597-022-01392-5
pubmed: 35715423
pmcid: 9205870
Almeida, A. et al. A unified catalog of 204,938 reference genomes from the human gut microbiome. Nat. Biotechnol. 39, 105–114 (2021).
doi: 10.1038/s41587-020-0603-3
pubmed: 32690973
Morin, F. & Bengio, Y. Hierarchical probabilistic neural network language model. In Proc. Tenth International Workshop on Artificial Intelligence and Statistics. 246–252 (PMLR, 2005).
Redmon, J., Divvala, S., Girshick, R. & Farhadi, A. You only look once: unified, real-time object detection. arXiv https://doi.org/10.48550/arXiv.1506.02640 (2016).
Valmadre, J. Oh, A. H., Agarwal, A., Belgrave, D. & Cho, K. (eds) Hierarchical classification at multiple operating points. Adv. Neural Inform. Process. Syst. https://doi.org/10.48550/arXiv.2210.10929 (2022).
Wang, Q., Garrity, G. M., Tiedje, J. M. & Cole, J. R. Naive bayesian classifier for rapid assignment of rrna sequences into the new bacterial taxonomy. Appl. Environ. Microbiol. 73, 5261–5267 (2007).
doi: 10.1128/AEM.00062-07
pubmed: 17586664
pmcid: 1950982
Slabbinck, B., Waegeman, W., Dawyndt, P., De Vos, P. & De Baets, B. From learning taxonomies to phylogenetic learning: integration of 16s rrna gene data into fame-based bacterial classification. BMC Bioinform. 11, 1–16 (2010).
doi: 10.1186/1471-2105-11-69
Tafintseva, V. et al. Hierarchical classification of microorganisms based on highdimensional phenotypic data. J. Biophoton. 11, e201700047 (2018).
doi: 10.1002/jbio.201700047
Udelhoven, T., Naumann, D. & Schmitt, J. Development of a hierarchical classification system with artificial neural networks and ft-ir spectra for the identification of bacteria. Appl. Spectrosc. 54, 1471–1479 (2000).
doi: 10.1366/0003702001948619
Liang, Q., Bible, P. W., Liu, Y., Zou, B. & Wei, L. DeepMicrobes: taxonomic classification for metagenomics with deep learning. NAR Genom. Bioinform 2, lqaa009 (2020).
doi: 10.1093/nargab/lqaa009
pubmed: 33575556
pmcid: 7671387
Mock, F., Kretschmer, F., Kriese, A., B¨ocker, S. & Marz, M. Taxonomic classification of DNA sequences beyond sequence similarity using deep neural networks. Proc. Natl Acad. Sci. USA. 119, e2122636119 (2022).
doi: 10.1073/pnas.2122636119
pubmed: 36018838
pmcid: 9436379
Xiao, L., Deng, L. & Liu, X. Metagenomic sequence classification based on one-dimensional convolutional neural network. In Proc. 2022 11th International Conference on Computing and Pattern Recognition. 191–196 (Association for Computing Machinery, New York, NY, USA, 2023).
Fuhl, W., Zabel, S. & Nieselt, K. Improving taxonomic classification with feature space balancing. Bioinform. Adv. 3, vbad092 (2023).
doi: 10.1093/bioadv/vbad092
pubmed: 37577265
pmcid: 10415173
Wichmann, A. et al. MetaTransformer: deep metagenomic sequencing read classification using self-attention models. NAR Genom. Bioinform. 5, lqad082 (2023).
doi: 10.1093/nargab/lqad082
pubmed: 37705831
pmcid: 10495543
Kim, J. & Steinegger, M. Metabuli: sensitive and specific metagenomic classification via joint analysis of amino-acid and DNA. Nat. Methods 21, 971–973 (2023).
Parks, D. H. et al. GTDB: an ongoing census of bacterial and archaeal diversity through a phylogenetically consistent, rank normalized and complete genomebased taxonomy. Nucleic Acids Res. 50, D785–D794 (2022).
doi: 10.1093/nar/gkab776
pubmed: 34520557
Sayers, E. W. et al. Database resources of the national center for biotechnology information. Nucleic Acids Res. 50, D20–D26 (2022).
doi: 10.1093/nar/gkab1112
pubmed: 34850941
Dick, G. J. et al. Community-wide analysis of microbial genome sequence signatures. Genome Biol. 10, R85 (2009).
doi: 10.1186/gb-2009-10-8-r85
pubmed: 19698104
pmcid: 2745766
BioSciences, P. Data Release: Human Microbiome Samples Demonstrate Advances in Hifi-Enabled Metagenomic Sequencing. https://downloads.pacbcloud.com/public/dataset/Sequel-IIe-202104/metagenomics/ (2023).
Meyer, F. et al. Critical assessment of metagenome interpretation: the second round of challenges. Nat. Methods 19, 429–440 (2022).
doi: 10.1038/s41592-022-01431-4
pubmed: 35396482
pmcid: 9007738
Li, H. Aligning sequence reads, clone sequences and assembly contigs with bwamem. arXiv Genom. https://doi.org/10.48550/arXiv.1303.3997 (2013).
Li, H. et al. The sequence alignment/Map format and SAMtools. Bioinformatics 25, 2078–2079 (2009).
doi: 10.1093/bioinformatics/btp352
pubmed: 19505943
pmcid: 2723002
Quince, C. et al. STRONG: metagenomics strain resolution on assembly graphs. Genome Biol. 22, 214 (2021).
doi: 10.1186/s13059-021-02419-7
pubmed: 34311761
pmcid: 8311964
Benoit, G. et al. Efficient High-Quality Metagenome Assembly from Long Accurate Reads using Minimizer-space de Bruijn Graphs. https://www.biorxiv.org/content/10.1101/2023.07.07.548136v1 (2023).
Li, H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics 34, 3094–3100 (2018).
doi: 10.1093/bioinformatics/bty191
pubmed: 29750242
pmcid: 6137996
Camargo, A. Apcamargo/pycoverm: Simple Python Interface to CoverM’s Fast Coverage Estimation Functions. https://github.com/apcamargo/pycoverm/tree/main (2023).
Chaumeil, P.-A., Mussig, A. J., Hugenholtz, P. & Parks, D. H. GTDB-tk v2: memory friendly classification with the genome taxonomy database. Bioinformatics 38, 5315–5316 (2022).
doi: 10.1093/bioinformatics/btac672
pubmed: 36218463
pmcid: 9710552
Schoch, C. L. et al. Ncbi taxonomy: a comprehensive update on curation, resources and tools. Database 2020, baaa062 (2020).
doi: 10.1093/database/baaa062
pubmed: 32761142
pmcid: 7408187
Dilthey, A., Jain, C., Koren, S. & Phillippy, A. Strain-level metagenomic assignment and compositional estimation for long reads with MetaMaps. Nat. Commun. 10, 3066 (2019).
doi: 10.1038/s41467-019-10934-2
pubmed: 31296857
pmcid: 6624308
Defazio, A. & Mishchenko, K. Learning-rate-free learning by d-adaptation. In Proc. 40th International Conference on Machine Learning. 7449–7479 (PMLR, 2023).
Paszke, A. et al. PyTorch: An imperative style, high-performance deep learning library. In Proc. 33rd Conference on Neural Information Processing Systems. 8026–8037 (NeurIPS, 2019).
Kutuzova, S., Nielsen, M., Lindez Piera, P., Nybo Nissen, J. & Rasmussen, S. Taxometer: Improving taxonomic classification of metagenomics contigs. Zenodo https://doi.org/10.5281/zenodo.13379588 (2024).
doi: 10.5281/zenodo.13379588