Machine learning for microbiologists.
Journal
Nature reviews. Microbiology
ISSN: 1740-1534
Titre abrégé: Nat Rev Microbiol
Pays: England
ID NLM: 101190261
Informations de publication
Date de publication:
Apr 2024
Apr 2024
Historique:
accepted:
03
10
2023
medline:
18
3
2024
pubmed:
16
11
2023
entrez:
15
11
2023
Statut:
ppublish
Résumé
Machine learning is increasingly important in microbiology where it is used for tasks such as predicting antibiotic resistance and associating human microbiome features with complex host diseases. The applications in microbiology are quickly expanding and the machine learning tools frequently used in basic and clinical research range from classification and regression to clustering and dimensionality reduction. In this Review, we examine the main machine learning concepts, tasks and applications that are relevant for experimental and clinical microbiologists. We provide the minimal toolbox for a microbiologist to be able to understand, interpret and use machine learning in their experimental and translational activities.
Identifiants
pubmed: 37968359
doi: 10.1038/s41579-023-00984-1
pii: 10.1038/s41579-023-00984-1
doi:
Types de publication
Journal Article
Review
Langues
eng
Sous-ensembles de citation
IM
Pagination
191-205Subventions
Organisme : NCI NIH HHS
ID : R01 CA230551
Pays : United States
Informations de copyright
© 2023. Springer Nature Limited.
Références
Bishop, C. M. Pattern recognition and machine learning (Springer, 2006).
Hastie, T., Tibshirani, R. & Friedman, J. The Elements of Statistical Learning: Data Mining, Inference, and Prediction 2nd edn (Springer Science & Business Media, 2009).
James, G., Witten, D., Hastie, T. & Tibshirani, R. An Introduction to Statistical Learning: with Applications in R (Springer Science & Business Media, 2013).
Murphy, K. P. Probabilistic Machine Learning: Advanced Topics (MIT Press, 2022).
Goodswen, S. J. et al. Machine learning and applications in microbiology. FEMS Microbiol. Rev. 45, fuab015 (2021).
pubmed: 33724378
pmcid: 8498514
Topçuoğlu, B. D., Lesniak, N. A., Ruffin, M. T., 4th, Wiens, J. & Schloss, P. D. A framework for effective application of machine learning to microbiome-based classification problems. mBio 11, e00434-20 (2020). This work focuses on applying machine learning to microbiome data for disease prediction, highlighting the important trade-off between model complexity and interpretability, and emphasizing the need for rigorous methodology towards more reproducible machine learning usage in microbiome research.
pubmed: 32518182
pmcid: 7373189
Wang, Q., Garrity, G. M., Tiedje, J. M. & Cole, J. R. Naive Bayesian classifier for rapid assignment of rRNA sequences into the new bacterial taxonomy. Appl. Environ. Microbiol. 73, 5261–5267 (2007).
pubmed: 17586664
pmcid: 1950982
Parks, D. H., MacDonald, N. J. & Beiko, R. G. Classifying short genomic fragments from novel lineages using composition and homology. BMC Bioinformatics 12, 328 (2011).
pubmed: 21827705
pmcid: 3173459
Rosen, G. L., Reichenberger, E. R. & Rosenfeld, A. M. NBC: the Naive Bayes Classification tool webserver for taxonomic classification of metagenomic reads. Bioinformatics 27, 127–129 (2011).
pubmed: 21062764
McHardy, A. C., Martín, H. G., Tsirigos, A., Hugenholtz, P. & Rigoutsos, I. Accurate phylogenetic classification of variable-length DNA fragments. Nat. Methods 4, 63–72 (2007).
pubmed: 17179938
Patil, K. R., Roune, L. & McHardy, A. C. The PhyloPythiaS web server for taxonomic assignment of metagenome sequences. PLoS ONE 7, e38581 (2012).
pubmed: 22745671
pmcid: 3380018
Gregor, I., Dröge, J., Schirmer, M., Quince, C. & McHardy, A. C. PhyloPythiaS+: a self-training method for the rapid reconstruction of low-ranking taxonomic bins from metagenomes. PeerJ 4, e1603 (2016).
pubmed: 26870609
pmcid: 4748697
Vervier, K., Mahé, P., Tournoud, M., Veyrieras, J.-B. & Vert, J.-P. Large-scale machine learning for metagenomics sequence classification. Bioinformatics 32, 1023–1032 (2016). This work introduces a machine learning-based approach for tackling the taxonomic binning step, using a supervised approach that balances accuracy and speed and outperforms alignment-based methods.
pubmed: 26589281
Diaz, N. N., Krause, L., Goesmann, A., Niehaus, K. & Nattkemper, T. W. TACOA — taxonomic classification of environmental genomic fragments using a kernelized nearest neighbor approach. BMC Bioinformatics 10, 56 (2009).
pubmed: 19210774
pmcid: 2653487
Sczyrba, A. et al. Critical assessment of metagenome interpretation — a benchmark of metagenomics software. Nat. Methods 14, 1063–1071 (2017).
pubmed: 28967888
pmcid: 5903868
Davis, J. J. et al. Antimicrobial resistance prediction in PATRIC and RAST. Sci. Rep. 6, 27930 (2016).
pubmed: 27297683
pmcid: 4906388
Arango-Argoty, G. et al. DeepARG: a deep learning approach for predicting antibiotic resistance genes from metagenomic data. Microbiome 6, 23 (2018).
pubmed: 29391044
pmcid: 5796597
Kavvas, E. S. et al. Machine learning and structural analysis of Mycobacterium tuberculosis pan-genome identifies genetic signatures of antibiotic resistance. Nat. Commun. 9, 4306 (2018).
pubmed: 30333483
pmcid: 6193043
Moradigaravand, D. et al. Prediction of antibiotic resistance in Escherichia coli from large-scale pan-genome data. PLoS Comput. Biol. 14, e1006258 (2018).
pubmed: 30550564
pmcid: 6310291
Rahman, S. F., Olm, M. R., Morowitz, M. J. & Banfield, J. F. Machine learning leveraging genomes from metagenomes identifies influential antibiotic resistance genes in the infant gut microbiome. mSystems 3, e00123–e00217 (2018).
pubmed: 29359195
pmcid: 5758725
Freund, Y. & Schapire, R. E. A decision-theoretic generalization of on-line learning and an application to boosting. J. Comput. Syst. Sci. 55, 119–139 (1997).
Baldi, P. Deep Learning in biomedical data science. Annu. Rev. Biomed. Data Sci. 1, 181–205 (2018).
Hannigan, G. D. et al. A deep learning genome-mining strategy for biosynthetic gene cluster prediction. Nucleic Acids Res. 47, e110 (2019).
pubmed: 31400112
pmcid: 6765103
Weimann, A. et al. From genomes to phenotypes: Traitar, the microbial trait analyzer. mSystems 1, e00101–e00116 (2016). This work uses machine learning to predict 67 microbial phenotypic traits from genome sequences, facilitating the analysis of large-scale microbial genomic data.
pubmed: 28066816
pmcid: 5192078
Thomas, A. M. et al. Metagenomic analysis of colorectal cancer datasets identifies cross-cohort microbial diagnostic signatures and a link with choline degradation. Nat. Med. 25, 667–678 (2019).
pubmed: 30936548
pmcid: 9533319
Wirbel, J. et al. Meta-analysis of fecal metagenomes reveals global microbial signatures that are specific for colorectal cancer. Nat. Med. 25, 679–689 (2019).
pubmed: 30936547
pmcid: 7984229
Poore, G. D. et al. Microbiome analyses of blood and tissues suggest cancer diagnostic approach. Nature 579, 567–574 (2020).
pubmed: 32214244
pmcid: 7500457
Pasolli, E., Truong, D. T., Malik, F., Waldron, L. & Segata, N. Machine learning meta-analysis of large metagenomic datasets: tools and biological insights. PLoS Comput. Biol. 12, e1004977 (2016).
pubmed: 27400279
pmcid: 4939962
Qin, J. et al. A metagenome-wide association study of gut microbiota in type 2 diabetes. Nature 490, 55–60 (2012).
pubmed: 23023125
Ghensi, P. et al. Strong oral plaque microbiome signatures for dental implant diseases identified by strain-resolution metagenomics. NPJ Biofilms Microbiomes 6, 47 (2020).
pubmed: 33127901
pmcid: 7603341
Salosensaari, A. et al. Taxonomic signatures of cause-specific mortality risk in human gut microbiome. Nat. Commun. 12, 2671 (2021).
pubmed: 33976176
pmcid: 8113604
Kartal, E. et al. A faecal microbiota signature with high specificity for pancreatic cancer. Gut 71, 1359–1372 (2022).
pubmed: 35260444
Asnicar, F. et al. Microbiome connections with host metabolism and habitual diet from 1,098 deeply phenotyped individuals. Nat. Med. 21, 321–332 (2021).
Lee, K. A. et al. Cross-cohort gut microbiome associations with immune checkpoint inhibitor response in advanced melanoma. Nat. Med. 28, 535–544 (2022).
pubmed: 35228751
pmcid: 8938272
McCulloch, J. A. et al. Intestinal microbiota signatures of clinical response and immune-related adverse events in melanoma patients treated with anti-PD-1. Nat. Med. 28, 545–556 (2022).
pubmed: 35228752
pmcid: 10246505
Routy, B. et al. Gut microbiome influences efficacy of PD-1-based immunotherapy against epithelial tumors. Science 359, 91–97 (2018).
pubmed: 29097494
Gopalakrishnan, V. et al. Gut microbiome modulates response to anti–PD-1 immunotherapy in melanoma patients. Science 359, 97–103 (2018).
pubmed: 29097493
Derosa, L. et al. Intestinal Akkermansia muciniphila predicts overall survival in advanced non-small cell lung cancer patients treated with anti-PD-1 antibodies: results a phase II study. J. Clin. Orthod. 39, 9019–9019 (2021).
Davar, D. et al. Fecal microbiota transplant overcomes resistance to anti-PD-1 therapy in melanoma patients. Science 371, 595–602 (2021).
pubmed: 33542131
pmcid: 8097968
Baruch, E. N. et al. Fecal microbiota transplant promotes response in immunotherapy-refractory melanoma patients. Science 371, 602–609 (2021).
pubmed: 33303685
Palma, S. I. C. J. et al. Machine learning for the meta-analyses of microbial pathogens’ volatile signatures. Sci. Rep. 8, 3360 (2018).
pubmed: 29463885
pmcid: 5820279
Ianiro, G. et al. Variability of strain engraftment and predictability of microbiome composition after fecal microbiota transplantation across different diseases. Nat. Med. 28, 1913–1923 (2022). This study uses machine learning to develop predictive models for selecting optimal donors for faecal microbiota transplantation, making personalized microbiome-targeted treatments more effective.
pubmed: 36109637
pmcid: 9499858
Smillie, C. S. et al. Strain tracking reveals the determinants of bacterial engraftment in the human gut following fecal microbiota transplantation. Cell Host Microbe 23, 229–240.e5 (2018).
pubmed: 29447696
pmcid: 8318347
Schmidt, T. S. B. et al. Drivers and determinants of strain dynamics following fecal microbiota transplantation. Nat. Med. 28, 1902–1912 (2022).
pubmed: 36109636
pmcid: 9499871
Arumugam, M. et al. Enterotypes of the human gut microbiome. Nature 473, 174–180 (2011).
pubmed: 21508958
pmcid: 3728647
Ravel, J. et al. Vaginal microbiome of reproductive-age women. Proc. Natl Acad. Sci. USA 108, 4680–4687 (2011).
pubmed: 20534435
Koren, O. et al. A guide to enterotypes across the human body: meta-analysis of microbial community structures in human microbiome datasets. PLoS Comput. Biol. 9, e1002863 (2013).
pubmed: 23326225
pmcid: 3542080
Knights, D. et al. Rethinking ‘enterotypes’. Cell Host Microbe 16, 433–437 (2014).
pubmed: 25299329
pmcid: 5558460
Costea, P. I. et al. Enterotypes in the landscape of gut microbial community composition. Nat. Microbiol. 3, 8–16 (2018).
pubmed: 29255284
Gao, L. L., Bien, J. & Witten, D. Selective inference for hierarchical clustering. J. Am. Stat. Assoc. https://doi.org/10.1080/01621459.2022.2116331 (2022).
Karcher, N. et al. Analysis of 1321 Eubacterium rectale genomes from metagenomes uncovers complex phylogeographic population structure and subspecies functional adaptations. Genome Biol. 21, 138 (2020).
pubmed: 32513234
pmcid: 7278147
Hamady, M. & Knight, R. Microbial community profiling for human microbiome projects: tools, techniques, and challenges. Genome Res 19, 1141–1152 (2009).
pubmed: 19383763
pmcid: 3776646
Edgar, R. C. Search and clustering orders of magnitude faster than BLAST. Bioinformatics 26, 2460–2461 (2010).
pubmed: 20709691
Rognes, T., Flouri, T., Nichols, B., Quince, C. & Mahé, F. VSEARCH: a versatile open source tool for metagenomics. PeerJ 4, e2584 (2016).
pubmed: 27781170
pmcid: 5075697
Pasolli, E. et al. Extensive unexplored human microbiome diversity revealed by over 150,000 genomes from metagenomes spanning age, geography, and lifestyle. Cell 176, 1–14 (2019).
Konstantinidis, K. T. & Tiedje, J. M. Genomic insights that advance the species definition for prokaryotes. Proc. Natl Acad. Sci. USA 102, 2567–2572 (2005).
pubmed: 15701695
pmcid: 549018
Nguyen, N.-P., Warnow, T., Pop, M. & White, B. A perspective on 16S rRNA operational taxonomic unit clustering using sequence similarity. NPJ Biofilms Microbiomes 2, 16004 (2016).
pubmed: 28721243
pmcid: 5515256
Jain, C., Rodriguez-R, L. M., Phillippy, A. M., Konstantinidis, K. T. & Aluru, S. High throughput ANI analysis of 90K prokaryotic genomes reveals clear species boundaries. Nat. Commun. 9, 5114 (2018).
pubmed: 30504855
pmcid: 6269478
Murray, C. S., Gao, Y. & Wu, M. Re-evaluating the evidence for a universal genetic boundary among microbial species. Nat. Commun. 12, 4059 (2021).
pubmed: 34234129
pmcid: 8263626
Rodriguez-R, L. M., Jain, C., Conrad, R. E., Aluru, S. & Konstantinidis, K. T. Reply to: ‘Re-evaluating the evidence for a universal genetic boundary among microbial species’. Nat. Commun. 12, 4060 (2021).
pubmed: 34234115
pmcid: 8263725
Li, W. & Godzik, A. cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics 22, 1658–1659 (2006).
pubmed: 16731699
Bahram, M. et al. Structure and function of the global topsoil microbiome. Nature 560, 233–237 (2018).
pubmed: 30069051
Spang, A. et al. Complex archaea that bridge the gap between prokaryotes and eukaryotes. Nature 521, 173–179 (2015).
pubmed: 25945739
pmcid: 4444528
Human Microbiome Project Consortium. Structure, function and diversity of the healthy human microbiome. Nature 486, 207–214 (2012).
Xiao, L. et al. A catalog of the mouse gut metagenome. Nat. Biotechnol. 33, 1103–1108 (2015).
pubmed: 26414350
Qin, J. et al. A human gut microbial gene catalogue established by metagenomic sequencing. Nature 464, 59–65 (2010).
pubmed: 20203603
pmcid: 3779803
Chen, C. et al. Expanded catalog of microbial genes and metagenome-assembled genomes from the pig gut microbiome. Nat. Commun. 12, 1106 (2021).
pubmed: 33597514
pmcid: 7889623
Steinegger, M. & Söding, J. MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nat. Biotechnol. 35, 1026–1028 (2017).
pubmed: 29035372
Vanni, C. et al. Unifying the known and unknown microbial coding sequence space. eLife 11, e67667 (2022).
pubmed: 35356891
pmcid: 9132574
Apweiler, R. et al. UniProt: the universal protein knowledgebase. Nucleic Acids Res. 32, D115–D119 (2004).
pubmed: 14681372
pmcid: 308865
Almeida, A. et al. A unified catalog of 204,938 reference genomes from the human gut microbiome. Nat. Biotechnol. 39, 105–114 (2021).
pubmed: 32690973
Abdi, H. & Williams, L. J. Principal component analysis. Wiley Interdiscip. Rev. Comput. Stat. 2, 433–459 (2010).
Davis, T. D., Gerry, C. J. & Tan, D. S. General platform for systematic quantitative evaluation of small-molecule permeability in bacteria. ACS Chem. Biol. 9, 2535–2544 (2014).
pubmed: 25198656
pmcid: 4245172
Suchodolski, J. S. et al. The fecal microbiome in dogs with acute diarrhea and idiopathic inflammatory bowel disease. PLoS ONE 7, e51907 (2012).
pubmed: 23300577
pmcid: 3530590
Mishiro, T. et al. Oral microbiome alterations of healthy volunteers with proton pump inhibitor. J. Gastroenterol. Hepatol. 33, 1059–1066 (2018).
pubmed: 29105152
Vázquez-Baeza, Y., Pirrung, M., Gonzalez, A. & Knight, R. EMPeror: a tool for visualizing high-throughput microbial community data. Gigascience 2, 16 (2013).
pubmed: 24280061
pmcid: 4076506
van der Maaten, L. & Hinton, G. Visualizing data using t-SNE. J. Mach. Learn. Res. 9, 2579–2605 (2008).
Becht, E. et al. Dimensionality reduction for visualizing single-cell data using UMAP. Nat. Biotechnol. 37, 38–44 (2018).
Howick, V. M. et al. The Malaria Cell Atlas: single parasite transcriptomes across the complete Plasmodium life cycle. Science 365, eaaw2619 (2019).
pubmed: 31439762
pmcid: 7056351
Kuchina, A. et al. Microbial single-cell RNA sequencing by split-pool barcoding. Science 371, eaba5257 (2021).
pubmed: 33335020
Yatsunenko, T. et al. Human gut microbiome viewed across age and geography. Nature 486, 222–227 (2012).
pubmed: 22699611
pmcid: 3376388
Rousk, J. et al. Soil bacterial and fungal communities across a pH gradient in an arable soil. ISME J. 4, 1340–1351 (2010).
pubmed: 20445636
Aagaard, K. et al. A metagenomic approach to characterization of the vaginal microbiome signature in pregnancy. PLoS ONE 7, e36466 (2012).
pubmed: 22719832
pmcid: 3374618
Blattman, S. B., Jiang, W., Oikonomou, P. & Tavazoie, S. Prokaryotic single-cell RNA sequencing by in situ combinatorial indexing. Nat. Microbiol. 5, 1192–1201 (2020).
pubmed: 32451472
pmcid: 8330242
Jeckel, H. & Drescher, K. Advances and opportunities in image analysis of bacterial cells and communities. FEMS Microbiol. Rev. 45, fuaa062 (2020).
pmcid: 8371272
Geier, B. et al. Spatial metabolomics of in situ host–microbe interactions at the micrometre scale. Nat. Microbiol. 5, 498–510 (2020).
pubmed: 32015496
Le Chatelier, E. et al. Richness of human gut microbiome correlates with metabolic markers. Nature 500, 541–546 (2013).
pubmed: 23985870
Li, H. Microbiome, metagenomics, and high-dimensional compositional data analysis. Annu. Rev. Stat. Appl. 2, 73–94 (2015).
Gloor, G. B., Macklaim, J. M., Pawlowsky-Glahn, V. & Egozcue, J. J. Microbiome datasets are compositional: and this is not optional. Front. Microbiol. 8, 2224 (2017).
pubmed: 29187837
pmcid: 5695134
Bermingham, M. L. et al. Application of high-dimensional feature selection: evaluation for genomic prediction in man. Sci. Rep. 5, 10312 (2015).
pubmed: 25988841
pmcid: 4437376
Zeller, G. et al. Potential of fecal microbiota for early-stage detection of colorectal cancer. Mol. Syst. Biol. 10, 766 (2014).
pubmed: 25432777
pmcid: 4299606
Zackular, J. P., Rogers, M. A. M., Ruffin, M. T. 4th & Schloss, P. D. The human gut microbiome as a screening tool for colorectal cancer. Cancer Prev. Res. 7, 1112–1121 (2014).
Wong, S. H. et al. Quantitation of faecal Fusobacterium improves faecal immunochemical test in detecting advanced colorectal neoplasia. Gut 66, 1441–1448 (2017).
pubmed: 27797940
Xie, Y.-H. et al. Fecal Clostridium symbiosum for noninvasive detection of early and advanced colorectal cancer: test and validation studies. EBioMedicine 25, 32–40 (2017).
pubmed: 29033369
pmcid: 5704049
Kostic, A. D. et al. Fusobacterium nucleatum potentiates intestinal tumorigenesis and modulates the tumor-immune microenvironment. Cell Host Microbe 14, 207–215 (2013).
pubmed: 23954159
pmcid: 3772512
Rubinstein, M. R. et al. Fusobacterium nucleatum promotes colorectal carcinogenesis by modulating E-cadherin/β-catenin signaling via its FadA adhesin. Cell Host Microbe 14, 195–206 (2013).
pubmed: 23954158
pmcid: 3770529
Bourgon, R., Gentleman, R. & Huber, W. Independent filtering increases detection power for high-throughput experiments. Proc. Natl Acad. Sci. USA 107, 9546–9551 (2010).
pubmed: 20460310
pmcid: 2906865
Hua, J., Tembe, W. D. & Dougherty, E. R. Performance of feature-selection methods in the classification of high-dimension data. Pattern Recognit. 42, 409–424 (2009).
Fan, J. & Lv, J. Sure independence screening for ultrahigh dimensional feature space. J. R. Stat. Soc. Ser. B Stat. Methodol. 70, 849–911 (2008).
Guyon, I., Weston, J., Barnhill, S. & Vapnik, V. Gene selection for cancer classification using support vector machines. Mach. Learn. 46, 389–422 (2002).
Radovic, M., Ghalwash, M., Filipovic, N. & Obradovic, Z. Minimum redundancy maximum relevance feature selection approach for temporal gene expression data. BMC Bioinformatics 18, 9 (2017).
pubmed: 28049413
pmcid: 5209828
Forslund, K. et al. Disentangling type 2 diabetes and metformin treatment signatures in the human gut microbiota. Nature 528, 262–266 (2015). This study underlines the importance of considering the influence of medication in machine learning-based microbiome analysis. In particular, it shows the effects of metformin on the gut microbiome of individuals with type 2 diabetes, highlighting the need to distinguish microbial signatures of diseases from medication.
pubmed: 26633628
pmcid: 4681099
Hacılar, H., Nalbantoğlu, O. U. & Bakir-Güngör, B. in 2018 3rd Int. Conf. Computer Science and Engineering (UBMK) 434–438 (IEEE, 2018).
Flemer, B. et al. The oral microbiota in colorectal cancer is distinctive and predictive. Gut 67, 1454–1463 (2018).
pubmed: 28988196
Yachida, S. et al. Metagenomic and metabolomic analyses reveal distinct stage-specific phenotypes of the gut microbiota in colorectal cancer. Nat. Med. 25, 968–976 (2019).
pubmed: 31171880
Maimon, O. & Rokach, L. (eds) Data Mining and Knowledge Discovery Handbook (Springer, 2010).
Lever, J., Krzywinski, M. & Altman, N. Model selection and overfitting. Nat. Methods 13, 703–704 (2016). This work highlights the importance of accurately assessing model performance to not fall into overfitting problems. Approaches that consider validation sets, test sets and cross-validation are extremely important especially when dealing with limited data.
Lever, J., Krzywinski, M. & Altman, N. Classification evaluation. Nat. Methods 13, 603–604 (2016). This work highlights the importance of selecting the appropriate evaluation metrics when assessing the performances of classification models in the context of medical diagnosis. It also emphasizes the impact of class imbalance and the use of specific metrics in cases of imbalanced data sets.
Ange, B. A., Symons, J. M., Schwab, M., Howell, E. & Geyh, A. Generalizability in epidemiology: an investigation within the context of heart failure studies. Ann. Epidemiol. 14, 600–601 (2004).
He, Y. et al. Regional variation limits applications of healthy gut microbiome reference ranges and disease models. Nat. Med. 24, 1532–1535 (2018).
pubmed: 30150716
Renson, A. et al. Sociodemographic variation in the oral microbiome. Ann. Epidemiol. 35, 73–80.e2 (2019).
pubmed: 31151886
pmcid: 6626698
Sinha, R. et al. Assessment of variation in microbial community amplicon sequencing by the Microbiome Quality Control (MBQC) project consortium. Nat. Biotechnol. 35, 1077–1086 (2017).
pubmed: 28967885
pmcid: 5839636
Soneson, C., Gerster, S. & Delorenzi, M. Batch effect confounding leads to strong bias in performance estimates obtained by cross-validation. PLoS ONE 9, e100335 (2014).
pubmed: 24967636
pmcid: 4072626
Riester, M. et al. Risk prediction for late-stage ovarian cancer by meta-analysis of 1525 patient samples. J. Natl Cancer Inst. 106, dju048 (2014).
pubmed: 24700803
pmcid: 4580556
Zhang, Y., Bernau, C., Parmigiani, G. & Waldron, L. The impact of different sources of heterogeneity on loss of accuracy from genomic prediction models. Biostatistics 21, 253–268 (2018). This work examines the impact of different types of heterogeneity on the validation accuracy of omics-based prediction models across data sets and provides insights into the challenges of validating prediction models in the presence of study heterogeneity.
pmcid: 7868050
Bernau, C. et al. Cross-study validation for the assessment of prediction algorithms. Bioinformatics 30, i105–i112 (2014).
pubmed: 24931973
pmcid: 4058929
Moreno-Indias, I. et al. Statistical and machine learning techniques in human microbiome studies: contemporary challenges and solutions. Front. Microbiol. 12, 635781 (2021). This work highlights the growing importance of statistical and machine learning techniques in human microbiome studies and challenges posed by the heterogeneity of microbiome data, and emphasizes the potential of machine learning in disease diagnosis, biomarker identification and prediction while addressing issues such as data standardization, overfitting and model interpretability.
pubmed: 33692771
pmcid: 7937616
Tonkovic, P. et al. Literature on applied machine learning in metagenomic classification: a scoping review. Biology 9, 453 (2020).
pubmed: 33316921
pmcid: 7763105
Feng, Q. et al. Gut microbiome development along the colorectal adenoma–carcinoma sequence. Nat. Commun. 6, 6528 (2015).
pubmed: 25758642
Pasolli, E. et al. Accessible, curated metagenomic data through ExperimentHub. Nat. Methods 14, 1023 (2017).
pubmed: 29088129
pmcid: 5862039
Méheust, R., Burstein, D., Castelle, C. J. & Banfield, J. F. The distinction of CPR bacteria from other bacteria based on protein family content. Nat. Commun. 10, 4173 (2019).
pubmed: 31519891
pmcid: 6744442
Brown, C. T. et al. Unusual biology across a group comprising more than 15% of domain bacteria. Nature 523, 208–211 (2015).
pubmed: 26083755
Anantharaman, K. et al. Thousands of microbial genomes shed light on interconnected biogeochemical processes in an aquifer system. Nat. Commun. 7, 13219 (2016).
pubmed: 27774985
pmcid: 5079060
Castelle, C. J. et al. Genomic expansion of domain archaea highlights roles for organisms from new phyla in anaerobic carbon cycling. Curr. Biol. 25, 690–701 (2015).
pubmed: 25702576
Probst, A. J. et al. Genomic resolution of a cold subsurface aquifer community provides metabolic insights for novel microbes adapted to high CO
pubmed: 27112493
Yu, J. et al. Metagenomic analysis of faecal microbiome as a tool towards targeted non-invasive biomarkers for colorectal cancer. Gut 66, 70–78 (2017).
pubmed: 26408641
Eid, F.-E., ElHefnawi, M. & Heath, L. S. DeNovo: virus–host sequence-based protein–protein interaction prediction. Bioinformatics 32, 1144–1150 (2015).
pubmed: 26677965
Calderone, A., Licata, L. & Cesareni, G. VirusMentha: a new resource for virus–host protein interactions. Nucleic Acids Res. 43, D588–D592 (2015).
pubmed: 25217587
Weis, C. et al. Direct antimicrobial resistance prediction from clinical MALDI-TOF mass spectra using machine learning. Nat. Med. 28, 164–174 (2022).
pubmed: 35013613
Wirbel, J. et al. Microbiome meta-analysis and cross-disease comparison enabled by the SIAMCAT machine learning toolbox. Genome Biol. 22, 93 (2021).
pubmed: 33785070
pmcid: 8008609
Vujkovic-Cvijin, I. et al. Host variables confound gut microbiota studies of human disease. Nature 587, 448–454 (2020).
pubmed: 33149306
pmcid: 7677204
Hernán, M. A. The C-word: scientific euphemisms do not improve causal inference from observational data. Am. J. Public. Health 108, 616–619 (2018). This work emphasizes the importance of using the term ‘causal’, in particular when analysing data from observational studies, and highlights the need to distinguish between association and causation and address confounding factors properly.
pubmed: 29565659
pmcid: 5888052