Machine learning approaches in microbiome research: challenges and best practices.

AutoML colorectal cancer feature selection machine learning methods microbiome data analysis model selection predictive modeling preprocessing

Journal

Frontiers in microbiology
ISSN: 1664-302X
Titre abrégé: Front Microbiol
Pays: Switzerland
ID NLM: 101548977

Informations de publication

Date de publication:
2023
Historique:
received: 19 07 2023
accepted: 04 09 2023
medline: 9 10 2023
pubmed: 9 10 2023
entrez: 9 10 2023
Statut: epublish

Résumé

Microbiome data predictive analysis within a machine learning (ML) workflow presents numerous domain-specific challenges involving preprocessing, feature selection, predictive modeling, performance estimation, model interpretation, and the extraction of biological information from the results. To assist decision-making, we offer a set of recommendations on algorithm selection, pipeline creation and evaluation, stemming from the COST Action ML4Microbiome. We compared the suggested approaches on a multi-cohort shotgun metagenomics dataset of colorectal cancer patients, focusing on their performance in disease diagnosis and biomarker discovery. It is demonstrated that the use of compositional transformations and filtering methods as part of data preprocessing does not always improve the predictive performance of a model. In contrast, the multivariate feature selection, such as the Statistically Equivalent Signatures algorithm, was effective in reducing the classification error. When validated on a separate test dataset, this algorithm in combination with random forest modeling, provided the most accurate performance estimates. Lastly, we showed how linear modeling by logistic regression coupled with visualization techniques such as Individual Conditional Expectation (ICE) plots can yield interpretable results and offer biological insights. These findings are significant for clinicians and non-experts alike in translational applications.

Identifiants

pubmed: 37808286
doi: 10.3389/fmicb.2023.1261889
pmc: PMC10556866
doi:

Types de publication

Journal Article Review

Langues

eng

Pagination

1261889

Informations de copyright

Copyright © 2023 Papoutsoglou, Tarazona, Lopes, Klammsteiner, Ibrahimi, Eckenberger, Novielli, Tonda, Simeon, Shigdel, Béreux, Vitali, Tangaro, Lahti, Temko, Claesson and Berland.

Déclaration de conflit d'intérêts

GP was directly affiliated with JADBio—Gnosis DA, S.A., which offers the JADBio service commercially. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Références

Gastroenterology. 2018 Aug;155(2):383-390.e8
pubmed: 29729257
Comput Struct Biotechnol J. 2021 Jan 27;19:1092-1107
pubmed: 33680353
Sci Rep. 2021 Feb 3;11(1):2925
pubmed: 33536501
Genome Biol. 2010;11(3):R25
pubmed: 20196867
Arthritis Rheumatol. 2023 Jan;75(1):41-52
pubmed: 35818337
Gut. 2017 Apr;66(4):633-643
pubmed: 26992426
PLoS One. 2014 Apr 08;9(4):e93827
pubmed: 24714158
Front Microbiol. 2021 Feb 22;12:635781
pubmed: 33692771
Front Microbiol. 2021 Feb 19;12:634511
pubmed: 33737920
Clin Transl Med. 2019 May 13;8(1):16
pubmed: 31081530
Bioinformatics. 2019 Mar 1;35(5):807-814
pubmed: 30816927
Nat Commun. 2022 Sep 15;13(1):5418
pubmed: 36109499
Nat Rev Gastroenterol Hepatol. 2020 Oct;17(10):635-648
pubmed: 32647386
Nat Med. 2019 Apr;25(4):679-689
pubmed: 30936547
PLoS One. 2013;8(2):e57923
pubmed: 23460914
Gastroenterology. 2021 Mar;160(4):1179-1193.e14
pubmed: 32920015
Genomics Inform. 2019 Mar;17(1):e6
pubmed: 30929407
Genome Res. 2012 Feb;22(2):299-306
pubmed: 22009989
Forensic Sci Int Genet. 2020 May;46:102257
pubmed: 32058299
Microbiome. 2017 Mar 3;5(1):27
pubmed: 28253908
J Exp Clin Cancer Res. 2020 Dec 14;39(1):285
pubmed: 33317591
Nat Biotechnol. 2023 Mar;41(3):310-313
pubmed: 36759708
Genome Biol. 2014;15(12):550
pubmed: 25516281
Stud Health Technol Inform. 2021 Oct 27;285:165-170
pubmed: 34734869
Front Microbiol. 2015 Aug 04;6:771
pubmed: 26300854
Sci Rep. 2021 Feb 25;11(1):4565
pubmed: 33633172
Front Microbiol. 2017 Nov 07;8:2114
pubmed: 29163406
BMC Genomics. 2020 Jan 2;21(1):6
pubmed: 31898477
Brain Inform. 2022 Jul 26;9(1):17
pubmed: 35882684
BMC Bioinformatics. 2009 Jan 26;10:34
pubmed: 19171069
Nat Rev Gastroenterol Hepatol. 2017 Oct;14(10):585-595
pubmed: 28790452
PLoS One. 2020 Feb 13;15(2):e0228899
pubmed: 32053657
Cell Host Microbe. 2021 Oct 13;29(10):1573-1588.e7
pubmed: 34453895
Biol Psychiatry Glob Open Sci. 2022 Feb 10;3(2):283-291
pubmed: 37124355
Int J Radiat Oncol Biol Phys. 2020 Jul 15;107(4):736-746
pubmed: 32315676
NPJ Precis Oncol. 2022 Jun 16;6(1):38
pubmed: 35710826
Mach Learn. 2018;107(12):1895-1922
pubmed: 30393425
PLoS Comput Biol. 2015 Mar 16;11(3):e1004075
pubmed: 25775355
Front Microbiol. 2021 Oct 11;12:727398
pubmed: 34737726
BMC Genomics. 2018 Apr 20;19(1):274
pubmed: 29678163
Brief Bioinform. 2023 Mar 19;24(2):
pubmed: 36653900
Crit Care. 2004 Dec;8(6):508-12
pubmed: 15566624
Bioinformatics. 2014 Nov 15;30(22):3152-8
pubmed: 25086004
Front Microbiol. 2021 Jan 12;11:607325
pubmed: 33510727
Curr Issues Mol Biol. 2017;24:17-36
pubmed: 28686566
Nat Commun. 2020 Mar 23;11(1):1512
pubmed: 32251296
Protein Cell. 2021 May;12(5):315-330
pubmed: 32394199
Nat Methods. 2013 Dec;10(12):1200-2
pubmed: 24076764
BMC Bioinformatics. 2018 Nov 19;19(1):432
pubmed: 30453885
Brief Bioinform. 2020 Dec 1;21(6):1954-1970
pubmed: 31776547
BMC Biol. 2019 Oct 28;17(1):83
pubmed: 31660948
Front Big Data. 2022 Dec 08;5:1027783
pubmed: 36567754
Mol Nutr Food Res. 2022 Jun;66(11):e2101091
pubmed: 35312171
Microb Pathog. 2017 Sep;110:630-636
pubmed: 28739439
Genome Biol. 2011 Jun 24;12(6):R60
pubmed: 21702898
J Biomed Sci. 2022 Oct 27;29(1):88
pubmed: 36303164
Trends Biotechnol. 2017 Jun;35(6):498-507
pubmed: 28351613
Nat Microbiol. 2019 Dec;4(12):2319-2330
pubmed: 31501538
Sci Rep. 2021 Feb 4;11(1):3030
pubmed: 33542369
Bioinformatics. 2019 May 1;35(9):1544-1552
pubmed: 30252023
J Allergy Clin Immunol. 2020 Jan;145(1):16-27
pubmed: 31910984
NPJ Biofilms Microbiomes. 2020 Dec 2;6(1):60
pubmed: 33268781
Sci Rep. 2016 Mar 15;6:23075
pubmed: 26975620
PLoS One. 2015 Mar 04;10(3):e0118432
pubmed: 25738806
Genome Res. 2012 Feb;22(2):292-8
pubmed: 22009990
PLoS Comput Biol. 2014 Apr 03;10(4):e1003531
pubmed: 24699258
Eur J Clin Microbiol Infect Dis. 2014 Aug;33(8):1381-90
pubmed: 24599709
BioData Min. 2022 Jul 26;15(1):15
pubmed: 35883154
Genome Biol. 2017 Jul 27;18(1):142
pubmed: 28750650
Mol Syst Biol. 2014 Nov 28;10:766
pubmed: 25432777
Cell Host Microbe. 2014 Mar 12;15(3):382-392
pubmed: 24629344
Front Neurosci. 2021 May 28;15:674055
pubmed: 34122000

Auteurs

Georgios Papoutsoglou (G)

Department of Computer Science, University of Crete, Heraklion, Greece.
JADBio Gnosis DA S.A., Science and Technology Park of Crete, Heraklion, Greece.

Sonia Tarazona (S)

Department of Applied Statistics and Operations Research and Quality, Polytechnic University of Valencia, Valencia, Spain.

Marta B Lopes (MB)

Center for Mathematics and Applications (NOVA Math), NOVA School of Science and Technology, Caparica, Portugal.
Research and Development Unit for Mechanical and Industrial Engineering (UNIDEMI), Department of Mechanical and Industrial Engineering, NOVA School of Science and Technology, Caparica, Portugal.

Thomas Klammsteiner (T)

Department of Ecology, Universität Innsbruck, Innsbruck, Austria.
Department of Microbiology, Universität Innsbruck, Innsbruck, Austria.

Eliana Ibrahimi (E)

Department of Biology, University of Tirana, Tirana, Albania.

Julia Eckenberger (J)

School of Microbiology, University College Cork, Cork, Ireland.
APC Microbiome Ireland, Cork, Ireland.

Pierfrancesco Novielli (P)

Department of Soil, Plant, and Food Sciences, University of Bari Aldo Moro, Bari, Italy.
National Institute for Nuclear Physics, Bari Division, Bari, Italy.

Alberto Tonda (A)

UMR 518 MIA-PS, INRAE, Paris-Saclay University, Palaiseau, France.
Complex Systems Institute of Paris Ile-de-France (ISC-PIF) - UAR 3611 CNRS, Paris, France.

Andrea Simeon (A)

BioSense Institute, University of Novi Sad, Novi Sad, Serbia.

Rajesh Shigdel (R)

Department of Clinical Science, University of Bergen, Bergen, Norway.

Stéphane Béreux (S)

MetaGenoPolis, INRAE, Paris-Saclay University, Jouy-en-Josas, France.
MaIAGE, INRAE, Paris-Saclay University, Jouy-en-Josas, France.

Giacomo Vitali (G)

MetaGenoPolis, INRAE, Paris-Saclay University, Jouy-en-Josas, France.

Sabina Tangaro (S)

Department of Soil, Plant, and Food Sciences, University of Bari Aldo Moro, Bari, Italy.
National Institute for Nuclear Physics, Bari Division, Bari, Italy.

Leo Lahti (L)

Department of Computing, University of Turku, Turku, Finland.

Andriy Temko (A)

Department of Electrical and Electronic Engineering, University College Cork, Cork, Ireland.

Marcus J Claesson (MJ)

School of Microbiology, University College Cork, Cork, Ireland.
APC Microbiome Ireland, Cork, Ireland.

Magali Berland (M)

MetaGenoPolis, INRAE, Paris-Saclay University, Jouy-en-Josas, France.

Classifications MeSH