Supervised learning and model analysis with compositional data.
Journal
PLoS computational biology
ISSN: 1553-7358
Titre abrégé: PLoS Comput Biol
Pays: United States
ID NLM: 101238922
Informations de publication
Date de publication:
Jun 2023
Jun 2023
Historique:
received:
19
01
2023
accepted:
03
06
2023
revised:
13
07
2023
medline:
17
7
2023
pubmed:
30
6
2023
entrez:
30
6
2023
Statut:
epublish
Résumé
Supervised learning, such as regression and classification, is an essential tool for analyzing modern high-throughput sequencing data, for example in microbiome research. However, due to the compositionality and sparsity, existing techniques are often inadequate. Either they rely on extensions of the linear log-contrast model (which adjust for compositionality but cannot account for complex signals or sparsity) or they are based on black-box machine learning methods (which may capture useful signals, but lack interpretability due to the compositionality). We propose KernelBiome, a kernel-based nonparametric regression and classification framework for compositional data. It is tailored to sparse compositional data and is able to incorporate prior knowledge, such as phylogenetic structure. KernelBiome captures complex signals, including in the zero-structure, while automatically adapting model complexity. We demonstrate on par or improved predictive performance compared with state-of-the-art machine learning methods on 33 publicly available microbiome datasets. Additionally, our framework provides two key advantages: (i) We propose two novel quantities to interpret contributions of individual components and prove that they consistently estimate average perturbation effects of the conditional mean, extending the interpretability of linear log-contrast coefficients to nonparametric models. (ii) We show that the connection between kernels and distances aids interpretability and provides a data-driven embedding that can augment further analysis. KernelBiome is available as an open-source Python package on PyPI and at https://github.com/shimenghuang/KernelBiome.
Identifiants
pubmed: 37390111
doi: 10.1371/journal.pcbi.1011240
pii: PCOMPBIOL-D-23-00094
pmc: PMC10343141
doi:
Types de publication
Journal Article
Langues
eng
Sous-ensembles de citation
IM
Pagination
e1011240Informations de copyright
Copyright: © 2023 Huang et al. This is an open access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Déclaration de conflit d'intérêts
The authors have declared that no competing interests exist.
Références
Nat Rev Gastroenterol Hepatol. 2020 Oct;17(10):635-648
pubmed: 32647386
Front Microbiol. 2021 Jan 28;12:609048
pubmed: 33584612
Nature. 2014 Sep 4;513(7516):59-64
pubmed: 25079328
ACS Comb Sci. 2015 Feb 9;17(2):130-6
pubmed: 25547365
Ecology. 2012 Mar;93(3):477-89
pubmed: 22624203
PLoS Comput Biol. 2016 Jul 11;12(7):e1004977
pubmed: 27400279
Bioinformatics. 2021 Jul 12;37(11):1595-1597
pubmed: 33225342
Am J Hum Genet. 2015 May 7;96(5):797-807
pubmed: 25957468
Front Genet. 2019 Jun 25;10:579
pubmed: 31293616
NPJ Biofilms Microbiomes. 2020 Dec 2;6(1):60
pubmed: 33268781
Microbiome. 2022 Jun 7;10(1):86
pubmed: 35668471
PLoS Comput Biol. 2012;8(9):e1002687
pubmed: 23028285
Ann Appl Stat. 2018 Mar;12(1):540-566
pubmed: 30224943
Front Microbiol. 2017 Nov 07;8:2114
pubmed: 29163406
mBio. 2020 Jun 9;11(3):
pubmed: 32518182
Front Microbiol. 2017 Nov 15;8:2224
pubmed: 29187837
Proc Biol Sci. 2014 Nov 22;281(1795):
pubmed: 25274366
PLoS Comput Biol. 2022 Dec 14;18(12):e1010714
pubmed: 36516158
Ann Epidemiol. 2016 May;26(5):330-5
pubmed: 27255738
Appl Environ Microbiol. 2005 Dec;71(12):8228-35
pubmed: 16332807
PLoS One. 2013 Jul 02;8(7):e67019
pubmed: 23843979
Sci Rep. 2021 Jul 15;11(1):14505
pubmed: 34267244
Nat Rev Microbiol. 2018 Jul;16(7):410-422
pubmed: 29795328
Diabetes Care. 2021 Feb;44(2):358-366
pubmed: 33288652