Learning sparse log-ratios for high-throughput sequencing data.


Journal

Bioinformatics (Oxford, England)
ISSN: 1367-4811
Titre abrégé: Bioinformatics
Pays: England
ID NLM: 9808944

Informations de publication

Date de publication:
22 12 2021
Historique:
received: 06 06 2021
revised: 09 08 2021
accepted: 03 09 2021
pubmed: 10 9 2021
medline: 3 2 2023
entrez: 9 9 2021
Statut: ppublish

Résumé

The automatic discovery of sparse biomarkers that are associated with an outcome of interest is a central goal of bioinformatics. In the context of high-throughput sequencing (HTS) data, and compositional data (CoDa) more generally, an important class of biomarkers are the log-ratios between the input variables. However, identifying predictive log-ratio biomarkers from HTS data is a combinatorial optimization problem, which is computationally challenging. Existing methods are slow to run and scale poorly with the dimension of the input, which has limited their application to low- and moderate-dimensional metagenomic datasets. Building on recent advances from the field of deep learning, we present CoDaCoRe, a novel learning algorithm that identifies sparse, interpretable and predictive log-ratio biomarkers. Our algorithm exploits a continuous relaxation to approximate the underlying combinatorial optimization problem. This relaxation can then be optimized efficiently using the modern ML toolbox, in particular, gradient descent. As a result, CoDaCoRe runs several orders of magnitude faster than competing methods, all while achieving state-of-the-art performance in terms of predictive accuracy and sparsity. We verify the outperformance of CoDaCoRe across a wide range of microbiome, metabolite and microRNA benchmark datasets, as well as a particularly high-dimensional dataset that is outright computationally intractable for existing sparse log-ratio selection methods. The CoDaCoRe package is available at https://github.com/egr95/R-codacore. Code and instructions for reproducing our results are available at https://github.com/cunningham-lab/codacore. Supplementary data are available at Bioinformatics online.

Identifiants

pubmed: 34498030
pii: 6366546
doi: 10.1093/bioinformatics/btab645
pmc: PMC8696089
doi:

Types de publication

Journal Article Research Support, Non-U.S. Gov't Research Support, U.S. Gov't, Non-P.H.S.

Langues

eng

Sous-ensembles de citation

IM

Pagination

157-163

Subventions

Organisme : Simons Foundation
ID : 542963
Organisme : Sloan Foundation
Organisme : McKnight Endowment Fund
Organisme : NSF
ID : DBI-1707398
Organisme : Gatsby Charitable Foundation

Informations de copyright

© The Author(s) 2021. Published by Oxford University Press.

Auteurs

Elliott Gordon-Rodriguez (E)

Department of Statistics, Columbia University, New York, NY 10025, USA.

Thomas P Quinn (TP)

Applied Artificial Intelligence Institute, Deakin University, Geelong, VIC 3126, Australia.

John P Cunningham (JP)

Department of Statistics, Columbia University, New York, NY 10025, USA.

Articles similaires

Genome, Chloroplast Phylogeny Genetic Markers Base Composition High-Throughput Nucleotide Sequencing

Selecting optimal software code descriptors-The case of Java.

Yegor Bugayenko, Zamira Kholmatova, Artem Kruglov et al.
1.00
Software Algorithms Programming Languages
Populus Soil Microbiology Soil Microbiota Fungi

Exploring blood-brain barrier passage using atomic weighted vector and machine learning.

Yoan Martínez-López, Paulina Phoobane, Yanaima Jauriga et al.
1.00
Blood-Brain Barrier Machine Learning Humans Support Vector Machine Software

Classifications MeSH