Compressive Big Data Analytics: An ensemble meta-algorithm for high-dimensional multisource datasets.


Journal

PloS one
ISSN: 1932-6203
Titre abrégé: PLoS One
Pays: United States
ID NLM: 101285081

Informations de publication

Date de publication:
2020
Historique:
received: 16 01 2020
accepted: 11 08 2020
entrez: 29 8 2020
pubmed: 29 8 2020
medline: 29 9 2020
Statut: epublish

Résumé

Health advances are contingent on continuous development of new methods and approaches to foster data-driven discovery in the biomedical and clinical sciences. Open-science and team-based scientific discovery offer hope for tackling some of the difficult challenges associated with managing, modeling, and interpreting of large, complex, and multisource data. Translating raw observations into useful information and actionable knowledge depends on effective domain-independent reproducibility, area-specific replicability, data curation, analysis protocols, organization, management and sharing of health-related digital objects. This study expands the functionality and utility of an ensemble semi-supervised machine learning technique called Compressive Big Data Analytics (CBDA). Applied to high-dimensional data, CBDA (1) identifies salient features and key biomarkers enabling reliable and reproducible forecasting of binary, multinomial and continuous outcomes (i.e., feature mining); and (2) suggests the most accurate algorithms/models for predictive analytics of the observed data (i.e., model mining). The method relies on iterative subsampling, combines function optimization and statistical inference, and generates ensemble predictions for observed univariate outcomes. The novelty of this study is highlighted by a new and expanded set of CBDA features including (1) efficiently handling extremely large datasets (>100,000 cases and >1,000 features); (2) generalizing the internal and external validation steps; (3) expanding the set of base-learners for joint ensemble prediction; (4) introducing an automated selection of CBDA specifications; and (5) providing mechanisms to assess CBDA convergence, evaluate the prediction accuracy, and measure result consistency. To ground the mathematical model and the corresponding computational algorithm, CBDA 2.0 validation utilizes synthetic datasets as well as a population-wide census-like study. Specifically, an empirical validation of the CBDA technique is based on a translational health research using a large-scale clinical study (UK Biobank), which includes imaging, cognitive, and clinical assessment data. The UK Biobank archive presents several difficult challenges related to the aggregation, harmonization, modeling, and interrogation of the information. These problems are related to the complex longitudinal structure, variable heterogeneity, feature multicollinearity, incongruency, and missingness, as well as violations of classical parametric assumptions. Our results show the scalability, efficiency, and usability of CBDA to interrogate complex data into structural information leading to derived knowledge and translational action. Applying CBDA 2.0 to the UK Biobank case-study allows predicting various outcomes of interest, e.g., mood disorders and irritability, and suggests new and exciting avenues of evidence-based research in the context of identifying, tracking, and treating mental health and aging-related diseases. Following open-science principles, we share the entire end-to-end protocol, source-code, and results. This facilitates independent validation, result reproducibility, and team-based collaborative discovery.

Identifiants

pubmed: 32857775
doi: 10.1371/journal.pone.0228520
pii: PONE-D-20-01487
pmc: PMC7455041
doi:

Types de publication

Journal Article Research Support, N.I.H., Extramural Research Support, U.S. Gov't, Non-P.H.S.

Langues

eng

Sous-ensembles de citation

IM

Pagination

e0228520

Subventions

Organisme : NIBIB NIH HHS
ID : U54 EB020406
Pays : United States
Organisme : NIMH NIH HHS
ID : R01 MH121079
Pays : United States
Organisme : NINR NIH HHS
ID : P20 NR015331
Pays : United States
Organisme : NIDDK NIH HHS
ID : P30 DK089503
Pays : United States
Organisme : NCI NIH HHS
ID : R01 CA233487
Pays : United States
Organisme : NCATS NIH HHS
ID : UL1 TR002240
Pays : United States
Organisme : NINDS NIH HHS
ID : P50 NS091856
Pays : United States

Déclaration de conflit d'intérêts

The authors have declared that no competing interests exist.

Références

J Stat Comput Simul. 2018;89(2):249-271
pubmed: 30962669
Stat Appl Genet Mol Biol. 2006;5:Article19
pubmed: 17049030
PLoS One. 2018 Aug 30;13(8):e0202674
pubmed: 30161148
Stat Appl Genet Mol Biol. 2007;6:Article25
pubmed: 17910531
Front Neuroinform. 2014 Apr 23;8:41
pubmed: 24795619
Harv Bus Rev. 2012 Oct;90(10):60-6, 68, 128
pubmed: 23074865
Neuroimage. 2003 Jul;19(3):1033-48
pubmed: 12880830
PLoS One. 2019 May 31;14(5):e0214311
pubmed: 31150407
Lancet. 2007 Jun 16;369(9578):1980-1982
pubmed: 17574079
Pharmacogenomics. 2005 Sep;6(6):639-46
pubmed: 16143003
IEEE Trans Neural Netw Learn Syst. 2018 Mar;29(3):657-669
pubmed: 28060713
Sci Data. 2016 Mar 15;3:160018
pubmed: 26978244
Sci Rep. 2019 Apr 12;9(1):6012
pubmed: 30979917
Natl Sci Rev. 2014 Jun;1(2):293-314
pubmed: 25419469

Auteurs

Simeone Marino (S)

Statistics Online Computational Resource, Department of Health Behavior and Biological Sciences, University of Michigan, Ann Arbor, Michigan, United States of America.
Department of Microbiology and Immunology, University of Michigan, Ann Arbor, Michigan, United States of America.

Yi Zhao (Y)

Statistics Online Computational Resource, Department of Health Behavior and Biological Sciences, University of Michigan, Ann Arbor, Michigan, United States of America.

Nina Zhou (N)

Statistics Online Computational Resource, Department of Health Behavior and Biological Sciences, University of Michigan, Ann Arbor, Michigan, United States of America.

Yiwang Zhou (Y)

Statistics Online Computational Resource, Department of Health Behavior and Biological Sciences, University of Michigan, Ann Arbor, Michigan, United States of America.
Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, Michigan, United States of America.
Department of Biostatistics, University of Michigan, Ann Arbor, Michigan, United States of America.

Arthur W Toga (AW)

Laboratory of Neuro Imaging, USC Stevens Neuroimaging and Informatics Institute, Keck School of Medicine of USC, University of Southern California, Los Angeles, California, United States of America.

Lu Zhao (L)

Laboratory of Neuro Imaging, USC Stevens Neuroimaging and Informatics Institute, Keck School of Medicine of USC, University of Southern California, Los Angeles, California, United States of America.

Yingsi Jian (Y)

Statistics Online Computational Resource, Department of Health Behavior and Biological Sciences, University of Michigan, Ann Arbor, Michigan, United States of America.

Yichen Yang (Y)

Statistics Online Computational Resource, Department of Health Behavior and Biological Sciences, University of Michigan, Ann Arbor, Michigan, United States of America.

Yehu Chen (Y)

Statistics Online Computational Resource, Department of Health Behavior and Biological Sciences, University of Michigan, Ann Arbor, Michigan, United States of America.

Qiucheng Wu (Q)

Statistics Online Computational Resource, Department of Health Behavior and Biological Sciences, University of Michigan, Ann Arbor, Michigan, United States of America.

Jessica Wild (J)

Statistics Online Computational Resource, Department of Health Behavior and Biological Sciences, University of Michigan, Ann Arbor, Michigan, United States of America.

Brandon Cummings (B)

Statistics Online Computational Resource, Department of Health Behavior and Biological Sciences, University of Michigan, Ann Arbor, Michigan, United States of America.
Michigan Center for Integrative Research in Critical Care, University of Michigan, Ann Arbor, Michigan, United States of America.

Ivo D Dinov (ID)

Statistics Online Computational Resource, Department of Health Behavior and Biological Sciences, University of Michigan, Ann Arbor, Michigan, United States of America.
Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, Michigan, United States of America.
Michigan Institute for Data Science, University of Michigan, Ann Arbor, Michigan, United States of America.
Neuroscience Graduate Program, University of Michigan, Ann Arbor, Michigan, United States of America.

Articles similaires

[Redispensing of expensive oral anticancer medicines: a practical application].

Lisanne N van Merendonk, Kübra Akgöl, Bastiaan Nuijen
1.00
Humans Antineoplastic Agents Administration, Oral Drug Costs Counterfeit Drugs

Smoking Cessation and Incident Cardiovascular Disease.

Jun Hwan Cho, Seung Yong Shin, Hoseob Kim et al.
1.00
Humans Male Smoking Cessation Cardiovascular Diseases Female
Humans United States Aged Cross-Sectional Studies Medicare Part C
1.00
Humans Yoga Low Back Pain Female Male

Classifications MeSH