Compressive Big Data Analytics: An ensemble meta-algorithm for high-dimensional multisource datasets.

Algorithms Big Data Data Compression Data Mining / methods Data Science / methods Humans Machine Learning Meta-Analysis as Topic Models, Theoretical Physical Phenomena Prognosis Reproducibility of Results Software

Journal

PloS one

ISSN: 1932-6203

Titre abrégé: PLoS One

Pays: United States

ID NLM: 101285081

Informations de publication

Date de publication:
2020

Historique:

received: 16 01 2020

accepted: 11 08 2020

entrez: 29 8 2020

pubmed: 29 8 2020

medline: 29 9 2020

Statut: epublish

Résumé

Health advances are contingent on continuous development of new methods and approaches to foster data-driven discovery in the biomedical and clinical sciences. Open-science and team-based scientific discovery offer hope for tackling some of the difficult challenges associated with managing, modeling, and interpreting of large, complex, and multisource data. Translating raw observations into useful information and actionable knowledge depends on effective domain-independent reproducibility, area-specific replicability, data curation, analysis protocols, organization, management and sharing of health-related digital objects. This study expands the functionality and utility of an ensemble semi-supervised machine learning technique called Compressive Big Data Analytics (CBDA). Applied to high-dimensional data, CBDA (1) identifies salient features and key biomarkers enabling reliable and reproducible forecasting of binary, multinomial and continuous outcomes (i.e., feature mining); and (2) suggests the most accurate algorithms/models for predictive analytics of the observed data (i.e., model mining). The method relies on iterative subsampling, combines function optimization and statistical inference, and generates ensemble predictions for observed univariate outcomes. The novelty of this study is highlighted by a new and expanded set of CBDA features including (1) efficiently handling extremely large datasets (>100,000 cases and >1,000 features); (2) generalizing the internal and external validation steps; (3) expanding the set of base-learners for joint ensemble prediction; (4) introducing an automated selection of CBDA specifications; and (5) providing mechanisms to assess CBDA convergence, evaluate the prediction accuracy, and measure result consistency. To ground the mathematical model and the corresponding computational algorithm, CBDA 2.0 validation utilizes synthetic datasets as well as a population-wide census-like study. Specifically, an empirical validation of the CBDA technique is based on a translational health research using a large-scale clinical study (UK Biobank), which includes imaging, cognitive, and clinical assessment data. The UK Biobank archive presents several difficult challenges related to the aggregation, harmonization, modeling, and interrogation of the information. These problems are related to the complex longitudinal structure, variable heterogeneity, feature multicollinearity, incongruency, and missingness, as well as violations of classical parametric assumptions. Our results show the scalability, efficiency, and usability of CBDA to interrogate complex data into structural information leading to derived knowledge and translational action. Applying CBDA 2.0 to the UK Biobank case-study allows predicting various outcomes of interest, e.g., mood disorders and irritability, and suggests new and exciting avenues of evidence-based research in the context of identifying, tracking, and treating mental health and aging-related diseases. Following open-science principles, we share the entire end-to-end protocol, source-code, and results. This facilitates independent validation, result reproducibility, and team-based collaborative discovery.

Identifiants

DOI: 10.1371/journal.pone.0228520 PMID: 32857775 PMC: PMC7455041

pubmed: 32857775

doi: 10.1371/journal.pone.0228520

pii: PONE-D-20-01487

pmc: PMC7455041

doi:

Types de publication

Journal Article Research Support, N.I.H., Extramural Research Support, U.S. Gov't, Non-P.H.S.

Langues

eng

Sous-ensembles de citation

Pagination

e0228520

Subventions

Organisme : NIBIB NIH HHS

ID : U54 EB020406

Pays : United States

Organisme : NIMH NIH HHS

ID : R01 MH121079

Pays : United States

Organisme : NINR NIH HHS

ID : P20 NR015331

Pays : United States

Organisme : NIDDK NIH HHS

ID : P30 DK089503

Pays : United States

Organisme : NCI NIH HHS

ID : R01 CA233487

Pays : United States

Organisme : NCATS NIH HHS

ID : UL1 TR002240

Pays : United States

Organisme : NINDS NIH HHS

ID : P50 NS091856

Pays : United States

Déclaration de conflit d'intérêts

The authors have declared that no competing interests exist.

Références

J Stat Comput Simul. 2018;89(2):249-271

pubmed: 30962669

Stat Appl Genet Mol Biol. 2006;5:Article19

pubmed: 17049030

PLoS One. 2018 Aug 30;13(8):e0202674

pubmed: 30161148

Stat Appl Genet Mol Biol. 2007;6:Article25

pubmed: 17910531

Front Neuroinform. 2014 Apr 23;8:41

pubmed: 24795619

Harv Bus Rev. 2012 Oct;90(10):60-6, 68, 128

pubmed: 23074865

Neuroimage. 2003 Jul;19(3):1033-48

pubmed: 12880830

PLoS One. 2019 May 31;14(5):e0214311

pubmed: 31150407

Lancet. 2007 Jun 16;369(9578):1980-1982

pubmed: 17574079

Pharmacogenomics. 2005 Sep;6(6):639-46

pubmed: 16143003

IEEE Trans Neural Netw Learn Syst. 2018 Mar;29(3):657-669

pubmed: 28060713

Sci Data. 2016 Mar 15;3:160018

pubmed: 26978244

Sci Rep. 2019 Apr 12;9(1):6012

pubmed: 30979917

Natl Sci Rev. 2014 Jun;1(2):293-314

pubmed: 25419469

Compressive Big Data Analytics: An ensemble meta-algorithm for high-dimensional multisource datasets.

Journal

Informations de publication

Résumé

Identifiants

Types de publication

Langues

Sous-ensembles de citation

Pagination

Subventions

Déclaration de conflit d'intérêts

Références

Auteurs

Simeone Marino (S)

Yi Zhao (Y)

Nina Zhou (N)

Yiwang Zhou (Y)

Arthur W Toga (AW)

Lu Zhao (L)

Yingsi Jian (Y)

Yichen Yang (Y)

Yehu Chen (Y)

Qiucheng Wu (Q)

Jessica Wild (J)

Brandon Cummings (B)

Ivo D Dinov (ID)

Articles similaires

[Redispensing of expensive oral anticancer medicines: a practical application].

Smoking Cessation and Incident Cardiovascular Disease.

Evaluation of Low-Value Services Across Major Medicare Advantage Insurers and Traditional Medicare.

Effectiveness of Virtual Yoga for Chronic Low Back Pain: A Randomized Clinical Trial.

Classifications MeSH