Compressive Big Data Analytics: An ensemble meta-algorithm for high-dimensional multisource datasets.
Journal
PloS one
ISSN: 1932-6203
Titre abrégé: PLoS One
Pays: United States
ID NLM: 101285081
Informations de publication
Date de publication:
2020
2020
Historique:
received:
16
01
2020
accepted:
11
08
2020
entrez:
29
8
2020
pubmed:
29
8
2020
medline:
29
9
2020
Statut:
epublish
Résumé
Health advances are contingent on continuous development of new methods and approaches to foster data-driven discovery in the biomedical and clinical sciences. Open-science and team-based scientific discovery offer hope for tackling some of the difficult challenges associated with managing, modeling, and interpreting of large, complex, and multisource data. Translating raw observations into useful information and actionable knowledge depends on effective domain-independent reproducibility, area-specific replicability, data curation, analysis protocols, organization, management and sharing of health-related digital objects. This study expands the functionality and utility of an ensemble semi-supervised machine learning technique called Compressive Big Data Analytics (CBDA). Applied to high-dimensional data, CBDA (1) identifies salient features and key biomarkers enabling reliable and reproducible forecasting of binary, multinomial and continuous outcomes (i.e., feature mining); and (2) suggests the most accurate algorithms/models for predictive analytics of the observed data (i.e., model mining). The method relies on iterative subsampling, combines function optimization and statistical inference, and generates ensemble predictions for observed univariate outcomes. The novelty of this study is highlighted by a new and expanded set of CBDA features including (1) efficiently handling extremely large datasets (>100,000 cases and >1,000 features); (2) generalizing the internal and external validation steps; (3) expanding the set of base-learners for joint ensemble prediction; (4) introducing an automated selection of CBDA specifications; and (5) providing mechanisms to assess CBDA convergence, evaluate the prediction accuracy, and measure result consistency. To ground the mathematical model and the corresponding computational algorithm, CBDA 2.0 validation utilizes synthetic datasets as well as a population-wide census-like study. Specifically, an empirical validation of the CBDA technique is based on a translational health research using a large-scale clinical study (UK Biobank), which includes imaging, cognitive, and clinical assessment data. The UK Biobank archive presents several difficult challenges related to the aggregation, harmonization, modeling, and interrogation of the information. These problems are related to the complex longitudinal structure, variable heterogeneity, feature multicollinearity, incongruency, and missingness, as well as violations of classical parametric assumptions. Our results show the scalability, efficiency, and usability of CBDA to interrogate complex data into structural information leading to derived knowledge and translational action. Applying CBDA 2.0 to the UK Biobank case-study allows predicting various outcomes of interest, e.g., mood disorders and irritability, and suggests new and exciting avenues of evidence-based research in the context of identifying, tracking, and treating mental health and aging-related diseases. Following open-science principles, we share the entire end-to-end protocol, source-code, and results. This facilitates independent validation, result reproducibility, and team-based collaborative discovery.
Identifiants
pubmed: 32857775
doi: 10.1371/journal.pone.0228520
pii: PONE-D-20-01487
pmc: PMC7455041
doi:
Types de publication
Journal Article
Research Support, N.I.H., Extramural
Research Support, U.S. Gov't, Non-P.H.S.
Langues
eng
Sous-ensembles de citation
IM
Pagination
e0228520Subventions
Organisme : NIBIB NIH HHS
ID : U54 EB020406
Pays : United States
Organisme : NIMH NIH HHS
ID : R01 MH121079
Pays : United States
Organisme : NINR NIH HHS
ID : P20 NR015331
Pays : United States
Organisme : NIDDK NIH HHS
ID : P30 DK089503
Pays : United States
Organisme : NCI NIH HHS
ID : R01 CA233487
Pays : United States
Organisme : NCATS NIH HHS
ID : UL1 TR002240
Pays : United States
Organisme : NINDS NIH HHS
ID : P50 NS091856
Pays : United States
Déclaration de conflit d'intérêts
The authors have declared that no competing interests exist.
Références
J Stat Comput Simul. 2018;89(2):249-271
pubmed: 30962669
Stat Appl Genet Mol Biol. 2006;5:Article19
pubmed: 17049030
PLoS One. 2018 Aug 30;13(8):e0202674
pubmed: 30161148
Stat Appl Genet Mol Biol. 2007;6:Article25
pubmed: 17910531
Front Neuroinform. 2014 Apr 23;8:41
pubmed: 24795619
Harv Bus Rev. 2012 Oct;90(10):60-6, 68, 128
pubmed: 23074865
Neuroimage. 2003 Jul;19(3):1033-48
pubmed: 12880830
PLoS One. 2019 May 31;14(5):e0214311
pubmed: 31150407
Lancet. 2007 Jun 16;369(9578):1980-1982
pubmed: 17574079
Pharmacogenomics. 2005 Sep;6(6):639-46
pubmed: 16143003
IEEE Trans Neural Netw Learn Syst. 2018 Mar;29(3):657-669
pubmed: 28060713
Sci Data. 2016 Mar 15;3:160018
pubmed: 26978244
Sci Rep. 2019 Apr 12;9(1):6012
pubmed: 30979917
Natl Sci Rev. 2014 Jun;1(2):293-314
pubmed: 25419469