Variable selection and validation in multivariate modelling.


Journal

Bioinformatics (Oxford, England)
ISSN: 1367-4811
Titre abrégé: Bioinformatics
Pays: England
ID NLM: 9808944

Informations de publication

Date de publication:
15 03 2019
Historique:
received: 24 11 2017
revised: 04 07 2018
accepted: 24 08 2018
pubmed: 31 8 2018
medline: 1 1 2020
entrez: 31 8 2018
Statut: ppublish

Résumé

Validation of variable selection and predictive performance is crucial in construction of robust multivariate models that generalize well, minimize overfitting and facilitate interpretation of results. Inappropriate variable selection leads instead to selection bias, thereby increasing the risk of model overfitting and false positive discoveries. Although several algorithms exist to identify a minimal set of most informative variables (i.e. the minimal-optimal problem), few can select all variables related to the research question (i.e. the all-relevant problem). Robust algorithms combining identification of both minimal-optimal and all-relevant variables with proper cross-validation are urgently needed. We developed the MUVR algorithm to improve predictive performance and minimize overfitting and false positives in multivariate analysis. In the MUVR algorithm, minimal variable selection is achieved by performing recursive variable elimination in a repeated double cross-validation (rdCV) procedure. The algorithm supports partial least squares and random forest modelling, and simultaneously identifies minimal-optimal and all-relevant variable sets for regression, classification and multilevel analyses. Using three authentic omics datasets, MUVR yielded parsimonious models with minimal overfitting and improved model performance compared with state-of-the-art rdCV. Moreover, MUVR showed advantages over other variable selection algorithms, i.e. Boruta and VSURF, including simultaneous variable selection and validation scheme and wider applicability. Algorithms, data, scripts and tutorial are open source and available as an R package ('MUVR') at https://gitlab.com/CarlBrunius/MUVR.git. Supplementary data are available at Bioinformatics online.

Identifiants

pubmed: 30165467
pii: 5085367
doi: 10.1093/bioinformatics/bty710
pmc: PMC6419897
doi:

Types de publication

Journal Article Research Support, Non-U.S. Gov't

Langues

eng

Sous-ensembles de citation

IM

Pagination

972-980

Informations de copyright

© The Author(s) 2018. Published by Oxford University Press.

Références

Neuroimage. 2017 Jan 15;145(Pt B):166-179
pubmed: 27989847
Proc Natl Acad Sci U S A. 2002 May 14;99(10):6562-6
pubmed: 11983868
Brief Bioinform. 2011 May;12(3):189-202
pubmed: 21300697
Brief Bioinform. 2016 Jul;17(4):628-41
pubmed: 26969681
BMC Bioinformatics. 2011 Jan 26;12:33
pubmed: 21269434
BMC Bioinformatics. 2014;15 Suppl 7:S9
pubmed: 25078324
J Cheminform. 2014 Mar 29;6(1):10
pubmed: 24678909
Anal Chim Acta. 2015 Jun 16;879:10-23
pubmed: 26002472
Mol Nutr Food Res. 2017 Jul;61(7):
pubmed: 28035736
Mol Cell Proteomics. 2013 Jan;12(1):263-76
pubmed: 23115301
Bioinformatics. 2007 Oct 1;23(19):2507-17
pubmed: 17720704
Artif Intell Med. 2016 Jan;66:63-71
pubmed: 26674595
BMC Bioinformatics. 2014 Jan 13;15:8
pubmed: 24410865
Neuroimage. 2018 Oct 15;180(Pt A):68-77
pubmed: 28655633
J Proteome Res. 2008 Oct;7(10):4483-91
pubmed: 18754629
Theory Biosci. 2013 Mar;132(1):1-16
pubmed: 23138757
Anal Chim Acta. 2014 Jun 4;829:1-8
pubmed: 24856395
Br J Nutr. 2017 Nov;118(9):686-697
pubmed: 29185930
Nat Rev Mol Cell Biol. 2012 Mar 22;13(4):263-9
pubmed: 22436749
Mol Nutr Food Res. 2015 Nov;59(11):2315-25
pubmed: 26264776
Metabolomics. 2010 Mar;6(1):119-128
pubmed: 20339442
Diabetologia. 2018 Apr;61(4):849-861
pubmed: 29349498
Sci Rep. 2016 Mar 10;6:22806
pubmed: 26960555
Algorithms Mol Biol. 2011 Dec 05;6(1):27
pubmed: 22142365
Anal Chim Acta. 2016 Mar 31;914:17-34
pubmed: 26965324
BMC Bioinformatics. 2007 Jan 25;8:25
pubmed: 17254353
Bioinformatics. 2007 Jul 1;23(13):1702-4
pubmed: 17495999
Methods Mol Biol. 2011;719:499-509
pubmed: 21370099
J Cheminform. 2014 Nov 26;6(1):47
pubmed: 25506400
Environ Monit Assess. 2017 Jul;189(7):316
pubmed: 28589457
Microbiol Res. 2015 Feb;171:52-64
pubmed: 25644953

Auteurs

Lin Shi (L)

Department of Molecular Sciences, Swedish University of Agricultural Sciences, Uppsala SE-750 07, Sweden.
Department of Biology and Biological Engineering, Food and Nutrition Science, Chalmers University of Technology, Gothenburg SE-412 96, Sweden.

Johan A Westerhuis (JA)

Swammerdam Institute for Life Sciences, University of Amsterdam, Amsterdam XH, The Netherlands.
Metabolomics Center, North-West University, X6001, Potchefstroom, South Africa.

Johan Rosén (J)

Swedish National Food Agency, Uppsala, Sweden.

Rikard Landberg (R)

Department of Molecular Sciences, Swedish University of Agricultural Sciences, Uppsala SE-750 07, Sweden.
Department of Biology and Biological Engineering, Food and Nutrition Science, Chalmers University of Technology, Gothenburg SE-412 96, Sweden.

Carl Brunius (C)

Department of Biology and Biological Engineering, Food and Nutrition Science, Chalmers University of Technology, Gothenburg SE-412 96, Sweden.

Articles similaires

[Redispensing of expensive oral anticancer medicines: a practical application].

Lisanne N van Merendonk, Kübra Akgöl, Bastiaan Nuijen
1.00
Humans Antineoplastic Agents Administration, Oral Drug Costs Counterfeit Drugs

Smoking Cessation and Incident Cardiovascular Disease.

Jun Hwan Cho, Seung Yong Shin, Hoseob Kim et al.
1.00
Humans Male Smoking Cessation Cardiovascular Diseases Female
Humans United States Aged Cross-Sectional Studies Medicare Part C
1.00
Humans Yoga Low Back Pain Female Male

Classifications MeSH