Variable selection and validation in multivariate modelling.
Journal
Bioinformatics (Oxford, England)
ISSN: 1367-4811
Titre abrégé: Bioinformatics
Pays: England
ID NLM: 9808944
Informations de publication
Date de publication:
15 03 2019
15 03 2019
Historique:
received:
24
11
2017
revised:
04
07
2018
accepted:
24
08
2018
pubmed:
31
8
2018
medline:
1
1
2020
entrez:
31
8
2018
Statut:
ppublish
Résumé
Validation of variable selection and predictive performance is crucial in construction of robust multivariate models that generalize well, minimize overfitting and facilitate interpretation of results. Inappropriate variable selection leads instead to selection bias, thereby increasing the risk of model overfitting and false positive discoveries. Although several algorithms exist to identify a minimal set of most informative variables (i.e. the minimal-optimal problem), few can select all variables related to the research question (i.e. the all-relevant problem). Robust algorithms combining identification of both minimal-optimal and all-relevant variables with proper cross-validation are urgently needed. We developed the MUVR algorithm to improve predictive performance and minimize overfitting and false positives in multivariate analysis. In the MUVR algorithm, minimal variable selection is achieved by performing recursive variable elimination in a repeated double cross-validation (rdCV) procedure. The algorithm supports partial least squares and random forest modelling, and simultaneously identifies minimal-optimal and all-relevant variable sets for regression, classification and multilevel analyses. Using three authentic omics datasets, MUVR yielded parsimonious models with minimal overfitting and improved model performance compared with state-of-the-art rdCV. Moreover, MUVR showed advantages over other variable selection algorithms, i.e. Boruta and VSURF, including simultaneous variable selection and validation scheme and wider applicability. Algorithms, data, scripts and tutorial are open source and available as an R package ('MUVR') at https://gitlab.com/CarlBrunius/MUVR.git. Supplementary data are available at Bioinformatics online.
Identifiants
pubmed: 30165467
pii: 5085367
doi: 10.1093/bioinformatics/bty710
pmc: PMC6419897
doi:
Types de publication
Journal Article
Research Support, Non-U.S. Gov't
Langues
eng
Sous-ensembles de citation
IM
Pagination
972-980Informations de copyright
© The Author(s) 2018. Published by Oxford University Press.
Références
Neuroimage. 2017 Jan 15;145(Pt B):166-179
pubmed: 27989847
Proc Natl Acad Sci U S A. 2002 May 14;99(10):6562-6
pubmed: 11983868
Brief Bioinform. 2011 May;12(3):189-202
pubmed: 21300697
Brief Bioinform. 2016 Jul;17(4):628-41
pubmed: 26969681
BMC Bioinformatics. 2011 Jan 26;12:33
pubmed: 21269434
BMC Bioinformatics. 2014;15 Suppl 7:S9
pubmed: 25078324
J Cheminform. 2014 Mar 29;6(1):10
pubmed: 24678909
Anal Chim Acta. 2015 Jun 16;879:10-23
pubmed: 26002472
Mol Nutr Food Res. 2017 Jul;61(7):
pubmed: 28035736
Mol Cell Proteomics. 2013 Jan;12(1):263-76
pubmed: 23115301
Bioinformatics. 2007 Oct 1;23(19):2507-17
pubmed: 17720704
Artif Intell Med. 2016 Jan;66:63-71
pubmed: 26674595
BMC Bioinformatics. 2014 Jan 13;15:8
pubmed: 24410865
Neuroimage. 2018 Oct 15;180(Pt A):68-77
pubmed: 28655633
J Proteome Res. 2008 Oct;7(10):4483-91
pubmed: 18754629
Theory Biosci. 2013 Mar;132(1):1-16
pubmed: 23138757
Anal Chim Acta. 2014 Jun 4;829:1-8
pubmed: 24856395
Br J Nutr. 2017 Nov;118(9):686-697
pubmed: 29185930
Nat Rev Mol Cell Biol. 2012 Mar 22;13(4):263-9
pubmed: 22436749
Mol Nutr Food Res. 2015 Nov;59(11):2315-25
pubmed: 26264776
Metabolomics. 2010 Mar;6(1):119-128
pubmed: 20339442
Diabetologia. 2018 Apr;61(4):849-861
pubmed: 29349498
Sci Rep. 2016 Mar 10;6:22806
pubmed: 26960555
Algorithms Mol Biol. 2011 Dec 05;6(1):27
pubmed: 22142365
Anal Chim Acta. 2016 Mar 31;914:17-34
pubmed: 26965324
BMC Bioinformatics. 2007 Jan 25;8:25
pubmed: 17254353
Bioinformatics. 2007 Jul 1;23(13):1702-4
pubmed: 17495999
Methods Mol Biol. 2011;719:499-509
pubmed: 21370099
J Cheminform. 2014 Nov 26;6(1):47
pubmed: 25506400
Environ Monit Assess. 2017 Jul;189(7):316
pubmed: 28589457
Microbiol Res. 2015 Feb;171:52-64
pubmed: 25644953