Development of Supervised Learning Predictive Models for Highly Non-linear Biological, Biomedical, and General Datasets.
clustering
highly non-linear datasets
recursive binary methods
statistical techniques
supervised learning algorithms
Journal
Frontiers in molecular biosciences
ISSN: 2296-889X
Titre abrégé: Front Mol Biosci
Pays: Switzerland
ID NLM: 101653173
Informations de publication
Date de publication:
2020
2020
Historique:
received:
09
09
2019
accepted:
22
01
2020
entrez:
3
3
2020
pubmed:
3
3
2020
medline:
3
3
2020
Statut:
epublish
Résumé
In highly non-linear datasets, attributes or features do not allow readily finding visual patterns for identifying common underlying behaviors. Therefore, it is not possible to achieve classification or regression using linear or mildly non-linear hyperspace partition functions. Hence, supervised learning models based on the application of most existing algorithms are limited, and their performance metrics are low. Linear transformations of variables, such as principal components analysis, cannot avoid the problem, and even models based on artificial neural networks and deep learning are unable to improve the metrics. Sometimes, even when features allow classification or regression in reported cases, performance metrics of supervised learning algorithms remain unsatisfyingly low. This problem is recurrent in many areas of study as, per example, the clinical, biotechnological, and protein engineering areas, where many of the attributes are correlated in an unknown and very non-linear fashion or are categorical and difficult to relate to a target response variable. In such areas, being able to create predictive models would dramatically impact the quality of their outcomes, generating an immediate added value for both the scientific and general public. In this manuscript, we present RV-Clustering, a library of unsupervised learning algorithms, and a new methodology designed to find optimum partitions within highly non-linear datasets that allow deconvoluting variables and notoriously improving performance metrics in supervised learning classification or regression models. The partitions obtained are statistically cross-validated, ensuring correct representativity and no over-fitting. We have successfully tested RV-Clustering in several highly non-linear datasets with different origins. The approach herein proposed has generated classification and regression models with high-performance metrics, which further supports its ability to generate predictive models for highly non-linear datasets. Advantageously, the method does not require significant human input, which guarantees a higher usability in the biological, biomedical, and protein engineering community with no specific knowledge in the machine learning area.
Identifiants
pubmed: 32118039
doi: 10.3389/fmolb.2020.00013
pmc: PMC7031350
doi:
Types de publication
Journal Article
Langues
eng
Pagination
13Informations de copyright
Copyright © 2020 Medina-Ortiz, Contreras, Quiroz and Olivera-Nappa.
Références
Front Mol Biosci. 2019 Jun 19;6:46
pubmed: 31275943
Kidney Res Clin Pract. 2017 Mar;36(1):3-11
pubmed: 28392994
Cell. 2018 Jun 14;173(7):1562-1565
pubmed: 29906441
Int J Mol Sci. 2016 Apr 07;17(4):512
pubmed: 27070572
Comput Struct Biotechnol J. 2014 Nov 15;13:8-17
pubmed: 25750696
Curr Opin Biotechnol. 2002 Feb;13(1):72-6
pubmed: 11849962
J Chem Theory Comput. 2016 Apr 12;12(4):2110-20
pubmed: 26989997
J Cell Physiol. 2014 Dec;229(12):1896-900
pubmed: 24799088
Artif Intell Med. 2017 May;78:14-22
pubmed: 28764869
Comput Struct Biotechnol J. 2017 Jan 08;15:104-116
pubmed: 28138367
Drug Discov Today. 2014 Apr;19(4):433-40
pubmed: 24183925
Nucleic Acids Res. 2005 Jul 1;33(Web Server issue):W306-10
pubmed: 15980478
Cell. 2018 Jun 14;173(7):1581-1592
pubmed: 29887378
IEEE/ACM Trans Comput Biol Bioinform. 2008 Jul-Sep;5(3):368-84
pubmed: 18670041
Artif Intell Med. 2017 Jan;75:16-23
pubmed: 28363453
Bioinformatics. 2008 Sep 15;24(18):2002-9
pubmed: 18632749
Front Cell Dev Biol. 2017 Sep 21;5:83
pubmed: 28983483
Artif Intell Med. 2017 May;78:41-46
pubmed: 28764871
Philos Trans A Math Phys Eng Sci. 2016 Nov 13;374(2080):
pubmed: 27698035
Proc Int Conf Intell Syst Mol Biol. 1997;5:147-52
pubmed: 9322029
Proc IEEE Inst Electr Electron Eng. 2016 Feb;104(2):444-466
pubmed: 27765959
Nucleic Acids Res. 2004 Jan 1;32(Database issue):D120-1
pubmed: 14681373
Nucleic Acids Res. 2019 Jan 8;47(D1):D542-D549
pubmed: 30395242
Genome Med. 2016 Jun 23;8(1):71
pubmed: 27338147
Med Phys. 2007 Nov;34(11):4164-72
pubmed: 18072480
Artif Intell Med. 2017 Jan;75:51-63
pubmed: 28363456