ENNGene: an Easy Neural Network model building tool for Genomics.

Genomics Machine Learning Neural Networks, Computer Protein Structure, Secondary

Convolutional Neural Network Deep Learning Evolutionary Conservation Score GUI RNA Secondary Structure Recurrent Neural Network

Journal

BMC genomics

ISSN: 1471-2164

Titre abrégé: BMC Genomics

Pays: England

ID NLM: 100965258

Informations de publication

Date de publication:
31 Mar 2022

Historique:

received: 05 11 2021

accepted: 23 02 2022

entrez: 1 4 2022

pubmed: 2 4 2022

medline: 5 4 2022

Statut: epublish

Résumé

The recent big data revolution in Genomics, coupled with the emergence of Deep Learning as a set of powerful machine learning methods, has shifted the standard practices of machine learning for Genomics. Even though Deep Learning methods such as Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs) are becoming widespread in Genomics, developing and training such models is outside the ability of most researchers in the field. Here we present ENNGene-Easy Neural Network model building tool for Genomics. This tool simplifies training of custom CNN or hybrid CNN-RNN models on genomic data via an easy-to-use Graphical User Interface. ENNGene allows multiple input branches, including sequence, evolutionary conservation, and secondary structure, and performs all the necessary preprocessing steps, allowing simple input such as genomic coordinates. The network architecture is selected and fully customized by the user, from the number and types of the layers to each layer's precise set-up. ENNGene then deals with all steps of training and evaluation of the model, exporting valuable metrics such as multi-class ROC and precision-recall curve plots or TensorBoard log files. To facilitate interpretation of the predicted results, we deploy Integrated Gradients, providing the user with a graphical representation of an attribution level of each input position. To showcase the usage of ENNGene, we train multiple models on the RBP24 dataset, quickly reaching the state of the art while improving the performance on more than half of the proteins by including the evolutionary conservation score and tuning the network per protein. As the role of DL in big data analysis in the near future is indisputable, it is important to make it available for a broader range of researchers. We believe that an easy-to-use tool such as ENNGene can allow Genomics researchers without a background in Computational Sciences to harness the power of DL to gain better insights into and extract important information from the large amounts of data available in the field.

Sections du résumé

BACKGROUND BACKGROUND

RESULTS RESULTS

Here we present ENNGene-Easy Neural Network model building tool for Genomics. This tool simplifies training of custom CNN or hybrid CNN-RNN models on genomic data via an easy-to-use Graphical User Interface. ENNGene allows multiple input branches, including sequence, evolutionary conservation, and secondary structure, and performs all the necessary preprocessing steps, allowing simple input such as genomic coordinates. The network architecture is selected and fully customized by the user, from the number and types of the layers to each layer's precise set-up. ENNGene then deals with all steps of training and evaluation of the model, exporting valuable metrics such as multi-class ROC and precision-recall curve plots or TensorBoard log files. To facilitate interpretation of the predicted results, we deploy Integrated Gradients, providing the user with a graphical representation of an attribution level of each input position. To showcase the usage of ENNGene, we train multiple models on the RBP24 dataset, quickly reaching the state of the art while improving the performance on more than half of the proteins by including the evolutionary conservation score and tuning the network per protein.

CONCLUSIONS CONCLUSIONS

As the role of DL in big data analysis in the near future is indisputable, it is important to make it available for a broader range of researchers. We believe that an easy-to-use tool such as ENNGene can allow Genomics researchers without a background in Computational Sciences to harness the power of DL to gain better insights into and extract important information from the large amounts of data available in the field.

Identifiants

DOI: 10.1186/s12864-022-08414-x PMID: 35361122 PMC: PMC8973509

pubmed: 35361122

doi: 10.1186/s12864-022-08414-x

pii: 10.1186/s12864-022-08414-x

pmc: PMC8973509

doi:

Types de publication

Journal Article

Langues

eng

Sous-ensembles de citation

Pagination

248

Subventions

Organisme : H2020 Spreading Excellence and Widening Participation

ID : 867414

Organisme : Masarykova Univerzita

ID : CZ.02.2.69/0.0/0.0/18 053/0016952

Informations de copyright

Références

Genome Biol. 2014 Jan 22;15(1):R17

pubmed: 24451197

Algorithms Mol Biol. 2011 Nov 24;6:26

pubmed: 22115189

Nat Methods. 2019 Apr;16(4):315-318

pubmed: 30923381

Nature. 2015 May 28;521(7553):436-44

pubmed: 26017442

Brief Bioinform. 2021 May 20;22(3):

pubmed: 34020542

Nucleic Acids Res. 2004 Jan 1;32(Database issue):D493-6

pubmed: 14681465

Comput Biol Chem. 2020 Feb;84:107171

pubmed: 31931434

Nucleic Acids Res. 2021 Jan 8;49(D1):D884-D891

pubmed: 33137190

Nat Rev Genet. 2019 Jul;20(7):389-403

pubmed: 30971806

Bioinformatics. 2018 Sep 1;34(17):i638-i646

pubmed: 30423078

RNA. 2019 Dec;25(12):1604-1615

pubmed: 31537716

Genome Res. 2016 Jul;26(7):990-9

pubmed: 27197224

Genome Res. 2010 Jan;20(1):110-21

pubmed: 19858363

Genome Res. 2005 Aug;15(8):1034-50

pubmed: 16024819

Nat Biotechnol. 2015 Aug;33(8):831-8

pubmed: 26213851

Bioinformatics. 2018 Sep 1;34(17):3035-3037

pubmed: 29659719

BMC Genomics. 2020 Dec 9;21(1):884

pubmed: 33297946

Nucleic Acids Res. 2012 Jul;40(12):5215-26

pubmed: 22373926

J Biomol Struct Dyn. 2020 Dec 4;:1-9

pubmed: 33272122

Psychol Rev. 1958 Nov;65(6):386-408

pubmed: 13602029

Neural Comput. 1997 Nov 15;9(8):1735-80

pubmed: 9377276

BMC Genomics. 2020 Dec 17;21(Suppl 13):866

pubmed: 33334313

Nucleic Acids Res. 2020 Jul 27;48(13):7099-7118

pubmed: 32558887

Nucleic Acids Res. 2016 Feb 29;44(4):e32

pubmed: 26467480

Sci Rep. 2020 Jun 11;10(1):9486

pubmed: 32528107

Brief Bioinform. 2021 May 20;22(3):

pubmed: 32808039

Nat Methods. 2015 Oct;12(10):931-4

pubmed: 26301843

BMC Genomics. 2018 Jul 3;19(1):511

pubmed: 29970003

Genome Res. 2020 Feb;30(2):214-226

pubmed: 31992613

Nat Commun. 2020 Jul 13;11(1):3488

pubmed: 32661261

Int J Mol Sci. 2015 Nov 03;16(11):26303-17

pubmed: 26540053

PLoS One. 2015 Jul 10;10(7):e0130140

pubmed: 26161953

BMC Bioinformatics. 2017 Feb 28;18(1):136

pubmed: 28245811

Bioinformatics. 2018 Oct 15;34(20):3427-3436

pubmed: 29722865

ENNGene: an Easy Neural Network model building tool for Genomics.

Journal

Informations de publication

Résumé

Sections du résumé

Identifiants

Types de publication

Langues

Sous-ensembles de citation

Pagination

Subventions

Informations de copyright

Références

Auteurs

Eliška Chalupová (E)

Ondřej Vaculík (O)

Jakub Poláček (J)

Filip Jozefov (F)

Tomáš Majtner (T)

Panagiotis Alexiou (P)

Articles similaires

Exploring blood-brain barrier passage using atomic weighted vector and machine learning.

Understanding the role of machine learning in predicting progression of osteoarthritis.

Comparative genomic analysis and characterization of novel high-quality draft genomes from the coal metagenome.

Unsupervised learning for real-time and continuous gait phase detection.

Classifications MeSH