ENNGene: an Easy Neural Network model building tool for Genomics.

Convolutional Neural Network Deep Learning Evolutionary Conservation Score GUI RNA Secondary Structure Recurrent Neural Network

Journal

BMC genomics
ISSN: 1471-2164
Titre abrégé: BMC Genomics
Pays: England
ID NLM: 100965258

Informations de publication

Date de publication:
31 Mar 2022
Historique:
received: 05 11 2021
accepted: 23 02 2022
entrez: 1 4 2022
pubmed: 2 4 2022
medline: 5 4 2022
Statut: epublish

Résumé

The recent big data revolution in Genomics, coupled with the emergence of Deep Learning as a set of powerful machine learning methods, has shifted the standard practices of machine learning for Genomics. Even though Deep Learning methods such as Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs) are becoming widespread in Genomics, developing and training such models is outside the ability of most researchers in the field. Here we present ENNGene-Easy Neural Network model building tool for Genomics. This tool simplifies training of custom CNN or hybrid CNN-RNN models on genomic data via an easy-to-use Graphical User Interface. ENNGene allows multiple input branches, including sequence, evolutionary conservation, and secondary structure, and performs all the necessary preprocessing steps, allowing simple input such as genomic coordinates. The network architecture is selected and fully customized by the user, from the number and types of the layers to each layer's precise set-up. ENNGene then deals with all steps of training and evaluation of the model, exporting valuable metrics such as multi-class ROC and precision-recall curve plots or TensorBoard log files. To facilitate interpretation of the predicted results, we deploy Integrated Gradients, providing the user with a graphical representation of an attribution level of each input position. To showcase the usage of ENNGene, we train multiple models on the RBP24 dataset, quickly reaching the state of the art while improving the performance on more than half of the proteins by including the evolutionary conservation score and tuning the network per protein. As the role of DL in big data analysis in the near future is indisputable, it is important to make it available for a broader range of researchers. We believe that an easy-to-use tool such as ENNGene can allow Genomics researchers without a background in Computational Sciences to harness the power of DL to gain better insights into and extract important information from the large amounts of data available in the field.

Sections du résumé

BACKGROUND BACKGROUND
The recent big data revolution in Genomics, coupled with the emergence of Deep Learning as a set of powerful machine learning methods, has shifted the standard practices of machine learning for Genomics. Even though Deep Learning methods such as Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs) are becoming widespread in Genomics, developing and training such models is outside the ability of most researchers in the field.
RESULTS RESULTS
Here we present ENNGene-Easy Neural Network model building tool for Genomics. This tool simplifies training of custom CNN or hybrid CNN-RNN models on genomic data via an easy-to-use Graphical User Interface. ENNGene allows multiple input branches, including sequence, evolutionary conservation, and secondary structure, and performs all the necessary preprocessing steps, allowing simple input such as genomic coordinates. The network architecture is selected and fully customized by the user, from the number and types of the layers to each layer's precise set-up. ENNGene then deals with all steps of training and evaluation of the model, exporting valuable metrics such as multi-class ROC and precision-recall curve plots or TensorBoard log files. To facilitate interpretation of the predicted results, we deploy Integrated Gradients, providing the user with a graphical representation of an attribution level of each input position. To showcase the usage of ENNGene, we train multiple models on the RBP24 dataset, quickly reaching the state of the art while improving the performance on more than half of the proteins by including the evolutionary conservation score and tuning the network per protein.
CONCLUSIONS CONCLUSIONS
As the role of DL in big data analysis in the near future is indisputable, it is important to make it available for a broader range of researchers. We believe that an easy-to-use tool such as ENNGene can allow Genomics researchers without a background in Computational Sciences to harness the power of DL to gain better insights into and extract important information from the large amounts of data available in the field.

Identifiants

pubmed: 35361122
doi: 10.1186/s12864-022-08414-x
pii: 10.1186/s12864-022-08414-x
pmc: PMC8973509
doi:

Types de publication

Journal Article

Langues

eng

Sous-ensembles de citation

IM

Pagination

248

Subventions

Organisme : H2020 Spreading Excellence and Widening Participation
ID : 867414
Organisme : Masarykova Univerzita
ID : CZ.02.2.69/0.0/0.0/18 053/0016952

Informations de copyright

© 2022. The Author(s).

Références

Genome Biol. 2014 Jan 22;15(1):R17
pubmed: 24451197
Algorithms Mol Biol. 2011 Nov 24;6:26
pubmed: 22115189
Nat Methods. 2019 Apr;16(4):315-318
pubmed: 30923381
Nature. 2015 May 28;521(7553):436-44
pubmed: 26017442
Brief Bioinform. 2021 May 20;22(3):
pubmed: 34020542
Nucleic Acids Res. 2004 Jan 1;32(Database issue):D493-6
pubmed: 14681465
Comput Biol Chem. 2020 Feb;84:107171
pubmed: 31931434
Nucleic Acids Res. 2021 Jan 8;49(D1):D884-D891
pubmed: 33137190
Nat Rev Genet. 2019 Jul;20(7):389-403
pubmed: 30971806
Bioinformatics. 2018 Sep 1;34(17):i638-i646
pubmed: 30423078
RNA. 2019 Dec;25(12):1604-1615
pubmed: 31537716
Genome Res. 2016 Jul;26(7):990-9
pubmed: 27197224
Genome Res. 2010 Jan;20(1):110-21
pubmed: 19858363
Genome Res. 2005 Aug;15(8):1034-50
pubmed: 16024819
Nat Biotechnol. 2015 Aug;33(8):831-8
pubmed: 26213851
Bioinformatics. 2018 Sep 1;34(17):3035-3037
pubmed: 29659719
BMC Genomics. 2020 Dec 9;21(1):884
pubmed: 33297946
Nucleic Acids Res. 2012 Jul;40(12):5215-26
pubmed: 22373926
J Biomol Struct Dyn. 2020 Dec 4;:1-9
pubmed: 33272122
Psychol Rev. 1958 Nov;65(6):386-408
pubmed: 13602029
Neural Comput. 1997 Nov 15;9(8):1735-80
pubmed: 9377276
BMC Genomics. 2020 Dec 17;21(Suppl 13):866
pubmed: 33334313
Nucleic Acids Res. 2020 Jul 27;48(13):7099-7118
pubmed: 32558887
Nucleic Acids Res. 2016 Feb 29;44(4):e32
pubmed: 26467480
Sci Rep. 2020 Jun 11;10(1):9486
pubmed: 32528107
Brief Bioinform. 2021 May 20;22(3):
pubmed: 32808039
Nat Methods. 2015 Oct;12(10):931-4
pubmed: 26301843
BMC Genomics. 2018 Jul 3;19(1):511
pubmed: 29970003
Genome Res. 2020 Feb;30(2):214-226
pubmed: 31992613
Nat Commun. 2020 Jul 13;11(1):3488
pubmed: 32661261
Int J Mol Sci. 2015 Nov 03;16(11):26303-17
pubmed: 26540053
PLoS One. 2015 Jul 10;10(7):e0130140
pubmed: 26161953
BMC Bioinformatics. 2017 Feb 28;18(1):136
pubmed: 28245811
Bioinformatics. 2018 Oct 15;34(20):3427-3436
pubmed: 29722865

Auteurs

Eliška Chalupová (E)

Faculty of Science, National Centre for Biomolecular Research, Masaryk University, Brno, Czechia.
Central European Institute of Technology (CEITEC), Masaryk University, Brno, Czechia.

Ondřej Vaculík (O)

Faculty of Science, National Centre for Biomolecular Research, Masaryk University, Brno, Czechia.
Central European Institute of Technology (CEITEC), Masaryk University, Brno, Czechia.

Jakub Poláček (J)

Faculty of Informatics, Masaryk University, Brno, Czechia.

Filip Jozefov (F)

Faculty of Informatics, Masaryk University, Brno, Czechia.

Tomáš Majtner (T)

Central European Institute of Technology (CEITEC), Masaryk University, Brno, Czechia.

Panagiotis Alexiou (P)

Central European Institute of Technology (CEITEC), Masaryk University, Brno, Czechia. panagiotis.alexiou@ceitec.muni.cz.

Articles similaires

Exploring blood-brain barrier passage using atomic weighted vector and machine learning.

Yoan Martínez-López, Paulina Phoobane, Yanaima Jauriga et al.
1.00
Blood-Brain Barrier Machine Learning Humans Support Vector Machine Software

Understanding the role of machine learning in predicting progression of osteoarthritis.

Simone Castagno, Benjamin Gompels, Estelle Strangmark et al.
1.00
Humans Disease Progression Machine Learning Osteoarthritis
Coal Metagenome Phylogeny Bacteria Genome, Bacterial

Unsupervised learning for real-time and continuous gait phase detection.

Dollaporn Anopas, Yodchanan Wongsawat, Jetsada Arnin
1.00
Humans Gait Neural Networks, Computer Unsupervised Machine Learning Walking

Classifications MeSH