ENNGene: an Easy Neural Network model building tool for Genomics.
Convolutional Neural Network
Deep Learning
Evolutionary Conservation Score
GUI
RNA Secondary Structure
Recurrent Neural Network
Journal
BMC genomics
ISSN: 1471-2164
Titre abrégé: BMC Genomics
Pays: England
ID NLM: 100965258
Informations de publication
Date de publication:
31 Mar 2022
31 Mar 2022
Historique:
received:
05
11
2021
accepted:
23
02
2022
entrez:
1
4
2022
pubmed:
2
4
2022
medline:
5
4
2022
Statut:
epublish
Résumé
The recent big data revolution in Genomics, coupled with the emergence of Deep Learning as a set of powerful machine learning methods, has shifted the standard practices of machine learning for Genomics. Even though Deep Learning methods such as Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs) are becoming widespread in Genomics, developing and training such models is outside the ability of most researchers in the field. Here we present ENNGene-Easy Neural Network model building tool for Genomics. This tool simplifies training of custom CNN or hybrid CNN-RNN models on genomic data via an easy-to-use Graphical User Interface. ENNGene allows multiple input branches, including sequence, evolutionary conservation, and secondary structure, and performs all the necessary preprocessing steps, allowing simple input such as genomic coordinates. The network architecture is selected and fully customized by the user, from the number and types of the layers to each layer's precise set-up. ENNGene then deals with all steps of training and evaluation of the model, exporting valuable metrics such as multi-class ROC and precision-recall curve plots or TensorBoard log files. To facilitate interpretation of the predicted results, we deploy Integrated Gradients, providing the user with a graphical representation of an attribution level of each input position. To showcase the usage of ENNGene, we train multiple models on the RBP24 dataset, quickly reaching the state of the art while improving the performance on more than half of the proteins by including the evolutionary conservation score and tuning the network per protein. As the role of DL in big data analysis in the near future is indisputable, it is important to make it available for a broader range of researchers. We believe that an easy-to-use tool such as ENNGene can allow Genomics researchers without a background in Computational Sciences to harness the power of DL to gain better insights into and extract important information from the large amounts of data available in the field.
Sections du résumé
BACKGROUND
BACKGROUND
The recent big data revolution in Genomics, coupled with the emergence of Deep Learning as a set of powerful machine learning methods, has shifted the standard practices of machine learning for Genomics. Even though Deep Learning methods such as Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs) are becoming widespread in Genomics, developing and training such models is outside the ability of most researchers in the field.
RESULTS
RESULTS
Here we present ENNGene-Easy Neural Network model building tool for Genomics. This tool simplifies training of custom CNN or hybrid CNN-RNN models on genomic data via an easy-to-use Graphical User Interface. ENNGene allows multiple input branches, including sequence, evolutionary conservation, and secondary structure, and performs all the necessary preprocessing steps, allowing simple input such as genomic coordinates. The network architecture is selected and fully customized by the user, from the number and types of the layers to each layer's precise set-up. ENNGene then deals with all steps of training and evaluation of the model, exporting valuable metrics such as multi-class ROC and precision-recall curve plots or TensorBoard log files. To facilitate interpretation of the predicted results, we deploy Integrated Gradients, providing the user with a graphical representation of an attribution level of each input position. To showcase the usage of ENNGene, we train multiple models on the RBP24 dataset, quickly reaching the state of the art while improving the performance on more than half of the proteins by including the evolutionary conservation score and tuning the network per protein.
CONCLUSIONS
CONCLUSIONS
As the role of DL in big data analysis in the near future is indisputable, it is important to make it available for a broader range of researchers. We believe that an easy-to-use tool such as ENNGene can allow Genomics researchers without a background in Computational Sciences to harness the power of DL to gain better insights into and extract important information from the large amounts of data available in the field.
Identifiants
pubmed: 35361122
doi: 10.1186/s12864-022-08414-x
pii: 10.1186/s12864-022-08414-x
pmc: PMC8973509
doi:
Types de publication
Journal Article
Langues
eng
Sous-ensembles de citation
IM
Pagination
248Subventions
Organisme : H2020 Spreading Excellence and Widening Participation
ID : 867414
Organisme : Masarykova Univerzita
ID : CZ.02.2.69/0.0/0.0/18 053/0016952
Informations de copyright
© 2022. The Author(s).
Références
Genome Biol. 2014 Jan 22;15(1):R17
pubmed: 24451197
Algorithms Mol Biol. 2011 Nov 24;6:26
pubmed: 22115189
Nat Methods. 2019 Apr;16(4):315-318
pubmed: 30923381
Nature. 2015 May 28;521(7553):436-44
pubmed: 26017442
Brief Bioinform. 2021 May 20;22(3):
pubmed: 34020542
Nucleic Acids Res. 2004 Jan 1;32(Database issue):D493-6
pubmed: 14681465
Comput Biol Chem. 2020 Feb;84:107171
pubmed: 31931434
Nucleic Acids Res. 2021 Jan 8;49(D1):D884-D891
pubmed: 33137190
Nat Rev Genet. 2019 Jul;20(7):389-403
pubmed: 30971806
Bioinformatics. 2018 Sep 1;34(17):i638-i646
pubmed: 30423078
RNA. 2019 Dec;25(12):1604-1615
pubmed: 31537716
Genome Res. 2016 Jul;26(7):990-9
pubmed: 27197224
Genome Res. 2010 Jan;20(1):110-21
pubmed: 19858363
Genome Res. 2005 Aug;15(8):1034-50
pubmed: 16024819
Nat Biotechnol. 2015 Aug;33(8):831-8
pubmed: 26213851
Bioinformatics. 2018 Sep 1;34(17):3035-3037
pubmed: 29659719
BMC Genomics. 2020 Dec 9;21(1):884
pubmed: 33297946
Nucleic Acids Res. 2012 Jul;40(12):5215-26
pubmed: 22373926
J Biomol Struct Dyn. 2020 Dec 4;:1-9
pubmed: 33272122
Psychol Rev. 1958 Nov;65(6):386-408
pubmed: 13602029
Neural Comput. 1997 Nov 15;9(8):1735-80
pubmed: 9377276
BMC Genomics. 2020 Dec 17;21(Suppl 13):866
pubmed: 33334313
Nucleic Acids Res. 2020 Jul 27;48(13):7099-7118
pubmed: 32558887
Nucleic Acids Res. 2016 Feb 29;44(4):e32
pubmed: 26467480
Sci Rep. 2020 Jun 11;10(1):9486
pubmed: 32528107
Brief Bioinform. 2021 May 20;22(3):
pubmed: 32808039
Nat Methods. 2015 Oct;12(10):931-4
pubmed: 26301843
BMC Genomics. 2018 Jul 3;19(1):511
pubmed: 29970003
Genome Res. 2020 Feb;30(2):214-226
pubmed: 31992613
Nat Commun. 2020 Jul 13;11(1):3488
pubmed: 32661261
Int J Mol Sci. 2015 Nov 03;16(11):26303-17
pubmed: 26540053
PLoS One. 2015 Jul 10;10(7):e0130140
pubmed: 26161953
BMC Bioinformatics. 2017 Feb 28;18(1):136
pubmed: 28245811
Bioinformatics. 2018 Oct 15;34(20):3427-3436
pubmed: 29722865