A successful hybrid deep learning model aiming at promoter identification.
Convolutional neural networks (CNNs)
Fully connected networks
Promoter identification
Structural profiles
Journal
BMC bioinformatics
ISSN: 1471-2105
Titre abrégé: BMC Bioinformatics
Pays: England
ID NLM: 100965194
Informations de publication
Date de publication:
31 May 2022
31 May 2022
Historique:
received:
12
05
2022
accepted:
16
05
2022
entrez:
1
6
2022
pubmed:
2
6
2022
medline:
3
6
2022
Statut:
epublish
Résumé
The zone adjacent to a transcription start site (TSS), namely, the promoter, is primarily involved in the process of DNA transcription initiation and regulation. As a result, proper promoter identification is critical for further understanding the mechanism of the networks controlling genomic regulation. A number of methodologies for the identification of promoters have been proposed. Nonetheless, due to the great heterogeneity existing in promoters, the results of these procedures are still unsatisfactory. In order to establish additional discriminative characteristics and properly recognize promoters, we developed the hybrid model for promoter identification (HMPI), a hybrid deep learning model that can characterize both the native sequences of promoters and the morphological outline of promoters at the same time. We developed the HMPI to combine a method called the PSFN (promoter sequence features network), which characterizes native promoter sequences and deduces sequence features, with a technique referred to as the DSPN (deep structural profiles network), which is specially structured to model the promoters in terms of their structural profile and to deduce their structural attributes. The HMPI was applied to human, plant and Escherichia coli K-12 strain datasets, and the findings showed that the HMPI was successful at extracting the features of the promoter while greatly enhancing the promoter identification performance. In addition, after the improvements of synthetic sampling, transfer learning and label smoothing regularization, the improved HMPI models achieved good results in identifying subtypes of promoters on prokaryotic promoter datasets. The results showed that the HMPI was successful at extracting the features of promoters while greatly enhancing the performance of identifying promoters on both eukaryotic and prokaryotic datasets, and the improved HMPI models are good at identifying subtypes of promoters on prokaryotic promoter datasets. The HMPI is additionally adaptable to different biological functional sequences, allowing for the addition of new features or models.
Sections du résumé
BACKGROUND
BACKGROUND
The zone adjacent to a transcription start site (TSS), namely, the promoter, is primarily involved in the process of DNA transcription initiation and regulation. As a result, proper promoter identification is critical for further understanding the mechanism of the networks controlling genomic regulation. A number of methodologies for the identification of promoters have been proposed. Nonetheless, due to the great heterogeneity existing in promoters, the results of these procedures are still unsatisfactory. In order to establish additional discriminative characteristics and properly recognize promoters, we developed the hybrid model for promoter identification (HMPI), a hybrid deep learning model that can characterize both the native sequences of promoters and the morphological outline of promoters at the same time. We developed the HMPI to combine a method called the PSFN (promoter sequence features network), which characterizes native promoter sequences and deduces sequence features, with a technique referred to as the DSPN (deep structural profiles network), which is specially structured to model the promoters in terms of their structural profile and to deduce their structural attributes.
RESULTS
RESULTS
The HMPI was applied to human, plant and Escherichia coli K-12 strain datasets, and the findings showed that the HMPI was successful at extracting the features of the promoter while greatly enhancing the promoter identification performance. In addition, after the improvements of synthetic sampling, transfer learning and label smoothing regularization, the improved HMPI models achieved good results in identifying subtypes of promoters on prokaryotic promoter datasets.
CONCLUSIONS
CONCLUSIONS
The results showed that the HMPI was successful at extracting the features of promoters while greatly enhancing the performance of identifying promoters on both eukaryotic and prokaryotic datasets, and the improved HMPI models are good at identifying subtypes of promoters on prokaryotic promoter datasets. The HMPI is additionally adaptable to different biological functional sequences, allowing for the addition of new features or models.
Identifiants
pubmed: 35641900
doi: 10.1186/s12859-022-04735-6
pii: 10.1186/s12859-022-04735-6
pmc: PMC9158169
doi:
Types de publication
Journal Article
Langues
eng
Sous-ensembles de citation
IM
Pagination
206Subventions
Organisme : National Natural Science Foundation of China
ID : 61872288
Informations de copyright
© 2022. The Author(s).
Références
Fickett JW, Hatzigeorgiou AG. Eukaryotic promoter recognition. Genome Res. 1997;7(9):861–78.
doi: 10.1101/gr.7.9.861
pubmed: 9314492
Haberle V, Stark A. Eukaryotic core promoters and the functional basis of transcription initiation. Nat Rev Mol Cell Biol. 2018;19(10):621–37.
pmcid: 6205604
doi: 10.1038/s41580-018-0028-8
pubmed: 29946135
Zeng J, Zhu S, Yan H. Towards accurate human promoter recognition: a review of currently used sequence features and classification methods. Brief Bioinform. 2009;10(5):498–508.
doi: 10.1093/bib/bbp027
pubmed: 19531545
Yamamoto YY, et al. Identification of plant promoter constituents by analysis of local distribution of short sequences. BMC Genomics. 2007;8(1):67.
pmcid: 1832190
doi: 10.1186/1471-2164-8-67
pubmed: 17346352
Abdolazimi Y, Stojanova Z, Segil N. Selection of cell fate in the organ of Corti involves the integration of Hes/Hey signaling at the Atoh1 promoter. Development. 2016;143(5):841–50.
pmcid: 4813338
doi: 10.1242/dev.129320
pubmed: 26932672
Ma Y, Sun S, Shang X, Keller ET, Chen M, Zhou X. Integrative differential expression and gene set enrichment analysis using summary statistics for scRNA-seq studies. Nat Commun. 2020;11(1):1–13.
Sun S, et al. Differential expression analysis for RNAseq using Poisson mixed models. Nucleic Acids Res. 2017;45(11):e106–e106.
pmcid: 5499851
doi: 10.1093/nar/gkx204
pubmed: 28369632
Juven-Gershon T, Kadonaga JT. Regulation of gene expression via the core promoter and the basal transcriptional machinery. Dev Biol. 2010;339(2):225–9.
doi: 10.1016/j.ydbio.2009.08.009
pubmed: 19682982
Kutach AK, Kadonaga JT. The downstream promoter element DPE appears to be as widely used as the TATA Box in drosophila core promoters. Mol Cell Biol. 2000;20(13):4754–64.
pmcid: 85905
doi: 10.1128/MCB.20.13.4754-4764.2000
pubmed: 10848601
Zhang Y, et al. Cellular microRNAs up-regulate transcription via interaction with promoter TATA-box motifs. RNA. 2014;20(12):1878–89.
pmcid: 4238354
doi: 10.1261/rna.045633.114
pubmed: 25336585
Lubliner S, Keren L, Segal E. Sequence features of yeast and human core promoters that are predictive of maximal promoter activity. Nucleic Acids Res. 2013;41(11):5569–81.
pmcid: 3675475
doi: 10.1093/nar/gkt256
pubmed: 23599004
Ioshikhes IP, Zhang MQ. Large-scale human promoter mapping using CpG islands. Nat Genet. 2000;26(1):61–3.
doi: 10.1038/79189
pubmed: 10973249
Illingworth RS, et al. Orphan CpG islands identify numerous conserved promoters in the mammalian genome. PLoS Genet. 2010;6(9):e1001134.
pmcid: 2944787
doi: 10.1371/journal.pgen.1001134
pubmed: 20885785
Umarov RK, Solovyev VV. Recognition of prokaryotic and eukaryotic promoters using convolutional deep learning neural networks. PLoS ONE. 2017;12(2):e0171410.
pmcid: 5291440
doi: 10.1371/journal.pone.0171410
pubmed: 28158264
Bharanikumar R, Premkumar KAR. A Palaniappan (2018) PromoterPredict: sequence-based modelling of Escherichia coli σ70 promoter strength yields logarithmic dependence between promoter strength and sequence. PeerJ. 2018;6:e5862.
pmcid: 6228582
doi: 10.7717/peerj.5862
pubmed: 30425888
Abeel T, Saeys Y, Bonnet E, Rouze P, Van P. Generic eukaryotic core promoter prediction using structural features of DNA. Genome Res. 2008;18(2):310–23.
pmcid: 2203629
doi: 10.1101/gr.6991408
pubmed: 18096745
Kobe F, Yvan S, Sven D, Pierre R, Yves VDP. Large-scale structural analysis of the core promoter in mammalian and plant genomes. Nuclc Acids Res. 2005;33(13):4255–64.
doi: 10.1093/nar/gki737
Gan Y, Guan J, Zhou S. A pattern-based nearest neighbor search approach for promoter prediction using DNA structural profiles. Bioinformatics. 2009;25(16):2006–12.
doi: 10.1093/bioinformatics/btp359
pubmed: 19515962
Oubounyt M, Louadi Z, Tayara H, Chong KT. DeePromoter: robust promoter predictor using deep learning. Front Genetics. 2019;10(286):2019.
Xu W, Zhu L, Huang D-S. DCDE: an efficient deep convolutional divergence encoding method for human promoter recognition. IEEE Trans Nanobiosci. 2019;18(2):136–45.
doi: 10.1109/TNB.2019.2891239
Huang G, Liu Z, Pleiss G, Van Der Maaten L, Weinberger K. Convolutional networks with dense connectivity. IEEE Trans Pattern Anal Mach Intell. 2019. https://doi.org/10.1109/TPAMI.2019.2918284 .
doi: 10.1109/TPAMI.2019.2918284
pubmed: 31442969
Dreos R, Ambrosini G, Périer RC, Bucher P. The eukaryotic promoter database: expansion of EPDnew and new promoter analysis tools. Nucleic Acids Res. 2015;43(D1):D92–6.
doi: 10.1093/nar/gku1111
pubmed: 25378343
Shahmuradov IA, Gammerman AJ, Hancock JM, Bramley PM, Solovyev VV. PlantProm: a database of plant promoter sequences. Nucleic Acids Res. 2003;31(1):114–7.
pmcid: 165488
doi: 10.1093/nar/gkg041
pubmed: 12519961
Swarbreck D, et al. The arabidopsis information resource (TAIR): gene structure and function annotation. Nucleic Acids Res. 2007;36:D1009–14.
pmcid: 2238962
doi: 10.1093/nar/gkm965
pubmed: 17986450
Gama-Castro S, et al. RegulonDB version 9.0: high-level integration of gene regulation, coexpression, motif clustering and beyond. Nucleic Acids Res. 2016;44(D1):D133–43.
doi: 10.1093/nar/gkv1156
pubmed: 26527724
Liu B, Yang F, Huang D-S, Chou K-C. iPromoter-2L: a two-layer predictor for identifying promoters and their types by multi-window-based PseKNC. Bioinformatics. 2018;34(1):33–40.
doi: 10.1093/bioinformatics/btx579
pubmed: 28968797
Shahmuradov IA, Umarov RK, Solovyev VV. TSSPlant: a new tool for prediction of plant Pol II promoters. Nucleic Acids Res. 2017;45(8):e65–e65.
pmcid: 5416875
pubmed: 28082394
Kingma DP, Ba J. Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 ; 2014.
Szegedy C et al. Going deeper with convolutions. In: Proceedings of the IEEE conference on computer vision and pattern recognition; 2015. p. 1–9.
He K, Zhang X, Ren S, Sun J. Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition; 2016. p. 770–778.
Sun S, Zhu J, Ma Y, Zhou X. Accuracy, robustness and scalability of dimensionality reduction methods for single-cell RNA-seq analysis. Genome Biol. 2019;20(1):269.
pmcid: 6902413
doi: 10.1186/s13059-019-1898-6
pubmed: 31823809
Xu W, Zhang L, Lu Y. SD-MSAEs: promoter recognition in human genome based on deep feature extraction. J Biomed Inform. 2016;61:55–62.
doi: 10.1016/j.jbi.2016.03.018
pubmed: 27018214
Zeng J, Zhao X-Y, Cao X-Q, Yan H. SCS: signal, context, and structure features for genome-wide human promoter recognition. IEEE/ACM Trans Comput Biol Bioinf. 2008;7(3):550–62.
doi: 10.1109/TCBB.2008.95
Azad A, Shahid S, Noman N, Lee H. Prediction of plant promoters based on hexamers and random triplet pair analysis. Algorithms Mol Biol. 2011;6(1):19.
pmcid: 3160368
doi: 10.1186/1748-7188-6-19
pubmed: 21711543
Silva SA, et al. DNA duplex stability as discriminative characteristic for Escherichia coli σ54-and σ28-dependent promoter sequences. Biologicals. 2014;42(1):22–8.
doi: 10.1016/j.biologicals.2013.10.001
Lin H, Deng E-Z, Ding H, Chen W, Chou K-C. iPro54-PseKNC: a sequence-based predictor for identifying sigma-54 promoters in prokaryote with pseudo k-tuple nucleotide composition. Nucleic Acids Res. 2014;42(21):12961–72.
pmcid: 4245931
doi: 10.1093/nar/gku1019
pubmed: 25361964
Zhang M, et al. MULTiPly: a novel multi-layer predictor for discovering general and specific types of promoters. Bioinformatics. 2019;35(17):2957–65.
pmcid: 6736106
doi: 10.1093/bioinformatics/btz016
pubmed: 30649179
He H, Bai Y, Garcia EA, Li S. ADASYN: adaptive synthetic sampling approach for imbalanced learning. In: 2008 IEEE international joint conference on neural networks (IEEE world congress on computational intelligence), IEEE; 2008. p. 1322–1328.
Long M, Zhu H, Wang J, Jordan MI. Deep transfer learning with joint adaptation networks. In: International conference on machine learning; 2017. p. 2208–2217.
Zheng Z, Zheng L, Yang Y. Unlabeled samples generated by gan improve the person re-identification baseline in vitro. In: Proceedings of the IEEE international conference on computer vision; 2017. p. 3754–3762.
Zhou J, Troyanskaya OG. Predicting effects of noncoding variants with deep learning-based sequence model. Nat Methods. 2015;12(10):931–4.
pmcid: 4768299
doi: 10.1038/nmeth.3547
pubmed: 26301843
Krizhevsky A, Sutskever I, Hinton GE. Imagenet classification with deep convolutional neural networks. In: Advances in neural information processing systems; 2012. p. 1097–1105.
Zhang X, Zou Y, Shi W. Dilated convolution neural network with LeakyReLU for environmental sound classification. In: 2017 22nd international conference on digital signal processing (DSP), IEEE; 2017. p. 1–5.
Srivastava N, Hinton G, Krizhevsky A, Sutskever I, Salakhutdinov R. Dropout: a simple way to prevent neural networks from overfitting. J Mach Learn Res. 2014;15(1):1929–58.
Wen Y, Zhang K, Li Z, Qiao Y. A discriminative feature learning approach for deep face recognition. In: European conference on computer vision, Springer; 2016. p. 499–515.
Xu Z-C, Wang P, Qiu W-R, Xiao X. iss-pc: identifying splicing sites via physical-chemical properties using deep sparse auto-encoder. Sci Rep. 2017;7(1):1–12.
Cuán A, Galván M, Chattaraj PK. A philicity based analysis of adsorption of small molecules in zeolites. J Chem Sci. 2005;117(5):541–8.
doi: 10.1007/BF02708360
Yakovchuk P, Protozanova E, Frank-Kamenetskii MD. Base-stacking and base-pairing contributions into thermal stability of the DNA double helix. Nucleic Acids Res. 2006;34(2):564–74.
pmcid: 1360284
doi: 10.1093/nar/gkj454
pubmed: 16449200
Gorin AA, Zhurkin VB, Wilma K. B-DNA twisting correlates with base-pair morphology. J Mol Biol. 1995;247(1):34–48.
doi: 10.1006/jmbi.1994.0120
pubmed: 7897660
Ozoline O, Deev A, Trifonov E. A novel feature in E. coli promoter recognition. J Biomol Struct. 1999;16(4):825–31.
doi: 10.1080/07391102.1999.10508295
Kang H, et al. Identification of cation-binding sites on actin that drive polymerization and modulate bending stiffness. Proc Natl Acad Sci. 2012;109(42):16923–7.
pmcid: 3479481
doi: 10.1073/pnas.1211078109
pubmed: 23027950
Drukker K, Wu G, Schatz GC. Model simulations of DNA denaturation dynamics. J Chem Phys. 2001;114(1):579–90.
doi: 10.1063/1.1329137
Breslauer KJ, Frank R, Blöcker H, Marky LA. Predicting DNA duplex stability from the base sequence. Proc Natl Acad Sci. 1986;83(11):3746–50.
pmcid: 323600
doi: 10.1073/pnas.83.11.3746
pubmed: 3459152
Sugimoto N, Nakano S-I, Yoneyama M, Honda K-I. Improved thermodynamic parameters and helix initiation factor to predict stability of DNA duplexes. Nucleic Acids Res. 1996;24(22):4501–5.
pmcid: 146261
doi: 10.1093/nar/24.22.4501
pubmed: 8948641
Olson WK, Gorin AA, Lu X-J, Hock LM, Zhurkin VB. DNA sequence-dependent deformability deduced from protein–DNA crystal complexes. Proc Natl Acad Sci. 1998;95(19):11163–8.
pmcid: 21613
doi: 10.1073/pnas.95.19.11163
pubmed: 9736707
Rich A, Zhang S. Z-DNA: the long road to biological function. Nat Rev Genet. 2003;4(7):566–72.
doi: 10.1038/nrg1115
pubmed: 12838348