PredPromoter-MF(2L): A Novel Approach of Promoter Prediction Based on Multi-source Feature Fusion and Deep Forest.
Deep Forest
Deep learning
Feature fusion
Feature selection
Machine learning
Promoter
Journal
Interdisciplinary sciences, computational life sciences
ISSN: 1867-1462
Titre abrégé: Interdiscip Sci
Pays: Germany
ID NLM: 101515919
Informations de publication
Date de publication:
Sep 2022
Sep 2022
Historique:
received:
19
10
2021
accepted:
05
04
2022
revised:
05
04
2022
pubmed:
1
5
2022
medline:
3
8
2022
entrez:
30
4
2022
Statut:
ppublish
Résumé
Promoters short DNA sequences play vital roles in initiating gene transcription. However, it remains a challenge to identify promoters using conventional experiment techniques in a high-throughput manner. To this end, several computational predictors based on machine learning models have been developed, while their performance is unsatisfactory. In this study, we proposed a novel two-layer predictor, called PredPromoter-MF(2L), based on multi-source feature fusion and ensemble learning. PredPromoter-MF(2L) was developed based on various deep features learned by a pre-trained deep learning network model and sequence-derived features. Feature selection based on XGBoost was applied to reduce fused features dimensions, and a cascade deep forest model was trained on the selected feature subset for promoter prediction. The results both fivefold cross-validation and independent test demonstrated that PredPromoter-MF(2L) outperformed state-of-the-art methods.
Identifiants
pubmed: 35488998
doi: 10.1007/s12539-022-00520-4
pii: 10.1007/s12539-022-00520-4
doi:
Types de publication
Journal Article
Langues
eng
Sous-ensembles de citation
IM
Pagination
697-711Subventions
Organisme : Natural Science Foundation of Shaanxi Province
ID : 2021JM-110
Organisme : National Natural Science Foundation of China
ID : 61972322
Informations de copyright
© 2022. International Association of Scientists in the Interdisciplinary Areas.
Références
Lai H, Zhang Z, Su Z, Su W, Ding H, Chen W, Lin H (2019) iProEP: a computational predictor for predicting promoter-sciencedirect. Mol Ther Nucleic Acids 17:337–346. https://doi.org/10.1016/j.omtn.2019.05.028
doi: 10.1016/j.omtn.2019.05.028
pubmed: 31299595
pmcid: 6616480
Liu B, Yang F, Huang D, Chou K (2017) iPromoter-2L: a two-layer predictor for identifying promoters and their types by multi-window-based PseKNC. Bioinformatics. https://doi.org/10.1093/bioinformatics/btx579
doi: 10.1093/bioinformatics/btx579
pubmed: 29322930
pmcid: 5751793
Gruber T, Gross C (2003) Multiple sigma subunits and the partitioning of bacterial transcription space. Ann Rev Microbiol 57(57):441–466. https://doi.org/10.1146/annurev.micro.57.030502.090913
doi: 10.1146/annurev.micro.57.030502.090913
Jishage M, Ishihama A (1995) Regulation of RNA polymerase sigma subunit synthesis in Escherichia coli: intracellular levels of [Formula: see text] and [Formula: see text]. J Bacteriol. https://doi.org/10.1128/jb.177.23.6832-6835.1995
doi: 10.1128/jb.177.23.6832-6835.1995
pubmed: 7592475
pmcid: 177550
Raina S, Missiakas D, Georgopoulos C (1995) The rpoe gene encoding the [Formula: see text] ([Formula: see text]) heat shock sigma factor of Escherichia coli. Embo J 14(5):1043–1055. https://doi.org/10.1002/j.1460-2075.1995.tb07085.x
doi: 10.1002/j.1460-2075.1995.tb07085.x
pubmed: 7889935
pmcid: 398177
Janga S, Collado-Vides J (2007) Structure and evolution of gene regulatory networks in microbial genomes. Res Microbiol 158(10):787–794. https://doi.org/10.1016/j.resmic.2007.09.001
doi: 10.1016/j.resmic.2007.09.001
pubmed: 17996425
pmcid: 5696542
Potvin E, Sanschagrin F, Levesque R (2010) Sigma factors in Pseudomonas aeruginosa. Fems Microbiol Rev 1:38–55. https://doi.org/10.1111/j.1574-6976.2007.00092.x
doi: 10.1111/j.1574-6976.2007.00092.x
Socorro G, Heladia S, Alberto S, Daniela L, Luis M, Santiago G, Kevin A, Irma M, Lucia P, Abraham C (2016) Regulondb version 9.0: high-level integration of gene regulation, coexpression, motif clustering and beyond. Nucleic Acids Res. https://doi.org/10.1093/nar/gkv1156
doi: 10.1093/nar/gkv1156
Cole T, Lior P, Steven LS (2021) TopHat: discovering splice junctions with RNA-seq. Bioinformatics. https://doi.org/10.1093/bioinformatics/btp120
doi: 10.1093/bioinformatics/btp120
pubmed: 34412594
pmcid: 8375142
Furey TS (2012) ChIP-seq and beyond: new and improved methodologies to detect and characterize protein–DNA interactions. Nat Rev Genet 13(12):840–52. https://doi.org/10.1038/nrg3306
doi: 10.1038/nrg3306
pubmed: 23090257
pmcid: 3591838
de Avila E, Forte F, Sartor I, Andrighetti T, Gerhardt L, Delamare AL, Echeverrigaray S (2014) DNA duplex stability as discriminative characteristic for Escherichia coli σ
doi: 10.1016/j.biologicals.2013.10.001
Lin H, Zeng E, Ding H, Chen W, Chou K (2014) iPro54-PseKNC: a sequence-based predictor for identifying sigma-54 promoters in prokaryote with pseudo k-tuple nucleotide composition. Nucleic Acids Res 21:12961–12972. https://doi.org/10.1093/nar/gku1019
doi: 10.1093/nar/gku1019
Kh U, Solovyev V, Rogozin I (2017) Recognition of prokaryotic and eukaryotic promoters using convolutional deep learning neural networks. PLoS One 12(2):0171410. https://doi.org/10.1371/journal.pone.0171410
doi: 10.1371/journal.pone.0171410
Lin H, Liang Z, Tang H (2017) Chen W (2017) Identifying [Formula: see text] promoters with novel pseudo nucleotide composition. IEEE/ACM Trans Comput Biol Bioinform. https://doi.org/10.1109/TCBB.2017.2666141
doi: 10.1109/TCBB.2017.2666141
pubmed: 29990223
pmcid: 5869087
Siddiqur R, Usma A, Rafsan J, Swakkhar S (2018) iPromoter-FSEn: identification of bacterial [Formula: see text] promoter sequences using feature subspace based ensemble classifier. Genomics 111:0888754318302593. https://doi.org/10.1016/j.ygeno.2018.07.011
doi: 10.1016/j.ygeno.2018.07.011
Zhang M, Li F, Marquez-Lago T, André L, Fan C, Kwoh C, Chou K, Song J, Jia C (2019) MULTiPly: a novel multi-layer predictor for discovering general and specific types of promoters. Bioinformatics. https://doi.org/10.1093/bioinformatics/btz016
doi: 10.1093/bioinformatics/btz016
pubmed: 32002535
pmcid: 6937862
Liu B (2019) Li K (2019) iPromoter-2L2.0: identifying promoters and their types by combining smoothing cutting window algorithm and sequence-based features. Mol Ther Nucleic Acids. https://doi.org/10.1016/j.omtn.2019.08.008
doi: 10.1016/j.omtn.2019.08.008
pubmed: 32069774
pmcid: 6970172
Amin R, Rahman C, Ahmed S, Sifat M, Shatabda S (2020) iPromoter-BnCNN: a novel branched cnn based predictor for identifying and classifying sigma promoters. Bioinformatics. https://doi.org/10.1093/bioinformatics/btaa609
doi: 10.1093/bioinformatics/btaa609
pubmed: 32614400
Li F, Chen J, Ge Z, Wen Y, Yue Y, Hayashida M, Baggag A, Bensmail H, Song J (2020) Computational prediction and interpretation of both general and specific types of promoters in Escherichia coli by exploiting a stacked ensemble-learning framework. Brief Bioinform. https://doi.org/10.1093/bib/bbaa049
doi: 10.1093/bib/bbaa049
pubmed: 32978618
pmcid: 8294564
Xiao X, Xu Z, Qiu W, Wang P, Ge H, Chou K (2018) iPSW(2L)-PseKNC: a two-layer predictor for identifying promoters and their strength by hybrid features via pseudo k-tuple nucleotide composition. Genomics. https://doi.org/10.1016/j.ygeno.2018.12.001
doi: 10.1016/j.ygeno.2018.12.001
pubmed: 30598109
pmcid: 6311941
Liang Y, Zhang S, Qiao H, Yao Y (2021) iPromoter-ET: identifying promoters and their strength by extremely randomized trees-based feature selection. Anal Biochem. https://doi.org/10.1016/j.ab.2021.114335
doi: 10.1016/j.ab.2021.114335
pubmed: 34678249
Liang X, Li F, Chen J, Li J, Wu H, Li S, Song J, Liu Q (2020) Large-scale comparative review and assessment of computational methods for anti-cancer peptide identification. Brief Bioinform. https://doi.org/10.1093/bib/bby089
doi: 10.1093/bib/bby089
pubmed: 31750520
pmcid: 8294543
Huang Y, Niu B, Gao Y, Fu L, Li W (2010) CD-HIT Suite: a web server for clustering and comparing biological sequences. Bioinformatics. https://doi.org/10.1093/bioinformatics/btq003
doi: 10.1093/bioinformatics/btq003
pubmed: 21172055
pmcid: 3024863
Chen Z, Zhao P, Li F, Marquez-Lago T, André L, Jerico R, Zhu Y, Powell D, Tatsuya A, Webb G (2019) iLearn: an integrated platform and meta-learner for feature engineering, machine learning analysis and modeling of DNA, RNA and protein sequence data. Brief Bioinform. https://doi.org/10.1093/bib/bbz041
doi: 10.1093/bib/bbz041
pubmed: 30285084
pmcid: 7820841
Liu B, Liu F, Fang L, Wang X, Chou K (2015) repDNA: a python package to generate various modes of feature vectors for DNA sequences by incorporating user-defined physicochemical properties and sequence-order effects. Bioinformatics 31(8):1307–1309. https://doi.org/10.1093/bioinformatics/btu820
doi: 10.1093/bioinformatics/btu820
pubmed: 25504848
Wang T, Yang J, Shen HB, Chou KC (2008) Predicting membrane protein types by the LLDA algorithm. Protein Pept Lett. https://doi.org/10.2174/092986608785849308
doi: 10.2174/092986608785849308
pubmed: 19075816
Chen Z, Zhao P, Li F, André L, Marquez-Lago T, Wang Y, Webb G, Ian S, Daly R, Chou K, Song J (2018) iFeature: a python package and web server for features extraction and selection from protein and peptide sequences. Bioinformatics. https://doi.org/10.1093/bioinformatics/bty140
doi: 10.1093/bioinformatics/bty140
pubmed: 30598069
pmcid: 6311935
Chen Z, Zhao P, Li C, Li F, Xiang D, Chen Y, Tatsuya A, Roger J, Geoffrey I, Zhao Q, Kurgan L, Song J (2021) iLearnPlus: a comprehensive and automated machine-learning platform for nucleic acid and protein sequence analysis, prediction and visualization. Nucleic Acids Res. https://doi.org/10.1093/nar/gkab122
doi: 10.1093/nar/gkab122
pubmed: 34871438
pmcid: 9071489
Chou K (2011) Some remarks on protein attribute prediction and pseudo amino acid composition. J Theor Biol 273(1):236–247. https://doi.org/10.1016/j.jtbi.2010.12.024
doi: 10.1016/j.jtbi.2010.12.024
pubmed: 21168420
Chen W, Lei T, Jin D, Lin H, Chou K (2014) PseKNC: a flexible web server for generating pseudo k-tuple nucleotide composition. Anal Biochem 456:53–60. https://doi.org/10.1016/j.ab.2014.04.001
doi: 10.1016/j.ab.2014.04.001
pubmed: 24732113
Chen W, Feng P, Deng E, Lin H, Chou K (2014) iTIS-PseTNC: a sequence-based predictor for identifying translation initiation site in human genes using pseudo trinucleotide composition. Anal Biochem 462:76–83. https://doi.org/10.1016/j.ab.2014.06.022
doi: 10.1016/j.ab.2014.06.022
pubmed: 25016190
Chen W, Lin H, Chou K (2015) Pseudo nucleotide composition or PseKNC: an effective formulation for analyzing genomic sequences. Mol Biosyst 11(10):2620–2634. https://doi.org/10.1039/c5mb00155b
doi: 10.1039/c5mb00155b
pubmed: 26099739
Chen W, Feng P, Ding H, Lin H, Chou K (2015) iRNA-methyl: identifying n6-methyladenosine sites using pseudo nucleotide composition. Anal Biochem. https://doi.org/10.1016/j.ab.2015.08.021
doi: 10.1016/j.ab.2015.08.021
pubmed: 26743717
pmcid: 4438313
Liu B, Fang L, Liu F, Wang X, Chou K (2016) iMiRNA-PseDPC: microRNA precursor identification with a pseudo distance-pair composition approach. J Biomol Struct Dyn. https://doi.org/10.1080/07391102.2015.1014422
doi: 10.1080/07391102.2015.1014422
pubmed: 27809674
Liu B, Fang L, Wang S, Wang X, Li H, Chou K (2015) Identification of microRNA precursor with the degenerate k-tuple or Kmer strategy. J Theor Biol. https://doi.org/10.1016/j.jtbi.2015.08.025
doi: 10.1016/j.jtbi.2015.08.025
pubmed: 26416547
Chen W, Feng P, Lin H, Chou K (2014) iSS-PseDNC: identifying splicing sites using pseudo dinucleotide composition. Biomed Res Int 2014:1–12. https://doi.org/10.1155/2014/623149
doi: 10.1155/2014/623149
Chen Z, Zhao P, Li F, Wang Y, Smith A, Webb G, Akutsu T, Baggag A, Bensmail H, Song J (2019) Comprehensive review and assessment of computational methods for predicting RNA post-transcriptional modification sites from RNA sequences. Brief Bioinform. https://doi.org/10.1093/bib/bbz112
doi: 10.1093/bib/bbz112
pubmed: 30285084
pmcid: 7820841
Jia C, Bi Y, Chen J, André L, Li Y, Song J (2020) PASSION: an ensemble neural network approach for identifying the binding sites of rbps on circRNAs. Bioinformatics. https://doi.org/10.1093/bioinformatics/btaa522
doi: 10.1093/bioinformatics/btaa522
pubmed: 33028191
pmcid: 8522485
Li F, Chen J, André L, Marquez-Lago T, Liu Q, Wang Y, Revote J, Smith A, Akutsu T, Webb G, Kurgan L, Song J (2019) DeepCleave: a deep learning predictor for caspase and matrix metalloprotease substrates and cleavage sites. Bioinformatics. https://doi.org/10.1093/bioinformatics/btz721
doi: 10.1093/bioinformatics/btz721
pubmed: 31888447
pmcid: 6936157
Liu Q, Chen J, Wang Y, Li S, Jia C, Song J, Li F (2020) DeepTorrent: a deep learning-based approach for predicting DNA n4-methylcytosine sites. Brief Bioinform. https://doi.org/10.1093/bib/bbaa124
doi: 10.1093/bib/bbaa124
pubmed: 32978618
pmcid: 8294561
Zhu Y, Hu J, Ge F, Li F, Song J, Zhang Y, Yu D (2020) Accurate multistage prediction of protein crystallization propensity using deep-cascade forest with sequence-based features. Brief Bioinform. https://doi.org/10.1093/bib/bbaa076
doi: 10.1093/bib/bbaa076
pubmed: 32978617
pmcid: 8522485
Wu J, Wang J, Xiao H, Ling J (2017) Visualization of high dimensional turbulence simulation data using t-SNE. 19th AIAA Non-Deterministic Approaches Conference https://doi.org/10.2514/6.2017-1770
Pieter M, Kathleen M, Kristof E (2012) DNA structural properties in the classification of genomic transcription regulation elements. Bioinform Biol Insights 6:155–168. https://doi.org/10.4137/BBI.S9426
doi: 10.4137/BBI.S9426
Abeel T, Saeys Y, Bonnet E, Rouzé P, Van de P (2008) Generic eukaryotic core promoter prediction using structural features of DNA. Genome Res. https://doi.org/10.1101/gr.6991408
doi: 10.1101/gr.6991408
pubmed: 18096745
pmcid: 2203629
Bansal M, Kumar A, Yella V (2014) Role of DNA sequence based structural features of promoters in transcription initiation and gene expression. Curr Opin Struct Biol 25:77–85. https://doi.org/10.1016/j.sbi.2014.01.007
doi: 10.1016/j.sbi.2014.01.007
pubmed: 24503515
Li Y, Yuan Y (2017) Convergence analysis of two-layer neural networks with ReLU activation. Curran Associates Inc. https://doi.org/10.48550/ARXIV.1705.09886
Yarotsky D (2017) Error bounds for approximations with deep ReLU networks. Neural Netw Off J Int Neural Netw Soc 94:103. https://doi.org/10.1016/j.neunet.2017.07.002
doi: 10.1016/j.neunet.2017.07.002
Agarap A (2018) Deep learning using rectified linear units (ReLU) [cs.NE]. https://doi.org/10.48550/ARXIV.1803.08375
Yu J, Shi S, Zhang F, Chen G, Cao M (2018) PredGly: predicting lysine glycation sites for homo sapiens based on XGBoost feature optimization. Bioinformatics. https://doi.org/10.1093/bioinformatics/bty1043
doi: 10.1093/bioinformatics/bty1043
pubmed: 30598108
pmcid: 6311935
Whitney A (1971) A direct method of nonparametric measurement selection. IEEE Trans Comput 20(9):1100–1103. https://doi.org/10.1109/T-C.1971.223410
doi: 10.1109/T-C.1971.223410
Li F, Li C, Wang M, Webb G, Zhang Y, Whisstock J, Song J (2015) GlycoMine: a machine learning-based approach for predicting n-, c- and o-linked glycosylation in the human proteome. Bioinformatics. https://doi.org/10.1093/bioinformatics/btu852
doi: 10.1093/bioinformatics/btu852
pubmed: 26722119
pmcid: 4848397
Li F, Li C, Revote J, Zhang Y, Webb G, Li J, Song J, Lithgow T (2016) GlycoMine(struct: a new bioinformatics tool for highly accurate mapping of the human n-linked and o-linked glycoproteomes by incorporating structural features. Sci Rep. https://doi.org/10.1038/srep34595
doi: 10.1038/srep34595
pubmed: 28442790
pmcid: 5515987
Li F, Guo X, Jin P, Chen J, Xiang D, Song J, Lithgow T (2021) Porpoise: a new approach for accurate prediction of RNA pseudouridine sites. Brief Bioinform. https://doi.org/10.1093/bib/bbab245
doi: 10.1093/bib/bbab245
pubmed: 34463709
pmcid: 8790951
Zhou Z, Feng J (2017) Deep forest. https://doi.org/10.48550/arXiv.1702.08835
Žižka J, Dařena F, Svoboda A (2019) Random Forest, 193–200. https://doi.org/10.1201/9780429469275-8
Geurts P, Ernst D, Wehenkel L (2006) Extremely randomized trees. Mach Learn 63(1):3–42. https://doi.org/10.1007/s10994-006-6226-1
doi: 10.1007/s10994-006-6226-1
Chen T, Guestrin C (2016) XGBoost: a scalable tree boosting system. https://doi.org/10.1145/2939672.2939785
Kleinbaum DG, Klein M (2002) Logistic regression: a self-learning text, 2nd edn. Springer, Berlin. https://doi.org/10.1111/j.1467-985X.2004.298_12.x
doi: 10.1111/j.1467-985X.2004.298_12.x
Basith S, Manavalan B, Shin T, Lee G (2020) Machine intelligence in peptide therapeutics: a next-generation tool for rapid disease screening. Med Res Rev. https://doi.org/10.1002/med.21658
doi: 10.1002/med.21658
pubmed: 31922268
Li F, André L, Liu Q, Wang Y, Xiang D, Akutsu T, Webb G, Smith A, Marquez-Lago T, Li J, Song J (2020) Procleave: predicting protease-specific substrate cleavage sites by combining sequence and structural information. Genom Proteom Bioinform 18(1):52–64. https://doi.org/10.1016/j.gpb.2019.08.002
doi: 10.1016/j.gpb.2019.08.002
Lundberg S, Lee S (2017) A unified approach to interpreting model predictions. https://doi.org/10.48550/arXiv.1705.07874