PredPromoter-MF(2L): A Novel Approach of Promoter Prediction Based on Multi-source Feature Fusion and Deep Forest.

Deep Forest Deep learning Feature fusion Feature selection Machine learning Promoter

Journal

Interdisciplinary sciences, computational life sciences
ISSN: 1867-1462
Titre abrégé: Interdiscip Sci
Pays: Germany
ID NLM: 101515919

Informations de publication

Date de publication:
Sep 2022
Historique:
received: 19 10 2021
accepted: 05 04 2022
revised: 05 04 2022
pubmed: 1 5 2022
medline: 3 8 2022
entrez: 30 4 2022
Statut: ppublish

Résumé

Promoters short DNA sequences play vital roles in initiating gene transcription. However, it remains a challenge to identify promoters using conventional experiment techniques in a high-throughput manner. To this end, several computational predictors based on machine learning models have been developed, while their performance is unsatisfactory. In this study, we proposed a novel two-layer predictor, called PredPromoter-MF(2L), based on multi-source feature fusion and ensemble learning. PredPromoter-MF(2L) was developed based on various deep features learned by a pre-trained deep learning network model and sequence-derived features. Feature selection based on XGBoost was applied to reduce fused features dimensions, and a cascade deep forest model was trained on the selected feature subset for promoter prediction. The results both fivefold cross-validation and independent test demonstrated that PredPromoter-MF(2L) outperformed state-of-the-art methods.

Identifiants

pubmed: 35488998
doi: 10.1007/s12539-022-00520-4
pii: 10.1007/s12539-022-00520-4
doi:

Types de publication

Journal Article

Langues

eng

Sous-ensembles de citation

IM

Pagination

697-711

Subventions

Organisme : Natural Science Foundation of Shaanxi Province
ID : 2021JM-110
Organisme : National Natural Science Foundation of China
ID : 61972322

Informations de copyright

© 2022. International Association of Scientists in the Interdisciplinary Areas.

Références

Lai H, Zhang Z, Su Z, Su W, Ding H, Chen W, Lin H (2019) iProEP: a computational predictor for predicting promoter-sciencedirect. Mol Ther Nucleic Acids 17:337–346. https://doi.org/10.1016/j.omtn.2019.05.028
doi: 10.1016/j.omtn.2019.05.028 pubmed: 31299595 pmcid: 6616480
Liu B, Yang F, Huang D, Chou K (2017) iPromoter-2L: a two-layer predictor for identifying promoters and their types by multi-window-based PseKNC. Bioinformatics. https://doi.org/10.1093/bioinformatics/btx579
doi: 10.1093/bioinformatics/btx579 pubmed: 29322930 pmcid: 5751793
Gruber T, Gross C (2003) Multiple sigma subunits and the partitioning of bacterial transcription space. Ann Rev Microbiol 57(57):441–466. https://doi.org/10.1146/annurev.micro.57.030502.090913
doi: 10.1146/annurev.micro.57.030502.090913
Jishage M, Ishihama A (1995) Regulation of RNA polymerase sigma subunit synthesis in Escherichia coli: intracellular levels of [Formula: see text] and [Formula: see text]. J Bacteriol. https://doi.org/10.1128/jb.177.23.6832-6835.1995
doi: 10.1128/jb.177.23.6832-6835.1995 pubmed: 7592475 pmcid: 177550
Raina S, Missiakas D, Georgopoulos C (1995) The rpoe gene encoding the [Formula: see text] ([Formula: see text]) heat shock sigma factor of Escherichia coli. Embo J 14(5):1043–1055. https://doi.org/10.1002/j.1460-2075.1995.tb07085.x
doi: 10.1002/j.1460-2075.1995.tb07085.x pubmed: 7889935 pmcid: 398177
Janga S, Collado-Vides J (2007) Structure and evolution of gene regulatory networks in microbial genomes. Res Microbiol 158(10):787–794. https://doi.org/10.1016/j.resmic.2007.09.001
doi: 10.1016/j.resmic.2007.09.001 pubmed: 17996425 pmcid: 5696542
Potvin E, Sanschagrin F, Levesque R (2010) Sigma factors in Pseudomonas aeruginosa. Fems Microbiol Rev 1:38–55. https://doi.org/10.1111/j.1574-6976.2007.00092.x
doi: 10.1111/j.1574-6976.2007.00092.x
Socorro G, Heladia S, Alberto S, Daniela L, Luis M, Santiago G, Kevin A, Irma M, Lucia P, Abraham C (2016) Regulondb version 9.0: high-level integration of gene regulation, coexpression, motif clustering and beyond. Nucleic Acids Res. https://doi.org/10.1093/nar/gkv1156
doi: 10.1093/nar/gkv1156
Cole T, Lior P, Steven LS (2021) TopHat: discovering splice junctions with RNA-seq. Bioinformatics. https://doi.org/10.1093/bioinformatics/btp120
doi: 10.1093/bioinformatics/btp120 pubmed: 34412594 pmcid: 8375142
Furey TS (2012) ChIP-seq and beyond: new and improved methodologies to detect and characterize protein–DNA interactions. Nat Rev Genet 13(12):840–52. https://doi.org/10.1038/nrg3306
doi: 10.1038/nrg3306 pubmed: 23090257 pmcid: 3591838
de Avila E, Forte F, Sartor I, Andrighetti T, Gerhardt L, Delamare AL, Echeverrigaray S (2014) DNA duplex stability as discriminative characteristic for Escherichia coli σ
doi: 10.1016/j.biologicals.2013.10.001
Lin H, Zeng E, Ding H, Chen W, Chou K (2014) iPro54-PseKNC: a sequence-based predictor for identifying sigma-54 promoters in prokaryote with pseudo k-tuple nucleotide composition. Nucleic Acids Res 21:12961–12972. https://doi.org/10.1093/nar/gku1019
doi: 10.1093/nar/gku1019
Kh U, Solovyev V, Rogozin I (2017) Recognition of prokaryotic and eukaryotic promoters using convolutional deep learning neural networks. PLoS One 12(2):0171410. https://doi.org/10.1371/journal.pone.0171410
doi: 10.1371/journal.pone.0171410
Lin H, Liang Z, Tang H (2017) Chen W (2017) Identifying [Formula: see text] promoters with novel pseudo nucleotide composition. IEEE/ACM Trans Comput Biol Bioinform. https://doi.org/10.1109/TCBB.2017.2666141
doi: 10.1109/TCBB.2017.2666141 pubmed: 29990223 pmcid: 5869087
Siddiqur R, Usma A, Rafsan J, Swakkhar S (2018) iPromoter-FSEn: identification of bacterial [Formula: see text] promoter sequences using feature subspace based ensemble classifier. Genomics 111:0888754318302593. https://doi.org/10.1016/j.ygeno.2018.07.011
doi: 10.1016/j.ygeno.2018.07.011
Zhang M, Li F, Marquez-Lago T, André L, Fan C, Kwoh C, Chou K, Song J, Jia C (2019) MULTiPly: a novel multi-layer predictor for discovering general and specific types of promoters. Bioinformatics. https://doi.org/10.1093/bioinformatics/btz016
doi: 10.1093/bioinformatics/btz016 pubmed: 32002535 pmcid: 6937862
Liu B (2019) Li K (2019) iPromoter-2L2.0: identifying promoters and their types by combining smoothing cutting window algorithm and sequence-based features. Mol Ther Nucleic Acids. https://doi.org/10.1016/j.omtn.2019.08.008
doi: 10.1016/j.omtn.2019.08.008 pubmed: 32069774 pmcid: 6970172
Amin R, Rahman C, Ahmed S, Sifat M, Shatabda S (2020) iPromoter-BnCNN: a novel branched cnn based predictor for identifying and classifying sigma promoters. Bioinformatics. https://doi.org/10.1093/bioinformatics/btaa609
doi: 10.1093/bioinformatics/btaa609 pubmed: 32614400
Li F, Chen J, Ge Z, Wen Y, Yue Y, Hayashida M, Baggag A, Bensmail H, Song J (2020) Computational prediction and interpretation of both general and specific types of promoters in Escherichia coli by exploiting a stacked ensemble-learning framework. Brief Bioinform. https://doi.org/10.1093/bib/bbaa049
doi: 10.1093/bib/bbaa049 pubmed: 32978618 pmcid: 8294564
Xiao X, Xu Z, Qiu W, Wang P, Ge H, Chou K (2018) iPSW(2L)-PseKNC: a two-layer predictor for identifying promoters and their strength by hybrid features via pseudo k-tuple nucleotide composition. Genomics. https://doi.org/10.1016/j.ygeno.2018.12.001
doi: 10.1016/j.ygeno.2018.12.001 pubmed: 30598109 pmcid: 6311941
Liang Y, Zhang S, Qiao H, Yao Y (2021) iPromoter-ET: identifying promoters and their strength by extremely randomized trees-based feature selection. Anal Biochem. https://doi.org/10.1016/j.ab.2021.114335
doi: 10.1016/j.ab.2021.114335 pubmed: 34678249
Liang X, Li F, Chen J, Li J, Wu H, Li S, Song J, Liu Q (2020) Large-scale comparative review and assessment of computational methods for anti-cancer peptide identification. Brief Bioinform. https://doi.org/10.1093/bib/bby089
doi: 10.1093/bib/bby089 pubmed: 31750520 pmcid: 8294543
Huang Y, Niu B, Gao Y, Fu L, Li W (2010) CD-HIT Suite: a web server for clustering and comparing biological sequences. Bioinformatics. https://doi.org/10.1093/bioinformatics/btq003
doi: 10.1093/bioinformatics/btq003 pubmed: 21172055 pmcid: 3024863
Chen Z, Zhao P, Li F, Marquez-Lago T, André L, Jerico R, Zhu Y, Powell D, Tatsuya A, Webb G (2019) iLearn: an integrated platform and meta-learner for feature engineering, machine learning analysis and modeling of DNA, RNA and protein sequence data. Brief Bioinform. https://doi.org/10.1093/bib/bbz041
doi: 10.1093/bib/bbz041 pubmed: 30285084 pmcid: 7820841
Liu B, Liu F, Fang L, Wang X, Chou K (2015) repDNA: a python package to generate various modes of feature vectors for DNA sequences by incorporating user-defined physicochemical properties and sequence-order effects. Bioinformatics 31(8):1307–1309. https://doi.org/10.1093/bioinformatics/btu820
doi: 10.1093/bioinformatics/btu820 pubmed: 25504848
Wang T, Yang J, Shen HB, Chou KC (2008) Predicting membrane protein types by the LLDA algorithm. Protein Pept Lett. https://doi.org/10.2174/092986608785849308
doi: 10.2174/092986608785849308 pubmed: 19075816
Chen Z, Zhao P, Li F, André L, Marquez-Lago T, Wang Y, Webb G, Ian S, Daly R, Chou K, Song J (2018) iFeature: a python package and web server for features extraction and selection from protein and peptide sequences. Bioinformatics. https://doi.org/10.1093/bioinformatics/bty140
doi: 10.1093/bioinformatics/bty140 pubmed: 30598069 pmcid: 6311935
Chen Z, Zhao P, Li C, Li F, Xiang D, Chen Y, Tatsuya A, Roger J, Geoffrey I, Zhao Q, Kurgan L, Song J (2021) iLearnPlus: a comprehensive and automated machine-learning platform for nucleic acid and protein sequence analysis, prediction and visualization. Nucleic Acids Res. https://doi.org/10.1093/nar/gkab122
doi: 10.1093/nar/gkab122 pubmed: 34871438 pmcid: 9071489
Chou K (2011) Some remarks on protein attribute prediction and pseudo amino acid composition. J Theor Biol 273(1):236–247. https://doi.org/10.1016/j.jtbi.2010.12.024
doi: 10.1016/j.jtbi.2010.12.024 pubmed: 21168420
Chen W, Lei T, Jin D, Lin H, Chou K (2014) PseKNC: a flexible web server for generating pseudo k-tuple nucleotide composition. Anal Biochem 456:53–60. https://doi.org/10.1016/j.ab.2014.04.001
doi: 10.1016/j.ab.2014.04.001 pubmed: 24732113
Chen W, Feng P, Deng E, Lin H, Chou K (2014) iTIS-PseTNC: a sequence-based predictor for identifying translation initiation site in human genes using pseudo trinucleotide composition. Anal Biochem 462:76–83. https://doi.org/10.1016/j.ab.2014.06.022
doi: 10.1016/j.ab.2014.06.022 pubmed: 25016190
Chen W, Lin H, Chou K (2015) Pseudo nucleotide composition or PseKNC: an effective formulation for analyzing genomic sequences. Mol Biosyst 11(10):2620–2634. https://doi.org/10.1039/c5mb00155b
doi: 10.1039/c5mb00155b pubmed: 26099739
Chen W, Feng P, Ding H, Lin H, Chou K (2015) iRNA-methyl: identifying n6-methyladenosine sites using pseudo nucleotide composition. Anal Biochem. https://doi.org/10.1016/j.ab.2015.08.021
doi: 10.1016/j.ab.2015.08.021 pubmed: 26743717 pmcid: 4438313
Liu B, Fang L, Liu F, Wang X, Chou K (2016) iMiRNA-PseDPC: microRNA precursor identification with a pseudo distance-pair composition approach. J Biomol Struct Dyn. https://doi.org/10.1080/07391102.2015.1014422
doi: 10.1080/07391102.2015.1014422 pubmed: 27809674
Liu B, Fang L, Wang S, Wang X, Li H, Chou K (2015) Identification of microRNA precursor with the degenerate k-tuple or Kmer strategy. J Theor Biol. https://doi.org/10.1016/j.jtbi.2015.08.025
doi: 10.1016/j.jtbi.2015.08.025 pubmed: 26416547
Chen W, Feng P, Lin H, Chou K (2014) iSS-PseDNC: identifying splicing sites using pseudo dinucleotide composition. Biomed Res Int 2014:1–12. https://doi.org/10.1155/2014/623149
doi: 10.1155/2014/623149
Chen Z, Zhao P, Li F, Wang Y, Smith A, Webb G, Akutsu T, Baggag A, Bensmail H, Song J (2019) Comprehensive review and assessment of computational methods for predicting RNA post-transcriptional modification sites from RNA sequences. Brief Bioinform. https://doi.org/10.1093/bib/bbz112
doi: 10.1093/bib/bbz112 pubmed: 30285084 pmcid: 7820841
Jia C, Bi Y, Chen J, André L, Li Y, Song J (2020) PASSION: an ensemble neural network approach for identifying the binding sites of rbps on circRNAs. Bioinformatics. https://doi.org/10.1093/bioinformatics/btaa522
doi: 10.1093/bioinformatics/btaa522 pubmed: 33028191 pmcid: 8522485
Li F, Chen J, André L, Marquez-Lago T, Liu Q, Wang Y, Revote J, Smith A, Akutsu T, Webb G, Kurgan L, Song J (2019) DeepCleave: a deep learning predictor for caspase and matrix metalloprotease substrates and cleavage sites. Bioinformatics. https://doi.org/10.1093/bioinformatics/btz721
doi: 10.1093/bioinformatics/btz721 pubmed: 31888447 pmcid: 6936157
Liu Q, Chen J, Wang Y, Li S, Jia C, Song J, Li F (2020) DeepTorrent: a deep learning-based approach for predicting DNA n4-methylcytosine sites. Brief Bioinform. https://doi.org/10.1093/bib/bbaa124
doi: 10.1093/bib/bbaa124 pubmed: 32978618 pmcid: 8294561
Zhu Y, Hu J, Ge F, Li F, Song J, Zhang Y, Yu D (2020) Accurate multistage prediction of protein crystallization propensity using deep-cascade forest with sequence-based features. Brief Bioinform. https://doi.org/10.1093/bib/bbaa076
doi: 10.1093/bib/bbaa076 pubmed: 32978617 pmcid: 8522485
Wu J, Wang J, Xiao H, Ling J (2017) Visualization of high dimensional turbulence simulation data using t-SNE. 19th AIAA Non-Deterministic Approaches Conference https://doi.org/10.2514/6.2017-1770
Pieter M, Kathleen M, Kristof E (2012) DNA structural properties in the classification of genomic transcription regulation elements. Bioinform Biol Insights 6:155–168. https://doi.org/10.4137/BBI.S9426
doi: 10.4137/BBI.S9426
Abeel T, Saeys Y, Bonnet E, Rouzé P, Van de P (2008) Generic eukaryotic core promoter prediction using structural features of DNA. Genome Res. https://doi.org/10.1101/gr.6991408
doi: 10.1101/gr.6991408 pubmed: 18096745 pmcid: 2203629
Bansal M, Kumar A, Yella V (2014) Role of DNA sequence based structural features of promoters in transcription initiation and gene expression. Curr Opin Struct Biol 25:77–85. https://doi.org/10.1016/j.sbi.2014.01.007
doi: 10.1016/j.sbi.2014.01.007 pubmed: 24503515
Li Y, Yuan Y (2017) Convergence analysis of two-layer neural networks with ReLU activation. Curran Associates Inc. https://doi.org/10.48550/ARXIV.1705.09886
Yarotsky D (2017) Error bounds for approximations with deep ReLU networks. Neural Netw Off J Int Neural Netw Soc 94:103. https://doi.org/10.1016/j.neunet.2017.07.002
doi: 10.1016/j.neunet.2017.07.002
Agarap A (2018) Deep learning using rectified linear units (ReLU) [cs.NE]. https://doi.org/10.48550/ARXIV.1803.08375
Yu J, Shi S, Zhang F, Chen G, Cao M (2018) PredGly: predicting lysine glycation sites for homo sapiens based on XGBoost feature optimization. Bioinformatics. https://doi.org/10.1093/bioinformatics/bty1043
doi: 10.1093/bioinformatics/bty1043 pubmed: 30598108 pmcid: 6311935
Whitney A (1971) A direct method of nonparametric measurement selection. IEEE Trans Comput 20(9):1100–1103. https://doi.org/10.1109/T-C.1971.223410
doi: 10.1109/T-C.1971.223410
Li F, Li C, Wang M, Webb G, Zhang Y, Whisstock J, Song J (2015) GlycoMine: a machine learning-based approach for predicting n-, c- and o-linked glycosylation in the human proteome. Bioinformatics. https://doi.org/10.1093/bioinformatics/btu852
doi: 10.1093/bioinformatics/btu852 pubmed: 26722119 pmcid: 4848397
Li F, Li C, Revote J, Zhang Y, Webb G, Li J, Song J, Lithgow T (2016) GlycoMine(struct: a new bioinformatics tool for highly accurate mapping of the human n-linked and o-linked glycoproteomes by incorporating structural features. Sci Rep. https://doi.org/10.1038/srep34595
doi: 10.1038/srep34595 pubmed: 28442790 pmcid: 5515987
Li F, Guo X, Jin P, Chen J, Xiang D, Song J, Lithgow T (2021) Porpoise: a new approach for accurate prediction of RNA pseudouridine sites. Brief Bioinform. https://doi.org/10.1093/bib/bbab245
doi: 10.1093/bib/bbab245 pubmed: 34463709 pmcid: 8790951
Zhou Z, Feng J (2017) Deep forest. https://doi.org/10.48550/arXiv.1702.08835
Žižka J, Dařena F, Svoboda A (2019) Random Forest, 193–200. https://doi.org/10.1201/9780429469275-8
Geurts P, Ernst D, Wehenkel L (2006) Extremely randomized trees. Mach Learn 63(1):3–42. https://doi.org/10.1007/s10994-006-6226-1
doi: 10.1007/s10994-006-6226-1
Chen T, Guestrin C (2016) XGBoost: a scalable tree boosting system. https://doi.org/10.1145/2939672.2939785
Kleinbaum DG, Klein M (2002) Logistic regression: a self-learning text, 2nd edn. Springer, Berlin. https://doi.org/10.1111/j.1467-985X.2004.298_12.x
doi: 10.1111/j.1467-985X.2004.298_12.x
Basith S, Manavalan B, Shin T, Lee G (2020) Machine intelligence in peptide therapeutics: a next-generation tool for rapid disease screening. Med Res Rev. https://doi.org/10.1002/med.21658
doi: 10.1002/med.21658 pubmed: 31922268
Li F, André L, Liu Q, Wang Y, Xiang D, Akutsu T, Webb G, Smith A, Marquez-Lago T, Li J, Song J (2020) Procleave: predicting protease-specific substrate cleavage sites by combining sequence and structural information. Genom Proteom Bioinform 18(1):52–64. https://doi.org/10.1016/j.gpb.2019.08.002
doi: 10.1016/j.gpb.2019.08.002
Lundberg S, Lee S (2017) A unified approach to interpreting model predictions. https://doi.org/10.48550/arXiv.1705.07874

Auteurs

Miao Wang (M)

College of Information Engineering, Northwest A&F University, Yangling, 712100, Shanxi, China.

Fuyi Li (F)

Department of Microbiology and Immunology, The Peter Doherty Institute for Infection and Immunity, The University of Melbourne, 792 Elizabeth Street, Melbourne, VIC, 3000, Australia.

Hao Wu (H)

School of Software, Shandong University, Jinan, 250100, Shandong, China.

Quanzhong Liu (Q)

College of Information Engineering, Northwest A&F University, Yangling, 712100, Shanxi, China. liuqzhong@nwsuaf.edu.cn.

Shuqin Li (S)

College of Information Engineering, Northwest A&F University, Yangling, 712100, Shanxi, China. lsq_cie@nwsuaf.edu.cn.

Articles similaires

Exploring blood-brain barrier passage using atomic weighted vector and machine learning.

Yoan Martínez-López, Paulina Phoobane, Yanaima Jauriga et al.
1.00
Blood-Brain Barrier Machine Learning Humans Support Vector Machine Software

Understanding the role of machine learning in predicting progression of osteoarthritis.

Simone Castagno, Benjamin Gompels, Estelle Strangmark et al.
1.00
Humans Disease Progression Machine Learning Osteoarthritis
Humans Artificial Intelligence Neoplasms Prognosis Image Processing, Computer-Assisted

Classifications MeSH