ProkDBP: Toward more precise identification of prokaryotic DNA binding proteins.
DNA‐binding proteins
computational biology
deep learning
evolutionary features
machine learning
prokaryotes
Journal
Protein science : a publication of the Protein Society
ISSN: 1469-896X
Titre abrégé: Protein Sci
Pays: United States
ID NLM: 9211750
Informations de publication
Date de publication:
Jun 2024
Jun 2024
Historique:
revised:
18
04
2024
received:
08
12
2023
accepted:
21
04
2024
medline:
15
5
2024
pubmed:
15
5
2024
entrez:
15
5
2024
Statut:
ppublish
Résumé
Prokaryotic DNA binding proteins (DBPs) play pivotal roles in governing gene regulation, DNA replication, and various cellular functions. Accurate computational models for predicting prokaryotic DBPs hold immense promise in accelerating the discovery of novel proteins, fostering a deeper understanding of prokaryotic biology, and facilitating the development of therapeutics targeting for potential disease interventions. However, existing generic prediction models often exhibit lower accuracy in predicting prokaryotic DBPs. To address this gap, we introduce ProkDBP, a novel machine learning-driven computational model for prediction of prokaryotic DBPs. For prediction, a total of nine shallow learning algorithms and five deep learning models were utilized, with the shallow learning models demonstrating higher performance metrics compared to their deep learning counterparts. The light gradient boosting machine (LGBM), coupled with evolutionarily significant features selected via random forest variable importance measure (RF-VIM) yielded the highest five-fold cross-validation accuracy. The model achieved the highest auROC (0.9534) and auPRC (0.9575) among the 14 machine learning models evaluated. Additionally, ProkDBP demonstrated substantial performance with an independent dataset, exhibiting higher values of auROC (0.9332) and auPRC (0.9371). Notably, when benchmarked against several cutting-edge existing models, ProkDBP showcased superior predictive accuracy. Furthermore, to promote accessibility and usability, ProkDBP (https://iasri-sg.icar.gov.in/prokdbp/) is available as an online prediction tool, enabling free access to interested users. This tool stands as a significant contribution, enhancing the repertoire of resources for accurate and efficient prediction of prokaryotic DBPs.
Types de publication
Journal Article
Langues
eng
Sous-ensembles de citation
IM
Pagination
e5015Subventions
Organisme : ICAR-Indian Agricultural Statistics Research Institute
Informations de copyright
© 2024 The Protein Society.
Références
Abbas M, EL‐Manzalawy Y. Machine learning based refined differential gene expression analysis of pediatric sepsis. BMC Med Genomics. 2020;13:122.
Ahmad S, Sarai A. Moment‐based prediction of DNA‐binding proteins. J Mol Biol. 2004;341:65–71.
Ali F, Kumar H, Patil S, Kotecha K, Banjar A, Daud A. Target‐DBPPred: an intelligent model for prediction of DNA‐binding proteins using discrete wavelet transform based compression and light eXtreme gradient boosting. Comput Biol Med. 2022;145:105533.
Altschul SF, Madden TL, Schäffer AA, Zhang J, Zhang Z, Miller W, et al. Gapped BLAST and PSI‐BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997;25:3389–3402.
Barukab O, Khan YD, Khan SA, Chou K‐C. DNAPred_Prot: identification of DNA‐binding proteins using composition‐ and position‐based features. Appl Bionics Biomech. 2022;2022:5483115–5483117.
Bhardwaj N, Lu H. Residue‐level prediction of DNA‐binding sites and its application on DNA‐binding protein predictions. FEBS Lett. 2007;581:1058–1066.
Bhardwaj N, Langlois RE, Zhao G, Lu H. Kernel‐based machine learning protocol for predicting DNA‐binding proteins. Nucleic Acids Res. 2005;33:6486–6493.
Breiman L. Bagging predictors. Mach Learn. 1996;24:123–140.
Breiman L. Random forests. Mach Learn. 2001;45:5–32.
Cai Y, Lin SL. Support vector machines for predicting rRNA‐, RNA‐, and DNA‐binding proteins from amino acid sequence. Biochim Biophys Acta. 2003;1648:127–133.
Carey J. Affinity, specificity, and cooperativity of DNA binding by bacterial gene regulatory proteins. Int J Mol Sci. 2022;23:562.
Chen R‐C, Dewi C, Huang S‐W, Caraka RE. Selecting critical features for data classification based on machine learning methods. J Big Data. 2020;7:52.
Chen T, Guestrin C. XGBoost: a scalable tree boosting system. Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining. KDD ‘16. New York, NY, USA: Association for Computing Machinery; 2016. p. 785–794.
Chicco D. Ten quick tips for machine learning in computational biology. BioData Mining. 2017;10:35.
Díaz‐Uriarte R, Alvarez de Andrés S. Gene selection and classification of microarray data using random forest. BMC Bioinformatics. 2006;7:3.
Duchi J, Hazan E, Singer Y. Adaptive subgradient methods for online learning and stochastic optimization. J Mach Learn Res. 2011;12:2121–2159.
Fang Y, Guo Y, Feng Y, Li M. Predicting DNA‐binding proteins: approached from Chou's pseudo amino acid composition and other specific sequence features. Amino Acids. 2008;34:103–109.
Feng J, Wang N, Zhang J, Liu B. iDRBP‐ECHF: identifying DNA‐ and RNA‐binding proteins based on extensible cubic hybrid framework. Comput Biol Med. 2022;149:105940.
Francine P. Systems biology: new insight into antibiotic resistance. Microorganisms. 2022;10:2362.
Freund Y, Schapire R. A short introduction to boosting. Journal of JSAI. 1999;14(5):771–780.
Friedman JH. Greedy function approximation: a gradient boosting machine. Ann Stat. 2001;29:1189–1232.
Gao M, Skolnick J. DBD‐hunter: a knowledge‐based method for the prediction of DNA–protein interactions. Nucleic Acids Res. 2008;36:3978–3992.
Gregorutti B, Michel B, Saint‐Pierre P. Correlation and variable importance in random forests. Stat Comput. 2017;27:659–678.
Guo S‐H, Deng E‐Z, Xu L‐Q, Ding H, Lin H, Chen W, et al. iNuc‐PseKNC: a sequence‐based predictor for predicting nucleosome positioning in genomes with pseudo k‐tuple nucleotide composition. Bioinformatics. 2014;30:1522–1529.
He K, Zhang X, Ren S, Sun J. 2016 IEEE conference on computer vision and pattern recognition (CVPR). Las Vegas, USA: IEEE; 2016. p. 770–778.
Holm L, Sander C. Removing near‐neighbour redundancy from large protein sequence collections. Bioinformatics. 1998;14:423–429.
Hu J, Zhou X‐G, Zhu Y‐H, Yu D‐J, Zhang G‐J. TargetDBP: accurate DNA‐binding protein prediction via sequence‐based multi‐view feature learning. IEEE/ACM Trans Comput Biol Bioinform. 2020;17:1419–1429.
Hu J, Rao L, Zhu Y‐H, Zhang G‐J, Yu D‐J. TargetDBP+: enhancing the performance of identifying DNA‐binding proteins via weighted convolutional features. J Chem Inf Model. 2021;61:505–515.
Hu J, Bai Y‐S, Zheng L‐L, Jia N‐X, Yu D‐J, Zhang G‐J. Protein–DNA binding residue prediction via bagging strategy and sequence‐based cube‐format feature. IEEE/ACM Trans Comput Biol Bioinform. 2022;19:3635–3645.
Hu J, Zeng W‐W, Jia N‐X, Arif M, Yu D‐J, Zhang G‐J. Improving DNA‐binding protein prediction using three‐part sequence‐order feature extraction and a deep neural network algorithm. J Chem Inf Model. 2023;63:1044–1057.
Huang Y, Niu B, Gao Y, Fu L, Li W. CD‐HIT suite: a web server for clustering and comparing biological sequences. Bioinformatics. 2010;26:680–682.
Jiang G, Wang W. Error estimation based on variance analysis of k‐fold cross‐validation. Pattern Recognit. 2017;69:94–106.
Karpel RL. Prokaryotic DNA‐binding proteins. Encyclopedia of life sciences. Volume 2001. New Jersey: John Wiley & Sons, Ltd; 2001.
Ke G, Meng Q, Finley T, Wang T, Chen W, Ma W, et al. LightGBM: a highly efficient gradient boosting decision tree. Proceedings of the 31st international conference on neural information processing systems. NIPS'17. Red Hook, NY, USA: Curran Associates Inc.; 2017. p. 3149–3157.
Khodadadi E, Zeinalzadeh E, Taghizadeh S, Mehramouz B, Kamounah FS, Khodadadi E, et al. Proteomic applications in antimicrobial resistance and clinical microbiology studies. Infect Drug Resist. 2020;13:1785–1806.
Kim Y. Convolutional neural networks for sentence classification. Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). Doha, Qatar: Association for Computational Linguistics; 2014. p. 1746–1751.
Kumar M, Gromiha MM, Raghava GP. Identification of DNA‐binding proteins using support vector machines and evolutionary profiles. BMC Bioinformatics. 2007;8:463.
Langlois RE, Lu H. Boosting the prediction and understanding of DNA‐binding domains from sequence. Nucleic Acids Res. 2010;38:3149–3158.
Larrañaga P, Calvo B, Santana R, Bielza C, Galdiano J, Inza I, et al. Machine learning in bioinformatics. Brief Bioinform. 2006;7:86–112.
Li G, Du X, Li X, Zou L, Zhang G, Wu Z. Prediction of DNA binding proteins using local features and long‐term dependencies with primary sequences based on deep learning. PeerJ. 2021;9:e11262.
Liu B, Xu J, Lan X, Xu R, Zhou J, Wang X, et al. iDNA‐Prot|dis: identifying DNA‐binding proteins by incorporating amino acid distance‐pairs and reduced alphabet profile into the general pseudo amino acid composition. PLoS One. 2014;9:e106691.
Liu B, Wang S, Wang X. DNA binding protein identification by combining pseudo amino acid composition and profile‐based protein representation. Sci Rep. 2015;5:15479.
Liu B, Xu J, Fan S, Xu R, Zhou J, Wang X. PseDNA‐Pro: DNA‐binding protein identification by combining Chou's PseAAC and physicochemical distance transformation. Mol Inform. 2015;34:8–17.
Liu B, Wang S, Dong Q, Li S, Liu X. Identification of DNA‐binding proteins by combining auto‐cross covariance transformation and ensemble learning. IEEE Trans Nanobiosci. 2016;15:328–334.
Liu S, Xu C, Zhang Y, Liu J, Yu B, Liu X, et al. Feature selection of gene expression data for cancer classification using double RBF‐kernels. BMC Bioinformatics. 2018;19:396.
Luscombe NM, Austin SE, Berman HM, Thornton JM. An overview of the structures of protein–DNA complexes. Genome Biol. 2000;1(1):reviews001.
Mi H, Thomas P. PANTHER pathway: an ontology‐based pathway database coupled with data analysis tools. Methods Mol Biol. 2009;563:123–140.
Mishra A, Pokhrel P, Hoque MT. StackDPPred: a stacking based prediction of DNA‐binding protein from sequence. Bioinformatics. 2019;35:433–441.
Mohammadi A, Zahiri J, Mohammadi S, Khodarahmi M, Arab SS. PSSMCOOL: a comprehensive R package for generating evolutionary‐based descriptors of protein sequences from PSSM profiles. Biol Methods Protoc. 2022;7:bpac008.
Molan K, Žgur Bertok D. Small prokaryotic DNA‐binding proteins protect genome integrity throughout the life cycle. Int J Mol Sci. 2022;23:4008.
Motion GB, Howden AJM, Huitema E, Jones S. DNA‐binding protein prediction using plant specific support vector machines: validation and application of a new genome annotation tool. Nucleic Acids Res. 2015;43:e158.
Nagarajan R, Gromiha MM. Prediction of RNA binding residues: an extensive analysis based on structure and function to select the best predictor. PLoS One. 2014;9:e91140.
Nanni L, Lumini A. Combing ontologies and dipeptide composition for predicting DNA‐binding proteins. Amino Acids. 2008;34:635–641.
Nimrod G, Schushan M, Szilágyi A, Leslie C, Ben‐Tal N. iDBPs: a web server for the identification of DNA binding proteins. Bioinformatics. 2010;26:692–693.
Pradhan UK, Sharma NK, Kumar P, Kumar A, Gupta S, Shankar R. miRbiom: machine‐learning on Bayesian causal nets of RBP–miRNA interactions successfully predicts miRNA profiles. PLoS One. 2021;16:e0258550.
Pradhan UK, Meher PK, Naha S, Pal S, Gupta A, Parsad R. PlDBPred: a novel computational model for discovery of DNA binding proteins in plants. Brief Bioinform. 2023;24:bbac483.
Pradhan UK, Meher PK, Naha S, Sharma NK, Agarwal A, Gupta A, et al. DBPMod: a supervised learning model for computational recognition of DNA‐binding proteins in model organisms. Brief Funct Genomics. 2023;elad039. https://doi.org/10.10993/bfgp/elad039
Pradhan UK, Meher PK, Naha S, Pal S, Gupta S, Gupta A, et al. RBPLight: a computational tool for discovery of plant‐specific RNA‐binding proteins using light gradient boosting machine and ensemble of evolutionary features. Brief Funct Genomics. 2023;22:401–410.
Pradhan UK, Naha S, Das R, Gupta A, Parsad R, Meher PK. RBProkCNN: deep learning on appropriate contextual evolutionary information for RNA binding protein discovery in prokaryotes. Comput Struct Biotechnol J. 2024;23:1631–1640. https://doi.org/10.1016/j.csbj.2024.04.034
Rahman MS, Shatabda S, Saha S, Kaykobad M, Rahman MS. DPP‐PseAAC: a DNA‐binding protein prediction model using Chou's general PseAAC. J Theor Biol. 2018;452:22–34.
Riechmann JL, Heard J, Martin G, Reuber L, Jiang C, Keddie J, et al. Arabidopsis transcription factors: genome‐wide comparative analysis among eukaryotes. Science. 2000;290:2105–2110.
Saeys Y, Inza I, Larrañaga P. A review of feature selection techniques in bioinformatics. Bioinformatics. 2007;23:2507–2517.
Sandri M, Zuccolotto P. A bias correction algorithm for the gini variable importance measure in classification trees. J Comput Graph Stat. 2008;17:611–628.
Schuster M, Paliwal KK. Bidirectional recurrent neural networks. IEEE Trans Signal Process. 1997;45:2673–2681.
Seshasayee ASN, Sivaraman K, Luscombe NM. An overview of prokaryotic transcription factors: a summary of function and occurrence in bacterial genomes. Subcell Biochem. 2011;52:7–23.
Shanahan HP, Garcia MA, Jones S, Thornton JM. Identifying DNA‐binding proteins using structural motifs and the electrostatic potential. Nucleic Acids Res. 2004;32:4732–4741.
Sharma NK, Gupta S, Kumar A, Kumar P, Pradhan UK, Shankar R. RBPSpot: learning on appropriate contextual information for RBP binding sites discovery. iScience. 2021;24:103381.
Shen L‐C, Liu Y, Song J, Yu D‐J. SAResNet: self‐attention residual network for predicting DNA–protein binding. Brief Bioinformatics. 2021;22:bbab101.
Sherstinsky A. Fundamentals of recurrent neural network (RNN) and long short‐term memory (LSTM) network. Phys D: Nonlinear Phenom. 2020;404:132306.
Siggers T, Gordân R. Protein–DNA binding: complexities and multi‐protein codes. Nucleic Acids Res. 2014;42:2099–2111.
Song L, Li D, Zeng X, Wu Y, Guo L, Zou Q. nDNA‐prot: identification of DNA‐binding proteins based on unbalanced classification. BMC Bioinformatics. 2014;15:298.
Stawiski EW, Gregoret LM, Mandel‐Gutfreund Y. Annotating nucleic acid‐binding function based on protein structure. J Mol Biol. 2003;326:1065–1079.
Szabóová A, Kuželka O, Železný F, Tolar J. Prediction of DNA‐binding propensity of proteins by the ball‐histogram method using automatic template search. BMC Bioinformatics. 2012;13:S3.
Szilágyi A, Skolnick J. Efficient prediction of nucleic acid binding function from low‐resolution protein structures. J Mol Biol. 2006;358:922–933.
The UniProt Consortium. UniProt: the universal protein knowledgebase in 2021. Nucleic Acids Res. 2021;49:D480–D489.
Vapnik V. Pattern recognition using generalized portrait method. Autom Remote Control. 1963;24:774–780.
Walter MC, Rattei T, Arnold R, Güldener U, Münsterkötter M, Nenova K, et al. PEDANT covers all complete RefSeq genomes. Nucleic Acids Res. 2009;37:D408–D411.
Wang N, Zhang J, Liu B. IDRBP‐PPCT: identifying nucleic acid‐binding proteins based on position‐specific score matrix and position‐specific frequency matrix cross transformation. IEEE/ACM Trans Comput Biol Bioinform. 2022;19:2284–2293.
Wang N, Zhang J, Liu B. iDRBP‐EL: identifying DNA‐ and RNA‐binding proteins based on hierarchical ensemble learning. IEEE/ACM Trans Comput Biol Bioinform. 2023;20:432–441.
Wang Y, Wang Z, Xuan J, Zhang J, Hoffman EP, Clarke R, et al. Proceedings of the 2004 14th IEEE signal processing society workshop machine learning for signal processing. Volume 2004. Sao Luis: IEEE; 2004. p. 273–282.
Whipple FW. Genetic analysis of prokaryotic and eukaryotic DNA‐binding proteins in Escherichia coli. Nucleic Acids Res. 1998;26:3700–3706.
Xu R, Zhou J, Liu B, Yao L, He Y, Zou Q, et al. enDNA‐Prot: identification of DNA‐binding proteins by applying ensemble learning. Biomed Res Int. 2014;2014:294279.
Yin W, Schütze H, Xiang B, Zhou B. ABCNN: attention‐based convolutional neural network for modeling sentence pairs. Trans Assoc Comput Linguist. 2016;4:259–272.
Yu J, Shi S, Zhang F, Chen G, Cao M. PredGly: predicting lysine glycation sites for Homo sapiens based on XGboost feature optimization. Bioinformatics. 2019;35:2749–2756.
Yu X, Cao J, Cai Y, Shi T, Li Y. Predicting rRNA‐, RNA‐, and DNA‐binding proteins from primary structure with support vector machines. J Theor Biol. 2006;240:175–184.
Zhang J, Chen Q, Liu B. iDRBP_MMC: identifying DNA‐binding proteins and RNA‐binding proteins based on multi‐label learning model and motif‐based convolutional neural network. J Mol Biol. 2020;432:5860–5875.
Zhang J, Chen Q, Liu B. DeepDRBP‐2L: a new genome annotation predictor for identifying DNA‐binding proteins and RNA‐binding proteins using convolutional neural network and long short‐term memory. IEEE/ACM Trans Comput Biol Bioinform. 2021;18:1451–1463.
Zhang Q, Liu P, Wang X, Zhang Y, Han Y, Yu B. StackPDB: predicting DNA‐binding proteins based on XGB‐RFE feature optimization and stacked ensemble classifier. Appl Soft Comput. 2021;99:106921.
Zhao H, Yang Y, Zhou Y. Structure‐based prediction of DNA‐binding proteins by structural alignment and a volume‐fraction corrected DFIRE‐based energy function. Bioinformatics. 2010;26:1857–1863.