DPI_CDF: druggable protein identifier using cascade deep forest.

Bioinformatics Cascade deep forest Druggable proteins PSSM Physicochemical features

Journal

BMC bioinformatics
ISSN: 1471-2105
Titre abrégé: BMC Bioinformatics
Pays: England
ID NLM: 100965194

Informations de publication

Date de publication:
05 Apr 2024
Historique:
received: 01 06 2023
accepted: 13 03 2024
medline: 6 4 2024
pubmed: 6 4 2024
entrez: 5 4 2024
Statut: epublish

Résumé

Drug targets in living beings perform pivotal roles in the discovery of potential drugs. Conventional wet-lab characterization of drug targets is although accurate but generally expensive, slow, and resource intensive. Therefore, computational methods are highly desirable as an alternative to expedite the large-scale identification of druggable proteins (DPs); however, the existing in silico predictor's performance is still not satisfactory. In this study, we developed a novel deep learning-based model DPI_CDF for predicting DPs based on protein sequence only. DPI_CDF utilizes evolutionary-based (i.e., histograms of oriented gradients for position-specific scoring matrix), physiochemical-based (i.e., component protein sequence representation), and compositional-based (i.e., normalized qualitative characteristic) properties of protein sequence to generate features. Then a hierarchical deep forest model fuses these three encoding schemes to build the proposed model DPI_CDF. The empirical outcomes on 10-fold cross-validation demonstrate that the proposed model achieved 99.13 % accuracy and 0.982 of Matthew's-correlation-coefficient (MCC) on the training dataset. The generalization power of the trained model is further examined on an independent dataset and achieved 95.01% of maximum accuracy and 0.900 MCC. When compared to current state-of-the-art methods, DPI_CDF improves in terms of accuracy by 4.27% and 4.31% on training and testing datasets, respectively. We believe, DPI_CDF will support the research community to identify druggable proteins and escalate the drug discovery process. The benchmark datasets and source codes are available in GitHub: http://github.com/Muhammad-Arif-NUST/DPI_CDF .

Sections du résumé

BACKGROUND BACKGROUND
Drug targets in living beings perform pivotal roles in the discovery of potential drugs. Conventional wet-lab characterization of drug targets is although accurate but generally expensive, slow, and resource intensive. Therefore, computational methods are highly desirable as an alternative to expedite the large-scale identification of druggable proteins (DPs); however, the existing in silico predictor's performance is still not satisfactory.
METHODS METHODS
In this study, we developed a novel deep learning-based model DPI_CDF for predicting DPs based on protein sequence only. DPI_CDF utilizes evolutionary-based (i.e., histograms of oriented gradients for position-specific scoring matrix), physiochemical-based (i.e., component protein sequence representation), and compositional-based (i.e., normalized qualitative characteristic) properties of protein sequence to generate features. Then a hierarchical deep forest model fuses these three encoding schemes to build the proposed model DPI_CDF.
RESULTS RESULTS
The empirical outcomes on 10-fold cross-validation demonstrate that the proposed model achieved 99.13 % accuracy and 0.982 of Matthew's-correlation-coefficient (MCC) on the training dataset. The generalization power of the trained model is further examined on an independent dataset and achieved 95.01% of maximum accuracy and 0.900 MCC. When compared to current state-of-the-art methods, DPI_CDF improves in terms of accuracy by 4.27% and 4.31% on training and testing datasets, respectively. We believe, DPI_CDF will support the research community to identify druggable proteins and escalate the drug discovery process.
AVAILABILITY BACKGROUND
The benchmark datasets and source codes are available in GitHub: http://github.com/Muhammad-Arif-NUST/DPI_CDF .

Identifiants

pubmed: 38580921
doi: 10.1186/s12859-024-05744-3
pii: 10.1186/s12859-024-05744-3
doi:

Types de publication

Journal Article

Langues

eng

Sous-ensembles de citation

IM

Pagination

145

Informations de copyright

© 2024. The Author(s).

Références

Hopkins AL, Groom CR. The druggable genome. Nat Rev Drug Discov. 2002;1(9):727–30.
pubmed: 12209152 doi: 10.1038/nrd892
Kandoi G, Acencio ML, Lemke N. Prediction of druggable proteins using machine learning and systems biology: a mini-review. Front Physiol. 2015;6:366.
pubmed: 26696900 pmcid: 4672042 doi: 10.3389/fphys.2015.00366
Santos R, Ursu O, Gaulton A, Bento AP, Donadi RS, Bologa CG, Karlsson A, Al-Lazikani B, Hersey A, Oprea TI, et al. A comprehensive map of molecular drug targets. Nat Rev Drug Discov. 2017;16(1):19–34.
pubmed: 27910877 doi: 10.1038/nrd.2016.230
Landry Y, Gies J-P. Drugs and their molecular targets: an updated overview. Fundam Clin Pharmacol. 2008;22(1):1–18.
pubmed: 18251718 doi: 10.1111/j.1472-8206.2007.00548.x
Lin J, Chen H, Li S, Liu Y, Li X, Yu B. Accurate prediction of potential druggable proteins based on genetic algorithm and bagging-SVM ensemble classifier. Artif Intell Med. 2019;98:35–47.
pubmed: 31521251 doi: 10.1016/j.artmed.2019.07.005
Makley LN, Gestwicki JE. Expanding the number of ‘druggable’ targets: non-enzymes and protein–protein interactions. Chem Biol Drug Des. 2013;81(1):22–32.
pubmed: 23253128 pmcid: 3531880 doi: 10.1111/cbdd.12066
Lavigne R, Ceyssens P-J, Robben J. Phage proteomics: applications of mass spectrometry. Bacteriophages: Methods and Protocols, Volume 2 Molecular and Applied Aspects, 2009:239–251
Ilari A, Savino C. Protein structure determination by x-ray crystallography. Bioinformatics: Data, Sequence Analysis and Evolution, 2008:63–87
Chan HS, Shan H, Dahoun T, Vogel H, Yuan S. Advancing drug discovery via artificial intelligence. Trends Pharmacol Sci. 2019;40(8):592–604.
pubmed: 31320117 doi: 10.1016/j.tips.2019.06.004
Munos B. Lessons from 60 years of pharmaceutical innovation. Nat Rev Drug Discov. 2009;8(12):959–68.
pubmed: 19949401 doi: 10.1038/nrd2961
Paul SM, Mytelka DS, Dunwiddie CT, Persinger CC, Munos BH, Lindborg SR, Schacht AL. How to improve r &d productivity: the pharmaceutical industry’s grand challenge. Nat Rev Drug Discovery. 2010;9(3):203–14.
pubmed: 20168317 doi: 10.1038/nrd3078
Jamali AA, Ferdousi R, Razzaghi S, Li J, Safdari R, Ebrahimie E. Drugminer: comparative analysis of machine learning algorithms for prediction of potential druggable proteins. Drug Discovery Today. 2016;21(5):718–24.
pubmed: 26821132 doi: 10.1016/j.drudis.2016.01.007
Sun T, Lai L, Pei J. Analysis of protein features and machine learning algorithms for prediction of druggable proteins. Quant Biol. 2018;6:334–43.
doi: 10.1007/s40484-018-0157-2
Gong Y, Liao B, Wang P, Zou Q. Drughybrid_bs: using hybrid feature combined with bagging-SVM to predict potentially druggable proteins. Front Pharmacol. 2021;1:3467.
Yu L, Xue L, Liu F, Li Y, Jing R, Luo J. The applications of deep learning algorithms on in silico druggable proteins identification. J Adv Res. 2022;41:219–31.
pubmed: 36328750 pmcid: 9637576 doi: 10.1016/j.jare.2022.01.009
Sikander R, Ghulam A, Ali F. Xgb-drugpred: computational prediction of druggable proteins using extreme gradient boosting and optimized features set. Sci Rep. 2022;12(1):1–9.
doi: 10.1038/s41598-022-09484-3
Iraji MS, Tanha J, Habibinejad M. Druggable protein prediction using a multi-canal deep convolutional neural network based on autocovariance method. Comput Biol Med. 2022;151: 106276.
pubmed: 36410099 doi: 10.1016/j.compbiomed.2022.106276
Charoenkwan P, Schaduangrat N, Moni MA, Shoombuatong W, Manavalan B, et al. Computational prediction and interpretation of druggable proteins using a stacked ensemble-learning framework. Iscience. 2022;25(9): 104883.
pubmed: 36046193 pmcid: 9421381 doi: 10.1016/j.isci.2022.104883
Arif M, Hayat M, Jan Z. imem-2lsaac: a two-level model for discrimination of membrane proteins and their types by extending the notion of saac into chou’s pseudo amino acid composition. J Theor Biol. 2018;442:11–21.
pubmed: 29337263 doi: 10.1016/j.jtbi.2018.01.008
Ge F, Zhu Y-H, Xu J, Muhammad A, Song J, Yu D-J. Muttmpredictor: Robust and accurate cascade xgboost classifier for prediction of mutations in transmembrane proteins. Comput Struct Biotechnol J. 2021;19:6400–16.
pubmed: 34938415 pmcid: 8649221 doi: 10.1016/j.csbj.2021.11.024
Ge F, Hu J, Zhu Y-H, Arif M, Yu D-J. Targetmm: Accurate missense mutation prediction by utilizing local and global sequence information with classifier ensemble. Combin Chem High Throughput Screen. 2022;25(1):38–52.
doi: 10.2174/1386207323666201204140438
Shen H-B, Chou K-C. Predicting protein fold pattern with functional domain and sequential evolution information. J Theor Biol. 2009;256(3):441–6.
pubmed: 18996396 doi: 10.1016/j.jtbi.2008.10.007
Khan A, Uddin J, Ali F, Kumar H, Alghamdi W, Ahmad A. Afp-spts: an accurate prediction of antifreeze proteins using sequential and pseudo-tri-slicing evolutionary features with an extremely randomized tree. J Chem Inf Model. 2023;63:826.
pubmed: 36649569 doi: 10.1021/acs.jcim.2c01417
Hu J, Li Y, Zhang M, Yang X, Shen H-B, Yu D-J. Predicting protein-DNA binding residues by weightedly combining sequence-based features and boosting multiple SVMs. IEEE/ACM Trans Comput Biol Bioinf. 2016;14(6):1389–98.
doi: 10.1109/TCBB.2016.2616469
Schäffer AA, Aravind L, Madden TL, Shavirin S, Spouge JL, Wolf YI, Koonin EV, Altschul SF. Improving the accuracy of psi-blast protein database searches with composition-based statistics and other refinements. Nucl Acids Res. 2001;29(14):2994–3005.
pubmed: 11452024 pmcid: 55814 doi: 10.1093/nar/29.14.2994
Bairoch A, Apweiler R. The swiss-prot protein sequence database and its supplement trembl in 2000. Nucl Acids Res. 2000;28(1):45–8.
pubmed: 10592178 pmcid: 102476 doi: 10.1093/nar/28.1.45
Dalal N, Triggs B. Histograms of oriented gradients for human detection. In: 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), 2005;1:886–893. IEEE
Junior OL, Delgado D, Gonçalves V, Nunes U. Trainable classifier-fusion schemes: an application to pedestrian detection. In: 2009 12Th International IEEE Conference on Intelligent Transportation Systems, 2009:1–6. IEEE
Mohan A, Papageorgiou C, Poggio T. Example-based object detection in images by components. IEEE Trans Pattern Anal Mach Intell. 2001;23(4):349–61.
doi: 10.1109/34.917571
Viola P, Jones MJ, Snow D. Detecting pedestrians using patterns of motion and appearance. Int J Comput Vision. 2005;63:153–61.
doi: 10.1007/s11263-005-6644-8
Dubchak I, Muchnik I, Holbrook SR, Kim S-H. Prediction of protein folding class using global description of amino acid sequence. Proc Natl Acad Sci. 1995;92(19):8700–4.
pubmed: 7568000 pmcid: 41034 doi: 10.1073/pnas.92.19.8700
Zhou C, Yu H, Ding Y, Guo F, Gong X-J. Multi-scale encoding of amino acid sequences for predicting protein interactions using gradient boosting decision tree. PLoS ONE. 2017;12(8):0181426.
doi: 10.1371/journal.pone.0181426
Zhang X, Liu S. Rbppred: predicting RNA-binding proteins from sequence using SVM. Bioinformatics. 2017;33(6):854–62.
pubmed: 27993780 doi: 10.1093/bioinformatics/btw730
Golmohammadi SK, Kurgan L, Crowley B, Reformat M. Classification of cell membrane proteins. In: 2007 Frontiers in the Convergence of Bioscience and Information Technologies, 2007: 153–158. IEEE
Xia X, Li W-H. What amino acid properties affect protein evolution? J Mol Evol. 1998;47:557–64.
pubmed: 9797406 doi: 10.1007/PL00006412
Qiu W-R, Sun B-Q, Xiao X, Xu Z-C, Jia J-H, Chou K-C. ikcr-pseens: identify lysine crotonylation sites in histone proteins with pseudo components and ensemble classifier. Genomics. 2018;110(5):239–46.
pubmed: 29107015 doi: 10.1016/j.ygeno.2017.10.008
Hayat M, Khan A. Predicting membrane protein types by fusing composite protein sequence features into pseudo amino acid composition. J Theor Biol. 2011;271(1):10–7.
pubmed: 21110985 doi: 10.1016/j.jtbi.2010.11.017
Kabir M, Arif M, Ahmad S, Ali Z, Swati ZNK, Yu D-J. Intelligent computational method for discrimination of anticancer peptides by incorporating sequential and evolutionary profiles information. Chemom Intell Lab Syst. 2018;182:158–65.
doi: 10.1016/j.chemolab.2018.09.007
Arif M, Ahmad S, Ali F, Fang G, Li M, Yu D-J. Targetcpp: accurate prediction of cell-penetrating peptides from optimized multi-scale features using gradient boost decision tree. J Comput Aided Mol Des. 2020;34:841–56.
pubmed: 32180124 doi: 10.1007/s10822-020-00307-z
Hayat M, Tahir M, Khan SA. Prediction of protein structure classes using hybrid space of multi-profile Bayes and bi-gram probability feature spaces. J Theor Biol. 2014;346:8–15.
pubmed: 24384128 doi: 10.1016/j.jtbi.2013.12.015
Hayat M, Khan A. Memhyb: predicting membrane protein types by hybridizing SAAC and PSSM. J Theor Biol. 2012;292:93–102.
pubmed: 22001079 doi: 10.1016/j.jtbi.2011.09.026
Zhou Z-H, Feng J. Deep forest: Towards an alternative to deep neural networks. In: IJCAI, 2017:3553–3559
Arif M, Kabir M, Ahmed S, Khan A, Ge F, Khelifi A, Yu D-J. Deepcppred: a deep learning framework for the discrimination of cell-penetrating peptides and their uptake efficiencies. IEEE/ACM Trans Comput Biol Bioinf. 2021;19(5):2749–59.
doi: 10.1109/TCBB.2021.3102133
Cai R, Chen C. Learning deep forest with multi-scale local binary pattern features for face anti-spoofing (2019). arXiv preprint arXiv:1910.03850
Wang Y, Bi X, Chen W, Li Y, Chen Q, Long T. Deep forest for radar HRRP recognition. J Eng. 2019;2019(21):8018–21.
doi: 10.1049/joe.2019.0723
Chen Z-H, Li L-P, He Z, Zhou J-R, Li Y, Wong L. An improved deep forest model for predicting self-interacting proteins from protein sequence using wavelet transformation. Front Genet. 2019;10:90.
pubmed: 30881376 pmcid: 6405691 doi: 10.3389/fgene.2019.00090
Utkin LV, Kovalev MS, Meldo AA. A deep forest classifier with weights of class probability distribution subsets. Knowl-Based Syst. 2019;173:15–27.
doi: 10.1016/j.knosys.2019.02.022
Zhou Z-H, Feng J. Deep forest. Natl Sci Rev. 2019;6(1):74–86.
pubmed: 34691833 doi: 10.1093/nsr/nwy108
Breiman L. Random forests. Mach Learn. 2001;45:5–32.
doi: 10.1023/A:1010933404324
Chen T, Guestrin C. Xgboost: a scalable tree boosting system. In: Proceedings of the 22nd Acm Sigkdd International Conference on Knowledge Discovery and Data Mining, 2016:785–794
Geurts P, Ernst D, Wehenkel L. Extremely randomized trees. Mach Learn. 2006;63:3–42.
doi: 10.1007/s10994-006-6226-1
Hanley JA, McNeil BJ. The meaning and use of the area under a receiver operating characteristic (roc) curve. Radiology. 1982;143(1):29–36.
pubmed: 7063747 doi: 10.1148/radiology.143.1.7063747
Wei L, Ding Y, Su R, Tang J, Zou Q. Prediction of human protein subcellular localization using deep learning. J Parall Distrib Comput. 2018;117:212–7.
doi: 10.1016/j.jpdc.2017.08.009
Ge R, Xia Y, Jiang M, Jia G, Jing X, Li Y, Cai Y. Hybavpnet: a novel hybrid network architecture for antiviral peptides identification. bioRxiv, 2022:2022–06
Li F, Guo X, Jin P, Chen J, Xiang D, Song J, Coin LJ. Porpoise: a new approach for accurate prediction of RNA pseudouridine sites. Brief Bioinform. 2021;22(6):245.
doi: 10.1093/bib/bbab245

Auteurs

Muhammad Arif (M)

College of Science and Engineering, Hamad Bin Khalifa University, Doha, Qatar.

Ge Fang (G)

State Key Laboratory for Organic Electronics and Information Displays, Institute of Advanced Materials (IAM), Nanjing 210023, P. R. China, Nanjing 210023, China.
Center for Research Innovation and Biomedical Informatics, Faculty of Medical Technology, Mahidol University, Bankok, 10700, Thailand.

Ali Ghulam (A)

Information Technology Centre, Sindh Agriculture University, Sindh, Pakistan.

Saleh Musleh (S)

College of Science and Engineering, Hamad Bin Khalifa University, Doha, Qatar.

Tanvir Alam (T)

College of Science and Engineering, Hamad Bin Khalifa University, Doha, Qatar. talam@hbku.edu.qa.

Classifications MeSH