DPI_CDF: druggable protein identifier using cascade deep forest.

Bioinformatics Cascade deep forest Druggable proteins PSSM Physicochemical features

Journal

BMC bioinformatics

ISSN: 1471-2105

Titre abrégé: BMC Bioinformatics

Pays: England

ID NLM: 100965194

Informations de publication

Date de publication:
05 Apr 2024

Historique:

received: 01 06 2023

accepted: 13 03 2024

medline: 6 4 2024

pubmed: 6 4 2024

entrez: 5 4 2024

Statut: epublish

Résumé

Drug targets in living beings perform pivotal roles in the discovery of potential drugs. Conventional wet-lab characterization of drug targets is although accurate but generally expensive, slow, and resource intensive. Therefore, computational methods are highly desirable as an alternative to expedite the large-scale identification of druggable proteins (DPs); however, the existing in silico predictor's performance is still not satisfactory. In this study, we developed a novel deep learning-based model DPI_CDF for predicting DPs based on protein sequence only. DPI_CDF utilizes evolutionary-based (i.e., histograms of oriented gradients for position-specific scoring matrix), physiochemical-based (i.e., component protein sequence representation), and compositional-based (i.e., normalized qualitative characteristic) properties of protein sequence to generate features. Then a hierarchical deep forest model fuses these three encoding schemes to build the proposed model DPI_CDF. The empirical outcomes on 10-fold cross-validation demonstrate that the proposed model achieved 99.13 % accuracy and 0.982 of Matthew's-correlation-coefficient (MCC) on the training dataset. The generalization power of the trained model is further examined on an independent dataset and achieved 95.01% of maximum accuracy and 0.900 MCC. When compared to current state-of-the-art methods, DPI_CDF improves in terms of accuracy by 4.27% and 4.31% on training and testing datasets, respectively. We believe, DPI_CDF will support the research community to identify druggable proteins and escalate the drug discovery process. The benchmark datasets and source codes are available in GitHub: http://github.com/Muhammad-Arif-NUST/DPI_CDF .

Sections du résumé

BACKGROUND BACKGROUND

METHODS METHODS

In this study, we developed a novel deep learning-based model DPI_CDF for predicting DPs based on protein sequence only. DPI_CDF utilizes evolutionary-based (i.e., histograms of oriented gradients for position-specific scoring matrix), physiochemical-based (i.e., component protein sequence representation), and compositional-based (i.e., normalized qualitative characteristic) properties of protein sequence to generate features. Then a hierarchical deep forest model fuses these three encoding schemes to build the proposed model DPI_CDF.

RESULTS RESULTS

The empirical outcomes on 10-fold cross-validation demonstrate that the proposed model achieved 99.13 % accuracy and 0.982 of Matthew's-correlation-coefficient (MCC) on the training dataset. The generalization power of the trained model is further examined on an independent dataset and achieved 95.01% of maximum accuracy and 0.900 MCC. When compared to current state-of-the-art methods, DPI_CDF improves in terms of accuracy by 4.27% and 4.31% on training and testing datasets, respectively. We believe, DPI_CDF will support the research community to identify druggable proteins and escalate the drug discovery process.

AVAILABILITY BACKGROUND

The benchmark datasets and source codes are available in GitHub: http://github.com/Muhammad-Arif-NUST/DPI_CDF .

Identifiants

DOI: 10.1186/s12859-024-05744-3 PMID: 38580921

pubmed: 38580921

doi: 10.1186/s12859-024-05744-3

pii: 10.1186/s12859-024-05744-3

doi:

Types de publication

Journal Article

Langues

eng

Sous-ensembles de citation

Pagination

145

Informations de copyright

Références

Hopkins AL, Groom CR. The druggable genome. Nat Rev Drug Discov. 2002;1(9):727–30.

pubmed: 12209152 doi: 10.1038/nrd892

Kandoi G, Acencio ML, Lemke N. Prediction of druggable proteins using machine learning and systems biology: a mini-review. Front Physiol. 2015;6:366.

pubmed: 26696900 pmcid: 4672042 doi: 10.3389/fphys.2015.00366

Santos R, Ursu O, Gaulton A, Bento AP, Donadi RS, Bologa CG, Karlsson A, Al-Lazikani B, Hersey A, Oprea TI, et al. A comprehensive map of molecular drug targets. Nat Rev Drug Discov. 2017;16(1):19–34.

pubmed: 27910877 doi: 10.1038/nrd.2016.230

Landry Y, Gies J-P. Drugs and their molecular targets: an updated overview. Fundam Clin Pharmacol. 2008;22(1):1–18.

pubmed: 18251718 doi: 10.1111/j.1472-8206.2007.00548.x

Lin J, Chen H, Li S, Liu Y, Li X, Yu B. Accurate prediction of potential druggable proteins based on genetic algorithm and bagging-SVM ensemble classifier. Artif Intell Med. 2019;98:35–47.

pubmed: 31521251 doi: 10.1016/j.artmed.2019.07.005

Makley LN, Gestwicki JE. Expanding the number of ‘druggable’ targets: non-enzymes and protein–protein interactions. Chem Biol Drug Des. 2013;81(1):22–32.

pubmed: 23253128 pmcid: 3531880 doi: 10.1111/cbdd.12066

Lavigne R, Ceyssens P-J, Robben J. Phage proteomics: applications of mass spectrometry. Bacteriophages: Methods and Protocols, Volume 2 Molecular and Applied Aspects, 2009:239–251

Ilari A, Savino C. Protein structure determination by x-ray crystallography. Bioinformatics: Data, Sequence Analysis and Evolution, 2008:63–87

Chan HS, Shan H, Dahoun T, Vogel H, Yuan S. Advancing drug discovery via artificial intelligence. Trends Pharmacol Sci. 2019;40(8):592–604.

pubmed: 31320117 doi: 10.1016/j.tips.2019.06.004

Munos B. Lessons from 60 years of pharmaceutical innovation. Nat Rev Drug Discov. 2009;8(12):959–68.

pubmed: 19949401 doi: 10.1038/nrd2961

Paul SM, Mytelka DS, Dunwiddie CT, Persinger CC, Munos BH, Lindborg SR, Schacht AL. How to improve r &d productivity: the pharmaceutical industry’s grand challenge. Nat Rev Drug Discovery. 2010;9(3):203–14.

pubmed: 20168317 doi: 10.1038/nrd3078

Jamali AA, Ferdousi R, Razzaghi S, Li J, Safdari R, Ebrahimie E. Drugminer: comparative analysis of machine learning algorithms for prediction of potential druggable proteins. Drug Discovery Today. 2016;21(5):718–24.

pubmed: 26821132 doi: 10.1016/j.drudis.2016.01.007

Sun T, Lai L, Pei J. Analysis of protein features and machine learning algorithms for prediction of druggable proteins. Quant Biol. 2018;6:334–43.

doi: 10.1007/s40484-018-0157-2

Gong Y, Liao B, Wang P, Zou Q. Drughybrid_bs: using hybrid feature combined with bagging-SVM to predict potentially druggable proteins. Front Pharmacol. 2021;1:3467.

Yu L, Xue L, Liu F, Li Y, Jing R, Luo J. The applications of deep learning algorithms on in silico druggable proteins identification. J Adv Res. 2022;41:219–31.

pubmed: 36328750 pmcid: 9637576 doi: 10.1016/j.jare.2022.01.009

Sikander R, Ghulam A, Ali F. Xgb-drugpred: computational prediction of druggable proteins using extreme gradient boosting and optimized features set. Sci Rep. 2022;12(1):1–9.

doi: 10.1038/s41598-022-09484-3

Iraji MS, Tanha J, Habibinejad M. Druggable protein prediction using a multi-canal deep convolutional neural network based on autocovariance method. Comput Biol Med. 2022;151: 106276.

pubmed: 36410099 doi: 10.1016/j.compbiomed.2022.106276

Charoenkwan P, Schaduangrat N, Moni MA, Shoombuatong W, Manavalan B, et al. Computational prediction and interpretation of druggable proteins using a stacked ensemble-learning framework. Iscience. 2022;25(9): 104883.

pubmed: 36046193 pmcid: 9421381 doi: 10.1016/j.isci.2022.104883

Arif M, Hayat M, Jan Z. imem-2lsaac: a two-level model for discrimination of membrane proteins and their types by extending the notion of saac into chou’s pseudo amino acid composition. J Theor Biol. 2018;442:11–21.

pubmed: 29337263 doi: 10.1016/j.jtbi.2018.01.008

Ge F, Zhu Y-H, Xu J, Muhammad A, Song J, Yu D-J. Muttmpredictor: Robust and accurate cascade xgboost classifier for prediction of mutations in transmembrane proteins. Comput Struct Biotechnol J. 2021;19:6400–16.

pubmed: 34938415 pmcid: 8649221 doi: 10.1016/j.csbj.2021.11.024

Ge F, Hu J, Zhu Y-H, Arif M, Yu D-J. Targetmm: Accurate missense mutation prediction by utilizing local and global sequence information with classifier ensemble. Combin Chem High Throughput Screen. 2022;25(1):38–52.

doi: 10.2174/1386207323666201204140438

Shen H-B, Chou K-C. Predicting protein fold pattern with functional domain and sequential evolution information. J Theor Biol. 2009;256(3):441–6.

pubmed: 18996396 doi: 10.1016/j.jtbi.2008.10.007

Khan A, Uddin J, Ali F, Kumar H, Alghamdi W, Ahmad A. Afp-spts: an accurate prediction of antifreeze proteins using sequential and pseudo-tri-slicing evolutionary features with an extremely randomized tree. J Chem Inf Model. 2023;63:826.

pubmed: 36649569 doi: 10.1021/acs.jcim.2c01417

Hu J, Li Y, Zhang M, Yang X, Shen H-B, Yu D-J. Predicting protein-DNA binding residues by weightedly combining sequence-based features and boosting multiple SVMs. IEEE/ACM Trans Comput Biol Bioinf. 2016;14(6):1389–98.

doi: 10.1109/TCBB.2016.2616469

Schäffer AA, Aravind L, Madden TL, Shavirin S, Spouge JL, Wolf YI, Koonin EV, Altschul SF. Improving the accuracy of psi-blast protein database searches with composition-based statistics and other refinements. Nucl Acids Res. 2001;29(14):2994–3005.

pubmed: 11452024 pmcid: 55814 doi: 10.1093/nar/29.14.2994

Bairoch A, Apweiler R. The swiss-prot protein sequence database and its supplement trembl in 2000. Nucl Acids Res. 2000;28(1):45–8.

pubmed: 10592178 pmcid: 102476 doi: 10.1093/nar/28.1.45

Dalal N, Triggs B. Histograms of oriented gradients for human detection. In: 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), 2005;1:886–893. IEEE

Junior OL, Delgado D, Gonçalves V, Nunes U. Trainable classifier-fusion schemes: an application to pedestrian detection. In: 2009 12Th International IEEE Conference on Intelligent Transportation Systems, 2009:1–6. IEEE

Mohan A, Papageorgiou C, Poggio T. Example-based object detection in images by components. IEEE Trans Pattern Anal Mach Intell. 2001;23(4):349–61.

doi: 10.1109/34.917571

Viola P, Jones MJ, Snow D. Detecting pedestrians using patterns of motion and appearance. Int J Comput Vision. 2005;63:153–61.

doi: 10.1007/s11263-005-6644-8

Dubchak I, Muchnik I, Holbrook SR, Kim S-H. Prediction of protein folding class using global description of amino acid sequence. Proc Natl Acad Sci. 1995;92(19):8700–4.

pubmed: 7568000 pmcid: 41034 doi: 10.1073/pnas.92.19.8700

Zhou C, Yu H, Ding Y, Guo F, Gong X-J. Multi-scale encoding of amino acid sequences for predicting protein interactions using gradient boosting decision tree. PLoS ONE. 2017;12(8):0181426.

doi: 10.1371/journal.pone.0181426

Zhang X, Liu S. Rbppred: predicting RNA-binding proteins from sequence using SVM. Bioinformatics. 2017;33(6):854–62.

pubmed: 27993780 doi: 10.1093/bioinformatics/btw730

Golmohammadi SK, Kurgan L, Crowley B, Reformat M. Classification of cell membrane proteins. In: 2007 Frontiers in the Convergence of Bioscience and Information Technologies, 2007: 153–158. IEEE

Xia X, Li W-H. What amino acid properties affect protein evolution? J Mol Evol. 1998;47:557–64.

pubmed: 9797406 doi: 10.1007/PL00006412

Qiu W-R, Sun B-Q, Xiao X, Xu Z-C, Jia J-H, Chou K-C. ikcr-pseens: identify lysine crotonylation sites in histone proteins with pseudo components and ensemble classifier. Genomics. 2018;110(5):239–46.

pubmed: 29107015 doi: 10.1016/j.ygeno.2017.10.008

Hayat M, Khan A. Predicting membrane protein types by fusing composite protein sequence features into pseudo amino acid composition. J Theor Biol. 2011;271(1):10–7.

pubmed: 21110985 doi: 10.1016/j.jtbi.2010.11.017

Kabir M, Arif M, Ahmad S, Ali Z, Swati ZNK, Yu D-J. Intelligent computational method for discrimination of anticancer peptides by incorporating sequential and evolutionary profiles information. Chemom Intell Lab Syst. 2018;182:158–65.

doi: 10.1016/j.chemolab.2018.09.007

Arif M, Ahmad S, Ali F, Fang G, Li M, Yu D-J. Targetcpp: accurate prediction of cell-penetrating peptides from optimized multi-scale features using gradient boost decision tree. J Comput Aided Mol Des. 2020;34:841–56.

pubmed: 32180124 doi: 10.1007/s10822-020-00307-z

Hayat M, Tahir M, Khan SA. Prediction of protein structure classes using hybrid space of multi-profile Bayes and bi-gram probability feature spaces. J Theor Biol. 2014;346:8–15.

pubmed: 24384128 doi: 10.1016/j.jtbi.2013.12.015

Hayat M, Khan A. Memhyb: predicting membrane protein types by hybridizing SAAC and PSSM. J Theor Biol. 2012;292:93–102.

pubmed: 22001079 doi: 10.1016/j.jtbi.2011.09.026

Zhou Z-H, Feng J. Deep forest: Towards an alternative to deep neural networks. In: IJCAI, 2017:3553–3559

Arif M, Kabir M, Ahmed S, Khan A, Ge F, Khelifi A, Yu D-J. Deepcppred: a deep learning framework for the discrimination of cell-penetrating peptides and their uptake efficiencies. IEEE/ACM Trans Comput Biol Bioinf. 2021;19(5):2749–59.

doi: 10.1109/TCBB.2021.3102133

Cai R, Chen C. Learning deep forest with multi-scale local binary pattern features for face anti-spoofing (2019). arXiv preprint arXiv:1910.03850

Wang Y, Bi X, Chen W, Li Y, Chen Q, Long T. Deep forest for radar HRRP recognition. J Eng. 2019;2019(21):8018–21.

doi: 10.1049/joe.2019.0723

Chen Z-H, Li L-P, He Z, Zhou J-R, Li Y, Wong L. An improved deep forest model for predicting self-interacting proteins from protein sequence using wavelet transformation. Front Genet. 2019;10:90.

pubmed: 30881376 pmcid: 6405691 doi: 10.3389/fgene.2019.00090

Utkin LV, Kovalev MS, Meldo AA. A deep forest classifier with weights of class probability distribution subsets. Knowl-Based Syst. 2019;173:15–27.

doi: 10.1016/j.knosys.2019.02.022

Zhou Z-H, Feng J. Deep forest. Natl Sci Rev. 2019;6(1):74–86.

pubmed: 34691833 doi: 10.1093/nsr/nwy108

Breiman L. Random forests. Mach Learn. 2001;45:5–32.

doi: 10.1023/A:1010933404324

Chen T, Guestrin C. Xgboost: a scalable tree boosting system. In: Proceedings of the 22nd Acm Sigkdd International Conference on Knowledge Discovery and Data Mining, 2016:785–794

Geurts P, Ernst D, Wehenkel L. Extremely randomized trees. Mach Learn. 2006;63:3–42.

doi: 10.1007/s10994-006-6226-1

Hanley JA, McNeil BJ. The meaning and use of the area under a receiver operating characteristic (roc) curve. Radiology. 1982;143(1):29–36.

pubmed: 7063747 doi: 10.1148/radiology.143.1.7063747

Wei L, Ding Y, Su R, Tang J, Zou Q. Prediction of human protein subcellular localization using deep learning. J Parall Distrib Comput. 2018;117:212–7.

doi: 10.1016/j.jpdc.2017.08.009

Ge R, Xia Y, Jiang M, Jia G, Jing X, Li Y, Cai Y. Hybavpnet: a novel hybrid network architecture for antiviral peptides identification. bioRxiv, 2022:2022–06

Li F, Guo X, Jin P, Chen J, Xiang D, Song J, Coin LJ. Porpoise: a new approach for accurate prediction of RNA pseudouridine sites. Brief Bioinform. 2021;22(6):245.

doi: 10.1093/bib/bbab245

DPI_CDF: druggable protein identifier using cascade deep forest.

Journal

Informations de publication

Résumé

Sections du résumé

Identifiants

Types de publication

Langues

Sous-ensembles de citation

Pagination

Informations de copyright

Références

Auteurs

Muhammad Arif (M)

Ge Fang (G)

Ali Ghulam (A)

Saleh Musleh (S)

Tanvir Alam (T)

Classifications MeSH