BASE: a web service for providing compound-protein binding affinity prediction datasets with reduced similarity bias.

Dataset bias Deep learning Drug discovery Drug-target affinity prediction Protein similarity

Journal

BMC bioinformatics
ISSN: 1471-2105
Titre abrégé: BMC Bioinformatics
Pays: England
ID NLM: 100965194

Informations de publication

Date de publication:
30 Oct 2024
Historique:
received: 06 08 2024
accepted: 23 10 2024
medline: 31 10 2024
pubmed: 31 10 2024
entrez: 31 10 2024
Statut: epublish

Résumé

Deep learning-based drug-target affinity (DTA) prediction methods have shown impressive performance, despite a high number of training parameters relative to the available data. Previous studies have highlighted the presence of dataset bias by suggesting that models trained solely on protein or ligand structures may perform similarly to those trained on complex structures. However, these studies did not propose solutions and focused solely on analyzing complex structure-based models. Even when ligands are excluded, protein-only models trained on complex structures still incorporate some ligand information at the binding sites. Therefore, it is unclear whether binding affinity can be accurately predicted using only compound or protein features due to potential dataset bias. In this study, we expanded our analysis to comprehensive databases and investigated dataset bias through compound and protein feature-based methods using multilayer perceptron models. We assessed the impact of this bias on current prediction models and proposed the binding affinity similarity explorer (BASE) web service, which provides bias-reduced datasets. By analyzing eight binding affinity databases using multilayer perceptron models, we confirmed a bias where the compound-protein binding affinity can be accurately predicted using compound features alone. This bias arises because most compounds show consistent binding affinities due to high sequence or functional similarity among their target proteins. Our Uniform Manifold Approximation and Projection analysis based on compound fingerprints further revealed that low and high variation compounds do not exhibit significant structural differences. This suggests that the primary factor driving the consistent binding affinities is protein similarity rather than compound structure. We addressed this bias by creating datasets with progressively reduced protein similarity between the training and test sets, observing significant changes in model performance. We developed the BASE web service to allow researchers to download and utilize these datasets. Feature importance analysis revealed that previous models heavily relied on protein features. However, using bias-reduced datasets increased the importance of compound and interaction features, enabling a more balanced extraction of key features. We propose the BASE web service, providing both the affinity prediction results of existing models and bias-reduced datasets. These resources contribute to the development of generalized and robust predictive models, enhancing the accuracy and reliability of DTA predictions in the drug discovery process. BASE is freely available online at https://synbi2024.kaist.ac.kr/base .

Sections du résumé

BACKGROUND BACKGROUND
Deep learning-based drug-target affinity (DTA) prediction methods have shown impressive performance, despite a high number of training parameters relative to the available data. Previous studies have highlighted the presence of dataset bias by suggesting that models trained solely on protein or ligand structures may perform similarly to those trained on complex structures. However, these studies did not propose solutions and focused solely on analyzing complex structure-based models. Even when ligands are excluded, protein-only models trained on complex structures still incorporate some ligand information at the binding sites. Therefore, it is unclear whether binding affinity can be accurately predicted using only compound or protein features due to potential dataset bias. In this study, we expanded our analysis to comprehensive databases and investigated dataset bias through compound and protein feature-based methods using multilayer perceptron models. We assessed the impact of this bias on current prediction models and proposed the binding affinity similarity explorer (BASE) web service, which provides bias-reduced datasets.
RESULTS RESULTS
By analyzing eight binding affinity databases using multilayer perceptron models, we confirmed a bias where the compound-protein binding affinity can be accurately predicted using compound features alone. This bias arises because most compounds show consistent binding affinities due to high sequence or functional similarity among their target proteins. Our Uniform Manifold Approximation and Projection analysis based on compound fingerprints further revealed that low and high variation compounds do not exhibit significant structural differences. This suggests that the primary factor driving the consistent binding affinities is protein similarity rather than compound structure. We addressed this bias by creating datasets with progressively reduced protein similarity between the training and test sets, observing significant changes in model performance. We developed the BASE web service to allow researchers to download and utilize these datasets. Feature importance analysis revealed that previous models heavily relied on protein features. However, using bias-reduced datasets increased the importance of compound and interaction features, enabling a more balanced extraction of key features.
CONCLUSIONS CONCLUSIONS
We propose the BASE web service, providing both the affinity prediction results of existing models and bias-reduced datasets. These resources contribute to the development of generalized and robust predictive models, enhancing the accuracy and reliability of DTA predictions in the drug discovery process. BASE is freely available online at https://synbi2024.kaist.ac.kr/base .

Identifiants

pubmed: 39478454
doi: 10.1186/s12859-024-05968-3
pii: 10.1186/s12859-024-05968-3
doi:

Substances chimiques

Proteins 0
Ligands 0

Types de publication

Journal Article

Langues

eng

Sous-ensembles de citation

IM

Pagination

340

Informations de copyright

© 2024. The Author(s).

Références

Zhang H, Liu X, Cheng W, Wang T, Chen Y. Prediction of drug-target binding affinity based on deep learning models. Comput Biol Med. 2024;174:108435.
pubmed: 38608327 doi: 10.1016/j.compbiomed.2024.108435
Saikia S, Bordoloi M. Molecular docking: challenges, advances and its use in drug discovery perspective. Curr Drug Targets. 2019;20(5):501–21.
pubmed: 30360733 doi: 10.2174/1389450119666181022153016
Jumper J, Evans R, Pritzel A, Green T, Figurnov M, Ronneberger O, et al. Highly accurate protein structure prediction with AlphaFold. Nature. 2021;596(7873):583–9.
pubmed: 34265844 doi: 10.1038/s41586-021-03819-2
Abramson J, Adler J, Dunger J, Evans R, Green T, Pritzel A, et al. Accurate structure prediction of biomolecular interactions with AlphaFold 3. Nature. 2024;630:1–3.
doi: 10.1038/s41586-024-07487-w
Gomes J, Ramsundar B, Feinberg EN, Pande VS. Atomic convolutional networks for predicting protein-ligand binding affinity. 2017. arXiv preprint arXiv:170310603.
Ragoza M, Hochuli J, Idrobo E, Sunseri J, Koes DR. Protein–ligand scoring with convolutional neural networks. J Chem Inf Model. 2017;57(4):942–57.
pubmed: 28368587 doi: 10.1021/acs.jcim.6b00740
Lim J, Ryu S, Park K, Choe YJ, Ham J, Kim WY. Predicting drug–target interaction using a novel graph neural network with 3D structure-embedded graph representation. J Chem Inf Model. 2019;59(9):3981–8.
pubmed: 31443612 doi: 10.1021/acs.jcim.9b00387
Son J, Kim D. Development of a graph convolutional neural network model for efficient prediction of protein-ligand binding affinities. PLoS ONE. 2021;16(4):e0249404.
pubmed: 33831016 doi: 10.1371/journal.pone.0249404
Liu Z, Su M, Han L, Liu J, Yang Q, Li Y, et al. Forging the basis for developing protein–ligand interaction scoring functions. Acc Chem Res. 2017;50(2):302–9.
pubmed: 28182403 doi: 10.1021/acs.accounts.6b00491
Yang J, Shen C, Huang N. Predicting or pretending: artificial intelligence for protein-ligand interactions lack of sufficiently large and unbiased datasets. Front Pharmacol. 2020;11:508760.
Volkov M, Turk J-A, Drizard N, Martin N, Hoffmann B, Gaston-Mathé Y, et al. On the frustration to predict binding affinities from protein–ligand structures with deep neural networks. J Med Chem. 2022;65(11):7946–58.
pubmed: 35608179 doi: 10.1021/acs.jmedchem.2c00487
Öztürk H, Özgür A, Ozkirimli E. DeepDTA: deep drug–target binding affinity prediction. Bioinformatics. 2018;34(17):i821–9.
pubmed: 30423097 doi: 10.1093/bioinformatics/bty593
Lee I, Keum J, Nam H. DeepConv-DTI: Prediction of drug-target interactions via deep learning with convolution on protein sequences. PLoS Comput Biol. 2019;15(6):e1007129.
pubmed: 31199797 doi: 10.1371/journal.pcbi.1007129
Pei Q, Wu L, Zhu J, Xia Y, Xie S, Qin T, et al. Breaking the barriers of data scarcity in drug–target affinity prediction. Brief Bioinform. 2023;24(6):386.
doi: 10.1093/bib/bbad386
Fang K, Zhang Y, Du S, He J. ColdDTA: utilizing data augmentation and attention-based feature fusion for drug-target binding affinity prediction. Comput Biol Med. 2023;164:107372.
pubmed: 37597410 doi: 10.1016/j.compbiomed.2023.107372
Gilson MK, Liu T, Baitaluk M, Nicola G, Hwang L, Chong J. BindingDB in 2015: a public database for medicinal chemistry, computational chemistry and systems pharmacology. Nucleic Acids Res. 2016;44(D1):D1045–53.
pubmed: 26481362 doi: 10.1093/nar/gkv1072
Zdrazil B, Felix E, Hunter F, Manners EJ, Blackshaw J, Corbett S, et al. The ChEMBL Database in 2023: a drug discovery platform spanning multiple bioactivity data types and time periods. Nucleic Acids Res. 2024;52(D1):D1180–92.
pubmed: 37933841 doi: 10.1093/nar/gkad1004
Harding SD, Armstrong JF, Faccenda E, Southan C, Alexander SP, Davenport AP, et al. The IUPHAR/BPS guide to pharmaCOLOGY in 2024. Nucleic Acids Res. 2024;52(D1):D1438–49.
pubmed: 37897341 doi: 10.1093/nar/gkad944
Pándy-Szekeres G, Caroli J, Mamyrbekov A, Kermani AA, Keserű GM, Kooistra AJ, et al. GPCRdb in 2023: state-specific structure models using AlphaFold2 and new ligand resources. Nucleic Acids Res. 2023;51(D1):D395–402.
pubmed: 36395823 doi: 10.1093/nar/gkac1013
Chan WK, Zhang H, Yang J, Brender JR, Hur J, Özgür A, et al. GLASS: a comprehensive database for experimentally validated GPCR-ligand associations. Bioinformatics. 2015;31(18):3035–42.
pubmed: 25971743 doi: 10.1093/bioinformatics/btv302
Davis MI, Hunt JP, Herrgard S, Ciceri P, Wodicka LM, Pallares G, et al. Comprehensive analysis of kinase inhibitor selectivity. Nat Biotechnol. 2011;29(11):1046–51.
pubmed: 22037378 doi: 10.1038/nbt.1990
Réau M, Lagarde N, Zagury J-F, Montes M. Nuclear receptors database including negative data (NR-DBIND): a database dedicated to nuclear receptors binding data including negative data and pharmacological profile: miniperspective. J Med Chem. 2018;62(6):2894–904.
pubmed: 30354114 doi: 10.1021/acs.jmedchem.8b01105
Team RC. RA language and environment for statistical computing, R Foundation for Statistical. Computing; 2020.
Knox C, Wilson M, Klinger CM, Franklin M, Oler E, Wilson A, et al. Drugbank 6.0: the drugbank knowledgebase for 2024. Nucleic Acids Res. 2024;52(1):D1265–75.
pubmed: 37953279 doi: 10.1093/nar/gkad976
Rogers D, Hahn M. Extended-connectivity fingerprints. J Chem Inf Model. 2010;50(5):742–54.
pubmed: 20426451 doi: 10.1021/ci100050t
Kim S, Chen J, Cheng T, Gindulyte A, He J, He S, et al. PubChem 2023 update. Nucleic Acids Res. 2023;51(D1):D1373–80.
pubmed: 36305812 doi: 10.1093/nar/gkac956
Landrum G. RDKit: open-source cheminformatics. Zenodo; 2006.
Chollet F. Keras: the python deep learning library. Astrophysics source code library. 2018. ascl: 1806.022.
Sánchez-Cruz N, Medina-Franco JL, Mestres J, Barril X. Extended connectivity interaction features: improving binding affinity prediction through chemical description. Bioinformatics. 2021;37(10):1376–82.
pubmed: 33226061 doi: 10.1093/bioinformatics/btaa982
Varadi M, Bertoni D, Magana P, Paramval U, Pidruchna I, Radhakrishnan M, et al. AlphaFold protein structure database in 2024: providing structure coverage for over 214 million protein sequences. Nucleic Acids Res. 2024;52(D1):D368–75.
pubmed: 37933859 doi: 10.1093/nar/gkad1011
Schrodinger L. The PyMOL molecular graphics system. Version. 2015;1:8.
Lundberg SM, Lee S-I. A unified approach to interpreting model predictions. Adv Neural Inf Process Syst. 2017;30:4765.
Shrikumar A, Greenside P, Kundaje A, editors. Learning important features through propagating activation differences. In: International conference on machine learning, PMlR; 2017
McInnes L, Healy J, Melville J. Umap: uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:180203426. 2018.
Smith TF, Waterman MS. Identification of common molecular subsequences. J Mol Biol. 1981;147(1):195–7.
pubmed: 7265238 doi: 10.1016/0022-2836(81)90087-5
Henikoff S, Henikoff JG. Amino acid substitution matrices from protein blocks. Proc Natl Acad Sci. 1992;89(22):10915–9.
pubmed: 1438297 doi: 10.1073/pnas.89.22.10915
Cock PJ, Antao T, Chang JT, Chapman BA, Cox CJ, Dalke A, et al. Biopython: freely available Python tools for computational molecular biology and bioinformatics. Bioinformatics. 2009;25(11):1422.
pubmed: 19304878 doi: 10.1093/bioinformatics/btp163
Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, et al. Gene ontology: tool for the unification of biology. Nat Genet. 2000;25(1):25–9.
pubmed: 10802651 doi: 10.1038/75556
Yu G, Li F, Qin Y, Bo X, Wu Y, Wang S. GOSemSim: an R package for measuring semantic similarity among GO terms and gene products. Bioinformatics. 2010;26(7):976–8.
pubmed: 20179076 doi: 10.1093/bioinformatics/btq064
Wang JZ, Du Z, Payattakool R, Yu PS, Chen C-F. A new method to measure the semantic similarity of GO terms. Bioinformatics. 2007;23(10):1274–81.
pubmed: 17344234 doi: 10.1093/bioinformatics/btm087
Szklarczyk D, Santos A, Von Mering C, Jensen LJ, Bork P, Kuhn M. STITCH 5: augmenting protein–chemical interaction networks with tissue and affinity data. Nucleic Acids Res. 2016;44(D1):D380–4.
pubmed: 26590256 doi: 10.1093/nar/gkv1277
Li Y, Yang J. Structural and sequence similarity makes a significant impact on machine-learning-based scoring functions for protein–ligand interactions. J Chem Inf Model. 2017;57(4):1007–12.
pubmed: 28358210 doi: 10.1021/acs.jcim.7b00049
Gilmer J, Schoenholz SS, Riley PF, Vinyals O, Dahl GE. Message passing neural networks. Mach Learn Meets Quantum Phys. 2020;968:199–214.
doi: 10.1007/978-3-030-40245-7_10
Kiranyaz S, Avci O, Abdeljaber O, Ince T, Gabbouj M, Inman DJ. 1D convolutional neural networks and applications: a survey. Mech Syst Signal Process. 2021;151:107398.
doi: 10.1016/j.ymssp.2020.107398
Qi Z, Liu L, Wei Y, Zhang S, Liao B. MMD-DTA: a multi-modal deep learning framework for drug-target binding affinity and binding region prediction. bioRxiv. 2023:2023.09.19.558555.
Xu K, Hu W, Leskovec J, Jegelka S. How powerful are graph neural networks? arXiv preprint arXiv:181000826. 2018.
Zheng S, Li Y, Chen S, Xu J, Yang Y. Predicting drug–protein interaction using quasi-visual question answering system. Nat Mach Intell. 2020;2(2):134–40.
doi: 10.1038/s42256-020-0152-y
Gönen M, Heller G. Concordance probability and discriminatory power in proportional hazards regression. Biometrika. 2005;92(4):965–70.
doi: 10.1093/biomet/92.4.965
Chang W, Cheng J, Allaire J, Sievert C, Schloerke B, Xie Y, et al. Shiny: web application framework for R. 2023. URL: https://github.com/rstudio/shiny
Van Rossum G, editor. Python programming language. In: USENIX annual technical conference, Santa Clara, CA; 2007.
Kroll A, Ranjan S, Lercher MJ. Drug-target interaction prediction using a multi-modal transformer network demonstrates high generalizability to unseen proteins. bioRxiv. 2023:2023.08.21.554147.
He H, Chen G, Chen CY-C. NHGNN-DTA: a node-adaptive hybrid graph neural network for interpretable drug–target binding affinity prediction. Bioinformatics. 2023;39(6):355.
doi: 10.1093/bioinformatics/btad355
Yuan W, Chen G, Chen CY-C. FusionDTA: attention-based feature polymerizer and knowledge distillation for drug-target binding affinity prediction. Brief Bioinform. 2022;23(1):506.
doi: 10.1093/bib/bbab506
Kaufman S, Rosset S, Perlich C, Stitelman O. Leakage in data mining: formulation, detection, and avoidance. ACM Trans Knowl Discov Data (TKDD). 2012;6(4):1–21.
doi: 10.1145/2382577.2382579

Auteurs

Hyojin Son (H)

Department of Bio and Brain Engineering, Korea Advanced Institute of Science and Technology (KAIST), Daejeon, Republic of Korea.

Sechan Lee (S)

Department of Bio and Brain Engineering, Korea Advanced Institute of Science and Technology (KAIST), Daejeon, Republic of Korea.

Jaeuk Kim (J)

Department of Bio and Brain Engineering, Korea Advanced Institute of Science and Technology (KAIST), Daejeon, Republic of Korea.

Haangik Park (H)

Department of Bio and Brain Engineering, Korea Advanced Institute of Science and Technology (KAIST), Daejeon, Republic of Korea.

Myeong-Ha Hwang (MH)

Department of Bio and Brain Engineering, Korea Advanced Institute of Science and Technology (KAIST), Daejeon, Republic of Korea.

Gwan-Su Yi (GS)

Department of Bio and Brain Engineering, Korea Advanced Institute of Science and Technology (KAIST), Daejeon, Republic of Korea. gwansuyi@kaist.ac.kr.

Articles similaires

Selecting optimal software code descriptors-The case of Java.

Yegor Bugayenko, Zamira Kholmatova, Artem Kruglov et al.
1.00
Software Algorithms Programming Languages
Databases, Protein Protein Domains Protein Folding Proteins Deep Learning

Exploring blood-brain barrier passage using atomic weighted vector and machine learning.

Yoan Martínez-López, Paulina Phoobane, Yanaima Jauriga et al.
1.00
Blood-Brain Barrier Machine Learning Humans Support Vector Machine Software
Adenosine Triphosphate Adenosine Diphosphate Mitochondrial ADP, ATP Translocases Binding Sites Mitochondria

Classifications MeSH