PharmaBench: Enhancing ADMET benchmarks with large language models.

Benchmarking Drug Discovery Data Mining Pharmacokinetics Pharmaceutical Preparations Humans

Journal

Scientific data

ISSN: 2052-4463

Titre abrégé: Sci Data

Pays: England

ID NLM: 101640192

Informations de publication

Date de publication:
10 Sep 2024

Historique:

received: 08 04 2024

accepted: 19 08 2024

medline: 11 9 2024

pubmed: 11 9 2024

entrez: 10 9 2024

Statut: epublish

Résumé

Accurately predicting ADMET (Absorption, Distribution, Metabolism, Excretion, and Toxicity) properties early in drug development is essential for selecting compounds with optimal pharmacokinetics and minimal toxicity. Existing ADMET-related benchmark sets are limited in utility due to their small dataset sizes and the lack of representation of compounds used in drug discovery projects. These shortcomings hinder their application in model building for drug discovery. To address this issue, we propose a multi-agent data mining system based on Large Language Models that effectively identifies experimental conditions within 14,401 bioassays. This approach facilitates merging entries from different sources, culminating in the creation of PharmaBench. Additionally, we have developed a data processing workflow to integrate data from various sources, resulting in 156,618 raw entries. Through this workflow, we constructed PharmaBench, a comprehensive benchmark set for ADMET properties, which comprises eleven ADMET datasets and 52,482 entries. This benchmark set is designed to serve as an open-source dataset for the development of AI models relevant to drug discovery projects.

Identifiants

DOI: 10.1038/s41597-024-03793-0 PMID: 39256394

pubmed: 39256394

doi: 10.1038/s41597-024-03793-0

pii: 10.1038/s41597-024-03793-0

doi:

Substances chimiques

Pharmaceutical Preparations 0

Types de publication

Journal Article Dataset

Langues

eng

Sous-ensembles de citation

Pagination

985

Informations de copyright

Références

Davis, A. M. & Riley, R. J. Predictive admet studies, the challenges and the opportunities. Current Opinion in Chemical Biology 8, 378–386, https://doi.org/10.1016/j.cbpa.2004.06.005 (2004).

doi: 10.1016/j.cbpa.2004.06.005 pubmed: 15288247

Chen, H., Engkvist, O., Wang, Y., Olivecrona, M. & Blaschke, T. The rise of deep learning in drug discovery. Drug Discovery Today 23, 1241–1250, https://doi.org/10.1016/j.drudis.2018.01.039 (2018).

doi: 10.1016/j.drudis.2018.01.039 pubmed: 29366762

Ferreira, L. L. & Andricopulo, A. D. Admet modeling approaches in drug discovery. Drug Discovery Today 24, 1157–1165, https://doi.org/10.1016/j.drudis.2019.03.015 (2019).

doi: 10.1016/j.drudis.2019.03.015 pubmed: 30890362

Wang, Y. et al. In silico adme/t modelling for rational drug design. Quarterly Reviews of Biophysics 48, 488–515, https://doi.org/10.1017/s0033583515000190 (2015).

doi: 10.1017/s0033583515000190 pubmed: 26328949

Sun, J. et al. Excape-db: an integrated large scale dataset facilitating big data analysis in chemogenomics. Journal of Cheminformatics 9, https://doi.org/10.1186/s13321-017-0203-5 (2017).

Bento, A. P. et al. The chembl bioactivity database: an update. Nucleic Acids Research 42, D1083–D1090, https://doi.org/10.1093/nar/gkt1031 (2013).

doi: 10.1093/nar/gkt1031 pubmed: 24214965 pmcid: 3965067

Kim, S. et al. Pubchem substance and compound databases. Nucleic Acids Research 44, D1202–D1213, https://doi.org/10.1093/nar/gkv951 (2015).

doi: 10.1093/nar/gkv951 pubmed: 26400175 pmcid: 4702940

Gilson, M. K. et al. Bindingdb in 2015: A public database for medicinal chemistry, computational chemistry and systems pharmacology. Nucleic Acids Research 44, D1045–D1053, https://doi.org/10.1093/nar/gkv1072 (2015).

doi: 10.1093/nar/gkv1072 pubmed: 26481362 pmcid: 4702793

Wu, Z. et al. Moleculenet: a benchmark for molecular machine learning. Chemical Science 9, 513–530, https://doi.org/10.1039/C7SC02664A (2018).

doi: 10.1039/C7SC02664A pubmed: 29629118

Huang, K. et al. Therapeutics data commons: Machine learning datasets and tasks for drug discovery and development. arXiv (Cornell University) https://doi.org/10.48550/arxiv.2102.09548 (2021).

Meng, F., Xi, Y., Huang, J. & Ayers, P. W. A curated diverse molecular database of blood-brain barrier permeability with chemical descriptors. Scientific Data 8, 289, https://doi.org/10.1038/s41597-021-01069-5 (2021).

doi: 10.1038/s41597-021-01069-5 pubmed: 34716354 pmcid: 8556334

Meng, J. et al. Boosting the predictive performance with aqueous solubility dataset curation. Scientific Data 9, https://doi.org/10.1038/s41597-022-01154-3 (2022).

Delaney, J. S. ESOL: Estimating Aqueous Solubility Directly from Molecular Structure. ACS Publications https://doi.org/10.1021/ci034243x.s001 (2019).

Pollastri, M. P. Overview on the rule of five. Current Protocols in Pharmacology 49, https://doi.org/10.1002/0471141755.ph0912s49 (2010).

Sheridan, R. P. et al. Experimental error, kurtosis, activity cliffs, and methodology: What limits the predictivity of qsar models? Journal of Chemical Information and Modeling https://doi.org/10.1021/acs.jcim.9b01067 (2020).

Butler, J. N. Ionic equilibrium: solubility and pH calculations (Wiley, 1998).

OpenAI. Gpt-4 technical report. arXiv (Cornell University) https://doi.org/10.48550/arxiv.2303.08774 (2023).

Gu, Y. et al. Domain-specific language model pretraining for biomedical natural language processing. ACM Transactions on Computing for Healthcare 3, 1–23, https://doi.org/10.1145/3458754 (2022).

doi: 10.1145/3458754

Lee, J. et al. Biobert: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics 36, https://doi.org/10.1093/bioinformatics/btz682 (2019).

Anil, R. et al. Palm 2 technical report, https://doi.org/10.48550/arXiv.2305.10403 (2023).

Mazurowski, M. A. et al. Segment anything model for medical image analysis: An experimental study. Medical Image Analysis 89, 102918, https://doi.org/10.1016/j.media.2023.102918 (2023).

doi: 10.1016/j.media.2023.102918 pubmed: 37595404

Xiao, X. et al. Pharmabench: Enhancing admet benchmarks with large language models. figshare https://doi.org/10.6084/m9.figshare.25559469.v1 (2024).

Brown, T. et al. Language models are few-shot learners. arXiv (Cornell University) https://doi.org/10.48550/arxiv.2005.14165 (2020).

Sahoo, P. et al. A systematic survey of prompt engineering in large language models: Techniques and applications. arXiv (Cornell University) https://doi.org/10.48550/arxiv.2402.07927 (2024).

Chen, Q. et al. An extensive benchmark study on biomedical text generation and mining with chatgpt. Bioinformatics 39 https://doi.org/10.1093/bioinformatics/btad557 (2023).

Guo, T. et al. Large language model based multi-agents: A survey of progress and challenges, https://doi.org/10.48550/arXiv.2402.01680 (2024).

Zhang, B. et al. Controlling large language model-based agents for large-scale decision-making: An actor-critic approach. arXiv (Cornell University) https://doi.org/10.48550/arxiv.2311.13884 (2023).

Xi, Z. et al. The rise and potential of large language model based agents: A survey, https://doi.org/10.48550/arXiv.2309.07864 (2023).

Landrum, G. A. Rdkit: Open-source cheminformatics. release 2014.03.1. zenodo https://doi.org/10.5281/zenodo.10398 (2014).

Ames, B. N., Lee, F. D. & Durston, W. E. An improved bacterial test system for the detection and classification of mutagens and carcinogens. Proceedings of the National Academy of Sciences 70, 782–786, https://doi.org/10.1073/pnas.70.3.782 (1973).

doi: 10.1073/pnas.70.3.782

Tsopelas, F., Giaginis, C. & Tsantili-Kakoulidou, A. Lipophilicity and biomimetic properties to support drug discovery. Expert Opinion on Drug Discovery 12, 885–896, https://doi.org/10.1080/17460441.2017.1344210 (2017).

doi: 10.1080/17460441.2017.1344210 pubmed: 28644732

Cui, Q. et al. Data_Sheet_1_Improved Prediction of Aqueous Solubility of Novel Compounds by Going Deeper With Deep Learning.ZIP. Frontiers https://doi.org/10.3389/fonc.2020.00121.s001 (2020).

doi: 10.3389/fonc.2020.00121.s001

Martins, I. F., Teixeira, A. L., Pinheiro, L. & Falcao, A. O. A bayesian approach to in Silico blood-brain barrier penetration modeling. ACS Publications https://doi.org/10.1021/ci300124c (2016).

doi: 10.1021/ci300124c

Bohnert, T. & Gan, L.-S. Plasma protein binding: From discovery to development. Journal of Pharmaceutical Sciences 102, 2953–2994, https://doi.org/10.1002/jps.23614 (2013).

doi: 10.1002/jps.23614 pubmed: 23798314

Martignoni, M., Groothuis, G. M. M. & de Kanter, R. Species differences between mouse, rat, dog, monkey and human cyp-mediated drug metabolism, inhibition and induction. Expert Opinion on Drug Metabolism & Toxicology 2, 875–894, https://doi.org/10.1517/17425255.2.6.875 (2006).

doi: 10.1517/17425255.2.6.875

Brian Houston, J. & Carlile, D. J. Prediction of hepatic clearance from microsomes, hepatocytes, and liver slices. Drug Metabolism Reviews 29, 891–922, https://doi.org/10.3109/03602539709002237 (1997).

doi: 10.3109/03602539709002237

Lord, S. J., Velle, K. B., Mullins, R. D. & Fritz-Laylin, L. K. Superplots: Communicating reproducibility and variability in cell biology. Journal of Cell Biology 219, e202001064, https://doi.org/10.1083/jcb.202001064 (2020).

doi: 10.1083/jcb.202001064 pubmed: 32346721 pmcid: 7265319

Karami, T. K., Hailu, S., Feng, S., Graham, R. & Gukasyan, H. J. Eyes on lipinski’s rule of five: A new “rule of thumb” for physicochemical design space of ophthalmic drugs. Journal of Ocular Pharmacology and Therapeutics 38, 43–55, https://doi.org/10.1089/jop.2021.0069 (2022).

doi: 10.1089/jop.2021.0069 pubmed: 34905402 pmcid: 8817695

Bickerton, G. R., Paolini, G. V., Besnard, J., Muresan, S. & Hopkins, A. L. Quantifying the chemical beauty of drugs. Nature chemistry 4, 90–98, https://doi.org/10.1038/nchem.1243 (2012).

doi: 10.1038/nchem.1243 pubmed: 22270643 pmcid: 3524573

Chen, T. & Guestrin, C. Xgboost: A scalable tree boosting system. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining - KDD ‘16 785–794, https://doi.org/10.1145/2939672.2939785 (2016).

Breiman, L. Random forests. Machine Learning 45, 5–32, https://doi.org/10.1023/A:1010933404324 (2001).

doi: 10.1023/A:1010933404324

Rogers, D. & Hahn, M. Extended-connectivity fingerprints. Journal of Chemical Information and Modeling 50, 742–754, https://doi.org/10.1021/ci100050t (2010).

doi: 10.1021/ci100050t pubmed: 20426451

Wenlock, M. & Tomkinson, N. Experimental in vitro dmpk and physicochemical data on a set of publicly disclosed compounds. ChEMBL https://doi.org/10.6019/CHEMBL3301361 .

Boobier, S. et al. Machine learning with physicochemical relationships: solubility prediction in organic solvents and water. Nat Commun 11, 5753, https://doi.org/10.1038/s41467-020-19594-z (2020).

doi: 10.1038/s41467-020-19594-z pubmed: 33188226 pmcid: 7666209

Wang, J., Hou, T. & Xu, X. Aqueous solubility prediction based on weighted atom type counts and solvent accessible surface areas. ACS Publications https://doi.org/10.1021/ci800406y.s005 (2016).

doi: 10.1021/ci800406y.s005

Meng, F., Yang, X., Huang, J. & Ayers, P. W. B3db: A curated diverse molecular database of blood-brain barrier permeability with chemical descriptors. figshare https://doi.org/10.6084/m9.figshare.15634230.v3 (2021).

Adenot, M. & Lahana, R. Blood-brain barrier permeation models: Discriminating between potential cns and non-cns drugs including p-glycoprotein substrates. ACS Publications https://doi.org/10.1021/ci034205d.s001 (2019).

doi: 10.1021/ci034205d.s001

Xu, C. et al. In silico prediction of chemical ames mutagenicity. ACS Publications https://doi.org/10.1021/ci300400a (2016).

Dimitrov, S. D. et al. Qsar toolbox – workflow and major functionalities. SAR and QSAR in Environmental Research 27, 203–219, https://doi.org/10.1080/1062936X.2015.1136680 (2016).

doi: 10.1080/1062936X.2015.1136680 pubmed: 26892800

Hansen, K. et al. Benchmark data set for in silico prediction of ames mutagenicity. ACS Publications https://doi.org/10.1021/ci900161g (2016).

Song, Y. et al. Communicative representation learning on attributed molecular graphs. Griffith Research Online (Griffith University, Queensland, Australia) https://doi.org/10.24963/ijcai.2020/392 (2020).

Cai, H., Zhang, H., Zhao, D., Wu, J. & Wang, L. Fp-gnn: a versatile deep learning architecture for enhanced molecular property prediction. Briefings in Bioinformatics 23 https://doi.org/10.1093/bib/bbac408 (2022).

Song, Y., Chen, J., Wang, W., Chen, G. & Ma, Z. Double-head transformer neural network for molecular property prediction. Journal of Cheminformatics 15 https://doi.org/10.1186/s13321-023-00700-4 (2023).

Yin, F. et al. Knowledge graph-enhanced molecular contrastive learning with functional prompt. Nature Machine Intelligence 5, 542–553, https://doi.org/10.1038/s42256-023-00654-0 (2023).

doi: 10.1038/s42256-023-00654-0

Li, P. et al. An effective self-supervised framework for learning expressive molecular global representations to drug discovery. Briefings in Bioinformatics 22 https://doi.org/10.1093/bib/bbab109 (2021).

Zhou, G. et al. Uni-mol: A universal 3d molecular representation learning framework. chemrxiv.org https://doi.org/10.26434/chemrxiv-2022-jjm0j (2022).

doi: 10.26434/chemrxiv-2022-jjm0j

Luo, S. et al. One transformer can understand both 2d & 3d molecular data. arXiv (Cornell University) https://doi.org/10.48550/arxiv.2210.01765 (2022).

doi: 10.48550/arxiv.2210.01765

PharmaBench: Enhancing ADMET benchmarks with large language models.

Journal

Informations de publication

Résumé

Identifiants

Substances chimiques

Types de publication

Langues

Sous-ensembles de citation

Pagination

Informations de copyright

Références

Auteurs

Zhangming Niu (Z)

Xianglu Xiao (X)

Wenfan Wu (W)

Qiwei Cai (Q)

Yinghui Jiang (Y)

Wangzhen Jin (W)

Minhao Wang (M)

Guojian Yang (G)

Lingkang Kong (L)

Xurui Jin (X)

Guang Yang (G)

Hongming Chen (H)

Articles similaires

[Redispensing of expensive oral anticancer medicines: a practical application].

Smoking Cessation and Incident Cardiovascular Disease.

Evaluation of Low-Value Services Across Major Medicare Advantage Insurers and Traditional Medicare.

Effectiveness of Virtual Yoga for Chronic Low Back Pain: A Randomized Clinical Trial.

Classifications MeSH