PharmaBench: Enhancing ADMET benchmarks with large language models.


Journal

Scientific data
ISSN: 2052-4463
Titre abrégé: Sci Data
Pays: England
ID NLM: 101640192

Informations de publication

Date de publication:
10 Sep 2024
Historique:
received: 08 04 2024
accepted: 19 08 2024
medline: 11 9 2024
pubmed: 11 9 2024
entrez: 10 9 2024
Statut: epublish

Résumé

Accurately predicting ADMET (Absorption, Distribution, Metabolism, Excretion, and Toxicity) properties early in drug development is essential for selecting compounds with optimal pharmacokinetics and minimal toxicity. Existing ADMET-related benchmark sets are limited in utility due to their small dataset sizes and the lack of representation of compounds used in drug discovery projects. These shortcomings hinder their application in model building for drug discovery. To address this issue, we propose a multi-agent data mining system based on Large Language Models that effectively identifies experimental conditions within 14,401 bioassays. This approach facilitates merging entries from different sources, culminating in the creation of PharmaBench. Additionally, we have developed a data processing workflow to integrate data from various sources, resulting in 156,618 raw entries. Through this workflow, we constructed PharmaBench, a comprehensive benchmark set for ADMET properties, which comprises eleven ADMET datasets and 52,482 entries. This benchmark set is designed to serve as an open-source dataset for the development of AI models relevant to drug discovery projects.

Identifiants

pubmed: 39256394
doi: 10.1038/s41597-024-03793-0
pii: 10.1038/s41597-024-03793-0
doi:

Substances chimiques

Pharmaceutical Preparations 0

Types de publication

Journal Article Dataset

Langues

eng

Sous-ensembles de citation

IM

Pagination

985

Informations de copyright

© 2024. The Author(s).

Références

Davis, A. M. & Riley, R. J. Predictive admet studies, the challenges and the opportunities. Current Opinion in Chemical Biology 8, 378–386, https://doi.org/10.1016/j.cbpa.2004.06.005 (2004).
doi: 10.1016/j.cbpa.2004.06.005 pubmed: 15288247
Chen, H., Engkvist, O., Wang, Y., Olivecrona, M. & Blaschke, T. The rise of deep learning in drug discovery. Drug Discovery Today 23, 1241–1250, https://doi.org/10.1016/j.drudis.2018.01.039 (2018).
doi: 10.1016/j.drudis.2018.01.039 pubmed: 29366762
Ferreira, L. L. & Andricopulo, A. D. Admet modeling approaches in drug discovery. Drug Discovery Today 24, 1157–1165, https://doi.org/10.1016/j.drudis.2019.03.015 (2019).
doi: 10.1016/j.drudis.2019.03.015 pubmed: 30890362
Wang, Y. et al. In silico adme/t modelling for rational drug design. Quarterly Reviews of Biophysics 48, 488–515, https://doi.org/10.1017/s0033583515000190 (2015).
doi: 10.1017/s0033583515000190 pubmed: 26328949
Sun, J. et al. Excape-db: an integrated large scale dataset facilitating big data analysis in chemogenomics. Journal of Cheminformatics 9, https://doi.org/10.1186/s13321-017-0203-5 (2017).
Bento, A. P. et al. The chembl bioactivity database: an update. Nucleic Acids Research 42, D1083–D1090, https://doi.org/10.1093/nar/gkt1031 (2013).
doi: 10.1093/nar/gkt1031 pubmed: 24214965 pmcid: 3965067
Kim, S. et al. Pubchem substance and compound databases. Nucleic Acids Research 44, D1202–D1213, https://doi.org/10.1093/nar/gkv951 (2015).
doi: 10.1093/nar/gkv951 pubmed: 26400175 pmcid: 4702940
Gilson, M. K. et al. Bindingdb in 2015: A public database for medicinal chemistry, computational chemistry and systems pharmacology. Nucleic Acids Research 44, D1045–D1053, https://doi.org/10.1093/nar/gkv1072 (2015).
doi: 10.1093/nar/gkv1072 pubmed: 26481362 pmcid: 4702793
Wu, Z. et al. Moleculenet: a benchmark for molecular machine learning. Chemical Science 9, 513–530, https://doi.org/10.1039/C7SC02664A (2018).
doi: 10.1039/C7SC02664A pubmed: 29629118
Huang, K. et al. Therapeutics data commons: Machine learning datasets and tasks for drug discovery and development. arXiv (Cornell University) https://doi.org/10.48550/arxiv.2102.09548 (2021).
Meng, F., Xi, Y., Huang, J. & Ayers, P. W. A curated diverse molecular database of blood-brain barrier permeability with chemical descriptors. Scientific Data 8, 289, https://doi.org/10.1038/s41597-021-01069-5 (2021).
doi: 10.1038/s41597-021-01069-5 pubmed: 34716354 pmcid: 8556334
Meng, J. et al. Boosting the predictive performance with aqueous solubility dataset curation. Scientific Data 9, https://doi.org/10.1038/s41597-022-01154-3 (2022).
Delaney, J. S. ESOL: Estimating Aqueous Solubility Directly from Molecular Structure. ACS Publications https://doi.org/10.1021/ci034243x.s001 (2019).
Pollastri, M. P. Overview on the rule of five. Current Protocols in Pharmacology 49, https://doi.org/10.1002/0471141755.ph0912s49 (2010).
Sheridan, R. P. et al. Experimental error, kurtosis, activity cliffs, and methodology: What limits the predictivity of qsar models? Journal of Chemical Information and Modeling https://doi.org/10.1021/acs.jcim.9b01067 (2020).
Butler, J. N. Ionic equilibrium: solubility and pH calculations (Wiley, 1998).
OpenAI. Gpt-4 technical report. arXiv (Cornell University) https://doi.org/10.48550/arxiv.2303.08774 (2023).
Gu, Y. et al. Domain-specific language model pretraining for biomedical natural language processing. ACM Transactions on Computing for Healthcare 3, 1–23, https://doi.org/10.1145/3458754 (2022).
doi: 10.1145/3458754
Lee, J. et al. Biobert: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics 36, https://doi.org/10.1093/bioinformatics/btz682 (2019).
Anil, R. et al. Palm 2 technical report, https://doi.org/10.48550/arXiv.2305.10403 (2023).
Mazurowski, M. A. et al. Segment anything model for medical image analysis: An experimental study. Medical Image Analysis 89, 102918, https://doi.org/10.1016/j.media.2023.102918 (2023).
doi: 10.1016/j.media.2023.102918 pubmed: 37595404
Xiao, X. et al. Pharmabench: Enhancing admet benchmarks with large language models. figshare https://doi.org/10.6084/m9.figshare.25559469.v1 (2024).
Brown, T. et al. Language models are few-shot learners. arXiv (Cornell University) https://doi.org/10.48550/arxiv.2005.14165 (2020).
Sahoo, P. et al. A systematic survey of prompt engineering in large language models: Techniques and applications. arXiv (Cornell University) https://doi.org/10.48550/arxiv.2402.07927 (2024).
Chen, Q. et al. An extensive benchmark study on biomedical text generation and mining with chatgpt. Bioinformatics 39 https://doi.org/10.1093/bioinformatics/btad557 (2023).
Guo, T. et al. Large language model based multi-agents: A survey of progress and challenges, https://doi.org/10.48550/arXiv.2402.01680 (2024).
Zhang, B. et al. Controlling large language model-based agents for large-scale decision-making: An actor-critic approach. arXiv (Cornell University) https://doi.org/10.48550/arxiv.2311.13884 (2023).
Xi, Z. et al. The rise and potential of large language model based agents: A survey, https://doi.org/10.48550/arXiv.2309.07864 (2023).
Landrum, G. A. Rdkit: Open-source cheminformatics. release 2014.03.1. zenodo https://doi.org/10.5281/zenodo.10398 (2014).
Ames, B. N., Lee, F. D. & Durston, W. E. An improved bacterial test system for the detection and classification of mutagens and carcinogens. Proceedings of the National Academy of Sciences 70, 782–786, https://doi.org/10.1073/pnas.70.3.782 (1973).
doi: 10.1073/pnas.70.3.782
Tsopelas, F., Giaginis, C. & Tsantili-Kakoulidou, A. Lipophilicity and biomimetic properties to support drug discovery. Expert Opinion on Drug Discovery 12, 885–896, https://doi.org/10.1080/17460441.2017.1344210 (2017).
doi: 10.1080/17460441.2017.1344210 pubmed: 28644732
Cui, Q. et al. Data_Sheet_1_Improved Prediction of Aqueous Solubility of Novel Compounds by Going Deeper With Deep Learning.ZIP. Frontiers https://doi.org/10.3389/fonc.2020.00121.s001 (2020).
doi: 10.3389/fonc.2020.00121.s001
Martins, I. F., Teixeira, A. L., Pinheiro, L. & Falcao, A. O. A bayesian approach to in Silico blood-brain barrier penetration modeling. ACS Publications https://doi.org/10.1021/ci300124c (2016).
doi: 10.1021/ci300124c
Bohnert, T. & Gan, L.-S. Plasma protein binding: From discovery to development. Journal of Pharmaceutical Sciences 102, 2953–2994, https://doi.org/10.1002/jps.23614 (2013).
doi: 10.1002/jps.23614 pubmed: 23798314
Martignoni, M., Groothuis, G. M. M. & de Kanter, R. Species differences between mouse, rat, dog, monkey and human cyp-mediated drug metabolism, inhibition and induction. Expert Opinion on Drug Metabolism & Toxicology 2, 875–894, https://doi.org/10.1517/17425255.2.6.875 (2006).
doi: 10.1517/17425255.2.6.875
Brian Houston, J. & Carlile, D. J. Prediction of hepatic clearance from microsomes, hepatocytes, and liver slices. Drug Metabolism Reviews 29, 891–922, https://doi.org/10.3109/03602539709002237 (1997).
doi: 10.3109/03602539709002237
Lord, S. J., Velle, K. B., Mullins, R. D. & Fritz-Laylin, L. K. Superplots: Communicating reproducibility and variability in cell biology. Journal of Cell Biology 219, e202001064, https://doi.org/10.1083/jcb.202001064 (2020).
doi: 10.1083/jcb.202001064 pubmed: 32346721 pmcid: 7265319
Karami, T. K., Hailu, S., Feng, S., Graham, R. & Gukasyan, H. J. Eyes on lipinski’s rule of five: A new “rule of thumb” for physicochemical design space of ophthalmic drugs. Journal of Ocular Pharmacology and Therapeutics 38, 43–55, https://doi.org/10.1089/jop.2021.0069 (2022).
doi: 10.1089/jop.2021.0069 pubmed: 34905402 pmcid: 8817695
Bickerton, G. R., Paolini, G. V., Besnard, J., Muresan, S. & Hopkins, A. L. Quantifying the chemical beauty of drugs. Nature chemistry 4, 90–98, https://doi.org/10.1038/nchem.1243 (2012).
doi: 10.1038/nchem.1243 pubmed: 22270643 pmcid: 3524573
Chen, T. & Guestrin, C. Xgboost: A scalable tree boosting system. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining - KDD ‘16 785–794, https://doi.org/10.1145/2939672.2939785 (2016).
Breiman, L. Random forests. Machine Learning 45, 5–32, https://doi.org/10.1023/A:1010933404324 (2001).
doi: 10.1023/A:1010933404324
Rogers, D. & Hahn, M. Extended-connectivity fingerprints. Journal of Chemical Information and Modeling 50, 742–754, https://doi.org/10.1021/ci100050t (2010).
doi: 10.1021/ci100050t pubmed: 20426451
Wenlock, M. & Tomkinson, N. Experimental in vitro dmpk and physicochemical data on a set of publicly disclosed compounds. ChEMBL https://doi.org/10.6019/CHEMBL3301361 .
Boobier, S. et al. Machine learning with physicochemical relationships: solubility prediction in organic solvents and water. Nat Commun 11, 5753, https://doi.org/10.1038/s41467-020-19594-z (2020).
doi: 10.1038/s41467-020-19594-z pubmed: 33188226 pmcid: 7666209
Wang, J., Hou, T. & Xu, X. Aqueous solubility prediction based on weighted atom type counts and solvent accessible surface areas. ACS Publications https://doi.org/10.1021/ci800406y.s005 (2016).
doi: 10.1021/ci800406y.s005
Meng, F., Yang, X., Huang, J. & Ayers, P. W. B3db: A curated diverse molecular database of blood-brain barrier permeability with chemical descriptors. figshare https://doi.org/10.6084/m9.figshare.15634230.v3 (2021).
Adenot, M. & Lahana, R. Blood-brain barrier permeation models: Discriminating between potential cns and non-cns drugs including p-glycoprotein substrates. ACS Publications https://doi.org/10.1021/ci034205d.s001 (2019).
doi: 10.1021/ci034205d.s001
Xu, C. et al. In silico prediction of chemical ames mutagenicity. ACS Publications https://doi.org/10.1021/ci300400a (2016).
Dimitrov, S. D. et al. Qsar toolbox – workflow and major functionalities. SAR and QSAR in Environmental Research 27, 203–219, https://doi.org/10.1080/1062936X.2015.1136680 (2016).
doi: 10.1080/1062936X.2015.1136680 pubmed: 26892800
Hansen, K. et al. Benchmark data set for in silico prediction of ames mutagenicity. ACS Publications https://doi.org/10.1021/ci900161g (2016).
Song, Y. et al. Communicative representation learning on attributed molecular graphs. Griffith Research Online (Griffith University, Queensland, Australia) https://doi.org/10.24963/ijcai.2020/392 (2020).
Cai, H., Zhang, H., Zhao, D., Wu, J. & Wang, L. Fp-gnn: a versatile deep learning architecture for enhanced molecular property prediction. Briefings in Bioinformatics 23 https://doi.org/10.1093/bib/bbac408 (2022).
Song, Y., Chen, J., Wang, W., Chen, G. & Ma, Z. Double-head transformer neural network for molecular property prediction. Journal of Cheminformatics 15 https://doi.org/10.1186/s13321-023-00700-4 (2023).
Yin, F. et al. Knowledge graph-enhanced molecular contrastive learning with functional prompt. Nature Machine Intelligence 5, 542–553, https://doi.org/10.1038/s42256-023-00654-0 (2023).
doi: 10.1038/s42256-023-00654-0
Li, P. et al. An effective self-supervised framework for learning expressive molecular global representations to drug discovery. Briefings in Bioinformatics 22 https://doi.org/10.1093/bib/bbab109 (2021).
Zhou, G. et al. Uni-mol: A universal 3d molecular representation learning framework. chemrxiv.org https://doi.org/10.26434/chemrxiv-2022-jjm0j (2022).
doi: 10.26434/chemrxiv-2022-jjm0j
Luo, S. et al. One transformer can understand both 2d & 3d molecular data. arXiv (Cornell University) https://doi.org/10.48550/arxiv.2210.01765 (2022).
doi: 10.48550/arxiv.2210.01765

Auteurs

Zhangming Niu (Z)

MindRank AI, Hangzhou, Zhejiang, China.
National Heart and Lung Institute, Imperial College London, London, SW7 2AZ, UK.

Xianglu Xiao (X)

MindRank AI, Hangzhou, Zhejiang, China.
Bioengineering Department and Imperial-X, Imperial College London, London, W12 7SL, UK.

Wenfan Wu (W)

MindRank AI, Hangzhou, Zhejiang, China.
Department of Bioinformatics and Systems Biology, Huazhong University of Science and Technology College of Life Sciences and Technology, Wuhan, Hubei, China.
Guangzhou National Laboratory, Guangzhou, 510005, China.

Qiwei Cai (Q)

MindRank AI, Hangzhou, Zhejiang, China.

Yinghui Jiang (Y)

MindRank AI, Hangzhou, Zhejiang, China.

Wangzhen Jin (W)

MindRank AI, Hangzhou, Zhejiang, China.

Minhao Wang (M)

MindRank AI, Hangzhou, Zhejiang, China.

Guojian Yang (G)

MindRank AI, Hangzhou, Zhejiang, China.

Lingkang Kong (L)

MindRank AI, Hangzhou, Zhejiang, China.

Xurui Jin (X)

MindRank AI, Hangzhou, Zhejiang, China.

Guang Yang (G)

National Heart and Lung Institute, Imperial College London, London, SW7 2AZ, UK. g.yang@imperial.ac.uk.
Bioengineering Department and Imperial-X, Imperial College London, London, W12 7SL, UK. g.yang@imperial.ac.uk.
Cardiovascular Research Centre, Royal Brompton Hospital, London, SW3 6NP, UK. g.yang@imperial.ac.uk.
School of Biomedical Engineering & Imaging Sciences, King's College London, London, UK. g.yang@imperial.ac.uk.

Hongming Chen (H)

Department of Bioinformatics and Systems Biology, Huazhong University of Science and Technology College of Life Sciences and Technology, Wuhan, Hubei, China. chen_hongming@gzlab.ac.cn.
Guangzhou National Laboratory, Guangzhou, 510005, China. chen_hongming@gzlab.ac.cn.
School of pharmaceutical sciences, Guangzhou Medical University, Guangzhou, 511495, China. chen_hongming@gzlab.ac.cn.

Articles similaires

[Redispensing of expensive oral anticancer medicines: a practical application].

Lisanne N van Merendonk, Kübra Akgöl, Bastiaan Nuijen
1.00
Humans Antineoplastic Agents Administration, Oral Drug Costs Counterfeit Drugs

Smoking Cessation and Incident Cardiovascular Disease.

Jun Hwan Cho, Seung Yong Shin, Hoseob Kim et al.
1.00
Humans Male Smoking Cessation Cardiovascular Diseases Female
Humans United States Aged Cross-Sectional Studies Medicare Part C
1.00
Humans Yoga Low Back Pain Female Male

Classifications MeSH