Comparing SMILES and SELFIES tokenization for enhanced chemical language modeling.

Natural Language Processing Models, Chemical Deep Learning Neural Networks, Computer Humans Machine Learning

Journal

Scientific reports

ISSN: 2045-2322

Titre abrégé: Sci Rep

Pays: England

ID NLM: 101563288

Informations de publication

Date de publication:
23 10 2024

Historique:

received: 21 06 2024

accepted: 14 10 2024

medline: 24 10 2024

pubmed: 24 10 2024

entrez: 24 10 2024

Statut: epublish

Résumé

Life sciences research and experimentation are resource-intensive, requiring extensive trials and considerable time. Often, experiments do not achieve their intended objectives, but progress is made through trial and error, eventually leading to breakthroughs. Machine learning is transforming this traditional approach, providing methods to expedite processes and accelerate discoveries. Deep Learning is becoming increasingly prominent in chemistry, with Convolutional Graph Networks (CGN) being a key focus, though other approaches also show significant potential. This research explores the application of Natural Language Processing (NLP) to evaluate the effectiveness of chemical language representations, specifically SMILES and SELFIES, using tokenization methods such as Byte Pair Encoding (BPE) and a novel approach developed in this study, Atom Pair Encoding (APE), in BERT-based models. The primary objective is to assess how these tokenization techniques influence the performance of chemical language models in biophysics and physiology classification tasks. The findings reveal that APE, particularly when used with SMILES representations, significantly outperforms BPE by preserving the integrity and contextual relationships among chemical elements, thereby enhancing classification accuracy. Performance was evaluated in downstream classification tasks using three distinct datasets for HIV, toxicology, and blood-brain barrier penetration, with ROC-AUC serving as the evaluation metric. This study highlights the critical role of tokenization in processing chemical language and suggests that refining these techniques could lead to significant advancements in drug discovery and material science.

Identifiants

DOI: 10.1038/s41598-024-76440-8 PMID: 39443676

pubmed: 39443676

doi: 10.1038/s41598-024-76440-8

pii: 10.1038/s41598-024-76440-8

doi:

Types de publication

Journal Article

Langues

eng

Sous-ensembles de citation

Pagination

25016

Subventions

Organisme : Fundação para a Ciência e a Tecnologia

ID : UIDB/04152/2020

Organisme : Javna Agencija za Raziskovalno Dejavnost RS

ID : P2-0442

Informations de copyright

Références

Devlin, J., Chang, M.-W., Lee, K. & Toutanova, K. BERT: Pre-training of deep bidirectional transformers for language understanding. CoRR abs/1810.04805 (2018). ArXiv: 1810.04805 .

Guo, T. et al. What can large language models do in chemistry? A comprehensive benchmark on eight tasks. Adv. Neural Inf. Process. Syst. 36, 59662–59688 (2023).

Gage, P. A new algorithm for data compression. C Users J. 12, 23–38. https://doi.org/10.5555/177910.177914 (1994).

doi: 10.5555/177910.177914

Tran, K. Optimization of molecular transformers: Influence of tokenization schemes. M.Sc. Thesis, Chalmers University of Technology, 2021 (2021).

Weininger, D. SMILES, a chemical language and information system: 1: Introduction to methodology and encoding rules. J. Chem. Inf. Comput. Sci. 28, 31–36. https://doi.org/10.1021/ci00057a005 (1988).

doi: 10.1021/ci00057a005

Krenn, M., Häse, F., Nigam, A., Friederich, P. & Aspuru-Guzik, A. Self-referencing embedded strings (SELFIES): A 100% robust molecular string representation. Mach. Learn.: Sci. Technol. 1, 045024. https://doi.org/10.1088/2632-2153/aba947 (2020).

doi: 10.1088/2632-2153/aba947

Mater, A. C. & Coote, M. L. Deep learning in chemistry. J. Chem. Inf. Model. 59, 2545–2559. https://doi.org/10.1021/acs.jcim.9b00266 (2019).

doi: 10.1021/acs.jcim.9b00266 pubmed: 31194543

Goh, G. B., Hodas, N. O. & Vishnu, A. Deep learning for computational chemistry. J. Comput. Chem. 38, 1291–1307. https://doi.org/10.1002/jcc.24764 (2017).

doi: 10.1002/jcc.24764 pubmed: 28272810

Jiao, Z., Hu, P., Xu, H. & Wang, Q. Machine learning and deep learning in chemical health and safety: A systematic review of techniques and applications. ACS Chem. Health Saf. 27, 316–334. https://doi.org/10.1021/acs.chas.0c00075 (2020).

doi: 10.1021/acs.chas.0c00075

Cova, T. F. G. G. & Pais, A. A. C. C. Deep learning for deep chemistry: Optimizing the prediction of chemical patterns. Front. Chem. 7, 809. https://doi.org/10.3389/fchem.2019.00809 (2019).

doi: 10.3389/fchem.2019.00809 pubmed: 32039134 pmcid: 6988795

Jastrzębski, S., Leśniak, D. & Czarnecki, W. M. Learning to SMILE(S), arXiv:1602.06289 (2016). _eprint: 1602.06289.

McCulloch, W. S. & Pitts, W. A logical calculus of the ideas immanent in nervous activity. Bull. Math. Biophys. 5, 115–133 (1943).

doi: 10.1007/BF02478259

Jiang, S. et al. When SMILES smiles, practicality judgment and yield prediction of chemical reaction via deep chemical language processing. IEEE Access 9, 85071–85083. https://doi.org/10.1109/ACCESS.2021.3083838 (2021).

doi: 10.1109/ACCESS.2021.3083838

Hochreiter, S. Long Short-Term Memory (Neural Computation MIT-Press, 1997).

doi: 10.1162/neco.1997.9.8.1735

Boiko, D. A., MacKnight, R. & Gomes, G. Emergent autonomous scientific research capabilities of large language models. ArXiv:abs/2304.05332 (2023).

Jablonka, K. M. et al. 14 examples of how LLMs can transform materials science and chemistry: A reflection on a large language model hackathon. Digit. Discov. 2, 1233–1250. https://doi.org/10.1039/D3DD00113J (2023).

doi: 10.1039/D3DD00113J pubmed: 38013906 pmcid: 10561547

Xia, J., Zhu, Y., Du, Y. & Li, S. Z. A systematic survey of chemical pre-trained models. arXiv preprint (2022). arXiv:2210.16484 .

Liao, C., Yu, Y., Mei, Y. & Wei, Y. From words to molecules: A survey of large language models in chemistry. arXiv preprint arXiv:2402.01439 (2024).

Jablonka, K. M., Schwaller, P., Ortega-Guerrero, A. & Smit, B. Leveraging large language models for predictive chemistry. Nat. Mach. Intell. 6, 161–169 (2024).

doi: 10.1038/s42256-023-00788-1

Brown, T. B. Language models are few-shot learners. arXiv preprint (2020). arXiv:2005.14165 .

White, A. D. et al. Assessment of chemistry knowledge in large language models that generate code. Digit. Discov. 2, 368–376 (2023).

doi: 10.1039/D2DD00087C pubmed: 37065678 pmcid: 10087057

Schick, T. et al. Toolformer: Language models can teach themselves to use tools. Adv. Neural Inf. Process. Syst. 36 (2024).

Shuster, K. et al. Language models that seek for knowledge: Modular search & generation for dialogue and prompt completion. arXiv preprint (2022). arXiv:2203.13224 .

Bran, M. et al. Augmenting large language models with chemistry tools. Nat. Mach. Intell. 6, 525–535 (2024).

doi: 10.1038/s42256-024-00832-8

Achiam, J. et al. Gpt-4 technical report. arXiv preprint (2023). arXiv:2303.08774 .

Castro Nascimento, C. M. & Pimentel, A. S. Do large language models understand chemistry? A conversation with Chatgpt. J. Chem. Inf. Model. 63, 1649–1655 (2023).

doi: 10.1021/acs.jcim.3c00285 pubmed: 36926868

White, A. D. The future of chemistry is language. Nat. Rev. Chem. 7, 457–458 (2023).

doi: 10.1038/s41570-023-00502-0 pubmed: 37208543

Kim, S. et al. PubChem in 2021: New data content and improved web interfaces. Nucleic Acids Res. 49, D1388–D1395. https://doi.org/10.1093/nar/gkaa971 (2021).

doi: 10.1093/nar/gkaa971 pubmed: 33151290

Wu, Z. et al. MoleculeNet: A benchmark for molecular machine learning. Chem. Sci. 9, 513–530. https://doi.org/10.1039/c7sc02664a (2018).

doi: 10.1039/c7sc02664a pubmed: 29629118

Daller, E., Bougleux, S., Brun, L. & Lézoray, O. Local patterns and supergraph for chemical graph classification with convolutional networks. In Structural, Syntactic, and Statistical Pattern Recognition (eds Bai, X. et al.) 97–106 (Springer International Publishing, 2018).

doi: 10.1007/978-3-319-97785-0_10

Ryu, S., Lim, J., Hong, S. H. & Kim, W. Y. Deeply learning molecular structure-property relationships using attention- and gate-augmented graph convolutional network. arXiv: Learning (2018).

Vaswani, A. et al. Attention is all you need (2023). arXiv:1706.03762 .

Wolf, T. et al. Transformers: State-of-the-art natural language processing. in Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations (eds Liu, Q. & Schlangen, D.) 38–45, https://doi.org/10.18653/v1/2020.emnlp-demos.6 (Association for Computational Linguistics, Online, 2020).

Chithrananda, S., Grand, G. & Ramsundar, B. ChemBERTa: Large-scale self-supervised pretraining for molecular property prediction. https://doi.org/10.09885arXiv:Learning (2020).

Yang, K. et al. Analyzing learned molecular representations for property prediction. J. Chem. Inf. Model.[SPACE] https://doi.org/10.1021/ACS.JCIM.9B00237 (2019).

doi: 10.1021/ACS.JCIM.9B00237 pubmed: 31825611 pmcid: 8154261

Ahmad, W., Simon, E., Chithrananda, S., Grand, G. & Ramsundar, B. Chemberta-2: Towards chemical foundation models (2022). arXiv:2209.01712 .

Yüksel, A., Ulusoy, E., Ünlü, A. & Doğan, T. SELFormer: Molecular representation learning via SELFIES language models. Mach. Learn.: Sci. Technol. 4, 025035. https://doi.org/10.1088/2632-2153/acdb30 (2023).

doi: 10.1088/2632-2153/acdb30

Cao, Z. et al. MOFormer: Self-supervised transformer model for metal-organic framework property prediction. J. Am. Chem. Soc.[SPACE] https://doi.org/10.1021/JACS.2C11420 (2023).

doi: 10.1021/JACS.2C11420 pubmed: 38150583 pmcid: 10785799

Sennrich, R., Haddow, B. & Birch, A. Neural machine translation of rare words with subword units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (eds Erk, K. & Smith, N. A.) 1715–1725, https://doi.org/10.18653/v1/P16-1162 (Association for Computational Linguistics, Berlin, Germany, 2016).

Bader, R. F. W. Atoms in molecules. Acc. Chem. Res. 18, 9–15. https://doi.org/10.1021/ar00109a003 (1985).

doi: 10.1021/ar00109a003

Ucak, U. V., Ashyrmamatov, I. & Lee, J. Improving the quality of chemical language model outcomes with atom-in-SMILES tokenization. J. Cheminform. 15, 55. https://doi.org/10.1186/s13321-023-00725-9 (2023).

doi: 10.1186/s13321-023-00725-9 pubmed: 37248531 pmcid: 10228139

Li, X. & Fourches, D. SMILES pair encoding: A data-driven substructure tokenization algorithm for deep learning. J. Chem. Inf. Model. 61, 1560–1569. https://doi.org/10.1021/acs.jcim.0c01127 (2021).

doi: 10.1021/acs.jcim.0c01127 pubmed: 33715361

Liu, Y. et al. Roberta: A robustly optimized bert pretraining approach (2019). arXiv:1907.11692 .

Martins, I. F., Teixeira, A. L., Pinheiro, L. & Falcão, A. O. A Bayesian approach to in silico blood–brain barrier penetration modeling. J. Chem. Inf. Model. 52, 1686–1697. https://doi.org/10.1021/ci300124c (2012).

doi: 10.1021/ci300124c pubmed: 22612593

National Cancer Institute. AIDS Antiviral Screen Data (2024).

National Institutes of Health. Tox21 Challenge (2014).

Akiba, T., Sano, S., Yanase, T., Ohta, T. & Koyama, M. Optuna: A next-generation hyperparameter optimization framework. in Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2019).

Loshchilov, I. & Hutter, F. Decoupled weight decay regularization. (2019) arXiv:1711.05101 .

Ramsundar, B. et al. Deep Learning for the Life Sciences (O’Reilly Media, 2019).

Heid, E. et al. Chemprop: A machine learning package for chemical property prediction. J. Chem. Inf. Model. 64, 9–17. https://doi.org/10.1021/acs.jcim.3c01250 (2024).

doi: 10.1021/acs.jcim.3c01250 pubmed: 38147829

Mann, H. B. & Whitney, D. R. On a test of whether one of two random variables is stochastically larger than the other. Ann. Math. Stat. 18, 50–60. https://doi.org/10.1214/aoms/1177730491 (1947).

doi: 10.1214/aoms/1177730491

Cliff, N. Dominance statistics: Ordinal analyses to answer ordinal questions. Psychol. Bull. 114, 494–509. https://doi.org/10.1037/0033-2909.114.3.494 (1993).

doi: 10.1037/0033-2909.114.3.494

Comparing SMILES and SELFIES tokenization for enhanced chemical language modeling.

Journal

Informations de publication

Résumé

Identifiants

Types de publication

Langues

Sous-ensembles de citation

Pagination

Subventions

Informations de copyright

Références

Auteurs

Miguelangel Leon (M)

Yuriy Perezhohin (Y)

Fernando Peres (F)

Aleš Popovič (A)

Mauro Castelli (M)

Articles similaires

[Redispensing of expensive oral anticancer medicines: a practical application].

Smoking Cessation and Incident Cardiovascular Disease.

Evaluation of Low-Value Services Across Major Medicare Advantage Insurers and Traditional Medicare.

Effectiveness of Virtual Yoga for Chronic Low Back Pain: A Randomized Clinical Trial.

Classifications MeSH