Comparing SMILES and SELFIES tokenization for enhanced chemical language modeling.


Journal

Scientific reports
ISSN: 2045-2322
Titre abrégé: Sci Rep
Pays: England
ID NLM: 101563288

Informations de publication

Date de publication:
23 10 2024
Historique:
received: 21 06 2024
accepted: 14 10 2024
medline: 24 10 2024
pubmed: 24 10 2024
entrez: 24 10 2024
Statut: epublish

Résumé

Life sciences research and experimentation are resource-intensive, requiring extensive trials and considerable time. Often, experiments do not achieve their intended objectives, but progress is made through trial and error, eventually leading to breakthroughs. Machine learning is transforming this traditional approach, providing methods to expedite processes and accelerate discoveries. Deep Learning is becoming increasingly prominent in chemistry, with Convolutional Graph Networks (CGN) being a key focus, though other approaches also show significant potential. This research explores the application of Natural Language Processing (NLP) to evaluate the effectiveness of chemical language representations, specifically SMILES and SELFIES, using tokenization methods such as Byte Pair Encoding (BPE) and a novel approach developed in this study, Atom Pair Encoding (APE), in BERT-based models. The primary objective is to assess how these tokenization techniques influence the performance of chemical language models in biophysics and physiology classification tasks. The findings reveal that APE, particularly when used with SMILES representations, significantly outperforms BPE by preserving the integrity and contextual relationships among chemical elements, thereby enhancing classification accuracy. Performance was evaluated in downstream classification tasks using three distinct datasets for HIV, toxicology, and blood-brain barrier penetration, with ROC-AUC serving as the evaluation metric. This study highlights the critical role of tokenization in processing chemical language and suggests that refining these techniques could lead to significant advancements in drug discovery and material science.

Identifiants

pubmed: 39443676
doi: 10.1038/s41598-024-76440-8
pii: 10.1038/s41598-024-76440-8
doi:

Types de publication

Journal Article

Langues

eng

Sous-ensembles de citation

IM

Pagination

25016

Subventions

Organisme : Fundação para a Ciência e a Tecnologia
ID : UIDB/04152/2020
Organisme : Javna Agencija za Raziskovalno Dejavnost RS
ID : P2-0442

Informations de copyright

© 2024. The Author(s).

Références

Devlin, J., Chang, M.-W., Lee, K. & Toutanova, K. BERT: Pre-training of deep bidirectional transformers for language understanding. CoRR abs/1810.04805 (2018). ArXiv: 1810.04805 .
Guo, T. et al. What can large language models do in chemistry? A comprehensive benchmark on eight tasks. Adv. Neural Inf. Process. Syst. 36, 59662–59688 (2023).
Gage, P. A new algorithm for data compression. C Users J. 12, 23–38. https://doi.org/10.5555/177910.177914 (1994).
doi: 10.5555/177910.177914
Tran, K. Optimization of molecular transformers: Influence of tokenization schemes. M.Sc. Thesis, Chalmers University of Technology, 2021 (2021).
Weininger, D. SMILES, a chemical language and information system: 1: Introduction to methodology and encoding rules. J. Chem. Inf. Comput. Sci. 28, 31–36. https://doi.org/10.1021/ci00057a005 (1988).
doi: 10.1021/ci00057a005
Krenn, M., Häse, F., Nigam, A., Friederich, P. & Aspuru-Guzik, A. Self-referencing embedded strings (SELFIES): A 100% robust molecular string representation. Mach. Learn.: Sci. Technol. 1, 045024. https://doi.org/10.1088/2632-2153/aba947 (2020).
doi: 10.1088/2632-2153/aba947
Mater, A. C. & Coote, M. L. Deep learning in chemistry. J. Chem. Inf. Model. 59, 2545–2559. https://doi.org/10.1021/acs.jcim.9b00266 (2019).
doi: 10.1021/acs.jcim.9b00266 pubmed: 31194543
Goh, G. B., Hodas, N. O. & Vishnu, A. Deep learning for computational chemistry. J. Comput. Chem. 38, 1291–1307. https://doi.org/10.1002/jcc.24764 (2017).
doi: 10.1002/jcc.24764 pubmed: 28272810
Jiao, Z., Hu, P., Xu, H. & Wang, Q. Machine learning and deep learning in chemical health and safety: A systematic review of techniques and applications. ACS Chem. Health Saf. 27, 316–334. https://doi.org/10.1021/acs.chas.0c00075 (2020).
doi: 10.1021/acs.chas.0c00075
Cova, T. F. G. G. & Pais, A. A. C. C. Deep learning for deep chemistry: Optimizing the prediction of chemical patterns. Front. Chem. 7, 809. https://doi.org/10.3389/fchem.2019.00809 (2019).
doi: 10.3389/fchem.2019.00809 pubmed: 32039134 pmcid: 6988795
Jastrzębski, S., Leśniak, D. & Czarnecki, W. M. Learning to SMILE(S), arXiv:1602.06289 (2016). _eprint: 1602.06289.
McCulloch, W. S. & Pitts, W. A logical calculus of the ideas immanent in nervous activity. Bull. Math. Biophys. 5, 115–133 (1943).
doi: 10.1007/BF02478259
Jiang, S. et al. When SMILES smiles, practicality judgment and yield prediction of chemical reaction via deep chemical language processing. IEEE Access 9, 85071–85083. https://doi.org/10.1109/ACCESS.2021.3083838 (2021).
doi: 10.1109/ACCESS.2021.3083838
Hochreiter, S. Long Short-Term Memory (Neural Computation MIT-Press, 1997).
doi: 10.1162/neco.1997.9.8.1735
Boiko, D. A., MacKnight, R. & Gomes, G. Emergent autonomous scientific research capabilities of large language models. ArXiv:abs/2304.05332 (2023).
Jablonka, K. M. et al. 14 examples of how LLMs can transform materials science and chemistry: A reflection on a large language model hackathon. Digit. Discov. 2, 1233–1250. https://doi.org/10.1039/D3DD00113J (2023).
doi: 10.1039/D3DD00113J pubmed: 38013906 pmcid: 10561547
Xia, J., Zhu, Y., Du, Y. & Li, S. Z. A systematic survey of chemical pre-trained models. arXiv preprint (2022). arXiv:2210.16484 .
Liao, C., Yu, Y., Mei, Y. & Wei, Y. From words to molecules: A survey of large language models in chemistry. arXiv preprint arXiv:2402.01439 (2024).
Jablonka, K. M., Schwaller, P., Ortega-Guerrero, A. & Smit, B. Leveraging large language models for predictive chemistry. Nat. Mach. Intell. 6, 161–169 (2024).
doi: 10.1038/s42256-023-00788-1
Brown, T. B. Language models are few-shot learners. arXiv preprint (2020). arXiv:2005.14165 .
White, A. D. et al. Assessment of chemistry knowledge in large language models that generate code. Digit. Discov. 2, 368–376 (2023).
doi: 10.1039/D2DD00087C pubmed: 37065678 pmcid: 10087057
Schick, T. et al. Toolformer: Language models can teach themselves to use tools. Adv. Neural Inf. Process. Syst. 36 (2024).
Shuster, K. et al. Language models that seek for knowledge: Modular search & generation for dialogue and prompt completion. arXiv preprint (2022). arXiv:2203.13224 .
Bran, M. et al. Augmenting large language models with chemistry tools. Nat. Mach. Intell. 6, 525–535 (2024).
doi: 10.1038/s42256-024-00832-8
Achiam, J. et al. Gpt-4 technical report. arXiv preprint (2023). arXiv:2303.08774 .
Castro Nascimento, C. M. & Pimentel, A. S. Do large language models understand chemistry? A conversation with Chatgpt. J. Chem. Inf. Model. 63, 1649–1655 (2023).
doi: 10.1021/acs.jcim.3c00285 pubmed: 36926868
White, A. D. The future of chemistry is language. Nat. Rev. Chem. 7, 457–458 (2023).
doi: 10.1038/s41570-023-00502-0 pubmed: 37208543
Kim, S. et al. PubChem in 2021: New data content and improved web interfaces. Nucleic Acids Res. 49, D1388–D1395. https://doi.org/10.1093/nar/gkaa971 (2021).
doi: 10.1093/nar/gkaa971 pubmed: 33151290
Wu, Z. et al. MoleculeNet: A benchmark for molecular machine learning. Chem. Sci. 9, 513–530. https://doi.org/10.1039/c7sc02664a (2018).
doi: 10.1039/c7sc02664a pubmed: 29629118
Daller, E., Bougleux, S., Brun, L. & Lézoray, O. Local patterns and supergraph for chemical graph classification with convolutional networks. In Structural, Syntactic, and Statistical Pattern Recognition (eds Bai, X. et al.) 97–106 (Springer International Publishing, 2018).
doi: 10.1007/978-3-319-97785-0_10
Ryu, S., Lim, J., Hong, S. H. & Kim, W. Y. Deeply learning molecular structure-property relationships using attention- and gate-augmented graph convolutional network. arXiv: Learning (2018).
Vaswani, A. et al. Attention is all you need (2023). arXiv:1706.03762 .
Wolf, T. et al. Transformers: State-of-the-art natural language processing. in Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations (eds Liu, Q. & Schlangen, D.) 38–45, https://doi.org/10.18653/v1/2020.emnlp-demos.6 (Association for Computational Linguistics, Online, 2020).
Chithrananda, S., Grand, G. & Ramsundar, B. ChemBERTa: Large-scale self-supervised pretraining for molecular property prediction. https://doi.org/10.09885arXiv:Learning (2020).
Yang, K. et al. Analyzing learned molecular representations for property prediction. J. Chem. Inf. Model.[SPACE] https://doi.org/10.1021/ACS.JCIM.9B00237 (2019).
doi: 10.1021/ACS.JCIM.9B00237 pubmed: 31825611 pmcid: 8154261
Ahmad, W., Simon, E., Chithrananda, S., Grand, G. & Ramsundar, B. Chemberta-2: Towards chemical foundation models (2022). arXiv:2209.01712 .
Yüksel, A., Ulusoy, E., Ünlü, A. & Doğan, T. SELFormer: Molecular representation learning via SELFIES language models. Mach. Learn.: Sci. Technol. 4, 025035. https://doi.org/10.1088/2632-2153/acdb30 (2023).
doi: 10.1088/2632-2153/acdb30
Cao, Z. et al. MOFormer: Self-supervised transformer model for metal-organic framework property prediction. J. Am. Chem. Soc.[SPACE] https://doi.org/10.1021/JACS.2C11420 (2023).
doi: 10.1021/JACS.2C11420 pubmed: 38150583 pmcid: 10785799
Sennrich, R., Haddow, B. & Birch, A. Neural machine translation of rare words with subword units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (eds Erk, K. & Smith, N. A.) 1715–1725, https://doi.org/10.18653/v1/P16-1162 (Association for Computational Linguistics, Berlin, Germany, 2016).
Bader, R. F. W. Atoms in molecules. Acc. Chem. Res. 18, 9–15. https://doi.org/10.1021/ar00109a003 (1985).
doi: 10.1021/ar00109a003
Ucak, U. V., Ashyrmamatov, I. & Lee, J. Improving the quality of chemical language model outcomes with atom-in-SMILES tokenization. J. Cheminform. 15, 55. https://doi.org/10.1186/s13321-023-00725-9 (2023).
doi: 10.1186/s13321-023-00725-9 pubmed: 37248531 pmcid: 10228139
Li, X. & Fourches, D. SMILES pair encoding: A data-driven substructure tokenization algorithm for deep learning. J. Chem. Inf. Model. 61, 1560–1569. https://doi.org/10.1021/acs.jcim.0c01127 (2021).
doi: 10.1021/acs.jcim.0c01127 pubmed: 33715361
Liu, Y. et al. Roberta: A robustly optimized bert pretraining approach (2019). arXiv:1907.11692 .
Martins, I. F., Teixeira, A. L., Pinheiro, L. & Falcão, A. O. A Bayesian approach to in silico blood–brain barrier penetration modeling. J. Chem. Inf. Model. 52, 1686–1697. https://doi.org/10.1021/ci300124c (2012).
doi: 10.1021/ci300124c pubmed: 22612593
National Cancer Institute. AIDS Antiviral Screen Data (2024).
National Institutes of Health. Tox21 Challenge (2014).
Akiba, T., Sano, S., Yanase, T., Ohta, T. & Koyama, M. Optuna: A next-generation hyperparameter optimization framework. in Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2019).
Loshchilov, I. & Hutter, F. Decoupled weight decay regularization. (2019) arXiv:1711.05101 .
Ramsundar, B. et al. Deep Learning for the Life Sciences (O’Reilly Media, 2019).
Heid, E. et al. Chemprop: A machine learning package for chemical property prediction. J. Chem. Inf. Model. 64, 9–17. https://doi.org/10.1021/acs.jcim.3c01250 (2024).
doi: 10.1021/acs.jcim.3c01250 pubmed: 38147829
Mann, H. B. & Whitney, D. R. On a test of whether one of two random variables is stochastically larger than the other. Ann. Math. Stat. 18, 50–60. https://doi.org/10.1214/aoms/1177730491 (1947).
doi: 10.1214/aoms/1177730491
Cliff, N. Dominance statistics: Ordinal analyses to answer ordinal questions. Psychol. Bull. 114, 494–509. https://doi.org/10.1037/0033-2909.114.3.494 (1993).
doi: 10.1037/0033-2909.114.3.494

Auteurs

Miguelangel Leon (M)

NOVA Information Management School (NOVA IMS), Universidade Nova de Lisboa, 1070-312, Lisbon, Portugal.

Yuriy Perezhohin (Y)

NOVA Information Management School (NOVA IMS), Universidade Nova de Lisboa, 1070-312, Lisbon, Portugal.

Fernando Peres (F)

NOVA Information Management School (NOVA IMS), Universidade Nova de Lisboa, 1070-312, Lisbon, Portugal.

Aleš Popovič (A)

Faculty of Computer and Information Science, University of Ljubljana, Ljubljana, Slovenia.

Mauro Castelli (M)

NOVA Information Management School (NOVA IMS), Universidade Nova de Lisboa, 1070-312, Lisbon, Portugal. mcastelli@novaims.unl.pt.

Articles similaires

[Redispensing of expensive oral anticancer medicines: a practical application].

Lisanne N van Merendonk, Kübra Akgöl, Bastiaan Nuijen
1.00
Humans Antineoplastic Agents Administration, Oral Drug Costs Counterfeit Drugs

Smoking Cessation and Incident Cardiovascular Disease.

Jun Hwan Cho, Seung Yong Shin, Hoseob Kim et al.
1.00
Humans Male Smoking Cessation Cardiovascular Diseases Female
Humans United States Aged Cross-Sectional Studies Medicare Part C
1.00
Humans Yoga Low Back Pain Female Male

Classifications MeSH