Comparing SMILES and SELFIES tokenization for enhanced chemical language modeling.
Journal
Scientific reports
ISSN: 2045-2322
Titre abrégé: Sci Rep
Pays: England
ID NLM: 101563288
Informations de publication
Date de publication:
23 10 2024
23 10 2024
Historique:
received:
21
06
2024
accepted:
14
10
2024
medline:
24
10
2024
pubmed:
24
10
2024
entrez:
24
10
2024
Statut:
epublish
Résumé
Life sciences research and experimentation are resource-intensive, requiring extensive trials and considerable time. Often, experiments do not achieve their intended objectives, but progress is made through trial and error, eventually leading to breakthroughs. Machine learning is transforming this traditional approach, providing methods to expedite processes and accelerate discoveries. Deep Learning is becoming increasingly prominent in chemistry, with Convolutional Graph Networks (CGN) being a key focus, though other approaches also show significant potential. This research explores the application of Natural Language Processing (NLP) to evaluate the effectiveness of chemical language representations, specifically SMILES and SELFIES, using tokenization methods such as Byte Pair Encoding (BPE) and a novel approach developed in this study, Atom Pair Encoding (APE), in BERT-based models. The primary objective is to assess how these tokenization techniques influence the performance of chemical language models in biophysics and physiology classification tasks. The findings reveal that APE, particularly when used with SMILES representations, significantly outperforms BPE by preserving the integrity and contextual relationships among chemical elements, thereby enhancing classification accuracy. Performance was evaluated in downstream classification tasks using three distinct datasets for HIV, toxicology, and blood-brain barrier penetration, with ROC-AUC serving as the evaluation metric. This study highlights the critical role of tokenization in processing chemical language and suggests that refining these techniques could lead to significant advancements in drug discovery and material science.
Identifiants
pubmed: 39443676
doi: 10.1038/s41598-024-76440-8
pii: 10.1038/s41598-024-76440-8
doi:
Types de publication
Journal Article
Langues
eng
Sous-ensembles de citation
IM
Pagination
25016Subventions
Organisme : Fundação para a Ciência e a Tecnologia
ID : UIDB/04152/2020
Organisme : Javna Agencija za Raziskovalno Dejavnost RS
ID : P2-0442
Informations de copyright
© 2024. The Author(s).
Références
Devlin, J., Chang, M.-W., Lee, K. & Toutanova, K. BERT: Pre-training of deep bidirectional transformers for language understanding. CoRR abs/1810.04805 (2018). ArXiv: 1810.04805 .
Guo, T. et al. What can large language models do in chemistry? A comprehensive benchmark on eight tasks. Adv. Neural Inf. Process. Syst. 36, 59662–59688 (2023).
Gage, P. A new algorithm for data compression. C Users J. 12, 23–38. https://doi.org/10.5555/177910.177914 (1994).
doi: 10.5555/177910.177914
Tran, K. Optimization of molecular transformers: Influence of tokenization schemes. M.Sc. Thesis, Chalmers University of Technology, 2021 (2021).
Weininger, D. SMILES, a chemical language and information system: 1: Introduction to methodology and encoding rules. J. Chem. Inf. Comput. Sci. 28, 31–36. https://doi.org/10.1021/ci00057a005 (1988).
doi: 10.1021/ci00057a005
Krenn, M., Häse, F., Nigam, A., Friederich, P. & Aspuru-Guzik, A. Self-referencing embedded strings (SELFIES): A 100% robust molecular string representation. Mach. Learn.: Sci. Technol. 1, 045024. https://doi.org/10.1088/2632-2153/aba947 (2020).
doi: 10.1088/2632-2153/aba947
Mater, A. C. & Coote, M. L. Deep learning in chemistry. J. Chem. Inf. Model. 59, 2545–2559. https://doi.org/10.1021/acs.jcim.9b00266 (2019).
doi: 10.1021/acs.jcim.9b00266
pubmed: 31194543
Goh, G. B., Hodas, N. O. & Vishnu, A. Deep learning for computational chemistry. J. Comput. Chem. 38, 1291–1307. https://doi.org/10.1002/jcc.24764 (2017).
doi: 10.1002/jcc.24764
pubmed: 28272810
Jiao, Z., Hu, P., Xu, H. & Wang, Q. Machine learning and deep learning in chemical health and safety: A systematic review of techniques and applications. ACS Chem. Health Saf. 27, 316–334. https://doi.org/10.1021/acs.chas.0c00075 (2020).
doi: 10.1021/acs.chas.0c00075
Cova, T. F. G. G. & Pais, A. A. C. C. Deep learning for deep chemistry: Optimizing the prediction of chemical patterns. Front. Chem. 7, 809. https://doi.org/10.3389/fchem.2019.00809 (2019).
doi: 10.3389/fchem.2019.00809
pubmed: 32039134
pmcid: 6988795
Jastrzębski, S., Leśniak, D. & Czarnecki, W. M. Learning to SMILE(S), arXiv:1602.06289 (2016). _eprint: 1602.06289.
McCulloch, W. S. & Pitts, W. A logical calculus of the ideas immanent in nervous activity. Bull. Math. Biophys. 5, 115–133 (1943).
doi: 10.1007/BF02478259
Jiang, S. et al. When SMILES smiles, practicality judgment and yield prediction of chemical reaction via deep chemical language processing. IEEE Access 9, 85071–85083. https://doi.org/10.1109/ACCESS.2021.3083838 (2021).
doi: 10.1109/ACCESS.2021.3083838
Hochreiter, S. Long Short-Term Memory (Neural Computation MIT-Press, 1997).
doi: 10.1162/neco.1997.9.8.1735
Boiko, D. A., MacKnight, R. & Gomes, G. Emergent autonomous scientific research capabilities of large language models. ArXiv:abs/2304.05332 (2023).
Jablonka, K. M. et al. 14 examples of how LLMs can transform materials science and chemistry: A reflection on a large language model hackathon. Digit. Discov. 2, 1233–1250. https://doi.org/10.1039/D3DD00113J (2023).
doi: 10.1039/D3DD00113J
pubmed: 38013906
pmcid: 10561547
Xia, J., Zhu, Y., Du, Y. & Li, S. Z. A systematic survey of chemical pre-trained models. arXiv preprint (2022). arXiv:2210.16484 .
Liao, C., Yu, Y., Mei, Y. & Wei, Y. From words to molecules: A survey of large language models in chemistry. arXiv preprint arXiv:2402.01439 (2024).
Jablonka, K. M., Schwaller, P., Ortega-Guerrero, A. & Smit, B. Leveraging large language models for predictive chemistry. Nat. Mach. Intell. 6, 161–169 (2024).
doi: 10.1038/s42256-023-00788-1
Brown, T. B. Language models are few-shot learners. arXiv preprint (2020). arXiv:2005.14165 .
White, A. D. et al. Assessment of chemistry knowledge in large language models that generate code. Digit. Discov. 2, 368–376 (2023).
doi: 10.1039/D2DD00087C
pubmed: 37065678
pmcid: 10087057
Schick, T. et al. Toolformer: Language models can teach themselves to use tools. Adv. Neural Inf. Process. Syst. 36 (2024).
Shuster, K. et al. Language models that seek for knowledge: Modular search & generation for dialogue and prompt completion. arXiv preprint (2022). arXiv:2203.13224 .
Bran, M. et al. Augmenting large language models with chemistry tools. Nat. Mach. Intell. 6, 525–535 (2024).
doi: 10.1038/s42256-024-00832-8
Achiam, J. et al. Gpt-4 technical report. arXiv preprint (2023). arXiv:2303.08774 .
Castro Nascimento, C. M. & Pimentel, A. S. Do large language models understand chemistry? A conversation with Chatgpt. J. Chem. Inf. Model. 63, 1649–1655 (2023).
doi: 10.1021/acs.jcim.3c00285
pubmed: 36926868
White, A. D. The future of chemistry is language. Nat. Rev. Chem. 7, 457–458 (2023).
doi: 10.1038/s41570-023-00502-0
pubmed: 37208543
Kim, S. et al. PubChem in 2021: New data content and improved web interfaces. Nucleic Acids Res. 49, D1388–D1395. https://doi.org/10.1093/nar/gkaa971 (2021).
doi: 10.1093/nar/gkaa971
pubmed: 33151290
Wu, Z. et al. MoleculeNet: A benchmark for molecular machine learning. Chem. Sci. 9, 513–530. https://doi.org/10.1039/c7sc02664a (2018).
doi: 10.1039/c7sc02664a
pubmed: 29629118
Daller, E., Bougleux, S., Brun, L. & Lézoray, O. Local patterns and supergraph for chemical graph classification with convolutional networks. In Structural, Syntactic, and Statistical Pattern Recognition (eds Bai, X. et al.) 97–106 (Springer International Publishing, 2018).
doi: 10.1007/978-3-319-97785-0_10
Ryu, S., Lim, J., Hong, S. H. & Kim, W. Y. Deeply learning molecular structure-property relationships using attention- and gate-augmented graph convolutional network. arXiv: Learning (2018).
Vaswani, A. et al. Attention is all you need (2023). arXiv:1706.03762 .
Wolf, T. et al. Transformers: State-of-the-art natural language processing. in Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations (eds Liu, Q. & Schlangen, D.) 38–45, https://doi.org/10.18653/v1/2020.emnlp-demos.6 (Association for Computational Linguistics, Online, 2020).
Chithrananda, S., Grand, G. & Ramsundar, B. ChemBERTa: Large-scale self-supervised pretraining for molecular property prediction. https://doi.org/10.09885arXiv:Learning (2020).
Yang, K. et al. Analyzing learned molecular representations for property prediction. J. Chem. Inf. Model.[SPACE] https://doi.org/10.1021/ACS.JCIM.9B00237 (2019).
doi: 10.1021/ACS.JCIM.9B00237
pubmed: 31825611
pmcid: 8154261
Ahmad, W., Simon, E., Chithrananda, S., Grand, G. & Ramsundar, B. Chemberta-2: Towards chemical foundation models (2022). arXiv:2209.01712 .
Yüksel, A., Ulusoy, E., Ünlü, A. & Doğan, T. SELFormer: Molecular representation learning via SELFIES language models. Mach. Learn.: Sci. Technol. 4, 025035. https://doi.org/10.1088/2632-2153/acdb30 (2023).
doi: 10.1088/2632-2153/acdb30
Cao, Z. et al. MOFormer: Self-supervised transformer model for metal-organic framework property prediction. J. Am. Chem. Soc.[SPACE] https://doi.org/10.1021/JACS.2C11420 (2023).
doi: 10.1021/JACS.2C11420
pubmed: 38150583
pmcid: 10785799
Sennrich, R., Haddow, B. & Birch, A. Neural machine translation of rare words with subword units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (eds Erk, K. & Smith, N. A.) 1715–1725, https://doi.org/10.18653/v1/P16-1162 (Association for Computational Linguistics, Berlin, Germany, 2016).
Bader, R. F. W. Atoms in molecules. Acc. Chem. Res. 18, 9–15. https://doi.org/10.1021/ar00109a003 (1985).
doi: 10.1021/ar00109a003
Ucak, U. V., Ashyrmamatov, I. & Lee, J. Improving the quality of chemical language model outcomes with atom-in-SMILES tokenization. J. Cheminform. 15, 55. https://doi.org/10.1186/s13321-023-00725-9 (2023).
doi: 10.1186/s13321-023-00725-9
pubmed: 37248531
pmcid: 10228139
Li, X. & Fourches, D. SMILES pair encoding: A data-driven substructure tokenization algorithm for deep learning. J. Chem. Inf. Model. 61, 1560–1569. https://doi.org/10.1021/acs.jcim.0c01127 (2021).
doi: 10.1021/acs.jcim.0c01127
pubmed: 33715361
Liu, Y. et al. Roberta: A robustly optimized bert pretraining approach (2019). arXiv:1907.11692 .
Martins, I. F., Teixeira, A. L., Pinheiro, L. & Falcão, A. O. A Bayesian approach to in silico blood–brain barrier penetration modeling. J. Chem. Inf. Model. 52, 1686–1697. https://doi.org/10.1021/ci300124c (2012).
doi: 10.1021/ci300124c
pubmed: 22612593
National Cancer Institute. AIDS Antiviral Screen Data (2024).
National Institutes of Health. Tox21 Challenge (2014).
Akiba, T., Sano, S., Yanase, T., Ohta, T. & Koyama, M. Optuna: A next-generation hyperparameter optimization framework. in Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2019).
Loshchilov, I. & Hutter, F. Decoupled weight decay regularization. (2019) arXiv:1711.05101 .
Ramsundar, B. et al. Deep Learning for the Life Sciences (O’Reilly Media, 2019).
Heid, E. et al. Chemprop: A machine learning package for chemical property prediction. J. Chem. Inf. Model. 64, 9–17. https://doi.org/10.1021/acs.jcim.3c01250 (2024).
doi: 10.1021/acs.jcim.3c01250
pubmed: 38147829
Mann, H. B. & Whitney, D. R. On a test of whether one of two random variables is stochastically larger than the other. Ann. Math. Stat. 18, 50–60. https://doi.org/10.1214/aoms/1177730491 (1947).
doi: 10.1214/aoms/1177730491
Cliff, N. Dominance statistics: Ordinal analyses to answer ordinal questions. Psychol. Bull. 114, 494–509. https://doi.org/10.1037/0033-2909.114.3.494 (1993).
doi: 10.1037/0033-2909.114.3.494