The Two Word Test as a semantic benchmark for large language models.
Journal
Scientific reports
ISSN: 2045-2322
Titre abrégé: Sci Rep
Pays: England
ID NLM: 101563288
Informations de publication
Date de publication:
16 09 2024
16 09 2024
Historique:
received:
07
05
2024
accepted:
09
09
2024
medline:
17
9
2024
pubmed:
17
9
2024
entrez:
16
9
2024
Statut:
epublish
Résumé
Large language models (LLMs) have shown remarkable abilities recently, including passing advanced professional exams and demanding benchmark tests. This performance has led many to suggest that they are close to achieving humanlike or "true" understanding of language, and even artificial general intelligence (AGI). Here, we provide a new open-source benchmark, the Two Word Test (TWT), that can assess semantic abilities of LLMs using two-word phrases in a task that can be performed relatively easily by humans without advanced training. Combining multiple words into a single concept is a fundamental linguistic and conceptual operation routinely performed by people. The test requires meaningfulness judgments of 1768 noun-noun combinations that have been rated as meaningful (e.g., baby boy) or as having low meaningfulness (e.g., goat sky) by human raters. This novel test differs from existing benchmarks that rely on logical reasoning, inference, puzzle-solving, or domain expertise. We provide versions of the task that probe meaningfulness ratings on a 0-4 scale as well as binary judgments. With both versions, we conducted a series of experiments using the TWT on GPT-4, GPT-3.5, Claude-3-Optus, and Gemini-1-Pro-001. Results demonstrated that, compared to humans, all models performed relatively poorly at rating meaningfulness of these phrases. GPT-3.5-turbo, Gemini-1.0-Pro-001 and GPT-4-turbo were also unable to make binary discriminations between sensible and nonsense phrases, with these models consistently judging nonsensical phrases as making sense. Claude-3-Opus made a substantial improvement in binary discrimination of combinatorial phrases but was still significantly worse than human performance. The TWT can be used to understand and assess the limitations of current LLMs, and potentially improve them. The test also reminds us that caution is warranted in attributing "true" or human-level understanding to LLMs based only on tests that are challenging for humans.
Identifiants
pubmed: 39284863
doi: 10.1038/s41598-024-72528-3
pii: 10.1038/s41598-024-72528-3
doi:
Types de publication
Journal Article
Langues
eng
Sous-ensembles de citation
IM
Pagination
21593Subventions
Organisme : NIH HHS
ID : 5R01DC017162-02
Pays : United States
Organisme : NIH HHS
ID : 5R01DC017162-02
Pays : United States
Organisme : NIH HHS
ID : 5R01DC017162-02
Pays : United States
Informations de copyright
© 2024. The Author(s).
Références
Bommasani, R. et al. On the opportunities and risks of foundation models. (2021) https://doi.org/10.48550/ARXIV.2108.07258 .
Choi, J. H., Hickman, K. E., Monahan, A. & Schwarcz, D. B. ChatGPT goes to law school. SSRN J. https://doi.org/10.2139/ssrn.4335905 (2023).
doi: 10.2139/ssrn.4335905
Terwiesch, C. Would Chat GPT get a Wharton MBA? A prediction based on its performance in the operations management course. (2023).
Kung, T. H. et al. Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models. PLOS Digit Health 2, e0000198 (2023).
doi: 10.1371/journal.pdig.0000198
pubmed: 36812645
pmcid: 9931230
Brown, T. B. et al. Language models are few-shot learners. (2020) https://doi.org/10.48550/ARXIV.2005.14165 .
Chowdhery, A. et al. PaLM: Scaling language modeling with pathways. (2022) https://doi.org/10.48550/ARXIV.2204.02311 .
Bubeck, S. et al. Sparks of artificial general intelligence: Early experiments with GPT-4. (2023) https://doi.org/10.48550/ARXIV.2303.12712 .
Michael, J. et al. What Do NLP Researchers Believe? Results of the NLP Community Metasurvey. (2022) https://doi.org/10.48550/ARXIV.2208.12852 .
Mitchell, M. & Krakauer, D. C. The debate over understanding in AI’s large language models. Proc. Natl. Acad. Sci. U.S.A. 120, e2215907120 (2023).
doi: 10.1073/pnas.2215907120
pubmed: 36943882
pmcid: 10068812
Lee, K., Firat, O., Agarwal, A., Fannjiang, C. & Sussillo, D. Hallucinations in neural machine translation. in (2018).
Raunak, V., Menezes, A. & Junczys-Dowmunt, M. The Curious Case of Hallucinations in Neural Machine Translation. in Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies 1172–1183 (Association for Computational Linguistics, Online, 2021). https://doi.org/10.18653/v1/2021.naacl-main.92 .
Mahowald, K. et al. Dissociating language and thought in large language models. (2023) https://doi.org/10.48550/ARXIV.2301.06627 .
Choudhury, S. R., Rogers, A. & Augenstein, I. Machine Reading, Fast and Slow: When Do Models ‘Understand’ Language? (2022) https://doi.org/10.48550/ARXIV.2209.07430 .
Gardner, M. et al. Competency problems: On finding and removing artifacts in language data. (2021) https://doi.org/10.48550/ARXIV.2104.08646 .
Linzen, T. How can we accelerate progress towards human-like linguistic generalization? (2020) https://doi.org/10.48550/ARXIV.2005.00955 .
Browning, J. & Lecun, Y. AI and the limits of language. Noema (2022).
Gagné, C. L. Relation and lexical priming during the interpretation of noun–noun combinations. J. Exp. Psychol. Learn. Mem. Cognit. 27, 236–254 (2001).
doi: 10.1037/0278-7393.27.1.236
Gagné, C. L. & Spalding, T. L. Constituent integration during the processing of compound words: Does it involve the use of relational structures?. J. Mem. Lang. 60, 20–35 (2009).
doi: 10.1016/j.jml.2008.07.003
Graves, W. W., Binder, J. R., Desai, R. H., Conant, L. L. & Seidenberg, M. S. Neural correlates of implicit and explicit combinatorial semantic processing. NeuroImage 53, 638–646 (2010).
doi: 10.1016/j.neuroimage.2010.06.055
pubmed: 20600969
Pylkkänen, L. The neural basis of combinatory syntax and semantics. Science 366, 62–66 (2019).
doi: 10.1126/science.aax0050
pubmed: 31604303
Graves, W. W., Binder, J. R. & Seidenberg, M. S. Noun–noun combination: Meaningfulness ratings and lexical statistics for 2,160 word pairs. Behav Res 45, 463–469 (2013).
doi: 10.3758/s13428-012-0256-3
Crawford, J. R. & Howell, D. C. Comparing an individual’s test score against norms derived from small samples. Clin. Neuropsychol. 12, 482–486 (1998).
doi: 10.1076/clin.12.4.482.7241
Mikolov, T., Chen, K., Corrado, G. & Dean, J. Efficient Estimation of Word Representations in Vector Space. Preprint at https://doi.org/10.48550/ARXIV.1301.3781 (2013).
Pennington, J., Socher, R. & Manning, C. Glove: Global Vectors for Word Representation. in Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) 1532–1543 (Association for Computational Linguistics, Doha, Qatar, 2014). https://doi.org/10.3115/v1/D14-1162 .
Roller, S. & Erk, K. Relations such as Hypernymy: Identifying and Exploiting Hearst Patterns in Distributional Vectors for Lexical Entailment. in Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing 2163–2172 (Association for Computational Linguistics, Austin, Texas, 2016). https://doi.org/10.18653/v1/D16-1234 .
Gao, C., Shinkareva, S. V. & Desai, R. H. SCOPE: The South Carolina psycholinguistic metabase. Behav Res 55, 2853–2884 (2022).
doi: 10.3758/s13428-022-01934-0
Srivastava, A. et al. Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models. (2022) https://doi.org/10.48550/ARXIV.2206.04615 .
Dziri, N. et al. Faith and Fate: Limits of Transformers on Compositionality. (2023) https://doi.org/10.48550/ARXIV.2305.18654 .
Bian, N. et al. ChatGPT is a Knowledgeable but Inexperienced Solver: An Investigation of Commonsense Problem in Large Language Models. (2023) https://doi.org/10.48550/ARXIV.2303.16421 .
Koralus, P. & Wang-Maścianica, V. Humans in Humans Out: On GPT Converging Toward Common Sense in both Success and Failure. (2023) https://doi.org/10.48550/ARXIV.2303.17276 .
Qin, C. et al. Is ChatGPT a General-Purpose Natural Language Processing Task Solver? in Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing 1339–1384 (Association for Computational Linguistics, Singapore, 2023). https://doi.org/10.18653/v1/2023.emnlp-main.85 .
Parrish, A. & Pylkkänen, L. Conceptual combination in the LATL with and without syntactic composition. Neurobiol. Lang. 3, 46–66 (2022).
doi: 10.1162/nol_a_00048