The Two Word Test as a semantic benchmark for large language models.


Journal

Scientific reports
ISSN: 2045-2322
Titre abrégé: Sci Rep
Pays: England
ID NLM: 101563288

Informations de publication

Date de publication:
16 09 2024
Historique:
received: 07 05 2024
accepted: 09 09 2024
medline: 17 9 2024
pubmed: 17 9 2024
entrez: 16 9 2024
Statut: epublish

Résumé

Large language models (LLMs) have shown remarkable abilities recently, including passing advanced professional exams and demanding benchmark tests. This performance has led many to suggest that they are close to achieving humanlike or "true" understanding of language, and even artificial general intelligence (AGI). Here, we provide a new open-source benchmark, the Two Word Test (TWT), that can assess semantic abilities of LLMs using two-word phrases in a task that can be performed relatively easily by humans without advanced training. Combining multiple words into a single concept is a fundamental linguistic and conceptual operation routinely performed by people. The test requires meaningfulness judgments of 1768 noun-noun combinations that have been rated as meaningful (e.g., baby boy) or as having low meaningfulness (e.g., goat sky) by human raters. This novel test differs from existing benchmarks that rely on logical reasoning, inference, puzzle-solving, or domain expertise. We provide versions of the task that probe meaningfulness ratings on a 0-4 scale as well as binary judgments. With both versions, we conducted a series of experiments using the TWT on GPT-4, GPT-3.5, Claude-3-Optus, and Gemini-1-Pro-001. Results demonstrated that, compared to humans, all models performed relatively poorly at rating meaningfulness of these phrases. GPT-3.5-turbo, Gemini-1.0-Pro-001 and GPT-4-turbo were also unable to make binary discriminations between sensible and nonsense phrases, with these models consistently judging nonsensical phrases as making sense. Claude-3-Opus made a substantial improvement in binary discrimination of combinatorial phrases but was still significantly worse than human performance. The TWT can be used to understand and assess the limitations of current LLMs, and potentially improve them. The test also reminds us that caution is warranted in attributing "true" or human-level understanding to LLMs based only on tests that are challenging for humans.

Identifiants

pubmed: 39284863
doi: 10.1038/s41598-024-72528-3
pii: 10.1038/s41598-024-72528-3
doi:

Types de publication

Journal Article

Langues

eng

Sous-ensembles de citation

IM

Pagination

21593

Subventions

Organisme : NIH HHS
ID : 5R01DC017162-02
Pays : United States
Organisme : NIH HHS
ID : 5R01DC017162-02
Pays : United States
Organisme : NIH HHS
ID : 5R01DC017162-02
Pays : United States

Informations de copyright

© 2024. The Author(s).

Références

Bommasani, R. et al. On the opportunities and risks of foundation models. (2021) https://doi.org/10.48550/ARXIV.2108.07258 .
Choi, J. H., Hickman, K. E., Monahan, A. & Schwarcz, D. B. ChatGPT goes to law school. SSRN J. https://doi.org/10.2139/ssrn.4335905 (2023).
doi: 10.2139/ssrn.4335905
Terwiesch, C. Would Chat GPT get a Wharton MBA? A prediction based on its performance in the operations management course. (2023).
Kung, T. H. et al. Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models. PLOS Digit Health 2, e0000198 (2023).
doi: 10.1371/journal.pdig.0000198 pubmed: 36812645 pmcid: 9931230
Brown, T. B. et al. Language models are few-shot learners. (2020) https://doi.org/10.48550/ARXIV.2005.14165 .
Chowdhery, A. et al. PaLM: Scaling language modeling with pathways. (2022) https://doi.org/10.48550/ARXIV.2204.02311 .
Bubeck, S. et al. Sparks of artificial general intelligence: Early experiments with GPT-4. (2023) https://doi.org/10.48550/ARXIV.2303.12712 .
Michael, J. et al. What Do NLP Researchers Believe? Results of the NLP Community Metasurvey. (2022) https://doi.org/10.48550/ARXIV.2208.12852 .
Mitchell, M. & Krakauer, D. C. The debate over understanding in AI’s large language models. Proc. Natl. Acad. Sci. U.S.A. 120, e2215907120 (2023).
doi: 10.1073/pnas.2215907120 pubmed: 36943882 pmcid: 10068812
Lee, K., Firat, O., Agarwal, A., Fannjiang, C. & Sussillo, D. Hallucinations in neural machine translation. in (2018).
Raunak, V., Menezes, A. & Junczys-Dowmunt, M. The Curious Case of Hallucinations in Neural Machine Translation. in Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies 1172–1183 (Association for Computational Linguistics, Online, 2021). https://doi.org/10.18653/v1/2021.naacl-main.92 .
Mahowald, K. et al. Dissociating language and thought in large language models. (2023) https://doi.org/10.48550/ARXIV.2301.06627 .
Choudhury, S. R., Rogers, A. & Augenstein, I. Machine Reading, Fast and Slow: When Do Models ‘Understand’ Language? (2022) https://doi.org/10.48550/ARXIV.2209.07430 .
Gardner, M. et al. Competency problems: On finding and removing artifacts in language data. (2021) https://doi.org/10.48550/ARXIV.2104.08646 .
Linzen, T. How can we accelerate progress towards human-like linguistic generalization? (2020) https://doi.org/10.48550/ARXIV.2005.00955 .
Browning, J. & Lecun, Y. AI and the limits of language. Noema (2022).
Gagné, C. L. Relation and lexical priming during the interpretation of noun–noun combinations. J. Exp. Psychol. Learn. Mem. Cognit. 27, 236–254 (2001).
doi: 10.1037/0278-7393.27.1.236
Gagné, C. L. & Spalding, T. L. Constituent integration during the processing of compound words: Does it involve the use of relational structures?. J. Mem. Lang. 60, 20–35 (2009).
doi: 10.1016/j.jml.2008.07.003
Graves, W. W., Binder, J. R., Desai, R. H., Conant, L. L. & Seidenberg, M. S. Neural correlates of implicit and explicit combinatorial semantic processing. NeuroImage 53, 638–646 (2010).
doi: 10.1016/j.neuroimage.2010.06.055 pubmed: 20600969
Pylkkänen, L. The neural basis of combinatory syntax and semantics. Science 366, 62–66 (2019).
doi: 10.1126/science.aax0050 pubmed: 31604303
Graves, W. W., Binder, J. R. & Seidenberg, M. S. Noun–noun combination: Meaningfulness ratings and lexical statistics for 2,160 word pairs. Behav Res 45, 463–469 (2013).
doi: 10.3758/s13428-012-0256-3
Crawford, J. R. & Howell, D. C. Comparing an individual’s test score against norms derived from small samples. Clin. Neuropsychol. 12, 482–486 (1998).
doi: 10.1076/clin.12.4.482.7241
Mikolov, T., Chen, K., Corrado, G. & Dean, J. Efficient Estimation of Word Representations in Vector Space. Preprint at https://doi.org/10.48550/ARXIV.1301.3781 (2013).
Pennington, J., Socher, R. & Manning, C. Glove: Global Vectors for Word Representation. in Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) 1532–1543 (Association for Computational Linguistics, Doha, Qatar, 2014). https://doi.org/10.3115/v1/D14-1162 .
Roller, S. & Erk, K. Relations such as Hypernymy: Identifying and Exploiting Hearst Patterns in Distributional Vectors for Lexical Entailment. in Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing 2163–2172 (Association for Computational Linguistics, Austin, Texas, 2016). https://doi.org/10.18653/v1/D16-1234 .
Gao, C., Shinkareva, S. V. & Desai, R. H. SCOPE: The South Carolina psycholinguistic metabase. Behav Res 55, 2853–2884 (2022).
doi: 10.3758/s13428-022-01934-0
Srivastava, A. et al. Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models. (2022) https://doi.org/10.48550/ARXIV.2206.04615 .
Dziri, N. et al. Faith and Fate: Limits of Transformers on Compositionality. (2023) https://doi.org/10.48550/ARXIV.2305.18654 .
Bian, N. et al. ChatGPT is a Knowledgeable but Inexperienced Solver: An Investigation of Commonsense Problem in Large Language Models. (2023) https://doi.org/10.48550/ARXIV.2303.16421 .
Koralus, P. & Wang-Maścianica, V. Humans in Humans Out: On GPT Converging Toward Common Sense in both Success and Failure. (2023) https://doi.org/10.48550/ARXIV.2303.17276 .
Qin, C. et al. Is ChatGPT a General-Purpose Natural Language Processing Task Solver? in Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing 1339–1384 (Association for Computational Linguistics, Singapore, 2023). https://doi.org/10.18653/v1/2023.emnlp-main.85 .
Parrish, A. & Pylkkänen, L. Conceptual combination in the LATL with and without syntactic composition. Neurobiol. Lang. 3, 46–66 (2022).
doi: 10.1162/nol_a_00048

Auteurs

Nicholas Riccardi (N)

Department of Communication Sciences and Disorders, University of South Carolina, Columbia, 29208, USA.

Xuan Yang (X)

Department of Psychology, University of South Carolina, Columbia, 29208, USA.

Rutvik H Desai (RH)

Department of Psychology, University of South Carolina, Columbia, 29208, USA. rutvik@sc.edu.

Articles similaires

[Redispensing of expensive oral anticancer medicines: a practical application].

Lisanne N van Merendonk, Kübra Akgöl, Bastiaan Nuijen
1.00
Humans Antineoplastic Agents Administration, Oral Drug Costs Counterfeit Drugs

Smoking Cessation and Incident Cardiovascular Disease.

Jun Hwan Cho, Seung Yong Shin, Hoseob Kim et al.
1.00
Humans Male Smoking Cessation Cardiovascular Diseases Female
Humans United States Aged Cross-Sectional Studies Medicare Part C
1.00
Humans Yoga Low Back Pain Female Male

Classifications MeSH