Evaluating the strengths and weaknesses of large language models in answering neurophysiology questions.

Humans Language Neurophysiology / methods Natural Language Processing Cognition / physiology

Bloom’s taxonomy Evaluation Large language models Neurophysiology

Journal

Scientific reports

ISSN: 2045-2322

Titre abrégé: Sci Rep

Pays: England

ID NLM: 101563288

Informations de publication

Date de publication:
11 May 2024

Historique:

received: 12 09 2023

accepted: 23 04 2024

medline: 12 5 2024

pubmed: 12 5 2024

entrez: 11 5 2024

Statut: epublish

Résumé

Large language models (LLMs), like ChatGPT, Google's Bard, and Anthropic's Claude, showcase remarkable natural language processing capabilities. Evaluating their proficiency in specialized domains such as neurophysiology is crucial in understanding their utility in research, education, and clinical applications. This study aims to assess and compare the effectiveness of Large Language Models (LLMs) in answering neurophysiology questions in both English and Persian (Farsi) covering a range of topics and cognitive levels. Twenty questions covering four topics (general, sensory system, motor system, and integrative) and two cognitive levels (lower-order and higher-order) were posed to the LLMs. Physiologists scored the essay-style answers on a scale of 0-5 points. Statistical analysis compared the scores across different levels such as model, language, topic, and cognitive levels. Performing qualitative analysis identified reasoning gaps. In general, the models demonstrated good performance (mean score = 3.87/5), with no significant difference between language or cognitive levels. The performance was the strongest in the motor system (mean = 4.41) while the weakest was observed in integrative topics (mean = 3.35). Detailed qualitative analysis uncovered deficiencies in reasoning, discerning priorities, and knowledge integrating. This study offers valuable insights into LLMs' capabilities and limitations in the field of neurophysiology. The models demonstrate proficiency in general questions but face challenges in advanced reasoning and knowledge integration. Targeted training could address gaps in knowledge and causal reasoning. As LLMs evolve, rigorous domain-specific assessments will be crucial for evaluating advancements in their performance.

Identifiants

DOI: 10.1038/s41598-024-60405-y PMID: 38734712

pubmed: 38734712

doi: 10.1038/s41598-024-60405-y

pii: 10.1038/s41598-024-60405-y

doi:

Types de publication

Journal Article Research Support, Non-U.S. Gov't

Langues

eng

Sous-ensembles de citation

Pagination

10785

Informations de copyright

Références

Thirunavukarasu, A. J. et al. Large language models in medicine. Nat. Med. 2023, 1–11 (2023).

Ahmed, I., Roy, A. & Kajol, M. et al. ChatGPT vs. Bard: A comparative study (Authorea, 2023).

Tang, L. et al. Evaluating large language models on medical evidence summarization. NPJ Digital Med. 6, 158 (2023).

doi: 10.1038/s41746-023-00896-7

Lim, S. & Schmälzle, R. Artificial intelligence for health message generation: An empirical study using a large language model (LLM) and prompt engineering. Front. Commun. 8, 1129082 (2023).

doi: 10.3389/fcomm.2023.1129082

Rakhmonova, S. & Rakhmatov, B. Bloom’s taxionomy and didactic significance of critical thinking method in the educational process. Innov. Dev. Educ. Activit. 2, 94–98 (2023).

Agarwal, M., Sharma, P. & Goswami, A. Analysing the applicability of ChatGPT, Bard, and Bing to generate reasoning-based multiple-choice questions in medical physiology. Cureus 2023, 15 (2023).

Lahat, A. et al. Evaluating the use of large language model in identifying top research questions in gastroenterology. Sci. Rep. 13, 4164. https://doi.org/10.1038/s41598-023-31412-2 (2023).

doi: 10.1038/s41598-023-31412-2 pubmed: 36914821 pmcid: 10011374

Sinha, R. K., Deb Roy, A., Kumar, N. & Mondal, H. Applicability of ChatGPT in assisting to solve higher order problems in pathology. Cureus 15, e35237. https://doi.org/10.7759/cureus.35237 (2023).

doi: 10.7759/cureus.35237 pubmed: 36968864 pmcid: 10033699

Schubert, M. C., Wick, W. & Venkataramani, V. Performance of large language models on a neurology board-style examination. JAMA Netw. Open 6, e2346721–e2346721. https://doi.org/10.1001/jamanetworkopen.2023.46721 (2023).

doi: 10.1001/jamanetworkopen.2023.46721 pubmed: 38060223 pmcid: 10704278

Banerjee, A., Ahmad, A., Bhalla, P. & Goyal, K. Assessing the efficacy of ChatGPT in solving questions based on the core concepts in physiology. Cureus 2023, 15 (2023).

Dhanvijay, A. K. D. et al. Performance of large language models (ChatGPT, Bing Search, and Google Bard) in solving case vignettes in physiology. Cureus 2023, 15 (2023).

Duong, D. & Solomon, B. D. Analysis of large-language model versus human performance for genetics questions. Eur. J. Hum. Genet. https://doi.org/10.1038/s41431-023-01396-8 (2023).

doi: 10.1038/s41431-023-01396-8 pubmed: 37582904

Gilson, A. et al. How Does ChatGPT perform on the United States medical licensing examination? The implications of large language models for medical education and knowledge assessment. JMIR Med. Educ. 9, e45312. https://doi.org/10.2196/45312 (2023).

doi: 10.2196/45312 pubmed: 36753318 pmcid: 9947764

Khorshidi, H. et al. Application of ChatGPT in multilingual medical education: How does ChatGPT fare in 2023’s Iranian residency entrance examination. Inf. Med. Unlocked 41, 101314 (2023).

doi: 10.1016/j.imu.2023.101314

Crowe, A., Dirks, C. & Wenderoth, M. P. Biology in bloom: implementing Bloom’s taxonomy to enhance student learning in biology. CBE Life Sci. Educ. 7, 368–381 (2008).

doi: 10.1187/cbe.08-05-0024 pubmed: 19047424 pmcid: 2592046

Heston, T. F. & Khun, C. Prompt engineering in medical education. Int. Med. Educ. 2, 198–205 (2023).

doi: 10.3390/ime2030019

Kojima, T., Gu, S. S., Reid, M., Matsuo, Y. & Iwasawa, Y. Large language models are zero-shot reasoners. Adv. Neural. Inf. Process. Syst. 35, 22199–22213 (2022).

Wei, J. et al. Chain-of-thought prompting elicits reasoning in large language models. Adv. Neural. Inf. Process. Syst. 35, 24824–24837 (2022).

Tan, T. F. et al. Generative artificial intelligence through ChatGPT and other large language models in ophthalmology: Clinical applications and challenges. Ophthalmol. Sci. 3, 100394. https://doi.org/10.1016/j.xops.2023.100394 (2023).

doi: 10.1016/j.xops.2023.100394 pubmed: 37885755 pmcid: 10598525

Koo, T. K. & Li, M. Y. A guideline of selecting and reporting intraclass correlation coefficients for reliability research. J. Chiropr. Med. 15, 155–163. https://doi.org/10.1016/j.jcm.2016.02.012 (2016).

doi: 10.1016/j.jcm.2016.02.012 pubmed: 27330520 pmcid: 4913118

Singhal, K. et al. Large language models encode clinical knowledge. Nature 620, 172–180. https://doi.org/10.1038/s41586-023-06291-2 (2023).

doi: 10.1038/s41586-023-06291-2 pubmed: 37438534 pmcid: 10396962

Webb, T., Holyoak, K. J. & Lu, H. Emergent analogical reasoning in large language models. Nat. Hum. Behav. https://doi.org/10.1038/s41562-023-01659-w (2023).

doi: 10.1038/s41562-023-01659-w pubmed: 37524930

Mahowald, K. et al. Dissociating language and thought in large language models: A cognitive perspective. arXiv:2301.06627 (2023).

Tuckute, G. et al. Driving and suppressing the human language network using large language models. BioRxiv 2016, 537080 (2023).

Schubert, M. C., Wick, W. & Venkataramani, V. Evaluating the performance of large language models on a neurology board-style examination. MedRxiv 42, 39 (2023).

Puchert, P., Poonam, P., van Onzenoodt, C. & Ropinski, T. LLMMaps—a visual metaphor for stratified evaluation of large language models. arXiv:2304.00457 (2023).

Loconte, R., Orrù, G., Tribastone, M., Pietrini, P. & Sartori, G. Challenging ChatGPT’Intelligence’with human tools: A neuropsychological investigation on prefrontal functioning of a large language model. Intelligence 2023, 145 (2023).

Evaluating the strengths and weaknesses of large language models in answering neurophysiology questions.

Journal

Informations de publication

Résumé

Identifiants

Types de publication

Langues

Sous-ensembles de citation

Pagination

Informations de copyright

Références

Auteurs

Hassan Shojaee-Mend (H)

Reza Mohebbati (R)

Mostafa Amiri (M)

Alireza Atarodi (A)

Articles similaires

[Redispensing of expensive oral anticancer medicines: a practical application].

Smoking Cessation and Incident Cardiovascular Disease.

Evaluation of Low-Value Services Across Major Medicare Advantage Insurers and Traditional Medicare.

Effectiveness of Virtual Yoga for Chronic Low Back Pain: A Randomized Clinical Trial.

Classifications MeSH