Evaluating the strengths and weaknesses of large language models in answering neurophysiology questions.
Bloom’s taxonomy
Evaluation
Large language models
Neurophysiology
Journal
Scientific reports
ISSN: 2045-2322
Titre abrégé: Sci Rep
Pays: England
ID NLM: 101563288
Informations de publication
Date de publication:
11 May 2024
11 May 2024
Historique:
received:
12
09
2023
accepted:
23
04
2024
medline:
12
5
2024
pubmed:
12
5
2024
entrez:
11
5
2024
Statut:
epublish
Résumé
Large language models (LLMs), like ChatGPT, Google's Bard, and Anthropic's Claude, showcase remarkable natural language processing capabilities. Evaluating their proficiency in specialized domains such as neurophysiology is crucial in understanding their utility in research, education, and clinical applications. This study aims to assess and compare the effectiveness of Large Language Models (LLMs) in answering neurophysiology questions in both English and Persian (Farsi) covering a range of topics and cognitive levels. Twenty questions covering four topics (general, sensory system, motor system, and integrative) and two cognitive levels (lower-order and higher-order) were posed to the LLMs. Physiologists scored the essay-style answers on a scale of 0-5 points. Statistical analysis compared the scores across different levels such as model, language, topic, and cognitive levels. Performing qualitative analysis identified reasoning gaps. In general, the models demonstrated good performance (mean score = 3.87/5), with no significant difference between language or cognitive levels. The performance was the strongest in the motor system (mean = 4.41) while the weakest was observed in integrative topics (mean = 3.35). Detailed qualitative analysis uncovered deficiencies in reasoning, discerning priorities, and knowledge integrating. This study offers valuable insights into LLMs' capabilities and limitations in the field of neurophysiology. The models demonstrate proficiency in general questions but face challenges in advanced reasoning and knowledge integration. Targeted training could address gaps in knowledge and causal reasoning. As LLMs evolve, rigorous domain-specific assessments will be crucial for evaluating advancements in their performance.
Identifiants
pubmed: 38734712
doi: 10.1038/s41598-024-60405-y
pii: 10.1038/s41598-024-60405-y
doi:
Types de publication
Journal Article
Research Support, Non-U.S. Gov't
Langues
eng
Sous-ensembles de citation
IM
Pagination
10785Informations de copyright
© 2024. The Author(s).
Références
Thirunavukarasu, A. J. et al. Large language models in medicine. Nat. Med. 2023, 1–11 (2023).
Ahmed, I., Roy, A. & Kajol, M. et al. ChatGPT vs. Bard: A comparative study (Authorea, 2023).
Tang, L. et al. Evaluating large language models on medical evidence summarization. NPJ Digital Med. 6, 158 (2023).
doi: 10.1038/s41746-023-00896-7
Lim, S. & Schmälzle, R. Artificial intelligence for health message generation: An empirical study using a large language model (LLM) and prompt engineering. Front. Commun. 8, 1129082 (2023).
doi: 10.3389/fcomm.2023.1129082
Rakhmonova, S. & Rakhmatov, B. Bloom’s taxionomy and didactic significance of critical thinking method in the educational process. Innov. Dev. Educ. Activit. 2, 94–98 (2023).
Agarwal, M., Sharma, P. & Goswami, A. Analysing the applicability of ChatGPT, Bard, and Bing to generate reasoning-based multiple-choice questions in medical physiology. Cureus 2023, 15 (2023).
Lahat, A. et al. Evaluating the use of large language model in identifying top research questions in gastroenterology. Sci. Rep. 13, 4164. https://doi.org/10.1038/s41598-023-31412-2 (2023).
doi: 10.1038/s41598-023-31412-2
pubmed: 36914821
pmcid: 10011374
Sinha, R. K., Deb Roy, A., Kumar, N. & Mondal, H. Applicability of ChatGPT in assisting to solve higher order problems in pathology. Cureus 15, e35237. https://doi.org/10.7759/cureus.35237 (2023).
doi: 10.7759/cureus.35237
pubmed: 36968864
pmcid: 10033699
Schubert, M. C., Wick, W. & Venkataramani, V. Performance of large language models on a neurology board-style examination. JAMA Netw. Open 6, e2346721–e2346721. https://doi.org/10.1001/jamanetworkopen.2023.46721 (2023).
doi: 10.1001/jamanetworkopen.2023.46721
pubmed: 38060223
pmcid: 10704278
Banerjee, A., Ahmad, A., Bhalla, P. & Goyal, K. Assessing the efficacy of ChatGPT in solving questions based on the core concepts in physiology. Cureus 2023, 15 (2023).
Dhanvijay, A. K. D. et al. Performance of large language models (ChatGPT, Bing Search, and Google Bard) in solving case vignettes in physiology. Cureus 2023, 15 (2023).
Duong, D. & Solomon, B. D. Analysis of large-language model versus human performance for genetics questions. Eur. J. Hum. Genet. https://doi.org/10.1038/s41431-023-01396-8 (2023).
doi: 10.1038/s41431-023-01396-8
pubmed: 37582904
Gilson, A. et al. How Does ChatGPT perform on the United States medical licensing examination? The implications of large language models for medical education and knowledge assessment. JMIR Med. Educ. 9, e45312. https://doi.org/10.2196/45312 (2023).
doi: 10.2196/45312
pubmed: 36753318
pmcid: 9947764
Khorshidi, H. et al. Application of ChatGPT in multilingual medical education: How does ChatGPT fare in 2023’s Iranian residency entrance examination. Inf. Med. Unlocked 41, 101314 (2023).
doi: 10.1016/j.imu.2023.101314
Crowe, A., Dirks, C. & Wenderoth, M. P. Biology in bloom: implementing Bloom’s taxonomy to enhance student learning in biology. CBE Life Sci. Educ. 7, 368–381 (2008).
doi: 10.1187/cbe.08-05-0024
pubmed: 19047424
pmcid: 2592046
Heston, T. F. & Khun, C. Prompt engineering in medical education. Int. Med. Educ. 2, 198–205 (2023).
doi: 10.3390/ime2030019
Kojima, T., Gu, S. S., Reid, M., Matsuo, Y. & Iwasawa, Y. Large language models are zero-shot reasoners. Adv. Neural. Inf. Process. Syst. 35, 22199–22213 (2022).
Wei, J. et al. Chain-of-thought prompting elicits reasoning in large language models. Adv. Neural. Inf. Process. Syst. 35, 24824–24837 (2022).
Tan, T. F. et al. Generative artificial intelligence through ChatGPT and other large language models in ophthalmology: Clinical applications and challenges. Ophthalmol. Sci. 3, 100394. https://doi.org/10.1016/j.xops.2023.100394 (2023).
doi: 10.1016/j.xops.2023.100394
pubmed: 37885755
pmcid: 10598525
Koo, T. K. & Li, M. Y. A guideline of selecting and reporting intraclass correlation coefficients for reliability research. J. Chiropr. Med. 15, 155–163. https://doi.org/10.1016/j.jcm.2016.02.012 (2016).
doi: 10.1016/j.jcm.2016.02.012
pubmed: 27330520
pmcid: 4913118
Singhal, K. et al. Large language models encode clinical knowledge. Nature 620, 172–180. https://doi.org/10.1038/s41586-023-06291-2 (2023).
doi: 10.1038/s41586-023-06291-2
pubmed: 37438534
pmcid: 10396962
Webb, T., Holyoak, K. J. & Lu, H. Emergent analogical reasoning in large language models. Nat. Hum. Behav. https://doi.org/10.1038/s41562-023-01659-w (2023).
doi: 10.1038/s41562-023-01659-w
pubmed: 37524930
Mahowald, K. et al. Dissociating language and thought in large language models: A cognitive perspective. arXiv:2301.06627 (2023).
Tuckute, G. et al. Driving and suppressing the human language network using large language models. BioRxiv 2016, 537080 (2023).
Schubert, M. C., Wick, W. & Venkataramani, V. Evaluating the performance of large language models on a neurology board-style examination. MedRxiv 42, 39 (2023).
Puchert, P., Poonam, P., van Onzenoodt, C. & Ropinski, T. LLMMaps—a visual metaphor for stratified evaluation of large language models. arXiv:2304.00457 (2023).
Loconte, R., Orrù, G., Tribastone, M., Pietrini, P. & Sartori, G. Challenging ChatGPT’Intelligence’with human tools: A neuropsychological investigation on prefrontal functioning of a large language model. Intelligence 2023, 145 (2023).