Evaluating the accuracy and reliability of AI chatbots in disseminating the content of current resuscitation guidelines: a comparative analysis between the ERC 2021 guidelines and both ChatGPTs 3.5 and 4.


Journal

Scandinavian journal of trauma, resuscitation and emergency medicine
ISSN: 1757-7241
Titre abrégé: Scand J Trauma Resusc Emerg Med
Pays: England
ID NLM: 101477511

Informations de publication

Date de publication:
26 Sep 2024
Historique:
received: 29 05 2024
accepted: 10 09 2024
medline: 27 9 2024
pubmed: 27 9 2024
entrez: 26 9 2024
Statut: epublish

Résumé

Artificial intelligence (AI) chatbots are established as tools for answering medical questions worldwide. Healthcare trainees are increasingly using this cutting-edge technology, although its reliability and accuracy in the context of healthcare remain uncertain. This study evaluated the suitability of Chat-GPT versions 3.5 and 4 for healthcare professionals seeking up-to-date evidence and recommendations for resuscitation by comparing the key messages of the resuscitation guidelines, which methodically set the gold standard of current evidence and recommendations, with the statements of the AI chatbots on this topic.  This prospective comparative content analysis was conducted between the 2021 European Resuscitation Council (ERC) guidelines and the responses of two freely available ChatGPT versions (ChatGPT-3.5 and the Bing version of the ChatGPT-4) to questions about the key messages of clinically relevant ERC guideline chapters for adults. (1) The content analysis was performed bidirectionally by independent raters. The completeness and actuality of the AI output were assessed by comparing the key message with the AI-generated statements. (2) The conformity of the AI output was evaluated by comparing the statements of the two ChatGPT versions with the content of the ERC guidelines. In response to inquiries about the five chapters, ChatGPT-3.5 generated a total of 60 statements, whereas ChatGPT-4 produced 32 statements. ChatGPT-3.5 did not address 123 key messages, and ChatGPT-4 did not address 132 of the 172 key messages of the ERC guideline chapters. A total of 77% of the ChatGPT-3.5 statements and 84% of the ChatGPT-4 statements were fully in line with the ERC guidelines. The main reason for nonconformity was superficial and incorrect AI statements. The interrater reliability between the two raters, measured by Cohen's kappa, was greater for ChatGPT-4 (0.56 for completeness and 0.76 for conformity analysis) than for ChatGPT-3.5 (0.48 for completeness and 0.36 for conformity). We advise healthcare professionals not to rely solely on the tested AI-based chatbots to keep up to date with the latest evidence, as the relevant texts for the task were not part of the training texts of the underlying LLMs, and the lack of conceptual understanding of AI carries a high risk of spreading misconceptions. Original publications should always be considered for comprehensive understanding.

Identifiants

pubmed: 39327587
doi: 10.1186/s13049-024-01266-2
pii: 10.1186/s13049-024-01266-2
doi:

Types de publication

Journal Article Comparative Study

Langues

eng

Sous-ensembles de citation

IM

Pagination

95

Informations de copyright

© 2024. The Author(s).

Références

European Resuscitation Council. Downloads_key_messages [Internet]. Guidelines 2021. 2021. Available from: https://cprguidelines.eu/guidelines-2021
Kung TH, Cheatham M, Medenilla A, Sillos C, De LL, Elepa C, et al. Performance of ChatGPT on USMLE: potential for AI-assisted medical education using large language models. PLOS Digit Heal. 2023;2(2):1–12.
Gilson A, Safranek CW, Huang T, Socrates V, Chi L. How does ChatGPT perform on the United States medical licensing examination? The implications of large language models for medical education and knowledge. JMIR Med Educ. 2023;9:1–9.
doi: 10.2196/45312
Fijačko N, Gosak L, Štiglic G, Picard CT, John DM. Can ChatGPT pass the life support exams without entering the American heart association course? Resuscitation. 2023;185: 109732.
doi: 10.1016/j.resuscitation.2023.109732 pubmed: 36775020
Zhu L, Mou W, Yang T, Chen R. ChatGPT can pass the AHA exams: open-ended questions outperform multiple-choice format. Resuscitation. 2023;188:1–3.
doi: 10.1016/j.resuscitation.2023.109783
King RC, Bharani V, Shah K, Yeo YH, Samaan JS. GPT-4V passes the BLS and ACLS examinations: an analysis of GPT-4V’s image recognition capabilities. Resuscitation. 2024;195: 110106.
doi: 10.1016/j.resuscitation.2023.110106 pubmed: 38160904
Komorowski M, Celi LA, Badawi O, Gordon AC, Faisal AA. The artificial intelligence clinician learns optimal treatment strategies for sepsis in intensive care. Nat Med. 2018;24(11):1716–20.
doi: 10.1038/s41591-018-0213-5 pubmed: 30349085
Thirunavukarasu AJ, Ting DSJ, Elangovan K, Gutierrez L, Tan TF, Ting DSW. Large language models in medicine. Nat Med. 2023;29(8):1930–40.
doi: 10.1038/s41591-023-02448-8 pubmed: 37460753
Singhal K, Azizi S, Tu T, Mahdavi SS, Wei J, Chung HW, et al. Large language models encode clinical knowledge. Nature. 2023;620(7972):172–80.
doi: 10.1038/s41586-023-06291-2 pubmed: 37438534 pmcid: 10396962
Marr B. Revolutionizing healthcare: the top 14 uses Of ChatGPT in medicine and wellness [Internet]. FORBES. 2023. Available from: https://www.forbes.com/sites/bernardmarr/2023/03/02/revolutionizing-healthcare-the-top-14-uses-of-chatgpt-in-medicine-and-wellness/?sh=f70042d6e547
Boscardin CK, Gin B, Golde PB, Hauer KE. ChatGPT and generative artificial intelligence for medical education: potential impact and opportunity. Acad Med. 2023;99(1):22–7.
doi: 10.1097/ACM.0000000000005439 pubmed: 37651677
Lechner F, Lahnala A, Welch C, Flek L. Challenges of GPT-3-based conversational agents for healthcare. arXiv. 2023; 2308.14641.
Nastasi AJ, Courtright KR, Halpern SD, Weissman GE. A vignette-based evaluation of ChatGPT’s ability to provide appropriate and equitable medical advice across care contexts. Sci Rep. 2023;13(1):1–6.
doi: 10.1038/s41598-023-45223-y
Karabacak M, Margetis K. Embracing large language models for medical applications: opportunities and challenges. Cureus. 2023;15(5).
Drazen JM, Lee P, Ph D, Bubeck S, Ph D, Petro J, et al. Benefits, limits, and risks of GPT-4 as an AI chatbot for medicine. NEJM. 2023;388(13):1233–9.
doi: 10.1056/NEJMsr2214184
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, et al. Attention is all you need. arXiv. 1706.03762v7 [Internet]. 2023; Available from: https://arxiv.org/pdf/1706.03762
Cretu C. How does ChatGPT actually work? An ML engineer explains [Internet]. 2023. Available from: https://www.scalablepath.com/machine-learning/chatgpt-architecture-explained
Meyer JG, Urbanowicz RJ, Martin PCN, Connor KO, Li R, Peng PC, et al. ChatGPT and large language models in academia : opportunities and challenges. BioData Min. 2023;16(20):1–11.
Leiser F, Eckhardt S, Leuthe V, Knaeble M, Maedche A, Schwabe G, et al. HILL: a hallucination identifier for large language models. arXiv:240306710 [Internet]. 2024; Available from: http://arxiv.org/abs/2403.06710
Ramponi M. How ChatGPT actually works [Internet]. 2022. Available from: https://www.assemblyai.com/blog/how-chatgpt-actually-works/
Manikandan B. Demystifying ChatGPT: a deep dive into reinforcement learning with human feedback [Internet]. 2023. Available from: https://bmanikan.medium.com/demystifying-chatgpt-a-deep-dive-into-reinforcement-learning-with-human-feedback-1b695a770014
Feuerriegel S, Janiesch C. Generative AI. Bus Inf Syst Eng. 2024;66:111–26.
doi: 10.1007/s12599-023-00834-7
Schnaubelt S, Garg R, Atiq H, Baig N, Bernardino M, Bigham B, et al. Cardiopulmonary resuscitation in low-resource settings: a statement by the international liaison committee on resuscitation, supported by the AFEM, EUSEM, IFEM, and IFRC. Lancet Glob Heal. 2023;11(9):e1444–53.
doi: 10.1016/S2214-109X(23)00302-9
Han Z, Battaglia F, Udaiyar A, Fooks A, Terlecky SR. An explorative assessment of ChatGPT as an aid in medical education: use it with caution. Med Teacher. 2024;46(5):657–64.
doi: 10.1080/0142159X.2023.2271159
Landis JR, Koch GG. The measurement of observer agreement for categorical data. Biometrics. 1977;33(1):159–74.
doi: 10.2307/2529310 pubmed: 843571
OpenAI. GPT-4 Technical report. 2023;4: 1–100. Available from: http://arxiv.org/abs/2303.08774
Arakelyan E, Liu Z, Augenstein I. Semantic sensitivities and inconsistent predictions: measuring the fragility of NLI models. EACL 2024—18th Conf Eur Chapter Assoc Comput Linguist Proc Conf. 2024; 1: 432–44.
Qi C, Li B, Hui B, Wang B, Li J, Wu J, et al. An investigation of LLMs’ inefficacy in understanding converse relations. arVix:2310.05163v3 [Internet]. 2023; Available from: https://arxiv.org/pdf/2310.05163
The Royal College of Physicians and Surgeons of Canada. CanMEDS 2015 physician competency framework [Internet]. Ottawa: Frank JR, Snell L, Sherbino J; 2015. Available from: http://www.royalcollege.ca/portal/page/portal/rc/canmeds/canmeds2015/overview
Merritt R. What Is Retrieval-augmented generation, aka RAG? [Internet]. 2023. Available from: https://blogs.nvidia.com/blog/what-is-retrieval-augmented-generation/
Li MM, Nikishina I, Sevgili Ö, Semmann M. Wiping out the limitations of large language models—a taxonomy for retrieval augmented generation. arXiv [Internet]. 2024; Available from: https://arxiv.org/pdf/2408.02854
Tasks KNLP, Lewis P, Perez E, Apr CL, Piktus A, Petroni F, et al. Retrieval-augmented generation for. arXiv [Internet]. 2021; Available from: https://arxiv.org/pdf/2005.11401

Auteurs

Stefanie Beck (S)

Department of Intensive Care Medicine, Hamburg-Eppendorf University Medical Centre, Hamburg, Germany. st.beck@uke.de.

Manuel Kuhner (M)

Department of Intensive Care Medicine, Hamburg-Eppendorf University Medical Centre, Hamburg, Germany.

Markus Haar (M)

Department of Intensive Care Medicine, Hamburg-Eppendorf University Medical Centre, Hamburg, Germany.

Anne Daubmann (A)

Department of Medical Biometry and Epidemiology, Hamburg-Eppendorf University Medical Centre, Hamburg, Germany.

Martin Semmann (M)

Hub of Computing and Data Science, University of Hamburg, Hamburg, Germany.

Stefan Kluge (S)

Department of Intensive Care Medicine, Hamburg-Eppendorf University Medical Centre, Hamburg, Germany.

Articles similaires

[Redispensing of expensive oral anticancer medicines: a practical application].

Lisanne N van Merendonk, Kübra Akgöl, Bastiaan Nuijen
1.00
Humans Antineoplastic Agents Administration, Oral Drug Costs Counterfeit Drugs

Smoking Cessation and Incident Cardiovascular Disease.

Jun Hwan Cho, Seung Yong Shin, Hoseob Kim et al.
1.00
Humans Male Smoking Cessation Cardiovascular Diseases Female
Humans United States Aged Cross-Sectional Studies Medicare Part C
1.00
Humans Yoga Low Back Pain Female Male

Classifications MeSH