Evaluating the accuracy and reliability of AI chatbots in disseminating the content of current resuscitation guidelines: a comparative analysis between the ERC 2021 guidelines and both ChatGPTs 3.5 and 4.

Humans Prospective Studies Resuscitation / standards Reproducibility of Results Artificial Intelligence Practice Guidelines as Topic Information Dissemination / methods

Journal

Scandinavian journal of trauma, resuscitation and emergency medicine

ISSN: 1757-7241

Titre abrégé: Scand J Trauma Resusc Emerg Med

Pays: England

ID NLM: 101477511

Informations de publication

Date de publication:
26 Sep 2024

Historique:

received: 29 05 2024

accepted: 10 09 2024

medline: 27 9 2024

pubmed: 27 9 2024

entrez: 26 9 2024

Statut: epublish

Résumé

Artificial intelligence (AI) chatbots are established as tools for answering medical questions worldwide. Healthcare trainees are increasingly using this cutting-edge technology, although its reliability and accuracy in the context of healthcare remain uncertain. This study evaluated the suitability of Chat-GPT versions 3.5 and 4 for healthcare professionals seeking up-to-date evidence and recommendations for resuscitation by comparing the key messages of the resuscitation guidelines, which methodically set the gold standard of current evidence and recommendations, with the statements of the AI chatbots on this topic. This prospective comparative content analysis was conducted between the 2021 European Resuscitation Council (ERC) guidelines and the responses of two freely available ChatGPT versions (ChatGPT-3.5 and the Bing version of the ChatGPT-4) to questions about the key messages of clinically relevant ERC guideline chapters for adults. (1) The content analysis was performed bidirectionally by independent raters. The completeness and actuality of the AI output were assessed by comparing the key message with the AI-generated statements. (2) The conformity of the AI output was evaluated by comparing the statements of the two ChatGPT versions with the content of the ERC guidelines. In response to inquiries about the five chapters, ChatGPT-3.5 generated a total of 60 statements, whereas ChatGPT-4 produced 32 statements. ChatGPT-3.5 did not address 123 key messages, and ChatGPT-4 did not address 132 of the 172 key messages of the ERC guideline chapters. A total of 77% of the ChatGPT-3.5 statements and 84% of the ChatGPT-4 statements were fully in line with the ERC guidelines. The main reason for nonconformity was superficial and incorrect AI statements. The interrater reliability between the two raters, measured by Cohen's kappa, was greater for ChatGPT-4 (0.56 for completeness and 0.76 for conformity analysis) than for ChatGPT-3.5 (0.48 for completeness and 0.36 for conformity). We advise healthcare professionals not to rely solely on the tested AI-based chatbots to keep up to date with the latest evidence, as the relevant texts for the task were not part of the training texts of the underlying LLMs, and the lack of conceptual understanding of AI carries a high risk of spreading misconceptions. Original publications should always be considered for comprehensive understanding.

Identifiants

DOI: 10.1186/s13049-024-01266-2 PMID: 39327587

pubmed: 39327587

doi: 10.1186/s13049-024-01266-2

pii: 10.1186/s13049-024-01266-2

doi:

Types de publication

Journal Article Comparative Study

Langues

eng

Sous-ensembles de citation

Pagination

Informations de copyright

Références

European Resuscitation Council. Downloads_key_messages [Internet]. Guidelines 2021. 2021. Available from: https://cprguidelines.eu/guidelines-2021

Kung TH, Cheatham M, Medenilla A, Sillos C, De LL, Elepa C, et al. Performance of ChatGPT on USMLE: potential for AI-assisted medical education using large language models. PLOS Digit Heal. 2023;2(2):1–12.

Gilson A, Safranek CW, Huang T, Socrates V, Chi L. How does ChatGPT perform on the United States medical licensing examination? The implications of large language models for medical education and knowledge. JMIR Med Educ. 2023;9:1–9.

doi: 10.2196/45312

Fijačko N, Gosak L, Štiglic G, Picard CT, John DM. Can ChatGPT pass the life support exams without entering the American heart association course? Resuscitation. 2023;185: 109732.

doi: 10.1016/j.resuscitation.2023.109732 pubmed: 36775020

Zhu L, Mou W, Yang T, Chen R. ChatGPT can pass the AHA exams: open-ended questions outperform multiple-choice format. Resuscitation. 2023;188:1–3.

doi: 10.1016/j.resuscitation.2023.109783

King RC, Bharani V, Shah K, Yeo YH, Samaan JS. GPT-4V passes the BLS and ACLS examinations: an analysis of GPT-4V’s image recognition capabilities. Resuscitation. 2024;195: 110106.

doi: 10.1016/j.resuscitation.2023.110106 pubmed: 38160904

Komorowski M, Celi LA, Badawi O, Gordon AC, Faisal AA. The artificial intelligence clinician learns optimal treatment strategies for sepsis in intensive care. Nat Med. 2018;24(11):1716–20.

doi: 10.1038/s41591-018-0213-5 pubmed: 30349085

Thirunavukarasu AJ, Ting DSJ, Elangovan K, Gutierrez L, Tan TF, Ting DSW. Large language models in medicine. Nat Med. 2023;29(8):1930–40.

doi: 10.1038/s41591-023-02448-8 pubmed: 37460753

Singhal K, Azizi S, Tu T, Mahdavi SS, Wei J, Chung HW, et al. Large language models encode clinical knowledge. Nature. 2023;620(7972):172–80.

doi: 10.1038/s41586-023-06291-2 pubmed: 37438534 pmcid: 10396962

Marr B. Revolutionizing healthcare: the top 14 uses Of ChatGPT in medicine and wellness [Internet]. FORBES. 2023. Available from: https://www.forbes.com/sites/bernardmarr/2023/03/02/revolutionizing-healthcare-the-top-14-uses-of-chatgpt-in-medicine-and-wellness/?sh=f70042d6e547

Boscardin CK, Gin B, Golde PB, Hauer KE. ChatGPT and generative artificial intelligence for medical education: potential impact and opportunity. Acad Med. 2023;99(1):22–7.

doi: 10.1097/ACM.0000000000005439 pubmed: 37651677

Lechner F, Lahnala A, Welch C, Flek L. Challenges of GPT-3-based conversational agents for healthcare. arXiv. 2023; 2308.14641.

Nastasi AJ, Courtright KR, Halpern SD, Weissman GE. A vignette-based evaluation of ChatGPT’s ability to provide appropriate and equitable medical advice across care contexts. Sci Rep. 2023;13(1):1–6.

doi: 10.1038/s41598-023-45223-y

Karabacak M, Margetis K. Embracing large language models for medical applications: opportunities and challenges. Cureus. 2023;15(5).

Drazen JM, Lee P, Ph D, Bubeck S, Ph D, Petro J, et al. Benefits, limits, and risks of GPT-4 as an AI chatbot for medicine. NEJM. 2023;388(13):1233–9.

doi: 10.1056/NEJMsr2214184

Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, et al. Attention is all you need. arXiv. 1706.03762v7 [Internet]. 2023; Available from: https://arxiv.org/pdf/1706.03762

Cretu C. How does ChatGPT actually work? An ML engineer explains [Internet]. 2023. Available from: https://www.scalablepath.com/machine-learning/chatgpt-architecture-explained

Meyer JG, Urbanowicz RJ, Martin PCN, Connor KO, Li R, Peng PC, et al. ChatGPT and large language models in academia : opportunities and challenges. BioData Min. 2023;16(20):1–11.

Leiser F, Eckhardt S, Leuthe V, Knaeble M, Maedche A, Schwabe G, et al. HILL: a hallucination identifier for large language models. arXiv:240306710 [Internet]. 2024; Available from: http://arxiv.org/abs/2403.06710

Ramponi M. How ChatGPT actually works [Internet]. 2022. Available from: https://www.assemblyai.com/blog/how-chatgpt-actually-works/

Manikandan B. Demystifying ChatGPT: a deep dive into reinforcement learning with human feedback [Internet]. 2023. Available from: https://bmanikan.medium.com/demystifying-chatgpt-a-deep-dive-into-reinforcement-learning-with-human-feedback-1b695a770014

Feuerriegel S, Janiesch C. Generative AI. Bus Inf Syst Eng. 2024;66:111–26.

doi: 10.1007/s12599-023-00834-7

Schnaubelt S, Garg R, Atiq H, Baig N, Bernardino M, Bigham B, et al. Cardiopulmonary resuscitation in low-resource settings: a statement by the international liaison committee on resuscitation, supported by the AFEM, EUSEM, IFEM, and IFRC. Lancet Glob Heal. 2023;11(9):e1444–53.

doi: 10.1016/S2214-109X(23)00302-9

Han Z, Battaglia F, Udaiyar A, Fooks A, Terlecky SR. An explorative assessment of ChatGPT as an aid in medical education: use it with caution. Med Teacher. 2024;46(5):657–64.

doi: 10.1080/0142159X.2023.2271159

Landis JR, Koch GG. The measurement of observer agreement for categorical data. Biometrics. 1977;33(1):159–74.

doi: 10.2307/2529310 pubmed: 843571

OpenAI. GPT-4 Technical report. 2023;4: 1–100. Available from: http://arxiv.org/abs/2303.08774

Arakelyan E, Liu Z, Augenstein I. Semantic sensitivities and inconsistent predictions: measuring the fragility of NLI models. EACL 2024—18th Conf Eur Chapter Assoc Comput Linguist Proc Conf. 2024; 1: 432–44.

Qi C, Li B, Hui B, Wang B, Li J, Wu J, et al. An investigation of LLMs’ inefficacy in understanding converse relations. arVix:2310.05163v3 [Internet]. 2023; Available from: https://arxiv.org/pdf/2310.05163

The Royal College of Physicians and Surgeons of Canada. CanMEDS 2015 physician competency framework [Internet]. Ottawa: Frank JR, Snell L, Sherbino J; 2015. Available from: http://www.royalcollege.ca/portal/page/portal/rc/canmeds/canmeds2015/overview

Merritt R. What Is Retrieval-augmented generation, aka RAG? [Internet]. 2023. Available from: https://blogs.nvidia.com/blog/what-is-retrieval-augmented-generation/

Li MM, Nikishina I, Sevgili Ö, Semmann M. Wiping out the limitations of large language models—a taxonomy for retrieval augmented generation. arXiv [Internet]. 2024; Available from: https://arxiv.org/pdf/2408.02854

Tasks KNLP, Lewis P, Perez E, Apr CL, Piktus A, Petroni F, et al. Retrieval-augmented generation for. arXiv [Internet]. 2021; Available from: https://arxiv.org/pdf/2005.11401

Evaluating the accuracy and reliability of AI chatbots in disseminating the content of current resuscitation guidelines: a comparative analysis between the ERC 2021 guidelines and both ChatGPTs 3.5 and 4.

Journal

Informations de publication

Résumé

Identifiants

Types de publication

Langues

Sous-ensembles de citation

Pagination

Informations de copyright

Références

Auteurs

Stefanie Beck (S)

Manuel Kuhner (M)

Markus Haar (M)

Anne Daubmann (A)

Martin Semmann (M)

Stefan Kluge (S)

Articles similaires

[Redispensing of expensive oral anticancer medicines: a practical application].

Smoking Cessation and Incident Cardiovascular Disease.

Evaluation of Low-Value Services Across Major Medicare Advantage Insurers and Traditional Medicare.

Effectiveness of Virtual Yoga for Chronic Low Back Pain: A Randomized Clinical Trial.

Classifications MeSH