Do ChatGPT and Gemini Provide Appropriate Recommendations for Pediatric Orthopaedic Conditions?


Journal

Journal of pediatric orthopedics
ISSN: 1539-2570
Titre abrégé: J Pediatr Orthop
Pays: United States
ID NLM: 8109053

Informations de publication

Date de publication:
22 Aug 2024
Historique:
medline: 22 8 2024
pubmed: 22 8 2024
entrez: 22 8 2024
Statut: aheadofprint

Résumé

Artificial intelligence (AI), and in particular large language models (LLMs) such as Chat Generative Pre-Trained Transformer (ChatGPT) and Gemini have provided additional resources for patients to research the management of healthcare conditions, for their own edification and the advocacy in the care of their children. The accuracy of these models, however, and the sources from which they draw conclusions, have been largely unstudied in pediatric orthopaedics. This research aimed to assess the reliability of machine learning tools in providing appropriate recommendations for the care of common pediatric orthopaedic conditions. ChatGPT and Gemini were queried using plain language generated from the American Academy of Orthopaedic Surgeons (AAOS) Clinical Practice Guidelines (CPGs) listed on the Pediatric Orthopedic Society of North America (POSNA) web page. Two independent reviewers assessed the accuracy of the responses, and chi-square analyses were used to compare the 2 LLMs. Inter-rater reliability was calculated via Cohen's Kappa coefficient. If research studies were cited, attempts were made to assess their legitimacy by searching the PubMed and Google Scholar databases. ChatGPT and Gemini performed similarly, agreeing with the AAOS CPGs at a rate of 67% and 69%. No significant differences were observed in the performance between the 2 LLMs. ChatGPT did not reference specific studies in any response, whereas Gemini referenced a total of 16 research papers in 6 of 24 responses. 12 of the 16 studies referenced contained errors and either were unable to be identified (7) or contained discrepancies (5) regarding publication year, journal, or proper accreditation of authorship. The LLMs investigated were frequently aligned with the AAOS CPGs; however, the rate of neutral statements or disagreement with consensus recommendations was substantial and frequently contained errors with citations of sources. These findings suggest there remains room for growth and transparency in the development of the models which power AI, and they may not yet represent the best source of up-to-date healthcare information for patients or providers.

Sections du résumé

BACKGROUND BACKGROUND
Artificial intelligence (AI), and in particular large language models (LLMs) such as Chat Generative Pre-Trained Transformer (ChatGPT) and Gemini have provided additional resources for patients to research the management of healthcare conditions, for their own edification and the advocacy in the care of their children. The accuracy of these models, however, and the sources from which they draw conclusions, have been largely unstudied in pediatric orthopaedics. This research aimed to assess the reliability of machine learning tools in providing appropriate recommendations for the care of common pediatric orthopaedic conditions.
METHODS METHODS
ChatGPT and Gemini were queried using plain language generated from the American Academy of Orthopaedic Surgeons (AAOS) Clinical Practice Guidelines (CPGs) listed on the Pediatric Orthopedic Society of North America (POSNA) web page. Two independent reviewers assessed the accuracy of the responses, and chi-square analyses were used to compare the 2 LLMs. Inter-rater reliability was calculated via Cohen's Kappa coefficient. If research studies were cited, attempts were made to assess their legitimacy by searching the PubMed and Google Scholar databases.
RESULTS RESULTS
ChatGPT and Gemini performed similarly, agreeing with the AAOS CPGs at a rate of 67% and 69%. No significant differences were observed in the performance between the 2 LLMs. ChatGPT did not reference specific studies in any response, whereas Gemini referenced a total of 16 research papers in 6 of 24 responses. 12 of the 16 studies referenced contained errors and either were unable to be identified (7) or contained discrepancies (5) regarding publication year, journal, or proper accreditation of authorship.
CONCLUSION CONCLUSIONS
The LLMs investigated were frequently aligned with the AAOS CPGs; however, the rate of neutral statements or disagreement with consensus recommendations was substantial and frequently contained errors with citations of sources. These findings suggest there remains room for growth and transparency in the development of the models which power AI, and they may not yet represent the best source of up-to-date healthcare information for patients or providers.

Identifiants

pubmed: 39171426
doi: 10.1097/BPO.0000000000002797
pii: 01241398-990000000-00639
doi:

Types de publication

Journal Article

Langues

eng

Sous-ensembles de citation

IM

Informations de copyright

Copyright © 2024 Wolters Kluwer Health, Inc. All rights reserved.

Déclaration de conflit d'intérêts

The authors declare no conflicts of interest.

Références

Yu KH, Beam AL, Kohane IS. Artificial intelligence in healthcare. Nat Biomed Eng. 2018;2:719–731.
Groot OQ, Ogink PT, Lans A, et al. Machine learning prediction models in orthopedic surgery: a systematic review in transparent reporting. J Orthop Res. 2022;40:475–483.
Neethirajan S. Artificial intelligence and sensor technologies in dairy livestock export: charting a digital transformation. Sensors (Basel). 2023;23:7045.
Tselentis DI, Papadimitriou E, van Gelder P. The usefulness of artificial intelligence for safety assessment of different transport modes. Accid Anal Prev. 2023;186:107034.
Chidambaram S, Maheswaran Y, Patel K, et al. Using artificial intelligence-enhanced sensing and wearable technology in sports medicine and performance optimisation. Sensors (Basel). 2022;22:6920.
Gulshan V, Peng L, Coram M, et al. Development and validation of a deep learning algorithm for detection of diabetic retinopathy in retinal fundus photographs. JAMA. 2016;316:2402–2410.
Wei J, Li D, Sing DC, et al. Detecting total hip arthroplasty dislocations using deep learning: clinical and Internet validation. Emerg Radiol. 2022;29:801–808.
Dairi A, Harrou F, Zeroual A, et al. Comparative study of machine learning methods for COVID-19 transmission forecasting. J Biomed Inform. 2021;118:103791.
Quek SXZ, Lee JWJ, Feng Z, et al. Comparing artificial intelligence to humans for endoscopic diagnosis of gastric neoplasia: an external validation study. J Gastroenterol Hepatol. 2023;38:1587–1591.
Zech JR, Carotenuto G, Igbinoba Z, et al. Detecting pediatric wrist fractures using deep-learning-based object detection. Pediatr Radiol. 2023;53:1125–1134.
Liu PR, Zhang JY, Xue M di, et al. Artificial intelligence to diagnose tibial plateau fractures: an intelligent assistant for orthopedic physicians. Curr Med Sci. 2021;41:1158–1164.
Karnuta JM, Murphy MP, Luu BC, et al. Artificial intelligence for automated implant identification in total hip arthroplasty: a multicenter external validation study exceeding two million plain radiographs. J Arthroplasty. 2023;38:1998–2003.e1.
Choi ES, Sim JA, Na YG, et al. Machine-learning algorithm that can improve the diagnostic accuracy of septic arthritis of the knee. Knee Surg Sports Traumatol Arthrosc. 2021;29:3142–3148.
Lum ZC. Can artificial intelligence pass the American Board of Orthopaedic Surgery Examination? Orthopaedic residents versus ChatGPT. Clin Orthop Relat Res. 2023;481:1623–1630.
Classic Home. AAOS AUC. Accessed January 27, 2024. http://www.orthoguidelines.org/
Chen H, Chen B, Tie K, et al. Single-bundle versus double-bundle autologous anterior cruciate ligament reconstruction: a meta-analysis of randomized controlled trials at 5-year minimum follow-up. J Orthop Surg. 2018;13:50.
Mariscalco MW, Magnussen RA, Mehta D, et al. Autograft versus nonirradiated allograft tissue for anterior cruciate ligament reconstruction. Am J Sports Med. 2014;42:492–499.
Buerba RA, Boden SA, Lesniak B. Graft selection in contemporary anterior cruciate ligament reconstruction. J Am Acad Orthop Surg Glob Res Rev. 2021;5:e21.00230.
Runer A, Keeling L, Wagala N, et al. Current trends in graft choice for primary anterior cruciate ligament reconstruction - part II: In-vivo kinematics, patient reported outcomes, re-rupture rates, strength recovery, return to sports and complications. J Exp Orthop. 2023;10:40.
Donnell-Fink LA, Klara K, Collins JE, et al. Effectiveness of knee injury and anterior cruciate ligament tear prevention programs: a meta-analysis. PLoS One. 2015;10:e0144063.
Park YB, Lee HJ, Cho HC, et al. Combined lateral extra-articular tenodesis or combined anterolateral ligament reconstruction and anterior cruciate ligament reconstruction improves outcomes compared to isolated reconstruction for anterior cruciate ligament tear: a network meta-analysis of randomized controlled trials. Arthroscopy. 2023;39:758–776.e10.
Boksh K, Sheikh N, Chong HH, et al. The role of anterolateral ligament reconstruction or lateral extra-articular tenodesis for revision anterior cruciate ligament reconstruction: a systematic review and meta-analysis of comparative clinical studies. Am J Sports Med. 2024;52:269–285.
Ouillette RJ, Bastrom TP, Newton PO, et al. Elastic intramedullary nails in the treatment of pediatric length unstable femur fractures. J Pediatr Orthop. 2022;42:201–208.
Moroz LA, Launay F, Kocher MS, et al. Titanium elastic nailing of fractures of the femur in children. Predictors of complications and poor outcome. J Bone Joint Surg Br. 2006;88:1361–1366.
Polesie S, Larkö O. Use of large language models: editorial comments. Acta Derm Venereol. 2023;103:adv00874.
Daraz L, Morrow AS, Ponce OJ, et al. Can patients trust online health information? A meta-narrative systematic review addressing the quality of health information on the internet. J Gen Intern Med. 2019;34:1884–1891.
Sun Y, Zhang Y, Gwizdka J, et al. Consumer evaluation of the quality of online health information: systematic literature review of relevant criteria and indicators. J Med Internet Res. 2019;21:e12522.
Tan SSL, Goonawardene N. Internet health information seeking and the patient-physician relationship: a systematic review. J Med Internet Res. 2017;19:e9.
Walker HL, Ghani S, Kuemmerli C, et al. Reliability of medical information provided by ChatGPT: assessment against clinical guidelines and patient information quality instrument. J Med Internet Res. 2023;25:e47479.
Hamed E, Eid A, Alberry M. Exploring ChatGPT’s potential in facilitating adaptation of clinical guidelines: a case study of diabetic ketoacidosis guidelines. Cureus. 2023;15:e38784.
Rahsepar AA, Tavakoli N, Kim GHJ, et al. How AI responds to common lung cancer questions: ChatGPT vs Google Bard. Radiology. 2023;307:e230922.
Zusman NL, Bauer M, Mann J, et al. AI = Appropriate Insight? ChatGPT appropriately answers parents’ questions for common pediatric orthopaedic conditions: original research. j Pedi Orthop Soci North America. 2023;5:762–769.
Yang J, Ardavanis KS, Slack KE, et al. Chat generative pretrained transformer (ChatGPT) and Bard: artificial intelligence does not yet provide clinically supported answers for hip and knee osteoarthritis. J Arthroplasty. 2024;39:1184–1190.

Auteurs

Sean Pirkle (S)

Department of Orthopaedics and Sports Medicine, University of Washington.

JaeWon Yang (J)

Department of Orthopaedics and Sports Medicine, University of Washington.

Todd J Blumberg (TJ)

Department of Orthopaedics and Sports Medicine, University of Washington.
Department of Orthopaedics and Sports Medicine, Seattle Children's Hospital, Seattle, WA.

Classifications MeSH