Evaluating ChatGPT on Orbital and Oculofacial Disorders: Accuracy and Readability Insights.
Journal
Ophthalmic plastic and reconstructive surgery
ISSN: 1537-2677
Titre abrégé: Ophthalmic Plast Reconstr Surg
Pays: United States
ID NLM: 8508431
Informations de publication
Date de publication:
16 Nov 2023
16 Nov 2023
Historique:
medline:
22
11
2023
pubmed:
22
11
2023
entrez:
21
11
2023
Statut:
aheadofprint
Résumé
To assess the accuracy and readability of responses generated by the artificial intelligence model, ChatGPT (version 4.0), to questions related to 10 essential domains of orbital and oculofacial disease. A set of 100 questions related to the diagnosis, treatment, and interpretation of orbital and oculofacial diseases was posed to ChatGPT 4.0. Responses were evaluated by a panel of 7 experts based on appropriateness and accuracy, with performance scores measured on a 7-item Likert scale. Inter-rater reliability was determined via the intraclass correlation coefficient. The artificial intelligence model demonstrated accurate and consistent performance across all 10 domains of orbital and oculofacial disease, with an average appropriateness score of 5.3/6.0 ("mostly appropriate" to "completely appropriate"). Domains of cavernous sinus fistula, retrobulbar hemorrhage, and blepharospasm had the highest domain scores (average scores of 5.5 to 5.6), while the proptosis domain had the lowest (average score of 5.0/6.0). The intraclass correlation coefficient was 0.64 (95% CI: 0.52 to 0.74), reflecting moderate inter-rater reliability. The responses exhibited a high reading-level complexity, representing the comprehension levels of a college or graduate education. This study demonstrates the potential of ChatGPT 4.0 to provide accurate information in the field of ophthalmology, specifically orbital and oculofacial disease. However, challenges remain in ensuring accurate and comprehensive responses across all disease domains. Future improvements should focus on refining the model's correctness and eventually expanding the scope to visual data interpretation. Our results highlight the vast potential for artificial intelligence in educational and clinical ophthalmology contexts.
Identifiants
pubmed: 37989540
doi: 10.1097/IOP.0000000000002552
pii: 00002341-990000000-00287
doi:
Types de publication
Journal Article
Langues
eng
Sous-ensembles de citation
IM
Informations de copyright
Copyright © 2023 The American Society of Ophthalmic Plastic and Reconstructive Surgery, Inc.
Déclaration de conflit d'intérêts
The authors have no financial or conflicts of interest to disclose.
Références
Shen Y, Heacock L, Elias J, et al. ChatGPT and other large language models are double-edged swords. Radiology 2023;307:e230163–e230163.
Brown T, Mann B, Ryder N, et al. Language models are few-shot learners. Adv Neural Inf Process Syst 2020;33:1877–1901.
Sallam M. ChatGPT utility in healthcare education, research, and practice: systematic review on the promising perspectives and valid concerns. Healthcare (Basel) 2023;11:887.
Balas M, Ing EB. Conversational Ai models for ophthalmic diagnosis: comparison of ChatGPT and the Isabel pro differential diagnosis generator. JFO Open Ophthalmol 2023;1:100005.
Ting DSJ, Tan TF, Ting DSW. ChatGPT in ophthalmology: the dawn of a new era [published online ahead of print]? Eye (Lond) 2023. doi:10.1038/s41433-023-02619-4.
doi: 10.1038/s41433-023-02619-4.
Singh S, Djalilian A, Ali MJ. ChatGPT and ophthalmology: exploring its potential with discharge summaries and operative notes. Semin Ophthalmol 2023;38:503–507.
Johnson D, Goodman R, Patrinely J, et al. Assessing the accuracy and reliability of AI-generated medical responses: an evaluation of the Chat-GPT model. doi:10.21203/rs.3.rs-2566942/v1.
Kung TH, Cheatham M, Medenilla A, et al. Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models. PLOS Digit Health 2023;2:e0000198.
Mihalache A, Popovic MM, Muni RH. Performance of an artificial intelligence chatbot in ophthalmic knowledge assessment. JAMA Ophthalmol 2023;141:589–597.
Dale R. GPT-3: What’s it good for? Nat Lang Eng 2021;27:113–118.
Stokel-Walker C, Van Noorden R. What ChatGPT and generative AI mean for science. Nature 2023;614:214–216.
Santo DSE, Joviano-Santos JV. Exploring the use of ChatGPT for guidance during unexpected labour. Eur J Obstet Gynecol Reprod Biol 2023;285:208–209.
Ali MJ. ChatGPT and lacrimal drainage disorders: performance and scope of improvement. Ophthal Plast Reconstr Surg 2023;39:221–225.
Streiner DL, Norman GR, Cairney J. Health Measurement Scales: A Practical Guide to Their Development and Use. USA: Oxford University Press; 2015.
Bagheri N, Wajda B, Calvo C, Durrani A. The Wills Eye Manual: Office and Emergency Room Diagnosis and Treatment of Eye Disease. Lippincott Williams & Wilkins; 2016.
Bowling B. Kanski’s Clinical Ophthalmology. Elsevier Edinburgh; 2016.
Friedman NJ, Kaiser PK, Trattler WB. Review of Ophthalmology-E-Book. Elsevier Health Sciences; 2022.
Fay A, Dolman PJ. Diseases and Disorders of the Orbit and Ocular Adnexa E-Book: Expert Consult. Elsevier Health Sciences; 2016.
EyeWiki. American Academy of Ophthalmology; 2023. Available at: https://eyewiki.org/Main_Page.
Flesch R. A new readability yardstick. J Appl Psychol 1948;32:221–233.
Kincaid JP, Fishburne RP Jr, Rogers RL, Chissom BS. Derivation of New Readability Formulas (Automated Readability Index, fog Count and Flesch Reading Ease Formula) for Navy Enlisted Personnel, 1975.
Hallgren KA. Computing inter-rater reliability for observational data: an overview and tutorial. Tutor Quant Methods Psychol 2012;8:23–34.
McGraw KO, Wong SP. Forming inferences about some intraclass correlation coefficients. Psychol Methods 1996;1:30–46.
Shrout PE, Fleiss JL. Intraclass correlations: uses in assessing rater reliability. Psychol Bull 1979;86:420–428.
Shapiro SS, Wilk MB. An analysis of variance test for normality (complete samples). Biometrika 1965;52:591–611.
Dunn OJ. Multiple comparisons using rank sums. Technometrics 1964;6:241–252.
Bland JM, Altman DG. Multiple significance tests: the Bonferroni method. BMJ 1995;310:170–170.
Cole HP, Couvillion JT, Fink AJ, et al. Exophthalmometry: a comparative study of the Naugle and Hertel instruments. Ophthal Plast Reconstr Surg 1997;13:189–194.
Jeon HB, Kang DH, Oh SA, et al. Comparative study of Naugle and Hertel exophthalmometry in orbitozygomatic fracture. J Craniofac Surg 2016;27:142–144.
Stiebel-Kalish H, Robenshtok E, Hasanreisoglu M, et al. Treatment modalities for Graves’ ophthalmopathy: systematic review and metaanalysis. J Clin Endocrinol Metab 2009;94:2708–2716.
DeWalt DA, Berkman ND, Sheridan S, et al. Literacy and health outcomes: a systematic review of the literature. J Gen Intern Med 2004;19:1228–1239.
Funnell MM, Donnelly MB, Anderson RM, et al. Perceived effectiveness, cost, and availability of patient education methods and materials. Diabetes Educ 1992;18:139–145.
Weiss BD. Health Literacy. Chicago: American Medical Association Foundation and American Medical Association, 2003.
Abraham C, Kools M. Writing Health Communication: An Evidence-Based Guide. Sage; 2011.
Ali R, Connolly ID, Tang O, et al. Bridging the Literacy Gap for Surgical Consents: An AI-Human Expert Collaborative Approach. medRxiv. 2023.
White J, Fu Q, Hays S, et al. A Prompt Pattern Catalog to Enhance Prompt Engineering With ChatGPT. arXiv preprint arXiv 2023.
Zhou Y, Muresanu AI, Han Z, et al. Large Language Models are Human-Level Prompt Engineers. arXiv preprint arXiv 2022.
Hamed E, Sharif A, Eid A, et al. Advancing artificial intelligence for clinical knowledge retrieval: a case study using ChatGPT-4 and link retrieval plug-in to analyze diabetic ketoacidosis guidelines. Cureus 2023;15:e41916.
Jin Q, Leaman R, Lu Z. Retrieve, summarize, and verify: how will ChatGPT impact information seeking from the medical literature? J Am Soc Nephrol 2023;10:1681.