Accuracy of natural language processors for patients seeking inguinal hernia information.
Artificial intelligence
Inguinal hernia
NLP
Natural language processors
Patient education
Journal
Surgical endoscopy
ISSN: 1432-2218
Titre abrégé: Surg Endosc
Pays: Germany
ID NLM: 8806653
Informations de publication
Date de publication:
23 Oct 2024
23 Oct 2024
Historique:
received:
21
04
2024
accepted:
19
08
2024
medline:
24
10
2024
pubmed:
24
10
2024
entrez:
23
10
2024
Statut:
aheadofprint
Résumé
NLPs such as ChatGPT are novel sources of online healthcare information that are readily accessible and integrated into internet search tools. The accuracy of NLP-generated responses to health information questions is unknown. We queried four NLPs (ChatGPT 3.5 and 4, Bard, and Claude 2.0) for responses to simulated patient questions about inguinal hernias and their management. Responses were graded on a Likert scale (1 poor to 5 excellent) for relevance, completeness, and accuracy. Responses were compiled and scored collectively for readability using the Flesch-Kincaid score and for educational quality using the DISCERN instrument, a validated tool for evaluating patient information materials. Responses were also compared to two gold-standard educational materials provided by SAGES and the ACS. Evaluations were performed by six hernia surgeons. The average NLP response scores for relevance, completeness, and accuracy were 4.76 (95% CI 4.70-4.80), 4.11 (95% CI 4.02-4.20), and 4.14 (95% CI 4.03-4.24), respectively. ChatGPT4 received higher accuracy scores (mean 4.43 [95% CI 4.37-4.50]) than Bard (mean 4.06 [95% CI 3.88-4.26]) and Claude 2.0 (mean 3.85 [95% CI 3.63-4.08]). The ACS document received the best scores for reading ease (55.2) and grade level (9.2); however, none of the documents achieved the readibility thresholds recommended by the American Medical Association. The ACS document also received the highest DISCERN score of 63.5 (57.0-70.1), and this was significantly higher compared to ChatGPT 4 (50.8 [95% CI 46.2-55.4]) and Claude 2.0 (48 [95% CI 41.6-54.4]). The evaluated NLPs provided relevant responses of reasonable accuracy to questions about inguinal hernia. Compiled NLP responses received relatively low readability and DISCERN scores, although results may improve as NLPs evolve or with adjustments in question wording. As surgical patients expand their use of NLPs for healthcare information, surgeons should be aware of the benefits and limitations of NLPs as patient education tools.
Sections du résumé
BACKGROUND
BACKGROUND
NLPs such as ChatGPT are novel sources of online healthcare information that are readily accessible and integrated into internet search tools. The accuracy of NLP-generated responses to health information questions is unknown.
METHODS
METHODS
We queried four NLPs (ChatGPT 3.5 and 4, Bard, and Claude 2.0) for responses to simulated patient questions about inguinal hernias and their management. Responses were graded on a Likert scale (1 poor to 5 excellent) for relevance, completeness, and accuracy. Responses were compiled and scored collectively for readability using the Flesch-Kincaid score and for educational quality using the DISCERN instrument, a validated tool for evaluating patient information materials. Responses were also compared to two gold-standard educational materials provided by SAGES and the ACS. Evaluations were performed by six hernia surgeons.
RESULTS
RESULTS
The average NLP response scores for relevance, completeness, and accuracy were 4.76 (95% CI 4.70-4.80), 4.11 (95% CI 4.02-4.20), and 4.14 (95% CI 4.03-4.24), respectively. ChatGPT4 received higher accuracy scores (mean 4.43 [95% CI 4.37-4.50]) than Bard (mean 4.06 [95% CI 3.88-4.26]) and Claude 2.0 (mean 3.85 [95% CI 3.63-4.08]). The ACS document received the best scores for reading ease (55.2) and grade level (9.2); however, none of the documents achieved the readibility thresholds recommended by the American Medical Association. The ACS document also received the highest DISCERN score of 63.5 (57.0-70.1), and this was significantly higher compared to ChatGPT 4 (50.8 [95% CI 46.2-55.4]) and Claude 2.0 (48 [95% CI 41.6-54.4]).
CONCLUSIONS
CONCLUSIONS
The evaluated NLPs provided relevant responses of reasonable accuracy to questions about inguinal hernia. Compiled NLP responses received relatively low readability and DISCERN scores, although results may improve as NLPs evolve or with adjustments in question wording. As surgical patients expand their use of NLPs for healthcare information, surgeons should be aware of the benefits and limitations of NLPs as patient education tools.
Identifiants
pubmed: 39443381
doi: 10.1007/s00464-024-11221-y
pii: 10.1007/s00464-024-11221-y
doi:
Types de publication
Journal Article
Langues
eng
Sous-ensembles de citation
IM
Informations de copyright
© 2024. The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature.
Références
Finney Rutten LJ, Blake KD, Greenberg-Worisek AJ, Allen SV, Moser RP, Hesse BW (2019) Online health information seeking among US adults: measuring progress toward a healthy people 2020 objective. Public Health Rep 134(6):617–625. https://doi.org/10.1177/0033354919874074
doi: 10.1177/0033354919874074
pubmed: 31513756
pmcid: 6832079
Amante DJ, Hogan TP, Pagoto SL, English TM, Lapane KL (2015) Access to care and use of the Internet to search for health information: results from the US National Health Interview Survey. J Med Internet Res 17(4):e106. https://doi.org/10.2196/jmir.4126
doi: 10.2196/jmir.4126
pubmed: 25925943
pmcid: 4430679
Fox S (2011) Health topics: 80% of internet users look for health information online. Washington, DC: Pew Internet & American Life Project. Feb 01, [2015–04–13]. http://www.pewinternet.org/files/old-media/Files/Reports/2011/PIP_Health_Topics.pdf . Accessed 1 April 2024
Drees J (2019) Google receives more than 1 billion health questions every day, 11 March 2019. https://www.beckershospitalreview.com/healthcare-information-technology/google-receives-more-than-1-billion-health-questions-every-day.html . Accessed 4 April 2024
Suarez-Lledo V, Alvarez-Galvez J (2021) Prevalence of health misinformation on social media: systematic review. J Med Internet Res 23(1):e17187. https://doi.org/10.2196/17187
doi: 10.2196/17187
pubmed: 33470931
pmcid: 7857950
do Nascimento IJB, Pizarro AB, Almeida JM, Azzopardi-Muscat N, Gonçalves MA, Björklund M, Novillo-Ortiz D (2022) Infodemics and health misinformation: a systematic review of reviews. Bull World Health Organ 100(9):544–561. https://doi.org/10.2471/BLT.21.287654
doi: 10.2471/BLT.21.287654
Khullar D (2022) Social media and medical misinformation: confronting new variants of an old problem. JAMA 328(14):1393–1394. https://doi.org/10.1001/jama.2022.17191
doi: 10.1001/jama.2022.17191
pubmed: 36149664
Walker HL, Ghani S, Kuemmerli C, Nebiker CA, Müller BP, Raptis DA, Staubli SM (2023) Reliability of medical information provided by ChatGPT: assessment against clinical guidelines and patient information quality instrument. J Med Internet Res 30(25):e47479. https://doi.org/10.2196/47479
doi: 10.2196/47479
Wu T, He S, Liu J, Sun S, Liu K, Han Q, Tang Y (2023) A brief overview of ChatGPT: the history, status quo and potential future development. IEEE/CAA J Autom Sinica 10:1122–1136. https://doi.org/10.1109/JAS.2023.123618
doi: 10.1109/JAS.2023.123618
Alkaissi H, McFarlane SI (2023) Artificial hallucinations in ChatGPT: implications in scientific writing. Cureus 15(2):e35179. https://doi.org/10.7759/cureus.35179
doi: 10.7759/cureus.35179
pubmed: 36811129
pmcid: 9939079
Kurzer M, Kark A, Hussain T (2007) Inguinal hernia repair. J Perioper Pract. 17(7):318–321. https://doi.org/10.1177/175045890701700704
doi: 10.1177/175045890701700704
pubmed: 17702204
Pitt SC, Schwartz TA, Chu D (2021) AAPOR reporting guidelines for survey studies. JAMA Surg 156(8):785–786. https://doi.org/10.1001/jamasurg.2021.0543 . (PMID: 33825811)
doi: 10.1001/jamasurg.2021.0543
pubmed: 33825811
SAGES (2021) Inguinal hernia repair surgery patient information from Sage, 19 April. https://www.sages.org/publications/patient-information/inguinal-hernia-repair-surgery-patient-information-from-sages/ . Accessed 2 Dec 2023.
Feliciano D, Hawn M, Heneghan K, Strand N (2022) Inguinal and femoral groin hernia repair. https://www.facs.org/media/0aihsqg0/groin_hernia.pdf . Accessed 12 Dec 2023.
Weiss BD (2003) Health literacy. American Medical Association, p 253
Charnock D, Shepperd S (2004) Learning to DISCERN online: applying an appraisal tool to health websites in a workshop setting. Health Ed Res 19:440–446
doi: 10.1093/her/cyg046
Emile SH, Horesh N, Freund M, Pellino G, Oliveira L, Wignakumar A, Wexner SD (2023) How appropriate are answers of online chat-based artificial intelligence (ChatGPT) to common questions on colon cancer? Surg 174(5):1273–1275. https://doi.org/10.1016/j.surg.2023.06.005
doi: 10.1016/j.surg.2023.06.005
Mika AP, Martin JR, Engstrom SM, Polkowski GG, Wilson JM (2023) Assessing ChatGPT responses to common patient questions regarding total hip arthroplasty. J Bone Joint Surg Am 105(19):1519–1526. https://doi.org/10.2106/JBJS.23.00209
doi: 10.2106/JBJS.23.00209
pubmed: 37459402
Johns WL, Kellish A, Farronato D, Ciccotti MG, Hammoud S (2024) ChatGPT can offer satisfactory responses to common patient questions regarding elbow ulnar collateral ligament reconstruction. Arthrosc Sports Med Rehabil 6(2):100893. https://doi.org/10.1016/j.asmr.2024.100893
doi: 10.1016/j.asmr.2024.100893
pubmed: 38375341
pmcid: 10875189
Vargas CR, Chuang DJ, Lee BT (2014) Online patient resources for hernia repair: analysis of readability. J Surg Res 190(1):144–150. https://doi.org/10.1016/j.jss.2014.03.045
doi: 10.1016/j.jss.2014.03.045
pubmed: 24746256
Ayers JW, Poliak A, Dredze M, Leas EC, Zhu Z, Kelley JB, Faix DJ, Goodman AM, Longhurst CA, Hogarth M, Smith DM (2023) Comparing physician and artificial intelligence chatbot responses to patient questions posted to a public social media forum. JAMA Intern Med 183(6):589–596. https://doi.org/10.1001/jamainternmed.2023.1838
doi: 10.1001/jamainternmed.2023.1838
pubmed: 37115527
pmcid: 10148230
Chen S, Kann BH, Foote MB et al (2023) Use of artificial intelligence chatbots for cancer treatment information. JAMA Oncol 9(10):1459–1462. https://doi.org/10.1001/jamaoncol.2023.2954
doi: 10.1001/jamaoncol.2023.2954
pubmed: 37615976
pmcid: 10450584
Khullar D, Casalino LP, Qian Y, Lu Y, Krumholz HM, Aneja S (2022) Perspectives of patients about artificial intelligence in health care. JAMA Open 5(5):e2210309. https://doi.org/10.1001/jamanetworkopen.2022.10309
doi: 10.1001/jamanetworkopen.2022.10309
NeSahni NR, Carrus B (2023) Artificial intelligence in U.S. health care delivery. N Engl J Med. 389(4):348–358. https://doi.org/10.1056/NEJMra2204673
doi: 10.1056/NEJMra2204673
Elias J (2023) Google launches its largest and ‘most capable’ AI model, Gemini, 6 Dec. https://www.cnbc.com/2023/12/06/google-launches-its-largest-and-most-capable-ai-model gemini.html#:~:text=Google%20launches%20its%20largest%20and%20%E2%80%98most%20capable%E2%80%99%20AI,them%20to%20use%20in%20their%20own%20applications.%20. Accessed 1 April 2024.
Anthropic Announcements (2024) Introducing the next generation of Claude, 3 Mar. https://www.anthropic.com/news/claude-3-family . Accessed 1 April 2024.