Large Language Models for Simplified Interventional Radiology Reports: A Comparative Analysis.

Artificial Intelligence Interventional Radiology Large Language Model Patient Friendliness Structured Reporting

Journal

Academic radiology
ISSN: 1878-4046
Titre abrégé: Acad Radiol
Pays: United States
ID NLM: 9440159

Informations de publication

Date de publication:
30 Sep 2024
Historique:
received: 27 08 2024
revised: 15 09 2024
accepted: 17 09 2024
medline: 3 10 2024
pubmed: 3 10 2024
entrez: 1 10 2024
Statut: aheadofprint

Résumé

To quantitatively and qualitatively evaluate and compare the performance of leading large language models (LLMs), including proprietary models (GPT-4, GPT-3.5 Turbo, Claude-3-Opus, and Gemini Ultra) and open-source models (Mistral-7b and Mistral-8×7b), in simplifying 109 interventional radiology reports. Qualitative performance was assessed using a five-point Likert scale for accuracy, completeness, clarity, clinical relevance, naturalness, and error rates, including trust-breaking and post-therapy misconduct errors. Quantitative readability was assessed using Flesch Reading Ease (FRE), Flesch-Kincaid Grade Level (FKGL), SMOG Index, and Dale-Chall Readability Score (DCRS). Paired t-tests and Bonferroni-corrected p-values were used for statistical analysis. Qualitative evaluation showed no significant differences between GPT-4 and Claude-3-Opus for any metrics evaluated (all Bonferroni-corrected p-values: p = 1), while they outperformed other assessed models across five qualitative metrics (p < 0.001). GPT-4 had the fewest content and trust-breaking errors, with Claude-3-Opus second. However, all models exhibited some level of trust-breaking and post-therapy misconduct errors, with GPT-4-Turbo and GPT-3.5-Turbo with few-shot prompting showing the lowest error rates, and Mistral-7B and Mistral-8×7B showing the highest. Quantitatively, GPT-4 surpassed Claude-3-Opus in all readability metrics (all p < 0.001), with a median FRE score of 69.01 (IQR: 64.88-73.14) versus 59.74 (IQR: 55.47-64.01) for Claude-3-Opus. GPT-4 also outperformed GPT-3.5-Turbo and Gemini Ultra (both p < 0.001). Inter-rater reliability was strong (κ = 0.77-0.84). GPT-4 and Claude-3-Opus demonstrated superior performance in generating simplified IR reports, but the presence of errors across all models, including trust-breaking errors, highlights the need for further refinement and validation before clinical implementation. With the increasing complexity of interventional radiology (IR) procedures and the growing availability of electronic health records, simplifying IR reports is critical to improving patient understanding and clinical decision-making. This study provides insights into the performance of various LLMs in rewriting IR reports, which can help in selecting the most suitable model for clinical patient-centered applications.

Identifiants

pubmed: 39353826
pii: S1076-6332(24)00690-1
doi: 10.1016/j.acra.2024.09.041
pii:
doi:

Types de publication

Journal Article

Langues

eng

Sous-ensembles de citation

IM

Informations de copyright

Copyright © 2024 The Association of University Radiologists. Published by Elsevier Inc. All rights reserved.

Déclaration de conflit d'intérêts

Declaration of Competing Interest The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Auteurs

Elif Can (E)

Department of Interventional Radiology, Medical Center - University of Freiburg, Faculty of Medicine, University of Freiburg, Germany (E.C., W.U., K.V., M.C.D.). Electronic address: elif.can@uniklinik-freiburg.de.

Wibke Uller (W)

Department of Interventional Radiology, Medical Center - University of Freiburg, Faculty of Medicine, University of Freiburg, Germany (E.C., W.U., K.V., M.C.D.).

Katharina Vogt (K)

Department of Interventional Radiology, Medical Center - University of Freiburg, Faculty of Medicine, University of Freiburg, Germany (E.C., W.U., K.V., M.C.D.).

Michael C Doppler (MC)

Department of Interventional Radiology, Medical Center - University of Freiburg, Faculty of Medicine, University of Freiburg, Germany (E.C., W.U., K.V., M.C.D.).

Felix Busch (F)

Department of Radiology, Klinikum rechts der Isar, Technical University of Munich (TUM), Munich, Germany (F.B., A.K., M.R.M., K.K.B., L.C.A.).

Nadine Bayerl (N)

Institute of Radiology, Friedrich-Alexander-Universität Erlangen-Nürnberg, University Hospital Erlangen, Erlangen, Germany (N.B., S.E.).

Stephan Ellmann (S)

Institute of Radiology, Friedrich-Alexander-Universität Erlangen-Nürnberg, University Hospital Erlangen, Erlangen, Germany (N.B., S.E.).

Avan Kader (A)

Department of Radiology, Klinikum rechts der Isar, Technical University of Munich (TUM), Munich, Germany (F.B., A.K., M.R.M., K.K.B., L.C.A.).

Aboelyazid Elkilany (A)

Department of Diagnostic and Interventional Radiology, University Hospital Leipzig, Leipzig, Saxony, Germany (A.E.).

Marcus R Makowski (MR)

Department of Radiology, Klinikum rechts der Isar, Technical University of Munich (TUM), Munich, Germany (F.B., A.K., M.R.M., K.K.B., L.C.A.).

Keno K Bressem (KK)

Department of Radiology, Klinikum rechts der Isar, Technical University of Munich (TUM), Munich, Germany (F.B., A.K., M.R.M., K.K.B., L.C.A.).

Lisa C Adams (LC)

Department of Radiology, Klinikum rechts der Isar, Technical University of Munich (TUM), Munich, Germany (F.B., A.K., M.R.M., K.K.B., L.C.A.).

Classifications MeSH