Large Language Models for Simplified Interventional Radiology Reports: A Comparative Analysis.

Artificial Intelligence Interventional Radiology Large Language Model Patient Friendliness Structured Reporting

Journal

Academic radiology

ISSN: 1878-4046

Titre abrégé: Acad Radiol

Pays: United States

ID NLM: 9440159

Informations de publication

Date de publication:
30 Sep 2024

Historique:

received: 27 08 2024

revised: 15 09 2024

accepted: 17 09 2024

medline: 3 10 2024

pubmed: 3 10 2024

entrez: 1 10 2024

Statut: aheadofprint

Résumé

To quantitatively and qualitatively evaluate and compare the performance of leading large language models (LLMs), including proprietary models (GPT-4, GPT-3.5 Turbo, Claude-3-Opus, and Gemini Ultra) and open-source models (Mistral-7b and Mistral-8×7b), in simplifying 109 interventional radiology reports. Qualitative performance was assessed using a five-point Likert scale for accuracy, completeness, clarity, clinical relevance, naturalness, and error rates, including trust-breaking and post-therapy misconduct errors. Quantitative readability was assessed using Flesch Reading Ease (FRE), Flesch-Kincaid Grade Level (FKGL), SMOG Index, and Dale-Chall Readability Score (DCRS). Paired t-tests and Bonferroni-corrected p-values were used for statistical analysis. Qualitative evaluation showed no significant differences between GPT-4 and Claude-3-Opus for any metrics evaluated (all Bonferroni-corrected p-values: p = 1), while they outperformed other assessed models across five qualitative metrics (p < 0.001). GPT-4 had the fewest content and trust-breaking errors, with Claude-3-Opus second. However, all models exhibited some level of trust-breaking and post-therapy misconduct errors, with GPT-4-Turbo and GPT-3.5-Turbo with few-shot prompting showing the lowest error rates, and Mistral-7B and Mistral-8×7B showing the highest. Quantitatively, GPT-4 surpassed Claude-3-Opus in all readability metrics (all p < 0.001), with a median FRE score of 69.01 (IQR: 64.88-73.14) versus 59.74 (IQR: 55.47-64.01) for Claude-3-Opus. GPT-4 also outperformed GPT-3.5-Turbo and Gemini Ultra (both p < 0.001). Inter-rater reliability was strong (κ = 0.77-0.84). GPT-4 and Claude-3-Opus demonstrated superior performance in generating simplified IR reports, but the presence of errors across all models, including trust-breaking errors, highlights the need for further refinement and validation before clinical implementation. With the increasing complexity of interventional radiology (IR) procedures and the growing availability of electronic health records, simplifying IR reports is critical to improving patient understanding and clinical decision-making. This study provides insights into the performance of various LLMs in rewriting IR reports, which can help in selecting the most suitable model for clinical patient-centered applications.

Identifiants

DOI: 10.1016/j.acra.2024.09.041 PMID: 39353826

pubmed: 39353826

pii: S1076-6332(24)00690-1

doi: 10.1016/j.acra.2024.09.041

pii:

doi:

Types de publication

Journal Article

Langues

eng

Sous-ensembles de citation

Informations de copyright

Déclaration de conflit d'intérêts

Declaration of Competing Interest The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Large Language Models for Simplified Interventional Radiology Reports: A Comparative Analysis.

Journal

Informations de publication

Résumé

Identifiants

Types de publication

Langues

Sous-ensembles de citation

Informations de copyright

Déclaration de conflit d'intérêts

Auteurs

Elif Can (E)

Wibke Uller (W)

Katharina Vogt (K)

Michael C Doppler (MC)

Felix Busch (F)

Nadine Bayerl (N)

Stephan Ellmann (S)

Avan Kader (A)

Aboelyazid Elkilany (A)

Marcus R Makowski (MR)

Keno K Bressem (KK)

Lisa C Adams (LC)

Classifications MeSH