How do large language models answer breast cancer quiz questions? A comparative study of GPT-3.5, GPT-4 and Google Gemini.

Breast cancer ChatGPT Google Gemini Large language models

Journal

La Radiologia medica

ISSN: 1826-6983

Titre abrégé: Radiol Med

Pays: Italy

ID NLM: 0177625

Informations de publication

Date de publication:
13 Aug 2024

Historique:

received: 16 04 2024

accepted: 01 08 2024

medline: 14 8 2024

pubmed: 14 8 2024

entrez: 14 8 2024

Statut: aheadofprint

Résumé

Applications of large language models (LLMs) in the healthcare field have shown promising results in processing and summarizing multidisciplinary information. This study evaluated the ability of three publicly available LLMs (GPT-3.5, GPT-4, and Google Gemini-then called Bard) to answer 60 multiple-choice questions (29 sourced from public databases, 31 newly formulated by experienced breast radiologists) about different aspects of breast cancer care: treatment and prognosis, diagnostic and interventional techniques, imaging interpretation, and pathology. Overall, the rate of correct answers significantly differed among LLMs (p = 0.010): the best performance was achieved by GPT-4 (95%, 57/60) followed by GPT-3.5 (90%, 54/60) and Google Gemini (80%, 48/60). Across all LLMs, no significant differences were observed in the rates of correct replies to questions sourced from public databases and newly formulated ones (p ≥ 0.593). These results highlight the potential benefits of LLMs in breast cancer care, which will need to be further refined through in-context training.

Identifiants

DOI: 10.1007/s11547-024-01872-1 PMID: 39138732

pubmed: 39138732

doi: 10.1007/s11547-024-01872-1

pii: 10.1007/s11547-024-01872-1

doi:

Types de publication

Journal Article

Langues

eng

Sous-ensembles de citation

Informations de copyright

Références

Singhal K, Azizi S, Tu T et al (2023) Large language models encode clinical knowledge. Nature 620:172–180. https://doi.org/10.1038/s41586-023-06291-2

doi: 10.1038/s41586-023-06291-2 pubmed: 37438534 pmcid: 10396962

Moor M, Banerjee O, Abad ZSH et al (2023) Foundation models for generalist medical artificial intelligence. Nature 616:259–265. https://doi.org/10.1038/s41586-023-05881-4

doi: 10.1038/s41586-023-05881-4 pubmed: 37045921

Nerella S, Bandyopadhyay S, Zhang J et al (2024) Transformers and large language models in healthcare: a review. Artif Intell Med 154:102900. https://doi.org/10.1016/j.artmed.2024.102900

doi: 10.1016/j.artmed.2024.102900 pubmed: 38878555

Clusmann J, Kolbinger FR, Muti HS et al (2023) The future landscape of large language models in medicine. Commun Med 3:141. https://doi.org/10.1038/s43856-023-00370-1

doi: 10.1038/s43856-023-00370-1 pubmed: 37816837 pmcid: 10564921

Thirunavukarasu AJ, Ting DSJ, Elangovan K, Gutierrez L, Tan TF, Ting DSW (2023) Large language models in medicine. Nat Med 29:1930–1940. https://doi.org/10.1038/s41591-023-02448-8

doi: 10.1038/s41591-023-02448-8 pubmed: 37460753

Sorin V, Glicksberg BS, Artsi Y et al (2024) Utilizing large language models in breast cancer management: systematic review. J Cancer Res Clin Oncol 150:140. https://doi.org/10.1007/s00432-024-05678-6

doi: 10.1007/s00432-024-05678-6 pubmed: 38504034 pmcid: 10950983

Rahsepar AA, Tavakoli N, Kim GHJ, Hassani C, Abtin F, Bedayat A (2023) How AI responds to common lung cancer questions: ChatGPT versus Google Bard. Radiology 307:e230922. https://doi.org/10.1148/radiol.230922

doi: 10.1148/radiol.230922 pubmed: 37310252

Kuşcu O, Pamuk AE, SütaySüslü N, Hosal S (2023) Is ChatGPT accurate and reliable in answering questions regarding head and neck cancer? Front Oncol 13:1256459. https://doi.org/10.3389/fonc.2023.1256459

doi: 10.3389/fonc.2023.1256459 pubmed: 38107064 pmcid: 10722294

Shao J, Rodrigues M, Corter AL, Baxter NN (2019) Multidisciplinary care of breast cancer patients: a scoping review of multidisciplinary styles, processes, and outcomes. Curr Oncol 26:385–397. https://doi.org/10.3747/co.26.4713

doi: 10.3747/co.26.4713

Omiye JA, Gui H, Rezaei SJ, Zou J, Daneshjou R (2024) Large Language models in medicine: the potentials and pitfalls. Ann Intern Med 177:210–220. https://doi.org/10.7326/M23-2772

doi: 10.7326/M23-2772 pubmed: 38285984

Brin D, Sorin V, Vaid A et al (2023) Comparing ChatGPT and GPT-4 performance in USMLE soft skill assessments. Sci Rep 13:16492. https://doi.org/10.1038/s41598-023-43436-9

doi: 10.1038/s41598-023-43436-9 pubmed: 37779171 pmcid: 10543445

Holmes J, Liu Z, Zhang L et al (2023) Evaluating large language models on a highly-specialized topic, radiation oncology physics. Front Oncol 13:1219326. https://doi.org/10.3389/fonc.2023.1219326

doi: 10.3389/fonc.2023.1219326 pubmed: 37529688 pmcid: 10388568

Griewing S, Knitza J, Boekhoff J et al (2024) Evolution of publicly available large language models for complex decision-making in breast cancer care. Arch Gynecol Obstet 310:537–550. https://doi.org/10.1007/s00404-024-07565-4

doi: 10.1007/s00404-024-07565-4 pubmed: 38806945 pmcid: 11169005

Cozzi A, Pinker K, Hidber A et al (2024) BI-RADS category assignments by GPT-3.5, GPT-4, and Google Bard: a multilanguage study. Radiology 311:e232133. https://doi.org/10.1148/radiol.232133

doi: 10.1148/radiol.232133 pubmed: 38687216

Wu Q, Wu Q, Li H et al (2024) Evaluating large language models for automated reporting and data systems categorization: cross-sectional study. JMIR Med Informatics 12:e55799. https://doi.org/10.2196/55799

doi: 10.2196/55799

How do large language models answer breast cancer quiz questions? A comparative study of GPT-3.5, GPT-4 and Google Gemini.

Journal

Informations de publication

Résumé

Identifiants

Types de publication

Langues

Sous-ensembles de citation

Informations de copyright

Références

Auteurs

Giovanni Irmici (G)

Andrea Cozzi (A)

Gianmarco Della Pepa (G)

Claudia De Berardinis (C)

Elisa D'Ascoli (E)

Michaela Cellina (M)

Maurizio Cè (M)

Catherine Depretto (C)

Gianfranco Scaperrotta (G)

Classifications MeSH