Evaluation of large language models as a diagnostic aid for complex medical cases.

ChatGPT clinical case solving complex clinical cases diagnosis large language model (LLM)

Journal

Frontiers in medicine
ISSN: 2296-858X
Titre abrégé: Front Med (Lausanne)
Pays: Switzerland
ID NLM: 101648047

Informations de publication

Date de publication:
2024
Historique:
received: 01 02 2024
accepted: 10 06 2024
medline: 5 7 2024
pubmed: 5 7 2024
entrez: 5 7 2024
Statut: epublish

Résumé

The use of large language models (LLM) has recently gained popularity in diverse areas, including answering questions posted by patients as well as medical professionals. To evaluate the performance and limitations of LLMs in providing the correct diagnosis for a complex clinical case. Seventy-five consecutive clinical cases were selected from the Massachusetts General Hospital Case Records, and differential diagnoses were generated by OpenAI's GPT3.5 and 4 models. The mean number of diagnoses provided by the Massachusetts General Hospital case discussants was 16.77, by GPT3.5 30 and by GPT4 15.45 ( The GPT4 model was able to generate a differential diagnosis list with the correct diagnosis in approximately two thirds of cases, but the most likely diagnosis was often incorrect for both models. In its current state, this tool can at most be used as an aid to expand on potential diagnostic considerations for a case, and future LLMs should be trained which account for the discrepancy between disease incidence and availability in the literature.

Sections du résumé

Background UNASSIGNED
The use of large language models (LLM) has recently gained popularity in diverse areas, including answering questions posted by patients as well as medical professionals.
Objective UNASSIGNED
To evaluate the performance and limitations of LLMs in providing the correct diagnosis for a complex clinical case.
Design UNASSIGNED
Seventy-five consecutive clinical cases were selected from the Massachusetts General Hospital Case Records, and differential diagnoses were generated by OpenAI's GPT3.5 and 4 models.
Results UNASSIGNED
The mean number of diagnoses provided by the Massachusetts General Hospital case discussants was 16.77, by GPT3.5 30 and by GPT4 15.45 (
Conclusions and relevance UNASSIGNED
The GPT4 model was able to generate a differential diagnosis list with the correct diagnosis in approximately two thirds of cases, but the most likely diagnosis was often incorrect for both models. In its current state, this tool can at most be used as an aid to expand on potential diagnostic considerations for a case, and future LLMs should be trained which account for the discrepancy between disease incidence and availability in the literature.

Identifiants

pubmed: 38966538
doi: 10.3389/fmed.2024.1380148
pmc: PMC11222590
doi:

Types de publication

Journal Article

Langues

eng

Pagination

1380148

Informations de copyright

Copyright © 2024 Ríos-Hoyo, Shan, Li, Pearson, Pusztai and Howard.

Déclaration de conflit d'intérêts

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Auteurs

Alejandro Ríos-Hoyo (A)

Yale Cancer Center, Yale School of Medicine, New Haven, CT, United States.

Naing Lin Shan (NL)

Yale Cancer Center, Yale School of Medicine, New Haven, CT, United States.

Anran Li (A)

Department of Medicine, University of Chicago, Chicago, IL, United States.

Alexander T Pearson (AT)

Department of Medicine, University of Chicago, Chicago, IL, United States.

Lajos Pusztai (L)

Yale Cancer Center, Yale School of Medicine, New Haven, CT, United States.

Frederick M Howard (FM)

Department of Medicine, University of Chicago, Chicago, IL, United States.

Classifications MeSH