Evaluation of large language models as a diagnostic aid for complex medical cases.
ChatGPT
clinical case solving
complex clinical cases
diagnosis
large language model (LLM)
Journal
Frontiers in medicine
ISSN: 2296-858X
Titre abrégé: Front Med (Lausanne)
Pays: Switzerland
ID NLM: 101648047
Informations de publication
Date de publication:
2024
2024
Historique:
received:
01
02
2024
accepted:
10
06
2024
medline:
5
7
2024
pubmed:
5
7
2024
entrez:
5
7
2024
Statut:
epublish
Résumé
The use of large language models (LLM) has recently gained popularity in diverse areas, including answering questions posted by patients as well as medical professionals. To evaluate the performance and limitations of LLMs in providing the correct diagnosis for a complex clinical case. Seventy-five consecutive clinical cases were selected from the Massachusetts General Hospital Case Records, and differential diagnoses were generated by OpenAI's GPT3.5 and 4 models. The mean number of diagnoses provided by the Massachusetts General Hospital case discussants was 16.77, by GPT3.5 30 and by GPT4 15.45 ( The GPT4 model was able to generate a differential diagnosis list with the correct diagnosis in approximately two thirds of cases, but the most likely diagnosis was often incorrect for both models. In its current state, this tool can at most be used as an aid to expand on potential diagnostic considerations for a case, and future LLMs should be trained which account for the discrepancy between disease incidence and availability in the literature.
Sections du résumé
Background
UNASSIGNED
The use of large language models (LLM) has recently gained popularity in diverse areas, including answering questions posted by patients as well as medical professionals.
Objective
UNASSIGNED
To evaluate the performance and limitations of LLMs in providing the correct diagnosis for a complex clinical case.
Design
UNASSIGNED
Seventy-five consecutive clinical cases were selected from the Massachusetts General Hospital Case Records, and differential diagnoses were generated by OpenAI's GPT3.5 and 4 models.
Results
UNASSIGNED
The mean number of diagnoses provided by the Massachusetts General Hospital case discussants was 16.77, by GPT3.5 30 and by GPT4 15.45 (
Conclusions and relevance
UNASSIGNED
The GPT4 model was able to generate a differential diagnosis list with the correct diagnosis in approximately two thirds of cases, but the most likely diagnosis was often incorrect for both models. In its current state, this tool can at most be used as an aid to expand on potential diagnostic considerations for a case, and future LLMs should be trained which account for the discrepancy between disease incidence and availability in the literature.
Identifiants
pubmed: 38966538
doi: 10.3389/fmed.2024.1380148
pmc: PMC11222590
doi:
Types de publication
Journal Article
Langues
eng
Pagination
1380148Informations de copyright
Copyright © 2024 Ríos-Hoyo, Shan, Li, Pearson, Pusztai and Howard.
Déclaration de conflit d'intérêts
The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.