Evaluation of large language models as a diagnostic aid for complex medical cases.

ChatGPT clinical case solving complex clinical cases diagnosis large language model (LLM)

Journal

Frontiers in medicine

ISSN: 2296-858X

Titre abrégé: Front Med (Lausanne)

Pays: Switzerland

ID NLM: 101648047

Informations de publication

Date de publication:
2024

Historique:

received: 01 02 2024

accepted: 10 06 2024

medline: 5 7 2024

pubmed: 5 7 2024

entrez: 5 7 2024

Statut: epublish

Résumé

The use of large language models (LLM) has recently gained popularity in diverse areas, including answering questions posted by patients as well as medical professionals. To evaluate the performance and limitations of LLMs in providing the correct diagnosis for a complex clinical case. Seventy-five consecutive clinical cases were selected from the Massachusetts General Hospital Case Records, and differential diagnoses were generated by OpenAI's GPT3.5 and 4 models. The mean number of diagnoses provided by the Massachusetts General Hospital case discussants was 16.77, by GPT3.5 30 and by GPT4 15.45 ( The GPT4 model was able to generate a differential diagnosis list with the correct diagnosis in approximately two thirds of cases, but the most likely diagnosis was often incorrect for both models. In its current state, this tool can at most be used as an aid to expand on potential diagnostic considerations for a case, and future LLMs should be trained which account for the discrepancy between disease incidence and availability in the literature.

Sections du résumé

Background UNASSIGNED

The use of large language models (LLM) has recently gained popularity in diverse areas, including answering questions posted by patients as well as medical professionals.

Objective UNASSIGNED

To evaluate the performance and limitations of LLMs in providing the correct diagnosis for a complex clinical case.

Design UNASSIGNED

Seventy-five consecutive clinical cases were selected from the Massachusetts General Hospital Case Records, and differential diagnoses were generated by OpenAI's GPT3.5 and 4 models.

Results UNASSIGNED

The mean number of diagnoses provided by the Massachusetts General Hospital case discussants was 16.77, by GPT3.5 30 and by GPT4 15.45 (

Conclusions and relevance UNASSIGNED

The GPT4 model was able to generate a differential diagnosis list with the correct diagnosis in approximately two thirds of cases, but the most likely diagnosis was often incorrect for both models. In its current state, this tool can at most be used as an aid to expand on potential diagnostic considerations for a case, and future LLMs should be trained which account for the discrepancy between disease incidence and availability in the literature.

Identifiants

DOI: 10.3389/fmed.2024.1380148 PMID: 38966538 PMC: PMC11222590

pubmed: 38966538

doi: 10.3389/fmed.2024.1380148

pmc: PMC11222590

doi:

Types de publication

Journal Article

Langues

eng

Pagination

1380148

Informations de copyright

Déclaration de conflit d'intérêts

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Evaluation of large language models as a diagnostic aid for complex medical cases.

Journal

Informations de publication

Résumé

Sections du résumé

Identifiants

Types de publication

Langues

Pagination

Informations de copyright

Déclaration de conflit d'intérêts

Auteurs

Alejandro Ríos-Hoyo (A)

Naing Lin Shan (NL)

Anran Li (A)

Alexander T Pearson (AT)

Lajos Pusztai (L)

Frederick M Howard (FM)

Classifications MeSH