Optimizing GPT-4 Turbo Diagnostic Accuracy in Neuroradiology through Prompt Engineering and Confidence Thresholds.
AI diagnostic tools
GPT-4 Turbo
artificial intelligence in medicine
clinical decision support
confidence thresholds
diagnostic imaging
large language model (LLM)
misdiagnosis reduction
neuroradiology
prompt engineering
Journal
Diagnostics (Basel, Switzerland)
ISSN: 2075-4418
Titre abrégé: Diagnostics (Basel)
Pays: Switzerland
ID NLM: 101658402
Informations de publication
Date de publication:
17 Jul 2024
17 Jul 2024
Historique:
received:
29
05
2024
revised:
02
07
2024
accepted:
10
07
2024
medline:
27
7
2024
pubmed:
27
7
2024
entrez:
27
7
2024
Statut:
epublish
Résumé
Integrating large language models (LLMs) such as GPT-4 Turbo into diagnostic imaging faces a significant challenge, with current misdiagnosis rates ranging from 30-50%. This study evaluates how prompt engineering and confidence thresholds can improve diagnostic accuracy in neuroradiology. We analyze 751 neuroradiology cases from the American Journal of Neuroradiology using GPT-4 Turbo with customized prompts to improve diagnostic precision. Initially, GPT-4 Turbo achieved a baseline diagnostic accuracy of 55.1%. By reformatting responses to list five diagnostic candidates and applying a 90% confidence threshold, the highest precision of the diagnosis increased to 72.9%, with the candidate list providing the correct diagnosis at 85.9%, reducing the misdiagnosis rate to 14.1%. However, this threshold reduced the number of cases that responded. Strategic prompt engineering and high confidence thresholds significantly reduce misdiagnoses and improve the precision of the LLM diagnostic in neuroradiology. More research is needed to optimize these approaches for broader clinical implementation, balancing accuracy and utility.
Sections du résumé
BACKGROUND AND OBJECTIVES
OBJECTIVE
Integrating large language models (LLMs) such as GPT-4 Turbo into diagnostic imaging faces a significant challenge, with current misdiagnosis rates ranging from 30-50%. This study evaluates how prompt engineering and confidence thresholds can improve diagnostic accuracy in neuroradiology.
METHODS
METHODS
We analyze 751 neuroradiology cases from the American Journal of Neuroradiology using GPT-4 Turbo with customized prompts to improve diagnostic precision.
RESULTS
RESULTS
Initially, GPT-4 Turbo achieved a baseline diagnostic accuracy of 55.1%. By reformatting responses to list five diagnostic candidates and applying a 90% confidence threshold, the highest precision of the diagnosis increased to 72.9%, with the candidate list providing the correct diagnosis at 85.9%, reducing the misdiagnosis rate to 14.1%. However, this threshold reduced the number of cases that responded.
CONCLUSIONS
CONCLUSIONS
Strategic prompt engineering and high confidence thresholds significantly reduce misdiagnoses and improve the precision of the LLM diagnostic in neuroradiology. More research is needed to optimize these approaches for broader clinical implementation, balancing accuracy and utility.
Identifiants
pubmed: 39061677
pii: diagnostics14141541
doi: 10.3390/diagnostics14141541
pii:
doi:
Types de publication
Journal Article
Langues
eng
Subventions
Organisme : Japan Society for the Promotion of Science
ID : 22K07674