Performance of trauma-trained large language models on surgical assessment questions: A new approach in resource identification.


Journal

Surgery
ISSN: 1532-7361
Titre abrégé: Surgery
Pays: United States
ID NLM: 0417347

Informations de publication

Date de publication:
23 Sep 2024
Historique:
received: 13 05 2024
revised: 18 07 2024
accepted: 16 08 2024
medline: 25 9 2024
pubmed: 25 9 2024
entrez: 24 9 2024
Statut: aheadofprint

Résumé

Large language models have successfully navigated simulated medical board examination questions. However, whether and how language models can be used in surgical education is less understood. Our study evaluates the efficacy of domain-specific large language models in curating study materials for surgical board style questions. We developed EAST-GPT and ACS-GPT, custom large language models with domain-specific knowledge from published guidelines from the Eastern Association of the Surgery of Trauma and the American College of Surgeons Trauma Quality Programs. EAST-GPT, ACS-GPT, and an untrained GPT-4 performance were assessed trauma-related questions from Surgical Education and Self-Assessment Program (18th edition). Large language models were asked to choose answers and provide answer rationales. Rationales were assessed against an educational framework with 5 domains: accuracy, relevance, comprehensiveness, evidence-base, and clarity. Ninety guidelines trained EAST-GPT and 10 trained ACS-GPT. All large language models were tested on 62 trauma questions. EAST-GPT correctly answered 76%, whereas ACS-GPT answered 68% correctly. Both models outperformed ChatGPT-4 (P < .05), which answered 45% correctly. For reasoning, EAST-GPT achieved the gratest mean scores across all 5 educational framework metrics. ACS-GPT scored lower than ChatGPT-4 in comprehensiveness and evidence-base; however, these differences were not statistically significant. Our study presents a novel methodology in identifying test-preparation resources by training a large language model to answer board-style multiple choice questions. Both trained models outperformed ChatGPT-4, demonstrating its answers were accurate, relevant, and evidence-based. Potential implications of such AI integration into surgical education must be explored.

Sections du résumé

BACKGROUND BACKGROUND
Large language models have successfully navigated simulated medical board examination questions. However, whether and how language models can be used in surgical education is less understood. Our study evaluates the efficacy of domain-specific large language models in curating study materials for surgical board style questions.
METHODS METHODS
We developed EAST-GPT and ACS-GPT, custom large language models with domain-specific knowledge from published guidelines from the Eastern Association of the Surgery of Trauma and the American College of Surgeons Trauma Quality Programs. EAST-GPT, ACS-GPT, and an untrained GPT-4 performance were assessed trauma-related questions from Surgical Education and Self-Assessment Program (18th edition). Large language models were asked to choose answers and provide answer rationales. Rationales were assessed against an educational framework with 5 domains: accuracy, relevance, comprehensiveness, evidence-base, and clarity.
RESULTS RESULTS
Ninety guidelines trained EAST-GPT and 10 trained ACS-GPT. All large language models were tested on 62 trauma questions. EAST-GPT correctly answered 76%, whereas ACS-GPT answered 68% correctly. Both models outperformed ChatGPT-4 (P < .05), which answered 45% correctly. For reasoning, EAST-GPT achieved the gratest mean scores across all 5 educational framework metrics. ACS-GPT scored lower than ChatGPT-4 in comprehensiveness and evidence-base; however, these differences were not statistically significant.
CONCLUSION CONCLUSIONS
Our study presents a novel methodology in identifying test-preparation resources by training a large language model to answer board-style multiple choice questions. Both trained models outperformed ChatGPT-4, demonstrating its answers were accurate, relevant, and evidence-based. Potential implications of such AI integration into surgical education must be explored.

Identifiants

pubmed: 39317517
pii: S0039-6060(24)00640-8
doi: 10.1016/j.surg.2024.08.026
pii:
doi:

Types de publication

Journal Article

Langues

eng

Sous-ensembles de citation

IM

Informations de copyright

Copyright © 2024. Published by Elsevier Inc.

Déclaration de conflit d'intérêts

Conflict of Interest/Disclosure The authors have no related conflicts of interest to declare.

Auteurs

Arnav Mahajan (A)

Department of Surgery, The MetroHealth System, Case Western Reserve University School of Medicine, Cleveland, OH. Electronic address: https://twitter.com/arnavmahajan_.

Andrew Tran (A)

Department of Surgery, The MetroHealth System, Case Western Reserve University School of Medicine, Cleveland, OH.

Esther S Tseng (ES)

Department of Surgery, The MetroHealth System, Case Western Reserve University School of Medicine, Cleveland, OH.

John J Como (JJ)

Department of Surgery, The MetroHealth System, Case Western Reserve University School of Medicine, Cleveland, OH.

Kevin M El-Hayek (KM)

Department of Surgery, The MetroHealth System, Case Western Reserve University School of Medicine, Cleveland, OH.

Prerna Ladha (P)

Department of Surgery, The MetroHealth System, Case Western Reserve University School of Medicine, Cleveland, OH.

Vanessa P Ho (VP)

Department of Surgery, The MetroHealth System, Case Western Reserve University School of Medicine, Cleveland, OH; Department of Population and Quantitative Health Sciences, Case Western Reserve University School of Medicine, Cleveland, OH. Electronic address: vho@metrohealth.org.

Classifications MeSH