How Does ChatGPT Perform on the United States Medical Licensing Examination? The Implications of Large Language Models for Medical Education and Knowledge Assessment.

ChatGPT GPT MedQA NLP artificial intelligence chatbot conversational agent education technology generative pre-trained transformer machine learning medical education natural language processing

Journal

JMIR medical education
ISSN: 2369-3762
Titre abrégé: JMIR Med Educ
Pays: Canada
ID NLM: 101684518

Informations de publication

Date de publication:
08 Feb 2023
Historique:
received: 23 12 2022
accepted: 29 01 2023
revised: 27 01 2023
entrez: 8 2 2023
pubmed: 9 2 2023
medline: 9 2 2023
Statut: epublish

Résumé

Chat Generative Pre-trained Transformer (ChatGPT) is a 175-billion-parameter natural language processing model that can generate conversation-style responses to user input. This study aimed to evaluate the performance of ChatGPT on questions within the scope of the United States Medical Licensing Examination Step 1 and Step 2 exams, as well as to analyze responses for user interpretability. We used 2 sets of multiple-choice questions to evaluate ChatGPT's performance, each with questions pertaining to Step 1 and Step 2. The first set was derived from AMBOSS, a commonly used question bank for medical students, which also provides statistics on question difficulty and the performance on an exam relative to the user base. The second set was the National Board of Medical Examiners (NBME) free 120 questions. ChatGPT's performance was compared to 2 other large language models, GPT-3 and InstructGPT. The text output of each ChatGPT response was evaluated across 3 qualitative metrics: logical justification of the answer selected, presence of information internal to the question, and presence of information external to the question. Of the 4 data sets, AMBOSS-Step1, AMBOSS-Step2, NBME-Free-Step1, and NBME-Free-Step2, ChatGPT achieved accuracies of 44% (44/100), 42% (42/100), 64.4% (56/87), and 57.8% (59/102), respectively. ChatGPT outperformed InstructGPT by 8.15% on average across all data sets, and GPT-3 performed similarly to random chance. The model demonstrated a significant decrease in performance as question difficulty increased (P=.01) within the AMBOSS-Step1 data set. We found that logical justification for ChatGPT's answer selection was present in 100% of outputs of the NBME data sets. Internal information to the question was present in 96.8% (183/189) of all questions. The presence of information external to the question was 44.5% and 27% lower for incorrect answers relative to correct answers on the NBME-Free-Step1 (P<.001) and NBME-Free-Step2 (P=.001) data sets, respectively. ChatGPT marks a significant improvement in natural language processing models on the tasks of medical question answering. By performing at a greater than 60% threshold on the NBME-Free-Step-1 data set, we show that the model achieves the equivalent of a passing score for a third-year medical student. Additionally, we highlight ChatGPT's capacity to provide logic and informational context across the majority of answers. These facts taken together make a compelling case for the potential applications of ChatGPT as an interactive medical education tool to support learning.

Sections du résumé

BACKGROUND BACKGROUND
Chat Generative Pre-trained Transformer (ChatGPT) is a 175-billion-parameter natural language processing model that can generate conversation-style responses to user input.
OBJECTIVE OBJECTIVE
This study aimed to evaluate the performance of ChatGPT on questions within the scope of the United States Medical Licensing Examination Step 1 and Step 2 exams, as well as to analyze responses for user interpretability.
METHODS METHODS
We used 2 sets of multiple-choice questions to evaluate ChatGPT's performance, each with questions pertaining to Step 1 and Step 2. The first set was derived from AMBOSS, a commonly used question bank for medical students, which also provides statistics on question difficulty and the performance on an exam relative to the user base. The second set was the National Board of Medical Examiners (NBME) free 120 questions. ChatGPT's performance was compared to 2 other large language models, GPT-3 and InstructGPT. The text output of each ChatGPT response was evaluated across 3 qualitative metrics: logical justification of the answer selected, presence of information internal to the question, and presence of information external to the question.
RESULTS RESULTS
Of the 4 data sets, AMBOSS-Step1, AMBOSS-Step2, NBME-Free-Step1, and NBME-Free-Step2, ChatGPT achieved accuracies of 44% (44/100), 42% (42/100), 64.4% (56/87), and 57.8% (59/102), respectively. ChatGPT outperformed InstructGPT by 8.15% on average across all data sets, and GPT-3 performed similarly to random chance. The model demonstrated a significant decrease in performance as question difficulty increased (P=.01) within the AMBOSS-Step1 data set. We found that logical justification for ChatGPT's answer selection was present in 100% of outputs of the NBME data sets. Internal information to the question was present in 96.8% (183/189) of all questions. The presence of information external to the question was 44.5% and 27% lower for incorrect answers relative to correct answers on the NBME-Free-Step1 (P<.001) and NBME-Free-Step2 (P=.001) data sets, respectively.
CONCLUSIONS CONCLUSIONS
ChatGPT marks a significant improvement in natural language processing models on the tasks of medical question answering. By performing at a greater than 60% threshold on the NBME-Free-Step-1 data set, we show that the model achieves the equivalent of a passing score for a third-year medical student. Additionally, we highlight ChatGPT's capacity to provide logic and informational context across the majority of answers. These facts taken together make a compelling case for the potential applications of ChatGPT as an interactive medical education tool to support learning.

Identifiants

pubmed: 36753318
pii: v9i1e45312
doi: 10.2196/45312
pmc: PMC9947764
doi:

Types de publication

Journal Article

Langues

eng

Pagination

e45312

Subventions

Organisme : NIDDK NIH HHS
ID : T35 DK104689
Pays : United States
Organisme : NCATS NIH HHS
ID : UL1 TR001863
Pays : United States

Commentaires et corrections

Type : CommentIn

Informations de copyright

©Aidan Gilson, Conrad W Safranek, Thomas Huang, Vimig Socrates, Ling Chi, Richard Andrew Taylor, David Chartash. Originally published in JMIR Medical Education (https://mededu.jmir.org), 08.02.2023.

Références

Acad Med. 2007 Apr;82(4):370-4
pubmed: 17414193
Med Teach. 2016 Aug;38(8):829-37
pubmed: 26613398
Yale J Biol Med. 2020 Aug 31;93(3):441-451
pubmed: 32874151
Sci Data. 2020 Oct 2;7(1):322
pubmed: 33009402
R Soc Open Sci. 2017 Dec 6;4(12):171085
pubmed: 29308247
J Health Soc Behav. 1988 Dec;29(4):357-75
pubmed: 3253326
Med Teach. 2009 Aug;31(8):685-95
pubmed: 19811204

Auteurs

Aidan Gilson (A)

Section for Biomedical Informatics and Data Science, Yale University School of Medicine, New Haven, CT, United States.
Department of Emergency Medicine, Yale University School of Medicine, New Haven, CT, United States.

Conrad W Safranek (CW)

Section for Biomedical Informatics and Data Science, Yale University School of Medicine, New Haven, CT, United States.

Thomas Huang (T)

Department of Emergency Medicine, Yale University School of Medicine, New Haven, CT, United States.

Vimig Socrates (V)

Section for Biomedical Informatics and Data Science, Yale University School of Medicine, New Haven, CT, United States.
Program of Computational Biology and Bioinformatics, Yale University, New Haven, CT, United States.

Ling Chi (L)

Section for Biomedical Informatics and Data Science, Yale University School of Medicine, New Haven, CT, United States.

Richard Andrew Taylor (RA)

Section for Biomedical Informatics and Data Science, Yale University School of Medicine, New Haven, CT, United States.
Department of Emergency Medicine, Yale University School of Medicine, New Haven, CT, United States.

David Chartash (D)

Section for Biomedical Informatics and Data Science, Yale University School of Medicine, New Haven, CT, United States.
School of Medicine, University College Dublin, National University of Ireland, Dublin, Dublin, Ireland.

Classifications MeSH