Background Chat Generative Pre-trained Transformer (ChatGPT) is a 175-billion-parameter natural language processing model that can generate conversation-style responses to user input. Objective This study aimed to evaluate the performance of ChatGPT on questions within the scope of the United States Medical Licensing Examination Step 1 and Step 2 exams, as well as to analyze responses for user interpretability. Methods We used 2 sets of multiple-choice questions to evaluate ChatGPT’s performance, each with questions pertaining to Step 1 and Step 2. The first set was derived from AMBOSS, a commonly used question bank for medical students, which also provides statistics on question difficulty and the performance on an exam relative to the user base. The second set was the National Board of Medical Examiners (NBME) free 120 questions. ChatGPT’s performance was compared to 2 other large language models, GPT-3 and InstructGPT. The text output of each ChatGPT response was evaluated across 3 qualitative metrics: logical justification of the answer selected, presence of information internal to the question, and presence of information external to the question. Results Of the 4 data sets, AMBOSS-Step1, AMBOSS-Step2, NBME-Free-Step1, and NBME-Free-Step2, ChatGPT achieved accuracies of 44% (44/100), 42% (42/100), 64.4% (56/87), and 57.8% (59/102), respectively. ChatGPT outperformed InstructGPT by 8.15% on average across all data sets, and GPT-3 performed similarly to random chance. The model demonstrated a significant decrease in performance as question difficulty increased (P=.01) within the AMBOSS-Step1 data set. We found that logical justification for ChatGPT’s answer selection was present in 100% of outputs of the NBME data sets. Internal information to the question was present in 96.8% (183/189) of all questions. The presence of information external to the question was 44.5% and 27% lower for incorrect answers relative to correct answers on the NBME-Free-Step1 (P<.001) and NBME-Free-Step2 (P=.001) data sets, respectively. Conclusions ChatGPT marks a significant improvement in natural language processing models on the tasks of medical question answering. By performing at a greater than 60% threshold on the NBME-Free-Step-1 data set, we show that the model achieves the equivalent of a passing score for a third-year medical student. Additionally, we highlight ChatGPT’s capacity to provide logic and informational context across the majority of answers. These facts taken together make a compelling case for the potential applications of ChatGPT as an interactive medical education tool to support learning.
Background: ChatGPT is a 175 billion parameter natural language processing model which can generate conversation style responses to user input. Objective: To evaluate the performance of ChatGPT on questions within the scope of United States Medical Licensing Examination (USMLE) Step 1 and Step 2 exams, as well as analyze responses for user interpretability. Methods: We used two novel sets of multiple choice questions to evaluate ChatGPT's performance, each with questions pertaining to Step 1 and Step 2. The first was derived from AMBOSS, a commonly used question bank for medical students, which also provides statistics on question difficulty and the performance on an exam relative to the userbase. The second, was the National Board of Medical Examiners (NBME) Free 120-question exams. After prompting ChatGPT with each question, ChatGPT's selected answer was recorded, and the text output evaluated across three qualitative metrics: logical justification of the answer selected, presence of information internal to the question, and presence of information external to the question. Results: On the four datasets, AMBOSS-Step1, AMBOSS-Step2, NBME-Free-Step1, and NBME-Free- Step2, ChatGPT achieved accuracies of 44%, 42%, 64.4%, and 57.8%. The model demonstrated a significant decrease in performance as question difficulty increased (P=.012) within the AMBOSS- Step1 dataset. We found logical justification for ChatGPT's answer selection was present in 100% of outputs. Internal information to the question was present in >90% of all questions. The presence of information external to the question was respectively 54.5% and 27% lower for incorrect relative to correct answers on the NBME-Free-Step1 and NBME-Free-Step2 datasets (P<=.001). Conclusion: ChatGPT marks a significant improvement in natural language processing models on the tasks of medical question answering. By performing at greater than 60% threshold on the NBME-Free- Step-1 dataset we show that the model is comparable to a third year medical student. Additionally, due to the dialogic nature of the response to questions, we demonstrate ChatGPT's ability to provide reasoning and informational context across the majority of answers. These facts taken together make a compelling case for the potential applications of ChatGPT as a medical education tool.
Objective To derive 7 proposed core electronic health record (EHR) use metrics across 2 healthcare systems with different EHR vendor product installations and examine factors associated with EHR time. Materials and Methods A cross-sectional analysis of ambulatory physicians EHR use across the Yale-New Haven and MedStar Health systems was performed for August 2019 using 7 proposed core EHR use metrics normalized to 8 hours of patient scheduled time. Results Five out of 7 proposed metrics could be measured in a population of nonteaching, exclusively ambulatory physicians. Among 573 physicians (Yale-New Haven N = 290, MedStar N = 283) in the analysis, median EHR-Time8 was 5.23 hours. Gender, additional clinical hours scheduled, and certain medical specialties were associated with EHR-Time8 after adjusting for age and health system on multivariable analysis. For every 8 hours of scheduled patient time, the model predicted these differences in EHR time (P < .001, unless otherwise indicated): female physicians +0.58 hours; each additional clinical hour scheduled per month −0.01 hours; practicing cardiology −1.30 hours; medical subspecialties −0.89 hours (except gastroenterology, P = .002); neurology/psychiatry −2.60 hours; obstetrics/gynecology −1.88 hours; pediatrics −1.05 hours (P = .001); sports/physical medicine and rehabilitation −3.25 hours; and surgical specialties −3.65 hours. Conclusions For every 8 hours of scheduled patient time, ambulatory physicians spend more than 5 hours on the EHR. Physician gender, specialty, and number of clinical hours practicing are associated with differences in EHR time. While audit logs remain a powerful tool for understanding physician EHR use, additional transparency, granularity, and standardization of vendor-derived EHR use data definitions are still necessary to standardize EHR use measurement.
We have developed Halyos ( http://halyos.gehlenborglab.org ), a visual EHR web application that complements the functionality of existing patient portals. Halyos is designed to integrate with existing EHR systems to help patients interpret their health data. The Halyos application utilizes the SMART on FHIR (Substitutable Medical Applications and Reusable Technologies on Fast Healthcare Interoperability Resources) platform to create an interoperable interface that provides interactive visualizations of clinically validated risk scores and longitudinal data derived from a patient's clinical measurements. These visualizations allow patients to investigate the relationships between clinical measurements and risk over time. By enabling patients to set hypothetical future values for these clinical measurements, patients can see how changes in their health will impact their risks. Using Halyos, patients are provided with the opportunity to actively improve their health based on increased understanding of longitudinal information available in EHRs and to begin a dialogue with their providers.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.