Comparing ChatGPT and GPT-4 performance in USMLE soft skill assessments

Brin, Dana; Sorin, Vera; Vaid, Akhil; Soroush, Ali; Glicksberg, Benjamin S.; Charney, Alexander W.; Nadkarni, Girish; Klang, Eyal

doi:10.1038/s41598-023-43436-9

Cited by 126 publications

(39 citation statements)

References 14 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Previous evaluations exclude questions with images as the single-modality limitation of ChatGPT and GPT-4. 20,[38][39][40] Our findings revealed that while medical students' performance lineally decreased when the difficulty of questions increased, GPT-4V's performance stayed relatively stable. When hints were provided, GPT-4V's performance stayed almost the same among questions in all difficult levels, as shown in Figure 2.…”

Section: Discussionmentioning

confidence: 94%

See 1 more Smart Citation

Performance of Multimodal GPT-4V on USMLE with Image: Potential for Imaging Diagnostic Support with Explanations

Yang,

Yao,

Tasmin

et al. 2023

Preprint

View full text Add to dashboard Cite

Importance Using artificial intelligence (AI) to help clinical diagnoses has been an active research topic for more than six decades. Few research however has the scale and accuracy that can be turned into clinical practice. The tide may be turned today with the power of large language models (LLMs). In this application, we evaluated the accuracy of medical license exam using the newly released Generative Pre-trained Transformer 4 with vision (GPT-4V), a large multimodal model trained to analyze image inputs with the text instructions from the user. This study is the first to evaluate GPTs for interpreting medical images. Objective This study aimed to evaluate the performance of GPT-4V on medical licensing examination questions with images, as well as to analyze interpretability. Design, Setting, and Participants We used 3 sets of multiple-choice questions with images to evaluate GPT-4V performance. The first set was the United States Medical Licensing Examination (USMLE) from the National Board of Medical Examiners (NBME) sample questions in step1, step2CK, and step3. The second set was derived from AMBOSS, a commonly used question bank for medical students, which also provides statistics on question difficulty and the performance on an exam relative to the user base. The third set was the Diagnostic Radiology Qualifying Core Exam (DRQCE) from the American Board of Radiology. The study (including data analysis) was conducted from September to October 2023. Main Outcomes and Measures The choice accuracy of GPT-4V was compared to two other large language models, GPT-4 and ChatGPT. The GPT-4V explanation was evaluated across 4 qualitative metrics: image misunderstanding, text hallucination, reasoning error, and non-medical error. Results Of the 3 exams with images, NBME, AMBOSS, and DRQCE, GPT-4V achieved accuracies of 86.2%, 62.0%, and 73.1%, respectively. GPT-4V outperformed ChatGPT and GPT-4 by 131.8% and 64.5% on average across various data sets. The model demonstrated a decreasing trend in performance as question difficulty increased in the AMBOSS dataset. GPT-4V achieves an accuracy of 90.7% in the full USMLE exam, outperforming the passing threshold of about 60% accuracy. Among the incorrect answers, 75.9% of responses included misinterpretation of the image. However, 39.0% of them could be easily solved with a short hint. Conclusion In this cross-sectional study, GPT-4V achieved a high accuracy of USMLE that was in the 70th - 80th percentile with AMBOSS users preparing for the exam. The results suggest the potential of GPT-4V for clinical decision support. However, GPT-4V generated explanation revealed several issues. It needs to improve explanation quality for potential use in clinical decision support.

show abstract

Section: Discussionmentioning

confidence: 94%

“…Previous evaluations exclude questions with images as the single-modality limitation of ChatGPT and GPT-4. 20,38–40…”

Section: Discussionmentioning

confidence: 99%

Performance of Multimodal GPT-4V on USMLE with Image: Potential for Imaging Diagnostic Support with Explanations

Yang,

Yao,

Tasmin

et al. 2023

Preprint

View full text Add to dashboard Cite

show abstract

“…115 However, even when trained for general purposes, ChatGPT has previously been shown to pass the United States Medical Licensing Examination (USMLE), the German State Examination in Medicine, or even a radiology board-style examination without images. [116][117][118][119] Although outperformed on specific tasks by specialized medical LLMs, such as Google's MedPaLM-2, this suggests that general-purpose LLMs can comprehend complex medical literature and case scenarios to a degree that meets professional standards. 120 Furthermore, given the large amounts of data on which proprietary models such as ChatGPT are trained, it is not unlikely that they have been exposed to more medical data overall than smaller specialized models despite being generalist models.…”

Section: Discussionmentioning

confidence: 99%

Systematic Review of Large Language Models for Patient Care: Current Applications and Challenges

Busch,

Hoffmann,

Rueger

et al. 2024

Preprint

View full text Add to dashboard Cite

The introduction of large language models (LLMs) into clinical practice promises to improve patient education and empowerment, thereby personalizing medical care and broadening access to medical knowledge. Despite the popularity of LLMs, there is a significant gap in systematized information on their use in patient care. Therefore, this systematic review aims to synthesize current applications and limitations of LLMs in patient care using a data-driven convergent synthesis approach. We searched 5 databases for qualitative, quantitative, and mixed methods articles on LLMs in patient care published between 2022 and 2023. From 4,349 initial records, 89 studies across 29 medical specialties were included, primarily examining models based on the GPT-3.5 (53.2%, n=66 of 124 different LLMs examined per study) and GPT-4 (26.6%, n=33/124) architectures in medical question answering, followed by patient information generation, including medical text summarization or translation, and clinical documentation. Our analysis delineates two primary domains of LLM limitations: design and output. Design limitations included 6 second-order and 12 third-order codes, such as lack of medical domain optimization, data transparency, and accessibility issues, while output limitations included 9 second-order and 32 third-order codes, for example, non-reproducibility, non-comprehensiveness, incorrectness, unsafety, and bias. In conclusion, this study is the first review to systematically map LLM applications and limitations in patient care, providing a foundational framework and taxonomy for their implementation and evaluation in healthcare settings.

show abstract

“…We also need to examine our current pedagogy-MCQs alone may no longer be enough to evaluate student or trainee understanding. [25][26][27] As an introduction to AI in Medicine, I also agree that we should include these LLMs during clinicopathological conferences. These may give med students and trainees an opportunity to develop critical thinking by identifying its weaknesses and strengths-and propose solutions with or without the collaboration of Computer Science colleagues.…”

Section: A S a Te Achermentioning

confidence: 99%

“…It would be interesting to explore these interactions in a patient perspective type of study. We also need to examine our current pedagogy—MCQs alone may no longer be enough to evaluate student or trainee understanding 25–27 . As an introduction to AI in Medicine, I also agree that we should include these LLMs during clinicopathological conferences.…”

Section: As a Teachermentioning

confidence: 99%

Class‐Rheum for AI: Reflections from a Filipino Rheumatologist and Health Informatics graduate student

Traboco

2024

Int J of Rheum Dis

View full text Add to dashboard Cite

In 2019, when I returned to the university for my master's studies in Health Informatics, we had no subject particular to artificial intelligence (AI). A pandemic and a few theses proposal delays later, we're learning about AI more voraciously than ever: studying the literature, listening to online podcasts, organizing lectures, and coffee meeting groups with the Department of Computer Science. I am grateful to receive knowledge from my more experienced international colleagues, in the meantime 1,2 but as a graduate student, I am still an explorer. An explorer, since I am only the third Filipino Rheumatologist in Health Informatics, caught amid a country undergoing a socio-technical transition. 3 Despite the challenges of this transition such as partial implementation and selective participation of stakeholders, many of my colleagues in Health Informatics are frontline, leaders, and drivers of this change. But before AI can even take place, the Philippines must digitize its healthcare services as part of the plan to provide universal healthcare. 4 My institution began plans for digitization only

show abstract

Comparing ChatGPT and GPT-4 performance in USMLE soft skill assessments

Cited by 126 publications

References 14 publications

Performance of Multimodal GPT-4V on USMLE with Image: Potential for Imaging Diagnostic Support with Explanations

Performance of Multimodal GPT-4V on USMLE with Image: Potential for Imaging Diagnostic Support with Explanations

Systematic Review of Large Language Models for Patient Care: Current Applications and Challenges

Class‐Rheum for AI: Reflections from a Filipino Rheumatologist and Health Informatics graduate student

Contact Info

Product

Resources

About