¿Es capaz “ChatGPT” de aprobar el examen MIR de 2022? Implicaciones de la inteligencia artificial en la educación médica en España

Carrasco, Juan Pablo; García, Eva Alarcón; Martínez, Domingo Antonio Sánchez; Porter, Pablo Daniel Estrella; Puente, L De La; Navarro, Joaquín; Campo, Álvaro Cerame del

doi:10.6018/edumed.556511

Cited by 21 publications

(14 citation statements)

References 8 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…Another study on the Neurosurgery Oral Board Preparation Question Bank showed that GPT-4 performed with an accuracy of 82.6%, while GPT-3.5 achieved an accuracy of 62.4% [20]. However, in our study, GPT-3.5 performed better on the NLME compared to previous studies where it failed examinations, including the USMLE and Spanish, Japanese, and Chinese NLMEs [2,[21][22][23]. This can be explained by our use of a prompt that resembles the "chain-of-thought prompting approach," in which ChatGPT decomposes multistep problems into smaller and manageable steps to enhance accuracy [24].…”

Section: Principal Findingscontrasting

confidence: 87%

Performance of ChatGPT on the Peruvian National Licensing Medical Examination: Cross-Sectional Study

Flores-Cohaila,

García-Vicente,

Vizcarra-Jiménez

et al. 2023

JMIR Med Educ

View full text Add to dashboard Cite

Background ChatGPT has shown impressive performance in national medical licensing examinations, such as the United States Medical Licensing Examination (USMLE), even passing it with expert-level performance. However, there is a lack of research on its performance in low-income countries’ national licensing medical examinations. In Peru, where almost one out of three examinees fails the national licensing medical examination, ChatGPT has the potential to enhance medical education. Objective We aimed to assess the accuracy of ChatGPT using GPT-3.5 and GPT-4 on the Peruvian National Licensing Medical Examination (Examen Nacional de Medicina [ENAM]). Additionally, we sought to identify factors associated with incorrect answers provided by ChatGPT. Methods We used the ENAM 2022 data set, which consisted of 180 multiple-choice questions, to evaluate the performance of ChatGPT. Various prompts were used, and accuracy was evaluated. The performance of ChatGPT was compared to that of a sample of 1025 examinees. Factors such as question type, Peruvian-specific knowledge, discrimination, difficulty, quality of questions, and subject were analyzed to determine their influence on incorrect answers. Questions that received incorrect answers underwent a three-step process involving different prompts to explore the potential impact of adding roles and context on ChatGPT’s accuracy. Results GPT-4 achieved an accuracy of 86% on the ENAM, followed by GPT-3.5 with 77%. The accuracy obtained by the 1025 examinees was 55%. There was a fair agreement (κ=0.38) between GPT-3.5 and GPT-4. Moderate-to-high-difficulty questions were associated with incorrect answers in the crude and adjusted model for GPT-3.5 (odds ratio [OR] 6.6, 95% CI 2.73-15.95) and GPT-4 (OR 33.23, 95% CI 4.3-257.12). After reinputting questions that received incorrect answers, GPT-3.5 went from 41 (100%) to 12 (29%) incorrect answers, and GPT-4 from 25 (100%) to 4 (16%). Conclusions Our study found that ChatGPT (GPT-3.5 and GPT-4) can achieve expert-level performance on the ENAM, outperforming most of our examinees. We found fair agreement between both GPT-3.5 and GPT-4. Incorrect answers were associated with the difficulty of questions, which may resemble human performance. Furthermore, by reinputting questions that initially received incorrect answers with different prompts containing additional roles and context, ChatGPT achieved improved accuracy.

show abstract

Section: Principal Findingscontrasting

confidence: 87%

Performance of ChatGPT on the Peruvian National Licensing Medical Examination: Cross-Sectional Study

Flores-Cohaila,

García-Vicente,

Vizcarra-Jiménez

et al. 2023

JMIR Med Educ

View full text Add to dashboard Cite

show abstract

“…Previous studies have evaluated the efficacy of AI-based models in addressing MCQs across diverse healthcare disciplines, yielding varying outcomes (Newton and Xiromeriti, 2023). These assessments encompassed the utilization of ChatGPT in scenarios such as United States Medical Licensing Examination (USMLE) exam, as well as the performance within fields such as parasitology and ophthalmology, among others in various languages (e.g., Japanese, Chinese, German, Spanish) (Antaki et al, 2023;Carrasco et al, 2023;Friederichs et al, 2023;Huh, 2023;Kung et al, 2023;Takagi et al, 2023;Xiao et al, 2023). A recent scoping review examined the influence of the ChatGPT model on examination outcomes, including instances where ChatGPT demonstrated superior performance compared to students, albeit in a minority of cases (Newton and Xiromeriti, 2023).…”

Section: Introductionmentioning

confidence: 99%

Below average ChatGPT performance in medical microbiology exam compared to university students

Sallam,

Al-Salahat

2023

Front. Educ.

View full text Add to dashboard Cite

BackgroundThe transformative potential of artificial intelligence (AI) in higher education is evident, with conversational models like ChatGPT poised to reshape teaching and assessment methods. The rapid evolution of AI models requires a continuous evaluation. AI-based models can offer personalized learning experiences but raises accuracy concerns. MCQs are widely used for competency assessment. The aim of this study was to evaluate ChatGPT performance in medical microbiology MCQs compared to the students’ performance.MethodsThe study employed an 80-MCQ dataset from a 2021 medical microbiology exam at the University of Jordan Doctor of Dental Surgery (DDS) Medical Microbiology 2 course. The exam contained 40 midterm and 40 final MCQs, authored by a single instructor without copyright issues. The MCQs were categorized based on the revised Bloom’s Taxonomy into four categories: Remember, Understand, Analyze, or Evaluate. Metrics, including facility index and discriminative efficiency, were derived from 153 midterm and 154 final exam DDS student performances. ChatGPT 3.5 was used to answer questions, and responses were assessed for correctness and clarity by two independent raters.ResultsChatGPT 3.5 correctly answered 64 out of 80 medical microbiology MCQs (80%) but scored below the student average (80.5/100 vs. 86.21/100). Incorrect ChatGPT responses were more common in MCQs with longer choices (p = 0.025). ChatGPT 3.5 performance varied across cognitive domains: Remember (88.5% correct), Understand (82.4% correct), Analyze (75% correct), Evaluate (72% correct), with no statistically significant differences (p = 0.492). Correct ChatGPT responses received statistically significant higher average clarity and correctness scores compared to incorrect responses.ConclusionThe study findings emphasized the need for ongoing refinement and evaluation of ChatGPT performance. ChatGPT 3.5 showed the potential to correctly and clearly answer medical microbiology MCQs; nevertheless, its performance was below-bar compared to the students. Variability in ChatGPT performance in different cognitive domains should be considered in future studies. The study insights could contribute to the ongoing evaluation of the AI-based models’ role in educational assessment and to augment the traditional methods in higher education.

show abstract

“…A previous Spanish study submitted the same MIR examination to GPT3.5 and obtained a lower score (54.8%) than ours [26]. One explanation of the difference could be the prompt used.…”

Section: Discussionmentioning

confidence: 46%

Evaluating ChatGPT Efficacy in Navigating the Spanish Medical Residency Entrance Examination (MIR): A New Horizon for AI in Clinical Medicine

Guillen-Grima,

Guillen-Aguinaga,

Guillen-Aguinaga

et al. 2023

Preprint

View full text Add to dashboard Cite

The rapid progress in artificial intelligence, machine learning, and natural language processing has led to the emergence of increasingly sophisticated large language models (LLMs) enabling their use in healthcare. The study assesses the performance of two LLMs: the GPT-3.5 and GPT-4 models in passing the medical examination for access to medical specialist training in Spain MIR. Our objectives included gauging the model's overall performance, analyzing discrepancies across different medical specialties, discerning between theoretical and practical questions, estimating error proportions, and assessing the hypothetical severity of errors committed by a physician. We studied the 2022 Spanish MIR examination after excluding those questions requiring image evaluations or having acknowledged errors. The remaining 182 questions were presented to the LLM ChatGPT4 and GPT-3.5 in Spanish and English. Logistic regression models analyzed the relationships between question length and question sequence d performance. GPT-4 outperformed GPT -3.5, scoring 86.81% in Spanish (p<0.001). English translations had a slightly enhanced performance. Among medical specialties, GPT-4 achieved a 100% correct response rate in several areas, with specialties like Pharmacology, ICU, and Infectious Diseases showing lower performance. The error analysis revealed that while a 13.2% error rate existed, gravest categories like "error requiring intervention to sustain life" and "error resulting in death" had a 0% rate. Conclusions: GPT-4 performs robustly on the Spanish MIR examination, varying its capability to discriminate knoweldge across specialties. While the model's high success rate is commendable, understanding the error severity is critical, especially when considering AI's potential role in real-world medical practice and its implication on patient safety.

show abstract

¿Es capaz “ChatGPT” de aprobar el examen MIR de 2022? Implicaciones de la inteligencia artificial en la educación médica en España

Cited by 21 publications

References 8 publications

Performance of ChatGPT on the Peruvian National Licensing Medical Examination: Cross-Sectional Study

Performance of ChatGPT on the Peruvian National Licensing Medical Examination: Cross-Sectional Study

Below average ChatGPT performance in medical microbiology exam compared to university students

Evaluating ChatGPT Efficacy in Navigating the Spanish Medical Residency Entrance Examination (MIR): A New Horizon for AI in Clinical Medicine

Contact Info

Product

Resources

About