ChatGPT goes to the operating room: evaluating GPT-4 performance and its potential in surgical education and training in the era of large language models

Oh, Namkee; Choi, Gyu‐Seong; Lee, Woo Yong

doi:10.4174/astr.2023.104.5.269

Cited by 112 publications

(41 citation statements)

References 10 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Recent studies show GPT-4 outperformed GPT-3.5 by 24%-30% in various medical examinations. 13,14,21,23 These findings indicate a significant enhancement in the model's capabilities. However, a study using the American College of Gastroenterology Test found GPT-3.5 and GPT-4 had scores of 65% and 62%, respectively.…”

Section: Discussionmentioning

confidence: 73%

See 1 more Smart Citation

Performance of ChatGPT on Nephrology Test Questions

Miao,

Thongprayoon,

Garcia Valencia

et al. 2023

CJASN

View full text Add to dashboard Cite

Background: ChatGPT is a novel tool that allows people to engage in conversations with an advanced machine learning model. ChatGPT's performance in the United States Medical Licensing Examination is comparable to a successful candidate’s performance. However, its performance in nephrology field remains undetermined. This study assessed ChatGPT's capabilities in answering nephrology test questions. Methods: Questions sourced from Nephrology Self-Assessment Program and Kidney Self-Assessment Program were used, each with multiple choice single answer questions. Questions containing visual elements were excluded. Each question bank was run twice using GPT-3.5 and GPT-4. Total accuracy rate, defined as the percentage of correct answers obtained by ChatGPT in either the first or second run, and the total concordance, defined as the percentage of identical answers provided by ChatGPT during both runs, regardless of their correctness, were used to assess its performance. Results: A comprehensive assessment was conducted on a set of 975 questions, comprising 508 questions from Nephrology Self-Assessment Program and 467 from Kidney Self-Assessment Program. GPT-3.5 resulted in a total accuracy rate of 51%. Notably, the employment of Nephrology Self-Assessment Program yielded a higher accuracy rate compared to Kidney Self-Assessment Program (58% vs. 44%; p<0.001). The total concordance rate across all questions was 78%, with correct answers exhibiting a higher concordance rate (84%) compared to incorrect answers (73%) (p<0.001). When examining various nephrology subfields, the total accuracy rates were relatively lower in electrolyte and acid-base disorder, glomerular disease, and kidney-related bone and stone disorders. The total accuracy rate of GPT-4’s response was 74%, higher than GPT-3.5 (p<0.001) but remained below the passing threshold and average scores of Nephrology examinees (77%). Conclusions: ChatGPT exhibited limitations regarding accuracy and repeatability when addressing nephrology-related questions. Variations in performance were evident across various subfields.

show abstract

Section: Discussionmentioning

confidence: 73%

“…As a result, questions that involved visual elements, such as clinical images, medical photographs, and graphs, were excluded from our assessment, following the approach taken by previous studies. 7,8,11,13,14…”

Section: Methodsmentioning

confidence: 99%

Performance of ChatGPT on Nephrology Test Questions

Miao,

Thongprayoon,

Garcia Valencia

et al. 2023

CJASN

View full text Add to dashboard Cite

show abstract

“…These comparisons highlighted the potential of ChatGPT in higher educational assessments; nevertheless, it showed the importance of ongoing refinements of these models and the dangers of inaccuracies it poses (Lo, 2023;Sallam, 2023;Sallam et al, 2023d;Gill et al, 2024). However, making direct comparisons across variable studies can be challenging due to differences in models implemented, subject fields of the exams, test dates, and the exact approaches of prompt construction (Holmes et al, 2023;Huynh Linda et al, 2023;Meskó, 2023;Oh et al, 2023;Skalidis et al, 2023;Yaa et al, 2023).…”

Section: Discussionmentioning

confidence: 99%

Below average ChatGPT performance in medical microbiology exam compared to university students

Sallam,

Al-Salahat

2023

Front. Educ.

View full text Add to dashboard Cite

BackgroundThe transformative potential of artificial intelligence (AI) in higher education is evident, with conversational models like ChatGPT poised to reshape teaching and assessment methods. The rapid evolution of AI models requires a continuous evaluation. AI-based models can offer personalized learning experiences but raises accuracy concerns. MCQs are widely used for competency assessment. The aim of this study was to evaluate ChatGPT performance in medical microbiology MCQs compared to the students’ performance.MethodsThe study employed an 80-MCQ dataset from a 2021 medical microbiology exam at the University of Jordan Doctor of Dental Surgery (DDS) Medical Microbiology 2 course. The exam contained 40 midterm and 40 final MCQs, authored by a single instructor without copyright issues. The MCQs were categorized based on the revised Bloom’s Taxonomy into four categories: Remember, Understand, Analyze, or Evaluate. Metrics, including facility index and discriminative efficiency, were derived from 153 midterm and 154 final exam DDS student performances. ChatGPT 3.5 was used to answer questions, and responses were assessed for correctness and clarity by two independent raters.ResultsChatGPT 3.5 correctly answered 64 out of 80 medical microbiology MCQs (80%) but scored below the student average (80.5/100 vs. 86.21/100). Incorrect ChatGPT responses were more common in MCQs with longer choices (p = 0.025). ChatGPT 3.5 performance varied across cognitive domains: Remember (88.5% correct), Understand (82.4% correct), Analyze (75% correct), Evaluate (72% correct), with no statistically significant differences (p = 0.492). Correct ChatGPT responses received statistically significant higher average clarity and correctness scores compared to incorrect responses.ConclusionThe study findings emphasized the need for ongoing refinement and evaluation of ChatGPT performance. ChatGPT 3.5 showed the potential to correctly and clearly answer medical microbiology MCQs; nevertheless, its performance was below-bar compared to the students. Variability in ChatGPT performance in different cognitive domains should be considered in future studies. The study insights could contribute to the ongoing evaluation of the AI-based models’ role in educational assessment and to augment the traditional methods in higher education.

show abstract

“…Questions, along with their multiple-choice answers, were presented to the model followed by the instruction, 'Give the number of the best answer. Start your response with "The answer is:"' The goal of this approach was to have the LLM respond with just the multiple-choice answer (1)(2)(3)(4)(5) and not provide a lengthy (costly) explanation.…”

Section: Ai Prompting Methodologymentioning

confidence: 99%

“…GPT 4.0 has proven remarkable ability in assessing knowledge in specialised domains such as medicine, law, and business [2][3][4]-areas that have historically been the exclusive purview of professionals. Particularly noteworthy is its exceptional performance on assessments like the Korean general surgery board exam, the United States Medical Licensing Exam, and the Wharton MBA final exam, each achieved without the finetuning of the pretrained model [5][6][7].…”

Section: Introductionmentioning

confidence: 99%

Artificial intelligence model GPT4 narrowly fails simulated radiological protection exam

Roemer,

Li,

Mahmood

et al. 2024

J. Radiol. Prot.

View full text Add to dashboard Cite

This study assesses the efficacy of Generative Pre-Trained Transformers (GPT) published by OpenAI in the specialized domains of radiological protection and health physics. Utilizing a set of 1064 surrogate questions designed to mimic a health physics certification exam, we evaluated the models' ability to accurately respond to questions across five knowledge domains. Our results indicated that neither model met the 67% passing threshold, with GPT-3.5 achieving a 45.3% weighted average and GPT-4 attaining 61.7%. Despite GPT-4's significant parameter increase and multimodal capabilities, it demonstrated superior performance in all categories yet still fell short of a passing score. The study's methodology involved a simple, standardized prompting strategy without employing prompt engineering or in-context learning, which are known to potentially enhance performance. The analysis revealed that GPT-3.5 formatted answers more correctly, despite GPT-4's higher overall accuracy. The findings suggest that while GPT-3.5 and GPT-4 show promise in handling domain-specific content, their application in the field of radiological protection should be approached with caution, emphasizing the need for human oversight and verification.

show abstract

ChatGPT goes to the operating room: evaluating GPT-4 performance and its potential in surgical education and training in the era of large language models

Cited by 112 publications

References 10 publications

Performance of ChatGPT on Nephrology Test Questions

Performance of ChatGPT on Nephrology Test Questions

Below average ChatGPT performance in medical microbiology exam compared to university students

Artificial intelligence model GPT4 narrowly fails simulated radiological protection exam

Contact Info

Product

Resources

About