Exploring the proficiency of ChatGPT-4: An evaluation of its performance in the Taiwan advanced medical licensing examination

Lin, Shih-Yi; Chan, Pak Ki; Hsu, Wu-Huei; Kao, Chia-Hung

doi:10.1177/20552076241237678

Cited by 6 publications

(4 citation statements)

References 36 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Researchers have not analyzed or elaborated on the impact of these task understanding prompts on ChatGPT's performance. However, three studies used optimized prompts [19,26,35]. A Korean study used four kinds of optimized prompts including: annotating Chinese terms in TKM, translating the instruction and question into English, providing exam-optimized instructions, and utilizing selfconsistency in the prompt.…”

Section: Figure 5 Performance Of Chatgpt On Passing Medical Licensing...mentioning

confidence: 99%

“…Pretend to be a junior doctor with expertise in clinical practice and exam solving and retry" or "Could you double-check the answer?". ChatGPT could correctly answer up to 88.9% and 84% of these questions, respectively [19,35]. For task understanding prompts, we conducted a subgroup analysis and meta-regression to examine whether they affected ChatGPT's performance.…”

Section: Figure 5 Performance Of Chatgpt On Passing Medical Licensing...mentioning

confidence: 99%

“…Although GPT-4 performed better overall than GPT-3.5 [33,36,41,44,47], it did not pass the Japanese medical licensing exam [49]. Additionally, ChatGPT performance varies significantly across medical specialties within these examinations [23,25,26,27,33,34,32,34,35]. At this stage, there is still a lack of a comprehensive understanding of the variability in ChatGPT's performance on different medical licensing exams.…”

Section: Original Manuscript Introduction Backgroundmentioning

confidence: 99%

See 2 more Smart Citations

Performance of ChatGPT Across Different Versions in Medical Licensing Examinations Worldwide: Systematic Review and Meta-Analysis

Liu,

Okuhara,

Chang

et al. 2024

J Med Internet Res

View full text Add to dashboard Cite

Background Over the past 2 years, researchers have used various medical licensing examinations to test whether ChatGPT (OpenAI) possesses accurate medical knowledge. The performance of each version of ChatGPT on the medical licensing examination in multiple environments showed remarkable differences. At this stage, there is still a lack of a comprehensive understanding of the variability in ChatGPT’s performance on different medical licensing examinations. Objective In this study, we reviewed all studies on ChatGPT performance in medical licensing examinations up to March 2024. This review aims to contribute to the evolving discourse on artificial intelligence (AI) in medical education by providing a comprehensive analysis of the performance of ChatGPT in various environments. The insights gained from this systematic review will guide educators, policymakers, and technical experts to effectively and judiciously use AI in medical education. Methods We searched the literature published between January 1, 2022, and March 29, 2024, by searching query strings in Web of Science, PubMed, and Scopus. Two authors screened the literature according to the inclusion and exclusion criteria, extracted data, and independently assessed the quality of the literature concerning Quality Assessment of Diagnostic Accuracy Studies-2. We conducted both qualitative and quantitative analyses. Results A total of 45 studies on the performance of different versions of ChatGPT in medical licensing examinations were included in this study. GPT-4 achieved an overall accuracy rate of 81% (95% CI 78-84; P<.01), significantly surpassing the 58% (95% CI 53-63; P<.01) accuracy rate of GPT-3.5. GPT-4 passed the medical examinations in 26 of 29 cases, outperforming the average scores of medical students in 13 of 17 cases. Translating the examination questions into English improved GPT-3.5’s performance but did not affect GPT-4. GPT-3.5 showed no difference in performance between examinations from English-speaking and non–English-speaking countries (P=.72), but GPT-4 performed better on examinations from English-speaking countries significantly (P=.02). Any type of prompt could significantly improve GPT-3.5’s (P=.03) and GPT-4’s (P<.01) performance. GPT-3.5 performed better on short-text questions than on long-text questions. The difficulty of the questions affected the performance of GPT-3.5 and GPT-4. In image-based multiple-choice questions (MCQs), ChatGPT’s accuracy rate ranges from 13.1% to 100%. ChatGPT performed significantly worse on open-ended questions than on MCQs. Conclusions GPT-4 demonstrates considerable potential for future use in medical education. However, due to its insufficient accuracy, inconsistent performance, and the challenges posed by differing medical policies and knowledge across countries, GPT-4 is not yet suitable for use in medical education. Trial Registration PROSPERO CRD42024506687; https://www.crd.york.ac.uk/prospero/display_record.php?RecordID=506687

show abstract

Section: Figure 5 Performance Of Chatgpt On Passing Medical Licensing...mentioning

confidence: 99%

Section: Figure 5 Performance Of Chatgpt On Passing Medical Licensing...mentioning

confidence: 99%

Section: Original Manuscript Introduction Backgroundmentioning

confidence: 99%

See 1 more Smart Citation

Performance of ChatGPT Across Different Versions in Medical Licensing Examinations Worldwide: Systematic Review and Meta-Analysis

Liu,

Okuhara,

Chang

et al. 2024

J Med Internet Res

View full text Add to dashboard Cite

show abstract

“…Multiple studies have systematically evaluated ChatGPT's performance on standardized tests across various languages. Notably, it has demonstrated excellent performance on assessments such as the United States Medical Licensing Examination (USMLE) [11][12][13] , the Japanese Medical Licensing Examination (JMLE) 14 , the Saudi Medical Licensing Examination (SMLE) 15 , the Polish medical specialization licensing exam (PES) 16 and Taiwan's medical licensing exams 17 . However, over the past ve years of the Chinese National Medical Licensing Examination (NMLE), ChatGPT scores have consistently fallen below the passing threshold.…”

Section: Introductionmentioning

confidence: 99%

Performance of GPT-4 and mainstream Chinese Large Language Models on the Chinese Postgraduate Examination dataset: Potential for AI-assisted Traditional Chinese Medicine

Peng,

Zhu,

Wang

et al. 2024

Preprint

View full text Add to dashboard Cite

ChatGPT is a well-known example of a Large Language Models(LLMs) that has performed notably well in the medical licensing exams of many countries. Tradition Chinese Medicine(TCM) has gained increasing attention and concern from the international medical community. In China, the medical master’s degree exam carries more weight than the licensing exam due to the multi-level doctor training system. However, the current study lacks an evaluation of the performance of various LLMs in TCM postgraduate examinations. Consequently, we created a test dataset of LLMs using postgraduate examination questions to comprehensively evaluate the performance of GPT-4 and mainstream Chinese LLMs in responding to knowledge and clinical inquiries about TCM. Besides computing the exam scores and presenting LLM's performance on various subjects, we evaluated the output responses based on three qualitative metrics: logical reasoning, and the ability to use internal and external information. The results indicated that Ernie Bot and ChatGLM's expertise in TCM surpassed the passing threshold for the postgraduate selection examination, showcasing their enormous potential as TCM support tools.

show abstract

How well do large language model-based chatbots perform in oral and maxillofacial radiology?

Jeong,

Han,

et al. 2024

Dentomaxillofacial Radiology

View full text Add to dashboard Cite

Objectives This study evaluated the performance of four large language model (LLM)-based chatbots by comparing their test results with those of dental students on an oral and maxillofacial radiology examination. Methods ChatGPT, ChatGPT Plus, Bard, and Bing Chat were tested on 52 questions from regular dental college examinations. These questions were categorized into three educational content areas: basic knowledge, imaging and equipment, and image interpretation. They were also classified as multiple-choice questions (MCQs) and short-answer questions (SAQs). The accuracy rates of the chatbots were compared with the performance of students, and further analysis was conducted based on the educational content and question type. Results The students’ overall accuracy rate was 81.2%, while that of the chatbots varied: 50.0% for ChatGPT, 65.4% for ChatGPT Plus, 50.0% for Bard, and 63.5% for Bing Chat. ChatGPT Plus achieved a higher accuracy rate for basic knowledge than the students (93.8% vs. 78.7%). However, all chatbots performed poorly in image interpretation, with accuracy rates below 35.0%. All chatbots scored less than 60.0% on MCQs, but performed better on SAQs. Conclusions The performance of chatbots in oral and maxillofacial radiology was unsatisfactory. Further training using specific, relevant data derived solely from reliable sources is required. Additionally, the validity of these chatbots’ responses must be meticulously verified. Advances in knowledge This study is the first in the field of oral and maxillofacial radiology to assess the knowledge levels of four chatbots. We recommend further training in this domain for all chatbots, given their unsatisfactory performance.

show abstract

Exploring the proficiency of ChatGPT-4: An evaluation of its performance in the Taiwan advanced medical licensing examination

Cited by 6 publications

References 36 publications

Performance of ChatGPT Across Different Versions in Medical Licensing Examinations Worldwide: Systematic Review and Meta-Analysis

Performance of ChatGPT Across Different Versions in Medical Licensing Examinations Worldwide: Systematic Review and Meta-Analysis

Performance of GPT-4 and mainstream Chinese Large Language Models on the Chinese Postgraduate Examination dataset: Potential for AI-assisted Traditional Chinese Medicine

How well do large language model-based chatbots perform in oral and maxillofacial radiology?

Contact Info

Product

Resources

About