2023
DOI: 10.2196/52202
|View full text |Cite
|
Sign up to set email alerts
|

Performance Comparison of ChatGPT-4 and Japanese Medical Residents in the General Medicine In-Training Examination: Comparison Study

Takashi Watari,
Soshi Takagi,
Kota Sakaguchi
et al.

Abstract: Background The reliability of GPT-4, a state-of-the-art expansive language model specializing in clinical reasoning and medical knowledge, remains largely unverified across non-English languages. Objective This study aims to compare fundamental clinical competencies between Japanese residents and GPT-4 by using the General Medicine In-Training Examination (GM-ITE). Methods We used the GPT-4 model provided by… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
5
0

Year Published

2024
2024
2025
2025

Publication Types

Select...
6
3

Relationship

0
9

Authors

Journals

citations
Cited by 23 publications
(5 citation statements)
references
References 25 publications
0
5
0
Order By: Relevance
“…Nevertheless, attributing this result to language limitations alone is challenging, given the superior performance of ChatGPT-4 in Japanese language compared to medical residents in the Japanese General Medicine In-Training Examination, as reported by Watari et al [44]. This study also exposed ChatGPT-4 limitations in test aspects requiring empathy, professionalism, and contextual understanding [44].…”
Section: Discussionmentioning
confidence: 71%
“…Nevertheless, attributing this result to language limitations alone is challenging, given the superior performance of ChatGPT-4 in Japanese language compared to medical residents in the Japanese General Medicine In-Training Examination, as reported by Watari et al [44]. This study also exposed ChatGPT-4 limitations in test aspects requiring empathy, professionalism, and contextual understanding [44].…”
Section: Discussionmentioning
confidence: 71%
“…The results suggest that disparities in language performance, as evident in Arabic dialects, could potentially extend to other languages. This was shown in Japanese, French, and Polish languages among others [21,22,32]. Thus, collaborative efforts should be implemented to create diverse AI training datasets, which would help to ensure the generation of equitable and accurate health information across different linguistic and cultural contexts.…”
Section: Discussionmentioning
confidence: 99%
“…Further, a previous study found that generative AI achieved a 79.9% correct response rate on the Japanese Medical Licensing Exam, notably outperforming the average examinee by 17% on hard questions (Takagi et 4 al., 2023). Furthermore, a prior study found that generative AI outperformed the Japanese medical residents on the General Medicine In-Training Examination, particularly in areas requiring detailed medical knowledge and difficult questions (Watari et al, 2023).…”
Section: Performance Of Aimentioning
confidence: 99%
“…Building on a growing body of research on the performance of AI, our study addresses the first research question: Can generative AI evaluate the acceptance of generative AI as effectively as humans? Previous studies have shown that AI, including generative AI, performs impressively in various academic and professional exams, equalling or exceeding human performance across fields such as medicine, engineering, 16 and finance (Frieder et al, 2024;Gilson et al, 2023;Kung et al, 2023;Takagi et al, 2023;Terwiesch, 2023;Watari et al, 2023;Yang & Stivers, 2024). Moreover, prior studies find that generative AI can effectively perform tasks similar to humans, such as providing writing feedback, evaluating essays, and supporting language learning (Escalante et al, 2023;Guo & Wang, 2023;Mizumoto & Eguchi, 2023).…”
Section: Evaluating Acceptance Of Generative Aimentioning
confidence: 99%