BackgroundWhile large language models show potential as beneficial tools in medicine, their reliability, especially in the realm of obstetrics and gynecology (OB-GYN), is not fully comprehended. This study seeks to measure and contrast the performance of ChatGPT and HuggingChat in addressing OB-GYN-related medical examination questions, offering insights into their effectiveness in this specialized field.
MethodsChatGPT and HuggingChat were subjected to two standardized multiple-choice question banks: Test 1, developed by the National Board of Medical Examiners (NBME), and Test 2, gathered from the Association of Professors of Gynecology & Obstetrics (APGO) Web-Based Interactive Self-Evaluation (uWISE). Responses were analyzed and compared for correctness.
ResultsThe two-proportion z-test revealed no statistically significant difference in performance between ChatGPT and HuggingChat on both medical examinations. For Test 1, ChatGPT scored 90%, while HuggingChat scored 85% (p = 0.6). For Test 2, ChatGPT correctly answered 70% of questions, while HuggingChat correctly answered 62% of questions (p = 0.4).
ConclusionAwareness of the strengths and weaknesses of artificial intelligence allows for the proper and effective use of its knowledge. Our findings indicate that there is no statistically significant difference in performance between ChatGPT and HuggingChat in addressing medical inquiries. Nonetheless, both platforms demonstrate considerable promise for applications within the medical domain.