Background
Large language models (LLMs), including ChatGPT (Chat Generative Pretrained Transformer), a popular, publicly available LLM, represent an important innovation in the application of artificial intelligence. These systems generate relevant content by identifying patterns in large text datasets based on user input across various topics. We sought to evaluate the performance of ChatGPT in practice test questions designed to assess knowledge competency for pediatric emergency medicine (PEM).
Methods
We evaluated the performance of ChatGPT for PEM board certification using a popular question bank used for board certification in PEM published between 2022 and 2024. Clinicians assessed performance of ChatGPT by inputting prompts and recording the software's responses, asking each question over 3 separate iterations. We calculated correct answer percentages (defined as correct in at least 2/3 iterations) and assessed for agreement between the iterations using Fleiss' κ.
Results
We included 215 questions over the 3 study years. ChatGPT responded correctly to 161 of PREP EM questions over 3 years (74.5%; 95% confidence interval, 68.5%–80.5%), which was similar within each study year (75.0%, 71.8%, and 77.8% for study years 2022, 2023, and 2024, respectively). Among correct responses, most were answered correctly on all 3 iterations (137/161, 85.1%). Performance varied by topic, with the highest scores in research and medical specialties and lower in procedures and toxicology. Fleiss' κ across the 3 iterations was 0.71, indicating substantial agreement.
Conclusion
ChatGPT provided correct answers to PEM responses in three-quarters of cases, over the recommended minimum of 65% provided by the question publisher for passing. Responses by ChatGPT included detailed explanations, suggesting potential use for medical education. We identified limitations in specific topics and image interpretation. These results demonstrate opportunities for LLMs to enhance both the education and clinical practice of PEM.