Background: Timely recognition and initiation of basic life support (BLS) before emergency medical services (EMS) arrives significantly improves survival rates and neurological outcomes. In an era where health information-seeking behaviors have shifted toward online sources, chatbots powered by generative artificial intelligence (AI) are emerging as potential tools for providing immediate health-related guidance. This study investigates the reliability of AI chatbots, specifically GPT-3.5, GPT-4, Bard, and Bing, in responding to BLS scenarios.
Methods: A cross-sectional study was conducted using six scenarios adapted from the BLS Objective Structured Clinical Examination (OSCE) by United Medical Education. These scenarios encompassed adult, pediatric, and infant emergencies and were presented to each chatbot on two occasions, one week apart. Responses were evaluated by a board-certified emergency medicine professor from Tehran University of Medical Sciences, using a checklist based on BLS-OSCE standards. Correctness was assessed, and reliability was measured using Cohen's kappa coefficient.
Results: GPT4 demonstrated the highest correctness in adult scenarios (85% correct responses), while Bard showed 60% correctness. GPT3.5 and Bing performed poorly across all scenarios. Bard had a correctness rate of 52.17% in pediatric scenarios, but all chatbots scored below 44% in infant scenarios. Cohen's kappa indicated substantial reliability for GPT-4 (k=0.649) and GPT3.5 (k=0.645), moderate reliability for Bing (k=0.503), and fair reliability for Bard (k=0.357).
Conclusion: GPT4 showed acceptable performance and substantial reliability in adult BLS scenarios. However, the overall limited correctness and reliability of all chatbots across different scenarios indicate that current AI chatbots are unsuitable for providing life-saving instructions in critical medical emergencies.