Introduction
With the potential for artificial intelligence (AI) chatbots to serve as the primary source of glaucoma information to patients, it is essential to characterize the information that chatbots provide such that providers can tailor discussions, anticipate patient concerns, and identify misleading information. Therefore, the purpose of this study was to evaluate glaucoma information from AI chatbots, including ChatGPT-4, Bard, and Bing, by analyzing response accuracy, comprehensiveness, readability, word count, and character count in comparison to each other and glaucoma-related American Academy of Ophthalmology (AAO) patient materials.
Methods
Section headers from AAO glaucoma-related patient education brochures were adapted into question form and asked five times to each AI chatbot (ChatGPT-4, Bard, and Bing). Two sets of responses from each chatbot were used to evaluate the accuracy of AI chatbot responses and AAO brochure information, and the comprehensiveness of AI chatbot responses compared to the AAO brochure information, scored 1-5 by three independent glaucoma-trained ophthalmologists. Readability (assessed with Flesch-Kincaid Grade Level (FKGL), corresponding to the United States school grade levels), word count, and character count were determined for all chatbot responses and AAO brochure sections.
Results
Accuracy scores for AAO, ChatGPT, Bing, and Bard were 4.84, 4.26, 4.53, and 3.53, respectively. On direct comparison, AAO was more accurate than ChatGPT (p=0.002), and Bard was the least accurate (Bard versus AAO, p<0.001; Bard versus ChatGPT, p<0.002; Bard versus Bing, p=0.001). ChatGPT had the most comprehensive responses (ChatGPT versus Bing, p<0.001; ChatGPT versus Bard p=0.008), with comprehensiveness scores for ChatGPT, Bing, and Bard at 3.32, 2.16, and 2.79, respectively. AAO information and Bard responses were at the most accessible readability levels (AAO versus ChatGPT, AAO versus Bing, Bard versus ChatGPT, Bard versus Bing, all p<0.0001), with readability levels for AAO, ChatGPT, Bing, and Bard at 8.11, 13.01, 11.73, and 7.90, respectively. Bing responses had the lowest word and character count.
Conclusion
AI chatbot responses varied in accuracy, comprehensiveness, and readability. With accuracy scores and comprehensiveness below that of AAO brochures and elevated readability levels, AI chatbots require improvements to be a more useful supplementary source of glaucoma information for patients. Physicians must be aware of these limitations such that patients are asked about existing knowledge and questions and are then provided with clarifying and comprehensive information.