BACKGROUND
AI-powered chatbots, using Large Language Models, may effectively answer questions from patients with hypertension, providing responses that are accurate, empathetic, and easy to read.
OBJECTIVE
This study evaluates the performance of three such chatbots in delivering quality responses.
METHODS
One hundred questions were randomly selected from the Reddit forum r/hypertension and submitted to three publicly available chatbots (ChatGPT-3.5, Microsoft Copilot, Gemini), anonymized as A, B, and C. Two independent medical professionals assessed the accuracy and empathy of their responses using Likert scales. Additionally, 300 responses were analyzed with the WebFX readability tool to measure various readability indices.
RESULTS
In total, 300 responses were evaluated. Chatbot A generated the most extensive responses, with an average of 13 sentences per reply, while Chatbot B had the shortest replies. Chatbot C achieved the highest score on the Flesch Reading Ease Scale, indicating better readability, while Chatbot A scored the lowest. Other readability metrics, including the Flesch-Kincaid Grade Level, Gunning Fog Score, and others, also showed significant differences among the chatbots, reflecting variability in readability.
CONCLUSIONS
The study indicates that while all chatbots can produce professional responses, their readability varies significantly. These findings underscore the potential of AI chatbots in patient education. However, they also highlight the urgent need for further optimization to enhance the comprehensibility of their outputs.