Assessing the Accuracy of Responses by the Language Model ChatGPT to Questions Regarding Bariatric Surgery

Samaan, Jamil S.; Yeo, Yee Hui; Rajeev, Nithya D.; Hawley, Lauren; Abel, Stuart; Ng, Wee Han; Srinivasan, Nitin; Park, Justin; Burch, Miguel; Watson, Rabindra R.; Liran, Omer; Samakar, Kamran

doi:10.1007/s11695-023-06603-5

Cited by 152 publications

(84 citation statements)

References 24 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The already published literature is mostly focused on the use of ChatGpt (Microsoft Bing) in generative patient education material in different specialties of medicine, such as gastroenterology (10), cardiology (11), bariatric surgery (12, 13), otolaryngology (14), ophthalmology (15), and prostate cancer (16). In our study, we have not only evaluated the Microsoft Bing and Google Bard models individually but also compared them in their ability to produce quality patient education content.…”

Section: Discussionmentioning

confidence: 99%

Microsoft Bing vs. Google Bard in Neurology: A Comparative Study of AI-Generated Patient Education Material

Nazir,

Ahmad,

Mal

et al. 2023

Preprint

View full text Add to dashboard Cite

BackgroundPatient education is an essential component of healthcare, and artificial intelligence (AI) language models such as Google Bard and Microsoft Bing have the potential to improve information transmission and enhance patient care. However, it is crucial to evaluate the quality, accuracy, and understandability of the materials generated by these models before applying them in medical practice. This study aimed to assess and compare the quality of patient education materials produced by Google Bard and Microsoft Bing in response to questions related to neurological conditions.MethodsA cross-sectional study design was used to evaluate and compare the ability of Google Bard and Microsoft Bing to generate patient education materials. The study included the top ten prevalent neurological diseases based on WHO prevalence data. Ten board-certified neurologists and four neurology residents evaluated the responses generated by the models on six quality metrics. The scores for each model were compiled and averaged across all measures, and the significance of any observed variations was assessed using an independent t-test.ResultsGoogle Bard performed better than Microsoft Bing in all six-quality metrics, with an overall mean score of 79% and 69%, respectively. Google Bard outperformed Microsoft Bing in all measures for eight questions, while Microsoft Bing performed marginally better in terms of objectivity and clarity for the epilepsy query.ConclusionThis study showed that Google Bard performs better than Microsoft Bing in generating patient education materials for neurological diseases. However, healthcare professionals should take into account both AI models’ advantages and disadvantages when providing support for health information requirements. Future studies can help determine the underlying causes of these variations and guide cooperative initiatives to create more user-focused AI-generated patient education materials. Finally, researchers should consider the perception of patients regarding AI-generated patient education material and its impact on implementing these solutions in healthcare settings.

show abstract

Section: Discussionmentioning

confidence: 99%

Microsoft Bing vs. Google Bard in Neurology: A Comparative Study of AI-Generated Patient Education Material

Nazir,

Ahmad,

Mal

et al. 2023

Preprint

View full text Add to dashboard Cite

show abstract

“…A possible solution to this problem is to leverage artificial intelligence, including generative large language model (LLM)−based chatbots that have been trained to respond to human-generated inquiries . While relatively nascent in health care, research has shown that LLM-based chatbots can answer patient-generated questions to varying degrees of accuracy . However, the ability of LLM-based chatbots to enhance informed consent documents remains unknown.…”

Section: Introductionmentioning

confidence: 99%

“…22 While relatively nascent in health care, research has shown that LLM-based chatbots can answer patient-generated questions to varying degrees of accuracy. [23][24][25][26] However, the ability of LLM-based chatbots to enhance informed consent documents remains These questions were inputted into a new profile in the chatbot using the incognito browser mode to mitigate potential bias from other internet activity. ChatGPT is an LLM-based chatbot that responds with text to human-generated questions; it has been shown to perform well on evaluations across multiple domains.…”

Section: Introductionmentioning

confidence: 99%

Large Language Model−Based Chatbot vs Surgeon-Generated Informed Consent Documentation for Common Procedures

Decker,

Trang,

Ramirez

et al. 2023

JAMA Netw Open

View full text Add to dashboard Cite

ImportanceInformed consent is a critical component of patient care before invasive procedures, yet it is frequently inadequate. Electronic consent forms have the potential to facilitate patient comprehension if they provide information that is readable, accurate, and complete; it is not known if large language model (LLM)-based chatbots may improve informed consent documentation by generating accurate and complete information that is easily understood by patients.ObjectiveTo compare the readability, accuracy, and completeness of LLM-based chatbot- vs surgeon-generated information on the risks, benefits, and alternatives (RBAs) of common surgical procedures.Design, Setting, and ParticipantsThis cross-sectional study compared randomly selected surgeon-generated RBAs used in signed electronic consent forms at an academic referral center in San Francisco with LLM-based chatbot-generated (ChatGPT-3.5, OpenAI) RBAs for 6 surgical procedures (colectomy, coronary artery bypass graft, laparoscopic cholecystectomy, inguinal hernia repair, knee arthroplasty, and spinal fusion).Main Outcomes and MeasuresReadability was measured using previously validated scales (Flesh-Kincaid grade level, Gunning Fog index, the Simple Measure of Gobbledygook, and the Coleman-Liau index). Scores range from 0 to greater than 20 to indicate the years of education required to understand a text. Accuracy and completeness were assessed using a rubric developed with recommendations from LeapFrog, the Joint Commission, and the American College of Surgeons. Both composite and RBA subgroup scores were compared.ResultsThe total sample consisted of 36 RBAs, with 1 RBA generated by the LLM-based chatbot and 5 RBAs generated by a surgeon for each of the 6 surgical procedures. The mean (SD) readability score for the LLM-based chatbot RBAs was 12.9 (2.0) vs 15.7 (4.0) for surgeon-generated RBAs (P = .10). The mean (SD) composite completeness and accuracy score was lower for surgeons’ RBAs at 1.6 (0.5) than for LLM-based chatbot RBAs at 2.2 (0.4) (P &lt; .001). The LLM-based chatbot scores were higher than the surgeon-generated scores for descriptions of the benefits of surgery (2.3 [0.7] vs 1.4 [0.7]; P &lt; .001) and alternatives to surgery (2.7 [0.5] vs 1.4 [0.7]; P &lt; .001). There was no significant difference in chatbot vs surgeon RBA scores for risks of surgery (1.7 [0.5] vs 1.7 [0.4]; P = .38).Conclusions and RelevanceThe findings of this cross-sectional study suggest that despite not being perfect, LLM-based chatbots have the potential to enhance informed consent documentation. If an LLM were embedded in electronic health records in a manner compliant with the Health Insurance Portability and Accountability Act, it could be used to provide personalized risk information while easing documentation burden for physicians.

show abstract

“…As ChatGPT's capabilities grow, we hypothesize patients will be more inclined to seek information from this technology regarding complex conditions such as heart failure. The model's utility in medicine is actively under investigation with prior studies having examined ChatGPT's ability to answer questions related to heart disease prevention, bariatric surgery, and cirrhosis yielding promising results [6,7,8]. In the deployment of tools such as ChatGPT within the medical field, a comprehensive examination of both strengths and limitations is essential.…”

Section: Introductionmentioning

confidence: 99%

Appropriateness of ChatGPT in answering heart failure related questions

King

Samaan

Yeo

et al. 2023

Preprint

Self Cite

View full text Add to dashboard Cite

Background Heart failure requires complex management with increased patient knowledge shown to improve outcomes. The large language model (LLM), Chat Generative Pre-Trained Transformer (ChatGPT), may be a useful supplemental resource of information for patients with heart failure. Methods Responses produced by GPT-3.5 and GPT-4 to 107 frequently asked heart failure-related questions were graded by two reviewers board-certified in cardiology, with differences resolved by a third reviewer. The reproducibility and accuracy between GPT-3.5 and GPT-4 were compared for questions involving basic knowledge, management, prognosis, procedures, and support. Results GPT-4 displayed a greater proportion of comprehensive knowledge for the categories of basic knowledge and management, while GPT-3.5 performed better in the other category (prognosis, procedures, and support) (94.1% vs 64.7%). There were 2 total responses (1.9%) graded as some correct and incorrect for GPT-3.5, while no GPT-4 responses received a grade of some correct and incorrect or completely incorrect. Both models provided highly reproducible responses, with GPT-3.5 scoring above 94% in every category and GPT-4 with 100% for all answers. Conclusions Both GPT-3.5 and GPT-4 answered the majority of heart failure-related questions accurately and reliably, with GPT-4 displaying superior performance overall. ChatGPT may lead to better outcomes in patients with heart failure by providing health education.

show abstract

Assessing the Accuracy of Responses by the Language Model ChatGPT to Questions Regarding Bariatric Surgery

Cited by 152 publications

References 24 publications

Microsoft Bing vs. Google Bard in Neurology: A Comparative Study of AI-Generated Patient Education Material

Microsoft Bing vs. Google Bard in Neurology: A Comparative Study of AI-Generated Patient Education Material

Large Language Model−Based Chatbot vs Surgeon-Generated Informed Consent Documentation for Common Procedures

Appropriateness of ChatGPT in answering heart failure related questions

Contact Info

Product

Resources

About