Reliability of Medical Information Provided by ChatGPT: Assessment Against Clinical Guidelines and Patient Information Quality Instrument

Walker, Harriet Louise; Ghani, Shahi; Kümmerli, C.; Nebiker, Christian A.; Müler, Beat; Raptis, Dimitri Aristotle; Staubli, Sebastian M.

doi:10.2196/47479

Cited by 132 publications

(42 citation statements)

References 28 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…When disseminating information about cancer treatment and sexual health issues faced by cancer survivors, the generated chatbots functioned without refusing to answer, with or without training sources of medical guidelines. GPT responses have been noted to be as reliable as web searches and are closer to clinical guidelines, making it a promising tool to support medical communication 7 8. In this study, the GPT returned useful results comparable to the guidelines, not calling for excessive pessimism or optimism.…”

Section: Discussionmentioning

confidence: 63%

Generative artificial intelligence and non-pharmacological bias: an experimental study on cancer patient sexual health communications

Hanai,

Ishikawa,

Kawauchi

et al. 2024

BMJ Health Care Inform

View full text Add to dashboard Cite

ObjectivesThe objective of this study was to explore the feature of generative artificial intelligence (AI) in asking sexual health among cancer survivors, which are often challenging for patients to discuss.MethodsWe employed the Generative Pre-trained Transformer-3.5 (GPT) as the generative AI platform and used DocsBot for citation retrieval (June 2023). A structured prompt was devised to generate 100 questions from the AI, based on epidemiological survey data regarding sexual difficulties among cancer survivors. These questions were submitted to Bot1 (standard GPT) and Bot2 (sourced from two clinical guidelines).ResultsNo censorship of sexual expressions or medical terms occurred. Despite the lack of reflection on guideline recommendations, ‘consultation’ was significantly more prevalent in both bots’ responses compared with pharmacological interventions, with ORs of 47.3 (p<0.001) in Bot1 and 97.2 (p<0.001) in Bot2.DiscussionGenerative AI can serve to provide health information on sensitive topics such as sexual health, despite the potential for policy-restricted content. Responses were biased towards non-pharmacological interventions, which is probably due to a GPT model designed with the ’s prohibition policy on replying to medical topics. This shift warrants attention as it could potentially trigger patients’ expectations for non-pharmacological interventions.

show abstract

Section: Discussionmentioning

confidence: 63%

Generative artificial intelligence and non-pharmacological bias: an experimental study on cancer patient sexual health communications

Hanai,

Ishikawa,

Kawauchi

et al. 2024

BMJ Health Care Inform

View full text Add to dashboard Cite

show abstract

“…Second, ChatGPT struggled to interpret all the coherent laboratory tests [60], generating superficial and incorrect responses. Indeed, ChatGPT could generate overly general answers without citing original references [20,40,42].…”

Section: Resultsmentioning

confidence: 99%

A Systematic Review of ChatGPT and Other Conversational Large Language Models in Healthcare

Wang,

Wan,

et al. 2024

Preprint

View full text Add to dashboard Cite

Background: The launch of the Chat Generative Pre-trained Transformer (ChatGPT) in November 2022 has attracted public attention and academic interest to large language models (LLMs), facilitating the emergence of many other innovative LLMs. These LLMs have been applied in various fields, including healthcare. Numerous studies have since been conducted regarding how to employ state-of-the-art LLMs in health-related scenarios to assist patients, doctors, and public health administrators. Objective: This review aims to summarize the applications and concerns of applying conversational LLMs in healthcare and provide an agenda for future research on LLMs in healthcare. Methods: We utilized PubMed, ACM, and IEEE digital libraries as primary sources for this review. We followed the guidance of Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRIMSA) to screen and select peer-reviewed research articles that (1) were related to both healthcare applications and conversational LLMs and (2) were published before September 1st, 2023, the date when we started paper collection and screening. We investigated these papers and classified them according to their applications and concerns. Results: Our search initially identified 820 papers according to targeted keywords, out of which 65 papers met our criteria and were included in the review. The most popular conversational LLM was ChatGPT from OpenAI (60), followed by Bard from Google (1), Large Language Model Meta AI (LLaMA) from Meta (1), and other LLMs (5). These papers were classified into four categories in terms of their applications: 1) summarization, 2) medical knowledge inquiry, 3) prediction, and 4) administration, and four categories of concerns: 1) reliability, 2) bias, 3) privacy, and 4) public acceptability. There are 49 (75%) research papers using LLMs for summarization and/or medical knowledge inquiry, and 58 (89%) research papers expressing concerns about reliability and/or bias. We found that conversational LLMs exhibit promising results in summarization and providing medical knowledge to patients with a relatively high accuracy. However, conversational LLMs like ChatGPT are not able to provide reliable answers to complex health-related tasks that require specialized domain expertise. Additionally, no experiments in our reviewed papers have been conducted to thoughtfully examine how conversational LLMs lead to bias or privacy issues in healthcare research. Conclusions: Future studies should focus on improving the reliability of LLM applications in complex health-related tasks, as well as investigating the mechanisms of how LLM applications brought bias and privacy issues. Considering the vast accessibility of LLMs, legal, social, and technical efforts are all needed to address concerns about LLMs to promote, improve, and regularize the application of LLMs in healthcare.

show abstract

“…This outcome underscores the imperative for exercising caution when solely relying on AI-generated medical information and the need for continuous evaluation, as others have noted [ 16 ]. However, in another study by Walker et al [ 17 ] aimed at evaluating the reliability of medical information provided by ChatGPT-4, multiple iterations of their queries executed through the model yielded a remarkable 100% internal consistency among the generated outputs [ 17 ]. Although promising, it should be noted that the queries used in their experiment consisted of direct single-sentence questions pertaining to specific hepatobiliary diagnoses.…”

Section: Discussionmentioning

confidence: 99%

Using ChatGPT-4 to Create Structured Medical Notes From Audio Recordings of Physician-Patient Encounters: Comparative Study

Kernberg,

Gold,

Mohan

2024

J Med Internet Res

View full text Add to dashboard Cite

Background Medical documentation plays a crucial role in clinical practice, facilitating accurate patient management and communication among health care professionals. However, inaccuracies in medical notes can lead to miscommunication and diagnostic errors. Additionally, the demands of documentation contribute to physician burnout. Although intermediaries like medical scribes and speech recognition software have been used to ease this burden, they have limitations in terms of accuracy and addressing provider-specific metrics. The integration of ambient artificial intelligence (AI)–powered solutions offers a promising way to improve documentation while fitting seamlessly into existing workflows. Objective This study aims to assess the accuracy and quality of Subjective, Objective, Assessment, and Plan (SOAP) notes generated by ChatGPT-4, an AI model, using established transcripts of History and Physical Examination as the gold standard. We seek to identify potential errors and evaluate the model’s performance across different categories. Methods We conducted simulated patient-provider encounters representing various ambulatory specialties and transcribed the audio files. Key reportable elements were identified, and ChatGPT-4 was used to generate SOAP notes based on these transcripts. Three versions of each note were created and compared to the gold standard via chart review; errors generated from the comparison were categorized as omissions, incorrect information, or additions. We compared the accuracy of data elements across versions, transcript length, and data categories. Additionally, we assessed note quality using the Physician Documentation Quality Instrument (PDQI) scoring system. Results Although ChatGPT-4 consistently generated SOAP-style notes, there were, on average, 23.6 errors per clinical case, with errors of omission (86%) being the most common, followed by addition errors (10.5%) and inclusion of incorrect facts (3.2%). There was significant variance between replicates of the same case, with only 52.9% of data elements reported correctly across all 3 replicates. The accuracy of data elements varied across cases, with the highest accuracy observed in the “Objective” section. Consequently, the measure of note quality, assessed by PDQI, demonstrated intra- and intercase variance. Finally, the accuracy of ChatGPT-4 was inversely correlated to both the transcript length (P=.05) and the number of scorable data elements (P=.05). Conclusions Our study reveals substantial variability in errors, accuracy, and note quality generated by ChatGPT-4. Errors were not limited to specific sections, and the inconsistency in error types across replicates complicated predictability. Transcript length and data complexity were inversely correlated with note accuracy, raising concerns about the model’s effectiveness in handling complex medical cases. The quality and reliability of clinical notes produced by ChatGPT-4 do not meet the standards required for clinical use. Although AI holds promise in health care, caution should be exercised before widespread adoption. Further research is needed to address accuracy, variability, and potential errors. ChatGPT-4, while valuable in various applications, should not be considered a safe alternative to human-generated clinical documentation at this time.

show abstract

Reliability of Medical Information Provided by ChatGPT: Assessment Against Clinical Guidelines and Patient Information Quality Instrument

Cited by 132 publications

References 28 publications

Generative artificial intelligence and non-pharmacological bias: an experimental study on cancer patient sexual health communications

Generative artificial intelligence and non-pharmacological bias: an experimental study on cancer patient sexual health communications

A Systematic Review of ChatGPT and Other Conversational Large Language Models in Healthcare

Using ChatGPT-4 to Create Structured Medical Notes From Audio Recordings of Physician-Patient Encounters: Comparative Study

Contact Info

Product

Resources

About