Learning to Fake It: Limited Responses and Fabricated References Provided by ChatGPT for Medical Questions

Gravel, Jocelyn; D’Amours-Gravel, Madeleine; Osmanlliu, Esli

doi:10.1016/j.mcpdig.2023.05.004

Cited by 71 publications

(25 citation statements)

References 29 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…False responses by ChatGPT-4 were recognized and described by the developer 20 and were also reported in other research in the field of medicine. 31,32 As discussed by Lee et al in a recent report in NEJM, 'a false response by GPT-4, referred to as a "hallucination," and such errors can be particularly dangerous in medical scenarios because the errors or falsehoods can be subtle and are often stated by the chatbot in such a convincing manner that the person making the query may be convinced of its veracity'. 33 Furthermore, the AI generated supplementary criteria within its outputs, drawing attention to important aspects of deprescribing, such as the role of patient's involvement and importance of shared decision-making.…”

Section: Discussionmentioning

confidence: 99%

Clinical decision‐making in benzodiazepine deprescribing by healthcare providers vs. AI‐assisted approach

Bužančić,

Belec,

Držaić

et al. 2023

Brit J Clinical Pharma

View full text Add to dashboard Cite

AimThe aim of this study was to compare the clinical decision‐making for benzodiazepine deprescribing between a healthcare provider (HCP) and an artificial intelligence (AI) chatbot GPT4‐ (ChatGPT‐4).MethodsWe analysed real‐world data from a Croatian cohort of community‐dwelling benzodiazepine patients (n=154) within the EuroAgeism H2020 ESR 7 project. HCPs evaluated the data using pre‐established deprescribing criteria to assess benzodiazepine discontinuation potential. The research team devised and tested AI prompts to ensure consistency with HCP judgments. An independent researcher employed ChatGPT‐4 with predetermined prompts to simulate clinical decisions for each patient case. Data derived from human‐HCP and ChatGPT‐4 decisions were compared for agreement rates and Cohen’s kappa.ResultsBoth HPC and ChatGPT identified patients for benzodiazepine deprescribing (96.1% and 89.6%, respectively), showing an agreement rate of 95% (κ=0.200, p=0.012). Agreement on four deprescribing criteria ranged from 74.7% to 91.3% (lack of indication κ= 0.352, p<0.001; prolonged use κ=0.088, p=0.280; safety concerns κ=0.123, p=0.006; incorrect dosage κ=0.264, p=0.001). Important limitations of GPT‐4 responses were identified, including 22.1% ambiguous outputs, generic answers, and inaccuracies, posing inappropriate decision‐making risks.ConclusionWhile AI‐HCP agreement is substantial, sole AI reliance poses a risk for unsuitable clinical decision‐making. This study's findings reveal both strengths and areas for enhancement of ChatGPT‐4 in the deprescribing recommendations within a real‐world sample. Our study underscores the need for additional research on chatbot functionality in patient therapy decision‐making, further fostering the advancement of AI for optimal performance.

show abstract

Section: Discussionmentioning

confidence: 99%

Clinical decision‐making in benzodiazepine deprescribing by healthcare providers vs. AI‐assisted approach

Bužančić,

Belec,

Držaić

et al. 2023

Brit J Clinical Pharma

View full text Add to dashboard Cite

show abstract

“…Others (e.g., a study by Birmaher et al) 6 did not relate to the question, others (e.g., the Schmideberg reference cited in the creativity question) were fabricated, while some (e.g., the WHO reports, 4 and the study by Merikangas et al 5 ) were inaccurate. Inaccurate and fabricated referencing is a well‐known limitation of ChatGPT, 7–9 which more often places emphasis on generating the most plausible sounding references. While ChatGPT was able to generate factual information related to bipolar disorder with reasonable accuracy (with some notable exceptions, such as the specific prevalence rates cited), such information was provided more at the level of a school essay than a scientific journal.…”

Section: Discussionmentioning

confidence: 99%

Section: Discussionmentioning

confidence: 99%

A chat about bipolar disorder

Parker,

Spoelma

2023

Bipolar Disorders

View full text Add to dashboard Cite

ObjectivesThis study aimed to assess the capabilities of ChatGPT (Chat Generative Pre‐Trained Transformer) in generating informative content related to bipolar disorders. The objectives were to evaluate its ability to provide accurate information on symptoms, classification, causes, and management of bipolar disorder and to explore its creativity in generating topic‐related songs.MethodsChatGPT3 was used for the study, and a series of clinically relevant questions were asked to test its knowledge and creativity. Questions ranged from common symptom descriptions to more artistic requests for songs related to bipolar disorder.ResultsChatGPT demonstrated the capacity to provide basic and informative material on bipolar disorders, including descriptions of symptoms, classification types, causes, and treatment options. It also showed creativity in generating songs that capture the nuances of bipolar symptoms, both during high and low states.ConclusionsWhile ChatGPT3 can offer superficial information on psychiatric topics like bipolar disorder, its inability to provide accurate and up‐to‐date references limits its utility for creating a comprehensive review article for scientific journals. However, it may be helpful in generating educational material and assisting in component tasks for those with bipolar disorder or other psychiatric conditions. As newer versions of AI models are continually developed, their capabilities in producing more accurate and advanced content will need further evaluation.

show abstract

“…In testing, aiChat was able to adjust the response by date when given this parameter. Given that studies have established that GPT often fabricates references [21,51], we did not specifically ask aiChat to provide references as part of the prompt. Providing an example of the desired output within the prompt has also been suggested [50] and found to improve performance in some analyses [9].…”

Section: Methodsmentioning

confidence: 99%

Evaluating a Large Language Model’s Ability to Answer Clinicians’ Requests for Evidence Summaries

Blasingame,

Koonce,

Williams

et al. 2024

Preprint

View full text Add to dashboard Cite

Objective: This study investigated the performance of a generative artificial intelligence (AI) tool using GPT-4 in answering clinical questions in comparison with medical librarians' gold-standard evidence syntheses. Methods: Questions were extracted from an in-house database of clinical evidence requests previously answered by medical librarians. Questions with multiple parts were subdivided into individual topics. A standardized prompt was developed using the COSTAR framework. Librarians submitted each question into aiChat, an internally-managed chat tool using GPT-4, and recorded the responses. The summaries generated by aiChat were evaluated on whether they contained the critical elements used in the established gold-standard summary of the librarian. A subset of questions was randomly selected for verification of references provided by aiChat. Results: Of the 216 evaluated questions, aiChat's response was assessed as "correct" for 180 (83.3%) questions, "partially correct" for 35 (16.2%) questions, and "incorrect" for 1 (0.5%) question. No significant differences were observed in question ratings by question category (p=0.39). For a subset of 30% (n=66) of questions, 162 references were provided in the aiChat summaries, and 60 (37%) were confirmed as nonfabricated. Conclusions: Overall, the performance of a generative AI tool was promising. However, many included references could not be independently verified, and attempts were not made to assess whether any additional concepts introduced by aiChat were factually accurate. Thus, we envision this being the first of a series of investigations designed to further our understanding of how current and future versions of generative AI can be used and integrated into medical librarians' workflow.

show abstract

Learning to Fake It: Limited Responses and Fabricated References Provided by ChatGPT for Medical Questions

Cited by 71 publications

References 29 publications

Clinical decision‐making in benzodiazepine deprescribing by healthcare providers vs. AI‐assisted approach

Clinical decision‐making in benzodiazepine deprescribing by healthcare providers vs. AI‐assisted approach

A chat about bipolar disorder

Evaluating a Large Language Model’s Ability to Answer Clinicians’ Requests for Evidence Summaries

Contact Info

Product

Resources

About