Assessing the Accuracy and Reliability of AI-Generated Medical Responses: An Evaluation of the Chat-GPT Model

Johnson, Douglas B.; Goodman, Rachel S.; Patrinely, James R.; Stone, Cosby A.; Zimmerman, Eli E.; Donald, Rebecca; Chang, Sam S.; Berkowitz, Sean T.; Finn, Arthur L.; Jahangir, Eiman; Scoville, Elizabeth; Reese, Tyler; Friedman, Debra L.; Bastarache, Julie A.; Heijden, Yuri van der; Wright, Jordan; Carter, Nicholas; Alexander, Matthew; Choe, Jennifer H.; Chastain, Cody A.; Zic, John A.; Horst, Sara N.; Turker, Isik; Agarwal, Rajiv; Osmundson, Evan C.; Idrees, Kamran; Kiernan, Colleen M.; Padmanabhan, Chandrasekhar; Bailey, Christina E.; Schlegel, Cameron; Chambless, Lola B.; Gibson, C. Michael; Osterman, Travis; Wheless, Lee

doi:10.21203/rs.3.rs-2566942/v1

Cited by 262 publications

(208 citation statements)

References 10 publications

Supporting

Mentioning

198

Contrasting

Unclassified

Order By: Relevance

“…Such artificial intelligence (AI) chatbots generate responses based on a large corpus of data including articles, websites, and other publicly available text data. 1 Artificial intelligence chatbots are rapidly becoming a primary source of information for patients, and chatbot responses to medical queries may influence health-related behavior. Among AI chatbots, ChatGPT (OpenAI) has shown promise in accurately answering medical questions, even US Medical Licensing Examination-style questions.…”

Section: Resultsmentioning

confidence: 99%

Assessment of Artificial Intelligence Chatbot Responses to Top Searched Queries About Cancer

Pan,

Musheyev,

Bockelman

et al. 2023

JAMA Oncol

View full text Add to dashboard Cite

ImportanceConsumers are increasingly using artificial intelligence (AI) chatbots as a source of information. However, the quality of the cancer information generated by these chatbots has not yet been evaluated using validated instruments.ObjectiveTo characterize the quality of information and presence of misinformation about skin, lung, breast, colorectal, and prostate cancers generated by 4 AI chatbots.Design, Setting, and ParticipantsThis cross-sectional study assessed AI chatbots’ text responses to the 5 most commonly searched queries related to the 5 most common cancers using validated instruments. Search data were extracted from the publicly available Google Trends platform and identical prompts were used to generate responses from 4 AI chatbots: ChatGPT version 3.5 (OpenAI), Perplexity (Perplexity.AI), Chatsonic (Writesonic), and Bing AI (Microsoft).ExposuresGoogle Trends’ top 5 search queries related to skin, lung, breast, colorectal, and prostate cancer from January 1, 2021, to January 1, 2023, were input into 4 AI chatbots.Main Outcomes and MeasuresThe primary outcomes were the quality of consumer health information based on the validated DISCERN instrument (scores from 1 [low] to 5 [high] for quality of information) and the understandability and actionability of this information based on the understandability and actionability domains of the Patient Education Materials Assessment Tool (PEMAT) (scores of 0%-100%, with higher scores indicating a higher level of understandability and actionability). Secondary outcomes included misinformation scored using a 5-item Likert scale (scores from 1 [no misinformation] to 5 [high misinformation]) and readability assessed using the Flesch-Kincaid Grade Level readability score.ResultsThe analysis included 100 responses from 4 chatbots about the 5 most common search queries for skin, lung, breast, colorectal, and prostate cancer. The quality of text responses generated by the 4 AI chatbots was good (median [range] DISCERN score, 5 [2-5]) and no misinformation was identified. Understandability was moderate (median [range] PEMAT Understandability score, 66.7% [33.3%-90.1%]), and actionability was poor (median [range] PEMAT Actionability score, 20.0% [0%-40.0%]). The responses were written at the college level based on the Flesch-Kincaid Grade Level score.Conclusions and RelevanceFindings of this cross-sectional study suggest that AI chatbots generally produce accurate information for the top cancer-related search queries, but the responses are not readily actionable and are written at a college reading level. These limitations suggest that AI chatbots should be used supplementarily and not as a primary source for medical information.

show abstract

Section: Resultsmentioning

confidence: 99%

Assessment of Artificial Intelligence Chatbot Responses to Top Searched Queries About Cancer

Pan,

Musheyev,

Bockelman

et al. 2023

JAMA Oncol

View full text Add to dashboard Cite

show abstract

“…They rated the output not “so incorrect as to cause patient harm” 91% to 93% of the time, but concordance with the results generated by a consultation service run by physicians and informatics experts analyzing aggregated electronic health record data was just 21% to 41%. Another study in which physicians evaluated ChatGPT 3.5 output on 180 clinical queries found that the mean score was 4.4 of 6 for accuracy and 2.4 for completeness, with 8% of answers scored as completely incorrect . In a third study, ChatGPT 3.5 responses to 36 clinical vignettes, compared with the clinical manual from which the vignettes were drawn, were scored as 72% accurate on average.…”

mentioning

confidence: 99%

“…Another study in which physicians evaluated ChatGPT 3.5 output on 180 clinical queries found that the mean score was 4.4 of 6 for accuracy and 2.4 for completeness, with 8% of answers scored as completely incorrect. 6 In a third study, ChatGPT 3.5 responses to 36 clinical vignettes, compared with the clinical manual from which the vignettes were drawn, were scored as 72% accurate on average. The researchers characterized this as "impressive accuracy," but acknowledged that even small errors can harm patients.…”

mentioning

confidence: 99%

ChatGPT and Physicians’ Malpractice Risk

Guha

2023

JAMA Health Forum

View full text Add to dashboard Cite

show abstract

“…The daunting challenge requires critical and insightful analysis of the data by expert minds to discern reliable data, which is not simply possible by text mining. In clinical medicine, the preliminary studies suggest that ChatGPT could generate reasonably but not perfectly accurate responses to medical questions, write patient clinic letters, or summarize discharge notes [2][3][4][5][6] . However, all ChatGPTgenerated medical documents must be reviewed and verified by experts.…”

mentioning

confidence: 99%

ChatGPT and other artificial intelligence chatbots and biomedical writing

2023

J Cardiovasc Aging

View full text Add to dashboard Cite

We asked ChatGPT whether it should be listed as an author of the text that it generates in response to questions. Here is how it responded: "Hello! As an AI language model, I don't have a personal identity or consciousness in the traditional sense, and I don't create content on my own. I was created by OpenAI, a research organization, and trained on vast amounts of text from the internet to generate human-like language in response to user prompts. However, the content that I generate is still subject to copyright laws, and the ownership of the content that I generate depends on the specific context and use case. In general, if someone uses my responses in a creative work, they would likely be considered the author of that work, but the original content generated by me would still be subject to copyright laws and owned by OpenAI."Well said, ChatGPT! Not applicable.

show abstract

Assessing the Accuracy and Reliability of AI-Generated Medical Responses: An Evaluation of the Chat-GPT Model

Cited by 262 publications

References 10 publications

Assessment of Artificial Intelligence Chatbot Responses to Top Searched Queries About Cancer

Assessment of Artificial Intelligence Chatbot Responses to Top Searched Queries About Cancer

ChatGPT and Physicians’ Malpractice Risk

ChatGPT and other artificial intelligence chatbots and biomedical writing

Contact Info

Product

Resources

About