The Evaluation of Generative AI Should Include Repetition to Assess Stability

Zhu, Lingxuan; Mou, Weiming; Hong, Chenglin; Yang, Tao; Lai, Yancheng; Qi, Chang; Lin, Anqi; Zhang, Jian; Luo, Peng

doi:10.2196/57978

JMIR Mhealth Uhealth

2024

DOI: 10.2196/57978

|View full text |Cite

The Evaluation of Generative AI Should Include Repetition to Assess Stability

Lingxuan Zhu,

Weiming Mou,

Chenglin Hong

et al.

Abstract: The increasing interest in the potential applications of generative artificial intelligence (AI) models like ChatGPT in health care has prompted numerous studies to explore its performance in various medical contexts. However, evaluating ChatGPT poses unique challenges due to the inherent randomness in its responses. Unlike traditional AI models, ChatGPT generates different responses for the same input, making it imperative to assess its stability through repetition. This commentary highlights the importance o… Show more

Help me understand this report

View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...

Citation Types

Supporting

Mentioning

Contrasting

Year Published

2024

Publication Types

Select...

Article3

Relationship

Self Cite0

Independent3

Authors

Journals

Cited by 3 publications

References 19 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

Ensuring Safety and Consistency in Artificial Intelligence Chatbot Responses

Zhu,

Mou,

Luo

2024

JAMA Oncol

View full text Add to dashboard Cite

Ensuring Safety and Consistency in Artificial Intelligence Chatbot Responses

Zhu,

Mou,

Luo

2024

JAMA Oncol

View full text Add to dashboard Cite

Generative artificial intelligence and social media: insights for tobacco control

Kong,

Ouellette,

Murthy

2024

Tob Control

View full text Add to dashboard Cite

A Qualitative Evaluation of ChatGPT4 and PaLM2’s Response to Patient’s Questions Regarding Age-Related Macular Degeneration

Muntean,

Marginean,

Groza

et al. 2024

Diagnostics

View full text Add to dashboard Cite

Patient compliance in chronic illnesses is essential for disease management. This also applies to age-related macular degeneration (AMD), a chronic acquired retinal degeneration that needs constant monitoring and patient cooperation. Therefore, patients with AMD can benefit by being properly informed about their disease, regardless of the condition’s stage. Information is essential in keeping them compliant with lifestyle changes, regular monitoring, and treatment. Large language models have shown potential in numerous fields, including medicine, with remarkable use cases. In this paper, we wanted to assess the capacity of two large language models (LLMs), ChatGPT4 and PaLM2, to offer advice to questions frequently asked by patients with AMD. After searching on AMD-patient-dedicated websites for frequently asked questions, we curated and selected a number of 143 questions. The questions were then transformed into scenarios that were answered by ChatGPT4, PaLM2, and three ophthalmologists. Afterwards, the answers provided by the two LLMs to a set of 133 questions were evaluated by two ophthalmologists, who graded each answer on a five-point Likert scale. The models were evaluated based on six qualitative criteria: (C1) reflects clinical and scientific consensus, (C2) likelihood of possible harm, (C3) evidence of correct reasoning, (C4) evidence of correct comprehension, (C5) evidence of correct retrieval, and (C6) missing content. Out of 133 questions, ChatGPT4 received a score of five from both reviewers to 118 questions (88.72%) for C1, to 130 (97.74%) for C2, to 131 (98.50%) for C3, to 133 (100%) for C4, to 132 (99.25%) for C5, and to 122 (91.73%) for C6, while PaLM2 to 81 questions (60.90%) for C1, to 114 (85.71%) for C2, to 115 (86.47%) for C3, to 124 (93.23%) for C4, to 113 (84.97%) for C5, and to 93 (69.92%) for C6. Despite the overall high performance, there were answers that are incomplete or inaccurate, and the paper explores the type of errors produced by these LLMs. Our study reveals that ChatGPT4 and PaLM2 are valuable instruments for patient information and education; however, since there are still some limitations to these models, for proper information, they should be used in addition to the advice provided by the physicians.

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

customersupport@researchsolutions.com

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

The Evaluation of Generative AI Should Include Repetition to Assess Stability

Cited by 3 publications

References 19 publications

Ensuring Safety and Consistency in Artificial Intelligence Chatbot Responses

Ensuring Safety and Consistency in Artificial Intelligence Chatbot Responses

Generative artificial intelligence and social media: insights for tobacco control

A Qualitative Evaluation of ChatGPT4 and PaLM2’s Response to Patient’s Questions Regarding Age-Related Macular Degeneration

Contact Info

Product

Resources

About