Accuracy and Repeatability of ChatGPT Based on a Set of Multiple-Choice Questions on Objective Tests of Hearing

IntroductionAn effective way of testing chatbots is to ask them for references since such items can be easily verified. The purpose of this study was to compare the ability of ChatGPT-4 and Gemini Advanced to select accurate references on common topics in otorhinolaryngology.MethodsChatGPT-4 and Gemini Advanced were asked to provide references on 25 topics within the otorhinolaryngology category of Web of Science. Within each topic, we set as target the most cited papers which had “guidelines” in the title. The chatbot responses were collected on three consecutive days to take into account possible variability. The accuracy and reliability of the provided references were evaluated.ResultsAcross the three days, the accuracy of ChatGPT-4 was 29–45% while that of Gemini Advanced was 10–17%. Common errors included false author names, false DOI numbers, and incomplete information. Lower percentage errors were associated with higher number of citations.ConclusionsBoth chatbots performed poorly in finding references, although ChatGPT-4 provided higher accuracy than Gemini Advanced.

show abstract

Testing new versions of ChatGPT in terms of physiology and electrophysiology of hearing: improved accuracy but not consistency

Jędrzejczak,

Skarżyński,

Kochanek

2024

Preprint

View full text Add to dashboard Cite

Introduction: ChatGPT has revolutionized many aspects of modern life, including scientific ones. Since its introduction, new versions have been introduced and advertised as having better performance. But is this true? This study aimed to assess the accuracy and consistency of six versions of ChatGPT (3.5, 4, 4o mini, 4o, 4o1 mini, and 4o1 preview). Of interest was the variability of responses given to asking the same question multiple times. Methods: We evaluated 6 versions of ChatGPT based on their responses to 30 single-answer, multiple-choice exam questions from a 1-year course on objective methods of testing hearing. The questions were posed 10 times to each version of ChatGPT across two days (5 times each day). The accuracy of the responses was evaluated in terms of a response key. To evaluate consistency (repeatability) of the responses over time, percent agreement and Cohen's Kappa were calculated. Results: The overall accuracy of ChatGPT increased with each version, starting from around 53% for version 3.5 and rising to 86% for version 4o1 preview. The greatest improvement in accuracy and repeatability came with the introduction of version 4o. Repeatability progressively rose with newer releases with the exception of version 4o1 mini. While the current top version 4o1 preview has similar repeatability to 4o, the faster version, 4o1 mini, had significantly lower repeatability than the older 4o mini. Conclusion: Newer versions of ChatGPT generally show improvement in terms of accuracy, but not in repeatability. The variability of responses is probably the current main limitation of ChatGPT for professional applications. Users must be especially careful with version 4o1 mini.

show abstract

Accuracy and Repeatability of ChatGPT Based on a Set of Multiple-Choice Questions on Objective Tests of Hearing

Cited by 4 publications

References 26 publications

Comparative accuracy of artificial intelligence chatbots in pulpal and periradicular diagnosis: A cross-sectional study

Comparative accuracy of artificial intelligence chatbots in pulpal and periradicular diagnosis: A cross-sectional study

Comparison of ChatGPT and Gemini as sources of references in otorhinolaryngology

Testing new versions of ChatGPT in terms of physiology and electrophysiology of hearing: improved accuracy but not consistency

Contact Info

Product

Resources

About