The Diagnostic and Triage Accuracy of the GPT-3 Artificial Intelligence Model

Levine, David M.; Tuwani, Rudraksh; Kompa, Benjamin; Varma, Amita; Finlayson, Samuel G.; Mehrotra, Ateev; Beam, Andrew L.

doi:10.1101/2023.01.30.23285067

Cited by 60 publications

(54 citation statements)

References 37 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The diagnostic accuracy of the GPT-3 model is considerably limited. A preprint article revealed the correct diagnosis to be 88% within the three differential-diagnosis lists [ 30 ]. Therefore, the diagnostic accuracy of the differential-diagnosis lists generated by AI chatbots, including ChatGPT-3, is unknown.…”

Section: Introductionmentioning

confidence: 99%

Diagnostic Accuracy of Differential-Diagnosis Lists Generated by Generative Pretrained Transformer 3 Chatbot for Clinical Vignettes with Common Chief Complaints: A Pilot Study

Hirosawa

Harada

Yokose

et al. 2023

IJERPH

231

114

View full text Add to dashboard Cite

The diagnostic accuracy of differential diagnoses generated by artificial intelligence (AI) chatbots, including the generative pretrained transformer 3 (GPT-3) chatbot (ChatGPT-3) is unknown. This study evaluated the accuracy of differential-diagnosis lists generated by ChatGPT-3 for clinical vignettes with common chief complaints. General internal medicine physicians created clinical cases, correct diagnoses, and five differential diagnoses for ten common chief complaints. The rate of correct diagnosis by ChatGPT-3 within the ten differential-diagnosis lists was 28/30 (93.3%). The rate of correct diagnosis by physicians was still superior to that by ChatGPT-3 within the five differential-diagnosis lists (98.3% vs. 83.3%, p = 0.03). The rate of correct diagnosis by physicians was also superior to that by ChatGPT-3 in the top diagnosis (53.3% vs. 93.3%, p < 0.001). The rate of consistent differential diagnoses among physicians within the ten differential-diagnosis lists generated by ChatGPT-3 was 62/88 (70.5%). In summary, this study demonstrates the high diagnostic accuracy of differential-diagnosis lists generated by ChatGPT-3 for clinical cases with common chief complaints. This suggests that AI chatbots such as ChatGPT-3 can generate a well-differentiated diagnosis list for common chief complaints. However, the order of these lists can be improved in the future.

show abstract

Section: Introductionmentioning

confidence: 99%

Diagnostic Accuracy of Differential-Diagnosis Lists Generated by Generative Pretrained Transformer 3 Chatbot for Clinical Vignettes with Common Chief Complaints: A Pilot Study

Hirosawa

Harada

Yokose

et al. 2023

IJERPH

231

114

View full text Add to dashboard Cite

show abstract

“…15 Recently, there has been great interest in utilizing the nascent but powerful chatbot for clinical decision support. 16–18…”

Section: Introductionmentioning

confidence: 99%

Assessing the Utility of ChatGPT Throughout the Entire Clinical Workflow

Rao

Pang

Kim

et al. 2023

Preprint

111

View full text Add to dashboard Cite

IMPORTANCE: Large language model (LLM) artificial intelligence (AI) chatbots direct the power of large training datasets towards successive, related tasks, as opposed to single-ask tasks, for which AI already achieves impressive performance. The capacity of LLMs to assist in the full scope of iterative clinical reasoning via successive prompting, in effect acting as virtual physicians, has not yet been evaluated. OBJECTIVE: To evaluate ChatGPT′s capacity for ongoing clinical decision support via its performance on standardized clinical vignettes. DESIGN: We inputted all 36 published clinical vignettes from the Merck Sharpe & Dohme (MSD) Clinical Manual into ChatGPT and compared accuracy on differential diagnoses, diagnostic testing, final diagnosis, and management based on patient age, gender, and case acuity. SETTING: ChatGPT, a publicly available LLM PARTICIPANTS: Clinical vignettes featured hypothetical patients with a variety of age and gender identities, and a range of Emergency Severity Indices (ESIs) based on initial clinical presentation. EXPOSURES: MSD Clinical Manual vignettes MAIN OUTCOMES AND MEASURES: We measured the proportion of correct responses to the questions posed within the clinical vignettes tested. RESULTS: ChatGPT achieved 71.7% (95% CI, 69.3% to 74.1%) accuracy overall across all 36 clinical vignettes. The LLM demonstrated the highest performance in making a final diagnosis with an accuracy of 76.9% (95% CI, 67.8% to 86.1%), and the lowest performance in generating an initial differential diagnosis with an accuracy of 60.3% (95% CI, 54.2% to 66.6%). Compared to answering questions about general medical knowledge, ChatGPT demonstrated inferior performance on differential diagnosis (β=-15.8%, p<0.001) and clinical management (β=-7.4%, p=0.02) type questions. CONCLUSIONS AND RELEVANCE: ChatGPT achieves impressive accuracy in clinical decision making, with particular strengths emerging as it has more clinical information at its disposal.

show abstract

“…A very encouraging triage accuracy of 87.2% in our study stands in contrast to recent results on the non-ophthalmological general medical domain published in a preprint by Levine and colleagues, who found a triage accuracy of 71% for GPT-3 and 96% for physicians. [7] Whether this contrast is due to the different testing domains, wording of the individual vignettes or an improvement between the different GPT versions remains unclear. Moreover, we must point out, that a technically high triage accuracy does not imply a great utility of the information on urgency: In our study ChatGPT frequently recommended to consult a physician "as soon as possible", which was judged to be appropriate for the urgency levels "emergency" and "same day".…”

Section: Discussionmentioning

confidence: 99%

“…Triage accuracy however was slightly and insignificantly lower compared to laypersons but by far and significantly lower compared to physicians. [7] For "can't-miss diagnoses", the aforementioned study from the Wills Eye emergency department showed a diagnostic accuracy of triaging ophthalmology staff to be as high as 97.2%. [11] We therefore clearly recommend contacting established providers of ophthalmological emergency services in case of acute symptoms.…”

Section: Discussionmentioning

confidence: 99%

“…determining the timespan in that he should be referred to an ophthalmologist) and initiation of appropriate preclinical first-aid measures. [4] Encouraging results with regards to differential diagnosis in the general medical domain have been published, [6,7] and ChatGPT has been found useful for simplifying access to information on cardiopulmonary resuscitation. [8] Despite these encouraging results, ChatGPT can also produce wrong information and has been reported to give potentially harmful advice in the ophthalmological and wider medical domains.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Assessment of ChatGPT in the preclinical management of ophthalmological emergencies – an analysis of ten fictional case vignettes

Knebel

Priglinger

Scherer

et al. 2023

Preprint

View full text Add to dashboard Cite

Background/Aims: The artificial intelligence (AI) based platform ChatGPT (Chat Generative Pre-Trained Transformer, OpenAI LP, San Francisco, CA, USA) has gained an impressing popularity over the past months. Its performance on case vignettes of general medical (non-ophthalmological) emergencies has priorly been assessed with very encouraging results. The purpose of this study is to assess the performance of ChatGPT on ophthalmological emergency case vignettes in terms of the main outcome measures triage accuracy, appropriateness of recommended preclinical measures and overall potential to inflict harm to the user/patient. Methods: We wrote ten short, fictional case vignettes describing different acute ophthalmological symptoms. Each vignette was entered into ChatGPT five times with the same wording and following a standardized interaction pathway. The answers were analysed in a standardised manner. Results: We observed a triage accuracy of 87.2%. Most answers contained only appropriate recommendations for preclinical measures. However, an overall potential to inflict harm to users/patients was present in 32% of answers. Conclusion: ChatGPT should not be used as a stand-alone primary source of information about acute ophthalmological symptoms. As AI continues to evolve, its safety and efficacy in the preclinical management of ophthalmological emergencies has to be reassessed regularly.

show abstract

The Diagnostic and Triage Accuracy of the GPT-3 Artificial Intelligence Model

Cited by 60 publications

References 37 publications

Diagnostic Accuracy of Differential-Diagnosis Lists Generated by Generative Pretrained Transformer 3 Chatbot for Clinical Vignettes with Common Chief Complaints: A Pilot Study

Diagnostic Accuracy of Differential-Diagnosis Lists Generated by Generative Pretrained Transformer 3 Chatbot for Clinical Vignettes with Common Chief Complaints: A Pilot Study

Assessing the Utility of ChatGPT Throughout the Entire Clinical Workflow

Assessment of ChatGPT in the preclinical management of ophthalmological emergencies – an analysis of ten fictional case vignettes

Contact Info

Product

Resources

About