Performance of ChatGPT on free-response, clinical reasoning exams

Strong, Eric; DiGiammarino, Alicia; Weng, Yingjie; Basaviah, Preetha; Hosamani, Poonam; Kumar, Andre; Nevins, Andrew; Kugler, John; Hom, Jason; Chen, Jonathan

doi:10.1101/2023.03.24.23287731

Cited by 35 publications

(8 citation statements)

References 2 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…This could lead to faster and more accurate diagnoses, improving patient outcomes. [3][4][5] Second, ChatGPT can help synthesize and analyze vast amounts of medical literature, which could lead to the discovery of new treatments, medications, or a better understanding of diseases. The potential of ChatGPT in medical writing is currently the most studied and discussed in the literature.…”

mentioning

confidence: 99%

Accuracy of ChatGPT‐Generated Information on Head and Neck and Oromaxillofacial Surgery: A Multicenter Collaborative Analysis

Vaira,

Lechien,

Abbate

et al. 2023

Otolaryngol.--head neck surg.

View full text Add to dashboard Cite

ObjectiveTo investigate the accuracy of Chat‐Based Generative Pre‐trained Transformer (ChatGPT) in answering questions and solving clinical scenarios of head and neck surgery.Study DesignObservational and valuative study.SettingEighteen surgeons from 14 Italian head and neck surgery units.MethodsA total of 144 clinical questions encompassing different subspecialities of head and neck surgery and 15 comprehensive clinical scenarios were developed. Questions and scenarios were inputted into ChatGPT4, and the resulting answers were evaluated by the researchers using accuracy (range 1‐6), completeness (range 1‐3), and references' quality Likert scales.ResultsThe overall median score of open‐ended questions was 6 (interquartile range[IQR]: 5‐6) for accuracy and 3 (IQR: 2‐3) for completeness. Overall, the reviewers rated the answer as entirely or nearly entirely correct in 87.2% of cases and as comprehensive and covering all aspects of the question in 73% of cases. The artificial intelligence (AI) model achieved a correct response in 84.7% of the closed‐ended questions (11 wrong answers). As for the clinical scenarios, ChatGPT provided a fully or nearly fully correct diagnosis in 81.7% of cases. The proposed diagnostic or therapeutic procedure was judged to be complete in 56.7% of cases. The overall quality of the bibliographic references was poor, and sources were nonexistent in 46.4% of the cases.ConclusionThe results generally demonstrate a good level of accuracy in the AI's answers. The AI's ability to resolve complex clinical scenarios is promising, but it still falls short of being considered a reliable support for the decision‐making process of specialists in head‐neck surgery.

show abstract

mentioning

confidence: 99%

Accuracy of ChatGPT‐Generated Information on Head and Neck and Oromaxillofacial Surgery: A Multicenter Collaborative Analysis

Vaira,

Lechien,

Abbate

et al. 2023

Otolaryngol.--head neck surg.

View full text Add to dashboard Cite

show abstract

“…This indicates a significant degree of variability in ChatGPT's responses, even when faced with identical scenarios. 42 Another study aimed to evaluate ChatGPT's capacity for ongoing clinical decision support. The research involved inputting published clinical vignettes into ChatGPT-3.5 and assessing its accuracy in various areas such as differential diagnoses, diagnostic testing, final diagnosis, and management.…”

Section: Discussionmentioning

confidence: 99%

A User-friendly Approach for the Diagnosis of Diabetic Retinopathy Using ChatGPT and Automated Machine Learning

Mohammadi,

Nguyen

2024

Ophthalmology Science

View full text Add to dashboard Cite

“…Yet, any evaluation of these tools must be context-specific and rigorous. This assessment is particularly relevant, urgent, and novel for complex conditions, such as ODS, where optimal multidisciplinary diagnostic and therapeutic paradigms are challenging to establish due to a relatively scarce body of recently published evidence [ 11 ]. This study aimed to evaluate the reliability of two versions of the ChatGPT LLM in managing ODS.…”

Section: Discussionmentioning

confidence: 99%

Reliability of large language models in managing odontogenic sinusitis clinical scenarios: a preliminary multidisciplinary evaluation

Saibene,

Allevi,

Calvo-Henriquez

et al. 2024

Eur Arch Otorhinolaryngol

View full text Add to dashboard Cite

Purpose This study aimed to evaluate the utility of large language model (LLM) artificial intelligence tools, Chat Generative Pre-Trained Transformer (ChatGPT) versions 3.5 and 4, in managing complex otolaryngological clinical scenarios, specifically for the multidisciplinary management of odontogenic sinusitis (ODS). Methods A prospective, structured multidisciplinary specialist evaluation was conducted using five ad hoc designed ODS-related clinical scenarios. LLM responses to these scenarios were critically reviewed by a multidisciplinary panel of eight specialist evaluators (2 ODS experts, 2 rhinologists, 2 general otolaryngologists, and 2 maxillofacial surgeons). Based on the level of disagreement from panel members, a Total Disagreement Score (TDS) was calculated for each LLM response, and TDS comparisons were made between ChatGPT3.5 and ChatGPT4, as well as between different evaluators. Results While disagreement to some degree was demonstrated in 73/80 evaluator reviews of LLMs’ responses, TDSs were significantly lower for ChatGPT4 compared to ChatGPT3.5. Highest TDSs were found in the case of complicated ODS with orbital abscess, presumably due to increased case complexity with dental, rhinologic, and orbital factors affecting diagnostic and therapeutic options. There were no statistically significant differences in TDSs between evaluators’ specialties, though ODS experts and maxillofacial surgeons tended to assign higher TDSs. Conclusions LLMs like ChatGPT, especially newer versions, showed potential for complimenting evidence-based clinical decision-making, but substantial disagreement was still demonstrated between LLMs and clinical specialists across most case examples, suggesting they are not yet optimal in aiding clinical management decisions. Future studies will be important to analyze LLMs’ performance as they evolve over time.

show abstract

Performance of ChatGPT on free-response, clinical reasoning exams

Cited by 35 publications

References 2 publications

Accuracy of ChatGPT‐Generated Information on Head and Neck and Oromaxillofacial Surgery: A Multicenter Collaborative Analysis

Accuracy of ChatGPT‐Generated Information on Head and Neck and Oromaxillofacial Surgery: A Multicenter Collaborative Analysis

A User-friendly Approach for the Diagnosis of Diabetic Retinopathy Using ChatGPT and Automated Machine Learning

Reliability of large language models in managing odontogenic sinusitis clinical scenarios: a preliminary multidisciplinary evaluation

Contact Info

Product

Resources

About