ChatGPT-3.5 and ChatGPT-4 dermatological knowledge level based on the Specialty Certificate Examination in Dermatology

Lewandowski, Miłosz; Łukowicz, Paweł; Świetlik, Dariusz; Barańska‐Rybak, Wioletta

doi:10.1093/ced/llad255

Cited by 39 publications

(10 citation statements)

References 14 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…When our results are placed alongside studies from other disciplines that have examined ChatGPT's performance on single-answer and multiple-choice questions, we observe a similar trend -a roughly 20% performance improvement with ChatGPT 4 over ChatGPT 3.5 [6,[23][24][25]. Studies in otolaryngology, closely related to audiology, have also shown lower performance scores for ChatGPT 3.5, supporting our findings of improved outcomes with the newer version [26,27].…”

Section: Discussionsupporting

confidence: 83%

“…Presently, two versions of ChatGPT are accessible to the general public: the freely available ChatGPT 3.5, based on an earlier LLM, and the more advanced, subscription-based ChatGPT 4. Research has shown that ChatGPT 4 outperforms its predecessor, demonstrating superior accuracy in various fields such as dermatology, where it achieved 80-85% accuracy compared to 60-70% for ChatGPT 3.5 [6]. Similar improvements are observed in orthopedic assessments and in general medical examinations, with ChatGPT 4 consistently outperforming ChatGPT 3.5 [7,8].…”

Section: Introductionmentioning

confidence: 71%

“…Accuracy can vary significantly based on the topic and the complexity of the questions [28]. For instance, studies have shown that the accuracy of ChatGPT 3.5 can range from as high as 70% [6] to as low as 43% [27]. A critical question then arises: are the responses consistent across different trials at various times?…”

Section: Discussionmentioning

confidence: 99%

See 2 more Smart Citations

Accuracy and Repeatability of ChatGPT Based on a Set of Multiple-Choice Questions on Objective Tests of Hearing

Kochanek,

Skarzynski,

Jedrzejczak

2024

Cureus

View full text Add to dashboard Cite

Introduction: ChatGPT has been tested in many disciplines, but only a few have involved hearing diagnosis and none to physiology or audiology more generally. The consistency of the chatbot's responses to the same question posed multiple times has not been well investigated either. This study aimed to assess the accuracy and repeatability of ChatGPT 3.5 and 4 on test questions concerning objective measures of hearing. Of particular interest was the short-term repeatability of responses which was here tested on four separate days extended over one week.Methods: We used 30 single-answer, multiple-choice exam questions from a one-year course on objective methods of testing hearing. The questions were posed five times to both ChatGPT 3.5 (the free version) and ChatGPT 4 (the paid version) on each of four days (two days one week and two days the following week). The accuracy of the responses was evaluated in terms of a response key. To evaluate the repeatability of the responses over time, percent agreement and Cohen's Kappa were calculated.Results: The overall accuracy of ChatGPT 3.5 was 48-49%, while that of ChatGPT 4 was 65-69%. ChatGPT 3.5 consistently failed to pass the threshold of 50% correct responses. Within a single day, the percent agreement was 76-79% for ChatGPT 3.5 and 87-88% for ChatGPT 4 (Cohen's Kappa 0.67-0.71 and 0.81-0.84 respectively). The percent agreement between responses from different days was 75-79% for ChatGPT 3.5 and 85-88% for ChatGPT 4 (Cohen's Kappa 0.65-0.69 and 0.80-0.85 respectively). Conclusion: ChatGPT 4 outperforms ChatGPT 3.5 both in accuracy and higher repeatability over time. However, the great variability of the responses casts doubt on possible professional applications of both versions.

show abstract

Section: Discussionsupporting

confidence: 83%

Section: Introductionmentioning

confidence: 71%

See 1 more Smart Citation

Accuracy and Repeatability of ChatGPT Based on a Set of Multiple-Choice Questions on Objective Tests of Hearing

Kochanek,

Skarzynski,

Jedrzejczak

2024

Cureus

View full text Add to dashboard Cite

show abstract

“…While ChatGPT-4 demonstrates proficiency in conducting general conversations in multiple languages, its capacity for medical reasoning and understanding remains to be thoroughly assessed. Several studies have indicated ChatGPT’s competence in executing single medical task commands, such as answering multiple-choice questions from exams like the United States Medical Licensing Exam [ 5 , 6 ] and various medical specialty exams [ 6 ]. However, ChatGPT-4 struggles with logical questions [ 7 ] and occasionally fabricates responses [ 8 ].…”

Section: Introductionmentioning

confidence: 99%

ChatGPT provides inconsistent risk-stratification of patients with atraumatic chest pain

Heston,

Lewis

2024

PLoS ONE

View full text Add to dashboard Cite

Background ChatGPT-4 is a large language model with promising healthcare applications. However, its ability to analyze complex clinical data and provide consistent results is poorly known. Compared to validated tools, this study evaluated ChatGPT-4’s risk stratification of simulated patients with acute nontraumatic chest pain. Methods Three datasets of simulated case studies were created: one based on the TIMI score variables, another on HEART score variables, and a third comprising 44 randomized variables related to non-traumatic chest pain presentations. ChatGPT-4 independently scored each dataset five times. Its risk scores were compared to calculated TIMI and HEART scores. A model trained on 44 clinical variables was evaluated for consistency. Results ChatGPT-4 showed a high correlation with TIMI and HEART scores (r = 0.898 and 0.928, respectively), but the distribution of individual risk assessments was broad. ChatGPT-4 gave a different risk 45–48% of the time for a fixed TIMI or HEART score. On the 44-variable model, a majority of the five ChatGPT-4 models agreed on a diagnosis category only 56% of the time, and risk scores were poorly correlated (r = 0.605). Conclusion While ChatGPT-4 correlates closely with established risk stratification tools regarding mean scores, its inconsistency when presented with identical patient data on separate occasions raises concerns about its reliability. The findings suggest that while large language models like ChatGPT-4 hold promise for healthcare applications, further refinement and customization are necessary, particularly in the clinical risk assessment of atraumatic chest pain patients.

show abstract

“…While ChatGPT can carry on general conversations in multiple languages, its ability to reason and understand the language of medicine needs further evaluation. Multiple studies show that ChatGPT does well on single medical task commands such as answering multiple-choice examination questions such as the United States Medical Licensing Exam (5,6) and medical specialty exams (6). However, ChatGPT needs help with logical questions (7) and tends to fabricate answers (8).…”

Section: Introductionmentioning

confidence: 99%

ChatGPT Provides Inconsistent Risk-Stratification of Patients With Atraumatic Chest Pain

Heston,

Lewis

2023

Preprint

View full text Add to dashboard Cite

BACKGROUNDChatGPT is a large language model with promising healthcare applications. However, its ability to analyze complex clinical data and provide consistent results is poorly known. This study evaluated ChatGPT-4’s risk stratification of simulated patients with acute nontraumatic chest pain compared to validated tools.METHODSThree datasets of simulated case studies were created: one based on the TIMI score variables, another on HEART score variables, and a third comprising 44 randomized variables related to non-traumatic chest pain presentations. ChatGPT independently scored each dataset five times. Its risk scores were compared to calculated TIMI and HEART scores. A model trained on 44 clinical variables was evaluated for consistency.RESULTSChatGPT showed a high correlation with TIMI and HEART scores (r = 0.898 and 0.928, respectively), but the distribution of individual risk assessments was broad. ChatGPT gave a different risk 45-48% of the time for a fixed TIMI or HEART score. On the 44 variable model, a majority of the five ChatGPT models agreed on a diagnosis category only 56% of the time, and risk scores were poorly correlated (r = 0.605). ChatGPT assigned higher risk scores to males and African Americans.CONCLUSIONWhile ChatGPT correlates closely with established risk stratification tools regarding mean scores, its inconsistency when presented with identical patient data on separate occasions raises concerns about its reliability. The findings suggest that while large language models like ChatGPT hold promise for healthcare applications, further refinement and customization are necessary, particularly in the clinical risk assessment of atraumatic chest pain patients.

show abstract

ChatGPT-3.5 and ChatGPT-4 dermatological knowledge level based on the Specialty Certificate Examination in Dermatology

Cited by 39 publications

References 14 publications

Accuracy and Repeatability of ChatGPT Based on a Set of Multiple-Choice Questions on Objective Tests of Hearing

Accuracy and Repeatability of ChatGPT Based on a Set of Multiple-Choice Questions on Objective Tests of Hearing

ChatGPT provides inconsistent risk-stratification of patients with atraumatic chest pain

ChatGPT Provides Inconsistent Risk-Stratification of Patients With Atraumatic Chest Pain

Contact Info

Product

Resources

About