Performance of Three Large Language Models on Dermatology Board Examinations

Mirza, Fatima N.; Lim, Rachel; Yumeen, Sara; Wahood, Samer; Zaidat, Bashar; Shah, Asghar; Tang, Oliver Y.; Kawaoka, John; Seo, Su-Jean; DiMarco, Christopher; Muglia, Jennie J.; Goldbach, Hayley; Wisco, Oliver J.; Qureshi, Abrar A.; Libby, Tiffany J.

doi:10.1016/j.jid.2023.06.208

Journal of Investigative Dermatology

2024

DOI: 10.1016/j.jid.2023.06.208

|View full text |Cite

Performance of Three Large Language Models on Dermatology Board Examinations

Fatima N. Mirza

Rachel Lim

Sara Yumeen

et al.

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...

Citation Types

Supporting

Mentioning

Contrasting

Year Published

2024

Publication Types

Select...

Article5

Relationship

Self Cite0

Independent5

Authors

Journals

Cited by 9 publications

References 5 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

Usefulness of the large language model ChatGPT (GPT‐4) as a diagnostic tool and information source in dermatology

Nielsen,

Grønhøj,

Skov

et al. 2024

JEADV Clinical Practice

View full text Add to dashboard Cite

BackgroundThe field of artificial intelligence is rapidly evolving. As an easily accessible platform with vast user engagement, the Chat Generative Pre‐Trained Transformer (ChatGPT) holds great promise in medicine, with the latest version, GPT‐4, capable of analyzing clinical images.ObjectivesTo evaluate ChatGPT as a diagnostic tool and information source in clinical dermatology.MethodsA total of 15 clinical images were selected from the Danish web atlas, Danderm, depicting various common and rare skin conditions. The images were uploaded to ChatGPT version GPT‐4, which was prompted with ‘Please provide a description, a potential diagnosis, and treatment options for the following dermatological condition’. The generated responses were assessed by senior registrars in dermatology and consultant dermatologists in terms of accuracy, relevance, and depth (scale 1–5), and in addition, the image quality was rated (scale 0–10). Demographic and professional information about the respondents was registered.ResultsA total of 23 physicians participated in the study. The majority of the respondents were consultant dermatologists (83%), and 48% had more than 10 years of training. The overall image quality had a median rating of 10 out of 10 [interquartile range (IQR): 9–10]. The overall median rating of the ChatGPT generated responses was 2 (IQR: 1–4), while overall median ratings in terms of relevance, accuracy, and depth were 2 (IQR: 1–4), 3 (IQR: 2–4) and 2 (IQR: 1–3), respectively.ConclusionsDespite the advancements in ChatGPT, including newly added image processing capabilities, the chatbot demonstrated significant limitations in providing reliable and clinically useful responses to illustrative images of various dermatological conditions.

show abstract

Usefulness of the large language model ChatGPT (GPT‐4) as a diagnostic tool and information source in dermatology

Nielsen,

Grønhøj,

Skov

et al. 2024

JEADV Clinical Practice

View full text Add to dashboard Cite

show abstract

Comparative Assessment of Otolaryngology Knowledge Among Large Language Models

Merlino,

Brufau,

Saieed

et al. 2024

The Laryngoscope

View full text Add to dashboard Cite

ObjectiveThe purpose of this study was to evaluate the performance of advanced large language models from OpenAI (GPT‐3.5 and GPT‐4), Google (PaLM2 and MedPaLM), and an open source model from Meta (Llama3:70b) in answering clinical test multiple choice questions in the field of otolaryngology—head and neck surgery.MethodsA dataset of 4566 otolaryngology questions was used; each model was provided a standardized prompt followed by a question. One hundred questions that were answered incorrectly by all models were further interrogated to gain insight into the causes of incorrect answers.ResultsGPT4 was the most accurate, correctly answering 3520 of 4566 questions (77.1%). MedPaLM correctly answered 3223 of 4566 (70.6%) questions, while llama3:70b, GPT3.5, and PaLM2 were correct on 3052 of 4566 (66.8%), 2672 of 4566 (58.5%), and 2583 of 4566 (56.5%) questions. Three hundred and sixty‐nine questions were answered incorrectly by all models. Prompts to provide reasoning improved accuracy in all models: GPT4 changed from incorrect to correct answer 31% of the time, while GPT3.5, Llama3, PaLM2, and MedPaLM corrected their responses 25%, 18%, 19%, and 17% of the time, respectively.ConclusionLarge language models vary in their understanding of otolaryngology‐specific clinical knowledge. OpenAI's GPT4 has a strong understanding of core concepts as well as detailed information in the field of otolaryngology. Its baseline understanding in this field makes it well‐suited to serve in roles related to head and neck surgery education provided that the appropriate precautions are taken and potential limitations are understood.Level of EvidenceN/A Laryngoscope, 2024

show abstract

ChatGPT and Other Large Language Models in Medical Education — Scoping Literature Review

Aster,

Laupichler,

Rockwell-Kollmann

et al. 2024

Med.Sci.Educ.

View full text Add to dashboard Cite

This review aims to provide a summary of all scientific publications on the use of large language models (LLMs) in medical education over the first year of their availability. A scoping literature review was conducted in accordance with the PRISMA recommendations for scoping reviews. Five scientific literature databases were searched using predefined search terms. The search yielded 1509 initial results, of which 145 studies were ultimately included. Most studies assessed LLMs’ capabilities in passing medical exams. Some studies discussed advantages, disadvantages, and potential use cases of LLMs. Very few studies conducted empirical research. Many published studies lack methodological rigor. We therefore propose a research agenda to improve the quality of studies on LLM.

show abstract

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

customersupport@researchsolutions.com

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Made with 💙 for researchers

Part of the Research Solutions Family.

Performance of Three Large Language Models on Dermatology Board Examinations

Cited by 9 publications

References 5 publications

Usefulness of the large language model ChatGPT (GPT‐4) as a diagnostic tool and information source in dermatology

Usefulness of the large language model ChatGPT (GPT‐4) as a diagnostic tool and information source in dermatology

Comparative Assessment of Otolaryngology Knowledge Among Large Language Models

ChatGPT and Other Large Language Models in Medical Education — Scoping Literature Review

Contact Info

Product

Resources

About