ObjectivesEvaluate and compare the ability of large language models (LLMs) to diagnose various ailments in otolaryngology.MethodsWe collected all 100 clinical vignettes from the second edition of Otolaryngology Cases—The University of Cincinnati Clinical Portfolio by Pensak et al. With the addition of the prompt “Provide a diagnosis given the following history,” we prompted ChatGPT‐3.5, Google Bard, and Bing‐GPT4 to provide a diagnosis for each vignette. These diagnoses were compared to the portfolio for accuracy and recorded. All queries were run in June 2023.ResultsChatGPT‐3.5 was the most accurate model (89% success rate), followed by Google Bard (82%) and Bing GPT (74%). A chi‐squared test revealed a significant difference between the three LLMs in providing correct diagnoses (p = 0.023). Of the 100 vignettes, seven require additional testing results (i.e., biopsy, non‐contrast CT) for accurate clinical diagnosis. When omitting these vignettes, the revised success rates were 95.7% for ChatGPT‐3.5, 88.17% for Google Bard, and 78.72% for Bing‐GPT4 (p = 0.002).ConclusionsChatGPT‐3.5 offers the most accurate diagnoses when given established clinical vignettes as compared to Google Bard and Bing‐GPT4. LLMs may accurately offer assessments for common otolaryngology conditions but currently require detailed prompt information and critical supervision from clinicians. There is vast potential in the clinical applicability of LLMs; however, practitioners should be wary of possible “hallucinations” and misinformation in responses.Level of Evidence3 Laryngoscope, 2024