Comparison of Large Language Models in Answering Immuno-Oncology Questions: A Cross-Sectional Study

Iannantuono, Giovanni Maria; Bracken-Clarke, Dara; Karzai, Fatima; Choo-Wosoba, Hyoyoung; Gulley, James L.; Floudas, Charalampos S.

doi:10.1101/2023.10.31.23297825

Cited by 5 publications

(2 citation statements)

References 28 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…GPT-4 achieved the highest overall score, followed by Bard and GPT-3.5. This aligns with previous findings where GPT-4 outperformed GPT-3.5 and Bard in terms of overall correct response rates [ 9 , 11 , 15 ]. Because detailed scoring criteria were not announced for all but the essential questions, we were unable to assess whether the LLMs met the JNDE's passing criteria.…”

Section: Discussionsupporting

confidence: 92%

“…In English-speaking countries, GPT-4 has been reported to meet the passing criteria for both the United States Medical Licensing Examination and the United Kingdom Medical Licensing Assessment [ 6 - 8 ]. Comparative studies between GPT and Bard have demonstrated GPT-4's superiority in answering several professional questions [ 9 - 11 ].…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

The Performance of GPT-3.5, GPT-4, and Bard on the Japanese National Dentist Examination: A Comparison Study

Ohta,

Ohta

2023

Cureus

View full text Add to dashboard Cite

This study aims to evaluate the performance of three large language models (LLMs), the Generative Pretrained Transformer (GPT)-3.5, GPT-4, and Google Bard, on the 2023 Japanese National Dentist Examination (JNDE) and assess their potential clinical applications in Japan. MethodsA total of 185 questions from the 2023 JNDE were used. These questions were categorized by question type and category. McNemar's test compared the correct response rates between two LLMs, while Fisher's exact test evaluated the performance of LLMs in each question category. ResultsThe overall correct response rates were 73.5% for GPT-4, 66.5% for Bard, and 51.9% for GPT-3.5. GPT-4 showed a significantly higher correct response rate than Bard and GPT-3.5. In the category of essential questions, Bard achieved a correct response rate of 80.5%, surpassing the passing criterion of 80%. In contrast, both GPT-4 and GPT-3.5 fell short of this benchmark, with GPT-4 attaining 77.6% and GPT-3.5 only 52.5%. The scores of GPT-4 and Bard were significantly higher than that of GPT-3.5 (p<0.01). For general questions, the correct response rates were 71.2% for GPT-4, 58.5% for Bard, and 52.5% for GPT-3.5. GPT-4 outperformed . The correct response rates for professional dental questions were 51.6% for GPT-4, 45.3% for Bard, and 35.9% for GPT-3.5. The differences among the models were not statistically significant. All LLMs demonstrated significantly lower accuracy for dentistry questions compared to other types of questions (p<0.01). ConclusionsGPT-4 achieved the highest overall score in the JNDE, followed by Bard and GPT-3.5. However, only Bard surpassed the passing score for essential questions. To further understand the application of LLMs in clinical dentistry worldwide, more research on their performance in dental examinations across different languages is required.

show abstract

Section: Discussionsupporting

confidence: 92%

Section: Introductionmentioning

confidence: 99%

The Performance of GPT-3.5, GPT-4, and Bard on the Japanese National Dentist Examination: A Comparison Study

Ohta,

Ohta

2023

Cureus

View full text Add to dashboard Cite

show abstract

Generative artificial intelligence in healthcare: A scoping review on benefits, challenges and applications

Moulaei,

Yadegari,

Baharestani

et al. 2024

International Journal of Medical Informatics

View full text Add to dashboard Cite

Decoding the NCCN Guidelines With AI: A Comparative Evaluation of ChatGPT-4.0 and Llama 2 in the Management of Thyroid Carcinoma

Pandya,

Bresler,

Wilson

et al. 2024

The American Surgeon™

View full text Add to dashboard Cite

Introduction Artificial Intelligence (AI) has emerged as a promising tool in the delivery of health care. ChatGPT-4.0 (OpenAI, San Francisco, California) and Llama 2 (Meta, Menlo Park, CA) have each gained attention for their use in various medical applications. Objective This study aims to evaluate and compare the effectiveness of ChatGPT-4.0 and Llama 2 in assisting with complex clinical decision making in the diagnosis and treatment of thyroid carcinoma. Participants We reviewed the National Comprehensive Cancer Network® (NCCN) Clinical Practice Guidelines for the management of thyroid carcinoma and formulated up to 3 complex clinical questions for each decision-making page. ChatGPT-4.0 and Llama 2 were queried in a reproducible manner. The answers were scored on a Likert scale: 5) Correct; 4) correct, with missing information requiring clarification; 3) correct, but unable to complete answer; 2) partially incorrect; 1) absolutely incorrect. Score frequencies were compared, and subgroup analysis was conducted on Correctness (defined as scores 1-2 vs 3-5) and Accuracy (scores 1-3 vs 4-5). Results In total, 58 pages of the NCCN Guidelines® were analyzed, generating 167 unique questions. There was no statistically significant difference between ChatGPT-4.0 and Llama 2 in terms of overall score (Mann-Whitney U-test; Mean Rank = 160.53 vs 174.47, P = 0.123), Correctness ( P = 0.177), or Accuracy ( P = 0.891). [Formula: see text] Conclusion ChatGPT-4.0 and Llama 2 demonstrate a limited but substantial capacity to assist with complex clinical decision making relating to the management of thyroid carcinoma, with no significant difference in their effectiveness.

show abstract

Comparison of Large Language Models in Answering Immuno-Oncology Questions: A Cross-Sectional Study

Cited by 5 publications

References 28 publications

The Performance of GPT-3.5, GPT-4, and Bard on the Japanese National Dentist Examination: A Comparison Study

The Performance of GPT-3.5, GPT-4, and Bard on the Japanese National Dentist Examination: A Comparison Study

Generative artificial intelligence in healthcare: A scoping review on benefits, challenges and applications

Decoding the NCCN Guidelines With AI: A Comparative Evaluation of ChatGPT-4.0 and Llama 2 in the Management of Thyroid Carcinoma

Contact Info

Product

Resources

About