Performance of ChatGPT on UK Standardized Admission Tests: Insights From the BMAT, TMUA, LNAT, and TSA Examinations

Giannos, Panagiotis; Delardas, Orestis

doi:10.2196/47737

Cited by 60 publications

(31 citation statements)

References 7 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…Yet, the AI model struggles with more complex tasks requiring advanced comprehension, analytical abilities, and precise calculations. As indicated by a number of studies, 16,[20][21][22] ChatGPT's limitations in handling scientific and mathematical applications, particularly those demanding high-level cognitive engagement, become evident. Fluctuations in accuracy may be linked to the nature of subfield questions, even without explicit categorization.…”

Section: Discussionmentioning

confidence: 99%

Performance of ChatGPT on Nephrology Test Questions

Miao,

Thongprayoon,

Garcia Valencia

et al. 2023

CJASN

View full text Add to dashboard Cite

Background: ChatGPT is a novel tool that allows people to engage in conversations with an advanced machine learning model. ChatGPT's performance in the United States Medical Licensing Examination is comparable to a successful candidate’s performance. However, its performance in nephrology field remains undetermined. This study assessed ChatGPT's capabilities in answering nephrology test questions. Methods: Questions sourced from Nephrology Self-Assessment Program and Kidney Self-Assessment Program were used, each with multiple choice single answer questions. Questions containing visual elements were excluded. Each question bank was run twice using GPT-3.5 and GPT-4. Total accuracy rate, defined as the percentage of correct answers obtained by ChatGPT in either the first or second run, and the total concordance, defined as the percentage of identical answers provided by ChatGPT during both runs, regardless of their correctness, were used to assess its performance. Results: A comprehensive assessment was conducted on a set of 975 questions, comprising 508 questions from Nephrology Self-Assessment Program and 467 from Kidney Self-Assessment Program. GPT-3.5 resulted in a total accuracy rate of 51%. Notably, the employment of Nephrology Self-Assessment Program yielded a higher accuracy rate compared to Kidney Self-Assessment Program (58% vs. 44%; p<0.001). The total concordance rate across all questions was 78%, with correct answers exhibiting a higher concordance rate (84%) compared to incorrect answers (73%) (p<0.001). When examining various nephrology subfields, the total accuracy rates were relatively lower in electrolyte and acid-base disorder, glomerular disease, and kidney-related bone and stone disorders. The total accuracy rate of GPT-4’s response was 74%, higher than GPT-3.5 (p<0.001) but remained below the passing threshold and average scores of Nephrology examinees (77%). Conclusions: ChatGPT exhibited limitations regarding accuracy and repeatability when addressing nephrology-related questions. Variations in performance were evident across various subfields.

show abstract

Section: Discussionmentioning

confidence: 99%

Performance of ChatGPT on Nephrology Test Questions

Miao,

Thongprayoon,

Garcia Valencia

et al. 2023

CJASN

View full text Add to dashboard Cite

show abstract

“…However, reported rates of correct answers vary dramatically across different examinations and medical fields. 3,4 We aimed to conduct a meta-analysis of studies reporting ChatGPT's performance in medical examinations with multiple-choice questions.…”

Section: Obj Ec Ti V Ementioning

confidence: 99%

“…ChatGPT's performance in different medical knowledge examinations has been recently studied in various medical disciplines. However, reported rates of correct answers vary dramatically across different examinations and medical fields 3,4 . We aimed to conduct a meta‐analysis of studies reporting ChatGPT's performance in medical examinations with multiple‐choice questions.…”

Section: Objectivementioning

confidence: 99%

Performance of ChatGPT in medical examinations: A systematic review and a meta‐analysis

Levin,

Horesh,

Brezinov

et al. 2023

BJOG

View full text Add to dashboard Cite

“…One prominent illustration of this is the Generative Pre-Trained Transformer (GPT), released by Open AI in 2018 [1]. GPT 4.0 has proven remarkable ability in assessing knowledge in specialised domains such as medicine, law, and business [2][3][4]-areas that have historically been the exclusive purview of professionals. Particularly noteworthy is its exceptional performance on assessments like the Korean general surgery board exam, the United States Medical Licensing Exam, and the Wharton MBA final exam, each achieved without the finetuning of the pretrained model [5][6][7].…”

Section: Introductionmentioning

confidence: 99%

Artificial intelligence model GPT4 narrowly fails simulated radiological protection exam

Roemer,

Li,

Mahmood

et al. 2024

J. Radiol. Prot.

View full text Add to dashboard Cite

This study assesses the efficacy of Generative Pre-Trained Transformers (GPT) published by OpenAI in the specialized domains of radiological protection and health physics. Utilizing a set of 1064 surrogate questions designed to mimic a health physics certification exam, we evaluated the models' ability to accurately respond to questions across five knowledge domains. Our results indicated that neither model met the 67% passing threshold, with GPT-3.5 achieving a 45.3% weighted average and GPT-4 attaining 61.7%. Despite GPT-4's significant parameter increase and multimodal capabilities, it demonstrated superior performance in all categories yet still fell short of a passing score. The study's methodology involved a simple, standardized prompting strategy without employing prompt engineering or in-context learning, which are known to potentially enhance performance. The analysis revealed that GPT-3.5 formatted answers more correctly, despite GPT-4's higher overall accuracy. The findings suggest that while GPT-3.5 and GPT-4 show promise in handling domain-specific content, their application in the field of radiological protection should be approached with caution, emphasizing the need for human oversight and verification.

show abstract

Performance of ChatGPT on UK Standardized Admission Tests: Insights From the BMAT, TMUA, LNAT, and TSA Examinations

Cited by 60 publications

References 7 publications

Performance of ChatGPT on Nephrology Test Questions

Performance of ChatGPT on Nephrology Test Questions

Performance of ChatGPT in medical examinations: A systematic review and a meta‐analysis

Artificial intelligence model GPT4 narrowly fails simulated radiological protection exam

Contact Info

Product

Resources

About