Background:
ChatGPT is a novel tool that allows people to engage in conversations with an advanced machine learning model. ChatGPT's performance in the United States Medical Licensing Examination is comparable to a successful candidate’s performance. However, its performance in nephrology field remains undetermined. This study assessed ChatGPT's capabilities in answering nephrology test questions.
Methods:
Questions sourced from Nephrology Self-Assessment Program and Kidney Self-Assessment Program were used, each with multiple choice single answer questions. Questions containing visual elements were excluded. Each question bank was run twice using GPT-3.5 and GPT-4. Total accuracy rate, defined as the percentage of correct answers obtained by ChatGPT in either the first or second run, and the total concordance, defined as the percentage of identical answers provided by ChatGPT during both runs, regardless of their correctness, were used to assess its performance.
Results:
A comprehensive assessment was conducted on a set of 975 questions, comprising 508 questions from Nephrology Self-Assessment Program and 467 from Kidney Self-Assessment Program. GPT-3.5 resulted in a total accuracy rate of 51%. Notably, the employment of Nephrology Self-Assessment Program yielded a higher accuracy rate compared to Kidney Self-Assessment Program (58% vs. 44%; p<0.001). The total concordance rate across all questions was 78%, with correct answers exhibiting a higher concordance rate (84%) compared to incorrect answers (73%) (p<0.001). When examining various nephrology subfields, the total accuracy rates were relatively lower in electrolyte and acid-base disorder, glomerular disease, and kidney-related bone and stone disorders. The total accuracy rate of GPT-4’s response was 74%, higher than GPT-3.5 (p<0.001) but remained below the passing threshold and average scores of Nephrology examinees (77%).
Conclusions:
ChatGPT exhibited limitations regarding accuracy and repeatability when addressing nephrology-related questions. Variations in performance were evident across various subfields.