Evaluating ChatGPT Performance on the Orthopaedic In-Training Examination

Kung, Justin E.; Marshall, Christopher; Gauthier, Chase; Gonzalez, Tyler A.; Jackson, J. Benjamin

doi:10.2106/jbjs.oa.23.00056

Cited by 37 publications

(22 citation statements)

References 13 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…This is particularly alarming considering that the access is free for this version, which is the most used. The substantial increase in performance of version 4 is consistent with other studies 5,[7][8][9]15 and likely stems from the foundational differences in training and algorithmic sophistication between the two versions. 1 It may partially be attributed to the integration of a rule-based reward model in version 4.…”

Section: Discussionsupporting

confidence: 87%

“…Several studies have assessed the reliability of ChatGPT, each using different methodologies and subsequently reporting varying levels of performance 5–11,15 . Studies using common questions of interest to patients and laypeople usually show high accuracy rates.…”

Section: Discussionmentioning

confidence: 99%

“…Several studies have assessed the reliability of ChatGPT, each using different methodologies and subsequently reporting varying levels of performance. [5][6][7][8][9][10][11]15 Studies using common questions of interest to patients and laypeople usually show high accuracy rates. In a recent study evaluating the chatbot recommendations regarding breast cancer prevention and screening, an accuracy rate of 88% for ChatGPT 3.5 was reported.…”

Section: Discussionmentioning

confidence: 99%

“…This tool has shown great promise in aiding healthcare professionals by providing quick and accessible information, assisting in decision‐making processes, and improving patient care 2 . The use of ChatGPT in clinical practice by both physicians and patients has been subject of several studies since its public release in 2022 2–11 . The advent of ChatGPT has marked a significant advancement in the application of AI in the medical field, where rapid access to accurate information can be critical.…”

Section: Introductionmentioning

confidence: 99%

“…2 The use of ChatGPT in clinical practice by both physicians and patients has been subject of several studies since its public release in 2022. [2][3][4][5][6][7][8][9][10][11] The advent of ChatGPT has marked a significant advancement in the application of AI in the medical field, where rapid access to accurate information can be critical.…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

Conformity of ChatGPT recommendations with the AUA/SUFU guideline on postprostatectomy urinary incontinence

Pinto,

de Azevedo,

Wroclawski

et al. 2024

Neurourology and Urodynamics

View full text Add to dashboard Cite

IntroductionArtificial intelligence (AI) shows immense potential in medicine and Chat generative pretrained transformer (ChatGPT) has been used for different purposes in the field. However, it may not match the complexity and nuance of certain medical scenarios. This study evaluates the accuracy of ChatGPT 3.5 and 4 in providing recommendations regarding the management of postprostatectomy urinary incontinence (PPUI), considering The Incontinence After Prostate Treatment: AUA/SUFU Guideline as the best practice benchmark.Materials and MethodsA set of questions based on the AUA/SUFU Guideline was prepared. Queries included 10 conceptual questions and 10 case‐based questions. All questions were open and entered into the ChatGPT with a recommendation to limit the answer to 200 words, for greater objectivity. Responses were graded as correct (1 point); partially correct (0.5 point), or incorrect (0 point). Performances of versions 3.5 and 4 of ChatGPT were analyzed overall and separately for the conceptual and the case‐based questions.ResultsChatGPT 3.5 scored 11.5 out of 20 points (57.5% accuracy), while ChatGPT 4 scored 18 (90.0%; p = 0.031). In the conceptual questions, ChatGPT 3.5 provided accurate answers to six questions along with one partially correct response and three incorrect answers, with a final score of 6.5. In contrast, ChatGPT 4 provided correct answers to eight questions and partially correct answers to two questions, scoring 9.0. In the case‐based questions, ChatGPT 3.5 scored 5.0, while ChatGPT 4 scored 9.0. The domains where ChatGPT performed worst were evaluation, treatment options, surgical complications, and special situations.ConclusionChatGPT 4 demonstrated superior performance compared to ChatGPT 3.5 in providing recommendations for the management of PPUI, using the AUA/SUFU Guideline as a benchmark. Continuous monitoring is essential for evaluating the development and precision of AI‐generated medical information.

show abstract

Section: Discussionsupporting

confidence: 87%

Section: Discussionmentioning

confidence: 99%

Section: Discussionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Conformity of ChatGPT recommendations with the AUA/SUFU guideline on postprostatectomy urinary incontinence

Pinto,

de Azevedo,

Wroclawski

et al. 2024

Neurourology and Urodynamics

View full text Add to dashboard Cite

show abstract

GPT-based chatbot tools are still unreliable in the management of prosthetic joint infections

Bortoli,

Fiore,

Tedeschi

et al. 2024

Musculoskelet Surg

View full text Add to dashboard Cite

Background Artificial intelligence chatbot tools responses might discern patterns and correlations that may elude human observation, leading to more accurate and timely interventions. However, their reliability to answer healthcare-related questions is still debated. This study aimed to assess the performance of the three versions of GPT-based chatbots about prosthetic joint infections (PJI). Methods Thirty questions concerning the diagnosis and treatment of hip and knee PJIs, stratified by a priori established difficulty, were generated by a team of experts, and administered to ChatGPT 3.5, BingChat, and ChatGPT 4.0. Responses were rated by three orthopedic surgeons and two infectious diseases physicians using a five-point Likert-like scale with numerical values to quantify the quality of responses. Inter-rater reliability was assessed by interclass correlation statistics. Results Responses averaged “good-to-very good” for all chatbots examined, both in diagnosis and treatment, with no significant differences according to the difficulty of the questions. However, BingChat ratings were significantly lower in the treatment setting (p = 0.025), particularly in terms of accuracy (p = 0.02) and completeness (p = 0.004). Agreement in ratings among examiners appeared to be very poor. Conclusions On average, the quality of responses is rated positively by experts, but with ratings that frequently may vary widely. This currently suggests that AI chatbot tools are still unreliable in the management of PJI.

show abstract

Use and Application of Large Language Models for Patient Questions Following Total Knee Arthroplasty

Bains,

Dubin,

Hameed

et al. 2024

The Journal of Arthroplasty

View full text Add to dashboard Cite

Evaluating ChatGPT Performance on the Orthopaedic In-Training Examination

Cited by 37 publications

References 13 publications

Conformity of ChatGPT recommendations with the AUA/SUFU guideline on postprostatectomy urinary incontinence

Conformity of ChatGPT recommendations with the AUA/SUFU guideline on postprostatectomy urinary incontinence

GPT-based chatbot tools are still unreliable in the management of prosthetic joint infections

Use and Application of Large Language Models for Patient Questions Following Total Knee Arthroplasty

Contact Info

Product

Resources

About