PurposeThe performance of vision-language models (VLMs) with image interpretation capabilities, such as GPT-4 omni (GPT-4o), GPT-4 vision (GPT-4V), and Claude-3, has not been compared and remains unexplored in specialized radiological fields, including nuclear medicine and interventional radiology. This study aimed to evaluate and compare the diagnostic accuracy of various VLMs, including GPT-4 + GPT-4V, GPT-4o, Claude-3 Sonnet, and Claude-3 Opus, using Japanese diagnostic radiology, nuclear medicine, and interventional radiology (JDR, JNM, and JIR, respectively) board certification tests.MethodsIn total, 383 questions from the JDR test (358 images), 300 from the JNM test (92 images), and 322 from the JIR test (96 images) from 2019 to 2023 were consecutively collected. The accuracy rates of the GPT-4 + GPT-4V, GPT-4o, Claude-3 Sonnet, and Claude-3 Opus were calculated for all questions or questions with images. The accuracy rates of the VLMs were compared using McNemar’s test.ResultsGPT-4o demonstrated the highest accuracy rates across all evaluations with the JDR (all questions, 49%; questions with images, 48%), JNM (all questions, 64%; questions with images, 59%), and JIR tests (all questions, 43%; questions with images, 34%), followed by Claude-3 Opus with the JDR (all questions, 40%; questions with images, 38%), JNM (all questions, 51%; questions with images, 43%), and JIR tests (all questions, 40%; questions with images, 30%). For all questions, McNemar’s test showed that GPT-4o significantly outperformed the other VLMs (allP< 0.007), except for Claude-3 Opus in the JIR test. For questions with images, GPT-4o outperformed the other VLMs in the JDR and JNM tests (allP< 0.001), except Claude-3 Opus in the JNM test.ConclusionThe GPT-4o had the highest success rates for questions with images and all questions from the JDR, JNM, and JIR board certification tests.Secondary abstractThis study compared the diagnostic accuracy of vision-language models, including the GPT-4V, GPT-4o, and Claude-3, in Japanese radiological certification tests. GPT-4o demonstrated superior performance across diagnostic radiology, nuclear medicine, and interventional radiology tests, including image-based questions, highlighting its potential for medical image interpretation.