Background: The rapid advancement of generative artificial intelligence (AI) has revolutionized understanding and generation of human language. Their integration into healthcare has shown potential for improving medical diagnostics, yet a comprehensive diagnostic performance evaluation of generative AI models and the comparison of their diagnostic performance with that of physicians has not been extensively explored. Methods: In this systematic review and meta-analysis, a comprehensive search of Medline, Scopus, Web of Science, Cochrane Central, and medRxiv was conducted for studies published from June 2018 through December 2023, focusing on those that validate generative AI models for diagnostic tasks. Meta-analysis was performed to summarize the performance of the models and to compare the accuracy of the models with that of physicians. The quality of studies was assessed using the Prediction Model Study Risk of Bias Assessment Tool. Results: The search resulted in 54 studies being included in the meta-analysis, with 13 of these also used in the comparative analysis. Eight models were evaluated across 17 medical specialties. The overall accuracy for generative AI models across 54 studies was 57% (95% confidence interval [CI]: 51–63%). The I-squared statistic of 96% signifies a high degree of heterogeneity among the study results. Meta-regression analysis of generative AI models revealed significantly improved accuracy for GPT-4, and reduced accuracy for some specialties such as Neurology, Endocrinology, Rheumatology, and Radiology. The comparison meta-analysis demonstrated that, on average, physicians exceeded the accuracy of the models (difference in accuracy: 14% [95% CI: 8–19%], p-value <0.001). However, in the performance comparison between GPT-4 and physicians, GPT-4 performed slightly higher than non-experts (-4% [95% CI: -10–2%], p-value = 0.173), and slightly underperformed compared to experts (6% [95% CI: -1–13%], p-value = 0.091). The quality assessment indicated a high risk of bias in the majority of studies, primarily due to small sample sizes. Conclusions: Generative AI exhibits promising diagnostic capabilities, with accuracy varying significantly by model and medical specialty. Although they have not reached the reliability of expert physicians, the findings suggest that generative AI models have the potential to enhance healthcare delivery and medical education, provided they are integrated with caution and their limitations are well-understood. This study also highlights the need for more rigorous research standards and a larger number of cases in the future.