Retrospective studies on artificial intelligence (AI) in screening for diabetic retinopathy (DR) have shown promising results in addressing the mismatch between the capacity to implement DR screening and increasing DR incidence. This review sought to evaluate the diagnostic test accuracy (DTA) of AI in screening for referable diabetic retinopathy (RDR) in real-world settings. We searched CENTRAL, PubMed, CINAHL, Scopus, and Web of Science on 9 February 2023. We included prospective DTA studies assessing AI against trained human graders (HGs) in screening for RDR in patients with diabetes. Two reviewers independently extracted data and assessed methodological quality against QUADAS-2 criteria. We used the hierarchical summary receiver operating characteristics (HSROC) model to pool estimates of sensitivity and specificity and, forest plots and SROC plots to visually examine heterogeneity in accuracy estimates. From our initial search results of 3899 studies, we included 15 studies comprising 17 datasets. Meta-analyses revealed a sensitivity of 95.33% (95%CI: 90.60–100%) and specificity of 92.01% (95%CI: 87.61–96.42%) for patient-level analysis (10 datasets, N = 45,785) while, for the eye-level analysis, sensitivity was 91.24% (95%CI: 79.15–100%) and specificity, 93.90% (95%CI: 90.63–97.16%) (7 datasets, N = 15,390). Subgroup analyses did not provide variations in the diagnostic accuracy of country classification and DR classification criteria. However, a moderate increase was observed in diagnostic accuracy in the primary-level healthcare settings: sensitivity of 99.35% (95%CI: 96.85–100%), specificity of 93.72% (95%CI: 88.83–98.61%) and, a minimal decrease in the tertiary-level healthcare settings: sensitivity of 94.71% (95%CI: 89.00–100%), specificity of 90.88% (95%CI: 83.22–98.53%). Sensitivity analyses did not show any variations in studies that included diabetic macular edema in the RDR definition, nor studies with ≥3 HGs. This review provides evidence, for the first time from prospective studies, for the effectiveness of AI in screening for RDR in real-world settings. The results may serve to strengthen existing guidelines to improve current practices.