Purpose To develop a deep learning (DL) model for differentiating between benign and malignant ovarian tumors of Ovarian-Adnexal Reporting and Data System Ultrasound (O-RADS US) Category 4 lesions, and validate its diagnostic performance. Methods A retrospective analysis of 1619 US images obtained from three centers from December 2014 to March 2023. DeepLabV3 and YOLOv8 were jointly used to segment, classify, and detect ovarian tumors. Precision and recall and area under the receiver operating characteristic curve (AUC) were employed to assess the model performance. Results A total of 519 patients (including 269 benign and 250 malignant masses) were enrolled in the study. The number of women included in the training, validation, and test cohorts was 426, 46, and 47, respectively. The detection models exhibited an average precision of 98.68% (95% CI: 0.95–0.99) for benign masses and 96.23% (95% CI: 0.92–0.98) for malignant masses. Moreover, in the training set, the AUC was 0.96 (95% CI: 0.94–0.97), whereas in the validation set, the AUC was 0.93(95% CI: 0.89–0.94) and 0.95 (95% CI: 0.91–0.96) in the test set. The sensitivity, specificity, accuracy, positive predictive value, and negative predictive values for the training set were 0.943,0.957,0.951,0.966, and 0.936, respectively, whereas those for the validation set were 0.905,0.935, 0.935,0.919, and 0.931, respectively. In addition, the sensitivity, specificity, accuracy, positive predictive value, and negative predictive value for the test set were 0.925, 0.955, 0.941, 0.956, and 0.927, respectively. Conclusion The constructed DL model exhibited high diagnostic performance in distinguishing benign and malignant ovarian tumors in O-RADS US category 4 lesions.