ImportanceThe utilization of artificial intelligence for the differentiation of benign and malignant breast lesions in multiparametric MRI (mpMRI) assists radiologists to improve diagnostic performance.ObjectivesTo develop an automated deep learning model for breast lesion segmentation and characterization and to evaluate the characterization performance of AI models and radiologists.Materials and methodsFor lesion segmentation, 2,823 patients were used for the training, validation, and testing of the VNet-based segmentation models, and the average Dice similarity coefficient (DSC) between the manual segmentation by radiologists and the mask generated by VNet was calculated. For lesion characterization, 3,303 female patients with 3,607 pathologically confirmed lesions (2,213 malignant and 1,394 benign lesions) were used for the three ResNet-based characterization models (two single-input and one multi-input models). Histopathology was used as the diagnostic criterion standard to assess the characterization performance of the AI models and the BI-RADS categorized by the radiologists, in terms of sensitivity, specificity, accuracy, and the area under the receiver operating characteristic curve (AUC). An additional 123 patients with 136 lesions (81 malignant and 55 benign lesions) from another institution were available for external testing.ResultsOf the 5,811 patients included in the study, the mean age was 46.14 (range 11–89) years. In the segmentation task, a DSC of 0.860 was obtained between the VNet-generated mask and manual segmentation by radiologists. In the characterization task, the AUCs of the multi-input and the other two single-input models were 0.927, 0.821, and 0.795, respectively. Compared to the single-input DWI or DCE model, the multi-input DCE and DWI model obtained a significant increase in sensitivity, specificity, and accuracy (0.831 vs. 0.772/0.776, 0.874 vs. 0.630/0.709, 0.846 vs. 0.721/0.752). Furthermore, the specificity of the multi-input model was higher than that of the radiologists, whether using BI-RADS category 3 or 4 as a cutoff point (0.874 vs. 0.404/0.841), and the accuracy was intermediate between the two assessment methods (0.846 vs. 0.773/0.882). For the external testing, the performance of the three models remained robust with AUCs of 0.812, 0.831, and 0.885, respectively.ConclusionsCombining DCE with DWI was superior to applying a single sequence for breast lesion characterization. The deep learning computer-aided diagnosis (CADx) model we developed significantly improved specificity and achieved comparable accuracy to the radiologists with promise for clinical application to provide preliminary diagnoses.