PurposeThe purpose of this study was to explore the performance of different parameter combinations of deep learning (DL) models (Xception, DenseNet121, MobileNet, ResNet50 and EfficientNetB0) and input image resolutions (REZs) (224 × 224, 320 × 320 and 488 × 488 pixels) for breast cancer diagnosis.MethodsThis multicenter study retrospectively studied gray-scale ultrasound breast images enrolled from two Chinese hospitals. The data are divided into training, validation, internal testing and external testing set. Three-hundreds images were randomly selected for the physician-AI comparison. The Wilcoxon test was used to compare the diagnose error of physicians and models under P=0.05 and 0.10 significance level. The specificity, sensitivity, accuracy, area under the curve (AUC) were used as primary evaluation metrics.ResultsA total of 13,684 images of 3447 female patients are finally included. In external test the 224 and 320 REZ achieve the best performance in MobileNet and EfficientNetB0 respectively (AUC: 0.893 and 0.907). Meanwhile, 448 REZ achieve the best performance in Xception, DenseNet121 and ResNet50 (AUC: 0.900, 0.883 and 0.871 respectively). In physician-AI test set, the 320 REZ for EfficientNetB0 (AUC: 0.896, P < 0.1) is better than senior physicians. Besides, the 224 REZ for MobileNet (AUC: 0.878, P < 0.1), 448 REZ for Xception (AUC: 0.895, P < 0.1) are better than junior physicians. While the 448 REZ for DenseNet121 (AUC: 0.880, P < 0.05) and ResNet50 (AUC: 0.838, P < 0.05) are only better than entry physicians.ConclusionBased on the gray-scale ultrasound breast images, we obtained the best DL combination which was better than the physicians.