Purpose
Accurate segmentation of the breast is required for breast density estimation and the assessment of background parenchymal enhancement, both of which have been shown to be related to breast cancer risk. The MRI breast segmentation task is challenging, and recent work has demonstrated that convolutional neural networks perform well for this task. In this study, we have investigated the performance of several two‐dimensional (2D) U‐Net and three‐dimensional (3D) U‐Net configurations using both fat‐suppressed and nonfat‐suppressed images. We have also assessed the effect of changing the number and quality of the ground truth segmentations.
Materials and methods
We designed eight studies to investigate the effect of input types and the dimensionality of the U‐Net operations for the breast MRI segmentation. Our training data contained 70 whole breast volumes of T1‐weighted sequences without fat suppression (WOFS) and with fat suppression (FS). For each subject, we registered the WOFS and FS volumes together before manually segmenting the breast to generate ground truth. We compared four different input types to the U‐nets: WOFS, FS, MIXED (WOFS and FS images treated as separate samples), and MULTI (WOFS and FS images combined into a single multichannel image). We trained 2D U‐Nets and 3D U‐Nets with these data, which resulted in our eight studies (2D‐WOFS, 3D‐WOFS, 2D‐FS, 3D‐FS, 2D‐MIXED, 3D‐MIXED, 2D‐MULTI, and 3D‐MULT). For each of these studies, we performed a systematic grid search to tune the hyperparameters of the U‐Nets. A separate validation set with 15 whole breast volumes was used for hyperparameter tuning. We performed Kruskal–Walis test on the results of our hyperparameter tuning and did not find a statistically significant difference in the ten top models of each study. For this reason, we chose the best model as the model with the highest mean dice similarity coefficient (DSC) value on the validation set. The reported test results are the results of the top model of each study on our test set which contained 19 whole breast volumes annotated by three readers fused with the STAPLE algorithm. We also investigated the effect of the quality of the training annotations and the number of training samples for this task.
Results
The study with the highest average DSC result was 3D‐MULTI with 0.96 ± 0.02. The second highest average is 2D WOFS (0.96 ± 0.03), and the third is 2D MULTI (0.96 ± 0.03). We performed the Kruskal–Wallis one‐way ANOVA test with Dunn's multiple comparison tests using Bonferroni P‐value correction on the results of the selected model of each study and found that 3D‐MULTI, 2D‐MULTI, 3D‐WOFS, 2D‐WOFS, 2D‐FS, and 3D‐FS were not statistically different in their distributions, which indicates that comparable results could be obtained in fat‐suppressed and nonfat‐suppressed volumes and that there is no significant difference between the 3D and 2D approach. Our results also suggested that the networks trained on single sequence images or multiple sequence images organized in multichannel i...
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.