Purpose
Accurate segmentation of the breast is required for breast density estimation and the assessment of background parenchymal enhancement, both of which have been shown to be related to breast cancer risk. The MRI breast segmentation task is challenging, and recent work has demonstrated that convolutional neural networks perform well for this task. In this study, we have investigated the performance of several twoâdimensional (2D) UâNet and threeâdimensional (3D) UâNet configurations using both fatâsuppressed and nonfatâsuppressed images. We have also assessed the effect of changing the number and quality of the ground truth segmentations.
Materials and methods
We designed eight studies to investigate the effect of input types and the dimensionality of the UâNet operations for the breast MRI segmentation. Our training data contained 70 whole breast volumes of T1âweighted sequences without fat suppression (WOFS) and with fat suppression (FS). For each subject, we registered the WOFS and FS volumes together before manually segmenting the breast to generate ground truth. We compared four different input types to the Uânets: WOFS, FS, MIXED (WOFS and FS images treated as separate samples), and MULTI (WOFS and FS images combined into a single multichannel image). We trained 2D UâNets and 3D UâNets with these data, which resulted in our eight studies (2DâWOFS, 3DâWOFS, 2DâFS, 3DâFS, 2DâMIXED, 3DâMIXED, 2DâMULTI, and 3DâMULT). For each of these studies, we performed a systematic grid search to tune the hyperparameters of the UâNets. A separate validation set with 15 whole breast volumes was used for hyperparameter tuning. We performed KruskalâWalis test on the results of our hyperparameter tuning and did not find a statistically significant difference in the ten top models of each study. For this reason, we chose the best model as the model with the highest mean dice similarity coefficient (DSC) value on the validation set. The reported test results are the results of the top model of each study on our test set which contained 19 whole breast volumes annotated by three readers fused with the STAPLE algorithm. We also investigated the effect of the quality of the training annotations and the number of training samples for this task.
Results
The study with the highest average DSC result was 3DâMULTI with 0.96 ± 0.02. The second highest average is 2D WOFS (0.96 ± 0.03), and the third is 2D MULTI (0.96 ± 0.03). We performed the KruskalâWallis oneâway ANOVA test with Dunn's multiple comparison tests using Bonferroni Pâvalue correction on the results of the selected model of each study and found that 3DâMULTI, 2DâMULTI, 3DâWOFS, 2DâWOFS, 2DâFS, and 3DâFS were not statistically different in their distributions, which indicates that comparable results could be obtained in fatâsuppressed and nonfatâsuppressed volumes and that there is no significant difference between the 3D and 2D approach. Our results also suggested that the networks trained on single sequence images or multiple sequence images organized in multichannel i...