Uncertainty estimation for deep learning-based pectoral muscle segmentation via Monte Carlo dropout

Objective. Deep Learning models are often susceptible to failures after deployment. Knowing when your model is producing inadequate predictions is crucial. In this work, we investigate the utility of Monte Carlo (MC) dropout and the efficacy of the proposed uncertainty metric (UM) for flagging of unacceptable pectoral muscle segmentations in mammograms. Approach. Segmentation of pectoral muscle was performed with modified ResNet18 convolutional neural network. MC dropout layers were kept unlocked at inference time. For each mammogram, 50 pectoral muscle segmentations were generated. The mean was used to produce the final segmentation and the standard deviation was applied for the estimation of uncertainty. From each pectoral muscle uncertainty map, the overall UM was calculated. To validate the UM, a correlation between the dice similarity coefficient (DSC) and UM was used. The UM was first validated in a training set (200 mammograms) and finally tested in an independent dataset (300 mammograms). ROC-AUC analysis was performed to test the discriminatory power of the proposed UM for flagging unacceptable segmentations. Main results. The introduction of dropout layers in the model improved segmentation performance (DSC = 0.95 ± 0.07 versus DSC = 0.93 ± 0.10). Strong anti-correlation (r = −0.76, p < 0.001) between the proposed UM and DSC was observed. A high AUC of 0.98 (97% specificity at 100% sensitivity) was obtained for the discrimination of unacceptable segmentations. Qualitative inspection by the radiologist revealed that images with high UM are difficult to segment. Significance. The use of MC dropout at inference time in combination with the proposed UM enables flagging of unacceptable pectoral muscle segmentations from mammograms with excellent discriminatory power.


Introduction
Breast cancer screening programs, in which women undergo regular mammography examinations, have significantly reduced mortality rates by detecting cancers at an early stage (Maroni et al 2021, Siegel et al 2022. Computer aided diagnosis (CAD) could be beneficial in breast cancer screening programs, especially for tasks such as breast density estimation (Gastounioti et al 2020), breast cancer risk prediction (Yala et al 2019(Yala et al , 2021 and quality assurance of mammograms (Waade et al 2021, Picard et al 2022, Wadden and Hapgood 2022. However, to achieve the highest performance of CAD in many of these applications, the detection of auxiliary structures in mammograms, most notably the pectoral muscle in the medio-lateral oblique (MLO) view is crucial (Gastounioti et al 2020, Waade et al 2021. Any further distribution of this work must maintain attribution to the author(s) and the title of the work, journal citation and DOI.
Automating the detection of pectoral muscle in MLO images remains a challenging task that has been the subject of multiple studies, using a range of different approaches (Mustra et al 2016, Moghbel et al 2020. The highest performing methods currently rely on convolutional neural networks (CNNs) (Rampun et al 2019, Soleimani andMichailovich 2020).
Despite achieving state-of-the-art performance on internal datasets, these models are susceptible to failure upon model deployment (Mehrtash et al 2020). Such a scenario is unacceptable in critical domains including healthcare. Therefore, additional measures that capture model predictive uncertainty in the absence of ground truth data are needed. In addition, these measures have been shown to gain a better understanding of the model (Reyes et al 2020) and to flag unacceptable segmentations (Ng et al 2020, Seedat 2020. A number of methods have been applied to uncertainty estimation in CNNs (Abdar et al 2021, McCrindle et al 2021. General approaches include Monte-Carlo (MC) dropout (Gal and Ghahramani 2016), which requires several forward passes with enabled dropout layers in the network during test time; Bayesian neural networks (Pinheiro Cinelli et al 2021) that directly represent network weights as probability distributions; and deep ensembles (Lakshminarayanan et al 2017) which combines the outputs of several networks to produce uncertainty estimates.
In this work, we investigate the utility of MC dropout and efficacy of the proposed uncertainty metric (UM) for flagging of unacceptable pectoral muscle segmentations in MLO mammography images.

Study design, dataset and preprocessing
The work was separated into two distinct phases (figure 1). First, a model building phase was used for model tuning, evaluation of segmentation performance, derivation of UM and qualitative assessment of uncertainty maps. This was followed by a clinical application phase, in which the tuned model and derived UM from the model building phase were tested in a real-life scenario and where image segmentation ground-truth data were absent.
For the model building phase (figure 1, left), 200 processed full field digital mammography (FFDM) images in MLO view were collected (all female, age range: (28, 88)) with manual delineations of the pectoral muscle performed by 2 specialists with a consensus reading in case of disagreement. These data were collected from the INBREAST public dataset (Breast Center in CHSJ, Porto) (Moreira et al 2012). The images were acquired between April 2008 and July 2010 using Mammomat Novation Siemens system. Amongst all publicly available datasets with ground-truth pectoral muscle segmentations (Heath et al 1998, Moreira et al 2012, Suckling et al 2015, INBREAST (Moreira et al 2012) is the only one with FFDM images.
For the clinical application phase (figure 1, right), 300 processed FFDM images from the Slovenian National Breast Cancer Screening program (DORA) were randomly selected (all female, mean age: 57.6 ± 6.7, age range: (49, 69)). All data was acquired between January 2010 and December 2019 on a Mammomat Novation Siemens system. These data did not have corresponding ground-truth pectoral muscle masks. Similarly sized testing sets Figure 1. Study design. The study was divided in two parts: model building phase (segmentation performance evaluation, quantification of uncertainty maps, qualitative assessment of uncertainty maps) and clinical application phase (testing the uncertainty metric (UM) for flagging of unacceptable segmentations). The UM was optimized in the model building phase and tested in the clinical application phase. Following a five-fold cross validation, the best performing model from the model building phase was selected as the final model to be tested in clinical application phase.
All images were adjusted for left laterality with left-right image array flipping. Breast tissue with pectoral muscle was obtained by ignoring all pixels corresponding to zero intensity. Unwanted artifacts (e.g. watermarks) were removed with the largest component selection operation. Finally, all images were zero padded to attain an aspect ratio of 1:2 (horizontal:vertical) and then subsequently resized to a resolution of 288 × 576 pixels. The original image matrix resolution was 3328 × 4084 or 2560 × 3328 pixels, depending on the compression plate used in the acquisition.

Segmentation and uncertainty estimation
For the segmentation of the pectoral muscle in MLO mammograms, a U-Net architectural design (Ronneberger et al 2015) with three modifications was utilized: (i) the original encoder head was replaced with the ResNet18 architecture (He et al 2015), where filter weights were initialized with those from a ResNet18 model trained on ImageNet (Deng et al 2009). The selection of ResNet18 was based on the excellent performance reported in large-scale computer vision tasks (Deng et al 2009, He et al 2015, which enabled the use of transfer learning-an important step when training datasets are limited in size. (ii) The first convolutional layer of the ResNet18 architecture was adapted for gray-level images and the number of output classes was set to 3, representing air, breast tissue, and pectoral muscle tissue. (iii) Dropout layers with dropout probability of 0.5 were inserted after every 2D convolutional layer in the encoder head of the network (Kendall et al 2016) (figure 2). This served two purposes: to avoid overfitting on a small training set and to allow for uncertainty estimation via MC dropout (Gal and Ghahramani 2016).
To investigate the effect of dropout on model performance, three scenarios utilizing dropout were investigated: (i) no dropout layers, (ii) dropout engaged during training only, and (iii) dropout engaged during training and testing. In the last scenario, segmentation masks were generated by computing the pixel-wise mean of the output model probabilities from several forward passes through the network.
Using this last scenario (iii), the predicted uncertainty of the pectoral muscle segmentation via MC dropout was generated by computing the pixel-wise standard deviation of multiple softmax probability outputs for the class of pectoral muscle. These were generated from isolated forward passes of the same test sample through the network, so called MC samples. Similarly, the final segmentation was obtained as a pixel-wise mean of these MC samples (figure 2), followed by an argmax operation.
For radiologist review in clinical application phase only boundary between pectoral muscle and breast tissue was visualized (figure 2). . Segmentation and uncertainty estimation workflow. Dropout layers (red) were inserted after each 2D convolutional layer in ResNet18 encoder head and retained at inference time. Final prediction was obtained as a pixel-wise mean of 50 softmax isolated forward passes-MC samples, followed by an argmax operation. The uncertainty of the prediction was estimated as a pixel-wise standard deviation of 50 MC samples and is indicated by the heatmap on the output images on the right. Final prediction is visualized as a boundary between pectoral muscle and breast tissue. For visualization of uncertainty map threshold of 0.02 is already applied (see uncertainty quantification).

Training and validation
During training, data augmentation with cropping was applied (randomly between 70% and 100%). For training of the model, optimal hyperparameters were selected using the Optuna framework (Akiba et al 2019). Root mean squared propagation was chosen as the optimizer with three tunable parameters: learning rate, weight decay, and momentum. The hyperspace of these parameters was limited to (1e-8, 1e-3) for learning rate, (1e-11, 1e-6) for weight decay, and (0.5, 1) for momentum. Learning rate and weight decay were tuned on a logarithmic scale, and momentum was tuned on a linear scale. Furthermore, the learning rate was reduced at the plateau with patience of 5 and a factor of 0.7. The hyperspace of batch size was investigated for five different sizes: 1, 8, 16, 32, and 64. Gradient clipping was tuned between −1 and 1.
Cross entropy was used as a loss function and the dice similarity coefficient (DSC) was used for monitoring and evaluation purposes.
Five-fold cross validation (CV) was invoked where for each fold, 80% of the overall data from the INBREAST dataset (160 scans) was used as training data and 20% (40 scans) was used as test data. For selection of an optimally trained model, each training fold was further split into actual training data (80%, 128 scans) and validation data (20%, 32 scans). The optimal model was determined based on average DSC calculated from the validation data. Model training in each CV fold was bounded with a maximum of 200 epochs, however the best DSC on the validation set was always achieved before reaching this maximum.

Quantification of uncertainty
From each pectoral muscle uncertainty map (pixel-wise standard deviations of N = 50 MC samples), the overall segmentation uncertainty was calculated as a pixel-wise sum of all the standard deviations in the image above a threshold that is yet to be optimized. Thresholding was necessary to isolate the sum to the breast tissue-pectoral muscle boundary where standard deviations are expected to be much higher relative to elsewhere in the image (figure 3). This sum was also normalized by the length of the boundary of the pectoral muscle, which was determined as the number of pixels on the border between the pectoral muscle and the breast tissue (figure 2final prediction). Normalization was performed to account for differences in the breast tissue-pectoral muscle boundary length. The UM was defined as: s is the standard deviation of a single pixel, L is the length of the breast tissue-pectoral muscle boundary, and t is the optimizable standard deviation threshold.
It was hypothesized that images with high uncertainties will be poorly segmented (anti-correlated). Therefore, the optimal standard deviation threshold was selected by first computing the Pearson correlation between the proposed UM and the corresponding DSC across all datapoints from the model building phase (five-fold CV). The threshold yielding the most negative Pearson correlation coefficient i.e. a strong association of a decreasing UM with increasing DSC, was then selected as the optimal threshold for the determination of the pectoral muscle to breast tissue boundary. Thresholds were increased from 0 to 0.5 with a step of 0.005.
Using the selected threshold, examples with the highest UM (top 10%) were qualitatively assessed by an experienced radiologist (7 years of experience). The final threshold was selected based on quantitative and qualitative observations. 2.5. Clinical application-leveraging uncertainty to flag unacceptable segmentations The trained model from the fold with the best DSC performance was used to test the ability of the method to identify/flag unacceptable segmentations in unlabeled data. The model was deployed on an independent dataset of 300 images from where predicted segmentation masks with associated UM were obtained.
Two independent observers (an experienced radiologist and a medical physicist) scored the 300 segmentations based on the following scoring system: A. Unacceptable, B. Acceptable, C. Perfect.
The scoring system was adapted from (Filho et al 2013, Umbaugh 2017, where segmentations were evaluated as perfect (high quality: as good as might be desired), acceptable (good quality with minor errors), reasonable (intermediate quality with outliers), bad (only a small portion of the object of interest is segmented) or worst (does not segment the object of interest). For the specific application of the pectoral muscle segmentation, classes reasonable, bad and worst were considered as class unacceptable.
The final image score was given after the consensus of both readers. The Mann Whitney U-test was used to test pair-wise statistical significance between segmentation acceptability groups A, B and C. The Kruskal-Wallis H-test was used for testing statistical significance between all groups together. ROC-AUC analysis between group A and combined groups B and C was performed to test the discriminatory power of the proposed UM for flagging unacceptable segmentations. (DeLong et al 1988) was used for evaluation of AUC uncertainty. Specificity is reported at 100% sensitivity (no unacceptable segmentations left unnoticed).
To show that qualitative assessments of segmentation quality in group unacceptable is statistically lower to group acceptable, simple random sampling (Singh and Mangat 1996) was used to select 15 examples from each of the two groups (total n = 30). For these randomly selected examples, ground-truth pectoral muscle boundary was manually delineated. The Matlab Image Viewer App was used to precisely estimate the length of the pectoral muscle boundary for the ground-truth delineations and predictions of the model. The percentage of correctly delineated pectoral muscle boundary was calculated. Mean and standard deviation for both groups were reported. Student's t-test was used to test statistical significance.

Segmentation performance
Results for three dropout scenarios-(i) no dropout layers, (ii) dropout engaged during training only and (iii) dropout engaged during training and testing are shown in table 1. Excluding dropout layers resulted in the lowest performance, with an average DSC of 0.93 ± 0.10 (mean ± standard deviation). When dropout layers were inserted into the model, the performance increased, with both scenarios resulting in an average DSC of 0.95 ± 0.07. The performance of models with dropout was consistently higher across all five-folds. Thus, the use of dropout layers during training improves the model performance and retaining dropout layers during test phase does not degrade model performance.
For all further uncertainty estimation analysis, only the model with dropout introduced during training and testing stage was used. This is the only model that allows uncertainty estimation with the proposed method. The optimal hyperparameters for this model were 2.3e-05 for learning rate, 0.63 for momentum, 32 for batch size, and 0.12 for gradient clipping.

Quantification of uncertainty
Firstly, CV results from the model building phase dataset (INBREAST) were used to determine the optimal threshold for quantification of uncertainty. The strongest negative correlation was observed at a threshold of 0.02 ( figure 4(a)) where the Pearson correlation coefficient was 0.76 (p < 0.001) ( figure 4(b)).
As can be seen from figure 3, the use of the proposed threshold removed pixels with minimal uncertainty from the calculation of the UM. The same applies for pixels lying on the boundary of breast tissue and air. As shown in the selected example in figure 3, after thresholding, only the uncertainty in the region of the pectoral muscle-breast tissue boundary was used for calculation. Similar observations can be made for all other images.
Detailed qualitative inspection by the radiologist of the 10% of segmentations with highest UM for the selected threshold of 0.02 revealed that high uncertainties are predominantly present for examples with: (i) extremely small pectoral muscle (32%), (ii) very bright mammogram (26%), (iii) superposition of a lesion on top of pectoral boundary (9%), (iv) superposition of glandural tissue on top of pectoral muscle (41%) or (v) questionable ground truth (5%). Some images may be classified into more than one category; therefore, the percentages do not add up to 100%. These observations indicate that the proposed UM is showing high uncertainties for examples that are mostly difficult to segment. The exception is group (i) where the proposed UM is high due to the small denominator i.e. a small length of the breast-pectoral muscle boundary.
As a consequence of quantitative (figure 4) and qualitative (figures 3, 5) observations, a threshold of 0.02 was selected as the optimal value for the calculation of UM. This UM was then further evaluated for its potential to flag unacceptable segmentations in an independent unlabeled dataset.

Clinical application-leveraging uncertainty to flag unacceptable segmentations
The proposed UM was then tested for flagging of unacceptable segmentations using the independent validation dataset-previously unseen by the model. After the consensus of the two readers over the 300-image unlabeled dataset, 70.4% of segmentations were labeled as perfect, 22.3% as acceptable and 7.3% as unacceptable (figure 6). Differences in scoring occurred for 8.3% of segmentations belonging to classes acceptable and perfect (after consensus 51% of scorings were upgraded to perfect and 49% of scorings were downgraded to acceptable) and for 2.0% of segmentations belonging to classes unacceptable and acceptable (after consensus all scorings were set to acceptable).
The results for Mann-Whitney U-test revealed that the difference between the median values is statistically significant (p < 0.05): unacceptable versus acceptable (p < 1e-4), unacceptable versus perfect (p < 1e-10), unacceptable versus acceptable and perfect combined (p < 1e-9) and for acceptable versus perfect (p < 1e-13). In addition, the Kruskal-Wallis H-test across all groups yielded p < 1e-18.  The discriminatory power of the proposed UM for flagging unacceptable segmentations was high with an AUC of 0.98 (95% CI: 0.96, 1.0). A specificity of 97% was achieved for the selected operating point of 100% sensitivity.
Investigation of outliers in the UM of segmentations scored as perfect revealed that uncertainties can be high when the length of the breast tissue-pectoral muscle boundary is small (example 1 in figure 5).
Inspection of randomly selected examples showed that the quality of segmentations in group Unacceptable is significantly lower (p < 1e-7) compared to the group Acceptable. In the group of unacceptable segmentations, the model performance was in fact low, with a mean percentage of correctly predicted pectoral muscle boundary at 55% ± 20%. In the group of Acceptable segmentations, the percentage was significantly higher at 91% ± 5%.

Discussion
In this work we investigated a method for quantifying the uncertainty of a pectoral muscle segmentation model. We hypothesized that poorly segmented images will have high UM and hence it would be possible to use the UM to flag unacceptable segmentations in the absence of ground truth. Knowing when your model fails is crucial in healthcare.  We confirmed our hypothesis and found out that the discriminatory power of the proposed UM for the flagging of unacceptable segmentations was very high with an AUC of 0.98 (95% CI: 0.96, 1.00). For the requirement that all unacceptable segmentations are detected (100% sensitivity), 97% specificity was achieved. These findings are promising for the potential deployment of the segmentation model in a real-life scenario since most of the failed segmentations can be flagged by the UM.
We showed that the proposed UM can be quantitatively and qualitatively interpreted. Since high anticorrelation between the UM and the DSC was obtained (r = −0.76, p < 0.001), better segmentations are associated with lower uncertainty and vice versa. Furthermore, qualitative assessment of the most uncertain segmentations revealed that high uncertainties were predominantly present for difficult examples, such as very bright mammograms or cases with superposition of lesions and/or fibroglandural tissue. Similar observations for mammograms that were considered as difficult to segment were already reported (Soleimani and Michailovich 2020).
We observed that careful implementation of MC dropout in a U-Net segmentation model does not reduce the pectoral muscle segmentation performance, even if MC dropout is retained at inference time. Similar observations have been reported in other work in the medical imaging domain, such as whole brain segmentation ( The proposed UM is intuitive and interpretable; however, the utility is limited for cases where pectoral muscle size is extremely small (high DSC and high UM). Since capturing the pectoral muscle in mammography is essential for accurate interpretation of the mammogram ( For further generalizability of findings in this study the model with the proposed UM should also be tested for different mammography system vendors and image processing algorithms as well as for more racially diverse datasets. Although the segmentation model is not currently ready for translation to the clinic, key guiding principles from the good machine learning practice (FDA 2021) were followed. This lays a strong foundation for possible translation in the future, mainly due to the significant contribution of the proposed UM that enables post-deployment performance monitoring without explicit human intervention.
The main limitation associated with the use of MC dropout for uncertainty estimation lies in the additional computing cost at inference time, however, for our specific example where images were down sampled in resolution, the increase in computation time was on the order of a few seconds per test image. In addition, in (Kendall et al 2016, Lakshminarayanan et al 2017 it has been shown that use of 5-10 MC samples can already produce adequate uncertainty maps. There are potential limitations as MC dropout does not ideally approximate Bayesian processes and that uncertainty estimation with this method for OOD detection is sub-optimal (Ashukha et al 2021, Liu et al 2021).
In extreme examples, such as breasts with implants or other OOD images, the prediction uncertainty may not correlate with the quality of the segmentation. Despite this, results from this study indicate that uncertainty estimation with MC dropout is possible and reliable for tasks where images are quality assured and no real OOD examples are present, as is the case for the datasets presented here. To completely avoid the problems of OOD examples, detection of these mammograms could be performed before segmentation.
Most deep learning models do not offer an out-of-the box uncertainty estimation ( . Of these, we focused only on the method of test-time MC dropout (Gal andGhahramani 2016, Kendall et al 2016). Of particular interest for our future work are layer ensembles (Kushibar et al 2022) where only one forward pass is required and better results in terms of accurate uncertainty estimation for breast mass and cardiac structure segmentation tasks were found.

Conclusion
To conclude, this study found the method of MC dropout is a useful technique for the estimation of uncertainties for the task of CNN-based pectoral muscle segmentation in digital mammograms. We found out that the use of MC dropout at inference time in combination with the proposed UM enabled the flagging of unacceptable segmentations with excellent discriminatory power. We envision that the methodology and results of this work should result in more reliable breast density estimation (Gastounioti et al 2020), breast cancer risk assessment (Yala et al 2019(Yala et al , 2021 and automatic quality assurance of mammograms (Waade et al 2021, Picard et al 2022, Wadden and Hapgood 2022. As the quality of pectoral muscle segmentation is important for calculations performed downstream in these applications, flagging unacceptable segmentations in the absence of ground truth will produce more robust estimates. In summary, this is the first work introducing an automatic quality control flagging mechanism for pectoral muscle segmentation in digital mammograms, which will likely be considered indispensable for software with clinical applications.