U2-MNet: An Improved Neural Network for Breast Tumor Segmentation

In recent years, UNet and its variants have achieved excellent performance. However, since the convolution kernel in UNet focuses solely on local pixels, these models struggle to model long-range dependencies. This issue has been addressed by recently proposed segmentation models based on transformer architecture. The internal self-attention mechanisms included in these systems capture global contextual information to improve segmentation effects. However, the results are often not ideal without pre-training on a large-scale dataset. Therefore, we designed a model (U2-MNet) to overcome the limitations of convolutional kernels, enabling the model to achieve high-accuracy segmentation without pre-training. This approach adopts a multi-layer framework, which can effectively describe multi-level channel information, employing a window-based channel MLP (WCM) block. It utilizes a sliding window and an MLP to capture information about details in local features. In addition, multi-level channel cross mixing (MCCM) block is included in each skip connection to reduce noise after aggregation of low-level and high-level features. The proposed model was trained from scratch and tested on Breast UltraSound Images (BUSI) dataset. Our model achieved 82.76%, 73.17%, and 86.24% on the Dice, IOU, and Precision indicators.


INTRODUCTION
Breast cancer, often called the "pink killer", has the highest incidence rate among malignant tumors in women.Detecting tumors in the breast using ultrasound is an auxiliary modality for breast cancer diagnosis.As a result, the application of neural networks for breast tumor segmentation has become a challenging yet promising task in medical image processing.UNet [1] employs an encoder-decoder architecture to combine low-level with high-level features, which is accomplished through skip connections.Several variations of UNet [1] have been proposed, including Attention UNet [2], UNet++ [3] and MultiResUNet [4], which have further improved segmentation effects.Transformer-based models have recently been used in natural language processing (NLP), due to their success, have been extended to image processing tasks (i.e., Vision Transformer [5]).For example, TransUNet [6], a vision transformer [5] combined with a UNet [1] that integrates the advantages of both.This approach has been remarkably successful, employing a self-attention mechanism to model global contextual features.However, achieving accurate segmentation results with this network requires pre-training on extensive datasets.As such, researchers are increasingly interested in improving the performance of transformers.Therefore, inspired by the U 2 -Net [7] framework, we propose U 2 -MNet.In this model, we design a window-based channel MLP (WCM) block into the encoder and decoder in the first four layers, during the extraction of multi-level features.In the WCM block, a sliding window is used in conjunction with an MLP (Multilayer Perceptron) to model the details of local features in the window.Meanwhile, we believe that the noise inside high-level features will increase when aggregating low-level features to high-level features.This is because the low-level features generated by downsampling inevitably contain a large amount of noise.On the other hand, the high-level features generated by upsampling are obtained by performing more processing on the downsampling features; as a result, the noise is relatively small.So, a simple aggregation of the two will increase the noise of high-level features.Traditional skip connections simply aggregate the two and cannot reduce the noise of low-level features.Therefore, inspired by UCTransNet [8], by modifying the skip connection, we add the MCCM block to each skip connection.Low-level features need to be processed by the MCCM block and combined with high-level features to reduce the impact of noise caused by aggregation and improve segmentation performance.This study's primary contributions can be outlined as follows: (1) We design a WCM block that cooperates with MLP through a sliding window to make the model focus on the content in the window, in order to obtain details that are not easily captured in local features.
(2) We design an MCCM block to reduce the increase in noise in high-level features, thereby utilizing the weight distribution in the feature channel to further improve model segmentation performance.

2.1.Attention Mechanism and Transformer
Previous studies have demonstrated that attention mechanisms have an impact on the performance of UNet [1], improving segmentation performance.For example, TransUNet [6] utilizes multi-head selfattention mechanisms to model global contextual information.In addition, attention mechanisms can overcome the limitations of convolutional kernels.When convolutions are used to model local information, the results are typically combined with global contextual information to improve the segmentation results.For example, SwinUNet [9] operates by assuming that each pixel exhibits a high degree of correlation with surrounding local pixels and a low correlation with remote pixels.Meanwhile, SwinUNet [9] uses windows of different sizes internally, and using attention mechanisms in each window reduces computational complexity and improved the segmentation results achieved by TransUNet [6].The disadvantage of this window is that it cannot capture detailed information on local features and transformer-based models are implemented without pre-training, as a result, unable to achieve high-accuracy segmentation.

2.2.Skip Connections
Skip connections were first proposed in UNet [1], combining shallow features with deep semantic features.Experimental results have further confirmed the effectiveness of this technique.UNet [1] has since become a milestone in medical image segmentation.It has many variants.For example, UNet++ [3] narrows the semantic gap by including several encoding and decoding steps, adding dense skip connections to acquire multi-scale semantic features.Attention UNet [2] proposed adding an attention gate to skip connections in UNet [1].This was done to assign weights and constrain the model to focus on features that directly affected the results.MultiResUNet [4] considers potential semantic gaps between encoder and decoder features, implementing residual connections to further improve skip connections.UCTransNet [8] improves traditional skip connections by utilizing channels to minimize semantic gaps between the encoder and decoder.However, the skip connections do not eliminate noise generated by the encoder, thus affecting segmentation performance.

3.1.Architecture Design
We design a multi-layer architecture in which the first four layers contain the WCM block to extract detailed information about features.In addition, dilated convolutions [10] are used inside the structure instead of ordinary convolutions, while various dilation rates are employed to expand the range of receptive fields within the network architecture.However, the downsampling feature map may have become very small after processing at the fourth layer.Therefore, only dilation convolutions [10] are used in the fifth and sixth layers.In the setting of hyperparameters, set input and output channel values to the same, set the convolution kernal size to 3, and set the padding to be the same as the dilation rates.We design the corresponding decoder using the same structure as the encoder.The MCCM block is then used to connect each encoder and its corresponding decoder to reduce the impact of large amounts of noise from low-level features on high-level features.Every decoder stage consolidates features from the symmetric encoder stage and up-sampling features from the preceding stage.The overall architecture for this model is shown in Fig. 1.

3.2.WCM Block
The WCM block (Fig. 2) first scales feature maps to form a hierarchical structure, sliding the window to produce a small range of feature information (Fig. 3).Modeling local features through depth-wise convolutions in sliding window and LN.The convolution kernal size is set to 3, and the padding and stripe are set to 1.Then, we use the result as input to the MLP.Finally, a residual structure is included to solve the gradient vanishing problem.The WCM block is stacked four times to obtain results.

3.3.MCCM Block
By employing basic skip connections to aggregate low-level and high-level features, there is a potential risk of adding noise to the higher-level features, which could ultimately lead to degraded segmentation performance.Each skip connection involving the MCCM block (Fig. 4) was used in this study as a link between the encoder and decoder.Among these, the MCCM block implements a transformer that includes multi-head cross mixing attention (Fig. 5), which differs from the common self-attention mechanism.This MCCM block divides low-level feature maps into small blocks by channel, applying the multi-head attention in the direction of each channel in the small blocks.This approach assigns different weights to each channel, allocating higher weights to the important channel components in low-level features.Finally, feature information is fused to model global contextual information, enhance the interaction between channels, and reduce the impact of noise on segmentation.

4.1.Dataset
The proposed model was evaluated using the Breast UltraSound Images (BUSI) [11] dataset, which comprises ultrasound images and corresponding ground truth (mask images).The benign and malignant images were used in this study, providing 647 images.All images with different resolutions are uniformly adjusted to 256 x 256 using bilinear interpolation.Subsequently, the samples were separated into training and testing sets with an 8:2 ratio.

4.2.Evaluation Indicators
In order to provide an unbiased assessment of the segmentation performance achieved by our proposed model, we employed three distinct metrics: Dice, IOU, and Precision.These measures can be computed using the following formulas: Precision TP TP FP   ( 3 ) where true positive (TP) implies both the prediction and result are true, false positive (FP) indicates a true prediction but a false result, and false negative (FN) denotes a false prediction but a true result.

4.3.Implementation Details
All models underwent training and testing on the Nvidia RTX3090 GPU, with the same hyperparameter settings, and were not pre-trained.Data augmentation uses random rotation and flipping.Both random thresholds are set to 0.5, which means that if the generated random value is greater than 0.5, the current image is rotated or flipped.The PyCharm integrated development environment (IDE) was used to construct the segmentation model, which was developed based on the Python 1.8.0 deep learning framework.We used the Adam optimizer to update the weights.We decided on a starting learning rate of 0.0001 and adjusted the weight decay to 0.00001 during the training process.Meanwhile, a batch size of 8 was employed for training purposes.To address category imbalance issues that may arise in image datasets, the DSC loss function was applied as well.

4.4.Model Comparison
Our model and comparison model are both trained and tested on the BUSI dataset [11].Comparison of baseline models for U-shaped architecture.Three of these models were proposed prior to 2019 (i.e., UNet [1], UNet++ [3], and Attention UNet [2]) and two of them were proposed in 2022 (i.e., U 2 -Net [7] and UNeXt [12]).A comparison of these algorithms is provided in Table I. Results indicated that our model significantly improved on the Attention UNet [2] in three specific indicators, including a 5.53% increase (77.23% vs 82.76%) in the Dice coefficient.The lightweight UNeXt [12] and U 2 -Net [7] frameworks were proposed to further enhance the prediction performance of existing models.However, experimental results showed that our model outperformed the lightweight UNeXt [12], achieving a 2.5% increase (80.26% vs 82.76%) in the Dice coefficient.Our model also improved on U 2 -Net [7], achieving a 0.93% increase (81.83% vs 82.76%) in the Dice coefficient.The segmentation effects for various models on the BUSI [11] test set are shown in Fig. 6.From the figure, it is evident that UNet [1], UNet++ [3], and Attention UNet [2] could only identify some of the tumors.In addition, the edge segmentation is somewhat rough and the overall effect is poor.UNeXt [12] and U 2 -Net [7] segmented the rough outline, but the processing of edge details was insufficient.Our model not only segmented tumor contours well, but also processed edge details and accurately segmented tumor regions.A comparison with the benchmark model discussed above suggests our model has further improved segmentation performance by increasing the accuracy of tumor segmentation in ultrasound images.

5.DISCUSSION
To better comprehend the contribution of each module to the proposed algorithm's overall segmentation performance, ablation experiments were carried out.This was done by sequentially adding each module to the infrastructure to evaluate its effect on the segmentation of the BUSI dataset [11].A U 2 -Net [7] framework was implemented first.The WCM block followed by the MCCM block added to each skip connection was then included to further evaluate segmentation performance.The experimental results, shown in Table II, indicate the added blocks provided notable improvements.

6.CONCLUSION
In this paper, a U 2 -MNet architecture was proposed.The model includes the WCM block that represents the detailed information on local features in the window.The MCCM block was also used in each skip connection to reduce noise generated by a combination of features at both low and high levels, so that corresponding feature maps could be aggregated more accurately.Experimental results exceeded the performance of current baseline network models.
This work was supported by the National Natural Science Foundation of China under Grant 62102331, the Natural Science Foundation of Sichuan Province under Grant 2022NSFSC0839 and the Doctoral Research Fund Project of Southwest University of science and Technology 22zx7110.

TABLE I .
A COMPARISON OF SEGMENTATION METHODS ON THE BUSI

TABLE II .
THE RESULTS OF ABLATION EXPERIMENTS CONDUCTED WITH THE BUSI DATASET