Ultrasound image segmentation of renal tumors based on UNet++ with fusion of multiscale residuals and dual attention

Objective. Laparoscopic renal unit-preserving resection is a routine and effective means of treating renal tumors. Image segmentation is an essential part before tumor resection. The current segmentation method mainly relies on doctors manual delineation, which is time-consuming, labor-intensive, and influenced by their personal experience and ability. And the image quality of segmentation is low, with problems such as blurred edges, unclear size and shape, which are not conducive to clinical diagnosis. Approach. To address these problems, we propose an automated segmentation method, i.e. the UNet++ algorithm fusing multiscale residuals and dual attention (MRDA_UNet++). It replaces two consecutive 3 × 3 convolutions in UNet++ with the ‘MultiRes block’ module, which incorporates coordinate attention to fuse features from different scales and suppress the impact of background noise. Furthermore, an attention gate is also added at the short connections to enhance the ability of the network to extract features from the target area. Main results. The experimental results show that MRDA_UNet++ achieves 93.18%, 92.87%, 93.66%, and 92.09% on the real-world dataset for MIoU, Dice, Precision, and Recall, respectively. Compared to the baseline model UNet++ on three public datasets, the MIoU, Dice, and Recall metrics improved by 6.00%, 7.90% and 18.09% respectively for BUSI, 0.39%, 0.27% and 1.03% for Dataset C, and 1.37%, 1.75% and 1.30% for DDTI. Significance. The proposed MRDA_UNet++ exhibits obvious advantages in feature extraction, which can not only significantly reduce the workload of doctors, but also further decrease the risk of misdiagnosis. It is of great value to assist doctors diagnosis in the clinic.


Introduction
Malignant renal tumors, also known as kidney cancer, have become one of the ten most common cancers in the world in recent years (Jasinski et al 2023).According to statistics, the incidence of kidney cancer has increased from 3% to more than 7% in the past decade.Globally, about 338 000 people are diagnosed with kidney cancer each year, accounting for about 2.4% of the total cancer diagnoses (Sung et al 2021).For the diagnosis of kidney tumors, patients are usually required to undergo medical imaging instruments.The most commonly used imaging method is computed tomography (CT).A CT scan can reveal the size of the tumor and if it has invaded local veins, lymph nodes, or surrounding organs.But the use of ionizing radiation and nephrotoxic iodine contrast agents in CT can cause certain harm to the human body.Therefore, Ultrasound (US) examination or magnetic resonance imaging (MRI) may also be selected based on the patient's condition and the doctor's judgment.In the past two decades, contrast-enhanced ultrasound methods based on ultrasound technology have attracted widespread attention from clinicians.The European Association of Urology (EAU) recommends using contrast-enhanced ultrasound for diagnosing renal lesions and identifying indeterminate renal tumors on CT or MRI (Ljungberg et al 2019).Significantly, ultrasound contrast agents are safe to use, with no nephrotoxicity and a low incidence of side effects (Cantisani et al 2021).The treatment of renal tumors mainly includes radical nephrectomy and partial nephrectomy.Partial nephrectomy is generally the preferred surgical approach, which only the tumor portion of the kidney is excised, preserving the healthy portion of the kidney.However, in cases of larger tumors, multiple tumors, renal insufficiency, or involvement of the renal pelvis or ureter by renal cancer, radical nephrectomy is needed, which both the renal tumor and the entire kidney are removed.This approach will significantly damage the patient's renal function, and will not be adopted unless necessary (Tan et al 2013).In a partial nephrectomy treatment plan, the kidney and renal tumor first need to be accurately outlined in medical images to obtain the morphological details of the tumor, which will help the surgeon formulate a precise surgical plan for tumor removal.As shown in figure 1, the red circle shows the outlined kidney tumor region.
Currently, the segmentation of renal tumors relies on manual outlining by experienced clinicians.However, the outlining process is not only cumbersome and time-consuming, but also the accuracy of the segmentation results will be affected by the doctor's personal experience and ability.Therefore, it is of great clinical significance if deep learning-based medical image segmentation technology can be utilized to assist doctors in diagnosis.Not only can it greatly reduce the workload of doctors, but also further reduces the risk of misdiagnosis.

Related work
While traditional segmentation methods and machine learning methods dominate image segmentation, a great deal of research has been conducted on the segmentation of kidneys and their anatomical structures.However, there are few related research papers on kidney tumor segmentation based on CT, US, or MRI images, due to the great difficulty of the task.CT scans are the preferred diagnostic modality, and the following studies are all based on CT images.Kim and Park (2004) first used a gray-scale thresholding method for kidney segmentation.Based on the texture analysis of kidney tumor sample images, a region-growing method was used to segment kidney tumors with part of the kidney tumors as seed points.Skalski et al (2016) used a hybrid level set method based on ellipsoidal constraints to locate the kidney region, computed the feature vectors containing the edge, region, direction, and spatial neighborhood information, and finally used a decision tree algorithm to segment the tumors.Lee et al (2017) first determined the target region of the kidney by intensity and location thresholds.At the same time, a large number of candidate masses are extracted.Then a block-based texture and contextual feature classification are used to reduce the false positive value.Finally, the value is used as a seed for kidney tumor segmentation by taking the regional growth, active contours, and outliers removed by size and shape as criteria.
In recent years, deep learning has made unprecedented achievements in the field of image segmentation, and significant breakthroughs have also been made in kidney tumor segmentation based on deep learning.Yang et al (2018) proposed a method relying on a three-dimensional (3D) fully convolutional networ and combining it with a pyramid pooling module for segmenting kidneys and renal tumors in CT angiography images.Mu et al (2019) designed a VB-Nets network based on 3D V-Net (Milletari et al 2016) using a bottleneck structure instead of the regular convolutional layers in V-Net.This network uses a multiresolution strategy to localize the region of interest of the whole kidney at coarse resolution and then segment the kidney tumor at fine resolution.Zhao and Zeng (2019) proposed a multiscale-supervised 3D U-Net for the task of segmenting CT images of kidneys and renal tumors in the KiTS19 Challenge.This method uses multiscale supervision in the decoder path combined with an exponential logarithmic loss function.Finally, a connective element-based post-processing method is designed to improve the segmentation performance.Yu et al (2019) proposed a cascaded iteratively trainable segmentation model, Crossbar-Net.This method captures the global and local appearance information of renal tumors from both horizontal and vertical directions using two orthogonal non-square patches.The obtained cross-patches are then used to iteratively train horizontal and vertical sub-models.These two sub-models complement each other to achieve self-improvement until convergence.Xie et al (2020) proposed a SE-ResNeXT U-Net (SERU) model that fully utilizes the advantages of SE-Net, ResNeXT, and U-Net.A coarse-to-fine approach is used to achieve renal tumor segmentation using information from contextual and key slices of the left and right kidneys.Shen et al (2021) proposed the Convolution-and-Transformer (COTRNet) network by combining the advantages of global modeling with the transformer model.This network is an encoder-decoder architecture that uses four convolution-transformer layers in the encoder to learn multiscale features.These features have local and global receptive fields that are critical for the accurate segmentation of kidneys, renal tumors, and renal cysts.In addition, pre-trained weights and deep supervision are utilized to further improve segmentation performance.
However, either traditional segmentation methods, machine learning methods, or deep learning methods.To our knowledge, research on the automatic segmentation of renal tumors in contrast-enhanced ultrasound images using convolutional neural networks is extremely limited.In this paper, from the perspective of practical needs, we address the problem of high segmentation difficulty caused by the different sizes and shapes of renal tumor lesions in contrast-enhanced ultrasound images.The UNet++ network fusing multiscale residuals and dual attention (MRDA_UNet++) is proposed to solve it.

Overall framework of the MRDA_UNet++ model
The UNet++ (Zhou et al 2018) network is derived from UNet (Ronneberger et al 2015).The UNet model is named because it resembles the English capital letter 'U', with the encoder on the left for feature extraction and the decoder on the right for feature reduction.Add skip connections between the encoder and decoder to merge the low-level and high-level semantic information, which makes the model more accurate and robust (Bousias and Armenakis 2020).UNet++ can be regarded as an effective integration of UNets with different depths.These UNets can partially share an encoder and use a deep supervision strategy so that each UNet branch has a corresponding loss function.A further pruning scheme is designed to allow the applicant to trade off the speed and accuracy of the network.In addition, considering the problem that UNet skip connections may produce semantic gaps due to the large gap in network depth, many short connections are added to fuse the convolutional features at different stages.
In the MRDA_UNet++ model proposed in this paper, the 'MCA' module is used to replace the convolutional part of the UNet++, and the attention gate (AG) (Oktay et al 2018) module is added before concatenating the feature maps with skip connections.The 'MCA' module consists of a 'MultiRes block' (Ibtehaz and Rahman 2020) plus coordinate attention (CA) (Hou et al 2021).To enhance the generalization ability of the network and the representation of the network structure, the utilization of MultiRes block module helps the UNet++ network in harmonizing the learned image features across different scales.Adding CA helps the network to focus on the target region more accurately and finely, effectively suppressing the influence of background noise.Then the AG module further highlights the target region to be segmented, enhancing the network's ability to extract target features.The overall framework of the proposed MRDA_UNet++ model is illustrated in figure 2.

MCA module
Due to the variation in size and diverse shapes of renal tumors, this study replaces the consecutive two 3 × 3 convolutions in the original UNet++ convolutional module with the 'MultiRes block' to harmonize the learned features from images at different scales.To enhance the expressive power of features, a CA module is embedded after concatenating the features outputted by different convolutional kernels.This aims to capture both crosschannel information and position-sensitive features, thereby improving the accuracy of tumor localization in the network.The structure of the MCA module is shown in figure 3.

'MultiRes block' module
In medical image segmentation, there is interest in segmenting objects such as cell nuclei (Coelho et al 2009), organs (Yang et al 2018), tumors (Codella et al 2018), etc, from images of various modalities.However, in most cases, these objects of interest are irregular and vary in scale.The kidney tumors segmented in this study, based on ultrasound images, exhibit the same characteristics, as shown in figure 4.
Therefore, the network should be robust enough to analyze objects at different scales.Serre et al (2007) used a series of fixed Gabor filters at different scales to confirm scale variations in an image.Later, the Inception architecture proposed by Szegedy et al (2015) utilizes convolutional kernels of different sizes in parallel to extract valid information from an image.The fact that these perceptions obtained at different scales are combined to obtain richer features means that the final classification judgment is more accurate, but this parallel structure greatly increases the memory requirements.
To solve this problem, Nabil Ibtehaz et al proposed the 'MultiRes block' module.This module proposes to fuse the outputs of 3 × 3, 5 × 5, and 7 × 7 convolutional operations to extract spatial features at different scales.To avoid the high complexity associated with large kernel convolutional computation, the outputs of the second and third consecutive 3 × 3 convolutional blocks are considered to effectively approximate the 5 × 5 and 7 × 7 convolutional operations, respectively.Thus unlike the Inception architecture, it uses a serial approach which effectively reduces the memory dependency.A 1 × 1 convolutional layer was also introduced to the model to   extract some additional useful features.Additionally, a residual connection was added as it has been proven effective in biomedical image segmentation (Drozdzal et al 2016).The structure of the 'MultiRes block' module is shown in figure 5.

Coordinate attention (CA) mechanisms
CA is an efficient attention mechanism that not only acquires inter-channel information but also considers orientation-related position information.With flexible and lightweight characteristics, it can be simply inserted into the core structure of the mobile network, which helps the model to better localize and identify the target.CA can be regarded as a feature enhancement computational unit, for the input X , processed by the CA module outputs the enhanced feature Y = [y 1 , y 2 ,K, y c ] with the same dimension as the X size.The CA module consists of two parts: coordinate information embedding and coordinate information generation.The structure of CA is shown in figure 6.

(i) Embedding of coordinate information
Given the input feature map X, each channel is encoded along the horizontal and vertical coordinates using the pooling kernels (H, 1) and (1, W) respectively.Then the output of the Cth channel with height h and width w can be expressed as:

(ii) Coordinate information generation
The outputs of equation 1 and equation 2 are first stitched together, then the features are transformed using 1 × 1 convolution and nonlinear activation functions:  is an intermediate feature containing both horizontal and vertical spatial information, and r is a reduction factor.
Subsequently, f is divided into two independent features . The number of channels is unified using two additional 1 × 1 convolutions F h and F w .The Sigmoid function is then used to transform the features so that their dimensions are consistent with the input X: Finally, the outputs g h and g w are combined into a weight matrix and multiplied with the input feature map to obtain the output of this attention module:

Attention gate (AG) module
The AG model structure is shown in figure 7. AG is related to the human visual attention mechanism, which automatically focuses on the target region and learns to suppress irrelevant feature responses in the feature map while highlighting salient feature information that is critical for a specific task (Zhang et al 2020).The steps of the AG model implementation are: (i) Input the feature maps X h and X g , and make the two features summed up by a linear transformation operation with a convolutional kernel of 1 × 1 to enhance the target feature region to obtain the feature map F h ; (ii) Immediately after using the ReLU activation function to further process F h to enhance the network's representation and learning ability; (iii) In order to reduce dimensionality and computational complexity, a 1 × 1 convolution operation is performed to obtain the feature map f; (iv) f is normalized using Sigmoid to obtain an importance score in the range of [0, 1], i.e. the attention weight coefficient α; (v) Multiply α with the input feature map X h element by element to obtain X h ˆto enhance the target feature representation and attenuate the non-target feature response.

Experimental dataset
The image data for kidney tumor segmentation in this study was collected from dynamic contrast-enhanced ultrasound videos of 100 confirmed cases of renal cell carcinoma patients admitted to the Department of Urology at the Shanghai General Hospital from 2018 to 2020.This study was conducted as a retrospective study and complied with the requirements of the hospital's ethics review committee.All patients underwent ultrasound imaging in supine and lateral positions.The ultrasound videos were recorded for two minutes, starting from the maximum cross-sectional area of the lesion, using the contrast-enhanced ultrasound mode.The contrast-enhanced ultrasound phase was divided into the renal parenchymal enhancement phase and the renal parenchymal washout phase, according to the guidelines of the Chinese Society of Ultrasound in Medicine.The entire process of contrast agent perfusion, reaching peak enhancement, and washout was recorded and stored in lossless DICOM format.The videos were read using the RadiAntViewer tool at a frame rate of 8 frames per second (8 frames/S), and ultrasound images were extracted for analysis starting from the renal parenchymal enhancement phase.The images were saved in jpg format using systematic sampling.If tumor position displacement was observed in the sampled images, they were excluded from the dataset.All contrast-enhanced ultrasound images were manually annotated by two experienced ultrasound radiologists who delineated the lesions and cross-validated their annotations, serving as the reference standard for segmentation, as shown in figure 4. A total of 3355 images were eventually selected for the experiment, with 3020 images used for training and 335 images used for testing.We abbreviate the renal tumor ultrasound image dataset as RTUI.
The BUSI (Al-Dhabyani et al 2020) dataset consists of breast ultrasound images collected from 600 female patients in 2018, comprising a total of 780 images.The average image size is 500 × 500 pixels, including 487 malignant lesion images, 210 benign lesion images, and 133 normal ultrasound images.Due to the focus on kidney tumor segmentation in this study, only single tumor segmentation tasks were considered, and multitumor images and normal images were excluded.Therefore, a final selection of 632 images was used for the experiment, with 532 images used for training and 100 images used for testing.
The Dataset C (Huang et al 2020) dataset consists of 320 breast ultrasound lesion images provided by Sun Yat-sen University Cancer Center.The average image size is 128 × 128 pixels and contains 160 benign lesion images and 160 malignant lesion images.Of these, 220 were used for training and 100 were used for testing.
The DDTI (Pedraza et al 2015) dataset is a publicly available dataset of thyroid nodule ultrasound images.This dataset includes both classification and segmentation information.For the segmentation task, some images with damaged annotations and images with multiple nodules in a single scan were excluded.A total of 637 images were selected for the experiment, with 537 images used for training and 100 images used for testing.

Experimental environment and parameter settings
The proposed model in this paper is built, trained, and tested based on the Pytorch deep learning framework.The specific hardware and software environments are shown in table 1.
During training, the binary cross-entropy loss function was used to compute the model's loss value.The AdamW optimizer was utilized to continuously adjust and optimize the network parameters.It is worth mentioning that the batch size had a slight impact on the experimental results, but due to device limitations, a batch size of 4 was set in this study.The training process consisted of 100 epochs, with a learning rate of 1 × 10 −3 for the first 50 epochs and a learning rate of 1 × 10 −5 for the remaining 50 epochs.After training, the model with the minimum loss value was saved for segmentation testing.
In the equation, pixel counting is used to calculate the true positive (TP), true negative (TN), false positive (FP), and false negative (FN) values corresponding to the two-dimensional confusion matrix.where k refers to the number of categories; TP denotes a positive sample that is predicted to be positive by the model; TN denotes a negative sample that is predicted to be negative by the model; FP denotes a negative sample that is predicted to be positive by the model; and FN denotes a positive sample that is predicted to be negative by the model.2, where the best performance indicators are highlighted in bold, and the same applies to subsequent comparisons.It can be observed that the MRDA_UNet++ outperforms other networks in the evaluation metrics of MIoU, Dice, and Recall.Compared to the baseline network UNet++, our model achieves improvements of 0.54% in MIoU, 0.61% in Dice, and 2.64% in Recall.Although the Precision metric is lower than some networks, considering all evaluation metrics, especially the crucial MIoU and Dice metrics for semantic segmentation, it can be concluded that the overall performance of our model is superior, enabling more accurate segmentation.Figure 8 displays ultrasound images of renal tumors for four patients, with the second image in each row representing the manually delineated gold standards by doctors.The subsequent nine images depict the segmentation results of different networks, followed by the corresponding subtraction figures in the next row.In the subtraction figures, blue represents the tumor regions that were wrongly predicted as background by the model, while red represents the background regions that were wrongly predicted as tumors.From the figure, it can be observed that the MRDA_UNet++ method achieves more accurate segmentation results compared to other methods when dealing with tumors of varying sizes and shapes.In the first segmented ultrasound image, it is evident that the results of the MultiResUNet method differ significantly from the ground truth.In contrast, our method incorporates a CA module to enhance the features of the tumor region, reducing spatial information loss.The AG module highlights the tumor area and suppresses the background region, enabling the model to accurately identify tumors of different sizes and significantly improve edge segmentation accuracy.
Furthermore, when facing the segmentation of larger tumor regions, as shown in the fourth row, SegNet and UNet exhibit significant deviations from the ground truth due to the lack of multiscale information interaction.Although other methods can also predict the tumor regions effectively, our proposed method, by introducing a 'dual attention' mechanism, achieves closer proximity to the ground truth in terms of details.However, for kidney tumor regions with severe ultrasound artifacts and irregular shapes, as shown in the third row, networks generally face difficulty in capturing edge details and even extracting the target region, which is a crucial factor affecting the accuracy of network segmentation.

Model generalization performance assessment
To evaluate the generalization performance of our algorithm, comparative experiments were conducted on three publicly available datasets: BUSI, Dataset C, and DDTI.Adhering to the principle of fairness, all experiments were conducted in the same environment, and the training parameters were kept consistent.Figure 9 presents the visual results of our proposed algorithm and eight comparative algorithms on three datasets.The red regions indicate the manually delineated gold standard, while the green regions represent the results of automatic segmentation by the model.
Observing the segmentation results of different networks in the second group of BUSI, it can be observed that, except for DeeplabV3+ and our proposed method, the other methods mistakenly consider the black regions in the ultrasound images as the target regions, resulting in significant deviations from the ground truth.Although DeeplabV3+ utilizes dilated convolutions to increase the global receptive field and fuses multiscale information with a pyramid structure, its lack of attention mechanism leads to inferior recognition and edge feature extraction capabilities compared to our method, particularly for small targets.
On Dataset C, which consists of smaller images with larger target regions and fewer issues such as blurry edges, speckle noise, and ultrasound artifacts, the segmentation task is relatively easier.From the results, it can be observed that the segmentation results of various networks are closer to each other.However, our proposed method still outperforms others in handling edge features by incorporating attention mechanisms.
Analyzing the first group of ultrasound images in the DDTI dataset, it is evident that the contrast between the nodular and non-nodular regions is low, posing a challenge for the eight network models compared to our approach.Among them, the UNet3+ network is particularly affected by noise interference.Combining the results from the second group of the BUSI dataset, it can be seen that our model significantly enhances the recognition and extraction capabilities of the target region by introducing multiscale information interaction and a 'dual attention' mechanism.
Tables 3-5 present the quantitative analysis results of our proposed algorithm and eight comparative algorithms on the BUSI, Dataset C, and DDTI datasets.Observing the segmentation results in the three tables, it can be observed that our algorithm performs the best in terms of MIoU, Dice, and Recall metrics.On the BUSI dataset, our algorithm improves the metrics by at least 3.13%, 4.08%, and 14.26%, respectively.Furthermore, compared to UNet++, our algorithm achieves improvements of 6.00%, 7.90%, and 18.09% in the same metrics.On Dataset C, our algorithm improves the metrics by at least 0.18%, 0.21%, and 0.32%, respectively, and relative to UNet++, the improvements are 0.39%, 0.27%, and 1.03%.On the DDTI dataset, our algorithm improves the metrics by at least 0.17%, 0.22%, and 0.32%, respectively, while relative to UNet++, the improvements are 1.37%, 1.75%, and 1.30%.

Ablation experiments
In order to verify the effectiveness and scientificity of each module in MRDA_UNet++, this study conducted ablation experiments on RTUI and BUSI.The ablation experiments are divided into four groups: group A uses the UNet++ network, group B replaces the convolutional modules in UNet++ with the 'MultiRes block,' group C replaces the convolutional modules in UNet++ with the 'MultiRes block' combined with the CA (i.e.MCA), and group D incorporates the AG module onto the C group experiment.The experimental setup and training parameters are kept consistent across all groups.The specific results of the ablation experiments are presented in tables 6 and 7.
From the table, it can be seen that with the addition of each improvement module to the base network UNet ++, the MIoU, Dice, and Recall metrics show varying degrees of improvement.The addition of CA and AG modules leads to the model excessively focusing on the target region.This may result in missegmentation of some background regions as targets, causing an increase in FP.As a consequence, this may result in a lower precision compared to the baseline model.However, for tumor segmentation tasks, the emphasis is on minimizing missed detections, i.e. recall.Therefore, considering all metrics, it can be concluded that our proposed method outperforms UNet++ and confirms the effectiveness of the proposed approach.

Conclusion
In this paper, we address the challenges of fuzzy edges, different sizes, and diverse shapes in the segmentation of renal tumor lesions based on ultrasound images.We propose a segmentation model called MRDA_UNet++, which fuses multiscale residual and dual attention mechanisms.The model utilizes different-sized convolution kernels in the 'MultiRes block' module to extract richer features from the target region, enabling effective handling of variations in lesion size and shape.By fusing these features, the model achieves a multiscale receptive field.Additionally, a CA module is incorporated to enhance feature localization and alleviate the impact of boundary ambiguity, reducing the occurrence of missed segmentations.The AG module is employed at the short connections to highlight the target region and improve the model's ability to extract target features.Experimental results on renal tumor ultrasound image (RTUI) datasets demonstrate the superior performance of the MRDA_UNet++, with MIoU and Dice reaching 93.18% and 92.87%, respectively.Furthermore, the algorithm demonstrates excellent results when tested on three public datasets, namely BUSI, Dataset C, and DDTI.This offers a novel implementation approach for delineating ultrasound renal tumor lesions.However, it is observed that the algorithm has a lower precision metric, which may lead to instances of incorrect segmentation.Although tumor segmentation tasks prioritize the Recall metric, the next research focus is on how to simultaneously improve precision and comprehensively enhance the segmentation performance of the model.

Figure 2 .
Figure 2. The overall framework of the MRDA_UNet++ model.

Figure 4 .
Figure 4. Renal tumor ultrasound images and labels.
represents the output of the Cth channel with height h, z w c w ( ) represents the output of the Cth channel with width w, and x c (i, j) represents the value of the Cth channel of the input feature map at position (i, j).

Figure 8 .
Figure 8.Comparison of different methods on RTUI.

Table 1 .
Experimental hardware and software environment.In this paper, we adopt the evaluation indexes commonly used in the medical image segmentation field: mean intersection over union (MIoU), Dice, Precision, and Recall.The values of the above indicators are all between 0 and 1, and closer to 1 means that the model is more effective.The calculation methods are shown in the following formulas:

Table 2 .
Segmentation results of different methods on RTUI.

Table 4 .
Segmentation results of different methods on Dataset C.

Table 5 .
Segmentation results of different methods on DDTI.

Table 6 .
Ablation experiments with different modules on RTUI.

Table 7 .
Ablation experiments with different modules on BUSI.