WATUNet: a deep neural network for segmentation of volumetric sweep imaging ultrasound

Abstract Limited access to breast cancer diagnosis globally leads to delayed treatment. Ultrasound, an effective yet underutilized method, requires specialized training for sonographers, which hinders its widespread use. Volume sweep imaging (VSI) is an innovative approach that enables untrained operators to capture high-quality ultrasound images. Combined with deep learning, like convolutional neural networks, it can potentially transform breast cancer diagnosis, enhancing accuracy, saving time and costs, and improving patient outcomes. The widely used UNet architecture, known for medical image segmentation, has limitations, such as vanishing gradients and a lack of multi-scale feature extraction and selective region attention. In this study, we present a novel segmentation model known as Wavelet_Attention_UNet (WATUNet). In this model, we incorporate wavelet gates and attention gates between the encoder and decoder instead of a simple connection to overcome the limitations mentioned, thereby improving model performance. Two datasets are utilized for the analysis: the public ‘Breast Ultrasound Images’ dataset of 780 images and a private VSI dataset of 3818 images, captured at the University of Rochester by the authors. Both datasets contained segmented lesions categorized into three types: no mass, benign mass, and malignant mass. Our segmentation results show superior performance compared to other deep networks. The proposed algorithm attained a Dice coefficient of 0.94 and an F1 score of 0.94 on the VSI dataset and scored 0.93 and 0.94 on the public dataset, respectively. Moreover, our model significantly outperformed other models in McNemar’s test with false discovery rate correction on a 381-image VSI set. The experimental findings demonstrate that the proposed WATUNet model achieves precise segmentation of breast lesions in both standard-of-care and VSI images, surpassing state-of-the-art models. Hence, the model holds considerable promise for assisting in lesion identification, an essential step in the clinical diagnosis of breast lesions.


Introduction
Breast cancer is one of the most fatal forms of cancer and has become a significant public health concern (1).It is the second most common cause of cancer-related deaths worldwide in women (2).Early detection and treatment of breast cancer are critical to increase the chances of survival and prevent metastasis.
Ultrasound imaging is a first-line diagnostic tool for detecting breast cancer.It is a safe, portable, costeffective, and non-invasive imaging modality that uses high-frequency sound waves to create images of breast tissue.Ultrasound imaging may be the only diagnostic option available in low and middle-income countries (LMICs).However, a significant challenge to the widespread use of ultrasound is the need for trained sonographers, who require wide-ranging skills that can take months to years to acquire (3).A potential solution to this challenge is to adopt volume sweep imaging (VSI) (4,5).VSI is a recently developed imaging technology that has the potential to revolutionize access to medical imaging for breast evaluation, especially in resource-limited settings (6,7,8).In addition to being affordable compared to other imaging modalities, including standard-of-care ultrasound, VSI enables inexperienced operators to acquire high-quality ultrasound images using a standardized imaging protocol, which requires minimal training (3,7).VSI has been clinically tested for breast, obstetrics, lung, thyroid, and right upper quadrant scanning indications, showing promising results both in the United States and Peru (4,9,10).Unlike traditional ultrasound, which demands highly skilled sonographers to operate the equipment and interpret the images, VSI utilizes a simplified approach based on external body landmarks (4).With VSI for breast ultrasound, an operator starts by positioning the ultrasound probe on the surface of the breast over a palpable breast lump, then sweeps the probe in a uniform pattern while maintaining contact with the skin.This produces a series of video clips that cover the entire target region (the palpable breast lump).Then, these video sweeps are sent to radiologists or computer aided diagnostic systems for further interpretation.
The use of ultrasound imaging for detecting and diagnosing breast cancer has been significantly improved with the introduction of machine learning (ML) and deep learning (DL) techniques.Compared with other ML techniques, DL algorithms, such as convolutional neural networks (CNNs), show increased potential and reliability in diagnosing disease (5,11,12,13).In the case of segmentation, these algorithms can be trained on large datasets to find boundaries by recognizing their features, such as shape, texture, and color.The DL models can also learn to accurately distinguish the region of interest from the background and other structures in the image.In addition to improving the accuracy of detection and diagnosis, DL algorithms can also help reduce the time and cost associated with traditional methods of diagnosis.This is particularly significant in resource-constrained settings with limited access to trained medical professionals and expensive diagnostic equipment.
The task of image segmentation involves categorizing pixels in an image with labels that describe their meaning.This can be achieved through semantic segmentation, which involves labeling individual pixels with object categories (14).Another approach is instance segmentation, which involves separating individual objects in the image (15).Panoptic segmentation combines both semantic and instance segmentation (16).Semantic segmentation is typically more difficult than whole-image classification because it requires labeling each pixel with a specific category rather than predicting a single label for the entire image.This paper presents a novel framework for semantic segmentation of breast tissues in ultrasound images in both standard-of-care and VSI imaging.We propose a novel segmentation architecture known as WATUNet, where the combination of wavelet decomposition and attention mechanism replaces the plain connection between encoder and decoder in the UNet structure in order to overcome the issues with plain skip connection, and to extract more features from ultrasound images.The proposed framework includes two stages.First, image enhancement and preprocessing will be applied to the inputs.Next, we will utilize our proposed segmentation framework to analyze the images and segment the mass area.Integration of VSI and WATUNet would enable the potential for rapid automatic diagnosis of palpable breast lumps without a radiologist or a sonographer (5,17).
The manuscript is divided into several sections with section 1 being this introduction.Section 2 outlines a brief review of some of the previous works to develop an understanding of the earlier and current state of knowledge in the image segmentation field.In section 3, we describe the datasets that were used in this study, along with the various preprocessing steps that were employed to ensure the integrity and reliability of the data, results of these sections will be discussed in Appendix C. In section 4, we introduce our novel WATUNet architecture, providing a detailed overview of the model's structure and operation.Additionally, we discuss its unique features and highlight its expected contributions to the field.In section 5, we present the results of our experiments and analyze these findings in depth.Finally, we conclude the study with a discussion of the implications of our results and reflect on the broader significance of this work within the context of the field.

Literature review
Image segmentation is a significant task in the fields of computer vision and image processing, and has wide-ranging applications such as medical image analysis, scene comprehension, robotic perception, video surveillance, image compression, and augmented reality.Numerous image segmentation methods have been developed over time, starting from the earliest approaches such as thresholding (18), region growing (19), and clustering (20), to more sophisticated methods such as active contours (21), graph cuts (22), and Markov random fields (23).However, in recent years, DL models have emerged as a new generation of image segmentation models with significant performance, often achieving the highest accuracy rates on commonly used benchmarks (24).This has led to a fundamental change in the field of image segmentation.
In their work, Noh et al. (25) presented a semantic segmentation method called DeConvNet, which consists of two main parts: an encoder that utilizes convolutional layers from the VGG16 layer network, and a multilayer deconvolutional network that takes in the feature vector and generates a map of class probabilities that are precise to the pixel level.One limitation of this work is that the DeConvNet model relies on the use of a pre-trained VGG16 network as the encoder, which may not be the most optimal choice for all types of images and datasets.Using a fixed pre-trained network may limit the model's ability to adapt to new types of images and features that are specific to a particular task or domain.In addition, the DeConvNet model does not incorporate any attention mechanisms or other advanced techniques for enhancing feature representation, which may limit its ability to accurately segment complex objects or scenes.
Badrinarayanan and colleagues introduced SegNet (26), which is based on encoder-decoder structure for semantic segmentation.Similar to the deconvolution network, SegNet's core trainable segmentation engine also contains an encoder network that has the same structure as the 13 convolutional layers in the VGG16 network.Additionally, it has a decoder network and a pixel-wise classification layer.
SegNet's uniqueness lies in the way the decoder performs nonlinear upsampling of its lower-resolution input feature map(s).SegNet utilizes the pooling indices calculated during the maxpooling step of the corresponding encoder to perform this task.The utilization of maxpooling for pooling indices computation, while widely used in CNNs, can have both beneficial and detrimental effects on the accuracy of segmentation output.Specifically, the use of maxpooling in this context may result in loss of crucial spatial information and cause blurring or misalignment of object boundaries, thereby limiting the accuracy of segmentation (27).
One of the limitations of the encoder-decoder structure for segmentation is that the decoder might lose fine-grained spatial information during the down sampling process, which can result in lower segmentation accuracy.This limitation is addressed in the UNet architecture by introducing skip connections that directly connect the corresponding encoder and decoder layers.These connections allow the decoder to access and use the detailed spatial information from the encoder, resulting in more accurate segmentation.The skip connections provide a way to recover the lost spatial information by fusing the feature maps from the encoder and decoder.Ronneberger et al. (28) proposed the UNet for efficiently segmenting biological microscopy images.The UNet architecture includes two parts: a contracting path to capture context, and a symmetric expanding path to enable precise localization.The UNet training strategy relies on the use of data augmentation to learn effectively from very few annotated images.It was trained on 30 transmitted light microscopy images, and it won the 2015 International Symposium on Biomedical Imaging Cell Tracking Challenge.Over time, since the introduction of the UNet architecture, numerous adaptations of the original model have been developed to address various types of images and issues, for example, UNet++ (29,30), UNet3+ (31), Weighted Res-UNet (32), Sharp UNet (33), Attention UNet (34), and Sharp Attention UNet (35).
UNet++ was designed to achieve more precise segmentation compared to the UNet model with plain skip connections by combining multiple UNets of different depths.The decoders of these UNets are interconnected using redesigned skip pathways that address two significant issues with the original UNet design: the uncertainty regarding the optimal depth of the architecture, and the overly restrictive design of the skip connections.Notwithstanding its superior performance, the complexity of the UNet++ model is relatively high, rendering it unsuitable for real-time processing applications.Furthermore, the increased complexity of the model results in a higher risk of overfitting, which can adversely affect its generalization performance on new datasets.The model also fails to capture adequate information from the entire range of scales, leaving ample room for further enhancement.To address this limitation, UNet3+ (31) was proposed.UNet3+ takes advantage of full-scale skip connections and deep supervisions to incorporate lowlevel details with high-level semantics from feature maps at different scales.UNet3+ is especially suited for organs that appear at varying scales.Nevertheless, when confronted with small objects within the image, this model demonstrates a decline in accuracy.
The Sharp UNet architecture (33) consists of a standard UNet encoder-decoder structure, with the addition of a sharpening module in between the encoder and decoder layers.This sharpening filter performs a convolution operation on the encoder feature maps before fusing them with the decoder features.It helps to make the encoder and decoder features semantically less dissimilar.We recently (35) introduced a model known "Sharp Attention UNet" which is an extension of the UNet architecture.The Sharp Attention UNet incorporates attention mechanisms to emphasize relevant regions and sharpening filters to enhance image details for lesion segmentation within ultrasound images.This approach is refined for VSI sweeps and is now termed WATUNet, the subject of this paper explained in the following sections.

Dataset and data preprocessing 3.1 Data collection
This study utilized two ultrasound breast imaging datasets for analysis.The following section provides a detailed description of each dataset.

BUSI dataset
The first dataset used in this study is known as "Breast Ultrasound Images" (BUSI) (36).The dataset consists of 780 images from 600 female patients between the ages of 25 and 75, collected in 2018.Patients were scanned using a LOGIQ E9 ultrasound system and lesions were segmented with manually traced masks from the radiologist's evaluation.The images were classified into three groups: (1) 133 normal images without masses, (2) 437 images with benign masses, and (3) 210 images with malignant masses.
The images are in PNG format, have varying heights and widths, and an average size of 600 × 500 pixels.
The data was preprocessed by removing non-image text and labels.

VSI dataset
The VSI dataset was obtained from a clinical study performed at the University of Rochester (4).Patients with a palpable breast lump underwent a VSI examination using the Butterfly iQ ultrasound probe (Butterfly Network, Guilford) employing the small organ preset.A medical student was trained for less than 2 hours on the VSI protocol.The first step in the protocol is to mark the palpable area with an "X" as shown in Figure 1.The patient lies supine with their arm above their head, and the marked area is scanned with eight sweeps in transverse, sagittal, radial, and anti-radial orientations to image the mass in different planes, resulting in 8 separate attempts to acquire a diagnostic image, greatly increasing the chance of obtaining at least one diagnostic view.This protocol takes minimal time to learn, typically 1-2 hours.Patients are scanned with a breast preset, and operators do not change any probe settings from the preset.The operator does not interpret the image, and VSI is ideally performed focusing on the sweep over the target region, not the ultrasound screen.The exam is short and can be completed within 10 minutes, including setup.Full details of the VSI protocol are included in Marini et al (4).

Figure 1 Demonstration of the technique for conducting breast volume sweep imaging (VSI). Ultrasound probe sweeps are
executed in various orientations, including transverse, sagittal, radial, and anti-radial.The picture is adopted from: (5) Each cine sweep averages 9 seconds, and the frame rate is typically 17 frames per second.The frames near the center of the sweep were selected for the segmentation as the tumors or abnormalities are more likely to be detected near the center of the ultrasound sweep.The individual frames from the center of the sweep were meticulously segmented, with every other frame being processed using the "image segmentator" application in MATLAB (version 2022b, MathWorks, Natick, MA, USA).No mass or abnormality was found in 70 cases.In 52 cases, benign masses were found, and in 20 cases malignant masses were found.Our study includes recorded biopsy data, though biopsies were not always performed.
The dataset is a three-class dataset consisting of 3818 frames with 2048 frames with no masses, 820 frames with benign findings, and 950 frames with malignant findings.
If a sonographic mass was absent, the pathology was deemed benign, and thus no biopsy was carried out.For patients with benign-appearing masses, follow-up imaging was sometimes conducted in lieu of biopsy.To classify pathology as benign in those cases, we relied on the standard of care image being assigned a BI-RADS (Breast Imaging Reporting and Data System) 2, which indicates a zero percent chance of cancer (100% chance of being benign).Pathologically, masses with a BI-RADS 3 classification hold a 98% chance of being benign, with the remaining 2% indicating potential malignancy.Therefore, although these masses were likely benign, we classified them as unknown.Notably, we did not encounter any BI-RADS 4 classifications in our study that were not biopsied.Additional information regarding pathology and its determination was previously published (4).

Data enhancement and augmentation
Applying image enhancement techniques to medical images, especially ultrasound images, before feeding them to DL models can help to improve the accuracy of the models.Ultrasound images can suffer from various artifacts, such as speckle noise, attenuation, and poor contrast, which can affect the quality of the image and make it difficult to interpret.Image enhancement techniques can be applied to remove or reduce these artifacts and improve the overall quality of the image.DL models can learn to identify and classify features in images based on patterns in the pixel values.By enhancing the image quality, the patterns become clearer, and the model can more accurately identify the features in the image.
Similar to our previous model (35), in this study we applied contrast limited adaptive histogram equalization (CLAHE) (37), which is an image enhancement method that improves upon the adaptive histogram equalization (AHE) (38) method by addressing the issue of excessive contrast levels.AHE can excessively enhance contrast, but CLAHE resolves this issue by setting a limit to the contrast using a histogram.
Data augmentation is a technique used to increase the size of a training dataset by modifying existing data samples.It helps to prevent overfitting by introducing more variability and diversity into the training data.This technique aims to create a more representative and generalized set of training data, particularly in the medical imaging field where data scarcity is a concern.However, it is important to carefully select augmentation parameters to avoid negative impacts on diagnostic accuracy (39).For ultrasound images, extreme brightness or zoom adjustments may lead to the loss of important details or distortions that can affect model predictions.In our study, augmentation parameters include random rotation, zoom, horizontal flip, width and height shifts, shear transformation, and brightness adjustment.
These parameters define data generators that generate augmented data samples during model training.By applying these parameters randomly to the training data, the augmented dataset better represents real-world scenarios and reduces overfitting.In the results section, the performance of the segmentation model with and without data augmentation will be presented.

Methods
In this section we will explore model optimization techniques and network fine tuning.Figure 2   Subsequently, the model's performance is evaluated using the designated test set.

Model optimization
Model optimization is an essential part of improving the performance of DL models.One significant instance is the backpropagation technique, intricately intertwined with the optimization process via gradient descent.However, backpropagation can be slow, therefore fine tuning and optimization can help this process to be faster.Another important instance is optimization's role in addressing the issues of overfitting and underfitting, which ultimately enhances the model's ability to make accurate predictions on new data.
One optimization technique is exploring different activation functions for model optimization.Previously, ReLU was a widely used activation function, but considering its limitations, we tried alternatives such as Leaky ReLU, swish, and mish.Swish is a smooth function that has shown promising results in certain architectures.Mish is another activation function which is non-monotonic and has self-regularization properties.Both two activation functions exhibit negative curvature and smoothness, enabling efficient optimization and faster convergence in deep neural networks.We described each one of these activation functions in a previous work (35).
Dropout is another regularization technique in neural networks that prevents overfitting.It randomly deactivates neurons during training.This encourages the network to learn more generalized features and improve performance on new data.Furthermore, dropout reduces co-adaptation between neurons by randomly deactivating them.This will lead to preventing excessive specialization and overfitting.
The optimal values for dropout in neural networks vary depending on factors like dataset, network architecture, and training method.Gal et al. (40) found that higher dropout probabilities are generally better for deeper layers.The specific task and dataset influence the optimal dropout probability.In this study, the optimal dropout values were chosen as 0.1 for the encoder and 0.5 for the decoder.These values were based on the recommendations of Gal et al. (40) and fine-tuned through an iterative trial-and-error process.

Proposed network
UNet architecture is a CNN model that uses an encoder-decoder network with skip connections to preserve high-resolution information from the input images.UNet structures have been widely used for medical image segmentation, including breast tumor segmentation in ultrasound images, due to their unique architecture and outstanding performance (41).UNet-based models can be trained using relatively small datasets, which is particularly relevant in medical imaging applications where large, annotated datasets are often not readily available (41).This skip connection in a UNet based model helps in preserving the spatial information lost during downsampling in the encoder.Concatenation of the encoder and decoder information helps preserve spatial information while increasing the depth of the feature maps, facilitating better learning of the spatial relationship between the features.In a neural network during backpropagation, the gradient is computed starting from the output layer of the network and moving backwards towards the input layer (42).The gradient values are then used to update the weights of the network using an optimization algorithm such as stochastic gradient descent or Adam.The vanishing gradient problem occurs when the gradient values become very small during backpropagation, making it difficult to update the weights in the lower layers (42).This can occur in deep networks because the gradient values are multiplied by the weight matrix at each layer, and if the weights are small, the gradient values can become exponentially small as they are propagated backwards through the network (43).As a result, the lower layers of the network may not learn effectively, and the overall performance of the network can be compromised.
Using a plain skip connection in UNet may lead to the problem of vanishing gradients during training (44).As noted earlier, this is because the gradients of the loss function with respect to the weights of the encoder and the decoder need to be propagated through the skip connections.If the plain skip connection is too deep, the gradients may become too small to be useful during training, which can slow down or even prevent convergence.
To overcome this problem, instead of a simple connection, we applied a discrete wavelet transform and attention map to highlight spatial and frequency information present in the ultrasound images.Adding a wavelet gate (WG) and an attention gate (AG) to a UNet model in place of a plain skip connection can prove beneficial.A WG can help the model capture multi-scale features in the input image.Wavelet transforms decompose the image into different frequency bands, which can capture details at different scales (45).By including a WG into the UNet model, the model can selectively attend to different frequency bands, allowing it to capture both fine-grained and coarse-grained details.
Moreover, incorporating an AG mechanism can effectively enhance the model's ability to attend to the most salient features within the input image.This is particularly useful in image segmentation tasks, where the model needs to identify the boundaries of objects accurately.By incorporating an AG into the UNet model, the model can learn to selectively attend to the most relevant regions of the input image, while down-weighting less important regions.By combining these two types of information from AGs and WGs, the UNet model can achieve better performance in image segmentation tasks.This model is designed to easily integrate into standard CNN architectures, such as the UNet model, with minimal operational overhead.Additionally, our experiment and results show that incorporating WGs and AGs in the model increases sensitivity and prediction accuracy.

Attention gate (AG)
The attention mechanism is a method to highlight high-importance features and downplay low importance features in neural network models.Oktay et al. (34) proposed a novel UNet algorithm with an attention mechanism to segment the pancreas in computed tomography (CT) images.
Their proposed network consisted of three main modules: an encoder module that receives CT input images to obtain the feature map, an attention module that consists of two parallel convolutional blocks that learn to capture both local and global contextual information from the encoder path, and a decoder module that restores the concatenated feature map from the AG to the original size of the input images.The authors reported a Dice coefficient of 79.8% for their method, compared to 74.7% for UNet without attention.
Recent studies have demonstrated that the use of AGs in DL models can lead to improved network performance (46,47).The AG architecture proposed in our study is illustrated in Figure 3.The architecture is inspired by the attention gate utilized Oktay et al. (34).

Figure 3 Illustration of the proposed additive attention gate (AG), which employs a gating signal (g) derived from applying a transposed convolution on coarser scale features and the features from the encoding path to analyze the activations and contextual information for selecting spatial regions.
This AG unit receives two inputs, the decoder branch, and the encoder branch.At the encoder branch, convolutional layers utilize a hierarchical approach to gradually extract high-level image features by processing local information layer by layer.This leads to a separation of pixels in a high dimensional space according to their semantics.This sequential processing of local features allows the model to condition its predictions on information collected from a large receptive field.The output of layer  is the feature-map   , which captures the extracted high-level features at that layer.The subscript  at   ,   , and   denotes the spatial dimension, where  is each pixel's value.
First, we apply transposed convolution on the decoding branch, the result is   .It is used to determine focus regions while   , the feature map from the encoding path, contains local information about the image.Then, we apply a convolution of kernel size and stride of 1 with the number of channels the same as the number of feature maps of   and   .To combine   and   we summed them to obtain the gating coefficient.The additive attention method may require more computational resources but has been demonstrated to achieve better accuracy compared to the multiplicative attention method (48).The resulting feature map  out is obtained by performing element-wise multiplication between the input feature map   and attention coefficients   , which is expressed mathematically as: where   can be expressed as: where σ1 is the swish function (49).Function σ2 is the sigmoid function.The feature map   and the gating signal vector   are transformed linearly using a 1*1 channel-wise convolution.The parameters   ,   , and   are transformed linearly and are trainable parameters.The bias terms are   and   .To reduce the complexity, we set the bias values to zero.Experiments have shown that setting these two values to zero does not affect the model's performance negatively.

Wavelet gate (WG)
Images include both spatial and frequency information, and wavelet transform is a valuable tool for analyzing the spatial frequency and spatial content of an image.In wavelet transform, the image is decomposed into different frequency bands or scales, each corresponding to a different level of detail.This enables the ability to analyze the image's frequency content at different scales and resolutions.In ultrasound imaging, the spatial frequency information is rich and can be leveraged to diagnose or detect patterns with higher accuracy.The continuous wavelet transform (CWT) of a signal f(t) is given by: where  and  are the scale and translation parameters, respectively, and  * is the complex-valued mother wavelet function.The discrete wavelet transforms (DWT) of a signal [] is obtained by applying a series of high-pass and low-pass filters to the signal, followed by decimation.The DWT can be represented mathematically as: primarily noise-related nature, and therefore its omission had no discernable effect on the results of our analyses.Figure 4 illustrates the wavelet gate proposed in our study.The LL sub-band serves as the input to the AG gate, whose architecture is shown in Figure 3.This sub-band corresponds to the low-frequency components of the original image, obtained via a low-pass filter applied to the image.Because the LL subband is capturing the overall structure or context of the input, this makes it a fitting input to the AG.
Moreover, the LL sub-band can reduce the impact of artifacts across the network layers during the initial stages of training.The HL and LH sub-bands include high-frequency components of the original image, and therefore contain more detailed information compared to the LL sub-band.The concatenation of the AG output and these two sub-bands can help the model to better capture the important features and patterns in the data, while filtering out irrelevant or noisy information.Additionally, the AG may help to reduce the effects of dimensionality reduction that may have occurred due to the wavelet transformation.
Wavelet families include Biorthogonal, Coiflet, Harr, Symmlet, Daubechies, and others (50).There is no universally "correct" method to choose which wavelet function to use; optimality depends on the specific application (51).While the Haar wavelet is simple to compute and comprehend, the Daubechies algorithm is more complex and requires more computation.However, the Daubechies algorithm can capture details that are missed by the Haar wavelet.Thus, selecting a wavelet that closely matches the signal being processed is crucial in wavelet applications (52).For wavelet decomposition, the Daubechies wavelet with two vanishing moments provided the best results.

Network architecture
In the UNet structure, the encoder path extracts high-level features from the input image, followed by a decoder path that upsamples the feature maps to reconstruct the original image.In this paper the encoder path consists of four downsampling blocks, where each block applies convolutional layers with 3*3 kernels and a stride of 1, followed by a batch normalization layer, and a swish activation function.The downsampling is achieved by applying max pooling with a 2*2 kernel size.The output of each block is then passed through a dropout layer to prevent overfitting.
The decoder path consists of four upsampling blocks, where each block applies a transpose convolution operation with a 3*3 kernel and stride of 2, followed by concatenation with the corresponding feature maps from the AG and WG, which is applied on the encoder path features.The concatenated feature maps are then passed through a convolutional block with the same architecture as in the encoder path.
Similar to the encoder path, the output of each block is then passed through a dropout layer to prevent overfitting.The final output of the model is a binary mask that represents the segmented regions of the input image.Figure 5 shows the proposed architecture.

Loss function
In this study we utilized a custom loss function which is specifically designed for training neural networks in segmentation tasks.The custom loss function combines two commonly used loss functions: binary crossentropy (BCE) loss and Dice loss.The BCE loss is typically used for binary classification problems such as image segmentation, where each pixel is classified as foreground or background.It quantifies the

Quantitative training and validation results on VSI dataset
To select the optimal model for our purposes, we employed a criterion based on the Dice coefficient, which was identified as our primary metric of interest.The Dice coefficient is a robust and reliable metric for evaluating the performance of segmentation models due to its sensitivity to both the TP (True Positive) and FP (False Positive).Accordingly, we saved the weight parameters associated with the best performing model, as determined by the highest value for the Dice coefficient.This process was integral to ensure the selection of the best performing model for our specific objectives.
In Figure 6

Comparative analysis of neural network models on VSI dataset
Table 1 presents the performance metrics obtained by five different neural network models, the backbone of these models are based on previous works namely; UNet (28), Attention UNet (34), Sharp UNet (33), Sharp Attention UNet (35) and the proposed WATUNet model on the test set of the VSI dataset.While the primary network architectures for the four former models have been previously proposed in the literature, the same backbone architecture and hyperparameters were employed for all four models when reporting the results in these tables.The following parameters were identically employed in the training process for all models: image input size of (128 × 128), augmentation technique, optimization technique, batch size, loss function, and learning rate.Each model underwent 300 epochs of training, after which the model exhibiting the best performance on the validation set was selected and saved for comparison on the test set.
The outcome of these comparisons revealed that the WATUNet architecture demonstrated superior performance relative to the other models across all validation parameters.The results obtained from the WATUNet algorithm demonstrate an improvement in the sensitivity of lesion area mask extraction for the VSI dataset, with a notable increase of 1.6% compared to the secondbest model.The Dice coefficient is also shown to have improved by 1.1% when compared to the secondbest model (35).These findings suggest that WATUNet may be a promising solution for improving the accuracy, robustness, and computational efficiency of lesion area segmentation in ultrasound imaging.
We also utilized McNemar's test, which is non-parametric, to compare the relative performance of each segmentation model that we developed.The purpose of the McNemar's test was to assess the statistical significance of differences observed between the models' segmentation results, more details are explained in appendix B. The test was conducted on our VSI test set including 381 images (10%) with dimensions of 128×128 pixels.We compared the dichotomous segmentation results of each pair of models to calculate the number of discordant entries for each pixel in the images.These discordant values were then evaluated using the chi-squared distribution with 1 degree of freedom, which yielded a corresponding p-value.
In hypothesis testing, a p-value below 0.05 is considered significant, allowing us to reject the null hypothesis and provide evidence for the alternative hypothesis.However, in the context of conducting multiple statistical tests simultaneously, e.g., comparing multiple model pairs, the probability of false positives increases.To mitigate this, several correction methods presented, such as the Bonferroni correction which adjusts the alpha value to uphold an appropriate level of significance (53).However, Bonferroni correction can result in a higher rate of Type II errors (false negatives).While the false discovery rate (FDR) correction (54) provides a more balanced trade-off between controlling false positives and false negatives, which can lead to more power to detect significant results.
We employed the Benjamini-Hochberg procedure to obtain adjusted p-values (q-values) by considering the rank of each p-value and the total number of comparisons.We identified statistically significant model pairs by comparing q-values to the specified FDR level.Pairs with q-values ≤ 0.05 were considered significant, indicating meaningful performance differences, while effectively controlling false positives and false negatives for multiple comparisons.Table 2 presents the results of comparing pairs of models after applying the FDR correction to each of the individual models.We calculated the p-value and q-value for each pair, in this

Visual comparison: segmentation model output vs. ground truth data on VSI and BUSI datasets
Figures 7 and 8 illustrate the comparison between the segmentation model output (WATUNet), and the ground truth data (outlined by radiologists) on the VSI and BUSI datasets, respectively.This visual representation provides valuable insight into the efficacy and accuracy of the proposed model's performance.The processed mask is derived by binarizing the predicted mask using a threshold of 0.4.This specific threshold value was determined through the analysis of the receiver operating characteristic (ROC) curve.Figure 9 shows the 3D rendering of a cine sweep and compares the ground truth and the predicted results.The WATUNet proposed segmentation model applied to VSI sweep and the tumor area is extracted, followed by multiplication of the mask with the original image.Then, contouring was applied to each slide.
The 3D rendering was achieved using the MATLAB Volume Viewer application.Top views are presented.
T at the top-left is an abbreviation for 'Transducer'.

Ground truth mask
Predicted mask Output mask (WATUNet) We evaluated our model by using two datasets categorized into three classes, including the publicly available "Breast Ultrasound Images" (BUSI) dataset of 780 images and our VSI dataset of 3818 images.
The experimental results showed that the proposed WATUNet model outperformed state-of-the-art models in terms of accuracy, loss, specificity, recall, F1, precision, Dice coefficient, and visual representation.
These findings indicate that the model holds considerable promise for clinical applications, as it has the potential to improve the accuracy of breast cancer segmentation and reduce the cost associated with traditional methods of tumor detection.
This study's innovation lies in the combination of the VSI scanning approach for breast imaging with deep learning algorithms to achieve highly accurate segmentation of breast lesions.This is particularly relevant to health care in remote areas where access to medical experts may be limited.Our results build on the advantages of using VSI, where the approach's simplicity allows relatively inexperienced operators to obtain high-quality ultrasound images quickly.This contrasts with traditional ultrasound, which requires T T highly skilled sonographers who may take months or even years to train.By overcoming the need for highly trained sonographers, VSI has the potential to significantly improve access to medical imaging for breast pathology, especially in regions where access to healthcare resources is limited.
By leveraging advanced machine learning techniques, this study offers the potential to improve early detection and segmentation of breast lesions, ultimately leading to improved patient outcomes.
Despite the encouraging results, this is a proof-of-concept study.Limitations of this study include the modest size of the two image sets employed, and the two commercial scanners used to obtain these images.
Expansions of these need to be addressed in future studies, along with the extension of this work to a final dichotomous diagnosis of the lesion as probably benign or malignant, the next necessary step in deciding the path of patient care.We also plan to further explore wavelet attention-based models and their potential for other applications beyond breast cancer diagnosis.Finally, the integration of VSI and WATUNet would potentially enable an increase in imaging access by allowing for rapid and automatic diagnosis of breast lumps to areas where a radiologist or a skilled sonographer are not present.
shows the model optimization and training workflow.This diagram illustrates the dataset preparation including masked frame extraction, data enhancement, and data augmentation.Then we trained the model while the shuffle was on, finally testing the model with unseen data.

Figure 2
Figure 2 Schematic depiction of the model optimization and training workflow.Workflow depicting the sequential stages of the proposed approach.In the first step, radiologists perform frame segmentation, followed by data preprocessing in the second step, encompassing data enhancement and augmentation.The preprocessed dataset is then utilized for training the proposed model.

)
,  = ∑ []. −2  , (5) where  is the scale index and  is the translation index.The superscripted variables  and  indicate whether the wavelet coefficient corresponds to the low-frequency (approximation) coefficients or highfrequency (detail) coefficients, respectively.∅ and  are the scaling and wavelet functions, respectively.The scaling function is a low-pass filter that captures the coarse-scale components of a signal or image, while the wavelet function is a high-pass filter that captures the fine-scale details.Together, they provide a multi-resolution representation of the signal, in which each level of the decomposition captures a different scale of features.The present study utilizes the DWT function to apply the wavelet transform to input images.This implementation utilizes the PyWavelets library in Python for executing the wavelet transform.PyWavelets is a free, open-source library that provides convenient tools for performing wavelet analysis in Python.The DWT function executes a four-band DWT decomposition of the image, producing four sub-bands, namely the approximation coefficients (LL) and the horizontal (LH), vertical (HL), and diagonal (HH) detail coefficients.The output of the DWT function is a tuple consisting of the LL sub-band and a concatenated tensor of the LH, HL sub-bands.The HH sub-band was considered non-essential in our dataset due to its

Figure 4
Figure 4 Illustration of the proposed wavelet gate (WG), which employs Db2 from the PyWavelets library on the feature map from the encoder path to analyze the spatial and frequency information of the input.

Figure 5
Figure 5 The proposed WATUNet architecture with AG and WG as skip connection to provide complementary information to the UNet model.

Figure 6
Figure 6 WATUNet performance indicators for training and validation.The plots (a-h) represent the performance metrics during training and validation on the VSI dataset.

Figure 7 Figure 8 Figure 9
Figure 7 Some VSI dataset examples with their corresponding ground truth, predicted mask, and WATUNet output mask overlaid on the original images.top row: malignant mass, middle row: benign mass, bottom row: no mass found in the frame.

Table 1
Comparison of the performance of the proposed WATUNet model with existing ML models using the same settings and parameters -VSI dataset Table, m is the number of comparisons which is equal to 10.Our proposed model, which combines attention gates and wavelet gates to a UNet framework, exhibited a notable performance improvement over other model.

Table 2
Statistical significance of model comparisons with Benjamin-Hochberg procedure, where i and m are the rank, and number of comparisons respectively, used to adjust for multiple comparisons.