Segmentation of bone surface from ultrasound using a lightweight network UBS-Net

Objective. Ultrasound-assisted orthopaedic navigation held promise due to its non-ionizing feature, portability, low cost, and real-time performance. To facilitate the applications, it was critical to have accurate and real-time bone surface segmentation. Nevertheless, the imaging artifacts and low signal-to-noise ratios in the tomographical B-mode ultrasound (B-US) images created substantial challenges in bone surface detection. In this study, we presented an end-to-end lightweight US bone segmentation network (UBS-Net) for bone surface detection. Approach. We presented an end-to-end lightweight UBS-Net for bone surface detection, using the U-Net structure as the base framework and a level set loss function for improved sensitivity to bone surface detectability. A dual attention (DA) mechanism was introduced at the end of the encoder, which considered both position and channel information to obtain the correlation between the position and channel dimensions of the feature map, where axial attention (AA) replaced the traditional self-attention (SA) mechanism in the position attention module for better computational efficiency. The position attention and channel attention (CA) were combined with a two-class fusion module for the DA map. The decoding module finally completed the bone surface detection. Main Results. As a result, a frame rate of 21 frames per second (fps) in detection were achieved. It outperformed the state-of-the-art method with higher segmentation accuracy (Dice similarity coefficient: 88.76% versus 87.22%) when applied the retrospective ultrasound (US) data from 11 volunteers. Significance. The proposed UBS-Net for bone surface detection in ultrasound achieved outstanding accuracy and real-time performance. The new method out-performed the state-of-the-art methods. It had potential in US-guided orthopaedic surgery applications.


Introduction
Ultrasound-assisted orthopedic navigation holds promise due to its non-ionizing feature, portability, low cost, and real-time performance.The utilization of intraoperative real-time three-dimensional (3D) reconstruction of bone surfaces via US imaging is instrumental in enhancing surgical safety and precision.Consequently, to minimize the incorporation of non-essential information within the 3D mapping of ultrasound images and to expedite the process of intraoperative ultrasound 3D reconstruction, it is necessary to achieve precise and real-time segmentation of the bone surface in 2-dimensional (2D) US images.Nevertheless, the imaging artifacts and low signal-to-noise ratios in the tomographical B-US images create substantial challenges in bone surface detection.
B-US images are generated through the acoustic reflections of the intra-tissue impedance mismatching within human body [1].The reflections of the bone surfaces appear to be highly echogenic signals in the B-US image.However, the bone echogenicity can be significantly reduced if the bone surface is not planar and not perpendicular to the US transducer, as the acoustic energy is not reflected back to the transducer.Acoustic shadows are also seen behind the bone surface due to lack of penetrating acoustic energy, resulting in limited imaging information in these regions (figure 1).Some soft tissues, such as muscles, can also appear hyperechoic and confound bone imaging in US bone imaging.Speckle noise in B-US images further deteriorate the image quality and make the bone surface detection more challenging [2].For the freehand B-US acquisitions, altering the US probe's orientations can change the imaging look of the bone surfaces [3].Thus, accurately detecting the bone surface from US images presents considerable challenges.
Segmentation of bone surfaces from US images was focused on methods using image intensity and gradient information.A combination of depth-weighted thresholding, image morphology and connected component labeling were used by Kowal [4].It resulted in an average accuracy of 0.42 mm and an average processing time of 0.8 s per US image frame.Foroughi [5] introduced a dynamic programming-based method that first iteratively enhanced US images using intensity and shadow region with a cost function for bone surface segmentation.Experimental results showed that the accuracy of the method was less than 0.3 mm and the computation time was 0.55 s.Lopez-Perez [6] segmented the elongated bones using an active snake models.The surface reconstruction error validated by the model data was 1.16 mm.
Another approach to US bone surface segmentation was based on the phase information of the image.Hacihaliloglu [7] were the first to propose the use of local phase image information when segmentation or enhancement of bone structures based on US data.
They initially extracted image phase information using a 2D Log-Gabor filter, which was included into a feature descriptor called phase symmetry to extract bone features.The method was validated in human models and in vitro experiments with an average localization error of 0.4 mm and a processing time of 0.5 s.Anas [8] proposed a framework for optimizing filter parameters using the information extracted from the frequency domain.The average localization error of the phantom validation experiments was 1.08 mm.
The method for generating shadow confidence maps for US images based on the graphical random walks technique proposed by Karamalis was widely cited [9,10].Pandey [11] proposed and evaluated a simplified 3D bone segmentation algorithm based on the confidence maps, which used acoustic shadows and peak intensities to detect bone surfaces.The study aligned the segmented US 3D images with CT images with a TRE of 2.44 mm.
Convolutional neural networks (CNN) had demonstrated advantages in US bone surface detection [12][13][14][15].Wang [16] et al introduced a filter-layerguided CNN that combined local phase tensor images, local bone images, enhanced bone shadow images, and B-US images.In later work, they developed [17] a network that incorporated local and global phase tensors, creating an end-to-end local phase tensor-guided CNN for bone surface segmentation in US images.However, extracting local phase image features required substantial computational time.Chen [3] et al presented an Annotation-guided encoder-decoder (AGN) model that considered the temporal relationship between adjacent B-US sequences for more efficient bone surface detections.Alsinan et al [18] integrated US shadows into a multi-feature guided CNN for real-time bone segmentation, utilizing a novel adversarial generative network (GAN).Rahman's [19] work was the first systematic design of exploiting interrelation of improving both bone and shadow segmentation by fusion of CNN and vision transformer to leverage multi-task learning while optimizing accuracy-efficiency trade-off.A novel hybrid CNN architecture, SIU-Net, was proposed by Banerjee to overcome the challenges in US segmentation out of speckle noise and low contrast images [20].
However, the complex network structures in these approaches resulted in lengthy training time and low generalization capability.In addition, the preprocessing step in these methods also took significant computational time.In this paper, we proposed an end-toend lightweight US bone segmentation network (UBS-Net), designed for the accurate, real-time segmentation of bone surfaces in 2D US images.This network used the U-Net structure as the base framework and a level set loss function for improved sensitivity to bone surface detectability.The new processing addressed the above mentioned issues in the existing methods and features: (1) Extracting features using the improved Dar-kNet28, which had fewer training parameters and the optimization could be efficient.This modification permitted the network to achieve near realtime extraction of bone surfaces.
(2) In order to improve the segmentation accuracy of the network, new approach introduced a DA mechanism that considered the correlation between the position and channel dimensions of the feature map.In addition, for better computational efficiency, the traditional SA in the position module was replaced by the AA.Finally, a twoclass feature fusion module to fuse position attention and CA to obtain the DA map; (3) The level set loss function was introduced to enhance the sensitivity of the network to the bone surface boundaries thereby further improving the segmentation accuracy of the network.
UBS-Net achieved excellent bone surface segmentation accuracy compared to the state-of-the-art medical segmentation networks (Dice similarity coefficient: 88.76% versus 87.22%) when applied to US bone scan maps of 11 volunteers.The new method achieved real-time detection at approximately 21 fps.The newly developed method attained better segmentation accuracy alongside a segmentation rate approaching real-time, while utilizing a small amount of memory capacity.

Methods and materials
The new network employed the U-Net structure as its fundamental framework, comprising a DarkNet28 feature extraction module, a DA module at the end of the encoding module, a feature fusion module (FFM), and a decoding module, and the overall network architecture was shown in figure 2.

Feature extraction module
The feature extraction module was an improvement of DarkNet53, which was initially proposed for the YOLOv3 object detection network [21] and known for its robust feature extraction performance.Incorporating residual terms into the feature extraction structure, facilitating network optimization and enhancing feature extraction accuracy through increased depth.We changed DarkNet53 to DarkNet28 (figure 2), which optimized GPU utilization, reduced training parameters, simplified optimization, and increased computational efficiency compared to the U-Net network while retaining feature extraction capabilities.We integrated DarkNet28 into the U-Net's encoding module.

The DA module
The DA mechanism, which aided in understanding global feature correlations within the feature map, had been extensively employed in segmentation tasks, yielding remarkable performance [22,23].Comprising position and channel attention, the DA mechanism considered both position and channel information by introducing two independent attention modules in position and channel dimensions.These 2 modules interacted to learn correlations in both position and channel, improving the model's performance.We replaced the traditional SA mechanism in the position attention module with AA for enhanced computational efficiency.A two-class FFM then fused position and channel attention to produce a DA feature map.

The AA module
In the position attention module, the SA mechanism was commonly employed to capture spatial dependencies between any two positions within the feature map.However, memory and computation demanded grow exponentially with sequence length, posing a computational challenge.To address this problem, we utilized AA [24][25][26][27] as a substitute for the traditional SA module.AA decomposed SA into two independent SA modules, the first module performed SA on the height axis of the feature map and the second module performed SA on the width axis of the feature map, as shown in figure 3. AA could effectively simulate the original SA and had better computational efficiency.For a given feature map with height H, width W, and channels C , in the output A , the AA layer along the width axis was defined as, were all linear projections of the input x.
were all learnable matrices.An AA layer operated along a specific axis, and to capture global information,we employed two consecutive AA layers for height and width axes, respectively.Equation (1) described AA applied to the tensor's width axis, and a similar equation was employed for the height axis.Both AA layers utilized a multi-headed attention mechanism.

The CA module
For the CA module, a SA-like mechanism was used to obtain channel dependencies between any two channel feature maps and updated each channel feature map using the weighted sum of all channel feature maps (figure 4).We computed the CA map After that, performed a matrix multiplication between X and the transpose of X. Next, the multiplication result was fed into a softmax layer to calculate the CA map E , The formulation of the above equation followed the CA model proposed in [22], where e ij represented the correlation between channel i and channel j, β represented the weight of the learning path and the CA graph F was the weighted sum of all channels and X.

The FFM
Owing to the difference of the two feature maps generated by DA, direct concentration might not retain vital information.Hence, we employed the FFM structure from [28] to fuse the maps produced by DA,  as shown in figure 5.The DA map D was calculated using the following equation, 3 were the weight of the convolution layer, ⊕ denoted the concentrate operation in equation (5).A was the position attention map obtained after AA, F was the CA map obtained following CA.σ was sigmoid function.

Loss function
In this study, we introduced level set loss function (LS) L LS and worked together with binary cross-entropy (BCE) loss function L B to improve the segmentation performance, and the new loss function was defined as, where Y was ground truth, Y ˆwas prediction result, μ was the weighting factor that was used to balance the LS constraint, and the value was 0.01 in this experiment.

Formulation of level set loss function
LS enabled deep networks to more readily learn salient feature information, handling various topological variations automatically.As a result, the LS approach was widely employed in active contour image segmentation [29][30][31].The core concept involved defining an implicit function in a higher dimension to represent the contour's zero LS, which then evolved based on the partial differential equation derived from the active contour model's Lagrangian formulation.Chan et al [32][33][34] introduced Variational LS Methods, which derived the evolving partial differential equation directly from the energy functional of a feature map.Consequently, the LS-based segmentation problem could be addressed through gradient descent to minimize the energy functional [35].The evolutionary properties of LS made them suitable for combination with deep networks to solve binary segmentation problems, with the LS function composition discussed subsequently.
When applying the LS method for the binary segmentation of a two-dimension (2D) space Ω, the surface C Ì W was defined as the boundary of an open subset .w Ì W Therefore, C .w = ¶ The interface curve could be represented by the zero LS of a Lipschitz function: : , The length of C was given by the following equation, where (u v , ) was the coordination, H z ( ) was the Heaviside function, z ( ) d was the Dirac delta function.

H z z z
1, 0, 0, 0, 10 The loss function based on the LS was defined as follows [35],

L Y Y y u v y u v dudv Length C y u v c y u v dudv
( ) where y u v , ( ) was ground truth value of pixel at u v , , ( ) average saliency values for inside(C) and outside(C).Keeping f fixed and minimizing the L LS with respect to C 1 and C , 2 these two constants could be expressed as, The Heaviside function might encounter local minima issues.To mitigate this, we employed the Approximated Heaviside Function (AHF) [32].

The BCE loss function
In order to enhance the network's classification ability for individual pixels, we utilized the BCE loss function to compute the classification error for each pixel.The BCE loss value was the negative average of the error for each pixel in the output probability map, where H W ´was the total number of pixels in the US image, Y represented ground truth, Y ˆdenoted prediction result.

Datasets and processing settings
UBS-Net was implemented using Python 3.6 and Pytorch 1.10.1, with all experiments conducted on an  Experimental data was collected from 11 volunteers, yielding a total of 5701 US images.Among them, we used 9 volunteers with 5030 images for training and 2 with 671 images for testing.To assess the performance of the method, a five-fold cross-validation on the different subjects performed within the dataset.

Evaluation metrics
The proposed method was compared to the state-ofthe-art segmentation algorithms including U-Net [36], Seg-Net [37], AU-Net [38], BiSe-Net [28], and U-Net++ [39].The same loss functions were used .Some results of bone surface detection using UBS-Net.Although there are slight differences between the experiments, the new method achieved effective bone surface detection from US images.The red color was the ground truth, the green color was the predicted result, and the white color was the merge of the ground truth and the predicted result.
In equation (18), sup (supremum) and inf (infimum) denoted the upper and lower definite bounds, respectively.J denoted ground truth and J ˆindicated the prediction result.

Results and discussion
We first assessed the accuracy of bone surface detection using UBS-Net.Subsequently, we compared UBS-Net with several state-of-the-art segmentation models, and finally, we conducted ablation experiments on different components of UBS-Net to ascertain the effectiveness of each module.

Bone surface detection
We evaluated UBS-Net by training it on 2D US images from 9 volunteers and testing it on images from 2 additional volunteers.During the evaluation, we quantitatively assessed bone surface detection results by measuring the DSC, VOE, HD, and average Hausdorff distance (AHD, representing shape similarity while addressing HD value sensitivity to extreme points) between the detected bone surface structures and the expert-created bone masks.We presented the Seg-Net [37], AU-Net [38], BiSe-Net [28], and U-Net++ [39].All these models used the same loss function.
performance results with means, standard deviations (SD), and graphs.The results of the five-fold crossvalidation were shown in table 1, with an average DSC of 88.76% ± 0.57%, an average HD of 2.30 ± 0.29 mm, an average AHD of 0.12 ± 0.02 mm, and an average VOE of 7.27% ± 1.58%.These results demonstrated that the proposed UBS-Net could accurately detect bone surfaces from the US images.Figure 7 displayed some bone surface detection results using UBS-Net, illustrating the effectiveness of the new method in detecting bone surfaces from US images.
Figure 8(a) illustrated the result of the 3D reconstruction of the segmented US bone surface, where the straw-like lines in the mesh features represented the bone surfaces encompassed within each US image frame.This reconstructed 3D US bone surface was  [36], Seg-Net [37], U-Net++ [39], AU-Net [38], Bise-Net [28], and UBS-Net prediction maps based on two randomly selected test set images.The red color was the ground truth, the green color was the prediction result, and the white color was the merge of ground truth and prediction result.
overlayed with the CT bone surface image, as depicted in figure 8(b).The alignment result revealed a rootmean-square error (RMSE) of 1.85 mm, thereby substantiating the effectiveness of the novel approach in extracting bone surfaces from US images.

Comparison segmentation performance with other state-of-the-art CNN models
We compared the segmentation performance of the proposed method with the state-of-the-art segmentation algorithms, including U-Net [36], Seg-Net [37], AU-Net [38], BiSe-Net [28], and U-Net++ [39], using the same loss functions across all models.The UBS-Net achieved the highest DSC of 88.76% among the tested models.As shown in figure 9, UBS-Net outperformed the other models in terms of DSC (88.76%),Accuracy (98.94%),Precision (86.34%), and AHD (0.12 mm).U-Net was the best-performing model among the other models, with figure 9 providing a comprehensive comparison of the results.Moreover, UBS-Net achieved a DSC above 90% for 52.42% of the test samples, out-performing U-Net (44.34%),Seg-Net (49.25%),AU-Net (41.14%),U-Net++ (51.70%), and BiSe-Net (43.40%).This demonstrated the superior segmentation performance of the UBS-Net.Figure 10 howcased the segmentation results for two randomly selected samples with error regions, further highlighting the superiority of UBS-Net.

Comparison of training parameters and segmentation efficiency with other state-of-the-art CNN models
We assessed the differences in the number of trainable parameters between the proposed UBS-Net and other advanced segmentation CNN models.The new network demonstrated a significantly reduced complexity with only 0.28 million (M) trainable parameters, while U-Net which performed best among other models had 34.53 M trainable parameters.It guaranteed the   11).
Up to now, we proposed a lightweight US bone surface segmentation network, which possessed fewer training parameters, mitigating the risk of overfitting, and facilitating easier optimization and enhanced generalization capabilities.
Furthermore, we compared the bone surface detection speed of the proposed method with previously published US bone detection methods, including the improved fine-tuned U-Net [40], filter layerguided CNN [16], and LPT-Net [17], as demonstrated in table 2. Although the comparison was not in the same benchmark, the comparison was still meaningful as we noticed that the image size in the proposed method was larger and the number of the GPU parallel cores / clock frequencies were about similar.The memory usage was about 5GB in the testing, which was below the lower memory capacity of the two GPU models (it was 11GB for NVIDIA RTX-2080i and 48GB for NVIDIA Quadro RTX-8000).Under these conditions, the proposed method showed the highest frame rate for realtime cine, indicating a superior performance.
We also implemented other neural network segmentation approaches (U-Net [36], Seg-Net [37], U-Net++ [39], AU-Net [38], and Bise-Net [28]).The computing speeds in terms of frame rate were demonstrated in table 3. The best performance from the Segnet approach gave 20.02 fps which was no better than the propose method of 21.36 fps.

Ablation experiments
To evaluate the effectiveness of each module in the UBS-Net, we conducted tests on segmentation models comprising various module combinations, assessing segmentation performance using six metrics (DSC, HD, AHD, VOE, accuracy, precision), as presented in figure 12.The UBS-Net demonstrated a significant improvement in DSC (up 1.54%), HD (down 0.36 mm), AHD (down 0.01 mm), accuracy (up 0.16%), precision (up 2.13%), and VOE (down 1.67%) compared to the baseline U-Net.
The introduction of the LS loss function led to enhanced performance in DSC (up 1.85%), accuracy (up 0.20%), and precision (up 1.11%), which demonstrated the effectiveness of the level set loss function for US bone surface segmentation.
To assess DarkNet28's feature extraction capabilities, we compared segmentation results between U-Net and DarkU-Net, observing no significant difference in each evaluation index between models U-Net and DarkU-Net.This indicated that DarkNet28 could maintain feature extraction quality while reducing trainable parameters compared to the original U-Net.It was pertinent to acknowledge that despite the absence of a significant disparity in the efficacy of feature extraction between DarkU-Net and U-Net, the + we demonstrated that the incorporation of the DA mechanism improved the accuracy of US bone surface detection, with DSC (up 1.92%), HD (down 0.05 mm), accuracy (up 0.23%), and precision (up 0.56%) showing better performance.This suggested that the DA mechanism enhanced the network's segmentation performance.
To verify the effectiveness of AA, we compared it to the traditional SA mechanism + revealed that replacing SA with AA not only prevented network segmentation performance degradation but also improved it, with DSC (up 3.16%), HD (down 0.06 mm), accuracy (up 0.33%) and precision (up 2.71%).Figure 13

Limitation and future work
Despite achieving excellent US bone segmentation results, this study had certain limitations.The primary focus was on US images around the knee joint, given the method's application in proposed navigation for knee surgery.Future work would entail collecting US images of different bone structures (femur, pelvis, vertebrae, etc) to evaluate the framework's effectiveness in general orthopedic surgery.

Conclusion
Accurate, robust, and real-time bone surface segmentation was critical for US-guided assisted orthopedic navigation.However, challenges persisted due to various imaging artifacts in US images and a low signal-to-noise ratio.An end-to-end lightweight segmentation network (UBS-Net) was proposed to detect US bone surfaces.The network leveraged the basic U-Net structure as a fundamental framework and incorporated the LS loss function to enhance US bone surface segmentation performance.Experimental results demonstrated UBS-Net's outstanding accuracy and real-time performance in segmentation, suggesting its potential application in US-guided assisted orthopedic navigation.

Figure 1 .
Figure 1.(a) An unprocessed US image of the human humerus, showing bone and surrounding soft tissues.(b) Manual annotation of the bone surface, acoustic shadow, soft tissues, and speckle in the US image.

Figure 2 .
Figure 2. Overview of the UBS-Net was illustrated.The network employed the U-Net structure as its fundamental framework, comprising a DarkNet28 feature extraction module, a DA module at the end of the encoding module, a FFM, and a decoding module.DarkNet28 captured high-and low-level features of US images.DA, used at the end of the encoding module, generated enhanced features for two paths: position attention (via AA mechanism) and CA.The two attention types were processed through the FFM to obtain DA features.The decoding module up-samples the DA map, integrated it with the corresponding stage feature maps in the encoding module, and ultimately generated the final segmentation prediction using a successive convolution layer.

Figure 3 .
Figure 3.An AA block, which consisted of two axial-attention layers operating along height-and width-axis sequentially.We employed N = 4 attention heads.⊕ denoted the addtive operation.

Figure 4 .
Figure 4.The CA module, where ⊕ denoted the addtive operation, and ⊗ was the multiplication operation.
) was the region ω, outside(C) represented the region outside ω in Ω.As shown in figure 6. Illustrated in figure 6(d), the bone boundary was manifested as a line with a thickness spanning multiple pixels.Figure6(e) showed the edge extraction of the line.The level set detected the boundaries enclosed by this edge.

Figure 5 .
Figure 5.The FFM, where BN was Batch Normalization, ⊕ denoted the addtive operation, and ⊗ was the multiplication operation.

Figure 6 .
Figure 6.An example of LS segmentation.(a) A LS function u v , ( ) f on the 2D space Ω.(b) Respective segmentation in Ω.The zero LS and corresponding segmentation boundary C were marked in blue.(c) Ultrasound image.(d) The bone boundary was manifested as a line with a thickness spanning multiple pixels.(e) Showed the edge extraction of the line in (d).The level set detected the boundaries enclosed by this edge.

Figure 7
Figure 7.Some results of bone surface detection using UBS-Net.Although there are slight differences between the experiments, the new method achieved effective bone surface detection from US images.The red color was the ground truth, the green color was the predicted result, and the white color was the merge of the ground truth and the predicted result.

Figure 8 .
Figure 8.(a) 3D reconstruction of bone surface after bone surface extraction from 2D ultrasound images.(b) Overlay of 3D ultrasound bone surfaces with the surfaces from CT images.

Figure 10 .
Figure10.U-Net[36], Seg-Net[37], U-Net++[39], AU-Net[38], Bise-Net[28], and UBS-Net prediction maps based on two randomly selected test set images.The red color was the ground truth, the green color was the prediction result, and the white color was the merge of ground truth and prediction result.

Figure 11 .
Figure 11.Number of trainable parameters in each segmentation network.Vertical axis in million (M).
. The results from models DarkNet28 + CA + AA + L L LS B + and Dar-kNet28 + CA + SA + L L LS B displayed the segmentation results of ablation experiments for four randomly selected samples, illustrating that the combination of DarkNet28 + CA + AA +L L LS B + achieved the best US bone segmentation performance.

Figure 13 .+
Figure 13.Prediction maps based on 4 randomly selected test set images in the ablation experiment.The red color was the ground truth, the green color was the prediction result, and the white color was a merge of the ground truth and the prediction result.NO.1: U-Net+ L L .LS B + NO.2: DarkNet28+CA+AA+ L L .LS B + NO.3: DarkNet28+CA+AA+ L .B NO.4: U-Net+CA+AA+ L L .LS B + NO.5: DarkU-Net28+ L L .LS B + NO.6: DarkNet28+CA+SA+ L L .LS B + Among them, NO.2 proposed in this paper achieved the best segmentation performance.

Table 1 .
Results of bone surface detection obtained by UBS-Net in all 2D US images.
Intel(R) Xeon(R) Gold 6240 CPU @ 2.60 GHz and a NVIDIA Quadro RTX 8000.The system ran on Windows 10, with an initial learning rate of 0.0001, a batch size of 4, and 50 epochs.Input network images measured 512 × 288.

Table 2 .
Comparative results of the bone surface detection speed of the proposed method with previously published US bone segmentation methods.