Myocardial Segmentation Algorithm of U-Net Network Based on Cardiac Ultrasound Images

The myocardial state is always regarded as an important basis for identifying cardiac diseases. In order to assist physicians in diagnosis in an accurate manner, this paper proposes myocardial segmentation using a U-Net network based on cardiac ultrasound images. Firstly, we collected a large amount of clinical data and employed professional cardiac ultrasound imaging physicians to mark the myocardial regions as the gold standard. Then, we built an optimized U-Net network to establish the relationship between images and semantics to extract original image features. Finally, a newly fused loss function for training the network is created. According to the experiments, it shows that the accuracy, precision, and recall rate of U-Net indexes proposed in this paper reaches more than 96%, and MIOU more than 94%, which can effectively assist doctors in diagnosis in an accurate manner.


Introduction
Ultrasound imaging (US) is a way to obtain images by acoustic means, which has the advantages of non-invasiveness, high resolution of visceral soft tissues, and high safety [1].It has become an important basis for the diagnosis of cardiac diseases such as myocardial ischemia and myocardial infarction [2].At present, the commonly used diagnostic tools are mostly based on the subjective experience of clinicians, which inevitably brings about missed and false detections during the detection process as the workload of physicians is heavy [3].
Over the last few years, it is the fast-growing advances of artificial intelligence technology that have made computer-assisted physician detection possible to improve the diagnostic efficiency of physicians.Deep learning algorithms have achieved excellent results in the field of artificial intelligence, and typical deep learning networks including convolutional neural network (CNN), AlexNet, VGG-Net, GoogleNet, residual network (ResNet), Dense connection network (DenseNet), fully convolutional neural network (FCNN), U-Net [4], InceptionNet, etc. Depending on the demands of the segmentation tasks, the applicable segmentation methods can be varied [5].Among them, the U-Net is an effective target region segmentation network.The model size of U-Net is relatively small, due to the symmetry encoderdecoder and the skip connection structure that enables U-Net to integrate low-resolution features and high-resolution features and can obtain better segmentation results when the training data set is small.These characteristics make U-Net very good at medical heart image segmentation tasks.It has been applied in medical imaging tasks.Representative algorithms have been developed based on it.In 2018, Xiao et al. proposed retinal vascular segmentation with the weighted Res-U-Net, which achieved highperformance segmentation [6].In 2019, Weng et al. proposed NAS U-Net, which obtained better performance and fewer parameters than U-Net and its variants when evaluated on medical image datasets [7].In 2020, Huang et al. proposed U-Net 3+ with a hybrid loss function to reduce the network parameters and improve the segmentation accuracy.Li et al. proposed a convolutional neural network integrated with the attention mechanism, which outperforms other existing mainstream approaches and has a significant reduction in the number of parameters [8].In 2021, Kushnure et al. proposed a multiscale approach to improve the receptive field of convolutional neural networks, which improved the segmentation performance of the network and reduced the computational complexity and network parameters [9].In 2022, Yan et al. proposed the axial fusion transformer U-Net (AFTer U-Net), which became less parametric and GPU memory usage during training compared to previous transformerbased models [10].In 2023, Cao et al. proposed Swin-U-Net and applied the network to the medical segmentation task of multi-body organs and heart, which is verified in the experiments that the performance of their designed Transformer-based U-shaped encoder-decoder network outperforms that of the full convolution or the combined network combining convolution and transformer [11].
In summary, the U-Net network has the advantages of high accuracy of segmented texture information and high scale utilization due to its small model size coding and decoding structure, which meets the requirements of myocardial extraction based on ultrasound images.Therefore, we carried out research on image segmentation based on the U-Net network.

Algorithm Framework
Firstly, we collected the clinically collected cardiac ultrasound data, on which basis we also engaged a professional physician to annotate the myocardial region as the gold standard.Then, the structure of the optimized U-Net network is established, as projected in Figure 1, and the established U-Net network mainly consists of two parts: the contracting path and the expanding path.The Loss function applied to train the network is designed as a fused function.
The content of network optimization is as follows.Firstly, the network input size is changed to 800×800, making it more suitable for the network of ultrasound images.Secondly, padding=1 is used to maintain the size of the feature map before and after convolution in convolutional operations, without the need to reduce the size of the feature map at each step.At the same time, it can also eliminate the need for cropping and discarding some information during skip connections.Finally, the size of the input and output images of the network is kept the same.

Contracting Path
The contracting path mainly consists of a stack of convolutional layers (3×3) and maximum pooling (2×2), and the information of new scales is obtained by downsampling, and this network obtains information of 5 scales.By continuously compressing the size of the feature map and gradually increasing the number of feature channels, the extracted features are more abstract and richer, and more expressive of the target.
We used an ultrasound image with a resolution of 800×800×3, and performed two convolutions with a convolution kernel size of [3,3], stride=1, padding=1, and channel=64, obtaining a feature layer L1 of [800, 800, 64].After performing the downsampling with 2×2 maximum pooling, the paper obtained a feature layer of [200,200,128], and performed two parallel convolutions, with the same convolution kernel size, stride, and padding, and only the number of channels of the convolution changed to 256, obtaining a preliminary effective feature layer L3 of [200,200,256].
After performing the downsampling with 2×2 maximum pooling, the paper obtained a feature layer of [100,100, 256] and performed two parallel convolutions, with the same convolution kernel size, stride, and padding, and only the number of channels of the convolution changed to 512, obtaining a preliminary effective feature layer L4 of [100, 100, 512].
After performing the downsampling with 2×2 maximum pooling, the paper obtained a feature layer of [50, 50, 512], and performed two parallel convolutions, with the same convolution kernel size, stride, and padding, and only the number of channels of the convolution changed to 512, obtaining a preliminary valid feature layer L5 of [50, 50, 512].After the downsampling process is completed, five valid feature layers are extracted.

Expanding Path
The extended paths adopted the upsampling to achieve enhanced feature fusion.Each convolutional stage consists of two 3×3 unfilled convolutions and the Relu activation function.Up-sampling of the feature map is completed using a 2×2 maximum pooling layer, and the number of feature channels is doubled after each up-sampling.The up-sampled feature layers are skip-connected with the corresponding scale-effective feature layers of the backbone extraction part, stacked and fused with the same number of channels at the corresponding resolution, and sent to the next convolutional layer.
Layer by layer upsampling stacking and then fusion, recovering the original image detail information pixel by pixel, obtaining fused all the effective feature layer information in the last layer, and finally classifying the pixel points by 1×1 convolutional layer and adjusting the channels for this feature layer to obtain the prediction results.The specific implementation is as follows: After conducting upsampling on the valid feature layer L5 (Y5) of [50, 50, 512] obtained by downsampling, it is spliced (cat) with the corresponding scale valid feature layer L4 to form a feature layer of [100, 100, 1024].And then after performing two more parallel convolutions with the same convolution kernel size and padding value, and only the number of channels changed to 512, a feature layer Y4 with [100, 100, 512] is obtained; After conducting the upsampling, feature layer 2 is spliced (cat) with the corresponding scale effective feature layer L3 to form a feature layer of [200,200,768].After performing two more parallel convolutions with the same convolution kernel size and padding value, and only the number of channels changed to 256, a feature layer Y3 with [200,200,256] is obtained; After conducting the upsampling, the feature layer 3 and the corresponding scale effective feature layer L2 are post-spliced (cat) to form a feature layer of [400,400,384].After performing two more parallel convolutions with the same convolution kernel size and padding value, and only the number of channels changed to 128, a feature layer Y2 with [400, 400, 128] is obtained; After conducting the upsampling, feature layer 4 is spliced (cat) with the corresponding scale effective feature layer L1 to form a feature layer of [800,800,192].After performing two more parallel convolutions with the same convolution kernel size and padding value, and only the number of channels changed to 64, a feature layer Y1 with [800, 800, 64] is obtained; The arrow symbol with blue color indicates a convolution operation with stride=1 and padding=1, Such an operation processing will make the width and height of the feature map remain unchanged.The red arrow represents the 2×2 maxpooling operation, at which the padding strategy is also valid.In such a case, no information will lose if the size of the feature map is even before pooling.To pick the right input size, the 2×2 max-pooling operator is applied to the image aspect of the even number of pixel points.The arrow symbol with green color indicates the operation of deconvolution with kernel size=2×2, which enlarges the width and height size of the feature map to two times the previous layer.The arrow symbol with green color indicates the copy operation, which achieves the splicing of the left feature layer with the same size as the right feature layer of the same layer, and can fully utilize the shallow features.
For the upsampling part, a total of 5 scales are obtained.For each upsampling, the feature layer of the same scale corresponding to the feature extraction part would be fused.The final layer of the output, Y1, uses a convolutional layer with size= [1,1] to make the number of channels consistent with the number of categories to obtain prediction results in the prediction layer predict.In this paper, for myocardial image segmentation, there is only one target category "myocardium", plus "background", and the number of categories is 2. The final two output layers are foreground and background, the features are mapped to the number of classes required for segmentation, the segmented image is output, and the segmentation result is finally obtained.

Loss Function
In this paper, cross-entropy loss and dice loss fusion are used as the overall loss for evaluation.The cross-entropy loss is used when the semantic segmentation platform uses Softmax to classify the pixel points.The cross-entropy loss denoted by H penalizes the deviation of P(x) from 1 at each position.The formula is: where k denotes the number of true labels of the pixel, p(xi) denotes the probability that the pixel truly belongs to category i, and q(xi) represents the probability that the pixel is predicted to be in category i.
Cross-entropy loss is often considered as the evaluation basis because it can represent the degree of

ICAITA-2023
Journal of Physics: Conference Series 2637 (2023) 012049 difference between two different probability distributions, and this strategy can represent the difference between the real probability distribution and the predicted probability distribution in the segmentation network.As one of the effective evaluation criteria, the greater the value of cross-entropy is, the worse the prediction effect will be, and its value is negatively correlated with the network prediction effect, which can partially represent the validity of the prediction results of the model.The Dice loss function is used as the loss function of semantic segmentation, denoted by L. Its calculation formula is as follows: where P refers to the network prediction result meanwhile T refers to the true result.L is a quantity that can reflect the similarity of the set, L is negatively correlated with the degree of similarity between two collections, which takes values in the range of [0, 1].The smaller L, the greater the overlap between the prediction result and the true result, i.e., the smaller L, the better the prediction effect [12].The overall loss (Loss) is made up of two components, containing both the fused cross-entropy loss and the Dice loss, then: oss 0.9 0.1 where 0.9 and 0.1 are the weight coefficients of the losses taken.

Experimental Results and Analysis
The paper collected 10 groups of cardiac ultrasound image data, totaling 4,340 frames and engaged a professional imaging physician to annotate them and construct a deep learning framework for the experiment.The computer configurations are shown in Table 1

Construction of Dataset
The echogram dataset used for myocardial segmentation in this paper was manually annotated, as shown in Figure 2, by extracting image frames from the myocardial sequence and annotating them separately on a single frame.The annotation is mainly to classify the myocardium (red shaded part), so there is only one target class "myocardium", and some other information of background.Therefore, the annotation file contains the annotation labels "myocardium" and "background".The myocardial edges of the echogram species are annotated using the Labelme tool.After annotating the dataset using LabelMe, the dataset file is obtained, which includes the original images with json format labels.The training file is in VOC format, and the RGB images are annotated, as shown in Figure 3.For the label, the content of each pixel point represents a grayscale value.Image segmentation is the classification of each pixel of the image, and each pixel is determined by the probability that it belongs to the class corresponding to the gray value of the corresponding label.After completing the row annotating of the myocardial part of the image with the LabelMe tool, the desired dataset is obtained and the network can be trained with this dataset.

T P MIOU X +TP+ FN + FP
  where X+1 denotes X categories plus 1 background, TP indicates the number of cases correctly judged as positive segmentation, TN indicates the number of cases correctly judged as negative segmentation, FP indicates the number of cases wrongly judged as positive segmentation, and FN indicates the number of cases wrongly judged as negative segmentation.

Training and Testing Results
With the algorithm proposed in this paper, the corresponding training results are shown in Figure 4. Figure 4(b) shows the MIOU iteration curve, the horizontal coordinate is the number of network training iterations, and the vertical coordinate is the MIOU.As can be seen in the figure, the MIOU reaches 90 at 8 iterations, and then continues to increase slowly and basically stabilizes.Finally, the MIOU reaches 92 steadily at 60 iterations.
In this paper, the myocardial segmentation of ultrasound images is a binary problem, which only needs to distinguish "myocardium" and "background".The output segmentation image is shown in Figure 5, in which the red part is the myocardium and the rest is the background, and the myocardial part is accurately identified and successfully segmented.

Conclusion
In order to provide better medical assistance for cardiac diseases, this paper segments myocardial parts in cardiac ultrasound images using deep learning, constructs datasets using the LabelMe tool to label data, implements training and prediction based on the U-Net network, selects a mixture of cross-entropy loss and dice loss as the overall loss, and performs visualization analysis to conduct the performance evaluation on the U-Net.The output achieved the expected results, completed the segmentation of myocardial parts in the target region of the image with accuracy, precision and recall rate of 96%, and MIOU reach more than 94%, confirming the reliability of the U-Net network in the segmentation of myocardial ultrasound images, which can efficiently assist doctors in diagnosis.

Figure 1 .
Figure 1.Structure of U-Net Network After performing the downsampling with 2×2 maximum pooling, the paper obtained a feature layer of [400, 400, 64], and performed two convolutions with the same convolution kernel size, stride and padding, and only the number of channels of the convolution changed to 128, obtaining a preliminary effective feature layer L2 of [400, 400, 128].After performing the downsampling with 2×2 maximum pooling, the paper obtained a feature layer of[200, 200, 128], and performed two parallel convolutions, with the same convolution kernel size, stride, and padding, and only the number of channels of the convolution changed to 256, obtaining a preliminary effective feature layer L3 of[200, 200, 256].After performing the downsampling with 2×2 maximum pooling, the paper obtained a feature layer of [100,100, 256] and performed two parallel convolutions, with the same convolution kernel size, stride, and padding, and only the number of channels of the convolution changed to 512, obtaining a preliminary effective feature layer L4 of [100, 100, 512].After performing the downsampling with 2×2 maximum pooling, the paper obtained a feature layer of [50, 50, 512], and performed two parallel convolutions, with the same convolution kernel size, stride, and padding, and only the number of channels of the convolution changed to 512, obtaining a preliminary valid feature layer L5 of [50, 50, 512].After the downsampling process is completed, five valid feature layers are extracted.

Figure 3 .
Figure 3. Dataset file 3.2 Evaluation Metrics of Segmentation In this paper, we introduced evaluation metrics: pixel accuracy (PA), precision (Precision), recall (Recall), and mean intersection over union (MIOU) in myocardial ultrasound image segmentation experiments.The relevant calculation formula is shown below: TP TN PA TP TN FP FN      Figure 4(a) shows the loss iteration curve diagram, after visualizing the loss of the training results, the horizontal and vertical coordinates represent the number of network training iterations and loss values, respectively.We can observe the loss value of the training set using the red curve, and the loss value of the validation set using the orange curve.The loss value of the training data set as the dependent variable rapidly decreases as the number of iterations as the independent variable increases, and the network basically tends to a state of convergence when conducting 20 iterations on the training data set, and the model finally reaches the optimum when conducting 80 iterations on the network.Figure4(b) shows the MIOU iteration curve, the horizontal coordinate is the number of network training iterations, and the vertical coordinate is the MIOU.As can be seen in the figure, the MIOU reaches 90 at 8 iterations, and then continues to increase slowly and basically stabilizes.Finally, the MIOU reaches 92 steadily at 60 iterations.In this paper, the myocardial segmentation of ultrasound images is a binary problem, which only needs to distinguish "myocardium" and "background".The output segmentation image is shown in Figure5, in which the red part is the myocardium and the rest is the background, and the myocardial part is accurately identified and successfully segmented.

Figure 5 .
Figure 5. Prediction Results of Myocardial Segmentation