Cross-convolutional transformer for automated multi-organs segmentation in a variety of medical images

Objective. It is a huge challenge for multi-organs segmentation in various medical images based on a consistent algorithm with the development of deep learning methods. We therefore develop a deep learning method based on cross-convolutional transformer for these automated- segmentation to obtain better generalization and accuracy. Approach. We propose a cross-convolutional transformer network (C2Former) to solve the segmentation problem. Specifically, we first redesign a novel cross-convolutional self-attention mechanism in terms of the algorithm to integrate local and global contexts and model long-distance and short-distance dependencies to enhance the semantic feature understanding of images. Then multi-scale feature edge fusion module is proposed to combine the image edge features, which effectively form multi-scale feature streams and establish reliable relational connections in the global context. Finally, we use three different modalities, imaging three different anatomical regions to train and test multi organs and evaluate segmentation performance. Main results. We use the evaluation metrics of Dice similarity coefficient (DSC) and 95% Hausdorff distance (HD95) for each dataset. Experiments showed the average DSC of 83.22% and HD95 of 17.55 mm on the Synapse dataset (CT images of abdominal multi-organ), the average DSC of 91.42% and HD95 of 1.06 mm on the ACDC dataset (MRI of cardiac substructures) and the average DSC of 86.78% and HD95 of 16.85 mm on the ISIC 2017 dataset (skin cancer images). In each dataset, our proposed method consistently outperforms the compared networks. Significance. The proposed deep learning network provides a generalized and accurate solution method for multi-organ segmentation in the three different datasets. It has the potential to be applied to a variety of medical datasets for structural segmentation.


Introduction
Recently, many networks based on convolutional neural network (CNN) have been proposed for medical image segmentation. Such as U-Net (Ronneberger et  Any further distribution of this work must maintain attribution to the author(s) and the title of the work, journal citation and DOI. remote modeling capability of encoders by adding self-attention. Cotr (Xie et al 2021) introduced a deformable transformer to effectively connect the CNN encoder-decoder, thus enhancing the remote modeling capability of the features. MissFormer (Huang et al 2021) raised the idea that transformer lacked the modeling of the local environment and enhanced hierarchical transformer by exploring both the global dependence and the local context capture. Relying on the promising global modeling capabilities, all of them have achieved consistent improvements. However, there are some limitations: (1) these models did not consider the complementary relationship of the global modeling ability of self-attention and the local modeling ability of convolution, and treated the convolutional operation and self-attention as two unrelated operations. They did not explore the effective complementary combination of convolution and self-attention.
(2) Most of the work relies excessively on the modeling ability of self-attention, and lacks the exploring of the multi-dimensional information.
We redesign a novel attention mechanism. Broaden the width of attention mechanism to solve the limitation. Then, capture image information in both spatial and channel dimensions synchronously.
Moreover, multi-scale information has been demonstrated its necessity and importance ( (Xie et al 2021) proposed a hybrid network that connects CNN and Transformer. By introducing the deformable self-attention mechanism to achieve the extraction of the multi-scale features. Trans-UNet (Chen et al 2021) combined CNN and Transformer to extract spatial features and global features. Different spatial resolution feature maps are modeled via the skip-connection. TransBTS (Wang et al 2021e) first adopted 3D convolution to extract local features. The generated different spatial resolution outputs were fed into transformers for modeling global features. CrossFormer (Wang et al 2021c) proposed a cross-scale layer to extract multi-scale features. Each embedding was mixed with different sizes. U-Net (Ronneberger et al 2015) was the first work to use skip-connection, which recovered the fine-grained features and it was proved to be effective in subsequent contrast comparasion experiments. However, the fine-grained features were still not fine enough, making the segmentation results lose edge information and location information. UCtransnet (Wang et al 2021a) proposed channel transformer from the perspective of the channel and self-attention mechanism to replace the skip-connection, guiding the fused multi-scale channel information to effectively connect to the decoder. Due to the complex organs, blurred and weak boundaries, the organ boundaries are more prone to false prediction, which makes the overall segmentation performance of the network to decline. So many research methods consider introduce edge information. MSRF-Net (Srivastava et al 2022b) used the gated shape flow GS-CNN (Takikawa et al 2019) for predicting important information about shapes and boundaries SMU-Net (Ning et al 2022) proposed the significantly guided morphological perception of U-Net, which enhances the morphological information learning ability to some extent. 3D-MTL (Irshad et al 2022) proposed a deep neural network based on the 3D boundary constraint, which enables the network to accurately predict the edges of organs. Although these works improve the fine-grained modeling ability of the model to some extent, they do not effectively address the defects of the network's lack of spatial, channel and edge information. All in all, these methods either conduct feature fusion by simple skip-connection operation, or ignore edge information. Therefore, they cannot effectively establish global context connections across multi-scales. Even cause global information disorder, leading to the descend of the generalization performance.
To address the above issue, this paper proposes a novel C 2 Former, leveraging the powerful multi-scale feature edge fusion to integrate local and global contexts. In summary, the main contributions of C 2 Former are as follows: • We redesign a novel cross-convolutional self-attention mechanism for semantic feature representation, which consist of window self-attention, long distance self-attention and convolutional attention. Based on this, the local-global attention capability of C 2 Former can better learn the characteristics of the target.
• We propose a Multi-scale feature edge fusion module (MFEF). Relying on MFEF, C 2 Former effectively combines the edge features, and also enables the features with different scale to achieve effective information interaction. Furthermore, gated module is introduced with the aim of filtering interference information.
• Extensive experiments on three public datasets demonstrate that our C 2 Former is outperforming other stateof-the-art medical image segmentation methods. Moreover, ablation studies have confirmed the promising effectiveness of each component. Figure 1 shows the proposed C 2 Former network. We utilize a U-shaped architecture, with stacking C 2 Transformer Block as encoder. It contains four stages of such blocks. Different stage blocks process sequences with different resolutions. Such operation allows to fully extract features and captures potential hidden information. MFEF skillfully combine the image edge features to effectively form the multi-scale feature streams. The decoder utilizes these multi-scale feature streams to output the segmentation result. These different components in C 2 Former are detailed in the following.  figure 2, which is composed of a 2-layer MLP, LayerNorm (LN) layer, window multi-head self-attention (W-MSA), long-distance multi-head self-attention (LD-MSA) and convolutional attention. Self-attention mechanism has an irreplaceable position in long-distance dependence, but it still has defects in modeling local features. Our proposed window self-attention mechanism with the aim to be able to establish effective dependencies in local regions. As shown in figure 3 (a), it  divided the image into several windows with the size of M Ḿ for feature modelling. In order to compensate for the fine-grained image features and effectively capture the information between the different regions token. We used long-distance sampling for self-attention. As shown in figure 3(b), the sampling interval I is set for sampling. The mask processing of the unsampled image blocks are conducted. It is divided into a group in individual units I over the length and width of the feature graph. Finally, we get / / H I H Í groups and conduct feature modeling within the groups. Overall, the difference between W-MSA and LD-MSA, as shown in figures 3(a) (b), W-MSA samples adjacent image blocks. LD-MSA samples non-adjacent image blocks and aggregates the sampled image blocks into a group to perform feature modeling within the group. The output of the W-MSA and LD-MSA are obtained as follows:

Encoder
where W O is the learnable weight matrix. It is noted that window self-attention and long-distance self-attention are stacked blocks. Convolutional attention is parallel with the above two self-attention mechanisms. The standard is the definition (Vaswani et al 2017) over query, key and value matrices. However, there are some limitations in modeling short-distance dependence of the self-attention mechanism. We adapt W-MSA to integrate the relationships between short and long-distance dependencies . Applying this window partitioning manner, the output of the lth layer are obtained as follows: In order to explore how to better combine convolution and self-attention for medical image segmentation tasks. We design convolutional attention parallel with self-attention to capture information from both spatial and channel dimensions synchronously. Similar to Woo's method (Woo et al 2018), for the input Z , we first generate the average features and the maximum features over the spatial dimension. Subsequently, both features are fed into the fully connected network. The spatial and channel dimensions attention are adopted to generated the final output as follows: , denote max-pooling and average-pooling operations, s represents the sigmoid function. Then, we can obtain the final output of C 2 Transformer Block as follows: where S l and Z l 1 + denote the output of the first layer and the second layer C 2 Transformer Block, respectively.

Multi-scale feature edge fusion module
Edge extraction module aims at enhancing the ability of the network to extract boundary features and refine the insignificant edge segmentation information. Specifically, as shown in figure 4, edge extraction module uses the Canny edge extraction algorithm to extract the edge features of the image and performs feature extraction and dimension alignment through down-sampling operation. Figure 5 shows the structure of our proposed MFEF, which not only considers the edge features, but also integrates multi-scale features. Furthermore, gated module is designed to filter interference information. The 1D features generated from Encoder is transformed into 2D features. Then the features enter gated module and fuse with the edge information obtained from Edge extraction. The edge fusion features flow into gated module  to fuse with different spatial features. Finally, it is modeled efficiently through 2-layer C 2 Transformer Block. The complete expression is as follows:

Gated module
Gated module consists of two batch normalization layers, convolutional layers, ReLU and Sigmoid layer. It receives the feature maps from the upper layer and the current layer, and merges the two feature maps together for the next step. First, we normalized the fused feature map with BN layer. It is worth noting that we use (1 × 1) kernel size instead of (3 × 3) convolution kernel. The reason for this processing is that we designed a more powerful C 2 attention module, so we no longer use (3 × 3) convolution for feature extraction, but using (1 × 1) convolution to further integration the information of the fused feature map on multiple channels. Thus, the network can extract more abundant feature information. Figure 6 shows the detail of our proposed gated module.

Decoder
As shown in figure 1, Decoder is an incomplete symmetric structure of Encoder. Decoder uses a linear extension to complete the upsampling operation. During upsampling, the sequences are linearly mapped to a highdimensional space. All of the feature information will be used to obtain the final predictions.

Datasets
Synapse abdominal multi-organ dataset: this dataset includes 30 CT volumes of abdominal organs, with a total of 3779 slices. In our experiments, 18 cases are used for the training of the model, and the rest 12 cases for testing.

Implementation details
Our C 2 Former is performed in PyTorch frame with a single NVIDIA RTX 2080 Ti GPU. The original input image size is set to 224 × 224. Some data augmentation methods such as random flipping, random scaling are applied. In the training stage, we set an initial learning rate of 0.05, batch size 24 and epoch 400. The training optimizer is SGD with a momentum of 0.9 and a weight decay of 1e-4. We train the network with the strategy of the joint loss function, which consists of the sum of Dice loss and Cross Entropy loss, which is defined as follows where α and β are two hyperparameters, empirically set to 0.6 and 0.4, respectively.

Segmentation results and performance
Comparison with the State-of-The-Art methods: table 1 shows the overall segmentation results of average DSC and HD95 for the three different datasets. For each dataset, table 2 presents the quantitative results of proposed C 2 Former and other state-of-the-art methods on Synapse dataset. As reported in table 2, C 2 Former obtains state-of-the-art segmentation performance in average DSC and HD. It is worth mentioning that C 2 Former(224) represents the input size is 224 × 224, C 2 Former(384) represents the input size is 384 × 384. We can see both of them achieve the promising performance and outperforms R50 ViT by over nearly 12 percents in DSC and 15 mm in average HD. Table 3 presents the quantitative results on ACDC dataset. We can see our proposed method outperforms other state-of-the-art methods. The quantitative experimental results of C 2 Former with other advanced models on the ISIC 2017 dataset are reported in table 4. It can be seen that our method outperforms the existing methods in several metrics such as DSC, Jaccard and Accuracy. Table 5 presents the segmentation performance in terms of parameters and calculation complexity. As reported in table 5, the segmentation results are the best while the complexity of different methods on the Synapse dataset are the lowest. It demonstrates that C 2 Ablation Study: in order to analysis the impact of the component of C 2 Former. We reports the ablation study results in tables 6-8. Baseline denotes standard multi-head self attention, C 2 MSA denotes our proposed Cross-Convolutional multi-head self attention. Compare the first two rows, with MFEF module outperforms baseline nearly 12 percents in DSC for synapse dataset. Compare the first and the third rows, C 2 MSA module outperforms baseline over 12 percents in DSC. Table 7 reports the ablation study results on the ACDC dataset. Compare the first two rows, with MFEF module outperforms baseline over 3 percents in average DSC. Compare the first and the third rows, C MSA module outperforms baseline nearly 4 percents in DSC. Table 8 reports the ablation study results on the ISIC 2017 dataset. Compare the first two rows, with MFEF module outperforms baseline nearly 3 percents in average DSC. Compare the first and the third rows, C 2 MSA module outperforms baseline nearly 4 percents in DSC. In a few words, we can see both C 2 MSA and MFEF improve the segmentation performance. Figure 7 visualizes the ablation study of the proposed MFEF module and C 2 Transformer block. MSA denotes standard multi-head self-attention, w/o MFEF denotes with out MFEF module. The red box high light some miss-segmentations. We can see C 2 Former is Closer to the manual annotation.
There are three segmented cardiac structures in the ACDC dataset: right ventricle (RV), left ventricular muscle (LVM) and left ventricle (LV). We conduct statistics on the correlation and consistency of the C 2 Former and manual results to evaluate whether DSC has statistical difference. Figure 8 shows that all the three parameters are highly correlated to manually derived results (r 2 = 0.909, 0.984 and 0.991, respectively). The consistency significance F test show that there is no significant difference between the C 2 Former and manual results (F = 40.180, P < 0.001, F = 240.208, P < 0.001 and F = 439.810, P < 0.001).

Discussion
In previous research, the deep learning method has been applied to multi-organ segmentation in medical images and achieved excellent performance (Fang and Yan P 2020, Zhang et al 2020, Lin et al 2021). However, due to the diversity and complexity of medical images, it is still a great challenge to obtain accurate boundaries (Fu et al 2021, Shi et al 2021. For medical image segmentation tasks, some relevant works did not explore local-features modeling ability of self-attention. Some relied excessively on the modeling ability of self-attention and had a limitation in exploring the multidimensional information. We through a novel attention mechanism to capture image information in both spatial and channel dimensions synchronously. And a MFEF is designed to achieve effective information interaction. As shown in the three datasets, our method achieves better performance. Furthermore, C 2 Former has great advantages in terms of parameters and calculation complexity. The above experiments show that C 2 Former can achieve advanced performance with low parameters and FLOPs. The diversity and complexity of medical images are mainly reflected in a variety of different types of images that can be collected by a variety of devices, such as slice images in CT, functional images in MRI, high resolution camera or microscope photos and so on. For example, slice images have continuous context, while functional images sometimes only show thermal maps or spectrum (Fournel et al 2021, Zhang ). A segmentation method with great compatibility and excellent performance is undoubtedly a great challenge than previous methods. Here we discuss the clinical impact of segmentation results through visualization. First, we evaluated our method on multi-organ CT dataset. Generally, CT images are slice images and have high spatial resolution but low tissue contrast, which is extremely difficult for small and complex structure segmentation. Figure 9 shows that compared with other state-of-the-art methods, our reasonable attention mechanism has excellent ability in distinguishing fuzzy boundaries and identifying the structure in the organization. For example, in the second row, MISSFormer recognized the edge of the stomach incorrectly; SwinUnet and MISSFormer cannot  accurately recognize the blood vessels located in the liver (lake blue label) and MISSFormer cannot accurately identify the boundary of stomach. In the three-row, SwinUnet and MISSFormer could not identify gallbladder structure, but these are well solved by our model and show the edge recognition ability under low tissue contrast. The second is the cardiac cine MRI dataset. Although most MRI images have very good tissue contrast, the spatial resolution is low. Figure 10 shows that our model also offers very high segmentation accuracy on lowcontrast images that are difficult to distinguish between right ventricular and left ventricular parietal muscles. Finally, we segment the high-definition photos of skin cancer (showed in figure 11). As a kind of 2D medical image, same as pathological picture, the photos are not sectional images and do not have continuous context. Compared with SwinUnet and MISSFormer models, our method still shows good recognition and edge segmentation ability for such images with excessive edge blur. We develop a general segmentation approach by Figure 9. The visual comparison of the segmentation result on the synapse dataset. In the first row, we can see that using SwinUnet and MISSFormer did not recognize the smaller gastric structure (blue label), and MISSFormer has some errors in liver edge recognition. In the second row, SwinUnet and MISSFormer cannot accurately recognize the blood vessels located in the liver (lake blue label) and MISSFormer cannot accurately identify the boundary of stomach. In the third row, both SwinUnet and MISSFormer cannot recognize the gallbladder (green label). In the fourth row, SwinUnet incorrectly identify the pancreas that not present in this slice (purple label). combining a cross-convolutional self-attention mechanism to integrate local and global contexts with exploring multi-dimensional information to fuse multi-scale feature edges. In addition to the application in this research, it can also be employed for other segmentation tasks. Although our model has achieved satisfactory segmentation results, there is still room for improvement and improvement from the perspective of DSC with different structure segmentation. On the other hand, we still use 2D segmentation algorithm, which will inevitably bring some information loss. In the future, we will make up for these deficiencies through 3D algorithm.  We can see that the red boxs show that SwinUnet and MISSFormer on the right ventricle (red label) and left ventricular muscle (green label) are inaccurate in edge recognition due to the low contrast, C 2 Former can recognize and segment better.

Conclusion
In this paper, we propose a novel C 2 Former network. It provides a new design idea of attention mechanism and captures information in space and channel dimensions synchronously. Specifically, Multi-scale feature edge fusion module (MFEF), which not only introduces the edge features into the Transformer, but also integrates the feature information of multi-scale. Furthermore, a gated convolutional module is also designed with the aim of filtering interference information to ensure extracted features are effective. Extensive experiments demonstrate that C 2 Former achieves advanced performance over previous start-of-the-art models. It still has great advantages in terms of parameters and calculation complexity. For clinical purposes, C 2 Former cannot only recognize the edges of complex organs or structures more accurately, but also has a strong generalization, which can be applied to the automatic segmentation of a variety of different organs or various medical images.

Statement and Acknowledgments
This manuscript has not been published or presented elsewhere in part or in entirety and is not under consideration by another journal. The study design was approved by the appropriate ethics review board. We have read and understood your journal's policies, and we believe that neither the manuscript nor the study violates any of these. There are no conflicts of interest to declare.