C-SwinIR: Dental CT super-resolution reconstruction fused with SwinIR

Super-resolution reconstruction (SR) of dental computed tomography (Dental CT) images is a innovative and challenging task. To address the limitations of Dental CT in obtaining high-resolution (HR) images due to equipment constraints and noise interference, we propose a Dental CT SR method called C-SwinIR based on SwinIR. Firstly, the self-calibrated convolutions network (SCNet) is introduced to solve the problem of detail loss in shallow feature graphs and improve the ability to recover details. Subsequently, the cross-shaped windows (CSWin) self-attention transformer structure is used to replace the original transformer structure, which improves the ability of the model to obtain context information. Eventually, the integration of efficient channel attention (ECA-Net) module effectively realizes the local cross-channel interaction, accelerates the model convergence and solves the problem of gradient explosion in training. Experimental results show that our proposed method is superior to the original SwinIR network by achieving a higher peak signal-to-noise ratio (PSNR) and structural similarity index (SSIM) by 0.287 and 0.003 respectively. Additionally, it decreases mean squared error (MSE) by 1.108. By using this method, clinicians can obtain clearer details and textures from Dental CT, which effectively assist in diagnosis.


Introduction
CT is an important auxiliary tool in clinical diagnosis.In recent years, Cone-beam CT (CBCT), with its higher spatial resolution, shorter scanning time and higher ray utilization rate, has gradually shown its advantages in the clinical use of Dental CT [1].In medical diagnosis, HR images are often needed for auxiliary analysis.However, due to the limitations of CT equipment, changes in the external environment, interference of network transmission media and other factors, there will be some noise in the images collected.To obtain HR images, patients need to be exposed to high-current and high-voltage X-rays for a long time, and the radiation generated by the X-rays will seriously harm patients' health.Therefore, adopting image SR technology [2] is an effective method to solve the above problems, which can provide clearer data for subsequent medical diagnosis.
In the background of the rapid development of computers, SR method based on deep learning has made great progress in image field.The convolutional neural network (CNN) operation can be parallelized and the training speed is fast.However, since the size of the convolution kernel is relatively fixed, it only works for fixed-length inputs.The input of Transformer can be variable length, which can adapt to input of different lengths.However, there are a lot of self-attention calculation in Transformer model, and the training speed is slow.Therefore, combining a CNN with Transformer can better adapt to inputs of different lengths, can learn both local and global dependencies, and the direct use of transformers can reduce network parameters.Dong et al. [3] used CNN for the first time in image SR, and its reconstruction effect is better than traditional methods, which is considered to be an innovative work in the field of image SR.Gu et al. [4] created a non-local image convolution aggregation module for better reconstruction of image texture information.
As an alternative to convolutional neural networks, Transformer [5] has demonstrated good performance on some visual issues [6].Swin Transformer [7] integrates the advantages of CNN and Transformer, showing great promise.Liang et al. [8] used Swin Transformer for image SR for the first time, combined Transformer with CNN, proposed SwinIR, and created a new idea of image SR.Christensen et al. [9] pioneered a reconstruction method for Transformer, VSR-SIM, which uses shifted three-dimensional window multiple attention to solve the SR problem based on the channel attention mechanism.In this paper, we improve a number of structures in the SwinIR network.The experimental confirm that the improved network has excellent SR performance for Dental CT images, the noise of reconstructed images is significantly reduced, and the information of tissue structure can be effectively recovered.The following is the main research content of this paper: (1) Self-calibrating convolution proposed in SCNet [10] is used as the feature extraction network in this paper.The network explicitly extends the field of view of each convolutional layer internally to enrich the output features, to solve the problem of detail loss in shallow feature maps.
(2) Replace the original Transformer structure with the CSWin Transformer [11] structure.The selfattention mechanism of cross shaped window and locally enhanced position coding proposed by CSWin Transformer can carry out more efficient receptive field interaction and obtain better feature extraction capability at a lower computational cost.
(3) Integration of the ECA-Net [12] module, which involves only a few parameters, but can bring significant performance improvement.Meanwhile, this module can effectively optimize the weight parameters in the back propagation, and solve the difficult convergence problem in deep network training.

Method 2.1. Network structure of the original SwinIR
The initial SwinIR architecture comprises three essential components: a module for extracting shallow features, another one for extracting deep features, and a module dedicated to image reconstruction, Figure 1.Within the shallow feature, a convolutional layer is employed to capture relevant features from shallow low-resolution (LR) images, focusing on preserving the low-frequency information.The extracted image features are then transmitted to the deep feature for further processing.The deep feature primarily consists of Swin Transformer blocks (RSTB) with a residual structure.Each RSTB block

Introducing SCNet to enhance shallow feature extraction.
In the process of feature extraction by shallow feature extraction module for large-size images, if only one convolutional layer is used to obtain shallow image information, and the extracted image features are directly transferred to deep feature extraction module based on Transformer architecture, a lot of feature information will be lost, which makes it difficult for reconstruction module to recover shallow image features.Make the network effect worse.This problem can be solved by introducing SCNet, whose structure is depicted in Figure 2.

Improved ability of STL structure to extract context information
In Swin Transformer, local attention needs a lot of window interaction to realize the information transmission of receptive field, which will make the receptive field expand slowly, resulting in difficult network convergence.However, CSWin Transformer proposes CSWin self-attention mechanism and local enhanced positional encoding (LePE) solves this problem well.
CSWin Transformer divides input features into equal width strips in horizontal and vertical directions respectively, and performs parallel self-attention calculation at the same time, as shown in Figure 3. Strip width is an important parameter of cross window, it can reduce the computation and obtain a large receptive field.Specifically, the width of the strip is adjusted according to the depth of the network.The width of the shallow network is smaller, and the width of the deep network is larger.The larger the strip width, the stronger the ability to obtain contextual information.
LePE adopts a way of directly imposing the position information on the linear projection value, which is imposed within each Transformer block to directly operate on the attention result.Rather than operating on attention calculations, LePE can be better applied to image reconstruction tasks at any input resolution.The improved STL structure is composed of LayerNorm (LN), CSWin self-attention and Multilayer Perceptron (MLP) modules.We call it CSWin Transformer Layer (CSTL), and its structure is shown in Figure 3.

Attention Fusion based on ECA-Net
The deep network contains more robust semantic information, however, the obtained feature map has a very low resolution, leading to a limited ability to perceive and capture fine details.As the depth of the network deepens, the reconstructed image is more likely to lose some details, and more likely to appear the phenomenon of gradient disappearance.Although SwinIR make Transformer merged with CNN, the problems of missing detail and disappearing gradients remained.Therefore, we turn to ECA-Net to solve this problem.The structure of ECA-Net is shown in Figure 4. ECA-Net applies global averaging pooling operations on the input feature maps, followed by 1-D convolution operations.The convolution result is then activated through the Sigmoid function to obtain the channel-wise weights.These weights are multiplied with the corresponding elements of the original input feature maps to produce the final output feature maps.Unlike other attention mechanisms, ECA-Net avoids dimension reduction, enabling effective local cross-channel interaction and dependency extraction between channels.Additionally, this module involves only a few parameters, making it more efficient.The integration of ECA-Net effectively enhances information exchange between different resolution channels and accelerates network training.
Figure 5 shows the overall structure of C-SwinIR.SCNet effectively enlarges the sensitive field of

Experiment and result analysis 3.1. Experimental data and experimental Settings
The dataset used in this study comes from stomatology hospital, and all the data have been desensitized.The experiment used 2000 images as the training set, 200 images as the verification set and 200 images as the test set.The HR image was sampled 4 times by the classical Bicubic interpolation method [13] to obtain the LR image.During training, Adam optimizer is used, and the initial learning rate is set to 10-4.In order to ensure the stable convergence of the network, Multi-StepLR is adopted to gradually reduce the learning rate in the training process.The experimental environment for this paper is as follows: Windows 10 operating system, Pytorch 1.12 framework, Python 3.10 programming language, Cudatoolkit 11.3 software, and a GPU device with 1080ti specifications.We use L1 pixel loss as the loss function of training, which is also called Least Absolute Deviations  1.
L1 loss is the sum of the absolute difference between the generated SR and the real HR image.For any input value, it can calculate a stable gradient, which will not cause the problem of gradient explosion, and has good robustness.

Evaluation Index
To provide an objective assessment of the reconstructed images quality, we employ the three most commonly used metrics for super-resolution as benchmarks: PSNR, SSIM and MSE.The calculation method is Equation 2-4. (2) (4)

Experimental results and analysis
To verify the performance of the improved network, it is compared with other representative SR methods such as Bicubic interpolation, MSRResNet [14] and RRDB [15].All the SR networks involved in the comparison were retrained using the dataset in this paper to ensure their performance on Dental CT images.Figure 6 shows the Loss curve of the original SwinIR network and C-SwinIR network during training.Table 1 shows the objective evaluation indexes of various SR networks under the ×4 amplification factor.It can be seen from Table 1 that the SR method C-SwinIR in this paper is optimal in terms of PSNR, SSIM and MSE, PSNR is 0.263 higher than the RRDB of the suboptimal method, SSIM is 0.003 higher than the original SwinIR of the suboptimal method, and MSE is 0.227 lower than the original RRDB of the suboptimal method.In general, the objective evaluation results of the improved network are better than those of the comparison network.For subjective reconstruction results, CT images containing key parts of teeth were selected from the test set for reconstruction.The visual representation of the network in this paper is depicted in Figure 7.
To make a more intuitive comparison, the reconstruction results are locally enlarged.By comparing the local magnification results of various reconstruction methods, it is evident that the Bicubic interpolation method produces the least favorable outcomes, with serious loss of texture and detail, and blurred image appearance.MSRResNet, RRDB and the original SwinIR SR network reconstructed some details, but the image edges are not natural.The enhanced network can more effectively recover the texture details surrounding the teeth, resulting in a natural edge transition and significantly improved clarity.

4.Conclusion
In this paper, we propose a Dental CT SR method called C-SwinIR based on SwinIR.Our method enhances feature extraction by explicitly expanding the field of view of each convolutional layer with

Figure 1 .
Figure 1.Illustration of SwinIR architecture.The core component of the deep feature module is the RSTB modules.Each RSTB module integrates several STL layers, facilitating local attention and cross-window interaction.

Figure 2 .
Figure 2. Schematic illustration of the self-calibrated convolutions.In self-calibrating convolution, the original input is divided into three parts, each part is responsible for extracting features of different channels.Finally, combine these features together.Compared with traditional convolution, self-calibrated convolution enables each position of feature map to obtain context information at different resolutions, breaking the tradition that features are extracted only in small-scale areas by traditional convolution, effectively enlarging the sensitivity field of convolutional layer, and making features extracted by self-calibrated convolution more diverse.The findings demonstrate that SCNet effectively extracts more detailed information such as contour and texture, and significantly improve the feature extraction capability of shallow feature extraction module.This structure can provide sufficient image feature information for the subsequent deep feature extraction module and significantly improve the efficiency of image reconstruction.

Figure 3 .
Figure 3.The CSWin structure consists of a LN and CSWin self-attention module, another LN and MLP module, and a shortcut concatenation.The blue square in Figure (b) is the feature map, and two orange bands together form the CSWin self-attention.

Figure 4 .
Figure 4. Schematic diagram of ECA-Net.The aggregation features were obtained by global averaging, and the channel weights were obtained by convolution and sigmoid function, finally achieving cross-channel interaction.

Figure 5 .
Figure 5. Diagram of C-SwinIR.SCNet is added to the shallow feature extraction module, and the STL structure is promoted to CSTL structure.ECA-Net is used to integrate the features extracted from RSTB module.

Figure 6 .
Figure 6.Loss curve of the original SwinIR network and C-SwinIR network.The loss of C-SwinIR network was significantly lower than SwinIR network.

Figure 7 .
Figure 7. Contrast method for visual effects comparison.The Bicubic results are vague and the reconstruction of MSRResNet, RRDB and the original SwinIR network is not clear enough.The method presented in this paper can recover the texture information around the teeth well and improve the clarity obviously.

(
LAD) and Least Absolute Error (LAE).The specific expression is shown in Equation

Table 1 .
Comparison of objective evaluation indexes of contrast methods.The symbols ↑ and ↓ indicate that higher and lower values are preferred.