CTformer: Convolution-free Token2Token Dilated Vision Transformer for Low-dose CT Denoising

Low-dose computed tomography (LDCT) denoising is an important problem in CT research. Compared to the normal dose CT (NDCT), LDCT images are subjected to severe noise and artifacts. Recently in many studies, vision transformers have shown superior feature representation ability over convolutional neural networks (CNNs). However, unlike CNNs, the potential of vision transformers in LDCT denoising was little explored so far. To fill this gap, we propose a Convolution-free Token2Token Dilated Vision Transformer for low-dose CT denoising. The CTformer uses a more powerful token rearrangement to encompass local contextual information and thus avoids convolution. It also dilates and shifts feature maps to capture longer-range interaction. We interpret the CTformer by statically inspecting patterns of its internal attention maps and dynamically tracing the hierarchical attention flow with an explanatory graph. Furthermore, an overlapped inference mechanism is introduced to effectively eliminate the boundary artifacts that are common for encoder-decoder-based denoising models. Experimental results on Mayo LDCT dataset suggest that the CTformer outperforms the state-of-the-art denoising methods with a low computation overhead.


I. INTRODUCTION
The LDCT problem has gained lots of attention in the community due to its potential of reducing X-ray radiation. However, compared to NDCT images, LDCT images suffer from severe noise and artifacts [2] when they are applied to clinical applications. To overcome this problem, two types of algorithms have been investigated: traditional algorithms and convolutional neural networks (CNNs) [3], [4]. i) Traditional algorithms such as iterative methods suppress the artifacts and noise by using a physical model based on a certain prior. Unfortunately, these algorithms are hard to be adopted in commercial CT scanners because of the hardware limitations and high computational cost [5]. ii) With the advent of deep learning, CNNs have been a prevailing approach for LDCT image denoising. Despite the superior learning ability aided by big data [6], CNNs are reported to be limited in capturing longrange contextual information in images [7]- [10], which will adversely affect the retrieval of richer structural information in denoised images.
Recently, the transformer model [8] has shown excellent performance in computer vision [11]- [21]. Dosovitskiy et al. proposed the first vision transformer (ViT) by simply mapping an image into 16×16 patches (this operation is commonly referred to as tokenization) in analogy to words in a sentence in natural language processing [14]. Yuan et al. further proposed a Token2Token method to empower the transformer model with a diverse information encoding [10]. Next, Liu et al. designed a swin transformer to include patch fusion and cyclic shift to enlarge the perception of contextual information in tokens [9]. Moreover, Choromanski et al. proposed a Performer transformer to reduce the computational complexity of the self-attention by approximating the inherent softmax operator [21]. Currently, the transformer model is poised to replace CNNs as the mainstream deep learning model. On the one hand, compared to CNNs, the transformer model is good at capturing global information and long-range feature interactions, resulting in the utilization of richer information. As shown in Fig. 1, the transformer has diversified and effective features, while the CNN model has many inactive features. On the other hand, the transformer model enjoys higher visual interpretability by the virtue of its inherent selfattention block [22]- [24]. However, a typical CNN model contains no generic explanation modules [25].
Despite the success and great promise, the transformer has been little investigated in LDCT denoising. In our opinion, the transformer model is suitable for LDCT denoising problem. Other than the effectiveness, a transformer is more desirable for physicians because it is self-explanatory [26], e.g., allowing a physician to make sense of the model's logic. To the best of our knowledge, Zhang et al. pioneered to apply the transformer in LDCT denoising [27]. Although this model achieves the state-of-the-art performance, it has imperfections in three aspects: i) The model uses the vanilla transformer which can not fully explore the potential of the transformer, as relevant studies are rapidly advancing. ii) Intensive convolutions are included in the model, making their model essentially a hybrid model. Thus, the merits of using a transformer are insufficiently justified. iii) Their work neglects the interpretability that is essential for clinical applications [28].
We aim to fully explore the potential of transformers in LDCT denoising. Specifically, we propose a Convolution-free Token2Token Dilated Vision Transformer (CTformer) for lowdose CT denoising. The CTformer has the following charac- teristics: i) Although the convolution is instrumental to capture local features when it is combined with transformers on small datasets, it is not a necessity for the performance because the token rearrangement can also help complement the local information. Therefore, we completely exclude convolution operations in the proposed CTformer. To the best of our knowledge, the CTformer is the first pure transformer for LDCT denoising. ii) The dilation and a cyclic shift are used in the Token2Token to enlarge the receptive field, thereby gaining broader contextual information from the feature maps and reducing the computational cost. iii) We utilize an overlapped inference mechanism to address the boundary artifact that is common in the encoder-decoder denoising models. iv) We develop interpretability for the CTformer with the visual attention maps and an explanatory graph that shed light on how the CTformer discriminates key structures from noise as well as hierarchical attention flow across layers. Experiments results suggest that the CTformer delivers superior denoising performance over other state-of-the-arts with fewer trainable parameters and multiply-accumulate operations (MACs). In summary, our contributions are threefold: i) This work is the forerunner to apply the vision transformer to LDCT denoising problem. What's more, the proposed CTformer is the first pure transformer. ii) We introduce dilation and cyclic shift to enhance the tokenization process in the model, utilize a new inference mechanism to fix the boundary artifacts, and develop the interpretation methods to unveil the model's denoising patterns. iii) Our experimental results demonstrate the superior denoising performance and model efficiency of the CTformer for LDCT denoising.

II. RELATED WORK
The previous studies for the LDCT denoising problem can be categorized into two classes.
Traditional algorithms. Typically, these methods incorporate a physical prior into an iterative reconstruction framework to suppress noise. For example, compressed sensing (CS) has been widely used for the LDCT problem by adopting a sparse representation [29], i.e., the total variation minimization assumes that the clean image is piecewise constant whose gradients are sparse [30]- [33]. Xu et al. used a dictionary to construct the sparse representation [34] for LDCT denoising. In addition to the sparsity prior, Ma et al. designed a non-local mean prior to utilize the image voxels across the whole image rather than the local region [35]. However, increasingly more studies [36]- [39] implied that the traditional algorithms are surpassed by deep learning models driven by big data.
Convolution models. CNNs have been used for the LDCT image reconstruction. Wu et al. used a K-sparse autoencoder to learn the image features in an unsupervised fashion and minimize the distance between a normal-dose image and an iterative reconstruction result in the feature space of the autoencoder [36]. Liu et al. proposed a 3D residual convolutional network to estimate an iterative reconstruction (IR) image from an LDCT analytic reconstruction image [40]. Their method can save time because it avoids the time-consuming iterative reconstruction. He et al. proposed the 3pADMM method to address the problems of hyper-parameter optimization and prior knowledge selection in LDCT reconstruction [41].
Besides, a majority of deep LDCT denoising models focused on image post-processing. The paper of Chen et al. was a pioneer work which employed the convolution, deconvolution, and shortcut connections to prototype a residual encoderdecoder convolution neural network (RED-CNN) [37]. Yang et al. used the generative adversarial network with Wasserstein distance (WGAN) aided by a perceptual loss to improve the quality of denoised images [38]. Due to the excellent performance of WGAN in generating faithful real-world CT images and the role of the perceptual loss in structural fidelity, this model alleviated the over-smoothness in the denoised images. Li  sharpness loss to preserve structural details and sharp boundaries [42]. Fan et al. constructed a quadratic neuron-based autoencoder for LDCT image denoising with more robustness and efficiency as opposed to conventional CNN-based methods [39]. It is the first autoencoder based on a new type of neurons. Huang et al. proposed a two-stage residual CNN [43], where the first stage uses stationary wavelet transform for texture denoising, and the second one enhances the image structure via combining the average of NDCT images and the denoised image from the first stage. However, CNN-based models typically lack the ability to capture global contextual information due to the limited receptive fields, thus less efficient to model the structural similarity across the whole image [1], [27], [44].

III. METHODS
In the supervised setting, with a deep learning model, the LDCT denoising task is to learn a mapping from a paired noisy LDCT image x to a clean NDCT image y. Mathematically, a neural network can be trained by optimizing a mean square error (MSE) loss function as follows: where f (W ; x) is a neural network, and W is a collection of parameters for simplicity.
A. Architecture of the CTformer As shown in Fig. 2, the proposed CTformer takes the residual encoder-decoder structure with tokenization/detokenization blocks, four CTformer modules, and an intermediate transformer block. In the encoder, CTformer modules A and B include a transformer block (TB), and a Token2Token Dilation block (T2TD). In the decoder, CTformer modules C and D symmetrically encompass an inverse Token2Token Dilation block (IT2TD) and a TB. The IT2TB block takes the inverse design of the corresponding T2TB block. Now let us introduce the CTformer from its macro to micro structures.
Residual encoder-decoder structure. We use a residual encoder-decoder structure as the backbone of the CTformer. The shortcuts only bridge similar levels of layers in encoder and decoder parts. Although the unsatisfactory information loss is accompanied by denoising in the encoder block, which hurts the structural recovery in the decoder part, the employment of shortcuts can supplement information from the feature maps of the encoder to retain structural details. Besides, shortcuts can fix the gradient vanishing problem such that a deep model can still be stably trained [45].
Tokenization block. As shown in Fig. 3, in the tokenization process, a noisy CT image is unfolded into a sequence of two dimensional (2D) patches (also referred to as tokens): T 0 ∈ R b×n×d0 , where b is the batch size, n is the number of tokens, and d 0 is the token dimension. Throughout this manuscript, we use tokens and patches interchangeably.
Transformer block. As shown in Fig. 3, a typical transformer block contains multiple head attention (MHA), layer normalization (LN), an MLP, and residual connections to enhance the expressive power. Specifically, in the self-attention, a token sequence T 0 ∈ R b×n×d0 is linearly mapped into three tensors which are respectively referred to as query, key, and value, denoted as Q, K, V ∈ R b×n×dm for short, where d m is the token embedding dimension. Mathematically, we have where W q , W k and W v are linear operators. Then, the output of the self-attention is calculated as where the scaling factor 1 √ d k is based on the network depth. Besides the authentic calculation of Eq. (3), the softmax operator can be approximated by a kernel method, thus, obtaining a reduced complexity of Eq. (3). The transformer using this approximation is also called Performer [21].
is the attention map that will be used in the post-hoc interpretability analysis. Through the transformer block, the output token T a ∈ R b×n×da is Token2Token dilation block. Previously, the simple tokenization in the vanilla transformer only includes one tok- enization process using either reshaping or convolutions with a fixed stride to convert an image to tokens. Thus, it tends to ignore the dependence across neighboring tokens. What's worse, it also makes the attention expressions redundant, which adversely results in limited feature richness in each layer [10]. To overcome these problems, as shown in Fig.  3, we adopt the recently-proposed T2T block which uses cascade tokenization to replace the simple tokenization [10]. The T2T block consists of reshaping and unfolding which can not only model the local information from the surrounding image pixels but also gain more feature representation than convolution. Furthermore, we use cyclic shift and dilation in the T2T (T2TD) to refine the contextual information fusion and leverage spatial relations across a larger region. Now, let us elaborate on these operations in detail.
Step 1: reshaping. A sequence of tokens T a ∈ R b×n×da given rise by the transformer block are first transposed to T a ∈ R b×da×n and then reshaped into F ∈ R b×da×h×w : where h = w = √ n are the height and width of the feature map, respectively.
Step 2: cyclic shift. We employ the cyclic shift to modify the 4D feature maps in each T2TD block. Specifically, the pixel values in the feature maps are shifted in a cyclic way to utilize the information more sufficiently. Then, an inverse cyclic shift is performed in the symmetric IT2TD block in the decoder to avoid any pixel mismatch in the final denoising results. Through cyclic shift, the tokens fed into the consequent transformer blocks are extracted from different feature maps rather than the fixed patches. Furthermore, now the tokens from the boundaries of the modified feature maps include pixels that are not boundaries in the original feature maps. In practice, the CTformer shifts the image by two pixels to extract new tokens. Fig. 3 illustrates the cyclic shift module, Step 3: dilated unfolding. The dilated unfolding will use the unfolding operation to retokenize the feature maps from the last step. To alleviate the information loss in this step, we adopt an overlapped splitting of patches. As a result, these aggregated tokens can respect the correlations among the neighboring tokens.
In this stage, the 4D feature maps F ∈ R b×d×h×w are converted back to 3D tokens T s ∈ R b×ns×ds , where n s and d s represent the new token number and token dimension, respectively. By aggregating surrounding patches and pixels, the local information is favorably preserved, and the number of tokens is changed. Specifically, the token number decreases in the encoder and increases in the decoder. Instead of the normal unfolding, we endow the unfolding with a dilation to capture the longer range contextual information with less computational cost. Mathematically, the perceptive field P of the dilation can be calculated as follows: where K i and D i denote the kernel size and the dilation rate in a certain dimension, respectively. After the dilated unfolding, the input feature map F ∈ R b×d×h×w becomes T sd ∈ R b×n sd ×d sd , where d sd = d × i K i and the total number of tokens n sd after the dilated unfolding operation is calculated as: where · is the floor function, spatial(i) means corresponding size in the i dimension, spatial(0) = h in height dimension, and spatial(1) = w in width dimension. Here, dilation, kernel, and stride are related parameters in the unfolding operation. Then, an MLP is performed to map the embedding dimension to a desired size. For better understanding of our model, a flowchart is attached for the above-discussed tensors in Fig. 4.
B. Inference of the CTformer.
In the inference phase, unlike CNN which can directly test the whole image, the transformer model can only do inference patch by patch. Because there exists information loss in the bottleneck of an encoder-decoder architecture [46], the denoised results of these patches are inconsistent at boundaries, causing boundary artifacts in the stitched image. As shown in Fig. 5, we can easily see the mosaic edge indicated by the red arrows, and artifacts are along all four directions. To address this problem, we propose an overlapped inference method. The core of our method is to discard the margin and only keep the center of the model output to stitch the final prediction. Suppose that the patch size is p × p, we only keep the central part of a patch (p − 2η) × (p − 2η) to form the final prediction image, where η is selected to be greater than the width of artifacts. In the overlapped inference, slightly more calculations are demanded because we discard the peripheral part of a patch. The increased cost is at the ratio of where n is the original image size, and · is the ceiling function. Therefore, we need to balance the computation cost with the artifact elimination effect.

C. Interpretability of the CTformer
In interpretability research, saliency map is the most popular method. One can generate a saliency map for the CNNbased classification model after the model is trained [47].
However, for the image-to-image denoising task, deriving saliency maps are not applicable because denoising models are essentially regression models. In contrast, even if the transformer models are used for denoising, one can leverage the inherent attention modules to achieve saliency maps. Utilizing such an advantage, we develop the interpretability of the CTformer by probing the patterns of the attention maps. Thus, one can decode the inner-working of the CTformer, with an emphasis on the processing of important structural and semantic information. The self-interpretability makes the CTformer uniquely relative to other LDCT denoising models.
Furthermore, we observe that the attention only reflects where the model attends in a static manner, which cannot convey how the attended parts flow across layers in the CTformer. To complement this dynamic information, inspired by [48], we propose to construct an explanatory graph to describe the hierarchical flow of the attention. We take the attended parts as graph nodes and the attention flow as graph edges. Two nodes linked by a edge are usually co-activated and take similar mapping (denoising). Specifically, we first recognize the attended object parts by identifying the peak activations. Then, we build the graph connections between neighboring layers by forwarding a masked feature map and monitoring the high activations.
Node: To identify the object part, we provide two pixelbased methods: TopK and local maximum (LM) selection. The TopK extracts the K-highest activation across the attention maps, while the LM detects the local maximum activations.
Edge: To construct edges among nodes, we propose to forward a masked feature map. Specifically, given a node (an object part) in a layer, we mask the feature maps and only keep the region around the node. Then, we feed the masked feature maps to obtain the attention map of the next layer. Finally, we extract the highest activation (node) from the obtained attention map and link it to the given node. By performing the above steps recursively in two subsequent layers, the whole explanatory graph is built to inform us how the attention of the CTformer is shifted.

IV. EXPERIMENTS
In this part, our model is trained and evaluated on a publicly available dataset. First, we demonstrate the superior denoising performance and the model efficiency of the CTformer over its counterparts. Then, we confirm the effectiveness of the overlapped inference mechanism. Finally, we elaborate on the model interpretability of the CTformer with the aforementioned interpretation methods.
Dataset. A publicly released dataset from 2016 NIH-AAPM-Mayo Clinic LDCT Grand Challenge 3 [49] is used for model training and testing. The dataset includes 2, 378 3.0mm slice thickness of low-dose (quarter) and normal-dose (full) CT images from ten anonymous patients. We select the patient L506 data for evaluation, while the rest nine patients for model training. Data augmentation is also applied. We generate more training images by randomly rotating (90, 180, or 270 here spatial(d) = √ 841 = 29 and spatial(d) = √ 625 = 25 can be calculated from the reshaping process in Eq. (5). The transformer token numbers in the decoder are symmetrically arranged as {625, 841}. • We randomly extract 4 patches from all available slices for training through 4000 epochs with a batch size of 16. In a training batch, fewer patches with more images lead to less fluctuations and bias than more patches with fewer images because many patches from a single image usually cannot represent the overall data distribution. • Adam is adopted to minimize the MSE loss with an initial learning rate of 1.0 × 10 −5 , which gradually decreases to 1.0 × 10 −6 with a scheduled decay rate. • A margin size of 16 is used for overlapped inference. Denoising performance. The performance of the CTformer is compared to other state-of-the-arts, e.g., RED-CNN [37], WGAN-VGG [38], MAP-NN [51], and AD-NET [52]. The selected models are all popular low-dose CT or natural image denoising models that were published in flagship journals. We retrain all the models based on their officially-disclosed codes. Fig. 6 shows the results of different networks on L506 with Lesion No. 575, and Fig. 7 demonstrates the ROIs from the rectangular area marked in Fig. 6. It can be seen that all methods can alleviate noise and artifacts to some extent, but the CTformer generates the clearest and the most perceptuallypleasing denoised images. Specifically, per the ROIs from Fig.  7, we find that WGAN-VGG and MAP-NN seem to introduce additional shadows and tissues. While the RED-CNN and AD-NET produce a smoother and clearer image relative to WGAN-VGG and MAP-NN, there still exists blotchy noise  around the lesion. In contrast, the CTformer satisfactorily supresses the noise and artifacts, maintains high-level spatial smoothness, and keeps the structural details in the restored image. Therefore, we conclude that the CTformer is the best denoiser compared to its competitors. Additionally, two metrics: structural similarity (SSIM) and root mean square error (RMSE) are adopted to quantitatively assess the quality of the denoised images. For fairness, we evaluate the model complexity with the number of trainable parameters (#param.) and MACs. Table I shows the average SSIM and RMSE on all slices of L506. Among the stateof-the-art methods, only AD-NET achieve an SSIM score over 0.91, and only MAP-NN and AD-NET have an RMSE score below 10. In contrast, our CTformer has the highest SSIM of 0.9121 and smallest RMSE of 9.0233. Concerning model complexity, MAP-NN has the highest MACs of 13.79G because it uses a lot of repeated modules, while WGAN-VGG has the greatest number of trainable parameters of 34.07M because it uses VGG as a feature extractor. In contrast, the CTformer has the smallest number of parameters and the lowest MACs. Compared to its competitors, our model has the best performance with the lowest computational cost.    Eliminating boundary artifacts. The overlapped inference is performed to eliminate boundary artifacts as shown in Fig.  9(a). From the ROIs in Fig. 9(b), we can see that the boundary artifacts are obvious when η is 0 or 1 but soon become hardly perceivable when η further increases. It is worth noting that as η varies, the boundary artifacts can appear in different regions because the size of the patches integrated in the final image is different. To further confirm the effectiveness of the overlapped inference, quantitative analysis on the patient L506 is also conducted. As seen from Table II,   Visual interpretation. To reveal the latent learning behavior in the CTformer, we visualize the attention maps Att = softmax(QK / √ d k ) in each layer. Specifically, we derive attention maps by averaging all grids of Att and resize it to the size of the original image. Then, the attention map is superimposed on the image with a transparence rate 0.4.
As shown in Fig. 10, the attention map in the first layer highlights the key object parts. Specifically, there are more attentions on the edges rather than the composition of key For the attention map in the second layer, it basically resembles the pattern in the first attention, but sparser and less focused on the structures. Next, the pattern in the third layer becomes semantically implicit. Finally, the attention in the fourth layer tends to ignore the edges of objects and emphasize the content where noise is concentrated. Since attentions in different layers focus on different structures, we construct an explanatory graph to illustrate the flow of attention across various layers. In our experiments, the object nodes are represented by the pixel coordinates of the image. We select the top 60 activations in the attention maps as nodes using TopK/LM selection and identify the highest activation under each node's influence. By applying the proposed method, the whole TopK and LM graph are obtained in Figs. 11 and 12, respectively.
From the TopK graph in Fig. 11, it can be seen that the attention flows across different testing slices have very similar patterns. First, from the first to the second layer, the attentions on the edges still favor other edges in the next layer as indicated by the white circle in Fig. 11. Second, all high activations from the second layer move to the top area of the third layer in a latent manner. Last, all the top attentions in the third layer spread across the noisy area in the fourth layer. While the TopK graph identifies the flow of the top activations, the LM graph illustrates that of the local protuberant objects. As shown in Fig. 12, the attention graphs of different slices using LM are also analogous. Compared to TopK graph, one principal distinction in the LM graph is that groups of local maximum activations tend to implicitly concentrate on the same point in the next layer. The white circles in Fig. 12 illustrate some concurrent points. Therefore, by inspecting the two attention graphs, the dynamic flow can be clearly followed. We can figure out how the object parts are co-activated and thus go through similar level of noise reduction.
In summary, the latent learning behavior of the CTformer can be visually interpreted statically and dynamically. This makes the proposed model more transparent and reliable for diagnostic decisions.

V. ABLATION STUDY
In this part, comparative experiments are conducted to study the impact of the T2TD block, the cyclic shift operation and the number of the intermediate transformer blocks.
Impact of T2TD block. T2TD blocks are used in the CTformer to enhance the feature integration in the tokenization stage. Compared to fixed-region tokenization, the tokens in T2TD blocks are extracted from various regions of the original images. To verify the effectiveness of this part, a Sole-ViT model without the T2TD module is designed. We only adopt a sole convolution in the tokenization stage with a filter size of 8 and a stride of 8. Then five layers of transformer with an embedding size of 256 are applied for feature extraction and denoising. 256 rather than 64 embedding size is used because the model size and MACs are close to our model as shown in III. Finally, a detokenization with deconvolution is employed to transform the tokens back to desired image domain. By investigating the conjunction area inside the blue circle in Fig.  (   Impact of cyclic shift. In this work, the cyclic shift is performed in the T2TD blocks to enhance the perceptual fields of our model. Fig. 13 shows that CTformer with cyclic shift enjoys more spatial smoothness compared to the CTformer without cyclic shift. The latter introduces some additional noise components. Quantitative results from Table III also confirm the effectiveness of cyclic shift in improving the SSIM and RMSE of the model by 0.0026 and 0.1337, respectively.
Impact of block number. In terms of the number of intermediate transformer blocks, we evaluate the CTformer with 1, 2, 4, and 8 blocks to identify the influence. When the block number grows, the network goes deeper. The computational cost increases slowly, but the actual training time climb up dramatically. However, Table III indicates that the CTformer with only one block yields the best performance over the ones with more blocks.

VI. CONCLUSION
In this paper, we have proposed a novel convolution-free transformer empowered by dilated tokenization and cyclic shift for LDCT denoising, which is referred to as the CTformer. To the best of our knowledge, the proposed CTformer is the first pure transformer model for LDCT denoising. Also, we have developed the interpretation methods for the proposed CTformer to decode its hidden behavior. Moreover, we have proposed the overlapped inference to address the boundary artifacts that are common in an encoder-decoder model. Experimental results have demonstrated that the CTformer outperforms its competitors in terms of the denoising performance and model efficiency. In the future, more efforts can be made to translate the CTformer into other medical denoising problems.