Residual Shuffle Attention Network for Image Super-Resolution

In order to improve the accuracy of the super-resolution network and reduce the number of model parameters, this paper improves its RCAB module on the basis of RCAN, and builds a reconstruction network RSAN that can improve the quality and efficiency of image super-resolution reconstruction. By replacing the original channel attention module with a more efficient and lightweight shuffle attention, it is mainly used to reduce the number of parameters, supplemented by improving the accuracy; and replacing part of the ordinary convolution in RCAN with split convolution is mainly used to improve accuracy, supplemented by reducing feature redundancy and parameters. The experimental results show that RSAN in this paper can not only obtain better subjective visual evaluation and objective quantitative evaluation, but also reduce the number of network model parameters and improve the efficiency of the network to a certain extent.


Introduction
Image super-resolution (SR) reconstruction is a technology that reconstructs low-resolution (LR) images into high-resolution images (HR). It is widely used in monitoring, remote sensing, military, medical and other fields. It has always been one of the research hotspots in the field of digital image processing.
At present, there are mainly two types of methods used for image super-resolution reconstruction. The one is reconstruction-based methods, and the other is the deep learning method that is currently used more popular. Dong et al. [1,2] are the pioneers of deep learning-based super resolution. They first used convolutional neural networks (CNN) to propose a super-resolution convolutional neural network (SRCNN) and fast super-resolution convolutional neural network (FSRCNN) that can accelerate model training. The SRCNN structure includes three parts: feature extraction, nonlinear mapping and reconstruction, which has a profound impact on the later various deep learning-based network structures. Lai et al. [3] proposed the Laplacian super-resolution network (LapSRN), which is similar to a pyramid in overall structure. The method includes a feature extraction branch, an image reconstruction branch, and then a staged up-sampling to reconstruct the image. At the same time, a new loss function, similar to the loss of HED (Holistically-Nested edge detection) [4] is used to guide model training. Lim et al. [5] proposed an enhanced deep super-resolution network (EDSR), which deletes the BN layer in the ResNet network and partially improves the quality of the generated image. Some scholars [6] used the adversarial generative network (GAN) to achieve super-resolution, and have achieved good results in structural similarity (SSIM). However, as the number of neural network layers increases, the methods based on deep learning make training more and more complex, and it is difficult to achieve satisfactory Since Vaswani et al. [7] introduced the attention mechanism into CNN, the attention mechanism has achieved good results in a series of CV tasks, such as the squeeze and excitation network (SEnet) proposed by Hu et al. [8], which is a channel attention mechanism. The convolutional block attention module (CBAM) method proposed by Woo [9] includes both spatial attention and channel attention. Since then, a large number of methods based on attention mechanisms have emerged in the field of image super-resolution networks. Zhang et al. [10] proposed a residual channel attention network (RCAN), which uses channel attention inside its residual channel attention block (RCAB). Dai et al. [11] conducted a more in-depth study on the channel attention method. They believed that the SEnet proposed by Hu et al. [8] is only a first-order image feature, ignoring higher-order image features, and limiting the performance of the network. Therefore, a second-order channel attention network (SAN) was proposed.
The attention mechanism in the above methods either only considers the channel attention or both channel and spatial attention, but it is not efficient in the fusion of the channel and spatial attention weights, the amount of parameters is also large, and the accuracy of the model is limited. In order to solve the above problems, this paper introduces the lightweight and efficient shuffle attention mechanism proposed by Yang [12] to replace the original channel attention (CA) module in RCAB to reduce the amount of parameters and improve the accuracy; In addition, in order to further reduce the amount of model parameters, feature redundancy, and improve accuracy, this paper partially replaces traditional convolution with split convolution (SP) proposed by Zhang et al. [13]. The experimental results show that the residual shuffle attention network (RSAN) proposed in this paper can not only obtain better subjective visual evaluation and objective quantitative evaluation, but also reduce the number of network model parameters and improve the efficiency of the network to a certain extent.

Network Structure
The overall network structure of this article is shown in figure 1, which is divided into three parts.

Shallow Feature Extraction.
The shallow feature extraction part contains only a common convolution layer with the kernel size of 3*3, 3 input channels and 64 output channels.

Deep Feature Extraction.
The extraction of deep features is completed by the new residual block (residual in residual, RIR) in this paper. The RIR network is composed of 10 external residual blocks and a long skip connection. Replace the last ordinary convolution layer of the fifth external residual block with SP. Each external residual block contains 20 residual shuffle attention block (RSAB). Each RSAB is composed of 2 convolution layers with the same number of input and output channels, a 3*3 convolution kernel, a ReLU activation function, a shuffle attention module, and a short skip connection.

Upsampling and Reconstruction.
In the image reconstruction part, the scaling factor S is 4 in this paper. The input channel is C, the output channel is C*4, and the convolution layer with the kernel size of 3*3 and the Pixelshuffle layer are connected in series. The Pixelshuffle layer proposed by Shi et al. [14] completes the actual upsampling process. See equations (1) and (2) for specific calculations.
where represents the tensor output by the previous residual block set, and represents a twodimensional convolution operation with C input channels and C*4 output channels. The dimension of M is [N, C*4, W, H].
The dimension of ′ in equation (2)  So far, the above process has completed the operation of magnifying the feature map by 2 times. Since the scale factor in this article is 4, the above process has to go through again to get ′′ with the dimension of [N, C, W*4, H *4]. Finally, we need to pass ′′ through a convolution layer with an input channel number of C, an output channel number of 3, and a convolution kernel size of 3*3 to obtain the final high-resolution image.

Shuffle Attention Module
The attention mechanism has shown its advantages in many computer vision tasks. Yang et al. [12] proposed a lightweight shuffle attention mechanism. Compared with the previous attention mechanism, The biggest difference is that the feature maps are grouped, learned and finally shuffled in the channel dimension, so that the feature information can be communicated with each other between the channels. Figure 2. Shuffle attention mechanism [12].
As shown in figure 2, the entire attention mechanism consists of 4 parts: (1) Feature grouping: given an input feature map ∈ * * , where C, W, H represent the channel number, width and height of the feature map respectively. Split the feature map X into G groups for the first time along the channel dimension, namely = [ 1 , ⋯ , ], ∈ R ( / ) * * , then each group is split into two branches along the channel direction 1 , 2 ∈ (( / )/2) * * . One branch uses the interrelationship between channels to generate a channel attention map, while the other branch uses the spatial relationship between features to generate a spatial attention map. It is found through experiments that G set to 16 is more appropriate.
(2) Channel attention: The SEnet proposed by Hu [8] et al. will lead to a large number of parameters, and it is difficult to make a trade-off between speed and accuracy. Wang [15] and others proposed efficient channel attention (ECA), but when the size of the convolution kernel is large, ECA is not applicable. In order to make improvements, Yang [12] and others adopted a lighter solution, which is realized with the combination of global average pooling + scaling + activation function. The specific calculation formulas are as follows: (3) 1 ′ = ( ( )) ⋅ 1 = ( 1 + 1 ) ⋅ 1 (4) where 1 , 1 ∈ (( / )/2) * 1 * 1 are two parameters that can be trained continuously through the network. represents the Sigmoid activation function.
(3) Spatial attention: Spatial attention can be regarded as a supplement to channel attention. Here, the group normalization (GN) operation proposed by Wu et al. [16] is used. The specific formula is as follows: where 2 , 2 ∈ (( / )/2) * 1 * 1 are also two parameters that can be trained continuously through the network, represents the Sigmoid activation function.
(4) Aggregation: After completing the two kinds of attention learning and re-calibrating the features, the two branches need to be spliced and aggregated, ′ = [ 1 ′ , 2 ′ ] ∈ ( / ) * * . Then all the subfeatures are aggregated, and finally the channel shuffle operation is performed.

Basic Experiment
Experimental data set: The DIV2K is selected as training data set, which contains 800 HR training images and 100 verification images. The scaling factor selected in this paper is 4, and the evaluation indicators are peak signal-to-noise ratio (PSNR) and structural similarity measure (SSIM). Train for 30 epochs.
Experimental results: We verify the performance of the algorithm proposed in this article on the DIV2K validation set, set5, set14, and BSD100. Bicubic, EDSR [5], SRGAN [6], RCAN [10], RDN [17]  It can be seen from table 1 that the method in this paper performs poorly on BSD100 under the 4× scale factor S. The possible reason is that the generalization ability of the model is insufficient. The effect on Set14 and Set5 is not bad. In general, it can be seen that the RSAN proposed in this article can obtain satisfactory reconstructed images.

Conclusion
This paper proposes a super-resolution reconstruction networks RSAN that can not only obtain better subjective visual evaluation and objective quantitative evaluation, but also reduce the number of network parameters and improve the network efficiency to a certain extent. However, through many training