Single Image Rain Removal Algorithm Based on U-Net and Vision Transformer

Rain will significantly lower image quality, which will undoubtedly have an impact on how well outdoor computer vision systems operate, such as autonomous driving. This paper proposes a two-branch deep neural network consisting of an attention guided U-Net and Vision Transformer, which could capture intra-field details and cross-patch relationships to obtain good global and local deraining effects. In order to ensure both branches pose positive effects on the target derain task, we specifically design a patch boot image module to achieve complemented feature fusion, which adaptively selects informative intra-field regions guided by the patch-wise importance map. Moreover, Wasserstein distance between the prediction and reference image is applied as the objective function to pursue better image quality measurement. In comparison to existing algorithms, the suggested method may more effectively remove rain and restore the background details, according to qualitative and quantitative results on the public datasets Rain200L and Rain200H. On the aforementioned datasets, the peaks of the structural similarity (SSIM) and signal-to-noise ratio (PSNR) were 32.55/0.9476 and 26.12/0.8826, respectively.


Introduction
Due to the substantial quality deterioration of images taken in rainy circumstances, several computer vision algorithms, including object identification, image segmentation, and depth estimation [1], which are essential components of autonomous navigation and surveillance systems, perform poorly.In order to improve the dependability of these vision systems, it is essential to remove the effects of rainy weather from the photographs.The difficulty of removing rain from a single picture is made even more difficult by the fact that the location and timing of the single image are static and the distribution of rain markings in rainy photographs is usually variable.Therefore, a single image draining offers a great deal of practical value.Early methods usually impose various prior knowledge based on the statistical properties of rain marks and clean images, which limits the rain removal performance due to the difficulty of simulating complex and changeable rainy weather scenes with prior knowledge [2].
Recently, many deep learning-based approaches have achieved satisfactory performance [3], but CNN's convolutional layers cannot directly capture the correlation of local pixels, leading to constrained receptive fields.Stacking convolution kernels is how most of the known deraining models increase the receptive field.However, the receptive field obtained in this way is still limited, and the global information of the rain image is not effectively used, resulting in the loss of some details and structural information of the image after deraining.Transformers have been applied to images deraining with good performance [4].These techniques, however, are unable to accurately model the local features of the image or recover its local information.
We propose a dual-branch deep neural network made up of the Attention guided U-Net and Vision Transformer to achieve good global and local deraining effects.The former network employs spatial attention to choose valuable local information from low-level characteristics under the supervision of high-level features, while the later network uses a transformer to gather global information.The main contributions of this study are summarized as follows: (1) To obtain good global and local deraining effects, we propose a two-branch deep neural network consisting of an attention guided U-Net and Vision transformer, which could capture intra-field details and cross-patch relationships.
(2) To fuse features derived from two branches, we design a patch boot image module to adaptively select informative intra-field regions for the target derain task.
(3) Wasserstein distance between the prediction and reference image is applied as the objective function to pursue better image quality measurement.

The basic structure of U-Net
The segmentation performance of medical images is significantly improved by the U-Net network suggested by Renneberger et al. [5].U-Net's fundamental structure is divided into two parts.The first part is called encoder equipped with several same block network, which consists of two consecutive 3×3 convolutions, followed by a ReLU function and one max pooling layer.The second part is the decoder consisting of reverse blocks with the same number as that of the encoder, where each block first uses 2×2 up-convolution to up-sample the feature map, then the feature map corresponding to the encoder is cropped and connected to the up-sampled feature map, followed by two 3×3 convolution and ReLU function.To get the feature map down to the necessary number of channels and create a segmented image, one additional 1×1 convolution is employed in the final stage.The U-Net network has a nearly symmetrical form and a U-like appearance.

Transformer
The early applications of Transformer were for natural language processing (NLP), and they were successful.The encoder and decoder subnetworks make up this network.In the encoder stage, the words in the sentence are first converted into word vectors, then feature map of global attention is obtained through self-attention module and finally output of the encoder is obtained through feedforward network.The output of the corresponding encoder and the output of the preceding decoder are included in the input of a decoder.Since the parallel input lacks the positional relationship of words, Transformer uses positional encoding to preserve the positional relationship, and the output of the decoder is the probability distribution of the corresponding position.As a result of the aforementioned noteworthy achievement, Transformer is being used in the field of computer vision by an increasing number of researchers.The Vision Transformer model was proposed by Dosovitskiy et al. [6].Transformer is being used for the first time to categorise images.In order to adjust to the encoder input, Vision Transformer separates the image into non-overlapping image patches.A new patch is included to predict the final label at the output of the Transformer encoder in a manner similar to BERT's [class] token.

Proposed Method
We first explain the overall architecture of our suggested method in this section, and then introduce the specifically designed Attention guided U-Net, Vison Transformer and Patch boot image module.

overall architecture
where  stands for the Wasserstein distance and ℎ stands for the ground-truth image.

Attention guided U-Net
where  stands for the features  after up-convolution and (•)stands for spatial attention block.
From  , the attention maps are used to extract the important information.
where  stands for the refined low-level feature with attention map A and ⊗ is pointwise multiplication.
For the encoder, we suggest stacking high-level features created by up-convolution with refined lowlevel features.The 2D convolutional layer is then used to obtain the output feature  .
where (•) stands for the convolution layer with a 1 × 1 kernel size and (•) for the concatenation operation.Be note that a SAG module relates to each Attention directed U-Net layer.

Vision Transformer
Figure 3 depicts the Vision Transformer's structural layout.picture  ∈  × × is divided into many flattened uniformly non-overlapping patches  ∈   To learn global information, we employ a stack of Transformer blocks made up of multi-head selfattention (MSA) and multi-layer perceptron (MLP).To scale the embedded patches, the MSA layer consists of L parallel self-attention heads:  =  ( ) +  ,  = 1 ⋯ .By using the formulas  =  ( ) +  ,  = 1 ⋯ ,where ( )denotes layer normalisation and  ∈  × denotes the encoded semantic representation in d-dimensional space, the MLP module learns global information.We model global information  ∈  × × in addition to encoding characteristics by reshaping  and using a 1×1 convolution technique.
In order to describe the significance of pixels in each region, we additionally define the region coefficient .The region coefficient  's function is to give the Patch boot image module a supervisory signal to help it select significant regions.The region coefficient  ∈  × is also obtained by reshaping  and applying a 1 × 1 convolution operation.

Experiments and Analysis
We introduce datasets, evaluation measures, comparison techniques, implementation specifics, and experimental findings in this part.

Datasets
We apply the deraining experiments on the Rain200L and Rain200H public benchmarks.1800 simulated images for training and 200 synthetic images for testing are included in Rain200L and Rain200H, respectively.

Evaluation metrics
Peak Signal-to-Noise Ratio () and Structure Similarity Index () are employed as the assessment metrics for the aforementioned benchmarks.In accordance with the earlier deraining approach, we calculate the  and  in the  channel of the  space.

Comparison methods
We contrast the suggested method with DSC [7], GMM [8], and DDN [9], the findings of which are referenced in [10].

Implementation details
On a single NVIDIA RTX 3060 GPU utilized for training, the experiments are carried out using the Pytorch framework.We use the Adam optimizer with a batch size of 4 and a learning rate of 0.0075.Moreover, if the validation accuracy does not increase for 10 successive epochs, a scheduler for learning rates is utilised to lower the learning rate in half.200 epochs were used to train the network.1.The proposed method performs better than previous comparing methods, as evidenced by increased PSNR and SSIM.On the Rain200L dataset, our technique performs better than DSC by 5.39 dB on PSNR and 0.0813 on SSIM.It also performs better than GMM by 3.89 dB on PSNR and 0.0824 on SSIM.Our technique outperforms DSC on the Rain200H dataset by 11.39 dB on PSNR and 0.5011 on SSIM, while outperforming GMM by 11.62 dB on PSNR and 0.4662 on SSIM.Such a big improvement demonstrates that our suggested strategy significantly enhances images deraining performance.
rainy image ours ground-truth

Figure 1
depicts the overall architecture.Given the rainy image  ∈  × × , where ℎ ×  represents the spatial resolution,  is sent to the Attention guided U-Net to obtain local information.After  is divided into several patches, global information  and region coefficient  are obtained through Vision Transformer.Finally, we use the Patch boot image module to fuse the local information from U-Net with global information  and region coefficient  from Vision Transformer to obtain  ∈  × × .By minimizing the following loss function, our model is trained:

Figure 1 .
Figure 1.overall architecture.3.2.Attention guided U-NetIn general, low-level features have more specific data, while high-level features have more semantic data.For the single image deraining work, both features are essential.Most U-Net-based methods directly connect the features of different levels, which do not consider the characteristics of different levels of features, and, generate some irrelevant features affecting the performance of the network structure.Therefore, we propose spatial attention guidance module (SAG).Figure 2's right side demonstrates it.Pass the high-level  and low-level  features to the spatial attention guidance module (SAG) in order to produce the cascaded feature  .By conducting an up-convolution operation on the high-level feature  , SAG first creates the attention map A for the low-level features:

Figure 2 '
s right side demonstrates it.Pass the high-level  and low-level  features to the spatial attention guidance module (SAG) in order to produce the cascaded feature  .By conducting an up-convolution operation on the high-level feature  , SAG first creates the attention map A for the low-level features:

Figure 2 .
Figure 2. Attention guided U-Net.We first train a 1D position embedding  ∈  × , which is then added to the patch embedding to preserve the position information.This preserves the spatial information of each patch. = [ ;  ; ⋯ ;  ] + , where  ∈  • × denotes the projected patch embedding.To learn global information, we employ a stack of Transformer blocks made up of multi-head selfattention (MSA) and multi-layer perceptron (MLP).To scale the embedded patches, the MSA layer consists of L parallel self-attention heads:  =  ( ) +  ,  = 1 ⋯ .By using the formulas  =  ( ) +  ,  = 1 ⋯ ,where ( )denotes layer normalisation and  ∈  × denotes the encoded semantic representation in d-dimensional space, the MLP module learns global information.We model global information  ∈  × × in addition to encoding characteristics by reshaping  and using a 1×1 convolution technique.In order to describe the significance of pixels in each region, we additionally define the region coefficient .The region coefficient  's function is to give the Patch boot image module a supervisory signal to help it select significant regions.The region coefficient  ∈  × is also obtained by reshaping  and applying a 1 × 1 convolution operation.

3. 4 .
Patch boot image module 4.5.1.Quantitative result.The quantitative assessment of the Rain200L and Rain200H is presented in Table

Figure 6 .
Figure 6.Visual results on Rain200H5.Concluding RemarksThis paper proposes a dual-branch deep neural network composed of attention guided U-Net and Vision Transformer.To ensure that both branches have a positive impact on the target deraining task, we specifically design a patch boot image module for complemented feature fusion, which adaptively selects informative intra-field regions guided by the patch-wise importance map.Furthermore, the is the length of the picture sequence, to prepare the input data for Vision Transformer.Through a patch encoder, we project patches onto the K-dimensional embedding space.

Table 1 .
Qualitative result.Figures 5 and Figures 6 display the visual deraining outcomes for Rain200L and Rain200H, respectively.It is evident that our approach can produce effective outcomes for removing rain that are almost in line with reality.Quantitative comparison between the Rain200L and Rain200H.The top two outcomes are denoted by bold and underline, respectively.