Image Inpainting Using Double Discriminator Generative Adversarial Networks

The image inpainting algorithms based on GAN have problems such as color inconsistencies of the inpainted region with the surrounding part and the mode collapse during the training. This paper proposes an inpainting algorithm based on double discriminator generative adversarial networks (GANs). This algorithm introduces the loss function of WGAN-GP into a global discriminator and local discriminator independently. It combines the Mean-Square Error (MSE) and feature reconstruction loss function to train the inpainting model and repair the missing area. The algorithm uses more dilated convolutions instead of standard convolutions to obtain a larger receptive field and the skip connection to enhance the structural prediction ability of the generator. The experiment results on the labeled faces in the wild dataset show that our algorithm improves the quality of image inpainting and the stability of the training process.


Introduction
With the rapid development of the Internet and the popularization of smartphones, images and videos are playing an increasingly important role in information transmission. The quality of images and videos has become a topic of great concern. This makes the image inpainting attract more and more attention from researchers. Image inpainting refers to repairing the missing area in an image or video using known information, which ultimately makes it impossible for the observer to distinguish whether the image has been inpainted or not [1].
Traditional image inpainting methods can be divided into two categories: texture-based and structure-based image inpainting methods. The structure-based inpainting algorithm is applied to repair small-sized missing areas. The representative one is the Bertalmio Sapiro Caselles Ballester (BSCB) inpainting model proposed by Bertalmio et al. [2]. Texture based inpainting algorithm is good at repairing large-sized missing areas, for example, the texture synthesis algorithm based on sample blocks proposed by Criminisi et al. [3].
In recent years, with the rapid development of deep learning, convolutional neural networks(CNN) have been widely used in various fields due to the ability to extract deep features, combined with deep learning of image inpainting gradually become the mainstream research. However, image inpainting based on convolutional neural networks has the problems of blurred repair results and inconsistent context semantics. The inpainting algorithm proposed in this paper is based on the double discriminator GANs. We aim to ensure the inconsistencies of the repair area with the surrounding part and to solve the instability of the training of GANs. The main contributions of this paper are as follows: 1. the image inpainting algorithm is based on double discriminator GANs together with a WGAN-GP loss function to improve training stability and prevent mode collapse.
2. more dilated convolutions instead of standard convolutions are used to increase the receptive field without changing the number of parameters, and also the skip connection to fuse the feature information of the bottom layer and the high layer to improve the prediction accuracy of the generator.
3. the feature reconstruction loss is added to MSE loss to get the structural loss. The structural loss can strengthen the structural information in the missing area.

Generative Adversarial Networks
In 2014, Goodfellow proposed Generative Adversarial Networks (GANs), which consists of a generator and a discriminator [4]. The model of GANs is shown in Figure 1. Martin Arjovsky et al. proposed to introduce the Wassertein distance into the GANs model [5]. Gulrajani proposed an improved algorithm WGAN-GP for WGAN [6].

Image inpainting based on GAN
As shown in Figure 2, an encoder-decoder is used as the generator, the encoder extracts depth features from the image to be repaired, and the decoder restores the extracted features to the repaired image.

The structure of the network
In this paper, a double discriminator is used to GANs as the image inpainting model. The overall network architecture is shown in Figure 3.

Generator network
The generator network uses the encoder-decoder based on the full convolution network(FCN). Five-layer dilated convolutions are used in the encoder, which can increase the receptive field and increase the authenticity of the output pixel when the parameter amount is unchanged. Besides, skip connection is introduced into the generator network. The skip connection can merge the features of the bottom layer and the high layer, improve the structural prediction ability of the generator, and accelerate the convergence of the network.
The specific parameters of the generator network are shown in Figure 4. Conv is the standard convolution, Dconv is the dilated convolution, and Deconv is the deconvolution. Except that the activation function of the output layer is Tanh, LeakyRelu is used as the activation function of the other layers.

Discriminator network
Both the global discriminator and local discriminator networks use convolutions to extract the features of the input image. Figure 5 (a, b, c) shows the specific parameters of the local discriminator, the global discriminator, and the fully connected layer independently. Both the local discriminator and the global discriminator use LeakyReLU as the activation function.

Loss function
The loss function used in this paper includes three parts: MSE loss of the repaired image and the real image, WGAN-GP loss, and feature reconstruction loss.
The MSE loss used in this paper is defined as follows: where is a binary mask, with 0 representing the missing position, and is the real image. The WGAN-GP loss is defined as follows: The feature reconstruction loss is defined as follows: where C is a feature extractor, using a pre-trained VGG-16 network. Feature reconstruction loss and MSE loss form a structural loss, which is used to train the generator network separately: The structural loss and the WGAN-GP loss form a joint loss, which is used to train the generator network: where α is a hyperparameter, α is set to 4e-4 in this paper.

Algorithm flow
This paper chooses to use the ADAM optimizer to train the generator network, the learning rate is 2e-4, and the RMSProp optimizer is used to train the discriminator network, and the momentum is 0.5.
An overview of the training process is stated in Algorithm Sample a minibatch of images x from the training set; 3: Generate masks , with 0 representing the missing area for each image in x; 4: �=x⊙ , � is the image to be repaired, send � into the network G to get the generated image �; 5: =�⊙(1-)+ x⊙ ， is the repaired image; 6: if epoch <20 then 7: Calculate to train and update the generator network G; 8: else 9: Randomly generate a mask image , which represents the local area of each image in x; 10:

Experimental details
The experiment in this paper uses the LFW dataset, which consists of 13233 face images of 128×128×3 size [10]. The platform of this experiment is Ubuntu 16.04.1, and the graphics card is NVIDIA 2080Ti. Use the programming language Python 3.6.8 and the deep learning framework Tensorflow-GPU 1.12.0 in the experiments.

Analysis of experimental results
The comparison between the MSE loss training generator and the structural loss training generator is shown in Figure 6 independently. The addition of feature reconstruction loss greatly enhances the generator's structure generation ability.

Figure 6. Effects of two loss training generators.
To analyze the repair ability of our algorithm, two representative algorithms based on GAN in recent years (reference [7,8]) were selected and compared under the same conditions. The repaired results are shown in Figure 7. From the perspective of visual experience, the repair effect of the algorithm in this paper is better in the case of medium area missing. The algorithm in reference [7] has no global discriminator and uses the original GAN loss function, which leads to distortion and blurring in the repair area. Compared with the algorithm in reference [7], the algorithm of reference [8] produces better quality images. Still, it has a blurry part in the repaired area due to the selection of the loss function and some shortcomings in the details of the network model. Our algorithm uses global and local discriminators to ensure that the repaired image maintains the consistency of semantics and structure globally and locally. The introduction of skip connection and feature reconstruction loss makes the generator's structural prediction ability enhanced.  Table 1. SSIM is a full-reference image quality evaluation index, which measures image similarity from three aspects of brightness, contrast, and structure. PSNR is based on the error between corresponding pixels, the higher the PSNR, the closer the image, and the original image. So combining the above two indicators can be seen, for a medium-area missing image, the repair quality of the algorithm proposed in this paper is higher than that of the other two algorithms.

Conclusion
In this paper, the image inpainting algorithm is proposed based on double discriminator GANs. The WGAN-GP loss makes the training process more stable and solves the problem of mode collapse. The skip connection enhances the structural prediction ability of the generator. The feature reconstruction loss makes the details of the repair results more realistic. At last, the experimental results show that our algorithm is effective in repairing small and medium missing regions. However, it still has some problems in processing complex details, such as image blurring, which will be emphasized in future research.