The study for optimization strategies on the performance of DCGAN

Since Deep Convolutional Generative Adversarial Networks (DCGAN) was proposed, it has been perceived as a model with difficulty in training due to several factors. To solve this problem, dozens of optimization strategies were presented, but none of them was compared with the others. In this paper, the author chose three representative methods, namely one-label smoothing, the two Time-Scale Update Rule (TTUR), and the Earth-Mover Distance (EMD) or Wasserstein-1 to make a comparison of the optimization effect on the DCGAN model. To be specific, these three approaches were adopted respectively while using MNIST and Fashion-MNIST as datasets. One-side label smoothing was designed to prevent overconfidence in the model by adding a penalty term in the discriminator. TTUR was a simpler update strategy that could help the model find the stationary local Nash equilibrium under mild assumptions. EMD was an alternative loss function that enabled the model to distinguish the difference while the real distribution and generated distribution were not overlapped. Contrast experiments were conducted both vertically and horizontally. The author applied these three methods with the same dataset and the same method with different datasets in order to compare the time of the model collapse, the trend of loss in line graphs, and the impact of different datasets on results. Experimental results indicated that both one-label smoothing and TTUR postponed the model collapse while EMD completely get rid of it. Furthermore, generated images may lose texture information when using more complicated datasets.


Introduction
Generative Adversarial Networks (GANs), a kind of artificial intelligence algorithm, have greatly contributed to machine learning and achieved excellent results in computer vision.While the majority of generative modelings are based on optimization, GANs are based on game theory instead of Monte Carlo estimation or Markov chain to train the network and produce better generated samples [1].
GANs have contributed to all supervised, semi-supervised, and unsupervised learning domains due to their rapid development with several versions during the past ten years.To control the modes of generated images, Conditional Generative Adversarial Nets (CGAN) condition both the generator and the discriminator on extra information to improve controls the types of images that are generated [2].Cycle-consistent Adversarial Networks (CycleGAN) were designed as an important approach to solving Image-to-Image translation problems, in which the majority of paired training data will not be available.
And it was proved to demonstrate better results compared with other methods [3].Though GANs have shown great potential in computer vision, there were still difficulties in producing high-resolution photorealistic images.To solve this problem, Stacked Generative Adversarial Networks (StackGAN) adopted a progressive idea and divided the generation of high-resolution images into two stages, which enabled it to make the generated image closer to reality with higher resolution [4].Among these variants, Deep Convolutional Generative Adversarial Networks (DCGAN) seem to be a milestone that narrows the gap between supervised and unsupervised learning in this field.Combining Convolutional Neural Network (CNN) and GANs, DCGAN introduced the convolutional network into the generative model for unsupervised training and uses the powerful feature extraction ability of the convolutional network to improve the learning effect of the generative network [5].
In practice, training DCGAN may have some sticky problems.The training process becomes unstable in most cases.And model collapse happens when the generator starts producing similar or even identical outputs.This problem is common during the training and dramatically reduces the diversity of generated samples.Another problem is non-convergence, due to inappropriate design of loss functions and diminished gradient.In other words, the discriminator gets too successful that the generator gradient vanishes and learns nothing.Aiming to find solutions to these problems, different methods were proposed, ranging from feature matching and minibatch discrimination to historical averaging and alternative Loss Functions.Feature matching prevents the generator from being overtrained by setting a new target for it.To be more specific, the author trains the generator to match the expected value of the feature on the middle layer of the discriminator which alleviates the problem of instability to some extent.To identify the similarity between generated samples and previous samples, minibatch discrimination enables the discriminator to distinguish the difference and help to avoid model collapse.Historical average was also proved as a solution for it enables to find equilibria of low-dimensional [6].Moreover, instead of using Jensen-Shannon (JS) divergence to measure the difference between the real distribution and the generated distribution, alternative methods are used to improve the model, such as f-divergence and earth-mover distance.
Despite existing a lot of approaches, the difference between various approaches remains unclear.There is no standard measure to evaluate the quality of the approaches.In this regard, this paper made both horizontal and vertical comparisons of three typical and frequently-used methods: one-sided label smoothing [6], the two Time-Scale Update Rule (TTUR) [7], and the Earth-Mover Distance (EMD) or Wasserstein-1 [8].Specifically speaking, the author analysed the diversity of identical methods with different datasets and different methods with identical datasets.

Dataset description and preprocessing
In this research, the author applied MNIST dataset and Fashion-MNIST dataset as shown in Figure 1 to three different methods.MNIST is a dataset provided by the National Institute of Standards and Technology (NIST).The training set consisted of 60,000 samples of handwritten digits from 250 different people, 50 percent of whom were high school students and 50 percent of whom worked for the Census Bureau.The test set's proportion of handwritten digits was similar to the training set and the total number of samples of it was 10,000.Every sample was a picture containing 28 × 28 pixels.The values of pixels were between 0 and 255 which indicated the lightness or darkness of the pixels.As for Fashion-MNIST, it was a dataset of grayscale article images given by Zalando.Associated with labels from 10 classes, it included 60,000 training samples and 10,000 test samples.Similar to MNIST, the size of images in Fashion-MNIST was 28 pixels in height and 28 pixels in width, while the pixel value was an integer between 0 and 255.When it came to the preprocessing of the dataset, it consisted of two parts.Firstly, according to the main structure of DCGAN, the author reshaped the size of all images as an input of the discriminator in training dataset from 28 × 28 to 64 × 64 in order to perform as close as possible to the model of DCGAN [5].Secondly, normalization was done by converting pixel values from 0 to 255 to -1.0 to 1.0 in order to accelerate the speed of finding the optimal solution using gradient descent and improve precision.

Proposed approach
In this project, DCGAN model was referred to, and made some changes.The network resembled traditional GANs in that it had a generator that attempted to create fictitious images to deceive the discriminator and a discriminator that was intended to determine whether the data was real or not.In other words, they were playing a minimax game during training.The biggest difference between DCGAN and other variants was that it combined the advantages of convolutional neural networks and GANs.In the beginning, noise z was put into the generator before doing a series of five fractionallystried convolutions and converting it into a 64 × 64 pixel image.In the discriminator, the process was inverse, and the value of output indicated the authenticity of the input.Notably, there were no fully connected or pooling layers in this model [3].
As for improvement, three different optimization strategies were implemented, which can be found as follows.
2.2.1.One-sided label smoothing.Deep neural networks might suffer from overconfidence.In classification, they tended to produce high-confidence outputs to identify the correct class, which sometimes turned out to be very extreme values.
This problem was particularly serious when it comes to adversarial networks.If the discriminator depended only on a few features instead of the entire sample, the generator might generate only these features to fool the discriminator and gain higher accuracy.To solve this problem, the author added a penalty term to the discriminator when the prediction for any real images went beyond a number.Label smoothing could help to reduce the vulnerability against adversarial samples.To be more specific, it would help discriminators to learn how to resist generator attacks more effectively [6].

Two Time-Scale Update Rule.
Since training GANs was a zero-sum non-cooperative game, the real goal was to find a Nash equilibrium, whereas gradient descent might fail to converge.However, the Two Time-Scale Update Rule was proved to converge to a stationary local Nash equilibrium under mild assumptions.
When optimizing the generator, the author assumed that the discrimination ability of the discriminator was better than the generation ability of the current generator, so that the discriminator could guide the generator to learn.The usual practice was to update the parameters more times and then updated the parameters of the generator one time.TTUR proposed a simpler update strategy, namely, setting different learning rates for the discriminator and the generator to make the discriminator converge faster.In other words, the learning rate of the discriminator was generally set higher than that of the generator [7].

Earth-Mover Distance.
In DCGAN, Jensen-Shannon (JS) divergence was used to measure the difference between the real distribution and the generated distribution.However, in most cases, these two distributions were not overlapped, because both two distributions were low-dim manifolds in highdim space and the overlap could be ignored.To say the least, the data was only sampled from these distributions, it would be really difficult to find the possible overlap between these samples, even when the real distribution and the generated distribution are overlapped.
If these two distributions were not overlapped, JS divergence was log2 [8], which meant the binary classifier achieved 100% accuracy and could not detect the difference in between.Eventually, the gradient of the generator was approximately 0 and the gradient disappeared.
However, compared with JS divergence, EMD had the advantage that even if two distributions did not overlap, it still reflected their distance, for KL divergence and JS divergence were abrupt, but EMD was smooth [9].

Implementation Details
The neural network was implemented using Tensorflow and the batch size was set as 100 to reduce running time, when adopting different strategies, the optimizers were both Adam for one-side label smoothing and TTUR, but was different as RMSprop for EMD.
As for cost function, the author used JS divergence as Eq (1) when applying one-side label smoothing and TTUR.To implement EMD, the author used Eq (2) as a cost function.While reducing the cost, the first term in Eq (2) was expected to become bigger and the last term in Eq (2) was expected to become smaller.If without any constraint, the training of the discriminator might not converge.Therefore, the discriminator was required to satisfy the 1-Lipschitz constraint and in this paper, the author used weight clipping to force the parameter w to not beyond a certain range [9].
Moreover, the author set different learning rates in these three methods.When using label smoothing to alleviate the problem of model collapse, the learning rates of the discriminator and the generator both were 0.0002.When using TTUR, the learning rate of the discriminator was 0.0004 which was four times as high as the learning rate of the generator (0.0001).As for EMD, the author tried two different learning rates: 0.0001 and 0.0004 to see if they had influence on the result.To control the variable, all epochs were set as 50 which greatly demonstrated the change of loss while updating hyper-parameters.

Performance of different optimization strategies on MNIST dataset
In this paper, the author records the performance of three methods' implementation.The earliest epoch that displays model collapse is shown in Table 1 together with the epoch and loss of the best performance.The line graphs in Figure 2 to 6 show the epochs that correspond to those in Table 1 as well as the loss of the generator and discriminator.
After trying different optimization strategies with different parameters, it is clear that without any strategy, the model collapse happens earliest, and the loss of the generator is more unstable in comparison with the most of other methods.After implementing one-side label smoothing, TTUR and a combination of these two approaches with the same cost function, the model with one-side label smoothing performs better and produces smaller fluctuations in Figure 5(b) compared with the other.It is noted that the fluctuation amplitude of the line graph in Figure 6 (a) gradually decreases with the increase in training times.When applying EMD, there is no model collapse over all epochs because it deals with the cause of this issue, compared to the line graphs in Figure 6 (b) and 6 (c), it is noticeable that the discriminator's loss of the model with smaller learning rate changes smoother with a higher quality of images.To improve the quality of images, adding a gradient penalty instead of using weight clipping is considerable [10].Moreover, it needs more training time to produce the best kind of images when adopting EMD than adopting one-label smoothing or TTUR.This may be because EMD is a more complex method to calculate the distance between the real distribution and the generated distribution.

Performance of different optimization strategies on Fashion-MNIST dataset
Table 2 and Figure 7-11 present the performance of different optimization strategies on Fashion-MNIST dataset.After changing to a more complicated dataset, using one-side label smoothing and the combination of one-side label smoothing and TTUR can also ameliorate the problem of model collapse to a certain extent.The loss of the generator fluctuates drastically before the implementation of any methods.Compared with the performance of these two methods, it seems that the model which uses the combination of one-side label smoothing and TTUR generates pictures with higher quality.As for EMD, there is still no model collapse during the training.In this case, the loss of discriminator is smoother while the learning rate is bigger (0.0004) but the model with a lower learning rate (0.0001) produces better images with clearer contour.Same as the result of using the MNIST dataset, when using EMD, the model needs more training epochs to make better images compared with the models using one-side label smoothing or TTUR.In Table 3, the author compares the same methods with different datasets.When the dataset becomes more complicated, the model collapse was postponed no matter what approaches are adopted.In addition, nearly all models with the Fashion-MNIST dataset required more training time to generate high-quality images while models with the MNIST dataset only need epochs under 40.The possible reason is that the Fashion-MNIST dataset consists of article images with multivariate appearance and great differences in size like shoes and dresses which adds difficulties for the training.Figure 12 to 14 are comparison of samples using the same methods with different datasets.The first line of the pictures are samples taken from epoch 10, the second line of the pictures are samples taken from epoch 20, and so on.Overall, the performance is better when using MNIST as a dataset because the majority of pictures on the right side lose their texture information.This may be because as the complication of the image increases, it becomes more difficult for the model to extract the details of the images, and therefore can only form the outline.

Conclusion
In this paper, the author compares the ability of three optimization strategies-one-side label smoothing, TTUR, and EMD with different parameters when solving common problems like model collapse, nonconvergence and, instability in DCGAN's training.Each method is applied to two different datasets, namely MNIST and Fashion-MNIST.After experiments being conducted, experimental results demonstrate that one-label smoothing could mitigate the fluctuations of the generator's loss.And both one-side label smoothing and TTUR can defer the model collapse, while the former is more prominent.Meanwhile, EMD has solved this problem completely with no model collapse during training.If the complication of the dataset increases, the generated images may lose some details compared with the samples from the dataset whatever approaches are adopted.As for future work, the author plans to add a regular term-gradient penalty or use spectral normalization when implementing EMD to improve pictures' quality.Besides, more different optimization strategies and datasets are needed, so that there will be more experimental data for analysis and comparison.

Figure 1 .
Figure 1.Visualization of the first 20 samples in MNIST's training dataset and Fashion-MNIST dataset.

Figure 2 .
Figure 2. (a) The original performance (b) The performance of using one-side label smoothing.

Figure 3 .
Figure 3. (a) the performance of using one-side label smoothing and TTUR (b) the performance of using TTUR.

Figure 4 .
Figure 4. (a) the performance of using EMD and learning rate is 0.0001 (b) the performance of using EMD and learning rate is 0.0004.

Figure 5 .
Figure 5. (a) loss of the original performance (b) loss of using one-side label smoothing (c)loss of using one-side label smoothing and TTUR.

Figure 6 .
Figure 6.(a) loss of using TTUR (b) loss of using EMD and learning rate is 0.0001 (c) loss of using EMD and learning rate is 0.0004.

Figure 7 .
Figure 7. (a) the original performance (b) the performance of using one-side label smoothing.

Figure 8 .
Figure 8. (a)the performance of using one-side label smoothing and TTUR (b)the performance of using TTUR.

Figure 9 .
Figure 9. (a)the performance of using EMD and learning rate is 0.0001 (b)the performance of using EMD and learning rate is 0.0004.

Figure 10 .
Figure 10.(a) loss of the original performance (b) loss of using one-side label smoothing (c) loss of using one-side label smoothing and TTUR.

Figure 11 .
Figure 11.(a)loss of using TTUR (b)loss of using EMD and learning rate is 0.0001 (c)loss of using EMD and learning rate is 0.0004.

Figure 12 .
Figure 12.(a) comparison of using one-side label smoothing (b) comparison of using one-side label smoothing and TTUR.

Figure 13 .
Figure 13.(a) Comparison of using TTUR (b) comparison of using EMD and the learning rate is 0.0001.

Figure 14 .
Figure 14.Comparison of using EMD and the learning rate is 0.0004.

Table 1 .
The performance of using different strategies with different parameters on MNIST dataset.

Table 2 .
The performance of using different strategies with different parameters on Fashion-MNIST dataset.

Table 3 .
The performance of using different strategies with different parameters on MNIST dataset