Comprehensive study and analysis for StarGAN and StarGAN-v2 in the task of image generation

With the continuous development of deep learning and artificial intelligence industry, more and more neural networks have been developed. Among them, one typical network widely used is StarGAN. It solves the problem of requiring multiple generators and discriminators for multiple domains and style transitions. In this article, we will analyze and explore StarGAN v1 and StarGAN v2. including StarGAN’s development history, advantages and disadvantages in the application, etc. At the same time, we will compare StarGAN with other neural networks such as CycleGAN and MSGAN through some discrimination criteria, and clearly demonstrate the influence of different networks on the generation of image data sets through quantitative and qualitative analysis. In addition, we also found a representative code on some of the actual uses of StarGAN for hair replacement through online sources. Since this code may be used to confuse the use of StarGAN v1 and v2, we will also improve this code in the article and put forward suggestions for changes. In this paper, we carried out the specific analysis of StarGAN’s loss function based on the existing data and put forward some relevant insights. By taking the data we searched for StarGAN, it can be found that StarGAN could use as few resources as possible to achieve clearer and more varied results during the image generation task


Introduction
Generative adversarial network (GAN) is a machine learning framework that was designed by Ian Goodfellow in June 2014 [1].After these years of development in the field of machine learning, GAN has become a type of neural network that is widely used in the field of image generation.Generative adversarial network consists of two parts namely generator and discriminator.The generator takes the image and the target domain tag to generate a "fake" image, and the discriminator learns to recognize the real image, generate the image as well as classify the image to the domain.More precisely, given a dataset called X and sets of labels called Y, the generator gets the joint probability or get p(X) in case of no label Y.Meanwhile, the discriminator gets the conditional probability p(Y|X).Generator includes the distribution of the data itself and shows the probability of the given sample.Discriminators show the probability of a label whether it can apply to the data instance [2].
With the research of GAN in recent years, there are varieties of GAN.With higher frequency are progressive GAN, conditional GAN, image-to-image GAN, and Star GAN.The progressive GAN's generator produced a low-resolution image in the first layer then subsequent layers added details [3].The conditional GAN train data set in a given label [4].While the image-to-image GAN lets the image as input, map it to a generated output image with different properties [5].In this article, we will emphatically introduce the Star GAN, a variant of the GAN which has unique structure and much more widely application scenarios.
As a variant of the GAN, StarGAN is more powerful and efficient [6].As the name suggests, a StarGAN is a network of star-shaped structures.Compared with traditional GAN, StarGAN can use only one model to complete the migration of multiple domains, thus improving the robustness and scalability of image domain migration.
Since there are different variations for GAN, there are also different variations for StarGAN.The traditional StarGAN (also known as v1), though, implements transformations in multiple domains.However, there is a lack of variety of styles and domains.For this reason, StarGAN-v2 is introduced to solve the above problems [7].The principle is that the generator and discriminator "cheat" each other; that is, the generator constantly generates "fake" pictures to train the discriminator.The reason for StarGAN was to solve the problem of training users by constantly changing styles when multiple styles were required.This feature is able to transfer facial attributes and some facial expressions in a composite with better results than the baseline model Based on the various features of StarGAN v1 and StarGAN v2, a representative implementation code for these two models is found in this study [8].In this implementation, the author may mistakenly apply the concept of Stargan-V1 to Stargan-V2, leading to some deviation in the actual operation of the code, or the author's intention is to use a different StarGAN to implement.Even with the above problems, this implementation is a very representative example of StarGAN's code.In this article, we will use this implementation as the research background to discuss the differences between various versions of StarGAN.Meanwhile, we will describe the characteristics and practical application scenarios of each version in detail in the following sections and make some modifications based on the implementation to solve some problems encountered when we actually run the code.

Application
For generative adversarial networks, different variants have different application scenarios.The first is the traditional GAN.Since the GAN is composed of a generator and a discriminator, it has a great advantage in generating data sets.Such as generating image data sets.According to the research findings of Ian in 2014, they used GAN to help MINST generate image data sets with handwritten digits.
There are also new iterations of GAN after recent developments, such as StarGAN.As an important branch of GAN, StarGAN has more powerful functions and more advantages than traditional GAN.As mentioned in the introduction, StarGAN is a network that exists in a star structure, which means it is more efficient and easier to use when dealing with complex situations.The most common application scenario in StarGAN v1 is when the facial attributes are changeable [6].
We found an implementation code, it uses StarGAN to implement the hair replacement function.First, a brief introduction to the code.The technical core of this code is to take a variation of GAN called StarGAN and learn different hairstyles and then generate new hairstyles based on the results of that learning.They also adopted an image segmentation mapping model SEAN based on the generative adversarial network to solve the changes of other features [8].However, SEAN will not be discussed in this paper.Therefore, in practical use, StarGAN v1 can be adopted as the technical basis for implementation when multiple styles and domains are not needed.
So in this code, as mentioned in the introduction, it confuses StarGAN v1 with StarGAN v2.This code uses V1 in the StarGAN section instead of V2.However, the goal of this project is to be able to implement a variety of styles, in other words to be able to have a variety of styles and domains.StarGAN v1 is not enough to solve this problem, so in this application scenario StarGAN v2 is used to solve the transformation between various styles and domains [7].The specific implementation methods are mentioned in the method section.When training the model, the loss function would require a label vector which consists of all labels from all data sets [6], but as there would be different data sets and each of them only will generate their own labels, it becomes a problem when generating the grand pool for features.To solve this problem, they have introduced the mask vector that would compensate for the lack of the labels due to the combination of different data sets.

Method
Where [• ] refers to concatenation, and ci represents a vector for the labels of the i-th dataset.The vector of the known label ci can be represented as either a binary vector for binary attributes or a onehot vector for categorical attributes In addition to this special vector, the discriminator is specifically designed，it could minimize the error separately with respect to each input data set.

StarGAN v2
This version of StarGAN is different from the first version, though their application is similar, the loss function and the implementation are not quite the same.The composition of the StarGAN v2 are generator, mapping network, style encoder and the discriminator.This structure from the Compared with the original version, the founder of the StarGAN has improved the aggregated performance when handling a group of features combined.With the StarGAN as the baseline model, they change several parts to improve the performance of the model, from the table1 we could find that the performance is largely improved by applying Latent code injection and utilizing style encoder.
The loss function of the StarGAN v2 is more complex than previous version, except the regular adversarial loss, there are also style reconstruction loss which could avoid the output inclining to one or some of the domains of the feature.And the style diversification loss is a further extension of the previous loss function, and this is also to keep the diversity of the output, but it is slightly different from the style reconstruction loss, which is mainly focused on conveying the style of the reference image, the diversification loss forces the generator to explore all dimensions in the given domains.The mapping network, which transforms the input image information through the latent space to the style codes.(c) the style encoder.This section will extract the feature from the image that the generator would apply during the training process so that it could follow the styles of each image.(d) Discriminator, this part would decide which of the generated images are fake and which of them are qualified to be "real" from different domains [7].

StarGAN v1
According to Yunjey et al. 's findings on StarGAN.They used two data sets, CelebA and RaFD, as the basis of their research.Celeb had 40 labels, such as hair color and gender, and RaFD had 8 labels, such as happy and angry.They used DIAT, CycleGAN, and IcGAN as control groups, and the results are shown in Figure 3.As shown in the figure, it is clear that the images generated by StarGAN not only have the most natural expression but also retain the facial features of the input data (such as gender, etc.), while the images generated by the control group are very vague and even male features are generated in IcGAN to preserve the personal identity of the images.
In quantitative analysis, StarGAN also performed better than the control group.This is shown in the Table 1 below.

StarGAN v2
Yunjey's team also performed qualitative and quantitative analyses of StarGAN v2.They also used CelebA as the data set [7].As mentioned above, the difference of StarGAN v2 is that it can realize multi-style image conversion, in other words, it gets rid of the "constraint" of labels.As shown in the image above, the raw data input generated by StarGAN v2 can be clearly compared to other GANs.As can be seen from other GANs, StarGAN v2 generates images with more style, that is, more variety, and more clarity in the generated images while retaining the features of the original data.
Comparing StarGAN v1 also shows that v2 is not limited to changing one attribute style.
In terms of quantitative analysis, the author cited FID and LPIPS as evaluation criteria, and StarGAN v2 also has a good performance.The smaller the FID, the more varied the image generation, the better the quality [9].b A higher LPIPS means that the picture is more different from the original picture [10].
It can be seen from the above table that StarGAN v2 performs the best in both FID and LPIPS among other methods.When we compare StarGAN v1 and StarGAN v2, StarGAN v2 has a low error rate and high image variability compared to other GANs, including DIAT, CycleGAN, and IcGAN.In real life, StarGAN is a much better choice when we want to achieve the similar results as other GAN models with large datasets at a low cost (using a small database, a generator, and a discriminator).As an example of what can happen, we want to design a program to allow me to walk virtually on the Oscar stage.Then we need to analyze my avatar and my expression into an input image and an expression tag, and then the software needs to apply StarGAN to swap my face with the winner's face at each frame of the video.Since the video is dynamic, some minor imperfections and errors in a single frame image are not enough to affect the fluidity and integrity of the whole video, so StarGAN allows the users of the software to get a virtual video of them receiving the Oscar in the shortest possible time in the form of low arithmetic power and quick output.
Through our research and understanding of StarGAN v1 and StarGAN v2, there can be some fundamental improvements for HyelinNAM, the Korean team studying hair placement software.We believe that StarGAN brings much more value than face recognition.Because the shape and color of the hair need to be changed, using the recognition as a vector to find the right hair for this face shape and using CNN for character recognition and exterior defocusing will be a positive alternate solution.Using StarGAN twice will produce a "new" hairstyle for the user, a simpler and more efficient solution compared to SEAN.If we use SEAN, we will use a lot of GPU resources and time to get similar results.
To find the reason of such an outstanding performance, we also conducted a structural analysis to find out the vital steps or compositions for StarGAN v2 is the key to success.As the row D and row in the table above indicated, the Latent code injection and style code made the major effort on improving the performance-each of them half the FID score (lower is better), as well as the final result has 50% higher LPIPS score (the higher the better) compared with the previous versions.

Conclusion
Our initial enthusiasm for hair transplantation software led us to learn about StarGAN from the Korean group, HyelinNAM, and to study it.In this paper, we compare the differences between different GAN systems and different versions of StarGAN, as well as the performance of the working conditions for specific processing environments, such as photo processing.StarGAN, with a relatively small amount of data, a generator, and a discriminator, already has a fairly good speed and accuracy of facial recognition and modification of images.Through the analysis and comparison of StarGAN, DIAT, CycleGAN, and IcGAN in different articles, we confirmed the advantages of StarGAN in face photo processing.The disadvantages are also obvious; a right vector and a reasonable discriminator can lead to a very large deviation of the whole image output from the expected one.Overall, StarGAN has the ability to process facial images efficiently, cleanly, and accurately and is highly useful for 3.1.StarGANThe original StarGAN algorithm intends to build a multi-domain model and get rid of the restriction of the normal cross domain model, which leads to the transformation from the one-to-one pair in the generation process to a novel, inclusive feature map from different classes.By applying such a technique, the founders of the StarGAN successfully combine the different features into a grand pool, and they also deploy a new layer to help converge the result into the direction they would like to model to perform.The loss function for this algorithm is composed of several parts: the Adversarial Loss, Domain Classification Loss, and the Reconstruction Loss.Besides the normal loss function of a GAN network, the domain classification loss and the reconstruction loss form the boundary of the output image.

Figure 1 .
Figure 1.(a) For traditional cross-domain model, it is necessary to build the relationship between each pair of the features so that it could mutually recognized each other, while (b) The starGAN model would build a grand feature bank so that all classes share in the same one, it could save a huge amount of work bridging between the classes [6].

Figure 1
Figure 1 represents the process of style conversion between traditional cGAN and StarGAN.As shown in the Figure 1, 12 training sessions are required for cGAN and only one for StarGAN.In other words, cGAN requires k(k-1) generators and StarGAN always requires only one generation when there are k fields.When training the model, the loss function would require a label vector which consists of all labels from all data sets[6], but as there would be different data sets and each of them only will generate their own labels, it becomes a problem when generating the grand pool for features.To solve this problem, they have introduced the mask vector that would compensate for the lack of the labels due to the combination of different data sets.

Figure 2 .
Figure 2.Here is the overview of the StarGANv2 model, which is composed of four major sections.(a) The generator, which is pretty similar to the baseline model of the original version of StarGAN, and both of them are built on the same principle-handle the features from multiple domains.(b)The mapping network, which transforms the input image information through the latent space to the style codes.(c) the style encoder.This section will extract the feature from the image that the generator would apply during the training process so that it could follow the styles of each image.(d) Discriminator, this part would decide which of the generated images are fake and which of them are qualified to be "real" from different domains[7].

Figure 3 .
Figure 3.The result of each model base on the CelebA [6].

Figure 4 .
Figure 4. Generating images by using different models [7], the last column represents the images generated by StarGAN v2.

Table 1 .
[6]ntitively analysis of StarGAN[6].As shown in the table, StarGAN has the lowest classification error, and as shown in the last row of the table, the number of parameters required by StarGAN is 14 times smaller than that of CycleGAN.This is also due to the fact that StarGAN requires only one generator and one discriminator.