DualGGAN: A New Facial Style Transfer Network

In response to the background penetration problem of unsupervised style transfer algorithms in most cases, a Transformer style transfer network DualGGAN based on dual generators and fusion of relative position encoding is proposed. The network is trained using the least squares generative adversarial network, and the neural network is used as the image feature extractor to generate feature maps to obtain facial image features with attention weights from the feature maps, utilizing relative position encoding and mask loss to jointly constrain feature region style transfer. The experimental results show that the DualGGAN network effectively reduces artifact generation when implementing facial style transfer, maintains good background consistency, and has good generalization ability. Experiments have shown that the FID and KID indicators on the cat2dog and older2adult datasets are significantly improved compared to other algorithms.


Introduction
Image style includes color, texture, visual pattern, and other features of an image at various spatial scales.Conversion usually refers to the transformation of an image from one style to another [1] , often applied in visual tasks such as image coloring [2] , image editing, super-resolution, or video generation.Traditional nonparametric style transfer methods can only be modified by extracting low-level features including image texture and color [3] , and cannot extract high-level features such as semantics in images [4] .Neural networks have excellent feature extraction capabilities and can extract rich semantic information [5] .Gatys et al. first proposed the concept of Neural Style Transfer (NST) and applied the VGG19 network [6] to style transfer, using a Gram matrix to measure the style similarity between stylized and reference images.
Considering that collecting paired images between each style domain requires a lot of work, a more practical research approach would target unsupervised methods [7] , where style transfer can be performed without the need for paired images.This method not only simplifies the complex process of collecting pairing information between various fields but also extracts representative features from a large sample set.Although the current style transfer methods have achieved good image transfer results, it is difficult to completely separate style features and content features when encoding facial images.That is, when the content information of the original image and the target image differs significantly, it leads to excessive stylization and background infiltration problems.When there is a significant difference in style information between the original image and the style image, there will be a phenomenon of under-stylization [9] .
To solve the problem of over-stylization and under-stylization when dealing with facial style migration, this paper proposes an image migration method based on two generators, which uses the VGG19 network to extract image style features and uses discriminator to feed back the image generated by the generator in many ways, so that the generator has a faster rate of convergence speed and produces more realistic results.For the issue of background penetration, this article utilizes the Transformer module containing relative position encoding [10] to control style transfer in areas with high correlation.At the same time, mask loss and background consistency are used to further constrain edge regions and background, improving background penetration while reducing style changes in unrelated areas.The experiment proves that the method proposed in this paper can make the generated facial image detail processing more rigorous and have fewer artifacts.

Style transfer
Image style transfer refers to the process of using computer algorithms to transfer the style of one image to another, resulting in a new image that retains the content of the original image but incorporates the artistic style of another image, such as anime, oil painting, or ink painting.In recent years, with the rise of deep learning and artificial intelligence, these technologies have been widely applied in various fields.Image processing software and filter functions have attracted a massive user base and gained significant popularity.At the core of these applications is the use of deep learning for image style transfer.

The process of style transfer
Image style transfer mainly uses neural networks to decouple the semantic information of images from their styles (i.e., high-level and low-level features), and only modifies low-level features such as color, texture, and other style information to achieve style transfer.We set the x domain as the original image domain (source domain) and the y domain as the style image domain (target domain).The purpose of style transfer is to convert the images in the x domain into images in the y domain while keeping their content features unchanged.The main process is shown in Figure 1. Figure 1 shows the process of style transfer, which is the style transfer from the source domain to target domain to source domain (target domain to source domain to target domain), where x and y are the source and target domain images, s and c are the image style encoding and content encoding obtained through encoder encoding,x andy are the target and source domain images generated by the generator, andx andy are the source and target domain images obtained through cyclic consistency, respectively.

Migration theory
In style transfer, the most commonly used method is the deep learning model based on the Generative adversarial network.This method is an unsupervised learning method, which was first proposed by Goodfellow et al. [11] .The Generative adversarial network includes two models, G (namely the Generative model) and D (namely Discriminative model).The Generative model aims to generate observation pictures randomly by inputting noise z , which is recorded as ( ) G z .The Discriminative model needs to judge whether the picture generated by the Generative model is a real picture, and output the probability that  

( ) D G z represents ( )
G z as a real picture.If it is 1, it represents a certain real image; If the output is 0, it means it cannot be a real image.Thus, the Generative model and the identification model constitute the confrontation process, namely.
The ideal result of training is that the generator can generate images G that are sufficiently fake to be genuine; For the discriminator, the probability of determining that the image generated by the generator is a real image is zero point five, that is ( ( )) 0.5 D G z  .The Cyclegan [8] method proposed by Zhu et al. is based on the Generative adversarial network, uses unpaired images, and uses cyclic consistency constraints to achieve style conversion from source domain to target domain to source domain.The cyclic consistency loss can be expressed as: They denote the data distribution as ~( ) x p x and ~( ) data y p y .Among them, x represents the source domain image, y represents the target domain image, ( ) G x represents the y-domain image generated by the generator and ( ) F y represents the x-domain image generated by the generator.

Algorithm
In image style transfer, the use of unsupervised unpaired images has become the most widely used method for style transfer [12] .This type of algorithm only needs to collect similar images from the same domain, without collecting the original and target domains of each image, greatly saving the workload of collecting images.However, when using unsupervised methods for facial style conversion, there may be issues with excessive stylization or failure to address the style features of the relevant areas.In order to improve the problem of under-stylization or over-stylization in facial style migration, this paper adopts the Generative adversarial network architecture of two generators.Its main idea is to use two generators to generate pictures at the same time and transmit the generated stylized pictures to the discriminator together with the real pictures.In the discriminator, the softmax function is used to replace the sigmoid function, and the two results generated by the generator are processed with real images.The processing results are fed back to the corresponding generator to accelerate the convergence of the generator.At the same time, a Transformer that includes a self-attention mechanism and relative position encoding is used to control the stylized range within a certain area, and mask loss and background consistency are further constrained, improving the issue of facial edge infiltration into the background.

Generating adversarial structures based on dual generators
The basic structure of a Generative adversarial network is a generator and a discriminator.When training, if the discriminator tends to fail to recognize the authenticity of data, the training of the Generative adversarial network reaches a stable stage.To generate feedback on the generator for each iteration, this paper uses a dual generator structure based on the Generative adversarial network.
During the confrontation training, the generator uses the pre-trained VGG19 network to separate and extract the content and style of the image, highly fuse the style of the target style image with the content of the original image, and use upsampling to transform it into an image in the target domain.
The discriminator in this article uses the softmax function for normalization processing, determines the data with lower scores obtained after processing as false, and punishes the corresponding generator.On this basis, we introduce the self-attention mechanism [13] into the neural network through Transformer, increasing the number of down-sampling layers to obtain high-frequency information of the image more accurately, to reduce edge penetration during style conversion.The main network structure of this article is shown in Figure 2.

Transformer block
This paper introduces the use of a Transformer to incorporate a self-attention mechanism into CNN.The method first uses convolutional neural networks as feature extractors to generate feature maps.Position embedding is extracted from the feature maps of CNN and can leverage high-resolution feature maps.Self-attention mechanism was initially applied in natural language processing (NLP) to process sequential text data and has shown promising results.In computer vision, the self-attention mechanism treats an image as a spatial sequence, allowing the network model to consider not only individual pixels but also automatically capture receptive fields and assign different weights to the regions of interest.Subsequently, modifications are made to the regions with high correlations.The attention mechanism of a single head in a self-attention layer is represented as: ( , , ) , , indicates the query, key, and value matrices, , , H W C represent the height, width, and dimension of the feature map, and k d represents the dimension of K .In order to further bias the self-attention mechanism towards highly correlated regions in the image, this paper uses relative position encoding.Relative position encoding can further bias the attention map, thereby sharpening the focus on the edges of the regions of interest.In relative position encoding, the coordinate differences between each query and the keys in the horizontal and vertical directions are represented by  , respectively.Therefore, the relative position can be represented by a relative position index matrix , and the pixel position indices B in each direction are taken from the matrix M and added as bias terms to the attention map, i.e., ( , , ) The VGG19 network contains five pooling layers.This article introduces a Transformer module after each pooling layer after the second pooling layer, which uses relative position encoding to match features more accurately.It has been successfully applied in the field of computer vision.Therefore, this article obtains the attention weight value of the image through a Transformer with relative position encoding, as shown in Figure 3.

Loss function
During the process of image style transfer, it is necessary to ensure that the content of the image remains unchanged before and after conversion, and to complete the conversion of the image from the source domain style to the target domain style.Content consistency means that when converting styles and styles, the image content (semantic information such as facial eyes and nose) should be kept as unchanged as possible.This article uses a Transformer with relative position encoding to control the region and edge information of style transfer, and only performs style transfer on facial attributes while maintaining their content information unchanged.
(1) Adversarial loss We let the two generators be 1 2 , G G respectively.The generator learns the distribution from the noise, and the discriminator processes the data of the two generators and conducts adversarial training.In order to make the generator converge faster, the losses of the Generative adversarial network are as follows.
  (2) Mask loss In order to enable the network to focus on the key information of the image in the global image and process it, this article uses a Transformer module with self-attention to constrain the area of style transfer within a certain area.In the case of using only the attention mechanism, the output effect has a good correlation, but the background edge area will change.Therefore, this article adopts Formula (7), which is guided by a self-attention mechanism containing relative position encoding to constrain mask loss on relevant regions, in order to reduce the transformation of irrelevant regions.
The first item of loss is aimed at controlling the size of the foreground area,   i mask k is the value of the fourth channel of pixel k .B is the deviation amount in relative position encoding.1 / ( ) B  and min  are hyperparameters that control the size of the extracted style transfer region, and control the minimum proportion of foreground masks, and W  is the number of pixels in the image.The second item of loss aims to segment the migration area, where the foreground area pixels that the attention mechanism focuses on tend to be one and the irrelevant area pixels tend to be zero, and the setting of  is to avoid a denominator of zero.In this paper we set min =0.3 x represents the image in the source domain, and   s G x represents the image of the target domain generated by the generator.This loss aims to control the style loss between the background area of the converted image and the original image at a lower level.The total loss of this article is: where mask

Experimental Environment and Dataset
This experiment uses PyCharm Integrated development environment and Python 3.9 language.In view of the low learning cost and flexibility, this article selects PyTorch as the framework.The hardware CPU environment is 11th Gen Intel (R) Core (TM) i9-11900K @ 3.50 GHz, with NVIDIA GeForce RTX3090 as the GPU, and a dedicated GPU memory capacity of 24 GB.In order to verify the effectiveness of the proposed algorithm in style transfer on facial images, this paper selected the cat2dog dataset (including 2035 training images and 200 test images), as well as the adult face dataset (10000 images) and elderly face dataset (10000 images) generated by stylegan, to form the adult2older dataset, which was divided into 9800 training images and 200 test images, respectively.In training and testing, this article adjusts the facial dataset to 256 × 256 pixel size.This article undergoes 200 K iterations during the training process.The first 100 K iterations of the discriminator simultaneously take real images as input, while the latter 100 K iterations of the discriminator will no longer take real images as input and only judge the authenticity of the inputs from the two generators.During the training, the Learning rate is set to 0.0001.The Adam optimizer was used during the experimental process, where 1 =0.9  , 2 =0.999  .

Qualitative results
As this article is mainly implemented through style transfer, it is compared with CycleGAN, Face Aging GAN, CUT, and U-GAT-IT.To ensure the fairness of the experiment, all methods use the same experimental settings and undergo the same number of iterations Figures 5 and 6 provide comparative experiments on the cat2dog dataset.From the comparison of experimental results, it can be seen that although CycleGAN can extract texture details of the target domain and rarely changes the background style, it cannot produce reasonable deformation (such as a smaller face for cats and a larger face for dogs).As shown in the third line of Figure 5 with attention mechanism can modify texture details and complete cat2dog (dog2cat) conversion, but there are still some cases where the style is infiltrated into irrelevant areas (artifacts are generated between the facial edge and the background edge), and even randomly generate the background.As shown in Figure 6, although U-GAT-IT successfully transferred the dog's facial texture to the cat's facial texture in the dog cat conversion, the ear part produced unrealistic effects and caused background inversion.In the fourth row of Figure 6, the texture of the original image is relatively rich.
In the conversion, the results produced by CycleGAN and CUT are poor.Although U-GAT-IT successfully processed the rich texture, it reversed the background.The Transformer module with relative position encoding and mask loss used in this article can confirm the size of the style transfer area in the image, which is more accurate for cat dog (dog cat) conversion and produces good visual effects.The edge penetration level is low and there are fewer artifacts generated.
Original Cyclegan CUT U-GAT-IT ours1 ours2 Figure 5. Qualitative comparison between algorithms on the cat2dog dataset.
Original Cyclegan CUT U-GAT-IT ours1 ours2 Figure 6.Qualitative comparison between algorithms on the dog2cat dataset.
Figures 7 and 8 provide experiments on CycleGAN, Face Aging GAN, U-GAT-IT, and our algorithm on the adult2older dataset.It can be seen that Cyclegan performs poorly in the adult2older conversion, as shown in the second and third rows of Figure 7, only handling a portion of the hair.When conducting older2ault, the experimental results are superior to the conversion of adult2old.Although Face Aging GAN can fully handle face and hair, experimental results show that Face Aging GAN can produce excessive stylization.The visual effect generated by U-GAT-IT is good, and it can handle some details on the face (such as hair, eye wrinkles, forehead wrinkles, etc.), but in some cases, the style will seep into the background, as shown in the third row of Figure 7.The style transfer result of U-GAT-IT will seep into the shoulder part.In the fourth row of Figure 7, CycleGAN, Face Aging GAN, and U-GAT-IT have all made modifications to the hat.However, the algorithm in this article has not made any modifications to the hat.In fact, the hat is not a facial attribute, that is, the area is irrelevant.Figure 8 shows the results of various methods on the older2ault dataset.CycleGAN can achieve facial rejuvenation on the original image, but cannot edit other styles (such as female hair lengthening).Face Aging GAN produces better results than CycleGAN, but in most cases, it penetrates the background more severely and can cause changes in the background style.The results of U-GAT-IT will generate artifacts at the edges of the foreground and background.This article can greatly reduce edge penetration and produce ideal facial migration results on the face, while maintaining the style and content of irrelevant areas unchanged.

Quantitative results
We use FID (Fr é chet Inception Distance) and KID (Kernel Inception Distance) as quantitative evaluation indicators [15] .FID uses the mean and Covariance matrix to calculate the distance between two high-dimensional distributions, which is mostly used to evaluate the quality of the images generated by the GAN.The smaller the value is, the more similar the two sets of images will be.The calculation of FID uses the Inception network.The calculation method is shown in Formula (10): )) where Tr represents the sum of diagonal elements of the matrix, which is the trace of the matrix, x and ( ) G x represent real images and generated images,  represents the mean, and x  and ( )  are the Covariance matrix of the real image and the generated image features. 2

( , )
where m is the sample size for generating images, n is the sample size of the real image, k is kernel, , x y are a 2048-dimensional vector from the Inception network, .KID has unbiased estimates with three kernels, which tend to match human perception [16] .Table 1 shows the FID values of the results of this article compared with other algorithms on the cat2dog dataset.From the experimental results, it can be seen that the FID metric in this article is lower than all other algorithms.Both Patch-based CUT and U-GAT-IT with attention modules can migrate in areas with significant stylistic differences between the source and target domains, resulting in ideal migration results.Compared to these algorithms, this paper adds mask loss and background consistency constraints on the basis of self-attention.This framework can control style transfer only in high correlation areas through an attention mechanism, reduce the impact of style transfer on the background, and generate fewer artifacts in the generated images, thus obtaining lower FID metrics.
Table 2 shows the FID values of various algorithms on the adult2older dataset.In the conversion results of adult2older, the FID values of U-GAT-IT results are slightly lower than those of the algorithm in this paper.One of the reasons is that facial images contain more attributes.The attention mechanism of U-GAT-IT combined with adaptive layer instance normalization can modify most of the facial attributes, and the migration effect is closer to the target domain.The self-attention mechanism in this paper can obtain a large Receptive field of the image, learn the image from a global perspective, and calculate its average FID value using the two image effects generated by the two generators.The FID value generated by some images may be higher, but the visual effect of the image generated in this paper is more consistent with people's perceptions, so the optimal KID value is obtained in the comparison algorithm.

Ablation experiments
To verify the effectiveness of the algorithm proposed in this article, ablation experiments were also conducted.On the cat2dog dataset, we tested the impact of the use of a Generative adversarial network, Transformer, mask loss, background consistency, and other functional modules on the FID value in style migration.From Table 3, it can be seen that adding the Transformer self-attention module can effectively improve the quality of image generation, and the FID value will be significantly reduced; Adding mask loss can reduce background penetration; When all functional modules are used, the facial style transfer effect achieves the best.

Conclusion
We propose a Transformer based image style migration network DualGGAN.This network uses a dual generator Generative adversarial network structure, which can accelerate the convergence of the Generative adversarial network.At the same time, it uses mask loss and background consistency to further restrict the generation of background penetration and artifacts.In addition, to verify the generalization of the algorithm, this article conducted experiments in situations with significant differences in style or a large proportion of image foreground and achieved good experimental results.However, during the process of facial style transfer using the adult2older dataset, due to the large number of female images in the dataset, a small number of male faces in the sample tend to be feminized (such as becoming long hair), and using a self-attention mechanism can lead to the appearance (or disappearance) of glasses in the sample facial images.In future work, this problem will be solved without changing the dataset.In addition to style transfer of facial features, it can maintain more original features of images other than the face without changing.


and BC  are hyperparameters that control different losses.In this experiment, The visualization of the Conv 2-1 layer feature map in the VGG19 network is shown in the second row of Figure 4.
, CycleGAN only modifies the texture, resulting in unrealistic facial attributes.Patch based CUT and U-GAT-IT ICAITA-2023 Journal of Physics: Conference Series 2637 (2023) 012024

Table 1 .
Quantitative comparison of various algorithms on the cat2dog dataset.

Table 2 .
Quantitative comparison of various algorithms on the adult2older dataset.