A Digital Animation Generation Model Based on Cycle Adversarial Neural Network

This paper proposes a digital animation generation model based on Cycle Adversarial Neural Network (CycleGAN). Compared with the classical CycleGAN, this research presents a multi-attention approach to enhance the network’s generalization. Specifically, a style enhancement module and a style cross-attention mechanism are introduced into the generator network structure, which enables the model to better parse the structural information of the content image and realize the accurate matching of content features and style features. Furthermore, the introduction of a multi-scale discriminator with a fused attention mechanism enhances the preservation of content information from the source image in the output animated image. The experiments conducted in this study demonstrate that the model proposed in this paper exhibits superior performance in generating digital animation. This model not only enhances the realism and variety of the generated effects, but also achieves significant improvements in the coherence and stability of long time-series animation. The contribution of this paper lies in the introduction of the multi-attention mechanism, which enhances the generalization of the network and is of great theoretical and practical significance for the development of the field of digital animation generation.


Introduction
Digital animation generation is a cross-innovation involving computer graphics, artificial intelligence, and computer animation.The rapid increase in computing power, especially with widespread GPU adoption, enhances the efficiency of handling complex animation calculations.Large-scale digital datasets, derived from the digitization and sharing of animation, film, and video content, provide ample training data for deep learning methods.This has significantly influenced the quality and diversity of digital animation generation by leveraging patterns and features learned from the data.The swift progress of deep learning, coupled with virtual reality technology, has opened new application scenarios, allowing users to experience more realistic animation scenes in a virtual environment.Animation generation technology, evolving from traditional methods based on physical laws to emerging deep learning approaches, has witnessed substantial development.While traditional computer graphics techniques were prevalent, their limitations in generating complex and highly realistic animations have prompted the shift towards more advanced methodologies.
Deep learning, exemplified by models like GAN [1], has significantly advanced animation generation.Unlike manual rules and physical models, GAN relies on learning extensive real animation data to produce realistic and vibrant effects.Specifically, GAN is effective in character animation, ensuring both realistic appearances and natural, smooth movements by capturing detailed motion features.Additionally, sequence modeling techniques such as Recurrent Neural Network (RNN) [2] and Long Short-Term Memory Network (LSTM) [3] contribute to improved animation generation.These models capture temporal information, enhancing predictions of future frames and resulting in smoother and more coherent animations.Furthermore, animation generation methods using reinforcement learning are gaining attention.Reinforcement learning allows models to acquire optimized animation strategies through interactions with the environment, enabling more intelligent and adaptive actions in scenarios like game character animation within virtual environments [4].
Deep learning animation generation faces several challenges.First, the model may be distorted when dealing with complex scenes and subtle details.Second, the quality and diversity of training data limit the generation results.In addition, deep learning animation generation requires a lot of computational resources and time.To tackle the aforementioned issues, this work primarily concentrates on enhancing the generalization capabilities of animation generation models.This research presents a multi-attention technique to enhance the generalizability of the network, building upon the CycleGAN model.The primary focus of this paper can be summarized as follows: (1) The generator structure of the CycleGAN model incorporates a style improvement module and a style cross-attention mechanism.This enables the model to effectively analyze the structural information of the content picture and achieve precise alignment between content features and style features.(2) The introduction of a multi-scale discriminator with fused attention mechanism enhances the preservation of content information from the source image in the output animation image.( 3) Empirical findings demonstrate that the model put forth in this work has superior generalization.

Research on animation generation based on deep learning
Neural networks have a wide range of applications in the field of data-driven action synthesis, and their high scalability and operational efficiency have attracted the attention of the computer animation and machine learning research communities.Taylor et al [5] advocated the use of Conditionally Restricted Boltzmann Machines (CRBMs) for predicting the body's next pose during movement, through which the dynamic properties of movement can be efficiently captured by the model.Fragkiadaki and other researchers [6] proposed an approach that utilizes encoder decoder networks (ERDs) and LSTMs in the latent space in order to predict future body poses.Jain and other scholars [7], on the other hand, introduced Structured Recurrent Neural Networks (SRNNs), which represent the human body's motor interactions by constructing spatio-temporal graphs and node-edged RNNs, which further improves the pose prediction accuracy.Martinez et al [8] employed a gated recurrent unit (GRU)-based encoder-decoder model to improve the smoothness of the action by leveraging the residual structure.Pavllo et al [9] introduced the QuaterNet network architecture, which use quaternion rotations to describe actions and calculates forward kinematics using a loss function.This approach effectively addresses the conventional challenge of computing angular errors.These methods are classified as autoregressive models, which primarily forecast future poses based on the character's prior postures.They are particularly well-suited for real-time applications like computer games.Holden and other researchers [10] proposed a phase function neural network in order to realize action generation in character animation that adapts to the environment geometry in real time.This network is able to map the control information from the gamepad as input to the character's movements.Unlike traditional methods, this approach requires gait and phase annotation of the motion data and is only applicable to bipedal motion.

CycleGAN
CycleGAN [11] CycleGAN is a powerful unsupervised learning model that can transform between different animation styles without the need for paired training data.By introducing cyclic consistency loss, it is able to learn bi-directional mappings in animation generation, allowing one animation style to be transformed into another while maintaining the coherence of the animation sequence.This flexibility allows CycleGAN to have a wide range of applications in animation, providing creators with powerful tools to seamlessly transform different styles of animation.The structure of the CycleGAN model is depicted in Figure 1:

Animation generation model based on an optimized CycleGAN
In this paper, CycleGAN is innovatively improved, aiming to solve the common problems of feature fusion error and poor results in animation generation.By introducing a multi-attention mechanism, specifically covering the style enhancement module [12] and the style cross-attention mechanism [13], in order to strengthen the generator network's attention to the style information of the animated images, while effectively parsing the structural information of the content images.This innovation enables the generator to more accurately match content features and stylistic features, thus avoiding the occurrence of feature fusion errors.In addition, the introduction of a multi-scale discriminator that incorporates the attention mechanism further enables the generated image to better retain the content information of the source image, thus improving the overall effect of animation generation.This improvement not only effectively solves the problems in traditional CycleGAN, but also introduces a more refined and efficient processing mechanism for animation generation.

Generator
The study introduces a generator network design that consists of an encoder, a feature transformation module, and a decoder.The feature transformation module comprises the SRM attention mechanism and the stylistic cross-attention mechanism, aiming to achieve precise integration of stylistic aspects and content features.Figure 2 illustrates the comprehensive architecture of the generator network.
The content features (Lc) and style features (Ls) extracted by the encoder are input to the dual attention mechanism feature transfer module.In this module, the content features and style features are refined through the recalibration mechanism of the SRM attention mechanism with the expression: After the SRM attention mechanism feature refinement, the style domain attributes of the stylized images can be accurately expressed.Next, the refined content features ( ( ) ' ci L ) and stylized features ( ( ) ' si L ) are used as inputs to two SCNet modules respectively.The output stylized feature ( ) L can be expressed as: Finally, the feature carriers

Feature fusion module
The feature fusion module employs the style cross-attention module SCNet, the structure of which is shown in Figure 3.

Figure 3. Stylistic cross-attention
SCNet uses the cross multiplication method to compute the correlation between the content feature location i and all the style feature locations j in its peer and column, which can be defined as: represent the refined content features and style features, respectively, where denote the dot product operation performed by both, while W  and W  are the learnable weight matrices realized by 1 × 1 convolution.g denotes the feedback result of the refined style feature at position j, as shown in Equation ( 7): W is a learnable weight matrix realized after 1×1 convolution.In the Aggregation operation, the similarity between the cross-multiplied fused features and the original stylized features is computed, and then the final stylized features are obtained through a series of convolution operations.
where f denotes the operation, p denotes the feature function, and C(L) denotes the normalization factor.
In performing the transformation process of content and style image features, the style crossattention mechanism processes content features and style features through region-by-region matching based on the semantic relevance between content and style features.This process aims to reorganize the content feature space by intelligently relocating locally integrated style features to semantically relevant locations.Through such an operation, while maintaining spatial consistency, the stylistic information such as color and texture of the migrated image is successfully made more consistent with the semantic connotation of the content image.

Discriminator
Since the use of CycleGAN method in animation generation tends to lead to the problem of structural distortion, in order to ensure the smooth training of the discriminator network, in this paper, the spectral normalization (SN) function is used in the construction of the discriminator network to perform matrix regularization on the parameters of the output of the convolutional layer.In addition, Class Activation Mapping (CAM) [14] is introduced at the end of the discriminator network.This design prioritizes the distinction between the styled image and the generated image, prompting the model to focus more on the specific areas of difference.Additionally, it dynamically modifies the parameters of the discriminative network at each level of resolution.By these measures, the optimized CycleGAN method is effectively guided to generate animated images with structural integrity of content.Figure 4 depicts the network architecture of the discriminator, while Table 1 provides a comprehensive breakdown of the network parameters for the discriminator.Secondly, all the images in the video are extracted, and 68 keypoints of the face are labeled using the Dlib library.Finally, the obtained face data is aligned.

Experiment-related settings
An excellent facial animation generation network should meet the following criteria: firstly, the generated facial animation must realize lip synchronization to ensure that the speech content and lip movements are perfectly coordinated in time, so as to eliminate the situation where the lip movements are ahead of or lag behind the speech content.Secondly, the generated facial animation should have natural head movements, avoiding only lip movements and static head, so as not to present a stiff appearance.Finally, the generated images must be clear and of high quality.In evaluating the lip synchronization performance, this paper adopts Landmark Distance (LMD) as the metric; in order to measure whether the facial animation has natural head movements, this paper adopts Rotation Distance (RD) as the metric; and in evaluating the image quality, this paper adopts Structural SIMilarity (SSIM) and Peak Signal-to-Noise Ratio (PSNR) are used as the metrics to evaluate the image quality; the larger the values of SSIM and PSNR are, the better the effect is.We used the Dlib library to label the lip keypoints (denoted as i and l, respectively) of the generated and real face images.First, calibration is performed by subtracting the mean value of the lip keypoints to eliminate the effect of possible lip bias.Next, the Euclidean distance between each pair of corresponding keypoints (i and l) is computed and normalized by the length of time and the number of keypoints.smaller LMD values indicate better results, as shown in Equation (9).
,, 2 11 11 NK where N represents the duration of the video and K represents the total number of lip keypoints (20 keypoints) for each face image.In order to evaluate whether the generated face animations have natural head movements, we introduced the RD as a metric.First, a standard face image is selected and its keypoints are extracted.Then, the key points of the generated and real face images are extracted and aligned with the standard face by affine transformation to obtain the rotation angles ( ŝ and s ).Finally, the Euclidean distance between these two angles is calculated and normalized over the video duration.Smaller RD values indicate more natural head movements.The specific calculation formula is detailed in Equation ( 10).PSNR is used to evaluate the quality of the generated image and is calculated as shown in Equation (11) and Equation ( 12).
Where w and h are the image width and height respectively, ij P and ˆij P are the real and generated image pixel points respectively and MAX is the maximum possible pixel value of the image.The formula for SSIM is as follows: ( ) ˆ1 ˆ1 2 , ( ) 2 , , ( ) ( ) ( ) ( ) For the face animation generation experiments, we used a server running Ubuntu 20.04 LTS operating system with NVIDIA Tesla V100 GPU, 32 GB RAM and 512 GB SSD storage.The deep learning framework TensorFlow 2.5.0,CUDA Toolkit 11.0 and cuDNN 8.0, and Python version 3.8.5 were used for the experiments.openCV 4.5.2 was used for image processing, and the machine learning toolkit was Scikit-learn 0.24.2. the experimental environment was configured to provide efficient GPU acceleration to ensure that the training process runs smoothly and satisfactory experimental results are obtained.

Animation Generation Performance Experiment
To validate the animation production effect of the model described in this research, we incorporated a similar model for experimental comparison.The comparison models are GAN, CycleGAN, Ref [15], Ref [16], Ref [17].The facial animation generation outcomes achieved from each model are presented in Table 2.The results of each experiment are the mean values derived from 10 iterations of running the model.

Table 2. Experimental results of animation generation
Index\Model GAN CycleGAN Reference [15] Reference [16] Reference [17] (GAN, CycleGAN).For LMD, the model in this paper exhibits better lip movement and speech synchronization effects by a significant margin (2.32 vs. 3.46).This is due to the multi-attention mechanism introduced by the model, including the style enhancement module and the style cross-attention mechanism, which improves the accurate parsing of the structural information of the content image.
In terms of RD, this paper's model (0.16) significantly outperforms other models, especially relative to the classical GAN model (0.27).This indicates that the model in this paper represents head movements more naturally in the generated animations, further improving the overall realism.This result is related to the multi-scale discriminator and its fusion attention mechanism introduced by this paper's model, which effectively preserves the content information of the source image.
In terms of SSIM, this paper's model (0.86) also achieves a significant advantage, showing that the generated image is structurally closer to the real face image.This can be attributed to the model's meticulous attention mechanism, which is able to better capture the structural features of the image.
Finally, in terms of PSNR, this paper's model (29.48) also presents higher values relative to the other models, reflecting the superiority in detail preservation.This is related to the multi-attention mechanism adopted in this paper's model, which strengthens the accurate matching of content and style.

Animation generation time experiment.
The speed of animation generation is also an important indicator of model performance.Analyzing the experimental data in Table 2, we can find that the animation generation performance of Reference [25] is good.The advantage of this paper's model is not very big compared with it.Therefore, we continue to verify the animation generation speed of this model.The comparison of the generation speed of each model is shown in Figure 5, and FPS stands for Frames Per Second.

Figure 5. Comparison of animation generation time for each model
The FPS of the GAN model is 25.5, i.e., it generates 255,000 frames per second, while the FPS of the CycleGAN model is 23.1.the GAN has a higher generation speed compared to the CycleGAN.This may reflect the relatively high efficiency of GAN generation in simple scenarios.The FPS of the three reference models, Reference [15], Reference [16] and Reference [17], are 23.9, 21.3 and 22.6, respectively.They all show a decreasing trend compared to GAN and CycleGAN, showing that in the experimental The use of different model structures and algorithms may lead to a decrease in generation speed.This Model achieves a score of 24.5 in terms of FPS, which is a better overall performance relative to the reference model.Despite the introduction of the multi-attention mechanism, this paper's model seems to better balance image quality while maintaining efficient generation speed.

Conclusion
This study presents a digital animation generating model based on CycleGAN, which incorporates a multi-attention strategy to improve the network's ability to generalize.Within the generator network architecture, we incorporate a style improvement module and a style cross-attention mechanism.These additions allow the model to effectively analyze the structural details of the content picture and achieve precise alignment between content features and style features.Meanwhile, by introducing a multi-scale discriminator with fused attention mechanism, we make the generated animated image better retain the content information of the source image.The experimental results demonstrate that the model proposed in this paper delivers a substantial enhancement in animation production performance when compared to the standard recurrent adversarial neural network.The evaluation measures, namely LMD and RD, demonstrate that the created animation exhibits superior synchronization between lip motions and speech, as well as more authentic head movements.The enhancement of SSIM and PSNR signifies that the produced image exhibits a higher level of quality, approaching that of the authentic facial image.In terms of generation time, the model in this paper maintains a relatively efficient generation speed, further verifying that the improvement of network performance by introducing the multi-attention mechanism does not affect the generation efficiency.However, we also realize some limitations of the model in this paper.For example, in some complex scenarios, the model may still have unsatisfactory generation results.In addition, the current model may require more computational resources when dealing with large-scale data.Future research directions include further optimizing the model structure to improve the generation efficiency while maintaining high quality generated images as well as improving the model's adaptability to complex scenes by introducing richer data and semantic information.

Figure 2 .
Figure 2. Generator structure First, samples are fed into encoder.Through the application of Equation (1) and Equation(2), the feature extraction procedure for both images is accomplished.( ) cc L E P = different layers that pass through the output of the SCNet module are decoded in the decoder to obtain the generated image CS P .

Table 1 .
Discriminator Parameter Settings This paper focuses on face generation in animation.The VoxCeleb2 dataset is a publicly available dataset containing interview videos of 6112 individuals, with a total number of videos of 150480, of which the number of training sets is 145569 and the number of test sets is 4911.The videos are sampled at a rate of 25fps, and the resolution is 224*224.In this paper, we preprocess the videos in our experiments.Firstly, the interview videos of 2800 individuals in the dataset are randomly selected.Due to the long length of each video, this study samples 15% of the video from each interviewer, and the final number of videos obtained for training is 189230 and the number of test videos is 28860.
In this study, we verify the significant advantages of the model proposed in several key metrics by comparing it with classical animation generation models