A Framework for False Image Detection with Sample-Oriented Intelligent Adversarial

To develop algorithms capable of automatically detecting and evaluating the authenticity of images and videos, researchers have focused on false image detection algorithms. These algorithms aim to identify the authenticity of images by distinguishing between real images and forged ones generated using false generation algorithms. In this paper, the main focus is on implementing a single-frame authenticated image detector using the concept of migration learning. The detector utilizes Inception ResNet v2, a target classification network pre-trained on a self-built military scene dataset. To enhance the dataset, a series of graphical enhancement algorithms are employed, enabling the classification network to learn the crucial differences between real and forged images. Additionally, Focal Loss is introduced to balance the dataset for various GAN-based image generation algorithms. As a result, the final forged target image detector achieves an impressive classification accuracy of 0.8908 on a large-scale sample test set.


Introduction
With the exponential growth in data size, significant advancements in computing power, and continuous algorithmic innovation, deep learning has experienced a resurgence and proven successful in tackling various complex problems, including big data analysis and computer vision.However, while science propels human development and progress, it also presents a dual nature.The advancements in deep learning technology can be harnessed by malicious individuals to create algorithms that pose threats to personal privacy, social security, and military defence.Deep learningbased fake image generation algorithms, such as GAN-based techniques and models like DeepFakes, Face2Face[1], FaceSwap, and NeuralTextures [2], have the ability to generate counterfeit images and videos that are almost indistinguishable to the human eye [3].If used irresponsibly, these technologies can have severe negative consequences and pose significant security challenges to society and emergency management.Hence, it is crucial to develop algorithms that can automatically detect and assess the authenticity of images and videos [4].False image detection algorithms serve this purpose by distinguishing between genuine images and fake ones generated through deceptive means [5].
In this study, a single-frame authenticated image detector is implemented as the main contribution.The detector is based on the concept of migration learning and utilizes Inception ResNet v2, a target classification network that has been pretrained on a self-built military scene dataset.To enhance the dataset, a series of graphical enhancement algorithms are applied, compelling the classification network to discern the essential differences between real and forged images.Additionally, to address various GAN-based image generation algorithms, Focal Loss is introduced to balance the dataset effectively.As a result, the final forged target image detector achieves an impressive classification accuracy of 0.8908 on a large-scale sample test set.
Forgery image generation algorithms and forgery detection algorithms are prevalent in the realm of new media network data [6], such as microblogs, and present typical scenarios where AI models are employed as countermeasures.Face image tampering serves as an illustrative example, where the generation algorithm can manipulate the facial region of an image to produce a modified version, altering the identity or relevant attributes of the face [7].Deep learning advancements have led to the emergence of highly sophisticated face tampering algorithms capable of synthesizing images of such remarkable quality that they deceive human perception.However, these manipulated images can be maliciously exploited, resulting in severe trust issues, security risks, and posing significant challenges to AI security [8].Hence, the development of face-swapping identification algorithms becomes imperative in ensuring the checks and balances of technological advancements [9].
The primary objective of forgery image identification is to distinguish between real and forged images.In the case of face swap identification algorithms, the central challenge lies in effectively detecting and discriminating the differences between genuine and manipulated images [10].Based on the algorithm's fundamental principles and applicable scenarios, the development history of face swap identification algorithms can be roughly categorized into three types: face swap identification algorithms designed to detect traditional image forgery, face swap identification algorithms designed to detect deep learning-based image forgery, and more generalized face swap identification algorithms [11].
(1) In general, due to different characteristics of hardware (sensors, lenses) or software (postprocessing, compression algorithms), images are marked with their own unique markers during the acquisition process, such as specific relationships between pixels and their neighbours, which can be revealed by Noise Analysis (NA) or Error Level Analysis (ELA).Analysis (ELA).Traditional image forgery or processing techniques include copy-and-paste, removal, stitching, and image steganography [12], etc.The forged images are relatively simple, and human vision can still distinguish some forged images.The main purpose of early face replacement identification algorithms is to detect images faked by the above simple image processing techniques, and the main method is to effectively extract the features of the image using specific methods, and then feed the extracted features into a post-level classifier for classification, such as Support Vector Machine (SVM), Random The traditional classification algorithms such as Support Vector Machine (SVM), Random Forest (RF) and Multi-Layer Perceptrons (MLP).And because the computational complexity of these classification algorithms is relatively low, the final classifier can also be enhanced using integration [13].Handextracted image features, although already effective in detecting forged images, are often not the most meaningful and reasonable image features, and require tedious manual design.With the rise of deep learning, the solution of many practical problems is beginning to turn to neural networks.Deep learning-based face-swapping identification algorithms use convolutional neural networks to automatically extract more robust and effective image features, which greatly improves the detection performance of face-swapping identification algorithms [14].
(2) The rise of deep learning has also promoted the development of image generation algorithms, which makes false image detection and identification very challenging.The most advanced algorithms, such as DeepFakes, Face2Face, FaceSwap and NeuralTextures, generate images (especially face images) that are so similar to real images that human vision can barely distinguish between the two, and ordinary face-swapping identification algorithms cannot effectively detect them.For the forged images generated by Convolutional Neural Network (CNN) and Generative Adversarial Network (GAN), the face swap identification and detection algorithm based on deep learning binary classification still models it as a binary classification problem (real image or forged image ), which is directly trained supervised using real and generated forged images, while considering some feature differences between real and forged images, such as blinks, facial expressions and head movements [15].Detection of deep learning image forgery-based face-swapping discrimination IOP Publishing doi:10.1088/1742-6596/2589/1/0120053 algorithms train classifiers for a specific face-swapping algorithm in a supervised form, which often provides effective detection of images generated by a specific algorithm, but the discrimination performance is severely degraded once it encounters images generated by an algorithm unknown to the classification network.The face-swapping discrimination algorithm based on deep learning binary classification tends to suffer from overfitting, and thus is only applicable to specific face-swapping algorithms with poor generalization performance.
(3) Deep learning is developing rapidly, and the recognition performance of the deep learning binary classification-based face replacement identification algorithm tends to drop sharply on the image set generated by the new face replacement algorithm, indicating that the algorithm has undergone overfitting.To improve the generalization ability of false image identification algorithms, some recent works have noticed the phenomenon of overfitting in classification networks and tried to extract more essential differences between real and fake image features to improve the generalization performance of the algorithms for false image identification.The rapid development of deep learning will inevitably lead to generative models to generate more realistic and reasonable images, and the exploration of discrimination algorithms will continue to fight against them, and more and more sight and work will focus on the evolutionary process of continuous iterative confrontation between generative and discrimination algorithms, based on which an iterative framework for image detection confrontation based on the total deviation equation is proposed.

Scene intelligence generation techniques to generate models
The proposed scenario jamming intelligence generator model first needs to be based on DeepFake [16], as shown in Figure 1, and consists of two identically structured deep encoder-decoders ES-DS, ET-DT, where the deep self-encoder E encodes the input target image as a latent vector, which is then reconstructed by the deep decoder D into the original radar target image [17].During the training process, the source scene target and fictitious interference scene target images are used to train the respective corresponding encoders and decoders, and the networks are trained using reconstruction loss, perceptual loss, and adversarial loss, respectively [18].The two deep self-encoders have the same structure and shared parameters to ensure that the encoding results for different scene targets are in the same feature space.The two decoders also have the same structure, but the training process is independent of each other and the parameters are not shared, which ensures that the two decoders can reconstruct different scenes [19].The effect of swapping the source scene to the target scene can be achieved by combining the encoder of the target scene (false interference scene) and the decoder of the source scene together when testing and inputting the image of the target virtual scene.In fact, the reason why the scene swapping can be accomplished is because the encoder extracts the attributes of the target scene, and the decoder, which has the features of the source scene embedded through training, accepts the target scene attributes from the encoder and generates the image that swaps the source scene to the target scene.Figure 2 shows the structure of the encoder-decoder network, which introduces multi-scale features, adversarial training, perceptual loss, occlusion improvement and other techniques on the basis of the original DeepFake.

Encoder Training for Scene Generation Models
The structure of the encoder is shown in Table 1.N 256 × 256 pictures are input, and after a series of convolution and maximum pooling operations, downsampling is performed, and multi-scale features are obtained in the second, fourth, and sixth layers.The feature size They are 128 × 64 × 64, 256 × 16 × 16, 512 × 4 × 4 respectively.Then through adaptive average pooling, the 64 × 64 and 16 × 16 features are down-sampled to 4 × 4 , and finally the three features are concatenated into (128 + 256 + 512) × 4 × 4 multi-scale features, and the multi-scale features are flattened into onedimensional vectors and then passed One fully connected layer is encoded as a 1024-dimensional latent vector.In fact, the original DeepFake does not contain multi-scale features, and the quality of the target scene attribute extraction is crucial to the quality of the generated fake scene.Therefore, we use multi-scale features when generating scene attribute latent vectors, which can enable the encoder to extract target scene attributes without different fine-grained and semantic levels, which is helpful for the realistic generation of virtual interference scenes.The structure of the decoder is shown in Table 2.The 1024-dimensional scene attributes are upgraded to 1024 × 16 dimensions through a fully connected layer, and then reshaped into 1024 × 4 × 4 sizes.After a series of upsampling operations, it is restored to a size of 3 × 256 × 256, which is the same size as the input image.Here upsampling is achieved by convolutional layers, specifically, the number of filters in the convolutional layer is set to four times the desired number of filters.Then use the values of different channels and the same position of the four adjacent feature maps to fill in the four adjacent positions of the same channel, and add a feature map with an area of 4 times the original to achieve upsampling.

Adversarial Training of Scene Generation Models
Generative adversarial networks significantly boost the performance of generative models.As shown in Figure 3, it is proposed to introduce confrontational training in the model to improve the quality of generated pictures, and the encoder-decoder structure is regarded as the generator (Dis).

Figure 3. Model confrontation training process
An additional discriminator is proposed to achieve performance improvement.The structure of the discriminator is shown in Table 3, which is a simple classifier.The discriminator outputs a probability of 0 to 1, indicating the authenticity of the input picture, and the greater the probability, the more authentic it is.The picture reconstructed by the decoder is denoted as , and the picture input by the encoder is denoted as .The loss of the generator is: (1) The discriminator loss is: The generator and discriminator are trained alternately.Considering that the training set is very small, with only a few hundred pictures, the discriminator is easy to overfit, causing the adversarial training to collapse.Therefore, we update the discriminator every 10 times the generator is updated, and the learning rate of the discriminator is attenuated by 1000 times compared to the generator.

Loss Function Design for Scene Generation Models
The traditional MSE loss will cause the problem of missing high-frequency information in the image in the field of image reconstruction, resulting in blurred images.The left picture of Figure 4 takes face data as an example, indicating that the results of the upper and lower reconstructions are equal to the MSE loss calculated by the ground truth, but it is obvious that the reconstruction results above are better and sharper.But the following results are easier to learn and will be blurred in the picture.Perceptual loss is a comparison of high-level semantic information extracted by convolution rather than a direct comparison at the pixel level, which can effectively alleviate the above problems.Therefore, we introduce a perceptual loss in the loss function: use VGG19 to extract the features of the generated image and ground truth, and then calculate the MSE loss between the features.The total loss of the network is shown by formulas (3) and (4):

Forgery detection model
Figure 5 shows the overall framework of the face-changing detection algorithm, which mainly includes three parts, data preprocessing and data enhancement, face-changing detection network, and classification loss FocalLoss.The preprocessing of the dataset is mainly to extract frames and extract faces in the image for subsequent processing; data enhancement augments the face image to improve the generalization ability of the face-changing detection network; the face-changing detection network

Total Focal Loss
Total Focal Loss is mainly to solve the problem of a serious imbalance in the proportion of positive and negative samples and an imbalance in difficult and easy samples in one-stage target detection.Among them, balance positive and negative samples and balance difficult and easy samples.In the FaceForensics++ [20] data set, there is a huge difference in the number of real videos and fake videos.
Secondly, there are differences in the difficulty of fake images generated by different face-changing algorithms or different face images generated by the same face-changing algorithm.Some people can identify them with their eyes, but some people Based on this, Total Focal Loss is introduced to balance the positive and negative samples and difficult and easy samples in the data set.For the two-category problem, the Total Focal Loss formula is as follows.In this example, = 0.25 = 2.0 Among them, the positive and negative sample balance rate is 0.25, and the difficult and easy sample balance parameter is set to 2.

Dataset
The data set used in this article is FaceForensics++, which contains 1000 real videos and fake videos generated by the four most advanced face-changing algorithms DeepFakes, Face2Face, FaceSwap[ [21] and NeuralTextures.Generated in pairs, all videos are available in three qualities, namely Raw (original quality), C23 (high quality, compression rate 23) and C40 (low quality, compression rate 40).The quality of the face in the video is high, it is almost a frontal face, there is no occlusion, and it can be detected and tracked.Table 4 shows the division and quantity distribution of the FaceForensics++ dataset, where NeuralTextures is the face-changing algorithm added later, and no statistics are made.The data set division used in this article is consistent with the official, training set: verification set: test set = 720:140:140, and the videos are divided in pairs.Figure 7 shows the distribution of FaceForensics++ dataset in terms of gender, video quality, and face pixels.The gender distribution is relatively balanced, and there are more men.There are three video qualities, among which VGA is 480p, HD is 720p, and FHD is 1080p.The face pixels are concentrated between 100 and 400 pixels.This article uses FaceForensics++ official data set division, each video extracts 32 frames of images, and uses MTCNN to detect faces, the optimizer is SGD+momentum, the initial learning rate is set to 0.5, a total of 50 epochs are trained, and at the 20th and 30th For each epoch, the learning rate decays, the decay rate is 0.1, the momentum momentum is set to 0.9, the regularization of L2 is used, and the weight decay is set to 0.0001.

Conclusion
Based on the idea of transfer learning, this paper uses the face classification network Inception ResNet v2 pre-trained on the VGG Face dataset to implement a single-frame face-changing detector, which uses a series of graphic enhancement algorithms to amplify the dataset, forcing the classification network to Learn the more essential difference between real images and fake images, and introduce Focal Loss to balance the positive and negative samples and difficult samples in the data set.Finally, the face-changing detector achieves a classification accuracy of 0.9208 on the FaceForensics test set.

Figure 4 .
Figure 4. Perceptual loss of Physics: Conference Series 2589 (2023) 012005 has been used in self-built The pre-trained Inception ResNet v2 on the mixed data set uses the idea of transfer learning to better detect face changes; the loss function uses Focal Loss to balance the category imbalance of real images and fake images in the data set and the imbalance of difficult and easy samples.

Figure 5 .
Figure 5.The overall framework of the face-changing detection algorithm2.3.Forged Image Detection NetworkTaking face recognition as an example, the forgery detection network uses the face classification network Inception ResNet v2 pre-trained on a mixed self-selected data set, and uses the idea of transfer learning to better identify real images and forged images.The following figure shows the network structure of Inception ResNet v2, the overall network structure is shown by Figure6, and the right side is the corresponding sub-module.

Figure 6 .
Figure 6.Schema of the Inception ResNet v2 and four main modules

Figure 8
shows the training results of the face-changing detection algorithm.It can be seen that the convergence is very fast during the training process, and the accuracy rate can almost reach more than 0.995 on the training set, and the accuracy rate can reach more than 0.915 on the verification set.

Figure 8 .Table 5 .
Figure 8. Face-changing detection algorithm training results Table 5 shows the results of the face-changing detection algorithm on the test set.The overall accuracy rate can reach 0.91, among which the classification accuracy rate of real images can reach 0.83, and the classification accuracy rate of fake images can reach 0.93

Table 1 .
The encoder structure

Table 2 .
The decoder structure

Table 3 .
The discriminator structure

Table 4 .
Dataset distribution of FaceForensics++