Review: Recent advances for the diffusion model

As the generative model technology becomes more and more popular, more and more people have invested in the research of the current State-of-the-art (SOTA) generative model-diffusion model. This paper reviews all SOTA generation models using the diffusion model for text-to-image generation since the emergence of the diffusion model, including the denoising diffusion probabilistic model (DDPM), DALL·E model, imagen model, stable diffusion model, and diffusion transformer architecture (DiT) model. In the theoretical section, the basic principles behind the diffusion model are reviewed in detail in the way of mathematical calculation, including the training process of the model and the mathematical principles behind the sampling process. Moreover, this paper focuses on the technical characteristics of these models and various improvements made after model iteration, such as model structure optimization, more efficient and accurate training methods, and the application of other optimization techniques widely used in the field of deep learning to diffusion models. In the end, the technical route of the development of the diffusion model is summarized, and some predictions are made.


Introduction
Diffusion model is one of the most popular generative models in the world, and it has been used as the generator for many State-of-the-art (SOTA) generative models published in recent years because of the excellent samples it produced and the low training difficulty it has.The diffusion model helps generate new image instances with high accuracy and efficiency, and it has a wide range of application areas such as upscaling some pictures to get super-resolution images or getting some unreal pictures or paintings.
In contrast to other deep machine learning models designed for clustering, classifying, or regression, the generative model can 'create' new instances the corpus or the dataset does not contain.For generative models, the training process is more likely to provide an expert knowledge system which supplies enough references so that the generative model can be able to generate samples through other formats of media input, like some keywords and even music.However, large-scale datasets and high-efficiency computing are needed to feed the model and thoroughly train this deep neural network [1].
It is necessary to provide a review of recent advances for the diffusion model because SOTA models with new structures and usages of the diffusion model keep appearing this year.This article focuses on the process of how this theory has become the most popular model in the image-generating area, from theory to application, and what advantages and disadvantages this model has.The diffusion model can be viewed as a kind of single-chain Markov chain (Figure 2), and each state in the model is defined as {x 0 , … , x T−1 , x t , x t+1 , … , x}.The original plot is defined as the {x 0 } state, and then a known amount of Gaussian noise (noise waveform follows a normal distribution) is added so that the original plot becomes fuzzy, equivalent to a process of information entropy increase.The noise is added again each time the image of the previous state is the reference.Until the {x T } state, the entire picture becomes pure noise.Finally, reverse engineering is done to remove the noise.The forward recurrence formula (1) for {x t } is given here.β t is defined as a single-increment function so that {x t−1 } and {x t } add more noise when there is already noise interference to make the difference between the states more obvious.Because of the low efficiency of recursive calculation, the general expression of x t can be obtained by calculation.The derivation process is as follows: What needs to be explained here is that the addition of two independently equally distributed Gaussian noises in (6) is still Gaussian noise.Accessible (5) √1 − α t ϵ t−1 * can be viewed as (0, (1 − α t )I) of a sample.Similarly, √α t − α t α t−1 ϵ t−2 * can be viewed as (0, (1 − α t + α t − α t α t−1 )I) = (0, (1 − α t α t−1 )I) to get a new normal distribution ϵ t ~(0,1) in (6).Then, (7) is obtained as the x t s general term formula, and α i can be calculated by β i .ϵ 0 is a random Gaussian noise, so only x 0 needs to be used, and it can directly generate x t , equivalent to obtain q for any t's (x t |x 0 ).This Process of generating x t is called the Diffusion Process, and in order to train a model, the diffusion model also includes a Reverse Diffusion Process, which is to predict x 0 from the known x T .In order to find x 0 , first, there is a need to know how to find q(x t−1 |x t , x 0 ) from the known q(x t |x 0 ) and q(x t−1 |x 0 ), which can be found using Bayes' formula, as follows: = exp ( It is noted that ∁(x t , x 0 ) is treated as a constant in (17) because it does not contain x t−1 .This constant is not ignored but is used to compensate for the absence of a perfect square in (23).From (23), it can be seen that the expression is converted to a standard expression for the normal distribution function.Therefore, the expectation and variance of x t−1 can be easily obtained.Then, calculate the relative information entropy (KL-Divergence) of the two probability distributions (13) and (24) for the forward and reverse processes.Here, one can assume that the parameters of the two normal distributions are exactly equal, such that the direction of the smallest change in relative information entropy is the direction of the smallest difference between the two probability distributions.Therefore, in theory, the neural network can easily estimate x 0 , but instead of going into the math here, the author will continue to show how to obtain the parameters for training.
It can be seen in ( 24) that x 0 should be an unknown quantity and should not appear in the derivation, so the forward derivation (12) can be used to make a slight transformation, replacing x 0 with x t , and the derivation is as follows: Plugging (26) into μ q (x t , x 0 ) (25) yields the following: However, the expression μ q (x t , x 0 ) is not known, so x 0 cannot be reversed.In the forward process, ϵ t is a random noise with a known standard normal distribution, therefore, the above method of building a simple neural network can be applied to reduce the error between the estimators ϵ ̂(x t , t) and ϵ t to zero.This gives μ q (x t , x 0 ) the estimate μ.In order to make the model estimate more accurate, the time parameter can be added to strengthen the model's understanding of the data.Finally, the formula (32) is obtained as follows, and Table 1 shows the training process and the sampling process.
Table 1.The diffusion model training and inference process [3].In the second part, the function of the decoder part is realized by the diffusion model, and the U-Net network structure is used to realize the reverse Diffusion stage of the diffusion model.As shown in Figure 3, U-Net is a full convolutional network with an encoding and decoding model structure, which inputs images with added noise and then trains U-Net to output the original added noise.Compared with the traditional U-Net network structure, additional time information is added here to make the model have different outputs according to the progress of the reverse process which is shown in Figure 4.

Stable diffusion
Stable diffusion is also called latent diffusion.As shown in Figure 6, stable diffusion uses U-Net as the model structure to implement the diffusion model principle.However, a cross-attention structure is added to the convolutional layer, which allows the diffusion model to add other modal conditions to participate in the generation process.Stable diffusion also uses the diffusion model to generate highly abstract images and then uses super-resolution technology to refine the details of the generated images.It can be found that in the process of inference, stable diffusion tries to make the diffusion model generate highly condensed information in latent space and then decodes it through Variational AutoEncoder (VAE) before the super-resolution process in Pixel Space.This reduces the training cost of the model and improves the flexibility of the model.The principle of this model is shown in Figure 6.

Conclusion
This paper briefly introduces the development process of diffusion models through DALL• E, imagen, Stable Diffusion, and DiT models.Only one year has passed since the diffusion model replaced GAN with the SOTA model, and the diffusion model is still largely optimized.Moreover, according to the time required for other proposed new models to fully exploit the potential of the model, the diffusion model still has a lot of room for development.At present, the diffusion model is directly generated from the original U-Net of DALL• E 2, and only abstract small-scale images are generated by Imagen for further super-resolution operations.After that, Stable Diffusion re-adopted VAE technology to project images from pixel space to latent space, which further improved the performance of the diffusion model and reduced the training difficulty and training cost of the model.After that, the current SOTA generation model DiT uses the Transformer architecture, which performs better than the U-Net architecture in all fields, to implement diffusion so that the diffusion model has a higher starting point.At present, there is still a lot of room for optimization of the diffusion model.It seems that many techniques that have been used in image processing have the potential to further improve the performance of the diffusion model, such as various transformer structures and other normalization methods.These techniques have proved to be very good in other fields and are worth trying.

Figure 1 .
Figure 1.The schematic diagram of Markov matrix.

Sampling process 1 : 1 : 2
T − 2 states are randomly set between the original image and pure noise 2:Let ϵ~(0, ) 3:Set the Loss function  = ϵ  − ϵ ̂(  , ) 4: Training until the loss approaches 0 Pure noise x T ~(0, ),~(0, ) 2:Use ϵ ̂(  , ) and x t−1 =   (  , ) +    to get the  0 The diffusion model has been proposed in 2015 or even before, but the first SOTA model based on it is DALL• E 2, which was proposed in April 2022.DALL• E 2 [4] was the first to make the diffusion model beat the GAN model to become the SOTA model.DALL• E 2 is divided into two parts.In the prior part, two encoders are used to extract the features of the text and the features of the image corresponding to the text respectively during training, and then through training, the model can identify the required image features through text semantics.This part is completed with the help of the Contrastive Language-Image Pre-Training (CLIP) model, a large model proposed by OpenAI, which shows the relationship between the semantics of the input text and the features of the image.But DALL• E 2 is required to generate a zero-shot image from the image features provided by the PRIOR part of the model, so the PRIOR part of the DALL• E 2 model is trained with the aid of CLIP.The PRIOR section implements the function of converting the text semantics into the concatenation of multiple image features.

3. 2
. Imagen Except for the text encoder of the model, the Imagen model can be split into three parts, and all three parts are based on the diffusion model.The Google team froze the text encoder to train the subsequent diffusion model step by step.The reason for this is to reduce the number of variables in the experiment, and the step training saves the training cost of the big model and reduces the difficulty of training.The second reason is to prevent the input of the diffusion model out of control because the action of freezing the text encoder establishes the fixed input-output correspondence of the encoder, which is the same input text that must output the same text embedding.Thus, it is difficult to prevent the under-fitting of the subsequent diffusion model.The quality of the image generated by this training mode will also lead to the loss of the diversity of images generated by the diffusion model.The imagen decoder is similar to the DALL• E 2 model, but compared with the DALL• E 2 model, imagen reduces the resolution of the output image, only the image of 32×32, and then the superresolution operation.The original one-step model is divided into three model series implementation functions.It can reduce the parameters of the model, which reduces the difficulty of the model training and also reduces the cost of the model training.The reason for the diffusion model to output only 32×32size pictures is to prevent the model from focusing on the details that generate the picture.The process of making low-resolution pictures and then gradually refining them is more in line with the creative process of people.As it turns out, this approach allowed the model to produce better-quality pictures.The super-resolution process also used two diffusion models as shown in Figure5, the first of which increased the resolution from 32×32 to 256×256, and the second model increased the resolution from 256×256 to 1024×1024, reducing the difficulty of model training and hyperparameter tuning.

Figure 6 .
Figure 6.Schematic diagram of Stable Diffusion3.4.The diffusion transformer architecture (DiT)Before this, all diffusion models use the U-Net network structure to implement diffusion models, and DiT uses Transformer to replace U-Net models to achieve better performance[8].Similar to Stable Diffusion, DiT uses the latent space to convert images into 32×32×4 patches, which reduces the training cost of the model.After experiments, it is found that the model performance of adding a cross-attention mechanism or in-context conditioning in the transformer is inferior to that of adding an adaptive normalization layer as shown in Figure7[9,10].DiT is the current SOTA generation model.