Research on the application of transformer in computer vision

The Transformer is a deep neural network model that utilizes attention mechanisms to improve model performance. Initially, the Transformer gained significant attention in the field of natural language processing. In recent years, due to continuous improvements and extensions to the Transformer model structure, it has also achieved many important breakthroughs in computer vision(CV) tasks, attracting the interest of many researchers. However, there is a lack of comprehensive review articles on the application and development of the Transformer in computer vision. A summary of the Transformer’s applications and advancements in computer vision is given in this paper. It discusses the Transformer model’s fundamental ideas and organizational framework, and primarily introduces its applications in various fields such as image classification, object detection, and image generation, as well as the superiority of the Transformer+ convolutional neural network(CNN) fusion model. The paper provides a detailed analysis of classic models such as Vision Transformer(ViT), Detection Transformer(DETR) and discusses their strengths, weaknesses, and improvement methods. Finally, the paper summarizes and looks forward to the Transformer’s evolution in computer vision.


Introduction
Transformer [1], proposed by Google in 2017, is the first to rely solely on self-attention rather than on a circular or convolution model, which is simpler in structure and in some ways better than models like CNN [1].The model was initially suggested and applied pertaining to natural language processing, and caused great shock in this area, at that time it was regarded as a milestone in natural language processing.Transformer model is a multilayer neural network model that uses an encoder-decoder structure and self-attention mechanism to accomplish sequential tasks.To aid the model in better capturing the crucial data in the sequence, the self-attention mechanism can assign varying importance to each piece in the sequence at various points.As a result, the Transformer model can more accurately extract important information from the sequence.

The basic structure of transformer
In Transformer, input data is output to the decoder at each level after going through a 6-layer encoder, creating a model structure that prevents loops (Figure 1) [1].Each encoder module contains two sublayers: the multi-head attention layer and the feedforward connection layer.Each decoder module consists of three sublayers: masking multi-head attention layer, multi-head attention layer, and feedforward connection layer.There is a residual connection and normalization process after each sublayer.

Position coding
Because loops or convolution operations are not included in the Transformer model, a method for displaying positional data in the input sequence is needed to better learn sequence relationships in the sequence.Position coding is a special representation method, which can help the model to learn sequence relations by assigning a unique vector representation for each position, rather than by looping or convolving.In addition, location coding can take advantage of spatial location relations to obtain better context information.
By assigning a unique vector representation to each location, positional coding helps the model learn about sequence relationships in sequences.In the field of Natural Language Processing(NLP), the input to a Transformer is a sentence.Since the Attention module in the Transformer is unable to capture the words in different positions in a sequence, by adding a position code, each word's place in the sentence is numbered, which corresponds to a vector and introduces position information.The position code is introduced through a sine/cosine function, and the formula:

Self-attention mechanism
The attention mechanism imitates the biological observation process, which synchronizes internal perceptions with external experiences to increase observation accuracy in specific regions.The attention mechanism is frequently employed in speech recognition, computer vision, and natural language processing because it can swiftly identify the key elements from sparse input.Attention mechanism is a key idea in the field of neural networks; it can be a cutting-edge technique to handle multitasking and is frequently applied to improve the interpretability of neural networks.
By putting a weight on the points in the sequence, the self-attention mechanism can more accurately reflect the relationship between them.Self-attention mechanisms from Transformer models have been created and applied in a variety of disciplines, including computer vision and natural language processing.The self-attention method can better reflect the intrinsic relationship between the data, minimize computation, and enhance algorithm performance by weighing the input data.In addition, when the model is not large enough or not large enough, the self-attention mechanism can help generalization performance to more properly interpret the data.
The self-attention mechanism, an improvement to the attention process, reduces reliance on external information and improves the ability to recognize internal correlations in data or features.By defining three learnable weight matrices {WQ, WK, WV} and projecting the input sequence onto these weight matrices, the self-attention layer gets triples Q = XWQ, K = XWK, V = XWV.The formula for calculating self-attention is the following: where d is equal to the K matrix's dimension size.

Encoder and decoder
In natural language processing and speech recognition, encoders and decoders are an extensively used neural network design.Data like text and audio are encoded and transformed into a higher-level representation by the encoder.RNN,CNN, and other neural networks, or the Transformer Model could be used as encoding devices.In the machine translation process of natural language processing, it is frequently essential to convert the original sentences into "context vectors," incorporating the original language sentences.
Typically, decoders are employed to create output sequences using context vectors produced by the encoder, such as sentences in the target language.Any kind of neural network, including a recurrent neural network, a convolution neural network, or a Transformer model, can serve as the decoder.When performing machine translation, the decoder typically takes the context vector produced by the encoder as input and produces target language phrases in accordance with that language's context.Each word in the input sequence and the part of speech for each word in the target language must be accurately estimated by the decoder during this procedure.In addition, the decoder can infer the semantic relationship between words according to the dependencies between words, so that the target language can be translated more accurately.
Since encoders and decoders are trained by sharing parameters, the network can effectively convert input sequences into output sequences.This approach has been effectively used in a variety of domains, including speech recognition and natural language processing.

Feedforward neural network(FNN)
Transformer also includes an attention sublayer and a fully linked feedforward neural network with two sections and a Rectified Linear Unit(ReLU) activation layer between each encoder and decoder.Feedforward networks handle items in the same way regardless of where they are in the sequence.Although the linear transformations of various places are the same, the parameters used between layers are different.Among them, the feedforward network uses X to represent signals from neurons, Wi to represent the corresponding connection weights, and bi to represent bias.The calculation formula is the following:  ViT is typically pre trained on large datasets, targeting smaller downstream tasks [4].It achieved an accuracy of 88.55% Top-1 on the ImageNet dataset, surpassing the ResNet series models and successfully replacing traditional convolutional neural networks on large-scale datasets, breaking the CNN monopoly in the field of vision, and having stronger generalization capabilities compared to traditional CNNs [4].However, even with its breakthrough progress, it still has significant shortcomings in the computer vision field:

Transformer for image classification
(1) VIT model processes images as one-dimensional sequences, ignoring their two-dimensional structure.
(2) VIT model requires a large amount of data for pre-training, and the inductive bias of self-attention is weaker than that of CNN.
(3) The computational cost of the VIT model is high, and the computational complexity is proportional to the square of the number of tokens, which is not suitable for high-resolution images.

Improved algorithms of VIT
Currently, there are many models that combine CNN and Transformer, such as CeiT, Swin Transformer, ViTc, etc., which utilize the spatial CNN features and the sequence features of Transformer, and consider both global and local information in images, effectively improving the quality of the model [5][6].The experiments on these models have proved that CNN+Transformer achieves better performance on multiple datasets than using CNN or Transformer alone, proving the superiority of this approach.
Specifically, for example, Yuan et al. proposed Convolution-enhanced image Transformer (CeiT), which combines the benefits of Transformer for creating long-range dependencies and CNN for extracting low-level characteristics to improve locality [5].In order to encourage the relevance between nearby tokens in the space dimension, they created an Image-to-Tokens (I2T) module that harvests patches from the generated low-level features and substitutes the feed-forward networks in every encoder block with a Local-enhanced Feed-Forward (LeFF) layer.At the same time, they introduced Layer-wise Class-token Attention (LCA) appended to a series of class tokens that receive information across different layers as input for the multi-level representation of Transformer.Experimental results on ImageNet show that CeiT has excellent generalization ability and better convergence, and doesn't demand a large amount of training data, significantly reducing training costs.
Liu et al. proposed the Swin Transformer, which processes images using a hierarchical structure similar to CNNs, making the Transformer model flexible in handling images of different scales [6].Swin Transformer employs a sliding window mechanism that calculates attention only within the window, introducing the locality of CNN convolutional operations.Additionally, by incorporating downsampling layers, the model can handle images of higher resolution, reducing computational cost and attending to both local and international information.
An innovative Transformer-in-Transformer (TNT) model was put forth by Han et al..It models representations at both pixel-level and patch-level [7].The external Transformer processes patch embeddings, while the internal Transformer extracts local features, followed by linear projection.Multiple TNT blocks are stacked to establish the TNT model.Experimental results revealed that TNT outperformed the most advanced ViT model with a comparable computational expense by around 1.7%, achieving a Top-1 accuracy of 81.5% on ImageNet.
Overall, CNNs extract features by sharing convolutional kernels, which reduces the number of parameters and training time and improves computational efficiency by incorporating local connections and weight sharing.Transformers, on the other hand, rely on more flexible self-attention layers and outperform CNNs in extracting global semantic information and achieving performance upper bounds [4].The combination of CNN and Transformer frameworks enhances the performance of algorithms.

Detection transformer
Detection Transformer is equivalent to a set of predictive problems, made into an end-to-end framework [8].The whole process of object detection is simplified, and the final prediction set is output directly in parallel according to the relationship between object and global context (Figure 3).DETR predicts all targets at once, pairs the predicted targets with the real targets and trains the model end-to-end using a set loss function.By removing a large number of manually designed modules, DETR does not require any custom networking layer, so it can be replicated under any framework that supports CNN and transformer.In the research direction of computer vision field, target detection plays an important role in many scenarios, such as medical image detection, the introduction of DenseNet-41 customized CornerNet network to abstract depth features and realize the classification and positioning of brain tumors.In remote sensing image detection, DETR was improved into Sparse representation model based on Sparse Transformer and K-means algorithm, sparse feature clustering was learned, and various shapes and distribution characteristics of remote sensing image rotation targets were adapted.For aerial image detection, Transformer is introduced into the trunk network, and weight is allocated to multi-scale feature maps based on improved spatial channel attention, focusing on small target aggregation areas, and enhancing the fusion efficiency of small targets.Object detection is an important direction of computer vision research.It has the characteristics of complex scene and complex object, so it has a good application prospect in daily life.

Improved algorithms of DETR
Unlike many detection algorithms, DETR does not require a dedicated library.After COCO data set test, the average accuracy of DETR is 42%, which is higher than that of Faster-R CNN in both speed and accuracy.However, its performance in detecting small targets is relatively low, with an average accuracy of only 20.5%.High-resolution images have extremely high computational complexity, leading to unacceptable DETR.Zhang et al. proposed the ACT model, an adaptive clustering attention, which uses locally sensitive hashing to cluster query features adaptively and approximates prototype key interactions to query key interactions [8].ACT is an embeddable module that does not require any training process, reducing the cost of DETR and replacing the self-attention module in DETR, fully compatible with the original Transformer.For the disadvantages of DETR, Zhu et al. proposed deformable DETR, which combines the advantages of sparse space sampling with good deformable convolution and the strong relationship modeling capability of Transformer [9].In deformable DETR, the deformable attention module is used to replace the attention module in Transformer to process the features in the diagram, regardless of the size of the diagram, thus greatly reducing the computational complexity.Numerous experiments on the Common Objects in Context (COCO) dataset have shown that deformable DETR can achieve better performance than DETR, especially on small objects, reducing training time by 10 times [9].As shown in Table 1, it is found that DETR has lower detection performance than Faster R-CNN on small targets and requires longer training time to converge, while DETR has better detection performance than Faster R-CNN on large targets.

Transformer for image generation
Image generation refers to the process of generating target images based on an input vector, which can be either a random noise or a user-specified conditional vector.The mainstream methods for image generation include Generative Adversarial Networks (GAN) and Variational Autoencoder (VAE).GAN trains the generator and discriminator through adversarial optimization to produces high-quality images.However, its training process is unstable and prone to mode collapse.VAE, on the other hand, is a generative model that relies on both encoder and decoder and produces smoother images with a more stable generation process.However, the diversity and accuracy of generated images fall short of GAN's standards.[10].It is able to sequentially anticipate the value of every pixel in the output picture based on previously generated pixels, as shown in Figure 5 [11].The image is divided into a spatial grid of blocks, called query blocks.In each feature generation step, each pixel in the self-attention query block attends to all pixels in the storage block.Their method factorizes the joint distribution of image pixels into the product of conditional distributions of pixels, and enhances the receptive field using the self-attention mechanism [11].However, it has some obvious drawbacks, such as requiring a large amount of storage space and being relatively difficult to parallelize because each pixel prediction depends on previous predictions, which may lead to longer training times.

Parmar et al. proposed an image creation model called Image Transformer that was influenced by both CNN and Transformer
Chang et al. proposed a novel image synthesis model called Masked Generative Image Transformer (MaskGIT), which uses a bidirectional transformer decoder to generate images based on the masking technique and the logical thinking of human painting [12].MaskGIT refers to the Vector Quantized Generative Adversarial Network (VQGAN) idea when tokenizing, and learns an encoder-decoder [13].In the generation stage, it uses a masking vector to make predictions, randomly selecting various values to simulate the generation process.Then, through Masked Visual Token Modeling (MVTM), learned a bidirectional transformer to generate images that match the masking vector, achieving image creation.For the ImageNet dataset, the experiment demonstrates that MaskGIT greatly beats cutting-edge transformer models and speeds up autoregressive decoding by 64 times.Moreover, MaskGIT can be quickly expanded to perform more image editing operations like processing, extrapolation, and repair.

Transformer for image super-resolution
Recovery of natural and realistic textures from low-resolution, degraded photos is the purpose of the super-resolution image task.In order to provide a better user experience, the success of the image superresolution SR (Super-Resolution) can significantly raise the standard of media material.Picture superresolution is frequently employed in satellite imagery, digital zoom, and ultra-clear TV.Fuzhi Yang et al.'s innovative network of texture converters for image super-resolution (TTSR) was offered as a solution to this issue [14].They compared the TTSR's results in Figure 6 to those of the most recent RefSR approach using a 4x magnification.The TTSR learns to search the Ref image (in green) for the target LR region (in yellow) by searching the relevant texture to avoid the wrong texture shift (in red).The TTSR contains four modules.Initially, a learnable texture extractor is suggested, capable of updating the model parameters during end-to-end train in real time.The reference Ref pictures and joint feature insertion of LR is made possible by the project's study.A correlation and embedding module are also created to determine the correlation between the reference image and the low-resolution images.On the basis of here, a feature representation approach that relies on LR and Ref is suggested, which is formulated as the query and keyword in the converter.The high resolution and high-resolution features of the reference Ref images are proposed to be transferred using two modules, soft attention module and hard attention module.As a result, TTSR suggests a better exact method for finding and transforming the pertinent textures in Ref pictures.A cross-scale feature integration module to stack texture transformers was also suggested by Fuzhi Yang et al. [14].In this module, features are learned across several scales (for example, from 1x to 4x) to enhance their feature representation.The total design, as seen in Figure 7, enables the TTSR to detect and transmit textures in Ref pictures (illustrated in green) and to obtain better visual effects than the SOTA method.Chen and others have found part of the converter -based SR model at this stage, but it is not as effective as Very deep residual channel attention networks (RCAN) because of the limitations of the information it utilizes [15].This phenomenon demonstrates that Transformer is more adept at simulating local data, but it also highlights the need to broaden the usage of information in Transformer.To address the aforementioned issue, X. Chen et al. introduced the hybrid Attention Transformer (HAT), also known as mixed attention.To benefit from the former's capacity to use global information and the latter's great representativeness, the author's HAT mixes channel attention and self-attention approaches.

Conclusion
Visualization was introduced in this paper basic principle and structure of transformer model, illustrates the function of each part from the internal structure and performance, and sums up the study of key problem.In a computer vision system, the attention mechanism provides a key role for high efficiency and performance enhancements in Transformer.This paper introduces the application of Transformer in visual tasks from the aspects of target detection, image classification, image super resolution and image generation, and systematically emphasizes the main advantages and limitations of existing methods.Moreover, it is found that the efficiency and performance of Transformer combined with CNN is still one of the main research directions in the future.The remote modeling capability and dynamic response characteristics of Transformer give it strong feature learning ability.Applications based on the Transformer model are still a hot area in computer vision.

Figure 1 .
Figure 1.6-layer structure of encoder-decoder and transformer model structure.

3. 1 .
Vision transformer One of the core tasks in computer vision is image classification, which is also the job used to compare practically all benchmark models.It is a technique for processing images that separates various object categories based on the various features that are represented in the image data.The Transformer has demonstrated excellent performance in the field of natural language processing.Researchers have applied it to the field of image classification in computer vision and achieved good results, with the typical algorithm being the Vision Transformer model.The ViT model was proposed by Dosovitskiy et al., who were motivated by the successful scaling of the Transformers in NLP and attempted to use a pure Transformer-based structure for image block sequences that did not rely on CNNs for the CV field [2].The ViT model is a structure completely based on self-attention mechanisms.It can also perform well in image classification tasks and its structure is shown in Figure 2 [3].Firstly, a two-dimensional image is segmented into individual image blocks, or patches, which are then flattened into one-dimensional vectors.Next, each vector is subjected to a linear projection transformation and added with position encoding.ViT also references Bidirectional Encoder Representations from Transformers (BERT) by inserting a [class] token for classification, which can better reflect the feature information of the image and complete the classification task.

Figure 3 .
Figure 3. Object detection structure.DETR also mentioned a new objective function, it forces model output binary chart matching through a unique set of prediction, and specify the true value.DETR first extracts image features using convolutional neural network, then adds extracted features and position codes to Transformer Enoder (Figure 4 [4]).A set of object queries is then fed into the Transformer Decoder along with the output of the encoder for processing.Each Transformer Decoder output is sent to a feed-forward network for processing and decoding, leading to the eventual generation of several prediction boxes.Then, the prediction box and ground Truth box are matched one by one and the loss of target detection is calculated based on the Hungarian algorithm, which effectively eliminates the need of post Non-maximum suppression (post-NMS) processing.

Figure 4 .
Figure 4. DETR structure.In the research direction of computer vision field, target detection plays an important role in many scenarios, such as medical image detection, the introduction of DenseNet-41 customized CornerNet network to abstract depth features and realize the classification and positioning of brain tumors.In remote sensing image detection, DETR was improved into Sparse representation model based on Sparse Transformer and K-means algorithm, sparse feature clustering was learned, and various shapes and distribution characteristics of remote sensing image rotation targets were adapted.For aerial image detection, Transformer is introduced into the trunk network, and weight is allocated to multi-scale feature maps based on improved spatial channel attention, focusing on small target aggregation areas, and enhancing the fusion efficiency of small targets.Object detection is an important direction of computer vision research.It has the characteristics of complex scene and complex object, so it has a good application prospect in daily life.

Figure 5 .
Figure 5. Image transformer structure.In (a), the self-attention block of Image Transformer produces each new pixel value by only considering the previously known pixel values in the image, and then feeds them into a feedforward subnetwork (Figure 5 [11]).The procedure carried out in the local self-attention in 2D is shown in (b).The image is divided into a spatial grid of blocks, called query blocks.In each feature generation step, each pixel in the self-attention query block attends to all pixels in the storage block.Their method factorizes the joint distribution of image pixels into the product of conditional distributions of pixels, and enhances the receptive field using the self-attention mechanism[11].However, it has some obvious drawbacks, such as requiring a large amount of storage space and being relatively difficult to parallelize because each pixel prediction depends on previous predictions, which may lead to longer training times.Chang et al. proposed a novel image synthesis model called Masked Generative Image Transformer (MaskGIT), which uses a bidirectional transformer decoder to generate images based on the masking technique and the logical thinking of human painting[12].MaskGIT refers to the Vector Quantized Generative Adversarial Network (VQGAN) idea when tokenizing, and learns an encoder-decoder[13].In the generation stage, it uses a masking vector to make predictions, randomly selecting various values to simulate the generation process.Then, through Masked Visual Token Modeling (MVTM), learned a bidirectional transformer to generate images that match the masking vector, achieving image creation.For the ImageNet dataset, the experiment demonstrates that MaskGIT greatly beats cutting-edge transformer models and speeds up autoregressive decoding by 64 times.Moreover, MaskGIT can be quickly expanded to perform more image editing operations like processing, extrapolation, and repair.
Recovery of an image's original quality from a low resolution is the aim of image super-resolution, authentic texture.Hence, this accomplishment of super-Resolution can significantly enhance the user's impression and image quality.In digital zoom, Ultra High Definition Television (UHDTV), medical imaging, satellite imaging, and other industries, image super-resolution is frequently employed.Nowadays, the two main subfields of image super-resolution research are single image superresolution (SISR) and reference-based image super-resolution (RefSR).The SISR technique, which uses deep learning, views this issue as a challenging images regression issue involving the translation of functions from low resolution to high resolution.The usual techniques are Super Resolution Convolutional Network (SRCNN), Very Deep Super Resolution (VDSR), Deeply-Recurisive Convolutional Network (DRCN), etc.However, the traditional SISR image is often due to the high-resolution image texture in the degradation process is too damaged, unable to restore, resulting in blurred image quality.Although the image super-resolution technology based on GAN has been used to solve the above problems, the problems of illusion and artifact caused by GAN bring great challenges to the task of super-resolution.Earlier this decade, a reference-based images super-resolution (RefSR) technology has been well developed to transfer existing reference Ref images to high-resolution (HR) textures for better visual effects.However, state of the art(SOTA) algorithms often transmit texture information in a direct manner that results in suboptimal super-resolution image quality.

Figure 6 .
Figure 6.Comparison of operation effects of different models.

Figure 7 .
Figure 7.The structure of texture transformer.Figure 7 depicts the structure of the Texture Transformer.4x bicubic-up-sampled input image, Input image, and reference picture are each denoted by the LR↑, LR, and Ref, respectively.Apply up-sampling and down-sampling bicubic with the equal 4-factor to the Ref in sequence to obtain Ref↓↑ that is domain-consistent with LR↑.A feature map with the identical size is produced by the texture converter using the LR features produced by Ref, Ref↓↑, LR↑, and the trunk as input, and this feature map is then utilized to provide HR predictions.Chen and others have found part of the converter -based SR model at this stage, but it is not as effective as Very deep residual channel attention networks (RCAN) because of the limitations of the information it utilizes[15].This phenomenon demonstrates that Transformer is more adept at simulating local data, but it also highlights the need to broaden the usage of information in Transformer.To address the aforementioned issue, X. Chen et al. introduced the hybrid Attention Transformer (HAT), also known as mixed attention.To benefit from the former's capacity to use global information and the latter's great representativeness, the author's HAT mixes channel attention and self-attention approaches.

Figure 7
Figure 7.The structure of texture transformer.Figure 7 depicts the structure of the Texture Transformer.4x bicubic-up-sampled input image, Input image, and reference picture are each denoted by the LR↑, LR, and Ref, respectively.Apply up-sampling and down-sampling bicubic with the equal 4-factor to the Ref in sequence to obtain Ref↓↑ that is domain-consistent with LR↑.A feature map with the identical size is produced by the texture converter using the LR features produced by Ref, Ref↓↑, LR↑, and the trunk as input, and this feature map is then utilized to provide HR predictions.Chen and others have found part of the converter -based SR model at this stage, but it is not as effective as Very deep residual channel attention networks (RCAN) because of the limitations of the information it utilizes[15].This phenomenon demonstrates that Transformer is more adept at simulating local data, but it also highlights the need to broaden the usage of information in Transformer.To address the aforementioned issue, X. Chen et al. introduced the hybrid Attention Transformer (HAT), also known as mixed attention.To benefit from the former's capacity to use global information and the latter's great representativeness, the author's HAT mixes channel attention and self-attention approaches.

Figure 8 .
Figure 8.The overall architecture of HAT and the structure of RHAG and HAB.As previously mentioned, each remaining mixed attention group (RHAG) consists of three layers: a 3 by 3 convolution layer, an overlapping cross attention block (OCAB), and M mixed attention blocks (HAB).The Hybrid Attention Block (HAB), shown above, activates more pixels when channel attention is used, because calculating channel attention weights involves global information.Convolution can also assist Transformer in achieving better visual representation or simpler optimization, as numerous studies have demonstrated.In order to further improve the network's presentational capabilities, the authors include convolution blocks based on channel attention within regular Transformer blocks.According to Figure 8, the window-based lengthy self-attention (W-MSA) module and the channel attention block (CAB) are both inserted into the conventional Swin Transformer block after the first LayerNorm (LN) layer.It should be noted that shift-window-based self-attention (SW-MSA) is employed at regular intervals in consecutive HABs.To avoid possible conflicts between CAB and MSA in optimization and visual representation, multiply the output of the CAB by a small constant alpha.

Table 1 .
Performance comparison of target recognition algorithms.