Study on a visual coder acceleration algorithm for image classification applying dynamic scaling training techniques

Nowadays, image classification techniques are used in the field of autonomous vehicles, and Convolutional Neural Network (CNN) is used extensively, and Vision Transformer (ViT) networks are used instead of deep convolutional networks in order to compress the network size and improve the model accuracy. The ViT network is used to replace the deep convolutional network. Since training ViT requires a large dataset to have sufficient accuracy, a variant of ViT, Data-Efficient Image Transformers (DEIT), is used in this paper. In addition, in order to greatly reduce the computing memory and shorten the computing time in practical use, the network is flexibly scaled in size and training speed by both adaptive width and adaptive depth. In this paper, we introduce DEIT, width adaptive techniques and depth adaptive techniques and combine them to be applied to image classification examples. Experiments are conducted on the Cifar100 dataset, and the experiments demonstrate the superiority of the algorithm on image classification scenarios.


Introduction
For a long time, CNN networks have been used for the task of processing images because only CNN can handle three-dimensional image data; however, the accuracy of CNN is still not high enough, and in the pursuit of higher accuracy, it has been proposed to apply transformer [1] to image classification [2].It is necessary to transform three-dimensional data like images into serialized data in order to be available for transformer processing, and this transformer architecture for processing image data is called ViT; however, a very large dataset is required in the pre-training ViT phase, as well as consuming a lot of computational resources and time. So, it was proposed to apply DEIT [3], this network structure is based on transformer architecture, the difference lies in the training strategy, and knowledge distillation is used, DEIT only needs less pre-training data to achieve the effect of ViT. The general model compression methods are quantization, weight sharing, pruning, and distillation; quantization refers to the use of FP16 or INT8 instead of model parameters, but INT8 is not yet popular because it involves more accuracy loss. And weight sharing is that the neurons in each layer share the same parameters. Both methods are applied in ALBERT [4], but the speedup is not significant. Pruning is the most direct method to remove the weights or neurons of the model to achieve the purpose of reducing the operations. Pruning is also divided into weight pruning and neuron pruning, structural pruning, weight pruning is to set the weight of a connection to 0 to obtain a sparse matrix; neuron pruning [5], is to remove a vector in the matrix directly; model level is structural pruning, which can be a direct removal of a layer. But the disadvantage of pruning is that we don't know which weights, which neurons, and which layers are unimportant, and often the accuracy may decrease after pruning, so a mechanism is needed to determine what exactly should be pruned. Distillation, on the other hand, allows the small model to learn the knowledge of the large model during training, and only uses the small model during prediction, and distillation has worked well so far.
In practice, reducing the number of network layers and distillation by pruning is a common and effective speedup practice. However, since different downstream tasks require different prediction speedups, it is very inconvenient to spend a long time to fine-tune and distill for different downstream tasks each time [6]. A large number of image classification techniques are applied to autonomous vehicle driving technology, but the control center of the vehicle itself has to process a very large amount of information and few computational resources are allocated to image classification; arranging a network with fewer parameters leads to insufficient accuracy of image classification, while arranging a large deep network with more parameters consumes a large amount of computational resources. Therefore, we choose to use ViT network instead of CNN, but pre-training ViT requires a huge dataset, so we use DEIT network instead of CNN, and we use width and depth adaptive techniques [8] to compress and distill multiple sub-networks simultaneously in order to better match other downstream tasks during pretraining.

Vision Transformer
(1) Transformer Encoder, ViT is the model that stacks multiple layers of Transformer Encoder. (1) i a is the input to the self-attention layer, a step called mapping, and the matrix W is also a learnable object.
(2) Suppose there are three matrices: , , q k v W W W .These three matrices are trainable and are shared.
In the latter case q is used to match with each k .
k will be matched by every q .
, , i i i q k v represents the information extracted from a .
Since the computation process of Transformer is parallel, it can be written in matrix form,Write the obtained i q ; i k ; i v in matrix form respectively: Q ; K ; V .
Here it means that each row vector of α passes through the softmax layer.
b V α = × This formula represents the final output matrix b combining all the input information, and the effect is similar to that of a convolutional network, so the CNN can be replaced by a transformer architecture.

Mutil head self attention.
The first few steps of the operation with MSA are the same as in equation (1) i v the same.So the last digit of the lower-right subscript is divided into the same group of heads, and each group of heads performs the same operation as self-attention.So the head of the j th group will get the matrix composed of , W is an updatable matrix so that the output of the MSA is obtained.

The special features of ViT
It uses position encoding because there is a relationship between each input vector. If we simply input each vector directly into the network, it will make the position of different image blocks inside any image change, but the corresponding output is still the same, which is equivalent to ignoring the position information of each vector. In order to use the position information of each input vector, the position information needs to be added. input to the neural network, and the basis of image classification is one vector, which must be transformed into one vector first, but the dimension of one image is too large, so a large image has to be cut into multiple inputs of the same dimension and then transformed into vectors to input to the neural network. Firstly, two hyperparameters are defined, patch-size is how big the image is divided into, and the other hyperparameter is the step size, which has the same meaning as the step size in CNN. After dividing the image, each patch is then stretched into a one-dimensional vector by the expansion package of einops, because the transformer architecture accepts one-dimensional vector input directly. In ViT, an updatable cls-token is added at the location of the first input vector, and the input is in the form of a [CLS] symbol. The reason for using a symbol is that this symbol will become a vector after the final ViT network, and then this vector will be input to an MLP classifier to finally output the probability of an item. So the final output is related to this [CLS] symbol, and the mapping vector of a certain image should not be used as input to avoid the final output being biased towards this input. Finally, the output of ViT is multiple vectors, and there is only one vector that is useful to us, which is the output vector corresponding to the [CLS] symbol. This output is the final classification result of the network. This output is generally used for training to do cross-entropy with the truth label of the output image, and then to find the gradient of each parameter, and then use the chain rule of derivation and back propagation algorithm to update the parameters that need to be updated.

Commonly used modules for deep neural networks: MLP, add && Norm layers, Layer Norm
Here MLP [9] is simply Neuron Full Connection + GELU Activation Function [10] + Dropout [11]. The effect is to enhance the fitting ability of the model. General deep neural networks add MLP layers. add && Norm layers This part of the structure is borrowed from the residual connection of CNN's famous ResNet [12], which sums the original input vectors and the vectors after MSA, preserving the original information and facilitating model convergence. In order to prevent the degradation of deep neural networks, i.e., we do not know the optimal number of layers, but we set more layers than the optimal number of layers, then we have to do a constant mapping of the extra layers, which is equivalent to the input equal to the output, but due to the design difficulties, the output can be set to 0 directly to achieve the same effect. The general deep neural network will use the residual connection. The general deep neural networks add Layer Norm [13] layer or Batch Norm layer, but because the transformer architecture is similar to RNN [14], Layer Norm is used instead of Batch Norm. Using Layer Norm can improve the training speed and accuracy of the model and make the model more robust. Layer Norm layers are added to deep neural networks in general.

DEIT
Because ViT requires huge amount of pre-training data, and the JFT dataset is privately owned by google, we cannot use this dataset for pre-training, and even if we can borrow it, the computational resources consumed are too large. Therefore, we are currently using ViT's open source model and then tuning it for downstream tasks. We used the DEIT model (untrained) and were able to pre-train with a much smaller dataset and achieve the same accuracy.
The structure of the Deit model is the same as ViT, the difference lies in the subtle differences in the training methods. The only difference between the output of Deit and ViT is that the last input vector of Deit is also [CLS], called distillation-token. This vector can be input to an MLP classifier together with the class token output vector, and the output of the MLP classifier is the final classification result.
This total loss function is mainly used for the BP algorithm to update the parameters.
( ) s Z ψ is the class probability distribution of the STUDENT network, which is the result of the class token going through all TRANSFORMER ecoder layers. y is ground truth.    Knowledge distillation [7] Knowledge transfer is achieved by introducing soft-targets associated with the teacher network (teacher model: complex, but with superior inference performance) as part of the total loss to induce training of the student network (student model: streamlined, low complexity).
This is the improved softmax formula, and T=1 is the most conventional softmax. The output of the teacher network (left side of Figure 3) can be obtained as a smooth probability distribution (often called soft target or soft label) with values between 0 and 1, taking a more moderate distribution than the conventional softmax output. the larger the value of Temperature, the smoother the distribution; and the decreasing value of Temperature, the closer to the unique heat vectorone -hot, less information, and easy to misclassify and introduce unnecessary noise. The hard label is then the true labeling of the sample, which can be represented by the one-hot vector.
The BP algorithm is run according to the total loss function to update the parameters to be updated. (14) log( ) The total loss is designed as a weighted average of the cross-entropy corresponding to the soft and hard targets (where a larger weighting factor α for the soft target cross-entropy indicates that the more knowledge migration relies on the contribution of the teacher network, which is necessary for the early stage of training because the output of the teacher network is smoother and contains more information than the truth-value labels, such as dark knowledge in addition to the maximum probability value corresponding to the classification [15]: the probability of a certain category is the smallest, etc., which helps the student network to identify simple samples more easily, but the weighting factor β of the hard target needs to be increased appropriately in the later stage of training, so that the truth-value labels can help to identify difficult samples, which refer to some easily confused pictures. In addition, the inference performance of the teacher network is usually better than that of the student network, while there is no specific limitation on the model capacity, and usually the more parameters the teacher network has, the better the fitting performance is, and the more beneficial to the student network learning [16] .

Adaptive Width
The overall training of DynaDEIT is divided into two stages, first training the width-adapted DynaDEITw and then the width+depth-adaptive training DynaDEIT. the width-adaptive training process starts by sorting the neurons in the heads and MLPs of each layer of the MSA according to Network Rewiring [17], and once the sorting is completed, we obtain An initial and fixed teacher network is obtained [18]. This is done by calculating a change in the relationship between the loss before the head or neuron is removed and the loss without them, the larger the change, the more important the head or neuron is. Before pruning, a is the number of attention heads and b is the number of neurons in the MLP layer. After pruning the head and neurons separately, the number of heads becomes wa and the number of neurons becomes wb. After pruning, the multiple student models are called DynaDeitw, so all the DynaDeitw models are trained using the distillation algorithm, where the distillation The total loss function used here consists of three aspects, which are (1) the input vector after image slicing and linear transformation, (2) the output vector of the last layer of Transformer Encoder, and (3) the output vector of class token and distillation token together after the MLP classifier. Finally, all the patches are used to train all the DynaDeitw, which will be used as the teacher model in the next step.   According to our experimental setup, what we can easily find is that despite less data and less computational resources Deit still generates higher performance image classification models, and the overall results are still good.
Deit has better results than CNN probably because Deit is based on a modified version of Vit, which has a Transformer architecture. Deit can focus on those relatively important regions instead of treating all pixels equally as CNN does, regardless of the content of the pixel and whether it is important or not. Also convolution is difficult to relate spatially distant concepts, and each convolution filter is restricted to work in a small region, but long-term interaction between semantic concepts is crucial. CNN performance is closely related to the size of the dataset. And Deit also uses knowledge distillation, which makes full use of the dark knowledge of the teacher model, and the loss function is more reasonable.
Deit has better results than Vit because the overall structure of Deit is similar to Vit, but Deit is based on knowledge distillation to learn image transformer, and the processing ability of image type data is really improved, so the operation related to knowledge distillation is worth learning and learning from.
Deit with width and depth adaptive training compared to Deit, the result is decreased but the difference is not big, and we use less time for training and lower computer hardware cost. So the application of width and depth adaptive training is necessary. Because it needs to crop the student model, resulting in fewer neurons, the model fitting ability is not as strong.

Conclusions
This paper presents a dynamic prediction-based visual coder acceleration technique for image classification on a knowledge distillation-based technique. This paper combines DeiT and width and depth adaptive training. Because pre-training Vit needs a huge dataset and a multi-GPU environment. To reduce the computational cost, we use Deit model and to further compress Deit we use width and depth adaptive training and get better results on the test set of Cifar100 and can greatly reduce the pretraining time and computational cost.After completing the experiments, it is expected that the pretraining time of Deit is faster compared to traditional CNN and Vit because we added width and depth adaptive training techniques. Unexpectedly, the results of Deit are slightly better than CNN because we use a smaller dataset and we apply knowledge distillation techniques on Deit.