Attention Based Data Augmentation for Knowledge Distillation with Few Data

Knowledge distillation has attracted great attentions from computer vision researchers in recent years. However, the performance of student model will suffer from the absence of the complete dataset, which is used to train the teacher model. Especially for conducting knowledge distillation between heterogeneous models, it is difficult for student model to learn and receive guidance with few data. In this paper, a data augmentation method is proposed based on the attentional response of teacher model. The proposed method utilizes the knowledge in teacher model without requiring homogeneous architecture between teacher model and student model. Experimental results demonstrate that combining the proposed data augmentation method with different knowledge distillation methods, the performance of student model can be improved in knowledge distillation with few data.


Introduction
In recent years, the expensive calculation and training costs of deep models have raised concerns among researchers. Knowledge distillation is proposed to utilize the teacher model which has deep convolutional layers and massive parameters to guide the student model which is lighter and smaller in training process. The ultimate goal of knowledge distillation is making the light model solve problems with similar performance and lower computing costs than the heavy model.
Knowledge distillation methods can be divided into three categories according to the source of knowledge [1]: response-based, feature-based and relation-based. Specifically, response-based knowledge distillation [2] utilizes the output logits of the teacher model as the knowledge. The activations, neurons or features of intermediate layers are utilized as the knowledge to guide the learning of the student model in feature-based knowledge distillation [3][4][5]. Relation-based knowledge distillation [6][7][8] utilizes the relationships between different activations, neurons or pairs of samples as the knowledge.
In practical application scene, due to the objective causes such as unavailable of sensitive data, inconvenient data access and unexpected loss of data, the complete training data of teacher models may not be used during knowledge distillation. The absence of the complete dataset makes it difficult for student models to learn. Under this condition, the training effects of student models will become unsatisfactory. Naturally, this problem is gradually attracting the attention of researchers. Fang et al. [9] proposed a data-free adversarial distillation method inspired by the form of Generative Adversarial Networks (GANs). A good performance of image classification was achieved during distillation in which the teacher and student have the similar network structures. Nayak et al. [10] presented Zero-Shot Knowledge Distillation (ZSKD) in which the data impressions will be synthesized form the softmax space of teacher models and utilized as surrogates for the original training data samples. The IOP Publishing doi:10.1088/1742-6596/2171/1/012058 2 student model could achieve a degree of classification performance without using any training data. Li et al. [11] presented few-sample knowledge distillation (FSKD), which is used in network compression where the student model is made by pruning the teacher model. Subsequently, Bai et al. [12] proposed a novel layer-wise knowledge distillation approach for effectively compressing network with few data. Recently, Shen et al. [13] proposed a novel grafting strategy for few-shot knowledge distillation. A dual-stage distillation scheme was used to replace the blocks of teacher model with the blocks of student model gradually and achieved gratifying results with few samples.
However, most of the previous works had achieved good results between teacher models and student models that have homogeneous architectures. Model compression methods such as pruning do not change the model structures. Conducting knowledge distillation between heterogeneous models with few data is still a challenge problem. In this paper, we explore knowledge distillation between heterogeneous models with few data from the perspective of data augmentation. A data augmentation method is proposed based on the attentional response of teacher model. The proposed method utilizes the effective response of teacher model to data. The knowledge containing in the teacher model will be used to enrich data. Experimental results demonstrate that our approach can improve the performance of student model in knowledge distillation with few data and does not require homogeneous architectures between teacher models and student models.
The main contributions of this paper are summarized as follows:  A data augmentation method for knowledge distillation is proposed based on the attentional response of teacher model. Combining with several typical distillation methods, the performance breakthroughs on full dataset are achieved in knowledge distillation between heterogeneous models.  The proposed data augmentation method can further improve the knowledge distillation performance on varying degrees of small datasets. Combining with the proposed method, the performances of several typical distillation methods have been improved when conducting knowledge distillation between heterogeneous models with few data. The remains of the paper are organized as follows. The attentional response of teacher model, the proposed data augmentation method and the simplified cross-layer knowledge distillation method are described in Materials and Methods. The experimental results and analyses are shown in Results and Discussions. Conclusions and future works are given in Conclusions Section.

The attentional response of teacher model
Teacher models are usually trained well on full dataset. It means that the teacher model can perceive well on similar data. Such well perception is not only reflected in the output logits, but also in the activations of deep feature maps. Zagoruyko et al. [4] proved that the feature maps in the deep layers of a network can be mapped to the original image to express the attention of the network to images. Areas with high attentions will be used for classification by the model. Hence, the attentional response of model is defined as the mapping result of a deep feature map.
To generate the attentional response of a teacher model, firstly consider a deep layer and its activation tensor A∈ℝ C×H×W , where C represents feature channel and H×W is the dimension of feature map. Then take the mean on channel dimension C to make a spatial attention map with dimension of H×W . Normalization will be carried out on the attention map. Finally, to project the spatial attention map to original image, the average pooling function is applied to generate the attentional response. A set of examples of attentional response are shown in figure 1. A well trained ResNet-32x4 [14] (79.42% top-1 accuracy) is used to generate attentional responses based on the activations of the last residual block. The key regions of the targets in images will generate high activation response. It demonstrates that the model has a strong ability to perceive data. We can use this capability to highlight key information in few training data and enrich data while keeping the key information remains.

Attention based data augmentation
The main idea of proposed data augmentation method is that the teacher model is utilized to sketch out the key information in limited data, and vary data while keeping the key information remains. Specifically, given a small training dataset, the size of each image is assumed to be h×w. We run it on the teacher model so that the attentional responses of each image will be generated and saved. The attentional responses can be denoted as AR∈ℝ h×w . Then, the mask of attentional responses MAR∈ℝ h×w can be expressed as: An element in MAR is set to 0 when the element in the same position of AR is less than the mean of AR, otherwise to 1. An augmented sample of an original image is defined as: where One∈ℝ h×w is a matrix where all element values is 1. img random is a random sample from the dataset. In summary, an augmented sample consists of key areas of the original image and random interference. The augmented samples increase the diversity of the dataset without losing target information. The problem that the model cannot learn enough generalization due to the small sample number is alleviated. An example of the augmented data is shown in figure 2.

Simplified cross-layer knowledge distillation
In this paper, we use a simplified cross-layer knowledge distillation modified from SemCKD [5]. Specifically, considering that it may become difficult to learn the attention allocation module with few samples, we remove the attention allocation module to reduce the parameters that need to be optimized. Each student layer that needs to be guided is directly associated with all target teacher layers through corresponding projection modules. Each projection module consists of a stack of three layers with 1x1, 3x3 and 1x1 convolutions to align the shapes of feature maps between a student layer and a target layer.  Fig. 2 An example of the attention based augmented data. Images are from CIFAR-100 dataset. Subgraph (a) is the original image. Subgraph (b) is the augmented image with random sample.
Subgraph (c) is a random sample from the dataset.

Results and Discussions
In this paper, CIFAR100 [15] is used as experimental dataset which contains 100 classes. CIFAR100 has 60000 colour images with 32×32 size. 50000 images are used as training set and the rest 10000 images are used as test set. We randomly sample k% images per class in training set to construct small datasets, where k∈{10, 20, 30, 40, 50} . Following SemCKD [5], we adopt ResNet32x4 [14] as teacher model and VGG8 [16] as student model. The architectures of two models are significantly different. We consider them as a pair of heterogeneous models following Shen et al. [13]. Several typical one-stage heterogeneous distillation methods are used in experiments for comparison. No-KD represents training student model on the specified dataset without any knowledge distillation methods. VKD [2] represents the vanilla knowledge distillation which is a classic responsebased distillation approach. SP [6] represents the similarity preserving knowledge distillation which is a typical relation-based distillation approach. SemCKD [5] is a state-of-the-art feature-based knowledge distillation approach. SCKD is the simplified cross-layer knowledge distillation modified from SemCKD [5]. ADA represents using the attention-based data augmentation proposed in this paper. None represents conducting knowledge distillation without any data augmentation. The experimental results of knowledge distillation with few data are shown in table 1.
Tab. 1 Experimental results of knowledge distillation with few data on CIFAR100 dataset. The teacher model is ResNet 32x4 [14] (Top-1 Accuracy: 79.42%), and the student model is VGG8 [16]. Firstly, comparing whether knowledge distillation is used at different k, it can be found that using knowledge distillation always improves performance. The lower the k value, the greater the boost knowledge distillation brings. For example, in k=10% and k=50% , the top-1 accuracies of student models are 43.61% and 65.16% respectively without using KD and ADA. While using SCKD in the same k value settings, the top-1 accuracies of student models increase by 13.87% and 6.79% respectively without using ADA. The benefits of KD in k=10% have almost doubled compared to . We believe that knowledge distillation itself has a certain ability to resist the absence of data, because the teacher model contains knowledge beyond the small dataset. Using knowledge distillation during training process can improve the performance of student model.
From experimental results, it can be seen that the attention-based data augmentation (ADA) proposed in this paper can further improve the knowledge distillation performance in small dataset. Using ADA with SCKD in k=10%, the top-1 accuracy of student model increases by 5.4% compared to SCKD without ADA in the same k . Besides, we also find that the ADA can be compatible with different KD methods. The performances of three typical distillation methods in different types (response-based, feature-based and relation-based) are all improved with ADA in different k.
Combining the SCKD and ADA, a gratifying performance is achieved in knowledge distillation between heterogeneous models with few data. The highest performance is achieved for each k value setting using SCKD with ADA compared to other experimental groups. Even under full dataset (k=100%), the performance breakthroughs can prove the effectiveness of the proposed ADA method.
Besides, it can be seen from comparative experiments (None KD), ADA does not work well without KD. This phenomenon may indicate that the augmented samples can only play a role under the guidance of the teacher model. Student models may not make good use of these augmented samples by themselves.
In addition, to further understanding the mechanism of ADA, A comparative experiment is conducted with different kinds of random interferences. In ADA, the random interference comes from a randomly selected sample. An alternative is generating random values directly. We named the alternative strategy as ADA(RN), and named the original strategy as ADA(RS). Dataset and models remain unchanged and KD method is SCKD. The comparative experimental results of different kinds of random interferences are shown in table 2.
Tab. 2 Comparative experimental results of different kinds of random interferences on CIFAR100 dataset. The teacher model is ResNet 32x4 [14], and the student model is VGG8 [16] KD be seen that using random values as the random interference can still improve the performance of knowledge distillation in all k value settings. Using ADA(RN) with SCKD in k=30%, the top-1 accuracy of student model increases by 2.23% compared to using SCKD without ADA in the same k.
However, the performances of ADA(RN) are consistently inferior to the performances of ADA(RS) in all k value settings. Especially in k=10%, using ADA(RN) resulted in 3.87% decrease in the top-1 accuracy compared with using ADA(RS). The difference between random samples and random values is whether the interference has semantic information. Random sample has obvious semantic information and such semantic information makes model training challenging. Particularly when the amount of data is small, the model cannot learn enough generalization through limited samples. The confused augmented samples with semantic interferences can improve the generalization of the model effectively.

Conclusion
In this paper, we explore knowledge distillation between heterogeneous models with few data from the perspective of data augmentation. A data augmentation method is proposed based on the attentional response of teacher model. Combining the proposed data augmentation method with different knowledge distillation methods, gratifying performances are achieved in knowledge distillation between heterogeneous models with few data. Besides, we discuss some characteristics of the proposed method, including that this method needs to be used in conjunction with knowledge