Feature Purity: A Quantitative Evaluation Metric for Feature Extraction in Convolutional Neural Networks

The quantitative analysis of specific convolutional layer feature information in Convolutional Neural Networks (CNN) has received considerable critical attention in the field of deep learning. In this paper, we proposed an innovative measuring method based on the principle of information entropy to quantitatively analyze the performance of feature extraction in CNN. In the method, the feature purity was defined as the normalized entropy of activation histogram, which was calculated from all kernels in feature layer of CNN. Feature purities were evaluated respectively in different layers of CNN as the inspection for internal structure in specific classification models. At the same time, general degrees of purities among different models were compared with classification performances. The experiments were performed in models of AlexNet, VGG, ResNet, and SENet with different epochs trained by cifar10 and ImageNet1000 datasets. Additionally, a visualization evaluation was performed by grad-CAM method in the manners of the intra-model and inter-model purities. Experimental results showed a strong relationship between feature purity and the classification accuracy in each model. Additionally, the visualized feature map also demonstrated that salient region extracted in specified feature layer had good consistency with purity value. As a result, the proposed feature purity can serve as a quantitative metric representing the degree of feature extraction in CNN without the dependency of label or specific network structure; and has significant interpretability and versatility across models.


Introduction
In recent years, deep learning technology has developed rapidly in the field of computer vision, from basic image classification tasks to further applications such as object detection [1], semantic segmentation [2] and so on. Compared with traditional methods based on artificial design of visual features, convolutional neural networks (CNN) has achieved the most advanced performance in these fields. However, the internal mechanism of CNN is always regarded as a 'black box', and its poor algorithm interpretability becomes one of the defects that have been criticized. Therefore, a growing number of researchers have realized that the significant value of interpretability in neural networks, and the internal working mechanism of the models has been extensively concerned.
Explainable research of neural network has gone through many stages [21]. Classical interpretable methods include visual interpretation method [3,4,5,6], which interprets neural network by visualizing the feature information of corresponding classes extracted from feature layer. On the other hand, interpretable methods based on models and data are also proposed. LIME [9] explained the neural network by fitting a simple interpretable model (linear classification) locally in the classification model of complex neural networks. Influence function [10] has studied the effects of training samples (e.g., IOP Publishing doi: 10.1088/1742-6596/2010/1/012033 2 reducing one sample, perturbing the sample) on model prediction to explained the neural network. Wang et al [11] has analyzed the relationship between the generalization behavior of convolution neural network and the spectrum of image datasets. the representation ability of the middle layer of convolutional neural network is judged by knowledge consistency [12].
At present, there are still some problems to be solved in the research of interpretability of neural network. The existing work mainly explained the model by looking at a certain classes of corresponding features extracted by CNN, or explains the model from the local or data of the model. The evaluation of global feature information and the criterion of its measurement are helpful to further understand and evaluate neural network model. Therefore, we proposed a feature measurement method based on the principle of maximum entropy. Our main idea is to measure the global feature information by evaluating the overlap degree of crossclass information extracted by convolution layer. That is, by counting the activation degree of kernels to all classes, we got the activation histogram of kernels, and calculated the information entropy of feature layer to obtained the feature purity to evaluated the global feature information extracted by convolutional layer. The comparison of intra-and inter-model feature purities proved that there is a positive correlation between the purity and the performance of kernels, layers, or models. At the same time, the purity of the feature layer increases gradually with the increasing of training epochs.
Our main contributions are summarized as follows: Based on the principle of maximum entropy, we proposed a neural network feature measurement method.
By experiments, we get the relationship between the feature layers within the intra-model and intermodel, meanwhile the variation trend of feature extraction information under different training degrees are obtained. And through Grad-CAM visual interpretation The proposed method can compare the extracted features between different feature layers of the same neural network. At the same time, the metric does not depend on the specific network structure selected and can be used across the network.

Related work
Visual interpretation of neural networks included Deconvolution [3] and Guided-backpropagation [4], they has elaborated the concept of features in the shallow and deep layers of CNN. Further, by replacing the full connection layer with the global average pooling and retraining the weight, the CAM [5] can understand the importance of each part of the input image to the category, and then visualize the classification results of the neural network. Selvaraju et al. proposed Grad-CAM [6], that without to replace the full connection layer, the global average of the gradient is used to calculate the weight, and the importance of the category is obtained. They investigated the visual interpretation of a specific class of features of the image, but they cannot understand the global features extracted by the feature layer well. In addition, TCAV [24] used the directional derivatives to measure the sensitivity of model predictions with respect to concepts at any model layer.
Entropy is a commonly used metric to measure the disorder or uncertainty in information theory. At the same time, information entropy is often used as a quantitative index of information content of a system. A larger entropy value means the system contains more information. Moreover, a KSE (kernel sparsity and entropy) [25] method is proposed to quantitate the feature map importance in a featureagnostic manner to guide model compression. Li et al [26] proposed an entropy-based filter pruning (EFP) method to prunes unimportant filters based on the amount of information carried by their corresponding feature maps.
Different from them, we used the Entropy and proposed a concept of feature purity to quantitative analysis CNN feature extraction performance. In CNN, for a specific convolutional layer, we apply the weights of each channel in Grad-CAM to multiplied the feature map, and obtained the weighted feature map. Combined with it, we calculated the activation of each kernel under threshold for every classes by information entropy, and obtained the feature purity to analyzed the global feature information extracted by convolution layer. At the same time, the proposed feature purity can serve as a quantitative metric representing the degree of feature extraction in CNN without the dependency of label or specific network structure; and has significant interpretability and versatility across models.

Principle of Feature Purity
In our method, the feature purity was defined as the normalized entropy of activation histogram, which was calculated from all kernels in feature layer of CNN. In CNN, we want that a filter can only learn the features belonging to a specific category as pure as possible, that is, it is activated by a specific category. Corresponding, for a convolutional layer, we hope that the filters activated relatively uniform, with less overlapping of cross-class information expressed by different kernels. Then the feature information extracted by this layer is 'purer' and more meaningful, that means the feature purity is higher, as shown in Figure 1(a). The feature purity is relatively small if the filter activated insufficient for classes, as shown in Figure 1(b). In the deeper layers with more filters, we can think that some kernels learn different features of a certain class. That is, the features extracted by convolutional layer are more abundant, which can help the model make better decision. Corresponding, its feature purity is higher, as shown in Figure 1(c).
Figure1.Sample of different activation cases in the feature layer. The dashed circles represent that kernels are not activated, and other circles of different colors represent that neurons are fully activated by different classes.

Algorithm Flow
The steps of feature purity calculation are shown in Figure 2. Same as Grad-CAM, for a given convolutional layer After obtaining the weight of the corresponding feature map of the category m, the weight is multiplied by the corresponding feature map one by one, and the weighted feature map Then we applied global average pooling for it, and obtained feature activation matrix Where c ij L representing the pixel value at the (i, j) position in the c feature map. Given n images, we can get the its activation matrix ) ( (6) Then the information entropy of the layer H can be calculated. Ideally, the maximum information entropy activated by the c channels is: The normalized feature purity score of the corresponding feature layer Fs is obtained: Where Fs value is 1 0 . A large value of Fs means that the activation of all kernels are more uniform, and the feature extracted by kernels are more 'pure'. Ideally, when the activation probability of each channel Fs reached maximum 1. If its value is small, it is suggesting that channels are not fully activated, and the features extracted by kernels are less.

Experiment
We used CIFAR10 [13] and ImageNet (ILSVRC-12 [14]) to experiment. On CIFAR10 dataset, we use Tiny VGG [22] and ResNet18 [15] as the baseline model. The simple data augmentation is implemented for training: 4 pixels are padded on each side, and a 32×32 crop is randomly sampled from the padded image or its horizontal flip. The learning_rate set to be 0.001, and batch size set to be 64. On the ImageNet (ILSVRC-12) datasets, we chose AlexNet [16], VGG [17], DenseNet121 [18], ResNet [16] and SENet154 [19] as experimental models. For the corresponding weight we selected pytorch officially published pre-training weights. First, we compared the effectiveness of the feature measure under different thresholds to choose the best threshold. Then choose three methods to compared: (1) Intra-model method, which means the comparison of feature purities in different feature layers of the same model. (2) Inter-model method, which means the comparison of feature purities extracted from different models. (3) Comparison of the purities from models with different training epochs. The selection of threshold is carried out on ResNet18 and VGG19 in CIFAR10. Intra-model Comparison chose ResNet18 and VGG19 models on dataset CIFAR10 for experiment. For Inter-model Comparison, we used AlexNet, VGG, DenseNet121, ResNet and SENet154 models on dataset ImageNet (ILSVRC-12) to experiment. For purity comparison at different training epochs, Tiny VGG and ResNet18 were selected on the CIFAR10 for experiment. The corresponding feature layers of a given category are visualized by Grad-CAM and the relationship between feature purity and feature layer training effect is further compared.

Thresholds in Purity Scores
We chose different thresholds([0.1-0.9]) on ResNet18 and VGG19 and compared the purity scores of each feature layer at different thresholds, as shown in Figure 3a and b. Then, we evaluated the effectiveness of their feature measures and selected the best threshold. It can be demonstrated that when the threshold is 0.4-0.5, the purity of each feature layer has significant differentiation, which can distinguish the features extracted by different feature layers. At the same time, we find that with the increase of threshold, the purity of shallow layer decreases more, and the purity of deep layer decreases a little. When the threshold is 0.4-0.5, the shallow purity decreases a lot, which indicates that the activation of shallow layer is weakly, the activation of deep kernels is strong. When the threshold is greater than 0.5, the feature purity begins to decrease significantly, which leading the effectiveness of feature metrics become bad, especially in the shallow feature layer. Therefore, under the condition of

Intra-model Comparison
Correspondingly on the ResNet18 model, we chose the images of each classes and observed the features extracted by different feature layers of the same model by Grad-CAM visualization, and compared the feature purity in different conv-layer.
We choose some obviously layers to compared. As shown in Figure. 4, by observing the Grad-CAM of the corresponding layer, it is found that the feature layer with great purity score extracted more feature information, and has a better response to all kinds of activation. For example, the feature purity score of layer4.1.conv2 is 0.971, the Grad-CAM of the corresponding layer shows that the activation response region of the its class is better, the localization ability of each classes is better, and the semantic information extracted by the layer is great. While for layer1.1.conv1, layer1.1.conv2 with a purity score of 0.147,0.394, the corresponding activation regions are scattered, and the localization of the classes is relatively poor, indicating that kernels are not fully activated or activated weak. Among them layer4.1.conv1 there is not activation to birds, but in general, the activation to other classes is relatively good. Generally, in line with the mentioned in the [3], shallow nonlinear ability is weak, learning ability is insufficient, only can extracted the object edge, texture, contour, and other low-level features, so the activation of features is relatively weak, the purity of features is small. With the increasing of layers, nonlinear ability is improved, and the extracted features are high-level semantic information about categories, can learned more information.

Inter-model Comparison
For the measurement of different model features, we mainly considered two sides to compared: First, we selected each feature layer that contains low, medium, and high semantic features before each model downsampling, and calculated the average of its feature purity to measure the feature information extracted by the model. Secondly, considering that the deep feature layer contains more high-level semantic information and has a great influence on the result of final classification, the last feature layer of the model is considered to be compared. On the imageNet (ILSVRC-12) dataset, we used VGG and ResNet to compute the purity of each feature layer before downsampling, and calculated their average value to observed the score of the feature purity of each characteristic layer, As shown in Table 1. It can be demonstrated that the feature purity of the last layer (C5) is higher than other feature layers. With the increase of model depth and performance, the purity was improved. The average value shows different trends with the different models and had a positive correlation with the performance. In contrast, there is a strong correlation between the purity and performance of the final convolutional layer of the model, which has more closely related to the final classification results. Therefore, the C5 layer was chosen to compare the purity of different models.   Figure 5.Comparison of activation patterns and feature purity of last convolutional layers on imageNet1000 of different models. Red box and green box represent the feature layer activated poor and activated well in Grad-CAM, respectively.
Subsequently on the imageNet (ILSVRC-12) datasets, we used AlexNet, VGG16, DenseNet121, ResNet50 and SENet154 with different performance to experiments. As shown in Table 2, the purity of the last feature layer of the model with better performance is relatively large. Figure 5 shows the purity and Grad-CAM comparison of each model in imageNet. A red box and a green dotted box are used to mark the classes activated poor and activated well in Grad-CAM, respectively. We observed that there have more green box in model of better performance, at the same time the model feature layer got a high score of feature purity and included less background information in Grad-CAM. On the other hand, the localization of filters to classes is better.

Purities with different training epochs
Moreover, in order to studied the vary in the purity of models with different training epochs during training, we use Tiny VGG and ResNet18 to compared the purity of feature layers under different training degrees of the same model. The models of training 10,20,50 and 100 epochs were selected to compared the purity of the last convolutional layer. Table 3 shows the feature purity scores between ResNet18 model and Tiny VGG model in different training epochs. Corresponding as shown in Figure  6,7, we use the Grad-CAM to visualized the last feature layer of Tiny VGG and ResNet18 training different epochs and observed the change of the purity of the corresponding feature layers. We use red box and a green box to marked the classes activated poor and well in Grad-CAM, respectively.
The result suggesting that with the increasing of training, the features extracted from Tiny VGG and ResNet18 are gradually optimized, and the extracted semantic feature information is better. The interference of background information is gradually eliminated, and the ability to located classes is better. At the same time, we find that with the increasing of training epochs, model existing the overfitting. Although the purity of the feature layer is improved, the features of some classes extracted by the feature layer gradually become worse and are located at the edge or background of the class. For example, in Figure 7, the activation of car and deer in epoch 50 is better than epoch100. In epoch100, due to the

Discussion
In CNN, the accuracy for a single output class is highly dependent on a small number of important kernels, applicable to all categories [23]. And their activation should be relatively large. In our method, We hypothesized that different channels activate different classes or feature patterns in specific layer during training procedure, with less overlapping of cross-class information expressed by different kernels. The purer the feature extracted, the more 'professional'the kernel is.
To verify the reliability of our principle, a group of experiments were carried out. The comparison of intra-and inter-model has already proved that there is a positive correlation between the purity and the performance of kernels, layers, or models. At the same time, the purity of the characteristic layer increases gradually with the increasing of training epochs, which provides another evidence to our hypothesis.
The theory of informatics can also provide the interpretation of purity from another aspect. As we described above, purity is defined by following the conception of information entropy in specific reception field in CNN. The low value of purity reflects an average and uniform activation from inputting information, which represents ambiguous, unclear, and even chaos in informatics. On the contrary, the high level of purity reflects specific and unique activation, which means significant and salient from inference. And the significant and salient feature extraction are the purpose we want to achieve in designing and training a CNN model. In traditional method, this goal of significant and salient feature extraction can only be evaluated by its final performance, such as classification accuracy. However, in our method, we have provided the purity as another viewpoint to measure the performance.
By comparing the value of purity with the performance of grad-CAM in the style of feature map, another phenomenon can be observed that: in most of our experimental cases, high purities reflected higher level of salient and significant regions in the feature map of grad-CAM, and more semantic information can be activated from specified class. For different models, the purity was positively correlated with the performance of the last feature map visualized by grad-CAM. At the same time, the purity of the characteristic layer and feature map increased gradually with the deepening of the training degree.
One of the deficiencies is that the proposed method cannot solve the problem of bias sampling of the dataset. The calculation of purity needed relatively large scale and category-balanced dataset as the input. If there was an inputting dataset with a lack of categories, the calculated purity may not fully reflect the performance of a well-trained model. It means that the purity was only responsible and meaningful to give inputting datasets/categories. Additionally, our algorithm workflow filtered out the weaker neurons and lost a small part of the information. However, from another point of view, this is helpful to solve the problem of filter-class entanglement [20]. In general, the characteristic measurement result is a more specific result of neuronal activation.

Conclusion
In this paper, we proposed an innovative explainable metric, namely feature purity, to interpret the performance of feature extraction in CNN. Feature purity can quantitatively measure the degree of feature extraction in CNN without the dependency of label or specific network structure; and has shown significant applicability and generality across models. Future work is devoted to considering retraining the poor purity feature layer to see if it can further improve the performance of the network. At the same time, for the model with sufficient performance, the model pruning of the kernel with poor activation is considered to further improve the performance of the model by exploring the characteristic layer with poor contribution to the model.