Improved deep semantic dictionary learning for multi-label image classification

Multi-label image classification is a practical and challenging task in the field of machine learning. It is a fundamental but essential task used in auto driving, internet image classification, and other fields. Many researchers have done various researches on relevant aspects. Deep Semantic Dictionary Learning (DSDL) is a multi-label image classification method brought out by previous researchers which combines the semantic meaning information into the classification task. However, it only uses one kind of feature extraction network as the feature map generator without considering other alternatives. In this paper, six types of pretrained networks are applied as alternative choices to validate the performance of DSDL on different neural networks. Experiments are conducted on the VOC2007 dataset and the result shows the inner relation of different series of neural networks and demonstrates that different feature extraction networks perform variously on different tasks. The ResNet-152 achieves the mean average precision (mAP) of 94.5% and outperforms the original DSDL method of 0.3% mAP.


Introduction
Image classification is a fundamental and classical task in computer vision.In recent years since the brought out of AlexNet [1], various Convolutional Neural Networks have achieved great improvements in Multi-Class Image Classification.In this kind of task, there are several labels in the classification mission but only one object in each image and only one label for it.However, in real practice, it is often the case that the number of objects in the same image cannot be only one.Just as shown in figure1, the former two images can be simply labeled as car and train separately while the third one should at least be labeled as the ship, and person while the last one should be labeled as a person, and dining table.It also can be a common situation that the scale of each object in the same image is different, just as shown in the third image in Figure 1, the scale of the person standing in front of the camera is much larger than the boat behind him.In a single class classification task, only one most salient object could be identified.But in the Multi-label Image classification task, it is required to recognize all the things in the picture including ship, person, and even the sea.To solve this problem, there is a simple way by just treating the objects separately and detecting the existence of every object by a simple binary classification.But this method completely ignored the relationship between objects and their semantic meanings and also often ignore small scaled objects.Due to these week points, some methods like the RLSD had brought out a solution by adding an Region Proposal Network (RPN) to get the region information of the objects [2,3].Some researchers also used the graph convolutional network (GCN) to get the topology structure among objects [4,5].Others used approaches based on Recurrent Neural Networks (RNN) to explicitly model label dependencies [6].And one method brought out in 2021 is Deep Semantic Dictionary Learning (DSDL) using dictionary learning to mix and utilize the label, visual and semantic spaces altogether [7].This method can reach state-of-the-art results but it only used the ResNet-101 as the method to get the visual feature of images.Therefore, there can be other possibilities that other feature recognition neural networks can do better than the original one.In this article, different Convolutional Neural Networks are tried to test the performance of the DSDL algorithm in different CNNs and to see if there is a better way to implement this algorithm.

DSDL overview
The DSDL method is a Multi-Label Image Classification method brought out by Zhou et al that uses a semantic dictionary to combine visual features and semantic meanings together to get the final classification results [7].The DSDL method can be divided into three parts, Feature Map Generation, Semantic dictionary generation, and Visual Features Representation.It also provided a training method called Alternately Parameter Updating Strategy (APUS) to train this special model.
In the original method, ResNet-101 pretrained on ImageNet is used as the core of the Feature Map Generation module.In the Feature Map Generation module, an input image x is fed to the Resnet-101 and becomes a 2048x14x14 feature map followed by a 14x14 global max pooling to get a 2048dimensional feature vector.And the ResNet-101 is the module that will be replaced in this article.
In the Semantic Dictionary Generation model.It used the GloVe pretrained on the Wikipedia dataset to get a c×k dimensional word embedding matrix as the basement of dictionary learning [8].Dictionary learning is a method using an encoder and a decoder to get a sparse representation of the word embedding.Every embedding of the original word can be seen as a linear combination of basic elements and these elements are just what the dictionary aims to learn.The encoder and the decoder are the main part of dictionary learning.In the DSDL method, the encoder aims to use the word embedding as an input to generate the semantic dictionary which is the basic elements mentioned before and the decoder should be able to decode the semantic dictionary to a "decoded word embedding" to make the decoding result as accurate as the original as possible.Usually, the encoder and the decoder can be represented as a matrix so the decoder is implemented as the transpose of the encoder matrix in this method.To illustrate the likelihood of the original embedding and the reconstructed one, the cosine distance is used to detect the difference.
In the DSDL method, it introduced a c dimensional vector α (c is the number of labels) as representation coefficients and the final result y is the sigmoid of α.The following formula represents the accuracy   = ‖ − ‖

Backbone CNNs
The DSDL method only used the ResNet-101 as the feature learning module to get the visual feature vector.However, there is a possibility that there might be a more suitable network for the baseline method.To find an alternate CNN to replace the original Resnet 101, a proven effective and pretrained convolutional neural network is needed.And different versions of VGG, ResNet, and DenseNet are chosen as trail.

VGG.
The VGG net is presented by the Visual Geometry Group of oxford in 2014, it is a traditional convolutional neural network that proves the effectiveness of adding additional layers to the network [9].Compared to AlexNet, it uses several continuous 3×3 convolution kernels to replace the larger (11×11,7×7,5×5) kernels [2].It brought out the concept that piled small convolution kernels are better than large ones.It used three 3×3 kernels instead of using the 7×7 kernel and used two 3×3 kernels to replace the 5×5 kernels.This ensured that while the neural network is getting deeper, the receptive field can remain the same.Also, the three 3×3 kernels are fewer parameters than the 7×7 ones.There are two different types of VGG net in total.That is VGG16 and VGG19 the only difference between the two is the depth of two neural networks.In this experiment, the VGG net served as a typical traditional deep convolutional neural network to test the performance.

ResNet. ResNet series is a new type of convolutional neural network that is brought out by
Microsoft and got the first award in the ImageNet competition [10].Compared to the traditional CNNs it brought out some new structures to solve the existing problems of the traditional neural networks.Before ResNet was brought out, all the CNNs are composed of the piling of conveyers and pooling layers.It was assumed that as the layer of the CNN goes up, the information extraction will be more complete and the performance will be better.This theory was proven to be false as real practice goes on.Researchers found that as the network goes deeper, the performance didn't get better but it brought out more problems, one is the vanishing gradient and exploding gradient and the other is the network even gets worse when the number of layers gets to around 56 layers.The former problem is solved by the preprocessing of the data and the addition of the BN layer.And what makes the ResNet distinct is the residual structure to make some shortcuts for the information in the CNN flow to avoid the information loss brought out by the deep layers.The residual structure proved to be working and the deepest Resnet has 152 layers now.

DenseNet.
DenseNet is a further improvement of Resnet [11].It jumped out of the conventional thought of improving the network by getting the network deeper and wider and changed the structure of the network.Although the problem of gradient vanishing and model degradation has been partly solved by using Batch Normalization and setting Bypass, the DenseNet also brought out a new concept that the best way to extract the features is by reusing the features in the network while keeping the influence of the former two problem minimum.Compared to Resnet, DenseNet has a smaller size of parameters and enhanced the use of Bypass.Every layer of the network should make use of all the output of all the previous layers.The calculation is shown as   =   ([ 0 ,  1 , … ,  −1 ] In the formula,   stands for the output layer of the output of the ith layer, and stands for concatenation, which means combining all the output feature maps according to the channel.And the H here is the combination of BN, ReLU, and 3×3 convolution.The reason that the output of different layers of output should be combined as the next input, the feature map should keep the same feature size in different layers and this restricts the down-sampling process.To use realize the Down Sampling process.It used a structure called Denseblock.Denseblock is a structure that requires every layer in the same block should keep the feature size the same and the Down Sampling process is done between the Denseblocks.In this article, three types of DenseNet are chosen as a better-performed network on ImageNet.

Implementation
In the original DSDL method, the visual feature vector is a 2048-dimensional one for the reason that the output of ResNet-101 is a 2048 channel feature map and after the Global Max Pooling it will become a 2048-dimensional vector.
Also, the size of the Dictionary should be 2048×c as well due to the feasibility of calculation.However, different CNNs have different numbers of output channels and they will not be able to calculate properly.For the adjustment of VGG16 and VGG19, the original output layer number is only 512 and it might be too small for a feature vector compared to the original 2048.So, the output layer of pretrained VGG16/19 is changed to 2048 channels to fit the size of the dictionary.Other CNNs already have enough output layers so their output layer is not changed.Instead, the size of the dictionary should be changed to fit the length of the visual feature vector.To change the size of the dictionary, the structure of the encoder should be modified.The original encoder is constructed with two linear layers connected with a leaky ReLU.The encoder encodes the original word embeddings into the size that can calculate with the visual feature vector.Thus the only thing that needs to be changed is output of the second linear layer.For example, the DenseNet-121 will give a feature map of 1024 channels so the encoder will encode the embedding into a c×1024 dictionary.And for the Densenet-169 with a 1664-dimensional feature map, the output of the encoder is also changed to 1664.

Experiments
All the programs are run on the environment of python 3.8.10 with pytorch1.10.0 and some code details are adjusted to fit the current version.[4,12,13].It is a dataset that contains 9963 pictures in total with 5011 training images and 4592 testing images with 20 classification categories in total.That is aeroplane, bird, bicycle, boat, bus, bottle, cat, car, cow, chair, dog, dining table, horse, motorbike, potted plant, person, sofa, sheep, train, and tv monitor.This dataset was originally used in the field of multi-class image classification but can also be used in multi-label image classification.As shown in Figure 2, the same image in multi-class classification should only be detected as horse but in multi-label classification, it is labelled as horse, person.This dataset has an obvious advantage that the total scale of the dataset is not so big compared to the recently famous Microsoft COCO which ensures the efficiency of getting the training results.

Evaluation metrics
To follow the conventional settings, the average overall recall (OR), precision (OP), F1 (OF1) and the average per-class recall (CR), precision (CP), F1(CF1), are chosen for performance evaluation [14].The result of top-3 labels is also examined as well as the mAP the calculation of these evaluations are as follows: In the calculations, c represents the number of labels,    stands for the number of accurately predicted images for the i-th label,    represents the quantity of images predicted for the i-th label,    stands for the ground truth amount for the i-th label.

Implementation details
Following the original DSDL method, the input images are resized randomly into 448x448 and with data argumentation of random horizontal flips.All the alternate CNNs will get a output feature map in d×14×14 where d stands for the dimension of feature vector.The encoder module contains two fully connected layer with a LeakyReLU with negative slope 0.2 in between.Other hyperparameters are kept the same as the original implementation.

Results
Table 1 and Table 2 shows the different performance of different CNNs on the DSDL method.that as a traditional CNN, the VGG will show a weaker performance in the DSDL as the feature extract network.And the DenseNet will be with a high potential to make an improvement.Table 1 and Table 2 shows the result of all the alternate DSDL performance.It shows that the replacement of the feature extract network can actually make a progress on the original method.It also confirmed the assumption that the VGG net as a traditional CNN will perform worse in the DSDL method.But it is the ResNet-152 that performed the best among all the methods tested including the original method with ResNet-101, with an improvement of around 0.3 in mAP.Surprisingly, although the DenseNet series have better performance on ImageNet than the ResNet and all of them reached a relatively high accuracy, none of them performed better in the DSDL method than ResNet.Even the best one DenseNet-201 is still 0.3 inferior to the original method in mAP.

Discussion
Throughout the whole experiment, it can be found that a better feature-extracting network can actually do better performance Multi-Label Image classification.The performance of the traditional network VGG are inferior to the modern networks.It also can be found that in the same series of networks, VGG 19 is doing better than VGG 16, DenseNet-201 is doing better than DenseNet-169 and DenseNet-121, ResNet-152 is doing better than ResNet-101, the more complex the network is, the better it is performed on the DSDL method.However, it does mean that superior performance on other tasks, multi-class classification on ImageNet for example, will ensure a better performance on other specified tasks.The DenseNet series is an improvement of the ResNet series but has an inferior performance on the DSDL task.So, the choice of the right network can also be an important aspect as well as the algorithm chosen.Due to the fact that the original researcher has done some studies on the influence of hyperparameters, it is presumed that the hyperparameter is already set to the best.There might be some changes and future studies can be done on different hyperparameters to fit the different networks.

Conclusion
Multi-label Image Classification is a momentous computer vision topic that is widely used in various fields.Compared to other methods, the DSDL used the thought of modulization and combined the semantic meaning with visual features, it used the dictionary learning method to generate a semantic dictionary for the visual features to look up thus enhancing the performance.In this paper, different versions of pretrained VGG, ResNet, DenseNet are applied as alternative choices of the DSDL method in the Multi-label Image Classification task, and compared to the original mAP of 94.232 the alternate with ResNet-152 makes an improvement of around 0.3 to the mAP of 94.543.Other indicators also show the method with ResNet-152 surpasses the remaining methods.The experiment results also show that the choice of the right network can also do an obvious influence on the performance of deep learning tasks.It is not a necessity that a feature extraction network working well on some image classification tasks will perform well on other tasks.Since the recombination of pre-existing networks can actually make some progress, in the future study, different information like the location information of different objects is considered to be combined into the DSDL method to get better performance.

Figure 1 .
Figure 1.Some image examples form VOC2007.To solve this problem, there is a simple way by just treating the objects separately and detecting the existence of every object by a simple binary classification.But this method completely ignored the relationship between objects and their semantic meanings and also often ignore small scaled objects.Due to these week points, some methods like the RLSD had brought out a solution by adding an Region Proposal Network (RPN) to get the region information of the objects[2,3].Some researchers also used the graph convolutional network (GCN) to get the topology structure among objects[4,5].Others used approaches based on Recurrent Neural Networks (RNN) to explicitly model label dependencies[6].And one method brought out in 2021 is Deep Semantic Dictionary Learning (DSDL) using dictionary learning to mix and utilize the label, visual and semantic spaces altogether[7].This method can reach state-of-the-art results but it only used the ResNet-101 as the method to get the visual feature of images.Therefore, there can be other possibilities that other feature recognition neural networks can do better than the original one.In this article, different Convolutional Neural Networks are tried to test the performance of the DSDL algorithm in different CNNs and to see if there is a better way to implement this algorithm.

3. 1
. Datasets To test the effectiveness of different feature extraction networks.Comparison experiments are done on the VOC 2007 datasets.Pascal VOC (Visual Object Classes) 2007 is a data set used extensively in the area of computer vision.The famous Faster-RCNN, Yolo-v1, and Yolo-v2 all used this dataset as a demonstration

Figure 2 .
Figure 2. A typical image from VOC 2007 that shows a person riding a horse.