Hierarchical Auxiliary Learning

Conventional application of convolutional neural networks (CNNs) for image classification and recognition is based on the assumption that all target classes are equal(i.e., no hierarchy) and exclusive of one another (i.e., no overlap). CNN-based image classifiers built on this assumption, therefore, cannot take into account an innate hierarchy among target classes (e.g., cats and dogs in animal image classification) or additional information that can be easily derived from the data (e.g.,numbers larger than five in the recognition of handwritten digits), thereby resulting in scalability issues when the number of target classes is large. Combining two related but slightly different ideas of hierarchical classification and logical learning by auxiliary inputs, we propose a new learning framework called hierarchical auxiliary learning, which not only address the scalability issues with a large number of classes but also could further reduce the classification/recognition errors with a reasonable number of classes. In the hierarchical auxiliary learning, target classes are semantically or non-semantically grouped into superclasses, which turns the original problem of mapping between an image and its target class into a new problem of mapping between a pair of an image and its superclass and the target class. To take the advantage of superclasses, we introduce an auxiliary block into a neural network, which generates auxiliary scores used as additional information for final classification/recognition; in this paper, we add the auxiliary block between the last residual block and the fully-connected output layer of the ResNet. Experimental results demonstrate that the proposed hierarchical auxiliary learning can reduce classification errors up to 0.56, 1.6 and 3.56 percent with MNIST, SVHN and CIFAR-10 datasets, respectively.


Introduction
Deep Convolutional Neural Networks (CNNs) have attracted a considerable attention due to their superior performance in image classification [25,19,9,5].With residual blocks, the depth and width of a neural network architecture becomes a key issue in reducing the classification error.Researchers have been investigating not only neural network architectures but also the way of utilizing a given dataset.For example, data are augmented by rotation and translation [4,10,20], auxiliary information from external data is fed to a neural network [21,12,18,26], data are grouped into superclasses in a supervised or unsupervised way [24,3], and information of data is gradually fed to the neural network [1].
Note that in conventional use of CNNs for image classification and recognition, it is assumed that all target classes are equal (i.e., no hierarchy) and exclusive of one another (i.e., no overlap).CNN-based image classifiers built on this assumption cannot take into account an innate hierarchy among target classes (e.g., cats and dogs in animal image classification) or additional information that can be easily derived from the data (e.g., numbers larger than five in the recognition of handwritten digits), thereby resulting in scalability issues when the number of target classes is large.
In this paper, we propose a new learning framework called hierarchical auxiliary learning based on two related but slightly different ideas of hierarchical classification [3,23,24,27] and logical learning by auxiliary inputs [22], which not only address the scalability issues with a large number of classes but also could further reduce the classification/recognition errors with a reasonable number of classes.In the hierarchical auxiliary learning, we first group classes into superclasses (e.g., grouping "Beagle" and "Poodle" into "Dog" and "Persian Cat" and "Russian Blue" into "Cat") and provide this superclass information to a neural network based on the following three steps: First, the neural network is augmented with the auxiliary block which takes superclass information and generates an auxiliary score.We use ResNet [6] as an example neural network architecture in this paper and insert the auxiliary block between the last residual block and the fully connected output layer.Second, a superclass is semantically or non-semantically assigned to each image and one-hot encoded.Finally, the one-hot-encoded superclass vector is fed to the auxiliary block and the multiplication of the output of the last residual block by the output of the auxiliary block is injected to the fully-connected output layer.
The rest of the paper is organized as follows: Section 2 introduces work related with the hierarchical auxiliary learning.Section 3 describes the neural network architecture based on the proposed hierarchical auxiliary learning.Section 4 presents experimental results which show the classification performance improved by the hierarchical auxiliary learning and the effect of different superclasses on the performance.Section 5 concludes our work in this paper.

Related Work
It is well known that transferring learned information to a new task as an auxiliary information enables efficient learning of a new task [15], while providing acquired information from a wider network to a thinner network improves the performance of the thinner network [16].
Auxiliary information from the input data also improves the performance.In the stage-wise learning, coarse to finer images, which are subsampled from the original images, are fed to the network step by step to enhance the learning process [1].The ROCK architecture introduces an auxiliary block which can perform multiple tasks of extracting useful information from the input and inserting it to the input for a main task [13].
There have been proposed numerous approaches to utilize hierarchical class information as well.Cerri et al. [2] connect multi-layer perceptrons (MLPs) and let each MLP sequentially learn a hierarchical class as rear layer takes the output of the preceding layer as its input.Yan et al. [24] insert coarse category component and fine category component after a shared layer.Classes are classified into K-coarse categories, and K-fine category components are targeted at each coarse category.In [3], CNN learns label generated by maximum margin clustering at root node, and images in the same cluster are classified at leaf node.B-CNN learns from coarse features to fine features by calculating loss between superclasses and outputs from the branches of the architecture [27], where the loss of B-CNN is the weighted sum of all losses over branches.In [23], an ultrametric tree is proposed based on semantic meaning of all classes to use hierarchical class information.The probability of each node of the ultrametric tree is the sum of the probabilities of leaves (which has a path from the leaves to the node) and all nodes on the path from the leaves to the node.Furthermore, auxiliary inputs are used to check logical reasoning in [22].Auxiliary inputs based on human knowledge are provided to the network to let the network learn logical reasoning.The network verifies the logical information with the auxiliary inputs first and proceeds to the next stage.

Hierarchical Auxiliary Learning
Learning step by step makes it efficient and easy.Most learning forms hierarchical structure.In the case of image classification learning, images can be classified through several steps.For example, digits from 0 to 9 can be grouped into two groups based on the condition that the digit is greater than or equal to 5. Another example is a dataset consisting of mammal, birds, fish, car, airplane, and electrical devices.The dataset can be grouped into two superclasses according to its aliveness.If images are hierarchically classified, the task becomes easier and more efficient, especially when the large number of data and classes are present.The goal of the hierarchical auxiliary learning is to utilize superclasses: For example, digits from 0 to 4 and from 5 to 9 can be grouped into superclass 0 and 1, respectively.Let (x, y) be a pair of an image and a class.The superclass x * is given to the pair.Hence each element of the dataset now consists of 3 components, (x, x * , y).Unlike the conventional neural networks, x and x * is injected to the neural network.Therefore, the goal becomes to learn a function f , which minimizes the loss between y d and y.

Learning Scheme
In order to take the advantage of superclass, we introduce an auxiliary block.It takes superclass information of inputs and utilize it to improve the performance of the neural network.The auxiliary block can be located between any two consecutive layers.In this paper, a small ResNet is used as a baseline and the auxiliary block is located between the last residual block and the fully-connected output layer as shown in Figure 1.With one-hot-encoded superclass vector, forward pass is as follows: First, the auxiliary block takes the superclass vector whose size is batch×s and the output of the last residual block whose size is batch×l.Then, batch×l vector passes linear layer, and auxiliary score whose size is batch×1 is obtained by element-wise subtraction and summation over each row as described in Figure 1.Finally, the output of the last residual block is element-wise multiplied by the auxiliary score.

Backpropagation
The weights of the neural network are adjusted to reduce the error of loss function through the backpropagation [17].The key point of the auxiliary block is that the auxiliary score is not directly calculated from the superclass but learned through the backpropagation.Let y N −1 , y N −2 , and a be the input to the fully-connected output layer, the output of the last residual block and the auxiliary score, respectively.Compared to the original ResNet, the input to the output layer y N −1 in the proposed architecture is calculated by and where w ij 's are weights of the linear layer in the auxiliary block and x * = (x * j ) is the one-hotencoded superclass vector.Then, the neural network is trained as follows: and where in backward pass where L is a loss function of the neural network.

Experimental Results
We evaluate the proposed model on three standard benchmark datasets: MNIST 1 , SVHN and CIFAR-10.The residual block in [8] is used for baseline in our experiment.A 10-layer ResNet is trained on MNIST data and 28-layers ResNet is trained on SVHN and CIFAR-10.Cosine annealing [11] is used for learning rate with maximum learning rate of 1.0 and the minimum learning rate of 0. Batch size and epoch are set to 128 and 250, respectively, for all datasets.Weights are initialized by He initialization [7].All datasets are cropped after adding 4 additional pixels on each side, randomly flipped along with horizontal axis and normalized on training.Only normalization is applied to all datasets during a test stage.
Let c be the number of superclasses.From the label in order to be fed to the auxiliary block.Therefore, x * j in Section 3.2 takes 1 if its superclass is j.Otherwise, it is 0.

MNIST
MNIST is one of the most widely used benchmark datasets in classification.It is composed of 10 classes of handwritten digits form 0 to 9. The size of each image is 28×28 and the number of training and test images are 60,000 and 10,000, respectively.Superclass of the dataset is given by three different ways based on human knowledge and one way based on the shape of digit.First, the dataset is divided into two superclasses by five.Second, we split the dataset into even and odd numbers.Third, prime number forms the same superclass.Finally, superclass is determined by noticing the shape of digits: Digit 0, 6, 8 and 9 have a circle shape, and digit 2, 3 and 5 do not have a circle shape but have a curve.The rest of digits has straight lines only.The four different ways of assigning superclasses are summarized in Table 1.All cases are trained five times, and the mean and standard deviation are shown in Table 2.While the baseline mismatches 0.93%, all cases mentioned above reduce the error irrespective of whether the superclass is given based on human knowledge or image itself.Loss of train and test dataset while training the baseline and each case is shown in Figure 2.
Due to the auxiliary score, the loss of train and test dataset for all the cases shows faster convergence than the baseline.Figure 3 shows the auxiliary scores of all training images obtained after 250 epochs of training.The auxiliary scores of case1, which has the lowest error, show clear separation between the auxiliary scores corresponding to the two superclasses.Otherwise, the auxiliary scores of superclasses are mixed for all other cases, which result in lower error reductions than case1.

SVHN
The Street View House Numbers (SVHN) data set also has 10 digits like MNIST.While MNIST is handwritten digit, SVHN consists of digits from a real world with the size of 32×32 [14].73,257 training images and 26,032 test images are available in the dataset.Because it has the same class with the MNIST dataset, we train SVHN with the four cases used for MNIST.Each case is trained 5 times and the mean of errors is shown in Table 2.The results demonstrate that all cases improve the accuracy at least 1.2%.As observed with MNIST dataset, Figure 4 shows that the loss of train and test dataset shows faster convergence when superclasses are introduced to the network.The auxiliary scores of all cases are well split according to their superclass as shown in Figure 5.As a result, differences between error reduction of case1, case2 and case3 are not significant.In addition, case4, which has 3 superclasses, also results in higher than 1% error reduction as its auxiliary scores are well divided into 3 layers according to their superclasses.

CIFAR-10
CIFAR-10 consists of 60,000 images with the size of 32×32, which belong to 10 different classes listed in Table 3. 500 images of each class form training set and the others are used for a test.We assign superclasses to images both non-semantically and semantically.First, superclass is simply given according to its label: If its label is larger than or equal to 5, then superclass 0 is given; otherwise, 1 is given.Second, superclass 0 is given to a class if a label of the class is an odd number; otherwise, 1 is given.As shown in Table 2, the non-semantical superclass assignments improve the performance.Third, classes are semantically grouped into two superclasses, i.e., transportation and animal.Finally, 5 superclasses are assigned based on the criteria described in Table 1. Figure 6 also shows that learning of the neural network is faster and more efficient in terms of convergence when superclasses are used.Split of auxiliary scores shows a stark difference between case1 and case3 in Figure 7. Case1-which shows clear split among auxiliary scores-provides much better classification performance than case3 as shown in Table 2. Auxiliary scores for case4 which includes 5 superclasses are divided into almost 5 layers, though the auxiliary scores of superclasses car and small animals are mixed together.

Concluding Remarks
To not only address the scalability issues in classification with a large number of classes but also improve the classification/recognition performance with a reasonable number of classes, in this paper we have proposed the hierarchical auxiliary learning, a new learning framework exploiting innate hierarchy among target classes or additional information easily derived from the data themselves.

Figure 1 :
Figure 1: Neural network architecture for the hierarchical auxiliary learning.

Figure 2 :Figure 3 :Figure 4 :Figure 5 :
Figure 2: Loss comparison of train and test dataset at each epoch during training between the baseline and (a) case1 (b) case2 (c) case3 and (d) case4 with MNIST dataset.

Figure 6 :Figure 7 :
Figure 6: Loss comparison of train and test dataset at each epoch during training between the baseline and (a) case1 (b) case2 (c) case3 and (d) case4 with CIFAR-10 dataset.