Multi-task continuous learning model

In continual learning, previous learned knowledge tends to be overlapped by the subsequent training tasks. This bottleneck, known as catastrophic forgetting (CF), has recently been relieved between vision tasks involving simple tasks. Nevertheless, the challenge lies in the continuous classification of the sequential sets discriminated by global transformations, such as excessively spatial rotations. Aiming at this, a novel strategy of dynamic memory routing is proposed to dominate the forward paths of capsule network (CapsNet) according to the current input sets. To recall previous knowledge, a binary routing table is maintained among these sequential tasks. Then, an increment procedure of competitive prototype clustering is integrated to update the routing of current task. Moreover, a sparsity measurement is employed to decouple the salient routing among the different learned tasks. The experimental results demonstrate the superiority of the proposed memory network over the state-of-art approaches by the recalling evaluations on extended sets of SVHN, Cifar-100 and CelebA etc.


Introduction
Intelligent creatures can learn and memorize required knowledge throughout their lives. This ability, referred to as continual learning, is the key to achieving general artificial intelligence. In continual learning, the previously learned knowledge tends to be erased by the overlapping of subsequent training tasks, which is known as catastrophic forgetting [1] (CF). Over the years, various approaches have been proposed to relieve this problem. These existing solutions can be briefly summarized into three categories including self-refreshing memory, stochastic gradient descent (SGD) and restraint approaches [2] .
In the early days, researchers generally applied self-refreshing memory schemes through hybrid training to constantly stimulate neurons with information related to historical data to form an activated competition, so as to minimize the impact on the important parameters related to historical knowledge while learning new knowledge [3] . Alternatively, another solution is to use an ensemble of neural networks. When a new task arrives, this ensemble strategy sets up a new network branch, and shares the representations between the tasks [4,5] . These approaches often improve the performance of the whole tasks and is practically used in continual learning. However, they has a complexity limitation, especially in inferences, since the number of networks increases with the number of new tasks that need to be learned [6] .
Another solution focuses on using implicit distributed storage of information in the typical learning with stochastic gradient descent [7] (SGD). These approaches adopt the idea of dropout or maxout to distributively store the information for each task by utilizing the large capacity of deep neural networks [8] . Unfortunately, most studies following this solution had limited success and failed to preserve performance on the old task when an extreme change to the environment occurred [9,10] .
Later studies suggest that the CF problem can be effectively relieved by restraining and pruning the connections of deep networks [11,12,13] . Recently, the weight regularization is integrated into the iterative optimization to adapt new tasks by transferring the old network configurations [14] . Also, the evolutionary theory of networks has been further inspirited [15] in terms of the efficient transferring strategy of elastic weight consolidation [16,17] (EWC). Learning without forgetting [18] (LWF) utilizes the knowledge distillation loss to preserve the performance of previous tasks. In addition, there are studies that continually train the neural network by using sequential Bayesian estimation [19] , which introduces a constraint in the loss function that directs plasticity to the weight that contributes the most to the previous tasks. CLAW [20] is proposed in terms of the probabilistic modelling and variation inference. Sparsebased approaches often embeds a hidden layer of sparse coding to improve CF [21] , nevertheless the obtained sparsity might impair the learning generalization on new tasks. TFM [22] adopts activation masks on each layer to prevent both catastrophic forgetting and negative transfer. Though these researches attempt to maintain a balance between the simplicity and plasticity of the original networks during continual learning, this balance is hard to preserve once the task sequence becomes longer or more difficult.
In this paper, we provide a new solution for continuous multi task learning. We extend CapsNet [23] to overcome the catastrophic forgetting on continual learning. Overall, this model has good adaptability for multi-task continuous learning problems with global transformations (such as excessively spatial rotations). A dynamic memory routing is formulated to dominate the feedforward connections of CapsNet according to the input task sets. Unlike the other capsule routing algorithms, an independent routing table is maintained among all the sequential tasks and updated with each training batch. An increment prototype clustering for the discrimination of sequential tasks is integrated into the proposed dynamic routing. Different from the existing memory networks, the current task sets are analyzed and discriminated by learning the prototypes of each task set onlin.

Paradigm of sequential learning
Let a annotated set 1  . Therefore, the whole network should be trained to decide both the shared weights  and the task-aligned structure t  , which will be detailed in following sections.

Dynamic memory routing on CapsNet
For complex vision applications, the capsule network (CapsNet) provides more integral representations of spatial attributes by employing both vectored capsules and the routing between them. The capsule routing can be dynamically formulated by activating the important capsules aligned to different task sets. Therefore, this paper focuses on the capsule routing for building up the memory mechanism on sequential learning. The basic routing is applied on the capsule layer l yielding a map : , and the routing weights where the matrixes is organized as with the capsule dimensions p and q respectively. This basic routing architecture is then expanded to deal with sequential sets for multiple tasks. Thus, a memory tensor called multi-task dynamic routing table (MDRT) illuminated as 1 2 ( , ,..., ) is formulated by all the routing matrixes of single task t C as shown in Figure 1(b). On the sequential learning, the task routing t C should be selected from the table P by a selection operation ( ) t P t C  . The multi-task routing can be indicated with where the capsule routing t C can be dynamically selected in terms of the current tasks index t . In this paper, this architecture is called dynamic memory routing network (DMRN) as shown in Figure 1  . It implies that this capsule routing tends to feedforward the more discriminative capsules in Û for the current task t . Inspirited by it, the routing matrix t C can be considered as a clustering center of these projected capsules Û depending the current input. Then, the dynamic selection of task routing is achieved by comparing the distances between Û and all the clustering centers, namely P . In order to ignore the influence of element amplitudes between them, a task label t is then predicted by minimizing cosine distance cosinê min ( , ) , where U is the compressed tensor of Û by averaging on the capsule dimension, and (.)   denotes the vectorization with matrix flattening. Then the routing equation can be rewritten as In terms of this equation, the current task routing also can be selected by a multiplication on channel dimension . Moreover, the capsule routing P should perform as a kind of hard weighting mechanism in order to separate from the continual weighting W , thus all the elements in t C P  are fed forward using the binary approximation as applied in the binary neural network (BNN) \cite{courbariaux2016binarized}. That means the forward path of single can also provide gradient updating for all the differentiable variables as where ij e is the error value back-propagated from the higher layers. It implies that if the specific capsule routing is inactivated when 0 t ij c   , the related gradient updating tends to be restrained. The above formulations provide a harmonious updating strategy of the whole network weights excluding the intrinsic MDRT, which is unfolded in the next section.

Increment prototype clustering
To achieve sequential learning, the key mechanism is to analysis the input task sets on-line, then based on it to automatically switch one task routing to another as a retrospective of previous memory. In this paper, the routing table P is thus dynamically updated by a prototype clustering in an increment manner. The task routing ( ) t P t C  is intrinsically the clustering prototypes according to the projected capsules U as illuminated in Figure 2. d C U  within the current learned task range, then the prototype P is updated by 1 1 where s is the updating step with a step length (0,1) s   , and this formula is the standard prototype clustering as described. Note that if the task prediction t is correct, the prototype ( ) t P t C  is updated by the equation in the first line, otherwise the mislabeled prototype is penalized by the second line, which are uniformly applied during the whole learning iterations. Furthermore, these above prototype updating equations can be uniformly integrated in the whole feedforward of CapsNet through a softmax operation exp( ( , ( ))) ( This equation is aligned to the current clustering number t increasing during the whole sequential task learning, which implies that this softmax operation can performs a prototype clustering in increment manner. It means that the previous tasks $t'$ are shared their learned knowledge to discriminate the current task t . Then, the updating loss of routing table P can be indicated in terms of the negative log-probability which is practically equivalent to the former prototype updating equations as shown in the second line. According to the memory networks, the local gradient restrain is important to maintain the previous learned knowledge. The learning strategy should tend to focus on the current task routing. Thus, the onehot task label ( ) onehot t  t is applied as a restrained mask of outputs for the training phase, and the masked task loss can be expressed as where t  is one-hot outputs of the task predictions sharing the same dimension with t . Therefore, only the dynamic routing aligned to the output neurons of the reference label can be updated through backpropagation. In addition, the activated elements in MDRT should be sparse to build salient and discriminative routing paths for each sequential task. Thus, the intrinsic MDRT is restricted by norm-1 L loss for each element in the learned routing table $P(t)$. Finally, the total routing loss of the capsule layer can be given by is the sparse coefficient, and this layered routing loss can be integrally calculated with the final CapsNet loss margin L with the default margin formula. Theoretically, it should be pointed out that the one-hot discriminate function is discontinued. For this reason, it will restrain the gradient back-propagation from the higher layers, and the prototype updating is thus relatively independent of the basic CapsNet. In order to formulate different sequential tasks, each dataset is sequentially extended by data enhancement such axis rotations, as shown in Figure 3.  Figure 3. Schematic of the split MNIST task protocol. Among them, each task includes all types of pictures, that is, the numbers 0-9. And the processing of other datasets is the same as MNIST.

Experimental settings
In the experiments, various datasets are extended for online sequential learning, which is more difficult than other existing reports on overcoming catastrophic forgetting. In details, CIFAR-100 is composed of 60k 32  32 RGB images of 100 classes, with 600 images per class. Each class has 500 images for training and 100 images for testing. MNIST consists of 28  28 gray-scale images of handwriting, and Fashion-MNIST comprises gray-scale images of the same size. SVHN includes digits cropped 32  32 color images from house numbers scene, and CelebA involves numerous face images with 218  178 pixels and each face aligned with 40 attributions as multi-annotations. Tiny ImageNet is a 64  64  3 resized version of ImageNet with 200 of the classes. The sequential learning on the sets involving globally spatial transformations are a challenge at present. Thus, each image in these sets is rotated with different angles of 18° to formulate the sequential task sets. We extensively compare the proposed technique with existing state-of-the-art methods for overcoming CF. These default set baselines include EWC, HAT [24], LWF, IMM[19](mode), less-forgetting learning [25](LFL) and PathNet [26], as well as some traditional methods, including standard SGD with dropout operation [7], CLAW, TFM [22], iCaRL [27] and RPS-Net [28].
In this paper, a CapsNet-based architecture containing two convolutional layers and one fully connected layer is implemented \cite{hinton2018matrix}. The first layer (Conv1) has 256 channels, 9  9 convolution kernels with single stride and ReLU activation. This layer converts pixel intensities to the activities of local features that are then used as inputs to the next layer. The second layer (Primary Capsules) is a convolutional capsule layer with 32 channels of convolutional 8-D capsules (i.e., each primary capsule contains 8 convolutional units with a 9  9 kernel). Each primary capsule output yields the outputs of all Conv1 units whose receptive fields overlap with the location of the center of the capsule. In total, Primary Capsules has 32  6  6 capsule outputs (each output is an 8-D vector) and each capsule in the 6  6 grid shares their weights with each other. The final layer (Digit Caps) has one 16-D capsule per digit class and each of these capsules receives input from all the capsules in the layer below. On this basis, the prototype clustering sub-network is integrated with the dynamic memory routing based on sparse restrain, so that the parameters can be updated online after each optimization iteration.
The memory capability can be measured by training model on a new task set and then using the previously learned sets to test the recalling accuracy. To achieve the sequential analysis, we resize images in CelebA to  109 89 , while maintain the other datasets with the original size. For all the evaluating tasks, the unified learning rate and the sparse loss coefficient are fixed at 0.001 and 0.2 respectively, and each batch size is set to 48. All the methods share the same task order, data split rate, batch shuffle operation and weight initialization. The other network configurations are selected manually. In different tasks, the optimal number of channels for each task is empirically chosen and maintained, and then the optimal capsule dimension is determined through multiple test runs.

Results and Comparisons
The purpose of this experiment is to assess the impact of learning previous tasks on the current task. In other words, we want to evaluate whether an algorithm avoids the CF problem, by evaluating the relative performance achieved on a unique task after learning a varying number of previous tasks. For the sake of fairness, the task after training will not be re trained. The experimental data set was expanded to 10 tasks with the same rotation angle interval. Among them, a single task revolves 18° clockwise or anticlockwise with the image as the center. In other words, the rotation angle of the first task is 18° and (e) Expend CelebA Figure 4. The accuracy of different methods on different datasets is compared, and one task is learned incrementally at a time.
We extensively compare the proposed technique with existing state-of-the-art methods for overcoming CF in Figure 4. The results indicate superior performance of the proposed method in all settings. For the case of expend CelebA in Table 1 and Figure 4(e), we outperform CLAW by an absolute margin of 21.69%. Compared with the second best method, our approach achieves a relative gain of 8.29% and 6.36% respectively for the case of expend SVHN and expend CIFAR-100 dataset are shown in Figure 4 (c) and (d). In the expend MNIST, Figure 4 (a), our performance is only 0.09% below the CLAW approach. Table. 1 compares the accuracy of the tenth task ( 1 0 A ) of different methods on different dataset. The results show that LWF does well when learning each new task with the help of the representation of the previous tasks. However, as more tasks are included, the older tasks start forgetting more. IMM (mode) has the opposite effect, it focuses on intransigence and tries to keep the knowledge of the older tasks, running out of capacity for the newer tasks. This allows for the approach to not forget much and even have a small backward transfer, but at the cost of performing worse with newer tasks. EWC has one of the worse performances, possibly due to the difficulty of having a good approximation of the FIM when there is so many classes per task. TFM has a good overall performance with nonforgetting, and rely on the amount of capacity of the network more than the other approaches. Probabilistic modelling and variational inference can perform well on specific tasks, but when the task becomes complex, ClAW performance will degrade due to network redundancy. LFL has certain advantages in the first tasks. As the number of tasks increases, it becomes difficult to select hyperparameters, and the robustness decreases, thus the memory cannot be preserved well. Soft constraint methods such as SGD have similar problems as well.

Conclusion
This paper presents an extended CapsNet which uses an increment prototype clustering to yield dynamic memory routing. It has been proved to be capable of retaining the previous knowledge while learning new task sets. In the experiments, the testing image sets are enhanced by axis rotations. Through a series of experiments, the effectiveness of the proposed network in overcoming catastrophic forgetting is verified and compared with some state-of-the-art approaches. The main conclusions are drawn as follows. First, the capsule based memory architecture has adaptability to global axis rotation and performs much better than the CNN based approaches. Second, the prototype clustering based dynamic memory routing can be successfully applied in sequential classification tasks so that the recalling performance is improved. In addition, the sparse routing table can efficiently reduce weight overlapping, especially when then the number of tasks increases. For challenging cases such as object deformations and occlusions, the enhancement of dynamic model needs further study.