Auxiliary Decoder and Classifier for Imbalanced Skin Disease Diagnosis

To date, deep learning has been widely adopted in medical diagnosis systems and made great achievements in real-world applications. However, in medical image-based intelligent diagnosis, the phenomenon of class imbalance often appears due to the substantially smaller available training data for rare diseases compared to common diseases, which usually degrade the performance of classification dramatically. In this paper, we propose a novel learning framework to effectively alleviate the impact of class imbalance, adding an auxiliary decoder to reconstruct original images and reusing the original CNN classifier to help the classifier more likely extract disease-relevant features for both rare and common diseases. Throughout experiments on two skin disease datasets support that the proposed framework outperforms strong baselines with a very explicit margin. The proposed method is also independent of network architecture so that it can be flexibly combined with different model structures. Besides, applying our proposed method with existing training strategies designed for class imbalance can still improve the classification performance.


Introduction
Nowadays, significant progress has been achieved in deep learning and relevant techniques have been applied in many medical diagnosis systems [1][2][3][4][5][6][7][8]. However, accurate diagnosis often relies on a large amount of training data, while in many real applications, such as skin disease diagnosis [9][10][11], the collected data may be very limited especially for rare diseases. The imbalanced training data between common and rare diseases could make deep learning models largely ignore discriminative features of rare diseases during model training resulting in biased predictions towards the common (large-sample) diseases. Since the class-imbalance issue has been extensively investigated over the last few decades, some traditional but effective methods have been adopted to deal with the class imbalance in deep learning. One widely adopted approach is to augment small classes by simply over-sampling the data from these classes [12]. Oversampling can be indirectly implemented by various transformations of original data, such as flipping horizontally, and random rotation within a certain range of degrees. Besides generating more data, cost-sensitive methods, such as class weighting [13] and focal loss [14], are also proved to effective to alleviate the class-imbalance issue. The class-weighting method can help deep models pay more attention to small class data during training, and focal loss can help models automatically select and focus on hard training data large of which are often from small classes. In addition, transfer learning via pretrained model can be also useful to alleviate the class-imbalance issue [15]. While the traditional approaches have been widely adopted to handle the class-imbalance issue in training deep learning models, deep learning technique itself has seldomly been explored to help train models. One exception is the recent work that uses the Grad-CAM attention map to help models focus on the lesion region in images of rare diseases during model training, resulting in improved diagnosis performance on both common and rare diseases [16]. In this paper, a simple but novel deep learning-based framework is proposed to alleviate the class-imbalance mainly with the help of a decoder network. This is inspired by the idea that better image reconstruction from the higher layer of the CNN classifier would help the classifier likely extract more visual content, especially when the reconstruction faithfulness is enforced for images of rare diseases. Extensive experiments on two skin image datasets proved the effectiveness of the proposed framework.

Method
The objective of interest is to effectively handle the class imbalance problem such that small class(es) such as rare diseases can be well learned by the classifier. To help the classifier learn small classes, larger weights are often assigned to smaller classes such that the importance of each training data from small classes is emphasized and therefore can be correctly recognized by the classifier. However, due to limited training data for each small class, the classifier might be trained to overfit the data of small classes, i.e., learn to recognize each small class of data not entirely by the class-specific characteristics but by certain superficial features. With this consideration, it would be desirable if the feature vector for the class prediction can contain more information of the original data, such that the class-specific information can be more likely encoded in the feature vector. Here we propose a simple way to achieve this goal, i.e., by adding a subsidiary decoder and an auxiliary CNN classifier to the (original) CNN classifier (figure 1). Intuitively, if any input image can be well reconstructed from the feature output of the original CNN classifier, such a feature vector should contain all essential visual information of the input image, including the discriminative information for accurate class prediction. Therefore, during training the CNN classifier, a decoder can be attached to the end of the feature extractor part of the original classifier to help the feature extractor output contain as much visual information of the input as possible, where the feature extractor would also work as an encoder. On the other hand, since the output of the decoder is often over-smoothed compared to the input image (e.g., discarding detailed information like high frequency edges and textures), the class-specific information in the input image could be partly or mostly removed during the decoding process. To help the output of the decoder contain the essential discriminative information for each input image, we propose applying the original classifier again, as an auxiliary CNN classifier sharing the mode parameters with the original CNN classifier, to the classification of the reconstructed image from the decoder ('twin classifier network' in figure 1). Overall, the twin CNN classifiers and the decoder can be jointly trained by minimizing the loss L: where Lc can be the general cross-entropy loss or its variants like class-weighted cross-entropy for the original CNN classifier, Lt represents the same type of loss as that of the original CNN classifier for the auxiliary CNN classifier, and Lr is the (L2 or L1) reconstruction loss for the decoder. α and β are coefficients to balance the three loss terms. Lr can further be decomposed to where Lk is the reconstruction loss for the kth class of training images, and K is the total number of classes. ωk is the class weight, with a higher value for smaller classes and thus emphasizing that images from smaller classes should be reconstructed more faithfully. It is expected that the class weight would further help the output of feature extractor in the model keep all important (including the class-specific) information, especially for small classes.  Figure 1. The proposed model framework. The green part represents the model structure, and the blue part represents the loss terms for model training.

Experimental Settings
Dataset. Two medical image datasets were used to evaluate the proposed approach. The first one is the Skin-7 dataset provided by ISIC2018 Challenge with 7 disease categories [17,18], in which 6705 images are for Melanocytic nevus and only 115 images for Dermatofibroma, clearly having serious data imbalance between classes. The other is the Skin-198 dataset with 198 categories [19]. The smallest class contains only 10 samples and more than 70 classes contain less than 20 samples. All images were resized to 300 × 300 pixels and then randomly cropped to 224 × 224 pixels. For each dataset, images are randomly split into five folds with stratification for five-cross validation. Each time, we gather four folds as a training set and the other one as the test set.
Implementation and protocol. In the experiments, each encoder backbone was pretrained on ImageNet, while the decoder was initialized by Kaiming Normal Initialization [20]. The decoder consists of 2 blocks, and each block contains 1 deconvolutional layer. α and β in equation (1) were set to 10.0 and 0.2 respectively. SGD optimizer was used throughout, with an initial learning rate set as 0.001 and momentum set as 0.9. The learning rate was divided by 10 at the 100 th epoch. Each model was trained for up to 200 epochs, with the consistent observation of training convergence within 160 epochs. Considering the imbalance distribution across classes, mean class f1-score (MF1, i.e., average f1-score overall classes), Precision (i.e., average precision over all classes) and Recall (i.e., average recall over all classes) at the last training epoch were calculated on each validation set, and the mean and standard deviation of the measurements over all the five cross-validation sets were reported.

Results
In order to test the effectiveness of the proposed approach, we compared our method to two widely-used training strategies for handling data imbalance, namely, 1) cost-sensitive learning(i.e., class-weighted cross-entropy loss, denoted by WCE), and 2) focal loss [14] denoted by FL, a representative method of hard negative mining. The traditional cross-entropy loss (BCE) and the class-weighted focal loss (WFL) were also used as baseline training strategies. For a fair comparison with the baseline, the proposed model (denoted by CDC) was trained by the same baseline training strategy each time. The performance of the ablation version without the auxiliary CNN classifier (denoted by CD) was also reported. From tables 1-2, it can be observed that the proposed framework outperforms all the baselines on both Skin-7 and Skin-198 datasets. In particular, the improvement is also clear on the small sample classes (table 3, average performance over the smallest classes on Skin-7 and the 40 smallest classes on Skin-198) compared to the baselines BCE and WCE. Similar improvement was also observed when compared to the FC and WFC baselines on the small-sample classes (not shown due to limited space). Figure 2 demonstrates the performance of various approaches on one validation set during the training process, confirming that all training is converged and the proposed framework consistently outperforms the baselines. Besides, while all the reported results were based on the ResNet50 backbone, similar performance was also observed when using the VGG and DenseNet backbones, supporting that the proposed framework is generalizable and not limited to specific CNN backbone.  Figure 2. MF1 curves of different methods over the smallest classes on one Skin-7 validation set with respect to training epochs. The training was converged around 160 epochs and the performance of the proposed framework is consistently better than corresponding baselines.

Conclusion
In conclusion, this paper proposed a novel and effective way to help handle the class-imbalance issue, mainly by using a subsidiary decoder to help CNN classifier more likely to extract disease-relevant visual features. Experiments on two skin image datasets showed that the proposed learning framework can improve not only the overall average classification performance overall diseases, but more importantly on the small-class (often corresponding to rare) diseases. The proposed more importantly on the small class (often corresponding to rare) diseases. The proposed framework is independent of existing training strategies and model backbones, and therefore can be easily combined with existing strategies and various CNN models.