Multi layered Stacked Ensemble Method with Feature Reduction Technique for Multi-Label Classification

Nowadays, multi-label classification can be considered as one of the important challenges for classification problem. In this case instances are assigned more than one class label. Ensemble learning is a process of supervised learning where several classifiers are trained to get a better solution for a given problem. Feature reduction can be used to improve the classification accuracy by considering the class label information with principal Component Analysis (PCA). In this paper, stacked ensemble learning method with augmented class information PCA (CA PCA) is proposed for classification of multi-label data (SEMML). In the initial step, the dimensionality reduction step is applied, then the number of classifiers have to be chosen to apply on the original training dataset, then the stacking method is applied to it. By observing the results of experiments conducted are showing our proposed method is working better as compared to the existing methods.


Introduction
Classification is a problem in machine learning and generally it is single label classification i.e., there will be only one class label for each pattern. In multi-label classification, there will be multiple class labels and each object or instance may be mapped with multiple class labels. The classification of multi-label problem is becoming an important area of research for many applications such as text [7] [8], image [8] [9], movie genres classification etc. An ensemble technique is achieved with the combination of diverse classifiers and they are shown to be effective methods for multi-label classification problems. The real challenge is in building of an effective multi-label ensemble of classifiers that uses the subset of labels from the given label set for each member of the ensemble [10]. As we are aware in our daily life, for buying any product or choosing any hospitals, we usually read the reviews or we take opinions or individual experiences from many people so that we can have better selection over many available choices.
There will always be the problem of which classifier to choose among many algorithms such as Naive Bayes(NB), Support Vector Machine(SVM), Decision Tree(DT) etc. In every case, it 1  2 is necessary to choose the right classifier with minimal error. By using an ensemble of different classifiers and combining their outputs by different ensemble methods [2] [12], we can reduce the risk of random selection which may select one of the poorly performing classifiers. However it is important to note that the combination of multiple classifiers may not always perform best. There is no guarantee but it may reduce the risk of poorly performing model selection. The curse of dimensionality is one of the problem to be considered for classification.
In this paper we are using class augmented PCA method to reduce the dimensionality for multilabel data. This method is proposed in [27]. The following sections of the paper will describe related work, proposed method, evaluation metrics and analysis of the experiment results.

Related work 2.1. Multi-label Classification
Multi-label classification can be formally described as for given data set X where each instance X i ∈ R n has a associated label Y i ⊆ Y , where Y = (l 1 , l 2 ....l h ) set of labels, then the model has to learn a function f , for mapping a given instance X j ∈ X, to a subset of labels i.e. f (X j ) = Y j such that Y j ∈ Y. There are two different types of approaches to work with multilabel classification problems. They are Problem Transformation [5] and Algorithm Adaptation [4] approach.

Problem Transformation
This approach basically transforms the multi-label data to a single label data, then we can make use the traditional single label algorithm for classification. The final output will be an ensemble of the results given by each classifier. There are few among many commonly used methods of problem transformation are Binary Relevance Method [18] [27] in which the dataset is duplicated for each unique label as single label instances. The new labels will indicate the presence or absence of the true label for the instance. The number of classifiers used here will be the same as the number of labels. This method is criticized as it assumes that there are no label dependencies. Label Power Set Method [4] considers each unique combination of labels found in the training set as a single label. The test output will be one of these unique label combinations. The problems associated with such an approach is the complexity and class imbalance, as the number of data points with a specific label combination can be sparse in a given dataset.
The following are some of the basic classification algorithms used with binary relevance approach on multi-label data. The performance of the proposed method is compared with these algorithms.
Binary Relevance k-Nearest Neighbours (BRkNN) [24] For multi-label classification with the binary relevance approach for each label Knn model is used separately.
Binary Relevance Decision Tree (BRDT) [20] Basically used for improved generalization capability of the model, it finds one decision tree for each label.
Binary Relevance Naive Bayes(BRNB) [18] [23] For each label associated with label set Y, binary Naive Bayes classifier is learned H : d → {0, 1}, for classification of new data point. This method outputs a set of labels which is the union of different single label Naive Bayes classifiers.Apart from these algorithms there are some variants of stacking concept has been proposed in the literature, In this paper we are trying to extend the concept of stacking with multi layered approach feature reduction technique. Algorithm Adaptation This approach adapts traditional classification algorithms to work directly on multi-label data without considering transformation method. One such algorithm is AdaBoost.MH [6] which is a boosting algorithm for multi-label problems aimed at minimizing the hamming loss. In ML-kNN(Multi-label KNN [11], statistical information such as prior probabilities and posterior probabilities are obtained from the training data. The former indicates the number of data points having a specific label and the latter gives the number of training instances with a specific label, whose k nearest neighbors have exactly a given number of that label. Finally Bayesian rule (maximum a-posterior) is used on these statistics to predict the algorithm output. The posterior probabilities for the test input can also be used to form a ranking vector for the labels predicted. All needed information is obtained using frequency counting from the training data. Random k-Label sets(RAkEL) [10] This method provides an improved performance over the LP [4] method mentioned earlier. The major difference comes in the fact that the label sets are first broken down into smaller finite subsets. Then a Label Power set classifier is used for each of these subsets. The decisions of all the classifiers are combined for the final prediction.

Dimensionality reduction algorithms
There are different dimensionality reduction methods are available like for example, MDDM, multi-label informed feature selection, kernal dependency estimation etc,. in this paper we are using class augmented pca for feature reduction.

Ensemble methods and algorithms
Ensemble methods are distinguished as dependent and independent ensemble frameworks [13]. In the former case the result of the base classifier can be used in constructing the next level classifier. In AdaBoost [14] AdaBoost can be used to boost the performance of any machine learning algorithm. It is best used with weak learners. Each instance in the training dataset is weighted. The initial weight is set to: W (X i ) = 1/n. Where X i is the i th training instance and n is the number of training instances. Whereas meta-learning methods are mostly more suitable for those classifiers which are consistently correctly classifying or misclassifying certain instances.Some of the weighting methods are majority voting, performance weighting, distribution summation, Bayesian combination, entropy weighting etc., described in [13].
In this paper we are using meta learning method called Stacking [17]. Stacking is basically used to achieve the maximum generalization accuracy [17]. It is usually used to combine the models built by different classifiers for creating the meta dataset that is, instead of using the input features from the original dataset, it uses the predicted results of the base classifiers as input features and class labels will be taken as it is from the original dataset. After this, the meta classifier will produce final prediction from the meta dataset.

The proposed method
Multi layered stacked ensemble for classification of multiclass problems has been described in [18] and Stacked MLKnn for multi-label data in [22] for lazy learning approach. In this paper we are extending, the stacked ensemble method for multi-label data (SEMML) with feature reduction technique. Fig.1 shows the flow architecture of our method SEMML. This initially we apply the class augmented PCA for feature reduction, then we apply the problem transformation method. Here we considered binary relevance [5] [18] which will try to transform the multi-label problem to a simple binary class problem, so that the base learner can be applied. Then different base classifiers have to be chosen to apply on the original dataset, to generate the new dataset for the next level, the predicted results need to be stacked and the actual label set from the original dataset will be augmented. Now this new dataset will be the input for the metalearner to get the final prediction.
and Y i from original dataset D.  In Algorithm 1, Step 1 will take the input feature set and transform it to reduced feature set by considering the class label information. Then in Step 2, by applying the transformation method i.e,. binary relevance method, the problem is decomposed into binary class problem and n number of base classifiers will be applied on transformed data D BR .
Step 4 constructs the new dataset D 2 by stacking the output of each base classifier used in Step 3 and augmenting the class labels from the original dataset D. In Step 5 the meta learner is trained based on new dataset D 2 constructed in Step 4. Here the metalearner may be meta decision tree (MDT) [25] for predicting the final prediction.

Mathematical Model
Let D be the original input data consisting of n points denoted as follows D = (X 1 , Y 1 ), (X 2 , Y 2 ).....(X n , Y n ). where X i ∈ X and Y i ∈ Y 3.2.1. CA PCA for multi-label data PCA determines the orthogonal vectors from the projected data. The eigen vectors of these projected data may not be sufficient for classification as there is no consideration of label information, by augmenting the new axis which is orthogonal to the existing axis.
The following steps summarizes the usage of PCA • Apply the PCA to the feature set of the original dataset • Encode the label information i.e,. Y i to be the vector of 0's and 1's. Then the following equations to be solved.
• Augment the data with encoded labels.
• Maximize the importance of class label information • Now apply the PCA once again to determine the transformation Matrix.
Base Level: Apply the transformation method on D D BR = BR(D CA P CA ) Let BC = {BC 1 , BC 2 ....., BC m } be set of base classifiers. Train the base classifiers on D BR . The base classifier BC i maps the data points into subset of classes in Y . Stacked ensemble: Let D 2 be the new data set to be constructed by stacking the predicted output of the different base classifiers.
where X i = {BC 1 (X i ), BC 2 (Xi)...BC m (X i )} and Y i from original dataset D.
Consider the following example: Table 1. Example of initial training dataset   Table I shows the training dataset where each instance X i is associated with a label Y i and Table II shows the new dataset generated from initial training dataset after applying base classifiers. NewX i will basically consist of concatenation of BC 1 (X i ), BC 2 (Xi)...BC m (X i ), etc., where BC 1 (X i ) is the predicted labels of X i using classifier1. Here we considered two base classifiers, there could be more than two base classifiers and note that if we are using binary relevance then the number of base classifiers should be equal to the number of labels. The dimension of new dataset is going to change according to the number of labels present, in this case four for each classifier. while constructing the next level new dataset we can either consider the prediction probabilities or binary prediction in the above example for simplicity we have considered binary prediction. If maximum number of labels is l and there are m classifiers, the number of elements in NewX i will be l * m. This new dataset is used for meta learning to get the final prediction.

Meta Learning
The metalearning classifier is a learning algorithm which is used to induce a meta-level model for combining the predictions of base-level models. Let M L be the metalearning classifier. In this paper we have considered meta decision tree(MDT) as metalearning classifier. Once after constructing the dataset D 2 , the metalearning classifier is applied to build the model based on dataset D 2 , now for the test data, at the base level transforms the test data to form new X t , the final prediction for the new data point X t , ML is used to decide the final label relevance based on D 2 .

Evaluation Metrics
In multi-label classification methods, the output is a set of labels, so the normal evaluation metrics used for analyzing single label classification algorithms cannot be applied directly.
Consider a multi-label data set consisting of N points (X 1 , Y 1 ), (X 2 , Y 2 ).....(X N , Y N ) and let f be our multi-label classifier. Let f (X i ) denote the predicted class labels, on the instance X i by f . Assume there are h labels possible which are denoted by the set of all labels Y . The following list includes some of the commonly used evaluation metrics explained using these assumptions. The metrics used for evaluation differs according to the nature of the target problem.
Precision [3] [5] for an instance is the ratio of correct prediction of the labels to the total prediction of the labels. It is averaged over all instances of the data set.
Recall[3] [5] for an instance is the ratio of correct prediction of the labels to the total number of original labels. It is then averaged over all instances of the data set.

F1-Measure[3][5]
for an instance is the ratio of the harmonic mean of precision and recall. It is then averaged over all instances of a data set.
Hamming Loss[3] [5] gives a measure of the number of times labels are miss classified for an instance. It is found by taking the symmetric difference or XOR between the predicted labels and the original labels.
The performance of algorithm gets better with the lower in value of hamming loss i.e., zero indicates a perfect classification. The measures such as Precision, Recall, F1 measure or any measure which can be used to evaluate binary classification can be computed on a per label basis and used for multi-label evaluation after averaging over all labels possible.

Experimental Results
To find out the performance of the SEMML we have chosen 10 standard multi-label datasets from the Mulan and Meka Repository[26] and each column of Table 1 shows the properties of datasets. We have tested on all these different domain datasets for the effectiveness of the proposed method.   Table 4, Table 5 and Table 6 shows the experimental results for the different algorithms. In this paper we experimented with traditional as well as proposed methods on different datasets listed in Table 3. The algorithms used are BRNB, BRDT, BRKnn, MLKnn described in Section II. These algorithms are compared with our proposed method SEMML. In all cases cross validation is used to avoid overfitting problem. Here we have used the MEKA tool for implementation of BRNB, BRDT, BRKnn and MLKnn is implemented separately with scikitmultilearn package. The proposed method SEMML has been implemented with MEKA as well as scikitmultilearn package. The results of all these algorithms have been listed in the tables with respect to the performance metrics Average precision, F1score and Hamming loss which are described in Section 4. As we can observe in Figure. 2 to Figure. 4 and Table 4 to Table 6, the performance of SEMML is better with respect to average precision, F1score and Hamming loss in comparison with all other algorithms. In all cases except one, the average precision of SEMML is much higher than the other algorithms. In the majority of datasets, the F1score of SEMML is better than the other algorithms. Hamming Loss of SEMML as shown in Figure. 4 and Table 6 is lower than all algorithms in most cases. This shows the superiority of our algorithm, since Hamming loss is one of the important criteria for multi-label performance evaluation. In Figure. 5, the performance of SEMML is far superior to the other algorithms for all the metrics.

Conclusion
In this paper we have experimented the stacked ensemble method with feature reduction technique CA PCA and compared our method with other existing algorithms in multi-label classification. The ensemble methods with feature reduction technique for multi-label data provide good results compare to different traditional algorithms with respect to the evaluation metrics used. Other ways of carrying out stacking can be experimented with as part of future work.