Multi-label text classification algorithm based on semi-supervised learning

The standard Co-Training algorithm can effectively improve the performance of the classifier by using unlabeled data. In fact, however, it is difficult to meet the requirement of sufficient and redundant views At the same time, the constraints of Co-Training is further relaxed to solve the problem of inaccurate confidence estimation of unlabeled samples. In this paper, imml-knn and mlplsa-knn are used to train the basic classifiers with great difference. By using unlabeled data in the way of providing Pseudotag sample to each other, a Multi-Label Text Classification Algorithm CT-MLTC based on Co-Training is proposed. Experimental results show that the algorithm has good performance in multi label text classification tasks with more unlabeled data.


Introduction
Semi-supervised learning (SSL) is a kind of extended supervised learning which uses unmarked data according to some principles. When the semi-supervised learning method is used in the classification task, it is usually called semi-supervised classification (SSC) [1]. The unlabeled data in SSC can be used as an auxiliary part of the training classifier. SSC can be carried out in two different ways: transductive learning, in which the labeled data is used to construct the classifier, and then the classifier is applied to the unmarked data; In inductive learning, some labeled data are excluded from the training data set as the verification set, and the feedback and induction are carried out by verifying the classification results, then the relevant parameters or rules are updated, and the classifier is optimized iteratively [2].
Cooperative training is a classic semi-supervised learning method. The Standard Co-Training algorithm is a classic semi-supervised learning algorithm based on Flow Pattern Hypothesis [3], which proves that the performance of the classifier can be effectively improved by the co-training algorithm use unlabeled data under the condition that the data of the training classifier are independent of each other. However, in real tasks, the requirement for data set of complete independence often can not met. S. Goldman and Y. Zhou proposed a cooperative training algorithm without sufficient and redundant views. In the algorithm, decision tree is used for cooperative training, and two classifiers are trained on the same attribute set. Each classifier can divide the sample space into several equivalent classes [4]. In the process of Co-Training, statistical techniques are used to estimate the confidence of the classifier to the unlabeled samples, and then the samples with high confidence are selected to mark the unlabeled samples. The samples and the mark results of the samples are put into another classifier training set, and the classifier is trained. The process is carried out in two classifiers until a stopping condition is met.
In order to further relax the constraints of collaborative training, Zhihua Zhou et al. proposed the Tri-training [5], which has no restrictions on the sufficient and redundant view of attribute set, and does not require different types of classifiers to be used in collaborative training. The most significant feature of this algorithm is that three classifiers were used for collaborative training. In this way, the confidence estimation problem of unlabeled samples and the prediction problem of unlabeled samples can be simplified, and the idea of integrated learning is introduced, which improves the generalization ability of the classifier in the collaborative training algorithm. However, compared with the explicit marker confidence estimation method in standard cooperative training, the implicit marker confidence estimation method in Tri-Training is often not accurate enough to estimate the confidence, especially when the initialization of the classifier is weak, Unlabeled samples are likely to be mislabeled by two auxiliary classifiers at the same time, which brings noise to the training of the main classifier. However, Subsequent theoretical research have found that data set of this kind of algorithm does not need to satisfy these requirements such as independent, sufficient and redundant view. The generalization performance can be improved by providing pseudo-labeled samples with each other, that is, significant differences between basic classifiers are needed.
From the perspective of Co-Training, imml-knn and mlplsa-knn are used to train the basic classifiers with great difference. By using unlabeled data in the way of providing Pseudotag sample to each other, so CT-MLTC based on Co-Training is proposed.

Construct a cooperative training classifier
Cooperative training is essentially a divergence-based method. The basic classifiers that need to be generated not only have significant differences with each other, but also have strong performance. Otherwise, the classifiers with poor performance will continue to misjudge the data labels when using unlabeled data, which will affect each other and eventually reduce the performance of the entire classification model. Therefore, on the one hand, at the beginning of collaborative training, classifiers with large differences are often selected, so that in repeated iterations of collaborative training, the classifier can always maintain large differences. On the other hand, the basic classifier should have a high classification accuracy in multi-label text classification, otherwise it will cause the accumulation of wrong classification.
The effectiveness of the IMML-KNN and MLPLSA-KNN algorithms in multi-label text classification tasks has been verified, the IMML-KNN algorithm is based on the probability of cooccurrence of category labels, and the MLPLSA-KNN algorithm is from a semantic perspective. The two different starting points ensure that the trained classifiers are always different. Therefore, in this paper, IMML-KNN and MLPLSA-KNN are selected as the algorithms for training the basic classifier.

Definition
The training sample set D is composed of a labeled sample set L and an unlabeled sample set U.
Among them, | | is the number of labeled samples, and | | is the number of unlabeled samples.

Algorithm input
Labeled sample set L. unlabeled sample set U. the defined number of nearest neighbors K, the number of defined topics NUM topic , the sample set to be predicted Test, the smoothing parameter s=1.

Algorithm output
Prediction tag sequence of each text sample in Test

Algorithmic process
Step 1: The labeled sample set was divided into 1 and 2 by Repeated random sampling of Boostrap. The initial classifiers 1 (1) and 2 (1) are trained using IMML-KNN algorithm and MLPLSA-KNN algorithm respectively. To ensure the difference of classifiers.
Step 2: For the first time, two initial classifiers 1 (1), 2 (1) are used to traverse the unlabeled sample set , and for the unlabeled sample ( ∈ ), text samples with high self-confidence are selected and given pseudo-labels. The prediction result was used as a pseudo-label for and was saved in the set 1 (1), 2 (1). After the first traversal, the classification result set 2 (1) of 2 (1) is added to the training sample 1 of 1 , which is updated to 1 + 2 (1). Similarly, the training sample of 2 is also updated to 2 + 1 (1).
Step 3: the updated training samples was trained again by the previous algorithm to obtain the classifiers 1 (2), 2 (2) and repeat step 2. However, 1 (1) and 2 (1) generated will not be regarded as labeled samples, but will be put into the unlabeled sample set U as unlabeled samples.
Step 5: Use the final classifier to classify the test sample set Test.

Experiment and analysis
In order to test the classification performance of the algorithm in different proportions of unlabeled data, the public data set Delicious on Mulan, which is a multi label classification research platform was selected in this experiment. By removing the category labels to construct different Delicious data sets with different proportions of unlabeled data. Then, Hamming Loss, Ranking Loss, Coverage, Precision and One Error are used as evaluation indicators.
In order to verify the effectiveness and superiority of the proposed algorithm in dealing with multilabel text classification problems, we compared with other semi-supervised multi-label classification algorithms in experiments. Miha Pavlinek and Vili Podgorelec proposed a self-training-based LDA method STLDA [6], which continuously tried to use unlabeled text data through self-training to further iteratively optimize the performance of the classifier. Tan Qiaoyu et al. proposed a graph-based transductive linear classification algorithm SMILE [7]. The missing category label information can be supplemented by the domain map, which has a good effect when processing incomplete category label data.
Wen Zhang et al. proposed a classification algorithm TESC [8] based on semi-supervised clustering, which calculates the distance between the unlabeled text data and each cluster center to cluster, Thus, the unlabeled text data is converted into a prior knowledge, which helps to improve the classification performance. The experimental results are shown in the following   The evaluation index values of four different algorithms on the Delicious dataset with different proportions of unlabeled data can be found in Figure 1. Coverage is too large compared to other values. In order to visually see the differences between algorithms, the coverage value is processed, so that the five evaluation indicators can be reflected in the following figure.  When the unmarked data accounts for 20%, the Hamming Loss Rate, Coverage Rate, Accuracy and Error Rate are all optimal, only the Sorting Loss rate is higher, which is 0.199 more than STLDA algorithm.
When the ratio of unlabeled data is 50%, the Accuracy, Coverage and Error Rate of the algorithms are all the best, Hamming Loss Rate is 0.028 more than STLDA algorithm, and Sorting Loss Rate is 0.199 more than stlda algorithm.
When the unmarked data accounts for 80%, Hamming Loss Rate, Coverage Rate, Sorting Loss Rate, Accuracy and Error Rate are all optimal.
In summary, the experiment proves the effectiveness and superiority of the CT-MLTC algorithm proposed in this paper when processing multi-label text classification tasks with unlabeled data. And compared with other algorithms based on semi-supervised learning, the CT-MLTC algorithm not only performs well in terms of classification performance, but also has stronger performance than some algorithms in using unlabeled data.

Conclusions
In this paper, a initial classifier for the collaborative training is constructed by the IMML-KNN and MLPLSA-KNN algorithms. Therefore, the divergence of the classifier can be guaranteed and the unlabeled data can be used effectively. From that a multi-label text classification algorithm CT-MLTC based on collaborative training is proposed. Through the experiments in different density datasets of unlabeled data, the performance in the task of multi label text classification with more unlabeled data has been proved. There are still many directions worth exploring. For example, in different task scenarios, the reliability and robustness of the algorithm proposed need to be verified in more data sets.