A filter-based feature selection approach in multilabel classification

Multi-label classification is a fast-growing field of machine learning. Recent developments have shown several applications, including social media, healthcare, bio-molecular analysis, scene, and music classification associated with the multilabel classification. In classification problems, multiple labels (multilabel or more than one class label) are assigned to an unseen record instead of a single-label class assignment. Feature selection is a preprocessing phase used to identify the most relevant features that could improve the accuracy of the multilabel classifiers. The focus of this study is the feature selection method in multilabel classification. The study used a feature selection filter method involving the Fisher score, analysis of variance test, mutual information, Chi-Square, and ensembles of these statistical methods. An extensive range of machine learning algorithms is applied in the modelling phase of a multilabel classification model that includes binary relevance, classifier chain, label powerset, binary relevance KNN, multi-label twin support vector machine, multi-label KNN. Besides, label space partitioning and majority voting of ensemble methods are used and Random Forest is the base learner. The experiments are carried out over five different multilabel benchmarking datasets. The evaluation of the classification model is measured using accuracy, precision, recall, F1 score, and hamming loss. The study demonstrated that the filter methods (i.e. mutual information) having top weighted 80% to 20% features provided significant outcomes.


Introduction
In machine learning literature, there are three types of classification: binary, multiclass, and multilabel classification.This classification categorization is based on the number of labels assigned to each unseen record in the test set.Attempts have been made in supervised machine learning to deal with single-label data analysis, in which unseen records are associated with only one label [1].In the multiclass classification problem, the unseen record's class label is predicted from the set of labels.However, in several applications, such as social media, medical diagnosis, bi-molecular analysis, sense classification, music classification, and text classification, instances/records may have more than one class label [2].This phenomenon is known as the multilabel classification problem.Multilabel classification is an emerging and recent research topic [1].Multilabel and multiclass are two different concepts.In multiclass, each record in a dataset belongs to only one label from a set of labels, whereas in multilabel, each record belongs to more than one label.Figure 1 gives an example of all three classification types.Like single-label classification, multi-label classification has faced high dimensionality problems.Some feature selection algorithms are available to reduce the dimensionality by removing the redundant and irrelevant features from multi-label datasets [3,4].Feature selection is an effective technique that can reduce computational time and sometimes improve a classification model's accuracy.There are three main categories of feature selection approaches in machine learning literature (illustrated in figure 2), which are categorized based on the evaluation matrices.In addition, two standard multi-label classification approaches are (1) problem transformation and (2) simple algorithm adoption.Problem Transformation transforms the multi-label problem into a single-class or multi-class problem.The Algorithm Adoption uses the single label classification by changing the cost or decision functions [5].An illustration of these two approaches is represented in figure 3.
This leads to open challenges in multi-label classification, especially reported in the following: • How an effective multi-label classification model can be designed/developed?• What features and label correlations may provide better outcomes?
• How can the multi-label classification model's performance be improved and the computational time be reduced with the help of the feature selection method?
This research proposes the feature selection for multi-label datasets using the filter method.The aim is to tackle the dimensionality problem, reduce the training time, and enhance the data compatibility with classifiers.A wide range of conventional machine learning algorithms is used in the study over five different benchmarking datasets.Further, well-established evaluation indexes are used for the performance measurement of the multi-label classification model.The rest of the paper is organized as follows: the background and related work are discussed in section 2. The methodology details are reported in 3. The adopted conventional machine learning algorithms are described in 4, and the evaluation indexes are explained in 5.The experimental results are reported in 6, and finally, conclusions are drawn in the 7.

Literature review
This section provides literature about multilabel classification and feature selection on multilabel datasets.The main techniques, approaches, multilabel datasets, performance, and limitations are reviewed.Despite the substantial advancements in the field, numerous problems/challenges still need to be solved.Table 1 summarizes the relevant studies.
Referring to table 1, the study in [2] evaluates, reviews, compares, classifies, and analyzes the existing work by applying several filter feature selection techniques on different datasets.It suggested the need for improvements in classifier accuracy and also reported a gap in work based on the wrapper and embedded approaches.An algorithm for feature selection for improving multi-label classification performance is proposed in [6], which produced significant outcomes for the fourteen multi-label datasets.Another MOMFS-multi-objective multi-label feature selection algorithm based on two-particle swarms is introduced in [7].It discussed two objectives: (1) measure the relevance between label and features, and (2) measure the redundancy between the features.
The study [8] proposed a novel approach for feature selection based on neighborhood mutual information (MI) in multi-label neighborhood decision systems and multi-label ReliefF.The proposed approach reduced the computational complexity of the multi-label dataset and improved classification performance by eliminating the useless features using a heuristic forward approach.An improved k -nearest neighbor (kNN) method for multi-label classification with three modified strategies is represented in [9].The three strategies are (i) Gaussian mixture models for splitting input space to multiple sub-spaces, (2) desirable labels, and (3) using unseen mutual/non-mutual examples in finding the local instances.A neighborhood rough sets-based multi-label feature selection approach and relief is reported in [10] to reduce the redundancy and increase the relevancy.Information gain, chi-square, Fisher score, V-score, minimum redundancy maximum relevance-mRMR and relief filter-based feature selection are used in [11] in reducing data dimensionality, removing irrelevant and redundant features, also improving and simplifying the learning.Fisher Score algorithm to select the best features has been applied in [12].Recent literature has used several novel methods for choosing multi-label features.One such method that focuses on high-dimensional data is the 'Relevance based on Weight Feature Selection (RWFS)' method, which is highlighted in [13].This technique offers a novel viewpoint by combining changing and established information ratios to construct a new feature relevance term.RWFS outperforms eight cutting-edge techniques on thirteen real-world datasets when evaluating feature contributions to label sets, demonstrating its effectiveness in improving multi-label feature selection performance.Similar to this, a different, unique approach introduces the 'Label Correlations and Feature Redundancy-based Multi-label Feature Selection (LFFS)' technique, which is described in [14].Label correlation exploration and feature redundancy reduction are combined to resolve duplicate information in multi-label data.LFFS achieves an ideal feature subset with less redundancy through ridge regression, low-dimensional embedding, and cosine similarity analysis.Comprehensive tests support its superiority over eleven rival approaches.A cost-constrained multi-label feature selection method is also presented in [15].This approach outperforms conventional approaches in terms of efficacy by fusing feature relevance and costs via MI and a user-defined parameter.Finally [16], introduces 'MFSJMI,' a technique that uses Joint MI and Interaction Weight to solve problems with traditional multi-label feature selection techniques.This approach's increased feature subset selection accuracy has been validated through trials, as shown by various assessment criteria.These most recent   contributions jointly enhance multi-label feature selection with their creative approaches and encouraging outcomes.

Methodology
This study employed a multilabel classification model involving a wide range of conventional machine learning algorithms such as binary relevance (BR), classifier chain (CC), label powerset (LP), binary relevance KNN (BR KNN), multi-label twin support vector machine (MLTSVM), multi-label KNN (MLKNN).Label space partitioning and majority voting of ensemble methods have also been applied for the experiments over five different multi-label datasets.In particular, a filter-based feature selection approach is used, and three different filter methods are applied to select the top 80%, 60%, 50%, 40%, and 20% features for each different dataset.The reason behind this design has been a comparison of the performance of the classification model before and after feature selection.The goal is to determine the impact of the filter-based feature selection methods on the multi-label classification model's performance and reduce the training time.Figure 4 represents the block diagram of the applied methodology.The details of each methodology block are described in the subsequent sections.

Datasets
This study considered five multi-label datasets: (i) scene data, (ii) medical data, (iii) emotions data, (iv) Genbase, and (v) Enron.These are summarized in table 2 and are publicly available online on the MULAN repository.The key characteristics of the datasets are described in the following: • Domain-the field or area of study to which the dataset belongs.
• Instances-total number of examples presented in the dataset • Labels-total number of labels presented in the dataset • Density-it is the mean of the number of labels of the example that belongs to the dataset divided by the number of labels • Cardinality-mean of the number of labels of the instances that belong to the dataset.
In a multi-label dataset, not all the data samples have the same number of labels.In some samples, it could be possible that the number of labels is less, whereas in others, it is large.Density and Cardinality are the two main characteristics of the multi-label dataset, which could affect the performance of multi-label classifiers.Table 2 shows that the emotions dataset has high density as compared to all other considered datasets, whereas the Enron dataset has high cardinality among all other datasets.Thus, these datasets are more likely to affect the performance of the multi-label classifier.Label density and cardinality are mathematically defined as Label density is an average number of labels of the samples in the dataset divided by the total labels present in the dataset as presented in equation (1).In contrast, Label cardinality is defined as the average number of labels of the examples in dataset shown in equation (2). (2)

Data preprocessing
Data preprocessing is one of the essential operations that cleans the given dataset and transforms it into a suitable format for further processing.The considered datasets have undergone preprocessing operations.
After doing statistical analysis, all five datasets are balanced concerning class labels.Thus, class imbalance issues did not arise.After that, we find the label-to-label correlation using a label network graph.In figures 5(a)-(e), a label network graph shows the correlations between the labels of the datasets.We then perform the most important preprocessing task, feature selection, which is discussed in detail in section 3.2.1.

Feature selection
The selection of features affects the overall performance of the classification model.Thus, feature selection is important in designing and developing a classification model.This study applies a filter-based feature selection method.The filer-based method helped to improve the performance and training time in most cases.The detailed results are discussed in section 6.
The employed filter-based method computes the worth of features with the help of their correlations over dependent variables.It does not use the training model for feature selection, and because of this filter-based method, it is much faster than wrapper methods.It is also computationally less expensive.In particular, it evaluates the subset of features using statistical methods.This study used four filter-based methods, including MI, Fisher score, analysis of variance (ANOVA) test, chi-Square and ensemble of these four filter techniques; their detailed explanation is reported in table 3. Furthermore, a MI approach has been tested over the considered five datasets.MI is often used in feature selection because it is a flexible and powerful method that can handle continuous and categorical data.Besides, it can capture non-linear relationships between features.MI is also a popular approach in machine learning due to its ability to identify highly informative features for prediction tasks.Conversely, the chi2 method can only be used to identify the categorical features.Likewise, ANOVA can handle categorical variables only when transformed into continuous ones.Fisher's score can handle both categorical and continuous features.It is a statistical measure that ranks the importance of features based on their discriminatory power between two classes or groups.
Using MI as a preprocessing step for feature selection, top 80%, 60%, 50%, 40% and 20% of features for each possible label in the dataset were selected as shown in the output snapshots: figures 6 and 7.At the top 20% and 40% feature selection, the weight of each feature was computed.A feature correlated to a label is counted and later divided by the total labels (as reported in algorithm 1 3.2.2).Based on these weights, the top 80%, 60%, 50%, 40% and 20% weighted features were selected (shown in table 4) for classification.This study utilized three approaches: problem transformation, algorithm adaptation, and ensembles of classifiers.

Algorithm
The algorithm takes the following parameters: X parameter is the feature matrix of shape m × n where m is the number of samples and n is the number of features; Y parameter is the target matrix having shape m × n where m is the number of samples and n is the number of class labels; and K parameter is the number of features to be selected.The algorithm uses MI to select the top k features from the input feature matrix X, which are the most informative and associated with each class label in Y.The weight for each top k feature is computed and weighted in descending order.Finally, it returns the subset of top k weighted features.This study, using the algorithm, selected 80% to 20% of the original dataset as shown in table 4.

Ensemble of filter techniques
Ensemble is the famous technique used to combine aggregate results and produce better outcomes.The simple ensemble techniques are max voting, averaging, and weighted averaging.The majority/max count among multiple outputs is selected as the final set in the max voting.On the other hand, averaging is the technique to sum all the sets of outputs of multiple models/techniques and divide them by the total number of outputs.In contrast, the weighted average is the extension of averaging.It assigns the weight to all the models based on prior experience in this field.Lastly, the simplest ensemble of sets of features takes the intersection of all the sets; see figure 8.In this research study, we used the intersection techniques.The results are presented and discussed in the section 6. Mathematically, it is defined as: In terms of PMFs for discrete distributions: Where, YP i is the observed label set and YT i is the target label set.

03
Fisher score Fisher score is a supervised feature selection approach that selects the top k ranked features by calculating each feature's Fisher score [11,12] .
Fisher scores S i for the ith feature in the jth class using mean µ ij , variance ρ ij and n j instances in the jth class and mean µ i of the ith feature.

ANOVA test
Performed filter ANOVA test for every feature analysis of variance whereas feature explained by the class variables.F statistics value is used as a score.The greater the F statistic, the more diverse the average values of the analogous features amid the classes [18].
ij is the average X k value in the i class Moreover, the average X k value of for entire occurrences in the dataset is

Validation: split percentage
This section describes the validation of the classification model that uses the well-established split percentage method to validate the classification model of the study.The datasets listed in table 2 were segmented into training and testing sets, with a split ratio of 70:30 (i.e.70% training set and 30% test set).The reason behind using split percentages with this ratio is that the literature reports its popularity and effectiveness, which has been observed in the experimental results of this study.This split was applied to prevent the model from overfitting and to ensure that all prediction models used in this study were generalized to unseen datasets, thereby eliminating potential biases in the results acquired after testing the models.

Machine learning models (multilabel classifiers)
This section reports the machine learning models used for multilabel classification.In particular, two classification categories were used: algorithm adaptation method, the problem transformation method, and the ensemble approach.

Algorithm adaptation method
The methods of the algorithm adaption method, which have been utilized in the study's experiments, are represented in the following.

MLTSVM
MLTSVM was designed for both linear and nonlinear cases [19].'The basic aim of Linear MLTSVM is to find the L near hyperplane for the L possible multilabel classification problem in such a way that Lth hyperplane is closer to the examples with the label L and far away from the other labels [19][20][21]' .Non-linear MLTSVM presents the kernel trick to find the learning process, making them more likely linearly separable via mapping the linearly non-separable samples in the input space into kernel space.

MLKNN
It is a multi-label lazy learning method obtained from the conventional KNN algorithm [22].Firstly, KNNs in the training datasets are found for each test sample; then, based on statistical information derived from the label sets of neighboring samples and using maximum a posteriori (MAP) to find the label set of test samples.

BRKNN
BRKNN is an extension of the KNN algorithm; however, conceptually, it is the pair of two approaches KNN and BR.The basic BRKNN approach has two extensions: BRKNNa (first checks if there are any labels in the top half of the KNN, and if not, it selects the label with the highest confidence score as the predicted label) and BRKNNb (calculates weights for each label based on their frequency among the KNN and uses these weights to assign a weighted score to each label.The label with the highest weighted score is selected as the predicted label for the test instance) [23].

Problem transformation method
This approach transforms the multilabel dataset into a single-label dataset and applies a random forest classifier.Three different problem transformation approaches used in the experiments are discussed below.

LP
A label for each subset of the multilabel training dataset is created using LP.Thus, the new label set is the powerset of the original label sets.A possible combination of powersets is calculated as 2 L, where L is the   number of labels in the dataset.However, it is not recommended for many distinct labels.The resulting powerset will sparse the dataset and make it harder for the classifier to work, whereas LP is only suitable for a small number of distinct labels [2].An example is illustrated in figure 9.

BR
BR transforms the dataset into q(|L|) independent single-label binary class for each label.If the original label contains label l it is assigned it true or 1 otherwise, false or 0 [1,2,24].Figure 10 represents the example of BR.

CCs
The CCs method is like conventional binary classifiers, but labels or target variables are not independent; rather, an individual target variable finding becomes a feature for other classifiers.The CC implementation is exemplified in figure 11; set X to predict L1, followed by feature set X and L1 to predict L2.Subsequently, the target variable Ln prediction is achieved through feature set X and label sets L1 through Ln-1.The order of target variable prediction, which is a user-defined input, significantly impacts classification performance.During Python implementation, the order parameter set to the range of label indices has been set; the labels shall be predicted in the order in which they appear in the dataset.

of problem transformation method
The BR approach does not contemplate the correlations between different target variables of each sample, and its predictive results can be weedy.Whereas the limitation of the LP approach has a probability of overfitting as its label subset space can be very large.The classifier performance The CC will be slow if the label is large [25,26].

Ensembles of classifiers
An ensemble of classifiers can be defined as combining base classifiers.Each base classifier's predictions are combined to get the final optimal result [28].This study used the random forest as a base classifier and applied two ensemble methods (i.e.Label Space Partitioning and Majority Voting Classifier).These two approaches are discussed in the following.

Label space partitioning classifier
The multi-label data is partitioned into subsets as label co-occurrence using the label network graph is discussed in section 3.2.As shown in figure 12 each subset is transformed into a single-label classification problem by using PT methods and then training each subset per label using a random forest base classifier.Each sub classifiers predict the result and takes the sum of all sub-classifiers; this method is called LSPU.

Majority voting classifier
This approach is the same as the label space partitioning method, whereas it uses a voting approach instead of a sum.The base classifier, that is, the random forest, is trained on each subspace, and label l is given to a sample since majority voting.If more than half of the base classifiers from the subspace or cluster have this label l, then assign the label to the sample.

Evaluation measures
The classification model in this study was evaluated using well-known and popular evaluation metrics, which are reported below.

Hamming loss
It is one of the statistic and evaluation metrics and is defined as the ratio of incorrect labels set to the total possible labelled set.
where N is total data examples, L total number of possible labels, YT ij is the target, YP ij is the predictor, and XOR is the 'Exclusive or' operator which returns zero in case labels are correctly identified otherwise it returns one.Hence, the lower the hamming loss better the classifier performance.

F-measure
F-Measure is an example based and is the harmonic mean of precision and recall.
In the given equation (10), precision is the number of points that the model predicted as relevant divided by the actual relevant points.At the same time, recall is finding relevant instances in the dataset.The Formulae to calculate precision and Recall are mentioned below:

Accuracy
Accuracy is the number of correct predictions from the overall prediction made by the model formed.

Experimental results
The research experiments were conducted on a machine equipped with GPUs and 32 GB RAM, using Anaconda-Jupyter Notebook as the IDE and Python 3.7.9 as the programming language.The required libraries for the experiment, including Mulan, Pandas, Numpy, Scikit-learn, matplotlib, seaborn and arff, were used.Overall, the experimental setup was carefully configured to ensure reliable and accurate results.
The results of the study in terms of evaluation measures (i.e.accuracy, hamming loss, F1 score), and the training time of the classifiers (i.e.problem transformation, algorithm adaptation, and an ensemble of classifiers), when applied to considered five multilabel datasets are reported and also its comparison before and after the proposed feature selection method is provided.

Performance of multilabel classification model
Accuracy measures the closeness of predicted and true positive values.Figure 13(a) line plot of all classifiers' accuracies shows that all classifiers ensemble of classifiers and LP performs well on all five multilabel datasets.Enron and emotions multilabel datasets perform poorly overall because of their higher cardinality and density.Similarly, figure 13(b) shows the line plot of all classifier's hamming loss.MLTSVM performs poorly on all datasets, especially on the emotions dataset.CCs have the lowest hamming loss as compared to the other classifiers.Thus, the lower the hamming loss, the better the classifier's performance.The CC performs well on all five multilabel datasets, particularly the medical and genbase datasets.Besides, this ensemble of classifiers gives the hamming loss slightly higher than the CC.The MLKNN and BRKNNa give almost the same hamming loss which is lower than the MLTSVM HL.

Accuracy
In figure 14, it is observed that MLKNN classifier over scene dataset provided 59% correctly predicted label set, and after 80% FS its performance increased up to 65%.Similarly, MLTSVM before FS gives 33% accuracy, and after 80% FS, its accuracy increased up to 38%.On the medical dataset, BRKNNa gave 51% accuracy before FS, and after 80% and 20% FS, it gives 58% and 57% accuracy, respectively.Likewise, BRKNNa classifier had originally given 55% correct label set prediction, and after 80%, 60% and 40% FS, its accuracy increased up to 59%, 60% and 58% respectively.However, CC classification originally results 56% accuracy and after 80%, 60%, 40% and 20% FS it results 62%, 61%, 62%, and 60% respectively.Therefore, it is a good achievement after FS we increase the performance by 6% to 10%.In a few cases, FS results, the increase of 1%, remain the same or lesser accuracy than the original one.

F1 score
Figure 15 shows the F-score of all classifiers on five different ML datasets.It is concluded that after FS F-score increased, but in extremely few cases, it remained the same or decreased.Hence, it is evidence of the effectiveness of the filter-based feature selection method.

Hamming loss
From the given results, in figure 16, it is concluded that After 80% FS of the scene and medical dataset, the MLKNN results in 1% HL decrease.On the emotions dataset after FS, the HL of BRKNNa classifiers decreases by 3 to 5%.Whereas the Enron dataset HL decreases by 1% after 40% FS.

Conclusions and future work
Multilabel classification algorithms have recently significantly contributed to machine learning and data mining fields.This study considered five different application multilabel datasets and applied a filter-based method for feature selection and the ensembles of these methods.Primarily, it calculated the importance of each feature by its weight; then, based on weights, it selected 80% 20% of features of the original dataset and applied classifiers, including (1) algorithm adaptation: MLTSVM, MLKNN and BRKNNa, (2) problem transformation: BR, LP-CC, and (3) ensemble of classifiers: LSPU and LSPV.The ensemble LSPU, LSPV and LP perform well among all classifiers over the five datasets.Thus, in this way, it helped in reducing the runtime burden of multilabel classifiers.It is concluded from experimental results that before and after applying filter methods, MLKNN performance improved by 6% accuracy on the scene dataset.Accuracy improved on medical and emotional datasets by 7% and 4%, respectively.The performance and K (percentage of feature selection) are directly proportional as the K increases; the classifier's performance also decreases, such as on medical MLKNN accuracy decreases by 7% When taken 20% of original features.Another feature selection approach used is ensemble.The objective of applying ensemble was to improve the performance, but the results were not improved.The results can be improved by 1. Trying different techniques such as simple ensemble methods: max voting, averaging, weighted averaging or other techniques including stacking, boosting etc 2. Secondly, it can be analyzed on various domains/types of benchmark datasets, and the performance can be measured using different multiple classifiers.All classifiers performed well on five datasets except the emotions and Enron datasets due to their higher density and cardinality.In the future, one can focus on feature-to-feature correlation to select the more appropriate features.The label density and label cardinality affect the classifiers' performance.There is a pressing need to work on how these factors are considered to improve the performance of multilabel classifiers.Feature engineering is used to add 10% features and then apply feature selection to reduce the chances of losing the essential features.However, multiple filter method strategies were considered for the feature selection process owing to time constraints.
Mere MI was evaluated over the considered datasets; this limitation is acknowledged and explored, and in the discussion section, prospective results of testing alternative filtering techniques are considered.

Figure 1 .
Figure 1.Three types of classification.

Figure 5 .
Figure 5. Label to label correlation

Figure 6 .
Figure 6.Top 20% selected features of the medical dataset.

Figure 7 .
Figure 7. Top 40% selected features of the medical dataset.

Figure 8 .
Figure 8. Ensemble: intersection of four selected set of features.

Figure 13 .
Figure 13.Accuracy and hamming loss line plots of classifiers on different datasets.

Figure 17
shows the training time of all classifiers before and after feature selection.MLTSVM took the highest training time among all other classifiers and reduced after FS.It is concluded that in most cases, the FS helped to reduce the training time and hence, it reduces the computation time and power.

Figure 14 .
Figure 14.Comparison of accuracies for various feature selection on different datasets.

Figure 18 .
Figure 18.Comparison of filter methods by evaluation metrics on different datasets.

Table 1 .
Summary of literature review.

Table 3 .
Filter methods for feature selection.

Table 4 .
No. of features before and after feature selection of all five dataset.