Feature space reduction method for ultrahigh-dimensional, multiclass data: random forest-based multiround screening (RFMS)

In recent years, several screening methods have been published for ultrahigh-dimensional data that contain hundreds of thousands of features, many of which are irrelevant or redundant. However, most of these methods cannot handle data with thousands of classes. Prediction models built to authenticate users based on multichannel biometric data result in this type of problem. In this study, we present a novel method known as random forest-based multiround screening (RFMS) that can be effectively applied under such circumstances. The proposed algorithm divides the feature space into small subsets and executes a series of partial model builds. These partial models are used to implement tournament-based sorting and the selection of features based on their importance. This algorithm successfully filters irrelevant features and also discovers binary and higher-order feature interactions. To benchmark RFMS, a synthetic biometric feature space generator known as BiometricBlender is employed. Based on the results, the RFMS is on par with industry-standard feature screening methods, while simultaneously possessing many advantages over them.


Introduction
The understanding of human motor coordination and the building of prediction models to meet various business needs have become widely studied topics in fields such as neurology and cybersecurity.With the help of adequate sensors, gestures, walking, handwriting, eye movement, or any other human motor activity can be transformed into multidimensional time series.However, from a general perspective, any fixed set of features is either uncharacteristic of these time series, or it is too large for resource efficient classification.Thus, instead of computing an a priori defined, conveniently small set of features, a promising alternative strategy is to create an ultrahigh-dimensional dataset that consists of hundreds of thousands of features and to search for the most informative minimal subset 1 .In this process, as well as in many other machine learning applications, the evaluation of feature importance and the elimination of irrelevant or redundant predictors has become one of the crucial elements in improving the performance of algorithms 2 .This elimination can increase the accuracy of the learning process and reduce the resource needs of model building.The statistical challenges of high dimensionality have been thoroughly reviewed in [3][4][5] .
Traditional variable selection methods do not usually work well in ultrahigh-dimensional data analysis because they aim to specifically select the optimal set of active predictors [6][7][8][9] .It has also been reported that traditional dimensionality reduction methods such as principal component analysis (PCA) do not yield satisfactory results for high dimensional data (for example, see [10][11][12] ).In contrast to these methods, feature screening uses rough but fast techniques to select a larger set that contains most or all of the active predictors [13][14][15] .Although several screening methods have been published for ultrahigh dimensional data in recent years (e.g., [16][17][18][19][20][21] ), only a few of them can be used in cases when the response variable contains numerous classes.In particular, in the domains of neuroscience and biometric authentication, datasets with these properties are often encountered.
To reduce various ultrahigh dimensional feature spaces in binary classification problems, Fan and Li (2008) 22 proposed a sure independence screening (SIS) method in the context of linear regression models.According to Fan and Fan (2008)  23 , all features that effectively characterize both classes can be extracted by using two-sample t-test statistics.Thus resulting in features annealed independence rules (FAIR).For similar binary classification problems, Mai and Zou (2013) 13 used a Kolmogorov filter (KF) method that was extended to handle a multiclass response in Mai and Zou (2013) 14 .The KF method is also applied for the ultrahigh dimensional binary classification problem with a dependent variable in Lai et al. (2017) 24 .For solving similar tasks, Roy et al. ( 2022) 21 proposed a model-free feature screening method based on energy distances (see 25,26 ).
While most existing feature screening approaches are unsuitable for examining higher-order interactive structures and nonlinear structures, random forest (RF) 27 can overcome such difficulties 28 .To provide a robust screening solution for ultrahigh-dimensional, multiclass data, we propose the random forest-based multiround screening (RFMS) method.The Julia package that implements RFMS is publicly available on GitHub 29 .The RFMS improves the accuracy and scalability of both traditional selection methods and existing RF-based screening by organizing the screening process into rounds.As an advantage, the input is processed in larger chunks, and we can iteratively distill a well-predicting subset of features.
The paper is organized as follows.The Data and methodology section introduces the dataset that was used for benchmarking and the proposed feature screening method.The Results and discussion section presents the performance of the novel screening method and compare it with other reduction algorithms.Finally, the Conclusions and future work section provides our conclusions and suggest future research directions.

Synthetic dataset
To compare the performance of the proposed RFMS with a wide range of feature screening methods, an ultrahigh dimensional, multiclass feature space-with ground truth and some additional side information on the usefulness of the features-was employed.This feature space imitates the key properties of the private signature dataset of Cursor Insight-,which was the winner of the ICDAR competition on signature verification and writer identification in 2015 30 )-.Moreover, it was compiled by using the BiometricBlender data generator 31 .The BiometricBlender Python package provides an alternative to real biometric datasets, which are typically not freely accessible and cannot be published for industrial reasons.It is publicly available on GitHub 32 .
The resulting dataset contains a categorical target variable with 100 unique classes and 10 000 intercorrelated features.Note that neither of the features in itself contains enough information to accurately classify the data over the target variable.However, an appropriate combination can provide sufficient information to achieve classification with high accuracy.Due to the high dimensionality of the dataset, the identification of such a combination is a nontrivial task (regardless of the classification algorithm).The screening algorithm that was introduced in this paper provides a reliable, robust, and resource-efficient means to achieve that goal.

Random forest-based multiround screening
Before we describe the steps of the proposed screening algorithm, several notations must be described.Let y ∈ {1, 2, . . ., k} be a categorical target variable that contains k different classes (k ∈ N + , k ≥ 2), and let x = ⟨x 1 , x 2 , . . ., x n ⟩ be the tuple of input features (n ∈ N + ).(Note that the method may straightforwardly be applied to continuous target variables as well.)Moreover, let α, β ∈ N + be predefined parameters such that 1 ≤ β ≤ α ≤ n, where α denotes the size of the subsets that the feature space will be divided into, and β denotes the number of features that will be selected by the algorithm.For optimal values of α and β , see the Supplementary information parameters step-size and reduced-size, respectively.
Iteration.In this step, we iterate over the abovementioned subsets by selecting the β most important features from a subset, adding them to the next subset, and repeating this process until the β most important features are selected from the last subset.Formally, for 1 , the concatenation of the two tuples), where t = |x i π | + β ≤ α + β , z 0 = ⟨ ⟩ is an empty tuple, and z i (1 ≤ i < m) will be defined below.(Note that x1 π = x 1 π .)The relative feature importance of xi π on y is identified by using random forest classification.The importance of a feature is determined by the total number of times it appears in the classification forest (often termed the selection frequency).
The most important β features of xi π are stored in: where G i : {1, 2, . . ., β } → {1, 2, . . .,t} is an injective function that sorts the features in z i in descending order of their importance.
The aforementioned steps of the calculation are illustrated in Figure 1.

Figure 1.
Steps of the RFMS.

Results and discussion
To compare the performance of the RFMS with off-the-shelf screening methods, we completed the following measurements: 1. We measured the maximum accuracy of three basic classifiers-k-nearest neighbors (kNN) 33,34 , support vector classifier (SVC) 35 , and random forest (RF) 27 -on the full feature set by using n-fold cross-validation.The optimal parameters of the classifiers were identified via a grid search.
2. We performed screening by using four different methods (including our method), thus resulting in the requested number of screened features (from 10 to 500) per method.The tested screening methods included principal component analysis (PCA) 36,37 , factor analysis (FA) 38,39 , k-best 40 , and RFMS.
3. We measured the maximum accuracy of the three classifiers on each of the screened feature sets by using n-fold cross-validation.
4. For every step above, we also measured the CPU usage.
Note that methods based on neural networks are legally restricted to prevent the restoration of original signatures.Therefore, we did not utilize these methods as a basis for comparison.The highest classification accuracies for each combination, along with their screening and fitting times, are summarized in Table 1.The optimized hyperparameters that were used during the application of the RFMS method can be found in the Supplementary information.Only the best accuracy among all of the parameters is reported.(b) Screening times are the CPU times of the feature screening step and correspond to the best accuracy shown above.(c) Fitting times are defined as the CPU times after the reduction step and correspond to the best accuracy shown above.
Based on the results, the RFMS and FA methods outperformed both PCA and k-best screening in accuracy.The highest accuracy was achieved by using the RFMS-SVC and FA-RF pairs (61.4%); however, the latter combination required considerably lower screening time.Notably, depending on the persistence of the features (see, e.g., 41 ), the screening was performed relatively infrequently in comparison with the fitting procedure, in which the combination comprising RFMS proved to be relatively fast.Furthermore, in exchange for a slower screening procedure, RFMS offers several advantages over the FA method.These advantages are detailed below.
Potential cost reduction in feature computation.To use FA on an incoming sample, its full feature set must be computed before the transformation can be applied.The trained model only works on the transformed feature set.In contrast, the output of RFMS is a transformation-free subset of the original feature set.This facilitates the interpretation of the resulting features; in addition, once RFMS has finished, and we have the set of optimal features, only these features need to be computed on any further incoming samples.This could be a significant factor in saving on cost and time in a production system.Suitability for several classifiers.Although the combination of FA and RF resulted in a high accuracy and low screening time, the accuracy of the same FA output with SVC and kNN classifiers produced significantly weaker results (accuracy of 42% and 10%, respectively).However, for the RFMS output, SVC performed slightly better than RF (just as well as the FA-RF combination), and even the accuracy of the kNN classifier at 38.1% was much closer to the top performers.
Robustness.If we further investigate past the highest accuracies for every combination and observe how the accuracy changes with the adjustment of the hyperparameters, we can conclude that FA is quite sensitive.If we reduce the number of screened features (components) from 500 to 250, the highest achievable accuracy drops to 33.1%.A further reduction to 125 results in an accuracy of only 25%.A similar performance drop is observable if we begin to increase the number of features from 500.However, with RFMS, a reduction in the number of screened features from 500 to 200 only slightly reduces the best accuracy to 60.8%, and with a further reduction to 100, the accuracy is still 55.4%.We observed this behavior with high probability when the degrees of freedom of the data were well defined, but the FA was requested to produce fewer features.
Figure 2 summarizes both trends on a single plot, thus demonstrating how the highest achievable accuracy converges to its global optimum as the number of screened features increases.Note that the deviation from the plotted accuracy values with the randomization of the selection and measurement process is negligible.In addition, by adjusting the RFMS hyperparameters, the screening time can be significantly reduced without compromising the classification accuracy.For example, with the right combination, the screening time can be decreased to 2 143 s (merely 1/5 th of the highest value in Table 1), while the achievable accuracy is still 60%.The fastest run in our test occurred for 1 738 s (15% of the longest screening time), and even that output could achieve a 57.3% accuracy (93.4% of the overall highest accuracy).a real (private) signature dataset.Based on the results, the RFMS is on par with industry-standard feature screening methods, and it also possesses many advantages over these methods due to its flexibility and robustness, as well as its transformation-free operation.The Julia package that implements RFMS is publicly available on GitHub 29 .
The difference in maximum accuracy that was achieved on real and synthetic data suggests that the synthetic data generator used for tests does not yet reproduce all of the properties of real data that challenge feature screeners, and this scenario is especially true for factor analysis.Therefore, it would be important to explore the properties of real data that cause this difference and to further develop BiometricBlender in this direction, which could subsequently enable more realistic tests.
To further develop the RFMS method, the following future works are suggested: 1. Filter highly correlated variables in every iteration just before classification, as this could improve the importance of the features that are proposed by the method.
2. Identify the means of automatically determining the number of important features to be retained per cycle, thus allowing for all of the important features to be kept and most unnecessary features to be dropped.This could improve both accuracy and computation time.
3. Reduce screening time by using more parallel computations (random forest building already utilizes multiple threads when available).
4. Replace random forest and the importance metrics with other less common (but potentially better performing) alternatives.
5. Hyperparameter optimization is typically not viable with brute force due to lengthy computation times.Handy visualization tools could provide useful hints for manual boosting.
6. Consider various types of elimination tournaments, such as going through the input several times or using alternative scoring like the Elo or Glicko 42 systems.This may further improve accuracy when some information is nontrivially distributed in multiple "entangled" features.

Figure 2 .
Figure 2. Convergence of highest accuracy as a function of the number of screened features (components).Accuracy is measured on a scale of 0-1.RFMS converges to the optimum much quicker than FA.

Table 1 .
Classification results on the 6 400×10 000 dataset for three basic classifiers and various reduction algorithms.(a)

Table S1 .
Optimal screening hyperparameters, the corresponding screening times, and the best achievable classification accuracies for various classifiers and fastest screening according to grid search results.