Analysis and Benchmarking of feature reduction for classification under computational constraints

Machine learning is most often expensive in terms of computational and memory costs due to training with large volumes of data. Current computational limitations of many computing systems motivate us to investigate practical approaches, such as feature selection and reduction, to reduce the time and memory costs while not sacrificing the accuracy of classification algorithms. In this work, we carefully review, analyze, and identify the feature reduction methods that have low costs/overheads in terms of time and memory. Then, we evaluate the identified reduction methods in terms of their impact on the accuracy, precision, time, and memory costs of traditional classification algorithms. Specifically, we focus on the least resource intensive feature reduction methods that are available in Scikit-Learn library. Since our goal is to identify the best performing low-cost reduction methods, we do not consider complex expensive reduction algorithms in this study. In our evaluation, we find that at quadratic-scale feature reduction, the classification algorithms achieve the best trade-off among competitive performance metrics. Results show that the overall training times are reduced 61%, the model sizes are reduced 6×, and accuracy scores increase 25% compared to the baselines on average with quadratic scale reduction.


Introduction
Throughout the last two decades, the field of machine learning (ML) has significantly advanced in terms of both theory and applications [1].It has showed impressive success across many tasks from different realms and settings.Modern ML algorithms often require immense volume of data, which has become even more prevalent, especially with the introduction and adaptation of deep neural networks [2].Coupled with the advances in high-performance computing systems and hardware, such as those in large-scale distributed and heterogeneous systems, ML has gained access to considerable compute and storage capabilities and has leveraged them successfully [3,4].
While these large-scale capabilities have enabled large-scale ML, on the opposite side of the spectrum, where there are critical compute, power, and storage constraints, such as the case of ML on the edge devices, ML applications have shown success despite of the constraints [5].However, several research problems remain open [6].One such open problem is to understand how ML methods can leverage the low-cost feature space reduction algorithms, such as univariate feature selection and random projections, to address those constraints, and to understand the impact of these reduction algorithms in terms of classification performance and computational costs.Furthermore, the conflicting nature of classification and cost performance on this side of the spectrum is yet to be explored and understood.We posit that there is a research gap regarding the extent to which classification performance can be sacrificed while employing the lowest-cost feature reduction algorithms.This is categorically different than a typical parameter-space exploration study where there are no prior constraints to perform the exploration.Moreover, to ensure wide-range applicability, the feature reduction methods that do not make any assumptions about the input data, e.g.non-negativity or occurrence in specific numerical intervals should be prioritized and studied.
In this work, we therefore review, identify, and evaluate the lowest-cost feature reduction methods and assess their impact on the classification accuracy and the computational overheads for a diverse set of ML algorithms.In particular, we concentrate on the least resource-intensive feature reduction methods provided by the Scikit-Learn library [7].Our primary objective is to determine the most effective and affordable reduction methods; therefore, we exclude complex and costly algorithms from our study.
The first step in our work is to analyze the existing feature reduction algorithms, identify the lowest-cost ones in Scikit-Learn library [7].The second step is to evaluate the effect of feature reduction on the classification performance, i.e. accuracy, precision, and recall, and computational overheads in a systematic way.In our study, we restrict our attention to traditional classification algorithms such as k-nearest-neighbours, support vector machines (SVMs), decision trees (DTs), random forests (RFs), AdaBoost classifier (Ada), Multi-layer perceptron's and Naive Bayes (NB).
At the end of this study, we should be able to answer questions such as (i) 'at what scale (amount) of reduction, ML algorithms achieve the best trade-off between classification and cost performance?', (ii) 'among the lowest-cost methods, which reduction methods perform the best'?, (iii) 'how do the probabilistic and randomized reduction methods perform compared to the deterministic ones?' , (iv) 'how feature selection methods compare to feature transformation ones?' , and (v) 'how do the individual classification methods impact cost savings?' .Overall, our study can help guide ML practitioners with designing and developing their applications and pipelines according to their resource constraints with widely available implementations.
In summary, the contributions of our study include: • We analyze the existing feature space reduction algorithms in Scikit-Learn and identify those that have no data assumptions and the lowest computational complexity, which we require to be less than or equal to O(N 2 ) and O(D 2 ) where N is the size of a dataset and D is the number of features.In our analysis, we found that randomized principal component analysis (PCA), Gaussian (GS) and Sparse random projections (SP), and the ANOVA based univariate feature selection (ANOVA) have the lowest computational complexity within their respective family of algorithms.More importantly, the implementations of these four algorithms have complexities less than the quadratic complexity in terms of dataset and feature space sizes.In addition, these algorithms do not introduce any limitations in terms of applicability, such as assuming non-negative features.• We evaluate how the four suggested feature reduction techniques affect both the classification accuracy and computational costs of various ML algorithms.This evaluation is conducted on ten datasets, specifically designed to address the questions we previously mentioned.• The main takeaway of our study is that at quadratic-scale feature space reduction, the classification algorithms achieve the best trade-off in terms of classification and cost performance.Results show that the training times are reduced 61%, the model sizes are reduced 6×, and accuracy scores increase 25% compared to the baselines where no feature reduction is performed with quadratic scale reduction on average.Moreover, the ANOVA based reduction (ANOVA) outperforms the other three reducers in terms of classification performance and incurred costs.ANOVA is not only one of the lowest computational complex algorithms but also is a deterministic method.This makes ANOVA an extremely practical, stable, and widely-applicable method.
The organization of our work is as follows: Section 2 provides our analysis of the existing feature space reduction methods, discusses the selection process of the lowest-cost methods to evaluate, and the reduction scales.Section 3 presents the experimental evaluation and the results.Section 5 overviews the related work.Section 6 summarizes this study.

Feature space reduction: methods and scales
In this section, we first review the existing feature reduction algorithms in Scikit-Learn framework [7] with specific attention to the computational complexities and discuss how we select four feature reduction algorithms to further evaluate.Second, we introduce the concept of the feature reduction scale.

Analysis and selection of algorithms for feature reduction
Scikit-Learn is a widely-used comprehensive MLframework [7].It has numerous feature selection and transformation, and dimensionality reduction algorithms that can be used to reduce the feature space of the input data.Among them, PCA decomposes data into a set of orthogonal components that account for the maximum amount of the variance.While PCA works well in practice, it is computationally expensive and is not suited for large datasets.Let N be the size of the dataset (the number of data points), and D be the dimension of the dataset (the number of features).The exact computation of principal components has the computational complexity of O(ND 2 + D 3 )-the computation of covariance matrix is O(ND 2 ) and the eigen-decomposition is O(D 3 ).Compared to the exact PCA, randomized PCA [8,9] has a time complexity of O(Nk 2 + k 3 ) where k is the dimension of the reduced feature space.Therefore, we select randomized PCA over the exact PCA.
Random projection is an approach that is computationally efficient in reducing the dimensionality of a feature space.It trades off accuracy with fast computations and smaller model sizes.There are two types of random projections implemented in Scikit-Learn.Gaussian random projection (GS) projects the original feature space by a randomly generated matrix where components are drawn from the normal distribution N (0, 1 k ).Its computational complexity is O(NDk) [10,11].The second type of random projection is Sparse random projection (SP) [12] which uses a sparse random matrix.The time complexity of sparse random projections is O(N √ Dk) where the density (nonzero entries) of a sparse matrix is set as O( √ D) [12].Both types of projections have exceptionally low computational complexities compared to many existing dimensionality reduction algorithms.Consequently, we select GS and SP to evaluate.
Feature selection algorithms [13,14] are designed to capture the features that impacts a classifier's accuracy on high-dimensional datasets.Among these algorithms, Variance-Threshold is a simple method which removes all features whose variance is less than some threshold.Although it is an extremely simple and low-cost method, there is no way to control the resulting number of features / dimensions.Therefore, for our purposes, we exclude it.In practice, it performs poorly with complex datasets [15].Furthermore, univariate feature selection approaches use statistical reasoning to select a strict subset of the features.Recursive feature elimination (RFE) [16] is another technique for feature selection.It first assigns weights based on training an external model such as SVMs.It then recursively omits the features using the feature weights and recursively training the model on the remaining features.The recursion continues until the desired number of features obtained.RFE has a very high computational complexity since an external model is trained on every iteration.Therefore, RFE is not suitable for resource-constraint feature reduction.
SelectFromModel [17] is a meta-transformer in Scikit-Learn that can be used with any estimator that assigns importance to each feature through a specific attribute.The usage of this meta-transformer is called model-based selection.Examples of specific attributes are estimator coefficients and feature importances.A numerical threshold is used to select a subset of the features.Model-based selection algorithms are expensive because of fitting an external model and therefore are not suitable for reducing computation and memory costs.
Sequential feature selection (SFS) [18] is a greedy algorithm that iteratively finds the best new feature to add to the set of selected features.Starting with an empty set of features, it finds the one feature that maximizes a cross-validated score when an estimator is trained on this single feature.Once that first feature is selected, the algorithm keeps adding a new feature to the set of selected features.It stops when the desired number of selected features is obtained.SFS is more expensive than RFE and model-based selection computationally.This is because in every iteration i, SFS fits N − i models, meanwhile RFE fits one model per iteration, and model-based selection fits a single overall model and does not perform iterations.
Manifold learning [19] refers to the family of techniques for non-linear dimensionality reduction, such as isomap, locally linear embedding, and multi-dimensional scaling.It is most often used for data visualization purposes.Moreover, manifold learning algorithms are almost always at least quadratic time complexity in terms of the dataset size.Only with additional assumptions about data, variant algorithms become sub-cubic.
Finally, feature agglomeration [20] is a dimensionality reduction method that uses agglomerative clustering to merge similar features to decrease the number of features.Feature Agglomeration has cubic time complexity in terms of the dataset size.
Table 1 summarizes the time complexity of the existing feature reduction algorithms in Scikit-Learn.

Feature reduction scales
In this study, we explore and evaluate five different feature reduction scales.A constant-scale feature reduction means that the resulting (target) number of the reduced features is the original number of the features divided by a constant.For our study, we choose this constant to be 2. A quadratic-and cubic-scale feature reduction means the resulting reduced number of the features is the square and cubic root of the original number of the features, respectively.An exponential-scale feature reduction means the resulting reduced number of the features is the natural logarithm of the original number of the features.An infinite-scale feature reduction corresponds to the resulting reduced number of the features being a fixed small number regardless of the original number of the features.In our evaluation, we set this number to 2. A quadratic-and cubic-scale feature reduction means that the resulting reduced number of the features is the square and cubic root of the original number of the features, respectively.An exponential-scale feature reduction means the resulting reduced number of the features is the natural logarithm of the original number of the features.An infinite-scale feature reduction corresponds to the resulting reduced number of the features being set a fixed small number.Figure 3 plots the number of the reduced features as the number of the original features increases with different reduction scales.We note that for visualization purposes, we use 10 as the constant not 2.

Methods
In this section, we present the methods and the experimental setup.We run our experiments at the deception computing system at PNNL using single compute nodes.A compute node has dual AMD EPYC 7502 CPUs running at 2.5 GHz with the ability to boost to 3.35 GHz, 256 GB of octa-channel DDR4-3200 memory, 512 GB of local NVMe.Our implementation is based on Python 3.9.6,Scikit-Learn 1.3.0,Scipy 1.11.1, and Statsmodels 0.14.1.
Table 2 shows the ten datasets we evaluate in this study (Bio [21], Cancer [22], Cov [23], Eeg [24], Hill [25], KDD [26], Madelon [27], Kidney [28], Random [29], and Sylva [30]).We use (randomly selected) 80% of a dataset as training data and the remaining 20% as test data.We standardize the datasets by StandardScaler of Scikit-Learn.We ran our experiments with five different random seeds to see if the results are consistent and similar across different runs.We observed that they had negligible variations.
Table 3 shows the ML algorithms we evaluate in our study along with their computational training complexities.We select these algorithms to cover a variety of different ML methods.Table 3 also shows the specific hyper-parameters we set.All other hyper-parameters are set with the default values.
Table 4 shows the four feature reduction algorithms we evaluate in our study.Among them, ANOVA is a deterministic feature selection algorithm and the other three algorithms are probabilistic feature transformation algorithms.From the table, it is not possible to draw definite conclusions about the computational costs because the relative sizes of a dataset, the features, and the reduced features are not known and they determine the relative costs.
Table 5 shows the class distributions for the datasets.We see that KDD, Kidney, and Sylva are imbalanced.For such imbalanced datasets, traditional accuracy metric may not be the best fit.Metrics such as Balanced Accuracy might be needed for their evaluation.As a note, Cov and KDD have more than two classes.In KDD, we treat abnormal instances as one class against the normal stances.As a result, it becomes a two-class dataset.In Cov, two classes constitute most of the dataset (37% and 49%).We evaluate Cov by in a one-versus-all manner (37% and 63%).
To assess the classification performances of the evaluated algorithms, we use accuracy, precision and recall metrics to consider different perspectives in a classification task.Formally, assuming that TP, FP, TN, and FN represent true positives, false positives, true negatives, and false negatives, respectively then accuracy, precision and recall (in terms of percentages) is defined as To better evaluate imbalanced datasets, Balanced Accuracy is typically used.It is defined as the arithmetic mean of the true positive rate and the true negative rate: Balanced Accuracy (%) = 1 2

Results
In this section, we present our results and discuss them in detail.Figure 4 demonstrates one of the main takeaways in our study where the y-axes represent the average percentage reductions (gains/savings) in terms of the original feature space size.We see that quadratic size reduction, i.e. √ n for n-dimensional feature space, is the sweet spot in terms of time and memory cost reductions, and classification accuracy.We see that Figure 4. Time, memory reductions/savings and accuracy results with respect to the baseline across the five feature reduction scales and seven ML algorithms.The baseline of an ML classification algorithm is established when it is run on a full dataset without any feature reduction as a pre-processing step (see figures 1 and 2).We see that the quadratic-scale feature reduction offers the best trade-off considering time and memory savings, and classification (accuracy) improvements.A negative difference in a classification metric-accuracy in this case-shows an improvement over the baseline.. the overall training times are reduced 61%, the model sizes are reduced 6×, and accuracy scores increase 25% compared to the baselines on average with quadratic scale reduction.While with constant-scale reduction the algorithms provide on par classification performance with quadratic-scale, their time and memory cost reduction is not as high as the quadratic-scale.On the other hand, with the cubic-, exponential-and infinite-scale reductions, the classification algorithms obtain better memory size reduction than with quadratic-scale-from 30× to 57× versus 6× on average.However, their accuracy performance is markedly worse.In fact, the classification performance is mostly worse than the original feature space.We note that the time gains at quadratic-cubic-, exponential-and infinite-scale reductions are similar-between 61% to 68% on average.This is because given the sizes of the datasets and the original dimensions, the resulting integer dimensions are not too different among each other.As a result, different reduction scales achieve similar time cost savings.For eight datasets, the number of features (dimensions) are around or less than 100.The two others are in the thousands and ten thousands.In addition, the reported results are the averages among all reducers, datasets, and ML models.All these lead to the similar performances for the quadratic-cubic-, exponential-and infinite-scale reductions.
Figure 5 shows the complete summary/aggregate classification results, that is, the average accuracy, precision and recall percentage improvements or losses with respect to different feature reduction scales and seven ML algorithms.A negative percentage means that the reduced space improved its performance compared to the original space while a positive percentage indicates classification performance is worse than the original space.We see that while with the constant-scale feature reduction, the classifiers improve the accuracy the best, only with the quadratic-scale reduction, they improve their classification performance in terms of all three metrics.At all scales, feature reduction improves the recall performances, while, with one exception, the accuracy and precision performances' of cubic-, exponential-and infinite-scale reductions are worse than the original space.Overall, with quadratic-scale, accuracy, precision, and recall improves by 25%, 35%, and 520%, respectively on average.Figure 6 details the average gains/reductions in the time costs of the four reducers at different scales averaging all seven ML algorithms.We see that constant-scale feature reduction provides categorically the lowest execution time gains while other scales achieve similarly higher time gains.As we discuss above, this similar performance is due to the original feature sizes of the datasets that are all less than 2000-with the Figure 6.Time savings over the baselines with different feature reduction scales and different feature reducer algorithms averaging all seven ML algorithms.We see that constant-scale reduction obtains the lowest time savings as expected.Beyond it, at all scales and with different reducers, the time savings are similar and close.The main reason is that given the original feature sizes of the datasets, quadratic-, cubic-, exponential-and infinite-scale reductions result in similar (integer) feature sizes.
Figure 7. Time savings over the baselines for each dataset with quadratic-scale feature reduction averaging all seven ML algorithms.We see that PCA and/or ANOVA reduce execution times more than GS and SP in all datasets.exception of the Kidney dataset-and the resulting integer dimensions.Averaging further contributes to this similarity.In terms of the reducers, PCA and ANOVA achieve larger execution time reductions than the probabilistic reducers GS and SP.Additionally, we see that the difference in the time savings of PCA and ANOVA is low.
Figure 7 shows the performance gains of the reducers for the datasets at quadratic-scale feature reduction averaging all seven ML algorithms.We see that for all datasets, either PCA or ANOVA or both obtain higher execution time savings.We see that, while not always, the higher the number of features the higher the time savings.For instance, Kidney and Bio datasets have the highest number of features and therefore, time savings are the highest compared to other datasets.On the other hand, Eeg dataset has the lowest number of features and as a result, time savings are the lowest.However, this trend does not always hold.For instance, for Cov and Cancer datasets, the savings are relatively higher considering their feature sizes in comparison to other datasets such as Madelon.
Figure 8 shows the original size of the feature dimensions of the datasets as well as the reduced sizes with different reduction scales in a logarithmic-scale plot.We see that other than Kidney, Bio, and Madelon datasets, the number (size) of the features are around 100 or less, which, in turn, causes the reduced sizes close to each other.show the model size savings with respect to different ML models and feature reducers with quadratic-scale reduction.ML model size reductions are calculated by the model sizes for the original feature spaces divided by the reduced space model sizes.For the non-parametric ML models DT, RF and Ada, there are no model size savings as the feature reduction does not have an effect.That is, there are no (feature  based) parameters in these models for which feature space reduction would decrease their size.Furthermore, we use the same hyper-parameter values for the original and reduced space models, and as a result, the model sizes do not differ much.We note that both RF and Ada models we evaluated are based on DTs.For the ML models KN, SVM, NN, and NB, the model size reductions are significant and increase as the sizes of the input dataset and the numbers of the model parameters increase.They are up to over 100× for KN, up to about 135× for SVM, up to 100× for NN, and up to about 90× for NB.The variations among the reductions is very limited across the ML models and the feature reducers.Some variations occur in KN and SVM models.These variations are due to the constraints of KN and SVMs.For instance, for KN, in some datasets, the target (reduced) feature size is too low and not admissible for the KN algorithm.In those cases, we had to adjust target sizes to be able to run the algorithm.We note that even though KN is non-parametric, feature reduction still reduces the model sizes.KN is an instance-based learning algorithm which memorizes the datasets.Therefore, the smaller the dataset, the smaller the model size.The variations in SVM are due to the random selection of the features during model fitting-we use the default value of the dual hyper-parameter              which is True.This intrinsic and algorithmic random feature selection process is different and separate from the feature reduction that we perform before training [31].It is part of the SVM fitting algorithm [31].
Figures 19-28 show the detailed heat maps of the percentage differences (improvements or losses) in accuracy, recall and precision across the ML models and the feature reducers for the datasets with the quadratic-scale space reduction.In these heat maps, a negative value means that the classification performance is improved with feature reduction while a positive value means the classification performance is worse than the original space.We clearly see that overall, PCA and ANOVA outperform GS and SP in terms of all three metrics.The classification performances of GS and SP are mostly poor and unpredictable, and therefore, the         usage of GS and SP should be avoided.Figures 19-28 also show that the impact of feature space reduction is different for different datasets.In addition, we see that NB observes markedly different performance in some datasets such as Cov, Eeg, and KDD.Lastly, the accuracy and recall scores are exactly the same in the Cov and KDD datasets because they are multi-class and we used the same average='weighted' value for the average parameter of the Sklearn Recall function [32].
Figures 29-33 show the decision boundaries for the ML algorithms for the baseline that is with no feature reductions performed as well as the four types of reductions.We set the target dimension to 2 to make the visualisations straightforward.For the figures, we use the Cancer dataset.For the baseline models, we simply  took the first two features from the dataset and evaluate the models on them.In the baseline, in figure 29, the original input is not transformed.Per the baseline decision boundaries, KN respects neighboring, SVM shows a radial boundary because of its radial kernel.DT and RF divide the input space with rectangles, however since RF deploys multiple trees, it produces overlapping rectangles.Ada's decision boundary is similar to RF as it also deploys several DTs.NN has a linear boundary as its default configuration is a linear classifier.Finally, NB assumes a Gaussian distribution for the conditional probabilities.As a result, its decision boundary is Gaussian.
In figure 30, we see the decision boundaries for the models with PCA.PCA transforms input data into a new space according to the orthogonal directions with the highest variance.As a result, we see that the transformed data is dispersed.Other than the dispersion, the boundaries reflects the characteristics of the ML algorithms as discussed above.
In figure 31, we see the boundaries for Gaussian projections.The input data is transformed by using a Gaussian distribution.However, this catastrophically removes the separation among the instances belonging to different classes.Consequently, the classification performance is poor.
Figure 32 shows the boundaries for the reductions with Sparse projections.As expected, the transformed input data is sparsified.Sparse projections degrade class separations, which results in degraded performance.However, it is not as severe as Gaussian projections.
Figure 33 shows the boundaries for ANOVA.Since ANOVA is a feature selection method, it does not transform the input data.As a result, any existing class separation in a dataset is preserved.This helps ANOVA to successfully classify the data instances-similar to the baseline models.As a side note, ANOVA selects the features 'mean concave points' and 'worst concave points' of the Cancer dataset in our evaluation.
To better understand how ANOVA reductions work, we analyze the box-and-whisker plots.Figures 34-39 are the box-and-whisker plots for the ANOVA reductions with the Cancer dataset.Figures 34 and 35 show the best performing three features in terms of classification accuracy when an ML algorithm is trained on the reduced feature space.show the worst performing three features.In statistics, ANOVA is used for testing whether the means of two are the same.When used as a supervised feature selection method, the populations are assumed to be the data instances of the same class.In our case, since we perform binary classification, we have two populations each dataset.As a result, choosing the features have different means among different classes indicates that the feature can be used to distinguish the instances   belonging to different classes (figures 34 and 35).Conversely, if the class means are close to each other with respect to a feature, then the feature is most likely not informative for successful classification (figures [37][38][39].
Finally, figure 42 visualizes the two components of randomized PCA for the Cancer dataset, where the target dimension is set to 2. We see how PCA transforms the data which has 30 dimensions into 2 dimensions by finding the directions along which the variance is the highest.These components correspond to the eigenvectors of the covariance matrix.
We note that we focused on an in-depth presentation for the quadratic scale.As for the constant, cubic, exponential, and infinite scales, we did not include the corresponding results due to space limitations.Nevertheless, the discussion of the quadratic scale results constitutes a representative analysis of all scales.To gauge the impact of the imbalance in our datasets, we evaluate the balanced accuracy of the ML algorithms in addition to traditional accuracy.We focus on KDD, Kidney and Sylva because they are the imbalanced datasets.Table 6 shows the average balanced accuracy scores for KDD, Kidney and Sylva for the ML algorithms.The scores include base results with no feature reduction, ANOVA and PCA based reductions.We see that the effect of the imbalance is limited-less than 7%.
ANOVA assumes data normality and homogeneous variances among data features.As seen from our results, it performs very well.However, its performance does not imply the normality of data.To verify this, we test the normality of features in our datasets.Here, we include a representative case.Figure 40 shows the histogram of the first selected feature of Sylva by ANOVA with quadratic scale reduction.We see that the feature is not normally distributed.However, it is also not widely different.In figure 41, we see the QQ plot where the red line corresponds to a theoretical normal distribution and the blue curve represents the feature.This is in agreement with the histogram.We conclude that even though ANOVA is successful in feature reduction, it does not require data normality.We run Shapiro-Wilk and Kolmogorov-Smirnov tests on data features and find that the p-values are always less the 0.05 (the standard value used in the literature).This implies that we reject the null hypothesis which states that data is normally distributed.As a side note,  p-value based tests typically are cautioned for sample sizes bigger than 5000.This is because p-value becomes unreliable.That is why we augment our analysis with visualization based on histograms and QQ-plots.

Discussion: related work
The studies [13,33,34] provide detailed surveys of feature and dimensionality reduction methods.Solorio-Fernández et al [14] focus on unsupervised feature selection approaches.
There are a number of studies that are close to ours.Hua et al [35] compare the performances of several filter and SFS feature selection methods in terms of classification error rates with real-world medical and synthetic datasets.The four reduction methods we evaluate are not included in their study.They report that no single feature selection method performs best in all cases, which is corroborated by [36].In addition, they report that univariate filter methods, such as t-test, have better or similar performance with wrapper methods.This supports our conclusion that univariate selection methods offers the best trade-off considering the computational costs and classification performance.
Avramov and Dong [37] evaluate the impact of feature reduction based on exact PCA, feature correlation, t-test significance, and random feature selection on the accuracy performance of logistic regression, KN, DTs, and linear and cubic SVMs on the same Cancer dataset that we evaluate.They perform random feature selection by a large number of brute-force trials to select a random subset of features.Random feature selection is different than the random projections we perform.Random projections randomly sample a projection matrix based on the assumptions on the matrix, such as Gaussian or sparsity, and generate a final feature matrix by the product of the projection and the original feature matrix.That is, random projections do not perform not feature selection but rather feature transformation.A significant drawback of their work is to evaluate only the Cancer dataset which is a trivial dataset to classify-many of the accuracy, recall and precision results pertaining to this dataset that we collect are above 97%.As a result, the conclusions drawn on only Cancer cannot be reliably generalized.Moreover, the authors only study the accuracy performances.There are no other classification metrics, time or memory costs studied, which makes their study's scope and the generalizability of the conclusions limited.
Kondo et al [38] focus on the impact of a variety of feature reduction techniques on defect prediction.As they evaluate a rich set of techniques from feature agglomeration to component analysis to random projections to neural network based methods, they focus exclusively on classification performance based on the Area Under the receiver operating characteristic Curve.They do not consider the computational aspects and costs which is a core focus of our study.As a side note, while the authors distinguish between feature selection and reduction, and treat them as separate techniques, we view feature selection as a strict subset of feature reduction.
Effrosynidis and Arampatzis [15] explore a large number of feature selection methods with RFs and light LightGMB [39] classifiers on eight fish species datasets.Their feature selection methods include filter methods, such as ANOVA , chi-square, and fisher score, wrapper methods, such as recursive feature elimination and SHAP [40], and ensembles of filter and wrapper methods with various voting schemes.While they report that filter methods show relatively weak performance compared to ensemble and wrapper methods, the difference between the performances is low and negligible.The best overall method is reported as ensemble reciprocal ranking whose mean F1 score is 0.83 and the second best is SHAP with a mean score of 0.82.It is clear that ensemble methods are very costly as they require each single method to be computed beforehand.In terms of SHAP, it is reported to have an average execution time of 1.35 s while ANOVA is reported to take 0.2 s, which is 6.75 × faster.ANOVA's F1 score is reported as 0.76 (the F1 score with the original non-reduced dataset is 0.78).A 0.06 F1 difference most often is negligible in real-world settings unless the ML application is security-or safety-critical.In fact, as reported, the average F1 scores of individual methods (not ensembles) lie between 0.69 and 0.83 while their execution times lie between 0.025 s (variance-threshold) to 7 h (permutation importance).Considering the negligible largest difference in F1 score, we posit that ANOVA like methods are well-suited for real-world settings.
Studies such as [41,42] focus on feature selection approaches based on filter methods, whereas our study considers all types of feature reduction algorithms-selection, transformation, embedded, agglomeration based-as long as they have low computational complexities.

Conclusions
In this study, we investigate the impact of four carefully selected lowest-cost feature reduction algorithms-ANOVA, PCA, GS, and SP-on the classification performance of different ML algorithms.Specifically, we explore the impact of the scale of feature reduction on the time and memory savings, and accuracy, precision and recall performances of seven classifiers with ten datasets.The main takeaways of our study are: • The classification algorithms achieve the best overall trade-off in terms of computational savings and classification performance at quadratic-scale feature reduction.Moreover, at quadratic-scale reduction, the classifiers improve their classification performance over the baseline where no feature reduction is performed.• ANOVA and PCA outperform GS and SP in terms of classification performance at all scales with all datasets.• Generally speaking, the higher number of the features, the higher the time and memory savings are.
• Memory savings tightly depend on the classification algorithm.Algorithms KN, SVM, NN, and NB enable significant model size reductions/savings, while the algorithms DT, RF, and Ada are not affected by feature reduction, and therefore, these classifiers do not have model size savings as the model sizes (mostly) stay the same with or without feature reduction.• Our work can be used as a guide for developing ML applications under computational and resource constraints.
Our work differs from typical landscape analyses by not blindingly searching the complete search space or any certain large subset of it.We focus on the extreme case where we aim to reduce computation and memory costs as much as possible while assessing the sacrifice in terms of classification performance of different types of classifiers.We note that even though in our study we focused on classification, our analysis can similarly be performed for regression too.We leave the evaluation of regression as a future work.

Figures 1 and 2
depict the baseline and feature reduction based training and testing steps.In both cases, we obtain execution times and model sizes from the training step and the classification performances from the testing step.Compared to the baseline, feature reduction based learning has the additional stage of reducing the number of features via a selection or transformation method.

Figure 1 .
Figure 1.The baseline training and testing.

Figure 2 .
Figure 2. Feature reduction based training and testing.

Figure 3 .
Figure3.Feature reduction scales as the number of feature increase.A constant-scale feature reduction means that the resulting (target) reduced number of the features is the original number of the features divided by a constant which is 10 in this illustration.A quadratic-and cubic-scale feature reduction means that the resulting reduced number of the features is the square and cubic root of the original number of the features, respectively.An exponential-scale feature reduction means the resulting reduced number of the features is the natural logarithm of the original number of the features.An infinite-scale feature reduction corresponds to the resulting reduced number of the features being set a fixed small number.

Figure 5 .
Figure5.Classification performances (accuracy, precision and recall) with respect to the baseline across the five feature reduction scales and seven ML algorithms.The baseline of an ML classification algorithm is established when it is run on a full dataset without any feature reduction as a pre-processing step (see figures 1 and 2).The quadratic-scale feature reduction offers the best trade-off considering accuracy, precision and recall improvements.A negative difference in a classification metric shows an improvement over the baseline..

Figure 8 .
Figure 8. Datasets' original and reduced dimension/feature sizes (logarithmic plot).We see that except for the constant-scale reduction, the resulting reduced sizes are relatively close to each other.

Figure 9 .
Figure 9. Model size reductions for Bio.

Figure 10 .
Figure 10.Model size reductions for Cancer.

Figure 13 .
Figure 13.Model size reductions for Hill.

Figure 15 .
Figure 15.Model size reductions for Kidney.

Figure 17 .
Figure 17.Model size reductions for Random.

Figure 29 .
Figure 29.Base Model: Decision boundaries for the original space for Cancer.

Figure 30 .
Figure 30.PCA: Decision boundaries for the reduced input spaces for Cancer.

Figure 31 .
Figure 31.Gaussian projections: Decision boundaries for the reduced input spaces for Cancer.

Figure 32 .
Figure 32.Sparse projections: Decision boundaries for the reduced input spaces for Cancer.

Figure 33 .
Figure 33.ANOVA: Decision boundaries for the reduced input spaces for Cancer.

Figure 40 .
Figure 40.Histogram of an ANOVA-selected feature of Sylva.

Figure 41 .
Figure 41.The QQ-plot of an ANOVA-selected feature of Sylva.

Figure 42 .
Figure 42.PCA components and their directions for Cancer.

Table 1 .
Time complexities of Scikit-Learn algorithms for feature space reduction.N is the size of a dataset, D is the number of features, and k is the target dimension of the reduced feature space (the reduced number of features).

Table 2 .
Datasets.N is the size of a dataset, D is the size of the features.

Table 3 .
Machine learning models (algorithms).D is the size (number) of the features, N is the size of a dataset, M is the number of the trees, K is the number of neighbors.For neural networks, i, ne, h, and o is the number of the iterations, neurons, hidden layers, and outputs, respectively.If not stated, hyper-parameter values are the default values.

Table 4 .
Feature Reduction Algorithms.N is the size of a dataset, D is the size of the features, and k is the target dimension of the reduced feature space.

Table 6 .
The average balanced accuracy scores for KDD, Kidney and Sylva.