An approach to determining the search range boundaries of optimal parameters values for the random forest algorithm

The problem of determining the search ranges for the optimal values for the main parameters in the random forest (RF) algorithm in order to reduce the time spent on developing RF classifier has been considered. The aim of the work is to obtain formulas for determining the search range for the values of RF classifier parameters. Formulas are obtained based on the results of experimental research on the development of RF classifiers using various sets from machine learning data repositories. The results of experimental research on the development of RF classifiers using training and test sets formed on the basis of the analyzed datasets have been presented. Formulas for graphical dependencies for assessing the quality of classification on the test set and development time have been obtained in general form. The recommendations on the application of the proposed formulas in the development of RF classifiers have been given.


Introduction
A large number of machine learning algorithms, which are successfully used to solve various data classification problems, is known nowadays [1 -7]. However, those algorithms require "fine" tuning of the values of their parameters to ensure high quality classification solutions.
It is obvious that enumeration of values of parameters of the classification algorithm on the grid or their search using modern evolutionary optimization algorithms can be accompanied by significant time costs, and use of values of parameters of algorithms "by default" does not guarantee that the resulting classifier will be characterized by high values of the classification quality indicators. Therefore, it will be necessary to use algorithms for finding the optimal values of the classifier parameters. A herewith, it is advisable to set the search range for each parameter. It is obvious that it is necessary to develop formulas for determining the search ranges for the values of the classification algorithm parameters, since this will minimize the time spent on developing the desired classifier.
One of currently actively used machine learning algorithms is random forest (RF) algorithm [1,2]: it can be used both in development of regression models and in development of data classifiers.
It should be noted that a random forest can be used as an independent classifier, as well as part of hybrid classifiers. In the case of developing a hybrid classifier, an RF algorithm can be applied, for example, as part of the SVM-RF classifier [3] to clarify the class belonging of objects near the class boundary, found using the SVM classifier [5,6].
When working with algorithm of a random forest, it is necessary to search for the optimal values of its parameters. In the simplest case, this can be done by iterating over the grid, which will be accompanied by significant time costs. In addition, one can apply any evolutionary optimization algorithm, for example, the genetic algorithm [8], the particle swarm algorithm [9], etc., by coding the solution with the optimized parameters. But even in this case time costs will be large. Obviously, it is necessary to try to reasonably narrow the range of searching for the optimal values of the RF algorithm parameters.
In the context of solving the problem of developing an RF classifier, it is proposed to carry out experimental studies, the results of which can be used to form recommendations for determining the boundaries of the search ranges for the optimal values of the main parameters of the algorithm on the basis of the studies.
In particularly, to determine the boundaries of the search range, it was decided to perform the experiment, during which we studied for 40 datasets the dependences of the quality of the classifier and the development time of the random forest on the number of trees, the number of features when dividing a node and the depth of the tree.

Decision tree
A decision tree is based on logical diagrams which make it possible to obtain a decision on the classification of an object using answers to a hierarchically organized system of questions. The tree is composed of "leaves" and "branches". The edges ("branches") of the decision tree contain the attributes on which the objective function depends, the "leaves" contain the values of the target label, and the remaining nodes contain questions asked at the current hierarchical level and depending on the answers received in the previous nodes [1,2,4].
It is necessary to train the classifier before classifying data. Training is performed on a training set randomly extracted from the analyzed dataset, and includes the search for optimal threshold values of parameters or binary partitions for features, based, for example, on the requirement to maximize the entropy index of heterogeneity in subsets.
The construction of the tree begins from the root, in which, taking into account the values of the inhomogeneity index for each feature of the objects, the optimal feature is found, with the help of which the set is divided into subsets. Further, each of the subsets is divided into other subsets based on its optimal feature. The split continues until the current subset meets the stopping criterion and is declared a sheet with a class label. As a stopping criterion, the criterion for achieving complete homogeneity in one of the classes can be used. The tree can always be built if the training set does not contain objects with the same value of each of the features belonging to different classes.
The following can be used as a criterion for splitting a tree top when solving a classification problem: The following can be used as a criterion for splitting a tree top when solving a classification problem:  entropy index of heterogeneity: L is the number of classes; N is the number of objects in the subset. A herewith, the entropy index of heterogeneity should be maximized, and the Gini index should be minimized.

Random Forest
The random forest is an ensemble decision tree algorithm proposed by Leo Breiman and Adele Cutler. This algorithm is one of the most popular and efficient algorithms for solving classification and regression problems. The random forest is based on bagging, which implements such a method of constructing compositions of classifiers, in which classifiers are trained independently of each other [1,3].
Each tree is constructed from a subset obtained from the original training set using a bootstrap (return sets). When constructing each tree at the stages of nodes branching, only a fixed number of randomly selected features of the training set are used and a complete tree is built, that is, each leaf of the tree contains observations of only one class. In classification problems based on a random forest, the decision is made by a majority vote.
The random forest algorithm can be represented by the following sequence of steps.
1. Formation of a bootstrap subset from the initial training set for each tree.
2. Construction of the obtained subset of a non-truncated decision tree with a minimum number of observations at the nodes using the following steps:  randomly select a certain number of features from the initial set of features, while it is recommended to take the root of the total number of features;  select the feature that provides the best splitting from the selected number of features;  divide the selection corresponding to the processed node into two subsets.
3. Formation of an ensemble of decision trees. Classification of test data by voting: the class of some object is the class which was chosen by the greater number of trees.
The main parameters of RF classifier are the following [1,3]:  number of trees;  the number of features by which the splitting is realized in the node of the tree;  maximum depth of trees.

Description of experiments
In order to determine how strongly the values of the parameters of RF classifier affect the quality of the classification, it is necessary to train many classifiers with different values of the parameters on different datasets. During the experiments, 40 different datasets from data repositories for machine learning were considered [10,11]. Moreover, each dataset was randomly divided into training and test sets in a ratio of 80:20.
Since the quality of the developed classifiers was assessed not only by Accuracy indicator, which makes it possible to estimate the proportion of objects for which the classifier made the right decision, but also by the development time, it was decided that it would be expedient to develop RF classifiers on different computers. In the course of the experiments, computers with the following characteristics were used:  PC 1 (processor: Intel (R) Core (TM) i3-2100 CPU 3.10GHz 3.10GHz, 2 cores; RAM: 8 GB; 64-bit operating system);  PC 2 (processor: Intel(R) Core(TM) i5-3337U CPU 1.80GHz 1.80GHz, 2 cores; RAM: 6 GB; 64-bit operating system);  PC 3 (processor: Intel(R) Core(TM) i5-3330 CPU 3.00GHz 3.20GHz, 2 cores; RAM: 4 GB; 64-bit operating system). RF classifiers were trained in Python 3, which has an implementation of the classification algorithm based on a forest of random trees, which allows to tune the above parameters of RF classifier [1,3].
The following ranges were defined for the values of RF-classifier parameters:  for the number of trees r :  for the number of features Features: where Features max is the total number of features;  for maximum tree depth depthTree :

Experimental results
As a result of the experiments, graphs of the dependences of the quality and development time of the classifier on various parameters of RF classifier were obtained. In figures 1 -6, the graphs obtained during the use of the Bank-additional dataset [12] are shown. This dataset can be used to train the binary classifier to determine whether the client will sign the urgent deposit. The object of this dataset is described by 20 features. The total number of objects is 41188 (a herewith, the training set will contain 32951 objects, the test set will contain 8237 objects).

Number of features when dividing a node
The dependence of the classification quality on the number of features when dividing a node has the shape of a parabola (figure 1).
The dependence of the development time on the number of features has a linear form ( figure 2). Therefore, we should not rely on the time of the classifier development when identifying formulas for the search range boundaries.  To derive the calculation formulas, the probability of choosing the best separating feature in the node was used (table 1). Based on the data in table 1, we can say that the formula for calculating the lower and upper boundaries is the following: where P is the probability of a better feature.
We suggest to use P =0.3 for the lower boundary and P =0.8 for the upper boundary.

Depth of tree
The dependence of the classification quality on the tree depth has a logarithmic form. To obtain the formula for the upper boundary, we use the graph of the dependence of the classification quality on the tree depth ( figure 3). In the general case, the classification quality value becomes close to the maximum at such a value of the tree depth, at which the set is most quickly divided into classes.    To obtain a formula for the lower boundary, we used the dependence of development time from the tree depth ( figure 4). Figure 4 shows that there comes a moment when the time for developing the classifier does not increase. This means that the tree branching has stopped, since the condition of the minimum number of objects has been reached, that is, there is one object in each leaf, and there is no sense in developing a model with a greater tree depth.
Thus, the formula for calculating the lower boundary is the following: where countClass is the number of classes in the dataset, and the formula for calculating the upper limit is is the following:

The number of trees
The dependence of the classification quality on the number of trees has a logarithmic form ( figure 5). The dependence of the development time on the number of trees has a linear form ( figure 6). Therefore, we should not rely on the time of the classifier development when identifying formulas for the search range boundaries.
To determine the formulas, it was assumed that the range limits depend on the number of objects in the dataset [8], that is, the larger the dataset, the greater the values of the lower and upper range boundaries. To support this assumption, values for the lower and upper boundaries were subjectively determined for each dataset.
The lower boundary will be the value of the number of trees, at which the value of the classification quality indicator Accuracy stops sharply increasing.
The upper boundary will be the value of the number of trees at which the value of the classification quality indicator Accuracy stops changing and fluctuates around its current maximum value.
The The obtained values of the lower and upper boundaries for the number of trees will be displayed on the graph, having previously sorted the datasets by increasing the number of objects in them.
To derive the formulas, the lower envelope for the lower boundary and the upper envelope for the upper boundary were built.  As a result, it was possible to determine the green and yellow approximation lines of the boundaries so that all previously identified values of the range boundaries were between them ( figure  7).
The least squares method was used to build the approximation lines. This made it possible not only to determine the most accurate dependence for the values of the boundaries of the search ranges on the number of objects in the dataset, but also to build the functions of the obtained graphs.
For a more complete coverage of the values of the range boundaries, it was decided to increase the area between the curves of approximation. As a result, the following formulas were obtained for calculating the boundaries.
To calculate the value of the number of trees along the lower boundary, it is proposed to use the formula: where obj n is the number of objects in the dataset.
To calculate the value of the number of trees along the upper boundary, it is proposed to use the formula:

Comparative analysis results
To assess the effectiveness of the obtained formulas in time, the experiments were carried out on PC 3. A herewith, the optimal values of the parameters were selected on the Bank-additional dataset using the grid search method with the step of 1 for each parameter with the assessment of the computation time (table 2). In the first experiment, the values of the range boundaries for each of the parameters were obtained using the proposed formulas. In the second experiment, the boundaries of the ranges of parameter values were chosen so that these ranges with the very high probability contain the optimal values of the parameters. As can be seen from table 2, the use of the proposed formulas has significantly reduced the time for search the optimal values of the classifier parameters.

Conclusion
The proposed study was carried out in order to ensure the minimization of time spent in the development of RF classifier for the selection of the optimal values of such parameters as the number of trees, the number of features when dividing a node, the tree depth. The obtained results allow to conclude that it is advisable to use the proposed approach to determining the search ranges boundaries for the optimal values of RF classifier parameters in order to reduce the time spent on its development.
We came to the conclusion that the proposed approach to determining the boundaries of the ranges of the RF classifier parameters can be used both in the development of independent RF classifiers, as well as in the development of hybrid classifiers, for example, in the development of the SVM-RF classifier [3] to find out using the RF classifier belonging to a class of objects near the class boundary, found using the SVM classifier [5,6].
The goal of further research is to develop a similar approach to finding the boundaries of the search ranges for optimal parameter values for Isolation Forest algorithm [13], which is used, in particular, to develop classifiers based on imbalanced datasets and to identify outliers in datasets.