Dynamic Prediction of Rock Mass Classification in the Tunnel Construction Process based on Random Forest Algorithm and TBM in situ Operation Parameters

Accurately predicting the rock mass classification is of great significance to ensure the safe and efficient construction of tunnel boring machines (TBMs). On the basis of the TBM in situ operation data recorded during tunnel construction, a prediction model for rock mass classification using the random forest (RF) algorithm is proposed. Through data preprocessing, 7538 TBM excavation cycles were obtained to form a data set. Each data sample included 195 operation parameters and corresponding information on mileage and rock mass classification. Furthermore, 6784 samples were randomly selected as the training set and the remaining 754 samples as the test set. According to the changing characteristics of operation parameters, each TBM excavation cycle was divided into the empty-push phase, the rising phase, and the stable phase. On the basis of the mean decrease Gini index, seven machine parameters highly correlated with rock mass classification were selected. Then, the variation characteristics (i.e., mean value and linear fitting slope) of the seven operation parameters in the first 30 s of the rising phase were used as the input features of the RF model. Additionally, the hyperparameters in the RF model were analyzed. The quantitative results show that the prediction accuracy is up to 87.27%, indicating that the proposed model is effective for the prediction of rock mass classification.


Introduction
As tunnel engineering continues to be an interesting and useful engineering discipline, tunnel boring machines (TBMs) have played an increasingly important role in the process of tunnel excavation [1]. TBMs are sensitive to the change of rock mass conditions. For the tunnels constructed in a complex geological environment, the uncertainty of surrounding rock properties will be more prominent [2]. Therefore, it is of great significance to propose an effective and accurate method to predict the rock mass classification ahead of the tunnel face, which can ensure the safety and efficiency of TBM construction. The first rock mass classification schemes had their origin in 1987 and were used to determine the support requirements of a tunnel [3]. On this basis, many rock mass classification systems have been developed from single-parameter classification schemes to multi-parameter classification schemes [4,5]. In tunnel engineering, the commonly used classification systems are the rock mass rating, rock mass quality, and geological strength index. To determine the rock mass classification using these methods, we need to conduct field sampling before construction, laboratory experiments, or some  [6], all of which, however, entail additional costs and equipment. Moreover, field sampling is discontinuous and relatively rough, and it cannot predict the rock mass classification in front of the tunnel face in real time. Many studies have shown that the operating parameters of TBMs have a strong correlation with the rock mass conditions [7]. On the basis of the analysis of the change of TBM operation parameters during the excavation process, the rock mass classification can be obtained. In recent years, because machine learning is highly advantageous in solving nonlinear problems, machine leaning-based models have been widely used in rock engineering, such as tunnels [8][9][10]. In this paper, a prediction model for rock mass classification using the random forest (RF) algorithm is proposed. On the basis of the 802-day data of the Songhua River water conveyance project in China, a data set having 7538 TBM excavation cycles with 195 operation parameters is formed. Each excavation cycle is divided into three working phases: the empty-push phase, the rising phase, and the stable phase. By calculating the mean decrease Gini index, seven operation parameters are selected as the high-correlation features with rock mass classification, namely, advance rate (v), penetration (Pr), cutterhead torque (T), total thrust (F), cutterhead power (p), cutterhead rotational speed (n), and pressure of gripper shoes (Pgs). Furthermore, the variation characteristics (i.e., mean value and linear fitting slope) of the seven features in the data of the first 30 s of the rising phase were used as the inputs of the RF model. Then, 6784 and 754 samples were randomly selected as the training set and the test set, respectively, to test the prediction performance of the RF model on rock mass classification. Additionally, the influence of the hyperparameters in the RF model is analyzed.

Decision tree
A decision tree is a commonly used machine learning algorithm that uses a simple and comprehensible structure for data analysis to solve the classification and regression problems [11]. A decision tree contains one root node and several branches (also called sub-trees). Before a decision tree model is used, the attribute variables and the target variable should be determined, and all data are input to the root node before splitting. Then, the root node is split into two or more branches on the basis of the specific split rule. Each branch is composed of some decision nodes and terminal nodes. The terminal node corresponds to a decision result, and the decision node corresponds to an attribute test, which will be split into new terminal nodes (or a new terminal node and a new decision node) on the basis of the specific split rule. According to the different selection criteria of attribute variables in decision nodes, the decision tree algorithm can be divided into the ID3 algorithm, the C4.5 algorithm, and the classification and regression tree (CART) algorithm [12,13]. In this study, the CART algorithm is used to conduct the attribute selection of each decision node. In the CART algorithm, the Gini impurity is used to select the attribute, to reduce the problem of large information gain caused by too many eigenvalues. Given the data set x y x y x y = , suppose that the category quantity of the attribute variable is m . The Gini impurity is calculated as follows: where k p represents the probability that the sample points belong to the class k . The smaller the value of Gini impurity is, the lower the probability that the selected samples in the set will be misclassified, which means that the classification effect is better.

Random forest
A decision is a single classifier. For the training set, its classification performance is always good. However, decision trees are prone to overfitting when some classification problems are being dealt with, resulting in poor performance on the classification of a full data set. To solve the above problems, in 2001, Breiman [14] proposed the RF algorithm, which is a supervised ensemble learning algorithm suitable for classification and regression problems. The idea of RF, which essentially is a multidecision tree model, is to combine multiple CARTs with certain rules. Figure 1 shows the topological structure of the RF model. Compared with the decision tree, RF introduces the idea of bagging and random subspace. According to the findings of many theoretical and case studies, the RF models have good prediction accuracy and do not easily overfit [15]. As shown in Figure 1, first, bagging is carried out; that is, bootstrap sampling is applied for the data set to extract l subsets from the original data set as the training data set of each decision tree. The samples of the l subsets are called in-bag data, and the unselected ones are contained in a subset called out of bag (OOB) [16]. When the RF model is training, the OOB data are used to estimate the internal error. The lower the OOB error rate, the better the performance of the RF model. When the RF model is used to deal with a classification problem, each decision tree obtains a classification result. Then, the final classification results are obtained by majority voting from all decision trees.

Project description and data acquisition
The study area of the tunnel in this paper is the No. 4 bid section of the Songhua River water conveyance project in Jilin Province, China. An open TBM constructs the main body of the project with a diameter of 8.03 m. The total length of the tunnel is about 22,955 m, of which about 20,198 m is for the TBM section, and 2757 m is for the drilling and blast section. In this paper, we choose the TBM section as the research object. During the tunnel construction process, the 195 TBM operation parameters are collected once a second. Finally, 802-day data are collected, and about 86,400 pieces of data are collected every day as a storage unit. Figure 2 shows the variation of four TBM operation parameters in a day. It can be seen that there are a large number of no-operation sections in a day. Through geological exploration and tests in the engineering site, the geological conditions of rock mass classification and its corresponding mileage are obtained. The engineering site mainly consists of surrounding rock samples belonging to four grades: grade III make up most of the samples, grade IV samples come second, and grades II and V samples are relatively few. The proportions of grades II-V rock mass are 8.13%, 66.73%, 20.03%, and 5.11%, respectively.  Figure 2. Time variation of four TBM operation parameters in a day.

Data preprocessing
A TBM takes the excavation cycle as a working unit. Figure 3 shows a complete TBM excavation cycle. Each TBM excavation cycle can be divided into three working phases: the empty-push phase, the rising phase, and the stable phase. In the empty-push phase, the TBM is a start-up, and the cutterhead rotation speed is made to reach the set value. Then the advance rate increases gradually. When the TBM comes into contact with the tunnel face, the main control room will generate obvious vibration sense data. Then the driver will instantly reduce the advance rate, resulting in a falling edge in the advance rate curve. In the rising phase, the TBM operation parameters will increase sharply until they reach stable values. In the stable phase, TBM tunneling moves forward with each parameter maintained at their relatively stable states. As shown in Figure 2, the collected data have many zero values, which are of no use to a machine learning model. Additionally, the data of the rising phase can reflect the interaction between the TBM and surrounding rock. Therefore, we use the data of the first 30 s of the rising phase to predict the rock mass classification, specifically by calculating the mean value and linear fitting slope of the first 30-s data points in the rising phase as the inputs of the RF model. First, we eliminate these useless data by constructing a state discriminant function [17]: where S is the state discriminant function and the other symbols are as described above. On the basis of the data preprocessing, 7538 TBM excavation cycles are obtained. Then, the data of the first 30 s of the rising phase are obtained by identifying the falling edge of the advance rate curve. The starting point c t of the falling edge is a maximum extreme value so that it can be found by the findpeaks function in MATLAB. Furthermore, because the duration of the falling edge is short, the ending point s t of the falling edge can be determined by finding the minimum value point of [ , 29] cc tt+ . After that, the data of the first 30 s of the rising phase can be collected. Because the inputs and outputs of the machine learning model are numerical data, it is necessary to conduct the quantitative disposal process for the rock mass classification. In this paper, the one-hot encoding method [18] is used to deal with rock mass classification. The one-hot encoding results are shown in Table 1.

Setting of predictors
As mentioned above, a total of 195 TBM operation parameters have been recorded. For machine learning models, too many redundant input features will reduce their calculation accuracy in some cases and will increase the calculation time. Therefore, it is necessary to carry out the feature selection process to determine the useful input parameters of machine learning models. The RF algorithm uses two indices for feature selection, namely, the mean decrease Gini and mean decrease accuracy [19]. In this study, the mean decrease Gini index is used to conduct the feature selection process. During the training process of the RF model, the mean value of the Gini impurity of each feature in all decision trees is calculated as the mean decrease Gini index. The larger the value of the mean decreases the Gini index, the more important the feature is. By calculating the mean decrease Gini, the following seven TBM operation parameters with a mean decrease Gini value of more than 14 are selected as the input features of the RF model: the pressure of gripper shoes (Pgs), cutterhead power (p), cutterhead rotational speed (n), total thrust (F), penetration (Pr), advance rate (v), and cutterhead torque (T). The mean decrease Gini value of the seven selected features is shown in Figure 4. The RF model cannot directly deal with the time series data. Therefore, we calculate the mean value and linear fitting slope of the first 30-s data of the seven selected TBM operation parameters as the inputs of the RF model. That is, the input data of the RF model are composed of 14 features. Furthermore, 6784 (90%) samples are randomly selected as the training set to train the model, and the remaining 754 (10%) samples are selected as the test set to test the prediction performance of the model.

Hyperparameter analysis of the RF model
In the RF model, ntree and mtry are two important hyperparameters that can influence the prediction performance of the RF model. ntree is the number of decision trees. Generally, the more the number of decision trees are, the better the training effect of the model is. mtry is the number of random sampling variables when constructing the branch of the decision tree. Choosing the appropriate value of mtry can reduce the prediction error rate of the RF model. Generally, the value of mtry is set as 66% of the total number of variables. First, for the value of ntree, we set the number of decision trees to 1, 10, 50, 100, 200, 300, 400, and 500, and the value of mtry is set as 9. The prediction accuracy under different values of ntree is shown in Figure 5. It can be seen that the prediction accuracy reaches a stable value when the number of decision trees exceeds 100. Second, we set the value of mtry to 5, 6, 7, 8, 9, and 10. The number of decision trees is set as 300. Figure 6 shows the prediction accuracy of the RF model under the different values of mtry, which, as can be seen, has little difference. Moreover, when the value of mtry equals to 8 or 9, the RF model demonstrates superior prediction performance. Using the above analysis on the hyperparameters, we set the values of ntree and mtry as 300 and 9, respectively.

Prediction results and discussion
As discussed, 7538 TBM excavation cycles were extracted, of which 6784 (90%) were randomly selected as the training set and the remaining 754 (10%) as the test set. The established RF model is used to predict rock mass classification. The setting of the RF model is as mentioned above. All the calculations are performed in MATLAB and CPU environments, and the prediction accuracy is used as the evaluation index of model performance. Figure 7 shows the prediction results of rock mass classification. Table 2 lists the prediction accuracy of each rock mass classification. It can be seen that the total prediction accuracy for rock mass classification is up to 87.27% (658/754). For each rock mass classification, the prediction effect on grade III rock mass is highest with a prediction accuracy of 99.12%. The prediction accuracy of grade IV rock mass is 57.36% (74/129). However, the prediction effect on grades II and V rock mass is relatively low, with a prediction accuracy of 38.64% (17/44) and 35.71% (5/14), respectively. The difference in prediction accuracy of the different grades of the rock mass is mainly due to the large difference in the number of samples of different rock mass classification. As mentioned in Section 3.1, the proportions of grades II-V rock mass are 8.13%, 66.73%, 20.03%, and 5.11%, respectively. Grade III rock mass accounts for the overwhelming majority of the samples, and grades II and V rock mass samples are very few. Therefore, the trained RF model has good prediction performance on grade III rock mass and relatively low prediction performance on grades Ⅱ and V rock mass. Overall, the proposed RF model is useful for the prediction of rock mass classification, and the result is also acceptable. Furthermore, it can be expected that the prediction performance of the model will be further improved with the increase of samples with different rock mass classification.

Conclusions
In this paper, a rock mass classification prediction model is proposed on the basis of the RF algorithm. With data from the No. 4 bid section of the Songhua River water conveyance project in China, a database was established, including 195 TBM operation parameters and rock mass classification corresponding to mileage. By data analysis and preprocessing, 7538 TBM excavation cycles were extracted to make up the data set. Seven TBM operation parameters were selected as the input features of the RF model on the basis of the mean decrease Gini index. Then, 6784 samples were randomly selected as the training set, and the remaining 754 samples were selected as the test set to test the prediction performance. The conclusion of this study are as follows: (1) The data of the rising phase in a TBM excavation cycle can effectively reflect the interaction between the machine and its surrounding. This study proposes a method to extract the data of the rising phase by using the variation characteristics (i.e., mean value and linear fitting slope) of the first 30-s data of the rising phase as the inputs of the RF model. (2) The prediction accuracy of the rock mass classification is 87.27%. However, because of the large difference in the number of samples of different grades of the rock mass, the trained model has