Soil classification based on unsupervised learning using cone penetration test data

Borehole data obtained during geological surveys are the most essential source for understanding soil stratification. It is a prerequisite to know the soil classes up to some depths prior to any construction. However, the direct method to identify the soil classes by drilling boreholes and testing soil samples is costly. A cost-effective alternative is the Cone Penetration Testing (CPT), which is one of the most popular soil investigation methods. This paper explores the intelligent classification of soil layers based on CPT data using three unsupervised machine learning methods: K-means, Gaussian Mixture Process, and BIRCH. The research investigates the classification performance of different models in scenarios with 2 combinations, 3 combinations, 4 combinations, and 5 combinations. The results indicate that the Gaussian Mixture Process method exhibits the best classification performance, followed by the BIRCH method, while K-means performs relatively poorly. Using unsupervised learning for intelligent soil layer classification offers a fast and clear process, but the accuracy still requires further improvement. This study provides a valuable reference for future soil classification studies.


Introduction
In the context of soil classification, drilling is a commonly used method.This approach involves extracting soil samples through drilling and determining soil categories through laboratory analysis.However, this method has many drawbacks, with the most significant being the expensive cost of detection.Cone Penetration Testing (CPT) offers a more economical alternative.Cone Penetration Testing (CPT) is a geological engineering testing method used to measure the mechanical properties of soil and rock.This testing method involves inserting a metal cone probe into the soil or rock to determine their physical and mechanical properties (Mao Rui et al. 2022).During CPT, the probe is inserted underground, and the measuring equipment records data such as resistance and lateral friction as the probe passes through soil layers.This data can be used to assess geological characteristics like bearing capacity and shear strength, enabling a relatively accurate inference of soil types.CPT's advantages lie in its speed, accuracy, and relatively non-disruptive nature, making it an important on-site testing technique in civil engineering.
However, there are challenges in using CPT to determine soil layers.Currently, the classification of soil types based on CPT data often relies on empirical methods, lacking a highly objective classification approach.This results in a certain degree of bias in the classification.With the continuous development of artificial intelligence technology, machine learning algorithms are rapidly advancing.Machine learning, a branch of artificial intelligence, involves enabling computer systems to learn and improve IOP Publishing doi:10.1088/1755-1315/1337/1/012032 2 performance from data rather than relying on explicit programming.Some research results have emerged in the field of combining CPT with machine learning.Lok, et al. (2022) propose using CPT and SPT data as variables and employing a backpropagation neural network to predict shear wave velocity in soil layers.They compare the predictive performance under three scenarios: using CPT data as variables, using SPT data as variables, and using a combination of CPT and SPT data.The model's highest accuracy is observed when using combined CPT and SPT data.Cho et al. (2020) used decision tree algorithms to intelligently classify soil layers.They select four different parameter combinations as input variables and compare their effectiveness.The best classification results are obtained with the input variable combination of Qtn, Fr, Qtn•Fr, and Qtn\Fr.Reale et al. (2018) employed artificial neural network algorithms, utilizing modified tip resistance and sleeve friction from CPT data as variables, to obtain a highly accurate predictive model.
Although many scholars have made significant contributions in this field, there are still some shortcomings in current research results.Firstly, most studies use supervised learning methods.While supervised learning can train models based on existing samples and produce suitable models, the models may lack stability due to randomness in training.Additionally, the lack of accessibility to intermediate processes in most supervised algorithms makes model modification difficult, hindering further improvements.Addressing these shortcomings, this paper adopts three different unsupervised machine learning methods for soil classification, using only CPT data as the basis for classification.Simultaneously, the paper improves the original model to achieve better classification accuracy.

Data Acquisition and Analysis
This article uses Cone Penetration Testing (CPT) data from the Shanghai region.To be more specific, the CPT data mainly comes from geotechnical investigation reports of certain sections of Shanghai Metro Line 2, primarily from Jinke Road Station to Guanglan Road Station and from Chuansha East Station to Yuandong Avenue Station.As this study employs unsupervised learning methods, which are applicable only to two-dimensional data, and CPT data is mostly presented in the form of curve images, it is necessary to convert these images into two-dimensional coordinate sets.Using the Getdata software, the CPT image data is extracted into coordinate data with a depth interval of 0.1 meters, representing a two-dimensional coordinate set of depth and resistance.A database is then established.Generally, the burial depth of tunnels in Shanghai generally does not exceed 60 meters (Shen, SL et al. 2014).Therefore, only the portion of CPT data beyond 60 meters is analyzed in this study.
Within the depth range of 60 meters in Shanghai, the main distribution consists of fill soil and soil types 2 to 7. Due to the instability of fill soil characteristics, it is excluded from the analysis.Analyzing the data for each soil type reveals their respective characteristics.
For soil type 2, the maximum and minimum depths are 8.2 meters and 0.6 meters, with an average of 2.60 meters, a median of 2.3 meters, and a standard deviation of 1.36.The maximum and minimum values of Ps are 13.22 MPa and 0.20 MPa, with an average of 1.09 MPa, a median of 0.79 MPa, and a standard deviation of 1.25 MPa.
For soil type 3, the maximum and minimum depths are 20.1 meters and 1.4 meters, with an average of 6.47 meters, a median of 6.3 meters, and a standard deviation of 2.30.The maximum and minimum values of Ps are 18.32 MPa and 0.20 MPa, with an average of 0.94 MPa, a median of 0.55 MPa, and a standard deviation of 1.18 MPa.
For soil type 4, the maximum and minimum depths are 26 meters and 6.3 meters, with an average of 14.59 meters, a median of 14.6 meters, and a standard deviation of 3.16.The maximum and minimum values of Ps are 4.74 MPa and 0.28 MPa, with an average of 0.67 MPa, a median of 0.66 MPa, and a standard deviation of 0.17 MPa.
For soil type 5, the maximum and minimum depths are 60 meters and 9 meters, with an average of 30.4 meters, a median of 28.2 meters, and a standard deviation of 8.72.The maximum and minimum values of Ps are 26.71MPa and 0.38 MPa, with an average of 1.71 MPa, a median of 1.27 MPa, and a standard deviation of 1.68 MPa.
For soil type 6, the maximum and minimum depths are 49.1 meters and 24.1 meters, with an average of 32.26 meters, a median of 30.4 meters, and a standard deviation of 5.81.The maximum and minimum values of Ps are 16.27MPa and 0.72 MPa, with an average of 2.11 MPa, a median of 1.74 MPa, and a standard deviation of 1.29 MPa.
For soil type 7, the maximum and minimum depths are 60 meters and 27 meters, with an average of 45.67 meters, a median of 46.6 meters, and a standard deviation of 9.87.The maximum and minimum values of Ps are 32.03MPa and 1.21 MPa, with an average of 14.29 MPa, a median of 13.7 MPa, and a standard deviation of 6.63 MPa.
Based on the characteristics of different soil types, this study combines soils in various combinations to investigate their classification performance.The article categorizes all soil layers into 2 combinations, 3 combinations, 4 combinations, and 5 combinations.The 2 combinations group soil types 2 to 5 together and soil types 6 and 7 together.The 3 combinations group soil types 2 and 3, soil types 4 and 5, and soil types 6 and 7 together.The 4 combinations treat soil type 2 separately, combine soil types 3 and 4, treat soil type 5 separately, and combine soil types 6 and 7.The 5 combinations treat soil type 2 separately, combine soil types 3 and 4, treat soil type 5 separately, treat soil type 6 separately, and treat soil type 7 separately.

Methodology
In this paper, three unsupervised machine learning methods, including K-means, Gaussian Mixture and Birch, were adopted to build the model to identify different stratums.Furthermore, an advancement is done to improve the results.A brief introduction to the three algorithms and the improvement is presented in this section.

K-means
K-means is a widely-used algorithm when dealing with clustering problems (Sinaga, KP et al. 2020).From statistical viewpoint, clustering methods are generally divided as probability model-based approaches and nonparametric approaches.First, it chooses k samples as the initialized centers.The centers can be represented by Eq. ( 1):  =  1 ,  2 ,  3 , … ,   Where  stands for the centers and  equals to the number of stratums in classification.Then, to every point in the CPT result, the distance to all the centers is calculated and the point is classified into the stratum which shows the shortest distance.All the centers are recalculated according to Eq.( 2): Where  and  are the coordinates of the points in the CPT results and  is the range of stratums.The process of classification and recalculation is done repeatedly until the iterations reach the regulated number.
The advantage of this method is that it is easy to understand and normally it can reach an acceptable result despite the problem of local optimum solution.

Gaussian Mixture
Gaussian Mixture is an unsupervised machine learning method based on Gaussian distribution, which is the most used finite mixture densities (Everitt, B S 1996).It assumes that all the stratums follow the Gaussian distribution.The algorithm figures out the Gaussian parameters of every stratum and classify the points in the CPT results according to the possibilities of different distributions.To be more specific, EM algorithm is adopted to optimize the parameters.There are some limitations to EM algorithm, such as the number of clusters needs to be known and solution depends strongly on initial conditions.First it initializes the parameters.Then the conditional probability expectation is calculated according to Eq.( 3) and Eq.( 4): Maximum process is done by Eq.( 4):  +1 = (,   ) The algorithm will do the expectation calculation and Maximum repeatedly until the model converges or the iterations reach the regulated number.
Gaussian Mixture can be used in various conditions including fascicled points and stripes points.It not only decides the affiliation of points according to the distance, but also to the shape of the CPT results of the stratum.

BIRCH
BIRCH is a useful algorithm when it is applied to a large dataset (T Zhang et al. 1996).It is based on the CF tree.A CF tree is a height-balanced tree that stores the clustering features for a hierarchical clustering.CF includes N, LS and SS.The definitions of the varieties are shown in Eq.( 5): When building the CF tree, a radius T is set.After entering a new point, it will be included in an existing CF tuple according to T. If the point cannot be classified into any of tuple, a new CF tuple will be built.Moreover, the maximum number of tuples in a node is also regulated, so when the number of tuples in one node exceeds the regulation, the node itself will split into two new nodes.
In BIRCH, a CF tree is built after all the data is put into the program.Then some other methods such as K-means are used to cluster the CF data to eliminate the effect caused by the order to input the points from the CPT result and the defects of the CF tree itself.

Analysis of Results
In this article, the results obtained by the model are the boundary data for each combination of soil layers, as shown in Figure 1.The evaluation metric employed in this article is defined by Eq. ( 7): In the equation,  represents the number of combinations,   denotes the actual soil layer area for the ith combination, and   , represents the algorithm-derived soil layer area for the i-th combination.This value ranges from 0 to 1, with a higher value indicating better classification performance of the model.The schematic diagram of the evaluation metric is illustrated in Figure 2:    For the case of 5 combinations, the Gaussian Mixture Process method demonstrates a clear advantage compared to the other two methods.BIRCH exhibits a greater advantage over the K-means method.However, in the case of 5 combinations, the MR values from the Gaussian Mixture Process are more dispersed and not as concentrated as the BIRCH method.While the average MR value of BIRCH is not as high as that of the Gaussian Mixture Process, it surpasses that of the K-means method, and its MR values are more concentrated.The results are shown in Figure 6.
After a comprehensive examination of the three methods, it can be concluded that the Gaussian Mixture Process method exhibits the best overall performance, while the BIRCH and K-means models perform similarly, with the BIRCH model slightly outperforming the K-means model.

Conclusion
This paper explores the intelligent classification of soil layers using three unsupervised machine learning methods: K-means, Gaussian Mixture Process, and BIRCH.The research investigates the classification performance of different models in scenarios with 2 combinations, 3 combinations, 4 combinations, and 5 combinations.The findings indicate that the Gaussian Mixture Process method exhibits the best classification performance, followed by the BIRCH method, while K-means performs relatively poorly.Using unsupervised learning for intelligent soil layer classification offers a fast and clear process, but the accuracy still requires further improvement.

Figure 1 .
Model Classification Schematic Diagram In this context, Figure 1(a) represents a schematic diagram of the model results for 2 combinations, and Figure 1(b) illustrates the model results for 3 combinations.The dashed lines indicate the boundaries of actual soil layer combinations, while different background colors represent the boundaries calculated by the mode.

Figure 2 .
Figure 2. Evaluation Metric Schematic DiagramFor the case of 2 combinations, the average MR value of the Gaussian Mixture Process is significantly higher than the other two methods, generally outperforming the other two.The MR values for the Kmeans and BIRCH methods are roughly similar, but the results from the BIRCH method are more dispersed.The results are shown in Figure3:

Figure 3 .
Figure 3. Combination Result Charts For the case of 3 combinations, the Gaussian Mixture Process still maintains a relatively high average MR value.The MR values for the K-means and BIRCH methods are roughly similar, but the results from the BIRCH method are more dispersed.The results are shown in Figure 4:

Figure 4 .
Figure 4. Combination Result Charts For the case of 4 combinations, the Gaussian Mixture Process still exhibits good performance, but the gap with the average MR values of K-means and BIRCH has narrowed.The average MR values for Kmeans and BIRCH are similar, but similarly, the MR values from BIRCH are more dispersed.The results are shown in Figure 5:

Figure 5 .
Figure 5. Combination Result ChartsFor the case of 5 combinations, the Gaussian Mixture Process method demonstrates a clear advantage compared to the other two methods.BIRCH exhibits a greater advantage over the K-means method.However, in the case of 5 combinations, the MR values from the Gaussian Mixture Process are more dispersed and not as concentrated as the BIRCH method.While the average MR value of BIRCH is not as high as that of the Gaussian Mixture Process, it surpasses that of the K-means method, and its MR values are more concentrated.The results are shown in Figure6.After a comprehensive examination of the three methods, it can be concluded that the Gaussian Mixture Process method exhibits the best overall performance, while the BIRCH and K-means models perform similarly, with the BIRCH model slightly outperforming the K-means model.