Research on Clustering of Material Inspection Data for Highway Construction Projects Based on Agglomerative Hierarchical Clustering

In large-scale highway construction projects, the project implementers will carry out detailed testing of various material indicators used in the project. By processing and analyzing these material testing data, we can reasonably classify large-scale highway construction projects, so that each project implementer can better manage the quality of the materials, which will greatly help all parties to strengthen the quality control of the project and improve the level of project management. This study first preprocesses the material testing data to obtain a structured data set suitable for data analysis. Then statistical features are constructed for the features in the dataset, including maximum, minimum, mean, median and standard deviation, to improve the performance and accuracy of the model. Next, the clustering hierarchical clustering method is applied for classification and the classification results are visualized in the form of dendrograms. Finally, through the comparison of performance analysis and clustering evaluation indexes, it is concluded that the classification of works into 3 or 4 categories according to material performance is in line with the actual level of engineering quality.


Introduction
In highway construction projects, the engineer-implementing party will do a detailed inspection and analysis of the materials used in the project under his responsibility.In previous theses, the research on highway construction projects mainly focuses on highway-related funds and resources, but relatively few classify them from the perspective of using materials.This thesis will start from the material testing data to classify the highway construction projects in order to improve the quality management of materials, in which the testing and testing data of the materials used are from the national highway and water transportation testing and testing big data platform.
Machine learning is widely used in road construction projects.Clustering algorithms have a wide range of applications in highway construction projects, the most outstanding are Kmeans algorithm and hierarchical clustering algorithm.For road construction projects, the Kmeans algorithm is used to cluster cash flow data to optimize the financial management of the project [1,2].Unlike the K-means algorithm, the hierarchical clustering algorithm does not need to pre-determine the number of clusters, which has stronger applicability and flexibility, and has been used more and more widely [5,8,9,10], and achieved good results.
Hierarchical clustering can be categorized into split type and cohesive type.The results of splittype hierarchical clustering tend to be unstable and the computational cost is relatively high.In contrast, cohesive hierarchical clustering is relatively stable and is a bottom-up approach, making it suitable for dealing with large datasets.The hierarchical structure produced by clustering facilitates visualization and interpretation of results [3,4] .We use clustering hierarchical clustering for engineering classification and introduce and use Silhouette Score, Davies-Bouldin Index and dendrogram to synthesize the clustering results to determine the final classification results.The final classification results will help engineering projects to target and manage their own material testing to meet the standards of the Institute of Highway Science of the Ministry of Transportation and Communications [6,7].

Raw Data
There are a total of 19 highway construction projects in the original data.Each project contains a number of projects, and each project contains testing data of materials with different sample types and specification models.Among them, the testing data of each material includes indicators such as sample type, specification and parameter name, as well as indicators such as the number of tests, maximum value, minimum value, average value, median and standard deviation.The data collection period will be from the start of construction through October 2023.
There were 28 sample types of materials, but not all 28 sample types were used on every project.To screen and identify representative sample types, we created a bar chart depicting the frequency of occurrence of these 28 sample types across the 19 projects.To protect data privacy and improve the readability of the chart, we will use the numbers 1 through 28 to denote the names of these specific sample types.See figure 1 below for more information on data integrity: As shown in figure 1 above, the five sample types with the highest data integrity rate are: "hot rolled ribbed rebar", with an integrity rate of 0.96; "ordinary silicate cement", with an integrity rate of 0.93; "coarse aggregate", with an integrity rate of 0.93; "fine aggregate", with an integrity rate of 0.93; and "hot rolled round rebar", with an integrity rate of 0.81.These five types of materials play important roles in the construction of roads and bridges, such as ordinary silicate cement to withstand the heavy pressure of vehicles and pedestrians, while hot rolled ribbed rebar has good mechanical properties and durability, and hot rolled ribbed rebar improves the adhesion of steel bars and concrete.These five materials play an important role in the construction of roads and bridges, such as ordinary silicate cement to withstand the heavy pressure of vehicles and pedestrians, hot-rolled ribbed steel bars have good mechanical properties and durability, and hot-rolled ribbed steel bars improve the adhesion of steel bars and concrete.Coarse aggregate and fine aggregate play the roles of skeleton and flesh and blood in concrete respectively.

Data Processing
Before constructing a clustering model, the raw material inspection data has certain problems of missing values, outliers, inconsistencies, and values containing special symbols, which are called "noise data".The presence of this "noise data" may have a negative impact on the subsequent modeling results.Therefore, before the actual modeling, the data should be pre-processed to unify the non-standard data into standardized data, remove or complete the missing values, and unify the dimensions.Here we remove the data with outliers and special symbol values, and the processing of missing values is placed in the next part of feature engineering.

Feature Engineering
For the same sample type, there are different specifications; for the same specification, there are different parameter index features.The combination of these three features can uniquely identify a piece of data, so I integrate them into a new feature, named "specific model".The number of occurrences of "specific model" in the detected projects is shown in table 1.Here we chose the test data for materials with a cumulative number of occurrences greater than or equal to 7, totaling 10.For each of these 10 "model-specific" materials, we constructed the corresponding mean, maximum, minimum, median, and variance.These statistical features represent the characteristics of the data well.Therefore, finally, we constructed 50 dimensional features for each project separately.
After the above processing, we obtained a dataset containing 19 highway construction projects and 10 different features.Each feature contains 5 statistical features.For the clustering problem, since different dimensions have a great influence on the clustering results, features with large values have a greater influence on the clustering results than those with small values, or even dominate the clustering results and make the final results inaccurate.... Therefore, we normalize the non-empty data in the selected 50-dimensional features to eliminate the influence of dimensionality without destroying the original data distribution.The normalization formula is shown in (1): But there are still some missing values in 50-dimensional data.Usually, outliers are handled by mean-filling and zero-filling.Mean padding is generally used to fill in the missing parts in a continuous series of data.In this case, some of the projects have not used certain materials, so it is more reasonable to use zero padding here.

Implement the Environment and Interface
The data analysis tool package uses python 3.7, numpy-1.19.2, matplotlib 3.3.2,and the data interface adopts JSON data format.

Introduction to the Hierarchical Clustering Algorithm
Hierarchical clustering is a type of clustering algorithm that starts by considering each object as an individual cluster and then progressively merges these atomic clusters into larger ones until all objects belong to a single cluster or a certain termination condition is met.
Common types of hierarchical clustering include single linkage, complete linkage, average linkage, and Ward linkage.Single linkage and complete linkage select the distances between the two closest and farthest data points, respectively, as the distance between two merged data points.However, they are susceptible to the influence of outliers.Average linkage, on the other hand, calculates the average distance among all pairs of data points in the two merging clusters, providing a more balanced approach but still affected by the "chaining effect."In contrast, Ward linkage differs from other methods by basing its decision on the increase in the sum of squared differences within the clusters when two clusters are merged.It performs particularly well when dealing with convex data sets.The classification rules for the Ward linkage method are as follows: 1. Initialization: Each point is treated as an independent category, and the Euclidean distance between each point is calculated, resulting in an n × n distance matrix, where n is the number of data points.Initially, each point is assigned to its own separate cluster.2. Merge the closest clusters: Find the minimum increase in the distance matrix, which represents merging two clusters into a new one.This step can be accomplished by calculating the increase in variance for each pair of clusters.The two clusters with the smallest increase in variance are merged.3. Update the distance matrix: After merging two clusters, update the distance matrix to reflect the distances between the new clusters.Typically, Ward's method is used to calculate the variance for the new cluster.4. Repeat steps 2. and 3. Continue merging the closest clusters and updating the distance matrix until only one cluster remains, meaning that all data points belong to the same cluster.

cohesive hierarchical clustering practical application and results
A cohesive hierarchical clustering model was constructed by taking as input the dataset that had been clearly and feature engineered.Based on the hierarchical clustering model, the 19 projects were progressively clustered until only one cluster remained.From the dendrogram in figure 2, we can intuitively see that the classification is best when the number of clusters is 3.In practical problems, the selection of the final number of clusters needs to consider the actual needs of the application.According to the tree diagram in figure 3, it can be seen that when the number of clusters is 4, it is also a reasonable choice.

Related test indicators of cohesive hierarchical clustering
Compared wirh supervised learning, Agglomerative hierarchical clustering doesn't have evaluation metrics such as accuracy and recall.In this context, we have chosen several metrics to assess the quality of clustering results.
The Silhouette Score is a metric used to measure the quality of clustering.It takes into account both the cohesion of data points within clusters and the separation between clusters.The formula is shown as (2): Where represents the average distance of sample i to other samples within the same cluster (intracluster cohesion), and represents the average distance of sample i to all samples in the nearest different cluster (inter-cluster separation).It measures both the cohesion of data points within clusters and the separation between data points in different clusters.The Silhouette Score ranges from -1 to 1, with values closer to 1 indicating better clustering performance.
The Davies-Bouldin Index is another metric used to assess clustering quality.A lower index value indicates better clustering performance.The formula is shown as (3): Where 'n' represents the number of clusters, represents the average distance between data points within the same cluster, and represents the distance between clusters.It can be seen from figure 4 that when the number of clusters is 2 or 3, the Silhouette Score is the highest and the clustering effect is the best.When the number of clusters increases to 4, the clustering effect drops sharply.Then the Silhouette Score slowly increased, but was always lower than when the number of clusters was 2 or 3. Therefore, according to Silhouette Score, 2 or 3 clusters are the best choice.
According to figure 5, we can see that compared with 2 and 4 clusters, the Davies-Bouldin Index of 3 clusters is lower, which means the clustering effect is better.When the number of clusters is greater than 4, as the number of clusters increases, the Davies-Bouldin Index decreases, and overfitting may occur.Therefore, according to the Davies-Bouldin Index, 3 clusters are the most reliable classification results.

Conclusion
This paper preprocesses the material testing data of large-scale highway construction projects and constructs the corresponding statistical features based on the data, so as to establish a clustering hierarchical clustering model to cluster the projects.Subsequently, based on the evaluation results of silhouette scores, Davis-Boldin index and tree diagrams, it is a relatively reasonable choice to classify highway construction projects into three clusters.This allowed us to maintain a high classification performance.When more detailed project categorization is needed, the data can also be considered to be divided into 4 classes.In conclusion, depending on different application requirements, we can flexibly choose 3 or 4 classes as the most appropriate number of clusters.The project implementers of each cluster can find the material problems they want to prioritize according to the material level of other highway construction projects in the same cluster.
Table 2 below shows part of the clustering process: