Research on Classification Method of Highway Concrete Water Reducing Agent Manufacturers Based on K-means++ Clustering

Concrete water reducing agent is an important admixture in the preparation of concrete. Machine learning has a wide range of applications in the field of material science. In this paper, machine learning is creatively applied to the classification of water reducing agent manufacturers. This is because the act of categorizing water reducing agent supply manufacturers in practical engineering relies heavily on the experience of experts, which makes it more difficult to find similar alternative manufacturers when the water reducing agent is out of stock. In order to solve this kind of problem, this paper first performs data cleaning on the original data, a total of 72-dimensional eigenvalues are selected, and missing value processing and standardization are performed to normalize the dataset. Afterwards, the K-means++ algorithm is used to analyze the clustering of manufacturers, and the optimal K value is selected by introducing three evaluation indexes, such as Silhouette Coefficient, and the optimal clustering effect is obtained when K is 3. At this time, when the manufacturers of a certain class are out of stock, similar manufacturers can be found for the replacement of the goods.


Introduction
Concrete water reducing agent is a chemical used in concrete and is a type of concrete admixture.It is a commonly used agent in the preparation of concrete and has the ability to reduce the amount of water needed in the concrete while maintaining the fluidity of the concrete [1].At the same time, it reduces the porosity and moisture in the concrete, thus increasing the strength of the concrete,making the concrete comply with the engineering construction standards [2,3].
However, the testing indicators of admixtures have more than ten indicators such as pH value, water reduction rate, urination rate ratio, etc.The basis for determining the indicators is complex, and it is difficult to determine the quality of the plant from a single indicator so as to categorize the admixture manufacturers.
At the same time, when the project needs concrete water reducing agent, there will often be a manufacturer in short supply.When a manufacturer is out of stock, it is necessary to purchase from manufacturers of similar quality.In the past, this part of the work could only be carried out by highway engineering experts, relying heavily on their experience.However, when we use the clustering algorithm to cluster the manufacturers, the factory quality relationship between manufacturers is quantified to some extent.When a manufacturer is out of stock, it can be found from 2 the clustering results to be classified into the same category of manufacturers at the purchase of goods, with the practical significance of manufacturer substitution.There are also some case studies available in engineering, such as Ying Zhou [4]used clustering algorithms to implement an early warning function, which serves as a precautionary measure for fresh food safety issues,Zili Chen [5]used clustering to improve routing protocols to extend the network's lifespan while balancing the network's energy consumption.
The common clustering methods are K-means clustering, K-means++ clustering, hierarchical clustering method and so on.Among them K-means algorithm was proposed by MacQueue [6] in 1967.Based on this algorithm, many people have proposed improved algorithms such as K-means++ algorithm and Mini-batch K-means algorithm, which also have many applications in engineering [7,8,9].Among them, the K-means++ algorithm optimizes the selection of initial cluster centroids, reduces the possibility of the algorithm falling into local minima [10], and increases the stability and efficiency of the K-means algorithm.
In this paper, material data cleaning and feature engineering first carried , after which, in order to solve the problem of classification difficulties encountered in engineering, K-means++ algorithm was used to cluster the concrete water reducing agent manufacturers.Meanwhile, in order to determine the optimal K-value, three evaluation indexes, Silhouette Coefficient,Davies-Bouldin Index and Calinski-Harabasz Index, were introduced, and the optimal K-value for clustering effect was obtained by comprehensive analysis.Finally, the full work is summarized.

Raw Data
The data in this paper comes from the Material Testing and Inspection Data Platform, which is the data measured in the highway construction projects.The original data contains 12 highway cement water reducing agent supplier manufacturers between 2022.9.30 and 2023.9.30.Each manufacturer provides two specifications of concrete water reducing agent respectively, which are high performance-standard water reducing agent and high performance-retarder water reducing agent.For each type of water reducing agent, there are various testing indicators, such as pH value, water reduction rate and so on.Each indicator contains statistical values such as maximum value, minimum value and so on.

Feature Engineering and Missing Data Handling
In order to select the eigenvalues for the clustering of manufacturers, we first need to select appropriate detection indicators, and then later select reasonable statistical values for all the values of these detection indicators within one year.In the following, we first analyze the number of detections of each detection indicator in the recent year as shown in Table 1 The analysis found that the number of tests for the indicators decreased in a cliff-like manner starting from the total alkali content.Since considering that the number of detections will affect the stability of the results, and at the same time, the more detections can ensure the accuracy of the results, nine indicators, namely, water reduction rate, air content, compressive strength ratio (7 days), solid content, water secretion ratio, compressive strength ratio (28 days), average slump 1h change over time, solid content, and pH, were selected as the detection indexes.
After that, the average value, maximum value, minimum value, and coefficient of variation were selected as statistical indexes for these nine testing indexes.Among them, the coefficient of variation is an index that describes the degree of data dispersion.Since there are two types of concrete water reducing agents, these nine testing indicators were selected for both types of water reducing agents.Therefore, a total of 2*9*4=72 dimensional eigenvalues were finally selected, and these eigenvalues were applied to 12 cement manufacturers for cluster analysis.
For some of the indicators, some manufacturers have a high number of tests and some have a low number of tests.It may even happen that some manufacturers have no data, so it is necessary to deal with missing values for this part of the data.When there is a missing value, the average value under the indicator of the remaining manufacturers is selected as a substitute to reduce the impact of missing values on the experimental results.
The processed part of the data is shown in Table 2, where I represents high-performance-standard water reducing agent and II represents high-performance-retarded water reducing agent.Excluding the first row of descriptive information, there are 12 rows and 72 columns in total.

Standardization of data
Taking type II concrete water reducing agent as an example, a box plot of 36 eigenvalues mentioned in the previous section is made as follows.Analyzing the above figure, it can be found that some of the eigenvalues such as the 25th, 26th, 27th, 29th, 30th, and 31st eigenvalues have a much higher range of values than the other eigenvalues.In order to highlight the impact on the results when the eigenvalues change, therefore, all 36 eigenvalues are considered to be standardized as shown in equation ( 1) to reduce the error of subsequent clustering.
Where f ij denotes the value of the j th eigenvalue of the i th manufacturer.f jmin denotes the smallest value among the values of the j th eigenvalue, and f jmax denotes the largest value among the values of the j th eigenvalue.After the normalization process then make the box shape shown in figure 2. For type I concrete water reducing agent, the same normalization process was taken.At this point, the data cleaning part of the work is completed, a total of 72-dimensional eigenvalues were selected, and the missing value processing and standardization were carried out for the subsequent application of the algorithm.

K-means++ Algorithm
The K-means++ algorithm is an improved version of the K-means algorithm, which belongs to the unsupervised learning algorithm and is a more widely used clustering algorithm.Compared to the latter, the former has a faster convergence rate and higher clustering quality, while reducing the possibility of the algorithm falling into local minima.It has In order to better describe the difference between the K-means++ algorithm and the K-means algorithm, the algorithmic flow of the K-means algorithm is first described below.
1. First, the k-value is determined, which determines how many clusters the final data is divided into.In terms of the size of the manufacturers, the k-value is set in the range of [2,x-1], where x represents the total number of manufacturers, corresponding to the 12 manufacturers in the dataset.
2. After that, K data points are randomly selected as the initial cluster centroids.The center point is the set of additive eigenvalues of the manufacturer.
3. For each data point, its distance from the center point of all the clusters, represented by the quality properties of the manufacturer's material, is calculated based on the 72-dimensional eigenvalues mentioned previously.Afterwards it is assigned to the cluster to which the nearest cluster centroid belongs.
4. Calculate the average of all data points within each cluster and use that value as the centroid of the new cluster.
5. Repeat steps 3 to 4 until the cluster centroid no longer changes or the algorithm reaches a set number of iterations.
6.The algorithm ends with the output of the data points contained in each of the K clusters.
Through the analysis, it can be known that the selection of the initial center data point of the Kmean algorithm is random, which is easy to make the final result obtain the local minimum instead of the global minimum.Therefore, K-means++ algorithm in order to overcome this shortcoming, improve the initial cluster center data point selection method, so that the stability of the algorithm has been improved.Specifically, the above 2. is changed to: first, the first cluster center point is randomly selected, and after that, a weighted random point selection is used so that the point further away from the selected center point has a higher probability of becoming the next cluster center point.This change makes the distribution of cluster center selection more balanced and improves the overall stability and efficiency of the algorithm.
The k-means++ algorithm is used to set the k value from 2 to 11, and the data set processed in the previous section is input into the model to classify the twelve water reducing agent manufacturers from S01-S12.The classification results are shown in Table 3 3, we can get the specific number of each class of manufacturers when specifying the k value.The water reducer manufacturers that are clustered into the same class have similar results in terms of feature values among them.Therefore, in practical engineering, when a certain manufacturer is out of stock, other manufacturers can be selected from the cluster of manufacturers in the same class as a substitute.

Related test indicators of K-means++
In order to quantify the effectiveness of the clustering model, this paper introduces three evaluation metrics commonly used in clustering algorithms.By analyzing these metrics, the results of the model can be evaluated more objectively.
Silhouette Coefficient (SC) is an indicator that can evaluate the good or bad clustering effect, and its value range is between [-1,1].He can show the degree of cohesion of the data points in the clusters to which they belong and the degree of separation between them and other clusters, which is calculated as shown in equation (2).

𝑆𝐶 =
() − () max{(), ()} Where a(i) denotes the average distance between data point i and all other points in its cluster.b(i) denotes the minimum of the average of the distances between data point i and all data points in different clusters.For each data point, the closer the SC value is to 1 means the better the clustering effect.After that, the SC values of most all points are averaged as the overall SC value, and the closer the SC value is to about 1 the better.
Davies-Bouldin Index (DBI) is also an indicator that can assess whether the cluster division is reasonable or not, the smaller the value of DBI indicates that the clustering is more effective, and it is calculated as shown in equation (3).
S i denotes the average distance from the data points in the cluster of class i to its center, and the Euclidean distance is chosen as the calculation method of distance in this paper.||w i − w j || 2 denotes the distance from the centroid of the cluster of class i to the centroid of the cluster of class j.
Calinski-Harabasz Index (CHI) is a metric that evaluates the degree of separation between clusters and the degree of aggregation within clusters, with larger values indicating better clustering.Its calculation is shown in equation (4).Where N denotes the total number of data points, K denotes the number of clusters.BCSS denotes the total variance between clusters and is calculated as shown in equation (5).WCSS denotes the total variance within clusters and is calculated as shown in equation (6).Analyzing Fig. 3, it can be seen that the SC decreases gradually as the value of k increases.And the rate of decline of the curve is maximum when the value of k changes from 3 to 4. When k=2 or 3, SC stays at a relatively high level.Analyzing Fig. 4 shows that when the value of k increases, DBI first decreases and falls to a lower point at k=3, then rises and turns again at k=6, and DBI gradually decreases at k>6.Analyzing figure 5, it can be seen that with the increase of k value, CHI first decreases and then turns at k=5, after which CHI gradually increases.
Comprehensive analysis, if we consider SC as the main evaluation index and the remaining two as secondary indicators, the clustering effect is best when k=3, because at this time SC is kept at a higher level, and DBI is kept at a lower level, and at the same time CHI is at the highest level in the several k values after that, so this time the clustering effect is best.

Conclusion
In this paper, 12 highway cement water reducing agent supplier manufacturers were clustered.Firstly, 72-dimensional feature data were selected by analyzing the detection indicators and the statistical values of these indicators.After that, the data were data cleaned and mean values were filled for missing data.After that, the k-means++ algorithm was used to cluster these 12 manufacturers, and the three indicators of SC, DBI, and CHI were used as evaluation indicators.It was found that the clustering effect was best when the k value was set to 3. If you want more number of clusters, you can also choose k as 7 as the sub-optimal k-value, when the evaluation index scores are better.At the same time, the k-means++ algorithm used in this paper has a certain reference significance for the same type of problems, for similar problems can try to use k-means++ for clustering analysis.

Figure 2 .
Figure 2. Box plots of eigenvalues after normalization.

Table 1 .
. NUMBERS OF TESTS FOR DETECTIONS.

Table 2 .
processed part of the data.