Machine learning based short fatigue crack growth rate prediction for aluminium alloys

As a result of different fatigue characteristics influenced by intricate microstructures, comparing with fatigue crack growth rate in long fatigue crack region, the growth in the short one is more complex to be fitted with fewer parameters. There have been more restrictions for traditional models in describing the nonlinearity between the fatigue crack growth rate and stress intensity factor range in short crack regime. Due to their outstanding ability in prediction with high accuracy and in description of nonlinearity with satisfactory flexibility, machine learning approaches have been payed more attention. The machine learning models have been the better choices to deal with the limitation in fatigue-related problems which traditional solutions cannot overcome. In this paper, two machine learning algorithms: k-nearest neighbour algorithm (KNN) and random forest (RF) are implemented to predict the short fatigue crack growth rate for 2024-T3 and LC9cs aluminium alloys. The testing outcomes of these applied machine learning algorithms are compared to evaluate their prediction abilities. The final results reveal that the values of Pearson correlation coefficient R2 of the KNN are generally higher than that of another method for each material. Each of them has an excellent performance in accuracy and effectiveness, and all of them have excellent extrapolation capabilities to predict the nonlinearity.


Introduction
Fatigue problems of alloy materials have accounted for a large proportion of the failure of engineering components and been the studying focus for researches. The initiation of the crack and its early propagation in short crack stage account for most of the fatigue life [1], so a huge number of researches focus on the behaviors in the short fatigue crack growth (FCG).
In order to make a law suitable for short crack propagation, Elber [2] defined the effective stress intensity factor range eff K  to describe the process of FCG, the equation shown as Eq.1, and the expression of the eff K  is shown as Eq. 2, in which op K is the measured crack opening stress intensity factor. However, the theoretical crack growth rate is larger than the experimental one [3].
(1) (2) Santus and Taylor [4]  To overcome this, the machine learning (ML) method is a better choice for its excellent accuracy in nonlinear regression and outstanding ability on adjusting variables. Han et al. [5] proposed an artificial neural network model including hybrid genetic algorithm (GA-ANN) to predict thermal fatigue life of microelectronic chips. Wang and Xie [6] developed a numerical calculation model to predict curved crack FCG failure by combining the artificial neural network (ANN) and FCG models. Zhang et al. [7] proposed a deep learning model for fatigue life prediction of components made in 316 austenitic stainless steel. Nonetheless, there is a high scatter of the prediction accuracy under some datasets.
In this paper, the powerful ML models, k-nearest neighbour algorithm (KNN) and random forest (RF) are employed to predict the short FCG behaviors. Every model is manipulated under the multi-variable inputs single output pattern. In order to utilize the method flexibly, the basic theories of these ML approaches under nonlinear conditions are introduced. After that, each model is trained and tested with the experimental data obtained. The corresponding prediction results are presented and compared with experimental datasets of 2024-T3 and LC9cs aluminum alloys. The Pearson correlation coefficient R 2 is applied to present the prediction abilities of the models, and the accuracy of predictions made by these models is analyzed the probability density function (PDF). At the end of this paper, the works related to ML in short FCG in the future are discussed.

K-Nearest Neighbour Algorithm
The K-Nearest neighbour (KNN) algorithm is a powerful tool that can be applied for both classification and regression problems [8]. This algorithm uses feature similarity to predict the numerical target. The new data point is evaluated to make a prediction on how closely it resembles the training data points. For this set of data, since it is a regression problem, the average value of the k nearest neighbours in the training data is taken as a prediction for the unseen data. In this study, the Euclidean distance 'D' within the data points is calculated, shown as Eq. 4: Where X i is the new point obtained and Y i is the existing point. The k value in this algorithm is the hyperparameter which needs to be tuned to attain a better outcome. In addition, it should be mentioned that the training dataset ought to be standardized before applying the algorithm.

Random Forest
The RF model is an assemble model that creates a number of decision trees [9,10], which can be presented as an upside-down tree. The top of the tree represents the root and the terminal nodes represents the leaves. Hence, a decision tree could be viewed as a binary tree and each intermediate node has two outputs. Moreover, a decision tree usually consists of a number of nodes, which are for evaluating the inputs by the test function. Then the values are transmitted to the branches in the light of the sample characteristics. Finally, the outputs of all trees are averaged to obtain the predicted result y p shown in Eq. 5: Where k n e is the training dataset, p i y represents the output of the i th tree. The topological structure diagram of the RF is as shown in Fig.1.

Fatigue Crack Growth Datasets
At present, the expression about FCG rate is usually as a function of da/dN and K  , and the corresponding data are utilized to present the prediction results by means of ML models. In this paper,  Fig.1 The structure diagram of RF with k decision trees FCG datasets of 2024-T3 [11] and LC9cs aluminium [12] alloys are divided into training and testing data. Among them, training data were firstly used to train ML models. As for testing data, which were then utilized to test the properties of these models.

Prediction results of machine learning models
In this section, the experimental datasets of 2024-T3 and LC9cs aluminium alloys are employed to characterize the FCG rate in short crack regime in terms of SIF ranges, maximum stress levels and stress ratios. The best prediction results of two ML models after a number of trial and error and corresponding experiment data points are presented in 3D figures.
The differences in detailed performance of ML models are compared to obtain the unique characteristics of each ML model. The Pearson correlation coefficient R 2 is applied to represent the prediction accuracy of ML models. Generally, the prediction accuracy of the model will get higher when R 2 has increased consistently, and its maximum value is 1. In this work, the coefficients are presented in Tab.1 to demonstrate the detailed differences in these two models. Fig.3a) and b) present the comparisons between the experimental data and the predictions made by KNN and RF for the Al2024-T3, c) and d) for the LC9cs aluminium alloy. It can be clearly seen from graphs that all these ML models accurately fit the nonlinearity of the short FCG behaviours. Furthermore, the predicted surface along a constant stress ratio R is nonlinear, which means that the results of ML methods are more consistent with the experimental data than that of theoretical models. As is shown in Fig.3a) and b), the KNN fits the nonlinearity very well with better predicted curved surface, which is As viewed in Fig.3c) and d), there are some differences of surfaces on the tendencies in the region marked by the red arrows: the tendency of the KNN is nearly flat, while that of the RF is downward. Moreover, the RF has a better extrapolating ability than that of the KNN. This leads to the results shown in Tab.1, which is influenced by the scatter distribution of data points. Therefore, it can be found from all outcomes that the distribution and total sum of data points play a dominating role in predictive effects. Fig.3 The experimental data and corresponding results predicted by KNN and RF. Note that: (a) and (c) predicted by KNN; (b) and (d) predicted by RF.

Evaluation of prediction accuracy of machine learning models
In order to observe the prediction accuracy of all three models intuitively, probability density function (PDF) is utilized to reflect the results of predictions of short crack behaviours for Al2024-T3 at R = 0 shown in Fig.4a) and b), for LC9cs at R = -1 shown in Fig.4c) and d), respectively. The PDF can reveal the outcomes by calculating the mean and standard deviation of prediction error which equals the It can be seen from all figures that the prediction accuracy of all models is acceptable with relatively narrow range of PDFs for prediction errors. In addition, the distributions of PDFs follow the extent of dispersion in experimental data. As shown in Fig.4a) and b) that at R = 0 with the maximum stress levels 120 MPa and 145 MPa, the scatter in data at 120 MPa is larger than that at 145 MPa. Hence, the range of the PDF on the horizontal coordinate at 120 MPa is relatively wider. Conversely, as shown in Fig.4c) and d) for LC9cs, the location and range of PDFs on the horizontal coordinate for 70 MPa is wider, which can be attribute to its larger range of distribution shown in Fig.3c) and d). In addition, as shown in Tab.1, since the KNN has a larger R 2 at every stress level, the result shown in Fig.4c) is better. Fig.4 Probability density function of prediction errors. Noted that: (a) and (c) predicted by KNN; (b) and (d) predicted by RF.

Conclusion
The above study demonstrates that the ML based models have many advantages in dealing with the nonlinearities of fatigue crack growth in short crack region. Nonetheless, the performance of ML models significantly relies on the extent of dispersion of data points and the abundance of the experimental datasets. In addition, the results would be better with data including all the nonlinear information within all the possible ranges of K  , R and S max . Hence, the establishment of an appropriate set of FCG data is necessary and the further optimized scheme should refer to following factors: (1) The sufficient number of experimental data points.
(2) The uniformity of distribution range of data points over the entire predicted area.
(3) The extent on dispersion of datasets. The efficiency and powerfulness of these ML models will get improved by optimizing these factors.