Clear Water Scour Depth Prediction using Gradient Boosting Machine and Deep Learning

The scouring process in adjacent to spur dikes has the potential for compromising the stability of riverbanks. Hence, it is necessary for river engineering to conduct precise measurement of maximum scour depth in the vicinity of spur dikes. Nevertheless, the determination of the maximum scour depth has proven to be a challenging task, primarily due to the complex nature of the scour phenomena associated with these structures. In this study, two data-driven models, namely the Gradient Boost Machine (GBM) and Deep Learning (DL), were developed to predict the clear water scour depth near to a spur dike. A total of 154 distinct observations have been collected from previous literatures. A total of 103 observations were utilized for training the model, while 53 observation were allocated for validation purposes. Several performance assessment measures were employed to evaluate the performance of the models, including the correlation coefficient (CC), root-coefficient of determination (R2), scattered plot, variation plot, and box plot. GBM outperformed the DL on the basis of above-mentioned assessment measures. Sensitivity analysis suggests that l/d50 is the most influences input parameter. Thus, the conclusion suggested that both the data-driven model can be used in the prediction of the clear water scour depth around spur dikes but GBM have highest accuracy.


Introduction
The main purpose of spur dikes, which are often artificial hydraulic constructions, is to protect the riverbanks.These are typically employed as river training works (Kothyari et al., 2007;Ettema & Muste, 2004).The direction of the entering flow is perpendicular to (or at an angle with) a single spur dike or a group of spur dikes (Pandey et al., 2019).Spur dikes efficiently protect riverbanks by reducing the stream's flow velocity (Pourshahbaz et al., 2020).Regarding river training structures, spur dikes are recognized as efficient and inexpensive methods of protecting river banks from scouring (Kuhnle et al., 2002;Kothyari & RangaRaju, 2001).Additionally, spur dikes play a significant function in waterways by enhancing stream stability and navigational conditions.
In fluvial hydraulics, local scour is defined as the loss of sediment particles from alluvial streams.It may result in the collapse of hydraulic structures such as spur dikes, bridge piers, and abutments (Oliveto & Hagen, 2002).Localized scouring phenomena in streams are usually triggered either by human activity or by natural events (Raudkivi & Ettema, 1983).According to research conducted by the United States Federal Highway Administration, local scour around 1327 (2024) 012030 IOP Publishing doi:10.1088/1755-1315/1327/1/012030 2 piers and abutments was responsible for the collapse of more than 350 bridges in 1973 (Richardson & Davis, 2001).The collapses of these hydraulic structures demonstrate how crucial it is to predict the scour depths near such structures.Using experimental data sets, Pandey et al. (2016) studied spur dike scour, evaluated the precision of scour depth equations, and then formulated an empirical equation to determine the maximum scour depth.In their study, Kothyari et al. (2007) conducted an analysis on both single and multiple spur dikes with the objective of quantifying the temporal variations in scour depth.Subsequently, the researchers formulated a mathematical equation to forecast the temporal variations in scour depth for piers, abutments, and spur dikes.These investigations have highlighted the importance of possessing a comprehensive comprehension of scour phenomena during the process of establishing the foundations for hydraulic structures In recent years, there has been a significant utilisation of data-driven models to address various challenges in the fields of hydraulics, environmental science, agriculture, and civil engineering (Singh et  successfully established a strong correlation between their findings and empirical evidence through the utilisation of multiple soft computing methodologies for the purpose of estimating the depths of scouring in the vicinity of river pipes.In their study, Najafzadeh et al. (2015) employed a soft computing approach to estimate the extent of downstream scour occurring at various hydraulic structures, including sluice gates, ski-jump bucket spillways, and bridge piers.Keep the eye on the vast applications of the soft computing techniques in the field of the civil and hydraulics engineering, this study aims to assess the effectiveness of two data-driven models Gradient Boost Machine (GBM) and Deep Learning (DL) in predicting the clear water scour depth around spur dikes.

2.1.Data Collection
The dataset was obtained from various literature (Ezzeldin et al., 2007;Pandey et al., 2016;Zaghloul, 1983;Nasrollahi et al., 2008).The dataset comprises several variables, including the length of spur dikes (L), the approaching velocity flow (U), the threshold velocity of bed particles (UC), the approaching flow depth (y), the medium diameter of bed particles (d50), the densimetric Froude number (Fd50), the flume width (B), the Reynolds number (Re), and the clear water scour depth (ds).The range of values for the variable L varies from 0.045 to 0.75, while the variable v exhibits a range from 0.195 to 0.43 meters.Additionally, the variable vs varies from 0.182 to 0.705 meters, and the variable y ranges from 0.005 to 0.53.The variable d50 displays a range from 0.6 to 4.49 millimeters, while Fd50 varies from 1.45 to 4.1.Furthermore, the variable B ranges from 0.4 to 2 meters, Re varies from 12700 to 13500, and ds varies from 4.7 to 74.7 centimeters.

Dimensionless Analysis and Data Processing
By referring to previous research (Pandey et al., 2021), the dimensionless equation is: The gathered raw data were processed as the independent dimensionless variables (v/vs, y/l, l/d50, and Fd50) as the input and the maximum clear water scour depth (ds/l) as the response variable.The introduction of boosting algorithms can be attributed to the machine learning community, as initially stated by Freund (1995).The gradient boosting machine learning algorithm employs a unique approach to combine weak learners in order to construct a robust learner.With the addition of each weak learner, a new model is trained to improve the accuracy of the estimated response variable.The newly introduced weak learners exhibit a high degree of correlation with the negative gradient of the loss function, which is specifically linked to the entire ensemble.The primary objective of the GBM is to enhance the predictive accuracy of a model by combining an ensemble of comparatively weaker prediction models.
This combination is determined by the input values of the neurons, denoted as xi and yi.
The utilization of a rectifier-type activation function has been observed to exhibit superior performance in terms of accuracy at higher levels.Different distribution arguments, such as Bernoulli, Poisson, Gaussian, Huber, etc., are employed in conjunction with a designated loss function.An illustrative instance involves the association of Bernoulli with cross-entropy, a metric commonly employed in the context of classification tasks.Gaussian, on the other hand, is linked to mean squared error, a prevalent measure in various statistical analyses.Lastly, Huber is associated with the Huber loss function, a robust estimator utilized in scenarios where outliers may significantly impact the overall performance.The utilization of cross-entropy was employed

Data-driven models: 3.1.GBM:
in the present investigation, with its computation being derived from the work of Martinez and Stiefelhagen (2019).
where t j and Oj are the final predicted and the actual output, z is the output unit.

Performance Evaluation Criteria:
The evaluation of the precision of the predictions values by both models was accomplished through the utilization of two performance evaluation criteria, namely the correlation coefficient (R), and coefficient of determination (R 2 ).The values of R and R 2 varies from -1 to 1.The formulas for determining the correlation coefficient and coefficient of determination are as follows: where ‫ݎݑܿܵ‬ ௧௨, = actual values of clear water scour depths, ‫ݎݑܿܵ‬ ௗ௧ௗ = predicted values of clear water scour depths by data-driven models, ‫ݎݑܿܵ‬ തതതതതതതത ௧௨ mean of the actual values of clear water scour depths, ‫ݎݑܿܵ‬ തതതതതതതത ௗ௧ௗ = mean of the predicted values of clear water scour depths and t is no. of observations.

Result and Discussions:
To assess the usefulness of gradient boosting machine and deep learning in predicting clear water scour depth around spur dikes, experimental data is obtained from Ezzeldin et al., (2007); Pandey et al., (2016) ;Zaghloul, (1983); and Nasrollahi et al., (2008).Correlation coefficient, and coefficient of determination (R 2 ) values are used to compare the performance of GBM and DL.The performance of both of the models are depend upon the user defined parameters.The details of these parameters are summarized in Table 2.These parameters are obtained by trial and errors process (Kumar and Singh, 2021).Table 3 provides the result obtained in terms of R and R 2 values with training as well as testing datasets in prediction of clear water scour depth around spur dikes.A comparison of these values suggests that both of the models are capable to predict the clear water scour depth but the performance of the GBM (R = 0.9986 & 0.9970; and R 2 = 0.9972 & 0.9941 for training and testing dataset respectively) is slightly better than the DL (R = 0.9931 & 0.9867; and R 2 = 0.9862 & 0.9737 for training and testing dataset respectively).So, the results interpret slightly improved performance of GBM as compare to DL.  values rather than DL.So, GBM data-driven model are highly capable to predict the clear water scour depth around spur dikes.Although the performance of the DL is not bed but its performance is slightly bad than GBM.Thus, on the basis of Table 2, Figure 1, Figure 2 and Figure 3, it is concluded that the both of the modelling techniques give good result in the prediction of clear water scour depth around spur dikes but GBM have edge on DL as its performance is highly precise than DL.

Sensitivity Analysis:
The sensitivity analysis is the process in which the importance of the input parameters is calculated.It is performed by both of the data-driven models and depicted in Figure 4 a & b.In Figure 4 a, sensitivity analysis is performed using GBM and in figure 4 b, it is performed by DL.
On the basis of both figures (Figure 4 a & b) it is concluded that l/d50 is the most important parameters which can affect the clear water scour depths most while y/l is second most important parameters in case of GBM, but v/vs in case of DL.The Fd50 is the third most important parameters for both of the models.The v/vs have got the last position for GBM and y/l for DL which have least impact on the output (clear water scour depth around spur dikes).Thus, in this way, l/d50 is the most influenced parameters on which the clear water scour depth of dependent.After observing the values of various relative importance of dimensionless variables as given by the both of the models, the variable 'l/d50' was given most significance in predicting the dependent variable.However, the parameter 'v/vs' for GBM and 'y/l' for DL have been considered as least significant parameter in prediction. References:

Figure 1 : 7 Figure 2 :Figure 3 :
Figure 1: Actual and predicted values of ds/l for training and testing datasets

Figure 4 :
Figure 4: Sensitivity analysis using a.GBM and b.DL6.Conclusion:This study is carried out to model the clear water scour depth around spur dikes using GBM and DL.Results suggests that the promising performance by both of the models.The predicted values of both of the models are lies on the agreement line.Comparison on the basis of R and R 2 suggests that GBM (R = 0.9986 & 0.9970; and R 2 = 0.9972 & 0.9941 for training and testing dataset respectively) perform better than DL (R = 0.9931 & 0.9867; and R 2 = 0.9862 & 0.9737 for training and testing dataset respectively).After observing the values of various relative importance of dimensionless variables as given by the both of the models, the variable 'l/d50' was given most significance in predicting the dependent variable.However, the parameter 'v/vs' for GBM and 'y/l' for DL have been considered as least significant parameter in prediction.

Table 1 : Descriptive analysis of the variables
Table 1 presents the descriptive analysis of the input and output variables.The entire dataset, consisting of 154 observations, was subjected to statistical analysis and subsequently divided into two distinct groups.The larger dataset called training dataset consisted of 103 observations, which were utilised for model creation, whereas the secondary called testing dataset comprised 51 observations, which were employed for model validation.
(Bengio, 2009 et al., 2019;Xiao et al., 2018)surge in popularity over the past few years owing to their notable attributes such as training stability, strong generalizability, capacity to handle large datasets, hierarchical feature selection, and proficiency in tackling challenging issues(Ismail Fawaz et al., 2019;Xiao et al., 2018).The present study employed a multilayer, feedforward H2O deep learning architecture, with a comprehensive elucidation of the underlying algorithm provided within this particular sub-section.The present model employs a method wherein a weighted amalgamation of signals is employed, which are subsequently aggregated and interconnected via neurons to facilitate the transmission of the output signal.The proposed model comprises three distinct layers, namely an initial input layer, an intermediate layer characterized by multilayer nonlinearity, and a final classification output layer.The learning algorithm can be denoted as(Bengio, 2009): LeCun et al. (2002)2013)ight matrix and assembly of {ܺ } ଵ:ைିଵ where i connects layers within a network comprising O layers.K denotes the training sample, while C signifies the column vector of biases for layer i + 1.A variety of nonlinear activation functions, such as tanh, maxout, or rectified linear, are employed in the computational framework.The estimation of rectified linear can be derived from the works ofGoodfellow et al. (2013)andLeCun et al. (2002).