Predicting maximum scour depth at sluice outlet: a comparative study of machine learning models and empirical equations

Estimating the maximum scour depth of sluice outlets is pivotal in hydrological engineering, directly influencing the safety and efficiency of water infrastructure. This research compared traditional empirical formulas with advanced machine learning (ML) algorithms, including RID, SVM, CAT, and XGB, utilizing experimental datasets from prior studies. Performance statistics highlighted the efficacy of the ML algorithms over empirical formulas, with CAT and XGB leading the way. Specifically, XGB demonstrated superiority with a correlation coefficient (CORR) of 0.944 and a root mean square error (RMSE) of 0.439. Following closely, the CAT model achieved a CORR of 0.940, and SVM achieved 0.898. For empirical formulas, although CORR values up to 0.816 and RMSE values of 0.799 can be obtained, these numbers are still lower than most ML algorithms. Furthermore, a sensitivity analysis underscored the densimetric Froude number (Fd) as the most crucial factor in ML models, with influences ranging from 0.839 in RID to 0.627 in SVM. Uncertainty in ML model estimates was further quantified using the Monte Carlo technique with 1,000 simulations on testing datasets. CAT and XGB have shown more stability than the other models in providing estimates with mean CORRs of 0.937 and 0.946, respectively. Their 95% confidence intervals (CIs) are [0.929–0.944] for CAT and [0.933–0.954] for XGB. These results demonstrated the potential of ML algorithms, particularly CAT and XGB, in predicting the maximum scour depth. Although these models offer high accuracy and higher 95% CI than others, the empirical formulas retain their relevance due to their simplicity and quick computation, which may still make them favored in certain scenarios.


Introduction
Scour is a fundamental geomorphological process that removes and erodes sedimentary particles from riverbeds or coasts due to fluid forces (Amini and Mohammad 2017).The severity of this erosion, particularly around hydraulic structures such as sluices and culverts, poses substantial challenges to civil engineering and hydraulics communities.One of the primary parameters underpinning this process is the scour depth, which refers to the vertical distance from the original bed level to the deepest point formed after erosion (Dargahi 1990, Ahmed et al 2021).The accurate forecast of maximum scour depth at sluice outlets is crucial for ensuring the safety and longevity of hydraulic structures, as its misestimation can potentially compromise the stability of these structures (Lim andYu 2002, Karami et al 2011).
Culverts and sluices are important hydraulic structures that regulate discharge or upstream water levels.They are pivotal in flood control, irrigation, and urban drainage systems.Scholars have become interested in scouring around such buildings throughout the years because of their vital role in guaranteeing the structural safety of these constructs (Hosseini et al 2016).Prolonged scouring, especially when underestimated or unchecked, can expose the foundation of these hydraulic bodies, jeopardizing their stability (Mutlu Sumer 2007, Pizarro et al 2020).The necessity of predicting anticipated local scour geometry is underscored by its significance in the optimal design of sluice outlet foundations (Abt et al 1985, Galán andGonzález 2019).The attempt to provide accurate predictions for local scour downstream of hydraulic constructions has resulted in extensive research to discover effective protective solutions.
For the outlets of hydraulic structures, estimating the maximum scour depth remains challenging due to bed materials, flow conditions, structural dimensions, and additional auxiliary works (Jahangirzadeh et al 2014).Historically, researchers have employed various methodologies, from physical models to mathematical representations and, more recently, machine learning (ML) approaches, to grapple with these challenges.Physical models have long been the foundation of traditional methods to comprehend scour hole geometry and estimate maximum scour depth (Emami andSchleiss 2012, Link et al 2019).These models have served as instrumental platforms for deriving empirical equations that guide scour depth estimation and understanding hole morphology.Notable studies in this domain include those by Olsen and Kjellesvig (1998) and Taha et al (2020), which underscore the potential and complexity of these physical models.
The investigations for determining the scour depth at outlets have been conducted in numerous laboratories under various conditions (Aderibigbe andRajaratnam 1998, Lu et al 2022).A large number of experiments have been conducted to assess the physical importance of the different dominant conditions for the local scour depth (Abida and Townsend 1991, Sarkar and Dey 2005, Dey and Sarkar 2006).Despite these significant contributions, physical models' practical utility faces critical challenges.Their construction can be labor-intensive and costly, making alterations concerning initial conditions or dimensions difficult (Aamir and Ahmad 2016).Additionally, their applicability to real-world scenarios is frequently contested due to the limited physical conditions under which they are constructed (Le et al 2022).For instance, many of these models have been exclusively tailored to non-cohesive soil types.The occasional omission of auxiliary structures, such as headwalls and aprons, raises concerns regarding their holistic relevance.The often-restricted volume of datasets used in various investigations also poses potential threats to the accuracy and generalizability of the ensuing empirical equations (Aamir and Ahmad 2019a).
In recent years, the development of hydraulic engineering with computational advancements has brought to the fore the potential of soft-computing and ML techniques in predicting scour depth (Akib et al 2014, Mostaani andAzimi 2022).The traditional reliance on empirical formulae and physical models has encountered challenges due to their inherent constraints.ML has emerged as a promising solution (Bashiri et al 2018).ML algorithms are designed to identify patterns and relationships from large datasets, making them suitable candidates for tasks that require prediction based on complex interdependent variables.Current literature witnesses the application of ML methods such as group method of data handling-GMDH (Najafzadeh 2015), genetic programming-GP (Guven and Gunal 2008), Gene Expression Programming-GEP (Moussa 2013), artificial neural network-ANN (Eghbalzadeh et al 2018), and adaptive neuro-fuzzy inference system-ANFIS (Bateni et al 2007, Sharafati et al 2020) in predicting local scour depth.ML has emerged as a promising solution.(Najafzadeh et al 2017) examined several techniques, like model tree (MT), evolutionary polynomial regression, and GEP, against each other for scour depth prediction post sluice gate deployment.Their findings indicated that the MT technique is superior in prediction accuracy to traditional empirical equations.In another study conducted by Abd El-Hady Rady (2020) to predict pier scour depth, the authors concluded that the GP algorithm outperformed the ANFIS model and the empirical equations.A separate study on scour depth estimation by Qaderi et al (2021) reported that the ANFIS algorithm gave better results than some commonly used models such as ANN, GMDH, GEP, and support vector machine (SVM).A conclusion was drawn by Parsaie et al (2019) when they showed that the SVM algorithm had a minor advantage over ANN and ANFIS in the prediction of scour depth.Although the methods discussed above are capable of providing commendable accuracy, their performance is still tied to the datasets used.In other words, these models are often constrained by the quality and quantity of data on which they are trained (Aamir and Ahmad 2016).
As the landscape of ML continues to evolve, newer models emerge, each addressing specific limitations of prior methodologies and offering enhanced predictive capabilities.Among the recent advancements in this domain are algorithms such as XGBoost (XGB), CatBoost (CAT), SVM, and Ridge Regression (RID).XGB, a gradient-boosting framework, combines the advantages of tree-based learning algorithms with the power of boosting to offer high-performing, scalable solutions.CAT, another gradient-boosting algorithm, is renowned for handling categorical data seamlessly.SVM stands out for its utilization of the kernel trick, enabling the incorporation of expert domain knowledge into the problem-solving framework.Meanwhile, RID offers a resolution to the challenges posed by multicollinearity in datasets, creating nuanced and stable models.
With the primary objective of enhancing accuracy and reliability in predicting the maximum scour depth at the sluice gates, this study focuses on the comparative analysis of the ML performance mentioned above algorithms: XGB, CAT, SVM, and RID.Furthermore, the effectiveness of these algorithms was also examined in comparison to the empirical equations, which served as the benchmark in this study.

Understanding the local scour problem at sluice outlets
The scouring phenomenon downstream of sluices is intrinsically complex, primarily driven by various hydraulic, geometric, and sedimentary factors (Farooq and Ghumman 2019).Figure 1 presents a visual schematic representation of sediment geometry post-sluice, further elucidating the complex issues in understanding the scour phenomenon.This complexity lies in the equilibrium or maximum scour depth (d s ), an important parameter describing the scour shape.The dynamics of d s can be attributed to several contributing aspects.
As reported in several literatures (Aamir and Ahmad 2016), the maximum equilibrium scour depth is a function of the initial conditions (namely the input discharge (Q), the upstream (d u ), and downstream (d t ) water depths) and parameters (such as the apron length (L), the open height of the sluice gate (a), and the roughness of apron).In addition, the presence of auxiliary structures like wing walls and blockages, as well as properties related to the bed material-such as soil density (ρ s ), mean grain size (D 50 ), its standard deviation (σ), and the soil type (cohesive or non-cohesive)-further nuances the scour depth determination.The density of water (ρ) and gravitational acceleration (g) also factor into the dynamics.To simplify the relationships more succinctly, researchers often use dimensionless parameters.In which the densimetric Froude number (F d ) and the Froude number of the water jet behind the sluice gate (F) are the most important.They can be represented mathematically by the following equations: Here, V denotes the jet velocity, a foundational metric in experimental campaigns.The scour depth, d s , is a function of sluice gate height (α), and can thus be expressed as follows: It's imperative to recognize that the function Ψ will vary based on the unique configuration of the sluice outlet.
The primary target of this work is to analyze and compare the predictive capabilities of advanced ML models with traditional empirical equations in determining the maximum scour depth at sluice outlets.By delving into the complexity of this hydraulic phenomenon, we aim to pave the way for more efficient and accurate prediction methods.The findings here will serve as the foundation for future research efforts in hydraulic scouring dynamics.

Data collection
In the process of collecting data, a comprehensive database consisting of 267 experimental samples was obtained through two key studies in the literature.Specifically, 42 samples were extracted from the work of Sarkar and Dey (2005) (hereafter Sarkar_2005), and a broader set of 225 samples was collected from the follow-up investigation by Dey andSarkar (2006) (hereafter Dey_2006).
The study by Sarkar 2005 investigated the characteristics of scour holes in both uniform and nonuniform sediments downstream of an apron caused by the action of a submerged horizontal jet emanating through a sluice gate.Their investigation revealed the relationship between F d and critical parameters of the scour hole, laying a vital foundation for understanding the impact of sediment uniformity and the geometric standard deviation on scour dimensions.Dey 2006 extended this empirical work in parallel, offering light on the similarities observed in the scouring mechanism and scour patterns for non-cohesive sediment layers.Their findings, particularly on scour depth reduction through the strategic location of a launching apron, underscored the influential parameters that govern the maximum equilibrium scour depth.Table 1 provides an in-depth overview of the experimental dataset, presenting the range of significant components such as the length of apron (L), densimetric Froude number (F d ), Froude number of jet velocity (F), mean grain size (D 50 ), water depth at downstream (d t ), open height of gate (α) to the maximum scour depth (d s ) after sluice gate.

Empirical equations
For predicting scour depth at sluice outlets, researchers have developed a plethora of empirical formulas over the years.The analysis of these predictive models based on the available data will provide further insight into their accuracy and practical applications.
where u * c is the threshold shear velocity Chatterjee et al (1994) d a F 0.775 s = (5) where and g s denotes the standard deviation of particle size distribution Dey and Westrich (2003 where Δh denotes the difference in water level between downstream and upstream; t denotes the time of scouring. Sarkar and Dey (2005 A remarkable observation from the literature reveals that a considerable faction of these empirical formulations considers d s as a function of a single variable, as seen in equations 4-6, or specific soil or jet water characteristics such as D 50 , F d(95) , and ⁎ u , c as demonstrated in equations.4, 6, 9. Several parameters like F d(95) and ⁎ u c pose challenges during data collection, making them less feasible for real-world case studies.Interestingly, equations 4, 6, and 8 overlook the influence of tailwater depth d t .Yet, Dey and Sarkar (2006) indicated that an increase in d t up to the critical tailwater depth leads to a decrease in the maximum scour depth, followed by a subsequent rise.Furthermore, equations 4 and 7 do not factor in the effects of sediment properties.Contrarily, Dey and Sarkar (2006) emphasized that an increase in sediment size results in a reduced maximum scour depth.
A study by Aamir and Ahmad (2019b) found that empirical equations, even those based on complex multiple linear regressions, do not always succeed in predicting the maximum scour depth.Moreover, an important observation emerges when considering the applicability of these empirical formulas: many formulas have been formulated and calibrated based on specific experimental datasets and conditions.Therefore, their effectiveness and performance could vary significantly when extrapolated to other datasets.This variation underscores the necessity of considering the initial experimental conditions and datasets upon which these formulas were determined.
For the scope of this study, given our reliance on the experimental data of Sarkar and Dey (2005) and Dey and Sarkar (2006), it was considered appropriate to select the two empirical equations proposed in these specific studies as the representative benchmarks for empirical formulations (equations 10-11).This choice ensures consistency in the comparative analysis and provides a reliable foundation to evaluate the performance of other predictive methods.

ML models
This section briefly introduces four prevalent ML models and explains their potential applications in predicting scour depths.XGB is an advanced gradient-boosting framework known for its efficiency and capability to produce high-quality predictions.It operates by iteratively combining the predictions of several decision trees, using errors from previous iterations to refine subsequent trees (Chen and Guestrin 2016).The model's inherent ability to handle large datasets, its resistance to overfitting due to its regularization parameters, and its feature importance capability make it an ideal candidate for predicting scour depths where various factors influence the outcome.
CAT is another gradient-boosting algorithm renowned for its prowess in seamlessly dealing with categorical data types (Ostroumova et al 2018).Unlike other gradient-boosting models, the CAT algorithm, known for its effectiveness in processing categorical data, employs a unique method that minimizes the need for extensive preprocessing.Given the diverse nature of variables influencing scour depth, including both continuous (e.g., water depth) and categorical types (e.g., shape), CAT offers a streamlined approach to model development and prediction in our study.
SVM is a supervised ML technique that could effectively address regression and classification tasks.Revered for its kernel trick, SVM offers flexibility in fitting non-linear boundaries by transforming the initial data space into a higher dimension.The 'kernel trick' allows users to build expert knowledge about the problem (Cortes and Vapnik 1995).Given the complexity of factors influencing scour depth and potential non-linear relationships between them, SVM's ability to delineate intricate decision boundaries makes it a robust choice for our predictive modeling.
RID is a linear regression technique that incorporates a regularization term.The regularization term discourages excessively complicated models that may overfit the training data.Ridge regression balances precision and computational simplicity (Hoerl and Kennard 1970).In predicting scour depths, multicollinearity can arise due to the interconnected nature of contributing factors, like input discharge and water depth.RID is adept at handling multicollinearity, making it an apt tool for modeling and predicting in scenarios where predictor variables are closely intertwined.

Performance evaluation 2.5.1. Hyperparameter tuning
In the process of refining the performance of ML models to forecast local scour depth at sluice outlets, hyperparameter tuning is considered a pivotal step.The grid search strategy was chosen because of its efficiency in hyperparameter optimization.This systematic approach, recognized for its suitability to small and mediumsized datasets, requires training and evaluating the machine learning algorithm on a complete set of hyperparameter combinations defined in a pre-specified grid.The Python scikit-learn library was leveraged for the seamless execution of the grid search procedure.
This study focuses on a dataset of 267 laboratory data samples, each providing valuable insights into the complexity of scour depths occurring after the sluice gate.The ratio of maximum scour depth to the open height of the gate (d s /a) is formulated as a function of four determining variables: d t /a, L/a, F d , and d 50 /a.This approach ensures a comprehensive synthesis of factors affecting scour depth.To determine the accuracy and reliability of the developed models, a process for measuring performance was delineated.
Given the size of the dataset, it is conservatively divided into training and test subsets, in a ratio of 75:25.This allocation resulted in the training set consisting of 200 samples, while the test set comprised 67 samples.Such division allows rigorous training of the model while reserving considerable data for objective evaluation.An integral part of the hyperparameter optimization process, 5-fold cross-validation, was implemented during the grid search.This strategy not only enhances the robustness of the tuned models but also ensures their applicability and accuracy to unseen data.The specific hyperparameters and their respective ranges selected for each ML model are detailed in table 3.

Monte Carlo simulation to quantify uncertainty
The Monte Carlo simulation offers a robust statistical method to quantify the uncertainty inherent in estimating the depth of scour at sluice outlets.This method primarily leverages the power of random sampling, allowing researchers to understand potential variability in model outputs based on various input scenarios.The fundamental principle of the Monte Carlo simulation is rooted in drawing random numbers to mimic realworld scenarios and then aggregating the results of these random experiments to estimate the desired outcomes (Brownlee 2019).An underlying assumption of this technique is the convergence of the average of these randomly drawn samples towards the expected value of the chosen probability distribution, given that a sufficiently large number of simulations are run (Guo 2020).
In our research on predicting scour depth, the Monte Carlo simulation is a pivotal tool to gauge the consistency and reliability of our model's predictions.By generating 1,000 simulations from our input data set of 267 laboratory samples, we endeavor to procure a panoramic understanding of the range and variability in predicted scour depths.This offers insights into the robustness of our chosen models and the trustworthiness of the predictions they yield.

Metrics used
A suite of statistical criteria was employed to assess the accuracy and reliability of the methods developed in this study for predicting scour depth (table 4).These metrics are tools to gauge the deviation of expected values from actual observations, the strength of the relationship between estimated and observed values, and the overall efficiency of the method.The criteria chosen are mean absolute error (MAE), RMSE, CORR, and Nash-Sutcliffe Efficiency (NSE).These metrics were selected due to their widespread usage in hydrological and geomorphological studies and their ability to evaluate model performance from different perspectives.

Performance comparison
To evaluate the predictive efficacy, the study undertook a comprehensive analysis of six different methods, encompassing both empirical equations and ML algorithms.Figures 2 and 3 outline the key performance metrics and provide graphical interpretations detailing the association between experimental data and corresponding predictions.Figures 2-3 provide a visual perspective on the comparison, with the dual representation of metric values for each method.The performance statistics show varying degrees of success between the empirical equations and the ML models.The RMSE and MAE values for both empirical methods indicate a slightly decreased performance compared to ML models.In particular, Sarkar_2005, with an RMSE of 0.966, showed a higher discrepancy from the observed data than any of the ML models.Dey_2006 has slightly better results than Sarkar_2005 but still lags behind its ML counterparts, which indicates that the Dey_2006 formula estimates scour depth with slightly higher accuracy.This reinforces the view that although empirical equations provide valuable insights, they may not encapsulate all the complexity and variation found in the dataset.
In the case of ML models, the superiority in predictability becomes more apparent.XGB demonstrated the most robust performance among its peers, as evidenced by the lowest RMSE and MAE values, at 0.439 and 0.321, Table 4. Metrics used for model evaluation.

Range values
Optimal value Here, n denotes the number of observations; x i and y i denote the observed and estimated values; x and y the mean of observed and estimated values.
respectively.This shows that XGB effectively captures the complexity of the data, resulting in more accurate predictions.CAT and SVM closely follow XGB in terms of predictive power.The RMSE and MAE values of these two models are 0.444 and 0.334 for CAT; and 0.569 and 0.451 for SVM.Although RID's performance is commendable, it is marginally outperformed by the other ML models.
Regarding the CORR metric, which quantifies the degree of the linear associations between observed and estimated values, the XGB algorithm was identified as the leading with a CORR value of 0.944.This model minimizes errors and correlates strongly with the observed data.The CAT algorithm was closely behind, which notched up a CORR value of 0.940, followed by SVM with a value of 0.898.The empirical equations exhibit a lower correlation, with Sarkar_2005 at 0.700 and Dey_2006 0.816.This difference might seem slight, but it could lead to significant deviations in predictions for larger datasets.
For NSE, which defines the magnitude of the residual variance when compared to the variance of the observed data.An efficiency of 1 indicates perfect predictions, and among the methods, the XGB algorithm closely approaches this ideal with an NSE of 0.938.It is worth noting that the CAT algorithm achieved an NSE value of 0.937, closely followed by the SVM with 0.936, indicating a high model efficiency level.Among the two empirical formulations, the Dey_2006 empirical formulation achieved an NSE value of 0.796, reflecting its relative accuracy compared to Sarkar_2005, which had an NSE of 0.702.Figure 3 further enriches the comparative analysis with scatter plots for each method.Each plot's x-axis signifies the observed values, while the y-axis portrays the estimated figures.These scatter plots reiterate the trends identified in figure 2 and underscore the closeness of predicted values to the observed data, especially for the ML algorithms.
Generally, both empirical equations and ML algorithms present respectable performances in predicting the maximum scour depth.The quantitative metrics and their visual counterparts conclude that the XGB algorithm is the most potent tool among the assessed methods.
3.2.Sensitivity analysis of ML models 3.2.1.Variable importance ranking ML models often rely on a combination of input variables to produce predictions, with some variables having a more pronounced influence than others.This study employed permutation importance to assess the significance of input variables in the ML models used.This method evaluates the change in the model's performance when the values of each variable are randomly shuffled, thus breaking the relationship between the variable and the target.The magnitude of the decrease in model performance indicates the importance of the variable.This method is applied uniformly across all four ML models (RID, SVM, CAT, and XGB) to calculate variable importance consistently.Table 5 and figure 4 present the permutation importance of crucial variables across the four models, providing insights into their significance in influencing the predictions.
Table 5 reveals that the variable F d remains dominant across all ML models.Specifically, RID attributes the highest importance to F d with a score of 0.816.This is closely followed by the XGB, CAT, and SVM models with importance values of 0.689, 0.638, and 0.635, respectively.The consistently high ranking of F d across all models indicates its pivotal role in estimating scour depth downstream of sluices, underscoring the importance of collecting accurate data for this variable.In a noteworthy contrast, L/a sees varied importance among the models, with the CAT and XGB models attributing relatively higher significance scores of 0.163 and 0.122, respectively.Meanwhile, SVM considers its importance more modestly with a score of 0.064, and RID just slightly above at 0.027.Although L/a plays a role in predictions, its magnitude of influence differs across models.
The graphical representation in figure 4 complements these findings.It visually emphasizes the importance of F d , with its bar noticeably outperforming the others.The other variables, d t /a and D 50 /a, show different heights, indicating their fluctuating significance between models.For d t /a, its highest importance emerges in the XGB model, with a score of 0.124.CAT and RID follow suit with values of 0.113 and 0.086, respectively.In the case of D 50 /a variable, SVM assigns it a predominant importance of 0.263, substantially higher than its counterparts.This is followed by CAT, RID, and XGB with scores of 0.086, 0.070, and 0.065, respectively.The importance given by SVM suggests that, in its framework, D 50 /a may have a more pronounced relationship with the predictions than observed in the other models.
In addition to the permutation importance analysis, the study delves into the contribution of each predictor to the model output through SHAP (SHapley Additive exPlanations) values, which are utilized as a robust framework for interpreting the decisions of ML models.SHAP values have been computed for four ML models:  RID, SVM, CAT, and XGB, as illustrated in figure 5.These values quantify the impact of each feature on the prediction of each model, providing an interpretable and detailed perspective on feature importance.The SHAP summary plots (figures 5(a)-(d)) illustrate the distribution of the impact that each feature has on the model output, offering insights into the predictive power of each feature within the models.A higher SHAP value signifies a more significant impact on the model's predictions.Notably, across all models, the F d variable consistently exhibits the most substantial impact, as indicated by the spread and positioning of its SHAP values towards the higher end of the axis.In contrast, features such as D 50 /a, L/a, and d t /a demonstrate a more heterogeneous influence on the models' output, with their SHAP values spanning a wider range on the axis.
The distribution of SHAP values aligns with the permutation feature importance findings, affirming the prominence of F d as a critical predictor across all models.Furthermore, the SHAP analysis adds a layer of detail by quantifying the direct influence of feature values on the prediction output.Elevated SHAP values for a feature indicate a greater positive or negative predictive impact on the model, as denoted by their horizontal placement in the SHAP summary plots.
In general, the permutation importance analysis provides a detailed view of how different variables determine the predictive power of ML algorithms.F d clearly takes center stage, while variables such as L/a, d t /a, and D 50 /a represent their roles in more varied and subtle ways, reflecting the complex structure of these algorithmic models.The SHAP analysis confirms the results obtained from permutation importance and augments the interpretability of the ML models by quantifying the direct impact of individual feature values on model predictions.

Uncertainty in predictions
In evaluating the reliability and robustness of ML models, the uncertainty associated with predictions often provides a comprehensive outlook.This study adopted the Monte Carlo techniques to gauge the inherent variability and uncertainties present in the ML algorithms.By producing a robust set of 1,000 predictions for the test dataset-effectively amounting to 1,000 samplings on the training data for each model-this technique rendered a comprehensive understanding of the models' behaviors under various scenarios.The results are described in table 6 and figure 6.Table 6 enumerates the performance statistics for the uncertainty estimates associated with ML models.The observations from the table show that XGB has the highest mean correlation (mean CORR) at 0.946.CAT follows closely with a mean CORR of 0.937, and SVM comes next with 0.897.RID, though still remarkable, has the lowest score at 0.723.These values suggest that, on average, the XGB model's predictions best fit the observed values, followed by CAT and SVM.
Regarding variability, the standard deviation values provide insights into the consistency and reliability of the models.XGB displays the most significant variation with a standard deviation of 0.0050, implying that its predictions, although on average quite accurate, can exhibit a slightly wider spread.In contrast, RID has the least variability, with a standard deviation of 0.0010, indicating its more stable properties.CAT and SVM are in the middle, presenting standard deviations of 0.0039 and 0.0035, respectively.
Figure 6 further elucidates these findings by graphically representing the performance of the ML models alongside their 95% confidence intervals (CIs).For XGB, the CI spans from 0.933 to 0.954, suggesting that 95% of the model's predictions will likely fall within this range.This wide CI, combined with its high mean CORR, confirms the model's effectiveness while drawing attention to its variability.Similarly, CAT's CI ranges from 0.929 to 0.944, reflecting its accuracy and consistency.Despite its higher standard deviation, SVM has a CI from 0.889 to 0.903, underscoring the model's capacity to generate reasonably accurate predictions while considering its wider variability.The RID, with a CI between 0.721 and 0.725, shows a stable predictive ability, although slightly less correlation.
In addition to assessing the uncertainty of the prediction, the permutation feature importance of predictors, along with their corresponding 95% CI was estimated and shown in figure 7.
The F d continues to demonstrate significant influence across all models, consistent with findings from the SHAP value analysis.However, the breadth of the CIs for F d varies notably among models.In the SVM model, F d 's CIs range from 0.6130 to 0.6264 (mean 0.6197), indicating substantial variability and potential effects of data sampling on its importance.This contrasts with the more stable estimates in the RID model (mean 0.8029) and the CAT model (mean 0.5973).
For the variable D 50 /a, the CIs are considerably narrower in the combined RID, CAT, and XGB models, suggesting a more consistent assessment of this feature's importance across these models.The SVM model shows a wider range (0.3138 to 0.3230, mean 0.3184), aligning with the moderate influence indicated in the SHAP analysis.The importance scores of L/a and d t /a show narrower intervals exhibit less variability across In conclusion, figure 7 validates the earlier findings from permutation importance and SHAP analyses and accentuates the uncertainty inherent in estimating feature importance.The Monte Carlo-driven assessment emphasizes the duality of accuracy and variability inherent to the predictions of ML models.While metric-based evaluations offer a snapshot of performance, insights into uncertainties provide a richer perspective.Here, reliability and robustness play equally crucial roles.The findings suggest that while the XGB and CAT models lead in accuracy, the RID offers more consistent predictions, though with a slight compromise in terms of correlation.

Discussion
The analytical procedure through performance comparison and sensitivity analysis has provided rich insights into the complexities and nuances of estimating the maximum scour depth.Both empirical and ML methods have shown their unique strengths and limitations, offering varied perspectives on the matter.The performance comparison (section 3.1) highlighted a significant finding: while empirical formulas like Sarkar_2005 and Dey_2006 yield good results, ML algorithms outperform most performance metrics.Specifically, the XGB algorithm exhibited exceptional prowess in its predictions, marked by the lowest RMSE and the highest CORR values.This shows that ML models, with their ability to distinguish complex data patterns and relationships, can provide superior predictive capabilities compared to traditional empirical methods.Yet, the tight performance metrics among ML techniques also spotlight the necessity of adjusting algorithm selection depending on the specific context and data complexity.
Moreover, the efficacy of these models does not depend solely on accuracy.Section 3.2.1 delved into sensitivity analyses and uncertainty predictions, providing a deeper understanding of the behavior of the ML model.The variable importance ratings shed light on the pivotal role of F d across all models, underscoring its centrality in the predictive landscape.This could have practical implications for field studies and data collection efforts, emphasizing the need for accurate measurements and representation of this variable.In addition, the SHAP value analysis has enhanced the interpretability of ML models, providing clarity on the contribution of each predictor to the model's output.
The uncertainty assessment (section 3.2.2) using the Monte Carlo techniques illuminated the duality of precision and variability in ML predictions.While raw predictability is essential, understanding the uncertainties comprehensively assesses the model's reliability.For instance, XGB demonstrated superior average performance, whereas its wider CIs imply higher prediction variability.Hence, although the model may be statistically superior, it needs to account for this variability, especially in applications where the margin of safety is critical.
This study underscores the potential and challenges of both empirical and ML methods in predicting scour depth.Although ML models, particularly XGB and CAT, exhibit promise in absolute predictability, they also come with inherent uncertainties.Although generally less efficient, the empirical formulas have the advantage of being simple and speedy.

Conclusions
This study provides a comprehensive comparison and assessment of empirical and ML methodologies in predicting scour depth, offering valuable insights for researchers and practitioners.The following conclusions can be drawn: • ML models, especially XGB and CAT, consistently surpassed empirical formulas in predicting the maximum scour depth, underscoring their potential in this domain.
• The variable F d emerged as a crucial determinant across all ML models, highlighting the importance of its accurate measurement and representation.The SHAP value analysis further confirms its dominant role and the necessity for precise measurement in scour-depth studies.
• Uncertainty analysis, conducted using Monte Carlo techniques, revealed that while ML models like XGB show higher mean performance, they also possess broader CIs, indicating prediction variability.
• Despite their lower accuracy compared to specific ML models, empirical formulas offer the advantages of simplicity and swift calculations, which might still make them preferable in particular scenarios.
• As research in scour depth prediction progresses, a potential integration of the rapidity of empirical formulas with the accuracy of ML models might pave the way for enhanced predictions in subsequent studies.

Figure 2 .
Figure 2. Comparison of MAE and CORR for various methods.

Figure 3 .
Figure 3.Comparison between experiment data and estimated results by six methods.

Figure 4 .
Figure 4. Effect of variables on the estimated outcomes of ML models.

Figure 5 .
Figure 5. SHAP summary plots for ML models.Features are sorted by their importance.Points represent individual feature values; their horizontal placement indicates the impact on the prediction (right for positive, left for negative).The color indicates the feature value (red for high, blue for low).

Figure 6 .
Figure 6.Performance of ML models and 95% confidence intervals.

Figure 7 .
Figure 7. Effect of variables on the estimated outcomes and their 95% CI.

Table 1 .
Table 2 briefly summarizes several notable equations proposed for this prediction.Overview of experiment data.

Table 2 .
Empirical equations to estimate the depth of scour.

Table 3 .
Grid search parameters for ML algorithms.

Table 5 .
Permutation importance of ML models.

Table 6 .
Performance statistic of uncertainty estimates for ML models.