Detecting the high/low-speed large-scale structures by a machine learning perspective in turbulent boundary

The current developments in machine learning have facilitated the application of data-driven strategies for predicting the evolution of vortex structures in turbulent boundary layers. This study combines the attached eddy hypothesis with the extreme gradient boosting model to forecast large-scale high/low-speed motion based on a series of input signals. The performance of the trained model, as quantified by the percentage of accurately predicted high/low-speed regions, exhibits variability concerning the deviation between the input velocity fluctuation and the predicted output value at z/δ=0.016, where ‘z’ represents the wall-normal height and ‘δ’ indicates the boundary layer thickness. Our findings underscore the significant potential of machine learning in predicting high/low-speed regions within large-scale motion.


Introduction
In the past, turbulent boundary layers were thought to be chaotic and unpredictable because there were no tools to effectively observe the flow.However, as time has passed, the concept that turbulence can involve organized patterns of motion has become more widely acknowledged [1].Nevertheless, the fluid mechanics community has not reached a unanimous agreement on a universally accepted definition of coherent structures.One suggested definition is that coherent structures are turbulent fluid movements characterized by correlated vorticity patterns within a defined spatial area [2].Likewise, in order to classify a motion as coherent, it is necessary for at least one property of the fluid to display a significant level of self-correlation across a spatial or temporal range that significantly exceeds the smallest scale of the flow [3].
In the study of zero-pressure-gradient wall-bounded turbulence, coherent structures were exclusively categorized into eight distinct groups through the application of various experimental flow observation methods [4].Following that, scientists streamlined this categorization into four groups by using a more efficient approach [5].By employing initial conditional statistics and two-point correlations, they were able to show that the motion in smooth boundaries is characterized by non-random and repetitive patterns [6].By employing a flow-visualization method involving hydrogen bubbles, researchers achieved the initial visualization of highly organized structures in the near-wall area of the turbulent boundary layer [7].The findings from this flow visualization are encapsulated in the "lifted stretched vortex element" model, which suggests that these distinct low-speed regions propagate outward from the wall at a defined angle and speed [7].Through the integration of the hydrogen bubble technique with hotwire anemometry, it was determined that low-velocity streaks elevate, undergo oscillation, burst, and ultimately move away from the surface boundary through the process of vortex induction [8].
In 1952, the concept of a hairpin/horseshoe vortex model, consisting of tornado-shaped vortical structures emerging from the near-wall area and covering low-speed structures at a 45° inclination, was first introduced [1].By conducting experiments involving synchronized hot-wire anemometry and oilfog flow visualization on an inclined laser sheet within zero-pressure-gradient turbulent boundary layers, scientists made their initial observations of these hairpin/horseshoe vortices [9].They also proposed a hypothesis that the hairpin/horseshoe vortex forms the basis for the structures seen in turbulent flows [10].
Following this, the attached eddy hypothesis (AEH) came into existence as a conceptual model for wall-bounded turbulence, which conceptualized structures as a set of self-similar eddies driven by inertia and randomly distributed within the wall plane [11].A recent review discusses the primary assumptions and constraints linked to the attached eddy hypothesis (AEH) [12].In this study, we apply features of the AEH to illustrate and explore prediction capabilities along the wall-normal height and streamwise offset.These findings underscore the complexity of turbulence structure and the challenges associated with predicting its behavior.Nevertheless, with the development of machine learning and its application to enhance performance through self-learning from extensive databases, there is now a promising opportunity to predict turbulence motion.Machine learning algorithms have been widely adopted in turbulence research and related disciplines such as astrophysics, atmospheric physics, and climate science, leading to significant achievements.
The current work aims to leverage the machine learning model to predict high/low-speed regions in large-scale motion, as illustrated in Figure 1.The input variables used in the prediction are labelled as  1 ,...,   , depicting the streamwise velocity components at the specific height denoted as u(z). 0 , serving as the predicted output signal, is located at the same height as the input variables and positioned immediately downstream of   .

Database
The data used in this research was obtained from experiments conducted before [14,15,16].The analysis primarily relies on the data from Graham et al. (2016).The domain dimensions are given by   ×   ×   , which is 8πh × 3πh × 2h, with 'h' representing the channel half-height.All the data is stored in physical space on a grid of 2048 × 1536 × 512 points, without zero-padding.The distance between two horizontally adjacent grids is denoted as   , which is equal to 8/  .The Reynolds number of the direct numerical simulations is   =1000.
In Table 1,   means the friction velocity of the turbulent.In this study, to obtain the dataset used for prediction, the raw data undergoes an initial filtering process aimed at removing small-scale motion while preserving large-scale motion.Subsequently, the mean is computed to extract pulsations, and data with pulsation values exceeding 0 are categorized as belonging to the high-speed zone, denoted by '1,' whereas data with pulsation values below 1 are classified within the low-speed zone, represented by '0'.

Extreme Gradient Boosting (XGBoost)
Introduced in 2014, XGBoost is now widely utilized for training and testing models with large amounts of data [13].This versatile algorithm is adept at tackling both regression and classification tasks.It does not require parameter optimization, allowing for immediate use without further configuration [18].We provide a brief overview of how XGBoost works here: There is a dataset called  = {(  ,   ):  = 1 ⋯ ,     ,   }, containing n samples and m features.The predicted label  ̂ is an additive model generated from the following equations: = {() =   ()}(:   → ,      ) (2) In this context, the predicted score is denoted as   (  ) for a specific sample.In the equation, the symbol  represents the collection of regression trees, which are structural parameters for  ; Additionally, () and  signify the weight associated with the leaves, and the term 'number of leaves' pertains to the count of leaves in the tree.To address the function described, it is imperative to identify the optimal set of functions through the minimization of both the loss and regularization objectives.In Equation (3),  represents the standard loss function, indicating the difference between the predicted output and the actual output; The second term, denoted as Ω , captures the model's intricacy or complexity.
In these equations,  represents the count of samples in the dataset, and  signifies the total data incorporated into the -th tree.The parameters  and  are employed to fine-tune the complexity of the tree.The regularization component serves to refine the ultimate learning weights, thereby preventing overfitting.
In this research, to identify the flow state, each decision tree outputs an identification label  ,  (  ) ∈ [0,1] at each grid point  0 , where the positions in space are indicated with subscripts x and z, and t refers to the current tree.The final prediction  , (  ) is a weighted sum of predictions from all T trees.Regions where  , (  ) exceeds a threshold   ∈ [0,1] are identified as positive predictions, while regions where  , (  ) is less than a threshold   ∈ [0,1] are identified as negative predictions by using the conventional criterion.Figure 2 illustrates an example of how the model is built and used to predict the high/low-speed large-scale motion at a fixed height u(z).In Figure 2(a), the raw large-scale motions for the streamwise velocity fluctuation are displayed, collected at z/=0.016,showing approximate cyclic variations, which indicates highly correlated features among them.The marked points indicate the positional relationship of the dataset, where the dots labeled as circles represent the input signals, and the dot labeled as a rectangle represents the output signal.Moreover, the original model is trained by using 320,000 Input/Output Combinations (I/O-C) collected at the same height to achieve the best performance.An additional 320,000 I/O-C data points are employed to assess the model's predictive performance.Figure 2(b) demonstrates the model's performance in predicting high-speed regions in large-scale motion by comparing the true values with the predicted values when   /=0.061.'H-Val' represents the raw largescale motions, with labels 1 and 0 indicating high-speed and non-high-speed regions, respectively.'H-Pre' is based on the predicted value  , (  ) obtained from the XGBoost model, converted into 1 or 0 depending on  , ̅̅̅̅̅ (  ) with a threshold value σt=0.5 in Equation (1).Similarly, Figure 2(c) shows the prediction when   /=0.491.The accuracy of the prediction for the large-scale motions represents the percentage of correctly predicted high-speed or low-speed regions, determined by assessing the extent of overlap between 'H-Val' and 'H-Pre'.By comparing Figures 2(b) and (c), it is evident that the smaller the   is, the better the performance of the current model is.This is also supported by Figure 2(d).
Clearly, the accuracy of the prediction dramatically decreases before   / reaches 0.5, and then gradually converges to 0 as   increases.It is already established that when the structural length is greater than one boundary layer thickness  and less than three boundary layer thicknesses 3, it is referred to as a large-scale structure.It is already known that the high and low-speed regions occur in cycles in large-scale structures [17].It is exhibited in Figure 3(a).The solid gray line in Figure 3(a) illustrates the unprocessed large-scale motions related to the streamwise velocity fluctuation.The windows provide an illustration of the size and location of the data.By conducting a correlation analysis on these two windows, the correlation coefficient R can be obtained.Figure 3(b) shows the change of R with   / when z/=0.016 and z/=0.20,represented by the blue line and orange line, respectively.Apparently, when   / is less than 0.5, R decreases sharply, following the same trend as the accuracy of the model.This explains why the ability of the model to predict large-scale motion is limited.However, when   / is larger than 0.5, instead of decreasing further, R increases and decreases in accordance with the changes in the highspeed region and the low-speed region in turbulence.Figures 2 and 3 demonstrate the successful application of machine learning to predict large-scale motion under specific conditions, indicating the potential to understand the structure of large-scale motion based on limited data through machine learning.To achieve this, the model will be improved by training under more diverse conditions, such as varying Reynolds numbers and heights.

Conclusion
This study utilizes machine learning to acquire an understanding regarding the development and forecasting of wall-attached eddies, a component of large-scale turbulence, within the turbulent boundary layer.The model's predictive accuracy for high and low-speed regions of large-scale motion is analyzed under specific heights and Reynolds numbers by using machine learning.It is observed that the decrease in accuracy is directly proportional to the increase in   , especially when   is less than 0.5 times the boundary layer thickness.This conclusion is further substantiated through an analysis of the Correlation Coefficient of the structural components within a windowed framework.In the subsequent phase, the current model will be further trained to make predictions at various heights and Reynolds numbers.Furthermore, the model is anticipated to exhibit robust performance even in scenarios involving changes in the spacing of the input signal.

Figure 1 .
Figure 1.Illustration of the eddy geometry with the wall-normal length   and input-data length   .Α indicates the inclination of the structure. 1 , ...,   represents the input components of large-scale streamwise velocity fluctuations. 0 indicates the output signal at the same height in the turbulent flow.  is used to indicate the streamwise offset between   and  0 .

5 )Figure 2 .
Figure 2. (a) Illustration of velocity fluctuations at z/=0.016.The dots labeled as circles indicate the signals used as inputs for the large-scale streamwise velocity fluctuations components, while those labeled as rectangles represent the output predicted signals.(b) La(H) means the line in the plot is based on whether the velocity is in a high-speed region.The validation is represented by the blue solid line 'H-Val,' which corresponds to the validation obtained from the large-scale signal in (a).The dashed purple line 'H-Pre' shows the predicted value  , (  ), marked as 1 when the threshold value is larger than σt=0.5, and marked as 0 otherwise.These values are based on   /=0.061.(c) Similar to (b), but with   /=0.491.(d) Prediction accuracy at z/=0.016 when   changes.

Figure 3 .
Figure 3.An example of how to get the Correlation Coefficient R of large-scale motion at z/=0.016.(a) Windows in different colors represent two groups of signals.  / refers to the offset between windows.The data length is the same as .(b) Relationship between R and   /.The blue solid line is identified by R at the height when z/=0.016 while when z/=0.195,R forms the orange solid line, and when z/=0.781,R forms the yellow solid line.

Table 1 .
Relevant information about the dataset.