Data-driven turbulence modelling using symbolic regression

The study is focused on the performance of machine-learning methods applied to improve the velocity field predictions in canonical turbulent flows by the Reynolds-averaged Navier–Stokes (RANS) equation models. A key issue here is to approximate the unknown term of the Reynolds stress (RS) tensor needed to close the RANS equations. A turbulent channel flow with the curved backward-facing step on the bottom has the high-fidelity LES data set. It is chosen as the test case to examine possibilities of GEP (gene expression programming) of formulating the enhanced RANS approximations. Such a symbolic regression technique allows us to get the new explicit expressions for the RS anisotropy tensor. Results obtained by the new model produced using GEP are compared with those from the LES data (serving as the target benchmark solution during the machine-learning algorithm training) and from the conventional RANS model with the linear gradient Boussinesq hypothesis for the Reynolds stress tensor.


Introduction
The development of machine learning and artificial intelligence techniques has opened up avenues in computational fluid dynamics (CFD) that were not well explored before specifically in the area of data-driven turbulence modelling [1]. Solving the Navier-Stokes equations for turbulent flows still remains one of the biggest challenges faced in CFD. With the use of direct numerical simulation (DNS), very accurate results can be obtained due to the fact that no turbulence model is needed but it is not a preferred solution since DNS is very computationally expensive scaling as Re 3 where Re is a typical Reynolds number in a flow. In large eddy simulation (LES), turbulent motions are modelled on small scales, which reduces the computational complexity in comparison with DNS. However, the LES application is still challenging, especially for high-Reynolds-number separated flows with thin near-wall boundary layers [2]. The more popular way to solve the Navier-Stokes equations is to analyze their Reynolds-averaged form [3][4][5][6]. Solving the Reynolds averaged Navier-Stokes (RANS) equations is widely used in engineering applications because of its relatively low computational costs compared to other methods. RANS simulations do not resolve turbulent fluctuations explicitly, but employ empirical Reynolds-stress approximations, which is computationally efficient. The RANS approximations are often inaccurate, leading to errors and uncertainties that should be quantified.
There is a significant possibility to improve RANS models with the use of machine learning (ML) algorithms [1] which have been first introduced in turbulence modelling [7,8] to create markers detecting the regions of high RANS model uncertainty and to model the Reynolds stress anisotropy (RSA) tensor. While the progress has been made to apply ML tools to predict some properties of flows in a data-driven approach, e.g. using the high-fidelity (LES, DNS, measurement) data for training, wider applications of ML models still remain vastly unexplored. With the increase in computational is the turbulent kinetic energy) using the ML technique to decrease the discrepancy in comparison with high fidelity data for canonical turbulent wall-bounded flow cases as in [8][9][10][11][12][13][14][15][16]. The workability of gene expression programming (GEP) which belongs to symbolic regression is explored here for a channel flow with the curved backward-facing step having the high-fidelity LES data set [17]. This is a new test case in comparison with those in [10] where the novel GEP algorithm was proposed, trained and tested for channel flows with the rectangular backstep and periodic hills placed on the bottom.

Turbulence model formulation
To describe the flow behaviour in terms of velocity and pressure fields, the momentum and continuity equations for incompressible fluid are considered here in the Reynolds (time or ensemble) averaged form leading to the widely used RANS equations where the unknown term of the Reynolds stress (RS) tensor is to be modelled for closing the equations. The Boussinesq hypothesis (gradient expression) τij = (2/3)kδij − 2 νt Sij, νt = Cμ k 2 /ε = Cμ k/ω, Cμ = 0.09, Sij = (1/2)·(∂Ui/∂xj + ∂Uj/∂xi) is usually applied for the RS tensor based on the isotropic eddy-viscosity coefficient νt and related linearly to the strain tensor Sij which is defined by the spatial derivatives of the mean velocity vector components Ui. Quantities of k and its viscous dissipation rate ε or ω are found from extra transport equtions formulated within the various versions of k-ε, k-ω and other turbulence models. The linear Boussinesq relation between the RS and strain tensors can be rewritten in terms of the RSA tensor as where the non-dimensional strain tensor sij = (k/ε)·Sij can be introduced as in [8,18] to develop the algebraic relation of the non-dimensional RSA tensor with the mean velocity derivatives. As mentioned in Introduction, the conventional model (1) leads to the considerable errors and uncertainties, especially in separated turbulent flows bounded by complex-geometry surfaces. The most attempts to enhance the RANS approximations modify the expression for bij following to [8] where the generalized eddy viscosity model for incompressible flow is presented as function of the non-dimensionalized strain (sij) and rotation (rij) tensors decomposed into ten terms with the isotropic basis tensors (where coefficients depend on the scalar functions λ1, …, λ5 being the sij, rij invariants): The ML tool is directed to find the optimal scalar coefficients g (n) (λ1, …, λ5). Once the coefficients are found during the training of a model with the high-fidelity data, serving as a target solution and taken from available datasets, then (2) can be used to obtain the new RSA tensor bij and can be inserted into the RANS solver to improve its performance for new flow cases in comparison with that for (1).

Methodology
As a machine learning algorithm, gene expression programming (GEP) which belongs to symbolic regression is used here as in [10,11]. GEP has been first introduced in [19] and has since attracted research communities to further develop and enhance such a method for new fields. It is a type of evolutionary algorithms closely related to genetic algorithms (GA) and genetic programming (GP). GEP is a special field of evolutionary computation that builds models and programs automatically to solve problems in any domain without depending on or making assumptions about the domain. The fundamental difference between the GA, GP, GEP algorithms resides in the nature of the individuals which are linear strings of fixed length (chromosomes) in GA, or nonlinear entities of different sizes and shapes (parse trees) in GP. On the other hand, in GEP the individuals are encoded as linear strings of fixed length (the genome or chromosomes) which are afterwards expressed as nonlinear entities of different sizes and shapes (i.e., simple diagram representations or expression trees).
The GEP process begins with the random generation of the chromosomes of the initial population. Then the chromosomes are expressed, and the fitness of each individual is evaluated. The individuals are then selected according to the fitness to reproduce with modification. The reproduction with modification is to ensure a creation of individuals with the necessary genetic diversity that facilitates evolution in each generation. This reproduction includes replication, mutation and transposition which imitate the real-life behaviour of genes. In other words, the modifications are in fact the modifications of functions that are reproduced every generation. Diversification in such functions helps increase the search space to find the coefficient of best fit and make sure the optimisation does not get stuck in local maxima or minima.
The implementation for GEP in the RANS turbulence modelling is as follows:  define the Reynolds stress anisotropy tensor bij from the high-fidelity DNS or LES dataset;  calculate scalars λm and tensors T (n) from values of k, ε, ∂Ui/∂xj of the high-fidelity data set;  start GEP based on these λm, T (n) and using a priori form of bij = bij (T (1) , T (2) , … , λ1, λ2, … );  receive optimal algebraic equations from GEP as a combination of some or all input functions;  test the new algebraic model for bij in different test cases to see its predictability. Advantages of such an approach are as follows [10]:  no particular model needs to be assumed;  little human bias in selecting functions is applied as functions are chosen randomly;  low computation cost as long as search space is restricted to relevant functions. In the present study, the GEP tool is explored with fixed length chromosomes to find the bij values. For the GEP implementation, the geppy library (https://geppy.readthedocs.io/en/latest) is used where an evolutionary algorithm framework designed for GEP in Python is built on top of the more general evolutionary computation framework DEAP. The latter does not completely support GEP by itself.

Test case
First, a turbulent flow with the curved backward-facing step (CBFS) on the channel bottom which has the extensive data set of the high-fidelity computations [17] is chosen as the test case (figure 1). The tensor basis is initially calculated from (2), taking all values obtained from the available data of LES (https://turbmodels.larc.nasa.gov/Other_LES_Data/curvedstep.html), and using only four tensors T (1) , T (2) , T (3) , T (4) and two scalar invariants λ1, λ2. Taking more tensors causes only a negligible increase in calculation accuracy while increasing computation time considerably [10]. In the two-dimensional consideration, each input tensor T contains four components (T11, T12, T21, T22) flattened as in [13] to give a single vector as the input into the GEP algorithm. With the input features ready, the GEP algorithm is trained on the anisotropy tensor bij to receive the new equations for bij. Next, these equations can be inserted into the RANS solver and verified in the a-posteriori study for the same flow at different Reynolds numbers or for a new flow case of similar geometry.  Figure 1. The LES data [17] at Reynolds number Re = UinH/ν = 13 700 presented by contours of non-dimensional horizontal mean velocity vector component U(x,y)/Uin, the turbulent kinetic energy k(x,y)/(Uin) 2 , the shear Reynolds-stress tensor component <u′v′>/(Uin) 2 , where Uin is the inlet free-stream velocity, H is the curved step height.

Computation details
During running the GEP algorithm, the number of data points were kept in the feasible range from 2500 to 10000 to have an optimized solution along with optimal training time. The latter increases drastically with increase in the number of points so a training dataset region is initially reduced to the area (in the recirculation zone downstream the step near the lower surface) where there is the highest discrepancy between the high-and lower-fidelity data of LES and RANS simulations. The population size and number of generations are the criteria that ensure that there is sufficient search space and enough number of iterations respectively for the algorithm to converge to an optimum solution. The population size was kept between 500 and 1000, and the better-optimized equations were obtained with population sizes near the higher threshold. The equations converge from 50-70 generations after which the drop in loss is not observed to be large enough when compared to time efficiency.

Results
The final convergent equation for the RSA tensor obtained after 20 runs with 50 iterations (i.e. 50 generations) of GEP for the reduced training region (2.0 < x/H < 3.2, 0 < y/H < 0.45) in CBFS flow is:  (4) (3) Next, we once again trained the GEP algorithm with 50 iterations and 20 runs for a larger CBFS flow area (−5.0 < x/H < 10.0, 0 < y/H < 1.25) and have the modified explicit relation for the RSA tensor: Due to GEP being a symbolic regression algorithm, a few resulting equations obtained after 20 runs of this algorithm contain a constant term (without T) in the right-hand side which makes the equation unphysical. These results had to be discarded. We also had to carefully select the equations (and then perform their averaging) that closely resembled the form given by [10,11] where each input tensor T is represented. There were, however, instances where the best results were obtained without all the tensors being represented as shown by the expression (4) for the larger training area. The Reynolds-stress anisotropy tensor components for normal stress <u′u′> and shear stress <u′v′> obtained by the baseline Boussinesq model (1)  To calculate the bij components by (1), (3) and (4), all quantities are taken from the high-fidelity data [17]. Our primary goal for such a priori study is to examine whether improvement of the RSA tensor predictions against the baseline model indeed takes place, and whether new explicit expressions that better fit the anisotropy tensor have been found. One can observe (figures 2−4) that the Boussinesq model (1) is poor in modelling the anisotropy tensor, and the machine learning model equations (3) and (4) resemble the RSA tensor components more closely than the baseline model.

Conclusion
The results for implementation of machine learning techniques to enhance the performance of the RANS turbulence models using the available high-fidelity data of LES are presented and discussed. For this, suitability of gene expression programming (GEP) which is a sort of symbolic regression is tested for the canonical case of a turbulent channel flow with the curved backward-facing step located on the bottom. Using GEP, the new explicit algebraic Reynolds-stress model is obtained, and the predictions for the Reynolds-stress anisotropy tensor components are improved in comparison with those for the baseline Boussinesq linear-gradient hypothesis as demonstrated in a priori study.
For future work, it can be checked whether the ML models obtained in the present study are consistently good for other flow cases and can be useful in general after inserting the new relation into the RANS solvers. Furthermore, it is of interest how the GEP technique can work when the tensor basis input is considered in its original shape and not flattened for computational convenience as has been made in the present research. Also, a more elegant method than averaging needs to be developed to obtain a faster converging equation so that limited runs of GEP return acceptable solutions.