Random Fourier Surrogate for simulation calibration

In this paper we propose Random Fourier Surrogate (RFS) as a method for simulation calibration. Computer simulations are actively used in various fields to model and analyze complex real-world phenomena, enabling practitioners to explore situations inaccessible by other means. To create a simulation that accurately reflects the desired real-world scenario requires calibrating the simulation parameters, a non-trivial process that requires domain expertise and iterative trial-and-error. The model-bridge paradigm has been proposed to automatize this process, where a more computationally efficient surrogate model is used in lieu of the real simulation, and a bridge model is used to map the noninterpretable surrogate parameters to calibrated simulation parameters, which are physically interpretable. Recently an efficient solution called tangent slope-intercept (TSI) descriptors has been proposed, where tangent lines of a principal curve are used to represent the simulation in low dimensionality. However, we find that it has two key problems: it is prone to overfitting and is sensitive to initialization. We address these issues using Random Fourier Features to construct a surrogate model, thereby reducing calibration time by avoiding complex computations, while retaining high accuracy. We evaluate the effectiveness of our method with different experiments on synthetic signal simulations and physical simulations of turbulent flow dynamics.


Introduction
Computer simulations have been widely used as decision making systems, given their ability to imitate real-life complex problems and predict their outcomes.They are often used in areas such as weather forecasting [1], medicine [2], applied physics [3], and many other fields.For the simulation to be effective in its predictions, it must first be calibrated accurately.Simulation calibration is the essential task of estimating the parameters of a given simulation for it to correctly reflect reality.However, simulation calibration usually requires large amounts of computational power and poses scalability issues when the simulation is larger in scale.Therefore, several solutions have been proposed previously in order to automatize this process and provide a more accurate estimate while reducing the computational overhead.
Simulation calibration has been addressed by several methods, one of which is a method called Kernel Approximated Bayesian Computation (or Kernel ABC) [4,5].In Kernel ABC, simulation calibration is performed by calculating the kernel mean of the posterior distribution from a sample parameter, θ, which is initially generated by a prior distribution π(θ).Then, kernel herding [6] is calculated to sample the data from the kernel mean distribution, resulting in a newly formed posterior sample.This sampling process undergoes an iterative updating procedure until the desired outcome is obtained.However, highly complex simulations would require more computational power to calibrate.
Therefore, a more recent and novel framework has been proposed to alleviate the computational cost of simulation calibration.One of the first proposed models is called the IC-MSQUARE-2023 Journal of Physics: Conference Series 2701 (2024) 012048 Figure 1.General framework of the model-bridge paradigm that was first proposed by [7].
"model-bridging" paradigm [7], where a bridge model would be able map the non-interpretable machine learning parameters into interpretable parameters of a simulation model.To do this, it requires two main models: a surrogate model for the simulation system; and a surrogateto-simulator bridge model.Figure 1 illustrates the general framework of this approach.The surrogate model is created using the input and output data available from the target simulation, thereby representing the simulation model more explicitly in order to obtain a non-interpretable vector descriptor of the simulation parameters.The descriptor would then be mapped to the simulation parameters using a machine learning model called the bridge model, thereby producing an interpretable vector descriptor that can be used in the target simulation model.However, to produce optimal estimates of the parameters, the bridge model has to be trained using a dataset with known calibrated parameters prior.This framework reduces the typical computational cost that is expected when performing simulation calibration.Creating a surrogate model of the simulation is key in the model-bridging paradigm, as it is the model that should closely resemble the structural integrity of the target simulation in order to produce accurate parameter estimates.A recent geometry-based method has been proposed for use as a surrogate model, called Tangent-slope Intercept (TSI) [8].In TSI, tangent lines of the principal curve are created in various points of the simulation data using the bending algorithm [9].Then, a slope and an intercept of each point in the principal curve are ordered into a TSI vector descriptor that can be used as input for the bridge model.The bridge model used in this method is a simple linear regression model.Although effective, the use of bending algorithm in TSI has a number of limitations.Firstly, it is sensitive to initialization, as it is an iterative optimization algorithm.This means that the initial values set for the process can greatly influence the final result that the algorithm would eventually converge to, making it unreliable for cases where high error rates are not tolerated.Secondly, the bending algorithm is known to be more prone to overfitting, capturing the noise and random fluctuations along with the true data, leading to poor performance overall when used over many simulations.Lastly, the TSI descriptor method has slower calibration time when used on more complex forms of simulation data, as shown in Section 3, Table 2.The reason for that could also be attributed to iterative nature and optimization requirement of the principal curve in the bending algorithm.
To address these issues, we propose the random Fourier surrogate (RFS), a kernel representation with fast implementation via random Fourier features (RFF) [10].The kernel spaces, with a possibly infinite dimension, can deal with nonlinear structures while being more resistant to overfitting.However, the kernel trick often used to implement kernel methods can be computationally costly, especially in cases such as this, where one needs to compute a kernel matrix not only for the target simulation, but for all simulations used to train the machine learning bridge model.Therefore, in our surrogate model, we employ the technique of random Fourier features to handle the kernel implementation, which approximates the kernel data transformation directly by computing a set of random sinusoidal functional basis, thereby avoiding the use of the kernel trick.Additionally, RFS has a single basis shared between all simulations, further reducing computational cost by avoiding re-sampling and mapping each simulation to randomly generated coordinates during training.
We evaluate the effectiveness of our method by reporting several experiment results on synthetic signal simulations and physical simulations of turbulent flow dynamics.We also compare our results to the previously proposed methods mentioned in this section, further showcasing our method's efficacy in the task of simulation calibration.The rest of the paper is organized in the following order.In Section 2, we describe our method in detail and how we tackle the current problems of simulation calibration.In Section 3, we provide our experiment results and discussion in comparison with a number of baseline methods.Finally, Section 4 concludes our paper and provides grounds for our future work.

Proposed Method
In this section, we formulate the current problem that our proposed method is tackling addressing.Then, we describe our proposed surrogate model, random Fourier surrogate (RFS) and bridge model.

Problem formulation
In our formulation, a simulation s is a process that takes some input variable x ∈ R and transforms it to some output y ∈ R. For example, in one of our experiments, s simulates a system composed of a flow moving within a cavity, where x is the initial velocity and y is the velocity at the center after 5[s] have elapsed.However, in this paper, a simulation is defined very generally, not necessarily even needing to be a physical simulation.
Our goal is to calibrate a target simulation s(x, θ) given data D = {(x j , y j )} n j=1 , where each of n pairs (x j , y j ) ∈ R 2 contains a corresponding input and output of the simulation.By calibration, we concretely refer to retrieving simulation parameters θ ∈ R p .In our cavity example, our parameter is the Reynolds number of the flow, but could be any system property.
As usual in machine learning, we assume to have a training dataset; in our case composed of N other simulations, which is used to train the bridge model.More concretely, each training simulation is represented by its samples and calibrated parameters, denoted as {(D i , θ i ) i } N i=1 .Back to our running example, we are given N simulations from various systems s i , with different Reynolds numbers θ i , and are tasked to find the unknown Reynolds number θ of the target system s.

Random Fourier Features as a surrogate model
In our proposed method, we take a model-bridging paradigm, as explained in the Introduction section, so it is divided into two steps: 1) representing the system s by our RFS, and 2) using the bridge model, we map the RFS to the parameter θ, achieving our objective of calibration.
The key idea of RFS is to represent the system s : R → R by a linear map S : H → H using Kernel ridge regression.Here H is a Reproducing Kernel Hilbert Space.The motivation is that increasing the dimension to infinity allows us to describe even complex s with a linear map.Each state can be theoretically mapped by some feature map ϕ : R → H, so that: Here, ξ ∈ H is the parameter of map S. In practice, such kernel techniques are often implemented by a kernel trick.That is, ϕ itself is inaccessible, and only the inner product ⟨•, •⟩ H of the Hilbert Space H can be calculated through some function k(x, y) = ⟨ϕ(x), ϕ(y)⟩ H , which has a closed-form independent of ϕ.By applying the kernel trick, we can avoid performing complex transformation calculations by only computing the inner product between the given data points in a high dimensional space.Although efficient, the time and memory complexity of the kernel trick calculations increase significantly as the number of samples n in a given dataset increases.
This is because the algorithm has to compute the inner product of each data point pair in the dataset, resulting in memory and space complexity of O(n 2 ) and time complexity of O(n 3 ).In our case, we would need to apply kernel trick to multiple datasets: not only the target simulation data D, but every data point of each training simulation dataset, D i .The kernel trick is an inefficient approach in this case.Therefore, in our RFS, we employ random Fourier features (RFF), proposed to alleviate the curse of dimensionality for large-scale kernel learning [10].The key idea is to approximate the inner product with a randomized map z, which does not depend on pairs, but can independently transform each data point separately.That is: ( This idea is inspired by Bochner's theorem [11], which states that "a continuous kernel k(x, y) = k(x − y) is positive definite if and only if k(∆) is the Fourier transform of a nonnegative measure".That is, selecting some measure p(ω), where ω ∼ p(ω), we can write: Note that most common kernels used in machine learning are of the form k(x − y), including the Gaussian, Cauchy and Laplace kernels.We draw R i.i.d.samples ω r (r = 1, . . ., R), then define a map h : x → [e iω 1 x , . . ., e iω R x ].With this apparatus, we can write: = p(ω)e iω(x−y) dω = E ω e iω(x−y) (6) Equation ( 5) is just the application of Bochner's theorem; and Equation ( 8) is the definition of h, with * denoting the complex conjugate.
The map z has the following property:

IC-MSQUARE-2023 Journal of Physics: Conference Series 2701 (2024) 012048
In summary, our algorithm works as follows.In the beginning, we draw R samples {ω r } R r=1 and {b r } R r=1 from p(ω) and U(0, 2π), respectively.R is a hyperparameter that we experimentally determine, and we use the same generated values from {ω r } R r=1 and {b r } R r=1 throughout all simulations in D i .This ensures that we have the same transformation when we predict on new data.We also use p(ω) = N (0, 1) in all experiments, which corresponds to using the Gaussian kernel.
Then, we compute z(x j ) (j = 1, . . ., n) for all the data of a target system s.Let Z ∈ R n×R be the application of map z to all n samples x j .We compute the kernel ridge regression parameters as: where y = [y 1 , . . ., y n ] is the vector with all outputs of the target simulation s, and λ is the ridge regularization parameter.Finally, ξ is an R-dimensional parameter vector that represents s.Note that, similar to TSI [8], our surrogate approximates s without depending on the simulation model or parameter θ, only on the system data D.However, it did not require an iterative algorithm such as the bending algorithm to learn this representation.

Bridge regression model
Using the RFS described in the last subsection, we compute a representation ξ i for every simulation in the training dataset, and then use it to train a bridge regression model.We use a multivariate linear regression model Ψ mapping ξ to θ.Note that this model is linear with respect to ξ, but is a kernel regression with respect to x.

Experiments and Discussion
In this section, we first discuss the datasets we used for our experiments in detail.Then, we showcase our evaluation on both time and error rate of our method using the datasets.Lastly, we discuss our findings compared to previously proposed methods.

Datasets
For our experiments, we use two forms of datasets: (i) Cavity fluid dynamics: This dataset was created using OpenFOAM® [12], an opensource computational fluid dynamics software that enables users to simulate fluid behaviour in various scenarios.For our specific experiments, we used a two-dimensional cavity that is filled with a fluid in a "stationary" state; movement is initiated at the top of the cavity with initial velocity x at t = 0s, and reaches a "turbulent" state at t = 5s, at which point we measure velocity y at the center of the cavity.We are interested in the Reynolds number, θ, as our simulation parameter for the simulation system, y = s(x, θ).We collected 600 simulation systems, where the Reynolds number ranges from 10, 000 to 70, 000 with a 100 step increment, and varying initial velocities, x, from 0.65 [m/s] to 1 [m/s] with a 0.1 [m/s] step increment.The resultant simulation system contains 36 data points per one simulation.Figure 2 illustrates the relationship between initial and final velocities x and y depending on Reynolds number, θ. (ii) Synthetic signals: We also created three separate synthetic datasets, namely Polynomial, Sinewave and Logistic datasets, that are able to emulate typical system behaviours of a simulation system used in a variety of application scenarios.We randomly generate data for these families of simulation systems using equations ( 15), ( 16) and ( 17), given a parameter vector θ, and system noise ε drawn from N (0, σ 2 s ).For the Polynomial simulation system, we compute the following equation: For the Sinewave, given an amplitude A, frequency F and phase φ, we compute the following equation: Finally, for the Logistic system, given an amplitude A, slope s and midpoint µ, we compute the following: The resulting total number of simulation systems for each synthetic signal is S = 100, with n = 100 data points for each system.

Experiment setup and results
For our baselines, we used the previously proposed methods, tangent-slope intercept (TSI) [8] and Kernel ABC [13].We set several hyperparameters for Kernel ABC in our experiments: number of samples from prior B = 1000; Gram matrix regularization parameter 1e − 6; and the bandwidth parameter, σ y = 1.Kernel ABC estimates the posterior distribution of θ, so we use the average θ to compare it to our method's results.For the hyperparameter R that we determined on each dataset, we used the following values: R = 350 for the Cavity dataset; R = 180 for both Logistic and Polynomial datasets; and R = 20 for the Sinewave dataset.The calibration results of our experiments are detailed in Table 1, along with their respective training and testing calibration times in Table 2 1 are given as root mean squared error (RMSE), a common regression metric which keeps the unit measure of the original data.KernelABC, a standard simulation calibration method achieves decent calibration accuracy with an RMSE of 134.83 for the Cavity dataset; TSI and our proposed RFS achieve better performance with error rates of 80.524 and 38.52, respectively, showcasing the effectiveness of model-bridge approaches.Though RFS fails to outperform TSI on the Cavity dataset, the difference in their performance is not very significant, considering that the error rate is measured in the order of magnitude of 10, while the actual parameters to be calibrated are in the order of magnitude of 10, 000.This can be easily seen on Figure 6, where parameters calibrated by RFS are very close to ground truth parameters.
It is worth noting, that due to the nature of bending algorithm, TSI is particularly well-suited for dealing with low-degree simulation data, as seen in the Cavity dataset in Figure 2, which partly explains its good performance.Figure 7 shows the reconstruction of both TSI and RFS on different examples of the Cavity dataset.TSI manages to reconstruct the shape of the curve well, but shifts in its estimation on the y-axis, scaling considerably higher than the ground truth.On the other hand, RFS reconstruction is able to stay within the range of values of ground truth, but fails in completely reconstructing the true curve.
However, RFS achieves several orders of magnitude better calibration results for all three synthetic signal datasets when compared to both KernelABC and TSI.Here, simulations exhibit higher-order complexity, making it difficult for TSI; the bending algorithm either struggles to capture the high-degree curvature, or it overfits.This can be also seen graphically in Figure 8, which shows the comparison between TSI and RFS reconstructions of the ground truth for synthetic simulations.Compared to the Cavity dataset, the curves of synthetic simulations are more varied, and TSI is unable to reconstruct them; RFS is, on the other hand, able to more accurately capture the shape of the curves and provide a better representation.The difference in error rates is more significant for synthetic datasets, as the values of actual parameters are in the  Calibration times for all datasets are shown in Table 2.As KernelABC is not a machine learning-based approach, but rather a standard calibration approach, it does not contain a training phase.However, its actual calibration time is very long as it requires multiple runs of the actual simulation.On the other hand, TSI and RFS require a training phase, but Similar difference in calibration times holds for the synthetic simulations, which measures in several orders of magnitude, despite the simulations themselves being much smaller in scale compared to the Cavity dataset; this greatly supports the paradigm of the model-bridge approach.For all datasets, RFS completes the calibration significantly faster compared to KernelABC and TSI; it is also faster during both training and testing (calibration) phases than TSI, which is also based on model-bridge.
The experiments show the value of model-bridge approaches to simulation calibration in general, and of RFS in particular.RFS is significantly faster than standard calibration methods like KernelABC and model-bridge method like TSI, while achieving better or similar calibration accuracy across the board.

Conclusion
In this paper, we proposed a new model-bridge paradigm that uses Random Fourier Features (RFF) as a surrogate model for simulation calibration.A simple linear regression model was used as the bridge for our method, but any regression model can be used for this framework.We evaluated our method on synthetic signal data and turbulent flow dynamics data provided by OpenFOAM.Our findings show that our method performs similarly, or in some cases better, than the previously proposed models for simulation calibration while requiring less time and expensive calculations.

Figure 2 .
Figure 2. Cavity fluid behaviour where each color corresponds to simulation behavior under one Reynolds number.
Figures 3,4 and 5 illustrate each of the synethtic dataset used.

Figure 6 .
Figure 6.RFS predictions compared to the ground truth for Cavity, Logistic, Polynomial and Sinewave datasets.Values closer to the diagonal indicate more accurate calibration.

Figure 7 .
Figure 7. Reconstruction of TSI (blue) and RFS (orange) compared to ground truth (red) on several points of the Cavity dataset.

Figure 8 .
Figure 8. Reconstruction of TSI (blue) and RFS (orange) compared to ground truth (red) on several examples in synthetic signal datasets.

Table 1 .
. Root Mean Squared Error (RMSE) of each method on different datasets.

Table 2 .
Calibration time, in milliseconds (unless otherwise stated), of each method during training and testing stages.