Cosmic Inference: Constraining Parameters with Observations and a Highly Limited Number of Simulations

Timur Takhtaganov; Zarija Lukić; Juliane Müller; Dmitriy Morozov

doi:10.3847/1538-4357/abc8ed

1. Introduction

The field of cosmology has rapidly progressed in the past few decades, going from a largely qualitative picture of the hot Big Bang to the now well-tested standard model of cosmology. This relatively simple model describes current observations at a few-percent level using only six parameters (Planck Collaboration et al. 2020). While this has been a great success—driven by a deluge of observations—questions still remain about the nature of dark matter and dark energy, primordial fluctuations relating to the inflation in the early universe, and the mass of neutrino particles. To make further progress in answering these questions, new ground- and space-based observational missions will be carried out, probing the highly nonlinear scales of cosmic structure. Planned wide-field sky surveys such as the Dark Energy Spectroscopic Instrument (DESI Collaboration et al. 2016), the Large Synoptic Survey Telescope (LSST Dark Energy Science Collaboration 2012), the Wide Field Infrared Survey Telescope (WFIRST, Spergel et al. 2015), and Euclid (Refregier et al. 2010) will provide precision measurements of cosmological statistics such as weak-lensing shear correlations, cluster abundance, and the distribution of galaxies, quasars, and Lyα absorption lines. Inferring values of the physical model parameters using observations of the mentioned sky surveys is a problem that belongs to the class of inverse problems in statistics.

The application of Markov chain Monte Carlo (MCMC, Metropolis et al. 1953; Gelman et al. 2013) or similar Bayesian methods requires hundreds of thousands to even millions of forward-model evaluations in order to determine the posterior probabilities of the considered parameters. When modeling the highly nonlinear regime of the structure formation in the universe, each such evaluation is a high-performance computing simulation costing more than 10⁵ CPU hours. Most perturbation theory methods break down at about the scale of x ≲ 65 h⁻¹ Mpc (Carlson et al. 2009). The most recent works have pushed this scale down to ∼10 h⁻¹ Mpc (Chen et al. 2020; d'Amico et al. 2020; Ivanov et al. 2020), but this is still between one and two orders of magnitude larger than what is needed for Lyα studies.

Cosmological simulations that from first principles numerically evolve the density field are therefore essential for the analysis and scientific inference of the future observational data sets.

While it may seem at first that this cost makes the inference computationally unfeasible, it is in fact possible to efficiently sample the parameter space with a dramatically reduced number of simulations, provided that certain smoothness conditions are satisfied. This is achieved through the development of cosmological emulators, that is, computationally cheap surrogate models of expensive cosmological simulations. The pioneering work on these techniques in cosmology was the cosmic calibration effort (Heitmann et al. 2006; Habib et al. 2007), resulting in a 1% accurate matter power spectrum emulator (Heitmann et al. 2010). Later works have expanded the range of validity of this nonlinear matter power spectrum in terms of the k and redshift coverage, and they also increased the number of cosmological parameters (Lawrence et al. 2017; Euclid Collaboration et al. 2019). In addition, this emulation technology proliferated into modeling many other statistics and observables, including gravitational lensing quantities (Liu et al. 2015; Petri et al. 2015; Wibking et al. 2020), the galaxy halo occupation model (Kwan et al. 2015), the halo mass function (McClintock et al. 2019), the galaxy correlation function (Zhai et al. 2019), the 1D Lyα flux power spectrum (Walther et al. 2019), and the 21 cm power spectrum (Jennings et al. 2019). More generally, outside the field of astrophysics, there is a large body of work on emulators and designs for large-scale computer experiments, see, e.g. Haaland & Qian (2011) and Kleijnen (2015).

In this work we are not concerned with building an emulator for the cosmological simulation models that is accurate over the entire prior range of the input parameter values. We instead focus on constructing an emulator in an adaptive fashion by preferentially selecting the inputs for the simulation that are more likely to result in the values of the output that are consistent with the observational data. By building up our emulator in this sequential way, we strive to avoid performing unnecessary simulations that would be needed to have a globally accurate surrogate. To find an optimal point in parameter space for running the simulation, we use Bayesian optimization techniques (Mockus 1994; Kennedy & O'Hagan 2001; Mockus et al. 2014; Leclercq 2018), specifically developed to efficiently determine the global optima of functions (we also refer to Shahriari et al. 2016 or Frazier 2018 for recent pedagogical surveys of Bayesian optimization). A similar Bayesian optimization for the construction of an emulator of the 1D Lyα flux power spectrum has recently been considered in Rogers et al. (2019). The authors subsequently applied their method to the problem of the Lyα forest in Bird et al. (2019). The difference between their work and ours here is that we use a different acquisition function and a different problem parameterization. In practice, we do expect the two approaches to result in similar computational efficiency, and we consider our work as complementary to that of Rogers et al. (2019).

The execution of such an iterative workflow can be efficiently executed on high-performance computing platforms using the system described in Lohrmann et al. (2017). Briefly, as the workflow requires exploration of the parameter space via simulation trials, each of these simulations becomes a job managed by a parallel scheduler. This approach relies on Henson (Morozov & Lukić 2016), a cooperative multitasking system for the in situ execution of loosely coupled codes.

Our treatment of multi-output emulation is different from the previous approach of Habib et al. (2007), which relied on dimension-reducing techniques. Instead of approximating the power spectrum in the basis obtained from a principal-component decomposition of the simulator's covariance structure, we assume a simple separable form for the covariance of the power spectrum as a vector-valued function (similarly to the approach of Conti & O'Hagan 2010). This allows us to start with a small number of training inputs for the initial emulator construction and to iteratively refine the initial design. Additionally, the separable structure of the covariance function allows us to perform training and prediction with the emulator using Kronecker products of small matrices; this makes it efficient.

In this work, we use a 1D Lyα flux power spectrum as an output quantity of interest. Following reionization that occurs around redshift z ∼ 8, the diffuse gas in the intergalactic medium (IGM) is predominantly photoionized, but the small residual fraction of the neutral hydrogen gives rise to Lyα absorption that is observed in spectra of distant quasar sightlines (for a recent review, see McQuinn 2016). This so-called Lyα forest is the premier probe of the IGM and cosmic structure formation at redshifts 2 ≲ z ≲ 6. As Lyα absorption at z ∼ 3 is sensitive to gas at around the cosmic mean and at redshifts z ≥ 4 even to the underdense gas in void regions of the universe (Lukić et al. 2015), complex and poorly understood physical processes related to galaxy formation are expected to play only a minor role in determining its structure (Desjacques et al. 2006; Kollmeier et al. 2006). Forward-modeling the structure of the IGM for a given cosmological and reionization scenario is thus a theoretically well-posed problem, but it requires expensive cosmological hydrodynamical simulations. The 1D power spectrum is a summary statistic of the Lyα flux field that measures the Fourier-space analog of two-point correlations in flux absorption along lines of sight to quasars. This statistic can be sensibly used to measure cosmological parameters (Seljak et al. 2006; Palanque-Delabrouille et al. 2020), constrain the neutrino sector (Palanque-Delabrouille et al. 2015; Rossi et al. 2015; Yèche et al. 2017), probe exotic dark matter models (Armengaud et al. 2017; Iršič et al. 2017; Rogers & Peiris 2020), or measure the thermal properties of the IGM (Boera et al. 2019; Walther et al. 2019). Here, we focus on parameters describing the thermal state of the IGM, similarly as in Walther et al. (2019). However, there is nothing specific to three things: Lyα forest, data or simulations in our inference formalism, thus the method we present here can straightforwardly be applied to other cosmological probes as well.

The outline of the paper is as follows. In Section 2 we describe the details of the forward model for the Lyα power spectrum. In addition to hydrodynamical simulations, we also use an approximate model for postprocessing the thermal state of the IGM, which is described in Appendix A. In Section 3 we provide a high-level overview of our main approach to inferring cosmological parameters from measurement data. First, we state the general Bayesian inference problem, and following Bilionis & Zabaras (2014), we show how it can be reformulated using a Gaussian process (GP) emulator as a Bayesian surrogate of the forward model. Next, we provide an outline of the adaptive algorithm developed in Takhtaganov & Müller (2018), which we use to construct a GP emulator iteratively. The details of the GP emulator construction and a comparison of the approaches to modeling interactions between emulator outputs are provided in Section 4, as well as in Appendix B. Results of applying this method on Viel et al. (2013) data and inferring thermal parameters of the IGM are given in Section 5. Finally, we present our conclusions in Section 6.

2. Forward Model

In this paper, we analyze different ways of inferring the model parameters using flux power spectrum observations. To this end, it is necessary to model the growth of cosmological structure and the thermal evolution of the IGM on scales far smaller (down to ${ \mathcal O }(10{h}^{-1}$ kpc)) than those described by the linear perturbation theory. Cosmological hydrodynamical simulations with atomic cooling and UV heating are the only method capable of modeling this process at percent-level accuracy (for approximate methods, see Sorini et al. 2016; Lochhaas et al. 2016, and references therein). Unfortunately, such simulations are computationally very expensive, ∼10⁵ CPU hours or more. It is therefore desirable to also have a reduced model that we can evaluate for a large number of points in the chosen parameter space, even if it is not as accurate as the full simulation model. In the following we first review our direct simulation model, and in Section A we present the approximate model based on postprocessing the simulation's instantaneous temperature–density relation and the mean flux.

2.1. Simulations

The hydrodynamical simulations we use in this paper are part of the THERMAL suite of Nyx simulations (Walther et al. 2019), consisting of 75 models, each in an L = 20h⁻¹ Mpc box with 1024³ Eulerian cells and 1024³ dark matter particles. The Nyx code (Almgren et al. 2013) follows the evolution of dark matter modeled as self-gravitating Lagrangian particles, and baryons modeled as an ideal gas on a set of rectangular Cartesian grids. The Eulerian gas dynamics equations are solved using a second-order accurate piecewise parabolic method to accurately capture shocks. In addition to solving for gravity and the Euler equations, we also include the main physical processes relevant for the Lyα forest. We consider the chemistry of the gas as having a primordial composition of hydrogen and helium, include inverse Compton cooling off the microwave background, and monitor the net loss of thermal energy resulting from atomic collisional processes (Lukić et al. 2015). All cells are assumed to be optically thin to ionizing radiation, and radiative feedback is accounted for via a spatially uniform, time-varying UV background radiation given to the code as a list of photoionization and photoheating rates (Haardt & Madau 2012).

This type of simulations is used as a forward model in virtually any recent inference work using Lyα power spectrum (Boera et al. 2019; Walther et al. 2019; Palanque-Delabrouille et al. 2020; Rogers & Peiris 2020). Effects of inhomogeneous reionization are neglected, both temperature and UV background fluctuations. We expect future simulations to start adopting models with fluctuations (Oñorbe et al. 2019) to achieve better accuracy of the power spectrum, especially on larger scales. High column density absorbers (N_{H I} ≳ 10¹⁷ cm⁻²), which broaden absorption lines with characteristic damping wings (Rogers et al. 2018), are not modeled either in these simulations. These percent-level details regarding the accuracy of the physics forward model do not affect any of our conclusions.

Thermal histories are generated in a similar way as in Becker et al. (2011) through rescaling the photoheating by a density dependent factor: = AΔ^B_hm12. Here, ${\rm{\Delta }}={\rho }_{b}/\bar{{\rho }_{b}}$ is the baryon overdensity, _hm12 are the heating rates tabulated in Haardt & Madau (2012), while A and B are free parameters adjusted to obtain different thermal histories. While this approach makes it straightforward to change the instantaneous density–temperature relation in the simulation, changing the pressure smoothing scale is more difficult as it represents an integral of (an unknown) function of temperature across cosmic time. We return to this point in Section 5.

We choose mock sightlines, or skewers, crossing the domain parallel to one of the axes of the simulation grid and piercing the cell centers. Computationally, this is the most efficient approach. This choice of rays avoids explicit ray-casting and any interpolation of the cell-centered data, which introduces other numerical and periodicity issues. As a result, from an N³ cell simulation, we obtain up to N² mock spectra, each spectrum having N pixels. We calculate the optical depth, τ, by convolving neutral hydrogen in each pixel along the skewer with the profile for a resonant line scattering and assuming Doppler shift for the velocity (for details, see Lukić et al. 2015). We compute this optical depth at a fixed redshift, meaning that we do not account for the speed of light when we cast rays in the simulation; we use the gas thermodynamical state at a single cosmic time. The simulated skewers are therefore not meant to globally mock observed Lyα forest spectra, but they recover the statistics of the flux in a narrow redshift window, which is what we need for this work. We have neglected instrumental noise and metal contamination in simulated skewers, but this is not relevant for the conclusions of this paper.

2.2. Model Parameters

Lyα forest simulations include both cosmological parameters and astrophysical parameters needed to model the thermal state of the gas, which is significantly affected by hydrogen and helium reionizations. Our main goal is to test and improve the parameter sampling scheme and the emulation method used for constraining the parameters; in order to reduce the computational expense, in this work we focus our attention on the set of standard parameters, $\{{T}_{0},\gamma ,{\lambda }_{P},\bar{F}\}$ , describing the thermal state of the IGM. We keep the cosmological parameters fixed and based on the Planck Collaboration et al. (2014) flat ΛCDM model with h = 0.67, Ω_m = 0.32, Ω_bh² = 0.022312, and n_s = 0.96, σ₈ = 0.8288.

The values for the thermal parameters T₀ and γ are obtained from the simulation by approximating the temperature–density relation as the power law,

$\begin{eqnarray}&&T={T}_{0}{{\rm{\Delta }}}^{\gamma -1},\end{eqnarray} \tag{ 1 }$

and finding the best fit using a linear least-squares method in log T–log Δ (Lukić et al. 2015). Therefore, T₀ parametrizes the temperature at mean density in the universe, while γ is the slope of temperature–density relation, expected to asymptote γ ≈ 1.6 long after reionization ends. To determine the pressure smoothing scale, λ_P in postprocessing. We fit the cutoff in the power spectrum of the real-space Lyα flux, as described in Kulkarni et al. (2015). The real-space Lyα flux is calculated using the actual density and temperature at each cell in the simulation, but omitting all redshift-space effects such as peculiar velocities and thermal broadening.

In Section 5.3 we demonstrate our adaptive GP approach on the problem of inferring the three parameters θ = (F, T₀, γ). There, we use the postprocessing of hydrodynamical simulations as the forward model, which is described in Appendix A. We use only three out of four parameters, as there is no good way to model λ_P in postprocessing because this parameter depends on the integrated thermal history rather than on one moment in time. The advantage of using this approximate model is that it allows us to evaluate our model at different points in parameter space at a very low computational cost. While this in principle already demonstrates the efficiency of our new sampling scheme, the concern of whether conclusions might be affected by leaving out the λ_P parameter can be raised legitimately.

To alleviate this concern and confirm that the demonstrated behavior of our adaptive GP emulator is not qualitatively different when the parameter space changes, we apply our technique in Section 5.4 to the problem where a forward model consists of hydrodynamical simulations and a full set of $\{{T}_{0},\gamma ,{\lambda }_{P},\bar{F}\}$ parameters. For computational efficiency, we use an existing THERMAL suite of Nyx simulations.¹ This means that we do not evaluate the forward model at any chosen point, which would be the case for a practical application of our method. Importantly, the conclusions from this experiment are fully consistent with the experiment presented in Section 5.3.

3. Adaptive Construction of Gaussian Process Emulators for Bayesian Inference

In this section we outline our main approach to the adaptive construction of the GP emulators. Our approach is designed to solve the specific problem at hand, which is inferring the parameters of interest that serve as input into the forward model of the power spectrum from the observational data. The main ingredient of our approach is a so-called acquisition function that guides the selection of the training inputs for the emulator. This acquisition function arises from the form of the likelihood for the measurement data. Thus, before explaining the acquisition process, we start by providing a general framework.

We denote the parameters of interest by ${\boldsymbol{\theta }}\in {{\mathbb{R}}}^{p}$ . We denote the vector of observations by ${\boldsymbol{d}}={({d}_{1},...,{d}_{q})}^{T}$ , where each d_i represents a measured value of the Lyα forest flux power spectrum at a certain value of the wavenumber k. The outputs of the forward model of the power spectrum for a given θ are thought of as a q-dimensional vector P(θ) (more on this in Section 4). We work under the assumption of the Gaussian measurement noise with zero mean and known covariance Σ_E. With this specification of the measurement noise, we formulate the likelihood function for the observational data, which depends on the value of the forward model at θ,

$\begin{eqnarray}&&L({\boldsymbol{\theta }}| {\boldsymbol{d}})={{ \mathcal N }}_{q}({\boldsymbol{d}}-{\boldsymbol{P}}({\boldsymbol{\theta }})\,| \,{0}_{q},{{\boldsymbol{\Sigma }}}_{E}).\end{eqnarray} \tag{ 2 }$

Assuming the Bayesian framework, we model the prior information about the parameters θ as a known distribution p(θ). Given the prior and the observed measurements d, the solution of the inverse problem is the posterior density obtained by applying Bayes' rule,

$\begin{eqnarray}&&p({\boldsymbol{\theta }}| {\boldsymbol{d}})\propto L({\boldsymbol{\theta }}| {\boldsymbol{d}})p({\boldsymbol{\theta }}).\end{eqnarray} \tag{ 3 }$

This posterior density can, in principle, be explored with MCMC methods. However, because evaluating the likelihood function L(θ∣d) requires evaluating the forward model P(θ), the direct application of MCMC methods for the current application (or any expensive-to-evaluate forward model) is unfeasible. This difficulty can be circumvented by using a surrogate model, such as a GP emulator, in place of the forward model. A GP is fully specified by its mean and covariance functions. The mean is commonly set to zero, while the covariance function describes the statistical relation between data points and is a reflection of the prior beliefs. This Bayesian nature of the GPs makes them particularly suitable for our framework. Next, we provide a formal review of GP emulators, leaving the details out until Section 4, and outline our adaptive approach to sequentially adding training inputs for the emulator.

We assume that we have collected a set of evaluations of the forward model P(θ) at n input points,

$\begin{eqnarray}&&{ \mathcal D }={\{{{\boldsymbol{\theta }}}^{(j)},{\boldsymbol{P}}({{\boldsymbol{\theta }}}^{(j)})\}}_{j=1}^{n}.\end{eqnarray} \tag{ 4 }$

The information in ${ \mathcal D }$ can be used to obtain a surrogate model specified by a random variable P^GP with a predictive distribution conditioned on the input θ and data ${ \mathcal D }$ ,

$\begin{eqnarray}&&p\left({{\boldsymbol{P}}}^{\mathrm{GP}}\,| \,{\boldsymbol{\theta }},{ \mathcal D }\right)=\int p\left({{\boldsymbol{P}}}^{\mathrm{GP}}\,| \,{\boldsymbol{\theta }},{ \mathcal D },{\boldsymbol{\psi }}\right)p({\boldsymbol{\psi }}\,| \,{ \mathcal D })d{\boldsymbol{\psi }}.\end{eqnarray} \tag{ 5 }$

Here ψ denotes the hyperparameters of the predictive model, $p\left({{\boldsymbol{P}}}^{\mathrm{GP}}\,| \,{\boldsymbol{\theta }},{ \mathcal D },{\boldsymbol{\psi }}\right)$ is the predictive distribution of the assumed model given the hyperparameters, and $p({\boldsymbol{\psi }}| { \mathcal D })$ is the posterior distribution of ψ given data ${ \mathcal D }$ . Hyperparameters are thus free parameters that allow customizing (training) the GP for the particular problem. Together with the choice of the covariance function, they define a specific GP model (this is detailed in Section 4.1). In this work, we use the Bayesian view on model selection, where the optimal hyperparameters are determined by maximizing the probability of this model given the data. Specifically, we use the maximum likelihood estimate (MLE) of the hyperparameters when training the surrogate model P^GP. More generally, we could apply MCMC to obtain $p({\boldsymbol{\psi }}| { \mathcal D })$ , see details in Appendix B.

The solution of the inverse problem with a limited number of forward-model evaluations can now be formulated using the likelihood of the observational data evaluated using the surrogate model. This ${ \mathcal D }$ -restricted likelihood (similarly to Bilionis & Zabaras 2014) is defined as follows:

$\begin{eqnarray}&&L({\boldsymbol{\theta }}\,| \,{\boldsymbol{d}},{ \mathcal D })=\int L\left({\boldsymbol{\theta }}\,| \,{\boldsymbol{d}},{{\boldsymbol{P}}}^{\mathrm{GP}}\right)p\left({{\boldsymbol{P}}}^{\mathrm{GP}}\,| \,{\boldsymbol{\theta }},{ \mathcal D }\right)d{{\boldsymbol{P}}}^{\mathrm{GP}},\end{eqnarray} \tag{ 6 }$

where $L\left({\boldsymbol{\theta }}\,| \,{\boldsymbol{d}},{{\boldsymbol{P}}}^{\mathrm{GP}}\right)={{ \mathcal N }}_{q}\left({\boldsymbol{d}}-{{\boldsymbol{P}}}^{\mathrm{GP}}({\boldsymbol{\theta }})\,| \,{0}_{q},{{\boldsymbol{\Sigma }}}_{E}\right)$ . Once we have an approximation of the distribution $p({\boldsymbol{\psi }}| { \mathcal D })$ , e.g., ψ_MLE when using the MLE approach, we can integrate the product of the two Gaussians and obtain an approximate formula for the likelihood $L({\boldsymbol{\theta }}\,| \,{\boldsymbol{d}},{ \mathcal D })$ , see Takhtaganov & Müller (2018) and Appendix B for details:

$\begin{eqnarray}&&L({\boldsymbol{\theta }}\,| \,{\boldsymbol{d}},{ \mathcal D })\approx \exp \left[-\displaystyle \frac{g({\boldsymbol{\theta }};{ \mathcal D },{{\boldsymbol{\psi }}}_{\mathrm{MLE}})}{2}\right]\end{eqnarray} \tag{ 7 }$

.

Evaluations of the approximate likelihood $L({\boldsymbol{\theta }}| {\boldsymbol{d}},{ \mathcal D })$ involve computing the following misfit function between the observational data and the predictions of the GP emulator:

$\begin{eqnarray}\begin{array}{rcl}g({\boldsymbol{\theta }};{ \mathcal D },{\boldsymbol{\psi }}) & = & {\left({\boldsymbol{d}}-{\boldsymbol{m}}({\boldsymbol{\theta }};{ \mathcal D },{\boldsymbol{\psi }})\right)}^{T}{\left({{\boldsymbol{\Sigma }}}_{E}+{{\boldsymbol{\Sigma }}}_{{GP}}({\boldsymbol{\theta }};{ \mathcal D },{\boldsymbol{\psi }})\right)}^{-1}\\ & & \times \ ({\boldsymbol{d}}-{\boldsymbol{m}}({\boldsymbol{\theta }},{ \mathcal D },{\boldsymbol{\psi }}))\end{array}\end{eqnarray} \tag{ 8 }$

for the estimate ψ = ψ_MLE. In Equation (8), we denote by ${\boldsymbol{m}}({\boldsymbol{\theta }};{ \mathcal D },{\boldsymbol{\psi }})$ the mean vector of the GP emulator evaluated at θ, and by Σ_GP(θ; ${ \mathcal D }$ , ψ) its predictive covariance. This misfit function captures the discrepancy between the observed values of the power spectrum and the predicted values in the norm weighted by the measurement error and the uncertainty of the emulator. For the inputs θ in the training set ${ \mathcal D }$ , the mean of the GP emulator ${\boldsymbol{m}}({\boldsymbol{\theta }};{ \mathcal D },{\boldsymbol{\psi }})$ coincides with the values of the power spectrum P(θ), and the covariance ${{\boldsymbol{\Sigma }}}_{{GP}}({\boldsymbol{\theta }};{ \mathcal D },{\boldsymbol{\psi }})$ vanishes, and thus the exact value of the misfit (and hence the likelihood) is known.

We use the misfit function of Equation (8) to inform our choice of the candidate inputs to add to the data set ${ \mathcal D }$ in order to improve the GP surrogate. The improvement we are looking for is to make the approximate likelihood $L({\boldsymbol{\theta }}| {\boldsymbol{d}},{ \mathcal D })$ more accurately resemble the true likelihood L(θ∣d). The overall accuracy of the GP emulator over the support of the prior is unimportant.

Our adaptive approach to extending the training set ${ \mathcal D }$ is based on an acquisition function commonly used in Bayesian optimization—the so-called expected improvement (EI) criterion (Jones et al. 1998). In our version of the EI criterion, we look for an input θ that provides the largest expected improvement in the fit, with the expectation taken with respect to the posterior distribution of the hyperparameters $p({\boldsymbol{\psi }}| { \mathcal D })$ ,

$\begin{eqnarray}\begin{array}{rcl}{ \mathcal I }({\boldsymbol{\theta }}) & \equiv & \displaystyle \int {\left[{g}_{\min }-g({\boldsymbol{\theta }};{ \mathcal D },{\boldsymbol{\psi }})\right]}^{+}p({\boldsymbol{\psi }}| { \mathcal D })d{\boldsymbol{\psi }}\\ & \approx & \ {\left[{g}_{\min }-g({\boldsymbol{\theta }};{ \mathcal D },{{\boldsymbol{\psi }}}_{\mathrm{MLE}})\right]}^{+}.\end{array}\end{eqnarray} \tag{ 9 }$

Here g_min denotes the smallest misfit to the measurement data for the points in the current training set ${ \mathcal D }$ , and [·]⁺ takes the positive part of the difference: $[\cdot ]\equiv \max \{\cdot ,\,0\}$ . This formulation allows us to balance the exploration and the exploitation of the GP emulator in an iterative search for a new training input to maximize the expected improvement in the fit function ${ \mathcal I }({\boldsymbol{\theta }})$ , i.e., we search for the input θ that provides the largest improvement in the fit to the measurement data under the current GP model, conditioned on the misfit being smaller than the current best value for the points in the training set. The outline of the algorithm is given in Algorithm 1. For more details, see Takhtaganov & Müller (2018).

Algorithm 1. Adaptive construction of GP emulators

Input: Initial design ${\left\{{{\boldsymbol{\theta }}}_{}^{(j)}\right\}}_{j=1}^{{n}_{}}$ ${\left\{{{\boldsymbol{\theta }}}_{}^{(j)}\right\}}_{j=1}^{{n}_{}}$ , threshold value _thresh, search space ${{ \mathcal X }}_{\theta }$ ${{ \mathcal X }}_{\theta }$ , maximum allowed number of forward-model evaluations s_max.

Output: Adaptive design ${ \mathcal D }$ ${ \mathcal D }$ .

1: Evaluate P(θ) for ${\boldsymbol{\theta }}\in {\left\{{{\boldsymbol{\theta }}}_{}^{(j)}\right\}}_{j=1}^{{n}_{}}$ ${\boldsymbol{\theta }}\in {\left\{{{\boldsymbol{\theta }}}_{}^{(j)}\right\}}_{j=1}^{{n}_{}}$ to obtain ${ \mathcal D }={\left\{{{\boldsymbol{\theta }}}_{}^{(j)},{\boldsymbol{P}}\left({{\boldsymbol{\theta }}}_{}^{(j)}\right)\right\}}_{j=1}^{{n}_{}}$ ${ \mathcal D }={\left\{{{\boldsymbol{\theta }}}_{}^{(j)},{\boldsymbol{P}}\left({{\boldsymbol{\theta }}}_{}^{(j)}\right)\right\}}_{j=1}^{{n}_{}}$ .

2: for s from 1 to s_max do

3: train the GP model using current design;

4: update the current best-fit value g_min;

5: maximize the expected improvement in the fit function ${ \mathcal I }({\boldsymbol{\theta }})$ ${ \mathcal I }({\boldsymbol{\theta }})$ over the search space ${{ \mathcal X }}_{\theta }$ ${{ \mathcal X }}_{\theta }$ , and let ${{\boldsymbol{\theta }}}^{s}=\mathop{\arg \,\max }\limits_{{\boldsymbol{\theta }}\in {{ \mathcal X }}_{\theta }}{ \mathcal I }({\boldsymbol{\theta }});$ ${{\boldsymbol{\theta }}}^{s}=\mathop{\arg \,\max }\limits_{{\boldsymbol{\theta }}\in {{ \mathcal X }}_{\theta }}{ \mathcal I }({\boldsymbol{\theta }});$

6: if ${ \mathcal I }\left({{\boldsymbol{\theta }}}^{s}\right)\lt {\epsilon }_{{thresh}}\cdot {g}_{\min }$ ${ \mathcal I }\left({{\boldsymbol{\theta }}}^{s}\right)\lt {\epsilon }_{{thresh}}\cdot {g}_{\min }$ then

7: break

8: end if

9: evaluate P at θ^s and augment the training set: ${ \mathcal D }={ \mathcal D }\cup \left\{{{\boldsymbol{\theta }}}^{s}\right.$ ${ \mathcal D }={ \mathcal D }\cup \left\{{{\boldsymbol{\theta }}}^{s}\right.$ , $\left.{\boldsymbol{P}}\left({{\boldsymbol{\theta }}}^{s}\right)\right\};$ $\left.{\boldsymbol{P}}\left({{\boldsymbol{\theta }}}^{s}\right)\right\};$

10: end for

11: return ${ \mathcal D }$ ${ \mathcal D }$ .

Download table as: ASCII Typeset image

For our numerical experiments, we take the search space ${{ \mathcal X }}_{\theta }$ to be the support of the prior p(θ), and we set the threshold value _thresh to be 1%. We solve the auxiliary optimization problem in Step 5 by using multistart gradient-based optimization, see Takhtaganov & Müller (2018) for details. We set the allowed number of simulations s_max to a large number so that the effective termination condition is the one in line 6. In practice, s_max is dictated by the simulation budget. In the next section, we discuss the details of the construction of the GP emulators for modeling the Lyα forest power spectrum.

4. Gaussian Process Emulators for the Lyα Flux Power Spectrum

We model the power spectrum P(k, θ) as a multi-output GP, with outputs corresponding to the fixed values of the wavenumber k. Furthermore, we assume a separable structure of the kernel function, meaning that it can be formulated as a product of a kernel function for the input space θ alone, and a kernel function that encodes the interactions between the outputs k (Alvarez et al. 2012, Section 4).

In the following subsections we discuss the construction of multi-output GP emulators for the modeling of the power spectrum P(k, θ) in detail. To our knowledge, a similar comparison of multi-output emulators has not been done in the current context. Through a detailed numerical comparison, we arrive at the two preferred approaches that are used in the rest of the paper. Our choices are motivated by considerations of efficiency, accuracy, and correct modeling of the correlation between the outputs.

The preferred approach is inherently application dependent. We work here on the Lyα flux power spectrum, but we emphasize that in a different setting, with much stronger correlations, there would be a larger difference between different approaches. The method that we describe in this section can thus be used to diagnose when to apply a given approach, and it is meant to serve as a practical guide for the application specialists. Readers who are not practitioners themselves could skip this section and proceed to the results presented in Section 5.

4.1. Gaussian Process Emulators

We treat P(k, θ) as a function from ${{ \mathcal X }}_{k}\times {{ \mathcal X }}_{\theta }$ to ${\mathbb{R}}$ ^q, where ${{ \mathcal X }}_{k}\subset {\mathbb{R}}$ , ${{ \mathcal X }}_{\theta }\subset {{\mathbb{R}}}^{p}$ , and q is the number of values of the wavenumber k for which we have observations. For a given vector of input parameters θ, we treat the output of the simulation code as a vector ${\boldsymbol{P}}({\boldsymbol{\theta }})\in {{\mathbb{R}}}^{q}$ of the values of the power spectrum at fixed values of k.

Similarly to Conti & O'Hagan (2010), we model P(·) as a q-dimensional separable GP,

$\begin{eqnarray}&&{{\boldsymbol{P}}}^{\mathrm{GP}}(\cdot )| {\boldsymbol{\psi }},{{\boldsymbol{\Sigma }}}_{k}\sim {{ \mathcal N }}_{q}({\boldsymbol{\mu }}(\cdot ),c(\cdot ,\cdot ;{\boldsymbol{\psi }}){{\boldsymbol{\Sigma }}}_{k}),\end{eqnarray} \tag{ 10 }$

conditional on hyperparameters ψ of the correlation function $c:{{ \mathcal X }}_{\theta }\times {{ \mathcal X }}_{\theta }\times {{ \mathcal X }}_{\psi }\to {\mathbb{R}}$ , and symmetric positive-definite matrix ${{\boldsymbol{\Sigma }}}_{k}\in {{\mathbb{R}}}^{q\times q}$ . This means that for any two inputs θ⁽¹⁾ and θ⁽²⁾, we have ${\mathbb{E}}\left[{{\boldsymbol{P}}}^{\mathrm{GP}}({{\boldsymbol{\theta }}}^{(i)})\,| \,{\boldsymbol{\psi }},{{\boldsymbol{\Sigma }}}_{k}\right]={\boldsymbol{\mu }}({{\boldsymbol{\theta }}}^{(i)})$ , i = 1, 2, and ${\mathbb{C}}\mathrm{ov}\left[{{\boldsymbol{P}}}^{\mathrm{GP}}({{\boldsymbol{\theta }}}^{(1)}),{{\boldsymbol{P}}}^{\mathrm{GP}}({{\boldsymbol{\theta }}}^{(2)})\,| \,{\boldsymbol{\psi }},{{\boldsymbol{\Sigma }}}_{k}\right]=c({{\boldsymbol{\theta }}}^{(1)},{{\boldsymbol{\theta }}}^{(2)};{\boldsymbol{\psi }}){{\boldsymbol{\Sigma }}}_{k}$ . As indicated by several studies, e.g., Chen et al. (2016), the introduction of the regression term μ(·) does not generally affect the performance of the predictive model, and in some cases, it might have an adverse effect. In our case, adequate results were obtained by simply setting μ(θ) ≡ 0. Furthermore, we set the covariance function c( · , · ; ψ) to be a squared-exponential with p + 1 hyperparameters ${\boldsymbol{\psi }}={({\sigma }_{c},{{\ell }}_{1},...,{{\ell }}_{p})}^{T}$ ,

$\begin{eqnarray}&&c\left({{\boldsymbol{\theta }}}^{(1)},{{\boldsymbol{\theta }}}^{(2)};{\boldsymbol{\psi }}\right)={\sigma }_{c}^{2}\exp \left(-\displaystyle \sum _{i=1}^{p}\displaystyle \frac{{\left({\theta }_{i}^{(1)}-{\theta }_{i}^{(2)}\right)}^{2}}{2{{\ell }}_{i}^{2}}\right).\end{eqnarray} \tag{ 11 }$

The choice of the covariance function here is purely empirical and does not affect the forthcoming method.

Our treatment of the inter-output covariance matrix Σ_k differs from that in Conti & O'Hagan (2010) and Bilionis et al. (2013). There, the authors assumed a weak noninformative prior on the matrix Σ_k and integrated it out of the predictive posterior distribution. Instead, we study four different approaches to treating interactions between outputs, which are summarized in Section 4.2.

4.2. Approaches to Treating Multi-output Models

1.
First, we test a naive approach that emulates each output separately with a single-output GP. We refer to this approach as MS because it corresponds to the MS (many single-output) emulator in Conti & O'Hagan (2010). This approach has an increased computational cost (which could be alleviated by training in parallel) compared to training only one GP, as the following two approaches do. This approach also ignores any dependences between the outputs.
2.
In our second approach (IND), we treat the outputs as independent given the hyperparameters of the covariance function c(·, ·; ψ). This approach has been considered in Bilionis & Zabaras (2014) and Takhtaganov & Müller (2018). It leads to a simple and efficient implementation of a multi-output emulator with a diagonal Σ_k, see Appendix B for details.
3.
Our third approach (COR) assumes that correlations between different outputs are nonzero but are still independent of the parameter θ. In this case, we fix the correlation matrix a priori and use it to obtain Σ_k by rescaling by the training variances. This approach requires specifying the inter-output correlation matrix. In terms of computational efficiency, this approach still allows us to use the Kronecker product structure of the training covariances (see Appendix B), and it is as efficient as using the diagonal covariance in approach IND.
4.
Our final approach (INP) is related to approach COR, but it is computationally more demanding. Here, we treat k as another input dimension into the GP model, with the associated covariance kernel being again a squared-exponential. The main computational cost is associated with inverting the training covariance matrices, which become q times larger because another input dimension is added. Conceptually, however, this approach is similar to approach COR, with Σ_k having entries specified by the squared-exponential kernel for a fixed (learned) value of the length-scale hyperparameter ℓ_p+1 associated with the k input dimension. The difference to approach COR is that the inter-output correlation matrix now depends on the training data. As our experiments demonstrate, this additional flexibility provides no discernible advantage. This approach is referred to as time-input (TI) emulator in Conti & O'Hagan (2010), where an additional dimension is time rather than k.

All of our approaches use matrix-valued kernels that fall into the category of separable kernels, see Alvarez et al. (2012, Section 4) for an overview. Specifically, our approaches IND, COR, and INP are examples of the so-called intrinsic coregionalization model, or ICM (Alvarez et al. 2012, Section 4.2). The ICM approach allows an efficient implementation of GP-based regression and inference that exploits the properties of the Kronecker product of the covariance matrix, see Appendix B for details. For the adaptive algorithm, efficiency is important for solving the auxiliary optimization problem for the expected improvement in the fit function ${ \mathcal I }({\boldsymbol{\theta }})$ . Solving this optimization problem requires multiple restarts because the ${ \mathcal I }({\boldsymbol{\theta }})$ function is highly multimodal, leading to a large number of evaluations of the GP emulator predictions and their gradients. For the GP-based inference, having an efficient emulator allows us to carry out MCMC sampling of the posterior with minimum computational effort. As the following numerical study of the considered approaches suggests, in our application it is reasonable to expect that the outputs corresponding to different values of the wavenumber k have similar properties with respect to the parameters θ, therefore, our choice of the separable form of the kernel is justified.

4.3. Numerical Study of Multi-output Approaches

In this section we compare the predictive performance of the different multi-output GP emulators introduced in Section 4.2 using an approximate model of the power spectrum P(k, θ) described in Section A. To obtain a better picture of the dependence of the results on the choice of design inputs, we build emulators using 10 to 30 inputs arranged in a Latin hypercube design (LHD) where the minimum distance between the points has been maximized—the so-called maximin LHD (see, e.g., Johnson et al. 1990). For each design we perform multiple experiments. For each experiment we generate a large test set consisting of 500 input-output pairs to compute various measures of predictive accuracy. In order to avoid unnecessary cost associated even with the postprocessing procedure of Section A, we precompute the power spectrum on a dense grid in ${{ \mathcal X }}_{\theta }$ using our automatized Henson system (Morozov & Lukić 2016; Lohrmann et al. 2017) and interpolate the outputs using trilinear interpolation to obtain the continuous approximation of the power spectrum in ${{ \mathcal X }}_{\theta }$ . We have confirmed that the P(k, θ) error associated with this interpolation is negligible. Specifically, we take a grid of 10³ input parameters θ = (F, T₀, γ) covering the box

$\begin{eqnarray}&&{{ \mathcal X }}_{\theta }=[0.2,0.5]\times [3\times {10}^{3}\,{\rm{K}},3\times {10}^{4}\,{\rm{K}}]\times [1.0,2.0].\end{eqnarray} \tag{ 12 }$

We restrict our attention to the redshift of z = 4.2 and q = 8 outputs corresponding to the following k values: {3.26 × 10⁻³, 6.51 × 10⁻³, 9.77 × 10⁻³, 1.63 × 10⁻², 2.28 × 10⁻², 3.26 × 10⁻², 5.21 × 10⁻², 8.14 × 10⁻²} km⁻¹ s, which cover the same range of values as the Viel et al. (2013) measurements that we use later. In the following we simply number these k outputs from 1 to 8.

We construct the four multi-output GP emulators (MS, IND, COR, and INP) using fixed LHDs with 10, 20, and 30 points in ${{ \mathcal X }}_{\theta }$ . In each case, the training output values are normalized (see Appendix B). We fit the hyperparameters of the covariance function using the maximum likelihood approach (MLE in Appendix B). We recall that the COR emulator requires a fixed output correlation matrix Σ_k. We obtain an estimate of this matrix by first using the INP emulator built with LHD 20 design. Upon training of the INP emulator, we obtain an estimate of the length scale for its k (output) variable. By plugging this estimate into the 1D squared-exponential kernel and evaluating the kernel for the values of k that we consider, we obtain a desired estimate of Σ_k. This estimate remains fixed for all experiments with the COR emulator, see Figure 1.

**Figure 1.** Correlation matrix between outputs (k-wavemodes) 1 through 8 for the COR emulator.
Download figure:
Standard image High-resolution image

To test the predictive performance of the emulators, we use a test set consisting of N = 500 points in a randomized LHD. We use the following measures of predictive accuracy:

1.
Standardized mean-squared error (SMSE): this is the mean-squared prediction error scaled by the variance of the test data for each output j = 1, ..., q:
$\begin{eqnarray}&&{\mathrm{SMSE}}_{j}=\displaystyle \frac{{\displaystyle \sum }_{i=1}^{N}{\left({m}_{j}\left({{\boldsymbol{\theta }}}^{(i)},{ \mathcal D }\right)-{P}_{j}\left({{\boldsymbol{\theta }}}^{(i)}\right)\right)}^{2}}{{\displaystyle \sum }_{i=1}^{N}{\left({\overline{P}}_{j}-{P}_{j}\left({{\boldsymbol{\theta }}}^{(i)}\right)\right)}^{2}},\end{eqnarray} \tag{ 13 }$
where ${m}_{j}({\boldsymbol{\theta }},{ \mathcal D })$ is the jth component of the vector of predictive means ${\boldsymbol{m}}({\boldsymbol{\theta }},{ \mathcal D })$ of the GP model, and ${\overline{P}}_{j}$ is the mean of the test values P_j(θ⁽ⁱ⁾), i = 1, ..., N.
2.
Credible-interval percentage (CIP), also known as coverage probability: the percentage of the 100α% credible intervals that contain the true test value. For an emulator that provides adequate estimates of the uncertainty about its predictions, this value should be close to α. We plot the CIP against α ∈ [0, 1] and look for deviations from the straight line. This statistic can only be plotted for each output separately.
3.
Squared Mahalanobis distance (SMD) between the predicted and the test outputs at a test point i:
$\begin{eqnarray}\begin{array}{rcl}{\mathrm{SMD}}_{i} & = & \left({\boldsymbol{P}}({{\boldsymbol{\theta }}}^{(i)})-{\boldsymbol{m}}({{\boldsymbol{\theta }}}^{(i)},{ \mathcal D })\right){{\boldsymbol{\Sigma }}}_{{GP}}{\left({{\boldsymbol{\theta }}}^{(i)},{ \mathcal D }\right)}^{-1}\\ & & \times \ \left({\boldsymbol{P}}({{\boldsymbol{\theta }}}^{(i)})-{\boldsymbol{m}}({{\boldsymbol{\theta }}}^{(i)},{ \mathcal D })\right),\end{array}\end{eqnarray} \tag{ 14 }$
where ${\boldsymbol{m}}({\boldsymbol{\theta }},{ \mathcal D })$ and ${{\boldsymbol{\Sigma }}}_{{GP}}({\boldsymbol{\theta }},{ \mathcal D })$ are the predictive mean and the predictive covariance of the multi-output GP emulator, respectively. According to the multivariate normal theory, this distance should be distributed as ${\chi }_{q}^{2}$ for all test points. A discrepancy between the distribution of distances for the emulator and a reference distribution indicates a misspecification of the covariance structure between the outputs.

Figure 2 reports the SMSE estimates for the three LHDs. In each case we repeat the experiment 20 times, meaning that we generate 20 training and 20 test sets for each of the three designs and use them to train and test each of the four emulators. We observe a noticeable spread of the estimates between the experiments with the same number of design points, regardless of the chosen emulator or output index. Increasing the size of the training design generally improves accuracy, but does not necessarily suppress the variation in results for different experiments. All four emulators demonstrate similar output-marginal accuracy, and the difference in errors for different outputs is low, with the exception of output 8, which has consistently higher relative errors for all four emulators. The COR and the INP emulators appear to have a similar spread of the error values, indicating that fixing the output covariance Σ_k a priori rather than allowing it to depend on training data has little effect on accuracy. The MS and the IND emulators achieve smaller errors, but also have relatively larger spreads between experiments, which suggests a higher dependence on the design than in the cases of COR and INP emulators.

While there is little that separates different multi-output approaches in terms of per-output SMSE, the results for CIP and SMD provide a more complete picture.

Figure 3 reports the CIP estimates for the three LHDs and for selected output indices (1, 4, and 7, in the top, middle, and bottom rows, respectively). Here we report the CIP values averaged over the 20 experiments. In general, we observe that all four emulators underestimate the uncertainty in their predictions, i.e., they are overconfident in their predictions. This trend becomes more pronounced with growing training design size. We view this as an artifact of the MLE approach to GP training, see the discussion in Takhtaganov & Müller (2018), Section 3. The MS emulator exhibits the worst performance of all emulators and is most overconfident in its predictions in all cases. The other emulators compensate for the shortcomings of the MLE approach by accounting for correlations between the outputs. The IND emulator does it in the least effective way and hence performs worse than the COR and the INP emulators. The COR emulator on average comes closest to providing accurate uncertainty estimates.

Figure 4 shows the box plots of the distributions of the squared Mahalanobis distances for the test points for the three designs. We only show the results of a single experiment for each design. The reference distribution ${\chi }_{8}^{2}$ has mean of 8 and variance of 16. We observe that although all of the emulators fail to correctly model the posterior covariance (a direct consequence of underestimating the predictive uncertainty), the IND, COR, and INP emulators all perform better than MS, with the last two coming closest to the reference. The means of the distributions show that for LHD 10, the average values over the 20 experiments are 54, 38, 27, and 29 for the MS, IND, COR, and INP emulators, respectively. The results shown represent the general trend that on average, COR and INP emulators tend to represent the predictive covariance better.

**Figure 4.** Distribution of the squared Mahalanobis distances for the four multi-output emulators (MS, IND, COR, and INP) for three different LHDs computed with 500 test points. The reference distribution (ref) is ${\chi }_{8}^{2}$ . A single representative experiment is chosen for each of the three designs. The closer distributions are to the reference, the better.
Download figure:
Standard image High-resolution image

In terms of the wall-clock time required for training and evaluating the four emulators, the most time-consuming emulator to train is INP, for which the required training time grows exponentially with the size of the training set. The next most time-consuming emulator to train is MS, but it can easily be trained and evaluated in parallel because there is no overlap between the outputs. The time for training IND and COR emulators is approximately the same and is considerably shorter than for the other two because only a single GP needs to be trained.

Based on the performed experiments, in the following we choose to work with IND and COR emulators due to their good performance on the three test statistics and because their computational efficiency is high. The question of how to obtain the correlation matrix for the COR emulator still remains open. In our numerical experiments we did not observe significant differences in the emulator performance with the small changes to the correlation matrix Σ_k. In particular, the estimates of Σ_k obtained with the INP emulator were similar for all experiments. It appears that the variability in the results due to the training designs outweighs the variability from using approximate correlations.

As mentioned previously, an alternative way of treating the separable form of the covariance is to assume a noninformative prior on Σ_k as in Conti & O'Hagan (2010), which allows analytical integration of this matrix out of the predictive distribution. In addition, if the mean μ(·) is taken to be a generalized linear model with a flat prior, it can also be treated analytically. We deviated from the approach of Conti & O'Hagan (2010) for the sake of a simpler and more easily interpretable model.

5. Numerical Study of Inference with Adaptive GP Emulators

In this section we evaluate the performance of the adaptive construction of the GP emulators introduced in Section 3. First, we solve a synthetic inference problem for the same set of three parameters θ = (F, T₀, γ). For this task, we generate synthetic measurement data with the postprocessing model of Section A and corrupt it with noise. Similarly to Section 4.3, we use a trilinear interpolation of the outputs of the postprocessing model as the forward model for the GP construction and inference. We run Algorithm 1 and use the constructed GP emulator to obtain the posterior of the parameters θ given the measurements. We compare the results obtained with the adaptive approach to those obtained using the GP emulators built using fixed design (see Section 5.1). Having a relatively inexpensive forward model, we can obtain a reference posterior using the true likelihood L(θ∣d), which allows us to obtain quantitative measures of the quality of the GP-based posteriors.

Following this detailed study, we use the adaptive algorithm to obtain parameter posteriors using observational data from Viel et al. (2013) and using the postprocessing model A as the forward model. These results are reported in Section 5.3.

Finally, we use a simplistic version of our adaptive approach to construct posteriors for an extended set of parameters, namely θ = (F, T₀, γ, λ_P), using the same Viel data and the THERMAL suite of Nyx simulations. Here, we restrict the search space ${{ \mathcal X }}_{\theta }$ to the 75 points for the parameters (T₀, γ, λ_P) in this data set and 40 values of F, giving us a total of 3,000 possible values of θ. We use our adaptive algorithm to select a small subset of the points to construct a GP emulator. We compare the results obtained with this restricted version of our algorithm to those in Walther et al. (2019), where all 75 points for (T₀, γ, λ_P) and several values of F were used for the GP construction. We do not expect to obtain identical results because in addition to differences in implementation (Walther et al. 2019 built a GP emulator using the PCA-based approach of Habib et al. 2007, see below), we also use only the Viel et al. (2013) subset of measurement data for the given redshift. The reason for this approach is our focus on the inference method. However, even using a restricted version of our adaptive algorithm, we are able to effectively constrain the parameters using only a fraction of the available inputs and simulation results.

5.1. State-of-the-art Data-agnostic Approach

Construction of emulators is often done using space-filling LHDs, such as maximin and orthogonal designs (Moon et al. 2011), or sliced LHDs (Qian 2012). Some of these space-filling LHDs are also used by the cosmological community to select the training points for the construction of emulators. Typically, the number of training points is selected a priori and remains fixed. For example, in the first such work in cosmology (Habib et al. 2007), the space-filling LHD with 128 points is selected for the problem of inferring five cosmological parameters from matter power spectrum measurements. The authors employ a PCA-based approach to construct multi-output emulators. Specifically, they use a singular-value decomposition (SVD) of the simulations at the training points specified by the LHD. The weights of the SVD are then modeled as independent GP emulators.

A similar approach is taken in Walther et al. (2019) to recover thermal parameters of the IGM using the Lyα flux power spectrum. However, the training points do not exactly form an LHD because one of the parameters, λ_P, is difficult to independently vary in a way that is not correlated with the other thermal state parameters T₀ and γ. This is because λ_P probes the integrated thermal history, which is smooth for each individual physical model of heating and cooling of the IGM during and after the reionization process. Of course, in principle one could generate models with abruptly changing instantaneous temperature such that the pressure smoothing does not have enough time to adjust, but we lack physical motivation for such models. In addition, one of the parameters—the mean flux of the Lyα forest—is not part of the LHD because it can be easily rescaled in the postprocessing, thus its sampling does not require running additional expensive simulations.

5.2. Comparison of the Adaptive Method and the Data-agnostic Approach

We continue to use the postprocessing model described in Section A as the true forward model (with the same q = 8 outputs). We first generate the mock measurement by evaluating the model at a fixed ${{\boldsymbol{\theta }}}_{\mathrm{true}}={(0.275,1.245\times {10}^{4}{\rm{K}},1.6)}^{T}$ , and we corrupt the resulting measurement vector with noise from a multivariate Gaussian distribution ${{ \mathcal N }}_{q}({0}_{q},{\sigma }_{e}^{2}{{\boldsymbol{I}}}_{q})$ with σ_e = 0.01. This level of measurement noise is consistent with the observational data (Viel et al. 2013) that we use later.

In order to analyze how the size of the initial design for the adaptive algorithm affects the obtained solution, we perform experiments with 4, 6, or 8 design points in the 3D parameter space θ = (F, T₀, γ). In each iteration s of Algorithm 1, we construct a GP surrogate using the current selection of design points, we solve the auxiliary optimization problem to maximize the expected improvement in fit function, and we augment the design set with a single point that provides the largest predicted improvement. We iterate until the relative expected improvement drops below a 1% threshold.

For each selection of the size of the initial training design, we run our algorithm 10 times, each time selecting new initial design points using a maximin LHD. For each of the initial designs, we run the algorithm with two emulator choices: IND and COR (see Section 4.2). For the COR emulator, the inter-output correlation matrix Σ_k is taken to be the same as the one used in Section 4.3 (see Figure 1). On average, the final designs contain 10–11 inputs regardless of the number of points in the initial design or the emulator choice.

We perform MCMC sampling of the posterior using the integrated likelihood based on the constructed GP surrogates (see Appendix B for details). To obtain a quantitative measure of the quality of the obtained posteriors, we compute the Hellinger distance between the GP-based posteriors ${p}^{\mathrm{GP}}({\boldsymbol{\theta }}| {\boldsymbol{d}},{ \mathcal D })$ and the reference posterior p^ref(θ∣d) obtained by a direct MCMC sampling with the true likelihood function, i.e., the likelihood of the measurement data that uses evaluations of our (postprocessing) forward model. The Hellinger distance is a metric for evaluating differences between two probability distributions and is a probabilistic analog of the Euclidean distance. It can be related to other commonly used distance measures, such as total variation distance and Kullback–Leibler divergence, see, e.g., Dashti & Stuart (2016). It has also recently been studied in the context of posteriors obtained with GP emulators (Stuart & Teckentrup 2018). The Hellinger distance between p^ref and p^GP is defined as follows:

$\begin{eqnarray}\begin{array}{rcl}{D}_{{\rm{H}}}({p}^{\mathrm{ref}},{p}^{\mathrm{GP}}) & = & \left(\displaystyle \frac{1}{2}{\displaystyle \int }_{{{ \mathcal X }}_{\theta }}\left(\sqrt{{p}^{\mathrm{ref}}({\boldsymbol{\theta }}| {\boldsymbol{d}})}\right.\right.\\ & & -\ {\left.{\left.\sqrt{{p}^{\mathrm{GP}}({\boldsymbol{\theta }}| {\boldsymbol{d}},{ \mathcal D })}\right)}^{2}d{\boldsymbol{\theta }}\right)}^{1/2}.\end{array}\end{eqnarray} \tag{ 15 }$

To compute D_H(p^ref, p^GP), we approximate the densities p^ref and p^GP by fitting kernel-density estimates (KDEs) with Gaussian kernels to the generated samples from the respective posteriors and discretized the integral in Equation (15) using the 3D Sobol sequence with 10⁴ points. The results for the posteriors obtained with the adaptive GPs (10 runs for each initial design) are presented on the left in Figure 5.

**Figure 5.** Hellinger distances ${D}_{{\rm{H}}}\left({p}^{\mathrm{ref}},{p}^{\mathrm{GP}}\right)$ for the GP-based posteriors. While fixed-design-based emulators can occasionally produce good-quality posteriors, the inconsistency of the results makes them a poor choice compared to adaptive designs.
Download figure:
Standard image High-resolution image

**Figure 5.** Hellinger distances ${D}_{{\rm{H}}}\left({p}^{\mathrm{ref}},{p}^{\mathrm{GP}}\right)$ for the GP-based posteriors. While fixed-design-based emulators can occasionally produce good-quality posteriors, the inconsistency of the results makes them a poor choice compared to adaptive designs.
Download figure:
Standard image High-resolution image

We compare the posteriors obtained with the adaptive approach to the posteriors obtained by training GP models with fixed maximin LHDs. Here, we fix the design sizes to be 10, 11, and 12. As in the adaptive case, we train the GP emulators using both IND and COR approaches, and we repeat each experiment 10 times. The results for the posteriors obtained with the fixed designs are shown on the right in Figure 5.

The comparison of the results in Figure 5 demonstrates the superiority of the adaptive approach. The adaptive approach is able to achieve results that are closer to the reference posterior in the Hellinger distance, and often with fewer design points. Furthermore, the results for the adaptive approach are less spread out, and thus make it more robust and consistent.

In Figure 6 we plot the data from Figure 5 again for the adaptive cases, and we show the final design sizes on the x-axis. In most cases, the final design size is either 10, 11, or 12. There is no significant difference in the results for different initial design sizes. However, there appears to be more variability in the final design sizes when the initial design contains only 4 points (LHD 4)—the final designs have between 7 and 13 points. We also observe a small trend of decreasing distance values as the final design size increases from 7 to 10. However, beyond 10 there is no significant difference in the results. The COR emulator appears less likely to terminate with a design consisting of fewer than 10 points, but in terms of the distance values, COR does not outperform the IND approach.

The results for the Hellinger distances confirm that there is little difference between the IND and the COR approaches for our current application. As far as the choice of the initial design size is concerned, starting with smaller designs (4 to 6 points for the current 3D problem) leads to fewer forward-model evaluations without compromising the quality of the result.

Additionally, we compare the posteriors obtained with the adaptive algorithm and with fixed LHDs by looking at the 95% highest posterior density (HPD) intervals for the marginal posteriors for each parameter, i.e., the shortest intervals containing 95% of the marginal posteriors for each parameter. This provides a visual representation of the spread of the obtained distributions. We plot the intervals for two parameters at a time and overlay results from all 60 experiments (2 approaches × 3 initial designs × 10 random realizations) in Figures 7 (adaptive cases) and 8 (fixed cases). We observe a good correspondence between the HPD intervals for the adaptive designs. For the fixed designs there is much more variation in the spreads of the posterior estimates. This comparison reinforces the conclusion that the results obtained with fixed designs are inconsistent, and are therefore less reliable.

**Figure 7.** 95% HPD intervals for the marginal posteriors obtained with adaptive designs (blue rectangles). We plot intervals for two parameters at a time and overlay results from 60 experiments. The red rectangle represents the HPD intervals of the marginal distributions of the reference posterior. We observe a good correspondence for the intervals of obtained posteriors with those of the reference.
Download figure:
Standard image High-resolution image

**Figure 8.** 95% HPD intervals for the marginal posteriors obtained with fixed designs (blue rectangles). We plot intervals for two parameters at a time and overlay results from 60 experiments. The red rectangle represents the HPD intervals of the marginal distributions of the reference posterior. We observe a noticeably larger scatter in the results between different experiments than in the adaptive designs case presented in Figure 7.
Download figure:
Standard image High-resolution image

5.3. Results for the Adaptive Gaussian Process with Postprocessing Model and Viel Data

In this section we apply our adaptive GP approach to the problem of inferring the same three parameters θ = (F, T₀, γ) using the postprocessing model of Section A as the forward model of the power spectrum, and data from Viel et al. (2013) for the redshift of z = 4.2.

The measurement data consists of seven values of the power spectrum for k = {5.01 × 10⁻³, 7.95 × 10⁻³, 1.26 × 10⁻², 1.99 × 10⁻², 3.16 × 10⁻², 5.01 × 10⁻², 7.95 × 10⁻²} km⁻¹ s with error bars that we treat as ± 1σ_k. The measurement noise covariance therefore has a diagonal form ${{\boldsymbol{\Sigma }}}_{E}\,=\mathrm{diag}[{\sigma }_{1}^{2},\ \ldots ,\ {\sigma }_{7}^{2}]$ . We take a uniform (flat) prior p(θ) on all three parameters defined over the same box ${{ \mathcal X }}_{\theta }$ as in Section 4.3:

$\begin{eqnarray}&&{{ \mathcal X }}_{\theta }=[0.2,0.5]\times [3\times {10}^{3}\,{\rm{K}},3\times {10}^{4}\,{\rm{K}}]\times [1.0,2.0].\end{eqnarray} \tag{ 16 }$

We use the COR emulator with the same Σ_k as in Section 4.3 and initialize Algorithm 1 with a maximin LHD with 4 points. The stopping threshold for the algorithm is again set to 1%.

Figure 9 shows the design points at different iterations of the adaptive algorithm. The first figure shows the initial 4 points arranged in maximin LHD, the figure in the middle has 3 additional points after three iterations of the adaptive algorithm, and the last figure shows the final design upon termination.

Figure 10 shows iteration history of the algorithm with iteration s = 1 corresponding to the initial design with 4 points. The blue line in this figure shows the value of the best misfit for the points in the training set in each iteration s, ${g}_{\min }^{s}$ , scaled by the best misfit value obtained upon convergence, ${g}_{\min }^{\mathrm{best}}$ . A point with a better misfit value is not obtained in every iteration, e.g., in iterations 2 and 3 we have the same g_min as initially. The points added to the design in these iterations serve the purpose of decreasing the overall uncertainty of the GP. The new point added to the training set ${ \mathcal D }$ after iteration s = 3 provides a reduction in g_min for the first time (see the red dot in the middle panel of Figure 9). The dotted red line in Figure 10 shows one minus the relative expected improvement in each iteration, i.e., $1-{ \mathcal I }({{\boldsymbol{\theta }}}^{s})/{g}_{\min }^{s}$ . As the algorithm progresses, we expect both lines to approach 1.

Figure 11 shows the power spectrum P(θ^s=6) evaluated at the last θ added by the algorithm in iteration s = 6. This point corresponds to the smallest misfit ${g}_{\min }^{{best}}$ to the Viel data found by our algorithm. This point is shown as a red dot in the bottom panel of Figure 9.

5.4. Results for the Adaptive Gaussian Process with Nyx Simulations and Viel Data

In this section we work with the THERMAL suite of Nyx simulations consisting of 75 models with given parameters (T₀, γ, λ_P) (see Figure 12). We treat the mean flux in postprocessing as in virtually all Lyα power spectrum inference works. To be specific, in this work we produce 40 equidistant values for the parameter F in the interval [0.2, 0.5] for each of the 75 thermal models, thereby sampling the 4D parameter space θ = (F, T₀, γ, λ_P).

We run a restricted version of Algorithm 1 with the possible selection of θ limited to the existing thermal models. We start by selecting the first 6 points randomly, build a GP emulator using the IND approach, and evaluate the expected improvement in fit function ${ \mathcal I }({\boldsymbol{\theta }})$ for the remaining points (using Viel data). We select the point that corresponds to the largest improvement in fit, and iterate until no further improvement can be made. In this regime we do not perform a direct optimization over the parameter space, but select the inputs out of the available THERMAL data. Our restricted algorithm terminates after iteration s = 4 using a total of 10 design points. Figure 13 shows the plot demonstrating the fit of the P(θ^s=4) corresponding to the best found misfit value. We used Nyx simulations as the forward model on a preexisting set of discrete points to chose from, and we reached convergence in similar number of evaluations as in Section 5.3 when we performed a continuous search for a candidate evaluation of the rescaled model (purely by coincidence, the number of evaluations in both cases was actually identical: 10).

**Figure 13.** P(θ^s=4) corresponding to the best found misfit value for the adaptive GP with Nyx simulations and Viel data.
Download figure:
Standard image High-resolution image

The other important ingredient is the construction of the prior p(θ). We use a flat prior for F, log T₀, γ, and $\mathrm{log}{\lambda }_{P}$ in a box constrained by the lowest and highest values for each parameter. We then truncate this prior to the convex hull of the THERMAL grid points, as is done in Walther et al. (2019). The resulting truncated prior is shown in Figure 14. This truncation is done to avoid GP extrapolation into a region of parameter space where this IGM model cannot produce an answer, for example, in case of very low T₀ but very high pressure smoothing scale (see also discussion in Section 5.1 about the λ_P parameter).

**Figure 14.** Prior p(θ) for θ = (F, T₀, γ, λ_P) according to Walther et al. (2019).
Download figure:
Standard image High-resolution image

The posterior obtained with this prior using the likelihood based on the adaptively constructed GP is shown in Figure 15. We observe that the resulting posterior is considerably more constrained than the prior, although we stress that the constraints on parameter γ are poor, which plays very little role at high redshifts (at most!) when the density of Lyα absorbing gas is close to the cosmic mean. Overall, the marginal ranges and central values for the parameters agree well with those reported in Walther et al. (2019). Note, however, that we do not use BOSS (Palanque-Delabrouille et al. 2013) measurements here, but only Viel et al. (2013) data, as we wish to avoid modeling the correlated silicon III absorption that is present in the BOSS data set. We prefer maintaining the forward model as simple as possible as the goal of this work is testing and improving inference schemes for the Lyα power spectrum. In any case, the fact that our results presented here are completely consistent with Walther et al. (2019) indicates that the BOSS data set contributes negligible information about the thermal state of the IGM at high redshifts.

6. Conclusions

In this work we described the use of an adaptive design of GP emulators of the Lyα flux power spectrum for solving inference problems for the thermal parameters of the IGM. To the best of our knowledge, while GP emulators constructed from a Latin hypercube design are in common use in the cosmological community today, the data-driven adaptive selection of training inputs has been considered only very recently in Rogers et al. (2019) and in our work. In the future, we expect to see wider use of similar techniques in other astrophysical applications where observational data already exist prior to the construction of an emulator.

Our motivation for this work is primarily the reduction of the number of computationally intensive simulations required to build a GP emulator. By prioritizing the regions of the parameter space that are consistent with the measurement data under the predictive model of the emulator, we obtain the desired reduction without sacrificing the quality of the parameter posteriors. A numerical study that we performed on a problem with an approximate model of the Lyα forest power spectrum and with synthetic measurement data demonstrated that our adaptive approach obtains consistently good approximations of the parameter posterior and outperforms a similar-size fixed-design approach based on maximum Latin hypercube designs.

We provided a complete framework for building multi-output GP emulators that predict the power spectrum at the preselected modes k. Our numerical study demonstrates that the resulting multi-output emulators, which either treat outputs as conditionally independent given the hyperparameters (IND) or explicitly model linear correlations between the outputs (COR), are effective and computationally efficient. Furthermore, our approaches allow us to train emulators using only a highly limited number of training inputs, which in turn enables the adaptive selection of additional inputs.

The initial results obtained with our adaptive approach are encouraging. Specifically, for the problem of inferring three thermal parameters of the IGM and mean flux using measurements of the power spectrum at seven values of k our approach (constrained to the 75 available Nyx THERMAL simulations) required simulation outputs for only 10 input values to constrain the parameters to the same level of accuracy as in Walther et al. (2019), who used substantially more simulations.

Finally, we emphasize that we do not consider the classical parameterization of θ = (F, T₀, γ, λ_P) to be the best to model the state of the IGM, but we nevertheless perform this type of analysis as it is straightforward to make comparisons of our results with previous works. While these parameters have intuitive physical meaning in describing the thermodynamical state of the IGM, there are several practical problems with them. First, they are output rather than input parameters, which brings significant difficulties with implementations of sampling and iterative emulation procedure. Second, these four parameters parameterize each time snapshot instead of the physical model itself. For this reason, we consider models that parameterize the time and duration of the reionization as well as associated heat input (Oñorbe et al. 2019) to be better, and we will be using them in future works.

Authors are grateful to Jose Oñorbe for making his Nyx simulations available to us, as well as for providing helpful comments and insights. We thank Joe Hennawi and members of the Enigma group² at UC Santa Barbara for insightful suggestions and discussions. This research used resources of the National Energy Research Scientific Computing Center (NERSC), which is supported by the Office of Science of the U.S. Department of Energy under Contract no. DE-AC02-05CH11231. This work made extensive use of the NASA Astrophysics Data System and of the astro-ph preprint archive at arXiv.org.

Software: Nyx (Almgren et al. 2013).

Appendix A: Rescaling the Thermal Parameters

The parameter space considered in this work consists of four standard parameters, $\{{T}_{0},\gamma ,{\lambda }_{P},\bar{F}\}$ , describing the thermal state of the IGM. In this appendix we specify the rescaling of these parameters, which is used as a model in Section 5.3. The advantage of this approximate model is that it does not require a new simulation for every new evaluated point.

In photoionization equilibrium, the mean flux of the Lyα forest is proportional to the fraction of neutral hydrogen, and is thus degenerate with the amplitude of the assumed UV background. Therefore the mean flux can be rescaled by finding the constant multiplier of the optical depth of all spectral pixels in the simulation so that the mean flux matches the desired value: $\bar{F}=\langle \exp (-A{\tau }_{i,j,k})\rangle$ . For accuracy considerations of rescaling the mean flux, we refer to Lukić et al. (2015).

Modifying any of the other three thermal parameters commonly requires running a new simulation (see, e.g. Walther et al. 2019). While modifying λ_P (3D pressure smoothing) is inherently difficult due to its dependence on the whole thermal history, we can hope to be also able to modify the instantaneous temperature—which determines the recombination rate and the thermal broadening (1D smoothing). In this way, we can generate different values of {T₀, γ, $\bar{F}$ }, without the need to repeat expensive simulations. To test this, we use three simulations that have the same cosmological and numerical parameters and yield the same pressure smoothing parameter λ_p ≈ 68 kpc at redshift z = 3. Temperature–density diagrams for these simulations are shown in Figure 16. The fiducial simulation has T₀ = 1.1 × 10⁴ K and γ = 1.57; the low-T₀ simulation differs from the fiducial only in that T₀ = 7 × 10³ K, while the low-γ simulation has γ = 1.03, and all other parameters are the same as the fiducial model.

**Figure 16.** The density–temperature distribution of gas (volume-weighted histogram) in three Nyx simulations: (a) the fiducial one, (b) the simulation with a lower T₀ value than in the fiducial one, and (3) the simulation with a lower γ value. Other than one parameter that is different, all other parameters in the simulation are the same, including the pressure smoothing scale at this redshift, λ_P.
Download figure:
Standard image High-resolution image

In Figure 17 we show power spectrum ratios of low-T₀ and low-γ simulations with respect to the fiducial model. The mean flux is matched in all cases shown. Solid lines (with circles) are the ratios of the unscaled simulations, and they are significantly different over the k range covered by data (k ≲ 0.08 km⁻¹ s). Two dashed lines with square and triangle points are models where temperature–density relation has been rescaled to the fiducial one with and without accounting for the scatter in the T–ρ. We note the significant improvement in the power spectrum, and the scatter in T–ρ, as expected from optically thin models, does not play a significant role, although it helps in the case of a radical change in the γ parameter, as shown in the right panel of Figure 17. Finally, the dashed line represents the case in which we rescale the temperature–density relation and used the line-of-sight velocity from the fiducial simulation. This is not a practical solution, as we would not know these velocities when we rescale a simulation to a target T–ρ relation, but we wish to show that differences in velocity account for most of the remaining error in this rescaling procedure.

While our approximate postprocessing model does not recover the power spectrum at a percent-accurate level over the whole range of available data, it is sufficiently accurate for the experiments conducted in Section 4. The essential requirements there are to know the true answer for a given model, and to be able to evaluate the model very many times. Moreover, this rescaling procedure looses accuracy at the high-k end, which is important for thermal constraints and interpreting P(k) from high-resolution spectra, but over the k range relevant to BOSS, eBOSS, and DESI observations (k ≲ 0.02 km⁻¹ s), the achieved accuracy is ≈1%, which should suffice for many studies.

Appendix B: Training GP Emulators

Let θ^(j), j = 1, ..., n_train, represent training inputs, with ${\boldsymbol{P}}\left({{\boldsymbol{\theta }}}^{(j)}\right)$ being a q-vector of outputs corresponding to a particular input. Denote by y_i the n_train-vector of the normalized values of the ith output, i = 1, ..., q, defined as follows:

$\begin{eqnarray}&&{{\boldsymbol{y}}}_{i}={\left({\widehat{P}}_{i}\left({{\boldsymbol{\theta }}}^{(1)}\right),...,{\widehat{P}}_{i}\left({{\boldsymbol{\theta }}}^{({n}_{\mathrm{train}})}\right)\right)}^{T},\end{eqnarray} \tag{ B1 }$

where

$\begin{eqnarray}&&{\widehat{P}}_{i}({\boldsymbol{\theta }})=\displaystyle \frac{{P}_{i}({\boldsymbol{\theta }})-{m}_{i}}{{{\mathbb{V}}}_{i}^{1/2}}\end{eqnarray} \tag{ B2 }$

with

$\begin{eqnarray}\begin{array}{rcl}{m}_{i} & = & \displaystyle \frac{1}{{n}_{\mathrm{train}}}\displaystyle \sum _{j=1}^{{n}_{\mathrm{train}}}{P}_{i}\left({{\boldsymbol{\theta }}}^{(j)}\right),\\ {{\mathbb{V}}}_{i} & = & \displaystyle \frac{1}{{n}_{\mathrm{train}}}\displaystyle \sum _{j=1}^{{n}_{\mathrm{train}}}{\left({P}_{i}\left({{\boldsymbol{\theta }}}^{(j)}\right)-{m}_{i}\right)}^{2}.\end{array}\end{eqnarray} \tag{ B3 }$

The normalized training outputs together form an output matrix ${\boldsymbol{Y}}=[{{\boldsymbol{y}}}_{1},\ ...,\ {{\boldsymbol{y}}}_{q}]\in {{\mathbb{R}}}^{{n}_{\mathrm{train}}\times q}$ . Finally, the vectorized form of Y is obtained by stacking the normalized training outputs into a (n_train · q)-vector $\overline{{\boldsymbol{y}}}=\mathrm{vec}({\boldsymbol{Y}})$ .

We denote the set of all training inputs by θ_train = {θ^(j), j = 1, ..., n_train}, and the combined set of training inputs and outputs, or training data, by ${ \mathcal D }=\{{{\boldsymbol{\theta }}}_{\mathrm{train}},{\boldsymbol{Y}}\}$ .

Training a GP emulator requires specifying the hyperparameters ψ of its kernel. These are characterized by the posterior distribution (note that conditioning on Σ_k is implicit but not necessary because it is fixed)

$\begin{eqnarray}&&p({\boldsymbol{\psi }}\,| \,{ \mathcal D })=\displaystyle \frac{p({ \mathcal D }| {\boldsymbol{\psi }})p({\boldsymbol{\psi }})}{p({ \mathcal D })},\end{eqnarray} \tag{ B4 }$

where

$\begin{eqnarray}&&\begin{array}{l}p({ \mathcal D }| {\boldsymbol{\psi }})=p({\boldsymbol{Y}}| {{\boldsymbol{\theta }}}_{\mathrm{train}},{\boldsymbol{\psi }})={{ \mathcal N }}_{{n}_{\mathrm{train}}\times q}({\boldsymbol{Y}}| {{\boldsymbol{\mu }}}^{\mathrm{norm}}({{\boldsymbol{\theta }}}_{\mathrm{train}}),\\ \qquad \ c({{\boldsymbol{\theta }}}_{\mathrm{train}},{{\boldsymbol{\theta }}}_{\mathrm{train}};{\boldsymbol{\psi }}),{{\boldsymbol{\Sigma }}}_{k}^{\mathrm{norm}})\end{array}\end{eqnarray} \tag{ B5 }$

is the likelihood of the training data ${ \mathcal D }$ under the matrix-normal distribution defining the GP (see Section 4) and $p({ \mathcal D })\,=\int p({ \mathcal D }| {\boldsymbol{\psi }})p({\boldsymbol{\psi }})d{\boldsymbol{\psi }}$ is referred to as evidence. Note that μ^norm is a normalized version of the mean function obtained by applying the linear transformation (B2), and ${{\boldsymbol{\Sigma }}}_{k}^{\mathrm{norm}}$ is the inter-output correlation matrix. In order to somewhat simplify this notation, we denote the covariance matrix for the training inputs by C_ψ = c(θ_train, θ_train; ψ). We also recall that we take μ(·) ≡ 0. Thus, we have

$\begin{eqnarray}&&p({\boldsymbol{Y}}| {{\boldsymbol{\theta }}}_{\mathrm{train}},{\boldsymbol{\psi }})={{ \mathcal N }}_{{n}_{\mathrm{train}}\times q}({\boldsymbol{Y}}| {0}_{{n}_{\mathrm{train}}\times q},{{\boldsymbol{C}}}_{\psi },{{\boldsymbol{\Sigma }}}_{k}^{\mathrm{norm}}).\end{eqnarray} \tag{ B6 }$

In a vectorized form, we can express the likelihood above as a regular multivariate normal density,

$\begin{eqnarray}&&p(\overline{{\boldsymbol{y}}}| {{\boldsymbol{\theta }}}_{\mathrm{train}},{\boldsymbol{\psi }})={{ \mathcal N }}_{{n}_{\mathrm{train}}\cdot q}(\overline{{\boldsymbol{y}}}| {0}_{{n}_{\mathrm{train}}\cdot q},{{\boldsymbol{\Sigma }}}_{k}^{\mathrm{norm}}\otimes {{\boldsymbol{C}}}_{\psi }).\end{eqnarray} \tag{ B7 }$

How do we obtain the hyperposterior (B4)? Because no analytical form for this posterior exists, we describe it via a particle approximation (Bilionis & Zabaras 2016, Section 2.6). That is, we approximate the hyperposterior with a weighted sum of Dirac delta functions centered at samples ψ^(j):

$\begin{eqnarray}&&p({\boldsymbol{\psi }}| { \mathcal D })\approx \displaystyle \sum _{j=1}^{{n}_{\psi }}{w}^{(j)}\delta ({\boldsymbol{\psi }}-{{\boldsymbol{\psi }}}^{(j)})\end{eqnarray} \tag{ B8 }$

with weights w^(j) ≥ 0 and ${\sum }_{j=1}^{{n}_{\psi }}{w}^{(j)}=1$ .

One way to obtain such a particle approximation is by maximizing the likelihood of the data given by (B7). This leads to a single-particle approximation,

$\begin{eqnarray}&&p({\boldsymbol{\psi }}| { \mathcal D })\approx \delta ({\boldsymbol{\psi }}-{{\boldsymbol{\psi }}}_{\mathrm{MLE}}^{* }),\end{eqnarray} \tag{ B9 }$

where

$\begin{eqnarray}&&{{\boldsymbol{\psi }}}_{\mathrm{MLE}}^{* }=\mathop{\arg \,\max }\limits_{{\boldsymbol{\psi }}\in {{ \mathcal X }}_{\psi }}p({ \mathcal D }| {\boldsymbol{\psi }})\end{eqnarray} \tag{ B10 }$

is the maximum likelihood estimator (MLE) of the hyperparameter vector. In the case of a flat prior on the hyperparameters, this estimator coincides with a maximum a posteriori (MAP) estimator. The MLE approach is convenient to work with because the covariance matrix C_ψ needs only to be formed once, but it might lead to somewhat overconfident estimates of the predictive uncertainties of the GP emulator. In the case of a sharply peaked likelihood $p({ \mathcal D }| {\boldsymbol{\psi }})$ , the MLE estimator can be sufficient. Another way of obtaining the particle approximation of the hyperposterior is by sampling it using MCMC techniques. This way provides a more complete picture of the hyperposterior, but at additional computational cost.

Whether the MLE or MCMC approach is used to obtain the hyperposterior $p({\boldsymbol{\psi }}| { \mathcal D })$ , we need to be able to evaluate the logarithm of the likelihood function (B7) (note that by applying the logarithm, we preserve the order relation and obtain a better-behaved function). In the following we derive the expression for the log-likelihood of the data and explain how it can be efficiently computed.

Let ${{\boldsymbol{\Sigma }}}_{\mathrm{tot}}={{\boldsymbol{\Sigma }}}_{k}^{\mathrm{norm}}\otimes {{\boldsymbol{C}}}_{\psi }$ and let a_i,j denote the entries of ${({{\boldsymbol{\Sigma }}}_{k}^{\mathrm{norm}})}^{-1}$ , then

$\begin{eqnarray}&&\begin{array}{l}\mathrm{log}p(\overline{{\boldsymbol{y}}}\,| \,{{\boldsymbol{\theta }}}_{\mathrm{train}},{\boldsymbol{\psi }})=-\displaystyle \frac{1}{2}{\overline{{\boldsymbol{y}}}}^{T}{{\boldsymbol{\Sigma }}}_{\mathrm{tot}}^{-1}\overline{{\boldsymbol{y}}}-\displaystyle \frac{1}{2}\mathrm{log}| {{\boldsymbol{\Sigma }}}_{\mathrm{tot}}| \\ \qquad \ -\ \displaystyle \frac{{n}_{\mathrm{train}}\cdot q}{2}\mathrm{log}(2\pi )\\ \quad \ =\ -\displaystyle \frac{1}{2}\displaystyle \sum _{i=1}^{q}\displaystyle \sum _{j=1}^{q}{a}_{i,k}{{\boldsymbol{y}}}_{i}^{T}{{\boldsymbol{C}}}_{\psi }^{-1}{{\boldsymbol{y}}}_{j}\\ \qquad \ -\ \displaystyle \frac{1}{2}\left({n}_{\mathrm{train}}\mathrm{log}| {{\boldsymbol{\Sigma }}}_{k}^{\mathrm{norm}}| +q\mathrm{log}| {{\boldsymbol{C}}}_{\psi }| \right)\\ \qquad \ -\ \displaystyle \frac{{n}_{\mathrm{train}}\cdot q}{2}\mathrm{log}(2\pi )\\ \quad \ =\ -\displaystyle \frac{1}{2}\mathrm{tr}({\left({{\boldsymbol{\Sigma }}}_{k}^{\mathrm{norm}}\right)}^{-1}{{\boldsymbol{Y}}}^{T}{{\boldsymbol{C}}}_{\psi }^{-1}{\boldsymbol{Y}})\\ \qquad \ -\ \displaystyle \frac{1}{2}\left({n}_{\mathrm{train}}\mathrm{log}| {{\boldsymbol{\Sigma }}}_{k}^{\mathrm{norm}}| +q\mathrm{log}| {{\boldsymbol{C}}}_{\psi }| \right)\\ \qquad \ -\ \displaystyle \frac{{n}_{\mathrm{train}}\cdot q}{2}\mathrm{log}(2\pi ).\end{array}\end{eqnarray} \tag{ B11 }$

In our implementation we first compute the Cholesky decomposition of the input covariance,

$\begin{eqnarray}&&{{\boldsymbol{C}}}_{\psi }={\boldsymbol{L}}{{\boldsymbol{L}}}^{T},\end{eqnarray} \tag{ B12 }$

and let

$\begin{eqnarray}&&{\boldsymbol{A}}={{\boldsymbol{L}}}^{T}\setminus ({\boldsymbol{L}}\setminus {\boldsymbol{Y}}),\end{eqnarray} \tag{ B13 }$

then compute

$\begin{eqnarray}&&{\boldsymbol{B}}={{\boldsymbol{Y}}}^{T}{\boldsymbol{A}},\end{eqnarray} \tag{ B14 }$

and set

$\begin{eqnarray}&&{\boldsymbol{D}}={{\boldsymbol{S}}}^{T}\setminus ({\boldsymbol{S}}\setminus {\boldsymbol{B}}),\end{eqnarray} \tag{ B15 }$

where ${{\boldsymbol{\Sigma }}}_{k}^{\mathrm{norm}}={\boldsymbol{S}}{{\boldsymbol{S}}}^{T}$ is the Cholesky decomposition of the output correlation matrix. Then

$\begin{eqnarray}&&\begin{array}{l}\mathrm{log}p(\overline{{\boldsymbol{y}}}\,| \,{{\boldsymbol{\theta }}}_{\mathrm{train}},{\boldsymbol{\psi }})=-\displaystyle \frac{1}{2}\left(\mathrm{tr}({\boldsymbol{D}})+2{n}_{\mathrm{train}}\displaystyle \sum _{i=1}^{q}\mathrm{log}({{\boldsymbol{S}}}_{i,i})\right.\\ \quad \ +\ \left.2q\displaystyle \sum _{i=1}^{{n}_{\mathrm{train}}}\mathrm{log}({{\boldsymbol{L}}}_{i,i})+{{qn}}_{\mathrm{train}}\mathrm{log}(2\pi \right).\end{array}\end{eqnarray} \tag{ B16 }$

The expression above can be further simplified in certain cases. For example, for the IND emulator that treats outputs as independent given the C_ψ matrix, the output correlation matrix in a unitary matrix ${{\boldsymbol{\Sigma }}}_{k}^{\mathrm{norm}}={{\boldsymbol{I}}}_{q}$ , and the computation of $\mathrm{tr}({\boldsymbol{D}})$ does not require cross-terms ${{\boldsymbol{y}}}_{i}{{\boldsymbol{C}}}_{\psi }^{-1}{{\boldsymbol{y}}}_{j}$ for i ≠ j. In the case of the IND emulator, the log-likelihood thus simplifies to

$\begin{eqnarray}&&\begin{array}{l}\mathrm{log}p(\overline{{\boldsymbol{y}}}\,| \,{{\boldsymbol{\theta }}}_{\mathrm{train}},{\boldsymbol{\psi }})=-\displaystyle \frac{1}{2}\left(\mathrm{tr}({\boldsymbol{B}})+2q\displaystyle \sum _{i=1}^{{n}_{\mathrm{train}}}\mathrm{log}({{\boldsymbol{L}}}_{i,i})\right.\\ \qquad \ +\ \left.{{qn}}_{\mathrm{train}}\mathrm{log}(2\pi ,),\right).\end{array}\end{eqnarray} \tag{ B17 }$

In our implementation, the optimal hyperparameter values ${{\boldsymbol{\psi }}}_{\mathrm{MLE}}^{* }$ are obtained by maximizing the above log-likelihood using a multistart strategy with an quasi-Newton iterative nonlinear optimizer such as sequential least-squares programming (SLSQP) or limited memory Broyden–Fletcher–Goldfarb–Shannon (L-BFGS-B).³

Appendix C: Obtaining Predictions

In order to obtain a prediction for an untried input θ, we apply the standard GP formulas obtained by conditioning on the data ${ \mathcal D }$ . Furthermore, by exploiting the Kronecker product structure of the covariance, we can apply standard GP formulas to each output separately. Indeed, as shown in Bonilla et al. (2008),

$\begin{eqnarray}\begin{array}{rcl}{{\boldsymbol{m}}}^{\mathrm{norm}}({\boldsymbol{\theta }};{ \mathcal D },{\boldsymbol{\psi }}) & = & {\left({{\boldsymbol{\Sigma }}}_{k}^{\mathrm{norm}}\otimes {{\boldsymbol{c}}}_{\psi }\right)}^{T}{\left({{\boldsymbol{\Sigma }}}_{k}^{\mathrm{norm}}\otimes {{\boldsymbol{C}}}_{\psi }\right)}^{-1}\overline{{\boldsymbol{y}}}\\ & = & \ ({\left({{\boldsymbol{\Sigma }}}_{k}^{\mathrm{norm}}\right)}^{T}\otimes {{\boldsymbol{c}}}_{\psi }^{T})({\left({{\boldsymbol{\Sigma }}}_{k}^{\mathrm{norm}}\right)}^{-1}\otimes {{\boldsymbol{C}}}_{\psi }^{-1})\overline{{\boldsymbol{y}}}\\ & = & \ (({{\boldsymbol{\Sigma }}}_{k}^{\mathrm{norm}}{\left({{\boldsymbol{\Sigma }}}_{k}^{\mathrm{norm}}\right)}^{-1})\otimes ({{\boldsymbol{c}}}_{\psi }^{T}{{\boldsymbol{C}}}_{\psi }^{-1}))\overline{{\boldsymbol{y}}}\\ & = & \ {\left({{\boldsymbol{c}}}_{\psi }^{T}{{\boldsymbol{C}}}_{\psi }^{-1}{{\boldsymbol{y}}}_{1},\ldots ,{{\boldsymbol{c}}}_{\psi }^{T}{{\boldsymbol{C}}}_{\psi }^{-1}{{\boldsymbol{y}}}_{q}\right)}^{T}\\ & = & \ {\left(m({\boldsymbol{\theta }};{{ \mathcal D }}_{1},{\boldsymbol{\psi }}),...,m({\boldsymbol{\theta }};{{ \mathcal D }}_{q},{\boldsymbol{\psi }})\right)}^{T}\\ & & \in {{\mathbb{R}}}^{q},\end{array}\end{eqnarray} \tag{ C1 }$

where the superscript norm indicates that this is the predictive mean of the GP fitted to the normalized outputs, and ${{\boldsymbol{c}}}_{\psi }\,=c({\boldsymbol{\theta }},{{\boldsymbol{\theta }}}_{\mathrm{train}};{\boldsymbol{\psi }})\in {{\mathbb{R}}}^{{n}_{\mathrm{train}}}$ . For the predictive covariance, we obtain

$\begin{eqnarray}\begin{array}{rcl}{{\boldsymbol{\Sigma }}}_{{GP}}^{\mathrm{norm}}({\boldsymbol{\theta }};{ \mathcal D },{\boldsymbol{\psi }}) & = & c({\boldsymbol{\theta }},{\boldsymbol{\theta }};{\boldsymbol{\psi }}){{\boldsymbol{\Sigma }}}_{k}^{\mathrm{norm}}-{\left({{\boldsymbol{\Sigma }}}_{k}^{\mathrm{norm}}\otimes {{\boldsymbol{c}}}_{\psi }\right)}^{T}\\ & & \times \ ({\left({{\boldsymbol{\Sigma }}}_{k}^{\mathrm{norm}}\right)}^{-1}\otimes {{\boldsymbol{C}}}_{\psi }^{-1})({{\boldsymbol{\Sigma }}}_{k}^{\mathrm{norm}}\otimes {{\boldsymbol{c}}}_{\psi })\\ & = & \ (c({\boldsymbol{\theta }},{\boldsymbol{\theta }};{\boldsymbol{\psi }})-{{\boldsymbol{c}}}_{\psi }^{T}{{\boldsymbol{C}}}_{\psi }^{-1}{{\boldsymbol{c}}}_{\psi }){{\boldsymbol{\Sigma }}}_{k}^{\mathrm{norm}}\\ & = & \ {\mathbb{V}}({\boldsymbol{\theta }};{ \mathcal D },{\boldsymbol{\psi }}){{\boldsymbol{\Sigma }}}_{k}^{\mathrm{norm}}.\end{array}\end{eqnarray} \tag{ C2 }$

Upon rescaling, we obtain

$\begin{eqnarray}&&\begin{array}{l}{{\boldsymbol{P}}}^{\mathrm{GP}}({\boldsymbol{\theta }})| { \mathcal D },{\boldsymbol{\psi }},{{\boldsymbol{\Sigma }}}_{k}\sim {{ \mathcal N }}_{q}({{\boldsymbol{P}}}^{\mathrm{GP}}({\boldsymbol{\theta }})| {\boldsymbol{m}}({\boldsymbol{\theta }};{ \mathcal D },{\boldsymbol{\psi }}),\\ \ {{\boldsymbol{\Sigma }}}_{{GP}}({\boldsymbol{\theta }};{ \mathcal D },{\boldsymbol{\psi }}))\end{array}\end{eqnarray} \tag{ C3 }$

with

$\begin{eqnarray}&&\begin{array}{l}{\boldsymbol{m}}({\boldsymbol{\theta }};{ \mathcal D },{\boldsymbol{\psi }})=\left({{\mathbb{V}}}_{1}^{1/2}m({\boldsymbol{\theta }};{{ \mathcal D }}_{1},{\boldsymbol{\psi }})+{m}_{1},\ ...,\ {{\mathbb{V}}}_{q}^{1/2}\right.\\ \qquad \ \times \ {\left.m({\boldsymbol{\theta }};{{ \mathcal D }}_{q},{\boldsymbol{\psi }})+{m}_{q}\right)}^{T},\end{array}\end{eqnarray} \tag{ C4 }$

and

$\begin{eqnarray}\begin{array}{rcl}{{\boldsymbol{\Sigma }}}_{{GP}}({\boldsymbol{\theta }};{ \mathcal D },{\boldsymbol{\psi }}) & = & {{\boldsymbol{V}}}^{1/2}{{\boldsymbol{\Sigma }}}_{{GP}}^{\mathrm{norm}}({\boldsymbol{\theta }};{ \mathcal D },{\boldsymbol{\psi }}){{\boldsymbol{V}}}^{1/2}\\ & = & \ {\mathbb{V}}({\boldsymbol{\theta }};{ \mathcal D },{\boldsymbol{\psi }})({{\boldsymbol{V}}}^{1/2}{{\boldsymbol{\Sigma }}}_{k}^{\mathrm{norm}}{{\boldsymbol{V}}}^{1/2})\\ & = & \ {\mathbb{V}}({\boldsymbol{\theta }};{ \mathcal D },{\boldsymbol{\psi }}){{\boldsymbol{\Sigma }}}_{k},\end{array}\end{eqnarray} \tag{ C5 }$

where ${\boldsymbol{V}}=\mathrm{diag}[{{\mathbb{V}}}_{1},\ ...,\ {{\mathbb{V}}}_{q}]\in {{\mathbb{R}}}^{q\times q}$ . Finally, integrating out the hyperparameters ψ (we recall the particle approximation (B8)), we obtain

$\begin{eqnarray}&&\begin{array}{l}{{\boldsymbol{P}}}^{\mathrm{GP}}({\boldsymbol{\theta }})| { \mathcal D },{{\boldsymbol{\Sigma }}}_{k}\sim \displaystyle \sum _{j=1}^{{n}_{\psi }}{w}^{(j)}{{ \mathcal N }}_{q}\left({{\boldsymbol{P}}}^{\mathrm{GP}}({\boldsymbol{\theta }})| {\boldsymbol{m}}({\boldsymbol{\theta }};{ \mathcal D },{{\boldsymbol{\psi }}}^{(j)}),\right.\\ \qquad \ \left.{{\boldsymbol{\Sigma }}}_{{GP}}({\boldsymbol{\theta }};{ \mathcal D },{{\boldsymbol{\psi }}}^{(j)}\right)\end{array}\end{eqnarray} \tag{ C6 }$

Appendix D: Inference using GP Emulators

We assume that we are given a vector of observations ${\boldsymbol{d}}\in {{\mathbb{R}}}^{q}$ and a distribution of the measurement noise ${{ \mathcal N }}_{q}({0}_{q},{{\boldsymbol{\Sigma }}}_{E})$ with a known covariance Σ_E. Upon substituting the true response P(·) with the GP emulator P^GP(·) in the likelihood of the measurement data, and integrating with respect to the GP distribution (C6), we obtain (see Takhtaganov & Müller 2018 for details) the so-called ${ \mathcal D }$ -restricted likelihood,

$\begin{eqnarray}&&L({\boldsymbol{\theta }}| {\boldsymbol{d}},{ \mathcal D })=\displaystyle \sum _{j=1}^{{n}_{\psi }}\displaystyle \frac{{s}^{(j)}}{{n}_{\psi }}\exp \left[-\displaystyle \frac{g\left({\boldsymbol{\theta }};{ \mathcal D },{{\boldsymbol{\psi }}}^{(j)}\right)}{2}\right],\end{eqnarray} \tag{ D1 }$

where ${s}^{\left(j\right)}={(2\pi )}^{-q/2}| {{\boldsymbol{\Sigma }}}_{E}+{{\boldsymbol{\Sigma }}}_{{GP}}\left({\boldsymbol{\theta }};{ \mathcal D },{{\boldsymbol{\psi }}}^{\left(j\right)}\right){| }^{-1/2}$ , and $g({\boldsymbol{\theta }};{ \mathcal D },{\boldsymbol{\psi }})$ is a data misfit function, defined as

$\begin{eqnarray}\begin{array}{rcl}g({\boldsymbol{\theta }};{ \mathcal D },{\boldsymbol{\psi }}) & = & {\left({\boldsymbol{d}}-{\boldsymbol{m}}({\boldsymbol{\theta }};{ \mathcal D },{\boldsymbol{\psi }})\right)}^{T}{\left({{\boldsymbol{\Sigma }}}_{E}+{{\boldsymbol{\Sigma }}}_{{GP}}({\boldsymbol{\theta }};{ \mathcal D },{\boldsymbol{\psi }})\right)}^{-1}\\ & & \times \ ({\boldsymbol{d}}-{\boldsymbol{m}}({\boldsymbol{\theta }};{ \mathcal D },\psi )).\end{array}\end{eqnarray} \tag{ D2 }$

When performing inference, the likelihood $L({\boldsymbol{\theta }}| {\boldsymbol{d}},{ \mathcal D })$ needs to be repeatedly evaluated for different values of θ. Instead of using the Cholesky factorization of the matrix appearing in the definition of $g({\boldsymbol{\theta }};{ \mathcal D },{\boldsymbol{\psi }})$ , we compute the misfit efficiently as follows.

First, we cover the case of homoscedastic measurement noise, i.e., when ${{\boldsymbol{\Sigma }}}_{E}={\sigma }_{E}^{2}{{\boldsymbol{I}}}_{q}$ . We denote the matrix appearing in $g({\boldsymbol{\theta }};{ \mathcal D },{\boldsymbol{\psi }})$ as

$\begin{eqnarray}&&{{\boldsymbol{\Sigma }}}_{{lik}}({\boldsymbol{\theta }};{ \mathcal D },{\boldsymbol{\psi }})={{\boldsymbol{\Sigma }}}_{E}+{{\boldsymbol{\Sigma }}}_{{GP}}({\boldsymbol{\theta }};{ \mathcal D },{\boldsymbol{\psi }}).\end{eqnarray} \tag{ D3 }$

Plugging in Σ_E and Σ_GP, we obtain

$\begin{eqnarray}&&{{\boldsymbol{\Sigma }}}_{{lik}}({\boldsymbol{\theta }};{ \mathcal D },{\boldsymbol{\psi }})={\mathbb{V}}({\boldsymbol{\theta }};{ \mathcal D },{\boldsymbol{\psi }})\left({{\boldsymbol{\Sigma }}}_{k}+\displaystyle \frac{{\sigma }_{E}^{2}}{{\mathbb{V}}({\boldsymbol{\theta }};{ \mathcal D },{\boldsymbol{\psi }})}{{\boldsymbol{I}}}_{q}\right).\end{eqnarray} \tag{ D4 }$

We have the sum of a symmetric matrix and a constant times the identity matrix. In this case, the inverse of Σ_lik can be efficiently computed using the eigendecomposition of the Σ_k matrix. Let

$\begin{eqnarray}&&{{\boldsymbol{\Sigma }}}_{k}={\boldsymbol{Q}}{\boldsymbol{\Lambda }}{{\boldsymbol{Q}}}^{T},\quad \mathrm{with}\ {{\boldsymbol{Q}}}^{-1}={{\boldsymbol{Q}}}^{T},{\boldsymbol{\Lambda }}=\mathrm{diag}[{\lambda }_{1},\ ...,\ {\lambda }_{q}].\end{eqnarray} \tag{ D5 }$

Then

$\begin{eqnarray}&&\begin{array}{l}{{\boldsymbol{\Sigma }}}_{{lik}}^{-1}({\boldsymbol{\theta }};{ \mathcal D },{\boldsymbol{\psi }})\\ \ =\,\displaystyle \frac{1}{{\mathbb{V}}({\boldsymbol{\theta }};{ \mathcal D },{\boldsymbol{\psi }})}{\boldsymbol{Q}}{\left[{\boldsymbol{\Lambda }}+\displaystyle \frac{{\sigma }_{E}^{2}}{{\mathbb{V}}({\boldsymbol{\theta }};{ \mathcal D },{\boldsymbol{\psi }})}{{\boldsymbol{I}}}_{q}\right]}^{-1}{{\boldsymbol{Q}}}^{T}.\end{array}\end{eqnarray} \tag{ D6 }$

In order to compute the misfit function $g({\boldsymbol{\theta }};{ \mathcal D },{\boldsymbol{\psi }})$ , we first compute

$\begin{eqnarray}&&{\boldsymbol{v}}={{\boldsymbol{Q}}}^{T}({\boldsymbol{d}}-{\boldsymbol{m}}({\boldsymbol{\theta }};{ \mathcal D },{\boldsymbol{\psi }}))\in {{\mathbb{R}}}^{q},\end{eqnarray} \tag{ D7 }$

then

$\begin{eqnarray}\begin{array}{rcl}g({\boldsymbol{\theta }};{ \mathcal D },{\boldsymbol{\psi }}) & = & \displaystyle \frac{1}{{\mathbb{V}}({\boldsymbol{\theta }};{ \mathcal D },{\boldsymbol{\psi }})}{{\boldsymbol{v}}}^{T}{{\boldsymbol{D}}}^{-1}{\boldsymbol{v}}\\ & = & \ \displaystyle \sum _{i=1}^{q}\displaystyle \frac{1}{{\mathbb{V}}({\boldsymbol{\theta }};{ \mathcal D },{\boldsymbol{\psi }}){\lambda }_{i}+{\sigma }_{E}^{2}}{v}_{i}^{2},\end{array}\end{eqnarray} \tag{ D8 }$

where ${\boldsymbol{D}}={\boldsymbol{\Lambda }}+({\sigma }_{E}^{2}/{\mathbb{V}}({\boldsymbol{\theta }};{ \mathcal D },{\boldsymbol{\psi }}){{\boldsymbol{I}}}_{q}$ is a diagonal matrix. Thus, for each θ and ψ the computation of the data misfit function requires ${ \mathcal O }({q}^{2})$ operations.

For a general Σ_E, we use the generalized eigendecomposition,

$\begin{eqnarray}&&{{\boldsymbol{\Sigma }}}_{E}{\boldsymbol{U}}={{\boldsymbol{\Sigma }}}_{k}{\boldsymbol{U}}{\boldsymbol{\Lambda }},\end{eqnarray} \tag{ D9 }$

which leads to the following form for the inverse of Σ_lik:

$\begin{eqnarray}&&\begin{array}{l}{{\boldsymbol{\Sigma }}}_{{lik}}^{-1}({\boldsymbol{\theta }};{ \mathcal D },{\boldsymbol{\psi }})\\ \ =\,\displaystyle \frac{1}{{\mathbb{V}}({\boldsymbol{\theta }};{ \mathcal D },{\boldsymbol{\psi }})}{\boldsymbol{U}}{\left[{{\boldsymbol{I}}}_{q}+\displaystyle \frac{1}{{\mathbb{V}}({\boldsymbol{\theta }};{ \mathcal D },{\boldsymbol{\psi }})}{\boldsymbol{\Lambda }}\right]}^{-1}{{\boldsymbol{U}}}^{T}.\end{array}\end{eqnarray} \tag{ D10 }$

Then

$\begin{eqnarray}&&g({\boldsymbol{\theta }};{ \mathcal D },{\boldsymbol{\psi }})=\displaystyle \sum _{i=1}^{q}\displaystyle \frac{1}{{\mathbb{V}}({\boldsymbol{\theta }};{ \mathcal D },{\boldsymbol{\psi }})+{\lambda }_{i}}{v}_{i}^{2},\end{eqnarray} \tag{ D11 }$

with ${\boldsymbol{v}}={{\boldsymbol{U}}}^{T}({\boldsymbol{d}}-{\boldsymbol{m}}({\boldsymbol{\theta }};{ \mathcal D },{\boldsymbol{\psi }}))$ .

Cosmic Inference: Constraining Parameters with Observations and a Highly Limited Number of Simulations

Article metrics

Permissions

Author e-mails

Author affiliations

Dates

Abstract

1. Introduction

2. Forward Model

2.1. Simulations

2.2. Model Parameters

3. Adaptive Construction of Gaussian Process Emulators for Bayesian Inference

4. Gaussian Process Emulators for the Lyα Flux Power Spectrum

4.1. Gaussian Process Emulators

4.2. Approaches to Treating Multi-output Models

4.3. Numerical Study of Multi-output Approaches

5. Numerical Study of Inference with Adaptive GP Emulators

5.1. State-of-the-art Data-agnostic Approach

5.2. Comparison of the Adaptive Method and the Data-agnostic Approach

5.3. Results for the Adaptive Gaussian Process with Postprocessing Model and Viel Data

5.4. Results for the Adaptive Gaussian Process with Nyx Simulations and Viel Data

6. Conclusions

Appendix A: Rescaling the Thermal Parameters

Appendix B: Training GP Emulators

Appendix C: Obtaining Predictions

Appendix D: Inference using GP Emulators

Footnotes

Cosmic Inference: Constraining Parameters with Observations and a Highly Limited Number of Simulations

Article metrics

Permissions

Share this article

Author e-mails

Author affiliations

Dates

Abstract

1. Introduction

2. Forward Model

2.1. Simulations

2.2. Model Parameters

3. Adaptive Construction of Gaussian Process Emulators for Bayesian Inference

4. Gaussian Process Emulators for the Lyα Flux Power Spectrum

4.1. Gaussian Process Emulators

4.2. Approaches to Treating Multi-output Models

4.3. Numerical Study of Multi-output Approaches

5. Numerical Study of Inference with Adaptive GP Emulators

5.1. State-of-the-art Data-agnostic Approach

5.2. Comparison of the Adaptive Method and the Data-agnostic Approach

5.3. Results for the Adaptive Gaussian Process with Postprocessing Model and Viel Data

5.4. Results for the Adaptive Gaussian Process with Nyx Simulations and Viel Data

6. Conclusions

Appendix A: Rescaling the Thermal Parameters

Appendix B: Training GP Emulators

Appendix C: Obtaining Predictions

Appendix D: Inference using GP Emulators

Footnotes