CHARACTERIZING THE FORMATION HISTORY OF MILKY WAY LIKE STELLAR HALOS WITH MODEL EMULATORS

Facundo A. Gómez; Christopher E. Coleman-Smith; Brian W. O'Shea; Jason Tumlinson; Robert. L. Wolpert

doi:10.1088/0004-637X/760/2/112

1. INTRODUCTION

Understanding the formation and evolution of galaxies is a central and long-standing problem in astrophysics. Over the past century, and particularly in the past decade, a tremendous amount of information has been gleaned about populations of galaxies and their temporal evolution, and data have been collected on galaxies spanning more than six orders of magnitude in stellar mass and over 13 billion years in the age of the universe. These observations show that the galaxies that we can see have undergone radical changes in size, appearance, and content over the last 13 billion years (Rix et al. 2004; Brinchmann et al. 2004; Papovich et al. 2005; Shapley 2011). Complementary observations have provided a rich data set on the kinematics and elemental abundances of stars in the Milky Way (MW), including large numbers of metal-poor stars in the halo of our own Galaxy and in local dwarf galaxies. In principle, this "galactic fossil record" can probe the entire merger and star formation history of the MW and its satellites, and complement direct observations at higher redshifts.

The quantity and quality of observational data on galaxy formation, which is already staggering, is going to increase exponentially over the next decade. Surveys such as LAMOST (Newberg et al. 2009), SkyMapper (Keller et al. 2007), Gaia (Perryman et al. 2001), and, ultimately, the Large Synoptic Survey Telescope (LSST Science Collaborations et al. 2009) will produce petabytes of data on billions of individual objects, both galactic and extragalactic, that will strongly inform our understanding of galaxy behavior.

Despite this wealth of observational information, we currently lack the detailed and self-consistent theoretical models necessary to adequately interpret such observational data sets. Purely analytic (i.e., "pencil-and-paper") theoretical models are insufficient to address the questions that are currently being asked about galaxy formation, due in no small part to the range of physical components that must be simultaneously modeled (e.g., gravity, dark matter (DM), gas dynamics, radiative cooling, star formation, and feedback), and the complex and nonlinear coupling of these components. As a result of these complications, two separate theoretical methods are commonly used to study galaxy formation: multiphysics hydrodynamical simulations and semi-analytic models.

Multiphysics numerical simulations are typically used to model galaxy formation by implementing relevant physical processes in as realistic a manner as is technically and computationally feasible. These calculations are typically based on N-body DM dynamics simulations of cosmological structure formation, and include gas dynamics, the radiative cooling and heating of gas, subgrid models for star formation and feedback, and possibly more complex physics such as magnetohydrodynamics, radiation transport, and the formation of, and feedback from, super massive black holes. Commonly used codes of this type include Enzo (O'Shea et al. 2004; Norman et al. 2007), Gadget (Springel 2005), Gasoline (Wadsley et al. 2004), RAMSES (Teyssier 2002), and more recently AREPO (Springel 2010). These codes produce broadly similar results, although some important differences remain to be resolved (O'Shea et al. 2005; Agertz et al. 2007; Tasker et al. 2008; Sijacki et al. 2011). The main advantage of such calculations is that they attempt to faithfully reproduce the relevant physical processes in as accurate of a manner as possible, and by virtue of their construction automatically include any complex, nonlinear interaction between important physical processes. For example, this approach is able to naturally handle the nonlinear hydrodynamics of gas ejecta (once initialized) due to bursts of star formation, as does the return of such gas into later generations of structure formation. The main disadvantage of this sort of simulation lies in its cost: current-generation calculations of a single MW-like galaxy performed at high (∼100 pc) spatial resolution (e.g., Agertz et al. 2011) can easily consume hundreds of thousands of CPU hours and months of time to complete, making it challenging to model statistically significant numbers of galaxies or to perform a meaningful study of variations in free parameters within the models.

A second approach is often referred to as "semi-analytic" or "phenomenological" modeling of galaxy formation. This type of model typically is based upon either the extended Press–Schechter formalism or N-body cosmological simulations, which provide the evolutionary histories for a population of galaxies. Prescriptions are then applied on top of these evolutionary histories to describe the behavior of the gas and stellar populations contained within, and surrounding, the DM halos that drive dynamics on large scales, as well as the observational properties of the resulting galaxies. These models are then calibrated by comparison to some set of observations. Some examples of this sort of model include GALFORM (Benson et al. 2003; Bower et al. 2006; Benson & Bower 2010), Galacticus (Benson 2012), and ChemTreeN (Tumlinson 2006, 2010). Two important strengths of this type of model are flexibility and speed: one can easily implement variations on a model (gas ejection from galaxies as a function of halo mass and redshift versus a constant value) and then see within minutes how this affects the modeled population of galaxies. Alternately, one can rapidly sweep through large swaths of parameter space, finding the best-fit model to a given set of observational constraints. The disadvantages of this modeling technique include the extent to which the observable properties of simulated galaxies depend on the models of specific physical phenomena, such as the behavior of galaxies during mergers, as well as the large number of free parameters. Note that the latter also pertains to hydrodynamical simulations, although to a lesser extent. Even with these substantial downsides, however, semi-analytic models are useful for exploring the consequences of various physical phenomena on the observable properties of galaxies.

One consequence of the large number of free parameters in semi-analytic models is that there is rarely a single "best-fit" set of parameters to a given set of observations. Rather, given N free parameters that define an N-dimensional space of potential values, it is likely that there is an M-dimensional surface, with M < N, of approximately equally good statistical fit to the observations in question, or even possibly several discrete regions within the parameter space that provide comparably good statistical fits to the chosen observations. When multiple observational data sets are used, or observations of different populations, it is possible that a given model parameter may only be relevant for a subset of the observations (and thus, the other observations provide no significant constraints upon this parameter). This challenge has only recently begun to be explored in the context of galaxy formation by Henriques et al. (2009), Bower et al. (2010), and Lu et al. (2011, 2012).

It is also desirable to attempt to model the formation of the MW as an exemplar of galaxy formation in general. The primary reason for doing so is that by virtue of the Earth being embedded within the MW, we possess a massive amount of data on this particular object (e.g., Dame et al. 2001; Yanny et al. 2003; Carollo et al. 2007; Ivezić et al. 2008; Jurić et al. 2008; Ghez et al. 2008; Bond et al. 2010, and many others). One important drawback of modeling a single object, however, is that the most fundamental part of either type of galaxy formation model described above—the gravitationally driven merger of successive generations of DM halos—is unique to a given halo, and may profoundly affect many observational quantities. As a result, there may be critical degeneracies between the merger history of a single galaxy and the model parameters. This is not an issue when studying large numbers of galaxies, as is done by Bower et al. (2010, hereafter B10), since the varieties of merger histories are included in the overall population. However, it presents substantial complications when trying to duplicate observations of the MW using either hydrodynamic or semi-analytic models.

In this paper, we combine semi-analytic models of the formation of the MW (including several different N-body simulation-based merger histories) with modern statistical techniques to explore how one might meaningfully constrain the formation of the MW's stellar halo and population of satellite galaxies both from a theoretical standpoint and in terms of guiding future observations. This paper is arranged as follows. In Section 2, we describe our methodology, including the cosmological simulations, semi-analytic models, and in Section 3 the statistical tools. In Section 4, we examine how a small set of model parameters affects observational quantities of the MW in a qualitative way, and do the same in a statistical sense in Section 5. We discuss the results and limitations of this work in Section 6.

2. NUMERICAL METHODS

In this section, we briefly describe the N-body simulations analyzed in this work and provide a brief summary of the main characteristic of the semi-analytical model, ChemTreeN. For a detail description of our numerical methods, we refer the reader to Tumlinson (2010, hereafter T10).

2.1. N-body Simulations

Four different simulations of the formation of MW-like DM halos are analyzed in this work. The simulations were run using Gadget-2 (Springel 2005) in a local computer cluster. MW-like halos were first identified in a cosmological simulation with a particle resolution of 128³ within a periodic box of side 7.32 h⁻¹ Mpc. A WMAP3 cosmology (Spergel et al. 2007) was adopted, with matter density Ω_m = 0.238, baryon density Ω_b = 0.0416, vacuum energy density Ω_Λ = 0.762, power spectrum normalization σ₈ = 0.761, power spectrum slope n_s = 0.958, and Hubble constant H₀ = 73.2 km s⁻¹ Mpc⁻¹. The candidates were selected to have gravitationally bound DM halos with virial masses of M₂₀₀ ≈ 1.5 × 10¹² M_☉ at z = 0 and no major mergers since z = 1.5–2. These MW-like DM halos were subsequently re-simulated at a resolution of 512³ by applying a multi-mass particle "zoom-in" technique. At this resolution, each DM particle has a mass of M_p = 2.64 × 10⁵ M_☉. Snapshots were generated at intervals of 20 Myr before z = 4 and 75 Myr intervals from z = 4 to z = 0. A six-dimensional friends-of-friends (Diemand et al. 2006) algorithm was applied to identify DM halos in each snapshot. The gravitational softening length was 100 comoving pc in all simulations. The main properties of the resulting DM halos are listed in Table 1.

Table 1. Main Properties at z = 0 of the Four Dark Matter Halos Analyzed in this Work

Name	R₂₀₀^a	M₂₀₀^b	c	z_LMM
MW1	381	1.63	12.2	2.1
MW2	378	1.59	9.2	3.5
MW3	347	1.23	15.5	2.0
MW4^c	366	1.44	13.6	3.0

Notes. ^aDistances are listed in kpc. ^bMasses are listed in 10¹² M_☉. ^cMW4 corresponds to the simulation MW6 presented in T10.

Download table as: ASCII Typeset image

2.2. Galactic Chemical Evolution Model

DM-only simulations are a useful tool for self-consistently characterizing the mass assembly history of DM halos in a cold dark matter cosmology. Yet they do not provide any information about the stellar populations of a galaxy such as MW. For this purpose, we have coupled to our N-body simulations the semi-analytical model, ChemTreeN (T10). A semi-analytical model consists of a set of analytic prescriptions that describe the underlying physical mechanisms driving the evolution of the baryons. Processes such as star formation, stellar winds, or chemical enrichment are introduced in the model through differential equations that are controlled via a set of adjustable input parameters. These parameters are commonly set to simultaneously match different observable quantities, such as the galaxy luminosity functions (LFs) or a variety of scaling relations. The general approach used in our models is to assume that each DM halo found in the simulations, and followed through the merger tree, possesses gas that has been accreted from the intergalactic medium (IGM), that this gas forms stars, that these stars return metals and energy to the host halo and to the larger environment, and that future generations of stars form with the now metal-enriched gas. Table 2 summarizes the numerical values of the parameters used for our fiducial models. The set of parameters that are considered in the model emulator analysis, as well as the range of values over which they are allowed to vary, are also listed in this table. In what follows, we briefly describe the implementation of the prescriptions that are most relevant to this work.

Table 2. Model Parameters

Parameter	Fiducial Value	Range	Description	Explored
z_r	10	5–19	Epoch of re-ionization	Yes
f_bary	0.05	0–0.2	Baryonic mass fraction	Yes
f_esc	50	0–110	Escape factor of metals	Yes
_*	1 × 10⁻¹⁰	0.2–1.8	Star formation efficiency (10⁻¹⁰ yr⁻¹)	Yes
m^II_Fe	0.07	0.04–0.2	SN II iron yield (M_☉)	Yes
f_Ia	0.015	...	SN Ia probability	No
_SN	0.0015	...	SNe energy coupling	No
m^Ia_Fe	0.5	...	SN Ia iron yield (M_☉)	No

Download table as: ASCII Typeset image

2.2.1. Parcels and Particle Tagging

ChemTreeN follows the evolution of stellar mass, metal abundance, and gas budgets of galaxies as a function of cosmic time. For each endpoint or "root halo" in which the star formation and chemical enrichment history will be determined, the code identifies its highest-redshift progenitor and starts the calculation there. Note that due to tidal disruption, the endpoint of some halos can be found at z > 0. The star formation history is calculated using 10 time steps between each redshift snapshot, so time steps are 2–8 Myr. At each time step, a star formation "parcel" is created with a single initial mass, metallically, and initial mass function (IMF). The metallicity for the parcel is derived from the present gas metallicity. Each parcel thus represents a single-age stellar population with a unique metallicity. When halos merge, their lists of parcels are concatenated. The calculation concludes when the "root halo" or endpoint is reached. Note that this process is performed independently for each satellite galaxy or building block that gave rise to the final MW-like halo. To explore the spatial, kinematic, and dynamical properties of stellar populations in the resulting halos, it is necessary to assign "stars" to DM particles in the N-body simulation. This is performed as follows. At each snapshot output, a fraction of the most bound particles in each halo are selected. The star formation parcels that occurred between this snapshot and the previous snapshots are then identified, and an equal fraction of star-forming parcels are assigned to each of the selected particles. Note that only the 10% most gravitationally bound particles in each halo are considered, in order to approximate the effect of stars forming deep in the potential well of the galaxy, where dense gas would be most likely to collect. As shown by Cooper et al. (2010), to reproduce the observed relationship between size and luminosity of dwarf galaxies in the Local Group, a fraction of 1%–3% is required. This choice is not feasible here due to the limited resolution of our N-simulations. However, gross properties of dwarf galaxies such as total luminosity are not strongly affected by our particular choice. Furthermore, rather than comparing with real observational data, in this work we focus on mock observational data sets generated by the model itself.

2.2.2. Baryon Assignment

The number of stars that a galaxy has formed throughout its history strongly depends on the amount of gas it contained. It is therefore important to define a prescription to model baryonic accretion into DM halos. Our models adopt a prescription based on that of Bullock & Johnston (2005) that heuristically takes into account the influence of a photoionizing background from the aggregate star formation in all galaxies. This model assigns a fixed mass fraction of baryons, f_bary, to all DM halos before re-ionization, z_r. After z_r, gas accretion and therefore star formation in small halos are suppressed below v_c = 30 km s⁻¹. Between 30 km s⁻¹ and 50 km s⁻¹, the assigned baryon fraction varies linearly from 0 to f_bary. This baryon assignment is intended to capture the IGM "filtering mass" (Gnedin 2000) below which halos are too small to retain baryons that have been heated to T ≳ 10⁴ K by global re-ionization.

2.2.3. Star Formation Efficiency

Stars are formed in discrete "parcels" with a constant efficiency, epsilon _*, such that the mass formed into stars M_* = epsilon _*M_gasΔt in time interval Δt. The star formation efficiency is equivalent to a timescale, epsilon _* = 1/t_*, on which baryons are converted into stars. The fiducial choice for this parameter is t_* = 10 Gyr, or epsilon _* = 10⁻¹⁰ yr⁻¹.

2.2.4. Stellar Initial Mass Function

An invariant stellar IMF at all times and at all metallicities is assumed. The invariant IMF adopted is that of Kroupa (2001), dn/dM∝(m/M_☉)^α, with slope α = −2.3 from 0.5 to 140 M_☉ and slope α = −1.3 from 0.1 to 0.5 M_☉. The impact that variations of the IMF may have on the stellar populations of the resulting stellar halos will be studied in a follow-up work.

2.2.5. Type Ia SNe

Type Ia supernovae (SNe) are assumed to arise from thermonuclear explosions triggered by the collapse of a C/O white dwarf precursor that has slowly accreted mass from a binary companion until it exceeds the 1.4 M_☉ Chandrasekhar limit. For stars that evolve into white dwarfs as binaries, the SN occurs after a time delay from formation that is roughly equal to the lifetime of the least massive companion. In our models, stars with initial mass M = 1.5–8 M_☉ are considered eligible to eventually yield a Type Ia SN. When stars in this mass range are formed, some fraction of them, f_Ia, are assigned the status as a Type Ia and are given a binary companion with mass obtained from a suitable probability distribution (Greggio & Renzini 1983). The chemical evolution results are sensitive to the SN Ia probability normalization, f_Ia. This parameter is fixed by normalizing to the observed relative rates of Type II and Type Ia SNe for spiral galaxies in the local universe (Tammann et al. 1994). This normalization gives a ratio of SN II to Ia of six to one.

2.2.6. Chemical Yields

ChemTreeN tracks the time evolution of galaxies' bulk metallicities by considering Fe as the proxy reference element. For Type Ia SNe with 1.5–8 M_☉, the models adopt the W7 yields of Nomoto et al. (1997) for Fe, with 0.5 M_☉ of Fe from each Type Ia SN. Type II SNe are assumed to arise from stars of 10 to 40 M_☉, with mass yields provided by Tominaga (2009). They represent the bulk yields of core-collapse SNe with uniform explosion energy E = 10⁵¹ erg. These models have M = 0.07–0.15 M_☉ Fe per event.

2.2.7. Chemical and Kinematic Feedback

One possible cause of the observed luminosity–metallicity (L–Z) relation for Local Group dwarf galaxies is SN-driven mass loss from small DM halos (Dekel & Woo 2003). To model this physical mechanism, ChemTreeN tracks mass loss due to SN-driven winds in terms of the number of SNe per time step in a way that takes into account the intrinsic time variability in the star formation rate and rate of SNe from a stochastically sampled IMF. At each time step, a mass of gas

$\begin{equation} M_{\rm lost} = \epsilon _{\rm SN} \sum _{i} \frac{N^{i}_{\rm SN} E^{i}_{\rm SN}}{2v_{\rm circ}^{2}} \end{equation} \tag{ 1 }$

becomes unbound and is removed permanently from the gas reservoir. Here, v_circ is the maximum circular velocity of the halo, N_SN is the number of SNe occurring in a given time step, and E_SN is the energy released by those SNe. The only free parameter, epsilon _SN, expresses the fraction of the SN energy that is converted to kinetic energy retained by the wind as it escapes. The sum over index i sums over all massive stars formed in past time steps that are just undergoing an explosion in the current time step. Note that this approach allows for variations in the number and energy of SNe from time step to time step.

The selective loss of metals that should arise when SNe drive their own ejecta out of the host galaxy is captured by the parameter f_esc, which expresses the increased metallicity of the ejected winds with respect to the ambient interstellar medium. At each time step, a total mass in iron M^Fe_lost is removed from the gas reservoir of the halo:

$\begin{equation} M^{\rm Fe}_{\rm lost} = f_{\rm esc} M_{\rm lost} \frac{M^{\rm Fe}_{\rm ISM}}{M_{\rm gas},} \end{equation} \tag{ 2 }$

where M^Fe_ISM is the total mass of iron in the ambient interstellar medium, M_gas × 10^[Fe/H]. This prescription ensures that, on average, the ejected winds are f_esc times more metal-enriched than the ambient interstellar medium. Alternatively, the fraction of metal mass lost from the halo is f_esc times higher than the total fraction of gas mass lost.

2.2.8. Isochrones and Synthetic Stellar Populations

To compare these model halos to observational data on the real MW and its dwarf satellites, it is necessary to calculate the luminosities and colors of model stellar populations using pre-calculated isochrones and population synthesis models. Each star formation parcel possesses a metallicity, age, and a total initial mass distributed according to the assumed IMF. These three quantities together uniquely specify an isochrone and how it is populated. The models adopt the isochrones of Girardi et al. (2002, 2004) for the UBVRIJHK and Sloan Digital Sky Survey ugriz systems, respectively, as published on the Padova group Web site.⁸ The lowest available metallicity in these isochrones is [Fe/H] = −2.3. Thus, this value is used to represent stellar populations with lower metallicities.

3. STATISTICAL METHODS

As discussed in the introduction, semi-analytic models such as ChemTreeN need to be calibrated by comparing their outputs to an observational data set. It is important to explore the input parameter space as thoroughly as possible to identify regions where good fits to the data can be achieved. The simplest method of exploring the dependence of interesting model outputs on these input parameters would be to run the model over a grid of points, densely sampling the parameter space. Although ChemTreeN can generate a new model within a few minutes, this technique quickly becomes prohibitively expensive as the dimensionality of the input parameter space increases. As an alternative to this approach, we can construct a Gaussian process (GP) emulator (O'Hagan 2006; Oakley & O'Hagan 2002, 2004; Kennedy & O'Hagan 2000), which acts as a statistical model of our computer model, ChemTreeN. An emulator is constructed by conditioning (i.e., constraining fits as described below) a prior GP on a finite set of observations of model output, taken at points dispersed throughout the parameter space. Once the emulator is trained, it can rapidly give predictions for both model outputs and an attendant measure of uncertainty about these outputs at any point in the parameter space. The probability distribution for the model output at all points in parameter space is a very useful feature of GP emulators—simpler interpolation schemes, such as interpolating polynomials, produce an estimate of the model output at a given location in the parameter space with no indication as to how much this value should be trusted. Furthermore, numerical implementations of GP emulators are very computationally efficient (producing output in microseconds rather than minutes), making it feasible to predict vast numbers of model outputs in a short period of time. This ability opens many new doors for the analysis of computer codes which would otherwise require unacceptable amounts of time (Higdon et al. 2008; Bayarri et al. 2002; B10).

3.1. Gaussian Process Model Emulator

We construct an emulator for a model by conditioning a GP prior (see Figure 1) on the training data (Chilès & Delfiner 1999; Cressie 1993; Rasmussen & Williams 2005). A GP is a stochastic process with the property that any finite set of samples drawn at different points of its domain will have a multivariate-normal (MVN) distribution. Samples drawn from a stochastic process will be functions indexed by a continuous variable (such as a position, time or, in our case, a parameter of the model) as opposed to a collection of values as generated by, e.g., a normally distributed random variable. A GP is completely specified in terms of a mean and covariance, both of which can be functions of the indexing variable. The unconditioned draws (solid lines) shown in the left panel of Figure 1 are smooth functions over the domain space labeled x. If enough samples are drawn from the process, then the average of the resulting curves at each point would converge to zero. A posterior distribution function can be obtained by conditioning this process on the training points obtained from the model. This forces samples drawn from the process to always pass through the training points, as shown in the right-hand panel of Figure 1. Repeated draws from the conditioned posterior distribution would on average follow the underlying curve with some variation, shown by the gray confidence regions. These confidence bubbles grow away from the training points where the interpolation is least certain, and contract to zero at the training points where the interpolation is absolutely certain. The posterior distribution can be evaluated to give a mean and variance at any point in the parameter space. We may interpret the mean of the emulator as the predicted value at a point, the variance at this point gives an indication of how close the mean is to the true value of the model. Again, we emphasize that simpler interpolation methods, such as interpolating polynomials or splines, generally do not provide any measure of the accuracy of the method at a given point in parameter space.

To construct an emulator, we need to fully specify our GP by choosing a prior mean and a form for the covariance function. The model parameter space is taken to be p-dimensional. We model the prior mean by linear regression with a desired basis function space h(x); we use h(x) = {1}. We specify a power exponential form for the covariance function with power α ≃ 2, which ensures smoothness of the GP draws,

$\begin{equation} c({\bf x}_i, {\bf x}_j) = \theta _0 \exp \left(-\frac{1}{2} \sum _{k=1}^{p} \left(\frac{x_i^{k} - x_j^{k}}{\theta ^{k}}\right)^{\alpha }\right) + \delta _{ij} \theta _{N}. \end{equation} \tag{ 3 }$

Here, θ₀ is the overall variance, the θ^k set characteristic length scales in each dimension in the parameter space, and θ_N is a small term, usually called a nugget, added to ensure numerical convergence or to model some measurement error in the code output. The shape of the covariance function determines how the correlations between pairs of outputs vary as the distance between them in the parameter space increases. The scales in the covariance function θ^k are estimated from the data using maximum likelihood methods (Rasmussen & Williams 2005). In Figure 2, we demonstrate their influence on an artificial data set. The linear regression model handles large-scale trends of the model under study, and the GP covariance structure captures the residual variations.

**Figure 2.** Demonstration of emulator behavior as a function of correlation length, θ₁. In all panels, the solid blue line shows the mean of the emulator and the solid gray region is a 95% confidence interval around this region. Left panel: fitting with a value of θ₁ that is too small (undersmoothing). Right panel: oversmoothing by using a value of θ₁ that is too large. Central panel: smoothing with a value of θ₁ = 0.68 that was obtained by a maximum likelihood estimation method.
Download figure:
Standard image High-resolution image

Given a set of n design points $\mathcal {D} = \lbrace {\bf x}_1, \ldots , {\bf x}_n\rbrace$ in a p-dimensional parameter space, and the corresponding set of n training values representing the model output at the design locations Y = {y₁, ..., y_n}, the posterior distribution defining our emulator is

$\mathcal {P}({\bf x}, {\bf \theta }) \sim \mbox{GP}(\hat{m}({\bf x}, {\bf \theta }), \hat{\Sigma }({\bf x}, {\bf \theta })),$

for conditional mean $\hat{m}$ and covariance $\hat{\Sigma }$ .

$\begin{eqnarray} \hat{m}({\bf x}) &= {\bf h}({\bf x})^{T}\hat{\beta } + {\bf k}^{T}({\bf x}) \mathbf {C}^{-1}({\bf Y} - {\bf H} \hat{\beta }), \nonumber \\ \hat{\Sigma }({\bf x}_i, {\bf x}_j) &= c({\bf x}_i, {\bf x}_j) + {\bf k}^{T}({\bf x}_i) \mathbf {C}^{-1} {\bf k}({\bf x}_j) + \Gamma (x_i, x_j), \nonumber \\ \mathbf {C}_{ij} &= c({\bf x}_i, {\bf x}_j) \end{eqnarray} \tag{ 4 }$

$\begin{eqnarray} \Gamma (x_i, x_j) &=& ({\bf h}({\bf x_i})^{T} - {\bf k}^{T}({\bf x_i})\mathbf {C}^{-1} {\bf H})^{T} ({\bf H}^{T} \mathbf {C}^{-1} {\bf H})^{-1} \nonumber \\ &&\times ({\bf h}({\bf x_j})^{T} - {\bf k}^{T}({\bf x_j})\mathbf {C}^{-1}{\bf H}), \nonumber \\ {\bf k}({\bf x})^{T} &=& (c({\bf x}_1, {\bf x}), \ldots , c({\bf x}_n, {\bf x})), \end{eqnarray} \tag{ 5 }$

where $\hat{m}({\bf x})$ is the posterior mean at x, $\hat{\Sigma }({\bf x}_i, {\bf x}_j)$ is the posterior covariance between points x_i and x_j, $\mathbf {C}$ is the n × n covariance matrix of the design $\mathcal {D}$ , $\hat{\beta }$ is the maximum likelihood estimated regression coefficients, h is the basis of regression functions, and H is the matrix of these functions evaluated at the training points.

The elements of the vector k(x) are the covariance of an output at x and each element of the training set. It is through this vector k(x) that the emulator "feels out" how correlated an output at x is with the training set and thus how similar the emulated mean should be to the training values at those points. Note that the quantities defined in Equation (4) depend implicitly upon the choice of correlation length scales θ = {θ₀, θ^k, θ_N}, which determine the shape of the covariance function.

The expression for the mean in Equation (4) can be decomposed into a contribution from the prior, and the linear regression model ${\bf h}({\bf x})^{T}\hat{\beta }$ plus a correction applied to the residuals determined by the covariance structure ${\bf k}^{T}({\bf x})\mathbf {C}^{-1}({\bf Y} - {\bf H} \hat{\beta })$ . Similarly, the covariance can be decomposed into a contribution from the prior, and the covariance function c(x_i, x_j) plus corrections arising from the prior covariance structure and the covariance of the new location x through k(x). These terms weight the points x_i, x_j more highly the closer they are to the training points through k. The Γ term gives the corrections to the covariance arising from the regression model.

A Latin Hypercube (LHC) design is used to generate the training locations in the parameter space. This is an efficient design for space-filling in high-dimensional parameter spaces (Sacks et al. 1989; Santner et al. 2003). LHC sampling divides the range of each dimension of the p-dimensional input parameter space into N equal length intervals. N points are then randomly distributed in the resulting grid, satisfying the condition that each row and column are occupied by only one point for each two-dimensional (2D) projection of the design.

3.2. Model to Data Comparison

To compare model output (via the emulator) to experimental data, it is convenient to introduce the notion of implausibility. Following B10, we define an implausibility measure as follows. Consider a model with a single output for which we have generated an emulator with posterior mean $\hat{m}({\bf x})$ and variance $\hat{\Sigma }({\bf x}, {\bf x})$ . The implausibility I(x) of the emulated model output at a point x in the parameter space is determined by

$\begin{equation} I^{2}({\bf x}) = \frac{(\hat{m}({\bf x}) - E[Y_f])^2}{\hat{\Sigma }({\bf x}, {\bf x}) + V[Y_f]}, \end{equation} \tag{ 6 }$

where Y_f represents the distribution of field data that we seek to compare our model against, E[Y_f] is the expected value of Y_f, and V[Y_f] is the observational uncertainties. Large values of I(x_t) indicate that the input parameter vector x_t is unlikely to give a good fit to the observable data. Here, we have only accounted for the variation from the emulator itself and the field data. In the following work, we carry out comparisons of the model output with idealized field data generated from the model itself. As we will show, this can be a very useful exercise in understanding a model and its dependence upon the underlying parameters. It is common practice (introduced by Kennedy & O'Hagan 2001) to introduce a bias or discrepancy term representing the deviation of the model predictions from real observation; this is beyond the scope of the present analysis, in which we compare the model output at different locations in the parameter space to certain selected default values.

The output of ChemTreeN is multivariate—the code produces predictions for many observables, such as the distributions of stellar populations in stellar halos of MW-like galaxies or its satellite galaxy LF and metallicity–luminosity relation. It is possible to compare separately each observable with a model emulator generated from the corresponding model output. However, considerably more powerful conclusions can be drawn by examining the joint properties of the observables and model outputs. Consider a t-dimensional vector of model outputs ${\bf y({\bf x})} = \lbrace y_1, \ldots , y_t \rbrace$ with a corresponding vector of field data Y_f. We extend our training set to be the t × n matrix ${\bf Y} = \lbrace {\bf y({\bf x}_1)}, \ldots , {\bf y({\bf x}_n)} \rbrace$ .

We apply a principal component analysis (PCA; Jollife 2002) decomposition to our training data set Y to obtain a reduced set of r ≪ t that is approximately independent and numerically orthogonal basis vectors that nearly span the t-dimensional output space, discarding terms in the eigen-decomposition which contribute less than 5% of the total variation. We construct individual independent emulators from the training values projected onto each basis. When we wish to evaluate the model output at a new location, we invert this transformation to obtain predictive distributions for each of the t model outputs at a given location in the parameter space. This procedure is detailed in Appendix A.

The implausibility (6) has a natural extension to multivariate outputs (B10, Section 3.7). From the emulator, we obtain a t-dimensional vector of predictions for the model output with means ${\bf \hat{m}}({\bf x})$ . The emulated t × t dimensional covariance matrix $\mathbf {K}({\bf x})$ between the model outputs at the point x in the design space⁹ can also be constructed from the PCA decomposition. With these two quantities, we define the joint implausibility J(x) for observables Y_f with measurement variance V[Y_f] and mean values E[Y_f]:

$\begin{eqnarray} &&J^{2}({\bf x}) = (E[{{\bf Y}_f}]- {\bf \hat{m}}({\bf x}))^{T} \nonumber\\ &&(\mathbf {K}({\bf x}) + I \cdot V[{{\bf Y}_f}] )^{-1} (E[{{\bf Y}_f}]- {\bf \hat{m}}({\bf x})). \end{eqnarray} \tag{ 7 }$

This covariance-weighted combination of the multiple observables gives a reasonable indication of which input values x are predicted by the emulator to lead to model predictions close to the observed values Y_f. The squared implausibility score J²(x) can be thought of as approximately the sum of r squared independent standard normal random variables, hence with approximately a χ²_r distribution, leading in the usual way to confidence intervals for J(x) that induce confidence sets in the input space. In this work, we consider 75% confidence sets for x. In the univariate case, for example, locations x in the parameter space with J(x) < 3 are viewed as "plausible," or likely to give model outputs closely reproducing the observational data, given the experimental and interpolation uncertainties.

The model parameters x and output Y were scaled and centered prior to emulator analysis, which improved numerical convergence in the matrix inversions and maximum likelihood estimation process.

4. EFFECTS OF GALAXY FORMATION HISTORY ON BULK STELLAR PROPERTIES I: GENERALITIES

In this section, we examine how observable quantities of our final Galactic models, such as properties of their dwarf galaxy population, depend on the galaxy's merger history as well as on the model parameters that control bulk properties of their star formation histories. We start by qualitatively comparing the output of models obtained by varying both the merger histories and the model parameters, and then move on to a more rigorous statistical analysis in the following section.

In the left panels of Figure 3, we explore the effects that different parameter values have on MW-like models obtained after coupling ChemTreeN with the merger tree based on simulation MW1. The top panel shows how the LF of the surviving satellites at z = 0 depends on the redshift of the epoch of re-ionization, z_r, the baryon fraction, f_bary, and the fraction of metal lost through SNe winds, f_esc. As previously discussed by T10, for a given f_bary = 0.05, the number of bright satellite galaxies (i.e., M_v < −10) does not strongly depends on z_r. However, we do observe a strong dependence of the number of satellites on z_r at the faint end of the LF. Note that the number of faint dwarfs increases by a factor of ≈2–3 when shifting z_r from 10 to 6.5. This is not surprising, since by moving the epoch of re-ionization to lower redshifts, we allow star formation to take place in small halos (i.e., v_c < 30 km s⁻¹) for extended periods of time. On the other hand, the number of bright satellites does depend on the parameter controlling the amount of gas available to form stars, f_bary. This parameter basically acts as a re-normalization of the satellite galaxy LF, shifting it up or down for higher or lower values of f_bary, respectively. In our models, the escape factor of metals does not have a significant impact on the LF. This is because we have assumed a fixed star formation efficiency, and thus, complicated physical processes that affect a galaxy's star formation history, such as metal cooling, are neglected (e.g., Choi & Nagamine 2009).

It is very interesting to compare these LFs with those obtained from galaxies that have had different merging histories, after fixing the values of all model parameters. Figure 4 shows the fraction of mass within R₂₀₀ at any given redshift of our four MW-like DM halos. Note the range of merger histories spanned by these halos. The LFs obtained after coupling each halo merger tree to ChemTreeN are shown in the top right panel of Figure 3. The values chosen for the input parameters in this analysis are z_r = 6.5, f_bary = 0.05, and f_esc = 80. The number of bright dwarf galaxies is very sensitive to the merger history of the parent halo, though there is a significant halo-to-halo variation (approximately by a factor of two) in the overall number of dwarf galaxies between the four simulations. It is important to note that the scatter observed in these LFs is similar to that previously observed when varying model parameters. This suggests a degeneracy between the effects that different parameters and merger histories have on the observable properties of our MW-like galaxies.

**Figure 4.** Galaxy formation history as shown by the virial mass of the most massive progenitor of our Milky-Way-like dark matter halos as a function of the expansion factor. In all cases, the mass is normalized to the z = 0 mass of the galaxy. Note the range of merger histories spanned by these halos.
Download figure:
Standard image High-resolution image

The situation is different for the luminosity–metallicity (L–Z) relation of the surviving satellite galaxies. We derived each satellite's metallicity by obtaining a model metallicity distribution function from the galaxy's particle-tagged stellar populations and by taking its mean value. The right middle panel of Figure 3 shows that this relation is not sensitive to the merger history of a galaxy. For comparison, we also show the linear fits derived from each data set. This result highlights the importance of scaling relations for this kind of analysis, as they allow us to put constraints on our models independently of the particular galaxy's merger history. In the left middle panel, we explore the dependencies of this relation on model parameters. The parameter with the strongest impact is the escape factor of metals. Note that by decreasing the value of f_escp from 80 to 50, we modify not only the slope of the L–Z relation but also its intercept, increasing the mean metallicity of all satellites. Instead, by varying the baryon fraction, we re-normalize the L–Z relation, keeping its slope constant. Although the mean metallically of a given satellite does not depends on the value chosen for f_bary, its absolute magnitude M_v does. For a smaller (larger) value of f_bary, the whole distribution of satellites is shifted toward fainter (brighter) absolute magnitudes, thus effectively decreasing (increasing) the average satellite's metallicity at a given M_v. We finally note that variations of z_r mainly induce changes in the slope of this relation. A later epoch of re-ionization allows star formation to take place on the smallest galaxies for longer periods of time. As a consequence, faint satellites that were fed by the smallest building blocks have a larger absolute magnitude at z = 0. Note, however, that this mechanism does not strongly affect the brightest satellites (M_v ≲ −12), since accretion of these smallest building blocks does not significantly modify their overall luminosity. This is because they are big enough to accrete gas and keep forming stars after re-ionization.

It is also interesting to compare the stellar halos' metallicity profiles as a function of galactocentric radius obtained from both simulations with different merger histories and model parameters. Note that after the last major merger (LMM) event, most of the star formation in the host galaxy is expected to take place in a dissipationally collapsed, baryon-dominated disk. The formation of this galactic component cannot be properly accounted for in our DM-only N-body simulation. Therefore, for this analysis, we have decided to neglect all star formation episodes that have taken place in the host halo after the redshift of the LMM. The right bottom panel of Figure 3 shows that independently of the merger history, our four halos present a noticeably flat spherically averaged metallicity profile up to ∼150 kpc, with mean metallicities between −1.3 dex and −1.6. dex. This is in agreement with previous studies that have found flat metallicity profiles on halos formed purely through the accretion of a large number of satellites (see, e.g., De Lucia & Helmi 2008; Cooper et al. 2010; Monachesi et al. 2012). Furthermore, recent observations on the halos of M31 (e.g., Richardson et al. 2009; Kalirai et al. 2006), M81 (Monachesi et al. 2012), and the MW (Ma et al. 2012) have shown indications of flat or relatively mild metallicity profiles outside ∼20–30 kpc: the radius at which accretion from satellites is expected to become the dominant contribution as opposed to in situ star formation (Zolotov et al. 2009, 2010; Font et al. 2011b). Note that our results are not in contradiction with the dual stellar halo proposed by Carollo et al. (2007, 2010) since, as previously discussed, in situ star formation is not properly accounted for in these simulations. In addition, Cooper et al. (2010) showed that stronger metallicity gradients can be found in halos with a small number of relatively massive progenitors.

In the left bottom panel of Figure 3, we show that this result is robust to variations of the model's parameters. The parameter that most strongly affect this profile is f_esc since, as discussed above, this controls the mean metallicity of the host halo's building blocks. Nevertheless, variations of this parameter result in final stellar halos with approximately flat metallicity profiles. Note that the scatter induced by varying the model parameters is similar to that obtained by sampling different merger histories.

5. EFFECTS OF GALAXY FORMATION HISTORY ON BULK STELLAR PROPERTIES II: STATISTICAL ANALYSIS

In the previous section, we showed qualitatively that certain observable quantities, commonly used to constrain the input parameter space of our semi-analytic model, depend strongly on the particular merger history of a galaxy. Furthermore, as shown previously by B10 in a different context, simultaneous variations of certain groups of parameters can yield similar observables, revealing important connections among the underlying physical mechanisms. The present discussion highlights the need for robust statistical tools capable of densely exploring the input parameter space, as well as the need for quantitative characterization of the effects that different merger histories have on the present-day distribution of observable quantities. In what follows, we will show how model emulators are well suited to address this kind of problem.

5.1. Parameter Space Exploration

As discussed in Section 3, the first step in constructing a model emulator is to train a suite of GP priors using a finite set of model outputs. These outputs are obtained by running ChemTreeN using different sparsely sampled sets of input parameters drawn from an experimental design $\mathcal {D} = \lbrace {\bf x}_1, \ldots , {\bf x}_n\rbrace$ . For simplicity, we will first take x_i to be a three-component vector, x_i = (zⁱ_r, fⁱ_esc, f_baryⁱ). Within this framework, it is trivial to increase the dimensionality of x_i, but interpreting and visualizing the final implausibility surfaces becomes progressively more challenging. (For examples of this kind of analysis with larger dimensionality, see Appendix B or B10.)

The training set consists of a number n = 200 of design points $\mathcal {D} = \lbrace {\bf x}_1, \ldots , {\bf x}_n\rbrace \subset \mathbb {R}^3$ , drawn using LHC sampling. This gives an acceptable balance between adequate coverage of the input space and acceptable run time. The input parameters are allowed to vary within the ranges specified in Table 2. The full model ChemTreeN is then run at each of these 200 training points. The next step is to select a set of outputs, Y = {y₁, ..., y_t}, for which individual emulators will be constructed. Note that the output selection is determined purely by the set of t observables or field data, Y_f = {y_{f, 1}, ..., y_{f, t}}, chosen. Motivated by our discussion in Section 4, we chose to emulate five values of the satellite galaxy LF, each at a different absolute magnitude, in addition to the slope and the intercept of the satellite galaxy luminosity–metallicity (L–Z) relation. This gives us a total of t = 7 outputs to be extracted from the models. As we show below, each of these outputs most strongly constrains different model parameters. In Figure 5, we show the LFs and L–Z relations extracted from the training set models, obtained after running ChemTreeN on the design points $\mathcal {D}$ . For clarity, only the result of a linear fit to each L–Z relation is shown. Note the great diversity of outputs that can be obtained by varying the input parameter of the model for a fixed merger history. The black dashed lines on the left panel indicate the five values of M_v chosen to sample the LFs.

**Figure 5.** Yellow/red solid lines show the satellite galaxy luminosity functions (left panel) and satellite galaxy luminosity–metallicity relations (right panel) extracted from a set of 200 models used to train our suite of model emulators. For clarity, only the result of a linear fit to each luminosity–metallicity relation is shown. Note the great diversity of outputs that can be obtained by varying the different input parameters. The models were obtained after coupling ChemTreeN with the N-body simulation MW1. The black solid line on both panels indicates the model considered to be the galaxy's "true" observational quantities, obtained after running ChemTreeN with the input parameter vector x_obs. The vertical dashed lines in the left panel indicate the five values of *M_v* chosen to sample the LFs. We sample each L–Z relation by extracting slopes and intercepts from the corresponding linear fits. A different model emulator was constructed for each of these seven model outputs.
Download figure:
Standard image High-resolution image

After training seven model emulators (one for each observable), we compare the model (via the emulators) to the observable data by calculating individual surfaces of implausibility, with level sets of I(x) for each observable. As described in Section 3.2, these three-dimensional surfaces bound the set of input parameter vectors x that are more likely to reproduce the observed data set Y_f. In principle, the observable data should be obtained from the LF and L–Z relation of the MW satellite galaxies. However, to test the constraining power of this approach when the "truth" is known, a particular run of the ChemTreeN model will be used as a mock observable data set. This type of controlled experiment can be very helpful in model performance assessment, since we know exactly what values of the input parameters were used to obtain the artificial "field data;" it also obviates the need for modeling discrepancy. The black solid lines in Figure 5 show the LF and L–Z relation of the model used as the mock observations. The values of the input parameters used are x_obs = (z_r, f_esc, f_bary) = (10, 50, 0.05). As shown by T10, this set of parameters provides a reasonable fit to the actual field data. It is important to note that this input parameter vector is not included among our design points $\mathcal {D}$ .

In Figure 6, we show three different sections of each of the implausibility surfaces obtained from the five model emulators constructed for the LF's outputs. The three-dimensional implausibility surfaces are sliced with three orthogonal planes as defined by the components of x_obs. The top row panels show the f_esc = 50 section of the I(x) surfaces. The black dashed lines indicate the values of the remaining two components of x_obs. Given an input parameter vector x_t, the larger the value of the I(x_t), the less likely is it that a good fit to the observable data can be obtained. From the leftmost panel (i.e., M_v = −16.5), it becomes clear that the parameter controlling the amount of available gas to form stars, f_bary, is strongly constrained by the number of satellite galaxies at the bright end of the satellite galaxy LF. Furthermore, within the range of values considered here, the number of satellites at this M_v is independent of the redshift of the epoch re-ionization, z_r. The most plausible parameter values are near the true value of f_bary = 0.05. As we move toward the faint end of the LF, the model parameter z_r becomes progressively more constrained and the total number of satellite galaxies becomes less dependent on f_bary. For M_v = −3.5 (top rightmost panel), the corresponding model emulator strongly constrains the input parameter space around values of z_r ≈ 10, but it gives equally good fits for nearly all possible values of f_bary. The second row of panels shows sections of the I(x) surfaces at f_bary = 0.05. The satellite galaxy LF appears to be completely independent of the value adopted for the escape factor of metals, f_esc. This reflects the lack of a metallicity dependence on the baryon budget and star formation rate. At the bright end of the LF, any combination of z_r and f_esc would yield an equally good fit to the mock observable data. However, at the faint end, values of z_r ≈ 10 are required to fit the mock data. A similar result can be obtained for the third row of panels showing the remaining sections, i.e., z_r = 10. Again, a good fit to the "observable" data can be obtained with values of f_bary ≈ 0.05 for any possible value of f_esc.

It is possible to put constraints on the parameter f_esc by exploring cross-sections of the implausibility surfaces constructed for the satellite galaxy luminosity–metallicity relation's slope and intercept model emulators. The middle and bottom panels of Figure 7 show the sections defined by f_bary = 0.05 and z_r = 10, respectively. Comparison with Figure 6 reveals implausibility surfaces with a more complex topology. Although both emulators present regions of low implausibility for a wide range of f_esc values, these regions are strongly associated with z_r and f_bary. These two parameters are also strongly constrained by the slope and intercept of the L–Z relation, as shown in the top row panels.

**Figure 7.** As in Figure 6, for the two model emulators constructed for the L–Z relation. The left and right panels show sections of the implausibility surfaces associated with the slope and the intercept, respectively. Note that these model emulators can provide strong constrains to the model parameter f_esc.
Download figure:
Standard image High-resolution image

Individually, none of the previously explored implausibility surfaces can constrain the full parameter space. This is not the case with the joint implausibility measure J(x), which combines the information obtained from the seven model emulators into one quantity.¹⁰ Figure 8 shows different iso-implausibility surfaces of the resulting J(x). Note that as the value of J(x) decreases, the volume enclosed by each iso-surface becomes smaller, converging toward the values associated with x_obs, as shown by the red solid lines. This can be more clearly appreciated in Figure 9. Each row of panels shows different sections J(x) as we traverse one of the three possible dimensions. The black solid line on the color bars show the cutoff applied to the joint implausibility. A value of J(x) > 3 (in t = 7 dimensions) indicates that it is implausible to obtain a good fit to the observed data with the corresponding values of the model parameters. Thus, regions of the parameter space lying above this threshold can be disregarded. We find that J(x) strongly constrains the full parameter space under study. Furthermore, the values of the components of x_obs are located in the most plausible regions of the space, as indicated by the star symbols in the corresponding panels.

**Figure 8.** Iso-implausibility surfaces extracted from the joint implausibility measure J(x). Redder colors indicate larger values of J(x). As the value of J(x) decreases, the volume enclosed by each iso-surface becomes smaller, converging toward the values associated with x_obs, as shown by the red solid lines. The region of lowest implausibility (and thus highest plausibility) is shown by the opaque blue volume at the intersection of the red lines.
Download figure:
Standard image High-resolution image

**Figure 9.** Different sections of the joint implausibility surface, J(x), obtained by combining information provided by the seven model emulators shown in Figures 6 and 7. The different colors show different values of J(x) in logarithmic scale. Model emulators are compared to the mock observable data obtained after running ChemTreeN with the input parameter vector x_obs. Both the mock observables and the training data set are obtained by coupling ChemTreeN with the N-body simulation MW1. The top, middle, and bottom row panels show different sections of constant f_esc, f_bary, and *z_r*, respectively. On each row, the black dashed lines indicate the values of two of the components of x_obs. If the three components are simultaneously present in a section, then the location of x_obs is indicated with a blue star. The horizontal black solid line on the color bars indicates the imposed threshold: a value above this threshold shows that it is very implausible to obtain a good fit to the observed data with the corresponding values of the model parameters. Note that J(x) can strongly constrain the full parameter space under study. Note as well that the values of the components of x_obs are located in the most plausible regions of the space.
Download figure:
Standard image High-resolution image

As described by B10, it is possible to further constrain the input parameters by performing an iterative approach. The idea involves isolating the region below the chosen implausibility threshold and generating a new training set using design points located within these regions. The kind of "controlled" experiments performed here, in addition to the low dimensionality of our previous example, renders this approach unnecessary. Thus, we refer the reader to B10 for thorough description of this procedure.

5.2. Impact of Merger Histories

In the previous subsection, we showed that it is possible to recover the set of input parameter chosen to create a mock MW-like observational data set using a suite of model emulators. In this "controlled" experiment, both training and mock observable data were obtained by coupling ChemTreeN to a merger tree obtained from a single simulation. Therefore, we have implicitly assumed that the exact merger history of our MW-like galaxy is a known variable of the problem. In reality, this merger history is poorly known, and could be regarded as an extra input parameter of the model. It is thus important to study how different merger histories can compromise our ability to meaningfully constrain the input parameter space. To this end, we perform the following set of controlled experiments. Using the merger trees extracted from the four available N-body simulations, we generate four different training sets (each training set containing n = 200 design points) and construct, for each set, the suite of model emulators discussed previously. Hereafter, we will refer to these emulators' sets as "MWi-emulators," with i = 1, 2, 3, and 4. The input parameter vector x_obs = (z_r, f_esc, f_bary) = (10, 50, 0.05) is used to obtain a mock observational data set from each merger tree. We will refer to these mock observables as "MWi-observables." We then ask the following question:

Is it possible to recover the input parameter vector, x_obs, if we use training data obtained from a merger tree different than that used to generate the mock observables?

In Figure 10, we show the outcome of this experiment. Each block of four panels shows sections of the joint implausibility surface obtained after comparing given MWi-observables with the four MWi-emulators. The merger tree used to generate the MWi-observables in each block is indicated with a green label and marked "OBS." For example, the top left corner shows comparison results using MW1-observables. As in Figure 9, when the model emulators are trained on the same merger tree used to generate the mock observables, we can successfully constrain the input parameter space and recover the components of x_obs. However, when model emulators constructed on different merger trees are considered, the most plausible regions are located around values of f_bary much larger than those used to obtain the mock observables. This is not surprising since, as shown in Figure 3, MW1 is the MW-like halo that contains the largest number of satellites at all M_v. To achieve a good fit to MW1-observables in the remaining simulations requires a larger amount of gas to form stars. Note as well that the joint implausibility surface obtained with the MW3-emulators never falls significantly below the chosen threshold, so MW1-observables cannot be reproduced using the merger history extracted from halo MW3. Another interesting example is shown in the lower right panels of Figure 10, where we consider MW4-observables. Very good fits to these observables can be obtained for either larger (MW3-emulators) or smaller (MW2-emulators) values of f_bary than that used to generate the mock observables. A similar situation is observed for the input parameter z_r. Note that we have only considered the f_esc = 50 section of each joint implausibility surface. As described in Section 4, the satellite galaxy luminosity–metallicity relation is not sensitive to the merger history of the host halo. The parameter f_esc is therefore well constrained in all cases. This forces the J(x) surfaces to have the most plausible regions in the vicinity of the aforementioned section of J(x).

The previous analysis clearly shows how a particular merger history can influence the model parameter selection: similarly good fits to a given set of observables can be obtained with different model parameter values simply by modifying the host's merger history. In our experiments, these values may differ from those used to generate the mock observables. When comparing with real observational data, a given set of the best-fitting parameter's values may be significantly different from the values that could best parameterize the desired underlying physical processes. This in turn may have important implications on other observable quantities that we would like to study and which have not been used for model parameter selection.

5.3. Model Comparison

The previous analysis illustrated the importance that a host's merger history has on model parameter estimation. For a given set of mock observables, locations of low-joint implausibility regions may significantly vary simply by changing the host merger history. However, it remains to be explored whether differences in the resulting joint implausibility surfaces obtained with, e.g., MW4-observables (bottom right panels of Figure 10) could allow us to isolate a "most likely" merger history for this mock data. Is it possible to constrain the host merger history by statistically comparing these surfaces?

This question can be addressed within a Bayesian framework for model quality assessment. The posterior probability for a model M_i, given an observable data set Y_f, is given by the Bayes's theorem (Gregory 2005),

$\begin{equation} P(M_i\mid Y_{f}) = \frac{P(Y_{f}\mid M_i) P(M_i)}{P(Y_{f})}, \end{equation} \tag{ 8 }$

where the denominator P(Y_f) = ∑⁴_{j = 1}P(Y_f∣M_j)P(M_j) is the marginal probability density of Y_F, and P(Y_f∣M_i) is the probability density of Y_F under model M_i, marginalized over the model parameters, x, representing the probability density for the observable Y_f under the assumption of model M_i. In this analysis, a model M_i represents the coupling of ChemTreeN with a given cosmological simulation. In model selection problems, it is often more useful to consider the ratio of the probability of two models rather than their probabilities directly. The quantity

$\begin{equation} O_{ij} = \frac{P(M_i\mid Y_{f})}{P(M_j\mid Y_{f})} = \frac{P(Y_{f}\mid M_i)}{P(Y_{f}\mid M_j)} \frac{P(M_i)}{ P(M_j)} \end{equation} \tag{ 9 }$

is known as the odds ratio in favor of model M_i over model M_j. The first factor on the right side of Equation (9) is known as the Bayes factor,

$\begin{equation} B_{ij} =\frac{P(Y_{f}\mid M_i)}{P(Y_{f}\mid M_j)} = \frac{\int P({\bf x}\mid M_i) P(Y_{f}\mid {\bf x},M_i)\, d{\bf x}}{\int P({\bf x}\mid M_j) P(Y_{f}\mid {\bf x},M_j)\, d{\bf x}}. \end{equation} \tag{ 10 }$

If we assign equal prior probabilities for all models, then the odds ratios and Bayes factors coincide:

$\begin{equation*} O_{ij} =B_{ij}. \end{equation*}$

In our problem, we approximate B_ij by calculating the likelihoods, $\mathcal {L} = P(Y_{f}\mid {\bf x},M_i)$ , under an MVN approximation from the corresponding joint implausibility, i.e., $\mathcal {L} \propto \int e^{-J_{i}({\bf x})^2 / 2} d{\bf x}$ , with uniform priors P(x∣M_i). Here, J_i(x) is obtained using MWi-emulators.

While the use of Bayes factors to compare and select among models is well established, a variety of scales have been proposed to facilitate their quantitative interpretation (Fadely & Keeton 2012). We will employ Jeffreys' scale (Jeffreys 1998), presented in Table 3. According to this scale, values of log₁₀B_ij between 0 and 0.5 indicate that models M_i and M_j are equally supported by the observable data set Y_f. Conversely, a value larger than 2 gives decisive evidence for model M_i against model M_j, given Y_f.

Table 3. Jeffreys' Scale for Grading the Significance Associated with Different Ranges of the Logarithmic Bayes Factor

log₁₀B_ij	Significance
0–0.5	Barely worth mentioning
0.5–1	Substantial
1–1.5	Strong
1.5–2	Very strong
>2	Decisive

Download table as: ASCII Typeset image

In this work, we are dealing not only with four different models, but also with four different mock observable data sets. From now on, we will refer to B^k_ij as the Bayes factor obtained comparing models M_i and M_j using MWk-observables. The results of this analysis are shown in Figure 11 in the form of a matrix. The elements of this matrix show, color coded, the values of the different Bayes factors. Note that the matrix is divided into four blocks, each one of them associated with a different MWk-observable. For comparison, the panels are distributed as in Figure 10. For example, the first block on the top left corner shows the probability ratios of model M₁ and M_j, with j = 1, 2, 3, and 4, using MW1-observables. The first element of this block is, not surprisingly, B¹₁₁ = 0, as we are comparing model M₁ with itself. It is interesting however to observe that on the basis of MW1-observables, models M₁ and M₂ are equally plausible, as indicated by B¹₁₂. Comparison with Figure 10 (top left panels) shows that the combination of model M₂ and MW1-observables yields plausible values of f_bary that are much larger than that used to generate the mock observables (i.e., f_bary = 0.05). A similar situation arises when considering MW4-observables. In this case, models M₂, M₃, and M₄ are all equally supported by the mock observable data. However, the corresponding joint implausibility surface's sections, shown on the bottom left panels' block of Figure 10, suggest that the most plausible values of the input parameter for models M₂ and M₃ differ significantly from the "true" values used to generate the MW4-observables.

**Figure 11.** Values of the log₁₀ Bayes factors obtained by comparing two different models, after fixing a given mock observable data set. The values are presented in logarithmic scale. Each element *B^k_ij* shows the result of comparing models *M_i* and *M_j* using MWk-observables. The matrix is divided in four block, each of them associated with a different MWk-observable. For comparison, the panels are distributed as in Figure 10. The values of the different Bayes factor are color coded according to the Jeffery's scale, introduced in Table 3. Each element log₁₀*Bⁱ_ii* = 0, since model *M_i* is being compared to itself. More than one model can be equally well supported by a given mock observable data, as shown by the multiple white and light gray elements of each matrix's blocks. Note that the "best-fitting" parameters extracted from each model may differ significantly across data sets (see Figure 10).
Download figure:
Standard image High-resolution image

This analysis has shown us that multiple models with different merger tree histories can be equally supported by a given observational data set. The "best-fitting" parameters extracted from these models are, however, significantly different from each other. On the basis of this analysis, it is therefore not possible to assess what set of parameters would be best suited to model the formation of the stellar halo of the MW.

6. DISCUSSION AND CONCLUSIONS

In this work, we combined modern statistical methods with cosmological simulations of the formation of the satellite galaxy population and stellar halo of MW-like galaxies to explore how one might meaningfully constrain a galaxy's formation history using observables related to these components. The semi-analytic galaxy formation model ChemTreeN, coupled to merger trees extracted from four cosmological N-body simulations, is used to track the evolution of the baryonic properties of MW-like galaxies and their satellite populations. GP model emulators are used to emulate the outputs of ChemTreeN at any location of its p-dimensional input parameter space. This allows us to explore the full input parameter space orders of magnitude faster than could be done with the ChemTreeN code itself. To compare observational data to the model using these emulators, we introduce the joint implausibility measure (see also B10). We show that by using this quantity as a measurement of correctness, we can successfully recover the input parameter vector used to generate mock observational data sets (namely, the satellite galaxy LF of our chosen MW-like DM halos and the slope and normalization of the satellite galaxy luminosity–metallicity relation).

From our analysis, we draw the three following main conclusions.

1.
Using our GP model emulator, we have explored the dependence of a variety of observational quantities on model parameters. In some cases, this highlights straightforward relationships between observational quantities and model parameters: for example, the number of low-luminosity dwarf galaxies depends very sensitively on the redshift of re-ionization, but is insensitive to other model parameters. On the other hand, the number of high-luminosity dwarfs is insensitive to the redshift of re-ionization, but sensitive to the reservoir of gas available for star formation. For simplicity, throughout this work we have focused our analysis on a three-dimensional input parameter space. However, as shown in Appendix B and B10, the method can be extended in a straightforward manner to a much higher-dimensional parameter space. Thus, techniques like the one presented here are well suited not only to explore the input parameters of complex models, but also to expose and visualize the nonlinear coupling between the parameterization of the different physical processes being considered.
2.
The detailed merger history of a galaxy can very sensitively affect the outcome of the procedure for finding the "best-fitting" parameters to a given mock observational data set. Similarly, good fits can be obtained with different parameter values simply by modifying the host galaxy's merger history. More importantly, in our experiments, these values may strongly differ from those used to generate the mock observables. This is true even when the input N-body simulations have been carefully screened to all have MW-like masses and times of LMMs. When comparing with real observational data, the resulting best-fitting parameter values may differ significantly from the values that could best parameterize the desired underlying physical processes. This in turn may have important implications on other observable quantities that we would like to study and which have not been used for model parameter selection.
3.
Using a Bayesian framework for model quality assessment, we have explored whether it is possible to identify (or at least constrain) the merger history of the galaxy from which observables have been obtained. This exercise showed that multiple models with different merger histories can be equally supported by a given observational data set. The "best-fitting" parameters extracted from these models do, however, differ significantly from each other. On the basis of this analysis, it is therefore not possible to assess what set of parameters would be best suited to model the formation of the MW's satellite galaxy population and its stellar halo.

The results presented in this work highlight the dangers of selecting model parameters based on simulations and observations of individual objects such as the MW. A way to partially circumvent this problem is to fix the values of a subset of parameters such that it is possible to reproduce properties of the galaxy population in the local universe as well as at higher redshift. This approach has been successfully adopted by several authors to study the formation of the MW stellar halo and characterize its satellite population (see, e.g., De Lucia & Helmi 2008; Li et al. 2010; Cooper et al. 2010; Font et al. 2011a). It is however not surprising that important physical process involved in the evolution of, e.g., the faintest MW dwarf galaxies (see, e.g., Willman et al. 2005; Belokurov et al. 2009, 2007; Walsh et al. 2007) are poorly constrained by the properties of the much larger-scale galaxy population. Statistical methods such as the one presented here are very well suited to visualize the interplay between different physical mechanism and thus can be used to further constrain the parameters that are more relevant to the evolution of dwarf galaxies.

We have seen that different models of the MW stellar halo, each with a different formation history, can be equally supported by a given observational data set even though the values extracted for the best-fitting parameter may significantly differ. This seems to compromise the possibility of constraining the formation history of the MW through this kind of analysis. Note, however, that our results are based on merger trees extracted from only four cosmological simulations. Therefore, it remains to be determined whether statistical constraints on the MW's merger history could be obtained by analyzing a large set of high-resolution cosmological simulations densely sampling full range of possible formation histories. We defer this analysis to a later paper.

F.A.G., B.W.O., C.C.S., and R.L.W. are supported through the NSF Office of Cyberinfrastructure by grant PHY-0941373. F.A.G. and B.W.O. are supported in part by the Michigan State University Institute for Cyber-Enabled Research (iCER). B.W.O. was supported in part by the Department of Energy through the Los Alamos National Laboratory Institute for Geophysics and Planetary Physics and by NSF grant PHY 08-22648: Physics Frontiers Center/Joint Institute for Nuclear Astrophysics (JINA). R.L.W. is also supported by in part by NSF grant DMS–0757549 and by NASA grant NNX09AK60G.

APPENDIX A: MULTIVARITE EMULATOR FORMALISM

Here, we outline the details of constructing an emulator for models with multidimensional output y = {y₁, ..., y_t}. In theory, each component could be considered independent of the others and emulated separately, but as we have shown above by examining the implausibilities I(x), this does not usually give satisfactory results. PCA is used to generate an orthogonal decomposition of the output vector y. The resulting basis is orthogonal and approximately independent. We retain the r eigenfunctions forming a subset of the full decomposition responsible for 95% of the observed variation. The set of training observations Y = {y₁, ..., y_n} is then projected into this new orthogonal basis and independent emulators are generated for the weights of the r basis functions. Given our set of n observations Y = {y₁, ..., y_n}, we suppose that ${\bf y}_i \sim \mathrm{MVN}({\bf {\mu }},\mathbf {\Sigma }),$ where

$\begin{eqnarray} \mu \approx \hat{\mu }_j &= \frac{1}{n}\sum _{i=1}^{n}(y_i)_j, \end{eqnarray} \tag{ A1 }$

$\begin{eqnarray} \mathbf {\Sigma }\approx \mathbf {\hat{\Sigma }} &= \frac{1}{n}\sum _{i=1}^{n} [({\bf y}_i - \hat{\mu }_i)^{T} ({\bf y}_i - \hat{\mu }_i)] \end{eqnarray} \tag{ A2 }$

are the sample mean vector (of length t) and sample covariance matrix (t × t), respectively. An eigen-decomposition of the sample covariance matrix $\hat{\Sigma }$ ,

$\begin{equation} \hat{\Sigma } = \mathbf {U} \mathbf {\Lambda } \mathbf {U}^{t}, \end{equation} \tag{ A3 }$

determines the PCA decomposition. The columns of the matrix $\mathbf {U}$ are the eigenvectors of $\hat{\Sigma }$ , while $\mathbf {\Lambda }$ is diagonal with the corresponding eigenvalues in decreasing order. The eigenvectors define the new orthogonal basis and the eigenvalues determine the weights of each basis function.

The sum of the eigenvalues or the trace of $\mathbf {\Lambda }$ is also the trace of $\hat{\Sigma }$ , the sum of the sample variances of Y, and so each eigenvalue represents the variance contribution of its associated eigenfunction to the observed total. The first eigenvector gives the direction in which the data Y are most variable, and the remaining eigenvectors correspond to orthogonal directions with successively smaller amounts of variation. Typically, a small number r ≪ t of the largest eigenvalues corresponding to the most variable combinations of the observables y are sufficient to describe nearly all of the variation of the full data set. PCA can be used for dimension reduction by keeping only the r most influential components, and by regarding the remaining t − r dimensions as noise and discarding them,

$\begin{equation} \mathbf {\hat{\Sigma }} = \mathbf {U} \mathbf {\Lambda } \mathbf {U}^{t} \approx \mathbf {U}_{r} \mathbf {\Lambda }_{r} \mathbf {U}_{r}^T. \end{equation} \tag{ A4 }$

The variability fraction $V(r)=\sum _{i=1}^{r} \lambda _i / \mathrm{Tr} (\mathbf {\Lambda })$ accounted for by the first r components can be used to help select r; in this analysis, we chose the smallest r for which V(r) ⩾ 0.95. Typically, with t = 7, we could attain this with r = 5.

After carrying out the eigen-decomposition and selecting an appropriate r, we project our training set Y' into the associated PCA basis as

$\begin{equation} \mathbf {Z} = \frac{1}{\sqrt{\Lambda _r}}\mathbf {U}_r^{T} ({\bf Y} - {\bf \hat{\mu }}), \end{equation} \tag{ A5 }$

an r × n matrix of the projections of the training data projected into the r-independent and orthogonal components of the PCA decomposition. Note that $\mathbf {Z}$ is standardized to have mean zero and unit covariance. Now we can construct r-independent emulators, each one trained on a component of Z.

Each orthogonal emulator is constructed as outlined in Section 3, giving a posterior mean m_z(x) and variance Σ_z(x) (Equation (4)). The mean and variance are functions of the location in the design space x. It is important to remember that they are functions which give output in the PC space. To obtain predictions for the model outputs Y, we need to undo the PCA decomposition by reprojecting back into the natural observable space. Naturally, by selecting r < t, we have lost some of the original information, however, with judicious choice of r, this is usually not a serious issue.

The projected mean ${\bf m({\bf x})}$ , a vector of length t, in the Y space is given as

$\begin{equation} {\bf m}({\bf x}) \approx {\bf \hat{\mu }} + \mathbf {U}_r \Lambda _r^{1/2} {\bf m}_z({\bf x}), \end{equation} \tag{ A6 }$

where m_z(x) is the length r vector of emulator means in the PC space. We label the emulated covariance of the lth and jth model observables at the location x_i as K_lj(x_i), which is

$\begin{eqnarray} K_{lj}({\bf x}_i) &=& \mbox{Cov}[y_l({\bf x}_i),y_j({\bf x}_i)] \nonumber\\ & \approx & \sum _{\alpha ,\beta ,\gamma =1}^r U_{l\alpha }\Lambda ^{1/2}_{\alpha \beta }U_{j\gamma }\Lambda ^{1/2}_{\gamma \beta } \mbox{Var}[Z_{\beta }({\bf x}_i)], \end{eqnarray} \tag{ A7 }$

where Var[Z_β(x_i)] is the emulated variance of the βth PC dimension at the location x_i in the parameter space. This gives a useful estimate of the covariance between our y observables at as yet untried input locations, and the t × t matrix $\mathbf {K}$ is crucial for the joint implausibility J(x).

APPENDIX B: FIVE-DIMENSIONAL PARAMETER SPACE EXPLORATION

As discussed in Section 5.1, model emulators can be easily generalized to deal with input parameter spaces of much larger dimensionality than previously considered. In this appendix, we demonstrate this by training a suite of model emulators in a five-dimensional input parameter space. The dimensionality is increased by including in the analysis the star formation efficiency, epsilon _*, and the Type II SNe yield, m^II_Fe. The training data set is constructed by running ChemTreeN on a design $\mathcal {D} = \lbrace {\bf x_1}, \ldots , {\bf x_n}\rbrace$ , containing a number n = 240 design points. Note the larger number of design points with respect to previous examples, intended to better sample the larger volume of this input parameter space. A mock observational data set is generated by considering the input parameter vector x_obs = (z_r, f_esc, f_bary, epsilon _*, m^II_Fe) = (10, 50, 0.05, 10⁻¹⁰, 0.07). Both the mock observables and the training data set are obtained by coupling ChemTreeN with the merger tree extracted from the DM-only N-body simulation MW1. The results are shown on Figure 12. Each panel shows a different section of the resulting joint implausibility surface. These 2D sections are obtained after fixing the remaining three parameters to the values associated with x_obs. The black solid line on the color bars show the threshold applied to the joint implausibility. Values above this threshold indicate that it is very implausible to obtain a good fit to the mock observational data using the corresponding values of the model parameters. As observed in the three-dimensional example shown in Figure 9, J(x_obs) can strongly constrain the full parameter space under study. Even in this higher-dimensional space, the values of the components of x_obs are always located in the most plausible regions of the space, as indicated by the black dashed lines.

**Figure 12.** Different sections of the joint implausibility surface, J(x), obtained after training model emulators in a five-dimensional input parameter space. The mock observable data are generated by feeding ChemTreeN with the input parameter vector x_obs = (z_r, f_esc, f_bary, _*, m^II_Fe) = (10, 50, 0.05, 10⁻¹⁰, 0.07). Each 2D section is obtained after fixing three out of the five parameters to the values associated with the component of x_obs. The black dashed lines indicate the values of the remaining two components. The different colors show different values of J(x) in logarithmic scale. The horizontal black solid line on the color bars indicates the imposed threshold: a value above this threshold shows that it is very implausible to obtain a good fit to the observed data with the corresponding values of the model parameters. Note that J(x) can strongly constrain the five-dimensional input parameter space under study. Note as well that the values of the components of x_obs are always located in the most plausible regions of the space.
Download figure:
Standard image High-resolution image

epsilon — **Figure 12.** Different sections of the joint implausibility surface, J(x), obtained after training model emulators in a five-dimensional input parameter space. The mock observable data are generated by feeding ChemTreeN with the input parameter vector x_obs = (z_r, f_esc, f_bary, _*, m^II_Fe) = (10, 50, 0.05, 10⁻¹⁰, 0.07). Each 2D section is obtained after fixing three out of the five parameters to the values associated with the component of x_obs. The black dashed lines indicate the values of the remaining two components. The different colors show different values of J(x) in logarithmic scale. The horizontal black solid line on the color bars indicates the imposed threshold: a value above this threshold shows that it is very implausible to obtain a good fit to the observed data with the corresponding values of the model parameters. Note that J(x) can strongly constrain the five-dimensional input parameter space under study. Note as well that the values of the components of x_obs are always located in the most plausible regions of the space.
Download figure:
Standard image High-resolution image

CHARACTERIZING THE FORMATION HISTORY OF MILKY WAY LIKE STELLAR HALOS WITH MODEL EMULATORS

Article metrics

Permissions

Author affiliations

Dates

ABSTRACT

1. INTRODUCTION