ELUCID—Exploring the Local Universe with the reConstructed Initial Density Field. II. Reconstruction Diagnostics, Applied to Numerical Halo Catalogs

Dylan Tweed; Xiaohu Yang; Huiyuan Wang; Weiguang Cui; Youcai Zhang; Shijie Li; Y. P. Jing; H. J. Mo

doi:10.3847/1538-4357/aa6bf8

1. Introduction

One of the key aspects of modern cosmology is interpreting the distribution and properties of galaxies in the sky. The diversity of galaxy properties is inherent in the galaxies' histories and environments.

The formation and evolution of galaxies is generally understood within the ΛCDM paradigm. Matter itself is dominated by a dark component: the dark matter subject to gravitational interactions. Galaxies are understood to reside in dark matter halos, which act as wells of gravitational potential. These follow a pattern of hierarchical structure formation, where the smaller structures form first and merge to build larger and larger structures. The gas itself is bound to these structures and as they build up, baryonic processes take place, leading to the formation of stars in galaxies that further evolve and interact within those halos.

Structure formation is thus understood as a nonlinear process that cannot be fully described analytically. Building these structures requires the use of specific numerical methods: N-body simulations. These codes focus on the dark matter component, and successfully describe the formation of structures by implementing gravitational interactions on large scales. Dark matter is described numerically by data points (otherwise referred to as particles) that trace a mass element corresponding to a volume element of the early "homogeneous" universe.

Over the decades, such simulations have been widely used, with little variation in term of physics. Combined with ever-decreasing limitations of computer resources and vast improvements in terms of implementations, larger volumes can be explored with increasing resolutions.

The distributions and properties of galaxies are observed with ever-increasing accuracy, but little information concerning the dominating component can be inferred. Nevertheless, various methods exist for exploring the formation of galaxies within dark matter halos. The most direct method is to numerically solve the evolution of the baryonic component on top of the dark matter component. This method requires more complex implementations than N-body simulations, with gas described either as (i) numerical data points with associated density (smooth particle hydrodynamics: Springel et al. 2001b; Wadsley et al. 2004; Springel 2005), (ii) grid cells fixed in the volume (cells are refined and unrefined as required to explore high gas density while neglecting low-density regions; nested grids or Adaptive Mesh Refinement: Kravtsov et al. 1997; Teyssier 2002; Kravtsov 2003), or (iii) moving cells (the gas element is associated with a numerical point within a volume defined from the distribution of nearby mesh points through Voronoi tesselation; Springel 2010). Less computationally intensive methods involve applying models on the scale of the halo itself, where evolution can be traced with merger trees. These semi-analytical models (SAM) have been successfully applied to halo catalogs extracted from N-body simulations (White & Frenk 1991; Kauffmann et al. 1993; Somerville & Primack 1999; Cole et al. 2000; van den Bosch 2002; Hatton et al. 2003; Kang et al. 2005; Baugh 2006; Croton et al. 2006).

Both types of methods are used to create mock galaxy catalogs. These catalogs are usually confronted with the observed ones and have been quite successful at reproducing statistical features such as the two-point correlation functions, luminosity functions, color distributions, and star formation rates. Both sub-grid models (applied to gas elements) and semi-analytical models (applied to halos) require the determination of a large number of parameters, which can lead to some degeneracies.

Even more simplified approaches are halo occupation distribution (HOD) models, where observed galaxies are assigned to halos by matching the halo mass and stellar mass functions (Jing et al. 1998, 2002; Berlind & Weinberg 2002; Bullock et al. 2002; Scranton 2002; van den Bosch et al. 2003; Yang et al. 2003; Behroozi et al. 2010; Rodríguez-Puebla et al. 2015). Such methods have been successful at defining the stellar mass–halo mass relation, but the scatter of this relation still remains uncertain and difficult to interpret. Higher constraints can be obtained by comparing specific galaxies, their environments, and the intergalactic medium. To achieve this, one needs to obtain one or several simulations that match the distribution of galaxies as seen in the sky. In order to understand how an N-body simulation can reproduce a given distribution, one must have a basic understanding of how a cosmological N-body code works. Basically, the distribution of matter (here dark matter only) is represented by numerical data points (particles) describing the mass and volume elements of this matter. Since dark matter answers to gravity alone, the motion of this dark matter element obeys the Poisson equation. A cosmological N-body code is essentially just a Poisson solver, with various optimizations rendering the computation possible. An N-body simulation starts at an early redshift (∼100) from a uniform distribution of particles following a grid. But the disposition of particles cannot be perfectly uniform in position and velocity since then the distribution would remain unchanged. Obtaining a realistic simulation implies applying small shifts in both positions and velocities on the initial grid nodes. This displacement field corresponds to the initial conditions (IC). They are usually randomly generated from the initial power spectrum for a given cosmology. From the same set of ICs any N-body code would produce the same density field. However, the formation and evolution of large-scale structures is a nonlinear process. The only way to derive the matter distribution is thus to run a numerical simulation.

Reconstructing a particular density field with an N-body simulation consists of finding, among all possible sets of ICs, the one that would produce the closest match. To achieve this purpose, it is not realistic to create an infinite number of trial sets of ICs, run an N-body simulation from each set, and chose the optimal one. The solution is to use other numerical techniques to reverse engineer the displacement field (that characterize a set of ICs) from the target density field.

Since the pioneering work of Hoffman & Ribak (1991) and Nusser & Dekel (1992), many methods have been developed to reconstruct the ICs of the local universe. These methods use either redshift surveys of galaxies (Heß et al. 2013; Wang et al. 2016) or radial peculiar velocities (Kravtsov et al. 2002; Klypin et al. 2003; Gottloeber et al. 2010; Sorce et al. 2016) of low redshift galaxies as tracers of the cosmic density field at the present day. In order to derive ICs from these low-redshift tracers, several dynamical models are adopted in the literature: (i) linear theory (Hoffman & Ribak 1991; Gottloeber et al. 2010), (ii) Lagrangian perturbation theory (Nusser & Dekel 1992; Brenier et al. 2003; Lavaux 2010; Doumler et al. 2013; Jasche & Wandelt 2013; Kitaura 2013; Wang et al. 2013), or (iii) particle-mesh (PM) dynamics (Wang et al. 2014). These dynamical models can be applied either backward or forward in time. Zel'dovich approximation (Zel'dovich 1970) has been classically used to trace the density field back in time to the linear regime (Nusser & Dekel 1992; Doumler et al. 2013; Sorce et al. 2014). Recently, several studies (Heß et al. 2013; Jasche & Wandelt 2013; Kitaura 2013; Wang et al. 2013, 2014) have implemented forward dynamical models within Markov Chain Monte Carlo algorithms. These bayesian methods gradually adjust the phases and amplitudes of the ICs until the evolved density field closely matches the one traced by the observed galaxy population. We refer to Wang et al. (2014, 2016) for a more detailed discussion of the advantages and shortcomings of the various methods.

The ELUCID project (Exploring the Local Universe with the reConstructed Initial Density field) consists of using the large amount of data gathered by the SDSS (Sloan Digital Sky Survey) to map the local universe and generate reliable simulations reproducing the distribution of large-scale structures in that survey. The first large ELUCID simulation⁸ was completed in 2015: it corresponds to a WMAP 5th year cosmology with ${{\rm{\Omega }}}_{M}=0.258$ , ${{\rm{\Omega }}}_{{\rm{\Lambda }}}=0.742$ , and h = 0.72. The simulation volume is a box of 500 Mpc/h per side, where matter is described by 3072³ particles corresponding to a mass resolution of $3.1\times {10}^{8}$ ${{\rm{h}}}^{-1}\,{M}_{\odot }$ . A more detailed background of the ELUCID project is given in Appendix A.

One of the main applications of the ELUCID simulation is to build realistic mock catalogs by applying various SAMs to the reconstruction. The validity of the model can then be tested by direct comparison with the original observed SDSS sample and group catalog derived using the method of Yang et al. (2005, 2007). The first problem is to build a framework for a direct comparison. This means devising reliable criteria for identifying the most likely reconstruction of a given observed group. The second problem is to differentiate between the effects of the reconstruction method and those of the SAM.

In Wang et al. (2014), a forward method that employs the Hamiltonian Markov Chain Monte Carlo (HMC) algorithm and the PM dynamics was developed. This method was tested with N-body simulations in order to find the best set of parameters. The main criterion was the similarity of the respective initial power spectra and that of the halo mass functions of an original dark-matter-only simulation and its reconstruction. The simulation sample they used is independent of any baryonic physics and observational bias. For this reason, this simulation data set may be used to prepare the analysis of the ELUCID simulations by exploring the limitations of the reconstruction technique itself.

In this paper, we make use of part of the N-body simulations sample from Wang et al. (2014). More specifically, we use halo catalogs extracted from these simulations. We first devise a method to match halos in the original simulation to halos in the reconstructed one. This catalog of matched halo pairs can be used to test alternative criteria that can be applied to the ELUCID simulations. Additionally, various internal properties of halos, such as shape, orientation, or angular momentum, can be compared. More precisely, given a reconstructed halo, we would be able to identify which of its properties were most likely to be constrained by the reconstruction. Furthermore, N-body simulations contain the full formation history of the halos they describe. By applying this cross-identification of halos in the original and reconstructed simulations at different times, one can assess how much of the formation history has been reproduced by the reconstruction method. Since both internal properties and formation histories of halos can have a strong impact on the SAM, these questions are extremely relevant to the ELUCID project.

We detail the method used in the paper in Section 2, before focusing on the results in Section 3. We discuss our results after a short summary in Section 4.

2. Method

In Section 2.1, we introduce the N-body simulation sample and explain how it was processed to obtain the halo catalogs. We describe the halo cross-comparison method in detail in Section 2.2.

2.1. Simulations

In Figure 1 we summarize the processes involved in creating a simulation pair. The first one (referred to as the original) is built from random ICs (generated from the power spectrum predicted for the cosmology), and the second simulation is a reconstruction of the first. The ICs of the reconstructed simulation are generated following Wang et al. (2014). As the schematic shows, we proceeded as follows.

1.
The original simulation is run from random ICs.
2.
The final (redshift 0) output is post-processed with a Gaussian-smoothing kernel to produce the density field.
3.
This density map is used as input for the reconstruction method to produce a reconstruction of the ICs.
4.
The reconstructed simulation is run from the reconstructed ICs.

In Table 1 we list the simulation data set used in this paper. Each simulation has 512³ particles, of mass $2\times {10}^{10}$ ${M}_{\odot }$ , in a comoving box of side 300 ${{\rm{h}}}^{-1}\,\mathrm{Mpc}$ . The cosmological parameters are ${{\rm{\Omega }}}_{M}=0.258$ , ${{\rm{\Omega }}}_{{\rm{\Lambda }}}=0.742$ , and h = 0.72. The cosmology is not relevant, as we focus on numerical techniques rather than on the effect of the cosmological model. We emphasize that all simulations are run with the same N-body code, GADGET2 (Springel 2005), with the same time outputs (60 outputs equally spaced in ${\rm{\Delta }}\mathrm{log}(1/1+z)$ from z = 18 to z = 0). The only difference between the simulations mentioned in this paper is in the ICs.

Table 1. List of N-body Simulations

Name	${\sigma }_{\mathrm{HMC}}$ (Mpc/h)	${\sigma }_{\mathrm{PM}}$ (Mpc/h)	${N}_{\mathrm{PM}}$
L300A

L300Ac1	2.25	1.5	10
L300Ac2	3	1.5	10
L300Ac3	4.5	1.5	10

L300B

L300Bc1	2.25	1.5	10
L300Bc2	3	1.5	10
L300Bc3	4.5	1.5	10

Note. We have two series of simulations L300A and L300B. The original simulations are L300A and L300B. From each of the original simulations, three reconstructions were performed. The extensions c1, c2, and c3ares used to distinguish them from their respective originals and to refer to the corresponding sets of reconstruction parameters ( ${\sigma }_{\mathrm{HMC}}$ , ${\sigma }_{\mathrm{PM}}$ , ${N}_{\mathrm{PM}}$ ) indicated in the row.

Download table as: ASCII Typeset image

As indicated in Table 1, we have two independent N-body simulations: L300A and L300B. For each of these original simulations three sets of reconstructed simulations were generated. The reconstruction method is detailed in Wang et al. (2014). It depends on the density smoothing ${\sigma }_{\mathrm{HMC}}$ and the PM parameters: ${N}_{\mathrm{PM}}$ (number of steps) and ${\sigma }_{\mathrm{PM}}$ (grid length). The smoothing scale had to be chosen to reduce the resolution effects of the PM model. Choosing ${\sigma }_{\mathrm{HMC}}\leqslant {\sigma }_{\mathrm{PM}}$ leads to significant discrepancies in the two-point correlation functions at small scales (see Figure 3 Wang et al. 2014). The smallest ${\sigma }_{\mathrm{HMC}}$ we consider is $1.5\times {\sigma }_{\mathrm{PM}}$ , while $2\times {\sigma }_{\mathrm{PM}}$ renders these discrepancies negligible. The choice of $3\times {\sigma }_{\mathrm{PM}}$ represents a compromise between further reducing this resolution effect and limiting the loss of information at small scales. In this paper we use the simulation sample that is used to explore the impact of the density smoothing with fixed PM parameters. We focus on halo catalogs extracted from these simulations rather than on the raw simulation data.

There is a large variety of group-finding algorithms, from simple halo-finding methods such as the friend-of-friend (FOF, Davis et al. 1985) and Spherical Over-Density (SOD, Lacey & Cole 1994) methods, to more elaborate subhalo finders such as SUBFIND (Springel et al. 2001a), AdaptaHOP (Aubert et al. 2004; Tweed et al. 2009), and AHF (Knollmann & Knebe 2009), to even more complex ones such as phase-space group-finders 6DFOF (Diemand et al. 2006), HST (Maciejewski et al. 2009), and RockStar (Behroozi et al. 2013; see the review by Knebe et al. 2011; Onions et al. 2012, and subsequent "subhalos going notts" papers). In addition to the algorithm used to collect particles, the properties associated with the halos are not necessarily standardized. The position can be defined as the center of mass of the collection of particles, the position of the density peak, or the position of the most bound particle. As mentioned by Cui et al. (2016), the two positions are normally aligned with each other. The mass itself can be defined as the total mass of the collection of particles, or as the virial mass defined as the total mass within a spherical (or ellipsoidal) region. This virial region can be defined as an overdensity of either $200\times \overline{{\rho }_{\mathrm{DM}}}$ , $200\times \overline{{\rho }_{\mathrm{crit}}}$ , or ${\rm{\Delta }}(z)\times \overline{{\rho }_{\mathrm{crit}}}$ . These criteria also lead to multiple definitions of halo size, boundary, and shape.

For simplicity, we have used the standard FOF (Davis et al. 1985) algorithm, where groups are built by iteratively linking particles to all their neighboring particles closer than a specific distance. In our case, the distance used is $0.2\,\times {\overline{{\rm{\Delta }}}}_{N}$ (b = 0.2), where ${{\rm{\Delta }}}_{N}={L}_{\mathrm{box}}/{N}^{1/3}$ is the mean interparticular distance of the N-body simulation in a volume of side ${L}_{\mathrm{box}}$ ( ${N}^{-1/3}$ in box units). We also mention that no unbinding procedure has been applied to the halo catalogs. This group -finder is often the baseline for more complex ones. If we were not to find any match between FOF halo catalogs, it is unlikely we would find any between catalogs built using other group-finders. Internal property comparisons and subhalo cross-identifications can still be implemented at a later stage for FOF cross-identified halos.

Our starting point here for each simulation is a list of particles, along with their associated indexes, positions, velocities, and masses. We also have a list of halos defined as a collection of particle indexes. From this collection we can easily calculate the position, mass, or other useful quantities relevant to the halo.

2.2. Halo Cross-identification

Having introduced the data we use in this paper, we now describe the comparison algorithm.

2.2.1. General Idea

Figure 2 illustrates the initial layout of a halo cross-identification method applied to two catalogs A and B.

1.
We compare halo catalog B to halo catalog A:
- (i)
  Each halo in catalog B can be associated with at most one halo in catalog A. Multiple halos B₁, B₂... can be associated to the same halo A_i in catalog A (upper half of Figure 2(a)).
- (ii)
  Only one of these halos (B₁, B₂...) can be "cross-identified" with A_i, while the others remain "associated" with A_i (upper half of Figure 2(b)).
2.
We compare halo catalog A to halo catalog B, following the same method to ensure consistency (lower halves of Figures 2(a) and (b)).

There are multiple reasons for us to design the method so that the comparison is run in both directions. The first reason is not to introduce any bias into our interpretation by implementing a fiducial. Let us suppose that for each halo A, the best match found in B is systematically larger. We cannot rule out that this systematic is not caused by the matching criteria unless we reverse the problem. Running the comparison both ways provides the means to ensure that the cross-identification is consistent. The second reason is the interpretation of the non-cross-identified halos. To illustrate this, we now suppose that catalog A is a subset of catalog B. By comparing catalog B to catalog A, we find that every halo in B is cross-identified. We then conclude that catalog A and B are identical. But by comparing catalog A to catalog B, we find a population of halos in catalog B that are not cross-identified. We then draw the correct conclusion; catalog A and B are different.

As detailed in Figure 2, a one-way identification can provide multiple correspondences for a halo. If we only care about cross-identification (double-tipped arrow), we can choose one possibility and discard the others. However, we chose to keep track of these discarded choices as associated halos (single-tipped arrow), and distinguish them from halos that have no possible counterparts. In order to collect these associated halo populations from both catalogs, it is also necessary to run the comparison both ways.

As we have described it, our algorithm needs to be run twice, once from B to A and once from A to B. The results then have to be compared for possible mismatches. One very simple improvement, is that instead of reading/using both catalogs twice, the code can be restructured to read/use only one of the catalogs twice and the other one once. The procedure described in Figure 2 becomes that illustrated in Figure 3. We note here that we used an A–B–A configuration, but a B–A–B configuration would provide the same answers. The structure of the cross-comparison between catalogs takes the form of a tree, with one main branch (trunk) for cross-identification and one at secondary branches for associations.

**Figure 3.** Code optimization: the layout of the algorithm is reorganized, with the lower part inverted and combined with the top one. The same cross-identification search can be implemented in both directions as one run.
Download figure:
Standard image High-resolution image

2.2.2. Cross-identification Criteria

Since the aim of reconstruction is to obtain the same spatial distribution of large-scale structures, the most logical choice would be to compare the positions of halos. Minimizing the distance may mislead the algorithm into pairing up halos of very different masses, while the correct reconstructed halo could be found at a slightly larger distance. If one wanted to focus on the most massive structures only, one could simply cross-identify halos starting from the high-mass end of the mass function. But this cheap and easy method would quickly break down after a few hundred groups. Solving the problem requires finding criteria that will minimize both differences in positions and mass between the original and reconstructed halo matches. Wang et al. (2016) implemented such a method to associate galaxy groups with halos in the ELUCID reconstructed simulation. A future cross-comparison method between real and reconstructed data will implement similar criteria within the framework we propose in this paper. The purpose of this paper is to build reliable cross-identified catalogs in order to evaluate and calibrate such a method.

The reason why we use reconstructions of N-body simulations is that halo properties are no longer restricted to position, mass, and size. Furthermore, halo cross-identification is a known problem for simulations. Building a merger-tree consists of associating halos between time outputs. Testing a new group-finder also requires cross-comparing with a halo catalog built using an other algorithm. In both cases, cross-identification is applied for one simulation (e.g., one set of ICs) and particle indexes are used as tracers. The problem is solved by optimizing the quantity of particles in common matched halos. But having different ICs implies that identical particle indexes have different displacement fields applied to them. Technically, particle indexing has no effect on the simulation outcome and should be entirely random.

However, in N-body simulations, particle indexing is not random. Indexes are assigned following a fixed grid to which the displacement field characterizing the ICs is applied. Thinking back to the problem at hand, one may suppose that if the reconstruction proves to be truly efficient, reconstructed halos would appear out of similar regions in the ICs. Provided that the same particle indexes are associated with the same coordinates in the ICs, we could expect a perfectly reconstructed halo to have the same particle indexes as the original one. In the case of N-body simulations, this means we can apply a tree-builder to the problem.

We can illustrate this by supposing that catalog B is the earlier output. For each halo in catalog B we first map out a list of all possible descendants in catalog A. Each candidate in A has at least one particle in common with halo B, so their common mass is ${M}_{{\rm{A}}\cap {\rm{B}}}\gt 0$ . This constitutes a graph. To create a merger-tree out of the graph, one needs to select one single descendant and update the list of their progenitors accordingly. In tree-builders, the strength of connections between progenitors and descendants is weighted by a merit function. The most widely used is the normalized shared merit function, and for two halos of masses M_A and M_B, the function is defined as

$\begin{eqnarray}&&{{ \mathcal M }}_{m}({\rm{A}},{\rm{B}})=\displaystyle \frac{{M}_{{\rm{A}}\cap {\rm{B}}}^{2}}{{M}_{{\rm{A}}}\times {M}_{{\rm{B}}}}.\end{eqnarray} \tag{ 1 }$

For each halo in catalog B, the descendant in catalog A with the highest merit is selected. Among the progenitors found in catalog B, the main progenitor of any halo in catalog A is the one with the highest merit. Going back to the cross-identification problem, a cross-identified halo would be translated as a main progenitor and an associated halo would be translated as a secondary progenitor.

As the review by Srisawat et al. (2013) showed, many tree-builders are available and most are variations of the algorithm by Lacey & Cole (1994). They differ in the way in which some very technical corrections are implemented. Avila et al. (2014) demonstrated that the choice of group-finder will have a stronger impact on the merger trees themselves than the choice of tree-builder. The tree-builder used here is TreeMaker (Hatton et al. 2003; Tweed et al. 2009). Using this tree-builder simply requires that the original (as A) and reconstructed (as B) halo catalogs are used as indicated in Figure 3.

This tree structure is useful for validating the cross-identification. The main branch should start and end with the same halo and secondary branches should contain at most one halo. As one can see from our illustration in Figure 3 (or even Figure 2(b)), this is not necessarily the case. These inconsistencies could represent a problem. In Appendix B, we describe how these inconsistencies are solved before constructing the cross-identified and associated catalogs. Some additional selection is applied for these catalogs. By default we apply a merit threshold of 0.25, but we may increase or decrease this parameter to derive more or less complete or strictly accurate cross-identified catalogs. We chose this value because it can be interpreted as a minimum quadratic average of 50% common mass between two cross-identified groups. For improving the interpretability of the associated catalogs, we systematically use, after cross-identification, a threshold of 50% of the common mass ( ${M}_{{\rm{A}}\cap {\rm{B}}}/{M}_{{\rm{A}}}\gt 0.5$ for the associated catalog A, ${M}_{{\rm{A}}\cap {\rm{B}}}/{M}_{{\rm{B}}}\gt 0.5$ for the associated catalog B). Even though their merit is below the first threshold, discarded cross-identified halos can be added to the associated catalog if their common mass is above that second threshold.

3. Results

We assess in Section 3.1 the effect of the density smoothing scale on the reconstruction. In Section 3.2 we further test the compatibility of the mass-based merit we use for a distance-based merit function. The impact of the merit threshold selection on the cross-identified sample is tested in Section 3.3. The evolution of the quality of the reconstruction at earlier redshifts is explored in Section 3.4. We check in Section 3.5 whether earlier time steps of the reconstructed simulations could correspond to more accurate reconstructions.

3.1. Density Smoothing and High Mass Bias

Figure 4 illustrates the effect of the HMC smoothing length on the reconstructed mass function at the final redshift. Each subfigure corresponds to a different value of ${\sigma }_{\mathrm{HMC}}$ used for the reconstruction of the same original simulation. In the top panels, the cumulative mass functions of each catalog are represented as dashed curves. In each case, the mass functions of the reconstruction (in blue) appear to be quite similar to the original (in red). Using solid curves following the same color-code, we also show the cumulative mass functions of the cross-identified subsample of each simulation. At the high-mass end, these curves are superimposed on the dashed ones. In Figures 4(a) and (b), where lower values of ${\sigma }_{\mathrm{HMC}}$ are tested, the reconstructed mass function is higher than that in the original.

**Figure 4.** Each subfigure assesses the performances of a different reconstruction of the same original simulation. Top panels: for both the original (in red) and the reconstruction (in blue) we show the cumulative mass functions of all halos (dashed curves), of the cross-identified subsamples (solid curves), and of the associated samples (dotted curves). Bottom panels: the last four curves are shown again after normalization by the cumulative mass functions indicated by the horizontal purple dashed line. ${{Np}}_{\mathrm{all}}$ denotes the total number of cross-identified pairs, and ${{Np}}_{0.5}$ denotes the number of pairs for which the reconstruction is 50% complete. In both panels, the additional shaded regions indicate the 50% completeness regions, and the vertical dashed line indicates the minimal halo mass.
Download figure:
Standard image High-resolution image

This mass bias could be caused by an overproduction of massive halos in the reconstruction or by a tendency toward reconstructing halos with a larger mass. The former possibility is ruled out since the cross-identified mass function overlaps the mass function of the complete catalog. The lower panel further confirms this statement: it displays (as solid curves) the ratios of the cross-identified mass functions to the mass functions of their respective simulations. We refer to these curves as cross-identified fractions. A systematic overproduction of massive halos in the reconstruction would translate at the high-mass end as a cross-identified fraction lower than unity for the reconstruction (blue curve), and a cross-identified fraction close to unity for the original (red curve).

These panels also illustrate how the reconstruction degrades as we explore lower masses. Below 10¹⁴ ${{\rm{h}}}^{-1}\,{M}_{\odot }$ , the cross-identified sample diverges from the complete one, effectively illustrating the increasing lack of completeness of the reconstruction as we consider lower and lower masses. The cumulative mass functions of associated halos (associated fractions for short) are represented as dotted curves (red for the original, blue for the reconstructions). These associated halo mass functions highlight an additional discrepancy between the original and reconstructed simulations. For low ${\sigma }_{\mathrm{HMC}}$ values, associated mass functions appear to be higher for the original, especially in the $[{10}^{12},{10}^{14}]$ mass range. Associated halos from the original simulation that appear in excess may be found within larger halos in the reconstruction. This trend is likely to be related to the mass excess in reconstructed halos at the high-mass end.

In order to further quantify the quality of the reconstruction, we have added some extra information to these figures. We indicate the total number of pairs in the cross-identified sample ${{Np}}_{\mathrm{all}}$ . Using both cross-identified fractions, we have estimated the mass from which 50% of the halo catalog is reconstructed. We start by selecting halo pairs corresponding to cross-identified fractions greater than 0.5 in both simulations. We denote the number of such pairs as ${{Np}}_{0.5}$ . The respective mass range covered by this 50% complete cross-identified sample is represented with shaded regions color-coded according to the simulation, with the overlap appearing in purple. As we explore larger smoothing lengths from Figures 4(a)–(c), the total number of cross-identified pairs ${{Np}}_{\mathrm{all}}$ decreases from $\sim {10}^{4}$ to ∼7000 then ∼4300 for smoothing lengths of 2.25, 3, and 4.5 Mpc/h. The corresponding numbers at 50% completeness are obviously also lower (∼3000, ∼1500, and ∼500), and are related to smaller mass ranges as the minimum increases (2 × 10¹³, 4 × 10¹³ and 9 × 10¹³ ${M}_{\odot }$ ). Additionally, the 50% completeness mass range found for the reconstructed simulation appears to be wider in all three cases. This seems to be related to higher values of the mass function for the reconstruction in the high-mass end.

It appears from this figure that there is a bias toward increasing halo mass during reconstruction, at least at the high-mass end. We explore this mass increase further in Figure 5. In order to compare directly the masses of cross-identified halos we have produced a scatter plot. Each point corresponds to a cross-identified pair, with the mass as measured in the reconstructed simulation displayed as a function of the mass as measured in the original one. In order to guide the eye, various mass ratios are indicated with dashed–dotted diagonal lines, with the 1:1 relation in purple, 50% and 100% excess for the reconstructed simulation in blue, and 50% and 100% excess for the original one in red. The horizontal histograms in the left panel show, as a function of the median mass, the fraction of reconstructed halos with an excess in mass. Excesses greater than 50% and 100% appear in light blue and dark blue. The vertical histograms in the bottom panel show, as a function of the median mass, the fraction of original halos with an excess in mass. Excesses greater than 50% and 100% appear in light red and dark red.

Since the mass bias is most prominent for the smallest smoothing length, it should also appear in Figure 5(a). One can assess qualitatively from the scatter plot that there does not seem to be any mass bias in this first case up to 10¹³ ${{\rm{h}}}^{-1}\,{M}_{\odot }$ . However, the bias becomes quite apparent above 10¹³, with higher mass found in the reconstruction. The histograms make the high mass bias clearly apparent in Figure 5(a), with 90% pairs in the higher-mass bins being found with a higher mass in the reconstruction, and 20% with twice the mass in the reconstruction. As we explore higher HMC smoothing lengths with Figures 5(b) and (c), the distributions of points appear to be centered closer to the equal mass diagonal represented in purple. This is confirmed by the black histograms and the probability of finding mass excesses of 50% and 100% appear to be similar in both the original and reconstructed simulations.

As Wang et al. (2014) suggested, for a structure formation model that is valid on a scale larger than ${\sigma }_{\mathrm{HMC}}$ , the HMC method will introduce non-Gaussian perturbations (non-Gaussianities) into the reconstructed ICs. These non-Gaussian ICs eventually lead to such an excess in halo masses. As we further confirm here, increasing the smoothing length significantly reduces the mass bias. However, it reduces the mass range within which halos are efficiently reconstructed. But we only consider a reconstruction model with ${\sigma }_{\mathrm{PM}}=1.5\,\mathrm{Mpc}/{\rm{h}}$ and ${N}_{\mathrm{PM}}=10$ . As mentioned in Wang et al. (2014), to avoid the mass bias the smoothing scale should be of the order of 3 PM cells. In order to improve the mass range with smaller ${\sigma }_{\mathrm{HMC}}$ smoothing, the ${\sigma }_{\mathrm{PM}}$ parameter must be decreased at the cost of increasing the number of steps ${N}_{\mathrm{PM}}$ . This was demonstrated by Wang et al. (2014), with more accurate reconstructions using ${N}_{\mathrm{PM}}=40$ and ${\sigma }_{\mathrm{PM}}=0.75\,\mathrm{Mpc}/{\rm{h}}$ and ${\sigma }_{\mathrm{HMC}}=2.25\,\mathrm{Mpc}/{\rm{h}}$ . Such reconstruction models (with low ${\sigma }_{\mathrm{PM}}$ and high ${N}_{\mathrm{PM}}$ ) are more computationally expensive. They are well suited to producing a single set of reconstructed ICs. Less accurate reconstruction models such as the one used here are better adapted for exploration and testing.

3.2. Testing Alternative Merit Functions

So far we have implemented a simple merit function as a criterion for cross-identification. We know that our implementation depends solely on tracing mass from the initial matter distribution. However, the objective of the ELUCID project is to reconstruct halos at the same positions. In order to quantify how close cross-identified halos are, we define this additional distance-based merit function:

$\begin{eqnarray}&&{{ \mathcal M }}_{d}({\rm{A}},{\rm{B}})={\left(\displaystyle \frac{{r}_{\mathrm{FOF},{\rm{A}}}+{r}_{\mathrm{FOF},{\rm{B}}}}{({r}_{\mathrm{FOF},{\rm{A}}}+{r}_{\mathrm{FOF},{\rm{B}}})+| {{\boldsymbol{p}}}_{{\rm{A}}}-{{\boldsymbol{p}}}_{{\rm{B}}}| }\right)}^{2},\end{eqnarray} \tag{ 2 }$

where ${{\boldsymbol{p}}}_{{\rm{A}}}$ and ${{\boldsymbol{p}}}_{{\rm{B}}}$ are the positions of cross-identified halos A and B and ${r}_{\mathrm{FOF},{\rm{A}}}$ and ${r}_{\mathrm{FOF},{\rm{B}}}$ are their radii. We note that for simplicity, halo positions are defined by their center of mass. Their size ${r}_{\mathrm{FOF}}$ is estimated by the maximum distance of their particles to the center. Effectively, the halos are described as the smallest sphere containing all their particles. One can also note that, for measuring the distance between halos $| {{\boldsymbol{p}}}_{{\rm{A}}}-{{\boldsymbol{p}}}_{{\rm{B}}}|$ , one needs to take periodic boundary conditions into account. The distance criterion we obtain with Equation (2) is fairly simple: the value goes down to 0 as halos are further from one another and rises to 1 if they are closer. If the halos overlap, $| {{\boldsymbol{p}}}_{{\rm{A}}}-{{\boldsymbol{p}}}_{{\rm{B}}}| \lt {r}_{\mathrm{FOF},{\rm{A}}}+{r}_{\mathrm{FOF},{\rm{B}}}+| {{\boldsymbol{p}}}_{{\rm{A}}}-{{\boldsymbol{p}}}_{{\rm{B}}}|$ , then the result of this merit function is larger than 0.25.

In Figure 6 (top panels) we directly compare the two merit functions for different mass bins (using the median mass of the cross-identified pairs). The cross-identified catalog used to create this figure differs from those used in Figures 4 and 5. We have not used any merit threshold, in order to obtain the equivalent distance-based merit for unlikely cross-identified pairs. The vertical and horizontal dashed lines display the value of our fiducial merit threshold of 0.25. The bottom panels display the distributions of pairs as a function of the distance-based merit. We explore in Figures 6(a)–(c) the different reconstructions of simulation L300A. For all reconstructions, we clearly see in the top panels that low-mass cross-identified halos do not appear to be close in the simulation volume and typically have low values in the corresponding mass derived merit function. For the reconstructions, as we look at increasing mass bins, a new population with a distance-based merit larger than 0.25 emerges. This bimodality is made even clearer as we focus on the bottom panels, which reveals the number distributions of cross-identified pairs as a function of distance-based merit.

These figures show that high-mass halos tend to be reproduced with high mass-based merits. And as we explore those higher masses, these pairs tend to be quantified with high- distance-based merits. The two merit functions are strongly correlated for halo pairs that have probably been the most correctly cross-identified, especially above 5 × 10¹³–10¹⁴ ${M}_{\odot }$ .

For any of the smoothing lengths, we find that these figure are very useful for merit function calibration. For the distance-based merit function, the value of 0.25 corresponds with overlap. It is represented by the vertical green dashed line. For our fiducial merit (mass-based) function, we find a very low number of non-overlapping cross-identified pairs above our fiducial 0.25 threshold (horizontal green dashed line). This result shows that we can indeed apply this mass-based cross-identification approach to such simulation reconstruction problems.

We can now decide to use our fiducial mass-based merit function as a baseline. We suppose that we wish to reproduce the predictions from the mass-based merit using the distance-based merit. The green dashed lines divide the top panels into 4 domains. We find the true negatives and true positives in the lower left and upper right corners, respectively. In the upper left and lower right panels, we find the false negatives and false positives, respectively. We can see that the number of false negatives is very low compared to the number of false positives. This suggests that the performances of the distance-based cross-identification may be improved by using a larger merit threshold.

We have explored an existing cross-identified sample to test a potential distance-based merit function. The data we used here maximize the value of the mass-based one. The distance-based merits shown here are overestimated compared to what would be found by a cross-identification algorithm using these criteria. One must keep in mind that this merit function cannot be fully tested unless it is implemented in a cross-identification algorithm.

3.3. Applying Different Merit Thresholds

In Figure 7, we explored the cumulative mass functions after cross-identification with multiple merit thresholds, considering different reconstructions in sub Figures 7(a)–(c). The columns correspond to different values of the merit threshold, increasing from left to right. Our fiducial value of 0.25, which was used in Figures 4(a)–(c), is displayed in the central panel.

Note that the selection threshold described and implemented here is not a parameter of the cross-identification method described in Section 2.2, but is rather applied a posteriori to generate cross-identified and associated catalogs. Each member of any halo pair that does not meet the merit threshold is added to the associated catalog, provided that 50% of its mass is found in the other member.

As expected, reducing the value of the merit threshold increases the number of halos to be considered cross-identified. Compared to our fiducial value, dividing this selection threshold by two increases the number of halos in the cross-identified catalog by a factor of 4–6 depending on the reconstruction (HMC smoothing). We have obtained 27,000–34,000 paired halos with this lower threshold, instead of 4000–8000 pairs with a fiducial value of 0.25. Allowing for a smaller merit strongly increases the size of the subsample that can be used for cross-comparison. The downside is that a lower value in terms of selection allows for a much larger mass difference between cross-identified halos. Moreover, as Figure 6 suggests, halo pairs with low merits may also be located at larger distances within the simulation box. As we double the value of our fiducial selection threshold, the size of the cross-identified sample is greatly reduced, by a factor of 30 for the smallest smoothing length, to a factor of 65 for the largest one. Despite its small size, this catalog should have the advantage of showing halo pairs of very similar masses at close coordinates.

The 50% completeness region is also strongly affected, with its lower limit moving toward the high-mass end (upper limit) with increasing merit threshold. The mass range discrepancy is also more apparent as we consider lower merit thresholds. The merit is thus a very effective criterion to create either a small but accurate reconstructed subsample or a very large but much less accurate one. The former may be preferred for very detailed analyses of specific halos, while the latter could be more relevant for large number statistics.

Within the framework of comparing reconstructed simulations, our fiducial value seems to be a good compromise. The size of the cross-identified catalog is still dependent on the realization itself, considering we go through the same exercise with the second set of simulations L300B in Figure 8. For the smallest value of ${\sigma }_{\mathrm{HMC}}$ , we found 15%-25% less cross-identified pairs than in the L300A simulation set. There is also a 10% discrepancy in the opposite direction for the medium value of ${\sigma }_{\mathrm{HMC}}$ . The difference is less apparent for the largest smoothing, unless we focus on the 50% completeness region where we find larger numbers in the L300A simulation set. This remark highlights the fact that for a specific set of parameters used in the reconstruction the number of reconstructed halos may differ to some potentially large extent. However, one aspect of the reconstruction that does not seem to be affected is the 50% completeness mass range. This suggests that the mass range for which we obtain a certain level of completeness is a better indicator for testing a new set of reconstruction parameters.

**Figure 8.** Same as Figure 7, but with the L300B simulation and its reconstructions. The results are qualitatively the same but we find a strong variation (up to 25%) in terms of the number of cross-identified pairs for similar sets.
Download figure:
Standard image High-resolution image

3.4. Cross-comparison at Earlier Redshifts

So far, we have shown that the z = 0 output is efficiently reconstructed at the higher end of the mass function. This result is obviously by design, since reconstruction constraints come from the z = 0 smoothed density map. Given that the larger structures should also be present in the simulation, we proceed to compare earlier snapshots to find out whether, by construction, we obtain some constraints at earlier times.

The most logical step to begin with is to directly compare the mass functions for earlier redshifts. In Figure 9, we run the same test as in Section 3.1 but at multiple redshifts. Each column corresponds to a different reconstruction of the same original simulation, with a different smoothing scale ${\sigma }_{\mathrm{HMC}}$ applied to the density map. Each row corresponds to the different redshifts 0., 0.1, 0.5, and 1, respectively, from top to bottom. As in Figures 4(a)–(c), we display the number of cross-identified halo pairs (with merit above 0.25) and the mass range corresponding to 50% completeness, and the corresponding numbers are also displayed.

Independent of the smoothing length used, one can see in this figure how the reconstruction gets degraded as we explore larger redshifts. The minimum mass from which half the halos should be reconstructed is higher at larger redshifts. Notably, the number of halos reconstructed above that mass gets smaller as well. If the growth of halos was purely linear, we would expect the high-mass end to be populated by the same halos taken at different times, and the 50% completeness mass range would simply shift to the left.

Cross-identified halos at redshift 0 most certainly have multiple progenitors at earlier times. Their main progenitors may also be cross-identified, but their secondary progenitors are much less likely to be constrained by the reconstruction technique. The contribution of these secondary progenitors to the mass function, along with the degradation of the cross-identification of the main progenitor, explain why the minimum of the shaded region increases. The cross-identification gets so degraded by redshift 1 that the most massive halos are no longer cross-identified. This is especially evident for the largest smoothing lengths, as the 50% completeness regions no longer overlap.

This degradation can be understood as a variation of the merger histories of specific halos. For example, we could imagine a perfectly cross-identified pair. They have the same particles, translating to a merit of 1. As we explore earlier time outputs, we can imagine that the original halo is split in two parts, the reconstructed halo being unchanged. The larger part would still be cross-identified but with a lower merit, while the smaller part would be associated. At an even earlier time output, it is the reconstructed halo that is split. But a different subset of particles could be removed (the split fraction could be identical). The merit of the cross-identified pair would once again be lowered and an additional associated halo would be found in the reconstruction. As we imagine that each split is a new progenitor, we understand how individual progenitors (including the main) may no longer be cross-identified. However, the collection of progenitors of the original halo could be cross-identified with the collection of progenitors of the reconstructed one.

3.5. Cross-comparison Across Redshifts

We pointed out in Section 3.1 that in the high-mass end of the mass function, we find more massive halos in the reconstructed simulation. The issue may be related to an over-merging process, with nearby original halos being reconstructed as one. The halos may thus appear more evolved than they are in the original simulation. We investigate this possibility in Figure 10. It is similar to Figure 9, but with the key difference that the redshift of the original simulation remains fixed (z = 0); but we use five consecutive snapshots of the reconstructed simulations.

**Figure 10.** This figure is similar to Figure 9. Instead of exploring earlier redshifts, we compare earlier snapshots of the reconstruction to the final output of the original simulation. Even though earlier reconstructed mass functions are a better match to the final original mass functions, the cross-identified catalogs contain less pairs, and the 50% completeness mass range gets smaller with increasing time difference.
Download figure:
Standard image High-resolution image

The mass functions of the reconstructions appear to be closer to the original mass functions if we use an earlier time step in the reconstruction, respectively z = 0.16 for L300Ac1, z = 0.11 for L300Ac2, and z = 0.05 for L300Ac3. We note that as the high-mass issue is less relevant for the given reconstruction (larger smoothing length), we naturally pick a closer time output. The cumulative mass function of cross-identified halos follows the same trend as the mass function, and the discrepancies between the associated mass functions get mitigated at earlier time steps in the reconstruction.

These qualitative remarks tend to suggest that the reconstructions are more evolved than the original. If the reconstructed halos are mergers of a cross-identified associated original halos, we could find an earlier time step in the reconstruction where the mergers have not yet occurred. In that case the main reconstructed halos would have no mass excess, and the associated original halo would now be cross-identified. Then the number of cross-identified pairs should increase, and the lower limit of the 50% completeness region should decrease (at least slightly).

Once we focus on the numbers of reconstructed halos, we notice that the performances of the reconstruction do not appear to be improved. The number of reconstructed halos in the original simulation decreases. We can also note some variation in the 50% completeness region, with the region corresponding to the reconstruction (in blue) being shifted to a lower mass and the one corresponding to the original simulation appearing to remain fixed. Even though the mass functions are closer when we compare different redshifts. In terms of the number of reconstructed halos, the quality of the reconstruction is not significantly improved. The most massive halos are simply a little less massive, and less original halos are associated with the reconstructed ones. The reconstruction is thus not more evolved than the original simulation.

4. Summary and Discussion

This paper can be summarized as follows.

1.
We have implemented a simple and reliable cross-identification method for halo catalogs derived from N-body simulations.
2.
We have used the cumulative mass functions to qualify the comparison and the numbers of cross-identified pairs, and 50% completeness mass ranges to quantify it.
3.
We have applied this method to simulations obtained by Wang et al. (2014) to test the IC reconstruction method used for the ELUCID project.
4.
Our cross-identified catalogs have been further tested with a distance-based criteria. The results are highly compatible at the high-mass end where 50% of the halos are reconstructed. With our default selection threshold, a very limited number of non-overlapping halos is part of the cross-identified catalogs.
5.
We have confirmed the effect of the smoothing parameter ${\sigma }_{\mathrm{HMC}}$ on the reconstructed halo catalog. We have also highlighted how a large value could avoid an artificial increase of the halo mass at the high end of the mass function to the detriment of the quantity of reconstructed halos.
6.
Even for the same set of reconstruction parameters, the quantity of reconstructed halos can vary between realizations. The reconstructed (50% completeness) mass range might be a reliable indicator.
7.
We have further explored the quality of the reconstruction and have confirmed a degradation of the results at higher redshifts.

The whole scope of the ELUCID project is to produce reliable ICs that could lead to more accurate reconstructions of the large-scale structures as observed in SDSS. Applying reconstruction methods to cosmological simulations with a large number of constraints is not trivial. Such simulation tests are extremely useful, as they can be used to benchmark the reconstruction algorithms, weighting the efficiency of the code itself with the quality of the reconstruction.

The diagnostic scheme we implemented in this paper differs considerably from that of Wang et al. (2014), as it focuses on the end result of the simulations by one-to-one matching. We have confirmed their observation of the mass bias in more detail. Furthermore, by quantifying the reconstruction, we have also ruled out the possibility that this bias was due to a faster evolution in the reconstructed simulations. The cross-identified catalog is another tool to diagnose discrepancies once either a HOD or a SAM is applied to the simulations. Systematic differences between mock catalogs should be directly assessed in light of the reconstructed halo properties themselves.

An additional aspect that may be relevant to such a study is the impact of the group-finder. The FOF algorithm is one of the simplest and the most widely used ones, even as an initial guess for more accurate and reliable algorithms. The high-mass-end discrepancies we observed may become less apparent when using alternatives. Applying a subhalo finder on top of the FOF halos could separate coalesced halos for the other simulations. Still, given the smoothing, one should not expect subhalo distributions to be a close match for reconstructed halos both in terms of mass and in terms of position. Defining halos as bound spheres using an SOD like algorithm (Lacey & Cole 1994), could also mitigate the problem. However, neglecting particles that are either unbound or outside the virial radius could reduce the chance of cross-identifying halos, thus introducing a bias. We preferred the simplest method for this analysis. In order to identify differences between original and reconstructed mock galaxies, one should cross-identify the (sub)halo catalogs that are part of the semi-analytical pipeline.

An extremely interesting aspect of this study is not only the cross-identification itself but also the nature of the cross-identification criteria The merger-tree structure is indeed useful for catalog comparison, since a classic merger-tree code uses the particle indexes to find connections between halos. One can see why tree-builders have been used efficiently to compare group catalogs extracted from the same simulation, for example, in confronting two group-finders. Such methods are useful for studying the effect of baryons in simulations (as in Cui et al. 2012, 2014), as they require cross-identification of halos from the hydrodynamical run with the N-body simulations. A further application of such methods could be resolution studies. Taking the same ICs applied at two resolutions (e.g., a factor 8 in particle number), one can map the correspondence of particle indexes between simulations and then modify a tree-builder for halo cross-identification across resolutions.

In our case we used two sets of ICs. Even if halos could be produced with the same mass in the same positions, there is no apparent reason why they would contain the same particle indexes. However, since particles are effectively indexed by their initial Lagrangian grid coordinates, they are not random. That leads one to conclude that our method relies on the fact that cross-identified halos are formed out of the same Lagrangian regions. This is quite interesting, as it is actually the concept behind IC reconstruction techniques.

This brings us to another interesting aspect of this cross-identification scheme. As reconstructed halos are born out of similar Lagrangian regions, we could expect their progenitors to be born out of the same regions themselves. A tree-builder would use the same criteria as our cross-identification algorithm. If the merit of a cross-identified halo pair is high, their respective main progenitors are likely to be cross-identified as well. We have found, however, that any consistency at some earlier redshift remains limited to a very limited redshift range. It would be necessary to combine the cross-identification catalogs with merger trees to understand the degradation of the cross-identification. Future study of this line of inquiry could reveal both (i) how far back halos are effectively reconstructed and (ii) how far back the collections of their progenitors are reconstructed. Similarities in the accretion histories of cross-identified pairs should also be assessed with respect to halos of similar masses. This topic is extremely relevant for determining the indirect constraints on semi-analytical galaxy mock catalogs due to the reconstruction.

D.T. thanks Alexander Knebe and Peter Thomas for useful discussions that initiated the idea behind the method. This work is supported by the 973 Program (No. 2015CB857002), the National Science Foundation of China (grant Nos. 11128306, 11121062, 11233005, 11073017, 11621303), NCET-11-0879, and the Shanghai Committee of Science and Technology, China (grant No. 12ZR1452800). H.W. further acknowledges the support from grants NSFC 11522324 and NSFC 11421303. We also appreciate the support from a key laboratory grant from the Office of Science and Technology, Shanghai Municipal Government (No. 11DZ2260700). This work is also supported by the High Performance Computing Resource in the Core Facility for Advanced Research Computing at Shanghai Astronomical Observatory. D.T. thanks Richard Tweed and the anonymous referee for helping to improve the quality and clarity of this manuscript.

Appendix A: ELUCID: Historical Background

Here is a short history of the ELUCID project.

Yang et al. (2005: developed a method to associate galaxy groups to halos. This iterative method consists of defining galaxy groups in observational surveys using a halo mass estimation algorithm, then applying an HOD model to estimate a halo radius and velocity dispersion, and iteratively refining the group membership with an improved center and halo radius. The method was benchmarked on mock catalogs derived from N-body simulations (code: Jing & Suto 2002) and applied to the 2dFGRS survey (Colless et al. 2001).
Yang et al. (2007) applied the Yang et al. (2005) method to the SDSS DR4 (Adelman-McCarthy et al. 2006) survey. Subsequent publications assessed the halo occupation numbers in terms of luminosity (Yang et al. 2008), and the derived group luminosity and stellar mass functions (Yang et al. 2009).
Wang et al. (2009): implemented a Monte Carlo reconstruction method and applied it to cosmological N-body simulations run with the P³M by Jing & Suto (2002). This method was applied by Muñoz-Cuartas et al. (2011) on the SDSS DR4 (Adelman-McCarthy et al. 2006) without taking redshift distortions into account.
Yang et al. (2012) applied the Yang et al. (2005) method to the SDSS DR7 (Abazajian et al. 2009) survey. The stellar mass evolution of both central and satellite galaxies was discussed in a self-consistent way.
Wang et al. (2012) used the method of Wang et al. (2009) to reconstruct the cosmic velocity field and the associated tidal field in the SDSS DR7 (Abazajian et al. 2009) survey volume, including redshift distortions.
Wang et al. (2013) combined a Hamiltonian Monte Carlo Method (HMC: Hanson 2001; Taylor et al. 2008) and a modified Zel'dovich approximation (MZA: Tassev & Zaldarriaga 2012). This method was applied to a mock catalog of SDSS DR7 galaxy populations associated with a halo catalog extracted from the Millennium Simulation (Springel 2005); the N-body code used is GADGET2 (Springel 2005).
Wang et al. (2014) further modified the model by combining the HMC method and a PM method widely used in N-body codes. This new model was compared with the previous one by testing it on N-body simulations. This paper constitutes the first publication of the ELUCID project.
Wang et al. (2016) applied the HMC + PM method to SDSS DR7 using a group catalog derived by (Yang et al. 2005, 2007). This publication describes the first large-scale, high-resolution ELUCID simulation. This simulation was performed on the CPU Node π at the Center for High Performance Computing (CHPC) in Shanghai at Jiao Tong University. The simulation ran for 178h (7.4 days) on 128 nodes (2048 cores in total). Each of the 100 time outputs required about 865 Go disk space.

Appendix B: Cross-identification: Consistency Issue and Correction

Despite the fact that tree-builders can be efficiently applied to the halo cross-identification problem, some further corrections are necessary to ensure self-consistency. Going back to the illustration introduced in Figures 3 and 2(b)), starting from a group of A and tracing the double-tipped arrows, one does not end up at the same halo. We mentioned this possibility in the illustrations; in this appendix we illustrate how this issue can occur and how we avoid it.

These cross-identification mismatches come from limiting associations and potential cross-identifications to one candidate. At first glance such errors may seem improbable since we use the same definition in both directions, but they can be explained through Figure 11(a). In this illustration, we have selected the rightmost two halos from simulations A and B. On the right side, the three possible connections (A5, B7), (A5, B8), and (A6, B8) are represented with dashed lines and their respective strengths (or merit) are labeled M1, M2, and M3. We see that if the condition $M1\gt M2\gt M3$ is satisfied we obtain the associations described in the middle panel (comparing B to A) and the right panel (comparing A to B). Once the same criteria is applied to define the cross-identifications we are faced with this inconsistency.

**Figure 11.** Illustration of a cross-comparison inconsistency, and a possible correction. We ensure that the same answers are obtained after substituting catalogs A and B. The configuration illustrated in the left panel is detected in order to obtain the final result displayed in the right panel.
Download figure:
Standard image High-resolution image

Solving it simply requires ignoring at least one connection. One way to do so is to apply additional conditions for associating halos, or apply a threshold to the merit. For a particle-based merit function, we could define a criterion that would render inconsistencies impossible by construction. Let us first denote ${f}_{Y}^{X}={M}_{X\cap Y}/{M}_{X}$ , and suppose our example has the following characteristics: ${f}_{{\rm{A}}5}^{{\rm{B}}7}\gt 0.5$ , ${f}_{{\rm{A}}5}^{{\rm{B}}8}\gt 0.5$ , and ${f}_{{\rm{B}}8}^{{\rm{A}}6}\gt 0.5$ . By definition we have ${f}_{{\rm{A}}5}^{{\rm{B}}8}+{f}_{{\rm{A}}6}^{{\rm{B}}8}\leqslant 1$ , so ${f}_{{\rm{A}}6}^{{\rm{B}}8}\,\lt 0.5$ . Imposing the conditions ${f}_{Y}^{X}\gt 0.5$ and ${f}_{X}^{Y}\gt 0.5$ would remove the (A6, B8) connection and the inconsistency itself. By construction we also have ${f}_{B7}^{{\rm{A}}5}+{f}_{{\rm{B}}8}^{{\rm{A}}5}\leqslant 1$ . Unless ${f}_{B7}^{{\rm{A}}5}\,={f}_{{\rm{B}}8}^{{\rm{A}}5}=0.5$ , at least one additional connection is removed.

This selection criterion can thus be too severe, as it may remove connections that do not lead to inconsistencies. In this paper we used the normalized shared merit function, which translates as ${{ \mathcal M }}_{m}(X,Y)={f}_{Y}^{X}\times {f}_{X}^{Y}$ . As we still suppose that $M1\gt M2$ , we have ${f}_{{\rm{A}}5}^{{\rm{B}}7}\times {f}_{B7}^{{\rm{A}}5}\gt {f}_{{\rm{A}}5}^{{\rm{B}}8}\times {f}_{{\rm{B}}8}^{{\rm{A}}5}$ . We still have ${f}_{{\rm{A}}5}^{{\rm{B}}7}\gt 0.5$ and ${f}_{{\rm{A}}5}^{{\rm{B}}8}\gt 0.5$ and we know that either ${f}_{B7}^{{\rm{A}}5}$ or ${f}_{{\rm{B}}8}^{{\rm{A}}5}$ has to be lower than 0.5. We could expect that ${f}_{{\rm{A}}5}^{{\rm{B}}8}\lt 0.5$ and that the stronger connection remains. But with ${f}_{{\rm{A}}5}^{{\rm{B}}7}$ being much larger than ${f}_{{\rm{A}}5}^{{\rm{B}}8}$ ( ${f}_{{\rm{A}}5}^{{\rm{B}}7}\sim 1$ and ${f}_{{\rm{A}}5}^{{\rm{B}}8}\sim 0.5$ ), we could have ${f}_{B7}^{{\rm{A}}5}\lt 0.5\lt {f}_{{\rm{B}}8}^{{\rm{A}}5}$ , while respecting the ordering of the merits. The stronger connection (between A5 and B7) would be removed and only the second best (between A5 and B8) may remain. A more flexible variation of this criterion would be necessary to avoid removing connections that one may find important. The conditions ${f}_{Y}^{X}\geqslant 0.5$ and ${f}_{X}^{Y}\geqslant 0.5$ imply that ${{ \mathcal M }}_{m}(X,Y)\geqslant 0.25$ . This threshold of 0.25 is quite efficient in the majority of cases, but it does not rule out either ${f}_{Y}^{X}\lt 0.5$ or ${f}_{X}^{Y}\lt 0.5$ . This fiducial threshold is not sufficient to avoid all possible issues, but too large a selection threshold would strongly limit the cross-identified catalog.

An alternative approach would be to maximize the number of cross-identified pairs. Since the cross-comparison is structured as a tree, it is straightforward to detect such problematic configurations as illustrated in Figure 11(a). Once the pattern is found, one can simply ignore the second connection M2 and correct the problem automatically, as displayed in Figure 11(b). In the end, we favor a cross-identification to a simple association. The advantage of this method is that it is easily applied to any kind of merit function, whether the criterion is based on distance or on traced mass. The downside is that we may artificially build cross-identifications from low strength connections. For example, we could have ${f}_{{\rm{A}}6}^{{\rm{B}}8}\lt 0.1$ , which would make ${{ \mathcal M }}_{m}({\rm{A}}6,{\rm{B}}8)\lt 0.05$ . Applying a threshold on the merit function thus remains relevant even after inconsistencies have been edited out. This is the correction we implemented for this paper. This way we ensure that all inconsistencies are avoided independently of any value we may chose for a merit threshold.

ELUCID—Exploring the Local Universe with the reConstructed Initial Density Field. II. Reconstruction Diagnostics, Applied to Numerical Halo Catalogs

Article metrics

Permissions

Author e-mails

Author affiliations

ORCID iDs

Dates

Abstract

1. Introduction

2. Method

2.1. Simulations

2.2. Halo Cross-identification

2.2.1. General Idea

2.2.2. Cross-identification Criteria

3. Results

3.1. Density Smoothing and High Mass Bias

3.2. Testing Alternative Merit Functions

3.3. Applying Different Merit Thresholds

3.4. Cross-comparison at Earlier Redshifts

3.5. Cross-comparison Across Redshifts

4. Summary and Discussion

Appendix A: ELUCID: Historical Background

Appendix B: Cross-identification: Consistency Issue and Correction

Footnotes

ELUCID—Exploring the Local Universe with the reConstructed Initial Density Field. II. Reconstruction Diagnostics, Applied to Numerical Halo Catalogs

Article metrics

Permissions

Share this article

Author e-mails

Author affiliations

ORCID iDs

Dates

Abstract

1. Introduction

2. Method

2.1. Simulations

2.2. Halo Cross-identification

2.2.1. General Idea

2.2.2. Cross-identification Criteria

3. Results

3.1. Density Smoothing and High Mass Bias

3.2. Testing Alternative Merit Functions

3.3. Applying Different Merit Thresholds

3.4. Cross-comparison at Earlier Redshifts

3.5. Cross-comparison Across Redshifts

4. Summary and Discussion

Appendix A: ELUCID: Historical Background

Appendix B: Cross-identification: Consistency Issue and Correction

Footnotes