Identification of Galaxy Protoclusters Based on the Spherical Top-hat Collapse Theory

We propose a new method for finding galaxy protoclusters that is motivated by structure formation theory and also directly applicable to observations. We adopt the conventional definition that a protocluster is a galaxy group whose virial mass M vir < M cl at its epoch, where M cl = 1014 M ⊙, but would exceed that limit when it evolves to z = 0. We use the critical overdensity for complete collapse at z = 0 predicted by the spherical top-hat collapse model to find the radius and total mass of the regions that would collapse at z = 0. If the mass of a region centered at a massive galaxy exceeds M cl, the galaxy is at the center of a protocluster. We define the outer boundary of a protocluster as the zero-velocity surface at the turnaround radius so that the member galaxies are those sharing the same protocluster environment and showing some conformity in physical properties. We use the cosmological hydrodynamical simulation Horizon Run 5 (HR5) to calibrate this prescription and demonstrate its performance. We find that the protocluster identification method suggested in this study is quite successful. Its application to the high-redshift HR5 galaxies shows a tight correlation between the mass within the protocluster regions identified according to the spherical collapse model and the final mass to be found within the clusters at z = 0, meaning that the regions can be regarded as the bona fide protoclusters with high reliability. We also confirm that the redshift-space distortion does not significantly affect the performance of the protocluster identification scheme.


Introduction
Galaxy clusters are typically defined as the objects that are bound and dynamically relaxed with total mass of M tot > 10 14 M ⊙ (e.g., Overzier 2016).As the progenitors of present-day galaxy clusters, protoclusters must have formed in the densest environments in the early universe, and the majority of the galaxies in protoclusters probably have formed and evolved earlier than those in other environment (Kaiser 1984).
Many observational efforts have been made to search for protoclusters at high redshifts.Deep-field spectroscopic survey is a direct approach for finding protoclusters (e.g., Steidel et al. 1998Steidel et al. , 2000Steidel et al. , 2005;;Lee et al. 2014c;Toshikawa et al. 2014;Cucciati et al. 2014;Lemaux et al. 2014;Chiang et al. 2015;Diener et al. 2015;Wang et al. 2016;Calvi et al. 2021;McConachie et al. 2022).However, the survey volume should be very large to include many of such rare objects, and spectroscopic observations are currently too time-consuming to carry out large-volume blind surveys for the deep universe.Therefore, large-area imaging surveys have been often con- Lee et al. ducted to search for overdense regions at high redshifts by utilizing the narrow-band photometry for emissionline galaxies or the photo-z/dropout technique (e.g., Shimasaku et al. 2003;Ouchi et al. 2005;Toshikawa et al. 2012Toshikawa et al. , 2016;;Cai et al. 2017;Toshikawa et al. 2018;Shi et al. 2019;Yonekura et al. 2022).
Some energetic events are expected to happen in overdense regions at high redshifts.High-z radio galaxies are believed to be the potential progenitors of brightest cluster galaxies and, thus, they are assumed as a proxy for protoclusters (Pascarelle et al. 1996;Le Fevre et al. 1996;Venemans et al. 2002Venemans et al. , 2004Venemans et al. , 2005Venemans et al. , 2007;;Hatch et al. 2011b,a;Hayashi et al. 2012;Cooke et al. 2014;Shen et al. 2021).Although it is still debated (see Husband et al. 2013;Hennawi et al. 2015), high-z QSOs are also known to trace overdense regions (Djorgovski et al. 2003;Wold et al. 2003;Stevens et al. 2010;Falder et al. 2011;Adams et al. 2015).Lyα blobs can be lit by a huge amount of ionized photons emitted from AGNs or starburst galaxies in dense regions which still bear sufficient cold gas as a fuel.High-z submillimeter galaxies are regarded as the progenitors of massive ellipticals (e.g., Lilly et al. 1999;Fu et al. 2013;Toft et al. 2014).Therefore, Lyα blobs or overdensity regions of submillimeter galaxies are also used as the indicators of protocluster regions (Stevens et al. 2003;Greve et al. 2007;Prescott et al. 2008;Daddi et al. 2009;Prescott et al. 2012;Umehata et al. 2014Umehata et al. , 2015;;Oteo et al. 2018;Cooke et al. 2019;Rotermund et al. 2021;Álvarez Crespo et al. 2021).Gas absorption lines are another probe of protoclusters that does not rely on galaxy distribution: high-z overdense regions that still contain plenty of intergalactic neutral hydrogen can be detected by examining the Lyα forests in the spectra of background QSOs or star-forming galaxies (e.g., Lee et al. 2014b;Stark et al. 2015;Cai et al. 2016Cai et al. , 2017;;Newman et al. 2022).
While the observations targeting protoclusters have used a variety of selection techniques, they commonly focus on the identification of overdense regions.The protoclusters that are expected to eventually form massive clusters with the total mass of M tot > 10 15 M ⊙ have the overdensity of δ ∼ 10 − 12 for typical galaxies or Lyα emitters within an aperture radius of R ∼ 15 cMpc at z ∼ 2 − 3 (e.g., Lemaux et al. 2014;Cucciati et al. 2014;Cai et al. 2017).Toshikawa et al. (2018) identify protocluster candidates in a wide field of > 100 deg 2 by selecting the regions that show the galaxy overdensity significance level higher than 4σ within an aperture radius of R ∼ 16 cMpc at z ∼ 3.8.This significance level corresponds to the overdensity of the regions ending up forming halos of M halo ≳ 5 × 10 14 M ⊙ .The overdensity significance level is adopted to achieve ∼ 80% reliability, at the cost of completeness (Toshikawa et al. 2016) Several theoretical studies have been conducted to examine the properties of protocluster regions.Chiang et al. (2013) and Muldrew et al. (2015) investigate the matter and galaxy overdensity in the areas enclosing protoclusters using the semi-analytic model of Guo et al. (2011) based on the Millennium simulation (Springel et al. 2005).In the two studies, protoclusters are traced using halo merger trees.They show that the protocluster galaxies are more widespread in larger clusters, and the distribution of protocluter galaxies largely shrink during z = 4 − 2. Chiang et al. (2013) also show that, in a top-hat box of (15 cMpc) 3 , the galaxy overdensity of protoclusters strongly correlates with final cluster mass.Wang et al. (2021) develop a method to identify protoclusters from halo distribution of an N-body simulation using an extension of the Friend-of-Friend (FoF) algorithm.They show that the approach reasonably recovers protoclusters with high completeness.
Hydrodynamical simulations are also used to study the formation and evolution of clusters of galaxies.Given that the mean separation of rich clusters is ∼ 70 cMpc (Bahcall & West 1992), it is thus necessary to use a simulation box larger than about 1 cGpc 3 to study the formation and evolution of Coma-like clusters accurately and with high statistical significance.However, due to the limitation of the current computing resources, it has been nearly impossible to conduct hydrodynamical simulations in such a large box while keeping a resolution below ∼ 1kpc.As a compromise between the need for the extremely large dynamic range and the limited computing resources, the zoom-in technique is widely adopted in the hydrodynamical simulations for galaxy clusters (Bahé et al. 2017;Choi & Yi 2017;Truong et al. 2018;Yajima et al. 2022;Trebitsch et al. 2021).In these simulations, cluster regions are pre-identified and zoomed in the initial conditions, and protoclusters are traced by using merger trees.
It should be noted that, in the previous studies, protoclusters have been defined inconsistently between observations, theories, and numerical simulations.If a protocluster is defined as the group of all the objects that will eventually collapse into a cluster, their initial distribution typically spans more than tens of cMpc (Chiang et al. 2013;Muldrew et al. 2015Muldrew et al. , 2018)).In this definition, protoclusters can be neither self-bound nor compact, and thus a protocluster is hardly viewed as a physical object in which galaxies are associated with each other in a common environment.Furthermore, diachronic information is not available in observations.Therefore, observers have focused on the identification Lee et al. of sufficiently overdense regions.This is justified by the fact that larger structures in the current universe are more likely to originate from more massive progenitors at high redshifts (Chiang et al. 2013;Muldrew et al. 2015).The range of overdense region varies between protoclusters.Since the virial radius only encloses the objects which are already bound to the local density peak, it inevitably misses a number of progenitors which are still in the course of infall, outside the virialized regions.Because the proto-objects of larger clusters are more extended (Muldrew et al. 2015), a systematic approach is required to define the boundary (or spatial extent) of protoclusters, which should be based on the physical conditions of specific environments of interest.
This study aims at proposing a new scheme for the identification of protoclusters that is motivated by structure formation theories, and also applicable to observations directly.Our prescription is justified and calibrated on a cosmological hydrodynamical simulation Horizon Run 5 (hereafter HR5, Lee et al. 2021;Park et al. 2022).HR5 covers a volume of (1048.6 cMpc) 3 with a spatial resolution down to about 1 kpc.Thanks to its large volume, HR5 enables us to look into the formation and evolution of galaxies in a wide range of environments.By taking advantage of HR5, we derive a scheme applicable to observations to find the centers of protocluster candidates based on the spherical top-hat collapse (SC) model.The scheme also defines the physical region of a given protocluster as the volume within the turnaround radius from their centers.The turnaround radius is the zero-velocity surface at which gravitational infall counterbalances the local Hubble expansion (Gunn & Gott 1972).
This paper is organized as follows.In Section 2, we briefly introduce the HR5 simulation, a structure finding and a tree building algorithm, and the scheme to identify clusters using a low resolution version of HR5.In Section 3, we present the methodology to find the candidate regions for protoclusters from the galaxy distribution.The method for finding the boundary of protoclusters is presented in Section 4. We discuss and summarize this study in Section 5. Additional details of structure identification, merger tree building schemes, the SC models, and protocluster identification are given in Appendix.

Horizon Run 5
HR5 is a cosmological hydrodynamical zoomed simulation aiming at covering a wide range of cosmic structures in a 1.15 cGpc 3 volume, with a spatial resolution down to ∼ 1 kpc.We adopt the cosmological parameters of Ω m = 0.3, Ω Λ = 0.7, Ω b = 0.047, σ 8 = 0.816, and h = 0.684 that are compatible with the Planck data (Planck Collaboration et al. 2016).We generate the initial conditions using the MUSIC package (Hahn & Abel 2011), with a second-order Lagrangian scheme to launch the particles (2LPT; Scoccimarro 1998;L'Huillier et al. 2014).HR5 is conducted using a version of the adaptive mesh refinement code RAMSES (Teyssier 2002) upgraded for an OpenMP plus MPI two-dimensional parallelism (Lee et al. 2021).We generated a number of random sets and selected the one that reproduced the theoretical baryonic acoustic oscillation features most closely.While the volume of the zoomed region is still somewhat insufficient for accurate statistical analyses of the most massive galaxy clusters and the impact of the very large-scale structures, the whole simulation box does manage to encompass the relevant large-scale perturbation modes, and provides us with a representative volume corresponding to the input cosmology.
The volume of HR5 is set to have a high-resolution cuboid zoomed region of 1048.6 × 119.0 × 127.2 cMpc 3 crossing the center of the volume.The effective volume of the region is ∼ (260 cMpc) 3 .The cosmological box has 256 root cells (level 8, ∆x = 4.10 cMpc) on a side and the zoomed region has 8192 cells (level 13, ∆x = 0.128 cMpc) along the long side in the initial conditions.The high-resolution region initially contains 8192×930× 994 cells and dark matter particles, and is surrounded by the padding grids of levels from 12 to 9. The dark matter particle mass is 6.89×10 7 M ⊙ in the zoomed region, and increases by a factor of 8 with a decreasing grid level.The cells are adaptively refined down to ∆x ∼ 1 kpc when their density exceeds eight times the dark matter particle mass at level 13.HR5 was proceeded through z = 0.625.
Physical processes driving the evolution of baryonic components are implemented in subgrid forms in RAMSES.Gas cooling is computed using the cooling functions of Sutherland & Dopita (1993) in a temperature range of 10 4 − 10 8.5 K and fine-structure line cooling is computed down to ∼ 750 K using the cooling rates of Dalgarno & McCray (1972).RAMSES approximates cosmic reionization by assuming a uniform UV background (Haardt & Madau 1996).The statistical approach of Rasera & Teyssier (2006) is adopted to compute a star formation rate.Supernova feedback affects the interstellar medium in thermal and kinetic modes (Dubois & Teyssier 2008) and AGN feedback operates in radio-jet and quasar modes, relying on the Eddington ratio (Dubois et al. 2012).Massive black holes (MHBs) are seeded with an initial mass of 10 4 M ⊙ in grids when gas density is higher than the threshold of star formation and no other MBHs is found within 50 kpc (Dubois et al. 2014b).MBHs grow via accretion and The white dotted lines display the Lagrangian volumes enclosing the dark matter particles that end up forming clusters at z = 0 in HR5-Low.We assume that all the objects in HR5 located inside the same Lagrangian volume are the progenitors of the corresponding cluster.The thickness of the projected volume is 8.2 cMpc (top), 13.8 cMpc (middel), and 21.5 cMpc (bottom), fully containing each cluster in the projected direction.In the right panels, M enclosed presents the total mass enclosed by the Lagrangian volume.All the objects inside the Lagrangian volume are traced back to high redshifts using their merger trees in this study.Lee et al. coalescence, and their angular momentum obtained from the feeding processes are traced (Dubois et al. 2014a).Metal enrichment is computed using the method proposed by Few et al. (2012) based on a Chabrier initial mass function (Chabrier 2003), and in particular the abundance of H, O, and Fe are traced individually.One can find further details of HR5 in Lee et al. (2021).

Identification of Clusters Using a Low Resolution Simulation
We identify FoF halos and self-bound objects embedded in FoF halos using PGalF (Kim et al. 2022).We also construct the merger trees of self-bound objects using ySAMtm (Jung et al. 2014;Lee et al. 2014a) based on stellar particles for galaxies and dark matter particles for halos that contain no stars.The details of the structure finding and tree building algorithms are given in Appendix A.
In this study, we define a galaxy cluster as the virialized object that has acquired the total mass of M tot > 10 14 M ⊙ at or before z = 0.The mass cut is adopted following the conventional mass range of galaxy clusters (e.g., Overzier 2016), and can be varied if a different mass range is necessary.Protoclusters are the progenitors of galaxy clusters that have not reached the clusterscale mass range yet.By this definition, both clusters and protoclusters can be found at any epoch.According to this definition of a galaxy cluster, we cannot directly identify all the clusters and protoclusters in HR5 as the simulation stopped at z = 0.625.At this redshift, we find 63 clusters with M tot > 10 14 M ⊙ in the zoomed region.Objects having mass contamination higher than 0.7% by the lower level particles are excluded.However, there can be many structures that are not massive enough to be identified as clusters at z = 0.625 but will evolve to cluster-scale halos by z = 0.
To find clusters and protoclusters in the last snapshot of HR5 (i.e.z = 0.625), we additionally conduct a lowresolution simulation HR5-Low (∆x ∼ 16 kpc) based on the initial conditions and the model parameters used in HR5.We identify structures from the snapshots of HR5-Low at z = 0 and 0.625 using PGalF.At z = 0, we find 2,794 objects of M 0 tot ≥ 10 13 M ⊙ and 189 objects of M 0 tot ≥ 10 14 M ⊙ with the contamination tolerance mentioned above.The dark matter particles are traced back to z = 0.625 using their IDs, to search for the progenitors of the clusters.We then construct the Lagrangian volume (hereafer LV, for details see Oñorbe et al. 2014) of the progenitors using the uniform cubic grids enclosing the dark matter particles finally assembling the clusters.We assume that the LVs constructed from HR5-Low also enclose the clusters or protoclusters in HR5.We present the details of the identification scheme and reliability of this approach in Appendix B. Figure 1 shows the dark matter distribution in three HR5-Low cluster regions at z = 0 (left), the same regions of HR5-Low (middle), and HR5 (right) at z = 0.625.The structure colored in yellow is the FoF halo of each cluster (left), its progenitors at z = 0.625 (middle), and its counterpart in HR5 (right panels).The grids enclosed by dotted lines mark the LVs of the objects constructed by tracing the dark matter particles.This figure demonstrates that the two simulations are in good agreement despite their different resolutions.The position of a structure may show a slight offset between the two different resolution simulations (HR5-Low and HR5) at z = 0.625 partly due to the adaptive time step in RAMSES.

Identification of protoclusters
We define 'protoclusters' as galaxy groups whose total mass within R vir is currently less than 10 14 M ⊙ at their epochs but would exceed that limit by z = 0.The physical extent of a protocluster is defined as the spherical volume within the turnaround radius or the zerovelocity surface.The concept is schematically visualized in Figure 2. A protocluster is located at the center of a sphere that has the mean density δ and encloses the total mass exceeding 10 14 M ⊙ .The critical overdensity δ sc m = δ is given by the spherical top-hat theory, and the mass contained is the expected virial mass of the region at z = 0.It should be noted that only the galaxies within the turnaround radius are called the protocluster member galaxies, and that the cluster progenitor galaxies can be spread out to much larger radii.
We first identify the authentic proto-objects by tracing their merger trees in Section 3.1, and then present a systematic approach for finding the candidate regions enclosing protoclusters from the galaxy distribution in a snapshot, without diachronic information, in Section 3.2.

Identification of Proto-objects using Merger Trees
We search for the bona-fide progenitors of each cluster or protocluster of HR5 at z = 0.625 by tracing backward their merger histories.All the progenitors of each object are identified in all snapshots.Note that we do not call all the progenitors the protocluster galaxies, as protocluster galaxies will be defined as those within the turnaround radius.We define the most massive galaxy among the progenitors in a snapshot as the central galaxy.Thus, the central galaxy of a protocluster may change over time, depending on their mass accretion history.
The bottom panel of Figure 3 shows the distribution of the galaxies belonging to clusters or protoclusters cluster progenitor galaxies RTA Rvir protocluster member galaxies Rvir < l a t e x i t s h a 1 _ b a s e 6 4 = " A j E a c S B t 2 c F J u Z 7 2 / c P O M tot 10 14 M Protocluster cluster member galaxies Figure 2. A schematic diagram presenting the definition of galaxy protoclusters and clusters.Clusters are groups of galaxies with Mvir currently greater than 10 14 M⊙.Protoclusters are those with Mvir < 10 14 M⊙ currently, but will have Mvir ≥ 10 14 M⊙ by z = 0.The future virial mass is estimated from the total mass within the region having the mean overdensity δ equal to the critical overdensity δ sc m for complete collapse at z = 0 predicted by the spherical top-hat theory.The physical volume of protoclusters is defined to be the region within the turn-around radius RTA.
in comoving space at z = 0.625.The upper four panels show their progenitors that are traced along merger trees.Red, yellow, and blue dots mark the galaxies with M ⋆ > 10 11 M ⊙ , 10 10 − 10 11 M ⊙ , and 10 9 − 10 10 M ⊙ , respectively.It can be seen that the overall locations of protoclusters hardly change over time: the initial conditions are essentially preserved for these massive objects sitting at deep gravitational potential minima.On the other hand, the systems of cluster progenitor galaxies have been monotonically shrinking since z ∼ 2.4.However, at redshifts higher than z ∼ 2.4, their extent is roughly static at the value of R ∼ 10 − 30 cMpc, and the systems start to fade away.The three redshifts of z = 2.4, 3.1, and 4.5 are the target redshifts of the ODIN survey for LAEs at z = 2.4, 3.1, and 4.5 (Ramakrishnan et al. 2022).We will discuss the results of this study mainly at these redshifts.

Identification of Protocluster Candidates based on the Spherical Top-Hat Collapse Model
In this subsection, we propose a systematic method to identify the candidate regions enclosing protoclusters from galaxy distribution based on the SC models.

Overdensity threshold for complete collapse at z = 0
We define protocluster candidate regions as the spherical volumes that enclose total mass greater than 10 14 M ⊙ and will collapse completely at z = 0 according to the overdensity threshold given by the spherical top-hat collapse model.We will search for the centers of protoclusters inside the spherical regions.
In the spherical top-hat collapse model, an overdense region at an epoch will contract into a point at some stage if its overdensity is equal to the critical threshold density.We find this threshold density as a function of redshift for two types of cosmology.In the Einstein de-Sitter (EdS) universe with Ω m = 1, a homogeneous density sphere that collapses at z = 0 reaches its maximum radius at z = 0.59 with δ sc m = 9π 2 /16 − 1 ≃ 4.55, where δ sc m is the spherical top-hat matter overdensity.See Appendix C for more details.For comparison, the linear theory predicts overdensity δ lin m ≃ 1.062 at t max in the EdS universe.
On the other hand, the SC model does not have an exact analytic solution in a flat universe with a nonzero cosmological constant, i.e., Ω m + Ω Λ = 1.We thus numerically solve the second order non-linear differential equation of the spherical top-hat overdensity δ sc m given in Pace et al. (2010): (1) where the derivatives are with respect to the expansion factor a, and E(a) = H(a)/H 0 = Ω m /a 3 + Ω Λ , where H(a) and H 0 are the Hubble parameter at the epoch of an expansion factor a and z = 0 (a = 1), respectively.The density parameters of Ω m = 0.3 and Ω Λ = 0.7 are adopted in this calculation.We numerically search for the initial conditions δ sc,i m and δsc,i m = δ sc,i m /a i at a i = 10 −3 that lead to δ sc m → ∞ at z = 0, and find a solution δ sc,i m = 2.16 × 10 −3 .The evolution of δ sc m is shown in Figure 4 (dashed line).For the general flat universe with non-zero Ω Λ , a fitting formula for the numerical solution of the SC model for the objects collapsing at z = 0 is given in Appendix C.

Overdensity of the HR5 regions to be collapsed
The SC model gives insight into the evolution of overdensities based on a simple assumption of homogeneous density distribution in a spherical region.However, in the real universe, structures are generally not spherical nor homogeneous.To examine if the simple assumption is applicable to practical cases, we compare the critical overdensity predicted by the SC model with the actual overdensity of the spherical region at a high redshift that encloses M 0 tot , the total mass of each cluster at z = 0 measured in the HR5-Low simulation.The sphere is centered at the most massive galaxy among all the cluster progenitors at the redshift.
The open circles in the upper panel of Figure 4 show the mean matter overdensity within the radius R(M 0 tot ) from the most massive progenitor of each of 189 HR5-Low clusters.It should be noted that δ m [R(M 0 tot )] for HR5 clusters agrees quite well with the prediction of the SC model (dashed line) at all redshifts in the flat ΛCDM universe.This result demonstrates that the SC model is remarkably accurate in the ΛCDM universe at the mass scale of galaxy clusters, and thus the critical density threshold is applicable to identify protocluster regions.

Identification of the regions enclosing protoclusters
We have shown a good agreement between the spherical top-hat overdensity predicted by the SC model and that actually measured for the HR5 clusters.However, to propose a protocluster identification scheme applicable to observations, it is necessary to find the relation between the total mass and stellar mass at the cluster mass scale.For the clusters with log M 0 tot /M ⊙ > 14 the bottom panel of Figure 4 shows the stellar-mass to totalmass ratio M ⋆ /M tot within the spherical region having the critical overdensity of δ sc m at redshift z.Open diamonds are the ratio when only the stars of the galaxies Identification of Protoclusters Lee et al. m for collapse at z = 0 predicted by the spherical top-hat collapse model in the ΛCDM universe.Bottom: Ratio of stellar mass to total mass within the protocluster regions whose mean overdensity is equal to the critical value δ sc m of the ΛCDM cosmology.Open circles show the ratios computed from entire stellar mass, and open diamonds are calculated from the galaxies with M gal,⋆ > 2 × 10 9 M⊙.The dotted curves are the fitting functions given in Equation 2.
with M gal,⋆ > 2 × 10 9 M ⊙ are used, and open circles are those when all stars are taken into consideration.We provide a fitting formula for the stellar-total mass relation in the following form: (2) This formula can fit the ratio well as a function of redshift with (α, β, γ) = (−0.055,1.903, −1.915) when the galaxies of M gal,⋆ > 2 × 10 9 M ⊙ are used (shown as the dotted curve fitting the diamonds in the bottom panel of Figure 4).When all stellar components are used (open circles), the best fit is made with the parameter set (−0.057, 1.755, −1.855).We note that the stellar-tototal mass relation is insensitive to mass in the case of the proto-objects of M 0 tot > 10 13 M ⊙ .This is because the region having the mean overdensity δ sc m is typically so large that the ratio converges to a value at a given redshift.The stellar-to-total mass ratio relation can be changed if the parameters of subgrid physics regulating star formation activities are changed.Therefore, the relation needs to be calibrated based on observations.
The protocluster identification starts with finding the candidate regions that enclose protoclusters.At a given epoch, we visit galaxies starting from the most massive ones, and inspect the spherical volume centered at the galaxy.The radius of the sphere is increased until the overdensity drops to the critical value δ sc m at that epoch.If the total mass contained within the sphere exceeds 10 14 M ⊙ , the galaxy can be assumed as a candidate for the center of a protocluster.The fitting formula in Equation 2 is used to convert the observed stellar mass to the total mass.
The central candidate galaxies do not always locate at the density peak of each sphere.Thus, we compute the center of mass (CM) from all the galaxies with M gal,⋆ > 2×10 9 M ⊙ located inside the spherical regions.To find the most representative center of galaxy distribution, we iterate the identification process until the CM converges to |⃗ x i−1 − ⃗ x i | < ϵ,, where ⃗ x i is the CM at i th iteration.In this study, we adopt ϵ = 0.25 cMpc for efficient searching since a smaller ϵ does not notably affect the results.A sphere is selected as a region enclosing protoclusters when it finally has M tot ≥ 10 14 M ⊙ after the iteration process.
In dense environment, the separations between the centers of the protocluster candidates can be very small.We combine a protocluster candidate region i with another one j if D ij /R i < 1.0 or D ij /R j < 1.0, where D ij is the distance between the centers and R i and R j are the radii of the spheres within which the mean overdensity meets δ sc m .In this case, we define the most massive sphere as the central one, and accordingly, M tot of the central one is set as the estimated total mass of a spherical region group (SRG).

Reliability of the Protocluster Identification Scheme
We assume that the objects in the spherical regions of protoclusters identified based on the SC model eventually form cluster-scale objects by z = 0. We evaluate the reliability of this approach by comparing the total mass of an SRG (M SRG tot ) at a redshift z with the mass M 0,SRG tot that ends up being inside clusters at z = 0.The latter is estimated using the final total mass weighted by the stellar mass of the cluster progenitor galaxies found  3. Larger dots are the SRGs with cluster-scale mass (M 0,SRG tot ≥ 10 14 M⊙).The black solid, dotted, and dashed curves delineate the region of average final mass of M 0,SRG tot = 10 14 , 10 14.25 , and 10 14.5 M⊙, respectively.The SRGs in the upper left corner demarcated by red lines are discarded in this work as protocluster candidates.
within the SRG as follows: where G is the set of the galaxies enclosed by an SRG, P i is the set of the progenitor galaxies of a cluster i, M (P i ) is the mass sum of P i , and M 0 tot,i is the final total mass of cluster i.The relation between M SRG tot and M 0,SRG tot tells us how reliably the spherical top-hat model predicts the final mass of enclosed objects.
It is reasonable to expect that the growth history of an SRG can be affected by its environment and the above relation may depend on the history.So we inspect if the final mass depends on both M SRG tot and mass growth environment.As a proxy of the environment, we choose D 1 /R SRG , where D 1 is the distance to the nearest neighbor SRG and R SRG is the radius of the target SRG.An SRG should have the total mass larger than half the total mass of the target SRG of interest to be qualified as a neighbor.
Figure 5  , which justifies our use of the spherical overdensity criterion for identifying the protocluster centers.In particular, 90% of the SRGs whose M SRG tot is larger than 10 14.2 M ⊙ end up having M 0,SRG tot > 10 14 M ⊙ , indicating that they probably contain the authentic protoclusters.This illustrates the high reliability of our identification scheme.This figure also demonstrates that the final mass to be included in clusters is rather independent of the environment represented by the nearest neighbor SRG distance.We find, however, that the purity slightly improves if we discard the isolated small-mass SRGs with M SRG tot < 10 14.15 M ⊙ and D 1 /R SRG > 2.5 (the region enclosed by double dot-dashed lines).Based on these criteria, we examine the purity and completeness of our approach in identifying the bona-fide protoclusters in Appendix D. We find that the identification scheme recovers the authentic protoclusters with high reliability.We also show in Appendix E that the redshift-space distortion (RSD) does not significantly affect the performance of the protocluster identification scheme.

Protocluster member galaxies within Turnaround Radius
In numerical simulations and theories, it is relatively easy to define a protocluster as a group of objects that eventually contracts and forms a cluster.As described in Section 3.1, the progenitors of cluster galaxies can be traced using their merger trees in numerical simulations, and the corresponding protoclusters can be identified.
However, as shown in Figure B3, the progenitor galaxies of clusters are widespread up to ∼ 30 cMpc at high redshifts and it is not reasonable to adopt all the progenitor galaxies as the physically-associated members of protoclusters.Most observations identify protoclusters by finding sufficiently overdense regions of galaxies (see Overzier 2016, and references therein).However, there has been no consensus on the value of the overdensity defining the membership of protocluster galaxies.Applying the virial radius in identifying protocluster galaxies is not so desirable as protoclusters are supposed to be the objects still under the process of formation and virized regions of protoclusters tend to vanish quickly as redshift increases.
We thus propose to define the protocluster member galaxies as those within the zero proper velocity surface from protocluster center.The distance from a density peak to the zero-velocity surface is dubbed the turnaround radius R TA .The turnaround radius is the distance to the spherical surface on which the gravitational infall counterbalances the Hubble expansion (Gunn & Gott 1972).The turnaround radius provides a theoretically motivated overdensity for defining the protocluster region, and also makes protoclusters physical objects where their member galaxies can have some degree of conformity.In this section we present a scheme for finding R TA from observed galaxy distribution.To measure R TA from the protocluster centers in HR5, we construct the matter (dark matter, gas, and stars) density and peculiar velocity fields on a uniform grid with pixel size of ∆x = 0.128 cMpc.The proper radial velocity v r at r 1 relative to a local density peak at r 0 is given as follows:

Turnaround Radius
where r = r 1 − r 0 , H(z) is the Hubble parameter at redshift z, e r is the unit vector of r, and v is the peculiar velocity at r 1 relative to the mean velocity of matter within |r|.The turnaround radius is measured by finding the radius of a shell on which the average v r becomes zero.
As an illustration, Figure 6 shows the matter-density and velocity fields of a HR5 protocluster region at four redshifts.The blue and yellow circles indicate R vir and R TA , respectively, centered at the most massive galaxy in the field at each epoch.Arrows are the proper velocity vectors projected onto a 4 cMpc-thick slice centered at the galaxy.The overdensity of the protocluster increases with time, and consequently, both R TA and R vir increase with time too.It can be seen that R vir contains only the very center of the protocluster and becomes uninterestingly too small at high redshifts.On the other hand, R TA is much larger than R vir , does separate the inner collapsing region from the outer expanding space, and embraces the high density region of intersecting filaments of galaxies.In this sense R TA defines the outer boundary of the protocluster and the galaxies within R TA can be called its 'members'.Even though protocluster members are identified only within a spherical region, their distribution is quite anisotropic as the region encloses connecting filaments.
Figure 7 shows R TA of the HR5 protoclusters and protogroups at four redshifts as a function of their final total mass at z = 0. R TA has a good correlation with the final mass.The tightness of the correlation increases toward low redshifts.The linear Pearson correlation co-efficient is 0.634 at z = 4.5 and this increases to 0.81 at z = 1.0 in the log R TA − log M 0 tot /M ⊙ plane.We have also checked if the turnaround radii measured from the most massive galaxies in SRGs are accurate compared to those of bonafide protoclusters, and find that more than 80% of the SRGs have R TA identical to that of the bonafide protoclusters (Appendix F).

Correlations between Turnaround Radius, Virial Mass, and Viral Radius
In this section we study the general nature of the turnaround radius by inspecting its relation with the virial mass and radius.The turnaround radius is known to be 3-4 times the virial radius of massive objects in the local universe (Mamon et al. 2004;Wojtak et al. 2005;Rines & Diaferio 2006;Cuesta et al. 2008;Falco et al. 2013).The virial mass of an object is defined as M vir = 4πr 3 vir ∆ c ρ c /3, where r vir is the virial radius within which the mean matter density is ∆ c times the critical density of the universe ρ c = 3H 2 /8πG, where H is the Hubble parameter at z and ∆ c is computed using the fitting formula derived by Bryan & Norman (1998) for the cosmology with Ω Λ > 0: where x = Ω m (z) − 1.Meanwhile, the total mean radial velocity at r from the center of a bound object is the sum of the Hubble expansion velocity and mean infall peculiar velocity: ⟨v r ⟩ = H(z)r + ⟨v infall (r)⟩, where ⟨v infall (r)⟩ is the averaged radial velocity of matter in a spherical shell at radius r.
In the region where the Hubble flow starts to dominate and the total mean radial velocity becomes positive, Falco et al. (2014) found a good approximation for the infall velocity profile as follows: where v vir = GM vir /r vir is the circular velocity at r vir , and a and b are free fitting parameters.The best fit values found are a = 0.8 ± 0.2 and b = 0.42 ± 0.16 at z = 0 in the N-body simulations of a ΛCDM universe with Ω m = 0.24 and h = 0.73 (Falco et al. 2014).Since ⟨v r ⟩ = 0 at the turnaround radius, the ratio of R TA to r vir can be reduced to R TA /r vir = (a ∆ c /2) 1/(b+1) by combining the equations above with r = R TA and ⟨v r ⟩ = 0. Thus, the ratio R TA /r vir is expected to be ∼ 4.3 and in the range of 3.1 -6.2 at z = 0.
We now inspect the relation of R TA with M vir or R vir directly for the HR5 protocluster/group regions.Measurements are made relative to the most massive galaxy in each region.Figure 8 demonstrates the tight correlation between R TA and the virial mass at each epoch.Objects are distinguished in color according to their total mass at z = 0.It can be noticed that the relation moves slowly downward with time and R TA decreases at the same virial mass at lower redshifts.
A weak evolution of the turnaround-to-virial radius ratio can be seen in Figure 9 for protoclusters (red, M 0 tot ≥ 10 14 M ⊙ ) and the proto-groups (blue, M 0 tot = 10 13 − 10 14 M ⊙ ).The median of the ratio slowly decreases from 4.8 at z = 6 to 3.9 at z = 0.625 for protoclusters or clusters (red).The decreasing rate of the ratio is higher at z < 2 than before as ∆ c significantly lowers.The ratio also decreases a little faster for protogroups.This seems to be caused by the disturbance of   velocity field that becomes more severe for smaller mass objects at lower redshifts.The major origin of this weak redshift dependence will be discussed in the next section.Our measurement of R TA /R vir at z = 0.625 is consistent with the ratio range of 3.1-6.2derived based on the semi-analytic approach of Falco et al. (2014).

Matter Overdensity within Turnaround Radius
The tight correlation between R TA and R vir implies nearly constant overdensity within R TA at z > 2. We measure the average matter overdensity of the HR5 proto-objects inside the sphere of radius R TA .Figure 10 presents the matter overdensity δ TA m as a function of R TA for all proto-objects.The large dots mark the protoclusters and small dots are proto-groups with the final mass of M 0 tot = 10 13 − 10 14 M ⊙ .The turnaround radius

Identification of Protoclusters
Lee et al.
R TA of protoclusters can temporarily decrease and δ TA m can jump up when they undergo close encounters with neighbors.In order to mitigate the impact of such temporal events, we choose to use the lower boundary (bottom 5%) of the distribution of δ TA m shown in Figure 10 for the threshold overdensity corresponding to R TA .When protoclusters have close neighbors, the radius found with the lower boundary will be somewhat larger than the actual turnaround radius directly measured, and the protocluster regions are allowed to overlap.The bottom 5% of the distribution of δ TA m are 4.96, 5.04, 5.30, and 6.55 at z = 4.5, 3.1, 2.4, and 1.0, respectively.The median and 1σ dispersion are 5.63 (σ = 0.58), 5.98 (1.01), 6.17 (1.07), and 7.71 (1.91), respectively.
Like R TA /R vir , δ TA m also weakly evolves over time, with small scatter for protoclusters.It should be noted that δ TA m hardly depends on R TA or the final cluster mass of the protoclusters (large dots).On the other hand, δ TA m of the low mass structures with relatively small R TA shows stronger evolution.The scatter of δ TA m at small R TA emerges when the field of interest is disturbed by neighbouring structures.
Figure 11 illustrates the evolution of four HR5 protoclusters representing different total mass scales at z = 0. Dotted circles mark the turnaround radii, and properties shown are stellar mass density and age, gas density and metallicity.Similar to Figure 6, Figure 11 again shows that the volume within the turnaround radius does encompass the interesting large-scale structures connected to the protocluster cores.It can be noticed in Figure 11 that R TA is not always larger for the protoclusters with larger mass.It is also possible for R TA to decrease temporarily when mergers happen.This is a desirable nature of R TA as it is supposed to define the member galaxies of protoclusters and separate them from approaching nearby objects.However, during close interactions with neighbors, R TA becomes smaller and δ TA m tends to increase.The upward scatter of the proto-objects in Figure 10 can be attributed to such events.

Stellar mass to total mass conversion within Turnaround radius
We define the outer boundary of protoclusters as R TA , which is the turnaround radius enclosing the threshold overdensity given by Equation 7 below.We will use an empirical relation between the total mass and stellar mass within R TA so that the definition can be applied to observations.Figure 12 shows the redshift evolution of δ TA m of the HR5 protoclusters (top) and the stellar-to-total mass ratio within the turnaround radius, M TA ⋆ /M TA tot , averaged over the HR5 protoclusters.The stellar mass is obtained from all stars (red open circles) or only for the galaxies with M gal,⋆ > 2 × 10 9 M ⊙ (blue open circles).
The overdensity δ TA m delineating the bottom 5% of the distribution at z can be fit well by the following formula: Figure 13.The turnaround radius estimated from the stellar mass distribution using the relations shown in Figure 12 versus the directly-measured turnaround radius of protoclusters.The former is based on the δ TA m of the bottom 5%.The turnaround radii estimated by using all stars, and using only the galaxies more massive than 2 × 10 9 M⊙ are marked by red and blue dots, respectively.at z < 1.5 due to decrease of the Hubble parameter and disturbance by neighboring structures.
The stellar-to-total mass ratio within the turnaround radius can be also fit well by Equation 2 with (α, β, γ) = (−0.0092,2.027, −1.962) when all stellar mass is counted, or with (−0.0128, 1.882, −2.017) when only the stellar mass in the galaxies with M gal,⋆ > 2 × 10 9 M ⊙ are used.We use this fitting formula to derive the total mass from the stellar mass within a radius from each protocluster center, and find the radius within which the mean total mass density reaches the predicted δ TA m at the given redshift (i.e.Equation 7).This gives the estimated turnaround radius.
Figure 13 compares the directly measured R TA with R est TA estimated from stellar mass.They correlate quite well for both cases when all stellar mass is counted in R est TA or only the stellar mass in the galaxies with M gal,⋆ > 2 × 10 9 M ⊙ is used.R est TA tends to be larger than R TA as expected, particularly for relatively smaller mass protoclusters, because we use bottom 5% δ TA m .At z = 4.5, when protoclusters have only a few galaxies above our stellar mass threshold, the correlation breaks.This necessitates to include the small mass galaxies with M gal,⋆ < 2 × 10 9 M ⊙ at z ≳ 4 for accurate estimation of R TA and reliable identification of protocluster environment.

Summary and Discussion
In this paper we have proposed a practical method to find galactic protoclusters in observational data, and demonstrated its validity to the protoclusters in the cosmological hydrodynamical simulation HR5.We first define 'protoclusters' as galaxy groups whose total mass within R vir is currently less than 10 14 M ⊙ at their epochs but would exceed that limit by z = 0. Conversely, 'clusters' are the groups of galaxies whose virial mass currently exceeds 10 14 M ⊙ .Therefore, there can be a mixture of clusters and protoclusters at z > 0. The extent of a protocluster is defined as the spherical volume within the turnaround radius or the zero-velocity surface.The future mass that a protocluster would achieve at z = 0 is estimated using the spherical top-hat collapse model.The whole concept is schematically visualized in Figure 2.
Our protocluster identification method is summarized as follows: 1. Visit galaxies starting from the most massive ones, and measure the mean total mass density within radius R. The total mass is obtained from the stellar mass by using the conversion relation in Equation 2.
2. Find the radius where the mean density drops to the threshold density given by the SC model.Equation C6 is a useful fitting formula for the threshold overdensity δ sc m .
3. Adopt the galaxy (or nearby density peak) as a protocluster center candidate if the total mass included within the radius is greater than 10 14 M ⊙ .Group the spherical regions if their separation is less than their radii.Protocluster centers are now identified.
4. The protocluster region is defined as the spherical volume from the protocluster center up to the turnaround radius.The turnaround radius is the radius where the mean overdensity drops to the threshold value given by Equation 7. The stellar mass to total mass conversion within R TA is made using Equation 2, with the parameters given in Section 4.4.
HR5 used in this paper adopts a flat ΛCDM cosmology with Ω m = 0.3 and Ω Λ = 0.7.As the threshold density given by the spherical top-hat collapse model is used to find the protocluster centers, it will be useful to check how sensitive the threshold is to the cosmology adopted.We examine how δ sc m changes depending on the matter density parameter while keeping the geometry of the universe flat and fixing the dark energy equation of state parameter to −1.Our choice of HR5 is based on the Planck data (Planck Collaboration et al. 2016).This is close to the recent measurement of Dong et al. (2023) who used the extended Alcock-Paczyński test to obtain Ω m = 0.285 +0.014 −0.009 .In Figure 14, δ sc m for four choices of Ω m , i.e., 0.25, 0.3, 0.35, and 1, safely bracketing the recent observational values, are plotted.The figure shows that δ sc m differs on average only by ∼12% at z = 6 − 2 among the flat ΛCDM models with Ω m from 0.25 to 0.35.Therefore, the threshold density used for finding the protocluster centers is not very sensitive to the choice of the matter density parameter, when the current tight constraint on the parameter is taken into account.
To estimate the reliability of this prescription, we use the clusters at z = 0 with M 0 tot ≥ 10 14 M ⊙ and groups with 10 13 M ⊙ < M 0 tot < 10 14 M ⊙ identified in HR5-Low, a low-resolution version of HR5.There are 2,794 objects with M 0 tot > 10 13 M ⊙ in the zoomed region of HR5, and among them 189 are clusters.Merger trees are constructed for these objects, and all progenitor galaxies are identified.We apply our protocluster identification scheme to the galaxy distributions at four simulation snapshots of z = 4.5, 3.1, 2.4, and 1, being motivated by the ODIN survey of Lyman-α emitters.We find a tight correlation between the mass within the protocluster regions identified in accordance with the SC model, and the final mass to be situated within clusters at z = 0.In particular, it is highly likely (probability ≳ 90%) for a protocluster region to evolve to a cluster if the region contains a total mass greater than about 2 × 10 14 M ⊙ , meaning that the region is likely to be the authentic protocluster.
We have defined the outer boundary of protoclusters as the zero-velocity surface at the turnaround radius.Even though protocluster members are identified within a spherical region, their distribution is quite anisotropic as the region encloses numerous filaments beaded with galaxies.The definition would make sense if the galaxies within the turnaround radius do share some physical properties, which is not found for those outside.In the next study, we will examine the physical properties and evolution of the protocluster galaxies based on the definition proposed in this study.acknowledgments J.L. is supported by the National Research Foundation of Korea (NRF-2021R1C1C2011626).C.P. and J.K. are supported by KIAS Individual Grants (PG016903, KG039603) at Korea Institute for Advanced Study.BKG acknowledges the support of STFC through the University of Hull Consolidated Grant ST/R000840/1, access to viper, the University of Hull High Performance Computing Facility, and the European Union's Horizon 2020 research and innovation programme (ChETEC-INFRA -Project no.101008324).This work benefited from the outstanding support provided by the KISTI National Supercomputing Center and its Nurion Supercomputer through the Grand Challenge Program (KSC-2018-CHA-0003, KSC-2019-CHA-0002).This research was also partially supported by the ANR-19-CE31-0017 http://www.secular-evolution.org.This work was supported by the National Research Foundation of Korea(NRF) grant funded by the Korea government (MSIT, 2022M3K3A1093827).Large data transfer was supported by KREONET, which is managed and operated by KISTI.This work is also supported by the Center for Advanced Computation at Korea Institute for Advanced Study.

Appendix A Structure Finding and Merger Trees
We use a galaxy finder PGalF introduced by Kim et al. (2022) to extract self-bound and stable galaxies from the snapshots of HR5.PGalF is devised to identify the Friend-of-Friend group of particles from the distribution of heterogeneous particles, i.e., star, MBH, gas, and dark matter in HR5.For the mixture of various types of particles, PGalF uses an adaptive linking length to connect a pair of particles of different species or masses.PGalF identifies self-bound substructures in the FoF halos.We classify a substructure as a galaxy when it contains stellar particles.To find galaxies from a FoF halo, PGalF first constructs an adaptive stellar density field and hierarchically determine the membership of the particles Lee et al. bound to the galaxies centered at stellar density peaks.A bound particle is eventually assigned to a galaxy when it is located inside the tidal boundary of the galaxy.We note that a galaxy identified in this process is generally composed of heterogeneous particles.For the substructures with no stellar particles, a similar process is conducted for the rest matter species.For a full description on the method, refer to Kim et al. (2022).
Since stellar or dark matter particles carry their own unique identification numbers (IDs) throughout the simulation runs, we are able to trace the progenitors/descendants of substructures between two time steps.A branch of a merger tree is described using the binary relation between the two sets of all stellar particles in two snapshots, motivated by the Set theory.First, we define S i as a set of all stellar particles at time step, t i .Then, where "new stars" are those created between time steps, t i−1 and t i .We define G j i as the group of star particles of the j'th galaxy at time step i.Because a stellar particle is never destroyed in HR5, S i−1 ⊆ S i .Our galaxy finder dictates that where n is the total number of galaxies identified in time step i.The left-hand and right-hand sides of the equation are not always equal due to unbound stray stellar particles which are not bound to any galaxies.
We associate galaxies between two snapshots by mapping a set of stellar particles (a galaxy) at a time step into sets of stellar particles (galaxies) at the next time step using ySAMtm (Jung et al. 2014;Lee et al. 2014a).In ySAMtm, we define the j'th galaxy as the main descendant of the k'th galaxy when satisfying the mapping: where P (G j i+1 |G k i ) is the fractional number of stellar particles of the k'th galaxy to be found in the j'th galaxy.Multiple galaxies in time step i are allowed to have a common main descendant in time step i + 1 once the mapping is satisfied or in short f (j) = f (k) for j ̸ = k.Now we consider the reverse mapping as ) which denotes that the j'th galaxy in time step i − 1 is the main progenitor to k'th galaxy in time step i.Unlike the mapping f for the main descendant, in principle, multiple galaxies in time step i cannot have a common main progenitor in time step i−1.So, in this case g(j) ̸ = g(k) for all j ̸ = k.This is because we assume that a galaxy cannot be fragmented into multiple descendants in ySAMtm.
The mapping f is the left inverse mapping of g ; it can be defined more formally as, Here, equation (A4) means that the main descendant of a main progenitor is the galaxy itself.One the other hand, g is not left inverse mapping of f (Eq.A5) because of the case when the j'th galaxy is merged into its descendant.
Our tree building scheme does not allow two galaxies to have the same main progenitor (or g(j) ̸ = g(k) for j ̸ = k), but this usually happens when a galaxy flies by a more massive galaxy.To circumvent such cases, we remove the main progenitor mapping of the less massive galaxy (the flying-by one) and trace back its previous history until its actual main progenitor is found, using the most bound particle (MBP).The MBP is a particle that has the largest negative total energy in the galaxy (Hong et al. 2016) and, thus, we assume that the MBPs trace density peaks of galaxies.We use dark matter particles as the MBPs because, unlike stellar particles, they do not disappear when backtracking snapshots.We also use the MBP scheme to trace the substructures with no stellar particles.The merger trees of substructures are constructed by connecting the progenitor-descendant relations across the all snapshots.The progenitor/descendant relation of FoF halos is traced based on the merger trees of their most massive substructures.Further details of the tree buliding algorithm are given in Park et al. (2022).

Appendix B Identification of Cluster
Progenitors in HR5 In this section, we describe the details of the identification process of the clusters in HR5 using its low resolution simulation HR5-Low.While HR5 achieves a spatial resolution down to ∆x ∼ 1 kpc and minimum dark matter particle mass of m p ≃ 6.89×10 7 M ⊙ , HR5-Low is set to have a spatial resolution down to ∆x ∼ 16 kpc with a minimum dark matter particle mass of m p ≃ 3.02 × 10 9 M ⊙ .Because the main purpose of HR5-Low is to identify structures at z = 0, we use the parameters and initial conditions of HR5 without any modification or calibration.We identify structures from the snapshot at z = 0 and 0.625 of HR5-Low using PGalF.At z = 0, we find 2,794 halos in M 0 tot ≥ 10 13 M ⊙ and 189 halos in M 0 tot ≥ 10 14 M ⊙ with the number fraction of lower   level particles less than 0.1%, which ensures the mass contamination lower than 0.7%.The dark matter particles of the clusters are traced back to z = 0.625 using their IDs, to search for the progenitors of halos of M 0 tot ≥ 10 13 M ⊙ .We measure the LV of a cluster in terms of the Cartesian grids.In HR5-Low, we place a mesh of uniform cubic grids with ∆l = 0.512 cMpc over the entire volume of interest (the simulated zoomed region).To build a density field, we use the dark matter particles of cluster halos at z = 0.When dark matter particles in a grid do not belong to (or are not members to) a single cluster, the grid is finally associated with the cluster which contributes most to the grid mass.By utilizing the LV method with the HR5-Low data, we are able to define protocluster regions at an arbitrary redshift.
In the subsequent analysis we assume that the LVs of HR5 clusters are identical to the LVs of corresponding HR5-Low clusters.In the last snapshot of HR5, therefore, we are able to find structures inside the LVs directly imported from the HR5-Low clusters.We only use grids having mass larger than 10 10 M ⊙ because 97.5% of galaxies with M ⋆ ≥ 10 9 M ⊙ have M tot > 10 10 M ⊙ .This mass cut helps us minimize the contamination by noncluster progenitors in the LVs of the cluster progenitors at z = 0.625.
Figure B1 presents the relation between the cluster mass in HR5-Low at z = 0 and the corresponding LV mass M LV in HR5 at z = 0.625.The two masses are nearly same with a median scattering of ∼ 6%.The Identification of Protoclusters Lee et al. mass difference may be caused by matter that happens to be enclosed in the LVs but would not fall into the cluster at z = 0.
To examine the consistency or similarity in particle distributions between HR5-Low and HR5 especially on halo scales at z = 0.625, we identify an HR5 FoF halo which is spatially closest to the main progenitor of each HR5-Low cluster.Here, the progenitor of a cluster is determined by the scheme described in Section 2.2.
Figure B2 shows the relation of FoF halo masses between the main progenitors of clusters in HR5-Low and their counterparts in HR5 at z = 0.625.Except for two cases marked by A and B, all FoF halos in the two simulations have nearly same mass.We slightly overestimate the mass of FoF halos in HR5-Low compared to HR5 because of the purer mass resolution which tends to more easily destroy clumpy structures in the outskirts of halos.Here, A and B are the cases when substructures are distinguishable only within eitherHR5 or HR5-Low.We assume that, although rare, the adaptive linking length may cause the different FoF halo identification between two simulations at different resolutions.Alternatively, the different-resolution simulations may, of course, produce different particle distributions more often in the outskirts of halos especially around a close binary or a multiple system of halos.
Figure B3 shows R 95 , the radius enclosing 95% of stellar mass in cluster progenitors, as a function of the final total mass.The progenitors of more massive halos tend to have larger R 95 .The range of R 95 is consistent with Muldrew et al. (2015) who measure R 90 of protoclusters using a semi-analytic model of Guo et al. (2011).In this study, we suggest the turnaround radius as the physical size of protoclusters, instead of R 95 because R 95 measures merely the spatial extent of the distribution of progenitor galaxies.

Appendix C Spherical top-hat overdensity in
the ΛCDM and Einstein de-Sitter universe In the Einstein de-Sitter (EdS) universe with Ω m = 1, the outermost radius R of a sphere of mass M evolves over time t as follows: where G is the gravitational constant.This equation has the cycloidal solution: where t max is the time when the sphere reaches a maximum radius R max .In this solution, the spherical region collapses at the collapse time t c = 2t max (θ = 2π).The overdensity of the sphere at a given epoch derived from the analytic solution is given by (e.g., Peebles 1980;Suto et al. 2016): A homogeneous density sphere that collapses at z = 0 reaches its maximum radius at z = 0.59 with δ sc m = 9π 2 /16 − 1 ≃ 4.55 in the EdS universe.For comparison, the linear theory predicts overdensity δ lin m ≃ 1.062 at t max in the EdS universe.
In the flat universe with non-zero Ω Λ , the expansion factor of maximum radius a max can be derived using the formula (Peebles 1984;Eke et al. 1996): where ω = Ω Λ /Ω m and I(ω) is given from: where a c is the expansion factor at the time of collapse.These equations give a max = 0.56 in the case of a c = 1.0 (z = 0) and the overdensity at the epoch is interpolated as δ sc m = 5.85 for our choice of the ΛCDM universe.When Ω m = 1.0 and Ω Λ = 0, Equation 1 has the solution that is equal to the exact solution of the EdS universe case derived above.
Figure C1 shows the overdensity evolution in a homogeneous sphere that collapses at z = 0 in the EdS (blue) and ΛCDM (red) universe.The two filled stars indicate the overdensities at the epochs of maximum radius.Because dark energy counteracts gravitational collapse and the growth of overdensity is relatively slower, the sphere should have higher overdensity in the universe with Ω Λ > 0 than in the EdS universe, to be able to collapse by z = 0.
Since δ sc m does not have an exact analytic solution in the ΛCDM universe adopted, we find a formula that fits the numerical solution of the SC model for the objects collapsing at z = 0: (bottom).We define the purity as the number fraction of the SRGs enclosing bona-fide protoclusters (those identified based on merger trees) to the all SRGs more massive than a given mass.The completeness is the number fraction of the authentic protoclusters enclosed by SRGs above a given mass.In these statistics, we assume that an SRG recovers a protocluster when the most massive galaxy of the SRG is the member of the protocluster and half the galaxy mass of the protocluster is enclosed by the SRG.In this scheme, an SRG can be associated with only one protocluster.Color code in the bottom panels indicates the D 1 /R SRG parameter.Colored dots show the distribution of all the SRGs sample and black concentric circles mark the SRGs with M SRG tot > 10 14.15 or D 1 /R SRG < 2.5.These two different mass definitions are overall in good agreement, particularly at z < 4.
Their correlation becomes tighter with decreasing redshift as structures form and develop further.The completeness and purity show that more than 80% of protoclusters can be recovered by our scheme with ∼ 60% purity at z ∼ 2 − 3. The purity increases to 80% in M SRG tot ≥ 2 × 10 14 M ⊙ .At z = 4.5, however, these statistics are inevitably poorer than at lower z, because galaxies have not had time to develop yet.We note that the purity and completeness are enhanced by ∼ 10% if an SRG is allowed to associate with all the protoclusters in which half their galaxy mass is enclosed by the SRG.

Appendix E Redshift-Space Distortion Effect on the Protocluster Identification
The peculiar velocities of galaxies distort the distribution of galaxies in redshift space (e.g., see Guzzo et al. 1997;Hamilton 1998).We examine the impact of RSD on the protocluster identification scheme.In this test, we assume that a virtual observer has the line of sight aligned with the major axis of the HR5 zoom-in region.The redshift of a snapshot is assigned to the center of the zoomed region, and the cosmological redshifts of the galaxies in the snapshot are computed from the distance relative to a virtual observer at z = 0.The Doppler redshifts induced by the peculiar velocities of galaxies are added to the cosmological redshifts, and the distances to the galaxies are re-estimated from the combined redshifts.The standard deviations of the differences between the intrinsic and redshift-distorted distances are 2.3, 3.0, and 3.6 cMpc at z = 2.4, 3.1, and 4.5, respectively.Figure E1 presents the impact of the RSDs on the protocluster identification scheme.Since the large-scale peculiar velocity vector tends to point toward overdense regions, the galaxy distribution near a protocluster is statistically flattened along the line of sight in redshift space (Kaiser 1987).This results in slight overestimation of the overdensity and size of the top-hat spheres of dense regions.The final impact is that the completeness increases, at higher redshifts in particular, while the purity slightly decreases.The bottom panels of Figure E1 show that the RSD effect slightly increases the SRG mass, but overall distribution is similar between the cases with and without the RSD effects.These statistics are computed based on the assumption that an SRG is only associated with a protocluster.The purity and completeness can change if an SRG is allowed to recover multiple protoclusters.This result demonstrates that the RSD effect does not have significant impact on the protocluster identification scheme.D1, but for the cases with and without the RSD effect.In the bottom panels, the scatter between M 0,SRG tot and M SRG tot is similar between the two cases with and without the RSD effect.Upper panels show that the RSD effect lowers the purity while it slightly enhances the completeness at given mass.This is caused by the RSD effect that makes overdense regions look flattened in the redshift space (Kaiser 1987), resulting in the overestimation of the SRG radius.

Figure F1
. Relation of turnaround radius of bona-fide protoclusters (RTA) to turnaround radius measured from the most massive galaxies in SRGs (R SRG TA ).A protocluster is assumed to be associated with an SRG when half its galaxy mass is enclosed by the SRG.Scatter is caused when the most massive galaxy of an SRG is not the most massive one of its host protocluster.We find that ∼ 80% SRGs recover the RTA of enclosed protoclusters.

Figure 1 .
Figure1.Dark matter particles in three clusters found at z = 0 in HR5-Low (left), in their progenitors at z = 0.625 (middle), and in the same volumes in HR5 at z = 0.625 (right).In the left panels, the halos in yellow are the members of the clusters with M 0 tot ∼ 10 14 M⊙ (top), 10 14.5 M⊙ (middle), and 10 15 M⊙ (bottom).White horizontal bars illustrate the scale of 4cMpc.The white dotted lines display the Lagrangian volumes enclosing the dark matter particles that end up forming clusters at z = 0 in HR5-Low.We assume that all the objects in HR5 located inside the same Lagrangian volume are the progenitors of the corresponding cluster.The thickness of the projected volume is 8.2 cMpc (top), 13.8 cMpc (middel), and 21.5 cMpc (bottom), fully containing each cluster in the projected direction.In the right panels, M enclosed presents the total mass enclosed by the Lagrangian volume.All the objects inside the Lagrangian volume are traced back to high redshifts using their merger trees in this study.

Figure 4 .
Figure 4. Top: Matter overdensity inside the radius enclosing the final mass of protoclusters (M 0 tot > 10 14 M⊙) as a function of redshift.Dots are the medians, and scatter bars show 16 th − 84 th percentile distributions.Dashed line is the critical matter overdensity δ scm for collapse at z = 0 predicted by the spherical top-hat collapse model in the ΛCDM universe.Bottom: Ratio of stellar mass to total mass within the protocluster regions whose mean overdensity is equal to the critical value δ sc m of the ΛCDM cosmology.Open circles show the ratios computed from entire stellar mass, and open diamonds are calculated from the galaxies with M gal,⋆ > 2 × 10 9 M⊙.The dotted curves are the fitting functions given in Equation2.

Figure 5 .
Figure 5. Final total mass M 0,SRG tot (encoded by color) as functions of distance to the nearest spherical region group (SRG) D1 (normalized by RSRG) and its total mass M SRG tot at each redshift.Each dot indicates a SRG, and color represents the final mass M 0,SRG tot shows the final mass M 0,SRG tot (encoded by color of large circles) in the D 1 /R SRG versus M SRG tot space.Redder color indicates larger final mass.Small dots are the SRGs with M SRG tot ≥ 10 14 M ⊙ at redshift z but with M 0,SRG tot ≤ 10 14 M ⊙ at z = 0, namely failed protocluster candidates.The figure demonstrates a tight correlation of M SRG tot with M 0,SRG tot

Figure 6 .
Figure 6.Matter density and velocity fields within and in the vicinity of a protocluster at four epochs.Denser regions are brighter.The panels show matter distribution within ±2cMpc from the most massive galaxy along the projected direction.Blue and yellow circles indicate Rvir and RTA measured from the density peak, respectively.All the blue dots are the gravitationally self-bound objects with the total mass greater than 10 10 M⊙.Larger blue dots are the cluster progenitor objects, and among them, those with M⋆ > 2 × 10 9 M⊙ are marked by red open circles.

Figure 7 .
Figure 7. Turnaround radius RTA of the proto-objects as a function of the final total mass M 0 tot at z = 0 that is measured in HR5-Low.Contrary to R95, RTA gradually increases as dense regions grow in mass and the Hubble parameter decreases with decreasing redshift.

Figure 8 .
Figure 8. Relations between the turnaround radius RTA and virial mass Mvir at four epochs for the proto-objects that will have the final total mass of M 0 tot that is measured in HR5-Low.The final total mass is color-coded.Protoclusters are marked by large filled circles and non-protocluster objects are marked by small dots.

Figure 9 .
Figure 9. Ratio of the turnaround radius RTA to the virial radius Rvir as a function of redshift.The blue and red circles correspond to the structures with log M 0 tot /M⊙ = 13 − 14 and log M 0 tot /M⊙ > 14 at z = 0, measured from HR5-Low, respectively.The scatter bars show 16 th −84 th percentile distributions.This figure indicates that RTA/Rvir evolves very weakly before z = 2.

Figure 10 .
Figure 10.Matter overdensity within the turnaround radius of proto-objects at the four redshifts.Protoclusters (M 0 tot ≥ 10 14 M⊙ at z = 0, measured in HR5-Low) are marked by large filled circles and non-protocluster objects are marked by small dots.The color code presents the final total mass of proto-objects.The dashed and solid arrows indicate the medians and bottom 5% of δ TA m of protoclusters.The matter overdensity of protoclusters only weakly increases from δ TA m ≈ 5.0 (bottom 5%) at z = 4.5 to δ TA m ≈ 5.3 (bottom 5%) at z = 2.4 (see the text and Figure 12).

Figure 11 .
Figure11.Distribution of gas and stars in the regions of four protoclusters that end up forming clusters with M 0 tot ≈ 10 14 − 10 15 M⊙ at z = 0 that is measured in HR5-Low.The dotted circles mark the turnaround radius of the protoclusters.Metal poor gas is colored in green, and gas color becomes redder with increasing metallicity.Younger stars are colored in blue and older ones are yellow.Grayish shades display the regions filled with the hot medium with T > 10 6 K.The upper two panels are relatively zoomed, as indicated by the scale bars.

Figure 12 .
Figure12.Top: Redshift evolution of the overdensity within RTA (δ TA m ) for the protoclusters of M 0 tot > 10 14 M⊙ that is measured in HR5-Low.The open circles indicate the bottom 5% overdensity measured from the HR5 protoclusters and the dashed and solid lines are the fits to the median and bottom 5%.Bottom: ratio of stellar to total mass within RTA as a function of redshift.The red and blue open circles denote the ratios measured from all stars and from galaxies with M gal,⋆ > 2 × 10 9 M⊙, respectively.The dashed curves are the fits based on Equation 2 with fitting parameters given in section 4.4.
) where (a, b, c, d) = (0.168, 4.068, −0.381, −0.734), which is shown as the solid line in the top panel of Figure 12.The error of the fit is smaller than 0.9%.As shown in Section 4.3, δ TA m monotonically increases with time on average, and reaches a finite maximum at z = 0.The evolution of δ TA m is weak at z > 2, but becomes rapid Identification of Protoclusters Lee et al.

Figure 14 .
Figure14.The critical overdensity for complete collapse at z = 0 given by the spherical top-hat collapse model in the Einstein-de Sitter universe (EdS, red) and the flat ΛCDM universes with three different matter density parameters.

Figure B1 .
FigureB1.The relation between the total mass of the clusters found at z = 0 in HR5-Low and the LV mass at z = 0.625 in HR5.The LV mass is on average ∼ 6% higher than the cluster mass due to the matter that are contained in voxels at the epoch but will not form the clusters.

Figure B2 .
Figure B2.Relation between the total mass of the main progenitors (z = 0.625) of the clusters found at z = 0 in HR5-Low and the total mass of their counterparts in HR5.Halo A is the one that is identified as two separate structures in HR5-Low while a smaller one already becomes a substructure of the halo in HR5.Halo B is the opposite case.The halos in HR5 are ∼ 9% less massive than their counterparts in HR5-Low because their small neighboring structures are not well resolved in HR5-Low.

Figure B3 .
Figure B3.Radius that encloses 95% of the stellar mass of the proto-objects of the FoF halos identified at z = 0 as a function of their final mass that is measured from HR5-Low.The radius measurement is centered at the most massive galaxy in each proto-object.Red dashed and sold lines mark 16 th and 84 th percentiles and the median of R95 at a given final mass.

Figure C1 .
Figure C1.The critical overdensity of a homogeneous tophat sphere collapsing at z = 0 predicted by the spherical tophat collapse model in the ΛCDM (blue) and EdS (red) universe in logarithmic (top) and linear (bottom) scales.Stars indicate the epoch and overdensity when the sphere reaches its maximum radius in each universe.
Figure D1 demonstrates the completeness and purity of our protocluster identification scheme (top) and the relation between M 0,SRG tot

Figure D1 .
Figure D1.Bottom: Final mass of SRGs estimated from the final mass of bona-fide protoclusters (those identified based on merger trees) embedded in the SRGs (M 0,SRG tot ) as a function of the total mass of SRGs (M SRG tot ).Color code denotes D1/RSRG.Colored dots mark all the SRGs sample and black concentric circles indicate the SRGs with M SRG tot > 10 14.15 or D1/RSRG < 2.5.As also seen in Figure5, most protoclusters have D1/RSRG ≲ 4. We note that M 0,SRG tot is an estimated mass to examine the prediction accuracy of M SRG tot .Top: Purity (blue) and completeness (red) of the bona-fide protoclusters in the spherical regions found by the SC model as a function of M SRG tot .The purity is the number fraction of the SRGs enclosing bona-fide protoclusters to the entire SRGs above a given mass.The completeness is the number fraction of the authentic protoclusters which are recovered by SRGs and more massive than a given mass.

Figure E1 .
Figure E1.Same as FigureD1, but for the cases with and without the RSD effect.In the bottom panels, the scatter between M 0,SRG