COSMOS2020: Identification of High-z Protocluster Candidates in COSMOS

We conduct a systematic search for protocluster candidates at z ≥ 6 in the Cosmic Evolution Survey (COSMOS) field using the recently released COSMOS2020 source catalog. We select galaxies using a number of selection criteria to obtain a sample of galaxies that have a high probability of being inside a given redshift bin. We then apply overdensity analysis to the bins using two density estimators, a Weighted Adaptive Kernel estimator and a Weighted Voronoi Tessellation estimator. We have found 15 significant (>4σ) candidate galaxy overdensities across the redshift range 6 ≤ z ≤ 7.7. The majority of the galaxies appear to be on the galaxy main sequence at their respective epochs. We use multiple stellar-mass-to-halo-mass conversion methods to obtain a range of dark matter halo mass estimates for the overdensities in the range of ∼1011–1013 M ⊙, at the respective redshifts of the overdensities. The number and the masses of the halos associated with our protocluster candidates are consistent with what is expected from the area of a COSMOS-like survey in a standard Λ cold dark matter cosmology. Through comparison with simulation, we expect that all of the overdensities at z ≃ 6 will evolve into Virgo-/Coma-like clusters at present (i.e., with masses ∼1014–1015 M ⊙). Compared to other overdensities identified at z ≥ 6 via narrowband selection techniques, the overdensities presented appear to have ∼10× higher stellar masses and star formation rates (SFRs). We compare the evolution in the total SFR and stellar mass content of the protocluster candidates across the redshift range 6 ≤ z ≤ 7.7 and find agreement with the total average SFR from simulations.


INTRODUCTION
Galaxy clusters are the most massive (∼ 10 13 − 10 15 M ) and largest (∼ 1 − 10 Mpc) gravitationallybound structures in the Universe. In the present-day Universe, they contain up to thousands of bright (L ≥ L ) galaxies and reside in massive dark matter halos that, according to the ΛCDM paradigm of hierarchical structure formation, mark the sites of the greatest overdensities of matter (White & Rees 1978). In this paradigm, clusters are the last structures to virialize and arXiv:2210.17334v1 [astro-ph.GA] 31 Oct 2022 assemble (e.g., Sheth & Tormen 1999;Mo et al. 2010), and they are, therefore, expected to have a complex and prolonged formation history (e.g., Bower & Balogh 2004;Kravtsov & Borgani 2012;Overzier 2016). Observations tell us that most of the mass locked up in stars at the present time is found in massive elliptical galaxies, which are predominantly found in clusters (e.g., Dressler 1980). The same is not true for the cosmic star-formation rate today, where clusters contribute a negligible fractionthe bulk of their galaxy populations consist of red and quiescent ellipticals -and most of the star-formation takes place in spirals and late-type galaxies, which are typically found outside clusters (Poggianti et al. 1999). This alone suggests that the star-formation activity and stellar mass build-up in clusters must have peaked at earlier times (z ≥ 2), and we naturally expect the red and quiescent massive galaxies residing in clusters today to be in star-forming overdense environments at earlier times. In fact, young and star-forming galaxies are observed to be an increasingly dominant galaxy population in overdense regions at higher redshifts (z > ∼ 1.5, e.g., Elbaz et al. 2007;Cooper et al. 2008;Scoville et al. 2013), although red and massive quiescent galaxies continue to be present in some clusters up to z 2−3 (e.g., Kodama et al. 2007;Kubo et al. 2013).
Studies of present-day galaxy clusters are limited in their ability to probe the formation and evolution of clusters since key signatures of their formation history are erased in the final stages of assembly as the clusters, and the galaxies within them undergo transformational processes such as dynamical relaxation (virialization), mergers and tidal interactions (e.g., Zabludoff et al. 1996;Kodama et al. 2001). If we are to understand the emergence of clusters, therefore, we are best off searching for them in the high-redshift (high-z) Universe, where we can catch them during their formative stages as overdense regions of galaxies (often termed protoclusters; Overzier (2016)). Finding and investigating distant protoclusters is critical to understanding the formation and evolution of present-day clusters of galaxies (e.g., Toshikawa et al. 2012Toshikawa et al. , 2014Long et al. 2020;Calvi et al. 2021), as well as galaxy quenching caused by dense environment of clusters and formation of quiescent systems we see today (e.g., Boselli et al. 2016;Foltz et al. 2018). With a comoving abundance of ∼ 10 −7 cMpc −3 , however, high-z protoclusters are rare, and their discovery requires systematic surveys with extensive sky coverage. At z ≥ 2, protoclusters are typically identified using rest-frame optical color features in Lyman break galaxies (LBGs), narrow-band imaging and spectroscopic follow-up of Lyα emitters (LAEs) and/or Hα emitters (HAEs) (e.g., Steidel et al. 1998;Malhotra et al. 2005;Matsuda et al. 2011;Galametz et al. 2013;Capak et al. 2015). However, protoclusters at z ≥ 3 have also been found in large (sub-)millimeter surveys with single-dish telescopes such as the South Pole Telescope (SPT), the Herschel, and Planck space observatories, where subsequent high-resolution ALMA observations of the brightest sources have revealed overdensities of dust-enshrouded starburst galaxies (e.g., Clements et al. 2014;Oteo et al. 2018;Ivison et al. 2020;Hill et al. 2020;Wang et al. 2021).
According to state-of-the-art simulations of cosmic structure formation (e.g., Chiang et al. 2017;Lovell et al. 2018Lovell et al. , 2021Lagos et al. 2020;Springel et al. 2021), the growth of protoclusters occurred at the peaks of the dark matter density distribution during the Epoch of Reionization (EoR) at z ≥ 6, when the Universe was less than a billion years old. To probe the onset of protocluster formation, therefore, observations must be pushed back to the EoR, where galaxies, in their formative stages, are thought to have congregated around these density peaks to form protoclusters and eventually developed clusters. However, only fragments of this process have been observed. One limiting factor is that to date, there are only four spectroscopically confirmed protoclusters at z ≥ 6 with ≥ 10 confirmed member galaxies 1 . The first one was discovered as an overdensity of i -band dropouts, and subsequently, spectroscopically verified to reside at z = 6.01 (Toshikawa et al. 2012(Toshikawa et al. , 2014. The other three were discovered as LAE-overdensities at z = 6.54, z = 6.6 and 6.9 and subsequently spectroscopically confirmed (Chanchaiworawit et al. 2019;Harikane et al. 2019;Hu et al. 2021). While all four protoclusters have ≥ 10 confirmed member galaxies, none of them have the deep multi-band optical/near-IR data required to accurately characterize their galaxy populations (e.g., stellar masses, star-formation rates, and ages (Conroy 2013)). In addition to these four protoclusters, there is a tentative millimeter-selected galaxy-overdensity at z 6.9 (Marrone et al. 2018), consisting of two spectroscopically confirmed dusty galaxies with optical HST counterparts and three fainter sub-mm sources lacking both spectroscopic redshifts and optical/near-IR counterparts .
In this paper, we present the discovery of 15 massive protocluster candidates at 6 ≤ z ≤ 7.7 over a 1.7 deg 2 area in the Cosmic Evolution Survey (COSMOS (Scoville et al. 2007)) using the new release of the COSMOS catalog (COSMOS2020 (Weaver et al. 2022)). The can-didates were identified from the source list of the near-IR (izY JHK S ) selected catalog, in a manner similar to the z = 6.01 protocluster found by Toshikawa et al. (2012) but, being located in the COSMOS field, they have unique multi-band wavelength coverage and depth.
The paper is organized as follows. Section 2 briefly summarizes the COSMOS survey and our protocluster candidates selection criteria. Section 3 explains the methods used to identify protocluster candidates through galaxy overdensity analyses. Section 4 presents our findings and analysis of the protocluster candidates. We compare our candidates with galaxy overdensities found using traditional dropout selection techniques, and present estimates for the dark matter halo mass and the present-day mass of the candidates. Section 5 discusses if our dark matter estimates are in line with what we would expect for a COSMOS-like survey. We also compare the physical parameters of our candidates with other protoclusters from the literature. Section 6 summarizes the main findings and conclusions.
Throughout the paper, we have adopted a standard ΛCDM cosmology with H 0 = 70 km s −1 Mpc −1 , Ω m = 0.3, and Ω Λ = 0.70. All magnitudes are expressed in the AB system (Oke 1974). A Chabrier (2003) stellar Initial Mass Function (IMF) is used to present our results. Results are reported with 68% confidence interval uncertainties.

The COSMOS Survey
The Cosmic Evolution Survey (COSMOS) covers 2 deg 2 and boasts a plethora of deep multi-wavelength data in over 40 bands from the world's major facilities (Scoville et al. 2007). The new version of the COSMOS source catalog, COSMOS2020 (Weaver et al. 2022) 2 , constitutes a major improvement over the previous catalog (Laigle et al. 2016), in that it includes detections and new ultra-deep optical/NIR imaging and multi-waveband photometry of 1.7 million sources over the entire COSMOS field, with ∼ 89000 measured in all the available broadband filters. Key additions for COS-MOS2020 include new ultra-deep optical data from the Hyper Suprime-Cam Subaru Strategic Program (HSC-SSP) public data release 2 (PDR2; Aihara et al. (2019)), new data from the Visible Infrared Survey Telescope for Astronomy (VISTA) data release 4 (DR4; McCracken et al. (2012)), reaching up to one magnitude deeper over the full area than previous data, and the inclusion of all Spitzer IRAC data in the COSMOS (Moneti et al. 2021). Legacy data sets (such as the Suprime-Cam imaging) have also been reprocessed (see Weaver et al. (2022) for details). COSMOS2020 consists of two independent photometric catalogs. The first is the Classic catalog, where standard aperture photometry is performed (Bertin, E. & Arnouts, S. 1996) on PSF-homogenized optical/NIR images, except for the IRAC images where the software IRACLEAN (Hsieh et al. 2012) is used to do PSF photometry. The second catalog is created using a new profile-fitting photometric software developed specifically for COSMOS2020 called The Farmer. Using The Tractor software (Lang et al. 2016) for the source modelling, The Farmer generates reproducible source detection and photometry to generate a full multiwavelength catalog. In this paper, we have used The Farmer catalog throughout since it has more accurate photometry in different bands for fainter sources than the Classic catalog and appears to have a higher density of sources with photometric redshifts > 6 by almost a factor of two in the faintest NIR magnitude bins (Weaver et al. 2022). This also translates to more sources for The Farmer catalog, as seen in Fig. 1. This difference in high redshift sources could be explained by two factors, one is The Farmer is able to deblend more sources than Classic and the other is that for the Classic catalog, the apertures can be contaminated by stray blue (optical) light, while the same is not the case in The Farmer catalog, leading to more robust high photometric redshift solutions.

Photometric redshifts: LePhare and EAZY
Photometric redshifts and physical parameters, including stellar masses and star-formation rates, for the COSMOS2020 sources, were derived using two independent photometric redshift codes, LePhare (Arnouts et al. 2002;Ilbert et al. 2006) and EAZY (Brammer et al. 2008). Both codes were used to fit photometric redshifts (photo-z) and UV/optical to near-IR spectral energy distributions (SEDs) to the objects in COSMOS2020. The codes fit galaxy SED templates to the photometric data, corrected for Galactic extinction using the Schlafly & Finkbeiner (2011) dust map.
The codes have similar approaches to determining the photo-z as they iteratively derive multiplicative corrections to both the individual photometric bands and the SED templates. The main differences consist of the stellar population synthesis templates used and the method by which they are fit to the observed photometry. A list of all the magnitude offsets used to optimize the absolute calibration of the photometry for each band for LePhare, and EAZY and both The Farmer and Classic COS-MOS2020 catalogs are given in Weaver et al. (2022). While results between the two codes are in good overall agreement, LePhare has a lower outlier percentage (η) 3 as a function of the Normalized Median Absolute Deviation (σ NMAD ) and i-band magnitude bin Weaver et al. (2022). This is particularly the case for the faintest magnitude bin (25 − 27 magnitude in the i-band, see Fig. 15 in (Weaver et al. 2022)), where both σ NMAD and η are lower when applying LePhare on The Farmer catalog. For this reason we mainly use the LePhare photo-z, which, unless stated otherwise, we will henceforth refer to as the photo-z (for more details on the calculation of the outlier percentage and the differences between the COSMOS2020 catalog versions, see §5.3 in Weaver et al. (2022)). To test the consistency of the results we have also done our overdensity analysis with the EAZY photo-z and found general agreement between the maps at the 1σ level. The relevant overdensity maps are available upon request

Physical parameters: LePhare and EAZY
The physical properties of the COSMOS2020 galaxies, such as stellar masses and star-formation rates, are also available from the two relevant codes (see Weaver et al. (2022) for details on how these quantities are calculated). In this paper, however, we did not use the SFR estimates provided by the two codes. Tests showed significant run-to-run variation in the SFR-estimates, particularly for LePhare, and between the two codes. In the case of LePhare, the code is run twice, once to obtain the photo-z and then again with the photo-z fixed to obtain physical parameters such as stellar mass and SFR. For more details, see §6 in Weaver et al. (2022). LePhare imposes a declining SFH for both star-forming and quiescent galaxies (though at the redshifts we are considering, the delayed SFH is likely to be in the rising SFR epoch), whereas EAZY makes no assumption about the SFH. This has no effect on the stellar masses, where the two codes show good agreement, but it can strongly affect the SFR estimates. Neither of the codes includes far-IR photometry in the SED fitting, which may also lead to significant uncertainties in SFR estimates (see Laigle et al. (2019)). Since we use the LePhare photoz for our galaxy selection and to ensure our analysis is done as self-consistent as possible, we will use the stellar masses from LePhare throughout the paper. We will use the Ultra Violet (UV) luminosity to SFR conversion from Barro et al. (2019) for the SFR estimates, corrected for dust attenuation: SFR corr UV = (1.09 × 10 −10 )(10 0.4A2800 )(3.3L 2800 /L ), (1) where L 2800 and A 2800 are the UV luminosity and dust attenuation at rest-frame λ = 2800Å, respectively. Assuming a Calzetti et al. (2000) attenuation law, the UV attenuation at 2800Å can be inferred directly from the best-fit model to the overall SED so that A 2800 = 1.8A V , where A V is the extinction in the V -band. The COS-MOS2020 catalog provides L 2800 and A 2800 from the best fit SED made with EAZY. The same values are not provided for LePhare in the catalog. However, it is possible to obtain the flux (and thereby the luminosity, assuming a luminosity distance from the photo-z) at rest-frame λ = 2800Å from the best fit SED before the dust attenuation using the Calzetti et al. (2000) law is applied, effectively using eq. 1 with A 2800 = 0. Since they are already available in the catalog and we require the photo-z from the two codes to be similar (thereby requiring the SED fits to be similar by proxy), we will use the two EAZY properties to estimate the star-formation rates. The star-formation rates we get from this conversion and the LePhare fit both scatter around the expected main sequence from Speagle et al. (2014), with the conversion having a greater scatter overall. The Barro et al. (2019) relation has a scatter of 0.32 dex when compared with the Wuyts et al. (2011) relation that includes both the IR and UV. We have chosen not to include this scatter in our estimate and use the uncertainty for L 2800 and A 2800 given by EAZY.

Selection criteria and redshift binning
Our search for high-z protoclusters in the COS-MOS2020 catalog focuses on galaxies at z ≥ 6, as it goes from the End of Reionization (EoR) to the highest photo-z provided by the LePhare. The photo-z assigned to a given galaxy and used in its selection is the median of its redshift probability distribution, p(z), from LePhare. We use the flags in The Farmer catalog, so as to only include galaxies and not sources such as stars, sources that coincide with an X-ray source (based on Chandra COSMOS Legacy; Civano et al. (2016)), sources in masked-out region as defined in Weaver et al. (2022) (combining Hyper Suprime-Cam, Suprime-Cam and UltraVISTA masks) or sources where LePhare fails to fit the SED. We checked the catalog for any AGN sources with X-ray detection's and a LePhare AGN fit and found seven sources between 6.3 < z < 9.7. None of the sources are in an overdense environment and would not be inside any of our overdensities given our selection criteria, so we have chosen not to incorporate them in our analysis further.
To ensure that the LePhare photo-z we work with comes from a secure sample with good SED fits, we require the reduced χ 2 < 5 for all sources. The χ 2 criteria accounts for a small number of outliers with very high χ 2 values. We also investigated the relationship between the reduced χ 2 from LePhare and the difference between the median photo-z from LePhare and EAZY, to explore if there was a positive correlation between the two. We found no such correlation, with the majority of the sources having an absolute difference in photoz of < 0.5 between the two codes. There appears to be smaller groups of galaxies where the absolute difference in photo-z between the two codes is > 5. This is because EAZY places many ill-constrained SEDs at z > 11.5 or between z = 0 − 1, whereas LePhare places the same galaxies at z ∼ 6, as its p(z) is constrained between z = 0 − 10. To account for the possible bias that comes with using a single SED fitting code, we require that the absolute difference between the median LePhare and EAZY photo-z is < 0.5. This accounts for the majority of the galaxies with median LePhare photo-z between z = 6 − 10. The total number of galaxies in the sample is 3256 across z = 5.5 − 10. The redshift distribution of these galaxies is shown in Fig. 1 using the FARMER and CLASSIC catalogs.
Next, we will describe the redshift bins we are using to search for protocluster candidates, we initially describe our lowest redshift bin and then generalise to higher redshifts. We initially target a redshift bin centered on z = 6.05 with a bin width of ∆z = 0.2. At this redshift, the bin width is equivalent to a distance of ≈ 80 cMpc, which we will use for all the other bins (i.e., the redshift range will increase with redshift to a maximum of z = 9.81 ± 0.19 for the highest bin). This bin width is the same as the one used by Harikane et al. (2019) to search for LAE overdensities. Semi-analytical models in Chiang et al. (2017) suggest that the average protocluster size at z 6 is ≈ 10 cMpc, so we should be able to determine overdensities within our adopted redshift bin width. This bin size also encompasses the redshift range of other protoclusters found at z ≈ 6 (Toshikawa et al. 2014;Chanchaiworawit et al. 2019;Harikane et al. 2019;Hu et al. 2021).
To select galaxies with a high probability of being inside our redshift bin, we require that all galaxies inside the bin have their maximum p(z) value within 5% of their median value. This excludes four groups of p(z) that could otherwise affect our overdensity analysis: 1) p(z), which have narrow low-z solutions with peak values higher than the broader peak found inside the z-bin, 2) p(z) with broad distributions and median redshifts inside our z-bin, but with a peak on top of the broad trend that is more than 5% away from the median value, 3) double-peaked p(z) with median values that fall within the z-bin and 4) p(z) with a plateau with equivalent values on either side of the bin that span multiple redshifts, but has a peak more than 5% away from the median value. This selection does not take p(z) into account that have a plateau-like probability distribution similar to 4), but with their maximum p(z) value within 5% of the median value. We describe in §3.1 how to account for these objects. These selection criteria aim to obtain a sample with the best trade-off between having many galaxies, so as to have a representative estimate for the mean density field, and to have secure galaxies that have a high probability of being in our redshift bins.
In LePhare, the p(z) is only defined between z = 0 − 10, and we find a group of galaxies that are placed right at the z = 10 boundary with a narrow p(z). Furthermore, galaxies that fall within redshift bins very close to z = 10 will have a significant part of their p(z) cut off at z > 10. This can lead to overestimating the galaxy overdensity near these galaxies since their weights will be biased high because less of the total p(z) will be outside the redshift bin due to the cut-off at z = 10. We, therefore, chose to discard redshift bins higher than z = 9.23 since beyond this cut-off, it is not possible to say anything meaningful about the overdensities. This is because a large number of the galaxies in the higher bins (z = 9.59 ± 0.18 and z = 9.81 ± 0.19) either have a large fraction of its p(z) outside the z = 10 boundary for LePhare or in the case of the highest bin, the p(z) peaks right at the z = 10 boundary showing the code is not adequately able to model the probability distribution for these sources.

Possible false detections using The Farmer
The Farmer and CLASSIC use a combined izY JHK s (denoted CHI MEAN) to detect the sources. This makes it possible to detect sources in the combined image, which would not be detected at the same confidence in the individual images. There are also sources where there are no individual detection in the images, but there is a detection in the combined image. A small group of galaxies in our sample have no individual detections in the grizYJHKs bands The Farmer uses or the bands LePhare uses for SED fitting (see Weaver et al. (2022) for a full list), These galaxies all have low weights (see §3.1 for definition of weights) with none having a weight above 0.5. We also see an increase in the weight of galaxies as a function of the number of bands with detections. While The Farmer uses mul-5.5 6.0 6.5 7.0 7.5 8.0 8.5 9.0 9.5 10.0  Figure 1. The redshift distribution of galaxies between z = 5.5 − 10 in the COSMOS2020 catalog, based on the LePhare median photo-z and the selection criteria specified in §2.4. The histogram bin width is ∆z = 0.1. The black curve is for the FARMER catalog and the red curve is for the Classic catalog (see §2.1).
tiple bands to locate sources in the field, it is still possible for it to falsely detect spurious sources, especially around very bright sources. While bright sources have been masked out, there is always a trade-off regarding how much of the region around a bright object should be masked. Within the UltraVista area of the COSMOS field, 23079 sources detected with The Farmer catalog are not in the Classic catalog (for our sample, compare the number distributions at high redshifts in Fig. 1). These sources are mainly faint sources or sources that The Farmer de-blended. Most of the 23079 sources are fainter than the detection limit in the UltraVISTA bands while being brighter than the detection limit in the IRAC CH1 and HSC i bands. We produce izY 4 and JHK true color images for all significant overdensities we find in a given redshift bin. We searched for any sources located inside overdensities with no individual detection in the six bands that The Farmer uses, but did not find any. Upon inspection, we found that sources in the overdense regions that were only detected with The Farmer are real sources, clearly visible in the true-color images in at least one of the izYJHKs bands. We highlight the KsHJ-band (red, green, blue) true color images in appendix C.

Accounting for lower redshift interlopers
An indicator of possible low-z interlopers is strong FIR or mm detections. We cross-matched all the protoclus-ter candidate galaxies we found with the Spitzer/MIPS 24 µm/radio catalogs in COSMOS and found no robust detections in either.
We have also cross-matched all protocluster candidate galaxies with the A3COSMOS photometry and found no single dust continuum detection above a Signal-to-Noise ratio (S/N ) of 4.3. This S/N threshold is the criterion for a 50% false detection rate. There are in total 15 sources with dust continuum upper limits from the A3COSMOS photometry. There is only one S/N > 3 source, COSMOS2020-ID871193 with S/N ∼ 4, so considering it as an upper limit is still reasonable.

Weighted Adaptive Kernel
To search for galaxy overdensities, which indicate the presence of protoclusters, we use a Weighted Adaptive Kernel (WAK) Estimator (Darvish et al. 2015). This estimator uses an iterative process that computes the galaxy surface density field and takes into account the uncertainties related to the photometric redshifts of the galaxies by assigning a weight, w i , to each galaxy. The weight of a galaxy is assigned according to the percentage of its p(z) that falls within the chosen redshift bin. By only selecting galaxies whose median and maximum p(z) values fall within a given bin (see §2.4) and further weighing them in the aforementioned manner, we are conservative with our selection, preferentially selecting galaxies that have a high probability of being at their assigned redshift. A more lenient approach would be to assign a weight to all galaxies whose p(z) overlaps with the chosen redshift bin (e.g., Darvish et al. 2015Darvish et al. , 2020. In Appendix B we show the distribution of the weights for all the redshift bins we consider in our analysis (Fig.  17).
The galaxy number surface density,Σ i , at some gridposition, i, in a given redshift bin is calculated by summing over the weighted fixed kernels placed at the positions of the galaxies, j, within the bin, where i = j, i.e.:Σ where N is the number of galaxies in the bin, K(r i , r j , h) is the fixed kernel, r i is the position of the galaxy for which the estimate of surface density is measured, r j is the position of all other galaxies in the bin and w j the weights of all other galaxies in the redshift bin. The kernel function used is a 2D symmetric Gaussian, with a kernel width, h, that controls the smoothing. The kernel function is defined as follows: To select an optimal global kernel width, h, we maximize the so-called Likelihood Cross-Validation (LCV) quantity (Chartab et al. 2020), which is defined as: where N is the total number of galaxies in the given redshift bin and σ −k (r) is the kernel estimator computed at position r excluding the k'th galaxy 5 . Determining h in this way provides an optimal trade-off between a high-variance estimate (under-smoothing) and a highbias estimate (over-smoothing) (Chartab et al. 2020). We perform a grid search from 0.0001 deg to 0.1 deg with 1000 steps to determine the optimal h-value that maximizes LCV(h). An adaptive kernel width (h i ), which is a measure of the local surface density associated with each galaxy, is then used to account for the fact that a fixed kernel width would underestimate the surface density in crowded regions while overestimating in sparsely populated areas. The adaptive kernel width is defined as h i = h × λ i , where λ i is calculated as: where G is the geometric mean of all theΣ(r i ) values. By using the adaptive kernel, the surface density field can now be calculated on a 2D grid, r = (x, y) as: The overdensity is then calculated as δ OD = Σ− Σ Σ for all the grid points, as a measure of how high the surface density is over the background. 6 The resulting overdensity field corresponding to the z = 6.05±0.1 redshift-bin is shown in Fig. 2a. Contours are in steps of 1σ and all grid points below 1σ have the same color. All galaxies inside a given 4σ overdensity level contour are selected and investigated as potential protocluster member candidates. Here, σ is defined as the standard deviation of the overdensity value for the entire field.
To meaningfully investigate the properties of the protocluster candidates, we require that there are at least five or more galaxies inside a given 4σ contour. If there are fewer galaxies than that, we do not include the overdensity as a protocluster candidate. This approach of identifying multiple (in our case ≥ 5) galaxies with a high probability of being close together is supported by simulations. Simulations show that the best indicator of whether a protocluster at z > 4 will end up as a massive cluster with a present-day virial mass of M vir > 10 15 M is not an extreme star-formation rate or stellar mass of the galaxies at z > 4, nor the presence of a massive bright central galaxy (BCG). Instead, the best indicator is the galaxy number overdensity (Chiang et al. (2013); Muldrew et al. (2015); Remus et al. (2022)). This is because the number of galaxies and satellites associated with the protocluster traces the range of accretion of material from cosmic filaments.

Weighted Voronoi Tessellation Estimator
It is important when using kernel estimation to locate overdensities to check if the resulting galaxy overdensity map gives reasonable results. The most straightforward approach would be to inspect the distribution of galaxies in and around the overdensities, as we expect a relatively high number of galaxies associated with an overdensity compared to the rest of the field. This approach is complicated by the fact that galaxies can have low weights (i.e., a low probability of being in the redshift bin) and therefore contribute little to the overdensity. Another way to gauge the robustness of our overdensity maps and to test for potential biases is to use an independent estimator and check if it recovers overdensities at the same locations. We adopt the Weighted Voronoi Tessellation (WVT) Estimator as a verification of overdensities identified by the Weighted Adaptive Kernel method. Voronoi tessellation sections the field into regions, socalled Voronoi cells, so that each galaxy has an area associated with it (Darvish et al. 2015;Shi et al. 2021). The Voronoi cell of a galaxy is defined as all points in the redshift bin plane closer to that galaxy than any other galaxy. Consequently, in dense regions of the field, the Voronoi cells will be smaller relative to more sparsely populated regions. This means that the inverse of the area of a galaxy's Voronoi cell informs us about the local surface density at its location, which can be written as: where A i is the Voronoi cell area associated with a given galaxy, i. The weights of each galaxy are taken into account by using a Monte-Carlo acceptance-rejection process that works as follows:

a) b)
Figure 2. a) The galaxy overdensity field at z = 6.05 ± 0.10 in COSMOS, derived from the Weighted Adaptive Kernel method described in §3.1. contours are in steps of 1σ and all grid points below 1σ have the same color. Individual galaxies are indicated as black dots, except for the colored dots which indicate the 19 galaxies, which make up the highly significant (δOD,max 9, ∼ 8σ) galaxy overdensity, COSMOS2020-PCz6.05-01 (see Tables 1 and 2 1. We start by generating a random number, R i , from a uniform distribution between the minimum and maximum weight values in a given redshift bin.
2. We then check if w i > R i , and if that is the case, we use the galaxy in the density estimation.
3. Using Voronoi tessellation, we calculate the surface density only for those selected galaxies.
4. We use Sibson natural neighbor interpolation to estimate the surface density for the grid points of a regularly spaced 2D grid. The result is a Monte Carlo estimate for the surface density fieldΣ(r).
5. We repeat this procedure N times and take the mean of all the Monte-Carlo density fields as the actual density field Σ(r): We choose to set N = 20 to save on computational time. A trial run with N = 100 performed on the z = 6.05 bin yielded no significant differences in the overdensity map. The contours are chosen the same way as the Weighted Adaptive Kernel Estimator. The resulting overdensity field for the z = 6.05 ± 0.1 redshift bin is shown in Fig. 3. An advantage of Voronoi tessellation is its scale independence and its ability to span a wide range of physical lengths. In addition, it does not make any assumptions about the geometry and morphology of the structures in the density field. This means that it is not susceptible to the same bias as the Weighted Adaptive Kernel since we do not need to worry about the choice of global kernel width. We investigate the drawbacks of WVT and the differences between the two estimators in §4. It is also possible to account for substructure within a redshift by constructing 3D (RA, Dec, z) overdensity maps in comoving space, though this requires sub-bin-width photo-z uncertainties, which with our 1σ uncertainties on the order of the bin size is currently infeasible.

Accounting for ultra-deep coverage at high-z
Observing the distribution of galaxies and their resulting density field in bins at z = 6.4 and above, we noticed that galaxies were predominantly located within vertical stripes in the field, with fewer galaxies found between them. The reason for this is that at z ≥ 6.4 galaxies in COSMOS are mainly detected due to the UltraVISTA ultra-deep survey, which covers (part of) the COSMOS field in four separate vertical stripes (McCracken et al. 2012;Weaver et al. 2022). This will affect our overden- Overdensity Figure 3. Voronoi tessellation of the galaxy overdensity field in the z = 6.05 ± 0.1 redshift bin in COSMOS. As in Fig. 2a, individual galaxies are shown as black dots, except for the 19 galaxies shown as colored dots, which make up COSMOS2020-PCz6.05-01 (highlighted by the same red rectangle as in Fig. 2a).
sity estimates since the mean galaxy surface density is calculated from all the grid points in the field. Including grid points where we lack coverage would overestimate the overdensity since the mean surface density would be lower. It would also affect our 4σ selection criterion, since the standard deviation of the overdensity across the grid would be lower. To account for this and obtain a more representative overdensity estimate of the field, we only calculate the mean surface density and overdensity standard deviation within the stripes covered by Ultra-Deep for all z ≥ 6.4 bins. We tested if the ultra-deep coverage would affect our overdensity estimates at z < 6.4 by using the method mention above and found the difference to be minimal, with 1-2 galaxies no longer part of the overdensities at the 4σ level and a slightly lower peak overdensity in the z = 6.05 ± 0.1 bin (δ max = 8.0 vs. δ max = 9.2 as seen with the first entry in table 2). We still argue that the overdensity at z < 6.4 should be calculated using the area covered by UltraVISTA, since we observe clear overdensities that are not within the ultra-deep stripes (half of PCz6.05-01 is outside the stripes), as opposed to z ≥ 6.4 where all the overdensities are concentrated on the stripes.

Overdensity maps
To present our findings, we initially highlight one of the overdensities we have located in the COSMOS field and then, subsequently, present our results for all the redshift bins considered. In the galaxy over- density map for the z = 6.05 ± 0.1 bin derived from the WAK estimator, we identify a significant (> 4σ) overdensity (δ OD = 9.2) consisting of 19 galaxies at z 6.0 with 14 of them being i-band dropouts (see Fig. 2b). This overdensity spans a volume of roughly 10.7 cMpc × 14.3 cMpc × 66.3 cMpc, corresponding to the maximal extent in the RA, Dec, and z directions, respectively. We verify the overdensity using the WVT estimator, see Fig. 3. Both estimators show a clear overdensity at the same location in the field, lending additional credence to the presence of a protocluster. We note that the Voronoi estimator of the overdensity is somewhat higher (δ OD 11 at the peak) than the WAK estimator, which is more smoothed out at the center. This galaxy overdensity -which we hereafter refer to as COSMOS2020-PCz6.05-01 -was not identified before, as the majority of the sources are too faint in the NIR to have been included in the previous COSMOS catalog (Laigle et al. 2016). It has also not been identified in any other surveys (e.g., narrow-band searches for Lyman-α emitters at z ∼ 6). The properties of the 19 galaxies are given in Table 1.
Having shown our approach to locating and analyzing overdensities in the COSMOS field at z = 6, we now extend our search of protocluster candidates to higher redshifts. To search for protocluster candidates between z = 6 − 10, we repeat our overdensity analysis for redshift bins corresponding to 80 cMpc in size. The number of galaxies in the higher redshift bins is of course lower. In the z = 6.05 ± 0.1 bin, there is ≈ 600 objects after applying all our selection criteria, but applying the same criteria to the higher redshift bins leaves us with ≈ 20 − 230 galaxies. We show the overdensity maps for all bins using both the WAK and WVT estimators in Fig. 16 in Appendix A, highlighting the 4σ contours that fulfil our ≥ 5 galaxies selection criterion. In total, we find 15 overdensities at ≥ 4σ across the redshift range z = 6.0 − 7.7 using the WAK estimator. At z > 7.7, we do not identify any galaxy overdensities at a significance of ≥ 4σ. We note that, in some cases ( Fig. 16km), overdensities are found at the same location in the field across adjacent redshift bins, suggestive of either extended structures or possible overdensities near the edge of a bin that are smeared out due to the photo-z uncertainty.
From Fig. 16a-z, it is seen that the two overdensity estimators are in excellent agreement and capture the same galaxy overdensities in the COSMOS field in nearly all of the redshift bins investigated. One aspect of using WVT that is clear from the maps is that it struggles to capture overdensities close to the edge of the field (see Fig. 16m-n and o-p). This is due to the way the areas of the Voronoi cells are calculated. The cells closest to the edge of the field will have one side facing the edge, which means these cells will be open facing polygons, and we cannot calculate their area. The same problem does not exist with the WAK, since it calculates the surface density by essentially using Gaussians, which can be arbitrarily close to the edge. We could alleviate this by defining boundary points around the edge of the field, which would allow us to then calculate the area of the cells spanned between the points in the field and the boundary points, but it would not capture the true distribution of galaxies outside the field and give a false impression of the overdensity near the edge. Not knowing the true distribution of galaxies outside the field affects both the WAK and WVT kernel estimators, but the effect inside the field close to the edge is different. In one instance, an overdensity is located at the same position using both estimators, but the 4σ criterion is not met by one of the estimators. This is seen in Fig. 16o-p, where a 4σ is present using WAK but not present using WVT.
We have not discarded this overdensity from our selection but merely caution that since no independent (4σ level) verification has been made, it could be the effect of using a biased kernel. It is also worth noting that the z = 7.69 ± 0.14 bin only has 26 galaxies in total, meaning the ability to discern what is an galaxy in an overdensity and what is a field galaxy becomes difficult.

Galaxy overdensity properties
In this section, we examine the properties of the galaxies associated with the overdensities identified in the previous section. We use the physical properties (e.g., stellar masses and star-formation rate) as described in Section 2.2. As in the previous section, we first present our results for COSMOS2020-PCz6.05-01, followed by a summary of our findings for the remaining overdensities at higher redshifts.
A JHK S true-color image of a 13 × 18 region centered on COSMOS2020-PCz6.05-01 is shown in Fig. 2b, with the 19 galaxies that make up the overdensity highlighted. The galaxies span a stellar-mass range of M 10 8 −2.4×10 10 M and a range in star-formation rate of SFR 2 − 125 M yr −1 . The photo-z probability distribution, p(z), of the 19 protocluster galaxies is shown in Fig. 4. Most of the galaxies in COSMOS202-PCz6.05-01 are consistent with being normal star-forming galaxies and agree with estimates of the galaxy main sequence at z 6 as seen in Fig For comparison, the galaxies residing in one of the most distant spectroscopically confirmed protocluster at z = 6.6 (hereafter denoted OD66 (Harikane et al. 2019)), lie ∼ 5× above the galaxy main sequence at this epoch (Fig. 5). This may be explained by the fact that OD66 was selected as an LAE overdensity, which is biased towards dust-poor young, actively star-forming galaxies. We note that the i -band dropouts in the z = 6.01 protocluster discovered by Toshikawa et al. (2012Toshikawa et al. ( , 2014, the protocluster at z = 6.54 from Chanchaiworawit et al. (2019) and the one at z = 6.9 from Hu et al. (2021) do not have reliable stellar mass estimates. Narrow-band Lyα selected sources are biased towards unobscured star-formation activity, and (sub)mm selected sources are biased towards obscured starformation activity (Fig. 13). COSMOS2020-PCz6.05-01 is NIR-selected (specifically the izY JHK S bands) and thus biased towards blue stars and unobscured starformation in a similar manner to the LAEs. For a comparison with a different selection method, most galaxies in COSMOS2020-PCz6.05-01 are 10× more massive than those in OD66 (Fig. 5). The total stellar mass and Table 1. Properties of the 19 galaxies in COSMOS2020-PCz6.05-01 ( Fig. 2) from LePhare and EAZY. The galaxies with a * indicate they are i-dropouts according to the criteria in Ono et al. (2017 Figure 5. SFR vs M for all z phot = 6.05 ± 0.10 galaxies in COSMOS2020 (grey circles) and for the 19 galaxies in COSMOS2020-PCz6.05-01 (shown as filled squares, diamonds, and circles -color coded according to their mass as in Fig. 2). The galaxies follow best estimates of the galaxy main sequence at z 6 (Speagle et al. 2014;Salmon et al. 2015;Lovell et al. 2021). For comparison, the z = 5.7 and z = 6.6 spectroscopically confirmed LAE-overdensities reported by (Harikane et al. 2019) are also shown (shown as blue + and yellow × symbols, respectively), along with the stacked LAE non-detection subsamples of OD57 and OD66 (blue and yellow open squares, respectively). The LAEs have low stellar masses and lie ∼ 5× above the main sequence, suggesting that they probe a different protocluster galaxy population than those in COSMOS2020-PCz6.05-01.
star-formation rate of the 19 galaxies in COSMOS2020-PCz6.05-01 is ∼ 1 × 10 11 M and ∼ 700 M yr −1 , respectively. The total star-formation rate is, therefore, ∼ 6× higher than that of OD66 (Fig. 13). Table 2 shows the global properties of all the overdensities identified in the COSMOS field that meet our selection criteria. In Appendix C, we highlight the individual galaxies within each overdensity 4σ-contour overlaid on a JHK S -band true-color image of the region they occupy. We also show the location of the galaxies in the SFR − M plane, their stellar mass Cumulative Distribution Function, and the p(z) of the galaxies. Of the 15 overdensities identified, 9 have either 5 or 6 galaxies inside their 4σ contour, meaning they are just above our selection criterion of ≥ 5 galaxies. The maximum overdensity values vary significantly from 4.7 for COSMOS2020-PCz6.05-05 to 13.0 for COSMOS2020-PCz6.92-05. The total stellar masses and star-formation rates of the overdensties also vary significantly, spanning a range of M ,tot = 10 9.7−11.1 M and SFR UV,tot = 37 − 692 M yr −1 , respectively.
Multiple protocluster candidates appear to have one massive galaxy surrounded by less massive ones, most obviously seen in COSMOS2020-PCz6.05-01, COSMOS2020-PCz6.92-05, and COSMOS2020-PCz7.42-01. To test that this is not simply an effect of the fact that the number density of less massive galax- Table 2. Properties of the overdensities found at z = 6−10. This includes (1) overdensity name, (2) number of galaxies inside the 4σ overdensity contour, with the parenthesis showing the number classified as LBGs (3) mean redshift of overdensity members, (4,5) RA and Dec. of the peak overdensity value inside the 4σ contour, (6) peak overdensity value inside the 4σ contour with the upper lower 1σ(Corresponding to Gaussian statistics 1σ = 0.8413) Poisson uncertainties (Gehrels 1986), (7) total stellar mass of all overdensity members, (8)  ies is higher than that of massive galaxies, we examine whether the galaxies in the overdensities are different from the galaxies in the field. Specifically, we compute the Cumulative Distribution Functions (CDF) for the stellar mass and SFR of these three overdensities and compare them to the environments of field galaxies with a central galaxy of similar stellar mass as the most massive galaxy in each overdensity (within ±0.2 dex). For each overdensity, we do this by defining a circular region that has the same area as the 4σ contour of the overdensity in a given z-bin and selecting all galaxies within that region to determine the CDFs. We only target field galaxies, that is we mask out all galaxies inside 4σ overdensity contours. Fig. 6 shows the CDFs for the three overdensities, as well as the individual and mean CDFs of the field galaxies. We see that both COSMOS2020-PCz6.05-01 and COSMOS2020-PCz6.92-05 skew towards more massive and more starforming galaxies than the field. The CDFs for COSMOS2020-PCz7.42-01, however, are noticeably shifted towards lower masses and lower SFR than the field. While there is difference between how the individual overdensities skew, it is clear that they appear different than the majority of environments with central galaxies of similar mass in the field. The most massive galaxies in our overdensities could be the progenitor Bright Cluster Galaxies (BCG) of the protoclusters. There is evidence to suggest that there is a downsizing effect for galaxy clusters where, on average, the BCGs (cores) for the most massive (proto-)clusters assemble at earlier times than the ones at lower masses (Rennehan et al. 2020). BCGs are formed in high galaxy density protocluster cores and among our candidates, COSMOS2020-PCz6.05-01 seem to fit that description best, with its most massive galaxy appearing to lie below be the main sequence and in a central position in the cluster (see Figs. 2 and 5, and Table 1). This suggests a highly evolved (proto-)BCG has formed already at z 6, pushing back the evolution of BCGs further than previously expected (Ito et al. 2019;Rennehan et al. 2020). Keep in mind that that there may exist multiple possible pathways for the formation of a BCG, as seen by protoclusters dominated by a single massive object like many (sub-)millimeter-selected sources , or a closely clustered association of objects that will rapidly merge to form a BCG (e.g., what is suggested in Kubo et al. (2021)). There are also protoclusters without any evidence for a massive (proto-)BCG in them (e.g., Toshikawa et al. . Stellar mass and SFR cumulative distribution functions for the three overdensities COSMOS2020-PCz6.05-01, COSMOS2020-PCz6.92-05, and COSMOS2020-PCz7.42-01, compared with field galaxies of similar mass in the same z-bin. The individual CDFs for the field galaxies are shown in gray, whereas the mean CDF is shown in red with associated errorbars. smaller than their extend in the z-direction. The extend in the z-direction is, in most cases (the exceptions being COSMOS2020-PCz6.05-03 and COSMOS2020-PCz7.17-03), not near 80 cMpc, so this is not a problem related to the choice of bin size for the overdensity maps. This could be expected, since we are not working with a 3D overdensity map and can therefore not distinguish the position of galaxies inside the bin, only by proxy through their p(z) weighting. The explanation is that this is due to a combination of two factors. One is that we have chosen to only look at galaxies inside a 4σ contour and therefore only get the central galaxies of the overdensity. Suppose we relaxed the restriction to 3σ or 2σ, the extent in RA and Dec would increase. The other factor is the uncertainty associated with the p(z) of the galaxies can make their distribution in space to be more elongated that they actually are. Keeping these two factors in mind, we may be seeing the infall of galaxies from the cosmic web, which would appear as an elongated structure inside our bin. This should be more prevalent as we go to higher redshift since the galaxies have had less time to move in from the cosmic web and coalesce into a protocluster structure. We do not expect this elongation to only be present in the zdirection, but since our 4σ contour selection limits the extent of the structure in RA and Dec, and there is an uncertainty associated with the p(z) of the galaxies, we will only see the elongation in the z-direction. Fig. 7 compares the stellar mass Cumulative Distribution Function (CDF) for the 19 galaxies in COSMOS2020-PCz6.05-01 with that of the field galaxies (i.e., the ones not in other protocluster candidates) at the same redshift bin. It is seen that the latter has a broader distribution with tails at low and high masses. ∼ 90% of the field galaxies are in the same mass range as COSMOS2020-PCz6.05-01, with ∼ 5% having a lower mass and ∼ 5% having a higher mass. Generally, we see that COSMOS2020-PCz6.05-01 is skewed towards higher masses than the field CDF. To check if the overdensity and field are drawn from different distributions, we can perform a two-sample Kolmogorov-Smirnov (KS) test (Darling 1957) using the stellar masses with the null hypothesis that the two independent samples are drawn from the same continuous distribution. We obtain a KS statistic of 0.28 and a p-value of 0.10 and therefore cannot reject the null hypothesis at 10% level. This means that we cannot state that protocluster candidate galaxies are from a separate population with a different stellar mass distribution from the field population. We also compare with the protocluster HDF850.1 at z = 5.2 (Calvi et al. 2021) and a lower redshift comparison with PCL1002 at z = 2.47 (Casey et al. 2015). Despite being at z = 5.2, HDF850.1 is skewed towards lower masses than COSMOS2020-PCz6.05-01, possibly indicating that our candidate will evolve into . The stellar mass cumulative distribution function (CDF) of the 19 galaxies in COSMOS2020-PCz6.05-01 (dashed blue line) and of the field (orange dashed line) consisting of 544 galaxies at z phot = 6.05±0.10. For comparison, we also show the stellar mass CDF from the spectroscopically verified protoclusters HDF850.1 at z = 5.2 (Calvi et al. 2021) and PCL1002 at z = 2.47 (Casey et al. 2015). a more massive system. We also see the difference in mass with a more evolved structure in PCL1002, which is skewed towards masses higher than both our field, COSMOS2020-PCz6.05-01 and HDF850.1, as expected of a structure at half the redshift.

Stellar mass cumulative distribution functions
Comparing the CDFs for the other protocluster candidates (see appendix C), we note that for all the candidates in the z = 6.05 bin, there are more massive galaxies (> 10 10 M ) than in the higher redshift bins. The higher redshift bins have steeper CDFs with masses typically lower than those in the z = 6.05 bin. The trend is also visible in the main sequence plots for z > 6.9, where there are fewer massive galaxies compared to the lower redshift bins. This cannot be an issue related to the sensitivity of the survey since we should target the most massive (and brightest) galaxies, which means this could hint at an evolution in the most massive galaxies between the lowest redshift bin and the higher ones. This evolution is investigated further in §5.

Star-formation rate evolution
To compare the evolution of the SFR for our protocluster candidates across the various redshift bins, we calculate the cumulative SFR as a function of the area from the peak overdensity value of each candidate. The resulting curves displayed in Fig. 8 show a general trend of decreasing SFR with redshift and a tendency for the lower redshift bins to have a sharper increase at the center of the overdensity, whereas the higher z-bins either have larger increases further out from the center of the overdensity or a more gradual increase as seen in the highest z-bins. This could indicate the central cores of the protoclusters being more evolved at lower redshift.

Dark matter halo masses
There exist multiple approaches in the literature to derive estimates of the dark matter halo masses M DM (Behroozi et al. 2013;Long et al. 2020;Calvi et al. 2021). We first consider three different methods to derive the halo masses laid out in Long et al. (2020); Calvi et al. (2021) as well as the method laid out in Shuntov et al. (2022) and discuss which ones are appropriate in our case. After introducing the methods we discuss how we handle uncertainties on our dark matter halo mass estimates and compare the different methods. The resulting estimates are shown in Table 3.
First, we derive a modest estimate of the total mass in dark matter halos associated with the protocluster using method: i) Individual Abundance Matching The first method estimates the total dark matter halo mass associated with the protocluster candidate by summing the halo masses of the constituent galaxies. This estimate assumes that individual galaxies are self-bound objects with halos closer to virialization than the protocluster itself and that each galaxy formed its halo prior to coalescing in this overdensity. The halo masses of the galaxies are derived by applying the stellar mass to halo mass relationship (SHMR) from abundance matching as detailed in Behroozi et al. (2013) (Be13, see Fig. 9), which is developed assuming that the bulk of the baryonic mass in dark matter halos is locked up in stars and that massive galaxies trace massive halos.

ii) Summed Abundance Matching
The second method assumes that the galaxies in overdensity are all residing in and evolving as part of the same overall dark matter halo. In this scenario, the stellar masses of the galaxies are summed into a single total stellar mass for the halo and then a stellar-to-halo abundance matching relationship from Behroozi et al. (2013) is used to convert the stellar mass into a dark matter halo mass. A problem with this method is that the relationship in Behroozi et al. (2013) generally does not extend to the large total stellar masses found in protoclusters, nor for the overdensities we identify at z ≥ 6. To circumvent this problem, Long et al. (2020); Calvi et al. (2021) uses the stellar-to-halo mass ratios in Behroozi et al. (2013) for the most massive individual galaxy halo mass estimate from method i). It has been 10 3 10 2 10 1 10 0 10 1  For each protocluster candidate, the cumulative SFR is plotted as a function of the area of the aperture. Candidates are separated into four redshift bins from: z = 6.0 − 6.5 (upper left), z = 6.5 − 7.0 (upper right), z = 7.0 − 7.5 (lower left), and z = 7.5 − 8.0 (lower right). The average protocluster core size (R200) from Chiang et al. (2017) at each redshift bin is plotted as vertical lines, with the color corresponding to the same bin as the protocluster candidates. When multiple lines are present in a plot, the higher redshift bin line always has the lower area.
argued that it is possible to use this method if the galaxies occupy the same single massive halo, which would require spectroscopic redshift confirmation and radial velocity measurements for the galaxies, as seen in Long et al. (2020). A potential problem with this method is that simulations of protoclusters at z ≥ 6 do not consist of a single dark matter halo but large agglomerations of individual dark matter halos (Chiang et al. 2017). Note also that the virial radius of any of these dark matter halos (few tens to 100 ckpc at most) will be an insignificant fraction of the structure size of the protocluster (many cMpc). For these reasons we have chosen not to use this method.
iii) 5% M Bar /M DM The third and final method assumes a 5% baryonic-to-dark-matter fraction (Behroozi & Silk 2018, BS18) to convert the total stellar mass of the protocluster candidate galaxies to an estimate of the halo mass. Note, however, that this method does not consider the gaseous component of the galaxies, and the dark matter halo mass should therefore be considered a lower limit. While the conversion ratio has been applied to lower redshift results (Long et al. 2020;Calvi et al. 2021), it does not take the evolution of the ratio between baryonic and dark matter at higher redshifts into account. It is also worth nothing the difference between the observationally motivated 5% conversion ratio and the universal baryon fraction obtained from cosmological results (Planck Collaboration et al. 2016; Behroozi & Silk 2018, see Fig. 9) Table 3. Estimates of the total dark matter halo mass associated with the protocluster candidate at the redshift of the overdensity and the most massive individual halo at the redshift of the overdensity using Behroozi et al. (2013); Behroozi & Silk (2018) and Shuntov et al. (2022).

Paper
Behroozi Shuntov Name log(MDM,tot(z)) log(MDM,cent(z)) log(MDM,tot(z)) log(MDM,cent (z) A potential caveat associated with methods i) (and ii)) is that the redshift-dependent M − M DM parametrization given by Behroozi et al. (2013) are only supported over a specific mass range. For example, at z = 6, a valid parametrization is only provided over the stellar mass range ≈ 10 9 − 2 × 10 10 M (see Fig. 7 in Behroozi et al. (2013)). When our stellar masses exceed this range, we obtain estimates of orders of magnitude higher than expected. In such cases, where one or more of the galaxy stellar mass estimates are outside the supported range for Behroozi et al. (2013), we use method iii) to estimate a lower limit for the dark matter halo mass.
iv) HOD-based modelling Finally, we use the SHMR provided by Shuntov et al. (2022) (Sh22). This relationship is determined using the stellar mass function and clustering of galaxies at 6.0 < z < 7.7 in the COSMOS2020 catalog and fitting them with a halo occupation distribution-based (HOD) model where a parameterized form of the SHMR is used. Whereas the Behroozi et al. (2013) relationships are each for one redshift, the method used here is similar to the "universal" relationship between 6.0 < z < 10.0 of Stefanon et al. (2021) because they both consider a range of redshifts. There are two curves which describe the SHMR for the centrals and satellites, that is, the total stellar mass of centrals/satellites in a halo of a given mass. To estimate the total halo mass of an overdensity we take the total stellar mass of the galaxies and use the sum of the central and satellite curves to convert to a dark matter halo mass. We consider this approach reasonable, because the total stellar mass includes both the central(s) and its(their) satellites.
The SHMRs that we have used are shown in Fig. 9. We see that the Shuntov et al. (2022) SHMR generally gives higher halo mass estimates than the Behroozi et al. (2013) and Stefanon et al. (2021) curves, the main exception being the Behroozi et al. (2013) curve at z = 6.0. We also note that the uncertainty on the stellar mass for the Shuntov et al. (2022) SHMR is substantially larger than the other SHMRs shown and that the SHMR does not go below dark matter halo masses of 10 11.8 M .
To estimate the uncertainties on our dark matter halo masses, we adopt the mean SMHRs from Behroozi et al. (2013) and Shuntov et al. (2022) and propagate the uncertainties on our stellar mass estimates (see Table  3). This is an underestimate of the true uncertainties since both SHMRs have intrinsic uncertainties (shaded regions in Fig. 9. The dark matter halo mass estimates using the different methods are shown in Table 3. Using the Be13 method, eight galaxy overdensities presented in this paper have abundance matching results, with total dark matter halo mass estimates in the range M DM 3.5 × 10 11 − 7.2 × 10 12 M (see Table 3). For the remaining seven overdensities, we only have an estimate using the BS18 method, corresponding to lower limits for the total dark matter halo masses. Using the Sh22 method, the range of total halo mass estimates is M DM 7.8 × 10 11 − 3.4 × 10 12 M for all of the 15 galaxy overdensities.
Comparing the Be13 and Sh22 estimates in Table 3, we see that for the lowest z-bin where is good agreement between the estimates, with the Be13 estimates being more massive, but also more uncertain (not accounting for the uncertainty in the M of the Shuntov et al. (2022) curves), while for the higher z-bins the relationship is flipped and the Sh22 estimates are higher, though the uncertainty on the Be13 estimates is still larger. This is in line with what we would expect from the different SHMR in Fig. 9. In both cases the estimates are higher than the lower limits estimated using the 5%M bar /M DM ratio.
The estimates presented in Table 3 correspond to large total dark matter halo masses associated with the protoclusters at redshifts ≥ 6. COSMOS2020-PCz6.05-01, in particular, may represent a very massive structure early in the Universe's history. For comparison, the spectroscopically confirmed galaxy overdensities associated with the (sub-)millimeter-selected sources SPT 0311−58 at z = 6.9 and HDF850.1 at z = 5.2 were found to have dark matter halo masses of 1.4 − 7.0 × 10 12 M and 1.9 − 8.0 × 10 12 M , respectively (Walter et al. 2012;Marrone et al. 2018;Arrabal Haro et al. 2018;Calvi et al. 2021). At later times, when the halos have had time to grow, more massive (sub-)millimeter-selected protoclusters have been found, such as SPT 2349−56 at z = 4.3 and the Distant Red Core at z = 4.0, which have dark matter halo masses of 10 13 M and 10 14 M , respectively (Miller et al. 2018;Long et al. 2020).
We caution that our halo estimates must be viewed in light of the inherent model uncertainty when using the stellar masses from COSMOS2020 and its effect on the halo mass estimates. Furthermore, the above dark matter halo mass estimates could be underestimated for the following reasons. First, the estimates do not take the cluster members that are not detected in COS-MOS2020 into account. Galaxies that are likely to have been missed are low-mass (and thus faint) galaxies and massive quiescent (and thus faint in the rest-frame UV) galaxies (Muldrew et al. 2015). The latter could have a more significant effect on observations of the central regions of protoclusters where downsizing, i.e., the most massive galaxies form earlier than less massive ones, might be an important factor (Rennehan et al. 2020). The second reason our dark matter halo mass estimates might be biased low is that the abundance matching method of Be13 is developed on the basis that the dominating mass component in all halos is the stellar one. It is important to keep in mind that the galaxies in high-z overdensities may contain large gas reservoirs (Strandet et al. 2017;Marrone et al. 2018), which can be significant (or even the main baryonic) contributors to the overall mass budget (Long et al. 2020). While none of the galaxies in COSMOS2020-PCz6.05-01, or indeed any of the galaxies associated with the other overdensities presented in this paper, have robust detections at (sub-)millimeter wavelengths, this does not rule out the possibility that they could harbour significant amounts of gas (and dust).

Descendant halo masses
Using Millenium Simulations of 3000 galaxy clusters and their evolution across cosmic time, Chiang et al. (2013) presented evolutionary tracks of protocluster progenitor masses with redshift (see also Muldrew et al. (2015)). The Chiang et al. (2013) models predict that overdensities residing in dark matter halos with masses > ∼ 10 12 M at z ≥ 6 are expected to evolve into presentday descendent galaxy clusters similar to a Virgo or Coma-like cluster with a total mass ∼ 10 15 M . Since Chiang et al. (2013) considers the main progenitor halo in a cluster merger tree, we need to identify the most massive individual dark matter halo in our protoclus-ter candidates. We have done this by using the most massive galaxy in each overdensity to estimate a central dark matter halo mass (M DM,cent ) using the Be13 and Sh22 methods (see Fig. 9 and Table 3) Note, for the Sh22 method, we use the central SHMR to estimate the (sub)halo mass of the single most massive galaxy in each overdensity. The dark matter halo mass estimates of the Be13 and Sh22 can also be seen as a subhalo mass, in the case of Be13 if the horizontal axis in Fig. 9 is interpreted as the halo mass at the time of accretion and in the case of Sh22 because the SHMR functional form used in Shuntov et al. (2022) has been used to also link the satellite galaxy mass to its subhalo mass (Behroozi et al. 2010(Behroozi et al. , 2019. In essence, the central SHMR can be considered the same for satellites if the halo mass is taken as the subhalo mass at the time of accretion. We therefore argue the central SHMR is applicable in this case, knowing that the estimated mass would be the halo mass at the time of accretion due to the fact that once a subhalo is accreted onto a more massive one, mass is tidally stripped as time goes on. Considering the abundance matching results of Be13, of the eight overdensities where we were able to use the curves of Fig. 9, three have halo masses in excess of 10 12 M and would therefore be expected to evolve into ∼ 10 15 M clusters at z = 0. These three overdensities are in the lowest z-bin. Using the Sh22 method, out of the 15 z ≥ 6 galaxy overdensities we have identified, 9 have halo masses in excess of 10 12 M . The smaller scatter and difference in redshift we see with the Sh22 method (all galaxies have M DM ≈ 10 12 M within 0.3 dex) would suggest that our higher redshift overdensities will evolve to be even more massive than a Virgoor Coma-like cluster at present.
We caution that the above descendant halo mass estimates require several assumptions that are not currently well constrained by observations. For example, the Chiang et al. (2013) halo growth curves depend heavily on the presumed volumes of the observed galaxy overdensities.

Galaxy richness and growth
It is important to keep in mind that it is not enough for a dark matter halo to be in the excess of 10 12 M at z ≥ 6 to grow into a Virgo or Coma-like cluster at present, they also need to be in an environment that allows them to grow over time (Overzier et al. 2009;Angulo et al. 2012). From simulations like Magneticum (Remus et al. 2022) it has been argued that galaxy richness, which can be characterised by determining the number overdensity, is the best tracer of this growth. This is because it traces the feeding of galaxies into the main dark matter halo from the cosmic web. What this means for us is we should consider not only which overdensity has the most massive halo estimate, but also which ones are in the most overdense environments. COSMOS2020-Pz6.05-1 stands out in this regard not only by having the second most prominent dark matter halo mass estimates (see Table 3) next to COSMOS2020-Pz6.05-5 using method i) and being among the highest dark matter halo mass estimates ( 2 × 10 12 M ) using the Sh22 method, but also by containing almost twice as many galaxy member candidates (19) as the secondrichest overdensity, COSMOS2020-PCz7.17-1 (10). This makes COSMOS2020-Pz6.05-1 the most likely candidate for becoming a truly massive galaxy cluster at the present day.

Present day halo masses from spherical collapse
It is well known that once matter overdensities enter the non-linear regime, their future evolution can no longer be rigorously described analytically. Instead, one has to resort to numerical simulations or adopt analytical approximations. Following the procedure in Calvi et al. (2021), we tested the so-called spherical collapse model which is the simplest analytical description of the non-linear evolution of matter overdensities with cosmic time (Gunn & Gott 1972). We found that the resulting upper limit on the total mass had a number of uncertainties that ultimately led us to disregard the estimate. An example on the small scale is that if we calculate the overdensity δ OD using the number of galaxies in the overdensity with the 4σ area and the number of galaxies in a field with the area of COSMOS, we obtain a higher value of δ OD , which in turn gives a higher estimate to the total mass upper limit. δ OD could also be affected by the lack of detections of fainter objects in COSMOS since if more of these objects are part of a given overdensity compared to the field and we had the depth to detect them, δ OD would have an increased value for the overdensity, in turn resulting in a higher present-day mass. The effect of galaxy downsizing would run counter to this argument, since the supposed less massive and faint galaxies would not have formed at z 6. On the other hand if there were more undetected galaxies in the field relative to the overdensity, the δ OD would be lower, leading to a lower mass estimate. An example on a larger scale is the effect the volume has on the final result. Due to the uncertainty associated with the photometric redshift, true volume occupied by the protocluster candidate is unclear and determining the extend from the median photometric redshift of the galaxies would overestimate the volume of the protocluster candidate and in turn the the present day total mass. Taking these effect in aggregate, the difference on the upper limit for the protocluster candidate total mass estimates vary more than an order of magnitude.

Overdensity maps using LBG selection methods
In this section, we compare the overdensities found with the WAK and WVT methods described in §3 with the traditional LBG dropout selection methods. To this end, we create overdensity maps using the relevant dropout selection techniques between z ∼ 6−10. We use the full The Farmer catalog and put no restrictions on the galaxies other than the dropout selections and the use of the same mask as we did for the WAK and WVT methods that masks out bright stars in the field.
To cover the redshift range z ∼ 5.5 − 6.3, we adopt the i-dropout criteria from Ono et al. (2017), which are based on the HSC-bands. The i-dropout criteria are: with the further requirement that the sources must be detected at a S/N > 5 in z and S/N > 4 in y (Ono et al. 2017). To select sources in the redshift range z ∼ 6.3 − 7.7, we use the z-dropout criteria specified by Ono et al. (2017). They require a S/N > 5 detection in y-band and a color-selection given by z − y > 1.6. Sources at z ∼ 7.4 − 8.8 are selected using the Y -dropout selection criteria from Schmidt et al. (2014). Since The Farmer does not include measurements in the V -band, which is used for one of the non-detection criteria in Schmidt et al. (2014), we use the HSC g-band as a substitute. Our adopted selection criteria are: where the Y JHK magnitudes are from the UltraVISTA imaging in COSMOS2020. For the J-dropouts, we use the selection criteria from Oesch et al. (2014), which covers the redshift range z ∼ 9 − 10 and above: where χ 2 opt+Y = SGN(f i ) × i (f i /σ i ) 2 , and f i /σ i is the band flux divided by the flux error for the optical (griz) and Y -bands (Oesch et al. 2014). The χ 2 criterion accounts for low-z interlopers while only slightly (∼ 20%) decreasing the galaxy sample size of real z > 9 sources. The H − [4.5] < 3.2 criterion accounts for intermediate redshift, dusty sources at z = 2 − 4. Fig. 10 shows the galaxy overdensity maps in COS-MOS when applying the above i-, z-, Y -, and J-dropout selection criteria to the COSMOS2020 catalog. Comparing these overdensity maps, which corresponds to redshift ranges of z ∼ 5.5 − 6.3, 6.3 − 7.7, ∼ 7.4 − 8.8, and ∼ 9 − 10, with the relevant WAK galaxy overdensity maps derived in §3, we find multiple overdensities that line up. For i-dropouts, Fig. 10a show corresponding overdensities to the ones we have found in the z = 6.05 ± 0.1 bin (Fig. 16a), with the exception of COSMOS2020-PCz6.05-08. For z-dropouts in Fig. 10b, there are also agreements between the overdensity maps, though less significant ones as seen with COSMOS2020-PCz6.69-01 and COSMOS2020-PCz6.92-01, -04, -05. The Y -dropout overdensity map in Fig. 10c shows one ≥ 4σ overdensity contour with 4 galaxies inside, though none of our 4σ contours align with it. For the J-dropout overdensity map in Fig. 10d, There are no overdensity contours at 4σ or above and no corresponding contour that fulfill our selection criteria. The small number of significant (≥ 4σ) overdensities in the Y − and J−dropout overdensity maps is in agreement with the lack of significant overdensities in the WAK overdensity maps in the redshift bins z ≥ 7.4 (Figs. 16m-16z). Comparing the maps, we find little agreement between the WAK and dropout maps, which we ascribe this to the low number statistics in the highest redshift bins.

Dark matter halo mass estimates
Considering the central dark matter halo mass estimates in Table 3, Fig. 11 shows the expected cumulative counts for a COSMOS-like survey. We have chosen to use the analytical fit of Warren et al. (2006) because it matches simulations over a wide range of masses well. We use the number density of halos with mass greater than a given mass from the Warren et al. (2006) analytical fit. To obtain the cumulative counts (the number of halos with mass greater than a given mass), we then multiply the number density with the volume of each redshift bin where we have found a protocluster candidate, calculated as a truncated pyramid. The range of mass estimate, shown as lines, using the three different methods are between 10 11 − 10 13 M . The estimates using the BS18 method are shown with lower error as the lower bar and the position of the arrow as the upper error. The upper error from the Be13 abundance Figure 10. Panels a) to d) show overdensity maps of LBGs in COSMOS2020 based on the i-(1116 galaxies), z-(353 galaxies), Y -(240 galaxies), and J-(17 galaxies) dropout techniques (see Section 4.5), which select galaxies in the redshift ranges z ∼ 5.5−6.3, ∼ 6.3−7.7, ∼ 7.4−8.8, and ∼ 9−10, respectively. The 4σ LBG overdensities are shown as cyan dashed contours. For comparison, the other lines show the 4σ overdensities that fulfill our selection criteria across redshift bins. In figure a) the red lines are for the z = 6.05 ± 0.10 bin. In figure b) the red line is (are) for the z = 6.69 ± 0.11 bin, the green lines for the z = 6.92 ± 0.12 bin, the magenta lines for the z = 7.17 ± 0.12 bin, the orange line for the z = 7.42 ± 0.13 bin and the white line for the z = 7.69 ± 0.14 bin. The orange and white lines in figure c) are the same ones as in figure b).

a) i-dropouts (N gal =1116) b) z-dropouts (N gal =353) c) Y-dropouts (N gal =240) d) J-dropouts (N gal =17)
matching estimates are shown as "}" and the upper error of the estimates using the Sh22 method are shown as "]". We have chosen the cumulative count value for each dark matter halo mass estimate arbitrarily, so it is easier to discern the relation between the estimates and the Warren et al. (2006) curves. A comparison with the curves of Warren et al. (2006) shows that our candidates have dark matter halo mass estimates that have corresponding cumulative count values that span a wide range. For the number of protocluster candidates (1-6) we have found in each redshift bin, we would expect the cumulative counts to be between 0.0 dex to 1.0 dex. The BS18 lower limits give high cumulative counts, generally 2.0 dex to 4.0 dex. The Be13 estimates, where available, give lower counts than the BS18 limits. For the z = 6.05 ± 0.1 bin, the estimates range from 1.0 dex to −6.0 dex, while for the higher bins the range is higher at 2.5 dex to 1.5 dex. The Be13 estimates suggest we would expect to find more halos of similar masses to ours, at least for the higher redshifts bins. This could be an effect of survey depth, as we might be missing out on faint and massive galaxies that would increase the value M DM,cent , though at z > 6 we would expect the most massive galaxies to be the brightest as well.
Finally, for the Sh22 estimates, the cumulative counts ranges from 1.5 dex to −3.0 dex. With the exception of a couple of outliers, the Sh22 counts generally range from 1.0 dex to −0.5 dex for each candidate, similar 4 z = 7.42 ± 0.13 z = 7.69 ± 0.14 Figure 11. Expected cumulative counts from a COSMOSlike survey following the analytical fit of Warren et al. (2006). The curves show the expected cumulative counts (number of halos with mass greater than a given mass) for the redshift bins where we have located protocluster candidates. The horizontal lines indicate a range of estimates for a given protocluster candidate using the different Mcent estimates described in section 4.3 and shown in Table 3. The lower bar for each estimate is the value using the BS18 method method minus its lower error and the position of the arrow is determined by its value plus the upper error. The "}" indicates the upper error of the Be13 abundance matching estimates and the "]" marks the upper error for the Sh22 method estimates. The position of each estimate on the y-axis is chosen arbitrarily for clarity. Estimates from the same bin have the same color, which correspond to the color of each Warren et al. (2006)  to what we would expect given the number of protocluster candidates we have found in each bin. The greatest outliers are COSMOS2020-PCz6.05-01 and -05, where their upper bounds using the Be13 method have corresponding cumulative counts of ∼ −3.0 dex and ∼ −6.0 dex respectively, meaning these appear to be rare structures. Another structure that appears to be rare is COSMOS2020-PCz7.69-01, where the Sh22 bounds varies between −2.0 dex to −3.0 dex. These comparisons have to be viewed in light of the fact that both simulation and analytical models have trouble reproducing the most extreme overdensities that we find observationally (Warren et al. 2006;Harrison & Hotchkiss 2013).
Another way to investigate if our dark matter halo mass estimates are realistic is to compare the most massive halos that are expected to be observable for a COSMOS-like survey for a given redshift, in a similar fashion to what was done in Fig. 3 of Marrone et al. (2018). Figure 12 shows how our protocluster candi-  Fig. 11. The exclusion curves are based on the models by Harrison & Hotchkiss (2013) and show the most massive halos that are expected to be observable within the COSMOS 1.7 deg 2 area (black curve) at 99% and 68% levels and the whole sky (grey dashed curve) at the 99% level, meaning for an 100α% exclusion curve we should expect to observe a (proto-)cluster above the line only 100(1-α)% of the time (Harrison & Hotchkiss 2013). The green, blue, and orange points are dark matter halo mass estimates based on overdensities associated with DSFG, QSO and LBG, respectively, and taken from Casey (2016) dates compare to the curves showing the expected most massive halo at a given redshift for a both a 2 deg 2 COSMOS-like survey and a full sky survey, based on the models from Harrison & Hotchkiss (2013). We have chosen only to show the BS18 estimates for clarity. We also compare with a number of protoclusters found with different selection methods from Marrone et al. (2018), as wel as Casey (2016) and Overzier (2022). In Harrison & Hotchkiss (2013) they argue that the rareness method of constructing exclusion curves as in Fig. 11 is biased so that it overestimates the amount of tension a particular observation may cause in relation to a ΛCDM cosmology. We observe that our candidates are generally under what is expected to be the most massive halo at their given redshift for a COSMOS-like survey and they have similar DM halo mass estimates as other structures found with different methods. The exceptions are again COSMOS2020-PCz6.05-01 and -05, where their upper estimates using the Be13 method are close to 99% exclusion curve. We also note that the lower bound of some of the candidates between z = 6.69 − 7.7 are close to the expected most massive DM halo mass at their redshift, though the large redshift uncertainty on some of the candidate galaxies makes it difficult to conclude if this result is significant.

Comparison with literature
Fig . 13 shows the SFR of protoclusters selected via several different methods over a wide redshift range. The different methods can generally be grouped into targeting obscured and unobscured star-formation. The former typically selects protoclusters with higher SFRs than the latter.
For obscured star-formation and at lower redshifts, we have a combination of protoclusters rich with dusty starforming galaxies (DSFG) and some with ultra-luminous AGN (Quasars, see Casey (2016)). These types of objects are rare and have intrinsically short duty cycles (∼ 100 Myr). Casey (2016) finds multiple DSFG-rich proto-clusters within a few square degrees and suggests this is evidence that proto-clusters assemble in shortlived, stochastic bursts that likely correspond to the collapse of large-scale filaments on 10 Mpc scales. Comparing with our COSMOS2020 candidates, we also have appear to have an extended filament like structure, but none of our galaxies are DSFGs. This does not exclude the scenario that our candidates had or will have a rapid growth phase of dusty star-formation at some point in time, merely that the different methods we use to select galaxies let us probe structures that are outside of this phase. Two of these DSFG-rich protocluster in COS-MOS at z = 2.1 and z = 2.47 were followed up with ALMA to study their less rare, more typical galaxies by Zavala et al. (2019). We have included the SFR and stellar masses of those galaxies in Fig. 13 and 14. Lewis et al. (2018) use ultra red galaxies as "signposts" for dense regions in the early Universe. They find multiple DSFGs near each signpost and posit that with the average total SFR and space density they obtain for their candidate protoclusters (which are similar to the most massive M DM ∼ 10 15 M galaxy clusters at z < 0.2), these DSFG systems will evolve into the massive ETGs (relatively passive ellipticals and lenticulars) at the centers of rich galaxy clusters at present. Our descendant mass estimates for PCz6.05-01 are in a similar range as those of (Lewis et al. 2018), though we, as mentioned, have no DSFGs. Ota et al. (2018) observes the environment around a quasar at z = 6.6 and detect multiple LBG and LAE candidates around in the quasar field. They compare with a control field, a general blank sky field where they confirm the non-existence of quasars at similar z, any clustering of LAEs and LBGs, or over/underdensities of them. In the southern area of their quasar field, they find the number density of LAEs is almost equivalent to the mean to mean −1σ density of LAEs in the control field. Conversely, over this area, LBGs exhibit a filamentary overdensity structure running from east to west. The LBG structure contains several 3σ-7σ high-density excess clumps. Their northern area is very sparse of both LAEs and LBGs. They find that the quasar is in an high-density LBG environment at the 4σ level, but in an near edge region and not at the center. To our knowledge, none of the galaxies in our overdensities are quasars or have strong AGN emission. Ota et al. (2018) provide no line fluxes, luminosities, stellar masses or SFR estimates (only the Lyα luminosity for one Lyα-blob candidate) and we therefore cannot provide further comparisons.
At the extreme end of star-formation, we have the SPT sources of Wang et al. (2021), compact structures with SFRs at least an order of magnitude higher than COS-MOS2020 protocluster candidates. These objects were selected due to their bright 870µm flux densities and point-source nature. They are classified as protoclusters since they constitute overdensities of dusty star-forming galaxies and, therefore, likely the progenitors of presentday clusters. See also the SMG selected protocluster of Oteo et al. (2018). This highlights the current discrepancy between the SFR of (proto)clusters determined from observations and the SFR expected from simulations, with a tendency for the former to be higher than the models at a given (high) redshift. This discrepancy could come from several factors like simulation volume, sub-grid models not capturing the physics of extreme environments or insufficient resolution due to computational constraints. The difference in SFR for the SPT sources and COSMOS2020-PCz6.05-01 shows the most apparent difference between the methods used to select the objects. The SPT sources are selected with 870 µm data and are therefore biased towards highly obscured star-formation, and they find all the bright sub-mm galaxies to be in a very active (starburst) phase of their formation where star-formation is at a maximum. The COSMOS2020 protocluster candidates are NIR selected and therefore biased towards unobscured star-formation but are not necessarily undergoing a dusty starburst phase, which can be argued due to the lack of mm and far infrared detections. Moreover, the COSMOS2020 candidate galaxies (e.g., Fig. 5) appear to be on the main sequence, although due to the significant scatter we cannot rule out the possibility that some of the galaxies are undergoing starburst.  Kato et al. (2016), some of the galaxies from the two COSMOS2015 protoclusters at z = 2.1, 2.47 by Zavala et al. (2019), The LBG protocluster at z ∼ 3.8 from Kubo et al. (2019), the LBG protocluster at z = 6.31 by Mignoli et al. (2020), the LBG protocluster discovered by Endsley & Stark (2022) at z = 6.80 and the LAE protocluster at z = 6.90 by Hu et al. (2021). Mix refers to the use of a mix of different selection methods for a given protocluster. The curves show the evolution of the total average SFR per protocluster for different Semi Analytical Models (SAM) from Chiang et al. (2017).
For unobscured star-formation, we have Lyman Alpha Emitters (LAE), which, similarly to the COS-MOS2020 candidates, are biased towards unobscured star-formation. The search for LAEs at high-z is typically done with narrow-band filters, which restricts the search to small ranges at specific redshifts (z = 5.7, z = 6.6 as seen in Ouchi et al. (2005Ouchi et al. ( , 2008; Jiang et al. (2018); Harikane et al. (2019)). We see from Fig. 13 that the SFRs of LAE selected protoclusters are the ones most similar to ours, though their structures contain objects with lower masses compared to the COSMOS2020 candidates (see Fig. 5). Higuchi et al. (2019) looks for LAE overdensities in a number of fields, including COSMOS, and they find a z = 6.6 overdensity HSC-z7PCC17 containing a handful of galaxies close to our protocluster candidate PCz6.69-02 and at a similar peak overdensity (δ max = 7.0 +6.1 −3.1 for theirs and δ max = 8.5 +4.6 −2.4 for ours), though their overdensity have no spectroscopically confirmed LAEs within 10 cMpc from the center of the protocluster candidates and is not marked as a high density region by their standard (see §5.1.2. in Higuchi et al. (2019) for more detail). One of their other overdensities HSC-z7PCC26 also at z = 6.6 is highlighted in Harikane et al. (2019), and is shown in Fig. 13 with the SFR taken from Lyman Break Galaxy (LBG) selected protoclusters are most similar to our selection of candidates, at least in method and are also biased towards unobscured starformation. LBGs are selected using color selection criteria applicable to specific redshifts. These color se-  Tadaki et al. (2019) at z = 2.16 and z = 2.49 and the protocluster at z = 6.8 from Endsley & Stark (2022). Labels are the same as in Fig. 13. lections use 2-4 observational bands, and as a result, the redshifts of the galaxies that fulfil these criteria are only known inside a broad range (see the LBG selected protoclusters at z ≈ 3 − 4 from Toshikawa et al. (2016) in Fig. 13).
Because of the redshift uncertainty associated with LBG selection, the method is often used to select objects which can be followed up with other methods, such as searching for LAEs. This is the case for the LBG selected protocluster at z = 6.01 from Toshikawa et al. (2012Toshikawa et al. ( , 2014, where followup on an LBG overdensity was done by searching for LAEs. Another example of is the protocluster at z ∼ 3.8 from Kubo et al. (2019), where LBG overdensities have been followed up with Planck and Herschel data. This followup gives IR SFRs and places the protocluster next to the DSFGs in Fig.  13. Toshikawa et al. (2016) found a number of dropout overdensities, specifically 5 i-band dropouts overdensities. The 5 dropout overdensities have peak OD values between 4.1σ − 7.6σ, similar to the peak overdensities for our protocluster candidates (see table 2). Only two of the Toshikawa et al. (2016) i-dropout overdensities have galaxies with spec-z, and of those there are only 2-3 galaxies with spec-z in each. Toshikawa et al. (2016) marks one as unclear and the other possible as to whether or not they are protoclusters. Since there is only data available for the galaxies with spec-z and it is unclear whether they are actual protoclusters, we have chosen to not include these overdensities in our plots and further analysis. the lower redshift dropout overdensities with followup from Toshikawa et al. (2016) (D1UD01, D4UD01 and D4GD01 as seen in Harikane et al. (2019)) are included in Fig. 13. For the z = 6.8 protocluster found in Endsley & Stark (2022), two of their galaxies with strong (> 7σ) Lyα detections are inside one of our 4σ overdensity contours in the z = 6.69±0.11 bin and another galaxy which has no Lyα detection is inside a 4σ contour in the z = 6.92 ± 0.12. Both of these contours are not included in our results, since they individually have less than 5 galaxies inside them, but they are clearly part of an overdensity in both bins (see Fig. 16g,h and 16i,j). Combining the galaxies from the contours in both bins would give us 5 galaxies in total, thereby classifying it as a candidate using our selection criteria. This highlights the trade-off when choosing a specific bin size, as there is the possibility of missing overdensities where the galaxies fall between the bin edges. Referencing the positions of the 12 candidate galaxies in Endsley & Stark (2022), we see that they are generally more spread out than our candidates, but their overdensity value is also lower than ours at δ = 3, whereas our candidates have an overdensity value of δ ∼ 5 or above. decreasing the overdensity value of the contours in the two bins to δ = 3 does not include any more of the Endsley & Stark (2022) galaxies.
We also compare the total average protocluster SFR from the semi-analytical models (SAM) in Chiang et al. (2017) (see the black curves in Fig. 13). From a comparison with the models, we see that for our candidates at z ≈ 6 − 8, our selection seems to target protoclusters that are average in terms of their SFR, as indicated by the candidates being close to the values predicted by the SAMs.
A subsample of the galaxies in Fig. 13 also has stellar mass estimates, making it possible to compare the total stellar mass for these protoclusters with the COS-MOS2020 candidates, as shown in Fig. 14. We observe a general trend where the lower redshift protoclusters have higher total stellar masses. This is in line with what we would expect given that the lower redshift protoclusters have had more time to build up mass.

Candidate stellar mass, SFR, sSFR evolution
To investigate the differences between the z = 6.05 bin candidate galaxies and the higher z-bins, we show their stellar mass, SFR and specific SFR (sSFR) histograms and CDFs in Fig. 15. It is clear that the distribution at z = 6.05 is skewed towards higher masses with a me-dian of M = 10 9.6 M for the z = 6.05 bin galaxies and M = 10 8.9 M for the galaxies in the other bins. Running a KS-test on the two distributions gives a KSvalue of 0.53 and a p-value of 2.22 × 10 −7 , suggesting that these are stellar mass distributions distinct from each other. To test whether this difference is mainly due to our most significant overdensity PCz6.05-01, we removed its galaxies from the mass distribution and found the change in the median mass negligible (∼ 0.005 dex), and the stellar mass CDF skewed towards even higher masses, since we loose some M < 10 8.5 M galaxies. For the SFR histogram and CDF, the difference between the bins is less noticeable. the z = 6.05 bin protocluster candidates skew towards higher SFR, though the highest values are found in the higher z bins. Running a KS test for these distributions give a KS-value of 0.23 and a p-value of 0.09, meaning we cannot reject the null-hypothesis and the 9% level. Finally for the sSFR histogram and CDF, we see a noticeable difference between the bins. The higher z bins are skewed towards high sSFR, with a median of 7.88 dex, as opposed to the z = 6.05 bin candidate galaxies with a median of 8.18 dex. The sSFR CDF shows that the higher bin candidates always skew towards higher sSFR than the ones in the z = 6.05 bin. A KS test of the two distributions give a KS-value of 0.40 and a p-value of 2.66 × 10 −4 , the two distribution appear to be distinct from one another.
Combining the information from the stellar mass, SFR and sSFR distributions, we see a tentative evolution of the protocluster candidates from having higher sSFR at higher redshift (i.e. more efficient at star-formation) to higher stellar masses at lower redshift. The difference we are seeing could be because we are probing larger volumes at lower redshift, thereby taking in more galaxies. However, comparing the expected sizes of protoclusters from Chiang et al. (2017), the difference between z = 6 and z = 7.7 is small, suggesting the effects we see in Fig.  15 are not due to volume selection effects.

CONCLUSIONS
In this paper, we have analyzed galaxies that are part of The Cosmic Evolution Survey between z = 6 − 10, utilizing the new The Farmer version of the COS-MOS2020 catalog (Weaver et al. 2022). The photometric redshifts are determined by the state-of-the-art SED fitting codes LePhare and EAZY. The goal was to probe the onset of protocluster formation by locating highredshift structures in redshift bins between z = 6 and 10. We perform a galaxy number density analysis in 80 cMpc bins between z = 6 − 10 using a 2D Weighted Adaptive Kernel and 2D Voronoi tessellation. The main findings of this paper are as follows: 1. We locate 15 significant (> 4σ) overdensities between z = 6.0 − 7.7, each containing at least five galaxies. Of the 15 overdensities, 14 are recovered at ≥ 4σ by both galaxy number density estimators employed here. The most prominent of these, COSMOS2020-PCz6.05-01, is an overdensity at z 6.05 ± 0.10 consisting of 19 galaxies, with 14 being i-band dropouts. The median photometric redshifts of the 19 galaxies from Le-Phare spans a range of 5.97 ≤ z gal ≤ 6.13. Above z = 7.7, we do not find any galaxy overdensities at a significance ≥ 4σ.
2. To compare with traditional dropout selection techniques, we construct overdensity maps consisting of i-, z-, Y -and J-band dropouts in the COSMOS2020 catalog. For redshift bins between z = 6.0 − 7.7, we find excellent correspondence between the overdensities found using the WAK and WVT estimators and the overdensities of i-and z-dropouts, which correspond to z ∼ 6 and ∼ 7, respectively. A similar agreement is not found, however, for the z > 7.7 redshift bins, where the low number of sources in our sample hinders a rigorous comparison with the Y -and J-band dropout maps.
3. We estimate the total dark matter halo mass associated with the 15 protocluster candidates using three different techniques and find a range of total halo masses of M DM ≈ 3.5 × 10 11 − 7.2 × 10 12 M using the abundance matching of Behroozi et al. (2013) and M DM ≈ 7.8 × 10 11 − 3.4 × 10 12 M using the method outlined in Shuntov et al. (2022). Considering the simulation of Chiang et al. (2013), and the central halo mass estimates from Behroozi et al. (2013);Shuntov et al. (2022), we find the descendant halo mass of COSMOS2020-PCz6.05-01 to be similar to a Virgo-/Coma-like cluster (∼ 10 14−15 M ), and we expect the other candidates from the z = 6.05 ± 0.10 bin to end up with a similar halo mass at z = 0. For the higher z candidates the Behroozi et al. (2013) abundance matching estimates are closer to M DM ≈ 10 11 M , whereas the Shuntov et al. (2022) estimates are still M DM ≈ 10 12 M , meaning they could evolve to be even more massive than a Virgo-/Coma-like cluster at present. These mass estimates are likely to be underestimated due to faint low mass galaxies and quiescent high mass galaxies not detected in COSMOS.
4. We compare the 15 candidates to the number of protoclusters located with different selection methods and find that they occupy a unique position as a NIR-selected structure biased towards massive blue stars and unobscured star-formation with galaxies that appears to be on the galaxy main sequence. combined with the lack of mm and far infrared detections, the candidate galaxies appear to not be in a starbursting phase. Compared with the semi-analytical models of Chiang et al. (2017), we find that our candidates agree with the total average star-formation rate predicted by those models.
5. We show a trend in the evolution of the stellar masses between the z = 6.05 bin and the higher z-bins by comparing their histograms and CDFs, with more massive galaxies at lower redshift. Combined with a higher specific SFR for the candidates in the higher redshift bins, we appear to trace the evolution of the protocluster galaxies from being more efficiently star-forming at z > 6 to more massive at z = 6.
Many of the protocluster candidates presented here will be covered by COSMOS-Web (Cycle 1, ID. #1727), a large program on the James Webb Space Telescope (JWST), which will survey the COSMOS field using the MIRI and NIRCam instruments. Our results emulate what will be possible with Euclid combined with large ground-based optical surveys, which, via deep near-IR imaging of tens of square degrees, will uncover dozens of z ≥ 6 protocluster-targets. Following these up with spectroscopic campaigns using 8-10m class telescopes on the ground, the JWST, and the Atacama Large Millimeter/sub-millimeter Array (ALMA), will offer uniquely detailed studies of the assembly processes of protoclusters and their galaxies during reionization.

ACKNOWLEDGEMENTS
The Cosmic Dawn Center (DAWN) is funded by the Danish National Research Foundation under grant No. 140. We gratefully acknowledge the contributions of the entire COSMOS collaboration consisting of more than 100 scientists in constructing the COSMOS2020 catalog as well as for the feedback and discussion. The research with the COSMOS survey is also partly supported by the Centre National d'Etudes Spatiales (CNES). TRG and MB are grateful for support from the Carlsberg Foundation via grant no. CF20-0534.

A. COSMOS GALAXY OVERDENSITY MAPS
Figs. 16a-16z show galaxy overdensity maps (colored contours) using the WAK (left panels) and WVT estimators (right panels). Starting from the top row the overdensity maps shown are for the following redshift bins: a-b) z = 6.05 ± 0.10 (595 galaxies), c-d) z = 6.25 ± 0.10 (226 galaxies), e-f) z = 6.47 ± 0.11 (95 galaxies), g-h) z = 6.69 ± 0.11 (123 galaxies), i-j) z = 6.92 ± 0.12 (187 galaxies), k-l) z = 7.17 ± 0.12 (193 galaxies), m-n) z = 7.42 ± 0.13 (102 galaxies), o-p) z = 7.69 ± 0.14 (26 galaxies), q-r) z = 7.97 ± 0.14 (22 galaxies), s-t) 8.26 ± 0.15 (20 galaxies), u-v) 8.57 ± 0.16 (21 galaxies), w-x) 8.89 ± 0.17 (23 galaxies), y-z) 9.23 ± 0.18 (33 galaxies). Contours indicate regions of increasing overdensity values in steps of 1σ. Red contours indicate regions where the galaxy overdensity is significant by 4σ or higher and meet our selection criteria. These are indicated by numbers that refer to their name in Table 2. The location of the individual galaxies are marked by cyan dots, and the number of galaxies within each redshift bin is indicated in the parentheses above. . Galaxy overdensity maps (colored contours in steps of 2σ) using the WAK (left panels) and WVT estimators (right panels) as described in Section 3. The overdensity maps are for the indicated redshift bins. Red contours indicate regions where the galaxy overdensity is significant by 4σ or higher and meet our selection criteria. These ≥ 4σ overdensities are labeled by numbers that refer to their IDs in Table 2. The location of the individual galaxies are marked by black dots, and the number of galaxies within each redshift bin is indicated in parentheses. In this appendix we list the properties of the galaxies associated with each of the 15 ≥ 4σ overdensities given in Table 2. In Tables 4 to 18 given below, * indicates galaxies that fullfil the LBG dropout criteria expected for their photometric redshift. COSMOS2020-PCz6.05-01: We identify 19 sources inside the 4σ contour of this overdensity, of which 14 are i-band dropouts ( Table 4). The galaxies span a volume of 10.7 × 14.3 × 66.3 cMpc (Fig. 18c). We infer a dark matter halo mass of M DM ≈ 2 − 13 × 10 11 M using the methods described in 5.1. The stellar mass CDF for the 19 galaxies is shown in Fig. 18b along with the CDF of the field population. A two-sample K-S test of the two CDFs yields a statistic of 0.28 and a p-value of 0.10.  COSMOS2020-PCz6.05-02: We identify 8 sources inside the 4σ contour of this overdensity, of which 7 are i-band dropouts and 1 has no i-band data. (Table 5). The galaxies span a volume of 6.9 × 4.8 × 54.4 cMpc. (Fig. 19c). We infer a dark matter halo mass of M DM ≈ 8 − 26 × 10 11 M using the methods described in 5.1. The stellar mass CDF for the 8 galaxies is shown in Fig. 19b along with the CDF of the field population. A two-sample K-S test of the two CDFs yields a statistic of 0.45 and a p-value of 0.05. COSMOS2020-PCz6.05-03: We identify 6 sources inside the 4σ contour of this overdensity, of which 3 are i-band dropouts ( Table 6). The galaxies span a volume of 2.3 × 2.3 × 71.4 cMpc. (Fig. 20c). We infer a dark matter halo mass of M DM ≈ 1 × 10 12 M using the methods described in 5.1. The stellar mass CDF for the 8 galaxies is shown in Fig. 20b along with the CDF of the field population. A two-sample K-S test of the two CDFs yields a statistic of 0.43 and a p-value of 0.17.  COSMOS2020-PCz6.05-05: We identify 6 sources inside the 4σ contour of this overdensity, of which 2 are i-band dropouts ( Table 7). The galaxies span a volume of 2.3 × 3.2 × 62.7 cMpc. (Fig. 21c). We infer a dark matter halo mass of M DM ≈ 8 − 58 × 10 11 M using the methods described in 5.1. The stellar mass CDF for the 8 galaxies is shown in Fig. 21b along with the CDF of the field population. A two-sample K-S test of the two CDFs yields a statistic of 0.45 and a p-value of 0.20.  COSMOS2020-PCz6.05-06: We identify 8 sources inside the 4σ contour of this overdensity, of which 5 are i-band dropouts (Table 8). The galaxies span a volume of 4.9 × 6.1 × 43.0 cMpc. (Fig. 22c). We infer a dark matter halo mass of M DM ≈ 5 × 10 11 M using the methods described in 5.1. The stellar mass CDF for the 8 galaxies is shown in Fig. 22b along with the CDF of the field population. A two-sample K-S test of the two CDFs yields a statistic of 0.24 and a p-value of 0.66.  COSMOS2020-PCz6.05-08: We identify 5 sources inside the 4σ contour of this overdensity, of which 3 are i-band dropouts (Table 9). The galaxies span a volume of 2.2 × 4.5 × 39.2 cMpc. (Fig. 23c). We infer a dark matter halo mass of M DM ≈ 8 × 10 11 M using the methods described in 5.1. The stellar mass CDF for the 5 galaxies is shown in Fig. 23b along with the CDF of the field population. A two-sample K-S test of the two CDFs yields a statistic of 0.32 and a p-value of 0.61.  COSMOS2020-PCz6.69-02: We identify 5 sources inside the 4σ contour of this overdensity, of which 2 are z-band dropouts (Table 10). The galaxies span a volume of 2.3 × 6.4 × 15.0 cMpc. (Fig. 24c). We infer a dark matter halo mass of M DM ≈ 1 × 10 12 M using the methods described in 5.1. The stellar mass CDF for the 5 galaxies is shown in Fig. 24b along with the CDF of the field population. A two-sample K-S test of the two CDFs yields a statistic of 0.41 and a p-value of 0.30.  COSMOS2020-PCz6.92-01: We identify 6 sources inside the 4σ contour of this overdensity, of which 0 are z-band dropouts (Table 11). The galaxies span a volume of 2.5 × 5.7 × 56.8 cMpc. (Fig. 25c). We infer a dark matter halo mass of M DM ≈ 9 − 39 × 10 10 M using the methods described in 5.1. The stellar mass CDF for the 6 galaxies is shown in Fig. 25b along with the CDF of the field population. A two-sample K-S test of the two CDFs yields a statistic of 0.51 and a p-value of 0.06.  COSMOS2020-PCz6.92-04 We identify 5 sources inside the 4σ contour of this overdensity, of which 2 are z-band dropouts (Table 12). The galaxies span a volume of 3.5 × 3.5 × 37.4 cMpc. (Fig. 26c). We infer a dark matter halo mass of M DM ≈ 2 − 5 × 10 11 M using the methods described in 5.1. The stellar mass CDF for the 5 galaxies is shown in Fig. 26b along with the CDF of the field population. A two-sample K-S test of the two CDFs yields a statistic of 0.34 and a p-value of 0.51.  COSMOS2020-PCz6.92-05: We identify 7 sources inside the 4σ contour of this overdensity, of which 3 are z-band dropouts (Table 13). The galaxies span a volume of 6.6 × 4.6 × 55.9 cMpc. (Fig. 27c). We infer a dark matter halo mass of M DM ≈ 1 × 10 12 M using the methods described in 5.1. The stellar mass CDF for the 7 galaxies is shown in Fig. 27b along with the CDF of the field population. A two-sample K-S test of the two CDFs yields a statistic of 0.50 and a p-value of 0.05.  COSMOS2020-PCz7.17-01: We identify 5 sources inside the 4σ contour of this overdensity, of which 0 are z-band dropouts (Table 14). The galaxies span a volume of 3.4 × 4.9 × 52.6 cMpc. (Fig. 28c). We infer a dark matter halo mass of M DM ≈ 8 × 10 11 M using the methods described in 5.1. The stellar mass CDF for the 5 galaxies is shown in Fig. 28b along with the CDF of the field population. A two-sample K-S test of the two CDFs yields a statistic of 0.31 and a p-value of 0.64.  COSMOS2020-PCz7.17-04: We identify 5 sources inside the 4σ contour of this overdensity, of which 0 are z-band dropouts (Table 16). The galaxies span a volume of 1.6 × 1.9 × 31.7 cMpc. (Fig. 30c). We infer a dark matter halo mass of M DM ≈ 9 − 35 × 10 10 M using the methods described in 5.1. The stellar mass CDF for the 5 galaxies is shown in Fig. 30b along with the CDF of the field population. A two-sample K-S test of the two CDFs yields a statistic of 0.34 and a p-value of 0.52.  COSMOS2020-PCz7.42-01: We identify 8 sources inside the 4σ contour of this overdensity, of which 0 are z-band dropouts (Table 17). The galaxies span a volume of 6.5 × 4.3 × 54.1 cMpc. (Fig. 31c). We infer a dark matter halo mass of M DM ≈ 1 − 4 × 10 11 M using the methods described in 5.1. The stellar mass CDF for the 8 galaxies is shown in Fig. 31b along with the CDF of the field population. A two-sample K-S test of the two CDFs yields a statistic of 0.53 and a p-value of 0.02.  COSMOS2020-PCz7.69-01: We identify 5 sources inside the 4σ contour of this overdensity, of which 0 are z-band dropouts (Table 18). The galaxies span a volume of 12.0 × 13.6 × 49.4 cMpc. (Fig. 32c). We infer a dark matter halo mass of M DM ≈ 9 × 10 11 M using the methods described in 5.1. The stellar mass CDF for the 5 galaxies is shown in Fig. 32b along with the CDF of the field population. A two-sample K-S test of the two CDFs yields a statistic of 0.24 and a p-value of 0.92.