COSMOS2020 Exploring the Dawn of Quenching for Massive Galaxies at 3 < z < 5 with a New Color-selection Method

We select and characterize a sample of massive ( log ( M * / M e ) > 10.6 ) quiescent galaxies ( QGs ) at 3 < z < 5 in the latest Cosmological Evolution Survey catalog ( COSMOS2020 ) . QGs are selected using a new rest-frame color-selection method, based on their probability of belonging to the quiescent group de ﬁ ned by a Gaussian mixture model ( GMM ) trained on rest-frame colors ( NUV − U , U − V , V − J ) of similarly massive galaxies at 2 < z < 3. We calculate the quiescent probability threshold above which a galaxy is classi ﬁ ed as quiescent using simulated galaxies from the SHARK semi-analytical model. We ﬁ nd that, at z 3 in SHARK , the GMM / NUVU − VJ method outperforms classical rest-frame UVJ selection and is a viable alternative. We select galaxies as quiescent based on their probability in COSMOS2020 at 3 < z < 5, and compare the selected sample to both UVJ - and NUVrJ - selected samples. We ﬁ nd that, although the new selection matches UVJ and NUVrJ in number, the overlap between color selections is only ∼ 50% – 80%, implying that rest-frame color commonly used at lower-redshift selections cannot be equivalently used at z > 3. We compute median rest-frame spectral energy distributions for our sample and ﬁ nd the median QG at 3 < z < 5 has a strong Balmer / 4000 Å break, and residual NUV ﬂ ux indicating recent quenching. We ﬁ nd the number densities of the entire quiescent population ( including post-starbursts ) more than doubles from 3.5 ± 2.2 × 10 − 6 Mpc − 3 at 4 < z < 5 to 1.4 ± 0.4 × 10 − 5 Mpc − 3 at 3 < z < 4, con ﬁ rming that the onset of massive galaxy quenching occurs as early as 3 < z < 5. Uni


INTRODUCTION
In the past few years it has become evident that there exists an extraordinary population of massive galaxies which have already ceased their star formation and become quiescent when the universe was just 2 billion years old (z ≥ 3).The first hints of this population were candidates selected from optical/near-infrared (NIR) sur-veys such as in the GOODS-South field (Fontana et al. 2009), and later the Newfirm Medium-Band Survey (NMBS, Whitaker et al. 2011;Marchesini et al. 2010) and the FourStar Galaxy Evolution Survey (ZFOURGE, Straatman et al. 2014;Spitler et al. 2014).Targets were selected using various indicators, either on the basis of their specific star formation rate (sSFR) or rest-frame colours.Although these candidates were tentative then, the recent influx of spectroscopically confirmed massive galaxies at z > 3 puts aside any doubt that these galaxies were already passive, and rapid evolution in the first few billion years caused the onset of the dawn of quenching.There are several examples of single galaxies (e.g., Marsan et al. 2015;Forrest et al. 2020a;Valentino et al. 2020;Saracco et al. 2020;Carnall et al. 2023), as well as full samples (Marsan et al. 2017;Schreiber et al. 2018;Forrest et al. 2020b;D'Eugenio et al. 2020a;Mc-Conachie et al. 2021;Nanayakkara et al. 2022), that have been spectroscopically confirmed as both massive (usually log(M * /M ) 10) and quiescent (log(sSFR) yr −1 −9.5), as well as a number of studies which selected and analysed massive quenched galaxies in large photometric data sets (Merlin et al. 2018;Girelli et al. 2019;Merlin et al. 2019;Guarnieri et al. 2019;Cecchi et al. 2019;Carnall et al. 2020;Shahidi et al. 2020;Santini et al. 2020;Stevans et al. 2021;Ito et al. 2022).
Thanks to the named studies, we now know about the properties of massive quiescent galaxies (QGs) at z > 3.These galaxies exhibit significant Balmer/4000 Å breaks indicative of ages between 100 Myr and ∼ 1 Gyr, although the ages can be quite uncertain (e.g.Schreiber et al. 2018;Carnall et al. 2020;Forrest et al. 2020a;D'Eugenio et al. 2020a;Nanayakkara et al. 2022;Carnall et al. 2023).A few spectroscopically confirmed galaxies exhibit weak flux blue-wards of the break (a "blue bump"), indicating a stellar population comprising predominantly A-type stars, and therefore likely recent quenching (< 1Gyr).This is distinct from the UVupturn observed in local quenched galaxies, which seems to be due to emission from post main sequence stars such as Horizontal Branch (HB) or Asymptotic Giant Branch (ABG) stars (Greggio & Renzini 1990;Yi et al. 1997; see Dantas et al. 2020 for a more recent overview).The age and mass of such early QGs implies a rapid formation period and recent quenching, much like post-starburst, or E/K+A galaxies observed at z < 3 (e.g.Dressler & Gunn 1983;Goto 2005;Wild et al. 2009;Ichikawa & Matsuoka 2017;Wu et al. 2018;Chen et al. 2019;Wild et al. 2020;Wu et al. 2020;Wilkinson et al. 2021; see French 2021 for a review of post-starburst galaxies).An onset of rapid quenching at z ∼ 4 − 6 as suggested by the previous works would result in a fairly young massive quiescent population at z ∼ 3 which had recently undergone an intense period of star formation.
Recently quenched QGs at z > 3 also seem to be fairly dust-free (Schreiber et al. 2018;Kubo et al. 2021).This could be because dust is being destroyed in the poststarburst phase (e.g., Akins et al. 2021) and our selections for QGs at z > 3 are dominated by PSBs and relatively dust-free QGs, or because our selections and surveys are blind to dusty QGs at z > 3 (D'Eugenio et al. 2020a).The upcoming investigations into galaxies appearing in NIR observations with no optical flux (HST -dark galaxies) with the James Webb Space Telescope (JWST) will hopefully address this pertinent debate (Wang et al. 2019;Long et al. 2022;Barrufet et al. 2021Barrufet et al. , 2022)).
Quiescent populations are commonly selected using rest-frame colour-colour diagrams, with an increasing number of studies further analyzing the correlations of physical properties relative to the sample location within these diagrams (e.g.Whitaker et al. 2012, Belli et al. 2019).The four most popular in use are the U V J selection criteria (Wuyts et al. 2008, Williams et al. 2009), or variations thereof (e.g.Whitaker et al. 2012), BzK (Daddi et al. 2004), and N U V rK or N U V rJ (Arnouts et al. 2013, Ilbert et al. 2013, Davidzon et al. 2017).Rest-frame colour-colour diagrams have been favoured for their quick and easy use with large photometric data sets; galaxies tend to occupy distinct parameter spaces within these diagrams, making selection for quiescent populations moderately straightforward.Critically, the use of two colours allows for the separation of quiescent from star forming galaxies, with the U − V or N U V − r colour separating star forming galaxies from quiescent, and the r − J or V − J colour separating red dusty galaxies from those that are quiescent.
These selections were originally designed to separate galaxy populations at low redshift, but as newer data has emerged demonstrating that the bi-modality in populations clearly extends to higher redshift (e.g.Whitaker et al. 2011;Muzzin et al. 2013;Ilbert et al. 2013;Straatman et al. 2014), the selections and variations thereof are now often adopted at z > 2 (e.g., Whitaker et al. 2012;Straatman et al. 2014;Hwang et al. 2021;Suzuki et al. 2022;Ji & Giavalisco 2022).In sum, rest-frame colour colour diagrams are ideal to select quiescent populations as they are computationally cheap, straightforward.They can also be model dependent, but this depends on the way in which the rest-frame colours are calculated.
However, the application of rest-frame colour diagram selection boxes to higher redshift samples (in particular, z > 3) should be cautioned.Several studies (Leja et al. 2019b;Belli et al. 2019;D'Eugenio et al. 2020a) find that 10 − 30% of U V J selected QGs at higher redshift are either low redshift interlopers or dusty star-forming galaxies.Lustig et al. (2022) show that this contamination occurs more frequently in the U V J area populated by older QGs, at the top right corner of the quiescent selection box.Moreover, despite sSFR decreasing with redder U − V and bluer V − J colours, Leja et al. 2019b demonstrate that a U V J selection cannot differentiate between moderate and low specific star formation rates.Such a distinction in specific star formation rate is critical at high redshift where recently quenched galaxies, or those in the process of quenching, are more prevalent.Furthermore, Antwi-Danso et al. 2022 show that the extrapolation of rest frame J (which can be common at z > 3) leads to galaxies being wrongly classified.
Whilst rest-frame colour diagram selections remain useful, particularly for large ground based surveys, the current methods explored above are less reliable at z > 3 for several reasons.Firstly, many studies clearly show that 3 < z < 5 is the epoch where QGs first appear (e.g.Marsan et al. 2020,Valentino et al. 2020), and so the galaxy population is not yet bimodal.Instead, as Marsan et al. 2020 find, massive galaxies in this epoch exhibit a wide spectrum of properties, with varying amounts of dust and star formation, as well as some harbouring an Active Galactic Nucleus (AGN) (see also Marsan et al. 2015, Marsan et al. 2017and Ito et al. 2022).It therefore does not make sense anymore to apply a colour colour cut that was designed to separate galaxies in a bimodal universe when that bi-modality does not yet exist.Secondly, it takes on the order of ∼Gyrs for galaxies to evolve and become truly red (without dust), which implies that we would not expect to find the upper edge of the U V J quiescent box highly populated at z > 3. Instead, it is more likely that galaxies found in this parameter space are instead dusty star forming galaxies that have scattered in over the V − J < 1.5 line.This is seen clearly in Lustig et al. 2022, who compare observations of QGs at z ∼ 3 with several simulations.They find that the contamination of the U V J quiescent box at z ∼ 2.7 is most significant at reddest edge of the box, near the V − J < 1.5 line.To remedy this, they suggest changing the V − J constraint from V − J < 1.5 to V − J < 1, which effectively cuts the selection area in half.
The third reason is that galaxies in the high redshift universe are faint, and thus even state-of-the-art groundbased surveys will have fairly large photometric errors for most galaxies which propagate to uncertainties a factor of two larger in colours (see Appendix C in Whitaker et al. 2010).This is further antagonised by the fact that surveys such as COSM OS are limited to the resolution of the now decommissioned Infra Red Array Camera (IRAC) on the Spitzer Space Telescope, which, besides having a large point spread function (PSF), also barely probes the Balmer break at z > 4 for QGs.The use of rest-frame V − J colour as a separator between dusty star forming galaxies and red QGs becomes highly uncertain at z > 5, where rest-frame J moves beyond ∼ 9µm (IRAC/Channel 4) and must therefore be extrapolated.For surveys limited to ∼ 5µm (Channel 2), this becomes z > 3.All of these reasons together imply that simply drawing a dividing line over a non bimodal, noisy population will result in both incomplete and quite significantly contaminated samples.
Given the arguments against them, one might reconsider the usefulness of rest-frame colour-colour selections when there are multiple options for spectral energy distribution (SED) fitting tools to calculate sSFRs which can be used to select for QGs.Whilst this alternative is often used (e.g.Carnall et al. 2020;Ito et al. 2022;Carnall et al. 2022), more complex methods may require careful tuning, time, and sometimes vast computing resources.Finally, this still does not solve the problem because these approaches require a sSFR cut or otherwise to select QGs in a population where the lines between star forming and quiescent are blurred.Whilst these methods offer opportunities to study the statistical properties and their stellar assembly, the selection of QGs should still be possible with simpler methods.It is therefore important that we take care in optimizing the rest-frame colour-colour selections as they are fast and can be easily applied to different surveys irrespective of the filters used, therefore serving as crucial tools for the selection of candidates for spectroscopic follow up with instruments such as JWST.This is something that the community has already begun to explore, e.g.Leja et al. 2019c (efficient selection of QGs at z < 1) and Antwi-Danso et al. 2022 (efficient selection of QGs at z > 3 using synthetic ugi s filters).
In this paper, we present a new rest-frame colour colour diagram and selection method specifically designed to find and select high confidence massive QGs at z ∼ 3 and beyond.Using the latest COSMOS catalogue, COSMOS2020, we explore the utility of this new selection technique.In Sections 2 and 3, we describe the data and a modified COSMOS2020 catalogue made specific to this study.In Section 4 we describe the selection of a robust sample of massive galaxies at 3 < z < 5.In Section 5, we introduce the new colour selection method.In Section 6, we present this new selection applied to COS-MOS2020 and the main results.Finally, we summarise our conclusions and outlook in Section 10.For all calculations we use the WMAP9 flat LambdaCDM cosmology (Hinshaw et al. 2013) with H 0 = 69.3km s −1 Mpc −1 , Ω m = 0.307.All magnitudes are in the AB system defined by Oke & Gunn 1983 as m AB = −2.5log 10 fν 3631Jy .

COSMOS2020
The Cosmological Evolution Survey (COSM OS) is currently the deepest NIR wide-field multi-wavelength survey that provides data over 2 square degrees of the sky (Scoville et al. 2007, Koekemoer et al. 2007).The latest version of the COSM OS photometric catalogue, COSM OS2020 (Weaver et al. 2021), improves on previous version (Laigle et al. 2016) in many ways.Firstly, with the addition of ultra-deep optical data from Hyper-Suprime Cam (HSC; Miyazaki et al. 2018;Aihara et al. 2018), secondly, the fourth data release of Ultra-VISTA (DR4) (McCracken et al. 2012;Moneti et al. 2019), effectively 1 mag deeper in K s over the entire field, and thirdly, the inclusion of all Spitzer /IRAC data ever taken in the COSM OS field (Weaver et al. 2021;Moneti et al. 2022).COSM OS2020 contains four catalogues, combining two source extraction methods and two photometric redshift codes.The Classic catalogue is produced using classical aperture based photometry method using Source Extractor (Bertin & Arnouts 1996), whilst The Farmer catalogue is produced using a profile fitting photometry method (Weaver at al., submitted), which is a wrapper written for the source profile fitting code The Tractor (Lang et al. 2016).The Farmer has a distinct advantage over classical photometry methods due to its ability to correctly model and de-blend overlapping sources.This is particularly beneficial in deep fields where source crowding and overlapping is common.Both photometry catalogs are fit with two different photometric redshift codes (Le Phare, Arnouts &Ilbert 2011, andeazy-py (Brammer et al. 2008;Brammer 2021).Weaver et al. 2021 show that The Farmer performs equivalently to Source Extractor and excels at i < 24, independent of the redshift code used.For this analysis we use The Farmer photometric catalogue combined with redshift and physical parameter estimation using eazy-py.
Although the Farmer photometry and its associated uncertainties are suitable for general use SED fitting, Weaver et al. (2021) noted that it is likely the photometric uncertainties are underestimated.To test this, we ran the custom code Golfir (Kokorev et al. 2022) on a small subset of the COSMOS2020 IRAC images.Golfir uses a prior based on the highest available resolution imaging to model the flux in the IRAC photometry.Overall, the agreement of both photometric catalogs is high, however the IRAC errors derived with The Farmer are smaller by a factor of ∼ 5.This is similar to the approach and conclusions in Valentino et al. (2022).At z > 3, the IRAC photometry plays a crucial role in the identification of massive galaxies.This is due to the fact that the light emitted from the bulk of the stellar mass is measured at observed frame λ obs > 2.5µm, which is measured predominantly by the IRAC bands.Although the depths probed by the first two IRAC bands Channels 1 and 2 are comparable to those in Ultra-VISTA bands JHK s (∼ 25.5 mag, 3σ), IRAC channels 3 and 4 are shallower by ∼ 3 mags.For massive QGs at z > 3, this may result in the IRAC bands approaching the detection limit of the survey.For this reason and the investigations explained above, for this work we conservatively boost the IRAC photometric errors by a factor of 5 and refit the COSM OS2020 Farmer photometry with eazy-py (see Section 3).As with Weaver et al. (2021), we do not include the Subaru Suprime-Cam broad band photometry or the GALEX F U V and N U V photometry, due to their shallow depth and broad PSF, respectively.In total, we use 27 photometric bands for SED fitting, probing observed frame ∼ 0.3 − 8µm.

3D-HST
In Section 3, we will introduce the latest python implementation of eazy, eazy-py and its newer features.To demonstrate their validity, in particular the computation of physical properties such as stellar mass and star formation rate, we use an observational catalog with similar properties as COSM OS2020 to benchmark the performance.We consider the Prospector version of the 3D-HST catalog from (Leja et al. 2019d).The 3D-HST survey (Skelton et al. 2014) is a 248−orbit Hubble Treasury program completed in 2015, providing W F C3 and ACS spectroscopy for galaxies in 5 extra-galactic deep fields, including COSM OS, covering 900 square arc-minutes (199 square arc-minutes in COSM OS).The catalog includes photometry in up to 44 bands in COSM OS covering 0.3-8µm in the observed frame, with the inclusion of Spitzer M IP S 24µm photometry from Whitaker et al. (2014).The version that we considered contains a representative sub sample of 58, 461 galaxies from the 3D-HST survey (roughly 33%) which span the observed star formation rate density and mass density at 0.5 < z < 2.5.These galaxies are fit with the Bayesian SED fitting code Prospector (Leja et al. 2017;Johnson et al. 2020).Prospector uses galaxy models generated using python-FSPS (Conroy et al. 2009;Conroy & Gunn 2010), non parametric star formation histories, variable stellar metallicities and a two-component dust attenuation model, under the assumption of energy balance.For this paper, we cross match our catalog and the 3D-HST Prospector catalog using a 0. 6 matching radius, resulting in 8964 galaxies in common.

Simulated data
To test the robustness of our quiescent galaxy selection method, we use simulated galaxies from the Shark semi-analytical model (SAM, Lagos et al. 2018).Shark models the formation and evolution of galaxies following many physical processes, including star formation, chemical enrichment, gas cooling, feedback from stars and active galactic nuclei, among others.Critically for this work, Shark includes a two-phase dust model for light attenuation that is based on radiative transfer results from hydrodynamical simulations (Trayford et al. 2020).This model was combined with the stellar population synthesis model of Bruzual & Charlot (2003), adopting a Chabrier (2003) initial mass function, and a model for re-emission in the IR in Lagos et al. (2019).Shark does this using the generative SED mode of the ProSpect SED code of Robotham et al. (2020).The predicted SED of each galaxy is then convolved with a series of filters to get their broad-band photometry.Lagos et al. (2019) showed that Shark is very successful in reproducing the FUV-to-FIR emission of galaxies in observations at a wide redshift range, including luminosity functions, number counts and cosmic SED emission.Lagos et al. (2020) used this model to show that the contamination fraction of dusty, star-forming galaxies (those with 870 µm flux > 0.1 mJy) in the U V J diagram increases with increasing redshift, reaching ≈ 47% at z = 3, showing that the commonly adopted U V J colour selection could be improved to reduce contamination.Here, we use a lightcone presented in Lagos et al. (2019) (see their Section 5) that has an area of ≈ 107 deg 2 , and includes all galaxies in the redshift range 0 ≤ z ≤ 8 that have a dummy magnitude, computed assuming a stellar mass-to-light ratio of 1, < 32.The method used to construct lightcones is described in Chauhan et al. (2019).

Eazy-Py
In this study, we use photometric redshifts, stellar masses, star formation rates and rest-frame colours derived with eazy-py1 (Brammer 2021).Here we detail this process, including an updated description and presentation of the eazy-py code.eazy-py is a pythonic photometric redshift code adapted from the original eazy code written in C+ (Brammer et al. 2008).eazy-py estimates redshifts by fitting non-negative linear combinations of galaxy templates to broad-band photometry to produce a best fit model.The flexibility of the code means that in theory, any kind of tem-plate can be used, e.g., quasar or AGN templates.The templates used in this paper are comprised of 13 stellar population synthesis templates from FSPS (Conroy et al. 2009;Conroy & Gunn 2010) spanning a wide range in age, dust attenuation and log-normal star formation histories.This template set is designed to encompass the expected distribution of galaxies in U V J color space at 0 < z < 3, and in theory beyond.We The latest version of eazy-py (version 0.5.2) has been adapted to be a general use SED fitting code, providing estimates of physical parameters such as stellar mass, star formation rate, and dust attenuation.These physical properties are assigned to each template and "propagate through" the fitting process with the same nonnegative linear combination coefficients that return the best-fit SED model.This results in star formation histories that are non-negative linear combinations of a number of basis star formation histories, and therefore the star formation history of a galaxy best-fit model will mimic non-parametric methods such as those used in Pacifici et al. 2016;Iyer et al. 2019;Leja et al. 2019a and other works.This means that eazy-py is more similar to codes that implement non-parametric star formation histories than it is to traditional SED fitting codes, and can account for the mass that is often missed as a result of using parametric star formation histories (Lower et al. 2020).We demonstrate this in Section 3.1.2.Errors on physical parameters are calculated by drawing 100 fits from the best fit template error function and using the 16th and 84th percentiles as the extremes of the 1σ errors.It should be noted that the physical parameter errors are not marginalised over the redshift error, therefore it is likely that the errors from eazy-py are underestimated.Figure 1 shows a comparison of photometric and spectroscopic redshifts for all galaxies at 0 < z < 6 in our catalog.We find similar outlier fractions and bias for both star forming and QGs compared to the official COSMOS2020 release (Weaver et al. 2021).

Rest-frame flux densities
The rest-frame fluxes are calculated based on the approach of Brammer et al. (2011), which uses the best fit model as a guide to interpolate between the observed photometry.This interpolation itself is weighted by the photometric errors, and therefore the rest-frame flux errors reflect those of the photometry.This means that the rest-frame fluxes are almost entirely derived from the photometry with only partial guidance from the best fit template.The results also account for the filter shapes and relative depths, which is advantageous particularly for multi-instrument photometric catalogs.This essentially ensures a "model-free" approach, which is crucial for our method, as rest frame colours are used in the selection process, and should ideally reflect the observed universe and not our models.For a full description of calculation of the rest-frame fluxes we refer the reader to Appendix A. The rest-frame filters used in this paper are the GALEX N U V band (λ = 2800 Å) (Martin et al. 2005), the re-calibrated U and V filters by Maíz Apellániz ( 2006), and 2MASS J band (Skrutskie et al. 2006).

Stellar masses and Star formation rates
In this section we explore the stellar masses (M ) and star formation rates (SFRs) derived with eazy-py.We note that it has been shown that star formation rates derived from broad-band photometric SED fitting assuming parametric star formation histories can underestimate the SFR of a galaxy by several orders of magnitude (Lower et al. 2020).Sherman et al. (2020), who used eazy-py with default FSPS templates using a variety of realistic SFHs (rising, bursty), find that M are overestimated on average by 0.085 dex, and SFRs are underestimated on average by ∼ 0.46 dex.As we fit with a different, newer set of templates using log-normal star formation histories, we cannot assume the derived masses and SFRs have from the same offsets.
To benchmark the eazy-py M and SFRs in this study, we use the 3D-HST catalog described in Section 2.2.From the prospector 3D−HST catalog we use redshift (which are a compilation of spectroscopic redshifts, grism redshifts and redshifts derived using eazy), stellar masses and star formation rates, which are averaged over the past 100 million years.Here, we use the masses and star formation rates derived with Prospector as the "ground truth".We select all galaxies with a photometric redshift agreeing within 15% of ∆z/(1 + z) to the Prospector redshift, which results in 8134 galaxies spanning 0.5 < z < 3.0.We then compute the difference (in dex) between our parameters (SFR, M ) and the Prospector parameters.Due to differences in method, the K s band magnitudes for COSMOS2020 are marginally brighter ( ∼ 0.09 dex) than those in the Skelton et al. 2014 catalog that was used in Leja et al. 2019a.To facilitate a fairer comparison, we scale the Prospector masses and star formation rates accordingly.In practice, this results in a very small difference.We show both the SFR and mass comparisons in Figure 2 as a function of the estimates derived with Prospector.We calculate the median offset and scatter, and find that there is excellent agreement between the two catalogs, with only a minor overestimation of 0.04 dex for masses and 0.07 dex for SFRs by eazy-py with a scatter of 0.21 dex for mass and 0.55 dex for SFRs.The difference between the results of this investigation compared to Sherman et al. 2020 is notably that of the SFRs -whilst they find an underestimation of ∼0.5 dex, we report a slight over-estimation.This is likely due to the different template sets used.In conclusion, given the agreement and no catastrophic bias in stellar masses or SFRs derived with eazy-py we deem them acceptable for use.

SELECTING A ROBUST SAMPLE OF MASSIVE GALAXIES
We begin by selecting all objects within the area covered by both HSC and Ultra-VISTA (i.e., Flag Combined = 0, area ≈ 1.27 •2 ), classified as galaxies according to the star galaxy separation criteria (see Appendix B), and with K s < 25.6 mag, which is the 3σ limiting magnitude.From these, we select all galaxies with photometric redshifts 3.0 < z phot < 5.0, with a constraint that the lower 16% of the photometric redshift probability distribution, p(z), should be contained above z = 3, and the upper 84% below z = 5.If a spectroscopic redshift is available in the catalogue we require 3 < z spec < 5.At 4.5 < z < 5, the rest-frame J band is covered by IRAC channels 3 and 4, which means that it is still observationally constrained and not extrapolated (see Appendix C).This is vital to ensure the robustness of the sample (Antwi-Danso et al. 2022).For this reason, we do not extend beyond z = 5.0 in our sample selection.We next make a cut on stellar masses, 10.6 < log(M * /M ) < 12, which is the knee of the mass function at this redshift range and well above the stellar mass completeness (Weaver et al. 2021).However, it is important to note that at 4.5 < z < 5, the Balmer/4000 Å break moves through the K s band, which is the reddest band in the chi-mean izY JHK s detection image used in the photometric modelling.Therefore, our selection is likely not sensitive to older quiescent galaxies at this redshift range, if they exist.To our sample, we then add the following requirements to ensure the confidence in our massive galaxy sample is high.We choose to conservatively exclude anything with a reduced χ 2 equal to χ 2 /(N filt ) greater than 10.We also require K s SNR > 5.In summary, we require: • Inside combined HSC/UVISTA area • Classified as galaxy • z phot,16% > 3.0 and z phot,84% < 5.0 In total, this selection results in 1568 galaxies with a median redshift of z phot = 3.55 and median stellar mass of log(M * /M ) = 10.94.In addition to this, we select a sample at 2 < z < 3 in a similar manner with a lower mass cut of 10 < log(M * /M ) < 12, in order to fit the Gaussian Mixture Model described in the following section.This gives a fitting sample of 12985 galaxies at 2 < z < 3 with a median photometric redshift of z phot = 2.46 and median stellar mass of log(M * /M ) = 10.52.

Introducing the GMM
We present here a new colour selection designed to make finding QGs at z > 3 easier.Our new colour selection method combines the N U V − U , U − V and V − J colours, using the filters defined in Section 3.1.1.This colour selection method is similar to both U V J and NU-VrJ, with two major differences.The first is that four bands (three colours) are used instead of three, and the second is that the separation itself is not defined by a traditional "box" but rather using simple machine learning methods.Both are introduced to make the separation of QGs from (dusty) star forming more pronounced, and furthermore allow an easier separation between recently quenched and older QGs.The U − V and V − J colours are included to separate quiescent galaxies from star forming, as originally designed.The N U V − U colour is added to measure the light emitted from recent star formation in the form of A-type stars and is similar to N U V − r, except that it probes a shorter baseline, specifically the flux directly blue-wards of the 4000 Å/Balmer break.Previous works have suggested N U V − U is a viable indicator for time since quenching, for example Schawinski et al. (2014).Similarly, Phillipps et al. (2020) use N U V − u as a means to remove passive galaxies with some residual star formation from a sample of truly passive galaxies.The reason for the use of four bands / three colours is twofold: firstly, the increase in dimensions allows us to extract more information whilst still only requiring 4 data points.Whilst this is similar to dimensionality reduction, it has the advantage of using information that is likely already available or easy to compute, and does not require imposing a high signal to noise cut to the data.Secondly, N U V −U does not have a co-dependent variable with V −J, which allows for the spread of colours in N U V − U /V − J to become more obvious, making it easier to both separate the quenched population from star forming, as well as explore the properties of galaxies within the quenched population with different amounts of recent star formation.This is particularly relevant at z > 3, where the fraction of PSB galaxies is high (e.g.D'Eugenio et al. 2020b and Lustig et al. 2020 report a PSB fraction of 60-70% for photometrically selected QGs at log(M * /M ) > 11).If we consider the 2D projection (N U V −U , V −J), it is clear that the star forming and quiescent populations separate more clearly due to the divergent alignment of the dust vector relative to the evolution tracks (see Figure 3).As an example, we plot the massive 3 < z < 5 sample in this colour space with tracks showing the colour evolution of a single stellar population aged to 1 Gyr at both A V = 0 and A V = 1 generated using the Python-FSPS package.Both tracks move directly up through the quiescent area as they evolve over time.The lower portion of this area preferentially selects quenching or post-starburst galaxies, whilst the upper half encompasses older QGs.This age trend has been observed in the U V J diagram (e.g.Whitaker et al. 2012, Whitaker et al. 2013, Belli et al. 2019, Carnall et al. 2019), thus it is not entirely surprising that it appears in this colour diagram tooand in fact more pronounced due to the addition of the N U V − U colour.We note that the direction of the age tracks diverge from the direction of the dust vector for (N U V − U , V − J), resulting in red dusty star forming galaxies pushed downwards and to the left relative to old, red QGs, whereas for U V J they follow the same direction, resulting in old, red QGs populating the same space as red dusty star forming galaxies.

Defining the separator
Most selections in rest-frame colour diagrams have been made by drawing a box empirically.We choose to separate QGs from star forming by using a Gaussian Mixture Model (GMM).This is a probabilistic model that operates under the assumption that the data can be fit by a finite number of Gaussian curves, for which the parameters are not known.GMMs have already been employed successfully in multiple research areas in astrophysics, including to select and explore various galaxy populations such as quiescent and PSB galaxies (e.g.Black & Evrard 2022; Ardila et al. 2018).Probabilistic selection of QGs at z > 3 has also been carried out with great success by Shahidi et al. (2020) and Santini et al. (2020).We use the Gaussian Mixture Model package supplied by Scikit-learn (Pedregosa et al. 2011).Briefly, the GMM algorithm uses expectation-maximisation (EM), which is an iterative process used for classification when there are no "correct" labels (as is our case).EM works by choosing random components and iterating towards the best fit by computing the likelihood of each data point being drawn from the current model, and then adjusting the parameters to maximise that likelihood.In this approach, we choose to use all three colours: N U V − U , U − V and V − J.We fit the model using the 2 < z < 3 sample, which is first cleaned by requiring all colours have 0 < colour < 4. To ascertain the optimal number of components, we fit multiple times using 1 − 10 Gaussian components and calculate the Bayesian Information Criterion (BIC) for each number of components, which is defined as where n is the number of model parameters, k is the number of data points fit and L max is the maximum likelihood of the fit.The BIC discriminates between models by penalising the number of parameters, which avoids over-fitting.We randomly select 60% of the sample, fit and compute the BIC for 1 − 10 components, and repeat 1000 times.We find that a 6−component model is best fit in ∼ 75% of the repeated fits and adopt this as our baseline model.The GMM returns for each galaxy a likelihood of belonging to each group, which we convert to a probability.We could simply use this value to decide what to classify as quiescent, however, this does not account for the errors in a galaxy's colours.
To assign each galaxy a probability of belonging to the quiescent group, which is identified by eye as the component in the quiescent area of U V J space, we instead take the following approach, inspired by Hwang et al. (2021).Each galaxy has rest frame fluxes in the N U V , U , V and J bands, and associated errors.Assuming these are Gaussian distributed, we boot-strap the fluxes 1000 times and compute the N U V −U , V −J and U −V colours for each set of rest frame fluxes.Each set of 1000 colours is then fit with the GMM and the percentiles (5, 16, 50, 84, 95) of the probabilities are calculated.This results in a quiescent probability distribution p(Q) for each galaxy, which is marginalised over its rest frame flux errors.Therefore, the final distribution of p(Q) for each galaxy accounts for flux errors.We refer mainly to p(q) 50 (abbrev.p(q)) throughout the text, which is the median quiescent probability calculated in this way.
Here, we use the GMM purely as a tool to classify QGs and do not explore the other five groups defined by the model.In Figure 4 we show the 2 < z < 3 fitting sample and the 3 < z < 5 sample in the N U V U − V J plane, coloured by their quiescent probability p(q).It is evident that the model correctly finds the boundary between quiescent and star forming, but a boundary that is a smooth gradient rather than a binary separation.
5.2.1.The p(q) threshold for quiescence .Left column: SHARK simulated galaxies in different colour planes.GMM/N U V − V J (this work, top) and U V J (middle) colour diagrams for massive (log(M * /M > 10) galaxies in the Shark simulation (grey), with QGs found by each selection method as red (blue) hexagons and galaxies missed by both selection methods as red (blue) crosses.The bottom panel shows the Receiver Operator Characteristic (ROC) curve for the trained Gaussian Mixture Model as a function of quiescent probability (p(Q)) threshold.The Area Under the Curve (AUC), which represents the percentage probability of correctly classifying QGs, is reported, as well as the p(Q) threshold, which is calculated as the geometric mean (shown on the figure as the coloured red circle on the red curve).The False Positive Rate and True Positive Rate is also presented for the Whitaker et al. 2012 U V J selection (blue star).Middle column: Same as left column but at z = 3.Right column: Same as left column but at z = 4.
To calculate the p(q) threshold above which a galaxy is classed as quiescent and assess the performance of the method, we use galaxies in the Shark simulation (Section 2.3).At redshift snapshots of z = 2, z = 3, and z = 4 we select a massive galaxy sample (log(M * /M ) > 10).We then define QGs in the sample as those with specific star formation rates log(sSFR t H yr −1 ) < 0.2, where t H is the age of the universe at that redshift snapshot and the sSFR is averaged over the last ∼ 50 Myr.We fit the GMM to the N U V −U , V −J and U −V colours at each redshift snapshot.We then calculate the Receiver Operator Characteristic (ROC) curve, which is simply the False Positive Rate (FPR) as a function of True Positive Rate (TPR).The Area-Under-Curve (AUC) can be used as a measure of effectiveness of the classification method: an AUC very close to unity shows that the method correctly classifies objects with a high TPR and low FPR.We also compute the same numbers for the Whitaker et al. 2012 U V J selection method (defined as U −V > 0.8×(V −J)+0.7,U −V > 1.3 and V −J < 1.5).Figure 5 shows both the N U V U -V J and U V J diagrams and the ROC curves for each redshift snapshot.The AUC increases from z = 2 to z = 4 , meaning that the GMM has increasing chance of correctly identifying QGs (88%, 93% and 97% at z= 2, 3, 4 respectively).It is not possible to calculate the AUC for U V J because there is only one selection "threshold", which is whether or not a galaxy falls inside the U V J quiescent area.We can instead calculate the TPR and FPR for U V J.The optimal p(q) threshold is defined as the maximum of the geometric mean, which finds the threshold at which the TPR and FPR are perfectly balanced (or the top left most point on the ROC curve), and is defined as max( T P R * (1 − F P R).The resulting p(q) thresholds at z = 2, z = 3, and z = 4 are calculated using the geometric mean and are shown in Table 1.

UVJ versus GMM in
0.1, with U V J having FPR=30%, TPR=90% and N U V − V J with FPR=33%, TPR=89%.However, the optimal threshold for N U V − V J performs substantially better at this redshift, with a TPR only a few % lower (81%), whilst the FPR is more than halved, reducing from 33% to 15%.If we consider instead the contamination, which is defined as the number of galaxies defined as quiescent compared to the total number of UVJ quiescent galaxies, the conclusion changes.The contamination fraction of UVJ in SHARK at z = 2, 3, 4 is 80, 86, 91%, further highlighting the issues with UVJ.Looking to higher redshift, it is evident that N U V −V J/ GMM classification not only out-performs U V J classi-Table 1. Quiescent probability (p(q)50) thresholds for redshift ranges, defined by the maximum of the geometric mean of the ROC curve trained on Shark simulated data.

APPLICATION TO COSMOS2020
We apply the GMM to the massive galaxy sample in COSMOS2020 at 2 < z < 5 selected in Section 5.
Whilst SHARK produces galaxies at redshift snapshots of z = 2, 3, 4, observed galaxies are continuously distributed in redshift.We therefore choose not to assume that the p(q) threshold is a step function, but rather a smoothly varying function of redshift.We fit a second order polynomial to the three points (zphot, p(q)) = (2, 0.56; 3, 0.18; 4, 0.01) and use this function to determine if a galaxy is chosen as quiescent or not.This naturally incorporates the assumption that as we move to higher redshift, in order to be defined as quiescent, galaxies may look less like quiescent galaxies at z = 2, and more like post-starburst galaxies.We require that at z = 2−4, the median of the p(q) distribution is above the threshold.At z ≥ 4, we instead require that 95% of the p(q) distribution is above p(q) = 0.01, in order to only select the highest confidence candidates.The application of redshift dependent p(q) thresholds selects 1455 QGs at 2 < z < 3 and 230 QGs at 3 < z < 5. We visually inspect cut-outs of the 3 < z < 5 sample in the optical and NIR bands, and remove nine QG candidates which are spurious, reducing the total 3 < z < 5 sample to 221 in number.In the following sections, we present the properties of our sample of QGs, compare its statistics with those of galaxies selected using traditional colour diagrams, and discuss the differences between methods.
6.1.Comparing GMM/N U V − V J to U V J and NUVrJ colour selections We apply both the Whitaker et al. ( 2012) U V J selection as well as the Ilbert et al. ( 2013) N U V rJ selection for quiescence to our massive galaxy samples at 2 < z < 3 and 3 < z < 5, and compare these traditional colour selections to the GMM colour selection method.For each panel of Figure 6, we calculate the overlap between For each panel, the overlap is given, defined as the number of galaxies selected by that selection method divided by the number of galaxies selected by the main selection method for that row.In the first row we show the GMM selected sample (dark red) at 2 < z < 3 in the N U V − V J diagram, NUVrJ diagram and U V J diagram.The second row shows the NUVrJ selected sample in NUVrJ (light blue) as well as the other two colour colour diagrams and the third row shows the U V J selected sample in U V J as well as in the other two colour colour diagrams (dark blue).selections, which is defined as the sample size divided by the sample size of the "main" selection for that row, which is highlighted by the coloured panel.At 2 < z < 3, the overlap between all three selection methods is high: U V J and N U V rJ selections have an ∼ 85% overlap, whilst the GMM selected sample also agrees well, sharing ∼ 80 − 90% with U V J and N U V rJ selected QGs.This would be expected if the rest frame fluxes were measured directly from the template, but in this case they are weighted by the observed fluxes and so the conclusion is that the different selections are generally measuring the same observed SED shape.Hwang et al. (2021), who compared N U V rJ and U V J selected samples at 0 < z < 3 in COSMOS2015, found a similar overlap (∼ 85%).This suggests that the choice of rest-frame colour diagram selection at z < 3 is not crucial, although modifications to both U V J and N U V rJ colour selections may be in order.
At 3 < z < 5 the agreement between colour selections is less clear: U V J and N U V rJ share 58 − 77% of the same sample, whilst GMM shares 52 − 79% with N U V rJ and 66 − 75% with U V J.This difference is highlighted clearly in Figure 7.This is likely due to two reasons: firstly, that neither U V J nor N U V rJ selections include the region where post-starburst galaxies are most often found, which is at the lower edge beyond the U − V < 1.3 boundary.Secondly, the contamination fraction of U V J and N U V rJ is likely much higher than the GMM selection method.Whilst it is easy to remedy the first issue in both cases by simply lowering or removing the dividing line in U − V /N U V − r as discussed above, this may also increase the contamination of the sample by introducing U −V /N U V −r blue star forming galaxies.This is something that Schreiber et al. (2018) confirmed in their analysis.For completeness, we report the performance of an extended U V J selection benchmarked against SHARK in Appendix D. The issue of contamination is more difficult to solve due to the nature of these colour selections, however our method aims to alleviate this problem by both the introduction of the third N U V −U colour, and the option of choosing galaxies based on a probability distribution ideally makes it easier to sharpen the boundaries between different populations.

Full spectral energy distribution
To offer a more complete view of the full SEDs of our selected QGs, in Figure 8 we show both the cutouts from optical/NIR bands and best-fit SEDs for three candidates at z phot = 3.12, 3.92, and 4.68.We also compute the median rest-frame photometry and best-fit models for the whole GMM-selected sample normalized to V = 1 (Figure 9).The median values of K s observed magnitudes, M , SFRs, and sSFRs are also reported in the figure.
The galaxies in our sample show a strong Balmer/4000 Å break indicating a dominant aged stellar population on average, as expected, However, they still have residual flux blue-ward of the break, indicative of a younger populations and recent quenching.The median mass (log(M * /M ) = 10.92) of this sample is ∼ 0.3 dex higher than the mass cut for the original sample selection, confirming that only the most massive galaxies have quenched their star formation at z > 3. The median log(sSFR/yr −1 )= −10.8 indicates that these galaxies are best fit mainly by templates that include little or no ongoing star formation.

Spectroscopically confirmed QGs
As a confidence check, we cross-matched our sample of candidate QGs with a literature compilation of 7 spectroscopically confirmed z 3 QGs in COSMOS from Forrest et al. 2020b (4), Forrest et al. 2020b/Marsan et al. 2015/Saracco et al. 2020 (1), Valentino et al. 2020 (1) andD'Eugenio et al. 2020a (1).For 6/7 sources we retrieve p(q) 10%, consistent with being QG according to our selection.To the remaining galaxy at z spec = 3.352 (Marsan et al. 2015;Forrest et al. 2020b;Saracco et al. 2020), we assign p(q) = 0.6%.This source would thus not be selected using our fiducial threshold at 3 < z < 4. We note that this galaxy has experienced rapid quenching, possibly due to an AGN.The presence of the latter is inferred from the large [OIII]/Hβ emission line ratio, with the oxygen line possibly contaminating the K s photometry (see the discussion in Forrest et al. 2020a).

NUMBER DENSITIES
The number densities of massive QGs at z > 3 is an important constraint on galaxy evolution and theory.The assembly of galaxies with such high stellar masses in an evolved state within only 1.5-2 billion years not only places important constraints on the formation of the first galaxies (e.g.Steinhardt et al. 2016 and references within), but also on the cosmic star formation rate density (cSFRD) (Merlin et al. 2019).As such, the number density of these galaxies provides important context to early galaxy evolution.However, the number densities of massive QGs at z > 3 has been an intense topic of debate, due to both the disagreement between observations and theory, and between the observational studies themselves.The rest frame N U V, U, V, J photometry is shown by the orange circles.The p(z) is shown to the right of each SED in yellow, along with its best-fit photometric redshift, stellar mass, and quiescent probability median value (P (q)50).
Number densities derived from photometric and spectroscopic observations span ∼ 2 dex, but they generally remain higher than those extracted from simulations (Ilbert et al. 2013;Straatman et al. 2014;Davidzon et al. 2017;Schreiber et al. 2018;Merlin et 2018Merlin et , 2019;;Girelli et al. 2019;Shahidi et al. 2020;Carnall et al. 2020;Santini et al. 2020;Valentino et al. 2020;Carnall et al. 2022, Valentino et al., in press).The disagreement among observational works and with theoretical predictions is likely due to a mix of several factors, primarily different sample selections and quiescent criteria used together with analyses done using multiple diverse data sets.Of particular note is the size of such fields, as statistics can be affected by both the rarity of these galaxies and by cosmic variance.
The latter is strongly mitigated by the large contiguous area of the COSMOS field.Here we compute and present the number densities for our of QGs in three bins at 2 < z < 3, 3 < z < 4 and 4 < z < 5 (Table 7).Number densities are calculated using the survey area of the combined HSC /UVISTA coverage, which corresponds 1.27 square We calculate the fractional error due to cosmic variance using the method of Steinhardt et al. (2021) who extend method Moster et al. (2011) for use in the early verse (for details, see Steinhardt et al. 2021 andWeaver et al. 2022).This is combined in quadrature with the Poisson error to get error In Figure 10 references therein).Additionally, we show QG number densities from the SHARK simulation at two mass (log(M * /M ) > [10., 10.6]) and sSFR selection thresholds (log 10 (sSFR/Gyr −1 ) <[-1,-2]), as well as both the flares (z ≥ 5) and eagle (z ≤ 5) simulations for QGs at (log(M * /M ) > 10.6) at two different selections (log 10 (sSFR/Gyr −1 ) <[-1,-2]) (Lovell et al. 2022).
It is important to note that the literature compilation of QGs at z > 3 comprise selections at different mass limits ranging from (log(M * /M ) > 10, (Carnall et al. 2020) upwards, whilst our mass limit is much higher (log(M * /M ) > 10.6).In general, our number densities at 3 < z < 4 agree with the observational studies within 1σ errors.It is interesting to note that the number densities derived by Girelli et al. (2019) are much lower than those derived from COSMOS2020, which likely arises from the COSMOS2020 catalog including all UVISTA stripes, some of which have known over-densities (e.g.McConachie et al. 2021).Looking to higher redshift at z > 4, our estimates agree with those of Weaver et al. (2022) and Shahidi et al. (2020).Generally, it appears that number densities for massive QGs derived from a variety of different fields, selected with a variety of methods, are finally converging on agreement.
Previously, simulations were not able to reproduce the number densities of massive QGs at z > 3 observed in the real universe by a factor of 1-2 dex; this tension remains slightly with eagle, which predicts only upper limits for QGs at log(M * /M ) > 10.6 that are several times lower than most observations.At z = 5, the results from flares are ∼ 50× smaller than our current limits at fixed M threshold, resulting in a direct tension.At z > 5, this simulation predicts similarly low number densities, but the lack of data from observations at z 5 means that for the first time at high redshift, simulations have QGs where observations have found none.SHARK performs the best at producing similar number densities of massive QGs to observations up to at least z ∼ 3, but has a dearth of massive (log(M * /M ) > 10.6) QGs at z 4. The only QGs at z > 4 in SHARK are ∼ 4 times less massive than those in observations, and in fewer numbers too, implying that those galaxies which have been able to quench are still to accumulate mass.Finding QGs at z 5 will require both the combination of wide optical/NIR surveys to find such rare galaxies, and also deep NIR/MIR spectroscopy to confirm them.
In the epoch of the universe where quenching begins, it may make more sense to instead differentiate between quenching, recently quenched, and older (or "true") QGs.This issue confronts a more philosophical ques-   et al. 2018;Park et al. 2022).Alternatively, the entire sample can be treated as homogeneous, which is what we have done with this analysis.The prevalence of QGs with recent star formation at z > 3 -the so-called PSB population -has been measured by a plethora of works, both from photometric and spectroscopic data.Whilst at low redshift, the definitions of post-starburst, albeit broad, are based on measurements involving spectral indices or spectra-based PCA that encode information about strong Balmer absorption lines as well as weak or no emission lines (e.g.Wild et al. 2014, Chen et al. 2019, Wild et al. 2020, Wilkinson et al. 2021), this type of definition is not possible to apply to large photometrically selected samples.We instead consider how one might define a PSB galaxy in a sample of photometrically selected QGs based just on colours.Marsan et al. (2020) proposed a definition of PSB based on a visual SED inspection and the presence of a UV peak brighter than the emission red-ward of the Balmer/4000 Åbreak.This study estimates that PSB galaxies comprise 28% of the massive (log(M * /M ) > 11) population at 3 < z < 4 and 17% at 4 < z < 6.Approximately twice higher fractions of PSB at log(M * /M ) > 11 are reported by D' Eugenio et al. (2020a) based on their sample of spectroscopically con-firmed objects at z > 3 with measured D n4000 , D B and H δA indices.
Thirty percent of the spectroscopic sample in Forrest et al. (2020b) ceased star formation less than 300 Myr prior the epoch of the observed redshift at 3 < z <.At the same epoch, Schreiber et al. (2018) report that "young quiescent" galaxies, defined as U V J-star forming galaxies with V − J < 2.6 but below some sSFR threshold, are just as numerous as classical QGs.
Here we base a definition of PSB on the strength of the "blue bump" in the SEDs traced by the N U V − U colour.The distribution of N U V − U colours for our QG sample of 221 galaxies peaks at N U V − U = 1.75.Only 28% of galaxies have N U V − U > 2 and only one galaxy has N U V − U > 3. Based on our single stellar population model from Section 5.1, a stellar population will not reach N U V − U > 2 until 0.5 − 1 Gyr has passed and N U V − U > 3 until well after 1 Gyr, assuming no dust attenuation.Based on this, we can define PSBs as those galaxies in our sample with N U V − U < 2, comprising 70% of QGs at 3 < z < 4 and 87% at 4 < z < 5.This is marginally higher than what previously reported (Schreiber et al. 2018).If we restrict our selection to only the most massive galaxies, i.e., those with log(M * /M ) > 11, then these numbers change to 60% at 3 < z < 4 and 86% at 4 < z < 5.This agrees with the 60 − 70% value given by D 'Eugenio et al. (2020a) for galaxies at z ∼ 3.As a fraction of the entire ultra-massive galaxy population, PSBs only comprise 7% at 3 < z < 4, and 12% at 4 < z < 5. Together with the results of the previous section, it is clear that 3 < z < 5 is an active era in the history of the universe, one in which the transition to quiescence is fairly common.9. WILL WE FIND OLD QUIESCENT GALAXIES AT Z > 4 WITH JW ST ?
QGs have already been spectroscopically confirmed with JW ST (Nanayakkara et al. 2022;Carnall et al. 2023), however no old quiescent galaxies have been found at z > 4 yet.In general, the blue colours of our 3 < z < 5 QG candidates indicates a lack of very red, evolved galaxies such as the one presented in Glazebrook et al. 2017.This could be because as an ensemble, the QG population at z > 3 has only recently quenched, and older QGs are much rarer.However, at z > 4.5, the reduced sensitivity to older/redder quiescent galaxies due to the lack of data directly covering the 4000 Å break could be biasing us to the bluest (and youngest) quiescent galaxies.Although this is easily side-stepped with JW ST /N IRSP EC, it is difficult to solve even with JW ST broad-band N IRCAM imaging; at 4.5 < z < 5, the 4000 Å break falls at 2.2 − 2.4µm and is straddled by F 200W on the blue side and F 277W (or redder bands such as F 350W ) on the red side (depending on the program), with no observations of quiescent galaxies at z > 4.5 currently spanning the midpoint or upper edge of the break (see e.g.Valentino et al., in press, and Carnall et al. (2022)).This could easily be alleviated with the inclusion of N IRCAM narrow-bands covering 2.2-2.4µm,such as F 210M or F 250M .Currently, the only Cycle 1 program that has the combination of F 200W , F 277W/F 350W and at least one medium band filter in between is the Canadian NIRISS Unbiased Cluster Survey (CANUCS; PI Willot (Willott et al. 2017)), which has both F 210M and F 250M .However, the combined area covered by these filters is ∼ 50 arc-minutes squared, which would net less than one massive quiescent galaxy based on our number densities (see Section 7), assuming the same space distribution on the sky.In short, one would have to be incredibly lucky to find an evolved quiescent galaxy at z > 4.5 without both area and either spectroscopy or medium-band photometry.

SUMMARY AND CONCLUSIONS
In this work, we explore the massive quiescent galaxy population at 3 < z < 5 in COSM OS using the latest photometric catalogue, COSM OS2020.We create a new catalogue using the COSM OS2020 photometry, with Spitzer/IRAC errors boosted and derive redshifts and physical parameters with eazy-py.We motivate the need for a new rest-frame color diagram to select for QGs at z > 3 and present the Gaussian Mixture Model (GMM)/N U V U − V J method as a viable alternative.We use a GMM to fit the N U V − U, U − V, V − J colours of massive galaxies at 2 < z < 3 and galaxies are assigned a probability of being quiescent based on their bootstrapped rest frame colours.This GMM model is then applied to the colours of massive galaxies at 3 < z < 5.Both the GMM model and code to calculate quiescent probabilities from rest frame flux densities are made available online2 .Our key findings can be summarised below: • We calculate the quiescent probability for simulated galaxies from the shark semi-analytical model at redshift snapshots of z = 2, 3, and 4 and find that GMM performs just as well as classical U V J selection at z = 2, with both methods returning similar true positive and false pos-itive rates (TPR∼ 90%, FPR∼ 30%).However, at z > 3, the GMM method outperforms classical U V J selection and particularly excels at z ≥ 4, where the TPR is almost 100% compared to U V J that has less than 40% TPR.This highlights our proposed method as a viable alternative to traditional colour selection methods at z > 3.
• We compare our GMM-selected quiescent galaxies (QGs) in COSM OS2020 to traditional rest-frame colour selected quiescent galaxies (U V J, N U V rJ) and calculate the overlap between selection methods.We find that all three colour selections have high agreement at 2 < z < 3, with methods selecting ∼ 85% of the same galaxies, implying that the exact choice of rest-frame colour selection method at 2 < z < 3 is not crucial.However, overlap between colour selections at z > 3 reduces to ∼ 65%, implying that one should be careful about using U V J and N U V rJ colour selections interchangeably.Generally, the bluer colours of GMM selected QGs at 3 < z < 5 implies a dearth of any highly evolved QGs in this epoch.
• We compute the number densities of QGs at 2 < z < 5 and confirm the overall abundance of QGs presented in many different works, which our values agree with within the uncertainties.Given the conservative nature of this work (the boosting of IRAC errors, the high mass limit and the probabilistic approach), we interpret the number densities as a lower limit.
The next few years will undoubtedly see our knowledge of QGs at z > 3 grow vastly with the recent launch and successful commissioning of JWST (Rigby et al. 2022), for which QGs at z > 3 have already been reported (Carnall et al. 2022;Pérez-González et al. 2022;Cheng et al. 2022;Rodighiero et al. 2023) Valentino et al., in press andeven confirmed (Nanayakkara et al. 2022;Carnall et al. 2023), as well as ongoing large ground-based surveys such as the Cosmic Dawn Survey (Moneti et al. 2022, McPartland et al., 2023, in prep.).Spectroscopic confirmation of multiple QG candidates with, e.g., NIR-SPEC can be done in a matter of minutes (Glazebrook et al. 2021;Nanayakkara et al. 2022), and whilst ∼ 8 hours with V LT /X − shooter or Keck/M OSF IRE are only sufficient to obtain a redshift and stellar populations, the same time (or less) with JWST could provide the required S/N to study detailed physics such as velocity dispersions and metallicities (Nanayakkara et al. 2021;Carnall et al. 2023).Whilst JWST will be crucial for confirming quiescent galaxy candidates and studying their physical properties in detail, wide area photometric surveys still have an important role to play, both for the selection of, and statistical study of these galaxies.
We acknowledge the constructive comments from the referee, which significantly improved the content and presentation of the results.Consider two methods for inferring the flux density of a given galaxy at redshift, z, in a rest-frame bandpass, e.g., the V (visual) band in the Johnson filter system (Johnson 1955;Bessell 1990), with a central wavelength3 λ c ∼ 5500 Å.In the first method we identify two observed bandpasses with central wavelengths that bracket the redshifted restframe bandpass λ c,A < λ c (1 + z) < λ c,B and perform a linear interpolation between the flux densities in the observed bandpasses A and B. In a second method we simply adopt the V band flux density of the best-fit model template that was used to estimate the photometric redshift of the galaxy.The first method has the benefit of being purely empirical, though it will be limited by noise in the two photometric measurements used for the interpolation.The second method incorporates constraints from all available measurements, but it will be more sensitive to systematic biases induced by the adopted model template library.
Here we adopt a methodology that is a hybrid between these two extremes that first makes better use of multiple observational constraints and second somewhat relaxes the dependence on the template library.Specifically, for a galaxy at redshift, z, and a given rest-frame bandpass with rest-frame central wavelength, λ r , we compute a weight, w i , for each observed bandpass with observed-frame central wavelength λ o,i where x i = log(λ r ) − log(λ o,i /(1 + z)), and ∆ and W are each parameters with a default value of 0.5.That is, the weights prioritize (i.e., are smallest for) observed bandpasses that are closest to the redshifted rest-frame bandpass.
The templates are then refit to the observed photometry with least squares weights , where F i and σ i are the original photometric measurements and their uncertainties, respectively.The adopted rest-frame flux density is that of the template in this reweighted fit integrated through the target bandpass.In short, this approach uses the templates as a guide for a weighted interpolation of observations that best constrain a targeted rest-frame bandpass that fully accounts for all bandpass shapes and relative depths of a multi-band photometric catalog.Because the weights are localized for each target rest-frame bandpass, there is no implicit requirement that, e.g., the inferred rest-frame V −J color is within the range spanned by the templates themselves.Finally, we note that this is essentially the same methodology as that in version 1.0 of the original eazy code4 and now implemented in eazy-py.

B. STAR-GALAXY SEPARATION
We use the same method for star-galaxy separation that is described in Weaver et al. (2021).Briefly, the catalogue is matched to the Hubble ACS morphology catalogue (Leauthaud et al. 2007) to isolate galaxies in the half-light radii versus magnitude plane.We fit each object with stellar templates from the PEGASUS library and compute the χ 2 star .We calculate the reduced chi-squared χ 2 r , where (χ 2 r = χ 2 /N f ilt − 1), of both the galaxy best fit template and star best template where N f ilt is the number of filters used.Objects with HSC i-band magnitude < 21.5 or HST /ACS F814W magnitude < 23 with K s SNR > 3 or IRAC Channel 1 SNR > 3 are classified as stars if they lie on the point like sequence in the half-light radii versus magnitude plane or if they are fit better with a stellar template than a galaxy template (i.e, χ 2 star < χ 2 galaxy ).Additionally, objects that do not satisfy these criteria can be classified as a star if χ 2 star + 1 < χ 2 galaxy .We find a similar fraction of stars to the number reported in the official COSMOS2020 catalogue and find a similar distribution in the gzK colour colour diagram (see Figure 11).
C. ROBUSTNESS OF REST-FRAME J BAND ESTIMATE Although searches for massive QGs at z > 3 in the coming era will likely begin with rest-frame colour colour selections, we should be critical about which rest-frame colours should be used.The choice of rest-frame colours should depend on both the desired redshift range probed, and the ability of models to adequately measure rest-frame flux densities from the data.In particular, the colour used to discriminate between red dusty star forming galaxies and red QGs should be carefully chosen.Antwi-Danso et al. (2022) explored the effects of rest-frame J extrapolation for a flux limited sample of galaxies at 0.5 < z < 6 and find that the flux is affected significantly at z > 2 when extrapolation occurs.This can result in the removal of QGs in standard selections and high contamination rates.
At z > 3, current ground-based surveys such as COSM OS rely on data from Spitzer/IRAC Channels 3 and 4. By z ∼ 5, the rest-frame J-band is constrained entirely by Channel 4, which has a moderately shallow limiting 3σ depth of 23.1 mag in COSM OS2020.By comparison, Spitzer/IRAC Channels 1 and 2 are much deeper limited to 26.4 and 26.3 mags 3σ, respectively.However, the fraction of galaxies in COSM OS2020 with SN R 3 in either Channel 3 or 4 at 3 < z < 5 is only ∼ 5%.Making progress towards understanding the physical properties of the first quenched galaxies therefore mandates deeper data in the NIR, else inventive new methods, such as the use of a synthetic rest-frame i s instead of J (Antwi-Danso et al. 2022).To ensure that rest-frame J is constrained for our sample, we limited the redshift upper bound at z < 5. Here, we compute the average SNR in bins of redshift for both Channels 3 and 4 for the entire massive sample and the quiescent sample (see Figure 12).At 3 < z < 3.5 and for both bands, the mean SNR for QGs is 3.At 3.5 < z < 4.0, the mean SNR in Channels 3 and 4 is 2 − 3, so rest-frame J is still constrained by reasonable photometry.At z > 4 however, both Channels have mean SNR between 1 − 2, even after the 5× boosting.This highlights the need for deeper observations in the NIR at > 5 µm.For part of the COSM OS field, this can be achieved with JWST /MIRI within the COSM OS − W eb (PI: C. Casey, Casey et al. 2022) and P RIM ER (PI: J. Dunlop) programs.
The issue of galaxies being classified incorrectly due to their rest-frame colours can partly be alleviated by our method of assigning a probability of quiescence to each galaxy, instead of a binary separator, which relies on the relative scatter within each population to be low -evidently not the case at z > 3.
D. PERFORMANCE OF AN EXTENDED U V J COLOR SELECTION At high redshift (z ≥ 2), it may make sense to lower, or remove, the U − V > 1.3 boundary, allowing for the selection of "recently quenched" or post-starburst galaxies.This has been suggested and implemented by multiple works.For example, Schreiber et al. (2018) suggested combining the lowering of the boundary with the use of a band measuring the contribution from more recent star formation, whilst Marsan et al. (2020), who studied ultra-massive galaxies in the COSMOS-Ultra-VISTA field, remove the line altogether.Belli et al. 2019, Forrest et al. 2020b, Carnall et al. 2020and Park et al. 2022 also advocated for either removing or lowering the boundary.If the U − V > 1.3 constraint is entirely removed, the picture changes for U V J.At z = 2 the results remain roughly the same: the TPR and FPR both increase by 4%.At z > 3, the improvement is marked: the TPR increases by 25% to 98%, whilst the FPR only increases by 3% by 15%.z = 4 sees the most improvement, with the TPR increasing from 34% to 97% whilst the FPR still remains low at 6%. Considering instead the contamination, which is defined as the number of galaxies defined as quiescent compared to the total number of UVJ quiescent galaxies, the conclusion changes.The contamination fraction of UVJ in SHARK at z = 2, 3, 4 is 81, 86, 87%, which is similar to the contamination including the border (80, 86, 91%).Therefore, although the TPR and FPR imply the extended U V J selection is suitable for use, the contamination fraction suggests otherwise, which is the same conclusion as for the classical U V J selection.

Figure 2 .
Figure 2. Left: ∆log(SFR) (Eazy−Prospector) as a function of Prospector SFR for 8134 galaxies at 0.5 < z < 3.0 matched to our catalogue with that of the 3D-HST survey in COSMOS.The median ∆log(SFR) is 0.07 dex (shown as red dashed line) and the scatter is 0.55 dex.The binned median offset and the associated 16th and 84th percentiles are shown as red points with error bars.Right: ∆log(M * ) (Eazy−Prospector) as a function of Prospector stellar mass for the same sample.The median ∆log(M * ) is 0.04 dex and the scatter is 0.21 dex.The binned mean offset and the associated 16th and 84th percentiles are shown as red points with error bars.

Figure 3 .
Figure3.Left: N U V − U , V − J colours of our massive (10.6 < log(M * /M ) < 12) galaxy sample at 3 < z < 5 (grey) and evolution tracks for an SSP with AV = 0 (blue line on the left of the figure) and AV = 1 (dark blue line) and Chabrier IMF aged to 1 Gyr.The dust vector representing the movement of a galaxy in this space by attenuation of AV = 1 is shown by the arrow at the bottom right.Right: Same but for U V J with the quiescent box defined byWhitaker et al. (2012) (defined as U − V > 0.8 × (V − J) + 0.7, U − V > 1.3 and V − J < 1.5).

Figure 4 .
Figure 4. Left: N U V − V J diagram with the 2 < z < 3 fitting sample coloured by p(q)50%, which is the median of the boot-strapped quiescent probability distribution based on the Gaussian Mixture Model.Right: Same but for the 3 < z < 5 sample.
Figure5.Left column: SHARK simulated galaxies in different colour planes.GMM/N U V − V J (this work, top) and U V J (middle) colour diagrams for massive (log(M * /M > 10) galaxies in the Shark simulation (grey), with QGs found by each selection method as red (blue) hexagons and galaxies missed by both selection methods as red (blue) crosses.The bottom panel shows the Receiver Operator Characteristic (ROC) curve for the trained Gaussian Mixture Model as a function of quiescent probability (p(Q)) threshold.The Area Under the Curve (AUC), which represents the percentage probability of correctly classifying QGs, is reported, as well as the p(Q) threshold, which is calculated as the geometric mean (shown on the figure as the coloured red circle on the red curve).The False Positive Rate and True Positive Rate is also presented for the Whitaker et al. 2012 U V J selection (blue star).Middle column: Same as left column but at z = 3.Right column: Same as left column but at z = 4.

Figure 6 .
Figure 6.Top row : N U V − V J selected QGs at 2 < z < 3 (dark red, left figure) shown in N U V − V J space, U V J space and NUVrJ space.Middle row : NUVrJ selected QGs (middle figure, light blue) shown in N U V − V J and U V J colour diagrams.Bottom row : U V J selected QGs (right figure, dark blue) shown in N U V − V J and NUVrJ colour diagrams.For each panel, the overlap is given, defined as the number of galaxies selected by that selection method divided by the number of galaxies selected by the main selection method for that row.In the first row we show the GMM selected sample (dark red) at 2 < z < 3 in the N U V − V J diagram, NUVrJ diagram and U V J diagram.The second row shows the NUVrJ selected sample in NUVrJ (light blue) as well as the other two colour colour diagrams and the third row shows the U V J selected sample in U V J as well as in the other two colour colour diagrams (dark blue).

Figure 10 ×
Figure10× 10" Postage stamp cutouts from selected bands in the optical to NIR scaled by ±3RMS (left) and corresponding best fit SEDs (right) for three quiescent galaxy candidates selected using the N U V − V J method, taken from the main sample.The observed photometric points are shown as black circles (any with SNR< 2 in grey) and the best fit model is shown in blue.The rest frame N U V, U, V, J photometry is shown by the orange circles.The p(z) is shown to the right of each SED in yellow, along with its best-fit photometric redshift, stellar mass, and quiescent probability median value (P (q)50).

Figure 9 .
Figure 9. Top: Median smoothed rest frame SED for the GMM/N U V − V J selected quiescent sample at 3 < z < 5. Rest frame photometry for all galaxies is shown in grey, whilst the median per band and the associated 16th and 84th percentiles (error bars) are shown in dark grey.The median best fit (smoothed) model is shown in red.The median model photometry per band is shown by the open face circles.The location of the sample in N U V − V J space is shown in the top right inset, whilst at the top left the sample median properties (number, median Ks band magnitude, median log(stellar mass), median log(SFR), median log(sSSFR)) are shown, as well as the standard deviation.

Figure 11 .
Figure 11.gzK diagram for all galaxies in the combined UVISTA and HSC area with g, z and K SNR > 3. Stars (black dots) occupy a locus across the lower part of the diagram, whereas galaxies (shown in green and purple dots) tend to occupy a wider spread of colours.

Figure 12 .
Figure12.Signal to Noise ratio (SNR) as a function of redshift at 3.0 < z < 5.0 for IRAC Channel 3 (top) and IRAC Channel 4 (bottom) for the whole massive sample (gray).The average SNR for each channel is shown binned in redshift for the entire massive sample (blue dots) and the extended quiescent sample (red stars).The worst performance is at 4.5 < z < 5.0 for both channels.

Table 2 .
Number The Cosmic Dawn Center (DAWN) is funded by the Danish National Research Foundation under grant No. 140.S.J. acknowledges the financial support from the European Union's Horizon research and innovation program under the Marie Sk lodowska-Curie grant agreement No. 101060888.