Fast Forward Modelling of Galaxy Spatial and Statistical Distributions

. A forward modelling approach provides simple, fast and realistic simulations of galaxy surveys, without a complex underlying model. For this purpose, galaxy clustering needs to be simulated accurately, both for the usage of clustering as its own probe and to con-trol systematics. We present a forward model to simulate galaxy surveys, where we extend the Ultra-Fast Image Generator to include galaxy clustering. We use the distribution functions of the galaxy properties, derived from a forward model adjusted to observations. This population model jointly describes the luminosity functions, sizes, ellipticities, SEDs and apparent magnitudes. To simulate the positions of galaxies, we then use a two-parameter relation be-tween galaxies and halos with Subhalo Abundance Matching (SHAM). We simulate the halos and subhalos using the fast PINOCCHIO code, and a method to extract the surviving subhalos from the merger history. Our simulations contain a red and a blue galaxy population, for which we build a SHAM model based on star formation quenching. For central galaxies, mass quenching is controlled with the parameter M limit , with blue galaxies residing in smaller halos. For satellite galaxies, environmental quenching is implemented with the parameter t quench , where blue galaxies occupy only recently merged subhalos. We build and test our model by comparing to imaging data from the Dark Energy Survey Year 1. To ensure completeness in our simulations, we consider the brightest galaxies with i < 20 . We find statistical agreement between our simulations and the data for two-point correlation functions on medium to large scales. Our model provides constraints on the two SHAM parameters M limit and t quench and offers great prospects for the quick generation of galaxy mock catalogues, optimized to agree with observations.


Introduction
One of the main focuses of modern cosmology is the study of the distribution of matter in the universe.We know that most of the matter in the universe is composed of dark matter, which can only be observed indirectly, for example with gravitational lensing.Visible matter on the other hand is in the form of gas and galaxies and can be observed in different ranges of the electromagnetic spectrum.Galaxy surveys are one of the most important ways to probe the universe.Galaxy shapes, distortions and magnifications allow for weak lensing analyses (e.g.Dark Energy Survey (DES, [1]), Kilo-Degree Survey (KiDS, [2])).The clustering of galaxies is a powerful cosmological probe and also a systematic effect for lensing analyses.To understand and evaluate the observed galaxy clustering we rely on simulations.They are important for example to estimate uncertainties of observational data, to build analysis pipelines, and for cosmological parameter estimation.For a simulation to be useful, it must be realistic and accurate enough and yet cover a large enough volume.This becomes a trade-off for the finite computational resources.Simulations that contain only dark matter are simpler, since they do not require a treatment of the baryons and therefore allow larger numbers of simulation particles.These N-body simulations describe the motion of the particles, often through Newtonian gravity, thereby predicting structure formation (e.g.GADGET-2 [3,4], GADGET-4 [5], PKDGRAV3 [6] and ABACUS [7]).A way to extract single objects from the resulting cosmic web is to use dark matter halos.A halo is a bound overdense region consisting of multiple simulation particles, describing objects, like the dark matter halo surrounding our Milky Way.Substructures are then described as subhalos, i.e., collapsed halos within another halo.Baryonic matter can be simulated using hydrodynamical simulations, but these are computationally expensive and therefore usually applied to smaller simulation volumes.Furthermore, the modelling of baryons is less well understood since the underlying theory is more complex than with dark matter only.An alternative is to simulate the dark matter distribution and relate it to the distribution of baryons.This is possible for example for neutral hydrogen mass using a HI to halo mass relation (e.g.[8]) or for galaxies hosted by halos (e.g.[9]).The latter requires an understanding of the relation between the dark matter and the galaxy distribution, or the relation between the galaxy properties and the properties of the dark matter halos they reside in.Since the field studying galaxy formation and evolution is still evolving and cannot easily be simplified, the galaxy to halo relation is an approximate model, which can nonetheless generate realistic galaxy distributions [9].Commonly used methods for simulations to populate dark matter halos with galaxies are the Halo Occupation Distribution HOD (e.g.[10][11][12][13][14][15][16][17]), (Sub-)Halo Abundance Matching (S)HAM (e.g.[18][19][20][21][22][23][24][25][26]) and semi-analytic models (e.g.GAEA [27][28][29][30]).While semi-analytic models are powerful for many applications, they require treatment of the formation and clustering of galaxies and are thus more complicated.For HOD, one needs to specify the functions N cen and N sat , standing for the average number of central and satellite galaxies respectively residing in a halo of a given mass at a given redshift.Since Halo Abundance Matching assigns only one galaxy to each halo or subhalo and derives a monotonic relation between halo and galaxy properties, SHAM does not require these two functions.Additionally, SHAM has the advantage of giving positions and velocities for all the galaxies, since every galaxy is assigned to a separate halo.In this work, we simulate the distribution of galaxies with a forward model consisting of two steps.In the first step, the distribution functions of the galaxy properties were forward modelled and constrained with Approximate Bayesian Computation (ABC) [31,32], comparing to data using the image simulations by the Ultra-Fast Image Generator UFig [33,34].For this purpose, the whole galaxy population is split into two categories: Red galaxies are more quiescent and blue galaxies have more star formation.In this model, the luminosity functions, sizes, ellipticities, SEDs and apparent magnitudes in observed filter bands were jointly constrained.As a second step, we are left to derive the relation between halo and subhalo mass and galaxy luminosity, using a SHAM approach for the red and blue galaxies.The dark matter simulation we use was described in [35].It allows us to efficiently simulate halos and subhalos at high resolution in a lightcone and is based on PINOCCHIO [36][37][38].We describe the halo-subhalo simulation in more detail in section 2.1.The (sub)halo mass we use for the SHAM is the one at the observed redshift for halos and the one at accretion time for subhalos.It has been shown by e.g.[39] that the subhalo mass at accretion is a good property to be used for matching galaxies to halos.The galaxy to halo relation could also be derived by using other halo properties (for example the highest mass in its history, the maximum circular velocity at observation, the peak maximum circular velocity in its history, or a combination of mass and virial velocity) and other galaxy properties (for example the galaxy size or the stellar mass) [39].The dark matter simulation from [35] does not have information on velocities within the halos, though, and our model to draw galaxies is based on luminosity functions.Previous work has been done in this field.The project PTHalos [15] used halo occupation numbers to simulate the galaxy spatial distribution.The ELUCID project [40,41] derives the galaxy-subhalo connection using a neighbourhood abundance matching method.The UniverseMachine [42] derives the star formation rate of galaxies from different halo properties, constrained by observed galaxy properties.ScamPy [43] uses subhalo clustering and abundance matching (SCAM) to paint galaxies onto halos and subhalos.In [44] they compare HOD with SHAM as well as SCAM.Another extension of subhalo abundance matching using hydrodynamical simulations and recipes for star formation rate was presented by [45].Several SHAM models were developed by comparing to SDSS and BOSS/eBOSS [46][47][48][49], e.g.[50,51].Recent work using the first data of DESI (EDR, Early Data Release [52]) includes projects on SHAM [53][54][55] and on HOD [56,57].Our goal is to develop a forward model that can be run many times due to its low computational cost.A discussion on forward modelling can be seen in e.g.[58].The idea in forward modelling is to generate the same data format as provided by observations, evaluate both consistently for comparison, and then loop back to adapt the model parameters.Contrary to a semi-analytical model or a hydrodynamical simulation, our forward model is simple and in particular involves only two free SHAM parameters.In order to be able to use Approximate Bayesian Computation (ABC, e.g.[31,32,59,60]) to optimize the galaxy model in the future, the galaxy to halo assignment needs to be very fast.Only two additional free parameters are added to the existing galaxy model of UFig [33,34].We also use the approximate dark matter halo catalogues described in [35] instead of a full N-body simulation with a halo finder to speed up our method, since the underlying simulation for the dark matter halos and subhalos is by far the most expensive part of the pipeline.This paper is structured as follows.In section 2, we describe our procedure for assigning simulated galaxies to halos and subhalos.We describe the galaxy survey that we use to compare to our simulated galaxy catalogue in section 3.In section 4, we show the results of the statistical properties and of the clustering of our simulated galaxy catalogue.Our conclusions are summarised in section 5.

Method
In this section, we describe our approach to simulate galaxies including their spatial distribution.Our approach includes both the simulation of dark matter halos and subhalos and the subsequent assignment of galaxies to them.

Halo and subhalo catalogue
For the spatial distribution of galaxies, we use the simulated distribution of dark matter halos and subhalos by applying Subhalo Abundance Matching.The simulation of faint galaxies with accurate clustering requires populating both halos and subhalos with galaxies.This helps reproduce the fact that galaxies are also associated with subhalos, especially in high density regions such as galaxy clusters.More distant galaxies are observed at an earlier stage of the universe.We thus need lightcones of halos and subhalos in order to populate them with galaxies.Since one of our main goals is to create fast simulations, we cannot depend on full N-body simulations with halo finders.They are computationally too expensive, especially since we need simulations with a high enough resolution for halos associated to single galaxies.We therefore use the halo-subhalo clustering simulations presented in [35].These simulations are based on PINOCCHIO [36][37][38], a dark matter simulation that efficiently produces halos in simulation boxes and lightcones.PINOCCHIO is much faster than an N-body simulation, as it uses Lagrangian Perturbation Theory and builds groups of particles while running.It can thus be run at high resolution, while still using a large enough simulation volume.However, it does not directly simulate the subhalo catalogues.In a previous work in this series [35], the subhalo progenitors for each halo in a snapshot or lightcone were extracted from the merger history.The survival of subhalos is modelled using a fit for the subhalo merger time.Radial density profiles are used for the spatial distribution of the surviving subhalos within their hosts.A comparison with simulations using the N-body simulation GADGET-2 [3,4] and the halo finder Rockstar [42,61] showed that the resulting subhalo velocity function and the two-point correlation function of subhalos are realistic.The halo-subhalo simulations by [35] were shown to be of the order ∼ 700 faster than a full N-body simulation.We use the lightcones calculated directly within PINOCCHIO, combined with the subhalo code by [35].With PINOCCHIO, we can create high resolution halo simulations in large cosmological volumes.For this work, we simulate lightcones with a half opening angle of 60 deg, resulting in one quarter of the sky.We simulate until redshift z = 0.75, which is sufficient given our selection of relatively bright galaxies in this work.To reach the necessary resolution of halos corresponding to single galaxies and even dwarf galaxies, we use a box size of 500 h −1 Mpc and 2048 3 simulation particles, and include halos and subhalos with at least 10 particles.This results in a minimum (sub-)halo mass of 1.1 • 10 10 h −1 M ⊙ .Since PINOCCHIO is not a fully numerical simulation, it is possible for us to use halos with very few particles, as described in [35].For the simulations, we use the same cosmological parameters as in [35], namely Ω Λ = 0.73, Ω m = 0.27, h = 0.7, Ω b = 0.045, σ 8 = 0.811, n = 0.961 and w = −1, which is close to the WMAP Cosmology [62].To estimate the statistical uncertainty, we run multiple simulations with the same cosmology, the same resolution, the same lightcone settings but different initial conditions.

UCat
The simulated galaxies for the abundance matching and the final simulation are taken from the framework of the Ultra-Fast Image Generator (UFig, [33,63]).UFig simulates images for astronomical applications by getting a catalogue and rendering images that include observational and instrumental effects.Observational properties include spectral coefficients, while instrumental effects include the point spread function, noise and pixelization.Originally developed for the Monte-Carlo Control Loops Pipeline (MCCL, [34,64]), UFig was designed for forward modelling and therefore to be fast.The underlying models are simple, which beside code optimization is the reason for UFig's speed.The catalogue generator for UFig is called UCat and the galaxy population model was developed by [31,33].The total galaxy population is split into a redder and a bluer population, with bluer galaxies having higher star formation rates.Whether a galaxy is red or blue in this simulation is an intrinsic property and not decided according to a colour cut, as is custom for galaxy surveys.UCat estimates the number of red and blue galaxies per volume and applies that to a given redshift range and field of view (i.e.lightcone aperture or image size).The code then samples red and blue galaxies separately from luminosity functions, which are expressed in absolute magnitudes M. The luminosity functions Φ(M, z) as a function of absolute magnitude and redshift were optimized together with the rest of the model parameters of UFig in [31] and [32] using Approximate Bayesian Computation (ABC, [60]).A more detailed description of the luminosity functions can be found in [32].We use the posterior from the ABC of [31] for this work.The luminosity functions for red and blue galaxies are displayed in appendix A. UCat then calculates apparent magnitudes in different filter bands from the absolute magnitudes and redshifts by using spectral coefficients, filter throughput and extinction values.The five spectral coefficients for each galaxy give a combination of the kcorrect spectral templates [65].The procedure is shown schematically in figure 1.Note that additionally, the spectra simulator USpec [58,66] can calculate realistic spectra from the spectral coefficients by adding a sky model, atmospheric transmission and readout noise.Before this work, the simulated galaxy positions were randomly selected.In this work, we introduce a method to simulate galaxy catalogues with realistic clustering properties.

Abundance matching
Abundance Matching derives a monotonic relation between (sub)halo and galaxy properties by matching the cumulative numbers of halos and galaxies per volume, assuming that each halo in the catalogue hosts one galaxy.In our case, the matching is done based on the assumption that more luminous galaxies are hosted by more massive halos.The number density of (sub)halos and galaxies, as well as the luminosity function of galaxies, depends on redshift.We therefore match the abundance in small bins of redshift.To ensure that also the simulated volume is the same, we match the field of view.The clustering of galaxies does not only depend on the intrinsic luminosity, but also on the galaxy type.A simple way to split the whole population is to distinguish between red and blue galaxies.Red galaxies are passive, with little star formation, while blue galaxies have  more star formation and are typically younger.The spatial distribution of halos depends on (sub)halo mass and on whether that halo is a halo or a subhalo.A galaxy assigned to a halo is called a central galaxy (especially if there are subhalos in this halo), while the galaxies assigned to subhalos are satellites.There has been plenty of research into the relation between red and blue and central and satellite galaxies (see e.g.[9,39,[67][68][69][70][71][72][73]).Red galaxies are typically older than blue galaxies, having already lost most of their active star formation due to being stripped of their cold gas.Galaxy clusters also have a higher abundance of red galaxies compared to the whole Universe.Furthermore, the central galaxy in a group or cluster, therefore the galaxy associated to a massive halo, is more often red than blue.Overall, blue galaxies on average reside in less dense regions than red galaxies (see e.g.[74][75][76]).We therefore assign red galaxies to older subhalos and to halos above a certain halo mass M limit .The distinction between red and blue satellites is done using the satellite quenching time t quench , where a satellite is blue until it has been part of its host for longer than t quench .This procedure is shown in figure 2. A step function without any spread is a simplification, but our model is designed to be as simple as possible.The mass limit M limit and the quenching time of satellites t quench are free parameters of our model and can be determined to ensure the galaxy statistics are conserved.Subhalos that have been part of their host for longer than their expected survival time according to the survival model from [35] are not included in our simulation, and therefore not assigned a galaxy.Each halo and subhalo in the lightcone is flagged to either host a red or a blue galaxy.The galaxies drawn from UCat also have a well-defined flag to be either red or blue.The relation between halo mass and galaxy luminosity is therefore done separately for red and blue galaxies.We bin the halos in redshift and in logarithmic mass, to get a two-dimensional histogram.The number of bins, both in redshift and in log(M halo ), has to be adjusted.Too few bins give a relation that is too approximated, which is especially problematic when too few mass bins are used.Too many bins lead to gaps and instability in the halo histograms.We found that a good bin width in redshift is ∆z = 0.01.In mass, we're using 60 bins with logarithmic spacing between the minimal and maximal halo mass.After binning the halos and subhalos in mass and redshift, we bin the galaxies in redshift, with the same binning edges.For each redshift bin, we then start at the highest mass bin and match the numbers with those of the most luminous galaxies and save the minimum and maximum absolute magnitude for this mass bin.Removing the matched galaxies, we proceed with the next lower mass bin and do the same, continuing until the lowest mass bin that contains halos.When creating the mass to luminosity relation, it is important to have a large enough halosubhalo catalogue with a high enough resolution.Otherwise, when applying the mass to luminosity relation to a catalogue for a different simulation, there may be halos or subhalos for which the corresponding absolute magnitude is not specified.In section 4.3.3,we show the derived relation between halo mass and absolute magnitude for red and blue galaxies.For each halo, we draw the absolute magnitude of its hosted galaxy using the derived two-dimensional interpolation in mass and redshift.The number of galaxies simulated this way is thus the same as the number of halos in the lightcone.The halo redshift and the assigned absolute magnitude are then used to calculate the apparent magnitude with spectral coefficients, extinction values and filter throughputs within UCat.The filters bands have to be set according to the galaxy survey one wishes to simulate, and the filter throughputs are integrated once in UFig using the kcorrect templates and K corrections.Figure 3 shows a schematic of the procedure of creating the halo catalogue, deriving the interpolation, populating each halo with a galaxy and getting the apparent magnitudes.This halo and subhalo catalogue is complete in mass, therefore the new galaxy catalogue is complete in absolute magnitude.To simulate images with UFig or to simulate a galaxy survey in general, the catalogue can be cut by setting an upper limit for the apparent magnitude.It is thereby important to have a dark matter simulation with a high enough resolution, otherwise some of the faintest galaxies will be missing, especially at low redshifts.

Magnitude shift calibration
Our comparison to data is done at catalogue level.The data we compare to is collected by a telescope and in images, as described in section 3. The survey collaboration has run Source Extractor [77] on their imaging data to produce catalogues.Source Extractor retrieves the sources from images and measures their fluxes and therefore their apparent magnitudes.Depending on the exact settings used in Source Extractor, the values for the apparent magnitudes are offset differently.Even small offsets in apparent magnitude that are not calibrated correctly will lead to shifted clustering results, especially when magnitude cuts or binning are involved.The apparent magnitudes that we calculate using UFig for our catalogues do not have such an offset.UFig allows us to create realistic images.We therefore produced a small set of images, then ran Source Extractor with the same settings as the survey, and matched the retrieved objects to the ones in the original catalogue before the images.This allows us to calibrate the effect of Source Extractor on the apparent magnitudes in different filter bands.For the i-band, the results of this calibration are shown in figure 4. We have performed the same calibration also for the other magnitude bands.On the left, the shift from simulated magnitudes to observed magnitudes due to Source Extractor is shown.On the right, the opposite is shown, calibrating the shift to get from observed magnitude to true magnitude.
Figure 4: Calibration of the shift in apparent magnitude i with Source Extractor, using simulations.The solid orange line is the mean offset as a function of magnitude; the dotted lines are the 2σ intervals.The calibration is done by simulating images using UFig and Source Extractor and then comparing the resulting magnitudes to the ones before creating images.On the left, the shift between real and Source Extractor magnitude is shown, on the right the opposite.We show dashed lines at ±0.5 for reference.
The latter is useful for us, as we can either perform the correction by applying the first shift to the simulated catalogue or by applying the second shift to the data.Each blue dot corresponds to one galaxy, after removing stars.We calculate the mean and standard deviation of the offset in narrow bins of apparent magnitude since both shift and scatter are strongly magnitude dependent.The orange lines show the mean and 2σ intervals of the offset.The black lines show -0.5, 0.0 and 0.5 shifts for clarity.For this calibration, we used objects reaching observed magnitudes of i = 22, but our later analysis is restricted to brighter objects.We apply the shift whenever comparing data to simulations.For figures 7, 11 and 12, we have shifted and added a scatter to the i-band apparent magnitude according to the calibration shown in figure 4. For figure 8, the shift of each magnitude band is applied, but not the scatter, since the scatter would have to be correctly correlated for the different bands for an analysis in colour space.As explained in more detail below, for figure 13 we calibrated the magnitude shift separately for redder and bluer galaxies.For simplicity, and since applying the scatter has no noticeable effect on the measured clustering, we applied the shift to the data instead of the simulations for the chi-squared analysis presented in figure 14.

Data
To constrain the free model parameters M limit and t quench described in section 2.2.2, we compare our simulations to observational data.Unlike imaging surveys, spectroscopic surveys are usually not complete in a certain magnitude range.The Bright Galaxy Survey (BGS) of the Dark Energy Spectroscopic Instrument (DESI) [78] will provide a magnitude complete spectroscopic data set, but the survey is still running.Such incompleteness is complicated to forward model, as the same kind of objects have to be missing in the simulations as in the data, otherwise systematic errors are created.Furthermore, our model includes all galaxy types, not just a specific selection according to colour cuts.Since we do not want to build a model for just certain types of galaxies, we avoid any colour cuts, and therefore work with imaging data.We compare our simulations to galaxies with i < 20 from DES Y1, the public data release of the first year of observations of the Dark Energy Survey [1,[79][80][81][82][83][84].Both the data in catalogue form and the spatial, pixelized mask we use are publicly available from the DES Data Management. 1or accurate forward modelling, we simulate the same cut-out of the sky as is covered by the data we use.The mask made publicly available by DES is in HEALPix [85] format with NSIDE=4096.For simplicity, since a smaller lightcone is easier to simulate, we reduce ourselves to the larger connected part of DES Y1 south of DEC < −30 deg.This results in an area of 1685 deg 2 , after removing the pixels that are flagged with badmask.Our simulated galaxy catalogue is by construction free of stars and of any other objects besides galaxies.We therefore use the star-galaxy classification provided by DES and select only the objects classified as galaxies.Furthermore, we remove all the flagged objects, to ensure a clean data set.The corresponding columns are called MODEST_CLASS and FLAGS_GOLD in the data release.We note that UFig does simulate stars when rendering images, but we are only using the galaxy catalogues here.The area on the sky considered needs to be large enough to provide stable clustering information on small, medium, and large scales.Although an even larger area would increase the stability, the data provided by DES Y1 is enough to build our model.Furthermore, a wider survey would require a bigger lightcone, leading to more difficulties when simulating and processing it.Figure 5 shows the spatial distribution of the galaxies from DES Y1 used in our comparison.As described in section 2.2.2, we assign one galaxy to each halo and subhalo.At low redshifts, completeness in apparent magnitude requires a higher mass resolution for faint galaxies.Faint galaxies at low redshifts are small dwarf galaxies residing in very small halos and subhalos.Achieving the same completeness in the simulations as in the data is important for our forward model.Considering only the brightest galaxies of the dataset allows us to ensure that we do not miss galaxies in our simulations due to lack of resolution.We therefore apply a simple cut in apparent magnitude in the i-band, at i = 20.At low redshifts, this data set still includes very small and faint galaxies.

Results
In this section, we present our results derived from simulations and comparisons to the data described in the previous section.Our focus lies on the spatial distribution of galaxies and on the dependence of our simulations on the two free SHAM parameters.

Galaxy distribution functions
The galaxy population model in UCat and UFig is tuned via ABC such that the distribution of photometric properties of the galaxy sample agrees with observations (see [32]).In appendix A, we show a comparison between the red and blue luminosity functions used here and observational results from other studies.In this work, the number of galaxies is given by the number of halos and subhalos.The abundance matching technique ensures the relation between halo mass and absolute magnitude, but does not maintain e.g., the number ratio between red and blue galaxies.We therefore perform some consistency checks of UCat before and after including clustering by using halos.We also compare the distribution of simulations and data.Figure 6a shows the number of galaxies as a function of redshift, where the galaxies have an apparent magnitude below i = 20.The orange line shows the distribution from UCat only, the black line shows it after including populating halos with galaxies.By construction, the luminosity function of both red and blue galaxies is preserved.What may be changed is the minimum luminosity (maximum absolute magnitude) of the simulated galaxies.This can also be different for red and blue galaxies, if the number of halos set to host blue or red galaxies respectively is incorrect.Figure 6b shows the distribution of absolute magnitudes for sampled red and blue galaxies when applying the cut at i = 20, before and after using the abundance matching method.The consistency check presented in figure 6 shows good agreement.For forward modelling, the galaxy distributions of our simulations must agree with the data that we wish to simulate.The easiest way to check this is by looking at the histogram of apparent magnitudes.We do not have direct access to absolute magnitudes and estimating them for the data is difficult since we do not have accurate redshifts.Figure 7 thus shows the histograms for the apparent magnitudes in the r and in the i band.The i-band apparent magnitude is sharply cut at i = 20, by construction.We see that the simulation agrees well to the data.The luminosity function was fine-tuned in previous works involving UFig and UCat, and we use the posterior from the ABC presented in [31].Figure 8 displays the distribution of galaxies in colour space, for our simulation on the left and for the data on the right.Only a small number of galaxies is shown, for easier inspection.
For the simulation, the catalogue is split into red and blue galaxies, according to the intrinsic types from UCat.The distribution in colour space agrees well with the data, though the data is more spread out and scattered due to uncertainties and scatter from Source Extractor.
A possible way to improve this in the future could be to add scatter when applying the calibrated magnitude shift due to Source Extractor, but the scatter would have to be correctly correlated between the different magnitude bands.For the simulations, we additionally colour code the blue and the red galaxies, according to the two UCat galaxy populations.The lines indicate the 68% and 95% contours.

Spatial distribution
We assign the simulated galaxies differently to (sub-)halos depending on whether they are from the red or from the blue population within UCat.Therefore, the two populations do not have the same clustering.This can be seen in figure 9 where blue galaxies are shown on the left and red galaxies on the right.In both cases, we show the same cutout of the sky in right ascension and declination spanning about 6.5 deg 2 , and the same redshift range 0.40 < z < 0.41.The simulated red galaxies are visibly more clustered, while the blue galaxies are more spread out to under-dense regions.This comes from the fact that we place red galaxies in highly clustered massive halos and into the majority of subhalos.

Correlation functions
To compare the simulations with the data and to check whether they are in statistical agreement, we look at the two-point correlation function.We measure the angular correlation function w(θ) as a function of angular separation using Corrfunc [86,87].Given that we do not have radial information for the data, since DES in an imaging survey and therefore does not measure accurate redshifts, we cannot compare the 3D two-point correlation function.We therefore only consider the positions on the projected sphere.We first evaluate the scales of trust in real space, in order not to over-interpret our simulations or the data on certain scales.We have defined certain physical scales corresponding to different effects in the dark matter simulations.This is related to the work presented in [35], where the simulations of halos and subhalos used here are described.The following scales are in decreasing order.At 20 h −1 Mpc, we have medium scales where we trust our halo/subhalo simulations and they are well in agreement with the reference simulations based on GADGET-2 with Rockstar.Below about 6 h −1 Mpc, PINOCCHIO starts to underestimate the clustering of halos due to its approximations.This scale is relatively inaccurate as it depends somewhat on the mass of the considered halos.The one-halo regime (according to the halo model of the power spectrum) starts at about 2 h −1 Mpc.While our simulations from [35] added the subhalos to the output halo catalogues from PINOCCHIO, the density profile used to distribute the subhalos within halos could not be calibrated precisely.This is mainly because the subhalos from the numerical reference simulations had significant resolution effects.The work in [35] saw a bump compared to the reference simulation, meaning a higher correlation function, at around 0.7 h −1 Mpc.Scales below 0.3 h −1 Mpc are too small to be simulated precisely with the given resolutions.To relate these physical scales to our results, we transform them from comoving distances to angular distances projected on the sky.As this relation depends on redshift, figure 11a shows the redshift distributions of the simulated galaxies for different magnitude bins.Given that we place the galaxies drawn from UCat into halos and subhalos, each resulting galaxy has a precise redshift in the simulated lightcone.The dotted vertical lines in figure 11a indicate the mean redshifts for each magnitude bin.For the magnitude bin 17 < i < 18, we show the resulting scales in figure 11b, along with the two-point correlation functions from the data and from one simulation as a reference.Given the discussion above, we can only use scales within the two-halo regime for a stable comparison between data and simulations.For the magnitude bin shown in 11b, this corresponds to angular scales with θ > 0.15 degrees.Given that brighter magnitude bins have a lower mean redshift than fainter magnitude bins, the usable scales correspond to smaller angles for fainter galaxies.This effect would only be increased if the mass dependence of the physical scales were included, since brighter galaxies sit in larger halos.For the following comparison, we only show the scales within the two-halo regime for each magnitude bin.In figure 12, we show the comparison of galaxy clustering between data and simulations.The data is shown as dashed lines, while the simulations are shown as solid lines.Different colours correspond to different magnitude bins in the i band.For the simulations, the mean from 10 PINOCCHIO realizations with the same SHAM model is shown.For this figure, the fiducial SHAM parameters were chosen as M limit = 8 • 10 12 h −1 M ⊙ and t quench = 2.0 Gyr, since t quench is expected be about 2 Gyr for redshifts 0 < z < 1 and the mass limit between 10 12 and 10 13 h −1 M ⊙ from literature [9].In the lower panels, the ratio between simulations and data in the corresponding bands is shown.The coloured areas given are the 1σ and 2σ intervals from our estimation of uncertainties.We combined the statistical and systematic uncertainties together.Statistical variability comes from running different PINOCCHIO runs.Systematic uncertainty comes here from the galaxy population model, where we run the same simulation with a fixed set of SHAM parameters and different posterior points from [31].The parameters include shape and normalization of the luminosity function, which directly affect the mass to luminosity relation derived with our SHAM method.We find that our simulations are well in agreement with the data, mostly within 1σ.The least agreement is achieved for the highest magnitude bin, meaning the faintest galaxies, for which the simulation slightly under-predicts the clustering, especially for larger separation angles.Our SHAM model with its two free model parameters distinguishes between red and blue galaxies from UFig.The red vs. blue label of the simulated galaxies is not an observable property, but we can define a cut in colour space to perform a colour dependent comparison between our simulations and the DES data.A colour cut allows us to consistently separate the simulations and the data into a "redder" and a "bluer" subset, making it possible to measure clustering separately.The cut we use for this separation is a straight line at i − z = 0.3, with redder galaxies having i − z > 0.3 and bluer galaxies having i − z < 0.3.The reason for this choice is that it approximately splits the simulation into the red and blue galaxies according to the categories from UFig.Furthermore, we chose a simple cut involving few magnitude bands, to minimize the effect from differences in colour space (see figure 8).As described in section 2.3, we have to carefully calibrate any shift in the magnitudes due to Source Extractor.For the analysis with redder and bluer galaxies, we have first calibrated the shift in i − z, before cutting into the two samples.Afterwards, starting with the original i-band magnitudes, we have calibrated the shift in i separately for redder and bluer galaxies.This allows us to bin the two catalogues into magnitude bins and compare the correlation functions from the data with our simulations, similar to figure 12.The reason for performing   Figure 13: Angular two-point correlation functions for redder (i−z > 0.3) and bluer galaxies (i − z < 0.3), comparing our simulations to the data.For the simulations, the mean values from 10 simulation runs are shown.The 1σ ranges correspond to the combination of the statistical and systematic uncertainties in the simulations.On the left, the magnitude bin 17 < i < 18 is shown, while on the right, we present the bin 18 < i < 19.
a separate calibration for redder and bluer galaxies is that while the magnitude dependence of the shift in each magnitude band is small, the resulting colour dependence of the magnitude shift is significant.In figure 13, we show the angular two-point correlation functions for redder and bluer galaxies.We show the results for two separate magnitude bins, 17 < i < 18 (subfigure 13a) and 18 < i < 19 (subfigure 13b).The correlation functions are calculated using Corrfunc.The dashed lines represent the result from the DES data, while the solid lines are the mean values from 10 simulation runs with the fiducial SHAM parameters.The shaded areas show the 1σ uncertainties, estimated by combining the statistical uncertainties with the variations from different posterior points, as described above for figure 12.The simulations agree with the data both for the redder and the bluer subset.The colour dependent clustering agrees better for the bluer galaxies, since most redder galaxies occupy subhalos, which are simulated with less stability than the halos at a given resolution.We do not show the brighter magnitude bins, as there are few redder galaxies below i = 17.The brighter bins are in agreement due to the large scatter between different simulation runs and especially between different posterior points, but do not provide much insight.Such a colour-dependent analysis is limited by the accuracy of the underlying galaxy model.As shown in figure 8, the posterior from the ABC analysis in [31] gives a good but not perfect match to the data in colour space.A cut in colour space therefore results in a slightly different sample for the simulations and the data, more prominently than a simple magnitude cut.Future works may be able to distinguish between redder and bluer galaxies more accurately, by using newer ABC posteriors resulting in a better agreement with the data.It is worth mentioning that we consider a bright sample of galaxies, which contributes only a small fraction of the whole galaxy population used for the ABC analysis.

Galaxy to halo relation
Our main goal is to build the SHAM model and show that the resulting clustering agrees with observational data.Since the relation between galaxy luminosities and halo masses is derived when the halos and subhalos are matched to red and blue galaxies, it is not assumed or an input.In this section we present the effect of the SHAM parameters, as well as the resulting galaxy-halo relation when using the reference SHAM parameters.

Constraints on the SHAM parameters
We have calculated the covariance matrix for the simulations based on 10 simulation runs with the same cosmological and SHAM parameters but different underlying PINOCCHIO runs.For this and the following chi-squared analysis, we have calculated the angular two-point correlation functions w(θ) using Corrfunc.We split up the data set into 5 magnitude bins: 15 < i ≤ 16, 16 < i ≤ 17, 17 < i ≤ 18, 18 < i ≤ 19 and 19 < i ≤ 20.For better stability, we have reduced the number of bins in angular separation θ.Since the scales of trust depend on apparent magnitude, as discussed in section 4.2.2, this gave us 4 to 6 bins in θ between about 0.1 deg and 30 deg, with the number of bins depending on the magnitude bin.Using the same bins as for the simulations, both in magnitude and angular separation, we calculated the correlation functions for the observational data.We then computed the chisquared values using χ 2 = D T COV −1 D, where D is the data vector defined as D = w data − w sim .Both for the data vector and for the covariance matrix COV the w(θ) vectors for the separate magnitude bins are stacked together.Since the covariance matrix is ill-conditioned, we used Tikhonov Regularization [88,89] for the inversion.Figure 14 shows the grid points of the SHAM parameters we used, as well as the contours of the chi-squared surface.The chi-squared values are not normalized here.We investigated values between 0 and 4 Gyr for t quench , as we expect it to be around 1-2 Gyr for redshifts 0 < z < 1 (e.g.[9]).Values of t quench below 0 Gyr are impossible, and values above 4 Gyr would make almost all satellite galaxies in clusters red, which is not what observations show.For M limit , we considered the range 10 12 to 4 • 10 13 h −1 M ⊙ .Literature places the transition between red and blue centrals in the range 10 12 to 10 13 h −1 M ⊙ [9].Values of M limit above 2 • 10 13 h −1 M ⊙ lead to too few red central galaxies, inconsistent with observations.On the other hand, values below 10 12 h −1 M ⊙ would mean that most medium sized galaxies should be red, again disagreeing with observations.As discussed in more details in section 4.3.2,increasing either the quenching time t quench for satellite galaxies or the mass limit M limit for central galaxies has a similar effect on galaxy clustering.This is the reason for the degeneracy between the two parameters, as can be seen in figure 14.The best fit parameters lie along the degeneracy between (M limit , t quench ) = (2 • 10 12 , 4.0), and (4 • 10 13 , 0.0).The units are h −1 M ⊙ for the mass and Gyr for the time.The constraining power is relatively weak since the parameters do not have a strong effect on clustering.Therefore, the shape of the best fit contour is irregular, and its size is large.The exact shape also depends on the number of layers used for plotting, but the degeneracy remains the same.

Impact of the SHAM parameters on clustering
The faintest galaxies sit in the smallest halos, by definition of abundance matching, which have the lowest correlation function.This holds mainly for halos and not for subhalos, since subhalos are much more clustered.On small scales, subhalos and therefore satellite galaxies The resulting chi-squared surface for the two SHAM model parameters M limit on the x-axis and t quench on the y-axis.The black dots show the grid points at which we run a simulation.There is a degeneracy from the top left to the bottom right.are more clustered due to the 1-halo term.On large scales, subhalos are more clustered compared to halos of the same mass since they sit in host halos which are often 100 times as massive as themselves, and more massive halos are more clustered.To understand the effect of the two SHAM parameters, it is important to note that the galaxy sample we look at is cut in magnitude.The whole simulation is limited in the iband by i < 20, and each magnitude bin corresponds to two cuts according to its bin limits.Therefore, although each halo and subhalo is populated with a galaxy, many of them are not included.Galaxies are excluded if they are too faint for the magnitude cut.For a given redshift, galaxies are effectively cut in absolute magnitude or luminosity.Overall, distant galaxies appear much fainter and are therefore excluded unless they are very bright.Increasing the mass limit M limit leads to blue galaxies occupying larger central halos.Since the luminosity function of blue galaxies is fixed, this leads to small halos being occupied by fainter galaxies, thus leading to them being excluded by the selection.Since small halos lower the correlation function, this increases the overall clustering.A similar effect happens when the quenching time t quench is increased.This leads to more blue galaxies occupying subhalos and becoming satellites.Therefore, the small halos are filled with faint galaxies, making them be removed by the magnitude cuts.In either case, to increase clustering one needs to ensure that small halos are not included, therefore increasing M limit and/or t quench .This can be seen in figure 15a, where the correlation function for the magnitude bin 18 < i < 19 is shown.Only results from our simulations are shown here, and we distinguish between red and blue galaxies according to the two UCat populations.The values in the 2D parameter plane for this figure were chosen perpendicular to the degeneracy.The colours from dark to blue galaxies, z=0.3 red galaxies, z=0.3 blue galaxies, z=0.3 red galaxies, z=0.3(b) Relation between mass and mass-to-light ratio Figure 16: Relation between (sub)halo mass and luminosity, derived using a two-dimensional interpolation in the mass-redshift plane.Red and blue galaxies have separate relations.On the left, the relation between mass and absolute magnitude or luminosity is shown.The points are observed galaxies of different types, from other works [90][91][92][93].The dashed line is from a Tully-Fisher relation for observed spiral galaxies [94].On the right, the mass-to-light ratio is shown in relation to the (sub)halo mass.The relations are shown at redshift z=0.3.The shaded areas correspond to the 1σ and 2σ uncertainties.points for the Milky Way and Andromeda (values from [90][91][92][93], in black), two star forming spiral galaxies, in agreement with our relation for blue galaxies.Spiral galaxies have more active star formation than elliptical ones.While the correspondence is not exact, we compare observational data for elliptical galaxies to our red galaxies and data for spiral galaxies to our blue ones.Note that our relation depends on redshift, and only a specific redshift of z = 0.3 is shown here, while the observed galaxies are at various redshifts around this value.The redshift dependence of the mass to luminosity relation is significant overall but weak for a narrow redshift range.We further show a linear fit between log circular velocity log(v c ) and log luminosity log(L B ) for the Tully-Fisher relation (see e.g.[95]) for spiral galaxies from [94] as a dashed line.We relate v c to (sub-)halo mass M via v c = (10 GH(z)M ) 1/3 .This Tully-Fisher relation is in good agreement with our mass to luminosity relation for blue galaxies.The data from [96] on early-type galaxies at z ∼ 1 is also in agreement with our mass to luminosity relation for red galaxies at z = 1, but not shown for better visibility.Other studies (e.g.[97,98]) show similar trends as our relation, but are not directly comparable, as they measured luminosities in other bands or stellar masses.

Conclusion
We presented our galaxy forward model based on Subhalo Abundance Matching (SHAM).The halo lightcones are generated with PINOCCHIO.The subhalo extraction from the merger history presented in [35] was used to add the surviving subhalos to the halo catalogue.This allows us to create halo-subhalo lightcone catalogues at high resolution with large cosmological volumes.The galaxies used to populate the halos and subhalos are drawn from luminosity functions from UCat, the catalogue component of the Ultra-Fast Image Generator (UFig).UCat contains two separate galaxy populations, for red and blue galaxies.The blue galaxies have more star formation, while the red galaxies are more quiescent.These two galaxy populations are included in our SHAM model based on star formation quenching.With time, the formation of stars is quenched due to several processes (e.g., reduced inflow of gas or tidal stripping), making blue galaxies become red.For central galaxies, our SHAM model includes mass quenching, with the parameter M limit .For satellite galaxies, the star formation is mostly reduced due to environmental quenching.Our corresponding parameter t quench separates into recently and long-since merged subhalos, hosting blue and red satellite galaxies, respectively.The properties matched in the abundance matching are the halo or subhalo mass and the galaxy absolute magnitude, corresponding to the luminosity.The most luminous galaxies reside in the most massive halos, filling up the halos and subhalos from massive to light.Our model performs this matching in narrow redshift bins, leading to an accurate, redshift dependent relation between mass and luminosity.We showed the comparison of our simulations with data from the public Dark Energy Survey Year 1 (DES Y1).To ensure completeness of our simulations at very low redshifts, we have restricted the comparison to a magnitude range of i < 20, leaving only bright galaxies.In section 4.1, we showed that the histograms of apparent magnitudes and galaxy distribution in colour space are in agreement with the data.The results of our comparison for the clustering were shown in section 4.2.We can only trust the comparison between data and our simulations in the the 2-halo regime, due to the approximate nature of PINOCCHIO and the subhalo model introduced by [35].We have calculated the magnitude-dependent clustering by binning in the i-band.Furthermore, we looked at the colour-dependent clustering by splitting the catalogue into a redder and a bluer sample, according to a cut in colour space.On reliable angular scales, the correlation functions of the simulations and the data are in statistical agreement.By comparing to the data, we constrained the two SHAM parameters purely based on clustering.We found that good values within the best fit contour are M limit = 8 • 10 12 h −1 M ⊙ and t quench = 2.0 Gyr.There is a degeneracy between the two parameters, since increasing either parameter increases the clustering of the galaxies.We could not reduce this degeneracy by distinguishing between red and blue galaxies, since the effect of changing the parameters along the degeneracy is only measurable in the 1-halo regime, where we cannot compare data and simulations reliably.Compared to models matching galaxies to halos based on a full N-body simulation or even compared to a hydrodynamical simulation, our simulations take little computational effort.Running the PINOCCHIO lightcone (once) took 19 minutes on 1032 CPU cores of the Euler Cluster from ETH Zurich 2 , with 2048 3 simulation particles.Creating the subhalo catalogue based on the merger history takes about 10 hours, on one CPU core.The SHAM simulation itself is much faster and can be run multiple times for one halo/subhalo ligthcone.Sampling from the luminosity functions before the SHAM takes about 4 minutes, the SHAM model itself until the final galaxy catalogue about 28 minutes, both on one CPU.The SHAM model requires about 30 GB of RAM.The reported values are for the simulations of DES Y1 for the redshift range z<0.75.Future work may include an extension to fainter galaxies and higher redshifts, and applications to current wide surveys.Further projects in progress include the development of a bias model based on the presented SHAM model, for the galaxies in UCat and UFig.Additionally, the effect of clustering on weak lensing systematics can be investigated using our framework.Matching the initial conditions and therefore the cosmic seed of our halo-subhalo lightcone to a full N-body particle simulation may open doors for further applications, including probe combinations.

A Luminosity function of red and blue galaxies
In this appendix, we show a comparison between the red and blue luminosity functions used for the galaxy clustering simulations in this work and observed luminosity functions from other works.The comparison is displayed in figure 17, similar to figures 8 and 9 in [32], though with the posterior distribution from the ABC in [31] used for this work.The literature studies shown in the comparison are from [99][100][101][102][103][104][105], same as in the analysis presented in [32].[31] (in black) for red (left) and blue galaxies (right) compared to literature luminosity functions.The functions from literature are from the works of [99][100][101][102][103][104][105] (in red, blue, yellow, magenta, brown, cyan and orange, respectively).For the posterior samples, the median and 1σ uncertainties are shown.For the literature luminosity functions, the bands correspond to 1σ errors.The redshifts chosen are z=0.3(top), z=0.5 (middle) and z=0.7 (bottom).
The redshifts chosen for this figure are z=0.3,z=0.5 and z=0.7.We do not show a comparison at lower redshifts, since the available literature studies do not span that redshift range.The luminosity functions used here are in good agreement with the literature, both for blue and red galaxies, for each considered redshift.The constraining power of the posterior from [31] on the luminosity functions is similar to that from the different literature studies.

Figure 1 :
Figure 1: Diagram showing the workings of UCat, UFig and USpec.UCat draws the galaxies on catalogue level without observational influence, UFig adds survey-specific properties to simulate realistic images and USpec adds a sky model and noise sources to simulate spectra.
(a) Red vs. blue for halos (b) Red vs. blue for subhalos

Figure 2 :
Figure 2: Diagrams depicting the SHAM model based on star formation quenching.Halos host blue central galaxies below a certain mass limit (left figure, 2a).Subhalos host blue satellite galaxies until a certain quenching time (right figure, 2b).

Figure 3 :
Figure 3: Schematic description of the clustered galaxy simulations.The halo catalogue is created separately, while the interpolation and the drawing of galaxies is within UCat.The output in the end can be used for UFig the same way as in the version without clustering.

Figure 5 :
Figure 5: Spatial mask used in this work, with the galaxies from DES Y1 with i < 20.This figure uses NSIDE=256, while the mask used throughout this work has NSIDE=4096.

Figure 6 :
Figure 6: Histogram distributions of the simulated galaxies, comparing UCat only (light colours) and UCat with PINOCCHIO (dark colours).The redshift distribution is shown on the left with the vertical lines indicating the mean redshifts, the distribution of absolute magnitudes on the right.

Figure 7 :Figure 8 :
Figure 7: Apparent magnitude histograms in the r and i band, for SHAM simulations (in orange) and the data (in blue).

Figure 9 :
Figure 9: Visualization of the clustering of blue (left) and red galaxies (right), within a sky area of about 6.5 deg 2 in a narrow redshift range 0.40 < z < 0.41.Galaxies are represented by circles, with both size and colour representing their absolute magnitudes.

Figure 10 :
Figure 10: Polar plot displaying the spatial galaxy distribution in real space as a function of redshift and angle, for both red and blue galaxies.Redshift is shown as the radial coordinate, and right ascension as the angular coordinate.For visual clarity, we only display galaxies within the declination range -55.2 deg < DEC < -55 deg.

Figure 10
Figure 10 shows a visual representation of the resulting simulated galaxy clustering in the lightcone in real space.A cut-out through the lightcone along the redshift direction is shown, with redshift or comoving distance plotted as the radial coordinate.Only a small cut-out along DEC is shown for better visibility, -55.2 deg < DEC < -55.The structure on large scales including filaments, underdense and overdense regions is visible.Red and blue galaxies are shown in different colours.

Figure 11 :
Figure11: Redshift distribution for different apparent magnitude bins on the left, calculated from the simulations.The dotted vertical lines are the mean redshifts for each magnitude bin.On the right, we show the two-point correlation function for data (dashed line) vs. simulation (solid line) for the magnitude bin 17 < i < 18, to indicate certain scales of interest (dotted vertical lines).

Figure 12 :
Figure 12: Galaxy two-point correlation function, as a function of separation angle θ, projected along the line of sight.Solid lines are the mean values from 10 simulations, dashed values correspond to the data.Different colours refer to different bins in apparent magnitude i.The lower panels show the ratio of simulations to data, with the 1σ and 2σ intervals due to statistical and systematic uncertainties.
Figure14: The resulting chi-squared surface for the two SHAM model parameters M limit on the x-axis and t quench on the y-axis.The black dots show the grid points at which we run a simulation.There is a degeneracy from the top left to the bottom right.