Dark Energy Survey Year 3 Results: Measuring the Survey Transfer Function with Balrog

We describe an updated calibration and diagnostic framework, Balrog, used to directly sample the selection and photometric biases of the Dark Energy Survey (DES) Year 3 (Y3) data set. We systematically inject onto the single-epoch images of a random 20% subset of the DES footprint an ensemble of nearly 30 million realistic galaxy models derived from DES Deep Field observations. These augmented images are analyzed in parallel with the original data to automatically inherit measurement systematics that are often too difficult to capture with generative models. The resulting object catalog is a Monte Carlo sampling of the DES transfer function and is used as a powerful diagnostic and calibration tool for a variety of DES Y3 science, particularly for the calibration of the photometric redshifts of distant “source” galaxies and magnification biases of nearer “lens” galaxies. The recovered Balrog injections are shown to closely match the photometric property distributions of the Y3 GOLD catalog, particularly in color, and capture the number density fluctuations from observing conditions of the real data within 1% for a typical galaxy sample. We find that Y3 colors are extremely well calibrated, typically within ∼1–8 mmag, but for a small subset of objects, we detect significant magnitude biases correlated with large overestimates of the injected object size due to proximity effects and blending. We discuss approaches to extend the current methodology to capture more aspects of the transfer function and reach full coverage of the survey footprint for future analyses.


Introduction
Wide-field imaging surveys have revolutionized modern astronomy. Some of the primary science goals of these projects are to extract precise constraints on cosmological models and galaxy evolution using measurements made from hundreds of millions of galaxies for ongoing surveys such as the Dark Energy Survey 61 (DES; The Dark Energy Survey Collaboration 2005), the Kilo Degree Survey 62 (KiDS; de Jong et al. 2013), and the Hyper Suprime-Cam Survey 63 (HSC; Aihara et al. 2018), and even billions of sources for upcoming Stage IV experiments such as Euclid (Amiaux et al. 2012) and the Vera C. Rubin Observatory Legacy Survey of Space and Time (LSST; Ivezić et al. 2019). For the largest surveys, the resulting constraints have become so precise that percent-level spatial variations in the survey's depth can cause biases that dominate over the statistical errors (see for instance Huterer et al. 2006;Blake et al. 2010;Ross et al. 2012;Leistedt et al. 2016;Weaverdyck & Huterer 2020). Small biases-as small as one part in 10 4 in some cases-in the measurements of sizes, shapes, and fluxes of sources can have a similarly important impact on the science results (Massey et al. 2013).
The cumulative effect of the many selection effects and measurement biases of an astronomical survey is captured by its transfer function. This function maps how the photometric properties of astronomical sources are distorted by real physical processes such as interstellar extinction or by our imperfect measurements at every step from detector calibration to object catalog creation. As most cosmological measurements from survey data are based on the same processed images and source catalogs, this mapping is crucial for accurately estimating the true cosmic signals imprinted on the sky such as the spatial clustering of galaxies (see Blumenthal et al. 1984;Tegmark et al. 2006;Elvin-Poole et al. 2018 for a few examples) and weak lensing of galaxy light profiles by the intervening matter field (similarly, see Brainerd et al. 1996;Mandelbaum 2018;Troxel et al. 2018).
Unfortunately, many of these effects are in practice difficult to characterize or even identify. For example, the object catalogs derived from survey images are produced by a complex process; calibration, detection, measurement, and validation involve a number of nonlinear transformations, thresholds applied to noisy quantities, and post facto cuts made on the basis of human judgment. Despite significant efforts to characterize some of these effects in the past (see Connolly et al. 2010 andChang et al. 2015 for the LSST and DES pipelines respectively), this complexity makes each contribution to the transfer function extremely difficult to model-and even small errors in the estimated survey completeness can substantially bias measurements such as the amplitude of galaxy clustering or important calibration efforts like the photometric redshift inference of weak-lensing samples (Aihara 60 Corresponding author. Original content from this work may be used under the terms of the Creative Commons Attribution 4.0 licence. Any further distribution of this work must maintain attribution to the author(s) and the title of the work, journal citation and DOI. Massey et al. 2013;Hildebrandt 2016;Fenech Conti et al. 2017).
Simulating the survey data from scratch can accurately capture some, but not all, of this complexity. Spatial variations in the effective survey completeness depend not just on the observing conditions but also on the ensemble properties of the stars and galaxies being studied. Systematic errors in the sky background estimation and biases in the measurements of galaxy and stellar properties can couple to fluctuations in the galaxy density field, leading to a completeness that depends on the signal being measured. Finally, there is a wide variety of nonastrophysical features that can affect the measurement quality and completeness such as artificial satellite trails, pixel saturation, or the diffraction spikes of bright stars. Not only are these effects difficult to model or simulate at high fidelity, but attempts to do so can introduce model-misspecification bias, which can underestimate the true uncertainty in the downstream fitted photometric parameters (Lv & Liu 2010;Pujol et al. 2020).
In contrast, injecting artificial sources directly into the real images can naturally capture many of these effects. Synthetic objects added to the real data automatically inherit the background and noise in the images as well as the biases arising from measurements in proximity to their real counterparts. Injecting realistic star and galaxy populations, convolving their light profiles with an accurate model for the pointspread function (PSF), and applying accurate models for effects not directly probed (such as Galactic reddening and variable atmospheric transparency) result in a population of simulated sources that inherits the same completeness variations and measurement biases as the real data. Mock catalogs made in this way can be used to discover, diagnose, and derive corrections for systematic errors and selection biases at high precision.
Injection simulations of this kind have been used for limited calibration studies of detection efficiency and photometric calibration in the presence of realistic noise and crowded fields since at least the mid-1980s (McClure et al. 1985Smith et al. 1986;Stetson 1987), not long after the widespread adoption of charge-coupled devices (CCDs) in astronomical imaging. There is a rich history of mixing real and synthetic data to estimate the detection efficiency of an apparatus in hybrid Monte Carlo techniques commonly used in particle physics measurements (Bunce 1980). In addition, there have been recent examples to improve blinding procedures for rare events such as embedding fake gravitational-wave signals ("hardware injections") into the Laser Interferometer Gravitational-wave Observatory (LIGO; Abbott et al. 2009) data and similarly "salting" the data taken by the Large Underground Xenon (LUX; Akerib et al. 2013) experiment with artificial events to test the robustness of their detection pipelines and guard against confirmation bias (Akerib et al. 2017;Biwer et al. 2017 respectively).
However, generating full-scale mocks via injection is computationally demanding for a modern wide-field (WF) galaxy survey. The injection simulations described in Suchyta et al. (2016) for the early releases of DES data did not attempt to pass the injected galaxies through every part of the measurement process, opting to inject only onto the coadd images. The SynPipe package (Huang et al. 2018) has been used to characterize measurement biases for the HSC pipeline and includes single-epoch processing, but only on a small fraction of the survey's available imaging. The Obiwan tool, currently developed to model completeness variations for the Dark Energy Spectroscopic Instrument (DESI; Martini et al. 2018), also has incorporated single-epoch processing but focuses only on the emission-line galaxies that are the primary DESI targets and, so far, has only injected sources within 0.2 mag of their used selection cuts for increased efficiency (Kong et al. 2020). Despite injection pipelines having shown great promise, the extremely high computational cost (in addition to the difficulty in distinguishing intrinsic methodological uncertainties in their sampling of the transfer function from actual measurement biases) has, until now, largely relegated them to proof-of-concept measurements rather than being used to directly calibrate the cosmological measurements from WF surveys.
This paper describes the generation of the Balrog 64 injection simulations for the first three years of DES data (referred to as Y3), covering a randomly selected 20% of the total Y3 footprint. Sources drawn from DECam (Flaugher et al. 2015) measurements of the DES Deep Fields (DF) (Hartley et al. 2022) are self-consistently added to the single-epoch DES images, which are then coadded and processed through the full detection and measurement pipeline. This extensive simulation and reduction effort allows us to characterize, in detail, the selection and measurement biases of DES photometric and morphological measurements as well as the variation of those functions across the survey footprint. In addition, using an input catalog with measurements from the same filters as the data resolves many of the issues in capturing the same photometric distributions as real DES objects seen in Suchyta et al. (2016)-particularly for color. The resulting catalogs generally follow completeness and measurement bias variations in DES catalogs to high accuracy, with mean color biases of a few millimagnitudes and number density fluctuations varying with survey properties within 1% for a typical cosmology sample.
As the measurement pipelines for the DES DF and WF data are complex and quite technical, so too are parts of this paper. However, we also motivate interesting science cases for the presented response catalogs for both calibration and direct measurement purposes including the photometric redshift calibration of weak-lensing samples, magnification effects on lens samples, and the impact of undetected sources on image noise. For readers more interested in using Balrog for potential science applications or as a general diagnostic tool, this is discussed in detail in Sections 4 and 5.
This paper is organized as follows: In Section 2 we introduce the significantly updated Balrog pipeline, which now emulates more of the DES measurement stack and uses a completely new injection framework for source embedding into single-epoch images. Section 3 describes the injection samples and methodological choices for the Y3 Balrog simulations, including a new scheme for handling ambiguous matches. In Section 4 we compare the recovered Balrog samples to the fiducial Y3 object catalog (Y3 GOLD; see Sevilla-Noarbe et al. 2021 for details), as well as present the photometric response of the main star and galaxy samples. We also examine the performance of a typical Y3 GOLD star-galaxy separation estimator and investigate a set of catastrophic photometric 64 Balrog is not an acronym. The software was born out of the original authors delving "too greedily and too deep" (Tolkien 1954) into their data, hence the name. modeling failures that enter science samples with dramatically overestimated fluxes (sometimes by multiple orders of magnitude). We then discuss novel applications of an injection catalog in cosmological analyses in Section 5, including the photometric redshift calibration of Y3 "source" galaxies and the effect of magnification on "lens" galaxy samples-in addition to a few unexpected discoveries such as noise from undetected sources and issues with background subtraction. Finally, we close in Section 6 with a discussion of our results, methodological limitations, and future directions before concluding remarks in Section 7.

The BALROG Pipeline
Balrog was introduced in Suchyta et al. (2016) as a software package 65 that injects synthetic astronomical source profiles into existing DES coadd images to capture realistic selection effects and measurement biases for the Science Verification (SV) and Year 1 (Y1) analyses. However, as the precision of the subsequent DES cosmological analyses has increased, so too has the need for even more robust systematics control and more precise characterization of the survey transfer function. The main limitations of the original methodology were that (1) injections into the coadd rather than single-epoch images skip many important aspects of the measurement pipeline whose effects we want to capture, and (2) the injected objects were drawn from fitted templates to sources in the space-based Cosmological Evolution Survey (COSMOS; Scoville et al. 2007) rather than measurements consistent with DECam filters-thus introducing discrepancies in the recovered colors. While the latter is solved by using the new Y3 DF catalog (Hartley et al. 2022), the former required significant additional complexity in the simulation framework to consistently inject objects across all exposures and bands.
To address this, we have developed a completely new software framework that is described and validated in the remainder of this section. An overview of the Y3 Balrog process is shown in Figure 1, with simplified summaries of the DF and Y3+Balrog measurement pipelines. Briefly, we use the significantly deeper DECam measurements of sources in the DES DF as a realistic ensemble of low-noise objects to inject into the Y3 calibrated single-epoch images. We then rerun the DES measurement pipeline on the injected images to produce new object catalogs that contain the Balrog injections. Finally, we match the resulting catalogs to truth tables containing the injection positions to provide a mapping of DF truth to WF measured properties.
All astronomical image injection pipelines such as Balrog have two distinct elements: emulation of a survey's measurement pipeline and source injection into the processed images. As our methodology for the former is intrinsically specific to DES while the latter is a fairly generic problem, development on the new Y3 Balrog was split into the two corresponding pieces discussed in detail in Sections 2.1 and 2.2 below.

DESDM Pipeline Emulation
The DES survey data are processed through a set of pipelines by the DES Data Management team (DESDM), which performs basic astronomical image processing as well as applying state-of-the-art galaxy fitting, PSF estimation, and shear measurement codes. The standard processing steps applied to the DES Y3 data are described in detail in Morganson et al. (2018). Ideally, to ensure that identical codes and versions were used at each stage of processing, one would implement Balrog as part of the standard data reduction. Figure 1. A high-level overview of how the Deep Fields (DF) and Y3 image processing pipelines interact to create the Balrog catalogs. The raw DECam exposures are used as the basis for both tracts, with the much deeper DF data being represented by the larger image stacks. The null-weight images, weight maps, PSF models, and zero-point solutions are computed from the raw exposures after calibrations are applied and are the starting point of the sampled transfer function. The DF exposures are not dithered and thus single-CCD coadds are created in place of the much larger Y3 coadds. The fiducial DF catalog is created by fitting CModel profiles to detections with Multi-Object Fitting (MOF), which simultaneously models the light profiles of detected neighbors. These fitted model profiles (after a few limited cuts discussed in Section 3.4) are used as the Balrog injection catalog, which are added to the Y3 null-weight images directly. Afterwards, the injected null-weight images are processed in a nearly identical way to the real images including coaddition, detection, and photometric measurements. Finally, we match the output object catalog to truth tables containing the injected positions. As all sources are remeasured, there is some ambiguity in the matching; this is discussed further in Section 3.5. See Hartley et al. (2021) and Morganson et al. (2018) for further DF and Y3 pipeline details, respectively. 65 https://github.com/emhuff/Balrog However, this was not an option for DES Y3 as the updated Balrog methodology did not exist until after the Y3 data were completely processed (this is now true for a future Year 6 (Y6) Balrog analysis as well). Therefore, it was necessary to replicate the DESDM processing pipeline stack as closely as possible. While this usually amounted to calling the relevant codes and scripts with identical configurations and software stack components, sometimes minor changes were required due to differences in computing environments or practical considerations such as processing time. These differences will be noted whenever relevant.
A modular design for the measurement pipeline 66 was chosen both for ease of testing and for the ability to do nonstandard production runs (see Sections 5.2 and 5.3 for examples). The individual Balrog processing stages for a single DES coadd tile (44′ × 44′) are as follows: 1. Database query and null weighting-Find all singleepoch immasked (the DES designation for flattened, sky subtracted, and masked) images in the griz bands that overlap the given DES Y3 tile. Download all exposures, PSFs, photometric and astrometric solutions from the DESDM Y3 processing archive. A masking process called "null weighting" is applied to these immasked images, which set the weights of pixels with certain flagged features (e.g., cosmic rays) to 0. These nullweight images are the starting point of the later injection step. 2. Base coaddition and detection-Remake the tile coadds from the single-epoch exposures with no objects injected using SWarp (Bertin et al. 2002) and the detection catalogs with SExtractor (Bertin & Arnouts 1996). Construct Multi-Epoch Data Structure (MEDS; Jarvis et al. 2016) files with cutouts of the coadd and singleepoch images used for additional photometric measurement codes. This allows us to cross-check our measured catalogs with Y3 GOLD to ensure that we recover the same detections and base photometry, as well as easily investigate proximity effects on the injections. Can be skipped to save processing time if desired. 3. Injection-Consistently add input objects in all relevant exposures and bands using the local PSF model in each exposure with corrections to the flux from the image zero points and local extinction, along with any other desired modifications such as an applied shear or magnification. This is discussed in detail in Section 2.2. 4. Coaddition and detection-Same as 2 but with the injected null-weight images. The resulting photometric catalogs contain existing real objects, injections, new spurious detections, and blends between the two. 5. Single-Object Fitting (SOF)-Fit a composite bulge + disk model that is the sum of an exponential and a de Vaucouleurs profile (CModel) to every source, while masking nearby sources. 6. Multiobject Fitting (MOF)-Fit sources with CModel, but group nearby detections into friends-of-friends (FOF) groups that have all of their properties fit iteratively to account for proximity effects. Only available for some Balrog runs due to its computational expense. 7. Metacalibration-Fit a simple Gaussian profile to detections and then remeasure after applying four artificial shears (Sheldon & Huff 2017). This is useful for the creation of weak-lensing samples where correcting for shear-dependent systematics is more important than absolute flux calibration (Huff & Mandelbaum 2017). 8. Gaussian APerture (GAp) fluxes-Fit a robust, scalelength-independent alternative to model-fitted photometry. Object flux is calculated within a Gaussian-weighted aperture with an FWHM of 4″. Described further in Section 3.5. 9. Bayesian Fourier Domain (BFD)-Estimate the shear of sources without explicitly fitting a shape using the methodology described in Bernstein & Armstrong (2014). Available only for a few specialized runs. 10. Match and compute GOLD value adds-Match input injections to output detections while accounting for ambiguous matches (see Section 3.5). Merge truth and measured table quantities. Compute Y3 GOLD valueadded quantities including flags, object classifiers, masks, and magnitude corrections (though only the dereddening component is used for Balrog magnitude corrections; see Section 2.1.1).
The resulting photometric catalogs of measured Balrog sources can then be used to measure the DES wide-field response of various input quantities or used directly as randoms with realistic selection effects (see Suchyta et al. 2016 andKong et al. 2020 for examples). In addition, an "injection catalog" that contains information for all injected sources, detected or not, for investigations into detection and completeness properties is created. Emulation steps 3 through 10 can be repeated for multiple injection realizations of a given tile to obtain sufficient sampling for the needed science case. However, as discussed in Section 3, for Y3 analyses, we opted for a single realization with a relatively high injection density due to the large computational cost of each realization.

Differences from the DESDM Pipeline
While Balrog strives to emulate the DESDM pipeline from null-weight images to science catalogs at high fidelity, there are some discrepancies due to practical limitations. The most significant are:  Table 2), we elected to skip this step for the main samples. 4. Zero-point and chromatic corrections are not applied: The Y3 photometric calibration introduces new chromatic corrections that achieve subpercent uniformity in magnitude by accounting for differences in response arising from varying observing conditions and differences in object spectral energy distributions (SEDs) (see Sevilla-Noarbe et al. 2021). However, the mean of all Y3 GOLD chromatic corrections are between 0.1 and 0.4 millimagnitudes (mmag) for all but the g band (0.9 mmag). As this is a subdominant effect that requires significant computation to correct in each injection realization, we do not account for these corrections before injecting into images. In addition, the SED-independent "gray" corrections that account for variations in sky transparency and instrumentation issues like shutter timing errors were not accounted for in the injection zero points. This was not intentional and will be included in all future Balrog runs. However, these corrections are also quite small, with the mean Y3 GOLD gray zero-point correction between −1 and 1 mmag for all bands. As we do not modulate the truth fluxes with these corrections during injection, it is not necessary to apply these corrections after measurement either. 5. Partial GOLD Catalog Creation: Due to the staged approach in the creation of Y3 GOLD with value-added products being incorporated as they were being developed, the exact same procedure for compiling the Balrog catalog could not be followed strictly as it would have produced an unnecessary and severe overhead in the production time. Scripts that approximately replicate this process were provided by DESDM, though they only reproduce the columns that were deemed to be most relevant to Y3 key science goals. Slight modifications had to be made to quantities such as FLAGS_GOLD and the object classifier EXTENDED_CLASS_SOF, where the required MOF columns were not available; these differences are mentioned when relevant throughout the paper.
While not technically a difference in the pipeline emulation itself, we note here that PSF models used for injections (PSFEx; Bertin 2011) were found to be slightly too large in Zuntz et al. (2018) for bright stars in Y1 due to the brighterfatter effect (see Antilogus et al. 2014). However, we still used PSFEx for our injection PSFs as the new Y3 PIFF PSF model described in Jarvis et al. (2021) was not yet implemented into the GalSim configuration structure that was required for our injection design, which is discussed below.

Injection Framework
As mentioned at the beginning of this section, incorporating single-epoch injection into Balrog required a new software design to handle the significant increase in simulation complexity beyond what was done in Suchyta et al. (2016) for the SV and Y1 analyses. Development on the injection framework was partitioned into its own software package 67 as the injection step is fairly generic and of potential interest to other analyses outside of DES Y3 projects, as well as upcoming Stage IV dark energy experiments such as LSST. Briefly, our injection framework maps high-level simulation choices into individual object and image-level details consistent between all single-epoch images for the simulation toolkit GalSim (Rowe et al. 2015) to process. With this design, Balrog automatically inherits much of the modularity, diverse run options, and extensive validation of GalSim. A schematic overview of the injection process is shown in Figure 2. The remainder of this section will quickly summarize the most relevant aspects of Figure 2. High-level overview of the injection processing for a single realization. Green boxes are inputs to the injection framework while red boxes are outputs. The length of each loop is determined by the number of exposures and tiles considered in the full simulation. While the main runs used for Y3 cosmology calibration modify only the position, orientation, and flux normalization of the truth inputs, there are many optional transformations that can be applied such as a constant shear or magnification. The main output of our injection package is a multidocument configuration file with detailed injection specifications that is then executed by GalSim, with each step being executed in the physically correct order. Additional realizations replicate all steps, other than the initial configuration parsing, and produce unique outputs. each step; we leave a more detailed description of the implementation details as well as a description of the most important user options for this new software package in Appendix A.

Injection Configuration
The Balrog configuration serves as the foundation for the final, much larger GalSim configuration file produced for each tile by the injection pipeline that follows the GalSim configuration conventions that are extensively documented. 68 Global simulation parameters that apply to all injections are defined here such as the input object type(s) (see Appendix A.2), position sampling method, injection density, and number of injection realizations. During injection processing, the requisite simulation details needed to inject the sampled input objects consistently across the relevant survey images are appended to this file to create a multidocument GalSim configuration file with each document corresponding to a single CCD exposure. An example configuration that was used for the two main cosmology runs is given in Appendix A.4.

Input Sample and Object Profiles
While any native GalSim input type can be used for the simulations, most Balrog runs sample objects from an existing catalog with parametric properties that describe the flux and morphology of each source. The photometric measurements of the DF catalog, as well as most measurements in the Y3 DES WF science catalogs, are based on Gaussian mixture model fits to various profiles by ngmix 69 introduced in Sheldon (2014) and most recently updated in Sevilla-Noarbe et al. (2021). Each profile parameterization is converted to a sum of GalSim Gaussian objects that represent the Gaussian components used in the original fit. Balrog can currently inject the following ngmix model types: a single Gaussian (Gauss); a composite model (CModel; cm) first introduced in SDSS, 70 which is a linear combination of an exponential disk and a central bulge described by a de Vaucouleurs' profile (de Vaucouleurs 1948); and a slightly simpler CModel with fixed size ratio between the two components (bdf, for Bulge-Disk with Fixed scale ratio). In DES Y3, the DF measurements use bdf profiles while the WF uses cm.
See Appendix A.2 for all provided custom input types, including the option to inject the "postage stamp" image cutouts of objects in MEDS files. While using the actual images of DF sources rather than parametric fits to their profiles would be a more accurate representation of the true distribution of galaxy properties and morphologies, there are significant added complexities due to adding artificial noise from stamps with associated PSFs larger than the injection image and ensuring stamp and mask fidelity of the full DF catalog; these issues are discussed in detail in Section 6.

Updating Truth Properties and Optional Transformations
Measurements of the transfer function with Balrog require truth tables that compile the properties of injected objects. For injections that are based on real sources, some of these object properties are modified to fit the needs of the simulation such as the positions, orientations, and fluxes. The updated source properties either replace their original columns in the output truth catalogs or are appended as new columns. Object fluxes are scaled to account for interstellar extinction and to match the photometric zero point of each single-epoch injection image. Additional transformations such as a constant shear or magnification factor can be applied depending on the desired science case (see Section 5.2 for an example using magnification in Y3).
The position sampling of injections depends on the desired science case; uniform sampling naturally allows for Balrog objects to be used directly as randoms for galaxy clustering calibration, but overlapping Balrog injections can artificially inflate the inferred blending rate. Alternatively, a hexagonal lattice is more appropriate for a perturbative sampling of the transfer function at a given position, but this embeds an unrealistic (though correctable) clustering signal at small scales. The available options are described in Appendix A.3 and the tradeoffs are discussed in more detail in Section 3.4.

PSF Convolution
The PSF used for each object is determined by the local single-epoch PSFEx solution at the injection position. Simpler PSF models are also allowed for testing purposes but not recommended for science runs.

Object Rendering and Injection
All of the previous simulation choices are ultimately encoded in a detailed configuration file that is structured to be read by GalSim. This design was chosen over the explicit use of the software's Python API as the configs facilitate easily reproducible simulations and allow for runs that are identical except for minor modifications such as an added constant magnification factor. Each transformation from truth property to pixel value is automatically handled by GalSim processing in the physically correct order. After an object stamp is rendered (including Poisson noise from the new source), its pixels are summed with the initial image while ignoring any part of the profile that may go off image. Rarely a profile will require an extremely large grid for the fast Fourier transform (FFT) during PSF convolution and exceed available memory. To avoid this, we set a maximum grid length of 16,384 pix −1 (or ∼63,000 arcsec −1 for DES) per side and skip objects that exceed this limit. While the injection framework was designed with flexibility in mind for uses outside of the Y3 cosmology science goals (and even DES itself), there are currently some assumptions made about the structure of the input data to emulate DES Y3 that we plan on generalizing in upcoming releases.

Pipeline Validation
As Balrog is a nongenerative, or discriminative, model of the transfer function, it is difficult to disentangle any intrinsic errors in the input sample or survey pipeline emulation from actual systematic effects we are trying to characterize-68 https://github.com/GalSim-developers/GalSim/wiki/Config-Documentation 69 https://github.com/esheldon/ngmix 70 https://www.sdss.org/dr12/algorithms/magnitudes/#cmodel particularly as Balrog was run independently of DESDM processing for Y3. Therefore, a series of increasingly complex test runs were completed in order to validate both the injection and emulation steps and characterize the pipeline fidelity at a detailed level. We initially ran Balrog with the injection step turned off to confirm that we recovered identical detection and photometry catalogs as Y3 GOLD when carefully accounting for the same random seeds in the fitters that were used in nominal Y3 processing. Once this was achieved, we verified that the injected profiles of objects drawn onto blank images matched single-object renderings made independently of the pipeline.
We then ran a series of tests where we ignored the existing survey image data during injection except for the estimated residual local sky background that is automatically subtracted from the exposures later in the pipeline. Objects were placed on a sparse grid to limit proximity effects from other injections with two types of noise depending on the run-either only Poisson noise for the injections or Poisson in addition to low levels of zero-mean Gaussian background sky noise. These blank image runs became progressively more complex as we added the features used in the main science runs described in Section 3 and acted as a form of regression testing.
These tests are relevant for more than pipeline validation; effects from methodological choices can also be identified and quantified while working in a simplified environment. As an example, the runs with only Poisson noise indicated that there were two subgroups of objects with statistically significant differences in magnitude response-one was well calibrated, and the other with a mean offset of ∼7.5 mmag, too faint in each of griz. This was ultimately discovered to be a result of different priors used for the parameter that measures the relative flux ratio between the de Vaucouleurs and exponential component, fracdev, for the ngmix profile type used to fit DF objects (bdf) and the one used to fit wide-field measurements (cm). A series of plots that show the difference in input versus measured fracdev and examples of its downstream effect on the recovered magnitude and color responses for this test is shown in Figure 3.
The impact of the different fracdev fits on the magnitude response can be seen clearly in Figure 3(d), where the difference in measured versus true i-band magnitude as a function of injected magnitude is colored by the response in fracdev for a single tile. As the difference in profile definition between cm and bdf is largely due to fitting stability and has little to do with the true distribution of galaxy properties, this effectively puts a lower bound on the accuracy of the mean magnitude response that we are able to measure with Balrog when using the DF sample as inputs at around 3 mmag. Importantly, however, the effect is nearly identical in each of the griz bands and has a negligible impact on the recovery of colors, as seen in Figure 3(c). This example highlights some of the difficulties in choosing a "truth" definition for injections based on model fits and the importance of carefully testing the impacts of model assumptions.
The final version of the blank image test was performed with identical input and configuration to that used to produce the fiducial Y3 catalogs across 200 tiles that contain over 2.3 million injections and 1.6 million detections. Zero-mean Gaussian background noise was applied to the blank images with variance set to the corresponding CCD SKYVAR value. The resulting object responses allow us to characterize the baseline performance of the photometric pipeline in ideal (though overly simplistic) conditions, which in turn may provide lower limits on the intrinsic uncertainty in our sampling of the DES transfer function. The mean and median difference in recovered versus injected magnitude for griz are plotted in Figure 4. The vertical bars correspond to the mean of the standard deviations of the griz magnitude responses in each truth magnitude bin, centered at the mean magnitude response.
The medians are extremely well calibrated, with only g < 18.5 and 22.5 < z < 23 off by more than 5 mmag, or 0.45%, through the 23rd magnitude where selection effects near the detection threshold become significant. The mean responses are consistently biased toward larger recovered flux on the bright end by ∼15 mmag due to the asymmetric tendency of SOF to measure the sizes of bright, extended objects to be too large in the presence of neighbors; this is a real effect seen in the main data runs and is discussed in greater detail in Section 4.3.1. Such biases are not seen in isolated SOF measurements of similar objects (E. Sheldon, private communication) and appear in this test as it was inefficient to use a grid size large enough to keep all other grid injections outside the MEDS stamps of the largest injections. This effect also keeps the magnitude error from decreasing as the intrinsic brightness increases, as one would naively expect. While the magnitude bias induced by the difference in the cm versus bdf profile definition is present in this measurement, it is negligible compared to proximity biases for extended sources and selection effects present in the noisier images.
Importantly, there is no significant band dependence in the median magnitude responses where the recovered sample is complete, with a typical spread in median griz biases of ∼3 mmag for truth magnitudes ranging from 18.5 to 22 with no characteristic shape or distribution systematics. While there is a detectable band dependence in the mean magnitude responses, it is nearly eliminated when binned in signal-to-noise ratio (S/N) instead of magnitude to account for differences in sky noise.

BALROG in DES Year 3
We describe here the injection samples, pipeline settings, and matching choices used to create the Y3 Balrog data products for the photometric performance characterization described in Section 4 and downstream science calibrations described in Section 5. For Y3, we ran Balrog several times with different configurations for various validation and science cases. These runs are tabulated in Table 1, which lists the following quantities: the run name, the number of simulated tiles, the total number of injected objects, the fraction of detected objects, the mean number of times a given object is injected across all tiles, the spacing between injections, and the magnitude limit used for sampling. As detection in DES is based on a composite riz detection coadd, we emulate the detection magnitude by averaging the dereddened riz fluxes of the injections.
The primary runs used for cosmological analyses are called main and aux (auxiliary). The former samples the transfer function across 1544 randomly chosen tiles (of the 10,338 Y3 tiles) to a detection magnitude limit of 25.4. This limit was chosen to capture DF objects that had at least a 1% chance of being detected as measured from a 200 tile test run. The latter, aux, was a supplemental run at a shallower limiting magnitude of 24.5 across 497 tiles to increase the fraction of recovered injections for analyses that needed a larger total sample. These runs are combined for the fiducial Balrog catalogs y3-deep and δ-stars, which are described in upcoming sections. The distributions of the number of injection realizations per input object for these runs are shown in Figure 5, and the spatial distribution of these tiles is shown compared to the full DES footprint in Figure 6. main-mag and aux-mag are identical to the above runs except for a constant added magnification of μ ∼ 0.02 for a limited subset of tiles; these are described in more detail in Section 5.2. The grid and noiseless-grid Figure 3. A series of plots highlighting aspects of the noiseless blank image test described in Section 2.3. (a) The first panel shows the difference in input bdf_fracdev vs. measured cm_fracdev for detected objects. The additional peak at 0.5 for bdf_fracdev is a result of the slightly different model definition; for bdf, the relative size ratio between the bulge and disk components is forced to be 1. This constraint does not exist for cm and thus it has a different prior on the parameter. (b) This panel shows the i-band magnitude response of these objects, where there are clearly two different populations. The first is well calibrated with the majority of detections well within ±2.5 mmag of the truth. The second population is biased toward fainter measurements by ∼7.5 mmag on average. (c) The g − r color response for these objects. The bias in recovered magnitude is nearly identical in griz and so does not translate to the recovered colors. The mean color response for g − r, r − i, and i − z is 0.1, 0.3, and 0.2 mmag, respectively. (d) The final panel shows that the biased magnitude population is a result of injections with input bdf_fracdev ∼ 0.5 scattering to 0 or 1 to match the expected cm_fracdev prior. As we do not believe this differential response to be of physical origin, it contributes to a lower bound on the precision in which Balrog can calibrate Δmag-though importantly this does not contribute a bias to recovered colors. runs were used for the validation tests shown in Section 2.3. The blank-sky and clusters runs were conducted separately from the main cosmology runs in order to facilitate two of the science cases discussed in Sections 5.3 and 5.5, respectively.
The processing was done on a dedicated compute cluster at Fermilab, "DEgrid," consisting of 3000 cores with 6-8 GB RAM per core available. The typical core and memory provisioning along with wall-clock running times for each stage of the pipeline is given in Table 2. MOF is not used for the fiducial Y3 cosmology analyses and so is excluded for main and aux-along with their corresponding magnification runs. We include the estimated computational cost to show the difficulty in scaling this methodology to full footprint coverage and WF density; we discuss this more in Section 6. All output measurement catalogs were archived including the MEDS cutout images of detected objects; the injected single-epoch images and resulting coadds were only saved for validation runs.
A few additional postprocessing steps were required to match changes made to the Y3 object catalogs after the fiducial GOLD catalog creation. These consisted of a correction to the Metacalibration S/N column, redefining the size_ratio quantity from mcal_T_r / psfrec_T to mcal_T_r / mcal_Tpsf, and adding a shear weight to each of the . The mean (solid circle) and median (hollow diamond) difference in measured vs. injected magnitude (〈Δmag〉) as a function of input magnitude for the final blank image runs with zero-mean Gaussian background noise. The vertical bars correspond to the mean of the standard deviations of griz magnitude responses in each truth magnitude bin, centered at the mean magnitude response. The vertical bars represent the average of the standard deviations of griz magnitude responses in each bin of size 0.5 mag, centered at the mean magnitude response. The overall calibration is excellent, with the median response less than 5 mmag in all bins except for g < 18.5 and 22.5 < z < 23. We expect significant biases past magnitude 23 due to selection effects near the detection threshold. However, the mean responses show some bias-particularly on the bright end. As discussed in the text, this is due to an asymmetric tendency for SOF to measure the fluxes of bright, extended galaxies to be too large when neighbors are contained in the object's MEDS stamps. The errors in 〈Δmag〉 do not substantially decrease past input magnitudes of 20 for the same reason. This is discussed in greater detail in Section 4.3.1. Figure 5. The number of injections per unique DF object for main in blue, aux in green, and their combination y3-deep in red. The mean number of injections per run is shown with dashed vertical lines and is stated along with the maximum number of injection realizations. main is composed of 1544 tiles vs. only 497 for aux but has a larger input catalog to sample due to the more conservative composite riz detection magnitude of 25.4 vs. 24.5 for aux. The resulting combination is no longer a Poisson distribution, but this can be accounted for in downstream analyses by using the column injection_counts for building a weighting scheme; see Myles et al. (2021) for an example. The typical Balrog object in y3-deep has just over 20 unique injection realizations across the sampled footprint. Note. Parameters include the number of tiles sampled, the number of total detections (N Det), the detection fraction (Det-Frac), the mean number of injections per unique DF object (〈Injʼs〉), the composite riz detection magnitude limit, and injection lattice spacing.
Metacalibration measurements for the photometric redshift calibration detailed in Section 5.1.

Input Deep Field Catalog for y3-deep
The majority of Y3 Balrog analyses use injections drawn from DECam measurements of objects in the DF described in Hartley et al. (2022). In brief, this catalog of nearly 3 million sources is assembled from hundreds of repeated exposures of three DES supernova (SN) fields and the COSMOS field. The corresponding deep single-CCD coadds have S/N of~10 times their WF counterparts and thus provide a good sample of low-noise sources to draw from for explorations of systematics in the WF measurements. There are multiple versions of the DF catalogs that provide tradeoffs in the average seeing quality versus the maximum depth. In Y3 Balrog, we use COADD_TRUTH as it strikes a balance between using observations with 10 times the mean WF exposure time while ensuring that the composite DF FWHM be no worse than the median single-epoch FWHM in the WF for each of the injection bands.
We emphasize that we are not injecting the actual images of DF galaxies but instead take the MOF ngmix parameterized model fit to each detection and generate an idealized galaxy profile based on those model parameters (with added Poissonian noise). The injection framework described in Section 2 is capable of injecting the MEDS stamps directly, which in principle would account for additional diversity in galaxy morphologies and eliminate any model bias compared to the true distribution of galaxy properties. However, this requires extensive validation of the DF stamps before injection and introduces additional complications due to image masks and added noise for injections into CCDs with better seeing than the DF composite image. We plan to revisit these issues for Balrog in the Y6 methodology.
The DF catalog contains model fits that are very similar to the WF CModel with two major differences: the two components (bulge + disk) are fit simultaneously rather than separately, and the ratio of the size of each component, TdByTe, is fixed to be 1. While this was chosen for increased fitting stability for the fainter DF sources, fixing the relative bulge-disk size ratio reduces the total number of free parameters in the model by one and significantly changes the distribution in the relative flux fraction fracdev (recall Section 2.3 for how this impacts the corresponding recovered Figure 6. The spatial distribution of randomly sampled DES tiles used for Balrog injections. 1544 main and 497 aux tiles are shown in blue and red respectively. The outline of the DES Y3 footprint is shown in black. Some tiles are slightly outside of the official footprint due to partial image coverage from DECam observations on the footprint edge. CModel photometry in idealized conditions). Ultimately, any photometry can be used for the injection truth as long as it is an unbiased estimate of the real distribution of object properties.
The bdf profile will be used for all Y6 DES source fitting and for Y6 Balrog-avoiding the small systematic difference in magnitudes between cm and bdf.

DF Object Extinction
The DF catalog has detailed photometric corrections to the fluxes including those for extinction as described in Hartley et al. (2022). However, these corrections were not yet ready when Balrog began the cosmology runs. Thus, in order to accurately account for variations in DF extinction, as well as extinction variations among tiles in the Y3 survey footprint, we enacted the following procedure to deredden the DF input objects and then reextinct them by an appropriate amount in the injection WF tile: For the DF objects, we sample the Schlegel et al. (1998) extinction maps at five points (center and corners) in each input DF CCD (of size 9′ × 18′) and record the average of the five E(B − V ) values. We also record the five-point average of E(B − V ) for the larger (size 44′ × 44′) WF tiles. During injection, we deredden each object by the DF recorded value for its CCD of origin and apply the mean extinction value for the WF injection tile. This chip and tile-level correction is simple to implement and distorts the overall magnitude and color distribution of the DF galaxy sample from the cosmic average only slightly. However, we plan on implementing perobject extinction corrections in the Y6 methodology. The dereddening and extinction values used are preserved in the injection truth tables for later flux and magnitude corrections to enable consistent comparisons between true and measured quantities.

Input Star Sample for δ-stars
While the majority (∼90%) 71 of the injections are sources (both stars and galaxies) from the DES DF, ∼10% of injections are simulated stars. Other than characterizing the photometric response of stars in DES with nearly no galaxy contamination (see Section 4.2), the δ-stars sample is useful for quantifying the baseline performance of the DESDM pipeline for the simplest morphologies. This allows us to isolate the more complex model fitting issues for the heterogeneous y3-deep sample.
The morphologies are modeled as pure delta (δ) functions convolved with the local PSFEx solution used during injection. The magnitude and color distributions are based on the local stellar population in each of the 10,338 tiles in the Y3 footprint. For example, areas of the survey with a higher stellar density near the galactic plane received more bright stars than areas toward the south galactic pole in the center of the footprint. To represent color distributions fainter than the WF limit of i ∼ 24, the color distribution near i ∼ 24 was extended by two magnitudes to i ∼ 26 using models of the Galactic disk and halo (Bienaymé et al. 2018). The simulated star catalog has already been corrected for extinction, so no other preprocessing is required. The measurement pipeline has no knowledge of the difference in input star/galaxy classification and returns the same CModel fits as y3-deep.

Object Classification and Differences in Measurement Likelihood
While we expect y3-deep and δ-stars will be used for calibration of DES galaxy and stellar systematics, respectively, there are additional star injections in y3-deep as it draws from all sources in the DF that pass quality cuts. Sources in the DF catalog have been classified with a k-nearest neighbor algorithm 72 trained on a subset of objects that have nearinfrared (NIR) data from the UltraVISTA survey (McCracken et al. 2012;Hartley et al. 2022). The classifier's stellar sample is not perfectly complete from magnitudes 18 < i < 24 (an average of 93%), but its mean weighted purity is greater than 98% over the same range. The requirement of successful detection and measured photometry for all ugrizJHK bands reduces the total number of objects with classification by 44.5%. The cut NearestNeighbor_class = 2 selects this star sample while NearestNeighbor_class = 1 will select the classified galaxies. The DF stars are not used in the analysis of the Y3 stellar photometric performance in this paper but are available if a larger sample is required for a given science case. However, we do use these classifications when estimating the galaxy contamination in Y3 stellar samples in Section 4.4.
We note that there is a subtle difference in the measurement likelihoods corresponding to each sample. The likelihood of the δ-sample, d  star , assumes perfect classification knowledge and is given by , 1 star meas meas true true meas meas true where θ meas and θ true are the measured and true objects' photometric parameters and c meas and c true are the corresponding object classifications. Alternatively, the likelihood of the DF star sample,  star DF , accounts for the uncertainty in the truth classification:

DF meas meas true true
This becomes particularly relevant if one wants to combine results from Sections 4.2 and 4.3 for modeling errors of the composite sample. The needed conditional probabilities that capture the stellar efficiency and galaxy contamination of y3deep can be derived from the results in Section 4.4.

Sample Selection and Injection Strategy
While in principle we would randomly sample from all sources in the DF, there are some methodological and practical considerations that led to the following conservative cuts: First, we eliminate any objects flagged with model fitting errors or in manually masked regions. We also require injections be from regions with external observations in the near-infrared (IR) as these IR bands are critical for the photometric redshift calibration (Section 5.1). We restrict the characteristic size of the injections (bdf_T) to be less than 100 arcsec 2 (corresponding to ∼10″) to reduce the rate of Balrog-Balrog blends and proximity effects on the injection grid-though this selection may result in slightly oversampling large, highly elliptical galaxies. In addition, this choice may be in conflict with other potential science cases such as measuring the detection efficiency and photometric response of low-surfacebrightness (LSB) galaxies (Tanoglidis et al. 2020). Next, we remove objects with flux to error ratios of less than −3 in any band; this cut was needed after inspection of the DF catalog showed that there was an excess of objects with extremely negative flux values compared to WF measurements (though ngmix fluxes are clipped below 10 −3 when computing magnitudes). Finally, we apply a detection magnitude limit of 25.4 to limit the time spent on injections that have almost no chance of being detected while still using a source catalog that is ∼2 mag deeper than WF. As described at the beginning of Section 3, this limit was derived from the mean dereddened riz bdf_flux of injections that had at least a 1% chance of being detected during a 200 tile test of main. We do not consider the flux in g in this calculation as it is not used in the detection image in DESDM processing. The aux limit of 24.5 was chosen based on requirements for the lens magnification measurement detailed in J. Elvin-Poole et al. (2021, in preparation) (and described further in Section 5.2). After making this selection, the DF injection catalogs used in main and aux have just over 1.23 million and 746,000 objects, respectively.
The star catalog was sampled to its full depth of 27th magnitude in g at a fraction of 10% of the total objects injected into the aux and (most) main tiles. No additional cuts were made. Because the relative contribution of Galactic stars to the total object count peaks at about 21st magnitude in a standard Y3 tile, these injections do not dominate the faint end of the distribution.
Choosing the injection density per realization is a tradeoff between increasing the statistical power of the catalogs, reducing the rate of Balrog-Balrog blends, and reaching the desired footprint coverage given available computational resources. Ideally, we would measure the response of a single source added to DES images for a high number of realizations. As this is infeasible we instead add objects on a hexagonal lattice with 20″ spacing using a MixedGrid (see Appendix A.3) for a single realization, corresponding to a density of ∼7.8 objects per arcmin 2 (or about 40% of the total Y3 density).
We can achieve a much higher injection density than that used in Suchyta et al. (2016) as we do not randomly sample the positions, which greatly reduces the self-blending rate of injections. This is crucial as running a single Balrog tile realization in Y3 takes ∼40 times longer than in SV and Y1 due to the increased complexity of the injection framework and additional photometric measurements. While this does in principle limit the ability to use Balrog injections as randoms to measure clustering signals on scales at and below the grid size, this is currently well below the scale cuts of order 10′ used in the Y3 analysis. In addition, we note that Balrog can still be used for studies of samples with intrinsic clustering by subsampling the full catalog of grid injections to match the desired clustering signal.
However, this relatively high density could have significant implications for a nonlocal deblender like the one used in MOF. In early testing, we found that this level of injection density can sometimes lead to nearly all objects in a tile becoming a single MOF FOF group. Such nonlocal effects are less relevant for SOF except in cases where blends of other nearby injections with large, real sources may change how the masking of the blend is handled (or for extremely large injections that would be captured in the MEDS cutout of other injections, which is why we cut on the injection size). Dealing with nonlocal contributions to the measurement likelihood may be an important consideration for Y6, as the object detection threshold is lower and proximity effects are more of a concern.

Blending and Ambiguous Matches
An important caveat in using an object injection pipeline like Balrog is that there is often inherent ambiguity in the matching of the new object catalogs to the injections. Remeasurement on the injection images changes the number of detections and catalog ID assignments in unpredictable ways, and light profiles that were previously considered distinct detections can be blended together into single objects. While we will show that the fraction of ambiguous cases is relatively small at our injection density in DES images (<1.5%) and can in principle be removed for our photometric tests, this ignores the increased shear noise and rms of the measured ellipticity distribution for these objects, which may be a dominant systematic for weak-lensing measurements in deeper surveys like LSST (Dawson et al. 2015). In addition, highly nonlinear detection and photometry algorithms can often respond in unexpected ways to perturbations (particularly deblenders that are intrinsically nonlocal), which can lead to additional spurious detections and splitting of objects. As a rule: Any matched catalog from an injection pipeline has made assumptions about ambiguous matches and blending! For these reasons, we save the full remeasured photometry catalogs so that different matching procedures can be applied depending on the desired science case. This is distinct from the approach in Suchyta et al. (2016), which ran remeasurement in SEx-tractorʼs association mode near injection positions.
However, it is useful to have a standard catalog sample with consistent matching for downstream cosmological analyses. Unless otherwise specified, Y3 analyses using Balrog catalogs use a catalog that applied the following matching prescription: We define the antecedent of any blend as the "brightest" of the individual objects that contributes to it by some metric. Each blend thus comprises a noisy version of the antecedent as well as the nondetection of all other contributors to the blend. This approach gives a consistent and complete assignment of detection, nondetection, and antecedent to all objects of interest in the remeasured images and strikes the desired balance of including photometric scatter by blend contributors while excluding extreme outliers due to faint injections near existing bright objects. In addition, in the absence of measurement noise, this scheme sets a maximum for the possible flux error of the antecedent in a two-object blend to be |Δmag| ∼ 0.75, a factor of 2. An overview of how this scheme applies to the most common case of a two-object blend is shown in Figure 7.
The above prescription requires a brightness metric to determine the antecedent. We use the average of the dereddened Gaussian-weighted aperture (GAp) fluxes in each of the DES detection bands (riz). GAp fluxes are conceptually similar to the GAaP fluxes described in Kuijken (2008) but instead measure the aperture flux for source profiles before convolution with the PSF. These fluxes are computed analytically from the MOF bdf fits to the DF injections and the SOF CModel fits to Y3 GOLD objects using a Gaussian weight function with FWHM of 4″. This allows us to use an estimate derived from our best guess of the flux of the PSFdeconvolved profile near the relevant object centroids while discounting variations in measured flux due to morphological differences-particularly those arising from significant flux contributions from the wings of extended profiles. We use the average of the detection band δ fluxes for δ-stars because an equivalent GAp flux is not well defined. This difference only becomes relevant for the brightest star injections, though in these cases they are very likely to be the antecedent.
The matching procedure is implemented in two separate steps. First, the injection positions are matched to the closest object in the remeasured photometry catalogs within a search radius of r 1 = 0 5. All objects that have a match are saved in the output Balrog catalogs and undergo the aforementioned postprocessing steps. Afterwards, the output catalogs are matched against the Y3 GOLD catalog to compare the relative brightness of any existing detections within a second match radius r 2 for a series of radii from 0 5 to 2 0 in increments of 0 25. Over 96% of candidate objects have no GOLD sources within the search aperture and are unambiguously a Balrog injection. Candidates that have an existing GOLD object within r 2 with mean riz GAp flux below their own are considered the antecedent and given a match_flag_{r2}_asec = 1 to indicate the presence of a nearby real source. Candidates that have a match within r 2 but have a smaller mean GAp flux than the existing object are assigned match_flag_{r2}_asec = 2 and are recommended to be cut from science analyses. We encode this information as a flag instead of cuts to the fiducial catalog to allow Balrog users more flexibility in choosing how to handle blending and ambiguous cases as needed. In this paper, we cut on match_flag_1.5_asec < 2 as we found that only 0.1% and 0.5% of Y3 GOLD objects were separated at distances less than 1 5 at i magnitudes of 21 and 22.5 respectively (or about 1.3-1.8 times the median PSF size depending on the band).
We show in Figure 8 the difference between the recovered and injected GAp magnitude, Δmag gap , for all recovered main objects for three choices of ambiguous matching cuts. In the left panel where no cut on ambiguous matches has been made, there is a long, asymmetric tail for negative Δmag gap where the recovered GAp flux is up to 10 mag brighter than the input. While there can be extremely large magnitude responses to model-fitted photometry in crowded fields or extreme imaging conditions (see Section 4.3.3), we expect GAp magnitudes to be less sensitive to these failure modes and most large discrepancies to be due to ambiguous matches. This is indeed the case: In the following panels where a match flag with r 2 of 0 5 and 1 5 is used to create the sample, the worst GAp response outliers have been removed, and the fraction of detections where |Δmag gap | > 1 falls by 41% and 65%, respectively. Some remaining scatter beyond |Δmag gap | = 0.75 is expected even for an optimal r 2 due to ambient light in dense fields, blends with extended sources, and image artifacts, though the number of objects below Δmag gap = −1 for the 1 5 cut falls by over an order of magnitude for each bin of unit size.

DES Y3 Photometric Performance
Here we present the photometric performance of the Y3 Balrog DF sample y3-deep along with the synthetic star sample δ-stars. While there are many photometric catalogs and science samples of interest for Y3, here we largely focus on the SOF CModel photometry of a basic Y3 GOLD sample (Sevilla-Noarbe et al. 2021) used as a starting point for more restrictive samples. Unless otherwise specified, the cuts for this sample are given by along with any appropriate object classification cut, which will be mentioned when relevant. Note that FLAGS_GOLD_SOF_ONLY is used in place of the typical FLAGS_GOLD as we are unable to Case (B) has both the injection and the GOLD object detected within r 2 but is extremely rare; in this case, we select the closer detection. Cases (C) and (D) are true blends where there is ambiguity in whether to classify it as a Balrog object with properties blended by the GOLD source or as a GOLD object that was blended by an injection. In this case, we assign the object with the larger average riz GAp flux as the antecedent. Only Case (D) is removed from the Balrog catalogs when applying a match_flag cut.
compute the first bit flag without y3-deep MOF runs. While ∼3.5% of Y3 GOLD objects have FLAGS_GOLD = 1, no Y3 cosmology analyses currently use this flag bit due to the use of SOF or Metacalibration photometry in favor of MOF. Additional samples for a few interesting Balrog applications are discussed in more detail in Section 5. We begin by examining how representative the Balrog catalog properties are compared to Y3 GOLD in Section 4.1, including a detailed look at how the number density fluctuations of both samples vary with respect to survey property maps. We then show the magnitude and color responses of δ-stars and y3-deep along with a discussion of interesting photometric failure modes in Sections 4.2 and 4.3, respectively. We then end by characterizing the performance of the EXTENDED_CLASS_SOF star-galaxy separator, using the extremely pure δ-stars sample whenever possible. As it is not practical to plot the photometric responses of all quantities of interest, one-dimensional Gaussian summary statistics for many relevant parameters are provided in Appendix C.

Consistency with DES Data
Even without perfect emulation fidelity, we expect the measured Balrog property distributions to closely resemble DES catalogs if we are indeed sampling an adequately representative transfer function and input sample. We will broadly check this agreement at various steps along the measurement path: object detection, photometric properties, and correlations with survey systematics-along with how these differences impact a typical clustering signal measurement. As we are primarily interested in the consistency in the transfer function of galaxies for cosmology, we use the y3-deep sample throughout and mention any classification cuts when relevant.

Completeness
We begin with object detection. Of the nearly 26.5 million galaxies injected in y3-deep, just over 41.9% were detected during remeasurement after accounting for ambiguous matches. However, as this catalog is the merger of two runs with different magnitude limits, it is more accurate to say that 36.3% and 59.4% of objects were recovered for main and aux, respectively. The fraction of injections contained in the fiducial sample drops to 14.4% and 44.2% after considering the basic flag and mask cuts described above. To simplify the comparison on the faint end, we use only main for the following comparison as it is about a magnitude deeper.
The detection completeness of sources in griz for main (points) compared to Y3 GOLD objects in the X3 SN field (lines) is shown in Figure 9. The completeness is plotted as a function of reference magnitude; the injection magnitudes for Balrog and the DF measurements of objects in the X3 field for Y3 GOLD. As we are comparing the mean completeness of the Balrog sample across all main tiles to only a small region for Y3 GOLD, to make a fair comparison we estimate the uncertainty in the difference with 50 jackknife samples of the main footprint. Note that the inferred completeness is only robust until the forced magnitude limit cutoff of 25.4 indicated by the dashed vertical line; beyond this point, the sampled injection objects have inherited a selection bias that forces at least one of the other detection bands to be significantly brighter than the magnitude limit and thus is more likely to be detected.
Overall the completeness measurements are quite similar, with the only discrepancies greater than twice the estimated error occurring for the brightest g-band magnitudes and the faintest i and z bin. The Balrog g-band completeness dips on the bright end despite the very high S/N as g is not included in the composite detection magnitude image limit, and thus objects bright in the g band but not in other bands are sometimes not detected. This is not seen as significantly in the Y3 GOLD sample, which suggests that the input DF sample overrepresents these kinds of objects. It is more difficult to determine possible discrepancies past the detection threshold in each band without careful examination of both measurements, Figure 8. The effectiveness of our ambiguous matching scheme, illustrated by the difference in measured vs. true i-band GAp magnitude (Δmag gap ) as a function of input GAp magnitude for three ambiguous matching choices. The overplotted contours contain 39.3%, 86.5%, and 98.9% of the data volume, corresponding to the volume contained by the first three σʼs of a 2D Gaussian distribution respectively. The percentage of detections outside of the dashed region denoting |Δmag gap | < 1 for each choice is labeled in the bottom left of each panel. The left panel shows the Δmag gap response for y3-deep when no cut is made to handle ambiguous matches. There is an extremely long outlier tail of injections measured to be significantly brighter than the injected flux both from ambiguous blends and real effects (see Section 4.3.3, though GAp fluxes are much less sensitive to these failures). The outlier tail significantly decreases in size as more ambiguous blends are accounted for. though their residuals are only marginally beyond 1σ and could simply be statistical fluctuations. While it is encouraging to see similar detection properties between Balrog and the data, that alone is not enough to ensure sufficient similarity for science calibrations.

SOF Photometry
We can make similar comparisons of the measured photometry. Figure 10 compares the recovered Balrog SOF griz magnitudes, g − r and r − i colors, and a few morphological parameters to Y3 GOLD after both samples have applied basic cuts. The comparison is in absolute counts with Balrog in blue and the mean of 100 GOLD bootstrap subsamples of identical size to the y3-deep sample in black. The standard deviation of the subsample counts in each bin is used to estimate the uncertainty, and the percent errors of the binned residuals are plotted below each distribution.
Qualitatively, the distributions are extremely similar in the densest regions of parameter space for most quantities, with the most obvious discrepancies occurring in the low-density tails of the distributions. This is particularly noticeable for the magnitudes and colors. The relative residuals confirm this: While nearly all Balrog magnitude bins have fractional distribution differences below 5% of the mean Y3 GOLD sample from 18 to 24, the region of interest for most Y3 cosmological analyses, Balrog counts in magnitudes below 18 underestimate GOLD by 10%-50% by magnitude 16. The colors are similar with the only discrepancy above 5% in the densest regions occurring at 1.3 < g − r < 1.5, values typical of M-dwarf stars (Smolčić et al. 2004). A few other notable discrepancies are that Balrog appears to underestimate the number of objects with ellipticities cm_g_{1/2} ∼ 0 and negative size parameter cm_T relative to the Y3 GOLD sample -both of which are again values typical of stars.
We stress that these binned residuals are still a largely qualitative check on the agreement between property distributions as they are very sensitive to sample selection. For example, the relative error in cm_T, cm_g_1, and cm_g_2 near zero are all significantly smaller after applying the stellar cut EXTENDED_CLASS_SOF > 1 which indicates that the y3deep sample does not capture the transfer properties of stars as well as galaxies. Yet the shape of these residuals often indicates important real differences. The change in residual sign near the detection threshold in each band indicates potential small differences in the effective depth of the samples, and the overabundance of Balrog objects with cm_fracdev near 0.5 reflects the effect of parameter priors not matching the true underlying distribution as discussed in Section 2.3.
In addition, residuals consistent with zero even under the assumption of perfect emulation fidelity require a completely representative input sample. There are many known reasons for why our input sample fails this requirement, a few of which we discuss here: (i) The DF sample underestimates cosmic variance as it only uses objects from a tiny fraction of the sky, which is particularly a problem for the stellar population as its distribution varies across the sky much more strongly than galaxies. (ii) The photometric pipeline used to make measurements of DF objects is not identical to the one used in the WF in order to deal with nondithered observations, an increased blending rate, the large number of exposures per detection, and instabilities in the detection of very faint sources in the presence of diffuse emission (see Hartley et al. 2022). (iii) The morphological model fits to the DF objects are subtly different (bdf versus cm), which we have shown can introduce small biases in other parameters such as the magnitude. (iv) CModel is not an appropriate photometric model for all objects in the sky. There are simple practical limitations that contribute to these discrepancies as well, such as limiting the size and magnitude distribution of objects to reduce Balrog-Balrog blends and the computational time spent on injecting near certain nondetections. We discuss these issues more in Section 6.

Spatial Variation and Property Maps
While the overall similarities in the photometries are encouraging, what is most critical is how well Balrog reproduces the measurable signals used in cosmological analyses as well as correlations with spatially varying image conditions and survey properties. These systematic trends are particularly important when measuring the galaxy clustering signal where local observing conditions can imprint fluctuations in number density that are not cosmological in origin such as variations in seeing, depth, and sky brightness (Rodríguez-Monroy et al. 2021). We now investigate the similarity of these systematic trends in Balrog and Y3 GOLD for a highly incomplete sample where the variation is more apparent, before looking at their contribution to the clustering signal itself for a cosmology-like sample in Section 4.1.4. Figure 11 compares the number density of all y3-deep and Y3 GOLD galaxies with basic cuts as a function of survey property in overlapping HEALPix (Górski et al. 2005)   was estimated by resampling the pixels used in each sample of equal size with a replacement for 100 bootstrap samples. The distribution of the rescaled survey properties for the Y3 GOLD sample is plotted in the background in green to highlight typical property values.
With a few notable exceptions, the number density of the two samples match closely in both amplitude and shape. It is especially encouraging to see Balrog capturing the highfrequency structure in the dependence of a few of the more complex trends such as the local sky brightness (skybrite) and airmass. The largest differences in recovered number density occur for extremely rare values of a few properties such as the quadrature sum of zero-point uncertainties (sig_zp) and exposure time (exp_time) and are not particularly concerning. However, there are still some more serious unresolved discrepancies in amplitude-particularly in r-band seeing and airmass. The same potential issues in input sample representativeness and photometric assumptions discussed previously apply to these measurements, but it is not immediately clear why these issues would manifest in a band-dependent fashion in seeing or why the largest discrepancies occur for an indirect parameter of the images like airmass. These differences may be indicative of features in the transfer function not currently captured by Balrog such as PSF modeling errors with unexpected chromatic effects or the unapplied injection zero-point corrections. Such differences warrant further investigations in preparation for an improved Y6 Balrog methodology but do not themselves indicate insufficient consistency for a clustering measurement. We explore this further below. Figure 10. Comparison of the y3-deep sample (in blue) vs. Y3 GOLD (in black) for measured griz magnitudes, g − r and r − i color, shape parameters cm_g_1 and cm_g_2, size cm_T, flux component ratio cm_fracdev, size component ratio cm_TdByTe, and i-band S/N. Both samples have had the basic cuts applied as described in Section 4. To compare the distributions, we resample Y3 GOLD with replacement to match the size of the y3-deep catalog 100 times and plot the mean and standard deviation of these bootstrap samples in black. The percent error of the binned residuals are shown below each distribution, which has been zoomed in to show the results of the most relevant regions. The region corresponding to ±5% has been shaded in gray. When quantities do not have hard boundaries, we include at least the 2nd-97th percentiles of the values. The residuals are very sensitive to selection cuts. For example, the discrepancies at cm_T < 0 and |cm_g_{1/2}| ∼ 1 are significantly smaller after cutting out suspected stars from the sample.

Galaxy Clustering Systematics
Many of the core science cases of interest to cosmology involve measurements of galaxy clustering. To be useful in calibrations for this purpose, it is not enough that the number counts of Balrog and Y3 GOLD galaxies follow the same trends with image properties like those shown in Figure 11.  Table E.1 in Sevilla-Noarbe et al. (2021), but we briefly defined them here in order from the top: the mean PSF size, the local sky brightness, the quadrature sum of the zero-point uncertainties, the variance of the sky brightness, the airmass, and the exposure time. Balrog captures many of the nonlinear features in the trend lines, though there are some unexplained band-dependent discrepancies in some property maps.
Where the systematic error is independent of the signal (as, for example, variations in the airmass and the true galaxy density on the sky are statistically independent of one another), the resulting variations in survey depth enter, to leading order, as additive systematic errors in the two-point statistics used for cosmology.
Correcting for these observational systematics is critical for unbiased cosmological inference from clustering, and the ability to use Balrog as object randoms with realistic measurement biases-if it sufficiently captures the clustering fluctuations of the data-offers an ideal calibration method without using the data vector directly which avoids possible overfitting (see Choi et al. 2016;Suchyta et al. 2016;Garcia-Fernandez et al. 2018). In addition, direct calibration with Balrog would eliminate the need to identify all sufficiently important survey property contributions at the desired precision (and avoid biases from any unidentified systematics) while potentially allowing for measurements on larger scales where the true signal is very small and the corrections have to be extremely accurate.
Here we estimate the approximate impact on the clustering signal due to systematic differences between Balrog and Y3 GOLD for a sample broadly similar to the MAGLIM science sample described in Porredon et al. (2021), where we cut both the Y3 GOLD and Balrog samples to 17.5 < i < 21.5 in addition to the previous cuts. We make density maps based on each property map across the full Y3 GOLD footprint by interpolating the trends in Balrog and GOLD to fill in cells where we do not have injection samples. These maps are estimates of the MAGLIM galaxy number density fluctuations in Y3 if they could be completely described by the survey property in question. 74 We then estimate the angular power spectra of both interpolated maps for each survey property using the pseudo-C ℓ estimation code PyMaster (Alonso et al. 2019). These are then compared to the power spectra of the survey property maps themselves along with a typical nonlinear galaxy power spectrum at z = 0.7 computed with the CAMB (Lewis et al. 2000) implementation of the nonlinear power spectrum described in Mead et al. (2015). Finally, we compute the differences in power from the interpolated Balrog and Y3 GOLD density maps as a fraction of the galaxy power spectrum at each ℓ scale.
Results for the best (g-band PSF FWHM) and worst (i-band sig_zp) performing map are shown in Figure 12. Angular clustering systematics for the remaining survey properties, generated in the same way, are shown in Appendix B. For scales comparable to or smaller than the DECam focal plane (approximately ℓ > 200), the difference between Y3 GOLD and Balrog is in all cases less than 1% of the typical amplitude of the angular clustering of galaxies (plotted in black). For some quantities, such as the g-band PSF (shown in the top panel in Figure 12), the differences are several orders of magnitude smaller.
While the differences are small in absolute terms, or as compared to a realistic cosmological signal, the relative deviation between the simulated and real catalogs is in some cases quite large. It is difficult to disentangle the relative contribution to these differences from insufficient sampling across survey property values, issues in the input sample, or missing features in the sampled transfer function (such as the zero-point corrections discussed in Section 2.1.1). We discuss these issues further in Section 6. However, that the absolute additive contributions are well below 1% at most relevant scales for even a single realization of a 20% sampling of the footprint gives us confidence that injection simulations like Balrog will be crucial for systematics calibration of clustering measurements in Y6 and the next generation of galaxy surveys with even more ambitious precision goals.
Whether Balrog is sufficiently similar to Y3 data ultimately depends on the science case and desired measurement precision. In addition, the magnitude of discrepancies can depend strongly on the choice of sample cuts-particularly for those effects related to star-galaxy separation and magnitude limits. However, we find that Balrog captures a significant amount of the variation in number density as a function of observing conditions even for extremely incomplete samples and systematics control of well under 1% for the clustering measurement of a typical cosmology sample. For an additional example of how to estimate the contribution of the intrinsic uncertainty in the Balrog methodology to the Y3 photometric redshift calibration error budget, see Myles et al. (2021).

Photometric Performance of δ-stars
As discussed in Section 3.2, the injections in δ-stars consist of pure delta functions convolved with the local PSFEx solution. The extremely high purity of this star sample with realistic transfer properties is unique to injection pipelines such as Balrog where we have truth information about the underlying object classification in addition to its photometry -which is not always the case for galaxy samples (discussed further in Section 4.3). This eliminates the need for a traditional star-galaxy separation metric like EXTENDED_CLASS_SOF and (nearly) removes any bias resulting from misclassified objects, though we still cut on EXTENDED_CLASS_SOF < = 1 to match what is done to create stellar samples in Y3 GOLD. The only contaminants in the main star sample come from ambiguous matches, which is why we still cut on < _ _ . _ 2 match flag 1 5 asec . This eliminated 1.9% of detections for this sample. Here we focus on the photometric performance and leave the discussion on stellar completeness and galaxy contamination in Section 4.4. We remind the reader that this sample probes a subtly different measurement likelihood than that of y3-deep as we have knowledge of the underlying object classification, as described in Section 3.3.
While the underlying morphology of stellar profiles is not well described by a Sérsic model, we still use the SOF CModel fits for the stellar sample as there was a systematic calibration offset in the PSF model photometry used in Y3 measurements on the data. This has been corrected for Y6 processing but leaves us without a reliable PSF photometry for our response measurements. However, ultimately this has only a small impact on the recovered photometry for sources smaller than the PSF as these objects are fit with a cm_T size near 0effectively eliminating the Sérsic components.

SOF CModel Magnitudes
The difference in recovered CModel magnitude compared to input magnitude Δmag δ as a function of input magnitude for griz is shown in Figure 13. Density contours are plotted on top of the scatter with percentiles equivalent to the first three 74 Where only regions with Balrog samples are used for the estimate. sigmas of a 2D Gaussian distribution, corresponding to 39.3%, 86.5%, and 98.9% of the total data volume. The mean response bias 〈Δmag δ 〉, median response D d mag , and scatter s d mag in truth magnitude bins of size 0.25 mag are overplotted in black bars. These summary statistics provide estimates for the statistical precision and accuracy of the SOF magnitudes, though we stress that the underlying distributions are not Gaussian. These are compared to the mean reported SOF error in the bin indicated by the solid white curve, which does not attempt to account for systematic effects.
The overall calibration of CModel for the stellar sample is quite good, with 〈Δmag δ 〉 and D d mag ranging from 1 to 10 mmag (or 0.1%-0.9%) across all bands up to an input magnitude of 20 and between 2 and 15 mmag (0.2%-1.4%) for 20 < Δmag δ < 22 except for the final two z-band bins. 〈Δmag δ 〉 stays under 1.5% for each band in all bins where the number of objects are increasing (input magnitudes of 23.5, 22.5, 22, and 22 respectively) except for the final z-band bin, which is ∼1.7%. The responses are a bit higher than the quoted 3 mmag uniformity of Y3 GOLD stars when compared to the Gaia star catalog (Gaia Collaboration et al. 2018; Sevilla-Noarbe et al. 2021), though the Y3 GOLD uniformity was measured only with respect to Gaia's G band, which we find to have the best photometric performance (differences of 0.5-6 mmag) over the quoted magnitude range. The Y3 GOLD measurement used a restricted 0.5 < g − i < 1.5 color range as well, which eliminates the worst outliers that we still consider here. In addition, the larger discrepancies found here could be the result of the CModel model-misspecification bias discussed previously.
The response bias and scatter increase significantly after these points due to competing systematic effects as the sample becomes progressively more incomplete, with the mean responses rising to ∼1.5%-3% as they approach the detection threshold in each band. Small sample sizes and strong selection effects lead to 〈Δmag δ 〉 and D d mag biases of ∼4% for g and r by the 24th magnitude, while the biases of the much shallower i and z rise significantly to over 10%. At the median coadd magnitude limits quoted in Table 2  2 (corresponding to a S/N of 10), the mean griz biases are measured to be 3.0%, 4.1%, 2.5%, and 2.2% respectively. The complete set of values for all Figure 12. Examples of the survey property maps with the smallest (top row) and largest (bottom row) estimated additive systematic impact on the clustering signal from differences in number density between Balrog and Y3 GOLD. The left panels show the angular power spectrum of the noted survey property (in green) and the corresponding power spectra of the number densities of the Balrog (in blue) and Y3 GOLD (in gold) MAGLIM-like galaxies across the Y3 footprint using the interpolated trends described in Sections 4.1.3 and 4.1.4. The reference galaxy power spectrum in black is CAMBʼs implementation of the nonlinear matter power spectrum described in Mead et al. (2015), meant to represent a typical cosmological signal at z = 0.7 with a linear galaxy bias parameter of 1. The right panels show the difference in power between Y3 GOLD and Balrog as a fraction of the fiducial cosmological power spectrum shown on the left. We draw a red dashed line indicating the 1% systematic error threshold as reference. Even in the worst case, we find that Balrog is able to capture the clustering amplitude due to variations in survey properties to better than 1% for ℓ > 50 (corresponding to θ > ∼ 3.5) deg. Equivalent plots for many other survey property maps in all griz bands are shown in Appendix B.
binned summary statistics are included in Table C1. While the underlying measurement likelihood of these objects is non-Gaussian, the morphological simplicity of stars results in these summary statistics qualitatively capturing the response features well when complete. We will return to this point in Section 4.3, where the situation is significantly more complicated.
There is evidence of a small band dependence in both the accuracy and precision of the magnitude response. This is most evident when comparing the g band, where Δmag δ is never above 5 mmag (0.5%) too faint below an input magnitude of 23.25, to the z-band Δmag δ , which is exclusively above 5 mmag too faint over the same interval. Unlike the blank image tests in Section 2.3, the D d mag values for each band in a bin have a distinct, monotonically increasing shape with the spread between the bands consistently 5-10 mmag brighter than injection magnitudes of 21. However, this effect is much less pronounced when binned by the measured S/N in each band where the detection significance and local sky background are taken into account. Binned in this way, D d mag is nearly identical for the i and z bands for S/N greater than 20 while g and r are consistently offset by at least 5 and 2 mmag respectively. As this band-dependent response in D d mag was not present in the blank image tests, it may suggest issues in the real image calibration such as the estimation of sky background, which we discuss more in Sections 4.3 and 5.3.

SOF CModel Colors
Of primary interest is the accuracy of the recovered colors due to their importance for photometric calibration, star −galaxy separation, photometric redshift estimation, and the Figure 13. The distribution of differences in the recovered griz SOF CModel magnitude vs. the injected δ-magnitude (Δmag δ ) as a function of input magnitude for the δ-stars sample. The density is overplotted where the contour lines correspond to the percentiles of the first three sigmas of a 2D Gaussian, containing 39.2%, 86.5%, and 98.9% of the data volume, respectively. The mean (solid), median (dotted), and standard deviation of the magnitude responses in bins of size 0.25 mag are shown in the overlaid black bars. These are compared to the reported SOF CModel errors by the solid white lines, which do not attempt to account for systematic effects. The marginal distributions of Δmag δ are included to highlight the small relative volume of the outlier tails. study of the Milky Way structure. We plot the difference in the measured SOF CModel g − r, r − i, and i − z color versus the input δ-color with respect to the input color in Figure 14. The contours and summary statistics are computed in the same way as the magnitudes, though with a bin size of 100 mmag for g − r and r − i and 50 mmag for i − z. The color calibration for this sample is excellent. For the three colors examined here, the median color difference D d c is never greater than 5 mmag (0.5%) from the injected color of −0.25 to 1.25 and is most commonly less than 3 mmag (0.3%). Beyond 1.25, D d c grows to a maximum of 25 mmag (2.3%) too blue for g − r while for r − i it never exceeds an absolute difference of over 3 mmag. The mean responses vary significantly due to extremely long scatter tails in both directions from the magnitude difference and are less reliable estimators of the overall performance in this case. However, they tend to be within a factor of 2 of the medians except for g − r, which increases in absolute size dramatically after 0.75 due to the long tail as can be seen in the figure. The full set of summary statistics is shown in Table C1. Notably we do not find evidence of a systematic chromatic response in CModel color.
Next we compare the color-color diagrams for g − r versus r − i and r − i versus i − z for the input and recovered samples in Figure 15. As expected, the recovered injected colors have broader distributions due to the inherited WF noise as well as moderately large magnitude scatters near the detection threshold. However, the broadening is concentrated outside of the 1σ contours where the agreement is extremely similar.

Photometric Performance of y3-deep
Unlike the synthetic star sample, y3-deep objects are sampled from fits to real sources contained in the DES DF. Thus, not only are the properties of these injections far more diverse, but we do not have perfect knowledge of their true classification. However, we anticipate that most uses of this Balrog sample will be to calibrate galaxy samples used in cosmology analyses. In these cases, we do not care about the true classification as we want to capture the same contamination fraction as the data. For this reason we apply the cut EXTENDED_CLASS_SOF > 1 and leave questions of star contamination to Section 4.4. Removing ambiguous matches with the cut match_flag_1.5_asec < 2 decreased the sample by just under 1.5%.
There are numerous photometries and parameters whose response can be explored with this sample. We restrict ourselves largely to SOF CModel colors, magnitudes, and sizes here for brevity but find similar results for Metacalibration. As with δ-stars, we include summary statistics of the tabular results in Appendix C.

SOF CModel Magnitudes
We compare the difference in the recovered SOF CModel magnitude the versus true DF magnitude Δmag DF as a function of input magnitude for griz bands in Figure 16. As with δ-stars, we characterize the photometric performance of y3-deep measured galaxies with the summary statistics 〈Δmag DF 〉, D mag DF , and s mag DF in bins of truth magnitude overplotted in black bars. Unsurprisingly, the overall scatter in magnitude response for this sample is significantly larger than that for the pure stellar injections due to the rich variety of injected morphologies and issues with the blending of extended sources. The measured s mag DF ʼs reflect this by being an average of over 4 times larger than the corresponding s d mag distribution over the same magnitude range, with the ratio reaching as high as 9 for very bright objects. We then expect the mean response bias 〈Δmag DF 〉 to be larger as well, but their behavior is more interesting than the stellar sample. On the bright end below 19th magnitude, the 50th-99th percentiles of objects are detected within 30 mmag (or 2.7%) of truth but there is a clear asymmetric preference for the recovered flux to be too large for the remaining objects. This result is driven by a sizable fraction of bright, extended injections that are commonly blended with existing Y3 GOLD galaxies and are subsequently measured to have far too large of a size. The measured fluxes of these objects vary significantly depending on local conditions and create visible vertical lines in the response scatter due to their many injection realizations and the relatively small population of objects with true magnitude less than 19. Image cutouts for a set of these objects along with the 50th and 95th percentiles of their measured CModel flux profiles are shown in Figure 17, in Figure 14. The distribution of differences in the measured SOF CModel g − r, r − i, and i − z color vs. the injected δ-color (Δc δ ) as a function of input color for the δ-stars sample. The density is overplotted where the contour lines correspond to the percentiles of the first three sigmas of a 2D Gaussian, containing 39.2%, 86.5%, and 98.9% of the data volume, respectively. The mean (solid), median (dotted), and standard deviation of the magnitude responses in bins of size 100 mmag magnitude for g − r and r − i and 50 mmag for i − z are shown in the overlaid black bars. addition to a more compact, typical injection at the same input magnitude that does not suffer from proximity effects or blending. These examples of large magnitude responses correlated with measured size errors are the first hint of a systematic issue with SOF fits in crowded fields that we investigate in more detail in Section 4.3.3.
As in the δ-stars sample, we detect a relatively small but clear band dependence in the mean and median responses. For all input magnitude bins brighter than 23 where the sample is nearly complete, there is a monotonic increase in the mean and median response in griz with an absolute spread of ∼16 mmag, or about 1.4% difference between g and z. This effect was hinted at in the response of the pure stellar sample but is far more evident here. This chromatic response is diluted but not eliminated when binning in measured S/N rather than input magnitude, with D mag DF no longer strictly monotonic and with a typical spread of 4-5 mmag for riz bands but 10-20 mmag when including the g band for S/N greater than 20.
We believe this chromatic effect is due to a systematic overestimation of the true sky background level in DES (and thus Balrog-injected) images. The SExtractor sky mode estimator is somewhat susceptible to the presence of neighboring objects in its sky annulus, especially in moderately to highly crowded fields. A mode estimate for the background appropriately allows for the fact that there will be background sources, detections, and undetected sources, which are particularly important in the presence of many sources (Stetson 1987). As precise mode estimation was once computationally impractical, traditional codes such as SExtractor have in practice used a Pearson-style mode estimator Mode est = 2.5 · Median − 1.5 · Mean for background estimation. This can result in a slight bias in overestimating the background, which becomes larger as the field becomes more crowded and in the neighborhood of bright stars with extended wings (E. Bertin 2021, private communication). This sky overestimation results in too faint a measurement of a galaxy's true magnitude, and the effect is stronger when there is more sky noise per-object signal.
The fact that the sky is more crowded as one moves from bluer (g, r) to redder (i, z) bands could lead to the chromatic effect described above. That the scale of this effect is lessened by binning objects of similar S/N across bands together supports this conclusion. Note that these offsets are computed with dereddened magnitudes, which has the effect of enhancing the chromatic offset in the g band compared to the redder bands. Additionally, Eckert et al. (2020) analyzed the noise properties of DES images and found that there was a slight positive bias induced in the sky-noise level due to faint unresolved sources in the field of essentially all images (see Section 5.3 for more details). The sign of this effect, while smaller, has the same trend and was found to only be significant for the riz bands. We plan to investigate this further for the Y6 Balrog analysis and potentially propose additional magnitude corrections to account for this effect.

SOF CModel Colors
Next, we investigate the color response of y3-deep objects in Figure 18, where we plot the difference in the measured SOF CModel g − r, r − i, and i − z colors versus the injected DF colors Δc DF against the input colors. The density contours and overplotted summary statistics are defined in the same way as the previous plots. While the color response scatter is significantly larger than in δ-stars, the overall calibration is still excellent and with less extreme outlier tails than in the individual magnitude responses. The behavior of the summary statistics is slightly more complex, but we find that the median color response D c DF is typically ∼3 mmag (0.3%) too faint from −0.25 to 0 and ∼1-11 mmag too bright between 0 and 1.0 for all three colors. The responses are much noisier outside of these regions due to much smaller sample sizes. D c DF tends to be ∼15-25 mmag (1.4%-2.2%) too faint below 0.25 and 15-25 mmag too bright beyond 1.0 for all colors (though a bit worse for r − i, reaching 12% too bright near 1.5) while 〈Δc DF 〉 differences are about three times as large as D c DF in the same direction depending on the color and bin. As with the stellar injections, individual 〈Δc DF 〉 and D c DF bin values can vary significantly due to long scatter tails, and we find no evidence of a systematic chromatic response in CModel color. The full-color response is summarized in Table C4.

Catastrophic Model Fitting
While Figure 16 shows that the vast majority of magnitude responses are well calibrated and are typically much less than Δmag DF of 0.5, it ignores the very long tail of upscattered outliers that are far larger than the measured photometric errors would predict. The responses of these outliers from blends and catastrophic photometry failures can be over an order of magnitude larger than those previously discussed as shown for the i band in Figure 19, where the contours from Figure 16 are overlaid in white. Figure 16. The distribution of differences in recovered the griz SOF CModel magnitude vs. the injected DF magnitude (Δmag DF ) as a function of input magnitude for the y3-deep sample. The density is overplotted where the contour lines correspond to the percentiles of the first three sigmas of a 2D Gaussian, containing 39.2%, 86.5%, and 98.9% of the data volume, respectively. The mean (solid), median (dotted), and standard deviation of the magnitude responses in bins of size 0.25 mag are shown in the overlaid black bars. These are compared to the reported SOF CModel errors by the solid white lines, which do not attempt to account for systematic effects. The marginal distributions of Δmag δ are included to highlight the small relative volume of the outlier tails.
Here, the true complexity of even a small slice of the transfer function is revealed: The many competing effects are often in opposition, with biases in the opposite direction of long, asymmetric tails that vary as a function of truth magnitude in a complex way. Simple Gaussian summary statistics like 〈Δmag DF 〉 and s mag DF are not able to appropriately capture the magnitude of these features and we argue that the Balrog samples themselves (or at least higher fidelity forms of data compression) should be used for most cosmological analyses that need accurate photometric error modeling. Examples of Figure 17. A few examples of injections that contribute to the long scatter tail in the magnitude response of bright y3-deep objects, due to the blending of extended DF injections discussed in Section 4.3.1. Each injection had a true g-band magnitude between 17 and 19, and we include the tile name and magnitude response Δm at the top of each panel. The red lines correspond to the 50th and 95th percentile flux contours of the measured profile. The extended profiles of these injections cause the MEDS image cutout size (based on the fitted SExtractor FLUX_RADIUS value) to be relatively large, which increases the probability of including real neighbors in the MEDS stamp. This in turn can cause SOF to significantly overestimate the cm_T size, which leads to a much larger Δm than one would naively expect for objects with these bright magnitudes. This is discussed further in Section 4.3.3. The final panel shows a typical bright but compact object that is very well calibrated for comparison. Note the presence of a nearby source in the bottom that could have potentially caused the same failure mode if the box size had been slightly larger. The stretch in each panel runs from −3σ sky to +10σ sky . Figure 18. The distribution of differences in the measured SOF CModel g − r, r − i, and i − z color vs. the injected DF color (Δc DF ) as a function of input color for the y3-deep sample. The density is overplotted where the contour lines correspond to the percentiles of the first three sigmas of a 2D Gaussian, containing 39.2%, 86.5%, and 98.9% of the data volume, respectively. The mean (solid), median (dotted), and standard deviation of the magnitude responses in bins of size 100 mmag magnitude for g − r and r − i and 50 mmag for i − z are shown in the overlaid black bars. how the full richness of the transfer function can be used in photometric redshift calibration and the magnification of lens samples are given in Sections 5.1 and 5.2 respectively. However, it is reasonable to be skeptical of the magnitude responses of Δmag DF ∼ 2-8 (a factor of 6-1600 in flux!) by supposedly well-calibrated photometry pipelines. To demonstrate what is causing these extremely large differences in recovered flux, we show in Figure 20 a set of injections of the same DF object with an r-band magnitude of 21.42 in eight different WF tiles where the red lines correspond to the 50th and 95th percentile flux contours. In most cases, the true magnitude is recovered within the reported errors of a few percent. However, in four instances there is at least one nearby object contained in the MEDS cutout image that interferes with SOF's ability to provide a reliable fit due to either an excess of masked pixels in the cutout or residual light unassociated with the injection. The result is a fitted characteristic size cm_T, which is much greater than its actual size. For this particular injection, the true size of the object (after deconvolution with the PSF) corresponds to a scale length of 0 77. Yet in the four cases with nearby sources the fitted size of the object is at least 1″, resulting in a flux measurement that is significantly greater than that of the input true flux. In the worst case for tile DES0346-5248, the target object is by chance injected near a very bright pair of merging galaxies and is fitted with a scale length of over 17″ resulting in flux 2.32 mag brighter than the input DF value.
These photometric measurement failures correlated with errors in measured cm_T can be even more dramatic. In Figure 21 we show eight examples of catastrophic fitting failures due to crowded fields, nearby bright stars, and unflagged image artifacts. These rare but real environments lead to Balrog magnitude responses from 5 to even 7 mag brighter than the injected truth. We emphasize that all of these objects pass the basic Y3 GOLD science catalog quality cuts described at the beginning of Section 4.
While the exact causal relationship between complex local environments and extreme magnitude errors requires further analysis, preliminary investigations suggest the following: In crowded fields or areas with unusual image features or artifacts, the SExtractor FLUX_RADIUS (which defines a circle that contains half of the total corresponding FLUX_AUTO value) can get artificially inflated in size as compared to what it would return for an object in an isolated environment. As a source's MEDS cutout image size is rounded up to the next integer multiple of 16, this leads to a MEDS stamp that is significantly larger than what is needed to fit the relevant flux profile in question. This leaves large areas of the stamp with masked Figure 19. The distribution of differences in the recovered i-band SOF CModel magnitude vs. the injected DF magnitude (Δmag DF ) as a function of input magnitude. The inset corresponds to the i-band panel in Figure 16 where the density contours still contain 39.2%, 86.5%, and 98.9% of the data volume, respectively. While most of the density is captured in the inset, it misses many of the rich features of the full magnitude response-particularly the long outlier tail of injections measured to have magnitudes up to 10 greater than truth. We explore some of the causes of this in Section 4.3.3. pixels when fit with SOF as the algorithm masks rather than models the light of other detected sources within the cutout. The resulting CModel fits then preferentially overestimate cm_T for this subpopulation, which can greatly increase the inferred flux for a given surface brightness measurement, though we defer investigations into the exact details of the scale and frequency of this effect for a future analysis.
Even without a complete understanding of the underlying cause, the correlation between Δmag DF and ΔT is evident as can be seen in Figure 22. Here we have plotted the full i-band magnitude response of y3-deep but colored individual responses by the absolute difference in measured cm_T versus input bdf_T. The vast majority of injections with truth i magnitude below 23 with very small Δmag DF responses have T differences much less than 1, which are colored blue. Bright objects with responses substantially below the zero line have moderately large errors in recovered T as we discussed in Section 4.3.1, while fainter injections with enormous magnitude errors have correspondingly large errors in T-reaching as high as the parameter prior limit of 10 6 arcsec 2 (or scale length of ∼ 10 3 arcsec). The situation is more complicated near and past the detection threshold, about 23rd magnitude in the i band, where additional systematic effects become important.
Model fitting photometry codes are complex, nonlinear, and sometimes nonlocal algorithms that can have unexpected consequences-particularly for low-S/N measurements, crowded fields, or when image artifacts are not appropriately weighted or masked. The journey from pixels to catalogs can at times be chaotic, and our modeling of photometric uncertainties should reflect this.

Scatter from Ambiguous Matches
Despite the efforts described in Section 3.5 there will always be some ambiguity in the matching to injected sources that can introduce large, nonphysical scatter. To check this, we visually inspected hundreds of the MEDS stamps of Balrog objects whose absolute magnitude response was greater than 2-and in particular, the set of objects with large Δmag DF whose size errors were small. There were a few isolated instances of ambiguous matches where a faint injection landed in the very center of an extremely bright Y3 star whose GAp flux measurement failed. These can easily be accounted for by adapting our ambiguous matching algorithm to reject Balrog injections near objects with flagged GAp fluxes, but this was not discovered in time to update the catalogs used in downstream measurements. However, this issue has a negligible impact as we estimate only a few hundred instances in the total y3-deep sample.

Star-Galaxy Separation
We use the δ injections of δ-stars to estimate the stellar efficiency (or true-positive rate) in blue and the classified DF sources in y3-deep for the contamination rate (or false discovery rate) in red for the Balrog star sample as a function of injection magnitude in Figure 23 10034605248852). The red contours give the 50% and 95% enclosed light apertures for the injected object as modeled in each tile. The difference between the measured and injected magnitude Δm is listed next to each tile name, with the cutouts ordered by the magnitude response. The box sizes are in 0 263 pixels. Not all cutouts are the same size, as the box size expands based on the initial SExtractor FLUX_RADIUS measurement. The true scale length of the object (after PSF deconvolution) is 0 77. The fitted profile for the object on tile DES0149-4123 is 1 0 and while that on tile DES0346-5248 is an unrealistic 17″, leading to an overestimate of the object flux corresponding to an error of 2.32 mag. The stretch in each panel runs from −3σ sky to +10σ sky . classified as less than or equal to an EXTENDED_CLASS_SOF value of 0, 1, or 2 respectively. While y3-deep is required to estimate the contamination rate in order to have a realistic relative ratio between star and galaxy counts, we use the δ-stars sample to compute the efficiency as its truth classifications are nearly noiseless and the measurement does not need any external information about galaxy contaminants. We find that the stars are correctly classified (EXTENDED_CLASS_SOF < = 1) over 95% of the time below an i-band magnitude of 21.75 and 80% of the time below magnitude 22.75 before dipping to 70% efficiency near the detection threshold at i ∼ 23. The stellar efficiency quickly drops to below 50% beyond the 23rd magnitude. The efficiency of high-confidence stars (EXTEN-DED_CLASS_SOF==0) follows a similar trend but reaches the previously quoted values about 0.5 mag earlier. Alternatively, the rate of DF galaxies misclassified as stars stays below 10% until 22nd magnitude where there is a sharp increase until the detection limit, where at low S/N it is extremely difficult to differentiate between classifications. However, we again note that the stellar efficiency measurement is less noisy due to the higher degree of confidence in accurate classification compared to the DF sample.
We make equivalent measurements for the galaxy efficiency and contamination in Figure 23(b) where the solid, dashed, and dotted lines now correspond to the fraction of objects classified as greater than or equal to EXTENDED_CLASS_SOF values of 1, 2, and 3. Here we must use sources in y3-deep exclusively as the ratio between stars in the δ-sample and galaxies in the DF sample is not realistic as required by a contamination estimate. Figure 21. The MEDS image cutouts for eight Balrog objects with extremely large differences between the measured and injected magnitude Δm. The red lines correspond to the 50th and 95th percentile flux contours of the measured profile. These injections happened to be placed in regions of rapidly varying sky brightness, in the spiral arm of a large spiral galaxy, in a rich cluster, near a stellar diffraction spike, in between two extended galaxies, or simply in crowded fields. In all cases, the fitted size is far too large for the source, which in turn leads to an overestimate of the object's flux. This process is discussed in detail in Section 4.3.3. The stretch in each panel runs from −3σ sky to + 10σ sky . Figure 22. The full i-band magnitude response Δmag DF for y3-deep shown in Figure 19 but now colored by the logarithmic absolute error in recovered size parameter cm_T vs. input size bdf_T. The response scatter is largely correlated by error in recovered size; injections with small Δmag DF values typically have small errors in recovered T as well (in blue), while nearly all of the extreme magnitude outliers have correspondingly large size errors. The correlation is less strong past the detection threshold at i ∼ 23 where other systematic effects increase in importance. The efficiency is slightly lower than the stars on the bright end due to impurities in the DF knn classifier but is quite close to 100% below the 22nd magnitude. The efficiency of high-confidence galaxies (EXTENDED_CLASS_SOF==3) decreases sharply near the detection limit, but over 85% of DF galaxies with assigned classifications are correctly identified (EXTENDED_CLASS_SOF > = 2) down to the 24th magnitude in the i band. The contamination rate of stars into the galaxy sample is consistently ∼2% until the 22nd magnitude where it rises slightly to 4% at a magnitude of 23. This low level of contamination is largely due to the relatively small number of stars compared to galaxies at these magnitudes and is consistent with the findings quoted in Sevilla-Noarbe et al. (2021). A table of the Balrog classification (or "confusion") matrix as a function of input magnitude is provided in Table C5.

Applications to DES Y3 Projects
Below we present some of the most important applications of the Y3 Balrog catalogs, particularly those that are relevant for the DES Y3 cosmology analysis. To our knowledge, this is the first time an object injection pipeline has been used for any of the following measurements or played such a critical role in the calibration of a galaxy survey's cosmological constraints.

Photometric Redshift Calibration
Chief among the applications of our results is facilitating a novel inference method for the photometric redshift calibration of weak-lensing samples. As shown in Buchs et al. (2019), we can extract information from the DES Y3 DF to break degeneracies in the riz 75 color-redshift relation if we have accurate estimates of the corresponding WF properties of the DF sources. In this inference method, Balrog plays the essential role of determining the likelihood of a given deep, many-band color to be observed at a given region of noisier color-magnitude space in DES measurements at Y3 depth. This allows us to rigorously separate the contributions from measurement noise to the true color-redshift relation when estimating the ensemble photometric redshift distribution of the lensing source sample. In practice, this inference method is facilitated by the use of two Self-Organizing Maps (SOMs), which classify the galaxies in the deep and wide samples into discrete classes, called cells, of color and color-magnitude space. The redshift distribution of a given Y3 source is then given by where z is redshift, c is a deep SOM cell,ĉ is a wide SOM cell, s is the sample selection function, and θ is any additional conditions such as position on the sky. The middle factor ( |ˆˆ) q p cc s , ,

Balrog
, a narrow slice of the full Balrog transfer function, expresses the likelihood of a deep color to be observed at a certain region of wide color-magnitude space. This transfer function serves to correctly weight the wellconstrained redshift distribution p(z|c) of each deep SOM cell according to the probability of detecting those galaxies. As the SOM cellsĉ are determined by Metacalibration magnitude and color, the Balrog samples are key to generating a distribution of observed Metacalibration magnitudes for each injected DF galaxy.
In addition to breaking degeneracies in the color-redshift relation, Balrog, by virtue of enabling this scheme, facilitates avoiding otherwise prohibitive selection biases resulting from the use of spectroscopic redshifts for weak-lensing redshift calibrations (see, e.g., Gruen & Brimioulle 2017) because it uses the spectroscopic redshifts only of galaxies for which eight bands of DES DF photometry provide relatively well constrained p(z).
In the first application of this inference scheme to data, Myles et al. (2021) found that the intrinsic uncertainty in Balrogʼs estimation of the transfer function is a negligible contributor to the overall error budget with an uncertainty on the mean redshift in each tomographic bin of s < -10 z 3 . This is a significant accomplishment as Balrog was able to decrease the systematic bias in the photometric redshift estimates without contributing a novel source of intrinsic systematic uncertainty in its sampling of the transfer function, which was Figure 23. The efficiency (in blue) and contamination (in red) of the Balrog stellar sample (a) and galaxy sample (b). We use the δ injections of δ-stars as our population of true stars for (a) as it is a nearly pure sample, with only ambiguous matches as potential contaminates. We use the DF injections classified as galaxies from the DF k-nearest neighbor (knn) classifier described in Section 3.3 as our true galaxy sample, which has intrinsic uncertainty as detailed in Hartley et al. (2021). For (b), we cannot use the δ injections as the contamination measurement requires a realistic ratio of galaxy and stars sources in the sample so we instead use the classified DF stars. Each line corresponds to the fraction of objects above or below the noted EXTENDED_CLASS_SOF threshold value. We do not expect the galaxy efficiency to be 100% even at magnitudes where complete due to small impurities DF knn classifier. not obviously the case a priori. The use of Balrog in photometric calibration can be further leveraged in future analyses by incorporating positional-dependent selection effects θ in the used measurement likelihood ( |ˆˆ) q p cc s , ,

Balrog
. For further details on this method, we refer the reader to Myles et al. (2021).

Magnification Bias on Clustering Samples
The magnification of galaxy light profiles induced by gravitational lensing changes both the number of detected sources on the sky as well as their measured properties, such as size. This effect biases measurements of large-scale structure and should be taken into account in the modeling of galaxy clustering and galaxy-galaxy lensing correlation functions (Unruh et al. 2020).
We can express the observed galaxy density fluctuation field d g obs for a particular redshift bin as where d g int is the intrinsic number density fluctuation field and d g mag is the contribution due to magnification. In the weaklensing regime where the magnitude of the convergence κ and the shear γ are much less than 1, we can approximate the local magnification μ in terms of the convergence only as Under this approximation, the magnification contribution to the number density is proportional to the local convergence where C encodes the slope of the intrinsic flux distribution of the sample (Garcia-Fernandez et al. 2018), as well as any selection effects. Magnification impacts the observed galaxy density field through two competing effects: a geometric suppression factor C area resulting from the increased sky area for a fixed set of detections, and a boost in the detection efficiency of faint sources which increases the local number density captured by C sample : In addition, magnification may change the measured properties of sources that would be detected without this effect-such as their size or even color as the blending rate is increased-which may alter their selection into clustering samples or their assigned tomographic redshift bin. While a simple argument in J. Elvin-Poole et al. (2021, in preparation) shows C area to be −2, the contribution by C sample for even a simple linear response to δκ depends on the change in the observed number density n obs for a given δκ compared to the intrinsic number density n int as a function of measured object fluxes F: ; , 8 sample obs obs int which is extremely difficult to model as n obs implicitly depend on complex detection and measurement systematics.
To aid in this effort, supplemental runs to main and aux (designated as main-mag and aux-mag, respectively) were created where the same input objects were injected with identical simulation configurations except for an additional GalSim magnify call that was applied to all objects uniformly. Each object was given a lensing magnification corresponding to δκ = 0.01, effectively increasing the flux and area of objects by about 2%. A given galaxy sample selection can be applied to both the magnified (κ = δ κ ) and unmagnified (κ = 0) runs and Equation (8) can be used to estimate the magnification bias C sample . This estimate will include not only the impact of magnification on galaxy fluxes but any selection bias (e.g., on size) introduced by the altered images on the downstream fitted photometry. Figure 24 shows the C sample estimates from Balrog for samples with a constant i-band flux limit and a simple galaxy section criteria of

EXTENDED CLASS SOF 3 AND FLAGS GOLD SOF ONLY 126
The same process is also applied to the real data where magnification is applied only to the galaxy fluxes. For this very simple selection, the Balrog estimates are consistent with the data flux-only estimates, indicating any contribution from size selection or other systematics is small.
In J. Elvin-Poole et al. (2021, in preparation) this Balrog methodology is applied to the lens samples used in the DES Y3 analysis including more complex color cuts and tomographic redshift binning. In this analysis, the MAGLIM lens sample (Porredon et al. 2021), which has a redshift-dependent magnitude limit and tomographic binning, is found to have a C sample from approximately 2-5 from low to high redshift. The redMaGiC lens sample , which is a luminous red galaxy (LRG) selection, has C sample from values consistent with 0 to approximately 4 at high redshift. The Balrog estimates of C sample are systematically lower than the flux-only estimates due to the additional selection effects captured by the full Balrog transfer function. See J. Elvin-Poole et al. (2021, in preparation) for additional details. Figure 24. The magnification bias from the boosted detection efficiency C sample estimated on samples with a uniform i-band flux magnitude limit. The Balrog estimates in blue use the magnified Balrog runs with a 2% magnification applied to every input object. The data estimate applies the same artificial magnification to the galaxy magnitudes in the data and reapplies the selection. The Balrog estimates of C sample are consistent with the flux-only estimates for this very simple selection, indicating that the contributions from additional selection effects are small. However, the Balrog C sample estimates are systematically lower than the simple data estimate for the real Y3 samples used in J. Elvin-Poole et al. (2021, in preparation), where the selections are significantly more complicated.

Noise from Undetected Sources
It is important to accurately characterize image noise to get unbiased estimates of an object's photometric properties and image moments. While Poisson noise is dominant for calibrated images, there are other less-studied contributions to the image noise including undetected sources (US). Using the Bayesian Fourier Domain (BFD) method described in Bernstein & Armstrong (2014) on Balrog detections across 48 tiles, the variance of measured galaxy moments was found to be up to 30% in excess of Poisson predictions in Eckert et al. (2020). Furthermore, an oversubtraction of the background was detected in the riz bands leading to a bias in the zeroth moment flux estimator as shown in Figure 25. The blue points show the mean μ of the Gaussian fit to the pull distribution of BDF flux moments for each tile as a function of object density where a clear correlation can be seen, particularly for the redder bands. The green points are the same measurements after making a local estimate of the background in each postage stamp.
In order to determine if the excess noise was due to US, a slight variant on the Balrog injection procedure was followed in which we injected zero-flux objects into 39 tiles at random positions and then made cutout postage stamps of these random patches of sky. The cross-power spectra of distinct exposures of the "dark" injections in griz were then computed, which would yield zero signal if the noise is Poisson or read noise. A clear detection of US noise is made in each band. This empirical approach allows computed BFD moments to calibrate the moment covariance matrix on the survey images rather than relying on simulations of unknown fidelity and naturally includes the contribution by US as a source of noise within the Bayesian calculation. See Eckert et al. (2020) for further details.

Accurate Joint Redshift-Stellar Mass Probability Distributions with Random Forests
In Mucesh et al. (2020), Balrog is used together with the random forest machine-learning algorithm to produce wellcalibrated joint redshift-stellar mass probability distributions at a fraction of the speed of traditional template-fitting methods. This was made possible because Balrog produces an ideal training sample: it captures both the realistic noise properties of DES WF measurements as well as the redshift and mass information of the DF injections from the COSMOS2015 catalog (Laigle et al. 2016).

Photometric Response near Galaxy Clusters
Clusters of galaxies-especially rich, crowded clusters-are known to present additional obstacles in the accurate detection and characterization of both cluster members and nearby, unassociated sources. These galaxies often have higher detection incompleteness and significant photometric biases because of the increased rate of proximity effects. Detected sources in or near galaxy clusters in the sky can be further biased because of blending with member galaxies or contamination from intracluster light (Zhang et al. 2019). To aid in studies of these difficult measurement biases and selection effects, a high-density Balrog run was performed targeting areas near rich galaxy clusters.
A sample of 900 tiles, each containing a galaxy cluster with optical richness λ > 35 (see Rykoff et al. 2016 for a description of richness and the DES cluster catalog), was injected with a similar DF galaxy sample as used in y3-deep at a lattice separation of 10″, resulting in four times the injection density of the main cosmology runs. This higher injection density was needed to properly sample the effects of clusters on the transfer function as a function of radius from a cluster center given the number of tiles used. 76 Additionally, we used a more restrictive riz detection magnitude of 23 to increase the fraction of detected objects for this analysis.
The magnitude responses of the injected galaxies were measured as well as their distances to the center of the nearby clusters. The sample was further subdivided by the host cluster's richness, the measured object size cm_T, and the magnitude of the cluster's brightest cluster galaxy (BCG) to see how these parameters affect the magnitude bias of the inserted objects. Preliminary results of this analysis for clusters in the redshift range from 0.2 to 0.3 are shown in Figure 26 and will be followed up in A. Masegian et al. (2021, in preparation), including a careful study of the detection efficiency as a function of various parameters including radial distance. Unsurprisingly, the magnitude responses become more negatively biased closer to cluster centers where the complex environments make accurate photometric measurements difficult and faint sources are upscattered by the abundant residual light.
We find a similar correlation between an object's measured size and magnitude response as seen in Section 4.3.3. The proximity effects that cause asymmetric overestimates of cm_T are amplified in the very crowded cluster environments, a trend that grows even stronger closer to the cluster centers. Correlation between magnitude bias and the other examined parameters, cluster richness and the BCG magnitude, is weaker but still present-particularly for richness. All correlations appear to bias the recovered magnitudes in the same direction. The scale of these effects increases as the injections approach cluster centers. Taken together, the proximity to cluster centers, cluster richness, and BCG brightness artificially increase the number of observed objects near clusters above a fixed brightness threshold which, in turn, can collectively bias cluster measurements from a corresponding increase in cluster Figure 25. Reproduced from Eckert et al. (2020), Figure 3. The Gaussian mean offset μ in the BFD flux moment pull distribution as a function of object density for the 48 used Balrog tiles in blue. The green points show the mean offset for the tiles after a local sky subtraction that mitigates the flux bias. While the g band is relatively unaffected, the redder riz bands show statistically significant sky oversubtraction that is correlated with object density. 76 While this increases the probability of unwanted proximity effects from other Balrog injections, we estimate that the chances of two neighboring injections with bdf_T > 10 (or ∼3 3) in this run to be less than 0.25%. member galaxies. We plan on accounting for these correlations in future DES cluster analyses.

Current Methodological Limitations and Future Directions
While this latest iteration of Balrog has made great advances in its ability to precisely quantify difficult measurement systematics, there remain many challenges to overcome if we are to reach the level of precision required by upcoming Stage IV surveys like LSST where the increased depth, pipeline complexity, and blending rate will otherwise limit the constraining power on cosmological parameters. Some of these challenges, such as properly accounting for per-object chromatic corrections at injection time or pushing the injection step further upstream in the measurement pipeline to account for more systematic effects in the image calibration, are largely technical barriers that can be addressed with more development time. Our ambiguous matching scheme can be improved by incorporating pixel-level information on the overlap between injected and real sources similar to the blending parameter introduced in Huang et al. (2018). In addition, many of the complexities and additional development time needed for careful emulation of a survey's measurement pipeline can be nearly eliminated by having injection pipelines placed directly in the software stack of the fiducial data processing runs. While this was not possible in DES, this approach is now taken in HSC with SynPipe and planned for LSST. However, there are more fundamental barriers to leveraging injection pipelines to their full potential.
A primary challenge is increasing the representativeness of the input catalog. Using the DECam observations of sources in the DES DF as the basis for the input object photometry Figure 26. The difference in measured CModel z-band magnitude vs. the injected DF magnitude Δmag DF as a function of input magnitude for the high-density clusters run in a redshift range of 0.2-0.3. The three columns present the Δmag DF responses binned by their radial distances to nearby cluster centers as specified at the top of the columns. The median response biases across the range of the injected magnitude are displayed as solid red lines, with the first and second s mag DF contours indicated by the dashed lines above and below. As the injections approach the center of a cluster, the median bias becomes increasingly negative, indicating that the objects are measured to be progressively brighter than injected truth the closer they are to the bright central galaxy (BCG). In the three rows we color the magnitude responses as a function of (a) measured object size cm_T, (b) cluster richness, λ, and (c) the magnitude of the cluster's BCG. The measured object size appears to have the strongest influence over magnitude bias among the three quantities, though richer clusters also show larger Δmag DF responses. We use cm_T+1 for the color scale as the ngmix T sizes are allowed to be slightly negative. This preliminary analysis will be followed up in more detail in A. Masegian et al. (2021, in preparation). rectified many of the input sample issues described in Suchyta et al. (2016)-particularly the discrepancy in recovered Balrog colors as compared to Y1 GOLD that arose from interpolating the SED of COSMOS galaxies to match DECam filters. However, Figure 10 shows that we have further work to do. While it is difficult to disentangle intrinsic errors in the emulation of the DESDM pipeline from the input sample representativeness, there are some clear avenues for improvement. The conceptually simplest is to sample a wider population of deep objects across more deep patches of sky in order to incorporate greater cosmic variance in the injection sample. However, these deep observations are very expensive, which limits the practicality of this approach. It may be possible to combine with external deep data sets, though this comes at the expense of a return to SED interpolations to match DECam filters. In addition, more detailed studies of the difficult PSF modeling in the DF may yield a stellar population more similar to the WF measurements and resolve some of the largest discrepancies between Balrog and Y3 GOLD for bright, PSF-like objects.
Another possibility is that the discrepancies between the recovered WF sample and Y3 GOLD are driven, at least in part, by the inability of the CModel profiles to accurately capture the full diversity of galaxy morphologies. True galaxy profiles have many complex features such as spiral arms, star knots, and long asymmetric disruptions from mergers that we are not currently capturing with our DF injections. The most direct solution to this problem is to inject the MEDS image cutouts of the DF sources. We have already built the basic infrastructure to do so with Balrog, as described in Section 2.2.2, but there are new issues to consider. The image cutouts can include artifacts, excessive masking, truncated profiles of nearby objects, or even be blended with other sources. This may be rectified in the future by using machinelearning methods such as nonnegative matrix factorization or generative adversarial networks to handle the required pixellevel deblending of sources in the stamps (see Melchior et al. 2018 andReiman &Göhre 2019 for examples, respectively).
However, using the image cutouts directly would introduce undesired noise when injecting into single-epoch exposures that had better seeing conditions than the composite PSF of the single-chip DF coadd and remains an unresolved issue. In addition, precisely defining the "truth" properties of the stamps is less straightforward than for model fit injections. This will likely be handled by making accurate measurements of each relevant WF photometry type on the stamps, which would eliminate inheriting nonphysical parameter biases from small profile definition differences such as the resulting magnitude bias from differences in fracdev prior shown in Figure 3.
The most difficult challenge to overcome is the high computational cost of the injection pipelines. The new singleepoch processing and additional photometric measurements in Y3 Balrog have increased the total mean CPU time per recovered injection to ∼80 s, about 12 times greater than in Suchyta et al. (2016). This large increase in runtime is only at Y3 depth, corresponding to ∼4-6 epochs per injection and made it infeasible to directly calibrate many of the other key aspects of the Y3 cosmological analysis for which Balrog would otherwise be the ideal measurement tool. For example, to achieve the equivalent statistical precision on how the blending of galaxies at different redshifts affects the multiplicative shear bias as measured in MacCrann et al. (2022), we estimate that we would have to run the equivalent of over a dozen y3-deep samples to sufficiently capture how an identical injection population responds to an input shear signal that varies with redshift. In addition, the original goal of using Balrog to directly calibrate the spatially dependent measurement biases and completeness inhomogeneities in the galaxy clustering measurement as outlined in Suchyta et al. (2016) would require many more injections than sources for the estimation of the angular correlation function-a daunting prospect in light of the ∼40% Y3 density achieved in this analysis. The situation will become significantly worse for much deeper surveys like LSST, where we can expect hundreds of exposures for each object.
Perhaps even more consequential than the low number density realistically achievable, the high cost of running Balrog led to only a single injection realization across just 20% of the total footprint area. This limited the calibrations from Balrog in Y3 to either be based on mean measurements, such as those described in Section 5, or required a reduction in the considered footprint such as the clustering measurement presented in Section 4.1.4. While even the relatively low sampling of y3-deep was sufficient to capture systematics variations in the clustering amplitude to better than 1% for a MAGLIM-like sample in the overlap area, reaching this threshold (or beyond) for highly incomplete samples or for accurate calibrations of large-scale fluctuations may require orders of magnitude more injections. Despite an expected significant increase in the total tiles sampled for Y6, achieving the many realizations of full footprint coverage required for the most ambitious Balrog measurements, such as providing realistic random properties for clustering and shear two-point measurements, will likely remain impractical without a dramatic increase in computational investment.
One promising solution that we plan to explore is the use of the Balrog samples as a training set for an emulator that predicts additional realizations conditional on the survey property maps. A somewhat similar approach is taken in Johnston et al. (2020), where they mitigate galaxy clustering systematics by producing "organized" random catalogs with fluctuations in number density imprinted from an SOM approach that trained on maps of the variations in KiDS observing properties. Using an injection catalog like Balrog directly as a training sample for this approach would leverage our very-high-fidelity measurements of the survey transfer function to include unknown systematics not fully captured by the identified survey properties. While still more computationally expensive than a machine-learning-only approach, this will allow us to build an efficient way of creating accurate random samples tuned for the desired measurement without increasing the total survey pipeline computational cost by more than a factor of 2. We plan to use the presented Balrog catalogs to gauge the accuracy and feasibility of this approach in an upcoming analysis.

Summary and Conclusion
We have presented here the suite of DES Y3 Balrog simulations and resulting object catalogs used in downstream Y3 analyses. Like its Y1 predecessor, this current iteration of Balrog directly samples the DES transfer function by injecting an ensemble of realistic sources into real survey images to make precise measurements of the inherited systematic biases in the photometric response. However, the updated methodology (and entirely new coding framework) for Y3 Balrog makes significant strides beyond Suchyta et al. (2016) in replicating many of the more complex features of the DESDM pipeline, including the coaddition of single-epoch images and multiepoch photometric measurements from SOF and Metacalibration in order to probe more aspects of the true measurement likelihood. In addition, we used a more realistic input sample based on the DES DF source catalog with observations using DECam filters that eliminated the need for template fitting to COSMOS galaxies and incorporated more cosmic variance in object properties. We also implemented a novel ambiguous matching scheme to capture many of the impacts of source blending while largely eliminating the contributions from undesired dropouts that happened to land on top of existing bright sources.
This effort culminated in tens of millions of Monte Carlo samples of the DES transfer function at high fidelity across 20% of the full DES footprint to Y3 depth, capturing systematic biases from more variations in observing conditions than any previous Balrog analysis. The improved methodology resulted in the injected objects matching Y3 GOLD photometric properties and capturing clustering systematics correlated with survey property maps to better than 1% accuracy for a typical cosmology sample on relevant scales. Additionally, we find that Balrog captures the clustering amplitudes of these systematics within a few percent for even highly incomplete samples-an encouraging first step for future analyses that wish to leverage more of our hard-earned (and often expensive) photons.
We quantified the photometric responses of Balrog injections through the Y3 DESDM measurement pipeline, particularly for magnitudes, colors, and morphology. We find that the magnitudes of most injections are well calibrated until selection effects near the detection threshold become significant, although we have found a clear asymmetric bias for objects in crowded fields or near image artifacts to have moderately to severely overestimated sizes which correlate with large negative magnitude biases. These biases are fairly common for bright, extended objects and can become extremely large (up to Δmag DF ∼ 8) at fainter magnitudes, though they constitute a much larger relative fraction of objects on the bright end. We demonstrated that these catastrophic photometry failures are real effects and often pass science cuts. We plan on exploring the causal relationship of this photometric failure mode further in a future analysis. While these magnitude response biases can cause significant discrepancies from more naive error estimates, fortunately, their effect appears to have little impact on the recovered colors where we find typical median response biases of ∼1-3 mmag for stars and ∼5-10 mmag for galaxies in the densest regions of parameter space-an effective median color calibration offset of less than 1%.
Finally, we discussed a few of the most important applications of the presented Balrog catalogs to the Y3 cosmology analysis and other DES science measurements. In particular, we provided a realistic measurement likelihood in the calibration of photometric redshifts to reduce systematic biases in one of the highest sources of uncertainty in the cosmological measurement without contributing any additional uncertainty to the overall error budget. Unexpected findings such as the noise contributions from undetected sources in DES images and sky oversubtraction in the riz bands described in Eckert et al. (2020), in addition to the moderate band dependence in magnitude response and discovery of a new class of catastrophic photometry failures correlated with measured size, are indicative of the diagnostic power of object injection pipelines like Balrog in modern galaxy surveys.
We believe that this paper only scratches the surface in cosmological calibration potential and the identification of new systematics with injection pipelines such as Balrog. In particular, the combination of direct Monte Carlo sampling of the transfer function with an emulator to boost the total statistical power has the potential to facilitate many of the most difficult measurements in modern galaxy surveys. It is clear that we have yet to dig too deep.
S.E. and T.J. acknowledge support from the U.

Data Availability
The DES Y3 data products used in this work will be made publicly available following publication at the URL: https:// des.ncsa.illinois.edu/releases.

Appendix A Injection Software
Here we describe a few of the most relevant configuration options when running the new injection framework, as well as templates for custom injection classes defined by the user for more advanced interfacing; see the code repository 77 for more details on running the simulations.

A.1. Injection Configuration
Configuration settings specific to a typical Balrog run have been wrapped into custom GalSim image and stamp types, both called Balrog: 1. image: Balrog-This image type is required for a full Balrog run. It parses all novel configuration entries and defines how to add GalSim objects to an existing image with consistent noise properties. It also allows the Balrog framework to be run on blank images for testing. 2. stamp: Balrog-An optional stamp type that allows GalSim to skip objects whose fast Fourier transform (FFT) grid sizes are extremely large and can occasionally cause memory errors when using photometric model fits to DES DF objects.
We also provide a much simpler image class called AddOn, which adds any simulated images onto an initial image without the full Balrog machinery. Some configuration details can also be set on the command-line call to balrog_injection.py for ease of use as long as they do not conflict with any settings in the configuration file.

A.2. Input Sample and Object Profiles
In principle any native GalSim input and object type can be used for injection. However, the object sampling, truth property updating, and truth catalog generation steps require knowledge about the underlying structure of the input data (e.g., parametric models versus image cutouts). We handle this ambiguity through the use of BalInput and BalObject parent classes that define the necessary implementation details to connect GalSim to Balrog. These classes can be used to register any needed injection types to Balrog including custom GalSim classes. Subclasses provided for injection types used in DES Y3 runs are described below: 1. ngmixGalaxy: Described in Section 2.2.2. A sum of GalSim Gaussian objects that represent a Gaussian mixture model fit to a source by the measurement software ngmix. Balrog can currently inject the following ngmix profile model types: a single Gaussian (Gauss), a composite model (cm) that combines an exponential disk with a de Vaucouleurs' profile, and a modified CModel with a fixed size ratio between the two components (bdf)). As ngmix allows for objects with the negative size before convolution with a PSF, these negative values are clipped to a small nonzero value (T = 10 −6 , corresponding to a size scale of ∼ 10 −3 arcsec) to avoid rendering failures. 2. DESStar: A synthetic star sample with realistic density and property distributions across the DES footprint was created to a depth of 27 mag in g. These objects are treated as delta functions convolved with the local PSF. These magnitudes are referenced as δ-mag in later figures. Further details about this star catalog are described in Section 3.2. 3. MEDSGalaxy: Single-epoch image cutouts of detected DES objects are stored in MEDS files for each band. These image cutouts can be used directly for injection after deconvolving with the original PSF solution and reconvolving with the local injection PSF.
Balrog can inject multiple object types in the same run by setting the gal field in the configuration as a List type; this is identical to GalSim configuration behavior. The relative fraction of each injection type is then set in the pos_sampling field described below.

A.3. Updating Truth Properties and Optional Transformations
Most Balrog runs sample objects from an existing catalog. Some of the object properties are modified to fit the needs of the simulation such as the positions, orientations, and fluxes. Updates to positions and orientations are automatically applied to the output truth catalogs while flux corrections due to local extinction and zero-point offsets are not, though we save the applied extinction factor. Different behaviors for these quantities as well as any additional changes can be defined when creating the relevant BalObject subclass.
Position sampling is determined by the configuration parameter pos_sampling and can be set to Uniform for spherical random sampling or one of the following grid choices that are regularly spaced in image space: RectGrid for a 77 https://github.com/sweverett/Balrog-GalSim Angular Clustering Systematics Section 4.1.4 introduced a method for translating the differences between the Balrog and Y3 GOLD catalogs into a predicted systematic error in the angular clustering of galaxies. We first choose a sample selection that is applied to both catalogs. We then measure the dependence of galaxy count fluctuations in both selected Balrog and Y3 GOLD samples on several measured image quality indicators, as in Figure 11. Finally, for each data quality indicator, we interpolate the density fluctuation trends to the full survey area and estimate the angular clustering that these trends imply. As small systematic variations in the survey depth enter, to leading order, as additive power in the measured clustering signal, a comparison of the power we measure in these interpolated maps offers a direct estimate of the importance of any deviation between our injection catalogs and the real data.
Here we show the same maps as Figure 12 for six measured survey properties in all bands, for the 17.5 < i < 21.5 sample selection meant to emulate the Y3 MAGLIM sample. The investigated survey properties correspond to the mean airmass ( Figure B1), mean exposure time ( Figure B2), mean PSF FWHM ( Figure B3), mean error on the grey zero point correction (SIGMA_MAG_ZERO, Figure B4), mean sky brightness ( Figure B5), and variance of the sky background ( Figure B6) of all exposures within a given HEALPix pixel. With the exception of a negligible spike in power in a few of the SIGMA_MAG_ZERO maps, the measured systematic errors are less than 1% of the fiducial galaxy clustering signal (calculated as described in Figure 11) on scales below approximately 1°(ℓ > 180). Figure B1. Power spectra of the mean airmass and associated interpolated Balrog and Y3 GOLD galaxy count variations, as in Figure 12. The left panels show the angular power spectrum of the noted survey property (in green) and the corresponding power spectra of the number densities of the Balrog (in blue) and Y3 GOLD (in gold) MAGLIM-like galaxies across the Y3 footprint using the interpolated trends described in Sections 4.1.3 and 4.1.4. The reference CAMB nonlinear matter power spectrum in black is at z = 0.7 with a linear galaxy bias parameter of 1. The right panels show the difference in power between Y3 GOLD and Balrog as a fraction of the fiducial cosmological power spectrum shown on the left. Figure B2. Power spectra of the mean exposure time and associated interpolated Balrog and Y3 GOLD galaxy count variations, as in Figure 12. The left panels show the angular power spectrum of the noted survey property (in green) and the corresponding power spectra of the number densities of the Balrog (in blue) and Y3 GOLD (in gold) MAGLIM-like galaxies across the Y3 footprint using the interpolated trends described in Sections 4.1.3 and 4.1.4. The reference CAMB nonlinear matter power spectrum in black is at z = 0.7 with a linear galaxy bias parameter of 1. The right panels show the difference in power between Y3 GOLD and Balrog as a fraction of the fiducial cosmological power spectrum shown on the left. Figure B3. Power spectra of the mean PSF FWHM and associated interpolated Balrog and Y3 GOLD galaxy count variations, as in Figure 12. The left panels show the angular power spectrum of the noted survey property (in green) and the corresponding power spectra of the number densities of the Balrog (in blue) and Y3 GOLD (in gold) MAGLIM-like galaxies across the Y3 footprint using the interpolated trends described in Sections 4.1.3 and 4.1.4. The reference CAMB nonlinear matter power spectrum in black is at z = 0.7 with a linear galaxy bias parameter of 1. The right panels show the difference in power between Y3 GOLD and Balrog as a fraction of the fiducial cosmological power spectrum shown on the left. Figure B4. Power spectra of the mean error on the gray zero-point correction and associated interpolated Balrog and Y3 GOLD galaxy count variations, as in Figure 12. The left panels show the angular power spectrum of the noted survey property (in green) and the corresponding power spectra of the number densities of the Balrog (in blue) and Y3 GOLD (in gold) MAGLIM-like galaxies across the Y3 footprint using the interpolated trends described in Sections 4.1.3 and 4.1.4. The reference CAMB nonlinear matter power spectrum in black is at z = 0.7 with a linear galaxy bias parameter of 1. The right panels show the difference in power between Y3 GOLD and Balrog as a fraction of the fiducial cosmological power spectrum shown on the left. Figure B5. Power spectra of the mean sky brightness and associated interpolated Balrog and Y3 GOLD galaxy count variations, as in Figure 12. The left panels show the angular power spectrum of the noted survey property (in green) and the corresponding power spectra of the number densities of the Balrog (in blue) and Y3 GOLD (in gold) MAGLIM-like galaxies across the Y3 footprint using the interpolated trends described in Sections 4.1.3 and 4.1.4. The reference CAMB nonlinear matter power spectrum in black is at z = 0.7 with a linear galaxy bias parameter of 1. The right panels show the difference in power between Y3 GOLD and Balrog as a fraction of the fiducial cosmological power spectrum shown on the left. Figure B6. Power spectra of the variance from the sky background and associated interpolated Balrog and Y3 GOLD galaxy count variations, as in Figure 12. The left panels show the angular power spectrum of the noted survey property (in green) and the corresponding power spectra of the number densities of the Balrog (in blue) and Y3 GOLD (in gold) MAGLIM-like galaxies across the Y3 footprint using the interpolated trends described in Sections 4.1.3 and 4.1.4. The reference CAMB nonlinear matter power spectrum in black is at z = 0.7 with a linear galaxy bias parameter of 1. The right panels show the difference in power between Y3 GOLD and Balrog as a fraction of the fiducial cosmological power spectrum shown on the left. Note. The quoted magnitudes correspond to the left bin edge. Simple Gaussian statistics do not fully capture the complexity of the responses-see Figure 16.

Table C3
The Mean (〈Δ〉), Median (D), and Standard Deviation (σ) of the Balrog g − r, r − i, and i − z Color Responses Binned in Injection Color for the δ-stars Sample Note. The quoted colors correspond to the left bin edge. Simple Gaussian statistics do not fully capture the complexity of the responses-see Figure 14.

Table C4
The Mean (〈Δ〉), Median (D), and Standard Deviation (σ) of the Balrog g − r, r − i, and i − z Color Responses Binned in Injection Color for the y3-deep Sample Note. The quoted colors correspond to the left bin edge. Simple Gaussian statistics do not fully capture the complexity of the responses-see Figure 18. Note. The second through fifth columns correspond to the true-positive (TP), false-positive (FP), false-negative (FN), and true-negative (TN) rates of Balrog stars, respectively. The very pure δ-stars sample is used to compute the TP and FN rates, while the noisier classifications of the DF y3-deep injections are used for the rest. The quoted magnitudes correspond to the left bin edge. See Figure 23.