Quaia, the Gaia-unWISE Quasar Catalog: An All-sky Spectroscopic Quasar Sample

We present a new, all-sky quasar catalog, Quaia, that samples the largest comoving volume of any existing spectroscopic quasar sample. The catalog draws on the 6,649,162 quasar candidates identified by the Gaia mission that have redshift estimates from the space observatory’s low-resolution blue photometer/red photometer spectra. This initial sample is highly homogeneous and complete, but has low purity, and 18% of even the bright (G < 20.0) confirmed quasars have discrepant redshift estimates (∣Δz/(1 + z)∣ > 0.2) compared to those from the Sloan Digital Sky Survey (SDSS). In this work, we combine the Gaia candidates with unWISE infrared data (based on the Wide-field Infrared Survey Explorer survey) to construct a catalog useful for cosmological and astrophysical quasar studies. We apply cuts based on proper motions and colors, reducing the number of contaminants by approximately four times. We improve the redshifts by training a k-Nearest Neighbor model on SDSS redshifts, and achieve estimates on the G < 20.0 sample with only 6% (10%) catastrophic errors with ∣Δz/(1 + z)∣ > 0.2 (0.1), a reduction of approximately three times (approximately two times) compared to the Gaia redshifts. The final catalog has 1,295,502 quasars with G < 20.5, and 755,850 candidates in an even cleaner G < 20.0 sample, with accompanying rigorous selection function models. We compare Quaia to existing quasar catalogs, showing that its large effective volume makes it a highly competitive sample for cosmological large-scale structure analyses. The catalog is publicly available at 10.5281/zenodo.10403370.

1. INTRODUCTION Quasars are powerful tools for many fields of astrophysics.They are key probes of accretion physics (e.g.Sunyaev & Zeldovich 1970;Yu et al. 2020), which informs the evolution of active galactic nuclei (AGNs).The evolution of quasars and their host galaxies are intertwined, giving insight into supermassive black hole growth (e.g.Hopkins et al. 2006) as well as massive galaxy formation (e.g.Kormendy & Ho 2013).Studies of the quasar distribution can also be used to understand black hole evolution (e.g.Powell et al. 2020) and halo masses and environmental effects (e.g.DiPompeo et al. 2017).Quasars can also be utilized as background sources for cosmic phenomena such as gravitational lenses (e.g.Claeskens & Surdej 2002), and quasar spectra encode the properties of the intergalactic medium via the Lyα forest (e.g.Rauch 1998).
Corresponding author: Kate Storey-Fisher k.sf@nyu.eduQuasars are key tracers for large-scale structure cosmology.They reside in peaks of the dark matter distribution and their clustering can be used to measure cosmological parameters, including the growth rate of structure f σ 8 (e.g.García-García et al. 2021;Alonso et al. 2023), the Hubble distance D H (e.g.Hou et al. 2020), primordial non-Gaussianity (e.g.Leistedt et al. 2014;Castorina et al. 2019;Krolewski et al. 2023), and the baryon density Ω b (e.g.Yahata et al. 2005).Crosscorrelations between quasars and other tracers provide measurements of key cosmological quantities, such as with photometric galaxy samples to measure the baryon acoustic feature (e.g.Ata et al. 2018), with cosmic microwave background (CMB) lensing to constrain quasar bias and the growth of structure (e.g.Sherwin et al. 2012), and with foreground galaxies as a probe of weak lensing (e.g.Ménard & Bartelmann 2002;Scranton et al. 2005;Zarrouk et al. 2021).They can also be used as standardizable candles to measure the expansion rate of the universe (e.g.Setti & Woltjer 1973;Risaliti & Lusso 2015;Lusso et al. 2020).Finally, given the large volume typically covered by quasar samples, the quasar distri-bution provides a test of the cosmological principle of isotropy and homogeneity (e.g.Secrest et al. 2021;Dam et al. 2022;Hogg et al. 2024).
Many surveys have observed and cataloged quasars, with around 1 million spectroscopically identified and several million when including photometric samples.The Sloan Digital Sky Survey (SDSS) Data Release 16 includes a highly complete catalog of 750,414 quasars with spectroscopic redshifts (Lyke et al. 2020).Photometric surveys observe a much larger number of quasars, at the expense of low redshift accuracy; nearly 3 million quasars with reliable photometric redshifts have been cataloged (Kunsági-Máté et al. 2022), including with the Wide-field Infrared Survey Explorer (WISE; Wright et al. 2010), which imaged the entire sky and Pan-STARRS, (Chambers et al. 2019) which observed three-quarters of the sky.Shu et al. (2019) combined photometry from Gaia DR2 and unWISE (Lang 2014) to identify 2.7 million AGN candidates and estimate their photometric redshifts.Upcoming surveys will observe even more quasars: the Dark Energy Spectroscopic Instrument (DESI; Aghamousa et al. 2016) expects to obtain spectra for 3 million quasars, and the Rubin Observatory's LSST will photometrically observe upward of 10 million quasars (Ivezić 2016).However, none of these quasar catalogs is both all-sky and contains precise redshift information.The recently released Gaia DR3 quasar candidates (Gaia Collaboration et al. 2023a) constitute a new sample that promises to fill this gap.
The Gaia quasar sample presents a new opportunity to explore these science topics.While the Gaia satellite was designed to map stars in the Milky Way (Gaia Collaboration et al. 2016), it broadly observes bright objects in the sky, which includes many extragalactic sources.Previous work identified a small number of quasars in earlier Gaia data releases, including identification based solely on their astrometric properties (Heintz et al. 2018(Heintz et al. , 2020)).In DR3, the Gaia collaboration released a sample of 6,649,162 quasar candidates that were incidentally observed during the survey (Delchambre et al. 2023;Gaia Collaboration et al. 2023a,b).The sources cover the entire sky and have Gaia blue photometer (BP)/red photometer (RP) spectra, low-resolution spectra covering the wavelength range of 330-1050 nm.These spectra allow for redshift estimates of the sources, with 86% having a precision of |∆z/(1 + z)| < 0.01 compared to SDSS redshifts when no processing issues affect the redshift estimation (flags qsoc = 0 or flags qsoc = 16), which is the case for 20% of the sample; for the full sample including sources with redshift warning flags set, this percentage of high-precision redshifts decreases to 53%.While not as precise as high-resolution spectroscopic redshifts, they are significantly better than photometric redshifts.The median redshift of the sample is z = 1.67.The Gaia quasar candidate sample was constructed for completeness over purity, and has an estimated purity of 52%; the Gaia Collaboration also suggests criteria for a higher purity (∼95%) subcatalog of ∼1.9 million quasars.Overall, the sample presents an unprecedented resource for quasar science and cosmology.
There are two main issues with this raw Gaia sample.First, the sample contains a large number of non-quasar contaminants.Second, a significant fraction of the redshift estimates are catastrophic errors, due to emission line misidentification given the limitations of the lowresolution spectra.Understanding and eliminating sample contaminants matters greatly in identifying the most extreme (e.g.brightest or most luminous) quasars, which has been addressed in the AllBRICQS catalog (Onken et al. 2023) that draws on Gaia quasar candidates.In this work, we construct a clean quasar catalog across the full magnitude range with lower contamination and improved redshift estimates, with the particular goal of building a catalog appropriate for large-scale structure analyses as well as other quasar science.For both of these, we rely on crossmatches with WISE observations of the quasars (Wright et al. 2010), which adds key infrared (IR) information.To filter out contaminants, we apply color cuts based on the Gaia and WISE photometry, as well as a proper motion cut.To improve the redshifts, we identify quasars that are also observed by SDSS, for which we have highly precise spectroscopic redshifts, and train a k-Nearest-Neighbors (kNN) model based on their photometry and Gaia redshift estimates.Further, the Gaia quasar candidate sample has strong systematic imprints from various observational effects, such as Galactic dust.To model these systematics so that their effects can be mitigated in analyses of the catalog, we fit a model for the selection function based on observational templates using a Gaussian process.We release both the catalog and selection function as publicly accessible data products.
This paper is organized as follows.In §2, we describe the initial data sets used in the construction of the catalog.The construction of the catalog is detailed in §3.In §4, we present the final catalog and perform verification and comparisons to other samples, and outline the data format.We summarize the catalog and describe the access to the data in §5.

INITIAL DATA SETS 2.1. Gaia DR3 quasar candidate sample
While performing its all-sky survey of the Milky Way, the Gaia satellite (Gaia Collaboration et al. 2016) also observed millions of extragalactic objects.These sources-both quasar and galaxy candidateswere first released in Gaia DR3 (Gaia Collaboration et al. 2023a,b).Gaia obtained BP/RP spectra of the sources, which are low-resolution spectra with relatively narrow wavelength ranges; BP covers 330-680 nm and has 30 ≤ R ≤ 100 and RP covers 640-1050 nm (Carrasco et al. 2021) with 70 ≤ R ≤ 100.The raw spectra are not released by Gaia (besides a small subsamplethe rest will be released in Gaia DR4), but redshift es-  timates and other derived information are contained in the catalogs.
The quasar candidates were selected based on multiple classifiers and criteria, described in detail in Gaia Collaboration et al. (2023a).The majority (5.5 million) of the quasar candidates were identified with the Discrete Source Classifier (DSC) module (detailed in Delchambre et al. (2023), a machine-learning model that takes as input the source's BP/RP spectrum, G-band magnitude, G-band variability, parallax, and proper motion, and outputs a class label trained on SDSS spectroscopic classifications.Given these SDSS labels, the results of this module will inherit many of the same selection effects as SDSS.DSC is estimated to have a completeness of over 90% and a purity of around g24% for quasars.Another machine learning model selected over 1 million sources based on their variability, as active nuclei have time-variable accretion; the model inputs were statistics of time series data in all Gaia bands as well as photometric and astrometric quantities, as detailed in Rimoldini et al. (2023).Additionally, a set of nearly 1 million sources was selected based on their surface brightness profile; this selection used existing major quasar catalogs to compile an initial list of sources, which were then processed by the Gaia surface brightness profile module (Ducourant et al. 2023).This module included quasars in the candidates catalog which passed certain criteria, including having Gaia observations covering > 86% of the source's surface area and a confident assessment (positive or negative) of host galaxy presence.Finally, the 1.6 million sources used to define the Gaia-CRF3 celestial reference frame were contributed, which are based on crossmatches of Gaia to external quasar catalogs.A large fraction of sources are identified as quasars by multiple of these methods; the overlapping contributions are shown in Figure 3 of Gaia Collaboration et al. (2023a).The full quasar candidate sample contains 6,649,162 sources1 , selected for high completeness, but with a low purity estimated to be around 52% (Gaia Collaboration et al. 2023a).We show the overlaps between this Gaia quasar candidate sample and other samples and subsamples used and constructed in this work in Figure 1.
Most of the quasar candidates (6,375,063) are assigned redshifts using the Quasar Classifier (QSOC) module, which uses a chi-squared approach on the quasars' BP/RP spectra compared to composite spectra from SDSS DR12Q (Delchambre et al. 2023).We refer to these Gaia redshift estimates as z Gaia .Many of these redshifts are determined by a single line due to the narrow spectral range, resulting in aliasing issues when lines are misidentified (see Figure 15 in Delchambre et al. 2023).An estimated 63.7% of the redshifts have |∆z| < 0.1, increasing to 97.6% for quasar candidates with no redshift warning flags (this is the case for nearly 80% of quasars with G < 18.5, but decreases to less than 20% for G > 19.5).
Gaia Collaboration et al. (2023a) provide a query to select a purer subsample of the quasar candidates.It requires higher quasar probability thresholds from the various classifiers and excludes surface-brightnessselected galaxies that have close neighbors.This results in 1,942,825 sources with an estimated purity of 95%; 1.7 million of these have Gaia redshifts.The sky distribution of this sample, which we call the Gaia DR3 'Purer' sample, is shown in Figure 2. The Gaia DR3 'Purer' sample has a low density in the Galactic plane; we speculate that this is largely due to dust extinction making sources too faint to observe at low Galactic latitudes.Gaia DR3 'Purer' also has significant overdensities around the LMC and SMC, as the sample still contains stellar contaminants.
For our analysis, we start with the full quasar candidate sample, rather than the Gaia DR3 'Purer' sample or cutting on other Gaia pipeline flags, to allow greater completeness and minimize reproducing biases; we compare our catalog with the Gaia Collaboration et al. (2023a) Gaia DR3 'Purer' subsample in §4.3.We construct a superset of our catalog (which is a subset of the Gaia quasar candidates sample) that contains all the information needed for catalog construction: we require that sources are in the Gaia quasar candidates table, have Gaia G, BP , and RP measurements, unWISE W 1 and W 2 observations (described in §2.2), Gaia-estimated QSOC redshifts, and a maximum G magnitude of G < 20.6.This magnitude cut was chosen to be slightly deeper than our desired catalog magnitude limit of G < 20.5, in order to provide a buffer for redshift estimation.This results in a superset with 1,518,782 sources.We call our final catalog Quaia, so we refer to this as the Quaia superset.

unWISE Quasar Sample
We use the unWISE reprocessing (Lang 2014;Meisner et al. 2019) of WISE (Wright et al. 2010) to contribute IR photometry to Gaia sources.The unWISE coadds combine data from NEOWISE (Mainzer et al. 2011) with the original WISE survey, providing a time baseline 15 times longer.Compared to the original AllWISE catalog, unWISE has deeper imaging and improved modeling of crowded fields.The unWISE catalog (Schlafly et al. 2019) contains measurements in the W 1 (3.4 µm) and W 2 (4.6 µm) bands for over 2 billion sources.We do not use the W 3 and W 4 bands as these do not go as deep as we need.We perform a crossmatch of the Gaia quasar candidate sample to unWISE sources within 1 ′′2 .We also crossmatch the SDSS training and validation samples ( §2.3, §2.4) to unWISE.
When combined with optical photometry, unWISE IR color information is very useful to identify quasars and distinguish them from contaminants.This photometry also contains useful redshift information; recent approaches to estimate redshifts from photometry with neural networks achieve a mean |∆z| ∼ 0.22 (Yang et al. 2017;Jin et al. 2019;Kunsági-Máté et al. 2022).In our case of redshift estimates from narrow-range BP/RP spectra, we expect IR photometry to add information that can break line identification degeneracies in order to improve estimates.We incorporate the W 1 and W 2 bands into both our quasar selection ( §3.1) and redshift estimation ( §3.2) procedures.

SDSS DR16 quasar sample
The Sloan Digital Sky Survey released the largest spectroscopic quasar catalog in DR163 (Lyke et al. 2020).It combines new sources from the extended Baryon Oscillation Spectroscopic Survey (eBOSS), part of SDSS-IV, with previously observed sources from earlier SDSS campaigns.The catalog contains 750,414 quasars, with an estimated 99.8% completeness (compared to the SDSS-III/SEQUELS sample of Myers et al. 2015, which has higher signal-to-noise spectra) and 98.7-99.7%purity.We remove sources with redshift warnings, ZWARNING!=0, as well as a handful of sources with unreasonably low or negative redshift estimates (z < 0.01).This results in 638,083 sources, which is the sample shown in Figure 1.We crossmatch these with the Gaia catalog, as well as unWISE ( §2.2), using a maximum separation of 1 ′′ on the sky.We remove sources with fewer than five observations in BP (phot bp n obs) or RP (phot rp n obs), following Bailer-Jones (2021), as well as sources that are duplicated in the SDSS star or galaxy samples ( §2.4).This results in 343,074 sources with both Gaia and unWISE observations that pass these criteria.
We use these to calibrate the cuts to decontaminate our sample ( §3.1); for this purpose, we only keep sources that are also in the Quaia superset (sources that are in the Gaia quasar candidates table, have all necessary Gaia and unWISE photometry, Gaia-estimated QSOC redshifts, and G < 20.6).This sample contains 246,122 quasars.We also use this sample (after applying the cuts described in §3.1) to train our redshift estimation model ( §3.2).While this spectroscopic sample has quite high completeness and accurate redshift information, we note that it is still imperfect, contains selection effects, and represents only a particular definition of a quasar; these issues will propagate to our catalog.

Contaminant samples: galaxies and stars
To guide the decontamination of our catalog ( §3.1), we compile known contaminant samples, namely galaxies and stars.For the galaxy sample, we use SDSS spectroscopic galaxies from DR18 4 .Following Bailer-Jones (2021), we include all galaxies with class label GALAXY in the SpecObj table, exclude galaxies with subclass labels AGN or AGN BROADLINE, and exclude sources with redshift warnings, zWarning=0.We crossmatch these with Gaia DR3 and unWISE with a 1 ′′ radius, and remove sources with fewer than five observations in BP or RP , as for the SDSS quasars.We also remove apparent stellar contaminants from the galaxies sample with the cut in G-RP and BP -G from equation (1) of Bailer-Jones et al. (2019), and additionally remove sources duplicated in the SDSS quasar or star samples.This leaves 600,897 crossmatched SDSS galaxies in our sample; 1,316 of these are in the Quaia superset.
For the star sample, we also use SDSS DR18 sources, selecting objects with class label STAR in the SpecObj table.As for the quasars and galaxies, we crossmatch these with Gaia DR3 with a 1 ′′ radius and remove sources with fewer than 5 observations in BP or RP, and remove sources duplicated in the other samples.This results in a stellar sample with 482,080 crossmatched SDSS-Gaia stars, with 2,276 of these in the superset.
For the decontamination procedure, we also compile a sample of sources in or near the LMC or SMC, as most of these will be stellar contaminants but have different properties than the SDSS star sample.To do this, we select all sources in the Gaia quasar candidates table that are within 3°of the center of the LMC or 1.5°from the center of the SMC.While this may include stars not actually in the LMC or SMC, we have chosen these fairly narrow radii in order to capture mostly LMC and SMC stars and few potential quasars.Additionally requiring that these have unWISE photometry, this gives 11,770 LMC-and SMC-adjacent stars; 9,927 are in the superset.

CATALOG CONSTRUCTION 3.1. Decontamination with proper motions and unWISE colors
The full Gaia quasar candidate sample is known to contain a significant fraction of contaminants (stars and other non-quasars, such as galaxies).The stellar contaminants might include sources such as brown dwarfs, which have similar colors as high-redshift quasars, and potentially blue horizontal branch stars, blue stragglers, and white dwarfs, which are UV bright like lowerredshift quasars.To remove stellar contaminants, we make an initial cut on proper motion µ, as quasars should have negligible proper motions due to their large distances.The value of µ has a dependence on G, so we make a cut in this space.To guide this cut, we use labeled sources: SDSS quasars, SDSS galaxies, SDSS stars, and Gaia LMC-and SMC-adjacent stars, as described in §2.3 and §2.4.The G-µ distributions of these sources are shown in the top panel of Figure 3.In the middle panel, we show the intersection of these labeled sources with our Quaia superset, which consists of sources in the Gaia quasar candidates table that have Gaia redshift estimates, complete Gaia and unWISE photometry, and are below G < 20.6.We see that the SDSS quasars tend to have much smaller proper motions than the other types of sources, with a very linear edge to the G dependence at the high proper motion side of the distribution.Based on this, we choose the cut µ < 10 0.4 (G−18.25)mas/yr . (1) At G = 18.25, this corresponds to µ ≲ 2.5 mas yr −1 , and allows for less severe cuts at deeper magnitudes given the typically less precise astrometry.This is related to the proper motion uncertainty as a function of G, which has been quantified by Gaia (Gaia Collaboration et al. 2021).We show this cut overlaid on the Quaia superset in the lower panel of Figure 3; based on the labeled data, we can clearly pick out the populations.The proper motion cut excludes 39470 sources, 2.6% of the superset.
Next, we determine the color cuts based on Gaia and unWISE photometry.Generally, stars and galaxies are dim in redder, IR wavelengths compared to AGN.For instance, the eBOSS quasar target selection (Myers et al. 2015) involved linear cuts in the optical-IR, involving the SDSS g, r, and i bands and the WISE W 1 and W 2 bands.
In Figure 4, we show color-color distributions for the same samples as in Figure 3.The left panel shows W 1-W 2 vs. G-W 1 color, and the right column shows G-RP vs. BP -G color.The top row, with the full labeled samples, shows that different types of sources tend to be localized to different areas of this parameter space (we show only a subset of each type for clarity).In particular, the colors involving unWISE (left panel) separate out the source types relatively clearly, demonstrating the importance of the unWISE crossmatch: SDSS quasars have very red W 1-W 2, and intermediate G-W 1 color, while galaxies have bluer W 1-W 2 and redder G-W 1 compared to quasars, and stars (both SDSS stars and stars near the LMC and SMC) are bluer in both colors.In Gaia color-color space, galaxies tend to have bluer BP -G and redder G-RP colors than the other types of sources.In the middle row of Figure 4, showing the intersection of the labeled sources with the Quaia superset, we see that the superset restrictions have eliminated many of the sources, especially SDSS galaxies and stars, though a significant number remain.(We note that it is possible that some of these SDSS galaxies do host AGN though they were not classified as such by SDSS.)The Quaia superset is shown in the bottom panel; we can see clear populations of quasars, stars, and galaxies lining up with the labeled sources.Importantly, we can see the effect of the stricter SDSS color selection in the red (high G-W 1) region of parameter space into which the Gaia quasar candidates extend, but are not represented in the SDSS sample in the above panels.
We choose to apply linear cuts in these colors to decontaminate the sample.While other works (e.g.Hughes et al. 2022) train classifiers to determine which objects are true quasars using SDSS-classified quasars as labels, we opt for simpler cuts for ease of reproducibility and to mitigate the propagation of SDSS selection effects, which may include color-and magnitude-dependent effects.We choose four cuts based on the distribution of sources in color-color space.The first is in W 1-W 2, which has been shown to be useful for distinguishing quasars; for instance, Nikutta et al. (2014) demonstrated that a small crossmatched SDSS quasar sample has very red W 1-W 2 = 1.2 ± 0.16, while other types of objects-namely star-forming and AGN galaxies, luminous red galaxies and stars-have bluer W 1-W 2. Stars tend to have the bluest W 1-W 2, with a mean of W 1-W 2 = −0.04 ± 0.03, so a cut in W 1-W 2 is a reliable way to filter out stellar contaminants.We add a cut in G-W 1 to filter out the bulk of the stars (including the LMC and SMC), and another in BP − G to cut out the galaxy contaminants.Finally, we find that these single color cuts were not sufficient to remove all of the LMC and SMC, so we add an additional diagonal cut in W 1-W 2 and G-W 1, choosing a reasonable slope.
We optimize the values (intercepts) of these four cuts with a grid search, trying values spaced out by 0.1 mag.We note that while we show the full samples in Figure 4, in practice we make the proper motion cut before optimizing the color cuts.We choose the color cuts that maximize our objective function L, where N q is the number of true quasars that make it into the catalog, N s SDSS stars, N g SDSS galaxies, and N m LMC and SMC stars, and the λ parameters balance the relative ratios of each.We choose λ s = 3, λ m = 5, and λ g = 1.
The optimal cuts for the objects to keep in the catalog are These are shown as the black lines in all panels of Figure 4, with the gray shading indicating exclusion regions.These cuts, as well as the proper motion cuts described above, exclude ∼7% of the superset, resulting in 1,414,385 quasars in our decontaminated sample.We apply an additional magnitude cut of G < 20.5 to reduce edge effects in our redshift estimation; this constitutes our deep sample, with 1,295,502 sources.We refer to this as Quaia in the rest of this work.However, the catalog becomes less clean and reliable as we push to deeper magnitudes-due to less precise measurements and stronger systematics, notably the Gaia scanning pattern-so we produce a version of the catalog with G < 20.0 to ensure a cleaner sample.This brighter catalog has 755,850 sources, and we report most of our results on this sample throughout the rest of this work.BP -G color.The black lines show the cuts we make; the shaded gray region is excluded from the catalog.The rows have the same samples as in Figure 3, except that in the top row, only 20,000 of each type of SDSS source is shown for clarity.In both color-color projections, the labeled sources are mostly localized in particular regions of parameter space, and we can see these populations somewhat clearly in the Quaia superset.

Spectrophotometric redshifts with unWISE and SDSS
We use unWISE and SDSS data to improve the redshift estimation of the sources.Figure 5a shows the redshifts estimated by the Gaia QSOC pipeline z Gaia compared to the SDSS redshifts z SDSS for a test sample of sources from Quaia with G < 20.5; note that the 2D histogram is plotted in log-space to show the outliers more clearly.We find that of the Gaia redshifts z Gaia , 82% (81%) agree to |∆z/(1 + z)| < 0.2 (0.1).A significant fraction of z Gaia are highly precise: 75% agree with SDSS to |∆z/(1 + z)| < 0.01.We also clearly see bands of incorrect estimation due to line aliasing issues.Additionally, in the crossmatched sample, nearly all of the very high z Gaia estimates (z > 4.5) are shown to be incorrect in comparison to SDSS.We note that the redshift estimation is much more accurate for sources that have no redshift warning flags set (flags qsoc=0), as discussed in §2.1, but this is only true for 21% of the sources in Quaia (G < 20.5), and even including sources with flags qsoc=16 this leaves only 39% of sources.
We train a kNN model on Quaia sources to estimate improved redshifts.(We also tested other models including XGBoost and a multilayer perceptron, and found that the kNN outperformed both by a small margin.)We include all sources in our decontaminated catalog ( §3.1) which goes out to G < 20.6, in order to have a buffer beyond our desired G < 20.5 sample to reduce edge effects from the training set.(We find that including the rest of the photometry does not make a difference in the results.)The reddening is determined with the Corrected Schlegel, Finkbeiner, & Davis (SFD) dust map introduced by Chiang (2023), which corrects the standard Schlegel et al. (1998) dust map by subtracting off the contribution from the cosmic infrared background (CIB).(We also include the appropriate correction factor given by Schlafly & Finkbeiner (2011).)5 .The labels are the SDSS redshifts, z SDSS .
We use as our labeled data sources from the crossmatched SDSS DR16Q sample ( §2.3) that are also in our decontaminated catalog Quaia, so that we train on sources drawn from the same distribution to which we will apply the model; this is 243,206 sources.We apply a 70%/15%/15% train/validation/test split.We build a k-d tree on the training set features using the KDTree implementation of sklearn.At the prediction stage, we access the K nearest neighbors of each input feature vector, first excluding neighbors with zero distance in feature space (i.e.neighbors that are in the training set).We assign the predicted label to be the median z SDSS of the K nearest neighbors, and the uncertainty to be the symmetrized inner 68% error of those neigh-bors.We use the validation set to tune K, and choose the value that maximizes the fraction of predicted redshifts with |∆z/(1 + z)| < 0.1, which is K = 27; we note that this value only varies at the ∼1% level for values 15 < K < 50, and is similar for other choices of |∆z/(1 + z)|.Finally, we apply the model to the full Quaia and output kNN redshift estimates, z kNN , for each source.
The results are shown in Figure 6, which shows the cumulative distribution of errors |∆z/(1 + z)| for z kNN compared to that of z Gaia (with z SDSS as the truth) for the test set with G < 20.0.(The shapes are similar for G < 20.5, just shifted to somewhat lower accuracy.)We find that the z kNN estimates have far fewer outliers than z Gaia .However, the z Gaia estimates tend to be more precise, as they use the full spectral information, while the kNN is essentially smoothing over the likeliest neighboring sources in feature space.We thus choose to combine the properties of both of these redshift estimates to obtain our final spectrophotometric (SPZ) redshifts z Quaia in the following way.For sources for which z Quaia and z Gaia agree to |∆z/(1 + z)| < 0.05, we assign z Quaia = z Gaia to preserve the precision of the Gaia estimate.For sources for which z Quaia and z Gaia differ by |∆z/(1 + z)| > 0.1, we assign z Quaia = z kNN to preserve accuracy.In between these thresholds, we apply a smooth, linear transition to avoid hard features in our estimates.These z Quaia estimates are also shown in Figure 6 compared to the "true" (spectroscopic, taken as truth for our purposes) SDSS redshifts, and we can see that these achieve nearly as high precision as z Gaia while maintaining the high accuracy of z kNN .
Our z Quaia results for the test set are shown in Figure 5b compared to z SDSS , shown here for the full catalog depth G < 20.5.We find that 91% (84%) of our SPZ redshifts agree to |∆z/(1 + z)| < 0.2 (0.1), and 62% highly agree to |∆z/(1+z)| < 0.01.We also give the bias (mean redshift error) and scatter (σ 68 , the symmetrized inner 68% region of the redshift errors) of |∆z/(1 + z)| in the figure; our SPZ redshifts significantly decrease the bias and scatter.The SPZ estimation corrected all of the very high-z Gaia estimates, and some of the intermediateoutlying aliasing effects.We still have some catastrophic outliers due to line aliasing, but with our SPZ redshifts, we find a reduction in the number of |∆z/(1 + z)| > 0.2 (0.1) outliers by ∼3× (∼2×) compared to the Gaia redshift estimates.
We investigate the dependence of the redshift error on the G-band magnitude in Figure 7.The fraction of redshifts with an error above various thresholds is shown as a function of samples with the given cut on G.The errors are lowest at a bright magnitude cut of G <∼19.0; in this sample, sources with SPZ redshift estimates inaccurate to |∆z/(1+z)| > 0.2 (0.1) comprise only 3% (4%) of the sample, and to the more stringent requirement of |∆z/(1 + z)| > 0.01, 12%.This outlier fraction increases steadily as fainter sources are included.For G < 20.0,   will depend on the nature of the analysis and its sensitivity to outliers.We note that our finding that the unWISE IR information significantly improves redshift estimates, compared to only the optical information used in the Gaia QSOC estimates, is consistent with other photomet-ric redshift work.For instance, DiPompeo et al. (2015) showed that including WISE mid-IR photometry in the redshift estimation of SDSS-imaged quasars results in a significant improvement on the estimates, even more so than including both GALEX near-and far-UV data and UKIDSS near-IR data.More recently, Yang & Shen (2023) compiled a photometric quasar catalog from the Dark Energy Survey (DES) DR2, combining DES optical photometry with near-IR photometry as well as unWISE mid-IR photometry; they obtained photo-zs with 92% having |∆z/(1 + z)| < 0.1 when all IR bands are used compared to 72% with only optical data.Additional photometric information at other wavelengths could further improve our estimates (as well as catalog decontamination), but is not currently available for enough sources in our Quaia catalog to be worthwhile.For instance, for the UV all-sky survey GALEX (Martin et al. 2005), crossmatches to Quaia sources are only available for 32% of the Quaia objects for near-UV observations, and when including far-UV only 16 %; this significant discrepancy is largely due to the faint end of Quaia, where GALEX observations do not reach deep enough.The Pan-STARRS1 survey (Chambers et al. 2019) covers only three quarters of the sky, with crossmatches to 75% of Quaia sources.We tested adding Pan-STARRS1 data to the redshift estimation feature set and found only a small improvement, and thus chose to prioritize keeping the full sky span of Quaia, though we make note that incorporating Pan-STARRS1 may be useful for certain applications.

Selection Function Modeling
Observational and astrophysical effects impact which sources we observe and their properties; this is known as the selection function.As Gaia is a space-based mission, it avoids many of the observational issues of groundbased surveys, such as seeing and airmass.However, there are still significant selection effects: for our model, we consider dust, the source density of the parent surveys, and the scan patterns of the parent surveys.
We fit a selection function model to a particular version of the catalog, namely, a particular maximum G.For the fiducial selection function we work only in terms of sky position.We make a healpix map of the catalog with NSIDE = 64 and count the number of observed catalog sources in each healpix pixel.We choose this NSIDE, which results in 49,152 pixels each with an area of ∼0.84 deg 2 , to balance constructing a map with reasonably high resolution with ensuring a sufficient number of sources in the pixels for stable fits, as well as fitting within memory limitations for the Gaussian process fit.In the case of no selection effects (and under the assumption of isotropy), we would expect each pixel to contain roughly the same number of sources.Our goal is to model the dependence between the number of sources per pixel and the various systematics.
The systematics maps (templates) we use are shown in Figure 8.We use the dust map of Chiang (2023), and convert it to a healpix map of NSIDE = 64.To do this, we evaluate the reddening E(B − V ) at the centers of pixels of a high-resolution NSIDE = 2048 healpixelization of the sphere, and apply the 0.86 correction factor proposed by Schlafly & Finkbeiner (2011).We convert these to extinction values by multiplying by R V = 3.1, and then take the mean of all of these values within each healpixel target NSIDE = 64 map.This produces a smoothed dust extinction map on the desired scale.The result is shown in Figure 8a; the extinction is highest around the Galactic plane, with structure extending outward.
For the stellar distribution, we randomly select ∼10.6 million Gaia sources with 18.5 < G < 20, the magnitude range of most of our quasar sample.The vast majority of these will be true stars.(While this sample will contain some other types of objects, including possibly some quasars and other extragalactic sources, these will be orders of magnitude less numerous than stars.)We count the number of stars per NSIDE = 64 healpixel; this is shown in Figure 8b.We also include a template of the unWISE source distribution, for which we randomly selected ∼10.6 million unWISE sources (1% of the catalog) that have flux in both W 1 and W 2, and have primary status (Prim=1).We count the number of these sources per NSIDE = 64 healpixel, as shown in Figure 8c.
In initial fits we found that the regions of the LMC and SMC are particularly poorly modeled, and that the fit is improved by including separate templates of just the LMC and SMC source density for both the Gaia and unWISE sources; this gives the model the freedom to assign different coefficients to these regions than to the overall survey source density.(The need for different coefficients could be for a number of reasons, such as a difference in stellar density, contamination, or magnitude distribution; we leave a deeper investigation of this to future work and just use this empirical finding to improve our model.)For the LMC/SMC templates, we cut out a wide region around the LMC and SMC (9°in radius around the LMC and 5°around the SMC), and subtract the background, which we approximate using the region at the same latitude but opposite longitude (mirrored across the l = 0°line) of the given source distribution map.We don't show these maps here as they are visually similar to the stellar and unWISE source density maps in the LMC and SMC regions (though with the background subtracted).
For the Gaia completeness, we use the quantity M 10 introduced by Cantat-Gaudin et al. (2023)6 .M 10 is the median magnitude in a given sky patch of the Gaia sources with ≤ 10 transits across the Gaia field of view; it incorporates the effects of both the scanning law and source crowding.The actual completeness map derived by Cantat-Gaudin et al. ( 2023) depends on both M 10 and G-band magnitude; this completeness is very close to 1 for nearly all of the sky for G = 20.0, with some non-negligible incompleteness for G = 20.5.However, this completeness model is based on the full Gaia source catalog, while we expect the selection function of our quasar sample to be different.We thus use the M 10 map directly in our fit to capture the effects of the Gaia scanning law and source crowding specific to Quaia.We downsample the map to NSIDE = 64; this is shown in Figure 8d.
For the unWISE scanning law, using the ∼10.6 million unWISE sources described above, we take the mean number of single-exposure images in the coadd in W 1 for the sources in each NSIDE = 64 healpixel.This is shown in Figure 8e; we can see that the scan is in strips of constant ecliptic latitude, and that there is a significant increase in observations at the ecliptic poles.
To model the selection function we use a Gaussian process, a flexible machine-learning method for regression; for a detailed treatment, see Rasmussen & Williams (2005).(We first tried a linear model and found that it gave a very poor fit, because there are significant nonlinearities between the systematics and the catalog number density.)We first scale the data: for the labels (number of Quaia sources per pixel) we work in their logarithm, and only fit for the pixels with a nonzero number of sources.For the Gaia stellar distribution, the unWISE source distribution, the unWISE scan pattern, and LMC/SMC map templates, we also take the log of the number of quasars per pixel; for the LMC/SMC map, we first replace zeros with a very small value.For all of the input feature maps, we take the meansubtracted systematics values.We assume a Poisson error on the labels (and apply the appropriate log transformation).For the Gaussian process, we use the george software package (Ambikasaran et al. 2016).We use an exponential squared kernel k of the form where r is the distance between points in feature space.
We train the Gaussian process on all of the data, optimizing the parameter vector using the BFGS solver (Fletcher 1987); this includes fitting for the mean of the labels.We finally evaluate the predicted number of sources in each pixel.Where there were no Quaia sources in the label map, we fix the prediction to zero.
To convert this to a selection function in terms of the relative completeness, we first identify "clean" pixels in the map having low dust extinction (A V < 0.03 mag), low star counts (N stars < 15), low unWISE source counts (< 150), no stars or unWISE sources in the LMC or SMC, and high M 10 (M 10 > 21 mag) and unWISE coadds (> 150); this results in 479 pixels.We take the mean predicted number of quasars in these clean pixels, and add two times the standard deviation in these pixels to encompass the scatter.We then normalize the predicted source numbers by this value, which ensures that all final values end up being less than 1 for clarity.The result is a selection function map in terms of the relative probability of a source at a given location being included in the catalog.We emphasize that this is relative; we have not normalized it to an absolute probability so as not to make the selection function map extremely sensitive to the maximum value.We also note that this fit must be redone for each version of the catalog because it depends on the particular number density and distribution of sources.
There will be a dependence of the selection function on the G-band magnitude, as well as other quantities such as redshift.While we do not include these in our modeling or fiducial selection function map, we do release selection functions for a redshift split version of the catalog, using two redshift bins, which is important for certain cosmological analyses.The code to generate the selection function for any input catalog is also provided so that users can construct maps that meet their needs.

CATALOG: RESULTS AND VERIFICATION 4.1. Properties of the catalog
Quaia, the Gaia-unWISE Quasar Catalog, consists of 755,850 (1,295,502) quasar candidates with G < 20.0 (20.5).The sky distribution of Quaia for each of these magnitude limits is shown in Figure 9.The catalog covers the full sky, besides the Galactic plane, including the southern sky-most of which is not well covered by other surveys (discussed further in §4.3).The sky distribution is remarkably uniform, and the nonuniform imprints visually follow the selection effects that we incorporated into our selection function map, most notably the dust distribution (Figure 8a).Quaia also does not show an obvious overdensity around the LMC and SMC (as the Gaia DR3 'Purer' sample does) because we have removed these with our decontamination procedure.In fact, there is now a slight underdensity of sources near the LMC; this makes sense because some quasars in that sky region are obscured by dust and confusion in the LMC, though it is possible we have also somewhat overcorrected for this and removed some true quasars.
The dearth of quasars in the Galactic plane is due largely to dust extinction and stellar crowding, as well as the fact that the SDSS training set quasars (for both the original Gaia DR3 quasar candidates sample and our decontamination procedure) are not representative of quasars in this dust-reddened region.If we exclude the regions with very high extinction A V > 0.5 mag, the quasars nearly uniformly cover the remaining sky area, which comprises 30277.52 deg 2 (f sky = 0.73).Based on this area we can also compute the effective volume V eff covered by the quasars, which depends on the number density as a function of redshift and the power spectrum value P (k), integrated over the physical volume.We assume a P (k) of 4 × 10 4 (h −1 Mpc) 3 , based on the value for the eBOSS clustering catalog of quasars at around k ∼ 0.01 (Mueller et al. 2021).This gives an effective volume of 7.67 (h −1 Gpc) 3 (3.19(h −1 Gpc) 3 ) for the G < 20.5 (G < 20.0) sample.
We show a 3D map of the Quaia catalog in Figure 10, using our z Quaia redshift estimates converted to spatial coordinates with a fiducial Planck cosmology.We also show a 3D map of the full SDSS quasar sample for comparison; Quaia spans a much larger volume than SDSS.We note that for SDSS large-scale structure analyses, the eBOSS quasar clustering catalog is used, which contains fewer sources than the full SDSS catalog as it spans only the intermediate (UV-excess) redshift range and is Figure 10.Left: a projection of the 3D map of the full Quaia catalog (G < 20.5).Right: the same projection for the quasars in SDSS DR16Q, the largest spectroscopic quasar catalog (note that it is a superset of SDSS quasars from multiple campaigns and as such is not intended to be uniform).The color bar shows the redshifts of the quasars (zQuaia for Quaia, zSDSS for SDSS), which have been converted to distances with a fiducial cosmology.Quaia spans a significantly larger volume than the SDSS sample.A rotating animation of this image is available in the online journal, and at the link in arXiv comment field.
designed to be uniform across the sky (described in more detail in §4.3).
We show the redshift distribution of Quaia in Figure 11.The distribution of our Gaia-unWISE-SDSS spectrophotometric redshift estimates, z Quaia , for the full G < 20.5 catalog is shown in black.We compare this to other samples, cut to the same G limit where relevant: the Gaia redshifts z Gaia for the same sample; z Gaia for sources in the full Gaia quasar candidates sample with G < 20.5 (that have redshift estimates); z Gaia for sources in the Gaia DR3 'Purer' sample with G < 20.5 (that have redshift estimates); and z SDSS for the SDSS DR16Q sources that have Gaia crossmatches, with G < 20.5.We see that the Quaia SPZ redshifts have a smoother distribution than the others, with a clear peak around z = 1.5; the median value is 1.47.These SPZ estimates have also greatly reduced the high-z tail present in the Gaia redshifts.There are still a significant amount of intermediate-z objects; 10% (N = 132, 417) of the sources in the full G < 20.5 Quaia catalog have z > 2.5 (for the G < 20.0 catalog, this is also 10% (N = 77, 337) of sources).We note that the z Gaia redshift distribution for the Gaia DR3 'Purer' sample is very similar to those same redshift estimates for Quaia; this is partially because a very high fraction of the objects in Quaia are also in the larger Gaia DR3 'Purer' sample (see Figure 1).
We see a slight bump in the z Quaia distribution around z ∼ 2.3, the same location as the peak in the SDSS DR16Q quasar distribution.In the SDSS distribution this feature is most prominent in the SDSS-III campaign quasars (see Figure 6 of Lyke et al. 2020), which targeted higher-redshift sources.To check the robust-ness of our redshift estimation, we reconstruct the sample and retrain the redshifts using the eBOSS quasar clustering catalog (Ross et al. 2020).This is the sample used for large-scale structure clustering analyses (e.g.Mueller et al. 2021;Rezaie et al. 2021), which has a smooth redshift distribution peaked around z = 1.5.It does still have a slight step around z ∼ 2.3.We find that the z Quaia redshift distribution does not change significantly when trained on this sample, and that the feature at z ∼ 2.3 remains.We hypothesize that this feature is thus a real feature of Gaia-selected quasars, rather than an imprint from the training set, likely related to details of the optical color selection around that redshift.We also find that compared to the full SDSS-trained sample, the sample trained on the eBOSS quasar clustering catalog produces a redshift distribution that is less smooth at low redshifts, possibly because of the lower number of low-z eBOSS quasars; similarly, the high-z tail is shorter.For these reasons, we choose to use the full SDSS sample (as described in §2.3) for the spectroscopic quasar training sample for our fiducial Quaia catalog, but confirm that the redshift distribution (and the source selection) is broadly robust to this choice.
We show the G-band magnitude distribution of Quaia in Figure 12, in comparison to the other Gaia and SDSS quasar samples described above.We see that our catalog (as well as the Gaia DR3 'Purer' sample) has removed all of the sources with excessively bright (for quasars) magnitudes G < 12.5 that are present in the full Gaia sample, as well as many sources with 12.5 < G < 16.For the Gaia DR3 and SDSS samples, the number of quasars drops off sharply after G ∼ 20.75; to avoid the complicated selection effects at these depths, we limit  our catalog to G < 20.5 as shown.We also note that the SDSS DR16 quasars do not extend as bright as Quaia, and this extrapolation past the training set could bias the results in this regime, though in practice this affects very few sources.
We note that some of the Quaia sources may technically be considered lower-luminosity AGNs, or Seyfertlike galaxies, rather than quasars.We estimate the fraction of these sources using the criterion of Schneider et al. (2010): sources are considered true quasars if they have SDSS i-band luminosity M i brighter than M i = −22.0.To estimate the i-band magnitude for our Gaia sources, we compute the median G − i color for the subset of Quaia sources with SDSS crossmatches, where G is the Gaia G band, and then subtract this value from the G-band magnitudes to obtain an effective i-band magnitude for all Quaia sources.We convert these to absolute magnitudes M i assuming a flat ΛCDM cosmology with H 0 = 70 km s −1 Mpc −1 , Ω m = 0.3, and Ω Λ = 0.7, following Schneider et al. (2010), and assuming a value of dust reddening of A v /E(B − V ) = 1.698 corresponding to the SDSS i band and R v = 3.1.We find that a small fraction, 8%, of Quaia sources have effective M i < −22.0 and thus do not meet this standard luminosity criterion for being true quasars.This distinction may be important for certain studies, though may not be relevant for others, and should be kept in mind for analyses of Quaia.

Selection function model
We show the results of our selection function modeling ( §3.3) for the G < 20.0 catalog in Figure 13.The Around the edges of the Galactic plane the residuals show a slight bias to positive values (meaning the completeness there was predicted to be higher than it actually is); in the region around zero Galactic longitude just above the Galactic plane, the residuals are slightly biased to negative values (meaning the completeness there was predicted to be lower than it is).These discrepancies indicate that our templates are not fully capturing the selection effects in these regions.As these are largely limited to the region around the Galactic plane, the issue could be circumvented by applying a latitude cut for sensitive analyses.The underdensity around the LMC is well modeled by the selection function, with no clear residual in that region.The selection function map for the G < 20.5 catalog (not shown) is similar with some moderate differences, and is also provided as a data product.
The selection function may change more significantly for different subsets of the catalog, such as redshift bins.
The selection function should be re-fit for a given sample to be analyzed; we provide code to fit the selection function for any other subset of the catalog.We note that depending on the subsample, certain regions may be more poorly modeled, and in particular, the regions around the LMC and SMC; users should check the residuals and may choose to mask the regions around the LMC and SMC to be more conservative.

Comparison to existing quasar catalogs
We compare Quaia to other existing quasar catalogs: Projections of these catalogs are shown in Figure 14.We show the Gaia DR3 'Purer' sample (Figure 14a); a crossmatched catalog of WISE and Pan-STARRS (WISE-PS1), a current leading large-area photometric redshift quasar sample (Figure 14b); the SDSS DR16Q catalog, the current best spectroscopic sample of quasars (Figure 14c); the eBOSS quasar clustering catalog, the subsample of SDSS DR16Q intended for clustering analyses (Figure 14d); and Milliquas, a meta-catalog compiling confirmed quasars from the literature (Figure 14e).
The Gaia DR3 'Purer' sample is described in §2.1; here we include only sources with QSOC redshift estimates (z Gaia ).The WISE-PS1 sample was constructed by Beck et al. (2022), based on the Source Types and Redshifts with Machine learning (STRM) algorithm by Beck et al. (2020).The quasar catalog with updated photometric redshifts is presented by Kunsági-Máté et al. (2022); here we include only those quasars with redshifts labeled "reliable", which is 59% of the sample.The SDSS DR16Q quasar catalog is the one described in §2.3, from Lyke et al. (2020), which compiles sources from eBOSS as well as previous SDSS campaigns (and is intended as a superset of SDSS quasars rather than a uniform sample).The eBOSS quasar clustering catalog is detailed in Ross et al. (2020); it is a subsample of SDSS DR16Q selected for large-scale structure clustering analyses, and as such is much more homogeneous than the full catalog.For the eBOSS clustering catalog, we have included both eBOSS and legacy SDSS quasars (IMATCH=1 or 2) and applied the clustering cuts of requiring sectors to have > 0.5 completeness (COMP BOSS) and redshift success rate (sector SSR); we have additionally removed sources with ZWARNING!=0.The Milliquas catalog was compiled by Flesch (2021); a significant portion of the sources are from SDSS and AllWISE.For each of these samples, we have shown quasars brighter than a limit- ing magnitude of G ∼ 20.5; for the non-Gaia catalogs we convert to G from the survey's r-band magnitude using the conversion in equation ( 2) of Proft & Wambsganss (2015), which is based on the SDSS r ′ band.While this should give a reasonable estimate for the SDSS sample (using r SDSS ) and the WISE-PS1 sample (using r PS1 which is very similar to r SDSS ), it may not be as reliable for Milliquas which catalogs "red" magnitudes from various sources, as well as for sources with z > 3 which were not included in the Proft & Wambsganss (2015) fit.A summary of the catalogs is shown in Table 1, for the full catalogs (limited to sources with reliable redshifts) as well as the G eff < 20.5 subsamples.We exclude Milliquas from this comparison given its very heterogeneous nature ; we do include SDSS DR16Q, though it is also not intended to be uniform, to show the comparison of Quaia to this large spectroscopic catalog of quasars.For these quantifications, we exclude areas that have A V > 0.5 mag, as well as healpixels with no quasars.For the sky fraction f sky , we see that Quaia and Gaia DR3 'Purer' are limited only by the dusty regions, and cover over 30% more area than WISE-PS1 (which is limited by Pan-STARRS), nearly 3× that of SDSS DR16Q, and over 5× that of the eBOSS quasar clustering catalog.Compared to the Gaia DR3 'Purer' sample, Quaia has a slightly smaller number of sources, but due to its redshift distribution gives a slightly higher effective volume.The on-sky number density is similar for all of the catalogs when limiting them to similar magnitudes, with WISE-PS1 slightly higher because it has a similar number of objects to the Gaia catalogs but over a smaller area, and SDSS DR16Q and the eBOSS clustering catalog slightly lower.When including faint sources for the catalogs, WISE-PS1 has 2.5× the on-sky number density as Quaia, and SDSS DR16Q and the eBOSS clustering catalog have 1.5 − 2×.
For the volume comparison, we compute two different volumes.The first is a simple "spanning" volume, V span , which is just the comoving volume in the sky area of the survey (as given by f sky of the full sky area) in a redshift range 0.8 < z < 2.2, a typical redshift range for clustering analyses (taken from the range of the eBOSS quasar clustering catalog).Thus it compares in the same way as the survey areas, but gives an idea of the physical volume the catalogs span.The second is the effective volume, described in §4.1; we use that same P (k) = 4 × 10 4 (h −1 Mpc) 3 for the volume calculation for all catalogs.We see that the effective volume of WISE-PS1 is much larger (nearly 3×) than that of Quaia as a result of its larger number of sources, though when considering samples with the same limiting magnitude, WISE-PS1 and Quaia have comparable effective volumes.The effective volume of Quaia is nearly twice as large as that of SDSS DR16Q, and 6× for the magnitude-limited sample; compared to the eBOSS quasar clustering catalog, the effective volume of Quaia is over twice as large, and 7× for the magnitude-limited sample.
The catalogs all have a similar median redshift, of around 1.4 < z < 1.7, extending to 1.77 for SDSS DR16Q when including faint sources.However, they have significantly different redshift precision; in Table 1 we show outlier fractions estimated from comparisons to spectroscopic redshifts.We see that both of the Gaia catalogs have a similar fraction of high-precision redshifts (|∆z/(1 + z)| < 0.01), but Quaia has a much higher fraction of redshifts that are not strong outliers (|∆z/(1 + z)| < 0.1) compared to Gaia DR3 'Purer'.WISE-PS1 falls between Quaia and Gaia DR3 'Purer' in terms of strong outliers, but has an extremely low fraction of high-precision redshifts as it is a photometric survey.We note that for both Gaia DR3 'Purer' and WISE-PS1, the redshift precision is significantly lower when considering the full catalog compared to samples limited to G eff < 20.5 like Quaia; we show both for a fair comparison.The SDSS DR16Q catalog and the eBOSS quasar clustering catalog have spectroscopic redshifts, so these are almost all very high precision; Lyke et al. (2020) estimated from a visual inspection that less than 1% of the SDSS DR16Q redshifts are outliers with ∆v > 3000 km s −1 (|∆z| > 0.01), independent of redshift; note that this is a slightly different sample than the eBOSS clustering catalog, but we can expect it to be similar.The SDSS DR16Q quasar sample has typical statistical redshift errors of |∆z| ∼ 0.001.
To give more of an idea of the redshift precision of Quaia, we compare it to existing all-sky photometric galaxy catalogs.A common statistic to summarize photometric redshift uncertainty robust to outliers is the SMAD, scaled median absolute deviation, defined as 1.4826×med(|∆z −med(∆z)|), where ∆z = z phot −z spec (the scaling factor adjusts the MAD such that SMAD is approximately equal to the standard deviation for normalized data).The SMAD of the full Quaia catalog (G < 20.5) is SMAD(∆z) = 0.023, and the normalized SMAD of the redshift errors with the (1 + z) factor divided out is SMAD(∆z/(1 + z)) = 0.008.For comparison, the WISE × SuperCOSMOS catalog of 20 million galaxies with z med = 0.2 (Bilicki et al. 2016) has an SMAD(∆z) of ∼ 0.04 and an SMAD(∆z/(1 + z)) of ∼ 0.035.The Two Micron All Sky Survey Photometric Redshift (2MPZ) catalog has around 1 million galaxies with a similar median redshift (Bilicki et al. 2013), which have an SMAD(∆z) of ∼ 0.015.Quaia thus falls in between these common photometric galaxy samples in terms of overall redshift precision; however, we note that it is difficult to capture the redshift error of Quaia in a single statistic, given both its large number of highly precise redshifts and non-insignificant number of outliers.
We also note that the ongoing DESI survey (Aghamousa et al. 2016;DESI Collaboration et al. 2023) will observe a high density of quasars over a large sky Table 1.Comparison between Quaia and Other Existing Quasar Catalogs, Detailed in the Text.We show the quantities for the full catalogs (for sources with reliable redshifts) as well as the catalogs limited to G < 20.5 or the rough equivalent converted from another band.For all quantities and catalogs shown, we exclude areas with high dust extinction (AV > 0.5 mag); this excludes ∼5% of sources for Quaia and Gaia DR3 'Purer', ∼18% of the full WISE-PS1 sample, and a negligible number of sources for SDSS DR16Q and the eBOSS clustering catalog.We note that the SDSS DR16Q catalog is a superset of quasars from many SDSS campaigns and is not intended to be uniform, which should be considered in particular for the sky fraction and spanning volume quantities.We show the number of sources N , the fraction of sky area covered f sky , the mean number density per square degree n, the spanning volume between 0.8 < z < 2.2 Vspan, the effective volume V eff , the median redshift z med , and the fraction of objects with |δz| ≡ |∆z/(1 + z)| < 0.01 and < 0.1 (where applicable).area (Chaussidon et al. 2023), which will be competitive with and complementary to Quaia.

Catalog format
The complete Quaia catalog contains our decontaminated quasar sample with computed redshift information, relevant Gaia properties, and crossmatched catalog information.The complete catalog format with column names, units, column descriptions, and an example entry is shown in Table 2.Additional information for the sources can be obtained by joining the catalog with the relevant data source with the associated identifier (Gaia or unWISE).We include only sources with G < 20.5 in the catalog; we also publish a version limited to G < 20.0, along with the selection function models fit to each ( §4.2) and "random" catalogs generated from the selection functions.The catalog includes our SPZ redshifts z Gaia along with 1σ redshift errors, sky position, Gaia photometry, unWISE photometry, and proper motion information.The catalog is in FITS format (Wells et al. 1981), and units and descriptions are provided for each column.

Limitations
While the Quaia catalog presents a highly useful quasar sample, it does have various limitations.We reiterate and discuss the main ones here.
We estimate spectrophotometric redshifts for the quasars, which are generally more accurate than the Gaia estimates, but are still low precision compared to spectroscopic redshifts.The uncertainties on these redshifts should be taken into account for any measurements, and the rate of catastrophic redshift errors (not necessarily captured by the redshift uncertainty) should be considered when thinking about possible uses of the catalog.
The selection function model has multiple potential limitations.While it broadly captures the selection effects that affect the quasar sample, it has significantly lower accuracy around the galactic plane; precision measurements may require masking this region.The regions around the LMC and SMC are also more poorly modeled; users may want to mask this area.We also note that we are not fitting the healpixels with zero quasars, which may result in a slight bias toward populated regions, and fixes the zero-probability region of the selection function.Our selection function map depends only on sky position and not other properties such as magnitude or redshift (besides fitting it to the appropriate subsample); a treatment incorporating these dependencies may be important for certain uses.The gold standard for completeness estimation is data injection and recovery tests.Unfortunately, the Gaia instrumentation has black-box elements, such as onboard image segmentation, onboard object detection, and onboard downlink prioritization, that make it impossible to perform endto-end injection tests, so we rely on a data-driven ap-proach, which may be less robust and more sensitive to modeling choices.Given this, it is possible that we are overfitting the selection function.Finally, the selection function depends on the assumption of isotropy, which we know to be broken to some extent by the kinematic dipole (Stewart & Sciama 1967;Secrest et al. 2021); we will explore and measure this in an upcoming work (see §4.6).Users employing the selection maps or generating their own selection function for some subset of the catalog should take note of these potential issues.
Generally, Quaia has relatively low number density (e.g. compared to the SDSS sample).This means that it may not be ideal for certain cosmological measurements, which may be shot noise dominated.
Finally, we note that this catalog is based on the Gaia quasar candidates sample, and it will inherit many of the limitations of that sample (Gaia Collaboration et al. 2023a).We are also limited to the Gaia-derived properties (e.g. the Gaia redshifts that are a feature for our estimates).In upcoming Gaia data releases, the collaboration will release more BP/RP spectra and we will have the opportunity to work directly from the spectral data to improve the catalog.

Potential applications
Quasars are highly biased tracers of the cosmic web that trace the matter distribution at higher redshift than galaxies and in the mildly nonlinear regime.Given the Quaia catalog's sampling of quasars to deep magnitudes and across a large volume, and its reduced systematic contamination allowed by space-based observations, Quaia lends itself to large-scale structure analyses, many of which are currently ongoing.
Thanks to its large volume and well-characterized selection function, Quaia is perhaps the best current sample for testing homogeneity and isotropy in the Universe (Hogg et al. 2024), and relatedly for measuring the dipole in the quasar distribution (Williams et al. 2024), which recent measurements have consistently found to be in mild tension with the kinematic interpretation in the ΛCDM model.Quaia's volume also makes it a good sample for a measurement of the matter-radiation equality scale, k eq (e.g.Bahr-Kalus et al. 2023).
The catalog is particularly well suited for crosscorrelations with other all-sky observations of projected tracers of the large-scale structure, which are less sensitive to redshift errors compared to 3D ones.Examples of this are the CMB, the CIB, or maps of the thermal Sunyaev-Zel'dovich effect.Alonso et al. (2023) used the cross-correlation between CMB lensing and Quaia to constrain the growth of matter fluctuations via the parameter S8, achieving competitive constraints as well as showing that Quaia can break the degeneracy between Ω m and σ 8 .An analysis of primordial non-Gaussianity (parameterized by f N L ) from this cross-correlation with CMB lensing is also underway.Analyses of the crosscorrelation with CMB temperature to measure the In- tegrated Sachs-Wolfe effect, and with the CIB to constrain the star formation history at high redshifts (e.g.Jego et al. 2023), are currently under investigation.Another measurement enabled by the catalog is the crosscorrelation of quasar proper motions with the largescale structure, which gives a direct estimate of the cosmological quantity Hf σ 8 (Duncan et al. 2023).Additionally, cross-correlations of Quaia with galaxy surveys may allow for measurements of the baryon acoustic feature (Patej & Eisenstein 2018;Zarrouk et al. 2021) and quasar environments (Padmanabhan et al. 2009;Shen et al. 2013).Quaia is also useful for void studies, including constraining core cosmological parameters with the void size distribution; this investigation is underway (Arsenov et al. 2024).The catalog is additionally relevant to astrophysical analyses of quasar properties, given its large sky coverage and multiband photometry, such as the role of galaxy interactions on AGN activity.Quaia sources may also be used to study the potential of quasars as standard candles.Further, Quaia provides perhaps the best quasar coverage of the southern sky, which may be important for a variety of applications such as identifying interesting sources there, adding new information to known sources, or calibrating surveys in that sky region.Finally, while a 3D clustering analysis of Quaia may be limited by the catalog's relatively low number density and moderate redshift precision, a careful analysis may yield useful constraints, especially using techniques targeted at wide-field surveys (e.g.Lanusse et al. 2015).
The latter is comparable or better than other stateof-the-art galaxy and quasar samples used in large-scale structure analyses, but not enough to necessarily allow an accurate interpretation.

SUMMARY AND DATA ACCESS
We have constructed a new quasar catalog, Quaia, the Gaia-unWISE Quasar Catalog, designed for cosmological studies, derived from the Gaia DR3 quasar candidates sample and using unWISE photometry to remove contaminants and derive precise redshifts.Our key contributions and the features of the catalog are as follows: • We have decontaminated the Gaia DR3 quasar candidates sample with proper motion cuts and optimized color cuts based on Gaia and unWISE photometry.This reduced the number of known contaminants by ∼4×, while only excluding 1.2% of known quasars with respect to the superset of Gaia quasar candidates (that have unWISE pho-

Figure 1 .
Figure 1.A summary of the overlaps between the various data sets and subsamples used in this work.The values describe the fraction of objects in each column's sample that are in each row's sample.Note that we only list unWISE as a row because the inverse is not relevant to this work.

Figure 2 .
Figure 2. Sky distribution of the quasar candidates in the Gaia DR3 'Purer' quasar sample, in Galactic coordinates and displayed using a Mollweide projection.

Figure 3 .
Figure3.Proper motion µ vs. G magnitude for two different sets of sources.The black line shows the cut we make; the shaded gray region is excluded from the catalog.Top: the sources for which we have labels (SDSS data as well as sources near the LMC and SMC in the Gaia quasar candidates sample) that are also in the Quaia superset (Gaia DR3 quasar candidates that have all necessary photometry, Gaia redshift estimates, and G < 20.6).Middle: sources in the top row that are also in the Quaia superset.Bottom: the superset of quasar candidates from which Quaia is constructed.The proper motion cut includes nearly all SDSS quasars in the superset while excluding a large number of stars.

Figure 4 .
Figure 4. Color-color plots of three different sets of sources.The left column shows W 1-W 2 vs. G-W 1 color, and the right column shows G-RP vs.BP -G color.The black lines show the cuts we make; the shaded gray region is excluded from the catalog.The rows have the same samples as in Figure3, except that in the top row, only 20,000 of each type of SDSS source is shown for clarity.In both color-color projections, the labeled sources are mostly localized in particular regions of parameter space, and we can see these populations somewhat clearly in the Quaia superset.

Figure 5 .
Figure 5. (a) Gaia redshift estimate zGaia vs. SDSS ("true") redshift zSDSS for a test set of sources in our quasar catalog Quaia with G < 20.5.(b) Our estimated spectrophotometric (SPZ) redshifts zQuaia, which are based on a kNN model, vs. zSDSS for the same sample.The bias (mean redshift error) and scatter (σ68, the symmetrized inner 68% region of the redshift errors) of the redshift estimates compared to zSDSS are shown in the panels.The zQuaia redshifts significantly decrease both the bias and scatter, as well as catastrophic outliers and unreasonably high redshift estimates.The one-to-one line (perfect accuracy) is shown in gray; note that the color bar is on a log scale, and that a majority of the sources in both cases lie along this line.

Figure 6 .
Figure 6.The cumulative distribution of redshift errors for Quaia test set sources with G < 20.0, considering SDSS spectroscopic redshifts zSDSS as the ground truth, for estimates directly from our kNN model (gray), the original zGaia redshifts (purple), and our final zQuaia estimates (black) based on a combination of the other two.Our SPZ redshifts have far fewer outliers and similar precision compared to the Gaia estimates.

Figure 7 .
Figure 7.The fraction of outlying redshifts with |∆z/(1 + z)| > (0.01, 0.1, 0.2), as a function of G magnitude, for our redshift estimation test set.The SPZ redshifts are shown in black, and the Gaia redshifts in purple.The fraction of outliers increases steeply with increasing G for G > 19.5 for both zQuaia and zGaia, though the fraction of catastrophic outliers for zQuaia is significantly lower (and the dependence less steep) compared to zGaia.

Figure 8 .
Figure 8.The systematics maps used in the selection function model: (a) dust extinction from Chiang (2023); (b) the stellar distribution based on ∼10.6 million randomly selected Gaia sources with 18.5 < G < 20; (c) the unWISE source distribution based on ∼10.6 million randomly selected unWISE sources; (d) the quantity M10, the median magnitude of sources with ≤ 10Gaia transits, which encodes the Gaia scanning law and source crowding; and (e) the unWISE scan pattern given by the mean number of single-exposure images of the sky region in the coadd.Note that the color bar on the M10 and unWISE scanning law maps are reversed, as high values indicate a cleaner region, the inverse of the other maps.We also include separate templates for only sources in the LMC and SMC regions for both the stellar and unWISE source densities, with the background subtracted.All templates are discussed in more detail in the text.

Figure 9 .
Figure 9. Sky distribution of the Quaia quasar catalog, in Galactic coordinates and displayed using a Mollweide projection.Panel (a) shows sources with G < 20.0, the cleaner version with more reliable redshifts, and (b) shows the catalog down to its magnitude limit of G < 20.5.

Figure 11 .
Figure 11.Redshift distribution of Quaia for our spectrophotometric redshift estimates zQuaia (black), normalized to the total number of objects.For comparison, we also show the normalized distributions of other samples, cut to the G < 20.5 limiting magnitude of Quaia where relevant: the Gaia redshift estimates zGaia for the same Quaia sources (purple); zGaia for the sources in the full Gaia quasar candidate sample with G < 20.5 (gray); zGaia for the Gaia DR3 'Purer' subsample with G < 20.5 (green); and the SDSS redshifts zSDSS for the SDSS DR16Q quasar sample that have Gaia crossmatches, with G < 20.5 (blue).The median redshift of each distribution is shown by the diamond and vertical line in the respective color.

Figure 13 .
Figure 13.(a) The selection function map for the G < 20.0 subset of Quaia, based on a Gaussian process model of the dust, stellar distribution, and M10.(b) The fractional residuals between a random catalog downsampled by the modeled selection function and the true Quaia G < 20.0 catalog.

Figure 14 .
Figure 14.Other current quasar catalogs for comparison with Quaia.All are shown for sources with G < 20.5 or the equivalent converted from another band, in Galactic coordinates and displayed using a Mollweide projection.The catalogs are (a) the Gaia DR3 'Purer' sample, (b) the WISE-PS1-STRM catalog, (c) the SDSS DR16Q catalog, (d) the eBOSS quasar clustering catalog, and (e) the Milliquas catalog.Note that the color bars have different scales in each panel.

Table 2 .
(Wells et al. 1981)escriptions of Quaia, Published as a FITS Data File(Wells et al. 1981).For the example entry, we show the first catalog row.