The Hitchhiker’s Guide to the Galaxy Catalog Approach for Dark Siren Gravitational-wave Cosmology

We outline the “dark siren” galaxy catalog method for cosmological inference using gravitational wave (GW) standard sirens, clarifying some common misconceptions in the implementation of this method. When a confident transient electromagnetic counterpart to a GW event is unavailable, the identification of a unique host galaxy is in general challenging. Instead, as originally proposed by Schutz, one can consult a galaxy catalog and implement a dark siren statistical approach incorporating all potential host galaxies within the localization volume. Trott & Huterer recently claimed that this approach results in a biased estimate of the Hubble constant, H 0, when implemented on mock data, even if optimistic assumptions are made. We demonstrate explicitly that, as previously shown by multiple independent groups, the dark siren statistical method leads to an unbiased posterior when the method is applied to the data correctly. We highlight common sources of error possible to make in the generation of mock data and implementation of the statistical framework, including the mismodeling of selection effects and inconsistent implementations of the Bayesian framework, which can lead to a spurious bias.


INTRODUCTION
The use of gravitational waves (GWs) from compact binary mergers as standard sirens (Schutz 1986;Holz & Hughes 2005;Dalal et al. 2006) for cosmology is an idea which has finally come to fruition in recent years.These signals directly provide a measurement of the luminosity distance measurement to the source, which is therefore independent of the cosmic distance ladder.With the addition of redshift information, measurements can therefore be made of those cosmological parameters which impact the expansion history of the Universe, such as the Hubble constant (H 0 ).This approach is independent of all other local measurements to date.
The detection of the binary neutron star (BNS) GW170817 (Abbott et al. 2017a) and its electromagnetic (EM) counterpart (Abbott et al. 2017b) by the LIGO (Aasi et al. 2015) and Virgo (Acernese et al. 2015) GW detectors -which allowed the host galaxy, and hence the redshift of the merger, to be identified -led to the first GW measurement of H 0 (Abbott et al. 2017).In the absence of an EM counterpart, redshift information from other sources can be used, such as i) galaxy catalogs (using the statistical method (Del Pozzo 2012;Chen et al. 2018a;Fishbach et al. 2019;Gray et al. 2020;Gray et al. 2022;Leandro et al. 2022) or the cross-correlation method exploring the spatial clustering between GW * NASA Einstein Fellow sources and galaxies (Oguri 2016;Mukherjee et al. 2020Mukherjee et al. , 2021;;Bera et al. 2020;Diaz & Mukherjee 2022)) and ii) "spectral siren" inference of the redshift from the astrophysical distributions of the GW sources themselves (Chernoff & Finn 1993;Taylor et al. 2012;Farr et al. 2019;María Ezquiaga & Holz 2020;Mastrogiovanni et al. 2021;Mukherjee 2022;Leyde et al. 2022;Ezquiaga & Holz 2022;Karathanasis et al. 2022).
This paper will focus on the dark siren + galaxy catalog method (also informally known as the statistical or dark siren method),1 in which galaxy catalogs are used to provide the redshift information for all potential host galaxies within a GW's localisation volume, which are statistically averaged over.While less informative than the counterpart method on an event-byevent basis, the constraint strengthens as more events are included in the analysis.Given the current detection rates of bright and dark sirens (the latter have a factor > 10 more detections), this method is expected to make a significant contribution to the GW constraint of H 0 (Chen et al. 2018a).Indeed, this approach has already been implemented using dozens of available GW detections (Fishbach et al. 2019;Soares-Santos et al. 2019;LIGO Scientific Collaboration & Virgo Collaboration 2019;Palmese et al. 2020;Abbott et al. 2021a;Finke et al. 2021;Palmese et al. 2021).
This paper aims to act as a point of introduction for those new to the field of GW cosmology, and the dark siren + galaxy catalog method in particular, using mock data to build up a basic analysis following a Bayesian approach.This is motivated by the recent claim that GW dark sirens generally provide biased estimates of H 0 even under simplified assumptions (Trott & Huterer 2021;TH21 hereinafter).We show here that this is not the case, and how it is incorrect modeling assumptions that lead to biased measurements.Given this, it is important to stress that a lot of the inconsistencies and biases explored in this work are not relevant for the case of realistic data, and instead primarily arise due to simplified assumptions and toy models for the GW sources and galaxy catalogs.
It is well understood that biases cannot arise in statistical analysis when the model and data-generating process are consistent.In simulation studies, it is possible to control the data generation process and so if biases appear these must be due to errors or inconsistencies when generating or analyzing the data.This should be used as a diagnostic tool to track down those errors and inconsistencies.This does not mean that analyses of real data are free of biases, since the true data-generating process for any observed process is unknowable.However, the true parameters are also not then known and so biases are hard to diagnose.It is certainly valuable, within a simulation study, to vary the assumptions about the data-generating process, while not changing the analysis, to understand what biases could appear in the analysis of real data.Indeed, this is the correct procedure for identifying potential sources of systematic error.However, this is only useful if a consistent analysis has first been shown to give unbiased results, and if the types of modifications to the data-generating process are controlled and physically well-motivated.Any sources of bias identified in this way can then be mitigated by increasing the complexity of the analysis model.
This paper is organised as follows.Section 2 introduces the basic Bayesian framework, then describes the mathematics of how the mock GW and EM galaxy catalogue data should be constructed.Section 3 gives details of the simulated data, then shows results for the scenario in which 200 GW detections are used to constrain H 0 , and how the constraint changes in the limit of a large number of detections.Section 4 discusses some of the common mistakes made when applying the dark siren + galaxy catalog method, particularly when generating simplified mock data, which can lead to biased outcomes.It also highlights some of the real-world complications which need to be addressed in a true dark siren + galaxy catalog analysis, which are not in scope for this paper.We note that this work is not intended to present a forecast of how well we will be able to constrain H 0 , as we make some simplifying assumptions that are different from reality, but it is intended as a pedagogical work on the method.

STATISTICAL FRAMEWORK
Let us assume that we have observed a set of N obs GW events {x} from which we are able to measure the luminosity distance of each source.In this basic example, H 0 can be determined by the fact that we measure the source luminosity distance from the GWs and can identify potential host galaxies from the catalog.According to Bayes' theorem, the posterior on H 0 , given a set of detected GW events {x}, can be written as where p(H 0 ) is the prior on H 0 .The likelihood L({x}|H 0 ) can be described by an inhomogeneous Poisson process (Mandel et al. 2019;Vitale et al. 2022) . (2) Here L GW (x i |d L (z, H 0 )) is the GW likelihood (the probability of measuring the data x i given some luminosity distance d L ), p CBC (z) is the probability that the source, a compact binary coalescences (CBCs), is at redshift z for an observer on Earth, while p GW det (z, H 0 ) is the GW detection probability used to account for selection biases.The term N exp (H 0 ) is the number of expected GW detections for a given value of H 0 .This term can be marginalized analytically by assuming a 1/N exp prior on the expected number of events (Fishbach et al. 2018), or equivalently the compact object binary merger rate.With this assumption, the pre-factor loses its dependence on H 0 and the likelihood for a single event reduces to which is the one we use here.The quantity p CBC (z) is being used here to represent the distribution of redshifts from which the GW source is drawn.If this is known, the likelihood, which is the probability distribution of observed data sets, can be found by marginalising over the distribution of possible redshifts, which is what is done in Eq. ( 3).In practice we do not know p CBC (z), but we construct it from observations of galaxies, as described in section 2.1 below.The interpretation is then that we are using the posterior from EM observations as the prior for the GW data.This is perfectly legitimate, but to be rigorously correct, what the EM data provides is a distribution of possible GW source distributions, since the EM data is not perfect.In practice, this means that we should treat the unknown true redshifts of the galaxies as population level parameters that we infer jointly from the EM and GW data, and marginalise over these after combining the individual GW measurements.In simpler terms the integral over z in Eq. ( 3) should be done at the end, rather than separately for each event.However, as we discuss more in detail in Sec.2.3, it can be shown that in the limit that the redshifts of the galaxies are perfectly known or the number density of CBCs is much lower than the number density of galaxies the hierarchical likelihood for H 0 can be reduced to Eq. ( 3).In particular, the second hypothesis is perfectly reasonable considering that currently the rate of CBC mergers is estimated to be ∼ 10 −6 -10 −5 yr −1 per galaxy (Abbott et al. 2021b).For this reason, Eq. ( 3) is used in most current analyses and this is perfectly legitimate.Where this distinction can matter, however, is in a mock data scenario where the number of CBCs is artificially inflated (or the density of galaxies artificially reduced).For most of this paper, we will limit our discussion to the case in which the number density of CBCs is much lower than the number density of galaxies.In sections 2.3 and 3.3 only we will demonstrate when this assumption has an impact and how this can be mitigated.As a final remark, it is important to note that the form of Eq. ( 3) is based on the usual assumption that "detection" is a property of the observed data only, not the true parameters of the source.It can be convenient to simulate data by selecting events based on the true source parameters, but this is a modification to the prior rather than to the likelihood ad would be inconsistent with the detection model assumed in Eq. (3).A consistent analysis in this case would just retain the numerator of Eq. ( 3), but with the prior p CBC (z) renormalised by dividing by its integral over the range of events that are detectable (which will be dependent on the cosmological parameters) and the integral in the numerator truncated to the same range.While this correction should eliminate biases, we would not advocate this approach as it does not reflect the reality of an analysis of real data and might therefore give misleading results.

Galaxy Catalog modeling
In this section we describe how to build the probability p CBC (z) of a CBC occurring at redshift z, under the assumption that CBCs occur in galaxies, and so will trace the distribution of galaxies in the universe in some fashion.Note that in this section (and for the rest of the paper) we will discuss the distribution of galaxies purely as a function of redshift, and thus neglect their spatial distribution in right ascension and declination.This closely follows the assumptions and approximations made in TH21.Neglecting this aspect does not automatically introduce problems, but it is an unrealistic set-up which increases the risk mentioned above about relative number-densities of CBCs and galaxies.
The probability that a merger will occur at a redshift z is the product of the probability that there is a galaxy at z, p cat (z), and the probability of a galaxy at redshift z hosting a GW merger, p rate (z), The rate -or signal emission -probability is given by where R(z) is the merger rate of GWs in their source frame usually expressed in Gpc −3 yr −1 .The merger rate is usually parametrised as R(z) ∝ (1+z) γ in the redshift region 0 < z < 2 (Fishbach & Holz 2017) to account for a possible evolution of the merger rate.If GW mergers are uniform in comoving volume and source-frame time, this term is constant.The 1/1 + z factor accounts for the effect of time dilation due to the expansion of the Universe between the source frame and the detector frame.In absence of any galaxy catalog observation, p cat (z) can be constructed with a uniform in comoving volume distribution and where Note that the above prior does not depend on H 0 , but it depends on other cosmological parameters through the expansion history (H(z)/H 0 ≡ E(z) = [Ω m (1 + z) 3 + Ω Λ ] 1/2 for a Flat Lambda Cold Dark Matter cosmological model) and the parametrization of the rate term.This is the prior used for analyses making use of mass information (Chernoff & Finn 1993;Taylor et al. 2012;Farr et al. 2019;Mastrogiovanni et al. 2021;Mukherjee 2022;Leyde et al. 2022;Ezquiaga & Holz 2022;Karathanasis et al. 2022), where cosmology and p CBC (z) are fit jointly.Note also that if the CBC rate presents some features in redshift, such as peaks, it might help to measure cosmology even in absence of counterparts or galaxy information (Ye & Fishbach 2021).
. Physically, p CBC (z) is something like the redshift distribution of galaxies that have sourced a compact merger, the signal of which has passed through the Earth.It is interesting to note that Eq. 6 can be also defined from a more "physical" argument starting from the rate of CBC mergers seen from an observer of Earth.This quantity can be expressed as where t d and t s are the times in the detector and source frame and the number of CBC per comoving volume per source frame time is by definition the CBC merger rate.
For this basic mock data, when calculating Eq. 4, we will neglect the rate term in Eq. 5 in order to align more closely with the approach taken in TH21.As long as this rate assumption is treated self-consistently when generating the mock data and analyzing it, this will not introduce any bias to the results.Let us note that this is a simplified description where CBC rates only depend on the Universe epoch (redshift), in a more general case, we might even model that more luminous galaxies are more likely to host CBCs.Therefore, in the remaining of the paper, we will approximate Moreover, here we want to exploit the fact that we are provided with a galaxy survey.We want to build p cat (z) given the observation of N gal galaxies with measured redshifts {ẑ g }.In other words, now we are computing p cat (z|{ẑ g }).In this computation, we will assume that the galaxy catalog is complete. 2For details on how galaxy catalog incompleteness can be incorporated into such an analysis, see e.g.Chen et al. (2018a);Fishbach et al. (2019); Gray et al. (2020); Gray et al. (2022).While constructing p cat (z|{ẑ g }), we must consider that {ẑ g } are not the true redshifts {z g }.We can take into account this uncertainty using the laws of probabilities, namely p cat (z|{ẑ g }) = d{z g }p gal (z|{z g })p red ({z g }|{ẑ g }), (10) where p gal (z|{z g }) is the probability to have a galaxy at redshift z when we have a set of true redshifts for the galaxies.This is simply given by where δ is Dirac delta distribution.The probability p red ({z g }|{ẑ g }) encodes the fact that we are not perfectly able to measure the true redshift of the galaxies, but we have a measured value.We label this probability with "red" to indicate that it refers to the measurement uncertainties due to redshift.Since each galaxy measure is independent of the others, By plugging Eqs.11-12 in Eq. 10 we obtain that 13) At this point we can immediately note that if the galaxy redshifts are perfectly measured, Eq. 13 reduces to a sum of Dirac delta functions and the likelihood in Eq. 3 can be expressed analytically as In Sec.2.3 we show that the hierarchical likelihood still reduces to this equation even if we relax the assumption that the CBC density is lower than the galaxy density.However, we are usually in a situation where the redshift of the galaxies is measured with large uncertainties.The term p CBC (z) is our prior on the CBC redshift distribution.In order for it to be used as such, the sum in Eq. 13 must be over the redshift posteriors of individual galaxies.In this case, we will construct the likelihood of the individual galaxies in the same manner as was used to construct the GW likelihood, but an additional prior must be applied to convert this to a posterior.That is: where L red (ẑ i g |z) is the likelihood model used to generate observed redshifts from the true redshift of each galaxy, and p bg (z) is a chosen prior on redshift that reflects our best belief on the background distribution of galaxies.We adopt a Gaussian redshift likelihood model like in TH21: with σ z = 0.013(1 + z)3 ≤ 0.015.Note that the redshift uncertainty explicitly depends on the true redshift, and this has to be taken into account in the modeling.
Regarding priors, in absence of any special knowledge about the galaxy distribution, the simplest choice we can make is to set p bg (z) uniform in comoving volume.This can appear as "double counting" but it is the more conservative choice in absence of any data-driven information since we would expect galaxies to be nearly uniformly distributed in comoving volume.In the limit that σ z → ∞, i.e. we detect galaxies but we do not measure their redshifts, p CBC (z) will then represent galaxies uniform in comoving volume.While in the limit that σ z → 0, the prior choice p bg (z) will not matter and p CBC (z) will be given by the sum of δ-like peaks located at the measured redshift of galaxies.

GW data modeling
The problems of detection and source parameters estimation of GW signals are complex and an active field of current research.We refer the reader to Abbott et al. (2020) for a more in-depth discussion.Here we just discuss the fundamental aspects of detection and parameter estimation for GW signals.GW signals are detected using low-latency algorithms, able to calculate signal-to-noise ratios and false alarm rate using either a template bank (Messick et al. 2017;Cannon et al. 2021;Aubin et al. 2021;Dal Canton et al. 2014;Aubin et al. 2021;Nitz et al. 2017) or a superposition of wavelets (Klimenko & Mitselmakher 2004).Typically, GW signals with false alarm rate lower than 1-10 per year are selected for population analyses (Abbott et al. 2021b,c).The choice of a detection threshold is important to correct for selection biases.Once a signal is classified as detected, the source physical parameters are estimated using a Bayesian sampling of the GW likelihood: where θ are the set of GW source parameters, h is the GW signal, and x i indicates the GW strain data.The scalar product is given in Fourier frequency space by where S(f ) is the one-sided power spectral density of the noise.
For this analysis, we do not use the full GW likelihood, and instead use a toy model similar to the one used in TH21.The GW likelihood in Eq. 3 is a central quantity of the mock study.The GW likelihood provides a model to generate error budgets on the estimation of the luminosity distance and it is also used to define the detection probability where the integral is done over all the possible detectable GW signals.In principle, the GW likelihood and detection probability should take into account all the GW parameters but in this simplified mock study, we will just model it as a function of d L .We simplify the GW likelihood by defining an "observed" luminosity distance di L , which replaces the "observed data" x i .We use a likelihood model where σ d L = Ad L (z, H 0 ), with A a constant fractional error.Note that Eq.21 is a probability density function of the "measured" luminosity distance di L .In other words, given a value of the true luminosity distance d L , the probability of obtaining a certain measured value di L is distributed according to a normal distribution.However, we note that the reconstruction of the "true" luminosity distance d L given an observed value di L is not Gaussian. 3We also note that, while a simple likelihood model for d L is sometimes assumed in GW analyses (e.g.Palmese et al. 2021) using the bayestar (Singer & Price 2016) sky maps, in general, this likelihood will take more complicated forms.
Regarding the detection process, we assume that GWs are detectable if their measured luminosity distance is smaller than a threshold of dL < dthr L = 1550 Mpc.The GW detection probability in Eq. 20 can be written using where Π is a prior term.the likelihood model in Eq.21 as In the above Equations, Θ is a Heaviside step function of dL , dropping to 0 for dL > dthr L , "erf" is the unilateral error function of a standardized normal distribution.In TH21 it is argued that P GW det (z, H 0 ) is a Heaviside step function dropping to 0 at dthr L .This is correct only in the limit that σ d L → 0 (for small error budgets).In Fig. 1 we plot the detection probability as a function of the true luminosity distance of the GW.When the GW likelihood has a non-negligible error, the GW detection probability does not drop sharply to zero.This is expected, as the scattering process of d L makes us able to detect sources whose true luminosity distance is above the detection threshold.The detection probability reduces to a Heaviside step function only when the distance error goes to 0 (i.e. when we are perfectly able to measure luminosity distance).

Full likelihood derivation
As described above, when writing down Eq. (3) we have effectively assumed that the GW likelihood depends on a distribution p CBC (z), which is known.We then proceeded to construct p CBC (z) from EM observations, {ẑ g }, of galaxies in the catalogue via Eq.( 13).Using the EM data as a prior for the GW data is per-fectly legitimate, but the usual form of the GW likelihood, Eq. ( 3) is then just an approximation.The GW likelihood actually depends on the true, and unknown, redshifts of the galaxies, {z g }.In the absence of selection effects, this distinction does not matter for a single event, which we will now demonstrate.The GW likelihood for a single observation, x i , is the likelihood for the EM observations is and the prior on the galaxy redshifts (ignoring clustering) is which we assume to be independent of the prior on H 0 , so the joint prior on the population level parameters is p({z g }, H 0 ) = p({z g })p(H 0 ).The joint likelihood for the EM observations comprising the catalogue and the GW data is from which we obtain the posterior on the population parameters, (H 0 , {z g }), via Bayes' theorem We can now integrate out the unknown true galaxy redshifts, {z g }.For each term in the sum over galaxies entering the GW likelihood there is precisely one of these integrals that is over the corresponding galaxy redshift.
The other integrals are independent of the GW data and reduce to the evidence for the EM observation of that galaxy We deduce that the marginalised posterior takes the form and so we recover the result derived earlier.However, this alternative way of deriving the posterior reveals two important corrections that are in principle present, but negligible in practice.Firstly, we have ignored GW selection effects in the above.These are straightforward to include by replacing L GW by L GW /p det (H 0 , {z g }).The detection probability, p det (H 0 , {z g }), is the integral of the GW likelihood over data sets deemed above the threshold and hence included in the analysis, i.e., However, this is a function of the true values of the galaxy redshifts.This dependence of the denominator of the GW likelihood on {z g } breaks the separability of the integrals that we exploited above, unless the true redshifts of the galaxies are perfectly known.In this case, it can be seen that the full hierarchical likelihood reduces to Eq. 15.In practice, however, we will not know the true redshifts of the galaxies.The detection probability is effectively an average of the galaxy redshift distribution over the whole volume within which GW sources can be observed.If that volume is sufficiently large, as is the case for current GW detectors, then the average galaxy redshift will be approximately uniform in comoving volume, and so the dependence of the detection probability on the actual galaxy redshifts will be relatively weak and so this term can be factored out, reducing the result to the simpler expression used in this paper.
The second correction arises when considering multiple GW observations.The likelihood is the product of likelihoods for each GW observation, but each one of those likelihoods includes the sum over galaxies.This will introduce cross terms which are not simply the product of the individually-marginalized likelihoods, but marginals of the product of the likelihoods over the true redshift of the galaxies.These terms represent corrections for the case when multiple GW events are observed from the same galaxy.In practice, these corrections are small, as there are typically 1/N gal times fewer of them than the dominant independenthost terms.These terms will only become important once a significant number of sources in the catalogue share a host.With typical merger rates of a few tens per Gpc 3 per year and approximately one galaxy per cubic megaparsec, the typical spacing of mergers in any given galaxy is millions of years, so we will essentially never be in a regime where these corrections matter. 4nce again, we emphasise that these terms only arise when the galaxy redshift measurements are imperfect.When galaxy redshifts are known perfectly, Eq. 15 is still exact when analysing many GW sources.
To conclude this section, we note that the inconsistency described here does not represent a fundamental limitation of the galaxy catalogue approach.The analysis can be done consistently by computing the posterior using the full expression and p det (H 0 , {z g }) is given in Eq. ( 30).This expression involves multiplication of a sum over galaxies for each GW event, followed by an integration over all the unknown true galaxy redshifts.This is extremely computationally expensive, which is why it is better to use the standard approximate expression, especially since the latter is expected to be accurate in any application to real data.Nonetheless, to further illustrate this issue and its resolution, we will show a simplified example of the application of the full-likelihood framework in Sec.3.3.

RESULTS
In this section we present results on H 0 using the statistical framework discussed in Sec. 2. We start by describing the simulation of mock data in Sec.3.1, and in Sec.3.2 we show results on H 0 using the statistical framework in the limit that the number density of galaxies is higher than the number density of GWs.While in Sec.3.3 we simulated a case where the number density of galaxies is lower than the number density of mergers.

Simulating the Mock data
We now describe step-by-step how we create the mock data.For this mock study, we use galaxies taken from the MICEcat v1.0,5 the Grand Challenge (Hoffmann et al. 2015;Carretero et al. 2015;Fosalba et al. 2014;Crocce et al. 2015;Fosalba et al. 2015).The MICEcat simulation is a lightcone simulation covering 1/8th of the sky out to a redshift of 1.4, down to halo masses of 2.2 × 10 11 h −1 M with a total of about 205 million galaxies.The fiducial cosmological model in MICEcat v1.0 is a Flat ΛCDM model with cosmological parameters: H 0 = 70 km/s/Mpc, Ω m = 0.25, Ω b = 0.044, Ω Λ = 0.75, n s = 0.95, σ 8 = 0.8.We also choose this model for our simulations.We note that MICEcat was previously used in Fishbach et al. (2019), which showed explicitly that dark siren estimations of H 0 are unbiased.Below we highlight how MICEcat is employed for this work.
When generating our mock data, we follow the method in TH21 as closely as possible.We take two sky directions, referred to as "Direction 1" and "Direction 2", and for each of these, we select galaxies found in an opening angle from the line-of-sight of θ = 1 and 5 deg, leading to four different scenarios.To simplify the notation we adopt the nomenclature D 15 to indicate Direction 1 with an opening angle of 5 deg, and similarly for the other combinations of directions and opening angles.On top of this choice, to reduce the overall number of galaxies being considered, we also subsample the galaxies in Direction 2 by half.
To simulate GW events, given N gal galaxies with a set of true redshifts {z g }, we randomly choose N ev of them in which to simulate a GW signal.We decide to draw GWs only from galaxies with true redshift below z draw < 1.4.This is reasonable because the luminosity distance at this redshift (∼ 10 Gpc) is much larger than our threshold distance for the GW events.The probability of detecting a GW event from that distance, even when allowing for fluctuations due to the difference between measured and true luminosity distance, is negligible.For each GW signal, we compute the true luminosity distance d i L by converting the true redshift using the fiducial MICEcat cosmological parameters.Using the likelihood in Eq.21 we draw an observed value for the luminosity distance di L following a Gaussian centered at d i L with σ = Ad i L .If di L is lower than a threshold of dthr L = 1550 Mpc (chosen to approximately match the detectability threshold used in TH21), we label the signal as detected and can be used for cosmological inference.
A crucial aspect of the simulation is the choice of z draw in comparison to the choice of dthr L .Following TH21, we might be tempted to simulate GW events only from galaxies with redshift below a z draw value obtained converting dthr L to a redshift for given cosmological parameters (H 0 and Ω m ).This procedure would allow us to detect only GW sources with a true luminosity distance below the detection range, and this is inconsistent with the assumed framework to correct for selection biases.As we argued in Fig. 1, we would expect to find a significant number of sources that are detected even if their true luminosity distance is higher than dthr L (the detection range).On the contrary, by selecting z max = 1.4,we are able to simulate GWs at high redshift that become detectable due to fortuitous noise fluctuations.Note that we can always make the choice of injecting GW events in 0 < z < 0.3, but we will need to account for this choice, using a p rate (z) = 0 if z > 0.3, that must be consistently implemented in the statistical framework.In our analysis, we include the fact that p rate (z) = 0 for z > 1.4, in order to avoid introducing any possible bias.
The final step of the simulation is to calculate p CBC (z) to use for the inference in the cases for which we assume that the galaxy redshifts are not perfectly known.To do so, we draw a set of observed redshifts {ẑ g } from the true ones {z g }, according to the likelihood model in Eq. 17.By using Eq.16, we build a posterior on redshift for every galaxy, which is later used to build the approximant for p CBC (z).To build the approximant, we use a uniform in comoving volume prior, note that this prior does not depend on H 0 , namely where the H 0 dependence cancels out, and it depends only on the value of Ω m = 0.25 considered in the MICE-catv1 simulation.
In Fig. 2 we show the reconstruction of the galaxy density profile as a function of redshift for the different lines-of-sight (LOS).We can see that the constructed interpolant for p CBC tracks the true galaxy density profile of the catalog.

H 0 Inference in the low-galaxy density limit
Following TH21 we simulate 200 GW dark sirens for each of the four directions D 11 , D 15 , D 21 and D 25 with 3 different scenarios for the error on the GW likelihood: σ d L /d L = 10%, 20% and 30%.For this case, we will assume that we perfectly know the redshift of the galaxies.We will use the hierarchical likelihood in Eq. 15, which we have shown to be formally correct even in the lowgalaxy density limit which happens for D 11 and D 21 .
In Fig. 3 we show the H 0 posteriors that we obtain for each line-of-sight, and for each error budget of the GW likelihood.From the plot, we can see that the lines of sight with more galaxies correspond to less constraining power on H 0 .This is expected, as the GW localization volume will include a larger number of galaxies, and the effect of structures in the galaxy distribution (that can help to provide useful redshift information) is reduced.Interestingly, we note that for the cases with σ d L /d L = 30% it is possible to obtain a posterior which is slightly more informative than the case of σ d L /d L = 20%.This might be counter-intuitive but it is consistent with the fact that for the case of σ d L /d L = 30% some GW events might be generated at the redshift edge of our simulation z draw = 1.4.This effectively introduces another redshift scale (an edge) to the inference, which then can help in inferring H 0 .This effect can be visualized from the GW likelihood.In Fig. 4, we plot the GW likelihood as function of redshift for H 0 = 140 km/s/Mpc for a signal detected at d thr with σ d L /d L = 20% and σ d L /d L = 30%.We can see that the GW likelihood for σ d L /d L = 30% has some non-negligible support beyond z draw = 1.4,which is excluded from our statistical analysis.This adds extra information and allows us quickly to exclude high values of H 0 as shown in Fig. 3.
To show more quantitatively that the statistical method for dark sirens does not present any significant bias, we generate a probability-probability (PP) plot.PP plots are used to test that parameters subject to Bayesian inference (in our case H 0 ) are consistently recovered.To generate a PP plot, we repeat the H 0 inference for 200 GW signals 100 times, each time drawing an injected H 0 value between the explored prior range of [20,140] km/s/Mpc.We then check in what credible intervals the true value for each simulation is found.If the analysis is performed properly, we expect to see that e.g.40% of injections are found in the 40% credible intervals.Fig. 5 shows the PP plot for our mock data with two lines-of-sight and varying the error budget on d L .
As we can see from the plot, the Bayesian framework is suitable to perform the H 0 inference as in all cases we are diagonal within 3σ.
Moreover, we have performed simulations also assuming errors on the galaxy redshifts.In the limit that we have N GW ≥ N gal , the simplified framework cannot be used6 e.g.cases D 11 and D 21 , otherwise, there could be a bias (see Sec. 3.3) for more details.Therefore, we restrict the simulation to D 15 and D 25 .Figs. 6 and 7 show a sample of H 0 posteriors and the PP-plots generated from these cases.We can observe that the statistical dark sirens method is unbiased for this case.

The one galaxy limit
As described in Section 2.3, the standard analysis makes some approximations that are only valid in the limit that the gravitational wave detection volume contains a sufficiently large number of galaxies.To illustrate this, we consider here an extreme and unrealistic example in which the galaxy catalogue contains only one galaxy, the redshift of which is known imperfectly from electromagnetic observations.As in previous cases, we generate a catalogue of 200 GW sources and construct the posterior on H 0 using the standard analysis that employs the approximate likelihood.In Figure 8  The blue histograms indicate the true redshift of the galaxies, the orange line is the reconstruction of the density profile using Eq. 13, and the black dashed line marks a uniform in comoving volume distribution.For the LOSs with small galaxy numbers, we do not fit the redshift interpolant as these are used only in the limit that we perfectly know the redshift (in order to not introduce a bias due to the breaking of the statistical framework).
Posterior [km 1 s Mpc] and bottom panels assume a standard deviation on the luminosity distance which is 10, 20, and 30% of the luminosity distance, respectively.All the posteriors are the result of combining 200 GW events and assuming galaxy redshift is perfectly known.We do not find a bias.Note that any particular choice of the random seed used for the simulation will lead to fluctuations of the posteriors around the true value.However, by performing several simulations one should find that these fluctuations are compatible with the confidence levels at which the true value is found (see the discussion about PP plots in the text).

Spectroscopic-like redshift errors
. PP-plot generated for the Direction 1 considered here and a nominal error on dL of 20% to 30%.The shaded area shows the 1, 2, 3 σ fluctuations of the pp plot around the ideal case.Our simulations show that we are able to recover the input value of H0 without incurring any significant bias.
the EM redshift measurement is δz/z = 3% or 0.3%.We see that with a 3% error, the posterior is shifted sig-   Approximate, z/z = 0.3% Approximate, z/z = 3% Full, z/z = 0.3% Full, z/z = 3% Approximate/full, z/z = 0% Truth Figure 8.Comparison between posteriors obtained in the one galaxy limit, with 200 observed gravitational wave sources, computed using the standard approximate likelihood given in Eq. (3) or using the full likelihood given in Eq. ( 33).We show results for fractional errors in the EM measurement of the galaxy redshift of δz/z = 0.3% and 3%, and also the perfect measurement limit δz/z = 0%.

Photometric-like redshift errors
nificantly to the right and shows a large bias.For the smaller redshift uncertainty the posterior appears to be unbiased, but in fact a small bias is still present on average.This is revealed by constructing a PP-plot over 100 random realisations of the Universe.The PP-plot is shown in Figure 9 and there is a clear and significant departure from the diagonal, even when δz/z = 0.3%.
These biases arise from the fact that in the standard analysis the approximation has been used that the true redshift of the galaxy can be marginalised out of the likelihood separately for each GW event.To carry out a consistent analysis we must use the full likelihood given in Eq. ( 31).In the case that there is only one galaxy, with unknown true redshift z and observed redshift ẑ, this simplifies to The results from analysing the same datasets with the full likelihood are also shown in Figures 8 and 9.We see that the full likelihood broadens the posterior, making it consistent with the true value even in the case of large redshift uncertainties, and it generates a pp-plot that is perfectly consistent with the expected diagonal line.
The figures also show results computed in the perfectredshift measurement limit, δz/z = 0%.In this limit the full and approximate likelihoods are exactly equivalent and so we recover unbiased results using either likelihood.
The reason for going through this example is that it illustrates that it is important to be careful when generating mock catalogues that they have a realistic galaxy

One galaxy in catalogue
Approximate, z/z = 0.3% Approximate, z/z = 3% Full, z/z = 0.3% Full, z/z = 3% Approximate/full, z/z = 0% Figure 9. PP-plots for the one galaxy analysis, with 200 observed gravitational wave sources, comparing the results using the approximate and full likelihoods.We consider the same values of the fractional error in the redshift measurement that were shown in Figure 8.
density.Using galaxy catalogues that are too sparse can lead to biases because of the inconsistency in the standard analysis that becomes apparent in this limit.However, these are not indicative of a fundamental problem in the analysis, and can be avoided by using the full likelihood of Eq. ( 31).

DISCUSSION
We have shown that the dark siren statistical method, utilizing a galaxy catalog of potential hosts, is able to produce unbiased estimates of H 0 .This is true, however, only if the statistical framework is consistent with the generative model of the population.An analysis model that is inconsistent with the simulations will incur biases.
Trott & Huterer (2021) concludes that the bias on H 0 is introduced by the absence of galaxy clustering, which is exacerbated when the number of galaxies in a given direction is large (e.g. the cases D 15 and D 25 ), or when the error budget on the GW luminosity distance is large (leading to the clustering effects being washed out in the GW localization volume).We argue that the conclusions in Trott & Huterer (2021) are likely due to an inconsistent procedure for the generation of their mock data and calculating the selection effects for their GW events.The following sections will demonstrate that an absence of structure in the redshift distribution of galaxies does not lead to a biased estimate of H 0 (indeed the result merely becomes uninformative), and that similar biases to those seen in TH21 can be produced by introducing inconsistencies between the mock data generation and analysis.The several lines indicate various fractional errors on dL for the GW likelihood model.The absence of clustering significantly weakens the result, rendering it nearly uninformative in the limit of large numbers of galaxies, but does not incur in significant biases.

The absence of clustering
In Sec. 3 we have shown that in the limit that there are many galaxies along the line-of-sight (e.g D 15 and D 25 ) we recover an unbiased H 0 result.As discussed above, in this limit we expect the constraining power on H 0 to significantly decrease.It can be demonstrated mathematically, as we do in App.A, that under some simplified assumptions the H 0 posterior is expected to be flat (assuming a flat prior) in this case.
To further demonstrate this, here we perform a full simulation of this measurement.We repeat the steps of simulating a Mock Data Challenge (MDC) explained in Sec.3.1 with the only difference that the galaxy distribution is synthetically generated to be uniform in comoving volume without any large-scale clustering.We draw 2 × 10 6 galaxies for this case in order to have a distribution as continuous as possible, and we use the same uncertainty model for the observed redshifts of the galaxies.In other words p gal (z) ∝ dVc dz .In Fig. 10, we show the H 0 posteriors of 200 GW signals combined for a relative error on d L on the GW likelihood model of 10%, 20% and 30%.We note that the recovered H 0 posterior is not as constraining as in the previous cases, as expected.The posterior does not display any noticeable bias in this case.Therefore, contrary to Trott & Huterer (2021), we conclude that the absence of clustering is not expected to introduce a bias on the H 0 estimation.In reality, the Universe is not uniform in comoving volume on smaller scales, as we know that the Universe's largescale structure does exhibit clustering, but the effect of clustering is to enhance the H 0 constraint, rather than being essential to it.

Possible sources of inconsistency in implementing the standard siren method
In this section we speculate of possible inconsistencies between the statistical framework and the generation of mock data can easily introduce biases in the estimation of H 0 .It is difficult to characterize the interplay of different issues, and how they might translate to a final H 0 bias, for this reason, we explore one possible inconsistency at a time.In what follows we focus on a number of possible errors and demonstrate how they impact the H 0 posterior.
Inconsistency 1, Double counting -A first issue is assigning GW events to the sample of MICEcat galaxies using a weight factor similar to Eq. 13.However, since the sample of MICEcat galaxies is already distributed as Eq. 13, this procedure effectively introduces a "double counting" problem.In other words, if the distribution of galaxies follows a z 2 profile, the distribution of GW events will follow a z 4 profile.This has a crucial impact on the H 0 inference when combined with errors in the luminosity distance generation and the presence of a detection threshold.Understanding the implication of the H 0 bias is not trivial for the dark siren approach, therefore we perform a simulation.In Fig. 11 (top panel), we repeat the analysis as in Sec. 3 but generating GWs in galaxies with an extra weight factor given by Eq. 13 (the analysis is then performed normally).As we see, this type of mismatch usually results in a bias towards lower values of H 0 .
Inconsistency 2, GW detection probability -A second issue is a general misinterpretation of observed quantities ("data") and latent variables ("true values") and their interaction with the selection biases.As we detail in Sec. 2, the selection bias on the GW side should be corrected using the GW detection probability.The latter is defined by integrating the likelihood over all possible data sets that can be detected given a true value of the luminosity distance.TH21 assumes that the detection probability is a Heaviside step function of the true luminosity distance, with the probability dropping to 0 at dthr L .This is correct only in the limit that there are no errors on the luminosity distance estimation.Fig. 11 (middle panel) shows that, when the luminosity distance estimation comes with a non-negligible uncertainty, assuming a Heaviside step function for the detection probability can bias the estimation of the Hubble constant to lower values.
Inconsistency 3, z draw not accounted -A third potential misconception is the fact that GW events are drawn from galaxies with z < 0.3, but this aspect is not ac- The detection probability in the analysis is set as a Heaviside step function.Third panel: GW events are drawn from z < 0.3 and this correction is ignored in the analysis.Fourth panel: The luminosity distance dependence of the GW likelihood standard deviation is not properly taken into account in the formalism.Note that for this particular case alone we show D21, which shows a more dramatic behaviour than D15, at least for the specific random seeds we tested.
The Hitchhiker's guide to the galaxy catalog approach for gravitational wave cosmology 15 counted for in the analysis.In order to understand the biases introduced by this type of error, we simulate GW events from galaxies at z < 0.3 and then build the interpolant p CBC (z) for the analysis using z draw = 1.4.In other words, the analysis accounts also for selection effects due to GW hosts present between 0.3 < z < 1.4, while the former simulation only considered GW hosts z < 0.3.In Fig. 11 (third panel), we show that this type of error might introduce a high H 0 bias.We note that in a real analysis, such as that done in Abbott et al. (2021a), we do not know if there is a maximum redshift at which GW events are generated.However, this is taken into account by fitting simultaneously with flexible merger rate models alongside the value of H 0 .
Inconsistency 4, GW likelihood mismodeling: -Another possible source of error is treating the GW likelihood as if the standard deviation was not dependent on the true value of d L , although the simulations are made assuming this dependence (see Section 2).By dropping the overall normalization factor in the GW likelihood, one is in practice ignoring a part of the likelihood that depends on the true luminosity distance.This causes a biased dependence of the H 0 posterior on the luminosity distance uncertainty, which we show in Fig. 11 (bottom panel).We find that in this case the inconsistency has the effect of biasing H 0 towards lower values for increasing values of σ d L .similarly to what TH21 found.In the bottom panel of Fig. 11 we show the results using a 1 deg light cone simulation (instead of the 5 deg D 15 simulation) as it shows a stronger bias change with σ d L .We also assume a Heaviside function for the selection effects as in TH21 so that the σ d L dependence on true luminosity distance is also not taken into account in this term.A general discussion about the misinterpretation of latent and observed variables, and the bias caused by assuming that observables depend on intrinsic properties in the mocks but not in the likelihood modeling in the context of GWs is also provided in https://github.com/farr/MockPosteriors.
Inconsistency 5, simulating dark siren events along the same line of sight.-The statistical method used in TH21 is for the high-galaxy density regime, but this is inconsistent with the generation of data from D 11 and D 21 , where it is possible for multiple GW events to be drawn from the same low-redshift host galaxy.Some of the H 0 biases for D 11 and D 21 might originate from this inconsistency, as we have shown in Sec.3.3.However, in our own analyses, this produced a noticeable bias less frequently than the other potential sources of bias we explored.Moreover, we find that a low-H 0 bias will arise if there are low-redshift overdensities along the line of sight and the full likelihood analysis is not used.When simulating multiple events along the same line of sight, if there is an overdensity of galaxies in that direction, the H 0 posterior is expected to display a peak at an H 0 lower than the input value, that corresponds to the redshift of the overdensity and luminosity distance of the peak of the GW distribution (likely ∼ 1500 Mpc in this work, close to the detection horizon).Especially when using the light cones with fewer galaxies, there are very few galaxies at the redshift of interest z 0.3, so we are sensitive to the small number statistics of a peak in the redshift distribution that may shift from realization to realization (as one does not always get the same exact redshift peak, and thus corresponding H 0 peak, in all realizations).As there are significantly more galaxies and volume at larger redshifts than at lower redshifts, which results in the effect of the overdensities on the H 0 posterior being typically less pronounced in the former regime rather than in the latter, this effect is only more significant at lower H 0 .Because in reality it is extremely unlikely to have multiple events from one narrow line of sight, we advise against using a small area light cone simulation to make dark siren analyses.Alternatively, the presence of the same galaxies across multiple GW events needs to be taken into account with the full likelihood analysis so as not to incur in a bias.

CONCLUSIONS
We have presented an introduction to the dark siren statistical method.We have explicitly demonstrated that, contrary to TH21, the dark siren statistical method can robustly recover an unbiased estimate of H 0 .We expect the method to produce unbiased estimates of additional cosmological parameters.We have also shown that in the limit of the absence of galaxy clustering, the dark siren statistical method continues to provide an unbiased estimate of H 0 .Notebooks showing how to reproduce this analysis can be found at https://github.com/simone-mastrogiovanni/hitchhiker guide dark sirens.
We emphasize that this work is not a forecast, but instead is meant to provide a pedagogical introduction to the dark siren approach, highlighting common pitfalls.Because of the specific assumptions that we have made, this is not a realistic simulation of GW events and their detections, and we do not provide estimates for the future sensitivity of dark siren probes (see e.g.Chen et al. 2018b;Gray et al. 2020 for forecasts).This is why we do not specifically quote precision measurements of H 0 throughout the manuscript.
Moreover, we remind the reader that a data analysis framework able to analyze real GW data is, in some aspects, more complicated than what is discussed in this paper.On the GW side, as has already been noted in the literature, selection effects based on GW sensitivity studies injecting signals in real data should also be carefully taken into account for standard sirens with counterparts (e.g.Mortlock et al. 2019;Farr 2019).Moreover, assumptions about the underlying population of compact object binaries can have a significant impact on estimating the Hubble constant not only for the galaxy catalog approach (Abbott et al. 2021a), but perhaps more importantly for a GW-only approach to standard siren cosmology (Mastrogiovanni et al. 2021;Mukherjee 2022;Ezquiaga & Holz 2022;Karathanasis et al. 2022).On the galaxy catalog side, there are several challenges related to the completeness correction of the galaxy catalog and the presence of a possible correlation between galaxy intrinsic luminosity and merger rates (Gray et al. 2020;Gray et al. 2022).Also, techniques exploring spatial clustering using cross-correlation need to properly mitigate the effects from GW bias parameters and its redshift evolution as demonstrated in (Mukherjee et al. 2021(Mukherjee et al. , 2022)).Clearly, the full Bayesian framework is more complicated for a statistical dark standard siren analysis as opposed to the case where a unique host galaxy is identified, so it is understandable that this method poses more challenges than the one simulated in this paper.However, it is important to note that all standard siren methods are dependent on the same assumptions and potential sources of mismodeling considered here.
Another caveat of this analysis, since it is built in part around the assumptions of TH21 to show how not to incur the biases they claim to find, is that of drawing a large number of events from the same line of sight.In this analysis, we have used only two directions to simulate all the events of the GW sources.When drawing from the same line of sight, the contribution of the large-scale structure is always the same, so your posterior may prefer the specific values of H 0 around the value corresponding to the overdensities along the line of sight for distances close to the peak of the detected GW source distribution, thereby lacking the expected variation in the large scale structure distribution.In reality, over different lines of sight, the contributions from the different under densities/overdensities will cancel out.We caution the reader from running the simulations we have made available using a number of GW events much larger than the number of galaxies.
To summarize, our analysis shows that using the dark siren statistical method, we can measure the value of the Hubble constant H 0 without any bias, in contrast to the recent claim by TH21.Accurate measurements of the Hubble constant require correctly taking into account the selection effects in both the galaxy catalog side and GW side, and a proper modeling of the likelihoods in question.We encourage research studies focused on the impact of population assumptions and selection effects to advance the entire field of standard siren cosmology.

Figure 1 .
Figure 1.GW detection probability computed with the toy likelihood model in Eq. 22 as a function of the true luminosity distance.The colors indicate various fractional errors on the luminosity distance.The vertical dashed line indicates the detection threshold for the measured luminosity distance dL.We note that the detection probability only reduces to a Heaviside step function in the limit of σ d L → 0. If σ d L = 0, a step function on true luminosity distance cannot be used to account for selection effects, else the H0 posterior may be biased in a way that depends on σ d L .

Figure 2 .
Figure2.Reconstructions of the galaxy density profile p(z|C) for each of the populations considered in the mock data.The blue histograms indicate the true redshift of the galaxies, the orange line is the reconstruction of the density profile using Eq. 13, and the black dashed line marks a uniform in comoving volume distribution.For the LOSs with small galaxy numbers, we do not fit the redshift interpolant as these are used only in the limit that we perfectly know the redshift (in order to not introduce a bias due to the breaking of the statistical framework).

(Figure 3 .
Figure 3. Posteriors of H0 derived for four different lines of sight (D1 and D2) in the MICECat simulations, and for two different sky areas (1 •and 5• opening angle for D1x and D2x, respectively) around those directions.The top, middle, and bottom panels assume a standard deviation on the luminosity distance which is 10, 20, and 30% of the luminosity distance, respectively.All the posteriors are the result of combining 200 GW events and assuming galaxy redshift is perfectly known.We do not find a bias.Note that any particular choice of the random seed used for the simulation will lead to fluctuations of the posteriors around the true value.However, by performing several simulations one should find that these fluctuations are compatible with the confidence levels at which the true value is found (see the discussion about PP plots in the text).

Figure 4 .
Figure 4. GW likelihood (vertical axis) as a function of the redshift (horizontal axis) for H0 = 140 km/s/Mpc.The lines show the case of σ d L /dL = 20% (blue line) and σ d L /dL = 30% (orange line).The vertical dashed line indicates z draw .

Figure 6 .
Figure 6.Posteriors of H0 derived for four different lines of sight (D1 and D2) in the MICECat simulations, and for two different sky areas (1 •and 5• opening angle for D1x and D2x, respectively) around those directions.The top, middle, and bottom panels assume a standard deviation on the luminosity distance which is 10, 20, and 30% of the luminosity distance, respectively.All the posteriors are the result of combining 200 GW events and assume errors on galaxy redshift.We do not notice any significant bias.

Figure 7 .
Figure7.PP-plot generated for Direction 1 and a nominal error on dL of 20% to 30%.The shaded area shows the 1, 2, 3 σ fluctuations of the PP plot around the ideal case.Our simulations show that we are able to recover the input value of H0 without incurring any significant bias.
Figure10.H0 posteriors for 200 GW events drawn from a uniform in comoving volume distribution of 10 6 galaxies.The several lines indicate various fractional errors on dL for the GW likelihood model.The absence of clustering significantly weakens the result, rendering it nearly uninformative in the limit of large numbers of galaxies, but does not incur in significant biases.
Figure11.H0 posteriors with 200 GW events for several cases with inconsistent choices in either the simulation of mock data or the statistical framework, for the D15 line-ofsight.The solid, dashed and dotted lines are obtained with σ d L /dL = 10%, 20% and 30%.First panel: GW events are generated by counting twice over the galaxy catalog.Second panel: The detection probability in the analysis is set as a Heaviside step function.Third panel: GW events are drawn from z < 0.3 and this correction is ignored in the analysis.Fourth panel: The luminosity distance dependence of the GW likelihood standard deviation is not properly taken into account in the formalism.Note that for this particular case alone we show D21, which shows a more dramatic behaviour than D15, at least for the specific random seeds we tested.