The clustering of dark sirens' invisible host galaxies

Dark sirens are a powerful way to infer cosmological and astrophysical parameters from the combination of gravitational wave sirens and galaxy catalogues. Importantly, the method relies on the completeness of the galaxy catalogues being well modelled. A magnitude-limited catalogue will always be incomplete to some extent, requiring a completion scheme to avoid biasing the parameter inference. Standard methods include homogeneous and multiplicative completion, which have the advantage of simplicity but underestimate or overestimate the amplitude of structure at low completeness, respectively. In this work, we propose a new method to complete galaxy catalogues which uses clustering information to incorporate knowledge of the large scale structure into the dark sirens method. We find that if the structure of the true number of galaxies is sufficiently well preserved in the catalogue, our estimator can perform drastically better than both homogeneous and multiplicative completion. We lay the foundations for a maximally informative dark sirens analysis and discuss its limitations.


Introduction
Standard sirens have been proposed as a powerful test of fundamental physics, cosmology and astrophysics.The accurate measurement of a gravitational wave (GW) waveform from the inspiral of a compact binary allows for a direct measurement of the luminosity distance to the source, together with other source parameters [1,2].Unlike electromagnetic probes, for which redshift information is most easily measured, gravitational waves do not directly contain redshift information.This prevents them from being sufficient alone to build a Hubble diagram and constrain cosmological parameters.
The presence of an electromagnetic counterpart in coincidence with a gravitational wave allows for the identification of the host galaxy from which a spectroscopic redshift can be extracted [2].After nearly a hundred GW events [3], the signal from a binary neutron star GW170817 [4] remains the only so-called bright siren observed to date.This is despite the optimism that followed this event, where the hope was to get a 2% measurement of the Hubble constant with O(50) events within 5 years [5].In most cases, electromagnetic counterparts are absent and one has to work with dark sirens.In this case, one must rely on statistical methods to extract source redshifts and study cosmology.Inferrence of the source redshift is informed by two pieces of information: firstly, a knowledge of features in the intrinsic (source-frame) mass distribution of black holes and neutron stars [6][7][8][9][10][11][12][13], when compared to the redshifted masses extracted from a waveform.Secondly, one can exploit a galaxy catalogue to provide the redshift distribution of observed galaxies, and hence a subset of the possible host galaxies of the event.This was proposed in [1], with early tests on GW170817 without the electromagnetic counterpart in [14], applications to real data in [15][16][17][18] and forecasts in [19][20][21][22].State-of-the-art analyses are now able to incorporate information from both sources simultaneously [15,23].If the rate of observed bright sirens continues to be relatively low, the numerical superiority of dark sirens may simply make them the most competitive GW probe of cosmology.Even if more bright sirens are observed in future, current analyses [15,[23][24][25] indicate that incorporating dark sirens can provide a significant boost to the results obtained from bright sirens alone.
In this work, we focus on the part of the constraints originating from the galaxy catalogue information.The poor angular resolution of a gravitational wave event together with large uncertainties on the luminosity distance itself lead to localisation volumes which can contain hundreds of thousands of galaxies.Assigning cosmological parameters to each potential host yields a broad distribution of cosmological parameters, with 'bump' features reflecting the structure of the line-of-sight redshift distribution, at least for events with strong catalogue support. 1 In most cases, the expectation is that the inferred cosmological parameters are wrong except for the true host.In this sense, the true cosmological parameters should emerge constructively as a peak if several GW events are combined.
On the other hand, the posterior distribution for the wrong cosmological parameters should be set to zero if no galaxy is collected for those values for even one event.As may be clear by now, this technique crucially relies on the completeness of the galaxy catalogue.If the true host galaxy is missing from the catalogue, then one might wrongly eliminate the corresponding cosmological parameters from the realm of possibilities.Often galaxy catalogues are magnitude limited,2 i.e. they cannot see objects fainter than a given apparent magnitude.At a fixed distance this implies that one is necessarily going to miss some galaxies below the corresponding intrinsic magnitude threshold, and this may sometimes include the true GW host galaxy.In fact, the further the source, the less galaxies are observed, while more galaxies are missed and higher is the chance that the true GW host was missed.On that account, it is of paramount importance to know how to complete a galaxy catalogue in an efficient and unbiased way.
The current standard in Ligo-Virgo-Kagra (LVK) analyses is to implement a technique we will call homogeneous completion (further details will be given in section 2.1), proposed as an efficient and unbiased way to complete a galaxy catalogue.It assigns a homogeneous probability density for the missing galaxies in comoving volumes.As we will see, the downside of this technique is that it damps structure in regions of low completeness, reducing the information content of high redshift sirens.A pixelated approach of homogeneous completion was implemented in the python package gwcosmo, to account for different levels of completeness in different directions on the sky [16].Multiplicative completion was proposed in [18], and fills in missing galaxies proportionally to the number of catalogue galaxies.This time, the disadvantage resides in over-amplifying structure at low completeness. 3Machine learning algorithms were also proposed as a way to complete a galaxy catalogue by imitating sky regions with similar completeness [18].Whilst efficiency concerns are important, we note that in current analyses with gwcosmo the line-of-sight redshift priors for a given catalogue are pre-computed quantities.Therefore, if our new completion method can similarly be precomputed, the additional computation time needed is not problematic.Therefore, in this work we leave efficiency as a secondary goal and study how one can introduce knowledge of the large scale structure to complete a galaxy catalogue.In particular, we introduce an algorithm to estimate the most likely number of missing galaxies in each voxel of the al-lowed source region for the gravitational wave source, which crucially respects the statistical properties of the galaxy distribution.
This article is structured as follows.In Sec. 2, we outline the theoretical foundations of galaxy completion techniques, reviewing homogeneous and multiplicative completion.We then detail our new variance-informed completion algorithm and give a measure to compare estimators.We outline how we simulate galaxy positions and the protocol we use to remove a subset of 'unobserved' galaxies.In Sec. 3, we implement galaxy completion algorithms and compare performances in two dimensions, with and without evolution in the completeness fraction.We discuss limitations and applications to gravitational wave cosmology in Sec.3.4 and conclude in Sec. 4.

Galaxy catalogue completion methods
In this section, we lay out our strategy for how to complete a galaxy catalogue in a manner that takes into account knowledge of the large-scale structure of the Universe.We first fix notation and a few properties of the catalogue, before moving on to describe homogeneous and multiplicative completion, which have been introduced in the literature.Then, we introduce variance completion, which represents the core of this section.Finally, we detail how we simulate a toy galaxy catalogue on which to perform initial tests of the completion methods, and introduce the tools to quantify how well an algorithm is performing.
Consider a comoving volume V in which one believes that the gravitational wave source is located.This can correspond in practice to the 99.9% confidence sky localisation region in angular and redshift space (n, z), translated into a comoving volume. 4One can divide this comoving volume into N v voxels (3D pixels), or pixels if one is working in 2D.The true number of galaxies n i g in each voxel is the sum of the catalogued galaxies n i c and the missing galaxies n i m , which reads Of these, only the catalogued galaxies n i c are known.The values of n i m and n i g have to be estimated.An implicit subtlety in this notation is the minimum mass, luminosity or absolute magnitude for an object to count as a galaxy.In order for the analysis that will follow to make sense, it is important to count galaxies which satisfy a fixed selection criteria.Once the criteria is fixed, it should be applied throughout the analysis.In particular, this should hold for the average number density, as well as for the standard deviation in the number density.It is useful to define a subset S ⊂ V with a number of elements N k ≡ card(S), which describes the number of voxels in S. The total number of galaxies in S, the catalogued and missing ones can be written as sums over the voxel index in the subspace S 2) The completeness fraction, estimating the proportion of galaxies observed in the comoving volume S, is defined as In principle, one does not know what is the total number of galaxies in a comoving volume; however this number is estimated to average on sufficiently large volumes to ∼ 0.1 [Mpc −3 ] between redshift 0 and 2 [28].This allows us to estimate the completeness fraction in where ng denotes the average number density of galaxies and we use hats to indicate estimators throughout this work.The last equality requires that the comoving volume S is large enough to ensure that clusters and voids average out.The subset S serves to denote comoving volumes which are estimated to have similar completeness fractions.In practice, the completeness varies significantly with redshift, due to both the magnitude limit of a survey and the intrinsic evolution of galaxy populations.In practice, the completeness fraction may additionally vary across pixels within a sky localisation area, as surveys with different magnitude limits may need to be combined to cover that sky area.Other factors, such as obscuration by the Milky Way or dust, may add further non-uniformity [16,29].
Another quantity which may be estimated from a complete galaxy catalogue is the variance or standard deviation σ g (alternatively called intrinsic scatter) in the number density of galaxies where the brackets ⟨. . .⟩ denote the expectation value of . . . .This variance depends implicitly on the voxel size.The variance is a decreasing function of the voxel size.For small voxels, the galaxy distribution can fluctuate significantly from one region to the other, leading to a relatively high variance.On the contrary, for larger voxels, the variance is smaller and converges to zero as the voxels reach the size of a Hubble patch.For a fixed voxel size, this quantity should also evolve with redshift: from a homogeneous Universe with low variance, the large scale structure forms under the pull of gravity, which enhances the variance at late times.In this sense, σ g depends on cosmology and gravity.
In the following, we outline the standard completion techniques to estimate n i m , the number of missing galaxies in a voxel, from the observed galaxies n i c and the completeness fraction f S , known as homogeneous and multiplicative completion.We then introduce a new algorithm to estimate n i m from n i c , ng and the additional knowledge of the standard deviation σ g of the true underlying distribution of galaxies.

Homogeneous completion
Given an estimation of the completeness fraction fS (via Eq. (2.4)), we can estimate the number of missing galaxies in each voxel, assuming that they are homogeneously distributed in the comoving volume S.This is of course, a simplification which has the dual advantage of being both easy to implement and which is not expected to bias the inference of cosmological parameters with dark sirens.The intuition is that it corresponds to averaging over all possible realisations of structure in the Universe.However, it also damps the amplitude of the fluctuations in the estimated structure, by making the galaxies more homogeneously distributed than they really are.For completeness, we outline here the homogeneous estimator.The total number of missing galaxies can be estimated from fS and the observed total number of catalogued galaxies in S as where N c can be computed using Eq.(2.2).Note that the fraction N c / fS gives an estimate of the total galaxy count Ng by Eq. (2.3).The average number of missing galaxies in each voxel can be estimated as Note that this estimator does not depend on the considered voxel.The estimated number of galaxies then reads This estimator gives on average the correct result as with a variance which corresponds to the one present in the catalogue (2.10) In most cases, the variances in the catalogue is damped with respect to the true distribution and therefore, homogeneous completion undermines the amplitude of observed structure.We will see this in action in Section 3. In the next subsection, we discuss multiplicative completion which on the contrary, overestimates structure.

Multiplicative completion
The multiplicative completion estimator, introduced in [18], places missing galaxies proportionally to the catalogue galaxies.In particular, in a region S, we have The parameter b S can be thought of as bias factor relating observed and missing galaxies.It is found by requiring that the catalogued and missing galaxies sum up to the total number of galaxies The expectation value of this estimator is indeed the average number density of galaxies and it also rescales the variance using the completeness fraction (2.16) At high completeness fS → 1, the variance in the number of galaxies estimated from the multiplicative completion agrees with that of homogeneous completion.However, it is precisely, when the completenesss becomes lower that the details of the completion method become important.The variance is singular for fS → 0, which makes it clear that it disproportionately inflates structure in cases of low completeness.It benefits from the fact that at medium completeness, a low Var(n c ) < σ 2 g competes with a low fS < 1, making it able to reproduce σ 2 g .Be that as it may, this achievement may be viewed as coincidental, without any physical or mathematical motivation.Having identified that the variance in the homogeneous and multiplicative estimators fail to reproduce the variance σ 2 g from the true galaxy field n g , we introduce variance completion in the next section, which uses knowledge of σ 2 g to achieve that goal.

Variance completion
The variance completion method uses variance information in addition to the mean to estimate the number of missing galaxies.The variance and standard deviation (or intrinsic scatter) σ g of the number density of galaxies reads (2.17) where ng = ⟨n g (x)⟩ denotes the average number of galaxies per comoving volume.
In principle, one only has access to the catalogued galaxies, leaving the statistical properties of the true number of galaxies inaccessible.However, one may use regions of high completeness, such as at low redshift, where they nearly coincide, i.e. n i g ≃ n i c , to infer the mean and the variance of the n g (x) field with sufficient accuracy.We also expect these two numbers to evolve with redshift.This requires extraction from highly complete samples at higher redshift with a deep field survey, for example.Alternatively, properties of the galaxy field can be extracted from N -body simulations given a cosmological model and a prescription to assign galaxies to halos, which may be nontrivial if the voxel size reaches non-linear scales.In the following, we assume that the variance in the number density of galaxies σ 2 g , on top of the average ng , is known, although we recognize that it may be a challenging task in practice.
We now detail the algorithm to incorporate this knowledge into a set of estimators for the n i m 's, which we call ni var .To this end, we define a number N S of subsets of V, labelled S k ⊂ V, in which the completeness fraction is homogeneous and for which Each S k should contain enough voxels such that a representative mean and variance could be extracted from them.As before, we call the number of voxels in each subset The method which we propose to determine the number of missing galaxies n i m differs substantially from the previous two.We shall require to minimize a well-chosen cost function, which bears some resemblance to the process of minimizing an action from classical mechanics.There shall be a part of the function dedicated to minimize the total number of missing galaxies introduced in each voxel, as well as a part which communicates interactions between different voxels.Indeed, increasing the number of missing galaxies in one voxel alone should be compensated by at least another voxel, so as to keep the mean and variance fixed.Without further motivating the nomenclature, the estimator for the N k unknowns n i m in the subset of voxels contained in S k can be found by minimizing the following function, which we shall refer to as a Lagrangian where the free Lagrangian reads and A 0 ∈ R is a coefficient to be fixed.Clearly, this part of the Lagrangian is minimal for n i m → 0. In this sense, it can be thought of as a cost function.Introducing more n i m increases the value of the Lagrangian.The power of 2 for each n i m ensures that it is bounded from below such that a minimum can be found.We leave the study of the space of other possible cost functions for future work; at present we see no benefit to making this more complicated than a sum of squares.The interaction Lagrangian reads where B k ∈ R is also a coefficient to be fixed. 5The aim of the interaction part is to force the average of the squares of (n i g ) 2 to converge to n2 g + σ 2 g so as to enforce Eq. (2.17).The chosen phraseology may be clear by now for the readers familiar with classical mechanics.
For a good choice of A 0 , we will be looking for the minimal sum of squared n i m , which can complete the catalogue in a way in which the variance is respected.In other words, we are looking for the minimial number of total missing galaxies which can enforce the correct statistics for the galaxies in S k .By making A 0 larger, we are making the introduction of additional missing galaxies more costly. 6This also prevents the introduction of an unnecessary negative number of missing galaxies in a voxel, which is physically meaningless.By definition, these numbers should be positive in each voxel.To minimize Eq. (2.19), one has to solve for the N k unknowns n i m .When minimizing the free part, one needs the following partial derivatives For the interaction part, one gets Accordingly, Eq. (2.22) is a complicated system of N k nonlinear coupled equations.We can solve them perturbatively by feeding in zeroth order estimates and solving for linear perturbations around this. Explicitly, we write where ni m is a zeroth order guess and we shall solve Eq. (2.22) for δn i m ≪ ni m .In the interactions, δn i m always appears next to n i c + ni m .The perturbative scheme works as long as In the free term, it appears next to ni m , which seem to require the more restrictive condition δn i m ≪ ni m .For a lognormal distribution of fluctuations of the number density, it is easily satisfied for most voxels, which are a little bit underdense with respect to the average.For a few voxels which may contain clusters, this inequality may be broken although it does not appear to affect the performance of the algorithm.Indeed, the inequality may only need to be satisfied on average, although we refrain from performing a formal study of the convergence requirements of our algorithm.Linearizing the free term in δn i m , that is neglecting terms of order O (δn i ) 2 leads to while linearizing the interaction term gives with We write the linearized version of Eq. (2.22) in a matrix form A • δn m = b, where δn m = (δn 1 m , . . ., δn N k m ).The first term in Eq. (2.27) belongs to the l th element of b.The second term is one element in each column i of line l of matrix A il .Finally, the third term corresponds to elements in the diagonal of the matrix A. Explicitly, the expressions for the elements of the matrix A and the vector b read and

.30)
The matrix A and vector b can be calculated using a zeroth order estimate of ni m , such as the homogeneous solution ni m;hom .The linear system A • δn m = b can then be solved for δn m by inverting the matrix A, such that δn m = A −1 b.We can solve the system iteratively by updating the zeroth order solution ni m → ni m + δn i m in A ij and b i , until it converges to a satisfactory accuracy.When the solutions give corrections which are much below one galaxy, i.e. δn i m ≪ 1, one may stop iterating and conclude that7 (2.31) As hinted earlier, we trade A 0 and B k for a single constant C 0 by rescaling them such that the function that we are minimizing is of order unity for the initial guess of homogeneous missing number of galaxies where 1) is a tuneable parameter, which we can adapt until we manage to recover that the total number of missing galaxies in S k matches the total number of galaxies we expect from the homogeneous estimator.The intuition is that if C 0 is too large, the Lagrangian L is dominated by the free part L free and is minimised for n i m → 0. If on the contrary C 0 is too small, then there is little cost in introducing negative number of missing galaxies and the solutions which respect the variance of the distribution of galaxies are degenerate.In practice, a balance between these two extremes can be found iteratively.We stop the iteration once the total number of missing galaxies that we estimate reaches the expectation from the homogeneous estimator, within 1% (That shall be |Ξ nm;var | ≤ 0.01 as of Eq. (2.40)).There are cases, where this criteria cannot be met after a large number of iterations.In that case, we abandon the scheme and simply resort to the homogeneous completion ni m;var = ni m;hom .Additionally, there are situations where convergence to the total number of missing galaxies expected in a homogeneous case is reached when some ni m;var are below some threshhold n t which is too close from zero.In this case, one may, at the end, remove those and fill in the minimum threshhold number of missing galaxies n t > 0, which we fix to n t = 1.One may repeat these steps for all the subsets S k until, an estimate ni m;var for each voxel of V is obtained.

Galaxy removal simulations
In order to test out the completion methods described above, we need a mock catalogue of galaxies for which the 'ground truth' n g is known.We still start by generating our own mock data, following the method described in this subsection.Whilst not fully physically realistic, this enables us to generate a range of catalogues with different parameters, to test how the completion methods fare under various conditions.In section 3.3, we will further test their performance on data from the more advanced MICECAT simulations [30][31][32][33][34].
We follow the procedure outlined in [35] to generate a catalogue of galaxies given a real space correlation function.Throughout the article, to simulate incompleteness, we remove galaxies according to the following procedure, except for subsection 3.3, where we also introduce an apparent magnitude threshold limit.
For each voxel in a subset S ∈ V, we apply a superposition of three ways to remove galaxies; homogeneous, proportional and random.To some extent, as motivated in Sec.We present a realization of the catalogue galaxies corresponding to the distribution of the left panel, with completeness fraction f S = 0.15 and removal scatter to intrinsic scatter ratio σ S /σ g = 0.05 and fraction of homogeneously removed galaxies a = 0.05.In this case, the catalogue keeps roughly the structure of the true distribution of galaxies but careful examination allows one to notice structure variations.Bottom: Structure difference between the catalogued and true galaxies presented on top.We present n i g − n i c /f S , which rescales the catalogue galaxies so as to compare the structure difference and not the average difference between the catalogued and the true galaxies.If they had exactly the same structure, this last plot would give 0 everywhere.
we do expect to miss more galaxies in overdense regions, justifying the proportional part.Perfect proportionality is also unexpected and we add in the simplest deviations we can think of, which are to remove galaxies homogeneously with some scatter.To implement this in practice, we remove a fraction of galaxies (1 − f S ).Of those, some fraction a ∈ [0, 1] are removed homogeneously, while the remaining fraction (1 − a) are removed proportionally.The randomness comes from the fact that we generate a random number from a Gaussian centered on the number of removed galaxies in each voxel with a standard deviation (or removal scatter) σ S .That is, we generate a random number from the following Gaussian distribution The scatter σ S allows for some randomness and ensures that the true number of galaxies in each voxel cannot be reconstructed even if one knows exactly a and f S .The f S that we input should be distinguished from the estimated completeness fS which can be calculated according to Eq. (2.4) once the catalogue has been simulated.In a second part of our analysis, we will promote f S and σ S to redshift-dependent quantities, by allowing them to evolve.That is, we will vary the expectation value and standard deviation of the Gaussian distribution corresponding to the properties of each subspace S k ⊂ V.Then, to keep the number of simulated missing galaxies in the interval [0, n i g ] which is a physical requirement, we determine n i m , ∀i ∈ {1, . . ., N v }, as per (2.35) The limit σ S ≪ 1 preserves the structure of the galaxy distribution into the catalogue such that the former can be well, though not perfectly, reconstructed from the latter by using knowledge of f S and a.The limit a → 0 and σ S → 0 is where we expect multiplicative completion to perform well.Alternatively, when a → 1 and σ S → 0 is when we expect homogeneous completion to do well.When σ S becomes larger, the true galaxy structure is progressively lost in the catalogue, and it shall be correspondingly more difficult to reconstruct the underlying galaxy distribution from the catalogue.
The fact that the results from the Gaussian are chopped if x i / ∈ [0, n i g ] results in differences between f S and fS , especially so in the case of low completeness.That is because, the result for the random number may often be negative such that the number of missing galaxies ends up being set to zero according to Eq. (2.35), for a large number of voxels, driving the average number of missing galaxies upwards.The upper boundary is less of a problem, because we are interested in low completeness, such that most times, fS ≥ f S .The catalogue galaxy number densities are computed as follows These catalogued galaxy number densities are the only ones that the observer has access to.
In the following section, we describe how we shall compare how well the different estimators are at reconstructing n i g from n i c , which is a task that we, as simulators, may perform.

Performance comparison
In this section, we describe how well an estimator performs when it comes to reconstruct the number of missing galaxies.For a given estimator, we define the average difference between the estimated number of missing galaxies ni m and the true number of missing galaxies n i m , which we simulate Estimators with lower residual amplitudes are more precise.In the situation where the completeness fraction f S = 0.15, fraction of homogeneously removed galaxies a = 0.15 and removal to intrinsic scatter ratio σ S /σ g = 0.05, the variance informed estimator has residuals with an amplitude of ∆ var = 41.08 on average.The homogeneous estimator has a much larger amplitude of about ∆ hom = 141.21.The multiplicative estimator gives ∆ mul = 173.99.In this case, the variance informed estimator is more than 4 times more precise.
It should be clear that a lower ∆ n indicates a more precise estimator.This performance measure indicates that on average, the algorithm estimates the number of missing galaxies ∆ n galaxies away from the true number of missing galaxies.We may also check if the total number of missing galaxies in S k is recovered (2.38) In this way, we can also estimate which is the fractional difference in the summed number of missing galaxies in S k .The ideal estimator has ∆ n → 0, and Γ n → 0. In practice, one does not know the n i m so that the ∆ n and Γ n tests cannot be applied on real data.However, one may check that the total number of missing galaxies recovers the expectation from the average.For this purpose, we define Figure 3.In this plot, we show one line of sight for the reconstructed number of galaxies ng using the three estimators, together with the true number of galaxies n g and the average number of missing galaxies nm .Here, the completeness fraction is f S = 0.15.The fraction of homogeneously removed galaxies is a = 0.15 and the removal scatter to intrinsic scatter ratio is set to σ S /σ g = 0.05.The homogeneous estimator damps existing structure while multiplicative completion over amplifies existing structure.The variance informed estimator performs well at amplifying catalogued structure with the correct amplitude even for a completeness which goes down to 15%, provided the structure of the true number of galaxies is preserved in the catalogue. where (2.41) In the variance algorithm, we use this quantity to gauge the constant C 0 (see Eq. (2.32)).When Ξ n is below some threshhold, that we fix arbitrarily to Ξ nm;var < 0.01, we fix C 0 and extract the estimators ni m;var .In this work, we simulate the removal of galaxies, so that we have full control over the number of missing galaxies, allowing us to perform the ∆ n and Γ n tests as well.In the next section, we investigate the homogeneous, multiplicative and variance completion technique quantitatively and compare their performances on both our mock data and full simulations.

Performance comparison of the completion methods
In this section, we present our results.We first study a situation of constant completeness using a toy-model correlation function to generate the galaxy catalogue.We then extend the analysis to incorporate a completeness which evolves in one direction, so as to model a redshift-dependent completeness.In Sec.3.3, we apply our algorithm to simulations from a realistic catalogue, for which we also introduce an apparent magnitude threshold to remove galaxies.Finally, we discuss our results and limitations in Sec.3.4.Throughout this work, we limit ourselves to two dimensions of space and shall use one dimension of space as a proxy for redshift.This is sufficient to extract visually convincing results.We leave the generalization to redshift, azimuthal and declination angles (z, α, δ) appropriate for observed galaxy positions for future work.It turns out that discrete directions on the sky identified uniquely by the two angles (α, δ) can also be parametrized by ordered pixel numbers, which are contained in a 1D vector.In this sense, cross product with a vector of redshift bins, makes the real data problem closer to a 2D problem than a 3D one.In this plot, we show one line of sight for the reconstructed number of galaxies ng , using the three outlined estimators together with the true number of galaxies n g .We do this for the case f S = 0.15, a = 0.15, σ S /σ g = 0.5.Such a large σ S = 132.10means that some peaks are converted to troughs and some troughs to peaks.This tricks all three estimators which convert above average catalogue galaxy densities to peaks or below average catalogue to troughs.This may however be incorrect.For example, there are nearly no catalogued galaxies at bin 25, but there exist an important peak, which is underestimated by all three estimators.Conversely, at bin 16, all three estimators give a peak, which does not exist.They do so with strongly varying amplitude, while the true number of galaxies is average.In this case, homogeneous completion is closest to the simulated data with ∆ hom = 174.11,∆ mul = 462.01 and ∆ var = 204.96.

Constant completeness
We simulate 10 6 galaxies living in a 2 dimensional space of 1000 × 1000 spatial units, which we split in 40 × 40 bins.We follow the procedure outlined in [35] to generate a catalogue .We plot a histogram of the 10 6 catalogued galaxies for which the completeness varies with the x bins.The completeness fraction varies between f Smax = 0.7 on the left hand side to f Smin = 0.05 on the right hand side.The removal scatter to intrinsic scatter ratio is fixed to σ S /σ g = 0.05 and the fraction of homogeneously removed galaxies is set to a = 0.2. of 10 6 galaxies given a real space correlation function ξ(r) = A|(r + ∆r)/r 0 | −α , where we fix A = 10 −1 , ∆r = 5, r 0 = 20 and α = 1.77.This choice of parameters ensures that the correlation function is non-singular and vanishes at large distances.A histogram of galaxy counts is presented on the left panel of Fig. 1.
We vary f S ∈ {0.05, 0.15, 0.30, 0.50}, thereby covering representive low completeness fractions where the estimators for the number of missing galaxies are relevant.We assume the scatter in the number of missing galaxies σ S to be a fraction of the scatter in the distribution of galaxies σ g , which we vary in the set {0.05, 0.15, 0.30}.While σ g may be extracted from data, σ S is unknown.We vary the fraction of galaxies removed homogeneously in the set a ∈ {0.0, 0.1, 0.2, 0.3, 0.4} and present the galaxy catalog for the case f S = 0.15, σ S /σ g = 0.05 and a = 0.05 in Fig. 1.We study 36 different cases with variations in the removal parameters in Tab. 1 for which we report the performance results of the three estimators from section 2.
To compare visually the performance of the variance informed algorithm and the homogeneous one, we unravel this 40 × 40 pixel distribution into a one-dimensional line; we present the number densities as a function of bin number and plot the residuals in Fig. 2. In the context of our estimators, the number of missing galaxies n i m are N v numbers for which the order plays no role, beyond labels.This would not be the case if we tried to make the estimated number of galaxies respect the full correlation function.This would then add a notion of intervoxel distance in the labeling, which would dramatically complicate our algorithm  m;hom in a case where the completeness varies with x bin.The completeness fraction varies between f Smax = 0.7 and f Smin = 0.05.The removal scatter to intrinsic scatter ratio is fixed to σ S /σ g = 0.05 with the fraction of homogeneously removed galaxy set to a = 0.2.The estimated completeness fraction fS differs from f S which evolves linearly because the number of missing galaxies is drawn from a truncated Gaussian distribution, as described in Sec.2.4.and the generalization to 3D.In this case, the average difference between the estimated and the true number of missing galaxies are ∆ hom = 141.21,∆ mul = 173.99 and ∆ var = 41.08, using the performance estimators defined in section 2.5.We show one line of sight (i.e. one row/column) and the reconstructed number of galaxies using the three different estimators in Fig. 3.The homogeneous estimator captures the structure but with an amplitude which is damped compared to the true number of galaxies.In contrast, the multiplicative completion over amplifies existing structure, according to Eq. (2.16).Finally, the variance informed estimator takes existing structure and amplifies the variance to reproduce the known one σ g .
From the results in Tab. 1, we see that at relatively high completeness f S = 0.5, the variance completion competes with the multiplicative completion even when the removed number of galaxies is proportional to the existing galaxies (a = 0).It also performs better than homogeneous completion when a ∈ [0, 0.3] and with sufficiently low variance in the scatter of removed galaxies σ S ≤ 0.3σ g = 79.26.When the completeness drops to f S ≃ 0.15 − 0.3, variance completion performs better unless the scatter in the removed number of galaxies becomes too large, i.e. σ S ≥ 0.25σ g = 66.05.For larger variance in the number of removed galaxies, i.e. σ S ≥ 0.3σ g , homogeneous completion may perform better.This may happen if a peak is strongly suppressed, which erases structure and misleads both the multiplicative and variance informed estimators.In this case the structure of the true galaxies is no longer sufficiently representative of the underlying galaxy distribution for the variance estimator to perform well.
At even lower completeness, such as f S = 0.05, the structure of the galaxy distribution needs to be well preserved in the catalogue for the variance algorithm to work well.We get very good results for σ S ≤ 0.05σ g = 13.21.If σ S becomes larger, then from time to time, entire catalogue peaks, which are of the order of σ S galaxies above the average of the catalogue, may disappear, which lets the variance algorithm confuse that voxel with a trough, instead.We illustrate this in Fig. 4 with σ S /σ g = 0.5.
In the next section, we make use of the knowledge acquired so far, and allow for the completeness to evolve in one direction, so as to model a redshift-dependent completeness.Here, we plot one line of sight, i.e. a line from the catalogue with evolving completeness parameters f Smax = 0.7 and f Smin = 0.05, from the left to the right.We plot the reconstructed number of galaxies per bin ni g = n i c +n i m for the homogeneous, multiplicative and variance informed estimators, together with the true number of galaxies n i g .At high completion, on the left of the plot, all schemes give similar results as the number of missing galaxies is subdominant.As the completeness of the catalogue decreases towards the right-hand side of the plot, the missing galaxies come to dominate the catalogued ones, which damps the amplitude of existing structure in case of homogeneous completion.On the hand, multiplicative completion overamplifies the structure.Variance informed completion moderates these two extremes, using knowledge of σ g .Here, we have assumed that the removal scatter to intrinsic scatter ratio is σ S /σ g = 0.05 and the fraction of homogeneously removed galaxies reads a = 0.2, which is a situation where the variance informed estimator performs much better than the other two, as can be seen from Table 2 1. Simulation parameters and results for 10 6 galaxies.This table contains the main results of this section.We present 36 simulations, with variation in the galaxy removal parameters, the completeness fraction f S , the fraction of homogeneously removed galaxies a and the removal scatter to intrinsic scatter ratio σ S /σ g as defined in section 2.4.The bin size is fixed to 25, such that there are 40 × 40 = 1, 600 bins.We then present the average difference between the estimated and true number of missing galaxies ∆ n defined in Eq. (2.37) for the three methods: homogeneous, multiplicative and variance completion.The bold symbols allows the reader to quickly identify which method works best at estimating the missing number of galaxies on average.When two estimators result in ∆ n's separated by less than 1, we mark them both in bold, as fluctuations in the particular realization of the removal parameters may influence the result.

Evolving completeness
In practice, one observes galaxies on the lightcone.Because many surveys are magnitude limited, this translates to a luminosity threshold on the galaxies that can be observed, which evolves with redshift.That means that the further one observes, the less complete galaxy catalogues are expected to be.In our 2D simulations, we model this for simplicity with a completeness fraction which evolves linearly between f Smax and f Smin in one direction of space.More sophisticated evolutions in the completeness fraction are not expected to affect the analysis and may easily be included.We plot an example of a catalogue with such a linear evolution in the completeness fraction in Fig. 5.We measure the completeness fraction fS in columns of 100 bins, constituting the 100 S k volumes according to Eq. (2.4), and plot the corresponding homogeneous completion in Fig. 6.The fact that the homogeneous completion exhibits small variations on top of the linear evolution comes from the difference between f S and fS as outlined in Sec.2.4.
We plot one line of sight example for which the variance informed estimator performs well in Fig. 7.In this plot, the input completeness fraction varies between f Smax = 0.7 and f Smin = 0.05.At high completeness, the three estimators perform similarly.As the completeness drops, the homogeneous and multiplicative start to drift away from the true number of missing galaxies.They do so in opposite ways.The homogeneous completions underestimates the contrast between peaks and troughs, while multiplicative completion exaggerates the contrast.The variance informed one implements an amplitude which is closer to the truth.We summarize our results with variations in the removal parameters in Tab. 2. We find that variance completion performs similarly or better in the considered cases so long as a ≥ 0.1 and σ S ≤ 0.2σ g .When the fraction of galaxies removed homogeneously vanishes, i.e. a = 0, multiplicative or homogeneous completion may perform marginally better.

Realistic catalogue
So far, we have generated a catalogue of galaxies according to a toy-model correlation function.In this section, we use a catalogue from the MICECAT v2 simulations [30][31][32][33][34] to study the efficiency of our algorithm to reconstruct the missing number density of galaxies with more realistic data.The MICECAT simulations yield a lightcone catalogue with the number density of galaxies evolving with observed redshift.Since we do not include this effect in our analysis, meaning that ng is fixed, we take a slice at x ∈ [x 0 , x 0 + ∆x] [Mpc/h] and y, z ∈ [0, 1000] [Mpc/h], for which the number density is sufficiently constant across the sample, where ∆x is the binsize and h is the reduced Hubble constant.We introduce an evolving completeness along the y axis according to the prescription in section 2.4 with f Smax = 0.5 and f Smin = 0.05.We present the selected 105, 817 galaxies in Fig. 8.
One line of sight is presented in Fig. 9 in an optimistic scenario where a = 0.2 and σ S = 0.05 σ g .This example shows how the algorithm may be applied to realistic data, by adapting the binsize to ensure that the statistical properties of the sample are well represented by the voxel decomposition of V.
At this point, one may wonder if perhaps the protocol to remove galaxies described in Sec.2.4 favors the variance estimator.In practice, galaxies are missed because they are too faint.That means that their apparent magnitude m is above the telescope threshold m * . 8To model this, we use the absolute magnitude M in the ks band from the MICECAT simulations.Any other appropriate band may be used.We compute an artificial apparent Because the number of missing galaxies is drawn from a chopped Gaussian, the estimated number of missing galaxies is above f Smin .For example, for a = 0.4 and σ S /σ g = 0.15, min( fS ) = 0.13 > f Smin .
Here, we vary the fraction of galaxies removed homogeneously in the set a ∈ {0.0, 0.1, 0.2, 0.3, 0.4} and the removal scatter to intrinsic scatter ratio σ S /σ g between 5% and 25% where appropriate.
magnitude according to where D = z 0 + z. 9 We adjust z 0 and m * such that the measured completeness fraction fS varies in the z direction.The catalogued galaxies are then the ones for which m i ≤ m * , while the missing galaxies have m i > m * .We take a volume y, z ∈ [0, 1000] [Mpc/h] and x ∈ [1600, 1600 + ∆x] [Mpc/h].We fix m * = 18 and z 0 = 100 [Mpc/h], which reproduces well a completeness fraction varying between 98% and 2% and vary the binsize in the set ∆x ∈ {10, 25, 50, 100} [Mpc/h].We present our results in table 3 and a histogram of the resulting catalogued galaxies in Fig. 10.There, the variance estimator systematically outperforms its competitors.For ∆x = 10 [Mpc/h], if the variance estimator seems to be only mildly better than the homogeneous estimator, the difference should be compared to the average number of galaxies in each bin ng = 5.34.Integrating over all the bins results in a 10% difference on the ability to reproduce the correct number of galaxies in each bin  in Fig. 11, where the superiority of variance completion in reproducing the true number of galaxies is visually striking.

Discussion
The results presented in the previous section are set up in the optimistic scenario where the structure of galaxy distribution is well preserved into the catalogue, with structure variation in σ S (the standard deviation in the Gaussian part of the removed galaxies) which are bounded to 30% of σ g (the standard deviation in the true galaxy distribution).We discuss here some of the assumptions that go into these results and limitations.
To simulate the positions of galaxies given a correlation function, we followed the procedure outlined in [35] .While we could vary the correlation function, we do not expect this to fundamentally change our results.The reason is that variance completion makes use of ng and σ g , which we measured from the 2D histogram of the true number of galaxies n g (x).Of course, σ g is expected to change if a different correlation function is implemented.However, for different correlation functions which yield the same variance σ 2 g , we do not expect vari-  Here, we plot one line of sight, i.e. a line from the MICECAT catalogue of 188, 976 galaxies with evolving completeness parameters f Smax = 0.5 and f Smin = 0.05.We chose a line of sight of 50 bins and plot the results of our estimators for visual comparison.At high completion (on the left), all schemes give similar results as the number of missing galaxies is subdominant.As the completeness of the catalogue decreases (towards the right), the missing galaxies come to dominate the catalogued ones, which damps the amplitude of existing structure in case of homogeneous completion.On the other hand, multiplicative completion overamplifies the structure.Variance completion is able to reproduce more accurately the galaxy distribution for a realistic catalogue.Here, we have assumed that the removal scatter to intrinsic scatter ratio is given by σ S /σ g = 0.05 and the fraction of homogeneously removed galaxies is set to a = 0.2, which is a situation where the variance informed estimator performs much better than the other two, ∆ hom = 23.39,∆ mul = 26.22   We compute the resulting maximum fSmax = max( fS ) and fSmin = min( fS ) using Eq.(2.4).We present the average number density of galaxies ng and the intrinsic scatter σ g , which we assume to be known.We present the average difference between the estimated and true number of missing galaxies for the three different techniques, according to Eq. (2.37).The variance estimator systematically outperforms the homogeneous and multiplicative estimators.ance completion to be affected.Changes in the variance can be studied to some extent by changing the voxelsize.This can be done provided the number of voxels in each subregion S k of similar completeness still contains a large enough sample of voxels, so as to extract a representative average and variance.We further applied the algorithm to the MICECAT catalogue in Sec.3.3, which shows that it may well be applied to realistic data.
On top of the constraint that there should be sufficiently many voxels in one S k to be a representative sample of the distribution of galaxies, the binsize is limited by computational resources and by the knowledge of σ g .We remind the reader that σ g should in principle be a function of redshift and bin volume, which might be measured only for some binsize and redshift.If it is known for a few binsizes and redshifts, one may be able to perform an interpolation although that may require a model.In Sec.3.1, we had N S = 1 and N 1 = 40 × 40 = 1, 600 bins which means that one is dealing with a matrix A which is 1, 600 × 1, 600 and which has to be inverted a number of times for the algorithm to converge.In Sec.3.2, we had 100 × 100 bins which were split in N S = 100 regions S k , k ∈ {1, . . ., N S } of N k = 100 bins.That means, that one has to deal with a hundred matrices which are 100 × 100.These numbers are sufficiently large such that the average and variance in the number of galaxies closely resembles that of the total sample and the algorithm converges on the timescale of seconds or a minute on a modern laptop.The computational complexity scales as O(N S N Here, we plot one line of sight, i.e. a line from the MICECAT catalogue with ng = 5.34 and plot the results of the estimators for visual comparison.The completeness decreases from 96% at low line of sight bin to 2%, at higher bin.When the catalogue is almost complete, all schemes give similar results as the number of missing galaxies is subdominant.As the completeness of the catalogue decreases, the missing galaxies come to dominate the catalogued ones, which damps the amplitude with respect to existing structure in case of homogeneous completion.Multiplicative completion overamplifies the structure.Variance informed, moderates these two extremes, using knowledge of σ g .Around bin 33-35, variance completion is manipulated by a chunk of high magnitude galaxies, which misleads it in overestimating the number of missing galaxies.The variance informed estimator performs much better than the other two, ∆ hom = 3.40, ∆ mul = 4.09 and ∆ var = 2.87.with a standard Gauss-Jordan matrix inversion algorithm.It can be made faster if one uses an optimized Coppersmith-Winograd -like algorithm to invert the matrix which then scales as O(N S N 2.376 k ) [36,37].An interesting feature of the algorithm is that if N k is too large in a region S k , then it can be split in different subsets, as long as N k is not too small and the large cubic N 3 k in the complexity is traded against a larger N S , on which the complexity depends linearly.This means that computational cost should not be viewed as a major issue for this algorithm to converge.
Biases in ng or equivalently, in the completeness fS , are expected to affect all three estimators.This is because the variance information algorithm stops whenever the number of expected galaxies N k ng is reached within 1% in S k , as described in Sec.2.3.Homogeneous completion is trivially biased by a mismeasurement of completeness, as can be guessed from Eq. (2.6).However, the impact is the strongest for multiplicative completion, for which the mistake propagates to the estimated variance of galaxies (see Eq. (2.16)) on top of the total number of galaxies.There, a small bias may be dramatic below some completeness threshold, where the authors of [18] continuously interpolate between multiplicative and homogeneous completion at low completeness.In contrast, the variance in the distribution of galaxies σ 2 g only enters in variance completion.Therefore, a bias in σ g only affects this estimator and leaves homogeneous and multiplicative completion unaffected.The extra parameter σ g can come as an advantage for variance completion, but it is also an additional parameter on which the algorithm relies.
The other major parameters, which we varied are the three removal parameters f S ∈ [0, 1], a ∈ [0, 1] and σ S ∈ [0, +∞[.These allowed us to remove galaxies with three methods, homogeneous, multiplicative and random.It is evident that there should be some uncertainty in the catalogue, which is modeled by σ S .The motivation for multiplicative completion is precisely that we expect the number of removed galaxies to be larger in regions where there are more galaxies.That is, the motivation for a = 0, which corresponds to galaxies removed proportionally to n i g is quite straightforward, but exact proportionality is not obvious.A homogeneous removal allows for deviations from perfect proportionality, which are not excluded.At low completeness, it allows to remove sufficiently many galaxies in underdense regions so as to make them appear empty in the catalogue.In the realistic catalogue section (Sec.3.3), we also applied an apparent magnitude cut to distinguish catalogued galaxies from missing ones.In this situation, the variance algorithm systematically outperforms its competitors.This supports the idea that direct proportionality between the catalogued galaxies and the true galaxies, as modelled by a = 0 and σ S /σ g ≪ 1, where multiplicative completion competes with variance completion, is an oversimplification of what happens for magnitude limited surveys.
When the completeness becomes so low that the structure is clearly unrepresentative of the underlying distribution of galaxies, one may increase the voxel size.As one does so, the power of variance completion on estimating the true number of missing galaxies better than homogeneous completion relative to the average number density of galaxies drops, as may be understood in Tab. 3.There is a voxel size for which σ g becomes negligible in the sense that the variance completion algorithm will pick up as initial guesses the homogeneous completion estimation and immediately converge.In this sense, variance completion converges to homogeneous completion.

Conclusion
In this work, we laid out a foundation to incorporate knowledge of large-scale structure into a galaxy catalogue completion technique.This has the potential to enhance the lineof-sight redshift support utilised in dark sirens method, and hence improve the resulting measurements of cosmological and astrophysical parameters.We reviewed homogeneous and multiplicative completion and analyzed the variance in the resulting estimation of the number density of galaxies from an incomplete catalogue.We demonstrated that homogeneous completion typically underestimates the variance, while multiplicative completion overestimates it in case of low completeness and may benefit from lucky cancellations otherwise.Taking advantage of the inability of both of these algorithms to reproduce the expected variance in the galaxy distribution, we introduced variance completion, which completes the galaxy catalogue by enforcing that both the expectation value ng and variance σ 2 g of the underlying distribution of galaxies are respected.This requires knowledge of σ 2 g in addition to ng , which is a quantity that can be measured from a complete galaxy catalogue at low redshift, although subtleties may appear in practice.
Our variance completion technique requires the minimisation a function L of the missing number of galaxies n i m in each voxel.The parameters which enter its definition are the catalogued galaxies n i c , the average number of galaxies ng and the variance σ 2 g .It requires to solve iteratively a linear system of N k equations with N k unknowns, one n i m for each voxel.This can be done for each region S k with constant completeness f S in the total hosting volume V.As such, it is more computationally costly than homogeneous and multiplicative completion, but can be handled on a modern laptop with reasonable computation times of the order of seconds or minutes.
To compare the different completion schemes, we simulated a distribution of 10 6 galaxies according to a given correlation function, from which we extracted ng and σ g after fixing the binsize.To simulate the catalogue, we removed galaxies using three parameters modeling homogeneous removal (with a = 1), multiplicative removal (with a = 0) and some randomness parametrized by σ S .We computed the average difference between the true number of missing galaxies and the estimated one for homogeneous, multiplicative and variance completion.We found that if the structure of the true distribution of galaxies is well preserved into the catalogue, then variance completion always outperforms its competitors.More precisely, this holds for σ S ≤ 0.25σ g and for f S ≥ 0.15.It performs marginally better than multiplicative completion for a = 0, but significantly better for a > 0 when the structure of the galaxy distribution is well preserved in the catalogue with σ S = 0.05σ g .We applied the variance completion algorithm to a realistic simulation for which we applied an apparent magnitude cut to remove galaxies.In this situation, the variance algorithm always outperforms its opponents.
We conclude that variance completion, introduced in this work, is a precise and robust way to estimate the distribution of missing galaxies in a catalogue when the structure of the catalogue preserves to some extent the true distribution of galaxies.The next stage of this work will be to implement our variance completion algorithm in a full dark sirens analysis, and determine to what depth it can extend the useful reach of the supporting galaxy catalogues.This in turn, has the potential to enhance the precision on the resulting estimates for the Hubble parameter, black hole population parameters and tests of General Relativity with gravitational waves.

Figure 1 .
Figure 1.Top left: A 2D histogram of a realization of 10 6 simulated galaxies in 40 × 40 bins; this will be the 'true' galaxy distribution of our mock data.Top right:We present a realization of the catalogue galaxies corresponding to the distribution of the left panel, with completeness fraction f S = 0.15 and removal scatter to intrinsic scatter ratio σ S /σ g = 0.05 and fraction of homogeneously removed galaxies a = 0.05.In this case, the catalogue keeps roughly the structure of the true distribution of galaxies but careful examination allows one to notice structure variations.Bottom: Structure difference between the catalogued and true galaxies presented on top.We present n i g − n i c /f S , which rescales the catalogue galaxies so as to compare the structure difference and not the average difference between the catalogued and the true galaxies.If they had exactly the same structure, this last plot would give 0 everywhere.

Figure 2 .
Figure 2. We plot the residual of the homogeneous, multiplicative and variance estimators defined as Res(ni m ) = n i m − ni m .Estimators with lower residual amplitudes are more precise.In the situation where the completeness fraction f S = 0.15, fraction of homogeneously removed galaxies a = 0.15 and removal to intrinsic scatter ratio σ S /σ g = 0.05, the variance informed estimator has residuals with an amplitude of ∆ var = 41.08 on average.The homogeneous estimator has a much larger amplitude of about ∆ hom = 141.21.The multiplicative estimator gives ∆ mul = 173.99.In this case, the variance informed estimator is more than 4 times more precise.

Figure 4 .
Figure 4.In this plot, we show one line of sight for the reconstructed number of galaxies ng , using the three outlined estimators together with the true number of galaxies n g .We do this for the case f S = 0.15, a = 0.15, σ S /σ g = 0.5.Such a large σ S = 132.10means that some peaks are converted to troughs and some troughs to peaks.This tricks all three estimators which convert above average catalogue galaxy densities to peaks or below average catalogue to troughs.This may however be incorrect.For example, there are nearly no catalogued galaxies at bin 25, but there exist an important peak, which is underestimated by all three estimators.Conversely, at bin 16, all three estimators give a peak, which does not exist.They do so with strongly varying amplitude, while the true number of galaxies is average.In this case, homogeneous completion is closest to the simulated data with ∆ hom = 174.11,∆ mul = 462.01 and ∆ var = 204.96.

Figure 5
Figure 5.We plot a histogram of the 10 6 catalogued galaxies for which the completeness varies with the x bins.The completeness fraction varies between f Smax = 0.7 on the left hand side to f Smin = 0.05 on the right hand side.The removal scatter to intrinsic scatter ratio is fixed to σ S /σ g = 0.05 and the fraction of homogeneously removed galaxies is set to a = 0.2.

Figure 6 .
Figure 6.We plot a 2D histogram of the results for the homogeneous completion estimator ni m;hom in a case where the completeness varies with x bin.The completeness fraction varies between f Smax = 0.7 and f Smin = 0.05.The removal scatter to intrinsic scatter ratio is fixed to σ S /σ g = 0.05 with the fraction of homogeneously removed galaxy set to a = 0.2.The estimated completeness fraction fS differs from f S which evolves linearly because the number of missing galaxies is drawn from a truncated Gaussian distribution, as described in Sec.2.4.

Figure 7 .
Figure 7.Here, we plot one line of sight, i.e. a line from the catalogue with evolving completeness parameters f Smax = 0.7 and f Smin = 0.05, from the left to the right.We plot the reconstructed number of galaxies per bin ni g = n i c +n i m for the homogeneous, multiplicative and variance informed estimators, together with the true number of galaxies n i g .At high completion, on the left of the plot, all schemes give similar results as the number of missing galaxies is subdominant.As the completeness of the catalogue decreases towards the right-hand side of the plot, the missing galaxies come to dominate the catalogued ones, which damps the amplitude of existing structure in case of homogeneous completion.On the hand, multiplicative completion overamplifies the structure.Variance informed completion moderates these two extremes, using knowledge of σ g .Here, we have assumed that the removal scatter to intrinsic scatter ratio is σ S /σ g = 0.05 and the fraction of homogeneously removed galaxies reads a = 0.2, which is a situation where the variance informed estimator performs much better than the other two, as can be seen from Table2.
. The difference goes to 22% if one compares the variance and multiplicative estimators.We present one line of sight

Figure 9 .
Figure 9. Here, we plot one line of sight, i.e. a line from the MICECAT catalogue of 188, 976 galaxies with evolving completeness parameters f Smax = 0.5 and f Smin = 0.05.We chose a line of sight of 50 bins and plot the results of our estimators for visual comparison.At high completion (on the left), all schemes give similar results as the number of missing galaxies is subdominant.As the completeness of the catalogue decreases (towards the right), the missing galaxies come to dominate the catalogued ones, which damps the amplitude of existing structure in case of homogeneous completion.On the other hand, multiplicative completion overamplifies the structure.Variance completion is able to reproduce more accurately the galaxy distribution for a realistic catalogue.Here, we have assumed that the removal scatter to intrinsic scatter ratio is given by σ S /σ g = 0.05 and the fraction of homogeneously removed galaxies is set to a = 0.2, which is a situation where the variance informed estimator performs much better than the other two, ∆ hom = 23.39,∆ mul = 26.22 and ∆ var = 13.53.

Figure 10 .
Figure 10.We plot a histogram of catalogued galaxy counts from the MICECAT V2 simulations where y, z ∈ [0, 1000] [Mpc/h] and x ∈ [1600, 1600+∆x] [Mpc/h] with ∆x = 10 [Mpc/h].The average number of galaxies per bin in that case is ng = 5.34 and the standard deviation is σ g = 6.28.Bins with more than 25 galaxies are plotted with same color as for n i c = 25 for visual clarity.The results for the reconstructed number densities are presented in the first line of Tab. 3.
Figure 11.Here, we plot one line of sight, i.e. a line from the MICECAT catalogue with ng = 5.34 and plot the results of the estimators for visual comparison.The completeness decreases from 96% at low line of sight bin to 2%, at higher bin.When the catalogue is almost complete, all schemes give similar results as the number of missing galaxies is subdominant.As the completeness of the catalogue decreases, the missing galaxies come to dominate the catalogued ones, which damps the amplitude with respect to existing structure in case of homogeneous completion.Multiplicative completion overamplifies the structure.Variance informed, moderates these two extremes, using knowledge of σ g .Around bin 33-35, variance completion is manipulated by a chunk of high magnitude galaxies, which misleads it in overestimating the number of missing galaxies.The variance informed estimator performs much better than the other two, ∆ hom = 3.40, ∆ mul = 4.09 and ∆ var = 2.87. .

Table 2 .
Study a σ S /σ g ∆ hom Simulation of 10 6 galaxies.This table contains the main results of this section.The bin size is fixed to 10.The completeness fraction f S varies linearly from f Smax = 0.7 to f Smin = 0.05.