A COMPARISON OF GALAXY COUNTING TECHNIQUES IN SPECTROSCOPICALLY UNDERSAMPLED REGIONS

Mike A. Specian; Alex S. Szalay

doi:10.3847/0004-637X/831/1/53

1. INTRODUCTION

Mapping the three-dimensional distribution of galaxies is of primary importance to those seeking to understand the large-scale structure of the universe. Galaxy surveys, like the Sloan Digital Sky Survey (SDSS; York et al. 2000), are often the most effective tool for generating these maps. Galaxy targets are identified photometrically; then, if they meet certain criteria, they are designated as targets for follow-up spectroscopy. This latter step is crucial, as spectroscopically derived redshifts are often the most effective tools for determining galaxies' radial distances.

However, it is rarely the case that every galaxy target will have its spectrum taken. This can occur for a number of reasons including a maximum number of simultaneous fibers per spectrograph, a minimum separation distance between any two fibers, competing priorities for spectroscopic targets, and active surveys in which spectroscopic observations lag behind photometric ones. This contributes to areas of the sky having a spectroscopic completeness, or the fraction of targets with measured spectra, of less than one.

This raises the challenging question of how to account for galaxy targets whose lines of sight are known photometrically, but that lack redshifts. In this paper, we examine the issue of spectraless targets within the context of measuring galaxy overdensities within the SDSS survey volume as a function of position. We tackle this problem by generating mock catalogs based off the Sixth Data Release (DR) of the SDSS Main Galaxy Sample (MGS; Strauss et al. 2002), and then testing the effectiveness of seven separate counting techniques as a function of redshift, survey region type, and smoothing radius.

Our usage of DR6 concludes the analysis begun in Specian & Szalay (2016) and M.A. Specian & A.S. Szalay (2016, in preparation). The former paper corrects errors in the photometric and spectroscopic footprints (SFs)—corrections critical for our current analysis—while the latter suggests remedies for statistical and systematic noise. These remedies are not employed in this paper, but require a discretized space to implement, which further necessitates the ability to accurately count the numbers of galaxies in given volumes.

Approximately 20% of DR6 MGS targets within the SF lack spectra, leading to an environment that is ripe for study. The situation is even more pronounced where galaxy densities are high. Yoon et al. (2008) cited a 30%–40% incompleteness rate in dense regions while searching for galaxy clusters in three dimensions. In their study of galaxy clusters, von der Linden et al. (2007) discovered that the central galaxy in 30% of clusters was missing a redshift.

Spectroscopic incompleteness adds uncertainty to the number of galaxies contained within a volume and, by extension, the overdensity of that volume. The spatial distribution of overdensities determines the power spectra—linear, logarithmic, Gaussianized, and otherwise—which are valuable tools for constraining a variety of cosmological properties. These include the universe's matter density via the cosmological horizon at matter-radiation equality (Percival 2007) and the "scalar index" n_s (Chung et al. 2003), which describes how density fluctuations vary with scale. Imprints in the power spectrum left by the baryon acoustic oscillations (see, e.g., Blake et al. 2011) are related to the physical densities of cold dark matter ${{\rm{\Omega }}}_{c}{h}^{2}$ and baryons ${{\rm{\Omega }}}_{b}{h}^{2}$ , the Hubble parameter (e.g., Peebles & Yu 1970; Sunyaev & Zeldovich 1970; Doroshkevich et al. 1978), dark energy (e.g., Blake & Glazebrook 2003; Hu & Haiman 2003; Seo & Eisenstein 2003), and the curvature of the universe (Blake et al. 2007).

The structure of this paper is as follows. In Section 2, we set the stage for our simulations by providing relevant background information regarding the SDSS, MGS, and photometric redshifts. We also explain our motivation and methodology for dividing the MGS into subsamples. In Section 3, we describe in detail the operation and implementation of our seven counting techniques under various conditions. In Section 4, we utilize Monte Carlo simulations to assess the effectiveness of each counting technique within three distinct SDSS region types, and report our results. We offer our final comments and conclusions in Section 5.

2. BACKGROUND

This section begins with a review of critical facets of the SDSS. We summarize the survey's observational strategy and creation of its photometric and SFs. We highlight key aspects of how photometric data are gathered and processed, then describe which photometric criteria are used to admit a target into the MGS. MGS targets are eligible for spectroscopic observation, but in DR6 only ∼80% have their spectra measured. This leads us to split the MGS into two mutually exclusive groups: those that possess quality spectra, and the remainder that do not. Accounting for members of the latter group requires specialized techniques. A few of those techniques rely upon the use of photometric redshifts, the critical properties of which we discuss in Section 2.4.

2.1. Sloan Digital Sky Survey

The SDSS (York et al. 2000) is one of astronomy's premiere projects. The SDSS, which began observations in 2000, is a multi-fiber imaging and spectroscopic survey purposed with measuring the positions and properties of hundreds of millions of celestial objects over about a quarter of the sky. Since beginning operations in 2000, the SDSS has photometrically imaged about 500 million objects and taken spectroscopy for about one million of those that satisfied certain color and brightness (i.e., apparent flux) criteria. Its primary goal was to image a contiguous, well-calibrated 10,000 deg²-sized area of the northern Galactic cap with follow-up spectroscopy of brown dwarfs, supernova, MGS galaxies, and Luminous Red Galaxies (LRGs; Eisenstein et al. 2001). The complete catalog of these results is referred to as the Sloan Legacy Survey. This and other SDSS data are accessible through a relational database system known as the Catalog Archive Server (CAS) through a web portal called SkyServer¹ (SkyServer 2008; Thakar et al. 2008).

About once a year, the SDSS issues a data release containing the latest results. Each new release subsumes earlier versions. While the Legacy Survey was effectively complete after the final spectroscopic observations of DR7, DR8 made small changes in photometric calibration and derived new photometric redshifts. So while DR8 will not introduce any new objects to our analysis, we will occasionally use its improved measurements to better inform our handling of DR6.²

The areas of the survey within which photometric and spectroscopic observations were collected are referred to as the photometric footprint (PF) and SF, respectively. We define the PF to be the union of primary segments, or the areas containing primary photometric observations of targets. The SF (which is completely subsumed by the PF) is defined as the union of sectors, where a sector is an SDSS region type formed by the intersections of spectroscopic tiles and masks. For a more complete description of segments, sectors, etc., as well as a host of corrections that were made to improve their quality, we refer the reader to Specian & Szalay (2016).

Spectra were collected with multi-object fiber-fed spectrographs (Uomoto et al. 1999). At the focal plane, fibers were manually inserted into holes in a 1 m diameter circular disk known as a tile. The position of each hole/fiber on the tile corresponded to the position of a spectroscopic target. Only 592 fibers could be allocated per tile, and no two fibers on a single tile could be positioned closer than 55'' of one another. Tiles were permitted to overlap but, even so, many small areas in between their circular projections were not spectroscopically observed. Additional masks and other effects further reduced the spectroscopic completeness of the survey. For these reasons and various other limitations, only about 80% of MGS targets in the SF had their spectra collected by the end of DR6.

Those that did had their spectra processed through a pipeline with two primary data products—the type of target (e.g., star, galaxy, QSO) and its redshift—that were determined through a best overall chi-squared fitting. In cases where accurate classifications were unable to be made, a flag was set in the database to indicate a poorly determined or otherwise inconsistent result. Redshift confidence was quantified by the CAS SpecObj table parameter zConf.

2.2. The MGS

The Legacy Survey targeted two samples of galaxies—the MGS and the LRGs. The former is a brighter, flux-limited, local sample of higher density. The latter is a redder, deeper, sparser, color-limited sample. We have opted to focus our galaxy counting analysis on MGS targets since they outnumber the LRGs by about an order of magnitude, and yield more robust spatial correlation and photometric redshift statistics.

Strauss et al. (2002) lay out the full list of conditions that must be met for a photometric object to be considered an MGS target and eligible for spectroscopic observation. However, the MGS target criteria were imperfect. Stars were sometimes confused with galaxies and vice versa. Identifying galaxies was the primary focus of the survey, so minimizing false negatives was prioritized over admitting false positives, especially since the latter could likely be identified spectroscopically. Depending on which specific criteria were met, one or more of the flags listed below were set in the primTarget database field of the PhotoPrimary table:

64	=	"TARGET_GALAXY"
128	=	"TARGET_GALAXY_BIG" and/or
256	=	"TARGET_GALAXY_BRIGHT_CORE".

Download table as: ASCII Typeset image

For more details on how we extract the MGS from the CAS database, consult the SQL code provided in Appendix A of Specian & Szalay (2016).

There are approximately 230 million primary photometric objects in the DR6 database, but only 824,287 of those are MGS targets. In DR6, almost all objects have photometric redshifts determined both through template and training-set methods (see Section 2.4). The number of targets with spectra is around 70% with over 99% of those having K corrections. Table 1 summarizes these results.

Table 1. Census of DR6 Objects Identified as MGS Targets through Their Photometric Properties

SDSS Objects	Count
Objects in `PhotoPrimary`	230,417,920
Objects defined as MGS targets	824,287
MGS targets with template photo-z's	824,286
MGS targets with ANN photo-z's	816,401
MGS targets with spectra	577,436
MGS targets with spectra and K corrections	572,819

Download table as: ASCII Typeset image

2.3. Pristine MGS Galaxies and Non-pristine MGS Objects

Anything that satisfies the MGS photometric criteria (i.e., MGS targets) falls into one of two broad categories—those with quality spectra (hereafter MGS galaxies) and those without (hereafter MGS objects). While MGS galaxies readily yield their radial distances, MGS objects complicate the creation of a complete and accurate map of the local universe. In other words, we know the lines of sight MGS objects lie along, but can only make clumsy guesses as to their distances.

Overdensity measures depend on the ratio of the number of targets counted to those expected within a volume. If the latter quantity is normalized using only MGS galaxies, overdensities calculated by counting galaxies alone could approximately equal the true overdensities. This approximation would be exact if the angular distribution of MGS objects was isotropic. However, this is not the case. Spectroscopic fiber collisions can leave overdense areas with low spectroscopic completenesses. Survey boundaries slated to be tiled and spectroscopically observed in subsequent data releases leave MGS objects in their wake. Even if isotropy was not an issue, the large fraction of MGS objects adds significant variance to the approximation.

To begin addressing this problem, we split the MGS targets of Strauss et al. (2002) into two mutually exclusive groups—galaxies with high-quality spectra, which we refer to as pristine galaxies, and everything else (e.g., targets without spectra, targets with low-quality spectra), which we refer to as non-pristine objects.

In order for an MGS target to enter either sample, it must satisfy an additional bright magnitude limit of r_P > 15, where r_P is the extinction-corrected Petrosian magnitude in the r band. This is designed to both maintain the uniformity of the sample and flatten out the selection function at low redshift. This additional constraint rejects about 2.8% of otherwise eligible targets. It should be assumed that all references to MGS objects, galaxies, or targets incorporate this new criterion.

Since this paper utilizes specific, and sometimes non-standard terminology, we have created Table 2 as a reference. It summarizes the differences between targets, galaxies, objects, the regions they occupy, and more.

Table 2. Definitions

Term/Symbol	Definition
Target	anything satisfying the MGS photometric criteria
(Pristine) galaxy	targets spectrally classified as MGS galaxies
(Non-pristine) object	targets not spectrally classified as MGS galaxies

PF	photometric footprint
SF	spectroscopic footprint
β_PF	fraction of a cell's volume within the PF
β_SF	fraction of a cell's volume within the SF

Interspersed region	area within both a cell's circular projection and the SF
Dark region	area within a cell's circular projection, but outside the SF
External region	area significantly outside the SF
Interspersed object/galaxy	object/galaxy within an interspersed region
Dark object/galaxy	object/galaxy within a dark region
Galaxy	(italicized) galaxy retaining its spectroscopic information during simulations
Object	(italicized) galaxy stripped of its spectroscopic information during simulations

n	number of targets in cell
$\langle n\rangle$	number of targets expected in cell
δ	overdensity $=\,n/\langle n\rangle -1$
2PCF	two-point correlation function

Note. Definitions for select terminology and symbols used throughout this paper. The definitions of "target," "galaxy," and "object" provided here are simplifications. Full caveats and conditions are provided in the text.

Download table as: ASCII Typeset image

2.3.1. Pristine Sample

To be designated as a pristine galaxy, an MGS target must

1.
have a redshift (i.e., specObjID cannot be set to 0) that is of high-confidence and quality,
2.
be spectrally classified as a galaxy (as opposed to a star or QSO) through field specClass,
3.
reside in the redshift range of 0.02 ≤ z ≤ 0.30, and
4.
have an absolute magnitude of M_r ≤ −17.

The first two conditions maximize the probability that members of the pristine sample are actually galaxies. Like the apparent magnitude condition, the absolute magnitude criterion is introduced to maintain sample uniformity. The numbers of MGS targets that satisfy each subsequent pristine galaxy requirement are provided in Table 3. The median redshift of the pristine galaxy sample is approximately z = 0.1.

Table 3. Census of 496,617 DR6 MGS Targets that Satisfy the Pristine Galaxy Criteria

SDSS Objects	Count	Percentage
MGS targets with r_P > 15	801,281	N/A
above with spectra	562,743	70.2
above with specClass = galaxy	551,847	98.1
above at 0.02 ≤ z ≤ 0.30	543,663	98.5
above with good zStatus	513,986	94.5
above with zConf ≥ 0.9	496,617	96.6
above where M_r < −17	496,617	100

Note. Each row contains the number and percentage of objects remaining from the previous row after the filtering condition is applied. The absolute magnitude condition was ultimately redundant because all DR6 pristine galaxies that satisfied the first three conditions also satisfied the fourth.

Download table as: ASCII Typeset image

Galaxies in the low-redshift range z < 0.02 posed a couple of problems. After spectroscopic analysis, many were discovered to be non-galaxies (usually stars) that made it through the MGS target selection pipeline. Peculiar velocities were also a concern in this regime because a number of galaxies had redshifts close to or less than zero. If left in, these galaxies would artificially drive up measurements of the absolute magnitude and erroneously skew the selection function. The lower redshift limit of z ≥ 0.02 was chosen to avoid the worst of these problems.

2.3.2. Non-pristine Sample

All MGS targets not defined to be pristine galaxies are non-pristine objects. Members of this subsample fall into one of three mutually exclusive groups:

1.
those whose quality spectroscopic data reveal that they are not actually galaxies,
2.
those whose spectral classifications and/or redshifts are of low confidence, and
3.
those with no or useless spectra.

Non-Galaxies of the MGS targets with measured spectra, the vast majority were identified as galaxies and had their specClass fields set equal to two. However, after spectral data reduction, about 0.5% of targets that would have otherwise been classified as pristine galaxies were revealed to be non-galaxies, i.e., either a star or QSO. Distributions of the types and redshifts of these misidentified targets are provided in Table 4 and Figure 1, respectively. We find that the probability of an MGS target actually being a galaxy can be modeled with the linear best-fit relation,

$\begin{eqnarray}&&m(z)=-0.057z+1.00,\end{eqnarray} \tag{ 1 }$

as seen in Figure 2.

**Figure 1.** Distribution of MGS targets that satisfy all pristine galaxy criteria except *specClass* = 2. Targets are counted in redshift bins of size Δz = 0.01.
Download figure:
Standard image High-resolution image

**Figure 2.** Probability that an MGS target is verified as an MGS galaxy after spectroscopic observation. The vertical axis represents the ratio of pristine galaxies to the number when the *specClass* = 2 criterion is lifted. Targets are counted in 70 redshift bins uniformly spaced between z = 0.02 and z = 0.22. A linear best-fit with equation $m(z)=-0.057z+1.00$ is superimposed. This relationship can be interpreted as the probability of an MGS object meeting the pristine galaxy requirements were its spectrum to be measured.
Download figure:
Standard image High-resolution image

Table 4. Distribution of Pristine Galaxies if the Criterion Forcing Them to be of specClass GALAXY is Lifted. The Possible Values of specClass and Their Integer Identifiers in CAS Occupy the First and Second Columns

specClass	Identifier	Number Count
UNKNOWN	0	2
STAR	1	3
GALAXY	2	496,617
QSO	3	2691
STAR_LATE	6	1

Download table as: ASCII Typeset image

Because there is no bias in the tiling algorithm, it is likely that approximately 0.5% of MGS objects are also stars or QSO's. These objects ought to be disregarded when estimating galaxy counts. One could utilize Equation (1) to do this as a function of redshift. However, this correction is of substantially smaller magnitude than the characteristic uncertainties of our counting methods, so we do not apply it within the course of our present analysis. We recognize that such a correction might be useful in other applications, so we provide the result for completeness.

Objects with no or useless spectra.About 30% of MGS targets (i.e., 238,538 objects) lack spectra, as is indicated by their having specObjID = 0. Another 29,478 have nonzero specObjID's but have zStatus's equal to 0, 1, or 2 which are, respectively, "redshift not taken," "redshift measurement failed," and "redshift cross-correlation and emission line redshift both high-confidence but inconsistent."

None of these objects possess useful spectral information; therefore, no redshift range or specClass criteria are imposed. Spectrally defined K corrections are also unavailable, so no absolute magnitude limit can be applied. If the statistics of the pristine galaxies are representative, this omission should have a negligible effect.

However, these MGS objects still possess useful information including photometric redshifts and angular positions relative to other pristine galaxies. The correlations between these quantities and radial distance can each be exploited as a way to improve the counting of targets within discrete volumes.

Objects with low-quality spectra.In the middle ground between pristine galaxies and no-redshift objects are MGS objects with low-quality spectra. These include objects with zconf < 0.9 and those with their zStatus flags set to any of the options listed in Table 5. Targets with z ≤ 0 and z > 1 are also admitted to this sample because those spectroscopic redshifts are likely nonphysical given the properties of the MGS.

Table 5. Census of DR6 MGS Targets That Enter the Low-quality Sample Due To Having Their zStatus Flag Set to 5, 8, or 10

zStatus	Identifier	Number Count
Redshift determined from cross-correlation with low confidence	5	2386
Redshift determined from em-lines with low confidence	8	133
Redshift determined "by hand" with low confidence	10	304

Note. Descriptions of these flags are provided in the first column, while the number of targets that satisfy them are provided in the third column.

Download table as: ASCII Typeset image

Members of the low-quality sample have spectroscopic redshifts that are less well-determined than those in the pristine sample. While the redshift confidence bright-line of 0.9 is somewhat arbitrary, we argue that targets on the lower side of the zconf spectrum should not be treated identically as pristine galaxies, at least not by default.

While spectra of members of the low-quality sample could be inspected manually, doing so for thousands of objects is unrealistic. There are other possible solutions. Low-quality spectroscopic redshifts could be used directly, but only if other distance estimation methods have greater uncertainties or if there is robust agreement with derived photometric redshifts. Alternatively, the precise value of zconf could be used to derive a weighted redshift or a modified redshift probability distribution.

The nature of our counting simulations in Section 4 places a premium on identifying MGS targets with high-precision redshifts. To maintain the integrity of the pristine galaxy sample, we therefore opt to group the low-quality targets in with the non-pristine objects, thereby discarding their spectroscopic redshift information. We recognize, however, that others may wish to handle these targets differently. The selection criteria and analysis employed here should allow them to do so.

2.4. Photometric Redshifts

One of the most popular techniques for estimating the radial distances to galaxies when spectroscopic redshifts are unavailable are photometric redshifts or photo-z's. Efforts to derive redshifts photometrically date back 50 years (e.g., Baum 1962; Weymann et al. 1999) and the methods developed can be classified into two broad categories: template-fitting and training set.

The template-fitting approach attempts to match an object's multi-color spectral energy distribution (SED) with model or empirical templates of known objects (see, e.g., Budavári et al. 2000; Csabai et al. 2003; Coe et al. 2006; Brammer et al. 2008).³ Training-set methods use empirical spectroscopic and photometric information from known galaxies to train estimators to best predict unclassified objects' types and redshifts. For a comparison of methods, see e.g., Dahlen et al. (2013) and Hildebrandt et al. (2010). In either case, errors for photometric redshifts are typically much larger than those for spectroscopic redshifts.

For DR6 the training-set approaches of Oyaizu et al. (2008), which utilize artificial neural networks (ANNs), were incorporated directly into the database table Photoz2. There are two sets of photo-z's (and their errors) reported in Photoz2—CC2 and D1.⁴ Each is trained using different concentration parameters (which measure how compact a galaxy's light is) and colors. Our analysis utilizes the D1 set since it has been shown to display better performance at brighter magnitudes, which are a characteristic of MGS galaxies. We note that comparisons between the methods have shown that, with surveys the scale of SDSS, photo-z's derived through training sets exhibit less bias and scatter relative to their true spectroscopic redshifts than those from template methods (Cunha et al. 2009).

With either method, photometric redshifts are conditioned upon the colors of their galaxies and are thus Bayesian reconstructions of their true redshifts. Therefore, each photo-z should be considered less of a determined number and more (when combined with its error) of a probability distribution. This statistical interpretation of photometric redshifts is put into practice in Section 3.4.

Finally, we acknowledge that there are many other correlations that can be exploited in pursuit of better photo-z's, and a host of codes that do so. Because the purpose of this paper is not to reflect the cutting edge of photometric redshift analyses, these will not be considered here. Instead, we remain focused on two principle photometric redshift types supplied directly by the SDSS—SED and D1.

3. COUNTING TECHNIQUES

In this section, we present seven techniques for counting MGS objects in cells. To compare these techniques, we simulate the distribution of MGS targets by randomly splitting the true MGS galaxy sample into two sets—objects that are stripped of their redshift information and galaxies that are not.⁵ We then use the counting techniques to approximate the number counts and overdensities within cells. These measures are compared against the true counts, and conclusions are drawn regarding which techniques are most effective as a function of redshift.

We discretize space by populating the volume of the DR6 SF with closely packed, non-overlapping spherical cells of radii 7, 11, and 16 ${h}^{-1}\,\mathrm{Mpc}$ (hereafter R7, R11, and R16) that fill about 74% of the survey volume using the hexagonal closest packing (HCP) arrangement (Conway & Sloane 1993). We define the survey volume to lie within the redshift limits $z\in [0.02,0.22]$ for R7 and R11, and $z\in [0.02,0.30]$ for R16. We utilize a Monte Carlo process with the SDSS region algebra to empirically determine the cells' volume fractions β_PF and β_SF that intersect the PF and SF, respectively. To ensure that at least half of the targets in cells (on average) are galaxies, we require that β_SF ≥ 0.62.

The radial distribution of galaxies within the SF is used to parameterize the Schechter luminosity function, which in turn is integrated over all absolute magnitudes between M_min(z) and M_max(z) to yield the radial selection function S(z). Assuming a standard flat cosmology where ${{\rm{\Omega }}}_{m}$ = 0.3, ${{\rm{\Omega }}}_{k}$ = 0, ${{\rm{\Omega }}}_{{\rm{\Lambda }}}$ = 0.7, and h = 0.7, the selection function yields $\langle n\rangle$ , or the expected number of galaxies in a cell in the absence of clustering. For further details on this and the creation of the cells, we refer the reader to M. A. Specian & A. S. Szalay (2016, in preparation).

The seven counting techniques fall into three categories. The first is "discrete counting," in which every object is assigned a singular redshift. The second is "scaling," in which the number count of galaxies is scaled up by a factor related to either (a) a cell's spectroscopic completeness or (b) the volumes the cell occupies in the photometric and SFs. The third is "probabilistic smearing," in which an object's redshift is interpreted as a probability distribution function. The PDF is subsequently used to assign partial galaxy counts to cells along the object's line of sight. These distributions can be given by the selection function, two-point correlation function (2PCF), or photometric redshifts.

We classify objects by the environments they occupy. Objects that lie within the SF but lack spectra (perhaps due to fiber collisions or insufficient tile coverage) will be referred to as interspersed objects. Objects that lie outside the SF, but inside the boundaries of cells will be known as dark objects. Finally, any object that lies in a large, contiguous area at a substantial distance from the SF is referred to as an external object. The areas these objects occupy will be called interspersed regions, dark regions, and external regions, respectively. An example of a dark region is offered in Figures 3 and 4.

**Figure 3.** Circular projection of an R16 cell at z = 0.037 (purple) is superimposed on top of a Monte Carlo visualization of the DR6 SF (yellow). The portion of the cell that overlaps the black area, i.e., that which lies outside the SF, is referred to as a dark region. For this cell, β_SF = 0.8055 and β_PF = 1. The fraction of the circular projection outside the SF is 0.22.
Download figure:
Standard image High-resolution image

**Figure 4.** Alternative view of the circular projection of the cell from Figure 3. The 15,381 targets within the SF are enclosed within the magenta boundary (i.e., interspersed region), while 5125 lie outside (i.e., dark region). MGS galaxies and objects are colored in bluish green and vermilion, respectively. Some galaxies lie outside the SF by virtue of footprint boundary corrections from Specian & Szalay (2016) designed to make the survey's spectroscopic completeness more isotropic.
Download figure:
Standard image High-resolution image

The list of counting methods that follows is by no means exhaustive. Others have used the color–magnitude relation (e.g., Baum 1959; Visvanathan & Sandage 1977; Hogg et al. 2004; López-Cruz et al. 2004) to identify which angularly proximal early-type galaxies likely belonged to a particular cluster. Cunha et al. (2009) use a spectroscopic subsample of galaxies to assign an individual redshift probability distribution to each galaxy based on its photometry.

There is also no shortage of photo-z codes one can employ to correlate photometric properties with redshift. For instance, Hildebrandt et al. (2010) compare the performance of 19 such codes over 18 optical and near-infrared bands. Dahlen et al. (2013) conduct a similar study for 11 photo-z codes.

Rather than attempt a thorough comparison of all methods in the literature, we focus on a subset of seven fundamental techniques. We do this with the understanding that the best counting methods may possibly be excluded from this analysis. This is by no means a judgment of these methods' merits, but merely a necessity to constrain the scope of our investigation. The work that follows is designed as an initial study from which future analyses can be extended.

3.1. Galaxies with Redshifts

Counting the number of MGS galaxies in each cell is straightforward. Each galaxy possesses a well-calibrated angular position and high-quality redshift. These are mapped to comoving radial distances χ(z) to establish their three-dimensional positions. If $({X}_{i},{Y}_{i},{Z}_{i})$ and $({X}_{c},{Y}_{c},{Z}_{c})$ are the comoving coordinates of the ith galaxy and center of a sphere, respectively, the ith galaxy resides within the cell if

$\begin{eqnarray}&&{({X}_{i}-{X}_{c})}^{2}+{({Y}_{i}-{Y}_{c})}^{2}+{({Z}_{i}-{Z}_{c})}^{2}\leqslant {R}_{c}^{2},\end{eqnarray} \tag{ 2 }$

where R_c is the radius of the cell.

3.2. Discrete Counting

Discrete counting methods assign each object a single redshift. This approach offers the best upside. Under optimal (though admittedly unlikely) conditions, discrete counting can be exactly right, something scaling and probabilistic smearing methods cannot offer.

The errors induced by discrete counting of a single object are limited to, at most, two non-overlapping cells—the one it is estimated to reside inside and the one it is actually inside. In this way, the negative impact of discrete counting is limited in the number of cells it can affect, though the absolute errors in those cells are potentially much larger than with other methods.

Ignore: the simplest of the seven methods, "ignore" simply disregards every object during the counting process. This method will systematically undercount the number of targets in each cell. For cells with large angular projections, ignoring all objects can discount a significant number of targets, leading to large errors. This method should improve as cell size shrinks or the number of cells per unit redshift increases. Either scenario will decrease the number of objects intersecting each cell's line of sight.

Photometric redshifts: each object is assigned a redshift equal to its photometric redshift. This implicitly correlates radial distance with an object's brightness and color profile, as described in Section 2.4. The SDSS database offers two types of photometric redshifts. The first are template based, or SED photo-z's, which utilize an object's spectral energy distribution. The second are training set (i.e., ANN) photo-z's, of which the SDSS database offers two varieties—D1 and CC2. We utilize D1 photo-z's since these have been shown to display better performance at brighter magnitudes. They are drawn from CAS table Photoz2.

In light of the wide range of photo-z codes, and to avoid placing too much emphasis on either SED or D1 photo-z's, in particular, we combine these two SDSS statistics into a single photometric redshift when reporting their performance. Whichever photo-z estimate counts more accurately (see Section 4) is selected as the singular photometric redshift for that scenario. In this way, we examine photo-z's "at their best" within the limits of what the SDSS natively provides.

Nearest nighbor method: each object is assigned the redshift of the galaxy that lies the smallest angular separation away (e.g., Zehavi et al. 2002, 2005; Berlind et al. 2006). Nearest neighbor was an early solution for handling fiber collided galaxies in SDSS-I/II (Zehavi et al. 2002, 2005; Zehavi 2011) down to ∼0.1 ${h}^{-1}\,\mathrm{Mpc}$ . It has also been used to identify galaxy groups and clusters (Berlind et al. 2006). This method works better when angular separation is small, though it has had difficulty below the fiber collision scale and when constructing the redshift-space correlation function. Nearest neighbor is weaker at large redshifts, where a given angular separation implies a larger physical separation. It is expected to be less effective in dark regions and external regions.

3.3. Scaling

Rather than directly approximate their redshifts, the scaling method accounts for the presence of objects by upweighting the count of galaxies in cells. The scaling mechanism works differently for interspersed objects and dark objects.

Consider first the case in which dark objects lie in well-defined dark regions formed by the intersection of a cell's circular projection with the spectroscopic and PFs. Assume dark regions contain only objects, and that the remainder of the cell contains only galaxies. If n_g is the number of galaxies in the cell, the scaling method approximates the number of dark objects within its volume to be

$\begin{eqnarray}&&{n}_{d}=\left(\displaystyle \frac{{V}_{d}}{{V}_{g}}\right){n}_{g}=\left(\displaystyle \frac{{\beta }_{\mathrm{PF}}-{\beta }_{\mathrm{SF}}}{{\beta }_{\mathrm{SF}}}\right){n}_{g},\end{eqnarray} \tag{ 3 }$

where V_d is the volume of the cell intersected by the dark region and V_g is the volume of the cell inside the PF. The total number of targets n_t approximated to lie within the cell becomes

$\begin{eqnarray}&&{n}_{t}=\left(\displaystyle \frac{{\beta }_{\mathrm{PF}}}{{\beta }_{\mathrm{SF}}}\right){n}_{g},\end{eqnarray} \tag{ 4 }$

where c = β_PF/β_SF is the "scaling factor." The scaling method, as represented through Equation (4), can only be employed when the cells contain distinct, contiguous, spectroscopically unsampled regions, i.e., when β_PF and β_SF are clearly specified. Situations like this are common when cells are placed near the edges of the SF, or when masks are introduced.

Assuming one's survey is photometrically complete, the scaling method should only be employed if a cell's dark region actually contains dark objects. If no dark objects are present, the approximate number of targets in the cell's dark volume is trivially set to zero. In this way, each dark object is not counted individually, but rather, acts as a binary switch for whether the scaling method will be executed or not.

A disadvantage of this method is that useful information—the number of dark objects in a cell's dark region—is essentially discarded. A single dark object that intersects numerous cells along its line of sight can potentially contribute to an aggregate number count totaling more or less than one distinct object. This almost guarantees that the total number count of dark objects will not be conserved, an issue that resurfaces with the probabilistic smearing methods introduced in Section 3.4.

Because scaling relies heavily on the approximated number densities of galaxies in cells, it should be most effective when the dark region's total area is small relative to the cell's projection. It is better suited for cells with large projections since the number densities inside and outside the dark regions are more likely to be similar and less likely to vary due to fluctuations in large-scale structure.

Next, consider the case in which interspersed objects are distributed among interspersed galaxies within a cell's interspersed region. Because there are no clearly delineated areas containing only objects or only galaxies, Equation (4) is inapplicable. Instead, the scaling factor c can be calculated using the spectroscopic completeness of the interspersed region. To first approximation c = 1/f, where f is the spectroscopic completeness of the survey.

However, spectroscopic completeness is anisotropic and, in some cases, extremely so. While the footprint corrections of Specian & Szalay (2016) helped equalize the completeness across the SF as a whole, there are still localized clusters where the proportion of MGS targets assigned fibers deviates appreciably from the average.

We therefore propose a direction-dependent spectroscopic completeness factor

$\begin{eqnarray}c=\left\{\begin{array}{ll}1 & \mathrm{if}\ {N}_{g}=0\ \mathrm{and}\ {n}_{{io}}=0\\ {N}_{g}/({N}_{g}+{n}_{{io}}) & \mathrm{otherwise}\end{array}\right.,\end{eqnarray} \tag{ 5 }$

where N_g and n_io are the numbers of galaxies and interspersed objects whose projections intersect the cell's interspersed region. The total approximated number count of targets within the interspersed region is then

$\begin{eqnarray}{n}_{t}=\left\{\begin{array}{ll}0 & \mathrm{if}\,c=0,\\ {n}_{{ig}}/c & \mathrm{otherwise}\end{array}\right.\end{eqnarray} \tag{ 6 }$

where n_ig is the number of interspersed galaxies inside the cell's volume as given by Equation (2). The distributions of completeness factors c are presented in Figure 5.

**Figure 5.** Distribution of direction-dependent completeness factors c. Factors are counted in bins of width Δc = 0.01.
Download figure:
Standard image High-resolution image

Unlike with dark regions, the number of interspersed objects n_io acts as more than a binary switch. By helping to parameterize c, this scaling approach introduces a proximity bias. If a cell contains no galaxies within its volume, it cannot contain any objects either. This effectively limits object counts to volumes where galaxies already exist.

Interspersed scaling works best when the radial distribution of objects is the same as that of galaxies. While this is certainly not the case in all directions, it should hold reasonably well along restricted lines of sight. The circular projections of R7, R11, and R16 cells, especially those at large redshifts, fall squarely into this category.

A final scaling option would ignore MGS objects entirely and renormalize the selection function to only account for the presence of MGS galaxies. This tactic would adjust the expected number of galaxies downward, similar to how the scaling method would adjust the total number count upward. In either case, their ratio, which determines the value of the overdensity δ, would remain the same, save some small differences induced by anisotropies in the spectroscopic completeness. Because the differences between scaling number count versus expected number is so slight, the latter option is not examined any further in this paper.

3.4. Probabilistic Smearing

The probabilistic smearing methods introduced in this section operate not by assigning an object any particular redshift, but by reporting the probabilities that the object lies in each cell along the line of sight. Under this methodology a single object count is smeared among multiple cells, with each cell i along the line of sight receiving a partial count n_i. Objects may also exist between cells (recall that cells positioned using the HCP only fill about 74% of the survey volume) or beyond the survey's redshift boundaries. Consequently, in most realistic scenarios ${\sum }_{i}{n}_{i}\lt 1$ and some of the count will be "lost."

The probability that an object lies within the boundaries of a cell depends upon two factors—the entry and exit points of the line-of-sight chord through the cell and the shape of the density function itself. A chord that passes directly through the center of a cell will contribute a higher number count than one that glances its edge even for cells at the same redshift.

Let ${\hat{{\boldsymbol{s}}}}_{i}$ represent the vector that points toward the center of cell i, and ${\hat{{\boldsymbol{x}}}}_{j}$ be the vector that points toward object j. It was shown in M. A. Specian & A. S. Szalay (2016, in preparation) that when $a\equiv \cos \theta =\hat{{\boldsymbol{s}}}\cdot \hat{{\boldsymbol{x}}}$ , the depths χ_l and χ_u at which the chord enters and exits the cell are

$\begin{eqnarray}{\chi }_{l} & = & {a}^{2}\chi -a\sqrt{{R}^{2}+{\chi }^{2}({a}^{2}-1)},\\ {\chi }_{u} & = & {a}^{2}\chi +a\sqrt{{R}^{2}+{\chi }^{2}({a}^{2}-1)},\end{eqnarray} \tag{ 7 }$

where χ is the comoving distance to the cell's center and R is the radius of the cell. The comoving distances χ_l and χ_u can be subsequently converted to redshifts.

Letting ${z}_{l}^{{ij}}$ and ${z}_{u}^{{ij}}$ represent the entry and exit redshifts for object j's line of sight chord through cell i of angular radius θ_i, the partial count contributed by the object to the cell is

$\begin{eqnarray}{n}_{{ij}}=\left\{\begin{array}{ll}{w}_{j}{\displaystyle \int }_{{z}_{l}^{{ij}}}^{{z}_{u}^{{ij}}}{p}_{j}(z)\,{dz} & \mathrm{if}\ {\hat{{\boldsymbol{s}}}}_{i}\cdot {\hat{{\boldsymbol{x}}}}_{j}\gt \cos {\theta }_{i}\\ 0 & \mathrm{otherwise}\end{array}\right..\end{eqnarray} \tag{ 8 }$

The total number count in cell i will be ${\sum }_{j}{n}_{{ij}}$ . The function p_j(z) is normalized such that its integral over the redshift range within which the object is constrained to exist equals unity. The weighting factor w_j reflects how heavily to count one object relative to the others.

In this paper, we set ${w}_{j}=1\ \forall \,j$ because all objects are weighted equally. It is possible, however, to develop hybrid criteria through which a fraction w < 1 of an object is smeared through one method, while the remainder $1-w$ is counted via another. This more sophisticated approach embodies a dual probability system—one in which the full set of an object's properties is used to nominate multiple counting methods, each applied with a different weight. We discuss this approach for completeness, but do not further explore its usage in these pages.

Because probabilistic smearing deposits partial counts in cells, it almost certainly will not yield a result that is exactly correct. This is a critical distinction from discrete counting methods. The hope is that if the probability models are accurate, the combination of smeared counts from multiple objects will better approximate the true distribution of targets than other counting alternatives. We provide evidence to support this claim—at least for the survey as a whole—later in this section.

Below, four probabilistic smearing methods are introduced and explained. Each distributes galaxy counts in cells using Equation (8). They differ only in the probability density functions p(z) used to do so.

Selection function smearing: this method takes p_j(z) to equal the distribution p_exp(z), or the expected number of galaxies observed in the absence of clustering:

$\begin{eqnarray}&&{p}_{j}(z)\,{dz}={p}_{\exp }(z)\,{dz}\\ &&=\displaystyle \frac{1}{({d}_{{\rm{H}}}\int d{\rm{\Omega }})C}\,S(z)\,\left({d}_{{\rm{H}}}\displaystyle \frac{\chi {(z)}^{2}}{E(z)}{dz}\displaystyle \int d{\rm{\Omega }}\right),\end{eqnarray} \tag{ 9 }$

where the normalization factor C equals

$\begin{eqnarray}&&C={\displaystyle \int }_{{z}_{\min }}^{{z}_{\max }}S(z)\displaystyle \frac{\chi {(z)}^{2}}{E(z)}{dz}.\end{eqnarray} \tag{ 10 }$

In Equations (9) and (10), d_H is the Hubble distance, S(z) is the selection function, $\displaystyle \int d{\rm{\Omega }}$ is the angle integral over the survey footprint, z_min and z_max are the survey's redshift limits, and from Peebles (1993)

$\begin{eqnarray}&&E(z)\equiv \sqrt{{{\rm{\Omega }}}_{m}{(1+z)}^{3}+{{\rm{\Omega }}}_{k}{(1+z)}^{2}+{{\rm{\Omega }}}_{{\rm{\Lambda }}}},\end{eqnarray} \tag{ 11 }$

where ${{\rm{\Omega }}}_{m}$ , ${{\rm{\Omega }}}_{k}$ , and ${{\rm{\Omega }}}_{{\rm{\Lambda }}}$ are the density parameters of matter, curvature, and dark energy respectively.

In a sense, Equation (9) can be considered a default case for smearing in that no object's property is utilized other than the fact that it is drawn from the same distribution as the MGS galaxies. While smearing via the selection function guarantees that no clustering statistics beyond homogeneity will be reported, it may serve as a useful tool to "fill in the gap" left by an object's absent redshift.

Selection function smearing adheres to the same principle employed by Guo et al. (2012) to recover the 2PCF when SDSS fiber collisions reduce spectroscopic completeness. As with their approach, this method does not estimate the radial location of an object using observables (e.g., color, photo-z, proximity to nearest neighbor), but rather draws conclusions based off the simple assumption that objects possess the same radial distribution as galaxies.

Two-point correlation function smearing: the 2PCF smearing method quantifies the probability that an object lies between two redshifts by using the two-point correlation function $\xi ({\boldsymbol{r}})$ . By necessity, the 2PCF considers galaxies as a pair. We chose to partner each object of unknown redshift with its nearest-galaxy neighbor of known redshift, since this is the galaxy with which the object will have the largest spatial correlation.

The joint probability ${dP}({\boldsymbol{r}})$ of finding galaxies in each of two separate volume elements dV₁ and dV₂ separated by a vector ${\boldsymbol{r}}$ is ${dP}({\boldsymbol{r}})={n}^{2}(1+\xi ({\boldsymbol{r}}))\,{{dV}}_{1}\,{{dV}}_{2}$ , where n is the background density that would exist if the universe were perfectly homogeneous. If one galaxy's position is fixed, the probability of finding another galaxy a distance r away is found by marginalizing over one of the volume elements. The universe is assumed to be isotropic and the directional dependence on ${\boldsymbol{r}}$ is dropped,

$\begin{eqnarray}&&{dP}(r)\propto n(1+\xi (r))\,{dV}.\end{eqnarray} \tag{ 12 }$

The differential volume element can be written as dV = A dz, where A is an arbitrarily sized cross-sectional area. The background density of galaxies is independent of direction but, in a magnitude-limited survey, it is redshift-dependent. Therefore,

$\begin{eqnarray}&&{dP}(r)={{Cp}}_{\exp }(z)(1+\xi (r))\,{dz},\end{eqnarray} \tag{ 13 }$

where C is a normalization constant. If an object of unknown redshift is constrained to exist between the limits z_i and z_f, then the normalization is fixed such that

$\begin{eqnarray}&&C={\left[{\displaystyle \int }_{{z}_{i}}^{{z}_{f}}{p}_{\exp }(z){dz}+{\displaystyle \int }_{{z}_{i}}^{{z}_{f}}{p}_{\exp }(z)\xi (r){dz}\right]}^{-1}.\end{eqnarray} \tag{ 14 }$

When two galaxies' positions are uncorrelated, the second integral in Equation (14) vanishes and this method reduces to selection function smearing. When r is small, ξ(r) dominates and a galaxy's depth is mostly constrained by the two-point correlation function. In this way, 2PCF smearing is a combination of selection function smearing and a probabilistic version of the nearest neighbor method.

Integration is done numerically. Redshifts between ${z}_{l}^{{ij}}$ and ${z}_{u}^{{ij}}$ are computed on a z-grid with resolution dz = 10⁻⁵. At each grid location, a conversion to comoving depth is performed and p_exp(z) is evaluated. The distances r between the nearest neighbor and each comoving grid point are found and the values of ξ(r) are subsequently interpolated. The integration then proceeds as normal. The final form of the probability density function is

$\begin{eqnarray}&&p(z)={{Cp}}_{\exp }(z)(1+\xi (r)).\end{eqnarray} \tag{ 15 }$

Figure 6 displays three sample 2PCF smearing density functions. Each possesses the shape of the selection function with a spike at the redshift of its nearest-galaxy neighbor. As the nearest neighbor separation decreases, the height of the spike increases, indicating an increased likelihood that the object resides close to its nearest neighbor. Conversely, the probabilities that the object resides at all other redshifts decrease accordingly. In the case of the largest angular separation of 2096 arcsec, the nearest neighbor correlation is quite weak and introduces only a minor perturbation to the selection function.

Photometric redshift smearing: the final smearing method correlates an object's radial distance to its color and luminosity through either the SED or D1 photo-z's and their associated errors. As with the discrete photo-z's (see Section 3.2), we let the smeared photo-z statistic stand for whichever method, SED or D1, performs better.

As summarized in Section 2.4, photometric redshifts are Bayesian reconstructions that may be interpreted as Gaussian probability functions. In these cases, p(z) = C g(z), where g(z) follows a Gaussian distribution with a mean μ_z equal to the photometric redshift, and a variance ${\sigma }_{z}^{2}$ equal to the square of the error reported in the SDSS database,

$\begin{eqnarray}&&g(z)=\exp \left[-\displaystyle \frac{{(z-{\mu }_{z})}^{2}}{2{\sigma }_{z}^{2}}\right].\end{eqnarray} \tag{ 16 }$

The normalization constant is again set by specifying the redshift limits within which the object is constrained to exist,

$\begin{eqnarray}&&C={\left({\displaystyle \int }_{{z}_{i}}^{{z}_{f}}g(z){dz}\right)}^{-1}.\end{eqnarray} \tag{ 17 }$

Redshift limits z_i = 0.02 and z_f = 0.30 are selected to match the range of MGS galaxies.

The assumption of photo-z Gaussianity has precedent. Balogh et al. (2014) model photometric redshifts with a Gaussian distribution. They and others (Ilbert et al. 2009; George et al. 2011) have shown that integrating such a distribution is an effective way to count galaxies within a redshift range. López-Sanjuan et al. (2010) have used the Gaussian model to identify close galaxy pairs.

Several studies have concluded that the best way to utilize photo-z's is probabilistically (e.g., Fernández-Soto et al. 2002; Cunha et al. 2009; Myers et al. 2009; Wittman 2009; Carrasco Kind & Brunner 2014). A plethora of algorithms exists to do so. Carrasco Kind & Brunner (2014) combine different models into a stronger estimator through a Bayesian framework. Some groups get probability distributions directly from the photometric templates. More sophisticated analysis represents the total PDF as a combination of red and blue templates.

The photo-z smearing methods implicitly ignore any spatial correlations that might be present. This sacrifice becomes less of a problem at high redshifts where galaxies are sparse and their physical separations start to exceed the spatial correlation radius. At this point, it becomes statistically unlikely that a galaxy is spatially correlated with its nearest neighbor, and the photometric redshift begins to carry more information.

A preliminary comparison of these counting methods is presented in Figure 7. The true distribution of pristine MGS galaxies (as measured through spectroscopic redshift), is plotted along with those of the selection function and photo-z smearing methods. Note that there is no difference between the suitably normalized selection function and the aggregate result of selection function smearing.

It is immediately obvious that none of the methods are capable of reproducing the true galaxy distribution on small scales. The errors inherent to each method are larger than this level of detail, which reinforces the idea than none are suitable on their own for conducting high-precision redshift surveys. The net result is a smoothing effect everywhere except the peak of the discrete photo-z distribution.

Of the three methods, photo-z smearing is best able to match the true redshift distribution of MGS galaxies for z > 0.04. The fact that this method induces no large, systematic shift from the true distribution suggests that the photometric redshifts (and their errors) contain more information about large-scale structure than does the selection function alone.⁶ Discrete photometric redshifts offer the worst performance, indicating that a photo-z's uncertainty carries useful information that should not be discounted.

3.5. Two-point Correlation Function

We calculate the 2PCF empirically using the pair counting method of Landy & Szalay (1993). Pairs are drawn from the N MGS galaxies in the northern hemisphere's SF. This is done to best match the distribution of MGS objects we will eventually be counting. The distances between all galaxy pairs DD are measured, binned in bins of width r ± dr, and counted. The same is done for random points of RR that are distributed uniformly as a function of angle within the SF and radially according to p_exp(z) from Equation (9).

According to Landy & Szalay, the estimator

$\begin{eqnarray}&&\xi (r)=\left\{\begin{array}{l}\left(\tfrac{{DD}}{{N}_{{DD}}}-\tfrac{2{DR}}{{N}_{{DR}}}+\tfrac{{RR}}{{N}_{{RR}}}\right)/\tfrac{{RR}}{{N}_{{RR}}}\\ \quad r\lt 89.575\,{h}^{-1}\,\mathrm{Mpc}\\ 0\quad \mathrm{otherwise}\end{array}\right.,\end{eqnarray} \tag{ 18 }$

minimizes the variance in ξ(r) to the Poisson level, where DR are the pair counts of the cross-correlated galaxies and randoms, and which are introduced to account for survey edge effects. (The dependence of DD, RR, and DR on r in Equation (18) is implied for notational simplicity.)

We select the number of random points to be ${N}_{R}=10N$ . This gives way to ${N}_{{RR}}={N}_{R}({N}_{R}-1)/2$ unique pairs of random points, ${N}_{{DD}}=N(N-1)/2$ pairs of galaxies, and N_DR = NN_R galaxy/random pairs. The cosmological principle is in effect for r ≥ 100 ${h}^{-1}\,\mathrm{Mpc}$ , meaning homogeneity and isotropy are typically maintained beyond these scales. In practice, we find that ξ(r) first runs negative at r = 89.575 ${h}^{-1}\,\mathrm{Mpc}$ , so the correlation function is set to zero beyond that point. After convolving Equation (18) with the spherical windows for R7, R11, and R16, the correlation functions in Figure 8 result.

**Figure 8.** Empirical two-point correlation function as calculated using MGS galaxies from the northern hemisphere of the SF. The ratio from Equation (18) is displayed (black) along with its convolutions with spherical window functions of radii 7, 11, and 16 ${h}^{-1}\,\mathrm{Mpc}$ (orange, blue, pink). The original ξ(r) is binned in bins of width 2 dr = 0.05 ${h}^{-1}\,\mathrm{Mpc}$ .
Download figure:
Standard image High-resolution image

**Figure 8.** Empirical two-point correlation function as calculated using MGS galaxies from the northern hemisphere of the SF. The ratio from Equation (18) is displayed (black) along with its convolutions with spherical window functions of radii 7, 11, and 16 ${h}^{-1}\,\mathrm{Mpc}$ (orange, blue, pink). The original ξ(r) is binned in bins of width 2 dr = 0.05 ${h}^{-1}\,\mathrm{Mpc}$ .
Download figure:
Standard image High-resolution image

4. SIMULATIONS AND RESULTS

In this section, we simulate the presence of MGS galaxies and objects in three different region types—interspersed regions (Section 4.1.1), dark regions (Section 4.1.2), and external regions (Section 4.1.3)—in order to test which of the seven counting techniques described in Section 3 are most effective under various conditions.

In lieu of constructing full MGS mock catalogs, we generate galaxy and object realizations by randomly dividing the MGS galaxies within the SF into two sets—those that retain their redshift information, and those that do not. Because the true redshifts of MGS galaxies are known, their number in each cell is fixed. This "ground truth" may be compared against counts estimated through each of the seven counting techniques to draw conclusions about their efficacy.

Using the real set of MGS galaxies to create galaxy/object simulations offers some advantages. Randomizing which MGS galaxies retain their redshifts allows for the rapid generation of multiple universes in which the constituent targets share the same properties (both spatial and photometric) as those we are trying to count. In this way, difficult to model correlations between spatial clustering and photometric redshift do not need to be discovered before simulations can commence. This increases the likelihood that conclusions drawn from simulations will be applicable to the full set of MGS targets.

The MGS is static, however, and conclusions drawn from it will be weighted toward the local universe. Also, the full MGS within the SF contains about 25% more targets than does the set of MGS galaxies from which the simulations are drawn. Consequently, the conclusions reached in this chapter will be for a somewhat sparser universe than reality. We do not attempt to account for this difference beyond advising that even if conclusions are imperfect, they should still be able to offer informative first-order principles to be carried forth into even more robust analyses.

To assess the effectiveness of each counting technique as a function of redshift, we introduce an error metric ${\epsilon }^{()}$ that quantifies each method's impact on galaxy number count n, overdensity δ, and overdensity squared δ² by measuring the average size of the discrepancy between the truth and each realizations τ. Let n_i equal the true galaxy count in cell i. Let ${n}_{i}^{(\tau )}$ equal the count in cell i during realization τ using one's method of choice. We report the average discrepancy over K realizations,

$\begin{eqnarray}&&{\rm{\Delta }}n({z}_{i})\equiv \displaystyle \sum _{\tau =1}^{K}| {n}_{i}-{n}_{i}^{(\tau )}| /K,\end{eqnarray} \tag{ 19 }$

where z_i is the redshift of cell i. We split redshift-space into a discrete number of bins and let the boundaries of the jth bin be z_j and ${z}_{j+1}$ . The deviation is ultimately reported as an average over all the cells in each redshift bin,

$\begin{eqnarray}&&{\epsilon }_{j}^{(n)}=\langle {\rm{\Delta }}n({z}_{i})\rangle ,\ {z}_{j}\leqslant {z}_{i}\lt {z}_{j+1}.\end{eqnarray} \tag{ 20 }$

The error bars on ${\epsilon }_{j}^{(n)}$ are reported as the 1σ standard deviation of Δn(z_i) for all cells within the corresponding redshift range. A similar statistic is applied to the overdensities and overdensities squared by replacing $| {n}_{i}-{n}_{i}^{(\tau )}|$ in Equation (19) by $| {\delta }_{i}-{\delta }_{i}^{(\tau )}|$ and $| {\delta }_{i}^{2}-{{\delta }_{i}^{2}}^{(\tau )}|$ respectively.

4.1. Simulating Spectroscopically Incomplete Regions

Counting techniques that rely upon the nearest neighbor method, the 2PCF, and scaling are sensitive to the environments in which they are applied. In this section, we describe how we simulate three separate, spectroscopically undersampled region types. The first are interspersed regions, or areas within the SF. The second, dark regions, lie adjacent to, but outside the SF. Finally, we look at external regions, which lie a significant distance from the SF. For a summary of the terminology used to describe region types, refer to Table 2.

4.1.1. Interspersed Regions

One realization of an interspersed region is generated by randomly stripping 20% of MGS galaxies of their redshifts and turning them into objects. Galaxy counts are approximated and the process repeats until counts are gathered within 50,000 cells in each redshift range. The density of MGS galaxies is large enough that we did not encounter any situations in which a cell's circular projection did not contain at least one object.

An object's nearest neighbor is defined to be the galaxy at the smallest angular separation. To speed up testing, the 10 nearest neighbors for each MGS galaxy are precomputed. In the rare event that an object's 10 nearest angular neighbors are also stripped of their redshift information during a realization, the tenth of those assumes the role of nearest neighbor.

4.1.2. Dark Regions

To simulate the effects of dark regions, we create a set of "dark region random variables" designed to perfectly match the shapes of our cells' dark regions. We determine the properties of dark regions within each redshift range ${z}_{i}\lt z\lt {z}_{i+1}$ to parameterize the distributions, then apply simulated dark regions only to cells within the same range.

Using the SDSS region algebra, it is possible to describe each cell's circular projection with a four-vector $[\hat{{\boldsymbol{s}}},c]$ where $\hat{{\boldsymbol{s}}}$ points toward the center of the cell and $c\equiv \cos \theta$ when θ is the cell's angular radius. Sets of these four-vectors, also known as halfspace constraints, can be used to describe any SDSS region. True dark regions are created by intersecting a cell's circular projection with the halfspace constraints that define the SF.

A cell's center may be repositioned to point ${\hat{{\boldsymbol{s}}}}_{p}$ using a rotation matrix ${\boldsymbol{M}}$ such that ${\hat{{\boldsymbol{s}}}}_{p}={\boldsymbol{M}}\hat{{\boldsymbol{s}}}$ . Multiplying each of the SF's halfspace constraints by ${\boldsymbol{M}}$ realigns the dark region with its cell. If ${\hat{{\boldsymbol{s}}}}_{p}$ is selected to lie within the SF, this action effectively generates a dark region at a new location. For example, the cell in Figure 9 contains two dark regions. The compliment of these regions has a shape vaguely resembling a mushroom, as illustrated in Figure 10.

**Figure 9.** Circular projection of an R16 cell centered at z = 0.113 is shaded in magenta and superimposed upon the DR6 SF as marked in yellow. This cell has two dark regions, one in the lower left and the other in the lower right.
Download figure:
Standard image High-resolution image

**Figure 10.** View of all MGS targets that lie within the circular projection of the R16 cell at z = 0.113 from Figure 9. Targets within the SF are represented by yellow pixels, while those outside are colored in blue or red. The seven red targets lie in the small areas in the compliment of TILEs.
Download figure:
Standard image High-resolution image

One can relocate dark regions like these using the following procedure. First, from within the redshift range $[{z}_{i},{z}_{i+1}]$ , randomly select a cell that contains dark regions, i.e., where β_SF < 0.99. This will be referred to as a template cell. Then, select a random point ${\hat{{\boldsymbol{s}}}}_{p}$ within the SF and derive the rotation matrix ${\boldsymbol{M}}$ that recenters the template cell on that point. Multiply the SF halfspace constraints by ${\boldsymbol{M}}$ and apply a random angular rotation about the cell's center to produce a shape like that pictured in Figure 11.

**Figure 11.** Cell (magenta) and its dark regions from Figure 9 are rotated into a new position within the SF (yellow). Galaxies within the magenta layer retain their redshifts and become nearest neighbor candidates. Galaxies within the relocated dark regions will become dark objects and be stripped of their redshifts.
Download figure:
Standard image High-resolution image

Next, identify the galaxies that fall within the relocated template cell's circular projection. Those that satisfy the rotated SF constraint conditions retain their redshifts and are eligible to be nearest neighbors. The remainder become dark objects and are stripped of their redshifts. All cells within the redshift range $[{z}_{i},{z}_{i+1}]$ whose circular projections enclose at least one of the dark objects are identified and the seven counting methods are employed. This process continues until counts have been gathered for 10,000 cells in each redshift bin.

An astute observer might argue that requiring dark objects' nearest neighbors to lie within the rotated template cell is too restrictive. After all, there could be galaxies adjacent to the dark region but outside that cell. Disregarding what could actually be the nearest neighbors might cause the nearest neighbor and 2PCF smearing methods to appear weaker than they truly are.

We ignore this concern for two reasons. Because dark regions tend to lie on the survey's edges, it is likely that there are no other galaxies on the side opposite the SF. Consider the cell in Figure 9, for example. The areas outside the cell, and adjacent to this cell's two dark regions also lie outside the SF. No targets here can act as dark objects' nearest neighbors. In this case, the nearest neighbors are most likely to come from inside the template cell.

The second issue is practical. If we allowed nearest neighbors to come from outside the template cell, we would need to generate a larger list of candidate galaxies. To fairly represent the SDSS geometry, we would have to rotate all of the spectroscopic constraint conditions by ${\boldsymbol{M}}$ and compare them against a much larger set of targets. Nearest neighbor distances would have to be calculated. These additional computations would slow down the simulation process, leading to fewer cell counts overall.

The majority of dark regions are caused by the survey's edges, yet some are due to the small areas between TILEs (see the targets colored in red in Figure 10, for example). Objects that lie within them are classified as dark objects, even though their small number and close proximity to galaxies makes them more like interspersed objects.

While interspersed counting techniques would likely be more effective, we draw no distinction between dark objects based upon the type of dark region they occupy. Doing so would require setting a minimum area for dark regions, a limit that at this point would be arbitrary. Regardless, the absolute number of dark objects in these regions is small, so treating them differently than those in the larger areas is unlikely to significantly alter the results. At worst, failure to handle them separately would increase the error metric by some small amount. However, since the goal of this analysis is to get a sense of which counting techniques are preferable to others, and since all counting techniques are subject to the same experimental conditions, we can expect the results to be generally applicable to the (larger) dark regions this section intends to study.

Five of the seven counting techniques could be applied to dark objects just as they were to interspersed objects. However, the "ignoring" and scaling methods, as well as determining the expected number of galaxies in each cell, require minor adjustments due to the introduction of the rotated constraint conditions. Each cell whose circular projection intersects a dark object has a precomputed spectroscopic completeness volume fraction β_SF. Once the rotated constraint conditions (i.e., those that generate the simulated dark regions) are applied, these cells adopt new completeness fractions ${\beta }^{\prime }\lt {\beta }_{\mathrm{SF}}$ that can be quantified using our empirical Monte Carlo procedure.

Let the number of galaxies within a cell's volume, after both sets of conditions are applied, be denoted by n_g. This serves as the final count for the "ignore" method. For the scaling method, the final number count ${n}_{g}^{\prime }$ equals the known galaxy count scaled by the relative increase in volume,

$\begin{eqnarray}&&{n}_{g}^{\prime }={n}_{g}\displaystyle \frac{{\beta }_{\mathrm{SF}}}{\beta ^{\prime} }.\end{eqnarray} \tag{ 21 }$

Similarly, the expected number of galaxies in the cell once the dark region rotated constraints are implemented is

$\begin{eqnarray}&&\langle n{\rangle }^{\prime }=\langle n\rangle \displaystyle \frac{{\beta }_{\mathrm{SF}}}{{\beta }^{\prime }}.\end{eqnarray} \tag{ 22 }$

4.1.3. External Regions

External regions are large, contiguous areas that contain no MGS galaxies. Unlike dark regions, which can occupy no more than a fraction $1-{\beta }_{\mathrm{SF}}$ of a cell's volume, external regions can be survey-sized. They are a natural byproduct of large photometric surveys prior to spectroscopic observation.

We simulate external regions by carving away the edges of the DR6 SF. The galaxies within them are subsequently stripped of their redshifts, becoming external objects in the process. The area that remains becomes the trimmed SF, and the galaxies therein become candidates for the nearest neighbor and 2PCF smearing methods. As with interspersed and dark objects, all external objects are drawn from the set of pristine MGS galaxies.

The survey is trimmed along lines of constant "lambda" and "eta," otherwise referred to as "survey coordinates" within SDSS. These coordinates align with the direction of the SDSS stripes. Their edges are chosen so as to create areas about 4° wide along the boundaries of the northern hemisphere.

The 125,306 galaxies that occupy the external regions are pictured in Figure 12. Color is used to represent the distance between each external object and its nearest neighbor. As seen in the figure, the nearest neighbor distances for many of these objects are so large that they likely lie beyond a spatial correlation radius that would put them at similar redshifts. For this reason, it is expected that the nearest neighbor method will perform poorly and that 2PCF smearing will offer little advantage over selection function smearing.

We study our counting methods in external regions to understand how well they perform when all targets in cells lack redshifts. Therefore, we prohibit any galaxies from within the trimmed SF to be counted in cells. Because galaxies are disqualified from the counting analysis, the scaling method is undefined in external regions and consequently disregarded.

The exclusion of galaxies has the greatest impact on low-redshift cells and/or those with large angular projections. To avoid biasing our results, this change requires that we reprocess our cell sample in a couple ways. To ensure uniformity, we only consider cells for which β_SF > 0.99. To account for the fact that cells that reach outside the external regions will now contain "empty volumes," the volume each cell occupies in the external region exclusively is calculated using the usual Monte Carlo process. The number of galaxies expected therein is subsequently scaled downward to account for the volume reduction. Otherwise, tests proceed as normal.

4.2. Results

This section contains the results of our counting simulations. We begin with comparisons between two pairs of related techniques—discrete photo-z's versus smeared photo-z's (Section 4.2.1), and selection function smearing versus 2PCF smearing (Section 4.2.2). The latter comparison reveals that 2PCF smearing is always superior. This removes the need to consider selection function smearing any further. We then report how each of the remaining counting techniques performed in the interspersed regions (Section 4.2.3), dark regions (Section 4.2.4), and external regions (Section 4.2.5).

We acknowledge that there are many other comparisons that could be made between counting methods as a function of redshift, region variety, and measurement type. We lack the space in these pages to explore them all. For this reason, we have made the processed data files for each of the counting methods available online—http://bit.ly/1Nc8MmI—should the reader be compelled to explore the problem further.

4.2.1. Discrete versus Smeared Photo-z's

In Figure 13, we plot the differences in error metrics of photo-z's against smeared photo-z's, in both interspersed and dark regions.⁷ Several conclusions emerge. First, smearing tends to fail in regions where $\langle n\rangle$ is low. These include R16 cells at z ≳ 0.22 and R7 cells at z ≳ 0.17.⁸ At these higher redshifts, excess probability in the photo-z PDF's drives up measures of n. This leads to the conclusion that at high-z, all else being equal, discrete photometric redshifts are preferable to smeared photometric redshifts.

**Figure 13.** Comparison between discrete photo-z and probabilistic photo-z counting techniques. For each cell size and redshift bin, the optimal photo-z (SED or D1) and smeared photo-z counting methods are determined. The errors for the best photometric redshift smearing techniques are subtracted from the errors ${\epsilon }^{()}$ for the best photo-z counting methods to produce a comparison statistic ${\rm{\Delta }}{\epsilon }^{()}$ . When ${\rm{\Delta }}{\epsilon }^{()}\gt 0$ , photometric redshift smearing is outperforming the use of photometric redshifts alone. The left and right columns contain the results for interspersed and dark regions respectively.
Download figure:
Standard image High-resolution image

The second conclusion is that for low and intermediate redshift cells, probabilistic photo-z's are generally better for measuring n and δ, while discrete photo-z's preferable for δ² (though this latter preference is of limited statistical significance). This result holds for both interspersed and dark regions. This is anticipated since the regions differ only in the way objects are clustered within the cells—a distinction that has no impact on the efficacy of photometric redshift methods.

Finally, we conclude that photo-z smearing methods offer relatively better performance in measuring n when $\langle n\rangle$ is large. Recall the assumption underlying probabilistic smearing—when applied in aggregate the sum of partial counts should approach the true count. This condition is best realized when the number of targets in each cell is high, as is the case with the low-redshift R16 cells in Figure 13.

4.2.2. Selection Function versus 2PCF Smearing

The differences between selection function smearing and 2PCF smearing are plotted in Figure 14. To summarize the figure in one statement: 2PCF smearing is always preferable to selection function smearing. The information "boost" provided to the selection function by including spatial correlation information from an object's nearest neighbor is always beneficial in both interspersed and dark regions.

This benefit is larger and more significant for interspersed objects than dark objects. Because the average nearest neighbor distance is smaller for interspersed objects, the boost provided by the correlation function is more valuable here than in dark regions. We also note that the boost is better for larger cells when approximating n, and for smaller cells when approximating δ and δ².

This preference for 2PCF smearing is significant enough for us to remove selection function smearing from further consideration. The figures and results that follow in subsequent sections will therefore offer comparisons of only six counting techniques rather than the original seven.

4.2.3. Interspersed Regions

Full counting results for the R7, R11, and R16 cases are plotted in Figures 15–17 respectively. The optimal counting techniques as a function of redshift and cell size are presented in Table 6.

**Figure 15.** Error metrics from Equation (20) for number counts n, overdensities δ, and overdensities squared δ² in interspersed regions for R7 cells. Error metric values are averaged over redshift bins of width z = 0.01. Error bars are omitted for visual clarity here, but are available in text files online: http://bit.ly/1Nc8MmI. Uncertainties for select counting methods and comparisons are also plotted in Figures 13, 14, and 18.
Download figure:
Standard image High-resolution image

**Figure 16.** Same as Figure 15 but for R11 cells.
Download figure:
Standard image High-resolution image

**Figure 17.** Same as Figure 15 but for R16 cells.
Download figure:
Standard image High-resolution image

Table 6. Optimal Counting Techniques for Interspersed MGS Objects As a Function of Redshift

	R7			R11			R16
z	n	δ	δ²	n	δ	δ²	n	δ	δ²
0.030	scaling	scaling	scaling	scaling	scaling	scaling	scaling	scaling	scaling
0.045	scaling	scaling	scaling	scaling	scaling	scaling	scaling	scaling	scaling
0.055	scaling	scaling	scaling	scaling	scaling	scaling	scaling	scaling	scaling
0.065	scaling	scaling	scaling	scaling	scaling	scaling	scaling	scaling	scaling
0.075	scaling	scaling	scaling	scaling	scaling	scaling	scaling	scaling	scaling
0.085	scaling	scaling	scaling	scaling	scaling	scaling	scaling	scaling	scaling
0.095	scaling	scaling	scaling	scaling	scaling	scaling	scaling	scaling	scaling
0.105	scaling	scaling	scaling	scaling	scaling	scaling	scaling	scaling	scaling
0.115	scaling	scaling	scaling	scaling	scaling	scaling	scaling	scaling	scaling
0.125	scaling	scaling	scaling	scaling	scaling	scaling	scaling	scaling	scaling
0.135	scaling	scaling	scaling	scaling	scaling	scaling	scaling	scaling	scaling
0.145	scaling	scaling	scaling	scaling	scaling	scaling	scaling	scaling	scaling
0.155	scaling	scaling	scaling	scaling	scaling	scaling	scaling	scaling	scaling
0.165	ignore	ignore	scaling	scaling	scaling	scaling	scaling	scaling	scaling
0.175	ignore	ignore	scaling	scaling	scaling	scaling	scaling	scaling	scaling
0.185	ignore	ignore	scaling	scaling	scaling	scaling	scaling	scaling	scaling
0.195	ignore	ignore	ignore	scaling	scaling	scaling	scaling	scaling	scaling
0.205	ignore	ignore	ignore	ignore	ignore	scaling	scaling	scaling	scaling
0.215	ignore	ignore	ignore	ignore	ignore	scaling	scaling	scaling	scaling
0.225	⋯	⋯	⋯	⋯	⋯	⋯	scaling	scaling	scaling
0.235	⋯	⋯	⋯	⋯	⋯	⋯	ignore	ignore	pzs
0.245	⋯	⋯	⋯	⋯	⋯	⋯	ignore	ignore	pz
0.255	⋯	⋯	⋯	⋯	⋯	⋯	ignore	ignore	ignore
0.265	⋯	⋯	⋯	⋯	⋯	⋯	ignore	ignore	ignore
0.275	⋯	⋯	⋯	⋯	⋯	⋯	ignore	ignore	ignore
0.285	⋯	⋯	⋯	⋯	⋯	⋯	ignore	ignore	pz
0.295	⋯	⋯	⋯	⋯	⋯	⋯	ignore	ignore	ignore

Note. The abbreviation "pz" stands for "photo-z," and "pzs" stands for "photo-z smearing."

Download table as: ASCII Typeset image

Results for the three cell sizes share some common features. At low redshifts, where galaxies are plentiful, ignoring objects performs the worst, particularly as cell size increases. However, at high redshifts, where galaxies are scarce, ignoring objects is the preferred technique. So few galaxies are detectable at large distances that assuming no objects reside there is actually a sound strategy.

Scaling is the preferred method at low and intermediate redshifts. When the number of expected galaxies as a function of redshift is normalized using all MGS targets (i.e., galaxies and objects), scaling brings the number counted in each cell more in line with reality. An alternative is to calculate only the expected number of galaxies in each cell, then compute the overdensity using the number counts of galaxies only. Both approaches should yield similar overdensities since the former scales both n and $\langle n\rangle$ by approximately the same factor. As a practical matter, this suggests that the latter approach should be a reasonable counting approach in regions where scaling is preferred. We note also that the redshift at which ignoring objects becomes preferable to scaling increases with cell size.

The nearest neighbor method is nonoptimal in almost all situations. As Figure 18 illustrates, this conclusion is of high significance, especially for n and δ when z ≲ 0.16. When estimating n, scaling is particularly dominant when either cell size or the expected number of galaxies per cell is large. This conclusion loses 2σ significance only at the highest redshifts, but since ignoring objects is the dominant method in this regime, the practical impact of this fact is largely moot.

4.2.4. Dark Regions

The results of the dark region counting analyses for R7, R11, and R16 are presented in Figures 19–21, respectively. A summary of the best counting methods at each redshift is supplied in Table 7. The preferred methods to count dark objects are less uniform than those for interspersed objects, and are more dependent on cell size and redshift.

**Figure 19.** Dark region counting method results for R7 cells. Error metrics for number count ${\epsilon }^{(n)}$ , overdensity ${\epsilon }^{(\delta )}$ , and overdensity squared ${\epsilon }^{({\delta }^{2})}$ are presented on the vertical axis. Values are averaged over cells in redshift bins of width Δz = 0.01. Error bars are omitted for clarity, but are available in text files online: http://bit.ly/1Nc8MmI. Preferred counting methods have lower error metric values.
Download figure:
Standard image High-resolution image

**Figure 19.** Dark region counting method results for R7 cells. Error metrics for number count ${\epsilon }^{(n)}$ , overdensity ${\epsilon }^{(\delta )}$ , and overdensity squared ${\epsilon }^{({\delta }^{2})}$ are presented on the vertical axis. Values are averaged over cells in redshift bins of width Δz = 0.01. Error bars are omitted for clarity, but are available in text files online: http://bit.ly/1Nc8MmI. Preferred counting methods have lower error metric values.
Download figure:
Standard image High-resolution image

**Figure 20.** Same as Figure 19 but for R11 cells.
Download figure:
Standard image High-resolution image

**Figure 21.** Same as Figure 19 but for R16 cells.
Download figure:
Standard image High-resolution image

Table 7. A Summary of the Best Methods to Count Dark Objects for Each Cell Size and Measurement Type As a Function of Redshift

	R7			R11			R16
z	n	δ	δ²	n	δ	δ²	n	δ	δ²
0.030	pzs	pzs	pzs	pzs	pzs	pzs	pzs	pzs	pzs
0.045	scaling	scaling	scaling	pzs	pzs	pzs	pzs	pzs	pz
0.055	scaling	scaling	scaling	pzs	pzs	pz	pzs	pzs	pzs
0.065	scaling	scaling	pzs	scaling	scaling	pzs	pzs	pzs	pz
0.075	scaling	scaling	2PCF	2PCF	2PCF	2PCF	pzs	pzs	pz
0.085	scaling	scaling	pzs	scaling	pzs	pz	pz	pz	pz
0.095	ignore	ignore	ignore	pzs	pzs	pzs	pzs	pzs	pz
0.105	ignore	ignore	ignore	scaling	scaling	pzs	pzs	pzs	pzs
0.115	ignore	ignore	ignore	scaling	scaling	scaling	pzs	pzs	pz
0.125	ignore	ignore	ignore	scaling	scaling	pzs	pzs	pzs	pz
0.135	ignore	ignore	ignore	pzs	pzs	pz	pzs	pzs	pz
0.145	ignore	ignore	ignore	pzs	pzs	pzs	pz	pzs	pz
0.155	ignore	ignore	ignore	pzs	pzs	pz	pzs	pzs	pz
0.165	ignore	ignore	ignore	pzs	pzs	pz	pzs	pzs	pz
0.175	ignore	ignore	ignore	ignore	ignore	ignore	pzs	pzs	pz
0.185	ignore	ignore	ignore	ignore	ignore	pz	pzs	pzs	pz
0.195	ignore	ignore	ignore	ignore	ignore	pz	pzs	pzs	pz
0.205	ignore	ignore	ignore	ignore	ignore	ignore	pzs	pzs	pz
0.215	ignore	ignore	ignore	ignore	ignore	ignore	ignore	ignore	pzs
0.225	⋯	⋯	⋯	⋯	⋯	⋯	ignore	ignore	pz
0.235	⋯	⋯	⋯	⋯	⋯	⋯	ignore	ignore	ignore
0.245	⋯	⋯	⋯	⋯	⋯	⋯	ignore	ignore	ignore
0.255	⋯	⋯	⋯	⋯	⋯	⋯	ignore	ignore	ignore
0.265	⋯	⋯	⋯	⋯	⋯	⋯	ignore	ignore	ignore
0.275	⋯	⋯	⋯	⋯	⋯	⋯	ignore	ignore	ignore
0.285	⋯	⋯	⋯	⋯	⋯	⋯	ignore	ignore	ignore
0.295	⋯	⋯	⋯	⋯	⋯	⋯	ignore	ignore	ignore

Note. The abbreviation "pz" stands for "photo-z," and "pzs" stands for "photo-z smearing." Best methods are defined to be those with the lowest values of the error metrics ${\epsilon }^{()}$ .

Download table as: ASCII Typeset image

Ignoring dark objects in high-redshift cells is optimal. The redshift at which ignoring becomes preferable is lower than with interspersed objects. This result follows from the fact that as a function of redshift, the average number of dark objects approaches zero sooner than interspersed objects.

The preferred methods for counting dark objects in low-redshift cells varies as a function of cell size and distance. Scaling is generally preferred for R7, and photo-z smearing is generally preferred for R16, while a combination of photo-z smearing, 2PCF smearing, and scaling is preferred for R11. The preference for photo-z smearing in larger cells at lower redshifts is consistent with our presumption that probabilistic methods would be most effective when counting large numbers of objects whose lines of sight intersect the same volume.

The nearest neighbor method is the least effective in dark regions. This is especially true at low redshifts where cells' large circular projections increase the average angular distance between nearest neighbors. This method offers a comparitive advantage only for low-redshift R11 and R16 cells, where ignoring objects outright is sometimes worse.

To quantify the significance of the preferences reported in Table 7, we refer to Figure 22 where three pairs of methods for R11 cells are compared directly.⁹ The significances vary by pair. We discover that for most redshifts, neither photo-z smearing nor scaling ofters a significant advantage over the other. Scaling is usually preferred to 2PCF smearing, often between the 1σ – 2σ level. However, photo-z smearing is significantly more effective at counting dark objects at low redshifts than ignoring, while the opposite is true for z > 0.17.

**Figure 22.** Comparison between select counting method pairs for dark objects in R11 cells. The differences in the error metrics ${\rm{\Delta }}{\epsilon }^{(\delta )}$ reported on the vertical axis are scaling minus photo-z smearing (top), 2PCF smearing minus scaling (middle), and ignore minus photo-z smearing (bottom). Errors bars are 1σ spreads of the differences in the error metric.
Download figure:
Standard image High-resolution image

This dark region analysis was carried out using cells for which β_SF ≥ 0.62. These tests can be replicated for other minimum values of β_SF and the error metrics from Figures 19–21 can be recalculated. This can quantify how the maximum volume a cell is permitted to occupy in a dark region affects the magnitude of the counting errors.

4.2.5. External Regions

Comparisons between all counting methods in the external regions are presented in Figures 23–25, while Table 8 summarizes the optimal counting methods for each cell size and redshift.

**Figure 23.** External region counting method results for R7 cells. Error metrics for number count ${\epsilon }^{(n)}$ , overdensity ${\epsilon }^{(\delta )}$ , and overdensity squared ${\epsilon }^{({\delta }^{2})}$ are presented on the vertical axis. Values are averaged over cells in redshift bins with widths of z = 0.01. Error bars are omitted for clarity, but are available in text files online: http://bit.ly/1Nc8MmI. Preferred counting methods have lower error metric values.
Download figure:
Standard image High-resolution image

**Figure 24.** Same as Figure 23 but for R11 cells.
Download figure:
Standard image High-resolution image

**Figure 25.** Same as Figure 23 but for R16 cells.
Download figure:
Standard image High-resolution image

Table 8. A Summary of the Best Methods to Count External Objects for each Cell Size and Measurement Type As a Function of Redshift

	R7			R11			R16
z	n	δ	δ²	n	δ	δ²	n	δ	δ²
0.035	pzs	pzs	pzs	⋯	⋯	⋯	⋯	⋯	⋯
0.055	pzs	pz	pz	pz	pz	pz	⋯	⋯	⋯
0.065	pz	pz	pz	⋯	⋯	⋯	pzs	pzs	pzs
0.075	2PCF	2PCF	2PCF	pzs	pzs	pzs	⋯	⋯	⋯
0.085	pzs	pzs	pz	pzs	pzs	pzs	⋯	⋯	⋯
0.095	pzs	pzs	pz	pz	pz	pz	pz	pz	pz
0.105	pzs	pzs	pz	pzs	pzs	pzs	pzs	pzs	pzs
0.115	pzs	pzs	ignore	pzs	pzs	pz	pzs	pzs	pzs
0.125	pzs	pzs	pzs	pzs	pzs	pzs	pzs	pzs	2PCF
0.135	pzs	pzs	ignore	pzs	pzs	pzs	pzs	pzs	pzs
0.145	pzs	pzs	ignore	pzs	pzs	pzs	pzs	pzs	pzs
0.155	pzs	pzs	ignore	pzs	pzs	pzs	pzs	pzs	2PCF
0.165	pzs	pzs	ignore	pzs	pzs	pzs	pzs	pzs	pz
0.175	pzs	pzs	ignore	pzs	pzs	pzs	pzs	pzs	pzs
0.185	ignore	ignore	ignore	pzs	pzs	pz	pzs	pzs	pzs
0.195	ignore	ignore	ignore	pzs	pzs	ignore	pzs	pzs	pzs
0.205	ignore	ignore	ignore	pz	pz	ignore	pz	pz	2PCF
0.215	ignore	ignore	ignore	ignore	ignore	ignore	pz	pz	pz
0.225	⋯	⋯	⋯	⋯	⋯	⋯	pz	pz	pz
0.235	⋯	⋯	⋯	⋯	⋯	⋯	2PCF	2PCF	ignore
0.245	⋯	⋯	⋯	⋯	⋯	⋯	2PCF	2PCF	ignore
0.255	⋯	⋯	⋯	⋯	⋯	⋯	2PCF	2PCF	ignore
0.265	⋯	⋯	⋯	⋯	⋯	⋯	ignore	ignore	ignore
0.275	⋯	⋯	⋯	⋯	⋯	⋯	ignore	ignore	ignore
0.285	⋯	⋯	⋯	⋯	⋯	⋯	ignore	ignore	ignore
0.295	⋯	⋯	⋯	⋯	⋯	⋯	ignore	ignore	ignore

Note. The abbreviation "pz" stands for "photo-z," and "pzs" stands for "photo-z smearing." Best methods are defined to be those with the lowest values of the error metrics ${\epsilon }^{()}$ . Redshift bins containing too few cells to generate meaningful statistics are grouped together. These groups are z < 0.07 for R11 and z < 0.09 for R16.

Download table as: ASCII Typeset image

The nearest neighbor method offers no advantages in external regions. Ignoring external objects is also a poor strategy, save for the highest redshift cells where $\langle n\rangle$ ≈ 0. Of the remaining methods, none perform significantly better than any of the others. Figure 26 compares four pairs of these counting techniques. We find that in almost all cases, the 1σ uncertainties in the differences of the error metrics are many times larger than the differences themselves.

**Figure 26.** Comparisons between photo-z smearing and other counting techniques in external regions for R11 cells. The error metric ${\epsilon }^{(\delta )}$ for photo-z smearing is subtracted from that of ignoring (top), nearest neighbor (second from top), 2PCF smearing (second from bottom), and discrete photo-z (bottom) and plotted on the vertical axis. Errors bars are 1σ spreads of the differences in the error metric.
Download figure:
Standard image High-resolution image

**Figure 26.** Comparisons between photo-z smearing and other counting techniques in external regions for R11 cells. The error metric ${\epsilon }^{(\delta )}$ for photo-z smearing is subtracted from that of ignoring (top), nearest neighbor (second from top), 2PCF smearing (second from bottom), and discrete photo-z (bottom) and plotted on the vertical axis. Errors bars are 1σ spreads of the differences in the error metric.
Download figure:
Standard image High-resolution image

Photo-z smearing performs exceptionally poorly for R16 cells at large redshifts. This is due to large photo-z variances ${\sigma }_{z}^{2}$ depositing too many partial counts at high redshifts. Ultimately, this reveals a problem with using a simple Gaussian model for the photo-z PDFs. Improvements might consist of a reduced high-z tail, better parameterized σ_z, or replacement of the Gaussian model with an individualized distribution for each target.

We conclude that no single counting method is clearly optimal at any redshift for any cell size in external regions. The magnitudes of the error metrics ${\epsilon }^{()}$ for external regions are approximately an order of magnitude larger than those for dark regions, though this comes with the caveat that cells containing dark regions are very likely to contain galaxies of known redshift as well as dark objects. Regardless, the external region error metrics are large enough to partially justify our constraint that cells must lie mostly within the SF through the requirement β_SF ≥ 0.62. Moreover, they support the conclusion derived from inspection of Figure 7—none of the counting methods tested are capable of reproducing small-scale structure.

4.3. Galaxies in Dark Regions

The dark region/object simulations of Section 4.1.2 made an important assumption—that dark regions are exclusively populated by objects. In practice, this is not always the case. The DR6 footprint corrections (Specian & Szalay 2016) occasionally recharacterize galaxies as lying outside the SF. We refer to these MGS targets as dark galaxies.

The presence of dark galaxies has no impact on the counting of dark objects for all but one counting method—scaling. Naively applying the scaling relation of Equation (3) necessarily discards information about the number of targets in the dark region. This simplification is worth avoiding if any of those targets' spectroscopic redshifts are known. However, simply adding the number count of dark galaxies to the approximated count of dark objects implies that the effective galaxy density in the dark region is possibly greater than in the rest of the cell.

Our solution is to derive an effective spectroscopic completeness ${\beta }_{\mathrm{SF}}^{\prime }$ that better reflects the volume of the cell within the SF once the dark galaxies are taken into account. The fundamental assumptions required to calculate ${\beta }_{\mathrm{SF}}^{\prime }$ are (1) each target in the dark region occupies an equally sized area of that dark region and (2) the area fraction occupied by each is the same as its volume fraction. Put another way, the surface density local to each target is assumed to be the same as that of the dark region as a whole.

Consider a cell of volume V with a fraction β_PF inside the PF and a fraction β_SF within the SF. Let n_ig equal the number of interspersed galaxies inside both the cell and the SF, n_io equal to the number of interspersed objects approximated to be inside the cell (through the optimal interspersed counting strategy), n_dg equal the number of dark galaxies in the cell, n_do equal the number of dark objects within the cell's projection, and c equal the direction-dependent interspersed region spectroscopic completeness factor from Equation (5).

The average fraction f of the cell's area (and volume, by assumption) occupied by dark galaxies and dark objects in the dark region is

$\begin{eqnarray}&&f=\displaystyle \frac{{\beta }_{\mathrm{PF}}-{\beta }_{\mathrm{SF}}}{{n}_{{dg}}+{n}_{{do}}}.\end{eqnarray} \tag{ 23 }$

These partial volumes are added to the volume of the cell within the SF such that

$\begin{eqnarray}&&{\beta }_{\mathrm{SF}}^{\prime }={\beta }_{\mathrm{SF}}+{{fn}}_{{dg}}.\end{eqnarray} \tag{ 24 }$

We replace β_SF in Equation (4) with ${\beta }_{\mathrm{SF}}^{\prime }$ from Equation (24), and take the adjusted number of galaxies inside the cell's interspersed volume to be ${n}_{{ig}}+{n}_{{io}}+{n}_{{dg}}$ . The number of targets n_t approximated to lie within the cell becomes

$\begin{eqnarray}&&{n}_{t}=\left(\displaystyle \frac{{\beta }_{\mathrm{PF}}({n}_{{dg}}+{n}_{{do}})}{{\beta }_{\mathrm{SF}}{n}_{{do}}+{\beta }_{\mathrm{PF}}{n}_{{dg}}}\right)({n}_{{ig}}+{n}_{{io}}+{n}_{{dg}}).\end{eqnarray} \tag{ 25 }$

The following is a summary of the scaling method when dark galaxies are present.

1.
The number of interspersed galaxies n_ig are counted using Equation (2).
2.
The optimal interspersed counting method is applied to approximate the counts n_io of interspersed objects.
3.
The average volume fraction f per dark region target is calculated and used to update the spectroscopic completeness volume in Equation (24).
4.
Using the number of dark galaxies n_dg, Equation (25) is evaluated to deliver a final approximated number count.

Ultimately, the impact of dark galaxies is at the percent level. As illustrated in Figure 27, there are only limited number cells for which ${\beta }_{\mathrm{SF}}^{\prime }\ne {\beta }_{\mathrm{SF}}$ . For R7, R11, and R16 there are 366, 271, and 210 cells for which ${\beta }_{\mathrm{SF}}^{\prime }-{\beta }_{\mathrm{SF}}\geqslant 0.001$ , or 0.46%, 1.37%, and 1.38%, respectively.

**Figure 27.** Distribution of cells for which ${\beta }_{\mathrm{SF}}^{\prime }-{\beta }_{\mathrm{SF}}\geqslant 0.01$ . Cells are counted in bins of width 0.01.
Download figure:
Standard image High-resolution image

Consideration was given to the idea of treating dark objects whose nearest neighbors were within the SF as interspersed objects. Their proximity to the footprint boundary might make them more akin to interspersed objects than to dark objects, which tend to clump together and lack galaxy neighbors on one side of their region.

However, none of the optimal interspersed methods involve the use of spatial correlations. Also, ignoring objects is preferred at high redshifts for both region types, further mitigating the need for a separate designation. While there could be some marginal benefit to scaling boundary objects rather than smearing them for low-redshift R11 and R16 cells, the added complexity provides a disincentive. This particular issue is pursued no further.

5. SUMMARY AND CONCLUSIONS

This paper compared seven techniques for counting galaxies in discrete volumes of space. By "count," we refer specifically to the measurement of number count, overdensity, and overdensity squared within non-overlapping spherical cells of radii 7, 11, and 16 ${h}^{-1}\,\mathrm{Mpc}$ (i.e., R7, R11, and R16) densely packed into the survey footprint of the SDSS's sixth data release. We used MGS galaxies—all with high-quality spectroscopic redshifts—to generate simulations in which a percentage were randomly stripped of their redshifts. The true counts were compared to those estimated through our counting methods to draw conclusions about which performed best under different scenarios. The results of these comparisons are most succinctly summarized in Tables 6–8.

We used the term "object" to mean "a target without a high-quality spectroscopic redshift." Our first three techniques for counting objects each assigned them a singular redshift. These were

1.
ignore: do not count object,
2.
nearest neighbor: assign object redshift of nearest angular neighbor, and
3.
photo-z: use optimal photometric redshift.

Our next three techniques were probabilistic in nature. Each cell along an object's line of sight received a partial count equal to the probability of the object lying with that cell, a process we refer to as "smearing." The only differences between the probabilistic methods tested were the distributions used. These were

1.
selection function: distribution of galaxies in the absence of spatial clustering,
2.
2PCF: selection function distribution "boosted" by nearest neighbor information, and
3.
photo-z smearing: Gaussian distribution with mean and standard deviation equal to optimal photometric redshift and its error, respectively.

We refer to our seventh technique as "scaling." Scaling upweights the number of galaxies (with spectroscopic redshifts) counted in each cell by a scaling factor. Inside the SF, the scaling factor equals the spectroscopic completeness along that cell's line of sight. Outside, it is related to the relative volumes the cell occupies within the PF and SF. This approach is similar to disregarding objects from both counts n and expected counts $\langle n\rangle$ when measuring overdensities.

Three of our seven techniques (nearest neighbor, 2PCF, scaling) utilized spatial correlations, meaning the type of environment within which the techniques were applied is relevant. We separately considered three region types. The first included areas within the SF where the angular separations between objects and other galaxies were small (see Figures 15–17). The second included areas just outside the SF, where there were fewer viable nearest neighbors and spatial correlations weakened (see Figures 19–21). The final included areas far from the edge of the SF (see Figures 23–25).

One of the clearest conclusions is that smearing with the 2PCF is always preferable to using the selection function alone (see Figure 6). Applying selection function smearing to an object merely reflects that object's belonging to the galaxy sample. The information "boost" provided by incorporating the spatial correlation with its nearest neighbor appears to be beneficial regardless of cell size or environment.

We also examined whether discrete photo-z's or smeared photo-z's were more effective in recovering accurate galaxy counts. At lower redshifts, smearing was preferable, partly because probabilistic methods improve when the number of objects per cell is large. However, we discovered that modeling the high-redshift tail of the smeared photo-z's with a Gaussian distribution tended to overestimate the probability of objects lying at large redshifts. Beyond z ≳ 0.2, discrete photo-z's were generally preferable to smearing.

At large redshifts, though, no counting technique was more successful than ignoring objects. This approach systematically underestimates number counts and overdensities in cells. However, in distant regions where the expected number of galaxies is small, we found the uncertainty associated with ignoring objects to be smaller than those of alternative methods.

At lower redshifts, scaling galaxy counts was often the optimal approach, especially for objects inside the SF. Scaling was also preferable outside the SF, but only until about z = 0.09 and z = 0.13 for R7 and R11 cells, respectively. Outside the SF, in the intermediate redshift range before ignoring was preferable, photo-z smearing was often found to be the best approach.

Despite its popularity, the nearest neighbor method was outperformed by scaling within the SF, though it was favored over other alternatives in this regime. The nearest neighbor method was largely ineffective outside the SF.

We also compared these methods in cells far outside the SF where the spectroscopic completeness was zero. The scaling method is undefined here, and the other spatial correlation methods (nearest neighbor, 2PCF) offer no advantages. With the exception of ignoring, which is again preferable in the most distant cells, none of the remaining methods performed significantly better than any of the others, and no substantial preference for a counting method could be established.

We conclude by reporting in Figure 28 the distribution of MGS overdensities after the optimal object counting techniques for each cell were applied. By virtue of having the most and smallest cells, the R7 set offers the highest resolution statistics. The large number of high-δ R7 cells results from many of those cells having moderate $\langle n\rangle$ , but large n. The R11 cells occupy the same redshift range, but their larger size smooths out structure, thereby depressing overdensities. The R16 cells are larger still, but over 56% lie in the range of 0.22 < z ≤ 0.30, where $\langle n\rangle$ is very low and δ is high when galaxies are present.

In Figure 29, we visualize what percentage of each kind of MGS target occupies cells within the stated redshift ranges. Galaxies with redshifts inside the SF comprise the largest fraction—one that at low-z is only slightly smaller than the overall spectroscopic completeness of the DR6 survey. At greater radial distances, the fraction of these galaxies approaches one, while the fraction of objects drops to zero. This is a consequence of the transition from "scaling" to "ignoring" objects, a result that holds both in and outside the SF. The percentage of counted objects outside the SF at low-z is ∼7% for all cell sizes, a number that roughly reflects the average value of $1-{\beta }_{\mathrm{SF}}$ in that range. When these objects are counted via scaling (i.e., low-z R7 cells), they constitute a smaller fraction than when smeared photometrically (e.g., R16 cells).

It is important to bear in mind that these seven counting techniques were tested under particular circumstances—the sixth Data Release of the SDSS using only MGS targets selected within a predetermined photometric, spectroscopic, and angular range. Even so, optimal counting methods still differed by cell size and region type. These results may differ for other redshift surveys and cell geometries. As such, they should be viewed as more of a general sketch than a universal statement on optimal counting methods. Specific conclusions can only be drawn by applying these testing methodologies to one's own data set.

We also emphasize that no individualized photo-z distributions per object are available in the CAS. Such distributions can, and have been, utilized with success in other studies. Their existence suggests that the probabilistic photo-z results reported here could be thought of as the "floor" of what the method can ultimately provide.

A COMPARISON OF GALAXY COUNTING TECHNIQUES IN SPECTROSCOPICALLY UNDERSAMPLED REGIONS

Article metrics

Permissions

Author e-mails

Author affiliations

ORCID iDs

Dates

ABSTRACT

1. INTRODUCTION

2. BACKGROUND

2.1. Sloan Digital Sky Survey

2.2. The MGS

2.3. Pristine MGS Galaxies and Non-pristine MGS Objects

2.3.1. Pristine Sample

2.3.2. Non-pristine Sample

2.4. Photometric Redshifts

3. COUNTING TECHNIQUES

3.1. Galaxies with Redshifts

3.2. Discrete Counting

3.3. Scaling

3.4. Probabilistic Smearing

3.5. Two-point Correlation Function

4. SIMULATIONS AND RESULTS

4.1. Simulating Spectroscopically Incomplete Regions

4.1.1. Interspersed Regions

4.1.2. Dark Regions

4.1.3. External Regions

4.2. Results

4.2.1. Discrete versus Smeared Photo-z's

4.2.2. Selection Function versus 2PCF Smearing

4.2.3. Interspersed Regions

4.2.4. Dark Regions

4.2.5. External Regions

4.3. Galaxies in Dark Regions

5. SUMMARY AND CONCLUSIONS

Footnotes

A COMPARISON OF GALAXY COUNTING TECHNIQUES IN SPECTROSCOPICALLY UNDERSAMPLED REGIONS

Article metrics

Permissions

Share this article

Author e-mails

Author affiliations

ORCID iDs

Dates

ABSTRACT

1. INTRODUCTION

2. BACKGROUND

2.1. Sloan Digital Sky Survey

2.2. The MGS

2.3. Pristine MGS Galaxies and Non-pristine MGS Objects

2.3.1. Pristine Sample

2.3.2. Non-pristine Sample

2.4. Photometric Redshifts

3. COUNTING TECHNIQUES

3.1. Galaxies with Redshifts

3.2. Discrete Counting

3.3. Scaling

3.4. Probabilistic Smearing

3.5. Two-point Correlation Function

4. SIMULATIONS AND RESULTS

4.1. Simulating Spectroscopically Incomplete Regions

4.1.1. Interspersed Regions

4.1.2. Dark Regions

4.1.3. External Regions

4.2. Results

4.2.1. Discrete versus Smeared Photo-z's

4.2.2. Selection Function versus 2PCF Smearing

4.2.3. Interspersed Regions

4.2.4. Dark Regions

4.2.5. External Regions

4.3. Galaxies in Dark Regions

5. SUMMARY AND CONCLUSIONS

Footnotes