Global 21 cm Signal Extraction from Foreground and Instrumental Effects. I. Pattern Recognition Framework for Separation Using Training Sets

, , , and

Published 2018 February 5 © 2018. The American Astronomical Society. All rights reserved.
, , Citation Keith Tauscher et al 2018 ApJ 853 187 DOI 10.3847/1538-4357/aaa41f

Download Article PDF
DownloadArticle ePub

You need an eReader or compatible software to experience the benefits of the ePub3 file format.

0004-637X/853/2/187

Abstract

The sky-averaged (global) highly redshifted 21 cm spectrum from neutral hydrogen is expected to appear in the VHF range of ∼20–200 MHz and its spectral shape and strength are determined by the heating properties of the first stars and black holes, by the nature and duration of reionization, and by the presence or absence of exotic physics. Measurements of the global signal would therefore provide us with a wealth of astrophysical and cosmological knowledge. However, the signal has not yet been detected because it must be seen through strong foregrounds weighted by a large beam, instrumental calibration errors, and ionospheric, ground, and radio-frequency-interference effects, which we collectively refer to as "systematics." Here, we present a signal extraction method for global signal experiments which uses Singular Value Decomposition of "training sets" to produce systematics basis functions specifically suited to each observation. Instead of requiring precise absolute knowledge of the systematics, our method effectively requires precise knowledge of how the systematics can vary. After calculating eigenmodes for the signal and systematics, we perform a weighted least square fit of the corresponding coefficients and select the number of modes to include by minimizing an information criterion. We compare the performance of the signal extraction when minimizing various information criteria and find that minimizing the Deviance Information Criterion most consistently yields unbiased fits. The methods used here are built into our widely applicable, publicly available Python package, pylinex, which analytically calculates constraints on signals and systematics from given data, errors, and training sets.

Export citation and abstract BibTeX RIS

1. Introduction

The highly redshifted 21 cm spectrum from neutral hydrogen in the early universe has become an important observational target in the astrophysics community. This is because it fills outstanding gaps in our knowledge of the universe between the formation of the first neutral atoms $\sim 4\times {10}^{5}$ years after the Big Bang (recombination) and when the earliest stars seen to date by the Hubble Space Telescope existed about $\sim {10}^{9}$ years later. One of these knowledge gaps concerns the heating of the gas in the intergalactic medium (IGM) around the time when the first stars formed and the first black holes began accretion. This heating changes the spin temperature of the IGM gas, changing how strongly it emits the 21 cm line (Madau et al. 1997; Cohen et al. 2017; Mirocha et al. 2017a). Since the 21 cm line is a quantum transition of neutral hydrogen (Ewen & Purcell 1951), its brightness temperature is modulated by the neutral fraction of the IGM gas. The other knowledge gap addressable by the 21 cm signal is therefore the nature of the reionization of the universe's hydrogen, including both when it started and how it progressed. Furlanetto et al. (2006), Loeb & Furlanetto (2013), Morales & Wyithe (2010), and Pritchard & Loeb (2012) provide in-depth studies of 21 cm cosmology.

Experiments made to measure the highly redshifted 21 cm line come in two different varieties—power spectrum and global signal—with different science goals. The power spectrum reflects spatial anisotropies in 21 cm emission, whereas the global signal quantifies its isotropic component. For both cases, but particularly for the former, physical parameter inference is difficult because the posterior distributions are broad and the likelihood functions needed to numerically explore them often require a prohibitively long time to calculate for a Markov Chain Monte Carlo (MCMC) exploration. Recently, to avoid this challenge, Kern et al. (2017) proposed an emulation technique, which uses a Gaussian Process model to drastically speed up the evaluation of the likelihood function for the Hydrogen Epoch of Reionization Array power spectrum experiment (DeBoer et al. 2017). For a similar purpose, Schmit & Pritchard (2018) proposed a different emulation technique based on a neural network. In contrast to these approaches, which maintain a connection between curves and the physical parameters that generated them, in this paper, we explore our likelihood analytically by using linear basis functions encoding information from training sets.

Although there has not yet been a detection of the 21 cm spectrum, some recent limits have been placed on both the power spectrum (Paciga et al. 2013; Dillon et al. 2014, 2015; Parsons et al. 2014; Ali et al. 2015; Jacobs et al. 2015, see also Figure 9 in DeBoer et al. 2017 for a comprehensive review) and the global signal (Bowman & Rogers 2010; Voytek et al. 2014; Bernardi et al. 2016; Monsalve et al. 2017b; Singh et al. 2017).

Although some have suggested that interferometers may be suited to measure the global signal (Presley et al. 2015; Singh et al. 2015), most experimental efforts use single antennas. Experiments and mission concepts to measure the global 21 cm signal include the Experiment to Detect the Global EoR Signature (EDGES; Bowman & Rogers 2010; Monsalve et al. 2017a; Monsalve et al. 2017b), the Dark Ages Radio Explorer (Burns et al. 2012, 2017), the Shaped Antenna measurement of the background RAdio Spectrum (Patra et al. 2013; Singh et al. 2017), the Sonda Cosmológica de las Islas para la Detección de Hidrógeno Neutro (Voytek et al. 2014), the Zero-spacing Interferometer Measurements of the Background Radio Spectrum (Mahesh et al. 2014), the Large-aperture Experiment to detect the Dark Ages (Bernardi et al. 2015, 2016; Price et al. 2017), and the Broadband Instrument for Global HydrOgen ReioNisation Signal (Sokolowski et al. 2015).

The main impediment to all of these experiments is the vast strength of the foregrounds ($\sim {10}^{3}\mbox{--}{10}^{4}$ K) relative to the expected strength of the 21 cm signal (∼10–100 mK). Since the observations must be made at VHF radio frequencies (∼20–200 MHz), the instrument making the observations will have a very large beam that will corrupt the intrinsically smooth foreground spectrum by imprinting its own spatial/spectral characteristics onto it. Likewise, the radiometer system will leave its own features in the data (some visible even after calibration). As these antenna and receiver features are characteristic of the particular instrument, we use Gaussian beams and do not include calibration errors in the results of this paper.

In order to fit or remove the immense beam-weighted foregrounds observed by global signal experiments, a foreground model must be chosen. In most analyses, this model is a polynomial or a polynomial multiplied by a power law. However, these models can also be used to fit spectral features which look like the expected 21 cm signal, leading to large covariances between the foreground and signal models and, hence, large errors and low result significance. Some have attempted to circumnavigate this by proposing a model that only considers polynomials that have special smoothness properties (Sathyanarayana Rao et al. 2017). This model is "loss resistant," meaning that when the model is used to fit spectra, it cannot fit shapes similar to the signal. Use of this model, however, requires all spectral features introduced into the data by the instrument which are not as smooth as the foreground to have been removed completely in some cleaning process. The effectiveness of this technique relies on knowledge of all relevant aspects of the instrument (e.g., antenna and receiver reflection coefficients, receiver noise temperature, etc.); if the knowledge is inaccurate, the cleaning process and the rigidity of the foreground model introduce biases. In contrast to this method, here we present an analysis which, instead of requiring precise knowledge of instrument parameters, requires precise knowledge of the modes in which the instrument parameters can vary (e.g., in frequency)—knowledge more feasibly extracted from simulations and lab measurements. This knowledge allows the errors introduced by imperfect cleaning of the data to be fit consistently.

In this first paper in the series, we propose a method in which we simulate a given experiment multiple times to create a "training set." Then, Singular Value Decomposition (SVD)—also commonly known as Principal Component Analysis and EigenValue Decomposition—is used to extract from this training set linear basis functions describing the systematics. After a similar process is performed using a global signal training set, the signal and systematics basis functions are combined to fit the spectra, yielding an estimate of the 21 cm global signal based solely on the input training sets and the data. This is an extremely fast process that also allows us to extract the signal in a model independent parameter space with a unimodal distribution. The second paper in this series will address how the pipeline takes advantage of initial results such as those found here by facilitating the initialization of an MCMC search to constrain physical parameters (D. Rapetti et al. 2018, in preparation; referred as Paper II hereafter). Paper III (K. Tauscher et al. 2018, in preparation) will then detail how, with a realistic instrument, we can take advantage of instrument-induced polarization to differentiate the power of anisotropic sources like our Galaxy from the power of isotropic sources like the 21 cm global signal (see Nhan et al. 2017).

Others have proposed using SVD in the context of 21 cm cosmology. Chang et al. (2010) were the first to suggest that SVD could be used to fit out foregrounds of 21 cm experiments. Vedantham et al. (2014) and Switzer & Liu (2014) both put forth SVD as a possible method to model systematics in data acquired to measure the 21 cm signal. Unlike those works, we develop systematic modes from a training set based on instrumental degrees of freedom and published maps of Galactic emission, rather than infer modes from the data themselves. Our approach is similar to Leistedt & Peiris (2014), who identified principal contaminant modes in a quasar survey based on known systematic maps.

One important aspect of our method is the choice of where to truncate the modes of both signal and systematics. We perform this choice through a grid search for the global minimum of a given Information Criterion (IC). By running the extraction algorithm with many different inputs and changing the IC, which is minimized, we compare the statistical accuracy of the signal extractions resulting from each IC. The IC we consider are Akaike's Information Criterion (AIC; Akaike 1974), the Bayesian Information Criterion (BIC; Schwarz 1978), the Bayesian Predictive Information Criterion (BPIC; Ando 2007), and the Deviance Information Criterion (DIC; Spiegelhalter et al. 2002, 2014). Astrophysics has seen studies of model selection using IC in various contexts (see, e.g., Porciani & Norberg 2006; Liddle 2007). However, they come to differing conclusions of which is the best to use, with some preferring the BIC and others the DIC. Using a statistical measure of the bias in our separation of components of the data (e.g., signal and systematics), we offer a method of choosing which IC to minimize in order to better separate those components.

This work also includes a Python package named pylinex (Linear Extraction in Python), described in Appendix A, implementing the methods described in Section 2. Since its methods are so general, pylinex, which fits for signals hidden in large systematics and Gaussian noise, is a quick and efficient statistical tool with applications outside of 21 cm cosmology and even outside of astrophysics.

2. Methodology

2.1. Framework

Consider a data vector ${\boldsymbol{y}}$, which is a combination of N different data components ${{\boldsymbol{y}}}_{1},{{\boldsymbol{y}}}_{2},\,\ldots ,\,{{\boldsymbol{y}}}_{N}$ and Gaussian noise ${\boldsymbol{n}}$,

Equation (1)

In our case, ${{\boldsymbol{y}}}_{1}$ would be the global 21 cm signal and ${{\boldsymbol{y}}}_{k}$ for $k\gt 1$ would include all identified systematic effects (e.g., foregrounds). The matrices ${{\boldsymbol{\Psi }}}_{1},{{\boldsymbol{\Psi }}}_{2},\,\ldots ,\,{{\boldsymbol{\Psi }}}_{N}$—referred to as "expansion matrices" here since they allow the data components, ${{\boldsymbol{y}}}_{k}$, to exist in different (usually smaller) spaces than the full data vector, ${\boldsymbol{y}}$—encode the observation strategy used in obtaining the data. For example, if the data vector consists of the concatenation of two spectra which contain the same signal but different systematics, then the relevant expansion matrices are

Equation (2)

The expansion matrices can also be used to model situation dependent modulation of data components5 or linear transformations of data.6 Note that if the different components of the data, ${{\boldsymbol{y}}}_{k}$, have different sizes, then the corresponding expansion matrices, ${{\boldsymbol{\Psi }}}_{k}$, will have different numbers of columns but the same number of rows.

Our task is to separate a single component of the data (say ${{\boldsymbol{y}}}_{1}$) from ${\boldsymbol{y}}$. To do so, we model ${\boldsymbol{y}}$ less noise as a sum similar to Equation (1),

Equation (3)

In this expression, ${{\boldsymbol{F}}}_{k}$ is a matrix with basis vectors for fitting ${{\boldsymbol{y}}}_{k}$ as its columns and ${{\boldsymbol{x}}}_{k}$ is a column vector of coefficients modulating those basis vectors. With the definitions

the model can be written ${\boldsymbol{ \mathcal M }}({\boldsymbol{x}})={\boldsymbol{G}}{\boldsymbol{x}}$. If the model is adequate, the data vector ${\boldsymbol{y}}$ is real,7 the true model parameters are ${\boldsymbol{x}}$, and the noise ${\boldsymbol{n}}$ has covariance ${\boldsymbol{C}}$, then, up to a constant, the probability of observing ${\boldsymbol{y}}$ is

Equation (4)

To form the posterior parameter distribution from this likelihood, we must assume a form for the prior parameter distribution. Here, we use the least informative prior possible:8 one that is flat over the entirety of parameter space and any transformation of it.9 This means that the likelihood function is proportional to the posterior distribution, which is therefore Gaussian in ${\boldsymbol{x}}$ with covariance ${\boldsymbol{S}}$ and mean ${\boldsymbol{\xi }}$ given by

Equation (5a)

Equation (5b)

The maximum likelihood reconstruction of ${{\boldsymbol{y}}}_{k}$, ${{\boldsymbol{\gamma }}}_{k}$, its channel covariance, ${{\boldsymbol{\Delta }}}_{k}$, and its averaged $1\sigma $ root-mean-square error, ${\mathrm{rms}}_{k}$, are given by

Equation (6a)

Equation (6b)

Equation (6c)

where ${{\boldsymbol{\xi }}}_{k}$ is the portion of ${\boldsymbol{\xi }}$ containing parameters modeling ${{\boldsymbol{y}}}_{k}$, ${{\boldsymbol{S}}}_{{kk}}$ is the diagonal block of the covariance matrix corresponding to those parameters, and nk is the number of data channels in the ${{\boldsymbol{F}}}_{k}$ basis. These expressions are also valid without the k subscript, where they represent the same quantities for reconstructions of the full data. When removing the k subscript, ${{\boldsymbol{F}}}_{k}$ is replaced by ${\boldsymbol{G}}$.

2.2. Training Sets

The expressions above are valid for any choice of the basis vectors in $\{{{\boldsymbol{F}}}_{k}\}$.10 However, only sets of basis vectors, $\{{{\boldsymbol{F}}}_{k}\}$, which capture the variability of their respective data components, $\{{{\boldsymbol{y}}}_{k}\}$, will provide meaningful constraints. To automate the intelligent choice of such sets of basis vectors, we assume that, for all $k\in \{1,2,\,\ldots ,\,N\}$, it is possible to simulate a set of curves $\{{{\boldsymbol{b}}}_{i}^{(k)}| i\in \{1,2,\,\ldots ,\,{N}_{b}^{(k)}\}\}$, which reasonably characterizes the most important modes of variation of ${{\boldsymbol{y}}}_{k}$. These modes are then encoded in ${{\boldsymbol{F}}}_{k}$ and used in fitting the full data, ${\boldsymbol{y}}$. For this reason, the set of curves is known as a training set. The $k\mathrm{th}$ training set is described by an ${n}_{k}\times {N}_{b}^{(k)}$ matrix, ${{\boldsymbol{B}}}_{k}$, with the curves ${{\boldsymbol{b}}}^{(k)}$ as its columns. For the rest of Section 2.2, it is assumed that we are speaking about only one of the N different components of the data, so the k subscripts and superscripts are neglected.

We define the basis for the given data component as the matrix ${\boldsymbol{F}}$, which, under the constraint ${{\boldsymbol{F}}}^{T}{{\boldsymbol{K}}}^{-1}{\boldsymbol{F}}={\boldsymbol{I}}$ (where ${{\boldsymbol{K}}}^{-1}={{\boldsymbol{\Psi }}}^{T}{{\boldsymbol{C}}}^{-1}{\boldsymbol{\Psi }}$), minimizes the Total Weighted Square Error (TWSE) of the fits to all curves in the training set, given by

Equation (7a)

Equation (7b)

where ${\boldsymbol{W}}$ is a diagonal matrix of weights Wi,11 and ${\boldsymbol{\Phi }}={\boldsymbol{I}}-{\boldsymbol{F}}{{\boldsymbol{F}}}^{T}{{\boldsymbol{K}}}^{-1}$ projects out components of ${{\boldsymbol{b}}}_{i}$ in the subspace of ${\boldsymbol{F}}$. The desired basis is given in terms of the weighted SVD of ${\boldsymbol{B}}$, which takes the form ${\boldsymbol{B}}={\boldsymbol{U}}{\boldsymbol{\Sigma }}{{\boldsymbol{V}}}^{T}$, where ${\boldsymbol{\Sigma }}$ is diagonal with decreasing non-negative numbers on the diagonal, ${{\boldsymbol{U}}}^{T}{{\boldsymbol{K}}}^{-1}{\boldsymbol{U}}={\boldsymbol{I}}$, and ${{\boldsymbol{V}}}^{T}{\boldsymbol{W}}{\boldsymbol{V}}={\boldsymbol{I}}$ (see Appendix B for implementation details). Given a number of basis vectors to choose η, the basis ${\boldsymbol{F}}$ defined above is given by the first η columns of ${\boldsymbol{U}}$.

SVD is a reliable way of capturing the modes of variation of a single training set, but if it is performed independently on all components of the data, it may not yield the optimal set of basis vectors for the present purpose, which is separating the different components when they are combined in the same data set and fit simultaneously. Nevertheless, in lieu of a more sophisticated technique, we perform SVD independently on each training set.

2.3. Model Selection

To select from the models formed by different truncations of the SVD basis sets, we set up a framework within which we test figures of merit, known as information criteria, based on the competition between two terms, the goodness-of-fit term that measures the bias in the fit to the data and the complexity term that penalizes the number of parameters used in that fit. We consider the following information criteria for every truncation under consideration: the DIC (Spiegelhalter et al. 2002, 2014), the BPIC (Ando 2007), and the BIC (Schwarz 1978). For our likelihood function (Equation (4)), up to constants independent of the parameters, these are given by

Equation (8a)

Equation (8b)

Equation (8c)

${N}_{p}$ is the total number of varying modes across all N sets of basis vectors, Nc is the total number of data channels, ${\boldsymbol{\Delta }}={\boldsymbol{G}}{\boldsymbol{S}}{{\boldsymbol{G}}}^{T}$, ${\boldsymbol{\delta }}={\boldsymbol{G}}{\boldsymbol{\xi }}-{\boldsymbol{y}}$, and ${\boldsymbol{D}}={[\mathrm{diag}({\boldsymbol{\delta }})]}^{2}$. Note that while the goodness-of-fit remains the same across the information criteria, the complexity term varies. The AIC (Akaike 1974) and a variant of the DIC where the complexity term is based on the variance of the log-likelihood (Gelman et al. 2013, page 173) were also considered but since our model is linear, they are both equivalent to the DIC in Equation 8(a). When truncating, i.e., selecting between our nested SVD models, we choose the model that minimizes the desired criterion. We investigate which information criterion works best in our analysis in Section 3.2.

3. 21 cm Global Signal Application

3.1. Simulated Data and Training Sets

To illustrate our methods, we propose a simple, simulated experiment to measure the global 21 cm signal using dual-polarization antennas that yield data for all four Stokes parameters at frequencies between 40 and 120 MHz. For simplicity, we ignore all systematic effects other than beam-weighted foreground emission, such as human generated Radio Frequency Interference (RFI), refraction, absorption, and emission due to Earth's ionosphere, and receiver gain and noise temperature variations. The experiment proposed here is similar to a pair of antennas orbiting above the lunar farside, where the ionospheric effects and RFI need not be addressed (Burns et al. 2017). An instrument training set corresponding to a realistic antenna and receiver will be included in the analyses of Papers II and III.

The data product of our simulated experiment is a set of 96 brightness temperature spectra. The spectra correspond to four Stokes parameters and six different rotation angles for four different antenna pointing directions. The data vector, ${\boldsymbol{y}}$, consists of the concatenation of all of the spectra. The noise level of the data, σ, is roughly constant across the different Stokes parameters and is related to the total power (Stokes I) brightness temperature, Tb, through the radiometer equation, $\sigma (\nu )={T}_{b}(\nu )/\sqrt{{\rm{\Delta }}\nu \ {\rm{\Delta }}t}$, with a frequency channel width ${\rm{\Delta }}\nu $ of 1 MHz and an integration time ${\rm{\Delta }}t$ of 1000 hr (split between the different antenna pointing directions and rotation angles about those directions). The data are split into N = 5 different components—one for the 21 cm signal and one for the beam-weighted foregrounds (which are correlated across boresight angles and frequency) of each pointing. The signal is the same across all four pointings, while the foregrounds for each pointing only affect the data from that pointing. The expansion matrices encode this fact.

To create the beam-weighted foreground data for each of the four antenna pointing directions, we use the simulation framework of Nhan et al. (2017), henceforth referred to as N17. Each Stokes parameter, ζ, observed by the instrument is given by an expression of the form $\int {B}_{I\to \zeta }\ {T}_{\mathrm{gal}}\ d{\rm{\Omega }}$, where ${T}_{\mathrm{gal}}$ is the galaxy brightness temperature and $d{\rm{\Omega }}$ is the differential solid angle. As in N17, the four relevant beams at frequency ν, polar angle θ, and azimuthal angle ϕ are of the form

Equation (9)

where $b(\nu ,\theta )\propto \exp [-{\theta }^{2}/2{\alpha }^{2}(\nu )]$ and $\alpha (\nu )$ is the angular extent of the beam as a function of frequency. The beam's polarization response converts intensity anisotropy into apparent polarization, while the monopole (cosmological signal) averages to zero in the instrumental Stokes Q and U channels. Measurement in the polarization channels therefore provides discrimination of foreground modes from the signal. Furthermore, rotation about the boresight is used to modulate the polarized components. N17 used this method assuming a spectrally invariant beam, and a sky following a single power law in frequency. Here, with the aid of training sets, we extend the method to allow for spectrally varying beams and an arbitrary sky model. The Galaxy map used in this paper is a spatially dependent power-law interpolation between the maps provided by Haslam et al. (1982) and Guzmán et al. (2011). The beam-weighted foreground training sets are created using 125 Gaussian beams described by Equation (9) with varying quadratic models of $\alpha (\nu )$. Figure 1 shows the training set for one of the four antenna pointing directions and some of the corresponding basis functions. Although V = 0 in this work because ${B}_{I\to V}=0$, in real 21 cm global signal experiments with polarimetry, since we expect no circularly polarized light to be incident on the antenna, V can contain useful information about instrument variations.

Figure 1.

Figure 1. Beam-weighted foreground training set for a single rotation angle about one of the four antenna pointing directions (top), the same training set with its mean subtracted (middle), and the first six SVD basis functions obtained from the training set (bottom). In the top panel, the y-axis scale is linear between −1 and +1 and logarithmic below −1 and above +1. In addition to the Stokes parameters shown in the three columns (I, Q, U), each curve in the training set contains data for each of the rotation angles $\phi =n\pi /3$ for $n\in \{0,1,\,\ldots ,\,5\}$. The different rotation angles about the antenna pointing direction are part of the same training set so that SVD can pick up on ϕ-dependent structure and imprint it onto the basis functions. The modes are ordered from most to least important: blue, orange, green, red, purple, and brown. As described in Section 2.2, the modes are normalized so that they yield 1 when divided by the noise level, squared, and summed over frequency, Stokes parameter, and rotation angle about the antenna pointing.

Standard image High-resolution image

The 21 cm global signal training set and a few of the SVD basis functions it provides are shown in Figure 2. The training set was created by varying the parameters of the Accelerated Reionization Era Simulations (ares) code.12 See Mirocha et al. (2015, 2017a, 2017b) for information on the signal models used by ares. In this paper, we ignore the parameter values that make the signals in the training set because we focus only on measuring the brightness temperature profile of the 21 cm signal, not on the astrophysical or cosmological parameters that generated it. Paper II will address inference on these parameters.

Figure 2.

Figure 2. Signal training set used for our analysis was generated by running the ares code $7\times {10}^{5}$ times within reasonable parameter bounds in order to fill the frequency band. The top panel shows a thinned sample of that set (black curves). The red signals are taken instead from the tanh signal model of Harker et al. (2016). Such signals will be used in Figures 57. For illustration purposes, the first 6 SVD basis functions obtained from the ares training set are shown in the bottom panel. The modes are ordered from most to least important: blue, orange, green, red, purple, brown. As described in Section 2.2, the modes are normalized so that they yield 1 when divided by the noise level, squared, and summed over frequency, antenna pointing, and rotation angles about the antenna pointing.

Standard image High-resolution image

3.2. Optimized Truncation

The quality of the weighted least squares fit to the data given by the equations in Section 2 depends on the number of parameters describing the signal, ${p}_{{\rm{s}}}$, and the number of parameters describing the foregrounds of each antenna pointing direction, ${p}_{{\rm{f}}}$. For each $1\leqslant {p}_{{\rm{s}}}\leqslant 60$ and $1\leqslant {p}_{{\rm{f}}}\leqslant 30$,13 we evaluated all of the information criteria in Section 2.3. A section of the 60 × 30 grid of DIC values for a typical fit where both signal and foregrounds are taken from their training sets is shown in Figure 3. The model with the lowest DIC is the one with ${p}_{{\rm{s}}}=9$ and ${p}_{{\rm{f}}}=10$. See Section 3.4 for a comparison between the qualities of the fit signals generated by minimizing each of the information criteria.

Figure 3.

Figure 3. Grid of values of the DIC (see Equation 8(a)) for 144 different fits to simulated data with modes from Figures 1 and 2. The colors indicate the difference between the DIC and its minimal value. In this case, that minimal value (marked by the white square) occurs when there are 10 foreground and 9 signal modes included in the fit. This same process can be done with any of the information criteria given in Section 2.3. Although only a 12 × 12 grid is shown here, all of the information criteria were calculated over a 60 × 30 grid.

Standard image High-resolution image

3.3. Parameter Covariance

The parameter covariance matrix, ${\boldsymbol{S}}$ (Equation 5(a)), corresponding to the DIC-chosen model from Figure 3 is shown in Figure 4. To aid in intuition about ${\boldsymbol{S}}$, the set of all basis vectors and their overlaps (with respect to the noise covariance, ${\boldsymbol{C}}$) can be viewed as a complete graph—a graph with an edge connecting every pair of vertices—whose vertices belong to N distinct sets. Each vertex is characterized by a pair of numbers $\{i,\alpha \}$, where i is the set of basis vectors the vertex belongs to and α is the specific basis vector from that set. Each edge has a weight, ${w}_{\{i,\alpha \}\to \{j,\beta \}}$, given by the overlap between the modes under consideration,

Equation (10a)

Equation (10b)

where ${{\boldsymbol{f}}}_{\{i,\alpha \}}$ is the $\alpha \mathrm{th}$ column of ${{\boldsymbol{F}}}_{i}$. For the purpose needed here, we define the weight of a path traversing many edges as the product of the weights of its edges, i.e., ${w}_{\{i,\alpha \}\to \{j,\beta \}\to \{k,\gamma \}}\,={w}_{\{i,\alpha \}\to \{j,\beta \}}\cdot {w}_{\{j,\beta \}\to \{k,\gamma \}}$. Assuming the normalization conditions ${{{\boldsymbol{F}}}_{k}}^{T}{{{\boldsymbol{\Psi }}}_{k}}^{T}{{\boldsymbol{C}}}^{-1}{{\boldsymbol{\Psi }}}_{k}{{\boldsymbol{F}}}_{k}={\boldsymbol{I}}$, the value of the covariance of the coefficients modulating two basis vectors is the sum of the weights of all paths connecting the corresponding vertices of the graph.

Figure 4.

Figure 4. Matrix of covariances between the coefficients of all basis functions for the model of the data chosen in the procedure shown in Figure 3 (which has 9 signal modes and 10 foreground modes for each pointing area). The color scale is linear between −10 and +10 and logarithmic below −10 and above +10. Once the numbers of signal and foreground parameters are chosen, these covariances are given by Equation 5(a). The black horizontal and vertical lines indicate distinctions between different sets of basis functions. As illustrated by Equation 6(b), the error on the final signal estimate depends only on the signal–signal covariance, ${{\boldsymbol{S}}}_{21\mathrm{cm}}$ (top left block in this figure).

Standard image High-resolution image

From the above, it is clear that the parameter covariances and, hence, errors in the fit are influenced both by the modes included in the basis matrices, $\{{{\boldsymbol{F}}}_{k}\}$, and by the expansion matrices, $\{{{\boldsymbol{\Psi }}}_{k}\}$. The former are controlled by the training sets and the latter are determined by the experimental design. Therefore, errors can be lowered either by modifying the training set to refine the modes used to fit both signal and foreground spectra or by designing an experiment that intrinsically separates the signal and foreground spectra (e.g., through polarization modulation, as employed here).

3.4. Results

3.4.1. Statistical Measures

To evaluate the behavior of the extraction algorithm, we calculate a few statistical measures for an ensemble of input data sets. In order to assess how accurately the signal is extracted from the data, we first define the signal bias statistic, ε, as

Equation (11)

where nν is the number of frequencies at which the signal is measured, ${{\boldsymbol{y}}}_{21\mathrm{cm}}$ is the "true" (input) signal, and ${{\boldsymbol{\gamma }}}_{21\mathrm{cm}}$ and ${{\boldsymbol{\Delta }}}_{21\mathrm{cm}}$ are given by Equations 6(a) and (b). In an averaged sense, a signal bias statistic of ε corresponds to a signal bias at the $\varepsilon \sigma $ level. For this reason, if the errors are to be meaningful, the signal bias statistic should be of the order of unity.

The signal bias statistic is a measure of the accuracy and precision of the signal estimate but it does not gauge the overall quality of the fit to the data. In order to do so, we define the normalized deviance, D, through

Equation (12)

where Np is the number of parameters in the fit, Nc is the number of data channels, and ${\boldsymbol{\delta }}={\boldsymbol{G}}{\boldsymbol{\xi }}-{\boldsymbol{y}}$ is the bias in the fit to the data. D is a measure of how well the full set of basis functions (signal and foreground) fits the data. Its true distribution is difficult to ascertain because the number of parameters chosen varies with the inputs; but, since the likelihood is Gaussian and we expect to be able to adequately fit the non-noise component of the data, if Np parameters are chosen for a given extraction, then the corresponding value of D should follow a ${\chi }^{2}$ distribution with ${N}_{c}\mbox{--}{N}_{p}$ degrees of freedom. Since, unlike ε, D can be calculated even for data whose split between signal and foreground is unknown, it is useful to check its value against confidence intervals generated from the expected distribution to determine if any anomalies were encountered in the extraction process or if the training sets used were inadequate.

3.4.2. Comparing Different Inputs and Information Criteria

We calculated ε and D for 5000 input data sets, each with their own signal, foregrounds, and noise realization, for two different scenarios. In both cases, the input foregrounds are taken from the training set. In the first case (solid lines in Figures 5 and 6), the input signals were taken from the ares training set used to define the modes (shown in Figure 2). In the second case (dashed lines in Figures 5 and 6), the input signals come from a separate set that parameterizes the ionization history and HI spin temperature as hyperbolic tanh functions of redshift (see Harker et al. 2016, and the red curves in Figure 2).

Figure 5.

Figure 5. Estimate of the Cumulative Distribution Function (CDF) of the signal bias statistic, ε (see Equation (11) and the text for a definition), from 5000 input simulated data sets. The value of the CDF at $\varepsilon ={\varepsilon }_{0}$ is equal to the fraction of the 5000 cases where $\varepsilon \leqslant {\varepsilon }_{0}$. A bias statistic of ε roughly corresponds to a bias at the $\varepsilon \sigma $ level. The solid black reference line indicates the CDF associated with a χ random variable with 1 degree of freedom. This is the distribution that associates 1σ with 68% confidence and 2σ with 95%. To guide the eye, the dotted black line indicates the 95% level. The realizations that generated the solid lines came from input signals and beam-weighted foreground curves which were both taken from their respective training sets. The dashed lines show the effect of using instead input signals generated using a model (tanh) different than that used for the training set (ares). The color of each curve indicates the statistic that was minimized in order to choose the numbers of parameters of each type (see Sections 2.3 and 3.2).

Standard image High-resolution image

The cumulative distributions of ε for the two cases described above are shown in Figure 5. The different colors represent the different IC from Section 2.3, which are minimized in order to choose the number of SVD modes of each type to retain. As expected, when the input signals are not taken from the training set that was used to generate the SVD modes (dashed lines), on average, the fits tend to degrade (compare the dashed with the corresponding solid lines). Whether the input signal is in the training set or not, for a given value ${\varepsilon }_{0}$ of the statistic, we find that the DIC recovers the global 21 cm signal with $\varepsilon \lt {\varepsilon }_{0}$ more often than any other IC. Thus, we adopt the DIC as our model selection criterion.

Figure 6 shows that whether the input signals are taken from the training set or are generated with the tanh model, D follows the expected distribution. Therefore, we find that, combined, the selected SVD signal and foreground modes are adequate to fit the input data down to the noise level.

Figure 6.

Figure 6. Histogram showing the Probability Distribution Function (PDF) of 5000 values of the normalized deviance, D (see Equation (12)), for fits with different input signals, beam-weighted foregrounds, and noise when the DIC is used to choose the best model. The solid red line represents the case where the input signal is taken from the signal training set (see Figure 2) whereas the dashed red line represents the case where the input signal is generated from a tanh model. For both histograms, the input beam-weighted foregrounds are taken from their training sets. As described in the text, D should follow a distribution approximated by the extremely thin black region, which is a combination of ${\chi }^{2}$ distributions associated with the range of degrees of freedom chosen for the extractions. The plot is qualitatively unchanged if we use our other IC for the model selection procedure.

Standard image High-resolution image

3.4.3. Signal Estimates

The channel mean and covariance of the estimated global 21 cm signal (in frequency channel space), ${{\boldsymbol{\gamma }}}_{21\mathrm{cm}}$ and ${{\boldsymbol{\Delta }}}_{21\mathrm{cm}}$, are given by Equations 6(a) and (b), respectively. Figure 7 shows a representative selection of extractions of simulated signals by our analysis. In each panel of the figure, the widths of the bands at frequency ν are given by 1.7 and 3.2 times the square root of the corresponding diagonal element of ${{\boldsymbol{\Delta }}}_{21\mathrm{cm}}$, making them 1.7σ and 3.2σ confidence intervals. Figure 5 shows that, when the input signal is generated by the tanh model as opposed to being taken from the training set, signals are fit with biases of less than 1.7σ (3.2σ) about 68% (95%) of the time. Therefore, the light bands in Figure 7 show 95% confidence intervals in the sense that there is a 95% probability that the squared ratio of the bias (difference between red and black lines) to the error (light red band) averaged across the spectrum is less than 1. Note that this does not mean that there is a 95% probability that the value of the 21 cm global signal at a given frequency is within the error band. The root-mean-square widths of the 95% bands of 95% of the ares (tanh) input signals are less than 21 (28) mK.

Figure 7.

Figure 7. Signal estimates obtained using the linear model defined by SVD eigenmodes calculated from the training sets. The black curves show the input signals, the red curves show the signal estimates, and the dark (light) red bands represent the posterior 1.7σ (3.2σ)—or, equivalently, 68% (95%)—confidence intervals (see Figure 5 and Section 3.4.3 for details). For all plots, the input beam-weighted foregrounds came from their training sets. The input signals for the four plots on the left came from the signal training set (Figure 2), whereas the input signals for the four plots on the right were generated by the tanh model (see Section 3.4.2). Note that by eye there is hardly a difference between these cases, but, on average, we find that there is, as shown in Figure 5.

Standard image High-resolution image

4. Conclusions

In this paper, we have proposed a method of analysis that transforms the task of extracting a signal from data into one of composing training sets that accurately model the relevant modes of variation in all components of the data. Applying this method to a simple, simulated 21 cm global signal experiment using dual-polarized antennas, we extracted a wide variety of input signals accurately with a 95% confidence rms error of $\lesssim 30$ mK. We also showed that, for our purpose of extracting components of data using well separated training sets, minimizing the DIC yields less biased results than minimizing the BIC or BPIC.

For 21 cm experiments analyzed with our method, there are three strategies for decreasing the errors: changing the expansion matrices through modifications of the experimental design (e.g., implementing the polarization technique of N17), changing the basis vectors by using more separable training sets (if possible), and lowering the noise level in the data by adding integration time. The first two of these strategies decrease parameter covariances, while the last changes the normalization of the modes, leading to lower errors with the same parameter covariances (see Section 3.3).

As presented in this paper, our method assumes that the posterior distribution on the coefficients of the basis functions is Gaussian. This imposes two requirements: (1) the likelihood function and, hence, all noise must be Gaussian and (2) the different components of the data must be combined by addition. If either of these conditions is violated (e.g., if an expansion matrix, ${\boldsymbol{\Psi }}$, is parameter dependent), then the posterior parameter distribution is not Gaussian and therefore must be explored with more sophisticated methods such as MCMC sampling. Such an exploration, however, could still be relatively fast because, although they may be combined in nonlinear ways, the models of the individual components are still linear. Under this variation of the method, model selection as in Section 2.3 can still be performed but different expressions for the information criteria must be derived. This more general extraction will be implemented with an MCMC sampler in future versions of pylinex. In addition to generalizing our methods, in future analyses, we will include multiple Galaxy maps in the foreground training set and we will use calibration errors generated by a realistic instrument to generate an instrument training set.

As will be described in Paper II, we have also created a pipeline built around the methods described here to perform inference on physical parameters of the 21 cm signal. After achieving a signal fit like those shown in Figure 7, we use it to initialize the iterates of an MCMC sampler tightly packed around or near the likelihood peak. In this MCMC, the 21 cm signal model using SVD basis functions is replaced with one that uses physical parameters. Since there are essentially no constraints on the majority of signal parameters and most realizations of the 21 cm global signal require $\gtrsim 1\,{\rm{s}}$ to compute, such a first guess is important to avoid local minima and is very advantageous toward the convergence of an MCMC chain.

Therefore, in this paper, we have presented methods to bypass the initial phase of potential traps and the time wasted when an MCMC algorithm wanders through parameter space looking for features in the likelihood by skipping to the stage where the MCMC explores systematically around the likelihood peak. In a simplified yet general setting, we have shown that, with well-characterized training sets and a well-designed experiment, a wide variety of realistic 21 cm global signals can be extracted from simulated sky-averaged observations in the VHF range.

We thank Raul Monsalve, Jordan Mirocha, Richard Bradley, Bang Nhan, Licia Verde, and Nicholas Kern for useful discussions. This work was directly supported by the NASA Solar System Exploration Research Virtual Institute cooperative agreement number 80ARC017M0006. This work was also supported by grants from NASA's Ames Research Center (NNA09DB30A, NNX15AD20A, NNX16AF59G) and by a NASA ATP grant (NNX15AK80G). D.R. is supported by a NASA Postdoctoral Program Senior Fellowship at NASA's Ames Research Center, administered by the Universities Space Research Association under contract with NASA. Resources supporting this work were provided by the NASA High-End Computing (HEC) Program through the NASA Advanced Supercomputing (NAS) Division at Ames Research Center through the award SMD-16-7501.

Appendix A: Software Release

pylinex14 is a Python package implementing the methods presented in Section 2. It is split up into five modules, the most important of which are the expander, basis, and fitter.

The expander module contains classes that efficiently implement the expansion matrices, ${\boldsymbol{\Psi }}$. This is particularly useful when there is a large amount of data involved in the calculations as storing all elements of ${\boldsymbol{\Psi }}$ could be unnecessarily memory-intensive, especially when ${\boldsymbol{\Psi }}$ is sparse.

The basis module stores classes representing many different types of basis functions, including simple polynomials, Legendre polynomials, Fourier series, and, most importantly for the current work, SVD basis sets.

The fitter module contains the main products of the software: objects that perform fits using the basis sets and expansion matrices from the other modules. The most useful and comprehensive class is the Extractor class, which takes as inputs training sets, data with errors, and expansion matrices and yields posterior channel means and errors for each data component as outputs.

Pylinex is compatible with Python 2.7+ as well as Python 3.5+ and is implemented efficiently through the use of the numpy, scipy, and matplotlib packages.15 Future versions of pylinex will implement extraction when the model is nonlinear and/or the likelihood function is non-Gaussian as described in Section 4.

Appendix B: Weighted SVD Implementation

This section describes our implementation of weighted SVD using only basic SVD algorithms. By weighted SVD, we refer to the decomposition of a real matrix, ${\boldsymbol{B}}$, of the form ${\boldsymbol{B}}={\boldsymbol{U}}{\boldsymbol{\Sigma }}{{\boldsymbol{V}}}^{T}$, where ${\boldsymbol{\Sigma }}$ is diagonal and of the same shape as ${\boldsymbol{B}}$, ${{\boldsymbol{V}}}^{T}{\boldsymbol{W}}{\boldsymbol{V}}={\boldsymbol{I}}$, ${{\boldsymbol{U}}}^{T}{{\boldsymbol{C}}}^{-1}{\boldsymbol{U}}={\boldsymbol{I}}$, and ${\boldsymbol{C}}$ and ${\boldsymbol{W}}$ are symmetric positive definite matrices. If ${\boldsymbol{B}}$ is a matrix with training set curves as columns, then ${\boldsymbol{C}}$ represents the noise covariance used in fits and ${\boldsymbol{W}}$ is a matrix that weights the different training set curves. To begin, with standard SVD, we decompose the matrix ${{\boldsymbol{C}}}^{-1/2}{\boldsymbol{B}}{{\boldsymbol{W}}}^{1/2}$ into ${\boldsymbol{P}}{\boldsymbol{\Gamma }}{{\boldsymbol{Q}}}^{T}$, where ${\boldsymbol{\Gamma }}$ is diagonal, ${{\boldsymbol{P}}}^{T}{\boldsymbol{P}}={\boldsymbol{I}}$, and ${{\boldsymbol{Q}}}^{T}{\boldsymbol{Q}}={\boldsymbol{I}}$. Solving for ${\boldsymbol{B}}$, we find the weighted SVD of ${\boldsymbol{B}}$,

Equation (13)

Therefore, the matrices in the weighted decomposition are ${\boldsymbol{U}}={{\boldsymbol{C}}}^{1/2}{\boldsymbol{P}}$, ${\boldsymbol{\Sigma }}={\boldsymbol{\Gamma }}$, and ${\boldsymbol{V}}={{\boldsymbol{W}}}^{-1/2}{\boldsymbol{Q}}$. In most cases, ${\boldsymbol{C}}$ and ${\boldsymbol{W}}$ are diagonal and the matrix square root is equivalent to the element-wise square root.

Empirically, we find that the weighted SVD algorithm described above scales as ${ \mathcal O }(\min ({{N}_{t}}^{2}{N}_{c},{N}_{t}{{N}_{c}}^{2}))$, where Nt is the number of training set curves used and Nc is the number of data channels in each training set curve.

Appendix C: Priors from Training Sets

Under certain circumstances, it could be beneficial to include prior information on one or more of the data components derived from their training sets. This technique relies on a few key assumptions that limit its applicability. First, the training set must encompass the true multivariate distribution of the observed curves. Second, the prior must be Gaussian for the posterior to be Gaussian, which it must be for simple, analytical expressions to be written for the signal estimate and error. If either of these assumptions is invalid, priors will bias the results.

To generate priors from the training sets, each curve in the training set can be fitted with the SVD modes defined by that training set. Then, the sample mean, μ, and covariance, Λ, of the points defined by the coefficients of each fit can be found and used to define a multivariate Gaussian prior distribution. Defined in this way, μ and Λ are given by

Equation (14)

where ${\boldsymbol{J}}$ is a column vector composed of Nb ones. If η basis vectors are used (as in Section 2.1), the matrix ${\boldsymbol{\Pi }}$ is the top left $\eta \times \eta $ block of ${\boldsymbol{\Sigma }}$ and the matrix ${\boldsymbol{Z}}$ is the first η columns of ${\boldsymbol{V}}$ (where ${\boldsymbol{\Sigma }}$ and ${\boldsymbol{V}}$ are defined as in Equation (13)). Including this prior in the fit, the posterior parameter covariance ${\boldsymbol{S}}$ and parameter mean ${\boldsymbol{\xi }}$ given by Equations 5(a) and (b) are modified to

Equation (15)

Footnotes

  • For example, if the signal is expected to be $1-\kappa $ times as strong in the second spectrum as in the first (e.g., if a fraction κ of the source of the signal is blocked in the second spectrum), then the signal expansion matrix is modified to

  • For instance, setting ${{\boldsymbol{\Psi }}}_{k}$ equal to the Discrete Fourier Transform (DFT) matrix allows one to model the DFT of ${{\boldsymbol{y}}}_{k}$.

  • If ${\boldsymbol{y}}$ is complex, transpose operations in this paper should be replaced by Hermitian conjugates.

  • See Appendix C for both the assumptions required to include informative priors derived from the training sets, and the equations necessary to do so.

  • Note that even though such a prior is referred to as improper, i.e., it cannot be normalized, the posterior distribution is well-behaved.

  • 10 

    The only exception to this statement occurs when two or more of the basis vectors are degenerate with one another.

  • 11 

    We generally use ${\boldsymbol{W}}={\boldsymbol{I}}$.

  • 12 
  • 13 

    As is true here, the maximum number of parameters considered for a given data component should be greater than or equal to the number of modes necessary to fit each curve in the corresponding training set down to the noise level of the data component, which is determined by the covariance matrix ${({{{\boldsymbol{\Psi }}}_{k}}^{T}{{\boldsymbol{C}}}^{-1}{{\boldsymbol{\Psi }}}_{k})}^{-1}$. Recall that ${\boldsymbol{C}}$ is the covariance of the noise, ${\boldsymbol{n}}$, in the data (see Equation (1)).

  • 14 
  • 15 
Please wait… references are loading.
10.3847/1538-4357/aaa41f