Fitting and Comparing Galactic Foreground Models for Unbiased 21-cm Cosmology

Accurate detection of the cosmological 21-cm global signal requires galactic foreground models which can remove power over ~$10^6$. Although foreground and global signal models unavoidably exhibit overlap in their vector-spaces inducing bias error in the extracted signal, a second source of bias and error arises from inadequate foreground models, i.e. models which cannot fit spectra down to the noise level of the signal. We therefore test the level to which seven commonly employed foreground models -- including nonlinear and linear forward-models, polynomials, and maximally-smooth polynomials -- fit realistic simulated mock foreground spectra, as well as their dependence upon model inputs. The mock spectra are synthesized for an EDGES-like experiment and we compare all models' goodness-of-fit and preference using a Kolomogorov-Smirnov test of the noise-normalized residuals in order to compare models with differing, and sometimes indeterminable, degrees of freedom. For a single LST bin spectrum and p-value threshold of $p=0.05$, the nonlinear-forward model with 4 parameters is preferred ($p=0.99$), while the linear forward-model fits well with 6-7 parameters ($p=0.94,0.97$ respectively). The polynomials and maximally-smooth polynomials, like those employed by the EDGES and SARAS3 experiments, cannot produce good fits with 5 parameters for the experimental simulations in this work ($p<10^{-6}$). However, we find that polynomials with 6 parameters pass the KS-test ($p=0.4$), although a 9 parameter fit produces the highest p-value ($p\sim0.67$). When fitting multiple LST bins simultaneously, we find that the linear forward-model outperforms (a higher p-value) the nonlinear for 2, 5 and 10 LST bins. Importantly, the KS-test consistently identifies best-fit \textit{and} preferred models.


INTRODUCTION
Many windows into the early Universe-from the Cosmic Dawn to the Epoch of Reionization-have begun to open as low-frequency single antenna/radiometer and interferometer experiments, aimed at the elusive 21cm emission line of neutral Hydrogen produced by the Inter-Galactic Medium (IGM, see Madau et al. 1997;Shaver et al. 1999), experience first light.Beginning with the Experiment to Detect the Global EoR Signature (EDGES, Bowman et al. 2018), and followed by the Shaped Antenna Measurement of the Background Radio Spectrum (SARAS3, Singh et al. 2021) and the Hydrogen Epoch of Reionization Array (HERA, Collaboration et al. 2022), data now exist which have the potential to illuminate the nature of the first stars and galaxies (Furlanetto et al. 2006a;Pritchard & Loeb 2012;Barkana et al. 2018;Schauer et al. 2019;Muñoz et al. 2022;Furlanetto & Mirocha 2022), measure small-scale structure formation and the tomography of Reionization (Muñoz et al. 2020;Mirocha et al. 2022), and perhaps provide a new method for constraining dark matter (Furlanetto et al. 2006b;Barkana 2018;Fialkov et al. 2018;Schneider 2018;Hibbard et al. 2022).
Other global and power-spectrum experiments, including the Large-Aperture Experiment to Detect the Dark Age (LEDA, Price et al. 2018;Spinelli et al. 2022), Probing Radio Intensity at high-Z from Marion (PRIZM, Philip et al. 2018), Sonda Cosmológica de las Islas para la Detección de Hidrógeno Neutro (SCI-HI, Voytek et al. 2015), the Radio Experiment for the Analysis of Cosmic Hydrogen (REACH, de Lera Acedo et al. 2022), the Mapper of the IGM Spin Temperature (MIST, Monsalve et al. 2023) and power spectrum experiments with data taken by the Murchison Widefield Array (MWA, Yoshiura et al. 2021), the LOw-Frequency ARray (LOFAR, Mertens et al. 2020), and the Long Wavelength Array (LWA, Eastwood et al. 2019), are under development or have released initial constraints.
The foremost obstacle to measuring the 21-cm signal, both global and power spectrum remains the galactic foreground, four to six orders of magnitude greater than the cosmological signal (Pritchard & Loeb 2012) and comprised primarily of synchrotron radiation generated by relativistic electrons spiraling in galactic magnetic fields (Condon & Ransom 2016).Although the intrinsic foreground is well-modeled by spectrally-smooth power laws, when coupled to realistic, chromatic antenna beams the result is a spectral Gordian knot; that is, the frequency dependence of the beam couples spatial variations of the foreground sky as spectral variations into the received signal at the antenna.The ineluctable fact remains, seldom acknowledged, that models for the beam-weighted, galactic foreground must be able to fit the foreground component of the data down to the noise-level of the signal.Residual total intensity beam-weighted foreground power can not only obscure the signal entirely, but lead to false signal reconstructions (Tauscher et al. 2020a;Anstey et al. 2021;Rao et al. 2017;Bassett et al. 2021b), as can unmodelled intrinsic foreground polarization (Spinelli et al. 2022), or inadequately removed ionospheric ripples (Shen et al. 2021); we focus on modelling the first in this paper.Any proper foreground model, then, ought to be able to fit realistic, mock foreground spectra down to the ∼ 20 mK level or less (the experimental level reported by EDGES, with residual systematic error above the radiometer noise), constituting a requirement of nearly six orders of magnitude of foreground removal.
To be explicit, there are two main sources of systematic bias and additional uncertainties in the process of extracting the 21-cm signal.The first comes from the inevitable overlap of the foreground and the signal models, and cannot be mitigated wholly as their vector spaces are in general not orthogonal.However, the second significant source comes from inadequate foreground mod-els themselves.If a foreground model alone cannot fit mock foreground-only spectra down to the noise-level of the signal, even with the addition of more parameters, then the residual level that is attained represents the "limit" of this foreground model, and one cannot hope to get better signal extractions than this level regardless of the overlap with the signal model.This source of error has received less recent attention, if any, by the community.This paper is thus primarily concerned with testing foreground models to see when/if they reach this level of accuracy, given mock foreground-only spectra which mimic real observations in complexity.
For readers who are interested in modelling and/or mitigating the first source of bias and error, namely that of overlap, we refer them to e.g., Tauscher et al. (2020b); Anstey et al. (2023).Note that all foreground models used in this work have also been used successfully in joint-fits (i.e., fits with foreground and signal); see below where each model is referenced to its primary work.
Ideally, we would construct foreground models from all-sky temperature maps at the Cosmic Dawn and EoR frequencies (from 10 − 200 MHz).However, due to the dearth of such maps, let alone the utter lack of any with error estimations, it is necessary to use one of two methods to build robust, full-coverage foreground models.The first method uses principal component analysis (see de Oliveira-Costa et al. 2008;Zheng et al. 2017) to extract modes from published low-frequency sky temperature maps, such as those of Haslam et al. (1982); Guzmán et al. (2011) at 408 MHz and 45 MHz, respectively.These modes can then be used to fill in the missing frequencies through interpolation.In the second method, the temperature maps are parameterized by nonlinear emission and absorption physics, such as spectral indices of synchrotron emission or electron densities contributing to free-free absorption (Rao et al. 2016;Anstey et al. 2021).
Both methods depend upon the spatial features revealed by sky temperature maps.Such maps require high numbers of spherical harmonics to fully characterize, and models reliant upon such decompositions can be prohibitively computationally expensive.The lack of errors in the temperature maps themselves also prohibits their direct use as a model in likelihoods where error characterization is essential.Thus it is necessary to use the published sky temperature maps as a baseline (a model input) for characterizing the spatial features of the low-frequency sky, and to then build flexible enough foreground models which can account for some amount of spatial variation or error in the underlying temperature maps.
The method of generating linear eigenmodes (via singular value decomposition, SVD), which characterize the most important modes of variation, from a matrix which consists of columns of spectra generated from a nonlinear, physically-motivated foreground model, as proposed in Switzer & Liu (2014); Vedantham et al. (2014); Tauscher et al. (2018), lies somewhere between method one and two.
Finally, it is common in the literature to employ purely phenomenological foreground models which describe smoothly-varying spectral features, such as polynomials and maximally smooth functions.Although it was initially believed that these models would be sufficient to describe the intrinsic power-law foreground with relatively few (∼ 3) terms (Furlanetto et al. 2006a;Pritchard & Loeb 2012), it became quickly clear that the large chromaticity of the beam distorted the smoothstructure sufficiently to vitiate fits with a small number of parameters (Harker et al. 2012;Bernardi et al. 2016;Mozdzen et al. 2016;Sims & Pober 2020).Distortions from the ionosphere further increase the number of needed terms (Mozdzen et al. 2019).As such, increasing numbers of polynomial parameters are required to describe the resultant beam-weighted foreground spectra adequately, indelibly increasing the overlap (and thus extracted errors) with 21-cm signal models (see e.g., Bassett et al. 2021a).
We reemphasize that these two separate problems of modelling the foreground alone sufficiently, and of the foreground's inevitable overlap with signal models, are deeply coupled: residual power left from the former will almost certainly be absorbed by the latter and cause a biased signal extraction.To avoid this, more foreground parameters will probably be needed, which in turn, will increase the overlap with signal models, leading to larger signal errors.Furthermore, goodness-of-fit statistics may proclaim a fit at the noise level of the signal when in fact both foreground and signal have poor fits individually, but their unmodelled components combine to cancel in the residuals.
To understand the limits of foreground modelling alone, as well as which models are preferred when fitted to realistic mock foreground spectra, we present an analysis of seven foreground models commonly employed in the literature: two with spatial dependencies -one nonlinear based upon the Radio Experiment for the Analysis of Cosmic Hydrogen (REACH) model of Anstey et al. (2021) and one linear based upon eigenmodes and singular value decomposition from Tauscher et al. (2018)-, three common polynomial models in frequency ν or ln ν (Bowman et al. 2018), and two maximally-smooth polynomial models (Rao et al. 2017;Bevins et al. 2021).
We first construct mock foreground spectra from realistic spatial and spectral intrinsic foregrounds, and then convolve them using realistic beams, horizons, timesampling, and noise levels.We test the ability of each model to fit the mock spectra down to the noise level, and compute several goodness-of-fit and model comparison statistics.We test not only how dependent the models are on their spatial inputs (where applicable), but how many terms are needed in each case for foregroundonly fits to reach the signal noise level.Importantly, we also present a more robust method for comparing and choosing best-fits across all models (useful in the case where goodness-of-fit statistics are unreliable due to an inability to calculate degrees of freedom accurately, as in the case of nonlinear models such as those employed by REACH and SARAS), using normalized residuals in a Kolmogorov-Smirnov test.This work is motivated in part by requirements for upcoming lunar-based telescopes that will soon begin global 21-cm observations from the uniquely radio-quiet, ionosphere-free, and environmentally stable lunar surface.These observations are enabled by NASA's Commercial Lunar Payload Services program that will deliver our radio instruments to the lunar farside including operations during lunar night (Burns et al. 2021b,a;Bale et al. 2023).
We first construct our realistic foreground mock spectra in Section 2, and then in Section 3 introduce the foreground models which we fit to the mock.Section 4 outlines the statistics used and our Bayesian analysis framework.We present the results of our fits in Section 5, discuss the implications of our findings in Section 6, and conclude in Section 7.

SIMULATED FOREGROUND DATA (REACH MOCK)
We begin by building a spatially and spectrally realistic, beam-weighted mock foreground.We refer to the unweighted galactic sky as the intrinsic foreground, and the foreground measured by a radio antenna experiment as the beam-weighted foreground.The former is generated by synchrotron emission from relativistic electrons spiraling in the galactic magnetic field.These electrons have power-law energy number density distributions, and thus produce power-law emission spectra (Condon & Ransom 2016).The beam-weighted foreground is generated by coupling a radio beam to the intrinsic foreground.

Intrinsic Foreground
The intrinsic mock foreground we employ is the same as that of the REACH team (Anstey et al. 2021(Anstey et al. , 2022) ) in their studies: where T 408 is the sky map generated at 408 MHz by the Global Sky Model (GSM) of de Oliveira-Costa et al. (2008), T CMB = 2.725 K is the temperature of the Cosmic Microwave Background (CMB), and the spectral index β R is given by where T 230 is the sky map generated at 230 MHz by the same GSM.We calculate all extrapolations and convolutions using the HEALPIX module of Gorski et al. (2005).This intrinsic foreground mock exhibits both complicated spatial temperature and spectral distributions, and is thus a suitable representation of the inherent complexity of the true foreground.Our intrinsic foreground is shown in Figure 1, with the upper panel showing T 408 and the bottom panel β R .
However, in reality, it is likely that the true foreground deviates in spatial structure from 408 MHz down to 40 MHz, and contains small fluctuations in the spectral indices over the bandwidth of interest, known as spectral curvature (Mozdzen et al. 2017(Mozdzen et al. , 2019;;Irfan et al. 2021).For this work, however, we neglect such intrinsic spectral variations, and assume that the spectral indices β R contain all spectral information, and T 408 all spatial temperature information.

Beam Model, Horizon, and Observation Strategy
To generate our beam-weighted foreground spectra from the intrinsic foreground mock, we assume a beam and observation strategy similar to the EDGES observatory, situated near the Murchison Radio Observatory (MRO) in Western Australia (Bowman et al. 2018).Therefore, we use a model of the antenna beam as generated by the computational electromagnetics software package FEKO 1 for a blade antenna with an extended ground plane (see Mahesh et al. 2021).
Figure 2 shows the E and H Plane polar cuts of the EDGES antenna beams B(Ω) used to convolve the intrinsic foreground mock and generate the simulated spectra 2 .
Our horizon profile (the fraction of the 4π steradian sky set to zero due to blocking by the ground and lo-1 Altair FEKO -https://altairhyperworks.com/product/FEKO 2 These beam simulations have been provided courtesy of the EDGES collaboration.cal topology, H(Ω)) is generated by the SHAPES code from Bassett et al. (2021b) for the EDGES experiment's latitude and longitude.See the latter work for the calculation and visualization of the horizon profile3 .Lastly, note that the observing band for our mock spectra runs from 51 to 87 MHz, with ∆ν ≈ 0.4 MHz channel spacing and a total number of channels N ch = 93.

Time-Sampling and Multiple LST Bins
Most mock spectra of global experiments simulations assume continuous integration of the intrinsic foreground for some LST duration δt (see for example Tauscher et al. 2020b;Anstey et al. 2021).For instance, one might assume integration for a duration of δt = 6 hours from LST 4 -10 for mock observations.The effective foreground seen in this example is then a smeared version of the intrinsic foreground produced by the continuous integration and rotation of the sky from LST 4 to LST 10: where t i labels the middle of the LST bin, and f n (t) is a time-sampling function, described below.In practice, most experiments do not observe the sky continuously for hours or even minutes, but take spectra with short integration times of several seconds while the switch is in the antenna position (see for instance, work by the EDGES collaboration Monsalve et al. 2017;Mozdzen et al. 2019;Monsalve et al. 2021;Mahesh et al. 2021).As we seek to produce mock spectra as similar as possible to the actual EDGES low-band 2016-2017 run data, we follow the time-sampling of that experiment, which consists of taking δt = 13 second integrated spectra every 39 seconds.These spectra are then averaged together into LST bins according to the center of their LST time stamp t i .This allows us to write down the effective foreground as seen by an antenna taking discrete time samples t i of length δt as where N n is the number of samples in each LST bin n.
Finally, we note that our method of LST-binning differs slightly from that employed in Anstey et al. (2023) by the REACH team.For this work, we separate the entire sky into equal-spaced bins based upon the desired number of total bins n = [1, 2, 5, 10], and then within each final LST bin we average together a number N n of higher resolution LST snapshots.Therefore, for example, if the sky is broken in to n = 2 final bins, snapshots in LST ranges 1-12 would fall into the first final bin, and 13-24 into the second final bin.In the latter paper, typically one to three hours at a particular LST will then be split into higher-resolution bins of the order of minutes.

Noise Estimate
We simulate the noise for our mock spectra by assuming that the system temperature T sys is set by the sky temperature T fg , equivalent to setting noise source temperatures to infinity and both load and receiver temperatures to zero.As we are comparing our mock spectra to those taken by the EDGES experiment during its 2016-2017 run, as seen in Hibbard et al. in prep, we calculate the noise in each LST bin as from an ideal radiometer, given by where the weighted, dynamic range D w = ∆νδtW n , and W n are the weights (number of raw channels times number of raw spectra) in each LST bin, but are the same for every channel with a given bin.T n (ν) refers to the final, convolved mock spectra in LST bin n (see below).
In practice, we estimate D w for use in our simulations by dividing the actual averaged, observed EDGES spectra in Hibbard et al. (in prep) by the final observed EDGES noise level as estimated from propagation of errors when averaging together raw spectra in time and frequency which has already undergone excision.This estimate is done solely to provide us with a comparable noise level to calculate for our mock spectra here, given a dynamic range comparable to that in the real experiment, and indeed this approximation is less than a factor of two off when used to estimate the noise level in the actual EDGES data.For real data such a simplification will not suffice, but for the simulations here we are only interested in generating reasonable noise levels given the mock spectra of the sky and comparable weighted, dynamic ranges.

Convolved Mock Spectra
Given an intrinsic foreground mock T fg (Ω, ν, t i ), a horizon profile H(Ω), a beam model B(Ω, ν), and time samples t i belonging to an LST bin n, our final convolved foreground mock is given by where as throughout the paper, N n is the number of spectra in LST bin n, and σn is a noise realization drawn from a Gaussian distribution with zero mean and variance given by Equation 5.
In Figure 3 we plot the final mock spectra (hereafter, the REACH mock) for each beam-weighted foreground of different LST bins, with the thick black lines delineating the individual spectra, concatenated together when n > 1.In the bottom panel of the figure, below each of the concatenated spectra, the corresponding noise levels used in the analysis are also shown.These T n represent our mock data that we shall now fit with galactic foreground models.

FOREGROUND MODELS
We fit the REACH mock with four categories of foreground models: nonlinear forward-model, linear forward-model, polynomials (linear), and maximally smooth polynomials (nonlinear).The first two model categories are forward-models in the sense that they rely upon spatial and spectral knowledge of model inputs such as the intrinsic foreground and beam, whereas the polynomial and maximally smooth models are purely spectral, phenomenological models.For brevity, we shall simply refer to the first two models as nonlinear and linear, respectively.In this work we seek to test not only how well each foreground model category can fit the REACH mock for a given set and number of parameters, but also how reliant upon the model inputs the first two categories are.In particular, we will test the robustness of assumptions about the spatial distribution of the base temperature map (BM) and of the spectral index patch map (PM; see below) which extrapolates the BM to the desired frequency band.
Table 1 summarizes the seven foreground models tested in this work, including the nonlinear, linear, 3 polynomial, and 2 maximally-smooth polynomial models.They are delineated by model symbol, the number of LST bins to which they are applied (and consequently, the number of LST bins in the REACH mock they fit), the equation which generates the model spectra, the parameters in the model Θ, the number of parameters in a given fit N Θ , the constraints upon the generating equation, and the parameter priors, if any.Only one model has a fixed number of parameters, the Linear Physical Polynomial M LinP hys , with N py = 5.
The rest of this section will be devoted to further illuminating the assumptions made and calculations involved for each model.Readers familiar with the galactic foreground model literature and Bayesian analysis may wish to skip directly to Section 5 for the results of fitting each model to the REACH mock using standard Bayesian analysis.

Nonlinear (Physical) Foreground Model
Based on work from the REACH collaboration (Anstey et al. 2021(Anstey et al. , 2022;;Pagano et al. 2022;de Lera Acedo et al. 2022), we assume that T fg in Equation 1 is well-modelled to the chosen noise level by approximating its spectral index map β R in Equation 2as a set of N r regions or "patches", each with roughly the same number of pixels.These regions are chosen by sorting the spectral index map pixels into N r equalpercentile bins, where each bin defines a region on the sky.For instance, a sky consisting of 4 regions would have a 0 − 25 percentile bin, or region, a 26 − 50 percentile region, and so on.Figure 4 shows an example of this spectral index patch map when N r = 8 regions.Each region j is then assigned a constant spectral index β j , magnitude A j , and spectral curvature γ j4 .The curvature parameters allow for spectral structure poorly captured by the spectral index to be accounted for, as well as intrinsic curvature in the foreground (neglected here), while the magnitude parameters allow for a rescaling of the temperature base map5 .
The above nonlinear parameters are shown in Equation 7 of Table 1, and in its corresponding Θ column.The so-called chromaticity functions K j (ν) in that equation contain all of the information concerning the observation strategy, time-sampling, beam, horizon, temperature base map (BM) and spectral index patch map (PM).They are parametrized by region j and calculated via where the different PM regions are formed by the masks M j , which are unity for each pixel within a region j, and zero otherwise.M j is calculated from β R (the same as the REACH mock) for the primary case of the nonlinear model, although see below where we also test for a case when M j is not derived from β R .This set of masks is one of the two model inputs required to build the nonlinear model.The other model input, the base map (BM) T 0 at reference frequency ν 0 , is likewise set  2, 3.6, 6, 8.4, 10.8, 13.2, 15.6, 18, 20.4, and 22.8 hr (red).The vertical black lines delineate each LST bin.We also plot the radiometer noise levels for the mock data (bottom panel), as well as flat 20 mK and 100 mK noise levels in black dashed lines for comparison.Black vertical lines delineate different spectra when the number of bins is greater than 1, and channels refers to a particular frequency in the bandwidth for a particular LST spectrum.Thus, channel 0 corresponds to 51 MHz for the first spectrum, while channel 93 refers to 51 MHz for the second spectrum, and so on.to T 408 for the primary, or ideal, case.We also test the case in which T 0 ̸ = T 408 .For N r sky regions, there can be up to N nl = 3N r parameters for this model, depending on whether we include magnitude A j and spectral curvature γ j parameters.When these parameters are not included in a fit, we set their values to 1 and 0, respectively.

Linear Foreground Model
The linear foreground model we employ in this work is, in essence, a linearized version of the nonlinear model given above.For a given set of nonlinear model inputs, such as the number of regions N r , the BM and PM, and the nonlinear parameters θ, we simulate ten thousand nonlinear model spectra by computing b draw,i = M nl (θ draw,i ) for each nonlinear parameter draw θ draw,i labelled by i.We generate the latter by sampling from uniform prior ranges similar to those used for the nonlinear model priors, albeit slightly truncated: β j ∼ U (−3.5, −2.0),A j ∼ U (0, 3), γ j ∼ U (−0.1, 0.1).Each simulated nonlinear spectrum b draw,i is then stacked horizontally into a matrix denoted B fg as one of its columns.This matrix thus has dimensions N ch × N sp , where N ch is the number of channels in each spectra (number of frequencies times the number of LST bins concatenated together) and N sp = 10, 000 is the number of simulated spectra.We then compute the eigenmodes of this matrix of spectra, hereafter referred to as LinLogPoly MLinLogPoly Table 1: All galactic foreground models considered in this work, where each row denotes a different model.We fit each model to the REACH mock for a given set of LST bins, parameters, and model inputs.The Nonlinear model is the same as that employed by the REACH collaboration; the linear model generates eigenspectra from the nonlinear model via Singular Value Decomposition; the next three are all polynomial models; the last two are maximally-smooth polynomial models.The generating equation gives the model spectra for a given set of parameter values and model inputs.Parameter labels, the number of parameters, constrain equations, and priors are shown in the last four columns.eigenspectra, via Singular Value Decomposition (SVD)for details on SVD, we refer the reader to our pylinex6 pipeline series (Tauscher et al. 2018;Rapetti et al. 2020;Tauscher et al. 2020bTauscher et al. , 2021)), as well as to recent work by the REACH collaboration building on our SVD techniques (Saxena et al. 2022).The SVD eigenspectra, given mathematically as the columns of F fg , then represent the modes of our linear model and provide the best fit to every spectrum in the matrix B fg .In this case, best fit means that the eigenspectra derived from SVD minimize the total root-mean-square (RMS) of all residuals for fits to every spectrum in the matrix.
Theoretically we would need a number of eigenspectra equal to the rank of the matrix to span its null space, but in practice we truncate the number of vectors when we fit the desired mock data to the input noise level.This typically requires 5−50 eigenspectra for one or two LST bins, but can require up to 150 eigenspectra for 10 LST bins.The parameters in this model are labelled as Θ = [x fg ], which is a parameter vector of length N x equal to the number of eigenspectra, or columns in F fg .
To generate priors for the linear parameters x fg , we first fit every spectrum in B fg with its own eigenspectra.We then calculate the mean and covariance of each linear model parameter x i fg across all fits, and use these values to define a multivariate Gaussian prior distribution for the linear model parameters with mean ν and diagonal covariance Λ (see also Bassett et al. 2021a;Tauscher et al. 2020b, for details on this calculation).Using these priors assumes that the magnitudes of the spectra in the simulated matrix B fg well-describe the magnitudes of the underlying convolved foreground, hence further relying on the validity of the foreground model.

Polynomial Foreground Models
As there is no obvious way to generalize the polynomial foreground models to fit multiple LST bins simultaneously, these models are limited to the case of 1 LST bin.We focus on three polynomial models commonly employed in the literature (see e.g., Bowman et al. 2018;Singh et al. 2021;Bevins et al. 2022): linear-logarithmic (LinLogPoly), linear (LinPoly), and linearized physical (LinPhys).

Traditional Polynomial Models
The LinLogPoly model consists of powers of natural logarithms, while the LinPoly model consists of mere polynomials in (ν/ν 0 ).Incidentally, the former model is a consequence of Taylor-expanding a foreground consisting of many smooth power-laws added together (Hibbard et al. 2020), though including a chromatic beam factor to each power-law disrupts this smoothness and thus this approximation 7 .The LinPhys model is supposed to account for both the galactic foreground and ionospheric distortion and is similarly derived from a Taylor-expansion of a physically-motivated, nonlinear, power-law based foreground model (Bowman et al. 2018).
In each polynomial model, the Θ = [a k ] are the (linear) polynomial parameters, and N py refers to the number of polynomial parameters in each fit.We select broad, Gaussian priors with zero mean and variance given by σ 2 poly = (3000) 2 .

Maximally Smooth Polynomial Models
Maximally smooth polynomials are polynomial models with the added constraint that their higher-order derivatives have no zero-crossing points.That is, for 7 For an alternative form of the LinLogPoly model, see the following memo by the EDGES collaboration: https://loco.lab.asu.edu/loco-memos/edges_reports/report122.pdf.This form factors out a power-law before Taylor-expanding, as opposed to what is done in (Hibbard et al. 2020).We retain our version of the polynomial model for comparison to other works which use the same form.
the two models considered here, we require where M M DP refers to the maximally smooth Difference Polynomial in Equation 12, and M M LLP refers to the maximally smooth Log Log Polynomial in Equation 13.The order of the derivative is labelled by m, and it is typically required that m ≥ 2. Such constraints can be gathered into a matrix equation (Bevins et al. 2021) given by Ga ≤ 0, where G denotes the matrix of zeroes and derivatives of all desired orders, and a is a vector of the maximally smooth polynomial parameters, with length N M SF .These models were chosen as they appear best-suited to fit mock foreground spectra and are commonly employed in maximally smooth fits, as seen in Bevins et al. (2021) and Rao et al. (2017).These derivative constraints ensure that the polynomials fit only smooth foregrounds characterized by spectral indices and curvatures, and not global signals, which likely have multiple turning points and would thus be ill-fit by such smooth models (see again Rao et al. 2017;Bevins et al. 2021;Singh et al. 2021).However, it should be noted that in the literature maximally-smooth polynomials have largely been used to fit spectra with achromatic beams, spectrally constant foregrounds, or noise levels on the order of ∼ 500 mK, much too high for most predicted global signal trough amplitudes.See the latter references for examples.

Bayesian Analysis
Given a data vector T n and a model M i with (linear or nonlinear) parameters Θ from a set of possible models [M i ], the posterior distribution, or the probability of the parameters given the data, is calculated from Bayes' theorem: where L is the likelihood, or the probability of the data given the model and parameters; π is the prior, or probability of the parameters for a given model; and the normalizing factor Z is the evidence, given by: which is the marginal likelihood over the prior volume.
If the prior volume is uniform, it is a measure of the likelihood volume.
Assuming Gaussian-distributed noise, the natural logarithm of the likelihood is proportional to where C is the noise covariance matrix of the data, which is typically written as a symmetric, diagonal matrix by assuming that all cross-correlations between channels are zero.In our simulations, the diagonal elements of this matrix are given by the noise level estimates in Equation 5. Bayesian analysis then proceeds by seeking the model parameters which maximize the posterior distribution.If the priors are uniform, the parameters which maximize the likelihood are the same as the parameters which maximize the posterior.The former are referred to as the maximum likelihood estimate parameters (MLE), while the latter are the maximum a posterior parameters (MAP).
For the nonlinear model, we calculate the MAP parameters θ MAP and the evidence ln Z using polychord with the default number of live points, repeats, and evidence tolerance (Handley et al. 2015).We run our nonlinear models on the BLANCA CURC condo-cluster at the University of Colorado.In general, these nonlinear fits must be performed individually for a given number of spectral regions N r , and nonlinear parameters.
If the model is linear, the MAP parameters can be calculated analytically: where S MAP is the MAP parameter covariance.
If the model is linear and the priors are Gaussiandistributed, we can also calculate the evidence analytically: See Appendix A for a derivation of the above result.
While the number of parameters in the nonlinear model is a function of N r , the number of parameters for the linear model, N x , is determined via a grid search.Given a particular set of eigenspectra F fg , we compute the evidence for each possible number N of linear parameters in a one-dimensional grid with N ≤ 100 for 1 and 2 LST bins, and N ≤ 150 for 5 and 10 LST bins.
Moreover, we define N x to be the number of parameters needed to fit the spectra in the matrix B fg down to the noise level, as determined by the N with the highest evidence from the grid search, and is typically N x = 7 for the 1 and 2 LST bin cases, and N x = 15 or more for the 5 and 10 LST bin cases.Note that this grid search is performed for every different model input (BM, PM, and N r ).N x is then employed as the number of parameters used in the linear fit.
For the traditional polynomial models in Section 3.3.1,we obtain the best-fit parameters a MAP using Equation 19with F fg replaced by the polynomial modes defined in Equations 9 -11.
For the maximally smooth polynomial models in Section 3.3.2,we use maxsmooth from Bevins et al. (2021) to compute the MAP parameters.

Goodness of Fit
While the evidence can be used to determine a preferred model for a given data set, it does not quantify the goodness-of-fit of the model to the data set.Instead, the reduced chi-squared statistic is often employed for this purpose: where δ = T n −M(Θ MAP ) are the residuals of the bestfit, and N DOF is the number of degrees of freedom of the fit.If the model is linear and there are no priors, the expected value and variance of the χ 2 red statistic are E[χ 2 red ] = 1 and V [χ 2 red ] = 2/N DOF , allowing for a calculation of the number of standard deviations σ away from the expected value that a particular model fit produces.Robustly, the degrees of freedom for a linear model without priors is equal to the rank of the basis matrix, in this case the rank of F fg , if the vectors in the basis are linearly independent (Andrae et al. 2010).SVD produces orthogonal vectors, which satisfy this requirement; otherwise, is the number of channels in the data vector.Therefore, for the linear models here, N DOF = N d − N Θ .When priors are included in linear models, E[χ 2 red ] ≥ 1, and the expectation value can be calculated analytically.

Challenges of Reduced Chi-Squared
The trouble emerges when the model is nonlinear.In that case, not only is it in general impossible to calculate E[χ 2 red ] when priors are included, but the number of degrees of freedom lies somewhere between zero and N d − 1, may change during the fitting procedure, and subsequently cannot be estimated using the same equation for the linear model.Rigorously, this is because in the case of a linear model it is possible to write down a basis matrix with linearly independent vectors (F fg , for example).The rank of this basis, or design matrix, is equal to the degrees of freedom; (see e.g.Ye 1998;Hastie et al. 2009).This means that in general, E[χ 2 red ] ̸ = 1 for nonlinear models, and without a method to estimate E[χ 2 red ], determining the quality of fit based upon how close χ 2 red is to unity is grossly inadequate for nonlinear models, especially when considering the dynamic ranges involved in 21-cm signal extractions.

KS-Test of Normalized Residuals
An alternative statistic that can be used to characterize both goodness-of-fit and model comparison is the set of noise-weighted residuals in a Kolmogorov-Smirnov (KS) test (Kolmogorov 1933;Smirnov 1948).By definition for our Gaussian loglikelihood, the distribution of residuals normalized by the true, input errors for the true underlying model evaluated at the true parameters is Gaussian with zero mean and unity variance; i.e. the true model removes all foreground power, leaving exactly only Gaussian noise at the input level.All other models will produce a distribution of residuals which will deviate, whether slightly or significantly, from the case where the errors are known and the model is the truth.
One can quantify this deviation by comparing the cumulative distribution function (CDF) of a particular model fit's normalized residuals δ/σ n to the true model's normalized residuals, which is the CDF of a unit normal distribution, using the KS-test.The latter produces a p-value for a particular model fit, p ks .If the null hypothesis is that a particular model produces normalized residuals which follow a unit normal distribution (i.e. the true distribution), then we reject the null hypothesis if the particular model's fit produces a p-value below some threshold.The model does not follow a unit normal distribution in that case, and we conclude that the foreground has not been removed to the noise level8 .To be clear, here we interpret the p ks -value from the KStest to mean the probability of observing the KS-test statistic D given that the null hypothesis (the model residuals follow a unit normal distribution) is true.If the p ks value falls below some threshold of significance, then we conclude that there is a low probability that the null hypothesis is true, or equivalently, a low probability that the model residuals follow a unit normal distribu-tion.We choose our default threshold to be p ks < 0.05; we leave a full quantification of the best threshold for determining goodness-of-fit to future work, although see Section 6.2 for a brief discussion on this value.
Therefore, models which produce p ks > 0.05 pass the null hypothesis, and their residuals are indistinguishable from the true model's residuals.Furthermore, if model A produces a p-value of p a and model B produces a p-value of p b , both pass the null hypothesis test, and p a > p b , we conclude that p a is a better fit to the data than p b .

Ideal Model Inputs
The nonlinear and linear foreground models rely on explicit spatial forms for the temperature BM, the spectral index PM, and the beam.When these model inputs are identical to those used to compute the REACH mock, then model fits test to what noise level the REACH mock can be fit in the ideal case.In this case we write T 0 = T 408 and M j (β R ), as the BM and the spectral index masks are taken directly from those used to generate the intrinsic foreground, T 408 and β R , respectively.
Care still has to be taken, however, to choose the correct number of regions N r and nonlinear parameters for the optimal fit.In general, the number of regions which maximizes the Bayesian evidence is a function of both the size of the beam and the number of LST bins in T n .The former enters into consideration because larger beams will average more of the sky together and make many small regions largely indistinguishable from one large region, while the latter also enforces spatial resolution through time-dependence.More LST bins requires greater spatial resolution, although this too is limited by the size of the beam.See Appendix B for a demonstration of this effect in the case of the linear-forward models.
Therefore, we appeal to the Bayesian evidence to choose the optimal number of regions N r for both the nonlinear and linear models.We find that for the nonlinear models the number of regions which maximizes the Bayesian evidence is as follows: N r = 4 for 1 LST bin, N r = 8 for 2 LST bins, N r = 9 for 5 LST bins, and N r = 9 also for 10 LST bins, in the ideal model input case.Due to the computational expense of running each of these models for many regions, we use these optimal numbers for all of our fits.
For the linear model, we compute the Bayesian evidence for N r = 1 − 20 for each number of LST bins and choose the model with the highest evidence.If two models exhibit a similar evidence ln Z 1 − ln Z 2 ≤ 3, fol- We also use this same method of model parameter number selection for the polynomial and maximally smooth models.
Figure 5 shows both the evidence ln Z and χ 2 red for each number of LST bins (blue, orange, green, and red for 1, 2, 5, and 10, respectively), and when different nonlinear model parameters are included in the matrix of eigenspectra F fg (solid, dotted, and dashed lines for [β j ], [β j , A j ], [β j , A j , γ j ], respectively).It is interesting to note that the evidence for the linear models saturates at a given number of regions, dependent upon the number of LST bins.In general, one region is typically enough for 1 LST bin, while at least 2 regions are required for 2 LST bins, 5 regions for 5 LST bins, and 12 regions for 10 LST bins.We have checked that this pattern is consistent across different model inputs (BM, PM) and numbers of nonlinear parameters included.This trend reflects the fact mentioned above that more LST bins requires higher spatial resolution (see Appendix B).The filled circle symbols in the bottom panel of Figure 5 mark the expected values E[χ 2 red ] for each fit.

Incorrect Model Inputs
To test the dependence of these models on their inputs, in addition to the ideal case (mathematically represented by M j (β R ) = M j (β 408/230 ) and T 0 = T 408 ), we compute nonlinear and linear fits when one of the inputs T 0 or M j is incorrect, and leave tests of the beam for future work.We wish to use model inputs which are similar to those used in the REACH mock, but different enough to test the limits of the models.For the case when the BM is incorrect, we set T 0 = T 300 ; for the case when the PM is incorrect, we set M j = M j (β 300/130 ), where the spectral index map β 300/130 is the ratio of the GSM maps generated at 300 and 130 MHz.
The left panel of Figure 6 shows a mollweide view of the absolute difference between the T 408 BM of the REACH mock and the T 300 BM model input.The right panel of the figure shows a histogram of the pixel temperatures, which can be thought of as the errors between the model input temperature map and the temperature map in the mock.Note that these represent differences in spatial structure in the assumed BM.These errors, which trace the galactic structure, are analogous to errors in the temperature map which are proportional to the temperature in each pixel and with zero degrees of correlation across the sky.See also Pagano et al. (2022) for similar error map analyses.
For clarity, note that these figures show the absolute temperature difference between the two maps: if the pixels (and thus spatial structure) of one map were identical up to a multiplicative factor, then we would expect no difference when changing the model input map, as the spatial structures would be identical and the scaling by the reference frequency ν 0 would account for the absolute temperature difference.This test examines how small pixel-by-pixel differences in the BM affect the ability of the foreground model to account for them, generally through using more parameters (like the magnitude parameters, A j ).To summarize, we are asking the question, if the true pixel absolute values are those given by T 408 , but we use a model input with slightly different spatial structure in the temperature map, such as T 300 , how many parameters will it take for the foreground to fit the mock spectra using the incorrect model input down to the noise level?
Similarly, Figure 7 shows an example mollweide view of the difference in the two patch maps for N r = 8 when each region is assigned a number from zero to seven.Therefore, a difference of zero (cyan color) means that both pixels are assigned to the same patch of the sky, while a difference of one means that they differ by one region, and so on.The right panel of Figure 7 shows a bar chart of the pixels, where one can see that roughly seventy percent of the pixels are within the same region, and twenty-eight percent differ by a single region on the sky.We do not consider this too draconian a difference in patch assignations, as seen by the preponderance of cyan in the left panel.
All together then, for the nonlinear and linear models, we perform three categories of tests: when the model inputs (BM, PM) are ideal, when the BM only is incorrect, and finally when the PM only is incorrect.For the polynomial models, we only test to what residual level they can fit the single LST bin REACH mock for various numbers of polynomial parameters N py or N M SF .

RESULTS
Tables 2 3, and 4 summarize the main models, evidences, and goodness-of-fit statistics (such as χ 2 red and the KS test p-value, p ks ) produced when fitting the REACH mock with linear and nonlinear, polynomial, and maximally smooth foreground models, respectively.Each model is defined first by its model inputs (such as LSTs, Input Base Map, Patch Map, and parameters), and secondly by the number of parameters in the fit, N Θ , where Θ takes on the symbols [x, θ, py, M SF ] for linear, nonlinear, polynomial, and maximally smooth models, respectively.Where the model is linear in its parameters, such as for the linear and polynomial foreground models, we also include the number of sigma σ the model fit lies away from the expected χ 2 red , for the linear models where the degrees of freedom are calculable.A value of σ = 1, therefore, means that the value of χ 2 red produced by the model fit lies within one sigma of E[χ 2 red ].We color-code each table, for a particular model or model input (as delineated by the horizontal lines), according to its p ks −value.Under the null hypothesis, the noise-weighted residuals produced by a particular model fit do not deviate significantly from a Gaussiandistribution with mean µ = 0 and variance σ 2 = 1.Therefore, if a fit produces a p ks < 0.05, we reject the null hypothesis and conclude that residuals deviate from a unit normal Gaussian; i.e. that significant foreground power remains in the residuals at or above the noise level in the data.Models with the highest p ks -value (and therefore the best-fit) for a particular input are colored in gold , while models which fail the null hypothesis are colored in gray .Additionally, for the linear and nonlinear foreground models, we label the best-fitting gold model with either L or N L, and a number, as seen in the column labeled BF for Best-Fit.

Nonlinear and Linear Models
As was mentioned previously, neither χ 2 red nor the ln Z are always jointly representative of the best-fit model for the linear and nonlinear foreground models.The latter allows one to form a Bayes ratio and hence calculate model preference, while the former describes goodnessof-fit when the model is linear.The p ks -value from the KS test accomplishes both for the linear and nonlinear models.For instance, in Table 2 under the 2 LST bin case for the 300 MHz BM model input, according to the highest Bayesian evidence, the linear model prefers only the spectral index parameter, while the nonlinear model prefers all three (index, magnitude, and curvature).The linear model is able to produce good fits according to χ 2 red in all parameter cases (all have σ = 1), while for the corresponding nonlinear models, it is not well-defined whether they correspond to fits at the one or more sigma level.Examining the p ks , we find that the two nonlinear cases pass the null hypothesis, but are not preferred over the highest p ks −value of the linear model, even though both nonlinear cases have higher evidences.
Another interesting example is the 5 LST bin, (300/130) PM model case, which illustrates the possible failure of the evidence to pick a preferred and goodfitting model: here, the nonlinear model with all three parameters (spectral indices, magnitudes, and curva-  tures) is preferred by the evidence compared to the corresponding linear model, and yet if one calculates the nonlinear model χ 2 red naïvely using N θ as the degrees of freedom, one finds χ 2 red = 7.9, which is almost certainly not a good fit.The p ks −value, however, sorts out the difficulty, as the linear model has 0.93, which is orders of magnitude higher than the nonlinear value of 8.6 × 10 −20 .The linear model's χ 2 red value further confirms this preference.
Remarkably, the linear model does not produce any fits that fail the null hypothesis until using 10 LST bins, whereas the nonlinear model fails in many instances before that when the model input is incorrect and there are not sufficient nonlinear parameters included (for our method of LST-binning).At 10 LST bins, no nonlinear model produces good fits for any of the inputs described, as shown by the graying-out of all the nonlinear rows.The linear model, on the other hand, exhibits sev-Table 2: Linear and Nonlinear model fit results to the REACH mock.Each category of fits is delineated by its model inputs (IDEAL, BM, or PM) and separated by a dividing horizontal line, within a given number of LSTs.IDEAL means the model input is the same as the REACH mock, BM denotes the temperature base map used in the model input which is different from the REACH Mock, and PM denotes the spectral index patch used in the model input which differs from the REACH mock.Within a category, the best-fitting model based upon the KS-test p-value p ks are outlined in gold.Model fits which do not pass the null hypothesis exhibit p ks < 0.05 and are outlined in gray.The last column BF indicates the best-fitting model within each category, labelled according to whether it is linear L or nonlinear N L.   eral good-fitting models, all of which require the complexity added by including all three parameters.The model input cases corresponding to the incorrect patch map labelled (300/130) PM appear to be the most difficult cases for the nonlinear model to fit, and for 5 LST bins the nonlinear model is already incapable of producing any good fits.The number of LST bins, or timedependence, that the nonlinear model can accommodate is largely limited by the size of the beam, and therefore the number of distinguishable patches on the sky (see Appendix B for a brief discussion on the effect of beam size on the number of regions and LST bins preferred by the evidence).
Figure 8 shows the residuals produced for each of the best-fit linear and nonlinear models.Best fit models that are linear are shown by solid lines, while those that are nonlinear are shown by dotted lines.For the 1 and 2 LST bin cases, it is clear that the best-fit models remove all foreground power, leaving only the input Gaussian-distributed noise.However, for both the 5 and 10 LST bin cases, especially in the channels where the power is greater (because the galaxy is overhead), even the bestfit models leave small amounts of residual foreground power which manifests as small ripples in the spectrum (see for instance the model L8 represented by the green line in 10 LST bins within the 400 to 900 channel band).These ripples can be described by sinusoidal functions, as we have verified by taking the Fourier transform of each LST bin and finding that the residual power contains features with periodicity of ∼ 0.025 cycles/MHz, although in this case they are below the noise level.In particular, we found that the models which do not produce good fits (via the KS-test) produced ripples with considerable power, in the case of the nonlinear model for 2, 5, and 10 LST bins.In contrast, for the linear model the ripples are below the noise level when the null hypothesis is satisfied for all LST bins.

Polynomial Models
Table 3 gives the statistics for the fits of the polynomial models to the REACH mock.Both the LinLog and the Lin Polynomial fail the KS test for N py = 5 parameters, a number of terms commonly employed in the literature for fits to galactic foreground spectra.N py = 6 passes the KS-test, however, and while it is still three sigma away from the expected χ 2 red , it thus produces residuals which are indistinguishable from noisenormalized foreground-free residuals.Polynomial models with N py = [6, 7, 9] have comparable evidences, but we find that N py = 9 is preferred because of its low sigma value σ = 2.1 compared to the other two models, and it also exhibits the highest p ks -value of the polynomial fits.
We find that at least N py = 10 is required to produce fits with σ ≈ 1 and RMS residuals on the order of 10 mK.Furthermore, we find that these models are not preferred, according to the highest p ks −value for 1 LST bin cases, when compared to the 408 BM linear and nonlinear models.The LinPhys Polynomial model fails outright to fit the REACH mock (χ 2 red = 9.04), and would require many more terms to fit out any residual foreground power.
Panel (a) of Figure 9 shows the residuals produced by various polynomial model fits to the REACH mock, including the best-fit cases with 9 terms.It is interesting to note that increasing the number of terms in the fit merely reduces the residual amplitudes, but does not appear to alter their structure significantly.That is, the ripples in the residuals follow a similar sinusoidal pattern with roughly the same period for the LinLog and Lin Polynomial models even when including more terms.
To verify the robustness of the above results, we conducted additional fits to mock foregrounds where we allowed the BM T 0 used for extrapolation and the spectral index map β R0 generating the sky temperatures to be different from the REACH mock.We both fixed T 0 = 408 MHz, and changed the spectral index map to β R0 = (408/300), (300/100), (408/100), and also fixed the spectral index map to β R0 = (408/230) and changed T 0 = 300, 200, 45 MHz, giving six new intrinsic foregrounds that we then convolved according to the same beam and observation strategy.We found that all mock foreground cases required N py > 5 in order to achieve good fits.Therefore, we conclude that the polynomial models, for realistic experiments, require N py > 5 for good fits, and typically the best fits have N py ∼ 9 terms, although N py = 6 is not disallowed according to our tests.

Maximally Smooth Polynomials
Lastly, Table 4 shows the statistics and model fits produced when fitting the REACH mock with the two maximally smooth polynomials.We find that the LogLog model is unable to fit out enough of the foreground power to pass the null hypothesis, even with N M SF = 15 terms.On the other hand, the Diff model requires at least N M SF = 10 terms to achieve a good fit.A greater number of terms, however, seems to diminish the fit, as seen by the N M SF = 15 terms case, and as we verified by calculating the fits for N M SF = 1 − 20.We conclude that the maximally smooth polynomials in general require higher numbers of terms to fit the REACH mock than the polynomial models and moreover are not preferred over the latter (as per the higher of the various p ks −values).
Panel (b) of Figure 9 shows the residuals produced by several maximally smooth polynomial cases, which are clearly much larger than their polynomial counterparts for a given number of terms -compare them to those in panel (a) in the same scale.The poor fits produce similar sinusoidal ripples which are indicative of unmodelled foreground power, though they have a different residual periodicity than the polynomial models.
Lastly, we note that the above analyses for the polynomial and maximally smooth polynomials neglect the potential added fitting benefit of multiplying the spectrum by a beam-chromaticity correction, such as in (Bowman et al. 2018;Sims et al. 2023).In particular, the latter work shows how such a factor removes chromaticity coupling perfectly in a spectrum with a spatially isotropic intrinsic foreground, or can dampen residual foreground power when the spatially anisotropic intrinsic foreground can be described by perturbations which are constructed from LinLog polynomials.We leave an investigation of how such a correction factor might affect our own beam-weighted foreground fits to future work.The nonlinear model allows us to reconstruct an intrinsic foreground from the convolved mock spectra, as seen through the beam and sky-averaging.This is important because, due to the latter two effects, there is no a priori reason that the nonlinear parameters, such as the spectral index, will correspond exactly to those in the input intrinsic foreground.Moreover, there is a fundamental limitation on how well the nonlinear model can fit the intrinsic foreground, even neglecting the beam and sky-averaging, due to the fact that we are approximating a sky with tens of thousands of pixels by effectively ten pixels (i.e.regions) or less.
In Figure 10 we show the intrinsic foreground reconstructions for the best-fit nonlinear models with 1, 2, and 5 LST bins.For 10 LST bins there were no fits that passed the null hypothesis test.In the left panel of the figure we show the RMS of the difference-or error-between the REACH mock intrinsic foreground and the reconstructed foreground as a function of frequency.That is, at each frequency we reconstruct an intrinsic foreground using the best-fit nonlinear parameters as found from the fit to the corresponding convolved mock spectra, and the following equation for the intrinsic, reconstructed foreground, T IRFG (Ω, ν): where the number of regions N r and nonlinear parameters are determined by the particular fit (See (Chluba et al. 2017) however for a much more general formula using spherical decompositions and various foreground components).If a model fit does not have magnitude or spectral curvature parameters, we set them to unity or zero, respectively.
In the left panel of Figure 10, the different colors correspond to the different LST bins, while the shapes correspond to whether we use the p ks −value to determine the best-fit model (triangles), or the evidence (squares).The black dotted, dashed, and solid lines show the RMS error when we use the mean value of the spectral index β R in each region of the REACH mock to reconstruct the foreground, and thus they are not affected by the beam      The blue contours correspond to the best-fit model selected by the evidence, while the orange contours correspond to the model selected by the KS-test.The black contours show the input mock distribution for the intrinsic foreground, as generated from the GSM.Contours which are closer to the center of the mock distribution represent better fits to the intrinsic foreground, but not necessarily to the spectra which result from convolving the foreground with a beam.
or sky-averaging.Given the latter ideal circumstances, the black lines represent in essence the fundamental limit of the model to fit the data.When approximating a sky with many pixels by a few regions (4, 8, and 9 for 1, 2, and 5 LST bins, respectively), one cannot do better than an RMS error of roughly four percent, as shown by the solid black line for 5 LST bins.Interestingly, many of the best fit models (colored lines) produce RMS errors with respect to the intrinsic foreground that are relatively close to the ideal case (black lines), and overall roughly between 5 to 10 percent.It is important to remember that for these cases we did not fit the intrinsic foreground, but the convolution of it with the beam, so once again, we do not expect to obtain parameter values that ideally fit the intrinsic foreground.The evidence and KS-test both choose the same model for the 1 LST bin case, namely that of only spectral indices, which is why their symbols overlap and there appears to be only one red line in the left panel of the figure.The 5 LST bin cases (in cyan) both give around 8 percent error, while there is a slightly larger separation between the two purple lines for 2 LST bins with different model fit statistics (see also Figure 11 below).This is because the best-fit model as selected by the KS-test for the 2 LST bins incorporates magnitude parameters (triangle symbols), while the other fit has only spectral indices (circle symbols).
The right panel of Figure 10 shows the reconstructed foreground (colored regions) as a function of LST.Each shaded region encapsulates all frequencies, with the top of the band corresponding to the lowest frequencies, and the bottom of the band, the highest frequencies.The gray region shows the input mock spectra as a function of LST.Clearly, using more time-dependence information improves how well we can approximate the whole sky, although there is a limit to how many LST bins can be used before the nonlinear model breaks down, as noted above.
Lastly, in Figure 11 we show a triangle plot of the spectral index parameters for the two best-fit models (as determined by the evidence and KS-test, respectively) for 2 LST bins.As before, the input mock spectral index distributions are outlined in black in the one-dimensional panels, and appear as gray contours in the 2-dimensional panels.These distributions represent again the ideal case of intrinsic foreground reconstruction, so the closer the blue (best evidence case) and orange (best KS-test case) distributions are to the centers of the black/gray distributions, the better the reconstruction.Note that the model corresponding to the orange contours has magnitude parameters, while the blue does not.Note also that even for the worst case for the orange con-tours (β 1 ), the reconstructed values are only off from the mean value of roughly −2.87 by ∼ 0.02, well within real observational errors (Guzmán et al. 2011;Mozdzen et al. 2019).Therefore, our reconstruction of the intrinsic foreground using best-fit nonlinear parameters from a fit to the convolved mock spectra is quite encouraging, despite the corruption induced by the chromatic beam.

P-values in the KS-test
Astute readers may have noticed that four of the fits for 5 or 10 LST bins in the linear model column produced p ks > 0.05 and thus passed the null hypothesis test, but according to the analytical calculation of χ 2 red , still had σ > 2. In particular, the last row has a p ks = 0.12 but is σ = 7 away from the E[χ 2 red ].When the data vector is large, as in those cases, a higher threshold for the null hypothesis test would alleviate such apparent tension.By choosing a threshold p-value of 0.5, for instance, a ten-fold increase, some of the models that are currently not, would be colored gray.So the p-value threshold may be made more stringent, in accordance with the nature of the test and data vector.Note also that a higher number of LST bin data vectors for the same amount of integration time implies a greater noise level in each bin.

CONCLUSIONS
There are two primary sources of bias and error in 21-cm signal extraction: the first comes from overlap between signal and foreground models, while the second comes from inadequate foreground models alone which cannot fit mock spectra to the noise level of the data required for signal extraction.In this work, for a given mock foreground spectrum, we tested seven galactic foreground models commonly employed for 21-cm cosmology, including a physically-motivated nonlinear model, a linear model characterized by the eigenmodes of the latter, three polynomials, and two maximallysmooth polynomials in order to determine what residual level, and hence potential bias, they produce, given different model input assumptions and their number of foreground parameters to generate the mock data.
We first simulated mock foreground spectra that incorporated realistic spatial and spectral complexity, beam models, horizon profiles, discrete time-sampling, and low noise levels for different numbers of LST bins, for an EDGES-like experiment.Next, we fitted each of the seven models to the mock foreground spectra for various numbers of parameters and input model assumptions, including the dependence of the forward-models upon their input spatial temperature and spectral index maps.Lastly, we calculated the Bayesian evidence and two goodness-of-fit statistics for each fit: χ 2 red and the p-value, p ks , from a KS-test comparing the noisenormalized residuals of the fit to a unit normal distribution.We recommend using the latter statistic in particular as it denotes goodness-of-fit and model preference between two models when the degrees of freedom are unknown or indeterminable (as in the case of nonlinear models), as opposed to the traditionally used χ 2 red .We found the following results: • For 1 LST bin, to which all the models were applied, the nonlinear model was preferred according to the p ks statistic (p ks = 0.99).This model is able to fit the convolved foreground spectra even when its model spatial inputs, such as the temperature map used for extrapolation or the spectral index map distribution, are different from the mock spectrum, needing 4 parameters in the ideal case and 12 in the worst.On the other hand, the linear model fits the mock spectra well in all cases with about 6-7 parameters (p ks = 0.94, 0.97 respectively), making it only slightly less preferred than the nonlinear model.
• For 1 LST bin, the polynomial and maximally smooth polynomial models do not provide good fits when using only 5 parameters p ks < 10 −6 , as commonly used in the literature for these models.However, we do find that 6 parameters produces a fit which passes the null hypothesis (residuals are indistinguishable from foreground-free residuals) for the linear, logarithmic polynomial model p ks = 0.4, and that it is only three sigma away from the expected reduced chi-squared with a Bayesian evidence comparable to the 7 and 9 term polynomial fits ∆ ln Z ≤ 1.However, both the linearized physical polynomial and the log-log maximally-smooth polynomial models could not fit the mock spectra with any number of parameters.
• When jointly fitting multiple LST bins, we found that the linear model significantly out-performed the nonlinear model (according to the p ks value and our choice of LST-binning).The nonlinear model begins to leave foreground power in the residuals starting at 2 LST bins, and fails goodness-of-fit tests (p ks < 0.05) utterly in every case by 10 LST bins.In contrast, the linear model only begins to leave residual foreground power at 10 LST bins, and even then still achieves good fits when the model is made sufficiently complex, even when the input spatial temperature map or spectral index map are different from those used to generate the mock.The linear model is also orders of magnitude faster than its nonlinear counterpart.
• The p ks -value is a more robust and general statistic to use for model selection and goodness-offit than the Bayesian evidence and reduced chisquared χ 2 red when comparing different foreground models.For nonlinear models in particular, we found several instances where the nonlinear model had high evidence (−234) compared to the linear model (−1130), but no unambiguous χ 2 red goodness-of-fit, as one cannot accurately calculate the degrees of freedom for a nonlinear model.Examining the p ks -values, however, clears up this ambiguity, giving 0.5 for the nonlinear model and 0.78 for the linear model, where the latter is thus clearly preferred.
It is also worth recalling here that for joint global signal plus foreground fits, the linear foreground model is also a desirable choice as its parameters can be efficiently and analytically marginalized over (Rapetti et al. 2020;Tauscher et al. 2021).In addition, the overlap between the linear foreground and global signal models is significantly decreased by employing more LST bins that are modeled simultaneously (Tauscher et al. 2020b), making the linear model desirable for joint analyses.This method of decreasing the overlap between foreground and global signal models increases the number of linear foreground parameters, which is largely how good fits to the mock foreground spectra were obtained in this paper.In contrast, for the other foreground models, the nonlinear and the two types of polynomials, increasing the number of foreground parameters to achieve good fits inevitably leads to more overlap with global signal models.
der.Blanca is jointly funded by computing users and the University of Colorado Boulder.
Figure 12: Bayesian evidence as a function of the number of regions of the sky N r and the number of LST bins (compare to the top panel of Figure 5).For each line-style, the legend shows the relative size of the FWHM of the analytical Gaussian beam as a quadratic function of the band.The first number refers to the degree FWHM of the beam in the low band, while the second number the shows the high band.The dotted and dashed lines therefore have smaller beams across the band than the solid line.For 1 and 2 LST bins (blue and orange contours), the best-fit number of regions saturates to the same value of N r quickly, regardless of the size of the beam.However, when the number of LST bins increases to 5 or 10, the best-fit (according to the highest evidence) number of regions N r increases dramatically for beams which have smaller FWMHs across the band (dotted and dashed lines).

B. TIME DEPENDENCE, SKY REGIONS, AND BEAM SIZE
Figure 12 shows the Bayesian evidence as a function the number of regions that the sky is split into for fits with a beam which is spatially Gaussian and with a Full-Width Half-Maximum (FWHM) that varies as a quadratic polynomial across the band.Different colors show different numbers of LST bins fit, and hence we expect greater time-resolution to require higher spatial resolution-or more regions-which is indeed what the figure shows.Moreover, a larger beam size averages more of the sky together, making finer resolution sky regions indistinguishable in the convolved spectra.The solid line shows the case for a beam which has a FWHM of 80 degrees in the low band, and varies up to 120 degrees in the high band.For comparison, the dotted and dashed lines show beams which have FWHMs which are 30 and 60 degrees smaller across the band compared to the solid line.Clearly decreasing the size of the beam increases the number of distinguishable sky regions, and hence also increases the time-resolution of the forward-model.

Figure 1 :
Figure 1: Base temperature map and spectral index map for the intrinsic foreground in the simulated data.The latter map is calculated by forming the natural logarithmic ratio of the two temperature maps generated by the GSM at 408 and 230 MHz, modulated by their reference frequency ratio.

Figure 2 :
Figure 2: E and H Plane polar cuts of the EDGES antenna beam used to convolve the intrinsic foreground mock.The units shown are normalized, absolute units, such that the integral of the beam over the entire sky is unity at every frequency.The different colors represent the beam at various frequencies, as shown in the legend, from roughly 50 to 90 MHz.

Figure 3 :
Figure3: Simulated Mock Spectra (REACH Mock, top panel) and noise levels (bottom panel) plotted for various numbers of LST bins: 1 spectrum at LST 12 hr (magenta), 2 spectra at 6 and 18 hr (green), 5 spectra at 2.4, 7.2, 12, 16.8, 21.6 hr (cyan), and 10 spectra at 1.2, 3.6, 6, 8.4, 10.8, 13.2, 15.6, 18, 20.4, and 22.8 hr (red).The vertical black lines delineate each LST bin.We also plot the radiometer noise levels for the mock data (bottom panel), as well as flat 20 mK and 100 mK noise levels in black dashed lines for comparison.Black vertical lines delineate different spectra when the number of bins is greater than 1, and channels refers to a particular frequency in the bandwidth for a particular LST spectrum.Thus, channel 0 corresponds to 51 MHz for the first spectrum, while channel 93 refers to 51 MHz for the second spectrum, and so on.

Figure 4 :
Figure4: Approximation of the intrinsic spectral index map by dividing the sky into N r equal-percentile bin regions.Each region j is then assigned parameters which are constant throughout the region.This particular map is generated from the spectral index map used in the intrinsic mock foreground, β R .

Figure 5 :
Figure 5: The linear model evidences ln Z and χ 2 red for fits to different numbers of LST bins and regions N r .The models shown here (solid, dashed, and dotted) are when the model inputs (BM, PM) are the same as those in the REACH mock, the ideal case, but the number of nonlinear parameters included varies.Each LST bin tends to saturate at a maximum number of regions, with greater LST bin mock foregrounds requiring greater spatial resolution.The solid, colored dots in the bottom panel correspond to the expectation value E[χ 2 red ] for each model input and number of LST bins.Similar analyses are carried out when the model inputs differ from the REACH mock.lowing Kass & Raftery (1995), we select the model input with χ 2 red which is the least number of σ away from the expected reduced chi-squared, E[χ 2 red ].If several model inputs have similar evidences and number of sigma, we select the model input that has the smallest N r .We also use this same method of model parameter number selection for the polynomial and maximally smooth models.Figure5shows both the evidence ln Z and χ 2 red for each number of LST bins (blue, orange, green, and red for 1, 2, 5, and 10, respectively), and when different nonlinear model parameters are included in the matrix of eigenspectra F fg (solid, dotted, and dashed lines for [β j ], [β j , A j ], [β j , A j , γ j ], respectively).It is interesting to note that the evidence for the linear models saturates at a given number of regions, dependent upon the number of LST bins.In general, one region is typically enough for 1 LST bin, while at least 2 regions are required for 2 LST bins, 5 regions for 5 LST bins, and 12 regions for 10 LST bins.We have checked that this pattern is consistent across different model inputs (BM, PM) and numbers of nonlinear parameters included.This trend reflects the fact mentioned above that more LST bins requires higher spatial resolution (see Appendix B).The filled circle symbols in the bottom panel of Figure5mark the expected values E[χ 2 red ] for each fit.

Figure 6 :
Figure 6: Mollvweide view and histogram of the difference between the temperature BM used to create the REACH mock (T 408 ) and the model input (T 300 ).The right panel shows the temperature difference per pixel, which peaks around 12-17 K, while the left panel shows the view on the galactic-centered sky.Pixels which are cyan have model input values very close to the REACH mock, while those in magenta are more different.Note that these errors still trace out the features of the galaxy, and are thus indicative of errors which are proportional to the temperatures in each pixel.Hence, the pixel errors near the galactic center, which is brighter, are larger.

Figure 7 :
Figure 7: Mollweide view (left panel) and histogram (right panel) of the difference between the spectral index patch map used to create the REACH mock (408/230) and the patch map used as a model input (300/130).Each region in ascending order is assigned a number between 0 and 7, before the difference is taken.Pixels which are in the same region in both the mock and model input are shown in cyan in the left panel and have a difference value of zero, while those which differ by one region have a value of one, and so on.

Figure 8 :
Figure 8: Residuals from the best-fitting (BF) cases from Table 2 for the linear and nonlinear models.Each panel is denoted by the number of LST bins in the fit.Nonlinear models use dotted lines and have N L in their label, while linear models use solid lines and are denoted by L.

Figure 9 :
Figure9: Panel (a) shows several example residuals produced from fitting polynomial models to the REACH mock.Solid (dashed) lines denote 5 (9) term fits.The input noise level is shown as a solid black line, while the red, cyan, and magenta lines denote the linear physical, linear-logarithmic, and linear polynomial models, respectively.Panel (b) shows similar example residuals for the maximally smooth polynomials, with solid (dashed) lines for 5 (10) parameter fits.Blue denotes the difference polynomial and orange the log-log polynomial model.
(a) RMS Error of the Best-fit Nonlinear Foreground Models Compared to the Input Intrinsic Foreground.(b) Reconstructed foreground as a function of LST.

Figure 10 :
Figure10: A comparison of the reconstructed intrinsic foreground as determined by the best-fit nonlinear parameters from fits to the convolved spectra.The left panel shows the RMS error in the reconstruction (as computed with respect to the input intrinsic foreground) as a function of frequency, where the different colors (red, purple, and cyan) label the different LST bin cases (1, 2, and 5 respectively).The triangle symbols label models which were selected as best-fit via the KS-test, whereas the squares denote models selected using the Bayesian evidence.The black lines correspond to the ideal case, without the beam or averaging over the sky, by using the mean spectral index in the input intrinsic foreground for each spectral region, with the number of regions N r = 4, 8, and 9, for 1, 2, and 5 LST bins, respectively.The right panel shows the reconstructions of the foreground as a function of LST for the models selected via the KS-test.The gray band shows the input intrinsic foreground.The top (bottom) of the band corresponds to the lowest (highest) frequencies.

Figure 11 :
Figure11: Example triangle plot for the 2 LST bin case, showing the best-fit spectral indices for each region, N r = 1 − 8.The blue contours correspond to the best-fit model selected by the evidence, while the orange contours correspond to the model selected by the KS-test.The black contours show the input mock distribution for the intrinsic foreground, as generated from the GSM.Contours which are closer to the center of the mock distribution represent better fits to the intrinsic foreground, but not necessarily to the spectra which result from convolving the foreground with a beam.

Table 3 :
Polynomial Model fits to the REACH mock.The color codes here are the same as for the nonlinear and linear model fits in Table2.

Table 4 :
Maximally Smooth Polynomial fits to the REACH mock.The color codes here are the same as for the nonlinear and model fits in Table2.