Precision Calibration of Radio Interferometers for 21 cm Cosmology with No Redundancy and Little Knowledge of Antenna Beams and the Radio Sky

We introduce CALibration AMITY (calamity), a precision bandpass calibration method for radio interferometry. calamity can solve for direction-independent gains with arbitrary frequency structure to the high precision required for 21 cm cosmology with minimal knowledge of foregrounds or antenna beams and does not require any degree of redundancy (repeated identical measurements of the same baseline). We have achieved this through two key innovations. First, we model the foregrounds on each baseline independently using a flexible and highly efficient set of basis functions that have minimal overlap with 21 cm modes and enforce spectral smoothness in the calibrated foregrounds. Second, we use an off-the-shelf GPU accelerated API (tensorflow) to solve for per-baseline foregrounds simultaneously with per-frequency antenna gains in a single optimization loop. GPU acceleration is critical for our technique to be able to solve for the large numbers of foreground and gain parameters simultaneously across all frequencies for an interferometer with ≳10 antennas in a reasonable amount of time. In this paper, we give an overview of our technique and, using realistic simulations, demonstrate its performance in solving for and removing pathological gain structures down to 4.5 orders of magnitude below the level of foregrounds and consistent with our simulated thermal noise limit. If readers want to start using calamity now, they can find a tutorial notebook online.

Corresponding author: Aaron Ewall-Wice aaronew@berkeley.edu The primary obstacle to a successful measurement of 21 cm fluctuations are bright radio foregrounds which dwarf the cosmological signal by four-to-five orders of magnitude.Because they are spectrally smooth, these foregrounds can be circumvented by taking advantage of the fact that they are described by a small fraction of the spectral shapes that comprise 21 cm fluctuations (Di Matteo et al. 2004;Morales & Hewitt 2004).More precisely, foregrounds as observed by an interferometer are contained within a "wedge" shaped region of the 3D Fourier space of the sky (Datta et al. 2010;Morales et al. 2012;Parsons et al. 2012a;Vedantham et al. 2012;Pober et al. 2013;Thyagarajan et al. 2013).In principal one can excise these wedge modes, leaving enough SNR for a detection (Pober et al. 2014).
While the intrinsic properties of foregrounds are amenable to 21 cm signal recovery, spectral variations within direction dependent and direction independent antenna gains imprint additional spectral structures beyond the wedge.Sources of spectral structure that can leak power outside of the wedge include reflections sourced by impedance mismatches within the analog sig-nal chain (Beardsley et al. 2016;Ewall-Wice et al. 2016;Kern et al. 2020a), direction dependent mutual coupling from reflections of sky signal between antenna elements (Kern et al. 2020a;Fagnoni et al. 2021;Josaitis et al. in preparation.),digital effects (Prabu et al. 2015;Barry et al. 2019), injudicious choices in RFI flagging (Offringa et al. 2019) and polarization leakage (Moore et al. 2013;Kohn et al. 2016;Asad et al. 2016Asad et al. , 2018)).Since the amplitude of foregrounds is roughly four-to-five orders of magnitude greater then the 21 cm signal, leakage introduced by these effects must either be suppressed by a factor of roughly five orders of magnitude in the design of the instrument, or corrected to a similar level through calibration (Trott & Wayth 2016;Thyagarajan et al. 2016;Ewall-Wice et al. 2016;Patra et al. 2018).
To date, bandpass calibration methods either rely heavily on the correctness of a sky-model and perantenna beams or make assumptions about the reproducibility of the beams and the exact positions of the antennas.Neither of the requirements under-girding calibration techniques have been realized in the field and it is unclear whether they can be.In this paper, we introduce a new approach that only relies on the more robust assumption that foregrounds, as observed by our antenna beams, occupy a limited number of smooth spectral modes contained within the wedge.Our technique does not require that the beams are identical between antennas nor does it require any degree of redundancy between baselines.It works well on both minimally redundant and redundant arrays.Our technique achieves this by solving for per-frequency gains across all spectral channels simultaneously with smooth foreground parameters on each individual baseline in the array.Our per-baseline smooth degrees of freedom approach stands in contrast to the existing techniques that attempt to model foreground emission and beam variations as sources with associated positions on the sky.For modern interferometers, this necessarily involves optimization of tens to hundreds of thousands of free parameters which we accomplish by implementing our technique using the off-the-shelf GPU accelerated tensorflow API.We call our technique CALibration AMITY (calamity) because we believe it represents a significant improvement in the amity between detecting 21 cm fluctuations and living with the direction independent gain features that arise in radio interferometers.
This paper is organized as follows.In § 2, we give a short overview of calibration techniques being employed to calibrate 21 cm cosmology arrays and briefly introduce our notation.In § 3, we discuss our modeling approach in formal detail along with how its performance scales with the number of antennas and frequency chan-nels.In § 4, we discuss simulations of existing arrays to furnish several test-cases for our technique.In § 5, we explore some of the differences to expect when applying calamity to a highly redundant versus a minimally redundant array (it can be used effectively on both).In § 6, we investigate whether calamity In § 7 we discuss calamity's runtime and memory scaling with array size.In § 8, we emphasize our techniques limitations and conclude in § 9 2. PREVIOUS APPROACHES TO CALIBRATION Calibration seeks to solve for and correct for the impact of N inst different complex instrumental parameters given some set of N data interferometric measurements.Adopting the notation of Byrne et al. (2021b), one typically models the observed cross-correlations between antenna a and b at frequency ν, ζ abν as where g aν are the direction-independent gain parameters for antenna a at frequency ν.If there are N ant antennas and N ν frequency channels, these gains form a set of N ν N ant complex numbers.m abν (u sys , u fg ) represents a model of all direction dependent systematics which includes intrinsic foregrounds which we parameterize with u fg , and direction dependent systematics like beam effects and ionospheric distortions which we parameterize with u sys .
A number of strategies have been explored to solve for gains which broadly fall into two categories.Within each category, these techniques are distinguishable primarily in their approaches on how they parameterize g and m, which parameters they solve for, and which they keep fixed.

Solve for Direction Independent Gains.
In this approach, one assumes a fixed model for the foregrounds and the direction dependent antenna beams and attempts to solve for g only.This method was an excellent strategy for instruments whose fields of view could be dominated by a single bright and well understood source (e.g.Baars et al. 1965) and the conservative approach of not allowing the sky to vary can reduce the risk of signal loss.Direction independent gain solvers being used for 21 cm cosmology include stefcal (Salvini & Wijnholds 2014) and cubical (Sob et al. 2020), self-holography (Gueuning et al. 2020).The large fields of view at low frequency and the extreme dynamic range requirements of ∼ 10 5 pose a serious obstacle since any calibration observation will have a multitude of sources off of boresight contributing significantly to observed visibilities.Hence, m must not only involve accurate models of faint sources which are today highly uncertain (Jacobs et al. 2013) but also precision a-priori knowledge of the beam(s).The state-of-the-art in beam modeling today is far behind the level of precision required for 21 cm (Pober et al. 2012;Newburgh et al. 2014;Neben et al. 2015;Colegate et al. 2015;Sutinjo et al. 2015;Berger et al. 2016;Neben et al. 2016;Sokolowski et al. 2017;Jacobs et al. 2017;Line et al. 2018;de Lera Acedo et al. 2018;Nunhokee et al. 2020).If we fit for enough spectral degrees of freedom to correct intrinsic gain structures outside of the wedge, beam and source model errors leak power into the EoR window at levels that far exceed the 21 cm signal (Barry et al. 2016;Trott & Wayth 2016;Ewall-Wice et al. 2017;Joseph et al. 2020).While this might be circumvented by excluding / downweighting long baselines (Ewall-Wice et al. 2017), doing so requires an accurate model of diffuse emission which is highly polarized (Lenc et al. 2016).Polarized diffuse sky models are being constructed for calibration Byrne et al. (2021a) and work remains on determining the requirements on such a model for short baseline calibration to be effective.Even if the beams and foregrounds were known to sufficient accuracy, ionospheric fluctuations corrupt calibration solutions (e.g.Intema et al. 2009;Jordan et al. 2017;Gehlot et al. 2018;Yoshiura et al. 2021) albeit in a way that may average down with time (Vedantham & Koopmans 2015).Thus, solving for direction independent gains given a sky-model has so-far required making potentially innacurate a-priori assumptions of the gains being spectrally smooth (e.g.Beardsley et al. 2016;Mertens et al. 2020) while fine-frequency degrees of freedom must be modeled using other measurements such as autocorrelations (e.g.Ewall-Wice et al. 2016;Li et al. 2019;Kern et al. 2020b), beam variations (Line et al. 2018), and simulations of digital effects (Barry et al. 2019).In its current state, gain-only calibration requires spectrally smooth priors which may not be realized in the field.

Solve for Direction Dependent Gains and a model
of the Sky.
One solution to resolving beam, ionosphere, and foreground modeling errors is to explicitly solve for them.Foreground models are often encoded as a list of source locations and spectral parameters but have also been represented by extended spatial shapes such as wavelets (Gu et al. 2013;Chapman et al. 2013) to more efficiently describe diffuse emission.Packages such as ddfacet (Tasse et al. 2018), sagecal (Kazemi et al. 2011;Kazemi & Yatawatta 2013), the MWA Realtime System (Mitchell et al. 2008), facetcal (van Weeren et al. 2016) and cubical (Kenyon et al. 2018;Sob et al. 2020) all attempt to solve for the direction dependent Jones matrices of the antennas along with a parameterized source model (such as locations and spectral evolution) and ionospheric distortions.Several of these methods such as the rts and sagecal also have direction independent functionality which has appeared in the literature.The primary obstacle to increasing the number of degrees of freedom in our foreground and telescope models is the prospect of unintentionally subtracting components of the 21 cm signal itself.For example Patil et al. (2016) report potentially problematic suppression of diffuse polarized emission by sagecal.The solution these authors adopt in Patil et al. (2017) is to exclude short baselines (which are dominated by the diffuse emission) from calibration but this runs into the issue of contamination from long baselines if the beams are not identical between antennas.Further refinements to algorithms such as sagecal-co (Yatawatta 2016;Gehlot et al. 2019;Mertens et al. 2020) use concensus optimization to force the direction dependent gains to follow smooth polynomials.This mitigates the contamination from long baselines along with signal suppression but still leaves unresolved the issue of fine frequency scale direction independent gains.
A popular method for constraining a sky and beam model with arbitrary frequency resolution while attempting to limit signal loss and errors incurred by innacuracies in apriori models is to assume that all antenna beams are identical and exploit the redundancy between repeated copies of the same baseline to solve for sky parameters and gains up to some frequency dependent degeneracies before referring back to the sky (Wieringa 1992;Liu et al. 2010;Zheng et al. 2013Zheng et al. , 2014;;Dillon et al. 2020).Unfortunately, high redundancy is difficult to accomplish due to mechanical variations between antennas (Line et al. 2018;Kim et al. in preparation.),positional errors, and over-the-air mutual coupling between antennas (Fagnoni et al. 2021;Josaitis et al. in preparation.).Joseph et al. (2018) demonstrate that small errors in antenna positions lead to significant biases in redundant calibration solutions and Orosz et al. (2019) find that gain solution errors arising from expected levels of non-redundancy introduce foreground contamination can exceed 21 cm fluctuations by orders of magnitude while Choudhuri et al. (2021) determined that these errors also cause mild decoherence in the 21 cm fluctuations themselves.The equations in redundant calibration still admit frequency dependent degeneracies which must still be solved for by referencing back to a sky-model and the associated errors can be amplified by redundant antenna arrangements (Byrne et al. 2019).
Several authors have proposed "hybrid" approaches such as solving for non-redundant degrees of freedom (Sievers 2017), projecting the degeneracies in a redundant sub-array calibration solution onto a sky-model based calibration solution (Li et al. 2018), and using skymodels to inform priors on redundant solutions (Byrne et al. 2021b).Some of these techniques demonstrate improvements over purely redundant or sky-based approaches.None have demonstrated that the errors in gain solutions with arbitrary spectral degrees of freedom are suppressed sufficiently in the presence of nonredundancies to allow for a 21 cm detection.
All calibration approaches must balance a tradeoff between flexibility and signal loss.A calibration technique must be flexible enough to model and correct fine-scale frequency structures in the instrumental gains along with antenna-to-antenna variations in the beams.However, more flexible techniques run the risk of subtracting actual 21 cm fluctuations or introducing additional errors.An example of the latter, calibration with a fixed sky-model but with completely flexible frequency degrees of freedom in the gains compensates for the disagreement between the true and measured sky on long baselines by inserting fine-scale frequency structure.When performed on an array that is not truely redundant, redundant calibration does the same thing.Solving for the direction dependent gains towards bright foregrounds as point sources adds additional flexibility but still introduces errors since the radio sky is comprised of large numbers of faint and extended sources that are not completely encapsulated by any point source model.For example, Yoshiura et al. (2021), find that solving for direction dependent gains in the directions of a small number of the brightest sources actually increases systematics in the EoR window over direction independent calibration.
In this paper, we introduce calamity which is a calibration method that is flexible enough to deal with nonredundant beams and direction dependent effects, and a completely unconstrained radio sky while at the same time can be demonstrated to not introduce biases within the EoR window that can prevent a robust 21 cm detection.Most existing calibration techniques model direction dependent effects as a source model tied to the sky modulated by different direction dependent systematics.Our approach is to instead model both the direction dependent effects and the foregrounds per-baseline as linear combinations of discrete prolate spheroidal sequences which are a maximally efficient basis for modeling band limited signals (Slepian 1978) and have proven to be excellent for modeling per-baseline foregrounds (Ewall-Wice et al. 2021).More explicitly, we model the visibility formed from antennas a and b, including all direction dependent beam effects, m abν from equation 1 as where A iν (τ ab ) is the discrete prolate spheroidal sequence (DPSS) (Slepian 1978) with a length of N ν and standardized half bandwidth of W ab = 2N ν ∆ντ ab where τ ab is a delay approximately equal to the light travel time between antenna a and antenna b (the horizon delay at the edge of the wedge) and ∆ν is the channel width.Each u iab is a coefficient which must be solved for by fitting to measurements and each visibility has its own series of u iab .The number of DPSS vectors required to describe band-limited signals with delay half-width of (Slepian 1978).Thus, we model each intrinsic visibility with a sum of one-dimensional sequences in frequency.We refer the reader to Ewall-Wice et al. ( 2021) for a more detailed description of this modeling strategy.Doing this, instead of sky-based modeling, ignores correlations between the baselines and limits any increases in variance and signal loss, to the foreground wedge while achieving a high degree of flexibility to model systematics.Our method's philosophy is to model the foregrounds and direction dependent effects with a highly flexible and degenerate set of parameters which incur significant variance and biases within the wedge while leaving the EoR window relatively free of biases.
We implement our technique in python using the optimization and auto-differentiation machinery of the tensorflow library (Abadi et al. 2015) which provides a readable and flexible code base for large-scale modeling and optimization problems with out-of-the-box GPU support.

calamity
In this section, we describe calamity in detail § 3.1, how it can be modified to reproduce the redundant calibration approach § 3.2 (though we can calibrate redundant arrays perfectly well without this modification).

A Description of the Method
Incorporating equation 2 into equation 1 yields the complete model for our data as Since we model each visibility with an independent set of u iab , our model is flexible enough to account for all variations caused by the ionosphere and non-redundant beams as long as they are below τ ab for a particular baseline.
To take advantage of fast rectilinear matrix operations on a GPU we represent our foreground and direction dependent systemics parameters, u, as an N b × N max v tensor where N max v is the maximum number of vectors needed to model any of our baselines and is approximately equal to the maximum standardized half bandwidth of any baseline in our array Max ∼(a,b) (W ab ).Because many of our short baselines require fewer components to model, this means that a large fraction of our u elements are zero.We represent A as an N max v ×N b ×N ν tensor which dominates our technique's memory requirements.We set the rows of A that correspond to modeling vectors that are unused on short baselines to zero as well.
To solve for u and g, we minimize the mean-squared errors between our instrument model and the measured visibilities v abν , using the adamax algorithm (Kingma & Ba 2014).w abν stands for a per-baseline and per-frequency weight which can take into account, for example, flagged channels and differing noise levels per-baseline.While it is computationally expedient, first order gradient descent will only converge to the local minimum closest to our initial guesses for g and u.Thus, our technique as written should be used as a second stage of calibration after a first stage that obtains initial guesses that fall within the convex region containing the global minimum of equation 4 such as firstcal (Parsons et al. 2014;Dillon et al. 2018;Li et al. 2018;Dillon et al. 2020).As written, the cost function in equation 4 is completely degenerate with respect to a single complex gain multiplier.We break this degeneracy by adding a minimal prior, p(u) to L on the sum over all baselines and channels in the data.
where m abν is a sky model which can have relatively low detail and merely serves to fix degeneracies like the overall gain scale.w abν is a set of prior weights which we can adjust to set the strength of the prior.For this work we set w abν = w abν = (N b N ν ) −1 for all weights.
One could also drop this prior from the optimization since the degeneracy, by definition does not effect the first likelihood term and set an overall gain scale after completing optimization by taking the ratio between the sum of measured visibilities and the sum of model visibilities.

Redundant Calibration with calamity
If we are highly confident that our visibilities are actually redundant, we can reduce the set of modeling vectors u abi and A iabν coefficients so that a single set of modeling vectors and coefficients describes all baselines within a redundant group.If we label our redundant groups by R, then equation 4 reads (6) Thus, calamity is equipped to perform standard redundant calibration if the user so wishes.

THE PERFORMANCE OF calamity IN RECOVERING 21 CM FLUCTUATIONS IN THE EOR WINDOW IN REALISTIC SIMULATIONS
In this section, we test calamity on simulated visibilites for a minimally redundant array containing realistic foregrounds and calibration errors.Since calamity can calibration a minimally redundant array with minimal sky knowledge, it follows that it can calibrate nearly redundant arrays (redundant arrays with positional and beam errors) as well.We will briefly address the redundant case in § 5 In § 4.1, we describe our simulations

The Simulations
We test calamity on examples of two different array paradigms in wide use today -a highly redundant hexagonal layout and a minimally-redundant layout.Beyond the already discussed calibration strategy that they enable, redundant arrays are intended to enhance sensitivity to a handful of modes and increase our leverage on systematics by including many repeated copies of the same baseline separation.This allows us to beat down per-antenna features by crossing and averaging different copies of the same redundant baseline (Parsons et al. 2014;Ali et al. 2015;The HERA Collaboration et al. 2021).Randomized arrays have the advantage of reducing the risk of signal loss from calibration overfitting by averaging more independent cosmological modes into each k-bin.Lanman & Pober (2019) also note that this reduces sample variance.We compare calamity's performance on both types of arrays by considering a minimally-redundant and redundant subset of antennas from the compact configuration of the MWA phase-II (Wayth et al. 2018).For our minimally-redundant array, we choose the first thirty-six antennas that comprise the majority of the MWA's compact minimallyredundant core.For our redundant array, we use one of the MWA phase-II's two thirty-six antenna hexagonal patterns.The primary beam model for MWA we used for the visibility simulations is described in Sutinjo et al. (2015) and Sokolowski et al. (2017).Our simulations are run using the pyuvsim visibility simulator package1 (Lanman et al. 2019).We simulate fifty-six, two second integrations starting at an LST of ≈ 1.8 hours.Visibilites are simulated for 384 channels with 80 kHz channel width spanning 167 − 187 MHz bandwidth.The foreground model for the simulated visibilities is taken from the GLEAM catalog (Hurley-Walker et al. 2016) contains all sources with wide-band integrated flux F int > 50 mJy.Sources in the foreground model are assumed to be flat-spectrum over the simulation bandwidth, and the a single wide-band integrated flux value per source is used for all frequency channels.While the true foregrounds are not spectrally flat, the majority of spectral variation originates from antenna beams anyways and Ewall-Wice et al. ( 2021) have established the robustness of using DPSSs to model foregrounds with direction dependent spectral variation.The EoR model used for the simulations is a flat power spectrum P (k) = 12482.356K2 Mpc 3 .This corresponds to noise sky of variance 9 K 2 in image space in zeroth channel.The EoR model is generated in the healpix projection with nside=256 that corresponds to a spatial resolution of 13.74 arcmin.The EoR model is stored in a SkyModel object of pyradiosky 2 package for seamless readability by pyuvsim.Table 1 summarizes the simulation details.

Reduction of Simulation Products and Power Spectra
Our foreground simulation contains beam interpolation artifacts which set an intrinsic systematics floor at ≈ −45 dB below the foreground brightness levels, therefor we scale our EoR fluctuations to be ≈ −40 dB below the level of foregrounds.While this is brighter then the expected EoR signal levels, this arbitrary scaling does not affect the conclusions of this paper.For each baseline and time, we simulate two independent realizations of thermal noise scaled to have a standard deviation on each baseline that is 10× the level of the 21 cm signal and generate two effective measurements with an independent noise realization at each baseline, frequency and time which we will interleave to form power spectra in the fashion of Dillon et al. (2014).We set our simulated gains to be constant over the 112 seconds of observation.We set each frequency and time sample to be unity plus an independent normally distributed real and imaginary component with an amplitude of 0.1.Our simulated vis-ibilities can thus be broken down into ) where F ab is the foreground visibility for baseline ab, S ab is the EoR visibility and N ab are the even (odd) noise realizations.To demonstrate calamity's ability to calibrate arbitrarily fine gain structures, we set our frequency dependent simulated gains to be highly chromatic and equal to a normally distributed noise with a standard deviation in the real and imaginary parts of 0.1 added to a baseline gain of unity.We assume that the gains are constant in time.
Running calamity provides us with estimates of gains and foregrounds for each even/odd integration which we denote as g a (ν) and F ab (ν, t) respectively.We compute a calibrated residual at each time and frequency, (8) We fringe-stop each residual and average coherently over the full 112 seconds of observation.This has been demonstrated to lead to negligible signal loss for the baseline lengths under consideration (Ali et al. 2015).We next form power spectra for pairs of baselines ab and cd at two different times t and t by computing the complex conjugated products between delay-transformed simulated data with our two independent noise realizations and scaling to cosmological power spectrum units using the relation (Parsons et al. 2012a) where k ⊥ = 2πb ab /X, k = 2πτ /Y , Ω pp is the solid angle integral of the primary beam squared, B pp is the frequency integral of the passband squared, X and Y are linear scaling factors relating transverse and parallel distances to angle and frequency distances respectively and V ab (τ, t) is the delay-transform of V ab (ν, t) (Parsons et al. 2012b) where T (ν) is a frequency tapering function and ∆ν is the channel width.For all power spectra, we use a Blackman-Harris taper.We compute power spectra for the uncalibrated visibilities (P V ) as well as our calibrated residuals (P R ), the EoR fluctuations (P S ), noise only (P N ) and the EoR fluctuations combined with noise and foregrounds (P S+F +N ).For the redundant array, we only form products between different baselines within the same redundant group (ab = cd).For  the minimally redundant array, we take the products of baselines with themselves (ab = cd).We estimate error bars based on the RMS amplitude of thermal noise power spectrum, σ N and the RMS amplitude of the 21 cm power spectrum σ S (McQuinn et al. 2006), When we calculate noise only power spectra, we neglect the σ S term and when we calculate signal only power spectra we neglect the σ N terms.

Simulation Results
We now discuss the performance of calamity in calibrating a minimally redundant array.Before diving into the detailed results, it is illustrative to inspect how well our fitted gains and calibrated data match the injected gains and data to better understand where we can trust calamity.
We inspect the fitted gains for antennas 0 and 1 in our minimally redundant array in Fig. 2. Our fit-ted gains to not agree particularly well with the true gains, only on the 20% level on all delays.However, the quantity that governs whether a calibrated visibility formed from antennas a and b is smooth is In the right hand panel of Fig. 2, this quantity is much smoothers.We see in Fig. 3 that our foreground model on this baseline only agree at the 10% level.The disagreements between the foregrounds and gains alone arise from frequency dependent degeneracies which might be broken by incorporating additional information such as baseline-baseline covariances or a more stringent prior on the foreground model and beams, neither of which are available with the current state-of-the art.While we may not have have highprecision estimates of the foregrounds or the gains alone, we do obtain high precision estimates of the actual quantity we optimized for by minimizing equation 4 -the products of the gains with the foreground model.In Fig. 4 we show our model for the measured visibility between antennas 0 and 1 which is the product of the two gains and the model for that visibility.While the foreground model and gains alone only agree with the true gains and foregrounds at the 10% level, the products of the gains and foregrounds are in excellent agreement and are identical to within the uncertainties from thermal noise.If we subtract the model of foregrounds leaked by gain errors.If we subtract our estimate for the foregrounds multiplied by the gains, then we eliminate the contamination introduced by foregrounds being leaked by gains into fine-spectral scales.This is precisely what calamity does and in this sense, it might be better thought of as a filter then a calibration technique since it identifies and subtracts power that is described by wedge-like foreground modes leaked by antenna gains at high precision.
In Fig. 5, we inspect the RMS average of the delaytransformed visibilities.We see that the per-baseline R values recovered by calamity are broadly consistent with the levels of injected thermal noise fluctuations and orders of magnitude lower then the injected gain features.
While the residuals are broadly consistent with the noise outside of the wedge, there is a noticeable reduction in amplitude inside of the wedge.This is due to the fact that our per-baseline modeling cannot distinguish between foreground modes and thermal noise or cosmological fluctuations.Thus, subtracting our guess at the the foregrounds leads to signal loss inside of the wedge.What does remain within the wedge are errors in the gain solutions which exist at a lower level then the thermal noise.For the redundant array, we also find that we get roughly the same residual when calibrating with redundant modeling using a single set of foreground coefficients to describe all baselines within each redundant group (equation 6).We hypothesize that this is because when we use multiple foreground coefficients within a truly redundant group, we are simply adding repeated copies of the same set of parameters which has no effect on our final answer.
We inspect the real parts of wedge power spectra in Fig. 6.In our simulated visibilites without calibration errors, foreground power is contained within the wedge (left panel).Our calibration solutions are uncorrelated between frequencies and as a result, the entire EoR window is filled in with systematics (left of center panel).After running calamity, these systematics are effectively removed (center panel) and over most cylindrical modes, these residuals are roughly consistent with the injected EoR signal level (center right panel) albiet with clear postive and negative fluctuations arising from noise (right panel).To further beat down noise, we must average in spherical k-bins.
In Fig. 7, we show spherically averaged power spectra for P V , P S+F +N and P S averaging over all sampled Fourier bins.We compare with P R spherically averaged over all Fourier bins lying outside of the wedge plus a 100 ns buffer.We find that P R is in broad agreement with with P S beyond k 0.2h −1 Mpc −1 when we only spherically average modes outside of the wedge.As shown in Ewall-Wice et al. (2021), our residual does significantly better then a tapered estimator (approximated by the black points) which is contaminated by Blackman-Harris sidelobes.When we include modes inside of the wedge in our spherical average, we encounter negative bias out to ≈ 0.3hMpc −1 .This is because we subtracted foreground estimates inside of the wedge which are completely degenerate with 21 cm modes.
In Fig. 7, we also show calibration residuals from simulations with foregrounds and noise but no EoR.Calibration residuals are an order of magnitude lower then the EoR, at a similar level to the noise only simulations, and ≈ 90 dB below the level of the foregrounds in the power-spectrum domain.Our minimally-redundant noise residuals show a slight positive bias that is at roughly the same level as the thermal noise at small k values.We hypothesize that this could be caused by residual gain errors that still leak foregrounds or artifacts in the simulation itself.

PERFORMANCE OF calamity ON A REDUNDANT ARRAY
We have demonstrated that on a minimally redundant array calamity can remove foreground leakage into the EoR window at dynamic ranges necessary for 21 cm cosmology, even when our gains contain large amounts of fine frequency structure.In this section, we examine its performance on a highly redundant array (the "Redundant" MWA subarray) and compare to the performance on our minimally redundant case.In Fig. 8, we compare spherical averages of Fourier modes outside of the wedge with a 100 ns buffer.
We observe that our redundant array suppresses calibration errors by one-to-two orders of magnitude compared to our minimally redundant array.This is because when we average over power spectra formed from pairs of non-identical baselines in each redundant group, we average over independent calibration systematics which average down.We are not able to do the same in our minimally redundant array since each baseline samples a unique sky mode and must be crossed with itself.Because we are able to coherently average over more baselines (rather then incoherently average), the noise-only power spectrum is also significantly reduced for our re- Figure 2. A comparison between the real parts of our simulated (solid lines) and recovered (dashed lines) gains for antennas 0 and 1, part of our "Minimally Redundant" array.We use gains that are completely uncorrelated between frequencies to make the spectral calibration challenge as difficult as possible.(This is substantially more chromatic then what is typically encountered in the field.)Our gain estimates do not agree particularly well with the true injected gains.In fact, their residuals are on the order of 10% (dotted lines).Right Panel: Even in the delay domain (with a Blackman-Harris taper), the residuals are at 10% at all delays.While it may be tempting to think that such large residuals have catastrophic implications for 21 cm cosmology, they arise from degeneracies between the foreground wedge and the gains and between the different gain frequencies themselves.The gain correction ratio, g0 × g1/(ĝ0 × ĝ * 1 ) − 1 (black dotted line) is smooth at the level of the thermal noise fluctuations.Thus, the calibrated foregrounds agree with our data at high precision (Fig. 4).Comparison between the true simulated foregrounds (dashed lines) and our fitted foreground model (dashed lines) on a single baseline and time snapshot.Like the gains (Fig. 2), our foreground model does not agree particularly well with the True foregrounds and the residual (dotted line) is on the order of ≈ 10 − 20% and significantly exceeds the level of the thermal noise.Right Panel: The same as the left, but in delay space with a Blackman-Harris taper.By definition, our foreground model resides within a compact region of delay space.As the model is contained within the foreground wedge, subtracting it does not introduce any additional spectral structures beyond the wedge.
dundant simulation, relative to our minimally redundant simulation.
Another noteable difference between the redundant and minimally-redundant calibration residuals in Fig. 8 is that the calibration residual in our redundant array has a negative bias at larger k than the minimallyredundant array.This is due to the fact that our gains absorb some 21 cm signal at all frequency scales which incurs signal loss.Each gain absorbs 21 cm contributions from the ∼ N ant baselines that each antenna participates in.In our minimally-redundant array, each of these 21 cm visibilities is statistically independent while in our redundant array, these visibilities are highly correlated within a redundant group, reducing the effective number of independent 21 cm realizations that are averaged down in each gain solution which increases sig-  Comparison between the real part of an uncalibrated visibility in our simulations at a single time snapshot (solid line) with our best fit model (dashed line).While our fitted gains are in relatively poor agreement with the true gains (Fig. 2), our fitted model visibilities have excellent agreement and at the same level as the thermal noise (dash-dotted line).Right Panel: The same comparison in delay space.We see that while fitted gains and foregrounds alone only agree at the 10% level, our visibility model is accurate at the 0.1% level and at a similar level to the simulated thermal noise at all delays.Thus, if we subtract our visibility model, we can eliminate the systematics caused by spectral bandpass features leaking power into high delays.For this reason, calamity might be better thought of as a precision filter rather then a precision calibration technique.

EoR
EoR + Foregrounds EoR + Foregrounds + Noise Noise Gain Corrupted Residual Corrected Figure 5.The RMS over time of simulated visibility amplitudes for several baselines in our minimally redundant layout.calamity solves for foregrounds multiplied by gain structures (thin solid magenta line) and subtracts the model to obtain residuals (dashed cyan line).Outside of the wedge (vertical black dashed line), the RMS amplitudes of our calibrated residuals (cyan dashed line) are reasonably consistent with thermal noise fluctuations (dotted orange line).Our per-baseline modeling approach cannot distinguish between foregrounds and noise / EoR fluctuations inside of the wedge, leading to signal loss which is observable here as a small decrease in residual amplitude.Coherent time averages and cross-multiplication between independent noise realizations lead to a more visible decrease (see Figs. 6 and 5). Figure 7. Top Row: Spherically averaged power spectra for our simulations of foregrounds, 21 cm fluctuations, and noise P S+F +N (black circles), the same simulations corrupted by gain errors P V (red diamonds), calibrated residuals P R (orange squares), calibrated residuals for a simulation that only contains foregrounds and thermal noise (cyan squares), and a simulation with only thermal noise (magenta circles).Before cross multiplying time samples, the visibilities were coherently averaged over all integrations.Left Column: Spherically averaging over all Fourier modes, including those inside the wedge.Right Column: Averaging over all Fourier modes outside of the wedge with a 100 ns buffer.We plot 95% confidence intervals for the power spectrum of 21 cm fluctuations P S with sample variance as grey regions and denote negative points with open circles.Bottom Row: We zoom in on the 21 cm fluctuations and calibration residuals.Error bars denote 95% confidence regions.We do not apply any corrective normalization to the power spectrum and only multiply by constant normalization factors constant in k.When we average over all modes, including those inside of the wedge, we see that our calibrated residual is biased below k 0.25hMpc −1 because of degeneracies between wedge 21 cm modes and foreground estimates.
nal loss on all scales.This effect should be reduced on larger arrays with greater numbers of independent baseline groups involving each gain.It can also be reduced by imposing priors on the gain, such as smoothness, to reduce the number of degrees of freedom we must solve for and their overlap with 21 cm fluctuations.

THE LOW SIGNAL-TO-NOISE REGIME
For the majority of this paper, we have applied calamity to simulations with gain-features with much larger amplitudes than the thermal noise.This is because our primary aim is to demonstrate calamity's ability to remove gain features to the level necessary for 21 cm cosmology.In practice, calibration is usually run on much lower SNR data.In this section, we check whether calamity is able to remove gain artifacts to below the level of thermal noise for observations with lower (and more representative) sensitivities.We explore this by observing the power-spectrum of calamity residuals for simulations where the true gains are equal to  7 except now the left column is our minimally-redundant Array and the right column is our redundant array and we only spherically average Fourier modes outside of the wedge in both the left and right columns.We are able to reduce calibration systematics on our redundant array by 1-2 orders of magnitude by only crossing different redundant baselines to form power spectra.calamity is able to recover 21 cm fluctuations for the redundant case, albeit with higher signal loss (compare the regions below 0.3hMpc −1 on the left and right plot.This is because fewer independent samples of 21 cm fluctuations are absorbed into our gain estimates in our redundant array which leads to greater signal loss.
unity plus a small, visually distinctive sinusoidal ripple that is below the level of thermal noise on a particular integration and baseline.For each antenna a, we set where r a is an amplitude drawn from an exponential distribution with a mean of 0.01, φ a is a uniformly distributed phase, and τ a is drawn from a normal distribution with mean of 1200 ns and a standard deviation of 100 ns.We show the delay-transoforms of our gains in Fig. 9.In Fig. 10 we show the RMS of a single baseline in delay-space over all integrations.As discussed, the noise level is now roughly an order of magnitude below the foreground level and the gain errors are relatively small as compared to the noise so that on an individual baseline, "uncalibrated" visibilities with the corrupting gains are visually difficult to distinguish from the visibilities without the gains.The residuals from calamity agree reasonably well with both of these lines.We are concerned that gain structures below the level of the noise on a single baseline and integration might still be present in the data and raise their heads once we perform averaging over many times and baselines.To determine whether this is the case with calamity, we plot the spherically averaged power spectra from all times and baselines in Fig. 11.With no calibration, reflection artifacts exceed the thermal noise level by a factor of ≈ 10 in the spherically averaged power spectrum.calamity removes the reflection artifacts so that the residual is still consistent with noise only simulations.We determine that in the regime where gain features are below the thermal noise level on a single integration and baseline, calamity is still able to mitigate gain-like spectral structures to the noise level of the spherically averaged power spectrum for the entire dataset.

COMPUTATIONAL RUNTIME
In this section, we discuss the computational runtime of calamity.The computational cost of each gradient update step is dominated by the sum u abi A iabν in equation 4 which involves N v N ν N b floating point operations.Memory use is dominated by the N v N ν N b element A matrix.In Fig. 12, we show the scaling in runtime for a single adamax gradient update step with the number of antennas in rectangular arrays with rows of eight 2 m apertures with 2 m spacing between each antenna in each row and spacings between rows set so that the maximum baseline length is always 100 m to keep N max v constant.Each simulation contains 200 frequency channels between 120 − 180 MHz.We also illustrate the linear scaling of the gradient update computations with maximum baseline length ∝ N max v in Fig. 13 where we hold the number of elements constant (10) in a linear array but change the inter-baseline spacing.We measure runtimes on two different machines, one with Two dual core Intel Xeon CPUs at 2.3 GHz and a another with identical CPUs and a Tesla P100 GPU.Gradient update steps on the GPU tend to run on the order of ∼ 20 − 30 ms for arrays with ∼ 100 antennas.We find that given a learning rate of γ = 0.01, calamity requires thousands of gradient descent steps to converge to the precisions required for 21 cm cosmology.Thus 0 200 400 600 800 1000 1200 1400 [ns] .Same as Fig. 5 but now for a single baseline with the level of thermal noise set so that on a single integration, the RMS is equal to one tenth the RMS amplitude of foregrounds.We have also introduced gain errors with distinctive reflection features at the 0.01 level which are dominated on a single baseline by thermal noise.We clearly observe that the gain corrupted visibilities, residuals, and noise are all at a similar level for a single baseline and integration.
calamity can be used to obtain gradient descent solutions for arrays with hundreds of antennas within minutes on a single computer-node outfitted with a modern GPU.

CAVEATS AND LIMITATIONS
We have demonstrated that calamity can remove arbitrary frequency dependent gains at high precision.However, there are several scenarios where its ability to correct systematics will be limited.Firstly, while calamity can deal with non-redundancy, it requires that non-redundant systematics only occupy spectral scales within the foreground wedge.We do not expect calamity to be able to correct high-delay systematics that are direction dependent high RM polarization leakage (Moore et al. 2013;Asad et al. 2016) and mutual coupling artifacts that extend significantly beyond the edge of the wedge such as those discussed in Kern et al. (2019) and Josaitis et al. (in preparation.).Thus, telescope designers can relax their specifications on the analog signal chain and its direction independent effects but great care must be still be taken to limit the effects of inter-antenna mutual coupling over large delays.Josaitis et al. (in preparation.)find that super-horizon emission from mutual coupling primarily contaminates  9) and noise that is roughly one-tenth the amplitude of the foregrounds on a single baseline and integration.A reflection features is clearly visible in our uncalibrated visibilities (red points) but it is removed by calamity in the residuals (orange squares) which is roughly the same as the residuals for simulations with noise only (purple squares).
fringe-rates outside of the fringe-rates occupied by the main lobe.Thus main-lobe filtering along the lines of Parsons et al. (2016) prior to bandpass calibration could still allow for calamity to remove fine-frequency bandpass features as long as they are stable on time-scales longer then sky variations.
Another limitation of our technique in it's current form is its reliance on first-order gradient descent.This requires that we make an initial guess of the calibration solution within the convex region containing the local minimum.Thus, we recommend using calamity as a final calibration to be performed after a rougher initial calibration Finally, because calamity models the foregrounds per-baseline, it introduce signal loss within a region close to the wedge plus whatever buffer the analyst has specified.The degree and extent of this signal loss will not only depend on the specific configuration of the interferometer (as we have seen in Fig. 7) but will also depend on the bandwidth (Ewall-Wice et al. 2021) and chan- ) and Nant equally spaced antennas arranged in rows of 8 with 2 m spacing between the antenna of each row and Nν = 200 channels over 80 MHz and centered at 150 MHz.Markers denote distribution medians while filled regions denote 95% confidence regions.The computational time for a gradient update follows quadratic scaling with Nant, as we expect (lines).calamity performs 30 times faster on a GPU than on a CPU on arrays with hundreds of antennas (blue regions).Gradient descent typically requires thousands of steps to converge.Thus, on a single GPU equipped computer node calamity can provide robust per time-integration bandpass calibration for hundreds of element arrays on the timescales of minutes and for thousands of element arrays on timescales of hours.nelization resolution.Care must be taken in any analysis that uses calamity to account for this loss either through simulations or propagating the filtered modes into a window function matrix (e.g.Kern & Liu 2021).

CONCLUSION
In this paper, we introduced calamity, a fast and spectrally robust calibration algorithm for 21 cm cosmology.The key innovations leveraged by calamity are that it gives up on trying to precisely solve for the highly degenerate gains and foregrounds and instead focuses on solving for and subtracting the products of gains and foregrounds in a way that only relies on the robust assumption that the intrinsic foregrounds are are contained within the wedge region of instrumental Fourier space.We are able to accomplish the computationally expensive modeling problem in a reasonable amount of time for modern array sizes by taking advantage of GPU acceleration and auto-differentiation.By simultaneously modeling gains with per-baseline smooth foreground components, calamity is capable of removing arbitrary direction independent bandpass structures to the precision necessary for 21 cm intensity mapping experiments in the presence of non-redundancy and a highly uncertain foreground sky.We demonstrated calamity's effectiveness using simulations of a redundant and minimally-redundant configuration on the Murchison Widefield Array and find that in both scenarios, it recovers relatively unbiased estimates of the power spectrum within the EoR window.While we assumed artifically high SNR to demonstrate calamity's ability to filter out systematics to the level of 21 cm fluctuations, we also find that calamity removes spectral gain errors that are below the noise level of a single baseline integration and is effective in the low SNR regime encountered in the field.Low-level spectral bandpass structures are one of the primary systematics currently hampering observational efforts, thus calamity has the potential to drive significant progress towards a detection.Because calamity models each baseline independently, it does not require redundancy which can be dif-ficult to achieve in the field but as a consequence, it gives up on recovering any 21 cm signal inside of the foreground wedge since its per-baseline spectral modes are completely degenerate with large LoS scale 21 cm fluctuations.While calamity can correct arbitrary direction independent gain errors, we do not expect it to be capable of correcting direction dependent gain errors with delay structures that extend significantly beyond the edge of the wedge.Thus, array designers should focus their efforts on keeping direction dependent overthe-air mutual coupling effects to large spectral scales while potentially relaxing their requirements on redundancy on large spectral scales and direction independent fine-scale features.

Figure 1 .
Figure 1.We test calamity on simulations of two sub-arrays of the MWA.A minimally-redundant array consisting of antennas 0 through 35 (left) and one of the MWA's redundant hexagonal sectors (right).

Figure 3 .
Figure3.Left Panel: Comparison between the true simulated foregrounds (dashed lines) and our fitted foreground model (dashed lines) on a single baseline and time snapshot.Like the gains (Fig.2), our foreground model does not agree particularly well with the True foregrounds and the residual (dotted line) is on the order of ≈ 10 − 20% and significantly exceeds the level of the thermal noise.Right Panel: The same as the left, but in delay space with a Blackman-Harris taper.By definition, our foreground model resides within a compact region of delay space.As the model is contained within the foreground wedge, subtracting it does not introduce any additional spectral structures beyond the wedge.

Figure 4 .
Figure 4. Left Panel:Comparison between the real part of an uncalibrated visibility in our simulations at a single time snapshot (solid line) with our best fit model (dashed line).While our fitted gains are in relatively poor agreement with the true gains (Fig.2), our fitted model visibilities have excellent agreement and at the same level as the thermal noise (dash-dotted line).Right Panel: The same comparison in delay space.We see that while fitted gains and foregrounds alone only agree at the 10% level, our visibility model is accurate at the 0.1% level and at a similar level to the simulated thermal noise at all delays.Thus, if we subtract our visibility model, we can eliminate the systematics caused by spectral bandpass features leaking power into high delays.For this reason, calamity might be better thought of as a precision filter rather then a precision calibration technique.

Figure 6 .
Figure6.From left to right, cylindrically averaged power spectra for our simulated foregrounds, noise, and EoR model, P S+N +F , uncalibrated visibilities, P V , calibrated residuals P R , the injected EoR fluctuations, and thermal noise only.

Figure 8 .
Figure8.The same as Fig.7except now the left column is our minimally-redundant Array and the right column is our redundant array and we only spherically average Fourier modes outside of the wedge in both the left and right columns.We are able to reduce calibration systematics on our redundant array by 1-2 orders of magnitude by only crossing different redundant baselines to form power spectra.calamity is able to recover 21 cm fluctuations for the redundant case, albeit with higher signal loss (compare the regions below 0.3hMpc −1 on the left and right plot.This is because fewer independent samples of 21 cm fluctuations are absorbed into our gain estimates in our redundant array which leads to greater signal loss.

Figure 9 .
Figure9.Fourier transforms of simulated gains that we used to test calamity's performance in the low SNR regime.Gains include a single reflection with an amplitude of ≈ 0.01 and delay of ≈ 1200 ns (see equation12) and are otherwise flat.Secondary peaks at higher delays are from the second round-trip reflection.

Figure 11 .
Figure11.The same as the right-hand panel of Fig.7but now with reflection gains (fig.9) and noise that is roughly one-tenth the amplitude of the foregrounds on a single baseline and integration.A reflection features is clearly visible in our uncalibrated visibilities (red points) but it is removed by calamity in the residuals (orange squares) which is roughly the same as the residuals for simulations with noise only (purple squares).

Figure 12 .
Figure 12.Runtimes for a single adamax gradient update step in calamity for a linear array with a maximum baseline length of one-hundred meters (fixed N max v

Figure 13 .
Figure13.The runtime of a single adamax gradientdescent step for a 10-element linear array of equally spaced antennas with 200 channels between 120 and 180 MHz.Shaded regions describe 95% confidence regions over 1000 realizations on Two dual core Xeon CPUs (blue) and a P100 GPU (orange).We observe linear scaling of the runtime on the CPU.The runtime on the GPU appears constant since we are not yet in the regime where linear scaling is occuring.