Improved Constraints on the 21 cm EoR Power Spectrum and the X-Ray Heating of the IGM with HERA Phase I Observations

We report the most sensitive upper limits to date on the 21 cm epoch of reionization power spectrum using 94 nights of observing with Phase I of the Hydrogen Epoch of Reionization Array (HERA). Using similar analysis techniques as in previously reported limits, we find at 95% confidence that Δ2(k = 0.34 h Mpc−1) ≤ 457 mK2 at z = 7.9 and that Δ2(k = 0.36 h Mpc−1) ≤ 3496 mK2 at z = 10.4, an improvement by a factor of 2.1 and 2.6, respectively. These limits are mostly consistent with thermal noise over a wide range of k after our data quality cuts, despite performing a relatively conservative analysis designed to minimize signal loss. Our results are validated with both statistical tests on the data and end-to-end pipeline simulations. We also report updated constraints on the astrophysics of reionization and the cosmic dawn. Using multiple independent modeling and inference techniques previously employed by HERA Collaboration, we find that the intergalactic medium must have been heated above the adiabatic cooling limit at least as early as z = 10.4, ruling out a broad set of so-called “cold reionization” scenarios. If this heating is due to high-mass X-ray binaries during the cosmic dawn, as is generally believed, our result’s 99% credible interval excludes the local relationship between soft X-ray luminosity and star formation and thus requires heating driven by evolved low-metallicity stars.


INTRODUCTION
21 cm cosmology-the observation of the hyperfine transition of neutral hydrogen at cosmological distances-has long promised to become a sensitive probe of the structure and evolution of the intergalactic medium (IGM) from the Cosmic Dark Ages through to the cosmic dawn, the epoch of reionization (EoR) (Hogan & Rees 1979;Madau et al. 1997), and beyond. By measuring fluctuations in the 21 cm brightness temperature relative to the Cosmic Microwave Background (CMB) that trace the density, temperature, and ionization state of the IGM, we can precisely constrain our models of cosmology and of the first stars and galaxies (Mao et al. 2008;Patil et al. 2014;Pober et al. 2014;Greig et al. 2016;Ewall-Wice et al. 2016a;Kern et al. 2017). For pedagogical reviews see, e.g. Ciardi & Ferrara (2005), Furlanetto et al. (2006), Morales & Wyithe (2010), Pritchard & Loeb (2012), Mesinger (2016), and Liu & Shaw (2020).
A number of low-frequency radio telescopes designed to detect and characterize the cosmic dawn and EoR 21 cm signal have been built over the last decade and a half. Many are interferometers seeking to statistically detect and ultimately tomographically map 21 cm fluctuations over a broad range of frequencies and thus redshift. This period has seen increasingly tight limits on the 21 cm power spectrum from a number of different telescopes, including the Giant Metre Wave Radio Telescope (GMRT; Paciga et al. 2013), the Low Fre-quency Array (LOFAR; van Haarlem et al. 2013;Patil et al. 2017;Gehlot et al. 2019;Mertens et al. 2020), the Donald C. Backer Precision Array for Probing the Epoch of Reionization (PAPER; Parsons et al. 2010;Cheng et al. 2018;Kolopanis et al. 2019), the Murchison Widefield Array (MWA; Tingay et al. 2013;Dillon et al. 2014Dillon et al. , 2015aJacobs et al. 2016;Ewall-Wice et al. 2016b;Beardsley et al. 2016; Barry et al. 2019;Li et al. 2019;Trott et al. 2020;Yoshiura et al. 2021;Rahimi et al. 2021), and the Owens Valley Long Wavelength Array (LWA; Eastwood et al. 2019;Garsden et al. 2021).
Additionally, a number of total-power experiments have been conducted to measure the sky-averaged, global 21 cm signal as it evolves with redshift (Bernardi et al. 2016;Singh et al. 2017;Monsalve et al. 2017). Recently, the Experiment to Detect the Global EoR Signature (EDGES; Bowman et al. 2018) reported the detection of an unexpectedly strong absorption feature in the global signal at z ≈ 17 which would require either an IGM temperature below the adiabatic cooling limit (Muñoz et al. 2015;Barkana 2018;Muñoz & Loeb 2018) or a high-redshift radio background in excess of the CMB (Feng & Holder 2018;Ewall-Wice et al. 2018;Mirocha & Furlanetto 2019). A number of subsequent analyses have further investigated alternative explanations for this result in terms of instrumental systematics (Bradley et al. 2019;Hills et al. 2018;Singh & Subrahmanyan 2019;Sims & Pober 2020;Mahesh et al. 2021) and the recent non-detection by the Shaped Antenna measurement of the background RAdio Spectrum 3 (SARAS 3; Singh et al. 2022) in an overlapping frequency band is in tension with the EDGES result.
The main challenge facing both interferometric and sky-averaged 21 cm observations is the roughly five orders of magnitude of dynamic range between the 21 cm signal and astrophysical foregrounds-largely synchrotron and free-free emission from our Galaxy and other galaxies. While foregrounds are in principle separable from 21 cm signal using their intrinsic spectral smoothness, that separability is complicated by many real-world factors. Calibration errors due to e.g. incomplete sky and instrument models or unaccountedfor non-redundancy can leak foreground power into regions of Fourier space that would otherwise be signaldominated (Barry et al. 2016;Ewall-Wice et al. 2017;Orosz et al. 2019;Byrne et al. 2019;Joseph et al. 2019). Moreover, interferometers are inherently chromatic instruments with increasing frequency structure with baseline length-the origin of the so-called "wedge" feature in 2D power spectra (Datta et al. 2010;Vedantham et al. 2012;Parsons et al. 2012b,a;Liu et al. 2014a,b).
The extreme sensitivity and calibration requirements of high-redshift 21 cm cosmology have driven the design of second-generation interferometers including the Hydrogen Epoch of Reionization Array (HERA; DeBoer et al. 2017) and the Square Kilometre Array (SKA; Koopmans et al. 2015) with larger collecting areas and a diversity of approaches to understanding and controlling instrumental systematics. HERA, when complete, will be a interferometer with 350 fully cross-correlated elements-each a fixed, zenith-pointing 14 m dish-at the South African Radio Astronomy Observatory site in the Karoo desert. The dishes are designed to minimize the frequency structure of the instrumental response (Thyagarajan et al. 2016;Neben et al. 2016;Ewall-Wice et al. 2016;Patra et al. 2018;Fagnoni et al. 2021). HERA's compact, hexagonally-packed configuration maximizes sensitivity on short baselines, which are intrinsically less chromatic, while enabling relative gain calibration of antennas using redundant baselines (Dillon & Parsons 2016).
During Phase I, which culminated in the 2017-2018 observing season, 1 HERA repurposed PAPER's sleeved dipoles, suspended at prime focus (see Figure 1), along with PAPER's correlator and signal chains to observe from 100 to 200 MHz. During that time, the array continued to be built and commissioned. In Phase II, a new signal chain, correlator, and upgraded Vivaldi feeds have extended the bandwidth to 50-250 MHz (Fagnoni et al. 2020).
Recently, we reported the first upper limits on the 21 cm brightness temperature power spectrum from HERA in Abdurashidova et al. (2022a, hereafter H22a), using 18 nights of Phase I data and only 39 antennas. H22a built upon a number of supporting papers exploring various aspects of the data analysis. These included redundant-baseline calibration (Dillon et al. 2020), absolute calibration (Kern et al. 2020a), systematics mitigation (Kern et al. , 2020b, error estimation (Tan et al. 2021), analysis pipeline architecture (La Plante et al. 2021), and end-to-end validation of that pipeline with realistic simulated data (Aguirre et al. 2022). We focused on the so-called "foreground-avoidance" approach to power spectrum estimation (Kerrigan et al. 2018;Morales et al. 2019), working predominately in foreground-free regions of Fourier space and applying conservative techniques that minimized potential signal loss or bias.
Our results, which constrained the "dimensionless" brightness temperature power spectrum to ∆ 2 (k) < 946 mK 2 at z = 7.9 and k = 0.19 hMpc −1 and to ∆ 2 (k) < 9,166 mK 2 at z = 10.4 and k = 0.26 hMpc −1 (both at the 95% confidence level), represented the most stringent constraints to date. They allowed us in Abdurashidova et al. (2022b, hereafter H22b) to constrain the astrophysics of reionization and the cosmic dawn and show that the IGM was heated above the adiabatic cooling limit by at least z = 7.9. Evidence from other probes-including the integrated optical depth to reionization (Planck Collaboration 2018), observed galaxy UV luminosity functions, and quasar spectroscopyindicates that reionization is likely well underway by z = 7.9 (Mason et al. 2018;Greig et al. 2022). Our results therefore already rule out some of the most extreme of the so-called "cold reionization" models where an adiabatically cooling IGM produces a bright temperature contrast with the CMB, amplifying the 21 cm power spectrum as it is driven by ionization fluctuations .
In this work, we adapt and apply the analysis techniques of H22a and H22b to a full season of HERA Phase I data. Retaining the philosophy of foregroundavoidance and minimizing (and carefully accounting for) signal loss, we further tighten constraints on the 21 cm power spectrum at z = 7.9 and z = 10.4, and update the astrophysical implications of those limits. While some of our analysis techniques are updated to reflect an im- proved understanding of our instrument or adapted to better handle the larger volume of data considered, the core analysis techniques remain largely unchanged. 2 We begin in Section 2 by detailing the observations themselves and the basic cuts performed to ensure data quality. Then in Section 3 we review the data reduction steps performed to go from raw visibilities all the way to power spectra, highlighting updated analysis techniques and revised analysis choices. These techniques are tested with end-to-end pipeline simulations designed to validate our analysis choices and software in Section 4, in which we quantify a number of potential small biases and reproduce a few key figures from Aguirre et al. (2022) in the context of our new limits. In Section 5, we can then present our final power spectrum estimates, error bars, and upper limits. We build confidence in our results in Section 6 by applying a variety of statistical tests on our power spectra and how they integrate down across baselines and time. In Section 7, we report the impact of our new limits on the various approaches to astrophysical modeling and inference used in H22b, detailing our updated constraints on the epoch of reionization and the cosmic dawn. We conclude in Section 8, looking forward to potential future analyses of these data and data from the full HERA Phase II system.

OBSERVATIONS AND DATA SELECTION
In this work, we analyze observations with the HERA Phase I system that were performed over the period from September 29, 2017 (JD 2458026) through March 31, 2018 (JD 2458208). In Table 1, we summarize the key observational parameters of the instrument. For more detail about the precise configuration of the instrument, its signal chain components, and its FX correlator architecture, we refer the reader to DeBoer et al. (2017) and H22a. In this section, we discuss the process by which a selection of high-quality nights and antennas was performed.

Selection of Nights and Epochs
Of the 182 nights during this season of simultaneous construction, commissioning, and observing, a significant fraction of nights was discarded for a variety of reasons. Most of these were hardware failures, including network outages, power outages, too many low-and/or high-power antennas, a briefly broadcasting antenna, broken receivers, and broken X-engines. Some were due Epoch 0 Epoch 1 Epoch 2 Epoch 3 Jan. 2018Feb. 2018Mar. 2018Oct. 2017Nov. 2017Dec. 2017 Total Antennas in the Array Unflagged Antennas Epoch Boundaries H22a Data Set Figure 2. Here we show the observing season, split into epochs, and the number of antennas observing each night, both in total and after cuts. While the number of total antennas in the array grew from 47 to 71, the number passing all cuts remained roughly constant at ∼40. Epochs were defined by natural breaks in the observing season, mostly due to hardware issues. H22a analyzed data from 39 antennas on 18 nights, a subset of Epoch 1.
to site issues, including high winds, a lightning storm, and excess radio frequency interference (RFI). While all nights have significant narrow-band RFI contamination from FM radio, TV broadcasts, and ORBCOMM satellites (see Section 3.2.3), most nights that were completely flagged for excess RFI showed consistent broadband emission contaminating channels typically free of RFI. These cuts first reduced the 182 nights to 104 nights using contemporaneous observing logs and realtime analysis. After inspecting hundreds of jupyter notebooks 3 summarizing the nightly results of the data analysis pipeline after each key stage (see Section 3), this was reduced to 94 nights. For more details on the precise selection of nights, see Dillon (2021a).
In Figure 2, we show how the nights passing our various data quality checks span the observing season. The breaks in good data (due to a network outage, a correlator malfunction, and a broadcasting antenna) naturally divided the season into four epochs. The data used in H22a were a subset of Epoch 1. While in theory each night could be analyzed independently before binning all of them together at constant local sidereal time (LST), we found it useful to analyze epochs individually, both for systematics mitigation (see Section 3.2.4) and for statistical tests on subsets of the data (see Section 6.3).
Because we observed during the Southern summer, most 12-hour "nights" included some data taken before sunset or after sunrise. These were flagged in our analysis. Further, a number of partial nights were flagged, usually due to excess RFI. The majority of these were due to broadband RFI during the first few hours of the night, possibly attributable to construction activity on site. A few other partial nights were flagged due to suspicious nightly calibration solutions, especially in Epoch 3 when the Galaxy was rising at the end of each night. More detail on the precise subset of nights flagged is given in Dillon (2021a).
The end result of our data cuts is a set of observations that, when LST-binned together, are significantly deeper than those in H22a, and cover over 21 hours of LST. As we can see in Figure 3, this data set peaks at ∼70 nights' observing around 7 hours of LST, nearly four times deeper than the observations used in H22a. Roughly speaking, this sets an upper bound on the factor by which our limits might improve due purely to the increase in sensitivity, since the noise on P (k) scales inversely with observing time.

Selection of Antennas
Antenna selection began with the nightly data quality monitoring system described in H22a. It identified malfunctioning antennas by looking for antennas participating in baselines that were either outliers in total visibility power, or had visibility amplitude spectra significantly different from other baselines measuring the same physical separation on the ground. This procedure ultimately informed the most rigorous identification of malfunctioning antennas described in Storer et al. (2022), which was applied to HERA Phase II data. The results of nightly analysis were synthesized into a set of perantenna, per-night flags by the HERA commissioning team as part of an internal data release.
Because our metrics for antenna quality are relative ones, we decided to expand and harmonize this list of flagged antenna-nights under the conservative presumption that antennas that misbehave consistently enough are probably also anomalous at some lower level on other nights. If, in any given epoch, an antenna was flagged more than 10% of the days, we flagged it for the whole epoch. If an antenna was completely flagged for more than 60% of the epochs that it appeared in (i.e. 3 of 4 epochs for antennas that were observing for the whole season), then it was completely removed.
Antennas passing this first series of cuts were then used for an initial round of redundant-baseline calibration where per-antenna gains and per-unique-baseline visbilities are solved for simultaneously as part of a large overdetermined system of equations . Antennas outside the southwest sector of HERA's split- H22a (18 Nights) Full-Season (94 Nights) Figure 3. The full season of HERA data that we analyze spans nearly the full 24 hours of LSTs. Since we only observe at night, the time of greatest overlap in LST between nights occurs near the middle of the night in the middle of the season, at roughly 7 hours. At LSTs near that peak, this data set roughly quadruples the depth of the H22a data set. The increased number of flagged times (which cause vertical dips here) is due in large part to the changes in RFI excision (see Section 3.2.3).
core configuration (155,156,180,181,182, and 183-see Figure 4-as well as two outriggers not pictured) were excluded as well because they would introduce extra tiptilt degeneracies Zheng et al. 2014Zheng et al. , 2016Dillon et al. 2018) and thus complicate a subsequent skybased absolute calibration (Li et al. 2018;Kern et al. 2020a). As Dillon et al. (2020) describes, redundant-baseline calibration can be cast as a χ 2 -minimization problem where χ 2 quantifies how consistent deviations from redundancy are with thermal noise (see Section 3.1 and Equation 2). If one attributes each baseline's contribution to χ 2 to both of its constituent antennas equally, one can form a χ 2 per antenna statistic that is sensitive to particularly non-redundant antennas. In our first round of redundant-baseline calibration, antennas which made significant excess contributions to χ 2 are flagged in 60-integration (i.e. 10.7 minute) chunks, and then calibration is performed again, iteratively, until no outliers remain. These were then likewise harmonized; antennas flagged for non-redundancy more than 20% of a night or 5% of an epoch were flagged for the whole night/epoch.
In Figure 4, we show the per-antenna flagging fraction after all of these cuts. The overall impact is that while the array was growing, the number of antennas included in our analysis remained largely static (see also Figure 2). Likely this set of nightly flags and antenna flags is overly conservative and some good data were thrown out. Due to the extreme dynamic range challenge faced by all of 21 cm cosmology, we adopted the stance that it was far safer to throw out possibly good data than to risk including bad data. Importantly, all these decisions about data selection were performed without reference to final power spectra and are therefore less likely  . The layout of antennas during the season we analyze. Each antenna is a pie chart showing the fraction of the 94 total nights that each antenna was either flagged, unflagged, or not yet in the data set. The array was under construction as we observed, as can be seen by the northward expansion of the antennas available for observing. All antennas shown here numbered except 155, 156, and 180-183 would eventually become part of the southwest sector of HERA's split-core configuration (Dillon & Parsons 2016;DeBoer et al. 2017). to introduce experimenter bias. Once the set of good antennas-nights was selected, it was not subsequently changed.

DATA REDUCTION AND SYSTEMATICS MITIGATION
We now turn to a discussion of our data analysis pipeline, which we designed to reduce nightly visibilities to a final set of power spectra while avoiding systematics contamination. In general, our goal in this work is to apply the methods developed and validated in H22a and its supporting papers to this larger data set. This is a fundamentally iterative approach and likely does not leverage the full constraining power of the data set. Thus, the steps in the analysis pipeline-which we review in Section 3.1-remain largely unchanged.
However, a number of changes were incorporated in this work. Some were necessary because this data set is larger and more heterogeneous than the 18 nearlycontiguous nights examined in H22a with the same 39 antennas each night. Others are the result of various tweaks and minor improvements in the HERA team's codebase developed between the H22a analysis and this work. Finally, some are simply minor changes in data analysis parameters and choices motivated by various intermediate data products. In Section 3.2 we report changes to our data reduction pipeline and in Section 3.3 we similarly detail changes to the estimation of power spectra and their errors and potential biases.

Overview of the Data Analysis Pipeline
H22a gives a detailed description of the analysis steps that take raw visibilities all the way to power spectra. Here we provide a high-level overview of each step in order to give context for the changes and updates in this analysis. We refer the reader to H22a and its supporting papers for more detail. The steps in our data reduction and systematics pipeline are as follows: 1. Redundant-baseline calibration: We begin with direction-independent calibration, wherein our observed visbilities, V obs ij , are modeled as Here g i is a complex, time-and frequencydependent gain associated with the ith antenna and n ij is the noise on that visibility. Redundantbaseline calibration assumes that all baselines b ij with the same physical separation and orientation should observe the same true visibility, and thus solves for both gains and unique visibilities as a large χ 2 -minimization problem, where V sol i−j here is the visibility solution for all redundant baselines with the same physical separation as V ij . We calibrate by minimizing χ 2 for every time and frequency independently, using only internal degrees of freedom and without reference to any model of the instrument or the sky. For an exploration of redundant-baseline calibration in the context of HERA, see Dillon et al. (2020).
2. Absolute calibration: The internal consistency of redundant baselines alone cannot solve for three important degrees of freedom, namely the overall gain amplitude and two phase tip/tilt terms. These degeneracies must be solved, per time and frequency, by reference to externally calibrated (or simulated) visibilities (Kern et al. 2020a). In H22a and in this work, we performed absolute calibration using a set of visibilities synthesized from three nights with LSTs spanning this data set (JDs 2458042, 2458116, and 2458207 3. RFI flagging: RFI is identified and flagged using an iterative outlier detection algorithm described in H22a. Essentially, several sets of waterfalls (visibilities, gains solutions, etc.) are converted into time-and frequency-dependent measures of "outlierness" expressed as a z-score or modified zscore. This is done by looking at a 17×17 pixel region centered on each pixel of the waterfall and measuring how consistent each it is with its neighbors in time and frequency. After averaging together z-scores of baselines or antennas to improve the signal-to-noise ratio (SNR), 5σ outliers and 2σ outliers neighboring 5σ outliers are flagged. This is done independently for each set of waterfalls and the flags are then combined. We start RFI flagging with median filters and modified z-scores to reduce the effect of really bright RFI, then use those flags as a prior on a second round using mean filters and standard z-scores. Finally, we examine these statistics over the whole day, looking for whole channels or whole integrations that are 7σ outliers and their 3σ outlier neighbors. The same flags are applied to all baselines.
4. Gain smoothing: After flagging, all gains solutions are smoothed to mitigate the effect of noise and calibration errors, taking as a prior that the gains should be stable and relatively smooth in frequency. This smoothing is performed with a CLEAN-like deconvolution algorithm (Högbom 1974;Parsons & Backer 2009), filtering gains in 2D Fourier space on a 6 hour timescale and a 10 MHz frequency scale (or equivalently, 100 ns delay scale). These are the scales on which we have evidence for intrinsic gain variation in time (Dillon et al. 2020) and frequency (Kern et al. 2020a). For more implementation details, see H22a.
5. LST-binning: Having calibrated and flagged each night's data, we then coherently average nights together on a fixed LST grid. This 21.4 s cadence grid-double the integration time of raw visibilities-is created by assigning each observation to the nearest gridded time and then rephasing that visibility to account for sky rotation due to the difference in LST. An additional round of flagging is performed on a per-LST, per-frequency, and per-baseline basis, looking for 5σ outliers 4 in modified z-score among the list of rephased visibilities to be averaged together. This cut is designed to identify low-level residual RFI and calibration failures; it is highly unlikely to flag outliers due to noise.
6. Hand-flagging: After LST-binning, a final set of additional flags are added by manually examining high-pass delay-filtered residuals. This filtering was performed on using an iterative delay CLEAN to remove power below the 2000 ns scale in order to highlight spectrally compact features. Clear outlier channels and/or times that are consistent across baselines are flagged upon visual inspection. The same additional flags are applied to all baselines.
7. Inpainting flagged channels: When using Fast Fourier Transforms (FFTs) to form power spectra, flagged channels introduce discontinuities that leak foreground power to high delays. To avoid this, we use the same delay-based CLEAN algorithm to low-pass filter the data on 2000 ns scales and inpaint the flagged channels with the filtered result. Entirely flagged times are not inpainted. For more details and a demonstration of this procedure, see H22a.
8. Cable reflection calibration: When a signal bounces off of both ends of a cable before being transmitted, the result is a copy of the signal at a fixed delay associated with twice the lighttravel time along the cable. Kern et al. (2020b) showed how the 20 m and 150 m cables in the signal chain produce these reflections, which can be represented as complex gain terms. While these gains are in principle calibratable with redundantbaseline calibration, our gain smoothing procedure completely removes them. Thus, following the procedure outlined in , we iteratively model and calibrate out reflection and subreflection terms using autocorrelations, which have much higher SNR than cross-correlations. Since cable reflections are stable over many nights, this is done after LST-binning.
9. Crosstalk subtraction: Kern et al. (2020b) also demonstrated the pernicious impact of crosstalk systematics across a large range of delay modes in HERA Phase I data. It was hypothesized that this was due to over-the-air crosstalk that led to autocorrelations leaking to cross-correlations at high delays. In Appendix A, we show how this effect's delay and amplitude structure can be explained by an emitter in the signal chain. Because autocorrelations are non-fringing and thus quite stable in time, the effect can be mitigated by modeling each baseline's excess power near zero fringe-rate.  does this using singular value decomposition (SVD) to find the delay and time modes affected and uses Gaussian process regression to limit the range of fringe-rates modeled and subtracted so as to avoid EoR signal loss. That fringerate range is symmetric about zero and limited by the east-west projected baseline length |b E-W | such that The signal loss due to crosstalk subtraction was calculated in Aguirre et al. (2022) and corrected for in H22a; we repeat the calculation in Section 4.3 and find similar results. Crosstalk subtraction is not applied to baselines with projected east-west distances less than 14 m, since the zero fringe-rate mode overlaps with the fringe-rates associated with the main lobe of the primary beam .
10. Coherent time averaging: Following H22a, we next coherently average visibilities from the 21.4 s cadence after LST-binning down to a 214 s cadence, using the same rephasing procedure described above to account for sky-rotation. This timescale was chosen to keep signal loss at the ∼1% level (Aguirre et al. 2022); we repeat that calculation for this data set in Section 4.3 and find consistent results.
11. Forming pseudo-Stokes I visibilities: Before forming power spectra, we construct pseudo-Stokes visibilities. As H22a showed, this limits the leakage of Faraday-rotated foregrounds into highdelays, though Q → I leakage from primary beam asymmetry is still possible (Moore et al. 2013;Asad et al. 2016;Kohn et al. 2016;Nunhokee et al. 2017;Asad et al. 2018). For the pseudo-I channel, this step consists of simply averaging together the EE-and N N -polarized visibilities for the same baseline, where E and N denote the east-and north-aligned linearly-polarized feeds respectively .
12. Power spectrum estimation: We compute power spectra using the delay approximation, in which we substitute a Fourier transform along the frequency axis of a visibility (i.e. a delay transform) for a line-of-sight Fourier transform (Parsons et al. 2012a). This strategy avoids mapmaking entirely (Dillon et al. 2013(Dillon et al. , 2015bXu et al. 2022). We thus approximate τ and u (the magnitude of the baseline in units of wavelengths) as mapping linearly to line-of-sight Fourier modes, k , and transverse Fourier modes, k ⊥ , respectively. The power spectra are estimated by taking the real part ofP where V is a Fourier transformed visibility in frequency. Ω pp is the full-sky integral of the squared primary beam response-we use the beam simulated in Fagnoni et al. (2021)-and B is the bandwidth. X and Y are scalars mapping angles and frequency to cosmological distances, de- where H(z) is the Hubble parameter, and D(z) is the transverse comoving distance (Hogg 1999). 5 P (k ⊥ , k ) here is in units of mK 2 h −3 Mpc 3 , though it conventional to report the "dimensionless" power spectrum (which has units of mK 2 ): H22a shows how this can be recast into the language of quadratic estimators (Tegmark 1997;Liu & Tegmark 2011;Dillon et al. 2013;Trott et al. 2016). In that formalism, we use a diagonal normalization matrix M (i.e no decorrelation of bandpower uncertainties). In lieu of any inverse covariance weighting, we use a Blackman-Harris taper to prevent foreground leakage into the EoR window. When computing Equation 4, we cross-multiply Fourier transformed visibilities from alternating 214 s blocks of time (i.e. V 1 and V 2 ), using all pairs of baselines in a redundant baseline group. In the (quite accurate) approximation that visibilities interleaved in this way have uncorrelated noise, this produces an estimate of the power spectrum free of noise bias.
13. Error estimation: As in H22a, we employ the noise estimation formalism of Tan et al. (2021). The noise power spectrum is given by where T sys is the system temperature, t int is the integration time, and N coherent and N incoherent are the numbers of integrations averaged together coherently or incoherently-i.e. averaged as visibilities with phase information or averaged as power spectra. Ω eff is the effective beam area, defined as Ω eff ≡ Ω 2 p /Ω pp in Appendix B of Parsons et al. (2014). We use the inverse square of the noise power spectrum to perform inverse varianceweighted averaging of the power spectra. For reporting final errorbars on power spectra, we use an unbiased estimator of the noise and signal-noise cross-terms developed by Tan et al. (2021), 14. Incoherent power spectrum averaging: Here and in H22a, power spectra are averaged incoherently over several axes to produce the final limits. First, all baseline-pairs within a redundant baseline group are averaged, ignoring autobaseline pairs (power spectra formed from the same pair of antennas at interleaved times). This preserves most of the sensitivity of coherently averaging visibilities within a redundant group before forming power spectra, while excluding the pairs most likely to exhibit correlated systematics. Then power spectra are averaged incoherently in several disjoint LST ranges, which we call "fields" since they correspond to different parts of the sky at zenith. Power spectra are estimated independently for the LST ranges in the separate fields and we perform no further averaging in H22a or this analysis when reporting power spectrum upper limits. Finally, we perform a binning to spherical k = k 2 + k 2 ⊥ , excising baselines based on their length and certain sets of delay modes based on their proximity to the "horizon wedge," which is set by the light travel time along the baseline P N andP SN are propagated through each of these averaging steps.
For more details on the implementation of these techniques, we refer the interested reader to H22a and its supporting papers.

Updates to the Data Reduction and Systematics
Mitigation Pipeline Since H22a With the full context of the analysis pipeline established above, we now detail the ways it has changed since H22a. While most are relatively minor tweaks (Dillon 2021a), we detail them here for completeness and reproducibility.

Updates to Redundant-Baseline Calibration
Two minor changes were made to redundant-baseline calibration. The first is the addition of a step in firstcal-the solver for per-antenna delays and phase offsets (Dillon et al. 2020)-to also solve per-antenna polarity flips. A polarity flip, which could result from rotating the feed by 180 • , simply flips the sign on the measured voltage from the antenna or equivalently multiplies the gain by −1. Solving for polarities allows firstcal to converge faster and more reliably, but does not appreciably change the result.
The second change was an increase of the maximum number of iterations allowed in omnical, which uses damped fixed-point iteration to minimize χ 2 in Equation 2 (Dillon et al. 2020). This was increased from 500 to 10,000. This likely makes little difference in practice, since omnical usually only converges that slowly for a given time and frequency in the presence of bright RFI contamination. Since allowing more steps did not substantially increase runtime (individual times and frequencies can converge independently), we felt it was safer give the algorithm as long as necessary to minimize χ 2 , even if doing so had little impact after gain smoothing.

Updates to Absolute Calibration
Two important changes were made to absolute calibration as compared to H22a. The first is a change to how flags are propagated from the sky-calibrated reference visibilities. Previously, antennas flagged or otherwise not included in the reference set of visibilities were also flagged on a nightly basis after absolute calibration. In this work, these antennas are simply given zero weight when solving for the degeneracies of redundantbaseline calibration-which are then applied to all gain solutions. Per-antenna flags, once set (see Section 2.2), did not change during the nightly calibration.
Second, we added a new step in absolute calibration to fix the bias discovered in the course of validating the H22a pipeline. In Aguirre et al. (2022), we found that absolute calibration produced gains that were biased high and that the bias got larger with decreasing visibility SNR. This is particularly worrisome because gains affect both power spectrum and error estimates quartically, and high gains lead to artificially low power spectrum estimates. In Aguirre et al. (2022) we calculated the size of this effect and in H22a we increased our measurements and error bars to compensate for this ∼10% bias on our power spectra.
A detailed mathematical account of the origin of the bias appears in Appendix B of Byrne et al. (2021). However, it can be understood intuitively as follows: when solving for the overall gain degree of freedom in absolute calibration, noise turns individual visibilities in the complex plane into samples of a circularly symmetric distribution whose center is displaced from the origin (the "true" visibility). When measuring magnitudes, that probability distribution is Ricean and always has a larger mean than the magnitude of the "true" visibility. Simply put, adding symmetric noise is more likely to increase the amplitude of a complex number than decrease it. However, by calibrating the overall multiplicative amplitude as a complex number and then only taking the absolute value at the very end, one can avoid this bias-as we show in Section 4.2 and Figure 13.

Updates to RFI Excision
While the fundamental algorithm for RFI excision remains unchanged, we made two updates to how it was performed on a nightly basis. The first is related to how the analysis dealt with data file boundaries. Previously, every data file was analyzed in parallel for RFI. Since Here we illustrate the process for RFI excision after binning together all 94 days. On the left, we show 2000 ns high-pass filter residuals of all epochs combined. It shows some clear temporal and spectra structure which necessitates further flagging. These additional flags, shown on the right, are performed first by looking for outlier channels or integrations, then by hand-flagging contiguous regions of excess structure. We attribute these outliers to low-level RFI, as well as the interplay between the night-to-night variation in the RFI flagging mask and night-to-night changes in calibration errors. Most of the additional flagged pixels in this waterfall were already flagged on a significant fraction of the nights.
the outlier identification algorithm relies on neighboring times and frequencies, this became less reliable near file boundaries where there are roughly half as many neighbors to compare to, and likely led to some of the ∼10 minute periodicity we saw in the flags in H22a. In this analysis, we parallelized the pipeline in overlapping time chunks, so that every point was compared to exactly the same number of neighbors-except at the beginning and end of the night and at the edges of the band.
Second, we modified the set of data products used in searching for outliers. In both analyses we used raw visibilities (albeit only in the mean filter round); gains from both redundant-baseline calibration and absolute calibration; and χ 2 from both calibration steps. Based on experiments we performed on which data products were providing unique information and not leading to overflagging, we removed a global cut on outliers in χ 2 -which likely led to the over-flagging around Fornax A in H22a-and added uncalibrated autocorrelations for their high SNR and computational tractability compared to the full set of visibilities. The result is still a very expansive set of flags and likely contains a significant number of false positive identifications of RFI, especially around ORBCOMM at 137 MHz and the clock line at 150 MHz (see e.g. Figure 5 and Figure 8). Given the extreme dynamic range requirements of 21 cm cosmology, it is far safer to over-flag than under-flag (Kerrigan et al. 2019).
In that spirit, we also revisited how a final set of byhand flags were identified. In H22a, this flagging was performed by examining a handful of key baselines after high-pass filtering to remove structure below the 2000 ns scale. In this work, the four epochs were first combined together without any inpainting, reflection calibration, or crosstalk subtraction. After performing the same perbaseline high-pass filter on every baseline, their amplitudes were averaged together, inverse variance weighted by noise and then corrected for the noise bias.
The result, shown in the left-hand panel of Figure 5, highlights residual frequency structure. Much of this has low N samples and/or borders on previously identified RFI, indicating its origin as low-level, inconsistently flagged RFI. As the right-hand panel shows, outlier channels and integrations were first identified by averaging along each axis. Next, individual areas of concern were flagged by hand, by converting the waterfall to a bitmap image and individually marking times and frequencies in Adobe Photoshop. An effort was made to flag coherent rectangular regions near previously identi- Figure 6. In Kern et al. (2020b), the observation that the phase of crosstalk systematics remained stable in delay in space was key to removing them down to nearly the noise level. This technique proved foundational to our results in H22a. However, when we combined all four epochs together, we began to see discontinuities in the phase structure of the crosstalk. In the left panel, this effect is particularly clearly illustrated in Band 1 of the north-polarized baseline between antennas 11 and 66. We see a number of delay modes where the sign of the crosstalk feature abruptly shifts from positive to negative (or vice versa) as a function of LST. These discontinuities appear to be associated with epoch boundaries, which give rise to discontinuities in the number of samples binned together (right panel). We hypothesize that these discontinuities arise due to the effects of the ongoing construction of HERA, either on the source of the crosstalk emission or on how it is transmitted through the array (see Appendix A for more detail).
fied flags to avoid cherry-picking, though fundamentally this step involved a series of subjective judgment calls. Once the final flagging waterfall was developed, it was not revisited after estimating power spectra in order to avoid experimenter bias.

Mitigation
In H22a, all the steps in Section 3.1 after LST-binning were performed on the full-sensitivity, 18-night data set. At first, we attempted performing the same analysis with all 94 nights binned together but found that the level of residual crosstalk had substantially higher SNR than anything seen in H22a. Some baselines were particularly bad, exhibiting a delay-and time-averaged SNR greater than 10 in the affected delay range. Upon examining the pre-subtraction waterfalls of the baselines where crosstalk subtraction performed the worst, we found clear evidence for temporal structure in the delays contaminated by crosstalk. In Figure 6, we show one such baseline. Plotting the real part of its waterfall in delay space shows clear temporal structure at certain delays, including some where it flips from positive to negative and vice versa. This is potentially disastrous for the crosstalk subtraction technique of , which relies on stability in delay space.
Furthermore, there appears to be a correlation between discontinuities in N samples (right-hand panel of Figure 6) and discontinuities in the delay structure of V ij . Since the former are largely attributable to epoch boundaries, which affect how often each LST is observed, we hypothesized that the changing and growing array was affecting the precise structure of the crosstalk. This ultimately pointed us toward a new understanding of the physical origin of the effect, namely that all antennas' signals were being broadcast from one point on the west side of the array, likely the refrigerated enclosures which contained the analog receivers. We discuss this model and the evidence for it in detail in Appendix A.
The upshot of this result is twofold. First, it confirms that the model of , of autocorrelations leaking into cross-correlations, is correct. Second, it implies that as long as the array is stable, the effect should be stable in LST-binned data as well. We therefore decided to perform inpainting, cable reflection calibration, and crosstalk subtraction on a per-epoch basis before binning together the four epochs. This proved a sub- Figure 7. Here we show the effect of our crosstalk subtraction algorithm on the power spectrum of a baseline-pair (in this case, 38-66 crossed with 52-82 in Band 1) with a particularly strong crosstalk signal at negative delay. The technique of , applied on a per-epoch basis, removes the crosstalk features seen here at roughly ±1000 ns down to a level nearly consistent with noise even after averaging all four epochs together. Since the crosstalk is proportional to the amplitude of the autocorrelations (see Appendix A), which rise very steeply when the Galactic center transits the beam at 17.8 hours LST, and since our removal algorithm depends on the temporal stability of the crosstalk, we give LSTs between ∼15.3 and 21 hours zero weight. This results in cleaner residuals in the more sensitive fields, at the price of greater residuals at high LST. This also motivated our definition of the five fields in which the power spectrum was independently estimated; see Section 3.3.1 and Figure 9 for more details on the definition of the fields. stantially better approach-as Figure 7 shows-and got us much closer to consistency with noise after crosstalk subtraction.
A few minor tweaks were also made to cable reflection calibration and crosstalk subtraction. The total number of terms used to fit the cable reflections was increased from 28 terms between 75 and 1500 ns to 35 terms between 75 and 2500 ns, to both better model the 20 m cables and to be able to model a few extra long cables whose reflection timescale were larger than 1500 ns but were not in the H22a data set. The SVD used in crosstalk subtraction was previously limited to 30 delay modes; we increased it to 50 based on experiments where it made the crosstalk residuals a bit more consistent with noise. Finally, we revised how weights were applied before computing the SVD. First, we weighted each time by the frequency-averaged number of samples.
H22a used an unweighted SVD, but as Figure 3 shows, the approximation of weights that are constant in time breaks down when considering such a large and discontinuous data set. Second, we also set the weight in the SVD to zero from 15.3-21 hours to prevent bright galactic emission at the edge of the data set from introducing temporal structure that caused the crosstalk subtraction procedure to perform worse for the most sensitively measured LSTs.

Updates to Power Spectrum Estimation Since H22a
Just as with the data reduction pipeline, we sought to apply the same power spectrum analysis procedures and choices as were used in H22a. However, differences in LSTs observed, flagging, and systematics removal motivated slightly different approaches to how the data should be reduced and cut. These decisions were made without reference to the final power spectra in order to minimize experimenter bias.

Picking Bands and Fields
The two frequency bands analyzed in H22a were Blackman-Harris tapered ranges from 117.  MHz. These were motivated by the two contiguous regions of low flag occupancy (see Figure 12 in H22a). We reproduce that same plot in Figure 8 and following the same logicalbeit with somewhat different flags-pick Band 1 and Band 2 to range from 117.  MHz, respectively. The bands still center on approximately the same redshifts: z = 10.4 and z = 7.9.
Because we observed a larger range of LSTs than in H22a, we need to define new fields (i.e. new LST ranges) in which to independently estimate power spectra. In order to motivate the choice of fields without reference the final power spectra, we looked at two other statistics, which we plot in Figure 9. These two statistics are the inverse-variance-weighted, baseline-averaged delay-zero power, P (τ = 0)-a proxy for foreground power-and P N , which is flat in delay and tells us about both foreground power and total observation time. We used these to define a total of five fields: A (21. Here we illustrate the bands as Blackman-Harris windows, indicating the relative weight of the different channels in our final power spectra. These two bands are from 117.  MHz, corresponding to z = 10.4 and z = 7.9 respectively. These bands differ very slightly from the bands with the same names used in H22a. (6.25-9.25 hours), and E (9.25-14.75 hours). 6 We restricted our field boundaries to quarter-hour increments to avoid cherry-picking integrations. The rationale for defining the fields is as follows. We wanted fields B, C, and D, to correspond reasonably well to fields 1, 2, and 3 in H22a, so to cover the new LST ranges, we added fields A and E. Field A was set by the flagging gaps at either end, intentionally avoiding the last integration before the flagging gap between 0 and 1 hours due to the potential for signal loss from crosstalk subtraction (Aguirre et al. 2022). Likewise, field B was defined to exclude the first integration after the gap and keep Fornax A in the main lobe no brighter than its brightest point in the first sidelobe around 2 hours. Field C was defined to start at a roughly symmetrical place to where Field B ended with respect to Fornax A and to include the range of maximum sensitivity from roughly 4-6 hrs. Thus, the upper field boundary was set by the sidelobe of the Galactic plane at 6.25 hours. The boundary between fields D and E was set to include the roughly symmetrical sidelobe at ∼9 hours within Field D, keeping the Galactic plane contained to a single field. Field E ends a bit before where the crosstalk subtraction gets zero weight in the SVD. Once these field definitions were established, they were not allowed to change.

Crosstalk
Despite subtracting crosstalk on a per-epoch basis, we still found strong evidence for residual crosstalk on certain baselines. In the baseline examined in Figure 7, we can see clear residual power as an excess SNR in the delay range of 800 < |τ | < 1500 ns (right-hand panel). This is most prominent near the Galactic center, which got zero weight in the SVD, and near Fornax A, but there appears to be a slight excess at other LSTs as well. To quantify this, we averaged |SNR| over that delay range and over the three most sensitive fields, B, C, and D. This average was performed separately for positive and negative delays, since we now know that those two signals have independent origin (see Appendix A).
We computed this averaged |SNR| for every autobaseline pair (i.e. power spectra formed from the same baseline at interleaved times, rather than power spectra formed from different but redundant baselines) and plotted a histogram of it for each band in Figure 10 . Two metrics-the baseline-averaged power spectrum at τ = 0 in the top panel, and noise power spectrum amplitude in the bottom panel-were used to divide the data set into fields in which to independently estimate the power spectrum. Following H22a, which divided the data set into three fields, we divide the data set into five and label them A, B, C, D, and E since they do not directly correspond to the fields used in H22a. As we describe in Section 3.3.1, we picked the fields to avoid gaps in the data, Fornax A (at 3.4 hours), and the galactic center (at 17.8 hours). These are all features that introduce sharp temporal changes that make crosstalk subtraction more difficult. We also wanted to avoid the times where the SVD in the crosstalk subtraction was given zero weight (gray hatched region) to avoid affecting the more sensitive fields.
ing positive and negative delays as independent samples. Compared to an equivalently sized delay range from 2800 < |τ | < 3500 ns (solid black), we see evidence for a mild excess on most baselines. This is perhaps not too surprising; the crosstalk subtraction algorithm of  attempts to model and subtract the crosstalk down to the noise-in our case, the noise in a single epoch. To the extent that crosstalk remains correlated from epoch to epoch, integrating down should reveal more crosstalk. That said, there is a tail of outliers in Figure 10 that motivated us to perform a cut at |SNR| > 1.5. The cut was performed separately for positive and negative delays, so some baseline pairs are "half-flagged." The vast majority of auto-baseline pairs (>95%) were kept. More baseline-pairs were cut at negative delays than positive delays because the antenna ordering means that negative delays were more often associated with antennas nearer the crosstalk source (see Appendix A). We also computed the |SNR| for cross-baseline pairs as well and found that they were highly correlated with the |SNR| of the two corresponding auto-baseline pairs. However, we decided to more conservatively use only the auto-baseline pairs-which are not included in the final power spectra-for our cut. Any cross-baseline pair with one baseline participating in a flagged auto-baseline pair (and delay sign) was flagged. This cut is the most surgical and perhaps more worrisome analysis change from H22a, in the terms of removing individual power spectra before averaging. However, by only looking at high |τ |-well above the corresponding k values where we set our tightest upper limits-and by using only the auto-baseline-pairs, we insulate ourselves from the risk of cherry-picking and signal loss.
One other key change from H22a is the shortest eastwest projected baseline length allowed to be included in our final spherical power spectra. Even after averaging cross-baseline pairs within redundant groups, we still see a substantial uptick in |SNR| in the crosstalk delay range for baselines with 14.6 m projected east-west baselines, the shortest baselines used in H22a (see Figure 11). This makes sense physically-these baselines have their main lobe closest to zero fringe-rate, where the crosstalk subtraction algorithm only operates on a very narrow range of fringe-rates (Equation 3) for fear of signal loss. While the crosstalk is centered at 0 mHz, it has some width in fringe-rate space. We expect, therefore, that these baselines should be the first to show residual cross-talk as we integrate down. It was also likely true in H22a; the lower Residual Cross-Talk |SNR| at 800 < | | < 1500 ns Noise-Dominated |SNR| at 2800 < | | < 3500 ns Cut at |SNR| > 1.5 Figure 10. While our crosstalk subtraction algorithm removes the vast majority of the systematic, the technique is not perfect. We quantify the level of residual crosstalk by taking the average magnitude of the SNR in our most sensitive fields-B, C and D-and in the delay range most affected by crosstalk, 800 < |τ | < 1500 ns. Comparing this quantity to the same quantity computed in the noise-dominated delayrange of 2800 < |τ | < 3500 ns, we find significant excess. Using auto-baseline-pairs, which are not included in the final power spectrum, we eliminate baselines with |SNR| > 1.5. Note that our most sensitive upper limits come from lower delays, so this is not a potentially biasing cut on SNR in the final quantity of astrophysical interest. Because the crosstalk at positive and negative delay is sourced independently (see Appendix A), we perform this cut separately for positive and negative delays. Because, by convention baselines are east-west-orientated and not west-east-oriented and because the crosstalk emitter is close to the west side of the array, more negative delays are flagged than positive delays. In all, this cut flags 3.5% and 3.8% of baselines at negative delays for bands 1 and 2, respectively, and 1.0% and 0.9% of baselines at positive delays. noise level in these data simply makes the systematics clearer.
We decided to conservatively increase that cut to 15 m, throwing out several redundant baseline groups, including the most sensitive single baseline group used in H22a, the single-unit 14.6 m east-west baseline. To keep this baseline without simply accepting excess systematics, we would have had to find a way to more ag- Field B Field C Field D Figure 11. Even after flagging baselines for residual crosstalk, averaging the power spectra of cross-baseline pairs incoherently within redundant groups shows residual crosstalk. This is not surprising, since Figure 10 showed ubiquitous excess in this delay range. What averaging reveals more clearly is an important trend in crosstalk |SNR| with projected east-west baseline. Baselines with small eastwest components are the slowest fringing, which means that the crosstalk subtraction algorithm of  only attempts to remove crosstalk in a relatively small range of fringe rates for fear of removing cosmological signal. That is why baselines with east-west components less than 14 m were removed before power spectrum estimation in H22a and do not appear here either. Based on this metric, we decided to extend that cut to baselines with east-west components less than 15 m (those inside the dashed lines), which includes all single-unit separated baselines in the hexagonal grid (see Figure 4). While some other long baselines also show strong residuals, this noted jump among the most sensitive baselines was concerning enough to merit a cut.
gressively filter crosstalk, which would then have necessitated a more thorough and precise quantification of baseline-dependent signal loss using end-to-end simulations. Since our aim in this work is to apply the analysis of H22a as directly as possible, we defer such an investigation to future work.

Changes to k Cuts and Bins
The final key analysis change between H22a and this work is an increase in the area of power spectrum modes that were excised from within the EoR window (but still near the wedge). In H22a, the modes excluded from the spherical power spectra were all those within 200 ns of the horizon wedge (Equation 8). This "wedge buffer" has a long history in the field, going back to Parsons et al. (2012a), which suggested that a combination of foreground and beam chromaticity and the application of tapering functions in the delay power spectrum can extend power ∼0.15 hMpc −1 beyond the horizon wedge. The choice of 200 ns in H22a (equivalent to 0.11 hMpc −1 at z = 7.9 and 0.10 hMpc −1 at z = 10.4) was motivated by Figures 14 and 15 of that paper, which show power spectrum SNRs after cylindrically binning to k -k ⊥ space. The buffer was picked to mostly exclude the region of k-space with SNR consistently larger than 1, while balancing that exclusion of foregrounddominated modes at low k ⊥ against the admission of noise-dominated modes at high k ⊥ .
Reassessing the same question in light of our equivalent plots of SNR in cylindrical k-space ( Figure 17 and Figure 18), we increase the wedge buffer to 300 ns, to achieve roughly the same balance of excluding and admitting modes. We picked 300 ns as a round number to avoid cherry-picking. This produces a wedge buffer at k = 0.15 hMpc −1 in Band 1 and k = 0.17 hMpc −1 in Band 2, which is more in line with the value suggested by Parsons et al. (2012a), and used in early HERA forecasts (Pober et al. 2014). That said, increasing the wedge buffer is another sensitivity hit, so finding other ways to mitigate foreground emission near the wedge (e.g. Ewall-Wice et al. 2021) in future work might increase the constraining power of this data set.
One other minor change to our spherical power spectra is our precise binning in k. In most of the EoR window, k is dominated by k , which maps to τ . H22a picked k bin centers and widths with the intention that two τ modes would fall into each bin. However, using a fixed ∆k of 0.064 hMpc −1 for both bands, with the first bin centered at k = 0, did not quite achieve this. Some fixed-τ modes got split between k bins and some k bins had more power spectra averaged together than others. To achieve a better alignment with k bin centers nearer the average k value of modes in the bin, we used ∆k = 0.0619 hMpc −1 in Band 1 and ∆k = 0.0709 hMpc −1 in Band 2m with the first bin centered at 3∆k/4. For more details on this change, see Dillon (2021b).

Functions
The expectation value of the estimated power spectrum for a given baseline and delay,P (u, τ ), is actually a weighted sum of the neighboring true bandpowers P (k). These weights are usually referred to as the window functions W , defined througĥ In H22a, the horizontal error bars on the spherical power spectrum were evaluated with the same assumptions made to analyze the data, i.e. the delay approximation, in which the delays and line-of-sight Fourier modes are treated interchangeably. This leads to underestimating the tails of the window functions, and in particular the foreground power leaking into the EoR window at low k (Liu et al. 2014a). In this work, we estimate the exact window functions by lifting this approximation in the derivative of the covariance with respect to each bandpower, and hence obtain an accurate description of the mapping between instrumental space (u, τ ) and cosmological space (k ⊥ , k ). In doing so, we can account for the delay approximation when comparing theory to data because we now know exactly which k ⊥ and k modes contribute to a given bandpower and in what proportion. Note that, in order to account for the frequency dependence of the HERA primary beam, we have used the simulations introduced in Fagnoni et al. (2021). For details on the derivation of the window functions, as well as a complete illustration of their importance in the analysis of low-frequency radio data in general, and of the HERA data in particular, we refer the interested reader to Gorce et al. (2022). We discuss the impact of the improved window function calculation in Section 5.2.

Quantifying Decoherence Due to Non-Redundancy
Because we average together power spectra of pairs of different baselines within a redundant group, we must quantify the effect of non-redundancy on the power spectrum. We know HERA's putatively redundant baselines are not quite redundant (Dillon et al. 2020;Carilli et al. 2020), so we should expect some level of decoherence when cross-multiplying baselines that see a slightly different beam-weighted sky. Following H22a, we compare incoherent power spectrum averages-which are decoherence-free by construction because they only use auto-baseline pairs-to forming power spectra from visibilities coherently averaged within a redundant baseline group. This yields a metric for decoherence given by where the angle brackets indicate a rolling time average over 1 hour timescales to ameliorate the effects of nulls in power for certain baselines (as was done in H22a).
In particular, we examine ∆κ(t, τ = 0), which we take as a foreground-dominated (and thus high SNR) metric of decoherence of sky signal in the primary beam. In Figure 12, we show the histogram of decoherence levels at zero delay, using different LSTs and unique baselines as samples of decoherence. H22a performed this analysis over a short range of LSTs around the galactic plane crossing (7.2-8.3 hours) and found a ∼1% signal loss due to decoherence. We measure ∆κ by taking a median over a wider range of LSTs-all five fields-in part to account for LST-or JD-dependent gain errors and the possibility that the array became more or less redundant Band 1 Distribution Band 1 Median: -1.9% Band 2 Distribution Band 2 Median: -2.4% Figure 12. Following H22a, we show here our estimate for signal loss due to non-redundancy while forming and averaging cross-baseline power spectra within redundant baseline groups. Our metric ∆κ, defined in Equation 10, looks at the difference between the power in an incoherent average and a coherent average over baselines. We compute those powers at τ = 0, the delay of sources at zenith, as a proxy for cosmological signal loss in the main beam. As in H22a, that difference is normalized by the incoherently-averaged power smoothed on 1 hour timescales to minimize the effect of nulls. This quantity is then weighted by the inverse noise variance, and we show here the histogram over unique baselines and over our five fields. Taking the median of this histogram, we estimate a 1.9% correction for Band 1 and a 2.4% correction for Band 2, as we discuss in Section 3.3.6.
over the season. Assuming that the signal loss due to non-redundancy should relatively be stable in LST, and with the knowledge that nulls in power on certain baselines can create spurious temporal outliers in ∆κ that always create extra apparent decoherence, even with the 1-hour smoothing, we take the median of this histogram as our signal loss estimate. This yields a slightly larger estimate of the loss: 1.9% for Band 1 and 2.4% for Band 2. This will be accounted for in our final power spectrum upper limits, as we will discuss presently.

Correcting for Potential Biases and Signal Loss
H22a performed a careful accounting for four potential sources of bias in the final power spectrum. Three of these are forms of signal loss-ways in which true sky power from the 21 cm signal can be removed due to the analysis choices we made. The fourth, due to absolute calibration, produced an overall bias in H22a that affected both our measured power spectrum and also our noise and error estimates, which are ultimately derived from autocorrelations which were also biased (Aguirre et al. 2022). Each of these corrections was applied separately per band.
In Table 2, we report all of the per-band bias corrections used in this work. The corrections for crosstalk Table 2. Summary of fractional signal power lost over the course of the analysis. The effect of non-redundancy is estimated from the data directly (Section 3.3.5); the rest are derived from simulations of sky signal, noise, and instrumental systematics (Section 4). subtraction, coherent time averaging, and redundantbaseline averaging are all forms of signal loss that do not affect our autocorrelations and thus our estimate of the thermal noise P N (though they can have a small effect onP SN ). All these effects are taken into account when reporting power spectra, errors, and upper limits in Section 5. We discussed our evaluation of the signal loss due to non-redundancy in Section 3.3.5. The other three sources of bias are quantified using the realistic sky and instrument simulations we use to validate our analysis pipeline, as we discuss in Section 4. The first two, signal loss due to coherent time averaging and signal loss due to crosstalk subtraction, are evaluated in simulations performed without noise in order to precisely measure the effect on a known EoR-like signal in Section 4.3. As discussed in Section 3.2.2, the absolute calibration bias that was due to the effect of low-SNR visibilities should have been eliminated in this work. Indeed, we will show in Section 4.2 that the effect has been reduced in magnitude to less than 1% in the gains, and has reversed sign-our gains now appear to be biased very slightly low. If this is correct, then that would lead to an overestimate of our power spectra, error bars, and upper limits. Since we are far more concerned with the possibility of signal loss leading us to report an upper limit lower than the data justify, we choose conservatively not to adjust for this effect in the limits we report in Section 5.

ANALYSIS PIPELINE VALIDATION WITH SIMULATIONS
Before we present our final upper limits in Section 5, we first report the results of our extensive simulationbased validation of the analysis pipeline. The techniques used for simulating our instrument and the analyses performed on the output of those simulations are very similar to those in Aguirre et al. (2022), which was written to support H22a. We present here a brief summary of how we applied those techniques to this work, highlighting the relevant updates, and then show some key results that both validate the overall pipeline and help us to quantify specific signal loss biases that we correct for (see Section 3.3.6).

Visibility and Systematics Simulations
Our primary method for validating the analysis pipeline presented in Section 3 is via an end-to-end simulation, wherein we generate realistic visibility simulations of foregrounds, noise, and a boosted EoR-like signal that should be easily detectable given our sensitivity. This allows us to holistically evaluate the performance of the analysis and identify any unknown sources of bias. One could also approach the same problem by injecting a signal of known amplitude (larger than the real EoR) into the visibilities and analyzing that data in the same way. However, this technique requires high confidence in one's calibration solutions (injected visibilities must be "uncalibrated" before injection) and it is difficult to disentangle residual systematics form power spectrum biases.
Our sky simulations consist of three unpolarized components: diffuse Galactic emission, a point source catalog, and an EoR analogue created as a Gaussian random field drawn from a known power spectrum. We present each of these here and discuss how they differ from the corresponding components in Aguirre et al. (2022). For the diffuse Galactic emission, we use the Global Sky Model (GSM; de Oliveira-Costa et al. 2008), computed at every frequency we measure using pygdsm 7 . The result is a HEALPix map (Górski et al. 2005) with resolution N side = 512. We smooth the map with a 1 • Gaussian kernel and degrade it to N side = 128 to speed up our simulations, since that provides sufficient resolution to cover the angular scales HERA is sensitive to. Each HEALPix pixel is treated as a single point source at its center using the pixel area for normalization.
Our model of point source emission differs significantly from Aguirre et al. (2022) in our handling of spatial gaps in the MWA GLEAM catalog (Hurley-Walker et al. 2017) due to the Galactic plane. Since H22a used observations spanning a much smaller range of LSTs than those used in this work, it was decided to slightly further restrict the range of LSTs simulated to 1.5-7 hours in order to avoid those gaps. In this work, we wanted to be able to simulate the full 24 hours of LST, so we developed a GLEAM-analogue with simulated point sources. We created 14,073,688 random synthetic sources with uniformly random positions across the sky. Their fluxes were drawn to match the GLEAM source count distribution in Franzen et al. (2019) from 0.001-87 Jy. Their spectral indices were drawn from a Gaussian distribution with mean −0.79 and standard deviation 0.05, again following Franzen et al. (2019) (see also, e.g., Offringa et al. 2016;Carroll et al. 2016). The 200,000 brightest sources were simulated at their random positions, the remainder (those below 0.1975 Jy) were treated as confusion noise. For the sake of computational expediency, these were added to the GSM at the nearest HEALPix grid point. Above 87 Jy, we added a several real sources in their true positions. This includes GLEAM J215706-694117, GLEAM J043704+294009, GLEAM J122906+020251, GLEAM J172031-005845, and the "A-Team" sources reported in Table 2  Finally, our boosted EoR analogue is created with the same simulator 8 used in Aguirre et al. (2022). We used a higher amplitude mock EoR 9 and a slightly different power spectrum slope: P (k) ∝ k −2.7 . That EoR simulation was binned to the same HEALPix grid as the diffuse foregrounds, where again each pixel center is treated as a point source.
To actually simulate visibilities, we use vis cpu (Bull 2021), 10 a fast visibility simulator validated against pyuvsim (Lanman et al. 2019), 11 a reference simulator designed for accuracy (Pascua et al. in prep.). The simulator used in Aguirre et al. (2022), RIMEz, 12 calculates visibilities in spherical harmonic space. vis cpu takes a much simpler approach. It calculates a per-antenna visibility factor for each point-like sky componentessentially the square root of the source flux with a phase factor that depends on frequency and antenna position-and multiplies them by a Jones matrix. These are then cross-multiplied to form visibilities and summed over sources. We use the primary beam calculated in Fagnoni et al. (2021) and interpolated in azimuth and zenith angle. We simulate the full 24 hours of LST at a five second cadence for each unique baseline and frequency observed by HERA.
Our end-to-end validation began with producing two sets of 94 nights of simulated data, one with the mock boosted EoR and one without. Both included foregrounds, noise, and systematics. For each night, we interpolated the original set of simulated visibilities using a cubic spline onto each night's 10.7 second cadence, since the LST grid of the data varied from night to night. Unlike in Aguirre et al. (2022), where we used only a subset of the antennas, in this work we inflated the redundant baselines to produce data files with all baselines not flagged in the real data. The final simulated data set completely matched the real data in terms of baselines, times, and frequencies. No non-redundancy due to antenna-to-antenna variation of beams or positions offsets (Orosz et al. 2019;Dillon et al. 2020;Choudhuri et al. 2021) was simulated.
We then "uncalibrate" the data by applying perantenna complex bandpasses and cable reflection terms and then add noise to the visibilities, both steps performed exactly the same way as in Aguirre et al. (2022). Finally, our simulations have per-baseline crosstalk added to them. Again, this is done with nearly the same procedure as in the prior validation work. For each baseline, we model the crosstalk as a series of copies of the autocorrelation of each antenna in the baseline, each multiplied by a complex delay term and an amplitude that decreases exponentially with delay by a factor of 100 from the peak delay out to 2000 ns. In this work, the amplitudes and delays of the crosstalk peak, and how they depend on position in the array, are motivated by the new physical model of the crosstalk (see Appendix A) in which we attribute crosstalk to an emitter on the west side of the array. (In our model we use α = −2.29 and τ offset = 0 ns; see Equation A4 for definitions.) The crosstalk structure is allowed to change per epoch, but not per night. The amplitudes are on the low end of what is observed in real data in order to avoid cross-talk effects on certain baselines becoming crosstalk-dominated, since the visibility simulation itself is also somewhat underpowered on long baselines relative to the real sky. We do not explicitly model any of the multipath effects hypothesized to be responsible for the breadth of the crosstalk spectrum, relying instead on that series of peaks as in Aguirre et al. (2022). All of the techniques for visibility corruption, along with an interface to vis cpu, are packaged into hera sim. 13

Validation Results from End-to-End Simulations
With our procedure for turning visibility simulations into full nights of "uncalibrated," systematicscorrupted data matching the real data, we then apply our analysis pipeline almost exactly as described in Section 3. We perform redundant-baseline calibration and absolute calibration, then flag, then smooth our gain solutions. We next bin together individual epochs and perform inpainting, cable reflection calibration, and crosstalk subtraction. After binning together the four epochs, we form pseudo-Stokes visibilities, time-average, and estimate power spectra. We run the entire end-to-end pipeline twice-once without the mock EoR and once with it-including the various power spectrum cuts described in Section 3.3. In order to faithfully validate the analysis pipeline, the same software packages-pyuvdata 14 (Hazelton et al. 2017), hera cal, 15 hera qm, 16 and hera pspec 17 -with the same git hashes that were used to run the end-toend validation. 18 The one major step that was performed differently was RFI flagging. Following Aguirre et al. (2022), we do not inject RFI and then attempt to detect it. Rather, we simply cross-apply the real data's flags to the simulated data at the same step in the pipeline. Since the times, frequencies, and baselines match perfectly, this step is very straightforward. Likewise, we take the final flag-13 https://github.com/HERA-Team/hera sim 14 https://github.com/RadioAstronomySoftwareGroup/pyuvdata 15 https://github.com/HERA-Team/hera cal/ 16 https://github.com/HERA-Team/hera qm 17 https://github.com/HERA-Team/hera pspec 18 With one exception: the first round of LST-binning was performed with a newer version of hera cal because the version run on data assumes that flagged baselines are left in the data files and just flagged. Our end-to-end validation run skipped simulating these antennas, which required an update to the original code.  Figure 13. In our end-to-end tests with simulated data, we can compare the known input gains to the derived gains after redundant-baseline calibration, absolute calibration, and calibration smoothing. Here we show the average gain errors after averaging over antennas, times, and nights for Epoch 1, the longest epoch. In the bottom panel, we show the error in our gain phases; these are comparable to those found in the validation of our H22a limits in Aguirre et al. (2022).
In the top panel we show our gain amplitude errors. These are substantially smaller than those found in Aguirre et al. (2022) due to our new algorithm for absolute amplitude calibration which is not biased when visibility SNR is low (see Section 3.2.2 for details). In the bands of interest, our gains are correct to within 1%. While we do have some evidence that they are biased slightly low, which would lead to a power spectrum that is slightly too high, we choose conservatively to not to adjust the power spectrum to compensate (see Section 3.3.6).
ging mask generated from a manual inspection of the data (see Section 3.2.3) and apply it to each epoch of simulated data at precisely the same point in the real pipeline that those flags are applied to real data.
Before we discuss the final power spectra, we can now evaluate whether the fix offered in Section 3.2.2 eliminates the bias in absolute calibration discovered in Aguirre et al. (2022) and corrected for in H22a. In Figure 13, we compare the calibration solutions produced by the pipeline for Epoch 1 (the longest epoch) with the known simulated gains, averaging over antennas, times, and nights. By comparison to Figure 11 of Aguirre et al. (2022), we can see that the absolute calibration bias is largely eliminated and that the phases are still wellrecovered across the band. The gain errors are largest at the band edges, due to edge effects of the low-pass filter used in gain smoothing.
Interestingly, we actually now see a slight negative bias for most frequencies (0.6% for Band 1, 0.9% for Band 2), which could lead to overestimating power spectra and error bars by a few percent, since gain errors impact power spectra quartically. One possible origin of the effect is a rare failure mode of the absolute calibration bias fix described in Section 3.2.2. When fitting for a single overall amplitude, solutions are biased but quite stable in time. When fitting for a complex number and then taking the absolute value of that, we have seen rare instances where the data and reference visibilities are so far apart that the gains are driven to 0 in order to minimize χ 2 . In real data, these sorts of collapses are easily identified as discontinuities and flagged as RFI. This justifies our conservative choice in Section 3.3.6 to not correct for any remaining absolute calibration bias since we do not expect this effect to be as large on our final upper limits. However, in our end-to-end pipeline validation, we flag precisely where we did in the real data, ignoring any potential new discontinuities. These artificially low gains can thus be the consequence of a rare calibration failure getting spread out and diluted by gain smoothing.
We turn now to the final result of our end-to-end test: spherically-averaged power spectra for both bands and all five fields, with and without our mock boosted EoR. In Figure 14, we show one field for each band (corresponding to our lowest limits, see Section 5). The power spectra include signal loss corrections for crosstalk subtraction and coherent time averaging, as described in Section 3.3.6, but no correction for non-redundancy as none is included in the simulation. For comparison, we also show the results of a simulation run with only the mock EoR with no foregrounds, noise, systematics, or flags (black line). It is interpolated directly onto the grid of the final LST-binned data set, time-averaged, converted to pseudo-Stokes I, and formed into power spectra.
In general, we find the results of our end-to-end test in good agreement with the mock EoR-only power spectrum, as the bottom panels of Figure 14 show. Though the error bars on the power spectrum with EoR signal may be underestimates (since sample variance is not accounted for in our real pipeline), there is no evidence for an additional, unaccounted-for contribution to signal loss that might lead us to report artificially low upper limits. If anything, we see some evidence that our results are biased slightly high. This is likely due to a number of factors; the bias high due to rare absolute calibration failures, a possible overestimation of the im- Here we show the final result of our end-to-end test of our data reduction pipeline on simulated data. We simulated visibilities both with (green) and without (purple) a boosted EoR analogue Gaussian random field, along with foregrounds, noise, and instrumental systematics, for all 94 nights. (For visual clarity, the points along the k axis were slightly off set between green and purple.) This was then processed with almost all the same code as was used to process the real data in order to thoroughly test our analysis chain. As the bottom panels showing the ratio of our end-to-end-test (the green points) to the particular boosted mock EoR realization (black solid line), our results match the EoR realization quite well. There is no evidence for additional signal loss not accounted for in Section 3.3.6. While this calculation was performed for all bands and fields, we show here just the two fields where our lowest limits are derived at each band (see Section 5).
pact of signal loss, and the effect of flagging and inpainting. When the same pipeline is run without the EoR, the results are consistent with noise at almost all k-a result of the somewhat aggressive cuts we discuss in Section 3.2.4 and Section 3.3.3-though we unsurprisingly do see some residual foregrounds at very low k, especially in Band 1. This is consistent with what we see in the real data in most fields (see Section 5.2).

Noise-Free Tests for Quantifying Potential Signal Loss
Separately from our end-to-end simulations, we also repeat two tests of signal loss performed in Aguirre et al. (2022) that were used in H22a to correct the final power spectrum measurements (see Section 3.3.6). The first is the effect of coherently time-averaging LST-binned visibilities from 21.4 s to 214 s before cross-multiplying interleaved visibilities to form power spectra. This is done by interpolating mock EoR-only visibilities onto the 21.4 s grid of the final LST-binned data set, forming pseudo-Stokes I, and then either forming power spectra directly or forming power spectra after coherently averaging to a 214 s cadence with rephasing as described in Section 3.1. If we take the result with 21.4 s integrations to be loss-free, which it should be to a good approximation, then we find a 1.2% signal loss for Band 1 and a 1.5% signal loss for Band 2 (see Figure 15).
It is not surprising that the signal loss should be slightly higher for Band 2, since the primary beam is smaller at higher frequencies and the rephasing of visibilities before averaging cannot account for the changing beam-weighted sky. Likewise, if we were to open up additional degrees of freedom in our signal loss correc-  Figure 15. Here we repeat the test of decoherence from time-averaging shown Figure 17 of Aguirre et al. (2022). We compare power spectra computed from EoR-only visibility simulations at a 21.4 s cadence-which is what we obtain after LST-binning-to the same data averaged to a 214 s cadence before forming power spectra. After averaging over all fields and baselines, we find a result in agreement with Aguirre et al. (2022). In H22a, the two bands are averaged together and the effect is taken to be a 1% correction; here we keep the two bands separate to arrive at our signal loss correction factors (see Section 3.3.6 and Table 2).
tion, we would likely find that this decoherence depends on baseline length and orientation-baselines that fringe more quickly should see more loss here. Since we are using the same weighting of baselines here as in analysis of the real data, the overall loss estimate should be correct to first order. Therefore, because the overall signal loss is quite small, it is not necessary to further complicated the signal loss correction by making it baseline dependent. Regardless, these values are consistent with the 1% loss figure used in H22a and calculated in Aguirre et al. (2022), which did not separate out the two bands. The second specialized test is to examine the impact of crosstalk subtraction (described above in Section 3.1 and Section 3.2.4). The crosstalk subtraction algorithm devised in , demonstrated in Kern et al. (2020b), and employed for H22a removes power near fringe-rate zero. The maximum extent of this removal in fringe-rate space (Equation 3) is designed to keep signal loss at the ∼1% level. To measure this effect, we interpolate to get one data set per epoch with foregrounds, mock EoR, and crosstalk injected, but no noise or calibration errors. To this data set we apply our final flagging mask, inpaint, subtract cross-talk, LST-bin the four epochs together, form pseudo-Stokes visibilities, coherently time-average, and then form power spectra. We spherically average those power spectra over baselines using the same noise-based weights as were applied to the data. In Figure 16, we compare those per-time-step power spectra to mock EoR-only power spectra with the 20 22 0 2 4 6 8 10 12 14 16 Figure 16. Our final test of potentially biasing signal loss examines the effect of our crosstalk subtraction algorithm on the EoR signal. Here we compare simulations with just our boosted EoR model-the numerator in Figure 15-to noise-free, per-epoch simulations with EoR, foregrounds, and crosstalk systematics. Crosstalk subtraction is performed on a per-epoch basis, after which epochs are combined, formed into power spectra, and then averaged as usual to get spherical power spectra as a function of time. Taking the median over delays less than 4000 ns and LSTs in our five fields, we estimate 2.4% loss in Band 1 and 3.3% loss in Band 2, which we report in Table 2.
same averaging performed. We take a median over delay up to 4000 ns, the highest delay in the crosstalk subtraction SVD, to produce a single bias estimate per LST and per band. As H22a argues, we expect the crosstalk subtraction bias to scale-independent. The final result is a bit difficult to interpret. Clearly the result is LST-dependent; just as in Aguirre et al. (2022), we see higher levels of signal loss near gaps in the data-this was one of the reasons we avoided the range from 0-0.75 hours of LST. We also see some evidence under-subtraction around Fornax A (in Band 1) and the Galactic center, though the latter is expected due the weighting in the SVD. To avoid the effect of outliers in both directions, we estimate our per-band signal loss by taking the median over LSTs in the five fields. This yields 2.4% signal loss in Band 1 and 3.3% loss in Band 2, which is basically consistent with the 1% and 3% used in H22a.

UPPER LIMITS ON THE 21 CM POWER SPECTRUM
Having surveyed our technique for reducing 94 nights of visibilities to our final power spectra, and having validated that technique and quantified the signal loss biases we must correct for, we are finally in a position to present our upper limits on the 21 cm EoR power spectrum.

Cylindrically-Averaged Power Spectra
We begin with cylindrically-averaged power spectra. These are first averaged over all cross-baseline pairs within each redundant baseline group (excluding those with high residual crosstalk, see Section 3.2.4). Next they are averaged incoherently in time over each field. Finally, they are incoherently averaged in |u|, combining together baselines with the same length. All averages use P N (Equation 6) to perform inverse variance weighting; P N andP SN (Equation 7) errors are propagated during each step. This produces power spectra in τ and u, which is equivalent in the delay approximation to k and k ⊥ (Parsons et al. 2012a). These cylindrical power spectra are the most sensitive data products that still keep k and k ⊥ separate, which is useful because different scales along those two axes are measured in a fundamentally different way instrumentally and correspond to different ways of measuring distance cosmologically. While we are not yet sensitive enough to constrain the cosmological signal's dependence on line-of-sight velocity effects, cylindrical power spectra are still useful for evaluating how foregrounds and other systematics appear upon deep integration.
We show our cylindrically-averaged power spectra for each of the five fields in Figure 17 for Band 1 and Figure 18 for Band 2.
In the top row of each figure, we show the real part of the power spectrum. 19 We also show the power spectrum SNR, P/P N in the bottom row. In regions that are noise-dominated, we expect roughly half of the power spectrum bins to be negative (white) and we expect the SNR to have mean 0 and standard deviation 1.
What we see is fairly consistent with that expectation across most of the EoR window. However, the foreground-dominated region clearly extends well beyond the horizon wedge (gray dashed line). This justifies the need for a buffer beyond the horizon wedge and our choice specifically to expand it from 200 ns as in H22a to 300 ns (black solid line). As we discussed in Section 3.3.3, our choice of the buffer was set by examining these SNR plots and trying to balance the amount of foreground leakage into the window at low k ⊥ with the amount of noise-dominated modes excised at high k ⊥ . 19 The power spectrum is complex because it involves a crossmultiplication of two independent times. We expect the imaginary part to be noise-dominated in the EoR window and dominated in the wedge by signal-noise cross terms (Tan et al. 2021).
This is exactly what we see, both in cylindrical and spherical power spectra, but we do not show them here for the sake of brevity. Their consistency with noise, however, is shown in Table 5.
Note that a number of data points that we excise from our spherically-averaged power spectra are still shown in Figure 17 and Figure 18. We have not yet removed the modes under the horizon wedge and buffer, nor have we removed baselines with projected east-west distance less than 15 m (though baselines with a projected eastwest distance less than 14 m have already been removed because crosstalk subtraction could not be performed on them without substantial signal loss, see Section 3.1).

Spherically-Averaged Power Spectra and the Deepest Upper Limits
We now turn to our spherically-averaged power spectrum measurements and upper limits. These are produced from averaging together all baselines incoherently after excising modes below the horizon wedge (Equation 8) plus a 300 ns buffer and all baselines with a projected east-west length less than 15 m (which includes the single-unit east-west spacing; see Figure 4). The result, for all bands and fields, is shown as a "dimensionless" power spectrum (Equation 5) in Figure 19. 20 We estimate vertical errors by propagatingP SN (k) (Equation 7) and show the measurements ±2σ after converting to ∆ 2 using Equation 5. When measurements are negative, we adopt the strong prior that the true power spectrum must be zero or positive and plot upper limits at precisely 2σ. The 1σ error level derived fromP SN (k) is shown as a black dashed line. Our absolute lowest 2σ upper limits on the power spectrum are ∆ 2 (k = 0.36 hMpc −1 ) ≤ 3,496 mK 2 at z = 10.4 (Band 1, Field D) and ∆ 2 (k = 0.34 hMpc −1 ) ≤ 457 mK 2 at z = 7.9 (Band 2, Field C). These limits are 2.6 and 2.1 times deeper, respectively, than those presented in H22a. We report the measured power spectra, 1σ error bars, and and 2σ upper limits results at most values of k-including low k, which is where most of our astrophysical constraining power comes from-in Table 3 for Band 1 and Table 4 for Band 2. 21 In general, most of our measurements are consistent with the expected noise level. At low k, especially in Band 1, we see evidence for residual foregrounds beyond our horizon plus buffer cut. This makes sense; fore- Figure 17. Here we show our cylindrically-binned power spectra for all five fields for Band 1 (z = 10.4). In the top row, we show the real part of P (k); in the bottom row, we show the same power spectra normalized by the power spectrum of expected noise. Since noise-dominated regions of this space are equally likely to be positive or negative, we can see that most of the EoR window is noise-dominated. These power spectra are formed by averaging together all cross-baseline pairs within a redundant baseline group and then by incoherently averaging in time in each field. Both averages are performed with inverse noise variance weighting. We convert from baseline and delay to k ⊥ and k respectively using the delay spectrum approximation (Parsons et al. 2012a). Note that bins that fall below the horizon wedge (Equation 8) plus a 300 ns buffer, along with baselines with projected east-west distances less than 15 m, are included here to help illustrate systematics but are excluded from all spherically binned power spectra.
grounds are brighter at low frequency. For a few fields and bands, there is some evidence for residual crosstalk, which should appear between roughly 0.4-1.0 hMpc −1 . Close inspection of Table 3 and Table 4 also reveals a handful of large negative power spectra at the roughly −2σ level, especially in Band 2, Field E. There are more such points than might be expected from random noise, even after accounting for that fact that, as H22a showed, errors are correlated between neighboring k bins at the ∼25% level. This is potentially concerning, since ignoring autobaseline-pairs as we do here and as was done in H22a has the potential to introduce negative power spectrum biases (Morales et al. 2022), which would then lead to artificially low upper limits. Tracing back these points to their corresponding cylindrical band powers in Figure 17 and Figure 18 reveals only a handful of negative power spectra in excess of what is expected from ther-mal noise, the largest of which are all near |τ |=900 ns. Because these excesses appear at higher k than the best limits and because we conservatively chose to use 2σ as the upper limit wherever ∆ 2 is negative, it is unlikely that our deepest power spectra or the likelihoods used in Section 7 are significantly affected. That the systematics in this range remain mostly marginal speaks to the efficacy of our crosstalk subtraction and baseline selection techniques (see Section 3.2.4).
We use horizontal error bars in Figure 19 to represent the range of k values whose contribution sums to 68% of each measurement. To examine both the window functions and our deepest upper limit more closely, we reproduce our Band 2, Field C results from Figure 19 in the upper panel of Figure 20, along with the window functions themselves in the lower panel. The window function matrix W, defined in Equation 9 and propagated to the spherical power spectra, tells us the extent to which Figure 18. The same plot of cylindrically-binned P (k) and SNRs as in Figure 17, but for Band 2 (z = 7.9). different bandpower measurements are expected to be the linear combination of the true bandpowers across k modes. Each row-shown as a different colored line in the bottom row of Figure 20-sums to 1 by definition. The fact that each row is peaked along the diagonal of the matrix means dominant contribution to each measurement is in fact the k mode at which the measurement is reported.
As discussed in Section 3.3.4, these window functions more precisely take into account the delay approximation than the more simpler calculation used in H22a. While they may look similar to the window functions presented in H22a because of the linear scale used in the figure, there are as many as three orders of magnitude of difference in the amplitude of the tails, with the delay approximation leading to a large underestimate (Gorce et al. 2022). This is particularly important if the true 21 cm power spectrum has features in k space or deviates substantially from a roughly flat power law in ∆ 2 (k), though that is not the case in most "vanilla" models of reionization.

STATISTICAL TESTS OF THE POWER SPECTRUM UPPER LIMITS
In this section we report on a series of statistical tests designed to build confidence in our upper limits. Our goal is to test the self-consistency of our results by showing that they either integrate down like noise or are inconsistent with noise in ways that are well understood. Likewise, we split our data set in various ways to look for signs of possible residual systematic effects. By repeating key tests from H22a, we can help ensure that no new failure modes have cropped up.

Noise Integration Tests at High Delay
As in H22a, we performed a series of noise integration tests to determine whether bandpowers outside the wedge region are consistent with being noise-dominated. By cumulatively averaging the samples that went into the fully-averaged power spectra shown in Figure 19, we are able to test whether the samples average together as would be expected for bandpowers formed from uncorrelated (white) random noise, or whether an additional correlated or non-random signal, such as a source of systematic contamination, may be present.
Our first test is to simply integrate all bandpowers above a fixed k. Following H22a, we randomly draw 300,000 pure-noise realizations of the bandpowers using the error covariance matrix propagated through our var-  Table 3 and Table 4. Following H22a, we leave the incoherent averaging of different fields together for future work.
ious averaging steps and used to set the error bars on our power spectra. We then compute the fraction of noise realizations that are larger than the data, which is by definition the p-value of the measurement if the null hypothesis is that measured values are drawn from the noise covariance and have mean zero. In Table 5, we show the results of this test for both the real and imaginary parts of both bands' and all fields' power spectra.
In general, we find our imaginary power spectra consistent with noise across k, as expected. The one exception is in Field A for Band 1, where we are likely seeing noise-foreground cross terms dominating the imaginary power spectrum. Our real power spectra are another matter. We see strong evidence for inconsistency with noise at low k, which is expected given the foreground leakage just outside the wedge buffer in both cylindrical power spectra (Figure 17 and Figure 18) and at low k in spherical power spectra ( Figure 19). In Band 1, we also see more marginal evidence for inconsistency with noise even after k < 0.5 hMpc −1 is excluded. This is likely attributable to residual crosstalk, which appears in our final power spectra at the few-σ level.
While the p-values are a useful test of the null hypothesis on the final power, a lot can be obscured by collapsing an entire power spectrum to just a few numbers. To further probe the consistency of our results with noise, especially at high delay, we perform two additional tests. The first assesses how baselines integrate down within a redundant group. In Figure 21 we examine how a single delay mode at τ ≈ ±3000 ns integrates down as more baselines are added to the cumulative average. We use our most sensitive redundant group-the 29 m east-west baseline. We normalize the cumulative Table 3. Band 1 (z = 10.4) power spectra, errors, and upper limits from Figure 19. The upper limit is taken to be 2σ above the power spectrum measurement or above 0, whichever is greater. The lowest upper limit is in bold. average by the expected variance of that cumulative average derived from P N . We show 1σ and 2σ regions where we expect 68% and 95% of pure noise realizations to fall in gray. We have validated this calculation with a set of 100 white noise-only simulations that have been passed through the same flagging, weighting, and power spectrum estimation steps as the real data. The simulations are constructed to have noise variance that depends on time, frequency, and baseline in the same Table 4. Band 2 (z = 7.9) power spectra, errors, and upper limits from Figure 19. The upper limit is taken to be 2σ above the power spectrum measurement or above 0, whichever is greater. The lowest upper limit is in bold. way as the data, according to our empirical estimates of the noise power spectrum P N . The overall conclusion from Figure 21 is that there are no strong deviations from the expected noise-like behavior for any band or field in our data set at |τ | ≈ 3000 ns, according to this statistic. Band 1, Field D has the most substantial deviations, but these remain largely within the 95% region. Band 2, Field A also approaches the edge of the 95% region. Bands 1 and 2 of Field C have averages that are more consistently Here we show in greater detail the most sensitive 2σ upper limit on the 21 cm power spectrum we report, namely 457 mK 2 at 0.34 hMpc −1 Band 2 (z = 7.9), Field C. The information in the top panel is identical to that in Figure 19. We also include the window functions W (k) in the bottom panel, which tell us how each measurement is expected to be a linear combination of the underlying bandpowers. All band window functions peak at the measured k and are relatively narrow, hence the relatively narrow horizontal errors bars which are interpolated from the window functions to span the 16th to the 84th percentile.
close to zero than for any of the others, though this is not necessarily a statistically significant anomaly. Another similar test is shown in Figure 22 where we examine the same baseline group, but average over baselines in the group and over a wider range of delays (±2000-4000 ns) to look for evidence of LST-dependent systematics. We compare each average to the distribution of 100 noise realizations which are drawn from P N and reflect the real data's sampling and cuts. In the fields of interest, we see little evidence for significant outliers, indicating that there is no strongly LSTdependent high-delay systematic affecting our results for this baseline group. Outside our fields, we do see some large outliers from Fornax A and the Galactic center which might be concerning in the future if they appear elsewhere in HERA data as we integrate deeper.

Testing Systematics Mitigation
Though we have taken several steps to mitigate the effect of crosstalk-including subtracting it from our visibilities and cutting baselines that exhibit substantial residuals (see Section 3.2.4)-it is useful to understand Table 5. Statistical p-values, testing the consistency of the real and imaginary components of the spherical ∆ 2 (k) with noise over different overlapping ranges of k. At high k, most measurements are consistent with noise. At low k, we do see decisive evidence for residual systematics (values when p < 0.01, in bold)-mostly foregrounds and perhaps also crosstalk, especially in Band 1. We do not quote numbers below p < 0.001, since we cannot precisely calculate such low p-values with our 300,000 random draws. how those choices may have affected the distribution of bandpowers. In H22a we performed a one-sample Kolmogorov-Smirnov (KS) test comparing the cumulative distribution function (CDF) of power spectra over 0.4 hours of LST to a validated analytic noise model. Here we iterate slightly on that test with a two-sample KS test comparing the CDF of measured bandpowers for the redundantly-averaged 29 m east-west baseline group to that of the same 100 noise realizations as in our zscore test above. Each CDF is computed over LSTs in Field C and over a series of 200 ns-wide delay bins (except the first bin, which spans 0-500 ns). In Figure 23, we show the results of that work for both bands and for  Cumulative average of the (crosstalksubtracted) delay power spectrum for a single redundant baseline group (29 m east-west) as a function of the number of baseline pairs, for our two frequency bands and five fields. Only the results for single bandpowers at τ ≈ ±3000 ns are shown here, corresponding to k ≈ 1.48 h Mpc −1 (Band 1) and k ≈ 1.68 h Mpc −1 (Band 2). This is well above the crosstalk contaminated delay range. Solid lines denote positive delays, dotted denote negative delays, and an incoherent time average within each field has been performed for each baseline pair before the cumulative average. Different amounts of flagging apply to each band, field, and bandpower, hence the different lengths of the lines. The cumulative average is normalized by the expected noise variance, calculated from the mean noise power over each field/band; see the text for more details. The gray bands show regions of 1σ and 2σ corresponding to where the cumulatively-averaged power spectra of white noise with the same (inhomogeneous) noise variance as the data, PN , would be expected to fall 68% and 95% of the time. We see no strong evidence for any violation of the null hypothesis that these baseline groups average down like noise at this delay. data with (dashed lines) and without cross-talk subtraction (solid lines).

p-value
To infer the expected range of distances between CDFs, we also compute a distribution of two-sample KS statistics for each delay ranges by comparing many pairs of independent noise realizations. We show the 2σ range of that distribution in the gray band of Figure 23. We can see that while the distribution of bandpowers was highly inconsistent with noise before crosstalk subtraction, afterwards they are consistent beyond 500 ns for this baseline group.

Jackknife Tests Across Epochs
In order to check the consistency of the epochs (defined in Section 2.1) at the bandpower level and thus justify (post hoc) the decision to combine them to arrive at our final limits, it is useful to begin by looking at single-epoch power spectra. These are formed by simply going straight from crosstalk subtraction to forming pseudo-Stokes I and coherently averaging. In Figure 24, we show one such power spectrum from Band 2, Field C-our deepest limit-and compare measurements and vertical error bars from individual epochs to the full data set. The full data set's power spectrum (which is identical to that in Figure 20), is generally lower than any individual epoch. Since most individual epochs are consistent with noise, this is not too surprising. Since the epochs are binned together with weights proportional to the number of visibility samples, it is not trivial to look at this figure and read off the impact of each epoch on any given k mode. Therefore, while there are no obvious inconsistencies visible here, it behooves us to approach the question more quantitatively.
We answer that question by performing a Bayesian jackknife test across epochs. This test is described in detail in Wilensky et al. (2022); we summarize our specific implementation of it here. For each band, field, and spherically-averaged k-mode, we consider 2 N hypothetical models for the N ≤ 4 epochs that contribute to the particular band and field's power spectrum at the specified k-mode (Epoch 0, for example, does not contain any observations in Field E). In each hypothetical model, we propose that a bandpower is either unbiased-i.e. consistent with zero-mean noise-or strongly biased-i.e. consistent with measuring a signal-by an unspecified amount within a relatively constrained prior range. To compute the posterior for each model, we marginalize over a multivariate Gaussian bias prior that is centered at 6σ i with a width of σ i in each direction, where σ i is the square root of the estimated variance for the ith bandpower. This is intended to represent a strong but not excessive bias with an a priori SNR in the 4-8 range. We then use a series of Bayesian model comparisons to identify which epoch(s) is most likely to be biased for any given bandpower. Figure 25 shows the results of this test for each band, field, and k mode. To calculate the inferred odds of there being at least one significantly biased epoch-albeit only when the large biases reflected by the alternate hypotheses are considered-we have evaluated the posterior probability of each bias configuration, summed those in which at least one bias is present, and divided by the posterior probability of the null hypothesis that no biases were present. We use a flat prior for the bias configurations. We generally find that for the majority of bands and fields, most spherically-averaged Fourier modes are more consistent with pure noise than a scenario with an epoch-dependent bias, particularly for higher-k modes (see the caption of Figure 25 for a discussion of fields/epochs that potentially deviate from the null hypothesis). Since crosstalk subtraction is performed on a per-epoch basis, it is likely that residual crosstalk is a major source of bias that varies from epoch to epoch. This majority-null result justifies the decision to average the epochs together into one final power spectrum for each band and field. This is not to say that we strongly suspect there is no underlying signal due to residual systematics or one of cosmological origin. Since only large biases are considered in the alternate hypotheses, our conclusion is that if biases are present, then they are sufficiently small to be difficult to distinguish from the expected statistical fluctuations in the bandpowers. Indeed, if we marginalize over bias priors centered at smaller biases (e.g. less than σ i ), the posterior probability of the bias configurations is more evenly diffused over the hypotheses, demonstrating a lack of certainty that arises from an inability to finely distinguish weak bias configurations with so few data.

CONSTRAINTS ON THE ASTROPHYSICS OF REIONIZATION AND THE COSMIC DAWN
Having established our new upper limits on the 21 cm power spectrum at both z = 7.9 and z = 10.4, we now turn to their astrophysical implications. Just as much of this work so far has been the application of the techniques developed in H22a and its supporting papers, in this section we directly apply the techniques laid out in H22b to our updated power spectra. We begin by explaining how we compute model likelihoods in order to perform astrophysical inference in a Bayesian framework (Section 7.1). In Section 7.2, we then briefly sur-  Here we show the odds of at least one epoch being significantly biased, as a function of k for each band and field, as determined by our a Bayesian jackknife test. We observe that for most bands, fields, and k modes, there is only occasionally "strong" evidence (odds > 10 1 ; medium gray region) or "decisive" (odds > 10 2 ; dark grey) for a significantly biased epoch, using the terminology of Kass & Raftery (1995). The majority of points fall beneath the region of "substantial" evidence (odds > 3.2; light gray) as well. Band 1, Field C shows consistent evidence for possessing a significantly biased epoch. Examination of the posterior over bias configurations (see Wilensky et al. 2022) suggests that Epoch 1 is most likely to be biased for a number of k modes in Band 1, Field C. The biases observed in Band 2, Fields D and E are less clearly attributable to a single epoch. Notably, Band 2, Field C (see Figure 24) shows only mild or occasionally strong evidence for the presence of a bias.
vey the four techniques employed in H22b and compare their updated constraints on the ratio of the average spin temperature of the IGM to the temperature of the radio background, T S /T rad . After giving some background detail on the two simplest models in Section 7.3, we proceed with a more detailed report of updated techniques and results from both 21cmMC (Section 7.4) and from models with an extra radio back-ground generated by galaxies (Section 7.5). Finally, we conclude in Section 7.6 with a more comprehensive comparison of our models and the significance of their results in the broader context of 21 cm cosmology.

Evaluating Model Likelihoods
To interpret the power spectra and upper limits reported in H22a, all four techniques examined in H22b employ the same statistical approach to evaluating the posterior probability of model parameters in a Bayesian framework. Each theoretical model M with a set of parameters θ compares the data d to modeled power spectra m(θ) convolved with the window function W. We write this difference as t ≡ d − Wm(θ). We next assume that each measurement is due to some unknown combination of 21 cm signal, noise, and systematics. Further we assume that systematics (typically residual foregrounds and crosstalk) can only add power. These assumptions are only appropriate when claiming upper limits on the 21 cm power spectrum, as we do in this work. To claim a detection, one would have to impose additional constraints on the relative contribution of systematics to the measurement. To do that credibly, one likely needs high SNR detections at multiple k modes, redshifts, and fields, which can be subjected to rigorous jackknives and other statistical tests for internal consistency, along with a theoretical framework able to match the data well.
Marginalizing over systematics, as we show in H22b, yields a posterior probability of the form where the product on the right-hand side is the likelihood. Here N d is the number of data points considered and σ i is the standard deviation of d i . This result is consistent with similar approaches in the literature (Li et al. 2019;Ghara et al. 2020). In this derivation, we also assumed that measurements have uncorrelated errors. While neighboring measurements in k are in fact quite correlated, we follow H22b and throw out every other measurement in k-keeping the absolute lowest limits-to largely eliminate that correlation.
To understand the effect of the form of the likelihood in Equation 11, we calculate it using a power law ∆ 2 ∝ k α over several orders of magnitude in ∆ 2 evaluated at k = 0.35 hMpc −1 in Figure 26. At z = 10.4, we use α = 1.3; at z = 7.9 we use α = 0.0. These values of α were derived by interpolating the 21cmMC power spectrum realizations in our posterior without HERA (see Section 7.4.3) at the k values of our two deepest measurements, fitting a power law, and then taking the median α over models that had not yet completely reionized. The flattening of ∆ 2 with decreasing redshift is a generic consequence of inside-out reionization, which suppresses small-scale power. Single k 0.35 hMpc 1 0.2 < k < 0.5 hMpc 1 : 0.2 < k < 1.0 hMpc 1 : 0.2 < k < 2.5 hMpc 1 : Figure 26. Here we show our marginalized likelihood from Equation 11 for each band and multiple fields and ranges of k. In solid colored lines, we show the likelihood for each field independently, but combine all 0.2 < k < 1.0 h Mpc −1 after throwing away every other power spectrum measurement to eliminate correlated errors. To combine multiple k values, we have assumed a power law ∆ 2 ∝ k α where α = 1.3 at z = 10.4 and α = 0.0 at z = 7.9. In the various black curves, we combine together all fields but use different ranges of k values, including just the k of the best upper limits at k ≈ 0.35 hMpc −1 . To help provide some intuition for how the upper limits yield these likelihoods, we also show the best upper limit from each band (pink error bars, taken from Figure 19 and rotated 90 • ). It is clear that combining together multiple fields has an effect, especially for Band 1 (z = 10.4) where Fields C and D contribute roughly equally. Likewise, combining together multiple k modes tightens the constraints significantly, especially at z = 10.4 where power law is steeper. When evaluating the posterior probability of our various models and parameters, we compare the models to the data at the proper k values.
The first thing to note is that our likelihoods are essentially flat for ∆ 2 values much less than the upper limit; the signal could fall anywhere in that range because the systematics could fall anywhere in that range as well. Our results can therefore be heavily impacted by the prior, p(θ|M), and what it considers the smallest viable power spectrum. Figure 26 also highlights the fact that, in contrast to H22b, our result features fairly comparable limits from multiple fields and k modes. Because our likelihoods are not Gaussian, the results do not scale like 1/ √ N as one might expect. Limits with larger measurements and smaller error bars look different in Equation 11 than measurements with the same 2σ upper limits but with smaller measured ∆ 2 and larger error bars. So, while it is not surprising that most of the information in Band 2 comes from Field C, it is interesting and not necessarily intuitive that Fields C and D contribute equal amounts of information in Band 1. Likewise, we can see that just using a single k mode (dotted black lines)-even though it is the single best upper limit and combines together all fields-yields somewhat worse constraints than combining multiple ks. The vast majority of the constraining power comes form k < 1 hMpc −1 (solid black line). Of course, combining k modes together in this way is only technically correct if we are evaluating the likelihood in Equation 11 for a model with ∆ 2 ∝ k α . In general, our model posteriors use all fields and all k-values, but compare them to the data at the proper k values.

Overview of Theoretical Models and their IGM Spin Temperature Constraints
In H22b, we interpreted the results of H22a using a suite of four different theoretical models to infer constraints on the IGM and high-z galaxies: 1. a simple "density-driven" linear bias model, in which the 21 cm fluctuations are assumed to trace the density fluctuations, multiplied by a bias factor that depends on the average thermal and ionization state of the IGM; 2. a phenomenological "reionization-driven" model, which parameterizes the ionized bubble distribution and IGM spin temperature directly without making any explicit assumptions about galaxies (Mirocha et al. 2022); 3. a semi-numeric model of the 21 cm signal, 21cmFAST (Mesinger et al. 2011), along with the inference engine 21cmMC (Greig & Mesinger 2017) that uses it to explore and constrain a range of parameterized galaxy properties (e.g. Park et al. 2019) with multi-wavelength probes; and 4. an independent semi-numeric model that models the galaxies differently and allows for a radio background in excess of the CMB at high redshifts (Reis et al. 2020).
Our models are constructed in two qualitatively different ways. The density-driven and reionization-driven models interpret the 21 cm signal directly as a function of IGM properties. They avoid making explicit assumptions about the sources generating the radiation fields that determine the 21 cm signal (though their priors on IGM properties carry some implicit assumptions), and as such have a lighter implementation. The seminumeric models, on the other hand, start with a model of galaxy evolution and simulate a cosmological volume in order to predict the 21 cm signal. The latter are significantly more complicated but have the distinct advantage of a parameter set that is rooted in our current understanding of galaxy properties, which allows us to combine 21 cm measurements with other constraints on the galaxy population (such as UV luminosity functions and the X-ray background). The disadvantage of the semi-numeric approach is that the resulting constraints on galaxy physics are only sensible if their parameterization is flexible enough to include all the relevant physics. In H22b we considered two independent semi-numeric models to mitigate this problem.
The key result of H22b, across all these models, was that the IGM must have been heated above the expectation for an adiabatically cooling IGM from recombination to z = 7.9 at >95% credibility. 22 If this heating is interpreted in the conventional way-as the result of X-rays generated by accretion onto black hole remnants of early star formation and then depositing that thermal energy in the IGM-this level of heating suggested that early galaxies were more efficient X-ray emitters than their local counterparts. If an excess radio background were allowed, the observations jointly constrained the efficiencies of radio and X-ray emission from galaxies, also restricting the parameter space of otherwise viable models of these mechanisms during the EoR.
In this work, we apply those same four techniqueswith some minor adjustments, as we will discuss belowto our data from both bands. As H22b argued, the easiest point of comparison for the models is the inferred 22 When we say, e.g., "at 95% credibility" or present a "95% credible limit" we mean that the particular parameter value bounds the 95% credible interval-the region of the posterior on that parameter that contains 95% of the integrated probability density. This is different from a frequentist's 95% confidence interval, which is by definition bounded by a pair of random variables that should contain the true parameter value in 95% of repeated trials. The credible interval depends directly on the theoretical model, its parameterization, and its priors. To say that a specific model with a given set of parameters is ruled out at 95% confidence, by contrast, one would have to compare the model's prediction for ∆ 2 (k, z) directly to the measurements and their error bars (see Table 3 and Table 4).
average spin temperature of the neutral IGM, T S . In Figure 27 we compare the 68% and 95% highest posterior density (HPD) credible interval on T S in all four models and show how those constraints have improved since H22b. 23 The improvements are modest but consequential, especially at z = 10.4: all four models now independently require the IGM to be heated above the adiabatic cooling limit at >95% credibility. The precise values of lower limits vary from model to model, reflecting their different assumptions and approximations. In particular, 21cmMC shows the most dramatic increase in the spin temperature constraint at z = 7.9, a fact that deserves further examination. 24 As we discuss the specific models and their detailed results throughout this section, we will look to understand what precisely drives these differences. In Section 7.6, we will summarize our takeaways and put these spin temperature constraints in the broader cosmological context.

Phenomenological Models of the 21 cm Power Spectrum
Before we discuss the detailed results of our more complex techniques for astrophysical inference based on semi-numerical simulations, here we briefly detail the two simpler methods used in H22b to infer constraints on the spin temperature and other IGM properties.
In the first, which we refer to as the "density-driven" model, the 21 cm power spectrum is assumed to follow that of matter, multiplied by a bias parameter squared (as is standard in perturbation theory, see e.g. McQuinn & D'Aloisio 2018; Georgiev et al. 2022;Qin et al. 2022;Sailer et al. 2022 for more complex approaches). Limits on this bias can then be translated into lower bounds on T S /T CMB under some assumptions, including redshiftspace distortions as a function of the line-of-sight cosine µ, see H22b) and that the IGM properties are roughly homogeneous (as the model ignores ionization and temperature fluctuations beyond adiabatic, see H22b for details). Our limits on the bias b m using 94 nights (and all fields and every other k, as discussed above) are |b m | < 60 mK at z = 7.9 and |b m | < 160 mK at z = 10.4 at 95% credibility, which assuming x HI = 1 23 In this work, we generally compute HPD credible intervals in the space sampled in, either logarithmic or linear. For T S , this is done in by minimizing the logarithmic interval in most models. The exception is when we compute posteriors in ∆ 2 (see Figure 30 and Figure 31), a derived quantity in our models, which can often be spread over a wide dynamic range, multi-modal, and include zeros where the universe is fully reionized. There we instead used the equal-tailed credible interval, which has equal integrated probabilities above and below the interval. 24 In fact, its 68% credible lower bound at z = 7.9 is 79.0 K, off the right edge of the plot (see Figure 29). translate into the IGM limits shown in Figure 27 and in Figure 34. Note that these limits are on the absolute value |b m |. This is unimportant for cases where T S T radio , but as the limits tighten we will obtain contours around T S = T radio (which yields no 21-cm signal and thus b m = 0). In fact, the 68% credible limit on the z = 7.9 bias is |b m | < 30 mK, which translates into the double-sided region from T S = 15 K to T S = 60 K (assuming T radio = T CMB , as is standard).
In the second, which we refer to as the "reionizationdriven" model, the IGM is modeled as a two-phase medium, with fully ionized bubbles drawn from a lognormal size distribution and the "bulk" IGM outside bubbles assumed to be of uniform temperature (as described in Mirocha et al. 2022). The advantage of this approach 25 is that it works directly with IGM quantities, which makes it easy to interpret, and avoids making explicit assumptions about galaxies 26 . In its simplest form, it requires four parameters: the volume filling fraction of ionized gas, Q ≡ (1 − x HI ), the IGM spin temperature, T S , the characteristic bubble size, R b , and log-normal dispersion, σ b . In this work, when jointly fitting both HERA bands, we also use a 7-parameter version of the model in which the ionized fraction and characteristic bubble size are allowed to evolve with redshift as power laws. We require only that Q and R b increase with time, that reionization completes at z ≥ 5.3, and that all parameters are positive. We perform our Markov Chain Monte Carlo (MCMC) inference using emcee (Foreman-Mackey et al. 2013). Now, with 94 nights of data, we infer spin temperatures in excess of the adiabatic cooling limit at z = 7.9 and 10.4, both for fits that consider each band separately and the joint fit to both bands using a 7-parameter version of the model (see Figure 27). At 95% (68%) credibility we obtain T S > 11.0 (35.2) K at z = 7.9, and T S > 6.2 (26.4) K at z = 10.4, when fitting both bands simultaneously. Note that, in this joint fit, lower limits on T S grow at z = 7.9, as expected, but actually slightly decrease at z = 10.4 (at 95% credibility) relative to the results of single-band fits. This is a result of using an HPD estimate of the credible interval, combined with change in the shape of the T S posterior, which goes from being roughly flat as T S → 10 3 K to a more "peaked" distribution, slightly away from the maximal value of Here we summarize HERA's constraints on the IGM spin temperature TS and contrast TS/T radio at z = 10.4 (left) and z = 7.9 (right). Each row shows the HPD results obtained with a different model for 21-cm fluctuations (described in Section 7). These include a linear bias model with density fluctuations only (top row; see Section 7.3); a phenomenological model that parametrizes the ionized bubble size distribution and assumes an IGM of uniform temperature (second row; also see Section 7.3); 21cmMC, a Bayesian technique for fitting parameters of semi-numeric 21cmFAST simulations (third row; see Section 7.4); and another semi-numerical model that includes a prescription for radio emission generated by galaxies (bottom row; see Section 7.5). In each panel, we compare results obtained in this work with the previous set of upper limits published in H22b, where we only saw evidence for heating above the adiabatic limit (gray hashed region) at z = 7.9. In this work, we see consistent evidence across all our models for an IGM heated above the adiabatic at z = 10.4 as well. The more dramatic rise of the z = 7.9 spin temperature in 21cmMC relative to the other models is driven by the z = 10.4 constraints combined with independent constraints on galaxy luminosity functions (as we discuss in detail in Section 7.6). Note that for the first three models, T radio = TCMB, which enables a conversion between the top and bottom axes. For the model in the bottom row with an excess radio background, T radio = TCMB in general, and so the TS tick marks along the top axis should be ignored.
T S . This is consistent with the results of the densitydriven model, suggesting that our limits are beginning to disfavor scenarios with saturated T S at z = 7.9. The differences between these approaches explains some of the differing conclusions in Figure 27. For example, the density-driven model yields higher T S limits than the reionization-driven model, because the latter assumes ionization and density are positively correlated. As a result, the IGM must be made colder to compensate for the loss of the densest regions to ionization. We see also in the reionization-driven model the power of jointly fitting multiple bands; the z = 7.9 limit increases by roughly a factor of 2 in the joint fit because a given temperature at z = 7.9 is only viable if the temperature evolution is consistent with the z = 10.4 data.

Updated Constraints on Reionization and X-ray
Heating of the IGM with 21cmMC 21cmFAST (Mesinger et al. 2011;Murray et al. 2020) is a semi-numerical simulator for computing the evolution of the 21 cm signal across cosmic time by assuming that star-forming galaxies, hosted by dark matter halos, drive the cosmic radiation fields that heat and ionize the IGM. It uses empirical scaling relations to assign galaxy properties to dark matter halos based on their halo masses. These include the stellar mass to halo mass ratio, the UV ionizing escape fraction, the star formation rate, and the X-ray luminosity. 21cmFAST simulates reionization using the excursion-set formalism (Furlanetto et al. 2004). It accounts for inhomogeneous recombinations (Sobacchi & Mesinger 2014) and includes a numerical correction for photon conservation (Park et al. 2021).
In Section 5.1 of H22b, we discussed the nine physically-meaningful free parameters that go into the scaling relations describing galaxy properties (also see Park et al. 2019). We adopted either flat linear or logarithmic priors-depending on the parameter's dynamic range and how it enters into the model-within our 9dimensional hypercubic parameter space. The ranges on these priors were picked to allow a broad range of physically plausible values while not strongly constraining the parameters most constrained by other highredshift probes. These include UV luminosity functions (Bouwens et al. 2015;Bouwens et al. 2016;Oesch et al. 2017) and measurements of the IGM opacities during the EoR such as the Ly-α forest (McGreer et al. 2015;Bosman & Becker 2015;Qin et al. 2021a) and CMB polarization and optical depth (Planck Collaboration 2018; . The Ly-α forest and the CMB's τ provide important clues as to the timing and duration of reionization, which then constrain the ionizing escape fraction, given the observed UV luminosity function. We use 21cmMC (Greig & Mesinger 2015, 2017 and its MultiNest sampler (Feroz et al. 2009;Qin et al. 2021b) to perform Bayesian inference in this work. The posterior probability distribution without HERA serves as the starting point for comparing models against HERA data.
One key result from these other high-redshift probes is the strong constraint on the star formation efficiency of dark matter halos at high redshift, which is determined by the UV luminosity functions at z ∼ 6-8; existing observations constrain both the peak efficiency and show that it declines toward small halo masses (e.g., Tacchella et al. 2013;Mason et al. 2015;Mirocha et al. 2017;Park et al. 2019;Sabti et al. 2022). The version of 21cmFAST used here assumes that this behavior can be extrapolated to higher redshifts and smaller haloes. As a result, the range of galaxy formation models that are allowed by 21cmMC is relatively restricted, and all display a rapid increase in the stellar mass density between the two redshifts measured by HERA. The resulting constraints must be interpreted with this in mind, as more complex galaxy evolution histories (which break the assumed extrapolation by, for example, introducing a new population of sources, see e.g. Muñoz 2019; Muñoz et al. 2022 for implementations in 21cmFAST) are not included in the prior and are left for future work.

21cmMC Constraints on X-Ray Luminosity
In H22b we explored how adding HERA affected the full posterior parameter covariance. In this work, we focus on the parameter most constrained by HERA, the ratio of the integrated soft-band (<2keV) X-ray luminosity to the star formation rate. Since 21cmFAST assumes that X-ray photons govern the thermal history of the neutral IGM, this L X<2keV /SFR parameter essentially describes the heating power of EoR galaxies per unit of star formation. In H22b, we obtained the first observational evidence for an enhanced X-ray luminosity of Local relation for HMXBs (Mineo et al. 2012) Prediction for low-metallicity HMXBs (Fragos et al. 2013) 21cmMC Prior 21cmMC Posterior without HERA 21cmMC Posterior after 18 nights with HERA (H22b) 21cmMC Posterior after 94 nights with HERA (this work) Figure 28. Here we show how our marginalized 21cmMC posterior PDF of the ratio of soft X-ray luminosity to SFR, L X<2keV /SFR, tightens with a full season of HERA data. The shaded regions show the 68% and 95% credible intervals of the posterior. Probability densities are plotted per logarithmic interval. Our results are consistent with theoretical expectations for X-rays produced during the cosmic dawn by a population of low-metallicity high-mass X-ray binaries (HMXB) (Fragos et al. 2013), likely a more representative model of the first galaxies (dash-dotted black vertical line). Compared to H22b (orange dashed line), our result's 99% credible interval excludes models where the local relation for X-ray efficiency (solid black vertical line; Mineo et al. 2012) continues to hold at high redshift.
high-redshift (z > 6) galaxies, with a 68% HPD credible interval of L X<2keV /SFR ∼ 10 39.9 -10 41.6 erg s −1 M −1 yr. This disfavors a relationship between star formation and soft X-ray luminosity at or below the one seen in local, metal-enriched galaxies at >68% credibility.
As Figure 28 shows, we find that the full season of HERA observing constrains the 95% credible interval on L X<2keV /SFR to the range 10 40.4 -10 41.8 erg s −1 M −1 yr. This result assumes as a prior that L X<2keV /SFR < 10 42 erg s −1 M −1 yr, beyond which X-rays reionize the universe too quickly (Mesinger et al. 2013). More than 99% of the 21cmMC posterior volume excludes the possibility of the local relation for HMXBs (Mineo et al. 2012) continuing to hold at high redshift. It is consistent, however, with models of extremely low-metallicity galaxies, where high mass stars have less mass-loss from line-driven winds than their solar-metallicity counterparts (Fragos et al. 2013).

21cmMC Constraints on the IGM's Thermal History
Our constraints on the soft X-ray efficiency are themselves a consequence of our ability to use our upper limits to exclude a range of scenarios with low levels of IGM heating. In Figure 29 we show our updated marginalized posteriors on the predicted average IGM temperatures-both the spin temperature, T S , and the kinetic temperature, T K -along with results from H22b for comparison. As demonstrated in H22b, current EoR constraints from Planck and quasar spectra already disfavor a large number of models in the prior volume which predict either highly ionized IGM at z ≥ 10.4 or completely neutral one at z ≤ 10.4. These constraints also have a slight impact on the average IGM temperature, excluding models with high T K or T S at these redshifts. However, because a decently-sized fraction of parameter space with an unheated IGM at these redshifts is not ruled by these probes, and since 21cmMC cannot produce spin temperatures below the adiabatic limit, our posterior without HERA shows a pileup of probability right around that limit.
When we incorporate the HERA limits, a significant range of models with low IGM temperatures can be further excluded. We showed in H22b how HERA observations substantially improve our understanding of the neutral IGM at z = 7.9. However, there was still a small fraction of the total posterior with low values of T S , so H22b could not completely rule out an unheated IGM the observed redshifts. With the improved limits presented in this work, we now find that an unheated IGM is disfavored at greater than 99% credibility at both z = 10.4 and 7.9. The new HPD 95% credible intervals on the spin and kinetic temperatures are 4.7 K < T S < 171.2 K and 3.2 K < T K < 313.2 K at z = 10.4 and 15.6 K < T S < 656.7 K and 13.0 K < T K < 4768 K at z = 7.9. 27 Since the 21 cm brightness temperature is proportional to (1−T CMB /T S ), any model where T S < 0.5T CMB will show an enhanced power spectrum amplitude relative to one where T S T CMB . Since the CMB is 24.3 K at z = 7.9, our results rule out the range of cold reionization scenarios with enhanced power spectra at z = 7.9 with greater than 95% credibility. While we cannot definitively rule out any power spectrum enhancement due to a cold IGM z = 10.4, we have ruled out the large swath of parameter space that produces the largest power spectra. Figure 29 brings into sharper relief the question raised by Figure 27 of why the 21cmMC results show such a large change in T S . Once we reach the regime where T S > T CMB , large increases in spin temperature have only a modest impact on the power spectrum. It follows then that the updated power spectrum limits at z = 7.9 are not primarily driving the spin temperature constraint at z = 7.9. Just as in the reionization-driven phenomenological model where jointly fitting the two redshifts produced tighter T S constraints, the combination of our two redshift bands is crucial. In particular, the star formation histories inferred by 21cmMC from UV luminosity functions (in which the stellar mass density increases significantly from z = 10.4 to z = 7.9) causes substantial IGM heating between the two bands. Thus the modest heating demanded by HERA at z = 10.4 translates to a strong inference about the temperature at the later times. We will see quite a similar effect when we examine the inferred power spectrum posteriors in Figure 31.

Spectrum
Because we have ruled out very large negative 21 cm brightness temperatures arising from cold reionization, we can significantly constrain the range of possible values of the 21 cm power spectrum in our model. Figure 30 shows the inferred 68% and 95% posterior equal-tailed credible intervals on the power spectrum at redshifts 12, 10, 8 and 6 after incorporating results of a full season of HERA observation. One can think of these contours as the testable-and therefore falsifiable-predictions of 21cmMC given the HERA data, the other astrophysical constraints, and the assumptions of the model. In Figure 30, we also show the 95% equal-tailed credible interval from H22b and from our posterior without HERA. Upper limits from H22a and this work as well as a number of other previously published measurements from GMRT (Paciga et al. 2013), PAPER (Kolopanis et al. 2019), MWA (Li et al. 2019;Trott et al. 2020), and LOFAR (Mertens et al. 2020) are also presented in Figure 30. It is evident that the HERA limits have been significantly improved since H22a and that this work represents our best constraint so far on the neutral IGM during the EoR. As a result, the posterior distribution of 21 cm power spectra in our model has also become much tighter, which is consistent with our exclusion of an unheated IGM at these redshifts. That said, one might wonder why the 95% posteriors are well below our 2σ power spectrum upper limits in Figure 30. The reason is threefold. First, recall that our likelihood model is not Gaussian; it is an error function that asymptotes to We compare the prior on these quantities to the posterior from H22b data (orange dashed) and the posterior after a full season (purple, with 68% and 95% credible intervals shaded). We also show our prior (black dashed) and the posterior after including non-HERA astrophysical constraints on reionization (gray). The averaging is performed over neutral cells with xHI > 0.95; for models completely reionized at z = 7.9, we take the average temperature from the last time-step with neutral cells. The hashed region indicates temperatures below the adiabatic cooling limit. Our observations rule out an unheated IGM at >99% credibility at both z = 10.4 and z = 7.9, placing new constraints on the population of X-ray emitting compact objects during the cosmic dawn.
a flat probability at small model ∆ 2 (Equation 11). As Figure 26 shows, power spectra just below our limits are far less likely than power spectra well below them. This concentrates posterior probability at lower power spectrum values. The effect is especially important when multiple fields and k modes contribute significantly to the likelihood, instead of the two measurements from a single field that dominated the results in H22b.
The second reason is that, as already discussed, our inference is heavily informed by other high-redshift observations and the galaxy model assumed by 21cmFAST. This is particularly relevant in the context of the third reason, which is the influence of constraints from the two bands on each other. To better understand the impact of the model, we show in the top row of Figure 31 the prior and full posterior probability distributions from the three inferences, i.e. without HERA and with HERA after 18 (H22b) and 94 nights (this work). We only show the distributions at the k values of our deepest limits, roughly 0.35 hMpc −1 . First consider the PDF without incorporating HERA. It has two clear peaks: the one at smaller ∆ 2 corresponds to models with T S T CMB (with abundant X-ray heating) while the other is mostly "cold reionization" with little heating. In between, there is a valley in the distribution because one must fine-tune the heating and ionization to get a signal between these two extremes, which is comparatively unlikely given our priors and the other high-redshift constraints.
The H22b results ruled out the extreme end of the cold reionization peak z = 7.9, but at z = 10.4 much of that stronger peak was still viable. With our new limits, an IGM near the adiabatic limit is essentially excluded Here we show the 68% and 95% equal-tailed credible intervals for the inferred power spectra at redshifts 12, 10, 8, and 6, after a full season of HERA (purple). We also show 95% credible intervals after 18 nights of HERA (orange, reproduced from H22b) and without HERA (gray). We also include the HERA 2σ limits from H22a and this work that most strongly constrain the likelihood, namely the single deepest limit over all fields at each k. When evaluating likelihoods, data is compared to models using all fields and k, which can make a big difference (see Section 7.1). Note that, due to the form of the likelihood in Equation 11, which depends on both our measurements and our error bars, models with power spectra just below our limits are more disfavored than models with power spectra well below them. To better understand how the shapes of these likelihoods were updated at z ∼ 10 and z ∼ 8 and why the contours are often surprisingly far from the deepest limits, see the full posteriors at the k values of our deepest limits in Figure 31. Following H22b, we use every other k to avoid unmodeled correlations between measurements at different k values. For context, we also show the most competitive 2σ limits from other telescopes at similar redshifts, including GMRT (Paciga et al. 2013), PAPER (Kolopanis et al. 2019), MWA (Li et al. 2019;Trott et al. 2020), and LOFAR (Mertens et al. 2020). at z = 10.4, which clearly favors models with considerable X-ray heating. Because that heating is assumed by the models to continue through z = 7.9 (when the stellar mass density, and hence density of X-ray sources, has increased by a factor of a few), the higher-redshift measurement helps to constrain the lower-redshift one as well. In particular, it eliminates the last little tail of large amplitude power spectra in the posterior, which is why the 95% contour moves so much between H22b and this work in Figure 30, though the 68% contour (not shown here for H22b) does not change very much. Approximate likelihood: All fields, 2 k 0.2 < k < 1.0 hMpc 1 Figure 31. Here we show the full 21cmMC derived prior and posterior PDFs per logarithmic interval in ∆ 2 at the redshifts and k modes roughly corresponding to our best upper limits (top row). Just as in Figure 29, we show our priori and three posteriors: one with other astrophysics constrains but no HERA (black), one from H22b (orange dashed), and one from this work (purple, with 68% and 95% equal-tailed credible intervals shaded). In the bottom row, we show the ratio of this work's 21cmMC posterior to the posterior without HERA data. After renormalizing, this ratio is effectively the likelihood that went into the Bayesian update. We compare those effective likelihoods to those calculated in Figure 26 using all fields and all 0.2 < k < 1.0 hMpc −1 . It is clear that the update at z = 10.4 is largely (though not entirely) attributable to the measures at z = 10.4. However, the update at z = 7.9 is sharper, indicating that much of the information about the inferred ∆ 2 comes from ruling out models with inefficient X-ray heating of the IGM, which are better constrained by the z = 10.4 measurement despite the larger upper limit on ∆ 2 . This in turn helps us understand this work's 95% contours in Figure 30 relative to H22b, especially at z = 7.9. Because we have eliminated a broad range of cold reionization scenarios-the right-hand peaks in the pre-HERA posteriors-the 21cmMC posteriors have shifted substantially toward the peaks associated with hot reionization.
As further evidence for the impact of the z = 10.4 measurements on the z = 7.9 posterior, we show two additional sets of curves in the bottom row of Figure 31. The first are the ratios of the posterior after 94 nights with HERA to the pre-HERA posterior. The second are the likelihoods that we get from Equation 11 by combining together measurements from all fields and k modes from 0.2 < k < 1.0 hMpc −1 (dropping every other k, as discussed in Section 7.1). The likelihoods are evaluated assuming power law ∆ 2 ∝ k 1.3 at z = 10.4 and ∆ 2 ∝ k 0.0 at z = 7.9 (see Section 7.1 for details). Both curves are normalized to plateau at 1. Since the pre-HERA posterior is treated as the prior for the post-HERA inference, we should expect this ratio to match the likelihood by Bayes' theorem. It should be noted that using a power-law power spectrum is only an approximation. To understand the precise disagreements between between the likelihood and the ratio of the posteriors, we would have to account for the detailed dependence of ∆ 2 on k, the ways in which the z = 7.9 measurement constrains the z = 10.4 posterior (and not just vice versa), and sampling noise.
With those caveats the match looks reasonable at z = 10.4, but not nearly as good at z = 7.9. It follows then that the inferred constraints on both T S and ∆ 2 at z = 7.9 are driven externally to the data at z = 7.9, which is all that goes into the black likelihood curve. Therefore, the z = 7.9 results must be a consequence of the z = 10.4 limits and the way 21cmFAST models the evolution the X-ray luminosity by tying it to star formation and then constraining that star formation rate with other probes, most notably the UV luminosity function. In Section 7.6, we will return to the question of the impact of the specific modeling choices of 21cmMC and how they compare to the other three techniques.

Updated Constraints on Astrophysical Models with Excess Radio Background
In H22b, we reported parameter constraints on an alternate semi-numerical model that allows for a significant excess radio background. EoR scenarios with high levels of radio background at rest-frame 21 cm wavelengths can potentially produce strong 21 cm absorption signals, since the brightness temperature is proportional to (1 − T radio /T S ) (Feng & Holder 2018). We know there exists today a radio background well in excess of the CMB from observations with ARCADE 2 Seiffert et al. 2011) and the LWA (Dowell & Taylor 2018). It has been theorized that if this excess is sourced by a population of unresolved high-redshift, potentially exotic sources (Ewall-Wice et al. 2018;Fraser et al. 2018;Jana et al. 2019;Brandenberger et al. 2019;Thériault et al. 2021), it could explain the anomalously strong absorption signal seen by EDGES (Bowman et al. 2018). Such explanations are not without difficulty; they would have to feature sources far stronger than what would be expected from local observations (Ewall-Wice et al. 2018;Mirocha & Furlanetto 2019;Mebane et al. 2020;Ewall-Wice et al. 2020) and would need to include rapid X-ray heating at 16 z 14 to explain the high-frequency side of the EDGES trough (e.g. Mittal & Kulkarni 2022).
Since a radio background can also increase the amplitude of 21 cm fluctuations Fialkov & Barkana 2019;Reis et al. 2020), limits from HERA can constrain astrophysical parameters describing models with excess radio background. In H22b, we used a semi-numerical simulation (Visbal et al. 2012;Fialkov et al. 2014;Fialkov & Barkana 2019;Cohen et al. 2020;Reis et al. 2020Reis et al. , 2021 in which the key radiation fields are all driven by the cosmic star formation rate. This is set (in part) by the star formation efficiency f * with which collapsed gas in halos turns into stars, and the circular velocity V c which determines the minimum mass for star forming halos. Just as in H22b, the efficiencies of X-ray and radio background luminosity relative to the star formation rate are parameterized by f X and f r , and reionization is parameterized by the CMB optical depth τ . For more details on the implementation of the model and its parameterization, see Section 8 of H22b. 28 There are a few differences between the precise inference procedure used here and the one used in H22b. More parameters are now allowed to vary rather than fixing them to specific values. Most significantly, we no longer limit ourselves to the X-ray spectral energy distribution (SED) of X-ray binaries (Fragos et al. 2013;Fialkov et al. 2014) and instead parameterize the SED by a power law spectral index α X (either 1.0, 1.3, or 1.5) and a minimum frequency cutoff ν min which we vary from 0.1 to 3 keV with log-uniform prior. Additionally, we no longer fix the mean free path to R mfp = 40 cMpc but vary it between 10 and 70 cMpc. 29 The new parameterization of the X-Ray SED implies that more varieties of X-ray heating are captured. For the same value of f X , a softer SED (with a larger number of low-energy photons, e.g. lower ν min ) leads to a more efficient heating with shorter X-ray mean free path, while a harder SED (lower number of low-energy photons, e.g. higher values of ν min ) results in less heating with a larger fraction of photons remaining unabsorbed.
Another improvement over the analysis performed in H22b is the application of more accurate emulators. Our new emulator, based on the GLOBALEMU formalism (Bevins et al. 2021) and implemented using scikit-learn (Pedregosa et al. 2011), uses a more general neural network reaching an accuracy of +11% −7% (at 68% confidence) at z = 7.9 and k = 0.34 hMpc −1 . Additionally, instead of an MCMC, we now use nested sampling with PolyChord (Handley et al. 2015a,b).
The updated analysis and the improved HERA upper limit together result in tightened constraints, two of which are of particular interest. The biggest impact is on the X-ray efficiency f X , which is consistent with the 21cmMC results on L X /SFR (see Section 7.4.1). In general, HERA excludes models with high f r and low f X , since they would produce the brightest amplitude of 21 cm fluctuations. The lower bound on the 68% HPD credible interval of f X increases from f X > 0.03, using the data from H22a, to f X > 0.18 using the limits presented here. However, the upper bound on the 68% HPD credible interval of f r decreases only slightly, from f r < 586 to f r < 575. Note that these numbers differ from the values quoted in H22b due to the changed X-ray SED and priors.
In Figure 32, we show the region of the parameter space disfavored by HERA in this model, where we include the impact of f * on both X-ray heating and the total radio background. This space was already constrained by other measurements of astrophysical backgrounds. Reis et al. (2020) showed that models with strong f * · f r are inconsistent with the radio background from LWA (Dowell & Taylor 2018) and ARCADE 2 Seiffert et al. 2011), where the lower 10 6 10 5 10 4 10 3 10 2 10 1 10 0 10 1 10 2 X-Ray Heating f * f X . Models with excess high-redshift radio background can produce much larger power spectra than the standard scenario where the 21 cm brightness temperature is seen in contrast to the CMB. However, such models without accompanying X-ray heating of the IGM are excluded by HERA. Here we show the region of parameter space disfavored by 68% and 95% of HERA's posterior, as well as regions inconsistent with either the LWA's radio background measurements (Dowell & Taylor 2018) or Chandra's X-ray background measurements (Lehmer et al. 2012). Between HERA's constraints and Chandra's, models where LWA's extra-Galactic radio background as entirely explained by z > 8 emission (i.e. models of the radio background at the bottom of the yellow region) are disfavored but not entirely excluded.
frequency measurements from LWA are the more constraining of the two (Reis et al. 2020). Likewise, Fialkov et al. (2017) show that large values of f * · f X are generally ruled out by Chandra X-ray background measurements in the 0.5-2 keV band (Lehmer et al. 2012). We show those limits in Figure 32 in yellow and blue, respectively. Compared to the results in H22b, our new 21 cm constraints leave little room for the LWA radio background to be explained entirely as a cosmological signal originating from z > 8. Such models are not entirely excluded, but they are mostly disfavored at 68% credibility or greater. The second result, which is also an effect of the higher X-ray efficiency, can be seen via our derived constraints on the spin temperature of the IGM. In Figure 33, we show how HERA data updates our model's prior on T S and T radio . Both of these quantities are averaged over the neutral IGM at z = 7.9.
In general, our results exclude models with large radio background temperatures without also featuring large spin temperatures. Specifically, our 95% credible interval requires that T S /T radio > 0.21 at z = 7.9. By performing the same calculation for Band 1, we can also 10 0 10 1 10 2 10 3 10 4 10 5 10 6 10 7 T S (K) Posterior with 94 nights with HERA (this work) 95% posterior limit on T S /T radio Figure 33. The derived constraints from our model with excess radio emission from high-redshift galaxies allows for a large range of both radio background and IGM spin temperatures at z = 7.9. HERA generally favors models with lower radio background, though large radio backgrounds counteracted by large spin temperatures are still possible. Both temperatures are averaged over the neutral IGM where 21 cm emission or absorption might be observed. 95% of the posterior volume with HERA falls below the line where T S /T radio = 0.21. Performing the same calculation with the z = 10.4 posterior (not shown) we can constrain T S /T radio > 0.18 at 95% credibility.
show that at z = 10.4, T S /T radio > 0.18 with 95% credibility. While the posterior largely prefers relatively low radio background temperatures, there remain a set of models where a large radio background becomes coupled to the spin temperature, yielding T S /T radio ≈ 1 and thus small power spectra.

Comparison of Astrophysical Models and Constraints
Finally, we can now more fully compare the results of our four models and their implications for the thermal history of the IGM. As we saw in Figure 27, the results from 21cmMC exhibit a more dramatic improvement at z = 7.9 in this work relative to H22b than either of the phenomenological models (see Section 7.3) or the radio background model (see Section 7.5). In the discussion around Figure 31, we saw clear evidence that 21cmMC's change in T S and thus in the inferred posterior probability distribution of ∆ 2 was driven by the power spectrum constraints at z = 10.4. This tight connection between redshifts requires rather rapid thermal evolution of the IGM, which is generally the case for the range of 21cmMC models we consider.
In the other models, the spin temperature history can evolve quite gradually, and thus yield similarly gradual 21 cm signal evolution that will be more difficult to disfavor with improved z = 10.4 limits. For the reionizationdriven model, this is because T S is parameterized as a power-law in redshift with a uniform prior on the powerlaw index. In the radio background model, there are two reasons for gradual evolution. First, the stellar-masshalo-mass relation is constant in the radio background model, which gives rise to a much more gradual cosmic star formation history than, e.g., models in which the relation is itself a function of halo mass. Second, because the radio and X-ray backgrounds are both sourced by galaxies, both T radio and T S will grow similarly as the cosmic star formation rate density rises, resulting in relatively gradual evolution in the 21 cm signal. By contrast, in our 21cmMC simulations, T radio = T CMB , which of course declines as (1 + z). This means T radio /T S will evolve more rapidly than in than in the radio background models, even if both share an identical T S history.
Another reason for the difference between models is the way 21cmMC directly incorporates data from other wavelengths (especially the galaxy luminosity function)-a major advantage of the approach. Models that fit the luminosity function at all redshifts require the star formation efficiency to increase with galaxy mass (at least up to galaxy masses comparable to the Milky Way, see e.g., Mirocha et al. 2017;Park et al. 2019). The 21cmMC models therefore also favor massive galaxies, whose abundance evolves quite rapidly at high redshifts. Because these galaxies are also the sources of IGM heating in this model, their rapidly evolving abundance also leads to rapid IGM heating. Because the luminosity function fits "bake in" this kind of behavior, 21cmMC models that feature modest heating at z = 10.4 require stronger heating at z = 7.9. The expected increase is roughly the ratio of the stellar mass density at the two redshifts, which is a factor of ∼3. This explains why the strong T S limit in 21cmMC at z = 7.9 is actually informed in large part by the ostensibly weaker temperature limits at z = 10.4.
Of course, this conclusion is reliant on the parameterization of the galaxy luminosity function within 21cmMC. While a wide range of models agree that a physically-motivated extrapolation of the observed luminosity function to higher redshifts behaves similarly, it is of course reasonable to suppose that the extrapolation breaks down-perhaps because of processes unique to high-redshifts, or perhaps simply because young dwarf galaxies behave differently than their larger cousins (see e.g., . One could always broaden the allowed set of galaxy models to accommodate such exotic astrophysics. We have, of course, explored one class of such models by examining the effect of an excess radio background in Section 7.5. One could also imagine adding another source of early heating like Pop III remnants that could inject sufficient heat into the IGM at z = 10.4 without necessarily affecting the constraints at z = 7.9, though this might require some degree of fine tuning-a question left for future work. Regardless, it is useful to understand how we should interpret the HERA limits in the context of "vanilla" galaxy models. Here, the 21cmMC results demonstrate that the "known" galaxy population requires substantial IGM heating by z ∼ 8. Regardless of the differences between our models, we should not lose sight of the fact that they are all pointing in fundamentally the same direction: a heated IGM during the EoR. To put them in the broader cosmological context of other 21 cm experiments, we compare them to a few illustrative scenarios for the thermal history of the IGM in Figure 34. The arrows show the lower bounds of the 95% credible regions of T S /T radio at z = 7.9 and z = 10.4 inferred by each of the four models. We also show the expectation for an IGM temperature set by adiabatic cooling since recombination with no additional heating from astrophysical sources. Just as Figure 27 showed, the lower limits at both redshifts are above this curve, requiring substantial X-ray heating (as illustrated by the solid black line where L X<2keV /SFR = 10 41 erg s −1 M −1 yr). This is a significant improvement from H22b, which could not demonstrate any heating beyond the adiabatic cooling curve at z = 10.4. More interestingly, they are all inconsistent with an IGM thermal history that one would predict by extrapolating the local relation between X-ray luminosity and star formation rate to high redshift (black dotdashed line, see Figure 28). The accordance of multiple models here strengthens the argument in Section 7.4.1 that our results broadly favor an IGM heated by lowmetallicity HMXBs. Figure 34 also shows measurements at higher redshifts. The EDGES result (Bowman et al. 2018) found a surprisingly large absorption signal at z ∼ 17, requiring either substantial cooling of the gas below the adiabatic limit or a larger radio background than provided by the CMB. The implied range of T S /T radio is shown by the red error bars, but note that EDGES found this absorption trough to be very narrow, implying that the Universe was heated shortly afterward. However, the recent SARAS 3 measurement (Singh et al. 2022)   shown offset from their proper redshifts, 7.9 or 10.4, for visual clarity. They differ in their conclusions, both due to their different physical assumptions as well as their different priors and incorporation of other high-redshift probes. However, they all conclude that the IGM spin temperature at both z = 7.9 and z = 10.4 is in excess of the temperature one would expect from adiabatic cooling since recombination, assuming T radio = TCMB (purple dashed line). They are also generally inconsistent with an IGM weakly heated by X-rays (black dot-dashed line), as it would be if the local relation for LX /SFR for HMXBs held at high redshift (see Figure 28)-although that statement is somewhat model dependent. However, all our models are consistent with a more rapidly heated IGM, such as the one shown as a solid black line where L X<2keV /SFR = 10 41 erg s −1 M −1 yr, which is closer to the expectation for low-metallicity HMXBs. To put HERA's result in the context of high-redshift global signal measurements, we also show results from EDGES (Bowman et al. 2018) and SARAS 3 (Singh et al. 2022). For EDGES, we show the implied T S /T radio from the depth of their best fit model, as well as the 2σ range of model amplitudes. For SARAS, which disfavors the EDGES best fit at ∼95% confidence when marginalizing over only the amplitude, we show the lower limit on models with the same shape as the EDGES signal. The SARAS central redshift is also offset from EDGES for visual clarity. Finally, we show a model with 0.5% millicharged dark matter, as in Figure 11 of H22b, which could explain EDGES and still be consistent with HERA if there is sufficient X-ray heating between z ∼ 17 and z ∼ 10. detect such a signal; the upper limit is subtle to express quantitatively, but here we show the limit their measurement places on a feature with the shape of the best-fit EDGES signal with an unknown amplitude. Though we show the SARAS 3 limit offset in redshift for clar-ity, the measurement spanned a wide frequency range corresponding to 15.8 < z < 24.6. Because HERA observed at much smaller redshifts than EDGES and SARAS 3, it is difficult to compare directly-even if the deep EDGES trough is real, any heating between z ∼ 15 and z ∼ 10 (as indeed the EDGES best-fit model requires!) would make the two measurements consistent. We show an example of such a model (green solid line) where the IGM is cooled by interactions with a fraction (0.5%) of millicharged dark matter with mass 10 MeV and charge 10 −5 times that of the electron, following Muñoz & Loeb (2018). The gas is subsequently heated by HMXBs. This model is designed to explain EDGES and still be consistent with lower redshift observations like HERA's, though it is in mild tension with SARAS 3. More independent measurements like SARAS 3, as well as low-frequency power spectra from HERA Phase II and other interferometers, will be extremely valuable in understanding this early era and any new physics that may have impacted it.

CONCLUSION
In this work we have presented improved upper limits on the 21 cm power spectrum using 94 nights of observing with Phase I of the Hydrogen Epoch of Reionization Array (HERA), as well as their astrophysical implications for the X-ray heating of the IGM. We have found with 95% confidence that ∆ 2 (k = 0.34 hMpc −1 ) ≤ 457 mK 2 at z = 7.9 and that ∆ 2 (k = 0.36 hMpc −1 ) ≤ 3,496 mK 2 at z = 10.4, an improvement by a factor of 2.1 and 2.6 respectively over previous HERA limits in H22a with 18 nights of data and roughly the same number of antennas. Our full set of upper limits across k are detailed in Section 5.
Our results rely heavily on the application of existing techniques to this new data set. In particular, we adapted most of the techniques used in H22a to our larger data volume, noting where and why we made different analysis choices. We replicated many of the statistical tests developed in H22a-and some new ones-to show that our data largely integrate down as expected and that our techniques for systematics mitigation do not introduce any new biases. Likewise, we have performed a number of the simulation-based tests developed in Aguirre et al. (2022) using our updated techniques, including an end-to-end test of our analysis pipeline that simulates the full data volume before reducing it to power spectra. While many small adjustments were made to accommodate the larger data set considered here, the fundamental philosophical approach remains the same. Instead of subtracting or filtering foregrounds, we focused on maintaining spectral smoothness and sys-tematics control while minimizing and rigorously quantifying potential sources of signal loss.
We also revisit four independent theoretical models for inferring constraints on IGM and galaxy properties during the EoR and the cosmic dawn. These techniques were applied to the H22a data set in H22b where we showed that the IGM had to heated above the adiabatic cooling limit by at least z = 7.9. Using the improved upper limits presented in this work, the four techniques broadly agree with at least 95% of their posterior volumes (and in some cases greater than 99%): the IGM had to be heated above the expectation from adiabatic cooling by at z = 10.4 the latest.
There are two key consequences of this result. The first follows from our current understanding from existing probes of reionization-especially the integrated optical depth to reionization, galaxy UV luminosity functions, and quasar spectroscopy-that the bulk of reionization happens after z = 10.4. If that is the case, then we have ruled out the "cold reionization" scenarios in which the IGM continues to adiabatically cool until it reionizes, creating a bright 21 cm power spectrum is boosted by a strong contrast between the CMB and the IGM spin temperature. It is still possible that the IGM is slightly heated at z = 10.4 but still colder than the CMB. However, a broad class of models where the IGM remains very cold until reionization are no longer viable.
Second, if high mass X-ray binaries dominate the soft X-ray budget of high redshift galaxies and thus were responsible for the heating of the IGM, as is generally expected (Fragos et al. 2013), then more than 99% of HERA's posterior excludes the local relationship between star formation and soft X-ray production (Mineo et al. 2012) extrapolated to high redshift. We instead favor models with low-metallicity HMXBs, which is clear evidence for the impact on the IGM of some of the first compact objects to form during the cosmic dawn.
We also used a semi-numerical model that allows galaxies to create radio backgrounds brighter than the CMB to jointly constrain those radio backgrounds and the X-ray luminosity of those galaxies. Specifically, we combined HERA limits (which disfavor strong radio backgrounds with weak X-ray heating) and Chandra Xray background constraints (which rule out strong X-ray heating) to exclude most models that would explain the radio backgrounds observed by LWA and ARCARDE 2 as originating at z 8.
Looking forward, we see a number of ways HERA might continue to improve the constraints on the 21 cm signal as we continue the steady march to greater and greater sensitivity. One approach would be to move beyond cautious foreground-avoidance and attempt to apply more aggressive and more nearly optimal filters, removing foregrounds (Ewall-Wice et al. 2021) in delay space and integrating down coherently for longer with better-tailored fringe-rate filters ). These techniques might help us explore more frequency bands and claw back some of the most sensitive baselines that we had to excise in this work. However, they will likely incur higher levels of signal loss that will have to be rigorously quantified and taken into account.
More importantly, this analysis only used a small fraction of HERA's final size and bandwidth. HERA Phase II, now being commissioned, will have 350 antennas observing from 50-250 MHz which corresponds to 4.7 < z < 27.4. We now have a well-tested analysis pipeline taking us all the way to power spectra and astrophysical inference, including a suite of statistical tests and end-to-end simulations with which to validate our results. This work, along with H22a, H22b, and their supporting papers, will serve as a foundation for future HERA analysis. Of course, HERA Phase II has an entirely new signal chain-from feeds to correlator-and will likely have to contend with a somewhat different set of systematics. However, if these can be overcome, HERA could be the instrument to detect and characterize the 21 cm power spectrum from the epoch of reionization and push our knowledge of early stars, black holes, and galaxies into the cosmic dawn.
National Research Foundation, an agency of the Department of Science and Innovation.
This work used the Extreme Science and Engineering Discovery Environment (XSEDE; Towns et al. 2014), which is supported by National Science Foundation grant number ACI-1548562. We acknowledge the use of the Ilifu cloud computing facility (www.ilifu.ac.za) and the support from the Inter-University Institute for Data Intensive Astronomy (IDIA; https://www.idia.ac.za).
The authors wish to thank the anonymous referee for their insightful feedback.
J.S. Dillon gratefully acknowledges the support of the NSF AAPF award #1701536. N. Kern gratefully acknowledges support from the MIT Pappalardo fellowship. P. Kittiwisit and M.G. Santos acknowledge support from the South African Radio Astronomy Observatory (SARAO; www.sarao.ac.za) and the National Research Foundation (Grant No. 84156). This result is part of a project that has received funding from the European Research Council ( One of the most important systematics that we needed to mitigate before estimating power spectra is been crosstalk. The strategy developed in , demonstrated in Kern et al. (2020b), and applied in H22a is quite effective at removing that crosstalk. However, as we discovered in Section 3.2.4, our crosstalk was significantly less time-stable over the entire season that it was during the 18 nights analyzed in H22a. This motivated LST-binning over epochs and new cuts on the data (see Section 3.3.2).
This discovery also motivated a renewed attempt to understand the physical origin of the crosstalk. We began with the basic mathematical model of crosstalk presented in . If we postulate that the voltage measured on an antenna i is due to both the incident sky signal v i absorbed by antenna i and contributions from other antennas, we can write that voltage (ignoring noise) as where ε ni is the coupling between the nth antenna and antenna i. Since visibilities are formed by cross-correlating voltages, a visibility with cross-coupling takes the form where angle brackets indicate a time average. Assuming the coupling is small and dropping all terms that are second order in ε or first order in ε but only involve cross-correlations, we get Thus, to leading order, the main cross-coupling contributions appear as autocorrelations of each antenna involved leaking into a visibility. It should be noted that we are ignoring antenna-to-antenna coupling effects due to reflections off antennas, which Josaitis et al. (2022) showed would be an important systematic for HERA Phase II. At this point, we have seen no evidence that this is a dominant systematic for HERA Phase I.  was agnostic as to the origin of the coupling. They instead focused on how autocorrealtions, which evolve much more slowly in time than cross-correlations with any appreciable east-west projected length, can be distinguished from the primary sky signal in fringe-rate space. As long as ε was time-stable, its structure was of secondary concern. Our observation of discontinuities in the crosstalk in delay space (see Figure 6) meant that ε was effectively no longer stable over a full season of LST-binned data.
Explaining the range of delays over which crosstalk was observed in e.g. Figure 6 and Figure 7-generally 800 to ∼2000 ns-required solving a puzzle. Those delays are too long to be explained by a broadcasting antenna, which would created correlated signals at delays explicable by the light travel time across the array. They are also too short to be explained by invoking cable reflections, which require two traversals of the ∼150 m cables connecting antennas to the receivers and thus take ∼1200 ns.  explored and rejected several possible explanations. No model could explain that range of delays and also the asymmetric structure we see in a single visibility, whose positive and negative delays can exhibit completely different crosstalk structure.
However,  did note that attributing positive or negative delays to the first and second antennas in a baseline reveals interesting patterns in the dependence of the delay and amplitude of the cross-talk peak on position in the array. That proved to be a key insight. Taking the time-averaged amplitude of the Fourier transform of a several inpainted visibilities from a single epoch (in this case Epoch 2) which share one antenna in common illustrates that pattern. In Figure 35 we show this for a subset of the baselines that share antenna 119, all of them in the same two rows (see Figure 4). What we see is a remarkable asymmetry between positive and negative delays. At negative delay, we see a delay structure that looks similar within rows, but with decreasing amplitude and at more negative delays as the second antenna moves east. Meanwhile, at the crosstalk systematics at positive delay show up at largely similar delays but with wildly ranging amplitudes and structures. Holding a single antenna fixed and seeing such similarities at negative delays indicates a clear association between antenna i and negative delays in V ij -and thus by symmetry, an association between antenna j and positive delays. A model for this crosstalk must explain how the voltage signal from e.g. antenna 119 was received by every other antenna with a position-dependent peak delay and amplitude.
To understand these delays, recall that HERA Phase I reused PAPER feeds and signal chains. These included ∼150 m coaxial cables from each feed to a set of eight "receiverators," RF-shielded mini-fridges each containing 16 receiver and post-amplification modules (Bradley 2017), located just west of the array. These were then connected by ∼20 m coaxial cables to a shipping container housing the analog to digital converters and the correlator (DeBoer 2015). We hypothesize that at one of those connection points, likely after the receiverators, was leaking and broadcast virtually every antenna's voltage signal. These signals were then picked up by every other antenna, leading to autocorrelations appearing in cross-correlation visibilities at high delay. We thus explain the crosstalk peak delays and amplitudes in Figure 35 with a model for ε ij where Here τ i,cable is the light travel time along the ∼150 m cable connecting antenna i to the receiverators; it can be easily measured by examining the delay spectrum of autocorrelations since cable reflections appear at 2τ i,cable (Kern et al. 2020b). The rest of the terms are free parameters. The first two are the positions of the emitter, x * and y * , from which we can calculate d * j , the distance from the emitter to antenna j. The next is τ offset which can account for the possibility that emission occurred after traversing another ∼20 m cable between the receiverators and the correlation. No per-antenna variation is allowed in τ offset . Finally, the amplitude of the crosstalk seen on each baseline depends on the "leakiness" of antenna i-A i , each a free parameter-and the distance it must travel to the power −α. Since Crosstalk Peaks Figure 35. Examining the time-averaged delay spectrum of several baselines all sharing a single antenna reveals a clear pattern in the delay structure of the crosstalk feature between 800 |τ | 1500 ns, but only one on side. Here we show eleven baselines all sharing antenna 119 (and all north-polarized). All the antennas are part of two rows (see Figure 4). We plot their timeaveraged amplitudes in delay space, each arbitrarily offset for readability. We also mark the peak delays of each baseline's crosstalk with black stars, which our model must explain. Crosstalk at negative delay shows similar structure within rows of the array (120-124 as compared to 137-142) with diminishing amplitude as we move eastward. By contrast, the crosstalk feature at positive delay shows no clear pattern. This supports the argument that the negative delay feature is associated with antenna 119. It follows from symmetry that, since this effect is not unique to antenna 119, the positive delay feature must be associated with the other antenna. At low delay we can also see the foregrounds peak in the main lobe of the primary beam, as well as a widening "pitchfork" feature associated with foregrounds on the horizon (Thyagarajan et al. 2015a,b;Pober et al. 2016), which is at higher delay for longer baselines.
emitted voltages go down as 1/r, we should expect α = 1. However, to attempt to account for the effects of complicated mutli-path propagation through a lattice of antennas, we leave α as a free parameter. Since this model does not predict the full delay spectrum of the crosstalk, our goal here is not to solve for every parameter optimally. Instead, we want to test its physical plausibility. Therefore, we fit both parts-the phase and the amplitude-separately with potentially different emitter positions. First we fit the delays, since that is a simpler model with far fewer free parameters. In Figure 36, we show in the left panel the measured crosstalk peak positions at negative delay (black stars in Figure 35) for all baselines V ij where i = 119. In the right panel, we show our best fit source position and the prediction for the total delay term in Equation A4. Both are overlaid over recent satellite photography of HERA. This fit is performed over all unflagged baselines, with the exception of a few baseline that had too little crosstalk to measure the peak reliably. The fit is quite good; the average delay error is only 47 ns which is only a few times larger than the 12.8 ns delay resolution of the Fourier transform after removing flagged channels. More tellingly, the emitter position is quite consistent with the suspect signal chain elements, namely the connections between the receiverators and the correlator. The fit τ offset = 99 ns which consistent with expectations for the for the ∼20 m cable (Kern et al. 2020b), given the speed of light in the medium. While this result lends significant credence for our model, we cannot definitively state that it locates the emitter as the input ports on the correlator container. The emitter position and τ offset are correlated and, as Figure 35 shows, quantifying the error in measuring peak delays is challenging.
As a second test, we also fit the amplitude of each crosstalk signal using Equation A4. This test is a bit less straightforward, since it requires a free parameter A i for every antenna. That said, an independent fit of the emitter position yields a quite consistent result, as we can see in Figure 37. Again, we show the crosstalk amplitudes associated . We can predict the measured peak delays in the crosstalk in each visibility Vij using a model (Equation A4) where antenna i's signal travels down a ∼150 m cable, gets amplified in the "receiverators," and then is emitted shortly thereafter. That voltage signal then travels over the air and is picked up by antenna j, producing a contribution of Vii to Vij at large negative delays (Equation A3). In the left-hand panel, we show all such measurements involving a single antenna, 119, and the prediction of our best fit model (right-hand panel) using all baselines. We overlay our data and fit over recent satellite imagery of HERA from Google Maps (accessed in June 2022), which shows that our best fit for the position of the emitter is spatially consistent with our receiverators and correlator container. This explanation of the physical origin of the crosstalk validates our approach to removing it. Crosstalk Relative Amplitude Model = A 119 /d *X 2.31 (unitless) Figure 37. Analogously to Figure 36, we can also predict the amplitudes of the crosstalk as they appear in each baseline using the same model (Equation A4). In the left-hand panel we show measured peak crosstalk amplitudes, relative to the autocorrelations that source them, for all baselines involving antenna 119. On the right, we show the best-fit model prediction for that same data set using all baselines. The relatively good agreement here, as well as the fact that we find a consistent emitter position despite fitting for it independently fit, lend credence to our physical model for the crosstalk.
with antenna 119 (left-hand panel) but perform our fit over all baselines. The result is a fairly good fit, with a mean amplitude error of 1.2 × 10 −5 relative to the autocorrelations. Oddly, the best fit power law is not α = 1 as we had expected, but α = 2.31. We do not have an explanation for why it should follow that power law, though we note that the problem is a fair bit more complicated than simple free-space propagation. There are also some parameter degeneracies to consider; one can still get a decent fit by fixing α = 1 and moving the emitter position much closer to the array. Without a proper error analysis, it is difficult to validate the precise parameterization of our model. What should α be? What about τ offset ? Can we explain the full delay structure of the crosstalk signal as some sort of multi-path propagation effect? Can construction of new antennas explain the epoch-to-epoch change in the crosstalk we saw in Figure 6? Unfortunately we can only speculate. The system in question has long-since been disassembled and HERA Phase II does not use receiverators nor does this systematic appear in more recent data. That said, having a plausible physical mechanism that allows autocorrelations to leak into cross-correlations in a relatively time-stable way adds significant support to our strategy for removing the systematic.