An Infrared Search for Kilonovae with the WINTER Telescope. I. Binary Neutron Star Mergers

The Wide-Field Infrared Transient Explorer (WINTER) is a new 1 $\text{deg}^2$ seeing-limited time-domain survey instrument designed for dedicated near-infrared follow-up of kilonovae from binary neutron star (BNS) and neutron star-black hole mergers. WINTER will observe in the near-infrared Y, J, and short-H bands (0.9-1.7 microns, to $\text{J}_{AB}=21$ magnitudes) on a dedicated 1-meter telescope at Palomar Observatory. To date, most prompt kilonova follow-up has been in optical wavelengths; however, near-infrared emission fades more slowly and depends less on geometry and viewing angle than optical emission. We present an end-to-end simulation of a follow-up campaign during the fourth observing run (O4) of the LIGO, Virgo, and KAGRA interferometers, including simulating 625 BNS mergers, their detection in gravitational waves, low-latency and full parameter estimation skymaps, and a suite of kilonova lightcurves from two different model grids. We predict up to five new kilonovae independently discovered by WINTER during O4, given a realistic BNS merger rate. Using a larger grid of kilonova parameters, we find that kilonova emission is $\approx$2 times longer-lived and red kilonovae are detected $\approx$1.5 times further in the infrared than in the optical. For 90% localization areas smaller than 150 (450) $\rm{deg}^{2}$, WINTER will be sensitive to more than 10% of the kilonova model grid out to 350 (200) Mpc. We develop a generalized toolkit to create an optimal BNS follow-up strategy with any electromagnetic telescope and present WINTER's observing strategy with this framework. This toolkit, all simulated gravitational-wave events, and skymaps are made available for use by the community.

Optical-wavelength searches dominated the UVOIR O3 GW follow-up landscape, with dozens of optical telescopes across the globe and in orbit triggered during O3 for GW follow-up. Theoretical models and observational constraints from O3 nondetections predict that the blue, optical emission from a kilonova is angle-dependent, fades rapidly (<1 week), and may not be present in all BNS or NSBH mergers (Metzger 2020;Kasen et al. 2013Kasen et al. , 2017Barnes et al. 2016;Kasliwal et al. 2020). In contrast, the near-infrared emission is expected to be isotropic, longlived (>1 week), and ubiquitous in models regardless of mass ratio, viewing angle, or remnant lifetime . Models predict the detection rates of kilonovae in the near-infrared could be up to ∼ 8 − 10 times higher than in optical wave bands (Zhu et al. 2021b).
However, due to the high cost-per-pixel of detectors and bright sky backgrounds, the dynamic infrared sky remains largely underexplored compared to optical wavelengths. Existing time-domain, infrared surveys are either restricted to small areas on sky (e.g., the VISTA Variables in the Via Lactea survey covering 520 deg 2 (Catelan et al. 2011) or the UKIRT Deep Extragalactic Survey covering 35 deg 2 (Lawrence et al. 2007)) or are relatively shallow (e.g., Palomar Gattini IR, J ∼ 16 mag; De et al. (2020)). The lack of deep, all-sky reference images limits the number of GW events that can be followed up in the infrared.
The Wide-field Infrared Transient Explorer (WINTER) is a new instrument that will perform the first nearinfrared, all-sky survey to J AB = 21 magnitudes and is specially built for GW follow-up (Frostig et al. 2020;Lourie et al. 2020). WINTER will operate on a dedicated 1-meter telescope at Palomar Observatory that was commissioned in June 2021 with an optical camera on one port, with the infrared WINTER instrument to be added on the second port in late 2021. WINTER's wide, 1 deg 2 field of view can quickly tile the median expected fourth observing run (O4) 33 deg 2 BNS localization contour (Abbott et al. 2020b) with rapid-response robotic observing. With three near-infrared filters in the Y, J, and a shortened-H bands (centered at 1.0, 1.2, and 1.6 µm, respectively), WINTER is designed to discover kilonovae and observe them for two weeks or more (Frostig et al. 2020).
As an alternative to the traditional, but expensive, mercury-cadmium-telluride (HgCdTe) detectors that dominate the near-infrared landscape, WINTER employs cheaper indium-gallium-arsenide (InGaAs) detectors new to astronomical instrumentation. A prototype instrument confirmed InGaAs detectors achieve backgroundlimited near-infrared photometry without cryogenic cooling (Simcoe et al. 2019), which is required for HgCdTe detectors. WINTER will head a new class of InGaAsbased near-infrared astronomical instruments coming online in the next decade, including the DREAMS telescope in the Southern Hemisphere (Travouillon et al. 2020). The upcoming HgCdTe-based PRime-focus Infrared Microlensing Experiment telescope will also join WINTER and DREAMS in an upcoming effort to deepen our understanding of the near-infrared time-domain sky.
In this paper, we present an end-to-end simulation of WINTER's performance in O4 and make a case for infrared follow-up of BNS GW signals. Existing simulations such as Abbott et al. (2020c) also present predictions for BNS merger rates detected by the global GW network in O4, but do not model any electromagnetic counterparts. Petrov et al. (2021) repeats the analysis in Abbott et al. (2020c) with more realistic parameters and studies the resulting BNS lightcurves using only optical-wavelength kilonova models. Our study additionally takes into account uncertainty introduced by the use of realistic GW matched-filter pipelines, compares any potential differences between the use of low-and medium-latency skymaps, and is primarily focused on the infrared. We include only BNS mergers in this work and leave the study of NSBH mergers, which are expected to be especially promising to follow up in the infrared Fernández et al. 2017;Zhu et al. 2020), for future work.
In Section 2 we describe the simulation, including modeling a population of BNS mergers, their respective skymaps, and WINTER follow-up observations. We present the results in Section 3. This leads to a discussion of the merits of studying kilonovae in the infrared and the planning of WINTER observations in Section 4, and we conclude in Section 5.

METHODS
We simulate follow-up of BNS gravitational-wave signals with WINTER using the following procedure: 1. We generate a simulated BNS population and model a set of electromagnetic lightcurves from two different model grids Bulla 2019) for each merger event.
2. To model the gravitational-wave response to each event, we simulate Advanced LIGO (Aasi et al. 2015), and Virgo (Acernese et al. 2014) detector noise and network configurations expected for O4. We also include the Japanese interferometer KA-GRA (Somiya 2012;Aso et al. 2013;Akutsu et al. 2018), which joined the global network of groundbased gravitational-wave detectors at the end of the third observing run.
3. We perform a low-latency search for gravitationalwave events in our simulated data using the PyCBC Live pipeline (Nitz et al. 2018;Dal Canton et al. 2020). Low-latency skymaps are generated using the BAYESTAR software (Singer & Price 2016) to obtain the localization uncertainty contour for each event recovered by PyCBC Live. Full source characterization is then performed using the bilby parameter estimation pipeline (Ashton et al. 2019;Romero-Shaw et al. 2020) to obtain mediumlatency skymaps. Our simulated PyCBC Live results, BAYESTAR skymaps, and bilby posteriors are publicly available on Zenodo at Frostig et al. (2021).
4. We simulate WINTER follow-up observations searching the resultant BAYESTAR and bilby skymaps. If WINTER successfully observes the true location of the event, we then verify that the event is bright enough to qualify as a statistically significant detection.

Simulated BNS population
The fourth observing run of the Advanced LIGO gravitational-wave interferometers (Aasi et al. 2015) is expected to have a duration of one year and include a four-detector network consisting of the two Advanced LIGO detectors (Aasi et al. 2015), Advanced Virgo (Acernese et al. 2014), and KAGRA (Somiya 2012;Aso et al. 2013;Akutsu et al. 2018;Abbott et al. 2020c). One major goal of this study is to determine how many BNS systems would be detected in gravitational waves and then successfully followed up with WINTER during O4. To this end, we simulate a realistic population of BNS systems that could be detected over the course of one year of observing with this detector network operating at its expected sensitivity during this observing run (Abbott et al. 2020a). A (quasi-circular) BNS system is fully described by 17 binary parameters, θ, including the component masses, six-dimensional spins, and tidal deformabilities of the individual neutron stars, and the extrinsic parameters-the distance, inclination angle, right ascension, declination, polarization angle, and time and phase at coalescence.
We simulate systems following the mass distribution providing the best fit to the population of Galactic double neutron stars and the components of GW170817 as determined in Farrow et al. (2019), which is parameterized in terms of the "slow" and "recycled" binary neutron star components. In the isolated channel for BNS formation (e.g., Tauris et al. 2017), the recycled neutron star forms first and gets spun up to a period of ∼ 10 − 100 ms via accretion from its companion (Radhakrishnan & Srinivasan 1982;Alpar et al. 1982;Heuvel 2017). Conversely, the second-born neutron star does not experience this period of accretion, and spins down to a "slow" period of O(1) s on a timescale of ∼ 1 Myr. The Galactic double pulsar PSR J0737-3039A/B serves as the canonical example for this type of system (Burgay et al. 2003;Lyne et al. 2004).
We model the mass distribution of the recycled neutron star, m r , as a Gaussian mixture model, with mixing fraction α = 0.68 determining the fraction of systems in the first Gaussian, The two Gaussians have means and widths given by  The tidal deformabilities of the neutron stars, Λ i , are calculated from the masses assuming the AP4 equation of state (Akmal & Pandharipande 1997;Akmal et al. 1998). We assume the dimensionless spins of the neutron stars χ are aligned with the orbital angular momentum and drawn from the implied distribution on χ z assuming that the magnitudes are distributed uniformly on [0, 0.05] and the directions are isotropic. This spin prior choice is motivated by the maximum spin of observed Galactic double neutron star systems (Lorimer 2008). The resultant distribution is shown in the bottom panel of Figure 1. The inclination angle, θ JN , between the total angular momentum and the line of sight of the observer is drawn uniformly in cos θ JN , and the orbital phase at coalescence and polarization angle also follow uniform distributions.
The detector-frame coalescence times, t d , of the sources are distributed uniformly over the course of one year starting on January 1 st 2022. We choose the sky locations and distances (and hence the redshifts) assuming the sources Table 1. Number of BNS mergers occurring and detected in gravitational-waves for each rate considered in this manuscript during one calendar year of O4. The first column indicates the total number of mergers, while the second column gives the median and 90% symmetric credible interval on the number of systems detected in gravitational waves obtained by averaging over the 101 different realizations of BNS merger combinations. Also included are the number of unique pairs of events that are found in gravitational waves within one day and one week of each other.

Rate
Mergers Found 1 day 1 week are distributed isotropically on the sky and uniformly in comoving volume and source-frame time, such that p(z, t d ) ∝ dVc dz (1 + z) −1 out to a luminosity distance of 400 Mpc, where V c is the comoving volume. We consider three different merger rates consistent with those presented in the LIGO Scientific Collaboration et al. (2021): pessimistic, realistic, and optimistic rates of 100, 500, and 1000 Gpc −3 yr −1 , respectively. This results in 21, 105, and 210 sources per year out to d L = 400 Mpc (assuming a merger rate unevolving with redshift; see Table 1). We generate 625 unique BNS merger events following the distributions specified above. For each of the three rates considered, we draw 101 different realizations of the corresponding total yearly number of events from these 625 systems. This allows us to marginalize over many different realizations of the "universe" described by each merger rate and more realistically account for uncertainties.

Detection in gravitational waves
For each simulated system, we randomly assign a detector configuration assuming each of the four interferometers are independently operating with a duty cycle of 0.7 (Abbott et al. 2020b). Each BNS is added to 296 s of simulated Gaussian noise sampled at 8192 Hz colored by the expected O4 power spectral density (PSD, shown in Figure 2) for each of the detectors that are "observing" during that event (Abbott et al. 2020a) starting at a frequency of 17 Hz.

PyCBC Live
Low-latency searches for gravitational waves from compact binary mergers rely on the matched filtering technique (e.g. Cutler et al. 1993;Allen et al. 2012), where a template bank of waveforms is used to filter the data to search for candidate events. In order to more accu- rately account for the uncertainties associated with the matched-filter detection of gravitational-wave events, we use the PyCBC Live (Nitz et al. 2018;Dal Canton et al. 2020) pipeline-which is one of the pipelines used to search for gravitational-wave events in LIGO and Virgo data in real time-to identify candidates in our simulated population. The discreteness of the template bank is the dominant source of uncertainty on the source parameter estimates obtained from matched filtering. It is constructed using a stochastic placement algorithm (Babak 2008;Harry et al. 2009;Privitera et al. 2014) so that the maximum loss in the signal-to-noise ratio (SNR GW ) calculated from a starting frequency of 20 Hz between adjacent templates is 3%. The templates cover detectorframe component masses of between 1−3 M and aligned spins out to χ = 0.05. For convenience, the search is configured to run on data from all four detectors in our O4-like network. It calculates the PSD for each detector in real time using the simulated data for each event. A trigger is generated if the criterion SNR GW ≥ 4.5 is met in at least two of the four detectors. In post-processing, we then calculate the network matched-filter SNR GW using the trigger information returned by PyCBC Live by taking the square root of the quadrature sum of the SNR GW in each detector that was actually "observing" at the time of the event in question.
Because we only simulate data in short segments around the times containing a BNS merger rather than a continuous one-year data stream, the false alarm rate (FAR) cannot be meaningfully calculated by PyCBC Live. Rather than using an FAR threshold to indicate detec- tion, we instead set a threshold on the network matchedfilter SNR GW recovered by the search for each event of SNR GW > 9 in order for the source to count as "found." Of the 625 independent BNS mergers, 96 are found by PyCBC Live. By averaging over the 101 different realizations of BNS merger combinations for each rate as described in the previous section, we obtain the median and 90% symmetric credible interval on the number of found events shown in Table 1. The fraction of the total number of events detected with each number of interferometers is shown in Figure 3, along with the recovered SNR GW distribution. The farthest event is found at a distance of 372.2 Mpc and the nearest at 59.0 Mpc.

BAYESTAR Map Generation
We localize each event identified by PyCBC Live (i.e., with network SNR GW > 9) using the BAYESTAR rapid localization algorithm (Singer & Price 2016). During O3, skymaps produced using BAYESTAR were released in lowlatency for each GW event that passed the public-alert threshold.
BAYESTAR takes as inputs the estimated masses of the neutron stars, the coalescence time, and the SNR time series from each detector as calculated by PyCBC Live. From this time series, BAYESTAR extracts the timing, relative phases, and amplitudes from each detector. BAYESTAR then constructs a three-dimensional sky localization posterior distribution, with a probability for each pixel on the two-dimensional sky and an additional distance component that is approximated by a Gaussian along each line of sight.
In O3 and in this work, BAYESTAR sky localizations for each event were computed in O(10 s). Further efforts to optimize BAYESTAR have led to improvements resulting in runtimes of O(1 s) (Magee et al. 2021).

Parameter estimation
After the initial low-latency skymap using the BAYESTAR algorithm is sent out, further source characterization is performed using full parameter estimation, whereby posterior probability distributions for the binary parameters are obtained using stochastic sampling methods. While BAYESTAR fixes the intrinsic parameter values to those corresponding to the maximum-likelihood template returned by the search pipeline, parameter estimation enables marginalization over the uncertainty in the intrinsic parameters and the extrinsic parameters that are not necessary for the skymap. Accounting for this uncertainty and allowing for correlations between parameters as done using full parameter estimation should lead to skymaps that have systematically smaller sky areas and include the true location of the source at smaller confidence intervals (Singer et al. 2014). We present a comparison of the skymaps from the two localization algorithms in Section 3.1.
The likelihood of observing gravitational-wave data d given binary parameters θ is given by (Veitch & Vecchio 2010;Romano & Cornish 2017;Ashton et al. 2019): where h(θ) represents the gravitational waveform for the BNS signal with parameters θ, T is the duration of the analyzed segment, S k is the PSD, and k indicates the frequency dependence of the data, waveform, and PSD. The posterior probability distribution for the binary parameters characterizing each systems is then given by Bayes' Theorem: where π(θ) represents the prior probability distribution assumed for θ. We simulate the progression of skymaps that could occur during a real observing run by performing parameter estimation using the bilby software (Ashton et al. 2019;Romero-Shaw et al. 2020) for all of the events that were found by PyCBC Live. We use the PyMultiNest nested sampler (Feroz & Hobson 2008;Feroz et al. 2009Feroz et al. , 2019Buchner et al. 2014) to generate samples from the posterior probability distribution for the source parameters, where the likelihood is analytically marginalized over the distance to the source and the phase at coalescence (Veitch et al. 2015;Singer & Price 2016;. We make the assumption that the neutron stars are point masses with tidal deformability Λ = 0, which is unlikely to affect the inference of the sky location of the source. This assumption enables the use of the reduced order quadrature implementation (Smith et al. 2016) of the IMRPhenomPv2 waveform (Hannam et al. 2014;Khan et al. 2016;Husa et al. 2016), significantly reducing the computational cost of the parameter estimation. The likelihood uses the true PSD used to color the Gaussian noise in each detector. In order to keep the computational cost low, we do not marginalize over uncertainty in the detector calibration.
Because the search pipelines recover the chirp mass of the source extremely well (Biscoveanu et al. 2019), we use a uniform prior on the detector-frame chirp mass with a width of 0.2 M centered on the true value. The prior on the mass ratio, q ≡ m 2 /m 1 , is uniform over the interval [0.125, 1]. We restrict the analysis to aligned spins only, and the spin prior is the same as that used for drawing the simulated sources, shown in Figure 1. The luminosity distance prior is uniform in the source frame between 10 and 500 Mpc, and the prior on the coalescence time in the geocentric frame is uniform over a width of 0.2 s centered on the true coalescence time. We use standard priors on all other parameters (see, e.g., Romero-Shaw et al. 2020).
With these parameter estimation settings, the sampling stage takes 63 +61 −32 mins across all of the recovered events. This depends most sensitively on the SNR GW of the signal, since louder signals take longer to analyze due to the way the nested sampler explores the prior volume. We note that there are additional timing overheads for generating the weights for calculating the reduced order quadrature likelihood, reconstructing the posterior for the analytically marginalized distance parameter, and generating a skymap of the appropriate format that can be used by observers, although these additional steps generally take less time than the sampling. This timing is not necessarily indicative of what will occur in O4. There are a number of alternative techniques being explored that can be used to further accelerate parameter estimation, particularly for low-mass sources, like focused reduced order quadrature (Morisaki & Raymond 2020), relative binning (Zackay et al. 2018;Finstad & Brown 2020), parallelization (Pankow et al. 2015;Talbot et al. 2019;Wysocki et al. 2019;Smith et al. 2020), prior restrictions (You et al. 2021), and machine learning based approaches (Gabbard et al. 2021;Williams et al. 2021).

Lightcurve modeling
In order to capture some of the current uncertainty in kilonova modeling, we use two different prescriptions to predict the lightcurves of kilonovae from our simulated BNS mergers.
Bulla model : The first is a grid of kilonova models generated using the Monte Carlo radiative transfer code POSSIS (Bulla 2019). The grids assume a twocomponent ejecta comprised of mass ejected on a dynamical timescale and mass ejected in the form of a wind from the debris disk post merger. The dynamical ejecta is further divided into a lanthanide-rich equatorial component and a lanthanide-free polar component. The lightcurves are calculated using four parameters: the total dynamical ejecta mass (M dyn ej ), the total wind ejecta mass (M wind ej ), the opening angle of the lanthanide-rich component (Φ) and the observer viewing angle ( . A larger value of the opening angle Φ increases the amount of lanthanide-rich material, making the kilonova redder. Further details about the BNS grids are presented in Dietrich et al. (2020).
Kasen model : The second model uses the kilonova simulations presented in Kasen et al. (2017). While these simulations only include the effects of a single, spherically symmetric ejecta component with radial density given by a broken power law, they feature a more rigorous treatment of the microphysics determining the radiation transport in the system and allow for a range of compositions. The model parameters are the total ejecta mass, which we calculate as M ej = M dyn ej + M wind ej , the expansion velocity of the ejecta, v ej , and the mass fraction of lanthanides, X lan . A higher M ej leads to lightcurves that peak at brighter magnitudes, while higher v ej leads to faster-fading lightcurves. More neutron-rich ejecta with a higher lanthanide fraction, X lan 10 −2 , produce a redder and longer-lived kilonova than ejecta composed primarily of light r-process material, 10 −6 X lan 10 −2 .
To estimate the ejecta masses and velocity from the neutron star masses, we use the prescription outlined in Stachie et al. (2021). First, we assume an AP4 equation of state (Akmal & Pandharipande 1997;Akmal et al. 1998) for the merging neutron stars to calculate their radii and compactness, although recent works have developed (Stachie et al. 2021) and employed (Abbott et al. 2021a) methods for marginalizing over the uncertainty in the equation of state. We then calculate the dynamical ejecta mass using the fitting formula from Coughlin et al. (2019a): where m i and C i are the masses and the compactnesses of the two neutron stars, respectively, and a = −0.0719, b = 0.2116, d = −2.42 and n = −2.905. The ejecta velocity is similarly calculated using a fitting formula from Coughlin et al. (2019a): where the coefficients are given by a = −0.3090, b = 0.657, and c = −1.879.
We estimate the wind ejecta mass as a fraction of the disk mass M wind ej = ζM disk ej and set ζ = 0.15 (Dietrich et al. 2020). We calculate the disk mass using the fitting formula where M thresh is the minimum total mass that results in a prompt collapse post merger and is calculated as in Bauswein et al. (2013); a , b , c , and d , are calculated as in Dietrich et al. (2020). We interpolate between the parameters of standard grids for both the Bulla and Kasen models using the Python package gwemlightcurves (Coughlin et al. 2018a(Coughlin et al. , 2019aDietrich et al. 2020) to predict the J-band and r-band lightcurves of the kilonovae. The left panel of Figure 4 shows the lightcurves for an example system. The differences in peak magnitude and decay time for the lightcurves calculated with the Kasen and Bulla models demonstrate the uncertainties currently underlying kilonova modeling. We present results for both models to conservatively account for this uncertainty in our predictions. Additionally, we note that the ejecta masses calculated from the fitting formulae are approximate and depend strongly on the equation of state, fitting parameters, and assumptions of the numerical relativistic models that they are based on. It is thus possible that the ejecta masses are not entirely representative of realistic kilonovae. We investigate the effect of varying ejecta masses in Section 4.1.

WINTER simulated observing
For each found gravitational-wave event, we use the corresponding skymap and simulated time of the event  to create a realistic observing schedule with the gwemopt package (Coughlin et al. 2018b(Coughlin et al. , 2019bGhosh et al. 2016). The package takes gravitational-wave probability maps as inputs, such as the BAYESTAR and bilby skymaps, subdivides the sky into tiles sized to the telescope field of view, and generates an optimized observing schedule. During O3, a network of telescopes including the Zwicky Transient Facility (Bellm et al. 2018) created GW followup schedules with gwemopt (e.g., Coughlin et al. 2019c;Kasliwal et al. 2019b) and we expect to use the package during nominal WINTER operations in O4.
We run a set of observing simulations allowing one, three, five, seven, and fourteen nights of dedicated telescope time searching for the kilonova from each event.
For the scope of this study, we limit observations to the J-band to match WINTER's planned J-band reference images and all-sky survey and only study one gravitational-wave event at a time, even though multiple events may occur during the same night or same week (see Table 1). Each WINTER observation lasts 450 seconds to match the J AB = 21 magnitude reference images, split across five dithers, with the time for dithering at approximately one second per dither represented as an overhead time in the simulation. Time to slew between each field is calculated based on the telescope and dome slew rates measured at WINTER's host telescope at Palomar Observatory. WINTER's InGaAs sensors read out continuously during each exposure, leading to no overhead time due to sensor readout (Malonis et al. 2020). Combining overhead and exposure times, for an eight hour night WINTER covers ∼ 63 deg 2 to J AB = 21 magnitudes. The WINTER data processing pipeline will subtract new science images from prebuilt reference im-ages (constructed well before the GW alert) and detect candidate kilonovae in near real time.
We follow up each skymap with a ranked search strategy, in which we subdivide the sky into a fixed grid of telescope pointings. The center of each tile corresponds to a WINTER reference image for easy image subtraction during data processing. We prioritize the tiles based on the skymap-generated probability. Finally, we create a greedy observing schedule, where the highest probability tiles are observed first, as described in Coughlin et al. (2018b). We prioritize scheduling at least two observations of each field, spread out over the length of an observing campaign to study the lightcurve evolving over time. In this simplified simulation, one observing schedule is created and executed without modification for each follow-up campaign. The simulation schedules multiple visits for each field, but fields are not preferentially revisited. During real O4 follow-up observing, WINTER will iteratively compare observations to prebuilt reference images and revisit any fields containing promising kilonova candidates (see Section 4.3 for details) Next, we check the observing schedules for a successful observation of the kilonova, which can be defined in multiple ways. For the sake of clarity in this study, we define an event as localized if WINTER takes at least one image of the kilonova's true location on the sky. An event may not be localized if the skymap is too large for WINTER to search it efficiently or if the event is not overhead at the given time of year. Additionally, poor weather can hinder follow-up observations; however, weather simulations are outside of the scope of this study.
Even if an event is localized, the kilonova must be sufficiently bright at the time of imaging to be detected. The event qualifies as a discovery if it is localized, imaged at least twice, and observed to a signal-to-noise ratio (SNR EM ) ≥ 5. We employ the lightcurve models described in Section 2.3.1 to calculate the magnitude of the event at the time of each observation and scale the magnitude based on the airmass of the observation. We follow Equation 1 as described in Section 3.1 of Frostig et al. (2020) to calculate the SNR EM for each event based on historic Palomar data and predicted instrument noises. For comparison, we repeat this exercise for WINTER observing in J band and for a fictitious optical telescope observing in the Sloan Digital Sky Survey r filter with equivalent sensitivity, field of view, exposure times, and overhead times as the WINTER J-band observations.
For the scope of this study, we use two observations to SNR EM ≥ 5 as a simplified proxy for true kilonova discovery, as two observations are the minimum number required to distinguish a kilonova from asteroids or other transients. In reality, to confirm a new kilonova candidate, more than two observations may be required, and promising candidates will be selected based on their color evolution, how quickly the lightcurve fades, and if the event is associated with a probable host galaxy. Discovery will then be confirmed with further photometric and spectroscopic follow-up. See Section 4.3 for further discussion of electromagnetic follow-up observations. 3. RESULTS

Skymap comparisons
A total of 96 events were found by PyCBC Live out of the 625 total independent simulated mergers, as shown in Table 1. Each event was localized by both BAYESTAR and bilby, as described in Section 2. We present a comparison of the two algorithms. This is especially relevant for electromagnetic follow-up purposes, since in O3, BAYESTAR skymaps were distributed hours or days before  Figure 6. Comparison skymaps showing the 50 and 90% credible regions obtained using bilby and BAYESTAR for two different events. The true source location is marked with the black star. The event on the left was observed with the Hanford and KAGRA interferometers at a distance of 141 Mpc, although it was not confidently detected with KAGRA since the optimal SNR in that detector is 2.6. The one on the right was detected with Hanford, Livingston, Virgo, and KAGRA at a distance of 114 Mpc. In the skymap to the right, the BAYESTAR and bilby localizations almost completely overlap. if there are significant discrepancies in the two localizations, electromagnetic observers might switch partway through the night, or might even prefer to wait for the bilby skymap before beginning observations. The distribution of 90% credible localization areas for all bilby and BAYESTAR skymaps is shown in Figure 5. Events that are recovered by PyCBC Live with larger values of matched-filter SNR GW tend to have smaller localization areas. Some events have network SNR GW > 20 but localizations larger than 100 deg 2 ; these are typically one-or two-detector events where the low number of detectors results in poor constraints on timing and phase. The cumulative distributions of 90% localization areas from each algorithm are similar, with 24% of bilby and 27% of BAYESTAR localizations falling under 50 deg 2 . The BAYESTAR distribution lies slightly to the left (i.e., to smaller areas) of the bilby distribution. Figure 6 shows a comparison between the BAYESTAR and bilby skymaps for two particular events which are generally indicative of the performance observed in the larger sample. In the left skymap, which has a 90% localization area of order O(1000)deg 2 from a two-detector event at a distance of d L = 141.1 Mpc, the BAYESTAR and bilby skymaps have some overlap, but with probability concentrated in different areas of the sky. In the right skymap, which stems from a four-detector event at a distance of d L = 113.8 Mpc, the localizations are small and in very good agreement.
A useful metric for the accuracy of localizations of simulated events is the searched area, which is the amount of sky area that is covered by integrating in order of decreasing probability from the highest probability pixel until reaching the true location of the source. The left panel of Figure 7 shows the distributions of searched areas for bilby and BAYESTAR. The two algorithms perform very similarly.
Another convenient statistic for comparing probability distributions is the Jensen-Shannon (JS) divergence. The JS divergence measures how similar two distributions are and ranges from 0 bit for identical distributions to 1 bit for completely divergent distributions. JS divergence values greater than 0.002 are considered to be statistically significant (Romero-Shaw et al. 2020). The right panel of Figure 7 shows the intersection of the two 90% credible areas normalized by the bilby 90% area versus the JS divergence between the two skymap posterior distributions for each event. These distributions describe the probability of the event being in each pixel of the two-dimensional skymap. For skymaps that have almost complete 90% area overlap, the JS divergence is very small and thus the maps share significant information content, and vice versa. However, there are also events for which the normalized intersections are close to unity, but have relatively large JS divergences. This is due to two types of events: events that have similar 90% localization areas, but with different probability distributions for the pixels within that area; and events for which the intersection of the bilby 90% area is small compared to the BAYESTAR 90% area, so the intersection divided by the bilby area is almost unity.
Previous studies have found that localizations from full parameter estimation pipelines such as LALInference, when compared to those from BAYESTAR, have systematically smaller sky areas and include the true location of the source at smaller confidence intervals (see, e.g. Figure 3 of Singer et al. (2014)). In this study we find that the discrepancy between the two algorithms is significantly reduced, since our BAYESTAR skymaps use data from all online detectors, while those in Singer et al. (2014) use data only from detectors that register an SNR GW above the detection threshold. Furthermore, since matched-filter searches recover the true chirp mass of BNS systems with extremely high accuracy, to within ∼ O(10 −4 ) M (Biscoveanu et al. 2019), not marginalizing over the uncertainty in the mass parameters should lead to less significant biases in the low-latency skymap compared to higher-mass sources. We expect the difference between the skymaps obtained with the two algorithms to be more significant for NSBH sources, which will be explored in future work.  Figure 8 shows the results of searching the bilby and BAYESTAR maps with WINTER, varying the length of the observing campaign from one through fourteen nights of telescope time, given an optimistic BNS merger rate. The simulation produces separate observing schedules based on the amount of time allowed to follow up the event. In agreement with Section 3.1, WINTER localizes the BNS merger events at similar rates for both the bilby and BAYESTAR skymaps. The number of events localized with WINTER steadily increases given more nights of observing time, with ∼ 2 times more events localized with a fourteen-night search strategy than a one-night search strategy.

Results from WINTER simulated observing
Varying the number of nights allowed searching for the kilonova changes the ordering and prioritization of each observing schedule. For example, given a one-night observing campaign, 100% of the images of the kilonova are taken before the J-band peak of the lightcurve, regardless of model. For a five-night campaign, 80% (86%) of the localized events have their first images taken before the peak and 30% (27%) have two images taken before the peak for the Kasen (Bulla) lightcurve model. Increasing observing time allows WINTER to cover a greater portion of the skymap, but risks tiling the high-probability area of the skymap less efficiently or observing the kilonova later in its evolution when it may have already faded. Therefore, discovery of new kilonovae (defined above as observing the event at least twice to SNR EM ≥ 5) does not increase steadily with time, but levels off as the kilonova eventually fades beyond detection. In the right panel of Figure 4, we compare discovery of new kilonovae in the WINTER J band and for an equally sensitive r- band telescope given the number of nights each telescope is allowed to search for the kilonovae. Given the Kasen lightcurve models, discovery in both J band and r band peaks around three nights of tiling and levels off given more nights of observing, with the exception of lanthanide-poor (X lan = 10 −5 ) kilonova discovery in the r band continuing to increase with fourteen nights of observing (bottom right panel of Figure 4). In the J band with three nights of observing, WINTER discovers 8.5 times more lanthanide-rich (X lan = 10 −2 ) events than lanthanide-poor events (X lan = 10 −5 ). In contrast, an equally sensitive r band telescope discovers zero lanthanide-rich events and up to 28% of lanthanidepoor events.
Despite differences in the lightcurve models, J-band and r-band discovery also levels off after three nights of observing for the Bulla models (top right panel of Figure  4). Varying the opening angle of the lanthanide-rich component changes the resultant lightcurves less than varying the lanthanide fraction in the Kasen models. At three nights of observing, a large opening angle of the lanthanide-rich component (Φ = 60 • ) decreases r-band discovery by 46% and does not change J-band discovery. Table 2 displays the results of the end-to-end simulation studying how many kilonovae WINTER will discover during one year of follow-up observations, including the median and 90% symmetric credible interval on the number of events as described in Section 2.2.1. Some subset of the GW triggers will not be observable by a telescope at Palomar Observatory at the given trigger time, either because the event is too far south or too near the sun.
For a given year of GW triggers, on average ∼ 70% of the events are visible above 20 • altitude for WINTER at some point during the night following the event trigger time. These events are denoted as EM accessible in the table. For a portion of those events, WINTER observes the correct patch of sky and localizes the events, with some events missed due to large localization areas or events with true locations outside of the 90% localization areas. Localized events only qualify as a discovery (SNR EM ≥ 5 in at least two observations) if the event is sufficiently bright, which can vary based on the model grid used to simulate the event. WINTER discoveries range from as low as zero new kilonovae per year with a pessimistic BNS merger rate to as high as ten new kilonovae discovered per year to 90% confidence with an optimistic BNS merger rate. Given a realistic BNS merger rate, WINTER discovers up to five new kilonovae per year to 90% confidence.

Advantages of infrared follow-up
In Section 3, we demonstrate that a 1 deg 2 J-band survey like WINTER can detect up to ten kilonovae during O4. Here, we examine the advantages of a J-band search over a similar optical search. We use a grid of realistic kilonova lightcurves calculated with the Bulla model to identify the parameter space where an infrared search outperforms an optical search. To generate the grid, we do not use the fitting formulae from Section 2.3.1, as they may not be entirely representative of the underlying physical population of kilonovae. Instead, we Max. detectable distance [Mpc] Figure 9. The maximum distance out to which a kilonova can be detected by an m lim = 21 survey for a given combination of dynamical and wind ejecta masses, based on the Bulla models. We plot the distances separately for J and r bands. We further distinguish the kilonova models as blue and on-axis (top left), blue and off-axis (top right), red and on-axis (bottom left) and red and off-axis (bottom right). The ejecta masses for GW170817 are plotted as a red star. The red dashed line shows the contour at the distance at which a kilonova with GW170817-like ejecta masses is detectable in each case (distance indicated in red). We also plot the contours at the distances of 15 events followed up by WINTER from one realization of our realistic-rates simulation.
Twelve of these events are on-axis and three are off-axis. Two off-axis events have distances < 150 Mpc and lie off the plots. It is evident that for the same set of ejecta masses, the J band can detect kilonovae out to larger distances than r band if the kilonova is red.

Red kilonovae are brighter at infrared wavelengths
Kilonovae with a larger opening angle of the dynamical ejecta will have more lanthanide-rich material and hence will be brighter at redder wavelengths. We quantify this in Figure 9 which shows the maximum distance out to which a kilonova can be detected in the r and J bands by a telescope with a limiting depth of 21 magnitudes. We distinguish between blue (Φ = 30 • ) and red (Φ = 60 • ) kilonovae, and an on-axis (θ obs < 60 • ) and off-axis (θ obs > 60 • ) kilonova. We use the maximumlikelihood estimates of the ejecta masses of GW170817 as a benchmark (M dyn ej ∼ 0.005 M , M wind ej ∼ 0.05 M , Di-etrich et al. 2020) and indicate the distances out to which a kilonova with these masses can be detected. Finally, we plot contours corresponding to the distances of 15 events from one realization of our uniformly distributed, realistic-rate simulation that were detected in GWs and followed up with WINTER (see Table 1). Twelve of these events are on-axis and three are off-axis.
For a blue kilonova (Φ = 30 • ), an infrared search does not offer a significant advantage over an optical search. A blue on-axis (off-axis) kilonova with GW170817-like ejecta would be detectable out to ≈ 300 (230) Mpc in both r and J bands. This is because the peak values of r-and J-band lightcurves for a blue kilonova are not significantly different (see top left panel of Figure 4). Off-axis kilonovae are generally fainter than on-axis ones, which explains the reduced sensitivity (300 Mpc vs. 220 Mpc) for off-axis kilonovae. All 12 on-axis events from our simulation are detectable in both r and J bands if they are blue. We note that the detectable region of the ejecta mass phase space (i.e., the region to the Detected fraction (f det ) Figure 10. The fraction of kilonovae from our entire model grid that can be detected in the r and J bands by an m lim = 21 telescope, as a function of distance and number of days since the merger. The contours corresponding to detection fractions (f det ) of 0.1, 0.5, and 0.9 are plotted as dashed, solid, and dotted lines, respectively. Red crosses mark the distances and areas enclosing 90% localization probability of the events from our realistic-rate simulation (of the 16 events, only 10 are shown, as the remaining six lie outside the bounds of the axes). It is clear that kilonovae can be detected for much longer in the J band compared to r band. The right panel can also be used to select GW triggers that are worth following with WINTER. We will only follow events that have median distance estimates and localization areas such that the detection fraction is at least 10%. We note that the detection fraction drops at very early search times (t < 0.5 day), as the kilonova is still brightening in our models at these times. However, we will observe events with localization areas that can be tiled within 0.5 day, observing them repeatedly until the kilonova becomes bright enough to be detected.
right of each contour in Figure 9) is slightly larger for r band than for the J band. Ten of these events have a detectable region that includes GW170817 ejecta masses in both r and J bands. All three off-axis events from our simulation are detectable in both r and J bands. Of these three, two are closer than 150 Mpc, and are detectable for the entire range of ejecta masses in both r and J bands. However, an infrared search performs significantly better than an optical search for red kilonovae. Both onand off-axis red kilonovae are detectable out to larger distances in the infrared than in the optical. A red on-axis (off-axis) kilonova with GW170817-like ejecta is detectable to 284 (246) Mpc in the infrared but only to 217 (174) Mpc in the optical. If the 12 on-axis kilonovae from our simulations are red, only 10 are detectable in the r band while all 12 are detectable in the J band. GW170817 ejecta masses lie in the detectable region for only six kilonovae in the r-band but for 10 kilonovae in J band. If the three off-axis simulated events are red, only two are detectable in the r band while all three can be detected in the J band. Finally, we note that if a particular kilonova is not detected in WINTER observations, Figure 9 can be used to place constraints on the ejecta masses and opening angles associated with it. 4.1.2. All kilonovae are longer-lived in the infrared A second advantage of infrared searches is that kilonova emission is longer-lived in the infrared than in the optical. We quantify this in Figure 10, which shows the fraction of kilonovae from our entire model grid that can be detected in the r and J bands by an m lim = 21 telescope as a function of distance and number of days since the merger. We plot contours corresponding to detection fractions (f det ) of 0.1, 0.5, and 0.9. Figure 10 clearly shows that kilonovae can be detected for much longer in the J band compared to r band. For example, at 200 Mpc more than 10% (i.e. f det > 0.1) of kilonovae are detectable in the r band for a maximum duration of three days. However, in the J band the same events can be detected for almost six days. A WINTER-like telescope with a 1 deg 2 field of view can thus search localization regions of ≈ 450 deg 2 in the J band for kilonovae at 200 Mpc, but is limited to only ≈ 200 deg 2 in the r band. At a distance of 100 (300) Mpc, these values change to ≈ 350 (120) deg 2 for r band and 750 (220) deg 2 for J band.  Figure 11. Contours marking the regions for which the kilonova detection fraction is at least 0.1 (f det > 0.1), as a function of the localization area. We plot separate contours for searches with limiting magnitude of 21, 20.5, and 19.5 mag (black, blue, and red, respectively). With WINTER, we will follow up all events with localization areas < 450 deg 2 . down to a depth of 21 mag. If the localization area is larger than 450 deg 2 , we will follow only those events that have median distance estimates < 200 Mpc down to a depth of 20.5 mag. If the localization area is larger than 1000 deg 2 , we will follow only those events with median distances < 150 Mpc down to a depth of 19.5 mag.
In Figure 10, we also plot the distances and areas enclosing 90% localization probabilities from the skymaps of the 15 events from one realization of our realistic BNS rate simulation. For an r-band search, seven of the 15 events have 90% areas lying in the f det > 0.1 region. In the J band, eight events lie in this region.

WINTER GW follow-up strategy
WINTER is a dedicated instrument for follow-up of slowly fading infrared kilonovae and can spend weeks searching for a single event. However, even though longer searches cover larger fractions of the skymaps, they risk covering the high-probability areas less efficiently or observing the kilonova once it has already faded. Based on the results from simulating WINTER observations with the Kasen and Bulla models, WINTER will dedicate up to seven nights of searching for each event. WINTER will observe compelling transients until they fade, and if there are no candidate kilonovae in the data, will stop searching after seven nights.
Furthermore, WINTER will not follow up all BNS GW alerts. Figure 10 provides a prescription to select which GW triggers are worth following up with WINTER during O4 based on information that is available at the time of the trigger. We will only follow events that have median distance estimates and localization areas such that the chance of detecting a kilonova is at least 10% (i.e., lying within the f det > 0.1 region of Figure 10). If the localization areas are larger and the events are nearer, we will reduce our exposure times to tile the localizations faster and increase the chances of detecting a kilonova. Figure 11 shows the f det = 0.1 contours for searches with depths of 21 mag (t exp = 450 s), 20.5 mag (t exp = 180 s) and 19.5 mag (t exp = 40 s) as a function of the localization area that can be searched with WINTER. With WINTER, we will follow up all events with localization areas < 300 deg 2 down to a depth of 21 magnitudes, matching WINTER's J-band reference images. For events with localization areas larger than 300 deg 2 and distances < 250 Mpc, we will reduce exposure time to 180 seconds (i.e. to a depth of 20.5 magnitudes). With 180 second exposures, we can tile areas as large as 1000 deg 2 while maintaining f det = 0.1. For events with localization areas larger than 1000 deg 2 , we will follow only those events that are closer than 150 Mpc to a depth of J AB = 19.5 magnitudes. If there are multiple gravitational-wave triggers of interest on the same night (as discussed in Table 1), we will prioritize events that have a higher f det and are easier to observe with WINTER.
Finally, we have assumed that the areas mentioned above enclose all of the BNS localization probability. If the full skymap areas are larger than the limits mentioned above, we will cover only those events where the area enclosing 50% of the localization probability can be tiled with f det > 0.2. We also note that in Figure  10, the detection fraction f det drops at very early search times (t < 0.5 day), as the kilonova is still brightening in our models at these times. However, we will observe events with localization areas that can be tiled within 0.5 day, observing them repeatedly until the kilonova becomes bright enough to be detected. Similar detectability constraints were derived by ; their Figure 17) using the Los Alamos National Laboratory (LANL, Wollaeger et al. 2021) grid of kilonova models. The constraints presented in Figure 10 are broadly consistent with their constraints, with minor variations attributable to the differences in the underlying Bulla and LANL models (see, for example, the differences in the analyses of Anand et al. (2020) and Thakur et al. (2020), Dichiara et al. (2021)).
The methods described above can be used by other surveys to select GW triggers for follow-up during O4. We include a python notebook with the code to reproduce Figures 9-11 in the Zenodo repository at Frostig et al. (2021).

The realities of electromagnetic follow-up observing
The methods outlined in Section 2.3.2 represent a simplified approach to follow-up observations, where one observing strategy is decided at the outset and followed unchangingly throughout the campaign. In reality, followup observing is an iterative process where new decisions are made nightly or even multiple times per night. Instead of solely following one search strategy as shown in the above simulations, we will prioritize repeat observations of compelling transients found in the follow-up data from WINTER and other telescopes. In simulations, WINTER observes at the zenith to a limiting magnitude of J AB = 21 in 450 seconds, J AB = 21.8 in 30 minutes, and J AB = 22.7 in 3 hours, with many repeat observations allowing for more significant constraints on kilonova detection.
Furthermore, if another telescope discovers a new kilonova candidate, infrared localizations, not just discoveries, add unique data to kilonova science. A nondetection in WINTER of an optical kilonova constrains both the dynamical and wind ejecta masses . Additionally, faint detections (SNR EM < 5) or single images of the event in the near infrared contribute to models of chemical evolution, ejecta mass, and wind speeds (Barnes et al. 2016;Coughlin et al. 2020c;Kasen et al. 2013).
However, with no confirmed discoveries from other telescopes, there can be thousands of candidate transient events to sort through in a week of near infrared survey data. There are many tools available for classifying transients in survey data with machine-learning algorithms becoming a standard tool in the field, particularly for wide-field optical surveys, such as ZTF and the Vera C. Carefully planning the color and cadence of WINTER follow-up observations can assist these transient classification techniques in narrowing down the number of candidate events. For example, the characteristic ∼ 1 week fading time distinguishes a kilonova lightcurve from longer-lasting supernovae or short-lived asteroids (Cowperthwaite & Berger 2015). Additionally, observations in at least two filters assist in studying the reddening of the transient over time, a distinctive feature of kilonovae seen in GW170817 (Arcavi et al. 2017;Pian et al. 2017;Smartt et al. 2017). For optical telescopes, the i band provides the reddest images, and models predict kilonova discovery is maximized with the g -i filter pair for LSST (Andreoni et al. 2019) and a g, r, i filter cycle for ZTF (Almualla et al. 2021). In the simulation described in Section 2.3.2, we only observe in the J-band to leverage WINTER's J-band reference images and all-sky survey. In practice, we aim to conduct most of the search in J band but also follow up all interesting candidate events in the Y-band to study the Y -J color evolution. We will also study the g -J color pair, either with g-band images from ZTF follow-up or with the optical camera on WINTER's companion port. Time permitting, WIN-TER and its counterpart optical camera will also observe candidate events in the u, r, i, and H filters.

CONCLUSION
The BNS merger GW170817 brought about a new field of multimessenger astronomy, but despite extensive follow-up campaigns in O3, we have not observed a second multimessenger kilonova. In this study, we show infrared observations are a promising avenue for kilonova discovery, particularly for lanthanide-rich "red" kilonovae, as these are detectable to larger distances in the infrared than at optical wavelengths. We predict that infrared follow-up of GW triggers with WINTER could discover up to ten new kilonovae per year during O4. Furthermore, by employing more targeted follow-up strategies than those we have simulated, we can achieve a deeper sensitivity on a subset of interesting targets, therefore enhancing our ability to confirm new discoveries.
Moreover, we limit this study to BNS mergers and leave the study of NSBH kilonovae to a future work. Infrared follow-up of NSBH kilonovae is especially promising, as they are brighter in the infrared compared to BNS kilonovae Fernández et al. 2017;Zhu et al. 2020), and observing both event types has the potential to increase the number of kilonovae discovered each year. Even just one new electromagnetic observation of a kilonova in O4 will double the number of known multimessenger kilonovae, helping to answer ongoing questions in the study of the Hubble tension, the neutron star equation of state, and r-process nucleosynthesis.