Rapid Localization of Gravitational Wave Sources from Compact Binary Coalescences Using Deep Learning

Chayan Chatterjee; Manoj Kovalam; Linqing Wen; Damon Beveridge; Foivos Diakogiannis; Kevin Vinsen

doi:10.3847/1538-4357/ad08b7

1. Introduction

In 2015, the first direct detection of a gravitational-wave (GW) signal from a merging binary black hole (BBH; Abbott et al. 2016) system was made by the LIGO Scientific Collaboration (Aasi et al. 2015) and the Virgo Collaboration (Acernese et al. 2014). Since then, the total number of GW detections has grown, at the time of writing, to 90 (The LIGO Scientific Collaboration et al. 2021), and a new detector, KAGRA (Akutsu et al. 2019), has joined the GW search effort. These 90 detections include two confirmed signals from merging binary neutron stars (BNS) and at least four from mergers of neutron star–black hole (NSBH) binaries (The LIGO Scientific Collaboration et al. 2021). The first BNS detection in 2017 also involved a coincident observation of a short gamma-ray burst (GRB) originating from the same source (Abbott et al. 2017a, 2017b, 2017c), making it the first multi-messenger observation involving both GW and electromagnetic (EM) signals.

The observations of these EM counterparts, including the early kilonova lightcurve, and several other possible emissions in the optical, ultraviolet, X-ray, and radio can enrich our understanding of the physics of compact objects, constrain the neutron star equation of state (Metzger & Piro 2014), and reveal the processes governing formation and emission of shocks and ejecta (Metzger 2017). In GW data analysis, a full Bayesian framework, involving Markov Chain Monte Carlo (MCMC) and nested sampling, contained within the codes BILBY (Ashton et al. 2019) and LALInference (Veitch et al. 2015) has been traditionally used to estimate optimal source parameters of detected GW events (The LIGO Scientific Collaboration et al. 2021). These tools are, however, very computationally intensive and time-consuming, since they probe the full 15-dimensional parameter space that typically characterizes GW signals. The rapid, Bayesian, non-MCMC code BAYESTAR (Singer & Price 2016) can generate posterior distributions of R.A. (α), decl. (δ), and luminosity distance in a few seconds.

Other fast parameter estimation methods use heterodyned likelihood (Cornish 2021), iterative fitting methods as used in the RIFT algorithm Lange et al. (2018), and an excess power approach (Tsutsui et al. 2021). What all of these methods have in common is that they are approximations of Bayes's theorem

$\begin{eqnarray}&&p(\theta | d)=\displaystyle \frac{p(d| \theta )p(\theta )}{p(d)},\end{eqnarray} \tag{ 1 }$

where p(d∣θ) is the likelihood distribution of the observed strain data, assumed to consist of a GW signal h(θ) plus stationary Gaussian noise n. p(θ) represents our prior belief about the distribution of the source parameters and p(d) is the evidence.

On the other hand, deep learning techniques, in particular, neural posterior estimation (NPE; Papamakarios & Murray 2016; Lueckmann et al. 2017; Greenberg et al. 2019) models have proven to be efficient at estimating accurate posterior distribution of BBH source parameters (Chua & Vallisneri 2020; Dax et al. 2021; Gabbard et al. 2022), enabling sky direction inference at much faster speeds than conventional sampling algorithms like MCMC. Previous applications of deep learning for BBH sky localization involved the use of a classification approach in which a neural network was trained to learn handcrafted features from raw GW strain data and associate those features with a particular pixel in the sky in which the source may be located (Chatterjee et al. 2019). The prediction of the neural network is a probability distribution of the source's sky direction over those pixels. While this approach showed promising results for simulated BBH signals over 2048 sky pixels, it does not scale well for any arbitrary number of pixels, leading to, in some cases, poor localization due to low angular resolution.

Moreover, there have been no published results, at the time of writing, on deep learning–based posterior estimation of BNS and NSBH sources, which are the primary compact binary coalescence (CBC) candidates with potential EM counterparts. The main challenge involving the use of deep learning models for BNS and NSBH parameter estimation is that the GW waveforms of these sources are much longer than BBHs and can last up to 1000 s at detector design sensitivity. This means that for the same overall signal-to-noise ratio (S/N), BNS and NSBH GW signals will have lower signal strength in each individual time segment, compared to that of BBH. This makes feature extraction using deep learning models for BNS and NSBH strains extremely challenging and is one of the possible reasons why no parameter estimation results on BNS and NSBH using deep learning have been demonstrated yet.

In this paper, we introduce GW-SkyLocator, the first deep learning model for sky localization of all compact binaries—BBH, BNS, and NSBH, at orders of magnitude faster speeds than standard Bayesian techniques. We have benchmarked the performance of our model against BAYESTAR and other standard samplers like BILBY and LALInference using both real GE events and simulated injections in Gaussian noise. This paper is organized as follows. In Section 2, we discuss the novelty of our deep learning–based sky localization method; in Section 3, we provide the details of our data generation process and describe our neural network architecture. In Section 4, we compare the performance of GW-SkyLocator against BAYESTAR, BILBY, and LALInference for simulated injections and real events. We summarize our findings in Section 5 and discuss plans for future work.

2. Methods

Previous deep learning–based methods for parameter estimation used the GW strain data directly for training and testing. Here, we have adopted an alternative approach, first introduced by BAYESTAR (Singer & Price 2016), in which the S/N time series data is used instead of the full GW strains for rapid sky localization. We train our model, called GW-SkyLocator on the S/N time series data, obtained by the template-based matched filtering technique used for detecting modeled GW signals (Maggiore 2007) in existing online and offline GW search pipelines (Klimenko et al. 2016; Usman et al. 2016; Sachdev et al. 2019; Aubin et al. 2021; Chu et al. 2022). Our model also predicts the posterior distribution of the sky location over any arbitrary number of sky pixels, and supports the generation of multi-order sky maps, thus alleviating the problems associated with the approach in Chatterjee et al. (2019). The S/N time series of a GW signal can be thought of as a condensed version of the GW strain data, in which all information about the GW source sky direction is preserved in its amplitude and phase (Hooper 2013), with a typical length of 0.1 s around the merger time. The top panel in Figure 1 shows an example BNS waveform and the corresponding strain obtained after injecting the signal in stationary Gaussian noise with advanced LIGO power spectral density (PSD). The bottom panel shows the S/N time series obtained by matched filtering the strain data with the signal template. The time difference of the S/N peaks between pairs of detectors represents the arrival time delays of the signals, which is essential information for source localization based on the triangulation method (Wen & Chen 2010). Since the sensitivity of each detector to incoming GW signals is a function of the sky direction of the source, the relative amplitudes and phases of the S/N time series at peak, which depend on the interferometers' directional response to GWs, further help constrain the possible sky directions (Hooper 2013). The S/N time series is also readily available from matched filtering-based GW search pipelines, which means we can directly apply our model in conjunction with an online GW detection pipeline to perform rapid sky localization immediately after detection.

Figure 1. Refer to the following caption and surrounding text. — **Figure 1.** A BNS waveform, the strain signal obtained after injecting the waveform in noise, and the corresponding S/N time series obtained by matched filtering the strain with the optimal waveform template are shown. Top panel: the whitened GW strain signal corresponding to a BNS merger is plotted in blue. These data were obtained after injecting the GW waveform (rescaled amplitude shown in orange) into stationary Gaussian noise colored by advanced LIGO PSD. Bottom panel: the optimal S/N time series obtained by matched filtering the GW strain with the signal template shown in the above panel.
Download figure:
Standard image High-resolution image

Matched filtering involves cross-correlation of the GW strain data, d, with waveform templates h(θ) that depend on the intrinsic source parameters θ, i.e., masses and spins. The output of matched filtering is the S/N time series ρ(t), defined as (Hooper 2013)

$\begin{eqnarray}&&\rho (t)=\displaystyle \frac{z(t)}{\sigma },\end{eqnarray} \tag{ 2 }$

where z(t) is the complex matched filter defined as (Hooper 2013),

$\begin{eqnarray}&&z(t)=2\left({\int }_{-\infty }^{\infty }\displaystyle \frac{\widetilde{d}(f){\widetilde{h}}_{c}^{* }(f)}{{S}_{n}(| f| )}+i{\int }_{-\infty }^{\infty }\displaystyle \frac{\widetilde{d}(f){\widetilde{h}}_{s}^{* }(f)}{{S}_{n}(| f| )}\right){e}^{2\pi {ift}}{df}.\end{eqnarray} \tag{ 3 }$

Here $\widetilde{d}(f)$ is the Fourier transform of the GW strain, ${\widetilde{h}}_{c}(f)$ and ${\widetilde{h}}_{s}(f)$ are the Fourier transforms of the cosine and sine components of the templates used for matched filtering, and S_n(f) is the PSD of the noise. σ is the standard deviation of the real part of the matched filter, which is used to normalize ρ(t).

In this work, we train our model on the optimal S/N time series data, ${\rho }_{\mathrm{opt}}^{i}(t)$ , and the intrinsic parameters ${\hat{\theta }}_{\mathrm{in}}^{i,\mathrm{opt}}$ of the matched filter used to generate ${\rho }_{\mathrm{opt}}^{i}(t)$ . Here the index i refers to the ith training sample. This matched filter, having the same source parameters as the injection, optimally extracts the signal from the noise with the highest S/N. Matched filtering with nonoptimal templates, i.e., templates having different parameters compared to the injection, produce lower S/N output. During testing, we make direct comparisons of our model with BAYESTAR. We first generate 2000 injections each for BBH, BNS, and NSBH sources using the parameters described in the Methods section (Section 2), and then run a simulated matched filtering pipeline to generate triggers. Samples having S/N > 4 in at least two detectors and a total network S/N between 9 and 40 are then picked for inference using GW-SkyLocator and BAYESTAR , and the results are compared. In the future, we plan to run an end-to-end GW search pipeline on both Gaussian and real noise injections and test our model's performance on the recovered events. These investigations will directly inform our method's scalability and robustness for online sky localization in future LVK observation runs.

The exact relation between ρ_opt and sky coordinates α and δ is given in Equation (7.181) in Maggiore (2007). Since our model is trained and tested on very short (0.2 s long) ρ_opt(t) data, we are able to obtain highly accurate localization results for all CBC sources with a relatively simple neural network architecture, compared to deeper and more sophisticated neural networks used in similar analyses (Dax et al. 2021; Gabbard et al. 2022). The use of short S/N time series data instead of strains for training and inference was the crucial reason for the success of our model on BNS and NSBH events, since machine learning models usually struggle to extract features from hundreds of seconds long input data, even after robust feature engineering, data compression, or the use of extremely deep neural networks.

Our model consists of three main components: (i) a ResNet34 network (He et al. 2015) to extract essential features from the complex S/N time series data ρⁱ(t), (ii) a Normalizing Flow (Rezende & Mohamed 2015) model, in particular, a Masked Autoregressive Flow (MAF) network (Kingma et al. 2016; Papamakarios et al. 2017), which generates the posterior distribution of α and δ given the input data, ${p}_{\mathrm{true}}(\alpha ,\delta | {\rho }^{i}(t),{\hat{\theta }}_{\mathrm{in}}^{i})$ , and (iii) a multilayered fully connected neural network that extracts features from the intrinsic parameters, ${\hat{\theta }}_{\mathrm{in}}^{i}$ . The features extracted from the ResNet34 and fully connected networks are combined and passed as a conditional input to the MAF. We condition the Normalizing Flow model on the intrinsic parameters of the templates, in addition to the S/N time series, in order to provide additional context to our network during training and inference. This is particularly motivated by BAYESTAR's algorithm, which achieves a much higher inference speed-up over parameter estimation codes like LALInference, by fixing the intrinsic parameters to values near the peak of the likelihood, rather than marginalizing over them, as is done in offline parameter estimation. We found that including the intrinsic parameter information helped the model constrain the sky position posterior better as they encode information about the shape of the S/N time series peak around the merger. While testing our model on real GW events, as discussed in later sections, we use the intrinsic parameters of the best-matched templates that produce the highest S/N output, as for such cases the exact intrinsic parameters are unknown.

A Normalizing Flow is a chain of invertible transformations that maps a simple base distribution, like a multivariate Gaussian, to a more complex target distribution. During training, the model learns to map samples from the base distribution into the target posterior samples. If z is a random sample drawn from the base distribution π(z) and f is the transformation induced by the Normalizing Flow, then the new random variable is x = f(z). Since f is invertible, we can write z = f⁻¹(x). The probability distribution of the new random variable p(x) is therefore given by the change of variables formula for probability distributions

$\begin{eqnarray}&&p(x)=\pi ({f}^{-1}(x))\left|\det \frac{{{df}}^{-1}}{{dx}}\right|.\end{eqnarray} \tag{ 4 }$

During testing, the trained network can not only generate samples from the target distribution in milliseconds, but also directly evaluate the probability density of the data.

We use a multivariate Gaussian as the base distribution, and the MAF as the transformation f, to obtain our target posterior ${p}_{\phi }(\alpha ,\delta | {\rho }^{i}(t),{\hat{\theta }}_{\mathrm{in}}^{i})$ . Since our target distribution is a conditional probability distribution, the transformation f is conditioned on features extracted from ρⁱ(t) and ${\hat{\theta }}_{\mathrm{in}}^{i}$ using the ResNet34 and fully connected networks. The architectures of these networks are discussed in the Methods section.

The MAF models a joint probability density by decomposing it into a product of one-dimensional conditional densities. This is implemented efficiently using a fully connected neural network with a special architecture called Masked Autoencoder for Distribution Estimation (MADE; Germain et al. 2015). The characteristic of the MADE network is that the connections between several neurons in the network are masked or set to zero. This preserves the autoregressive property of the network, which is necessary for correctly modeling the one-dimensional conditional densities at the output layer.

Figure 2 shows the workflow of the GW-SkyLocator. The MAF takes as input random samples (z) drawn from our base distribution, which we choose to be a multivariate Gaussian, and transforms them into samples of our target distribution. The α parameter is cyclic, i.e., for a fixed δ, α = 0, and α = 2π points to the same location in the sky. In order to make our model learn this cyclical nature of α, and prevent it from mistaking α = 0 with α = 2π and vice versa, we decompose α into α_x and α_y, equal to the cosine and sine components of α, respectively. Therefore, instead of predicting α and δ directly, we predict α_x, α_y, and δ. We found that doing this trick improved the performance of the model on injection samples located at α = 0 and 2π. The final α posterior is obtained by taking the arc tangent of $\cos \,\alpha$ and $\sin \,\alpha$ . The normalizing flow model then scales and shifts the z samples to obtain p_ϕ(α, δ∣ρ_opt), the model's approximation of p_true(α, δ∣ρ_opt).

Figure 2. Refer to the following caption and surrounding text. — **Figure 2.** Architecture of our model, `GW-SkyLocator`. The input data, consisting of the S/N time series, $\rho$ (t), and intrinsic parameters, ${\hat{\theta }}_{\mathrm{in}}$ are provided to the network through the ResNet34 channel (only one ResNet block is shown here) and the multilayered fully connected (dense) network channel respectively. The features extracted by $\rho$ (t) and ${\hat{\theta }}_{\mathrm{in}}$ are then combined and provided as conditional input to the MAF, denoted by f(z).
Download figure:
Standard image High-resolution image

$\rho $ — **Figure 2.** Architecture of our model, `GW-SkyLocator`. The input data, consisting of the S/N time series, $\rho$ (t), and intrinsic parameters, ${\hat{\theta }}_{\mathrm{in}}$ are provided to the network through the ResNet34 channel (only one ResNet block is shown here) and the multilayered fully connected (dense) network channel respectively. The features extracted by $\rho$ (t) and ${\hat{\theta }}_{\mathrm{in}}$ are then combined and provided as conditional input to the MAF, denoted by f(z).
Download figure:
Standard image High-resolution image

Training a neural network involves minimizing an objective or loss function, which for our case is the expectation over ρ(t) of the cross entropy between the true posterior distribution ${p}_{\mathrm{true}}(\alpha ,\delta | \rho (t),{\hat{\theta }}_{\mathrm{in}})$ and the prediction of our model, ${p}_{\phi }(\alpha ,\delta | \rho (t),{\hat{\theta }}_{\mathrm{in}})$ . This is given by

$\begin{eqnarray}&&\begin{array}{l}L=\int d\rho \,d{\hat{\theta }}_{\mathrm{in}}\,{p}_{\mathrm{true}}(\rho (t),{\hat{\theta }}_{\mathrm{in}})\\ \quad \times \int d\theta {p}_{\mathrm{true}}(\theta | \rho (t),{\hat{\theta }}_{\mathrm{in}})\mathrm{log}\,{p}_{\phi }(\theta | \rho (t),{\hat{\theta }}_{\mathrm{in}}),\end{array}\end{eqnarray} \tag{ 5 }$

where $\theta =(\cos (\alpha ),\sin (\alpha ),\delta )$ .

Using Bayes's theorem and Monte Carlo approximation, we obtain the following approximate expression for the loss function:

$\begin{eqnarray}&&L\approx -\displaystyle \frac{1}{N}\displaystyle \sum _{i=1}^{N}\mathrm{log}\,{p}_{\phi }({\theta }^{(i)}| {\rho }^{(i)}(t),{\hat{\theta }}_{\mathrm{in}}),\end{eqnarray} \tag{ 6 }$

where N is the size of a mini-batch or subset of the full training set. Equation (6) shows that the optimization step is free from costly likelihood and posterior evaluations. Instead, we only need to draw samples from the likelihood, which is readily available as S/N time series outputs from online GW search pipelines (Klimenko et al. 2016; Usman et al. 2016; Sachdev et al. 2019; Aubin et al. 2021; Chu et al. 2022). The optimization of L is done using stochastic gradient descent, specifically using the Adam optimizer (Kingma & Ba 2017).

Under the transformation given in Equation (4), the loss function, Equation (6) becomes

$\begin{eqnarray}&&\begin{array}{l}L={{\mathbb{E}}}_{{p}_{\mathrm{true}}(\theta )}{{\mathbb{E}}}_{{p}_{\mathrm{true}}(\rho (t),{\hat{\theta }}_{\mathrm{in}}| \theta )}\\ \times \left[-\mathrm{log}\,{ \mathcal N }{\left(\mathrm{0,1}\right)}^{n}({f}^{-1}(\theta ))-\mathrm{log}\left|\det \displaystyle \frac{\partial ({f}_{1}^{-1},...,{f}_{n}^{-1})}{\partial ({\theta }_{1},...,{\theta }_{n})}\right|\right],\end{array}\end{eqnarray} \tag{ 7 }$

where n refers to the number of parameters/labels being predicted.

3. Data Generation

We trained and tested GW-SkyLocator on 5 ×10⁵ and 2000 injections, respectively, for each of the three CBC sources—BBH, BNS, and NSBH. For each injection, the GW strain was generated by injecting a simulated waveform in stationary Gaussian noise with advanced LIGO Zero Detuned High Power PSD. Matched filtering was then performed on the strains to obtain the S/N time series. For the training set (ρ_opt(t), ${\hat{\theta }}_{\mathrm{in}}^{\mathrm{opt}}$ ), matched filtering was done using the optimal filters, as explained earlier. The details of the prior distributions of the injection parameters of our training and test sets are discussed in the Methods section (Section 2). For each CBC class, the same network architecture was used, which shows the robustness of our approach. During inference, we generate 5000 α and δ samples from the trained model and evaluate a Gaussian kernel density estimator (KDE) over the samples with a fixed bandwidth of 0.02. The trained KDE is then used to evaluate the probability density over sky pixels using the adaptive sampling scheme employed by BAYESTAR, in which the posterior is first evaluated over 16 HEALPix grids (Górski et al. 2005), which corresponds to a single sky grid area of 13.4 deg². The highest probability grids are then adaptively subdivided into smaller grids over which the posterior evaluation is repeated. This process is repeated seven times to obtain the final multi-order sky map for each test sample. We then run BAYESTAR on these injections and compute the areas of different credible intervals for comparison.

Training GW-SkyLocator takes ∼1.5 hr on a P100 GPU. Generating 5000 samples from the trained model during inference takes a few milliseconds on the same computational resource. The total inference time, including the generation of samples, fitting the KDE, and evaluating the probability density over the adaptive mesh takes around ∼1.2 s on average.

The injected signals used in this analysis are generated using the LALSuite package (LIGO Scientific Collaboration 2018). For BBHs, the source-frame primary mass distribution is drawn from a Salpeter initial mass function (IMF) distribution (Salpeter 1955), with values lying between 10 and 80 M_⊙, and the secondary is drawn from a uniform distribution between 10 M_⊙ and the value of the primary mass. For BNS, both components are drawn from a uniform distribution between 1.0 and 2.5 M_⊙. For NSBH, the neutron star mass is drawn from a uniform distribution between 1.0 and 2.5 M_⊙, and the black hole is drawn from the Salpeter IMF between 10 and 80 M_⊙. For the test sets, the BBH injections are simulated with uniform distribution in comoving volume up to a redshift of 2.5, while for BNS and NSBH, the injections are uniform in redshifts up to 0.35 and 0.25, respectively. These distributions are similar to what has been used by the LVK Collaboration for analysis of both simulated and O3 data, but with a narrower black hole mass range between 10 and 80 M_⊙ (Messick et al. 2017; Sachdev et al. 2020; The LIGO Scientific Collaboration et al. 2021; Kovalam et al. 2022).

We performed matched filtering on the simulated strains to obtain the S/N time series, and discarded samples with network S/N < 8 and >40. While signals with a network S/N < 8 are not considered significant detections, ones with S/N > 40 were discarded because such loud events are expected to be extremely rare, even at the design sensitivities of the ground-based GW interferometers. In a future work, we plan to train different versions of GW-SkyLocator with signals having different S/N ranges such as 8–20, 20–40, 40–60, and so on. This strategy is expected to yield better results than what we may obtain from training a single model since the variance of the S/N distribution in each individual training set distribution is smaller than in a single training set covering the entire S/N range.

For generating the training set, instead of simulating signals with similar redshift distribution as the test sets, we adopt a uniform network S/N distribution between 8 and 40 for all three cases. This is done to ensure there are a sufficient number of training samples at each S/N bin and therefore avoid potential biases in the network's performance arising from training our model with a nonuniform S/N distribution. For BBH samples, we use the waveform approximant SEOBNRv4 (Bohé et al. 2017) and SpinTaylorT4 (Sturani et al. 2010) for BNS and NSBH. All the waveforms were simulated with a fixed coalescence GPS time of 1187008882.4. The generated waveforms are then injected into Gaussian noise and the resultant strains, sampled at 2048 Hz, are whitened with an estimate of the PSD made using the Welch median average method (Welch 1967) on simulated data. For generating the training samples, after whitening, we rescale the amplitudes of the strains to match a desired network S/N. In order to do that, we first calculate the network optimal matched filtering S/N (NOMF–S/N) of the whitened waveforms using the formula

$\begin{eqnarray}&&\mathrm{NOMF}\mbox{--}{\rm{S}}/{\rm{N}}=\sqrt{{\rm{S}}/{\rm{N}}{({\rm{H}}1)}^{2}+{\rm{S}}/{\rm{N}}{({\rm{L}}1)}^{2}+{\rm{S}}/{\rm{N}}{({\rm{V}}1)}^{2}}.\end{eqnarray} \tag{ 8 }$

We then use the NOMF–S/N to calculate a scaling factor that we use to rescale the strain amplitudes to match the injection S/N drawn from the prior network S/N distribution. After rescaling the strains, matched filtering is performed using the optimal templates to obtain ρ_opt(t). We chop ρ_opt(t) for all CBC sources to 0.2 s around the merger peak. Sample generation and matched filtering were implemented with a modified version of the code developed by Gebhard et al. (2019) that uses PyCBC software (Nitz et al. 2021). GW-SkyLocator was written in TensorFlow 2.4 (Abadi et al. 2016) using the Python language.

3.1. MAF

In this section, we review MAFs and the autoregressive density estimation technique. The MAF is a type of normalizing flow that maps a simple base distribution to a more complex distribution. In autoregressive density estimation techniques, a model learns a complex joint density by decomposing it into a product of one-dimensional conditional densities as follows (Kingma et al. 2016; Papamakarios 2017):

$\begin{eqnarray}&&p(x)=\displaystyle \prod _{i}^{n}p({x}_{i}| {x}_{1:i-1}).\end{eqnarray} \tag{ 9 }$

In autoregressive models, these conditional densities are usually parameterized as a univariate Gaussian. The means and standard deviations of the Gaussians are predicted by neural networks that depend on the previous x_1:i−1:

$\begin{eqnarray}&&p({x}_{i}| {x}_{1:i-1})={ \mathcal N }({x}_{i}| {\mu }_{i},\exp ({\alpha }_{i})),\end{eqnarray} \tag{ 10 }$

where

$\begin{eqnarray}&&{\mu }_{i}={f}_{{\mu }_{i}}({x}_{1:i-1}),\end{eqnarray} \tag{ 11 }$

and

$\begin{eqnarray}&&{\alpha }_{i}={f}_{{\alpha }_{i}}({x}_{1:i-1})\end{eqnarray} \tag{ 12 }$

for i = 1,...,n.

${f}_{{\mu }_{i}}$ and ${f}_{{\alpha }_{i}}$ are the transformations induced by the MAF. If we sample $u\sim { \mathcal N }{\left(\mathrm{0,1}\right)}^{n}$ , then,

$\begin{eqnarray}&&{x}_{i}={\mu }_{i}({x}_{1:i-1})+{\mu }_{i}\exp \,{\alpha }_{i}({x}_{1:i-1})\end{eqnarray} \tag{ 13 }$

defines a sample from p(x).

This defines a transformation of the base distribution, which is a univariate normal, to the target distribution p(x). We can stack multiple autoregressive transformations to create a sufficiently expressive flow. It is easy to compute the inverse transformation u = f⁻¹(x) since it only requires us to reverse the scale and shift as follows:

$\begin{eqnarray}&&u=(x-{f}_{\mu }(x))/\exp ({f}_{\alpha }(x)).\end{eqnarray} \tag{ 14 }$

Therefore, the forward and inverse pass of the flow both only require the forward evaluation of f_μ and f_α. The Jacobian in Equation (7) can be calculated as follows:

$\begin{eqnarray}&&\left|\det \,\displaystyle \frac{\partial ({f}_{1}^{-1},...,{f}_{n}^{-1})}{\partial ({x}_{1},...,{x}_{n})}\right|=\exp (-\displaystyle \sum _{i=1}^{n}{\alpha }_{i}({x}_{1:i-1})).\end{eqnarray} \tag{ 15 }$

For parameterizing a Normalizing Flow, two conditions need to be necessarily fulfilled: the inverse of the transformation must exist and the Jacobian must be easy to compute. The above arguments suggest that the MAF satisfied both the requirements necessary for a normalizing flow. The autoregressive property is implemented in neural networks by masking the weights of a fully connected neural network, or by leaving out only those connections which are allowed, with the rest being set to zero. This architecture is called the MADE or Masked Autoencoder for Density Estimation. Usually to obtain a sufficiently flexible model, a stack of several MADE networks is used. Between each pair of MADE networks, the order of the components is permuted so that if one network is unable to model a distribution well due to poor choice of variable ordering, a subsequent network with a different combination of weights and masking is able to do it. In our MAF implementation, we also use a batch normalization layer between each MADE unit to stabilize training and prevent the performance of the model from suffering due to spurious random weight initializations.

3.2. Network Architecture

In this section, we describe the architectures of the different components of GW-SkyLocator. We use a ResNet34 network (He et al. 2015) and a multilayered fully connected network to extract features from the complex S/N time series data and the intrinsic parameters of the templates used for generating them using matched filtering. The ResNet34 is a deep network with 34 layers, which has a special feature called the skip connections. The skip connection connects the output of one layer to another by skipping some layers in between. These layers, together with the skip connection form a residual block. ResNets are made by stacking these residual blocks together. The advantage of using several skip connections is that the network starts to learn the features of the input data right from the start of training, even if some layers have not started learning, i.e., their weights are close to the values they were initialized with (usually close to zero). The skip connections make it possible for the signal to flow through the whole network and prevent training degradation and the vanishing gradients problem usually observed in very deep neural networks.

The first two layers of the ResNet34 network are 2D convolutional layers with ReLU activation function and a batch normalization layer. We use 32 filters for the first convolutional layer with a kernel size of 7 and a stride of 2. The stride determines the length by which the receptive field of a convolutional kernel shifts over the activations of the previous layer. We then use a max-pooling layer with a pool size of 3 to reduce the input dimensions by a factor of 3. This is followed by a stack of residual units. After every few residual units, the number of filters is doubled, starting from 32. Each residual unit is comprised of two convolutional layers with batch normalization, ReLU activation, and a kernel size of 3. Overall, our ResNet34 implementation consists of 34 layers (only counting the convolutional and fully connected layers) containing three residual units with 32 filters, followed by four residual units with 64 filters, followed by six residual units with 128 filters, and then three residual units with 256 filters. The intrinsic parameters are passed through a stack of five fully connected layers with the ReLU activation function. The features from the final layers of the ResNet and fully connected networks are then combined into a 1D feature vector and passed as conditional input to the MAF. We use a stack of 10 MADE units, each consisting of five hidden layers with 256 neurons. In between each pair of MADE networks, we use batch normalization to stabilize training and produce a more flexible model. This architecture is consistent across all the three CBC sources we have considered in this study. We arrived at this architecture from extensive tests and experimentation over different network hyperparameters such as the number of residual and MADE blocks, the number of hidden layers in the MADE blocks, and the number and size of kernels in the convolutional layers. These hyperparameter choices gave us the best results for injection runs conducted in Gaussian noise with minimal overfitting during the training, validation, and inference steps.

4. Results

4.1. Injection Run on Simulated Gaussian Noise

We discuss our model's performance on simulated GW events injected in stationary Gaussian noise with Advanced LIGO PSD. Figures 3(a)–(f) show GW-SkyLocator's performance on BBH, BNS, and NSBH injections, respectively. The heat maps in the figures show the probability densities of the source's sky location obtained by our model. The outer (blue) and inner (green) contours corresponding to the heat maps in the figures represent the 90% and 50% credible interval regions, respectively, and the blue marker shows the true sky coordinates of the injections. We observe that the true sky locations are within our model's 90% credible interval regions. In addition, we have also plotted the 90% and 50% credible intervals (white) obtained by BAYESTAR for these injections. We notice while our 90% credible intervals overlap with BAYESTAR, the areas of GW-SkyLocator's 90% and 50% intervals are usually larger than BAYESTAR, especially for high S/N injections. This is also reflected in the histograms in Figures 4 (a)–(i), which show the densities of the areas of the 90% credible intervals, 50% credible intervals, and searched areas from GW-SkyLocator and BAYESTAR for 2000 BBH, BNS, and NSBH injection samples detected in co-incidence by two or more detectors using matched filtering from a catalog of simulated injections.

Figure 3. Refer to the following caption and surrounding text. — **Figure 3.** Sky maps of three-detector (upper panel) and two-detector (bottom panel) BBH, BNS, and NSBH (bottom panel) injections generated by `GW-SkyLocator`. The heat maps show the probability density obtained from `GW-SkyLocator` with the 90% and 50% credible intervals shown in blue and green, respectively. The white contours show the 90% and 50% credible regions obtained from `BAYESTAR`. The blue marker shows the true sky coordinates of the injections. The areas of the 90% and 50% credible regions from `GW-SkyLocator` are shown in the inset.
Download figure:
Standard image High-resolution image

Figure 4. Refer to the following caption and surrounding text. — **Figure 4.** Upper panels (a)–(c) show densities of the areas of the 90% credible intervals, 50% credible intervals, and searched areas, respectively, for our BBH injection set, obtained using `GW-SkyLocator` (blue) and BAYESTAR (orange). Middle panels (d)–(f) show the densities of the 90% credible interval areas, 50% credible interval areas, and searched areas for BNS samples. Bottom panels (g)–(i) show the same for NSBH injections.
Download figure:
Standard image High-resolution image

Figures 4(a), (d), and (g) show the distribution of the areas of the 90% credible intervals of the BBH, BNS, and NSBH samples, respectively, as obtained by GW-SkyLocator (blue) and BAYESTAR (orange), respectively. We observe that for all three test cases, the median of GW-SkyLocator's histograms lies close to the medians obtained by BAYESTAR. The 90% area histograms of BAYESTAR for all three test cases show two distinct peaks, while the distribution obtained by our model shows a single peak centered in between the two peaks of BAYESTAR. Therefore, on average, the areas of the 90% credible regions obtained by the two methods are consistent. We however observed that for high S/N cases (>25), our model does not reproduce the same localization accuracy as BAYESTAR, which is evident from the higher fraction of test samples with 90% area <10 deg² in BAYESTAR compared to GW-SkyLocator for all three cases. A possible reason for this is our choice of a fixed KDE bandwidth during inference to obtain the probability densities. While the LVK Collaboration uses the ligo.skymap, Singer & Price (2016)'s software's implementation of a k-means clustering algorithm to find appropriate KDE bandwidths for each event separately, this method takes a few minutes to evaluate and therefore defeats our objective of rapid sky localization. We notice a similar trend in the distributions of the 50% credible interval areas (Figures 4(b), (e), and (h) for BBH, BNS, and NSBH, respectively), in which the histograms from GW-SkyLocator and BAYESTAR overlap for 50% of areas greater than ∼10 deg², but show a significant difference in the fraction of samples with areas less than 10 deg². The distributions of the searched areas (Figures 4(c), (f), and (i) for BBH, BNS, and NSBH, respectively) for both GW-SkyLocator and BAYESTAR, however, are more Gaussian, with BAYESTAR's median searched area for all three cases being slightly smaller than that of GW-SkyLocator.

We have investigated the effect of employing a fixed KDE bandwidth instead of ligo.skymap's k-means clustering algorithm on the overall uncertainties of the sky location estimates, as shown in Figures 5 and 6. Figure 5(a) shows a sky map for a simulated BBH event generated using BAYESTAR. Figure 5(b) shows the same event's sky map generated using GW-SkyLocator with a fixed bandwidth Gaussian KDE and Figure 5(c) shows the result from GW-SkyLocator with ligo.skymap's KDE implementation. Figures 5(d)–(f) and (g)–(i) show similar results for a simulated BNS and NSBH event, respectively. We observe that implementing KDE with k-means clustering results in significant shrinkage of the sky location uncertainty, compared to what we obtain using a KDE with fixed bandwidth. We further run an injection test using 200 BBH, BNS, and NSBH samples each and generate box and whisker plots for the 90% credible interval areas obtained using BAYESTAR, GW-SkyLocator with fixed bandwidth KDE, and GW-SkyLocator with ligo.skymap's KDE, as shown in Figure 6. The improvement in the sky direction estimates from employing k-means clustering for bandwidth selection is apparent from the locations of the lower quartiles and the lower extreme tails of the box and whisker plots. While the smallest 90% credible interval area obtained using a fixed KDE is around 40 deg², the minimum area with ligo.skymap KDE turns out to be less than 15 deg² for events in this injection set. Apart from the differences arising from our choice of KDE, Figures 4–6 indicate that further optimization of our networks is needed to match the overall accuracy of BAYESTAR.

Figure 5. Refer to the following caption and surrounding text. — **Figure 5.** Upper panels (a)–(c) show sky maps of a simulated BBH event obtained using BAYESTAR, `GW-SkyLocator` with fixed bandwidth Gaussian KDE and `GW-SkyLocator` with `ligo.skymap`'s KDE, respectively. Middle panels (d)–(f) and bottom panels (g)–(i) show similar sky maps for simulated BNS and NSBH events obtained using the same methods as (a)–(c).
Download figure:
Standard image High-resolution image

Figure 6. Refer to the following caption and surrounding text. — **Figure 6.** Box and whisker plots showing areas of the 90% credible intervals for a set of 200 BBH, BNS, and NSBH injections each, obtained using BAYESTAR, `GW-SkyLocator` with fixed bandwidth Gaussian KDE, and `GW-SkyLocator` with `ligo.skymap`'s KDE.
Download figure:
Standard image High-resolution image

Figures 7(a)–(c) show probability–probability (P–P) plots for GW-SkyLocator's predictions on the BBH, BNS, and NSBH injections, respectively. The P–P plots show the cumulative histograms of the true sky directions of events lying in different credible intervals. It serves as a test of the consistency of any Bayesian parameter estimation method with the frequentist interpretation (i.e., on average the true sky directions should lie within the 90% interval for 90% of the test cases). To generate the P–P plot, for each sky map, we rank the sky pixels by descending probability density and sum up their probabilities until we reach the pixel that contains the true sky location for the injection. We then plot the empirical cumulative distribution of these injections. For accurate posteriors, the cumulative distribution should be a diagonal line, i.e., it should follow the dotted lines in Figures 7(a)–(c). From the figures, we observe that our empirical distribution lies within the 95% confidence interval of the dotted lines. This is also evident from the p-values of the Kolmogorov–Smirnov (K-S) tests we performed for α and δ (see the insets in Figures 7(a)–(c)). The results of the K-S tests (p-values > 0.05) suggest that at a 5% level of significance, we cannot reject the null hypothesis that the cumulative distributions of the percentile scores of the true sky directions within their marginalized credible intervals are uniform, i.e., they follow the diagonal line. These tests suggest that our posteriors are indeed accurate.

Figure 7. Refer to the following caption and surrounding text. — **Figure 7.** Left to right: P–P plots for the BBH, BNS, and NSBH test samples, respectively. The dotted diagonal line shows the expected cumulative distribution of events within different credible intervals. The gray bands in the P–P plots show the 95% error margins arising due to statistical fluctuations. The p-values for the K-S tests of α and δ for each injection are shown in the legend.
Download figure:
Standard image High-resolution image

4.2. Localization of Real GW Events

We tested our model's performance on real GW events detected during the second and third observation runs of LIGO and Virgo and compared our model's performance with published results from BAYESTAR and LALInference. In order to test on real events, we retrained our model on injections in real LIGO-Virgo data from the second and third observation runs. This was done because the PSD of Advanced LIGO, which was used for the previous analyses, is significantly different from that of the real events. Therefore, the model had to be retrained with S/N time series data obtained from similar PSDs in order to obtain accurate sky location distributions. For O3 data, we used real noise from GPS times between 1242181632 and 1242705920, and for O2 data we used noise between GPS times 1187008512 and 1187733618. The noise and the GW strain data for the real events were downloaded from the public data release available at the Gravitational Wave Open Science Center (GWOSC). We analyzed the strain data of the real events and obtained their respective S/N time series by performing matched filtering using templates with source parameters whose values were drawn from the medians of the published O2 and O3 source parameter estimation results (The LIGO Scientific Collaboration et al. 2021). It is to be noted that while our model is trained on the optimal templates S/N time series, which is obtained by matched filtering strains with the exact templates, for real events the S/N time series are not optimal since the true values of the source parameters are unknown.

Figures 8(a)–(c) show sky maps for the BBH event GW200224_2223 detected during the second half of the third observation run (O3b), the first BNS event detected by LIGO and Virgo, GW170817, and the NSBH event GW200115_042309 detected during the third observation run. These events were selected in particular because their masses and distances fall within our chosen prior ranges for generating the training sets. We have also plotted the 90% and 50% credible intervals from BAYESTAR (white) and LALInference (yellow) for comparison. We notice that our predicted contours overlap well with these methods but are larger in area. A possible explanation for this, which is related to our choice of fixed KDE bandwidth has already been provided. Another explanation for this mismatch could be the difference in the PSDs between our training and test sets. While, we generated training data from PSDs drawn from O2 and O3, real LIGO-Virgo data is nonstationary, and therefore, the PSDs corresponding to each individual event are different. In order to make our model robust against different PSDs, it is essential to condition the Normalizing Flow on the PSD computed from strain data close to the events being tested on. This method was applied in a recent study by Dax et al. (2021) to obtain unprecedented parameter estimation accuracy of real BBH events from O1 and O2. We plan to incorporate this feature in our network in the future.

Figure 8. Refer to the following caption and surrounding text. — **Figure 8.** Sky maps of real GW events detected during O2 and O3 generated by `GW-SkyLocator`. The upper panel shows sky maps for the BBH and BNS events GW200224_22234 and GW170817, respectively. The bottom panel shows the sky map for the NSBH event GW200115_042309. The heat maps show the probability density obtained from `GW-SkyLocator` with the 90% and 50% credible intervals shown in blue and green, respectively. The yellow contour shows published sky localization results obtained from full parameter estimation software and the white contour shows the sky location estimates obtained from BAYESTAR. The areas of the 90% and 50% credible regions from `GW-SkyLocator` are shown in the inset.
Download figure:
Standard image High-resolution image

We also notice that the shape of the localization contours for GW200224_22234 and GW170817 are different from GW200115_042309. This is because the shape of the sky localization contours is primarily determined by the signal arrival time delays, and is partially modulated by the phase lag and amplitude difference between signals at individual detectors. For each pair of detectors, one can plot a ring of possible sky directions, for a single time of arrival difference. For a three-detector network, the rings intersect at two patches in the sky that are mirror images with respect to the plane defined by the detector network. This degeneracy can be broken by considering the amplitude and phase consistency of the signals between detectors, which allows us to neglect one of the patches. For events like GW200115_042309 in which the Virgo S/N is very weak, we obtain elongated ellipses and rings in the sky, which are only partially broken by including the amplitude and phase information, as we observe in Figure 8(c). For GW170817, the observation of weak Virgo S/N, however, helped constrain the sky localization contour better by eliminating the region of the sky that Virgo has high sensitivity in, producing a highly constrained sky location posterior as a result.

5. Discussion

In summary, we have reported the first deep learning–based approach for sky localization of all CBC sources that enables orders of magnitude faster inference than other methods. Using S/N time series data instead of the actual GW strain data for training and testing, we were able to achieve localization of BNS and NSBH sources for the first time using deep learning. The use of S/N time series data also significantly reduces computational overheads and memory, which acts as a bottleneck in many deep learning–based approaches. We plan to adapt this method for extensive training and testing on real noise injections in the future. Since deep learning methods do not require a likelihood to be explicitly defined, and can learn the distribution of data directly from training examples, they can possibly be more robust against glitches, nonstationary, and non-Gaussian noise than other methods like BAYESTAR, BILBY, and LALinference, which assume a Gaussian likelihood. The present work is a feasibility study for deep learning as an alternative tool for rapid GW sky localization and it is hoped that certain modifications to the current model architecture will improve the consistency of our model's results to standard approaches, while having the same inference speed as BAYESTAR. One of these modifications is to condition the network not only on the S/N time series and intrinsic parameters, but also on the individual detector PSDs of the noise around the event in consideration. This modification will make the inputs to our model exactly similar to BAYESTAR's (i.e., S/N time series, intrinsic parameters of best-matched templates, and detector PSDs), and it has also proven to significantly improve the performance of deep learning–based parameter estimation approaches in previous studies (Dax et al. 2021). We will investigate this feature in our future work. In the future, we also plan to extend this method for rapid inference of the luminosity distance of GW sources, which is essential for prompt identification of the host galaxies for EM follow-up observations of compact object mergers.

Acknowledgments

This research was supported in part by the Australian Research Council Centre of Excellence for Gravitational Wave Discovery (OzGrav, through Project No. CE170100004). This research was undertaken with the support of computational resources from the Pople high-performance computing cluster of the Faculty of Science at the University of Western Australia. This work used the computer resources of the OzStar computer cluster at Swinburne University of Technology. The OzSTAR program receives funding in part from the Astronomy National Collaborative Research Infrastructure Strategy (NCRIS) allocation provided by the Australian Government. This research used data obtained from the Gravitational Wave Open Science Center (https://www.gw-openscience.org), a service of LIGO Laboratory, the LIGO Scientific Collaboration, and the Virgo Collaboration. LIGO is funded by the U.S. National Science Foundation. Virgo is funded by the French Centre National de Recherche Scientifique (CNRS), the Italian Istituto Nazionale della Fisica Nucleare (INFN), and the Dutch Nikhef, with contributions by Polish and Hungarian institutes. This material is based upon work supported by NSF's LIGO Laboratory which is a major facility fully funded by the National Science Foundation. The authors would like to thank Prof. Amitava Datta from The University of Western Australia for help in this work.