Denoising gravitational-wave signals from binary black holes with a dilated convolutional autoencoder

The broadband frequency output of gravitational-wave (GW) detectors is a non-stationary and non-Gaussian time series data stream dominated by noise populated by local disturbances and transient artifacts, which evolve on the same timescale as the GW signals and may corrupt the astrophysical information. We study a denoising algorithm dedicated to expose the astrophysical signals by employing a convolutional neural network in the encoder-decoder configuration, i.e. apply the denoising procedure of coalescing binary black hole signals to the publicly available LIGO O1 time series strain data. The denoising convolutional autoencoder neural network is trained on a dataset of simulated astrophysical signals injected into the real detector’s noise and a dataset of detector noise artifacts (‘glitches’), and its fidelity is tested on real GW events from O1 and O2 LIGO-Virgo observing runs.


Introduction
The onset of gravitational wave (GW) astronomy began in 2015 with first direct detection of GWs from a binary inspiral and merger of two stellar-mass black holes (BHs), the event later denoted as GW150914 [1].Since then, the Advanced LIGO [2] and the Advanced Virgo [3] detector network has detected many GW signals, mainly binary black holes but also binary neutron star (NS) inspiral events [4,5], and BH-NS systems [6,7]; for the list of published GW transient signal detections, see the GWTC-1 [8], GWTC-2 [9], GWTC-2.1 [10] and GWTC-3 [11] catalogs.During the a few dilated decoder layers [40] is able to train on the GW signal waveforms injected to realistic LIGO data time series, and recover real GW events.We consider this type of a lightweight algorithm a potentially useful trigger generator, performing a role of rapid initial classification of GW signals, and/or data characterisation tasks.
This article is composed as follows.Section 2 contains a brief description of the CNN and AE methodology, description of the DAE network, as well as the training and testing data.Section 3 describes the results, obtained using both the simulated signals and real O1 and O2 events, whereas Sect. 4 contains the conclusions.

Convolutional autoencoders
CNN is a type of NNs that applies a set of convolutions (by means of the kernel filters) to the network's input [37,38].In the context of denoising, training a CNN consists in finding the filters weights which optimally extract the preferable features in input data and are present in clean (noise-free) data.The filters are weighted, and their relative importance is also acquired during training.CNNs are widely used in data processing and are able to perform a wide range of tasks such as classification, pattern recognition or feature extraction.Concretely, the convolutional filter is moved across the input image to reveal important features while preserving spatial relationship between samples, see e.g.[41,42].In all cases, the CNN is effectively performing a role of a dimensionality reduction algorithm, the output of the convolutional layer being a 'simplified view' of the input.In our one-dimensional context of time series data we adopt the CNN architecture for a practical advantage of their much faster training times, as compared to more complicated LSTM/RNN architectures [38], while retaining the ability to learn temporal information encoded in the time series [43].Specifically, we introduce dilated convolutional layers [40], with an aim to connect distant samples within the signal, and use multi-scale information contained in the time series.

Denoising AE model
The CNN layers in this work adopt the AE architecture.In general, the AE NN represents an identity function by compressing the representation of input data and then decompressing it.During the training, the overall network will learn its own sparse representation of the input signals.AE are composed of two networks: an encoder g φ , and a decoder f θ .
The parameters φ and θ denote the parameters of the encoder and the decoder respectively.The encoder maps the original high-dimensional input into a latent space z = g φ (x), where x is the training or testing input.Note that we do not make an explicit use of the latent space in the present study, although in principle it may be used for e.g.parameter estimation study, as it is related to the Bayesian formulation of the problem through the variational AE (VAE) approach [44,45], or a version of the VAE, conditioned by actual values of signal parameters in question (Conditional VAE, CVAE [46]).Contrary to the encoder, the decoder recovers the data from z, and successively unfold the compressed data.The output data of the decoder (and hence the AE) is denoted x = f θ (z).AE accomplishes the dimensionality reduction in the same way as the principal component analysis (PCA, [47]) or matrix factorisation algorithms [48], but the underlying process of AEs is highly non-linear and explicitly optimizes the data reconstruction.
The training of an AE consists in finding the set of parameters (θ, φ) which minimize the distance between x and x .This distance is properly defined by the so-called loss function L AE (θ, φ).In the case of regression problems like the AE identity function, a popular choice is the mean square error (MSE): where N is the number of samples of input/output signals.
We adopt a one-dimensional CNN AE to denoise generated astrophysical GW signals added to real GW detector data time series, collected during the LIGO O1 run.A DAE procedure consists in training an AE with a corrupted version of the input signal (i.e.clean signal immersed in the noisy time series), denoted by x, and by demanding that the recovered output x is as close to the original clean input x as possible.A schematic representation of a DAE is presented in Fig. (prevent saturation of the non-linearities), and to speed up the training.The overall effect is also to make the network more robust to the initialization of weights.We use 32 input instances per batch.In the encoder part of the AE, the pooling layers [50] apply a non-linear down-sampling and compactify (reduce) the information of the input.
In the decoder part of the AE, we introduce upsampling layers to ensure that the low dimensional information is successively unfolded.We use a Rectified Linear Unit (ReLU) activation function all over the layers.As there are no strong arguments for the use of asymmetric encoder-decoder structure, we introduce three dilated convolutional layers [40] which correlate non adjacent samples within the signal, and is able to systematically gather multi-scale information contained in the time series without losing resolution.In this respect the final layers (CNN dilated layers) of the decoder bear resemblance to the LSTM architecture, in the sense that they register and keep track of the signal evolution in various scales.More precisely, they include a causal padding that couples signal samples that are originally far from each other.The more important is the dilation rate, the wider is the coupling area.
For the DAE loss function, we chose the following MSE function: where f θ (g φ (x)) = x is the DAE output, and x is the ground-truth (clean) signal waveform input.

Training and testing data
The input "clean" signals used in this project are simulated astrophysical GW signals from binary BHs (BBHs).In general, astrophysical GW signals from close binary compact systems exhibit a characteristic increase of GW amplitude and frequency during the inspiral (the "chirp"), followed by the merger of the binary components, and the ringdown GW emission from the remnant [51].Approximately, the GW inspiral frequency evolves as [52] f where t c is the time of coalescence, and the M c is a function of component masses M 1 and M 2 , called the chirp mass: At a given moment the GW strain amplitude h depends, approximately, on the binary system parameters as follows: For production runs, we assume that signal waveforms (the amplitude-frequency evolution h(f ) or, equivalently, h(t)) are well-modeled using general relativity, and for the sake of this study employ the SEOBNRv4 waveform model [53], assuming non-spinning components with masses randomly chosen from a uniform distribution in the range M 1 , M 2 ∈ (10, 30) M , compatible with the current state of observational knowledge on the binary BH population [8].Sky localizations were chosen to be optimal with respect to the antenna pattern of the detector for a given time, corresponding to the data segment.Other parameters describing the source, e.g., the orientation of the orbit with respect to the line of sight, were selected randomly from their natural ranges.
The "corrupted" input is obtained injecting these astrophysical GW signals in pre-selected time series segments from the science-quality data stream of the LIGO Livingston detector, collected during the O1 data taking campaign and released by the LIGO and Virgo Collaborations via the Gravitational Wave Open Science Center [15].For convenience, these time series segments are packaged with a fixed length of 1 s and down-sampled from the detector's raw sampling rate of 16384 Hz to 2048 Hz (corresponding to the Nyquist frequency of 1024 Hz).They are selected to contain only "quiet" data, i.e. rejecting periods when transient artifacts of instrumental origins (socalled "glitches") [54,55], as well as hardware signal injections ( [56], arranged to test and calibrate the detection system), are present.The GW signals are injected in the time series segments randomly off-center (taking the signal's maximum amplitude peak as a reference), with random time shifts drawn from a uniform distribution of (-0.1, 0.4) s.
GW data analysis is essentially based on matched-filtering techniques, which consist in finding the waveform template that matches the data best.Working under the assumption of the additive property of the noise (i.e., the time series d containing the GW signal h immersed in noise n is d = n + h), then the optimal signal-to-noise ratio (SNR) ρ, corresponding to the best matching filter (template h equals the signal) is , where (d|h with d(f ) a Fourier transform of the time series d(t), and S n (f ) the power spectral density (PSD) of the detector; asterisk denotes complex conjugation [18].Detector's PSD S n (f ) represents the frequency-dependent sensitivity in a broad range of frequencies, and is quantified by its sensitivity curve.From the astrophysical perspective, the SNR is a function of the waveform amplitude and, since the waveform describes the evolution of the GW amplitude, is inversely proportional to the luminosity ("loudness") distance.While preparing the data set, we label the signal waveform with their optimal matched-filter SNR which approximates ρ, assuming that the noise effect is negligible, d ≈ h.The optimal matched-filter SNR ρ opt is a good first order approximation to the actual matched-filter SNR ρ in e.g.stationary Gaussian noise [57].Subsequently, we produce a synthetic distribution of source luminosity distances such that the ρ opt distribution is uniform in the range of (5,20) in order to consider a wide range of signals during the training.Source distances are in the range between 100 to 1000 Mpc.The adopted ρ opt range is comparable to the figure-of-merit optimal SNR of 8, which correspond to a confident single-interferometer detection of an optimally-oriented compact binary inspiral [58,59], and is also consistent with the previous LIGO-Virgo detections [8,9,10].We extend the ρ opt range to the lower values to study the sensitivity and robustness of our DAE implementation in the low ρ regime.Last, but not least, to normalize the dependence of signals' strength at different frequencies we additionally perform the whitening of the time series data with added signals: we divide the Fourier representation of the time domain data by an estimate of the amplitude spectral density of the noise S n (f ) (ASD, square root of the PSD) to ensure that the data has equal significance in each frequency bin [60].A low-frequency cutoff at f low = 30 Hz was chosen for the simulated signals and the data, taking into account the low frequency (seismic) limit of the detectors' sensitivity.This is reflected in the example of GW strain amplitude evolution h(t), immersed in the LIGO Livingston detector noise, shown on Fig. 2.
The network is trained during 50 epochs on 7000 data time series containing injected astrophysical signals.In addition, the training set contains 1000 time series from the O1 LIGO Livingston data when known instrumental artifacts ("glitches") are present in the data.We have used the the Gravity Spy database [55] to obtain various common types of glitches with an estimated SNR larger than 10.All the instances in this dataset are treated with the whitening procedure and a 30 Hz high band pass filter.Mixing signals and glitches prevents from any over-fitting because the network is fed with several distinct datasets which all span the targeted parameter space.Additionally, we introduce randomization of the data at two levels.First, the set of training and testing datasets are split in two after randomly shuffling the initial dataset; the traintest split factor is 0.75.In addition, we randomly replace some of the training and testing astrophysical signal input data by a null signal (array of zeros) with a probability p = 35%, to increase the robustness of the trained CNN to Gaussian noise fluctuations (in case of instrumental glitches, the instances of signal are automatically set to arrays containing zero values).In training we have used the Adam optimizer [61] with a constant learning rate of 0.001 as an optimization algorithm for stochastic gradient descent calculations.

Results
Using the figures of merit commonly used in the GW astronomy, the MSE (Eq.2), and the waveform overlap O, which compares the original "clean" waveform h and the denoised waveform h d obtained at the output of the DAE: where N is the number of points in the time series.In the evaluation, we also show distributions of selected astrophysical parameters of the GW signals.

Population of injected signals
In Fig. 3 we show both the estimated MSE and the recovered waveform overlap O as a function of the injected matched filtering signal-to-noise ratio (SNR) [18,27], with additional information on the distance to the source (top plot) and the chirp mass (bottom panel) encoded in color, for 1000 GW signals added to detector's noise.We don't detect a correlation between the MSE and the SNR, as expected in properly executed training on the input data set with uniform distribution SNR.While we don't detect a clear correlation between the overlap and the chirp mass M c , a preference towards smaller SNR for large distances is seen.Signals with waveform overlap smaller than 0.75 constitute 6.6% of all the signals; 5.6% of all the signals have both O < 0.8 and SNR < 8, so using this threshold overlap criterion we estimate that about 1% of potentially detectable signals (with SNR > 8) are incorrectly recovered by the DAE. Figure 4 shows the comparison between the injected SNR ρ opt and the recovered (denoised) output SNR ρ out,d , both calculated using the optimal matched filter SNR formula (Eq.7).The color code indicates the waveforms overlap.As expected, in cases of high overlap the denoised SNR approximates quite well the injected one.The cases of lower overlap have a denoised SNR close to zero.The distribution follows the ideal ρ out ≡ ρ out,d relation with a root-mean-square of residuals of 1.9 and variance of residuals of 3.8.The denoised SNR may be used as an approximate proxy for detection criterion: 8.2% of the output signals have ρ opt,d < 5, whereas 1.3% of signals have both ρ opt > 8 and ρ opt,d < 5, i.e. are potentially strong enough to detect with the standard methods, but incorrectly recovered by the DAE.Additionally, we perform the denosing procedure on the same detector time series samples but without injected GW signals, to study the output signals.The distribution of output SNR in that case is depicted by the red histogram; only a few noise-only samples have ρ out,d > 5.

Evaluation on known instrumental glitches
As discussed in Sect.2.3, we have also used time series containing known glitches to train the network.The GPS times of these glitches have been extracted from the Gravity Spy  database [55].The corresponding 1 second LIGO data segments centered at these GPS times have been downloaded via the Gravitational Wave Open Science Center [15].For the training phase, 1000 GPS times have been randomly selected among the list of glitch times available in O1 with SNR>10 and duration<1 s.To test the ability of the network to denoise also time series containing glitches, we have tested the network on different glitch families, selecting some examples of recurrent glitch types, still with the conditions SNR>10 and duration<1 s.The nomenclature of these families comes from Gravity Spy.In some cases the origin of the glitch is known, for example for the glitch type called whistle, which seems to be caused by signals at megahertz frequencies that beat with Voltage Controlled Oscillators in the interferometer control system [62].The other glitch families considered for this test, Low-Frequency Burst, Koi Fish and Blip, have an unknown origin.A Low-Frequency Burst appears in a time-frequency spectrogram as an excess noise at low frequency.A Koi Fish is a short-duration broadband noise.A blip is a short duration noise that appears in a spectrogram as a symmetric 'teardrop' typically between 30 and 250 Hz, with the majority of the power appearing at the lowest frequencies [54].
Figure 5 contains an histogram of the recovered SNR for a group of various glicthes randomly selected (blue line) and for specific glitch classes.It is visible that, even if all the selected glitches have SNR>10, the denoised SNR is always quite small, so it can be assumed that a denoised time series with sufficiently high SNR is probably a signal of cosmic orgin and not a glitch.However, this test have been done with few samples so a deeper study on this respect is needed.

Aleatoric uncertainty modelling
The results presented above are obtained with a forward pass of a noisy GW signal in our DAE model.It produces another denoised signal which has the same length as the input one.It is the result of minimizing an unweighted L2 distance between samples (Eq.2) or equivalently of maximising a Gaussian likelihood.Obtaining a maximum likelihood estimator (MLE) provides an unbiased estimator of the mean.However this does not provide an unbiased estimator for the standard deviation.When it comes to estimating uncertainty in a statistical model, two contributions are often distinguished.The epistemic uncertainty is caused by the fact the model is not appropriate to the data.Indeed multiple model parameters could be consistent with the observed training data.In practice it arises in parts of the parameter space where there are fewer samples for training.The aleatoric uncertainty is the uncertainty arising from the stochastic nature of observed data.
A further improvement of our work is notably to enrich the prediction made by the NN model by introducing the estimation of the aleatoric uncertainty.To do so we follow the following procedure.First, the noisy signal x(t) is fed into the DAE and it predicts a point estimate of the denoised signal x (t).Then we add a random noise n(t) from a standard normal distribution to x (t) in order to form a new noisy signal.Repeating the last step several time one gathers { x (t)} M , i.e.M distinct Gaussian noise realisations on top of x (t).Finally, the M noisy timeseries are fed to the DAE and one gets { x } M denoised timeseries from which the standard deviation at each time sample is computed.
This estimation of the aleatoric uncertainty relies on two assumptions.First, the noise part of the input signal follows a Gaussian distribution.Second, the point estimate is an unbiased estimate of the mean true signal.On Fig. 2 and Figs. A1-B4, this mean is effectively close from the clean signal.No quantitative estimation of the closeness of the mean with the true signal has been performed.
Recent works on Bayesian deep learning like normalizing flows [63] and variational autoencoders [45] offer a formalized statistical and probabilistic framework to properly estimate both the epistemic and aleatoric uncertainties.This implies estimating a distribution on every sample of the signal.For instance, if we assume Gaussian distributed prediction of a denoising CVAE model, this implies the inference of {µ[s], σ[s]} s for s ranging from 0 to 2048.We leave this investigation for further work.

Real GW events
The result of the denoising of data segments containing real GW events registered during O1 and O2 are showed in Appendix A and Appendix B, respectively.We study events included in the GWTC-1-confident catalog subset at the GWOSC [64], and in particular evaluate how well the DAE trained with the O1 LIGO Livingston data only performs on O1 and O2 data gathered by both the LIGO Livingston and LIGO Hanford instruments, see the Gravitational Wave Open Science Center for detailed information [15].For brevity, we denote the Livingston detector by L1 and the Hanford detector by H1.The data have been whitened and high-pass filtered with a f low threshold of 30 Hz, as described for the training and testing dataset in Sec.2.3.The events' waveforms are taken from the data behind Fig. 10 of the GWTC-1 catalog paper [8], more precisely from the lalinference folder of [65].The plots in Appendix A and Appendix B display the aleatoric uncertainty estimate discussed in Sec.3.3.

Model size and training time
All key parameters of the network (batch size, learning rate, number of units per layer, activation functions) have been determined following a simple grid-based approach.In total, the final version of the DAE model consists of 1491137 parameters (4096 non trainable parameters).The training time on 8000 data instances (7000 GW signals immersed in noise, 1000 glitch instances) is approximately 480 s on NVIDIA Tesla V100-SXM2-32GB, including reading in the training data.The evaluation of the trained DAE takes approximately 20 ms to denoise one time series instance.

Conclusions
We introduce a deep-learning technique relying on an AE network architecture to reduce Gaussian and non-Gaussian contributions from BBH GW measurements by groundbased detectors.The model is trained on a population of simulated astrophysical sources injected into real interferometric noise from O1 and O2 LIGO-Virgo observing runs (LIGO LIvingston and Hanford detectors).We studied the efficiency in recovering the SNR and overall waveform from a population of injected signals, and assess the model robustness with respect to some classes of instrumental glitches.Finally, we propose and implement an aleatoric uncertainty estimation method before applying the method to real events observed during the O1 and O2 LIGO-Virgo observing runs (GW events robustly detected and confirmed by other methods).The DAE method presented here is potentially a versatile pre-processing tool prior to detection and/or source parameter estimation pipelines used to analyse data collected by ground based instruments.An immediate step is to extend the source parameter ranges such as the individual masses or the sky localization, and apply the conditional parameter training, as in the CVAE.The approach can also be improved by (i) considering other morphologies of instrumental glitches [55] and increase the training/testing dataset size.
(ii) make the loss function more complex, e.g. by adding a regularization term to penalize glitches and unwanted features [36].
(iii) include more ground based detectors in the CNN architecture and see whether it improves noise removal.Note that we have not used the currently-available Virgo data in the analysis because of the low SNR of recovered events.
(iv) consider more sophisticated hyperparameter tuning of the network.
(v) improve the uncertainties estimation by e.g. the analysis of the latent space features.
In summary, we emphasize that we aimed at obtaining a denoising reconstruction method of realistic GW BBH signals with a relatively small training data set (of the order of thousands of samples), and a minimal-size NN.Vast majority of test sample GW signals are recovered sufficiently well, especially the high-frequency part which is also correctly resolved in phase, making the DAE method a potentially computationallyinexpensive trigger generator working in low-latency (a pre-processing step before more expensive parameter estimation methods); the shortcomings presented on e.g.real data examples are generally related to parameters of the GW signals being outside the training dataset parameters, and therefore possible to relieve with training tuned to specific signal parameters (taking into account known techniques to prevent catastrophic forgetting [66]).Improvements in the uncertainties estimation by the DAE model would additionally strengthen its role as a pre-processing step in the general parameter estimation.

Figure 1 .
Figure 1.Data flow diagram of a DAE architecture.The clean input data x is corrupted (denoted by x) and fed to the encoder part of the AE; for the well-trained network, the decoded output (reconstructed clean input) x should be a close analogue of the clean input x, i.e. x ≈ x.The middle part of the DAE -the latent space -is denoted by z.In case of x ≡ x the DAE becomes a classical AE.

Figure 2 .
Figure 2.An example of a training/testing data time series instance, and the DAE output.Top panel: a GW signal (green) immersed in the noisy data (blue).Middle panel: denoised version of the signal (red), with the aleatoric uncertainty estimate (yellow); see Sect.3.3 for more details.Bottom panel: for tests, we perform a denoising procedure on the time series without a GW signal added (grey).The result is the black line, optimally with signal-to-noise equal to 0. Parameters of the GW waveform are as follows: M 1 = 22.74 M , M 2 = 17.34 M , distance r = 479.27Mpc, corresponding to the optimal matched filter signal-to-noise ratio ρ opt = 9.3 (Eq.7).The denoised ρ opt,d is 10.3.Output waveform obtained from pure-noise sample (containing no GW signal) has ρ opt,pn of 1.5.Both the data and GW waveform are whitened by the detector's PSD, calculated using neighbouring data segments; see Sect. 3 for more details.

Figure 3 .
Figure 3. Upper part of the plot: mean square error (MSE, Eq. 2) as a function of the signal-to-noise (SNR) resulting from for evaluating 1000 data instances with added astrophysical GW waveforms.Color of points correspond to the source distance (in Mpc).Side panels show histogram counts (in logarithmic scale) along both axes.Lower part of the plot: the overlap (Eq.8) between the original and the denoised GW waveforms as a function of the SNR.Color of points correspond to the chirp mass M c .The histogram count of the SNR values shows approximately uniform distribution of the values in the dataset.

Figure 4 .
Figure 4. Denoised SNR (calculated from the DAE output, vertical axis) as a function of the injected SNR (horizontal axis) for a testing dataset of 1000 data instances with added astrophysical GW waveforms.Points are colored by their corresponding overlap values.Orange dashed line denotes the denoised SNR equal to injected SNR.Example waveform presented in Fig. 2 is denoted by a red circle.Side histograms (in logarithmic scale) show the distribution of the injected SNR (upper plot), and SNRs denoised from samples containing added GW waveforms (blue histogram), and -for comparison -not containing GW signals (i.e.pure noise, red histogram), respectively.

Figure 5 .
Figure 5. Evaluation of the DAE on instrumental glitches.The logarithmic vertical scale plot shows histograms of denoised output SNR for a selection of 38 Low Frequency Burst glitches, 5 Koi Fish type glitches, 80 Blips and 7 Whistle glitches.Blue line marks the evaluation on 792 assorted various types of glitches.All the glitches data are obtained from the Gravity Spy database [55].The glitches have their estimated intrinsic SNR > 10.

Figure B4 .
Figure B4.Denoising applied to the O2 data GW170608 event for the L1 detector top panels) and the H1 detector (3 bottom panels).While for the other plots the high-pass filter was set to 30 Hz (as in the training set), in this case we apply high pass at 50 Hz to the original data before the denoising.

Table 1 .
1.The training data set is described in detail in the subsequent Sect.2.3.Results of the trained DAE on simulated data in real detector's noise are described in Sect.3, whereas denoising of real GW signals found in the O1 and O2 LIGO run are presented in Sect.3.4.Architecture of the DAE model.Here N = 2048 denotes the number of samples in the time series signal instance provided at the input of the encoder (the number of samples corresponds to 1 second of data with the frequency sampling of 2048 Hz).The AE latent space "bottleneck" is located between layers 8 and 9. DR stands for dilation rate.The dilated part of the decoder starts at layer 14.
[49]layer-by-layer structure of the NN is described in detail in Tab.1.The batch normalization technique[49]normalizes and scales the layer inputs in order to stabilize [55]distribution of the SNR of denoised signals is compared to the SNR of injected signals.As a sanity check we evaluate the DAE on time series not containing injected signals ("only noise" time series) and on several types of the Gravity Spy database[55]glitches.