Towards a robust and reliable deep learning approach for detection of compact binary mergers in gravitational wave data

The ability of deep learning (DL) approaches to learn generalised signal and noise models, coupled with their fast inference on GPUs, holds great promise for enhancing gravitational-wave (GW) searches in terms of speed, parameter space coverage, and search sensitivity. However, the opaque nature of DL models severely harms their reliability. In this work, we meticulously develop a DL model stage-wise and work towards improving its robustness and reliability. First, we address the problems in maintaining the purity of training data by deriving a new metric that better reflects the visual strength of the ‘chirp’ signal features in the data. Using a reduced, smooth representation obtained through a variational auto-encoder (VAE), we build a classifier to search for compact binary coalescence (CBC) signals. Our tests on real LIGO data show an impressive performance of the model. However, upon probing the robustness of the model through adversarial attacks, its simple failure modes were identified, underlining how such models can still be highly fragile. As a first step towards bringing robustness, we retrain the model in a novel framework involving a generative adversarial network (GAN). Over the course of training, the model learns to eliminate the primary modes of failure identified by the adversaries. Although absolute robustness is practically impossible to achieve, we demonstrate some fundamental improvements earned through such training, like sparseness and reduced degeneracy in the extracted features at different layers inside the model. We show that these gains are achieved at practically zero loss in terms of model performance on real LIGO data before and after GAN training. Through a direct search on ∼8.8 days of LIGO data, we recover two significant CBC events from GWTC-2.1 (Abbott et al 2022 2108.01045 [gr-qc]), GW190519_153544 and GW190521_074359. We also report the search sensitivity obtained from an injection study.

The field of GW searches has witnessed promising developments on the machine learning (ML) front in recent years.Some of the early works professing the advantages of ML methods for GW searches focused on sensitivity comparisons against matched-filtering primarily using simulated noise [72,73] with a limited extension to real data [74].These works based their sensitivity estimates on Receiver operating characteristic (ROC) curves obtained from datasets consisting of discrete samples that placed CBC injections in specific regions of the analysis window.Gebhard et al. [75] critically ex-amined the suitability of such approaches for realistic searches by addressing various issues like class imbalance, temporal localisation of candidate events and assigning statistical significance.A more reliable approach of performing trigger-based coincident analysis with the DL outputs on a stream of continuous data was proposed in the literature [34,35], which is followed in this work as well.A systematic approach in constructing a proper ranking statistic is also a crucial step towards implementing a coincident search [34,38,41,76].To enable comparative evaluation of different ML pipelines in a realistic search setting and to provide a common platform, a mock data challenge was organised recently [77].
The DL approaches come with a cost.The most critical hurdle that keeps them from being considered for productionlevel deployment is the unreliability originating from their black-box nature.It has been shown that tiny perturbations to the input images achieved through targeted attacks can lead to surprise failures of these models [78].Also, generative adversarial attacks can reveal many other failure modes.A generator trained to deceive a DL model can yield many unrealistic images which do not belong to any meaningful class, although, the model classifies them into specific classes with high probabilities.These failures highlight the ambiguity over the actual features learnt by a DL model that otherwise shows an excellent performance in metrics like accuracy, F1-score, and area under the ROC curve.It also highlights how the performance of DL models on new data is unpredictable and fragile, making them unreliable for the analysis of real data at the production stage.In the context of GW searches, by obtaining adversarial examples through input perturbations, Gebhard et al. [75] first demonstrated the simple "failure modes" of DL models, which the conventional approaches could handle easily.In this work, we take a generative adversarial network (GAN)-based approach to generate the adversarial samples.GANs have previously been used in GW data analysis primarily for generating various transient signals [79,80].Here, we take a novel approach of utilising the GAN framework for bringing robustness to our DL model.
In this work, we explore several ways of making DL models more reliable.Although the focus is on CBC searches, the methods presented here are extendable to searches for signals from other sources and possibly to tasks other than searches, such as glitch classification and parameter estimation.In Section II, we address the issues related to the purity of the training data, describe the development of a new metric for data generation and the preparation of various datasets for different stages of training.Section III elaborates the training of a variational auto-encoder (VAE) model to achieve a reduced representation of the signal-specific features in the input.Repurposing this information for the task of signal identification is detailed in Section IV.Section V describes the scrutiny of the resulting classifier model for robustness and its retraining in the GAN setup to increase robustness.Lastly, the implementation of a direct search using the classifier is provided in Section VI.We conclude with outlook and future directions in Section VII.

II. WHAT ARE WE SHOWING TO THE MODEL?
In this section, we describe the development of a new metric for data generation and the preparation of datasets.For efficient training of various stages of our model, we construct four different datasets for training: injections in simulated Gaussian noise (INJ), the same set of injections without noise (CLEANINJ), the same Gaussian noise realisations without injections (GN) and a glitch dataset (GL) obtained from GravitySpy [53,55,81,82].To ensure near absolute fidelity of the training data, we stick to simulated Gaussian noise.However, we consider only real LIGO data for testing.We prepare three classes of test datasets: CBC injections (INJ-TEST), PYCBC background triggers (TRIG-TEST) and glitches (GL-TEST).Our experiments on the test datasets show that the features learnt by our model are generalised enough to ensure the model's applicability to real data.
A. New metric for injection data Most of the past works considered a fixed lower cutoff on signal-to-noise ratio (SNR) across the parameter space for generating training data either directly or by fixing the upper cutoff on the chirp-distance 1 .However, the duration of time the signal remains in the sensitive band of a detector, which in turn depends on the component masses and the power-spectral density (PSD), strongly determines the visibility of the signal 2 .Figure 1 illustrates this with the continuous wavelet transforms (CWT) maps of two injections that have nearly the same recovered SNR values but different chirp masses.Clearly, the low chirp-mass binary has a smaller amplitude but spends a 1 Chirp-distance is the luminosity distance adjusted for the leading order chirp mass term in the CBC waveform. 2Here, visibility refers to the relative strength of the pixels corresponding to the signal features in the input image to those corresponding to the noise features.It can also be imagined as a metric that captures humans' ability to distinguish signal features from noise features in the image data.
much longer time in the sensitive band of LIGO/Virgo.Therefore, the total energy contained in its signal is distributed over a much longer duration resulting in reduced visibility of its chirp features compared to the high chirp-mass binary.This difference is expected to play a crucial role for data representations which do not include the phase information, e.g., timefrequency representations such as CWT maps, Q-transforms, and short-time Fourier transforms saved as images with pixel values proportional to the amplitude.
To efficiently train the DL models to detect the weakest possible signals across the parameter space, maintaining uniform visibility of signal features across the training data is important.Therefore, instead of using SNR-based thresholds, an approximate metric that better reflects the signal's visible strength could be the SNR divided by the duration of the signal in the detector's sensitive band.For a detector with flat frequency response, the duration of the CBC signal can be captured in leading order using the chirp time (τ 0 ).It is the Newtonian coalescence time calculated starting from the frequency f 0 and is given by, Considering that the detector noise is coloured, we can choose f 0 in such a way that we operate in a frequency band where the response of the detector is approximately flat and τ 0 remains relevant even after whitening.However, we found that the approximate metric constructed this way lost its validity rapidly upon approaching higher masses where Newtonian approximation fails to describe the waveform.To solve this issue, we numerically modelled the intrinsic mass dependence of the visibility of the chirp features.We obtained maximum amplitude values of the normalised whitened waveforms generated at each point in the component mass space to capture the variations in the maximum pixel values corresponding to the CBC chirp features in the CWT maps.We uniformly divided the mass-ratio (q) -total mass (M ) space 3 , generated waveforms at each point of the grid using IMRPhenomXPHM approximant and normalised them (such that ρ( ĥ, ĥ) = 1, where ĥ is the normalised waveform).The q and M values were sampled in the range [1,8] and [4,800] with step-sizes 0.1 and 10, respectively.We whitened both '+' and '×' polarisations of each waveform using the aLIGOZeroDetHighPower PSD, and recorded the maximum amplitude of their quadrature sum as the intrinsic mass dependence factor in the amplitude, A intr (q i , M i ) for the i th binary.
We used linear interpolation between the grid points to obtain a continuous function A intr (q, M ) that is valid at all points within the range of parameters considered for this numerical estimation.It can be observed from Figure 2 that the maximum relative variation in A intr (q, M ) is as high as ∼ 7. Using this intrinsic mass dependence, we construct a new metric that better reflects the visible strength of the chirp features. 3All the masses mentioned here are in the detector frame.FIG. 2. Value of the maximum amplitude of normalised whitened waveform Aintr(q, M ) as a function of the mass ratio q and total mass M .We use aLIGOZeroDetHighPower PSD for whitening.The relative variation of Aintr(q, M ) can be as large as ∼ 7 in the extreme regions of the mass space.The dashed line marks the region of parameter space used for the training and testing.
We call this metric γ, which is obtained as, where, ρ(s, ĥ(θ)) is the SNR calculated with data s and template ĥ(θ) with θ limited to only the mass parameters.We, therefore, obtain a metric that is a direct function of the visibility of the signal features in the input image and is independent of the component masses of the CBC.As an example, the injections shown in Figure 1 having nearly the same SNR values are fairly separated in their γ values which come out to be 26.5 and 44.9, respectively.The non-inclusion of the spin parameters in the A intr function is expected to retain some impurity in the training data.However, these impurities are expected to be small enough that their effect on training remains minimal.In Figure 3, we show how the visible strength of signals in the image changes as a function of γ alone while the chirp mass only changes the signal morphology, leaving the visible strength of the signal almost unaffected.The recovered SNRs are annotated on the respective images for comparison.The SNR values can be seen to increase with reducing chirp mass for a given γ bin.

B. Preparation of datasets
For generating the INJ dataset, we consider a region of CBC parameter space covering masses that are most likely to be observed with LIGO/Virgo detectors, and precession.A large number of injections are made in the simulated coloured Gaussian noise generated from aLIGOZeroDetHighPower PSD 4 .The hyper-parameter choices for these injections are listed in Table I.We whiten and band-pass the data between 10 − 512 Hz and collect 3 second long slices containing injections.Further, we obtain the time-frequency representations of these data slices using CWT with Morlet wavelet.We then crop these 2-D arrays to 1 second long CWT maps in such a way that the merger times of the binaries are uniformly distributed between 0.7 − 0.85 second of the slice.Such a range is chosen to allow a reasonable coverage of the inspiral and post-merger parts of the low-mass and high-mass binaries, respectively.To discard the filter artefacts, we manually set the values corresponding to frequencies less than 16 Hz in the CWT maps to zero.For better visualisation, the frequency scale is set to logarithmic.The same procedure is repeated for generating the CLEANINJ dataset using the same injection set, except without noise this time.The GN dataset is prepared by collecting the Gaussian noise realisations from the INJ dataset while omitting the CBC injections.Therefore, at the level of time-series data, the INJ dataset is effectively the direct addition of data slices from CLEANINJ and GN datasets.However, since the image pixels of CWT maps in our data correspond to the absolute amplitude at the respective locations, this relation is not followed exactly.Nevertheless, such data preparation greatly reduces the possibility of over-fitting the training data, as the features corresponding to noise in INJ and GN datasets are very close to each other.This forces the model to focus on the uncommon features that correspond to the CBC signal exclusively.To prepare the GL dataset, we obtain the class and GPS time information from Gravity Spy [53,55,81,82], cluster the GPS times within 0.1 second, and choose the data slices such that the glitches are uniformly distributed between 0.25 − 0.75 second of the slice.Since no specific morphology is to be captured in the case of glitches, this choice of range is made while ensuring complete coverage.The same procedure mentioned above is repeated to prepare the CWT maps.For the datasets involving CBC injections (INJ and CLEANINJ), we consider samples with γ > 30 and ρ > 5. Note that we use the recovered SNR instead of injected SNR while calculating the γ values to account for variations due to noise realisation, thereby reducing the error in the estimations of γ.This is especially helpful in maintaining faithfulness of the injection data even at lower values of γ.A total of 50000 samples were collected with these cuts, of which 95% were used for training and the rest for validation.We also prepare three separate test datasets using real LIGO data.For the signal class (INJ-TEST), we make injections with the same parameter choices described in Table I.However, unlike the INJ dataset, these were made in the real data from the entire third observing run of LIGO detectors at Hanford (H1) and Livingston (L1), and with a new seed value to ensure new samples of parameters.For the noise class, we consider two cases of direct relevance for actual searches: the glitches (GL-TEST) and the background triggers collected through matched filtering (TRIG-TEST).The specification of GL-TEST dataset is kept the same as that of GL while ensuring no overlap between the two.For TRIG-TEST dataset, random samples were obtained from ∼ 1.3 billion triggers collected in each detector by the focused PYCBC BBH search of ∼ 7.24 days long chunk of data between May 29 to June 5, 2019 UTC from GWTC-2.1 [1].As the CBC events are sporadic, we do not expect any real events to be present in this dataset.We collect ∼ 5000 samples for each of the datasets.

III. WHERE IS THE MODEL LOOKING?
One of the primary hurdles that limit the reliability of DL methods is their tendency to over-fit the data in undesirable ways.Here, we explore an approach that significantly mitigates this issue and helps our model focus only on the relevant features in the input data.Also, the signal-specific information is minimal compared to the actual amount of data fed to the model.Therefore, a reduced representation of the input data that captures the complete signal information is particularly beneficial for any analysis.Specifically, for the case of modelled signals, this reduced parameter space can, in principle, be written in terms of only the non-degenerate set of intrinsic and extrinsic parameters of the signals.Therefore, a model that reads the complete signal information from the data should ideally be able to represent the signal in a reduced parameter space that can be transformed back to the full signal without any loss 5 .However, the presence of additive noise inevitably introduces statistical errors in estimating source parameters in any space.Though the resulting errors in the inferred parameters can be theoretically characterised using the statistical noise model.
For the signal model, an extensive matched-filter-based search uses maximum likelihood estimation (MLE) analytically over the extrinsic parameters and numerically over the intrinsic parameters.MLE essentially achieves the reduced representation of the input strain in the parameter space of CBC signals.However, such searches are computationally expensive, which limits the coverage of some parameters, such as eccentricity, component spins, etc. Full coverage of such parameters could increase the computation cost by orders of magnitude [83,84], which is presently impossible to accommodate.Also, the noise model considered for constructing the log-likelihood function for the search assumes stationary Gaussian noise for calculating the signal-to-noise ratio [85].In general, search models built with this assumption (matched-filter-based searches or DL approaches trained against Gaussian noise alone) prove inadequate when the noise has non-Gaussianities in the form of glitches.Including glitches in the construction of the search log-likelihood function can play a crucial role in separating signal and noise more effectively.However, this is not achievable in practice due to the lack of an analytical description of the glitches.Rather, various signal consistency tests are often employed as additional measures to mitigate the adverse effect of glitches on search sensitivity [86,87].DL models being excellent approximators of non-linear features can learn the broad set of morphologies followed by these glitches.In the presence of Gaussian noise alone, the performance of a DL model can be compared against that of matched filtering.However, their added ability to learn artefacts originating from glitches allows them to surpass matched-filter-based searches in principle.However, these arguments are based on theoretical assumptions of the capabilities of DL models and achieving such performance in practice is often very challenging.

A. Introduction to VAEs
A specific class of neural networks, known as autoencoders, offer encoding of the data into an efficient coding [88].The architecture of auto-encoders involves an encoder E ϕ that encodes the input data X into a latent space coding Z and a decoder D θ that learns to regenerate the required output data X ′ from it, where X ′ can be same as X or a transformed version of X .The reduced coding Z offers a condensed representation of the data.Going a step beyond, a VAE offers the representation of each input sample x as a multivariate posterior distribution q ϕ (z|x) for x ∈ X and z ∈ Z [89].Typically, the posterior q ϕ (z|x) is modelled as a Gaussian and is regularised to be as close as possible to the prior of D θ using the Kullback-Leibler (KL) divergence.This prior is denoted by p model (z) and is employed for generating new data using D θ .It is generally taken as the Normal distribution N (0, 1).As values of z are drawn from the distribution q ϕ (z|x) at the latent layer, performing back-propagation becomes problematic.This issue is solved using the re-parametrisation trick, which models the distribution q ϕ (z|x) as a multivariate Gaussian whose mean, µ and standard deviation, σ are predicted as latent space outputs from the encoder.Now, q ϕ (z|x) can be written as, where the random variable ϵ is described by the independent fixed distribution N (0, 1).Therefore, the back-propagation is made possible with the gradient descent performed over the parameters µ and σ that govern the distribution q ϕ (z|x) instead of the sampled values z.The final loss function to train a VAE, given by Equation 4, has two components.First, the reconstruction loss that compares the output against X ′ , and second, the KL divergence-based loss.The loss terms are decomposed into the sum of losses coming from each sample j from the training data.Training of the VAE model with such a loss function instils bounded, smooth and meaningful parametrisation of the latent code Z.

B. Reduced representation of signal space using VAE
We build a VAE for the task of denoising the CWT maps.However, the denoising task here is not the literal pixel-level denoising of the image.Instead, it is transforming the input CWT maps from INJ into the corresponding CWT maps of CLEANINJ, which demands preserving of features belonging to the injection exclusively and ignoring other artefacts belonging to the noise class as shown in Figure 4. Therefore, during training, the model learns to look for only the signal features, encode them into the latent space and decode a pure estimate of the signal.Since our final classification task is only signal-specific, any lack of retention of other non-signalspecific information from the input, such as glitch or Gaussian noise features, is not a concern.Rather, the insensitivity of the encoder to some of the glitch features may be desirable for eliminating the respective glitches right at the first step of inference.We construct the encoder part of the VAE model using sub-modules consisting of parallel branches of convolutions of different kernel sizes as shown in Figure 20.The parallel modules are inspired by Inception modules which help extract features at varying scales [90].Similarly, the architecture of the decoder part is shown in Figure 21.We use L1 regularisation with l 1 = 0.001 for kernel and bias parameters in all convolution layers of the encoder and decoder.The weights of convolutional kernels were initialised using Xavier uniform initialiser [91] while biases were initialised with a constant value of 0.2.We use a cyclical learning rate varying in the range 10 −5 − 10 −4 , which resulted in faster convergence and better performance at the same time [92].Since, such a scheme results in oscillating losses, we use a callback function to monitor the total loss during training and store the best checkpoint with minimum loss.The choice of optimiser was stochastic gradient descent (SGD) with momentum 0.01.We also set the gradient clipping at 0.5 to deal with the exploding gradient problem.We trained our VAE model on four NVIDIA A100 Tensor Core GPUs for 600 epochs (run-time ∼ 11 hours).We use t-distributed Stochastic Neighbour Embedding (t-SNE) to visualise the latent space encoding at different stages of model development and make qualitative observations on their performance.t-SNE is a statistical method for visualising high-dimensional data by giving each data point a location in a two or three-dimensional map [93].Unlike Principal Component Analysis (PCA), a linear dimensionality reduction algorithm, t-SNE is based on probability distributions with a random walk on neighbourhood graphs to interpret the complex polynomial relationships between features and find structure within the data.We generate the t-SNE visualisations of the latent space coding obtained from the encoder at two separate stages.The first being the encoder output from the trained VAE model, while the latter one is obtained after the classifier fine-tuning is accomplished (described in Section IV).The prime difference between the two is that the first model was trained with the INJ dataset alone, while the second one was trained with INJ, GN and GL datasets.To obtain the latent space required for t-SNE, we analyse the validation data samples from INJ, GN, GL and test data samples from TRIG-TEST with both the encoder models.Before proceeding with t-SNE visualisation, we reduce the dimensions of the latent space from 256 to 10 using PCA.The benefits are twofold, it helps suppress some noise in the final representation and speeds up the algorithm without losing much information.After performing PCA, we find that the explained variance ratio for the encoder output after VAE training was 0.46 compared to 0.94 after the fine-tuning stage.From the t-SNE representations shown in Figure 8, it is evident that even without training on the GN, GL and TRIG-TEST datasets, the initial encoder is already able to separate them satisfactorily.The class separation improves further after fine-tuning.

IV. UTILITY OF THE LATENT CODE FOR CLASSIFICATION
We develop a hybrid classifier model by appending the encoder with additional densely connected layers that map the latent space to a single output node predicting the presence of a CBC signal in the input (refer to Figure 9).The input to the densely connected layers is the mean-vector µ of the multi-variate Gaussian distribution from the latent space.We train this hybrid model in two stages.Initially, we froze the encoder and only allowed the appended dense layers to learn the mapping between the latent space and the binary outputs.The training data was composed of signal class (INJ dataset) and noise class (GN and GL datasets) with 1's and 0's as targets, respectively.The losses for GN and GL datasets were weighed down by half to maintain class balance.The data were shuffled before the start of training.We train the hybrid network for 600 epochs (run-time ∼ 5 hours 30 minutes) with a cyclical learning rate varying between 10 −5 − 10 −4 and binary cross-entropy as the loss function with label smoothing set to 0.005.We used Adam optimiser with β 1 = 0.9, β 2 = 0.999 and ϵ = 10 −7 .The gradients were clipped at a maximum value of 50.At the end of the training of the first stage, both training and validation accuracies were close to 99.6% while the losses saturated around 42.4.After the parameters of the densely connected layers were learnt, we further fine-tuned the network for 600 epochs (run-time ∼ 6 hours) by making the entire model, including the encoder, trainable.We used a callback function that monitored the validation loss during training and stored the best checkpoint version of the network.The best checkpoint had a training accuracy of 99.99% and a validation accuracy of 99.55% while the corresponding loss values are 0.025 and 0.031, respectively.The evolution of loss and accuracy for the training of the hybrid classifier for both frozen and fine-tuning stages is shown in Figure 10.Note that, especially during fine-tuning, the accuracies appear to stagnate while the binary cross-entropy losses continue to de- crease.This indicates an increasing separation between the output distributions corresponding to INJ and noise classes.

V. BRINGING ROBUSTNESS TO THE MODEL
Large DL models typically show an over-sensitivity to certain features over others [94].A class of techniques called adversarial attacks exploits this weakness of DL models and finds their failure modes.This can either be accomplished by systematically perturbing the input data with gradient ascent or training a separate generative adversarial DL model.The first kind, that of perturbative attack on the regular input, can be as simple as a single pixel attack [95] or a more generic attack [78], leading to a dramatically degraded model performance.These perturbations in the data are typically insignificant when perceived by a human.The second kind of attack can be achieved through generative adversarial frameworks which involve training generative DL models which can create images that deceive the discriminator successfully.Initialised randomly, these models learn to generate images that do not contain any practically meaningful information, yet are able to mislead the classifier [96,97].
Our fine-tuned classifier shows very promising performance provided the inference is made on data that belong to the same input space as the training data.However, as mentioned above, the model may show a dramatically degraded performance if the data for inference deviates from this space.Such deviations could occur due to changes in the data characteristics or the pre-processing scheme.Such changes can render the model inapplicable, and retraining the model with an updated dataset may be required each time they occur.This can become especially challenging in certain cases, for example, new types of glitches that may keep arising with changing operating conditions of the detectors.
In recent years, multiple strategies have been explored to make DL models more robust [78,98,99].Particularly, including adversarially perturbed images in training has been shown to improve the robustness of the models to small perturbations [78].However, it leads to robustness against perturbations only in the vicinity of the input space covered by training data.As we intend to explore the failure modes in a parameter space that is not necessarily close to the input image space, we take a somewhat different approach of using the generative adversarial framework to bring robustness to our classifier, as described in the next section.

A. Introduction to GANs
A GAN is composed of two DL models, a generator G ϕ and a discriminator D θ , playing a minimax game and being trained in the process [96].The generator uses a latent space input z to generate fake data G ϕ (z).The discriminator analyses this data and gives the output predictions D θ (G ϕ (z)) of them being real.In such a setup, D θ is trained for correctly classifying the training data x from the fake data G ϕ (z) generated by G ϕ .In other words, it is trained with 1's and 0's as targets for D θ (x) and D θ (G ϕ (z)), respectively.Therefore, for D θ , the training objective is the minimisation of a loss function given by [96], (5) through gradient descent.Simultaneously, G ϕ is trained to generate fake data that can mislead the discriminator into giving positive outputs.The loss function that is to be minimised in this case is [96], Provided G ϕ and D θ have enough capacity, it can be proved that the optimisation algorithm saturates when the generator learns to produce data that is indistinguishable from the real data [96], i.e., the posterior of G ϕ , p g converges to that of data, p data which is the global optimum of the minimax game.At this point, the output of D θ saturates to D θ * (x) = p data p data +pg , which, for equal proportions of training and generated data, is

B. Fragility of classifier and proposal for robustness
Firstly, we demonstrate the fragility of our fine-tuned model in its original state by identifying its failure modes.We trained five adversarial generators with the loss function given in Equation 6 against the classifier, which was frozen.Training in this setup is equivalent to training only the adversaries in a GAN to find the failure modes of the discriminator.The adversaries were initialised with random weights at the start of training.Hence, their output images are unrealistic in the initial epochs.However, we observed that the adversaries learn quickly, within ∼ 5 epochs, and produce images classified as signals by the classifier with very high output probabilities.Figure 11 shows a few sample images generated by the adversaries at the end of 5 th epoch.It can be seen that these images do not belong to any meaningful category, and yet, are classified as CBC signals with output values > 0.99.
To train our classifier against its failure modes, we trained the discriminator alongside the five adversaries in the same GAN framework.The motivation for such training is to identify and exhaust all the possible failure modes of the classifier by simultaneously besetting it with several adversaries while the training is still in progress.It would lead to an ultimate saturation point when the adversaries start producing images very close to the signal space in the training data and exhaust all other possibilities.
However, there are several pitfalls in this proposal.We need to modify the regular GAN framework to achieve desirable performance for the required task of robust classification.GANs are typically used for obtaining a competent generator model that generates realistic images similar to the training data.Therefore, the usual training data of GAN comprises of samples belonging to only one category.The biases inherited by the discriminator through such training are not a concern, as the discriminator usually gets discarded at the end of training.In our case, we wish to retain the discriminator as a robust classifier and discard the generator instead.Thus, retaining the capabilities of the original classifier is crucial.In principle, the generator is expected to exhaust all the failure regions of the input space, including those corresponding to GN and GL.However, since the input space of the model can be considered infinitely large for all practical purposes, a generator model starting from a random initialisation of parameters can only attack the discriminator from one side with respect to the training data.During training, it typically covers only a limited region that lies along the path explored by it and cannot leap to a vastly different region of input space.Our experiments also confirmed that training a GAN with only the INJ dataset results in a bias towards the signal class and a loss of performance in the other noise classes, GN and GL.To make the training setup more complete and retain the performance of our classifier on all three classes of data, INJ, GN and GL, we use the same composition of training data that was used for training the classifier as described in Section IV.

C. Training for improved robustness
As mentioned before, since covering the entire input space is practically impossible, absolute robustness is unachievable.Nevertheless, to exhaust as much space around the input space as possible, we create a setup where our fine-tuned hybrid Since it is impossible to achieve absolute robustness using this approach, the very need for such an additional training effort may naturally be questioned.To thoroughly examine the actual changes the GAN training brought about, we inspect the transformations of sample input data inside our classifier at various levels with and without GAN training.Specifically, we focus on the changes in the operations of 3 × 3 convolutional filters from the second and third parallel modules, and the one before the flattening operation towards the end.We pass a sample image from INJ and GN data each through the classifier states before and after GAN training and obtain the transformed maps at the outputs of these convolutions as shown in Figure 18 and Figure 19.For the classifier state be-fore GAN training, the transformed maps display a high degree of similarity in terms of the features collected from the input image.Whereas, in the case of the robust classifier obtained after the GAN training, this degeneracy is largely eliminated.Fewer and more helpful features are recovered that can be translated to the binary output more reliably.This can be compared with the sparseness achieved through regularisation [88] which is a highly desirable property for a DL model.
To test the performance of the classifier on real data and study the possible deterioration in its capabilities after GAN training, we ran the inference on the test datasets with the states of the classifier before and after GAN training.Figure 14 shows the separate sets of ROC curves with GL-TEST and TRIG-TEST as the noise class for classifier before and after GAN training.The area under the curve (AUC) corresponding to the ROC curve of CBCs vs glitches reduces from 0.989 to 0.984, while for CBCs vs PYCBC triggers it improves from 0.976 to 0.989 after GAN training.This change is relatively insignificant and can be attributed to statistical fluctuations in performance.Therefore, we can say that the improvement in robustness is achieved with no adverse impacts in terms of the model performance on real data.

VI. SEARCH
Though improving the sensitivity of CBC searches was not the target of this work, to assess the capabilities of our GAN trained model to identify events in real data, we implement a direct search on the GW strain data from H1 and L1 7 .For inference, we analyse the GW strain data in 512 seconds long chunks.The data were first whitened with a PSD generated from the same chunk.We analyse 1 second-long slices of whitened strain by obtaining their CWT maps.For analysis, we stride the data with fixed steps, generate CWT maps and save them for inference with GPU later.Taking a shorter stride leads to wasteful computation of image generation and GPU inference, whereas taking longer strides leads to the possibility of missing out on signals.To systematically decide a reliable stride value, we need to identify a sensitive time window of the classifier.One can anticipate the sensitive time window as a function of chirp mass since the chirp morphology is expected to affect such a window.Once this window function is obtained, we can choose an optimal stride step such that no detectable signal is missed while the number of images generated and analysed remains as low as possible.

A. Estimating the sensitive time-window
To obtain the sensitive time window of the classifier, we produce a set of injections generated by uniformly sampling the chirp time τ 0 between 0.01 − 2 seconds.To calculate τ 0 , we take the starting frequency as 30 Hz.The component masses are constrained to lie in the range 2 − 400 M ⊙ with mass ratio q < 8.The injections were generated without noise, in a setting similar to the preparation of CLEANINJ dataset as described in Section II B. Next, the analysis window was stridden over the injection so that the merger of the signal traversed from the left to right end of the window in steps of 0.05 second.The complex CWT map of the clean injection was obtained at each step and added to the CWT maps of 10 different simulated noise realisations.An average of the predictions on this stack of CWT maps gives a more reliable output score which was recorded for the corresponding injection at the respective step.The injections were generated with op-

B. Search on real LIGO data
To test the independent performance of our classifier, we perform a CBC search on 8.8 days long chunk of real data from LIGO Hanford and Livingston detectors between May 12 − 21, 2019 UTC.This data includes six gravitational wave candidates with P astro ≥ 0.5 as listed in GWTC-2.1 [1].For estimating the sensitivity of the search, we also create an injection set whose parameters are described in Table II.The image data for CWT maps were generated as described in the previous section with a stride of 0.1 second, which translates to ∼ 5 million images in H1 and ∼ 5.9 million images in L1.The computational cost was observed to be ∼ 1 CPU core second per image, out of which, a significant fraction came from the operation of saving the image file on the disk.We plan to optimise this cost in future work.We ran a multi-GPU inference on these images to obtain the model predictions as time-series data (run-time ∼ 16 hours on four NVIDIA Tesla P100 GPUs).We call these predictions P CBC scores which correspond to how the classifier ranks the triggers for them to belong to the CBC class.However, note that this is not a realistic statistical probability of the trigger belonging to the CBC class.Triggers were generated using peak finding and clustering over a 1 second window.Also, a lower cut-off of 10 −5 was used for trigger collection.By dividing all the triggers by the maximum trigger value of the entire run, we normalised them to the range [0 − 1].We used the PYCBC workflow for trigger collection and post-processing of these triggers for coincident analysis.The coincident foreground events were generated by collecting triggers that lie in a time window of 0.4 second to conservatively collect triggers captured at an offset of ∼ 0.2 second from the actual merger time in each detector.The P CBC data from H1 were time-shifted in 0.9 second steps with respect to those from L1, and the same coincidence collection procedure was repeated to obtain the background events.We build the coincident ranking statistic as a simple multiplication of the P CBC values of the coincident triggers, which essentially expects the triggers to belong to the CBC class in both detectors at the same time.We do not consider single detector events in this work.We perform injection recoveries by looking for coincident foreground events lying within 2 seconds on either side of the injections.From the significance of these recovered injections, the sensitive distance for the search can be found [100].The sensitive distance as a function of FAR is plotted in Figure 16 for different chirp-mass bins.The sensitive distance at the FAR of 1 per month is found to be 140 Mpc, 996 Mpc and 2173 Mpc for chirp mass bins [1,10], [10,40] and [40,80], respectively.
The foreground and background events obtained from the search are shown in Figure 17.
Our search recovered two significant CBC events, GW190519_153544 and GW190521_074359, with FAR < 1 per 20350.4years.
There are four additional events in the analysed chunk that were found by other search pipelines in GWTC-2.1,viz., GW190513_205428, GW190514_065416, GW190517_055101 and GW190521.Initial investigations suggest that there could be several reasons our search did not find them, viz., one of the detectors having a weaker signal or the presence of a glitch artefact in addition to signal (GW190514_065416 and GW190513_205428), signal lying outside the parameter space considered for training the model (GW190521) and lack of almost any tail feature in the chirp morphology (could be due to high spin modulations in the signal) leading to resemblance with blip glitches (GW190517_055101).However, it should be noted that due to the inherent opaqueness of the DL models, it is very difficult to provide a definitive and complete explanation of the case-to-case performance.It is also worth mentioning that the search results could be further improved by optimising the hyper-parameters of the search.However, we did not pursue this aspect as improving the search sensitivity is not the goal of this work.

VII. CONCLUSIONS
DL algorithms hold much promise due to their ability to learn and model vast amounts of complex data -a task that is formidably difficult to perform using classical methods.Their human-like capability to generalise the learned features, along with the deployability on accelerated computing hardware like GPUs, make them better placed for solving many data-intensive problems across domains.However, these algorithms lack transparency over the learned features and a prescription of where they could possibly fail.The fragility of large DL models has been exposed using various methods, and improving their robustness is an area of active research [78,98,99].
In the field of GW astronomy, identifying true GW signals in the data against the spurious noise transients of terrestrial origin is one of the leading applications of DL that has invoked a lot of interest.However, the lack of transparency and trustworthiness of these models has been a significant hurdle in deploying them for production-level analysis and attestation of their results.In this work, we take the first steps towards having better control over the performance of the DL models and improving their reliability to perform CBC searches in the GW data.The key developments are summarised below - We used our classifier as a discriminator and trained it alongside 5 adversarial generators.We tested the model's performance before and after GAN training on separate test datasets.These datasets were composed of injections in real data, glitches and random triggers collected from matched-filter searches.We found that the GAN training had no significant effect on the model's performance on real data, while it brought some fundamental improvements in terms of sparseness which could contribute to enhanced robustness.We obtained the transformations of sample inputs at different layers of the classifier and found that a large amount of degeneracy in the extracted features was removed after GAN training and much fewer channels were used to make inferences.Also, the outputs from the layers were better localised, which could help in enabling a more robust mapping to the binary output.
• Implementation for the search of real data and estimation of search sensitivity: Lastly, we examined the stand-alone capability of the model to search for CBC signals in 8.8 days of real LIGO data.We chose an optimum stride for the generation and analysis of CWT maps by systematically investigating the model's response to injections as a function of time.This investigation helped in optimising the computational cost while ensuring recovery of all underlying signals to which the classifier is sensitive.
We also performed an injection study to estimate the sensitive distance of the CBC search.At a FAR of 1 per month, the sensitive distance for chirp mass bins [1,10], [10,40] and [40,80] was 140 Mpc, 996 Mpc and 2173 Mpc, respectively.Importantly, our search recovered two CBC events, GW190519_153544 and GW190521_074359, with high significance.
The analysed data also contained four additional events, GW190513_205428, GW190514_065416, GW190517_055101 and GW190521, found by various pipelines in GWTC-2.1 [1].We discuss the potential causes of our classifier missing these signals in Section VI B. As search sensitivity was not the focus of this work, we did not pursue possible improvements in the results by optimising the hyper-parameters of the search.Also, the overall search sensitivity can be considerably improved if our classifier is used in conjunction with an existing offline search using MLStat [38].In future, we intend to analyse the real LIGO/Virgo data with this classifier in the framework of MLStat and study the improvement in the significance of events from GWTC-2 and GWTC-3 [8,9].
Apart from providing speedy inference, the fast-growing field of DL applications in GW astronomy is expected to compete with, if not complement, conventional algorithms in terms of sensitivity and parameter space coverage in the near future.Through this work, we have addressed a crucial aspect of this field: the reliability of DL approaches.While the novel proposal to increase the robustness presented here is still in its nascent stages and under further development, we believe these techniques will pave the way for the successful integration of DL methods in production-level analyses.We hope this work goes beyond the field of GW astronomy and proves to be a valuable stepping-stone for other areas of science and industrial applications.

1 FIG. 1 .
FIG. 1. Example CWT maps of two injections highlighting how the chirp-masses of CBC signals determine their visibility in the timefrequency domain representation even when the recovered SNR values are nearly the same.However, in γ metric (as described in Equation 2), these injections are well separated with values 26.5 and 44.9, respectively.

FIG. 3 .
FIG. 3. Variation in the visible strength of chirp features with the metric γ(θ), independent of the chirp mass.The images are sampled from the INJ dataset with M chirp and γ values lying in the corresponding bins shown in the brackets.Recovered SNR values are annotated on respective images.Higher SNR values are required for CBCs with lower chirp masses to maintain equal degree of signal visibility.

FIG. 4 .
FIG. 4. The schematic of the VAE model, along with its input (INJ) and target output (CLEANINJ) used for training.The sampling layer at the latent space learns to represent the CBC features extracted from the input as a multivariate Gaussian, thereby obtaining their reduced and smooth representation.

FIG. 5 .
FIG. 5. Evolution of losses over 600 epochs of training.The KL divergence saturates to an almost constant value after ∼ 400 epochs while the reconstruction loss continues to reduce slowly.Note that the y-axis is in log scale.The VAE model reaches a satisfactory level of reconstruction ability at the end of training.

FIG. 6 .
FIG. 6. Histograms of KL loss (green) and reconstruction loss (red) for the VAE on training data (solid line) and test data (dashed line).The histogram for reconstruction loss shows mild overfitting to the training data.This is a commonly observed phenomenon during the training of DL models resulting in the difference between the training and validation performance.

Figure 5
Figure 5 shows the evolution of losses during model training with an overall asymptotically saturating behaviour with oscillations originating from cyclical learning.A comparison of the performance of the trained VAE model on training and validation INJ datasets is shown in Figure 6.The histograms of KL loss for training and validation datasets are similar.Whereas, the reconstruction loss exhibits slight overfitting towards training data, a tendency that is commonly observed in the training of DL models and is not a cause of serious concern unless the predictions for training and validation data show significant variations from each other.The bestcollected checkpoint had a reconstruction loss value of 19.3 and the KL divergence loss value of 15.3.To showcase the model's ability to reconstruct clean signals from input injections, in Figure 7, we show 4 randomly selected images from

FIG. 8 .
FIG. 8. Latent space representations of the validation data from INJ, GN and GL datasets along with the TRIG-TEST dataset visualised using t-SNE at the encoder output.(a) Encoder output taken from the trained VAE model.Though the VAE was trained on INJ and CLEANINJ data alone, the encoder can already separate the classes very well.(b) Encoder output from the final classifier after fine-tuning.Since the classifier was trained on INJ, GN and GL datasets, an improved class separation can be observed after fine-tuning.

FIG. 10 .
FIG. 10.Performance metrics for the hybrid classifier are shown for its two stages of training: with encoder frozen and fine-tuning of the entire model.During both stages, the classifier is trained to separate the INJ class from the noise classes (GN and GL).(a) Evolution of binary cross-entropy losses for training and validation data.Initially, with the encoder frozen, both training and validation losses saturated to a value of 42.4.During fine-tuning, the best checkpoint was collected with a training loss of 0.025 and a validation loss of 0.031.(b) Evolution of training and validation accuracies.At the end of the training, both the accuracies for the hybrid model with frozen encoder were close to 99.6%.The best checkpoint had a training accuracy of 99.99% and a validation accuracy of 99.55%.

FIG. 12 .
FIG. 12. Schematic of the GAN setup with the hybrid classifier being trained as the discriminator alongside 5 adversarial generator models tasked with finding its failure modes.The training data was the same as that used for training the hybrid classifier earlier.The initially unrealistic fake images produced by the generators eventually converge to images similar to those from the INJ dataset.
FIG. 13.(a) Evolution of discriminator losses given by Equation 5 in the GAN setup during training of 400 epochs.Loss on fake data generated by the generators is shown along with the losses for INJ, GN and GL data.(b) Progress of generator losses given in Equation 6 through the training.Five generators were trained simultaneously alongside the discriminator.The initial incapability of generators is overcome quickly in the first ∼ 30 epochs.

FIG. 14 .
FIG.14.Classifier performance on real datasets before and after the GAN training.The ROC curves on the left were obtained through inference of INJ-TEST and GL-TEST as the signal and noise datasets, respectively, whereas for those on the right, the noise dataset was TRIG-TEST.The 0.5% decrease and 1.3% increase in the AUC values of the two test cases are statistically insignificant, indicating that GAN training does not adversely impact the performance of the classifier on real data.Also, it is worth noting that the overall AUC values are sufficiently high, given that, except for GL, we did not use real data during training.

FIG. 15 .
FIG. 15.The sensitivity of the classifier as a function of the location of the merger in the CWT map and the chirp-time of the CBC signal.The injections are stridden in steps of 0.05 second across the analysis window of 1 second (shown by dotted lines) and the average PCBC values are shown with a colour bar.The sensitive window can be seen to become relatively narrower for injections with shorter duration.

FIG. 18 .
FIG. 18.Comparison of image transformations of a sample CWT map taken from the INJ database with the model-state before (left column) and after the GAN training (right column).Outputs of the model extracted at the end of the 3 × 3 convolutional layers from second (top) and third parallel modules (middle), and the one before flattening operation (bottom) are shown.The extracted features on the left show many similarities.On the right, these degeneracies disappear, and only a few more helpful sets of features are extracted, which can be mapped to the binary output more reliably.

FIG. 19 .
FIG. 19.Comparison of image transformations similar to Figure 18 with a sample CWT map taken from the GN database.A similar difference in feature extraction can be observed between the two models.
FIG.20.Architecture of the encoder part of the VAE model along with its sub-components.Outputs of different layers, colour coded for distinction, are shown.The shapes of these outputs are given at the top with the number in the bracket denoting the size of the square kernel.The encoder consists of three parallel modules, which have convolutions in parallel and in series, inspired by the inception architecture[90].The latent space has a sampling layer for the multi-variate Gaussian distribution represented by the mean and variance vectors.
FIG. 21.Architecture of the decoder part of the VAE model along with its branch structure.The depictions are similar to those shown in Figure 20 except for the difference in the types of layers.The inputs to the decoder are the latent space mean vector codings of the INJ dataset and the outputs are the corresponding images from the CLEANINJ dataset.
FIG. 22. Architecture of the generators used for adversarial attacks and in GAN training with their branch structure.The input to the generator is a noise vector of size 256.The generator outputs are the fake images deceiving the classifier model.

TABLE I .
Choices of hyper-parameters for injections used in INJ and INJ-TEST datasets.Additional cutoffs of γ > 30 and ρ > 5 are applied to ensure uniform visible strength of chirp features across the parameter space.

•
New metric for enhanced purity of data: We started by addressing ways to maintain the purity of training data, especially for the CBC class across the parameter space.Due to varying chirp-time of the signals based on their component masses, the CBC injection data generated with a constant lower threshold on SNR leads to variations in the visibility of their chirp-features.We built a metric that accounts for variations in the peak intrinsic amplitude of the CBC signals and enables the generation of large amounts of data with uniform visibility of chirp features across the parameter space.•DLmodeldevelopment focused on signal space: To better control which features of the input image the model learns to focus on, we trained a VAE to reproduce only the chirp features from the input data and ignore everything else.Such a VAE encodes the signal-specific information from the input into a reduced, smooth latent space representation.Using densely connected layers, we mapped this information-rich latent coding to the binary output that corresponds to the signal or noise class.The signal class consisted of CBC signals sampled from a focused BBH parameter space injected into the simulated LIGO noise.Whereas, the noise class consisted of simulated Gaussian noise and glitches obtained from real LIGO data.We fine-tuned the model further to achieve high training and validation accuracies.•Investigation of fragility of DL model and proposal for robustness: Going beyond just careful model building, we tested this model for robustness by subjecting it to adversarial attacks.These attacks revealed several simple failure modes of the model which did not have any features similar to CBC signals.These results also underlined the crucial fact that despite various advantages, DL approaches can be extremely fragile.As the first step towards bringing robustness to our DL model, we employed a novel GAN setup to retrain the model.