Neural network time-series classifiers for gravitational-wave searches in single-detector periods

The search for gravitational-wave (GW) signals is limited by non-Gaussian transient noises that mimic astrophysical signals. Temporal coincidence between two or more detectors is used to mitigate contamination by these instrumental glitches. However, when a single detector is in operation, coincidence is impossible, and other strategies have to be used. We explore the possibility of using neural network classifiers and present the results obtained with three types of architectures: convolutional neural network, temporal convolutional network, and inception time. The last two architectures are specifically designed to process time-series data. The classifiers are trained on a month of data from the LIGO Livingston detector during the first observing run (O1) to identify data segments that include the signature of a binary black hole merger. Their performances are assessed and compared. We then apply trained classifiers to the remaining three months of O1 data, focusing specifically on single-detector times. The most promising candidate from our search is 4 January 2016 12:24:17 UTC. Although we are not able to constrain the significance of this event to the level conventionally followed in GW searches, we show that the signal is compatible with the merger of two black holes with masses m1=50.7−8.9+10.4M⊙ and m2=24.4−9.3+20.2M⊙ at the luminosity distance of dL=564−338+812Mpc .


Introduction
The breakthrough discovery of gravitational waves (GW) on September 14, 2015 [1], announced by the LIGO Scientific Collaboration [2] and the Virgo Collaboration [3], opened the era of the GW astronomy.The detection happened during the first observing run (O1) of the LIGO detector.With the subsequent observing runs, O2 and O3, performed jointly with Virgo, the list of detected GW signals has grown to 90 events.While the detected sources are mainly associated with the merger of binary black holes (BBH), they also include binary systems with neutron stars [4][5][6][7].These detections are collected and characterized in the GW transient catalogs GWTC [8][9][10][11].On May 2023 the fourth observing run (O4) started with an increasing detector sensitivity and consequently an enhanced expected rate of detections.GW transient signals are detected in the data by a variety of data analysis pipelines, see e.g.[11] for a recent review.In particular, matched filtering [12] is a prominent technique to search for signals when an accurate waveform model is available, as in the case of compact star binary mergers.Algorithmically, this consists in correlating the data with a large set of template waveform models (the "template bank") that are representative of all the morphologies the expected signal can possibly take.
To make robust detection statements, those pipelines have to address a major difficulty: the presence in the data of short-duration noise artefacts, often called "instrumental glitches" [13,14], that can mimic the GW signal [15,16].A very powerful tool to discriminate the signal from noise glitches is time coincidence across two or more separate detectors (see [17] for a discussion on multi-detector noise rejection techniques).
Obviously, coincidence cannot be used during periods when only one detector operates.During the O1 and O2 observing runs, single-detector periods amount to about 30% of the observation time [18,19].During O3, thanks to a more stable and reliable operation, this fraction was reduced to about 15% in O3a [20] and 11% in O3b [21] (the first and second six months parts of O3).In total, more than five months of observing time fall in this category, so far.The O4 science run initiated recently may also have long periods of single detector times.
The lack of coincidence results in difficulties to disentangle the signal from glitches and to measure the statistical significance of a trigger to high confidence levels.Several studies investigate ways to resolve these difficulties.Two methods [22,23] that allow the identification of gravitational-wave candidates in single-detector data have been employed in production in the context of low-latency gravitational-wave searches [24], enabling the initial identification of GW170817 and GW190425.Similarly, Ref. [25] introduces a framework for assigning significance to single-detector gravitational-wave events by leveraging the measured rate of binary black hole mergers.More recently, ref. [26] studies the possibility to extend the multi-variate likelihood-ratio statistics used by the GstLAL pipeline to generate single-detector events.The likelihood estimation has been recently updated in view of the O4 run [27] and one of the improvements is the addition of a tuneable penalty in case of single detectors candidates to down weight their significance [28].To extrapolate the significance measure of single-detector triggers produced by the PyCBC pipeline [29], a method proposed in [30] allows to recover loud signals in single-detector data.In both cases, it is shown that the search sensitivity is significantly reduced compared to multi-detector searches.
Despite those developments, single-detector periods have received less attention than the rest of the observations and are covered in a few studies.Following a "multimessenger" approach, several works looked for coincidences between data from a solitary gravitational-wave detector with gamma-ray observations from the Fermi Gamma-ray Burst Monitor [31][32][33].Three searches for binary mergers in single-detector periods relied on gravitational-wave data only.Ref. [34] presents a search which specifically targets a narrow range of low masses motivated by the population of known double neutron-star binaries.Two contributions present the results of searches for binary mergers over the entire range from 1 to 100 M for the component masses, for the observing runs O1 and O2 [35] and for O3 [30].The former finds two candidate events observed in single detector periods: 2015-12-25 04:11:44 UTC with the LIGO Hanford detector and 2016-01-04 12:24:17 UTC with the LIGO Livingston detector.The first candidate event has a low significance with a probability of astrophysical origin [36] p astro = 0.12, while the second has a larger significance p astro = 0.47.However, for this event, an excess power observed in the residual after subtraction of the best-fit waveform from the data suggests this event may not be of astrophysical origin, and is thus discarded.
Glitches of different types vary widely in duration, frequency range and morphology.It is difficult to construct a statistical model able to capture the overall complexity of the glitch populations.Their complex and time-evolving nature makes glitch identification and rejection a good problem and a use case for machine learning (ML).In principle, this approach allows to train a classifier able to distinguish between different types of input (glitches versus real GW signal in our case), and thus to learn a possibly very complex and high-dimensional statistical model from a set of examples.
As in many scientific fields, the use of ML has recently gained in popularity in the context of GW astronomy.There is a fairly large body of works pertaining to various aspects ranging from denoising, glitch classification and cancellation, waveform modelling, searches for GW signals, astrophysical parameter estimation, population studies (see e.g.[37,38] for recent reviews).
In the context of GW signal searches, convolutional neural networks (CNN) [39] have been investigated to detect BBH signals for both single-and multi-detector cases [40][41][42][43][44][45].The primary motivation put forward in those contributions is the computational gains expected from the use of CNNs compared to matched filtering techniques.
So far a large fraction of those investigations use simulated Gaussian noise [40,42,43,45].In this case, it is not possible to learn the non-Gaussian component of the instrumental noise.Few studies use real GW data including glitches [41,44].The classifiers obtained in those contributions are limited to false positive probability (i.e., noise or glitches classified as signal) of about 1%.This corresponds to a false alarm rate of once every 40 minutes, which is not sufficient in practice.A recent review [46] compares different approaches on a mock data challenge.
The purpose of this study is to enhance the ability of neural network based searches to reject noise artifacts and improve their sensitivity, with a particular focus on analyzing data from a single detector.The goal is to achieve a false alarm rate similar to that of current online searches performed by the LIGO-Virgo-KAGRA collaboration (LVK), i.e. two false alarms per day [47].We explore various network architectures, particularly those designed for time series classification [48,49].
We trained and tested neural network classifiers using a dataset produced from one month of O1 data collected by the LIGO Livingston detector, during which no GW signals were detected using the matched filtering based searches.
Section 2 provides details on how the training and testing sets are generated, while Sec. 3 describes the structure of the various neural network classifiers being considered.The performance and efficiency of the classifiers are assessed using testing data, and the results are presented in Sec. 4. We applied these classifiers to the remaining three months of O1 data, including segments associated with the three GW events detected during O1.Sec. 5 summarizes the results of this analysis.We checked the classifiers' response obtained with the known detected events during O1.A particular focus is then given to the single-detector times.Interestingly, we found that only one data segment was classified as "signal" by the three classifiers we considered.This event coincides with the single-detector event found by [35] in the LIGO Livingston data, as mentioned above, and was downgraded by the same study as a noise artifact.Following the additional checks we conducted on this event, we arrived at a different conclusion as they confirmed its compatibility with an astrophysical origin.Finally, Sec. 6 concludes on the applicability of the proposed methodology.

Generation of datasets for training and testing
The typical approach for applying ML methods to GW detection is to treat it as a classification problem, see e.g., [40][41][42][43]45].In this approach, we aim to determine whether a given segment of GW strain data of fixed duration contains an astrophysical signal or not.This problem can be solved by developing an ML-based classifier that is trained using example data.We produce training data labeled as follows: • noise: the data are compatible with stationary background noise, i.e., are free of transient instrumental artifacts (glitches) or known GW events, • glitch: the data include one or several transient instrumental artifacts (glitches), • signal: the data include a (simulated) astrophysical signal, added to the stationary background noise.
This three-class approach differs from other contributions in the literature, which consider only two classes.The presence of glitches is known to significantly alter the statistical distribution of the data.By assigning a specific label to data segments containing glitches, the idea is that this may aid the classifier in achieving improved performance.Furthermore, the relative significance assigned to each class could offer valuable information when evaluating the contents of a given segment.
Training and testing data are extracted from the dataset of the observing run O1, which was publicly released via the Gravitational Wave Open Science Center (GWOSC) [50].Specifically, we utilize the data from the LIGO Livingston detector spanning one month between November 25, 2015 (GPS time 1132444817) and December 25, 2015 (GPS time 1135036817).Throughout this duration, no GW signals were detected by the standard search pipelines.
In this period the available L1 data amounts in total to about 13.3 days (1,147,457 s), of which 3.6 days (312,284 s) were in single-detector time, i.e. 27% of the time.
The raw data are sampled at 16 kHz.We have downsampled the data to 2048 Hz,2 bandpass-filtered between 20 Hz and 1 kHz and whitened by applying the inverse amplitude spectral density (ASD) in the frequency domain. 3The ASD is estimated over stretches of variable length, depending on the duration of uninterrupted datataking periods (minimum duration is 37 s and maximum is 100,573 s).The data are divided into one-second non-overlapping segments.
The data are distributed into the three classes introduced above as explained in the next sections.Representative instances of the three classes are shown in Fig. 1.

The noise class
The noise class corresponds to segments that are free of known GW signals, glitches (see next section) or hardware injections. 4All segments in the dataset passed the first criterion, as no GW signals were confidently detected by standard pipeline over the selected period. 5Overall, there is a total of 750,000 noise samples in the one-month O1 dataset.

The glitch class
A database of glitches is created using two different sources: the unmodeled transient search coherent WaveBurst (cWB) [54,55] and the citizen science project Gravity Spy [56].
The cWB pipeline is an open-source software package designed to search for a wide range of GW transients without prior knowledge of the signal waveform.To evaluate the analysis background, cWB uses a resampling technique [54] that involves applying non-physical time shifts to the data before analysis.Loud, i.e., high signal-to-noise ratio (SNR), background triggers resulting from this procedure are good candidates for glitches.The loudest triggers in LIGO Livingston with an SNR higher than 5.8 were selected (258,480 glitches).This list was complemented with the Gravity Spy database (13,144 glitches).The timestamps and duration of the identified glitches from these two sources are collected in a single list, which is then used to label the one-second data segments from the O1 observing run.If the glitch duration is shorter than 1 second, the associated segment is labeled as a glitch.Note that the glitch has a random position within the one-second window.If the glitch duration is longer than 1 second, all segments that overlap with that glitch duration are labeled as glitch.Only the glitches whose time belongs to the data segments available on GWOSC are considered.In many cases, the glitches are closer in time than one second, so multiple glitches can fall in the same one-second segment.
From the one-month O1 data, a total of 150, 000 segments receive the glitch label.

The signal class
The samples from the signal class are produced by adding simulated GW signals from BBH systems to the one-month O1 data in periods without known GW signals or hardware injections.For the training set, the data segments used to generate samples of the signal class are not utilized for the noise class nor the glitch class, while for the testing set, the same data segments are used for both the noise and signal classes.To generate the astrophysical signals, the waveform model SEOBNRv4 [57] is employed, with a lower frequency cutoff of 30 Hz.The simulated signals are sampled, whitened, and band-pass filtered in the same manner as the data segments.
The masses of the binary BH used for generating the simulated signals in the class signal are chosen to ensure that they fall within the mass range observed by the LVK and that the signals are short enough to be contained within the one-second data segments.Specifically, the component masses m 1 and m 2 are chosen randomly, with the constraint that m 1 > m 2 ≥ 10M and the total mass M = m 1 + m 2 is uniformly distributed in 33M ≤ M ≤ 60M .We consider non-spinning BH, so the dimensionless spin magnitudes χ 1 and χ 2 are set to 0. The phase at coalescence and the polarization angle are drawn uniformly in (0, 2π), and the inclination angle in (0, π).Since the focus is on a single detector, the right ascension and declination are not particularly important and are thus fixed to zero.
The amplitude of the added signals is computed such that the corresponding optimal SNR ρ opt is uniformly distributed between 8 and 20.Following [58], it is defined as where h(f ) denotes the Fourier transform of the template h(t) and S n (f ) is the power spectral density of the detector noise.To generate the signals, a fiducial luminosity distance d L of 100 Mpc is initially chosen, and then scaled to obtain the desired ρ opt .The final values of d L range from 1 to 1300 Mpc approximately.The simulated signals are added at a random position within the segment while ensuring the chirping part of the signal is completely contained in the segment.The final part of the signal is randomly shifted between -0.25 s and 0.3 s with respect to the center of the one-second segment.A total of 750,000 signal samples are generated.Overall, the training set consists of 250, 000 segments for the noise class, the same number for the signal class, and 70, 000 for the glitch class.A 20% fraction of the training set is allocated for validation.The testing set, used to evaluate the classifier, comprises 500, 000 samples for both the noise and signal classes, and 80, 000 for the glitch class.This ensures sufficient statistical data for characterizing the classifier's performance.In total, the training and testing datasets comprise 1,650,000 one-second segments, with 45% for the noise class, 45% for the signal class, and 10% for the glitch class.This amounts to a storage space of 26 gigabytes.Out of the total number of segments, 28% is utilized for training, 7% for validation, and 65% for testing.

Classifier architectures
This section discusses the type of neural network architectures considered in this study.Similarly to other works [40][41][42][43][44][45], the classifier is directly fed by the one-second segment of strain time series, so a vector size of 2048.We experiment 6 with three different network architectures, namely the CNN, as well as two other architectures specialised for time-series classification: Temporal Convolutional Network (TCN) [48] and Inception Time (IT) [49].The last two, to our knowledge, have never been tested with this type of problem.The architectures are described in more detail in the following subsections.The model hyperparameters provided below have been tuned after a coarse exploration of the parameter space.

Convolutional Neural Network (CNN)
CNNs were first introduced for image classification [39].They are now used for a wide variety of tasks, including the detection of GW signals [40][41][42][43][44][45].In this study, we tested a range of CNNs similar to those considered in previous works.
We limited ourselves to shallow networks with five layers, four convolutional layers, and one final fully connected embedding layer.For simplicity, we only report here on the best-performing CNN, whose structure is detailed in Table 1.
The convolutional layers are defined by the number of output filters, the length of the 1D convolution window (kernel size), the stride length of the convolution, and the activation function.The dense layer only requires the definition of the activation function.The input of inner convolutional layers is downsampled with a max pooling operation over a window size indicated in the table.The output of convolutional layers is processed by a dropout layer that randomly sets the input units to 0 with the frequency rate specified in the table.A global average pooling, followed by a dropout with a rate of 10%, is applied to the output of the last convolutional layer.

Temporal Convolutional Network (TCN)
TCN [48,61] is a neural network architecture specifically developed for sequence modeling problems.TCN has been shown to outperform generic state-of-the-art architectures over a diverse range of tasks and datasets.The TCN architecture is based on causal convolutions, where an output at time t is only convolved with past inputs from the previous layer.This allows the network to collect information from further in the past, using a combination of deeper networks (augmented with residual layers) and dilated convolutions.
In this study, we have tuned the hyperparameters of the TCN model to find a compromise between the best performance and a reasonable training time.We ended up using a network with a TCN layer consisting of N = 6 dilated convolutional layers with 32 filters, a kernel size of k = 16, default values of dilation factors d k=1...6 = (1, 2, 4, 8, 16, 32) for the 6 convolutional layers, and a dropout rate of 0.1.The output of the TCN layer goes into a final dropout layer with a rate of 0.5, and a dense embedding layer closes the model.
A key parameter that governs the training efficiency is the receptive field, which is the size of the region in the input data that produces a given feature in the output.The receptive field of the TCN can be expressed as R = 1 + 2 (k − 1) d tot where d tot = d k [48].With the above configuration, we have R ≈ 1900.The data used in this work have a sampling rate of 2048 Hz so each segment of data has 2048 points.The training with TCN is effective when R is much larger than the length of the input sequence [48].To satisfy this constraint, only for this model, it is necessary to downsample the input data to 1024 Hz, therefore producing an input vector of size 1024.

Inception Time (IT)
IT [49] is a deep network ensemble designed specifically for time series classification.It leverages the concept of residual networks and incorporates Inception modules [62].In a nutshell, the Inception module first produces a one-dimensional summary of the input multivariate time series (this is the "bottleneck" layer), and then convolves this summary through multiple filters of different lengths, leading to a multivariate output that provides inherently multi-resolution features.The module output is finally reduced by max pooling (pool size of 4) before passing to the next module.
The IT architecture is composed of five ResNet networks with a sequence of depth d Inception modules, with two residual blocks.The outputs of the five models are combined through a global average pooling and a final softmax layer, used to produce the classification probabilities for the different classes.In this study we have used the standard implementation of IT provided by the authors [49] with networks of depth d = 10, each with a bottleneck size of 32 processed through 32 filters with kernel sizes 20, 40 and 80.

Training process
The three classifiers are optimized using the training set described in Sec. 2 to minimize the categorical cross-entropy loss function.The default implementations of the Adam optimizer are utilized, with a batch size of 24 [59].The training procedure is repeated 10 times with different (random) initializations of the model weights and dropouts, and the instance exhibiting the best Receiver Operating Characteristic (ROC) curve on the testing dataset (as explained in Sec. 4) is chosen.Note that this evaluation cannot be done with the validation dataset, as it does not provide enough statistics to compute the ROC in the relevant regime of low false alarm rates.
Throughout the training process, the model's area under the ROC curve [63] is evaluated on the validation data, and the model with the highest value is ultimately selected.The CNN, TCN, and IT models are trained for 50, 150, and 20 epochs, respectively.The best models are obtained at the 24th epoch for CNN, the 34th epoch for TCN, and the 5th epoch for IT 7 .On the Tesla K40d GPU we used, the training times per epoch were 220 seconds for CNN, 1000 seconds for TCN, and 3320 seconds for IT.

Decision statistic
The final objective is to detect with high confidence the segments with a true astrophysical signal, i.e., to classify them as signal and to reject the other segments as noise or glitch. 8We aim to constrain false alarms to a rate of two per day (similar to the current online search pipelines).This implies that we should reject all but one noise or glitch segment from the testing set in 1.7 × 10 5 trials.
The classifiers output the probability of class membership for each of the classes, that is three numbers between 0 and 1, summing to 1.The final detection is performed by applying a threshold to the membership probability P s assigned to the signal class, which thus defines our decision statistic.The class membership probability is computed by the softmax activation function applied to the raw output (the "logits tensor") of the fully connected embedding layer which concludes the classifiers.Because of the high-confidence level required, this threshold is very close to 1, thus requiring attention to the numerical precision for the evaluation of the membership probability (This issue related to the precision of floating-point arithmetics was already noted in [45]).This has consequences on the way the classification loss is computed from the membership probability at the training stage.We found that the categorical cross-entropy loss should be directly computed from the logits tensor rather than from the class membership probability after the softmax transformation.
This numerical precision issue has an impact on the performance of both the TCN and IT classifiers.The right panel of Fig. 4  compares the ROC curves based on the detection statistic P s ( see Sec. 4 for the details on how the curves are computed) obtained with IT when the categorical crossentropy loss is calculated from the logits tensor (green) and when it is calculated from the class membership probability after the softmax transformation (red).The shaded area represents the range between the best and worst models among the 10 instances computed at training.When softmax is used the uncertainty in the performance is larger (the shaded area is wider) and the classification efficiency reached at small false alarm rates is lower.

Classifier evaluation with the testing data
This section describes the results obtained with the three classifiers presented above applied to the testing set.The classifiers all exhibit poor separation power between the noise and glitch classes.This can be attributed to several factors, including the absence of a distinct boundary between the two classes (potentially due to contamination and mislabeling), the considerable variation in glitch morphology, and the relative class imbalance with 3.5 times the glitch class being unrepresented by a factor of 3.5 compared to the noise class in the training set.The initial assumption that a three-class division would enhance classification performance turned out to be incorrect, at least with this dataset.Consequently, we proceed by combining the noise and glitch classes into a single class representing the absence of an astrophysical signal.

Noise rejection
We first assess the noise rejection capabilities of the classifiers.Fig. 2 compares the distributions of the decision statistic P s (the membership probability assigned to the signal class) when the input segment belongs to each of the three classes.The P s distributions obtained with samples from the noise or glitch classes have very similar shapes, reflecting the intrinsic similarity of those two classes (see above).The best classifier is the one that provides the greatest contrast between the distributions of the P s statistic obtained in the presence of a signal (blue) versus noise or glitch (red dot-dashed and black).
The distributions obtained for the noise or glitch classes exhibit maxima at zero for the TCN and IT classifiers, while the maximum is shifted to around 0.1 for CNN.Moving from the peak to higher values, the distribution shows a monotonic decay for CNN and TCN.However, for IT, the distribution initially decreases and then slightly increases near P s = 1.The TCN classifier appears to reach the lowest background 10 −2 in normalized count units.Since our objective is to achieve high-confidence classification, we are primarily interested in the immediate vicinity of P s = 1.This motivates to reparameterize the P s statistic as λ := − log 10 (1 − P s ).While P s ranges from 0 to 1, λ can theoretically take values across the entire real line.However, our main focus lies in the range λ 7. The most stringent criterion is to require P s = 1 at machine precision, which corresponds to λ = ∞.The number of noise and glitch samples in the testing set that satisfy this selection criterion is 0, 1, and 2 for the CNN, TCN, and IT classifiers, respectively.Such rejection power (between 0 and 2 false alarms in 5.8 × 10 5 trials) is in agreement with the false-alarm rate targeted initially.

Signal extraction
We proceed to assess the classifiers' ability to extract signals.Fig. 2 illustrates the distributions of the decision statistic P s when the input segment belongs to the signal class, represented in blue.As anticipated, all distributions exhibit a peak at P s = 1.However, the peak appears narrower for the IT classifier.To focus on the region of interest near P s = 1, we employ the λ reparametrization, as depicted in Fig. 3.This figure also incorporates the dependencies on the signal-to-noise ratio (SNR) of the injected GW signal and the chirp mass M of the source binary.The distributions are computed separately for three ranges of chirp mass M: low, mid, and high, corresponding to M values between 13 and 17 M , 17 and 21 M , and 21 and 26 M , respectively.The histograms on the right-hand side are computed with the samples of the signal class that are classified with P s = 1, showing their distribution in terms of SNR for the three chirp mass ranges.
It is worth noting that we have either λ 7.5 or λ = +∞ (i.e., P s = 1). 9In a sense, the latter case seems to accumulate all samples with λ 7.5.From the left column of the figure, it is apparent that the IT classifier assigns larger λ (or P s ) values more uniformly over the full range of chirp mass and to lower SNR.In contrast, the CNN fails to do so for the lower chirp mass interval shown in blue.This is confirmed by the histograms in the right column, which indicate that the TCN and IT classifiers have a higher overall count (approximately 76%, compared to 65% for CNN) and extend to lower SNR values.

Global assessment with Receiver Operating Characteristics
To fully characterize the performance of the classifier, the noise rejection and signal extraction capabilities have to be evaluated jointly.This can be done by computing the ROC curves [63].The classification efficiency S th /S tot and false alarm rate N th /N tot are evaluated from the testing set, with S th and N th , the number of signal samples and noise and glitch samples with a P s value above some threshold, and S tot and N tot the total number of samples for each category.Note that since each sample has a duration of 1 second, N tot is intended here as the total duration in seconds of noise and glitch samples, so N th /N tot is measured in s −1 .By varying the threshold, one obtains the ROC curves in Fig. 4 which displays the classification efficiency versus the false alarm rate.The TCN and IT classifiers appear to have similar ROC curves and show a clear improvement with respect to CNN.Fig. 4 also shows the ROC computed for two instances of the IT architecture, with and without the softmax activation during training (see Sec. 3.5 for a discussion).
Fig. 5 shows the classification efficiency for a given false alarm rate set to 10 −5 s −1 , as a function of the injected SNR as defined in Eq. ( 1).The classifiers TCN and IT give similar efficiencies and surpass uniformly over CNN.Note that the efficiency shown in this figure is averaged over the full chirp mass range and thus does not show the differences evidenced in Fig. 3. Overall, signals with SNR=10 can be detected at the considered significance level with a good probability, larger than 50%.

Application to the remaining O1 single-detector data
This section presents the results of applying the different classifiers to the remaining O1 data from the Livingston detector.Our primary focus is on the IT classifier, while the results for the other models can be found in Appendix A.

Analysis of known O1 GW events
We first investigate how the three events detected in the O1 data [8] by matched filtering searches are classified by the considered models.The statistic P s is evaluated for different positions of the one-second window, that is for different time delays ∆t between the start of the analysis window and the merger time.This definition implies that, for ∆t = −1 s, the analysis window only includes the initial part of the signal (inspiral), whereas,   The classifiers TCN and IT give similar efficiencies and surpass uniformly over CNN.Overall, signals with SNR=10 can be detected at the considered significance level with a good probability, larger than 50%.Figure 6.Evolution of the statistic P s produced with IT classifier versus the relative delay ∆t of the analysis window to the O1 event merger time (GW150914, GW151012 and GW151226).For ∆t = −1 s, the analysis window only includes the initial part of the signal (inspiral), whereas, for ∆t = 0 s, the analysis window starts at the merger time and thus only includes the final part (merger and ringdown).
for ∆t = 0 s, the analysis window starts at the merger time and thus only includes the final part (merger and ringdown).Fig. 6 shows the evaluation of P s between those two extreme cases for the IT model for GW150914, GW151012 and GW151226 (see also Appendix A).
As expected, when the chirp signal is not included in the analysis window, the classifier is not able to detect the presence of the signal.GW150914 appears to be loud enough to be always identified, regardless of its position in the time window, even if it is partially visible.GW151012 is only detected when the chirp is at the center of the analysis window.GW151226 is not detected.This is expected as the binary component masses are outside the range used to generate the astrophysical signals in the signal class of the training data.Both events have single detector optimal SNRs for Livingston from parameter-estimation analyses lower than the minimum value of 8 we used to train the network (namely, 5.8 +1.2 −1.2 for GW151012 and 6.9 +1.2 −1.1 for GW151226 according to Table V of [8]).

Analysis of the remaining O1 data
We analysed all the remaining L1 data in O1 excluding the month we used to train and test the classifiers (see Sec. 2).This corresponds to the period between GPS=1126051217 (2015-09-12 00:00:00 UTC) and GPS=1132444817 (2015-11-25 00:00:00 UTC) and between GPS=1135036817 (2015-12-25 00:00:00 UTC) and GPS=1137254417 (2016-01-19 16:00:00 UTC).In this period we excluded the intervals of ± 1 second around the chirp time of the 3 known events (see previous section).This amounts to a total of 4,216,489 s (about 49 days), of which 1,054,564 s (about 12 days) are single-detector times, corresponding to 25% of the total.This data set is whitened following the same procedure used to produce the training set (the ASD was calculated from periods of non-interrupted data taking with 26 s minimum and 146,978 s maximum).The data are then divided into non-overlapping one-second segments that are processed through the three classifiers.For each, we used the best-performing model on the testing data.The processing time for the full data set is about 4 hours per model on NVIDIA Tesla V100S GPUs, but most of this time is taken to load the data, the extraction of the model predictions takes about 8 min for CNN, 18 min for TCN and 52 min for IT.No data quality information was used, so this analysis is solely based on the gravitational-wave strain data.Fig. 7 shows the distribution of the λ = − log 10 (1 − P s ) statistic obtained with the IT classifier (similar plots can be found in Appendix A for the other models).We apply the most restrictive selection cut, by requiring P s = 1 (at machine precision).We recall that this selection cut corresponds to a false-alarm rate of 4 × 10 −6 s −1 (that is one false alarm per 3 days) and a classification efficiency of 76% when estimated on the testing set, see Sec. 4.1 and 4.2.Based on these results, we estimate from basic counting statistics that the maximum number of false alarms expected for this analysis should be 29, 43 and 55 for CNN, TCN and IT respectively at 95% level for the full data set, and 9, 13 and 16 when restricting to the single-detector part.
For the IT classifier, a total of nine segments pass the selection cut, with two occurring in single detector time at GPS=1131289775 (2015-11-11 15:09:18 UTC) and GPS=1135945474 (2016-01-04 12:24:17 UTC).For the CNN and TCN classifier, we obtain 4 and 105 segments passing the cut, with 2 and 14 falling in single detector periods.The results are thus consistent with the expectations for CNN and IT, while there is a clear excess with TCN.We have observed that a significant fraction of the triggers comes from two time intervals around 2015-10-20.Our interpretation is that the data from those periods could differ in nature from those of the training set, and TCN may be sensitive to this difference.
Interestingly there is only one segment passing the selection cut for all three classifiers: GPS=1135945474 (2016-01-04 12:24:17 UTC) which we investigate further in the next section.As single-detector searches cannot employ statistical resampling techniques with time shifts [64], we can only provide an upper limit on the false alarm rate for this detection.The upper limit is estimated to be 1 event every 49 days, based on the available data from the three-month analysis period.This segment on 2016-01-04 corresponds to the event identified in the Livingston detector data during the O1 single-detector periods using a standard matched-filtering-based search, as reported in [35].However, this candidate is subsequently eliminated by the authors of Ref. [35] after examining the residual obtained by subtracting the best-fit waveform from the data, since excess power is observed in the residual at frequencies below 80 Hz.The segments with P s = 1 have been assigned a value of λ = 8 for plotting purposes.The pink histogram corresponds to a subset labeled as "Blip" glitches by Gravity Spy [65].The markers at the top indicate the highest values for the three O1 events displayed in Fig. 6.Please note that the vertical position of these markers is arbitrary.

Detailed analysis of the 2016-01-04 event
We have performed a number of detailed checks of the 2016-01-04 event.We have performed a "visual" inspection with the time-frequency Q-transform [66].Fig. 8 provides a time-frequency representation of the entire segment with a Q-scan [66].A transient is visible ∼ 0.35 seconds after the start of the segment, at a frequency of about 150 Hz.In the magnified view, the shape of the transient is clearly indicative of a frequency modulated chirp-like transient.
The Gravity Spy database [65] has marked this specific GPS time classified as being an instrumental artefact of the "Blip" type.The term refers to a well identified family of instrument glitches whose origin is still largely unknown (see, e.g., [67,68] for more details).Generally, "Blip" glitches do not exhibit a chirping frequency (see Fig. 1 of [69] for a typical example).To complement this initial inspection, Fig. 7 gives in pink the statistic λ (or equivalently P s ) of the 600 blip glitches listed in Gravity Spy overlapping with the part of the O1 dataset being analyzed.The resulting distribution is compatible with the overall background distribution.The Jan 4 segment appears to be an outlier with respect to the blip glitches identified in the data.
Further, we checked if the transient signal can be fitted by a GW waveform model associated to a compact binary merger.To do so, we ran the Bayesian inference library Bilby [70] and used the IMRPhenomXPHM waveform model [71].It is assumed that the component spins are co-aligned with the orbital momentum.For the rest of the source parameters, generic and agnostic priors are assumed, along with a standard Λ-CDM cosmology model with H 0 = 67.9km s −1 Mpc −1 [72].The analysis did not include a marginalization over calibration uncertainties.The analysis results in a signal-versusnoise log Bayes factor of 47.The estimated time of arrival of the merger at the detector is GPS=1135945474.373+0.076 −0.07 and the measured optimal SNR is 11.34 +1.8 −1.6 .Fig. 9 shows the result of the fit in the time domain, by comparing the whitened data in orange to the inferred waveform (blue) with a 90% credible belt.We report that there is no significant residual after subtraction of the inferred waveform as shown in Fig. 10.As an independent check of the nature of the signal, the figure also includes the waveform estimate produced by the denoising convolutional autoencoder described in [73] (dashed red).The two reconstructed waveforms are in good agreement, following a similar phase evolution, except for the initial and final parts of the signal, where the denoiser's reconstruction is not optimal because of its low-frequency cut-off, and the rather low SNR of the signal.In addition, we note that the denoising autoencoder was trained on the waveform family SEOBNRv4 which is different than the one used for Bilby (IMRPhenomXPHM), which may contribute to differences.
Above checks are all compatible with the event being of astrophysical origin.The corner plot in Fig. 11 displays the posterior distribution of the source parameters including the binary component masses, spins and source distance.Since only one detector is available, the source direction is not localized in the sky.The 90% credible intervals for those parameters are: the measured (redshifted) chirp mass M = 30.18+12.−338 Mpc; see [74] for a definition of those physical parameters.Overall, these values are consistent with the observed population of BBH to date.

Conclusions
This contribution demonstrates the viability of training neural network classifiers on real detectors' data for analyzing single-detector observing periods of ground-based GW detectors.We show that architectures specifically designed for time-series classification, such as IT or TCN, outperform the standard CNN typically used so far.Their relative detectability limit in terms of signal-to-noise ratio is lower by few percents to 15% for 50 and 90% classification efficiencies respectively.The models were trained with one month of the observing run O1 data from the LIGO Livingston detector.When applied to the remaining three months of O1 data, the classifiers independently detect a plausible GW signal of astrophysical origin on January 4, 2016.This candidate signal was also identified by [35] using standard matched filtering techniques.While [35] downgraded the event as a noise artifact, various diagnostics we performed substantiates the possibility of its astrophysical origin.
Operationally, we propose an approach where the multiple detector data from the first month of an observing run, labeled by standard matched filtering-based pipelines, are used to train the neural network models.The resulting classifiers can then be applied to the remaining data collected during single-detector periods.Once trained, the computational cost is such that the classifiers can produce low-latency triggers.However, the poor sky localization obtained with only one detector limits the relevance of this approach.
The current approach faces two limitations: (i) using real data for training and testing inherently limits the statistical characterization of these algorithms and their noise rejection capabilities, as already highlighted in [46] and observed with the excess of triggers produced by the TCN classifier; (ii) there is a technical issue arising from the use of bounded selection statistics (i.e., class membership probabilities in our case) that leads to numerical intricacies.More generally, due to the absence of a mathematical theory for neural networks, their precise statistical characterization on noisy data remains an open question.Consequently, research in this field is limited to a trial and error heuristic  This contribution opens up new possibilities for analyzing the fairly large singledetector data set.Applying the proposed classifiers to other LIGO-Virgo observing runs and broadening the parameter space to include lower masses and effects such as higher-order modes or precession would be interesting directions for future work.

Figure 1 .
Figure 1.Instances of the classes noise (blue), signal (black) and glitch (green).Top (noise): one-second data segment recorded by LIGO Livingston at the GPS time 1132550972.487.Middle (signal): a simulated BBH waveform with SNR of 20 (dashed red line) is injected in the previous timeseries.Bottom (glitch): Data recorded at the GPS time 1132580628.41which contains a low-frequency transient instrumental artifact.

Figure 2 .
Figure2.Distributions of P s (the class membership probability assigned to the signal class), conditioned on the class of the input segment from the testing set: signal (blue), noise (dot-dashed red), or glitch (black).These distributions were computed for the CNN (left), TCN (middle), and IT (right) architectures.The histograms are normalized to have a unit sum.The classifiers do not distinguish between samples from the noise and glitch classes, thus resulting in practically identical probability distributions (see Sec. 4 for a discussion on this point).

Figure 3 .
Figure 3. Distribution of the statistic λ := − log 10 (1 − P s ) obtained with testing samples from the signal class and computed for the three considered classifiers: CNN (top), TCN (center) and IT (bottom).The column on the left shows a kernel density estimate of the λ distribution for the samples with P s < 1, thus leading to a finite value for λ.The shaded area is the 50% containment region, and the line is the 90% containment region.Those distributions are shown versus the SNR of the injected GW signal and computed separately for three ranges of chirp mass.The column on the right shows a histogram for the samples with P s = 1.The signal samples detectable with high confidence fall in the range of large λ 7 (i.e,P s values very close or equal to 1).

Figure 4 .
Figure 4. ROC curves for the three considered classifiers, CNN, IT, and TCN, illustrating the classification efficiency versus the false positive rate.Each classifier has been trained 10 times, and the continuous line represents the result obtained for the best model, while the shaded area covers the range from the best to the worst model.The left panel displays the TCN (orange) and CNN (blue) ROC curves.In the right panel, the ROC curves are shown for two instances of the IT architecture: one trained with softmax activation (red) and another without softmax activation (green) (refer to Sec. 3.5).The TCN ROC curve is reproduced in this panel as a dashed orange line to facilitate comparison.

Figure 5 .
Figure 5. Classification efficiency versus SNR for a false alarm rate of 10 −5 s −1 .The classifiers TCN and IT give similar efficiencies and surpass uniformly over CNN.Overall, signals with SNR=10 can be detected at the considered significance level with a good probability, larger than 50%.

Figure 7 .
Figure 7. Distribution of the λ = − log 10 (1 − P s ) statistic (shown in blue) obtained using the IT classifier on the remaining O1 dataset (refer to Sec. 5.2 for details).The segments with P s = 1 have been assigned a value of λ = 8 for plotting purposes.The pink histogram corresponds to a subset labeled as "Blip" glitches by Gravity Spy[65].The markers at the top indicate the highest values for the three O1 events displayed in Fig.6.Please note that the vertical position of these markers is arbitrary.

Figure 8 .
Figure 8. Time-frequency representation of the segment at 2016-01-04 12:24:17 UTC (GPS=1135945474 s) recorded by the LIGO Livingston detector.The top panel shows the entire segment.The bottom panel is a detailed view that focuses on the transient signal at t ∼ 0.35 and f ∼ 150 Hz.The frequency of the signal is distinctly increasing in a chirping pattern.

Figure 9 .
Figure 9.Comparison of the whitened L1 data (orange line) with the reconstructed waveform from the Bilby posteriors (blue) and the ML denoising convolutional autoencoder neural network described in [73] (dashed red line).

Figure 10 .
Figure 10.Time-frequency representation of the residual after the subtraction of the reconstructed waveform from Bilby posteriors from the data segment at 2016-01-04 12:24:17 UTC (GPS=1135945474 s).The dynamic range and color code are the same as in Fig. 8.No excess power is visible in this plot.

Figure 11 .
Figure 11.Posterior distribution of the chirp mass M, luminosity distance, component masses m 1 and m 2 , and effective spin χ eff for the 2016-01-04 event (see Sect. 5.3 for details). approach.

Figure A1 .
Figure A1.Distribution of the λ = − log 10 (1 − P s ) statistic (shown in blue) obtained using the CNN (top panel) and TCN (bottom panel) classifiers on the remaining O1 dataset (refer to Sec. 5.2 for details).The segments with P s = 1 have been assigned a value of λ = 8 for plotting purposes.The pink histogram corresponds to a subset labeled as "Blip" glitches by Gravity Spy[65].The markers at the top indicate the highest values for the three O1 events displayed in Fig.6.Please note that the vertical position of these markers is arbitrary.

Figure A2 .
Figure A2.Top panel: evolution of the statistic P s produced with CNN classifier versus the relative delay ∆t of the analysis window to the O1 event merger time (GW150914, GW151012 and GW151226).For ∆t = −1 s, the analysis window only includes the initial part of the signal (inspiral), whereas, for ∆t = 0 s, the analysis window starts at the merger time and thus only includes the final part (merger and ringdown).Bottom panel: the TCN classifier results.

Table 1 .
[39]cture of the CNN considered in this study.The type of the layer is either convolutional (Conv) or fully connected (Dense).The activation function is either the rectified linear unit (relu) or the softmax function[39].