Comparison of neural network architectures for feature extraction from binary black hole merger waveforms

We evaluate several neural-network architectures, both convolutional and recurrent, for gravitational-wave time-series feature extraction by performing point parameter estimation on noisy waveforms from binary-black-hole mergers. We build datasets of 100 000 elements for each of four different waveform models (or approximants) in order to test how approximant choice affects feature extraction. Our choices include SEOBNRv4P and IMRPhenomPv3 , which contain only the dominant quadrupole emission mode, alongside IMRPhenomPv3HM and NRHybSur3dq8 , which also account for high-order modes. Each dataset element is injected into detector noise corresponding to the third observing run of the LIGO-Virgo-KAGRA (LVK) collaboration. We identify the temporal convolutional network architecture as the overall best performer in terms of training and validation losses and absence of overfitting to data. Comparison of results between datasets shows that the choice of waveform approximant for the creation of a dataset conditions the feature extraction ability of a trained network. Hence, care should be taken when building a dataset for the training of neural networks, as certain approximants may result in better network convergence of evaluation metrics. However, this performance does not necessarily translate to data which is more faithful to numerical relativity simulations. We also apply this network on actual signals from LVK runs, finding that its feature-extracting performance can be effective on real data


I. INTRODUCTION
The third observing run of the LIGO-Virgo-KAGRA (LVK) collaboration delivered a total of 79 gravitationalwave (GW) events, raising the total number of detected signals to date to 90 [1][2][3][4].All these events are consistent with compact binary coalescences (CBCs) comprising mostly black-hole mergers (BBH) but also neutronstar mergers [5,6] and neutron-star black-hole mergers [7].The detection of signals from CBCs as well as the extraction of physical information of the sources relies on matched filtering [8].This is based on the comparison (through cross-correlation) of the incoming GWs to pre-computed signal templates (or waveforms) for given source parameters [9][10][11][12].This process requires that templates are faithful representations of true GWs, as the opposite can introduce biases in the source parameter estimation or even actual misses of incoming signals.
The field of waveform modelling has gradually reached a status of maturity, currently providing robust methods for the description of gravitational radiation from compact binary systems through the combination of information from three main physical frameworks, namely post-Newtonian theory, the effective-one-body approach, and numerical relativity.There exist nowadays accurate and efficient waveform models, or approximants, describing all stages of the signal of a BBH coalescence (inspiral, merger and ringdown) and allowing for accurate parameter estimation of the source.This is especially true for quasi-circular binaries, also accounting for the effects of unequal masses, misaligned spins, precession, and subdominant (high-order) modes [13][14][15][16][17][18][19][20].
Given a waveform model, the traditional method to infer source parameters is through Bayesian inference.This is a computationally expensive task as it requires applying stochastic sampling techniques to evaluate the likelihood function for a large set of source parameters, resulting in the need to generate potentially millions of waveforms.Moreover, as the performance of GW detectors increases, with predicted sensitivities for third-generation detectors being around a tenfold improvement on the current instruments [21], the amount of detected signals is expected to increase accordingly.This may render unsustainable traditional approaches for both, signal detection and parameter estimation.Recently, there have been successful attempts to significantly speed up inference using machine-learning techniques [22][23][24][25][26][27][28].The role of deep learning (DL) for GW data analysis is already relevant and is bound to become increasingly so, as GW astronomy fully unfolds (see [29,30] and references therein).Current applications are not only limited to accelerating parameter estimation but also include efforts to develop detection methods [31][32][33], to improve signal quality [34,35], to accelerate waveform generation [36,37] as well as to simulate noise transients in GW detectors [38,39].
There are several relevant choices when implementing a DL method to deal with GW data.On the one hand, there is the choice of a network architecture.There is already a rich body of work when it comes to the application of DL to generic time series [40][41][42].Therefore, there is a large amount of pre-existing architectures that can detect relevant features of time series data (or at least are able to serve as the basis for new architectures).On the other hand, however, there is the issue of having an appropriate dataset.Points to consider include data representation, such as the window size in time and the sample rate of the data, as well as more involved technical aspects such as dimensional reduction through, for example, principal component analysis.More fundamentally, though, there is the issue of the composition of the dataset.The number of confirmed GW detections is minuscule by data science standards to realistically attempt to train models on real data.Moreover, full numerical relativity (NR) simulations are too computationally expensive to cover the entire parameter space of BBH coalescences.As such, waveforms for use in GW astronomy are commonly generated by approximants which attempt to quickly and accurately describe the waveform generated by the full NR simulation, developed with different physical approximations in mind, and then calibrated to NR waveforms [13, 17-20, 43, 44].
The existence of different approximants relying on different techniques raises the question of whether choosing a particular one for the creation of a dataset may condition the feature extraction ability of a trained network.The aim of this paper is to address this question.To do so, we test a number of state-of-the-art network architectures implemented in the tsai python package [45] on the task of point parameter estimation, as a way to evaluate how well the waveform's relevant features are identified.Our goal is to find out if it is possible to select a specific type of architecture that might extract the features of the data relevant to GWs in a more accurate way than the rest.Furthermore, after finding the best-performing architecture, we test the effect of approximant choice on the network performance.In particular, we analyse the differences between a NR surrogate model, NRHybSur3dq8, the phenomenological model IMRPhenomPv3 and its sibling IMRPhenomPv3HM including high-order modes, and the effective-one-body based model SEOBNRv4P.Our results indicate that the choice of approximant to build GW datasets can have a non negligible impact on fea-ture extraction.
The paper is organized as follows: In Section II we describe the generation of the four datasets we use for this work.In Section III we compare the training of 10 different networks on the NRHybSur3dq8 dataset, choosing the network with the lowest validation loss as the best architecture for our purposes.In Section IV we take this architecture and train 10 equal networks on each of the generated datasets and compare the results by looking at the root-mean-square error (RMSE) of each of the physical parameters studied.Finally, in Section V we summarize our conclusions.

II. DATASETS A. Approximant overview
The first approximant we employ in our study is IMRPhenomPv3 [14].This is a phenomenological waveform approximant in the IMRPhenomP family [46].This family of approximants is constructed by matching NR simulations of BBH systems to a set of physicallymotivated analytical waveforms, typically in the frequency domain, including terms that attempt to describe the full physics of the dominant (ℓ, m) = (2, 2) mode of compact binary gravitational radiation, encompassing spin effects such as precession.These approximants are then calibrated to a set of NR simulations, in order to obtain a good agreement with the predictions of general relativity.We also employ a second approximant of the same family, IMRPhenomPv3HM [15] which builds upon the foundations of IMRPhenomPv3 but includes higher (subdominant) mode effects, specifically the (2,2), (2,1), (3,3), (3,2), (4,4), and (4,3) modes.The inclusion of these effects was shown to significantly improve the mismatch to NR waveforms for systems with mass ratios up to 1/5.
The third approximant we use is SEOBNRv4P [17].This is part of a family of approximants which utilise the effective-one-body (EOB) formalism.EOB describes the two-body dynamics in general relativity and its GW emission using a resummation of the post-Newtonian information for the dynamics in terms of the geodesic motion of a particle in an effective spacetime [47].The twobody problem is mapped into the effective problem of a particle of mass µ (the reduced mass of the binary system) in an effective spacetime which is a symmetric deformation of Schwarzschild (or Kerr for spinning binaries).The conservative dynamics of the two-body system is encoded in the EOB Hamiltonian from which the evolution of the phase space variables can be obtained.First attempts to construct a resummed version of the GW flux incorporating test-particle results at higher order were taken by [48].A different strategy based on a resummation framework for the GW amplitudes was subsequently initiated by [49].Recent comparisons with NR simulations and test-particle results outperform the standard Taylor-expanded form approximants [50].
Our fourth and last approximant is NRHybSurd3q8 [16].This is a hybrid surrogate of NR waveforms.Surrogate models work by interpolating the space of gravitational waveforms using a reduced basis obtained from a number of NR waveforms.However, despite their high accuracy, surrogate models have the disadvantage of NR waveforms being rather short, usually starting around twenty orbits before the BBH merger.NRHybSur3dq8 gets around this issue by stitching together EOB-corrected post-Newtonian waveforms for the early inspiral with corresponding NR waveforms, and then applying the surrogate methodology to obtain a model that is capable of accurately interpolating the space of waveforms while still being able to reproduce a large portion of the inspiral phase.

B. Dataset Generation
The datasets are generated in two steps.First, a dataset of GW waveforms corresponding to 100,000 BBH mergers is created for each of the four chosen approximants.This is done using the pycbc package [51] to generate the signals given a dictionary of seven physical parameters.These parameters are sampled as follows: the primary detector-frame mass of the system, m 1 , is sampled uniformly in the interval [5,80]M ⊙ .The mass ratio, q, is sampled uniformly in the interval [1/8, 1], in order to respect the limitations of the NRHybSur3dq8 model.The secondary mass, m 2 , is taken to be the maximum value between 5M ⊙ and (q • m 1 )M ⊙ .The spins of the component objects, s 1 and s 2 , have their dimensionless magnitudes sampled uniformly in the interval [0, 0.9], while their direction is sampled isotropically.However, the x and y components are ignored in order to respect the parameter space of the hybrid surrogate model.The inclination ι is sampled uniformly in the range [0, π].Finally, the coalescence phase is sampled uniformly from [0, 2π].All of these waveforms are generated with an initial nominal distance of 100 Mpc, which serves the purpose of scaling them.This distance is then adjusted during the injection process according to the selected signal-to-noise (SNR) criterion (see below).
The second step comprises the injection of the generated waveforms into real detector strain for each of the LIGO detectors [52] and Virgo [53], taken from a 6.5hour time interval from the O3 observation run, starting at GPS time 1264316210.No GW signals have been reported by the LVK Collaboration in this interval, which ensures only real noise conditions.In this study we do not worry about the effects possible glitches in the data might cause to our analysis.This means that, while real parameter estimation analysis are carried our after checking that no glitches are present in the data, our analysis is subject to biases due to potential superposition of glitches over injections, therefore providing conservative results.The injection is performed using the bilby package for python [54].An important choice made at this stage was to limit the detector network's SNR (given by the quadrature sum of the individual detector SNRs) to the interval [10,50].This guarantees some minimal signal quality without going into unrealistically high values.To this end, for each approximant used, the process of injection is as follows: 1.A waveform from a generated dataset is loaded.

2.
A target network SNR is sampled uniformly in the range [10,50].
3. A 20 second segment of real strain data for each of the detectors (LIGO Hanford, LIGO Livingston [52] and Virgo [53]) is loaded, with a randomly chosen GPS time serving as the initial time.
4. The sky position of the source is sampled isotropically.
5. The waveform is injected into the detector strains at t = t 0 + 10, with an additional random value up to 0.8 6.The injection SNR is calculated as where h stands for the template waveform and s stands for the post-injection strain data.Note that here we use the conventional noise-weighted inner product definition, given a frequency-series g(f ) and h(f with S n being the one-sided power spectral density of the detector noise.
7. The injection SNR is compared to the target SNR.
If the absolute difference is less than 3 the final waveform is accepted and written to the injection dataset.If this criteria is not met, a ratio is calculated and step 6 is repeated with the luminosity distance of the clean waveform being scaled by 1/r.
8. Steps 1-7 are repeated until all waveforms from a given dataset have been successfully injected.
The resulting injections (at a sample rate of 4096 Hz) and all the physical parameters used to generate the original waveforms are saved to an HDF5 file.To perform our study, however, we will focus solely on 4 parameters: the (detector-frame) chirp mass of the system FIG. 1: Dataset composition for the 100,000 injections using the NRHybSur3dq8 approximant.The composition of the datasets using other approximants is not shown as it is equivalent to the one displayed in the figure .the mass ratio q = m 2 /m 1 , the effective inspiral spin where s 1 z and s 2 z are the components of the objects' spins aligned with the orbital angular momentum, and the luminosity distance d L .We deliberately do not cover the entire parameter space.These four parameters are chosen due to their larger influence on the gravitational waveforms and, as such, differences in feature extraction performance should be more obvious in this reduced parameter space.
The distribution of these parameters for the NRHybSur3dq8 dataset is displayed in Fig. 1.Note that the imposition of a flat SNR distribution results, effectively, in a prior in distance proportional to 1/d L .This is evident in Fig. 1 and it does mean that the network will have a bias along those lines.On the other hand, typical priors on the distance go with d 2 L .We justify this by again pointing out that our goal is not application-level parameter estimation but rather to test feature extraction, and setting the SNR as described makes network convergence more robust.

III. NETWORK ARCHITECTURE OVERVIEW
For this work we implement the neural networks using the fastai package (version 2.7.10) for python [55] (version 3.8.13),which functions as a while the architectures are taken from the time-series-focused tsai package [45] (version 0.3.5).In total we test 10 different networks, 4 of which are convolutional networks and 3 of which are recurrent networks.Most of the networks we consider were designed for classification tasks, save for the recurrent networks which see the most use in forecasting.This choice of networks architectures was pragmatic: given the rapid pace and subject breadth of DL research, finding a set of networks which stands out in a general way is probably unfeasible.As such, we chose to include recurrent architectures as they are a mainstay in time-series forecasting, as well as some of the top-performing convolutional networks in [42].

A. Convolutional networks
The sample of convolutional networks we employ is the following: • Fully Convolutional Network (FCN) [56]: the FCN architecture makes use of simple convolution operations in sequence in order to extract the features of the data, followed by a linear layer which maps the convolutions' output to the desired output size (in our case, the number of parameters).
• xResNet1d: The xResNet1d architecture is a tweaked version of the ResNet architecture [57].
ResNets are convolutional-based networks which make use of residual connections between layers, that is, the original input of the layer is added to the output of a convolution operation happening within the layer.To be more precise, for a layer input x, the layer output is f (x) = x + C(x), where C stands for the convolutional operation within the layer.The xResNet architecture tweaks the original ResNets, avoiding some information loss by delaying feature reduction operations and replacing expensive convolutions with multiple cheap convolutions [58].We use three different depths for this network, namely 18, 34, and 50 layers, due to our previous positive results for GW parameter estimation using xResNets obtained in [31].
• InceptionTime [59]: The InceptionTime architecture is at heart a residual network.Its main distinguishing feature is the use of the Inception block, which combines concurrent information on the various channels of a multivariate time series in order to obtain a univariate representation of the original data.
• Temporal Convolutional Network (TCN) [60]: the TCN architecture is once again typically based on residual blocks.However, TCN stands out by the use of causal convolutions, that is, throughout the network elements of a layer are considered to be in sequence and are only affected by the elements of the previous layer which are in the past.TCN also employs dilated convolutions, so that deeper layers can have a larger receptive field, i.e. they can look further into the past.
• MiniRocket [61]: MiniRocket is a convolutional network for time series which performs convolutions using a set of 84 convolutional kernels whose elements have their values restricted to the set {1, 2}.
Using randomly generated bias values, MiniRocket generates 9,960 features per timeseries, which are then passed to a linear layer to obtain the final output.

B. Recurrent networks
Correspondingly, our sample of recurrent networks comprises the following ones: • Recurrent Neural Network (RNN) [62]: RNN makes use of memory cells which keep track of the network's immediate past state during training, using it as one of the parameters that defines the next step in the network's parameter space.
• Long Short-Term Memory (LSTM) [63]: LSTMs are an extension of the RNN framework.However, they address a weakness in the simple RNN setup.
RNNs have difficulty incorporating long-term dependencies, as gradients of older states often either vanish (the most common case) or explode as they propagate.Under the LSTM setup, the network is endowed with "input", "output" and "forget" gates that allow it to control information flow, for example being able to ignore potential updates to its memory cells if the input is deemed irrelevant, and being able to forget contents of its memory.As such it is less susceptible (though not immune) to the vanishing/exploding gradients problem.
• Gated Recurrent Unit (GRU) [64]: the GRU architecture adopts the information flow modulation idea from the LSTM concept, but crucially GRU networks do not have an equivalent to the output gate, which modulates the contents of the memory cell that get exposed to the other components of the network.This makes the network computationally simpler but potentially less able to detect long-distance correlations.

IV. RESULTS
For all of the networks we use the default architecture parameters provided by tsai, changing only the number of input and output channels to match the data and the number of parameters estimated, respectively.The different networks have a rather large number of trainable parameters and, moreover, their amounts vary noticeably among networks.This is displayed in Fig. 2 which shows that the number of parameters to train can be as low as 1.1 × 10 4 for the case of RNN and as high as 1.6 × 10 7 for xResNet1d with 50 layers.

A. Training and architecture comparison
We train each of our ten networks on the dataset built with the NRHybSur3dq8 approximant 1 which belongs to the class of waveform models that yield the most accurate results for quasi-circular BBH systems with aligned spins.We only show the comparison of the training of the different network architectures for this approximant as a similar behaviour is found for the datasets built with the other three approximants regarding the evolution of the loss function.
We prepare the data before passing it to the network, rescaling the M and d L values so that they fit in the interval [0, 1].To evaluate the parameter values predicted by the network in comparison to the real injection parameters in the dataset we use a simple mean-square-error (MSE) loss function, given by where N is the number of inputs passed to the network, y truth is a 2D (N params × N ) tensor of ground truth values, x in is a 3D array of input tensors with dimensions ([L signal × N channels ] × N ), where L signal is the length of each signal, N channels is the number of channels in the multivariate time series (that is, in this case, the number of GW detectors considered), f is the mapping represented by the network, taking each of the N [L signal × N channels ]-shaped input tensors to a 1D tensor of length N params , and Θ = (θ 0 , θ 1 , ...) is the array of learned weights which define f .We also use L 2 regularization, which means the square of the weights of the 1 All computations in this paper were performed on the Artemisa cluster from the University of Valencia using 32 Intel Xeon CPU cores and an Nvidia V100 GPU.
network are added to the loss function with some chosen multiplier (10 −5 , in our case), making the full loss function take the form This regularization method prevents the network from giving too much importance to a limited set of weights.
For the training process, we make use of the wellestablished Adam optimizer [65].Using a batch size of 256 and a 0.8/0.2training/validation split, we start the training process by using a learning rate finder, performing a sweep on learning rate values, starting at 10 −6 and monotonically increasing with each batch evaluated, up to the point where the learning rate is high enough that the loss function value diverges.We obtain a starting learning rate by taking the learning rate value corresponding to the lowest MSE loss in this process.We train for a maximum of 100 epochs, using an early stopping call if the validation loss does not decrease by at least 0.002 after 20 epochs.We also reduce the learning rate by a factor of 10 after 5 epochs if there is not a validation loss decrease of at least 0.005 in that period.The evolution of the two losses during training for each network are presented in Fig. 3.The most relevant metric here is the validation loss (right panel of the figure) as it approximately represents the network's performance on out-of-distribution data.Interestingly, the recurrent architectures struggle to converge at all for the RNN and LSTM cases (for the RNN this may indicate gradient vanishing or explosion, as discussed in [66], but for the LSTM case such effects should be mitigated), yet the GRU architecture presents the second-lowest validation loss overall, showing no overfitting, despite not having a particularly significant difference in the number of parameters.Further attempts at making networks based on the RNN and LSTM architectures converge, by fine-tuning the learning rate and some network hyperparameters, yielded no improvement.This issue may require further study.As for the convolutional architectures, while the InceptionTime and xResNet architectures show promising behaviour at the start of training, they quickly enter into an overfitting regime, with the validation losses showing unstable behaviour and values of an order of magnitude higher than the corresponding training losses.The network based on the MiniRocket architecture shows a consistently decreasing validation loss, but this process is rather slow and ends up being noncompetitive given the restriction on the number of epochs.The FCN-based network does not seem to converge, rapidly showing unstable behaviour in the training loss.The TCN-based network maintains a consistent improvement until starting to stabilise at around epoch forty, and the closeness between training and validation loss values indicates that no significant overfitting or underfitting is occurring.From the final value of the validation loss (aproximately 2 × 10 −2 ) and the lack of overfitting or underfitting, we thus conclude that the best performing network of our sample is  the one using the TCN architecture.

B. Approximant comparison
Settling on the top-performing TCN architecture, we move on to testing the effect of approximant selection on feature extraction from BBH merger waveforms.To do so, we follow a similar training procedure to the one detailed above but this time training 10 networks with the same TCN architecture for each of the datasets generated with the different approximants.Since NN training is a stochastic process at its core, training a network for each scenario multiple times allows us to build some statistics and analyse trends and fluctuations in the results, rather than focusing on a single scenario where a network converged.We limit the training to 40 epochs, as the results of the previous section suggest gains after this point are not too significant for the computation time spent.
The evolution of the training and validation losses during the networks' training process is illustrated in Fig. 4, with the mean of the ten runs for each epoch being represented by the solid line, and the standard deviation being shown by the shaded area.Here we see that while the validation loss curves for NRHybSur3dq8 and IMRPhenomPv3 have significant overlap, there is a slight deviation towards higher losses on the SEOBNRv4P run, and a large deviation towards lower losses for the IMRPhenomPv3HM scenario.In addition, the standard deviation band is significantly wider for the runs on the latter dataset.
After training we evaluate each network on its validation set, obtaining a root mean square error for each where N is the number of elements evaluated, index j refers to one of the parameters and i is the index of the sample.To better analyse the distribution of these values for each of the approximants used, we show in Fig. 5 a collection of box plots.In these plots, each data point (black markers) represents the RMSE value for the validation loss of an independent iteration of a TCN network.The horizontal line on each of the boxes shows the median value of the error, while the region encompassed between the median and the edge of the box represents the second and third quartiles of the data.The whiskers of the boxes extend up to 1.5 times the inter-quartile range.Points not in between a box's whiskers are beyond the whisker range and are considered to be outliers.A first inspection of Fig. 5 shows that while there are visible differences in the RMSE for all parameters depending on the chosen approximant, the range of these variations is not too large.In particular, the standard deviation of the median values for each of the M, q, d L and χ eff parameters are respectively, 1.3×10 −1 M ⊙ , 6.7× 10 −3 , 29 Mpc, and 5.7 × 10 −3 .We observe that TCN networks trained on IMRPhenomPv3HM show a very significant advantage in recovering the luminosity distance of the injection.It is quite unclear as to why this difference should manifest.IMRPhenomPv3HM also shows a smaller advantage in recovering the mass ratio q, while the chirp mass and effective spin are better recovered in the IMRPhenomPv3 experiment.TCN instances trained on the SEOBNRv4P dataset stand out by having slightly worse parameter recovery than the other experiments, save for the case of the effective spin where it very slightly out-performs the IMRPhenomPv3HM and NRHybSur3dq8 cases.
It is interesting to note that, despite what could have been expected, the presence of higher modes in an approximant (the case of IMRPhenomPv3HM and NRHybSur3dq8) does not seem to systematically affect the ability of the TCN network to recover the injection parameters.This indicates that the increase in complexity in the data does not necessarily make it harder for the network to converge under the tested conditions.Fig. 6 shows an example of a visual comparison between the performances of networks trained on each approximant on the chirp mass parameter.In these 2D diagrams, obtained on the validation dataset for each approximant, the x-axis represents the injected values of the parameter, defined at generation time for each of the elements of the dataset, while the y-axis represents the predictions of each network.The colour in each bin shows the amount of events that fit within, translated into numbers by the colour bar on the right side of each plot.We observe that the network can infer the chirp mass values with reasonable accuracy, as expected from the previous discussion and Fig. 5.This kind of performance on one of the most dominant parameters is further indication that the main features of the gravitational information within the waveform are being successfully identified.

C. Feature extraction from real data
As a way to test the feature extraction capabilities of the studied method under real-world conditions, we apply our best performing network on actual GW events identified by the LVK collaboration.Given the parameter ranges of our training dataset, and the noise conditions of the training set, we limited this analysis to events from the O3 catalog with 10 < M < 80 M ⊙ .The results for the recovery of the chirp mass M are plotted in Fig. 7.Here we notice that for the considered events most predictions are within 10 M ⊙ .Two events with luminosity distances beyond 6 Mpc have network predictions significantly below the tabulated values, and there is a population of 7 low-mass events that is being assigned significantly higher masses by the network.Given the fact that the noise background varies significantly both between the analysed events, as well as between the events and the training set, it is encouraging to see that feature extracting performance is overall functional under different noise conditions and may be effective on real data.Furthermore, it is likely that training under more varied noise conditions and a finer covering of the parameter space in the training dataset could improve the generalization capacity of the network.

V. CONCLUSION
The existence of DL methods for time-series analysis, both for classification and forecasting, provides the GW community with the option to build data analysis pipelines upon a large body of work.In this paper we have applied a sample of neural networks, both convolutional and recurrent, to the estimation of physical parameters from GW signals produced in binary black hole mergers.Our specific goal has been to address if the choice of a particular waveform model or approximant for the creation of a dataset may condition the feature extraction ability of a trained network.To this aim we have used waveform datasets based on four of the most common waveform models available in the literature, IMRPhenomPv3, IMRPhenomPv3HM, SEOBNRv4P, and NRHybSur3dq8, injected into detector noise of the Advanced LIGO and Advanced Virgo detectors.Our study has been performed using the out-of-the-box network architectures provided by tsai [45] with no further modifications thereof.
For feature extraction, we have found a significant advantage in performance for residual-based architectures, obtaining the best results with the Temporal Convolutional Network (TCN) architecture in particular, in terms of training and validation losses and absence of data overfitting.Using this network we have explored the effects of the choice of waveform approximant on parameter estimation.Differences in the performance of several TCN networks as a function of the waveform model have been found.In particular, TCN networks perform better on datasets generated with the IMPhenomPv3HM approximant.Interestingly, the presence of higher modes (and thus more complex waveform morphologies) did not seem to affect the results in a systematic way.Our bestperforming network has also been applied on actual GW signals identified by the LVK collaboration.Our results show that the feature extracting performance of the network is reasonably robust against noise conditions and may be effective on real data.
We conclude that, under the specific setup we have employed, based on out-of-the-box network architectures, the CBC approximant choice when creating a GW dataset impacts the feature extraction performance of a neural network.This means care should be taken when building a synthetic GW dataset for the training of neural networks, as test performance does not necessarily translate to data which is more faithful to NR simulations.A similar study using Bayesian machine-learning methods for parameter estimation, followed by a comparison to traditional Monte-Carlo samplers, might shed further light on the findings reported in this paper.
Melissa López for her useful comments.
OGF is supported by an FCT doctoral scholarship (reference UI/BD/154358/2022).JCB is supported by a fellowship from "la Caixa" Foundation (ID 100010434) and from the European Union's Horizon 2020 research and innovation programme under the Marie Sk lodowska-Curie grant agreement No 847648.The fellowship code is LCF/BQ/PI20/11760016. JCB is also supported by the research grant PID2020-118635GB-I00 from the Spain-Ministerio de Ciencia e Innovación.JAF and ATF are supported by the Spanish Agencia Estatal de Investigación (grant PID2021-125485NB-C21) funded by MCIN

FIG. 3 :
FIG. 3: Evolution of the loss function during the training of the networks using the GW dataset built with the NRHybSur3dq8 approximant.(a): training loss.(b): validation loss.

FIG. 4 :
FIG. 4: Evolution of the loss function during training of the networks for the approximant comparison test.Lines represents the mean for all the networks trained on a given approximant, while the shaded area represents the standard deviation interval at each epoch.(a): training loss.(b): validation loss.

FIG. 5 :
FIG. 5: Box plots for the RMSE of the TCN networks on the validation set, for each of the physical parameters used to train the network and for each of the four datasets used.The specific panels show the RMSE for (a) the chirp mass of the system, (b) the mass ratio, (c) the luminosity distance, and (d) the effective inspiral spin of the system.

FIG. 7 :
FIG. 7: Prediction behaviour for the chirp mass as a function of d L , for the studied LVK detections.