WaveFormer: transformer-based denoising method for gravitational-wave data

With the advent of gravitational-wave astronomy and the discovery of more compact binary coalescences, data quality improvement techniques are desired to handle the complex and overwhelming noise in gravitational wave (GW) observational data. Though recent machine learning-based studies have shown promising results for data denoising, they are unable to precisely recover both the GW signal amplitude and phase. To address such an issue, we develop a deep neural network centered workflow, WaveFormer, for significant noise suppression and signal recovery on observational data from the Laser Interferometer Gravitational-Wave Observatory (LIGO). The WaveFormer has a science-driven architecture design with hierarchical feature extraction across a broad frequency spectrum. As a result, the overall noise and glitch are decreased by more than one order of magnitude and the signal recovery error is roughly 1% and 7% for the phase and amplitude, respectively. Moreover, on 75 reported binary black hole events of LIGO we obtain a significant improvement of inverse false alarm rate. Our work highlights the potential of large neural networks in GW data analysis and, while primarily demonstrated on LIGO data, its adaptable design indicates promise for broader application within the International Gravitational-Wave Observatories Network in future observational runs.


Introduction
In September 2015, the Laser Interferometer Gravitational-Wave Observatory (LIGO) [1] detected gravitational waves (GWs) from distant colliding black holes [2][3][4], ushering in the era of GW astronomy.Since then, dozens of merging black-hole and neutron-star binaries [6][7][8][9] have been observed by LIGO and Virgo [5].Currently, while some GW detection methods [10][11][12] that do not need templates are emerging, searching for sources of GW still typically utilizes template-matching-based analysis [13], which performs better in the case of stationary Gaussian noise superimposed on a precisely known signal waveform.However, data collected by the LIGO-Virgo-KAGRA detectors contains time series of GW strains that are heavily contaminated by loud noise artifacts that are analogous to the waveforms of the actual signals, and conversely bias the analysis results of the parameters of the putative astrophysical sources [14].In addressing these challenges, several non-linear noise subtraction frameworks (e.g., DeepClean [15], NonSENS [16]) and glitch subtraction methods (e.g., BayesWave [17]) have been developed, aiming to improve the reliability of catalog parameter estimation.When a candidate signal is identified, rigorous studies are carried out to verify whether the candidate is related to instrumental causes [18,19] or data quality issues [20][21][22] that could potentially impact the analysis of the candidate event with poor significance estimates, and even confute the astrophysical origin.In gravitational wave data analysis, noise sources can generally be categorized into two types: persistent wide band noise and short-duration noise artifacts, commonly known as glitches.The former represents noise that is continuously present at certain frequencies, while the latter refers to abrupt, transient disturbances.Although the noise subtraction process [23] can help reduce the wide band noise in the LIGO-Virgo-KAGRA detectors, it has no effect on the amplitude of noise artifacts that are unrelated to the addressed noise sources, making the rate of loud noise artifacts one of the primary limitations of an • We evaluated WaveFormer on pure noise realizations (the off-source data) and found that noise suppression was evident.
The noise level percentile of the amplitude decreased from 52.5 to 0.47 and the noise amplitude spectral density (ASD) of the whole frequency range is significantly decreased from 1 to 3 orders of magnitude.With regard to GW signals that are contaminated by different categories of glitch, the average of glitch amplitudes is 30 to 800 times smaller than before.
• We further investigated WaveFormer's capacity to recover signals from observational data in terms of phase and amplitude recovery.We achieved state-of-the-art accuracy compared with other deep learning methods [29][30][31][32].On majority of the detected binary black hole (BBH) events, the phase overlaps are higher than 0.99 (1% error).And no matter the circumstances, like low network SNR, we could recover the waveform amplitude with a root mean square error of < 0.53 for matched-filtering SNR, and the typical signal recovery error is approximately 7%.
• Finally, we assessed the performance of our WaveFormer-based workflow by evaluating the inverse false alarm rate (IFAR) on all reported 75 BBH events in the Gravitational-Wave Transient Catalog (GWTC), and achieved significant IFAR improvement, which indicated that data quality was significantly improved after noise suppression for the first time.
We showcase the trusty noise suppression performance and potential contribution to GW search of WaveFormer from multiple experiments.The proposed AI-based workflow provides the means to enable open-source, accelerated, and deeplearning-based GW data preprocessing and the analysis has the potential to lay a solid foundation for future GW-related tasks.

Methods
The overall workflow of our work is shown in Figure 1, the raw strain data are firstly preprocessed through whitening and normalization.After performing WaveFormer denoising, we can achieve noise suppression.Then, inverse normalization is used to acquire the denoised observational data in the whitened domain.Utilizing peak-finding post-processing, IFAR calculation is conducted to further evaluate our work's significance.

Training dataset
Firstly, we download all the public data that released by the Gravitational Wave Open Science Center (GWOSC) [40][41][42] using CVMFS under the gwosc.osgstorage.orgorganization.In this study, considering that waveforms from two detector are sufficient to calculate IFAR and Virgo data are not available in O1, we limited our analysis to a subset of the data, specified by 1. both Hanford and Livingston data are available; 2. data quality satisfies CBC_CAT3 (using the GWOSC definitions) at least; Figure 1.GW noise suppression workflow with WaveFormer.Following a preprocessing procedure including whitening and normalization, LIGO observational strain data are processed into noisy input data, which are then fed into the trained WaveFormer model to suppress the noise.Given the input and denoised output data of WaveFormer, a post-prcessing procedure including regular peak-finding, cross-correlation, and coincidence analysis is performed validate the denoising results.The waveforms on top are an example of the input and denoised output of the first BBH event, GW150914 [2].

3/23
3. No hardware injections [43] are included; 4. No real signals: periods around known gravitational wave detections in O1, O2, and O3 are excluded.The final observational data that we analysed from O1, O2, O3a, and O3b consist of 48.8 days, 118.1 days, 106.7 days and 96.30 days, respectively.These datasets are used for further significance estimates to validate our model's noise suppression performance.
Finally, considering a single noisy signal in the training dataset, it is a linear combination of the whitened modeled waveform and randomly sampled noise obtained from GWOSC, down-sampled at 2,048 Hz by a Butterworth filter [47].Specifically, we extract a 32s-long noisy signal and compute its noise power spectral density.The power spectral density is then used to whiten both the noisy signal and the modeled waveform.Thereafter, the SNR of the whitened noisy signal is uniformly distributed across a range from 4 to 30 by adjusting the amplitude of the signal relative to the noise level.We select 8.0625s-long data at center of the preprocessed 32s-long noisy signal, and normalize the standard deviation, as the input of WaveFormer.Reason of selecting central 8.0625s-long data is to minimize impact of spectral leakage.The ground-truth label, which is the preprocess modeled waveform, is also clipped and normalized by the corresponding standard deviation of noisy signal such that we can perform inverse normalization (Figure 1) and recover the amplitude of the pure whiten waveforms.
Millions of waveforms that combine noise and modeled waveform are generated for model training to overcome overfitting.The training dataset is augmented by randomly dropping the signals (also zeroing the labels) with 20% probability.

WaveFormer
Our AI-based workflow is centered with WaveFormer, which is a deep end-to-end transformer-based pretraining model (Figure 2) with science-driven innovations that significantly improve GW signal noise suppression performance.The input sequence to WaveFormer is a whitened and normalized noisy signal from either Hanford or Livingston data from LIGO.Considering an input sequence in dataset as S (1,16512) , which means that S (1,16512) is one waveform with 16512 sampling points (8.0625s-long waveform sampled at 2048 Hz).Firstly, Multiple subsequences are generated through applying a fix-length window (win_length = 0.125s • 2048Hz = 256) with fixed stride.We set stride = 1 2 • win_length = 128 because a window with 50% overlap loses minimum data information in frequency domain [48].So, S (1,16512) is segmented into 128 subsequences which are stacked together and form the input (I (128,256) ) of WaveFormer.One subsequence is treated as one token.Hence, I (i, j) represents j-th element of i-th token.The same data preprocessing method is applied to label (y (128,256) ) and mask.
The input data I (128,256) is first embedded into dense features (DF) through embedding module as described in equation (1).The dense features are composed of token embedding (TE), one-dimensional (1D) convolutional embedding (CE) and positional embedding (PE) .In embedding module, input and output channel of 1D convolutional layer are both 128 and kernal size equals 3, GeLU is the activation function.Different from one-hot vector representation for each token as in natural language processing tasks, each token of I (128,256) of WaveFormer contains rich information.Hence, CE is introduced in WaveFormer, which enrichs neighboring low-level local waveform features and high frequency signal information.
Where Position (128,128) represents position of each token, each position is described with one-hot vector.W te ,W ce and W pe are embedding weights of TE, CE and PE, respectively.Their hidden sizes are both 2048.
Then dense features are further processed through residual module as described in equation (2).Input and output channel of two-dimensional (Conv2d) layer are both 1 and kernal size equals 7.As illustrated in Figure 2, the Conv2d layer is able to extract sparse spatial mid-level local features, which acts like atrous convolution in image feature extraction.The advantage is that it can increase the receptive field and learn intermediate frequency information of signals.
The residual feature is then fed into encoder blocks that consist of multi-head self attention module and multilayer perception (MLP).The self attention module can extract global waveform features based on its global attention mechanism.Compared with vanilla encoder of [49], some modifications are applied in WaveFormer.Firstly, bias is removed from of MLP, because

4/23
it is helpful to stabilize training process for large models [50].Furthermore, intermediate activation in MLP is replace with SwiGLU (Swish(xW ) • xV ) because it has been shown to increase performance [51] compared with ReLU, GeLU et.al.
Where h represents number of attention heads that equals 32, and hidden size of each head that equals 64.W O is the output projection of self-attention, whose hidden size equals 2048.
Hidden sizes of inner and outer dense layer of MLP module equals 12288 and 2048, respectively.Output of former encoder is used as input of its following encoder, and there are m encoders in WaveFormer.In our experiment, m is set as 24.Finally, an output projection block and a dense layer with shared weights of token embedding layer are applied to decode output of the last encoder block, and the model output O has same size as input I (128,256) . (5)

Masked loss
To further improve denoising performance, we propose a masked loss mechanism during WaveFormer training.To this end, a mask is applied to calculate mean square error of WaveFormer during training.For each output element O (i, j) and its corresponding ground-truth label y (i, j) , mean square error e (i, j) is defined as: Masked loss is defined as: Where α is a weight factor for balancing loss contribution of different elements.In our experiment, α was set to 1/6.Compared with vanilla transformer, our introduced masked loss is in a more fine-grained form.It can not only distinguish tokens, but also each samples within a token, which significantly accelerates convergence speed and improves training stability.Specifically, each noisy signal has its corresponding same-shape mask.The left and right border of the mask are calculated based on post-Newtonian theory and linear perturbation theory .Details of mask desciption are provided in 4.

Injection test
To draw a more robust conclusion, we perform an injection test with real LIGO-Virgo observational data.The noise is sampled from the first two observing runs, and then injected with a black hole waveform h tuned to the desired optimal SNR [52] SNR opt = ⟨h | h⟩.The scalar product ⟨• | •⟩ represents noise-weighted inner product [53].In total, we generate a large number of injections (5000) using the same prior as in the injection test set.All the injected templates and the denoised samples are used to calculate the matched-filtering SNR with the original injections d.Within a 0.25-seconds window, we use matched-filtering SNR [54] SNR mf = ⟨d | h⟩/ ⟨h | h⟩ on the original injections for injected templates and denoised samples.Moreover, we analyzed phase recovery performance on simulated compact binary coalescence signals.To determine how well the recovered signals fit the expected waveform templates, the overlap O [7] between them is computed as Where h d and h are denoised waveform and whitened injection, respectively.The scalar product ⟨• | •⟩ represents noise-weighted inner product [53]. 5/23

IFAR calculation
Utilizing noise suppression results on real observational data, we further evaluate denoising performance through comparing FAR of BBH event with the public reports [6][7][8][9][55][56][57][58].Firstly, we obtain the denoised output by utilizing Waveformer.Then, triggers are defined and identified by three steps including, 1. Find the max value in the time-series data.2. Search nearby maximum outside 0.2s' time window of the max value in the previous step.3. Repeat the second step until the maximum and all local-maximums (referred to as "triggers") are identified.After finding all triggers, the following procedures are conducted to decide whether a candidate event appears, 1.By constraining triggers that exist on both two detectors, we get valid triggers.2. We then calculate the correlation of the to-be-evaluated trigger (target trigger) between its noisy and corresponding denoised segments.3. Through time shift, background analysis is done on other triggers around the target trigger.Finally by counting the number of false alarm trigger pairs, we obtain the IFAR value of the target trigger, which represents the reported or candidate BBH event in this experiment.For all GW events with a given IFAR observed in the data of duration T , we divide the data into analysis periods that allow at least 7 days (30 days for GW150914, GW151226, GW170104, GW170814, GW170809, GW170823, GW170412, GW190521_074359, GW190707_093326, GW200129_065458, GW200225_060421) of coincident data between two LIGO detectors.The total amount of background time analyzed will equal T obs = T 2 /δ , where δ is the time-shift interval (we set to 0.1s as same with PyCBC [47]) The minimum FAR scales as δ /T 2 obs so that approximately 7 to 30 days of coincident data are sufficient to measure FAR of 1 in (1 ∼ 4) × 10 5 years.

Science-driven deep neural network
We design a deep learning model (Figure 2a) that consists of stacks of transformer encoders [49], residual blocks [59] and embedding modules.The model is referred to as WaveFormer.Compared with vanilla transformer [49], some science-driven innovations that improve noise suppression performance are proposed in this work.Firstly, the combination of convolutional neural networks and transformer enables our model's ability to capture generic and hierarchical features of GWs.As depicted in Figure 2b, convolutional embedding (CE) in the embedding module and residual module extract low-level and mid-level local features, respectively.Encoders, on the other hand, are primarily concerned with high-level global features.The hierarchical feature extraction mechanism is robust when applied to noise suppression tasks.When it comes to GW signals, high-frequency information corresponds to low-level local features since they place a premium on the connections between nearby data points.Similarly, mid-level local feature and high-level global feature correspond to intermediate-and low-frequency GW signal information, respectively, because they are more concerned with distant sampling points, such as milliseconds to seconds.From a scientific standpoint, the comprehensive hierarchical feature extraction mechanism can process long signals and learn rich GW information of frequency domain, resulting in WaveFormer's excellent denoising performance.
Then, the dynamic mask and masked loss are introduced during network training process.Compared to masked selfattention in vanilla transformer, our dynamic mask is in a more fine-grained manner.We can assign a different mask value to each element within a token, while masked self-attention can not.In the view of science, the importance of each sampling point for phase and amplitude recovery varies; dynamic mask can distinguish the variation and assign an appropriate mask value accordingly.As a result, the effectiveness of GW denoising is enhanced even further.Finally, some minor adaptions are applied to the activation and bias settings of encoders.These adaptions have been proven to stabilize and accelerate network training and convergence.WaveFormer is implemented based on Megatron-LM [60] and Ray [61] framework in PyTorch [62].For optimization, we use ADAM [63] algorithm, which works well on problems with large dataset and parameters.Data parallel training was performed on eight NVIDIA V100 32GB GPUs and took approximately 24 hours to train for 300,000 iterations.

Effect on realistic noise
We present WaveFormer's noise suppression performance on real observational data by evaluating data difference after noise suppression.All the instances of the input data are processed with the same whitening, normalizing and denoising procedure as in our proposed workflow.The upper panels of Figure 3 showcase the 2048s-long off-source data around GW200208_130117 in time (left) and frequency (right) domain.The amplitude of noise level percentile is clearly compressed, reduced from 52.5 to 0.47.Analyzing the ASD further reveals that our WaveFormer is able to effectively eliminate narrowband and broadband spectral information while drastically decreasing the overall level of all frequency contributions.Specifically, ASD of intermediate frequecy noise is 10 times lower after noise suppression, while ASD of low-frequency and high-frequency noise is approximately 1000 times lower than before.
Furthermore, we investigated the effect of noise suppression on the loud noise artifact, known as glitch.We use the Gravity Spy database [64][65][66] to obtain various common types of glitches with an estimated SNR larger than 10 and confidence > 0.95.We focus on three categories (Blip, Scattered Light, and Koi Fish) since they are known to be problematic to mimic the response of detectors to an actual GW event [14] and thus limit the overall sensitivity of GW searches [20][21][22]67].Peak

7/23
frequency is defined as the frequency with the highest amplitude in the frequency spectrum of the signal.The bottom panels of Figure 3 show the comparison of the amplitude at peak frequency between the original and suppressed glitches during the second half of the third observing run (O3b), and its corresponding ASD distribution.Detailed results of other observing runs (O1, O2 and O3a) are given in 4, and they exhibit a similar distribution pattern as O3b.It can be noticed that the amplitude is compressed to multiple orders of magnitude below its original value.Take O3b result as example, average compression ratio of Blip, Scattered Light, Koi Fish, and other instances are 78.7,184.7, 605.7, and 611.4,respectively.And the ASD distribution is similar as pure noise.The results indicate that our model can significantly suppress the level of glitch that embedded in real advanced LIGO-Virgo noise.

Recovery of binary black holes
Based on pure and loud noise suppression ability, we further validate WaveFormer's signal recovery performance while simultaneously suppressing noise as far as possible.Specifically, we apply the trained network on BBH injections (see more in Section 2.4) in LIGO observation noise and evaluate phase and amplitude recovery accuracy.Overlap and matched-filtering signal-to-noise (MFSNR) [68] are calculated to represent phase and amplitude recovery performance.We calculate the overlap over the same signal duration [30] for phase recovery and obtain the similar overlaps with [31,32].With respect to the overlap distribution among the validation dataset for Hanford O3b (more samiliar results in other observating runs are provided in 4), overlap is higher than 0.9 for most waveforms (Figure 4d), and as expected, higher SNR leads to better overlap preformance (Figure 4c) with injections in LIGO-Virgo noise for all three observations, which is consistent with [30,31].When optimal SNR > 6, overlaps of almost all samples, specifically 94.58%, are higher than 90%.We also observe that the WaveFormer is slightly biased against the low-mass systems.Around overlaps of 13% samples are smaller than 0.90 when chirp mass < 25 solar masses and optimal SNR > 6.For high-mass systems, > 96% samples have overlaps > 90%.
These results demonstrate that the phase information of GW waveform can be accurately recovered using WaveFormer.Figure 4c also shows a comparison between the injected templates and denoised waveforms using MFSNR.As expected, the denoised SNR comes quite close to the target one in cases of high overlap.Lower overlap cases have denoised SNR with larger variance.The root-mean-square residuals for Hanford O3b is 0.53 ± 0.83, which is significantly better than the results of [31] for Livingston O1 data.As shown in Figure 4b, we go deeper and analyze ASD of WaveFormer's denoised output.Among the intermediate frequency range  that covers rich BBH signal information, the ASD distribution of denoised waveform is evidently consistent with that of target signal.The comparison of ASDs shows that the denoised waveform's amplitude is reconstructed with a median error of about 7% relative to the target signal's amplitude, further illustrating the effectiveness of our method in recovering the gravitational wave signal.
Figure 5 presents the output of our denoising model when applied to real advanced LIGO noise that contains different BBH events.We further compare the whitened GW template [6] with WaveFormer's output and derive four cases from the comparison findings.Figure 5a shows the first successful detection of the GW signal, GW150914.We achieved perfect recovery of the inspiral and merger phases at both detectors.Compared with [29,30], our result can not only recover the amplitude but also the ringdown part, with an overall overlap > 99.10% at 0.25-seconds signals around the merger location.GW151012 has the lowest network SNR, 6.4 +1.3 −1.3 for Hanford and 5.8 +1.2 −1.2 for Livingston, among all BBH events in GWTC-1, hence Bacon et al. [31] poorly recovered both the phase and amplitude, while our model shows the ability to retrieve clean cycles.We completely recovered phase information and obtained the amplitudes reasonably well at mergers and ringdowns of GW151012 (Figure 5b).The signal overlaps for Handfold and Livingston are 99.04% and 97.16%, respectively.In case of the GW170823 (Figure 5c), a BBH event with high chirp mass 29.2M ⊙ , both Bacon et al. [31] and Murali et al. [32] could recover the phase of original GW signal with certain cycles but failed to recover the complete evaluation in amplitude scale.In the contrast, we observe a clear match in the amplitude of peaks of the extracted GW170823 waveform, with an overlap of 96.95% and 99.00% for Handfold and Livingston, respectively.Figure 5d shows the most recent detected BBH candidate, GW200208_130117, during O3b obeservation.Its network SNR is as low as GW151012, 10.8 +0.3  −0.4 to be exact, and we can well recover the GW signal.These results show that our denoising algorithm outperformed others by capturing the characteristic chirping morphology of BBH evolution, and can denoise signals in realistic detection scenarios without affecting signal characteristics such as phase and amplitude.

Significance estimates
As shown in Figure 6, we assessed the performance of our denoising workflow by comparing results with the GWTC-1, GWTC-2, GWTC2.1, and GWTC-3 catalogs (referred to as the 'reported catalogs') as well as their associated data releases.We prioritized the data obtained directly from these releases, ensuring the most accurate and updated analysis, rather than relying solely on the summary tables or figures in the publications.Comparing with our results, the significance estimates from GWTC [6][7][8][9] and OGC [55][56][57][58] have a more significant divergence in the distribution of the IFAR from the reported catalogs.With regard to all 75 reported BBH events, we achieve significant IFAR improvement, which indicates that loud terrestrial noise is  The data ASD after noise suppression.Among the frequency region(red dashed rectangle box, ) that contains richest signal information, ASD of our denoised output is significantly consistant with that of target, the relative difference between median of power spectral density distribution is about 7%.(lower left) Signal phase recovery performance.The overlap is calculated between WaveFormer's denoised output with its corresponding groundtruth whitened waveform to evaluate phase recovery accuracy.Higher optimal SNR and chirp mass both lead to higher overlap.(lower right) After noise suppression, the overlap is higher than 0.9 for most waveforms, which valids our phase recovery accuracy.For most events, we achieved perfect phase information recovery.The ringdown part of GW150914, in particular, can be recovered very well.And no matter under what circumstances, like low network SNR (GW151012 and GW200208_130117), or high chirp mass system (GW170823), the amplitudes of our denoised signal match those of the templates.
11/23 well suppressed.For example, in the case of the low network SNR event GW200208_130117 (as shown in Figure 5d), we obtained an IFAR of 8916 years, when the maximum IFAR of other catalogs is less than 4000 years.our analysis indicates that the variability in improvement is closely related to the nature of the noise in the original data.The inherent non-Gaussian, non-stationary characteristics of the noise and the varying strategies of different pipelines in signal recognition contribute to the observed discrepancies in IFAR improvement.Furthermore, we found that IFAR performance significantly depends on the extent to which it reduces the non-Gaussian noise near each event.This observation suggests that for events in which IFAR shows substantial improvement, its misleading non-Gaussian noise is effectively eliminated.Conversely, for events where IFAR underperforms, the denoised data still retains non-Gaussian characteristics, which is possibly due to inherent systematic errors of WaveFormer.

Conclusion
Large-scale neural network is a powerful tool that allows us to directly apply machine learning algorithms to raw observational GW data to perform data processing.We develop an AI-based workflow centered with WaveFormer to achieve accurate and real-time GW noise suppression.Our proposed WaveFormer model is based on transformer, but with several science-driven innovations.The combination of convolutional neural network and transformer enables our model the ability of extracting hierarchical features, which correspond to GW signal information of a wide frequency range from science perspective.Moreover, a masked loss mechanism is proposed and applied.It can distinguish the recovery importance of different sampling points and assign an appropriate mask accordingly.
All the proposed adaptions have been proven to improve noise suppression performance as well as stabilize and accelerate network convergence.Firstly, we directly evaluate model's noise suppression on pure noise and glitches.With regard to pure noise suppresion, noise is significantly supressed.Standard deviation of noise amplitude and noise ASD of the whole frequency range are significantly decreased by an order at least.For different glitch categories, glitch amplitude can be compressed to multiple orders below its original value.Secondly, we further validate model's signal recovery performance.On real observational data and BBH events, we achieve state-of-the-art results compared with other deep-learning-based denoising method.WaveFormer can recover the amplitude of low network SNR events and high chirp mass events while other methods fail.Finally, through significance estimates, we prove that there is a dramatic data quality improvement with our AI-based denoising workflow, and achieve significant IFAR improvement on 75 reported BBH.
To this end, along with the provided applications, this work can be a starting step towards the GW search strategy that can potentially be extended and contributed to the upcoming and heavy data processing and GW search procedures of the fourth observing run.

Mask definition
We implement a dynamic masking operation based on the characteristic of each modeled waveform.As shown in Figure 7, we assume the signal waveform is well-modeled and maximum absolute value of the waveform locates at merger time.Left boarder t 0 of the mask depends on the lower frequency cutoff at f 0 = 20Hz based on the post-Newtonian theory.We approximate the evolution of the pre-merger part as f (t) = 1 8πM 5/8 5 t 3/8 , where M represents chirp mass.For the right boarder t 1 , we refer to the damping time τ from linear perturbation theory to ensure the contribution of ringdown phase.We specify ten times of the damping time, 10 × τ 220 , of the dominant quasinormal mode to ensure that enough waveforms are enclosed for effective denoising.Mask values between t 0 and t 1 + 10τ 220 are set to 1, otherwise α.Value of α is decided by the length ratio of BBH waveform (generally 0.5 to 2 seconds) and model input (8.0625 seconds).We did ablation studies on four settings (α ∈ {1/10, 1/6, 1/4, 1}) and found that α = 1/6 performed best.Hence in our experiment, α is set to 1/6.The dynamic mask is then applied to training loss and stablizes training process.

Detailed suppression results
We provide a clearer picture of the data post-processing, not just from a spectral perspective but also encompassing time series and spectrogram to convey a more holistic view of the method's effectiveness.Examples of signal(GW150914) and blip are shown in Figure 8 and Figure 9.As for signal, we give a 30-second whole signal and its corresponding 0.2-second zoomed-in segment, as well as spectrogram and denoising performance on both two detectors.Similarly, the time series and spectrogram of the blip are represented in Figure 9.
Glitch is a common occurrence with a rate of ≲1 per minute in the LIGO detectors in O3a [7].To further investigate the performance of noise suppression on these loud non-Gaussian artifacts, we use the Gravity Spy database [64][65][66], which contains a wide range of glitches.The total number of LIGO glitches considered in this work from the first three observing runs (O1, O2, and O3, where O3 is divided into O3a and O3b) is 15487, 41497, 101614 and 144958 for O1, O2, O3a, and 12/23 Figure 6.IFAR for LIGO BBH events.The BBH events are collected from the first, second, and third Gravitational-Wave Transient Catalogs (GWTC-1/2/2.1/3)[6-9] and sorted by FAR (from low to high).The events marked with a dagger ( †) contain Virgo data.The point with an arrow attached represents the minimum IFAR that can be achieved by the pipelines from the reported catalogs.The black dotted line represents a 2.0yr −1 FAR threshold used in GWTC-2/2.1/3catalogs.For all reported BBH events, we achieve significant IFAR improvement.O3b, respectively.We set a minimum confidence threshold (0.95) and estimated SNR threshold (10) for all glitch categories to reduce the risk of contamination from the machine learning classifier in Gravity Spy.In our AI-based workflow, all instances undergo the whitening, normalizing, and WaveFormer denoising preprocessing steps, and the maximum amplitude around each instance's peak frequency is compared to its original value in the whitened domain.
We focus on three categories (Blip, Scattered Light, and Koi Fish) because they are known to be problematic and can create considerable challenges for candidate event analysis [14,[20][21][22]67]. Results are given in Figure 10 and Table 1.The percentage of instances with higher denoised amplitude (a denoised ) than before (a original ) is quite small and only O1 exceed 1%., for each glitch set in O1, O2, O3a, and O3b are all more than 30 times.The noise amplitude spectral density of the glitches from O1, O2, O3a, and O3b is presented in Figure 11.

Example of denoising result on NSBH event
From a scientific standpoint, the comprehensive hierarchical feature extraction mechanism can process long signals with up to 8.0625 seconds in length and learn rich gravitational wave information.This input segment length, while seemingly excessive for typically short BBH signals, is actually crucial for effective noise modeling and signal reconstruction.Such a design is an advantage not only for BBH signals but also shows promise for longer signal types such as NSBHs and BNSs.Hence, we further conducted additional tests to explore the potential of WaveFormer in the context of NSBH signals.
Our preliminary tests on available NSBH data from O1 to O3 observing runs revealed interesting results.Notably, for the event GW191204_171526, classified as either an NSBH or a low-mass BBH candidate in GWTC-3, WaveFormer demonstrated a significant improvement (Figure 12).The overlap with IMRPhenomXPHM achieved 0.93 and 0.95 on H1 and L1, respectively, which are marked improvements over those achieved by BayesWave and cWB (with overlaps between 0.82 to 0.86).This singular yet promising result suggests that while WaveFormer is currently optimized and trained on BBH signals, it holds potential for application to NSBH signals.It highlights not only WaveFormer's current capabilities but also its potential versatility and adaptability to other types of gravitational wave signals.The result also demonstrate the significance of WaveFormer as a valuable addition to the IGWN software ecosystem.WaveFormer's denoising performance on NSBH event (GW191204_171526).On H1 and L1 data, we achieve overlap values of 0.9267 and 0.9582 with IMRPhenomXPHM template respectively, which are significantly higher than those of BayesWave and cWB (0.82 to 0.86).

Figure 2 .
Figure 2. Science-driven WaveFormer architecture design and hierarchical feature extractions.a, Overview of the WaveFormer architecture.The deep neural network mainly includes embedding module, residual module, and encoders.Input and output are noisy and denoised data, respectively.b, Illustration of hierarchical feature extractions, which consists of low-level local feature (top), middle-level local feature (middle) and high-level global feature (bottom).From a science-driven perspective, they correspond to high, intermediate and low frequency information of signals.Each feature has the same background padding color as its corresponding network module in a.

Figure 3 .
Figure 3. Noise suppression performance on LIGO's O3b data.(Upper panels: results of pure noise) The histogram distribution (left) and ASD (right) of the off-source strain data around GW200208_130117 from the Hanford detector before and after noise suppression.On the histogram, the dashed lines represent the 5th and 95th percentile noise level, before=52.5 and after=0.47.On the spectrogram, the typical percentiles are displayed as a shaded zone surrounding the median from the 5th to the 95th percentile.(Bottom panels: results of glitches) (left) The amplitude of glitches (Hanford O3b) is 40 ∼ 400 times lower than its original value for almost all samples.The dashed diagonal line means that the input glitch amplitude is equal to that of denoised glitch.The amplitude for the notorious blip, which is one of the limiting sources of noise for GW searches from high-mass compact binaries [14], is 78.7 times lower than their original value.(right) Similar as top right panel for glitches, showing the glitch ASD is decreased by more than 1 ∼ 3 orders of magnitude.

Figure 4 .
Figure 4. Signal recovery results of simulated signal with noise from Hanford O3b data.(upper left) Signal amplitude recovery performance between target SNR and denoised data SNR.The closer scatter points are to the black dashed line, the better amplitudes are recovered.For most waveforms, its corresponding scatter point is quite close to the line.(upper right)The data ASD after noise suppression.Among the frequency region(red dashed rectangle box,) that contains richest signal information, ASD of our denoised output is significantly consistant with that of target, the relative difference between median of power spectral density distribution is about 7%.(lower left) Signal phase recovery performance.The overlap is calculated between WaveFormer's denoised output with its corresponding groundtruth whitened waveform to evaluate phase recovery accuracy.Higher optimal SNR and chirp mass both lead to higher overlap.(lower right) After noise suppression, the overlap is higher than 0.9 for most waveforms, which valids our phase recovery accuracy.

Figure 5 .
Figure 5.Comparison of denoised signals from LIGO observation data for events: a. GW150914, b.GW151012, c. GW170823 and d.GW200208_130117 with their optimal templates.Hanford and Livingston are represented by H1 and L1.For most events, we achieved perfect phase information recovery.The ringdown part of GW150914, in particular, can be recovered very well.And no matter under what circumstances, like low network SNR (GW151012 and GW200208_130117), or high chirp mass system (GW170823), the amplitudes of our denoised signal match those of the templates.

Figure 7 . 23 Figure 8 . 23 Figure 9 .
Figure 7. Dynamic mask definition strategy.Top panel: An example of a chirp-like time strain with increasing frequency and amplitude, location of the maximum absolute value is considered as the merger moment.Middle panel: Time-frequency representation of the above chirp-like signal with low-frequency cutoff at f 0 .10 × τ 220 represents ten times of the damping time of dominant quasinormal mode.Botton panel: The showcase of mask sequence, based on the characteristic chirping morphology of the above BBH evolution, Values within mask are set to 1 and α for others.

Figure 10 . 23 Figure 11 .
Figure 10.Comparison between raw and denoised glitches from the first three observing runs (O1, O2 and O3, where O3 is split into O3a and O3b).The dashed diagonal line represents that the glitch amplitude is equivalent to the denoised glitch amplitude.For clarity, 20% samples of the total are given.The four subfigures show similar distribution pattern, and amplitudes of all glitch categories are well suppressed.

Figure 12 .
Figure 12.WaveFormer's denoising performance on NSBH event (GW191204_171526).On H1 and L1 data, we achieve overlap values of 0.9267 and 0.9582 with IMRPhenomXPHM template respectively, which are significantly higher than those of BayesWave and cWB (0.82 to 0.86).

Table 1 .
Noise suppression performance on various glitch categories from the first three observing runs (O1, O2, and O3, where O3 is divided into O3a and O3b).