Accumulating Errors in Tests of General Relativity with Gravitational Waves: Overlapping Signals and Inaccurate Waveforms

Observations of gravitational waves (GWs) from compact binary coalescences provide powerful tests of general relativity (GR), but systematic errors in data analysis could lead to incorrect scientific conclusions. This issue is especially serious in the third-generation GW detectors in which the signal-to-noise ratio (S/N) is high and the number of detections is large. In this work, we investigate the impacts of overlapping signals and inaccurate waveform models on tests of GR. We simulate mock catalogs for Einstein Telescope and Cosmic Explorer and perform parametric tests of GR using waveform models with different levels of inaccuracy. We find that the systematic error in non-GR parameter estimates could accumulate toward a false deviation from GR when combining results from multiple events, although a Bayesian model selection analysis may not favor a deviation. Waveform inaccuracies contribute most to the systematic errors, but multiple overlapping signals could magnify the effects of systematics owing to the incorrect removal of signals. We also point out that testing GR using selected “golden binaries” with high S/N is even more vulnerable to false deviations from GR. The problem of error accumulation is universal; we emphasize that it must be addressed to fully exploit the data from third-generation GW detectors and that further investigations, particularly in waveform accuracy, will be essential.


INTRODUCTION
The observation of gravitational waves (GWs) from compact binary coalescences (CBCs) provides an ideal means of testing of general relativity (GR) in the strongfield regime (Abbott et al. 2016(Abbott et al. , 2017a(Abbott et al. ,b, 2019a(Abbott et al. ,b, 2021a,b),b).The latest GW event catalogs contain nearly 100 CBC events (Abbott et al. 2021c,d)), based on which various tests of GR have been performed (Abbott et al. 2021a,b).No concrete evidence of a deviation from GR has been found yet, but unprecedented constraints have been placed on possible violations of the theory.In the coming decades, the third-generation (3G) ground-based GW detectors (i.e. the Einstein Telescope (Punturo et al. 2010)) and Cosmic Explorer (Reitze & et al 2019)) are expected to detect O(10 5 ) CBC events per year, with signal-to-noise ratio (SNR) up to thousands (Maggiore & et al 2020;Himemoto et al. 2021;Samajdar et al. 2021;Relton & Raymond 2021;Oguri 2018).Since the statistical uncertainty of parameter estimates shrinks when the SNR increases, and when a catalog of events are combined, observations from 3G GW detectors are exq.hu.2@research.gla.ac.ukJohn.Veitch@glasgow.ac.uk pected to be able to obtain much tighter constraints on gravity theories.
However, this inspiring prospect of an enlarged detection catalog and higher SNRs brings with it many difficulties in data analysis.For the purpose of testing GR (and any other theories), one needs to ensure that the systematic errors are small, so that the analysis will not favor the wrong theory and cause a false alarm (or false dismissal).Parameterized tests of GR (Meidam & et al 2018) suffer from the same problems as parameter estimation (PE) in general, which has been investigated in many works, (e.g.(Cutler & Vallisneri 2007;Antonelli et al. 2021).For instance, inaccurate waveform models may have already caused some tensions in current GW observations (Hu & Veitch 2022;Williamson et al. 2017) and are expected to be more important in future high SNR detections (Cutler & Vallisneri 2007;Gamba et al. 2021;Pürrer & Haster 2020).Additionally, the 3G detectors with their improved low-frequency sensitivity are able to observe multiple signals at the same time.Detected overlapping signals can not be perfectly removed from the data, and could have non-negligible impact on PE when the merger times of overlapping signals are close (Himemoto et al. 2021;Samajdar et al. 2021;Relton & Raymond 2021;Antonelli et al. 2021;Pizzati et al. 2022;Janquart et al. 2022).The undetected overlapping signals, i.e., the signals that are too faint to be detected may also contribute to the systematic error (Antonelli et al. 2021;Reali et al. 2022).These errors are inevitable in 3G detectors, and repeated biased estimations for each event might end up with a wrong conclusion in the catalog-level analysis (Kunert et al. 2022;Moore et al. 2021).
Aforementioned works mainly focus on case studies for single events, or include only one type of systematic error.In this work, we aim to perform a more comprehensive investigation on systematic errors at the catalog level, including interactions between different types of systematics.We perform parameterized post-Newtonian (PPN) coefficient tests (Mishra et al. 2010;Cornish et al. 2011;Li et al. 2012) with our simulated event catalogs and inaccurate waveforms.Our simulations show that systematic errors can accumulate, and could lead to an incorrect measurement of deviation from GR when results from multiple events are combined.We find overlapping signals could magnify the effects of waveform systematics because of their imperfect subtraction from the data .Even worse, we find that the selected high-SNR events without known overlapping signals (so-called "golden events", which have been examined for GR tests in e.g.(Ghosh et al. 2016)) may be more vulnerable to biased conclusions.
This paper is organized as follows.We introduce our methodology in Sec. 2, including the Fisher matrix formalism for error prediction in Sec.

Estimating systematic errors
The generic formalism we use for estimating systematic errors in PE was first proposed in Cutler & Vallisneri (2007) and then generalized and validated against full PE by Antonelli et al. (2021).Let θ be the parameters of a GW signal h(θ) (which may include more than one source, in the case of overlapping signals).The frequency domain data from a GW detector, denoted d(θ) is given by where n is noise.Under the assumption that this noise is stationary and Gaussian, the likelihood for GW PE is where (. . .| . . . ) is the inner product (Finn 1992), defined as where * means complex conjugate, and denotes the real part.S n (f ) is the noise power spectral density (PSD) of the detector.The optimal SNR is ρ = (h|h).For more than one data streams, the inner product's definition should be replaced by the sum of inner products calculated individually by each data stream.
Consider a maximum likelihood estimator (which is equivalent to Bayesian estimation with flat priors), the maximum point θ ML satisfies where ∂ i denotes the derivative with respect to the i'th parameter.The data d is known, but real parameter θ real and the GW signal in the detector h(θ real ) is unknown.In practice, they are replaced by a waveform model h m (θ ML ).By doing this, errors are introduced to d − h: The first term n is what d − h is supposed to be: the noise in the detector.The second term δH = h(θ real ) − h m (θ real ) is the excess strain which represents the difference between real signal(s) in the data and the model used to subtract signals.Inaccurate waveforms and overlapping signals can both contribute to this term.
The third term comes from the imperfect measurement of signal parameters due to statistical noise, and is given by the linear expansion of h m (θ real ) − h m (θ ML ), where ∆θ j is the statistical error of the j'th parameter from the maximum likelihood estimator, and we adopt Einstein notation to indicate the sum over parameters.Substituting Eq. 5 into Eq. 4 and approximate all derivatives at θ ML , and we get where is the error induced by the detector noise.< ∆θ i stat >= 0, so the maximum likelihood estimator is unbiased if δH = 0; and < ∆θ i stat ∆θ j stat >= (Γ −1 ) ij , which is consistent with the Fisher matrix formalism.The ∆θ i sys = (Γ −1 ) ij (∂ j h m |δH) is the systematic error.Any effect that contributes to δH could be a source of systematic bias in PE.We will use (Γ −1 ) ii as statistical uncertainty and ∆θ i sys as the predicted systematic error.

PPN formalism, choices of parameters and waveforms
The test of parameterized post-Newtonian coefficients is a generic formalism for finding deviations from GR, initially proposed by Mishra et al. (2010) and further developed for application with Bayesian inference (Li et al. 2012), and later applied to catalogs of real GW observations, most recently in Abbott et al. (2021b).We use the waveform model IMRPhenomPv2 (Husa et al. 2016;Khan et al. 2016), whose phase is characterized by a set of parameters {p i }, including inspiral phase parameters {φ 0 , . . ., φ 7 } and {φ 5l , φ 6l }, phenomenological coefficients {β 0 , . . ., β 3 }, and merger-ringdown parameters {α 0 , . . ., α 5 }.Deviations p i → (1 + δ pi )p i are introduced as the violations of GR; δ pi = 0 reproduces GR.In this framework, testing GR is reduced to estimating the testing parameters δ pi .Although a specific modified gravity theory could bring deviations in more than one testing parameter, previous works have shown that including one testing parameter at once is enough to detection violations.In fact it can be more efficient to find violations from GR this way because it avoids the correlations between testing parameters and GR parameters (Meidam & et al 2018;Sampson et al. 2013).In this work, we choose δ φ0 as the example testing parameter.We assume GR is the correct theory and focus on whether the PPN test falsely indicates deviations of GR.
We restrict our Fisher matrix analysis to a subset of the full signal parameters, to avoid computational issues.Parametrized deviations of the type we consider have a direct effect on the phasing of the signal, so in addition to δ φ we must include the other parameters that do the same: chirp mass M and mass ratio q, as well as the time of coalescence t c .The full 6-dimensional space of spin configurations is known to bring ill-conditioned Fisher matrices (Borhanian & Sathyaprakash 2022) due to correlations between parameters, and because of the prior bounds on angular parameters results can be misleading even when they can be computed.We therefore use only the effective spin χ eff to capture the dominant effect of (aligned) spin on the waveforms.We include this by forcing the two aligned spin components to contribute equally to χ eff , which allows us to treat it as a single parameter.We neglect to include extrinsic parameters in the Fisher matrix, effectively assuming they are measured precisely.Since these do not have a frequencydependent effect on the phase, we do not expect them to be highly correlated with the intrinsic parameters.Our choice captures the parameters that appears in the leading PN term and the corresponding PPN modifications, as well as the decisive parameter in the analysis of overlapping signals, t c .Other parameters are ran-domly generated (details in Sec.2.3) but are treated as perfectly known.Setting parameters to their injection values excludes their contributions to both statistical and systematic errors in PE.For instance, if we removed the effective spin from our calculation, we would obtain tighter statistical and systematic errors because its correlation with mass parameters and the testing parameter is removed (Berti et al. 2005).Considering realistic PE in the future in which all parameters are included, correlation between parameters may make posteriors wider and systematic bias larger.However, due to the linear expression in Eq. 6, we expect the two changes are proportional and our conclusion will not change significantly under this simplification.
We induce a non-zero δ β2 to mimic inaccurate waveform models based on the following considerations.To reduce potential correlations with δ φ0 , we exclude testing parameters for the inspiral stage.Correlation between the testing parameter and the waveform systematic parameter may undermine the generality of the illustration.To make sure the testing parameter has enough influence on the waveform, we do not choose parameters for the merger-ringdown stage which only includes the last few cycles.Therefore, we look for parameters in the intermediate region which is described by δ βi (Khan et al. 2016).δ β0 and δ β1 bring global phase shift and time shift in this region respectively, so δ β2 is the dominant testing parameter that encodes physical (frequency-dependent) modifications.
The excess strain from inaccurate waveforms can be written as We can use the approximation h(θ real ) − h m (θ real ) ≈ h(θ ML ) − h m (θ ML ), as the error would be a higher order term.

Overlapping signals and mock catalogs
When multiple signals come into data, they may have impacts on the analysis of each other (Himemoto et al. 2021;Samajdar et al. 2021;Relton & Raymond 2021;Antonelli et al. 2021).It is known that the correlation between signals is not strong unless the merger times are very close (typically < 1s); in this work we regard two signals as "overlapping" only if the merger time difference |∆t| < 4s, which captures the most influential neighbours of a signal.
Overlapping signals can be classified into two types: detected signals and undetected signals (confusion signals).The former is strong enough to be detected and should be subtracted from data in the analysis for other signals (or, the "main" signal)1 .The latter, however, is too faint to be recognized by the detection pipeline and may have an unnoticed impact on PE.In this work, the network SNR threshold for detection is set to 8, under which GWs are assumed to be undetected.
If a signal is detected, it will still contribute excess strain since we cannot perfectly remove it from the data.The excess strain after imperfect removal is where denotes variables of the detected overlapping signal.The first term arises from the inaccurate estima-tion of parameters for the overlapping signal, which is random since the error is partly caused by the random noise, although other factors, such as waveform inaccuracies and overlapping signals also contribute to it.As a conservative estimation and following Antonelli et al. (2021), we ignore waveform systematic errors in ∆θ i (i.e., assuming ∆θ i is merely caused by noise, which tends to underestimate it), and adopt the lowest order approximation for its correlation with the main signal.Substituting it into Eq.6, one obtains the covariance of the first term in the systematic error Eq. 8 where (Γ mix ) ij = (∂ i h|∂ j h ) encodes the correlation between two signals and Γ = (∂ i h |∂ j h ) is the Fisher matrix of the overlapping signal.The second term in Eq. 8 represents the inaccurate waveform model we use to subtract signals, and can be calculated the same way as the waveform systematic, yielding ∆θ i DO2 = (Γ −1 ) ij (∂ j h m |δH wf ).In this work, the systematic error from detected overlapping signals is calculated as ∆θ i DO2 plus a random sample drawn from a multivariate Gaussian distribution with covariance matrix Eq. 9 and zero mean.For more than one detected overlapping signal, Eq. 8 can be extended by defining h as the summation of all GWs in the data (Antonelli et al. 2021), which enlarges the dimension of Γ mix and Γ .
The undetected overlapping signal simply contributes to systematic error by δH UO = undetected h (θ real ).It is accessible in our simulation but unknown in real data analysis.
We consider BBH and BNS sources, and assume their distribution in redshift z follows the analytical approximation (Oguri 2018) which is then converted to observable event rate by multiplying a factor 1 1+z dVc dz .Here V c is the comoving volume and we employ Planck15 cosmology (Ade et al. 2016).Note that "observable" GWs need to achieve an network SNR of 8 to be "detectable".a {1,2,3,4} are model parameters.We set a 2 = 1.6, a 3 = 2.1, a 4 = 30 to mimic a peak at z ∼ 2. a 1 is scaled based on local merger rate given by Abbott et al. (2021e) −240 and R BBH = 23.9+14.3 −8.6 Gpc −3 yr −1 ) such that R GW (z = 0) = R BNS/BBH .We choose three values for a 1 which corresponds to lower, median, and higher estimation of local merger rate, respectively.
The masses of BBHs are generated by the PowerLaw + Peak model in Abbott et al. (2021e), while all BNS systems are set to be same: 1.45 + 1.4M , Λ 1 = Λ 2 = 425.
The effective spin follows the Gaussian distribution in Abbott et al. (2021e), with mean of 0.06 and standard deviation of 0.12.IMRPhenomPv2 NRTidal (Dietrich et al. 2019) is used to generate BNS waveforms with the same δ β2 as BBH.We will perform tests of GR with all BBH events and use BNS events as a background: BNS events are only involved in the calculation as overlapping signals.We assume isotropically distributed inclination and source sky direction; and uniformly distributed coalescence time, phase, and polarization angle.
A summary of low, median, and high merger rates catalogs is shown in Tab. 1.It shows that most BBH events will not have an overlapping signal near their merger time, which implies overlapping signals contribute to systematic errors less frequently than waveform systematics.With our ET+CE configuration, the numbers of the two kinds of overlaps are close.However, if the number of detectors is less than assumed, or detector sensitivities are lower than designed, some of detected overlaps would become undetected, and vice versa.The unnoticeable confusion background has drawn attention in recent works (Wu & Nitz 2022;Reali et al. 2022) and needs further investigation.Compact binaries formed by Pop III stars (which we have ignored) could also contribute to the confusion background.However, according to the model in Oguri (2018), the numbers of observable Pop III binaries of B17 and K16 models per year are roughly 40000 and 180000 respectively, which is much lower than the BNS background.
Several simplifications have been adopted in our mock catalog: we regard BNS as a background and use only BBH as the test source; we ignore neutron starblack hole (NSBH) mergers and other possible types of sources; we use an analytical merger rate that peaks at z ∼ 2, ignoring compact binaries from Pop III stars.Our catalogs aim to generate an appropriate merger rate for the study of systematic error accumulation, rather than accurately modeling the astrophysical population.To achieve this, we also adjust the merger rate to different levels, expecting that the real situation will lie somewhere between our lowest and highest estimates.
Signals are injected into the 3rd generation GW detector Einstein Telescope with ET-D PSD (Punturo et al. 2010) located at the Cascina site of the current Virgo detector, and Cosmic Explorer located at the LIGO Hanford site with the sensitivity curve proposed by Abbott et al. (2017c).The frequency band used for the analysis is 5-2048 Hz.

Combining results
There are several ways of combining results from multiple events (Zimmerman et al. 2019;Isi et al. 2019).We employ two straightforward methods: multiplying likelihoods (equivalently, multiplying posteriors if priors are flat) and multiplying Bayes factors.The former assumes the modification parameter is the same for all events, while the latter allows the modification parameter to vary across events.
We assume a flat prior distribution, and that the posterior follows a multivariate Gaussian distribution with covariance matrix Γ −1 and mean µ equal to injection values θ inj plus systematic errors ∆θ sys .The statistical uncertainty of a parameter is σ i = (Γ −1 ) ii .We define the error ratio between systematic and statistical errors as We consider that the PPN test coefficient is subject to false deviations from GR when R(δ φ0 ) > 1.
In order to combine results from multiple events, one would multiply the posterior distributions of the testing parameter for each.Multiplication of Gaussian distributions results in another Gaussian distribution whose mean (systematic error) is a linear combination of the original means.From the first event in a catalog, we multiply the posterior of new events one by one and calculate the error ratio.Considering the arbitrary sequence of events, we permute the sequence 200 times and extract the ensemble average and 68% confidence interval.
Treating GR as a sub-model of the non-GR theory, Bayes factor can be calculated analytically with the Gaussian posterior (Moore et al. 2021).Denote systematic error of δ φ0 as ∆θ sys , we have The Bayes factor is then calculated as ) where Γ nonGR is the Fisher matrix including the testing parameter, while Γ GR only includes GR parameters.v i = (∂h/∂θ i |∂h/∂δ φ0 ) represents the correlation between GR and non-GR parameters.Γ δ φ0δ φ0 = (∂h/∂δ φ0 |∂h/∂δ φ0 ).The exponential term in the Bayes factor accounts for the deviation of GR, while the determinant ratio term usually favors GR since modified theories introduce extra parameters to explain the data.We also note that the correlation term v T (Γ GR ) −1 v mit- igates the deviation of GR.Ignoring this term may overestimate the Bayes factor (e.g., Moore et al. (2021)).When combining events, Bayes factors are numerically multiplied, with the same permutation mentioned before.We consider a false deviation from GR to be achieved when ln B nonGR GR > 8.We reemphasize that Bayes factors are first computed for each event and then combined across the catalog, rather than calculated after different posteriors are multiplied.This analysis should be interpreted as not assuming that the testing parameter is the same for all events.In this sense it is less sensitive to violations of GR when there is a common underlying deviation parameter, so we would expect it to be less vulnerable to simulated false violations.While error ratio accumulation is decided by errors from each event, Bayes factor accumulation is more sensitive to the fraction of correct analyses in the catalog.The two methods of combining results are independent and do not necessarily lead to the same conclusion.More details are given in Sec.3.2.

Single events
We first present an example event, investigating the effect of a detected or undetected overlapping signal.The main signal is from a BBH with M c = 32M (in the detector frame), q = 0.9, χ eff = 0.2 and network SNR of 27.The overlapping signal is an equal mass BBH with M c = 20M and χ eff = 0.1.We scale its SNR from ∼ 26 down to 8 to make it detectable or undetectable.We vary the merger time difference (by 0.01 s per step) and calculate the total systematic error with different waveform models.Note that, throughout this section, the "systematic error" refers to that of the testing parameter δ φ0 , and is denoted as ∆θ sys .
The error ratio for this example event is shown in Fig. 2, including an illustration of the waveforms.The error from the overlapping signal oscillates when ∆t changes due to the repeating alignments and misalignments of phases of the two GWs.The overlap error is not symmetric around ∆t = 0 because the two waveforms of are not symmetric, but the peak is always located in the region |∆t| ≤ 1s, meaning the overlapping signal only produces a large influence when two mergers are very close.Waveforms in the last row show how the main signal is modulated by overlapping signals.Around ∆t ∼ 0, the confusion signal has larger impacts than waveform systematics, so it dominates the systematic error.The detected signal changes the signal significantly, but it is then subtracted from data and therefore produces less residual strains.When |∆t| is large, it is waveform inaccuracy that dominates the systematic error.These characteristics are consistent with previous works (Himemoto et al. 2021;Samajdar et al. 2021;Relton & Raymond 2021;Antonelli et al. 2021).
It is possible for undetected signals to produce significant systematic errors in our simulation.However, comparing the detected and undetected overlapping signal, the former produces larger systematic error when the waveform is not accurate because the waveform systematic is also involved in signal subtraction.One can see this from the more intense perturbations in δ β2 = 0 case in Fig. 2 for detected overlaps (errors are directly added so they may constructively or destructively interfere.)This implies different types of systematic errors are correlated and could be a magnifying factor for each other, as expected from Eq. 8.A more direct comparison is given in Fig. 3.We calculate systematic errors for each BBH event in our mock catalog, and show systematic errors and Bayes factors caused by different numbers of overlapping signals for the high merger rate catalog.With the increase of the number of overlaps, detected overlaps tend to produce larger errors, while errors from confusion background signals make smaller incremental changes.
The statistical error ∆θ i stat = (Γ −1 ) ij (∂ j h|n) ≈ 1/|h| ≈ 1/SNR, while the systematic error ∆θ i stat = (Γ −1 ) ij (∂ j h|δH) does not necessarily shrink when the SNR increases, for example, waveform systematics of the main signal.Therefore, systematic errors may dominate in high SNR scenario.We plot the absolute error, error ratio and Bayes factor with SNR in Fig. 4. We find that the error ratio could exceed one for the "current" waveform, and this happens more often when SNR> 30 despite the fact that high SNR events are rarer.Error ratios for the "future" waveform simulations are usually below one, but a certain amount of exceptions exist.For the Bayes factor analysis we find a similar situation, although there are a smaller fraction of more extreme values.There are roughly 0.8% and 18% events producing R(δ φ0 ) > 1 for δ β2 = 5×10 −4 and 5×10 −2 , respectively, while for ln B nonGR GR > 8 the fractions are 0.02% and 3%.As pointed out by Moore et al. (2021), false deviations could be achieved even though estimations for individual events are generally accurate.We will investigate this in more detail in the next subsection.

Error accumulation in a catalog
As mentioned in Sec.2.4, we combine all BBH events by multiplying likelihoods or Bayes factors.The results are shown Fig. 5. Let N event be the number of events.When multiplying likelihoods, the statistical uncertainty shrinks as 1/ √ N event .The absolute error of the testing parameter also decreases, but at a slower pace due to the perturbations from newly accumulating systematic errors.It also follows 1/ √ N event if there were no systematic errors -we observe that the test with the perfect waveform in a low merger rate catalog is approximately doing so.In most simulations it is the waveform inaccuracy that keeps contributing to the systematic errors.The slower decay of systematic error results in a climbing error ratio as the number of events increases.At some point (typically ∼ 10 3 events, considering error bars) it leads to a false deviation of GR for the "current" waveform.For the better waveform, the error ratio climbs as well, but it keeps below the statistical level until 10 5 − 10 6 events.
Multiplying Bayes factors is a direct addition of ln B nonGR GR ."Correct analyses" can effectively decrease the combined Bayes factor so that a correct-analysesdominated catalog leads to correct conclusions.Since there are only 3% of events with ln B nonGR GR > 8 (furthermore, only 7% of events with ln B nonGR GR > 0) for the current waveform, the sum of all Bayes factors is negative, thus false deviation is not achieved in this case.In contrast, multiplying likelihoods linearly adds systematic errors: for Gaussian distributions f and g, the mean of their product is and different sign of errors could diminish systematic error a bit, but it is never guaranteed for the error to be held around 0. Moreover, statistical uncertainty also shrinks during events stacking, so the error ratio shows a clear increase.

Golden events
We have combined all the detected BBH events in the above subsection.It is also interesting to test GR with only the "golden events", i.e., the GW events with high SNR and clean data that contribute to most of the information in the whole catalog test.This idea is widely used in many works, such as recent GWTC-3 tests of GR (Abbott et al. 2021b) and cosmology (Abbott et al. 2021f).Since the noise is Gaussian in our simulation, we select the golden events with only two criteria: SNR above a chosen threshold (50 or 200) and that there is no detected overlapping signals.
Results for the error ratio and Bayes factor are shown in Fig. 6: high SNR events are more vulnerable to systematic errors.Fewer events are needed to create a false deviation for the "current" waveform model, and the "future" waveform is closer to false deviation in all three catalogs.Moreover, the golden events catalog consists of more incorrect analyses (R(δ φ0 ) > 1 or ln B nonGR GR > 8), and it causes the Bayes factor of current waveform to incorrectly favor the non-GR theory.
As mentioned in Sec.3.1, statistical uncertainty decreases as 1/SNR while systematics do not as long as waveform is imperfect.The false deviation for golden events is not surprising from this angle, but it does need more attention and an appropriate solution for future data analysis.

CONCLUSIONS AND DISCUSSIONS
We have investigated how systematic errors in testing GR accumulate under the influence of overlapping signals and inaccurate waveforms.We have considered different levels of waveform inaccuracies and event rates, and employed two approaches to combining the results.
We confirm that systematic errors could accumulate when combining multiple events, and could lead to incorrectly disfavoring GR in some cases.Since overlapping signals do not always occur, it is waveform inaccuracies that keep contributing to the systematic error in the catalog tests.An accurate waveform model is effective at preventing false deviations in most cases, while a worse one could lead to biased conclusions.We additionally find that overlapping signals can enlarge the effect of waveform systematics.By increasing the number of overlaps, we tend to achieve a greater systematic error and a Bayes factor that leans more toward the non-GR model.One can avoid this correlated error by selecting events with no detected overlapping signals, and, if one prefers, with high SNR as well.However, we have showed these events produce biases much faster because waveform systematics dominate in high SNR scenario.We should point out that GR is assumed to be the true theory to describe the data in this work, which is not necessarily correct.The inverse problem, namely, what happens to detection and PE when we use GR waveform for data analysis but GR is wrong (stealth bias), is investigated in previous works (Cornish et al. 2011;Vallisneri & Yunes 2013;Vitale & Del Pozzo 2014).The core idea of our work and stealth bias is the same: using an incorrect model in data analysis can lead to biased results.Stealth bias emphasizes the importance of assuming the correct theory, while our work points out that even if the assumed fundamental theory is correct, waveform modelling and overlapping signals are still able to corrupt the results.
We re-emphasize that systematic errors can accumulate when combining multiple events and lead to incorrect scientific conclusions.This problem is universal: in addition to tests of GR, any analysis based on a GW catalog is faced with this issue, such as constraints on cosmological models, neutron star models (Kunert et al. 2022), and astrophysical population inference.Furthermore, there are more sources of systematic errors than those investigated in this work: instrumental calibration (Sun et al. 2020;Hall et al. 2019), glitches (Powell 2018;Pankow et al. 2018), missing physical effects (Pang et al. 2018;Saini et al. 2022) and so forth.A full analysis of these contributions, and their relative importance, will be essential in designing analysis strategies for 3G detectors.An obvious solution to these issues is continuing improvements to waveform model accuracy and instrumental stability, but we believe more efforts are needed from the angle of data analysis.A proper estimate of confusion background may be necessary (Reali et al. 2022), and new techniques might be needed, such as accounting for waveform systematic errors during PE (Moore & Gair 2014), performing specific analysis of residual strain (Dideron et al. 2022), and so forth.  .Systematic error accumulates with the increase of number of events.The first column shows the absolute error of δ φ0 and the second column shows the error ratio.The third column is the Bayes factor.Solid red lines are ensemble average for "current waveform", dashed red lines are for "future waveform", and blue lines stand for the perfect waveform.The shadow along lines are 68% confidence interval.The first, second, and third rows are for low, median, and high catalogs, respectively.The black dotted-dashed line is the threshold above which a false deviation of GR is claimed.False deviations can be achieved with the increase in the number of events by multiplying posteriors, but multiplying Bayes factors does not give wrong conclusions in these correct-analyses-dominated catalogs.5, the error ratio and Bayes factor accumulation.The left two columns show results from SNR>50 events, the right two columns are for SNR>200 events.Compared with Fig. 5, it shows that tests with high SNR events are more likely to make a false deviation from GR.

Figure 1 .
Figure1.The absolute value of real part of plus polarization from a non-spinning BBH with Mc = 30.69M , q = 0.88, in frequency domain.Waveforms with different δ β2 are shown in different color and linestyles.The intermediate region of this system starts around 50 Hz, which is consistent with where waveform difference appears in the plot.

Figure 2 .
Figure2.Uppermost and middle rows: The error ratio of δ φ0 varies with merger time difference.The main signal has (detector frame) Mc = 32 M , q = 0.9, χ eff = 0.2 and SNR of 27.The overlapping signal is an equal mass BBH with Mc = 20 M and χ eff = 0.1.The SNR of the overlapping signal is adjusted by changing its luminosity distance: the detected overlap is shown in upper panel, and the undetected in the lower one.We use three kinds of waveforms explained in Sec.2.2: perfect waveform (solid line), "current" waveform (dashed line), and "future waveform" (faint dotted-dashed line).Bottom row: waveforms of the main and overlapping signals and their superposition.Merger times of overlapping signals are chosen to maximize their influences, as marked by grey stars in the first two rows.Inaccurate waveform in the δ β2 = 5 × 10 −2 case is also plotted for comparison.

Figure 3 .
Figure 3. Distributions of systematic errors and Bayes factors in the high merger rate catalog with δ β2 = 5 × 10 −2 waveform, classified by number and type of overlapping signals.Bars denote the mean value.The number of overlapping signals is cut at 4 because of the insufficient number of events coming with > 4 overlapping signals.The difference in the increase of mean values shows detected overlapping signals could magnify the effects of inaccurate waveform models.

Figure 4 .
Figure4.Relation between SNR and absolute error (first column), error ratio (second column) of δ φ0 and Bayes factor (third column) for low (uppermost row), median (median row), and high (bottom row) merger rate catalogs.Each point represents a BBH event.Blue points are for the "perfect waveform" case, where all systematic errors come from overlapping signals; red points stand for "current waveform" case and yellow points for "future waveform" case.Grey points in the first column are statistical errors.Dashed lines in the second and third columns are the threshold above which GR is mistakenly disfavoured.This plot shows R(δ φ0) > 1 and ln B nonGR

Figure 5
Figure5.Systematic error accumulates with the increase of number of events.The first column shows the absolute error of δ φ0 and the second column shows the error ratio.The third column is the Bayes factor.Solid red lines are ensemble average for "current waveform", dashed red lines are for "future waveform", and blue lines stand for the perfect waveform.The shadow along lines are 68% confidence interval.The first, second, and third rows are for low, median, and high catalogs, respectively.The black dotted-dashed line is the threshold above which a false deviation of GR is claimed.False deviations can be achieved with the increase in the number of events by multiplying posteriors, but multiplying Bayes factors does not give wrong conclusions in these correct-analyses-dominated catalogs.

Figure 6 .
Figure 6.Similar to Fig.5, the error ratio and Bayes factor accumulation.The left two columns show results from SNR>50 events, the right two columns are for SNR>200 events.Compared with Fig.5, it shows that tests with high SNR events are more likely to make a false deviation from GR.
2.1, configurations of parameter estimation, waveforms and PPN tests in Sec.2.2, catalog simulation and overlapping signals in Sec.2.3, and methods of combining results in Sec.2.4.Results are given in Sec. 3. We first demonstrate selected example events in Sec.3.1, and then move on to the catalog level tests in in Sec.3.2 and Sec.3.3.Conclusions and discussions are in Sec. 4.

Table 1 .
A summary of three mock catalogs.From left to right, it shows catalog type, observable BBH and BNS per year (note this is not detectable), and distributions of numbers of overlapping signals among BBH events.For example, in median merger rate catalog, there are 13125 detected BBH events (15% of all detected BBH events) coming with 1 detected overlapping GW signal, and 10461 detected BBHs coming with 1 undetected overlapping GW signal.The overlapping signal can be BBH or BNS, and two signals are defined as overlapped if their merger time difference ∆t < 4s.