Inconsistent black hole kick estimates from gravitational-wave models

The accuracy of gravitational-wave models of compact binaries has traditionally been addressed by the mismatch between the model and numerical-relativity simulations. This is a measure of the overall agreement between the two waveforms. However, the largest modelling error typically appears in the strong-field merger regime and may affect subdominant signal harmonics more strongly. These inaccuracies are often not well characterised by the mismatch. We explore the use of a complementary, physically motivated tool to investigate the accuracy of gravitational-wave harmonics in waveform models: the remnant's recoil, or kick velocity. Asymmetric binary mergers produce remnants with significant recoil, encoded by subtle imprints in the gravitational-wave signal. The kick estimate is highly sensitive to the intrinsic inaccuracies of the modelled gravitational-wave harmonics during the strongly relativistic merger regime. Here we investigate the accuracy of the higher harmonics in four state-of-the-art waveform models of binary black holes. We find that the SEOBNRv4HM_ROM, IMRPhenomHM, IMRPhenomXHM and NRHybSur3dq8 models are not consistent in their kick predictions. Our results enable us to identify regions in the parameter space where the models require further improvement and support the use of the kick estimate to investigate waveform systematics. We discuss how numerical-relativity kick estimates could be used to calibrate waveform models further, proposing the first steps towards kick-based gravitational-wave tuning.

yield the most accurate description of the GW signal and they are the only way to access the highly dynamic merger regime from first principles. The standard method of comparison is based on calculating a match between the model and an NR or hybrid waveform, based on a Wiener inner product. This is the same inner product that is used to assess the likelihood of whether the GW data contain a signal, by comparing the agreement between observational data and waveform templates. The match represents a notion of the angle between two signals (related to a distance between them) and is a standard quantity used in the waveform modelling community to test the quality of a model (see e.g. [34]).
A complementary tool to standard match calculations has recently been suggested in [35]. The authors propose an infinite set of constraints on compact binary coalescence waveforms, predicted by full, non-linear general relativity (GR). The set of constraints are the Bondi-van der Burg-Metzner-Sachs (BMS) balance laws, which are induced by the infinite-dimensional group of supertranslations [36], the natural extension of the four-dimensional group of translations defined at null infinity. The theory provides the opportunity to test waveform systematic errors in a new way. The balance laws can be particularly useful for regions of the parameter space where performing NR simulations might be more challenging or where these simulations are not abundant enough for the subsequent calibration of the approximate waveform models. Because these constraints come from exact GR, the balance laws are also attractive to quantify the accuracy of NR waveforms, which are often considered as a proxy of the exact waveform predicted by GR. The set of balance laws have already been applied to currently available waveform models to test their accuracy based on their angular momentum [37] and the gravitational memory estimates [38]. Besides, the Simulating eXtreme Spacetimes (SXS) Collaboration has recently employed the BMS balance laws to correct the extracted SXS waveforms by incorporating gravitational memory effects [39]. Apart from imposing accuracy requirements, the theory presented in [35] represents new tests of GR.
Although the expression of the radiated three-momentum flux by itself does not represent a BMS balance law, the calculation of the linear momentum flux (or the kick velocity) has recently been used to test the accuracy of the relative phase shifts between the GW harmonics of the state-of-the-art phenomenological waveform models [21,25]. These studies have focused on the dependency of the kick magnitude on the symmetric mass ratio of the binary.
Using the recoil prediction to assess the waveform accuracy is attractive for three main reasons: (1) The kick estimate is a highly sensitive quantity [40][41][42]. The presence of small time and phase deviations that appear from incorrect modelling can result in significantly different kick predictions. (2) The kick builds up in the merger, the highly dynamic region that is most complex to model. Therefore, a correct kick estimate requires an accurate description of the merger. (3) The asymmetries of the system that lead to the kick are most completely characterised when using the dominant and higher harmonics. Thus, an accurate description of the subdominant harmonics is essential for a correct prediction of the kick velocity.
In this paper, we explore the use of the kick velocity as a diagnostic test for waveform models, by analysing the waveform estimates of the magnitude and orientation of the kick velocity over the parameter space. Precessing binaries lead to the largest recoil velocities [43] and are therefore observationally easier to accesss. However, only one of the available precessing waveform models includes the mode asymmetries that generate out-of-plane kicks for precessing binaries: the NR Surrogate (NR Sur) model. This model only covers part of the parameter space relevant for LVK data analysis. Kicks in nonprecessing binaries, on the other hand, can be predicted by any of the nonprecessing models. For this reason, in this work, we focus on nonprecessing systems as a first step to understanding the accuracy of GW harmonics through remnant kicks.
We evaluate the accuracy of four waveform models which are used in current GW data analysis studies. We also analyse the performance of the NR Sur fit [44], recently used to make the first measurement of a large kick velocity in an observed GW signal [32]. By exploiting the features of the kick, we create a set of diagnostic tests that can be applied to any waveform model to identify modelling inaccuracies over the parameter space.
After a description of the methodology of our work in section 2, we use the kick velocity to evaluate the accuracy of several gravitational waveforms and analyse their harmonic contributions to the kick in section 3. In section 4 we further explore the applications of the kick in the context of waveform modelling by addressing the first steps towards kick-based GW tuning. Finally, we summarise our results in section 5.

Waveform models
The GW signal radiated by a BBH coalescence is uniquely determined by a number of physical parameters that characterise the binary. Astrophysical quasi-circular BBHs are described by the two individual masses m i and the individual spin vectors. It is common to use the dimensionless spin parameter ⃗ χ i = ⃗ S i /m 2 i for the spin components. BBHs on quasi-circular orbits are thus characterised by eight intrinsic parameters: The total mass of the binary is M = m 1 + m 2 and the symmetric mass ratio is defined as η = m 1 m 2 /M 2 . Throughout this work we use geometric units, G = c = 1. Apart from the intrinsic parameters, there are seven extrinsic parameters: the spherical angles {θ, φ}, which describe the orientation of the binary, the luminosity distance d L , the coalescence time t c , the declination and right ascension {ϑ, ϕ}, and the polarization angle ψ.
In NR, it is common to use the Newman-Penrose scalar, Ψ 4 , which is derived from the Weyl tensor and encodes the outgoing gravitational radiation as Ψ 4 = −ḧ, where h := h + − ih × . The behaviour of a quantity under rotations is expressed by the spin weight. Because the Weyl tensor component has spin weight s = −2, the GW strain can be expanded in a basis of spinweighted spherical harmonics (SWSH) [45], Here h ℓ,m (t, λ) are the GW spherical harmonics associated to the multipole moments. The (2,2) spherical harmonic is the quadrupolar term, while those associated with higher multipole moments are referred to as higher harmonics. The h ℓ,m (t, ⃗ λ) depend on the time and the intrinsic physical properties of the source, ⃗ λ. The orientation of the source with respect to the observer is encoded in the spherical harmonic basis functions of spin weight −2, −2 Y ℓ,m (θ, φ). The coalescence phase φ c is sometimes included as an extrinsic parameter of the binary, and is degenerate with the azimuthal angle φ in nonprecessing systems. In a nonprecessing binary, the spins of the individual black holes are parallel to the direction of the orbital angular momentum, and the evolution of the binary takes place in the Table 1. Gravitational waveform models used in our study. The second column indicates the higher harmonics included in each of the models.
The inspiral and merger phases can be partly described by the EOB formalism, which maps the two-body problem to that of a test particle in an effective metric (see [63] for a review) and free coefficients are calibrated to NR data. On the other hand, Phenom models are based on employing hybrid waveforms that connect an analytical inspiral description with NR data for the late inspiral, merger and ringdown, which are then described by a phenomenological fit. The ringdown phase of the remnant black hole is characterised by the emission of quasinormal modes [64], mathematically described in terms of exponentially damped oscillations. In contrast, the NR Sur family is based on a reduced order method interpolation of the NR simulations over the parameter space, built to cover a larger region of the parameter space than that covered by the NR waveforms.
In our work, we analyse the kick predictions of four nonprecessing waveform models that include higher harmonics: SEOBNRv4HM_ROM [65], two phenomenological models, namely, IMRPhenomHM [57] and the more recent IMRPhenomXHM [21,22], and NRHybSur3dq8 [61]. Table 1 indicates which gravitational multipoles are included in each waveform model.

Linear momentum flux
Because asymmetric BBHs radiate linear momentum through GWs, the remnant black hole acquires a kick velocity. This can be mathematically described from the linear momentum flux radiated by the binary. Since the linear momentum of the system is initially zero, the momentum of the remnant is equal to the opposite of the three-momentum flux carried by the radiated GWs, and is given by: where dΩ = sin θ dθ dφ andx i = (sin θ cos φ, sin θ sin φ, cos θ) is the unit vector expressed in the spherical harmonic basis. Because the asymmetries that cause the kick are encoded in the GW signal, the momentum of the remnant black hole is entirely determined by the waveform h. In asymmetric BBH systems, gravitational higher harmonics are particularly loud during merger. For this reason, we express the radiated momentum in terms of the dominant and higher multipoles by decomposing the gravitational radiation on a basis of SWSH. We choose the z axis along the orbital angular momentum of the binary. In a nonprecessing system, the binary orbits in a fixed plane, that we choose as the x-y plane. In this case, the z-component of the momentum vanishes and the kick takes place in the orbital plane. The momentum can then be expressed as a combination of the two planar coordinates, ⃗ P := P x + iP y . After expressing the unit-vector in terms of the SWSH and integrating over the 2-sphere, one can show that the components of the momentum are given by [66]: where the coefficients a ℓ,m and b ℓ,m read: The kick velocity will then be given by ⃗ v f = − ⃗ P/M f , where M f is the mass of the remnant black hole.

Implementation
We have implemented the expression of the momentum (5) for the mentioned gravitational waveform models. We use the LIGO Algorithm Library (LALSuite) software [67] to obtain individual higher harmonics of the GW signal. It is physically meaningful to compute the momentum flux in the time domain. However, for data analysis purposes the SEOB-NRv4HM_ROM and Phenom models that we employ in our study have been developed in the frequency domain. For this reason, we inverse Fourier transform the GW signal after obtaining the individual spherical harmonics in the frequency domain. We use f min = 10 Hz for the lower frequency cut-off and because the kick velocity does not depend on the total mass of the binary, we fix the total mass to M = 50 M ⊙ . For a nonspinning equal-mass binary this corresponds to 63 orbits before merger. To calculate the remnant's final mass, we use waveform specific fitting functions which are available in LALSuite 3 . The final mass is internally computed from the radiated energy and is scaled by the original total mass. For details on how the radiated energy is calculated we refer the reader to the paper references. We compare the kick estimates with the predictions of a set of NR waveforms. We have used SXS waveforms [68] from the LVCNR Waveform Catalog [69] with the highest resolution available. These include all subdominant harmonics up to ℓ = 8, with |m| ⩽ ℓ. The NR data come with a metadata file which includes a coordinate recoil velocity estimation that is calculated based on the trajectory ⃗ x(t) of the coordinate centre of the apparent horizon of the remnant. Although this value is close to the one computed from the momentum flux integral, they might not be necessarily the same [68]. For this reason, we calculate the recoil velocity from the full waveform by using the momentum flux integral. For the final mass estimate, we use the value indicated at the metadata file.
Finally, we also employ the NR Sur fit [44], which estimates the final properties of the remnant black hole {m f , ⃗ χ f ,⃗ v f } from the initial intrinsic properties of the binary {m 1 , m 2 , ⃗ χ 1 , ⃗ χ 2 }. The surrogate model is trained on quasicircular NR simulations using Gaussian process regression. The NR Sur fit includes two surrogate models: the NRSur7dq4Remnant fit, trained on precessing systems with q ⩽ 4 and |⃗ χ 1 | = 0.8, |⃗ χ 2 | = 0.8, and the NRSur3dq8Remnant fit, trained on nonprecessing systems with q ⩽ 8 and |⃗ The kick prediction is in general limited by two aspects. First, waveform models only include a finite number of harmonics. Phenom models include a subset of the spherical harmonics with ℓ ⩽ 4, while SEOBNRv4HM_ROM and the NRHybSur3dq8 include a subset of the harmonics up to ℓ ⩽ 5 (see table 1). What we use as the gravitational strain is given by: However, the radiated momentum flux is related to the sum over an infinite number of subdominant harmonics. The omission of specific harmonics can influence the final kick velocity value. Second, since waveforms are only available for a limited time range, the time integration of equation (5) is truncated to a finite time. Because the GW amplitude decreases exponentially in the ringdown, having a finite upper limit does not influence the final velocity value. The lower bound limits the amount of early inspiral phase included in the waveform. Based on PN calculations, there exist expressions of the net linear momentum radiated during the early inspiral phase [70][71][72]. The contribution of the inspiral phase to the total linear momentum is significantly smaller than the merger contribution. For a non-spinning binary with symmetric mass ratio 0.2 and total mass 50 M ⊙ , the radiated linear momentum up to 10 Hz is less than 0.05 km s −1 . In our calculations we neglect the linear momentum radiated up to 10 Hz.
We apply the methods discussed to calculate the estimates of the kick velocity from the previously specified waveform models over the parameter space. We divide our results into three parts. We first make model-model comparisons of the magnitude and orientation of the kick velocity. Then, we quantify where exactly in the parameter space models show disagreements by analysing the dependence of the kick estimates on the mass ratio and the individual spin components. We then compare the harmonic contributions of the kick velocity estimated by the models.
Even though NR simulations provide the closest description of the true waveform, only a limited set of NR waveforms is available. In the region where binaries have highly asymmetric masses, simulations are particularly sparse. Therefore, studying the agreement with respect to a particular model allows us to make a more exhaustive analysis of the differences between model estimates over the parameter space. For this reason, we have analysed model-model agreement by choosing a reference model and calculating the relative difference with respect to its estimates. Because the NRHybSur3dq8 model includes a larger set of subdominant harmonics, we are able to compare all the harmonic contributions predicted by the models (each model has a particular set of harmonic contributions, different from the other models) to the NR Sur model estimates. The NR Sur is slightly more accurate than the Phenom and SEOBNR models. These models, on the other hand, cover a wider range of the parameter space than the NR Sur, relevant for LVK data analysis purposes.
However, since the true waveform is unknown, the actual value of the kick velocity is also unknown. By comparing the predictions of two different models, it is difficult to tell which of the two estimates is more accurate. However, disagreements between models are a reflection of systematic errors in either or both waveform models.
The direction of the kick velocity is subject to the orientation of the binary at the initial reference frame. In order to compare the estimates of the direction of the kick velocity from different models, we need to make sure that all waveforms are initially aligned in phase the same way. We first align the waveforms in time, setting t = 0 at the maximum amplitude of the (2,2) spherical harmonic. We then apply a phase shift to each (ℓ, m) harmonic equal to ∆ϕ ℓ,m = m/2 × ϕ 2,2 , where ϕ 2,2 is the optimal phase shift of the (2,2) spherical harmonic, obtained from the match calculation between the model of interest and the reference model. The match is expressed in terms of the Wienner inner product, defined as: whereh indicates the Fourier transform of h and S n ( f ) is the one-sided power spectral density of a GW detector. When aligning two waveforms in phase, we consider a flat noise sensitivity S n = 1. The match M is defined as the normalized inner product maximized over relative time and phase shifts between the two waveforms, i.e.: The mismatch MM is then defined as:

Model-model comparisons
We calculate the estimates of the kick velocity from the indicated waveform models for discrete points in the parameter space. The spin components are uniformly distributed by selecting points in the interval χ z 1,2 = [−0.8, 0.8] with step size ∆χ z 1,2 = 0.1. We consider masses m 1 and m 2 subject to the symmetric mass ratio and total mass values. Because the estimate of the kick velocity does not depend on the total mass, we only sample over the symmetric mass ratio. The symmetric mass ratio is sampled uniformly by choosing 31 points in the interval η = [0.10, 0.25] with step size ∆η = 0.005, giving a total number of 8959 points in the parameter space. The region of the parameter space that we choose to analyse is limited by the region to which the NRHybSur3dq8 model is calibrated.
In figure 1 we show the differences of the kick magnitude predicted by the PhenomHM, PhenomXHM and SEOBNRv4HM_ROM waveform models compared to the NRHybSur3dq8 model. We observe that the differences of PhenomXHM and SEOBNRv4HM_ROM have comparable values, with SEOBNRv4HM_ROM showing better agreement with the NRHyb-Sur3dq8 model. The distribution of PhenomXHM has a mean value of ∆v ∼ 9 km s −1 and a standard deviation (std) of σ ∼ 19 km s −1 . For SEOBNRv4HM_ROM, the mean value lies at ∆v ∼ −0.4 km s −1 and the std σ ∼ 19 km s −1 . On the other hand, the PhenomHM model  largely over-and underestimates the kick velocity, with a mean value of ∆v ∼ 42 km s −1 and a standard deviation of σ ∼ 81 km s −1 . Figure 1 shows that PhenomHM has been superseded in accuracy by its respective newer version, PhenomXHM.
We now discuss the distributions of the differences on the direction of the kick velocity, shown in figure 2. In this case we observe good agreement between PhenomXHM and SEOB-NRv4HM_ROM with the NRHybSur3dq8 model in the kick orientation, with the mean value at ∆θ ∼ −0.13 rad and ∆θ ∼ −0.13 rad respectively. For PhenomHM, the distribution appears shifted by ∆θ ∼ 0.4 rad, meaning that the orientation estimates are inconsistent with the three other waveform models. The NRHybSur3dq8 model is trained against 104 hybridized nonprecessing NR waveforms [61] and inherits the accuracy limitations of the hybrid waveforms. We now investigate the uncertainty of the kick estimate in NR simulations. These waveforms contain two primary sources of error: numerical resolution and extrapolation errors. In the case of SXS waveforms, the uncertainty involved in the extrapolation procedure is, on average, significantly smaller than the resolution error [68]. For this reason, here we focus on understanding the influence of the resolution error in our kick estimates. We estimate the kick uncertainty by comparing the kick values of the highest two resolutions. In particular, we have used 173 nonprecessing NR waveforms from the SXS Collaboration, which include at least two resolutions. The set of waveforms we employ cover a larger region of the parameter space than the set used to calibrate the NRHybSur3dq8, with q ⩽ 10 and χ z 1,2 ⩽ 0.994. We provide a list of the waveforms employed in appendix B. The error estimates of the kick magnitude and orientation are shown in figure 3.
The distribution of the kick-magnitude uncertainty has a std of ∼10 km s −1 . While the errors of waveform models have values between (−π/2, π/2), the orientation errors of the NR waveforms lie between (−π, π). The maximum value is set by our methodology and is different in each case. In the case of waveform models, we align the spherical harmonics with a phase shift ∆ϕ ℓ,m = m/2 × ϕ 2,2 . Here, the phase of the (2, 2) harmonic is degenerate: where ϕ opt 2,2 is the optimal phase shift of the (2, 2) spherical harmonic. This means that the phase shift is also degenerate: Applying a phase shift ∆φ to a waveform translates as rotating the kick orientation by ∆φ. Therefore, when applying ∆ϕ ℓ,m to each spherical harmonic, the kick orientation acquires a π n degeneracy. For this reason, the orientation estimates of the waveform models take values in the range θ ∈ (−π/2, π/2) and thus, the orientation differences also lie between (−π/2, π/2). With NR waveforms, however, there is no such degeneracy and the orientation differences lie between (−π, π).
What is important here is that both waveform models and NR simulations have orientation errors that extend to the maximum values. In the case of NR waveforms, we measure a standard deviation of ∼0.8 rad. The large uncertainty of the kick orientation is mostly related to the fact that, for each waveform resolution, the dominant harmonic contribution, which comes from the (2, ±1)(2, ±2) pair of harmonics, has a significantly different kick orientation in each case.

Symmetric-mass-ratio dependency
We now proceed to find the regions in parameter space where the models show larger disagreement, reflecting the existence of waveform modelling inaccuracies in one or both of the models. We use the same data as in section 3.1 and study the dependency on the symmetric mass ratio. Figure 4 shows the differences on the magnitude of the kick velocity as a function of the symmetric mass ratio, while figure 5 shows the differences on the kick orientation. The curves represent the mean value of the distributions at each symmetric-mass-ratio value, while the shaded region represents the std in each case.
Similar to figure 1, in figure 4 we observe that the distributions of PhenomXHM and SEOB-NRv4HM_ROM have comparable values for the kick magnitude. They both show the larger differences in the region between η ∈ [0.20, 0.25]. We observe that the differences between PhenomHM and NRHybSur3dq8 are significantly larger than for PhenomXHM and SEOB-NRv4HM_ROM. In particular, the estimates of PhenomHM deteriorate in accuracy with increasing symmetric-mass-ratio values, showing the largest inconsistencies close to the equalmass case.
Regarding the estimates of the kick orientation, in figure 5 we observe a constant closeto-zero mean and std for SEOBNRv4HM_ROM. Although the estimates of PhenomXHM are comparable to those of SEOBNRv4HM_ROM, they show a slightly more complicated correlation with the symmetric mass ratio, with small changes in the width of the distributions. Similar to the figure 2, the estimates of PhenomHM appear to be shifted by ∼π/8 rad in the region η ∈ [0.10, 0.20], and show the largest std from the NRHybSur3dq8 values. The estimates converge to zero towards the equal-mass case.
In appendix A, we show the harmonic contributions of the kick velocity as a function of the symmetric mass ratio, which help to understand the origin of the broad distributions observed for PhenomHM. We also include the symmetric-mass-ratio dependency of the NR errors in appendix C. In figure 6, we show the kick magnitude as a function of the symmetric mass ratio for two fixed spin configurations: nonspinning (left column) and positively highly spinning (right column). We analyse the kick estimates of the four waveform models for such configurations. In addition, we include the estimates of the two NR Sur fits, namely NRSur7dq4Remnant and NRSur3dq8Remnant and the estimates of a set of SXS waveforms.
In the case of nonspinning binaries (left column), we observe good agreement between the SXS waveforms and the NR Sur models. The estimates of the Phenom and SEOB-NRv4HM_ROM models show disagreement in the region η ∈ [0.15, 0.24]. The highly spinning configurations (right column) show larger relative errors between models. In particular, we observe a secondary maximum in the estimates of PhenomXHM and SEOBNRv4HM_ROM, located in the region η ∈ [0.05, 0.15]. Even if the true values are not known, it is highly probable that PhenomHM largely overestimates the value of the kick velocity, since the higher harmonics that induce the kick are not calibrated to NR in this model.
We now illustrate how sensitive the kick estimate is compared to the mismatch uncertainty between the models and the NR waveforms. We choose the BBH configuration {η = 0.22, χ z 1 = 0.0, χ z 2 = 0.0}. Such binary lies within the region where the models studied here have been calibrated to NR simulations. For this reason, we expect the mismatch errors with respect to the NR waveform to be small. We have calculated the mismatch for the plus (h + ) and cross (h × ) polarizations between the waveform models and the SXS waveform, SXS:BBH:0169, for three inclination values: 0, π/3 and π/2 (rad). We have considered the Advanced-LIGO design sensitivity curve [73] with a lower cutoff of f min = 10 Hz. When computing the mismatch we maximise the overlap over the relative time. Here we do not optimise this quantity over the relative phase, since these waveforms include higher harmonics and for such waveforms, a relative phase shift does not leave the waveform invariant. Table 2 displays the mismatch values and kick differences of the models with the SXS waveform. From the mismatch values we are tempted to conclude that the waveform models are highly accurate for such particular binary configuration. However, the large disagreement in the kick estimates indicates the existence of modelling errors in the description of the GW harmonics during the merger phase. These results reflect the sensitivity of the kick velocity to waveform systematic errors.

Spin dependency
After analysing the mass ratio dependency, we now study whether there is any correlation between the waveform predictions and the spin components of the binary. Using the same data, we compute the differences with the NRHybSur3dq8 model as a functions of the individual spins and calculate the distributions' mean value. Our results are shown in figure 7. The left column shows the differences of the kick magnitude for PhenomHM (first row), PhenomXHM (second row) and SEOBNRv4HM_ROM (third row) with NRHyb-Sur3dq8. The right column displays the differences of the estimates on the kick orientation for the same models. In addition, we include the spin dependency of the NR errors in appendix C.
We observe discrepancies between all models in the magnitude, direction and overall dependency concerning the spin components. Regarding the differences with the NRHyb-Sur3dq8 model on the kick magnitude (left column), in the case of PhenomHM, we observe a correlation with the spin component of the more massive object. The larger the absolute magnitude of the primary spin component is, the larger the difference appears to become. Phe-nomXHM shows larger disagreement with the NRHybSur3dq8 model in the regions where both individual objects have the largest positive and negative spin magnitudes. In particular, we observe significantly larger relative errors in the region where both objects have negative spin components. As expected, we observe that the newer model is superior to the previous Phenom model. We observe that from the three models, it is SEOBNRv4HM_ROM the model with the closest predictions to those of NRHybSur3dq8. The largest disagreement between

Harmonic decomposition of the kick velocity
The kick velocity results from the sum of the contributions coming from pairs of GW harmonics. We decompose the kick velocity into its harmonic contributions, which allows us to look in more detail into the GW spherical harmonics included in the waveform models. Looking at the individual harmonic contributions allows us to understand which harmonics show more significant disagreements and, in turn, indicate the presence of waveform systematics on a more detailed level. Equation (5) relates two GW harmonics with different (ℓ, m) numbers. Besides, the contributions of the pairs (ℓ 1 , m 1 ) (ℓ 2 , m 2 ) and (ℓ 2 , −m 2 ) (ℓ 1 , −m 1 ) have the same magnitude and direction. The dominant contribution always comes from the (2, 1) (2, 2) and (2, −2) (2, −1) pairs, and we refer to them jointly as (2, ±1) (2, ±2). Besides, the number of GW harmonic contributions to the kick velocity will vary from one waveform model to another, depending on the number of spherical harmonics included in each model. We have calculated the harmonic contributions individually and compared the estimates to those of the NRHybSur3dq8 model for the same set of points as in the previous section, which are uniformly sampled in the parameter space. Figure 8 shows the differences in the kick magnitude of four harmonic contributions. The pairs of spherical harmonics are indicated at the top of each plot. SEOBNRv4HM_ROM and PhenomXHM show comparable differences to the NRHybSur3q8 estimates, while Phe-nomHM largely under-and overestimates the magnitude of the kick velocity. The largest differences appear for contributions where an (ℓ, ±(ℓ − 1)) harmonic is involved. Phe-nomXHM is in good agreement with NRHybSur3dq8, particularly for the (3, ±2)(3, ±3) and (3, ±3)(4, ±4) contributions. We should note that the SEOBNRv4HM_ROM model does not include the (3, 2) harmonic, and thus, it has no contribution coming from the (3, ±2) (3, ±3) pair. In the case of SEOBNRv4HM_ROM, we observe good agreement with the NRHyb-Sur3dq8, particularly for the (2, ±1)(2, ±2) and (2, ±2)(3, ±3) contributions. Figure 9 displays the distribution of the differences in the orientation of the kick velocity. Here we observe a similar pattern to the one displayed in figure 2. The distributions of the differences are centered around zero for SEOBNRv4HM_ROM with σ ∼ 0.3 rad, showing good agreement with the NRHybSur3dq8 model. In the case of the PhenomXHM estimates, we observe good agreement with NRHybSur3dq8, with σ ∼ 0.4 rad, except for the (3, ±2) (3, ±3) harmonic contribution, where the distribution has a significantly larger std, σ ∼ 0.6 rad. The distributions of PhenomHM appear shifted at every contribution, with σ ∼ 0.8 rad. In particular, the (3, ±2)(3, ±3) and (3, ±3)(4, ±4) contributions, show a significantly larger disagreement with the NRHybSur3dq8 model.

Towards kick-based GW tuning
Because the kick estimate is a highly sensitive quantity, here we investigate whether we can use the kick prediction of NR simulations to add additional information in the calibration of the EOBNR and Phenom waveform models. These models calibrate unknown coefficients to NR simulations for the merger and ringdown. Unlike the inspiral region, where the amplitude and phase of the signal are well known from PN or EOB theory, modelling inaccuracies might build up in the merger region. In the following, we propose the first steps towards incorporating the kick prediction in the calibration of a waveform model.
Ideally, one could try to improve the description of the radiated linear momentum together with the energy and angular momentum by tuning these three quantities together. However, this would significantly increase the complexity of the problem. The energy and angular momentum are dominated by the (2, ±2) modes, while the linear momentum is dominated by the interplay of the (2, ±2) with other higher harmonics. The EOBNR and Phenom waveform families started with models for the dominant mode, which over the years have been improved by increasing the number of NR waveforms used for the calibration. For this reason, we expect the energy and angular momentum to be better modeled than the kick velocity, which has a more subtle imprint in the waveform. As a first step, one would then focus on the kick description, making sure that the radiated energy and angular momentum are not modified dramatically. Expressions for the energy and angular momentum in terms of the radiated harmonics can be found in [66].
As indicated before, the prediction of the kick orientation in currently available NR waveforms has a large uncertainty (σ ∼ 0.8 rad). For this reason, it might be difficult to significantly improve the performance of waveform models with current NR waveforms. Here we discuss how we could use information from the kick estimate once we have more accurate NR waveforms, and in turn, more precise kick estimates. By comparing the kick predictions of a waveform model with those predicted by NR, we analyse whether it is possible to carefully tune the individual harmonic contributions and, in turn, improve the accuracy of the modelled gravitational signal. We discuss the requirements that must be fulfilled and describe the complexity of addressing such a problem.
When analysing GW data, small phase variations are much better measured compared to amplitude variations. Hence, most of the information that is inferred from the waveform comes from the phase. The first question we want to answer is, whether it is possible to calibrate the GW phase based on the kick contributions.
We tune the harmonics contributions applying a phase shift α ℓ,m to the GW harmonic h ℓ,m , such that h ℓ,m = A ℓ,m (t) e −iϕ ℓ,m (t) transforms into: whereφ ℓ,m (t) = ϕ ℓ,m (t) + α ℓ,m (t, λ) is the tuned phase function. As specified, in principle, α ℓ,m (t, λ) can be a function of the time evolution and the intrinsic parameters of the binary. We can also write the waveform simply as: In reality, we are interested in the kick prediction, which involves the first time derivative of the gravitational strain: The time derivative of the transformed waveform is on the other hand,

Applying a constant phase shift
In the simple case where the applied phase shift is slowly varying, such that its time derivative can be neglected, the first derivative of the spherical harmonic reduces to: We now look at the the kick formula and compute the kick contribution of the pair of harmonics (ℓ, m) and (ℓ, m + 1), each with a constant phase shift α l,m and α l,m+1 respectively. We have: Since the phase shift is simply a constant, the exponential term can be taken out of the integral. We can see that the transformed kick contribution is rotated by the difference of the constant phase shifts:v This means, a constant phase shift allows us to perfectly correct the orientation of the contributions, and keep the amplitude unchanged at the same time. In general, we consider two (ℓ 1 , m 1 ) and (ℓ 2 , m 2 ) harmonics, with a phase shift α ℓ1,m1 and α ℓ2,m2 , respectively. Their kick contribution will be shifted by: Therefore, if the calibration phase shift of a particular harmonic α ℓ,m is known, the phase shift of the (ℓ, −m) harmonic will be directly determined. Thus, we only need to find the required phase shift for half of the harmonics included in the model. In the following, we use PhenomHM as the base model that we want to recalibrate. However, the procedure described next can be generalised to any waveform model. The reason why we choose PhenomHM is simply that accuracy improvements might be easier to appreciate. Looking at figures 8 and 9, we observe that the m = ℓ − 1 harmonics, namely the (2, 1), (3,2) and (4,3), are the least accurately modelled for PhenomHM. These are the harmonics we might want to tune. In the case of PhenomHM, there are 7 kick contributions that we want to correct. The model includes 6 (positive m) harmonics. In addition, there is one extra degree of freedom that is the reference orbital phase, which is fixed in each SXS waveforms, but not for the Phenom models. The set of unknown phase shifts can be determined by solving a linear system of equations for the positive (or negative) m harmonics. The linear system of equations, Ax = b, is given by: x = (α ref , α (2,2) , α (3,3) , α (4,4) , α (2,1) , α (3,2) , α (4,3) ) , θ (2,1)(2,2) refers to the orientation difference between the SXS and PhenomHM (2,1)(2,2) contribution, The solution to such a system will rotate the kick contributions of the waveform model and perfectly match those predicted by the NR waveform. The next question we want to answer is whether these shifts make the waveform model more accurate. In reality, applying a constant phase shift to the waveform would change the phase of the inspiral part, which is well modelled based on PN and EOB theory. Thus, applying a constant shift would instead make our waveforms inconsistent with the PN predictions of the early inspiral. If we want to address possible systematic errors in the waveform, we must keep the original inspiral phase and correct the merger-ringdown phase with a time-dependent phase shift.

Applying a time-dependent phase shift around the merger
We now want to leave the inspiral phase unchanged and apply a phase shift that is non-zero during the merger. The late-ringdown phase is determined by solving equation (23) for the fully integrated kick velocity. Ideally, one should find the phase shift α ℓ,m (λ, t) as a function of the intrinsic parameters for each harmonic, such that the time profile of the individual kick contributions predicted by a model is calibrated to the NR estimates. Thus, we are looking for a phase correction that is zero in the inspiral part, builds up around the merger and has a constant value in the late-ringdown.
We consider a phase shift that is time-dependent, and at least once differentiable. In this case, the time derivative of the tuned waveform involves an additional term: The tuned kick contribution of a pair of harmonics (ℓ, m) and (ℓ, m + 1), each with a timevarying phase shift α ℓ,m and α ℓ,m+1 respectively, is then given by: We observe that for a phase-shift function that necessarily changes its value over time, it is not possible to recover the expression (20), where the tuned contribution is precisely rotated by the difference of the individual phase shifts. Therefore, when using a time-dependent function, we can only aim to tune the kick contributions as close to the NR predictions as possible.
The rest of the subsection aims to study more deeply how well we can correct the harmonic contributions. We start with a simple function with the following characteristics: it is zero for the inspiral part, builds up around the merger following a Gaussian distribution, and has a constant value in the late-ringdown. The Gaussian distribution is centred at t/M = 0 and has a maximum value equal to the late-ringdown phase shift. We leave the width of the distribution as a free parameter and study how the tuned kick velocity (integrated over the whole evolution) depends on its value. Figure 10 displays the function we first try, with different values of the std of the Gaussian distribution. As a first step, we should mention that one could choose a different type of distribution, centre it at a different point in time, and would still come to the same conclusions.
Next, we want to address whether a more convenient width can be chosen for all harmonics. We compute the dependency of the kick on the width for three different configurations as shown in figure 11.
Because we are only trying to correct the orientation of the contributions, it might be that even though the direction of the kick is close to the NR prediction, the individual magnitudes differ significantly from the NR estimates. A better approach should consider both amplitude and phase corrections at the same time. Besides, one should study the dependency individually for each harmonic. We observe that the value of the distribution width might change for different binary configurations. Thus, the width should not only vary for each harmonic, but one should also find dependency on the intrinsic parameters of the binary.  Although our current approach can be improved in many ways, we want to know whether our simple model can still improve in the phase of the modelled GW harmonic. We use 9 SXS waveforms of different configurations as indicated in table 3 and apply the following algorithm: • Find the value of the std of the distribution that leads to a total kick value that matches the SXS prediction. • Find the required late-ringdown phase shifts for each harmonic by solving the linear system of equations for the fully integrated kick contributions. • Apply the tuning phase shift to each harmonic.
To analyse how the phase shift modifies the original PhenomHM harmonics, we plot the original and tuned waveform, its phase and the phase difference between the SXS and the Phe-nomHM waveforms. We include the results of one of the configurations in figures 12 and 13, where we show how the phase shifts modify the phase and in turn the waveform for the (2,2) and (4,4) harmonics. In all 9 studied cases, we find that the simple phase-shift function does not improve the accuracy of the modelled waveform. Just as for the constant phase shift, it is Table 3. SXS waveforms used in our study to calibrate the kick contributions of PhenomHM. not straightforward that tuning the kick contributions necessarily implies an improvement of the waveform. From the plots of the waveform phases, we find support, once again, to use a different phase function for each harmonic. We observe that the relative error in the phase does not have a simple structure which could be corrected using a phase shift that builds up following a smooth distribution. Besides, it could be that the relative errors are randomly distributed during the merger time and even over the parameter space. These two aspects reinforce our view of how complex kick-based tuning is. The main question we want to address is still how to optimally reduce the relative errors using the information contained in the kick. As mentioned earlier, the amplitude and the phase of the kick contributions should be corrected simultaneously. A time-dependent phase shift does not only change the orientation but also the magnitude of the kick contributions, as indicated by equation (28). In our research, we have only considered the change in the orientation of the kick. One could try to analyse the expression (28) in more detail and study whether it is possible to tune the amplitude and phase simply with a phase-shift function.

Waveform ID
So far, we have only calibrated the kick contributions of the late-ringdown by comparison to the fully integrated kick values. The time evolution 4 of the individual contributions could be used instead, which would allow finding the required phase shifts at several points in time, not only at the late-ringdown. One could choose a set of collocation points and solve the set of equations of the individual phases and possibly the amplitudes at each point. Besides, in our study we have explored the use of phase shifts to calibrate our estimates. However, the kick prediction is also sensitive to time shifts, which could complement the use of phase shifts. One should take care that in the waveform calibration process, the properties of the binary system are not changed. Similar to the radiated linear momentum, the energy and the angular momentum (see [37]) would have to be corrected in the models.

Discussion
In this paper, we show that current waveform models, NRHybSur3dq8, SEOB-NRv4HM_ROM, PhenomHM and PhenomXHM, are not consistent with each other in their kick estimates over the parameter space. Waveform systematic deviations that occur during the merger phase can strongly impact the kick estimate. Because the kick prediction is highly sensitive to waveform inaccuracies, disagreements between models indicate where in the parameter space are waveform systematics more significant. We have studied the dependency of the kick magnitude and orientation on the symmetric mass ratio and the individual spin components of the binary. Analysing model-model agreement, we find that overall, PhenomXHM and SEOBNRv4HM_ROM show comparable differences compared to NRHyb-Sur3dq8. We observe large discrepancies between the predictions of PhenomHM relative to NRHybSur3dq8. Our results show that PhenomHM has been superseded in accuracy by its newer version, PhenomXHM. Such improvement is probably related to the NR calibration of the subdominant spherical harmonics in the latest model. We observe the largest discrepancies in regions of the parameter space where the spin magnitude of the more massive black hole has high absolute values. Besides, the largest uncertainties appear in the region where the symmetric mass ratio takes values between η = [0.20, 0.25].
Since the estimate of the kick velocity involves the description of the higher harmonics during the merger phase, we are able to study the individual contributions of the kick velocity coming from different spherical harmonics. We analyse model-model agreement and find similar results as for the total kick velocity. Both PhenomXHM and SEOBNRv4HM_ROM show considerably smaller differences to NRHybSur3dq8 than PhenomHM. In particular, we observe extremely good agreement between the PhenomXHM and NRHybSur3dq8 in the magnitude of the (3, ±2) (3, ±3) and (3, ±3) (4, ±4) contributions. At the same time, we find large disagreement between PhenomXHM and PhenomHM with NRHybSur3dq8 on the orientation of the (3, ±2) (3, ±3) and (3, ±3)(4, ±4) contribution. Our results support the use of the kick estimate as a complementary tool to the mismatch uncertainty to evaluate the performance of gravitational waveform models.
We further study whether calibrating the individual kick contributions to the NR predictions can, in turn, imply an improvement in the accuracy of a waveform model. We focus on tuning the orientations of the kick contributions, and we use PhenomHM as our base model for the study. Although applying a constant phase shift would allow us to calibrate the orientation of the individual contributions, the shift would change the well modelled inspiral phase. Instead, we need a time-dependent phase shift that builds up around the merger. However, finding the appropriate time-dependent function that corrects the waveform amplitude and phase errors is highly complex. We find that a different phase-shift function is probably needed for each harmonic, which might depend on the binary's intrinsic properties. Besides, one should treat both the magnitude and orientation of the individual kick contributions simultaneously. Once there are more accurate NR predictions of the kick orientation available, this calibration framework could help improve the modelling of the kick imprint in waveform models.

Data availability statement
The data cannot be made publicly available upon publication because the cost of preparing, depositing and hosting the data would be prohibitive within the terms of this research project. The data that support the findings of this study are available upon reasonable request from the authors. Figure A1. Harmonic contributions of the kick magnitude as a function of the symmetric mass ratio. We show results for the PhenomHM (blue), PhenomXHM (orange) and SEOBNRv4HM_ROM (green) models. The shaded region represents the std of such distribution at each symmetric-mass-ratio value, while the curves represent the mean value of the distributions. Figure A2. Harmonic contributions of the kick orientation as a function of the symmetric mass ratio. We show results for the PhenomHM (blue), PhenomXHM (orange) and SEOBNRv4HM_ROM (green) models. The shaded region represents the std of such distribution at each symmetric-mass-ratio value, while the curves represent the mean value of the distributions.

Appendix B. Table of SXS waveforms employed to estimate the kick error
In table B1, we display the list of 173 nonprecessing SXS waveforms we have used to get an estimate of the kick error in NR waveforms coming from the NR resolution uncertainty. For each simulation we include the intrinsic parameters of the BBH, and the error estimate of the kick magnitude (∆v) and direction (∆θ). Table B1. Error estimates of the kick magnitude (∆v) and orientation (∆θ) for the set of SXS waveforms indicated in the first column. For each waveform, we include the intrinsic parameters of the binary.