When to Point Your Telescopes: Gravitational Wave Trigger Classification for Real-Time Multi-Messenger Followup Observations

We develop a robust and self-consistent framework to extract and classify gravitational wave candidates from noisy data, for the purpose of assisting in real-time multi-messenger follow-ups during LIGO-Virgo-KAGRA's fourth observing run~(O4). Our formalism implements several improvements to the low latency calculation of the probability of astrophysical origin~(\PASTRO{}), so as to correctly account for various factors such as the sensitivity change between observing runs, and the deviation of the recovered template waveform from the true gravitational wave signal that can strongly bias said calculation. We demonstrate the high accuracy with which our new formalism recovers and classifies gravitational wave triggers, by analyzing replay data from previous observing runs injected with simulated sources of different categories. We show that these improvements enable the correct identification of the majority of simulated sources, many of which would have otherwise been misclassified. We carry out the aforementioned analysis by implementing our formalism through the \GSTLAL{} search pipeline even though it can be used in conjunction with potentially any matched filtering pipeline. Armed with robust and self-consistent \PASTRO{} values, the \GSTLAL{} pipeline can be expected to provide accurate source classification information for assisting in multi-messenger follow-up observations to gravitational wave alerts sent out during O4.


Introduction
The observation of gravitational waves (GWs), a burst of gamma rays and other electromagnetic (EM) transients in X-ray through radio, from the binary neutron star (BNS) merger GW170817, marked a significant breakthrough in multi-messenger astronomy [2,20,36,38,14,41].Interpretation of the complementary information carried by photons and GWs from a single BNS system has provided new insights into several mysteries of fundamental physics, beyond the scope of what either messenger could have conveyed individually.Some examples include explorations of the properties of matter at extreme densities [30], tests of general relativity in the strong gravity regime [3,38], the measurement of cosmological parameters [39], and investigations on the production of heavy elements in the universe [24].
To facilitate the prompt discovery of multi-messenger counterparts to GW observations, the LIGO, Virgo and KAGRA (LVK) [37,6,7] detector network sends out public alerts about GW candidates, within minutes of their detection [1].These alerts comprise information regarding source properties and candidate significance which can assist in optimizing the triggered follow-up campaigns.For example, the optical counterpart to GW170817 was detected while performing a targeted search that made use of the precise sky localization obtained through the analysis of multi-detector GW data [14,41].
For GWs originating from compact binary coalescences (CBCs), information regarding the nature of the merging compact objects is also considered highly important for conducting real-time multi-messenger searches.The main reason for this consideration is as follows.Among the three kinds of CBC sources expected to be observable by the LVK, BNSs and neutron star-black hole binaries (NSBHs) are far more likely to be associated with multi-messenger counterparts than binary black holes (BBHs) [27,29].Hence, estimates of the probability that a CBC candidate has originated from either a BNS or an NSBH, that are sent out as part of GW candidate alerts [1], can assist astronomers in making informed decisions on whether to follow-up on said alerts.
This so-called probability of astrophysical origin (P astro ) can be computed from GW data, given knowledge of the rate at which BBH, BNS and NSBH systems merge in the observable universe and the population-level distributions of their system parameters [23,5].Informed by the relative rates and population properties of various CBC sources, estimates of P astro can be expected to reveal a larger number of significant candidates in densely populated regions of the CBC parameter space, in complement to estimates of false alarm rate (FAR).Hence, in addition to facilitating source classification, the P astro calculation also offers a population-informed measure of significance for GW candidates.
Several frameworks for estimating P astro self-consistently with the rates and population properties of CBCs from GW data have been studied in existing literature [16,23,19,32].However, they all involve constructing the posterior probability distributions of said rates given data associated with all the GW candidates found above some FAR threshold during an observing run.Since the mentioned dataset can only become available in postprocessing, such analyses cannot be used straightforwardly for computing P astro in real-time, without resorting to various schemes for approximating the rate of detectable CBCs.
In this paper, we describe a self-consistent methodology for computing P astro in low-latency with the purpose of assisting in real-time multimessenger follow-up observations during LVK's fourth observing run (O4).Our framework comprises several improvements to the low-latency P astro calculation that was implemented in LVK's third observing run (O3).We implement our formalism through the GstLAL search pipeline's real-time (online) analysis of O4 data, even though our improved rate approximation scheme is applicable to other pipelines as well.
We validate our formalism through GstLAL's participation in a rigorous Mock Data Challenge (MDC) that comprised a replay of O3 data with simulated sources injected into the data stream [11,15].We show that our formalism yields highly accurate category specific P astro for both real and simulated events in the MDC.We also demonstrate how our improved rate-approximation prevents the underestimation of P astro values of real and simulated events that would have resulted from the use of O3's inaccurate rate-approximation scheme.
This paper is organized as follows.In section 2 we describe our formalism for computing low-latency P astro values in detail.In section 3, we summarize the performance of our methods in the aforementioned MDC and demonstrate the accuracy of our category-specific P astro values.In section 4, we conclude with a summary of our methods, their validation, and future prospects regarding further improvements to our low-latency computation of P astro .

Methods
In this section, we describe our formalism for calculating accurate P astro values in low latency amidst an ongoing observing run.We first summarize the offline calculation that can be implemented given the full set of GW triggers found during an observing run.We then derive several approximation schemes for accurate P astro calculation in real-time, i.e. in the absence of the full list of triggers, which can only become available in post-processing.

Probability of astrophysical origin
GW triggers originating from different sources (including terrestrial ones) can be interpreted as realizations of independent Poisson processes [16,23].Hence, given the rate of events expected from each source category and the GW data associated with a particular trigger, one can compute the probability of its origin being any one of said categories.For example, the probability of the k th trigger originating from category α is given by: where, H α is the hypothesis that a particular trigger has originated from the source category α, R α the expected rate of detectable triggers per unit time from source α ∈ {BBH, BNS, NSBH, Terrestrial}, and d k the data associated with the k th trigger ‡.
Note that the analysis of only detectable triggers implies the existence of a threshold value for some detection statistic such as FAR that is used to rank events by the GW search pipelines.
In an offline analysis, wherein the data corresponding to the full list of N triggers found during an observing run is available, the probabilities in Eq. ( 1) can be inferred self-consistently with the rates from the same dataset [16,23].This is achieved by first constructing the posterior distribution of the rates using the trigger data from a particular observing run (say r), which looks like: , where ⃗ d Nr is a set comprising data from all observed triggers, N r the total number of observed triggers, T obs,r the total observation time and p( ⃗ R r ) an uninformative prior on the rates.Note that we are using the capitalized P for probability and the regular p for probability distribution/density, a notational choice made consistently throughout the rest of the paper.Once the rate posterior is constructed, it is possible to marginalize the probabilities in Eq. ( 1) over the uncertainty in rate estimation: . Upon carrying out said marginalization, the probability of a GW trigger originating from a particular astrophysical source category a ∈ {BBH, BNS, NSBH} can be expressed in terms of the expectation values of the rates w.r.t. the posterior in Eq. ( 2), as in: where, ⟨R α ⟩ N/k = d ⃗ Rp( ⃗ R| ⃗ d N/k )R α is the mentioned expectation value, ⃗ d N/k the dataset corresponding to all N triggers except the k th one, and the subscript 0 representative of the noise/terrestrial source category.Note that the sum over b in the denominator is over astrophysical categories only.Throughout the rest of this paper, Latin subscripts to rates and probabilities (such as a, b, c) are chosen to represent an astrophysical source category while a Greek subscript (such α, β) can denote astrophysical as well as terrestrial categories.In terms of the quantities defined in Eq. 4, the combined probability of astrophysical origin can be written as P astro = a P a .For a derivation of Eq. ( 4) from Eq. (3) see ref. [23].
This framework has been used to compute offline P astro values that were used in the generation of GW transient catalogs for the first three observing runs of LVK [5,4,12,13].In the next subsection, we describe our rate approximation scheme that is needed to implement this framework accurately in low latency amidst an ongoing observing run.

Computing P a in low latency: rate approximation
As mentioned before, in order to evaluate Eq. ( 4), the rate posterior needs to be constructed from the full list of triggers found during an observing run.However, for the purpose of assisting multi-messenger follow-up observations in real-time, P a has to be computed amidst an on-ongoing observing run i.e. in the absence of the complete list of triggers.
In such a scenario, our only option is to approximate the expected rates of detectable GW sources from their corresponding estimates yielded by past observing runs, while accounting for the change in sensitivity of the detectors in between the previous and ongoing runs.In particular, we only need approximate the ratio ⟨Ra,r⟩ ⟨R 0,r ⟩ of the foreground and background rates since it is the only term involving them that appears in Eq. (1).To construct such an approximation scheme, we can exploit the following facts: • The astrophysical rate R of BBHs, BNSs, and NSBHs, per unit comoving volume per unit time, do not change in between observing runs.
• The rate of terrestrial triggers above some FAR threshold (F AR 0 ) can be expected to be equal to said threshold: R 0 = F AR 0 .
Hence, estimates of R a for a ∈ {BBH, BNS, NSBH} from previous observing runs can be used to approximate R a,r for an ongoing observing run r given knowledge of the hypervolume [42] to which the detector is sensitive during said run: where, ⟨V T ⟩ a,r is the sensitive hypervolume during observing run r corresponding to the astrophysical source category a, T obs,r the corresponding live observation time, and ⟨R a ⟩ an estimate of the astrophysical rate yielded through the combined analysis of data from previous observing runs.However, if the ongoing observing run has a different timevolume sensitivity than the previous ones, ⟨V T ⟩ a,r needs to be scaled appropriately from its values during past observing runs to account for said sensitivity change.While the sensitive hypervolume depends on various factors such as the source population model [42] (which itself remains uncertain to some degree [40]), as a first-order step, simplistic scaling factors can be constructed by using the BNS horizon distance of the detector at various observing runs (including the ongoing one for which we use projections based on simulations [1]).For example, if d BN S,s is the horizon distance of the detector network during observing run s at an SNR threshold of 8 and T obs,s the live observation time, then ⟨V T ⟩ a,r can be approximated in the following way: where old represents the combination of all past observing runs.The live observation times of the past runs (T obs,s ) can be approximated from the total number of terrestrial triggers found above threshold (N 0,old ), their corresponding rate (R 0,old = F AR 0,old ), and the individual run times (T run,s ) in the following way: Putting Eqs.(5,6,7) together, we are able to construct an approximation for ⟨Ra,r⟩ ⟨R 0,r ⟩ from estimates of various quantities available at the beginning of the ongoing observing run r, that takes the following form: (8) Notice here that T obs,r conveniently cancels in our self-consistent approximation scheme.The scaling factor on the RHS of Eq. ( 8) is an improvement over the online P astro calculations implemented upto O3 which assumed it to be 1 and hence did not account for the sensitivity change between O2 and O3, leading to potential underestimation of P astro values.We demonstrate this through the analysis of O3 replay data in sec.3.1 .We note that the rate approximation scheme described here is general in the sense that it can be used in conjunction with any low-latency matched filtering pipeline.

Computing P a in low latency: category specific likelihood ratio
With the self-consistent rate approximation scheme at our disposal, we can proceed to assemble a methodology for computing the remaining ingredients required for evaluating Eq. ( 4), namely the marginalized likelihood ratios ( also known as Bayes factors in standard literature): The Bayes factors quantify how much the data supports H a over H 0 .In order to compute these quantities, one must model the population of CBCs in each astrophysical category using some fiducial function characterizing the distribution of measurable CBC parameters.Measurements of said parameters from detector strain data obtained by means of matched filtering can then be used to classify a potential GW candidate into various astrophysical categories given the different population models corresponding to each category.Matched filtering search pipelines usually discretize the space of signal parameters to reduce computational cost, which results in a bank of template waveforms [8].A CBC signal is expected to ring up templates in the bank with some SNR (ρ).Multidetector coincidences are formed when identical templates are triggered at different observatories within the propagation time of GWs between said observatories.The triggered templates are usually ranked in significance with some statistic (say x) that is often computed from additional quantities assigned to GW triggers by the search pipeline.
Among the triggers that pass a chosen detection threshold x th in x, either the highest x or highest ρ template within a certain time-window is chosen to represent a potential GW candidate that might have occurred in said window.The data corresponding to a particular GW candidate (d k ) relevant for Bayes factor calculation thus comprises three quantities assigned to it by the search pipeline namely the representative template (t k ), the corresponding ranking statistic (x k ) and SNR ρ k .
To compute the Bayes factor accurately from the trigger data, one needs to account for the possibility of random noise fluctuations causing a true signal originating from source category a to be recovered in a template whose parameters correspond to a different source category b.This is particularly true if the different source categories map onto neighboring regions of the template bank separated by some hard boundary.
Hence, a true signal corresponding to template t j and SNR ρ, can have a finite probability (P (t k |t j , ρ)) of being found in a different template t k , depending on how similar said templates are with each other.In terms of this probability and a categoryspecific distribution of template parameters, it is possible to evaluate the likelihood with which any template t k in the bank might be triggered by a CBC signal originating from source category a: where, ⃗ θ(t j ) is the CBC parameters corresponding to the template t j , V j the volume in parameter space associated with ⃗ θ(t j ) and p(ρ|H a ) the prior distribution of SNRs corresponding to sources in category a [17].
To compute P (t k |t j , ρ), we make the assumption that migration between true and recovered templates is solely due to Gaussian fluctuations [17].Furthermore, since the purpose of these template-weights (P (t k , ρ k |H a )) is to account for the miss-classification of GW sources due to the template migration caused by noise, it is safe to assume, for a first approximation, that ρ k = ρ j .Under these considerations, the probability of template migration takes the following form: where t k • t j is the noise averaged inner product of the normalized template waveforms.For a derivation of Eq. ( 11) see Appendix B. The volume element V j corresponding to each template needed for evaluating Eq. ( 10) can often be pre-computed from the template bank.For an arbitrary bank, it is possible to approximate the volume element V j corresponding to a template t j as the volume of the Voronoi cell [44,18] that contains the point ⃗ θ(t j ) [17].However the method of Voronoi decomposition is computationally expensive and it is preferable to compute V j as part of bank generation itself, if possible, as in the case of banks populated using methods like [22].Since different search pipelines use different template banks and ranking statistics, the remainder of the Bayes factor calculation will also vary from pipeline to pipeline.We demonstrate how such a calculation can be implemented within the GstLAL search pipeline [26,9,33] while noting that our formalism can also be applied to other pipelines by simply replacing the GstLAL-specific part of the Bayes factor calculation that is described below.
The O4 template bank used by the GstLAL search pipeline places templates in the log 10 (m 1 ) × log 10 (m 2 ) × χ eff space by splitting up the parameter space via a binary tree approach such that the volumes of the nearby templates are similar.Here, χ eff is m 1 ×s 1,z +m 2 ×s 2,z m 1 +m 2 where m i and s i,z are the masses and z-component spins of the i th component of the binary.The volumes surrounding templates are computed using the minimal match of the templates allowed in the template bank.For details regarding the volume calculation for the template bank used in O4 by the GstLAL search pipeline see Ref. [34].
Putting all of these pieces together yields a robust estimate of P (t k , ρ k |H a ), one that accounts for the mismatch between true and recovered waveforms due to detector noise.The variation of these quantities as a function of SNR and template parameters are summarized in Appendix C. We find said variation to be consistent with our expectations regarding miss-classification among various categories at different SNRs.
We note that GstLAL's online analyses upto O3 did account for this migration between true and recovered templates accurately and classified sources into astrophysical categories either based solely on the recovered template parameters or based on simulation campaigns that were not only expensive to generate but also sparse compared to the full template bank.We demonstrate the accuracy of the source classification resulting from the use of our template weights in sec 3.2.
With the template weights accounting for miss-classification among astrophysical categories, the Bayes factors can be computed from said template weights and the probability distributions of the search pipeline's ranking statistic conditional on the signal and noise hypotheses (H 1 and H 0 ) [23].For the GstLAL search pipeline, its ranking statistic is itself the log of the likelihood ratio: e x k ∝ P (d k |H 1 )/P (d k |H 0 ) [10,26,21,43].Hence, in terms of GstLAL's ranking statistic (x k ), the Bayes factors can be written in the following way: where x(F AR 0 ) is the log-likelihood ratio corresponding to the FAR threshold used throughout the P astro calculation and p(x|H 0 ) is the normalized probability distribution of the ranking statistic conditional on the noise hypothesis.For a derivation of Eq. ( 12), see Appendix A.
With the Bayes factors calculable on the fly from trigger data and the rates precomputed using the scaling relation (8), robust and accurate P astro values can be evaluated in low-latency through Eq. ( 4) for assisting in multi-messenger follow-up observations to GW triggers in real-time.The code developed to implement this analysis is publicly available in the python package gwsci-pastro § § https://pypi.org/project/gwsci-pastro/

Illustrative results: mock data challenge
We have tested our formalism for calculating P astro through GstLAL's participation in a rigorous MDC [11,15].The MDC involved a replay of strain data from the detectors at Hanford, Livingston and Virgo spanning 40 days of O3, from 5 Jan. 2020 15:59:42 to 14 Feb.2020 15:59:42, along with simulated signals injected into the data stream.Several low-latency search pipelines (including GstLAL) were used to analyze this mock dataset with our new P astro calculation implemented as part of GstLAL's analysis.For further details regarding the MDC see refs [11,15].
To validate our formalism for calculating P astro , we first demonstrate the improvements resulting from our rate-approximation scheme in the context of real GW events replayed in the MDC.Afterward, we use the P astro values of simulated events computed using our formalism to provide a rigorous validation for the developed rate approximation and source classification schemes.

GW events replayed in the MDC
As mentioned before, the real-time P astro calculations necessitate the use of rate approximation schemes to compensate for the absence of the full list of triggers found during an observing run, which can only be available in post-processing.In that sense, our online P astro values are approximations to the ones computed self-consistently with the rates in an offline analysis carried out at the end of the observing runs for generating GW transient catalogs.Therefore, as a validity test of our rate-approximation scheme, P astro values computed using our formalism for O3 GW candidates replayed in a MDC can be compared with results from the O3 offline analysis.
For this MDC, we can thus pretend that O3 is the ongoing observing run while O1 and O2 are the previous ones.To compute rate estimates for calculating P astro values in the MDC, we therefore scale the sensitive hyper-volume of O1 and O2 obtained from offline analyses, using the projections of the O3 BNS ranges available at the beginning of O3 [1].We then use the estimates of the astrophysical rates available from offline analyses of O1 and O2 in conjunction with the scaled hypervolumes to find ⟨R a,O3 ⟩ / ⟨R 0,O3 ⟩.
The accuracy of our rate approximation can be demonstrated by comparing the aforementioned scaled rate estimates to the unapproximated ones obtained from an offline analysis of public LVK data corresponding to O3 triggers found by GstLAL [25].The comparison is summarized in table 1.It can be seen that the approximated scaling relation yields BBH rates that are remarkably accurate given how close they are to the offline results.On the other hand, the unscaled O1O2 rates are smaller by a factor of 2.
The scaled estimates of BNS and NSBH are further away from the offline O3 estimates as compared to BBH, which is to be expected due to the following reasons.The number of confident BNS and NSBH detections in O1O2 are 1 and 0 respectively [5].Similarly, the corresponding numbers in O3 are 1 and 2 [4,13].Hence, both sets of rate estimates, namely the ones scaled from O1O2 and the ones yielded by the O3 offline analysis, are subject to much larger Poisson uncertainties for BNS and NSBH than for BBH.The BNS and NSBH rate estimates for O4 (listed in table 2), which are scaled from the O1O2O3 rates and VTs, can be expected to be more accurate given the larger number of BNS and NSBH detections in O1O2O3 as compared to O1O2.Table 1: Comparison of ⟨R a,O3 ⟩ / ⟨R 0,O3 ⟩ for the three cases (unscaled, scaled and O3 offline) described above.
Furthermore, table 1 is indicative of the fact that using the unscaled rates and hypervolumes of O1 and O2 has led to the underestimation of P astro values in the O3 online analysis.To verify this claim, we compare the P astro of real GW events replayed in the MDC computed using the 3 different rate estimates listed in table 1.We choose the fiducial population model required for computing the template-weights (and hence P astro ) to be a Salpeter function [35] in primary mass and uniform in all other parameters: . Here Θ is the Heaviside step function, m min is the minimum mass of templates in the bank and S a are hyper-cubes in the space of the parameters (m 1 , m 2 , χ 1z , χ 2z ) whose edges correspond to category specific upper and lower boundaries.The boundaries in parameter space separating BNS from NSBH and NSBH from BBH are determined by the maximum NS mass and spin which are taken to be 3 M ⊙ and 0.5 respectively.Masses and spins of BHs are allowed to go up to 300 M ⊙ and 0.99 respectively.With P (t k , ρ k |H a ) pre-computed using Eq. ( 10) for the population model in Eq. ( 14), P astro values for the real GW events replayed in the MDC can be computed for validating our rate-approximation schemes.For a visual representation of how the P astro values computed using the aforementioned template weights and the 3 different rates listed on table 1 compare against each other see Fig. 1.
It can be seen that while both the scaled and unscaled rate estimates yield accurate P astro values for the highly significant candidates, the scaled rate estimates perform significantly better for the marginal events than the unscaled ones.In particular, it can be seen that the low-latency P astro of the NSBH candidate GW200210 092254 would be underestimated by a factor of 2 if not for our accurate scaling approximation.
Given this performance in the MDC, we conclude that our formalism can be expected to yield P astro values of real GW events in GstLAL's online analysis of O4 data, that approximate the results of a corresponding offline analysis with higher accuracy, as compared to what was implemented in O3.In sec.3.2, We provide a more rigorous demonstration of the necessity of our rate scaling approximations through analysis of the set of simulated sources in the MDC.
The approximate rates to be used by GstLAL to calculate online P astro values during O4 are listed in table 1.They are largely consistent with the estimates yielded by Monte Carlo simulations carried out for assessing LVK's observing capabilities in O4 [1].The rates in table 2 shall be used in conjunction with the template-weights corresponding to the Salpeter population-model in Eq. ( 14) by GstLAL in O4 for computing robust and accurate P astro values in its online analysis of O4.Table 2: The estimates of ⟨R a,O4 ⟩ / ⟨R 0,O4 ⟩ to be used by GstLAL in the O4 online analysis.The columns ⟨R a ⟩ and ⟨V T a,old ⟩ list estimates of the astrophysical rates and the sensitive hyper-volume obtained from the offline analysis of O1O2O3 (appendix C3 of ref. [40]).
We note that the distribution in Eq. ( 14) is a straw-person representative of the true CBC mass and spin population which remains uncertain to this date [40].In particular, the population-level distributions of masses and spins of both BNS and NSBH systems are subject to significant Poisson uncertainties given the small number of detections made to date.If new BNSs and NSBHs discovered in the first half of O4 lead to more precise measurements of their population properties and if the new measurements are found to deviate significantly from our straw-person population model, the template weights corresponding to some best estimate of the newly measured population can be re-computed straightforwardly leading to even more accurate source classification in the second half of O4.
However, improvements in our knowledge of the true CBC population can lead to better classification if and only if the deviation between true and recovered signal parameters due to noise fluctuations is correctly accounted for within the formalism of calculating P astro .We thus want to test for the accuracy with which our formalism can correctly account for the probability of miss-classification among astrophysical source categories given knowledge of true CBC population distribution.
To perform such a validation study, we calculate the category-specific P astro values (P a ) for the simulated events in the MDC that comprise a known fiducial population of CBCs.We summarize the results obtained from said study in the next subsection.

Simulated sources in the MDC
The category-specific P astro values of the simulated sources that are injected into the replay data as part of the MDC, can be used to validate our source classification and rate-approximation schemes.The template weights as well as the rate of injected events per comoving volume, both of which are needed for calculating P a , will depend on the population-level distributions of CBC parameters that were used to generate the mentioned injections.
The masses and spins characterizing the simulated signals in the MDC were drawn from fiducial population-level distributions (one corresponding to each source category), the functional forms of which were chosen to be truncated power-laws in masses and uniform in spins [11], as in: . Here, S a are hyper-cubes in the space of the parameters (m 1 , m 2 , χ 1z , χ 2z ) whose edges correspond to category-specific upper and lower boundaries.The redshift distribution of these injections is taken to be uniform in co-moving volume upto a category-specific maximum (z max,a ) [11].The values of the power-law indices and the maximum redshifts corresponding to each category are listed in table 3.As before, the boundaries in parameter space separating BNS from NSBH and NSBH from BBH are determined by the maximum NS mass and spin.For the population of simulated sources used in the MDC, these values were taken to be 2.05 M ⊙ and 0.4 respectively.Masses and spins of BH components were allowed to go upto 100 M ⊙ and 0.99 respectively for BBHs and upto 60 M ⊙ and 0.99 for NSBHs while no NSs lighter than 1M ⊙ were drawn. Category Table 3: The powerlaw indices characterizing the mass-distribution of simulated sources corresponding to each category.
Given these population distributions, the template weights corresponding to each of them can be computed using Eqs.(10).The variation of these template weights with SNR, masses, and spins are discussed in Appendix C.
To compute P astro for the simulated sources, we also need the rate of injections which is significantly higher than the astrophysical rates listed in table 1.The rate of detectable injections can be estimated from the total number of injections drawn and the maximum redshift corresponding to each category, as in: , where N inj,a is the total number of injections drawn from category a, S r/old the scale factor derived in Eq. ( 5) and V (z) is the co-moving volume upto redshift z.Armed with the rates in Eq. ( 16) and the template weights corresponding to the population distributions in Eq. ( 15), we can compute P astro for all the simulated sources in the MDC that are found by the search pipeline.We can then compare the cumulative sum of P astro values with the number of found injections for validating our rate-approximation and source classification schemes.
To demonstrate the improvements resulting from our rate-approximation scheme, we compute two sets of P astro values, one using rate estimates obtained from the correct values of S r/old and the other using rates that correspond to S r/old = 1.We then compare the cumulative sum of P astro values ( i a P a (d i )) corresponding to both sets of rate estimates against the total number of found injections.The results of said comparison are summarized in Fig. 2.
In the case of perfect signal recovery, the plotted curve would lie exactly atop the diagonal.Fig. 2 thus validates that our formalism yields P astro values that correspond to a near-perfect recovery of all simulated sources with very few injections lost to terrestrial.It also demonstrates how the rate-approximation part of our formalism prevents the underestimation of P astro since the P astro values generated using un-scaled rate estimates are significantly below the diagonal than the scaled ones.
To now demonstrate the accuracy with which our template weights can classify triggers into various astrophysical categories we compute the fraction of simulated  16) while the orange curve corresponds to un-scaled rate estimates with S new/old = 1 sources from each category a that are classified into category b.The value of this fraction can be obtained from the cumulative sum of category-specific P astro values of the simulated events in the following way: where i ∈ a represents the the trigger corresponding to the ith simulated event from source category a and N a is the total number of simulated events from a that are recovered by the search pipeline with F AR ≥ F AR 0 .To compute the fraction of simulated sources lost to the terrestrial category, we use For the simulated astrophysical sources and the noise triggers in the MDC, this set of fractions obtained from our P astro estimates are represented in Fig. 3.We can see in Fig. 3 that the majority of simulated sources are correctly classified with a very small fraction of miss-classifications.For the BNS injections, we find that P BNS , P NSBH and P Terrestrial accounts for 73%, 25% and 2% of them respectively with a negligible fraction lost to BBH.Similarly, for the NSBH injections, we find that 77%, 13% and 8% are classified as NSBH, BNS, and BBH respectively with only 2% lost to terrestrial.86% BBHs are correctly classified with 10% and 4% injections recovered as NSBH and terrestrial.Note that very few of the potentially EM bright sources (BNS and NSBH) are lost to BBH or terrestrial.
To summarize, we conclude that Figs 3 and 2 are implicative of the following.In a simulated universe with known populations of BBH BNS and NSBH systems that merge at some known astrophysical rate per comoving volume per time, our formalism can be used to compute P astro in real-time that enable highly accurate source classification in the presence of noisy data.Furthermore, we conclude that the rate approximation scheme developed as part of our formalism is a necessity since a significant fraction of simulated sources has been shown to be misclassified as terrestrial without it.

Conclusion and future prospects
We have developed a self-consistent analysis framework capable of accurately classifying GW candidates in real-time, for the purpose of assisting multimessenger follow-up observations during O4.We have implemented our formalism through the GstLAL search pipeline while noting that our rate approximation scheme is applicable to other pipelines as well.We have demonstrated the accuracy of our classification scheme and the improvements resulting from our rate approximation scheme, both through the analysis of real and simulated GW data as part of an MDC.With the robust and accurate P astro values yielded by our framework, the GstLAL search pipeline can be expected to better assist in multimessenger follow-up campaigns to the GW alerts sent out during O4, as compared to O3.
While our formalism is an improvement over existing frameworks for calculating P astro , further developments leading to even better source classification can potentially be implemented in the near future.For example, even though our template weights facilitate the correct classification of most GW candidates, they are still based on point estimates of CBC parameters computed by the search pipelines and hence susceptible to potential biases.On the other hand, Bayesian parameter estimation (PE) has been shown to yield estimates of these quantities that are significantly more accurate than the point estimates computed by search pipelines.Given the recent developments in the field of low-latency PE in real-time [28,31], it is possible to replace our templateweights-based classification scheme with one that makes use of more accurate mass and spin measurements yielded by said PE.The related developments and the associated validation studies are part of an ongoing exploration.
To summarize, we have implemented several improvements in the low-latency computation of category-specific P astro values by the GstLAL search pipeline for assisting in the search for counterparts to GW candidates in O4.Given the performance of our analysis framework in an MDC comprising real and simulated GW data, we conclude that these improvements will lead to the correct identification and classification of several GW candidates in O4 that would have otherwise been marked as terrestrial or missclassified into an incorrect astrophysical category.While our classification scheme is susceptible to uncertainties in the population model, these uncertainties are expected to decrease as more and more events are found, leading to further improvements in classification in the later parts of O4 and beyond.

Acknowledgement
The authors are grateful for computational resources provided by the LIGO Laboratory and supported by National Science Foundation Grants PHY-0757058 and PHY-0823459.This material is based upon work supported by NSF's LIGO Laboratory which is a major facility fully funded by the National Science Foundation.LIGO  In Sec.2.3, we mentioned that among the various quantities assigned to a GW candidate by a search pipeline, the trigger data relevant to Bayes factor calculation can be thought to comprise only three quantities, namely the ranking-statistic, the representative template, and the SNR.For the GstLAL search pipeline, Eq. ( 12) suggests that it is indeed the case.However, in order to derive Eq. ( 12) itself, the details of how the ranking statistic is computed from additional pipeline-assigned quantities will become relevant.
For GstLAL, the full dataset corresponding to a particular trigger comprises additional quantities to t k and ρ k such as the signal-based veto parameter ⃗ ξ 2 at each detector.The other independent quantities that comprise the full dataset are measurements of arrival time and coalescence phase ( ⃗ ∆t k , ⃗ ∆ϕ k ) by different detectors expressed relative to the corresponding measurements by some reference detector (t ref,k , ϕ ref,k ).The log-likelihood ratio used by GstLAL to rank triggers can be expressed in terms of the probability distributions of these quantities [21,43] in the following way: where ⃗ O is the subset of detectors that found the trigger with a signal-to-noise ratio (SNR) above threshold.While we do describe in Sec.2.3 the calculation of P (t k , ρ k |H 1 ), details regarding the other individual probabilities in Eq. (A.1) can be found in ref. [21].On the other hand, in terms of the same trigger-data used for calculating x k , the Bayes factors required for P astro can be written as: . We note that template t k is the only parameter that determines which astrophysical source category the signal might have originated from in the sense that the only difference between astrophysical source categories is the population-level distributions of the template parameters corresponding to each category.Hence the probabilities of the remaining quantities ( ⃗ ξ .Matched filtering involves computing the noise averaged inner product of this timeseries data with the time-series corresponding to a template waveform.If the true signal is a perfect match to the jth template then the matched filter corresponding to the kth template can be expressed in the following way: ,where In this appendix, we summarize how the template weights vary as a function of the template parameters and SNR.Given the form of the migration probability in Eq. ( 11), we expect there to be significant confusion among source categories that map onto neighboring regions of the template-bank at low SNR.As the SNR increases, we expect the probability of miss-classification to decrease with perfect classification at infinite SNR.
For the Salpeter function mass-model in Eq. ( 14) that shall be used by GstLAL to compute P astro values in low-latency for O4, the template weights as a function of template masses are displayed in Fig. C1.We find that the variation with SNR is exactly what we expect them to be.
Similarly, for the population model corresponding to the MDC injections (summarized in Eq. ( 15)), the template weights as a function of component masses is displayed in Fig. C2 We also over plot the recovered parameters of the injections from category a found by the pipeline above threshold with SNR ρ, atop the plots of the template weights P (t k , ρ|H a ).
It can be seen that the density of blue points at various regions matches with the heat map representing log P (t k , ρ|H a ).For example, in some SNR ranges wherein BNS injections are recovered in the NSBH region, P (t k , ρ|H BNS ) in that region corresponding to the SNR in question is also found to have non-zero value.This is implicative of the fact our template weights correctly model the mismatch between the recovered template and the true signal due to noise fluctuations, provided the population distribution corresponding to each category is exactly known.

Figure 1 :
Figure1: Comparison of P astro values computed using scaled, unscaled and offline rate estimates.The x-axis marks the actual dates on which the replayed GW events were discovered in O3.

Figure 2 :
Figure 2: Recovery of simulated sources.The y-axis represents the cumulative sum of P astro values of all the triggers found above threshold by the search pipeline.The blue line corresponds to the scaled rate estimates in Eq. (16) while the orange curve corresponds to un-scaled rate estimates with S new/old = 1

Figure 3 :
Figure 3: Classification of simulated sources.In this Sankey diagram, the size of each link is representative of f (b|a) with a on the left and b on the right.
was constructed by the California Institute of Technology and Massachusetts Institute of Technology with funding from the National Science Foundation (NSF) and operates under cooperative agreement PHY-1764464.The authors gratefully acknowledge the Italian Istituto Nazionale di Fisica Nucleare (INFN), the French Centre National de la Recherche Scientifique (CNRS) and the Netherlands Organization for Scientific Research, for the construction and operation of the Virgo detector and the creation and support of the EGO consortium.The authors are grateful for computational resources provided by the Pennsylvania State University's Institute for Computational and Data Sciences (ICDS) and the University of Wisconsin Milwaukee Nemo and support by NSF PHY-2207728, NSF PHY-2011865, NSF OAC-2103662, NSF PHY-1626190, and NSF PHY-1700765.This paper carries LIGO Document Number LIGO-P2300141.Appendix A. Derivation of Bayes Factors from GstLAL's ranking statistics

Figure C2 :
Figure C2: Variation of the weights corresponding to the population distribution of injections, as a function of component masses and SNR.The colorbars represent log P (t k , ρ k |H a ) with a and ρ specified at the top of each panel.The blue dots in each panel correspond to recovered masses of injections whose true source category coincides with the one specified on top of the panel.The SNRs of the injections in each panel lie within a narrow range about the SNR specified on top of the pannel 2 k , ⃗ ∆t k , ⃗ ∆ϕ k , t ref,k , ϕ ref,k , ⃗ O k )conditional on H a can be expected to equal the probabilities of the same quantities conditioned instead on H 1 .Under this consideration, we can use Eq.(A.1) to rewrite Eq. (A.2) in the following way:K a (d k ) = Ae x k P (t k , ρ k |H a ) P (ρ k , t k |H 1 ) (A.3)Since the category-specific signal hypotheses (H a ) correspond to disjoint regions of the template bank (H a ∩ H b = ∅ f or a ̸ = b) that together span the entire bank (∪ a H a = H 1 ), we can re-write the signal model in the denominator to be P (ρ k , t k |H 1 ) ≈ a P (t k , ρ k |H a )P (H a |H 1 ), where P (H a |H 1 ) ≈13 for a ∈ {BBH, BNS, NSBH}.Under this consideration, Eq. (A.3) can be expressed as follows: is the matched filter SNR corresponding to template t k , n k = t k • n and ρ j is the SNR of the true signal s.As mentioned in Sec.2.3, we are interested in quantifying how noise fluctuations might cause a true signal t j to be highest SNR match with template t k .Hence, as a first approximation, we can set ρ j = ρ k .Under this approximation and the assumption that the noise is stationary and Gaussian, we can compute the probability of template migration:P (t k , ρ k |t j , ρ j ) ∝ Appendix C.Template weights P (t k , ρ k |H a ): variation with template parameters and SNR