Things that might go bump in the night: Assessing structure in the binary black hole mass spectrum

Several features in the mass spectrum of merging binary black holes (BBHs) have been identified using data from the Third Gravitational Wave Transient Catalog (GWTC-3). These features are of particular interest as they may encode the uncertain mechanism of BBH formation. We assess if the features are statistically significant or the result of Poisson noise due to the finite number of observed events. We simulate catalogs of BBHs whose underlying distribution does not have the features of interest, apply the analysis previously performed on GWTC-3, and determine how often such features are spuriously found. We find that one of the features found in GWTC-3, the peak at $\sim35\,M_{\odot}$, cannot be explained by Poisson noise alone: peaks as significant occur in $1.7\%$ of catalogs generated from a featureless population. This peak is therefore likely to be of astrophysical origin. The data is suggestive of an additional significant peak at $\sim10\,M_{\odot}$, though the exact location of this feature is not resolvable with current observations. Additional structure beyond a power law, such as the purported dip at $\sim14\,M_{\odot}$, can be explained by Poisson noise. We also provide a publicly-available package, \texttt{GWMockCat}, that creates simulated catalogs of BBH events with correlated measurement uncertainty and selection effects according to user-specified underlying distributions and detector sensitivities.


INTRODUCTION
Gravitational waves (GWs) from more than 70 mergers of compact objects have now been detected in the data of the LIGO (Aasi et al. 2015) and Virgo (Acernese et al. 2014) detectors.A cumulative catalog of these events and their properties has been produced by the LIGO-Virgo-KAGRA (LVK) collaborations.This collection of all detections to date is called the "Third Gravitational-Wave Transient Catalog" (Abbott et al. 2021, GWTC-3), and has enabled several insights into the nature of gravity (Abbott et al. 2021) , the local expansion of the universe (Abbott et al. 2021), and the population of GW sources (Abbott et al. 2021a).
afarah@uchicago.edu bedelman@uoregon.edu The underlying population of GW sources holds information about the astrophysical processes that give rise to merging binaries of compact objects.The mass spectrum of binary black holes (BBHs), for example, encodes information about numerous physical processes underlying massive-star evolution, supernova physics, compact object formation, and binary interactions.For example, the presence or dearth of black holes with masses between ∼ 2-5 M ⊙ ( Özel et al. 2010;Farr et al. 2011;Fishbach et al. 2020;Farah et al. 2022a) may unveil the maximum neutron star mass, the stability of mass transfer, and the timescales relevant for the engines that drive supernova explosions (e.g., Fryer et al. 2012;Zevin et al. 2020;Mandel & Müller 2020;Li et al. 2021;van Son et al. 2022a;Patton et al. 2022;Siegel et al. 2022).On the high mass end, a sharp decrease in the mass spectrum for black holes with masses ≳ 50 M ⊙ (Fishbach & Holz 2017;Edelman et al. 2021) would be a strong indication that the pair instability process is at play and limiting the core mass of massive stars (Fowler & Hoyle 1964;Barkat et al. 1967;Heger & Woosley 2002;Heger et al. 2003;Woosley & Heger 2015;Belczynski et al. 2016;Woosley 2017Woosley , 2019;;Marchant et al. 2019;Renzo et al. 2020), with the location of the decrease in the differential merger rate acting to constrain relevant nuclear reaction rates (Farmer et al. 2020).Other overdensities and underdensities in the observed mass distribution (Edelman et al. 2022;Tiwari & Fairhurst 2021;Tiwari 2022;Edelman et al. 2022), as well as the evolution of the mass distribution with redshift (Fishbach et al. 2021;van Son et al. 2022b;Karathanasis et al. 2022a;van Son et al. 2022a), will further inform the dominant BBH formation channels, binary evolution physics, and the metallicity evolution of the universe.
All of the parameters that are measurable from the signal of a binary merger can provide insight into formation mechanisms of merging binaries, especially when used in a population analysis (Stevenson et al. 2015;Zevin et al. 2017).However, the masses of the objects in the merging system are the best measured and span the largest dynamic range.Additionally, the mass distribution of compact objects can be used to measure cosmological parameters using the "spectral siren" method, provided there is structure in the distribution beyond a boundless power law (Chernoff & Finn 1993;Messenger & Read 2012;Taylor et al. 2012;Farr et al. 2019;Ezquiaga & Holz 2021;Ezquiaga & Holz 2022;Abbott et al. 2021), such as edges, gaps, peaks, or changes in the power law slope.Multiple features must be present to disentangle redshift evolution of the mass spectrum from cosmology, and more features further aid in breaking this degeneracy (Ezquiaga & Holz 2022).Therefore, considerable effort in the field of GW astronomy has gone towards understanding the mass distribution of GW sources.There are currently many more detected BBH mergers than binary neutron star (BNS) or neutron star-black hole (NSBH) mergers, so much of the activity has been on population properties of the BBH distribution, though the mass distribution of BNSs and NSBHs has also been been considered (Fishbach et al. 2020;Landry & Read 2021;Farah et al. 2022a;Ye & Fishbach 2022;Biscoveanu et al. 2022b).
The mass distribution of merging BBHs is typically parameterized by the primary mass m 1 , the larger of the two component masses in the binary, and the mass ratio q = m 2 /m 1 , the ratio of the less massive object's mass to the primary mass, though other parameterizations are possible and valid (e.g., Farah et al. 2022a;Fishbach & Holz 2020a;Tiwari & Fairhurst 2021).The community has thus far gained a robust understanding of the largescale features of the BBH mass distribution, and is just Figure 1.Distribution of primary BBH masses inferred using GWTC-3 and three different population models.The smoothed power law model (grey) consists of a single power law slope between a minimum and maximum mass, with the merger rate set to exactly zero outside of those bounds.It also includes a smoothing parameter at the low-mass end that allows for an offset between the minimum BH mass and the global maximum of the distribution.The Power Law + Peak model is similar to the smoothed power law, but also includes a Gaussian component.The Power Law + Spline model adds a cubic spline modulation to a smoothed power law to allow for additional substructure.We seek to determine if the perturbations beyond a power law found by Power Law + Spline and other semi-parametric models can be explained by random associations in the data due to a finite number of observations, or if they are features of the true underlying distribution.
beginning to resolve its finer details.After the release of the First Gravitational-Wave Transient Catalog (Abbott et al. 2019, GWTC-1), minimum and maximum masses at ∼ 5 M ⊙ and ∼ 40 M ⊙ were identified in the BBH primary mass distribution, but it was not yet possible to distinguish between a uniform distribution and a power law between those two bounds (Fishbach & Holz 2017;Talbot & Thrane 2018;Abbott et al. 2019).The Second Gravitational-Wave Transient Catalog (Abbott et al. 2021b, GWTC-2) brought dozens of additional events, and the BBH mass distribution was found to have a global maximum at ∼ 8 M ⊙ and an excess of BHs between ∼ 30 M ⊙ -40 M ⊙ followed by a steep, although not infinitely sharp, drop off in the rate at higher masses extending to ∼ 80 M ⊙ (instead of sharp cutoff at ∼ 40 M ⊙ ).At the time, there were not enough observations to determine whether the mass distribution had a local maximum at ∼ 35 M ⊙ , represented by a Gaussian peak on top of a power law, or whether the steepening towards higher masses was better described as a break in the power law (Abbott et al. 2021).
At the end of the third LIGO-Virgo observing run, the same two features at ∼ 8 M ⊙ and ∼ 35 M ⊙ remained, and the feature at 35 M ⊙ was classified as a peak rather than a break in the power law (Abbott et al. 2021a). Additionally, non-parametric (Mandel et al. 2017;Rinaldi & Del Pozzo 2022;Sadiq et al. 2022;Payne & Thrane 2022;Edelman et al. 2022) and semi-parametric (Edelman et al. 2022) analyses found robust evidence for an additional peak at ∼ 10 M ⊙ , the same peak at ∼ 35 M ⊙ , as well as modest evidence for a paucity of events near ∼ 14 M ⊙ (Abbott et al. 2021a).These features in the primary mass distribution correspond to similar ones in the chirp mass distribution, occurring at ∼ 9 M ⊙ , ∼ 11 M ⊙ , and ∼ 26 M ⊙ , respectively (Tiwari & Fairhurst 2021;Tiwari 2022).The current picture of the BBH mass distribution is therefore a decreasing power law from low to high masses, with a global maximum at m 1 ∼ 10 M ⊙ , a potential underdensity at m 1 ∼ 14 M ⊙ , and an overdensity at m 1 ∼ 35 M ⊙ .This can be seen in Figure 1, where we plot the results of fitting two parameteric models and one semi-parametric model to the BBHs in GWTC-3.
While the existence of this substructure in the current data set appears robust, its interpretation is less clear.Plausible explanations for this substructure include (1) Poisson noise, (2) modeling systematics, or (3) astrophysical signatures from one or several formation channels.We aim to disentangle the first two possibilities from the third using the Power Law + Spline model (Edelman et al. 2022), one of the semi-parametric models used to identify the substructure reported in Abbott et al. (2021a).
Poisson noise would be caused by the fact that the fiducial BBH analysis in Abbott et al. (2021a) includes only 69 events over a mass range that spans more than an order of magnitude, so observations may appear to be clumped at some masses even if the underlying distribution is smooth.We first determine if this explanation accounts for the data by simulating catalogs of BBHs whose underlying distribution does not have the features of interest, applying the analysis previously performed on GWTC-3, and determining how often such features are spuriously found.We develop several metrics comparing observations to simulated data in order to assess the statistical significance of the "bumps" in the primary mass distribution found by Abbott et al. (2021a); Edelman et al. (2022).All of the metrics derived in this work answer the same general question: how often do we infer the existence of a feature when analyzing observations of a true population without that feature?In this sense, these metrics are analogous to frequentist p-values, as lower values correspond to more significant features in the data.Readers familiar with gravitational wave data analysis might find it useful to think of these metrics as false alarm rates because they quantify how often noise resembles the observed signal.
A similar frequentist analysis on a large number of mock catalogs was performed by Sadiq et al. (2022) on the peak at ∼ 35 M ⊙ using an adaptive kernel density estimator (aKDE) to find features in samples drawn from featureless mass models, as well as from a model with a single peak.They account for selection effects, but not measurement uncertainty.They find that an aKDE is able to identify peaks in the data, and that the peak at ∼ 35 M ⊙ found in GWTC-2 is statistically significant within the aKDE model.
The second effect mentioned above, model systematics, could also plausibly cause spurious inference of features beyond a power law.It is potentially concerning that the models considered in Abbott et al. (2021a) that find peaks and troughs in the mass distribution are inherently "bumpy": both Power Law + Peak (Talbot & Thrane 2018) and Multi source employ a smoothed power law with a Gaussian component (Wysocki & O'Shaughnessy 2021), Flexible Mixtures is a linear combination of Gaussian components, and Power Law + Spline employs a smoothed power law under a cubic spline modulation.The question is then whether these "bumpy" models can recover sharp features or if they instead create peaks and troughs that are morphologically dissimilar to the true distribution.This is most easily addressed by cross-checking with independent models such as Broken Power Law (Abbott et al. 2021;Abbott et al. 2021a) and the autoregressive model presented in Callister & Farr (2023).
Inaccuracies in the selection function are also known to cause systematic biases when inferring the underlying population (e.g.Malmquist 1922Malmquist , 1925)).These biases could, in principle, also cause an incorrect inference of structure in the astrophysical distribution of BBH masses.However, selection effects in GW detectors are remarkably well-characterized, so we expect this effect to be subdominant to Poisson uncertainty.As the number of events grows, so will our accuracy in the estimation of the selection function (Farr 2019;Essick & Farr 2022).
We provide posterior samples from our simulated catalogs in an accompanying data release (Farah et al. 2022b), and also provide a publicly-available python package, GWMockCat (Farah et al. 2022c), to create similar samples according to user-defined populations. 1ection 2 provides a demonstrative example: it foregoes a full fit to the astrophysical population of sources, and compares the observed distribution of masses to possible observed distributions given an underlying power law in primary mass, (incorrectly) assuming no measurement uncertainty.This analysis suggests that the observed peak at ∼ 35 M ⊙ is statistically significant, but that all other features beyond a simple power law might be explainable by Poisson noise.This motivates a thorough study using a full hierarchical Bayesian analysis on simulated event posteriors, which we carry out in Section 3. Section 4 summarizes our conclusions and discusses their implications for the astrophysical origin of the gravitational waves observed thus far by the LVK.Readers primarily interested in the significance of features in the mass distribution may wish to skip to Section 3.3, whereas those interested in using the package GWMockCat can find details in Appendices A and B.

MOTIVATION
To construct a simple test of feature significance and motivate further study, we first avoid a fit to the mass distribution and instead consider the observed distribution of primary masses and its resemblance to one that would result from a simple power law.The observed population differs significantly from the astrophysical one, as current gravitational wave detectors are subject to selection biases that favor the detection of closer and more massive systems, as well as measurement error that affects each system differently.We construct plausible observed mass distributions that could occur from detecting 69 BBHs whose astrophysical distribution is a featureless power law in primary mass.To do this, we use the samples provided by LIGO Scientific Collaboration et al. (2021a), which were created for sensitivity estimation for the LVK's GWTC-3 analysis.Each of these samples comes with a probability of being drawn from an assumed underlying distribution and a false alarm rate (FAR) assigned by each search used by the LVK.We can then re-weight these samples to our desired population model (in this case, a power law in m 1 , q, and z) using the draw probability, and apply the same FAR threshold used in Abbott et al. (2021a) to select "found injections."Of the ∼ 6 × 10 4 found injections, we resample to N = 10 4 independent sets of 69 draws each to directly compare to observations.We then histogram each set of these found injections, thereby obtaining a distribution of bin heights for our mock populations.Using several thousand realizations of found injection sets enables us to construct a null distribution of bin heights and characterize the effect of Poisson noise on the shape of the observed distribution.We compare these null distributions with the observed Observed source-frame primary mass distributions.Black solid lines contain the median a posteriori values for the binary black holes in GWTC-3.Pink and blue bands indicate the 90% credible interval on the observed distributions predicted from astrophysical distributions that are power laws in primary mass with spectral index α = 3.25 and α = 2.7, respectively.The top panel shows a histogram of observed primary masses.For GWTC-3's distribution to be consistent with the null distributions, we expect its bin heights in the top panel to be within the 90% credible intervals in 18 out of the 20 bins.The uncertainties in these predicted distributions are due only to Poisson noise resulting from a finite number of observations, rather than modeling uncertainty or uncertainty in parameter estimation.Therefore, the cumulative distribution functions in the bottom panel are similar to a conventional posterior predictive check, but with only one source of uncertainty.The large deviations of the black curve from the shaded bands in some regions indicate the difficulty that a single power law with Poisson shot noise has in fully explaining the observations.However, many of the apparent excursions from a power law are well-contained within the predicted bands.
distribution of BBH masses in GWTC-3 2 by assuming the primary masses are measured perfectly and using the median a posteriori values of their primary masses as point estimates.The result is shown in the top panel of Figure 2, which plots the 90% credible interval on the observed null distributions, along with the distribution of median primary masses of GWTC-3's BBHs.For the null distributions, we consider two power law spectral indices as representative examples: α = 2.7 and α = 3.25.These are chosen to represent a range of plausible values for the BBHs in GWTC-3: a power law fit to GWTC-3 yields α = 2.98 +0.16 −0.28 , where the bounds represent 1-σ deviations.
To obtain a more quantitative measure, we compare bin heights from the found injections, h inj , to the bin heights of observed events in GWTC-3, h GWTC-3 , obtaining for each bin i the fraction of simulated bin heights that are lower than those of GWTC-3 BBHs.Explicitly, where the sum is over the N = 10 4 sets of found injections, and r h is defined for each bin.For example, if r h = 0.95 for a given bin, the observed distribution in that bin is larger than would be expected from a featureless power law 95% of the time.A value of r h approaching unity corresponds to a "bump" in the observed mass distribution, and a value of r h approaching zero is indicative of a "dip."Note that the comparison between the null distributions and GWTC-3 are occurring at each bin, rather than across all bins.We do this because the magnitude of Poisson noise depends on the value of m 1 : since the underlying distribution is not uniform, fewer events are expected at very high m 1 and therefore the relative standard deviation is larger.This is also a consequence of Eddington bias (Eddington 1913).Making comparisons at specific points in m 1 does not, however, properly correct for the look-elsewhere effect.We will address this effect in Section 3. We use samples generated with the IMRPhenomXPHM waveform and a prior proportional to the square of the luminosity distances (i.e. the samples were not "cosmologically reweighted").
To make the most direct comparison with Abbott et al. (2021a), we keep events with secondary mass larger than 3 M ⊙ and FAR less than 1 yr −1 , resulting in 69 events.
The three most significant values of r i h in the case of α = 3.25 are r 15.6 M⊙ h = 0.033, r 27.9 M⊙ h = 0.036, r 36.1 M⊙ h > 0.999, where the superscripts indicate the centers of the bins at which r was calculated.This means that less than 0.1% of mock populations had more events near 36.1 M ⊙ than GWTC-3 does, 3.3% of mock populations had fewer events near 15.6 M ⊙ than GWTC-3, and at 27.9 M ⊙ , 3.6% of mock populations had fewer events.
Repeating the exercise for α = 2.7, we find the three most significant values of r i h to be r 40.2 M⊙ h = 0.935, r 27.9 M⊙ h = 0.020, r 36.1 M⊙ h > 0.999.The locations of the significant features differ when the assumed underlying distribution changes3 .In either case, the bump at ∼ 35 M ⊙ is unlikely to be due to Poisson noise, but other features may be.
To avoid the need to arbitrarily choose bins, we additionally construct a cumulative distribution function (CDF) of the primary masses and compare it to the CDFs of the null distributions, shown in the bottom panel of Figure 2.This comparison is akin to a posterior predictive check in that it can highlight where the model fails to predict the data.Importantly, though, it differs from the conventional posterior predictive check because we have purposefully left out the effects of modeling uncertainty and measurement uncertainty in order to isolate the effects of Poisson noise.The prior distributions are therefore also not included, since each event is assumed to be measured with perfect accuracy.
If α = 3.25, the null distributions are consistent with the data below ∼ 18 M ⊙ and above ∼ 35 M ⊙ , but not between them, meaning that the ∼ 10 M ⊙ and ∼ 35 M ⊙ peaks can be explained by Poisson noise, but the underdensity between them could not be.On the other hand, if α = 2.7, the null distributions are consistent with the data everywhere except for above ∼ 40 M ⊙ , suggesting that under this scenario, Poisson noise can explain all features except for the ∼ 35 M ⊙ peak.
For both spectral indices considered, two of the three features found by Abbott et al. (2021a) can be explained by Poisson noise from a finite number of observations.However, this does not mean that exactly two of the features are the result of Poisson noise, just that no more than two can be caused by the phenomenon.Additionally, it is not clear which of the features are more likely to have physical origin, as this method offers no quantitative way to determine which power law slope is preferred.
Importantly, this methodology does not account for the effects of measurement error, which can cause significant biases near the edges of sharp distributions when not properly accounted for (Fishbach et al. 2020).We therefore turn to a full hierarchical Bayesian analysis of simulated catalogs, which will allow us to fit for population model parameters, take measurement uncertainty into account, and directly compare to metrics used in Abbott et al. (2021a).

HIERARCHICAL ANALYSIS AND RESULTS
We determine how often the features inferred in the mass distribution of BBHs would be spuriously found in data whose underlying distribution does not have those features.To do this, we construct a null distribution by simulating BBH observations that would occur if the underlying astrophysical distribution was a single power law with no substructure in a finite range.The procedure for creating synthetic BBH observations is described in Appendix A. Mock observations are combined with corresponding sensitivity estimates in a hierarchical Bayesian analysis, described in Loredo (2009); Mandel et al. (2019); Thrane & Talbot (2019).We analyze these simulations in the same way as the BBHs in GWTC-3 to determine how often the features observed in GWTC-3 would be found from an underlying distribution without those features.

Power Law + Spline Mass Model
We use the Power Law + Spline semi-parametric primary mass model as a flexible model that is easily capable of finding peaks and valleys in the mass distribution (Edelman et al. 2022;Abbott et al. 2021a).This model parameterizes perturbations or deviations from a simpler underlying distribution with flexible cubic spline functions.Specifically, given an underlying hyper-prior for primary mass, p(m 1 |Λ), the Power Law + Spline model describes the primary mass distribution as: where f (m 1 |{m i }, {f i }) is the function describing the perturbations, which we model with a cubic spline function interpolated by introduced hyper-parameters, {m i }, the locations of spline knots in mass space, and {f i }, the height of the perturbation function at each knot.This describes a semi-parametric model as it includes a simple "parametric" component (the underlying distribution) in addition to a non-parametric component that models the perturbation around the simple description.For this study we use the simplest primary mass model for the underlying description, which is the Truncated model, describing a power law with sharp cutoffs at the lower and upper mass bounds (Fishbach & Holz 2017;Edelman et al. 2022).While this model has been shown to insufficiently describe the primary mass distribution, it captures the majority of the broadest features (Abbott et al. 2021;Abbott et al. 2021a).
To assess the significance of peaks or valleys found with the Power Law + Spline model one can look at the posterior distribution of the perturbation heights as a function of mass.This tells us how far "off" the simple power law description is from accounting for the data.Specifically we can find what percentile f (m 1 ) = 0 falls in the posterior distribution as a function of mass.For data exactly distributed as a power law (the underlying population), the inferred perturbation function should be symmetric about 0 with widths determined by the prior distributions on the knot heights and the number of observed events.At masses where the percentile of zero perturbation approaches 100% (0%) we can say there is an over (under) density of events at these masses, compared to the underlying power law distribution.This is identical to the analysis done by Abbott et al. (2021a), who use the percentile at which the perturbation function excludes zero at a given location as a metric for how significant a feature is at that location.

Metrics of Feature Significance
As described in Section 3.1, the Power Law + Spline model makes use of a perturbation function constructed from cubic splines.The height of the perturbation function, f (m 1 ), at a point in primary mass, m 1 , is then a direct measure of the deviation from a power law at that point.We can determine how often one would find spurious evidence for substructure by simulating catalogs from a power law, fitting them with the Power Law + Spline model, and examining the resulting perturbation function.
If the mock catalogs produce perturbation functions with similar amplitudes to those seen for GWTC-3, the structure in the GWTC-3 fit might be described by Poisson noise.On the other hand, if the perturbation functions produced by fits to the mock catalogs are always lower in amplitude to that of the GWTC-3 fit, the structure in the GWTC-3 data is likely to be present in the underlying distribution.
For a given mock catalog, we find the m 1 value where the median a posteriori value of the perturbation function is maximal.We obtain the posterior distribution of perturbation function amplitudes at that location, g(f max ).We repeat this for all mock catalogs, obtaining a set of maximal perturbation function distributions, {g j (f max )}.These are plotted in light grey on the left panels of Figure 4.The locations of the three maximal perturbation function amplitudes in the GWTC-3 fit are, from least to most Bayesian significance, 13.8 M ⊙ , 10.3 M ⊙ , and 35.7 M ⊙ .The posterior distributions of perturbation function heights at these locations are g GWTC-3 (f (13.8 M ⊙ )), g GWTC-3 (f (10.3 M ⊙ )), and g GWTC-3 (f (35.7 M ⊙ )), and are plotted in orange in the left panels of Figure 4.The amplitude of the perturbation function at 13.5 M ⊙ is negative (i.e. it is a dip rather than a bump), so we flip its distribution about zero for more direct comparison.The same is done for all g(f max ) whose medians are negative, as the perturbation function's prior is symmetric about zero.

Simulation Study
To determine whether the features in the mass spectrum of GWTC-3 BBHs are the result of Poisson noise of a finite number of observations drawn from a featureless power law, we compare Power Law + Spline fits using the GWTC-3 catalog and 300 mock catalogs generated from a "featureless" power law.The mock catalogs considered in this section are all generated from the same underlying distribution: a truncated power law in primary mass, mass ratio, and redshift, with a smoothing at low component masses to ensure the peak of the mass distribution is not in the same location as the minimum mass.The explicit form of the mock catalogs' population model, including values of all of its hyperparameters, can be found in Appendix B.
Full parameter estimation was not performed on each event in each mock catalog; instead, we use prescriptions for generating event posteriors that reproduce the correlations between an event's parameters, as well as the typical uncertainties seen in GWTC-3.We show in Appendix C that the prescriptions used are sufficient to reproduce population analyses such as the ones scrutinized in this work.In order for our comparisons to be consistent between featureless mock catalogs and GWTC-3, we recreate GWTC-3's event posteriors with the same prescriptions as were used for the mock catalogs, perform a population analysis on those, and use the resulting perturbation function for all comparisons to mock catalogs.Our population re-analysis of the GWTC-3 events with mock posteriors appears consistent with the full analysis presented by the LVK in Abbott et al. 2021a.Lines labeled "GWTC-3" or depicted in orange in Figures 3, 4, and 5 refer to the analysis done on the recreated version of GWTC-3.All analyses presented here were repeated using the original LVK-released parameter estimation samples in Appendix C, and the same qualitative conclusions were reached, though with slightly more statistical significance.
Despite knowing the parameters of the underlying population for the mock catalogs, we allow all hyperparameters to vary when fitting Power Law + Spline to the mock catalogs.The resulting perturbation functions are shown in Figure 3 for 10 randomly chosen mock catalogs and GWTC-3.The perturbation functions deviate from their prior distribution in the mass range where detections exist (above ∼ 5 M ⊙ and below ∼ 85 M ⊙ ), even in the case of mock catalogs.This means that the perturbation functions are informed by the mock data despite the mock data not inherently requiring a deviation from a power law.The question still remains whether the perturbation function heights inferred from mock catalogs with no substructure are larger than those inferred from GWTC-3.While nonzero values of the perturbation function are common in the 10 mock catalog fits shown in Figure 3, only a few amplitudes appear comparable in height to the three largest amplitudes of the GWTC-3 perturbation function.
To verify this, we isolate the largest amplitude perturbations for all 300 mock catalog fits and compare them to the three largest amplitude perturbations for the GWTC-3 fit.These are plotted in the leftmost panels of Figure 4.The light grey curves are the posterior distributions of largest perturbation function amplitudes {g j (f max )} for each simulated catalog j.These appear to have the same general shape as one another, though with noticeable scatter.The orange curves in each panel are the posterior distributions of GWTC-3's perturbation function g GWTC-3 (f (m 1 )) at its three maximal locations: m 1 = 13.8M ⊙ , 10.3 M ⊙ , and 35.7 M ⊙ .
The distribution for the ∼ 14 M ⊙ dip appears qualitatively similar to that of the simulated catalogs, the ∼ 10 M ⊙ peak appears to be slightly shifted with respect to most of the simulated catalogs but still within their range, and the ∼ 35 M ⊙ peak is noticeably shifted towards higher values relative to the bulk of the simulated catalog distributions.This suggests that the ∼ 35 M ⊙ peak is unlikely to be the result of Poisson noise or modeling systematics, while other features could plausibly be explained by those effects.

Maximum Perturbation Amplitude
To obtain a more quantitative measure, we derive several metrics from the distributions of maximal pertur- Law + Spline fit to the primary masses in GWTC-3 (orange) and in 10 mock catalogs (grey).The perturbation function multiplies a smoothed power law in primary mass to add modulations to an otherwise monotonic distribution, making it a direct measure of deviations from a power law.It is a cubic spline with knots fixed at the locations indicated by the black vertical tick marks.The prior on the perturbation heights is the unit normal distribution, as can be seen below ∼ 5 M⊙ where there are no detections to constrain the likelihood and the posterior reverts to the prior.The perturbation function corresponding to GWTC-3 events appears large in amplitude in three locations: ∼ 10 M⊙, ∼ 14 M⊙, and ∼ 35 M⊙.While the medians of the perturbation function at these distributions are comparable in amplitude, the posterior distribution at ∼ 35 M⊙ (∼ 14 M⊙) is the most (least) tightly constrained.
bation function amplitudes.The first uses the Kolmogorov-Smirnov (KS) test: we compute the KS divergence D between each of the {g j (f max )} distributions to obtain a null distribution of KS divergences, shown in the solid black curve in the middle column of Figure 4. We then perform a KS test between the {g j (f max )} distributions and g GWTC-3 (f (m 1 )) and obtain the orange curves in the middle column of Figure 4. From this, we find that the KS divergences for mock catalogs are larger than those of GWTC-3 20%, 11%, and 7% of the time for the 14 M ⊙ , 10 M ⊙ , and 35 M ⊙ features, respectively.This means, for ex-ample, that mock catalogs can produce perturbation function posteriors as tall as the one inferred from GWTC-3 at ∼ 35 M ⊙ only 7% of the time.Written in terms of g(f ), g GWTC-3 (f (14 M ⊙ )) ̸ = g j (f max ) to 20%, g GWTC-3 (f (10 M ⊙ )) ̸ = g j (f max ) to 11%, and g GWTC-3 (f (35 M ⊙ )) ̸ = g j (f max ) to 7%.Though none of these percentages are convincingly small, this indicates that the orange histograms are more statistically distinct from the black histograms in the case of the ∼ 35 M ⊙ peak than they are in the cases of the features at 10 M ⊙ and 14 M ⊙ .The second metric is obtained by quantifying the shift of g GWTC-3 (f (m 1 )) relative to the set of {g j (f max )}.For each point in g GWTC-3 (f (m 1 )), we calculate the percentile in which it lies in each of the {g j (f max )}, obtaining the orange bands in the rightmost panels of Figure 4.For comparison, we do the same for each of the {g j (f max )} relative to each other, constructing the grey bands in the rightmost panels of Figure 4. We then take the mean of the set of light orange bands and light black bands to obtain the solid orange and solid black curves, respectively.The black bands serve as null distributions, so large deviations from those indicate significant shifts.We observe a large deviation for the ∼ 35 M ⊙ peak, a moderate deviation for the ∼ 10 M ⊙ peak, and only a slight deviation for the ∼ 14 M ⊙ dip.Quantitatively, g GWTC-3 (f (35 M ⊙ )) ≥ g j (f max ) to 83 +17 −69 % (90% credible interval), meaning that the ∼ 35 M ⊙ peak lies in the 83 +17 −69 rd percentile of the mock catalogs' largest perturbation heights.For the other features, g GWTC-3 (f (10 M ⊙ )) ≥ g j (f max ) to 74 +25 −60 % and g GWTC-3 (f (14 M ⊙ )) ≥ g j (f max ) to 34 +52 −32 %.In comparison, the corresponding statistic for the null distributions is g j (f max ) ≥ g i (f max ) to 50 +47 −46 %.It is not possible to draw firm conclusions from these large uncertainties.However, the central values indicate that the ∼ 35 M ⊙ peak is noticeably shifted relative to the mock catalogs' perturbation functions, the ∼ 10 M ⊙ peak is moderately shifted, and the ∼ 14 M ⊙ dip even has a slightly lower amplitude than the maximum perturbation functions typical of mock catalogs.

Inconsistency With a Power Law
The final metric we consider is inspired by the statistic presented in Abbott et al. (2021a), which states that "the inferred perturbation f (m 1 ) strongly disfavors zero at both the 10 M ⊙ and 35 M ⊙ peak."We therefore turn from considering the full distribution of perturbation function heights at a given location to the percentile at which it excludes zero.A perturbation function amplitude of zero is a useful reference point for several reasons.The most intuitive is that it causes the population model to behave like a featureless power law, so a posterior that excludes zero to high credibility indicates an inconsistency with a power law.Zero is also the mean of the prior predictive distribution for the perturbation function: the prior allows for equal upwards and downwards fluctuations, symmetric about zero perturbation.Similarly, a vanishing perturbation function amplitude is the state to which we expect the posterior predictive distribution to asymptote in the limit of infinite detections from an underlying power law distribution.We For GWTC-3, we evaluate the perturbation function's posterior distribution at primary mass (m1) values of 13.8 M⊙ (dotted ), 10.3 M⊙ (dashed ), and 35.7 M⊙ (solid ).For mock catalogs, we find the primary mass value at which the perturbation function is maximal and evaluate its posterior distribution there.The values reported here are the percentage of the posterior distribution that is greater than zero at those values in m1.The 13.5 M⊙ feature excludes zero to a level comparable to some of the mock catalogs, but the other two features exclude zero to a level not reproducible by any mock catalogs.therefore plot the percentile at which each mock catalog excludes zero perturbation in Figure 5.
We then calculate how often a simulated catalog's perturbation function excludes zero to the same credibility as that of GWTC-3.This is the same as finding the point along the y-axis of Figure 5 at which each of the vertical orange lines hits the cumulative distribution function.
1.7%, 10.0%, and 92.7% of the {g j (f max )} exclude zero to the same percentile as g GWTC-3 (f (35 M ⊙ )), g GWTC-3 (f (10 M ⊙ )), and g GWTC-3 (f (14 M ⊙ )), respectively.The fact that, for example, g GWTC-3 (f (14 M ⊙ )) < 0 to 20.7% but 92.7% of mock catalogs have a similar or smaller statistical excursion is due in part to the difference between Bayesian credible intervals and frequentist p-values, and because our metric corrects for the look-elsewhere effect by comparing GWTC-3's perturbation function at specific locations to all possible locations in the mock catalogs.
Combined with the metrics presented in Section 3.3.1, the results above lead us to conclude that the peak at ∼ 35 M ⊙ is difficult to reproduce with featureless catalogs, but it is possible that the dip at ∼ 14 M ⊙ is just a large fluctuation rather than a astrophysical feature.The peak at ∼ 10 M ⊙ is difficult to reproduce with fea-tureless catalogs, though it is easier to reproduce than the ∼ 35 M ⊙ peak.We discuss the interpretation of this feature in more detail in Section 4.
In summary, featureless catalogs can sometimes produce features as tall as the ∼ 10 M ⊙ peak, and they can sometimes produce perturbations constrained away from zero with the same credibility.The dip at ∼ 14 M ⊙ could be a Poisson fluctuation because fits to featureless catalogs can easily produce perturbations as large, and as credibly constrained away from zero perturbation.The peak at ∼ 35 M ⊙ is difficult to reproduce by mock catalogs in any way: its perturbation amplitude is too large and too credibly constrained away from zero.
The fact that we find one of the features explainable by Poisson noise is consistent with Section 2, which suggests that up to two of the excursions from a power law can be explained by Poisson fluctuations.Our conclusions are also in broad agreement with those presented in Abbott et al. (2021a), as they report confident detections for the two largest peaks in the mass distribution but only modest evidence for the dip at ∼ 14 M ⊙ .

DISCUSSION
Previous analyses of the BBH mass spectrum by the LVK and others have found evidence for structure beyond a simple power law (Abbott et al. 2021;Abbott et al. 2021a).There has been considerable work exploring possible astrophysical causes of these identified features.Our aim is instead to determine, from a statistical viewpoint, whether astrophysical arguments need be invoked at all.
We first demonstrate that it is only possible for up to two of the three deviations from a power law to be explained by Poisson noise about a single power law distribution.Therefore, at least one feature must be added on top of a power law to describe the data.
We then perform a more thorough analysis, simulating thousands of BBHs with measurement uncertainty, selection effects, and a known underlying distribution.We fit the Power Law + Spline model to the resulting catalogs and find that the data is inconsistent with a single power law, agreeing with the LVK result.However, we find that one of the previously identified features, an underdensity at ∼ 14 M ⊙ , may not be present in the true astrophysical distribution.Instead, it may have been the result of a Poisson fluctuation around a simple power law, or an artifact of the models used to fit the mass spectrum.The metrics constructed in this work differ from those previously used to assess the significance of features in the mass distribution because, by virtue of comparing to several simulated catalogs, they correct for the look-elsewhere effect.This is only in mild tension with the conclusions reached by Abbott et al. (2021a), as they report "modest evidence" in favor of a dip at 14 M ⊙ .
We find the other two previously identified peaks, at ∼ 10 M ⊙ and ∼ 35 M ⊙ , unlikely to be the result of Poisson noise or modeling artifacts.Simulated catalogs coming from distributions that do not include these features can reproduce the height of the ∼ 10 M ⊙ peak, but not its lack of support for zero perturbation.The ∼ 35 M ⊙ peak is difficult to reproduce from featureless catalogs in any way.
Our conclusions are consistent with a recent study by Callister & Farr (2023) who fit the BBH mass distribution with an autoregressive model and find that the primary mass distribution gradually decreases as a function of mass and exhibits two local maxima with a relatively flat continuum between them.They interpret the 14 solar mass dip found by other analyses to be a flattening of the power law index at lower masses rather than a local minimum.We also find similar results to Edelman et al. (2022) who construct the mass distribution entirely from basis splines and find peaks at ∼ 10 M ⊙ and ∼ 35 M ⊙ .The significance of the peaks near 10 M ⊙ and 35 M ⊙ , as well as the lack of significance of the dip near 14 M ⊙ , is also in agreement with Sadiq et al. ( 2022) and Wong & Cranmer (2022).
The dip near ∼ 14 M ⊙ may be a large Poisson fluctuation or an artifact of the models used to characterize it.If it is in fact a feature of the underlying distribution, it is difficult to resolve with current observations.The peak near ∼ 10 M ⊙ is likely an imprint of the true astrophysical distribution.Its amplitude is slightly larger than what featureless catalogs can produce with random fluctuations, and it is inconsistent with the power law that describes the rest of the BBH mass distribution at a level that only a small fraction of featureless catalogs can achieve.We therefore report moderate evidence that additional structure beyond a power law is needed to explain the peak at ∼ 10 M ⊙ .
The ∼ 10 M ⊙ feature may either be an additional peak that is distinct from the one created by the underlying smoothed power law at ∼ 7 M ⊙ (Abbott et al. 2021a;Edelman et al. 2022;Tiwari 2022) or the sole peak in the region between ∼ 5 M ⊙ and ∼ 20 M ⊙ (Edelman et al. 2022).These two possibilities can be seen in Figure 1.The former scenario is the case where we interpret the first two peaks in the orange band as distinct from one another, therefore treating the global maximum inferred by Power Law + Spline as a different feature from the global maximum inferred by Power Law + Peak.In the latter scenario, the role of the perturbation function is to shift the global maximum from the value inferred by the power law component to a slightly higher value without removing the mass distribution's support for 5-10 M ⊙ objects.A simple smoothed power law, such as that employed by the Power Law + Peak model (see grey and blue bands in Figure 1), may not be flexible enough to place a global maximum at ∼ 10 M ⊙ while also fitting the correct slope at larger masses and fitting the correct merger rate below ∼ 10 M ⊙ , so it places its global maximum at ∼ 7 M ⊙ .This scenario, in which there is a single local maximum below ∼ 12 M ⊙ , is consistent with Edelman et al. (2022) and Callister & Farr (2023), both of whom find only one significant maximum between approximately 3 M ⊙ and 12 M ⊙ using fully non-parametric methods.If this interpretation is correct and the global maximum of the BBH mass distribution is indeed offset from the minimum mass by ∼ 5 M ⊙ , the upper edge of the lower mass gap may not be as morphologically simple as previously assumed (e.g., Fishbach et al. 2020;Farah et al. 2022a;Ezquiaga & Holz 2022), making it potentially difficult to resolve with parametric models alone.The marginal evidence for the significance of the ∼ 10 M ⊙ peak is likely driven by the fact that current observations are insufficient to distinguish between these two scenarios, which is unsurprising considering the lower sensitivity of GW detectors to low-mass events relative to high-mass events.
A peak anywhere between ∼ 7 M ⊙ and ∼ 10 M ⊙ could be indicative of particular evolutionary processes that are dominant within formation environments.van Son et al. (2022a) showed that a global maximum near this value is consistent with and robustly predicted by the stable mass transfer channel in isolated binary evolution, as stability during mass transfer requires mass ratios between the donor star and accreting compact object to be relatively symmetric, and stellar companions to ∼ 10 M ⊙ BHs must be near this mass to form compact objects above the minimum BH mass.This may be an indication that the stable mass transfer channel operates more efficiently than the traditional common envelope channel for generating merging BBHs.If the stable mass transfer channel is indeed the cause of the global maximum in the primary mass distribution, the exact location of this global maximum will constrain the core mass fraction, mass transfer stability, and mass transfer efficiency of this process (van Son et al. 2022a).Though dynamical formation channels with low escape velocities, such as globular clusters, struggle to produce a global maximum at 10 M ⊙ (Antonini et al. 2022), dynamical environments with higher escape velocities may more readily produce merging BBHs with lower masses around 10 M ⊙ due to the more prevalent lower-mass BHs preferentially remaining bound to these clusters following supernova kicks.
We find that the peak centered on 35 M ⊙ is the most likely to be a feature of the true underlying distribution.This bodes well for the "spectral siren" (Farr et al. 2019;Ezquiaga & Holz 2022) method of estimating cosmological parameters from GW observations, as this peak happens to be the most informative feature for this method since it is a well-measured, somewhat sharp feature in the mass distribution (Abbott et al. 2021).The astrophysical process that gives rise to this feature is still a topic of discussion.The key reason for including a flexible bump-like feature in the phenomenology of parametric models, such as the Power law + Peak model used by the LVK (Talbot & Thrane 2018), was to accommodate a potential build-up of BHs with masses just below the pair instability mass gap, as pulsational pair instability supernovae are predicted to efficiently shed material from high-mass stars with cores in the mass range of M core ∼ 45 − 65 (Woosley 2017(Woosley , 2019;;Marchant et al. 2019;Renzo et al. 2020).It is difficult to reconcile the locations of the local maxima found in the BBH primary mass distribution with predictions of the pair instability process in the cores of massive stars.The largest uncertainty determining the location of the lower edge of the pair instability mass gap is the 12 C(α, γ) 16 O reaction rate, which determines the abundance of oxygen in stellar cores (e.g., Farmer et al. 2019).Higher 12 C(α, γ) 16 O reaction rates lead to a higher oxygen abundance in the stellar core, which will ignite explosively during core collapse and lead to (pulsational) pair instability supernovae occurring at lower core masses.However, even at 3σ deviations above the median measured value of the 12 C(α, γ) 16 O reaction rate, the lower end of the mass gap only reaches ≈ 38 M ⊙ (Farmer et al. 2020).This is above where the measured overdensity in the observed mass spectrum occurs.This may be an indication that the peak at 35 M ⊙ is the result of certain isolated binary evolution scenarios (e.g.chemically homogeneous evolution, see du Buisson et al. 2020;Zevin et al. 2021;Bavera et al. 2022), another BBH formation channel entirely (e.g.globular clusters, see Antonini et al. 2022), or that stellar evolution models are missing particular ingredients that can shift the location of the pair instability gap (relaxing the assumption that the exploding stars are hydrogen-free, adjustments to convective overshooting, see e.g.Iorio et al. 2022).
Additionally, several studies have suggested that the observed peaks in the BBH mass distribution can be explained by successive generations of hierarchical mergers (Tiwari & Fairhurst 2021;Mahapatra et al. 2022;Tiwari 2022), though no correlation has been detected in the spin distribution of BBHs (Biscoveanu et al. 2022a), which is also necessitated by the hierarchical merger formation scenario (Gerosa & Berti 2017;Fishbach et al. 2017;Rodriguez et al. 2019;Kimball et al. 2020;Doctor et al. 2020Doctor et al. , 2021;;Gerosa et al. 2021).Additionally, for these peaks to correspond to hierarchical mergers of the same population, the dominant hierarchical pairing would have to be the first generation BH with a third generation BH (Mahapatra et al. 2022;Tiwari 2022), whereas the dominant pairing predicted by Rodriguez et al. ( 2019) is a first generation BH generation with a second generation BH.While it is certainly possible that GWTC-3 contains hierarchical mergers (e.g.Abbott et al. 2020, though also see Fishbach & Holz 2020b), the relative fraction of events formed this way is likely too small to form the structure observed in the primary mass distribution (Kimball et al. 2021), and some fine-tuning may be needed to avoid a cluster catastrophe (Zevin & Holz 2022).The exact physical reason for the overdensity at 35 M ⊙ therefore remains unclear.However, we confirm that it is a robust signature in the observational data; future observing runs will help to constrain its precise location, width, and possible redshift evolution.
Given the true parameters, the basic procedure of generating samples from the posterior distribution of each event is to draw an observation from each event's likelihood, use that observation as the central value of the posterior distribution, and then to draw samples from that posterior, assuming a prior.
To obtain observed parameters, θ obs i , we need the likelihood, L total (θ obs i | θi ).We model each event's likelihood as where Here, N (µ, σ) is the normal distribution with mean µ and standard deviation σ.
The standard deviations are determined by assuming the uncertainties on all parameters except for the SNR scale inversely with ρ obs (Veitch et al. 2015).In stationary, Gaussian noise, we expect the matched-filter SNR in a single detector to have unit variance (Allen et al. 2012), i.e. σ ρ i = 1 for all i.We therefore draw ρ obs for each event from L ρ (ρ obs i |ρ i ).This observed SNR will serve as the detection statistic that determines whether each event is observable.We assume events that pass an SNR threshold of ρ obs,i > 8 in a single detector are detected.In this way, we allow for events near threshold to fluctuate above or below threshold, emulating the actual noise process in the detectors.Of the events that make it through detection, we randomly select 69 of them to constitute a mock catalog with the same number of BBHs as were analyzed by Abbott et al. (2021a).The standard deviations for M det , η, and Θ of the detected events are calculated via where we have chosen u M = 0.08 M ⊙ , u η = 0.022, and u Θ = 0.21 to match uncertainties in these parameters typical of events observed in O3.
Observed values for all parameters are drawn from Equation A4 with standard deviations defined in Equation A6.With θ obs i in hand, we are now ready to construct a posterior distribution.We apply the following priors: where U (x 1 , x 2 ) is the uniform distribution with lower bound x 1 and upper bound x 2 .The bounds on η and Θ are chosen because those parameters are only physically defined in the domains (0, 0.25] and [0, 1], respectively.Neither M nor ρ are defined below zero, but the upper bounds were chosen somewhat arbitrarily: they must only be large enough that the likelihood has minimal support above them.The posterior distributions for each parameter are then Gaussians centered on the observed value, with standard deviations defined in Equation A6.They are therefore the same as the distributions in Equation A5, but with the role of the true and observed values switched.We then simulate multiple-dimensional posterior samples for each event by drawing 5000 independent samples 5 of detector-frame chirp 5 We use 5000 samples to optimize the speed of population inference while also ensuring the number of effective samples used for Monte Carlo sums in the population inference always satisfies the criterion outlined in Farr (2019).That criterion has since been shown to be insufficient and has been superseded by Essick & Farr (2022), but we utilize the former for consistency with the analysis performed in Abbott et al. (2021a).However, users of the GWMockCat package can easily modify the number of posterior samples to suit their needs.
and the individual mass and redshift distributions are given by the following: ) This is equivalent to the Power Law + Peak model in Abbott et al. (2021a) and Abbott et al. (2021), with λ peak set to 0. We will call the population model described by Equations B10-B13 Smoothed Power Law.We generate catalogs from the model that results from setting Λ mock to the values provided in Table 1.These values were chosen by fitting this population model to GWTC-3 (grey band in Figure 1) and obtaining the median a posteriori value for each hyperparamter.Spectral index for the power law of the mass ratio distribution.

α
Negative spectral index for the power law of the primary mass distribution.

mmin
Minimum mass of the primary mass distribution.

M⊙ mmax
Maximum mass of the primary mass distribution.

M⊙ δ
Range of mass tapering at the lower end of the mass distribution.

M⊙ κ
Spectral index for the power law factor of the redshift distribution.

2.7
We validate the mock catalogs' generation by fitting them with Smoothed Power Law and allowing the hyperparameters to be inferred from the mock data.We then determine whether the inferred values of the hyperparameters are consistent with the values in Table 1.We fit 100 mock catalogs of 69 events each, 10 results of which are shown in Figure 6.While there is noticeable scatter about the injected value, it is generally consistent with the recovered mass distributions: the hyperparameters of the underlying mass distribution fall within the inferred mass hyperparameters' 90% credible intervals 89.6% of the time.We therefore conclude that any biases that the mock posterior generation process introduces in the mass distribution are subdominant to the statistical uncertainties of the fit.
To further explore systematic differences caused by mock catalog generation that may be subdominant to the considerable statistical uncertainties resulting from a fit to only 69 events, we fit Smoothed Power Law to a single catalog of that is five times larger.The result is shown in Figure 7.The hyperparameters of the underlying distribution seem to be consistent with the inferred hyperposterior, so we conclude that our mock event posterior generation process produces biases subdominant to measurement uncertainty typical of 345-event catalogs.We therefore find this method of generating mock catalogs sufficient to test the significance of features identified in the mass distribution of GWTC-3.

C. ACCURACY OF MOCK CATALOGS WHEN USED IN A POPULATION ANALYSIS
As a second check of GWMockCat's ability to simulate catalogs accurately enough to be used in a population analysis of the mass distribution of BBHs, we recreate GWTC-3 with GWMockCat.We then compare Power Law + Spline's fit to this mock catalog with its fit to the posterior samples released by the LVK for GWTC-3 (LIGO Scientific Collaboration et al. 2021b).We find that the two resulting mass distributions are consistent, and therefore conclude that the approximate prescriptions used in GWMockCat are sufficient to probe the mass distribution of BBHs, at least for current GW detector sensitivities.To simulate GWTC-3, we re-weight all GWTC-3 events to the same prior as used for sampling in GWMockCat, namely uniform in detector-frame chirp mass, uniform in symmetric mass ratio, and uniform in sky angle, as defined in Equation A7.We then take the mean of the re-weighted detector frame chirp mass, symmetric mass ratio, sky angle, and single-detector SNR posteriors as the observed parameters θ obs i for each event.Using the priors defined in Equation A7 and likelihoods defined in Equations A5 and A6, we construct mock posterior distributions on detector frame chirp mass, symmetric mass ratio, sky angle, and single-detector SNR.We then convert these parameters to source-frame component masses and redshift.Finally, we re-weight back to the standard prior used in LVK's parameter estimation process.The end result of this is shown in Figure 8 for GW191215 223052, an event that was chosen at random from the catalog and happened to have a primary mass and redshift near the mode of the detected events.The mock posteriors appear consistent with the true posteriors, and degeneracies between parameters seem to be suitably captured.This behavior is qualitatively similar for all simulated events, though some had mock posteriors that were slightly more consistent with the full parameter estimation posterior samples, and some had mock posteriors that were slightly less consistent.
Any population analysis must define criteria for inclusion in the population.We apply two different criteria and report the results of both in Figure 9.The first, a cut at SNR > 8 in Livingston, is chosen to be analogous to the detection criteria of the other mock catalogs, which use a single-detector SNR cut since it is not possible to run all of the pipelines necessary to produce a FAR on mock data.The second criterion, a false alarm rate (FAR) cut at 1/yr was used to be consistent with the analysis done by the LVK on GWTC-3 (Abbott et al. 2021a).Either choice is reasonable because the selection function is known with respect to both SNR and FAR.We can therefore use the publicly-released sensitivity estimates (LIGO Scientific Collaboration et al. 2021a) to reconstruct the underlying, or astrophysical distribution of BBHs from either of these catalogs.All of the metrics of feature significance presented in Section 3 make use of this astrophysical distribution when comparing GWTC-3's population fit to that of mock catalogs.The first criterion (single-detector SNR cut) produced a final catalog with many fewer events than the second criterion (FAR cut), with the former resulting in 36 events and the latter resulting in 69 events.Therefore, the population analysis performed on the catalog selected by SNR has much wider hyper-posteriors than the one performed on the catalog selected by FAR.However, these two catalogs do not appear to be systematically biased with respect to one another, nor are they systematically biased with respect to the LVK-released analysis.This is again because the selection function is known with respect to both of these criteria, and therefore the reconstructed astrophysical distributions are consistent.
Figure2.Observed source-frame primary mass distributions.Black solid lines contain the median a posteriori values for the binary black holes in GWTC-3.Pink and blue bands indicate the 90% credible interval on the observed distributions predicted from astrophysical distributions that are power laws in primary mass with spectral index α = 3.25 and α = 2.7, respectively.The top panel shows a histogram of observed primary masses.For GWTC-3's distribution to be consistent with the null distributions, we expect its bin heights in the top panel to be within the 90% credible intervals in 18 out of the 20 bins.The uncertainties in these predicted distributions are due only to Poisson noise resulting from a finite number of observations, rather than modeling uncertainty or uncertainty in parameter estimation.Therefore, the cumulative distribution functions in the bottom panel are similar to a conventional posterior predictive check, but with only one source of uncertainty.The large deviations of the black curve from the shaded bands in some regions indicate the difficulty that a single power law with Poisson shot noise has in fully explaining the observations.However, many of the apparent excursions from a power law are well-contained within the predicted bands.

2
For all comparisons to real observations, we use the publicly available posterior samples for the GWTC-2.1 and GWTC-3 data releases(LIGO Scientific Collaboration & Virgo Collaboration  2022; LIGO Scientific Collaboration et al. 2021b, respectively).

Figure 3 .
Figure3.Median (top panel ) and 90% credible interval (bottom panel ) of the perturbation function resulting from the Power Law + Spline fit to the primary masses in GWTC-3 (orange) and in 10 mock catalogs (grey).The perturbation function multiplies a smoothed power law in primary mass to add modulations to an otherwise monotonic distribution, making it a direct measure of deviations from a power law.It is a cubic spline with knots fixed at the locations indicated by the black vertical tick marks.The prior on the perturbation heights is the unit normal distribution, as can be seen below ∼ 5 M⊙ where there are no detections to constrain the likelihood and the posterior reverts to the prior.The perturbation function corresponding to GWTC-3 events appears large in amplitude in three locations: ∼ 10 M⊙, ∼ 14 M⊙, and ∼ 35 M⊙.While the medians of the perturbation function at these distributions are comparable in amplitude, the posterior distribution at ∼ 35 M⊙ (∼ 14 M⊙) is the most (least) tightly constrained.

Figure 4 .
Figure 4. Three largest deviations from a power law observed in GWTC-3 compared to mock catalogs.Left column: The posterior distribution of perturbation function heights at the location where the posterior distribution is maximal for mock catalogs (light grey) and GWTC-3 (solid orange).Middle column: Null distribution (black ) and GWTC-3 distribution (orange) of Kolmogorov-Smirnov (KS) divergences between the individual distributions in the left column.Smaller values of the KS divergence indicate more similar distributions.Right column: Null distribution (black ) and GWTC-3 distribution (orange) of percentiles.Large deviations from the diagonal indicate a more significant rightward shift of the GWTC-3 distribution relative to the mock catalogs.Each row corresponds to a different local extremum for GWTC-3: m1 = 13.8M⊙ (top), m1 = 10.3M⊙ (middle), and m1 = 35.7 M⊙ (bottom), while the global extrema for each mock catalog are shown in all rows, along with the aggregated distribution across all mock catalogs (solid black ).The ∼ 35 M⊙ peak is an outlier with respect to both the KS and percentile statistics, but the other two features are more ambiguous.

Figure 5 .
Figure5.Percentile at which the posterior distribution of the perturbation function excludes zero for GWTC-3 (orange vertical lines) and catalogs drawn from a featureless distribution (black histogram).For GWTC-3, we evaluate the perturbation function's posterior distribution at primary mass (m1) values of 13.8 M⊙ (dotted ), 10.3 M⊙ (dashed ), and 35.7 M⊙ (solid ).For mock catalogs, we find the primary mass value at which the perturbation function is maximal and evaluate its posterior distribution there.The values reported here are the percentage of the posterior distribution that is greater than zero at those values in m1.The 13.5 M⊙ feature excludes zero to a level comparable to some of the mock catalogs, but the other two features exclude zero to a level not reproducible by any mock catalogs.

Figure 6 .
Figure6.Injected (Solid black line) and recovered (colored shaded bands) distributions for 10 mock catalogs.Top: probability density function of primary masses.Bottom left: hyperposterior distribution for β, the power law spectral index of the mass ratio distribution.Bottom Right: hyperposterior distribution for κ, the spectral index of the power law factor in the redshift distribution.

Figure 7 .
Figure 7.A corner plot of the inferred hyperposterior from a fit to a mock catalog with 345 events.The injected values are shown in orange.The recovered hyperposterior is consistent with the injected population.

Figure 8 .
Figure8.Mock posteriors simulated with GWMockCat (red ) compared to posteriors made with full parameter estimation as released by the LVK (blue) for the event GW191215 223052.The two sets of posterior samples appear consistent to the level needed for a population analysis.

Table 1 .
Hyperparameter values for the underlying population of mock catalogs described by Smoothed Power Law (Equations B10-B13).