Extreme data compression for Bayesian model comparison

We develop extreme data compression for use in Bayesian model comparison via the MOPED algorithm, as well as more general score compression. We find that Bayes Factors from data compressed with the MOPED algorithm are identical to those from their uncompressed datasets when the models are linear and the errors Gaussian. In other nonlinear cases, whether nested or not, we find negligible differences in the Bayes Factors, and show this explicitly for the Pantheon-SH0ES supernova dataset. We also investigate the sampling properties of the Bayesian Evidence as a frequentist statistic, and find that extreme data compression reduces the sampling variance of the Evidence, but has no impact on the sampling distribution of Bayes Factors. Since model comparison can be a very computationally-intensive task, MOPED extreme data compression may present significant advantages in computational time.


Introduction
Astronomical datasets can be very large, with galaxy catalogues and weak lensing surveys having up to hundreds of millions of objects [e.g., [1][2][3], and Planck microwave background data [4] containing billions of time-ordered data.Extracting scientific knowledge from such large datasets almost inevitably requires some form of data compression, typically into summary statistics such as point estimates of two-point summaries, either correlation functions or power spectra.Compression without loss of information is generally possible since the datasets are noisy and correlated, and furthermore the information may be captured by a small number of summary statistics, depending on the statistical properties of the underlying data.However, even with compression to two-point summaries, the number of statistics can still be large: for example, cosmic shear surveys such as proposed for the Euclid mission [5] may generate ∼ 10 4 two-point statistics, and this can present problems for likelihoodbased scientific analysis if the covariance of the summaries needs to be simulated, or indeed if their distributions are not known [6].Both of these considerations motivate the use of more radical data compression techniques.This can also be highly beneficial when the resultant summaries are coupled with techniques such as simulation-based inference (SBI, also known as likelihood-free inference), which is undergoing rapid development thanks in part to novel machine learning techniques [e.g., [7][8][9][10][11].Typically, the most extreme compression possible without losing information is to reduce the dataset down from its original size to the number of model parameters to be inferred.Such extreme compression can have a number of benefits, in terms of speed of analysis, in reducing the number of simulations required to determine the sampling distribution of the statistics [6], and in making SBI feasible [12,13], whilst still allowing for new models to be considered [14].Heavens et al. [15] derived the optimal data compression algorithm MOPED, for gaussian-distributed data whose mean signal depends on the parameters, showing that extreme data compression can preserve the Fisher matrix, so the compression is, in this sense, lossless.Techniques, based on Karhünen-Loève compression [16] also exist for data whose covariance depends on the parameters, but not the mean, in contrast to MOPED, which applies when the mean depends on the parameters, but not the covariance.However, the former situation does not lead to lossless extreme compression, nor does it generalise neatly to multiple parameters.MOPED, whether in its original Gram-Schmidt orthogonal form, or in the simpler non-orthogonalised version (which we use in this paper) is a special case of using the score function (the gradient of the likelihood with respect to the model parameters) to compress the data, as shown by [17].Alternative approaches to extreme data compression include information-maximising neural networks (IMNNs) to find maximally-informative summaries, as introduced by [7], and which have in certain cases been shown to be lossless at Fisher level [9].Typically, the neural network-based methods operate on the fields themselves [e.g.9-11], whereas the MOPED and score compressions operate on some intermediate, physics-based summaries [e.g.6,10,12,18].
The main motivation for this paper is to determine whether the qualities of optimal extreme data compression which apply to parameter inference also extend to Bayesian Evidence and Bayes factors, i.e., to model comparison.The answer to this question is yes, provided that the extreme data compression is performed in each of the models separately, and the compressed data then combined, such that the combined compressed dataset is by construction optimal in all models.This applies to both nested and non-nested models.
With this key result, committed Bayesians may stop reading, but we also take the opportunity to investigate the sampling distribution of the Bayesian Evidence and the related Bayes Factor, to look at their use as frequentist statistics under extreme compression.The motivation for this comes from a suggestion [19] that data compression may help in reducing the sampling variability in the Bayesian Evidence.The answer is that it does, but the sampling variance of the Bayes Factor, which is a ratio of evidences and the relevant quantity for model comparison, remains virtually unchanged.To rephrase this same point, one can use extreme MOPED data compression as effectively as the full dataset for model comparison as well as parameter inference.Since model comparison is typically a computationally expensive exercise, extreme data compression, which often results in faster likelihood evaluation, can be very valuable.
Note that the distinction between the sampling distribution of the Bayesian Evidence and of the Bayes Factor is important, especially since it is tempting to conclude -erroneouslythat the sampling distribution of the Bayesian Evidence is sufficiently wide to compromise its usefulness in model selection, but this ignores the fact that much of the variation is common to both models (the data are the same), and cancels out when the ratio of Bayesian Evidence is computed, leaving the Bayes factor as an effective discriminant, as we show in this paper.
The layout of the paper is as follows.In section 2 we review the MOPED data compression algorithm, and show how it works effectively for model comparison as well as parameter inference, and consider more general score compression.In section 3.1 we explore the frequentist properties of Bayes factors, showing that the sampling distribution of the Bayes factor is unaltered whether the full or compressed data are used.In section 3.2 we apply MOPED compression to the Pantheon+SH0ES dataset [20]) and show that both the cosmological parameters and the Bayes Factor can be determined from just 3 numbers (the MOPED coefficients), as precisely as from the original 1590 supernovae.We conclude in section 4.

Bayesian Evidence and Bayes Factors for the uncompressed dataset
Let us first consider gaussian-distributed data, with fixed covariance matrix and arbitrary dependence of the mean on the parameters.Data compression with MOPED [15] allows extreme compression to p summary statistics, where p is the number of model parameters, whatever the length of the original data vector.Although the resulting MOPED coefficients are not strictly sufficient statistics except in the case of linear models, they can preserve the entire Fisher matrix, so the likelihood surface near the peak is no broader than when using the full, uncompressed dataset.As a result, the extreme compression can be used for parameter inference without increasing errors, and has been used in many contexts, such as star formation histories of SDSS galaxies [e.g.[21][22][23][24], the cosmic microwave background [e.g.[25][26][27], galaxy power spectrum and bispectrum [e.g.28,29], gravitational waves [30], weak lensing [e.g. 12, 31], combined probes [e.g.32,33], planetary transits [34], and extended to parameter-dependent covariance matrices [6] and mis-specified models [14].In this section, we explore how MOPED and related score compression can be used in a new context: model comparison.To begin we review the derivation of the unorthogonalised version of MOPED, following not the original analysis [15], but rather the derivation in [17].
We consider a model with parameters θ = {θ 1 , . . ., θ p }, data x ∈ R n (where n p) generated from a gaussian sampling distribution with covariance matrix C ∈ R n×n independent of θ and expectation value of the data µ(θ).The sampling distribution for the uncompressed dataset is therefore Letting X ≡ x − µ * , where µ * is the expectation value of the data at some fiducial set of parameters θ * , the posterior is for a prior density π(θ).Here and a linear Taylor expansion around the fiducial point gives μ µ ,α θα where θ ≡ θ − θ * . (2.4) The comma indicates ∂/∂θ α and the summation convention is used.Hence to this order, We see that the only coupling of the data x to the parameters is through the μT C −1 X term, and only via the p MOPED coefficients and we identify the p (unorthogonalised) MOPED weight vectors [15]) The posterior may then be written or in matrix form where Y is a vector of only p elements, consisting of the MOPED coefficients: and B T ∈ R p×n is made up of p rows of MOPED vectors b T α , defined by equation (2.7).In matrix form, where Φ T ∈ R p×n is a matrix with rows given by µ T ,α .The symmetric matrix Λ ∈ R p×p is Note that Heavens et al. [15] applied a Gram-Schmidt orthogonalisation to the MOPED vectors, to decorrelate the elements of Y , but the information content is the same if this is not applied, and we omit it as it simplifies the analysis.We have also subtracted µ * from x for convenience, and to ensure that a Taylor expansion to linear order is accurate, following [6].
To normalise the posterior, p(θ|x, M ) = p(x|θ, M )π(θ)/p(x|M ), we need the Bayesian Evidence Z(x) = p(x|M ), where we have added explicitly the dependence on the choice of model, M .The Evidence is We see that the data-only prefactor cancels in the normalized posterior: From this it is apparent that, to the extent that the linear expansion is applicable, we may compute the posterior from the MOPED coefficients alone, without considering x in fullwe only need Y.This is a very powerful feature of MOPED compression.Note that for a Gaussian prior density, π( θ|M ) = N (0, Σ) the integrals may be evaluated analytically (see also [13,35,36]) giving

Demonstration that the Bayes factor depends only on MOPED coefficients
Before we derive the general expression for the Bayes factor in the next section, we motivate our approach by demonstrating that, for nested models, the Bayes factor from the full dataset depends only on the set of MOPED coefficients defined in the extended model, to linear order in expansion.The posterior odds between models M 0 and M 1 are where B 01 is the Bayes factor, and the last equality applies if we assume equal prior model probabilities, π(M 0 ) = π(M 1 ).We have defined Model M 0 , with q free parameters, is said to be 'nested' in model M 1 , with p > q parameters, if the first q parameters of Model 1 are common to Model 0, and there is a choice of value for the additional parameters in Model 1 such that the latter reverts to Model 0. In this setting, we choose a fiducial parameter set within the prior support of Model 0 (and, therefore, a fortiori in Model 1); in this case, µ * is common to both models, and X is the same in both. in this case, the prefactor in the Evidence (equation 2.13), cancels out in the Bayes factor, for any choice of prior densities π( θi |M i ) . (2.17) Note that for the simpler model the extra parameters are set at θ = 0, which we accommodate by defining Λ 0 as the top q × q block of Λ 1 .
Just like the posterior for the parameters, the Bayes factor for the full dataset depends on the data only via Y .This motivates adopting MOPED compression for model comparison in the general case, with the potential for a large acceleration of what can be a computationally expensive task for some applications.We develop this in the next section.
It is worth pointing out that the requirements on the accuracy of the Taylor expansion are more stringent for model comparison than for parameter inference, since we are evaluating an integral over the whole parameter space, well beyond the region where the posterior density is appreciable.For some applications this could limit the accuracy of the approach, but for typical cosmological posteriors, their unimodal and relatively simple structure means that the MOPED-derived posteriors are often very close to those obtained from the full dataset, even far from the posterior peak.This is apparent even in the first (astrophysical) example, presented in [15], where the posterior contours are similar between MOPED and full data set at values smaller than the peak by a factor e 100 , even though this is not guaranteed by the method -the compression does far better than one has any right to expect.We therefore compress the data only once for each model, each at a single fiducial parameter set, and we illustrate later in a cosmological supernova study that the Bayes factor computation from the compressed data still agrees with the full dataset within sampling errors.

Bayesian Evidence and Bayes Factor with MOPED compression for nonnested models
We now develop the formalism for using MOPED extreme data compression for computing Bayes factors for non-nested models, with the nested case requiring only a minimal adjustment.In order for the Bayes factor to give the correct Bayesian posterior odds ratio (for equal model priors), the data entering in each of the Bayesian Evidence calculations have to be the same, or else the p(data) denominator in Bayes theorem will not cancel in the ratio.Since the MOPED vectors defined within each model are optimal for parameter inference in their respective models, we construct a compressed dataset from the union of the MOPED coefficients from both (or, more generally, all) models.In this way the augmented compressed dataset will be optimal for parameter inference in both models, since we will not lose information by adding additional data beyond what is required.Of course we must take care to include correctly the correlations between all the compressed data.We thus compare the analysis with the full dataset with the data compressed to where µ * 0 and µ * 1 are fiducial means in the two models.In the nested case, these would be chosen to be equal, and since Y 0 is already contained in Y 1 , it is discarded, with Y 1 alone being retained.Here we assume non-nested models when both Y 0 and Y 1 are present, and specialize to the nested case later.
The Bayesian Evidence for each model, Z i (Y ) is obtained by very similar arguments to the preceding analysis.For Gaussian data x ∼ N (µ, C), Y ∼ N ( Ȳ , Λ) is also Gaussiandistributed, with mean .
(2. 19) p+q) , where in block form Φ = (Φ 0 , Φ 1 ).The Bayesian Evidence for model i is then With a Taylor expansion for µ as before (equation (2.4)), Ȳ = B T Φ θ ≡ Λ θ, we find where A i ≡ Φ T C −1 Φ i and which will have zeros in one or other block depending on whether i = 0 or 1.

Nested models with MOPED compression
There is a significant simplification for nested models, where we dispense with the top blocks of Y and Φ T , and take µ * 0 = µ * 1 .As a consequence Ȳ * i = 0.A i can be replaced by Λ (which reduces to Λ 1 ), since it always comes in combination with θi , and we extend the θ0 vector to include zeros from the extended parameter space.With these simplifications, the Evidence becomes This is very similar to the Evidence computed from the full data set (equation 2.13), but with a different data-dependent prefactor to the integral.
When we set the prior on the extended model parameters to Dirac delta functions in Model 0, Λ effectively becomes Λ 0 , and when we cancel the common prefactors, we recover equation (2.17), and find that the Bayes factor for the MOPED compressed data for nested models is exactly the same as the Bayes factor for the original dataset.In other words, to the extent that a linear Taylor expansion is accurate, no information is lost in performing Bayesian model comparison using MOPED coefficients rather than the full dataset.

Gaussian prior
With a gaussian prior on the parameters, the Bayesian Evidence (equation 2.21) for the different models (nested or otherwise) reduces, after application of the Woodbury formula and the matrix determinant lemma, to a very simple expression where (2.25)

Score compression
We can generalise the result further to arbitrary likelihood functions, using the score compression of Alsing & Wandelt [17], of which MOPED is a special case.Here they expand the log likelihood as where J ≡ −∇∇ T L, and * indicates that quantities are evaluated at the fiducial parameters.
L * depends on the data, but we see again that the L * (which is a source of variability in the Bayesian Evidence) cancels in the Bayes factor if the models are nested and θ * lies in the domain of both M 0 and M 1 .The only coupling between data and parameters at linear order is through the score function t ≡ ∇L * . (2.27) J is strictly constant in the Gaussian case if C is independent of θ, and is given by the Fisher matrix If C is parameter-dependent, the expression is complicated, and couples data and parameters (see [17], equation ( 15)), and replacing J by F would be an approximation, but possibly a good one.We see that the preceding arguments hold in this case as well, that for nested models, the L * term cancels out in both the posterior and also the Bayes factor, if the fiducial parameters are common to both (all) models.As a result, general score compression, using the trick of concatenating the scores from all models considered, can effectively be used for model comparison, with the additional approximation (2.28).

Distribution of Bayes factors
Bayesians may wish to turn away now, as we consider the frequentist properties of Bayesian Evidence and Bayes factors, as investigated by Jenkins & Peacock [37] and Joachimi et al. [19].These papers note that the Bayesian Evidence and the Bayes factor are noisy statistics, and consider their sampling variability and its impact on the outcome of model comparison.Note that we remain of the Bayesian view that the Bayes factor can indicate a preference for one model over another, but not falsify models in absolute terms, and we are not seeking frequentist methods to do this, which typically rely on tail probabilities of p(data|M ) as opposed to posterior model probabilities, p(M |data), cf.[38][39][40].
To the committed Bayesian, the sampling distribution is not relevant, since the Bayesian view is that there is one set of data on which the inference is based, and it does not matter what other realisations of the data might have yielded.Nevertheless, Z is a statistic, and its sampling distribution could be of some interest.There is a contribution to the variability of Z from the data-only prefactor in equation (2.13) in the gaussian case (and we come to a similar conclusion in the more general score compression case: see section 2.3).The log of the prefactor is distributed as a χ 2 distribution with n degrees of freedom with a variance of 2n.However, this main source of variability (the prefactor) cancels in nested models when the Bayes factor is computed, so the quantity on which model comparison depends is not subject to this source of large variability.
This behaviour is apparent in Fig. B.1 of [19], where the Bayes factor is very stable even though Z 0 and Z 1 are highly variable, which would invite the naïve interpretation that the sampling variance of the Evidence may hinder model comparison.However, even if the sampling distribution of ln Z 0 and ln Z 1 is large in comparison with ln B 01 (X), the Bayes factor favours one model consistently, because Z 0 and Z 1 are not independent, as they use the same data realisation.
We illustrate the behaviour of a non-nested case with the following toy non-linear models: where t has 20 values uniformly spaced between 0 and π (an example is shown in Fig. 1).
The data are generated with E = 2.5 and F = 1.0, and the fiducial values are E * = 3.0, F * = 1.2, a * = −1.5, b * = 4.7).The distribution of Bayesian Evidence values for each model from 4×10 4 realisations is shown in Fig. 2 for the full dataset, and from MOPED-compressed data in Fig. 3. Notice that the considerable sample variance exceeds the difference of the mean Evidence values.As suggested by [19], the width of the sampling distribution of the compressed data Evidence is reduced compared to the full dataset.
Fig. 4 shows that, although the Evidence values differ markedly for the full and compressed datasets, the Bayes factors are almost identical, with deviations being due to the linear Taylor expansion not being perfect in the nonlinear case.
Our conclusion from this study is that MOPED compression reduces the variance of the sampling distribution of the Bayesian Evidence, so it would appear advantageous to use such a data compression.However, the Evidence values from an individual noise realisation are highly correlated, as they are based on the same data, and the log of the Bayes factor has a much smaller variance than the variance of the individual ln Z values.We find that the    uncompressed data give the same Bayes factor as the MOPED compressed data (inasmuch as the neglect of higher-order derivatives of µ is valid).As a result MOPED compression can be used for accurate model comparison, with in some cases a large increase in speed of computation.We also note that with a single full likelihood evaluation and one compressed likelihood evaluation at the fiducial parameters, the full dataset Evidence can be computed, if it were ever needed, from the compressed Evidence:

Cosmological example -the Pantheon+SH0ES dataset
In this section, we re-analyse the Pantheon+SH0ES data presented in [20].This consists of 1701 supernovae, with Cepheid calibration.We use the distance moduli from the catalogue1 (with a fiducial SNIa magnitude determined from SH0ES 2021 Cepheid host distances), and the supplied covariance matrix, including systematics, to define a simple gaussian likelihood function for 1590 supernovae with z > 0.01.While not nearly as sophisticated as a Bayesian hierarchical model [e.g.41], this simplified setup serves to demonstrate parameter inference and Bayesian Evidence comparisons for the full and MOPED-compressed datasets.The two models being compared are: • a flat universe, parameterised by θ 0 = (Ω m , h); • a curved universe, which allows Ω Λ to vary away from 1 − Ω m , with parameters where h is the Hubble-Lemaître constant in units of 100 km s −1 Mpc −1 and Ω m is the matter density parameter, and Ω Λ the vacuum energy density parameter.Priors on all varying parameters are taken to be uniform over the range [0, 1].We choose a fiducial model with Ω m = 0.3, h = 0.7 and Ω Λ = 0.7.The compression concatenates the MOPED coefficients from both models, but since they are nested, this is identical to using the MOPED coefficients from the extended, 3-parameter model, yielding We use the nested sampling code dynesty2 [42] with 5000 live points, and no linear approximation is made for the parameter dependence of the mean distance modulus in the likelihood function.We find virtually identical posteriors for full and compressed data, shown in Fig.
with in both cases the flat model being slightly favoured by the data.The error is an estimate from the values of about 0.036 for each, since we do not know the extent to which the errors are correlated.Fig. 5 and the Bayes factors of eq.(3.3) show the strength of MOPED for both parameter inference and now for model comparison as well.The latter, along with the sampling distributions of the Bayes factor in the compressed case (Fig. 4) are the main new results of this paper.

Discussion and Conclusions
In this paper we have shown how the extreme data compression algorithm MOPED [15] can be used to compute Bayes factors with essentially no loss of accuracy, thus extending its power from parameter inference to now include model comparison.The key to this is to perform MOPED compression in all the model spaces, and to concatenate the MOPED coefficients into a slightly larger compressed data space, of dimension equal to the sum of the number of parameters if the models are not nested.This is still an extreme compression, if the number of data far exceed the number of parameters, which is typically the case with cosmological survey data.In the nested case the MOPED coefficients come from the more complex model.Many of the advantages extend to other forms of score compression when MOPED is not applicable.
The computation of Bayes factors is generally expensive, requiring integrals over the parameter space or sophisticated nested sampling techniques, so where MOPED leads to much faster likelihood evaluations from the massively reduced dataset, the process may be significantly accelerated.
Finally, we have investigated the sampling distribution of the Bayesian Evidence and of Bayes factors, for uncompressed and MOPED compressed data.We found that while extreme data compression reduces the variance of the Evidence for each individual model, as previously claimed, the sampling distribution for the Bayes factor remains unaffected by extreme compression of linear models, and virtually identical for nonlinear models.Thus treating the Bayes factor as a frequentist statistic, it is just as effective to compute it from the compressed dataset as the full set.In a cosmological application, we have shown that, assuming a simple gaussian data model for the Pantheon+SH0ES dataset [20], the Bayes factor for the flat vs non-flat model can be computed from only 3 compressed numbers, rather than the full dataset of 1590 supernovae, with no loss of accuracy.

Figure 1 .
Figure 1.Example dataset, with noisy data drawn from the mean curve of Model 0. Parameters for Model 1 are arbitrary.

Figure 2 .
Figure 2. The distribution of Bayesian Evidence for 50000 noise realisations of non-nested toy models M 0 and M 1 with details given in the text, obtained from the full dataset.The normalisation of the y axis is arbitrary.

Figure 3 .
Figure 3.As Fig. 2, but for a compressed dataset consisting of the four MOPED coefficients, two from each model.

Figure 4 .
Figure 4. Comparison of Bayes factor sampling distribution for the nonlinear, non-nested toy models M 0 and M 1 with details given in the text, from the full dataset and the MOPED-compressed dataset.The distributions overlay almost perfectly.

Figure 5 .
Figure 5. Marginal posteriors for cosmological parameters (Ω m , h, Ω Λ ), using the full dataset of size 1590 (red) and only the three MOPED-compressed coefficients (blue).The posteriors are almost identical, resulting in a purple shade in the 1D marginals.
5. The natural log of the Bayesian Evidence for the full dataset is 838.88 ± 0.035 for the flat model, and 837.96 ± 0.037 for the extended model.For the MOPED-compressed data, the log Bayesian Evidence values are −22.35 and −23.29, with the same errors.Hence the Bayes factor are