Fast and effortless computation of profile likelihoods using CONNECT

The frequentist method of profile likelihoods has recently received renewed attention in the field of cosmology. This is because the results of inferences based on the latter may differ from those of Bayesian inferences, either because of prior choices or because of non-Gaussianity in the likelihood function. Consequently, both methods are required for a fully nuanced analysis. However, in the last decades, cosmological parameter estimation has largely been dominated by Bayesian statistics due to the numerical complexity of constructing profile likelihoods, arising mainly from the need for a large number of gradient-free optimisations of the likelihood function. In this paper, we show how to accommodate the computational requirements of profile likelihoods using the publicly available neural network framework connect together with a novel modification of the gradient-based basin-hopping optimisation algorithm. Apart from the reduced evaluation time of the likelihood due to the neural network, we also achieve an additional speed-up of 1–2 orders of magnitude compared to profile likelihoods computed with the gradient-free method of simulated annealing, with excellent agreement between the two. This allows for the production of typical triangle plots normally associated with Bayesian marginalisation within cosmology (and previously unachievable using likelihood maximisation because of the prohibitive computational cost). We have tested the setup on three cosmological models: the ΛCDM model, an extension with varying neutrino mass, and finally a decaying cold dark matter model. Given the default precision settings in connect, we achieve a high precision in χ2 with a difference to the results obtained by class of Δχ2 ≈ 0.2 (and, importantly, without any bias in inferred parameter values) — easily good enough for profile likelihood analyses.


Introduction
In the last few decades, parameter inference in cosmology has traditionally been done using Bayesian statistics.In Bayesian parameter inference, the goal is to characterise the multidimensional posterior probability distribution.This is often done using Markov-chain Monte Carlo sampling.Subsequently, estimates and uncertainties on single parameters are obtained by integrating the multidimensional posterior distribution over all other parameters, a process called marginalisation [1].This is the main reason for the popularity of Bayesian parameter inference in cosmology: all credible intervals and two-dimensional posterior distributions are readily computable once a representative sample of the multidimensional posterior has been obtained.
Marginalisation requires a way to associate (prior) probability to volumes of parameter space, so the marginalised posterior distributions and the credible intervals [2] may sometimes be completely dominated by the choice of prior.This effect is sometimes referred to as volume effects, because the effect is associated with the prior volume being integrated.This can be seen as an advantage because it makes it easy to build e.g.theoretical prejudice into the statistical analysis.For instance, one may wish to penalise a model containing a parameter that needs to be extremely fine-tuned to provide a good fit to the data.However, given that we often do not know the true underlying model, it could very well be that the true underlying model is observationally equivalent to the one proposed, but in the true model, the equivalent parameter is not fine-tuned.Thus, if we penalise the proposed model a priori, we might fail to discover the true model.
If we want to avoid this trap, we may instead employ frequentist statistics.Broadly speaking, this simply entails that we substitute marginalisation for maximisation; instead of integrating out the extra parameters, we maximise the likelihood over these parameters.The resulting object is the profile likelihood, which has recently gained increased popularity [3][4][5][6][7][8][9][10][11].From this profile likelihood, we may then derive confidence intervals [12], akin to the credible intervals in Bayesian statistics.The advantage of the profile likelihood is that it may reveal interesting features of the model that would be missed in the Bayesian approach, but the disadvantage is the difficulty and computational cost associated with the large number of multidimensional optimisations.Early papers on cosmological parameter inference have examples of both marginalisation (see e.g.[13]) and profiling (see e.g.[14][15][16]; see also [17] for an early discussion of profiling versus marginalisation).However, the introduction of MCMC techniques in marginalisation [18,19] led to their increasing dominance in the field because of their speed and simplicity.
Many of the computational problems of profile likelihoods are solved by the recent advancements of machine learning within the field of cosmology.Many different cosmological emulators of observables, using e.g.neural networks [20][21][22][23][24] or Gaussian processes [25][26][27], have emerged in recent years, and these all benefit from much faster evaluation times than ordinary Einstein-Boltzmann solver codes.
In this paper, we show how connect [20] can facilitate fast and accurate computation of one-and two-dimensional profile likelihoods at a tiny fraction of the cost of a more traditional approach.This performance enhancement derives both from the speed-up of evaluating the neural network instead of class [28] or camb [29] but also because the gradient of the likelihood can be computed fast and accurately through automatic differentiation techniques.The paper is organised as follows: In section 1, we give an introduction; in section 2, we introduce profile likelihoods and discuss the difference between profile likelihoods and marginalised posteriors; in section 3, we give a brief overview of the optimisation algorithms we use; and in section 4, we provide some more practical details regarding the implementation.In section 5, we show, for the first time, triangle plots for the profile likelihood compared to the posterior distribution for the ΛCDM-model as well as for an extension with varying neutrino mass and degeneracy.We also show a profile likelihood of the decay constant in a decaying cold dark matter model as an example where the training data of the neural network itself suffers from large volume effects.Finally, we give our conclusions and future outlook in section 6.

Profile likelihoods and Bayesian inference
Given an N -dimensional parameter space Θ where we are mainly interested in constraining the parameters of an M -dimensional subset Ω, the profile likelihood L( ⃗ θ) of a likelihood function L( ⃗ θ, θ) on the full parameter space is obtained by maximising the dimensions not contained in the reduced space Ω [30], where ⃗ θ represents a vector in the parameter subspace of interest and θ a parameter vector in the subspace of Θ we maximise over.Parameters in the latter subspace are commonly referred to as nuisance parameters, and usually we are interested in either one-dimensional profile likelihoods, M = 1, for one-parameter constraints, or two-dimensional profile likelihoods, M = 2, to constrain the correlation between pairs of parameters.Since the profile likelihood (2.1) is obtained by maximising a subset of the parameter space, frequentist inference based on it is an optimisation problem (as opposed to Bayesian inference, which is a sampling problem).The main contributions of this paper are novel strategies for carrying out this optimisation.
From the profile likelihood, confidence intervals (or regions, if M > 1) can be obtained using the Neyman construction [31], where 68.27% (95.45%) confidence regions for ⃗ θ are defined implicitly by the regions ∆χ 2 ( ⃗ θ) < 1.0 (4.0) with the definition 2) The coverage of the intervals constructed using the Neyman method is exact whenever the profile likelihood is Gaussian or whenever there exists a reparametrisation in which it is.
Alternative interval constructions, such as the Feldman-Cousins prescription [32], are known to produce more accurate interval coverages, but since the focus of this paper is on the optimisation and not the interval construction, we employ the Neyman construction throughout.
As seen in the definition (2.1), the profile likelihood is a maximum likelihood estimate in the reduced subspace and therefore inherits the invariance under reparametrisations of this subspace from the latter [30].This is in contrast to posterior distributions from the Bayesian inferences, which may change with the specific parametrisation.To illustrate the difference between the marginalised posterior distribution and the profile likelihood, we investigate a toy likelihood comprised of two Gaussian distributions with different normalisations and covariance matrices.The two Gaussians have only a slight overlap, as shown in figure 1, and the one with the largest maximum is much more narrow in the θ 2 parameter.This makes the contribution of the taller peak to the posterior in the θ 1 parameter much less when marginalising over the θ 2 parameter in the case of uniform priors, even though the likelihood is actually larger in this peak.This shows how the greater volume of a less significant likelihood peak can dominate the marginalised posterior.When computing a profile likelihood in the θ 1 parameter, the likelihood is optimised for each fixed value of θ 1 , and so the profile looks like a projection of the two-dimensional surface onto the one-dimensional θ 1 -axis.This, of course, shows the two peaks with their actual height differences.
The two different ways of representing the likelihood function come from the two different ways of interpreting probability in frequentist and Bayesian statistics.Neither method is more correct; they simply answer different questions.Frequentist inference answers the question of how we can choose a range in the parameter space based on the data such that the true value of the parameter will fall within the range in (1 − α) × 100% of the asymptotically often repeated realisations of the experiment.This range is called the confidence interval with the level of significance α.Bayesian inference treats the true value of the parameter as a random parameter chosen from a distribution, and the question is which range of the parameter space we can choose so that we are (1 − α) × 100% certain that the true value has taken a value that lies within the range.This is called the credible interval with the level of significance α.The posterior can therefore be thought of as a probability distribution, while the profile likelihood cannot.Ultimately, it is this difference in interpretation that manifests in the different constraints obtained from the two statistical paradigms.

Optimisation of the likelihood function
An accurate and robust optimisation routine is crucial for profiling likelihoods, and this task can be difficult in certain situations.Optimisation routines typically require many function evaluations in order to perform well, and this is especially true if there is no prior knowledge of the cost function.In the case of using an Einstein-Boltzmann solver code like class, the function takes on the order of 10 core seconds to evaluate (including both the Einstein-Boltzmann solver code and the likelihood codes), and this quickly adds up to large computation times.Certain optimisation methods make use of gradients, but Einstein-Boltzmann solver codes tend to be numerically unstable with respect to differentiation due to different approximation schemes toggling in different regions of parameter space.Furthermore, in order to construct profile likelihoods, it is essential that the global optimum be found.Therefore, global optimisation routines are essential.

Global optimisation
The problem of global optimisation is numerically difficult.Methods like gradient descent [33] and the simplex algorithm [34] can get stuck in local optima, so this can be a problem whether or not one has access to gradients of the likelihood function.There exist, however, specific methods that can effectively search the parameter space in clever ways inspired by MCMC methods.One such method that has proven to be fairly efficient is simulated annealing [35], which does not require gradients and is therefore a suitable choice when dealing with Einstein-Boltzmann solver codes.In short, this method searches the parameter space as any other MCMC but gradually lowers the sampling temperature while doing so.This makes the features of the likelihood landscape more profound over time, allowing the MCMC chain to eventually settle on the global optimum.The optimisation is highly dependent on the chosen temperature schedule and somewhat inefficient when the acceptance rate of the MCMC is small.Despite the individual simulated annealing optimisation being sequential, each point in a profile likelihood can be optimised in parallel, so one can therefore do profile likelihoods based on simulated annealing in reasonable time, although the number of evaluations needed for an inference of all parameters in a typical cosmological model, as well as their correlations, makes it unfeasible to do with an Einstein-Boltzmann solver code.Instead, we use a neural network from the connect framework to speed up the evaluation of the observables.This means that the likelihood evaluation time is dominated by the comparison of theoretical observables to data.This is, however, a significant decrease in computational cost, and using the same method as before (simulated annealing), we can obtain profile likelihoods much quicker.
We can even improve on the method now that we are using a neural network that does not suffer the same numerical instabilities with respect to differentiation as Einstein-Boltzmann solver codes do.The auto-differentiation of the TensorFlow [36] framework lets us easily differentiate the neural network with respect to the input parameters, so we can now make use of gradient-based optimisation techniques.However, before that, we need to be able to auto-differentiate the likelihood calculation as well.We are seeking the derivative of the likelihood value L with respect to the cosmological input parameters ⃗ θ, and the network only provides us with the derivative of the observables ⃗ O (C ℓ spectra, power spectra, etc.) with respect to the cosmological parameters.By the chain rule, we therefore need the derivative of the likelihood calculation with respect to its input (the cosmological observables computed by the network), Obtaining the derivative of the likelihood with respect to cosmological observables proves to be quite elaborate, but this is discussed further in section 4.
Equipped with auto-differentiation all the way from cosmological input parameters to the likelihood value, one can now make use of gradient-based methods.Relying solely on these can be a problem since gradient-based methods are often unable to explore other optima than the closest.A global optimisation method that can circumvent this issue is the basin-hopping [37] algorithm.This method is reminiscent of simulated annealing, but with the addition of a local optimisation in each step.The method therefore "hops" between different local optima, or "basins", instead of hopping between random points.This reaches convergence much quicker since fewer steps are needed due to the local optimiser always placing each point at an optimum.This method is very dependent on a good local optimisation method; otherwise, it will reduce to simulated annealing in the limit of a trivial optimiser.One drawback of this optimisation method is that one is not guaranteed to find the global optimum given the stochasticity of the optimiser [38], and indeed there is a probability of convergence at a suboptimal point (see appendix B).In our case, however, this is a quite small probability due to the smoothness of the neural networks, and it can be heavily decreased through precision settings.

Local optimisation
A local optimisation method should, in our case, take only a few steps in order to reach the optimum.Gradient-based methods are obviously suitable for this, but the simplest of such methods, gradient descent, does not perform well in this regard.The reason for this is that the step size will decrease with the slope, and a lot of steps will be taken close to the optimum due to the vanishing gradient.Some parts of the likelihood function can even be close to flat, and this requires a lot of steps by the gradient descent optimiser.A better choice could be to also use second-order derivatives, which would mean that we could almost guess the correct optimum after one evaluation if the likelihood landscape is close to Gaussian.The secondorder derivatives are, however, not easy to get in our case since the TensorFlow graph of this computation becomes too large to handle.An appropriate compromise could be a pseudosecond-order method like the Levenberg-Marquardt [39,40] or Broyden-Fletcher-Goldfarb-Shanno [41][42][43][44] (BFGS) algorithms.These use first-order derivatives to approximate the Hessian in order to quickly locate the nearest optimum.TensorFlow has an implementation of BFGS, so this has been chosen as the local optimiser to use with the global basin-hopping method.

Auto-differentiability
In order for us to use gradient-based methods, we need auto-differentiation at each step in the evaluation of the likelihood function.We use the built-in reverse mode differentiation of TensorFlow, and this requires every operation during the evaluation to be written with TensorFlow syntax in order for them to be auto-differentiable.Many popular likelihood codes, such as the full Planck 2018 likelihood [45] (including low-ℓ T T , low-ℓ EE, high-ℓ T T + T E + EE, and lensing), are complex and tedious to translate to TensorFlow syntax, so as of now only the Planck lite likelihood has been translated, first to Python by Ref. [46] and then to TensorFlow by Ref. [21].We have altered the TensorFlow version from Ref. [21] to accommodate neural networks from connect and this involved another rather tedious task of interpolation.connect only computes the same ℓ-grid that class does, and this is more than an order of magnitude fewer points than required by the likelihood code.We have therefore implemented a cubic spline interpolation routine in the likelihood code since no suitable interpolation method is implemented in TensorFlow.This has to be written not only in TensorFlow syntax but also using only differentiable operations (some functionality in TensorFlow is not auto-differentiable).This adds an extra layer of computation to equation (3.1), where ⃗ C ℓ is a vector of C ℓ values, which is the observable used by the likelihood code in our case, and the number in square brackets tells the number of C ℓ values.One could simply emulate all 2508 values needed by the code, but this is a computational waste when training the network since all the information is contained in only the 100 values that class actually calculates [20] (the number of points calculated by class differs based on input and precision settings).Since each of the computational steps in equation (4.1) is now differentiable, the total derivative of the likelihood with respect to the cosmological parameters can be used for optimisation purposes.A single function evaluation of the gradients takes on the order of ∼10 −2 seconds and is approximately twice as expensive as evaluating just the likelihood itself (using the neural network, the interpolation, and the data comparison).

Ensemble basin-hopping
When training a neural network with connect, the training data is gathered using multiple MCMC runs in an iterative manner [20], where each iteration uses an MCMC sampler to gather data using the neural network from the previous iteration.This means that we already have a covariance matrix and an estimate for the best-fit point as a starting point.This is a great help when doing the actual maximum likelihood optimisation, since the likelihood landscape is roughly known beforehand.We can use this information to slightly modify the standard basin-hopping algorithm to accommodate this additional information.Steps are normally taken sequentially, but since we have a covariance matrix and a best fit estimate, we can construct a proposal distribution and draw multiple points from it at once.These can then all be locally optimised in parallel, thus exploring different local optima simultaneously.We can then centre a new proposal distribution around the best point of the ensemble of optimised points and lower the sampling temperature.Repeating this allows us to home in on the global optimum much faster due to the parallelisability.The altered algorithm is sketched below1 : • T and ⃗ b if the majority of the ensemble finds the same optimum then break from while loop L min is now the optimised function value and ⃗ b is the best-fit point

Constraints on parameter space
Profile likelihoods are not subject to any priors as marginalised posteriors are, but there might still be benefits to confining the optimisation within certain ranges in the parameter space due to physical constraints.An example here could be the mass of a particle, which must be positive.When computing profile likelihoods using Einstein-Boltzmann solver codes and simulated annealing, this will never be an issue since the code will raise a computation error in such a case, and the likelihood code will then return a very low likelihood such that no point beyond this physical boundary will be accepted.Neural networks will, however, always produce output based on any input of the correct type and dimensionality, so they might learn some behaviour linked to a certain parameter as it decreases.It is then able to extrapolate, and the result might be a good fit to the data either by chance or because of this extrapolation.This is exactly the case for the model with a varying neutrino mass used as a test case in this paper.It is therefore beneficial to confine our optimisation within given parameter ranges.A variant of the BFGS algorithm dubbed BFGS-B [48] performs the optimisation within a confined box in the parameter space.This is very useful in our case, but unfortunately it is not implemented in TensorFlow.A solution is therefore to introduce a smooth and differentiable penalty function that penalises the likelihood when evaluating a point outside of the box.We have chosen a very steep exponential function that increases depending on the distance from the boundary, and this ensures that gradients are still meaningful in a way such that any evaluation outside the box will send the optimiser back inside the box.A proper implementation of BFGS-B might be beneficial in the future.

Workflow
When doing profile likelihoods with connect, there are a few steps.First of all, one needs to gather training data and train a neural network with a specific cosmological model implemented in class.Then, it is a good idea to run a normal MCMC with the neural network in order to have a good covariance matrix along with Bayesian inference for comparison.The covariance matrix from the gathered training data can be used instead 2 .Then one needs to choose at which points in the parameter space to optimise for both one-and two-dimensional profile likelihoods; the idea is to sample with more resolution where the likelihood function varies significantly as well as near its maximum.This can be tricky if one does not know any features of the likelihood function of the particular cosmological model, and this is another reason for doing an MCMC run with the neural network beforehand.When the posteriors are known, a good initial guess is that the profile likelihood will somewhat resemble them.This is exactly true when there are no volume effects and only flat priors are used, but in any case, it is reasonable to assume that at least some features of the profile likelihood will overlap with the features of the posterior.In the case of one-dimensional profile likelihoods, one can often get away with simply choosing a set of equally spaced points.In the case of two-dimensional profile likelihoods, it is not as straightforward, but a good guess is to use the points of the histograms from which the posteriors are computed.Any non-zero bin in these histograms corresponds to a region where the MCMC has accepted points, and so we can choose these bin centres as the points in our two-dimensional profiles (see appendix A).If the computed profiles seem to not be best represented by these points, a routine to manually add points by clicking in the plots has been included (see appendix B).A full automatisation of which points to choose is beyond the scope of this work but should be further investigated.

Results and discussion
In order to test the performance of the optimisation routine, we have chosen three cosmological models as examples: ΛCDM, massive neutrinos, and decaying cold dark matter.These three models each have features that are useful for highlighting different problems and their corresponding solutions.

ΛCDM
The posterior distribution of the standard ΛCDM model is almost perfectly Gaussian under standard CMB data [45] and therefore has no volume effects.In this case, we expect to see the profile likelihoods coinciding perfectly with the posteriors in both the one-and twodimensional plots.We did not train a specific ΛCDM neural network with connect since this will be contained in the neural networks of the other two models.We have therefore used the same network as for the massive neutrinos model, where the parameters m ncdm and deg ncdm were fixed to 0.06 and 1.0, respectively, in order to match a value of the effective number of degrees of freedom of N eff = 3.046.When fixing the two model-specific parameters, the rest of the network behaves like a ΛCDM network trained with these two parameters fixed at the same values.
Figure 2 shows a full triangle plot of both posteriors and profile likelihoods for the ΛCDM model.The blue lines and filled contours are the one-and two-dimensional posteriors, respectively; the red stars and contour lines are the one-and two-dimensional profile likelihoods, respectively; and the cyan stars mark the best-fit point of the entire parameter space.The stars in the one-dimensional plots are the actual calculated points in the profile likelihoods, but for the two-dimensional plots, we only show the contour lines calculated from the computed points in the same way that the posterior contours are calculated from the histograms.The agreement between the posteriors and the profile likelihood is excellent, and since we would expect this to be true for a near-Gaussian likelihood, this validates our optimisation routine.

Massive neutrinos
Now that we have tested our optimisation on a simple likelihood without any volume effects, like in the ΛCDM model, we can move on to a model that actually contains volume effects.The inclusion of a variable neutrino mass, m ncdm , (two species are assumed to be massless) and the degeneracy, deg ncdm , introduces new volume in the parameter space, and the likelihood is not Gaussian for the neutrino mass.We should therefore expect to see some differences between the posteriors and the profile likelihoods.Figure 3 shows the posteriors and profile likelihoods of this model.The posteriors are again shown in blue, while the profile likelihoods are shown in red.We can clearly see differences between the posteriors and profile likelihoods now, with the profile likelihoods being shifted slightly compared to the posteriors.The most profound shifts are to higher values of H 0 and lower values of both A s and τ reio , and it is interesting that the actual best-fit point (cyan stars) lies on the edge of the 68.27% credible regions of the posteriors involving H 0 , and the 95.45% confidence regions of the profile likelihoods in H 0 seem to alleviate the Hubble tension significantly.Since we are only using the Planck lite likelihood, we cannot draw any reasonable conclusions based on this, but it really emphasises how the volume of the parameter space can impact parameter inference and that both a Bayesian analysis and a frequentist analysis should be performed in order to get the full picture.
The differences between a model with volume effects and one without are more apparent from figure 4, where the 68.27% and 95.45% constraints from both the profile likelihoods and the posteriors are shown for each of the two cosmological models, the ΛCDM model and the massive neutrinos model.The additional volume of the massive neutrinos model not only broadens the constraints but also shifts some of the parameters, as previously discussed.In order to test the performance of the emulator, we compare the results with profile likelihoods obtained using class and simulated annealing.These are shown in figure 5, along with the 68.27% and 95.45% confidence intervals calculated from the connect profile likelihood.We see a great agreement between class and connect and this further strengthens our confidence in the results obtained using connect in figures 2 and 3.

Decaying cold dark matter
In order to really put the framework and optimisation to the test, we have included the model of decaying cold dark matter (DCDM) with dark radiation (DR) as the decay product.This likelihood is notoriously difficult to sample and is also quite challenging when doing profile likelihoods [5].With Planck lite data, the likelihood features a very slight peak for high values of the decay rate, Γ dcdm , with a height corresponding to only ∆χ 2 ≈ 0.5.For this, we require much higher precision than for the other cosmological models, but achieving this turns out to be very difficult.The optimisation can only ever be as good as the neural network used by the optimisation routine, and a good network suitable for this particular optimisation requires a lot of training data around this subtle likelihood peak.As of now, we are limited by the fact that our training data for the neural network is generated by iteratively sampling from the posterior, and this makes it very hard to properly sample around the peak due to the large volume effects that the posterior is influenced by.It can, of course, be solved by sampling maybe an order of magnitude more points to use as training data (as well as training for more epochs and with a larger network architecture), but this becomes unfeasible in terms of training the network.
Another approach would be to slice the parameter space at specific values for any parameters exhibiting such problems and train different networks for each value of this parameter.This also requires more computational resources, and if the goal is to have just the one-dimensional profile likelihood in the parameter in question (Γ dcdm in our case), this is quite wasteful, and one might as well compute the profile likelihood with class and simulated annealing.We can, however, use these individual networks for more, and so if we wanted a full triangle plot of profile likelihoods, we would just use the appropriate network for any point in a two-dimensional profile likelihood containing the sliced parameter and use all networks at once for any point in all other two-dimensional profile likelihoods and simply choose the lowest value of χ 2 .This adds a layer of complexity compared to using just a single network containing the whole parameter space, but if any two-dimensional profile likelihoods are needed, then this approach will be orders of magnitude less computationally expensive than using class and simulated annealing.
Figure 6 shows the profile likelihood in the Γ dcdm parameter using both class and connect.connect has been tested both using single networks trained on the entire parameter space at different sampling temperatures and individual networks for each fixed value of Γ dcdm .The individual networks for each fixed value of Γ dcdm have been trained with the regular iterative approach of connect [20], and the networks spanning the entire parameter space have been trained with a slight modification.The first part of the training is identical to the regular iterative approach with the default sampling temperature of T = 5, but when convergence is reached, the sampling and training do not halt.Instead, new data is gathered with a smaller temperature (T = 4) and used to train a new network, which is then used to gather data for a new network with an even lower temperature, etc.The training data from consecutive iterations are not merged in this second part since they have different statistical properties due to differences in sampling temperature.This annealing results in multiple networks with different sampling temperatures according to a predefined list that would not be possible to obtain with just the regular iterative approach.This is because the likelihood is difficult to sample with small temperatures, so having only a small temperature from the beginning will not result in a useable network for e.g.T = 1.By having the data gathered with a slightly larger temperature as the foundation for the network gathering the next data with a smaller temperature, it ensures that the network is always very accurate when gathering data.The reason for wanting to do this is that we cannot properly resolve this particular best-fit region with too large of a temperature given that it is a very narrow and slight band in the multidimensional likelihood surface.On the other hand, we cannot perform the entire sampling at a low temperature because the accuracy would suffer.With inspiration from the simulated annealing algorithm, this new approach takes all of this into account.
Another problem apparent in figure 6 is that the networks with different temperatures only seem to agree just around the minimum and for higher values of Γ dcdm3 .This is, however, not surprising since we are sampling training data from the posterior, which is heavily influenced by volume effects towards high values of Γ dcdm in this case.We therefore have much more training data on the right side of the figure, which means that all of the networks have better performance here.
The figure also shows results from individual networks with fixed values of Γ dcdm , and we can clearly see that these resemble the results from class much more.There is a small discrepancy in the depth of the well, but this is due to small precision errors.The difference corresponds to ∆χ 2 ≈ 0.2 and this is definitely not significant given that the same behaviour and shape are produced.The networks with different temperatures also seem to agree with the networks with fixed values of Γ dcdm for high values, so it is highly probable that it is a systematic error founded in the precision of neural networks.These networks have, however, all been trained to per mille precision in all of the C ℓ spectra, but when investigating the reason for the differences in χ 2 we found that the Planck lite likelihood is very sensitive to certain ranges in ℓ for the T E spectrum.A per mille error in the T E spectrum can lead to errors in χ 2 up to roughly 0.3 around the best-fit, and that seems to be in agreement with what we can see from the figure.This is only relevant for this very slight likelihood peak, and since the peak is more significant when using the full Planck 2018 likelihood [5], the precision is most likely not an issue.Increasing the precision in the networks, e.g. by training for more epochs, could potentially render the effect unnoticeable in figure 6, but this would be without merit due to the χ 2 precision already being much greater than what is needed in any actual case of use.Regardless of the minute χ 2 discrepancy, the connect networks seem to find the same optimal parameter vectors that class finds, given that evaluating the points with class results in more or less the same profile as class finds on its own (red diamonds in the top panel of the figure).The T T spectrum is furthermore known to have much more constraining power than both T E and EE, so we should expect the same shape in the profile likelihoods due to the sufficiently high precision in the T T spectrum, and the small discrepancy from the precision in the T E spectrum only shifts the χ 2 values by a small amount and not the parameter vectors.Parameter inference is therefore not affected by this discrepancy.This is supported by the curve with the red diamonds in figure 6, which shows the same points in the parameter space obtained by the connect networks with fixed values of Γ dcdm but evaluated by class.We see that this curve is very close to the one actually obtained by class, and this indicates that the optimisations of the neural networks find the correct optima.
It stands to reason that we should be able to obtain the same level of accuracy with a single network spanning the entire parameter space as we can with individual networks at fixed values of Γ dcdm , and indeed the results seem to converge if we generate larger amounts of training data, which is apparent from the bottom panel of figure 6.When increasing the amount of data much further than shown here, it is necessary to train the networks for more epochs and perhaps increase the size of the networks, i.e. hidden layers and nodes in each layer.Given that we have 15 individual fixed-value networks each with ∼5 × 10 4 training points, 6 hidden layers, and 1024 nodes in each hidden layer, we can definitely justify making a much larger network instead with close to ∼10 6 training data points since this would be equally expensive in terms of computational power.Increasing the number of trainable parameters in the network also increases the amount of information it can hold, which eventually will bring the emulation error down to a point where there will be no discrepancy between class and connect.

Computational performance
It is obvious that the computational cost of computing profile likelihoods with connect is much lower than when using class simply because the evaluation time of a connect network is around 3 orders of magnitude faster than that of class [20].Just using connect instead of class with simulated annealing would therefore be a huge speed-up.The utilisation of gradients, however, boosts the speed-up even further since fewer function evaluations are needed.In order to use gradients in TensorFlow, a computational graph of the entire gradient computation needs to be constructed, which roughly takes around a minute, after which the evaluation of both the likelihood and the gradients takes around ∼10 −2 seconds.Given that class is allowed to run on multiple threads, the evaluation time is around ∼1 second.Using the basin-hopping algorithm described in section 4 with the BFGS optimiser as the local optimiser, each local optimisation requires ∼10 2 function calls, each ensemble contains ∼10 1 walkers, and the temperature is updated until convergence (usually around 2-4 times).This results in ∼10 3 -10 4 evaluations, each taking ∼10 −2 seconds.This means that a single point will usually converge in less than a minute on a single CPU core.The simulated annealing algorithm requires on the order of ∼10 4 -10 5 evaluations for each point in the profiles presented in this paper, and with an evaluation time dominated by class, a single point will converge in roughly a few days if run sequentially.There is, however, some degree of parallelisability, given that multiple chains can be utilised and class can be parallelised to around 8-10 cores.The different points of a profile can, of course, be computed completely separately, regardless of which algorithm is used.
There is also the matter of gathering training data for a neural network if one is starting from scratch, and this process is quite time-consuming compared to the use cases of a trained network [20].For most likelihoods, the network requires around ∼50, 000 points of training data, which means that class needs to be evaluated ∼50, 000 times.This is, however, very parallelisable, and it is much faster than doing an actual MCMC run with class -especially for beyond ΛCDM where the number of class evaluations can be as high as 10 5 -10 6 .If one only seeks to optimise a single point of a (ΛCDM) model, it might be better to use class and simulated annealing, but for an entire one-dimensional profile, it is very beneficial to use connect instead, and for two-dimensional profiles requiring several hundred points, it might be necessary to use connect.If one seeks to perform a full frequentist analysis with triangle plots of one-and two-dimensional profiles, the task is virtually impossible at this point without using connect.

Conclusion and outlook
Using the ensemble basin-hopping algorithm for global optimisation combined with the BFGS optimiser for local optimisation has been demonstrated to yield robust, fast, and accurate results when calculating profile likelihoods from CMB data.By making use of gradients in the local optimiser, the number of function evaluations can be greatly decreased when compared to gradient-free simulated annealing.This, together with the much faster evaluation time of connect compared to class, results in speed-ups of several orders of magnitude when constructing profile likelihoods.In addition to being fast, the method is also very robust and accurate, and given the smoothness of the neural networks, the global optimisations often converge with only a few local optimisations.
When a neural network has been trained, the profile likelihoods in any parameter or set of parameters are computationally very inexpensive, and entire triangle plots of profile likelihoods, typically consisting of more than 5000 individual points in parameter space for which optimisation must be performed, are easily computed.Each such point in the parameter space is independent of all other points, making the optimisations embarrassingly parallelisable.The optimisation of each point takes around a minute on a single modern CPU core.A full triangle plot could therefore be computed on a normal quad-core laptop in less than a day.
With fast and easy access to profile likelihoods in cosmology, it is easy to investigate cosmological models with both Bayesian and frequentist statistics.Neither of these statistical frameworks can claim superiority compared to the other, but the two different approaches answer different questions and complement each other.It is therefore very useful to have both a posterior and a profile likelihood in order to get the full picture and draw reasonable conclusions about the given model.The biggest reason for this not being done frequently in analyses of cosmological models is that the computational costs of profile likelihoods are much greater than those of posteriors.Having a fast emulation tool for quick computations of both posteriors and profile likelihoods makes it much easier and more appealing to (re)introduce the frequentist approach in cosmology.
While the optimisation is typically extremely precise, the precision is limited by the emulation precision of the underlying network.Therefore, it is quite important to use welltrained and precise networks to derive profiles consistent with those obtained using class and simulated annealing.In this work, the same neural network has been used for all profile likelihoods and MCMC related to the ΛCDM model and the massive neutrino model due to its simple shape and only modest volume effects.The agreement with profile likelihoods from class is excellent for these two models and very precise out to several standard deviations in parameter space.For the DCDM model, the agreement with class is also quite good when using either individual networks for each point in the decay rate, Γ dcdm , or an annealed network with a large amount of training data, even though a small offset is visible between the profile likelihoods.This is, however, on a scale of ∆χ 2 ≈ 0.2 which is much too small to be significant, and parameter inference would not be affected by this small discrepancy.
The training data for the connect neural networks are sampled iteratively from the posterior, and with large volume effects, this can bias the accuracy of the network away from regions of maximal likelihood: Regions with a large volume are sampled much more than regions with a small volume, even though the small volumes might have better likelihood values.This creates a bias in the networks towards larger volumes due to these being much more represented in the training data.It is possible to overcome this problem using a larger network architecture, more training data, and more training epochs.The training data can also be weighted in a way that reduces the impact of data points with larger values of Γ dcdm on the total loss during training.However, it might be beneficial to pursue new ways of sampling training data more suited for profile likelihoods in future works.
Another thing to consider is the complexity of the likelihood function at this point.We evaluate a network to get C ℓ spectra, perform a cubic interpolation, and use a TensorFlow version of the Planck lite likelihood in order to go from cosmological parameters to a likelihood value, and this makes the computational graph of the gradients quite large in terms of memory.A way to speed up the computations even further and increase usability is to directly emulate the likelihood value of any given likelihood code.We then lose the dependency on having likelihood codes written in TensorFlow syntax or emulated separately.Most of the problems highlighted in this paper will most likely cease to exist with this approach of directly emulating likelihoods.An additional advantage is that any likelihood emulated in this way (e.g.BAO, SNI-a, etc.) will automatically lend itself to auto-differentiation, allowing the efficient combination of basin-hopping and BFGS local optimisation to be used.

Reproducibility.
We have used the publicly available connect framework available at https://github.com/AarhusCosmology/connect_public to create training data and train neural networks.The framework has been extended with the basin-hopping optimiser and a module for computing profile likelihoods.Explanatory parameter files have been included in the repository in order to easily use the framework and reproduce results from this paper.

B Recomputing and adding points
Since the optimisation routine, as described in section 4, has some level of stochasticity, the optimisation might fail by converging on a local optimum a small fraction of the time.By tweaking precision settings and hyper-parameters, the rate of failed optimisations can be greatly decreased, but there will always be some probability of not succeeding.With the default settings, a complete triangle plot of profile likelihoods containing ∼10 3 points to compute will result in only ∼10 1 failed points.This is difficult to detect automatically, but it is very easily seen when plotting the profile likelihoods.A routine to interactively choose failed points after plotting them has been implemented, and figure 8 shows the process of choosing points to recompute.These are then gathered in a file and recomputed.Given the low probability of getting stuck in a local optimum, the recomputation is almost always successful.In rare cases, a few points need to be recomputed twice, and if a specific point turns out to be particularly difficult to optimise, then the precision settings might need adjustment for that single optimisation.This is still much more feasible than running with very precise settings for all points in the profile likelihoods.
If the computed points, chosen according to the process described in appendix A, do not encapsulate the contour of a specific confidence level, additional points have to be selected and optimised.This is also difficult to choose automatically, but it is very easy to pick new points by looking at the contours and previously optimised points.A routine for interactively choosing new points has also been implemented, and using this, one can easily choose points based on the location of all current points and the contours based on those.Figure 9 shows the process of choosing new points since the 99.73% confidence region is not entirely encapsulated by the previously chosen points.The new points are gathered in a file and optimised.After the inclusion of the new optimised points, the contour line looks as it should and is fully represented by the total set of points.

Figure 1 .
Figure 1.The bottom panel shows the function value of the likelihood function written in the panel, while the top panel shows the resulting marginalised posterior and the profile likelihood, both scaled to a maximum of unity.The posterior is dominated by the shorter large Gaussian, while the profile is dominated by the taller small Gaussian.The two one-dimensional statistics thus reveal complementary information about the actual likelihood.
local optimiser (returns value and position) p( ⃗ θ) = Gaussian distribution from Cov( ⃗ θ) and ⃗ b (proposal distribution) T = 1.0 (sampling temperature) T min = minimal temperature (works as a tolerance) N = number of points in ensemble while T > T min do L array = zeros(N ) P array = zeros(N )

Figure 2 .
Figure 2.Posteriors and profile likelihoods of the ΛCDM model.The blue filled contours and the blue lines on the diagonal are the posteriors from an MCMC run with the neural network for one and two dimensions, respectively, and the red contour lines and the red stars on the diagonal are the profile likelihoods for one or two dimensions, respectively.The cyan star marks the best-fit point from a global optimisation of the entire neural network.

Figure 3 .
Figure 3.Posteriors and profile likelihoods of the massive neutrinos model.The blue filled contours and the blue lines on the diagonal are the posteriors from an MCMC run with the neural network for one and two dimensions, respectively, and the red contour lines and the red stars on the diagonal are the profile likelihoods for one or two dimensions, respectively.The cyan star marks the best-fit point from a global optimisation of the entire neural network.

Figure 5 .
Figure 5.Comparison between one-dimensional profile likelihoods using class and connect with simulated annealing and basin-hopping as optimisation routines, respectively.The profile likelihoods are in the parameters m ncdm and H 0 .The 68.27% and 95.45% confidence intervals are also shown.

Figure 6 .
Figure 6.The top panel shows profile likelihoods in the Γ dcdm parameter with both class (simulated annealing) shown with black circles and connect (basin-hopping and BFGS) using several individual networks for each fixed value of Γ dcdm shown with blue stars.The parameter vectors resulting from the connect optimisations have all been evaluated by class and plotted as a profile likelihood, shown with red diamonds.The middle panel shows profile likelihoods resulting from neural networks with Γ dcdm as a varying parameter where training data has been gathered at different sampling temperatures (see text).The bottom panel shows the same kinds of profiles as the middle panel, but with different amounts of training data for the networks.Only the temperatures T=1 and T=5 are shown in order to keep the figure simpler.

Figure 8 .Figure 9 .
Figure 8.The panels from left to right show the process of recomputing specific points in the twodimensional (ω cdm -H 0 )-profile likelihood if the optimiser converges on a local optimum.The points to recompute can be selected interactively and are then optimised again with a better result.
ΛCDM and massive neutrinos.We clearly see that the constraints from posteriors (blue) and profile likelihoods (red) are identical for ΛCDM (top panel) but differ somewhat for massive neutrinos (bottom panel).Along with the constraints, the best-fit point has been included as a centreline in the boxes.