Bayesian Cosmological inference beyond statistical isotropy

With advent of rich data sets, computationally challenge of inference in cosmology has relied on stochastic sampling method. First, I review the widely used MCMC approach used to infer cosmological parameters and present a adaptive improved implementation SCoPE developed by our group. Next, I present a general method for Bayesian inference of the underlying covariance structure of random fields on a sphere. We employ the Bipolar Spherical Harmonic (BipoSH) representation of general covariance structure on the sphere. We illustrate the efficacy of the method with a principled approach to assess violation of statistical isotropy (SI) in the sky maps of Cosmic Microwave Background (CMB) fluctuations. The general, principled, approach to a Bayesian inference of the covariance structure in a random field on a sphere presented here has huge potential for application to other many aspects of cosmology and astronomy, as well as, more distant areas of research like geosciences and climate modelling.


Introduction
With the different experimental data the cosmology has become a play room for the data analysts. In recent past we get data from WMAP, Planck, BICEP, BOSS etc and there are many new experiments coming up. We need to analyse the high volume of data and draw inferences from the data. For many data analysis process we require advanced numerical techniques. In this proceeding we discuss two cosmological parameter estimation methods in which we used some advanced numerical methods.
First, we demonstrate a cosmological parameter estimation for different theoretical model by comparing the C l with the observed C l . We develop a MCMC code, named as SCoPE [1,2,3,4,5], based on global Metropolis Hastings algorithm [6,7,8,9]. In general the Metropolis Hastings algorithm is pretty slow. Therefore, we have developed few novel technique that makes the MCMC method faster. In the first section we briefly discuss about these techniques that make SCoPE faster then standard Monte Carlo methods.
In the second part of the proceeding we have discussed about a Monte Carlo method that we use for Bayesian estimation of the isotropy violation in the CMB sky. For the first time, we develop an algorithm for fast basyan estimation of the isotropy violation on the CMB sky. We use Hamiltonion Monte Carlo for the estimation process. The stability of the algorithm was a big issue there. Therefore, we have applied different integration method for testing the stability of the method. We have discussed that in the second part of the proceeding.

Metropolis Hastings
For cosmological parameter estimation we use Metropolis Hastings (MH) algorithm. The Metropolis-Hastings (MH) is one of the most widely used MCMC sampler, in which the posterior i.e. π(θ) is sampled using a random walk. A standard Markov Chain at each step i, randomly chooses a candidate value θ i+1 from the proposal distribution q(.|θ i ). The candidate value only depends on the current data point θ i . The new data point is then accepted with probability α = min(1, π(θ i+1 )/π(θ i )). If the new data point rejected, the previous point is replicated by increasing its weight by +1. The chain of data-points thus generated, approximate the target posterior distribution π(θ). The proposal distribution is generally taken to be a Gaussian s is the step size. Theoretical optimum step size for an ideal distribution that provide the best acceptance rate is s = 2.4/ √ n for a n dimensional MCMC sampler. The covariance matrix is provided as an input to the program. As the exact covariance matrix is unknown before the analysis, in practice an approximate covariance matrix, often based on some previous analysis, is provided. If no prior information is available about the covariance between parameters then some approximate diagonal matrix is also often used. However, in such cases the acceptance rate of the sampler may reduce drastically and can be ensured to remain reasonable only by trial and error. Therefore a prior knowledge of the covariance is required. Parallelization of MH sampler is generally done by running multiple chains. Whether it is better to run a longer chain than running multiple short chains has been addressed and argued by many authors [10,11]. But, for running multiple parallel chains proper mixing between the chains has to be ensured. Therefore, each of the multiple chains has to be long enough so that the it can represent an unbiased sample of the population. Gelman-Rubin "R" statistics is generally used for testing the mixing of chains. For convergence, the chains have to be long enough such that R is very close to unity [12]. For practical purposes it is taken as R < 1.2. However, this criterion though necessary is often not sufficient for ensuring proper sampling. It is desirable to devise an implementation of MH algorithm that allows the individual chains to run in parallel and increase the acceptance rate of the models of a chain. Apart from that the mixing of the chains are also necessary. Therefore in SCoPE we have modified standard MH algorithm to accomplish effective parallelization through prefetching together with all other above mentioned features, namely, enhanced acceptance, regular covariance update from samples.

2.1.
Embellishing the standard Metropolis-Hastings algorithm 2.1.1. Prefetching In MCMC it is extremely useful to speed up the generation of a single chain, through parallelization rather than using multiple chains. In our work we make the individual chains parallel by precomputing several draws from the posterior distribution ahead of time via multiple evolution of models simultaneously in parallel and then use only the values that are needed. Prefetching is a draw level parallelization in a single chain. The method can be explained by taking the binary tree of a Metropolis algorithm as shown in Fig. 1. In a kth level binary tree there are total 2 k − 1 nodes, each of which represents a possible future state of a metropolis algorithm. The branches at the left child of any node represent the accepted steps and the right child represents the rejected states. If we have enough computational resources then all 2 k − 1 nodes can be evaluated simultaneously and k steps of a MCMC chain can be carried out in parallel simultaneously. Though the method of prefetching allows to parallelize a single chain, it only uses k steps out of 2 k − 1 computations. The rest of the computations are not utilized. Therefore, many argue against computing all the nodes of the binary tree. Rather if we know the acceptance rate at a point of time from the previously accepted data points, we can statistically identify and precompute only the most probable chain and hence avoid any unnecessary wastage of computation power. It is easy to see that if the acceptance probability at any point of time is less than 0.5 then the extreme right chain (1-3-7-15-..) of Fig. 1 Figure 1. Prefetching scheme is explained in the text with the help of above figure be most probable chain. In a similar manner if the acceptance rate is more than 0.5 then the extreme left chain (1-2-4-8-..) will be the most probable chain [13]. Therefore, by pre-evoluting only the most probable chain, we parallelize the code and at the same time we can manage the computational resources in a better way.

Delayed rejection
One of the major problems with the MCMC method is the choice of the step size for the proposal distribution. It will be better idea if the rejected sample from one step can be used to determine the proposal distribution for the next sample by varying stepsize accordingly. This increases acceptance rate but at a cost of violation of the Markovian property. But if we can find some method that can change the acceptance probability of the sample point to compensate the step size variation then that will be useful. The concept of delayed rejection can briefly be explained as follows. Suppose at some step i, the position of a chain is θ i = x. Suppose at this time a candidate y 1 is accepted from q 1 (x, y 1 ) and accepted with probability α 1 (x, y 1 ) = min(1, π(y 1 )q 1 (y 1 , x)/π(x)q 1 (x, y 1 )) as in the standard MH algorithm. For a Markov chain, q 1 (y 1 , x) is time symmetric, i.e. q 1 (y 1 , x) = q 1 (x, y 1 ). Therefore, the acceptance ratio only depends on the posterior. A rejection at any step suggests that there is a local bad fit of the correct proposal and a better one, q 2 (x, y 1 , y 2 ), can be constructed in light of this. But, in order to maintain the same stationary distribution the acceptance probability of the new candidate, y 2 , has to be properly computed. A possible way to reach this goal is to impose detailed balance separately at each stage and derive the acceptance probability that preserves it. It can be done by changing the acceptance probability as [14,15,16,17,18,19] α 2 (x, y 1 , y 2 ) = min 1, π(y 2 )q 1 (y 2 , y 1 )q 2 (y 2 , The Markovian property of the chain will not get destroyed, but still the sample choice can be made dependent on the previously accepted data point. We implement this in our algorithm.

Inter-chain covariance adaptation
The practical problem in implementing MH is the tuning problem of the proposal distribution q so that the sampling is efficient. One of the recent improvements in the MCMC efficiency is to introduce adaptive samplers. The adaptive MCMC uses the sample history and automatically tune the proposal distribution in the sampling process [20,21,22,23]. In adaptive metropolis algorithm, the covariance matrix from the samples obtained so far is used as the covariance of a Gaussian proposal. The most common parallelization scheme of MCMC method is to run parallel chains instead of running a single one. If in each chains proposal distribution is adapted using the local covariance matrix then the acceptance probability a chain may improve, however, the inter-chain mixing will not improve. If some chain stuck at some local minima then the local covariance matrix corresponding to that chain will be erroneous. So, in case of a local peak the local covariance update will give covariance corresponding to the local peak. In that case, the mixing of chains will slow down and sampling may not be proper. Therefore, we have adapted the concept of the inter-chain covariance update in the adaptation technique. We run several parallel chains, and randomly update the covariance matrix taking the data points accepted till then from all the chains. This means we have used the covariance after ith accepted step as Cov(θ 11 .., θ i1 , θ 12 , .., θ i2 ....θ 1n , ..., θ in ), where, n is the number of chains. This speeds up the mixing of the chains and covers the sample space faster. The value of the covariance matrix will freeze after few adaptations and hence we will be using same Gaussian proposal after few steps, which is important to guarantee proper sampling. However, if the adaptive covariance is not frozen, it may give rise to unfair sampling as the Gaussian proposal will vary between steps. Therefore, the process of the adaptive covariance calculation is only used for the initial burn in process and after that the adaptation is stopped, during which the covariance calculation attains partial convergence.

Estimation of Isotrypy violation with Hamiltonion Monte Carlo
The previous section we briefly demonstrate how the standard MH algorithm can be modified to make a faster parameter estimation algorithm. However, MH algorithm is mainly applicable when the number of parameters is small, of order of 10's. However, if the parameter space is high of the order of 10 6 -10 7 then MH is not at all efficient. To explore such a big parameter space we need to use Gibbs sampling or Hamiltonian Monte Carlo (HMC) etc. In the problem of estimating isotropy violation in CMB we use HMC [24,25,26,27].
In CMB the observed sky map is a convolution of the real sky temperature with the instrumental beam with an addition of instrumental noise. Therefore,T (γ) the actual temperature signal of the CMB sky along the direction γ is linearly related to the observed sky temperature,d(γ),asd (γ) =B(γ, γ 0 )T (γ 0 )dΩγ 0 +Ñ (γ) (2) where B(γ, γ 0 ) is the instrumental beam profile andÑ (γ) is the instrumental noise. For a perfectly circular beam profile, B(γ, γ 0 ) ≡ B(γ·γ 0 ), assumed in this work, it is easy to deconvolve the effect of the beam after inferring the power spectra. However, if the beam is not circular symmetric then the effect of the beam depends on the full scan pattern of the experiment and its deconvolution may be non-trivial [28,29,30,31,32]. For data defined on a sphere, it is convenient to work in the spherical harmonic space. The CMB signal,T (γ), then can be expanded in terms of spherical harmonics as where Y lm (γ) are the spherical harmonic functions and a lm are the coefficients in the spherical harmonic basis. Similarly, the observed data,d(γ), can also be expanded in spherical harmonics with coefficients, d lm . For a perfectly statistically isotropic sky, the two point correlation function on sky can be expressed in terms of the angular power spectrum, C l , alone as a lm a * l m = S lml m = C l δ ll δ mm (4) Here . . . denotes the ensemble average. However, in presence of SI violation, C l does not provide a full description of the covariance matrix. A general covariance matrix can be expanded in the BipoSH representation as are the BipoSH spectra that provide a natural generalisation of the angular power spectrum.
The observed sky map contains the noise with the CMB signal and our goal is to calculate the posterior distribution of A LM ll from the observed sky map, i.e. P (A LM ll |d lm ) or P (S lml m |d lm ). However, its difficult to calculate this Probability directly as there are no direct method to find out the a lm s. Therefore, in practice instead of finding this distribution, people find a joint probability distribution of the S lml m and a lm , i.e. P (S lml m , a lm |d lm ) and then marginalize over a lm. . This can be obtained directly by using the Bayesian Technique P (S lml m , a lm |d lm ) = P (d lm |a lm )P (a lm |S lml m )P (S lml m ) P (S lml m ) is the prior on S lml m . Here we use a flat prior on S lml m i.e. P (S lml m ) = 1. We also consider a isotropic noise and N l is the noise power spectrum.
We tried to sample a lm amd A LM ll simultaneously from a given map d lm and noise covariance matrix N l . Therefore we have of the order of 10 7 variables.

Calculating the equations of motion
Hamiltonian Monte Carlo (HMC) technique based on the Classical Hamiltonian Mechanics relies on the fact that the density of a group of particles with random momenta placed in a potential will trace the potential given that all of them start from random momentum drawn from a normal distribution with mean 0 and co-variance M , where M is a positive definite matrix called the mass matrix and can be chosen independently. It is known that HMC method can sample the distribution more effectively even in very high dimensional space in comparison to other conventional MCMC methods. In a Hamiltonian Monte Carlo algorithm we need to define a conjugate momentum and a mass corresponding to each of its parameters. We consider the conjugate momentum to a lm and A LM ll are p a lm and p A LM ll and a corresponding mass m a lm , m A LM l 1 l 2 respectively. The mass matrices are the positive definite quantity by their definition. Thus the Hamiltonian for the motion of this ensemble of particles is Using Hamiltonian mechanicsȧ andṗ Similarly, the equation of motion for A LM ll will bė HMC is performed in two steps. In the first step, values of the momentum variables are chosen from the Gaussian distribution of mean 0 and variance m x , where x ∈ (a lm , A LM ll ). In the next step a Matropolis update is performed from the state (p a lm , p A LM ll , a lm , A LM ll ) to a new state (p 1 a lm , p 1 decides the stability of the integration process. HMC algorithm in general uses the Leapfrog integration algorithm due to its time reversal symmetry and almost symplectic nature. However, in our calculation we have used a fourth order Forest and Ruth integrator , which is a symplectic integrator that involves three Leapfrog steps, works better then a simple Leapfrog.
3.2. Computational implementation 3.2.1. Overcoming the time complexity There are practical limitations arising from carrying out the computationally challenging analysis in reasonable time. The first practical issue that we face is related to inverting the covariance matrix. A brute force inversion of the matrix is computationally prohibitive. The inversion of S −1 lml m a l m can be done by using Gauss Seidel method. Brute force inversion of S lml m is difficult and time consuming except for small l max . However, using the fact that, in case of CMB signal the off-diagonal components of the matrix S lml m are expected to be much smaller than the diagonal components dominated by A 00 ll , we can use Taylor series expansion to invert the matrix. Taylor series up to the first order gives us considerably good results.

Stability of numerical integration
Another computational issue is the choice of the numerical integration method and the mass matrix. In Hamiltonian integrators, though Leapfrog integrator is common because the integrator preserves the Hamiltonian in phase space (symplectic), the propagation error being huge we have to use a fourth order symplectic integrator, namely Forest and Ruth integrator [25,33], which performs better and the propagational errors are contained at a manageable level. Forest-Ruth algorithm is a combination of three Leapfrog steps. It can be shown that for the absolute stability of the integration process we need to take the mass matrix of the particles to the inverse of the covariance matric of the corresponding quantities [24].

Conclusion
In this proceeding we discuss two MCMC techniques that we use for cosmological parameter estimation and discuss the techniques which we use to make the algorithm faster. In our modified global Metropolis Hastings algorithm, SCoPE, the individual chains can run in parallel and a rejected sample can be used to locally modify the proposal distribution without violating the Markovian property. The latter increases the acceptance probability of the samples in chains. The prefetching algorithm allows us to increase the acceptance probability as much as required, provided requisite number of multiple cores are available in the computer. Apart from these, due to the introduction inter-chain covariance update the code can start without specifying any input covariance matrix. The mixing of the chains is also faster in SCoPE. In the second part of the proceeding we discuss about a general method to infer the underlying covariance structure of a CMB sky using a completely Bayesian technique. We employ the Bipolar Spherical Harmonic representation of the general covariance matrix underlying random fields on a sphere and outline the method for a fully Bayesian inference of the angular power spectrum and the BipoSH coefficients simultaneously from a single observed map. We use Hamiltonian Monte Carlo for sampling the posterior distributions of the BipoSH parametrization of the covariance. We also discuss the time complexity and choice of the mass matrix and integration method for the HMC algorithm that can provide stable numerical integration.