The Kullback–Leibler Divergence and the Convergence Rate of Fast Covariance Matrix Estimators in Galaxy Clustering Analysis

We present a method to quantify the convergence rate of the fast estimators of the covariance matrices in the large-scale structure analysis. Our method is based on the Kullback–Leibler (KL) divergence, which describes the relative entropy of two probability distributions. As a case study, we analyze the delete-d jackknife estimator for the covariance matrix of the galaxy correlation function. We introduce the information factor or the normalized KL divergence with the help of a set of baseline covariance matrices to diagnose the information contained in the jackknife covariance matrix. Using a set of quick particle mesh mock catalogs designed for the Baryon Oscillation Spectroscopic Survey DR11 CMASS galaxy survey, we find that the jackknife resampling method succeeds in recovering the covariance matrix with 10 times fewer simulation mocks than that of the baseline method at small scales (s ≤ 40 h −1 Mpc). However, the ability to reduce the number of mock catalogs is degraded at larger scales due to the increasing bias on the jackknife covariance matrix. Note that the analysis in this paper can be applied to any fast estimator of the covariance matrix for galaxy clustering measurements.


Introduction
The covariance matrix plays an important role in the data analysis of the galaxy large-scale structure and contains important information on the statistical and systematical errors on the data.An accurate covariance matrix is crucial to pass the errors on the data to the errors on the inferred cosmological parameters correctly (Hartlap et al. 2007;Dodelson & Schneider 2013;Taylor et al. 2013;Percival et al. 2014;Taylor & Joachimi 2014).However, we usually do not know the true covariance matrix from first principles.Instead, the standard way is to estimate the covariance matrix from the data themselves or from the artificial or mock catalogs (Reid et al. 2010;Manera et al. 2013Manera et al. , 2015;;Anderson et al. 2014;Gil-Marín et al. 2016).The mock catalogs are created to follow the statistical properties of the data set as closely as possible and include the diverse observational effects (Manera et al. 2015).A large number of mock catalogs are required to reduce the statistical errors on the covariance matrix (Percival et al. 2014).The creation of mock catalogs and the analysis of them has become one of the most computationally consuming steps in the modern galaxy clustering analysis (Monaco et al. 2002(Monaco et al. , 2013;;Scoccimarro & Sheth 2002;Manera et al. 2013Manera et al. , 2015;;Tassev et al. 2013;Kitaura et al. 2014Kitaura et al. , 2015;;White et al. 2014;Chuang et al. 2015;Feng et al. 2016;Balaguera-Antolínez et al. 2019), especially for the ongoing and upcoming next-generation galaxy surveys such as the Dark Energy Survey (Frieman & Dark Energy Survey Collaboration 2013), the Dark Energy Spectroscopic Instrument survey (DESI; Schlegel et al. 2011), the Large Synoptic Survey Telescope survey (LSST Science Collaboration et al. 2009), and the Euclid satellite mission surveys (Laureijs et al. 2011).
Numerous efforts have been devoted to finding alternatives to obtain reliable estimates of the covariance matrix quickly and accurately.In real observations, the covariance matrix involves complex effects from the galaxy evolution, the scaledependent and non-Poissonian shot noise, the stochastic bias, and the redshift space distortion (Takahashi et al. 2009;Zhang et al. 2013;Li et al. 2014;Blot et al. 2015;Shi et al. 2016;Zheng & Song 2016;Howlett & Percival 2017;Klypin & Prada 2018).Theoretical modeling of the covariance matrix has achieved great progress on the dark matter power spectrum (Neyrinck 2011;Mohammed & Seljak 2014;Carron et al. 2015;Bertolini et al. 2016;Grieb et al. 2016;Mohammed et al. 2017;Hikage et al. 2020;Taruya et al. 2021), the galaxy power spectrum (e.g., Lacasa & Kunz 2017;Sugiyama et al. 2020), and the galaxy correlation function (e.g., Philcox et al. 2020;Rashkovetskyi et al. 2023).Wadekar & Scoccimarro (2020) have proposed a promising analytical method to compute the covariance matrix of galaxy power spectrum multipoles including various theoretical and observational effects.Their results show that the analytic approach has the benefit of sampling noise free and saving computational resources to recompute covariances in the model fitting process (Wadekar et al. 2020).
Meanwhile, many methods have been proposed to reduce the number of mock catalogs or the size of the simulation boxes that are required to obtain a reliable and accurate covariance matrix estimate.Some of them are based on fitting the estimated covariance from a small number of mocks to an empirical model with several free parameters (Pope & Szapudi 2008;O'Connell et al. 2016;Pearson & Samushia 2016).O'Connell & Eisenstein (2019) extended the method, fitting a jackknife covariance matrix from a single survey volume to obtain the fitting parameters without reference to any mocks in real analysis.Howlett & Percival (2017) proposed a method to reduce the size of the simulation box and to correct for the supersample covariance and the window function effect analytically.Other approaches aim to reduce the number of mock catalogs by the resampling method (Norberg et al. 2009;Schneider et al. 2011;Arnalte-Mur & Norberg 2014;Escoffier et al. 2016;Mohammad & Percival 2022) or tapering method (Paz & Sánchez 2015).
An important step in assessing the efficiency of the covariance matrix estimator of the large-scale structure is to calculate the convergence rate, i.e., the number of mock catalogs needed to obtain an equivalent covariance matrix compared to the brute-force sample variance from a given number of mock catalogs.Two factors are concerned in this case, the noise and the bias.The noise level is usually represented by the mean variance of the elements of the covariance matrix and commonly used to estimate the convergence rate.However, the different parts in the covariance matrix do not play equal roles in the parameter fitting process, typically the diagonal and off-diagonal terms.Compared to the dominant diagonal terms, the off-diagonal terms are usually much smaller but contain critical information on the mode coupling and the window function.The mean variance of the elements cannot distinguish the difference between them.The bias on the estimator of the covariance matrix also plays an important role in the parameter fitting process and should be recognized.Although the biases on the diagonal elements are easy to show, those on the off-diagonal elements are not due to the fact that they have small values and high noise levels.Furthermore, it is the precision matrix, i.e., the inverse of the covariance matrix, that appears in the likelihood function.The matrix inversion is a nonlinear process.It mixes the diagonal and off-diagonal elements in the covariance matrix, then makes the effect of the noise and bias much more complicated.So we need alternatives to quantify the performance of the estimator of the covariance matrix.
In this paper, we present a simple method to estimate the convergence rate of the covariance matrix estimators efficiently.Since we are not comparing two arbitrary matrices but instead two Gaussian likelihood functions characterized by the two covariance matrices, there is a prominent tool to accomplish our goal, the Kullback-Leibler (KL) divergence (Kullback & Leibler 1951).The KL divergence measures the relative entropy between two probability distributions, and it can describe how different two covariance matrices are in the sense of the Gaussian likelihood functions.It has been adopted in multiple literature to study the convergence of covariance matrices from fast methods (e.g., O'Connell et al. 2016;Lippich et al. 2019;Philcox et al. 2020).In our study, we apply the method on a recently proposed covariance matrix estimator of the galaxy correlation functions, which combines the deleted jackknife resampling and the mock catalogs (Escoffier et al. 2016).
This paper is organized as follows.In Section 2, we introduce the KL divergence and its application to the convergence of covariance matrix estimators.In Section 3, we describe the data set and the two methods to estimate the covariance matrix of the galaxy correlation function, i.e., the brute-force method and the jackknife resampling.In Section 5, we first test the KL divergence using the brute-force covariance matrices from different numbers of mock catalogs.In addition, we apply the KL divergence test to the covariance matrices from the jackknife resampling method to estimate its convergence rate.We close the paper with a brief discussion and summary in Section 6.

KL Divergence
The KL divergence from the probability distribution Q to the reference probability distribution P is a measure of Q diverging from P, which is defined as The KL divergence is positive definite, that is, KL(P|Q) 0.
The equality holds if and only if P = Q.Another important property of the KL divergence is that it is asymmetric in general, KL(P|Q) ≠ KL(Q|P).In Bayesian language, the KL divergence measures the information loss when one uses Q (usually a model) to approximate P ("true" distribution; Baez & Fritz 2014).
If P and Q are both multivariate normal distributions with the same mean, the KL divergence simplifies to where C P and C Q are the covariance matrices of P and Q, respectively.N is the dimension of the concerned random variables or the data vectors.( ) A Tr represents the trace of matrix A, and A det is the determinant.In the following, the distributions P and Q are always assumed to be the multivariate normal distributions.

Sample Variance of Gaussian-distributed Data
In data analysis of the cosmological large-scale structure, we usually estimate the covariance matrix of measured data from the sample variance of a large number of independent simulation mock catalogs.If data follow a multivariate normal distribution, the estimated covariance matrix Ĉ follows a Wishart distribution (Wishart 1928), where N m is the number of mock samples, n = N m − 1 is the degrees of freedom, Σ is the expectation value of the covariance matrix ( ˆ) E C , and p is the dimension of Ĉ. Γ p (x) is the multivariate gamma function.Considering that ĈP and ĈQ are the sample variances of two subsets of the same parent simulation mock catalogs, the expectation value of the KL divergence from Q (with the covariance matrix ĈQ ) to P (with the covariance ĈP ) can be calculated as where N P and N Q are the number of subset mock catalogs used to calculate ĈP and ĈQ , respectively.To derive the above equation, we have used the fact that the expectation value of ĈP and ĈQ is the same.

Biased Covariance Matrix
Supposing that the expectation value of the estimated covariance matrix ˆC Q has a linear bias α with respect to the expectation value of ĈQ , that is, ˜a S = S Q Q , the expectation value of the KL divergence from the multivariate normal distribution Q with the covariance matrix ˆC Q to P will be where Δ KL (α) is defined as where it has a minimal negative value of ( ( , which is close to 0 when the dimension of the covariance matrix is much smaller than the number of mock realizations, i.e., p = N Q .When N Q → ∞ , a → 0. Then the bias on the covariance matrix sets a lower limit on Δ KL (α), which is positive definite when α ≠ 1.

Data Sets
We conduct the KL divergence on the covariance matrix calculated from the delete-d jackknife resampling.We discuss the galaxy mock catalogs that we use to calculate the jackknife covariance matrix in Section 3.1.We show the calculated correlation functions and their covariance matrices in Sections 3.2 and 3.3, respectively.

Quick Particle Mesh Mock Samples
In this paper, we use the publicly released mock catalog by the Baryon Oscillation Spectroscopic Survey (BOSS) collaboration.These mocks are generated using the quick particle mesh (QPM) method (White et al. 2014) with low mass and force resolution.The simulations are run in a flat ΛCDM cosmology with parameters of Ω m = 0.29, h = 0.7, Ω b = 0.0458, σ 8 = 0.8, and n s = 0.97.The cubic simulation box has a side length of 2560 h −1 Mpc and contains 1280 3 particles.Halos are assigned to a subset of the simulation particles, which are chosen based on their smoothed local density.The halo masses are then sampled with a density-dependent probability to match the halo mass function and the large-scale bias from the reference high-resolution N-body simulations.The galaxies are populated in the resolved halos using the halo occupation distribution (HOD) approach (e.g., Wechsler & Tinker 2018).
The HOD parameters are adjusted to fit the small-scale projected two-point correlation function of the BOSS DR11 CMASS galaxies.The galaxies are further downsampled based on the radial selection function and the angular mask of the BOSS DR11 CMASS survey on the north Galactic cap, which covers 6391 deg 2 and extends over a wide redshift range of 0.43 < z < 0.70 (Beutler et al. 2014).For more information on the QPM galaxy mock catalogs, we refer the reader to White et al. (2014).

Two-point Correlation Function
We measure the galaxy two-point correlation function of the QPM mocks based on the Landy & Szalay (1993) method, where s is the separation between two galaxies and μ is the cosine of the angle spanned by the galaxy separation vector and the line-of-sight vector.Here we define the line of sight for each pair of galaxies as the direction of the vector passing through the median point of the pair separation and the observer, h = (s 1 + s 2 )/2, with s 1,2 being the position vector of galaxy 1 and galaxy 2. DD(s, μ) is the number of galaxygalaxy pairs whose separation is located in the (s, μ) bin, normalized by the total number of pairs.DR(s, μ) and RR(s, μ) are the number of galaxy-random and random-random pairs, respectively.For each galaxy mock sample, we generate randomly distributed points with the consideration of the radial selection function and the angular mask of the BOSS DR11 CMASS survey.We set the number of random points to be 10 times that of the mock galaxies.
The multipoles of the correlation function are calculated by expanding the 3D correlation function using the Legendre polynomial P l (μ), i.e., l l 1 1 where P 0 (μ) = 1 and P 2 (μ) = (3μ 2 − 1)/2.We are focusing on the monopole (l = 0) and quadrupole (l = 2) in the following.To do the above integration, we use 20 μ bins with an equal bin width of 0.05.Our data vector is ξ = (ξ 0 , ξ 2 ), where , and N is the number of s bins.
We mainly study the correlation function multipoles at the scale range of 0 h −1 Mpc < s < 40 h −1 Mpc with the bin size Δs = 2 h −1 Mpc, where the jackknife method (Xu et al. 2023) can still give a satisfactory covariance matrix compared to the more sophisticated HOD-based (Yu et al. 2022) or emulatorbased (Chapman et al. 2022;Yuan et al. 2022) methods.For the large-scale structure analysis, we choose two scale ranges, the intermediate scale of 20 h −1 Mpc < s < 80 h −1 Mpc with Δs = 4 h −1 Mpc and the large scale of 24 h −1 Mpc < s < 160 h −1 Mpc with Δs = 8 h −1 Mpc.On these large scales, the survey window effect becomes important.The large-scale clustering breaks the independence among different jackknife subregions, which will introduce an increasing bias on the jackknife covariance matrix along with the increasing scales.

The Covariance Matrix
Based on the QPM mocks, we can calculate the baseline covariance matrix from the brute-force method and take it as the true covariance.To show the convergence rate of the fast covariance matrix estimators based on the KL divergence, we choose the delete-d jackknife resampling method as a case study.

Brute-force Covariance Estimation
The baseline covariance matrix is estimated from the variance of independent mock samples drawn from the fiducial cosmological model, which we denote as the brute-force method.In our case, each QPM mock sample has the same observational effect as the real observation, including the complex survey geometry and the completeness effect.
The brute-force covariance estimation of the correlation function multipoles over mock catalogs is calculated using the following formula: where i, j = 1, 2, K, N, N + 1, K, 2N with the first (last) N elements corresponding to the N radial bins of the monopole (quadrupole) correlation functions.The superscript k enclosed in brackets denotes the mock index, and N m is the total number of mocks.
The above-obtained brute-force covariance matrix, as random variables, will follow a Wishart distribution if the correlation functions follow a multivariate Gaussian distribution.The Wishart distribution can be characterized by the degrees of freedom N m − 1, the dimension of the data vector or the number of bins p, and the true or expected covariance matrix Σ.

Jackknife Resampling
Unlike the brute-force covariance matrix, which needs a large number of mock samples, the jackknife covariance matrix can be calculated directly from the observational data and so can save the computational cost greatly.The practical analysis shows that the jackknife covariance matrix suffers from a large noise level and cannot meet the requirement of future largescale structure analysis.It has been proposed that applying the jackknife technique on individual mock samples and averaging over them can greatly enhance the precision of the jackknife covariance matrix estimation (Escoffier et al. 2016).The performance of the jackknife covariance matrix can be further improved by using the delete-d jackknife technique.
The traditional jackknife covariance matrix is calculated by dividing the observational data into N s subregions.The jackknife samples are constructed by deleting one subregion at a time.We calculate the correlation function for each jackknife sample.The covariance matrix is calculated by x k J1 is the correlation function for the kth jackknife sample, and xJ1 is the mean correlation function averaged over all the jackknife samples.
The above covariance matrix is shown to be not optimal for nonsmooth or nonlinear statistics (Wu 1986), which tends to be true for our case due to the effects from the window function and the redshift-dependent galaxy selection function.Shao & Wu (1989) proposed two delete-d subsamples, instead of one, at a time to construct the jackknife samples and proved that it can give an asymptotically unbiased covariance matrix for the case with nonsmooth statistics when N s − d → ∞ and  ¥ N d s .This is called the delete-d jackknife resampling.The delete-d jackknife covariance matrix is calculated by x k JK is the correlation function of the kth delete-d jackknife sample, and xJK is the mean correlation function over the total number of delete-d jackknife samples given by In addition, if a jackknife covariance matrix is calculated from a mock, we can further reduce its sample variance by averaging over the jackknife covariance matrices from multiple mocks (Escoffier et al. 2016), i.e., ˆ˜( ) m JK is the covariance matrix obtained by applying Equation (12) on the mth mock.
The delete-d jackknife variance estimator is asymptotically unbiased when N s and d go to infinity.Increasing the number of subsamples will reduce the minimal transverse size of each subsample and then the number of independent modes.It will also increase the number of jackknife resamples.Escoffier et al. (2016) show that the delete-d jackknife covariance matrix of the galaxy correlation function converges when N s 9 and the choice of d has a small effect.In this analysis, we choose N s = 12 and d = 6.The total number of jackknife samples for each mock catalog is calculated from the combination formula, .

Covariance Matrix from QPM Mocks
In this section, we use the QPM mock catalog to calculate the delete-d jackknife covariance matrices and to compare them with the brute-force ones.We calculate the baseline brute-force covariance matrix from 1000 mock samples and take it as the benchmark.For the jackknife covariance matrix, we calculate the mean over 100 mocks based on Equation (14).In Figure 1, we compare the two types of covariance matrices at the scale 0 h −1 Mpc < s < 40 h −1 Mpc with a bin width of 2 h −1 Mpc.Overall, the jackknife covariance matrix with 100 mock samples has good consistency with the baseline brute-force one.In the upper panel of Figure 1, we compare the diagonal terms of the covariance matrices.The data points are from the jackknife one, and the solid lines are from the baseline bruteforce one.For comparison, we also show the brute-force covariance matrix with fewer mock samples, N m = 100.As expected, it has a larger fluctuation compared to the baseline for both the correlation function monopole (black dashed line) and quadrupole (magenta dashed line).For the monopole, the diagonal terms of the jackknife covariance matrix show a slightly increasing bias compared to the baseline as s > 30 h −1 Mpc.
We also show the cross-correlation matrix for the jackknife covariance matrix and the baseline brute-force one (in the lower panel in Figure 1), which is defined as The cross-correlation matrix is symmetric and has unity diagonal elements.So we show the baseline cross-correlation matrix in the upper left corner (as indicated by the text BF1000) and the jackknife one in the lower right corner (denoted as JK100).The symmetric feature on small scales indicates that there is good agreement between the two correlation matrices.Again, the discrepancy increases as the scale becomes larger.
There are two important requirements in the delete-d jackknife resampling method.One is that the number of subsamples N s and the omitted subsamples d should be large enough to satisfy the relation of N s − d ? 1 and  N d 1 s .The success on small scales demonstrates that our choice of N s = 12 and d = 6 is reasonable for the correlation function analysis.Another requirement is that the data in each subsample should be identically and independently distributed.Violating this could break the robustness of the jackknife variance estimator and introduce bias.For galaxy two-point correlation function measurement, subsamples are correlated with each other on large scales due to the large-scale modes of galaxy clustering.This may be the cause for the growing bias with scales on the jackknife covariance matrices.

KL Divergence: Measurements
In this section, we show the results of the KL divergence for the baseline covariance matrices coming from the same or different sets of mock samples (Section 4.1), as well as the KL divergence for the jackknife covariance matrices and the baseline covariance matrices (Section 4.2).Then we introduce the information factor to estimate the convergence rate of the jackknife resampling method in Section 4.3.

KL Divergence for Brute-force Covariance Matrices
We partition the full 1000 QPM mock samples exclusively into multiple groups.For the setting of n-partition, there are n .We can calculate the brute-force covariance matrix Ĉn i for each member S n i in S n .Given two multivariate normal distributions, Q n j with the covariance matrix Ĉn j and P m i with the covariance matrix Ĉm i , the KL divergence from Q n j to P m i , ( | ) P Q KL m i n j , can be calculated based on Equation (2).To avoid any possible correlation between Ĉn j and Ĉm i , we require that S n j and S m i do not contain any common mock samples.We can calculate the mean and variance of ( | ) where N p is the total number of available pairs of Q j n and P i m .We show the mean and variance of ( | ) P Q KL m i n j calculated using the above formula for the various combinations of npartition and m-partition in Figure 2. First, the KL divergence is asymmetric under the exchange of P m and Q n , that is Second, given a fixed m (the number of mock samples to calculate the brute-force covariance matrix of P m ), KL(P m |Q n ) decreases along with increasing n (the number of mock samples to calculate the brute-force covariance matrix of Q n ) and saturates at some large value of n.Finally, the value of KL(P m |Q n ) is dominated by the covariance matrix, which is calculated from fewer mock samples and so has a higher noise level.
In addition, we show the model prediction of KL(P m |Q n ) by assuming that the covariance matrices of P m and Q n follow the same Wishart distribution as the solid lines in Figure 2.There is quite good agreement between the measurement and the model prediction, which indicates that the correlation functions measured from the mock samples follow a multivariate normal distribution closely.

KL Divergence for Jackknife Covariance Matrices
Following Section 4.1, for each member in set S n , we calculate the mean delete-d jackknife covariance matrix using Equation ( 14) and denote it as Ĉn j ,JK .Supposing Ĉn j ,JK is the covariance matrix of the multivariate normal distribution ,JK , can be calculated using Equation (2).Then we calculate the mean and variance of ,JK using the nonoverlapped pairs of S n j and S m i , similar to Equations ( 16) and (17), i.e., As discussed in the previous sections, if ĈP and ĈQ contain the same signal or expectation value, the KL divergence from the multivariate normal distribution Q with the covariance matrix ĈQ to the multivariate normal distribution P with the covariance matrix ĈP , KL(P|Q), is determined by the noise level of ĈQ and ĈP .KL(P|Q) can be larger when ĈQ contains less information (or a higher noise level) than ĈP , and vice versa; hence, we expect that KL(P|Q) can measure the relative amount of information contained in ĈP and ĈQ .As shown in Figure 1, due to a limited number of mock samples, the sample variance can cause large fluctuation (noise) on the baseline covariance matrix from the brute-force method.To reduce such a noise effect, we introduce the information factor defined as whose denominator is calculated from Equation ( 16) with m = n.The information factor compares the statistical information contained in the jackknife covariance matrix Ĉn,JK and the baseline covariance matrix Ĉm .If Ĉn,JK contains the same information Ĉm , then η = 1.Otherwise, η < 1 or η > 1, if Ĉn,JK contains more or less information, respectively.Therefore, the intersections between the solid curves (linked to the data points) and the horizontal dashed line (showing η = 1) in Figure 3 give the estimated number of mock samples required for the jackknife covariance matrices to be equivalent to the baseline covariance matrices in the sense of KL divergence.The variance of the information factor can be roughly estimated using the variance of the numerator in Equation (19), We do not account for the contribution from the variance of KL divergence between the baseline covariance matrices (the denominator), so it will underestimate the true variance.As can be seen in Figure 2, the proportional variance of the denominator is about 10% at most.In Figure 3, we show the error bars of η based on the approximated variance.The dashed curves show the information factors that replace the denominator in Equation (19) by the model prediction for Gaussian-distributed data.They have a good agreement with the solid curves.

Convergence Rate of Jackknife Covariance Matrix
As shown in Section 4.2, the information factor defined in Equation ( 19) can quantify the relative information contained in the jackknife covariance matrices Ĉn,JK and the baseline covariance matrices Ĉm .Given m, increasing the number of mock samples for Ĉn,JK decreases the information factor η(Q m |Q n,JK ).At the point η = 1, we consider that the jackknife covariance matrices converge to the baseline covariance matrices, since they contain the same information statistically.In Figure 4, we show the convergence rate of the covariance matrix calculated from he jackknife resampling and the brute-force methods based on the QPM mock samples.For the distance scales of 0 h −1 Mpc < s < 40 h −1 Mpc (black solid line), there is a linear scaling law for the number of mocks required to obtain the statistically equivalent covariance matrices from the two methods.

The Reference Covariance Matrices
When we compare the statistical information of the jackknife and baseline covariance matrices using the information factor  (19). n is the number of mock samples used to calculate the jackknife covariance matrices as in Equation ( 14), and m is the number of mock samples to calculate the baseline covariance matrices.Different colors denote different m.We also show the results with the denominator in Equation ( 19) replaced by the model prediction as the dashed curves.defined in Equation (19), we use P m as the reference distribution function.Since the covariance matrices of P m and Q m are calculated using the same number of mock samples, they contain roughly the same statistical information.In the following, we break this limitation and replace P m with P k , where k is not necessarily equal to m, to test the robustness of the information factor in quantifying the information contained in two covariance matrices.We introduce the extended information factor, If Q n,JK and Q m are the same distribution function, i.e., their covariance matrices Ĉn,JK and Ĉm contain the same information, then η k (Q m |Q n,JK ) = 1 for any k.
In Figure 5, we show the extended information factor η k (Q m |Q n,JK ) as a function of n for difference combinations of k and m.As shown by the different types of lines with the same color, the shape of the extended information factor as a function of n (the number of mock samples used to calculate the jackknife covariance matrices) varies as k changes; i.e., it is steeper for larger k.However, they converge to the value of 1 at almost the same position (at the same n) when the two covariance matrices contain almost the same amount of information.This clearly shows the robustness of the information factor defined by Equation (19) to quantify the relative information between the two covariance matrices with respect to the reference covariance matrices.

When Bias Is Present: Intermediate and Large Scales
As discussed in Section 2.2, a linear bias on ĈQ with respect to ĈP will introduce an additional term, Δ KL (Equation ( 6)), on the KL divergence from the multivariate normal distribution Q with covariance matrix ĈQ to the multivariate normal distribution P with covariance matrix ĈP , KL(P|Q).And Δ KL is almost positive definite.So any bias on the estimate of the covariance  matrix will affect the power of the information factor to quantify the relative information between the estimated covariance matrix and the true one.
The jackknife resampling method tends to produce biased estimates for the covariance matrix of the galaxy correlation function.This is partially due to the nonlinear feature of the two-point correlation function.Furthermore, the possible correlation between different subregions in the galaxy sample will break the independence of the jackknife observations and introduce bias on the jackknife covariance matrix.
In this section, we study the correlation function at intermediate scales (20 h −1 Mpc < s < 80 h −1 Mpc) to test the effect of the covariance matrix bias on the information factor.Following Section 3.3, we calculate the baseline covariance matrix by the brute-force method and the jackknife covariance matrix using the delete-d method.Similar to Figure 1, we show the resulting baseline covariance matrix from 1000 QPM mocks and one of the jackknife covariance matrices from 100 QPM mocks in Figure 6.As shown in the upper panel, there are significant biases on the diagonal terms of the jackknife covariance matrix that are larger at larger scales.The offdiagonal terms also show clear biases in the lower panel.
Similarly, the information factors for the distance scales 20 h −1 Mpc < s < 80 h −1 Mpc are shown in Figure 7. Compared with the small-scale one, the information factor as a function of n (the number of mocks for the jackknife covariance matrix) becomes flatter and converges to a larger value for a given m (the number of mocks for the brute-force covariance matrix) on intermediate scales.As a result, more mock samples are needed for the jackknife covariance matrix to contain the same information as the brute-force covariance matrix.The solid magenta line in Figure 4 shows the scaling law of n as a function of m on intermediate scales, which is clearly flatter than the small-scale one (solid black line).
It can be expected that the information factor will converge to larger and larger values when m increases and eventually will stay above 1 forever, where the bias on the jackknife covariance matrix dominates the statistical noise.Limited by the number of mock samples available, this phenomenon is not explored further in our study.Instead, we observe a similar phenomenon using data on larger scales where the bias on the jackknife covariance matrices becomes even larger.The information factors for the distance scales of 24 h −1 Mpc < s < 160 h −1 Mpc are shown in Figure 8.As expected, the information factors with m = 300 converge to a value close to 1.When m goes up to 500, the information factors stay above 1 all the time.The scaling law of the number of mock samples needed by the jackknife resampling method and the baseline method for the distance scales of 24 h −1 Mpc < s < 160 h −1 Mpc is shown in Figure 4 as the red solid line, which is much flatter than those from the smaller distance scales.

Discussion and Summary
In this paper, we have proposed a simple method to diagnose the equality or similarity of two covariance matrices and then to calculate the convergence rate of the fast covariance matrix estimators for galaxy clustering measurements.The essence of the method is based on the fact that we are only interested in the Gaussian likelihood function characterized by the covariance matrix, rather than the covariance matrix itself.The KL divergence is a perfect tool to do this job.
As a case study, we explore the delete-d jackknife covariance matrix estimator, which is one of the fast covariance matrix estimators in galaxy clustering analysis.In general, the jackknife covariance matrix contains both bias and noise with respect to the true covariance matrix, which contributes to the KL divergence.So we introduce the information factor (Equation ( 19)) to study the statistical information in the jackknife covariance matrix.
In this work, we focus on the anisotropic two-point galaxy correlation function and study its covariance matrix based on the QPM mock samples.We first test the KL divergence for the brute-force covariance matrices coming from different numbers of mock samples on the scale range 0 h −1 Mpc < s < 40 h −1 Mpc.We find that they are consistent with the Gaussian predictions.Then we calculate the information factor using the jackknife and the brute-force covariance matrices and estimate the convergence rate for the jackknife resampling method.We find that the jackknife resampling can recover the brute-force covariance matrices statistically by using about 10 times fewer mock samples.This can be supportive for the study of galaxy clustering at small scales with a small number of mocks.
By introducing the extended information factor (Equation ( 21)), we test the robustness of the information factor in Section 5.1.Although a general choice of k in the extended information factor can give us more information, the simplification by taking k = m can still catch the point where Q n,JK and Q m contain the same information.In addition, we study the influence of the bias of the covariance matrix on the KL divergence based on the correlation functions on larger scales.We find that the bias on the jackknife covariance matrix reduces the power of the jackknife resampling method in recovering the brute-force covariance matrix.
The analysis presented in this paper can be applied to other fast estimators of the galaxy clustering covariance matrix.The findings on the limitation of the jackknife resampling method, especially the significant bias of the covariance matrix on large scales, are generic.Favole et al. (2021) recently studied the constraint on the baryon acoustic oscillation scale from the jackknife covariance based on the CMASS data or mocks and found no significant bias compared to that from the brute-force covariance matrix.However, the jackknife covariance bias is still worthy of being carefully investigated for the nextgeneration redshift surveys, such as DESI and Euclid, as their statistical error will be subdominant compared to the systematics.A larger set of simulation mock catalogs is required to investigate such cases with small bias.
of which contains n mock samples.The groups are arranged in this way: we pack the first n samples of the full catalog into the first group S n 1 , the second n samples into S n 2 , and so on.We denote the ith group as S n i .We introduce the set S n to represent the n-partition, i.e.,

Figure 1 .
Figure1.The brute-force and delete-d jackknife covariance matrices of the correlation function monopoles and quadrupoles of the QPM mock catalogs on scales 0 h −1 Mpc < s < 40 h −1 Mpc.The upper panel shows the diagonal elements of the covariance matrices: solid lines for the brute-force method with 1000 mock samples (black for the monopole and magenta for the quadrupole), dashed lines for the brute-force method using 100 mock samples, and plus signs for the delete-d jackknife method with 100 mock samples.The lower panel shows the cross-correlation matrices from the brute-force (upper left corner) and delete-d (lower right corner) methods, respectively.

Figure 2 .
Figure 2. The mean and variance of ( | ) P Q KL m i n j as given in Equations (16) and (17), respectively.The upper panel shows the mean KL divergence KL(P m |Q n ) (colored symbols) and their model prediction (solid lines).Different colors denote different values of m.The error bars are from the standard deviations σ KL (P m , Q n ).The lower panel shows the fractional difference between the measurements and the model prediction at different n.For clarity, we slightly shift the results from different m at a given n.

Figure 3 .
Figure 3. Information factor of QPM mocks, η(Q m |Q n,JK ), defined in Equation(19).n is the number of mock samples used to calculate the jackknife covariance matrices as in Equation (14), and m is the number of mock samples to calculate the baseline covariance matrices.Different colors denote different m.We also show the results with the denominator in Equation (19) replaced by the model prediction as the dashed curves.

Figure 4 .
Figure 4. Scaling law of the number of mock samples needed by the jackknife resampling method and the baseline method to give statistically equivalent covariance matrices.

Figure 5 .
Figure 5. Extended information factor, η k (Q m |Q n,JK ).The different colors of the lines represent different numbers of mock samples used in Ĉm , and different line types denote different numbers of mocks used in the reference covariance matrices Ĉk .