Large deviations of spread measures for Gaussian matrices

For a large $n\times m$ Gaussian matrix, we compute the joint statistics, including large deviation tails, of generalized and total variance - the scaled log-determinant $H$ and trace $T$ of the corresponding $n\times n$ covariance matrix. Using a Coulomb gas technique, we find that the Laplace transform of their joint distribution $\mathcal{P}_n(h,t)$ decays for large $n,m$ (with $c=m/n\geq 1$ fixed) as $\hat{\mathcal{P}}_n(s,w)\approx \exp\left(-\beta n^2 J(s,w)\right)$, where $\beta$ is the Dyson index of the ensemble and $J(s,w)$ is a $\beta$-independent large deviation function, which we compute exactly for any $c$. The corresponding large deviation functions in real space are worked out and checked with extensive numerical simulations. The results are complemented with a finite $n,m$ treatment based on the Laguerre-Selberg integral. The statistics of atypically small log-determinants is shown to be driven by the split-off of the smallest eigenvalue, leading to an abrupt change in the large deviation speed.


Introduction -
The standard deviation σ of an array of m data X i is the simplest measure of how spread these numbers are around their average value X = (1/m) m i=1 X i .Suppose that X i represent final Physics marks of m students of a high-school.Most worrisome scenarios for the headmaster would be a low X and/or a high σ, signaling an overall poor and/or highly non-uniform performance.
What if Physics and Arts marks are collected together?Detecting performance issues now immediately becomes a much harder task, as data may fluctuate together and in different directions.A two-dimensional scatter plot may help, though.The "centre" of the cloud gives a rough indication of how well the students perform on average in both subjects.But how to tell in which subject the gap between excellent and mediocre students is more pronounced, or whether outstanding students in one subject also excel in the other?
In Fig. 1 (bottom) we sketch two scatter plots of marks adjusted to have zero mean.A meaningful spread indicator seems to be the shape of the ellipse enclosing each cloud.For example, an almost circular cloud -like School 1 -represents a rather uninformative situation, where your Arts marks tell nothing about your Physics skills, and vice versa.Conversely, a rather elongated shape -like School 2 -highlights strong correlations between each student's marks in different subjects.The semi-axes of the ellipse are called components, and are given by the spectral decomposition of the covariance matrix of data.Finding the most significant components for a high-dimensional dataset is the goal of Principal Component Analysis (PCA), with countless applications including image processing [1], biological microarrays [2,3], population genetics [4,5], finance [6,7], meteorology and oceanography [8] among others.
Let us first come back to our example.For a bunch of many scattered points it would be desirable to summarize the overall spread around the mean just by a single scalar quantity, like the perimeter or area of the enclosing ellipse.Not surprisingly, however, these indicators (taken individually) have evident shortcomings [9].For instance, neither of the two gives information about the orientation of the patterns, making it impossible to pinpoint which of the two subjects fluctuates the most.Surely a wiser choice is to combine more than one single measure of spread (like perimeter or area alone), to obtain a more revealing indicator.
In this Letter, we compute analytically the joint statistics of "perimeter" and "area" enclosing clouds of random highdimensional data.Why this is a crucial (and so far unavailable) ingredient for an accurate data analysis will become clearer very shortly.
In the more general setting of n subjects and m students, their marks can be arranged in a n × m matrix X, adjusted to have zero-mean rows.We then construct the normalized n×n covariance data matrix S = (1/n)XX T , with non-negative eigenvalues {λ}, which is precisely the multi-dimensional analogue of the variance σ 2 for a single array.The surface and volume ("perimeter" and "area" in the two-dimensional example) of the enclosing ellipsoid are related to T = n −1 n i=1 λ i arXiv:1403.4494v1[cond-mat.stat-mech]18 Mar 2014 , the scaled trace and determinant of S. In statistics, these objects are called total and generalized variance respectively [10,11].As in Fig. 1, different correlation structures are not detected by G alone, while the total variance T , although computationally more manageable, essentially ignores all cross-correlations.Blending both estimators together would be preferable, like in the widely used scalar combination 0 ≤ L = T − H − 1, called likelihood ratio [11], where H = ln G.The behavior of L for two different covariance data matrices S is sketched in Fig. 1 (bottom) [42].Now, suppose that we wish to test the hypothesis that the data X ij (yielding a certain L) are independent and identically distributed.To this aim, we generate a random n × m dataset X (0) ij independently drawn from a standard Gaussian distribution, our reference (null) model.The corresponding L (0) has a probability density (see Fig. 1 Top) peaked on a region of ∼ O(1/n) around its mean (see below).Standard statistical tools [10,11] then allow to assess the likelihood that the empirical L has been drawn from the density of L (0) , thus rejecting or not the null hypothesis within a certain confidence interval.
However, what if an atypically high (or low) L comes out from the data?We would be tempted to reject the test hypothesis outright.However, this might lead to a misjudgment, as atypical values of L (0) for the null model can (and do) occur (just very rarely).A more accurate test must consider the possibility that the data are indeed independent and identically distributed, even though the corresponding L is anomalously high (or low).What is the probability of this rare event?In this Letter, using the Coulomb gas technique of statistical mechanics, we give a complete solution to this problem, computing analytically the joint statistics of total and generalized variance ("perimeter" and "area") for a large random dataset.Besides providing an elegant solution to a challenging problem, our unifying framework effortlessly recovers some results earned by other techniques in the statistics literature.We first summarize our main results.
Setting and summary of results -We consider n × m matrices X [43], whose entries are real, complex or quaternion independent random variables with standard Gaussian densities [44] labelled by Dyson's index β = 1, 2 and 4 respectively, and we form the (real, complex or quaternion) covariance (Wishart) matrix [12] S = (1/n)XX † , where † stands for transpose, hermitian conjugate and symplectic conjugate respectively.We define the rectangularity parameter c = m/n ≥ 1.The joint probability density (jpd) of the n real and positive O(1) eigenvalues of S is known [13,14] where α = m − n + 1 − 2/β and Z n is a normalization constant known from the celebrated Selberg integral [15].The joint density of the linear statistics [45] H = n −1 n i=1 ln λ i ("area") and , where • stands for the average with respect to the jpd (1).The Laplace transform P(s, w) = e −βn 2 (sH+wT ) of P(h, t) is Here we show, using a Coulomb gas technique that, for large n and m, with c = m/n ≥ 1 fixed, this Laplace transform behaves as P(s, w) ≈ exp −βn 2 J(s, w) [46].The function J(s, w), independent of β and finite for s ≤ s c = (c − 1)/2 and w > −1/2, is given by and is the generating function (GF) of the joint cumulants of T and H.The functions J T (w) and J H (s) (see Eq. ( 13) and ( 14)) are the individual GF of cumulants of T and H separately. Extracting the cumulants, we obtain the formulae valid for any β, c and to leading order in n They reproduce known results for β = 1 [16][17][18].Denoting ω(n, ) = 2/βn 2 −1 , the higher order cumulants κ (T ) and κ (H) ( > 2) are given to leading order by and encode deviations from the Gaussian behavior.
From J H (s) and J T (w), one computes the so called rate functions [19] ψ H (h), ψ T (t) providing the asymptotic decay with n of the probability densities P H (h) ≈ exp −βn 2 ψ H (h) and P T (t) ≈ exp −βn 2 ψ T (t) of scaled log-determinant and trace respectively.For T we find ψ T (t) = 1 2 (t − c) − c 2 ln t c for t ∈ (0, ∞).We identify three regimes: i) typical fluctuations of order O(1/n) are governed by the quadratic behavior of ψ T (t) around t = T = c [47], implying an asymptotically Gaussian law P T (t) ≈ exp −βn 2 (t − c) 2 /(4c) (see ( 4)); ii) large deviations for t T exhibit a c-independent exponential decay P T (t) ≈ exp −βn 2 t/2 ; iii) for t T we find a c-dependent power law P T (t) ≈ t βn 2 c/2 (t → 0).For the log-determinant H the rate function ψ H (h) is not easy to write down in closed form.However we can again identify i) a Gaussian regime P H (h) ∼ exp(−(h − H ) 2 )/2Var(H) (see ( 4)) on a narrow region of ∼ O(1/n), recovering [16] for β = 1, while ii) atypical right deviations of H are encoded in the behavior of ψ H (h) ∼ e h /2 for h → ∞, implying a super-exponential suppression P H (h) ≈ exp(−(βn 2 /2)e h ) (independent of c).The statistics of anomalously small logdeterminants (h H ) is instead described by a different mechanism (see last section below).The rate functions ψ T (t) and ψ H (h) globally capture large fluctuations away FIG. 2: (color online) Cumulant generating functions JT (w) and JH (s) for w > −1/2 and s ≤ sc from ( 13) and ( 14) respectively.The points are obtained from the finite n formula −(1/βn 2 ) ln P(s, w) (15), for n = 8.In the plotted region the maximal deviation between the exact and the asymptotic expression is of order O(10 −4 ).Inset: Cumulant GF JL(s), −1/2 < s ≤ sc of the likelihood ratio (again the points are from the finite n result).
from the Gaussian behavior near the peak.In particular, the non-Gaussian corrections for H account for deviations of the scaled determinant G = n i=1 λ 1/n i from the log-normal distribution.For statistics of determinants of random matrices, see [20][21][22], while other results on large deviations for the Wishart ensemble can be found in [23][24][25][26].Further remarks are given in [27].
From P(s, w), it is easy to compute the Laplace transform of the likelihood ratio L for Wishart random matrices, PL (s) = e βn 2 s P(−s, s), implying PL (s) ≈ exp −βn 2 {J(−s, s) − s} for −1/2 < s ≤ s c .The cumulants of L then are κ (L) = κ (T ) + (−1) κ (H) + δκ with corresponding to a Gaussian probability density , in agreement with [28].Note that, since T and H are not independent, the cumulants of the likelihood involve the extra term δκ .We will now map the problem of computing (2) to the evaluation of the partition function of an associated Coulomb fluid.
Coulomb fluid problem -The first step to compute the multiple integral (2) is to realize that the integrand can be written in the form Z −1 n exp(−βE[λ; s, w]), where the energy is and V ext (x; s, w) = n(w + 1/2)x + (sn − α/2) ln x.This way the jpd of eigenvalues ( 1) is the Gibbs-Boltzmann weight of a fluid of negative charges (−1), constrained to the pos-itive half-line.This fluid is in equilibrium at inverse temperature β under competing interactions: the external potential V ext (x) tends to drive the particles towards its minimum [48], while the 2D Coulomb repulsion (logarithmic) tends to spread them apart.Now, the Laplace transform (2) can be interpreted as the ratio of two partition functions, P(s, w) = Z n (s, w)/Z n (0, 0), where Z n (0, 0) ≡ Z n .Originally due to Dyson [29], this technique has been recently used in many different problems [30][31][32][33][34][35].
How to evaluate Z n (s, w) for large n? First, one introduces as a key object the empirical spectral density ρ(λ) = n −1 i δ(λ − λ i ), in terms of which any linear statistics on λ can be easily written down.For instance, In the large n limit, the partition function can be written as where we replaced the multiple integral in ( 2) with a functional integral over ρ and we neglected subleading O(n) contributions.The action E[ρ; s, w] (continuous version of the discrete energy ( 8)) reads where µ 0 is a Lagrange multiplier enforcing normalization of ρ.The functional integral is amenable to a saddle-point analysis, yielding Z n (s, w) ≈ e −βn 2 E[ρ ;s,w] , where ρ (λ) (depending parametrically on s and w) is the solution of the saddle point equation δ δρ E[ρ; s, w] ρ=ρ = 0, namely where ffl is the principal value and supp[ρ ] is the interval where ρ (λ) ≥ 0. In the framework of our electrostatic model, Eq. ( 11) is the continuous version of the force balance condition for the charge cloud ρ (λ) to be in equilibrium in the large n limit.We find the solution of ( 11) for s ≤ s c and w > −1/2 as [36] on the compact support [λ − , λ + ], where the edges of the support are λ ± (s, For fixed values of (s, w), the solution (12) provides the typical configuration of eigenvalues yielding prescribed values for T [ρ ] and H[ρ ] from (9).For instance, setting s = w = 0, we recover the celebrated Marčenko-Pastur law [37] ρ mp (λ) for the density of the Wishart ensemble having typical values for trace and log-determinant as in (4).However, while the constraint w > −1/2 does not affect the range of values of T (as 0 < T [ρ ] < ∞), the situation is more delicate for the log-determinant H. Indeed, for s ≤ s c , the integral H[ρ ] ≥ H crit = −1, hence the saddle-point solution (12) does not describe configuration of eigenvalues yielding a log-determinant H smaller than H crit = −1.These smaller log-determinants are described by a different physical scenario, which we briefly discuss below.
Finite n, m results -From the jpd (1) the joint Laplace transform P(s, w) can be also evaluated exactly at finite n, m, using the Laguerre-Selberg integral [15], as P(s, w) = (βn/2) βsn 2 (2w + 1) ] and α as in (1).The special case w = 0 and α = 0 (square case n = m for the determinant alone) of ( 15) was considered in [20].A direct asymptotic analysis of ( 15) is however difficult.In Fig. 2, we show that our large n formulae reproduce very well the finite n, m result even for moderate values of n.
Smaller H and condensation -As s reaches the critical value s = s c , the lower edge λ − of the saddle-point density (12) hits the origin, and ρ , for w = 0, acquires an inverse square-root singularity there, ρ (λ) = (2π) −1 (4 − λ)/λ.This corresponds (w = 0 hereafter) to a c-independent critical value for the log-determinant H crit = −1 (or G crit = 1/e), where the Laplace transform (2) is no longer finite.Beyond s = s c , the logarithmic part of the external potential V ext becomes attractive, implying that an effective positive charge is now located at the origin.The statistics of very small determinants G < 1/e is therefore governed by an entirely different mechanism: a finite fraction of small eigenvalues (the closest to the positive charge at the origin) condenses on a very narrow scale around it, leaving behind a bulk of "macroscopic" eigenvalues still ∼ O(1).Similar condensation phenomena for both correlated and i.i.d.random variables have been re- cently detected in a variety of contexts (see e.g.[31,34,[38][39][40][41] and references therein).Setting c = 1 for simplicity and looking for a solution of (11) in the form ρ (λ) = µδ(λ − λ 0 ) + ρ(λ), with 0 < µ < 1 and ´ρ(λ)dλ = 1 − µ [50], one may verify that this ρ is a solution of (11) in the limit λ 0 → 0 and µ → s with ρ(λ) = (2π) −1 (4(1 − s) − x)/x.Note that µ = s is exactly the charge necessary to neutralize the logarithmic potential source at the origin.Since the total charge of our fluid is −n, this neutralization is no longer possible for s > 1 (or s > s c + 1 in the general case).The average log-determinant restricted to the macroscopic eigenvalues, H # = (1/n) λ∼O(1) ln λ i can be computed from (9) as , where s is the fraction of microscopic eigenvalues.Both the new equilibrium density ρ and H # have been checked numerically with good agreement (see Fig. 3).
Conclusions -In summary, for n × n Wishart matrices S with c = m/n fixed, we computed (joint and marginal) cumulant generating functions for scaled trace T , log-determinant H and likelihood ratio L. We pointed out an interesting macroscopic condensation phenomenon governing the statistics of atypically small determinants.Analytical results for large n have been corroborated with numerical simulations of the Coulomb fluid and finite n expressions.Several questions remain open, for instance a more complete treatment of the condensed phase scenario.Extensions of the method to correlated data sets would be also interesting.
Acknowledgments -PV acknowledges financial support from Labex-PALM (project RandMat).FDC acknowledges hospitality by the LPTMS (Univ.Paris Sud) where this work was completed.We also thank Antonio Lerario for very helpful discussions and advice., 2 π e −2|w ij | 2 for β = 1, 2 and 4 respectively (recall that a real quaternion number is specified by two complex numbers (z, w)).
[45] A linear statistics is a random variable of the form A = n i=1 a(λi), for a well-behaved function a(x).
[47] The average scaled trace is easily computed T [48] The logarithmic part of Vext is repulsive for s < sc and attractive otherwise.This will have important consequences on the Coulomb fluid equilibrium density.
[50] The delta component has to be intended as a continuous density ρ0(λ) that appear concentrated at the typical scales of ρ(λ), so that we can treat it as a delta function centered in λ0.

FIG. 1 :
FIG.1:(color online) Top: Sketch of the probability density of the likelihood ratio L of a Gaussian iid data set.In yellow, the typical region around the mean of order O(1/n).Larger fluctuations are referred to as atypical large deviations.Bottom: Sketch of two multivariate data sets with n = 2 and m = 35.Each point represents a student, for two different schools, and his/her marks in Arts and Physics.The two datasets have same generalized variance H, but different total variance T .The likelihood ratio L of School 1 is compatible with the iid hypothesis, while the value of L for School 2 is atypically far from the average L .

FIG. 3 :
FIG.3:(color online) Density of eigenvalues ρ for w = 0 yielding prescribed values for H.Here c = 1 (sc = 0).Solid green curve is for s = −5, while the red curve (s = 0) is the limiting solution of(11).For s > sc = 0, in solid blue the equilibrium solution for the blob of macroscopic eigenvalues ρ (plotted for a fraction s = 1/2 of "condensed" eigenvalues).Inset: each value of s ≤ 0 yields a typical value ofH = H[ρ ].For larger values of s, the macroscopic density ρ yields H # = H[ρ].The points are from a simulation (Metropolis algorithm) for a Coulomb fluid of n = 40 particles with energy (8).