Behavior of the maximum likelihood in quantum state tomography

Quantum state tomography on a d-dimensional system demands resources that grow rapidly with d. They may be reduced by using model selection to tailor the number of parameters in the model (i.e., the size of the density matrix). Most model selection methods typically rely on a test statistic and a null theory that describes its behavior when two models are equally good. Here, we consider the loglikelihood ratio. Because of the positivity constraint $\rho \geq 0$, quantum state space does not generally satisfy local asymptotic normality, meaning the classical null theory for the loglikelihood ratio (the Wilks theorem) should not be used. Thus, understanding and quantifying how positivity affects the null behavior of this test statistic is necessary for its use in model selection for state tomography. We define a new generalization of local asymptotic normality, metric-projected local asymptotic normality, show that quantum state space satisfies it, and derive a replacement for the Wilks theorem. In addition to enabling reliable model selection, our results shed more light on the qualitative effects of the positivity constraint on state tomography.

Determining the quantum state ρ 0 produced by a specific preparation procedure for a quantum system is a problem almost as old as quantum mechanics itself [1,2]. This task, known as quantum state tomography [3], is not only useful in its own right (diagnosing and detecting errors in state preparation), but is also used in other characterization protocols including entanglement verification [4][5][6] and process tomography [7]. A typical state tomography protocol proceeds as follows: many copies of ρ 0 are produced, they are measured in diverse ways, and finally the outcomes of those measurements (data) are collated and analyzed to produce an estimateρ. This is a straightforward statistical inference process [8,9], where the data are used to fit the parameters of a statistical model. In state tomography, the parameter is ρ, and the model is the set of all possible density matrices on a Hilbert space H (equipped with the Born rule). However, we don't always know what model to use. It is not always a priori obvious what H or its dimension is; examples include optical modes [10][11][12][13][14] and leakage levels in AMO and superconducting [15,16] qubits. In such situations, we seek to let the data itself determine which of many candidate Hilbert spaces is best suited for reconstructing ρ 0 .
Choosing an appropriate Hilbert space on the fly is an instance of a general statistical problem called model selection. Although model selection has been thoroughly explored in classical statistics [17], its application to state tomography encounters some obstacles. They stem from the fact that quantum states -and therefore, estimates of them -must satisfy a positivity constraint ρ ≥ 0. (See Figure 1.) A similar constraint, complete positivity, applies to process tomography. The impact of positivity constraints on state and process tomography is an active area of research [18][19][20][21], and its implications for model selection have also been considered [22][23][24][25][26][27][28]. In this paper, we address a specific question at the heart of this matter: How does the loglikelihood ratio statistic used in many model selection protocols, including (but not lim- 1. Impact of the positivity constraint (ρ ≥ 0) on tomographic estimates: This figure illustrates how the quantum state space's boundary -which results from the constraint ρ ≥ 0 -affects maximum likelihood (ML) tomography for a qutrit state ρ0 (star). Two different 2-dimensional crosssections of the state space are shown, which correspond to a qubit (left) and a classical 3-outcome distribution (right). Top: Without the positivity constraint, some ML estimates (orange squares) are not valid estimates of a quantum state, because they are not positive semidefinite. However, some ML estimates (blue circles) are. Further, the ML estimates are Gaussian distributed. Bottom: Imposing the positivity constraint forces the (previously negative) ML estimates to "pile up" on the boundary of state space; the distribution Pr(ρML) is not Gaussian, and local asymptotic normality is not satisfied. In turn, the assumptions necessary to invoke the Wilks theorem are not satisfied either.
ited to) information criteria such as Akaike's AIC [29], behave in the presence of the positivity constraint ρ ≥ 0?
We begin in Section I by introducing the loglikelihood ratio statistic λ, and outline how it can be used to choose a Hilbert space. In Section II, we show how and why the classical null theory for its behavior, the Wilks theorem, falls apart in the presence of the positivity constraint, because quantum state space does not generally satisfy local asymptotic normality (LAN). We define a new generaliza-tion of LAN, metric-projected local asymptotic normality (MP-LAN), in Section III; this generalization explicitly accounts for the positivity constraint, and is satisfied by quantum state space. Using this generalization, we derive a closed-form approximation for λ's expected value in Section IV, thereby providing a replacement for the Wilks theorem that is applicable in state tomography. Finally, we test the validity of our approximate theory under the assumptions used in its derivation (Section V A), and under harsh conditions by comparing it to numerical results for the realistic problem of optical heterodyne tomography (Section V B).

I. INTRODUCTION -STATISTICAL MODEL SELECTION
Discussing model selection for state tomography requires introducing some terminology/notation from statistics. A model is a parameterized family of probability distributions over some data D, usually denoted as Pr θ (D), where θ are the parameters of the model. In state tomography, the parameters are a quantum state ρ, the data are the observed outcomes of the measurement of a positive operator-valued measure (POVM) {E j }, and the probability of observing outcome "j" is given by the Born rule: p j = Tr(ρE j ) [30]. In this paper, a model is a set of density matrices, and a state ρ is a particular choice of the model's parameters.
Suppose we have data D obtained from an unknown state ρ 0 , and two candidate models M 1 , M 2 that could be used to reconstruct it. Many of the known methods for choosing between them (i.e., model selection) involve quantifying how well each model fits the data by its likelihood. The likelihood of a simple hypothesis ρ is defined as L(ρ) = Pr(D|ρ). Models, however, are composite hypotheses, comprising many possible values of ρ. A canonical way to define model M's likelihood is via the general method of maximum likelihood (ML), by maximizing L(ρ) over ρ ∈ M. In practice, the maximization is usually done explicitly to find an ML estimateρ ML,M [31][32][33] of M's parameters, and then L(M) = L(ρ ML,M ). (Although it is common to refer toρ ML without specifying the model over which L was maximized, we list the model explicitly because in this paper, we are frequently using many different models!) Then, the models can be compared using the loglikelihood ratio statistic [27,33,34]: All else being equal, a positive λ favors M 2 -i.e., the model with the higher likelihood is more plausible, because it fits the data better. However, all else is rarely equal. If both models are equally valid -e.g., they both contain ρ 0 -but M 2 has more parameters, then M 2 will very probably fit the data better. Models with more adjustable parameters do a better job of fitting noise (e.g., finite sample fluctuations) in the data. This becomes strictly true when the models are nested, so that M 1 ⊂ M 2 . In this case, the likelihood of M 2 is at least as high as that of M 1 ; not only is λ ≥ 0, but almost surely λ > 0. Remarkably, the same effect also makes M 2 's fit less accurate (almost surely), because the fit incorporates more of the noise in the data. These two effects constitute overfitting, which can be summed up as "Extra parameters make the fit look better, but perform worse.". An overfitted model would fit current data extremely well, but would fail to accurately predict future data. This provides strong motivation to correct for overfitting by penalizing or handicapping larger models, to prevent them from being chosen over smaller models that are no less valid, and may even yield better estimates in practice [29].
For this reason, any model selection method/criterion that relies (explicitly or implicitly) on a statistic to quantify "how well model M fits the data" also relies on a null theory to predict how that statistic will behave if some null hypothesis is true. For the model selection problems we consider, the null hypothesis is that ρ 0 ∈ M, and the null theory will tell us how statistics of interest behave when that null hypothesis is in fact true. A model selection criterion based on an invalid null theory (or a criterion used in a context where its null theory does not apply) will tend to perform sub-optimally (as compared to a method based on a correct null theory).
The null theory can be used to formulate a decision rule for choosing between models. If we know how the test statistic behaves when both models are equally valid, then we can evaluate the observed value of the statistic under the null theory. If the observed value is very improbable under the null theory, then that constitutes evidence against the smaller model, and justifies rejecting it. On the other hand, if the observed value is consistent with the null theory, there is no reason to reject the smaller model.
The standard null theory for λ is the Wilks theorem [35]. It relies on local asymptotic normality (LAN) [36,37]. LAN is a property of M; if M satisfies LAN, then as N samples → ∞: • The ML estimateρ ML,M is normally distributed around ρ 0 with covariance matrix I −1 : (2) • The likelihood function in a neighborhood ofρ ML,M is locally Gaussian with Hessian I: Here, I is the (classical) Fisher information matrix associated with the POVM. It quantifies how much information the data carry about a parameter in the model. (Note that in expressions involving I, states ρ are treated as vectors in state space, and I is a matrix or 2-index tensor acting on that state space.) Most statistical models satisfy LAN. When LAN is satisfied and N samples is large enough to reach the "asymptotic" regime, we can invoke the Wilks theorem to determine the behavior of λ. This theorem says that under suitable regularity conditions, if ρ 0 ∈ M 1 ⊂ M 2 , where M 2 has K more parameters than M 1 , then λ is a χ 2 K random variable. This is a complete null theory for λ (under the specified conditions), and implies that λ = K and (∆λ) 2 = 2K.
Therefore, in the "Wilks regime", a simple criterion for model selection would be to compare the observed value of λ to λ thresh = λ +k∆λ, for some k ≈ 1, and reject the smaller model if λ > λ thresh . While model selection rules can be more subtle and complex than this [29,[38][39][40], they usually take the general form of a threshold in which λ plays a key role. Rather than attempting to define a specific rule, our purpose in this paper is to understand the behavior of λ and derive an approximate expression for it in the context of state tomography.
The first step in doing so is to explain how and why the Wilks theorem breaks down in that context.

II. QUANTUM STATE TOMOGRAPHY AND THE BREAKDOWN OF THE WILKS THEOREM
Quantum state tomography typically begins with N samples independently and identically prepared quantum systems -i.e., N samples copies of an unknown state ρ 0 . Each copy is measured, and without loss of generality we can assume that each measurement is described by the same positive operator-valued measure (POVM). A POVM is a collection of positive operators {E j } summing to 1l, and the probability of outcome "j" is given by Tr(ρ 0 E j ). The results of all N samples measurements constitute data, represented as a record of the frequencies of the possible outcomes {n j }, where n j is the number of times "j" was observed, and j n j = N samples . Finally, this data is processed through some estimator to yield an estimate of ρ 0 , denotedρ .
Although a variety of estimators have been proposed [31][32][33][41][42][43][44], the exact estimator used is not our concern here. However, since we are concerned with computing the likelihood of a model M, which is defined as the likelihood of the most likely ρ ∈ M, we will make extensive use of the maximum likelihood (ML) estimator. This should not be taken as advocacy for the ML estimator; it is only a convenient way to find the maximum of L over M, and once a model is chosen, a different estimator could be used. The likelihood L(ρ) is is the solution to minimizing a convex function over a convex set, it can be found efficiently via any of several algorithms for convex optimization [45].
Usually, H is taken for granted or chosen by fiat. In this paper, we will consider a nested family of different Hilbert spaces, indexed by their dimension d: The models we consider are therefore given by: For notational brevity, we will useρ ML,d to denote the ML estimate over M d . To select between these models, we need to determine whether one model (say, M d+1 ) is "better" than another (say, M d ). To evaluate which is better, we typically compute the likelihood of each model, and then use λ(M d , M d+1 ) to choose between them. As mentioned in the previous section, this requires having a null theory for λ that describes its behavior when ρ 0 ∈ M d ⊂ M d+1 .
The Wilks theorem, which is the classical null theory for λ, relies on local asymptotic normality (LAN). If the models under consideration satisfy LAN, then as mentioned in the previous section, the likelihood L(ρ) is Gaussian with a Hessian given by the Fisher information. In classical statistics, it is common to assume that boundaries are not relevant, either because the models of interest have none, or because the true parameter values ρ 0 lie far away from them. In the absence of boundaries, and in the asymptotic limit where the curvature of the Fisher information metric is also negligible, many calculations can be simplified by changing to Fisher-adjusted coordinates in which the Fisher information is isotropic (i.e., I ∝ 1l). Under these assumptions and simplifications, the Wilks theorem can be derived.
In quantum state tomography, the Wilks theorem breaks down for two reasons. First, the quantum state space does have boundaries. Second, the Fisher information is anisotropic, and the anisotropy can't easily be eliminated by a coordinate change because those boundaries define a preferred coordinate system. We discuss these obstacles -and our plan to address them -in detail in the remainder of this section.
Given a model M d , its boundary is the set of rankdeficient states within it. When ρ 0 ∈ M d and is full rank, LAN will hold -which is to say that, asymptotically, the boundary can indeed be ignored. But when ρ 0 is rankdeficient, it lies on the boundary of the model. LAN is not satisfied, because positivity constrainsρ ML,d , and so Pr(ρ ML,d ) is not Gaussian (see Figure 1). The Wilks theorem does not apply in this case, and its predictions regarding λ aren't even close (see Figure 2). Moreover, this is the relevant situation for our analysis, because even if ρ 0 is full-rank in M d , it must be rank-deficient in M d+1 . So we require a replacement for the Wilks theorem; that is, we need a null theory for λ when ρ 0 is rank-deficient.
One challenge in deriving this replacement is that the Fisher information generally depends strongly on ρ 0 and the POVM being measured (see Figure 5). In many standard derivations, such anisotropy has no impact and can be eliminated easily by changing to Fisheradjusted coordinates. But the models we consider (quantum states) have boundaries that break scale-invariance, and define preferred coordinate systems. Changing to Fisher-adjusted coordinates does not eliminate the effect of anisotropy, because the boundary has a new shape in the new coordinates that serves as a record of the anisotropy. Moreover, the methods we derive here for calculating the impact of the boundary rely heavily on a particular coordinate system (Hilbert-Schmidt coordinates), and changing to Fisher-adjusted coordinates would break them. This makes it very difficult to derive an precise generalization of the Wilks theorem for arbitrary Fisher information, so to derive our results we make the key simplifying assumption that the Fisher information is isotropic with respect to Hilbert-Schmidt metric. This is almost never exactly true in practice [46], but it is reasonable to presume that our results remain useful and approximately true when the Fisher information is almost isotropic. Our results actually appear to be surprisingly robust to significant anisotropy. In Section V B, we perform numerical simulations of heterodyne tomographyin which the condition number of the Fisher information ranges from 10 1 to 10 9 (1 corresponds to isotropic Fisher information) -and find that our new theory remains reasonably accurate even in this extreme scenario.
To derive our replacement for the Wilks theorem, we first need a new framework for reasoning about models with convex constraints. We develop such a framework in the next section by defining a new generalization of LAN.

III. GENERALIZING LAN TO DEAL WITH CONSTRAINTS
In this section, we develop a framework that will allow us to derive a replacement for the Wilks theorem that holds for rank-deficient ρ 0 . To do so, we define a generalization of LAN in the presence of boundaries, which we call metric-projected local asymptotic normality (MP-LAN). (For other generalizations of LAN, see [47,48].) Like LAN, MP-LAN is a property that a statistical model Rank(ρ 0 ) =1

FIG. 2. Predictions of the Wilks theorem vs reality:
In the context of state tomography on a true state ρ0 in a d-dimensional Hilbert space, the Wilks theorem can be used to predict that, when comparing the zero-parameter model Here, we compare that prediction to numerical simulations of tomography on states ρ0 in dimension d = 2, . . . , 30, with ranks r = 1, . . . , min(10, d). The Wilks theorem only predicts λ correctly for full-rank states; when r d, the actual expected loglikelihood ratio is much smaller. Our main result (Equation 19) gives a replacement that works correctly (see Figure 9). may satisfy. Unlike LAN, MP-LAN is satisfied by quantum state space. For any model that satisfies MP-LAN (quantum or classical), we compute an asymptotically exact expression for λ, a necessary building block in our replacement for the Wilks theorem.
In Section IV, we show that the models M d satisfy MP-LAN, and derive an approximation for λ (Equation (19), on page 12). Section V A compares our theory to numerical results, and Section V B applies our theory to the problem of heterodyne tomography.
The reader should note that to enhance readability, in this section (and only this section) we use N to denote the number of samples, previously denoted as N samples .

A. Definitions and Overview of Results
The main definitions and results required for the remainder of the paper are presented in this subsection. Technical details and proofs are presented in the next subsection. While there are many possible choices for the unconstrained model M , we will find it useful to let M be a model whose dimension is the same as M, but where any of the constraints that define M are lifted. (For example, in Lemma 5, we will take M to be Hermitian matrices of dimension d.) Other choices of M are possible, but we do not explore those here.
Although the definition of MP-LAN is rather short, it implies some very useful properties. These properties follow from the fact that, as N → ∞, the behavior of ρ ML,M and λ is entirely determined by their behavior in an arbitrarily small region of M around ρ 0 , which we call the local state space. Models that satisfy MP-LAN have the following asymptotic properties: • The local state space is the solid tangent cone of the model at ρ 0 , denoted T (ρ 0 ). • The ML estimateρ ML,M is given by the metric projection ofρ ML,M onto T (ρ 0 ): (We first encountered the term "metric projection" in the convex optimization literature [49,50], and inspires our choice for the acronym "MP-LAN". However, it should be noted that in the problem setting considered in those references, I = 1l.) • The loglikelihood ratio λ(ρ 0 , M), defined as takes the following simple form: (This property is non-trivial; see Figure 3.) Even when M satisfies MP-LAN, these properties may not be true when N is finite; they are guaranteed only in the asymptotic limit. When N is sufficiently large, we can (and will!) use the asymptotic properties above.
The following subsection presents the technical details and definitions necessary to show the above results. The reader may skip it without loss of continuity, and proceed to Section IV.  λ is equal, in the asymptotic limit, to the squared distance fromρML,M k to ρ0 (red lines), because ρ0,ρ ML,M , andρML,M k form a right triangle. This is also true for models with curved boundaries (such as quantum state space) because asymptotically, the local state space is the solid tangent cone, whose boundaries are always flat.
where d − → means "converges in distribution to", and Σ = I −1 . The shape of the distribution is entirely determined by I. As N → ∞, this Gaussian distribution becomes more and more tightly concentrated around ρ 0 . Although there is always a non-zero probability thatρ ML,M will be arbitrarily far away from ρ 0 , it is possible to define a sequence of balls B N that shrink with N , yet contain everyρ ML,M with probability 1 as N → ∞.
First, we switch coordinates by sending ρ → ρ − ρ 0 , establishing ρ 0 as the origin of the coordinate system. In these coordinates,ρ ML,M ∼ N (0, Σ/N ), and the following lemma constructs B N .
Proof. Let B 0 N be an ellipsoidal ball defined by {ρ ∈ M | Tr(ρΣ −1 ρ) ≤ 1/N 1/2 }. Change coordinates by Switching back to the original coordinates, we have To see that this is the case, write the equation for B 0 N in the standard quadratic form for an ellipsoid: The standard ellipsoid {x | x T Ax ≤ 1} has semi-major axes whose lengths s j are related to the eigenvalues a j of A: where λ j are the eigenvalues of Σ. Thus, the lengths of the semi-major axes of B 0 N are given by s j = 1/ N 1/2 /λ j = λ j /N 1/4 . Letting λ max (Σ) denote the largest eigenvalue of Σ, the longest semi-major axis of B 0 N has length λ max (Σ)/N 1/4 . Because B N is a ball whose radius is equal to this length, B N circumscribes B 0 N , and B 0 N ⊂ B N . As B 0 N ⊂ B N , it follows from the monotonicity of probability that Pr(ρ ML,M ∈ B 0 N ) ≤ Pr(ρ ML,M ∈ B N ). As the asymptotic limit of Pr(ρ ML,M ∈ B 0 N ) is 1, and Pr(ρ ML,M ∈ B N ) itself is bounded above by 1, it follows from the squeeze theorem that lim N →∞ Pr(ρ ML,M ∈ B N ) = 1.
Informally speaking, Lemma 1 implies that as N → ∞ "all the action" aboutρ ML,M takes place inside B N . Accordingly, to understand the behavior of quantities which depend onρ ML,M (such asρ ML,M and λ), it is sufficient to consider their behavior within B N . In fact, we can show that, asymptotically, all theρ ML,M are contained within the region C N ≡ B N ∩ M: Proof. Using the law of total probability, write Pr(ρ ML,M ∈ C N ) as a sum of two terms, depending on whetherρ ML,M ∈ B N . Letting p denote Pr(ρ ML,M ∈ B N ), we have For anyρ ML,M ∈ B N , the correspondingρ ML,M is somewhere in M. To showρ ML,M ∈ C N , we use a proof by contradiction. Suppose thatρ ML,M is the ML estimate in M forρ ML,M , and thatρ ML,M ∈ C N . Let ρ C denote the closest point in C N toρ ML,M . Because B N ⊃ C N , it follows that ρ C is closer toρ ML,M thanρ ML,M , contradicting the assumptionρ ML,M was the ML estimate in M forρ ML,M . Therefore, ifρ ML,M ∈ B N , it must be the case thatρ ML,M ∈ C N .
In the original coordinates, both B N and the distribution ofρ ML,M shrink with N , but B N shrinks more slowly. Suppose, instead, that we switch to N -dependent coordinates that shrink with the distribution ofρ ML,M . In these coordinates, M and M grow with N , and B N also grows (but more slowly). This homothetic transformation of M, M , and B N scales all of them up. As N → ∞, B N → R d , and the local state space is the solid tangent cone of M at ρ 0 .
Definition 3 (Homothetic Transformation). Given a convex set C, the homothetic transformation of C with respect to any point X ∈ C, with homothety coefficient h, is the set C h defined by Definition 4 (Solid Tangent Cone). For each point X in a convex set C, let C h be the homothetic transformation of C with respect to X, with homothety coefficient h. Then, the solid tangent cone T (X) is defined as the following limit: Tangent cones are a general feature of convex sets; see [51], Chapter 6, Section A for more information about them and their properties.
Let C N = B N ∩ M in Hilbert-Schmidt coordinates. We show that, in an N -dependent coordinate system, C N converges to the solid tangent cone, and is the local state space. 2) The original set C N is a convex subset of M, and from Lemma 2, lim N →∞ Pr(ρ ML,M ∈ C N ) = 1. Further, the coordinate system defined by the mapping ρ → N 1/2 ρ turns the (previously Ndependent) Fisher information I into a constant. Thus, lim N →∞ C N is the local state space.
Therefore, we have shown that, asymptotically, the local state space around ρ 0 is the solid tangent cone T (ρ 0 ). The geometry of T (ρ 0 ) depends strongly on ρ 0 . If ρ 0 is rank-deficient within M, then T (ρ 0 ) is the cone whose faces touch M at ρ 0 . (See Figure 4 for a rebit example.) However, if ρ 0 is full-rank, T (ρ 0 ) is R d 2 −1 .

MLE as Metric Projection
As N → ∞, all theρ ML,M are contained within the ball B N , and the local state space is the solid tangent cone. Because M satisfies LAN, the likelihood function around eachρ ML,M is Gaussian, meaning the optimization problem definingρ ML,M is given bŷ That is,ρ ML,M is the metric projection ofρ ML,M onto the tangent cone. See Figure 4 for a rebit example. (Notice that ifρ ML,M ∈ T (ρ 0 ), thenρ ML,M =ρ ML,M .) What makes this nontrivial is the replacement of the original state space M, whose geometry can be arbitrarily complicated, with its tangent cone T (ρ 0 ). As shown in the next section, cones can be much simpler and tractable.

Expression for λ(ρ0, M)
The loglikelihood ratio statistic between any two models λ(M 1 , M 2 ) can be computed using a reference model R: Let us take R = ρ 0 . Because M satisfies LAN, asymptotically L(ρ) is Gaussian, and λ relates to a difference in squared distances: Using the factρ ML,M is a metric projection, we can prove that λ(ρ 0 , M) has a simple form. Proof. We switch to Fisher-adjusted coordinates (ρ → I 1/2 ρ), and in these coordinates I becomes 1l: So if M satisfies MP-LAN then as N → ∞ the loglikelihood ratio statistic becomes related to squared error/loss (as measured by the Fisher information metric.) This result may be of independent interest in, for example, defining new information criteria, which attempt to balance goodness of fit (as measured by λ) against error/loss (generally, as measured by squared error).
With these technical results in hand, we can proceed to compute λ(M d , M d+1 ) in the next section. We can reduce the problem of computing λ(M d , M d+1 ) to that of computing λ(ρ 0 , M k ) for k = d, d + 1 using the identity where λ(ρ 0 , M k ) is given in Equation (6). Because each model satisfies MP-LAN, asymptotically, λ(ρ 0 , M k ) takes a very simple form, via Equation (7): The Fisher information I k is generally anisotropic, depending on ρ 0 , the POVM being measured, and the model M k (see Figure 5). And while the ρ ≥ 0 constraint that invalidated LAN in the first place is at least somewhat tractable in standard (Hilbert-Schmidt) coordinates, it becomes completely intractable in Fisheradjusted coordinates. So, to obtain a semi-analytic null theory for λ, we will simplify to the case where I k = 1l k / 2 for some that scales as 1/ N samples . (That is, I k is proportional to the Hilbert-Schmidt metric.) This simplification permits the derivation of analytic results that capture realistic tomographic scenarios surprisingly well [52].
With this simplification, λ(M d , M d+1 ) is given by . (11) That is, λ is a difference in Hilbert-Schmidt distances. This expression makes it clear why a null theory for λ is necessary: if ρ 0 ∈ M d , M d+1 , thenρ ML,d+1 will lie further from ρ 0 thanρ ML,d (because there are more parameters that can fit noise in the data). The null theory for λ tells us how much extra error will be incurred in using M d+1 to reconstruct ρ 0 when M d is just as good.
Describing Pr(λ) is difficult because the distributions ofρ ML,d ,ρ ML,d+1 are complicated, highly non-Gaussian, and singular (estimates "pile up" on the various faces of the boundary as shown in Figure 1). For this reason, we will not attempt to compute Pr(λ) directly. Instead, we focus on deriving a good approximation for λ .
We consider each of the terms in Equation (11)  Depending on ρ0, the distribution of the unconstrained estimatesρML (ellipses) may be anisotropic. Imposing the positivity constraint ρ ≥ 0 is difficult in Fisher-adjusted coordinates; in this paper, we simplify these complexities to the case where I ∝ 1l, and is independent of ρ0.
(1) Identify which degrees of freedom inρ ML,M d are, and are not, affected by projection onto the tangent cone T (ρ 0 ).
(2) For each of those categories, evaluate its contribution to the value of λ .
In Section IV A, we identify two types of degrees of freedom inρ ML,M , which we call the "L" and the "kite". Section IV B computes the contribution of degrees of freedom in the "L", and Section IV C computes the contribution from the "kite". The total expected value is given in Equation (19) in Section IV D, on page 12.

A. Separating out Degrees of Freedom inρ ML,M d
We begin by observing that λ(ρ 0 , M d ) can be written as a sum over matrix elements, and therefore λ = jk λ jk . Each term λ jk quantifies the mean-squared error of a single matrix element ofρ ML,d , and while the Wilks theorem predicts λ jk = 1 for all j, k, due to positivity constraints, this no longer holds. In particular, the matrix elements ofρ ML,d now fall into two parts: 1. Those for which the positivity constraint does affect their behavior. The latter, which lie in what we call the "L", comprise all off-diagonal elements on the support of ρ 0 and between the support and the kernel, while the former, which lie in what we call the "kite", are all diagonal elements and all elements on the kernel (null space) of ρ 0 .
Performing this division is also supported by numerical simulations (see Figure 6). Matrix elements in the "L" appear to contribute λ jk = 1, consistent with the Wilks theorem, while those in the "kite" contribute more (if they are within the support of ρ 0 ) or less (if they are in the kernel). Having performed the division of the matrix elements ofρ ML,M d , we observe that λ = λ L + λ kite . Because each λ jk is not necessarily equal to one (as in the Wilks theorem), and because many of them are less than 1, it is clear that their total λ is dramatically lower than the prediction of the Wilks theorem. (Recall Figure  2.) In the following subsections, we develop a theory to explain the behavior of λ L and λ kite . In doing so, it is helpful to think about the matrix δ ≡ρ ML,M d − ρ 0 , a normally-distributed traceless matrix. To simplify the analysis, we explicitly drop the Tr(δ) = 0 constraint and let δ be N (0, 2 1l) distributed over the d 2 -dimensional space of Hermitian matrices (a good approximation when d 2), which makes δ proportional to an element of the Gaussian Unitary Ensemble (GUE) [53].

B. Computing λL
The value of each δ jk in the "L" is invariant under projection onto the boundary (the surface of the tangent cone T (ρ 0 )), meaning that it is also equal to the error (ρ ML,d − ρ 0 ) jk . Therefore, λ jk = δ 2 jk / 2 . Because M satisfies LAN, it follows that each δ jk is an i.i.d. Gaussian random variable with mean zero and variance 2 . Thus, λ jk = 1 ∀ (j, k) in the "L". The dimension of the surface of the tangent cone is equal to the dimension of the manifold of rank-r states in a d-dimensional space. A direct calculation of that quantity yields 2rd − r(r + 1), so λ L = 2rd − r(r + 1).
Another way of obtaining this result is to view the δ jk in the "L" as errors arising due to small unitary pertur- However, if either j or k (or both) lie within the kernel of ρ 0 (i.e., k|ρ 0 |k or j|ρ 0 |j is 0), then the corresponding δ jk are zero. The only off-diagonal elements where small unitaries can introduce errors are those which are coherent between the kernel of ρ 0 and its support. These off-diagonal elements are precisely the "L", and are the set {δ jk | j|ρ 0 |j = 0, j = k, 0 ≤ j, k ≤ d − 1}. This set contains 2rd − r(r + 1) elements, each of which has λ jk = 1, so we again arrive at λ L = 2rd − r(r + 1).

C. Computing λ kite
Computing λ L was made easy by the fact that the matrix elements of δ in the "L" are invariant under the projection ofρ ML,M d onto T (ρ 0 ). Computing λ kite is a bit harder, because the boundary does constrain δ. To understand how the behavior of λ kite is affected, we analyze an algorithm presented in [52] for explicitly solving the optimization problem in Equation (5).
This algorithm, a (very fast) numerical method for computingρ ML,d givenρ ML,M d , utilizes two steps: 1. Subtract q1l fromρ ML,M d , for a particular q ∈ R. 2. "Truncate"ρ ML,M d − q1l, by replacing each of its negative eigenvalues with zero.
Here, q is defined implicitly such that Tr Trunc(ρ ML,M d − q1l) = 1, and must be determined numerically. However, we can analyze how this algorithm affects the eigenvalues ofρ ML,d , which turn out to be the key quantity necessary for computing λ kite .
The truncation algorithm above is most naturally performed in the eigenbasis ofρ ML,M d . Exact diagonalization ofρ ML,M d is not feasible analytically, but only its small eigenvalues are critical in truncation. Further, only knowledge of the typical eigenvalues ofρ ML,d is necessary for computing λ kite . Therefore, we do not need to determineρ ML,d exactly, which would require explicitly solving Equation (5) using the algorithm presented in [52]; instead, we need a procedure for determining its typical eigenvalues.
We assume that N samples is sufficiently large so that all the nonzero eigenvalues of ρ 0 are much larger than . This means the eigenbasis ofρ ML,M d is accurately approximated by: (1) the eigenvectors of ρ 0 on its support; and (2) the eigenvectors of δ ker = Π ker δΠ ker = Π kerρML,M d Π ker , where Π ker is the projector onto the kernel of ρ 0 . Changing to this basis diagonalizes the "kite" portion of δ, and leaves all elements of the "L" unchanged (at O( )). The diagonal elements fall into two categories: 1. r elements corresponding to the eigenvalues of ρ 0 , which are given by p j = ρ jj + δ jj where ρ jj is the j th eigenvalue of ρ 0 , and δ jj ∼ N (0, 2 ). 2. n ≡ d − r elements that are eigenvalues of δ ker , which we denote by κ = {κ j : j = 1, . . . , n}.
In turn, q is the solution to where (x) + = max(x, 0), and λ kite is To solve Equation (12), and derive an approximation for (13), we use the fact that we are interested in computing the average value of λ kite , which justifies approximating the random variable q by a closed-form, deterministic value. To do so, we need to understand the behavior of κ. Developing such an understanding, and a theory of its typical value, is the subject of the next section.

Approximating the eigenvalues of a GUE(n) matrix
We first observe that while the κ j are random variables, they are not normally distributed. Instead, because δ ker is proportional to a GUE(n) matrix, for n 1, the distribution of any eigenvalue κ j converges to a Wigner semicircle distribution [54], given by Pr(κ) = The eigenvalues are not independent; they tend to avoid collisions ("level avoidance" [55]), and typically form a surprisingly regular array over the support of the Wigner semicircle. Since our goal is to compute λ kite , we can capitalize on this behavior by replacing each random sample of κ with a typical sample given by its order statisticsκ. These are the average values of the sorted κ, so κ j is the average value of the j th largest value of κ. Large random samples are usually well approximated (for many purposes) by their order statistics even when the elements of the sample are independent, and level avoidance makes the approximation even better.
Suppose that κ are the eigenvalues of a GUE(n) matrix, sorted from highest to lowest. Figure 7 illustrates such a sample for n = 100. It also shows the average values of 100 such samples (all sorted). These are the order statistics κ of the distribution (more precisely, what is shown is a good estimate of the order statistics; the actual order statistics would be given by the average over infinitely many samples). As the figure shows, while the order statistics are slightly more smoothly and predictably distributed than a single (sorted) sample, the two are remarkably similar. A single sample κ will fluctuate around the order statistics, but these fluctuations are relatively small, partly because the sample is large, and partly because the GUE eigenvalues experience level repulsion. Thus, the "typical" behavior of a sample -by which we mean the mean value of a statistic of the sample -is well captured by the order statistics (which have no fluctuations at all).
We now turn to the problem of modeling κ quantitatively. We note up front that we are only going to be interested in certain properties of κ: specifically, partial sums of all κ j greater or less than the threshold q, or partial sums of functions of the κ j (e.g., (κ j − q) 2 ). We require only that an ansatz be accurate for such quantities. We do not use this fact explicitly, but it motivates our approach -and we do not claim that our ansatz is accurate for all conceivable functions.
In general, if a sample κ of size n is drawn so that each κ has the same probability density function Pr(κ), then a good approximation for the j th order statistic is given by the inverse cumulative distribution function (CDF): This is closely related to the observation that the histogram of a sample tends to look similar to the underlying probability density function. More precisely, it is equivalent to the observation that the empirical distribution function (the CDF of the histogram) tends to be (even more) similar to the underlying CDF. For i.i.d. samples, this is the content of the Glivenko-Cantelli theorem [56]. Figure 8 compares the order statistics of GUE(100) and GUE(10) eigenvalues (computed as numerical averages over 100 random samples) to the inverse CDF for the Wigner semicircle distribution. Even though the Wigner semicircle model of GUE eigenvalues is only exact as n → ∞, it provides a nearly-perfect model for κ even at n = 10 (and remains surprisingly good all the way down to n = 2). We make one further approximation, by assuming that n 1, so the distribution of the κ j is effectively continuous and identical to Pr(κ). For the quantities that we compute, this is equivalent to replacing the empirical distribution function (which is a step function) by the CDF of the Wigner semicircle distribution. So, whereas for any given sample the partial sum of all κ j > q jumps discontinuously when q = κ j for any j, in this approximation it changes smoothly. This accurately models the average behavior of partial sums.

Deriving an approximation for q
The approximations of the previous section allow us to use {p j } ∪ {κ j } as the ansatz for the eigenvalues of ρ ML,M d , where the p j are N (ρ jj , 2 ) random variables, and the κ j are the (fixed, smoothed) order statistics of a Wigner semicircle distribution. In turn, the defining equation for q (Equation (12)) is well approximated as r j=1 (p j − q) + + n j=1 (κ j − q) + = 1.
To solve this equation, we observe that the κ j are symmetrically distributed around κ = 0, so half of them are negative. Therefore, with high probability, Tr Trunc(ρ ML,M d ) > 1, and so we will need to subtract q1l fromρ ML,M d before truncating.
Because we have assumed N samples is sufficiently large (N samples >> min j 1/ρ 2 jj ), the eigenvalues of ρ 0 are large compared to the perturbations δ jj and q. This implies (p j − q) + = p j − q. Under this assumption, q is the solution to where ∆ = r j=1 δ jj is a N (0, r 2 ) random variable. We choose to replace a discrete sum (line 1) with an integral (line 2). This approximation is valid when n 1, as we can accurately approximate a discrete collection of closely spaced real numbers by a smooth density or distribution over the real numbers that has approximately the same CDF. It is also remarkably accurate in practice.
In yet another approximation, we replace ∆ with its average value, which is zero. We could obtain an even more accurate expression by treating ∆ more carefully, but this crude approximation turns out to be quite accurate already.
To solve Equation (15), it is necessary to further simplify the complicated expression resulting from the integral (line 3). To do so, we assume ρ 0 is relatively lowrank, so r d/2. In this case, the sum of the positive κ j is large compared with r, almost all of them need to be subtracted away, and therefore q is close to 2 √ n. We therefore replace the complicated expression with its leading order Taylor expansion around q = 2 √ n, substitute into Equation (15), and obtain the equation This equation is a quintic polynomial in q/ , so by the Abel-Ruffini theorem, it has no algebraic solution. However, as n → ∞, its roots have a well-defined algebraic approximation that becomes accurate quite rapidly (e.g., for n > 4): where x = 15πr 2n 2/5 [57].

Expression for λ kite
Now that we know how much to subtract off in the truncation process, we can approximate λ kite , originally given in Equation (13):

D. Complete Expression for λ
The total expected value, λ = λ L + λ kite , is thus where z is given in Equation (17), n = d − r, and r = Rank(ρ 0 ). This null theory is much more complicated than the Wilks theorem, but as Figure 9 shows, it is very accurate when 2r << d. (In contrast, the prediction of the Wilks theorem is wildly incorrect for r d.) Although our null theory does break down as r → d, it does so fairly gracefully. We conclude that our analysis (and Equation (19)) correctly models tomography if the Fisher information is isotropic (I ∝ 1l), a point we turn to in the following subsections.
In the asymptotic limit d → ∞, while keeping r fixed, (20) That λ scales as O(rd) in this regime is to be expected, as a rank-r density matrix has O(rd) free parameters. Curiously though, this asymptotic result is not equal to λ L , meaning that the "kite" elements continue to contribute to the behavior of the statistic (even though most of them will be set to zero in the projection step when computingρ ML,M ).

V. COMPARISON TO NUMERICAL EXPERIMENTS
To evaluate our null theory for λ , we compare it to numerical experiments, described below.

A. Comparison to Idealized Tomography (Isotropic Fisher Information)
In our derivation of Equation (19), we assumed both that the Fisher information is isotropic and that the number of samples is asymptotically infinite. For a variety of true states ρ 0 with dimension d = 2, . . . , 30 and rank r = 1, . . . , 10 we: (a) generated N = 500 i.i.d. N (ρ 0 , 2 I) unconstrained ML estimates {ρ ML,M d,j } N j=1 , thereby simulating the unconstrained ML estimates at the N samples = ∞ limit, (b) numerically solved Equation (8)  ML estimates are close to ρ 0 , and that we are not erroneously generating estimates which are too far away.
(Recall that our derivation used the fact that, asymptotically, we can "zoom in" on ρ 0 to understand the behavior of λ. Consequently, if is too large, then some of the unconstrained ML estimates may almost be orthogonal to ρ 0 , which clearly violates the conditions used in our derivation.) In Figure 9, we compare our theory (dashed lines) to these numerical results (solid dots). It is clear Equation (19) is almost perfectly accurate when r d/2, but it does begin to break down as r becomes comparable to d.

B. Comparison to Heterodyne Tomography
In practice, the Fisher information is rarely isotropic. So we tested our idealized result by applying it to a realistic, challenging, and experimentally relevant problem: quantum heterodyne (equivalent to double homodyne) state tomography [11,12,14,58] of a single optical mode. (See Figure 10 for a plot of the condition number -the ratio of the largest eigenvalue to the smallestof the estimated Fisher information. It is clear I ∝ 1l.) States of this continuous-variable system are described by density operators on the infinite-dimensional Hilbert space L 2 (R). Fitting these infinitely many parameters to finitely much data demands simpler models.
We consider a family of nested models motivated by a low-energy (few-photon) ansatz, and choose the Hilbert space H d to be that spanned by the photon number states {|0 , . . . , |d − 1 }. Heterodyne tomography reconstructs ρ 0 using data from repeated measurements of the coherent state POVM, {|α α|/π, α = x + ip ∈ C}, which corresponds to sampling directly from the Husimi Qfunction of ρ 0 [59].
We examined the behavior of λ for 13 distinct states FIG. 11. Applying isotropic formula to heterodyne tomography: The Wilks theorem (orange dots) dramatically over-estimates λ(ρ0, M d ) in optical heterodyne tomography. Our formula, Equation 19 (blue squares), is far more accurate. Residual discrepancies occur in large part because N samples is not yet "asymptotically large". The solid red line corresponds to perfect correlation between theory ( λ ) and practice (λ). ρ 0 , both pure and mixed, supported on H 2 , H 3 , H 4 , and H 5 . We used rejection sampling to simulate 100 heterodyne datasets with up to N samples = 10 5 , and found ML estimatesρ ML,M d over each of the 9 models M 2 , . . . , M 10 using numerical optimization [60]. For each ρ 0 and each d, we averaged λ(ρ 0 , M d ) over all 100 datasets to obtain an empirical average loglikelihood ratioλ(ρ 0 , d).
Results of this test are shown in Figure 11, where we plot the predictions for λ given by the Wilks theorem and Equation (19), against the empirical averageλ, for a variety of ρ 0 and d. Our formula correlates very well with the empirical average, while the Wilks theorem (unsurprisingly) overestimates λ dramatically for low-rank states. Whereas a model selection procedure based on the Wilks theorem would tend to falsely reject larger Hilbert spaces (by setting the threshold for acceptance too high), our formula provides a reliable null theory.
Interestingly, as d grows, Equation (19) also begins to overpredict. As Figure 12 indicates, a more accurate description is that the numerical experiments are underachieving, becauseλ is still growing with N samples . Both the Wilks theorem and our analysis are derived in the limit N samples → ∞; for finite but large N samples , both may be invalid. Figure 12 shows that, even at N samples ∼ 10 5 , the behavior ofλ has failed to become asymptotic. This is surprising, and suggests heterodyne tomography is a particularly exceptional and challenging case to model statistically. However, our model does get some of the qualitative features correct. In Figure 13, we present simulated values of λ jk for an isotropic Fisher information and for heterodyne tomography. While the values of λ jk do not agree exactly, they still decompose into two types, the "L" and the "kite". (See Figure 14 for an analysis of the discrepancies.) The empirical averageλ may have achieved its asymptotic value, or is still growing, depending on ρ0 and d.
Solid lines indicate the value of our formula for the asymptotic expected value, given in Equation (19).

VI. CONCLUSIONS AND DISCUSSION
Quantum state space violates local asymptotic normality, a key property satisfied by classical statistical models. Through the introduction of metric-projected local asymptotic normality, we have provided a new framework for reasoning about classical statistical results for models that don't satisfy LAN because of convex constraints. We explicitly investigated one such result, the Wilks theorem, found it is not generally reliable in quantum state tomography, and provided a much more broadly applicable replacement that can be used in model selection methods (Equation (19)). This includes information criteria such as the AIC and BIC [17,29,38,39] that do not explicitly use the Wilks theorem, but rely on the same assumptions (local asymptotic normality, equivalence between loglikelihood and squared error, etc.). Null theories of loglikelihood ratios have many other applications, including hypothesis testing [27,33] and confidence regions [61], and our result is directly applicable to them. Refs. [27,61] both point out explicitly that their methods are unreliable near boundaries and therefore cannot be applied to rank-deficient states; our result fixes this outstanding problem.
However, our numerical experiments with heterodyne tomography show unexpected behavior, indicating that quantum tomography can still surprise, and may violate all asymptotic statistics results. In such cases, bootstrapping [62,63] may be the only reliable way to construct null theories for λ.
Finally, the methods presented here have application beyond the analysis of loglikelihoods. Metric-projected local asymptotic normality provides a new and rigorous framework for describing the behavior of the maximum likelihood estimator in the presence of convex constraints. This framework helps shed light on the behavior ofρ ML,d for rank-deficient states, and can be used to predict other derived properties such as the average rank of the estimate, which is independently interesting for (e.g.) quantum compressed sensing [19,[64][65][66].
Sandia National Laboratories is a multimission laboratory managed and operated by National Technology & Engineering Solutions of Sandia, LLC, a wholly owned subsidiary of Honeywell International Inc., for the U.S. Department of Energy's National Nuclear Security Administration under contract de-na0003525.