Brought to you by:
Paper The following article is Open access

Behavior of the maximum likelihood in quantum state tomography

and

Published 22 February 2018 © 2018 The Author(s). Published by IOP Publishing Ltd on behalf of Deutsche Physikalische Gesellschaft
, , Citation Travis L Scholten and Robin Blume-Kohout 2018 New J. Phys. 20 023050 DOI 10.1088/1367-2630/aaa7e2

Download Article PDF
DownloadArticle ePub

You need an eReader or compatible software to experience the benefits of the ePub3 file format.

1367-2630/20/2/023050

Abstract

Quantum state tomography on a d-dimensional system demands resources that grow rapidly with d. They may be reduced by using model selection to tailor the number of parameters in the model (i.e., the size of the density matrix). Most model selection methods typically rely on a test statistic and a null theory that describes its behavior when two models are equally good. Here, we consider the loglikelihood ratio. Because of the positivity constraint ρ ≥ 0, quantum state space does not generally satisfy local asymptotic normality (LAN), meaning the classical null theory for the loglikelihood ratio (the Wilks theorem) should not be used. Thus, understanding and quantifying how positivity affects the null behavior of this test statistic is necessary for its use in model selection for state tomography. We define a new generalization of LAN, metric-projected LAN, show that quantum state space satisfies it, and derive a replacement for the Wilks theorem. In addition to enabling reliable model selection, our results shed more light on the qualitative effects of the positivity constraint on state tomography.

Export citation and abstract BibTeX RIS

Original content from this work may be used under the terms of the Creative Commons Attribution 3.0 licence. Any further distribution of this work must maintain attribution to the author(s) and the title of the work, journal citation and DOI.

Determining the quantum state ρ0 produced by a specific preparation procedure for a quantum system is a problem almost as old as quantum mechanics itself [1, 2]. This task, known as quantum state tomography [3], is not only useful in its own right (diagnosing and detecting errors in state preparation), but is also used in other characterization protocols including entanglement verification [46] and process tomography [7]. A typical state tomography protocol proceeds as follows: many copies of ρ0 are produced, they are measured in diverse ways, and finally the outcomes of those measurements (data) are collated and analyzed to produce an estimate $\hat{\rho }$. This is a straightforward statistical inference process [8, 9], where the data are used to fit the parameters of a statistical model. In state tomography, the parameter is ρ, and the model is the set of all possible density matrices on a Hilbert space ${ \mathcal H }$ (equipped with the Born rule). However, we do not always know what model to use. It is not always a priori obvious what ${ \mathcal H }$, or its dimension, is; examples include optical modes [1014] and leakage levels in AMO and superconducting [15, 16] qubits. In such situations, we seek to let the data itself determine which of many candidate Hilbert spaces is best suited for reconstructing ρ0.

Choosing an appropriate Hilbert space on the fly is an instance of a general statistical problem called model selection. Although model selection has been thoroughly explored in classical statistics [17], its application to state tomography encounters some obstacles. They stem from the fact that quantum states—and therefore, estimates of them—must satisfy a positivity constraint ρ ≥ 0. (See figure 1.) A similar constraint, complete positivity, applies to process tomography. The impact of positivity constraints on state and process tomography is an active area of research [1821], and its implications for model selection have also been considered [2228]. In this paper, we address a specific question at the heart of this matter: How does the loglikelihood ratio statistic used in many model selection protocols, including (but not limited to) information criteria such as Akaike's AIC [29], behave in the presence of the positivity constraint ρ ≥ 0?

Figure 1.

Figure 1. Impact of the positivity constraint ($\rho \geqslant 0$) on tomographic estimates: this figure illustrates how the quantum state space's boundary—which results from the constraint $\rho \geqslant 0$—affects maximum likelihood (ML) tomography for a qutrit state ρ0 (star). Two different two-dimensional cross-sections of the state space are shown, which correspond to a qubit (left) and a classical 3-outcome distribution (right). Top: without the positivity constraint, some ML estimates (orange squares) are not valid estimates of a quantum state, because they are not positive semidefinite. However, some ML estimates (blue circles) are. Further, the ML estimates are Gaussian distributed. Bottom: imposing the positivity constraint forces the (previously negative) ML estimates to 'pile up' on the boundary of state space; the distribution $\Pr ({\hat{\rho }}_{\mathrm{ML}})$ is not Gaussian, and local asymptotic normality is not satisfied. In turn, the assumptions necessary to invoke the Wilks theorem are not satisfied either.

Standard image High-resolution image

We begin in section 1 by introducing the loglikelihood ratio statistic λ, and outline how it can be used to choose a Hilbert space. In section 2, we show how and why the classical null theory for its behavior, the Wilks theorem, falls apart in the presence of the positivity constraint, because quantum state space does not generally satisfy local asymptotic normality (LAN). We define a new generalization of LAN, metric-projected local asymptotic normality (MP-LAN), in section 3; this generalization explicitly accounts for the positivity constraint, and is satisfied by quantum state space. Using this generalization, we derive a closed-form approximation for λ's expected value in section 4, thereby providing a replacement for the Wilks theorem that is applicable in state tomography. Finally, we test the validity of our approximate theory under the assumptions used in its derivation (section 5.1), and under harsh conditions by comparing it to numerical results for the realistic problem of optical heterodyne tomography (section 5.2).

1. Introduction—statistical model selection

Discussing model selection for state tomography requires introducing some terminology/notation from statistics. A model is a parameterized family of probability distributions over some data D, usually denoted as ${\Pr }_{{\boldsymbol{\theta }}}(D)$, where ${\boldsymbol{\theta }}$ are the parameters of the model. In state tomography, the parameters are a quantum state ρ, the data are the observed outcomes of the measurement of a positive operator-valued measure (POVM) {Ej}, and the probability of observing outcome 'j' is given by the Born rule: pj = Tr(ρEj)3 . In this paper, a model is a set of density matrices, and a state ρ is a particular choice of the model's parameters.

Suppose we have data D obtained from an unknown state ρ0, and two candidate models ${{ \mathcal M }}_{1},{{ \mathcal M }}_{2}$ that could be used to reconstruct it. Many of the known methods for choosing between them (i.e., model selection) involve quantifying how well each model fits the data by its likelihood. The likelihood of a simple hypothesis ρ is defined as ${ \mathcal L }(\rho )=\Pr (D| \rho )$. Models, however, are composite hypotheses, comprising many possible values of ρ. A canonical way to define model ${ \mathcal M }$'s likelihood is via the general method of maximum likelihood (ML), by maximizing ${ \mathcal L }(\rho )$ over $\rho \in { \mathcal M }$. In practice, the maximization is usually done explicitly to find an ML estimate ${\hat{\rho }}_{\mathrm{ML},{ \mathcal M }}$ [3032] of ${ \mathcal M }$'s parameters, and then ${ \mathcal L }({ \mathcal M })={ \mathcal L }({\hat{\rho }}_{\mathrm{ML},{ \mathcal M }})$. (Although it is common to refer to ${\hat{\rho }}_{\mathrm{ML}}$ without specifying the model over which ${ \mathcal L }$ was maximized, we list the model explicitly because in this paper, we are frequently using many different models!)

Then, the models can be compared using the loglikelihood ratio statistic [27, 32, 33]:

Equation (1)

All else being equal, a positive λ favors ${{ \mathcal M }}_{2}$—i.e., the model with the higher likelihood is more plausible, because it fits the data better. However, all else is rarely equal. If both models are equally valid—e.g., they both contain ρ0—but ${{ \mathcal M }}_{2}$ has more parameters, then ${{ \mathcal M }}_{2}$ will very probably fit the data better. Models with more adjustable parameters do a better job of fitting noise (e.g., finite sample fluctuations) in the data. This becomes strictly true when the models are nested, so that ${{ \mathcal M }}_{1}\subset {{ \mathcal M }}_{2}$. In this case, the likelihood of ${{ \mathcal M }}_{2}$ is at least as high as that of ${{ \mathcal M }}_{1};$ not only is λ ≥ 0, but almost surely λ > 0.

Remarkably, the same effect also makes ${{ \mathcal M }}_{2}$'s fit less accurate (almost surely), because the fit incorporates more of the noise in the data. These two effects constitute overfitting, which can be summed up as 'Extra parameters make the fit look better, but perform worse.' An overfitted model would fit current data extremely well, but would fail to accurately predict future data. This provides strong motivation to correct for overfitting by penalizing or handicapping larger models, to prevent them from being chosen over smaller models that are no less valid, and may even yield better estimates in practice [29].

For this reason, any model selection method/criterion that relies (explicitly or implicitly) on a statistic to quantify 'how well model ${ \mathcal M }$ fits the data' also relies on a null theory to predict how that statistic will behave if some null hypothesis is true. For the model selection problems we consider, the null hypothesis is that ${\rho }_{0}\in { \mathcal M }$, and the null theory will tell us how statistics of interest behave when that null hypothesis is in fact true. A model selection criterion based on an invalid null theory (or a criterion used in a context where its null theory does not apply) will tend to perform sub-optimally (as compared to a method based on a correct null theory).

The null theory can be used to formulate a decision rule for choosing between models. If we know how the test statistic behaves when both models are equally valid, then we can evaluate the observed value of the statistic under the null theory. If the observed value is very improbable under the null theory, then that constitutes evidence against the smaller model, and justifies rejecting it. On the other hand, if the observed value is consistent with the null theory, there is no reason to reject the smaller model.

The standard null theory for λ is the Wilks theorem [34]. It relies on LAN [35, 36]. LAN is a property of ${ \mathcal M };$ if ${ \mathcal M }$ satisfies LAN, then as ${N}_{\mathrm{samples}}\to \infty $:

  • The ML estimate ${\hat{\rho }}_{\mathrm{ML},{ \mathcal M }}$ is normally distributed around ρ0 with covariance matrix ${{ \mathcal I }}^{-1}$:
    Equation (2)
  • The likelihood function in a neighborhood of ${\hat{\rho }}_{\mathrm{ML},{ \mathcal M }}$ is locally Gaussian with Hessian ${ \mathcal I }$:
    Equation (3)

Here, ${ \mathcal I }$ is the (classical) Fisher information matrix associated with the POVM. It quantifies how much information the data carry about a parameter in the model. (Note that in expressions involving ${ \mathcal I }$, states ρ are treated as vectors in state space, and ${ \mathcal I }$ is a matrix or 2-index tensor acting on that state space.)

Most statistical models satisfy LAN. When LAN is satisfied and ${N}_{\mathrm{samples}}$ is large enough to reach the 'asymptotic' regime, we can invoke the Wilks theorem to determine the behavior of λ. This theorem says that under suitable regularity conditions, if ${\rho }_{0}\in {{ \mathcal M }}_{1}\subset {{ \mathcal M }}_{2}$, where ${{ \mathcal M }}_{2}$ has K more parameters than ${{ \mathcal M }}_{1}$, then λ is a ${\chi }_{K}^{2}$ random variable. This is a complete null theory for λ (under the specified conditions), and implies that $\langle \lambda \rangle =K$ and (Δ λ)2 = 2 K.

Therefore, in the 'Wilks regime', a simple criterion for model selection would be to compare the observed value of λ to ${\lambda }_{\mathrm{thresh}}=\langle \lambda \rangle +k{\rm{\Delta }}\lambda $, for some k ≈ 1, and reject the smaller model if λ > λthresh. While model selection rules can be more subtle and complex than this [29, 3739], they usually take the general form of a threshold in which $\langle \lambda \rangle $ plays a key role. Rather than attempting to define a specific rule, our purpose in this paper is to understand the behavior of $\langle \lambda \rangle $ and derive an approximate expression for it in the context of state tomography.

The first step in doing so is to explain how and why the Wilks theorem breaks down in that context.

2. Quantum state tomography and the breakdown of the Wilks theorem

Quantum state tomography typically begins with Nsamples independently and identically prepared quantum systems—i.e., Nsamples copies of an unknown state ρ0. Each copy is measured, and without loss of generality we can assume that each measurement is described by the same POVM. A POVM is a collection of positive operators {Ej} summing to ${\mathbb{I}}$, and the probability of outcome 'j' is given by $\mathrm{Tr}({\rho }_{0}{E}_{j})$. The results of all Nsamples measurements constitute data, represented as a record of the frequencies of the possible outcomes {nj}, where nj is the number of times 'j' was observed, and ${\sum }_{j}{n}_{j}={N}_{\mathrm{samples}}$. Finally, this data is processed through some estimator to yield an estimate of ρ0, denoted $\hat{\rho }$ .

Although a variety of estimators have been proposed [3032, 4043], the exact estimator used is not our concern here. However, since we are concerned with computing the likelihood of a model ${ \mathcal M }$, which is defined as the likelihood of the most likely $\rho \in { \mathcal M }$, we will make extensive use of the ML estimator. This should not be taken as advocacy for the ML estimator; it is only a convenient way to find the maximum of ${ \mathcal L }$ over ${ \mathcal M }$, and once a model is chosen, a different estimator could be used. The likelihood ${ \mathcal L }(\rho )$ is

and ${\hat{\rho }}_{\mathrm{ML},{ \mathcal M }}$ is the solution to the optimization problem

In state tomography, ${ \mathcal M }$ is almost always the set of all density matrices over a Hilbert space ${ \mathcal H }$:

where ${ \mathcal B }({ \mathcal H })$ is the space of bounded linear operators on ${ \mathcal H }$. To determine ${\hat{\rho }}_{\mathrm{ML},{ \mathcal M }}$, we can use the following facts: (a) ${{ \mathcal M }}_{{ \mathcal H }}$ is a convex set, and (b) ${\hat{\rho }}_{\mathrm{ML},{ \mathcal M }}$ minimizes the value of the convex function $-\mathrm{log}[{ \mathcal L }(\rho )]$. Because ${\hat{\rho }}_{\mathrm{ML},{ \mathcal M }}$ is the solution to minimizing a convex function over a convex set, it can be found efficiently via any of several algorithms for convex optimization [44].

Usually, ${ \mathcal H }$ is taken for granted or chosen by fiat. In this paper, we will consider a nested family of different Hilbert spaces, indexed by their dimension d: ${{ \mathcal H }}_{1}\subset \cdots \,\subset {{ \mathcal H }}_{d}\subset {{ \mathcal H }}_{d+1}\subset \cdots $. The models we consider are therefore given by:

Equation (4)

For notational brevity, we will use ${\hat{\rho }}_{\mathrm{ML},d}$ to denote the ML estimate over ${{ \mathcal M }}_{d}$. To select between these models, we need to determine whether one model (say, ${{ \mathcal M }}_{d+1}$) is 'better' than another (say, ${{ \mathcal M }}_{d}$). To evaluate which is better, we typically compute the likelihood of each model, and then use $\lambda ({{ \mathcal M }}_{d},{{ \mathcal M }}_{d+1})$ to choose between them. As mentioned in the previous section, this requires having a null theory for λ that describes its behavior when ${\rho }_{0}\in {{ \mathcal M }}_{d}\subset {{ \mathcal M }}_{d+1}$.

The Wilks theorem, which is the classical null theory for λ, relies on LAN. If the models under consideration satisfy LAN, then as mentioned in the previous section, the likelihood ${ \mathcal L }(\rho )$ is Gaussian with a Hessian given by the Fisher information. In classical statistics, it is common to assume that boundaries are not relevant, either because the models of interest have none, or because the true parameter values ρ0 lie far away from them. In the absence of boundaries, and in the asymptotic limit where the curvature of the Fisher information metric is also negligible, many calculations can be simplified by changing to Fisher-adjusted coordinates in which the Fisher information is isotropic (i.e., ${ \mathcal I }\propto {\mathbb{I}}$). Under these assumptions and simplifications, the Wilks theorem can be derived.

In quantum state tomography, the Wilks theorem breaks down for two reasons. First, the quantum state space does have boundaries. Second, the Fisher information is anisotropic, and the anisotropy cannot easily be eliminated by a coordinate change because those boundaries define a preferred coordinate system. We discuss these obstacles—and our plan to address them—in detail in the remainder of this section.

Given a model ${{ \mathcal M }}_{d}$, its boundary is the set of rank-deficient states within it. When ${\rho }_{0}\in {{ \mathcal M }}_{d}$ and is full rank, LAN will hold—which is to say that, asymptotically, the boundary can indeed be ignored. But when ρ0 is rank-deficient, it lies on the boundary of the model. LAN is not satisfied, because positivity constrains ${\hat{\rho }}_{\mathrm{ML},d}$, and so $\Pr ({\hat{\rho }}_{\mathrm{ML},d})$ is not Gaussian (see figure 1). The Wilks theorem does not apply in this case, and its predictions regarding $\langle \lambda \rangle $ are not even close (see figure 2). Moreover, this is the relevant situation for our analysis, because even if ρ0 is full-rank in ${{ \mathcal M }}_{d}$, it must be rank-deficient in ${{ \mathcal M }}_{d+1}$. So we require a replacement for the Wilks theorem; that is, we need a null theory for λ when ρ0 is rank-deficient.

Figure 2.

Figure 2. Predictions of the Wilks theorem versus reality: in the context of state tomography on a true state ρ0 in a d-dimensional Hilbert space, the Wilks theorem can be used to predict that, when comparing the zero-parameter model ${{ \mathcal M }}_{0}=\{{\rho }_{0}\}$ and the $({d}^{2}-1)$-parameter model ${{ \mathcal M }}_{d}$ defined in equation (4), the expected loglikelihood ratio $\langle \lambda ({{ \mathcal M }}_{0},{{ \mathcal M }}_{d})\rangle $ will be d2 − 1. Here, we compare that prediction to numerical simulations of tomography on states ρ0 in dimension d = 2, ..., 30, with ranks r = 1, ..., min(10, d). The Wilks theorem only predicts $\langle \lambda \rangle $ correctly for full-rank states; when $r\ll d$, the actual expected loglikelihood ratio is much smaller. Our main result (equation (19)) gives a replacement that works correctly (see figure 9).

Standard image High-resolution image

One challenge in deriving this replacement is that the Fisher information generally depends strongly on ρ0 and the POVM being measured (see figure 5). In many standard derivations, such anisotropy has no impact and can be eliminated easily by changing to Fisher-adjusted coordinates. But the models we consider (quantum states) have boundaries that break scale-invariance, and define preferred coordinate systems. Changing to Fisher-adjusted coordinates does not eliminate the effect of anisotropy, because the boundary has a new shape in the new coordinates that serves as a record of the anisotropy. Moreover, the methods we derive here for calculating the impact of the boundary rely heavily on a particular coordinate system (Hilbert–Schmidt coordinates), and changing to Fisher-adjusted coordinates would break them. This makes it very difficult to derive an precise generalization of the Wilks theorem for arbitrary Fisher information, so to derive our results we make the key simplifying assumption that the Fisher information is isotropic with respect to Hilbert–Schmidt metric. This is almost never exactly true in practice [45], but it is reasonable to presume that our results remain useful and approximately true when the Fisher information is almost isotropic. Our results actually appear to be surprisingly robust to significant anisotropy. In section 5.2, we perform numerical simulations of heterodyne tomography—in which the condition number of the Fisher information ranges from 101 to 109 (1 corresponds to isotropic Fisher information)—and find that our new theory remains reasonably accurate even in this extreme scenario.

To derive our replacement for the Wilks theorem, we first need a new framework for reasoning about models with convex constraints. We develop such a framework in the next section by defining a new generalization of LAN.

3. Generalizing LAN to deal with constraints

In this section, we develop a framework that will allow us to derive a replacement for the Wilks theorem that holds for rank-deficient ρ0. To do so, we define a generalization of LAN in the presence of boundaries, which we call MP-LAN. (For other generalizations of LAN, see [46, 47].) Like LAN, MP-LAN is a property that a statistical model may satisfy. Unlike LAN, MP-LAN is satisfied by quantum state space. For any model that satisfies MP-LAN (quantum or classical), we compute an asymptotically exact expression for λ, a necessary building block in our replacement for the Wilks theorem.

In section 4, we show that the models ${{ \mathcal M }}_{d}$ satisfy MP-LAN, and derive an approximation for $\langle \lambda \rangle $ (equation (19), on page 12). Section 5.1 compares our theory to numerical results, and section 5.2 applies our theory to the problem of heterodyne tomography.

The reader should note that to enhance readability, in this section (and only this section) we use N to denote the number of samples, previously denoted as Nsamples.

3.1. Definitions and overview of results

The main definitions and results required for the remainder of the paper are presented in this subsection. Technical details and proofs are presented in the next subsection.

(MP-LAN).

Definition 1 A model ${ \mathcal M }$ satisfies MP-LAN if ${ \mathcal M }$ is a convex subset of a model ${ \mathcal M }^{\prime} $ that satisfies LAN.

While there are many possible choices for the unconstrained model ${ \mathcal M }^{\prime} $, we will find it useful to let ${ \mathcal M }^{\prime} $ be a model whose dimension is the same as ${ \mathcal M }$, but where any of the constraints that define ${ \mathcal M }$ are lifted. (For example, in lemma 5, we will take ${ \mathcal M }^{\prime} $ to be Hermitian matrices of dimension d.) Other choices of ${ \mathcal M }^{\prime} $ are possible, but we do not explore those here.

Although the definition of MP-LAN is rather short, it implies some very useful properties. These properties follow from the fact that, as $N\to \infty $, the behavior of ${\hat{\rho }}_{\mathrm{ML},{ \mathcal M }}$ and λ is entirely determined by their behavior in an arbitrarily small region of ${ \mathcal M }$ around ρ0, which we call the local state space.

(Local state space).

Definition 2 For ${N}$ samples, let ${{I}}_{{N}}$ be the corresponding Fisher information matrix of ${ \mathcal M }$ at ρ0, and let ${{ \mathcal M }}_{{N}}$ be the set obtained by rescaling each point in ${ \mathcal M }$ ${{{I}}_{{N}}}^{\mathrm{-}\mathrm{1/2}}$. (In other words, ${{ \mathcal M }}_{{N}}$ is just ${ \mathcal M }$, but scaled so that the Fisher information at ρ0 is 1l.). In addition, let ${{C}}_{{N}}$ be a convex subset of ${{ \mathcal M }}_{{N}}$ chosen so that (a) ${C}_{N+1}$ contains ${{C}}_{{N}}$, and (b) ${\mathrm{lim}}_{N\to \infty }\Pr ({\hat{\rho }}_{\mathrm{ML},{ \mathcal M }}\in {C}_{N})=1$. Then the sequence $\{{C}_{N}:N=\mathrm{1,\; 2...}\}$ converges to the local state space around ρ0.

Models that satisfy MP-LAN have the following asymptotic properties:

  • The local state space is the solid tangent cone of the model at ${\rho }_{0}$, denoted T(ρ0).
  • The ML estimate ${\hat{\rho }}_{\mathrm{ML},{ \mathcal M }}$ is given by the metric projection of ${\hat{\rho }}_{\mathrm{ML},{ \mathcal M }^{\prime} }$ onto T(ρ0):
    Equation (5)
    (We first encountered the term 'metric projection' in the convex optimization literature [48, 49], and inspires our choice for the acronym 'MP-LAN'. However, it should be noted that in the problem setting considered in those references, ${ \mathcal I }={\mathbb{I}}$.)
  • The loglikelihood ratio $\lambda ({\rho }_{0},{ \mathcal M })$, defined as
    Equation (6)
    takes the following simple form:
    Equation (7)
    (This property is non-trivial; see figure 3.)

Figure 3.

Figure 3. Equivalence of λ and squared distance when MP-LAN is satisfied: For any model ${{ \mathcal M }}_{k}$, $\lambda ({\rho }_{0},{{ \mathcal M }}_{k})$ is the difference between the squared distance from ${\rho }_{0}$ to ${\hat{\rho }}_{\mathrm{ML},{ \mathcal M }^{\prime} }$ and that from ${\hat{\rho }}_{\mathrm{ML},{{ \mathcal M }}_{k}}$ to ${\hat{\rho }}_{\mathrm{ML},{ \mathcal M }^{\prime} }$ (black lines). If ${{ \mathcal M }}_{k}$ satisfies MP-LAN, then (a) ${{ \mathcal M }}_{k}\subset { \mathcal M }^{\prime} $ for an unconstrained model ${ \mathcal M }^{\prime} $, and (b) λ is equal, in the asymptotic limit, to the squared distance from ${\hat{\rho }}_{\mathrm{ML},{{ \mathcal M }}_{k}}$ to ρ0 (red lines), because ${\rho }_{0},{\hat{\rho }}_{\mathrm{ML},{ \mathcal M }^{\prime} }$, and ${\hat{\rho }}_{\mathrm{ML},{{ \mathcal M }}_{k}}$ form a right triangle. This is also true for models with curved boundaries (such as quantum state space) because asymptotically, the local state space is the solid tangent cone, whose boundaries are always flat.

Standard image High-resolution image

Even when ${ \mathcal M }$ satisfies MP-LAN, these properties may not be true when N is finite; they are guaranteed only in the asymptotic limit. When N is sufficiently large, we can (and will!) use the asymptotic properties above.

The following subsection presents the technical details and definitions necessary to show the above results. The reader may skip it without loss of continuity, and proceed to section 4.

3.2. Technical details

Assume a statistical model ${ \mathcal M }$ that satisfies MP-LAN. Below, we prove the properties of ${ \mathcal M }$ asserted in section 3.1.

3.2.1. Convergence of the local state space to the solid tangent cone

Because ${ \mathcal M }$ satisfies MP-LAN, there exists a model ${ \mathcal M }^{\prime} \supset { \mathcal M }$ of dimension $d^{\prime} $ that satisfies LAN. This means that as $N\to \infty $, the distribution of ${\hat{\rho }}_{\mathrm{ML},{ \mathcal M }^{\prime} }$ converges to a Gaussian:

where $\mathop{\to }\limits^{{d}}$ means 'converges in distribution to', and ${\rm{\Sigma }}={{ \mathcal I }}^{-1}$. The shape of the distribution is entirely determined by ${ \mathcal I }$. As $N\to \infty $, this Gaussian distribution becomes more and more tightly concentrated around ρ0. Although there is always a non-zero probability that ${\hat{\rho }}_{\mathrm{ML},{ \mathcal M }^{\prime} }$ will be arbitrarily far away from ρ0, it is possible to define a sequence of balls BN that shrink with N, yet contain every ${\hat{\rho }}_{\mathrm{ML},{ \mathcal M }^{\prime} }$ with probability 1 as $N\to \infty $.

First, we switch coordinates by sending $\rho \to \rho -{\rho }_{0}$, establishing ρ0 as the origin of the coordinate system. In these coordinates, ${\hat{\rho }}_{\mathrm{ML},{ \mathcal M }^{\prime} }\sim { \mathcal N }(0,{\rm{\Sigma }}/N)$, and the following lemma constructs BN.

Lemma 1. Let ${\hat{\rho }}_{\mathrm{ML},{ \mathcal M }^{\prime} }\sim { \mathcal N }(0,{\rm{\Sigma }}/N)$, and let ${\lambda }_{\max }({\rm{\Sigma }})$ denote the largest eigenvalue of Σ. Define ${B}_{N}=\{\rho \in { \mathcal M }^{\prime} \,| \,\mathrm{Tr}({\rho }^{2})\leqslant {r}^{2}\}$, where $r=\sqrt{{\lambda }_{\max }({\rm{\Sigma }})}/{N}^{1/4}$. Then, ${\mathrm{lim}}_{N\to \infty }\Pr ({\hat{\rho }}_{\mathrm{ML},{ \mathcal M }^{\prime} }\in {B}_{N})=1$.

Proof. Let B0N be an ellipsoidal ball defined by $\{\rho \in { \mathcal M }^{\prime} | \,\mathrm{Tr}(\rho {{\rm{\Sigma }}}^{-1}\rho )\leqslant 1/{N}^{1/2}\}$. Change coordinates by defining $\sigma ={N}^{1/2}{{\rm{\Sigma }}}^{-1/2}\rho $. In these new coordinates ${\hat{\sigma }}_{\mathrm{ML},{ \mathcal M }^{\prime} }\sim { \mathcal N }(0,{{\mathbb{I}}}_{d^{\prime} })$, and ${B}_{N}^{0}=\{\sigma \in { \mathcal M }^{\prime} | \,\mathrm{Tr}({\sigma }^{2})\leqslant {N}^{1/2}\}$. Therefore,

because $\mathrm{Tr}({\hat{\sigma }}_{\mathrm{ML},{ \mathcal M }^{\prime} }^{2})$ is a ${\chi }_{d^{\prime} }^{2}$ random variable. It follows that

Switching back to the original coordinates, we have

and ${\mathrm{lim}}_{N\to \infty }\,\Pr ({\hat{\rho }}_{\mathrm{ML},{ \mathcal M }^{\prime} }\in {B}_{N}^{0})=1$.

Now that we know B0N contains all ${\hat{\rho }}_{\mathrm{ML},{ \mathcal M }^{\prime} }$ as $N\to \infty $, we can now show the same holds true for BN. It suffices to show ${B}_{N}^{0}\subset {B}_{N}$. To see that this is the case, write the equation for B0N in the standard quadratic form for an ellipsoid:

The standard ellipsoid $\{{\bf{x}}\,| \,{{\bf{x}}}^{T}A{\bf{x}}\leqslant 1\}$ has semi-major axes whose lengths sj are related to the eigenvalues aj of A: ${s}_{j}=1/\sqrt{{a}_{j}}$. The matrix $A={N}^{1/2}{{\rm{\Sigma }}}^{-1}$ has eigenvalues ${N}^{1/2}/{\lambda }_{j}$, where λj are the eigenvalues of Σ. Thus, the lengths of the semi-major axes of B0N are given by ${s}_{j}=1/\sqrt{{N}^{1/2}/{\lambda }_{j}}=\sqrt{{\lambda }_{j}}/{N}^{1/4}$. Letting ${\lambda }_{\max }({\rm{\Sigma }})$ denote the largest eigenvalue of Σ, the longest semi-major axis of B0N has length $\sqrt{{\lambda }_{\max }({\rm{\Sigma }})}/{N}^{1/4}$. Because BN is a ball whose radius is equal to this length, BN circumscribes B0N, and ${B}_{N}^{0}\subset {B}_{N}$.

As ${B}_{N}^{0}\subset {B}_{N}$, it follows from the monotonicity of probability that $\Pr ({\hat{\rho }}_{\mathrm{ML},{ \mathcal M }^{\prime} }\in {B}_{N}^{0})\leqslant \Pr ({\hat{\rho }}_{\mathrm{ML},{ \mathcal M }^{\prime} }\in {B}_{N})$. As the asymptotic limit of $\Pr ({\hat{\rho }}_{\mathrm{ML},{ \mathcal M }^{\prime} }\in {B}_{N}^{0})$ is 1, and $\Pr ({\hat{\rho }}_{\mathrm{ML},{ \mathcal M }^{\prime} }\in {B}_{N})$ itself is bounded above by 1, it follows from the squeeze theorem that ${\mathrm{lim}}_{N\to \infty }\Pr ({\hat{\rho }}_{\mathrm{ML},{ \mathcal M }^{\prime} }\in {B}_{N})=1$.□

Informally speaking, lemma 1 implies that as $N\to \infty $ 'all the action' about ${\hat{\rho }}_{\mathrm{ML},{ \mathcal M }^{\prime} }$ takes place inside BN. Accordingly, to understand the behavior of quantities which depend on ${\hat{\rho }}_{\mathrm{ML},{ \mathcal M }^{\prime} }$ (such as ${\hat{\rho }}_{\mathrm{ML},{ \mathcal M }}$ and λ), it is sufficient to consider their behavior within BN. In fact, we can show that, asymptotically, all the ${\hat{\rho }}_{\mathrm{ML},{ \mathcal M }}$ are contained within the region ${C}_{N}\equiv {B}_{N}\cap { \mathcal M }$:

Lemma 2.  ${\mathrm{lim}}_{N\to \infty }\Pr ({\hat{\rho }}_{\mathrm{ML},{ \mathcal M }}\in {C}_{N})=1$.

Proof. Using the law of total probability, write $\Pr ({\hat{\rho }}_{\mathrm{ML},{ \mathcal M }}\in {C}_{N})$ as a sum of two terms, depending on whether ${\hat{\rho }}_{\mathrm{ML},{ \mathcal M }^{\prime} }\in {B}_{N}$. Letting p denote $\Pr ({\hat{\rho }}_{\mathrm{ML},{ \mathcal M }^{\prime} }\in {B}_{N})$, we have

For any ${\hat{\rho }}_{\mathrm{ML},{ \mathcal M }^{\prime} }\in {B}_{N}$, the corresponding ${\hat{\rho }}_{\mathrm{ML},{ \mathcal M }}$ is somewhere in ${ \mathcal M }$. To show ${\hat{\rho }}_{\mathrm{ML},{ \mathcal M }}\in {C}_{N}$, we use a proof by contradiction. Suppose that ${\hat{\rho }}_{\mathrm{ML},{ \mathcal M }}$ is the ML estimate in ${ \mathcal M }$ for ${\hat{\rho }}_{\mathrm{ML},{ \mathcal M }^{\prime} }$, and that ${\hat{\rho }}_{\mathrm{ML},{ \mathcal M }}\,{\not\in}\,{C}_{N}$. Let ρC denote the closest point in CN to ${\hat{\rho }}_{\mathrm{ML},{ \mathcal M }}$. Because ${B}_{N}\supset {C}_{N}$, it follows that ρC is closer to ${\hat{\rho }}_{\mathrm{ML},{ \mathcal M }^{\prime} }$ than ${\hat{\rho }}_{\mathrm{ML},{ \mathcal M }}$, contradicting the assumption ${\hat{\rho }}_{\mathrm{ML},{ \mathcal M }}$ was the ML estimate in ${ \mathcal M }$ for ${\hat{\rho }}_{\mathrm{ML},{ \mathcal M }^{\prime} }$. Therefore, if ${\hat{\rho }}_{\mathrm{ML},{ \mathcal M }^{\prime} }\in {B}_{N}$, it must be the case that ${\hat{\rho }}_{\mathrm{ML},{ \mathcal M }}\in {C}_{N}$.

Consequently, $\Pr ({\hat{\rho }}_{\mathrm{ML},{ \mathcal M }}\in {C}_{N}| {\hat{\rho }}_{\mathrm{ML},{ \mathcal M }^{\prime} }\in {B}_{N})=1$, implying $\Pr ({\hat{\rho }}_{\mathrm{ML},{ \mathcal M }}\in {C}_{N})\geqslant \Pr ({\hat{\rho }}_{\mathrm{ML},{ \mathcal M }^{\prime} }\in {B}_{N})$. Applying the squeeze theorem, plus lemma 1, we conclude ${\mathrm{lim}}_{N\to \infty }\Pr ({\hat{\rho }}_{\mathrm{ML},{ \mathcal M }}\in {C}_{N})=1$.□

In the original coordinates, both BN and the distribution of ${\hat{\rho }}_{\mathrm{ML},{ \mathcal M }^{\prime} }$ shrink with N, but BN shrinks more slowly. Suppose, instead, that we switch to N-dependent coordinates that shrink with the distribution of ${\hat{\rho }}_{\mathrm{ML},{ \mathcal M }^{\prime} }$. In these coordinates, ${ \mathcal M }$ and ${ \mathcal M }^{\prime} $ grow with N, and BN also grows (but more slowly). This homothetic transformation of ${ \mathcal M }$, ${ \mathcal M }^{\prime} $, and BN scales all of them up. As $N\to \infty $, ${B}_{N}\to {{\mathbb{R}}}^{d^{\prime} }$, and the local state space is the solid tangent cone of ${ \mathcal M }$ at ρ0.

(Homothetic transformation).

Definition 3 Given a convex set $C$, the homothetic transformation of $C$ with respect to any point $X\in C$, with homothety coefficient $h$, is the set ${C}_{h}$ defined by

(Solid tangent cone).

Definition 4 For each point $X$ in a convex set $C$, let ${C}_{h}$ be the homothetic transformation of $C$ with respect to $X$, with homothety coefficient $h$. Then, the solid tangent cone $T(X)$ is defined as the following limit:

Tangent cones are a general feature of convex sets; see [50, chapter 6], section A for more information about them and their properties.

Let ${C}_{N}={B}_{N}\cap { \mathcal M }$ in Hilbert–Schmidt coordinates. We show that, in an N-dependent coordinate system, CN converges to the solid tangent cone, and is the local state space.

Lemma 3. Consider the set ${C}_{N}={B}_{N}\cap { \mathcal M }$ in Hilbert–Schmidt coordinates, and define ${C}_{N}^{{\prime} }=\{{N}^{1/2}\rho \,\forall \,\rho \in {C}_{N}\}$. Then:

  • (1)  
    ${\mathrm{lim}}_{N\to \infty }{C}_{N}^{{\prime} }$ is the solid tangent cone at ρ0.
  • (2)  
    ${\mathrm{lim}}_{N\to \infty }{C}_{N}^{{\prime} }$ is the local state space.

Proof. 

  • (1)  
    By definition, ${C}_{N}^{{\prime} }$ is a homothetic transformation of CN, with homothety coefficient N1/2. (The homothetic center is ρ0; in these coordinates, it is 0.) By definition, ${\mathrm{lim}}_{N\to \infty }{C}_{N}^{{\prime} }$ is the solid tangent cone at ρ0.
  • (2)  
    The original set CN is a convex subset of ${ \mathcal M }$, and from lemma 2, ${\mathrm{lim}}_{N\to \infty }\Pr ({\hat{\rho }}_{\mathrm{ML},{ \mathcal M }}\in {C}_{N})=1$. Further, the coordinate system defined by the mapping $\rho \to {N}^{1/2}\rho $ turns the (previously N-dependent) Fisher information ${ \mathcal I }$ into a constant. Thus, ${\mathrm{lim}}_{N\to \infty }{C}_{N}^{{\prime} }$ is the local state space.

Therefore, we have shown that, asymptotically, the local state space around ρ0 is the solid tangent cone T(ρ0). The geometry of T(ρ0) depends strongly on ρ0. If ρ0 is rank-deficient within ${ \mathcal M }$, then T(ρ0) is the cone whose faces touch ${ \mathcal M }$ at ρ0. (See figure 4 for a rebit example.) However, if ρ0 is full-rank, T(ρ0) is ${{\mathbb{R}}}^{{d}^{2}-1}$.

Figure 4.

Figure 4. Example of the solid tangent cone for a rebit: as $N\to \infty $, the local state space around ρ0 is T(ρ0). In Fisher-adjusted coordinates, it is easy to show that (a) ${\hat{\rho }}_{\mathrm{ML},{ \mathcal M }}$ is the metric projection of ${\hat{\rho }}_{\mathrm{ML},{ \mathcal M }^{\prime} }$ onto $T({\rho }_{0})$, and (b) $\lambda {({\rho }_{0},{ \mathcal M })=\mathrm{Tr}[({\hat{\rho }}_{\mathrm{ML},{ \mathcal M }}-{\rho }_{0})}^{2}]$.

Standard image High-resolution image

3.2.2. MLE as metric projection

As $N\to \infty $, all the ${\hat{\rho }}_{\mathrm{ML},{ \mathcal M }^{\prime} }$ are contained within the ball BN, and the local state space is the solid tangent cone. Because ${ \mathcal M }^{\prime} $ satisfies LAN, the likelihood function around each ${\hat{\rho }}_{\mathrm{ML},{ \mathcal M }^{\prime} }$ is Gaussian, meaning the optimization problem defining ${\hat{\rho }}_{\mathrm{ML},{ \mathcal M }}$ is given by

Equation (8)

That is, ${\hat{\rho }}_{\mathrm{ML},{ \mathcal M }}$ is the metric projection of ${\hat{\rho }}_{\mathrm{ML},{ \mathcal M }^{\prime} }$ onto the tangent cone. See figure 4 for a rebit example. (Notice that if ${\hat{\rho }}_{\mathrm{ML},{ \mathcal M }^{\prime} }\in T({\rho }_{0})$, then ${\hat{\rho }}_{\mathrm{ML},{ \mathcal M }}={\hat{\rho }}_{\mathrm{ML},{ \mathcal M }^{\prime} }$.) What makes this non-trivial is the replacement of the original state space ${ \mathcal M }$, whose geometry can be arbitrarily complicated, with its tangent cone T(ρ0). As shown in the next section, cones can be much simpler and tractable.

3.2.3. Expression for $\lambda ({\rho }_{0},{ \mathcal M })$

The loglikelihood ratio statistic between any two models $\lambda ({{ \mathcal M }}_{1},{{ \mathcal M }}_{2})$ can be computed using a reference model ${ \mathcal R }$:

where

Let us take ${ \mathcal R }={\rho }_{0}$. Because ${ \mathcal M }^{\prime} $ satisfies LAN, asymptotically ${ \mathcal L }(\rho )$ is Gaussian, and λ relates to a difference in squared distances:

Equation (9)

Using the fact ${\hat{\rho }}_{\mathrm{ML},{ \mathcal M }}$ is a metric projection, we can prove that $\lambda ({\rho }_{0},{ \mathcal M })$ has a simple form.

Lemma 4.  $\lambda ({\rho }_{0},{ \mathcal M })=\mathrm{Tr}[({\rho }_{0}-{\hat{\rho }}_{\mathrm{ML},{ \mathcal M }}){ \mathcal I }({\rho }_{0}-{\hat{\rho }}_{\mathrm{ML},{ \mathcal M }})]$.

Proof. We switch to Fisher-adjusted coordinates ($\rho \to {{ \mathcal I }}^{1/2}\rho $), and in these coordinates ${ \mathcal I }$ becomes ${\mathbb{I}}$:

Equation (10)

To prove the lemma, we must consider two cases:

Case 1: Assume ${\hat{\rho }}_{\mathrm{ML},{ \mathcal M }^{\prime} }\,{\not\in}\,T({\rho }_{0})$. Because ${\hat{\rho }}_{\mathrm{ML},{ \mathcal M }}$ is the metric projection of ${\hat{\rho }}_{\mathrm{ML},{ \mathcal M }^{\prime} }$ onto T(ρ0) (equation (8)), the line joining ${\hat{\rho }}_{\mathrm{ML},{ \mathcal M }^{\prime} }$ and ${\hat{\rho }}_{\mathrm{ML},{ \mathcal M }}$ is normal to T(ρ0) at ${\hat{\rho }}_{\mathrm{ML},{ \mathcal M }}$. Because T(ρ0) contains ρ0 (as its origin), it follows that the lines joining ρ0 to ${\hat{\rho }}_{\mathrm{ML},{ \mathcal M }}$, and ${\hat{\rho }}_{\mathrm{ML},{ \mathcal M }}$ to ${\hat{\rho }}_{\mathrm{ML},{ \mathcal M }^{\prime} }$, are perpendicular. (See figure 4.)

By the Pythagorean theorem, we have

Subtracting $\mathrm{Tr}[{({\hat{\rho }}_{\mathrm{ML},{ \mathcal M }}-{\hat{\rho }}_{\mathrm{ML},{ \mathcal M }^{\prime} })}^{2}]$ from both sides, and comparing to equation (10), yields the lemma statement in Fisher-adjusted coordinates.

Case 2: Assume ${\hat{\rho }}_{\mathrm{ML},{ \mathcal M }^{\prime} }\in T({\rho }_{0})$. Then, ${\hat{\rho }}_{\mathrm{ML},{ \mathcal M }}={\hat{\rho }}_{\mathrm{ML},{ \mathcal M }^{\prime} }$, and equation (10) simplifies to the lemma statement in Fisher-adjusted coordinates.

Switching back from Fisher-adjusted coordinates, we have $\lambda ({\rho }_{0},{ \mathcal M })=\mathrm{Tr}[({\rho }_{0}-{\hat{\rho }}_{\mathrm{ML},{ \mathcal M }}){ \mathcal I }({\rho }_{0}-{\hat{\rho }}_{\mathrm{ML},{ \mathcal M }})]$.□

So if ${ \mathcal M }$ satisfies MP-LAN then as $N\to \infty $ the loglikelihood ratio statistic becomes related to squared error/loss (as measured by the Fisher information metric). This result may be of independent interest in, for example, defining new information criteria, which attempt to balance goodness of fit (as measured by λ) against error/loss (generally, as measured by squared error).

With these technical results in hand, we can proceed to compute $\langle \lambda ({{ \mathcal M }}_{d},{{ \mathcal M }}_{d+1})\rangle $ in the next section.

4. A Wilks theorem for quantum state space

To derive a replacement for the Wilks theorem, we start by showing the models ${{ \mathcal M }}_{d}$ satisfy MP-LAN.

Lemma 5. The models ${{ \mathcal M }}_{d}$, defined in equation (4), satisfy MP-LAN.

Proof. Let ${{ \mathcal M }}_{d}^{{\prime} }=\{\sigma \,| \,\dim (\sigma )=d,\sigma ={\sigma }^{\dagger },\mathrm{Tr}(\sigma )=1\}$. (That is, ${{ \mathcal M }}_{d}^{{\prime} }$ is the set of all trace-1, d × d Hermitian matrices, but we do not require them to be non-negative.) It is clear ${{ \mathcal M }}_{d}\subset {{ \mathcal M }}_{d}^{{\prime} }$. Now, $\forall \,\sigma \in {{ \mathcal M }}_{d}^{{\prime} }$, the likelihood ${ \mathcal L }(\sigma )$ is twice continuously differentiable, meaning ${{ \mathcal M }}_{d}^{{\prime} }$ satisfies LAN. Thus, ${{ \mathcal M }}_{d}$ satisfies MP-LAN. □

We can reduce the problem of computing $\lambda ({{ \mathcal M }}_{d},{{ \mathcal M }}_{d+1})$ to that of computing $\lambda ({\rho }_{0},{{ \mathcal M }}_{k})$ for k = d, d + 1 using the identity

where $\lambda ({\rho }_{0},{{ \mathcal M }}_{k})$ is given in equation (6). Because each model satisfies MP-LAN, asymptotically, $\lambda ({\rho }_{0},{{ \mathcal M }}_{k})$ takes a very simple form, via equation (7):

The Fisher information ${{ \mathcal I }}_{k}$ is generally anisotropic, depending on ρ0, the POVM being measured, and the model ${{ \mathcal M }}_{k}$ (see figure 5). And while the ρ ≥ 0 constraint that invalidated LAN in the first place is at least somewhat tractable in standard (Hilbert–Schmidt) coordinates, it becomes completely intractable in Fisher-adjusted coordinates. So, to obtain a semi-analytic null theory for λ, we will simplify to the case where ${{ \mathcal I }}_{k}={{\mathbb{I}}}_{k}/{\epsilon }^{2}$ for some epsilon that scales as $1/\sqrt{{N}_{\mathrm{samples}}}$. (That is, ${{ \mathcal I }}_{k}$ is proportional to the Hilbert–Schmidt metric.) This simplification permits the derivation of analytic results that capture realistic tomographic scenarios surprisingly well [51].

Figure 5.

Figure 5. Anisotropy of the Fisher information for a rebit: suppose a rebit state ρ0 (star) is measured using the POVM $\tfrac{1}{2}\{| 0\rangle \langle 0| ,| 1\rangle \langle 1| ,| +\rangle \langle +| ,| -\rangle \langle -| \}$. Depending on ρ0, the distribution of the unconstrained estimates ${\hat{\rho }}_{\mathrm{ML}}$ (ellipses) may be anisotropic. Imposing the positivity constraint $\rho \geqslant 0$ is difficult in Fisher-adjusted coordinates; in this paper, we simplify these complexities to the case where ${ \mathcal I }\propto {\mathbb{I}}$, and is independent of ρ0.

Standard image High-resolution image

With this simplification, $\lambda ({{ \mathcal M }}_{d},{{ \mathcal M }}_{d+1})$ is given by

Equation (11)

That is, λ is a difference in Hilbert–Schmidt distances. This expression makes it clear why a null theory for λ is necessary: if ${\rho }_{0}\in {{ \mathcal M }}_{d},{{ \mathcal M }}_{d+1}$, then ${\hat{\rho }}_{\mathrm{ML},d+1}$ will lie further from ρ0 than ${\hat{\rho }}_{\mathrm{ML},d}$ (because there are more parameters that can fit noise in the data). The null theory for λ tells us how much extra error will be incurred in using ${{ \mathcal M }}_{d+1}$ to reconstruct ρ0 when ${{ \mathcal M }}_{d}$ is just as good.

Describing Pr(λ) is difficult because the distributions of ${\hat{\rho }}_{\mathrm{ML},d},{\hat{\rho }}_{\mathrm{ML},d+1}$ are complicated, highly non-Gaussian, and singular (estimates 'pile up' on the various faces of the boundary as shown in figure 1). For this reason, we will not attempt to compute Pr(λ) directly. Instead, we focus on deriving a good approximation for $\langle \lambda \rangle $.

We consider each of the terms in equation (11) separately and focus on computing ${\epsilon }^{2}\langle \lambda {({\rho }_{0},{{ \mathcal M }}_{d})\rangle =\langle \mathrm{Tr}[({\hat{\rho }}_{\mathrm{ML},d}-{\rho }_{0})}^{2}]\rangle $ for arbitrary d. Doing so involves two main steps:

  • (1)  
    Identify which degrees of freedom in ${\hat{\rho }}_{\mathrm{ML},{{ \mathcal M }}_{d}^{{\prime} }}$ are, and are not, affected by projection onto the tangent cone T(ρ0).
  • (2)  
    For each of those categories, evaluate its contribution to the value of $\langle \lambda \rangle $.

In section 4.1, we identify two types of degrees of freedom in ${\hat{\rho }}_{\mathrm{ML},{ \mathcal M }^{\prime} }$, which we call the 'L' and the 'kite'. Section 4.2 computes the contribution of degrees of freedom in the 'L', and section 4.3 computes the contribution from the 'kite'. The total expected value is given in equation (19) in section 4.4, on page 12.

4.1. Separating out degrees of freedom in ${\hat{\rho }}_{\mathrm{ML},{{ \mathcal M }}_{d}^{{\prime} }}$

We begin by observing that $\lambda ({\rho }_{0},{{ \mathcal M }}_{d})$ can be written as a sum over matrix elements,

and therefore $\langle \lambda \rangle ={\sum }_{{jk}}\langle {\lambda }_{{jk}}\rangle $. Each term $\langle {\lambda }_{{jk}}\rangle $ quantifies the mean-squared error of a single matrix element of ${\hat{\rho }}_{\mathrm{ML},d}$, and while the Wilks theorem predicts $\langle {\lambda }_{{jk}}\rangle =1$ for all j, k, due to positivity constraints, this no longer holds. In particular, the matrix elements of ${\hat{\rho }}_{\mathrm{ML},d}$ now fall into two parts:

  • 1.  
    Those for which the positivity constraint does affect their behavior.
  • 2.  
    Those for which the positivity constraint does not affect their behavior, as they correspond to directions on the surface of the tangent cone T(ρ0). (Recall figure 4 - as a component of ${\hat{\rho }}_{\mathrm{ML},{ \mathcal M }^{\prime} }$ along T(ρ0) changes, the component of ${\hat{\rho }}_{\mathrm{ML},{ \mathcal M }}$ changes by the same amount. These elements are unconstrained.)

The latter, which lie in what we call the 'L', comprise all off-diagonal elements on the support of ρ0 and between the support and the kernel, while the former, which lie in what we call the 'kite', are all diagonal elements and all elements on the kernel (null space) of ρ0.

Performing this division is also supported by numerical simulations (see figure 6). Matrix elements in the 'L' appear to contribute $\langle {\lambda }_{{jk}}\rangle =1$, consistent with the Wilks theorem, while those in the 'kite' contribute more (if they are within the support of ρ0) or less (if they are in the kernel). Having performed the division of the matrix elements of ${\hat{\rho }}_{\mathrm{ML},{{ \mathcal M }}_{d}^{{\prime} }}$, we observe that $\langle \lambda \rangle =\langle {\lambda }_{{\rm{L}}}\rangle +\langle {\lambda }_{\mathrm{kite}}\rangle $. Because each $\langle {\lambda }_{{jk}}\rangle $ is not necessarily equal to one (as in the Wilks theorem), and because many of them are less than 1, it is clear that their total $\langle \lambda \rangle $ is dramatically lower than the prediction of the Wilks theorem. (Recall figure 2.)

Figure 6.

Figure 6. Division of the matrix elements of ${\hat{\rho }}_{\mathrm{ML},{{ \mathcal M }}_{d}^{{\prime} }}$: When a rank-2 state is reconstructed in d = 8 dimensions, the total loglikelihood ratio $\lambda ({\rho }_{0},{{ \mathcal M }}_{8})$ is the sum of terms λjk from errors in each matrix element ${({\hat{\rho }}_{\mathrm{ML},d})}_{{jk}}$. LefT: numerics show a clear division; some matrix elements have $\langle {\lambda }_{{jk}}\rangle \sim 1$ as predicted by the Wilks theorem, while others are either more or less. Right: the numerical results support our theoretical reasoning for dividing the matrix elements of ${\hat{\rho }}_{\mathrm{ML},{{ \mathcal M }}_{d}^{{\prime} }}$ into two parts: the 'kite' and the 'L'.

Standard image High-resolution image

In the following subsections, we develop a theory to explain the behavior of $\langle {\lambda }_{{\rm{L}}}\rangle $ and $\langle {\lambda }_{\mathrm{kite}}\rangle $. In doing so, it is helpful to think about the matrix $\delta \equiv {\hat{\rho }}_{\mathrm{ML},{{ \mathcal M }}_{d}^{{\prime} }}-{\rho }_{0}$, a normally distributed traceless matrix. To simplify the analysis, we explicitly drop the $\mathrm{Tr}(\delta )=0$ constraint and let δ be ${ \mathcal N }(0,{\epsilon }^{2}{\mathbb{I}})$ distributed over the d2-dimensional space of Hermitian matrices (a good approximation when d ≫ 2), which makes δ proportional to an element of the Gaussian unitary ensemble (GUE) [52].

4.2. Computing $\langle {\lambda }_{{\rm{L}}}\rangle $

The value of each δjk in the 'L' is invariant under projection onto the boundary (the surface of the tangent cone T(ρ0)), meaning that it is also equal to the error ${({\hat{\rho }}_{\mathrm{ML},d}-{\rho }_{0})}_{{jk}}$. Therefore, $\langle {\lambda }_{{jk}}\rangle =\langle {\delta }_{{jk}}^{2}\rangle /{\epsilon }^{2}$. Because ${ \mathcal M }^{\prime} $ satisfies LAN, it follows that each δjk is an i.i.d. Gaussian random variable with mean zero and variance epsilon2. Thus, $\langle {\lambda }_{{jk}}\rangle =1\,\forall \,(j,k)$ in the 'L'. The dimension of the surface of the tangent cone is equal to the dimension of the manifold of rank-r states in a d-dimensional space. A direct calculation of that quantity yields 2rd − r(r + 1), so $\langle {\lambda }_{{\rm{L}}}\rangle =2{rd}-r(r+1)$.

Another way of obtaining this result is to view the δjk in the 'L' as errors arising due to small unitary perturbations of ρ0. Writing ${\hat{\rho }}_{\mathrm{ML},{{ \mathcal M }}_{d}^{{\prime} }}={U}^{\dagger }{\rho }_{0}U$, where $U={{\rm{e}}}^{{\rm{i}}\epsilon H}$, we have

and $\delta \approx {\rm{i}}\epsilon [{\rho }_{0},H]$. If j = k, then δjj = 0. Thus, small unitaries cannot create errors in the diagonal matrix elements, at ${ \mathcal O }(\epsilon )$. If $j\ne k$, then ${\delta }_{{jk}}\ne 0$, in general. (Small unitaries can introduce errors on off-diagonal elements.)

However, if either j or k (or both) lie within the kernel of ρ0 (i.e., $\langle k| {\rho }_{0}| k\rangle $ or $\langle j| {\rho }_{0}| j\rangle $ is 0), then the corresponding δjk are zero. The only off-diagonal elements where small unitaries can introduce errors are those which are coherent between the kernel of ρ0 and its support. These off-diagonal elements are precisely the 'L', and are the set $\{{\delta }_{{jk}}\,| \,\langle j| {\rho }_{0}| j\rangle \ne 0,j\ne k,\,0\leqslant j,k\leqslant d-1\}$. This set contains 2rd − r(r + 1) elements, each of which has $\langle {\lambda }_{{jk}}\rangle =1$, so we again arrive at $\langle {\lambda }_{{\rm{L}}}\rangle =2{rd}-r(r+1)$.

4.3. Computing $\langle {\lambda }_{\mathrm{kite}}\rangle $

Computing $\langle {\lambda }_{{\rm{L}}}\rangle $ was made easy by the fact that the matrix elements of δ in the 'L' are invariant under the projection of ${\hat{\rho }}_{\mathrm{ML},{{ \mathcal M }}_{d}^{{\prime} }}$ onto T(ρ0). Computing $\langle {\lambda }_{\mathrm{kite}}\rangle $ is a bit harder, because the boundary does constrain δ. To understand how the behavior of $\langle {\lambda }_{\mathrm{kite}}\rangle $ is affected, we analyze an algorithm presented in [51] for explicitly solving the optimization problem in equation (5).

This algorithm, a (very fast) numerical method for computing ${\hat{\rho }}_{\mathrm{ML},d}$ given ${\hat{\rho }}_{\mathrm{ML},{{ \mathcal M }}_{d}^{{\prime} }}$, utilizes two steps:

  • 1.  
    Subtract $q{\mathbb{I}}$ from ${\hat{\rho }}_{\mathrm{ML},{{ \mathcal M }}_{d}^{{\prime} }}$, for a particular $q\in {\mathbb{R}}$.
  • 2.  
    'Truncate' ${\hat{\rho }}_{\mathrm{ML},{{ \mathcal M }}_{d}^{{\prime} }}-q{\mathbb{I}}$, by replacing each of its negative eigenvalues with zero.

Here, q is defined implicitly such that $\mathrm{Tr}[\mathrm{Trunc}({\hat{\rho }}_{\mathrm{ML},{{ \mathcal M }}_{d}^{{\prime} }}-q{\mathbb{I}})]=1$, and must be determined numerically. However, we can analyze how this algorithm affects the eigenvalues of ${\hat{\rho }}_{\mathrm{ML},d}$, which turn out to be the key quantity necessary for computing $\langle {\lambda }_{\mathrm{kite}}\rangle $.

The truncation algorithm above is most naturally performed in the eigenbasis of ${\hat{\rho }}_{\mathrm{ML},{{ \mathcal M }}_{d}^{{\prime} }}$. Exact diagonalization of ${\hat{\rho }}_{\mathrm{ML},{{ \mathcal M }}_{d}^{{\prime} }}$ is not feasible analytically, but only its small eigenvalues are critical in truncation. Further, only knowledge of the typical eigenvalues of ${\hat{\rho }}_{\mathrm{ML},d}$ is necessary for computing $\langle {\lambda }_{\mathrm{kite}}\rangle $. Therefore, we do not need to determine ${\hat{\rho }}_{\mathrm{ML},d}$ exactly, which would require explicitly solving equation (5) using the algorithm presented in [51]; instead, we need a procedure for determining its typical eigenvalues.

We assume that ${N}_{\mathrm{samples}}$ is sufficiently large so that all the non-zero eigenvalues of ρ0 are much larger than epsilon. This means the eigenbasis of ${\hat{\rho }}_{\mathrm{ML},{{ \mathcal M }}_{d}^{{\prime} }}$ is accurately approximated by: (1) the eigenvectors of ρ0 on its support; and (2) the eigenvectors of ${\delta }_{\ker }={{\rm{\Pi }}}_{\ker }\delta {{\rm{\Pi }}}_{\ker }={{\rm{\Pi }}}_{\ker }{\hat{\rho }}_{\mathrm{ML},{{ \mathcal M }}_{d}^{{\prime} }}{{\rm{\Pi }}}_{\ker }$, where ${{\rm{\Pi }}}_{\ker }$ is the projector onto the kernel of ρ0.

Changing to this basis diagonalizes the 'kite' portion of δ, and leaves all elements of the 'L' unchanged (at ${ \mathcal O }(\epsilon )$). The diagonal elements fall into two categories:

  • 1.  
    r elements corresponding to the eigenvalues of ρ0, which are given by ${p}_{j}={\rho }_{{jj}}+{\delta }_{{jj}}$ where ρjj is the jth eigenvalue of ρ0, and ${\delta }_{{jj}}\sim { \mathcal N }(0,{\epsilon }^{2})$.
  • 2.  
    n ≡ d − r elements that are eigenvalues of δker, which we denote by ${\boldsymbol{\kappa }}=\{{\kappa }_{j}:\,j=1,\,\ldots ,\,n\}$.

In turn, q is the solution to

Equation (12)

where ${(x)}^{+}=\max (x,0)$, and λkite is

Equation (13)

To solve equation (12), and derive an approximation for (13), we use the fact that we are interested in computing the average value of λkite, which justifies approximating the random variable q by a closed-form, deterministic value. To do so, we need to understand the behavior of ${\boldsymbol{\kappa }}$. Developing such an understanding, and a theory of its typical value, is the subject of the next section.

4.3.1. Approximating the eigenvalues of a GUE(n) matrix

We first observe that while the κj are random variables, they are not normally distributed. Instead, because δker is proportional to a GUE(n) matrix, for $n\gg 1$, the distribution of any eigenvalue κj converges to a Wigner semicircle distribution [53], given by $\Pr (\kappa )=\tfrac{2}{\pi {R}^{2}}\sqrt{{R}^{2}-{\kappa }^{2}}$ for $| \kappa | \leqslant R$, with $R=2\epsilon \sqrt{n}$. The eigenvalues are not independent; they tend to avoid collisions ('level avoidance' [54]), and typically form a surprisingly regular array over the support of the Wigner semicircle. Since our goal is to compute $\langle {\lambda }_{\mathrm{kite}}\rangle $, we can capitalize on this behavior by replacing each random sample of ${\boldsymbol{\kappa }}$ with a typical sample given by its order statistics $\bar{{\boldsymbol{\kappa }}}$. These are the average values of the sorted ${\boldsymbol{\kappa }}$, so ${\overline{\kappa }}_{j}$ is the average value of the jth largest value of ${\boldsymbol{\kappa }}$. Large random samples are usually well approximated (for many purposes) by their order statistics even when the elements of the sample are independent, and level avoidance makes the approximation even better.

Suppose that ${\boldsymbol{\kappa }}$ are the eigenvalues of a GUE(n) matrix, sorted from highest to lowest. Figure 7 illustrates such a sample for n = 100. It also shows the average values of 100 such samples (all sorted). These are the order statistics $\overline{{\boldsymbol{\kappa }}}$ of the distribution (more precisely, what is shown is a good estimate of the order statistics; the actual order statistics would be given by the average over infinitely many samples). As the figure shows, while the order statistics are slightly more smoothly and predictably distributed than a single (sorted) sample, the two are remarkably similar. A single sample ${\boldsymbol{\kappa }}$ will fluctuate around the order statistics, but these fluctuations are relatively small, partly because the sample is large, and partly because the GUE eigenvalues experience level repulsion. Thus, the 'typical' behavior of a sample—by which we mean the mean value of a statistic of the sample—is well captured by the order statistics (which have no fluctuations at all).

Figure 7.

Figure 7. Approximating typical samples of GUE(n) eigenvalues by order statistics: we approximate a typical sample of GUE(n) eigenvalues by their order statistics (average values of a sorted sample). Left: the sorted eigenvalues (i.e., order statistics κj) of one randomly chosen GUE(100) matrix. Right: approximate expected values of the order statistics, ${\bar{\kappa }}_{j}$, of the GUE(100) distribution, computed as the average of the sorted eigenvalues of 100 randomly chosen GUE(100) matrices.

Standard image High-resolution image

We now turn to the problem of modeling ${\boldsymbol{\kappa }}$ quantitatively. We note up front that we are only going to be interested in certain properties of ${\boldsymbol{\kappa }}$: specifically, partial sums of all ${\kappa }_{j}$ greater or less than the threshold q, or partial sums of functions of the κj (e.g., ${({\kappa }_{j}-q)}^{2}$). We require only that an ansatz be accurate for such quantities. We do not use this fact explicitly, but it motivates our approach—and we do not claim that our ansatz is accurate for all conceivable functions.

In general, if a sample ${\boldsymbol{\kappa }}$ of size n is drawn so that each κ has the same probability density function $\Pr (\kappa )$, then a good approximation for the jth order statistic is given by the inverse cumulative distribution function (CDF):

Equation (14)

This is closely related to the observation that the histogram of a sample tends to look similar to the underlying probability density function. More precisely, it is equivalent to the observation that the empirical distribution function (the CDF of the histogram) tends to be (even more) similar to the underlying CDF. For i.i.d. samples, this is the content of the Glivenko–Cantelli theorem [55]. Figure 8 compares the order statistics of GUE(100) and GUE(10) eigenvalues (computed as numerical averages over 100 random samples) to the inverse CDF for the Wigner semicircle distribution. Even though the Wigner semicircle model of GUE eigenvalues is only exact as $n\to \infty $, it provides a nearly perfect model for $\overline{{\boldsymbol{\kappa }}}$ even at n = 10 (and remains surprisingly good all the way down to n = 2).

Figure 8.

Figure 8. Approximating order statistics by the inverse CDF: order statistics of the GUE(n) eigenvalue distribution are very well approximated by the inverse CDF of the Wigner semicircle distribution. In both figures, we compare the order statistics of a GUE(n) distribution to the inverse CDF of the Wigner semicircle distribution. Top: n = 100. Bottom: n = 10. Agreement in both cases is essentially perfect.

Standard image High-resolution image

We make one further approximation, by assuming that n ≫ 1, so the distribution of the ${\overline{\kappa }}_{j}$ is effectively continuous and identical to $\Pr (\kappa )$. For the quantities that we compute, this is equivalent to replacing the empirical distribution function (which is a step function) by the CDF of the Wigner semicircle distribution. So, whereas for any given sample the partial sum of all ${\kappa }_{j}\gt q$ jumps discontinuously when $q={\kappa }_{j}$ for any j, in this approximation it changes smoothly. This accurately models the average behavior of partial sums.

4.3.2. Deriving an approximation for q

The approximations of the previous section allow us to use $\{{p}_{j}\}\cup \{{\overline{\kappa }}_{j}\}$ as the ansatz for the eigenvalues of ${\hat{\rho }}_{\mathrm{ML},{{ \mathcal M }}_{d}^{{\prime} }}$, where the pj are ${ \mathcal N }({\rho }_{{jj}},{\epsilon }^{2})$ random variables, and the ${\overline{\kappa }}_{j}$ are the (fixed, smoothed) order statistics of a Wigner semicircle distribution. In turn, the defining equation for q (equation (12)) is well approximated as

To solve this equation, we observe that the ${\overline{\kappa }}_{j}$ are symmetrically distributed around $\kappa =0$, so half of them are negative. Therefore, with high probability, $\mathrm{Tr}[\mathrm{Trunc}({\hat{\rho }}_{\mathrm{ML},{{ \mathcal M }}_{d}^{{\prime} }})]\gt 1$, and so we will need to subtract $q{\mathbb{I}}$ from ${\hat{\rho }}_{\mathrm{ML},{{ \mathcal M }}_{d}^{{\prime} }}$ before truncating.

Because we have assumed ${N}_{\mathrm{samples}}$ is sufficiently large (${N}_{\mathrm{samples}}\gg \,{\min }_{j}1/{\rho }_{{jj}}^{2}$), the eigenvalues of ρ0 are large compared to the perturbations δjj and q. This implies ${({p}_{j}-q)}^{+}={p}_{j}-q$. Under this assumption, q is the solution to

Equation (15)

where ${\rm{\Delta }}={\sum }_{j=1}^{r}{\delta }_{{jj}}$ is a ${ \mathcal N }(0,r{\epsilon }^{2})$ random variable. We choose to replace a discrete sum (line 1) with an integral (line 2). This approximation is valid when $n\gg 1$, as we can accurately approximate a discrete collection of closely spaced real numbers by a smooth density or distribution over the real numbers that has approximately the same CDF. It is also remarkably accurate in practice.

In yet another approximation, we replace Δ with its average value, which is zero. We could obtain an even more accurate expression by treating Δ more carefully, but this crude approximation turns out to be quite accurate already.

To solve equation (15), it is necessary to further simplify the complicated expression resulting from the integral (line 3). To do so, we assume ρ0 is relatively low-rank, so $r\ll d/2$. In this case, the sum of the positive ${\overline{\kappa }}_{j}$ is large compared with r, almost all of them need to be subtracted away, and therefore q is close to $2\epsilon \sqrt{n}$. We therefore replace the complicated expression with its leading order Taylor expansion around $q=2\epsilon \sqrt{n}$, substitute into equation (15), and obtain the equation

Equation (16)

This equation is a quintic polynomial in q/epsilon, so by the Abel–Ruffini theorem, it has no algebraic solution. However, as $n\to \infty $, its roots have a well-defined algebraic approximation that becomes accurate quite rapidly (e.g., for n > 4):

Equation (17)

where4 $x={\left(\tfrac{15\pi r}{2n}\right)}^{2/5}$ .

4.3.3. Expression for $\langle {\lambda }_{\mathrm{kite}}\rangle $

Now that we know how much to subtract off in the truncation process, we can approximate $\langle {\lambda }_{\mathrm{kite}}\rangle $, originally given in equation (13):

Equation (18)

4.4. Complete expression for $\langle \lambda \rangle $

The total expected value, $\langle \lambda \rangle =\langle {\lambda }_{{\rm{L}}}\rangle +\langle {\lambda }_{\mathrm{kite}}\rangle $, is thus

Equation (19)

where z is given in equation (17), n = d − r, and $r=\mathrm{Rank}({\rho }_{0})$.

This null theory is much more complicated than the Wilks theorem, but as figure 9 shows, it is very accurate when $2r\ll d$. (In contrast, the prediction of the Wilks theorem is wildly incorrect for $r\ll d$.) Although our null theory does break down as $r\to d$, it does so fairly gracefully. We conclude that our analysis (and equation (19)) correctly models tomography if the Fisher information is isotropic (${ \mathcal I }\propto {\mathbb{I}}$), a point we turn to in the following subsections.

Figure 9.

Figure 9. Improved prediction for $\langle \lambda ({\rho }_{0},{{ \mathcal M }}_{d})\rangle $, as compared to the Wilks theorem: numerical results for $\langle \lambda ({\rho }_{0},{{ \mathcal M }}_{d})\rangle $ compared to the prediction of the Wilks theorem (solid line) and our replacement theory as given in equation (19) (dashed lines). Our formula depends on the rank r of ρ0 (unlike the Wilks prediction), and is nearly perfect for $r\ll d/2$. It becomes less accurate as r approaches d/2, and is invalid when $r\approx d$.

Standard image High-resolution image

In the asymptotic limit $d\to \infty $, while keeping r fixed, $\langle \lambda \rangle $ takes the following form:

Equation (20)

That $\langle \lambda \rangle $ scales as ${ \mathcal O }({rd})$ in this regime is to be expected, as a rank-r density matrix has ${ \mathcal O }({rd})$ free parameters. Curiously though, this asymptotic result is not equal to $\langle {\lambda }_{{\rm{L}}}\rangle $, meaning that the 'kite' elements continue to contribute to the behavior of the statistic (even though most of them will be set to zero in the projection step when computing ${\hat{\rho }}_{\mathrm{ML},{ \mathcal M }}$).

5. Comparison to numerical experiments

To evaluate our null theory for $\langle \lambda \rangle $, we compare it to numerical experiments, described below.

5.1. Comparison to idealized tomography (isotropic fisher information)

In our derivation of equation (19), we assumed both that the Fisher information is isotropic and that the number of samples is asymptotically infinite. For a variety of true states ρ0 with dimension d = 2, ..., 30 and rank r = 1, ..., 10 we: (a) generated N = 500 i.i.d. ${ \mathcal N }({\rho }_{0},{\epsilon }^{2}{ \mathcal I })$ unconstrained ML estimates $\{{\hat{\rho }}_{\mathrm{ML},{{ \mathcal M }}_{d,j}^{{\prime} }}\}{}_{j=1}^{N}$, thereby simulating the unconstrained ML estimates at the ${N}_{\mathrm{samples}}=\infty $ limit, (b) numerically solved equation (8) for each ${\hat{\rho }}_{\mathrm{ML},{{ \mathcal M }}_{d,j}^{{\prime} }}$ to obtain the constrained ML estimate ${\hat{\rho }}_{\mathrm{ML},{{ \mathcal M }}_{d,j}}$, and (c) estimated $\langle \lambda \rangle $ as $\tfrac{1}{N}{\sum }_{j=1}^{N}\mathrm{Tr}[{({\rho }_{0}-{\hat{\rho }}_{\mathrm{ML},{{ \mathcal M }}_{d,j}})}^{2}]/{\epsilon }^{2}$. We took $\epsilon ={10}^{-4}$, to ensure that all of the unconstrained ML estimates are close to ρ0, and that we are not erroneously generating estimates which are too far away. (Recall that our derivation used the fact that, asymptotically, we can 'zoom in' on ρ0 to understand the behavior of λ. Consequently, if epsilon is too large, then some of the unconstrained ML estimates may almost be orthogonal to ρ0, which clearly violates the conditions used in our derivation.)

In figure 9, we compare our theory (dashed lines) to these numerical results (solid dots). It is clear equation (19) is almost perfectly accurate when $r\ll d/2$, but it does begin to break down as r becomes comparable to d.

5.2. Comparison to heterodyne tomography

In practice, the Fisher information is rarely isotropic. So we tested our idealized result by applying it to a realistic, challenging, and experimentally relevant problem: quantum heterodyne (equivalent to double homodyne) state tomography [11, 12, 14, 56] of a single optical mode. (See figure 10 for a plot of the condition number—the ratio of the largest eigenvalue to the smallest—of the estimated Fisher information. It is clear ${ \mathcal I }\,{\not\propto}\,{\mathbb{I}}$.) States of this continuous-variable system are described by density operators on the infinite-dimensional Hilbert space ${L}^{2}({\mathbb{R}})$. Fitting these infinitely many parameters to finitely much data demands simpler models.

Figure 10.

Figure 10. Anisotropy of the heterodyne POVM Fisher information: the condition number κ—the ratio of the largest eigenvalue to the smallest—of the estimated heterodyne Fisher information. (Estimates are the average over 100 Hessians of the loglikelihood function.) κ grows with model dimension, meaning anisotropy is increasing. The dashed lines indicate different states ρ0, and the solid line is κ = 1 (i.e., ${ \mathcal I }\propto {\mathbb{I}}$).

Standard image High-resolution image

We consider a family of nested models motivated by a low-energy (few-photon) ansatz, and choose the Hilbert space ${{ \mathcal H }}_{d}$ to be that spanned by the photon number states $\{| 0\rangle ,\,\ldots ,\,| d-1\rangle \}$. Heterodyne tomography reconstructs ρ0 using data from repeated measurements of the coherent state POVM, $\{| \alpha \rangle \langle \alpha | /\pi ,\,\alpha =x+{\rm{i}}{p}\in {\mathbb{C}}\}$, which corresponds to sampling directly from the Husimi Q-function of ρ0 [57].

We examined the behavior of λ for 13 distinct states ρ0, both pure and mixed, supported on ${{ \mathcal H }}_{2},{{ \mathcal H }}_{3},{{ \mathcal H }}_{4}$, and ${{ \mathcal H }}_{5}$. We used rejection sampling to simulate 100 heterodyne datasets with up to ${N}_{\mathrm{samples}}={10}^{5}$, and found ML estimates ${\hat{\rho }}_{\mathrm{ML},{{ \mathcal M }}_{d}}$ over each of the 9 models ${{ \mathcal M }}_{2},\,\ldots ,\,{M}_{10}$ using numerical optimization5 . For each ρ0 and each d, we averaged $\lambda ({\rho }_{0},{{ \mathcal M }}_{d})$ over all 100 datasets to obtain an empirical average loglikelihood ratio $\bar{\lambda }({\rho }_{0},d)$.

Results of this test are shown in figure 11, where we plot the predictions for $\langle \lambda \rangle $ given by the Wilks theorem and equation (19), against the empirical average $\bar{\lambda }$, for a variety of ρ0 and d. Our formula correlates very well with the empirical average, while the Wilks theorem (unsurprisingly) overestimates λ dramatically for low-rank states. Whereas a model selection procedure based on the Wilks theorem would tend to falsely reject larger Hilbert spaces (by setting the threshold for acceptance too high), our formula provides a reliable null theory.

Figure 11.

Figure 11. Applying isotropic formula to heterodyne tomography: the Wilks theorem (orange dots) dramatically over-estimates $\langle \lambda ({\rho }_{0},{{ \mathcal M }}_{d})\rangle $ in optical heterodyne tomography. Our formula, equation (19) (blue squares), is far more accurate. Residual discrepancies occur in large part because ${N}_{\mathrm{samples}}$ is not yet 'asymptotically large'. The solid red line corresponds to perfect correlation between theory ($\langle \lambda \rangle $) and practice ($\bar{\lambda }$).

Standard image High-resolution image

Interestingly, as d grows, equation (19) also begins to overpredict. As figure 12 indicates, a more accurate description is that the numerical experiments are underachieving, because $\bar{\lambda }$ is still growing with ${N}_{\mathrm{samples}}$. Both the Wilks theorem and our analysis are derived in the limit ${N}_{\mathrm{samples}}\to \infty ;$ for finite but large ${N}_{\mathrm{samples}}$, both may be invalid. Figure 12 shows that, even at ${N}_{\mathrm{samples}}\sim {10}^{5}$, the behavior of $\bar{\lambda }$ has failed to become asymptotic. This is surprising, and suggests heterodyne tomography is a particularly exceptional and challenging case to model statistically.

Figure 12.

Figure 12. 'Underachievement' of $\bar{\lambda }$ in heterodyne tomography: the empirical average $\bar{\lambda }$ may have achieved its asymptotic value, or is still growing, depending on ρ0 and d. Solid lines indicate the value of our formula for the asymptotic expected value, given in equation (19).

Standard image High-resolution image

However, our model does get some of the qualitative features correct. In figure 13, we present simulated values of $\langle {\lambda }_{{jk}}\rangle $ for an isotropic Fisher information and for heterodyne tomography. While the values of $\langle {\lambda }_{{jk}}\rangle $ do not agree exactly, they still decompose into two types, the 'L' and the 'kite'. (See figure 14 for an analysis of the discrepancies.)

Figure 13.

Figure 13. Detailed comparison of isotropic model and heterodyne tomography: the values of $\langle {\lambda }_{{jk}}\rangle $ for an isotropic Fisher information (left), and for heterodyne tomography (right). Top: ${\rho }_{0}=| 0\rangle \langle 0| $. Bottom: ${\rho }_{0}={{ \mathcal I }}_{2}/2$. Discussion: qualitatively, the behavior is the same, though there are quantitative differences, particularly within the kite.

Standard image High-resolution image
Figure 14.

Figure 14. Discrepancies between isotropic model and heterodyne tomography: examining how our prediction for $\langle {\lambda }_{{jk}}\rangle $ disagrees with simulated heterodyne experiments. We take ${\rho }_{0}=| 0\rangle \langle 0| $ and d = 8. Top left: the values of $\langle {\lambda }_{0k}\rangle $ in the 'L' as a function of ${N}_{\mathrm{samples}}$. Top right: at the largest ${N}_{\mathrm{samples}}$ studied, $\langle {\lambda }_{0k}\rangle $ is less than 1, especially for the higher number states. Bottom left: the total from the 'kite' versus ${N}_{\mathrm{samples}}$. It is clear the total is still growing. Bottom right: The individual 'kite' elements $\langle {\lambda }_{{jk}}\rangle $ at the largest ${N}_{\mathrm{samples}}$ studied; most are small compared to their values in the isotropic case.

Standard image High-resolution image

6. Conclusions and discussion

Quantum state space violates LAN, a key property satisfied by classical statistical models. Through the introduction of MP LAN, we have provided a new framework for reasoning about classical statistical results for models that don't satisfy LAN because of convex constraints.

We explicitly investigated one such result, the Wilks theorem, found it is not generally reliable in quantum state tomography, and provided a much more broadly applicable replacement that can be used in model selection methods (equation (19)). This includes information criteria such as the AIC and BIC [17, 29, 37, 38] that do not explicitly use the Wilks theorem, but rely on the same assumptions (LAN, equivalence between loglikelihood and squared error, etc). Null theories of loglikelihood ratios have many other applications, including hypothesis testing [27, 32] and confidence regions [58], and our result is directly applicable to them. [27, 58] both point out explicitly that their methods are unreliable near boundaries and therefore cannot be applied to rank-deficient states; our result fixes this outstanding problem.

However, our numerical experiments with heterodyne tomography show unexpected behavior, indicating that quantum tomography can still surprise, and may violate all asymptotic statistics results. In such cases, bootstrapping [59, 60] may be the only reliable way to construct null theories for λ.

Finally, the methods presented here have application beyond the analysis of loglikelihoods. MP-LAN provides a new and rigorous framework for describing the behavior of the ML estimator in the presence of convex constraints. This framework helps shed light on the behavior of ${\hat{\rho }}_{\mathrm{ML},d}$ for rank-deficient states, and can be used to predict other derived properties such as the average rank of the estimate, which is independently interesting for (e.g.) quantum compressed sensing [19, 6163].

Acknowledgments

TLS thanks Jonathan A Gross for helpful discussions on software design, coding, and statistics, John King Gamble for useful insights on parallelized computation and feedback on earlier drafts of this paper, and Anupam Mitra, Kenneth M Rudinger, and Daniel Suess for proofreading edits. The authors are grateful for those who provide support for the following software packages: iPython/Jupyter [64], matplotlib [65], mpi4py [66], NumPy [67], pandas [68], Python 2.7 [69], SciPy [70], seaborn [71], and SymPy [72].

Sandia National Laboratories is a multimission laboratory managed and operated by National Technology & Engineering Solutions of Sandia, LLC, a wholly owned subsidiary of Honeywell International Inc., for the U.S. Department of Energy's National Nuclear Security Administration under contract DE-NA0003525.

Footnotes

  • The index j may be continuous or discrete.

  • See supplemental information for a derivation of this solution.

  • The model ${{ \mathcal M }}_{1}$ is trivial, as ${{ \mathcal M }}_{1}=\{| 0\rangle \langle 0| \}$. This model will almost always be wrong, in general.

Please wait… references are loading.
10.1088/1367-2630/aaa7e2