Spectral thresholding quantum tomography for low rank states

Cristina Butucea; Mădălin Guţă; Theodore Kypraios

doi:10.1088/1367-2630/17/11/113050

1. Introduction

Recent years have witnessed significant developments at the overlap between quantum theory and statistics: from new state estimation (or tomography) methods [2–8], design of experiments [9–11], quantum process and detector tomography [12, 13] construction of confidence regions (error bars) [14–16], quantum tests [17, 18] entanglement estimation [19], asymptotic theory [20–23]. The importance of quantum state tomography, and the challenges raised by the estimation of high dimensional systems were highlighted by the landmark experiment [1] where entangled states of up to eight ions were created and fully characterized. However, as full quantum state tomography of large systems becomes unfeasible [24], there is significant interest in identifying physically relevant, lower dimensional models, and in devising efficient model selection and estimation methods in such setups [7, 8, 25–28]. In this paper we reconsider the multiple ions tomography (MIT) problem by proposing and analysing several new methods for estimating low rank states in a statistically efficient way. We emphasize that, while the theoretical and simulations results are specific to the ion tomography setup, the general methods based on combining linear (least squares) estimation with spectral thresholding (or eigenvalues truncation) can be applied to any informationally complete tomography scenario. Below, we briefly review the MIT setup, after which we proceed with presenting the key ideas and results of the paper.

In MIT [1], the goal is to statistically reconstruct the joint state of k ions (modelled as two-level systems), from counts data generated by performing a large number of measurements on identically prepared systems. The unknown state ρ is a d × d density matrix (complex, positive trace-one matrix) where $d={2}^{k}$ is the dimension of the Hilbert space of k ions. The experimenter can measure an arbitrary Pauli observable ${\sigma }_{x},{\sigma }_{y}$ or ${\sigma }_{z}$ of each ion, simultaneously on all k ions. Thus, each measurement setting is labelled by a sequence ${\bf{s}}=({s}_{1},...,{s}_{k})\in \{x,y,z\}{}^{k}$ out of ${3}^{k}$ possible choices. The measurement produces an outcome ${\bf{o}}=({o}_{1},...,{o}_{k})\in \{+1,-1\}{}^{k},$ whose probability is equal to the corresponding diagonal element of ρ with respect to the orthonormal basis (ONB) determined by the measurement setting ${\bf{s}}.$ The measurement procedure and statistical model can be summarized as follows. For each setting ${\bf{s}}$ the experimenter performs n repeated measurements and collects the counts of different outcomes $N({\bf{o}}| {\bf{s}}),$ so that the total number of quantum samples used is $N:= n\times {3}^{k}.$ The resulting dataset is a ${2}^{k}\times {3}^{k}$ table whose columns are independent and contain all the counts in a given setting. A commonly used [1] estimation method is maximum likelihood (ML) which selects the state for which the probability of the observed data is the highest among all states. However, while this method seems to perform well in practice, and has efficient numerical implementations [29], it does not provide confidence intervals (error bars) for finite samples, and it has been criticized for its tendency to produce rank-deficientstates [2].

The goal of this paper is to find alternative estimators which can be efficiently computed, and work well for low rank states. The reason for focusing on low rank states is that they form a realistic model for physical states created in the lab, where experimentalists often aim at preparing a pure (rank-one) state. While this is generally difficult, the realized states tend to have rapidly decaying eigenvalues, so that they can be well approximated by low rank states. Our strategy is to combine an easy but 'noisy' estimation method—the least square estimator (LSE)—with an appropriate spectral truncation (tuned using available data only) which involves setting certain eigenvalues of the LSE to zero, while possibly adjusting the remaining ones. It turns out that this can lead to significant reduction of the mean square error (MSE) of the estimator. We discuss some of the key results on the LSE and the proposed truncation estimators below.

The LSE ${\widehat{\rho }}_{n}^{(\mathrm{ls})}$ is obtained by inverting the linear map $A\ :\rho \mapsto {{\mathbb{P}}}_{\rho }$ between the state and the probability distribution of the data, where the unknown probabilities are replaced by the empirical (observed) frequencies of the measurement data. The resulting estimator is unbiased, and is 'optimal' in the sense that it minimizes the prediction error, i.e. the euclidian distance between the empirical frequencies and the predicted probabilities. However, one of the disadvantages of the LSE is that it does not take into account the physical properties of the state, i.e. its positivity and trace-one property. More importantly, as we explain below, the LSE has a relatively large estimation error for the class of low rank states, and performs well only on very mixed states. This is illustrated in figure 1 where the eigenvalues of ${\widehat{\rho }}_{n}^{(\mathrm{ls})}$ are plotted (in decreasing order) against those of the true state ρ, the latter being chosen to have rank r = 2. We see that while the non-zero eigenvalues of ρ are estimated reasonably well, the LSE is poor in estimating the zero-eigenvalues, and as consequence, it has a large estimation variance.

**Figure 1.** Eigenvalues of the LSE (red) arranged in decreasing order, versus those of the true state of k = 4 ions of rank r = 2 (blue), for n = 20 measurement repetitions (left) and n = 100 measurement repetitions (right).
Download figure:
Standard image High-resolution image

Our goal is to design more precise estimators, which have the LSE as a starting point, but take into account the 'sparsity' properties of the unknown state. Figure 1 suggests that the non-zero eigenvalues of the LSE which are below a certain 'statistical threshold', can be considered as statistical noise and may be set to zero in order to improve the estimation error. To find this noise level, we establish a concentration inequality (see proposition 1) which shows that the operator-norm error $\parallel {\widehat{\rho }}_{n}^{(\mathrm{ls})}-\rho {\parallel }^{2}$ is upper bounded by a rate ${\nu }^{2}$ which (up to logarithmic factors in d) is proportional to d/N.

The first estimator we propose, is a rank penalized one obtained by diagonalizing the LSE, arranging its eigenvalues in decreasing order of their absolute values, and setting to zero all those eigenvalues whose absolute values are below the threshold ν

$\begin{eqnarray*}&&{\widehat{\rho }}_{n}^{(\mathrm{ls})}=\displaystyle \sum _{i=1}^{d}{\widehat{\lambda }}_{i}| {\hat{\psi }}_{i}\rangle \langle {\hat{\psi }}_{i}| \quad \longrightarrow \quad {\widehat{\rho }}_{n}^{(\mathrm{pen})}=\displaystyle \sum _{| {\widehat{\lambda }}_{i}| \geqslant \nu }{\widehat{\lambda }}_{i}| {\hat{\psi }}_{i}\rangle \langle {\hat{\psi }}_{i}| .\end{eqnarray*}$

The same outcome can be obtained as solution of the following penalized estimation problem: among all selfadjoint matrices, choose the one that is close to the LSE but in the same time it has low rank, so that it minimizes over τ the norm-two (also known as Frobenius norm) distance squared, penalized by the rank

$\begin{eqnarray*}&&D(\tau ):= \parallel \tau -{\widehat{\rho }}_{n}^{(\mathrm{ls})}{\parallel }_{2}^{2}+{\nu }^{2}\cdot {\rm{rank}}(\tau ).\end{eqnarray*}$

In particular the estimator's rank is determined by the data. In theorem 1 we show that if ρ is of unknown rank $r\leqslant d,$ then the MSE ${\mathbb{E}}\parallel {\widehat{\rho }}_{n}^{(\mathrm{pen})}-\rho {\parallel }_{2}^{2}$ is upper-bounded (up to logarithmic factors) by the rate $(r\cdot d)/N.$ This captures the expected optimal dependence on the number of parameters for a state of rank r. Indeed, in section 5 we show that no estimator can improve the above rate for all states of rank r, see theorem 3 for the asymptotic minimax lower bound.

The penalized estimator has however the drawback that it may not represent a physical state. To remedy this, and further improve its statistical accuracy, we propose a physical estimator which is the solution of the following optimization problem. We seek the density matrix which is closest to the LSE ${\widehat{\rho }}_{n}^{(\mathrm{ls})},$ and whose non-zero eigenvalues are larger that the threshold $4\nu .$ It turns out that the solution can be found via a simple iterative algorithm whereby at each step the eigevalues of ${\widehat{\rho }}_{n}^{(\mathrm{ls})}$ below the threshold are set to zero, and the remaining eigenvalues are normalized by shifting with a common constant, while the eigenvectors are not changed throughout the process. In theorem 2 we show that the physical estimator satisfies a similar upper bound to the penalized one.

In section 6 we present results of extensive numerical investigations of the two proposed estimators. In addition we consider the oracle 'estimator' and the cross-validated estimator ${\widehat{\rho }}_{n}^{(\mathrm{cv})}$ (see next paragraph for a brief explanation of the cross-validation method). The oracle estimator is simply the spectral truncation of the LSE that is closest to the true state ρ, and is obtained by setting to zero a number of eigenvalues with small magnitude. It is not a proper estimator since it uses the true unknown density matrix ρ, but it is very useful as a benchmark. The cross-validated estimator is also a spectral truncation of the LSE which aims to find the optimal truncation rank (i.e. rank of the truncated estimator) by minimizing an estimate of the norm-two square error over all ranks.

Cross-validation is a widely used collection of procedures for model validation and selection [30] which involve splitting the data into different (independent) batches, some of which are used for estimation while the others are used for testing the model. We found that cross-validation can help in better tuning the constant factor of the threshold rate of the penalized and physical estimators. As expected from the theoretical results, we find that all estimators perform significantly better than the LSE on low rank states; moreover the physical estimator has slightly smaller estimation error than the others, including the oracle estimator. We also find that all methods converge to the correct rank in the limit of large number of repetitions but through different routes: the penalized estimator tends to underestimate, while the physical one tends to overestimate the rank, for small number of samples.

Having discussed the upper bounds on the estimators' MSE, we would like to know how they compare with the best possible estimation procedure. For each estimation procedure ${\widehat{\rho }}_{n}$ one can consider its maximum MSE over the class ${{\mathcal{S}}}_{d,r}$ of rank r states

$\begin{eqnarray*}&&{R}_{\mathrm{max}}({\widehat{\rho }}_{n};r)=\underset{\rho \in {{\mathcal{S}}}_{d,r}}{\mathrm{sup}}{{\mathbb{E}}}_{\rho }\parallel {\widehat{\rho }}_{n}-\rho {\parallel }_{2}^{2}.\end{eqnarray*}$

Its minimum over all estimators is called the minimax risk ${R}_{\mathrm{minmax}}(r,n).$ Asymptotic statistics theory [31] shows that the limiting value of the minimax risk rescaled by the number of samples $N=n\cdot {3}^{k},$ satisfies the following lower bound

$\begin{eqnarray*}&&\underset{n\to \infty }{\mathrm{lim}\;\mathrm{inf}}\;{{NR}}_{\mathrm{minmax}}(r,n)\geqslant \underset{\rho \in {{\mathcal{S}}}_{d,r}}{\mathrm{sup}}{\rm{Tr}}(I{(\rho )}^{-1}G(\rho )).\end{eqnarray*}$

The left side is the asymptotic minimax risk while on the right side $I(\rho )$ is the Fisher information corresponding to all measurement settings taken together, and $G(\rho )$ is a positive matrix describing the quadratic approximation of the norm-two (Frobenius) distance squared, around ρ. In theorem 3 we show that the above lower bound is further bounded from below by $2r(d-r)$ which shows that (up to logarithmic factors) the upper bounds of the penalized and physical estimators have the same scaling as the asymptotic minimax risk.

Recently, a number of papers discussed related aspects of quantum tomography problems. The idea of the penalized estimator has been proposed in [8], which provided a weaker upper bound for its MSE. Here we improve on the MSE upper bounds, provide minimax lower bounds, and propose and analyse new classes of estimators, e.g. a 'physical' estimator with improved estimation performance. Reference [7] analyses model selection methods for finite rank models and ML estimation. Reference [32] proposes a different estimator and establishes a comparable upper bound for its MSE. The class of low rank states is also employed in compressed sensing quantum tomography [25, 28, 33], but their statistical model is based on expectations of Pauli observables rather than measurement counts.

The paper is organized as follows. In section 2 we describe the measurement procedure and introduce the statistical model of MIT. In section 3 we define the linear (least squares) estimator and derive an upper bound on its operator norm error which improves on a previous bound of [8]. In section 4 we define the penalized and threshold estimators and derive upper bounds for their MSEs with respect to the norm-two (Frobenius) distance squared. The performance of the different methods is analysed in section 6. An asymptotic lower bound for the minimax risk is derived in section 5, based on the Fisher information of the measurement data. The upper and lower bounds match in the scaling with the number of parameters and number of total measurements, up to a logarithmic factor. We give a detailed description of the numerical implementation of the algorithms, including the cross-validation routines used for tuning the pre-factor of the penalty and threshold constants. We illustrate the simulation results with box plots of the MSEs for the least squares, oracle, cross-validation, penalization and threshold estimator, for states of ranks 1, 2, 6 and 10, and for different choices of measurement repetitions $n=20,100.$ Additionally, we plot the empirical distribution of the chosen rank for different estimators, showing the concentration on the true rank as the number of repetitions increases.

2. Multiple ions tomography

This paper deals with the problem of estimating the joint quantum state of k two-dimensional systems (qubits), as encountered in ion trap quantum tomography [1]. The two-dimensional system is determined by two energy levels of an ion, while the remaining levels can be ignored as they remain unpopulated during the experiment. The joint Hilbert space of the ions is therefore the tensor product ${\left({{\mathbb{C}}}^{2}\right)}^{\otimes k}\cong {{\mathbb{C}}}^{d}$ where $d={2}^{k},$ and the state is a density matrix ρ on this space, i.e. a positive d × d matrix of trace one.

Our statistical model is derived from standard ion trap measurement procedures, and takes into account the specific statistical uncertainty due to finite number of measurement repetitions. We consider that for each individual qubit, the experimenter can measure one of the three Pauli observables ${\sigma }_{x},{\sigma }_{y},{\sigma }_{z}.$ A measurement set-up is then defined by a setting ${\bf{s}}=({s}_{1},...,{s}_{k})\in {{\mathcal{S}}}_{k}:= \{x,y,z\}{}^{k}$ which specifies which of the three Pauli observables is measured for each ion. For each fixed setting, the measurement produces random outcomes ${\bf{o}}\in {{\mathcal{O}}}_{k}:= \{+1,-1\}{}^{k}$ with probability

$\begin{eqnarray}&&{p}_{\rho }({\bf{o}}| {\bf{s}}):= {\rm{Tr}}(\rho {P}_{{\bf{o}}}^{{\bf{s}}})=\langle {e}_{{\bf{o}}}^{{\bf{s}}}| \rho | {e}_{{\bf{o}}}^{{\bf{s}}}\rangle ,\end{eqnarray} \tag{ 2.1 }$

where ${P}_{{\bf{o}}}^{{\bf{s}}}$ are the one-dimensional projections

$\begin{eqnarray}&&{P}_{{\bf{o}}}^{{\bf{s}}}=| {e}_{{o}_{1}}^{{s}_{1}}\rangle \langle {e}_{{o}_{1}}^{{s}_{1}}| \otimes ...\otimes | {e}_{{o}_{k}}^{{s}_{k}}\rangle \langle {e}_{{o}_{k}}^{{s}_{k}}| \end{eqnarray} \tag{ 2.2 }$

and $| {e}_{o}^{s}\rangle$ are the eigenvectors of the Pauli matrices, i.e. ${\sigma }_{s}| {e}_{\pm }^{s}\rangle =\pm | {e}_{\pm }^{s}\rangle .$

The measurement procedure consists of choosing a setting ${\bf{s}},$ and performing n repeated measurements in that setting, on identically prepared systems in state ρ. This provides information about the diagonal elements of ρ with respect to the chosen measurement basis, i.e. the probabilities ${p}_{\rho }({\bf{o}}| {\bf{s}}).$ In order to identify the other elements, the procedure is then repeated for all ${3}^{k}$ possible settings.

Before describing the statistical model of the measurement counts data, we start by discussing in more detail the relation between the unknown parameter ρ and the probabilities $p({\bf{o}}| {\bf{s}}).$ Consider the 'extended' set of Pauli operators $\{{\sigma }_{x},{\sigma }_{y},{\sigma }_{z},{\sigma }_{I}:= {\bf{1}}\}$ which form a basis in $M({{\mathbb{C}}}^{2}).$ We construct the tensor product basis in $M({{\mathbb{C}}}^{d})$ with elements ${\sigma }_{{\bf{b}}}={\sigma }_{{b}_{1}}\otimes ...\otimes {\sigma }_{{b}_{k}}$ where ${\bf{b}}\in \{x,y,z,I\}{}^{k}$ and note that the following orthogonality relations hold ${\rm{Tr}}({\sigma }_{{\bf{b}}}{\sigma }_{{\bf{c}}})=d{\delta }_{{\bf{b}},{\bf{c}}}.$ The state ρ can be expanded in this basis as

$\begin{eqnarray}&&\rho =\displaystyle \sum _{{\bf{b}}\in {\{I,x,y,z\}}^{k}}{\rho }_{{\bf{b}}}{\sigma }_{{\bf{b}}},\qquad \mathrm{where}\quad {\rho }_{{\bf{b}}}={\rm{Tr}}(\rho {\sigma }_{{\bf{b}}})/d.\end{eqnarray} \tag{ 2.3 }$

Equation (2.1) can then be written as

$\begin{eqnarray*}&&{p}_{\rho }({\bf{o}}| {\bf{s}})=\displaystyle \sum _{{\bf{b}}\in {\{I,x,y,z\}}^{k}}{\rho }_{{\bf{b}}}{\rm{Tr}}({\sigma }_{{\bf{b}}}{P}_{{\bf{o}}}^{{\bf{s}}})=\displaystyle \sum _{{\bf{b}}\in {\{I,x,y,z\}}^{k}}{\rho }_{{\bf{b}}}{A}_{{\bf{b}}}({\bf{o}}| {\bf{s}}).\end{eqnarray*}$

The coefficients ${A}_{{\bf{b}}}({\bf{o}}| {\bf{s}})$ can be computed explicitly as

$\begin{eqnarray}&&{A}_{{\bf{b}}}({\bf{o}}| {\bf{s}})={\rm{Tr}}({\sigma }_{{\bf{b}}}{P}_{{\bf{o}}}^{{\bf{s}}})=\displaystyle \prod _{j\not\hspace{-2pt}{\in }{E}_{{\bf{b}}}}{o}_{j}\cdot {\delta }_{{b}_{j},{s}_{j}},\end{eqnarray} \tag{ 2.4 }$

where ${E}_{{\bf{b}}}:= \{i\ :{b}_{i}=I\}.$ Let $\tilde{\rho }\in {{\mathbb{C}}}^{{4}^{k}}$ be the representation of ρ as a the vector of coefficients ${\rho }_{{\bf{b}}},$ and let ${{\bf{p}}}_{\rho }$ be the corresponding vector of probabilities for all settings $\left({p}_{\rho }({\bf{o}}| {\bf{s}})\ :\;({\bf{o}},{\bf{s}})\in {{\mathcal{O}}}_{k}\times {{\mathcal{S}}}_{k}\right),$ with settings, and outcomes within settings ordered in lexicographical order. The measurement is then described by the linear map ${\bf{A}}\ :{{\mathbb{C}}}^{{4}^{k}}\to {{\mathbb{C}}}^{{3}^{k}}\otimes {{\mathbb{C}}}^{{2}^{k}}$ with matrix elements ${{\bf{A}}}_{{\bf{b}},({\bf{o}},{\bf{s}})}:= {A}_{{\bf{b}}}({\bf{o}}| {\bf{s}})$ defined in (2.4), such that

$\begin{eqnarray}&&{{\bf{p}}}_{\rho }={\bf{A}}\tilde{\rho }.\end{eqnarray} \tag{ 2.5 }$

The linear map ${\bf{A}}$ is injective and we solve the previous equation via the optimization problem:

$\begin{eqnarray*}&&\tilde{\rho }=\mathrm{arg}\underset{\tilde{\tau }\in {{\mathbb{C}}}^{{4}^{k}}}{\mathrm{inf}}\parallel {\bf{p}}-{\bf{A}}\tilde{\tau }{\parallel }^{2},\end{eqnarray*}$

which comes down to multiplying equation (2.5) by the Moore–Penrose pseudoinverse of ${\bf{A}}$ and gives $\tilde{\rho }={({{\bf{A}}}^{*}\cdot {\bf{A}})}^{-1}\cdot {{\bf{A}}}^{*}\cdot {{\bf{p}}}_{\rho }.$ The following lemma has appeared in [8] but for completeness we include its proof in appendix A.1.

Lemma 1. Let ${\bf{A}}$ be the linear map defined in equation (2.5). Then ${{\bf{A}}}^{*}\cdot {\bf{A}}$ is diagonal and its elements are

$\begin{eqnarray*}&&{[{{\bf{A}}}^{*}\cdot {\bf{A}}]}_{{\bf{b}},{{\bf{b}}}^{\prime }}={2}^{k}{3}^{d({\bf{b}})}{\delta }_{{\bf{b}},{{\bf{b}}}^{\prime }},{\rm{}}\;{\rm{for}}\;{\rm{all}}\;{\rm{}}{\bf{b}}\in {\{I,x,y,z\}}^{k},\end{eqnarray*}$

where $d({\bf{b}}):= | {E}_{{\bf{b}}}|$ is the number of I's in the sequence ${\bf{b}}.$

From the decomposition (2.3) and lemma 1 we find

$\begin{eqnarray}&&\rho =\displaystyle \sum _{{\bf{b}}}\displaystyle \sum _{{\bf{o}}}\displaystyle \sum _{{\bf{s}}}{p}_{\rho }({\bf{o}}| {\bf{s}})\displaystyle \frac{{A}_{{\bf{b}}}({\bf{o}}| {\bf{s}})}{{2}^{k}{3}^{d({\bf{b}})}}{\sigma }_{{\bf{b}}}=\displaystyle \sum _{{\bf{b}}}{\tilde{\rho }}_{{\bf{b}}}{\sigma }_{{\bf{b}}}.\end{eqnarray} \tag{ 2.6 }$

The above formula allows to reconstruct the matrix elements from the measurement probabilities. However, since the experiment only provides random counts from these probabilities, we need to construct a statistical model for the measurement data. After n repetitions of the measurement with setting ${\bf{s}},$ we collect independent, identically distributed observations ${X}_{i| {\bf{s}}}\in {{\mathcal{O}}}_{k},$ for i from 1 to n. The data can be summarized by the set of counts $\{N({\bf{o}}| {\bf{s}})\ :{\bf{o}}\in {{\mathcal{O}}}_{k}\},$ where $N({\bf{o}}| {\bf{s}})={\displaystyle \sum }_{i}{\delta }_{{X}_{i| {\bf{s}}},{\bf{o}}}$ is the number of times that outcome ${\bf{o}}$ has occurred. After repeating this for each setting ${\bf{s}}\in {{\mathcal{S}}}_{k},$ we collect all the data in the counts dataset $D:= \{N({\bf{o}}| {\bf{s}}):({\bf{o}},{\bf{s}})\in {{\mathcal{O}}}_{k}\times {{\mathcal{S}}}_{k}\}.$

Since successive preparation-measurement cycles are independent of each other, the probability of a given set of counts $N({\bf{o}}| {\bf{s}})$ (obtained by repeating n times the measurement in setting ${\bf{s}}$ ) is a multinomial distribution, and the probability of a certain dataset D is the product of such multinomials over the different settings:

$\begin{eqnarray}&&{{\mathbb{P}}}_{\rho }(D)={{\mathbb{P}}}_{\rho }\left(\{N({\bf{o}}| {\bf{s}})\ :({\bf{o}},{\bf{s}})\in {{\mathcal{O}}}_{k}\times {{\mathcal{S}}}_{k}\}\right)=\displaystyle \prod _{{\bf{s}}}\displaystyle \frac{n!}{{\displaystyle \prod }_{{\bf{o}}}N({\bf{o}}| {\bf{s}})!}\displaystyle \prod _{{\bf{o}}}{p}_{\rho }{({\bf{o}}| {\bf{s}})}^{N({\bf{o}}| {\bf{s}})}.\end{eqnarray} \tag{ 2.7 }$

The statistical problem is to estimate the state ρ from the measurement data summarized by the counts dataset D. The most commonly used estimation method is ML. The ML estimator is defined by

$\begin{eqnarray*}&&{\hat{\rho }}_{n}^{(\mathrm{ml})}(D):= \underset{\sigma }{\mathrm{arg}\;\mathrm{max}}\;{{\mathbb{P}}}_{\sigma }(D),\end{eqnarray*}$

where the maximum is taken over all density matrices τ on ${{\mathbb{C}}}^{d},$ and can be computed by using standard maximization routines, or the iterative algorithms proposed in [29, 34]. However, ML becomes impractical for about k = 10 ions, and the iterative algorithm has the drawback that it cannot be adapted to models where prior information about the state is encoded in a lower dimensional parametrization of the relevant density matrices, e.g. when the states are low rank. In the next section we discuss an alternative method, the LSE, and derive an upper bound on its MSE. After this, we will show that by 'post-processing' the LSE using penalization and thresholding methods, its performance can be considerably improved when the unknown state has low rank.

3. The LSE

Recall that the vectorized version $\tilde{\rho }$ of the state ρ satisfies (2.5), it is therefore the solution of the optimization

$\begin{eqnarray*}&&\tilde{\rho }=\mathrm{arg}\underset{\tilde{\tau }\in {{\mathbb{C}}}^{{4}^{k}}}{\mathrm{inf}}\parallel {\bf{p}}-{\bf{A}}\tilde{\tau }{\parallel }^{2},\end{eqnarray*}$

giving $\rho ={\displaystyle \sum }_{{\bf{b}}}{\tilde{\rho }}_{{\bf{b}}}\cdot {\sigma }_{{\bf{b}}}$ in (2.6). If the number of repetitions n is large compared with the dimension d, then the outcomes' empirical frequencies are good approximations of the corresponding probabilities, i.e. $f({\bf{o}}| {\bf{s}}):= N({\bf{o}}| {\bf{s}})/n\to p({\bf{o}}| {\bf{s}})$ by the law of large numbers. Therefore, by replacing ${\bf{p}}$ by the vector of frequencies ${\bf{f}}$ in in the previous display, we can define the linear estimator also known as the LSE of ρ

$\begin{eqnarray*}&&{\tilde{\rho }}_{n}^{(\mathrm{ls})}:= \mathrm{arg}\underset{\tilde{\tau }\in {{\mathbb{C}}}^{{4}^{k}}}{\mathrm{inf}}\parallel {\bf{f}}-{\bf{A}}\tilde{\tau }{\parallel }^{2},\end{eqnarray*}$

which has the explicit expression

$\begin{eqnarray}&&{\widehat{\rho }}_{n}^{(\mathrm{ls})}=\displaystyle \sum _{{\bf{b}}}\displaystyle \sum _{{\bf{o}}}\displaystyle \sum _{{\bf{s}}}f({\bf{o}}| {\bf{s}})\displaystyle \frac{{A}_{{\bf{b}}}({\bf{o}}| {\bf{s}})}{{2}^{k}{3}^{d({\bf{b}})}}{\sigma }_{{\bf{b}}}.\end{eqnarray} \tag{ 3.1 }$

Note that in this case it comes down to replacing the unknown probability ${\bf{p}}$ in equation (2.6) with the empirical frequencies ${\bf{f}}$ (also known as the plug-in method).

In spite of this 'optimality' property and its computationally efficiency, the LSE has the disadvantage that in general it is not a state, i.e. it is not trace-one and may have negative eigenvalues. A more serious disadvantage is that its risk—measured for instance by the MSE ${\mathbb{E}}(\parallel {\widehat{\rho }}_{n}^{(\mathrm{ls})}-\rho {\parallel }_{2}^{2})$ —is large compared with other estimators such as the ML estimator. This is due to the fact that the LSE does not use the physical properties of the unknown parameter ρ, that is positivity and trace-one. As we will see below, the modified estimators proposed in section 4 outperform the LSE while adding only a small amount of computational complexity. Moreover, the second estimator will be a density matrix.

In the remainder of this section we provide concentration bounds on the square error of the LSE, which will later be used in obtaining the upper bounds of the improved estimators. The following proposition improves the rate $k{(4/3)}^{k}/n$ obtained in [8] to $k{(2/3)}^{k}/n.$

Proposition 1. Let ${\widehat{\rho }}_{n}^{(\mathrm{ls})}$ be the LSE of ρ. Then, for any $\varepsilon \gt 0$ small enough the following operator norm inequality holds with probability larger than $1-\varepsilon$ under ${{\mathbb{P}}}_{\rho }$

$\begin{eqnarray*}&&\parallel {\widehat{\rho }}_{n}^{(\mathrm{ls})}-\rho \parallel \leqslant \nu (\varepsilon ),\end{eqnarray*}$

where

$\begin{eqnarray*}&&\nu {(\varepsilon )}^{2}=\displaystyle \frac{2}{n}{\left(\displaystyle \frac{2}{3}\right)}^{k}\mathrm{log}\left(\displaystyle \frac{{2}^{k+1}}{\varepsilon }\right)=\displaystyle \frac{2d}{N}\mathrm{log}\left(\displaystyle \frac{2d}{\varepsilon }\right)\end{eqnarray*}$

with $N:= n\cdot {3}^{k}$ the total number of measurements. The same bound holds when $k=k(n)$ as long as $\nu (\varepsilon )\to 0.$

Proof. See appendix A.2.

As a side remark we note that projecting the LSE onto the space of Hermitian matrices with trace 1, does not change the rate of convergence from proposition 1. The following proposition allows us to assume that, without loss of generality, the LSE has also trace 1.

Proposition 2. Under the notation and assumptions in proposition 1, let

$\begin{eqnarray}&&{\widehat{\rho }}_{n}^{(\mathrm{ls},n)}=\underset{\tau :\mathrm{tr}(\tau )=1}{\mathrm{arg}\;\mathrm{min}}\parallel \tau -{\widehat{\rho }}_{n}^{(\mathrm{ls})}{\parallel }_{2}^{2}.\end{eqnarray} \tag{ 3.2 }$

Then with probability larger than $1-\varepsilon$ we have $\parallel {\widehat{\rho }}_{n}^{(\mathrm{ls},n)}-\rho \parallel \leqslant 2\nu (\varepsilon ).$

Proof. See appendix A.3.

4. Rank-penalized and threshold projection estimator

In this section we investigate two ways to improve the LSEs. The first method is to project the LSE onto the space of finite rank Hermitian matrices of an appropriate rank. We prove upper bounds for its risk with respect to the norm-two (Frobenius) distance squared. Building on the knowledge about the rank-penalized estimator, we define the second estimator which is the projection of the LSE on the space of physical states whose eigenvalues are larger than a certain positive noise threshold. We give an simple and fast algorithm producing a proper density matrix from the data, which also inherits the good theoretical properties of the rank-penalized estimator.

4.1. Rank-penalized estimator

We introduce here the rank-penalized nonlinear estimator, which can be computed from the LSE by truncation to an appropriately chosen rank.

As noted earlier, while the LSE is unbiased, it has a large variance due to the fact that it does not take into account the physical constraints encoded in the unknown parameter ρ. A possible remedy is to 'project' the LSE onto the space of physical states, i.e. positive, trace-one matrices. This method will be discussed in the following subsection. Another improvement can be obtained by taking into account the 'sparsity' properties of the unknown state. For instance, in many experimental situations the goal is to create a particular low rank, or even pure state. The fact that such states can be characterized with a smaller number of parameters than a general density matrix, has two important consequences. Firstly, they can be estimated by measuring an 'informationally incomplete' set of observables, as demonstrated in [25, 26]. Secondly, the prior information can be used to design estimators with reduced estimation error compared with generic methods which do not take into account the structure of the state. Roughly speaking, this is because each unknown parameter brings its own contribution to the overall error of the estimator.

However, the downside of working with a lower dimensional model is that it contains built-in assumptions which may not be satisfied by the true (unknown) physical state. Preparing a pure state is strictly speaking rarely achievable due to various experimental imperfections, so using a pure state statistical model is in fact an oversimplification and can lead to erroneous conclusions about the true state. On the other hand, one can argue that when the (small) experimental noises are taken into account, the actual state is 'effectively' low rank, i.e. it has a small number of significant eigenvalues and a large number of eigenvalues which are so close to zero that they cannot be distinguished from it. Then, the interesting question is how to decide on where to make the cut-off between statistically relevant eigenvalues and pure statistical noise. This is a common problem in statistics which is closely related to that of model selection [30]. Below we describe the rank-penalized estimator addressing this problem, and show that its theoretical and practical performance is superior to the LSE, and is close to what one would expect from an optimal estimator. In addition, its computation requires only the diagonalization of the LSE.

Before presenting a simple algorithm for computing the estimator, we briefly discuss the idea behind its definition. Let

$\begin{eqnarray}&&{\widehat{\rho }}_{n}^{(\mathrm{ls})}=\displaystyle \sum _{i=1}^{d}{\widehat{\lambda }}_{i}| {\hat{\psi }}_{i}\rangle \langle {\hat{\psi }}_{i}| .\end{eqnarray} \tag{ 4.1 }$

Be the spectral decomposition of the LSE, with eigenvalues ordered such that $| {\widehat{\lambda }}_{1}| \geqslant ...\geqslant | {\widehat{\lambda }}_{d}| .$ For each given rank $\kappa \in \{1,...,d\}$ we can project ${\widehat{\rho }}_{n}^{(\mathrm{ls})}$ onto the space of matrices of rank κ by computing the matrix which is the closest to ${\widehat{\rho }}_{n}^{(\mathrm{ls})}$ with respect to the norm-two distance

$\begin{eqnarray*}&&{\widehat{\rho }}_{n}(\kappa ):= \underset{\tau :{\rm{rank}}(\tau )=\kappa }{\mathrm{arg}\;\mathrm{min}}\parallel \tau -{\widehat{\rho }}_{n}^{(\mathrm{ls})}{\parallel }_{2}^{2}.\end{eqnarray*}$

Although the projection is not a linear operator, ${\widehat{\rho }}_{n}(\kappa )$ is easy to compute, and is obtained by truncating the spectral decomposition (4.1) to the most significant κ eigenvalues

$\begin{eqnarray*}&&{\widehat{\rho }}_{n}(\kappa )=\displaystyle \sum _{i=1}^{\kappa }{\widehat{\lambda }}_{i}| {\hat{\psi }}_{i}\rangle \langle {\hat{\psi }}_{i}| .\end{eqnarray*}$

The question is now how to choose the rank κ in order to obtain a good estimator. In figure 2 we illustrate the dependence of the norm-two square error $e(\kappa ):= \parallel {\widehat{\rho }}_{n}(\kappa )-\rho {\parallel }_{2}^{2}$ on the rank, for a particular dataset generated with a rank 6, 4-ions state. As the rank is increased starting with $\kappa =1$ (pure states), the error decreases steeply as ${\widehat{\rho }}_{n}(\kappa )$ becomes less biased, it reaches a minimum close to the true rank, and increases slowly as added parameters increase the variance of the estimator. However, since the state ρ is unknown, the norm-two error and optimal rank for which the minimum is achieved, are unknown. To go around this, we can estimate the error $e(\kappa )$ from the data by means of e.g. cross-validation, as it will be described in section 6. However, in this section we follow a different path, and we define the rank-penalized estimator [8, 35] as the minimizer over κ of the following expression:

$\begin{eqnarray*}&&\parallel {\widehat{\rho }}_{n}(\kappa )-{\widehat{\rho }}_{n}^{(\mathrm{ls})}{\parallel }_{2}^{2}+{\nu }^{2}\cdot \kappa =\displaystyle \sum _{i=\kappa +1}^{d}{\widehat{\lambda }}_{i}^{2}+{\nu }^{2}\cdot \kappa ,\end{eqnarray*}$

where ν is a constant which will be tuned appropriately. The first term quantifies the fit of the truncated estimator with respect to the LSE, while the second term is a penalty which increases with the complexity of the model, i.e. the rank. The rank penalized estimator ${\widehat{\rho }}_{n}^{(\mathrm{pen})}$ is thus the solution of the simple optimization problem

$\begin{eqnarray}{\widehat{\rho }}_{n}^{(\mathrm{pen})} & := & {\widehat{\rho }}_{n}(\hat{\kappa }),\qquad {\rm{where}}\qquad \hat{\kappa }:= \underset{\kappa =1,\ldots ,d}{\mathrm{arg}\;\mathrm{min}}\left\{\displaystyle \sum _{i=\kappa +1}^{d}{\widehat{\lambda }}_{i}^{2}+{\nu }^{2}\cdot \kappa \right\}=\mathrm{max}\{\kappa \ :{\widehat{\lambda }}_{\kappa }^{2}\geqslant {\nu }^{2}\}.\end{eqnarray} \tag{ 4.2 }$

This means that the eigenvalues below a certain noise threshold are set to zero while those above the threshold remain unchanged. The following theorem is our first main result, and shows that the appropriate threshold is given by the upper bound on the operator norm error of the LSE, as established in proposition 1.

**Figure 2.** The norm-two square error $\parallel {\widehat{\rho }}_{n}(\kappa )-\rho {\parallel }_{2}^{2}$ of the truncated LSE as a function of rank, for a rank 6 state, k = 4 (d = 16) and n = 500 measurement repetitions.
Download figure:
Standard image High-resolution image

**Figure 2.** The norm-two square error $\parallel {\widehat{\rho }}_{n}(\kappa )-\rho {\parallel }_{2}^{2}$ of the truncated LSE as a function of rank, for a rank 6 state, k = 4 (d = 16) and n = 500 measurement repetitions.
Download figure:
Standard image High-resolution image

Theorem 1. Let $\theta \gt 0$ be an arbitrary constant, let $c(\theta ):= 1+2/\theta ,$ and let $\varepsilon \gt 0$ be a small parameter. Then with probability larger than $1-\varepsilon ,$ we have

$\begin{eqnarray}&&\parallel {\widehat{\rho }}_{n}^{(\mathrm{pen})}-\rho {\parallel }_{2}^{2}\leqslant \underset{\kappa =1,\ldots ,d}{\mathrm{min}}\left\{{c}^{2}(\theta )\displaystyle \sum _{j\gt \kappa }{\lambda }_{j}^{2}(\rho )+2c(\theta )\nu {(\varepsilon )}^{2}\kappa \right\},\end{eqnarray} \tag{ 4.3 }$

where ${\widehat{\rho }}_{n}^{(\mathrm{pen})}$ is the penalized estimator defined in (4.2) with threshold $\nu {(\varepsilon )}^{2}$ given by

$\begin{eqnarray*}&&\nu {(\varepsilon )}^{2}:= \displaystyle \frac{2}{n}{\left(\displaystyle \frac{2}{3}\right)}^{k}\mathrm{log}\left(\displaystyle \frac{{2}^{k+1}}{\varepsilon }\right)=\displaystyle \frac{2d}{N}\mathrm{log}\displaystyle \frac{2d}{\varepsilon },\end{eqnarray*}$

which is assumed to be o(1) with increasing n and k.

Proof. The upper bound follows directly from proposition 1 combined with the following oracle inequality established in [8],

$\begin{eqnarray*}&&\parallel {\widehat{\rho }}_{n}^{(\mathrm{pen})}-\rho {\parallel }_{2}^{2}\leqslant \underset{\kappa =1,\ldots ,d}{\mathrm{min}}\left\{{c}^{2}(\theta )\underset{R:{\rm{rank}}(R)=\kappa }{\mathrm{min}}\parallel R-\rho {\parallel }_{2}^{2}+2c(\theta )\cdot \kappa \cdot \nu {(\varepsilon )}^{2},\right\}\end{eqnarray*}$

which holds true provided that $(1+\theta )\parallel {\widehat{\rho }}_{n}^{(\mathrm{ls})}-\rho {\parallel }^{2}\leqslant \nu {(\varepsilon )}^{2}.$ This event occurs with probability larger than $1-\varepsilon .$

□

Let us make some explanatory remarks on the above result. Firstly, the bound (4.3) applies to all states ρ, not only 'small' rank ones. Recall that ${{\mathcal{S}}}_{d,r}$ denotes the set of states of rank-r states on ${{\mathbb{C}}}^{d}.$ In the special case when ${\rm{rank}}(\rho )=r\leqslant d,$ the theorem implies that, with probability larger than $1-\varepsilon ,$

$\begin{eqnarray*}&&\parallel {\widehat{\rho }}_{n}^{(\mathrm{pen})}-\rho {\parallel }_{2}^{2}\leqslant 2c(\theta )\cdot r\cdot \nu {(\varepsilon )}^{2}.\end{eqnarray*}$

If the rank r is much smaller than d, this bound is a significant improvement to the corresponding upper bound $d\cdot \nu {(\varepsilon )}^{2}$ for the LSE, which can be derived by combining the operator norm bound of proposition 1 and the matrix norms inequality $\parallel \tau {\parallel }_{2}^{2}\leqslant d\parallel \tau {\parallel }^{2},$ for $\tau \in M({{\mathbb{C}}}^{d}).$ Moreover, up to a constant factor the rate $\nu {(\varepsilon )}^{2}$ is equal to $d\cdot r\mathrm{log}(d)/N$ which is essentially the ratio of the number of parameters and total number of measurements. In section 5 we will show that apart from the log factor this rate is also optimal, and cannot be improved even if the rank of the state is known, which indicates that the estimator adapts to the complexity of the true parameter. Furthermore, we stress the fact that the bound (4.3) holds true for growing dimension $d={2}^{k}$ as well as the number of measurements n; the bound remains meaningful as far as $d\mathrm{log}d/N\to 0.$

The second observation is that our procedure selects the true rank consistently. Denote by $\hat{\kappa }$ the rank of the resulting estimator ${\widehat{\rho }}_{n}^{(\mathrm{pen})}.$ Following [8] we can prove that, if there exists some κ such that ${\lambda }_{\kappa }(\rho )\gt (1+\delta )\sqrt{\nu (\varepsilon )}$ and ${\lambda }_{\kappa +1}(\rho )\lt (1-\delta )\sqrt{\nu (\varepsilon )}$ for some $\delta \in (0,1),$ then

$\begin{eqnarray*}&&{\mathbb{P}}(\hat{\kappa }=\kappa )=1-{\mathbb{P}}(\parallel {\widehat{\rho }}_{n}^{(\mathrm{ls})}-\rho \parallel \geqslant \delta \sqrt{\nu (\varepsilon )}).\end{eqnarray*}$

This stresses the fact that the procedure detects the eigenvalues above a threshold related to the error of the LSE. If the true rank of ρ is r and if k and n are such that ν tends to 0 (which always occurs for fixed number of ions k), then ${\lambda }_{r}\gt \nu$ asymptotically and the probability that $\hat{\kappa }=r$ tends to 1.

We can also project ${\widehat{\rho }}_{n}^{(\mathrm{pen})}$ on the matrices with trace 1, to get

$\begin{eqnarray}&&{\widehat{\rho }}_{n}^{(\mathrm{pen},n)}=\mathrm{arg}\;\underset{R\in {\mathcal{S}}}{\mathrm{min}}\parallel R-{\widehat{\rho }}_{n}^{(\mathrm{pen})}{\parallel }_{2}^{2},\end{eqnarray} \tag{ 4.4 }$

where ${\mathcal{S}}$ is the set of all density matrices on ${{\mathbb{C}}}^{d}.$ The following corollary shows that the key properties of the estimator are preserved if we additionally normalize it to trace-one after thresholding.

Corollary 1. Under the notation and assumptions of theorem 1 if ρ is an arbitrary state in ${{\mathcal{S}}}_{d,r}$ and if k and n are such that ${\lambda }_{r}\gt \nu (\varepsilon )$ for some $\varepsilon \in (0,1),$ then, with probability larger than $1-\varepsilon ,$

$\begin{eqnarray*}&&\parallel {\widehat{\rho }}_{n}^{(\mathrm{pen},n)}-\rho {\parallel }_{2}^{2}\leqslant 8c(\theta )\cdot r\cdot \nu {(\varepsilon )}^{2}.\end{eqnarray*}$

Moreover, there exists an absolute constant $C\gt 0$ such that

$\begin{eqnarray*}&&\underset{\rho \in {{\mathcal{S}}}_{d,r}}{\mathrm{sup}}{E}_{\rho }\parallel {\widehat{\rho }}_{n}^{(\mathrm{pen},n)}-\rho {\parallel }_{2}^{2}\leqslant C\displaystyle \frac{{rd}}{N}\mathrm{log}\left(\displaystyle \frac{2d}{\varepsilon }\right).\end{eqnarray*}$

Proof. See appendix A.4.

□

4.2. Physical threshold estimator

Although the rank-penalized estimator performs well in terms of its risk, it is not necessarily positive and trace-one and therefore it may not represent a physical state. In this section we propose and analyse the following 'physical estimator'

$\begin{eqnarray}&&{\widehat{\rho }}_{n}^{(\mathrm{phys})}=\mathrm{arg}\;\underset{\sigma \in {\mathcal{S}}(\nu )}{\mathrm{min}}\parallel \sigma -{\tilde{\rho }}_{n}^{(\mathrm{ls})}{\parallel }_{2}^{2},\end{eqnarray} \tag{ 4.5 }$

where ${\tilde{\rho }}_{n}^{(\mathrm{ls})}$ is the 'normalized LSE' defined in (3.2), and ${\mathcal{S}}(\nu )$ denotes the set of states at noise level ν

$\begin{eqnarray*}&&{\mathcal{S}}(\nu )=\left\{\sigma {\rm{}}\;{\rm{density}}\;{\rm{matrix}}\;{\rm{with}}\;{\rm{eigenvalues}}\;{\rm{}}{\lambda }_{j}\in \{0\}\cup (4\nu ,1],j=1,\ldots ,d\right\}.\end{eqnarray*}$

In particular, the space of all density matrices correspond to $\nu =0$ and is denoted ${\mathcal{S}}.$ The estimator ${\widehat{\rho }}_{n}^{(\mathrm{phys})}$ is therefore the physical state which is closest to the (normalized) LSE, and whose non-zero eigenvalues are above the threshold $4\nu .$

Before analysing the performance of the estimator, we describe its numerical implementation through the following simple iterative algorithm.

Let ${\tilde{\lambda }}_{1}\geqslant ...\geqslant {\tilde{\lambda }}_{d}$ denote the eigenvalues of ${\tilde{\rho }}_{n}^{(\mathrm{ls})}.$ Let ${\ell }=0,$ and define ${\widehat{\lambda }}_{j}^{(0)}={\tilde{\lambda }}_{j}$ for $j=1,...,d.$

For ${\ell }=1,\ldots ,d,$ do

if ${\widehat{\lambda }}_{d-{\ell }+1}^{({\ell }-1)}\gt 4\nu ,$ STOP;

else, put ${\widehat{\lambda }}_{d-{\ell }+1}^{({\ell })}=0$ and

$\begin{eqnarray*}&&{\widehat{\lambda }}_{j}^{({\ell })}={\widehat{\lambda }}_{j}^{({\ell }-1)}+\displaystyle \frac{1}{d-{\ell }}\left(1-\displaystyle \sum _{k=1}^{d-{\ell }}{\widehat{\lambda }}_{k}^{({\ell }-1)}\right),{\rm{}}\;{\rm{for}}\;{\rm{}}j=1,\ldots ,d-{\ell };\end{eqnarray*}$

${\ell }={\ell }+1.$

The algorithm checks whether the smallest eigenvalue is larger than the noise level $4\nu$ and if it is not, then sets its value to 0 and distributes the mass of the erased eigenvalue in such a way that they sum to 1. This algorithm is similar to that proposed by Smolin et al [4], with the important difference that we do not keep all positive eigenvalues but only significantly positive eigenvalues. Here, significant means larger than the noise threshold of the order of the operator-norm error of the LSE. The reason for keeping only eigenvalues above the threshold is similar to the case of the penalized estimator. By proposition 1, eigenvalues below the threshold are of the order of the estimation error $\nu (\epsilon ).$ Therefore, if the state is low rank, such eigenvalues are likely to correspond to zero eigenvalues of the true state and keeping them unchanged would induce significant statistical noise.

If the total number of iterations is $\widehat{{\ell }}=d-\widehat{r}$ then the estimator ${\widehat{\rho }}_{n}^{(\mathrm{phys})}$ has rank $\widehat{r}.$ Its eigenvalues are equal to 0 for $j\gt \widehat{r},$ while, for $j\leqslant \widehat{r}$ they are given by

$\begin{eqnarray*}&&{\widehat{\lambda }}_{j}^{(\mathrm{phys})}={\tilde{\lambda }}_{j}+\displaystyle \frac{L}{2},\qquad {\rm{}}\;{\rm{where}}\;{\rm{}}\qquad \displaystyle \frac{L}{2}\widehat{r}=\displaystyle \sum _{k\gt \widehat{r}}{\tilde{\lambda }}_{k}=1-\displaystyle \sum _{k\leqslant \widehat{r}}{\tilde{\lambda }}_{k}.\end{eqnarray*}$

This implies that ${\widehat{\rho }}_{n}^{(\mathrm{phys})}$ has decreasing eigenvalues and ${\widehat{\lambda }}_{\widehat{r}}^{(\mathrm{phys})}\geqslant 4\nu .$ The following theorem shows that ${\widehat{\rho }}_{n}^{(\mathrm{phys})}$ is rank-consistent and its MSE has the same scaling as that of penalized estimator ${\widehat{\rho }}_{n}^{(\mathrm{pen})}.$

Theorem 2. Assume that the state ρ has rank r, i.e. belongs to ${{\mathcal{S}}}_{d,r}.$ For small $\varepsilon \gt 0,$ let $\nu =\nu (\varepsilon )$ be defined as in theorem 1, and assume that ${\lambda }_{r}\gt 8\nu (\varepsilon ).$ Then, with ${{\mathbb{P}}}_{\rho }$ probability larger than $1-\varepsilon$ we have $\widehat{r}=r$ and

$\begin{eqnarray*}&&\parallel {\widehat{\rho }}_{n}^{(\mathrm{phys})}-\rho {\parallel }_{2}^{2}\leqslant 48\cdot r\cdot \nu {(\varepsilon )}^{2}.\end{eqnarray*}$

Moreover, there exists an absolute constant $C\gt 0$ such that

$\begin{eqnarray*}&&\underset{\rho \in {{\mathcal{S}}}_{d,r}}{\mathrm{sup}}{E}_{\rho }\parallel {\widehat{\rho }}_{n}^{(\mathrm{phys})}-\rho {\parallel }_{2}^{2}\leqslant C\displaystyle \frac{{rd}}{N}\mathrm{log}\left(\displaystyle \frac{2d}{\varepsilon }\right).\end{eqnarray*}$

Proof. See appendix A.5.

5. Lower bounds for rank-constrained estimation

The goal of this section is to investigate how the convergence rates of our estimators compare with that of an 'optimal estimator' for the statistical model consisting of all states of rank up to r. For this we will derive a lower bound for the maximum risk of any estimator.

In this section ${\widehat{\rho }}_{n}$ will be an arbitrary estimator and the true state ρ is assumed to belong to the set ${{\mathcal{S}}}_{d,r}$ of rank-r states. To quantify the overall performance of ${\widehat{\rho }}_{n},$ we define the maximum risk

$\begin{eqnarray*}&&{R}_{\mathrm{max}}({\widehat{\rho }}_{n};r)=\underset{\rho \in {{\mathcal{S}}}_{d,r}}{\mathrm{sup}}{{\mathbb{E}}}_{\rho }\parallel {\widehat{\rho }}_{n}-\rho {\parallel }_{2}^{2}.\end{eqnarray*}$

In view of the previous upper bounds, we expect its asymptotic behaviour (in terms of the total number of measurements, for a large number of repetitions n) to be

$\begin{eqnarray*}&&{R}_{\mathrm{max}}({\widehat{\rho }}_{n};r)=\displaystyle \frac{{rd}}{N}\cdot O(1),\qquad N=n\cdot {3}^{k}.\end{eqnarray*}$

Taking this into account we define the (appropriately rescaled) minimax risk as

$\begin{eqnarray}&&{R}_{\mathrm{minmax}}(r,k;n):= \underset{{\widehat{\rho }}_{n}}{\mathrm{inf}}\;{{NR}}_{\mathrm{max}}({\widehat{\rho }}_{n};r),\end{eqnarray} \tag{ 5.1 }$

which describes the behaviour of the best estimator at the hardest to estimate state. The next theorem provides an asymptotic lower bound for the minimax risk. It shows that the maximum MSE of any estimator is al least of the order of $r(d-r)/N,$ which for low rank states scales as $\#{\rm{parameters}}/\#{\rm{samples}},$ which up to logarithmic factors is the same as the upper bounds derived in theorems 1, corollary 1, and theorem 2.

Theorem 3. The following lower bound holds for the asymptotic minimax risk holds

$\begin{eqnarray*}&&\underset{n\to \infty }{\mathrm{lim}\;\mathrm{inf}}\;{R}_{\mathrm{minmax}}(r,k;n)\geqslant 2r(d-r).\end{eqnarray*}$

Proof. The minimax risk captures the worst asymptotic behaviour of the rescaled risk, over all states of rank r. In order to bound the risk from below, we construct a (lower dimensional) subfamily of states ${{\mathcal{R}}}_{d,r}\subset {{\mathcal{S}}}_{d,r}$ such that the maximum risk for this subfamily provides the lower bound. Let

$\begin{eqnarray}&&{\rho }_{0}:= {\rm{Diag}}\left(\displaystyle \frac{1}{r},...,\displaystyle \frac{1}{r},0,...,0\right)\end{eqnarray} \tag{ 5.2 }$

be a diagonal state with respect to the standard basis ${\bf{B}}$ (consisting of tensor products of eigenvectors of the Pauli ${\sigma }_{z}$ operator), and define ${{\mathcal{R}}}_{d,r}$ to be the set of matrices obtained by rotating ${\rho }_{0}$ with an arbitrary unitary U, i.e. ${{\mathcal{R}}}_{d,r}:= \{\rho := U{\rho }_{0}{U}^{*}\;| \;U{\rm{}}\;{\rm{unitary}}\}.$ This is a smooth, compact manifold of dimension $2r(d-r)$ known as a (complex) Grassmannian [36]. For each point $\rho =U{\rho }_{0}{U}^{*}$ we consider the ONB ${{\bf{B}}}_{U}$ of eigenvectors of ρ, which is obtained by rotating ${\bf{B}}$ by the unitary U. With respect to this basis, we consider first the parametrization of an arbitrary density matrix ${\rho }^{\prime }$ by its matrix elements, more precisely by the diagonal, real and imaginary parts of the off-diagonal matrix elements, such that ${\rho }^{\prime }\equiv {\rho }_{\theta }$ with

$\begin{eqnarray}\theta & = & ({\theta }^{(d)},{\theta }^{(r)},{\theta }^{(i)})\\ & := & ({\rho }_{11},...,{\rho }_{{dd}};{\rm{Re}}{\rho }_{\mathrm{1,2}},...,{\rm{Re}}{\rho }_{d-1,d};{\rm{Im}}{\rho }_{\mathrm{1,2}},...,{\rm{Im}}{\rho }_{d-1,d}).\end{eqnarray} \tag{ 5.3 }$

The norm-two distance is given by

$\begin{eqnarray*}&&\parallel {\rho }_{{\theta }_{1}}-{\rho }_{{\theta }_{2}}{\parallel }_{2}^{2}=\parallel {\theta }_{1}^{d}-{\theta }_{2}^{d}{\parallel }^{2}+2\parallel {\theta }_{1}^{r}-{\theta }_{2}^{r}{\parallel }^{2}+2\parallel {\theta }_{1}^{i}-{\theta }_{2}^{i}{\parallel }^{2}={({\theta }_{1}-{\theta }_{2})}^{T}G({\theta }_{1}-{\theta }_{2}),\end{eqnarray*}$

where G is the constant diagonal weight matrix $G={\rm{Diag}}({1}_{d},2\cdot {1}_{d(d-1)/2},2\cdot {1}_{d(d-1)/2}).$ However, this parametrization does not take into account the prior information about the rank of the true state, and moreover, our key argument involves the even smaller family ${{\mathcal{R}}}_{d,r}$ of states. We will now focus on providing a local parametrization of ${{\mathcal{R}}}_{d,r}$ around $\rho =U{\rho }_{0}{U}^{*}.$ With respect to the basis ${{\bf{B}}}_{U},$ a state ${\rho }^{\prime }\in {{\mathcal{R}}}_{d,r}$ in the neighbourhood of ρ has the form

$\begin{eqnarray}{\rho }^{\prime }=\rho +{{\rm{\Delta }}}_{{off}}+\delta =\left(\begin{array}{ccc}\displaystyle \frac{1}{r}{I}_{r} & & 0\\ & & \\ 0 & & 0\end{array}\right)+\left(\begin{array}{ccc}0 & & {\rm{\Delta }}\\ & & \\ {{\rm{\Delta }}}^{\dagger } & & 0\end{array}\right)+\left(\begin{array}{ccc}O(\parallel {\rm{\Delta }}{\parallel }^{2}) & & 0\\ & & \\ 0 & & O(\parallel {\rm{\Delta }}{\parallel }^{2})\end{array}\right),\end{eqnarray} \tag{ 5.4 }$

where Δ is a matrix of free (complex) parameters, and the two $O(\parallel {\rm{\Delta }}{\parallel }^{2})$ blocks are r × r and respectively $(d-r)\times (d-r)$ matrices whose elements scale quadratically in Δ near ${\rm{\Delta }}=0.$ The intuition behind this decomposition is that a small rotation of ρ produces off-diagonal blocks of the size of the 'rotation angles' while the change in the diagonal blocks are only quadratic in those angles. Since we are interested in the asymptotic behaviour of estimators, the local approach is justified, and the leading contribution to the norm-two distance comes from the off-diagonal blocks. More precisely, if ${\rho }_{1},{\rho }_{2}\in {{\mathcal{R}}}_{d,r}$ are in the neighbourhood of ρ then

$\begin{eqnarray*}&&\parallel {\rho }_{1}-{\rho }_{2}{\parallel }^{2}=2\parallel {\tilde{\theta }}_{1}^{r}-{\tilde{\theta }}_{2}^{r}{\parallel }^{2}+2\parallel {\tilde{\theta }}_{1}^{i}-{\tilde{\theta }}_{2}^{i}{\parallel }^{2}+O(\parallel {\tilde{\theta }}_{1}{\parallel }^{4},\parallel {\tilde{\theta }}_{2}{\parallel }^{4}),\end{eqnarray*}$

where ${\tilde{\theta }}^{r},{\tilde{\theta }}^{i}$ are the real and imaginary parts of the off-diagonal elements contained in the block Δ, i.e. for $i\leqslant r\lt j.$ Locally, the manifold ${{\mathcal{R}}}_{d,r}$ can be parametrized by $\tilde{\theta }:= ({\tilde{\theta }}^{r},{\tilde{\theta }}^{i}).$

Since ${{\mathcal{R}}}_{d,r}\subset {{\mathcal{S}}}_{d,r}$ the maximum risk for the model consisting of rank-r states is bounded from below by that of the (smaller) rotation model ${{\mathcal{R}}}_{d,r}$

$\begin{eqnarray}\underset{{\widehat{\rho }}_{n}}{\mathrm{inf}}\;\underset{\rho \in {{\mathcal{S}}}_{d,r}}{\mathrm{sup}}{{\mathbb{E}}}_{\rho }\parallel {\widehat{\rho }}_{n}-\rho {\parallel }_{2}^{2} & \geqslant & \underset{{\widehat{\rho }}_{n}}{\mathrm{inf}}\;\underset{\rho \in {{\mathcal{R}}}_{d,r}}{\mathrm{sup}}{{\mathbb{E}}}_{\rho }\parallel {\widehat{\rho }}_{n}-\rho {\parallel }_{2}^{2}.\end{eqnarray} \tag{ 5.5 }$

Let π be the 'uniform' distribution over ${{\mathcal{R}}}_{d,r}.$ To draw a sample from this distribution, one can choose a random unitary U from the Haar measure over unitaries, and defines $\rho := U{\rho }_{0}{U}^{*}.$ Then the maximum risk is bounded from below by the Bayes risk

$\begin{eqnarray}&&\underset{\rho \in {{\mathcal{R}}}_{d,r}}{\mathrm{sup}}{{\mathbb{E}}}_{\rho }\parallel {\widehat{\rho }}_{n}-\rho {\parallel }_{2}^{2}\geqslant {\displaystyle \int }_{{{\mathcal{R}}}_{d,r}}{{\mathbb{E}}}_{\rho }\parallel {\widehat{\rho }}_{n}-\rho {\parallel }_{2}^{2}\;\pi ({\rm{d}}\rho )\end{eqnarray} \tag{ 5.6 }$

By applying the van Trees inequality in [37] (see also [38]) we get that

$\begin{eqnarray}&&{\displaystyle \int }_{{{\mathcal{R}}}_{d,r}}{{\mathbb{E}}}_{\rho }\parallel {\widehat{\rho }}_{n}-\rho {\parallel }_{2}^{2}\;\pi ({\rm{d}}\rho )\geqslant \displaystyle \frac{1}{n}{\displaystyle \int }_{{{\mathcal{R}}}_{d,r}}{\rm{Tr}}(\tilde{G}(\rho ){\tilde{I}}^{-1}(\rho ))\pi ({\rm{d}}\rho )-\displaystyle \frac{\alpha }{{n}^{2}},\end{eqnarray} \tag{ 5.7 }$

where $\alpha \gt 0$ is a constant which does not depend on n. Here, $\tilde{I}(\rho )$ is the (classical) Fisher information matrix of the the data obtained by performing one measurement for each setting, and $\tilde{G}(\rho )$ is the weight matrix corresponding to the quadratic approximation of the norm-two distance squared, around ρ. Both matrices are of dimensions $2r(d-r)={\rm{dim}}({{\mathcal{R}}}_{d,r}),$ and depend on the chosen parametrization, but the trace is independent of it. Inserting (5.6) and (5.7) into (5.5), we get

$\begin{eqnarray*}&&\underset{{\widehat{\rho }}_{n}}{\mathrm{inf}}\;\underset{\rho \in {{\mathcal{S}}}_{d,r}}{\mathrm{sup}}N{{\mathbb{E}}}_{\rho }\parallel {\widehat{\rho }}_{n}-\rho {\parallel }_{2}^{2}\geqslant \displaystyle \frac{N}{n}{\displaystyle \int }_{{{\mathcal{R}}}_{d,r}}{\rm{Tr}}(\tilde{G}{(\rho )}^{1/2}\tilde{I}{(\rho )}^{-1}\tilde{G}{(\rho )}^{1/2})\pi ({\rm{d}}\rho )-\displaystyle \frac{\alpha N}{{n}^{2}}.\end{eqnarray*}$

Since t ↦ t⁻¹ is an operator convex function we have

$\begin{eqnarray*}&&\int \;\tilde{G}{(\rho )}^{1/2}\tilde{I}{(\rho )}^{-1}\tilde{G}{(\rho )}^{1/2}\pi ({\rm{d}}\rho )\geqslant {\left(\int \tilde{G}{(\rho )}^{-1/2}\tilde{I}(\rho )\tilde{G}{(\rho )}^{-1/2}\pi ({\rm{d}}\rho )\right)}^{-1}\end{eqnarray*}$

and by taking the limit $n\to \infty$ we obtain the asymptotic minimax lower bound

$\begin{eqnarray}&&\underset{n\to \infty }{\mathrm{liminf}}\;{R}_{\mathrm{minmax}}(r,k)\geqslant {3}^{k}{\rm{Tr}}\left({\left(\int \tilde{G}{(\rho )}^{-1/2}\tilde{I}(\rho )\tilde{G}{(\rho )}^{-1/2}\pi ({\rm{d}}\rho )\right)}^{-1}\right),\end{eqnarray} \tag{ 5.8 }$

where ${R}_{\mathrm{minmax}}(r,k;n)$ is the minimax risk defined in equation (5.1).

At this point we choose a convenient local parametrization around an arbitrary state $\rho \in {{\mathcal{R}}}_{d,r}.$ As discussed in the beginning of the proof we showed that for this we can use the real and imaginary parts $\tilde{\theta }=({\tilde{\theta }}^{r},{\tilde{\theta }}^{i})$ of the off-diagonal block Δ, and that the corresponding weight matrix is $\tilde{G}(\rho )={21}_{2r(d-r)}.$ The lower bound (5.8) becomes

$\begin{eqnarray*}&&{R}_{\mathrm{minmax}}(r,k)\geqslant {3}^{k}\cdot 2\cdot {\rm{Tr}}\left({\left({\displaystyle \int }_{{{\mathcal{R}}}_{d,r}}\tilde{I}(\rho )\pi ({\rm{d}}\rho )\right)}^{-1}\right).\end{eqnarray*}$

Another consequence of (5.4) is that the Fisher information matrix $\tilde{I}(\rho )$ is equal to the corresponding block of the Fisher information matrix I of the full (d²-dimensional) unconstrained model with parametrization θ defined in (5.3). We will now compute the average over states of the Fisher information with respect to the Pauli bases measurements, by showing that it is equal to the average Fisher information at ${\rho }_{0},$ for the random basis measurement. As the different settings are measured independently, the Fisher information $\tilde{I}(\rho )$ is (and similarly for I)

$\begin{eqnarray*}&&\tilde{I}(\rho )=\displaystyle \sum _{{\bf{s}}\in {{\mathcal{S}}}_{k}}\tilde{I}(\rho | {\bf{s}}),\end{eqnarray*}$

where $\tilde{I}(\rho | {\bf{s}})$ is the Fisher information corresponding to the von Neumann measurement with respect to the ONB defined by setting ${\bf{s}}.$ More generally, with ${{\bf{B}}}_{U}$ as defined above, we denote by $\tilde{I}(\rho | {{\bf{B}}}_{U})$ the Fisher information corresponding to this basis. Due to the rotation symmetry, we have

$\begin{eqnarray*}&&\tilde{I}(U\rho {U}^{*}\tilde{| }{{\bf{B}}}_{U})=\tilde{I}(\rho | {\bf{B}})\end{eqnarray*}$

so

$\begin{eqnarray*}\displaystyle \int \pi ({\rm{d}}\rho )\tilde{I}(\rho ) & = & \displaystyle \sum _{{\bf{s}}}\displaystyle \int \pi ({\rm{d}}\rho )\tilde{I}(\rho | {\bf{s}})=\displaystyle \sum _{{\bf{s}}}\displaystyle \int {\mu }^{d}({\rm{d}}U)\tilde{I}(U{\rho }_{0}{U}^{*}| {\bf{s}})\\ & = & {3}^{k}\displaystyle \int \mu ({\rm{d}}U)\tilde{I}({\rho }_{0}| {{\bf{B}}}_{U})={3}^{k}\bar{\tilde{I}},\end{eqnarray*}$

where ${\mu }^{d}({\rm{d}}U)$ is the unique Haar measure on the unitary group on ${{\mathbb{C}}}^{d}.$

The average Fisher information matrix $\bar{I}$ (of the full, unconstrained model) is computed in section A.6 where we show that the block corresponding to $\tilde{\theta }$ parameters has average $\bar{\tilde{I}}=2{{\bf{1}}}_{2r(d-r)}$ such that the lower bound is

$\begin{eqnarray*}&&{R}_{\mathrm{minmax}}(r,k)\geqslant 2r(d-r).\end{eqnarray*}$

6. Numerical results

In this section we present the results of a simulation study which analyses the performance of the proposed estimation methods. The penalized and physical estimators discussed in the previous sections use a theoretical penalization and respective threshold rates proportional to ${\nu }^{2}.$ However in practice we found that the performance of the estimators can be further improved when the rates are adjusted by multiplying with an appropriate constant c—whose choice is informed by the data—from a grid over a small interval which was chosen to be $[0,3].$ The last two estimators are such versions of the theoretical ones with constant c chosen by using cross-validation methods which are explained in detail in section 6.1. We will compare the following 5 estimators described below.

(1)
The LSE ${\widehat{\rho }}_{n}^{(\mathrm{ls})}$ defined in (3.1).
(2)
The oracle 'estimator' ${\widehat{\rho }}_{n}^{(\mathrm{oracle})}$ defined below. This is strictly speaking not an estimator since it requires the knowledge of the state ρ itself, and can be computed only in simulation studies. However, the oracle is a useful benchmark for evaluating the performance of the other estimators.
(3)
The cross-validated projection estimator ${\widehat{\rho }}_{n}^{(\mathrm{cv})}.$ Here we try to find the optimal truncation rank of the LSE, by using the cross-validation method.
(4)
The cross-validated penalized estimator ${\widehat{\rho }}_{n}^{(\mathrm{pen}-\mathrm{cv})}.$ This is a modification of the penalized estimator ${\widehat{\rho }}_{n}^{(\mathrm{pen})}$ defined in (4.2), where the value of the penalization constant is adjusted by cross-validation.
(5)
The cross-validated physical threshold estimator ${\rho }_{n}^{(\mathrm{phys}-\mathrm{cv})}.$ This is a modification of the physical estimator ${\widehat{\rho }}_{n}^{(\mathrm{phys})}$ defined in (4.5), where the value of the threshold constant is adjusted by cross-validation.

We explore the estimators' behaviour by simulating datasets from states with different ranks, and with different number of measurement repetitions per setting. The methodology is described in detail below.

6.1. Generation of random states and simulation of datasets

In order to generate a density matrix of rank r, we first create a rank r upper triangular matrix T in which

(i)
the off-diagonal elements of the first r rows are random complex numbers,
(ii)
the diagonal elements ${T}_{22},...,{T}_{{rr}}$ are real, positive random numbers,
(iii)
all elements of the rows $r+1,...,d$ are zero.

The matrix T is completed by setting T₁₁ such that ${T}_{11}\geqslant 0,$ and $\parallel T{\parallel }_{2}^{2}=1.$ If these conditions cannot be satisfied we repeat the procedure by generating a new set of matrix elements for T. When successful, we set $\rho := {T}^{*}T$ which by construction is a density matrix of rank r. We note that it is not our purpose to generate matrices from a particular 'uniform' ensemble, but merely to have a state with reasonably random eigenvectors, and whose r eigenvalues are not significantly smaller than $1/r.$ Following this procedure we have generated 4 states of 4 ions ( $d={2}^{4}$ ) with ranks $1,2,6,10.$ The rank 6 state for instance, has non-zero eigenvalues $(0.47,0.19,0.12,0.11,0.07,0.04).$

For each state, we have then simulated a number of 100 independent datasets with a given number of repetitions chosen from the range $20,100,500,$ and 2500. In this way we can study the dependence of the MSE of each estimator on state (or rank) and number of repetitions.

6.2. Computation of estimators

We conducted the following simulation study for all the possible combinations between the states and the total number of cycles (i.e. $4\times 4=16$ different scenarios). Below, we denote by r the rank of the 'true' state ρ, from which the data has been generated. The procedure has the following steps:

(1)
For a given number of repetitions n, we simulate 5 independent datasets ${D}_{1},...,{D}_{5},$ each with $n/5$ repetitions. By simply adding the number of counts for each setting and outcome, we obtain a dataset D of n repetitions. However, as we will see below, having 5 separate 'smaller' datasets is important for the purpose of applying cross-validation. Note that such a procedure can be easily implemented in an experimental setting.
(2)
We compute the LSE ${\widehat{\rho }}_{n}^{(\mathrm{ls})}$ based on the full dataset D with total number of cycles n.
(3)
We compute the oracle 'estimator' as follows:
- (a)
  We compute the spectral decomposition (4.1) of ${\widehat{\rho }}_{n}^{(\mathrm{ls})},$ with the eigenvalues ${\widehat{\lambda }}_{i}$ arranged in decreasing order of their absolute values. For each rank $1\leqslant \kappa \leqslant d$ we define the truncated (least squares) matrix
  $\begin{eqnarray*}&&{\widehat{\rho }}_{n}(\kappa )=\displaystyle \sum _{i=1}^{\kappa }{\widehat{\lambda }}_{i}| {\hat{\psi }}_{i}\rangle \langle {\hat{\psi }}_{i}| .\end{eqnarray*}$
- (b)
  We then evaluate the norm-two distance squared $e(\kappa ):= \parallel \rho -{\widehat{\rho }}_{n}(\kappa ){\parallel }_{2}^{2}$ and define the oracle estimator as the truncated estimator with minimal norm two error
  $\begin{eqnarray*}&&{\widehat{\rho }}_{n}^{(\mathrm{oracle})}={\widehat{\rho }}_{n}({\kappa }_{0}),\qquad {\kappa }_{0}=\mathrm{arg}\;\underset{\kappa }{\mathrm{min}}e(\kappa ).\end{eqnarray*}$
  Note that the oracle estimator relies on the knowledge of the true state ρ which is not available in a real data set-up. It is nevertheless useful as a benchmark for judging the performance of other estimators in simulation studies. At the next point we define the cross-validation estimator which tries to find the 'optimal' rank ${\kappa }_{0}$ by replacing the unknown state ρ with the LSE computed on a separate batch of data.
(4)
We compute the cross validation estimator as follows.
- (a)
  For each $j\in \{1,...,5\}$ we compute the following estimators. While holding the batch D_j out, we compute the LSE ${\widehat{\rho }}_{n;-j}^{(\mathrm{ls})}$ for the dataset consisting of joining the remaining four batches together. Similarly to the point above, we define the rank κ truncation of this estimator by ${\widehat{\rho }}_{n;-j}(\kappa ).$ We also compute the LSE for the remaining batch j, denoted by ${\widehat{\rho }}_{n;j}^{(\mathrm{ls})}.$
- (b)
  For each rank κ we evaluate the 'empirical discrepancy'
  $\begin{eqnarray*}&&{\rm{CV}}(\kappa )=\displaystyle \frac{1}{5}\displaystyle \sum _{i=1}^{5}{\| {\widehat{\rho }}_{n;-j}(\kappa )-{\widehat{\rho }}_{n;j}^{(\mathrm{ls})}\| }_{2}^{2}.\end{eqnarray*}$
  Since ${\widehat{\rho }}_{n;-j}(\kappa )$ and ${\widehat{\rho }}_{n;j}^{(\mathrm{ls})}$ are independent, and the LSE is unbiased ${\mathbb{E}}({\widehat{\rho }}_{n;j}^{(\mathrm{ls})})=\rho ,$ the expected value of ${\rm{CV}}(k)$ is
  $\begin{eqnarray*}{\mathbb{E}}\left[{\rm{CV}}(\kappa )\right] & = & {{\mathbb{E}}}_{-1}\left[{\rm{Tr}}\left({\widehat{\rho }}_{n;-1}{(\kappa )}^{2}\right)\right]-2{\rm{Tr}}\left({{\mathbb{E}}}_{-1}\left[{\widehat{\rho }}_{n;-1}(\kappa )\right]\cdot {{\mathbb{E}}}_{1}\left[{\widehat{\rho }}_{n;1}^{(\mathrm{ls})}\right]\right)+{{\mathbb{E}}}_{1}\left[{\rm{Tr}}\left({\left({\widehat{\rho }}_{n;1}^{(\mathrm{ls})}\right)}^{2}\right)\right]\\ & = & {{\mathbb{E}}}_{-1}\left[{\rm{Tr}}\left({\widehat{\rho }}_{n;-1}{(\kappa )}^{2}\right)\right]-2{\rm{Tr}}\left({{\mathbb{E}}}_{-1}\left[{\widehat{\rho }}_{n;-1}(\kappa )\right]\rho \right)+{\rm{Tr}}({\rho }^{2})+C\\ & = & {{\mathbb{E}}}_{-1}\left[\parallel {\widehat{\rho }}_{n;-1}(\kappa )-\rho {\parallel }^{2}\right]+C,\end{eqnarray*}$
  where we denoted ${{\mathbb{E}}}_{-1}$ and ${{\mathbb{E}}}_{1}$ the expectation over all batches except the first, and respectively over the first batch. Therefore, the average of ${\rm{CV}}(\kappa )$ is equal to the MSE of the truncated estimator ${\widehat{\rho }}_{n;-1}(\kappa ),$ up to a constant C which is independent of κ.
- (c)
  Based on the above observation we use the the cross-validation method as a proxy for the oracle estimator. Concretely, we minimize CV $(\kappa )$ with respect to κ
  $\begin{eqnarray*}&&{\hat{\kappa }}_{\mathrm{cv}}=\underset{k}{\mathrm{arg}\;\mathrm{min}}{\rm{}}\;{\rm{CV}}(\kappa ),\end{eqnarray*}$
  and define the cross-validation estimator as the truncation to rank ${\hat{\kappa }}_{\mathrm{cv}}$ of the full data LSE ${\widehat{\rho }}_{n}^{(\mathrm{cv})}:= {\widehat{\rho }}_{n}({\hat{\kappa }}_{\mathrm{cv}}).$
(5)
We compute the cross-validated rank-penalized estimator as follows.
- (a)
  Let c be a penalization constant chosen from a suitable set of discrete values in the interval $[0,3].$ Similarly to the cross-validation procedure, we hold out batch j, and we compute the rank-penalized estimator (4.2), with penalty constant $c{\nu }^{2}$ for $j=1,\ldots ,5.$ We denote these estimators by ${\widehat{\rho }}_{n;-j}^{(\mathrm{pen})}(c).$ We will also need the LSE ${\rho }_{n;j}^{(\mathrm{ls})}$ for batch j computed above.
- (b)
  For each value of c we evaluate the empirical discrepancy
  $\begin{eqnarray*}&&{\rm{CV}}(c)=\displaystyle \frac{1}{5}\displaystyle \sum _{i=1}^{5}{\| {\widehat{\rho }}_{n;-j}^{(\mathrm{pen})}(c)-{\rho }_{n;j}^{(\mathrm{ls})}\| }_{2}^{2}\end{eqnarray*}$
  and minimize ${\rm{CV}}(c)$ with respect to the constant c
  $\begin{eqnarray*}&&\widehat{c}=\underset{c}{\mathrm{arg}\;\mathrm{min}}{\rm{}}\;{\rm{CV}}(c).\end{eqnarray*}$
  Finally we compute the cross-validated rank penalized estimator ${\widehat{\rho }}_{n}^{(\mathrm{rk}-\mathrm{cv})}$ which is defined as in (4.2), with constant $\widehat{c}{\nu }^{2},$ on the whole dataset D.
(6)
We compute the physical estimator as follows.
- (a)
  As above we choose a constant c from a grid over the interval $[0,3].$ We hold out batch j, and we compute the physical threshold estimator (4.5), using the algorithm below this equation, with threshold $c\cdot 4\nu .$ We denote the resulting estimators by ${\widehat{\rho }}_{n;-j}^{(\mathrm{phys})}(c),$ for $j=1,\ldots ,5.$
- (b)
  For each value of c we evaluate the empirical discrepancy
  $\begin{eqnarray*}&&{\rm{CV}}(c)=\displaystyle \frac{1}{5}\displaystyle \sum _{i=1}^{5}{\| {\widehat{\rho }}_{n;-j}^{(\mathrm{phys})}(c)-{\widehat{\rho }}_{n;j}^{(\mathrm{ls})}\| }_{2}^{2}.\end{eqnarray*}$
  We then minimize ${\rm{CV}}(c)$ with respect to the constant c.
  $\begin{eqnarray*}&&\hat{c}=\underset{c}{\mathrm{arg}\;\mathrm{min}}{\rm{}}\;{\rm{CV}}(c)\end{eqnarray*}$
  Finally we compute the cross-validated physical estimator ${\widehat{\rho }}_{n}^{(\mathrm{phys}-\mathrm{cv})}$ which is defined as in (4.5), with constant $\hat{c}\cdot 4\nu .$

6.3. Simulation results

We collect here the results of the simulation study described in the previous section. As a figure of merit we focus on the MSE ${\mathbb{E}}\left((\parallel {\widehat{\rho }}_{n}-\rho {\parallel }_{2}^{2}\right)$ of each estimator, which is estimated by averaging the square errors over the 100 independent repetitions of the procedure. We are also interested in how the different methods perform relative to each other, and whether the selected rank is consistent, i.e. it concentrates on the rank of the true state for large number of repetitions.

The four panels in figure 3 represent the boxplots of the square errors $\parallel {\widehat{\rho }}_{n}-\rho {\parallel }_{2}^{2}$ for the different estimators, and different states, when the number of repetitions is n = 20. Similarly, figure 4 shows the same boxplots at n = 100. As expected, in both cases the least squares performs significantly worse than the other estimators, and the discrepancy is larger for small rank states. The remaining 4 estimators have similar MSE's with the physical one performing slightly better than the rest, followed by the oracle. Note also that the estimators' variances (indicated by the size of the boxes) are larger for the least squares than the other estimators. A similar behaviour has been observed for $n=500,2500$ repetitions.

**Figure 3.** Boxplots for the estimated mean square error ( ${\mathbb{E}}\parallel {\widehat{\rho }}_{n}-\rho {\parallel }_{2}^{2}$ ) for different ranks, k = 4 (d = 16), with n = 20 repetitions.
Download figure:
Standard image High-resolution image

**Figure 4.** Boxplots for the estimated mean square error ( ${\mathbb{E}}\parallel {\widehat{\rho }}_{n}-\rho {\parallel }_{2}^{2}$ ) for different ranks, k = 4 (d = 16), with n = 100 repetitions.
Download figure:
Standard image High-resolution image

Figure 5 illustrates the dependence of the MSE of a given estimator, as a function of n, for the four different states which have been analysed. Since the MSE decreases as ${n}^{-1}$ we have chosen to plot the 'renormalized' MSE given by $n\cdot {\mathbb{E}}\parallel {\widehat{\rho }}_{n}-\rho {\parallel }_{2}^{2},$ which converges to a constant value for large n. As expected the limiting value increases with the rank of the state, as a proxy for the number of parameters to be estimated.

**Figure 5.** Renormalized MSEs $n\cdot {\mathbb{E}}\parallel {\widehat{\rho }}_{n}-\rho {\parallel }_{2}^{2}$ as a function of the number of repetitions for states with different ranks: 1(black), 2 (red), 6 (green), 10 (blue).
Download figure:
Standard image High-resolution image

The histograms in figure 6 show the probability distributions for the chosen rank of each given estimator, as a function of the number of measurement repetitions n, for the state of rank 6. We note that in all cases the proportion of times that the chosen rank is equal to the true rank of the state increases as with the number of repetitions. However, this convergence towards a 'rank-consistent' estimator is rather slow, as the proportion surpasses 80% only when n = 2500. Another observation is that the penalty and threshold estimators appear to have different behaviours: the former tends to underestimate the true rank, while the latter tends to overestimate it. As expected, the oracle estimator is more likely to choose the correct rank for large number of repetitions. Perhaps slightly more surprising, for small number of repetitions (n = 20), the oracle choose a pure state in most cases.

7. Conclusions and outlook

Quantum state tomography, and in particular MIT is an important enabling component of quantum engineering experiments. Since full quantum tomography becomes unfeasible for large dimensional systems, it is useful to identify lower dimensional models with good approximation properties for physically relevant states, and to develop estimation methods tailored for such models. In particular, quantum states created in the lab are often very well approximated by low rank density matrices, which are characterized by a number of parameter which is linear rather than quadratic in the space dimension.

In this work we analysed several estimation algorithms targeted at estimating low rank states in MIT. The procedure consists in computing the LSE, which is then diagonalized, truncated to an appropriate smaller rank by setting eigenvalues below a 'noise threshold' to zero, and normalized. Among the several truncation methods proposed, the best performing one is the 'physical estimator'; this chooses the density matrix whose non-zero eigenvalues are above a certain threshold and is the closest to the LSE. We proved concentration bounds and upper bounds for the MSE of the penalized and physical estimators, as well as a lower bound for the asymptotic minimax rate for MIT. The results show that the proposed methods have an optimal dependence on rank and dimension, up to a logarithmic factor in dimension. In addition, the algorithms are easy to implement numerically and their computational complexity is determined by that of the LSE.

An interesting future direction is to extend the spectral thresholding methodology to a measurement setup where a smaller number of settings is measured, which is however sufficient to identify the unknown low rank state. Another direction involves the construction of confidence intervals / regions for such estimators, beyond the concentration bounds established here. Since the main ingredients are the LSE and the spectral thresholding algorithms, the proposed estimation methods can be applied to any quantum tomography scenario with informationally complete measurements. It would be interesting to see how these methods perform in other settings.

Acknowledgments

MG's work was supported by the EPSRC Grant No. EP/J009776/1.

Appendix A.:

A.1. Proof of lemma 1

Proof. Since the matrix elements of ${\bf{A}}$ are ${{\bf{A}}}_{{\bf{b}},({\bf{o}},{\bf{s}})}={A}_{{\bf{b}}}({\bf{o}}| {\bf{s}}),$ the product ${{\bf{A}}}^{*}\cdot {\bf{A}}$ has elements

$\begin{eqnarray*}&&{[{{\bf{A}}}^{*}\cdot {\bf{A}}]}_{{\bf{b}},{{\bf{b}}}^{\prime }}=\displaystyle \sum _{{\bf{s}}}\displaystyle \sum _{{\bf{o}}}{A}_{{\bf{b}}}({\bf{o}}| {\bf{s}}){A}_{{{\bf{b}}}^{\prime }}({\bf{o}}| {\bf{s}}).\end{eqnarray*}$

If ${\bf{b}}={{\bf{b}}}^{\prime }$ it is easy to see that

$\begin{eqnarray*}&&{[{{\bf{A}}}^{*}\cdot {\bf{A}}]}_{{\bf{b}},{\bf{b}}}=\displaystyle \sum _{{\bf{s}}}\displaystyle \sum _{{\bf{o}}}\displaystyle \prod _{j\not\hspace{-2pt}{\in }{E}_{{\bf{b}}}}{\delta }_{{s}_{j},{b}_{j}}={2}^{k}{3}^{d({\bf{b}})}.\end{eqnarray*}$

If ${\bf{b}}\ne {{\bf{b}}}^{\prime },$ we have either ${E}_{{\bf{b}}}={E}_{{{\bf{b}}}^{\prime }}$ or ${E}_{{\bf{b}}}\;\not\hspace{-2pt}{=}\;{E}_{{{\bf{b}}}^{\prime }}.$ On the one hand, in case the sets ${E}_{{\bf{b}}}$ and ${E}_{{{\bf{b}}}^{\prime }}$ are equal, we have

$\begin{eqnarray*}&&{A}_{{\bf{b}}}({\bf{o}}| {\bf{s}}){A}_{{{\bf{b}}}^{\prime }}({\bf{o}}| {\bf{s}})=\displaystyle \prod _{j\not\hspace{-2pt}{\in }{E}_{{\bf{b}}}}{\delta }_{{s}_{j},{b}_{j}}\cdot {\delta }_{{s}_{j},{b}_{j}^{\prime }}.\end{eqnarray*}$

For each ${\bf{s}},$ the previous product is 0. Indeed, if different from 0 then ${b}_{j}={b}_{j}^{\prime }$ for all j not in ${E}_{{\bf{b}}}.$ As ${b}_{j}={b}_{{j}^{\prime }}=I$ for j in ${E}_{{\bf{b}}},$ it implies that ${\bf{b}}={{\bf{b}}}^{\prime }$ which contradicts the assumption here.

On the other hand, if the sets ${E}_{{\bf{b}}}$ and ${E}_{{{\bf{b}}}^{\prime }}$ are different, there exists at least one coordinate j₀ in the symmetric difference ${E}_{{\bf{b}}}{\rm{\Delta }}{E}_{{b}^{\prime }}$ and the sum over outcomes ${\bf{o}}$ will split over values of ${\bf{o}}$ where ${o}_{{j}_{0}}=1$ and values where ${o}_{{j}_{0}}=-1:$

$\begin{eqnarray*} & & \displaystyle \sum _{{\bf{o}}}{A}_{{\bf{b}}}({\bf{o}}| {\bf{s}}){A}_{{{\bf{b}}}^{\prime }}({\bf{o}}| {\bf{s}})\\ & & =\displaystyle \sum _{{\bf{o}}:{o}_{{j}_{0}}=1}{\delta }_{{s}_{{j}_{0}},{b}_{{j}_{0}}}\displaystyle \prod _{j\not\hspace{-2pt}{\in }{E}_{{\bf{b}}}}{o}_{j}{\delta }_{{s}_{j},{b}_{j}}\cdot \displaystyle \prod _{l\not\hspace{-2pt}{\in }{E}_{{{\bf{b}}}^{\prime }}}{o}_{l}{\delta }_{{s}_{l},{b}_{l}^{\prime }}\\ & & -\displaystyle \sum _{{\bf{o}}:{o}_{{j}_{0}}=-1}{\delta }_{{s}_{{j}_{0}},{b}_{{j}_{0}}}\displaystyle \prod _{j\not\hspace{-2pt}{\in }{E}_{{\bf{b}}}}{o}_{j}{\delta }_{{s}_{j}={b}_{j}}\cdot \displaystyle \prod _{l\not\hspace{-2pt}{\in }{E}_{{{\bf{b}}}^{\prime }}}{o}_{l}{\delta }_{{s}_{l},{b}_{l}^{\prime }}=0.\end{eqnarray*}$

We assumed here that j₀ belongs to ${E}_{{\bf{b}}}\setminus {E}_{{{\bf{b}}}^{\prime }}$ and the same holds for j₀ in ${E}_{{{\bf{b}}}^{\prime }}\setminus {E}_{{\bf{b}}}.$

□

A.2. Proof of proposition 1

Proof of proposition 1. For more clarity, in this proof we will use the notation $I(a=b)={\delta }_{a,b}.$ Note that $f({\bf{o}}| {\bf{s}})=\frac{1}{n}{\displaystyle \sum }_{i}I({X}_{{\bf{s}},i}={\bf{o}}),$ where the random variables ${X}_{{\bf{s}},i}$ are independent for all settings ${\bf{s}}$ and all i from 1 to n. To estimate the risk of the linear estimator we write

$\begin{eqnarray*}{\widehat{\rho }}_{n}^{(\mathrm{ls})}-\rho & = & \displaystyle \sum _{{\bf{b}}}\displaystyle \sum _{{\bf{o}}}\displaystyle \sum _{{\bf{s}}}(f({\bf{o}}| {\bf{s}}))-p({\bf{o}}| {\bf{s}}))\displaystyle \frac{{A}_{{\bf{b}}}({\bf{o}}| {\bf{s}})}{{2}^{k}{3}^{d({\bf{b}})}}{\sigma }_{{\bf{b}}}\\ & = & \displaystyle \sum _{{\bf{b}}}\displaystyle \sum _{{\bf{o}}}\displaystyle \sum _{{\bf{s}}}\displaystyle \frac{1}{n}\displaystyle \sum _{i}(I({X}_{{\bf{s}},i}={\bf{o}})-p({\bf{o}}| {\bf{s}}))\displaystyle \frac{{A}_{{\bf{b}}}({\bf{o}}| {\bf{s}})}{{2}^{k}{3}^{d({\bf{b}})}}{\sigma }_{{\bf{b}}}\\ & := & \displaystyle \sum _{{\bf{s}}}\displaystyle \sum _{i}{W}_{{\bf{s}},i},\end{eqnarray*}$

where ${W}_{{\bf{s}},i}$ are independent and centred Hermitian random matrices. We will apply the following extension of the Bernstein matrix inequality [39] due to [40], see also [32, 33].

Proposition 3. (Bernstein inequality, [40].) Let ${Y}_{1},\ldots ,{Y}_{n}$ be independent, centred, m × m Hermitian random matrices. Suppose that, for some constants $V,W\gt 0$ we have $\parallel {Y}_{j}\parallel \leqslant V,$ for all j from 1 to n, and that $\parallel {\displaystyle \sum }_{j}{\mathbb{E}}({Y}_{j}^{2})\parallel \leqslant W.$ Then, for all $t\geqslant 0,$

$\begin{eqnarray*}&&{\mathbb{P}}(\parallel {Y}_{1}+...+{Y}_{n}\parallel \geqslant t)\leqslant 2m\mathrm{exp}\left(-\displaystyle \frac{{t}^{2}/2}{W+{tV}/3}\right).\end{eqnarray*}$

In our setup we bound $\parallel {W}_{{\bf{s}},i}\parallel \leqslant V$ for all ${\bf{s}}$ and i and $\parallel {\displaystyle \sum }_{{\bf{s}}}{\mathbb{E}}({W}_{{\bf{s}},i}^{*}{W}_{{\bf{s}},i})\parallel \leqslant W,$ where $V,W$ are evaluated below. We have

$\begin{eqnarray*}\parallel {W}_{{\bf{s}},i}\parallel & \leqslant & \displaystyle \frac{1}{n}\displaystyle \sum _{{\bf{b}}}\displaystyle \sum _{{\bf{o}}}\left|\displaystyle \frac{{A}_{{\bf{b}}}({\bf{o}}| {\bf{s}})}{{2}^{k}{3}^{d({\bf{b}})}}\right|\cdot | I({X}_{{\bf{s}},i}={\bf{o}})-p({\bf{o}}| {\bf{s}})| \cdot \parallel {\sigma }_{{\bf{b}}}\parallel \\ & \leqslant & \displaystyle \frac{1}{n}\displaystyle \sum _{{\bf{b}}}\displaystyle \frac{1}{{2}^{k}{3}^{d(b)}}\displaystyle \prod _{j\not\hspace{-2pt}{\in }{E}_{{\bf{b}}}}I({b}_{j}={s}_{j})\displaystyle \sum _{{\bf{o}}}| I({X}_{{\bf{s}},i}={\bf{o}})-p({\bf{o}}| {\bf{s}})| \\ & \leqslant & \displaystyle \frac{2}{n{2}^{k}}\displaystyle \sum _{{\ell }=0}^{k}\;\displaystyle \sum _{b:d(b)={\ell }}\displaystyle \frac{1}{{3}^{{\ell }}}=\displaystyle \frac{2}{n{2}^{k}}\displaystyle \sum _{{\ell }=0}^{k}\displaystyle \left(\genfrac{}{}{0em}{}{k}{{\ell }}\right)\displaystyle \frac{1}{{3}^{{\ell }}}=\displaystyle \frac{2}{n{2}^{k}}{\left(1+\displaystyle \frac{1}{3}\right)}^{k}=\displaystyle \frac{2}{n}{\left(\displaystyle \frac{2}{3}\right)}^{k}:= V.\end{eqnarray*}$

Let us denote $B({\bf{o}}| {\bf{s}}):= {\displaystyle \sum }_{{\bf{b}}}{2}^{-k}{3}^{-d({\bf{b}})}{A}_{{\bf{b}}}({\bf{o}}| {\bf{s}}){\sigma }_{{\bf{b}}}.$ Then

$\begin{eqnarray} & & \parallel \displaystyle \sum _{{\bf{s}}}\displaystyle \sum _{i}{\mathbb{E}}({W}_{{\bf{s}},i}^{*}{W}_{{\bf{s}},i})\parallel \\ & = & \displaystyle \frac{1}{{n}^{2}}\parallel \displaystyle \sum _{{\bf{s}}}\displaystyle \sum _{i}\displaystyle \sum _{{\bf{o}},{{\bf{o}}}^{\prime }}{B}^{*}({\bf{o}}| {\bf{s}})\cdot {\rm{Cov}}(I({X}_{{\bf{s}},i}={\bf{o}}),I({X}_{{\bf{s}},i}={{\bf{o}}}^{\prime }))\cdot B({{\bf{o}}}^{\prime }| {\bf{s}})\parallel \\ & \leqslant & \displaystyle \frac{1}{{n}^{2}}\parallel \displaystyle \sum _{{\bf{s}}}\displaystyle \sum _{i}\displaystyle \sum _{{\bf{o}}}{B}^{*}({\bf{o}}| {\bf{s}})\cdot B({\bf{o}}| {\bf{s}})\parallel .\end{eqnarray} \tag{ A.1 }$

The last inequality follows from the fact that the covariance matrix can be written as the difference of two positive matrices ${{\rm{Cov}}}_{{\bf{o}},{{\bf{o}}}^{\prime }}:= {\rm{Cov}}(I({X}_{{\bf{s}},i}={\bf{o}}),I({X}_{{\bf{s}},i}={{\bf{o}}}^{\prime }))=p({\bf{o}}| {\bf{s}}){\delta }_{{\bf{o}},{{\bf{o}}}^{\prime }}-p({\bf{o}}| {\bf{s}})p({{\bf{o}}}^{\prime }| {\bf{s}})$ and since $p({\bf{o}}| {\bf{s}})\leqslant 1,$ we get ${\rm{Cov}}\leqslant {\bf{1}}.$

We replace $B({\bf{o}}| {\bf{s}})$ in (A.1) and use lemma 1 to get

$\begin{eqnarray*}&&\parallel \displaystyle \sum _{{\bf{s}}}\displaystyle \sum _{i}{\mathbb{E}}({W}_{{\bf{s}},i}^{*}{W}_{{\bf{s}},i})\parallel \\ &&\quad \leqslant \displaystyle \frac{1}{n}\| \displaystyle \sum _{{\bf{s}}}\displaystyle \sum _{{\bf{o}}}\displaystyle \sum _{{\bf{b}},{{\bf{b}}}^{\prime }}\displaystyle \frac{1}{{2}^{2k}{3}^{d({\bf{b}})+d({{\bf{b}}}^{\prime })}}{A}_{{\bf{b}}}({\bf{o}}| {\bf{s}}){A}_{{{\bf{b}}}^{\prime }}({\bf{o}}| {\bf{s}})\cdot {\sigma }_{{\bf{b}}}{\sigma }_{{{\bf{b}}}^{\prime }}\| \\ &&\quad \leqslant \displaystyle \frac{1}{n{2}^{k}}\displaystyle \sum _{{\ell }=0}^{k}\;\displaystyle \sum _{{\bf{b}}:d({\bf{b}})={\ell }}\displaystyle \frac{1}{{3}^{{\ell }}}=\displaystyle \frac{1}{n{2}^{k}}\displaystyle \sum _{{\ell }=0}^{k}\displaystyle \left(\genfrac{}{}{0em}{}{k}{{\ell }}\right)\displaystyle \frac{1}{{3}^{{\ell }}}=\displaystyle \frac{1}{n}{\left(\displaystyle \frac{2}{3}\right)}^{k}:= W.\end{eqnarray*}$

Apply the matrix Bernstein inequality in proposition 3 to get, for any $t\gt 0:$

$\begin{eqnarray*}&&{{\mathbb{P}}}_{\rho }\left(\parallel {\widehat{\rho }}_{n}^{(\mathrm{ls})}-\rho \parallel \geqslant t\right)\leqslant {2}^{k+1}\mathrm{exp}\left(-\displaystyle \frac{{t}^{2}}{1+2t/3}\displaystyle \frac{n}{2}{\left(\displaystyle \frac{3}{2}\right)}^{k}\right).\end{eqnarray*}$

We choose $t\gt 0$ such that

$\begin{eqnarray*}&&\varepsilon ={2}^{k+1}\mathrm{exp}\left(-\displaystyle \frac{{t}^{2}}{1+2t/3}\cdot \displaystyle \frac{n}{2}{\left(\displaystyle \frac{3}{2}\right)}^{k}\right),\end{eqnarray*}$

which leads to t such that

$\begin{eqnarray*}&&\displaystyle \frac{{t}^{2}}{1+2t/3}=\displaystyle \frac{2}{n}{\left(\displaystyle \frac{2}{3}\right)}^{k}\mathrm{log}\left(\displaystyle \frac{{2}^{k+1}}{\varepsilon }\right)=\displaystyle \frac{3}{2}v(\varepsilon ).\end{eqnarray*}$

Then the convenient choice of t is $\nu (\varepsilon )=v(\varepsilon )/2+\sqrt{{v}^{2}(\varepsilon )/2+3/2\cdot v(\varepsilon )},$ which is equivalent to $\sqrt{3/2\cdot v(\varepsilon )}$ when this last term tends to 0.□

A.3. Proof of proposition 1

Proof of proposition 1. Let us denote by $({\widehat{\lambda }}_{1}\geqslant ...\geqslant {\widehat{\lambda }}_{d})$ the eigenvalues of $| {\widehat{\rho }}_{n}^{(\mathrm{ls})}|$ and by ${\tilde{\lambda }}_{1},\ldots ,{\tilde{\lambda }}_{d}$ the eigenvalues of the resulting estimator ${\widehat{\rho }}_{n}^{(\mathrm{ls},n)}.$ It is easy to see that the latter has the same eigenvectors as ${\widehat{\rho }}_{n}^{(\mathrm{ls})}.$

Therefore

$\begin{eqnarray*}&&({\tilde{\lambda }}_{1},\ldots ,{\tilde{\lambda }}_{d})=\underset{\lambda :{\displaystyle \sum }_{j=1}^{d}{\lambda }_{j}=1}{\mathrm{argmin}}\;{\displaystyle \sum }_{j=1}^{d}{({\lambda }_{j}-{\widehat{\lambda }}_{j})}^{2}.\end{eqnarray*}$

This optimization has the explicit solution

$\begin{eqnarray*}&&{\tilde{\lambda }}_{j}={\widehat{\lambda }}_{j}+\displaystyle \frac{L}{2},{\rm{where}}\;{\rm{}}\displaystyle \frac{L}{2}d=1-\displaystyle \sum _{j=1}^{d}{\widehat{\lambda }}_{j}.\end{eqnarray*}$

Note that

$\begin{eqnarray*}&&\displaystyle \frac{L}{2}=\displaystyle \frac{1}{\;}d{\rm{tr}}(\rho -{\widehat{\rho }}_{n}^{(\mathrm{ls})})\leqslant \parallel {\widehat{\rho }}_{n}^{(\mathrm{ls})}-\rho \parallel .\end{eqnarray*}$

Therefore, $\parallel {\widehat{\rho }}_{n}^{(\mathrm{ls},n)}-{\widehat{\rho }}_{n}^{(\mathrm{ls})}\parallel =L/2\leqslant \parallel {\widehat{\rho }}_{n}^{(\mathrm{ls})}-\rho \parallel$ and thus

$\begin{eqnarray*}\parallel {\widehat{\rho }}_{n}^{(\mathrm{ls},n)}-\rho \parallel & \leqslant & \parallel {\widehat{\rho }}_{n}^{(\mathrm{ls})}-\rho \parallel +\parallel {\widehat{\rho }}_{n}^{(\mathrm{ls},n)}-{\widehat{\rho }}_{n}^{(\mathrm{ls})}\parallel \leqslant 2\nu (\varepsilon ).\end{eqnarray*}$

□

A.4. Proof of corollary 1

Proof of corollary 1. We order the eigenvalues of ${\widehat{\rho }}_{n}^{(\mathrm{pen})}$ in decreasing order of absolute values $(| {\widehat{\lambda }}_{1}| \geqslant ...\geqslant | {\widehat{\lambda }}_{d}| ),$ and denote by ${\tilde{\lambda }}_{1},\ldots ,{\tilde{\lambda }}_{d}$ the eigenvalues of ${\widehat{\rho }}_{n}^{(\mathrm{pen},n)}.$ As in the previous proof, we can see that both matrices will share the same eigenvectors and that the relation between the eigenvalues is

$\begin{eqnarray*}&&{\tilde{\lambda }}_{j}={\widehat{\lambda }}_{j}+\displaystyle \frac{L}{2},{\rm{}}\;{\rm{where}}\;{\rm{}}\displaystyle \frac{L}{2}d=1-\displaystyle \sum _{j=1}^{d}{\widehat{\lambda }}_{j}.\end{eqnarray*}$

Thus

$\begin{eqnarray*}&&\parallel {\widehat{\rho }}_{n}^{(\mathrm{pen})}-{\widehat{\rho }}_{n}^{(\mathrm{pen},n)}{\parallel }_{F}^{2}=d{(L/2)}^{2}\leqslant {{\rm{Tr}}}^{2}({\widehat{\rho }}_{n}^{(\mathrm{pen})}-\rho )/d\leqslant \parallel {\widehat{\rho }}_{n}^{(\mathrm{pen})}-\rho {\parallel }_{F}^{2}.\end{eqnarray*}$

We deduce that $\parallel \rho -{\widehat{\rho }}_{n}^{(\mathrm{pen},n)}{\parallel }_{F}\leqslant \parallel {\widehat{\rho }}_{n}^{(\mathrm{pen})}-{\widehat{\rho }}_{n}^{(\mathrm{pen},n)}{\parallel }_{F}+\parallel {\widehat{\rho }}_{n}^{(\mathrm{pen})}-\rho {\parallel }_{F}\leqslant 2\parallel {\widehat{\rho }}_{n}^{(\mathrm{pen})}-\rho {\parallel }_{F}.$

Therefore

$\begin{eqnarray*}&&{\mathbb{P}}\left(\parallel {\widehat{\rho }}_{n}^{(\mathrm{pen},n)}-\rho {\parallel }_{F}\gt 8c(\theta )r\nu {(\varepsilon )}^{2}\right)\lt \varepsilon \end{eqnarray*}$

and the previous inequality remains true for all $0\lt e\leqslant \varepsilon \lt 1.$ Let us denote by

$\begin{eqnarray*}&&x(e)=8r{\nu }^{2}(e)=16\displaystyle \frac{{rd}}{N}\mathrm{log}(\displaystyle \frac{2d}{e}),\end{eqnarray*}$

which is a decreasing function of e. This implies that $e=2d\mathrm{exp}\left(-\displaystyle \frac{{Nx}(e)}{16{rd}}\right).$ Thus

$\begin{eqnarray*}{{\mathbb{E}}}_{\rho }\parallel {\widehat{\rho }}_{n}^{(\mathrm{pen},n)}-\rho {\parallel }_{2}^{2} & = & {\displaystyle \int }_{0}^{\infty }{{\mathbb{P}}}_{\rho }(\parallel {\widehat{\rho }}_{n}^{(\mathrm{pen},n)}-\rho {\parallel }_{2}^{2}\gt x){\rm{d}}x\\ & = & {\displaystyle \int }_{0}^{x(\varepsilon )}{{\mathbb{P}}}_{\rho }(\parallel {\widehat{\rho }}_{n}^{(\mathrm{pen},n)}-\rho {\parallel }_{2}^{2}\gt x){\rm{d}}x+{\displaystyle \int }_{x(\varepsilon )}^{\infty }{{\mathbb{P}}}_{\rho }(\parallel {\widehat{\rho }}_{n}^{(\mathrm{pen},n)}-\rho {\parallel }_{2}^{2}\gt x){\rm{d}}x\\ & \leqslant & x(\varepsilon )+{\displaystyle \int }_{x(\varepsilon )}^{\infty }2d\mathrm{exp}(-\displaystyle \frac{{Nx}}{16{rd}}){\rm{d}}x\\ & \leqslant & 16\displaystyle \frac{{rd}}{N}\mathrm{log}(\displaystyle \frac{2d}{\varepsilon })+2d\cdot 16\displaystyle \frac{{rd}}{N}\mathrm{exp}(-\displaystyle \frac{N}{16{rd}}x(\varepsilon ))\\ & \leqslant & 16\displaystyle \frac{{rd}}{N}\left(\mathrm{log}(\displaystyle \frac{2d}{\varepsilon })+\varepsilon \right)\leqslant C\displaystyle \frac{{rd}}{N}\mathrm{log}(\displaystyle \frac{2d}{\varepsilon }).\end{eqnarray*}$

□

A.5. Proof upper bound physical estimator

Proof of theorem 2. We recall from proposition 2 that with probability larger than $1-\varepsilon ,$ we have $\parallel {\tilde{\rho }}_{n}^{(\mathrm{ls})}-\rho \parallel \leqslant 2\nu (\varepsilon ).$ In particular, by using the Weyl inequality [41], this implies that $| {\tilde{\lambda }}_{k}-{\lambda }_{k}| \leqslant 2\nu (\varepsilon )$ for all k from 1 to d, where ${\lambda }_{1},\ldots ,{\lambda }_{d}$ are the eigenvalues of ρ arranged in decreasing order. After a total of $\hat{{\ell }}=d-\hat{r}$ iterations the algorithm stops and we have [42]

$\begin{eqnarray*}&&{\widehat{\lambda }}_{\hat{r}+1}^{(\mathrm{phys})}=...={\widehat{\lambda }}_{d}^{(\mathrm{phys})}=0\end{eqnarray*}$

and

$\begin{eqnarray*}&&{\widehat{\lambda }}_{j}^{(\mathrm{phys})}:= {\widehat{\lambda }}_{j}^{(\hat{{\ell }})}={\tilde{\lambda }}_{j}+\displaystyle \frac{1}{\hat{r}}\displaystyle \sum _{k\gt \hat{r}}{\tilde{\lambda }}_{k}\qquad j=1,...,\hat{r}.\end{eqnarray*}$

From now on we assume that $\parallel {\tilde{\rho }}_{n}^{(\mathrm{ls})}-\rho \parallel \leqslant 2\nu (\varepsilon )$ which holds with probability larger than $1-\varepsilon ,$ see corollary 3.2. We will show that under this assumption $\hat{r}=r.$ For this we consider the two cases $\hat{r}\gt r$ and $\hat{r}\lt r$ separately.

Suppose $\hat{r}\gt r.$ For $j\leqslant \hat{r}$ we have

$\begin{eqnarray*}| {\widehat{\lambda }}_{j}^{(\mathrm{phys})}-{\lambda }_{j}| & = & \left|{\tilde{\lambda }}_{j}-{\lambda }_{j}+\displaystyle \frac{1}{\hat{r}}\left(1-\displaystyle \sum _{k\leqslant \hat{r}}{\tilde{\lambda }}_{k}\right)\right|\\ & = & \left|{\tilde{\lambda }}_{j}-{\lambda }_{j}+\displaystyle \frac{1}{\hat{r}}\displaystyle \sum _{k\leqslant \hat{r}}({\lambda }_{k}-{\tilde{\lambda }}_{k})\right|\\ & \leqslant & | {\tilde{\lambda }}_{j}-{\lambda }_{j}| +\displaystyle \frac{1}{\hat{r}}\displaystyle \sum _{k\leqslant \hat{r}}| {\tilde{\lambda }}_{k}-{\lambda }_{k}| \leqslant 2\nu +2\nu =4\nu .\end{eqnarray*}$

This implies that ${\widehat{\lambda }}_{\hat{r}}^{(\mathrm{phys})}\leqslant {\lambda }_{\hat{r}}+4\nu =4\nu$ since $\hat{r}\gt r.$ However, as discussed above, by the definition of the threshold estimator, ${\widehat{\lambda }}_{\hat{r}}^{(\mathrm{phys})}\geqslant 4\nu .$ Therefore $\hat{r}$ cannot be larger than r.

Suppose $\hat{r}\lt r.$ Then ${\widehat{\lambda }}_{r}^{(\mathrm{phys})}\geqslant {\lambda }_{r}-| {\widehat{\lambda }}_{r}^{(\mathrm{phys})}-{\lambda }_{r}| \gt 8\nu -4\nu =4\nu .$ However this is not possible since ${\widehat{\lambda }}_{r}^{(\mathrm{phys})}=0$ under the assumption $\hat{r}\lt r.$ Thus, $\widehat{r}=r$ with probability larger than $1-\epsilon .$

We consider now the MSE of the estimator. As shown above, we have $| {\widehat{\lambda }}_{j}^{(\mathrm{phys})}-{\lambda }_{j}| \leqslant 4\nu .$ Let $\rho ={{UDU}}^{*}$ be the diagonalization of ρ, where U is unitary and D is the diagonal matrix of eigenvalues. Similarly, let ${\widehat{\rho }}_{n}^{(\mathrm{ls})}={\widehat{U}}_{n}{D}_{n}^{(\mathrm{ls})}{\widehat{U}}_{n}^{*}$ and ${\widehat{\rho }}_{n}^{(\mathrm{phys})}={\widehat{U}}_{n}{D}_{n}^{(\mathrm{phys})}{\widehat{U}}_{n}^{*}$ be the decompositions of the least squares and the physical threshold estimator, where we take into account that the latter two have the same eigenbasis. Then

$\begin{eqnarray}\parallel {\widehat{\rho }}_{n}^{(\mathrm{phys})}-\rho {\parallel }_{2} & = & \parallel {\widehat{U}}_{n}{D}_{n}^{(\mathrm{phys})}{\widehat{U}}_{n}^{*}-{{UDU}}^{*}{\parallel }_{2}\\ & \leqslant & \parallel {\widehat{U}}_{n}{D}_{n}^{(\mathrm{phys})}{\widehat{U}}_{n}^{*}-{\widehat{U}}_{n}D{\widehat{U}}_{n}^{*}{\parallel }_{2}+\parallel {\widehat{U}}_{n}D{\widehat{U}}_{n}^{*}-{{UDU}}^{*}{\parallel }_{2}\\ & \leqslant & \parallel {D}_{n}^{(\mathrm{phys})}-D{\parallel }_{2}+\parallel {\widehat{U}}_{n}D{\widehat{U}}_{n}^{*}-{{UDU}}^{*}{\parallel }_{2}\end{eqnarray} \tag{ A.2 }$

The first norm on the right-hand side of the previous inequality is bounded as

$\begin{eqnarray}&&\parallel {D}_{n}^{(\mathrm{phys})}-D{\parallel }_{2}^{2}=\displaystyle \sum _{j\leqslant \widehat{r}}{({\widehat{\lambda }}_{j}^{(\mathrm{phys})}-{\lambda }_{j})}^{2}\leqslant 16r{\nu }^{2}.\end{eqnarray} \tag{ A.3 }$

For the second norm we use a similar triangle inequality for the operator norm

$\begin{eqnarray*}\parallel {\widehat{U}}_{n}D{\widehat{U}}_{n}^{*}-{{UDU}}^{*}\parallel & \leqslant & \parallel {\widehat{U}}_{n}D{\widehat{U}}_{n}^{*}-{\widehat{U}}_{n}{D}_{n}^{(\mathrm{ls})}{\widehat{U}}_{n}^{*}\parallel +\parallel {\widehat{U}}_{n}{D}_{n}^{(\mathrm{ls})}{\widehat{U}}_{n}^{*}-{{UDU}}^{*}\parallel \\ & = & \parallel D-{D}_{n}^{(\mathrm{ls})}\parallel +\parallel {\widehat{\rho }}_{n}^{(\mathrm{ls})}-\rho \parallel \leqslant \nu +\nu =2\nu .\end{eqnarray*}$

The first term is smaller than ν since $| {\lambda }_{i}-{\widehat{\lambda }}_{i}^{(\mathrm{ls})}| \leqslant \nu$ as it follows from proposition 1, and the Weyl inequality [41]. The second term is also bounded by ν, by proposition 1. Therefore, since ${\widehat{U}}_{n}D{\widehat{U}}_{n}^{*}$ and ${{UDU}}^{*}$ are rank r matrices, the difference is at most rank $2r$ and $\parallel {\widehat{U}}_{n}D{\widehat{U}}_{n}^{*}-{{UDU}}^{*}{\parallel }_{2}\leqslant 2\nu \sqrt{2r}.$ By plugging this together with (A.3) into the right side of (A.2) we get

$\begin{eqnarray*}&&\parallel {\widehat{\rho }}_{n}^{(\mathrm{phys})}-\rho {\parallel }_{2}^{2}\leqslant {(2\sqrt{2}+4)}^{2}r{\nu }^{2},\end{eqnarray*}$

with probability larger than $1-\epsilon .$

Moreover, the previous inequality remains true for all $0\lt e\leqslant \varepsilon \lt 1.$ Let us denote by

$\begin{eqnarray*}&&x(e)=48r{\nu }^{2}(e)=96\displaystyle \frac{{rd}}{N}\mathrm{log}(\displaystyle \frac{2d}{e}),\end{eqnarray*}$

which is a decreasing function of e. This implies that $e=2d\mathrm{exp}\left(-\displaystyle \frac{{Nx}(e)}{96{rd}}\right).$ Thus

$\begin{eqnarray*}{{\mathbb{E}}}_{\rho }\parallel {\widehat{\rho }}_{n}^{(\mathrm{phys})}-\rho {\parallel }_{2}^{2} & = & {\displaystyle \int }_{0}^{\infty }{{\mathbb{P}}}_{\rho }(\parallel {\widehat{\rho }}_{n}^{(\mathrm{phys})}-\rho {\parallel }_{2}^{2}\gt x){\rm{d}}x\\ & = & {\displaystyle \int }_{0}^{x(\varepsilon )}{{\mathbb{P}}}_{\rho }(\parallel {\widehat{\rho }}_{n}^{(\mathrm{phys})}-\rho {\parallel }_{2}^{2}\gt x){\rm{d}}x+{\displaystyle \int }_{x(\varepsilon )}^{\infty }{{\mathbb{P}}}_{\rho }(\parallel {\widehat{\rho }}_{n}^{(\mathrm{phys})}-\rho {\parallel }_{2}^{2}\gt x){\rm{d}}x\\ & \leqslant & x(\varepsilon )+{\displaystyle \int }_{x(\varepsilon )}^{\infty }2d\mathrm{exp}(-\displaystyle \frac{{Nx}}{96{rd}}){\rm{d}}x\\ & \leqslant & 96\displaystyle \frac{{rd}}{N}\mathrm{log}(\displaystyle \frac{2d}{\varepsilon })+2d\cdot 96\displaystyle \frac{{rd}}{N}\mathrm{exp}(-\displaystyle \frac{N}{96{rd}}x(\varepsilon ))\\ & \leqslant & 96\displaystyle \frac{{rd}}{N}\left(\mathrm{log}(\displaystyle \frac{2d}{\varepsilon })+\varepsilon )\right)\leqslant C\displaystyle \frac{{rd}}{N}\mathrm{log}(\displaystyle \frac{2d}{\varepsilon }).\end{eqnarray*}$

A.6. The average Fisher information matrix for the full, unconstrained model, with random measurement design

In this section we present a detailed calculation of the average quantum Fisher information at a the rank r state ${\rho }_{0}$ defined in (5.2), for random measurements with uniform distribution over the measurement basis. We consider the full parametrization by $\theta \in {{\mathbb{R}}}^{{d}^{2}}$ given by equation (5.3). As explained in the proof, the Fisher information matrix for the parametrization $\tilde{\theta }$ of the rotation model ${{\mathcal{R}}}_{d,r}$ is a particular block of the larger Fisher matrix computed here. We will come back to this at the end of the computation.

Let

$\begin{eqnarray*}&&{{\bf{B}}}_{U}:= \{| {\bf{o}};U\rangle := U| {\bf{o}}\rangle :{\bf{o}}=1,...,{2}^{k}\}\end{eqnarray*}$

denote the ONB obtained by rotating the standard basis by the unitary U.

The Fisher info $I(\rho | {{\bf{B}}}_{U})$ associated to the measurement with ONB ${{\bf{B}}}_{U}$ is given by the matrix

$\begin{eqnarray*}&&I{(\rho | {{\bf{B}}}_{U})}_{a,b}=\displaystyle \sum _{{\bf{o}}:{p}_{\rho }({\bf{o}}| {{\bf{B}}}_{U})\gt 0}\displaystyle \frac{1}{{p}_{\rho }({\bf{o}}| {{\bf{B}}}_{U})}\displaystyle \frac{\partial {p}_{\rho }({\bf{o}}| {{\bf{B}}}_{U})}{\partial {\theta }_{a}}\displaystyle \frac{\partial {p}_{\rho }({\bf{o}}| {{\bf{B}}}_{U})}{\partial {\theta }_{b}},\end{eqnarray*}$

where ${p}_{\rho }({\bf{o}}| {{\bf{B}}}_{U})$ is the probability of the outcome ${\bf{o}}$

$\begin{eqnarray*}{p}_{\rho }({\bf{o}}| {{\bf{B}}}_{U}) & = & \langle {\bf{o}},U| \rho | {\bf{o}},U\rangle =\displaystyle \sum _{i}{\rho }_{{ii}}| \langle {\bf{o}},U| i\rangle {| }^{2}\\ & + & 2\displaystyle \sum _{i\lt j}{\rm{Re}}({\rho }_{i,j}){\rm{Re}}(\langle i| {\bf{o}},U\rangle \langle {\bf{o}},U| j\rangle )+2\displaystyle \sum _{i\lt j}{\rm{Im}}({\rho }_{i,j}){\rm{Im}}(\langle i| {\bf{o}},U\rangle \langle {\bf{o}},U| j\rangle )\end{eqnarray*}$

and the partial derivatives are given by

$\begin{eqnarray*}\displaystyle \frac{\partial {p}_{\rho }({\bf{o}}| {{\bf{B}}}_{U})}{\partial {\rho }_{i,i}} & = & | \langle {\bf{o}},U| i\rangle {| }^{2}\\ \displaystyle \frac{\partial {p}_{\rho }({\bf{o}}| {{\bf{B}}}_{U})}{\partial {\rm{Re}}{\rho }_{i,j}} & = & 2{\rm{Re}}(\langle i| {\bf{o}},U\rangle \langle {\bf{o}},U| j\rangle )\\ \displaystyle \frac{\partial {p}_{\rho }({\bf{o}}| {{\bf{B}}}_{U})}{\partial {\rm{Im}}{\rho }_{i,j}} & = & 2{\rm{Im}}(\langle i| {\bf{o}},U\rangle \langle {\bf{o}},U| j\rangle ).\end{eqnarray*}$

The Fisher information matrix $I(\rho | {{\bf{B}}}_{U})$ has the following block structure

$\begin{eqnarray*}I(\rho | {{\bf{B}}}_{U})=\left(\begin{array}{ccc}{I}^{{dd}}(\rho | {{\bf{B}}}_{U}) & {I}^{{dr}}(\rho | {{\bf{B}}}_{U}) & {I}^{{di}}(\rho | {{\bf{B}}}_{U})\\ {I}^{{rd}}(\rho | {{\bf{B}}}_{U}) & {I}^{{rr}}(\rho | {{\bf{B}}}_{U}) & {I}^{{ri}}(\rho | {{\bf{B}}}_{U})\\ {I}^{{id}}(\rho | {{\bf{B}}}_{U}) & {I}^{{ir}}(\rho | {{\bf{B}}}_{U}) & {I}^{{ii}}(\rho | {{\bf{B}}}_{U})\end{array}\right)\end{eqnarray*}$

with superscripts indicating the type of parameter considered: diagonal, real or imaginary part of off-diagonal element. The average Fisher information for a randomly chosen basis is

$\begin{eqnarray*}&&\bar{I}(\rho ):= \int \mu ({\rm{d}}U)I(\rho | {{\bf{B}}}_{U}),\end{eqnarray*}$

where $\mu ({\rm{d}}U)$ is the Haar measure over unitaries used for choosing the random basis. Note that by symmetry $\bar{I}(\rho )$ only depends on the spectrum of ρ. We will not compute $\bar{I}(\rho )$ for an arbitrary state ρ but only at $\rho ={\rho }_{0}.$ The corresponding Fisher information will be denoted $\bar{I}=\bar{I}({\rho }_{0})$ and is a function of d and r. Below we compute the different blocks of $\bar{I}.$ For the matrix elements we will use a suggestive notation, e.g. ${\bar{I}}_{{ii};{jk}}^{d,r}$ denotes the element corresponding to the diagonal parameter ${\theta }_{{ii}}^{d}={\rho }_{{ii}}$ and the real part of the off-diagonal element ${\rho }_{{jk}},$ etc.

(A) Diagonal–diagonal block.

$\begin{eqnarray*}&&I{(\rho | {{\bf{B}}}_{U})}_{{ii};{jj}}^{{dd}}=\displaystyle \sum _{{\bf{o}}:{p}_{\rho }({\bf{o}}| {{\bf{B}}}_{U})\gt 0}\displaystyle \frac{1}{{p}_{\rho }({\bf{o}}| {{\bf{B}}}_{U})}| \langle {\bf{o}},U| i\rangle {| }^{2}| \langle {\bf{o}},U| j\rangle {| }^{2}.\end{eqnarray*}$

By integrating over unitaries we obtain the corresponding matrix element of the average Fisher matrix $\bar{I}.$ Since ${p}_{{\rho }_{0}}({\bf{o}}| {{\bf{B}}}_{U})\gt 0$ is true for all ${\bf{o}},$ with probability one with respect to ${\mu }^{d}({\rm{d}}U),$ we drop the condition from the sum. At the state $\rho ={\rho }_{0}$ defined in (5.2), we have

$\begin{eqnarray*}{\bar{I}}_{{ii};{jj}}^{{dd}} & = & r\displaystyle \sum _{{\bf{o}}}\displaystyle \int \displaystyle \frac{{| \langle {\bf{o}},U| i\rangle | }^{2}{| \langle {\bf{o}},U| j\rangle | }^{2}}{{\displaystyle \sum }_{k=1}^{r}{| \langle {\bf{o}},U| k\rangle | }^{2}}{\mu }^{d}({\rm{d}}U)\\ & = & r\cdot d\displaystyle \int \displaystyle \frac{{| \langle {\bf{1}},U| i\rangle | }^{2}{| \langle {\bf{1}},U| j\rangle | }^{2}}{{\displaystyle \sum }_{k=1}^{r}{| \langle {\bf{1}},U| k\rangle | }^{2}}{\mu }^{d}({\rm{d}}U)\\ & = & r\cdot d\displaystyle \int \displaystyle \frac{{| \langle \psi | i\rangle | }^{2}{| \langle \psi | j\rangle | }^{2}}{{\displaystyle \sum }_{k=1}^{r}{| \langle \psi | k\rangle | }^{2}}{\nu }^{d}({\rm{d}}\psi ),\end{eqnarray*}$

where ${\nu }^{d}({\rm{d}}\psi )$ is the uniform measure over the projective space on ${{\mathbb{C}}}^{d}.$ To compute the integral we decompose $| \psi \rangle$ as

$\begin{eqnarray*}&&| \psi \rangle =q| {\psi }_{1}\rangle +\sqrt{1-{q}^{2}}| {\psi }_{2}\rangle \end{eqnarray*}$

with $| {\psi }_{1}\rangle \in {{\mathcal{H}}}_{r}:= {\rm{Span}}\{| 1\rangle ,...,| r\rangle \}$ and $| {\psi }_{2}\rangle \in {{\mathcal{H}}}_{r}^{\perp }$ normalized (orthogonal) vectors, and $0\leqslant q\leqslant 1.$ The uniform measure can be expressed as

$\begin{eqnarray*}&&{\nu }^{d}({\rm{d}}\psi )={m}^{d,r}({\rm{d}}q)\times {\nu }^{r}({\rm{d}}{\psi }_{1})\times {\nu }^{d-r}({\rm{d}}{\psi }_{2}),\end{eqnarray*}$

where ${m}^{d,r}({\rm{d}}q)$ is the distribution of the length of the projection of a random vector in ${{\mathbb{C}}}^{d}$ onto an r-dimensional subspace. With this notation we have

$\begin{eqnarray}&&{\bar{I}}_{{ii};{jj}}^{{dd}}=r\cdot d\int \int \int \displaystyle \frac{1}{{q}^{2}}| \langle \psi | i\rangle {| }^{2}| \langle \psi | j\rangle {| }^{2}{m}^{d,r}({\rm{d}}q){\nu }^{r}({\rm{d}}{\psi }_{1}){\nu }^{d-r}({\rm{d}}{\psi }_{2}).\end{eqnarray} \tag{ A.4 }$

We distinguish four sub-cases depending on whether each of the indices $i,j$ belongs to $\{1,...,r\}$ or $\{r+1,...,d\}.$

Sub-case 1: $i,j\leqslant r.$ In this case (A.4) becomes

$\begin{eqnarray}{\bar{I}}_{{ii};{jj}}^{{dd}} & = & r\cdot d\displaystyle \int \displaystyle \frac{1}{{q}^{2}}{q}^{4}{m}^{d,r}({\rm{d}}q)\times \displaystyle \int | \langle {\psi }_{1}| i\rangle {| }^{2}| \langle {\psi }_{1}| j\rangle {| }^{2}{\nu }^{r}({\rm{d}}{\psi }_{1})\\ & = & r\cdot d\displaystyle \int \displaystyle \frac{1}{{q}^{2}}{q}^{4}{m}^{d,r}({\rm{d}}q)\times \displaystyle \int {U}_{1,i}^{*}{U}_{i,1}{U}_{1,j}^{*}{U}_{j,1}{\mu }^{r}({\rm{d}}U).\end{eqnarray} \tag{ A.5 }$

In the last line we have re-written $| {\psi }_{1}\rangle$ as $U| 1\rangle$ in order to use the existing formulas for integrals of monomials over the unitary group [43]. Since we will use these formulas repeatedly, we recall that

$\begin{eqnarray}&&\int {U}_{{i}_{1},{j}_{1}}{U}_{{i}_{2},{j}_{2}}{U}_{{j}_{1}^{\prime },{i}_{1}^{\prime }}^{*}{U}_{{j}_{2}^{\prime },{i}_{2}^{\prime }}^{*}{\mu }^{l}({\rm{d}}U)=\displaystyle \sum _{\sigma ,\tau \in {S}_{2}}{\delta }_{{i}_{1},{i}_{\sigma 1}^{\prime }}{\delta }_{{i}_{2},{i}_{\sigma 2}^{\prime }}{\delta }_{{j}_{1},{j}_{\tau 1}^{\prime }}{\delta }_{{j}_{2},{j}_{\tau 2}^{\prime }}{{\rm{Wg}}}^{l}(\sigma {\tau }^{-1}),\end{eqnarray} \tag{ A.6 }$

where σ and τ are permutations in ${S}_{2}=\{(1,1),(2)\}$ and ${{\rm{Wg}}}^{l}(\cdot )$ is the Weingarten function over S₂:

$\begin{eqnarray*}&&\ {{\rm{Wg}}}^{l}(1,1)=\displaystyle \frac{1}{{l}^{2}-1},\qquad {{\rm{Wg}}}^{l}(2)=\displaystyle \frac{-1}{l({l}^{2}-1)}.\end{eqnarray*}$

The integral over q can be easily evaluated as

$\begin{eqnarray}&&\int {q}^{2}{m}^{d,r}({\rm{d}}q)=\displaystyle \sum _{k=1}^{r}\int | \langle k| \psi \rangle {| }^{2}{\mu }^{d}({\rm{d}}\psi )=\displaystyle \frac{r}{d}.\end{eqnarray} \tag{ A.7 }$

By inserting (A.6) and (A.7) into (A.5) we get

$\begin{eqnarray*}{\bar{I}}_{{ii};{jj}}^{{dd}} & = & r\cdot d\cdot \displaystyle \frac{r}{d}\left(\displaystyle \frac{1}{{r}^{2}-1}-\displaystyle \frac{1}{r({r}^{2}-1)}\right)=\displaystyle \frac{r}{r+1},\quad i\ne j\\ {\bar{I}}_{{ii};{ii}}^{{dd}} & = & r\cdot d\cdot \displaystyle \frac{r}{d}\cdot 2\left(\displaystyle \frac{1}{{r}^{2}-1}-\displaystyle \frac{1}{r({r}^{2}-1)}\right)=\displaystyle \frac{2r}{r+1}.\end{eqnarray*}$

Sub-case 2: $i\leqslant r$ and $j\gt r.$ From (A.4) we get

$\begin{eqnarray*}{\bar{I}}_{{ii};{jj}}^{{dd}} & = & r\cdot d\displaystyle \int \displaystyle \frac{1}{{q}^{2}}{q}^{2}(1-{q}^{2}){m}^{d,r}({\rm{d}}q)\times \displaystyle \int | \langle {\psi }_{1}| i\rangle {| }^{2}{\nu }^{r}({\rm{d}}{\psi }_{1})\times \displaystyle \int | \langle {\psi }_{2}| j\rangle {| }^{2}{\nu }^{d-r}({\rm{d}}{\psi }_{2})\\ & = & r\cdot d\left(1-\displaystyle \frac{r}{d}\right)\cdot \displaystyle \frac{1}{r}\cdot \displaystyle \frac{1}{d-r}=1.\end{eqnarray*}$

Sub-case 3: $i,j\gt r.$ Similarly, in this case

$\begin{eqnarray}{\bar{I}}_{{ii};{jj}}^{{dd}} & = & r\cdot d\displaystyle \int \displaystyle \frac{1}{{q}^{2}}{(1-{q}^{2})}^{2}{m}^{d,r}({\rm{d}}q)\times \displaystyle \int | \langle {\psi }_{2}| i\rangle {| }^{2}| \langle {\psi }_{2}| j\rangle {| }^{2}{\nu }^{d-r}({\rm{d}}{\psi }_{2})\\ & = & r\cdot d\displaystyle \int \displaystyle \frac{1}{{q}^{2}}{(1-{q}^{2})}^{2}{m}^{d,r}({\rm{d}}q)\times \displaystyle \int {U}_{1,i}^{*}{U}_{i,1}{U}_{1,j}^{*}{U}_{j,1}{\mu }^{d-r}({\rm{d}}U).\end{eqnarray} \tag{ A.8 }$

We first simplify the integral on the right side

$\begin{eqnarray}&&\int \displaystyle \frac{1}{{q}^{2}}{(1-{q}^{2})}^{2}{m}^{d,r}({\rm{d}}q)=\int \displaystyle \frac{1}{{q}^{2}}{m}^{d,r}({\rm{d}}q)-2+\displaystyle \frac{r}{d}.\end{eqnarray} \tag{ A.9 }$

To evaluate the remaining integral, we consider a multivariate Gaussian random variable $c=({c}_{1},...,{c}_{2d})\sim N(0,{I}_{2d}),$ and denote by ${g}^{2d}({\rm{d}}c)$ its probability distribution. From this we construct the complex vector with uniform distribution over the unit ball in ${{\mathbb{C}}}^{d}$

$\begin{eqnarray*}&&| \psi \rangle := \displaystyle \frac{1}{\parallel c\parallel }\displaystyle \sum _{i=1}^{d}({c}_{i}+{\rm{i}}{c}_{d+1})| i\rangle ,\qquad \parallel c{\parallel }^{2}=\displaystyle \sum _{l=1}^{2d}{c}_{i}^{2}.\end{eqnarray*}$

We can now write

$\begin{eqnarray}\displaystyle \int \displaystyle \frac{1}{{q}^{2}}{m}^{d,r}({\rm{d}}q) & = & \displaystyle \int {g}^{2d}({\rm{d}}c)\displaystyle \frac{{\displaystyle \sum }_{l=1}^{2d}{c}_{l}^{2}}{{\displaystyle \sum }_{l=1}^{2r}{c}_{l}^{2}}=1+\displaystyle \int {g}^{2d}({\rm{d}}c)\displaystyle \frac{{\displaystyle \sum }_{l=2r+1}^{2d}{c}_{l}^{2}}{{\displaystyle \sum }_{l=1}^{2r}{c}_{l}^{2}}\\ & = & 1+\displaystyle \int \parallel {c}^{1}{\parallel }^{2}\;{g}^{2(d-r)}({\rm{d}}{c}^{1})\cdot \displaystyle \int \displaystyle \frac{1}{\parallel {c}^{2}{\parallel }^{2}}{g}^{2r}({\rm{d}}{c}^{2})\\ & = & 1+2(d-r)\cdot \displaystyle \frac{1}{2r-2}=\displaystyle \frac{d-1}{r-1}.\end{eqnarray} \tag{ A.10 }$

Above we used the fact that $\parallel {c}^{2}{\parallel }^{2}$ is a ${\chi }^{2}$ variable with $2r$ degrees of freedom, so the second integral is the mean of its inverse. By inserting (A.10) into (A.9) we obtain

$\begin{eqnarray}&&\int \displaystyle \frac{1}{{q}^{2}}{(1-{q}^{2})}^{2}{m}^{d,r}({\rm{d}}q)=-2+\displaystyle \frac{r}{d}+\displaystyle \frac{d-1}{r-1}=\displaystyle \frac{(d-r)(d-r+1)}{d(r-1)}.\end{eqnarray} \tag{ A.11 }$

Finally, by inserting (A.11) into (A.8) and applying (A.6) we obtain

$\begin{eqnarray*}{\bar{I}}_{{ii};{jj}}^{{dd}} & = & r\dot{d}\cdot \displaystyle \frac{(d-r)(d-r+1)}{d(r-1)}\cdot \left(\displaystyle \frac{1}{{(d-r)}^{2}-1}-\displaystyle \frac{1}{(d-r)({(d-r)}^{2}-1)}\right)\\ & = & \displaystyle \frac{r}{r-1},\quad i\ne j\\ {\bar{I}}_{{ii};{ii}}^{{dd}} & = & \displaystyle \frac{2r}{r-1}.\end{eqnarray*}$

Note that for r = 1 these matrix elements are infinite. This is due to large contributions from measurements which have one basis vector close to being orthogonal to the one-dimensional vector state.. This is somewhat akin to what happens in the case of a Bernoulli variable (coin toss), in the case when the probability is close to zero or one. While this phenomenon is interesting, it does not play any role in our analysis for which the diagonal matrix elements are not relevant parameters.

(B) Diagonal-real block.

$\begin{eqnarray*}&&I{(\rho | {{\bf{B}}}_{U})}_{{ii};{jk}}^{{dr}}=2\displaystyle \sum _{{\bf{o}}:{p}_{\rho }({\bf{o}}| {{\bf{B}}}_{U})\gt 0}\displaystyle \frac{1}{{p}_{\rho }({\bf{o}}| {{\bf{B}}}_{U})}| \langle {\bf{o}},U| i\rangle {| }^{2}\cdot {\rm{Re}}(\langle j| {\bf{o}},U\rangle \langle {\bf{o}},U| k\rangle ).\end{eqnarray*}$

As before, the corresponding entry of the average Fisher information matrix is

$\begin{eqnarray*}{\bar{I}}_{{ii};{jk}}^{{dr}} & = & 2\cdot d\cdot r\displaystyle \int \displaystyle \frac{{| \langle \psi | i\rangle | }^{2}\cdot {\rm{Re}}(\langle j| \psi \rangle \langle \psi | k\rangle )}{{\displaystyle \sum }_{k=1}^{r}{| \langle \psi | k\rangle | }^{2}}{\nu }^{d}({\rm{d}}\psi )\\ & = & 2\cdot d\cdot r\displaystyle \int \displaystyle \int \displaystyle \int \displaystyle \frac{1}{{q}^{2}}| \langle \psi | i\rangle {| }^{2}\cdot {\rm{Re}}(\langle j| \psi \rangle \langle \psi | k\rangle ){m}^{d,r}({\rm{d}}q){\nu }^{r}({\rm{d}}{\psi }_{1}){\nu }^{d-r}({\rm{d}}{\psi }_{2}).\end{eqnarray*}$

Sub-case 1: $i,j,k\leqslant r.$ In this case the integral is

$\begin{eqnarray*}{\bar{I}}_{{ii};{jk}}^{{dr}} & = & 2\cdot d\cdot r\displaystyle \int {q}^{2}{m}^{d,r}({\rm{d}}q)\displaystyle \int {| \langle {\psi }_{1}| i\rangle | }^{2}{\rm{Re}}(\langle j| {\psi }_{1}\rangle \langle {\psi }_{1}| k\rangle ){\nu }^{r}({\rm{d}}{\psi }_{1})\\ & = & d\cdot r\cdot \displaystyle \frac{r}{d}\displaystyle \int {\mu }^{r}({\rm{d}}U){U}_{i1}{U}_{1i}^{*}({U}_{j1}{U}_{1k}^{*}+{U}_{k1}{U}_{1j}^{*})=0,\end{eqnarray*}$

where we applied formula (A.6) for the integrals over the unitaries.

Sub-case 2: $i,j\leqslant r$ and $k\gt r.$ In this case the integral is

$\begin{eqnarray*}&&{\bar{I}}_{{ii};{jk}}^{{dr}}=2\cdot d\cdot r\displaystyle \int q\sqrt{1-{q}^{2}}{m}^{d,r}({\rm{d}}q)\\ &&\times {\rm{Re}}\displaystyle \int | \langle {\psi }_{1}| i\rangle {| }^{2}\langle j| {\psi }_{1}\rangle {\nu }^{r}({\rm{d}}{\psi }_{1})\displaystyle \int \langle {\psi }_{2}| k\rangle {\nu }^{d-r}({\rm{d}}{\psi }_{2})=0.\end{eqnarray*}$

Sub-case 3: $i\leqslant r$ and $j,k\gt r.$ In this case the integral is

$\begin{eqnarray*}&&{\bar{I}}_{{ii};{jk}}^{{dr}}=2\cdot d\cdot r\displaystyle \int (1-{q}^{2}){m}^{d,r}({\rm{d}}q)\\ &&\times \displaystyle \int | \langle {\psi }_{1}| i\rangle {| }^{2}{\nu }^{r}({\rm{d}}{\psi }_{1})\cdot {\rm{Re}}\displaystyle \int \langle j| {\psi }_{2}\rangle \langle {\psi }_{2}| k\rangle {\nu }^{d-r}({\rm{d}}{\psi }_{2})=0.\end{eqnarray*}$

Sub-case 4: $i\gt r$ and $j,k\leqslant r.$ In this case the integral is

$\begin{eqnarray*}&&{\bar{I}}_{{ii};{jk}}^{{dr}}=2\cdot d\cdot r\displaystyle \int (1-{q}^{2}){m}^{d,r}({\rm{d}}q)\\ &&\times \displaystyle \int | \langle {\psi }_{2}| i\rangle {| }^{2}{\nu }^{d-r}({\rm{d}}{\psi }_{2})\cdot {\rm{Re}}\displaystyle \int \langle j| {\psi }_{1}\rangle \langle {\psi }_{1}| k\rangle {\nu }^{r}({\rm{d}}{\psi }_{1})=0.\end{eqnarray*}$

Sub-case 5: $i\gt r$ and $j,k\leqslant r.$ In this case the integral is

$\begin{eqnarray*}&&{\bar{I}}_{{ii};{jk}}^{{dr}}=2\cdot d\cdot r\displaystyle \int (1-{q}^{2}){m}^{d,r}({\rm{d}}q)\\ &&\times \displaystyle \int | \langle {\psi }_{2}| i\rangle {| }^{2}{\nu }^{d-r}({\rm{d}}{\psi }_{2})\cdot {\rm{Re}}\displaystyle \int \langle j| {\psi }_{1}\rangle \langle {\psi }_{1}| k\rangle {\nu }^{r}({\rm{d}}{\psi }_{1})=0.\end{eqnarray*}$

Sub-case 6: $i,k\gt r$ and $j\leqslant r.$ In this case the integral is

$\begin{eqnarray*}&&{\bar{I}}_{{ii};{jk}}^{{dr}}=2\cdot d\cdot r\displaystyle \int q\sqrt{1-{q}^{2}}{m}^{d,r}({\rm{d}}q)\\ &&\times {\rm{Re}}\displaystyle \int | \langle {\psi }_{2}| i\rangle {| }^{2}\langle {\psi }_{2}| k\rangle {\nu }^{d-r}({\rm{d}}{\psi }_{2})\cdot \displaystyle \int \langle j| {\psi }_{1}\rangle {\nu }^{r}({\rm{d}}{\psi }_{1})=0.\end{eqnarray*}$

Sub-case 7: $i,j,k\gt r.$ In this case the integral is

$\begin{eqnarray*}&&{\bar{I}}_{{ii};{jk}}^{{dr}}=2\cdot d\cdot r\displaystyle \int \displaystyle \frac{{(1-{q}^{2})}^{2}}{{q}^{2}}{m}^{d,r}({\rm{d}}q)\\ &&\times \displaystyle \int | \langle {\psi }_{2}| i\rangle {| }^{2}{\rm{Re}}\langle j| {\psi }_{2}\rangle \langle {\psi }_{2}| k\rangle {\nu }^{d-r}({\rm{d}}{\psi }_{2})=0.\end{eqnarray*}$

The last integral is zero for the same reason as in sub-case 1.

In conclusion all diagonal-real matrix elements are zero

$\begin{eqnarray*}&&{\bar{I}}_{{ii};j,k}^{{dr}}=0,\qquad \mathrm{for}\;\mathrm{all}\;i=1,...,{2}^{k},\;\mathrm{and}\;1\leqslant j\lt k\leqslant {2}^{k}.\end{eqnarray*}$

(C) Diagonal-imaginary block.

Similarly to the case of diagonal-real elements, we obtain that all diagonal-imaginary matrix elements are zero

$\begin{eqnarray*}&&{\bar{I}}_{{ii};{jk}}^{{di}}=0,\qquad \mathrm{for}\;\mathrm{all}\;i=1,...,{2}^{k},\;\mathrm{and}\;1\leqslant j\lt k\leqslant {2}^{k}.\end{eqnarray*}$

(D) Real–real block.

$\begin{eqnarray*}&&I{(\rho | {{\bf{B}}}_{U})}_{{ij};{kl}}^{{rr}}=4\displaystyle \sum _{{\bf{o}}:{p}_{\rho }({\bf{o}}| {{\bf{B}}}_{U})\gt 0}\displaystyle \frac{1}{{p}_{\rho }({\bf{o}}| {{\bf{B}}}_{U})}{\rm{Re}}(\langle i| {\bf{o}},U\rangle \langle {\bf{o}},U| j\rangle )\cdot {\rm{Re}}(\langle k| {\bf{o}},U\rangle \langle {\bf{o}},U| l\rangle ).\end{eqnarray*}$

As before the corresponding entry of the average Fisher information matrix is

$\begin{eqnarray*}{\bar{I}}_{{ij};{kl}}^{{rr}} & = & 4\cdot d\cdot r\displaystyle \int \displaystyle \frac{{\rm{Re}}(\langle i| \psi \rangle \langle \psi | j\rangle )\cdot {\rm{Re}}(\langle k| \psi \rangle \langle \psi | l\rangle )}{\displaystyle \sum _{k=1}^{r}{| \langle \psi | k\rangle | }^{2}}{\nu }^{d}({\rm{d}}\psi )\\ & = & 4\cdot d\cdot r\displaystyle \int \displaystyle \int \displaystyle \int \displaystyle \frac{1}{{q}^{2}}{\rm{Re}}(\langle i| \psi \rangle \langle \psi | j\rangle )\cdot {\rm{Re}}(\langle k| \psi \rangle \langle \psi | l\rangle ){m}^{d,r}({\rm{d}}q){\nu }^{r}({\rm{d}}{\psi }_{1}){\nu }^{d-r}({\rm{d}}{\psi }_{2}).\end{eqnarray*}$

The same reasoning as before can be applied to transform the integral into a single integral, or a product of integrals over unitaries. If a single index is larger than r while the other three are smaller, or conversely a single index is smaller than r while the other are larger, then we have two integrals over unitaries which are zero since the monomial is of odd order. Therefore the following cases remain to be analysed.

Sub-case 1: $i,j,k,l\leqslant r.$ In this case the integral is

$\begin{eqnarray*}{\bar{I}}_{{ij};{kl}}^{{rr}} & = & 4\cdot d\cdot r\displaystyle \int {q}^{2}{m}^{d,r}({\rm{d}}q)\displaystyle \int {\rm{Re}}(\langle i| {\psi }_{1}\rangle \langle {\psi }_{1}| j\rangle )\cdot {\rm{Re}}(\langle k| {\psi }_{1}\rangle \langle {\psi }_{1}| l\rangle ){\nu }^{r}({\rm{d}}{\psi }_{1})\\ & = & d\cdot r\cdot \displaystyle \frac{r}{d}\cdot \displaystyle \int {\mu }^{r}({\rm{d}}U)({U}_{i1}{U}_{1j}^{*}+{U}_{j1}{U}_{1i}^{*})({U}_{k1}{U}_{1\;l}^{*}+{U}_{l1}{U}_{1k}^{*}).\end{eqnarray*}$

Using formula (A.6) we find that the integral over unitaries is zero unless we deal with a diagonal matrix element of I, i.e. i = k and j = l. For the latter we have

$\begin{eqnarray*}{\bar{I}}_{{ij};{ij}}^{{rr}} & = & d\cdot r\cdot \displaystyle \frac{r}{d}\cdot \displaystyle \int {\mu }^{r}({\rm{d}}U)({U}_{i1}{U}_{1j}^{*}+{U}_{j1}{U}_{1i}^{*})({U}_{i1}{U}_{1j}^{*}+{U}_{j1}{U}_{1i}^{*})\\ & = & 2\cdot d\cdot r\cdot \displaystyle \frac{r}{d}\left(\displaystyle \frac{1}{{r}^{2}-1}-\displaystyle \frac{1}{r({r}^{2}-1)}\right)=\displaystyle \frac{2r}{r+1}.\end{eqnarray*}$

Sub-case 2: $i,j,k,l\gt r.$ In this case, the integral is

$\begin{eqnarray*}{\bar{I}}_{{ij};{kl}}^{{rr}} & = & 4\cdot d\cdot r\displaystyle \int \displaystyle \frac{{(1-{q}^{2})}^{2}}{{q}^{2}}{m}^{d,r}({\rm{d}}q)\displaystyle \int {\rm{Re}}(\langle i| {\psi }_{2}\rangle \langle {\psi }_{2}| j\rangle )\cdot {\rm{Re}}(\langle k| {\psi }_{2}\rangle \langle {\psi }_{2}| l\rangle ){\nu }^{d-r}({\rm{d}}{\psi }_{2})\\ & = & d\cdot r\cdot \displaystyle \frac{(d-r)(d-r+1)}{d(r-1)}\cdot \displaystyle \int {\mu }^{d-r}({\rm{d}}U)({U}_{i1}{U}_{1j}^{*}+{U}_{j1}{U}_{1i}^{*})({U}_{k1}{U}_{1\;l}^{*}+{U}_{l1}{U}_{1k}^{*}),\end{eqnarray*}$

where we have used formula (A.11) for the integral over q. A similar calculation as above shows that all off-diagonal elements are zero and

$\begin{eqnarray*}{\bar{I}}_{{ij};{ij}}^{{rr}} & = & 2d\cdot r\cdot \displaystyle \frac{(d-r)(d-r+1)}{d(r-1)}\cdot \left(\displaystyle \frac{1}{{(d-r)}^{2}-1}-\displaystyle \frac{1}{(d-r)({(d-r)}^{2}-1)}\right)\\ & = & \displaystyle \frac{2r}{r-1}.\end{eqnarray*}$

Sub-case 3: ( $i,j\leqslant r$ and $k,l\gt r$ ) or ( $i,j\gt r$ and $k,l\leqslant r$ ). In this case we deal with a product of integrals over unitaries of the form

$\begin{eqnarray*}&&\int {\mu }^{r}({\rm{d}}U){U}_{i1}{U}_{1j}^{*}=0.\end{eqnarray*}$

So all matrix elements are zero.

Sub-case 4: $i,k\leqslant r$ and $j,l\gt r.$ Again, the off-diagonal elements are zero and

$\begin{eqnarray*}{\bar{I}}_{{ij};{ij}}^{{rr}}= & = & 4\cdot d\cdot r\displaystyle \int (1-{q}^{2}){m}^{d,r}({\rm{d}}q)\displaystyle \int {\rm{Re}}(\langle i| {\psi }_{1}\rangle \langle {\psi }_{2}| j\rangle )\cdot {\rm{Re}}(\langle i| {\psi }_{1}\rangle \langle {\psi }_{2}| j\rangle ){\nu }^{r}({\rm{d}}{\psi }_{1}){\nu }^{r}({\rm{d}}{\psi }_{2})\\ & = & 2d\cdot r\cdot (1-\displaystyle \frac{r}{d})\cdot \displaystyle \frac{1}{r}\displaystyle \frac{1}{d-r}=2.\end{eqnarray*}$

In conclusion, the only non-zero elements of the real–real block are on the diagonal and are given by

$\begin{eqnarray*}{\bar{I}}_{{ij};{ij}}^{{rr}} & = & \displaystyle \frac{2r}{r+1},\qquad 1\leqslant i\lt j\leqslant r,\\ {\bar{I}}_{{ij};{ij}}^{{rr}} & = & 2,\qquad \qquad \;i\leqslant r\;\mathrm{and}\ \;r\lt j\leqslant {2}^{k},\\ {\bar{I}}_{{ij};{ij}}^{{rr}} & = & \displaystyle \frac{2r}{r-1},\qquad \;r\lt i\lt j\leqslant {2}^{k}.\end{eqnarray*}$

(E) Imaginary–imaginary block. This block is similar to the real–real one. All off diagonal elements are zero, and the diagonal ones are

$\begin{eqnarray*}{\bar{I}}_{{ij};{ij}}^{{ii}} & = & \displaystyle \frac{2r}{r+1},\qquad 1\leqslant i\lt j\leqslant r,\\ {\bar{I}}_{{ij};{ij}}^{{ii}} & = & 2,\qquad \qquad \;i\leqslant r\;\mathrm{and}\ \;r\lt j\leqslant {2}^{k},\\ {\bar{I}}_{{ij};{ij}}^{{ii}} & = & \displaystyle \frac{2r}{r-1},\qquad \;r\lt i\lt j\leqslant {2}^{k}.\end{eqnarray*}$

(F) Real-imaginary block. Next we show that all real-imaginary off-diagonal elements are equal to zero.

$\begin{eqnarray*}&&I{(\rho | {{\bf{B}}}_{U})}_{{ij};{kl}}^{{ri}}=-4\cdot \displaystyle \sum _{{\bf{o}}:{p}_{\rho }({\bf{o}}| {{\bf{B}}}_{U})\gt 0}\displaystyle \frac{1}{{p}_{\rho }({\bf{o}}| {{\bf{B}}}_{U})}{\rm{Re}}(\langle i| {\bf{o}},U\rangle \langle {\bf{o}},U| j\rangle ){\rm{Im}}(\langle k| {\bf{o}},U\rangle \langle {\bf{o}},U| l\rangle ).\end{eqnarray*}$

By integration we obtain the corresponding matrix element of the average Fisher information matrix

$\begin{eqnarray*}{\bar{I}}_{{ij};{kl}}^{{rr}} & = & -4\cdot d\cdot r\displaystyle \int \displaystyle \frac{{\rm{Re}}(\langle i| \psi \rangle \langle \psi | j\rangle )\cdot {\rm{Im}}(\langle k| \psi \rangle \langle \psi | l\rangle )}{{\displaystyle \sum }_{k=1}^{r}{| \langle \psi | k\rangle | }^{2}}{\nu }^{d}({\rm{d}}\psi )\\ & = & -4\cdot d\cdot r\displaystyle \int \displaystyle \int \displaystyle \int \displaystyle \frac{1}{{q}^{2}}{\rm{Re}}(\langle i| \psi \rangle \langle \psi | j\rangle )\cdot {\rm{Im}}(\langle k| \psi \rangle \langle \psi | l\rangle ){m}^{d,r}({\rm{d}}q){\nu }^{r}({\rm{d}}{\psi }_{1}){\nu }^{d-r}({\rm{d}}{\psi }_{2}).\end{eqnarray*}$

Again, if one index is smaller that r while the other three are larger, or otherwise, then the matrix element in zero. We analyse the remaining cases.

Sub-case 1: $i,j,k,l\leqslant r.$ In this case the integral is

$\begin{eqnarray*}{\bar{I}}_{{ij};{kl}}^{{ri}} & = & -4\cdot d\cdot r\displaystyle \int {q}^{2}{m}^{d,r}({\rm{d}}q)\displaystyle \int {\rm{Re}}(\langle i| {\psi }_{1}\rangle \langle {\psi }_{1}| j\rangle ){\rm{Im}}(\langle k| {\psi }_{1}\rangle \langle {\psi }_{1}| l\rangle ){\nu }^{r}({\rm{d}}{\psi }_{1})\\ & = & i\cdot d\cdot r\cdot \displaystyle \frac{r}{d}\displaystyle \int {\mu }^{r}({\rm{d}}U)({U}_{i1}{U}_{1j}^{*}+{U}_{j1}{U}_{1i}^{*})({U}_{k1}{U}_{1\;l}^{*}-{U}_{l1}{U}_{1k}^{*}).\end{eqnarray*}$

Using formula (A.6) we find that the integral over unitaries is zero unless i = k and j = l. However even in this case, two of the four terms are zero, and the other two cancel each other.

Sub-case 2: $i,j,k,l\gt r.$ Here the integrals over the unitaries are similar as in case 1 above, with the difference that they taken with respect to ${\mu }^{d-r}({\rm{d}}U).$ Therefore the matrix elements are zero.

Sub-case 3: ( $i,j\leqslant r$ and $k,l\gt r$ ) or ( $i,j\gt r,k,l\leqslant r$ ). In this case we deal with a product of integrals over unitaries of the form

$\begin{eqnarray*}&&\int {\mu }^{r}({\rm{d}}U){U}_{i1}{U}_{1j}^{*}=0.\end{eqnarray*}$

so all matrix elements are zero.

Sub-case 4: $i,k\leqslant r$ and $j,l\gt r$ Again, the off-diagonal elements are zero and

$\begin{eqnarray*}{\bar{I}}_{{ij};{ij}}^{{ri}}= & = & -4\cdot d\cdot r\displaystyle \int (1-{q}^{2}){m}^{d,r}({\rm{d}}q)\displaystyle \int {\rm{Re}}(\langle i| {\psi }_{1}\rangle \langle {\psi }_{2}| j\rangle )\cdot {\rm{Im}}(\langle i| {\psi }_{1}\rangle \langle {\psi }_{2}| j\rangle ){\nu }^{r}({\rm{d}}{\psi }_{1}){\nu }^{r}({\rm{d}}{\psi }_{2})\\ & = & i\cdot d\cdot r\cdot \left(1-\displaystyle \frac{r}{d}\right)\displaystyle \int \displaystyle \int \left({U}_{i1}{V}_{1j}^{*}+{V}_{j1}{U}_{1i}^{*}\right)\left({U}_{i1}{V}_{1j}^{*}-{V}_{j1}{U}_{1i}^{*}\right){\mu }^{r}({\rm{d}}U){\mu }^{d-r}({\rm{d}}V)\\ & = & 0.\end{eqnarray*}$

In the last integral, two terms are zero and two have different signs and cancel each other. In conclusion all elements of the real-imaginary block are equal to zero.

Summary of the computation of $\bar{I}$ . We found that all off-diagonal blocks of $\bar{I}$ are zero

$\begin{eqnarray*}&&{\bar{I}}^{{dr}}={\bar{I}}^{{di}}={\bar{I}}^{{ri}}={\bar{I}}^{{rd}}={\bar{I}}^{{id}}={\bar{I}}^{{dr}}=0.\end{eqnarray*}$

Moreover the real and imaginary diagonal blocks are diagonal, equal to each other and have three distinct values depending on the position of the indices $i\lt j$ with respect to r:

$\begin{eqnarray*}{\bar{I}}^{{rr}/{ii}}={\rm{Diag}}\left\{\begin{array}{cccc}{\bar{I}}_{{ij};{ij}}^{{rr}/{ii}} & = & \displaystyle \frac{2r}{r+1}, & 1\leqslant i,j\leqslant r\\ {\bar{I}}_{{ij};{ij}}^{{rr}/{ii}} & = & 2, & 1\leqslant i\leqslant r,{\rm{a}}{nd}r\lt j\leqslant {2}^{k}\\ {\bar{I}}_{{ij};{ij}}^{{rr}/{ii}} & = & \displaystyle \frac{2r}{r-1} & r\lt i\lt j\leqslant {2}^{k}.\end{array}\right.\end{eqnarray*}$

Finally, the d × d block ${\bar{I}}^{{dd}}$ is not diagonal but has a simple form

$\begin{eqnarray*}{\bar{I}}_{{ii};{jj}}^{{dd}}=\left\{\begin{array}{cccc}{\bar{I}}_{{ii};{ii}}^{{dd}} & = & \displaystyle \frac{2r}{r+1}, & 1\leqslant i\leqslant r\\ {\bar{I}}_{{ii};{ii}}^{{dd}} & = & \displaystyle \frac{2r}{r-1}, & r\lt i\leqslant {2}^{k}\\ {\bar{I}}_{{ii};{jj}}^{{dd}} & = & \displaystyle \frac{r}{r+1}, & i,j\leqslant r,{\rm{a}}{nd}i\ne j\\ {\bar{I}}_{{ii};{jj}}^{{dd}} & = & \displaystyle \frac{r}{r+1}, & 1\leqslant i\leqslant r{\rm{a}}{nd}r\lt j\leqslant {2}^{k}\\ {\bar{I}}_{{ii};{jj}}^{{dd}} & = & \displaystyle \frac{r}{r+1}, & 1\leqslant j\leqslant r{\rm{a}}{nd}r\lt i\leqslant {2}^{k}\\ {\bar{I}}_{{ii};{jj}}^{{dd}} & = & \displaystyle \frac{r}{r-1} & r\lt i,j,{\rm{a}}{nd}i\ne j.\end{array}\right.\end{eqnarray*}$

The model ${{\mathcal{R}}}_{d,r}$ can be seen (locally) as the restriction of the full unconstrained model ${{\mathcal{S}}}_{r,d}$ parametrized by θ, to the subset of parameters $\tilde{\theta }$ which are real and imaginary parts of matrix elements ${\rho }_{i,j}$ with $i\leqslant r\lt j.$ Therefore the corresponding average Fisher information matrix is equal to the corresponding block of $\bar{I},$ i.e. $\bar{\tilde{I}}={21}_{2r(d-r)}.$ □

Spectral thresholding quantum tomography for low rank states

Article metrics

Author e-mails

Author affiliations

Author notes

Dates

Abstract

1. Introduction

2. Multiple ions tomography

3. The LSE

4. Rank-penalized and threshold projection estimator

4.1. Rank-penalized estimator

4.2. Physical threshold estimator

5. Lower bounds for rank-constrained estimation

6. Numerical results

6.1. Generation of random states and simulation of datasets

6.2. Computation of estimators

6.3. Simulation results

7. Conclusions and outlook

Acknowledgments

Appendix A.:

A.1. Proof of lemma 1

A.2. Proof of proposition 1

A.3. Proof of proposition 1

A.4. Proof of corollary 1

A.5. Proof upper bound physical estimator

A.6. The average Fisher information matrix for the full, unconstrained model, with random measurement design

Spectral thresholding quantum tomography for low rank states

Article metrics

Share this article

Author e-mails

Author affiliations

Author notes

Dates

Abstract

1. Introduction

2. Multiple ions tomography

3. The LSE

4. Rank-penalized and threshold projection estimator

4.1. Rank-penalized estimator

4.2. Physical threshold estimator

5. Lower bounds for rank-constrained estimation

6. Numerical results

6.1. Generation of random states and simulation of datasets

6.2. Computation of estimators

6.3. Simulation results

7. Conclusions and outlook

Acknowledgments

Appendix A.:

A.1. Proof of lemma 1

A.2. Proof of proposition 1

A.3. Proof of proposition 1

A.4. Proof of corollary 1

A.5. Proof upper bound physical estimator

A.6. The average Fisher information matrix for the full, unconstrained model, with random measurement design