Efficient online quantum state estimation using a matrix-exponentiated gradient method

In this paper, we explore an efficient online algorithm for quantum state estimation based on a matrix-exponentiated gradient method previously used in the context of machine learning. The state update is governed by a learning rate that determines how much weight is given to the new measurement results obtained in each step. We show convergence of the running state estimate in probability to the true state for both noiseless and noisy measurements. We find that in the latter case the learning rate has to be chosen adaptively and decreasing to guarantee convergence beyond the noise threshold. As a practical alternative we then propose to use running averages of the measurement statistics and a constant learning rate to overcome the noise problem. The proposed algorithm is numerically compared with batch maximum-likelihood and least-squares estimators. The results show a superior performance of the new algorithm in terms of accuracy and runtime complexity.

In this paper, we explore an efficient online algorithm for quantum state estimation based on a matrix-exponentiated gradient method previously used in the context of machine learning. The state update is governed by a learning rate that determines how much weight is given to the new measurement results obtained in each step. We show convergence of the running state estimate in probability to the true state for both noiseless and noisy measurements. We find that in the latter case the learning rate has to be chosen adaptively and decreasing to guarantee convergence beyond the noise threshold. As a practical alternative we then propose to use running averages of the measurement statistics and a constant learning rate to overcome the noise problem. The proposed algorithm is numerically compared with batch maximum-likelihood and least-squares estimators. The results show a superior performance of the new algorithm in terms of accuracy and runtime complexity. The field of quantum information processing has grown rapidly over the past decade, largely motivated by the wide range of prospective applications of quantum computing, quantum cryptography, and quantum communications. However, building scalable quantum devices is still an enormous challenge. A core unsolved problem is the efficient characterization of quantum systems of intermediate size-can we check efficiently whether a quantum device comprised of a few qubits performs as intended? Practical considerations and, in particular, efficiency of the estimation procedure are at the forefront as quantum systems move beyond the curiosity of experimental physics to prototype quantum technology devices.

CONTENTS
The most fundamental characterization problem concerns state estimation-determining an unknown state of a quantum system using a series of different measurements. This procedure is referred to as quantum state estimation or quantum state tomography. Quantum state estimation usually refers to estimating the state using incomplete information, whereas quantum state tomography is often used to describe the situation where complete (and sometimes even noise-free) information about the state is assumed. The two terms can be used interchangeably, though we stick to the former throughout the paper. The literature on quantum state estimation is extensive (see, e.g. the survey text [PŘ04]) with methods ranging from simple linear inversion to least-squares (LS) regression [QHL + 13], maximum likelihood (ML) estimation [Hra97,ŘHKL07], methods based on compressed sensing [GLF + 10, FGLE12], and the Bayesian approach (see, e.g., [GCC16]). The maximum likelihood method is considered optimal in the sense that it yields a valid state that maximizes the probability of the observed data, and converges to the true state in the limit of many measurements. A disadvantage of the method is that it often yields estimates at the boundary of the state set, i.e. states that are rank deficient.
Gradient-based approximation methods [BKGL17,SZN17], promise to be much faster but they can produce non-physical states (with the estimate either having negative eigenvalues, or being unnormalized) and convergence is in many cases not guaranteed. The former problem can be solved in practice by projecting the state back into the physical space [SGS12]. The same problem is also present in linear regression methods. The matrix exponentiated gradient (MEG) method has found use in classical machine learning [TRW05,GKCC07] and offers an appealing alternative as it by construction ensures positive semidefiniteness of the matrix estimate. In [LC17], MEG was applied to perform quantum tomography on qubits and approximate the maximum likelihood estimate efficiently. In this paper, we chose MEG among other online estimation methods as we can show strong convergence results. Other efficient methods such as projected-gradients would be also interesting to explore, but this is outside the scope of this paper.
In this work, we use the MEG technique to devise an efficient online estimator for quantum states. Our algorithm satisfies the following three desiderata: (1) it is online-providing a running estimate of the state as data is collected; (2) it is fast-its runtime scales well with the dimension of the system; and (3) it comes with a convergence proof. Many other techniques satisfy some of these properties, but we are not aware of any that satisfy all. The main results of our work can be summarized as follows.
• The proposed algorithm is computationally more efficient than other approaches (such as online versions of ML and LS), scaling as O(d 3 ) instead of O(d 4 ), where d is the dimension of the quantum system.
Our algorithm is naturally online, which makes it interesting for many applications. For example, when large amounts of measurements need to be taken to verify a state or when the state is likely to change over time, it can be beneficial to have a running estimate that allows for a rapid diagnosis of error. While any batch algorithm (like the maximum likelihood estimator) can be run on a subset of the the initial data points to create an online estimate, this creates a significant overhead and can be avoided using an online estimator.
Related work: A different perspective on quantum state learning has been taken in [Aar07,Aar18] where instead of learning a full description of the state the goal is only to predict future measurement outcomes. Concurrent with our work, this approach has also been generalized to the online setting in [ACHN18], also using variations of the MEG method. The main difference is that their work targets obtaining predictions of future measurement outcomes based on previous ones, which can be achieved without full state tomography. The authors show, somewhat surprisingly, that this can be done up to constant error using only a number of measurements linear in the number of qubits. In contrast full characterization requires exponentially many measurements (see, e.g. [HHJ + 17]). Second, the error criterion to be minimized is based on a mistake bound (i.e. the number of time steps where the prediction was far from the true value), whereas we aim to show asymptotic convergence to the true state. A technical consequence of this is that in [ACHN18] the learning rate can be chosen to be a constant whereas we find that for convergence a decreasing learning rate is necessary.

II. SUMMARY OF MAIN RESULTS
Let us first describe the MEG update rule (see also Section III for more details). We assume that the true state, ρ, is finite-dimensional. The update algorithm takes four inputs:ρ t is the estimate of ρ calculated in the previous step; X t andŷ t are the observable and measurement outcome at time step t; and η t is the learning rate at time step t. The algorithm then returns the next estimate of the state,ρ t+1 , as follows.
Algorithm 1 Matrix-exponentiated gradient update rule for quantum state estimation correct by the gradient of the loss function returnρ t+1 ← exp(Gt+1) tr exp(Gt+1) our next estimate, properly normalized end function First, we introduce the use of MEG for online quantum state estimation in the ideal case where there is no noise in the measurements. This case may approximate the situation where experimentally a very large number of shots of each measurement are taken. The number of shots refers to the number of copies of the state that are needed to estimate the counts of each possible outcome. So, first the initial estimate is chosen arbitrarily to be the completely mixed state, i.e.ρ 1 = 1 d I d . Next, a measurement operator X t is selected at random, and the noiseless measurement is done to obtainŷ t = tr(ρX t ). In this setting the learning rate is chosen to be any constant such that 0 < η < 1 2 . Finally, the estimate is updated according to the MEG rule as in Algorithm 1. The estimate in this case converges in probability to the true state if the random set of measurements form a unitary one-design, e.g. if they are Pauli measurements in the case of one or more qubits. In other words, we show that for all δ > 0,  The infidelity is averaged over 1000 randomly generated quantum states and plotted versus the iteration number. The three lines correspond to the proposed matrix exponential gradient (MEG) method, maximum likelihood (ML) estimator and least-squares (LS) estimator. The number of shots per measurement is taken to be 1000 shots.
where · F denotes the Frobenius norm (or any other matrix norm) and the probability is taken over the choice of measurements. In fact, we can show that convergence in Frobenius norm is essentially as fast as 1/ √ t in the following sense. For any α ∈ (0, 1 2 ), we have Here the probability is taken over the measurement choices as well as over t uniformly chosen from the set {1, 2, . . . , T }. Essentially this tells us that the probability of a random t exceeding the bound 1/t α vanishes, even though we cannot guarantee that the bound is satisfied for any fixed t. The proof of this behavior is presented in Section IV B. Let us next discuss the (more realistic) case of noisy measurements. Here we are taking a finite number of shots per measurement so thatŷ t is a random variable with mean tr(ρX t ) and a variance that depends on the number of shots. In this noisy case the previous scheme will not converge. To see this, assume at some iteration we hit the true state,ρ t = ρ. We then see that even for this state the gradient will be non-zero because in generalŷ t = tr(ρX t ) and thus the update rule will push the estimate away from the true state. To avoid this behavior, we propose a scheme with an adaptive, decreasing learning rate. We show that a convergence guarantee in the form of (2) holds, although the convergence will be slower. To achieve this, for any α ∈ (0, 1 4 ), we set the learning rate to η t = 1 4 t −β with β = 3 4 − α to find that the MEG algorithm satisfies where the probability is taken over the measurement choices and outcomes, as well as t uniformly from the set {1, 2, . . . , T }. Section IV C discusses the proof of this statement. Finally, for our numerical testing in low dimensions we propose another approach to solve the problem with noisy measurements by using a running average of the measurement outcomes for each measurement. This is effectively equivalent to increasing the number N of shots when certain measurements are repeated. This means that eventually the algorithm approaches the noise-free case and convergence is thus ensured (we leave this as an informal statement). Moreover, numerical simulations show that this method converges faster than using an adaptive learning rate. Figure 1 compares the convergence of our algorithm to an ML and LS estimator for 1-, 2-, 3-and 4-qubit systems, showing that the proposed algorithm converges to the other two methods. We use infidelity between the true state ρ and the estimateρ t as an accuracy measure, which is defined as 1 − tr √ ρ √ρ t 2 . So, in terms of accuracy measured by infidelity, MEG can perform as well as other methods. Further numerical results can be found in Section V. In terms of complexity however, MEG outperforms the other methods with complexity of O(d 3 ) per update compared to O(d 4 ) for ML and LS. The bottleneck for MEG is the matrix exponentiation step in the update as seen in Algorithm 1.

III. PRELIMINARIES
We give a detailed description of the problem of online quantum state estimation and an overview of the matrix-exponentiated gradient (MEG) update rule.
A. Problem statement Given a quantum system in an unknown state ρ, it is required to find an estimated quantum stateρ, based on the classical outcomes of some measurements performed on copies of the system. The system has dimensions d, and so for the case of an m-qubit system, we have d = 2 m . For the numerical simulations in this paper we consider such m-qubit systems and perform Pauli measurements on each individual qubit. We shall denote the set of measurements operator by The outcome of such a binary measurement is a classical bit. We shall call these outcomes "up" and "down" corresponding to the ±1 eigenvalues of the Pauli operator. In order to do tomography, we assume that we have an ensemble of identically prepared quantum systems in the same unknown state ρ, so we can perform independent measurements on each of the subsystems, and calculate the average outcome. So, selecting a measurement operator X t = X [i(t)] at time step t, the expected value of the measurement denoted by y t as predicted by the Born rule is given by y t = tr(ρX t ), while the actual average we calculate if we repeat the experiment N times is the random variablê Here, n ↑ is the number of times the "up" outcome was observed, while n ↓ is the number of times the "down" outcome was observed. We know that n ↑ follows a binomial distribution. Given a measurement operator represented in terms of its eigenvalue projectors as X t = Π ↑ − Π ↓ , we have n ↑ ∼ B(N, p) with p = tr(ρΠ ↑ ). It is then easy to verify that We can then repeat the whole procedure and obtain a sequence of data points in the form {(X 1 ,ŷ 1 ), ...(X t ,ŷ t ), ...}. Notice that the measurement outcomesŷ t form an independent and identically distributed (i.i.d.) set of random variables. Since we are proposing an online algorithm, we do not have the whole data set in advance. We obtain one point at a time, and use it to update an estimateρ t of the true state. We would like thatρ t converges to ρ as t increases.

B. The matrix-exponentiated gradient (MEG) method
The MEG method was proposed in [TRW05,GKCC07] for some classical machine learning applications and symmetric matrices. The algorithm trivially generalizes to Hermitian matrices. Given a new data point (X t ,ŷ t ), the loss function at time step t, evaluated for a general quantum state σ, is defined as The gradient of the loss function at time step t is then Consider now the following online cost function where D is Umegaki's quantum relative entropy [Ume62] defined as D(ρ||σ) = tr(ρ log(ρ) − ρ log(σ)) for any two states ρ and σ, and η t is the learning rate. This cost function represents two conflicting goals. The first one is to have an estimate that is near the previous estimate, quantified by the relative entropy. This is important because in the online setting of the problem, we do not want the algorithm to forget what it has learnt so far. The second goal is to move the new estimate so that the loss function at the new data point is hopefully smaller. The learning rate η t controls this trade-off. Minimizing the cost function with respect toρ t+1 by taking the gradient (see Appendix A in [TRW05] for the details of the calculation) and setting it to zero results in where I denotes the identity matrix. Now, since we cannot find an explicit form forρ t+1 , we may approximateρ t+1 byρ t in the gradient to arrive at log(ρ t+1 ) = log(ρ t ) − η∇L t (ρ t ) − I, or, equivalently,ρ This form of the update rule ensures that if we start with a positive definite matrixρ t , and a Hermitian operator X t , then we are sure that the new estimateρ t+1 is positive definite. This is because the terms inside the exponential function are Hermitian, and thus the matrix exponential results in a positive definite matrix. Next, we want to make sure that the estimate has unit trace, to be a valid quantum state. So, we normalize to finally obtain the MEG rule: .
The update rule can also be expressed in the following compact alternative form: . (12)

IV. CONVERGENCE ANALYSIS
This section starts with stating some bounds related to the MEG update rule. Next, the proof of convergence for the noise-free case is given, followed by the proof of convergence in the noisy case. Finally, a discussion about the proposed running-average technique is presented. Some additional proofs are provided in Appendix B.

A. General bounds on the loss functions for the MEG rule
We will start by stating the following lemma which bounds the normalization constant that appears in the MEG update rule where and measurement operators satisfying −I ≤ X t ≤ I to ensure that the updated estimate has unit trace. This bound will be used to prove other important results. The proof is given in Appendix B generalizing the methods that involved real symmetric matrices in [TRW05] to complex Hermitian matrices.
Lemma 1. The normalization constant in the MEG rule update is bounded by Next, we state the following lemma which puts a bound on the difference between the loss function evaluated at the estimate, and a general state. The lemma relates this difference to the progress of the estimator towards that general state. This is the main lemma that will be used to prove the convergence of MEG. Appendix B gives the proof generalizing the results [TRW05] to the quantum setting.
Lemma 2. Given the loss function L t (ρ t ) = (tr(ρ t X t ) −ŷ t ) 2 with measurement operators −I ≤ X t ≤ I and learning rate 0 < η < 1 2 , then for any state σ, This leads to the following corollary that bounds the loss function of the estimate when the true state is used as the comparison state, in the case of noise-free measurements (i.e.ŷ t = y t ).
Corollary 1. Given the loss function L t (ρ t ) = (tr(ρ t X t ) − y t ) 2 with measurement operators −I ≤ X t ≤ I and learning rate 0 < η < 1 2 . Then, given the true state ρ, the following relation holds: Proof. Apply Lemma 2, set σ = ρ, and use the fact that L t (ρ) = 0.

B. Convergence analysis for noise-free measurements
The choice of measurements for doing quantum state estimation is arbitrary. However, in this paper we consider the case of performing local Pauli measurements on each qubit of a multiqubit system. This facilitates the experimental realization compared to performing some other, possibly global, measurement. The proofs will start by calculating some expectation values involving Pauli operators and loss functions. These results will be used to prove the main theorem showing the convergence of MEG in the noise-free case. We start with the following lemma about the the set of Pauli operators for multi-qubit systems.
i=0 of Pauli operators including the identity operator in a ddimensional quantum system satisfy where P 21 is the swap operator defined as Proof. The Pauli's form a unitary orthonormal basis of Hermitian d × d matrices. Therefore, they form a quantum 1-design due to Proposition 6 in [ADA13]. In other words, Next, we calculate the expectation value of a Pauli operator that is tensored with itself. This calculation will be needed in the calculation of the expectation of the loss function.
Proof. We have where the last equality holds from Lemma 3, and I d is the identity operator of dimension d.
The following lemma is a commonly-used result in quantum information. The proof is direct-see for example Lemma 1.2.1 in [Low10].
Lemma 5 (Swap trick). For any quantum system with arbitrary dimensions, and for two operators M and N , we have tr(M N ) = tr (M ⊗ N )P 21 , where P 21 is the swap operator on the quantum system (interchanges any two copies).
We are now ready to prove the following lemma in which the expectation of the loss function is calculated.
Lemma 6. Assuming we select the measurement operator X t at each time iteration uniformly at random from the set U − {I} then for any true state ρ and any state σ independent of X t , Proof. From the definition of the loss function, Taking the expectation of the loss function with respect to X t we get: Then, applying Lemma 4, Now, applying the swap trick in Lemma 5, In particular, it is clear that at any time step t, if σ = ρ, then E{L t (σ)} > 0. Now, we can show the following theorem considering convergence of the noiseless MEG.
Theorem 1. The state estimate using the MEG update rule converges in probability to the true state, i.e. for any δ > 0, Proof. We know from Corollary 1 that, Taking the expectation with respect to X t , Applying Lemma 6, and using the fact thatρ t is independent of X t we get Taking the expectation of the above inequality over all past time iterations Next, we sum the inequality over the time iterations to get If we now take the limit as T → ∞, we obtain Where the last line follows from the fact that the true state ρ and the initial estimateρ 1 are independent of X t and y t . Now the right hand side of the inequality is constant, so the series on the left hand side of the inequality converges. This implies by the divergence test that Now we can apply Lemma 12 on the random variable Z t = ρ t − ρ 2 F to conclude that Therefore, the estimateρ t converges in probability to the true state ρ.
Finally, we prove the main theorem that shows a stronger statement for the convergence of MEG algorithm in the case of noise-free measurements.
Theorem 2. Let δ ∈ (0, 1). Then for any α ∈ (0, 1), and learning rate 0 < η < 1 2 , there exists T 0 given by such that for any T > T 0 we have, where the probability is taken over all measurement choices and t uniformly in {1, 2, . . . , T }. Moreover, Proof. Let the initial estimate beρ 1 = I d d . We know from Corollary 1 that, Taking the expectation with respect to X t , Applying Lemma 6, and using the fact thatρ t is independent of X t we get Taking the expectation of the above inequality over all past time iterations Next, we sum the inequality over the time iterations to get Now, let t = E{ ρ t − ρ 2 F }, δ t = 1 t α+γ , and γ = 2 3 (1 − α). Notice that α + γ < 1. Define the set Rearranging the terms in the inequality we get In other words, the ratio between the number of iterations in which E{ ρ t − ρ 2 F } ≥ 1 t α+γ and the total number of iterations T we performed so far is bounded by where K := d 2 −1 ηd log d. This implies that because α+γ < 1. This means that increasing the number of iterations results in decreasing the number of times where the estimate was not accurate enough. Let's state this formally. Assume we do a total number of iterations T . If we select at random a fixed time step 1 ≤t ≤ T , then there will be two possible outcomes. Either t ≤ δt or t > δt. Assume we get the first outcome, then by applying Markov's inequality, Now, we can find the joint probability Pr t,ρt Applying Jensen's inequality on the second term (noting that f (x) = x r is a concave function for 0 < r < 1) yields Therefore, Now, let then, if choose T > T 0 , then or, Notice, that taking the limit as T → ∞ we obtain that δ = 0, and therefore lim T →∞ C. Convergence analysis for noisy measurements In this part, we show that using an adaptive learning rate with noisy measurements results in the convergence of the MEG estimate to the true state. First, some expectation values will be calculated based on similar techniques discussed in the noiseless case. After that, the optimal adaptive learning rate is derived in such a way to ensure the convergence of the estimate to the true state in probability. However, the learning rate in this case depends on the true state which is not practical. So, finally we show that we can choose a learning rate independent of the true state and prove even a stronger statement of convergence.
We will start with the following lemma to calculate the expectation value of the noise term that appears in the loss function due to performing finite number of measurements.
Lemma 7. The expectation value of the Pauli operators chosen uniformly at random from the set U − {I} satisfy the relation: Proof. We have Applying now Lemma 4, we get where the swap trick in Lemma 5 is used in the second line.
Next, we give the following lemma to calculate the expectation of the loss function for the case of noisy measurements.
Lemma 8. Assuming we select the measurement operator X t at each time iteration uniformly at random from the set U − {I} then for any true state ρ and any state σ independent of X t andŷ t for any t, Proof. Recall the noisy loss function, Note that σ is independent ofŷ t , but can depend on the previous history. So, the expectation can be calculated as Now, Let's take the expectation with respect to X t as where we used the results of Lemmas 6 and 7 in the last step. Notice that as N → ∞, the result of the noiseless case is recovered.
Consequently, the following result shows that the true state is the optimal state that minimizes the loss function.
Corollary 2. The state ρ is the unique state that minimizes the expectation of the noisy loss function, where The following theorem shows how to select an adaptive learning rate that results in convergence of the MEG estimate in probability for noisy measurements. The proof is given in Appendix B.
Theorem 3. In the presence of noise, the state estimate using the MEG update rule with learning rate converges in probability to the true state, i.e. for all δ > 0, The problem with this choice of learning rate, is that it depends on the true state. This might be useful in other applications like state tracking, but it will not be practical for tomography applications, where the true state is unknown. So, we show next that in fact we can select another form of the learning rate that is independent of the true state and show a stronger statement of convergence.
Theorem 4. Let δ ∈ (0, 1), α ∈ 0, 1 2 and β ∈ 1 2 , 1 − α . If we choose a learning rate of the form then there exists T 0 given by such that for any T > T 0 we have, where the probability is taken over all measurement choices and t uniformly in {1, 2, . . . , T }. Moreover, Proof. Let the initial estimate beρ 1 = I d d . We know from Lemma 2 that, Taking the expectation with respect to y t followed by the the expectation with respect to X t we get, Applying Lemma 8, we get Simplifying this expression and taking the expectation with respect to all previous time instants we get The second term on the left hand side is a function of the purity of the true state (defined as ρ 2 F ). This term comes from the variance of the noise which varies according to the location of the state. It can be bounded to become Summing up the inequality over different time steps we get Now, by choosing learning rate in the form the inequality becomes where ζ(·) is the Riemann zeta function. Now, let t = E{ ρ t − ρ 2 F }, δ t = 1 t α+γ , and γ = 2 3 (1 − α − β). Notice that α + β + γ < 1 as long as α + β < 1. Define the set Rearranging the terms in the inequality we get In other words, the ratio between the number of iterations in which E{ ρ t − ρ 2 F } ≥ 1 t α+γ and the total number of iterations T we performed so far is bounded by where K := d 2 −1 because α + β + γ < 1. This means that increasing the number of iterations results in decreasing the number of times where the estimate was not accurate enough. Let's state this formally.
Assuming we do a total number of iterations T , then if we select at random a fixed time step 1 ≤t ≤ T , then there will be two possible outcomes. Either t ≤ δt or t > δt. Assume we get the first outcome, then by applying Markov's inequality, Now, we can find the joint probability Pr t,ρt Applying Jensen's inequality on the second term (noting that f (x) = x r is a concave function for 0 < r < 1). Thus, Therefore, Now, let then, if choose T > T 0 , then or, Now, taking the limit as T → ∞ we obtain finally that, or, equivalently,

D. Convergence of MEG in the noisy case with averaging
As discussed previously, doing the running-average over the measurements with a small number of shots is equivalent to increasing the number of shots without having to do this experimentally per each measurement. This method also does not require the use of an adaptive learning rate. So, given the data point (X t ,ŷ t ), we calculate the running averageȳ t : such that {r i } n X t i=1 = {t : X t = X t } are the time indices in which the measurement operator X t appeared before (which means that r n = t), and n Xt is the number of times it appeared until time t. If we are choosing the measurement operators randomly then after enough number of iterations we may assume that we visited all operators the same number of iterations. So as t → ∞, n Xt → ∞. Now from the strong law of large numbers: So the gradient of the loss function satisfies that: In other words, after enough number of iterations, the situation becomes similar to the noise-free measurements case which allows the possibility of convergence to the true state with a constant learning rate.

V. SIMULATION RESULTS
This section discusses the methods and results of the numerical simulations. An overview of the simulations settings is given first, followed by discussion on the significance of the results.

A. Methods
In order to assess the performance of the proposed method, we created a dataset consisting of 1000 randomly generated quantum states for 1-, 2-, 3-, 4-, and 5-qubit systems, as well as simulating 100000 random measurement outcomes for 10, 100, 1000, 10000 shots for each of these states. The estimate after each measurement is calculated, and compared to the true state using the infidelity measure defined as (161) Figure 2 shows the behavior of MEG under different learning rates in the form η t = 0.5t −β , compared to using the running average (RA) method with a constant learning rate. The plot shows that using the running average leads to the fastest convergence compared to the case of adaptive learning rate. So, we choose the RA method for further discussion.
In addition to the matrix exponential gradient (MEG) estimator, the least squares (LS) method in [QHL + 13] and the diluted maximum likelihood (ML) in [ŘHKL07] are also implemented and used for comparison in the setting of 1-, 2-, 3-, and 4-qubit systems. In these simulations, the learning rate of the MEG rule is 0.5. For the maximum likelihood method, the  : Simulation results for MEG estimation for a single-qubit system and 100 shots per measurement. The infidelity of the proposed matrix exponential gradient (MEG) method is averaged over 1000 randomly generated quantum states and plotted versus the iteration number. The plot is for the running average case, as well as the variable learning rate η t = 0.5t −β for different values of β iteration step parameter (controlling the dilution) is taken to be 0.1. Since this value is much smaller than 1, it is guaranteed that after each internal iteration, the likelihood is increased as proved in [ŘHKL07]. The number of internal ML iterations is chosen to be 10, which is a small number to reduce the total runtime of this method. In other words, for every new data point, we recalculate the ML estimate starting from the previous estimate using 10 iterations, and then evaluate the infidelity. An optimal setting would be a variable number of internal iterations that starts out large and decreases afterwards. However, it should be noted that in this work the objective is not optimizing the implementation of the ML, but to have the simplest implementation for comparison purpose. Additionally, we are interested more in the asymptotic behavior of the estimators. So, after a large number of data points, the estimate will be very near the true state. Consequently there will be no need to have a large number of ML internal iterations at that stage. The source code is publicly available [You18]. Figure 1 shows the infidelity versus the number of iterations for 1-, 2-, 3-and 4-qubits when the number of shots per measurement is taken to be 1000. Figure 3 shows the performance for a 4-qubit system at different number of measurement shots. For 5-qubit systems, only the performance of MEG is assessed as shown in Figure 4.

B. Discussion
The maximum likelihood method is a batch method that requires that the whole dataset is available for post-processing. So if a new measurement is done, the entire algorithm must be repeated again from the beginning. Additionally, the storage requirement of the data operators may be large, especially for multi-qubit systems. Our proposed method does not need to store all the data set, just the last averaged outcome for each measurement operator in the most sophisticated case. The same comparison applies to least-squares, which also acts on the whole batch of data and is not an online algorithm.
An additional advantage of our algorithm is that it guarantees positivity of the estimated operator at all times. Least-squares and similar approaches are not guaranteed to produce a physical state unless a further step of projection back to the physical space is done. This forms an  additional choice and overhead on the algorithm. Moreover, the use of running average allows using a constant learning rate. This solves the problem of having to evaluate the optimum learning rate at each time step. Considering the accuracy of the estimate, the simulation results show that after a sufficient number of iterations, the MEG estimates converge to both the maximum-likelihood and least squares estimates which are considered the optimal estimators in batch processing systems. ML produces a point estimate for the model that maximizes the probability of the observed data, while LS minimizes the sum of squared errors due to observation noise. As the number of shots increase, the accuracy of all estimators gets better (i.e lower average infidelity for a given number of iterations) because the noise becomes less effective. On the other hand, as the number of qubits gets higher, more iterations are needed to achieve a low average infidelity. This is because at each iteration one basis is selected randomly for measurement. However, for high-dimensional systems there are many more bases that need to be covered to form a complete set (d 2 − 1 bases).
As for complexity, maximum-likelihood scales as O(d 4 ). This is because the bottleneck oper- ation is calculating the gradient of the log-likelihood function R = j f j N P r j Π j . For a complete set of measurement, at least d 2 −1 measurement operators are needed, each of dimension d×d. So this implies that calculating R requires O(d 4 ) complex multiplication operations. For the leastsquares method, the complexity is O(d 4 ) as discussed in [QHL + 13]. In this case the bottleneck operation is the matrix multiplication part X T Y of the estimation equationθ = (X T X) −1 X T Y . That is because again for a complete set of measurements we need at least d 2 − 1 operators, and thus Y is of dimensions (d 2 − 1) × 1, and X is of dimensions (d 2 − 1) × (d 2 − 1). Finally, for the proposed method, the bottleneck is in calculating the matrix exponential. The complexity will depend on the particular way of implementation. The most common way is by performing eigendecomposition, followed by exponentiating the diagonal matrix of eigenvalues. In this case, the complexity is usually assumed to be O(d 3 ) [PC99,DDH07]. It should be noted that the complexities discussed here are obtained per iteration, i.e for each update given a new data point. Table I summarizes these results. In order to verify the claim that MEG should have the fastest performance, the execution times per 1 iteration were recorded in the simulation for the three methods. Figure 5 shows the average of these execution times. It is clear that as the number of qubits increases, MEG has the least runtime compared to the other two methods.

VI. CONCLUSION
In this paper, we introduced the idea of using the running average on the noisy measurements together with the MEG update rule to construct a fast and simple online quantum state estimator. However, there are still some points to consider in the future. First, we considered only fixed measurements, but having adaptive measurements could further improve the performance. Also, it would be interesting to test these ideas while embedded in a real experiment. Finally, it would be interesting to explore other possible machine learning techniques in the classical literature, and investigate their applicability in the quantum setting. In particular, proving convergence for the projected-gradient method as another online estimation algorithm would be worth considering. Appendix E gives more details on this point. Acknowledgments: AY is supported by an Australian Government Research Training Program Scholarship. MT and CF acknowledge Australian Research Council Discovery Early Career Researcher Awards, projects No. DE160100821 and DE170100421, respectively. This research is also supported in part by the ARCLab facility at UTS.

Appendix A: Auxiliary lemmas
In this Appendix, we present some auxiliary lemmas needed for some proofs. We will start by stating the following lemma [TRW05], which is proved as Lemma 1 in [HSSW97].
The following result is presented as Lemma 2.1 in [TRW05].
Finally, we state the following lemma relating convergence in mean to convergence in probability.
Lemma 12. Given a sequence of positive random variables Z t , is equivalent to the statement Now, Markov inequality states that for a non-negative random variable X, So, the previous definition of the limit becomes or, Writing back as a limit, the expression becomes which is the definition of convergence in probability.

Proof of Theorem 3
Theorem 3. In the presence of noise, the state estimate using the MEG update rule with learning rate converges in probability to the true state, i.e. for all δ > 0, Proof. We know from Lemma 2 that, Taking the expectation with respect to y t followed by the the expectation with respect to X t we get, Applying Lemma 8, we get Simplifying this expression and taking the expectation with respect to all previous time instants we get The second term on the left hand side depends on the purity of the true state, and it can be bounded to become Selecting the learning rate to be then summing up the inequality over different time steps yields T t=1 1 4 where E{D(ρ||ρ 1 )} = D(ρ||ρ 1 ) becauseρ 1 and ρ are independent of X t andŷ t for any t. Now taking the limit as T → ∞ we get Since, the left-hand side of the inequality is constant, then the series on the right hand side must converge. Consequently using the divergence test, which contradicts the condition in (B31). This means that it must be the case that Now we can apply Lemma 12 on the random variable Z t = ρ t − ρ 2 F to conclude that Therefore, the estimateρ t converges in probability to the true state ρ.
where f j is the relative frequency of outcome j described by the POVM set {Π j }. This can be achieved by doing iterations in the formρ where This is the RρR algorithm. However, there is no guarantee that this form of update equation will generally converge. So, a modification on the form of the update equation is done to become, ρ t+1 = (I + R)ρ t (I + R) tr ((I + R)ρ t (I + R)) .
(C4)  where e is the error vector which converges to a normal distributed random variable at the limit of very large number of measurements. The optimal parameter is defined to minimize the sum of squared errors asθ and the solution of this optimization problem iŝ After the estimation of the unknown parameter, the quantum state is reconstructed aŝ The reconstructed state might be generally unphysical due to non-positivity of the estimate. So, in this case the state must be projected back to the physical space. One way to do is to redistribute the negative eigenvalues over all other eigenvalues, until there are no more negative eigenvalues. It can be shown [SGS12] that this is an optimal projection method, in the sense that the projected state is nearest to the unphysical state in terms of the Frobenuis norm. This method is batch, but can be adapted to become online by doing the whole procedure of estimation and projection after each new data point obtained.
where η is the learning rate, L t is the loss function as defined previously, and P denotes projection into the physical space as discussed in Appendix D. It turns out the performance is highly dependent on choosing the learning rate, and generally seems very similar to MEG when the learning rate is chosen low. In Figure 6a, we compared MEG with constant learning rate, MEG with running average (RA), PGD with constant learning rate and PGD with running average. The same value of η = 0.5 was used in the four of them. This plot shows that the MEG method has better convergence. On the other hand, by changing the step size of the PGD methods to 0.05, we see that both methods seem to have similar performance for the running average case after significant number of iterations as shown in Figure 6b. This makes it very difficult to give a fair comparison with MEG. Note that even for MEG we were not particularly concerned with finding the optimal learning rate -we simply tested a few learning rates that are compatible with the limitations given by the convergence proof. However, faster (but not provable) convergence might be possible, as it often is, if we go outside that range. For PGD we simply do not know what the restrictions on the learning rate are so that the algorithm still provably converges with similar parameters as MEG. Thus, it will be interesting as a future work to look into the convergence of the PGD method.