Quantum state estimation when qubits are lost: A no-data-left-behind approach

We present an approach to Bayesian mean estimation of quantum states using hyperspherical parametrization and an experiment-specific likelihood which allows utilization of all available data, even when qubits are lost. With this method, we report the first closed-form Bayesian mean estimate for the ideal single qubit. Due to computational constraints, we utilize numerical sampling to determine the Bayesian mean estimate for a photonic two-qubit experiment in which our novel analysis reduces burdens associated with experimental asymmetries and inefficiencies. This method can be applied to quantum states of any dimension and experimental complexity.


Introduction
The problem of estimating the state of a quantum system from observed measurement outcomes has been around since the dawn of quantum mechanics. In recent years, driven by technological advances in building and controlling quantum systems, this question has received a renewed interest especially in the context of quantum information processing (QIP). Since QIP relies on one's ability to prepare arbitrary multi-qubit quantum states, verifying experimentally that a quantum state has been indeed prepared acceptably close (by some metric) to a target is of paramount importance. As a result, methods built on classical statistical parameter estimation procedures have been adopted in order to satisfy a demand for quantum state characterization tools.
Arguably the most widespread quantum state estimation approach relies on classical maximum likelihood estimation (MLE) technique. Pioneered in [1], it offers a great simplicity in numerical implementation but suffers from several dangerous flaws. Most prominently, MLE is prone to rank-deficient estimates of a density matrix from a finite number of measurement samples [2] which in turn implies that probability of observing certain states is zero -a statement that is only valid in the limit of infinite number of observations. Also, MLE does not provide a straightforward way to place error bars on an estimated quantum state.
Fortunately, there is an alternative parameter estimation technique called the Bayesian mean estimate (BME) that can be applied to quantum state characterization [2,3,4] and is free of the the aforementioned shortcomings. In addition, BME minimizes the mean square error(MSE) [2,5] i.e. the average square difference between the parameter and its estimate. Thus, BME offers a more accurate estimate of a quantum state. But, BME poses an implementation challenge. It computes a posterior distribution over quantum states for given measurement data by using Bayes' rule which in turn requires one to calculate the probability of the data by integrating over the manifold of all physical quantum states. While it may be possible to carry out the multi-dimensional integration analytically, ultimate evaluation still may be computationally prohibitive. Thus, numerical routines that use Monte Carlo (MC) methods are applied in order to sample from the posterior distribution. There is a trade-off between the speed and accuracy of BME depending on which MC algorithm is used. To our knowledge two types of MC algorithms were proposed for quantum state BME so far. The first one is the Metropolis-Hastings [6] (MH) algorithm-an example of Markov Chain MC (MCMC)-was adopted in [2]. The second one is sequential MC (SMC) [7]-an importance sampling based algorithm-recently used for adaptive quantum state tomography [8]. The MH algorithm is known for its ability to reproduce probability distributions very accurately at the expense of slow convergence. On the other hand, the SMC algorithm is fast but may converge to a sample that does not faithfully represent the distribution of interest.
When to apply or the BME depends on many factors. For instance, for a small measurement data set and a large number of unknown parameters-a typical situation for multi-qubit systems-the BME is superior to as we demonstrate in Section 5 of this paper. But perhaps even more crucially, the applicability of the BME approach depends on the choice of the form and parametrization of the likelihood function. The experimental likelihood most used in application is a simple multinomial that connects the observed data set directly with the quantum probabilities using Born's rule [1]. This approach assumes the data set, observed measurement outcomes, result directly and only from the quantum state and unitary operations. This is not always the case, since the measurement apparatus often introduces operation bias (not always unitary) and inefficiencies that modify the probability of an experimental observation. Previously, James et al. [9] accounted for bias in qubit operations in their two-photon tomography method using MLE. More recent MLE works such as Gate Set Tomography [10,11] assume nothing about the qubit state or operations, gates, other than their dimension. However, previous methods stop short of full accounting for non-unitary operations such as qubit loss. Thus, experimentalists may find themselves applying normalizing constants to account for deficiencies in the defined likelihood. These normalizing constants require preliminary experiments to obtain, and the method for obtaining these constants is often not well defined nor reported.
In this paper we develop a BME-based quantum state reconstruction method that utilizes the slice sampling [12] (SS) algorithm which has the accuracy of the MH algorithm but demonstrates faster convergence [13] and is more resilient for a numerical implementation. We show that by using the hyperspherical parameterization of the manifold of density matrices the BME of a state of a single qubit can be computed analytically by using a uniform prior. For a two-qubit system, in a situation when individual qubits may be lost during the measurement process, we apply SS algorithm to the same parameterization and an experiment-specific likelihood demonstrating a computationally stable and efficient way of sampling from the posterior distribution over the density matrices. We compare the resulting BME estimates to the corresponding MLE estimates as a function of the number of measurements and observe the superiority of the BME method, especially in the limit of small sample sizes.
We begin this paper with a quick outline of our method in Section 2. We derive a closed BME for the ideal single qubit experiment in Section 3. This approachable example illustrates our method and contrasts it with traditional MLE methods. It may also inspire further research into closed-form BME solutions of higher dimensional quantum systems. Next, in Section 4, we derive a likelihood for a finite data two-photon experiment where detector inefficiencies and experimental asymmetries are taken into account. Utilization of this approach results in the real world benefit of eliminating the need to perform preliminary experiments to determine normalization constants. Subsequently, in Section 5 we simulate a multitude of two-qubit photon experiments generating data sets from which we compare the performance of various MLE and BME approaches. Lastly, we apply our estimation to a real world two-photon experiment in Section 6. In the Appendices, we describe a common MLE approach using a traditional likelihood, we detail numerical procedures for sampling density matrices from the true state distribution using slice sampling, and describe the optimization method used in likelihood maximization.

Approach Outline
The components of our quantum state estimation pipeline are outlined in Fig. 1. First, we define a model of our experiment by enumerating all the possible outcomes. This enumeration allows us to specify an experiment-specific likelihood P (D|α), the probability of observing a specific data set D given the experimental parameters α = {α 1 , · · · , α N }. In our case parameters α are elements of a density matrix ρ representing the quantum state to be estimated. Bayes' rule, then allows us to express P (α|D), a posterior distribution (PD) for the variables α, given an observed data set D and a prior probability distribution P (α). Next, the BME for a specific parameter α i given a data set D is We expand our analysis to quantum systems by assuming that parameters α are entries of a density matrix ρ (ρ ij = α k ) describing a valid quantum state (i.e. ρ ≥ 0, ρ = ρ † , Tr(ρ) = 1). Therefore, α's are not independent as we must enforce quantum constraints. To achieve this in a computationally tractable fashion, instead of using the Cartesian parameterization given by α's we parametrize a density matrix ρ utilizing a Cholesky decomposition (see panel c in Fig.(1)) and hyperspherical parameters as Our Bayesian mean estimation of density matrices is outlined above. a) We define our experiment by specifying a likelihood function P (D|α) of data D given parameters α and b) express a corresponding posterior distribution of the parameters α which define the density matrix given data using Bayes' rule. c) We parametrize the density matrix such that any choice of parameters in a specified range leads to a valid physical state. d) We represent our posterior distribution using these new parameters; now the BME of the density matrix can be formally written down as an integral using a Haar-invariant measure. Unfortunately, the integral's analytical solution is typically computationally intensive to evaluate. e) Thus, we use computationally efficient numerical slice sampling to make samples of ρ from the postirior distribution P (τ |D). f) As R→∞ we tend to the true mean ρ. suggested by Daboul [14]. We abbreviate this parametrization with τ to distinguish it from the Cartesian parametrization α.
Next, in order to compute the BME estimator, we need to select a prior probability distribution over density matrices P (τ ) ≡ P (ρ(τ )) and an integration measure over the set of all physical quantum states dτ such that dµ(ρ(τ )) = P (τ )dτ is a valid probability measure i.e. dµ(ρ(τ )) = 1. We use a non-informative prior P (τ ) = const and derive the integration measure dτ induced by the Riemanian metric g ij computed from the Euclidian length element between density matrices (ds) 2 = Tr dρ(τ ) · dρ † (τ ) [15]. This choice of the integration measure guarantees Haar invariance of the probability measure over the set of density matrices. Thus, the probability of a state ρ(τ ) is invariant under an arbitrary unitary rotation U i.e. P (ρ(τ )) = P (U ρ(τ )U † ). Then the BME of an unknown quantum state reads, The latter expression for the BME can, in principle, be evaluated analytically. However, in practice it almost surely requires computational resources and (or) time constraints that prohibit analytical evaluation. In this case, an estimate can be obtained using numerical sampling from the posterior distribution P (τ |D). For example, in later sections we utilize numerical slice sampling [12] to arrive at approximate estimates for a two-photon experiment.
3. An example: Bayesian mean estimation of an ideal single qubit Consider an ideal single-qubit experiment. In this experiment we can reliably and repetitively prepare a qubit in an unknown state ρ and measure the value of any desired observable M without err or qubit loss. If M is a two outcome POVM defined by operators M i with i ∈ {0, 1} , M 0 + M 1 = I then the respective probabilities p i to observe outcome i are determined by the unknown quantum state via p i = Tr (ρ · M i ). To fully describe a single qubit we need to measure a set of informationally complete POVMs which will fully define the density matrix. For concreteness let us consider a case when the qubit is represented by the polarization degree of freedom of a single photon. In this case a complete state description can be achieved by estimating the probability of observing one of two orthogonal outcomes in the rectilinear basis (Z, horizontal (h) and vertical (v) polarization), the diagonal basis (X, diagonal (d) and anti-diagonal (a) polarization), or the circular basis (Y , left (l) and right (r) circular polarization). The likelihood of observing a data set D from these measurements given we know the probabilities of each outcome exactly is where α = {p h , p d , p l }, D = {c h , c v , c d , c a , c l , c r }, we have enforced the single basis requirement that the sum of orthogonal probabilities is unity, p h,d,l + p v,a,r = 1. Using Bayes rule, the distribution for α given D is which has no quantum constraints, i.e. associated density matrices may not be physical. A physical density matrix ρ for the single-qubit must fulfill constraints Tr (ρ) = 1 probabilities sum to 1 φ| ρ |φ ≥ 0 positive semi-definite These can all be fulfilled by parametrizing the density matrix as suggested by Daboul [14]. For the single-qubit the parametrized matrix is ρ(τ ) = cos 2 (u) where parameter ranges τ = {u, θ, φ}, u ∈ [0, π 2 ], θ ∈ [0, π 2 ], and φ ∈ [0, 2π] ensure there is no state redundancy, states having multiple representations. This matrix heeds all quantum constraints for any values of the parameters. The parameters α in terms of the new parameters τ are p h (τ ) = cos 2 (u) (7) p d (τ ) = 1 2 This results in the new likelihood To complete the new description we must define a new integration measure in τ space. Our original probability space has an infinitesimal length element (ds) 2 = (dp h ) 2 + (dp d ) 2 + (dp r ) 2 . The measure in this case is reduced to the volume element in Cartesian coordinates dα = dp h dp d dp l . This space can be considered a"cube" that includes both physical and unphysical states. Within this cube, the new space is a sphere containing only and all physical density matrices. The length element in this space is [15] (ds where τ i ∈ {u, θ, φ}. The new measure, the infinitesimal volume, is where The integration measure in the ideal single qubit experiment is As described earlier, this measure is Haar invariant. We will also consider how this parametrization relates to the Pauli operators and their expectations z = Tr (σ z · ρ) = cos(2u) (15) x = Tr (σ x · ρ) = sin(2u) cos(θ) cos(φ) (16) y = Tr (σ y · ρ) = sin(2u) cos(θ) sin(φ).
With our likelihood defined, one estimation technique is to approximate the true distribution utilizing Laplace's method [16], also known as the saddle-point approximation. This is a multivariate Gaussian centered on the MLE defined by k parameters. This MLE is found by simultaneously solving k equations of the form and verifying this point represents the global maximum. The uncertainty in the parameters can be captured utilizing the covariance matrix which we estimate as The approximate distribution is then where τ is a column vector.
For the ideal single-qubit we find unbounded MLE where z f , x f , and y f are the frequency based linear inversion estimates (LIE) of the Pauli operator expectations When x 2 f + y 2 f + z 2 f ≤ 1 these LIE are the correct MLE. However the parameter set given in Eq. 21 and 22 is undefined for unphysical states, when When this is the case, the MLE is found on the boundary of the Bloch sphere due to the concavity of the likelihood given by Eq. 10. This point is not necessarily the one with smallest Euclidean distance to the unbounded MLE. Determination of the boundary MLE is accomplished by setting θ = 0, restricting us to the boundary, and maximizing the parametrized likelihood Eq. 10 over the parameter ranges u ∈ 0, π 2 and φ ∈ [0, 2π]. Next, we derive a closed-form BME which always results in a quantum bound obedient estimate.
To calculate the BME for the single qubit density matrix we first evaluate the normalizing constant and then estimate our mean density matrix Using the binomial theorem we can rewrite this as The integral over u has solution and similar for θ. The integral over φ can be shown to be which is zero for odd x or y. To ease representation of the solutions, define

AN EXAMPLE: BAYESIAN MEAN ESTIMATION OF AN IDEAL SINGLE QUBIT
The BME for our ideal single-qubit is This is the best possible estimation of the ideal single-qubit given a set of data D and a uniform prior.
To illustrate the physicality of the distribution P (τ |D) and the differences between the MLE, LIE, and the BME consider a true quantum state ρ 0 defined by the parameters u 0 = 0.864, θ 0 = 0.393, and φ 0 = 5.18. For visualization we use the Bloch sphere where the state is represented by the expectations of the the Pauli operators given in Eq. 15-17. We simulated taking 10 measurements in the Z, X, and Y bases from which we generated counts c h = 7, c v = 3, c d = 7, c a = 3, c l = 0, and c r = 10. We plot  the distributions P (x, z|D), P (y, z|D), and P (x, y|D) in Fig. 2. The coordinates for each estimate are given in Table 1. This small data set emphasizes the parametrized distribution's physicality and the qualitative difference between the MLE and BME. The gray locations correspond to unphysical states. As can be seen in the top right and bottom plots in Fig. 2, the LIE can be unphysical. To correct this, the MLE is found on the boundary, a pure state. In contrast, the BME will always be located within the physical space. This illustration is not made to emphasize the performance of any of specific approach. Performance is addressed in Section 5.
In order to utilize the ideal single qubit formalism with single-qubit experiments, the data can be renormalized to the lowest efficiency measurement similar to the procedure used in Appendix A. This method does not fully utilize the available information and has the additional complication that preliminary experiments must transpire to determine the measurement efficiencies. In the remainder of our manuscript we address qubit estimation for experiments.

Bayesian mean estimation for multi-qubit experiments
In an experiment the probability of observing an outcome depends not only on the quantum state but also on the measurement apparatus itself. In this case imperfections and asymmetries in the measurement process prohibit the type of "perfect" estimate we investigated in the last section. Our experiment of investigation is the common two-photon experiment for which we introduce the fundamental assumptions and model below. James et al. [17] previously reported an MLE approach to this experiment as well as higher dimensional experiments. In contrast to that method, we account for qubit loss within our defined likelihood and enable determination of the BME, the best estimate on average [2], which avoids MLE pitfalls such as "zero" probabilities, impossible outcomes.

Estimating parameters in a single-basis two-photon experiment
In this section, we give an example of estimating parameters in a single-basis experiment. To begin, we assume the existence of a photon pair. A member of this pair is sent to Alice and the other one to Bob each of whom has chosen a measurement basis as seen in Fig. 3. A single photon can result in one of two observable orthogonal outcomes, 0  or 1, and one unobservable outcome, the photon is lost. All observable outcomes have probabilities of occurrence proportional to the joint probabilities p 00 , p 01 , p 10 , and p 11 as seen in Fig. 4a. Additionally, Fig. 4b illustrates the four possible outcomes for a given "destiny" when pathway efficiencies are considered. The possibilities include both photons being counted giving one coincidence count and two singles counts, one photon being counted and one lost giving one singles count, or both photons being lost giving no counts.
Alice and Bob record event 0 or event 1 with number A 0 , A 1 ≤ N and B 0 , B 1 ≤ N , respectively, since typically some portion of the N photons are lost. Losses are due to Alice and Bob's suboptimal pathway efficiencies {a 0 , a 1 , b 0 , b 1 } ∈ [0, 1]. In the event both members of a photon pair are detected, Alice and Bob observe joint results, giving coincidence totals c 00 , c 01 , c 10 , and c 11 .
From this data we may enumerate the number of each type of event. The number of joint coincidence events are straightforward, given by the c ij with the probability of these events being a i b j p ij . The number of events where Alice registers result i and Bob loses The terms for Bob registering a photon and Alice losing her photon are similar. The number of events where both photons are lost is N -A 0 -A 1 -B 0 -B 1 +c 00 +c 01 +c 10 +c 11 with probability For now, assume the photon number N is known. In this case, using Bayes' rule, the event number, and the probabilities given above, the PD is where α={p 00 , p 01 , p 10 , p 11 , a 0 , a 1 , b 0 , b 1 } are the unknown parameters, joint probabilities and pathway efficiencies, the data set D={c 00 , c 01 , c 10 , c 11 , A 0 , A 1 , B 0 , B 1 } consists of the known singles and coincidence count values totalling The likelihood P (D, N |α) consists of the probability of each type of event with a multiplicity equal to the number of times it occurred. Both the probabilities and number of events were described in the preceding paragraph. We have retained the full form of the likelihood that includes γ(N ) for use below. It is typical in two-photon experiments that the photon number N is not known. If N is known, the following step may be skipped and the above PD is the appropriate choice. Otherwise, we must make N an unobserved parameter or seek a way to eliminate it. Fortunately, there is an analytical method to remove N from the PD completely [5] by taking an average over the N distribution using the summation formula Since the average is taken over the distribution, see Fig. 5, only probable values of N will have appreciable contribution. Applying this formula, N is removed giving PD where Assuming the integral can be carried out, we can make estimates of any parameter via its the mean value, for instance, Likewise, any other parameter mean p ij , a i , or b i as well as their standard deviations may be estimated. One exception is the mean value N . We find this mean by setting z = 1 in Eq. (27), where In principal, all of the above integrals have analytical solutions via the multinomial theorem, which gives exact answers in the form of sums of Beta and Gamma functions. However, the computation needed to carry out the resultant sums is prohibitive. If we cannot efficiently make our parameter estimations analytically, we can utilize numerical sampling to approximate the BMEs of interest. We discuss this in detail in Appendix B. If the probability of obtaining a sample α (r) tends to the true probability P (α (r) |D), the mean estimations can then be made by repetitive sampling, = p 00 , p 01 , p 10 , p 11 , a 0 , a 1 , b 0 , b 1 .

Single-basis simulation
Consider a single-basis simulation where unbeknownst to Alice and Bob a source generates N =10, 000 photon pairs with joint probabilities and pathways efficiencies Numerical sampling, see Appendix B, is used to produce sample α (r) from P (α|D).  Figure 5. Left. Sample histograms given proportional to the approximate the probability distribution for each parameter p ij whose true value is given by the black vertical line. Middle. Sample histograms are given proportional to the probability distribution for each efficiency parameter a 0 , a 1 , b 0 , and b 1 whose true value is given by the black vertical line. Right. A histogram of samples for parameter N are given proportional to the probability distribution for photon number N .

Parametrizing the n-dimensional density matrix
For approachability, we obscured the construction of the density matrix for the ideal single qubit given in Eq. 6. We briefly describe this construction here, for a full proof with discussion see [14]. We note that this construction is similar to that recently proposed by Seah et al. [18] whose density matrix sampling application is similar to our approach. Daboul's parametrization can be extended to quantum systems of any dimension. For an n-dimensional Hilbert space, the density matrix is formed using the Cholesky decomposition, requiring ρ = L(τ)L(τ) † with being a lower triangular matrix with positive real diagonal elements. The parameter set τ include n 2 −1 parameters which describe a unique density matrix. The elements L ij may be written as Consider the case of two qubits with dimension n=4, the parametrized matrix , and φ ij ∈ [0, 2π]. Indeed, one could instead change the u i and θ ij trigonometric terms to The complex terms involving φ ij remain unchanged. A similar adjustment was used by Chung and Trueman [19].

Estimating parameters in a multi-basis two-photon experiment
In section 4.1 and 4.3, respectively, we defined our experimental likelihood for the singlebasis experiment and detailed the parametrization of any n-dimensional density matrix.
To make estimations using data from multi-basis two-photon experiment we will use both of these pieces. Our example will be full-state tomography. Other multiple basis experiments will have similar estimation constructions. In the case that the data set is incomplete, our method will still return an estimate true to both the given data and all quantum constraints. To complete full-state tomography Alice and Bob each take measurements in bases Z, X, and Y such that all outcomes are observable in each basis combination ZZ, ZX, XZ, ZY , Y Z, XX, XY , Y X, and Y Y . Thus, Alice and Bob's data set will include the data from all 9 basis combinations. The likelihood is a product of the single-basis likelihoods, Eq. 29, from each of these basis combinations, where α includes the probabilities of all measurement outcomes and the four experimental pathway efficiencies which we assume are the same over all bases. However, in the experimental section, Section 6, we do not make this assumtion. Next, we parametrize our density matrix using the hyperspherical parameters described in the previous section, P (D|α) → P (D|τ ).
This parametrization comes with a new measure defined by Eq. 11, 12, and 13.
Putting it all together we can make any BME of interest, for instance the mean density matrix where P (D) = dτ P (D|τ )P (τ ).
If it is not computationally convenient to evaluate the integrals of the type given in Eq. 35, we can utilize numerical sampling. If we can draw samples ρ (r) from the distribution P (τ |D) we can estimate the BME of our density matrix as We address our numerical sampling approach in Appendix B.

State certainty
When reporting the values of experimental measurements such as the visibility of an interference curve V or the value of the Bell parameter S, it is typical to provide a standard deviation to describe the uncertainty in the parameter, e.g. V = 0.98 ± 0.01 or S = 2.65±0.05. This gives a quantification of the uncertainty in the estimate. When the BME is multi-dimensional the uncertainty can be represented by a covariance matrix [2,4] ∆ρ(τ ) = with each element being a covariance, where τ i is the expectation, mean value, of τ i . For i = j this is just the usual variance.
Here the k parameters include the n 2 −1 parameters needed to define the density matrix as well as any additional experimental parameters such as the efficiencies. We can also define a single quantity that captures a compact representation of the certainty in the estimation. We use the trace distance deviation ∆D, which is the mean trace distance over the distribution with the distribution's mean density matrix ρ, The trace distance is where the λ i are the eigenvalues of ρ − σ. We approximate ∆D with numerical sampling using the formula When the certainty is high all samples will be close to the mean value giving ∆D → 0 which can be compared to the typical standard deviation where smaller is better. This is also useful when there is no particular state one wishes to compare the estimations with such as is typically done when reporting the fidelity.

BME performance with numerical sampling
To characterize the performance of the presented estimation methods we used the following procedure. For each N ∈ {10, 10 2 , 10 3 , 10 4 , 10 5 } the following steps are repeated: 1. A density matrix ρ is sampled from a uniform distribution using a Haar measure in the hypershperical parameter space. 3. We simulate a two-photon experiment for N identical states ρ in each of the 9 bases given in Section 4.4, 9N total identical states to generate a data set D. Each simulated experiment for a single basis is the same as described by the Bayesian tree in Section 4.1.
4. Using D we find using a traditional likelihood and the actual randomly chosen pathway efficiencies as described in Appendix A. Also with D, we find the BME and MLE using the experiment-specific likelihood in which the pathway efficiencies are not known as described in Section 4.4. Thus, the traditional MLE has the unfair and unrealistic advantage of knowing the pathway efficiencies exactly.
5. The distance D, Eq. 40, is found between each estimate and the true state ρ.
6. If the experiment-specific BME or MLE is closer to the true state than the traditional MLE, that estimation type has a win tallied.
8. The average distance D over all 1000 repetitions is found for the traditional MLE approach and the experiment-specific BME and MLE. The total wins versus the traditional MLE are also recorded.
For these simulations the average distance D results are given at left in Fig. 6, and the win percentages for the experiment-specific likelihood MLE and BME versus the traditional MLE are given at right in Fig. 6. The MLE was found using an gradient ascent method described in Appendix C. We emphasize that these results are conservative, since we give the traditional MLE process the pathway efficiencies exactly-these would not be known exactly in an experiment. Figure 6. We generated data from simulating two-photon experiments for various states and photon pair number N as outlined in this section. Top. We have plotted the average distance estimate for photon pair number N over 1000 randomly sampled states. Bottom. We give the win percentage for the experiment-specific BME and MLE versus the traditional maximum likelihood method. The best performance is achieved using the experiment-specific Bayesian mean estimate. Another observation from this data is that an experimentalist can achieve a better estimate by switching to an experiment-specific likelihood which allows them to forgo any preliminary experiments to determine normalizing constants.
The states ρ were drawn from a uniform distribution which is Haar invariant when using the measure obtained from Eq. 11, 12, and 13. We also use the same distribution as a prior to compute our BME estimate. We also made estimates using a non-Haar invariant measure to evaluate the significance of prior selection. This results in a drastically different prior relative to that used in generating the random state. In Fig. 6 the experiment-specific BME with the original advantageous prior and the "bad" prior are both plotted. As can be seen, there is, possibly, a small gain using the advantageous prior for the smaller photon pair number estimations. But, it also highlights that the prior choice can be made effectively inconsequential given enough data [5]. The prior certainly can improve the estimate when little data is available. Granade and colleagues discuss this in depth [4].

Experimental tomography
We performed state tomography on a two-photon polarization entangled target state generated by pumping a periodically poled potassium titanyl phosphate (PPKTP) nonlinear crystal inside a Sagnac loop with two counterpropagating 405nm pump beams [20,21]. The two possibilities of Type II § spontaneous parametric downconverison (SPDC), either the clockwise or counter-clockwise beam generated a 810nm photon pair, leads to a polarization entangled state output into the idler and signal modes received by Alice and Bob, respectively. Alice and Bob each choose a basis by inclusion or omission of waveplates. Since this requires a physical adjustment to our apparatus for each basis choice, we assume in our likelihood that the efficiencies are independent parameters in each basis. The half-wave and quarter-wave plate matrix operations are, respectively, To measure in basis Z Alice omits her waveplates. To measure in X she includes the halfwave plate, she operates on her single-photon with H. Finally, to measure in the Y basis she includes both waveplates, operating with Q then H. Single-photon detectors record the detection mode, orthogonal outcomes 0 or 1 in each basis. From the experimental data given in Table 1

Conclusions
We have presented a novel method of Bayesian mean estimation using hyperspherical parametrization and an experiment-specific likelihood. This method has allowed us to derive a closed-form BME for the ideal single-qubit and to develop a numerical approach to approximating the BME for a two-qubit experiment using numerical slice sampling. Our approach offers the real world benefit of eliminating the need for preliminary experiments in common two-photon experiments by accounting for qubit loss within the likelihood. Our method is also scalable beyond two-qubit systems. Finally, we illustrated our approach by applying it to the measurement data obtained from a real-

A. Maximum likelihood estimation with a traditional likelihood
Traditional MLE relies on a simple multinomial likelihood that relates the probability of an observation to its quantum probability using Born's Rule, where E i are the observable of interest, for instance the Pauli operators σ z , σ x , σ y or possible Kronecker products of them, σ z ⊗ σ x , and so forth as dimensionality demands. For instance, the likelihood for a two-photon state with measurement results from a single-basis is P (D|α) = p c 00 00 p c 01 01 p c 10 10 (1−p 00 −p 01 −p 10 ) c 11 with D = {c 00 , c 01 , c 10 , c 11 } and α = {p 00 , p 01 , p 10 }. The problem with this simple view is that only in an ideal, or possibly perfectly symmetric, experiment does the data set D truly result from the quantum probabilities alone. Instead the measurement apparatus will add its own bias and inefficiency to the result. Thus, the data set may need to be corrected using experimental constants determined from initial experiments. In this paper we focus on the pathway efficiencies a 0 , a 1 , b 0 and b 1 present in the two-photon apparatus. Given that we know these exactly for a single basis we can correct the data set in the following manner. Identify the smallest efficiency for both Alice and Bob and use these to correct the counts to If a 0 =a 1 and b 0 =b 1 , it should be apparent that no correction is needed. With the corrected data set D ={k 00 , k 01 , k 10 , k 11 } maximization of P (D |α) can proceed.

B. Numerically sampling density matrices
In this Appendix we describe our procedure for sampling density matrices from the true distribution. While other methods of numerical sampling can be used, we have utilized slice sampling [12,16]. We give a brief description of slice sampling from a singleparameter distribution before describing its extension to density matrix sampling.

B.1. Slice sampling
Slice sampling is an approach to sampling parameter values x from the distribution P (x) given that only the unnormalized distribution P * (x) is known [12,16]. This is useful when evaluating P * (x) everywhere is resource prohibitive-the normalization Z= dxP * (x) is unknown. In this case, consider Fig. 8a) where the dashed curve represents the likelihood P * (x) which we do not know everywhere but can evaluate anywhere. Now consider that we can uniformly fill the space under this curve with points as depicted in Fig. 8b). If we forget the "y" coordinate and randomly select one of the R points we would tend to be sampling from the true distribution as R → ∞. Slice sampling is one method by which to uniformly fill the space under P * (x) with points.
There are different slice sampling methods, but we will use that introduced by Neal [12,16]. Below we evaluate the unnormalized distribution P * (x) in our sampling algorithm. Referring to Fig. 9, a) Start at point x 0 . We assume, for now, this is a random point. b) At this location a vertical coordinate y = q · P * (x), q ∈ [0, 1], is chosen uniformly. c) From x 0 we "step out" to the left and to the right in steps of w until the left point  l and right point r are both outside the curve, P * (l) < y and P * (r) < y. d) We uniformly sample a new point h from the interval x ∈ [l, r]. If h lies inside the curve, P * (h) ≥ y we accept this as our new point. If h lies outside the curve, P * (h) < y , we set the new leftmost or rightmost point equal to h. Whether h is greater or lesser than x 0 reveals which. We do this for our expected unimodal distribution to increase the chance of acceptance for the next sample by eliminating known rejection regions. We reject the first point and shrink the interval. e) We select and accept new point h = x 0 from the interval, since P * (x 1 ) > y. f) We start over at step a) from point x 1 .
In reality, we use ln P * (x) to compare points in the distribution which is more numerically convenient. The first point x 0 in step a) has not come from the true distribution. To resolve this, samples can be taken and forgotten until samples are being taken from P (x) with no influence of the starting point x 0 present. To evaluate when this point has been reached we utilize multiple independent samplers. Our stopping criteria should ensure that all samplers have converged, and also that enough samples have been taken to sufficiently approximate the distribution P (x). Using multiple samplers has compound utility in that convergence can be assessed and also that more samples can be taken in less time when samplers are run in parallel. Our sampling procedure is to perform a "burn-in" where each sampler takes k 0 samples, the mean x and the mean standard deviation σ are calculated for each sampler. When the mean of the ith sampler x i is separated from each of the other j sampler means x j by less than mσ j , m smaller requires closer convergence, we stop the burn in. If this criteria is not met after k i samples, we forget all but the last sample point, double the number of samples k i+1 = 2k i , and start over. Once the criteria is met, we repeat the last sampling request for each sampler. The data from this final request is combined as the final set of samples. These samples tend to be from the true distribution P (x) as the convergence parameter m → 0.
The estimations we have made in this article have been from distributions assumed to be unimodal. While this is certainly true for the simplest distributions given, it is not conclusive for the more complex distributions we have derived. However our assumption is corroborated by the simulations where multiple samplers are given the same data set, started in random independent locations, and always converge. However, we have not addressed how to approach known multi-modal distributions. In this case, more advanced methods such as sequential Monte Carlo (SMC) [7] must be applied to full characterize the posterior distribution.

B.2. Density matrix sampling
The goal of our slice sampling application is to sample density matrices from the true but unknown PD P (τ |D) = 1 Z P (D|τ )P (τ ) used in Eq. 36. To do this we slice sample parameter x in τ from the conditional unnormalized distribution P * (x|τ ) by keeping all other parameters τ fixed, x / ∈ τ . The sampling procedure is performed sequentially Figure 10. This sequence depicts a few iterations of the gradient ascent method. The bottom axis represents an idealized axis along the line of highest gradient ascent direction within the multi-dimensional space.
Each parameter may have unique ranges, thus slice sampling for each must account for the minimum and maximum parameter values x ∈ {min, max}. Additionally, some parameters can be cyclic. For instance, angle φ ∈ {0, 2π}. Extra consideration must be taken for these parameters since their distribution can be centered on a boundary. Hard bounds would not allow this. For these parameters we gave no bounds during the slice sampling algorithm.

C. Maximization using gradient ascent
To locate a local maximum from the likelihood we use an gradient ascent method as shown in Fig. 10. This methods auto-tunes the step size to make both large and small steps when appropriate. a) Starting from the current multi-dimensional point x the gradient ∇P * (x) is determined using the finite difference method. The direction of ascent d is found by normalizing the gradient with the 2 norm ) A step of size w is made in the direction of increase, x → x + w · d. This point is accepted since P * (x + w · d) ≥ P * (x). The step size is doubled.
c) The direction of ascent d is found. A step of 2w is made in the direction of increase, x → x + 2w · d. This point is accepted since P * (x + 2w · d) ≥ P * (x). The step size is doubled.
d) The direction of ascent d is found. A step of 4w is made in the direction of increase, x → x + 4w · d. This point is not accepted since P * (x + 4w · d) < P * (x). The step size is halved.
e) A step of 2w is made in the direction of increase, x → x + 2w · d. This point is not accepted since P * (x + 2w · d) < P * (x). The step size is halved.
f) A step of w is made in the direction of increase, x → x + w · d. This point is accepted since P * (x + w · d) ≥ P * (x). The step size is doubled.
In this way, the local maximum is ultimately approached.