Quantum Mixed State Compiling

The task of learning a quantum circuit to prepare a given mixed state is a fundamental quantum subroutine. We present a variational quantum algorithm (VQA) to learn mixed states which is suitable for near-term hardware. Our algorithm represents a generalization of previous VQAs that aimed at learning preparation circuits for pure states. We consider two different ans\"{a}tze for compiling the target state; the first is based on learning a purification of the state and the second on representing it as a convex combination of pure states. In both cases, the resources required to store and manipulate the compiled state grow with the rank of the approximation. Thus, by learning a lower rank approximation of the target state, our algorithm provides a means of compressing a state for more efficient processing. As a byproduct of our algorithm, one effectively learns the principal components of the target state, and hence our algorithm further provides a new method for principal component analysis. We investigate the efficacy of our algorithm through extensive numerical implementations, showing that typical random states and thermal states of many body systems may be learnt this way. Additionally, we demonstrate on quantum hardware how our algorithm can be used to study hardware noise-induced states.


I. Introduction
The task of learning an unknown d×d quantum state ρ is a fundamental primitive for quantum computing. The well known method of full quantum state tomography learns the matrix elements of ρ directly. The original scheme employs O(d 2 / 2 ) Pauli measurements to learn the state up to additive error in trace distance [1]. Subsequent refinements require only O((rank(ρ)d/ 2 ) · ln(d/ )) measurements [2], or O(d/ 2 ) allowing for a small failure probability [3]. A recent improvement has found it is necessary to use at least Ω(rank(ρ)d/ ) measurements, and it was also conjectured to be sufficient [4]. In all cases, tomography aims to obtain a classical description of a quantum state, and as such, the number of measurements it requires scales exponentially with the number of qubits of the target state.
An alternative approach is to give a more operational meaning to learning. In practice, we are often not interested in the exact form of ρ but instead in its properties; i.e., we wish to estimate o i := Tr[ρO i ] for some observable O i . Ref. [5] proves that if each observable O i , for 1 i M , is restricted to two-outcome measurements, then only O( −4 log 4 M log d) copies of ρ are sufficient * naezzell@proton.me to estimate each o i up to additive error by using a method called shadow tomography [6][7][8]. These results were expanded to simpler, experimentally tractable Clifford measurements and random Pauli measurements in Refs. [6,9]. In particular, it was shown that a polynomial number of Clifford (Pauli) measurements is sufficient to estimate any low rank (local) observable. The polynomial scaling achieved by shadow tomography is gained at the cost of providing only a partial description of the quantum state.
Yet another approach to quantum state learning, and the one we take here, is instead to solve the problem of state compilation. This involves learning a quantum circuit with which we may prepare an approximation of the target state. In this paper, we propose and demonstrate a near-term algorithm for approximately compiling an unknown quantum state by variationally training a parameterized quantum circuit. Similar to the shadow tomography case, one can use the final cost function value in our algorithm to place bounds on the deviation of observables estimated from the learnt state from those of the target state.
While the variational compilation of pure states has been explored in Refs. [10][11][12], here we focus on mixed state compilation. Specifically, we propose two different ansätze that can compile a mixed state either into a purification using ancilla or as a convex combination of pure states. The latter is similar to the Hamiltonian-based model approach in Ref. [13]. In addition, our approach shares similarities with Refs. [14,15], which present algorithms to learn the diagonalization of a target state, but in contrast to our approach, these methods only apply to low-rank quantum states. A circuit to prepare the target state is already known and quantum compilation is used to learn a more efficient circuit to prepare that state. b) The target state is the (unknown) output of (unknown) noise processes on quantum hardware. c), d) Quantum compiling is used to upload a quantum state from an experimental system (or different quantum computer) to a quantum computer. This could be performed either coherently using a Loschmidt echo/SWAP test to compute the cost (c) or incoherently by combining our algorithm with classical shadows techniques (d). In all sub-figures Eρ denotes the channel that prepares ρ and U † σ (α) denotes a single parameterized unitary to learn ρ via the CCPS ansatz.
Quantum state compilation, in providing a quantum circuit description of a state, serves a different purpose from tomography or classical shadows, as summarized in Fig. 1. In the first case, it may be used to 'upload' an unknown state of an experimental quantum system to a quantum computer. In other words, given a quantum system in an unknown target mixed state, our algorithms can learn a circuit representation of it for implementation on any digital quantum device. In this case, we can use the compilation to more efficiently store the state for later processing, compute properties of the state that are hard to measure directly, or use the state as the input to another algorithm, say for quantum simulation. Alternatively, a circuit to prepare the target state might already be known, but the aim would be to learn a more efficient (i.e., shorter depth and/or more noise resilient) circuit to prepare that state. These applications are particularly interesting when the unknown mixed state is generated by unknown noise as in (b). In this case, the compilation serves as a snapshot of how a noisy quantum computer corrupted a desired input state. This snapshot can then be used later or perhaps even on a more coherent machine for reliable processing.
In some cases, it may be of interest not to learn the target state perfectly but rather attempt to learn a low(er) rank approximation of it. In general, the resources required to manipulate a quantum state on quantum hard-ware grow with its rank. Thus, learning a lower rank approximation provides a means of compressing a state to store it more efficiently. The compression of quantum data has a long history tracing back to the early days of quantum information theory [1, 16,17]. More recently, the question of how well any given state may be approximated using a lower rank state was addressed analytically in Ref. [18] for the Hilbert-Schmidt and trace distances. Interestingly, the optimal lower rank approximation essentially corresponds to performing principal component analysis (PCA) (jump to Eq. (3) to look ahead) for the Hilbert-Schmidt distance (but not for trace distance). Hence, as we will show, one can also use our learned compilation to perform PCA with a desired cut-off rank. At the same time, the analytical expression for the optimal state from this work provides a natural benchmark for our learning task.
Our algorithm, which is summarized in Fig. 2, involves variationally minimizing a cost function that is formulated in terms of the Hilbert-Schmidt distance between the target state and ansatz state. This cost can be efficiently measured using either a SWAP test or a Loschmidt echo circuit. In contrast to Ref. [19], which sketches a quantum generative adversarial neural network that might be used to learn a mixed state, we further present local variants of our cost functions to mitigate the trainability barrier posed by barren plateaus [20][21][22][23][24][25][26][27][28][29][30].
In the numerical simulations of our proposed algorithm, we demonstrate the applicability of both the purification and convex combination ansätze for learning full and lower rank approximations of a target mixed state. In particular, we numerically simulate the learning of typical random states, as well as of random thermal states of the Heisenberg XY model. We additionally implement our algorithm on quantum hardware in order to compile an unknown state generated by hardware noise.

II. Quantum Mixed State Compiling Algorithm
The Quantum Mixed State Compiling (QMSC) algorithm takes as input a mixed state ρ and a desired approximation rank R. Though not necessary, we assume R rank(ρ) =: r in defining our algorithm since this is both sensible and simplifies our discussion. In our algorithm, we take this a step further and constrain R r , where r is the -rank 1 of ρ, which counts the number of eigenvalues greater than > 0. This notion of approximate rank is designed to capture the intuition that the contribution of very small but non-zero eigenvalues can The algorithm takes as inputs a target state ρ and a desired approximation rank R. The next step is to (b) i. estimate the purity of the target state or (optionally) ii. estimate the first R eigenvalues of ρ. The core of the algorithm consists of (c) variationally compiling the mixed state using either i. the state purification (SP) ansatz or ii. the convex combination of pure states (CCPS) ansatz (detailed in the green and dark blue boxes respectively). (d) The algorithm outputs a rank-R approximation of ρ. The pink box details the Loschmidt echo circuits used to evaluate the overlap terms when using the SP ansatz (left) or the CCPS ansatz (right). A detailed description of all the circuits necessary to evaluate the cost for both ansätze is given in Appendix B.
often in practice be ignored. The goal then is to optimize the classical parameters α of a parameterized trial state σ(α, R) satisfying rank(σ(α * , R)) = R, such that, for the optimized parameters α * , the output state σ(α * , R) well approximates the target ρ.
In order to assess the closeness between our training state σ and our target state ρ, we employ the following cost function in terms of the Hilbert-Schmidt distance between the two states: (1) As discussed in more detail in Sections II C and II D, the Hilbert-Schmidt distance can be efficiently computed using a combination of SWAP tests and/or a Loschmidtecho-like circuit. In Appendix A we detail how this cost can be reformulated such that it requires only local measurements [21] to mitigate the barrier to trainability posed by barren plateaus [20][21][22][23][24][25][26][27][28][29][30]. If trained well, the learned state σ(α * , R) should be close to the solution to the Quantum Low-Rank Approximation Problem [18]. As discussed in Ref. [18], the unique optimal state that minimizes the Hilbert-Schmidt distance, subject to a rank constraint, i.e., the state σ(α opt , R) where takes the form Here Π R is a projector onto the eigenstates corresponding to the R largest eigenvalues of ρ. That is, Π R is a projector onto the first R principal components of ρ. We note here that the projection Π R is the same conceptually as the typical subspace projection [34, page 31, Theorem 1.18] used in quantum data compression [16] (given that it projects onto the high probability subspace of the state ρ), and the quantum channel in (3) is essentially the same as the encoding channel used in quantum data compression [1, equation (12.50)].
With this analytical solution, we can define a natural performance metric for the Quantum Mixed State Compilation algorithm as the difference between the found cost and optimal possible cost, More explicitly, this expression takes the form where {λ i } r−1 i=0 denotes the set of eigenvalues of the target state. Of course when ρ is truly unknown, we can only report the optimal cost C(α * , R), but in this work, we actually compute ∆ R to verify that our algorithm is working as intended.
In this paper, we develop a practical, noisy intermediate-scale quantum (NISQ)-friendly algorithm to find σ(α * , R), an approximation of σ(α opt , R). In doing so, we also find the first R principal values and components of ρ, and so we are essentially performing quantum principal component analysis (PCA), analogous to some previously proposed algorithms [14,15,35]. The precise form in which σ(α * , R) (or the first R principal components) is obtained depends on our choice of ansatz. We consider two different ansätze, one based on learning an approximate purification of ρ and the second on decomposing an approximation of it into a convex combination of pure states. We also discuss the complexity of computing our cost function, as well as its operational meaning. These are detailed in the following sections. A summary of our algorithm, which also describes the two ansätze choices, is given in Fig. 2.

A. Complexity of Cost Function
For all VQAs, the purpose of using the quantum computer in the optimization loop is to estimate a cost function efficiently, which otherwise would be difficult to estimate classically. Hence, it is helpful to establish that the cost function is indeed classically hard to estimate. Previously proposed VQAs for quantum compiling [10] and linear system solving [36] have established classical hardness via the DQC1 hardness of estimating the relevant cost function. (In the previous sentence, DQC1 stands for deterministic quantum computation with one clean qubit [37]). We can make this same argument for the cost function in Eq. (1), as follows.
A special case of computing Eq. (1) is when the two states happen to be Choi states associated with unitary processes acting on a d dimensional Hilbert space. In this case, we write ρ = |φ U φ U | and σ = |φ V φ V |, where |φ U = (1 1 ⊗ U )|φ , |φ V = (1 1 ⊗ V )|φ , and |φ = (1/ √ d) j |j |j is the standard Bell state. Then the cost function in (1) becomes C = 2 − 2Tr(ρσ) where Tr(ρσ) = |Tr(U † V )| 2 /d 2 . Hence, in this special case we have that C = 2C HST where C HST is the Hilbert-Schmidt test cost of Ref. [10]. Estimating the latter was shown to be DQC1-hard in Ref. [10], and hence our cost function C is also DQC1-hard. Since efficient classical simulation of DQC1 would imply a collapse of the polynomial hierarchy [38,39], standard complexity assumptions imply the classical hardness of estimating C.

Observable estimation
There is a close connection between the Hilbert-Schmidt distance and trace distance for low-rank states [40], which gives operational meaning to our cost function in this case. Namely when at least one of the states is low rank, then the two distance measures are essentially equivalent [40]. This can be seen from the following inequality relating the 2-norm to the 1-norm: for any two density matrices ρ and σ. Here, R = rank(ρ)rank(σ)/[rank(ρ) + rank(σ)] is a quantity called the reduced rank, which is analogous to the reduced mass employed in physics.
In this sense, one can use the Hilbert-Schmidt distance, and hence our cost function, as a strong upper bound on the trace distance. In turn, the trace distance has operational meaning in terms of the difference of observable expectation values on the two states [1], ρ − σ 1 = 2 max P Tr(P (ρ − σ)) where the maximization is over all POVM (Positive Operator Valued Measure) elements P , i.e., operators satisfying 0 P 1 1. Hence, our cost function inherits this operational meaning: for any POVM element P , where R is the reduced rank for ρ and σ(α, R).

Eigenvalue and Eigenvector Estimation
We also establish an operational meaning for our cost function in the context of minimizing the errors in the eigenvectors and eigenvalues of the compiled state. This is particularly relevant in the context of using our algorithm for PCA, since PCA precisely aims to extract the eigenvectors with the largest eigenvalues.
First, we note that our proposed cost is an upper bound on the squared difference between the true eigenvalues of ρ and the eigenvalues of the learnt state σ, as follows. Let us denote the eigenvalues of ρ in increasing order as i=0 and the eigenvalues of σ as {µ i } d−1 i=0 , with d the Hilbert space dimension. Then a measure of eigenvalue error is It then follows from the Hoffman-Wielandt theorem that Thus if one can achieve a small cost value, the average error in the learnt eigenvalues is guaranteed to be small.
We remark that a small cost value might not be achievable since the optimal cost C(α opt , R) is often non-zero. In this case, one could consider an alternative measure of eigenvalue error that, unlike (9), would vanish for the optimal state in (3). Namely, one can consider the set S R of states that have the correct R-largest eigenvalues up to an additive constant, and then define a measure of eigenvalue error as ∆ S R (σ) = min τ ∈S R σ − τ 2 2 . Then it is clear that our cost function upper bounds this error as well: C ∆ S R .
Second, we note that our proposed cost function is an upper bound on the eigenvector error measure introduced in Ref. [14]. Specifically, a natural measure of the difference between the true eigenvectors of the target state, {|v i } r−1 i=0 , and the learnt eigenvectors, {|u i } R−1 i=0 , is given by where Expanding this out we have that (13) where for the inequality we use the fact that (i.e., the trace can be taken with respect to an arbitrary orthonormal basis). Thus a small cost function guarantees a small eigenvector error by this measure.

C. State Purification Ansatz
Let us now move onto the different types of ansatz constructions. The state purification (SP) ansatz constructs trial states of the form where U θ acts on the n system qubits (the same qubits where the state ρ resides) plus an ancilla register A composed of n A qubits. In using n A ancillas, we guarantee that rank(σ) 2 n A , but for a typical parameterized circuit U θ with randomly generated parameters, one has precisely rank(σ) = 2 n A . Thus, in practise, the SP ansatz is only compatible with controlling R in powers of two, so that R = 2 n A . While there are many different possible choices in the ansatz for U θ , we suggest using either a problem-inspired approach obeying the symmetries of the target state [41][42][43] and/or an adaptive approach [44] to mitigate the problem posed by barren plateaus. A discussion of specific choices in our work for different ensembles is given in Appendix B. Upon substitution of the SP ansatz into our proposed Hilbert-Schmidt distance-based cost function, we obtain where we use S to denote the system of the states ρ and σ(θ, n A ). Below we describe efficient ways of computing each of these terms.
Purity of target The purity Tr[ρ 2 ] of the target ρ may be measured using a SWAP test [45] or its destructive variant [46]. For the estimate to be within ε additive error of the true value with probability not smaller than 1 − δ, the Hoeffding bound implies that it suffices to take O(ε −2 log δ −1 ) samples. Alternatively, as this term remains constant throughout the procedure, we may opt to neglect it and only focus on optimizing the remaining two parameter-dependent terms.
Purity of ansatz The purity Tr[σ(θ, n A ) 2 ] of the ansatz σ(θ, n A ) may again be measured via a SWAP test. It is also possible to make use of the fact that we are representing it via its purification and measure the purity of the ansatz via the Loschmidt-echo circuit shown in Fig. 2. Again, from the Hoeffding bound, it suffices to take O(ε −2 log δ −1 ) samples.
Overlap term The overlap term Tr[ρσ(θ, n A )] can be evaluated using either a SWAP test or the Loschmidtecho type circuit pictured in Fig. 2. The latter requires fewer qubits and controlled unitaries and hence is more NISQ friendly. To see why this circuit works, note that since we prepare the trial state via a higher-dimensional purification found by evolving an initial all-zero state to the purification of ρ, we can rewrite the overlap term as follows This corresponds to preparing a maximally mixed state on the n A qubit ancillary system alongside ρ which is prepared on the system register S, evolving the system and ancilla registers under U θ , and then performing an all-zero measurement. Due to the factor of 2 n A , the number of shots required to measure this cost within additive error will scale exponentially in the number of ancilla qubits n A . Hence the Loschmidt echo method for computing the overlap is only appropriate for learning low rank approximations to ρ where n A ∈ poly(log(n)).
As discussed earlier, the above method may also be used to calculate the purity of our ansatz simply by replacing the initial state ρ with the trial state σ. However, the purity of our target may not be measured this way, as it requires knowledge of a purification of ρ, which is precisely what we are trying to find.

D. Convex Combination of Pure States Ansatz
The convex combination of pure states (CCPS) ansatz constructs trial states of the form Here {|i } R−1 i=0 denotes a subset of the computational basis of n qubits, α = (θ, φ) is a vector of parameters, U θ is a parameterized quantum circuit, and p φ is a parameterized probability distribution. An appealing feature of this ansatz is that learning a rank-R approximation gives any rank R < R approximation for free since one can always 'drop' the eigenstates corresponding to the smallest R − R eigenvalues and then re-normalize (see Eq. (31a)).
Since the state that minimizes our cost, σ(α opt , R) (i.e., the solution to the Quantum Low-Rank Approximation Problem given in Eq. (3)), is proportional to the first R principal components of the target state ρ, it follows that this ansatz can be used to learn the principal components of ρ. More precisely, we have that the first R principal values of ρ are given by {p φ (i)} R−1 i=0 and its principal components are {U θopt |i } R−1 i=0 . When the rank of the trial state is low, i.e., R ∈ Ω(poly(n)), such that we are learning a low rank approximation of the target state ρ, or for small scale problems, the vector of probabilities p φ := (p φ (0), . . . , p φ (R − 1)) may be stored simply as a classical vector. For learning high rank approximations to larger full rank states, the probability vector is exponentially large and thus cannot be explicitly stored efficiently. Rather the process will need to be sampled from. Such samples can be generated via classical neural networks, such as generative neural networks [47] or Boltzmann machines [48]. In this case, there are similarities between the CCPS ansatz and the Hamiltonian models considered in Ref. [13]. Similarly to the SPA, for U θ we suggest using either a problem inspired ansatz obeying the symmetries of the target state [41][42][43] and/or an adaptive approach [44]. A discussion of specific choices in our work for different ensembles is given in Appendix B.
Upon substituting the expression for the convex combination of pure states ansatz into our proposed Hilbert-Schmidt distance cost function, we obtain with α = (θ, φ). Here we describe how each of the terms in the above cost may be efficiently computed.
Purity of target. As for the state purification ansatz, the first term, i.e., the purity of the target state, may be computed using a SWAP test [45] or its destructive variant [46] (see also Ref. [49]).
Purity of ansatz. The second term, the purity of the guessed state, is equal to the sum of the square of the probabilities of the parameterized distribution, p φ (i). For low rank approximations, i.e., R ∈ poly(n), one may store the probability vector classically and thus this term can be computed by basic arithmetic.
When the probability vector is too large to be stored explicitly, but rather is handled via sampling, the purity of the ansatz can be estimated with a classical version of the SWAP test. Indeed, the approach is to take two independent samples from the distribution p φ (i) (call the random samples I and J). We then set an indicator random variable χ P as χ P = 0 if the samples are not equal, and χ P = 1 if the samples are equal. The expectation of this random variable is then given by Thus, the random variable χ P is an unbiased estimator of the collision probability, i (p φ (i)) 2 , and takes values between zero and one. The Hoeffding bound then applies, and we can take O(ε −2 log δ −1 ) independent samples of χ P in order to estimate the collision probability i (p φ (i)) 2 to within additive error ε with probability not smaller than 1 − δ.
Overlap term. A naive approach to computing the third term, the overlap between the guess σ and target ρ, would be to estimate each of the i|U † θ ρU θ |i terms using the Loschmidt echo circuit shown in Fig. 2, followed by classical post processing. That is, we compute i|U † θ ρU θ |i by preparing the state ρ, performing the unitary U † θ , and then measuring in the computational basis. The total overlap term could simply be computed by weighting each of the probabilities i|U † θ ρU θ |i by the corresponding classical probability p φ (i) and taking their sum. This naive approach will work for small problems; however, even in the case where p φ (i) can be explicitly stored, this method will not be efficient for larger problems. The problem is that one needs enough shots to estimate all 2 n probabilities i|U † θ ρU θ |i . Combining this observation with Hoeffding's equality, it is apparent that this method requires an exponential number of shots, O(2 n ε −2 log δ −1 ).
Instead, we propose computing the overlap term using a generalization of the classical SWAP test. Let us define the distribution which can be sampled from via the Loschmidt echo circuit. Then we see that This quantity can be estimated by taking a sample from p φ (i) and an independent sample from q θ (i) (call the samples I and J) and setting an indicator random variable χ O if the samples are not equal and χ O = 1 if the samples are equal. The expectation of this random variable is given by Thus, the random variable χ O is an unbiased estimator of the collision probability Tr[ρσ], and it takes values between zero and one. The Hoeffding bound then applies, and we can take O(ε −2 log δ −1 ) independent samples of χ O in order to estimate this collision probability to within additive error ε with probability not smaller than 1 − δ.

E. Performing PCA with our Ansätze
Both ansätze also allow for principal component analysis (PCA) of ρ. To make this precise, we first write the target state as where λ 1 > λ 2 > · · · > λ r are the ordered principal values of ρ with associated principal components |v i . By inspection, the CCPS ansatz is directly an ansatz for the principal components of ρ. In other words, learning a CCPS representation of σ(α * , R) provides an estimate of the principal components, |u i := U (θ * )|i , explicitly. It also provides an estimate of the principal values, p φ (i), but due to normalization, these are expected to be different from λ i by an additive constant when ∆ R is small. As R approaches r , this additive constant goes to zero, and here p φ (i) ≈ λ i . Nevertheless, even when R < r , the values of p φ (i) are such that σ(α * , R) acts as the closest rank R proxy, so in this operational sense, it is still appropriate to call p φ (i) the principal value estimates.
One very useful property of these explicit principal component/value estimates is that they allow us to construct any numerically optimal R < R approximation by truncation, Indeed, this is perhaps one convincing way to view each p φ (i) as the appropriate operational sense of "principal value" when we truncate the rank. Though less obvious, the SP ansatz can also be used for PCA by using a carefully designed circuit for the purification ansatz as described in Appendix B. The result is that computational basis measurements on the ancilla system prepare the principal vectors on the target system with probabilities given by the principal values. Thus, the knowledge of principal values/components here is implicit, and hence cannot be directly used to obtain lower rank approximations by truncation.
Finally, we remark that while PCA is a general procedure, it has an intuitive meaning for physically relevant classes of states. As an example, consider an XY thermal state. In general, the eigenvalues follow a Boltzmann distribution, and the eigenvectors are those of the XY Hamiltonian itself. At low temperature, we often say a state is approximately in its ground-state. More precisely, we mean, for small. In our language, we would say the state is approximately rank q, and when q r as is typical for low temperature thermal states, it's approximately low rank 2 . By performing PCA with a target rank R = q, we find approximations of the eigenvectors with low energies {|v i } q i=1 and corresponding Boltzmann weights Hence, our PCA algorithm can be thought of as a way to learn the Boltzmann weights and a means to prepare low-lying energy eigenvectors of a quantum thermal state which has also been explored in other NISQ friendly works [50].

F. Comparison of Ansätze
The reliance on an ancillary system to compute the cost function for the SP ansatz naturally increases the resources required for computation. Furthermore, in general, for a typical choice of U θ , the guess state will have 2 We comment that this exact notion of low rank actually precisely agrees with the definition of epsilon-rank used in Ref. [31]. We employ a slightly different definition already given that is more amenable to NISQ experiments. rank R = 2 n A . That is, one is limited to ranks of powers of two. In contrast, CCPS both requires no ancilla and allows for fine control over both output ranks. However, the need to learn p φ under the constraint of convexity, in addition to θ, potentially increases the complexity of our optimization subroutine. Thus the choice as to whether to use CCPS or SP will depend in large part on whether classical resources (optimization power) or quantum resources (qubits available) are more constrained.
Of course, the choice as to whether to use the CCPS or SP ansatz may depend not only on the required resources but also the end goal of the subroutine. For example, if the end goal is principal component analysis, this is more readily performed using CCPS since it automatically finds the first R principal components of the target. Alternatively, one can imagine situations where it is desirable to learn the purification of the target. For example, the purification of a state opens up methods for computing entanglement measures between subsystems of that state [51][52][53], the fidelity between two states (given Ulhmann's theorem) [54,55], as well as symmetry measures [56].

G. Summary of Algorithm
The Quantum Mixed State Compiling algorithm is summarized in Fig. 2. At its core it is composed of the following steps.
1. Start with a target state ρ, and a desired rank R for the compiled state σ R used to approximate ρ.
2. Choose whether to learn the target using the purification or convex combination ansatz. In general, this decision will depend on the purpose for which the state is being learned and the resources available. A discussion of the circuits one must run in both cases is given in Appendix B.
3. Minimize the Hilbert-Schmidt distance cost C, Eq. (1), using a hybrid quantum-classical optimization loop to find the trained parameters θ * (in the case of the SP ansatz) and (θ * , φ * ) (in the case of the CCPS ansatz) that approximately minimize C. If using a gradient based optimizer one can analytically compute the gradient of the cost using the parameter shift rule [57,58]. For the case of the purity of the ansatz term for the state purification ansatz, Tr[σ(θ, n A ) 2 ], this rule needs modifying to account for correlations. For more details on computing the gradients, see Appendix C.
Here we briefly describe two ways in which prior information about the structure of the target state could be used to make the mixed state learning algorithm more efficient.
In the first instance, the purity of the target state may be computed and then we need only consider 'guess' states σ R with the correct purity. That is, one may use the target purity as a constraint during the minimization of the Hilbert-Schmidt distance cost, Eq. (1). This amounts to maximizing the overlap term Tr[ρσ] subject to the constraint Tr[σ 2 ] = Tr[ρ 2 ]. We note that this constrained optimization may not be compatible with the rank constraint if the desired rank is much lower than the true rank of the target, i.e., R r. Going a step further, one could also simplify the task of learning a rank-R approximation to ρ by first learning the R largest eigenvalues of ρ and then using the mixed state learning algorithm to learn their corresponding eigenvectors. The largest eigenvalues of any state σ can be learnt non-variationally using the quantum algorithm proposed in Ref. [49]. Denoting the measured eigenvalues as {λ i } R−1 i=0 , the eigenvectors may be learnt by maximizing the overlap term i λ i i|U † θ ρU θ |i . As discussed further in Appendix A, one theoretical advantage of these two modifications is that they require optimizing only a single overlap term rather than the difference between two purity terms and an overlap term. This overlap term takes the form of a standard Variational Quantum Eigensolver (VQE) cost and therefore can be readily transformed into a local cost for which we have trainability guarantees [21].

III. Numerical Simulations
We begin our numerical simulations discussion with a brief summary of the target states and chosen optimizer. Additional details can be found in our open source code [59] which includes our raw data and the scripts we used to generate it. We then discuss the results of compiling each of the listed states in separate sections.

A. Description of States and Optimizer
We discuss the performance of our QMSC algorithm for three main types of states: 1. Random states drawn from the Bures measure.
2. Thermal states of an XY chain with random coefficients at both low and high temperature.
3. Noise-induced states generated by simple circuits on NISQ hardware. (We shall call them NISQ states for short.) In the first two cases, we perform the entire optimization using idealized classical simulations. That is, we evaluate the cost functions using matrix operations on a classical machine with no error model and no shots (i.e., infinite precision). Henceforth, the use of "idealized classical simulations" will continue to have this precise meaning. This serves both as a proof-of-principle as well as a means to estimate an empirical idealized scaling. For the NISQ states, discussed in Section IV, we evaluate all cost functions on quantum hardware but use classical computation for the parameter updates, which is the standard variational quantum algorithm approach.
In each case, we consider a low rank approximation by setting n A = 1 (or, equivalently, R = 2) or a full (epsilon) rank approximation with n A = log 2 r (or R = 2 log 2 r ). It is natural to wonder why we choose to use R = 2 log 2 r instead of R = r . The reason is straightforward: we simply want both the SP and the CCPS ansatz to attempt to learn the same state to the same rank approximation to make a fair comparison, and we can only control the rank of the SP ansatz in powers of two.
We use the gradient-free Powell optimizer [60] provided in the scipy [61] optimization library. We find that Powell is more robust and generally outperforms the common scipy black box alternatives such as BFGS [62], Nelder-Mead [63], SLSQP [64], and COBYLA [65] for our problems. Of course, the performance could be improved by using advanced VQA optimizers such as SPSA [66] or ICANs [67,68], but we do not pursue this refinement since we found Powell to give reasonable results.

B. Bures Random States Results
We first study the Bures random state distribution because it is a reasonable sampling distribution when nothing about the quantum state is known [69,70] (see Appendix D 1 for more details). For some intuition, note that one way to generate n qubit Bures random states is by preparing a Haar random state on n + n qubits, applying an n-qubit "local Haar random" unitary on the system qubits, and then tracing out the ancilla. In this sense, learning the purification is similar to learning a Haar random unitary on 2n qubits which we know to be intractable for VQAs due to an intrinsic, ansatzindependent vanishing gradient problem [22].
Given their lack of structure we see Bures random states as a good test case to compare the state purification and convex combination of pure states ansätze. However, due to the unavoidable vanishing gradient problem-along with the large number of parameters needed for an unstructured state (see Appendix B)-we only test modest sizes. Specifically, we test from n = 1 to n = 4, 5 for low rank approximations and from n = 1 to n = 3 for full -rank approximations (i.e., we learn up to a six-qubit random purification).
We tested our algorithm on 25 Bures random states for each n. The results are shown in Fig. 3. Here, SP results are shown in (a) and the CCPS results in (b). We plot the difference between the optimized cost and the lowest possible cost ∆ R (see Eq. (5) or Appendix E for more details) as a function of system size, n, for both low rank (n A = 1, R = 2) and full -rank (n A = log 2 r , R = 2 log 2 r ) approximations along with the number of iterations n it necessary to reach ∆ R .
Here the compilation is performed classically using the Powell optimizer. Our ansatz, which uses alternating layers of arbitrary two-qubit gates, is explained in Appendix B.
We focus our discussion first on the top row, where we plot the performance metric ∆ R . Here, we see that our algorithm is capable of learning completely (Bures) random states provided n is small enough. Indeed, for the SP ansatz, the value ∆ R stays below 10 −10 for all values of n tested when compiling both a low rank and a full -rank approximation. Interestingly, the values of ∆ R reached for the CCPS ansatz are substantially higher, reaching values of up to 10 −8 for the low-rank and up to 10 −6 for the full -rank approximations. This suggests that the CCPS optimization is more difficult than the SP one. This is plausibly due to the fact that the optimization was performed over both angles and probabilities, which needed to satisfy a normalization constraint.
In the second row of Fig. 3, we plot the number of iterations, n it , needed to reach the ∆ R values above. For reference, we also plot the curve showing 4 n scaling for naive full tomography as well as a curve r 2 n for improved quantum sensing tomography. Here, r denotes the average -rank across the 25 random instances. While it is difficult to draw definitive claims for the small values of n accessible, we see an interesting split in the results. For low-rank (n A = 1, R = 2) approximations, the number of iterations required seems to scale more favorably for our method than for both forms of tomography for both the SP and CCPS ansätze. The opposite appears to be true for the full--rank approximation where it appears that even full tomography is a better strategy at n = 3 for both ansätze. This could plausibly be explained by the barren plateau phenomenon for learning random states that was proven in Ref. [22]; hence reconfirming that variational methods are not well suited to fully learning typical random states. On the other hand, with no shot noise and such small n, this could also be due to the presence of many local minima [71,72]. Regardless of the cause, our simulations suggest that while our algorithm can learn small unstructured random states, it cannot scale beyond modest n.

C. XY Model Results
This discussion naturally raises the question of what happens when structure is present. This leads us to the study of thermal states in the Heisenberg XY model, given by where J i , K i ∼ N (0, 1) are i.i.d. standard Gaussian random variables and β = 1/k B T is the inverse temperature. By controlling the temperature, we can control the -rank of the generated mixed states (see Appendix D 2 for more details). Structurally, this model is clearly invariant under any global rotation of all spins, and the number of spins is constant. Thus, even with random coefficients, it exhibits important symmetries which allow us to greatly simplify our ansatz (see Appendix B). For this reason, we are able to test from n = 2 to n = 8 qubits relatively easily, which is sufficient for an initial study of empirical resource scaling. We tested our algorithm on 50 random XY chain thermal states (see Eq. (33a)) for each n: 25 at a large inverse temperature of β = 20 (i.e., low temperature regime) and 25 at a relatively smaller inverse temperature of β = 2 (i.e., moderate to high temperature regime). The results are shown in Fig. 4, whose format is the same as that in Fig. 3. Similar to the Bures results, we generated the results using noiseless classical optimization of an alternating layer ansatz with the Powell optimizer. Unlike Bures, however, the ansatz consisted of so-called Givens gates (see Appendix B for exact definition) which respect the symmetries of the XY model.
Our first observation is that the performance (i.e., both ∆ R and n it ) does not have a noticeable dependence on β. For this reason, we chose to simplify the presentation of the results and combine all 50 states (for each n) into a single box plot. Note that this was not expected a priori. As discussed in Appendix D 2, β = 20 corresponds to the limit when ρ (XY) n is approximately in the ground state of H XY with a low -rank, whereas β = 2 samples intermediate to large -ranks. Hence, what we have found is that our optimization is insensitive to the underlying -rank of the target state in this case.
For the SP ansatz, the performance (as measured by ∆ R ) is quite good, with a worst case of n = 8, n A = log 2 r , ∆ R ≈ 10 −6 . The same point for the CCPS ansatz only reached ∆ R ≈ 10 −4 in median performance, and indeed, the CCPS performance is noticeably worse across the entire data set. That is, when solving the same problem, the final ∆ R for the CCPS ansatz is often noticeably larger than the SP ansatz. However, even this worst median ∆ R ≈ 10 −4 is an acceptable "4 nines" result (i.e., C(α * , R) differs from C(α opt , R) only in the fourth decimal place). Across the entire data set, whether low or full rank and SP or CCPS, the actual number of iterations scales slightly better than the compressed tomography scaling of r 2 n and is substantially better than naïve tomography. For example, at n = 8, both low and high rank results use ≈ 56 times fewer iterations (relative to 4 n = 4 8 for full naive tomography) with SP. However, this positive result must be considered along with the result that the quality of the solution, ∆ R , deteriorates with n. A fair summary of the result can be understood by setting an acceptable cut-off, δ c . Supposing that δ c = 10 −3 (a "three nines criterion"), what our data shows is that we can reach ∆ R < δ c in fewer iterations than both naïve tomography and compressed sensing tomography.

D. PCA and state compression example
As discussed, the CCPS ansatz (see Eq. (18)) takes the form of a convex combination of the R estimated principal components of ρ weighted by the associated principal values. Hence, learning σ CCPS (α * , R) is tantamount to performing principal component analysis (PCA). As a consequence, we get the form of any R < R approximation from the rank R solution for free by truncation (see Eq. 31a). We explore the practical meaning of these two statements by example in Fig. 5. In this case, we learn a full rank CCPS ansatz approximation of a random three qubit XY thermal state at low temperature, ρ β=20 n=3 . As before, we consider 25 such optimizations over randomly drawn coefficients (see Eq. (33a)).
In the top plot of Fig. 5, we explore the effect of truncation from R = 8 to R < R on the cost as a function of R . At R = 8, the median cost is C(α * , R) = ∆ R=8 ≈ 10 −10 which means we have successfully learned ρ β=20 n=3 . But even as we truncate down to R = 2, the cost hardly changes-only jumping to a large value of 0.5 at R = 1. This suggests that only two principal components are necessary to approximate ρ β=20 n=3 , and we corroborate this intuition by showing the principal values (or spectrum) of both ρ β=20 n=3 and σ(α * , R = 8) in the inset. Here, the bar-plot shows the median k th principal value as a function of k. For ρ β=20 n=3 , the values are λ 0 ≈ 1/2, λ 1 ≈ 1/2, and λ 3 ≈ 10 −13 , so r = 2 for > 10 −13 for at least half of the instances (and all for > 10 −4 ).
In the bottom plot of Fig. 5, we show the pure state infidelity between the k th principal component of ρ β=20 n=3 , |v k , and the associated estimate contained in the CCPS ansatz, |u k = U θ * |k as a function of k. Note that the k labels are ordered by decreasing principal value, i.e., k = 0 corresponds to the principal component with the largest principal value and so on. Clearly, the infidelity for k = 0 and k = 1 is very small ∼ 10 −11 whereas the infidelity for k 2 can be rather large. This again is due to ρ β=20 n=3 having an effective -rank of two. But by plotting the pure state infidelity we have also made the notion of "learning the principal components of ρ" more explicit. Namely, a good approximation of ρ with the CCPS ansatz relies on having a high quality and explicit estimate of its important principal components as weighted by the relative importance of the principal values.
Finally, this discussion suggests an alternative way to use our algorithm as a means to find the approximate rank of an unknown state. By training for different values of R until C 1 or C converges, we can estimate that rank(ρ) ≈ R. This procedure also clarifies what we mean by claiming that our algorithm allows us to "learn a lower rank approximation that allows for more efficient processing." In this example, we mean that learning a rank R = 2 approximation is sufficient, and therefore σ CCPS (α * , R = 2) is a low-rank approximation/compression of ρ β=20 n=3 .

IV. Quantum Hardware Implementation
Finally, we consider the most important task for our algorithm: compiling a quantum state on a quantum device. In Fig. 6, we demonstrate the ability to learn two classes of quantum states on the IBM superconducting qubit devices with both ansatz choices. In Fig. 6(a), we successfully compile a random single qubit mixed state with both the SP and CCPS ansatz. Here, the random state is generated by tracing over one qubit in a two-qubit Haar random state The Hilbert-Schmidt (HS) subscript signifies that such a state is uniformly sampled from the Hilbert-Schmidt metric [70]. Finding a compilation for a HS random state has a similar proof-of-principle goal to the Bures random states, but this class is easier to prepare on NISQ devices. We remark that the state ρ HS is known a priori in this example in the sense that we supply the quantum device with U Haar . In Fig. 6(b), we learn unknown states generated by noisy state preparation which we call "NISQ" states. In particular, we learn a noisy |+ state and a noisy |Φ + state,ρ a single-and two-qubit state, respectively. Forρ + , we apply a single Hadamard and then idle for the duration of 20 Hadamards. Forρ Φ + , we follow the same procedure but then apply a final CNOT at the end. Under perfect conditions, these would generate the states |+ and |Φ + . Due to intrinsic ZZ cross-talk on superconducting devices [73][74][75][76] along with other secondary sources of noise, the qubits undergo a complicated dephasing process which we summarize as some unknown quantum channels E 1 and E 2 , respectively. For some sense of the strength of noise on these devices, we remark that Tr[ρ 2 + ] = 0.90(1) and Tr[ρ 2 Φ + ] = 0.78(4) with 1σ confidence intervals generated from 10 bootstrapped tomography experiments. Since we do not know the quantum channels in advance, compiling the NISQ states is also tantamount to learning the states. For example, we can learn a low-depth circuit to prepareρ + , which is short enough to be unaffected by ZZ cross-talk. This serves as a permanent snapshot to probeρ + even when E 1 inevitably drifts due to two-level system and calibration effects.
With the states now defined, we discuss the results presented in Fig. 6 more closely. For each optimization we plot two costs, C shot (dotted line) and C noiseless (solid line). The shot cost is computed on quantum hardware using 10 5 shots, and it is the cost we optimize over using the Powell optimizer. The noiseless cost is computed classically in post-processing for verification. For the random states this amounts to classically storing the circuits to prepare ρ and σ throughout the optimization and computing the cost with matrix operations. For the NISQ states, we rely on full quantum state tomography to compute ρ as a "trusted third party" method since the states are generated by unknown noise. We terminate the optimization when either C shot flattens for at least 10 iterations or 100 iterations are reached. In all cases, the final noiseless cost reaches 10 −3 C noiseless 10 −2 . We remark that this is consistent with the use of N = 10 5 shots since we expect a reported precision to scale as ∼ 1/ √ N . In all cases, we show the result of learning a full rank approximation of the target state, so ∆ R = C noiseless and hence 10 −3 ∆ R 10 −2 . For all but theρ φ + optimization we note that the noiseless cost is an order of magnitude lower than the cost evaluated on the hardware which is indicative of optimal parameter noise resilience [12]. That is, it suggests that the position of the global cost minimum of the cost landscape is (approximately) invariant under the action of noise.
Overall, our results show that we can successfully learn full-rank approximations of hardware relevant states. As we might expect from the Bures and XY state results, we can also learn lower rank approximations which we show explicitly in App. F. The net result is very similar: we can learn lower rank approximations to within a precision of 10 −3 to 10 −2 , and as in the idealized classical experiments, it takes fewer iterations to learn lower rank approximations. Alternatively, we may choose to learn a high rank approximation of ρ with the CCPS ansatz and with both the SP and CCPS ansatz successfully. In (b), we learn two hardware-noise induced states which are described in Eqs. (35a) and (35b) using the SP and CCPS ansatz, respectively. Optimizations were performed on the seven-qubit devices ibmq_casablanca (random SP result) and ibmq_jakarta (all others) using N = 10 5 shots to evaluate C shot (dotted line). A noiseless cost C noiseless was also computed classically for verification. In each case, 10 −3 C noiseless 10 −2 is reached which denotes a successful compiling given 10 5 shots.
obtain lower rank approximations via truncation (as in Sec. III D) which we also explore for the hardware data in App. F.

V. Discussion and Outlook
We have presented an algorithm to learn an unknown mixed state ρ. In particular, we have developed a procedure to learn a rank-R approximation to ρ, where rank(ρ) R is assumed but not essential. Put precisely, our algorithm is a practical variational way to solve the quantum low-rank approximation problem [18]. Applications of this algorithm are numerous and include PCA, state compression, learning noise-induced states, and uploading of states onto quantum computers, as described in Fig. 1.
We considered two ansatz constructions. If the purification ansatz is chosen, the end result is a unitary V (θ * ) such that V (θ * )|0 ⊗n ⊗ |0 ⊗n A generates a purification of the rank-R approximation of ρ. By tracing out the n A ancilla qubits, we get the desired rank R mixed state. For the convex combination of pure states (CCPS) ansatz, the final output is a classical vector p containing the R principal values of ρ and a unitary U (θ * ) such that U (θ * )|i for i ∈ {1, . . . , R} gives the R principal vectors of ρ.
Our numerical simulations and hardware implementations indicate that our algorithm works well for a variety of state ensembles, including random XY spin chain thermal states at arbitrary temperatures and unknown states generated by hardware noise on superconducting qubit devices. Unsurprisingly, we found that learning XY thermal states was easier than random states because the additional structure of the problem opened up the possibility of using simpler ansätze. Additionally, while the SP ansatz performs better in numerical simulations (because its optimization problem is simpler), the CCPS ansatz, as expected, allows for larger hardware implementations because it requires fewer qubits.
For both ansätze, our algorithm provides a means of compressing the target state when rank(ρ) > R. For the purification ansatz, the reduction is in terms of the number of qubits; for the convex combination ansatz, the reduction is in terms of the number of pure states required to simulate the effect of the state. While the compression of states [16,[77][78][79][80][81], and indeed data sets encoded in states [82], has been explored previously, much of the compression-based literature focuses on finding the compressed state via maximizing the degree to which the original state can be reconstructed via a successful decompression process [83].
A particularly timely application of our algorithm is for quantum PCA. While quantum PCA was orginally proposed to have an exponential speedup over classical methods for low rank states [35], it was later dequantized for the case of classical data analysis [84], reducing speedups for this case to being modest ones [85]. However, recently it was shown that these dequantization arguments break down for quantum data analysis [86] and that quantum PCA for quantum data can indeed achieve an exponential quantum speedup [87]. Moreover a simple method to encode the covariance matrix into a density matrix was recently proposed [88], making quantum PCA an easily accessible application for near-term quantum computers. Hence, our approach for extracting the principal components of a density matrix is especially timely, in the quest for near-term quantum advantage.
In this article we have focused on coherent access models for quantum compilation. That is, computing the cost using the Loschmidt echo or SWAP test requires coherent interaction between the state we wish to compile and the device on which we wish to compile it. For this to be possible, the target quantum state either needs to be already prepared (potentially by some unknown process) on the quantum computer or we require a quantum sensor to mediate the interaction between the target quantum system and the quantum computer. The former is practically viable and reasonable to assume for applications such as learning noise-induced states or state compression. However, further developments in quantum sensors will be required to upload the unknown quantum state of an experimental system to a quantum computer in this coherent access model.
An alternative approach would be to use Clifford shadow tomography [6-8] to efficiently compute the overlap between the target state and a large set of guess compilations via independent measurements and then classical post-processing. This incoherent version of quantum compilation could be applied in situations where the unknown state is prepared on a platform that is very different from the hardware on which we wish to compile it. In this manner, this approach opens up new techniques for uploading the quantum state of experimental systems to quantum hardware. In Fig. 1 we sketch the difference between the coherent and incoherent access models. Additionally, Ref.
[6] provides a means of upper bounding the copy complexity of such a compilation task. In the case of efficiently preparable target states and ansatz states, i.e., those that can be prepared via a circuit with T ∈ O(poly(n)) local gates, then O(T log(T / )/ 2 ) O(poly(n)/ 2 ) copies of the target state ρ suffices to approximately evaluate the cost and its partial derivatives to precision arbitrarily often [89]. Thus O(poly(n)/ 2 ) copies of the target ρ in theory suffice to compile it to precision O( ).
The framework investigated in this article for mixed state compilation may more generally be applied to the compiling of quantum channels. This follows from the fact that a channel may be represented via its Choi representation as a mixed state. That is, a channel may be fully characterized via the mixed state generated by applying a quantum channel to one half of a Bell state. Therefore one means of compiling a quantum channel would be to minimize the Hilbert-Schmidt distance between the Choi state corresponding to a target channel and an ansatz mixed state, formed by applying a parameterized channel to half a Bell state. In this sense, our algorithm further open up new avenues for learning unknown quantum processes.

VI. Data and Code Availability
The data and source code that support the findings of this study are openly available in a Zenodo repository [59], a static and citable version of an available Github repository.  [4] Henry Yuen, "An improved sample complexity lower bound for quantum state tomography," arXiv preprint arXiv:2206.11185 (2022).

Marginal Local Cost Function
A simple, naïve choice in 1-local cost could be formulated in terms of the Hilbert-Schmidt distance between the local reduced states. We call this cost function the marginal 1-local cost function, and it takes the same form for both the state purification and convex combination of pure states ansatz.
Definition 1. Given two quantum states ρ and σ, let us define a Marginal 1-local cost as follows: where ρ j = Tr j [ρ] and σ j = Tr j [σ], with j denoting all system qubits except for qubit j.
This can be measured using the methods outlined in Section II. (Note, in contrast to the main text, in this appendix we stress the dependence of the cost on the target state ρ and guess state σ.) A successful candidate for a local cost function needs to remain faithful to the global cost, i.e., the minimal value of the cost corresponds to the case when ρ = σ. This cost function is trivially faithful for learning product states using a product state ansatz, i.e., if σ = n j=1 σ j and ρ = n j=1 ρ j . However, it does not take much thought to see that this local cost function is not faithful more generally. In simple terms, different entangled states may have the same reduced states and so the local cost may vanish even when the global states are non-identical. For example, consider the case in which ρ and σ are orthogonal Bell states, i.e., ρ = |ψ + ψ + | and σ(θ, n A ) = |ψ − ψ − |. Now, in this case, the local cost vanishes because the reduced states are all maximally mixed However, |ψ + ψ + | = |ψ − ψ − | and so the local cost is not faithful. In contrast, the equivalent global cost, that is the HS distance between |ψ + ψ + | and |ψ − ψ − |, is non zero, as it should be. The lack of faithfulness does not necessarily preclude us using it for training. We could start training on the 1-local cost and as we approach the solution, add in k-local cost terms where we look at the distance between the reduced states on k qubits. That is, we propose to use a cost of the following form.
Definition 2. Given two quantum states ρ and σ, and some (possibly multiple) partitioning(s) 3 P k of the set of n qubits into N k subsets of at most k < n qubits, where at least one set has cardinality k, let us define the Generalized Marginal Cost as follows: where the set of marginals of ρ and σ as determined by P k , and the iteration dependent weightings α k (t) 0 satisfying k α k (t) = 1.
We note that an analysis of the general behaviour of C (k) M (ρ, σ) is non-trivial as the behaviour depends on the partitions P k . One (simple) potential choice of partition is to only allow k such that k | n, and choose N k sets of k qubits. In this case, for states that are k-local product states on the chosen subsets, C (k) M (ρ, σ) would be trivially faithful. By starting with the one local cost, i.e., with α 1 = 1, and by slowly ramping to the global cost C (n) M ≡ D (n) M (ρ, σ) by increasing the weightings of the various α k terms until α n = 1, it should be possible to train to the minimum of the global cost. We expect this proposal to prove most useful for learning mixed states with relatively localized entanglement.

Local Measurements on Global States
An alternative approach to formulating a local cost function is to continue to use the full n-qubit input states, but replace the global measurements with local measurements (similarly to Ref. [10]). As each ansatz includes specifically prescribed measurement operators, we treat each of our ansätze separately, and derive provable guarantees of faithfulness for special cases.

a. Local costs for CCPS ansatz
Consider the global Hilbert-Schmidt cost function in the CCPS ansatz framework: where we write H (i) G := |i i| S to emphasise the globality of the measurement. Since the purity of the target state is not optimized and the purity of the guess state is optimized fully classically, these terms can be left global without impeding trainability. However, for the overlap term, we seek to replace the global H (i) G term with some set of local measurements. There is freedom in how we may choose our local measurements, but one simple approach is to choose the average of all 1-local measurements, as follows Definition 3. For the CCPS ansatz, we can define a Local cost function as where, as in (18), |i = n j=1 |i j denotes an element of the computational basis of n qubits, and |i j denotes the j-th bit of the bit string i.
In order to prove faithfulness in the case of pure states, we first state and re-derive a result introduced in [10].
Proposition 1. Let ρ be an arbitrary quantum state, and let U θ be a parameterized unitary matrix. Denoting Proof. We can write H (i) are projectors that mutually commute. Note that Choosing A i = E j , we see This is precisely the first desired inequality 1 − Tr To prove the remaining inequality, observe that, via the union bound, we have Proposition 2. C CCPS L is faithful for pure states, and "close to faithful" for states with low impurity. Specifically we have that nC CCPS L C G − (n − 1) (Impurity(ρ) + Impurity(σ)) , where Impurity(X) := 1 − Tr[X 2 ] for X = ρ and X = σ and we write C G = ρ − σ 2 2 to emphasize the globality of the standard Hilbert-Schmidt distance cost.
Remark 3. It follows from Proposition 2, that if C D L = 0 then That is, if the target and trained state are pure, i.e., Impurity(σ) = Impurity(ρ) = 0, we have that C L = 0 entails that C G = 0. More generally, if the impurities of the target and trained states are low, C L = 0 entails that C G is small.
Accordingly, if C CCPS L = 0, then C G (n − 1)(Impurity(ρ) + Impurity(σ)) (A21) Therefore if the purity of the target and trained states are zero, the cost is faithful; i.e., C CCPS L = 0 implies C G = 0. Similarly, for high purity states the cost is approximately faithful, i.e., C CCPS L = 0 implies C G is small. Similar to the generalized marginal cost defined earlier, we can construct an extension of the low impurity local cost function, such that it is local on k qubits, for k n.
One simple approach is to perform n/k measurements, on k qubits at a time. This naturally restricts us to only choosing k such that k | n. Defining with H k(i) where P m contains the k indices of the k qubits being measured over by the m-th operator, such that P 1 ∪ · · · ∪ P n/k spans {1, . . . , n}; we can define a k-local cost function as follows: Definition 4. For k | n, we have the k-local cost function Proposition 4. Let ρ be an arbitrary quantum state, and let U θ be a parametrized unitary matrix. Denoting ρ U = U † θ ρU θ , we have that Proof.
forming one side of our inequality. We also have Thus, Proposition 5. The k-local cost function C k L is faithful for pure states, and "close to faithful" for states with low purity, with this closeness increasing with k. Specifically we have that Remark 6. It follows from Proposition 5, that if C k L = 0, then n k − 1 (Impurity(ρ) + Impurity(σ)) Proof. The proof is entirely analogous to that for Proposition 2 but with n → n/k.
Comparing this to the inequality found for the 1-local cost (A16), we find that the k-local cost is 'closer to faithful' at low impurities than the 1-local cost. It becomes increasingly faithful as k tends to n and, trivially, perfectly faithful for k = n. Thus similarly to the marginal local cost, we could start training on the 1-local cost and as we approach the solution, add in k-local cost terms to drive the ansatz towards the global minimum.

b. Local costs for SP ansatz
Consider our global Hilbert-Schmidt cost function, in the SP ansatz framework: Without loss of generality, we can also express ρ via its purification for analysis, i.e., where V ρ is the purifying unitary associated with the target state ρ. Recall where d A is the dimension of the environment system. In order to construct a local cost, we can replace the global (|0 0|) ⊗(n+n A ) projector in each term with a local measurement. The structure of this local measurement may, again, be chosen freely. In the case of learning ρ completely, (i.e., R = rank(ρ)), one simple approach is to replace the global measurement with the average of all measurements local to one system qubit and one environment qubit, i.e., with j denoting all qubits other than qubit j. We dub this the doubly-local SP Hamiltonian. When seeking to learn a low-rank approximation (i.e., compression) of ρ, we need not make the measurement local on the ancilla. In this case we can use the following singly-local SP Hamiltonian, Definition 5. For the state purification ansatz, we can define the Doubly-and Singly-local costs, where with the freedom to choose arbitrary H X L , e.g. H D L or H S L as defined above.
The terms c X L (U θ , σ) and c X L (U θ , ρ) can be measured using Loschmidt-echo type circuits (as discussed in Section II) but with the global measurements replaced with local ones. However, as in general one will not have access to V ρ , it is generally not possible to measure c X L (V ρ , ρ). Nonetheless, as this term remains constant throughout and does not contribute to the gradient (discussed further in App. C) it can be neglected without effecting the optimization procedure.
The methods used to prove faithfulness of C CCPS L for pure states do not carry over for the singly-and doubly-local SP costs, due to the factor of d A . Thus, faithfulness for pure states for C D L and C S L remains an open question. However, we can prove faithfulness for tensor-product states in the SP picture.
Proposition 7. C D L is faithful for tensor-product states.
Proof. In the case of tensor-product mixed states, we can write ρ = n j=1 ρ Sj , U θ = n j=1 U θ Sj Aj . Thus the overlap term can be written as Thus, in this case, we have that which vanishes if and only if ρ j = σ j for all j, i.e., assuming ρ and σ are tensor-product states, iff ρ = σ.
In this subsection, we have shown how by introducing local measurements directly into our cost function, we are able to construct local cost functions. For the case of the CCPS ansatz this construction is provably faithful for pure states and approximately faithful for high purity states. For the case of the SP ansatz the construction is faithful for product states. Thus we expect these costs to prove useful for learning mixed states with relatively low impurities and low entanglement respectively. However, for target states that are highly entangled and/or mixed, we are unable to provide guarantees on the behaviour of the cost function. In these cases, the function no longer resembles a distance measure, as positivity cannot be guaranteed. Thus the construction of a truly faithful, yet entirely local equivalent to the Hilbert-Schmidt distance remains an open question.
However, in practise one may create a cost that is both faithful and exhibits non vanishing gradients by taking a linear combination of the absolute value of the local cost and the global cost, i.e., by training on a cost of the form α|C L | + (1 − α)C Global for some choice of the local cost C L and 0 α 1. By tuning α such that it is (close to) one at the start of the optimization and (close to) zero at later stages of the optimization, it should be possible to steer towards the global minimum.

Summarizing the circuits in our algorithm
We summarize the circuits used to evaluate our cost function to clarify in detail how one can implement our algorithm. In particular, we provide three circuits which sample the three terms in our cost function, for both the state purification (SP) ansatz and the convex combination of pure states (CCPS) ansatz.

a. SP Ansatz Circuits
The SP ansatz generates an n qubit mixed state by applying a unitary on n + n A qubits and tracing out the n A ancilla, The translation of this procedure into a quantum circuit is straightforward and is shown for an n = 3, n A = 2 example in the left-most circuit of Fig. 7. Given U (θ), we can measure an estimate of our cost function using the remaining two Loschmidt echo style circuits shown in Fig. 7. The middle circuit evaluate the purity term Tr[σ SP (θ, n A ) 2 ] and the right-most circuit evaluates the cross term Tr[ρ S · σ SP (θ, n A )] which we showed in the main text Eq. (17). Note that we omitted providing a circuit to measure the purity of ρ S since this is a static term during the optimization anyway. If desired, one could simply use middle circuit and replace σ SP (θ, n A ) with ρ S .

|0
U (θ) In the middle purity evaluation circuit, we avoid writing out the circuit which prepares σ SP (θ, n A ) a second time. However, it is worth mentioning that generating σ SP (θ, n A ) itself takes n A ancilla, so the middle circuit takes a minimum of n + 2n A qubits without resetting the ancilla A to use twice. As for preparing the totally mixed state on the ancilla system, this can be done in two ways. In the first, we prepare one of 2 n A basis states with a uniformly random probability for every shot that C is evaluated. Alternatively, one may choose to actually prepare the uniformly mixed state which can be done by preparing any completely entangled bi-partite state (i.e. a GHZ state) on 2n A qubits. In the latter case, the middle circuit therefore uses n + 3n A qubits.
We also draw attention to the fact that we use U (θ) to prepare σ SP (θ, n A ) in the first circuit but U † (θ) in the two cost function evaluation circuits. The use of U † (θ) is fleshed out mathematically in Eq. (17) of the main text. Intuitively, it's as if U † is undoing the preparation circuit U -hence the name Loschmidt-echo like circuit. In fact, the final cost term is ultimately evaluated by counting the number of 0's obtained at the final registers which corroborates this intuition.
As discussed in the main text, the Loschmidt echo circuits we cooked up in Fig. 7 are NISQ friendly but incur a poor shot scaling as n A grows. If in practice, n A is on the order of n and both are large, then it's best to instead use SWAP test circuits [45] or their destructive variant [46]. The SWAP and destructive SWAP circuits to evaluate FIG. 8. Two circuits to evaluate Tr[ρS · σSP(θ, nA)] or Tr[σSP(θ, nA) 2 ] by choosing ρS = σSP(θ, nA). (Left) A version of the SWAP test as first discussed in Ref. [45]. For completeness, we demonstrate that this circuit works as intended from Eq. (B3) to Eq. (B8) since Ref. [45] is not explicit for our context. This requires an ancilla and a controlled swap (aka Fredkin) gate.
As the proof will show, this circuit can be straightforwardly generalized to n qubit states by performing an 2n qubit SWAP rather than a two qubit SWAP. (Right) A destructive variant of the SWAP test discovered in the context of optical systems and the Hong-Ou-Mandel effect in Ref. [46]. This removes the need for ancilla and a complicated Fredkin gate but destroys the prepared states in the process in what is known as a Bell measurement. Much of of Ref. [46] is dedicated to showing the right circuit is equivalent to the left circuit, so we do not re-derive the entire result here. Instead, we comment on how to connect the measurements of the circuit to the desired quantity in Eq. (B10). We then discuss how to generalize this procedure to n qubit mixed states in Eq. (B11).
Tr[ρ S · σ SP (θ, n A )] are shown in Fig. 8. Note that these also can be used for Tr[σ SP (θ, n A ) 2 ] by replacing ρ S with σ SP (θ, n A ). While the references ultimately contain sufficient information to deduce that these circuits work as claimed, it is not obvious at a glance. For posterity and completeness, we provide a tailored derivation of the claim in the present context. We begin by showing that the left SWAP test circuit is sufficient to measure Tr[ρσ] (we've dropped subscripts for simplicity). The first thing we need to know for this derivation is that As we'll see, our goal will be to find an observable on the ancilla qubit alone whose expectation value gives us the left-hand side of Eq. (B3). From this identity, the desired outcome follows, and this is why we call it a SWAP test method.
Consider the state right after the controlled SWAP (aka Fredkin) gate which we shall call ω for concreteness. By simple Dirac notation manipulations, we arrive at, The second line can be thought of as block matrix, Right before measurement we are then left with the state HωH. Since we then only make a measurement on the ancilla qubit, we will only need the diagonal entries of this block matrix to choose the right observable to measure. In particular, we find (B7) Staring at Eq. (B7) in light of Eq. (B3), we see that the right observable is σ z . Indeed, Note that the above derivation really only relied on the fact that SWAP(ρ ⊗ σ)SWAP = σ ⊗ ρ. Hence, this circuit works to evaluate Tr[ρσ] when ρ and σ are composed of an arbitrary number of qubits despite the way our diagram implies they are single qubit states. We simply chose to write it this way since it's easier to interpret and reason in this way afterwards then try to worry about multiple qubits form the start. Next, we discuss the right destructive SWAP variation. Let's again begin by considering that ρ and σ are single qubit states. It turns out, the relevant measurement is the projection into the |11 11| subspace. In particular, let be the probability that the final result from the destructive SWAP circuit is |11 . Then, gives us the desired quantity. An n qubit generalization is not quite as straightforward as the SWAP test generalization. In particular, we replace the single CNOT and Hadamard with a transversal application of CNOTs and Hadamards-i.e. Fig.11 in Ref. [46]. Further, we don't just project onto |1 . . . 1 1 . . . 1|. In fact, the augmented procedure is actually easier to state at the level of individual measurements rather than projectors. Supposing ρ and σ are n qubit states, then we can label the measurement outcomes for the first n ρ registers with a bitstring a and of the σ registers b. We say the test "fails" when the bit-wise and of the two bit-strings has odd parity, i.e. |a ∧ b| is odd. Identifying P f as this failure probability obtained by repeating this procedure ad infinitum, we again find As a sanity check, we can confirm that P f corresponds to P 11 for the single qubit case. Here, the bit-strings each have one element, so the condition reduces to |a ∧ b| = 1 which occurs if and only if a = b = 1. An alternative derivation for the n-qubit destructive SWAP test viewed as a Bell basis measurement can be found in [55] (page 8) whose intuition can be understood from [90] (page 6).

b. CCPS Ansatz Circuits
A mixed state is often thought as a probabilistic mixture of pure states. The CCPS ansatz is a direct implementation of this idea, Namely, experimental observables are obtained by averaging over many experiments in which the input state is U θ |i chosen with probability p φ (i) 4 . The ensemble of circuits can thus be represented by U θ acting on an input basis vector |i as in Fig. 9. One can choose any set of basis vector {|i } R−1 i=0 , but for our work, we choose the set of computational basis states 5 . Given a the preparation unitary U θ , we can use the right circuit to sample the distribution As discussed in the text surrounding Eq. (24), this distribution-alongside the classically stored p φ (i)-allow us to compute the cross term by using the classical SWAP test. 9. We summarize the circuits used to evaluate the cost function for the CCPS ansatz. (Left) A generic way to prepare a randomly sampled eigenvector of the CCPS state. Here, the state |i1i2 . . . in is a short-hand for an n qubit computational basis state, so each ij is either 0 or 1. Given the randomly sampled bitstring, we can prepare the desired state with single qubit X gates, and then we apply the same unitary U θ regardless of the randomly sampled input. (Right) Here, we provide a circuit to sample the distribution q θ (i) defined in (B14). We simply prepare ρS, apply U † θ , and measure each qubit separately in the computational basis. By (B14) and the fact that p φ (i) is stored classically, we can therefore compute the desired cross term using a classical SWAP test as described in the main text around (24).
FIG. 10. A hardware efficient tiling of a two-qubit gate W (θ) across a linearly connected five-qubit circuit. The notation θ lk is meant to convey the parameters of the k th gate in the l th layer. Each layer consists of n − 1 applications of W for n qubits.
As before, we omit a procedure to estimate Tr[ρ 2 S ] which doesn't affect the optimization. This time, however, we've also omitted a circuit to estimate Tr[σ 2 CCPS ] since this is done entirely classically again using the classical SWAP test. Other than that, the only subtlety in our circuits is simply that we've not explicitly written out different lines for different qubits. The reason is that for this ansatz, there are no ancilla qubits necessary, so it's understood that each line is for n qubits. As mentioned in the caption, the measurement is a single qubit measurement on each qubit.

The parameterized circuits used for U (θ)
So far, we have described all the circuits where U (θ) was understood to be some parameterized quantum circuit (PQC). Here, we define and justify the choices we make for the different classes of states we consider. We begin with an abstract description of a hardware efficient tiling. We then discuss how this tiling is applied for the parameterized circuits in the SP ansatz and the CCPS ansatz. Then we move into specifics for Bures random states, XY thermal states, and hardware noise induced states. As discussed in the main text, the CCPS ansatz automatically provides a description of the principal components of the target state ρ in the computational basis. This is not generally true for the SP ansatz, and we conclude with a discussion of how to generate an SP ansatz that does allow for extraction of the principal components but note that it is generally not practical.

a. Hardware Efficient Tiling and its Use in the CCPS and SP Ansätze
Let W (θ) be an unspecified two-qubit gate parameterized by a vector of angles, θ. A hardware efficient tiling of W is given in Fig. 10. A single layer consists of what is shown in the "dotted rectangle" on the left. The name hardwareefficient comes from the fact that for a linearly connected device, only neighboring qubits need to be coupled, and so no swaps are needed. Furthermore, each layer consists of a depth-2D W circuit only, where D W is the depth of W itself.
For the CCPS ansatz, we simply apply a hardware efficient tiling on the n qubits. That is, where the product over k represents the application of each gate in layer l and the product over l is for L layers, which is chosen depending on the class of state. For the SP ansatz, we simply apply the same tiling but to the n + n A system plus ancilla qubits,

. Bures Random State Ansatz
There are no non-trivial symmetry operators S for which [ρ, S] = 0 for all ρ drawn from the Bures measure. Without any inherent symmetry structure, we choose the most generic W : an arbitrary two-qubit gate; i.e., W can express any rotation in U(4). By the KAK decomposition [91][92][93], we can decompose W using three CNOTs and 15 elementary single qubit rotations (see Fig. 7 in [93] for example). Next, note that one way to generate ρ from the Bures distribution is through applying a Haar random unitary on 2n qubits, and then putting this state in coherent superposition with the same state changed by a local transformation on the n system qubits and then tracing out the ancilla [70] (see Eq. (D4)). This suggests that no low-depth circuit exists to faithfully generate ρ, so we choose L = n. This gives a total of 15n(n − 1) trainable parameters. As discussed in Cerezo et al. [21], we expect a linear depth alternating ansatz with this many parameters to exhibit a barren plateau and thus to not be scalable. However, in fact, we know that trying to learn a Haar random unitary induces a barren plateau regardless of the choice of ansatz when no other information is known [94]. Hence, this class of states is likely not scalable beyond the small sizes testable in the NISQ era anyway. To that end, it serves as a proof of principle that even the most difficult states (for tractable sizes) can be learned by our method.

c. XY Model Ansatz
The XY model does exhibit symmetries. In particular, the spin model is particle conserving, and as a chain in 2D, it is invariant under any global rotation. As discussed in Ref. [95], the structure of the XY model can be used to design a generically good ansatz using O(n 2 ) gates with circuit depth O(n log n). However, as discussed in Appendix 3c of [96], one can instead use n alternating layers of Givens rotations which still obeys the symmetries but results in a lower 2n depth with a gate count of n 2 − n. The Givens rotation is a single parameter gate, which rotates in the subspace where |00 and |11 are fixed. In our work, however, we want to associate the state |0 ⊗n in the CCPS ansatz to the first principal component of ρ. Hence, |0 ⊗n should not in general be preserved. As a fix, we can consider other Givens rotations that differ from G(θ) by an arbitrary permutation of basis elements. For example, 1 0 0 cos(θ/2) 0 0 sin(θ/2) − sin(θ/2) 0 0 cos(θ/2) 0 0 1 0 is also a valid Givens rotation where the set of fixed states is permuted. To avoid attaching to a fixed choice, we define a Givens gate as which is analogous to an Euler decomposition of a rotation in 3D as R X (θ 1 )R Y (θ 2 )R X (θ 3 ). Using a tiling of this W , we find empirically that a depth L = log n is sufficient, and so use a total of 3(n − 1) log(n) single-parameter gates with depth 3 log(n), which represents yet another improvement over [42].

d. Ansatz for Hardware Implementation
To learn the purification (i.e., the SP ansatz) of the single qubit state ρ + , we used an arbitrary two-qubit gate built up with the previously mentioned KAK decomposition. For the pure state approximation, we simply forgo using an ancilla. In other words, for the R = 1 approximation of ρ + , we used a single qubit circuit ansatz, on the system qubit.
To learn the principal components and values of ρ Φ + with the CCPS ansatz, we use an arbitrary two-qubit circuit U 2 . The principal components are then obtained as {U 2 |00 , U 2 |01 , U 2 |10 , U 2 |11 } whereas the principal values are stored as a vector we train over, {p 00 , p 01 , p 10 , 1 − (p 00 + p 01 , +p 10 )}. 11. SP PCA Ansatz. An example SP PCA ansatz acting on 3 + 2 qubits. We first apply a gate VA on the nA = 2 ancilla, connect the two sets of qubits with a CNOT cascade, and then apply a gate VS on the system qubits.

SP PCA Ansatz
We note that it is possible to construct an SP ansatz that allows for the principal components of ρ to be extracted using measurements in the computational basis. For clear reasons, we call this the PCA ansatz, and we show a generic example of such an ansatz in Fig. 11. The idea bears some similarities to the ansatz recently presented in [97].
To see why this is a PCA ansatz, we first consider the action of a generic version of the circuit in Fig. 11, proceeding step by step. The state before the CNOT cascade is as follows: After applying the CNOTs, we get c i1,i2,...,in A |i 1 , i 2 , . . . , i n A S |i 1 , i 2 , . . . , i n A A (B22b) c i1,i2,...,in A |0, . . . , 0, i 1 , i 2 , . . . , i n A S |i 1 , i 2 , . . . , i n A A (B22c) where we introduced an arbitrary simpler indexing at the end. The choice of tilde on the S basis ket is to emphasize that there is a leading n S − n A qubits in the all-zeros state. By applying a system local unitary V S and then tracing out the ancilla, we get where in the last step we simply identified |c k | 2 = p k as probabilities. For arbitrary V A and V S , the state σ S is an arbitrary rank-2 n A density matrix, and hence, this is clearly a legitimate ansatz for a density matrix on S. At the same time, this ansatz prepares the state on the ancilla system. By inspection of σ A and |φ 2 we see that a measurement of the ancilla in the computational basis yields the state with probability p k . Hence, a measurement of the ancilla prepares the eigenstates of σ S , V S |k with corresponding probability p k . Provided σ S ≈ ρ as a result of a successful training, this gives us a way to probabilistically prepare the first 2 n A principal components of ρ. In particular, by repeatedly preparing σ S and measuring the ancilla, we prepare the k th principal component with probability p k .

C. Gradient Analysis for State Purification Ansatz
The Hilbert-Schmidt distance cost for the state purification ansatz takes the form C SP(θ,n A ) = Tr[ρ 2 ] + Tr[σ(θ, n A ) 2 ] − 2Tr[ρσ(θ, n A )], where the trial state σ(θ, n A ) is found via its purification, i.e., σ SP (θ, , for our vector of training parameters θ. The gradient with respect to θ is given by The Tr[ρ 2 ] term vanishes trivially due to a lack of dependence on θ, whilst the overlap term Tr[ρσ(θ, n A )] obeys the standard parameter shift rule [57,58]. In our work, we specifically use the "Pauli parameter shift rule" in which the circuit ansätze are described using single qubit gates of the form P k = e −iθσ k /2 for Pauli σ k and parameter free CNOTs. In this case, the shift refers to running the circuit with θ → θ ± π/2 (this will become clear in the next equation where in the second line we use the Pauli parameter shift rule (hence the ±π/2) as applied directly to an operator rather than an expectation value. Thus we have the complete analytic expression to compute ∇C SP(θ,n A ) . We note that the gradient analysis for the local costs is entirely analogous.

Bures Random States
Before diving into what a Bures random state is specifically, it is helpful to discuss a few general considerations regarding generating random quantum states, as is done in the brief introduction of [69]. Generally, random quantum states are selected from a distribution that is invariant under global unitary transformations. For pure states, this single property uniquely defines a probability measure known as the Haar measure. But for mixed states, this property does not uniquely specify the measure. Assuming the distribution of eigenvectors and eigenvalues is independent, then we can write the probability measure over mixed states as [70] where dµ V is the Haar measure and dν is the measure over the normalized eigenvalues. So the problem is that we must also specify dν, which, while trivial for pure states, is unclear for mixed states.
There are many protocols one can follow to determine dν, but the most mathematically straightforward is to define a measure from the normalized volume elements of a metric [69,70]. In this formalism, the key is then to choose an appropriate metric. The Bures distance, given by has many attractive properties as an unbiased choice. Many of these properties were first pointed out by Bures himself [99], but simpler explanations are provided in Hall [69] and Zyczkowski et al. [70]. The summary in Zyczkowski et al. [70] is especially concise: the Bures metric (i) has an interpretation as a distinguishability measure [100,101], (ii) is the minimal monotone metric under quantum channels [102], and (iii) gives the statistical distance when applied to two diagonal operators [70]. The form of dν is given in Eq. (13) of Zyczkowski et al. [70], and for brevity, we call states drawn from this measure Bures random states. For numerically tractable system sizes, generating a Bures random state is straightforward [70]. First, we generate a 2 n × 2 n Ginibre random matrix G [103] with complex entries 6 . Then, we generate a Haar random unitary matrix U with the same dimensions. With these two matrices, the random state is given by An alternative, more physically motivated way to generate the states is to construct a superposition of a random bipartite state |ψ 1 = U AB |0, 0 with a local transformation of the same state, |ψ 2 = (V A ⊗ 1 1)|ψ 1 and then trace out the B degrees of freedom (see Fig. 5 in Ref. [70]),

XY Thermal States
By XY thermal states, we mean states of the form where J i , K i ∼ N (0, 1) are i.i.d. normal random variables. By controlling β, we control the effective rank r of the random states generated in this way, which we demonstrate in Fig. 12 with = 1/100. The reason is simple: At β = ∞ (i.e., T = 0), we expect a system with a non-degenerate ground state to be in its pure ground state which has r = 1. For a degenerate ground state, we get a totally mixed state in the ground-state subspace, so if the degeneracy is g 0 , we find r = g 0 . For small but non-zero temperatures (β = 20 here), the state is a convex combination of low-lying energy states, so r > g 0 for most choices of . As β decreases, the state becomes closer to the totally mixed state until it actually reaches it at β = 0. By choosing a larger intermediate temperature (β = 2 here), we interpolate between the two extremes and simply get a Gibbs state with large -rank but without being completely mixed. Despite the random states having different coefficients (and subsequent rank), the model still respects important symmetries. Most importantly, our XY model is a chain in 2D. Hence, it is invariant under any global rotation of all spins, and furthermore, the number of spins is constant. These symmetries allow us to greatly simplify the choice of ansatz to one that respects these symmetries, as discussed in Appendix B.

E. Computing Figure of Merit
Our figure of merit comes from the solution to the quantum low-rank approximation problem (QLRAP) [18]. Namely, the QLRAP is to find the state σ opt (R) that satisfies As shown in Ref. [18], the unique optimal solution for the Hilbert-Schmidt distance is given by  (n, β)). At β = 20 (low temperature), the resulting rank never goes beyond r = 4. For n = 7, the ground state is doubly degenerate, so r 2 as well. For n = 8, the ground state is unique, so r = 1 is reached for some of the states. For β = 2, the ranks go from r = 4 all the way up to r = 14, 20 for n = 7, 8, respectively. On the left, we show an example optimization of a seven-qubit XY model thermal state which illustrates what ∆R means pictorially. Namely, for the same fixed state, we use the CCPS ansatz to find a rank R = 1, 2, 3, 4 approximation. The resulting optimizations terminate when the learned cost is approximately equal to the optimal costs represented by horizontal lines. The difference between the final cost and the optimal cost is what is plotted on the right box-plots. The variation in performance for each R comes from sampling over 25 random states (see Appendix D 2).
With a little algebra, it is easy to see that the first term corresponds to the sum of squares of the r − R eigenvalues of ρ not approximated (assuming rank(ρ) = r), and the second term accounts for the constant offset between the first R eigenvalues of ρ and the R re-normalized eigenvalues of σ opt . Any compilation of ρ as described throughout our paper will find an empirical cost C(α * , R) C(α opt , R), so a simple figure of merit is just their difference: ∆ R ≡ C(α * , R) − C(α opt , R).
In Fig. 13, we help clarify how we compute this difference for an example optimization of a seven-qubit XY thermal state using the CCPS ansatz with R ∈ {1, 2, 3, 4}. Pictorially, ∆ R is the vertical distance between the final cost (solid line) and the optimal cost (horizontal dashed line). Direct low rank compilations of single qubit hardware states. We perform a direct full-rank and low-rank compilation of ρHS with the SP ansatz on the (left) and ofρ+ with the CCPS ansatz on the (right). By direct we mean we performed a separate and independent compilation of the same state with an ansatz supporting a rank two or rank one approximation. Since these are one qubit states, this encompasses all interesting possibilities. For the rank one approximation, we also include a horizontal line, C * , which denotes the lowest possible value of the noiseless cost. The difference between the final value of C noiseless and this line denotes ∆R as shown explicitly on the (left) plot.

F. Additional hardware results
In Sec. IV, we successfully compiled full rank approximations of the one qubit states ρ HS andρ + and the two qubit stateρ Φ + (see the referenced section for state definitions). In particular, within an allotted budget of 100 Powell iterations, we optimized to a cost value on the order of 10 −3 C noiseless 10 −2 which is close to the precision floor allowed by our use of 10 5 shots to evaluate the cost function. Here, we present additional hardware results where we compile lower rank approximations of these states either directly as in Fig. 14 or indirectly through truncation of the learned full rank CCPS state as in Fig. 15. For convenience, we have included the full-rank optimizations shown in Sec. IV alongside the lower rank optimizations.
In Fig. 14(a), we compile a full rank (i.e., rank two using n A = 1 ancilla) and a pure state (n A = 0) approximation of the single qubit stateρ + using the SP ansatz. In the full rank case, we find C noiseless (n A = 1) = ∆ 2 ≈ 6.44 × 10 −3 . In the pure state case, we find C noiseless (n A = 0) = 2.5 × 10 −2 , but the optimal possible cost is C * (n A = 0) = 4.1 × 10 −3 . Thus, ∆ 1 = 2.1 × 10 −2 , which is an acceptable final value on the order of 10 −2 . It is also interesting to note that around n it = 10, the difference is minimal, reaching C noiseless (n A = 0) = 4.3 × 10 −3 =⇒ ∆ 1 = 2.0 × 10 −3 . However, it is not forthright to report this as the final found value since it relies on knowledge of C noiseless to pick the right n it whereas our optimization stopping condition does not (i.e., it relies on C shot only).
Overall, we find that direct pure state compilations of ρ for both ansätze are learned to an acceptable value of ∆ 1 . In addition, we find that finding this value takes fewer iterations for both ansätze suggesting that finding a lower-rank approximation is easier. This observations was also found for the Bures random and XY random thermal states in idealized noiseless, infinite shot classical simulations. Given that the lower-rank optimization requires fewer learnable parameters alongside these empirical results, we suspect this to be a general property of our algorithm. Namely, learning a lower rank approximation is easier.
In Fig. 15 we explore the quality of indirect compilations ofρ Φ + by truncating a full-rank compilation. The results here mirror those in Sec. III D but for the hardware optimization ofρ Φ + . The story here is a bit more interesting on account of finite shot noise, however. The punchline is that due to having only a precision of roughly 10 −2 , our R = 4 optimization found a local minimum, C noiseless (R = 4) = 6.8 × 10 −2 (see Fig. 15(a)), that is effectively a rank one approximation. In other words, a pure state approximation ofρ Φ + is sufficient to reach the cost noise floor, and this pure state solution was found in our hardware optimization.
With the punchline stated, the empirical quality of the truncated states is summarized in Fig. 15(b). Evidently, the pure state approximation (R = 1) is closest to its optimal value since ∆ R =1 = 1.2 × 10 −3 . On the other hand, ∆ R >1  FIG. 15. PCA analysis ofρ Φ + when compiled on hardware using CCPS ansatz. In (a), we show the optimization result when finding a direct full-rank (R = 4) compilation ofρ Φ + . Given this CCPS ansatz, we can obtain any R < 4 approximation for free by truncation. As an example, we plot the found cost when truncating to a rank two approximation, C R=4→2 noiseless , as a solid horizontal line. The optimal possible value for this cost, C * , is shown in a dot-dashed line. The vertical distance between these lines denotes their difference, ∆ R =2 . In (b), we show the quality of our truncation state for different truncation levels from R = 4 (no truncation) to R = 1 (maximum truncation to a pure state). For each R , we plot the value of the found truncated cost C(α * , R → R ) (left, blue) as well as the optimal possible cost C(αopt, R → R ) (right, orange). The difference between these costs is ∆ R and is pictorially the height difference in the bars. In (c), we show the principal values (ordered eigenvalues) of the target stateρ Φ + and the learned CCPS state σCCPS(α * , R = 4).
is an order of magnitude higher in all cases, i.e., ∆R = 2 = 1.6 × 10 −2 , ∆ R =3 = 1.8 × 10 −2 , and ∆ R =4 = 6.8 × 10 −2 . In fact, the quality monotonically decreases with R increasing. This suggests that the performance of our R = 4 compilation is dominated by the quality of the estimate of the first principal component. In Fig. 15(c), we verify this intuition by seeing that the numerically optimal CCPS state, σ CCPS (α * , R = 4), is essentially a pure state. Namely, it only has non-trivial support on the largest principal value. This interesting observation aside, we ultimately find that truncated costs have a comparable ∆ R to the original ∆ R , so the indirect compilation of the full-rank CCPS state into lower rank approximations works well.