Resource frugal optimizer for quantum machine learning

Charles Moussa; Max Hunter Gordon; Michal Baczyk; M Cerezo; Lukasz Cincio; Patrick J Coles

doi:10.1088/2058-9565/acef55

1. Introduction

A new kind of data is emerging in recent times: quantum data. Tabletop quantum experiments and analog quantum simulators produce interesting sets of quantum states that must be characterized. Moreover, the rise of digital quantum computers is leading to the discovery of novel quantum circuits that can, once again, produce quantum states of interest. Quantum sensing, quantum phase diagrams, quantum error correction, and quantum dynamics are some of the areas that stand to benefit from quantum data analysis.

Classical machine learning was developed for the processing of classical data, but it is necessarily inefficient at processing quantum data. This issue has given rise to the field of quantum machine learning (QML) [1, 2]. QML has seen the proposal of parameterized quantum models, such as quantum neural networks [3–6], that could efficiently process quantum data. Variational QML, which involves classically training a parameterized quantum model, is indeed a leading candidate for implementing QML in the near term.

Variational QML, which we will henceforth refer to as QML for simplicity, has faced various sorts of trainability issues. Exponentially vanishing gradients, known as barren plateaus [7–15], as well as the prevalence of local minima [16, 17] are two issues that can impact the complexity of the training process. Quantum hardware noise also impacts trainability [18, 19]. All of these issues contribute to increasing the number of shots and iterations required to minimize the QML loss function. Indeed, a detailed shot-cost analysis has painted a concerning picture [20].

It is therefore clear that QML requires careful frugality in terms of the resources expended during the optimization process. Indeed, novel optimizers have been developed in response to these challenges. Quantum-aware optimizers aim to replace off-the-shelf classical optimizers with ones that are specifically tailored to the quantum setting [21–23]. Shot-frugal optimizers [24–27] have been proposed in the context of variational quantum eigensolver (VQE), whereby one can sample over terms in the Hamiltonian instead of measuring every term [28]. While significant progress has been made on such optimizers, particularly for VQE, we argue that very little work has been done to specifically tailor optimizers to the QML setting. The cost functions in QML go well beyond those used in VQE and hence QML requires more general tools.

In this work, we generalize previous shot-frugal and iteration-frugal optimizers, such as those in [24, 25, 28], by extending them to the QML setting. Specifically, we allow for random, weighted sampling over both the input and the output of the loss function estimation circuit. In other words, and as shown in figure 1, we allow for sampling over the dataset as well as over the measurement operators used to define the loss function. Our sampling approach allows us to unlock the frugality (i.e. to achieve the full potential) of adaptive stochastic gradient descent optimizers, such as the recently developed individual coupled adaptive number of shots (iCANS) [24] and global coupled adaptive number of shots (gCANS) [25].

**Figure 1.** Schematic illustration of Refoqus. Measurements, or shots, are a precious (and expensive) resource in quantum computing. As such, they should be used sparingly and only when absolutely necessary. This is particularly important in variational QML methods where training a model requires continuously calling a quantum device to estimate the loss function or its gradients. Refoqus provides a shot-frugal optimization paradigm where shots are allocated by sampling over the input data and the measurement operators.
Download figure:
Standard image High-resolution image

We discuss how our approach applies to various QML applications such as perceptron-based quantum neural networks [11, 29], quantum autoencoders [30], variational quantum principal component analysis (PCA) [31, 32], and classifiers that employ the MSE loss function [26, 33]. Each of these applications can be unified under one umbrella by considering a generic loss function with a highly general form. Thus we state our main results for this generic loss function. We establish an unbiased estimator for this loss function and its gradient. In turn, this allows us to provide convergence guarantees for certain optimization routines, like stochastic gradient descent. Furthermore, we show that for this general loss function one can use the form of the estimator to inform a strategy that distributes shots to obtain the best shot frugal estimates. We also show how to construct an unbiased estimator of the log likelihood loss function, which can be used in gradient-free shot frugal optimization.

Finally, we numerically investigate the performance of our new optimization approach, which we call Refoqus (REsource Frugal Optimizer for QUantum Stochastic gradient descent). For a quantum PCA task and a quantum autoencoder task, Refoqus significantly outperforms state-of-the-art optimizers in terms of the shot resources required. Refoqus even outperforms Random Operator Sampling for Adaptive Learning with Individual Number of shots (Rosalin) [28]—a shot-frugal optimizer that samples only over measurement operators. Hence, Refoqus will be a crucial tool to minimize the number of shots and iterations required in near-term QML implementations.

2. Background

Before giving the necessary background on shot-frugal optimizers, we introduce a few notations. Let $\mathcal{L}$ be the loss function, the gradient $\boldsymbol{\nabla} \mathcal{L}$ , α the learning rate, and θ the parameter vector. $\mathcal{L}$ is considered Lipschitz continuous with a Lipschitz constant $L \geqslant 0$ that satisfies the bound

$\begin{align} \Vert \boldsymbol{\nabla}{\mathcal{L}}(\boldsymbol{\theta}_a) - \boldsymbol{\nabla}{\mathcal{L}}(\boldsymbol{\theta}_b) \Vert \leqslant L \Vert \boldsymbol{\theta}_a - \boldsymbol{\theta}_b \Vert\,, \end{align} \tag{ 1 }$

for all $\boldsymbol{\theta}_a, \boldsymbol{\theta}_b \in \textrm{dom}(\mathcal{L})$ , where $\Vert {\cdot} \Vert$ is the $\ell_2$ norm.

2.1. gCANS

Inspired by an adaptive batch size optimizer used in classical machine learning [34], the iCANS optimizer [24] was introduced as an adaptive method for stochastic gradient in the context of the VQE. It allows the number of shots per partial derivative (i.e. gradient component) to vary individually, hence the name iCANS. We denote s_x the shot allocation for gradient component x.

Recently, a potential improvement over iCANS was introduced called gCANS [25]. gCANS considers the expected gain $\mathbb{E}[\mathcal{G}]$ (i.e. the decrease in the loss function) over the entire gradient vector. Then the goal is to maximize the shot efficiency

$\begin{align} \gamma : = \frac{\mathbb{E}[\mathcal{G}]}{\sum_{x = 1}^{d} s_{x}}\,, \end{align} \tag{ 2 }$

where the sum $\sum_{x = 1}^{d} s_{x}$ goes over all components of the gradient. Solving for the optimal shot count then gives:

$\begin{align} {s_x} = \frac{{2L\alpha }}{{2 - L\alpha }}\frac{{{\sigma _x}\sum\limits_{{x^\prime } = 1}^d {{\sigma _{{x^\prime }}}} }}{{\nabla {\cal L}({\boldsymbol{\theta }}){^2}}}. \end{align} \tag{ 3 }$

where σ_x is the standard deviation of a random variable X_x whose sample mean is g_x, an unbiased estimator for the xth gradient component. (Note that an exponential moving average is used to estimate σ_x and $\Vert \boldsymbol{\nabla}{\mathcal{L}(\boldsymbol{\theta})} \Vert^2$ as their true value is not accessible.) It was proven that gCANS achieves geometric convergence to the optimum, often reducing the number of shots spent for comparable solutions against its predecessor iCANS.

2.2. Rosalin

Shot-frugal optimizers like iCANS and gCANS rely on having an unbiased estimator for the gradient or its components. However, this typically places a hard floor on how many shots must be allocated at each iteration, i.e. going below this floor could result in a biased estimator. This is because the measurement operator H is typically composed of multiple non-commuting terms, each of which must be measured individually. Each of these terms must receive some shot allocation to avoid having a biased estimator. However, having this hard floor on the shot requirement is antithetical to the shot-frugal nature of iCANS and gCANS, and ultimately it handicaps these optimizers' ability to achieve shot frugality.

This issue inspired a recent proposal called Rosalin [28]. Rosalin employs weighted random sampling (WRS) of operators in the measurement operator $H = \sum_j c_j H_j$ , which allows one to achieve an unbiased estimator without a hard floor on the shot requirement. (Even a single shot, provided that it is randomly allocated according to an appropriate probability distribution, can lead to an unbiased estimator.) When combined with the shot allocation methods from iCANS or gCANS, the operator sampling methods in Rosalin were shown to be extremely powerful in the context of molecular chemistry problems, which often have a large number of terms in H.

We remark that [28] considered several sampling strategies. The WRS strategy was outperforming or competitive with the others on molecular ground state problems. Given a budget of s_tot shots, N terms, and $M = \sum_{j} |c_{j}|$ , WRS consists of using $p_{j} = \frac{|c_{j}|}{M}$ to define a (non-uniform) probability distribution to select which term should be measured. Because of these results, we choose to focus on the WRS strategy in our work here.

While Rosalin was designed for chemistry problems, it was not designed for QML, where the number of terms in H is not the only consideration. As discussed below, QML problems involve a (potentially large) dataset of input states. Each input state requires a separate quantum circuit, and hence we are back to the situation of having a hard floor on the shots required due to these multiple input states. This ultimately provides the motivation for our work, which can be viewed as a generalization of Rosalin to the setting of QML.

3. Framework

3.1. Generic variational QML framework

Let us present our general framework for discussing (variational) QML methods; see figure 2 for an illustration. This framework is meant to unify multiple literature QML algorithms under one umbrella. We discuss how specific literature algorithms are special cases of this framework in section 3.2.

**Figure 2.** Illustration of a generic variational QML framework. Given a dataset of quantum states $\mathcal{S}$ , the input states to the QML model are tensor product states of the form $\rho_{\boldsymbol{i}} = \rho_{i_1} \otimes \ldots \otimes \rho_{i_m}$ . The number of samples m depends on the QML task at hand. The parameterized quantum model denoted $\mathcal{M}_{\boldsymbol{\theta}}$ , acts on m copies of the input Hilbert space. Finally, an operator $H_{\boldsymbol{i}}$ is measured to estimate a quantity to be used when evaluating a loss function $\mathcal{L}(\boldsymbol{\theta})$ . The latter evaluation is then inputted to a classical optimizer that proposes new parameters θ in order to minimize the loss. Hence, one can repeat the quantum–classical loop of evaluations and updates until the desired stopping criteria are satisfied.
Download figure:
Standard image High-resolution image

In a generic QML setting, one has a training dataset composed of quantum states:

$\begin{align} \mathcal{S} = \{\rho_i\}_{i = 1}^N\,, \end{align} \tag{ 4 }$

where each ρ_i is a trace-one positive semi-definite matrix, i.e. a density matrix. Each of these training states may come with an associated probability, with the associated probability distribution denoted as

$\begin{align} \mathcal{P} = \{p_i\}_{i = 1}^N\,. \end{align} \tag{ 5 }$

The latter distribution is generally uniform but does not need to be (i.e. datasets with non-uniform probability distributions are used in [35]). In variational QML, one trains a parameterized quantum model, which we write as $\mathcal{M}_{\boldsymbol{\theta}}$ for some set of parameters θ . With a large degree of generality, we can assume that $\mathcal{M}_{\boldsymbol{\theta}}$ is a linear, completely-positive map. In general, $\mathcal{M}_{\boldsymbol{\theta}}$ could act on multiple copies (m copies) of the input Hilbert space. (Multiple copies allow for non-linear operations on the dataset, which can be important in certain classification tasks.) Hence we allow for the output of this action to be of the form:

$\begin{align} \mathcal{M}_{\boldsymbol{\theta}} (\rho_{\boldsymbol{i}}) = \mathcal{M}_{\boldsymbol{\theta}} (\rho_{i_1} \otimes \ldots \otimes \rho_{i_{m}}) = \mathcal{M}_{\boldsymbol{\theta}}\left( \bigotimes_{\alpha = 1}^{m} \rho_{i_{\alpha}}\right)\,, \end{align} \tag{ 6 }$

where we employ the notation $\rho_{\boldsymbol{i}} : = \rho_{i_1} \otimes \ldots \otimes \rho_{i_{m}}$ . Given that we are allowing for multiple copies of the input space, one can define an effective dataset $\mathcal{S}_{m}$ composed of the tensor product of m states in $\mathcal{S}$ and an effective probability distribution $\mathcal{P}_{m}$ .

A QML loss function is then defined in an operational manner so that it could be estimated on a quantum device. This involves considering the aforementioned mathematical objects as well as a measurement operator, or a set of measurement operators. We allow for the measurement operator to be tailored to the input state. Hence we write $H_{\boldsymbol{i}}$ , with $\boldsymbol{i} = \{i_1, \ldots, i_{m}\}$ , as the measurement operator when the input state on m copies of the Hilbert space is $\rho_{\boldsymbol{i}} = \rho_{i_1} \otimes \ldots \otimes \rho_{i_{m}}$ . Moreover, each measurement operator can be decomposed into a linear combination of Hermitian matrices that can be directly measured:

$\begin{align} H_{\boldsymbol{i}} = \sum_{j = 1}^{t_{\boldsymbol{i}}} c_{\boldsymbol{i},j} h_{\boldsymbol{i},j}\,. \end{align} \tag{ 7 }$

Generically, we could write the loss function as an average over training states chosen from the effective dataset D_m:

$\begin{align} \mathcal{L}(\boldsymbol{\theta}) = \sum_{\boldsymbol{i}} p_{\boldsymbol{i}} \ell (E_{\boldsymbol{i}}(\boldsymbol{\theta}))\,. \end{align} \tag{ 8 }$

Here, $\ell$ is an application-dependent function whose input is a measurable expectation value $E_{\boldsymbol{i}}(\boldsymbol{\theta})$ . Specifically, this expectation value is associated with the $\rho_{\boldsymbol{i}}$ input state, with the form

$\begin{align} E_{\boldsymbol{i}}(\boldsymbol{\theta}) = \textrm{Tr} [\mathcal{M}_{\boldsymbol{\theta}} (\rho_{\boldsymbol{i}}) H_{\boldsymbol{i}}]\,. \end{align} \tag{ 9 }$

There are many possible forms for the function $\ell$ . However, there are multiple QML proposals in the literature that involve a simple linear form:

$\begin{align} \ell (E_{\boldsymbol{i}}(\boldsymbol{\theta})) = E_{\boldsymbol{i}}(\boldsymbol{\theta})\,, \end{align} \tag{ 10 }$

and in this case, we refer to the overall loss function as a 'linear loss function'. Alternatively, non-linear functions are also possible, and we also consider polynomial functions of the form:

$\begin{align} \ell (E_{\boldsymbol{i}}(\boldsymbol{\theta})) = \sum_{z = 0}^{D} a_z [E_{\boldsymbol{i}}(\boldsymbol{\theta})]^z\,, \end{align} \tag{ 11 }$

where D is the degree of the polynomial. In this case, we refer to the loss as a 'polynomial loss function'.

3.2. Examples of QML loss functions

Now let us illustrate how various QML loss functions proposed in the literature fall under the previous framework. Crucially, as shown in figure 3 these loss functions can be used for a wide range of QML tasks.

**Figure 3.** Applications benefiting from Refoqus. Many variational quantum algorithms employ training data, and many QML models (which naturally analyze datasets) are variational in nature. We give examples of both of these cases in this figure. Our Refoqus optimizer is relevant in all cases.
Download figure:
Standard image High-resolution image

3.2.1. Variational quantum error correction

Variational quantum algorithms to learn device-tailored quantum error correction codes were discussed in [4, 36]. The loss function involves evaluating the input–output fidelity of the code, averaged over a set of input states. The input state is fed into the encoder $\mathcal{E}_{\boldsymbol{\theta}_1}$ , then the noise channel $\mathcal{N}$ acts, followed by the decoder $\mathcal{D}_{\boldsymbol{\theta}_2}$ . The concatenation of these three channels can be viewed as the overall channel

$\begin{align} \mathcal{M}_{\boldsymbol{\theta}} = \mathcal{D}_{\boldsymbol{\theta}_2} \circ \mathcal{N} \circ \mathcal{E}_{\boldsymbol{\theta}_1}\,, \end{align} \tag{ 12 }$

with parameter vector $\boldsymbol{\theta} = (\boldsymbol{\theta}_1, \boldsymbol{\theta}_2)$ . Then the loss function is given by:

$\begin{align} \mathcal{L}(\boldsymbol{\theta}) = \sum_{\vert \psi_i \rangle \in \mathcal{S} } \frac{1}{|\mathcal{S}|} \textrm{Tr} [\vert \psi_i \rangle \!\langle \psi_i \vert \mathcal{M}_{\boldsymbol{\theta}}(\vert \psi_i \rangle \!\langle \psi_i \vert )]\,, \end{align} \tag{ 13 }$

where $\mathcal{S}$ is some appropriately chosen set of states. It is clear that this loss function is of the form in (8) with $\ell$ having the linear form in (10).

3.2.2. Quantum autoencoder

Inspired by the success of classical autoencoders, quantum autoencoders [30, 37] were proposed to compress quantum data by reducing the number of qubits needed to represent the dataset. Consider a bipartite quantum system AB composed of n_A and n_B qubits, respectively, and let $\{p_i, \vert \psi_i \rangle \}$ be an ensemble of pure states on AB. The quantum autoencoder trains a gate sequence $U(\boldsymbol{\theta})$ to compress this ensemble into the A subsystem, such that one can recover each state $\vert \psi_i \rangle$ with high fidelity from the information in subsystem A. One can think of B as the 'trash' since it is discarded after the action of $U(\boldsymbol{\theta})$ .

The original proposal [30] employed a loss function that quantified the overlap of the trash state with a fixed pure state:

$\begin{align} \mathcal{L}_{G}(\boldsymbol{\theta}) & = 1- \textrm{Tr}_\textrm B[\vert \boldsymbol{0} \rangle \!\langle \boldsymbol{0} \vert \rho_{\textrm{B}}^{\textrm{out}}] \end{align} \tag{ 14 }$

$\begin{align} = \textrm{Tr}_\textrm{AB}[H_\textrm{G} U(\boldsymbol{\theta})\rho_{\textrm{AB}}^{\textrm{in}}U(\boldsymbol{\theta})^\dagger] \end{align} \tag{ 15 }$

$\begin{align} = \sum_i p_i \textrm{Tr}_\textrm{AB}[H_\textrm{G} U(\boldsymbol{\theta})\vert \psi_i \rangle \!\langle \psi_i \vert U(\boldsymbol{\theta})^\dagger]\,. \end{align} \tag{ 16 }$

Here, $\rho_{\textrm{AB}}^{\textrm{in}} = \sum_i p_i \vert \psi_i \rangle \!\langle \psi_i \vert$ is the ensemble-average input state, $\rho_{\textrm{B}}^{\textrm{out}} = \textrm{Tr}_A[U(\boldsymbol{\theta})\rho_{\textrm{AB}}^{\textrm{in}}U(\boldsymbol{\theta})^\dagger]$ is the ensemble-average trash state, and the measurement operator is $H_\textrm{G} = \unicode{x1D7D9}_\textrm{AB} - \unicode{x1D7D9}_\textrm{A} \otimes \vert \boldsymbol{0} \rangle \!\langle \boldsymbol{0} \vert$ .

Note that $H_\textrm{G}$ is a global measurement operator, meaning it acts non-trivially on all qubits, which can lead to barren plateaus in the training landscape [8]. To remedy this issue, [8] proposed a loss function with a local measurement operator acting non-trivially only on a small number of qubits:

$\begin{align} \mathcal{L}_{L}(\boldsymbol{\theta}) & = 1- \frac{1}{n_{\textrm{B}}}\sum_{j = 1}^{n_{\textrm{B}}}\textrm{Tr}_\textrm B\left[\left(\vert 0 \rangle \!\langle 0 \vert _j\otimes\unicode{x1D7D9}_{\overline{j}}\right)\rho_{\textrm{B}}^{\textrm{out}}\right] \end{align} \tag{ 17 }$

$\begin{align} \kern23.8pt = \sum_i p_i \textrm{Tr}_\textrm{AB}[H_{L} U(\boldsymbol{\theta})\vert \psi_i \rangle \!\langle \psi_i \vert U(\boldsymbol{\theta})^\dagger]\,, \end{align} \tag{ 18 }$

where $H_{L} = \unicode{x1D7D9}_{\textrm{AB}} - \frac{1}{n_{\textrm{B}}}\sum_{j = 1}^{n_{\textrm{B}}} \unicode{x1D7D9}_{\textrm{A}}\otimes \vert 0 \rangle \!\langle 0 \vert _j\otimes \unicode{x1D7D9}_{\overline{j}}$ , and $\unicode{x1D7D9}_{\overline{j}}$ is the identity on all qubits in B except the jth qubit.

It is clear from (16) and (18) that both loss functions fall under our framework. Namely, they have the form in (8) with $\ell$ having the linear form in (10).

3.2.3. Dynamical simulation

Recently a QML-based algorithm was proposed for dynamical simulation [38]. Here the idea is to variationally compile the short-time dynamics into a time-dependent quantum neural network [39]. Then one uses the trained model to extrapolate to longer times. The training data for the compiling process can be taken to be product states, due to a generalization result from [40].

Let $U_{\Delta t}$ be a unitary associated with the short-time dynamics. Let $\{\vert \Psi^{P}_i \rangle \}_{i = 1}^N$ be a set of product states used for training, where $\vert \Psi^{P}_i \rangle = \bigotimes_{j = 1}^n \vert \psi_{i,j} \rangle$ , and where n is the number of qubits. Let $U(\boldsymbol{\theta})$ be the quantum neural network to be trained. Then the loss function is given by:

$\begin{align} \mathcal{L}(\boldsymbol{\theta}) = \sum_{i = 1}^N \frac{1}{N} \textrm{Tr}[H_i V(\boldsymbol{\theta})\vert \Psi^{P}_i \rangle \!\langle \Psi^{P}_i \vert V(\boldsymbol{\theta})^\dagger]\,, \end{align} \tag{ 19 }$

where we have defined $V(\boldsymbol{\theta}) : = U(\boldsymbol{\theta})^\dagger U_{\Delta t}$ . Here, the measurement operator is given by $H_i = \unicode{x1D7D9} - \frac{1}{n}\sum_{j = 1}^n$ $\vert \psi_{i,j} \rangle \!\langle \psi_{i,j} \vert \otimes \unicode{x1D7D9}_{\overline{j}}$ , where $\overline{j}$ is the set of all qubits excluding the jth qubit.

Once again, this loss function clearly falls under our framework, having the form in (8) with $\ell$ having the linear form in (10).

3.2.4. Fidelity for quantum neural networks

Dissipative perceptron-based quantum neural networks (DQNNs) were proposed in [29] and their trainability was analyzed in [11]. The loss function is based on the fidelity between the idealized output state and the actual output state of the DQNN. Specifically, we are given access to training data $\{\vert \phi_i^{{\textrm{in}}} \rangle , \vert \phi_i^{{\textrm{out}}} \rangle \}_{i = 1}^N$ , and the DQNN is trained to output a state close to $\vert \phi_i^{{\textrm{out}}} \rangle$ when the input is $\vert \phi_i^{{\textrm{in}}} \rangle$ .

For this application, a global loss function was considered with the form

$\begin{align} \mathcal{L}_\textrm G(\boldsymbol{\theta}) = \sum_{i = 1}^N \frac{1}{N} \textrm{Tr}[H_i^\textrm G \rho^{\textrm{out}}_i]\,. \end{align} \tag{ 20 }$

Here, $\rho^{\textrm{out}}_i = \mathcal{M}_{\boldsymbol{\theta}}(\vert \phi_i^{{\textrm{in}}} \rangle \!\langle \phi_i^{{\textrm{in}}} \vert )$ is the output state of the DQNN, which is denoted by $\mathcal{M}_{\boldsymbol{\theta}}$ . The measurement operator is the projector orthogonal to the ideal output state: $H_i^G = \unicode{x1D7D9} - \vert \phi_i^{{\textrm{out}}} \rangle \!\langle \phi_i^{{\textrm{out}}} \vert$ .

To avoid the issue of barren plateaus, a local loss function was also considered in [11]:

$\begin{align} \mathcal{L}_L(\boldsymbol{\theta}) = \sum_{i = 1}^N \frac{1}{N} \textrm{Tr}[H_i^L \rho^{\textrm{out}}_i]\,, \end{align} \tag{ 21 }$

where the measurement operator is

$\begin{align} H_i^L = \unicode{x1D7D9} - \frac{1}{n_{\textrm{out}}}\sum_{j = 1}^{n_{\textrm{out}}}\vert \psi^{{\textrm{out}}}_{i,j} \rangle \!\langle \psi^{{\textrm{out}}}_{i,j} \vert \otimes \unicode{x1D7D9}_{\overline{j}}\,. \end{align} \tag{ 22 }$

This loss function is relevant whenever the ideal output states have a tensor-product form across the n_out output qubits, i.e. of the form $\vert \phi_i^{{\textrm{out}}} \rangle = \vert \psi^{{\textrm{out}}}_{i,1} \rangle \otimes \cdots \otimes \vert \psi^{{\textrm{out}}}_{i,n_{\textrm{out}}} \rangle$ .

Clearly, these two loss functions fall under our framework, having the form in (8) with $\ell$ having the linear form in (10).

3.2.5. Variational quantum state eigensolver

Near-term methods for quantum PCA have recently been proposed, including the variational quantum state eigensolver (VQSE) [32] and the variational quantum state diagonalization (VQSD) algorithm. Let us first discuss VQSE.

The goals of VQSE are to estimate the m largest eigenvalues of a density matrix ρ and to find quantum circuits that prepare the associated eigenvectors. When combined with a method to prepare the covariance matrix as a density matrix, VQSE can be used for PCA. Note that such a method was proposed in [41], where it was shown that choosing $\rho = \sum_i p_i \vert \psi_i \rangle \!\langle \psi_i \vert$ prepares the covariance matrix for a given quantum dataset $\{p_i, \vert \psi_i \rangle \}$ .

The VQSE loss function can be written as an energy:

$\begin{align} \mathcal{L}(\boldsymbol{\theta}) = \textrm{Tr} [H U(\boldsymbol{\theta})\rho U(\boldsymbol{\theta})^\dagger] \end{align} \tag{ 23 }$

$\begin{align} & = \sum_i p_i \textrm{Tr} [H U(\boldsymbol{\theta})\vert \psi_i \rangle \!\langle \psi_i \vert U(\boldsymbol{\theta})^\dagger] \end{align} \tag{ 24 }$

where $U(\boldsymbol{\theta})$ is a parameterized unitary that is trained to approximately diagonalize ρ. Note that we inserted the formula $\rho = \sum_i p_i \vert \psi_i \rangle \!\langle \psi_i \vert$ in order to arrive at (24).

The measurement operator H is chosen to be non-degenerate over its m-lowest energy levels. For example, one can choose a global version of this operator:

$\begin{align} H = \unicode{x1D7D9} - \sum_{j = 1}^m r_j \vert e_j \rangle \!\langle e_j \vert ,\quad r_j > 0 \end{align} \tag{ 25 }$

or a local version of this operator:

$\begin{align} H = \unicode{x1D7D9} - \sum_{j = 1}^n r_j Z_j,\quad r_j \in \mathbb{R}\,, \end{align} \tag{ 26 }$

where Z_j is the Pauli-z operator on the jth qubit, and with appropriately chosen real coefficients r_j to achieve non-degeneracy over the m-lowest energy levels. Regardless, it is clear that the loss function in (24) falls under our framework, having the form in (8) with $\ell$ having the linear form in (10).

3.2.6. VQSD

The goal of the VQSD algorithm [31, 42] is essentially the same as that of VQSE, i.e. to diagonalize a target quantum state ρ. Let us use:

$\begin{align} \tilde{\rho} : = U(\boldsymbol{\theta}) \rho U(\boldsymbol{\theta})^{\dagger} \end{align} \tag{ 27 }$

to denote the state after the attempted diagonalization. Here we omit the θ dependency for simplicity of notation.

In contrast to VQSE, the VQSD loss function depends quadratically on the quantum state. Specifically, global and local cost functions have been proposed, respectively given by

$\begin{align} \mathcal{L}_\textrm G(\boldsymbol{\theta}) & = \textrm{Tr}[\rho^2] - \textrm{Tr} [\mathcal{Z}(\tilde{\rho})^2], \end{align} \tag{ 28 }$

$\begin{align} \mathcal{L}_L(\boldsymbol{\theta}) & = \textrm{Tr}[\rho^2] - \frac{1}{n} \sum_{j = 1}^n \textrm{Tr} [\mathcal{Z}_j(\tilde{\rho})^2]\,. \end{align} \tag{ 29 }$

Here, $\mathcal{Z}$ and $\mathcal{Z}_j$ are quantum channels that dephase (i.e. destroy the off-diagonal elements) in the global standard basis and in the local standard basis on qubit j, respectively.

We can rewrite the terms in the global loss using:

$\begin{align} \textrm{Tr}[\rho^2] & = \textrm{Tr}[(\rho \otimes \rho) \textrm{SWAP}]\,,\kern34.6pt \end{align} \tag{ 30 }$

$\begin{align} \textrm{Tr} [\mathcal{Z}(\tilde{\rho})^2] & = \textrm{Tr}[(\rho \otimes \rho) W_G^\dagger (\vert \boldsymbol{0} \rangle \!\langle \boldsymbol{0} \vert \otimes \unicode{x1D7D9}) W_G]\,, \end{align} \tag{ 31 }$

where SWAP denotes the swap operator, and where $W_\textrm G$ corresponds to the layers of CNOTs used in the so-called diagonalized inner product (DIP) test circuit [31]. Hence, we obtain:

$\begin{align} \mathcal{L}_\textrm{G}(\boldsymbol{\theta}) & = \textrm{Tr}[(\rho \otimes \rho) H_\textrm{G}] \end{align} \tag{ 32 }$

$\begin{align} = \sum\limits_{i,i\prime } {{p_i}pi\prime } {\textrm{Tr}}[(|{\psi _i}\rangle \langle {\psi _i}| \otimes |\psi i\prime \rangle \langle \psi i\prime |){H_{\textrm{G}}}] \end{align} \tag{ 33 }$

where $H_G = \textrm{SWAP} - U_\textrm G^\dagger (\vert \boldsymbol{0} \rangle \!\langle \boldsymbol{0} \vert \otimes \unicode{x1D7D9}) U_\textrm G$ . Note that we inserted the relation $\rho = \sum_i p_i \vert \psi_i \rangle \!\langle \psi_i \vert$ in order to arrive at (33). Also note that one can think of the $q_{\boldsymbol{i}}: = p_i p_{i^{\prime}}$ as defining a probability distribution over the index $\boldsymbol{i} = \{i,i^{\prime}\}$ . Hence it is clear that (33) falls under our framework, with m = 2 copies of the input Hilbert space, and with $\ell$ having the linear form in (10).

Similarly, for the local loss, we can write

$\begin{align} \mathcal{L}_{L}(\boldsymbol{\theta}) & = \textrm{Tr}[(\rho \otimes \rho) H_{L}] \end{align} \tag{ 34 }$

$\begin{align} & = \sum_{i,i^{\prime}} p_i p_{i^{\prime}}\textrm{Tr}[(\vert \psi_i \rangle \!\langle \psi_i \vert \otimes \vert \psi_{i^{\prime}} \rangle \!\langle \psi_{i^{\prime}} \vert ) H_{L}] \end{align} \tag{ 35 }$

where $H_{L} = \textrm{SWAP} - (1/n)\sum_{j = 1}^n H_{L,j}$ , and $H_{L,j}$ is the Hermitian operator that is measured for the partial DIP test [31]. Once again, the local loss falls under our framework, with m = 2 copies of the input Hilbert space, and with $\ell$ having the linear form in (10).

3.2.7. Mean squared error for quantum classifiers

The mean squared error (MSE) loss function is widely employed in classical machine learning. Moreover, it has been used in the context of quantum neural networks, e.g. in [4]. Given a set of labels $y_i \in \mathbb{R}$ for a dataset $\{\rho_i\}_{i = 1}^N$ , and a set of predictions from a machine learning model denoted $\tilde{y}_i(\boldsymbol{\theta})$ , the MSE loss is computed as:

$\begin{align} \mathcal{L}(\boldsymbol{\theta}) = \frac{1}{N}\sum_{i = 1}^N (y_i - \tilde{y}_i(\boldsymbol{\theta}))^2 \,. \end{align} \tag{ 36 }$

In the case of a quantum model, there is freedom in specifying how to compute the prediction $\tilde{y}_i(\boldsymbol{\theta})$ . Typically, this will be estimated via an expectation value, such as $\tilde{y}_i(\boldsymbol{\theta}) = \textrm{Tr}[\mathcal{M}_{\boldsymbol{\theta}}(\rho_i) H_i] = E_i(\boldsymbol{\theta})$ . In this case, the loss function in (36) would be a quadratic function of expectation value $E_i(\boldsymbol{\theta})$ . Hence, this falls under our framework, with $\ell$ having the polynomial form in (11) with degree D = 2.

3.3. Possible extensions to our framework

Up until now, we have considered loss functions that have a linear or polynomial dependence on measurable expectation values, $E_i(\boldsymbol{\theta})$ . As shown above, this form encompasses many loss functions used in the literature and we will show that cost functions of this form enable cheap gradient estimation and therefore shot frugal optimization.

However, there are also more complicated loss functions that have a less trivial dependence on measurable expectation values. One such loss function is the log-likelihood which is commonly used in classification tasks [5, 43].

Let us assume one is provided with a dataset of the form $\mathcal{S} = \{\rho_i,y_i\}_{i = 1}^N$ where ρ_i are quantum states, and y_i are discrete real-valued labels. We aim to train a parametrized model whose goal is to predict the correct labels. Here one cannot directly apply the techniques previously used, as discrete outputs mean we cannot simply evaluate gradients. Therefore we cannot directly employ the results in [24, 34]. Hence, a different shot-frugal method must be employed.

Given $\mathcal{S}$ , the likelihood function is defined as

$\begin{align} \Pr(\mathcal{S}|\boldsymbol{\theta}): = \prod_{i = 1}^N\Pr(y_i|\rho_i;\boldsymbol{\theta}) = \prod_{i = 1}^Np_i\,,\end{align} \tag{ 37 }$

where we have defined $p_i = \Pr(y_i|\rho_i;\boldsymbol{\theta})$ and where it is assumed that the model's predictions only depend on the current input. From here, we can define the negative log-likelihood loss function as

$\begin{align} \mathcal{L}(\boldsymbol{\theta}) = -\log\left(\Pr(\mathcal{S}|\boldsymbol{\theta})\right) = -\sum_{i = 1}^N\log\left(p_i\right)\,. \end{align} \tag{ 38 }$

In appendix A we present a conceptually different approach for shot frugal optimization, which does not rely on the unbiased estimation of gradients. Rather, this approach relies on the construction of an unbiased estimator for the log-likelihood loss function. In particular, we follow the results in [44] and use IBS in the context of QML. We note that this extension is of independent interest to gradient-free shot-frugal optimizers (as one cannot leverage gradients for problems with discrete outputs) but leave further exploration to future work.

4. Unbiased estimators for gradients of QML losses

Gradient-based optimizers are commonly used when optimizing QML models. In shot frugal versions of these approaches, one of the key aspects is the construction of an unbiased estimator of the gradient [24–26]. There are two types of loss functions we will consider in this work; those that have a linear dependence on expectation values and those that have a non-linear dependence given by a polynomial function. We will see that for these two types of losses one can construct unbiased estimators of the gradients and that it is also possible to define sampling strategies that massively reduce the number of shots needed to evaluate such an estimator.

Previous work considered how to construct unbiased estimators of QML gradients. However, the shot frugal resource allocation strategies presented were sub-optimal [26] and leave room for improvement. Furthermore, the more sophisticated shot allocation methods presented in [28] were purpose-built for VQE-type cost functions. Here we unify approaches from these two works and show how one can employ more sophisticated shot allocation strategies in a general QML setting.

First, we consider the simpler case of linear loss functions where we show one can directly employ the parameter shift rule [45, 46] to construct an unbiased estimator for the gradient. Then we turn our attention to polynomial loss functions where we present a general form for an unbiased estimator of the gradient. In both cases, we show that shots can be allocated according to the expansion coefficients in the expressions we derive. These in turn depend on the coefficients in the operators to be measured and the set of quantum states used in the QML data set. Such shot-allocation schemes are an important ingredient in the design of our QML shot-frugal optimizer in the next section. We use the U-statistic formalism [47] to construct unbiased estimators for the loss function in each case. For an introduction to U-statistics, we refer the reader to appendix B.

4.1. Loss functions linear in the quantum circuit observables

4.1.1. Using the parameter shift rule

Loss functions that have a linear dependence on expectation values of observables are straightforward to consider. As shown in [26], the parameter shift rule can be used directly to construct unbiased estimators of the gradient of these loss functions. Previously, we wrote our general linear loss function as

$\begin{align} \mathcal{L}(\boldsymbol{\theta}) = \sum_{\boldsymbol{i}} p_{\boldsymbol{i}} \ell (E_{\boldsymbol{i}}(\boldsymbol{\theta}))\,. \end{align} \tag{ 39 }$

Let us not consider the case where

$\begin{align} E_{\boldsymbol{i}}(\boldsymbol{\theta}) = \textrm{Tr} (\mathcal{M}_{\boldsymbol{\theta}} (\rho_{\boldsymbol{i}}) H_{\boldsymbol{i}})\, \quad \textrm{and} \quad \ell (E_{\boldsymbol{i}}(\boldsymbol{\theta})) = E_{\boldsymbol{i}}(\boldsymbol{\theta})\,.\nonumber \end{align}$

By expanding the measurement operator as $H_{\boldsymbol{i}} = \sum_{j = 1}^{t_{\boldsymbol{i}}} c_{\boldsymbol{i},j} h_{\boldsymbol{i},j}$ , we can then write this loss function as follows

$\begin{align} \mathcal{L}(\boldsymbol{\theta}) = \sum_{\boldsymbol{i}, j} q_{\boldsymbol{i},j} \langle h_{\boldsymbol{i},j}(\boldsymbol{\theta}) \rangle, \end{align} \tag{ 40 }$

where $q_{\boldsymbol{i}, j} = p_{\boldsymbol{i}} c_{\boldsymbol{i}, j}$ and $\textrm{Tr}[ \mathcal{M}_{\boldsymbol{\theta}}(\rho_{\boldsymbol{i}}) h_{\boldsymbol{i}, j}] = \langle h_{\boldsymbol{i},j} (\boldsymbol{\theta}) \rangle$ .

Consider the partial derivative of this loss function with respect to the parameter θ_x,

$\begin{align} \frac{\partial \mathcal{L}}{\partial \theta_{x}} = \sum_{\boldsymbol{i}, j} q_{\boldsymbol{i},j} \frac{\partial \langle h_{\boldsymbol{i},j}(\boldsymbol{\theta})\rangle}{\partial \theta_{x}}\,. \end{align} \tag{ 41 }$

Suppose we assume that the quantum channel $\mathcal{M}_{\boldsymbol{\theta}}(\rho_{i})$ is a unitary channel:

$\begin{align} \mathcal{M}_{\boldsymbol{\theta}}(\rho_{i}) = U(\boldsymbol{\theta})\rho_{i} U(\boldsymbol{\theta})^\dagger\,, \end{align} \tag{ 42 }$

where $U(\boldsymbol{\theta})$ is a trainable quantum circuit whose parametrized gates are generated by Pauli operators. This assumption allows us to directly apply the parameter shift rule (see section 2). This leads to

$\begin{align} \frac{\partial \mathcal{L}}{\partial \theta_{x}} = \frac{1}{2}\sum_{\boldsymbol{i}, j} q_{\boldsymbol{i},j} \left( \left\langle h_{\boldsymbol{i},j}\left(\boldsymbol{\theta}+\boldsymbol{\delta}_{x}\frac{\pi}{2}\right)\right\rangle - \left\langle h_{\boldsymbol{i},j}\left(\boldsymbol{\theta}-\boldsymbol{\delta}_{x}\frac{\pi}{2}\right)\right\rangle\right)\,. \end{align} \tag{ 43 }$

Therefore, an unbiased estimator for the gradient can be obtained by combining two unbiased estimators for the loss function. Defining $\widehat{g}_{x}(\boldsymbol{\theta})$ to be an unbiased estimator of the xth component of the gradient,

$\begin{align} \widehat{g}_{x}(\boldsymbol{\theta}) = \frac{1}{2}\left[\widehat{\mathcal{L}}\left(\boldsymbol{\theta}+\boldsymbol{\delta}_{x}\frac{\pi}{2}\right) - \widehat{\mathcal{L}}\left(\boldsymbol{\theta}+\boldsymbol{\delta}_{x}\frac{\pi}{2}\right)\right]\,, \end{align} \tag{ 44 }$

where $\widehat{\mathcal{L}}(\boldsymbol{\theta} \pm \boldsymbol{\delta}_{x}\frac{\pi}{2})$ are unbiased estimators for the loss function at the different shifted parameter values needed when employing the parameter shift rule. This means that for loss functions that are linear in the expectation values recovered from the quantum circuit, one can then use an unbiased estimator for the cost evaluated at different parameter values to return an unbiased estimator for the gradient.

We can therefore distribute the shots according to the coefficients $q_{\boldsymbol{i},j}$ when evaluating a single or multi-shot estimate of $\widehat{\mathcal{L}}(\boldsymbol{\theta} \pm \boldsymbol{\delta}_{x}\frac{\pi}{2})$ . This can be done by constructing a probability distribution according to the probabilities $\epsilon_{\boldsymbol{i}, j} = |q_{\boldsymbol{i},j}|/ \sum_{\boldsymbol{i},j} |q_{\boldsymbol{i},j}|$ . Note that this distribution strategy relies on the construction of two unbiased estimators of the loss function, which are then combined. In the next section, we show how one can construct such estimators. In practice, we will use the same total number of shots to evaluate both estimators as they have the same expansion weights and are therefore equally important.

4.1.2. Unbiased estimators for linear loss functions

Let s_tot represent the total number of shots. We denote $\widehat{\mathcal{E}}_{\boldsymbol{i},j}$ as the estimator for $\langle h_{\boldsymbol{i},j} (\boldsymbol{\theta}) \rangle$ , and $\widehat{\mathcal{L}}(\boldsymbol{\theta})$ the estimator for the loss. That is,

$\begin{align} \widehat{\mathcal{L}}(\boldsymbol{\theta}) = \sum_{\boldsymbol{i}, j} q_{\boldsymbol{i},j} \widehat{\mathcal{E}}_{\boldsymbol{i},j} \,,\quad\textrm{with}\quad \widehat{\mathcal{E}}_{\boldsymbol{i},j} = \frac{1}{\mathbb{E}[s_{\boldsymbol{i},j}]}\sum_{k = 1}^{s_{\boldsymbol{i},j}} r_{\boldsymbol{i},j,k}\,. \end{align} \tag{ 45 }$

Here, $s_{\boldsymbol{i}, j}$ is the number of shots allocated to the measurement of $\langle h_{\boldsymbol{i},j} (\boldsymbol{\theta}) \rangle$ . Note that $s_{\boldsymbol{i},j}$ may be a random variable. As we will work in terms of the total shot budget for the estimation, s_tot, we impose $\sum_{\boldsymbol{i}, j} s_{\boldsymbol{i}, j} = s_{\textrm{tot}}$ . Also, each $r_{\boldsymbol{i},j,k}$ is an independent random variable associated with the kth single-shot measurement of $\langle h_{\boldsymbol{i},j} (\boldsymbol{\theta}) \rangle$ . We will assume that $\mathbb{E}[s_{\boldsymbol{i},j}]\gt0$ for all $\boldsymbol{i},j$ . Using these definitions we can show that this defines an unbiased estimator for the loss function in the following proposition.

Proposition 1. Let $\widehat{\mathcal{L}}$ be the estimator defined in equation (45). $\widehat{\mathcal{L}}$ is an unbiased estimator for the cost function $\mathcal{L}(\boldsymbol{\theta})$ defined in equation (8).

Proof.

$\begin{align} \mathbb{E}[\widehat{\mathcal{L}}] = \mathbb{E}\left[\sum_{\boldsymbol{i}, j} q_{\boldsymbol{i},j} \frac{1}{\mathbb{E}[s_{\boldsymbol{i},j}]}\sum_{k = 1}^{s_{\boldsymbol{i},j}} r_{\boldsymbol{i},j,k}\right] \nonumber\\ = \sum_{\boldsymbol{i}, j} q_{\boldsymbol{i},j} \frac{1}{\mathbb{E}[s_{\boldsymbol{i},j}]} \mathbb{E}\left[\sum_{k = 1}^{s_{\boldsymbol{i},j}} r_{\boldsymbol{i},j,k}\right]. \end{align} \tag{ 46 }$

Here it is useful to recall Wald's equation. Wald's equation states that the expectation value of the sum of N real-valued, identically distributed, random variables, X_i, can be expressed as

$\begin{align} \mathbb{E}\left[\sum\limits_{i = 1}^{N} X_{i}\right] = \mathbb{E}[N]\mathbb{E}[X_{1}], \end{align} \tag{ 47 }$

where N is a random variable that does not depend on the terms of the sum. In our case, each shot is indeed independent and sampled from the same distribution. Furthermore, the total number of shots does not depend on the sequence of single-shot measurements. Therefore,

$\begin{align} \mathbb{E}[\widehat{\mathcal{L}}] = \sum_{\boldsymbol{i}, j} q_{\boldsymbol{i},j} \frac{\mathbb{E}[s_{\boldsymbol{i},j}]}{\mathbb{E}[s_{\boldsymbol{i},j}]} \langle h_{\boldsymbol{i},j} (\boldsymbol{\theta}) \rangle = \mathcal{L}(\boldsymbol{\theta}). \end{align} \tag{ 48 }$

Alternatively, one can see that the estimator defined in equation (45) has the form of a degree-1 U-statistic for $\widehat{\mathcal{L}}$ . From the above result, we arrive at the following corollary.

Corollary 1. Let $\widehat{g}_{x}(\boldsymbol{\theta})$ be the estimator defined in equation (44). $\widehat{g}_{x}(\boldsymbol{\theta})$ is an unbiased estimator for the xth component of the gradient.

Proof. The proof follows by taking the expectation values of equation (44) and employing the result from proposition 1 and the parameter shift rule.

We also obtained the variance of the estimator constructed above whose derivation can be found in appendix C. Such a quantity can be used to compare different strategies for allocating shots among Hamiltonian terms. Although we do not compare variances, we use elements of the proof in appendix C for the following construction of an unbiased estimator of the MSE in appendix E.

Having now constructed an estimator for the loss function as outlined in the previous section combining two single or multi-shot estimates of this function evaluated at the required parameter values will lead to an unbiased estimator of the gradient.

4.2. Loss functions with polynomial dependence on the quantum circuit observables

4.2.1. Constructing an unbiased estimator of the gradient

In the case of non-linear dependence on the observables produced by a quantum circuit, estimating the gradient is not as simple as simply applying to parameter shift rule. However, we can still derive estimators for the gradient when the non-linearity is described by a polynomial function. We begin with the general expression for the loss functions we consider in this work

$\begin{align} \mathcal{L}(\boldsymbol{\theta}) = \sum_{\boldsymbol{i}} p_{\boldsymbol{i}} \ell (E_{\boldsymbol{i}}(\boldsymbol{\theta}))\,. \end{align} \tag{ 49 }$

Now constructing a polynomial loss function of degree D requires that

$\begin{align} \ell (E_{\boldsymbol{i}}(\boldsymbol{\theta})) = \sum_{z = 0}^{D} a_z [E_{\boldsymbol{i}}(\boldsymbol{\theta})]^z\,, \end{align} \tag{ 50 }$

which leads to the expression of the loss function,

$\begin{align} \mathcal{L}(\boldsymbol{\theta}) = \sum_{\boldsymbol{i},z} p_{\boldsymbol{i},z}\left[\sum_{j} c_{\boldsymbol{i},j} \langle h_{\boldsymbol{i}, j} (\boldsymbol{\theta})\rangle\right]^{z}\,, \end{align} \tag{ 51 }$

where $p_{\boldsymbol{i},z} = p_{\boldsymbol{i}}a_{z}$ . Taking the derivative with respect to θ_x leads to

$\begin{align} \frac{\partial \mathcal{L}(\boldsymbol{\theta})}{\partial \theta_{x}} = \sum_{\boldsymbol{i},z} p_{\boldsymbol{i},z} z \left(\sum_{j} c_{\boldsymbol{i},j} \langle h_{\boldsymbol{i}, j} (\boldsymbol{\theta})\rangle\right)^{z-1} \left(\sum_{j^{^{\prime}}} c_{\boldsymbol{i},j^{^{\prime}}} \frac{\partial\langle h_{\boldsymbol{i}, j^{^{\prime}}} (\boldsymbol{\theta})\rangle}{\partial \theta_{x}}\right)\,. \end{align} \tag{ 52 }$

Using the multinomial theorem and given J Hamiltonian terms, we can expand the second sum in the above expression,

$\begin{align} \frac{\partial \mathcal{L}(\boldsymbol{\theta})}{\partial \theta_{x}} = &\sum_{\boldsymbol{i},z} p_{\boldsymbol{i},z} z \sum_{b_1+ b_2+ \cdots +b_J = z-1} \binom{z-1}{b_1, b_2, \cdots b_J}\prod_{j} (c_{\boldsymbol{i},j} \langle h_{\boldsymbol{i}, j} (\boldsymbol{\theta})\rangle)^{b_j} \left(\sum_{j^{\prime}} c_{\boldsymbol{i},j^{\prime}} \frac{\partial\langle h_{\boldsymbol{i}, j^{\prime}} (\boldsymbol{\theta})\rangle}{\partial \theta_{x}}\right), \end{align} \tag{ 53 }$

where $\binom{z-1}{b_1, b_2, \cdots b_J} = \frac{(z-1)!}{b_1! b_2! \cdots b_J!}$ and b_j are non-negative integers. Therefore, we need to construct unbiased estimators of the terms $\langle h_{\boldsymbol{i}, j} (\boldsymbol{\theta})\rangle^{b_{j}}$ and use the previously established gradient estimators with the parameter shift rule. Then we will need to consider how to distribute shots among this estimator. Rewriting equation (53) leads to

$\begin{align} \frac{\partial \mathcal{L}(\boldsymbol{\theta})}{\partial \theta_{x}} = \sum_{\boldsymbol{i},z}\sum_{b_1+ b_2+ \cdots +b_J = z-1} p_{\boldsymbol{i},z} z \frac{(z-1)!}{b_1! b_2! \cdots b_J!} \sum_{j^{\prime}} \prod_j c_{\boldsymbol{i},j}^{b_j} c_{\boldsymbol{i},j^{\prime}} \bigg[\frac{\partial\langle h_{\boldsymbol{i}, j^{\prime}} (\boldsymbol{\theta})\rangle}{\partial \theta_{x}} \langle h_{\boldsymbol{i}, j} (\boldsymbol{\theta})\rangle^{b_j}\bigg]. \end{align} \tag{ 54 }$

Therefore we can distribute the shots according to the magnitude of the expansion terms $\prod_j c_{\boldsymbol{i},j}^{b_j}c_{\boldsymbol{i}, j^{^{\prime}}} p_{\boldsymbol{i},z} z \frac{(z-1)!}{b_1! b_2! \cdots b_J!}$ . Once again we can construct an unbiased estimator with the normalized magnitude of these terms defining the probabilities. We now explore how to construct an unbiased estimator for the term $\langle h_{\boldsymbol{i}, j} (\boldsymbol{\theta})\rangle^{b_{j}}$ , which is essential to construct an unbiased estimator for the gradients of these kinds of loss functions.

4.2.2. Constructing an unbiased estimator of polynomial terms

In order to construct an unbiased estimator for the gradient we need an unbiased estimator for terms of the form

$\begin{align} \frac{\partial\langle h_{\boldsymbol{i}, j} (\boldsymbol{\theta})\rangle}{\partial \theta_{x}} \prod_j \langle h_{\boldsymbol{i}, j} (\boldsymbol{\theta})\rangle^{b_j}\,. \end{align} \tag{ 55 }$

We can use the parameter shift rule for the term on the left-hand side. For the term in the product as each j index corresponds to a different operator in the Hamiltonian, each term will be independent. Therefore, we need to construct estimators for terms of the form $\langle h_{\boldsymbol{i}, j} (\boldsymbol{\theta})\rangle^{z}$ . This leads us to the following proposition.

Proposition 2. Let $\widehat{\xi}_{\boldsymbol{i},j}$ be an estimator defined as

$\begin{align} \widehat{\xi}_{\boldsymbol{i},j} = \frac{1}{\mathbb{E}[\binom{s_{\boldsymbol{i},j}}{z}]} \sum h^{*}(r_{\boldsymbol{i}, j, k_{\alpha_1}},\ldots,r_{\boldsymbol{i}, j, k_{\alpha_z}}), \end{align} \tag{ 56 }$

where the summation is over all subscripts $1 \leqslant \alpha_{1} \lt \alpha_{2} \lt \ldots \lt \alpha_{z} \leqslant s_{\boldsymbol{i},j}$ and $h^{*}(r_{\boldsymbol{i}, j, k_{1}},\ldots,r_{\boldsymbol{i}, j, k_{z}}) =$ $\prod_{\beta = 1}^{z} r_{\boldsymbol{i}, j, k_{\alpha_{\beta}}}$ . $\widehat{\xi}_{\boldsymbol{i},j}$ is an unbiased estimator for the term $\langle h_{\boldsymbol{i}, j} (\boldsymbol{\theta})\rangle^{z}$ estimated with $s_{\boldsymbol{i},j}$ shots where $s_{\boldsymbol{i},j} \geqslant z$ .

Proof. Equation (56) is inspired by the form of a U-statistic for $\langle h_{\boldsymbol{i}, j} (\boldsymbol{\theta})\rangle^{b_{j}}$ , see appendix B for more details. Taking the expectation value and using the U-statistic formalism one arrives at the desired result.

Bringing this all together we can formulate a proposition regarding an unbiased estimator for gradients of loss functions with polynomial dependence.

Proposition 3.

$\begin{align} \begin{gathered} {{\hat g}_x}({\mathbf{\theta }}) = \sum\limits_{{\mathbf{i}},z} {\sum\limits_{{b_1} + {b_2} + \cdots + {b_J} = z - 1} {{p_{{\mathbf{i}},z}}} } z\frac{{(z - 1)!}}{{{b_1}!{b_2}! \cdots {b_J}!}}\sum\limits_{j\prime } {\prod\limits_{j = 1}^J {c_{{\mathbf{i}},j}^{{b_j}}} {c_{{\mathbf{i}},{j^\prime }}}} \frac{1}{{\mathbb{E}[s{\mathbf{i}},j,j\prime ,{\mathbf{b}}]}} \hfill \\ \quad \times \sum\limits_{k = 1}^{{s_{{\mathbf{i}},j,{j^\prime },{\mathbf{b}}}}} {\left( {r_{{\mathbf{i}},j,k}^ + - r_{{\mathbf{i}},j,k}^ - } \right)} \frac{1}{{\mathbb{E}[\left( \begin{gathered} {s_{{\mathbf{i}},j,{j^\prime },{\mathbf{b}}}} \\ {b_j} \\ \end{gathered} \right)]}}\sum {\prod\limits_{\beta = 1}^z {{r_{{\mathbf{i}},j,{k_{{\alpha _\beta }}}}}} } {\mkern 1mu} , \hfill \\ \end{gathered} \end{align} \tag{ 57 }$

where

$\begin{align} \mathbb{E}[r^{\pm}_{\boldsymbol{i},j,k}] = \left\langle h_{\boldsymbol{i},j}\left(\boldsymbol{\theta} \pm \boldsymbol{\delta}_{x}\frac{\pi}{2}\right)\right\rangle \end{align} \tag{ 58 }$

and $\boldsymbol{b} = (b_1, b_2, \ldots, b_J)$ . The final sum is over all subscripts $1 \leqslant \alpha_{1} \lt \alpha_{2} \lt \ldots \lt \alpha_{z} \leqslant s_{\boldsymbol{i},j}$ and $h^{*}(r_{\boldsymbol{i}, j, k_{1}},\ldots,r_{\boldsymbol{i}, j, k_{z}}) = \prod_{\beta = 1}^{z} r_{\boldsymbol{i}, j, k_{\alpha_{\beta}}}$ . $\hat{g}_{x}(\boldsymbol{\theta})$ is a unbiased estimator for $\frac{\partial \mathcal{L}(\boldsymbol{\theta})}{\partial \theta_{x}}$ .

Proof. This can be seen to be true by taking the expectation values and invoking the propositions previously established.

These same techniques can be used to construct unbiased estimators of the loss function, which we expand on in appendix D. In order to provide a concrete example of using the above propositions, we consider the special case of the MSE loss function.

4.2.3. Constructing an estimator for the gradient of the MSE loss function

To clarify the notation used above, let us focus on the special case of the MSE loss function. We consider a slightly more general form than the MSE cost introduced above in equation (36). Consider a set of labels $y_{\boldsymbol{i}} \in \mathbb{R}$ for a dataset $\{\rho_{\boldsymbol{i}}\}_{\boldsymbol{i} = 1}^N$ , composed of the tensor product of m states. The set of predictions from the quantum machine learning model are denoted $\tilde{y}_i(\boldsymbol{\theta}) = \sum_j c_{\boldsymbol{i},j} \langle h_{\boldsymbol{i}, j} (\boldsymbol{\theta})\rangle$ . Therefore, we can write the loss as

$\begin{align} \mathcal{L}_\textrm{MSE}(\boldsymbol{\theta}) = \sum_{\boldsymbol{i}} p_{\boldsymbol{i}} \ \left[y_{\boldsymbol{i}} - \sum_j c_{\boldsymbol{i},j} \langle h_{\boldsymbol{i}, j} (\boldsymbol{\theta})\rangle \right]^2\,. \end{align} \tag{ 59 }$

Expanding the previous equation leads to

$\begin{align} \mathcal{L}_\textrm{MSE}(\boldsymbol{\theta}) & = \sum_{\boldsymbol{i}} p_{\boldsymbol{i}} \ \left[y_{\boldsymbol{i}}^{2} - y_{\boldsymbol{i}} \sum_j c_{\boldsymbol{i},j} \langle h_{\boldsymbol{i}, j} (\boldsymbol{\theta})\rangle + \left(\sum_j c_{\boldsymbol{i},j} \langle h_{\boldsymbol{i}, \boldsymbol{j}} (\boldsymbol{\theta})\rangle \right)^{2} \right]. \end{align} \tag{ 60 }$

Evaluating the partial derivative with respect to θ_x gives,

$\begin{align} \frac{\partial \mathcal{L}_\textrm{MSE}(\boldsymbol{\theta})}{\partial \theta_{x}}& = \sum_{\boldsymbol{i}, j, j^{\prime}} - p_{\boldsymbol{i}} c_{\boldsymbol{i},j} \frac{\partial\langle h_{\boldsymbol{i}, \boldsymbol{j}} (\boldsymbol{\theta})\rangle}{\partial \theta_{x}}+ 2p_{\boldsymbol{i}}c_{\boldsymbol{i}, j^{\prime}}c_{\boldsymbol{i},j}\langle h_{\boldsymbol{i}, j^{\prime}} (\boldsymbol{\theta})\rangle \frac{\partial\langle h_{\boldsymbol{i}, j} (\boldsymbol{\theta})\rangle}{\partial \theta_{x}}. \end{align} \tag{ 61 }$

We can distribute the total number of shots $s_\textrm{tot}$ among these terms according to the relative magnitudes of the $p_{\boldsymbol{i}}c_{\boldsymbol{i},j^{\prime}}$ and $2p_{\boldsymbol{i}}c_{\boldsymbol{i},j^{\prime}}c_{\boldsymbol{i},j}$ coefficients. This can be achieved by sampling from a multinomial probability distribution where the normalized magnitude of these coefficients defines the probabilities.

It is important to note that some estimators have a different minimum number of shots than others. For example, the estimator for the last term in the above equation, involving a gradient and a direct expectation value, requires a total of 3 for different circuit evaluations. The estimator for this term can be written as

$\begin{align} \widehat{\mathcal{D}}_{\boldsymbol{i},j,j^{\prime}} = \frac{1}{2} \widehat{\mathcal{E}}_{\boldsymbol{i},j^{\prime}}\left(\widehat{\mathcal{E}}^{+}_{\boldsymbol{i},j} - \widehat{\mathcal{E}}^{-}_{\boldsymbol{i},j}\right)\,. \end{align} \tag{ 62 }$

Therefore, in this case, the minimum number of shots given to any term can be set to 3 to ensure every estimation of any term in equation (59) will always be unbiased. We expand on this consideration below.

4.3. Distributing the shots among estimator terms

As previously mentioned the shots can be distributed according to a multinomial distribution with probabilities given by the magnitude of the constant factors that appear in the expression for the above estimators. We note that the shots assigned to a given term may be zero. In order to ensure the estimate of the above term using a given number of shots we need to ensure the estimate of each term is also unbiased. One needs at least two shots to produce an unbiased estimate of the gradient. For terms of the form $\langle h_{\boldsymbol{i}, j} (\boldsymbol{\theta})\rangle^{b_{j}}$ one needs at least b_j shots, corresponding to the degree of U-statistic. Therefore, care needs to be taken when distributing shots to ensure that each term measured has sufficiently many. One way to ensure this is the case is to distribute shots in multiples of the largest number of shots needed to evaluate any one term. Any leftover shots can then be distributed equally across the terms to be measured.

Each term itself may consist of a product of several expectation values, each with a different required minimum number of shots. The shots distribution with each term can be selected to correspond to this required minimum number of shots per term. To make this concrete consider a term of the form in equation (55). Estimating the gradient will take at least two shots. Estimating the product term will take at least $\sum_{j = 1}^{J} b_{j}$ shots. If we are given $s_{\boldsymbol{i},j}$ shots to use to estimate this term we can assign $\lfloor 2(s_{\boldsymbol{i},j}/(2+\sum_{j = 1}^{J} b_{j}) \rfloor$ to the first term and $\lfloor \sum_{j = 1}^{J} b_{j}(s_{\boldsymbol{i},j}/(2+\sum_{j = 1}^{J} b_{j}) \rfloor$ to the product term, distributing any remaining shots equally among both.

5. The Refoqus optimizer

Now that we have defined unbiased estimators of the gradient, we have the tools needed to present our new optimizer. We call our optimizer Refoqus, which stands for REsource Frugal Optimizer for QUantum Stochastic gradient descent. This optimizer is tailored to QML tasks. QML tasks have the potential to incur large shot overheads because of large training datasets as well as measurement operators composed of a large number of terms. With Refoqus, we achieve shot frugality first by leveraging the gCANS [25] rule for allocating shots at each iteration step and second by allocating these shots to individual terms in the loss function via random sampling over the training dataset and over measurement operators. This random sampling allows us to remove the shot floor imposed by deterministic strategies (while still achieving an unbiased estimator for the gradient), hence unlocking the full shot frugality of our optimizer.

The key insight needed for the construction of the Refoqus protocol is that cost functions that have a linear or polynomial dependence on measurable hermitian matrices, and their gradients can always be written in a form consisting of a summation of different measurable quantities with expansion coefficients. These coefficients can in turn be used to define a multinomial probability distribution to guide the allocation of the shots to each term when evaluating the loss function or its gradient.

We outline the Refoqus optimizer in algorithm 1. Given a number of shots to distribute among Hamiltonian terms, one evaluates the gradient components and the corresponding variances with the iEvaluate subroutine (Line 3). We follow the gCANS procedure to compute the shot budget for each iteration (Line 12). We refer to [25] for more details on gCANS. The iterative process stops until the total shot budget has been used.

Algorithm 1. The optimization loop used in Refoqus. The function $iEvaluate(\boldsymbol{\theta}, \boldsymbol{s}, \{f(p_{\boldsymbol{i}},c_{\boldsymbol{i},j})\})$ evaluates the the gradient at θ returning, a vector of the gradient estimates g and their variances S . The vectors are calculated using $s_{0}s_{x}$ shots to estimate the xth component of the gradient and its variance, where s₀ is the minimum number of shots required to obtain an unbiased estimator. Note that shots for each component estimation are distributed in multiples of s₀, according to a multinomial distribution determined by the expansion coefficients that define the gradient estimator. These expansion coefficients are defined by the function $f(p_{\boldsymbol{i}},c_{\boldsymbol{i},j})$ , which returns the probability of measuring the term corresponding to the coefficients $p_{\boldsymbol{i}}$ , $c_{\boldsymbol{i},j}$ . For the case of linear loss functions $f(p_{\boldsymbol{i}},c_{\boldsymbol{i},j}) = \frac{\|p_{\boldsymbol{i}}c_{\boldsymbol{i},j}\|}{\sum_{\boldsymbol{i},j}p_{\boldsymbol{i}}c_{\boldsymbol{i},j}}$ .
Input: Learning rate α, starting point $\boldsymbol{\theta}_0$ , min number of shots per estimation $s_{\min}$ , number of shots that can be used in total $s_{\max}$ , Lipschitz constant L, running average constant µ, a vector of the least number of shots needed for each gradient estimate s₀, which is loss function dependent and $f(p_{\boldsymbol{i}},c_{\boldsymbol{i},j})$ , which is also loss function dependent.
1: initialize: $\boldsymbol{\theta} \gets \boldsymbol{\theta}_0$ , $s_{\textrm{tot}} \gets 0$ , $\boldsymbol{g} \gets \ (0,\ldots,0)^\textrm T$ , $\boldsymbol{S} \gets \ (0,\ldots,0)^\textrm T$ , $\boldsymbol{s} \gets (s_{\min} ,\ldots ,s_{\min})^\textrm T$ , $\boldsymbol{\chi}^{\prime} \gets (0,\ldots,0)^\textrm T$ , $\boldsymbol{\chi} \gets (0,\ldots,0)^\textrm T$ , $\boldsymbol{\xi} \gets (0,\ldots,0)^\textrm T$ , $\boldsymbol{\xi}^{\prime} \gets (0,\ldots,0)^\textrm T$ , $t\gets 0$
2: while $s_{\textrm{tot}} \lt s_{\max}$ do
3: $\boldsymbol{g}, \boldsymbol{S} \gets iEvaluate(\boldsymbol{\theta}, \boldsymbol{s}, \{f(p_{\boldsymbol{i}},c_{\boldsymbol{i},j})\})$
4: $s_{\textrm{tot}} \gets s_{\textrm{tot}} + \sum_{x} s_{0} s_{x}$
5: $\boldsymbol{\chi^{\prime}} \gets \mu \boldsymbol{\chi} + (1-\mu) \boldsymbol{g}$
6: $\boldsymbol{\xi^{\prime}} \gets \mu \boldsymbol{\xi} + (1-\mu) \boldsymbol{S}$
7: $\boldsymbol{\xi} \gets \boldsymbol{\xi^{\prime}}/(1-\mu^{t+1})$
8: $\boldsymbol{\chi} \gets \boldsymbol{\chi^{\prime}}/(1-\mu^{t+1})$
9: $\boldsymbol{\theta} \gets \boldsymbol{\theta} - \alpha \boldsymbol{g}$
10: for $x \in [1,\ldots, d]$ do
11: $s_{x} \gets \left\lceil\frac{2L\alpha}{2-L\alpha} \frac{\xi_{x} \sum_{x} \xi_{x}}{\|\|\boldsymbol{\chi}\|\|^2}\right\rceil$
12: end for
13: $t\gets t + 1$
14: end while

The hyperparameters that we employ are similar to those of Rosalin [28], and will come with similar recommendations on how to set them. For instance, the Lipshitz constant L bounds the largest possible value of the derivative. Hence, it depends on the loss expression but can be set as $M = \sum_{\boldsymbol{i}, j} |q_{\boldsymbol{i},j}|$ according to [24]. Moreover, a learning rate satisfying $0 \lt \alpha \lt 2 / L$ can be used.

6. Convergence guarantees

The framework for Refoqus leverages the structure and the update rule of the gCANS optimizer presented in [25]. Therefore, we can apply the same arguments introduced to show geometric convergence. We repeat the arguments and assumptions needed for this convergence result here for convenience.

Proposition 4. Provided the loss function satisfies the assumptions stated below the stochastic gradient descent routine used in Refoqus achieves geometric convergence to the optimal value of the cost function. That is,

$\begin{align} \mathbb{E}[\mathcal{L}(\boldsymbol{\theta})^{(t)}] - \mathcal{L}^{*} = \mathcal{O}(\gamma^{t}) \end{align} \tag{ 63 }$

where t labels the iteration, $\mathcal{L}^{*}$ is the optimal value of the loss function, and $0\lt\gamma\lt1$ .

The update rule used in this work and introduced in [25] and the proof presented here can be directly applied. This update rule guarantees fast convergence in expectation to the optimal value of sufficiently smooth loss functions. The underlying assumption of this result is that the loss function is strongly convex and has Lipschitz-continuous gradients. In most realistic QML scenarios the optimization landscape is not convex, however, if the optimizer were to settle into a convex region then fast convergence is expected.

The exact assumptions needed in order to ensure the geometric convergence of Refoqus to the global minima are as follows:

1.
$\mathbb{E}[g_{x}(\boldsymbol{\theta})] = \frac{\partial \mathcal{L}(\boldsymbol{\theta})}{\partial \theta_{x}}, \forall x \in [d]$ .
2.
Var $[g_{x}(\boldsymbol{\theta})] = \frac{\textrm{Var}[X_{x}]}{s_{x}}$ , where X_x is the sampling-based estimator of $g_{x}(\boldsymbol{\theta})$ .
3.
$\mathcal{L}(\boldsymbol{\theta})$ is µ-strongly convex.
4.
$\mathcal{L}(\boldsymbol{\theta})$ has L-Lipschitz continuous gradient.
5.
α is a constant learning rate satisfying $0 \lt \alpha \lt \textrm{min}\{1/L, 2/\mu\}$ .
6.
An ideal version of the update rule holds, that is:
$\begin{align} {s_x} = \frac{{2L\alpha }}{{2 - L\alpha }}\frac{{{\sigma _x}(\sum\limits_{{x^\prime } = 1}^d {{\sigma _{{x^\prime }}}} )}}{{||\nabla \mathcal{L}({\mathbf{\theta }})|{|^2}}}\quad \forall x \in [d],\quad {\text{where}}\quad {\sigma _x} = \sqrt {{\text{Var}}[{X_x}]} {\mkern 1mu} . \end{align}$

In the previous sections, we have shown how to construct unbiased estimators for the gradient, satisfying the first assumption. The second assumption is satisfied as the estimate of the gradient is constructed by calculating the mean of several sampling-based estimates. Assumption 5 is satisfied by construction. Assumption 6 is an idealized version of the update rule used in Refoqus. However, the gradient magnitude and estimator variances cannot be exactly known in general, so these quantities are replaced by exponentially moving averages to predict the values of σ_x and $||\boldsymbol{\nabla}{\mathcal{L}(\boldsymbol{\theta})}||^{2}$ . Assumption 4 is also satisfied for all the loss functions we consider in this work. Finally, as previously noted assumption 5 is not expected to hold in QML landscapes, which are non-convex in general.

7. Numerical results

7.1. Quantum PCA

We benchmark the performance of several optimizers when applied to using VQSE [32] for quantum PCA [48] on molecular ground states. We perform quantum PCA on 3 quantum datasets: ground states of the H₂ molecule in the sto-3g basis (4 qubits), H₂ in the 6-31g basis (8 qubits) and $\textrm{BeH}_2$ in the sto-3g basis (14 qubits). We use 101 circuits, per dataset. The corresponding covariance matrix can be expressed as $\frac{1}{101} \sum_{i = 0}^{100} \vert \psi \rangle _{i} \langle \psi \vert _{i}$ . We optimize using the local cost introduced in equation (26) and repeated here for convenience:

$\begin{align} H = \unicode{x1D7D9} - \sum_{j = 1}^n r_j Z_j,\quad r_j \in \mathbb{R}\, \end{align} \tag{ 64 }$

taking coefficients $r_j = 1.0 + 0.2 (j-1)$ and $p_i = \frac{1}{101}$ following the presentation in [32]. Hence the latter coefficients form the $q_{i,j} = p_{i}r_{j}$ terms used for shot-allocation strategies, as outlined below.

When implementing Rosalin and Refoqus we use the WRS strategy for allocating shots as it demonstrated superior performance against other strategies in the numerical results of [28]. Hence, the $q_{i,j}$ coefficients that appear in the loss function (and the gradient estimators) are used to construct a multinomial probability distribution defined by the terms $\frac{|q_{i,j}|}{M}$ , where $M = \sum_{i, j} |q_{i,j}|$ . This distribution is used to probabilistically allocate the number of shots given to each term for each gradient estimation.

We use a circuit consisting of two layers of a hardware-efficient ansatz with depth 4, depicted in figure 4, with 20, 40, and 70 parameters for the 4, 8 and 16 qubit problems respectively. These parameters are optimized in order to minimize the VQSE cost function using the Adam, Rosalin, and Refoqus optimizers⁷ . We use the absolute error in estimating the eigenvalues of the system to benchmark the overall performance of the output of the optimized circuit as this is desired output of the algorithm. Given the exact 16 highest eigenvalues λ_i, and their estimation $\tilde{\lambda_i}$ given current parameters θ, the latter is computed as $\epsilon_{\lambda} = \sum_{i = 1}^{16} (\lambda_i - \tilde{\lambda_i})^2$ .

$ R_y(\theta) = \mathrm e^{-\mathrm iY \theta / 2} $ — **Figure 4.** Hardware-efficient ansatz with two layers used for VQSE on 4 qubits. Each *R_y* rotation is independently parametrized according to $R_y(\theta) = \mathrm e^{-\mathrm iY \theta / 2}$ .
Download figure:
Standard image High-resolution image

Figure 5 shows the results obtained when running each algorithm 20 times with different random initialization of the variational ansatz, up to a total shot budget of 10⁸. In general, we see that Refoqus is the best-performing optimizer, achieving a best-case accuracy while using fewer shots for all shot budgets. In summary, for the 4 qubit system, a median eigenvalue error of $6.8 \times 10^{-6}$ , $5.15 \times 10^{-6}$ and $6.08 \times 10^{-7}$ was obtained using Adam, Rosalin and Refoqus respectively. Although we found better minimal value with Adam compared to Rosalin ( $3.79 \times 10^{-7}$ and $1.74 \times 10^{-6}$ respectively), Rosalin is more advantageous at lower shot budgets. However, Refoqus achieved a minimal error value of $6.13 \times 10^{-9}$ and demonstrates a clear advantage over both Adam and Rosalin. On both the 8 and 14 qubit systems, we observe a similar trend although eigenvalue errors are worse overall. This is due to the variational ansatz being kept at a fixed depth while increasing the size of the problem, which leads to worse performance. Nonetheless, Refoqus appears to clearly match or outperform both Adam and Rosalin in the number of shots required to reach a given accuracy.

**Figure 5.** VQSE for quantum PCA of $\textrm{H}_{2}$ molecular ground states ((a), (b), (d), and (e)) and $\textrm{BeH}_2$ ((c) and (f)). The upper plots show the budget of shots spent against the best eigenvalue error achieved by the optimizers. The lower plots show the number of iterations used to spend the total shot budget. We display the results obtained from 20 independent optimization runs on a data set of 101 circuits representing molecular ground states calculated using the Adam, Rosalin, and Refoqus optimizers and compare their performance. We show the median (solid lines) and $95\%$ confidence intervals (shaded regions) over the 20 different random initializations.
Download figure:
Standard image High-resolution image

Additionally, Refoqus requires fewer parameter updates (iterations) when compared to Rosalin. Indeed, for the 4, 8 and 14 qubits problems respectively, a median of $117, 89$ , and 80 iterations were used by Refoqus against 1217, 618, and 353 for Rosalin. We note that this may be interpreted as another desirable feature of the optimizer, as this minimizes the number of iterations needed in the quantum–classical feedback loop to arrive at a solution of the same (or better) quality. Although the number of iterations is not a bottleneck in and of itself, in current hardware the quantum–classical feedback can prove restrictive meaning fewer iterations are favorable for real-device implementation.

Finally, we note that the variance of the Refoqus results appears to be larger, suggesting that sampling over more terms can lead to a larger variety in the quality of the optimization obtained. Nevertheless, the advantage in shot frugality that arises from sampling over more terms is clear and despite this larger variance, Refoqus is clearly the best-performing optimizer.

7.2. Quantum autoencoder

In addition to the previous quantum PCA results, we also performed a quantum autoencoder task on the same molecular ground states as before. We optimize using the local cost introduced in equation (17) where half of the qubits are 'trashed'. For the 4 qubit system, the ansatz used is inspired from [33]. Its implementation is available in Pennylane under the name 'StronglyEntanglingLayers' [49], and we use three layers. For the other systems, we use the same ansatz used for the quantum PCA task with 3 layers too, as we obtained better cost values. Figure 6 shows the results obtained when running each algorithm ten times with different random initialization of the variational ansatz, up to a total shot budget of 10⁶ shots. We do not use Adam though as only two iterations would be done with the latter budget. In summary, for the 4 qubit system, a median eigenvalue error of $8.4 \times 10^{-2}$ was obtained using Rosalin using a median of 1468 iterations, and $2.0 \times 10^{-4}$ in 208 iterations with Refoqus. The latter achieved a minimal cost value of $4.4 \times 10^{-5}$ , and $1.7 \times 10^{-2}$ for Rosalin. On the 8 qubit system, a median eigenvalue error of $3.3 \times 10^{-1}$ was obtained using Rosalin using a median of 342 iterations, and $1.3 \times 10^{-3}$ in 65 iterations with Refoqus. The latter achieved a minimal cost value of $4.3 \times 10^{-4}$ , and $1.6 \times 10^{-1}$ for Rosalin. Finally, on the 14 qubit system, a median eigenvalue error of $4.7 \times 10^{-1}$ was obtained using Rosalin using a median of 89 iterations, and $1.5 \times 10^{-2}$ in 24 iterations with Refoqus. The latter achieved a minimal cost value of $8.8 \times 10^{-3}$ , and $3.2 \times 10^{-1}$ for Rosalin. Hence, we witness again that Refoqus outperforms Rosalin.

**Figure 6.** Quantum autoencoder applied on $\textrm{H}_{2}$ molecular ground states ((a), (b), (d), and (e)) and $\textrm{BeH}_2$ ((c) and (f)). The upper plots show the budget of shots spent against the quantum autoencoder cost achieved by the optimizers. The lower plots show the number of iterations used to spend the total shot budget. We display the results obtained from 10 independent optimization runs on a data set of 101 circuits representing molecular ground states calculated using the Rosalin, and Refoqus optimizers and compare their performance. We show the median (solid lines) and $95\%$ confidence intervals (shaded regions) over the 10 different random initializations.
Download figure:
Standard image High-resolution image

8. Discussion

QML algorithms present a different paradigm for data processing which is particularly well suited to quantum data. VQAs are a key contender in giving a near-term useful quantum advantage. However, both VQAs and QML models require long run times and large resource overheads during training (as many iterations and shots to achieve respectable performances). To address these challenges, we propose Refoqus as a shot-frugal gradient-based optimizer based on state and operator sampling. We outline many cost functions that are of interest to the community and are easily captured within our framework for Refoqus. Our new optimizer leverages the loss function of the problem to allocate shots randomly when evaluating gradients. This randomness allows us to make resource-cheap gradient update steps, unlocking shot-frugality. We have shown that Refoqus comes with geometric convergence guarantees under specific assumptions, elaborated in section 6. Additionally, when applying our optimizer to a QML task, namely a quantum PCA task and a quantum autoencoder task, we obtain significantly better performance in terms of the number of shots and the number of iterations needed to obtain a given accuracy. Although we only applied Refoqus to a quantum PCA task and a quantum autoencoder task, we expect similar results to hold for other problems fitting our framework. Indeed, other cost functions follow a similar pattern and once there are many weighed terms, the sampling approach is expected to return similar results.

A potential future research direction is to extend our analysis to more complicated, non-linear loss functions such as the log-likelihood, exploring in more detail how to introduce shot frugality to gradient-free optimizers [50–52]. Furthermore, applying our optimizer to a problem of interest on a real device is an interesting and logical next step. However, the quality of results can be significantly impacted in noisy settings [20]. To address the latter problem, our optimizer can be combined with several noise-handling procedures when using it on a real device [53–60]. It would also be interesting to introduce different techniques to reduce the number of shots spent such as Bayesian optimization [27]. Finally, other empirical studies related to algorithm selection and configuration [51, 55, 61–65] can be applied to Refoqus on different datasets [63, 66, 67].

Acknowledgments

This work was supported by the U.S. Department of Energy (DOE) through a quantum computing program sponsored by the Los Alamos National Laboratory (LANL) Information Science & Technology Institute. M H G was partially supported by the Laboratory Directed Research and Development (LDRD) program under Project Number 20210116DR. M C acknowledges support from the LDRD program of LANL under Project Number 20230049DR. L C was partially supported by the U.S. DOE, Office of Science, Office of Advanced Scientific Computing Research, under the Accelerated Research in Quantum Computing (ARQC) program. P J C acknowledges support from the LANL ASC Beyond Moore's Law project.

Data availability statement

The data cannot be made publicly available upon publication because they are not available in a format that is sufficiently accessible or reusable by other researchers. The data that support the findings of this study are available upon reasonable request from the authors.

Appendix A: Constructing unbiased estimators for the log-likelihood

Here we review a conceptually different approach for a shot-frugal optimization of a problem utilizing a log-likelihood loss function. In particular, we will follow the results in [44]. First, we assume that one is given a dataset of the form $\mathcal{S} = \{\rho_i,y_i\}_{i = 1}^N$ where ρ_i are quantum states in some domain $\mathcal{R}$ , and y_i are discrete real-valued labels from some domain $\mathcal{Y}$ (e.g. in binary classification $\mathcal{Y} = \{0,1\}$ ). Then, we assume that one has a trainable parametrized model $h_{\boldsymbol{\theta}}:\mathcal{R}\rightarrow \mathcal{Y}$ whose goal is to predict the correct labels. Note that here one cannot directly apply the techniques previously used, as discrete outputs preclude gradient calculations and the use of the results in [24, 34]. Hence, a different shot-frugal method must be employed.

Given $\mathcal{S}$ and $h_{\boldsymbol{\theta}}$ , we can define the likelihood function

$\begin{align} \Pr(\mathcal{S}|\boldsymbol{\theta}): = \prod_{i = 1}^N\Pr(y_i|\rho_i;\boldsymbol{\theta}) = \prod_{i = 1}^Np_i\,, \end{align} \tag{ A1 }$

where we have defined $p_i = \Pr(y_i|\rho_i;\boldsymbol{\theta})$ and where assumed that the model's predictions only depend on the current input. From here, we can define the negative log-likelihood loss function as

$\begin{align} \mathcal{L}(\boldsymbol{\theta}) = -\log\left(\Pr(\mathcal{S}|\boldsymbol{\theta})\right) = -\sum_{i = 1}^N\log\left(p_i\right)\,. \end{align} \tag{ A2 }$

Interestingly, we can reduce the task of estimating the negative log-likelihood loss function to a Bernoulli sampling problem. Namely, we can assume without loss of generality that the model specifies a stochastic function g that takes as input ρ_i, the set of parameters θ and outputs a prediction y_i . That is, $y_i\sim g(\rho,\boldsymbol{\theta})$ . Then, we remark that to estimate the loss function, we do not need to know the explicit form of g, one only needs to know the probability for a random sample y from the model to match the true label y_i . Thus, we can convert each output from the model into a variable

$\begin{align} x = \begin{cases}1\quad \textrm{if}\ y = y_i\,\\ 0\quad \textrm{otherwise} \end{cases}\!\!\!. \end{align} \tag{ A3 }$

By construction, x follows a Bernoulli distribution with probability p_i . Thus, we can estimate the log-likelihood loss function by drawing samples $(x_1,x_2,\ldots)$ from said Bernoulli distribution. In most common cases these samples can be thought of as single shot-estimates of the model's output, and our goal is to minimize this number of samples/shots.

In [44] the authors study the task of constructing an unbiased estimator for the negative log-likelihood loss function of equation (A2), where they consider a fixed number of shots strategy (one uses s fixed shots) and the inverse binomial sampling (IBS) strategy (where one keeps drawing samples until $x_k = 1$ ). One can show that the fixed number of shots strategy always leads to a biased estimator of $\log(p)$ (for the special case when one estimates $p = \sum_{k = 1}^s x_k/s$ , the estimator can even be infinitely biased if x_k is always zero!). On the other hand, using IBS one can define the estimator

$\begin{align} \widehat{\mathcal{L}}(\boldsymbol{\theta}) = \begin{cases}0\quad \textrm{for}\ K = 1\,\\ \sum_{k = 1}^{K-1}\frac{1}{k}\quad \textrm{for}\ K>1 \end{cases}\!\!\!, \end{align} \tag{ A4 }$

where K is the number of shots needed to obtain a result $x_k = 1$ . Notably, it can be shown that $E[\widehat{\mathcal{L}}(\boldsymbol{\theta})] = \mathcal{L}(\boldsymbol{\theta})$ , meaning that IBS provides an unbiased estimator for the negative log-likelihood loss function (see [44]). The number of shots that IBS takes trial with probability p_i is geometrically distributed with mean $1/p_i$ . For instance, if $p_i\geqslant 0.5$ , IBS can require only 1 or 2 shots, while it will assign more shots to cases of small p_i . We also remark that in [44] the authors derive the estimator variance, and show that is minimal and non-diverging if $p_i\rightarrow 0$ . This method then provides a shot-frugal, minimum-variance, unbiased estimator for the negative log-likelihood loss function which can be combined with gradient-free optimizers to train the parameters in the mode. We further refer the reader to [44] for a discussion on how to extend these results to models $\ell_{\boldsymbol{\theta}}$ with continuous outputs.

Appendix B: Using U-statistics to construct unbiased estimators

The theory of U-statistics was initially introduced by Hoeffding in the late 1940s [47] and has a wide range of applications. U-statistics presents a methodology on how to use observations of estimable parameters to construct minimum variance unbiased estimates of more complex functions of the estimable parameters. In the cases we are interested in here, the estimable parameter is the expectation value of an observable used when constructing the cost function.

First, let $\mathcal{P}$ be a family of probability measures on an arbitrary measurable space. A probability measure is defined as the probability of some particular event occurring. Let $\Theta(P)$ be a real-valued function defined for $P \in \mathcal{P}$ . We define $\Theta(P)$ to be an estimable parameter within $\mathcal{P}$ if there exists some integer r for which an unbiased estimator of $\Theta(P)$ exists constructed from r independent identically distributed (i.i.d) random variables according to P. In other words, there is a measurable, real-valued function $h(x_1,\ldots,x_r)$ such that

$\begin{align} \mathbb{E}_{P}[(h)(x_1,\ldots,x_r)] = \Theta(P) \quad \forall P \in \mathcal{P}, \end{align} \tag{ B1 }$

where $x_1,\ldots,x_m$ are the i.i.d with distribution P. The degree of $\Theta(P)$ is defined to be the smallest integer r with this property.

The function $h(x_1,\ldots,x_r)$ is also referred to as the kernel. This function can be symmetrized by summing over h with all possible permutations of the random variables $x_1,\ldots,x_n$ . That is, the symmetrized version of the kernel can be defined as follows

$\begin{align} h^{*}(x_1,\ldots,x_r) = \frac{1}{r!} \sum_{\sigma \in S_r} h(x_{\sigma(1)},\ldots,x_{\sigma(r)}), \end{align} \tag{ B2 }$

where the summation is over S_r , the group of all permutations of a set of r indexes. With these definitions in hand, we can construct a U-statistic.

Assume we are given s > r samples from $\mathcal{P}$ , the U-statistic with kernel h is defined as,

$\begin{align} U_s = U(h) = \frac{1}{\binom{s}{r}} \sum_{C^{s}_{r}} h^{*}(x_{i_{1}},\ldots,x_{i_{r}}), \end{align} \tag{ B3 }$

where the summation is over the set $C^{s}_{r}$ of $\binom{s}{r} = \frac{s!}{r!(s-r)!}$ combinations of r integers chosen from $(1,2,\ldots,s)$ .

As the U-statistic defined above is simply an average of unbiased estimates for $\Theta(P)$ and is therefore also an unbiased estimator for $\Theta(P)$ . Indeed, it can be proved that the U-statistic is the best-unbiased estimate for $\Theta(P)$ in that it has the smallest variance.

B.1. Examples of U-statistics

Now that we have introduced U-statistics we can build intuition with some simple examples. Firstly, let us consider how to construct an unbiased estimate for the mean µ. Recall we first need to construct a kernel function h. In this case, the kernel will have degree 1 and can simply be defined as $h(x_i) = x_i$ . Note this has the required property that $\mathbb{E}[h(x_{i})] = \mathbb{E}[x_{i}] = \mu$ . Furthermore, in this case $h^{*}(x_{i}) = h(x_{i})$ , i.e. the kernel is already symmetric. Therefore, the U-statistic for µ constructed from s samples can be written as,

$\begin{align} U_s(h) & = \frac{1}{\binom{s}{1}} \sum_{C^{s}_{1}} x_i \end{align} \tag{ B4 }$

$\begin{align} & = \frac{1}{s}\sum_{i = 1}^{s}x_i, \end{align} \tag{ B5 }$

which is the well-known formula of the sample average, and which is used to estimate the mean.

Similarly, we can consider how to construct a U-statistic for the variance. Consider the kernel $h(x_i, x_j) = x_{i}^{2} - x_ix_j$ . This kernel has degree 2. Note once again, this has the required property that $\mathbb{E}[h(x_i, x_j)] = \mathbb{E}(x_{i}^{2}) - \mathbb{E}(x_ix_j) = \mathbb{E}(x_{i}^{2}) - \mu^{2}$ , which is the definition of the variance. Then this needs to be symmetrized, $h^{*} = 1/2 (x_{i}^{2} - x_ix_j + x_{j}^{2} - x_jx_i) = 1/2( x_{i}^{2} + x_{j}^{2} - 2x_ix_j) = 1/2(x_i - x_j)^{2}$ . Therefore, the U-statistic for the variance constructed with the kernel h using s samples is,

$\begin{align} U_s(h) & = \frac{1}{\binom{s}{2}} \sum_{C^{s}_{2}} \frac{(x_i - x_j)^{2}}{2}\kern39.6pt \end{align} \tag{ B6 }$

$\begin{align} & = \frac{1}{{s(s-1)}} \sum_{C^{s}_{2}} (x_i - x_j)^{2}, \end{align} \tag{ B7 }$

which corresponds to the sample variance.

As a final example let us consider constructing an estimator of the function $\Theta(P) = P^{k}$ . Lets define the kernel as $h(x_1,..,x_k) = \prod_{i} x_{i}$ , which clearly fulfills the properties we require. This kernel is already symmetric such that $h(x_1,\ldots,x_k) = h^{*}(x_1,\ldots,x_m)$ . Therefore, the U-statistic can be defined as,

$\begin{align} U_s(h) & = \frac{1}{\binom{s}{k}} \sum_{C^{s}_{k}} \prod_{i} x_{i} \,. \end{align} \tag{ B8 }$

This is exactly the form of estimators that we require in the main text. For our purposes, the number of samples s is a random variable that is constructed probabilistically.

The framework of U-statistics is perfectly suited to our purposes. Indeed, it provides a recipe to construct unbiased estimators for polynomial functions of variables with some probability distribution. Furthermore, these estimators have the smallest possible variance. Finally, the framework shows the minimum number of shots necessary to estimate a given term through the degree of the kernel.

Appendix C: Variance of estimator for linear loss functions

In section 4.1.2, we constructed an unbiased estimator for the loss function. Denoting the latter $\widehat{\mathcal{L}}$ , we proved in proposition 1 that $\mathbb{E}[\widehat{\mathcal{L}}] = \mathcal{L}(\boldsymbol{\theta})$ . We also obtained the variance of such an estimator, which can be used to compare different shot-allocation strategies, similar to [28] for the Rosalin optimizer designed for VQE. However, we use elements in the variance calculation in appendix E to construct an unbiased estimator for the MSE loss function.

Proposition 5. The variance of $\widehat{\mathcal{L}}$ is

$\begin{align} {\textrm{Var}}(\widehat {\cal L}){\textrm{ }} = \sum\limits_{{\textbf{i}},j} {\frac{{q_{{\textbf{i}},j}^2}}{{[{s_{{\textbf{i}},j}}]}}} \sigma _{{\textbf{i}},j}^2 + \sum\limits_{{\textbf{i}},j,{\textrm{ }}i\prime ,j\prime } {\frac{{{q_{{\textbf{i}},j}}qi\prime ,j\prime }}{{[{s_{{\textbf{i}},j}}][si\prime ,j\prime ]}}} {\textrm{ }}\langle {h_{{\textbf{i}},j}}({\boldsymbol{\theta }})\rangle \left. {\langle hi\prime ,j\prime ({\boldsymbol{\theta }})\rangle {\textrm{Cov}}[{s_{{\textbf{i}},j}},si\prime ,j\prime ,} \right] \end{align} \tag{ C1 }$

where $\sigma_{\boldsymbol{i},j}^2 = \langle h^{2}_{\boldsymbol{i},j} (\boldsymbol{\theta}) \rangle - (\langle h_{\boldsymbol{i},j} (\boldsymbol{\theta}) \rangle)^2$ and the covariance is defined as

$\begin{align} {\text{Cov}}[{s_{{\mathbf{i}},j}},{s_{{{\mathbf{i}}^\prime },{j^\prime }}}] = \mathbb{E}[{s_{{\mathbf{i}},j}}{s_{{{\mathbf{i}}^\prime },{j^\prime }}}] - \mathbb{E}[{s_{{\mathbf{i}},j}}]\mathbb{E}[s{{\mathbf{i}}^\prime },j\prime ]. \end{align} \tag{ C2 }$

Proof. Using the definition of the variance and recalling that $\mathbb{E}[\widehat{\mathcal{L}}] = \mathcal{L}(\boldsymbol{\theta})$ , we can write

$\begin{align} {\textrm{Var}}(\widehat {\cal L}) = [{\widehat {\cal L}^2}] - {[\widehat {\cal L}]^2} = \sum\limits_{{\textbf{i}},j,{\textrm{ }}i\prime j\prime } {\frac{{{q_{{\textbf{i}},j}}qi\prime ,j\prime }}{{[{s_{{\textbf{i}},j}}][si\prime ,j\prime ]}}} \left[ {\sum\limits_{k = 1}^{{s_{{\textbf{i}},j}}} {\sum\limits_{k\prime = 1}^{si\prime ,j\prime } {{r_{{\textbf{i}},j,k}}ri\prime ,j\prime ,k\prime } } } \right] - {\cal L}{({\boldsymbol{\theta }})^2}{\mkern 1mu} . \end{align} \tag{ C3 }$

We can then decompose the first term into its constituent parts where indices in the sum are different and equal

$\begin{align} \sum_{\boldsymbol{i}, j, \boldsymbol{i^{\prime}}, j^{\prime}} \frac{q_{\boldsymbol{i},j} q_{\boldsymbol{i^{\prime}},j^{\prime}}}{\mathbb{E}[s_{\boldsymbol{i},j}] \mathbb{E}[s_{\boldsymbol{i^{\prime}},j^{\prime}}]} \mathbb{E}\left[\sum_{k = 1}^{s_{\boldsymbol{i},j}} \sum_{k^{\prime} = 1}^{s_{\boldsymbol{i}^{\prime},j^{\prime}}} r_{\boldsymbol{i},j,k} r_{\boldsymbol{i^{\prime}},j^{\prime},k^{\prime}}\right] & = \sum_{\boldsymbol{i}, j, \boldsymbol{i^{\prime}}, j^{\prime}, \boldsymbol{i} \ne \boldsymbol{i^{\prime}}, j \ne j^{\prime}} \frac{q_{\boldsymbol{i},j} q_{\boldsymbol{i^{\prime}},j^{\prime}}}{\mathbb{E}[s_{\boldsymbol{i},j}] \mathbb{E}[s_{\boldsymbol{i^{\prime}},j^{\prime}}]} E_1 + \sum_{\boldsymbol{i}, j, \boldsymbol{i^{\prime}}, j^{\prime}, \boldsymbol{i} = \boldsymbol{i^{\prime}}, j \ne j^{\prime}} \frac{q_{\boldsymbol{i},j} q_{\boldsymbol{i},j^{\prime}}}{\mathbb{E}[s_{\boldsymbol{i},j}] \mathbb{E}[s_{\boldsymbol{i},j^{\prime}}]} E_2 \nonumber\\ & \quad + \sum_{\boldsymbol{i}, j, \boldsymbol{i^{\prime}}, j^{\prime}, \boldsymbol{i} \ne \boldsymbol{i^{\prime}}, j = j^{\prime}} \frac{q_{\boldsymbol{i},j} q_{\boldsymbol{i^{\prime}},j}}{\mathbb{E}[s_{\boldsymbol{i},j}] \mathbb{E}[s_{\boldsymbol{i^{\prime}},j}]} E_3 + \sum_{\boldsymbol{i}, j} \frac{q_{\boldsymbol{i},j}^2}{\mathbb{E}^2[s_{\boldsymbol{i},j}]} E_4, \end{align} \tag{ C4 }$

where

$\begin{align} & E_1 = \mathbb{E}\left[\sum_{k = 1}^{s_{\boldsymbol{i},j}} \sum_{k^{\prime} = 1}^{s_{\boldsymbol{i}^{\prime},j^{\prime}}} r_{\boldsymbol{i},j,k} r_{\boldsymbol{i^{\prime}},j^{\prime},k^{\prime}} \right] \end{align} \tag{ C5 }$

$\begin{align} & E_2 = \mathbb{E}\left[\sum_{k = 1}^{s_{\boldsymbol{i},j}} \sum_{k^{\prime} = 1}^{s_{\boldsymbol{i},j^{\prime}}} r_{\boldsymbol{i},j,k} r_{\boldsymbol{i},j^{\prime},k^{\prime}} \right] \end{align} \tag{ C6 }$

$\begin{align} & E_3 = \mathbb{E}\left[\sum_{k = 1}^{s_{\boldsymbol{i},j}} \sum_{k^{\prime} = 1}^{s_{\boldsymbol{i}^{\prime},j}} r_{\boldsymbol{i},j,k} r_{\boldsymbol{i^{\prime}},j,k^{\prime}} \right] \end{align} \tag{ C7 }$

$\begin{align} & E_4 = \mathbb{E}\left[\sum_{k = 1}^{s_{\boldsymbol{i},j}} \sum_{k^{\prime} = 1}^{s_{\boldsymbol{i},j}} r_{\boldsymbol{i},j,k} r_{\boldsymbol{i},j,k^{\prime}} \right]. \end{align} \tag{ C8 }$

Note that since $\frac{1}{\mathbb{E}[s_{\boldsymbol{i},j}]}\sum_{k = 1}^{s_{\boldsymbol{i},j}} r_{\boldsymbol{i},j,k}$ is an unbiased estimator of $\langle h_{\boldsymbol{i},j} (\boldsymbol{\theta}) \rangle$ , by application of Wald's equation we obtain

$\begin{align} E_1 & = \langle h_{\boldsymbol{i},j} (\boldsymbol{\theta}) \langle h_{\boldsymbol{i^{\prime}},j^{\prime}} (\boldsymbol{\theta}) \rangle \mathbb{E}[ s_{\boldsymbol{i},j} s_{\boldsymbol{i^{\prime}},j^{\prime}}] \end{align} \tag{ C9 }$

$\begin{align} E_2 & = \langle h_{\boldsymbol{i},j} (\boldsymbol{\theta}) \rangle \langle h_{\boldsymbol{i},j^{\prime}} (\boldsymbol{\theta}) \rangle \mathbb{E}[ s_{\boldsymbol{i},j} s_{\boldsymbol{i},j^{\prime}}] \end{align} \tag{ C10 }$

$\begin{align} E_3 & = \langle h_{\boldsymbol{i},j} (\boldsymbol{\theta}) \rangle \langle h_{\boldsymbol{i^{\prime}},j} (\boldsymbol{\theta}) \rangle \mathbb{E}[ s_{\boldsymbol{i},j} s_{\boldsymbol{i^{\prime}},j}]. \end{align} \tag{ C11 }$

For the case $\{\boldsymbol{i} = \boldsymbol{i^{\prime}}, j = j^{\prime}\}$ , care needs to be taken when expanding the sum as the terms are not independent when $k^{\prime} = k$ . Therefore, we decompose the sum into the cases where $k \neq k^{\prime}$ and $k = k^{\prime}$ ,

$\begin{align} E_4 & = \mathbb{E}\left[\sum_{k = 1}^{s_{\boldsymbol{i},j}} \sum_{k^{\prime} = 1, k^{\prime} \ne k}^{s_{\boldsymbol{i},j}} r_{\boldsymbol{i},j,k} r_{\boldsymbol{i},j,k^{\prime}} \right] + \mathbb{E}\left[\sum_{k = 1}^{s_{\boldsymbol{i},j}} r_{\boldsymbol{i},j,k}^2 \right] \nonumber \\ & = E_{4, 1} + E_{4, 2}. \end{align} \tag{ C12 }$

Once again, using Wald's equation, but taking into account that the number of entries for the $k \ne k^{\prime}$ is $(s_{\boldsymbol{i},j}-1)$ , one can show

$\begin{align} E_{4, 1} = \mathbb{E}[s_{\boldsymbol{i},j} (s_{\boldsymbol{i},j} - 1)] (\langle h_{\boldsymbol{i},j} (\boldsymbol{\theta}) \rangle)^2. \end{align} \tag{ C13 }$

Finally, using the fact that $\frac{1}{\mathbb{E}[s_{\boldsymbol{i},j}]} \sum_{k = 1}^{s_{\boldsymbol{i},j}} r_{\boldsymbol{i},j,k}^2$ is an unbiased estimator of $\langle h^{2}_{\boldsymbol{i},j} (\boldsymbol{\theta}) \rangle$ , we obtain:

$\begin{align} E_{4, 2} = \mathbb{E}[s_{\boldsymbol{i},j}] \langle h^{2}_{\boldsymbol{i},j} (\boldsymbol{\theta}) \rangle\,. \end{align} \tag{ C14 }$

Bringing all the terms together and writing the expressions containing E₁, E₂ and E₃ in a single sum, we obtain

$\begin{align} {\textrm{Var}}(\widehat{\mathcal{L}}) & = \sum_{\boldsymbol{i}, j} \frac{q_{\boldsymbol{i},j}^2}{\mathbb{E}^2[s_{\boldsymbol{i},j}]} \mathbb{E}[s_{\boldsymbol{i},j} (s_{\boldsymbol{i},j} - 1)] (\langle h_{\boldsymbol{i},j} (\boldsymbol{\theta}) \rangle)^2 + \sum_{\boldsymbol{i}, j} \frac{q_{\boldsymbol{i},j}^2}{\mathbb{E}^2[s_{\boldsymbol{i},j}]} \mathbb{E}[s_{\boldsymbol{i},j}] \langle h^{2}_{\boldsymbol{i},j} (\boldsymbol{\theta}) \rangle - \mathcal{L}(\boldsymbol{\theta})^2 \nonumber \\ & \quad + \sum_{\substack{\boldsymbol{i}, j, \boldsymbol{i^{\prime}}, j^{\prime} \\ \boldsymbol{i} \neq \boldsymbol{i^{\prime}}\ \textrm{if}\ j = j^{\prime} \\ j \neq j^{\prime} \ \textrm{if}\ \boldsymbol{i} = \boldsymbol{i^{\prime}}}} \frac{q_{\boldsymbol{i},j} q_{\boldsymbol{i^{\prime}},j^{\prime}}}{\mathbb{E}[s_{\boldsymbol{i},j}] \mathbb{E}[s_{\boldsymbol{i^{\prime}},j^{\prime}}]} \langle h_{\boldsymbol{i},j} (\boldsymbol{\theta}) \rangle \langle h_{\boldsymbol{i^{\prime}},j^{\prime}} (\boldsymbol{\theta}) \rangle \mathbb{E}[ s_{\boldsymbol{i},j} s_{\boldsymbol{i^{\prime}},j^{\prime}} ]\,. \end{align} \tag{ C15 }$

Expanding the second term and factorizing leads to,

$\begin{align} {\textrm{Var}}(\widehat{\mathcal{L}}) = \sum_{\boldsymbol{i}, j}\frac{q_{\boldsymbol{i},j}^2}{\mathbb{E}[s_{\boldsymbol{i},j}]} [\langle h^{2}_{\boldsymbol{i},j} (\boldsymbol{\theta}) \rangle - [\langle h_{\boldsymbol{i},j} (\boldsymbol{\theta}) \rangle ]^2] + \sum_{\boldsymbol{i}, j, \boldsymbol{i^{\prime}}, j^{\prime}} \frac{q_{\boldsymbol{i},j} q_{\boldsymbol{i^{\prime}},j^{\prime}}}{\mathbb{E}[s_{\boldsymbol{i},j}] \mathbb{E}[s_{\boldsymbol{i^{\prime}},j^{\prime}}]} \mathbb{E}[ s_{\boldsymbol{i},j} s_{\boldsymbol{i^{\prime}},j^{\prime}} ] \langle h_{\boldsymbol{i},j} (\boldsymbol{\theta}) \rangle \langle h_{\boldsymbol{i^{\prime}},j^{\prime}} (\boldsymbol{\theta}) \rangle - \mathcal{L}(\boldsymbol{\theta})^2. \end{align} \tag{ C16 }$

Which we can then rewrite to prove the desired result,

$\begin{align} & {\textrm{Var}}(\widehat{\mathcal{L}}) = \sum_{\boldsymbol{i}, j} \frac{q_{\boldsymbol{i},j}^2}{\mathbb{E}[s_{\boldsymbol{i},j}]} \sigma_{\boldsymbol{i},j}^2 + \sum_{\boldsymbol{i}, j, \boldsymbol{i^{\prime}}, j^{\prime}} \frac{q_{\boldsymbol{i},j} q_{\boldsymbol{i^{\prime}},j^{\prime}}}{\mathbb{E}[s_{\boldsymbol{i},j}] \mathbb{E}[s_{\boldsymbol{i^{\prime}},j^{\prime}}]} \langle h_{\boldsymbol{i},j} (\boldsymbol{\theta}) \rangle \langle h_{\boldsymbol{i^{\prime}},j^{\prime}} (\boldsymbol{\theta}) \rangle {\textrm{Cov}}[ s_{\boldsymbol{i},j}, s_{\boldsymbol{i^{\prime}},j^{\prime}}]\,. \end{align} \tag{ C17 }$

Appendix D: Constructing an unbiased estimators of loss functions with polynomial dependence

We consider here how to construct estimators for the loss functions with an arbitrary polynomial of degree D in the observables, expressed as:

$\begin{align} \mathcal{L}(\boldsymbol{\theta}) = \sum_{\boldsymbol{i}} \sum_{z = 0}^D p_{\boldsymbol{i}} a_z \left(\sum_{j} c_{\boldsymbol{i},j} \langle h_{\boldsymbol{i}, j} (\boldsymbol{\theta})\rangle\right)^z \end{align} \tag{ D1 }$

$\begin{align} & = \sum_{\boldsymbol{i},z} p_{\boldsymbol{i},z} \left(\sum_{j} c_{\boldsymbol{i},j} \langle h_{\boldsymbol{i}, j} (\boldsymbol{\theta})\rangle\right)^z, \end{align} \tag{ D2 }$

where $p_{\boldsymbol{i},z} = p_{\boldsymbol{i}} a_z, a_z \in \mathbb{R}$ . Using the multinomial theorem, given J hamiltonian terms, we can develop the power of the sum into a sum of $\binom{z+J-1}{J}$ terms:

$\begin{align} \mathcal{L}(\boldsymbol{\theta}) = \sum_{\boldsymbol{i},z} p_{\boldsymbol{i},z} \sum_{\substack{b_1+ b_2+\\ \cdots +b_J = z}} \binom{z}{b_1, b_2, \cdots b_J} \prod_j (c_{\boldsymbol{i},j} \langle h_{\boldsymbol{i}, j} (\boldsymbol{\theta})\rangle)^{b_j}\,, \end{align} \tag{ D3 }$

where $\binom{z}{b_1, b_2, \cdots b_J} = \frac{z!}{b_1! b_2! \cdots b_J!}$ and b_j are non-negative integers. First, one can generalize the formula for an unbiased estimator of $(\langle h_{\boldsymbol{i}, j} (\boldsymbol{\theta})\rangle)^{b_j}$ :

$\begin{align} & \widehat{\mathcal{E}}_{\boldsymbol{i},j} = \frac{1}{ \mathbb{E}\left[ \prod_{t = 0}^{b_j-1} (s_{\boldsymbol{i},j} - t)\right]} \prod_{\alpha = 1}^{b_j}\left(\sum_{k_{\alpha} = 1}^{s_{\boldsymbol{i},j}} r_{\boldsymbol{i}, j, k_{\alpha}}\right)\,. \end{align} \tag{ D4 }$

The case $b_j = 2$ was elaborated in equation (E11).

For $\prod_j \langle h_{\boldsymbol{i}, j} (\boldsymbol{\theta})\rangle^{b_j}$ , we define an unbiased estimator as:

$\begin{align} \widehat{\mathcal{R}}_{\boldsymbol{i},z} = \frac{1}{ \mathbb{E}\left[\prod_j \prod_{t = 0}^{b_j-1} (s_{\boldsymbol{i},j} - t)\right]} \prod_j \prod_{\alpha = 1}^{b_j} \left(\sum_{k_{j,\alpha}}^{s_{\boldsymbol{i},j}} r_{\boldsymbol{i}, j, k_{j, \alpha}}\right)\,. \end{align} \tag{ D5 }$

This involves estimating $\binom{z+J-1}{J}J\sum\nolimits_{j = 1}^{J}b_{j}$ terms to construct the following estimator for the cost:

$\begin{align} \widehat{\mathcal{L}} = \sum_{\boldsymbol{i},z} p_{\boldsymbol{i},z} \sum_{\substack{b_1+ b_2+\\ \cdots +b_J = z}} \binom{z}{b_1, b_2, \cdots b_J} \left[\prod_j (c_{\boldsymbol{i},j})^{b_j}\right] \widehat{\mathcal{R}}_{\boldsymbol{i},z}\,, \end{align} \tag{ D6 }$

and establish the following proposition.

Proposition 6. $\widehat{\mathcal{L}}$ is an unbiased estimator for $\mathcal{L}$ .

Proof. Taking the expectation over the above definitions:

$\begin{align} \mathbb{E} [\widehat{\mathcal{L}}] & = \sum_{\boldsymbol{i},z} p_{\boldsymbol{i},z} \sum_{\substack{b_1+ b_2+\\ \cdots +b_J = z}} \binom{z}{b_1, b_2, \cdots b_J} \left[\prod_j (c_{\boldsymbol{i},j})^{b_j}\right] \mathbb{E} [\widehat{\mathcal{R}}_{\boldsymbol{i},z}] = \sum_{\boldsymbol{i},z} p_{\boldsymbol{i},z} \sum_{b_1+ b_2+ \cdots +b_J = z} \binom{z}{b_1, b_2, \cdots b_J} \left[\prod_j (c_{\boldsymbol{i},j})^{b_j}\right] \nonumber \\ & \quad \times \frac{1}{ \mathbb{E}[\prod_j \prod_{t = 0}^{b_j-1} (s_{\boldsymbol{i},j} - t)]} \mathbb{E} \left[\prod_j \prod_{\alpha = 1}^{b_j} \sum_{k_{j,\alpha}}^{s_{\boldsymbol{i},j}} r_{\boldsymbol{i}, j, k_{j, \alpha}}\right]\,. \end{align} \tag{ D7 }$

Using again Wald's inequality, we obtain:

$\begin{align*} \mathbb{E} \left[\prod_j \prod_{\alpha = 1}^{b_j} \sum_{k_{j,\alpha}}^{s_{\boldsymbol{i},j}} r_{\boldsymbol{i}, j, k_{j, \alpha}}\right] = \mathbb{E}\left[\prod_j \prod_{t = 0}^{b_j-1} (s_{\boldsymbol{i},j} - t)\right] \prod_j \langle h_{\boldsymbol{i}, j} (\boldsymbol{\theta})\rangle^{b_j}\, \end{align*}$

for which, when replacing in the previous equation, we obtain $\mathbb{E} [\widehat{\mathcal{L}}] = \mathcal{L}(\boldsymbol{\theta})$ .

Appendix E: Constructing an unbiased estimator for the MSE loss function

The MSE is widely used in machine learning. Here, we show how to construct an unbiased estimator for the MSE as an illustrative example of previously-studied polynomial function of the observables in appendix D4. We start by considering a simplified case for the MSE, where we do not decompose each measurement operator H_i into a linear combination of Hermitian matrices. We then tackle the more complex case where H_i is of the form presented in equation (7).

E.1. Step 1: simplified MSE

Let $\{\rho_{\boldsymbol{i}}, y_{\boldsymbol{i}}\}$ be a dataset of N points, where the task consists of minimizing the following cost:

$\begin{align} \mathcal{L}_\textrm{MSE} = \sum\limits_{\boldsymbol{i}} p_{\boldsymbol{i}} (y_{\boldsymbol{i}} - \langle h_{\boldsymbol{i}}(\boldsymbol{\theta}) \rangle)^2. \end{align} \tag{ E1 }$

This corresponds to a weighted MSE cost. The most common case is when $p_{\boldsymbol{i}} = 1 / N$ . We will take a step-by-step approach to construct unbiased estimators for such cost.

If we expand the brackets, we obtain,

$\begin{equation*} \mathcal{L}_\textrm{MSE} = \sum_{\boldsymbol{i}} p_{\boldsymbol{i}} (y_{\boldsymbol{i}}^2 - 2 y_{\boldsymbol{i}} \langle h_{\boldsymbol{i}}(\boldsymbol{\theta}) \rangle + \langle h_{\boldsymbol{i}}(\boldsymbol{\theta}) \rangle^2). \end{equation*}$

We denote $\widehat{\mathcal{E}_{\boldsymbol{i}}}$ as the estimator for $\langle h_{\boldsymbol{i}}(\boldsymbol{\theta}) \rangle$ , $\widehat{\mathcal{Q}}_{\boldsymbol{i}}$ as the estimator for $\langle h_{\boldsymbol{i}}(\boldsymbol{\theta}) \rangle^2$ .

$\begin{align} \widehat{\mathcal{E}_{\boldsymbol{i}}} & = \frac{1}{\mathbb{E}[s_{\boldsymbol{i}}]}\sum_{k = 1}^{s_{\boldsymbol{i}}} r_{\boldsymbol{i},k}\,, \end{align} \tag{ E2 }$

$\begin{align} \widehat{\mathcal{Q}}_{\boldsymbol{i}} & = \frac{1}{\mathbb{E}[s_{\boldsymbol{i}} (s_{\boldsymbol{i}} - 1)]}\sum_{k = 1}^{s_{\boldsymbol{i}}} \sum_{k^{\prime} = 1, k^{\prime} \ne k}^{s_{\boldsymbol{i}}} r_{\boldsymbol{i},k} r_{\boldsymbol{i},k^{\prime}}\,, \end{align} \tag{ E3 }$

where $\widehat{\mathcal{E}_{\boldsymbol{i}}}$ is an unbiased estimator which is straightforward from Wald's equation. $\widehat{\mathcal{Q}}_{\boldsymbol{i}}$ is also an unbiased estimator for $\langle h_{\boldsymbol{i}}(\boldsymbol{\theta}) \rangle^2$ , following the reasoning used in equation (C13) in the proof of proposition 5. Then, defining $\widehat{\mathcal{L}}_\textrm{MSE}$ as

$\begin{align} \widehat{\mathcal{L}}_\textrm{MSE} = \sum_{\boldsymbol{i}} p_{\boldsymbol{i}} (y_{\boldsymbol{i}}^2 - 2 y_{\boldsymbol{i}} \widehat{\mathcal{E}_{\boldsymbol{i}}} + \widehat{\mathcal{Q}}_{\boldsymbol{i}}), \end{align} \tag{ E4 }$

we establish the following proposition.

Proposition 7. $\widehat{\mathcal{L}}_{MSE}$ is an unbiased estimator for $\mathcal{L}_{MSE}$ .

Proof. Using the definitions in equations (E2) and (E4) we can write

$\begin{align} \mathbb{E}[\widehat{\mathcal{L}}_\textrm{MSE}] = \mathbb{E}\left[\sum_{\boldsymbol{i}} p_{\boldsymbol{i}} y_{\boldsymbol{i}}^2\right] - 2 \mathbb{E}\left[\sum_{\boldsymbol{i}} y_{\boldsymbol{i}} p_{\boldsymbol{i}} \frac{1}{\mathbb{E}[s_{\boldsymbol{i}}]}\sum_{k = 1}^{s_{\boldsymbol{i}}} r_{\boldsymbol{i},k}\right] + \mathbb{E}\left[ \sum_i p_i \frac{1}{\mathbb{E}[s_{\boldsymbol{i}} (s_{\boldsymbol{i}} - 1)]}\sum_{k = 1}^{s_{\boldsymbol{i}}} \sum_{k^{\prime} = 1, k^{\prime} \ne k}^{s_{\boldsymbol{i}}} r_{\boldsymbol{i},k} r_{\boldsymbol{i},k^{\prime}}\right]\,. \end{align} \tag{ E5 }$

Using Wald's equation we can evaluate this expression to give

$\begin{align} \mathbb{E}[\widehat{\mathcal{L}}_\textrm{MSE}] = \sum_{\boldsymbol{i}} p_i y_i^2 - 2 \sum_i y_i p_i \frac{\mathbb{E}[s_{\boldsymbol{i}}]}{\mathbb{E}[s_{\boldsymbol{i}}]} \langle h_{\boldsymbol{i}}(\boldsymbol{\theta}) \rangle + \sum_{\boldsymbol{i}} p_i \frac{\mathbb{E}[s_{\boldsymbol{i}} (s_{\boldsymbol{i}} - 1)]}{\mathbb{E}[s_{\boldsymbol{i}} (s_{\boldsymbol{i}} - 1)]} (\langle h_{\boldsymbol{i}}(\boldsymbol{\theta}) \rangle)^{2}. \end{align} \tag{ E6 }$

Finally, we can evaluate these terms and factorize, which leads to,

$\begin{align} \mathbb{E}[\widehat{\mathcal{L}}_\textrm{MSE}] & = \sum_{\boldsymbol{i}} p_i (y_i^2 - 2 y_i \langle h_{\boldsymbol{i}}(\boldsymbol{\theta}) \rangle + \langle h_{\boldsymbol{i}}(\boldsymbol{\theta}) \rangle^2) \end{align} \tag{ E7 }$

$\begin{align} & = \mathcal{L}_\textrm{MSE}\,. \end{align} \tag{ E8 }$

This means that we can estimate this simplified MSE loss function by evaluating the first term exactly, sampling from the second term according to the probability distribution $p_{\boldsymbol{i}}y_{\boldsymbol{i}}$ and sampling the third term according to the distribution $p_{\boldsymbol{i}}$ .

E.2. Step 2: full MSE

Now, we assume the Hamiltonians $H_{\boldsymbol{i}}$ can be decomposed into many terms such that $H_{\boldsymbol{i}} = \sum_j c_{\boldsymbol{i},j} H_{\boldsymbol{i},j}$ . Hence the new full MSE cost can be expressed as

$\begin{align} \mathcal{L}^{\prime}_\textrm{MSE} & = \sum_{\boldsymbol{i}} p_{\boldsymbol{i}} \left(y_{\boldsymbol{i}} - \sum_j c_{\boldsymbol{i},j} \langle h_{\boldsymbol{i}, j} (\boldsymbol{\theta})\rangle\right)^2 = \sum_{\boldsymbol{i}} p_{\boldsymbol{i}} y_{\boldsymbol{i}}^2 - 2 \sum_{\boldsymbol{i},j} p_{\boldsymbol{i}} y_{\boldsymbol{i}} c_{\boldsymbol{i},j} \langle h_{\boldsymbol{i}, j} (\boldsymbol{\theta})\rangle \nonumber \\ & \quad + \sum_{\boldsymbol{i},j, j^{\prime}, j^{\prime} \ne j} p_{\boldsymbol{i}} c_{\boldsymbol{i},j} c_{\boldsymbol{i}, j^{\prime}} \langle h_{\boldsymbol{i}, j} (\boldsymbol{\theta})\rangle \langle h_{\boldsymbol{i}, j^{\prime}} (\boldsymbol{\theta})\rangle+ \sum_{\boldsymbol{i},j} p_{\boldsymbol{i}} c_{\boldsymbol{i},j}^2 \langle h_{\boldsymbol{i}, j} (\boldsymbol{\theta})\rangle^2. \end{align} \tag{ E9 }$

Defining $\widehat{\mathcal{E}}_{\boldsymbol{i},j}$ as the estimator for $T_{\boldsymbol{i},j}$ as previously,

$\begin{equation*} \widehat{\mathcal{E}}_{\boldsymbol{i},j} = \frac{1}{\mathbb{E}[s_{\boldsymbol{i},j}]}\sum_{k = 1}^{s_{\boldsymbol{i},j}} r_{\boldsymbol{i},j,k}. \end{equation*}$

When $j \ne j^{\prime}$ we define,

$\begin{align} \widehat{\mathcal{Q}}_{\boldsymbol{i},j,j^{\prime}} = \frac{1}{\mathbb{E}[s_{\boldsymbol{i},j}s_{\boldsymbol{i},j^{\prime}}]} \sum_{k = 1}^{s_{\boldsymbol{i},j}} \sum_{k^{\prime} = 1}^{s_{\boldsymbol{i},j^{\prime}}} r_{\boldsymbol{i},j,k} r_{\boldsymbol{i},j^{\prime},k^{\prime}}\,, \end{align} \tag{ E10 }$

and finally,

$\begin{align} \widehat{\mathcal{Q}}_{\boldsymbol{i},j} = \frac{1}{\mathbb{E}[s_{\boldsymbol{i},j} (s_{\boldsymbol{i},j} - 1)]}\sum_{k = 1}^{s_{\boldsymbol{i},j}} \sum_{k^{\prime} = 1, k^{\prime} \ne k}^{s_{\boldsymbol{i},j}} r_{\boldsymbol{i},j,k} r_{\boldsymbol{i},j,k^{\prime}}. \end{align} \tag{ E11 }$

Collecting these terms we define an estimator of the full MSE loss function as

$\begin{align} \widehat{\mathcal{L}}^{\prime}_\textrm{MSE} = \sum_{\boldsymbol{i}} p_{\boldsymbol{i}} y_{\boldsymbol{i}}^2 - 2 \sum_{\boldsymbol{i},j} p_{\boldsymbol{i}} y_{\boldsymbol{i}} c_{\boldsymbol{i},j} \widehat{\mathcal{E}}_{\boldsymbol{i},j} + \sum_{\boldsymbol{i}, j,j^{\prime}, j^{\prime} \ne j} p_{\boldsymbol{i}} c_{\boldsymbol{i},j} c_{\boldsymbol{i},j^{\prime}} \widehat{\mathcal{Q}}_{\boldsymbol{i},j,j^{\prime}} + \sum_{\boldsymbol{i}, j} p_{\boldsymbol{i}} c_{\boldsymbol{i},j}^2 \widehat{\mathcal{Q}}_{\boldsymbol{i},j}, \end{align} \tag{ E12 }$

which can be used to establish the following proposition.

Proposition 8. $\widehat{\mathcal{L}}^{\prime}_\textrm{MSE}$ is an unbiased estimator for $\mathcal{L}^{\prime}_\textrm{MSE}$ .

Proof. Taking the above definition we can write the following result:

$\begin{align} \mathbb{E}\left[\widehat{\mathcal{L}}^{\prime}_\textrm{MSE}\right] & = \sum_{\boldsymbol{i}} p_{\boldsymbol{i}} y_{\boldsymbol{i}}^2 - 2 \sum_{\boldsymbol{i},j} p_{\boldsymbol{i}} y_{\boldsymbol{i}} c_{\boldsymbol{i},j} \frac{1}{\mathbb{E}[s_{\boldsymbol{i},j}]} \mathbb{E}\left[\sum_{k = 1}^{s_{\boldsymbol{i},j}} r_{\boldsymbol{i},j,k}\right] + \sum_{\boldsymbol{i}, j,j^{\prime}, j^{\prime} \ne j} p_{\boldsymbol{i}} c_{\boldsymbol{i},j} c_{\boldsymbol{i},j^{\prime}} \frac{1}{\mathbb{E}[s_{\boldsymbol{i},j}s_{\boldsymbol{i},j^{\prime}}]} \mathbb{E}\left[\sum_{k = 1}^{s_{\boldsymbol{i},j}} \sum_{k^{\prime} = 1}^{s_{\boldsymbol{i},j^{\prime}}} r_{\boldsymbol{i},j,k} r_{\boldsymbol{i},j^{\prime},k^{\prime}}\right] \nonumber \\ & \quad + \sum_{\boldsymbol{i}, j} p_{\boldsymbol{i}} c_{\boldsymbol{i},j}^2 \frac{1}{\mathbb{E}[s_{\boldsymbol{i},j} (s_{\boldsymbol{i},j} - 1)]} \mathbb{E}\left[\sum_{k = 1}^{s_{\boldsymbol{i},j}} \sum_{k^{\prime} = 1, k^{\prime} \ne k}^{s_{\boldsymbol{i},j}} r_{\boldsymbol{i},j,k} r_{\boldsymbol{i},j,k^{\prime}}\right]. \end{align} \tag{ E13 }$

Given that

$\begin{align} &\mathbb{E}\left[\sum_{k = 1}^{s_{\boldsymbol{i},j}} r_{\boldsymbol{i},j,k}\right] = \mathbb{E}[s_{\boldsymbol{i},j}] \langle h_{\boldsymbol{i}, j} (\boldsymbol{\theta})\rangle \end{align} \tag{ E14 }$

$\begin{align} & \mathbb{E}\left[\sum_{k = 1}^{s_{\boldsymbol{i},j}} \sum_{k^{\prime} = 1}^{s_{\boldsymbol{i},j^{\prime}}} r_{\boldsymbol{i},j,k} r_{\boldsymbol{i},j^{\prime},k^{\prime}}\right] = \mathbb{E}[s_{i,j} s_{i,j^{\prime}}] \langle h_{\boldsymbol{i}, j} (\boldsymbol{\theta})\rangle \langle h_{\boldsymbol{i}, j^{\prime}} (\boldsymbol{\theta})\rangle \end{align} \tag{ E15 }$

$\begin{align} & \mathbb{E}\left[\sum_{k = 1}^{s_{\boldsymbol{i},j}} \sum_{k^{\prime} = 1, k^{\prime} \ne k}^{s_{\boldsymbol{i},j}} r_{\boldsymbol{i},j,k} r_{\boldsymbol{i},j,k^{\prime}}\right] = \mathbb{E}[s_{\boldsymbol{i},j} (s_{\boldsymbol{i},j} - 1)] (\langle h_{\boldsymbol{i}, j} (\boldsymbol{\theta})\rangle)^2\,. \end{align} \tag{ E16 }$

We can combine the above expressions to obtain

$\begin{align} \mathbb{E}[\widehat{\mathcal{L}}^{\prime}_\textrm{MSE}]& = \sum_{\boldsymbol{i}} p_{\boldsymbol{i}} y_{\boldsymbol{i}}^2 - 2 \sum_{\boldsymbol{i},j} p_{\boldsymbol{i}} y_{\boldsymbol{i}} c_{\boldsymbol{i},j} \langle h_{\boldsymbol{i}, j} (\boldsymbol{\theta})\rangle + \sum_{\boldsymbol{i}, j,j^{\prime}, j^{\prime} \ne j} p_{\boldsymbol{i}} c_{\boldsymbol{i},j} c_{\boldsymbol{i}, j^{\prime}} \langle h_{\boldsymbol{i}, j} (\boldsymbol{\theta})\rangle \langle h_{\boldsymbol{i}, j^{\prime}} (\boldsymbol{\theta})\rangle + \sum_{\boldsymbol{i}, j} p_{\boldsymbol{i}} c_{\boldsymbol{i},j}^2 (\langle h_{\boldsymbol{i}, j} (\boldsymbol{\theta})\rangle)^2 \end{align} \tag{ E17 }$

$\begin{align} & = \mathcal{L}^{\prime}_\textrm{MSE} \,. \end{align} \tag{ E18 }$

This shows we can estimate the full MSE cost function by evaluating four terms. One can be evaluated exactly and the others can be estimated by sampling from their corresponding probability distributions built with the weights as parameters.