Abstract
Quantum-enhanced data science, also known as quantum machine learning (QML), is of growing interest as an application of near-term quantum computers. Variational QML algorithms have the potential to solve practical problems on real hardware, particularly when involving quantum data. However, training these algorithms can be challenging and calls for tailored optimization procedures. Specifically, QML applications can require a large shot-count overhead due to the large datasets involved. In this work, we advocate for simultaneous random sampling over both the dataset as well as the measurement operators that define the loss function. We consider a highly general loss function that encompasses many QML applications, and we show how to construct an unbiased estimator of its gradient. This allows us to propose a shot-frugal gradient descent optimizer called Refoqus (REsource Frugal Optimizer for QUantum Stochastic gradient descent). Our numerics indicate that Refoqus can save several orders of magnitude in shot cost, even relative to optimizers that sample over measurement operators alone.
Export citation and abstract BibTeX RIS
Original content from this work may be used under the terms of the Creative Commons Attribution 4.0 license. Any further distribution of this work must maintain attribution to the author(s) and the title of the work, journal citation and DOI.
1. Introduction
A new kind of data is emerging in recent times: quantum data. Tabletop quantum experiments and analog quantum simulators produce interesting sets of quantum states that must be characterized. Moreover, the rise of digital quantum computers is leading to the discovery of novel quantum circuits that can, once again, produce quantum states of interest. Quantum sensing, quantum phase diagrams, quantum error correction, and quantum dynamics are some of the areas that stand to benefit from quantum data analysis.
Classical machine learning was developed for the processing of classical data, but it is necessarily inefficient at processing quantum data. This issue has given rise to the field of quantum machine learning (QML) [1, 2]. QML has seen the proposal of parameterized quantum models, such as quantum neural networks [3–6], that could efficiently process quantum data. Variational QML, which involves classically training a parameterized quantum model, is indeed a leading candidate for implementing QML in the near term.
Variational QML, which we will henceforth refer to as QML for simplicity, has faced various sorts of trainability issues. Exponentially vanishing gradients, known as barren plateaus [7–15], as well as the prevalence of local minima [16, 17] are two issues that can impact the complexity of the training process. Quantum hardware noise also impacts trainability [18, 19]. All of these issues contribute to increasing the number of shots and iterations required to minimize the QML loss function. Indeed, a detailed shot-cost analysis has painted a concerning picture [20].
It is therefore clear that QML requires careful frugality in terms of the resources expended during the optimization process. Indeed, novel optimizers have been developed in response to these challenges. Quantum-aware optimizers aim to replace off-the-shelf classical optimizers with ones that are specifically tailored to the quantum setting [21–23]. Shot-frugal optimizers [24–27] have been proposed in the context of variational quantum eigensolver (VQE), whereby one can sample over terms in the Hamiltonian instead of measuring every term [28]. While significant progress has been made on such optimizers, particularly for VQE, we argue that very little work has been done to specifically tailor optimizers to the QML setting. The cost functions in QML go well beyond those used in VQE and hence QML requires more general tools.
In this work, we generalize previous shot-frugal and iteration-frugal optimizers, such as those in [24, 25, 28], by extending them to the QML setting. Specifically, we allow for random, weighted sampling over both the input and the output of the loss function estimation circuit. In other words, and as shown in figure 1, we allow for sampling over the dataset as well as over the measurement operators used to define the loss function. Our sampling approach allows us to unlock the frugality (i.e. to achieve the full potential) of adaptive stochastic gradient descent optimizers, such as the recently developed individual coupled adaptive number of shots (iCANS) [24] and global coupled adaptive number of shots (gCANS) [25].
We discuss how our approach applies to various QML applications such as perceptron-based quantum neural networks [11, 29], quantum autoencoders [30], variational quantum principal component analysis (PCA) [31, 32], and classifiers that employ the MSE loss function [26, 33]. Each of these applications can be unified under one umbrella by considering a generic loss function with a highly general form. Thus we state our main results for this generic loss function. We establish an unbiased estimator for this loss function and its gradient. In turn, this allows us to provide convergence guarantees for certain optimization routines, like stochastic gradient descent. Furthermore, we show that for this general loss function one can use the form of the estimator to inform a strategy that distributes shots to obtain the best shot frugal estimates. We also show how to construct an unbiased estimator of the log likelihood loss function, which can be used in gradient-free shot frugal optimization.
Finally, we numerically investigate the performance of our new optimization approach, which we call Refoqus (REsource Frugal Optimizer for QUantum Stochastic gradient descent). For a quantum PCA task and a quantum autoencoder task, Refoqus significantly outperforms state-of-the-art optimizers in terms of the shot resources required. Refoqus even outperforms Random Operator Sampling for Adaptive Learning with Individual Number of shots (Rosalin) [28]—a shot-frugal optimizer that samples only over measurement operators. Hence, Refoqus will be a crucial tool to minimize the number of shots and iterations required in near-term QML implementations.
2. Background
Before giving the necessary background on shot-frugal optimizers, we introduce a few notations. Let be the loss function, the gradient , α the learning rate, and θ the parameter vector. is considered Lipschitz continuous with a Lipschitz constant that satisfies the bound
for all , where is the norm.
2.1. gCANS
Inspired by an adaptive batch size optimizer used in classical machine learning [34], the iCANS optimizer [24] was introduced as an adaptive method for stochastic gradient in the context of the VQE. It allows the number of shots per partial derivative (i.e. gradient component) to vary individually, hence the name iCANS. We denote sx the shot allocation for gradient component x.
Recently, a potential improvement over iCANS was introduced called gCANS [25]. gCANS considers the expected gain (i.e. the decrease in the loss function) over the entire gradient vector. Then the goal is to maximize the shot efficiency
where the sum goes over all components of the gradient. Solving for the optimal shot count then gives:
where σx is the standard deviation of a random variable Xx whose sample mean is gx , an unbiased estimator for the xth gradient component. (Note that an exponential moving average is used to estimate σx and as their true value is not accessible.) It was proven that gCANS achieves geometric convergence to the optimum, often reducing the number of shots spent for comparable solutions against its predecessor iCANS.
2.2. Rosalin
Shot-frugal optimizers like iCANS and gCANS rely on having an unbiased estimator for the gradient or its components. However, this typically places a hard floor on how many shots must be allocated at each iteration, i.e. going below this floor could result in a biased estimator. This is because the measurement operator H is typically composed of multiple non-commuting terms, each of which must be measured individually. Each of these terms must receive some shot allocation to avoid having a biased estimator. However, having this hard floor on the shot requirement is antithetical to the shot-frugal nature of iCANS and gCANS, and ultimately it handicaps these optimizers' ability to achieve shot frugality.
This issue inspired a recent proposal called Rosalin [28]. Rosalin employs weighted random sampling (WRS) of operators in the measurement operator , which allows one to achieve an unbiased estimator without a hard floor on the shot requirement. (Even a single shot, provided that it is randomly allocated according to an appropriate probability distribution, can lead to an unbiased estimator.) When combined with the shot allocation methods from iCANS or gCANS, the operator sampling methods in Rosalin were shown to be extremely powerful in the context of molecular chemistry problems, which often have a large number of terms in H.
We remark that [28] considered several sampling strategies. The WRS strategy was outperforming or competitive with the others on molecular ground state problems. Given a budget of stot shots, N terms, and , WRS consists of using to define a (non-uniform) probability distribution to select which term should be measured. Because of these results, we choose to focus on the WRS strategy in our work here.
While Rosalin was designed for chemistry problems, it was not designed for QML, where the number of terms in H is not the only consideration. As discussed below, QML problems involve a (potentially large) dataset of input states. Each input state requires a separate quantum circuit, and hence we are back to the situation of having a hard floor on the shots required due to these multiple input states. This ultimately provides the motivation for our work, which can be viewed as a generalization of Rosalin to the setting of QML.
3. Framework
3.1. Generic variational QML framework
Let us present our general framework for discussing (variational) QML methods; see figure 2 for an illustration. This framework is meant to unify multiple literature QML algorithms under one umbrella. We discuss how specific literature algorithms are special cases of this framework in section 3.2.
Download figure:
Standard image High-resolution imageIn a generic QML setting, one has a training dataset composed of quantum states:
where each ρi is a trace-one positive semi-definite matrix, i.e. a density matrix. Each of these training states may come with an associated probability, with the associated probability distribution denoted as
The latter distribution is generally uniform but does not need to be (i.e. datasets with non-uniform probability distributions are used in [35]). In variational QML, one trains a parameterized quantum model, which we write as for some set of parameters θ . With a large degree of generality, we can assume that is a linear, completely-positive map. In general, could act on multiple copies (m copies) of the input Hilbert space. (Multiple copies allow for non-linear operations on the dataset, which can be important in certain classification tasks.) Hence we allow for the output of this action to be of the form:
where we employ the notation . Given that we are allowing for multiple copies of the input space, one can define an effective dataset composed of the tensor product of m states in and an effective probability distribution .
A QML loss function is then defined in an operational manner so that it could be estimated on a quantum device. This involves considering the aforementioned mathematical objects as well as a measurement operator, or a set of measurement operators. We allow for the measurement operator to be tailored to the input state. Hence we write , with , as the measurement operator when the input state on m copies of the Hilbert space is . Moreover, each measurement operator can be decomposed into a linear combination of Hermitian matrices that can be directly measured:
Generically, we could write the loss function as an average over training states chosen from the effective dataset Dm :
Here, is an application-dependent function whose input is a measurable expectation value . Specifically, this expectation value is associated with the input state, with the form
There are many possible forms for the function . However, there are multiple QML proposals in the literature that involve a simple linear form:
and in this case, we refer to the overall loss function as a 'linear loss function'. Alternatively, non-linear functions are also possible, and we also consider polynomial functions of the form:
where D is the degree of the polynomial. In this case, we refer to the loss as a 'polynomial loss function'.
3.2. Examples of QML loss functions
Now let us illustrate how various QML loss functions proposed in the literature fall under the previous framework. Crucially, as shown in figure 3 these loss functions can be used for a wide range of QML tasks.
Download figure:
Standard image High-resolution image3.2.1. Variational quantum error correction
Variational quantum algorithms to learn device-tailored quantum error correction codes were discussed in [4, 36]. The loss function involves evaluating the input–output fidelity of the code, averaged over a set of input states. The input state is fed into the encoder , then the noise channel acts, followed by the decoder . The concatenation of these three channels can be viewed as the overall channel
with parameter vector . Then the loss function is given by:
where is some appropriately chosen set of states. It is clear that this loss function is of the form in (8) with having the linear form in (10).
3.2.2. Quantum autoencoder
Inspired by the success of classical autoencoders, quantum autoencoders [30, 37] were proposed to compress quantum data by reducing the number of qubits needed to represent the dataset. Consider a bipartite quantum system AB composed of nA and nB qubits, respectively, and let be an ensemble of pure states on AB. The quantum autoencoder trains a gate sequence to compress this ensemble into the A subsystem, such that one can recover each state with high fidelity from the information in subsystem A. One can think of B as the 'trash' since it is discarded after the action of .
The original proposal [30] employed a loss function that quantified the overlap of the trash state with a fixed pure state:
Here, is the ensemble-average input state, is the ensemble-average trash state, and the measurement operator is .
Note that is a global measurement operator, meaning it acts non-trivially on all qubits, which can lead to barren plateaus in the training landscape [8]. To remedy this issue, [8] proposed a loss function with a local measurement operator acting non-trivially only on a small number of qubits:
where , and is the identity on all qubits in B except the jth qubit.
It is clear from (16) and (18) that both loss functions fall under our framework. Namely, they have the form in (8) with having the linear form in (10).
3.2.3. Dynamical simulation
Recently a QML-based algorithm was proposed for dynamical simulation [38]. Here the idea is to variationally compile the short-time dynamics into a time-dependent quantum neural network [39]. Then one uses the trained model to extrapolate to longer times. The training data for the compiling process can be taken to be product states, due to a generalization result from [40].
Let be a unitary associated with the short-time dynamics. Let be a set of product states used for training, where , and where n is the number of qubits. Let be the quantum neural network to be trained. Then the loss function is given by:
where we have defined . Here, the measurement operator is given by , where is the set of all qubits excluding the jth qubit.
Once again, this loss function clearly falls under our framework, having the form in (8) with having the linear form in (10).
3.2.4. Fidelity for quantum neural networks
Dissipative perceptron-based quantum neural networks (DQNNs) were proposed in [29] and their trainability was analyzed in [11]. The loss function is based on the fidelity between the idealized output state and the actual output state of the DQNN. Specifically, we are given access to training data , and the DQNN is trained to output a state close to when the input is .
For this application, a global loss function was considered with the form
Here, is the output state of the DQNN, which is denoted by . The measurement operator is the projector orthogonal to the ideal output state: .
To avoid the issue of barren plateaus, a local loss function was also considered in [11]:
where the measurement operator is
This loss function is relevant whenever the ideal output states have a tensor-product form across the nout output qubits, i.e. of the form .
Clearly, these two loss functions fall under our framework, having the form in (8) with having the linear form in (10).
3.2.5. Variational quantum state eigensolver
Near-term methods for quantum PCA have recently been proposed, including the variational quantum state eigensolver (VQSE) [32] and the variational quantum state diagonalization (VQSD) algorithm. Let us first discuss VQSE.
The goals of VQSE are to estimate the m largest eigenvalues of a density matrix ρ and to find quantum circuits that prepare the associated eigenvectors. When combined with a method to prepare the covariance matrix as a density matrix, VQSE can be used for PCA. Note that such a method was proposed in [41], where it was shown that choosing prepares the covariance matrix for a given quantum dataset .
The VQSE loss function can be written as an energy:
where is a parameterized unitary that is trained to approximately diagonalize ρ. Note that we inserted the formula in order to arrive at (24).
The measurement operator H is chosen to be non-degenerate over its m-lowest energy levels. For example, one can choose a global version of this operator:
or a local version of this operator:
where Zj is the Pauli-z operator on the jth qubit, and with appropriately chosen real coefficients rj to achieve non-degeneracy over the m-lowest energy levels. Regardless, it is clear that the loss function in (24) falls under our framework, having the form in (8) with having the linear form in (10).
3.2.6. VQSD
The goal of the VQSD algorithm [31, 42] is essentially the same as that of VQSE, i.e. to diagonalize a target quantum state ρ. Let us use:
to denote the state after the attempted diagonalization. Here we omit the θ dependency for simplicity of notation.
In contrast to VQSE, the VQSD loss function depends quadratically on the quantum state. Specifically, global and local cost functions have been proposed, respectively given by
Here, and are quantum channels that dephase (i.e. destroy the off-diagonal elements) in the global standard basis and in the local standard basis on qubit j, respectively.
We can rewrite the terms in the global loss using:
where SWAP denotes the swap operator, and where corresponds to the layers of CNOTs used in the so-called diagonalized inner product (DIP) test circuit [31]. Hence, we obtain:
where . Note that we inserted the relation in order to arrive at (33). Also note that one can think of the as defining a probability distribution over the index . Hence it is clear that (33) falls under our framework, with m = 2 copies of the input Hilbert space, and with having the linear form in (10).
Similarly, for the local loss, we can write
where , and is the Hermitian operator that is measured for the partial DIP test [31]. Once again, the local loss falls under our framework, with m = 2 copies of the input Hilbert space, and with having the linear form in (10).
3.2.7. Mean squared error for quantum classifiers
The mean squared error (MSE) loss function is widely employed in classical machine learning. Moreover, it has been used in the context of quantum neural networks, e.g. in [4]. Given a set of labels for a dataset , and a set of predictions from a machine learning model denoted , the MSE loss is computed as:
In the case of a quantum model, there is freedom in specifying how to compute the prediction . Typically, this will be estimated via an expectation value, such as . In this case, the loss function in (36) would be a quadratic function of expectation value . Hence, this falls under our framework, with having the polynomial form in (11) with degree D = 2.
3.3. Possible extensions to our framework
Up until now, we have considered loss functions that have a linear or polynomial dependence on measurable expectation values, . As shown above, this form encompasses many loss functions used in the literature and we will show that cost functions of this form enable cheap gradient estimation and therefore shot frugal optimization.
However, there are also more complicated loss functions that have a less trivial dependence on measurable expectation values. One such loss function is the log-likelihood which is commonly used in classification tasks [5, 43].
Let us assume one is provided with a dataset of the form where ρi are quantum states, and yi are discrete real-valued labels. We aim to train a parametrized model whose goal is to predict the correct labels. Here one cannot directly apply the techniques previously used, as discrete outputs mean we cannot simply evaluate gradients. Therefore we cannot directly employ the results in [24, 34]. Hence, a different shot-frugal method must be employed.
Given , the likelihood function is defined as
where we have defined and where it is assumed that the model's predictions only depend on the current input. From here, we can define the negative log-likelihood loss function as
In appendix
4. Unbiased estimators for gradients of QML losses
Gradient-based optimizers are commonly used when optimizing QML models. In shot frugal versions of these approaches, one of the key aspects is the construction of an unbiased estimator of the gradient [24–26]. There are two types of loss functions we will consider in this work; those that have a linear dependence on expectation values and those that have a non-linear dependence given by a polynomial function. We will see that for these two types of losses one can construct unbiased estimators of the gradients and that it is also possible to define sampling strategies that massively reduce the number of shots needed to evaluate such an estimator.
Previous work considered how to construct unbiased estimators of QML gradients. However, the shot frugal resource allocation strategies presented were sub-optimal [26] and leave room for improvement. Furthermore, the more sophisticated shot allocation methods presented in [28] were purpose-built for VQE-type cost functions. Here we unify approaches from these two works and show how one can employ more sophisticated shot allocation strategies in a general QML setting.
First, we consider the simpler case of linear loss functions where we show one can directly employ the parameter shift rule [45, 46] to construct an unbiased estimator for the gradient. Then we turn our attention to polynomial loss functions where we present a general form for an unbiased estimator of the gradient. In both cases, we show that shots can be allocated according to the expansion coefficients in the expressions we derive. These in turn depend on the coefficients in the operators to be measured and the set of quantum states used in the QML data set. Such shot-allocation schemes are an important ingredient in the design of our QML shot-frugal optimizer in the next section. We use the U-statistic formalism [47] to construct unbiased estimators for the loss function in each case. For an introduction to U-statistics, we refer the reader to appendix
4.1. Loss functions linear in the quantum circuit observables
4.1.1. Using the parameter shift rule
Loss functions that have a linear dependence on expectation values of observables are straightforward to consider. As shown in [26], the parameter shift rule can be used directly to construct unbiased estimators of the gradient of these loss functions. Previously, we wrote our general linear loss function as
Let us not consider the case where
By expanding the measurement operator as , we can then write this loss function as follows
where and .
Consider the partial derivative of this loss function with respect to the parameter θx ,
Suppose we assume that the quantum channel is a unitary channel:
where is a trainable quantum circuit whose parametrized gates are generated by Pauli operators. This assumption allows us to directly apply the parameter shift rule (see section 2). This leads to
Therefore, an unbiased estimator for the gradient can be obtained by combining two unbiased estimators for the loss function. Defining to be an unbiased estimator of the xth component of the gradient,
where are unbiased estimators for the loss function at the different shifted parameter values needed when employing the parameter shift rule. This means that for loss functions that are linear in the expectation values recovered from the quantum circuit, one can then use an unbiased estimator for the cost evaluated at different parameter values to return an unbiased estimator for the gradient.
We can therefore distribute the shots according to the coefficients when evaluating a single or multi-shot estimate of . This can be done by constructing a probability distribution according to the probabilities . Note that this distribution strategy relies on the construction of two unbiased estimators of the loss function, which are then combined. In the next section, we show how one can construct such estimators. In practice, we will use the same total number of shots to evaluate both estimators as they have the same expansion weights and are therefore equally important.
4.1.2. Unbiased estimators for linear loss functions
Let stot represent the total number of shots. We denote as the estimator for , and the estimator for the loss. That is,
Here, is the number of shots allocated to the measurement of . Note that may be a random variable. As we will work in terms of the total shot budget for the estimation, stot, we impose . Also, each is an independent random variable associated with the kth single-shot measurement of . We will assume that for all . Using these definitions we can show that this defines an unbiased estimator for the loss function in the following proposition.
Proposition 1. Let be the estimator defined in equation (45). is an unbiased estimator for the cost function defined in equation (8).
Here it is useful to recall Wald's equation. Wald's equation states that the expectation value of the sum of N real-valued, identically distributed, random variables, Xi , can be expressed as
where N is a random variable that does not depend on the terms of the sum. In our case, each shot is indeed independent and sampled from the same distribution. Furthermore, the total number of shots does not depend on the sequence of single-shot measurements. Therefore,
Alternatively, one can see that the estimator defined in equation (45) has the form of a degree-1 U-statistic for . From the above result, we arrive at the following corollary.
Corollary 1. Let be the estimator defined in equation (44). is an unbiased estimator for the xth component of the gradient.
Proof. The proof follows by taking the expectation values of equation (44) and employing the result from proposition 1 and the parameter shift rule.
We also obtained the variance of the estimator constructed above whose derivation can be found in appendix
Having now constructed an estimator for the loss function as outlined in the previous section combining two single or multi-shot estimates of this function evaluated at the required parameter values will lead to an unbiased estimator of the gradient.
4.2. Loss functions with polynomial dependence on the quantum circuit observables
4.2.1. Constructing an unbiased estimator of the gradient
In the case of non-linear dependence on the observables produced by a quantum circuit, estimating the gradient is not as simple as simply applying to parameter shift rule. However, we can still derive estimators for the gradient when the non-linearity is described by a polynomial function. We begin with the general expression for the loss functions we consider in this work
Now constructing a polynomial loss function of degree D requires that
which leads to the expression of the loss function,
where . Taking the derivative with respect to θx leads to
Using the multinomial theorem and given J Hamiltonian terms, we can expand the second sum in the above expression,
where and bj are non-negative integers. Therefore, we need to construct unbiased estimators of the terms and use the previously established gradient estimators with the parameter shift rule. Then we will need to consider how to distribute shots among this estimator. Rewriting equation (53) leads to
Therefore we can distribute the shots according to the magnitude of the expansion terms . Once again we can construct an unbiased estimator with the normalized magnitude of these terms defining the probabilities. We now explore how to construct an unbiased estimator for the term , which is essential to construct an unbiased estimator for the gradients of these kinds of loss functions.
4.2.2. Constructing an unbiased estimator of polynomial terms
In order to construct an unbiased estimator for the gradient we need an unbiased estimator for terms of the form
We can use the parameter shift rule for the term on the left-hand side. For the term in the product as each j index corresponds to a different operator in the Hamiltonian, each term will be independent. Therefore, we need to construct estimators for terms of the form . This leads us to the following proposition.
Proposition 2. Let be an estimator defined as
where the summation is over all subscripts and . is an unbiased estimator for the term estimated with shots where .
Proof. Equation (56) is inspired by the form of a U-statistic for , see appendix
Bringing this all together we can formulate a proposition regarding an unbiased estimator for gradients of loss functions with polynomial dependence.
where
and . The final sum is over all subscripts and . is a unbiased estimator for .
Proof. This can be seen to be true by taking the expectation values and invoking the propositions previously established.
These same techniques can be used to construct unbiased estimators of the loss function, which we expand on in appendix
4.2.3. Constructing an estimator for the gradient of the MSE loss function
To clarify the notation used above, let us focus on the special case of the MSE loss function. We consider a slightly more general form than the MSE cost introduced above in equation (36). Consider a set of labels for a dataset , composed of the tensor product of m states. The set of predictions from the quantum machine learning model are denoted . Therefore, we can write the loss as
Expanding the previous equation leads to
Evaluating the partial derivative with respect to θx gives,
We can distribute the total number of shots among these terms according to the relative magnitudes of the and coefficients. This can be achieved by sampling from a multinomial probability distribution where the normalized magnitude of these coefficients defines the probabilities.
It is important to note that some estimators have a different minimum number of shots than others. For example, the estimator for the last term in the above equation, involving a gradient and a direct expectation value, requires a total of 3 for different circuit evaluations. The estimator for this term can be written as
Therefore, in this case, the minimum number of shots given to any term can be set to 3 to ensure every estimation of any term in equation (59) will always be unbiased. We expand on this consideration below.
4.3. Distributing the shots among estimator terms
As previously mentioned the shots can be distributed according to a multinomial distribution with probabilities given by the magnitude of the constant factors that appear in the expression for the above estimators. We note that the shots assigned to a given term may be zero. In order to ensure the estimate of the above term using a given number of shots we need to ensure the estimate of each term is also unbiased. One needs at least two shots to produce an unbiased estimate of the gradient. For terms of the form one needs at least bj shots, corresponding to the degree of U-statistic. Therefore, care needs to be taken when distributing shots to ensure that each term measured has sufficiently many. One way to ensure this is the case is to distribute shots in multiples of the largest number of shots needed to evaluate any one term. Any leftover shots can then be distributed equally across the terms to be measured.
Each term itself may consist of a product of several expectation values, each with a different required minimum number of shots. The shots distribution with each term can be selected to correspond to this required minimum number of shots per term. To make this concrete consider a term of the form in equation (55). Estimating the gradient will take at least two shots. Estimating the product term will take at least shots. If we are given shots to use to estimate this term we can assign to the first term and to the product term, distributing any remaining shots equally among both.
5. The Refoqus optimizer
Now that we have defined unbiased estimators of the gradient, we have the tools needed to present our new optimizer. We call our optimizer Refoqus, which stands for REsource Frugal Optimizer for QUantum Stochastic gradient descent. This optimizer is tailored to QML tasks. QML tasks have the potential to incur large shot overheads because of large training datasets as well as measurement operators composed of a large number of terms. With Refoqus, we achieve shot frugality first by leveraging the gCANS [25] rule for allocating shots at each iteration step and second by allocating these shots to individual terms in the loss function via random sampling over the training dataset and over measurement operators. This random sampling allows us to remove the shot floor imposed by deterministic strategies (while still achieving an unbiased estimator for the gradient), hence unlocking the full shot frugality of our optimizer.
The key insight needed for the construction of the Refoqus protocol is that cost functions that have a linear or polynomial dependence on measurable hermitian matrices, and their gradients can always be written in a form consisting of a summation of different measurable quantities with expansion coefficients. These coefficients can in turn be used to define a multinomial probability distribution to guide the allocation of the shots to each term when evaluating the loss function or its gradient.
We outline the Refoqus optimizer in algorithm
Algorithm 1. The optimization loop used in Refoqus. The function evaluates the the gradient at θ returning, a vector of the gradient estimates g and their variances S . The vectors are calculated using shots to estimate the xth component of the gradient and its variance, where s0 is the minimum number of shots required to obtain an unbiased estimator. Note that shots for each component estimation are distributed in multiples of s0, according to a multinomial distribution determined by the expansion coefficients that define the gradient estimator. These expansion coefficients are defined by the function , which returns the probability of measuring the term corresponding to the coefficients , . For the case of linear loss functions . |
---|
Input: Learning rate α, starting point , min number of shots per estimation , number of shots that can be used in total , Lipschitz constant L, running average constant µ, a vector of the least number of shots needed for each gradient estimate s0, which is loss function dependent and , which is also loss function dependent. |
1: initialize: , , , , , , , , , |
2: while do |
3: |
4: |
5: |
6: |
7: |
8: |
9: |
10: for do |
11: |
12: end for |
13: |
14: end while |
The hyperparameters that we employ are similar to those of Rosalin [28], and will come with similar recommendations on how to set them. For instance, the Lipshitz constant L bounds the largest possible value of the derivative. Hence, it depends on the loss expression but can be set as according to [24]. Moreover, a learning rate satisfying can be used.
6. Convergence guarantees
The framework for Refoqus leverages the structure and the update rule of the gCANS optimizer presented in [25]. Therefore, we can apply the same arguments introduced to show geometric convergence. We repeat the arguments and assumptions needed for this convergence result here for convenience.
Proposition 4. Provided the loss function satisfies the assumptions stated below the stochastic gradient descent routine used in Refoqus achieves geometric convergence to the optimal value of the cost function. That is,
where t labels the iteration, is the optimal value of the loss function, and .
The update rule used in this work and introduced in [25] and the proof presented here can be directly applied. This update rule guarantees fast convergence in expectation to the optimal value of sufficiently smooth loss functions. The underlying assumption of this result is that the loss function is strongly convex and has Lipschitz-continuous gradients. In most realistic QML scenarios the optimization landscape is not convex, however, if the optimizer were to settle into a convex region then fast convergence is expected.
The exact assumptions needed in order to ensure the geometric convergence of Refoqus to the global minima are as follows:
- 1..
- 2.Var, where Xx is the sampling-based estimator of .
- 3.is µ-strongly convex.
- 4.has L-Lipschitz continuous gradient.
- 5.α is a constant learning rate satisfying .
- 6.An ideal version of the update rule holds, that is:
In the previous sections, we have shown how to construct unbiased estimators for the gradient, satisfying the first assumption. The second assumption is satisfied as the estimate of the gradient is constructed by calculating the mean of several sampling-based estimates. Assumption 5 is satisfied by construction. Assumption 6 is an idealized version of the update rule used in Refoqus. However, the gradient magnitude and estimator variances cannot be exactly known in general, so these quantities are replaced by exponentially moving averages to predict the values of σx and . Assumption 4 is also satisfied for all the loss functions we consider in this work. Finally, as previously noted assumption 5 is not expected to hold in QML landscapes, which are non-convex in general.
7. Numerical results
7.1. Quantum PCA
We benchmark the performance of several optimizers when applied to using VQSE [32] for quantum PCA [48] on molecular ground states. We perform quantum PCA on 3 quantum datasets: ground states of the H2 molecule in the sto-3g basis (4 qubits), H2 in the 6-31g basis (8 qubits) and in the sto-3g basis (14 qubits). We use 101 circuits, per dataset. The corresponding covariance matrix can be expressed as . We optimize using the local cost introduced in equation (26) and repeated here for convenience:
taking coefficients and following the presentation in [32]. Hence the latter coefficients form the terms used for shot-allocation strategies, as outlined below.
When implementing Rosalin and Refoqus we use the WRS strategy for allocating shots as it demonstrated superior performance against other strategies in the numerical results of [28]. Hence, the coefficients that appear in the loss function (and the gradient estimators) are used to construct a multinomial probability distribution defined by the terms , where . This distribution is used to probabilistically allocate the number of shots given to each term for each gradient estimation.
We use a circuit consisting of two layers of a hardware-efficient ansatz with depth 4, depicted in figure 4, with 20, 40, and 70 parameters for the 4, 8 and 16 qubit problems respectively. These parameters are optimized in order to minimize the VQSE cost function using the Adam, Rosalin, and Refoqus optimizers 7 . We use the absolute error in estimating the eigenvalues of the system to benchmark the overall performance of the output of the optimized circuit as this is desired output of the algorithm. Given the exact 16 highest eigenvalues λi , and their estimation given current parameters θ, the latter is computed as .
Download figure:
Standard image High-resolution imageFigure 5 shows the results obtained when running each algorithm 20 times with different random initialization of the variational ansatz, up to a total shot budget of 108. In general, we see that Refoqus is the best-performing optimizer, achieving a best-case accuracy while using fewer shots for all shot budgets. In summary, for the 4 qubit system, a median eigenvalue error of , and was obtained using Adam, Rosalin and Refoqus respectively. Although we found better minimal value with Adam compared to Rosalin ( and respectively), Rosalin is more advantageous at lower shot budgets. However, Refoqus achieved a minimal error value of and demonstrates a clear advantage over both Adam and Rosalin. On both the 8 and 14 qubit systems, we observe a similar trend although eigenvalue errors are worse overall. This is due to the variational ansatz being kept at a fixed depth while increasing the size of the problem, which leads to worse performance. Nonetheless, Refoqus appears to clearly match or outperform both Adam and Rosalin in the number of shots required to reach a given accuracy.
Download figure:
Standard image High-resolution imageAdditionally, Refoqus requires fewer parameter updates (iterations) when compared to Rosalin. Indeed, for the 4, 8 and 14 qubits problems respectively, a median of , and 80 iterations were used by Refoqus against 1217, 618, and 353 for Rosalin. We note that this may be interpreted as another desirable feature of the optimizer, as this minimizes the number of iterations needed in the quantum–classical feedback loop to arrive at a solution of the same (or better) quality. Although the number of iterations is not a bottleneck in and of itself, in current hardware the quantum–classical feedback can prove restrictive meaning fewer iterations are favorable for real-device implementation.
Finally, we note that the variance of the Refoqus results appears to be larger, suggesting that sampling over more terms can lead to a larger variety in the quality of the optimization obtained. Nevertheless, the advantage in shot frugality that arises from sampling over more terms is clear and despite this larger variance, Refoqus is clearly the best-performing optimizer.
7.2. Quantum autoencoder
In addition to the previous quantum PCA results, we also performed a quantum autoencoder task on the same molecular ground states as before. We optimize using the local cost introduced in equation (17) where half of the qubits are 'trashed'. For the 4 qubit system, the ansatz used is inspired from [33]. Its implementation is available in Pennylane under the name 'StronglyEntanglingLayers' [49], and we use three layers. For the other systems, we use the same ansatz used for the quantum PCA task with 3 layers too, as we obtained better cost values. Figure 6 shows the results obtained when running each algorithm ten times with different random initialization of the variational ansatz, up to a total shot budget of 106 shots. We do not use Adam though as only two iterations would be done with the latter budget. In summary, for the 4 qubit system, a median eigenvalue error of was obtained using Rosalin using a median of 1468 iterations, and in 208 iterations with Refoqus. The latter achieved a minimal cost value of , and for Rosalin. On the 8 qubit system, a median eigenvalue error of was obtained using Rosalin using a median of 342 iterations, and in 65 iterations with Refoqus. The latter achieved a minimal cost value of , and for Rosalin. Finally, on the 14 qubit system, a median eigenvalue error of was obtained using Rosalin using a median of 89 iterations, and in 24 iterations with Refoqus. The latter achieved a minimal cost value of , and for Rosalin. Hence, we witness again that Refoqus outperforms Rosalin.
Download figure:
Standard image High-resolution image8. Discussion
QML algorithms present a different paradigm for data processing which is particularly well suited to quantum data. VQAs are a key contender in giving a near-term useful quantum advantage. However, both VQAs and QML models require long run times and large resource overheads during training (as many iterations and shots to achieve respectable performances). To address these challenges, we propose Refoqus as a shot-frugal gradient-based optimizer based on state and operator sampling. We outline many cost functions that are of interest to the community and are easily captured within our framework for Refoqus. Our new optimizer leverages the loss function of the problem to allocate shots randomly when evaluating gradients. This randomness allows us to make resource-cheap gradient update steps, unlocking shot-frugality. We have shown that Refoqus comes with geometric convergence guarantees under specific assumptions, elaborated in section 6. Additionally, when applying our optimizer to a QML task, namely a quantum PCA task and a quantum autoencoder task, we obtain significantly better performance in terms of the number of shots and the number of iterations needed to obtain a given accuracy. Although we only applied Refoqus to a quantum PCA task and a quantum autoencoder task, we expect similar results to hold for other problems fitting our framework. Indeed, other cost functions follow a similar pattern and once there are many weighed terms, the sampling approach is expected to return similar results.
A potential future research direction is to extend our analysis to more complicated, non-linear loss functions such as the log-likelihood, exploring in more detail how to introduce shot frugality to gradient-free optimizers [50–52]. Furthermore, applying our optimizer to a problem of interest on a real device is an interesting and logical next step. However, the quality of results can be significantly impacted in noisy settings [20]. To address the latter problem, our optimizer can be combined with several noise-handling procedures when using it on a real device [53–60]. It would also be interesting to introduce different techniques to reduce the number of shots spent such as Bayesian optimization [27]. Finally, other empirical studies related to algorithm selection and configuration [51, 55, 61–65] can be applied to Refoqus on different datasets [63, 66, 67].
Acknowledgments
This work was supported by the U.S. Department of Energy (DOE) through a quantum computing program sponsored by the Los Alamos National Laboratory (LANL) Information Science & Technology Institute. M H G was partially supported by the Laboratory Directed Research and Development (LDRD) program under Project Number 20210116DR. M C acknowledges support from the LDRD program of LANL under Project Number 20230049DR. L C was partially supported by the U.S. DOE, Office of Science, Office of Advanced Scientific Computing Research, under the Accelerated Research in Quantum Computing (ARQC) program. P J C acknowledges support from the LANL ASC Beyond Moore's Law project.
Data availability statement
The data cannot be made publicly available upon publication because they are not available in a format that is sufficiently accessible or reusable by other researchers. The data that support the findings of this study are available upon reasonable request from the authors.
Appendix A: Constructing unbiased estimators for the log-likelihood
Here we review a conceptually different approach for a shot-frugal optimization of a problem utilizing a log-likelihood loss function. In particular, we will follow the results in [44]. First, we assume that one is given a dataset of the form where ρi are quantum states in some domain , and yi are discrete real-valued labels from some domain (e.g. in binary classification ). Then, we assume that one has a trainable parametrized model whose goal is to predict the correct labels. Note that here one cannot directly apply the techniques previously used, as discrete outputs preclude gradient calculations and the use of the results in [24, 34]. Hence, a different shot-frugal method must be employed.
Given and , we can define the likelihood function
where we have defined and where assumed that the model's predictions only depend on the current input. From here, we can define the negative log-likelihood loss function as
Interestingly, we can reduce the task of estimating the negative log-likelihood loss function to a Bernoulli sampling problem. Namely, we can assume without loss of generality that the model specifies a stochastic function g that takes as input ρi , the set of parameters θ and outputs a prediction yi . That is, . Then, we remark that to estimate the loss function, we do not need to know the explicit form of g, one only needs to know the probability for a random sample y from the model to match the true label yi . Thus, we can convert each output from the model into a variable
By construction, x follows a Bernoulli distribution with probability pi . Thus, we can estimate the log-likelihood loss function by drawing samples from said Bernoulli distribution. In most common cases these samples can be thought of as single shot-estimates of the model's output, and our goal is to minimize this number of samples/shots.
In [44] the authors study the task of constructing an unbiased estimator for the negative log-likelihood loss function of equation (A2), where they consider a fixed number of shots strategy (one uses s fixed shots) and the inverse binomial sampling (IBS) strategy (where one keeps drawing samples until ). One can show that the fixed number of shots strategy always leads to a biased estimator of (for the special case when one estimates , the estimator can even be infinitely biased if xk is always zero!). On the other hand, using IBS one can define the estimator
where K is the number of shots needed to obtain a result . Notably, it can be shown that , meaning that IBS provides an unbiased estimator for the negative log-likelihood loss function (see [44]). The number of shots that IBS takes trial with probability pi is geometrically distributed with mean . For instance, if , IBS can require only 1 or 2 shots, while it will assign more shots to cases of small pi . We also remark that in [44] the authors derive the estimator variance, and show that is minimal and non-diverging if . This method then provides a shot-frugal, minimum-variance, unbiased estimator for the negative log-likelihood loss function which can be combined with gradient-free optimizers to train the parameters in the mode. We further refer the reader to [44] for a discussion on how to extend these results to models with continuous outputs.
Appendix B: Using U-statistics to construct unbiased estimators
The theory of U-statistics was initially introduced by Hoeffding in the late 1940s [47] and has a wide range of applications. U-statistics presents a methodology on how to use observations of estimable parameters to construct minimum variance unbiased estimates of more complex functions of the estimable parameters. In the cases we are interested in here, the estimable parameter is the expectation value of an observable used when constructing the cost function.
First, let be a family of probability measures on an arbitrary measurable space. A probability measure is defined as the probability of some particular event occurring. Let be a real-valued function defined for . We define to be an estimable parameter within if there exists some integer r for which an unbiased estimator of exists constructed from r independent identically distributed (i.i.d) random variables according to P. In other words, there is a measurable, real-valued function such that
where are the i.i.d with distribution P. The degree of is defined to be the smallest integer r with this property.
The function is also referred to as the kernel. This function can be symmetrized by summing over h with all possible permutations of the random variables . That is, the symmetrized version of the kernel can be defined as follows
where the summation is over Sr , the group of all permutations of a set of r indexes. With these definitions in hand, we can construct a U-statistic.
Assume we are given s > r samples from , the U-statistic with kernel h is defined as,
where the summation is over the set of combinations of r integers chosen from .
As the U-statistic defined above is simply an average of unbiased estimates for and is therefore also an unbiased estimator for . Indeed, it can be proved that the U-statistic is the best-unbiased estimate for in that it has the smallest variance.
B.1. Examples of U-statistics
Now that we have introduced U-statistics we can build intuition with some simple examples. Firstly, let us consider how to construct an unbiased estimate for the mean µ. Recall we first need to construct a kernel function h. In this case, the kernel will have degree 1 and can simply be defined as . Note this has the required property that . Furthermore, in this case , i.e. the kernel is already symmetric. Therefore, the U-statistic for µ constructed from s samples can be written as,
which is the well-known formula of the sample average, and which is used to estimate the mean.
Similarly, we can consider how to construct a U-statistic for the variance. Consider the kernel . This kernel has degree 2. Note once again, this has the required property that , which is the definition of the variance. Then this needs to be symmetrized, . Therefore, the U-statistic for the variance constructed with the kernel h using s samples is,
which corresponds to the sample variance.
As a final example let us consider constructing an estimator of the function . Lets define the kernel as , which clearly fulfills the properties we require. This kernel is already symmetric such that . Therefore, the U-statistic can be defined as,
This is exactly the form of estimators that we require in the main text. For our purposes, the number of samples s is a random variable that is constructed probabilistically.
The framework of U-statistics is perfectly suited to our purposes. Indeed, it provides a recipe to construct unbiased estimators for polynomial functions of variables with some probability distribution. Furthermore, these estimators have the smallest possible variance. Finally, the framework shows the minimum number of shots necessary to estimate a given term through the degree of the kernel.
Appendix C: Variance of estimator for linear loss functions
In section 4.1.2, we constructed an unbiased estimator for the loss function. Denoting the latter , we proved in proposition 1 that . We also obtained the variance of such an estimator, which can be used to compare different shot-allocation strategies, similar to [28] for the Rosalin optimizer designed for VQE. However, we use elements in the variance calculation in appendix E to construct an unbiased estimator for the MSE loss function.
Proposition 5. The variance of is
where and the covariance is defined as
Proof. Using the definition of the variance and recalling that , we can write
We can then decompose the first term into its constituent parts where indices in the sum are different and equal
where
Note that since is an unbiased estimator of , by application of Wald's equation we obtain
For the case , care needs to be taken when expanding the sum as the terms are not independent when . Therefore, we decompose the sum into the cases where and ,
Once again, using Wald's equation, but taking into account that the number of entries for the is , one can show
Finally, using the fact that is an unbiased estimator of , we obtain:
Bringing all the terms together and writing the expressions containing E1, E2 and E3 in a single sum, we obtain
Expanding the second term and factorizing leads to,
Which we can then rewrite to prove the desired result,
Appendix D: Constructing an unbiased estimators of loss functions with polynomial dependence
We consider here how to construct estimators for the loss functions with an arbitrary polynomial of degree D in the observables, expressed as:
where . Using the multinomial theorem, given J hamiltonian terms, we can develop the power of the sum into a sum of terms:
where and bj are non-negative integers. First, one can generalize the formula for an unbiased estimator of :
The case was elaborated in equation (E11).
For , we define an unbiased estimator as:
This involves estimating terms to construct the following estimator for the cost:
and establish the following proposition.
Proposition 6. is an unbiased estimator for .
Proof. Taking the expectation over the above definitions:
Using again Wald's inequality, we obtain:
for which, when replacing in the previous equation, we obtain .
Appendix E: Constructing an unbiased estimator for the MSE loss function
The MSE is widely used in machine learning. Here, we show how to construct an unbiased estimator for the MSE as an illustrative example of previously-studied polynomial function of the observables in appendix D4. We start by considering a simplified case for the MSE, where we do not decompose each measurement operator Hi into a linear combination of Hermitian matrices. We then tackle the more complex case where Hi is of the form presented in equation (7).
E.1. Step 1: simplified MSE
Let be a dataset of N points, where the task consists of minimizing the following cost:
This corresponds to a weighted MSE cost. The most common case is when . We will take a step-by-step approach to construct unbiased estimators for such cost.
If we expand the brackets, we obtain,
We denote as the estimator for , as the estimator for .
where is an unbiased estimator which is straightforward from Wald's equation. is also an unbiased estimator for , following the reasoning used in equation (C13) in the proof of proposition 5. Then, defining as
we establish the following proposition.
Proposition 7. is an unbiased estimator for .
Proof. Using the definitions in equations (E2) and (E4) we can write
Using Wald's equation we can evaluate this expression to give
Finally, we can evaluate these terms and factorize, which leads to,
This means that we can estimate this simplified MSE loss function by evaluating the first term exactly, sampling from the second term according to the probability distribution and sampling the third term according to the distribution .
E.2. Step 2: full MSE
Now, we assume the Hamiltonians can be decomposed into many terms such that . Hence the new full MSE cost can be expressed as
Defining as the estimator for as previously,
When we define,
and finally,
Collecting these terms we define an estimator of the full MSE loss function as
which can be used to establish the following proposition.
Proposition 8. is an unbiased estimator for .
Proof. Taking the above definition we can write the following result:
Given that
We can combine the above expressions to obtain
This shows we can estimate the full MSE cost function by evaluating four terms. One can be evaluated exactly and the others can be estimated by sampling from their corresponding probability distributions built with the weights as parameters.
Footnotes
- 7
For Adam, we perform 100 shots per circuit in our simulations. For Rosalin and Refoqus, we use a minimum of 2 shots per circuit (this leads to and for Rosalin and Refoqus respectively) in conjunction with WRS.