Higher order derivatives of quantum neural networks with barren plateaus

M Cerezo; Patrick J Coles

doi:10.1088/2058-9565/abf51a

1. Introduction

Standard quantum algorithms were not developed to handle the constraints imposed by current quantum computers. Such noisy intermediate scale quantum (NISQ) devices have limited connectivity, limited qubit count, and noise that limits circuit depth.

On the other hand, training parameterized quantum circuits provides a promising approach for quantum computing in the NISQ era, as this approach adapts to the imposed constraints. Here, one utilizes a quantum computer to efficiently evaluate a cost (or loss) function $C\left(\boldsymbol{\theta }\right)$ or its gradient $\nabla C\left(\boldsymbol{\theta }\right)$ , while employing a classical optimizer to train the parameters θ of a parameterized quantum circuit $V\left(\boldsymbol{\theta }\right)$ . This strategy is employed in two closely related paradigms: variational quantum algorithms (VQAs) [1] for chemistry, optimization, and other applications [2–16], and quantum neural networks (QNNs) for classification applications [17–21]. QNNs can be viewed as a generalization of VQAs to the case of multiple input states, and thus we will henceforth use the term QNNs to encompass all methods that train parameterized quantum circuits.

While many novel QNNs have been developed, more rigorous scaling analysis is needed for these architectures. One of the few known results is the so-called barren plateau (BP) phenomenon [22–31], where the cost function gradient vanishes exponentially with the system size. This can arise due to deep unstructured ansatzes [22, 24, 30], global cost functions [23, 24], noise [25], or an excess of entanglement [27, 31]. Regardless of the origin, when a cost landscape exhibits a BP, one requires an exponential precision to determine a minimizing direction in order to navigate the landscape. Since the standard goal of quantum algorithms is polynomial scaling with the system size (in contrast to the exponential scaling of classical algorithms), the exponential scaling due to BP can destroy quantum speedup. Hence, the study and analysis of BPs should be viewed as a fundamental step in the development of QNNs to guarantee that they can, in fact, provide a speedup over classical algorithms.

Recently there have been multiple strategies proposed for avoiding BPs such as employing local cost functions [23, 28], pre-training [32], parameter correlation [33], layer-by-layer training [34], initializing layers to the identity [35], and employing problem-inspired ansatzes [36, 37]. These strategies are aimed at either avoiding or preventing the existence of a BP, and they appear to be promising, with more research needed on their efficacy on general classes of problems. In a recent article [38], an alternative idea was proposed involving a method for actually training inside and escaping a BP. Specifically, the proposal was to compute the Hessian H of the cost function, and the claim was that taking a learning rate proportional to the inverse of the largest eigenvalue of the Hessian leads to an optimization method that could escape the BP.

The question of whether higher-order derivative information (beyond the first-order gradient) is useful for escaping a BP is interesting and is the subject of our work here. Our main results are presented here in the form of two propositions and corollaries. First, we show that the matrix elements H_ij of the Hessian are exponentially vanishing when the cost exhibits a BP. This implies that the calculation of H_ij requires exponential precision. In our second result we show that the magnitude of any higher-order partial derivative of the cost will also be exponentially small in a BP. Our results suggest that optimization methods that use higher-order derivative information, such as the Hessian, will also face exponential scaling, and hence do not circumvent the scaling issues arising from BPs.

As a byproduct of our work, we derive novel formulas for higher-order partial derivatives of the cost function, which can be used to efficiently evaluate these derivatives on quantum hardware. These formulas are obtained via the so-called Pascal tree, which we also introduce here. Due to their generality, these formulas can be generically used for training parameterized quantum circuits. Higher-order derivative information can be useful for various applications such as chemistry [39] and solving partial differential equations [14], as well as characterizing landscapes [38] and improving optimization methods (e.g., Newton's method [40, 41]).

2. Preliminaries

To set the stage for our results, we first give some background on the cost function, the parameter shift rule, and BPs.

2.1. Cost function

In what follows, we consider the case when the cost can be expressed as a sum of expectation values:

$\begin{equation}C\left(\boldsymbol{\theta }\right)=\sum\limits _{x=1}^{N}{C}_{x},\quad \mathrm{w}\mathrm{i}\mathrm{t}\mathrm{h}\quad {C}_{x}=\mathrm{T}\mathrm{r}\left[{O}_{x}V\left(\boldsymbol{\theta }\right){\rho }_{x}{V}^{{\dagger}}\left(\boldsymbol{\theta }\right)\right],\end{equation} \tag{ 1 }$

where {ρ_x} is a set (of size N) of input states to the parameterized circuit V( θ ). In order for this cost to be efficiently computable, the number of states in the input set should grow at most polynomially with the number of qubits n, that is, $N\in \mathcal{O}\left(\mathrm{poly}\left(n\right)\right)$ . In the context of QNNs, the states {ρ_x} can be viewed as training data points, and hence (1) is a natural cost function for QNNs. In the context of VQAs, one typically chooses N = 1, corresponding to a single input state. In this sense, the cost function in (1) is general enough to be relevant to both QNNs and VQAs.

2.2. Parameter shift rule

Let θ_i be an angle that parameterizes a unitary in V( θ ) as ${\text{e}}^{-\text{i}{\theta }_{i}{\sigma }_{i}/2}$ , with σ_i a Hermitian operator with eigenvalues ±1. Then, the partial derivative $\frac{\partial C\left(\boldsymbol{\theta }\right)}{\partial {\theta }_{i}}={\partial }_{i}C\left(\boldsymbol{\theta }\right)$ can be computed via the parameter shift rule [42, 43] as

$\begin{equation}{\partial }_{i}C\left(\boldsymbol{\theta }\right)=\frac{1}{2}\left(C\left({\boldsymbol{\theta }}_{\bar{i}},{\theta }_{i}^{\left(\frac{1}{2}\right)}\right)-C\left({\boldsymbol{\theta }}_{\bar{i}},{\theta }_{i}^{\left(-\frac{1}{2}\right)}\right)\right).\end{equation} \tag{ 2 }$

Here, $\bar{i}$ denotes the indices distinct from i (i.e., the indices of parameters that are not being differentiated), and we define the notation

$\begin{equation}{\theta }_{i}^{\left(\beta \right)}={\theta }_{i}+\beta \pi .\end{equation} \tag{ 3 }$

Note that the parameter shift rule in (2) allows one to exactly write the first-order partial derivative as a difference of cost function values evaluated at two different points. Moreover, we remark that equation (2) is not a finite difference formula, but rather corresponds to exactly evaluating the partial derivative.

2.3. Barren plateaus

As discussed in [22–31], by analyzing the scaling of the variance of the cost function partial derivative one can detect the presence of BPs in the cost function landscape. We denote this variance as Var_θ[∂_i C], where the expectation values in the variance are taken over θ . Specifically, when the cost function exhibits a BP one finds that Var_θ[∂_i C] is exponentially vanishing with the number of qubits, i.e.,

$\begin{equation}{\mathrm{V}\mathrm{a}\mathrm{r}}_{\boldsymbol{\theta }}\left[{\partial }_{i}C\right]{\leqslant}F\left(n\right)\enspace ,\quad \text{with}\quad F\left(n\right)\in \mathcal{O}\left(1/{b}^{n}\right),\end{equation} \tag{ 4 }$

for some b > 1. Then, combining equation (4) with Chebyshev's inequality, the probability that the cost derivative deviates from its mean value (of zero [30]) is bounded as

$\begin{equation}\enspace \mathrm{Pr}\left(\vert {\partial }_{i}C\vert {\geqslant}c\right){\leqslant}\frac{{\mathrm{V}\mathrm{a}\mathrm{r}}_{\boldsymbol{\theta }}\left[{\partial }_{i}C\right]}{{c}^{2}}{\leqslant}\frac{F\left(n\right)}{{c}^{2}},\end{equation} \tag{ 5 }$

for any c > 0. Equation (5) shows that, on average, the cost function partial derivatives will be exponentially small across the landscape, meaning that an exponential precision is needed to estimate the gradients and determine a cost minimizing direction.

BPs can arise due to multiple reasons. For instance, in the seminal work of [22] it was shown that deep unstructured circuits (i.e., random parametrized circuits with depths in $\mathcal{O}\left(\mathrm{poly}\left(n\right)\right)$ ) form 2-designs on n qubits and hence will have BPs. This phenomenon was later extended in [23] to the case of layered hardware efficient ansatzes where random two-qubit gates act on alternating pairs of qubits in a brick-like fashion. Therein it was shown that the existence of BPs is related to the locality of the operators O_x in (1). Global operators O_x that act non-trivially on all qubits have BPs irrespective of the depth of V( θ ), while local operators which measure individual qubits will have BPs only for deep circuits. BPs were then identified in perceptron-based QNNs [24], and in tasks of learning scramblers [26]. Moreover, the BP phenomenon has also been linked to the presence of high levels of entanglement in the circuit [27, 31] and to the hardware noise acting throughout the computation [25].

In what follows, we formulate our main results irrespective of the mechanism that leads to the BP, and hence our results apply to all such mechanisms.

3. Hessian matrix elements

Let us now state our results of how the presence of BPs can affect the estimation of the Hessian. The Hessian H of the cost function is a square matrix whose matrix elements are the second derivatives of C( θ ), i.e.,

$\begin{equation}{H}_{ij}=\frac{{\partial }^{2}C\left(\boldsymbol{\theta }\right)}{\partial {\theta }_{i}\partial {\theta }_{j}}={\partial }_{i}{\partial }_{j}C\left(\boldsymbol{\theta }\right).\end{equation} \tag{ 6 }$

Reference [38] noted that the matrix elements of the Hessian can be written according to the parameter shift rule. Namely, one can first write

$\begin{equation}{H}_{ij}=\frac{1}{2}\left[{\partial }_{i}C\left({\boldsymbol{\theta }}_{\bar{j}},{\theta }_{j}^{\left(\frac{1}{2}\right)}\right)-{\partial }_{i}C\left({\boldsymbol{\theta }}_{\bar{j}},{\theta }_{j}^{\left(-\frac{1}{2}\right)}\right)\right]\end{equation} \tag{ 7 }$

and then apply the parameter shift rule a second time:

$\begin{align}{H}_{ij}& =\frac{1}{4}\left[C\left({\boldsymbol{\theta }}_{\bar{ij}},{\theta }_{i}^{\left(\frac{1}{2}\right)},{\theta }_{j}^{\left(\frac{1}{2}\right)}\right)+C\left({\boldsymbol{\theta }}_{\bar{ij}},{\theta }_{i}^{\left(-\frac{1}{2}\right)},{\theta }_{j}^{\left(-\frac{1}{2}\right)}\right)\right.\\ & -\left.C\left({\boldsymbol{\theta }}_{\bar{ij}},{\theta }_{i}^{\left(\frac{1}{2}\right)},{\theta }_{j}^{\left(-\frac{1}{2}\right)}\right)-C\left({\boldsymbol{\theta }}_{\bar{ij}},{\theta }_{i}^{\left(-\frac{1}{2}\right)},{\theta }_{j}^{\left(\frac{1}{2}\right)}\right)\right].\end{align} \tag{ 8 }$

Now, the second derivatives of the cost can be expressed as a sum of cost functions being evaluated at (up to) four points.

From the parameter shift rule we can then derive the following bound on the probability that the magnitude of the matrix elements |H_ij| are larger than a given c > 0.

Proposition 1. Consider a cost function of the form (1), for which the parameter shift rule of (2) holds. Let H_ij be the matrix elements of the Hessian as defined in (7). Then, assuming that ${\langle {\partial }_{i}C\rangle }_{\boldsymbol{\theta }}=0$ , the following inequality holds for any c > 0,

$\begin{equation}\enspace \mathrm{Pr}\left(\vert {H}_{ij}\vert {\geqslant}c\right){\leqslant}\frac{2\enspace {\mathrm{V}\mathrm{a}\mathrm{r}}_{\boldsymbol{\theta }}\left[{\partial }_{i}C\right]}{{c}^{2}}.\end{equation} \tag{ 9 }$

Here ${\mathrm{V}\mathrm{a}\mathrm{r}}_{\boldsymbol{\theta }}\left[{\partial }_{i}C\right]={\langle {\left({\partial }_{i}C\right)}^{2}\rangle }_{\boldsymbol{\theta }}-{\langle {\partial }_{i}C\rangle }_{\boldsymbol{\theta }}^{2}$ , where the expectation values are taken over θ .

Proof. Equation (7) implies that the magnitudes of the Hessian matrix elements are bounded as

$\begin{equation}\vert {H}_{ij}\vert {\leqslant}\frac{1}{2}\left(\left\vert {\partial }_{i}C\left({\boldsymbol{\theta }}_{\bar{j}},{\theta }_{j}^{\left(\frac{1}{2}\right)}\right)\right\vert +\left\vert {\partial }_{i}C\left({\boldsymbol{\theta }}_{\bar{j}},{\theta }_{j}^{\left(-\frac{1}{2}\right)}\right)\right\vert \right).\end{equation} \tag{ 10 }$

From Chebyshev's inequality we can bound the probability that the cost derivative deviates from its mean value (of zero) as

$\begin{equation}\enspace \mathrm{Pr}\left(\vert {\partial }_{i}C\vert {\geqslant}c\right){\leqslant}\frac{{\mathrm{V}\mathrm{a}\mathrm{r}}_{\boldsymbol{\theta }}\left[{\partial }_{i}C\right]}{{c}^{2}},\end{equation} \tag{ 11 }$

for all c > 0, and for all i. Then, let ${\mathcal{E}}_{{\pm}}$ be defined as the event that $\left\vert {\partial }_{i}C\left({\boldsymbol{\theta }}_{{\pm}}\right)\right\vert {\geqslant}c$ , where ${\boldsymbol{\theta }}_{{\pm}}=\left({\boldsymbol{\theta }}_{\bar{j}},{\theta }_{j}^{\left({\pm}\frac{1}{2}\right)}\right)$ . Note that the set of events where |H_ij| ⩾ c is a subset of the set ${\mathcal{E}}_{+}\cup {\mathcal{E}}_{-}$ . Then, from the union bound and equation (11) we can recover (9) as follows:

$\begin{equation}\enspace \mathrm{Pr}\left(\vert {H}_{ij}\vert {\geqslant}c\right){\leqslant}\enspace \mathrm{Pr}\left({\mathcal{E}}_{+}\cup {\mathcal{E}}_{-}\right)\end{equation} \tag{ 12 }$

$\begin{equation}{\leqslant}\enspace \mathrm{Pr}\left({\mathcal{E}}_{+}\right)+\mathrm{Pr}\left({\mathcal{E}}_{-}\right)\end{equation} \tag{ 13 }$

$\begin{equation}{\leqslant}\frac{{\mathrm{V}\mathrm{a}\mathrm{r}}_{{\boldsymbol{\theta }}_{+}}\left[{\partial }_{i}C\right]}{{c}^{2}}+\frac{{\mathrm{V}\mathrm{a}\mathrm{r}}_{{\boldsymbol{\theta }}_{-}}\left[{\partial }_{i}C\right]}{{c}^{2}}\end{equation} \tag{ 14 }$

$\begin{equation}=\frac{2\enspace {\mathrm{V}\mathrm{a}\mathrm{r}}_{\boldsymbol{\theta }}\left[{\partial }_{i}C\right]}{{c}^{2}},\end{equation} \tag{ 15 }$

where we used the fact that ${\langle \cdot \rangle }_{\boldsymbol{\theta }}={\langle \cdot \rangle }_{{\boldsymbol{\theta }}_{{\pm}}}$ . □

Then, the following corollary holds

Corollary 1. Consider the bound in equation (12) of proposition 1. If the cost exhibits a BP, such that (4) holds, then the matrix elements of the Hessian are exponentially vanishing since

$\begin{equation}\enspace \mathrm{Pr}\left(\vert {H}_{ij}\vert {\geqslant}c\right){\leqslant}\frac{2F\left(n\right)}{{c}^{2}},\end{equation} \tag{ 16 }$

where $F\left(n\right)\in \mathcal{O}\left(1/{b}^{n}\right)$ for some b > 1.

The proof follows by combining (9) and (4). Corollary 1 shows that when the cost landscape exhibits a BP, the matrix elements of the Hessian are exponentially vanishing with high probability. This implies that any algorithm that requires the estimation of the Hessian will requires a precision that grows exponentially with the system size.

4. Higher order partial derivatives

Let us now analyze the magnitude of higher order partial derivatives in a BP. We use the following notation for the | α |th-order derivative

$\begin{equation}{D}^{\boldsymbol{\alpha }}C\left(\boldsymbol{\theta }\right)={\partial }_{{\alpha }_{1}}{\partial }_{{\alpha }_{2}}\dots {\partial }_{{\alpha }_{\vert \boldsymbol{\alpha }\vert }}C\left(\boldsymbol{\theta }\right),\end{equation} \tag{ 17 }$

where α is an | α |-tuple. Since one can take the derivative with respect to the same angle multiple times, we define the set Θ (of size M = |Θ|) as the set of distinct angles with respect to which we take the partial derivative. Similarly, let $\bar{\mathbf{\Theta }}$ be the compliment of Θ, so that $\mathbf{\Theta }\cup \bar{\mathbf{\Theta }}=\boldsymbol{\theta }$ . Then, for any Θ_k ∈ Θ we define N_k as the multiplicity of Θ_k in α such that ${\sum }_{k=1}^{M}{N}_{k}=\vert \boldsymbol{\alpha }\vert$ . Since the cost function and any of its higher order partial derivatives are continuous function of the parameters (as can be seen below via multiple applications of the parameter shift rule), one can extend Clairaut's theorem [44] to rewrite

$\begin{equation}{D}^{\boldsymbol{\alpha }}C\left(\boldsymbol{\theta }\right)={\partial }_{{{\Theta}}_{1}}^{{N}_{1}}\dots {\partial }_{{{\Theta}}_{M}}^{{N}_{M}}C\left(\boldsymbol{\theta }\right).\end{equation} \tag{ 18 }$

Then, applying the parameter shift rule | α | times we find that the | α |-order partial derivative can be expressed as a summation of cost functions evaluated at (up to) 2^{|
α
|} points as

$\begin{equation}{D}^{\boldsymbol{\alpha }}C\left(\boldsymbol{\theta }\right)=\frac{1}{{2}^{\vert \boldsymbol{\alpha }\vert }}\sum\limits _{\boldsymbol{\omega }}{W}_{\boldsymbol{\omega }}C\left(\bar{\mathbf{\Theta }},{\mathbf{\Theta }}^{\left(\boldsymbol{\omega }\right)}\right).\end{equation} \tag{ 19 }$

Here we defined ${\mathbf{\Theta }}^{\left(\boldsymbol{\omega }\right)}=\left({{\Theta}}_{1}^{\left({\omega }_{1}\right)},\dots ,{{\Theta}}_{M}^{\left({\omega }_{M}\right)}\right)$ , with ${{\Theta}}_{k}^{\left({\omega }_{k}\right)}={{\Theta}}_{k}+{\omega }_{k}\pi$ defined analogously to (3), and where

$\begin{equation}{W}_{\boldsymbol{\omega }}=\prod\limits _{l=1}^{M}{d}_{\left({\omega }_{l},{N}_{l}\right)}\quad \text{such}\;\text{that}\quad \sum\limits _{\boldsymbol{\omega }}\left\vert {W}_{\boldsymbol{\omega }}\right\vert ={2}^{\vert \boldsymbol{\alpha }\vert }.\end{equation} \tag{ 20 }$

Also, ω = (ω₁, ..., ω_D), where ω_l ∈ {0, ±1} if N_l is even, and ${\omega }_{l}\in \left\{{\pm}\frac{1}{2},{\pm}\frac{3}{2}\right\}$ if N_l is odd. Additionally, the coefficients ${d}_{{\omega }_{l},{N}_{k}}$ can be obtained from the Pascal tree which we introduce in figure 1. In the appendix A we provide additional details regarding the coefficients ${d}_{\left({\omega }_{l},{N}_{l}\right)}$ and the Pascal tree.

**Figure 1.** The Pascal tree. (a) The Pascal tree can be obtained by modifying how a Pascal triangle is constructed. In a Pascal triangle each entry of a row is obtained by adding together the numbers directly above to the left and above to the right, with blank entries considered to be equal to zero. The entries of a Pascal tree are obtained following the aforementioned rule, with the additional constraint that the width of the triangle is restricted to always being smaller than a given even number. Moreover, once an entry in a row is outside the maximum width, its value is added to the central entry in that row (see arrows). Here the maximum width is four. (b) The coefficients ${d}_{\left({\omega }_{l},{N}_{l}\right)}$ in (20) can be obtained from the Pascal tree of (a) by adding signs to the entries of the tree. As schematically depicted, all entries in a diagonal going from top left to bottom right have the same sign, with the first entry in the first row having a positive sign. Here, each row corresponds to a given N_l, while entries in a row correspond to different values of ω_l, with ω_l ∈ {0, ±1} if N_l is even, and ${\omega }_{l}\in \left\{{\pm}\frac{1}{2},{\pm}\frac{3}{2}\right\}$ if N_l is odd. For instance, ${d}_{\left(-\frac{1}{2},5\right)}=-12$ .
Download figure:
Standard image High-resolution image

**Figure 1.** The Pascal tree. (a) The Pascal tree can be obtained by modifying how a Pascal triangle is constructed. In a Pascal triangle each entry of a row is obtained by adding together the numbers directly above to the left and above to the right, with blank entries considered to be equal to zero. The entries of a Pascal tree are obtained following the aforementioned rule, with the additional constraint that the width of the triangle is restricted to always being smaller than a given even number. Moreover, once an entry in a row is outside the maximum width, its value is added to the central entry in that row (see arrows). Here the maximum width is four. (b) The coefficients ${d}_{\left({\omega }_{l},{N}_{l}\right)}$ in (20) can be obtained from the Pascal tree of (a) by adding signs to the entries of the tree. As schematically depicted, all entries in a diagonal going from top left to bottom right have the same sign, with the first entry in the first row having a positive sign. Here, each row corresponds to a given N_l, while entries in a row correspond to different values of ω_l, with ω_l ∈ {0, ±1} if N_l is even, and ${\omega }_{l}\in \left\{{\pm}\frac{1}{2},{\pm}\frac{3}{2}\right\}$ if N_l is odd. For instance, ${d}_{\left(-\frac{1}{2},5\right)}=-12$ .
Download figure:
Standard image High-resolution image

From (19) we obtain that the (| α | + 1)th-order derivative, which we denote as ∂_i D^α C( θ ) = D^{i,
α} C( θ ), is obtained as the sum of (up to) 2^{|
α
|} partial derivatives:

$\begin{equation}{D}^{i,\boldsymbol{\alpha }}C\left(\boldsymbol{\theta }\right)=\frac{1}{{2}^{\vert \boldsymbol{\alpha }\vert }}\sum\limits _{\boldsymbol{\omega }}{W}_{\boldsymbol{\omega }}{\partial }_{i}C\left(\bar{\mathbf{\Theta }},{\mathbf{\Theta }}^{\left(\boldsymbol{\omega }\right)}\right).\end{equation} \tag{ 21 }$

Since one has to individually evaluate each term in (21) and since there are up to 2^{|
α
|} terms, we henceforth assume that $\vert \boldsymbol{\alpha }\vert \in \mathcal{O}\left(\mathrm{log}\left(n\right)\right)$ . This guarantees that the computation of D^{i,
α} C( θ ) leads to an overhead which is (at most) $\mathcal{O}\left(poly\left(n\right)\right)$ .

The following proposition, which generalizes proposition 1, allows us to bound the probability that the magnitude of $\left\vert {D}^{i,\boldsymbol{\alpha }}C\left(\boldsymbol{\theta }\right)\right\vert$ is larger than a given c > 0.

Proposition 2. Consider a cost function of the form (1), for which the parameter shift rule of (2) holds. Let D^{i,
α} C( θ ) be a higher order partial derivative of the cost as defined in (17). Then, assuming that ${\langle {\partial }_{i}C\rangle }_{\boldsymbol{\theta }}=0$ , the following inequality holds for any c > 0,

$\begin{equation}\enspace \mathrm{Pr}\left(\left\vert {D}^{i,\boldsymbol{\alpha }}C\left(\boldsymbol{\theta }\right)\right\vert {\geqslant}c\right){\leqslant}\frac{{2}^{\vert \boldsymbol{\alpha }\vert }\enspace {\mathrm{V}\mathrm{a}\mathrm{r}}_{\boldsymbol{\theta }}\left[{\partial }_{i}C\right]}{{c}^{2}}.\end{equation} \tag{ 22 }$

Proof. From equation (21) we can obtain the following bound

$\begin{equation}\left\vert {D}^{i,\boldsymbol{\alpha }}C\left(\boldsymbol{\theta }\right)\right\vert {\leqslant}\frac{1}{{2}^{\vert \boldsymbol{\alpha }\vert }}\sum\limits _{\boldsymbol{\omega }}\left\vert {W}_{\boldsymbol{\omega }}\right\vert \left\vert {\partial }_{i}C\left(\bar{\mathbf{\Theta }},{\mathbf{\Theta }}^{\left(\boldsymbol{\omega }\right)}\right)\right\vert .\end{equation} \tag{ 23 }$

Let us define ${\mathcal{E}}_{\boldsymbol{\omega }}$ as the event that $\left\vert {\partial }_{i}C\left(\bar{\mathbf{\Theta }},{\mathbf{\Theta }}^{\left(\boldsymbol{\omega }\right)}\right)\right\vert {\geqslant}c$ . Since (4) holds, then the following chain of inequalities holds

$\begin{equation}\enspace \mathrm{Pr}\left(\left\vert {D}^{i,\boldsymbol{\alpha }}C\left(\boldsymbol{\theta }\right)\right\vert {\geqslant}c\right){\leqslant}\mathrm{Pr}\left(\bigcup\limits _{\boldsymbol{\omega }}{\mathcal{E}}_{\boldsymbol{\omega }}\right)\end{equation} \tag{ 24 }$

$\begin{equation}{\leqslant}\sum\limits _{\boldsymbol{\omega }}\mathrm{Pr}\left({\mathcal{E}}_{\boldsymbol{\omega }}\right)\end{equation} \tag{ 25 }$

$\begin{equation}{\leqslant}\sum\limits _{\boldsymbol{\omega }}\frac{{\mathrm{V}\mathrm{a}\mathrm{r}}_{\left(\bar{\mathbf{\Theta }},{\mathbf{\Theta }}^{\left(\boldsymbol{\omega }\right)}\right)}\left[{\partial }_{i}C\right]}{{c}^{2}}\end{equation} \tag{ 26 }$

$\begin{equation}{\leqslant}\frac{{2}^{\vert \boldsymbol{\alpha }\vert }\enspace {\mathrm{V}\mathrm{a}\mathrm{r}}_{\boldsymbol{\theta }}\left[{\partial }_{i}C\right]}{{c}^{2}},\end{equation} \tag{ 27 }$

where we invoked the union bound, and where we recall that ${\langle \cdot \rangle }_{\boldsymbol{\theta }}={\langle \cdot \rangle }_{\left(\bar{\mathbf{\Theta }},{\mathbf{\Theta }}^{\left(\boldsymbol{\omega }\right)}\right)}$ , ∀ ω . In addition, for (27) we used the fact that the summation in (26) has at most 2^{|
α
|} terms. □

Then, if the cost function exhibits a BP, the following corollary follows.

Corollary 2. Consider the bound in equation (22) of proposition 2. If the cost exhibits a BP, such that (4) holds, then higher order partial derivatives of the cost function are exponentially vanishing since

$\begin{equation}\enspace \mathrm{Pr}\left(\left\vert {D}^{i,\boldsymbol{\alpha }}C\left(\boldsymbol{\theta }\right)\right\vert {\geqslant}c\right){\leqslant}\frac{G\left(n\right)}{{c}^{2}},\end{equation} \tag{ 28 }$

where $G\left(n\right)\in \mathcal{O}\left(1/{q}^{n}\right)$ for some q > 1.

Proof. Combining (4) and (22) leads to

$\begin{equation}\enspace \mathrm{Pr}\left(\left\vert {D}^{i,\boldsymbol{\alpha }}C\left(\boldsymbol{\theta }\right)\right\vert {\geqslant}c\right){\leqslant}\frac{{2}^{\vert \boldsymbol{\alpha }\vert }F\left(n\right)}{{c}^{2}}.\end{equation} \tag{ 29 }$

Then, let us define G(n) = 2^{|
α
|} F(n). Since $\vert \boldsymbol{\alpha }\vert \in \mathcal{O}\left(\mathrm{log}\left(n\right)\right)$ , and $F\left(n\right)\in \mathcal{O}\left(1/{b}^{n}\right)$ , then we know that there exists κ, κ', and n₀ such that ∀ n > n₀ we, respectively, have 2^{|
α
|} ⩽ n^κ and $F\left(n\right){\leqslant}\frac{{\kappa }^{\prime }}{{b}^{n}}$ . Combining these two results we find

$\begin{equation}G\left(n\right){\leqslant}\frac{{\kappa }^{\prime }{n}^{\kappa }}{{b}^{n}}=\frac{{\kappa }^{\prime }}{{b}^{L\left(n\right)}}\enspace ,\quad \enspace \forall n{ >}{n}_{0},\end{equation} \tag{ 30 }$

where L(n) = (n − κ log_b(n)). Equation (30) shows that $G\left(n\right)\in \mathcal{O}\left(1/{b}^{L\left(n\right)}\right)$ . Then, since

$\begin{equation}{\mathrm{lim}}_{n\to \infty }\frac{L\left(n\right)}{n}=1,\end{equation} \tag{ 31 }$

we have L(n) ∈ Ω(n), meaning that there exists a $\hat{\kappa }{ >}0$ and ${\hat{n}}_{0}$ such that $\forall \enspace n{ >}{\hat{n}}_{0}$ , we have $L\left(n\right){\geqslant}\hat{\kappa }n$ . The latter implies $G\left(n\right){\leqslant}\frac{{\kappa }^{\prime }}{{b}^{\hat{\kappa }n}}$ for all $n{ >}\mathrm{max}\left\{{n}_{0},{\hat{n}}_{0}\right\}$ , which means that $G\left(n\right)\in \mathcal{O}\left(1/{q}^{n}\right)$ where $q={b}^{\tilde {\kappa }}$ . Also, q > 1 follows from b > 1 and $\tilde {\kappa }{ >}0$ .□

Corollary 2 shows that, in a BP, the magnitude of any efficiently computable higher order partial derivative (i.e., any partial derivative where $\vert \boldsymbol{\alpha }\vert \in \mathcal{O}\left(\mathrm{log}\left(n\right)\right)$ ) is exponentially vanishing in n with high probability.

5. Discussion

In this work, we investigated the impact of BPs on higher order derivatives. As shown in [21, 38], information on these higher-order derivatives can be used to analyze the landscape of cost functions and shed some light on the relatively obscure nature of training landscapes for VQAs and QNNs. Moreover, it was also suggested that one could use higher order derivative information to escape a BP. We considered a cost function C that is relevant to both VQAs and QNNs, as BPs are relevant to both of these applications.

Our main result was that, when a BP exists, the Hessian and other high order partial derivatives of C are exponentially vanishing in n with high probability. Our proof relied on the parameter shift rule, which we showed can be applied iteratively to relate higher order partial derivatives to the first order partial derivative (analogous to what reference [38] did for the Hessian). Hence, the parameter shift rule allowed us to state the vanishing of higher order derivatives as essentially a corollary of the vanishing of the first order derivative. We remark that iterative applications of the parameter shift rule led us to a mathematically interesting construct that we called the Pascal tree, depicted in figure 1.

Our results imply that estimating higher order partial derivatives in a BP is exponentially hard. Hence, any optimization strategy that requires information about partial derivatives that go beyond first-order (such as the Hessian) will require a precision that grows exponentially with n. We therefore surmise that, by themselves, optimizers that go beyond first order gradient descent do not appear to be a feasible solution to the BP problem. More generally, our results suggest that it is better to develop strategies that avoid the appearance of the BP altogether, rather than to try to escape an existing BP.

Finally, we remark that our results were derived using equation (21) and the Pascal tree, which provide novel, general formulas for arbitrary higher-order partial derivatives of the cost function in equation (1). These formulas are of interest on their own, as they can be generically employed to analytically evaluate higher-order derivatives of the cost on quantum hardware. As an example application, one could apply this method to take derivatives of quantum feature maps for solving partial differential equations [14].

While some prior work on higher-order derivative formulas was performed in reference [45], our formulas are more explicit and our introduction of the Pascal tree is novel. We also remark that, in a recent post that was concurrent with our work, reference [41] introduced a formula similar to equation (21) for explicitly determining higher-order partial derivatives, albeit that work analyzes the formulas in a different context.

Acknowledgments

We thank Kunal Sharma for helpful discussions. Research presented in this article was supported by the Laboratory Directed Research and Development program of Los Alamos National Laboratory under project number 20180628ECR. MC was also supported by the Center for Nonlinear Studies at LANL. PJC also acknowledges support from the LANL ASC Beyond Moore's Law project. This work was also supported by the U.S. Department of Energy (DOE), Office of Science, Office of Advanced Scientific Computing Research, under the Accelerated Research in Quantum Computing (ARQC) program.

Data availability statement

No new data were created or analysed in this study.

Appendix A.: Explicit description of ${d}_{\left({\omega }_{l},{N}_{k}\right)}$

In this appendix we first discuss how the parameter shift rule leads to the Pascal tree. Then, we provide analytical formulas for ${d}_{\left({\omega }_{l},{N}_{k}\right)}$ .

Let us consider the first and second order partial derivatives of the cost function with respect to the same angle. From the parameter shift rule of equation (2) we find

$\begin{align*}{\partial }_{i}C\left(\boldsymbol{\theta }\right)& =\frac{1}{2}\left[\underset{{\times}{d}_{\left(\frac{1}{2},1\right)}=1}{\underbrace{C\left({\boldsymbol{\theta }}_{\bar{i}},{\theta }_{i}^{\left(\frac{1}{2}\right)}\right)}}-\underset{{\times}{d}_{\left(-\frac{1}{2},1\right)}=-1}{\underbrace{C\left({\boldsymbol{\theta }}_{\bar{i}},{\theta }_{i}^{\left(-\frac{1}{2}\right)}\right)}}\right]\\ {\partial }_{i}^{2}C\left(\boldsymbol{\theta }\right)& =\frac{1}{4}\left[\underset{{\times}{d}_{\left(1,2\right)}=1}{\underbrace{C\left({\boldsymbol{\theta }}_{\bar{i}},{\theta }_{i}^{\left(1\right)}\right)}}+\underset{{\times}{d}_{\left(-1,2\right)}=1}{\underbrace{C\left({\boldsymbol{\theta }}_{\bar{i}},{\theta }_{i}^{\left(-1\right)}\right)}}-2\underset{{\times}{d}_{\left(0,2\right)}=-2}{\underbrace{C\left(\boldsymbol{\theta }\right)}}\right].\end{align*}$

where we can see that $\vert {d}_{\left(0,2\right)}\vert =\vert {d}_{\left(-\frac{1}{2},1\right)}\vert +\vert {d}_{\left(\frac{1}{2},1\right)}\vert =2$ . Similarly, if we were to take the third partial derivative with respect to i we would find |d_(±1/2,3)| = |d_(0,2)| + |d_(±1,2)|, and |d_(±3/2,3)| = |d_(±1,2)|. Note that this procedure forms the first four rows of the Pascal tree, which actually coincide with first four rows of the Pascal triangle. When taking the fourth partial derivative we have to take into account the fact that $C\left({\boldsymbol{\theta }}_{\bar{i}},{\theta }_{i}^{\left(-2\right)}\right)=C\left({\boldsymbol{\theta }}_{\bar{i}},{\theta }_{i}^{\left(2\right)}\right)=C\left(\boldsymbol{\theta }\right)$ , since e^−iθσ/2 is equal to e^{−i(θ+2π)σ/2} up to an unobservable global phase. Hence, the fact that θ ≡ θ⁽²⁾(mod 2π) imposes a restriction on the width of the Pascal tree. Hence, following this procedure one can recover the entries in figure 1.

For arbitrary ω_l and N_l, the coefficients ${d}_{{\omega }_{l},{N}_{k}}$ can be analytically obtained as follows: If N_k < 2 we have d_(±1,0) = 0, $\left.{d}_{0,0}\right)=1$ , d_(±1/2,1) = ±1, and d_(±3/2,1) = 0. Then, for N_k ⩾ 2

$\begin{equation*}{d}_{\left({\omega }_{l},{N}_{k}\right)}=\begin{cases}^{\frac{{N}_{k}}{2}}{2}^{{N}_{k}-1}\quad & \text{if}{\omega }_{l}=0,\\ {\pm}{\left(-1\right)}^{\frac{{N}_{k}-1}{2}}3\cdot {2}^{{N}_{k}-3}\quad & \text{if}{\omega }_{l}={\pm}1/2,\\ {\left(-1\right)}^{\frac{{N}_{k}-2}{2}}{2}^{{N}_{k}-2}\quad & \text{if}{\omega }_{l}={\pm}1\\ \mp {\left(-1\right)}^{\frac{{N}_{k}-1}{2}}{2}^{{N}_{k}-3}\quad & \text{if}{\omega }_{l}={\pm}3/2,\end{cases}\end{equation*}$

Note that ∀ N_l we have ${\sum }_{{\omega }_{l}}\vert {d}_{\left({\omega }_{l},{N}_{k}\right)}\vert ={2}^{{N}_{k}}$ .

Higher order derivatives of quantum neural networks with barren plateaus

Article metrics

Submit

Permissions

Author e-mails

Author affiliations

Author notes

ORCID iDs

Dates

Peer review information

Abstract

1. Introduction