Equivalence of cost concentration and gradient vanishing for quantum circuits: An elementary proof in the Riemannian formulation

The optimization of quantum circuits can be hampered by a decay of average gradient amplitudes with increasing system size. When the decay is exponential, this is called the barren plateau problem. Considering explicit circuit parametrizations (in terms of rotation angles), it has been shown in Arrasmith et al., Quantum Sci. Technol. 7, 045015 (2022) that barren plateaus are equivalent to an exponential decay of the variance of cost-function differences. We show that the issue is particularly simple in the (parametrization-free) Riemannian formulation of such optimization problems and obtain a tighter bound for the cost-function variance. An elementary derivation shows that the single-gate variance of the cost function is strictly equal to half the variance of the Riemannian single-gate gradient, where we sample variable gates according to the uniform Haar measure. The total variances of the cost function and its gradient are then both bounded from above by the sum of single-gate variances and, conversely, bound single-gate variances from above. So, decays of gradients and cost-function variations go hand in hand, and barren plateau problems cannot be resolved by avoiding gradient-based in favor of gradient-free optimization methods.


I. INTRODUCTION
Recent rapid advancements in quantum computing hardware enable the implementation of large and deep quantum circuits, reaching regimes beyond the simulation capabilities of classical computers.A promising scheme to harness this potential before the advent of practical fault-tolerance are variational quantum algorithms (VQA) [1]: Quantum circuits are executed on quantum computers and the quantumgate parameters are optimized through a classical backend to minimize a given cost function.A critical challenge in such hybrid quantum-classical optimizations consists in noise and the probabilistic nature of quantum measurements.In generic variational quantum circuits, average gradient amplitudes tend to decrease exponentially in the system size (number of qudits).This phenomenon is known as barren plateaus [2,3].Unless one already has a very good guess for the optimal circuit, the barren plateau problem implies that we would need an exponential number of measurement shots for a sufficiently accurate determination of cost-function gradients, prohibiting the application for large problem sizes.Otherwise, we would very likely end up with random walks in flat regions of the cost landscape.Numerous works investigate how to avoid an exponential decay of gradient amplitudes [4][5][6][7][8][9][10][11][12][13][14][15][16][17][18], and the absence of barren plateaus might imply classical simulability [19].
The quantum circuits in VQA can comprise fixed unitary gates { Ŵ1 , Ŵ2 , . . .} and variable unitary gates { Û1 , Û2 , . . ., ÛK }.For example, the former could be CNOT gates and the latter single-qubit gates.The variable gates are typically parametrized by rotation angles, Ûi = Ûi (θ i ) ∈ U(N i ), and the optimization is based on the Euclidean metric in the angle space {θ i }.Using this framework and assuming that the variables gates are composed of rotations e −iθ i,k σk with involutory generators σk = σ † k = σ−1 k , Arrasmith et al. established the equivalence of barren plateaus and an exponential decay of the variance of cost-function differences with respect to increasing system size [20].
FIG. 1.An increase in dimensionality (scaling up the system size n) can lead to a decay of gradients.The situation where the average gradient amplitude decays exponentially in n is the so-called barren-plateau problem.In general, gradient decay may or may not be accompanied by concentration of the cost function.As discussed here and in Ref. [20] the two phenomena go hand in hand for VQA.
Alternatively, one can formulate the circuit optimization problem directly over the manifold formed by the direct product of the gates' unitary groups in a representation-free form.In this Riemannian approach [21,22], gradients are elements of the tangent space of M, and one can implement line searches and Riemannian quasi-Newton methods through retractions and vector transport on M as discussed in recent works [23,24].Riemannian optimization has some advantages over the Euclidean optimization of parametrized quantum circuits.For example, it avoids cost-function saddle points that are introduced when employing a global parametrization {θ i } of the manifold M (consider, e.g., sitting at the north pole of a sphere and rotating around the z axis).Furthermore, the Riemannian formulation can simplify analytical considerations, e.g., concerning average gradient amplitudes [16,17] and cost-function variances as discussed in the following.
In this report, we establish a direct connection between cost-function concentration and the decay of Riemannian gradient amplitudes in the optimization of quantum circuits.The proof in the Riemannian formulation is surprisingly simple and, compared to Ref. [20], yields tighter bounds.We will show that when the gates are sampled according to the uniform Haar measure, the single-gate cost-function variance is exactly half the single-gate variance of the Riemannian gradient.The corresponding total variances, where all gates are varied simultaneously, are both bounded from above by sums of the single-gate variances.Furthermore, the total variances bound all individual single-gate variances.As a consequence, the barren plateau problem can be equivalently diagnosed through the analysis of cost function concentration and cannot be resolved by switching from gradient-based optimization to a gradient-free optimization [20,27].

II. COST FUNCTION AND RIEMANNIAN GRADIENT
Consider a generic quantum circuit Û composed of some fixed unitary gates { Ŵ1 , Ŵ2 , . . .} and variable unitary gates { Û1 , Û2 , . . .} over which we optimize.Starting from a reference state ρ0 , the circuit prepares the state ρ = Û ρ0 Û † .With an observable Ô, the cost function takes the form Considering the dependence on one of the variable gates, Û ∈ U(N ), we can write the cost function in the compact form as illustrated in Fig. 2a, where the Hermitian operator X on C N ⊗ C M comprises ρ0 , Ŷ comprises Ô, and both comprise further circuit gates except Û .See Fig. 2c for an example.As discussed in Refs.[16,17], expectation values ⟨Ψ| Ĥ|Ψ⟩ of a Hamiltonian Ĥ with respect to isometric tensor network states (TNS) |Ψ⟩ = Û|0⟩ can also be written in the form (3). In this case the TNS are generated from a pure reference state |0⟩ by application of a quantum circuit, and Û corresponds to one tensor of the TNS.The example of a multiscale entanglement renormalization ansatz (MERA) [25,26] is illustrated in Fig. 2d.
Here and in Sec.III, we consider variation of one specific unitary gate Û such that the Riemannian manifold is just U(N ); this is referred to as a "single-gate" variation.The extension to variation of all gates ("total" variation) on the full manifold (1) will be discussed in Sec.IV.Projecting the gradient d = ∂ Û E( Û ) = 2 Tr M ( Ŷ Ũ X) of the cost function (3) onto the tangent space of the unitary group U(N ) at Û , we obtain the Riemannian gradient as illustrated in Fig. 2b.Given that we need to stay on the manifold U(N ) during the optimization, ĝ is the relevant direction of change.As discussed in Refs.[23,24], it can be efficiently measured on quantum computers.Here and in the following, Tr N and Tr M denote the partial traces over the first and second components of C N ⊗ C M , respectively.Let us summarize the derivation of Eq. ( 4): The N × N unitary gates are embedded in the 2N 2 real Euclidean space For Riemannian optimization algorithms, one needs to project d onto the tangent space T Û of U(N ) at Û , and then construct retractions for line search, and vector transport to form linear combinations of gradient vectors from different points on the manifold [21][22][23].An element V of the tangent space T Û needs to obey The projection ĝ of d onto this tangent space obeys ( V , ĝ) = ( V , d) for all V ∈ T Û .This gives the Riemannian gradient ĝ = ( d − Û d † Û )/2 which results in Eq. ( 4).

III. SINGLE-GATE HAAR-MEASURE VARIANCES
To evaluate averages and variances over U(N ) (or, more generally the manifold M), we employ Haarmeasure integrals.The average of the Riemannian gradient ( 4) is zero, because ĝ is an odd function in Û .For the evaluation of Avg Û E and the variances, we only need the first and second-moment Haar-measure integrals over the unitary group.From the Weingarten formulas [28,29] for the first and second moments, one obtains [17] Avg Û Û † ⊗ Û = 1 N Swap and (7a) with Swap = N i,j=1 |i, j⟩⟨j, i| and Swap k,ℓ swaps the k th and ℓ th components of C N ⊗ C N ⊗ C N ⊗ C N .Graphical representations of Eqs.(7a) and (7b) are shown in Figs.5a and 5c.Weingarten formulas can be proven using the Schur-Weyl duality and the double centralizer theorem [29].An illustrating proof for Eq.(7a) is given in Appx. A.
Applying the Weingarten formulas (7), we find a simple linear relation between the single-gate costfunction variance and the single-gate gradient variance Var Û ĝ.With the average gradient ( 6) being zero, we can quantify the gradient variance by This definition can be motivated as follows [17]: As any element of the tangent space (5), the gradient ĝ can be expanded in an orthonormal basis of involutory Hermitian operators This gives the gradient in the form ĝ = i Û N 2 k=1 α k σk /N , where each α k corresponds to the derivative of one rotation angle.Hence, Tr(ĝ † ĝ)/N = k α 2 k /N 2 , i.e., Eq. ( 9) coincides with the average variance of the rotation-angle derivatives [30].
An elementary proof given in Appx.B establishes a linear relation between the single-gate cost-function variance (8) and gradient variance (9).
Theorem 1 (Exact equivalence of single-gate cost-function and gradient variances).In the Riemannian formulation, the variance of the cost function (2) is exactly half the variance of the Riemannian gradient (4) when considering the dependence on one of the unitary gates of the quantum circuit ( Ûj̸ =i fixed), i.e., Of course, the proportionality of these conditional single-gate variances translates directly into a proportionality of the averaged single-gate variances (the conditional variances (10) averaged over all Ûj̸ =i ), IV. TOTAL HAAR-MEASURE VARIANCES In Sec.III, we only considered the dependence of the cost function (2) on one of the unitary gates ( Û ) in the circuit as well as the single-gate gradient (4).In this section, we consider the dependence on all variable unitary gates ( Û1 , Û2 , . . ., ÛK ) ∈ M with Ûi ∈ U(N i ) and the corresponding total variances like for the cost function.
The full Riemannian gradient of the cost function (2) with respect to all variable gates is simply the direct sum of the individual gradients, i.e., Here Ũi := Ûi ⊗ 1 M i with Xi and Ŷi depending on the remaining gates of the circuit, and ρ0 as well as Ô as in Eq. ( 3).In extension of Eq. ( 9), we define the total variance of ĝfull as The following central result as proven in Appx.C is based on an analysis of covariances, the law of total variance, and Theorem 1.
Theorem 2 (Equivalence of circuit cost-function concentration and gradient vanishing).When averaging over the variable unitaries { Ûi } of the quantum circuit Û according to the Haar measure, the total variance of the cost function (2) and the total variance of the full Riemannian gradient (14) are both bounded from below by single-gate variances V i [Eq.(11)], and they are bounded from above by or proportional to the sum i V i , ≤ K 2 Var Û1 ,..., ÛK ĝfull ( Û1 , . . ., ÛK ) In particular, if all single-gate variances V i of polynomial-depth circuits (K = poly n) decay exponentially in the system size (number of qudits) n, then both total variances (12) and ( 14) decay exponentially in n.
Conversely, if one of the total variances decays exponentially in n, then all single-gate variances also decay exponentially.So, the barren-plateau problem and exponential cost-function concentration always appear simultaneously.
3. Numerical verification of Theorem 2 for binary one-dimensional MERA [25,26] with bond dimension χ = 2 and the cost function E given by the energy expectation value (16).In accordance with Eq. (15a), the cost-function variance is from below by single-gate variances V τ,k and from above by τ,k V τ,k .
Note that the conclusions below Eq. (15b) remain valid if we choose a different weighting in the definition of the full gradient variance (14).For example, we could also define it as i ĝi ), corresponding to an equal weighting of all rotation-angle derivatives in the parametrization discussed below Eq. ( 9).

V. NUMERICAL VERIFICATION
For an illustration of the general bounds on the cost-function variance in Theorem 2, consider a onedimensional binary MERA |Ψ⟩ for spin-1/2 chains of length L = 3 • 2 T , where T is the number of layers in the MERA [25,26].The cost function is given by the energy (density) expectation value is a sum of three-site interaction terms with Pauli operators σx i , σy i , σz i acting on site i.In the evaluation of the variances, Haar-averages are executed by numerical sampling, and we denote the single-gate variances (11) by V τ,k , where the position (τ, k) indicates the k th tensor in layer τ .
As shown in Fig. 3 for MERA with bond dimension χ = 2, the total cost-function variance Var(E) [Eq.(12)] is, in accordance with Eq. (15a), bounded from above by the sum of all single-gate variances τ,k V τ,k , and the single-gate variances V τ,k provide lower bounds.Within a given layer τ , the single-gate variances V τ,k are approximately constant and, as shown in Refs [16,17], they decrease exponentially as Hence, the best lower bound in Fig. 3 is given by the maximum of V 1,k .Note that the energy optimization problems for MERA, tree tensor networks states [31][32][33], and matrix product states [34,35] for Hamiltonian with finite-range interactions are actually free of barren plateaus [16,17].Nevertheless, the general variance relations from Theorem 2 apply.

VI. DISCUSSION
Given the equivalence of cost-function concentration and gradient vanishing on both the single-gate as well as full-circuit levels (Theorems 1 and 2, respectively), we can assess gradient vanishing and, especially, barren plateaus more easily through the scalar cost function.In fact, this route has already been pursued in recent analytic works on the trainability of variational quantum algorithms [19,[36][37][38][39][40].
Inspired by the work of Arrasmith et al. [20] on the parametrized circuits and Euclidean gradients, we studied the question in the Riemannian formulation which makes the proofs rather simple and yields additional insights: (a) The single-gate variances of gradients and of the cost function turn out to be strictly proportional.(b) In the Euclidean formulation, Arrasmith et al. obtained results for the variance of cost- , where θ ′ is a random reference point, K(n) is the number of variable gates as a function of the system size n, and V (n) is a common upper bound for all single-gate (gradient) variances V i .This difference construction turned out to be unnecessary in the Riemannian formulation and we could access Var Û1 ,..., ÛK E directly.(c) Furthermore, we obtained the tighter bound Var Û1 ,..., ÛK E ≤ K(n)V (n).This result aligns with our experience in numerical simulations and could probably be further tightened.
While quantifying the total cost-function variance is easier than studying single-gate gradient variances or, equivalently, single-gate cost-function variances, the latter provide more detailed trainability information.For example, the single-gate variances in MERA tensor networks vary strongly from layer to layer.Gates in lower layers have a more substantial impact on the cost function landscape than those in upper layers.This can be taken into account to improve optimization schemes [41].
As a specific example, consider the optimization of the quantum circuit that defines the MERA |Ψ⟩ to minimize the energy expectation value ⟨Ψ| Ĥ|Ψ⟩ for the spin-1/2 transverse-field Ising chain The Ising chain has a critical point at |h| = 1, where the groundstate is particularly strongly entangled, featuring the entanglement log-area law.It follows from the analysis in Refs.[16,17] that the total costfunction variance is (up to finite-size corrections) independent of the system size L and decays algebraically with increasing MERA bond dimension χ.This means that there is no barren-plateau problem and the optimization is in general possible.The single-gate variances provide more detailed information: MERA are hierarchical tensor networks with a layer structure.It turns out that the single-gate variances decay exponentially in the layer index τ = 1, . . ., T , where T is the number of MERA layers [16,17].See Eq. ( 17) for binary MERA with χ = 2.As demonstrated in Fig. 4, this suggests a more efficient optimization scheme [41], where we start by setting the gates of layers τ ≥ 2 to Ûτ,k = 1 and, initially, only optimize those of layer τ = 1.After a suitable number of iterations, we proceed by optimizing the gates of layers τ ≤ 2, then those of layers τ ≤ 3 and continue in this way, building up the MERA circuit layer by layer.The numerical results close to the critical point of the model confirm that this scheme is more efficient than the traditional approach of, right away, optimizing all layers simultaneously.On average, one achieves a higher energy accuracy and less circuits remain stuck in local minima.While single-gate variances provide considerably more information than the total cost-function variance alone, they still give limited insight about trainability and convergence properties.They can show that gradient amplitudes are on average above or below certain thresholds, but they are certainly not a measure for the complexity of the cost function landscape, the importance of local minima, or specifics of optimization trajectories.As discussed in Ref. [17] and shown diagrammatically in Fig. 6, the single-gate gradient variance (9) valuates to So, the cost-function variance (B4) is exactly half of the Riemannian gradient variance (B5).
Appendix C: Proof of Theorem 2 Proof.(a) Let us first consider a circuit with only two variable unitaries Û1 and Û2 .In this case, the cost function (2) can be written in the form where f a and g a are continuous functions which only depend on Û1 and Û2 , respectively: Analogously to Fig. 2c, we can always bipartition the the tensor network for E( Û1 , Û2 ) into two parts f and g with f containing Û1 , Û † 1 and g containing Û2 , Û † 2 .The contraction of the two parts (operator products and trace to obtain the scalar E) then corresponds to the sum over a in Eq. (C1).
The single-gate cost variance for Û1 at fixed Û2 then is (Avg i ≡ Avg Ûi and Var i ≡ Var Ûi ) The generalization to a circuit with K variable unitaries follows by iterating Eq. (C3).Decomposing the tensor network as before into K parts, each containing only one of the variable unitaries and its adjoint, we can write the cost function in the form E = E( Û1 , Û2 , . . ., ÛK ) = a f (1)  a ( Û1 )f (2)  a ( Û2 ) which holds for all i.Recall that, given two random variables E and U i on the same probably space (M), the law of total variance states that Var(E) = Avg Var(E|U i ) + Var Avg(E|U i ) .This corresponds to the first step in Eq. (C6).In the second, step, we have used the nonegativity of the variance.

1 T = 2 T = 3 T = 4 T = 5 1 CountsFIG. 4 .
FIG. 4. Optimization of randomly initialized binary-MERA quantum circuits with bond dimension χ = 2 for the minimization of the energy expectation value of the spin-1/2 transverse-field Ising chain (18) at h = 1.03.The plots show the optimization history for the deviation of the energy density e = ⟨Ψ| Ĥ|Ψ⟩/L from the exact groundstate energy density e ∞ gs .Insets show histograms for the accuracy of 100 randomly initialized MERA with 6 layers after 1200 iterations.(a) The single-gate variances (11) turn out to decay exponentially in the layer index τ[16,17].As indicated by the gray regions, we hence, first optimize layer τ = 1 only, then, after 100 iterations, all tensors of layers τ ≤ 2, then all tensors of layers τ ≤ 3 etc.(b) In contrast, using the traditional apporach of simultaneously optimizing all layers from the very beginning leads to considerably slower convergence and more circuits stuck in local minima.