Laziness, Barren Plateau, and Noise in Machine Learning

We define \emph{laziness} to describe a large suppression of variational parameter updates for neural networks, classical or quantum. In the quantum case, the suppression is exponential in the number of qubits for randomized variational quantum circuits. We discuss the difference between laziness and \emph{barren plateau} in quantum machine learning created by quantum physicists in \cite{mcclean2018barren} for the flatness of the loss function landscape during gradient descent. We address a novel theoretical understanding of those two phenomena in light of the theory of neural tangent kernels. For noiseless quantum circuits, without the measurement noise, the loss function landscape is complicated in the overparametrized regime with a large number of trainable variational angles. Instead, around a random starting point in optimization, there are large numbers of local minima that are good enough and could minimize the mean square loss function, where we still have quantum laziness, but we do not have barren plateaus. However, the complicated landscape is not visible within a limited number of iterations, and low precision in quantum control and quantum sensing. Moreover, we look at the effect of noises during optimization by assuming intuitive noise models, and show that variational quantum algorithms are noise-resilient in the overparametrization regime. Our work precisely reformulates the quantum barren plateau statement towards a precision statement and justifies the statement in certain noise models, injects new hope toward near-term variational quantum algorithms, and provides theoretical connections toward classical machine learning. Our paper provides conceptual perspectives about quantum barren plateaus, together with discussions about the gradient descent dynamics in \cite{together}.

However, a generically designed variational quantum ansatz may not be applicable to real problems.Specifically, a problem so-called barren plateau has been widely discussed in the variational quantum algorithm community, which is believed to be one of the primary problems of quantum machine learning [1].The argument is given as follows.A typical gradient descent algorithm will look like where θ µ is the variational angle, and t is referring the time step of gradient descent dynamics.η is the learning rate, and L is the loss function.The observation [1] is that, if our variational ansatz is highly random, due to the k-design integral formula [17][18][19][20], the derivative of the loss function is generically suppressed by the dimension of the Hilbert space N , and we might encounter a situation where the variation of the loss function during gradient descent is very small, namely δL ≡ L(t+1)−L(t) 1 for the step t.For instance, the second moment formula for Haar ensemble is Here U is a unitary taken from a 1-design, and δ is the Kronecker delta and i, j, k, l are matrix indexes.For higher moments random integrals [17][18][19][20][21], the factor poly(1/N ) will appear.Thus, the difference between the variational angles during iterations will be suppressed by the dimension of the Hilbert space.The work [1] demonstrates this existence of the barren plateau (the statement where δL 1) numerically and understands the result as a primary challenge of variational quantum circuits.It is often considered to be quantum analogs to the vanishing gradient problem, but the nature is fundamentally different [22,23].A further explanation is given in Appendix A.
Although the existence of the barren plateau is verified by numerous works [24][25][26][27], the theoretical understanding of the barren plateau problem is unclear.Moreover, the classical machine learning community has been successfully demonstrated its practical usage in science and business for years, and many successful classical neural network algorithms have been run for large scales.For example, Generative Pre-trained Transformer-3 (GPT-3) from OpenAI [28] has used 175 billion of training parameters, and it is one of the most successful natural language processing models up to date.Considering the standard LeCun initialization of weights W with the normalization of the variance σ 2 W [22,23,29] and its formal similarity to Equation 2, we might imagine that similar issues will happen for classical neural networks too: they might be highly overparametrized in the large-width limit.Here, σ W is a number that is independent of the size of the neural networks, and we set the width of the neural network to be the same in each layer for simplicity.In fact, in Appendix A, we will show that in the classical large-width neural network, the barren plateau will also happen: the trainable weights do not run that much during gradient descent.
So, why classical overparametrized neural networks are supposed to be practical and good, but the barren plateaus of quantum neural networks are crucial challenges?In this paper, we define the primary theoretical argument towards the quantum barren plateau, the large suppression of the right hand side of Equation 1, as laziness.In the quantum context, the suppression is from the dimension of the Hilbert space, while in the classical case, the suppression is from the width of the classical neural networks.In a more precise language, laziness is referring to small δθ µ , and barren plateau is referring to small δL.
Moreover, we will show that laziness may not imply the quantum barren plateau, from the perspective of overparametrization theory and representation learning theory through quantum neural tangent kernels (QNTKs) [2,29].In this paper, for quantum neural networks overparametrization is referring to the fact where LTr(O 2 )/N 2 ≈ O(1), where O is the operator we are optimizing, L is the number of trainable angles, and η is the learning rate as a constant.
Defining quantum analogs of neural tangent kernels from their classical counterparts [23,[30][31][32][33][34][35][36][37][38][39][40], we show that from the first-principle theoretical derivation, random (noiseless) quantum neural networks are still efficient to learn in the large-L limit without barren plateaus, despite their laziness.In fact, although each trainable angle does not move much due to the small magnitude of the gradient, the combined effect of many of them on the loss function will still be significant.In addition, there exist good enough achievable local minima that minimize the training error.See Figure 1 for an illustration.The requirements for making this to happen is especially when LTr(O 2 )/N 2 ≈ O(1), and we have a small learning rate and the mean square loss function.In the case of large Hilbert space dimension without overparametrization, the exponential decay rate during gradient descent might be small, which may not make this phenomenon manifest in the polynomial training iterations.In practice, what we see is a very slow decay of loss functions.Interestingly, in this case quantum noises will not affect us significantly until exponential numbers of iterations.Thus, the averaged QNTK, K, proportional to Tr(O 2 )L/N 2 , explains the existence of the barren plateau in practice, with or without noises.On the other hand, in the overparametrization regime where ηLTr(O 2 )/N 2 ≈ O(1), the exponential decay of gradient descent process is visible.
We note that the large-L expansion is a quantum analog of the classical neural tangent kernel theory at large width.In fact, we will show in Section 2 that we have similar large-width expansion comparing the classical theory, where in our model, classical width corresponds to L. The dimension of the Hilbert space plays an important role in the calculation.Moreover, the correspondence between quantum and classical neural networks might be explained by some physical heuristics, from the duality between matrix models and quantum field theories.See Appendix C for a brief discussion.
Moreover, we need to point out that laziness is intrinsically still a precision problem.More precisely, it could be primarily from quantum measurement and quantum control, since the size of classical devices could scale as log(1/ ) for given precision , while variational quantum circuits cannot, due to the measurement error and the limitation of quantum control [1].Thus, it naturally motivates us to think about how to include the effect of noise in the gradient descent calculation.In our work, indicating that we could get good predictions at the end as long as we sufficiently control the noises.
We will give more details in the following sections.

The loss function landscape and the QNTK theory
We begin by considering a variational quantum circuit ansatz, on a Hilbert space of size N with log 2 N qubits, as follows, with some trainable angles θ , constant unitary operators W , and Pauli operators X .Following [29], we consider the mean square loss function We illustrate the landscape by color plots of the loss function for two variational angles.Left: the traditional understanding of barren plateaus where we have the a single optimal point.Right: in the overparametrized case, the landscape is not barren, since for a random initial point, we get many good enough local optima that could minimize the loss function.Note that those plots are schematic since it is not possible to directly plot the loss function landscape in very high dimensions.In order to visualize it in O(1) numbers of iterations, one might have to have the number of trainable angles L comparable to the dimension of the Hilbert space N .and train the expectation value Ψ 0 U † (θ)OU (θ) Ψ 0 on an initial state |Ψ 0 towards a value O 0 .We define the residual training error ε = Ψ 0 U † (θ)OU (θ) Ψ 0 − O 0 .We use the gradient descent algorithm Equation 1 with the learn ing rate η and an initial variational angle θ(0).We look now at the difference of the residual training error When the learning rate of Equation 1 η is small, we can perform a Taylor expansion, The quantity K here is called the Quantum Neural Tangent Kernel (QNTK) [29], K = ∂ε ∂θ ∂ε ∂θ .Note that in a general supervised learning setup where one has a labeled dataset instead of just one expected value O 0 , K is a positive-semidefinite and symmetric matrix instead of a non-negative number.Here we focus on the optimization problem Equation 9: this example will demonstrate the validity of our theory, that can be readily generalized to a full supervised quantum machine learning setup.
A frozen QNTK will remain constant during a gradient descent flow will lead to gradient flow equations which can be solved exactly [29], showing that the error will decay exponentially at the gradient descent iteration t as For sufficient random variational ansätze, we could compute the value of K based on the same assumption of the barren plateau problem [1].After computing 2-design random average E (see [2] for more details) More precisely, we define And we assume that V −, and V +, form 2-designs independently in all s.We get the following expression of the averaged QNTK, This simple equation combined with Equation 12 reveals how, on average, the residual training error of a gradient descent dynamics will decay exponentially.Moreover, one should also check the standard deviation ∆K.If ∆K K, we get a distribution of K which is concentrated at K. In fact, one could show that from k-design assumptions, Thus, we have ∆K/ K = O(1/ √ L).In the limit where L 1, the neural tangent kernel is concentrated around a fixed value K.A more precise constraint will also include a time-dependent statement including the perturbations of higher-order Taylor expansion of the residual training error, which is characterized by the so-called quantum meta-kernel or dQNTK.See Appendix B for more details.

Precision and noise
Now we give some physical interpretations about Equation 15.We see in Section 2 that the theory should work in the regime where L 1, and also the overparametrization regime where ηK ≈ O(1).From Equation 12, we know that K would serve as an exponent of exponential decay: the larger K is, the faster the algorithm will converge.This qualitative description has been formulated in [29], with numerical evidence in [41] around the same time.
Moreover, a statement about precision could be made by combining Equation 12 and Equation 15.We have Here, T is the total training steps, and ε r is the relative residual training error around the end of training ε r = ε(T )/ε(0).The relative error ε r could be as small as the precision of the quantum device.Using Equation 15, we get Equation 18 makes the barren plateau problem manifestly as a precision problem.If we want to see the convergence within T ≈ O(1), we want η K ≈ 1.The smaller K is, the smaller decaying exponent we have, and more likely we will experience a barren plateau in practice.Otherwise, there will be good enough local optima around the small random fluctuations of variational angles.The more overparametrized the quantum neural networks are, the faster convergence they could have.In this case, we do not have a barren plateau if we assume that we do not have the measurement noise and the quantum hardware noise, although we have laziness.
Originally, a relation between the barren plateau problem and the precision has also been stated in [1], while we make it more clear by showing that the barren plateau is not algorithmic.In fact, in Appendix A, we show that classical overparametrized neural networks have laziness as well.Many useful, practical machine learning algorithms have to be in this case [23].Thus, variational quantum algorithms here have no algorithmic issue, and the origin of the problem comes from measurement and control (see also [42]).
Let us take a look at Equation 1 again.To implement variational algorithms, we need to perform measurements to evaluate the loss function or its derivatives (involving quantum measurements), and update the trainable angles through Equation 1 (involving quantum control).On the measurement side, classical computations could handle the precision-computation with the resource scaling as log 1/ , while measurement errors will be produced in the quantum setup, making the scaling 1/ α for positive α [1].There is no known way to date to avoid it because of limitations of metrology [43].
On the control side, it is also challenging to update the variational angles with exponential precision.In a sense, our theory makes the statement from [1] more precise.
The discussion naturally motivates us to introduce the noise model.Heuristically, we will expect that during the gradient descent process, the effective noise term will also be exponentially decaying because of the original recurrence relation and its solution.To verify this, we could add a random fluctuation term ∆θ to model the uncertainty of measuring the expectation value.One could also assume that the random variable ∆θ is Markovian.Namely, it is independent for the time step t.Moreover, we assume that ∆θ s are distributed with Gaussian distributions N (0, σ 2 θ ).Note that σ θ could come from the measurement noise during estimations of quantum observables used for the gradient descent, which scales as 1/ √ n, where n is the number of measurements.And the Gaussian assumptions come from the central limit theorem in the large-n limit.Furthermore, σ θ could also come from the hardware noises.On the other hand, the physical implementation of rotation angle will also have limited precision.One could note that robust quantum control techniques can suppress errors of rotation angles to higher orders, see [44].
Thus, one could show that the residual training error has the recursion relation in the linear order of the Taylor expansion, Now, let us assume that K is still a constant, Including the noise term into the recursion relation, one could show that averaging over the random distribution of the noise, we have Note that the first term is decaying when the time t is increasing.At the late time, we have where we assume the overparametrization ηK ≈ O(1) .Thus, at the late time, the loss function will arrive at a constant plateau at O(σ 2 θ /η).One could improve σ θ to make the constant plateau controllable and do not increase significantly with N , indicating that our algorithm could be noiseresilient.See Appendix D for a more detailed discussion, and see Figure 1 for an illustration.Some numerical results are also obtained in Figure 2 and Figure 3.

Conclusion and outlook
In this paper, we point out that for variational circuits with sufficiently large numbers of trainable angles, the gradient descent dynamics could still be efficiently performed, despite the existence of the exponential suppression of the variational angle updates (laziness).We point out that laziness is not uniquely happening in quantum machine learning, but also for overparametrized classical neural networks with large widths.The efficiency of large-width neural networks is justified by the neural tangent kernel theory, so do their quantum counterparts.A solid and simple theory has been established based on the above ideas, and the relation between the number of training steps, the quantum device error, the trainable depth, the dimension of the Hilbert space, and the norm of operators appearing in the loss function has been explicitly derived.Moreover, we have justified that for simple and natural noise models, we could make the variational quantum circuits noise-resilient in the overparametrized regime, with solid theoretical and numerical evidence.
Our results also indicate a more well-defined path to designing quantum neural networks from the first principle.If we are sampling unitary operators uniformly in the whole unitary group, it is hard to avoid polynomial factors of N , the dimension of the Hilbert space, into the expression of the number of iterations in order to obtain the visible laziness (see parallel efforts in [45,46]).One idea is to reduce the space of searching, and reduce the space of variational circuits to some subspaces, where people observe some evidence for setups in quantum convolutional neural networks [25,47] and local loss function [24], and the barren plateau phenomena are less drastic in those cases.However, since the subspace we are searching is reduced, the decreased expressibility will lead to a lower performance for the final convergence of the loss function on the training set [45]: around the end of the training, drastic corrections towards fixed neural tangent kernels will stop the exponential decay, and we get a local minimum which may not be good enough.The design of variational circuits will be a trade-off between barren plateaus and performance [48], which could be manifest in the presence of laziness.Despite generalizations to full learning setups with multiple output dimensions, other interesting directions include detailed discussions about the quantum noise in the real machines during quantum representation learning to understand how the noise will affect laziness and the barren plateau, a justification of our theory with large-scale classical and quantum simulation, and possible theoretical understandings beyond the limit L 1.We look forward to further analysis and research along our path.
Note added: When the paper is finished, we notice that another nice independent paper [49] appears in the arxiv, which has very similar conclusion to our results.

Appendix A Comments on the barren plateau in the classical machine learning
Now we consider a classical neural network, the MLP model (see [23]).The definition is Here, σ is a non-linear activation function, and we have widths The input dimension is n 0 and the output dimension is n L .Weights and biases at layer are denoted as W ( ) and b ( ) .z ( ) is called the preactivation.x j,α will denote the data where j is the vector index, and α is the data sample index.At the beginning, we initialize the neural network by Here, C b and C W will set the variance of biases and weights (we use the notation C W = σ 2 W in the main text).And we train the neural networks by gradient descent algorithms.We could consider the simplest version of the gradient descent algorithm, The loss function is where α ∈ A form a training set A, and we have a supervised learning task with the data label y. z i is the final prediction from the MLP model, z i , η is the training rate.θ µ is a vector combining all W s and bs.ε here is the residual training error, A.1 The fundamental difference between barren plateau and vanishing gradient Firstly, we wish to comment on the fact that there is a fundamental difference between the barren plateau problem and the vanishing gradient problem.
The vanishing gradient problem is claimed to be a challenge of machine learning algorithms, where the gradient is vanishing for some neural network constructions, and it will be challenging to train the network [50,51].A standard and traditional explanation of the vanishing gradient problem is due to multiplicatively large number of layers in a deep neural network.The loss will have exponential behavior against some multiplicative factors during gradient descent, which will cause either exploding or vanishing of the loss function if there is no fine tuning.A resolution of the vanishing gradient problem is associated with the idea of He initialization or Kaiming initialization, which fine-tunes the neural network towards its critical point [52] (see also [23]).
The barren plateau problem is a term invented from the quantum community since [1].As far as we know, there is no such term in classical machine learning instead of geography.The theoretical One could compute the average of the NTK.One could define the frozen NTK and the fluctuating NTK as and we have The full expressions of A, B are given in Chapter 8 of [23].Similarly, in the statistics language, one could check [31].The suppression of ∆H in the large width indicates that the large-width neural networks will learn efficiently through non-trivial Hi1i2;α1α2 , which is guaranteed to converge exponentially.In the large-width limit, the gradient descent algorithm is theoretically equivalent to the kernel method, where the kernel is defined effectively by NTKs.In Chapter 11 of [23], it is shown that dNTK, the higher-order corrections to the exponential decay, will vanish on its own, averaging over the Gaussian distribution of weights and bias.Moreover, the correlations between dNTK and other operators, which cause even numbers of W s in total, will be suppressed by the large width polynomially.Those theoretical results are classical analogs of random unitary calculations done in our work.

B Some further details about concentration conditions
For concentration conditions including the quantum meta-kernel, one could see [2] for further details.
Here we provide a simple review.Now, we would like to ask when the QNTK approximation is valid.When the learning rate is small, the error of the prediction in Equation 15could possibly come from two sources: the fluctuation of K about K during the gradient descent, and the higher-order corrections comparing the leading order Taylor expansion in Equation 11.The fluctuation ∆K could come from higher-order statistical calculations over the k-design assumption, similar to the analysis of higher-order effects in the barren plateau setup [26], in the large-N limit, and we present a detailed calculation in [2] with formulas up to 4-design.Moreover, we could look at higher order corrections to the Taylor expansion by the quantum metakernel (dQNTK) [29], Here could be computed statistically using k-design formulas again.One can show that E(µ) = 0 (which is the same as its classical counterpart [23]), and we have in the large-N limit.The condition where the QNTK estimation in Equation 15 is valid when We call the conditions 43 and 44 as the concentration conditions.Here, we denote ε(0) = ε(t = 0), and we assume that Tr(O 2 ) ≡ Ω 2 O > Tr 2 (O).This is correct, for instance, if O is a Pauli operator, where we have Tr(O 2 ) = N but Tr 2 (O) = 0.
Note that the condition Equation 44 is a weak condition.It only tells that how small η is needed to make sure the nearly expansion is valid.In practice, we often assume that η < O(1) and Ω O ≥ O(N ), so Equation 44 is automatically satisfied.The condition that usually matters is Equation 43, which is the definition of overparametrization here L 1. Thus, if L is large, the prediction will be correct, no matter how large N is.But if N is large, the decay rate itself K will be small.So this is exactly the definition of the barren plateau!Furthermore, we wish to mention that if we only count for powers of N and L, we have If we demand K = O(1) and ignore η, we get L = O(N ), so we get ∆K K = O 1 N as well.The 1/N or 1/width expansion is exactly observed in the classical neural networks [23].The origin of this equivalence comes from the similarity between Equation 2 and Equation 46, while a higher level (but heuristic) understanding comes from a connection between quantum field theory and the large-width expansion [23,37,38] and a similarity between Feynman rules in quantum field theory and matrix models [54], which we will briefly explain in Appendix C for readers who are interested in how observations about this paper might be discovered from another perspective.

C A physical interpretation
Here we make some comments about possible, heuristic, physical interpretations of the agreement between classical and quantum neural networks.There is a duality, pointed out in [23,[37][38][39] where the large-width classical neural networks could be understood in the quantum field theory language.In the large-width limit, the output of neural networks will follow a Gaussian process, averaging with respect to Gaussian distribution over weights and bias according to the LeCun parametrization, or more generally, for all positive integer k.Here, we are considering the multilayer perceptron (MLP) model with weights W , and the width is defined as the number of neurons in each layer.The limit is mathematically similar to the large-N limit of gauge theories, which becomes almost generalized free theories.
We could understand the ratio between the depth, the number of layers, and the width, the number of neurons, as perturbative corrections against the Gaussian process, which is similar to what we have done in the large-N expansion of gauge theories.
This physical interpretation will be helpful also when we consider its quantum generalization.If classical MLPs are similar to quantum field theories, quantum neural networks will be similar to matrix models [55,56].Matrix models have been studied for a long time, around and after the second string theory revolution [54], and they have deep connections to the holographic principle [57] and the AdS/CFT correspondence [58,59].Haar ensembles are toy versions of matrix models, which have been widely studied as toy models of chaotic quantum black holes [17,60].The similarity between the LeCun parametrization 46 and the 1-design Haar integral formula or more generally, where dim H is the dimension of the Hilbert space, might be potentially related to the similarity of Feynman rules between matrix models and quantum field theories.Thus, the similarity between quantum and classical neural networks might have a physical interpretation between matrix models and their effective field theory descriptions.
The above analogy is heuristic.We should point out that machine learning and physical systems are very different.Some mathematical similarities could provide guidance towards new discoveries and better insights, but we have to be careful that they are intrinsically different phenomena.

D Noises
Now let us add the affection of the noise.From the original gradient descent equation, we add a random fluctuation term ∆θ to model the uncertainty of measuring the expectation value.
We assume that the random variable ∆θ is Markovian.Namely, it is independent for the time step t.Moreover, we assume that ∆θ s are distributed with Gaussian distributions N (0, σ 2 θ ).Thus, the residual training error has the recursion relation in the linear order of the Taylor expansion, Now, let us assume that K is still a constant.Since ∆θ ∼ N (0, σ 2 θ ), we get ∂ε ∂θ ∆θ ∼ N (0, Kσ 2 θ ) .
Thus, we could write the recursion relation as Here, ∆θ ≈ N (0, σ 2 θ ).One can solve the difference equation iteratively.The answer is Now, we have At the initial time t = 0, there is no effect of noise.The relative size of the error will grow during time compared to the exponential decay term without noises.Based on the distribution, we could compute the average ε 2 against the noises, ε 2 , as Note that the first term is decaying when the time t is increasing.At the late time, we have where we assume the overparametrization ηK ≈ O(1) .Thus, at the late time, the loss function will arrive at a constant plateau at O(σ 2 θ /η).One could improve σ θ to make the constant plateau controllable and do not increase significantly with N , indicating that our algorithm could be noiseresilient.
One could also estimate the time scale where the contribution of the noise could emerge.We could define the time scale, T noise , as, It means that at T noise , the noise contribution is comparable to the noiseless part in the residual training error.We have, We find that choosing η ≈ O(1/K) will minimize ε(T noise ).It is exactly the overparametrization condition we use in this paper.
To be self-consistent, we need to check if the choice η ≈ O(1/K) is consistent with the concentration condition about dQNTK.In fact, we find that η ≈ O(1/K) will naturally satisfy the dQNTK concentration condition if ε(0) < O(L √ N ).This is naturally satisfied in generic situations in variational quantum algorithms since we will usually not have an exponential amount of residual training error initially.

E Numerical results
In this part, we show some simple numerical evidences based on the analysis done in [2].We will use the randomized version of the hardware-efficient variational ansatz defined in [2].In Figure 2, for each σ θ value, we run 10 experiments of 100 steps using the same setup of the ansatz U (θ), the operator O and the input state θ 0 as in [2].After that, we get the residual error of the last step and take the average value over 10 experiments to get the mean ε value, shown with black dots in the figure.The red line in the figure is the theoretical prediction.In these experiments, L = 64, and we have 4 qubits.We can further get the analytic result of the mean value of ε after a long time as where the K value is taken from the value of the last step, as it fluctuates a lot in the early time.
We run multiple experiments to approach the theoretical value as much as possible, where 10 experiments are done for each σ θ value.To verify that the numerical result lies in a reasonable regime, we calculated the 90% confidence interval of ε theoretically.
To compensate for the effect of large K on our numerical simulations, since in every experiment setup, due to randomness, the training will lead the parameters to different regimes of different Ks, we choose those experiments which fulfill our theoretical restrictions for small K.The numerical results above are with K ≈ O (10), which still shows great agreement with our theoretical formalism.
More precisely, in Figure 2, we get the relationship between residual error fluctuation and noise.For each σ θ value, we calculated the standard deviation with final residual error data from 10 experiments, shown as black dots.The final residual error that we get from the numerical experiments is taken absolute value for the benefit of the log scale.We find the numerical results follow the theoretical prediction in a reasonable confidence interval.Moreover, we verify the extent of our final residual error that can achieve as a function of noise σ θ with numerical evidence.
In Figure 3, we verify the prediction of standard deviation of ε(∞), σ ε , in the small η regime.In these numerical experiments, the inaccuracy comes mainly from a limited number of experiments and a limited time scale (t = 100).Especially for experiments with a small learning rate η with random initial states, T noise may be large for 100 steps to cover.

Figure 1 :
Figure1: Density plots of the loss function landscape comparing usual and overparametrized variational quantum circuits.We illustrate the landscape by color plots of the loss function for two variational angles.Left: the traditional understanding of barren plateaus where we have the a single optimal point.Right: in the overparametrized case, the landscape is not barren, since for a random initial point, we get many good enough local optima that could minimize the loss function.Note that those plots are schematic since it is not possible to directly plot the loss function landscape in very high dimensions.In order to visualize it in O(1) numbers of iterations, one might have to have the number of trainable angles L comparable to the dimension of the Hilbert space N .