Markov chain Monte Carlo enhanced variational quantum algorithms

Variational quantum algorithms have the potential for significant impact on high-dimensional optimization, with applications in classical combinatorics, quantum chemistry, and condensed matter. Nevertheless, the optimization landscape of these algorithms is generally nonconvex, leading the algorithms to converge to local, rather than global, minima and the production of suboptimal solutions. In this work, we introduce a variational quantum algorithm that couples classical Markov chain Monte Carlo techniques with variational quantum algorithms, allowing the former to provably converge to global minima and thus assure solution quality. Due to the generality of our approach, it is suitable for a myriad of quantum minimization problems, including optimization and quantum state preparation. Specifically, we devise a Metropolis–Hastings method that is suitable for variational quantum devices and use it, in conjunction with quantum optimization, to construct quantum ensembles that converge to Gibbs states. These performance guarantees are derived from the ergodicity of our algorithm’s state space and enable us to place analytic bounds on its time-complexity. We demonstrate both the effectiveness of our technique and the validity of our analysis through quantum circuit simulations for MaxCut instances, solving these problems deterministically and with perfect accuracy, as well as large-scale quantum Ising and transverse field spin models of up to 50 qubits. Our technique stands to broadly enrich the field of variational quantum algorithms, improving and guaranteeing the performance of these promising, yet often heuristic, methods.

In order to avoid the local minima convergence that plagues VQAs, we introduce MCMC-VQA, a technique that adapts the ergodic exploration of classical Markov chain Monte Carlo (MCMC) to guarantee the global convergence of quantum algorithms.As samples of ergodic systems are representative of their underlying probability distribution, an ergodic VQA necessarily yields a sample that contains states near the global minimum.In this work, we focus on the Metropolis-Hastings algorithm * taylorpatti@g.harvard.edudue to its success in high-dimensional spaces and suitability for unnormalized probability distributions [32].MCMC-VQA utilizes modified VQAs and their statistics as the Metropolis-Hastings transition kernels and quantum state energies as state likelihoods.These quantities are then used to determine the viability of parameter updates.Our algorithm requires no increase in quantum overhead and only a minimal increase classical overhead.MCMC-VQA represents a time-discrete, spacecontinuous Markov chain, as the algorithm progresses in discrete VQA epochs while training a continuousparameter quantum circuit.It can also be classified as a form of Stochastic Gradient Descent MCMC [33,34].Although in this work we focus on VQE [3], our techniques are readily applicable to a wide array of quantum machine learning applications.
While other works have introduced quantum subroutines for classical MCMC methods that offer a quadratic speedup for random walks [35][36][37] and sampling [38,39], this manuscript takes the opposite approach by designing a classical MCMC subroutine for quantum algorithms.Likewise, while classical MCMC methods have been used to simulate quantum computing routines [40,41], our work uses classical MCMC to enhance quantum algorithms on quantum hardware.
We briefly review VQAs, focusing on VQE (Fig. 1, gray) for quantum optimization for MaxCut problems.This choice of application is motivated by the ample nonconvexity of the corresponding quadratic loss functions [11,19].VQAs are parameterized by input states |ψ and quantum circuit unitaries U t = U ( θt ), where θt are the variable parameters learned during epoch t−1.Without loss of generality, we choose the n-qubit input state as |0 = n−1 i=0 |0 such that the output state is entirely defined by θ and assume that the initial parameters θ0 are randomly selected at the start of each new sequence of epochs.
MaxCut is a partitioning problem on undirected graphs G (Fig. 1, black), where edges ω i connect pairs of vertices v ia , v ib [42].The goal is to optimally assign all vertices v ia , v ib ∈ {−1, 1}, so as to maximize the objective function FIG.1: Diagram of a random graph for MaxCut, VQE, and MCMC-VQA.Random graphs (black, Secs.I and III) in this work are generated with normally distributed edge weights w i .The objective is to minimize Eq. 1 by optimally assigning each pair of vertices v ia , v ib ∈ {−1, 1}.MaxCut can be solved on a quantum computer by mapping v ia , v ib → σ ia , σ ib and minimizing the corresponding H. See Sec.III for graph details.VQE (gray, Sec.I) minimizes the loss function for each θ by calculating the expectation value Λ( θ) and updating θ with gradient descent using ∇Λ( θ).MCMC-VQA (blue, Sec.II) uses gradient descent with ∇Λ( θ) and random noise ξΘ r to produce candidate state θ′ , but also calculates probability distributions P ( θ) and P ( θ′ ), as well as proposal distributions G( θ′ | θ) and G( θ| θ′ ).Using these distributions, the acceptance distribution A( θ′ | θ) is calculated and compared to random uniform sample u ∼ U (0, 1).If A( θ′ | θ) > u, then θ′ → θ.Otherwise, the MCMC-VQA algorithm restarts with the original θ.(Red) after the maximum number of MCMC-VQA epochs T MC have occurred, the sampled parameters with the lowest loss, θmin , are selected and the optimization completes with a closing sequence of VQE epochs.
In this work, we will consider a generalized form of the problem known as weighted MaxCut, in which w i take arbitrary real values.
To solve MaxCut via VQE, a graph G is encoded in the Ising model Hamiltonian where ω i remains unchanged from the MaxCut objective function and v ia , v ib → σ ia , σ ib for Pauli-Z spin operators σ ia , σ ib .Maximizing the cut of G is then equivalent to minimizing the loss function where µ i t are the expectation values of the quadratic Max-Cut terms.VQE circuit training updates parameters θ via gradient descent on Λ t (Fig. 1), where the gra-dient of any θ k t ∈ θt can be calculated as ∇ k Λ( θt ) = Λ( θt + ǫ k) − Λ( θt − ǫ k) /2ǫ by finite difference.As ∇Λ( θt ) → 0 in the vicinity of both global and local minima, VQE training is prone to stagnation at suboptimal solutions.

II. RESULTS
In this section, we present our novel method for enhancing the performance of VQAs with classical MCMCs, a technique that we dub MCMC-VQA.We start by briefly reviewing traditional MCMC, focusing on the Metropolis-Hastings algorithm.Then, we introduce MCMC-VQA, derive its behavior, and verify our findings with numerical simulations.

A. MCMC-VQA Method
MCMC algorithms, such as Metropolis-Hastings, combine the randomized sampling of Monte-Carlo methods with the Markovian dynamics of a Markov chain in order to randomly sample from a distribution that is difficult to characterize deterministically [32].MCMC is particularly useful for approximations in high-dimensional spaces, where the so-called "curse of dimensionality" can make techniques such as random sampling prohibitively slow [43].The core merit of MCMC techniques is their ergodicity, which guarantees that all states of the distribution are eventually sampled in a statistically representative way, regardless of which initial point is chosen.This representative sample is known as the unique stationary distribution π.In particular, any Markov chain that is both irreducible (each state has a non-zero probability of transitioning to any other state) and aperiodic (not partitioned into sets that undergo periodic transitions) will provably converge to its unique stationary distribution π, from which it samples ergodically [44].The mathematical properties of ergodic Markov chains are well-studied, including analytic bounds for solution quality and mixing time (number of epochs) [45,46].
In order to obtain π for a distribution of interest, Metropolis-Hastings specifies the transition kernel P (x ′ |x), which is the probability that state x transitions to state x ′ .Typically, the Markov process is defined such that transitions satisfy the detailed balance condition: When Eq. 4 holds, the chain is said to be reversible and is guaranteed to converge to a stationary distribution.P (x ′ |x) can be factored into two quantities where G(x ′ |x) is the proposal distribution, or the conditional probability of proposing state x ′ given state x, and is the acceptance distribution, or the probability of accepting the new state x ′ given state x.To satisfy Eq. 4, the acceptance distribution is defined as Note that as only the ratio P (x ′ )/P (x) is considered, the probability distribution need not be normalized.To determine whether the candidate state x ′ or the current state x t should be used as the future state x t+1 , a sample u is drawn from the uniform distribution U (0, 1).If A(x ′ |x t ) ≥ u, then x t+1 = x ′ and we say that the candidate state x ′ is accepted.Otherwise, x t+1 = x t and we say that x ′ is rejected.
We now present the MCMC-VQA method.Fig. 1 contains a diagram of the algorithm (blue).In particular, we focus on an ergodic Metropolis-Hastings algorithm, which is guaranteed to sample states near global minima.We outline the algorithm both idealistically and experimentally, prove its ergodicity and convergence, and verify these findings with numerical simulations.
As we seek the lowest energy eigenstate when solving MaxCut via VQE, we define P ( θ) as the Boltzmann distribution such that a state's probability increases exponentially with decreasing loss function.
To calculate the proposal distribution G( θ′ | θt ), we must consider the sampling statistics of VQAs.Due to quantum uncertainty, a measurement m r i ( θt ) of operators ω i σ ia σ ib from Eq. 2 is a sample from a distribution with mean µ i t and variance The Central Limit Theorem asserts that, assuming at least M 30 independent and identically distributed measurements m r i ( θt ), an estimate of the loss function Λ t is the statistic [47,48].Similarly, ∀θ k t ∈ θt and assuming small parameter shifts ǫ, the gradient The variance of this distribution can be simplified by noting that to first order in ǫ, the parameter shifted Pauli operators are σ ±k ia = σ ia ( θ ± ǫ k) = σ ia ± ι iak , where σ ia = σ ia ( θ) and ι iak = (∂σ ia /∂θ k )ǫ.We can then simplify the sum ∆ i ( θt noting that Now, up to first order in ι, we can derive the gradient's distribution Standard gradient descent would propose the candidate state θ′ = θ − η∇Λ t , however MCMC-VQA adds a normally distributed random noise term Θ r ∼ N (0, 1) with scale parameter ξ in order to expand the support of the proposal distribution G( θ′ | θt ).This specifies where the notation pdf N µ, σ 2 (x) denotes the probability density function at point x of a normal distribution with mean µ and variance σ 2 .It follows that the acceptance distribution is given by We note that G( θt | θ′ ) is obtained by simply exchanging θt and θ′ in Eq. 11.A random uniform sample u ∼ U (0, 1) is then drawn for comparison, such that θt+1 = θ′ if A( θ′ | θt ) > u and θt+1 = θt otherwise.
After T MC epochs of the above Markovian process, implements a short series of traditional VQA epochs for rapid convergence to the nearest minimum.In particular, these closing VQA epochs are initialized with θmin , the parameter set of lowest eigenvalue Λ min found during the Metropolis-Hastings phase.In this manner, MCMC-VQA can be considered a "warm starting" procedure [20][21][22], but with ergodic guarantees.
Example MCMC-VQA trajectories are shown in Fig. 2 with inverse thermodynamic temperatures β = 0.8 and β = 0.2.The details of all simulations are given in Sec.III.Our algorithm combines the gradient descentbased optimization of VQE with a Markovian process that escapes local minima.Such exploration is significantly greater at the higher-temperature β = 0.2, where rather than settling into distinct loss function basins from which escape is relatively rare, the trajectories display the trademark "burn-in" behavior of ergodic Markov chains.By the time that the closing VQE epochs are applied, the ergodic β = 0.2 MCMC-VQA chains have sampled states sufficiently near the global minimum and converge to the groundtruth nearly uniformly.Fig. 3 (left) displays the average accuracy 1 − α (where is the average error, blue), and standard deviation (gray) of MaxCut solutions with MCMC-VQA as a function of β.Dashed lines represent the performance of traditional VQE on the same set of graphs and circuit ansatz.We note that all simulated β values outperform traditional VQE.Until β ∼ 0.2, higher temperature MCMC-VQA chains have higher accuracy and better convergence, as their more permissive temperature parameter biases the acceptance distribution towards accepting the candidate states.However, performance decreases at very high temperatures, for which the MCMC-VQA chains are no longer appreciably biased towards energy minimization and the algorithm becomes more like random sampling than intrepid gradient descent.Likewise, the optimal amount of parameter update noise ξ is inversely proportional to β (Fig. 3, right), as higher temperatures permit more radical deviations from standard gradient descent.

B. Implementation of MCMC-VQA on Quantum Hardware
As discussed above, the loss function Λ t is not precisely determined on actual quantum hardware, but rather estimated as a statistic l t = i q i t , where q i t = 1 M M r=1 m r i ( θt ).As a result, the variance of a single observable measurement (∆ i t ) 2 is estimated by ( , while that of the total loss function ( ]/M , for M -measurements per observable.Alternatively, the variances could be directly estimated from the standard deviations of expectation value statistics.We then define a( θ′ | θt ), the acceptance distribution on quantum hard-ware, as (13d) MCMC-VQA does not increase the quantum complexity of VQAs (number of operations carried out on quantum hardware), as the measurements to estimate Λ( θ) are carried out in the typical way.Moreover, the acceptance distribution and its components are computed classically with simple arithmetic.

C. Proof of Ergodicity
If a Metropolis-Hastings algorithm is irreducible and aperiodic, then the resulting Markov chain is provably ergodic [44].That is, it will explore all areas of the probability distribution, converging on average to the Markov process' unique stationary distribution, which includes the global minimum of the solution space.Moreover, as we have chosen to sample from the Boltzmann distribution of the loss function, we sample from states near optimal solutions with exponentially higher probability.

C.1. Irreducibility
The VQA Metropolis-Hastings Markov chain is irreducible if ∀ θa , θb , ∃T, { θ1 , θ2 , ....., θT } such that That is, the Markov chain is irreducible if, for any two points in parameter space θa , θb , there exists a series of transitions of any length T such that θa → θb with nonzero probability [49].While this definition of irreducibility is sufficient, we will instead focus on the yet more powerful condition of strong irreducibility.A Markov chain is strongly irreducible iff meaning that all points in parameter space have a nonzero probability of transitioning to all other points [50].This condition is then equivalent to FIG. 3: (Left, blue) Average MCMC-VQA accuracy (1 − α, for average error α) vs inverse thermodynamic temperature β.Nearly perfect average accuracy is obtained for properly tuned hyperparameter β (here, β ≈ 0.2).At low temperature (large β), the algorithm mixes slowly, only partially approximating ergodicity in T MC = 400 Markovian epochs.This partial convergence results in lower accuracy, which approaches that of traditional VQE (blue dashed line) in the limit of large β.Conversely, for high temperature (small β), the algorithm is biased towards low-energy solutions, which renders its gradient descent inefficient and reduces its accuracy.(Left, gray) The standard deviation of MCMC-VQE accuracy vs β.Higher standard deviation directly corresponds with lower accuracy.As discussed above, at high β, this is due to runs trapped in local minima (see Fig. 2), while at low β, this stems from the lack of energy-preferred convergence.(Left) Optimal value of ξ vs β, where ξ is the gradient descent noise parameter ( θ′ = θ − η∇Λ t + ξΘ r ) and each trajectory undergoes T MC = 400 Markovian epochs.As larger temperatures generate more permissive acceptance distributions A( θ′ | θ), higher ξ values lead to more efficient mixing in the low-β limit.See Sec.III for simulation details.
where we note that δ 2 Λ ( θt ) ∝ 1/M .Eq. 15 is satisfied, at least technically to some tolerance, ∀ θa , θb .Although g( θb | θa ) k may become very small, it will generally retain a non-zero probability for virtually all transitions, and the chain will be strongly irreducible, albeit perhaps slow to convergence.More precise arguments can be made in the limit of large ξ, where to first order in small 1/ξ, g( θb | θa ) k → 1/ √ 2πξ and all transitions become equally likely.While this extreme ξ limit is too random to result in efficient gradient descent, it illustrates a concrete transition to irreducibility with in-creasing ξ.Moreover, due to the uncertainty introduced by finite statistics d k l a and (δ Λ a ) 2 , sampling of the proposition kernel g( θb | θa ) k can allow for otherwise unlikely transitions.

C.2. Aperiodicity
In the case of strong irreducibility argued above (Eq.15), aperiodicity is automatically satisfied.Assuming only the weaker irreducibility of Eq. 14, it is sufficient to show that [49] As long as η ≫ ξ, Eq. 17 holds for all but singular points θa .

D. Mixing Time
The mixing time τ of a Markov chain is the number of epochs required to reach a certain threshold of convergence.For an ergodic, discrete-time Markov chain, τ is analytically bounded by where α MC = |S − π| is the distance between the Markov chain's sampled distribution S and the true stationary distribution π, π * is the probability of the least likely (maximum energy) state of π, and Φ is the conductance or "Cheeger constant" of the Markov process [45].The conductance can be understood as the minimum of normalized ergodic flows between all possible partitions of the state space.Fig. 4 demonstrates that the performance of MCMC-VQA is consistent with the theoretical predictions of ergodic Markov chains (Eq.18).That is, the time dependence of MCMC-VQA optimization error α follows the same ln(1/α) scaling as the distribution distance α MC in Eq. 18.Moreover, least-squares analysis of Fig. 4 data reveals a β-dependent scale factor that is proportional to ln(1/ √ π * ), which is consistent with the Boltzmann distribution p( θa ) ∝ exp(−βΛ a ) from which our method samples.This temperature-dependent time-complexity further verifies that MCMC-VQA is an ergodic Markov process that successfully samples from the target distribution.

III. NUMERICAL SIMULATIONS
The simulations in this work are done using a modified version of TensorLy-Quantum, an open-source software package for quantum circuit simulation using factorized tensors [51,52].TensorLy-Quantum specializes in exact tensor contraction, such that the simulations are carried out without truncation or approximation.
The MaxCut instances optimized in this work are generated from ten graphs.Each graph has ten vertices and an equal number of randomly selected edges, which are randomly generated from the unit normal distribution.Such graphs are equivalent to the Gilbert model of random graphs [53].The number of edges was chosen to be equal to that of vertices as this ratio is observed to pose high difficulty for random MaxCut problems of this model [54,55].
All numerical simulations in this work are done using the graphs described above, with twenty randomly initialized runs completed for each graph.The quantum circuits use one parameterized rotation per vertex.We illustrate our work using circuits with relatively few parameters, because their optimization landscape is es-pecially nonconvex and thus prone to convergence in local minima [19], however MCMC-VQA can be used with arbitrary parameterization.The circuit gates are alternated between a layer of single-qubit parameterized rotations (angles θ) about the y-axis and a layer of twoqubit control-Z gates.For each method (VQE or MCMC-VQA) and set of hyperparameters, a variety of learning rates are scanned so that numerical comparisons could be drawn against the optimal performance of each algorithm.All VQE sequences consisted of 100 epochs.Fig. 2 shows an ensemble of trajectories whereas Figs. 3 and  4 is the average over the optimal learning rate for all ten graphs and 20 random initializations.For simplicity, we take the large M limit, assuming many measurements and precise expectation values.

IV. DISCUSSION
In this work, we have introduced MCMC-VQA: a novel variational quantum algorithm that harnesses classical Makov chains to obtain analytic convergence gaurantees for parameterized quantum circuits.As ergodic Markov chains representatively sample a target probability distribution, they identify regions near the global minimum with high probability.We present MCMC-VQA, both from a theoretical and practical perspective, prove its ergodicity, and derive its time-complexity (mixing time) as a function of both accuracy and inverse thermodynamic temperature.Focusing on MaxCut optimization within the VQE framework due to its plentiful local minima and employing a reversible Metropolis-Hastings Markov process, we demonstrate the ergodicity of our method, the validity of our analytical findings, and the capacity of MCMC-VQA to not only outperform traditional VQAs, but to do so with up to perfect and deterministic convergence.
In future research, MCMC-VQA should be studied for a variety of different applications, quantum algorithms, and Markov processes.In addition to quantum optimization, VQAs have been employed to address a myriad of topics in both quantum chemistry [13][14][15] and condensed matter physics [16][17][18].Moreover, even simple quantum Hamiltonians, such as the transverse field Ising model, are known to acutely struggle with premature convergence to local, rather than global, minima.Similarly, our technique could be extended to QAOA [4] or any of the numerous VQAs that have been proposed in recent years.Finally, tens of MCMCs have been devised over the past 70 years, each with their own advantages, with variations featuring Gibbs sampling [56], parallel tempering [57], and independence sampling [58].These methods could be substituted for Metropolis-Hastings in order to produce algorithms with lower computational overhead and faster mixing times.In short, varieties of MCMC-VQA can be developed for a broad spectrum of variational quantum algorithms to both improve and guarantee performance.

FIG. 2 :
FIG. 2: Example trajectories with inverse thermodynamic temperature β = 0.8 (left) and β = 0.2 (right).Four-hundred MCMC-VQA epochs (Markovian epochs) are followed by a closing sequence of VQE epochs (beginning at red dashed line), which is initialized with the best parameters θmin found during the Markov process.At lower temperature (β = 0.8), trajectories become trapped in local minima and reaching ergodicity is a lengthy process.Conversely, the high-temperature (β = 0.2) trajectories rapidly reach burn-in, generating θmin that lead to near perfect convergence during the VQE closing sequence.See Sec.III for simulation details.

FIG. 4 :
FIG.4: Average accuracy vs Markovian epochs for three different β values.Gray dots are the average MCMC-VQA accuracy 1 − α, and blue curves are a least squares fit of this data to the analytical accuracy of an ergodic Markov chain 1 − α MC (τ ), with theoretical mixing time τ (see Eq. 18).The analytical time-dependence of α MC matches the observed scaling of α, affirming that MCMC-VQA is an ergodic Markov chain, and thus guaranteeing convergence to the global minimum.Furthermore, the ratio of observed scale parameters between MCMC-VQA simulations with different β values is consistent with the analytic dependence τ ∝ ln(1/ √ π * ) (Eq. 18) on the least likely state π * ∝ exp(−βΛ max ) (Eq. 7).This functional dependence on temperature further supports our claims of ergodically sampling from P ( θ) and thus deterministically converging to the global minimum.