Natural evolution strategies and variational Monte Carlo

A notion of quantum natural evolution strategies is introduced, which provides a geometric synthesis of a number of known quantum/classical algorithms for performing classical black-box optimization. The recent work of Gomes et al (2019 arXiv:1910.10675) on heuristic combinatorial optimization using neural quantum states is pedagogically reviewed in this context, emphasizing the connection with natural evolution strategies (NES). The algorithmic framework is illustrated for approximate combinatorial optimization problems, and a systematic strategy is found for improving the approximation ratios. In particular, it is found that NES can achieve approximation ratios competitive with widely used heuristic algorithms for Max-Cut, at the expense of increased computation time.


Introduction
An evolution strategy (Schwefel 1977, Rechenberg 1978) is a black-box optimization algorithm that iteratively updates a population of candidates within the feasible region of the search space. The population is updated by a process of random mutation, followed by fitness evaluation, and subsequent recombination of best-performing members to form the next generation. The focus of this paper is on the natural evolution strategies (NES) algorithm (Wierstra et al 2014) and its quantum variants, in which the population is represented by a smoothly parameterized family of search distributions defined on the search space. The mutation is achieved by sampling new candidates from the search distribution, which yields a gradient estimator of the expected fitness. The recombination step involves updating the parameters of the search distribution in the direction of steepest ascent, with respect to the information geometry implicit in the choice of search distribution.
Natural evolution strategies has recently demonstrated considerable progress in solving black-box optimization problems in high dimensions, including continuous optimization problems relevant to reinforcement learning (Salimans et al 2017). Comparatively little work has been done on the discrete optimization side (see, however, Malagò et al 2008, Ollivier et al 2017. It was very recently shown by Gomes et al (2019) that techniques from the quantum variational Monte Carlo (VMC) literature (Carleo and Troyer 2017) can be adapted for approximate heuristic solution of combinatorial optimization problems. Meanwhile, in the quantum computation literature, considerable effort has focused on approximate heuristic solution of combinatorial optimization problems using low-depth quantum circuits (Farhi et al 2014).
The unifying principle shared by the above algorithms is their utilization of Monte-Carlo samples drawn from a particular probability distribution, the choice of which is determined by a local optimization problem over a variational family of search distributions. The algorithms differ in the choice of variational family, as well as the geometry underlying the local optimization problem, and the use of quantum or classical resources to perform sampling. In this work, we provide a unified view on the relationship between the geometries which underpin these algorithms. In addition, we revisit the experiments of Gomes et al (2019) in which heuristic algorithms were found to outperform NES both in terms of approximation ratio and time to solution. In contrast, we show that NES can achieve approximation ratios for Max-Cut competitive with widely used algorithms, albeit at the expense of significantly increased computational cost. The key factor impacting performance is identified to be the batch size of the stochastic gradient estimator.
The paper is structured as follows. First, the NES algorithm is reviewed, highlighting its information-geometric origin. Next, we clarify the relationship between NES and quantum approximate optimization including the classical/quantum hybrid approach introduced in Gomes et al (2019). Finally, numerical experiments are presented, focusing on the problem of combinatorial optimization for the Max-Cut problem.

Background
Consider the problem of optimizing a real-valued, but otherwise arbitrary function f ∈ R X = {X → R} defined on a search space X. For simplicity of presentation we focus on |X| < ∞, although the method generalizes in a straightforward way to infinite search spaces, as required for continuous optimization. Now consider the probability simplex P(X) defined as The NES algorithm can be motivated by the following two observations, which concern the convex and Riemannian geometry of the set P(X): 1. The optimization problem admits the following equivalent convex relaxation: whose global minimizers form the subsimplex consisting of the following convex hull of Dirac distributions: } .

2.
There is a natural Riemannian metric called the Fisher-Rao metric defined on the interior Int(P(X)) of the probability simplex, consisting of strictly positive probability vectors. The distance between p, q ≻ 0 is defined as where √ p denotes the elementwise square root of the probability vector p.
Given a smoothly parametrized family of search distributions {p θ : θ ∈ R d } ⊆ P(X), one obtains a trivial variational bound as follows: Natural evolution strategies seeks to obtain a good approximation ratio by optimizing the above variational upper bound using Riemannian gradient descent in the geometry induced by the Fisher-Rao metric, otherwise known as natural gradient descent (Amari 1998). Specifically, given a learning rate η > 0 and initial condition θ 0 ∈ R d , one considers the deterministic sequence in R d defined by where I θ denotes the Fisher information matrix, evaluated at the parameter vector θ ∈ R d , Indeed, it can be shown that I θ is the local coordinate representation of the Riemannian metric tensor induced by (4), and the restriction to the interior of the probability simplex ensures that the logarithm is defined 4 .
Observe that the iteration (6) defining the sequence (θ t ) t≥0 involves the unknown function f. The NES can now be defined as the randomized algorithm inspired by (6), in which ∇L(θ) is replaced by a stochastic gradient estimator obtained by sampling from the search distribution p θ . Likewise, if the Fisher information (7) cannot be evaluated in closed form, then it can be replaced by an associated estimator.

Quantum approximate optimization as NES
In this section we pedagogically review the proposal of Gomes et al (2019) showing it to be a variant of NES in which the optimization dynamics is modified as a consequence of the quantum state geometry. In addition, we identify the regime in which both methods coincide. The quantization of NES proceeds by replacing the search space X by a complex Euclidean space, whose orthonormal basis elements are |x⟩. Moreover, the probability simplex P(X) is replaced by the convex set of density operators, where Herm(C X ) denotes the set of Hermitian operators on C X . It is clear that any classical probability distribution p ∈ P(X) can be encoded as the following diagonal density operator diag(p) := ∑ x∈X p(x)|x⟩⟨x|. It is equally clear that this does not extinguish the space of admissible density operators: another possibility being the rank-1 projection operator P ψ = |ψ⟩⟨ψ|/⟨ψ|ψ⟩ onto the one-dimensional subspace spanned by the vector ψ ∈ C X . Any admissible density operator ρ ∈ D(C X ) gives rise to a valid probability distribution, which we call diag(ρ) ∈ P(X), obtained from the diagonal matrix representation in the standard basis.
It is conceptually useful to introduce the following Hermitian operator H f ∈ Herm(C X ) (diagonalized by the standard basis): whose ground-state subspace encodes the solution of the optimization problem: Then for any density operator ρ ∈ D(C X ) we see that the quantum expectation value of H f evaluated in the state ρ, is computed by the following classical expectation value, which is an obvious upper bound for min x∈X f(x). The above identity demonstrates the widely known fact that for diagonal operators of the form (10), unconstrained optimization over the space of quantum states D(C X ) is equivalent to optimization over the probability simplex P(X). Gomes et al (2019) asks if constrained optimization within a parametrized subset of density operators provides a useful heuristic for approximate combinatorial optimization. In particular, they consider the case of rank-1 projectors, for which there exists a natural Riemannian metric called the Fubini-Study metric defined as follows: ) .
Thus, given a smoothly parametrized subset {ψ θ : θ ∈ R d } ⊆ C X of a complex Euclidean space, one can define quantum NES as the local optimization of the following variational upper bound, via Riemannian gradient descent in the geometry induced by the Fubini-Study metric: The choice to restrict to rank-1 projection operators involves no loss of generality compared to classical NES because if we choose ψ θ = ∑ x∈X √ p θ (x)|x⟩, then the Fubini-Study geometry reduces to Fisher-Rao. Thus, if the parametric family is chosen to be strictly positive ψ θ ≻ 0, then Riemannian gradient descent in the Fubini-Study geometry coincides with natural gradient descent 5 and we recover classical NES. Therefore we henceforth use the terminology 'natural gradient' to refer to both geometries interchangeably.
The natural gradient has been thoroughly explored in the VMC for ground-state optimization of non-diagonal Hermitian operators (Sorella 1998, Carleo andTroyer 2017), and more recently in the variational quantum algorithm literature (Yuan et al 2019, Stokes et al 2019, Koczor and Benjamin 2019. Finally, we note the possibility of generalizing to higher-rank density operators, the investigation of which is left to future work.

Experiments
Consider combinatorial optimization problems defined on X = {±1} n . For concreteness we focus on the Max-Cut problem corresponding to an undirected graph G = (V, E) of size |V| = n. The function f ∈ R X to be minimized is simply where −f (x) is the size of the cut corresponding to the configuration x ∈ {±} n . After fixing a parametrized family of wavefunctions, we locally optimize the variational upper bound (14) using stochastic natural gradient descent. Following Gomes et al (2019), we choose the variational wavefunction ψ θ ∈ C X to be of Boltzmann form (Carleo and Troyer 2017), where the variational parameters θ = (W, b, c) ∈ F m×n × F m × F n and F denotes either R or C. In the case F = R we have ψ θ ≻ 0 and the optimization problem is equivalent to NES (Wierstra et al 2014). An example illustrating the increased expressiveness of complex restricted Boltzmann machines (RBMs) compared to their real-valued counterparts for representing classical probability distributions is presented in the supplementary material.

Performance Evaluation on Max-Cut problem
Although no polynomial-time algorithm is known for solving Max-Cut on general graphs, many approximation algorithms have been developed in the past decades. Random Cut Algorithm is a simple randomized 0.5-approximation algorithm that randomly assigns each node to a partition (e.g. see Mitzenmacher and Upfal (2005)). Goemans and Williamson (1995) improved the performance ratio from 0.5 to at least 0.87 856, by making use of the semidefinite programming (SDP) relaxation of the original integer quadratic program. Burer and Monteiro (2001) reformulated the SDP for Max-Cut into a non-convex problem, with a benefit of having lower dimension and no conic constraint, but with the disadvantage of being heuristic in nature.
The above heuristic and approximate algorithms were used as benchmark solvers for comparison with quantum NES. The implementation of Goemans-Williamson algorithm used the CVXPY Boyd 2016, Agrawal et al 2018) package and the Burer-Monteiro reformulation with the Riemannian Trust-Region method (Absil et al 2007) used the Manopt toolbox (Boumal et al 2014), which essentially implements the optimization algorithm proposed by Journée et al (2010).
The classical and quantum variants of NES were realized using the VMC (McMillan 1965) method with stochastic reconfiguration (SR) (Sorella 1998), as implemented in the NetKet toolbox (Carleo et al 2019). The SR optimization was performed using a regularization parameter λ = 0.1 and a learning rate η = 5 × 10 −2 , for 90 iterations. At each iteration, the number of Monte Carlo sampled observables (batch size for training) is 4096. The mean performance of the observable batch drawn from the trained model is reported. For RBM model, the number of hidden variables is set to be the same as the number of spins; the weights are complex-valued, initialized with Gaussian distribution of mean 0 and standard deviation (std) 0.01. Throughout the experiments, the timing benchmarks are performed on a core of an 8-core processor, Intel ® Xeon ® CPU E-5-2650 v2 @ 2.60 GHz, with 128 GB of memory.
For evaluation, we constructed a problem instance for each graph size n by randomly generating graph Laplacians with edge density 50%, for n ∈ 50, 70, 100, 150, 200, 250. This defines a set of problem instances,   indexed by n, which are held fixed throughout the experiments. For fixed problem instance, each algorithm was executed ten times using ten random seeds/initializations. In table 1, we report the mean and std of the performance over graph instances of different sizes. Since the optimal cut for a given problem instance cannot be computed for large scale problems, we approximate it with an upper bound UBD(I) (Boumal et al 2016), which is the optimal value of the SDP relaxation, according to the arguments in (Goemans and Williamson 1995). In figure 1(L), we present the ratio cut(A(I))/UBD(I) for the same algorithms A and graph instances I in table 1 with box plot. The choice of batch size was found to be a crucial factor controlling the performance of the algorithm. Intuitively, it quantifies the exploration capability in the state space: the algorithm has a better chance to discover the ground state if it is allowed to explore more. The performance (as measured by approximation ratio) as well as the training time, as a function of batch size are reported in figure 1(R) on the graph instance with 150 nodes. The ablation study also reveals that increasing hidden unit density is correlated with performance, provided that training batch size is correspondingly increased. This is expected behavior since increased model capacity is required to capture the increasing complexity from the sampled observables.
The natural gradient descent (Amari 1998, Sorella 1998) proved essential for converging to a good local optimum; results on cRBM-1 suggest that optimizers equipped with natural gradient updates consistently outperformed those without. The use of complex RBM yielded some improvement relative to the real-valued case, although the performance gap largely disappeared when natural gradient updates were applied. In addition, we found that architectures other than RBM can achieve high performance: the single-layer perceptron (FC) is compatible with RBM across all optimizers. In general, Stochastic Natural Gradient Descent consistently achieves optimal performance over all architectures, in comparison with other optimizers.

Conclusions
The Max-Cut approximation ratio achieved by NES is competitive with widely used solvers, although this comes at the expense of significantly increased computation time. It will be interesting to investigate other combinatorial optimization problems, particularly from the spin-glass literature, which do not admit semidefinite program relaxations.
It is legitimate to inquire about possible advantages for classical stochastic optimization, given access to efficiently simulable subsets of quantum states, such as the complex Boltzmann machine. In the case of NES our ablative study indicates that the use of natural gradient descent levels the playing field between the quantum and classical variants.

Data availability statement
The data that support the findings of this study are openly available at the following URL/DOI: https://github.com/Ericolony/QAOA.