Supplementing Recurrent Neural Networks with Annealing to Solve Combinatorial Optimization Problems

Combinatorial optimization problems can be solved by heuristic algorithms such as simulated annealing (SA) which aims to find the optimal solution within a large search space through thermal fluctuations. The algorithm generates new solutions through Markov-chain Monte Carlo techniques. This sampling scheme can result in severe limitations, such as slow convergence and a tendency to stay within the same local search space at small temperatures. To overcome these shortcomings, we use the variational classical annealing (VCA) framework that combines autoregressive recurrent neural networks (RNNs) with traditional annealing to sample solutions that are uncorrelated. In this paper, we demonstrate the potential of using VCA as an approach to solving real-world optimization problems. We explore VCA's performance in comparison with SA at solving three popular optimization problems: the maximum cut problem (Max-Cut), the nurse scheduling problem (NSP), and the traveling salesman problem (TSP). For all three problems, we find that VCA outperforms SA on average in the asymptotic limit by one or more orders of magnitude in terms of relative error. Interestingly, we reach large system sizes of up to $256$ cities for the TSP. We also conclude that in the best case scenario, VCA can serve as a great alternative when SA fails to find the optimal solution.


Introduction
Combinatorial optimization is widely used in many areas of science such as physics, computer science, and biology.It also has a wide range of applications in the industry including without limitation to supply chain, energy, and transportation.Providing an efficient solution to combinatorial optimization problems can boost scientific progress and provide optimal solutions to lots of industry problems.Nevertheless, a large class of these problems is NP-Hard [2].Thus, devising clever algorithms is needed to cope with the exponential growth in the search space that contains all possible combinatorial solutions.In particular, a heuristic algorithm based on the concept of annealing was devised and is known as simulated annealing (SA) or classical annealing (CA) [3].The latter is inspired by metallurgy, where slowly reducing the temperature of a metal can result in a more stable configuration with a lower energy [3].Given the Markovian nature of these algorithms, they can result in getting stuck in a local minima due to the rugged nature of some optimization landscapes.
In this paper, we aim to use tools from machine learning for the purpose of improving traditional implementations of SA.In particular, this work is based on the findings of Ref. [1] that uses recurrent neural networks (RNNs) to variationally simulate the classical version of annealing within a framework called variational classical annealing (VCA) [1].Here, we apply RNNs through VCA to find the optimal solution to real-world optimization problems namely the maximum cut problem (Max-Cut), the nurse scheduling problem (NSP), and the traveling salesman problem (TSP).This work has shown that this variational implementation is capable of outperforming SA on average and in the asymptotic limit of slow annealing.We note that the use of RNNs in our work allows us to avoid the expensive use of GPU memory resources, and to reach system sizes beyond 100 cities for the TSP.This limitation can be found in traditional Transformer models [4].

Related Work
Similar studies have proposed the use of a parameterized ansatz to approximate the ground state of combinatorial optimization problems.In particular, Ref. [5] introduces a technique called classical quantum optimization (CQO) where they use neural-network quantum states (NQS) [6] to find the ground state of optimization problems that have been cast into a cost Hamiltonian (also called the energy function) [5,7,8].The NQS ansatz aims to minimize the cost Hamiltonian by training its parameters using the stochastic reconfiguration (or natural gradient) optimization technique [9].Our study is based on the findings of Ref. [1] that adds the annealing component on top of CQO.This study was further extended to the task of finding the ground state of quantum systems with frustration [10].Furthermore, there have been studies that leverage the Transformer architecture [11] to parameterize the solution distribution of the TSP [4,12,13].The aforementioned papers encode the TSP cities using a transformer encoder but take different approaches to decoding the next city in the partial (uncomplete) tour.Ref. [12] queries the first and last cities in the partial tour, and a combined representation of all the cities to decode the next city.For training, the authors use reinforcement learning (RL) with a greedy baseline.Ref. [13] decodes the next city with a query consisting of the last 3 cities in the partial tour, and the training being done with the Actor-Critic RL [13].Afterward, the solution generated by the transformer architecture is further improved using the 2-OPT heuristic.Ref. [4] decodes the next city autoregressively, meaning the query is constructed using all the cities in the partial tour.The training is performed in an RL framework with a greedy baseline similar to Ref. [12].
Additionally, thermal fluctuations have been supplemented to the process of training binary neural networks within an approach known as the Bayesian learning rule [14].Thermal fluctuations and entropy have also been used to improve Actor-Critic RL under a framework called Soft Actor-Critic (SAC) RL [15].We also see the use of annealing in machine learning being put forward in the training process of variational auto-encoders [16].Similarly to the principle of annealing, the gradual increase in problem complexity has been explored to solve optimization problems under a framework called curriculum learning [17].We finally note that a summary of the progress made by the machine learning community for the task of solving optimization problems can be found in Ref. [18].

Variational Classical Annealing
For an energy function (or a Hamiltonian) H(X) that encodes an optimization problem, our aim is to find the ground state X * by optimizing a set of a model's parameters λ.The variational classical annealing (VCA) algorithm introduced in Ref. [1], aims to produce a probability distribution P λ that approximates the Boltzmann distribution at zero temperature through annealing and optimization of the parameters λ.The peak of this Boltzmann distribution provides the optimal solutions to H(X).This algorithm is inspired by the use of variational autoregressive networks to solve statistical mechanics systems [19].
Autoregressive models originally aim to use the observation in the previous time step as input to the subsequent time step as it usually done in natural language processing.In the context of our work, we harness the power of autoregressive models for the purposes of solving optimization problems by sequentially sampling solutions to these problems [1].This sampling scheme allows to obtain perfect samples as opposed to Metropolis sampling, which can produce correlated samples, and which can get stuck in local minima if the optimization problem has a rugged optimization landscape [20].
Within the setting of VCA, we use RNNs as our model to be optimized.This scheme is performed by minimizing the variational free energy function The gradient descent learning on this cost function is used to update λ at every annealing stage.Here, ⟨H(X)⟩ λ is the average Hamiltonian over the RNN distribution P λ , T (t) denotes the temperature at time t, and S(P λ ) is the Shannon entropy.The latter is formulated as where the sum iterates over all the states {X i } in the search space.Computing the true entropy by considering all the states in the exact state space is intractable in general.However, one can approximate the entropy by taking a statistical average over the autoregressive samples obtained from the distribution P λ as shown in Appendix.A. The VCA framework is based on temperature annealing.The latter was found to be helpful in avoiding mode collapse [19] and in an adiabatic evolution to the lowest energy [1].In our annealing scheme, we use a linear annealing schedule characterized by T (t) = T 0 (1 − t) where time t ∈ [0, 1] and T 0 is the initial temperature.
VCA initially performs a series of warm-up steps for the purpose of bringing a randomly initialized RNN probability distribution to the minimum of the free energy F λ (0) by performing N warmup gradient descent steps.Throughout this initial equilibrating process, the temperature is kept constant at T (0) = T 0 .Once the warm-up steps are completed, the annealing process begins.Any time the temperature is cooled from T (t ′ ) to T (t ′′ ), the shape of the free energy landscape of the system changes depending on the magnitude of ∆t = t ′′ − t ′ .To equilibrate the RNN to a minimum of the new free energy F λ (t ′′ ), we perform N train gradient descent steps.This combination of annealing and re-equilibration is performed until t = 1 where T (1) = 0, marking the end of the annealing process.At this point, the probability distribution P λ is expected to be maximized around states with an energy equal or close to the ground state energy H(X * ).
In our implementation, we parameterize time as t = i/N annealing , where N annealing is a user-defined hyperparameter denoting the number of annealing steps which effectively determines the speed of the annealing process, such that i iterates over 0, . . ., N annealing .
4 Optimization Problems

The Nurse Scheduling Problem (NSP)
We now focus our attention on the optimization problems we tackle using the VCA algorithm.The first one is NSP which belongs to the class of scheduling problems in the field of operations research, and it is known to be NP-hard [21].NSP aims to assign nurses to specific shifts in a hospital under a set of imposed constraints.The imposed constraints make it hard to find satisfactory solutions to the problem.In this work, we apply VCA to the formulation of NSP introduced in Ref. [22].
In this formulation, we have the following constraints: the hard nurse constraint, the hard shift constraint, and the soft nurse constraint.The hardness determines the importance of respecting the constraint.The hard nurse constraint requires that no nurse works for two consecutive shifts as they require sufficient rest after a shift.The hard shift constraint emphasizes the need to deploy enough nurses to handle a given shift.Finally, the soft nurse constraint favors solutions with an even distribution of nurses assigned to shifts.In our paper, the terms 'shift' and 'day' are synonymous.
To better understand NSP, we take a look at what defines such a problem.In a hospital, we have N individual nurses given by n ∈ {1, . . ., N } and D working days given by d ∈ {1, . . ., D}.A solution to the NSP problem could be represented by a matrix where X M ∈ {0, 1} N ×D .If nurse n has been assigned to day d, then x n,d = 1, otherwise x n,d = 0. We use the flattened vector representation of X M to represent our solution, denoted by X ∈ {0, 1} N D .With index manipulation, the matrix elements of X M map to the elements in The three constraints are formulated individually into separate quadratic penalty functions [22].The hard nurse constraint, which is quantified by penalizes a solution that has instances of a nurse n working two days in a row.Every time this violation occurs, a penalty of α is added.
The hard shift constraint has been formulated using the workforce required on a particular day W (d).
The constraint Hamiltonian aims to equalize the accumulated contribution of the nurses assigned to a working day and the actual amount of workforce required on that day.A penalty is incurred when there are not enough nurses assigned to a day, or conversely, when there is a surplus of nurses.
Lastly, we have the soft nurse constraint which promotes equal distribution of all nurses across the working days.This constraint is given as follows: The soft constraint term H 3 (X) has been formulated using F (n) that gives us the number of days a nurse n wishes to work.If equal distribution of nurses is a desirable outcome, then setting at least F (n) = ⌊D/N ⌋ for all nurses is required as this ensures a fair workload when D is a multiple of N .
We can sum the three penalty terms shown in Eq. ( 3), ( 4), (5) to get the resultant energy function where λ and γ are real coefficients that scale the penalty terms H 2 (X) and H 3 (X) respectively.H(X) is minimized when the least number of constraints are broken.If all the constraints are obeyed then the optimal energy H(X * ) = 0.However, it should be noted that for certain combinations of N and D, it is not always possible for the ground state X * to respect all the constraints, resulting in H(X * ) > 0.
Interestingly, given the quadratic form of H(X) the binary nature of X, H(X) can be represented as a quadratic unconstrained binary optimization (QUBO) model [22,23] such that where coefficient q i,j and constant c are derived from Eq. ( 6).This expression can be made more compact as H(X) = X T QX + c, such that Q ∈ R N D×N D is a square matrix with matrix elements given by q ij .

The Maximum Cut Problem (Max-Cut)
The second problem we tackle with VCA is the Max-Cut problem that has been known to be at least NP-hard [24].The Max-Cut problem is defined as follows: Given an undirected graph G(V, E), we make a cut along the edges of G to get two complementary sets of vertices such that the number of edges between the two sets is maximized.In other words, if E cut ⊂ E is the set of edges bridging the two complementary sets of vertices, we wish to maximize its size |E cut |.In the context of our study, we work with unweighted graphs.
To model the partition across the graph, we can set a value of either 1 or 0 to the vertices to label the partitions where they belongs.Therefore, any solution to a graph of N vertices is given by X = (x 1 , . . ., x N ) where x i ∈ {0, 1}.This mapping allows us to use an energy function H(X) that computes the negative of the sum of the number of edges belonging in E cut .This step is done by summing the following expression over all the edges in set E: Taking the negative converts Max-Cut into a minimization problem.An edge connecting x i and x j exists in E cut when x i ̸ = x j .The latter provides a contribution of −1 to H(X).For simplicity, we note that the expression of H(X) is equivalent to the following Kronecker Delta expression Note that H(X) in Eq. ( 7) is quadratic due to the binary nature of variables x i .Therefore, H(X) can be cast into a QUBO form as follows where Q ∈ R N ×N is the QUBO square matrix with matrix elements derived from Eq. ( 7).

The Traveling Salesman Problem (TSP)
TSP is another famous combinatorial optimization problem that we attempt to solve with VCA.The aim of TSP is to visit a set of cities exactly once and return to the starting city in the tour with a specific order that minimizes the total cost of travel.In our study, the cost is the total distance traveled.We are particularly interested in the 2D Euclidean TSP which is known to be NP-complete [25].
Given N cities on a 2D Cartesian plane, a solution tour looks as X = (π(1), . . ., π(N )) where π is a permutation of N cities.Each city π(i) has a tuple of x and y coordinates (x π(i) , y π(i) ).An important property of a valid solution that should be noted is that every city on the tour should be unique.We also remark that the permutation order of cities is translation invariant.The latter property is exploited in the construction of our parameterized model.The energy function of the 2D Euclidean TSP is given by the sum of the Euclidean distances between consecutive cities, including the trip from the last city back to the first as follows where % denotes the modulo operator.

Implementation
As we mentioned in Sec. 3, our VCA implementation uses RNNs as a parameterized probability distribution.With RNNs, a solution X is built by the sequential realization of its elements x i .In other words, solutions are sampled according to the probability chain rule where the product of conditional probabilities P (x i |x <i ) provides the joint probability of the solution X.Here the nature of x i depends on the optimization problem.For NSP and Max-Cut, there are N binary variables x i ∈ {0, 1}, whereas for TSP, we have N variables x i ∈ {1, . . ., N } satisfying the condition The VCA architecture can be visualized as an N -sized chain of RNN cells as illustrated in Fig. 1(a).We use the term 'site' to refer to any particular position (among N ) in this chain.First of all, we look at the case where all RNN cells share the same set of parameters λ [26].We first start with the description of a Vanilla RNN cell [27].Here, our formulation of the hidden state h n and the conditional probability vector P n at site n are given by In the above equations, weights W , U , V , and bias vectors b, c jointly parameterize the RNN cells.In Eq. ( 11), x n−1 is the one-hot encoded vector of the variable x n−1 sampled by the previous RNN cell.h n−1 is the hidden state from the previous RNN cell.The size of h n is a hyperparameter called the number of memory (or hidden) units denoted as d h .The latter controls the expressivity of the Vanilla RNN cell.Finally, our choice of non-linear activation function is ELU -exponential linear unit [28].In Eq. ( 12), the Softmax function, given by Softmax(σ i ) = e σi / n j=1 e σj , computes the categorical distribution from which we sample x n .In this case, the conditional probability of x n , denoted as P λ (x n |x <n ), can be computed as where • denotes the dot product operation.The set of N conditional probabilities {P λ (x 1 ), P λ (x 2 |x 1 ), . . ., P λ (x N |x <N )} is used to compute P λ (X) in Eq. (10).
It should be noted that random interactions between the variables x i are best captured using RNNs with site-dependent parameters, i.e., each RNN cell has its own set of parameters dedicated to that particular site.This observation is motivated from Ref. [1] in which disordered systems, such as the Sherrington-Kirkpatrick [29] and Edwards-Anderson [30] spin glass models, are optimized using site-dependent RNNs.In this case, we have weights that are used in a Vanilla RNN architecture as follows: We now focus our attention on another RNN architecture to tackle optimization problems with longrange interactions called the Dilated RNN [31].This architecture uses recurrent skip connections that propagate information from sites that are far from each other as shown in Fig. 1(b).From the illustration, we also see that this architecture has a stack of Dilated layers that dictate the length of these skip connections given by 2 l−1 where l is the layer number such that 1 ≤ l ≤ L. For a system of size N , we choose L = ⌈log 2 (N )⌉ number of layers by taking inspiration from tensor networks [1,32].Here, we index the RNN weights and biases according to the layer l and site n where the RNN cell is located.Thus, the RNN formulation becomes such that h [0] n = x n−1 and h [l] 0 = 0 for 1 ≤ n ≤ N and 1 ≤ l ≤ L. We note that for the Max-Cut and NSP, we use Vanilla and Dilated RNN with site-dependent parameters to reflect the randomness in the coupling between the discrete variables.Furthermore, given the binary nature of the discrete variables for NSP and Max-Cut, the output probability vector P n has size 2. For TSP which enjoys the translation invariance property, we use parameters sharing for the Dilated RNN, i.e., W , c n = c.Additionally, the output probability P n has size N , such that N is the number of cities.
To enforce the constraint of not visiting a city twice in TSP, we apply masking during the process of autoregressive sampling as clarified in Appendix.B. This trick allows us to avoid the use of an energy penalty to enforce this constraint [33].The masking also reduces the combinatorial search space.We remark that we conducted experiments where we replace the vanilla RNN cell with LSTM and GRU cells respectively.Compared to the models with vanilla RNN, the GRU and LSTM models do not generally improve on our results.We also note that the use of RNN architectures in our study allows us to go beyond the system sizes that can be reached by Transformer architectures [12,4].In particular, we demonstrate that we can go beyond N > 100 cities.This advantage is a result of the cheaper computational cost of Dilated RNNs as opposed to traditional Transformer architectures.

Experiments and Discussion
We now focus our attention on our experiments.Here we compare VCA and SA on the Max-Cut, NSP and TSP optimization problems.For all the problems, we measure the average performance of the two algorithms using the residual energy per site given by ϵ res ≡ (H(X) − H(X * ))/N .This measure quantifies the average error made on the estimation of the ground state energy H(X * ).Here, N denotes the system size of the problem [34].For VCA, H(X) is estimated by taking an average over the N samples autoregressive samples generated from the RNNs after training.For SA, H(X) is approximated by taking the mean over different SA runs.An important observation to highlight is the ability of RNNs to sample a relatively large number of solutions at the end of annealing compared to SA, which does not offer this possibility.We also consider the best case scenario by computing the minimal residual energy per site defined as ϵ min res ≡ (H min − H(X * ))/N where H min is the lowest energy obtained by either SA or VCA across the different samples.To make the connection with the literature, we hightlight that ϵ min res is directly proportional to the optimality gap defined as g ≡ (H min /H(X * )) − 1 [12].More details about the VCA hyperparameters and the SA implementation can be found in Appendix A.
To compare VCA and SA, ϵ res is plotted against the number of annealing steps N annealing (see Sec. 3).In terms of hardware, we use an Nvidia Titan XP GPU for VCA and a CPU for SA.
Figure 2: Plot of the residual energy per site ϵ res against the number of annealing steps N annealing .For all the RNN variations of VCA, we compare with SA on a Max-Cut problem with system size N = 128 with edge density (a) ρ = 0.12, (b) ρ = 0.25, and (c) ρ = 0.5.We observe that VCA outperforms SA in the limit of large N annealing .Max-Cut.We use unweighted Max-Cut instances of N = 128 vertices and edge densities ρ = 0.12, 0.25, 0.5 generated by the Rudy graph generator by G. Rinaldi [35].These graphs were originally generated to benchmark the classical-quantum optimization (CQO) technique [5].The choice of increasing densities is used to investigate the effects of adding more complexity to the Max-Cut problem while keeping the number of vertices constant.We approximate the ground state of these graphs using the ConicBundle package by C. Helmberg from the Biq Mac Solver server [36].
We use the Vanilla RNN (see Eq. ( 14)) and Dilated RNN (see Eq. ( 16)) with site-dependent parameters to run VCA on this problem.As illustrated in Fig. 2, the general trend we see for both VCA and SA is that ϵ res decreases with the number of annealing steps.VCA with Dilated RNNs (VCA-Dilated) outperforms SA on average starting from annealing steps 2 11 , 2 8 , 2 14 respectively for the densities 0.12, 0.25, 0.5.Although SA has better energies on average in the fast regime, VCA provides a better convergence at larger number of annealing steps.This observation suggests that VCA requires a threshold number of annealing steps before it consistently converges to the ground state.
If we consider the best solutions obtained by SA and VCA (see Tab. 1), we observe that for all three densities, SA finds the ground state at N annealing = 2 4 .For both VCA variants, it is required to have a longer annealing time to find the ground states.The latter means that SA can find the optimal solution of our Max-Cut instances in fewer number of annealing steps compared to VCA.
NSP.We use a configuration of 15 days and 7 nurses, giving us a system size of 15 × 7 = 105.Taking inspiration from Ref. [22], we use the following NSP parameters described in Sec.4.1: Based on the chosen parameters, the ground state energy is H(X * ) = γ.The proof is provided in Appendix.C. Similar to the Max-Cut problem, we use Vanilla and Dilated RNNs with site-dependent parameters to tackle this problem.The results are presented in Fig. 3(a).Again, we see that increasing the number of annealing steps allows VCA to reduce the average error ϵ res .Both VCA-Dilated and VCA-Vanilla reach lower error margins compared to SA at large N annealing .Between annealing steps 2 9 and 2 13 , VCA-Dilated maintains the lowest average error among all the models.For the slowest annealing schedule, N annealing = 2 14 , VCA-Vanilla reaches the overall lowest average error, whereas VCA-Dilated sees a sudden increases in ϵ res .A possible reason for this sudden increase can be related to the choice of our hyperparameters, which can be further tuned.We also note that for a small number of annealing steps, SA outperforms VCA on the average case.
Lastly, by considering the best solutions, SA finds the ground state at N annealing = 2 4 whereas VCA lands at the optimal solution around N annealing = 2 6 as illustrated in Tab. 1.The latter means that SA, similarly to Max-Cut, needs fewer N annealing steps compared to VCA to find optimal solutions.TSP.We use three instances of TSP with varying system sizes N = 64, 128, 256.Each component of the coordinates (x, y) of the cities has been uniformly sampled from the range (0, 1 Here we define the minimal residual energy per site ϵ min res ≡ (H min − H(X * ))/N where H min is the lowest energy obtained by either SA or VCA across the different samples.'D' stands for the Dilated RNN results and 'V' stands for the Vanilla RNN results.Values in bold font correspond to the lowest N * annealing to find the exact or the lowest approximation to the ground state after comparing SA and VCA.Bold values also highlight the lowest ϵ min res as well as the lowest estimated time to find the exact or the best solution.
approximation of the ground state tours of these TSP configurations are provided by the Concorde algorithm [37] through the NEOS Server [38].
Unlike the previous problems, we use VCA with Dilated RNNs that have shared parameters owing to the translation invariance property of TSP.The focus on Dilated RNNs for this problem is motivated by the presence of long-range interactions in TSP, as well as by the overall advantage of Dilated RNNs over Vanilla RNNs for Max-Cut and NSP.The results from our experiments are presented in Fig. 3(b).For N = 64, 128, VCA reaches a residual energy per site ϵ res < 10 −4 whereas for N = 256, we reach an average error ϵ res ∼ 10 −3 .We also observe that SA reaches ϵ res ∼ 10 −2 in the slow annealing regime.Overall, in the range of medium to slow annealing (N annealing ≥ 2 7 ), VCA demonstrates a better average performance compared to SA.Furthermore, for the largest system size, VCA requires a relatively larger N annealing to noticeably improve over SA highlighting the increase in complexity that arises with larger system sizes.
By considering the best case scenario, VCA finds the best tours for all three problem sizes.It also finds the exact ground state for N = 64, as illustrated in Tab. 1.In contrast, SA finds tours with a higher cost and gets stuck in local minima.Interestingly, this result is different from what we observed for the Max-Cut and the NSP.Thus, we conclude that if a user is interested in finding the best solutions to a particular optimization problem, VCA is a valuable alternative when SA fails to find an optimal solution.

Conclusion
In this work, we supplement recurrent neural networks (RNNs) with the principle of annealing under a framework called variational classical annealing (VCA) to solve real-world optimization problems.
In our experiments, we find that VCA outperforms traditional simulated annealing on average in the limit of slow annealing.The latter suggests that a variational simulation of classical annealing is a great alternative for solving optimization problems.When considering the best case scenario, we conclude that VCA is best to use when SA fails to provide a good solution.The failure of SA is likely to occur for models with a glassy landscape, that has a large number of local minima, such as in spin glass models [39,40,41].We also highlight that we reach large system sizes up to 256 cities for the TSP, by virtue of the cheaper cost of RNNs compared to ordinary Transformers [42].
In our results, we see that even though VCA outperforms SA for all three problems on average in the limit of slow annealing, VCA takes more time per iteration compared to SA for the same number of annealing steps (see Tab. 1).This limitation could be mitigated by the use of meta-learning and transfer learning techniques to predict the solutions to optimization problems that are similar to each other within a shorter amount of time [43].Furthermore, with more hardware advancements, VCA is expected to show speed-up in the running time.We have also seen instabilities that occurred during the training of RNNs and we mitigated some of them by careful choice of the hyperparameters such as tuning the learning rate and varying the number of training samples.Overall, there is still significant space for exploration in terms of hyperparameter tuning, hardware efficiency, and choice of architectures.
From a broader perspective, we believe there are various directions of research that could be undertaken in the future.It would be interesting to benchmark VCA against other Monte Carlo optimization algorithms such as parallel tempering [44].It would be also interesting to extend our work for optimization problems with continuous variables [45] using models that can handle continuous variables [46,47].Additionally, one could also incorporate graph autoregressive networks in the VCA scheme to take the graph structure of some optimization problems into consideration [48,49].There is also flexibility in tuning the temperature cooling schedules to potentially improve the performance of VCA [50].We believe that the combination of deep learning techniques and the principle of annealing is an interesting direction of research that could potentially improve our current solutions to optimization problems.

A Hyperparameters and Training
The VCA hyperparameters are detailed in Tab. 2. Our SA implementation is based on Metropolis moves.At the initial temperatures T 0 , we perform N warmup warmup steps where each step corresponds to a sweep of N Metropolis moves, where N is the number of variables.After this step, we alternate between annealing steps (by decreasing temperature) and equilibrium steps N eq .The latter is similar in spirit to N train in VCA.The hyperparameters of SA (see Tab. 3) are chosen to be consistent with VCA hyperparameters.
To train our RNN architectures, we minimize the variational free energy function F λ (t) = ⟨H(X)⟩ λ − T (t)S(P λ ).Here we approximate the true variational free energy at any given temperature by taking the average over N samples solutions sampled from the RNN distribution P λ .This approximation is given by (H(X i ) + T (t) log(P λ (X i ))) . ( Furthermore, the gradients of the variational free energy ∂ λ F λ (t) have a similar expression as shown in Ref. [1].We also optimize the relevant parameters λ with the Adam optimizer [52].

B Masking in TSP
To avoid revisiting the same cities during the processing of autoregressive sampling in VCA, we take inspiration from Ref. [54], where constraints were applied in the context of quantum physics, and from Ref. [55] in the context of TSP.Here we apply masking on the visited cities when computing the conditional probabilities P θ (c i |c i−1 , . . ., c 2 , c 1 ) at the level of the Softmax layer (see Eq. ( 13)), where c i ∈ {1, 2, . . ., N } corresponds to the city to be visited at step i and θ are the parameters of our model.The masking at step i is done as follows: • If cities (c 1 , c 2 , . . .c i−1 ) were visited, where c j ∈ {1, 2, . . ., N }, then P θ (c j |c i−1 , . . ., c 2 , c 1 ) is set to zero for 1 ≤ j ≤ i − 1.
After implementing these steps, our autoregressive model generates permutations π of the cities.We note that this masking trick can be implemented in parallel across the batch size.We also know that each of the days only requires W (d) = 1 amount of workforce.Therefore, the hard shift constraint is also satisfied since there is only one nurse assigned each day:

C NSP Ground State Energy
However, the soft nurse constraint is broken exactly once because 5%2 = 1 where % is the modulo operator.This observation leaves exactly one day (here d 5 ) to which a nurse will be assigned to (in this case n 1 ) who has to work an extra day.Therefore Summing the penalty functions with their respective scaling, we obtain H(X * ) = H 1 (X * ) + λH 2 (X * ) + γH 3 (X * ) = γ.
One can notice that it is energetically favorable to break the soft nurse constraint once than the hard shift constraint as γ < λ, α.Thus, a solution X ′ that does not have any nurse working on d 5 can not be the ground state (see Tab. 4(b)).

Figure 1 :
Figure 1: The two variants of RNNs used in our study.(a) We show a Vanilla RNN where each RNN cell receives an input x n−1 and a hidden state h n−1 and outputs a new hidden state h n .The latter is fed to a Softmax layer (denoted by S) to output the conditional probability P n .(b) We illustrate a Dilated RNN architecture with ⌈log 2 (N )⌉ layers where N is the system size.This RNN uses longer recurrent connections to take care of long-range interactions.

Figure 3 :
Figure 3: Plots of the residual energy per site ϵ res vs N annealing .(a) For NSP with 15 days and 7 nurses, both the Vanilla and Dilated RNN variants of VCA are plotted alonside SA.(b) For TSP with system sizes N = 64, 128, 256, VCA with Dilated RNNs and with shared parameters is plotted alongside SA.For panels (a) and (b), we see that VCA has a lower ϵ res compared to SA for large N annealing .

Table 4 :d 1 d 2 d 3 d 4 d 5 n 1 d 1 d 2 d 3 d 4 d 5 n 1
The NSP parameters used in our experiments along with (a) a ground state solution and (b) a sub-optimal solution for NSP with D = 5, N = 2. Ground state solution X * .Sub-optimal solution X ′ .
Instead of deriving the ground state energy for NSP with days D = 15 and nurses N = 7, we show near identical calculations for NSP with a relatively simpler configuration of D = 5, N = 2.The NSP parameters used in our experiment are shown in Tab. 4. Looking at the ground state (see Tab. 4(a)), we see that the hard nurse constraint is respected since neither nurse n 1 nor nurse n 2 has to work two days in a row.Referring back to the penalty functions from Sec. 4.1, we obtain H 1 (X * ) = 0.

Table 1 :
).The A summary table of the best performances of VCA and SA on NSP, Max-Cut and TSP.

Table 2 :
A table of the hyperparameters used for the VCA experiments in our study.