Machine-learning-assisted Monte Carlo fails at sampling computationally hard problems

Several strategies have been recently proposed in order to improve Monte Carlo sampling efficiency using machine learning tools. Here, we challenge these methods by considering a class of problems that are known to be exponentially hard to sample using conventional local Monte Carlo at low enough temperatures. In particular, we study the antiferromagnetic Potts model on a random graph, which reduces to the coloring of random graphs at zero temperature. We test several machine-learning-assisted Monte Carlo approaches, and we find that they all fail. Our work thus provides good benchmarks for future proposals for smart sampling algorithms.


A. Motivations
Sampling from a given target probability distribution P t (σ 1 , · · · , σ N ) over N degrees of freedom can become extremely hard when N is large. A universal (i.e. systemindependent) strategy for sampling consists in starting from a random configuration of σ = {σ i } i=1,··· ,N , and generating a local Monte Carlo Markov Chain (MCMC), by sequentially proposing an update of one of the σ i , and accepting or rejecting it with a proper probability (e.g. Metropolis-Hastings), until convergence [1]. However, for large N , the convergence time of the MCMC can grow exponentially in N , because of non-trivial longrange correlations that make local decorrelation extremely hard [2].
A solution to this problem consists in identifying the proper set of correlated variables, and proposing global updates of such variables together, in such a way to speed up convergence [3]. However, this process is not universal, because it relies on the proper identification of systemdependent correlations, which is not always possible. For instance, in disordered systems such as spin glasses, the nature of correlated domains is extremely elusive and proper global moves are not easy to identify [4,5]. Another approach, which has been particularly successful in atomistic models of glasses, consists in unconstraining some degrees of freedom, evolve them and constrain them back [6][7][8][9][10], but again it is model-specific. Alternative proposals based on a renormalization group approach [11,12] also rely on the identification of system-dependent collective variables.
A recently developed line of research, see e.g. [13][14][15][16][17][18][19][20], proposed to solve the problem in an elegant and universal way, by machine learning proper MCMC moves. In a nutshell, the idea is to learn an auxiliary probability distribution P a (σ), which (i) can be sampled efficiently (e.g. linearly in N ) and (ii) provides a good approximation of the target probability. Then, the hope is to use the auxiliary distribution to propose smart MCMC moves. Using this strategy with autoregressive architectures that ensure efficient sampling, some authors found convergence speedup [13,14,16], but others found less promising results [20].
In order to make these studies more systematic, and really assess the performance of the method, it is important to have good benchmarks, i.e. problems that are guaranteed to be really hard to sample by local MCMC. In the early 90s, the very same problem had to be faced to assess the performance of local search algorithms that looked for solution of optimization or satisfiability problems [21]. In that case, the problem of generating good benchmarks was solved by introducing an ensemble of random instances of the problem under study [21][22][23][24]. It was later shown, both numerically and analytically, that these random optimization/satisfiability problems require a time scaling exponentially in N for proper sampling at low enough temperatures in certain regions of parameter space [2]. Hence, they provide very good benchmarks for sampling algorithms. Yet, the recent attempts to apply machine learning methods to speed-up sampling have not considered these benchmarks.
In this paper, we consider a prototypical hard-to-sample random problem, namely the coloring of random graphs, and we show that all the proposed methods fail to solve it. Our results confirm that this class of problems are a real challenge for sampling methods, even assisted by smart machine-learned moves. The model investigated in [20] possibly belong to this class. In addition, we discuss some practical issues such as mode-collapse in learning the auxiliary model, which happens when the target probability distribution has multiple peaks and the auxiliary model only learns one (or a subset) of them.

B. State of the art
Before proceeding, we provide a short review of the papers that motivated our study. Because the field is evolving rapidly, this does not aim at being an exhaustive review, and despite our best efforts, it is possible that we missed some relevant references.
Ref. [25] considered the general problem of whether a target probability distribution P t can be approximated by a simpler one P a , in particular by considering the Kullback-Leibler (KL) divergence If this quantity is proportional to N for N → ∞, then P t (σ)/P a (σ) is typically exponential in N , and as a result samples proposed from P a are very unlikely to be accepted in P t . A small D KL (P t ||P a )/N (ideally vanishing for N → ∞) seems therefore to be a necessary condition for a good auxiliary probability, which provides a quantitative measure of condition (ii) above. Ref. [25] suggested, by using small disordered systems (N ∼ 20), that there might be a phase transition, for N → ∞, separating a phase where D KL (P t ||P a )/N vanishes identically and a phase where it is positive.
(2) Each term P i a is then approximated by a neural network, which takes as input {σ 1 , · · · , σ i−1 } and gives as output P i a , i.e. the probability of σ i conditioned to the input. Such a representation of P a , also called Masked Autoencoder for Distribution Estimator (MADE) [26], allows for very efficient sampling, because one can first sample σ 1 , then σ 2 given σ 1 , and so on, in a time scaling as the sum of the computational complexity of evaluating each of the P i a , which is typically polynomial in N for reasonable architectures. Hence, this scheme satisfies condition (i) above. The simplest choice for such a neural network is a linear layer followed by a softmax activation function. Ref. [13] showed that using such an architecture, several statistical models could be well approximated, and the Boltzmann distribution of a Sherrington-Kirkpatrick (SK) spin glass model (with N = 20) could be efficiently sampled. Note that the model in Ref. [13] was trained by a variational procedure, which minimizes D KL (P a ||P t ) instead of D KL (P t ||P a ). This method is computationally very efficient as it only requires an average over P a , which can be sampled easily, instead of P t , but it is prone to mode-collapse (see Sec. II for details). Moreover, this work was limited to quite small N .
Following up on Ref. [13], Ref. [14] considered as target probability the Boltzmann distribution of a twodimensional (2d) Edwards-Anderson (EA) spin glass model at various temperatures T , and used a Neural Autoregressive Distribution Estimator (NADE) [27], which is a variation of the MADE meant to reduce the number of parameters. Furthermore, the model was trained using a different scheme from Ref. [13], called sequential tempering, which tries to minimize D KL (P t ||P a ), thus preventing mode-collapse. To this aim, at first, a sample from P t is generated at high temperature, which is easy, and used to learn P a . Then, temperature is slightly reduced and smart MCMC sampling is performed using the P a learned at the previous step, to generate a new sample from P t , which is then used in the next step. If P a remains a good approximation to P t and MCMC sampling is efficient, this strategy ensures a correct minimization of D KL (P t ||P a ). This was shown to be the case in Ref. [14], down to low temperatures for a 2d EA model of up to N = 225 spins.
Ref. [15] introduced a different scheme for learning P a . This adaptive scheme combines local MCMC moves with smart P a -assisted MCMC moves, together with an online training of P a . It was successfully tested using a different architecture for P a (called normalizing flows), on problems with two stable states separated by a high free energy barrier. Note that normalized flows can be equivalently interpreted as autoregressive models [28][29][30]. Ref. [16] also proved the effectiveness of smart assisted MCMC moves in a 2d Ising model and an Ising-like frustrated plaquette model.
Several other groups [17][18][19][20] investigated a problem related to sampling, namely that of simulated annealing [31] for finding ground states of optimization problems. This is an a priori slightly easier problem, because simulated annealing does not need to equilibrate at all temperatures to find a solution [32,33]. In these works, simulated annealing moves were once again assisted by machine learning. Ref. [17] tested their procedure on the 2d EA and SK models, and Ref. [18] considered a 2d, 3d, and 4d EA model. However, while finding the exact ground state of the SK and EA (for d ≥ 3) models is hard, in practice for not too large random instances the problem can be solved by a proper implementation of standard simulated annealing [34], and the scaling of these methods with system size remains poorly investigated. Ref. [19] considered the graph coloring problem, which is the zerotemperature version of the benchmark problem we propose to use in this work, and found that a Graph Neural Network (GNN) can propose moves that allow one to efficiently find a proper coloring with comparable performances to (but not outperforming) state-of-the-art local search algorithm. Additionally, GNN have shown to be successful at solving discrete combinatorial problems [35], but they do not provide much advantage over classical greedy algorithms, and sometimes they can even show worse performance [36,37]. Finally, Ref. [20] showed that the machine-learning-assisted simulated annealing scheme does not work on a glassy problem with a rough energy landscape.
These works provided a series of inspiring ideas to improve sampling in disordered systems via machine learning smart MCMC moves. Yet, the question of whether machine learning can really speed up sampling in problems that are exponentially hard to sample via local MCMC remains open. This wide class of systems include many problems of interest, such as optimization problems (e.g. random SAT or random graph coloring) [38] and meanfield glass-forming materials [39][40][41][42].

C. Summary
In this work, we test machine-learning-assisted MCMC in what is considered to be a prototypical hard-to-sample model, namely the coloring of random graphs [21,43,44]. Before doing that, we also tested and reproduced previous results in simpler cases.
The models we consider are: (1) The mean-field ferromagnetic problem, usually called the Curie-Weiss (CW) model, to gain some analytical insight into the different ways of training the auxiliary model.
(2) A two-dimensional Edwards-Anderson spin glass (2d EA) model. We consider this as an 'easy' problem (because, for instance, its ground state can be found in polynomial time), and we use it to reproduce previous results, compare different architectures, and gain insight on the role of some hyperparameters.
(3) The coloring (COL) of a random graph, which at finite temperature becomes an antiferromagnetic Potts model. In the proper range of parameters, this problem is proven to be exponentially hard to sample via local MCMC [2,44], and we use it as a benchmark to understand whether smart MCMC can improve the sampling efficiency.
Any machine learning model that satisfies the autoregressive property can be trained and used as an auxiliary distribution to propose smart moves. However, on the one hand, for complex problems, shallow or simple models might not be expressive enough to accurately learn the target distribution. On the other hand, if a problem can be easily solved by a simpler model, there is no need to employ complex deep architectures. In this paper we used several standard architectures illustrated in Fig. 1 and detailed in the SI: • The MADE [13], which is an autoregressive deep neural network; when its depth is equal to zero, this corresponds to a 'shallow' or single-layer autoregressive model.
• The NADE [14,27], which corresponds to a MADE with additional constraints on the parameters, with different depths and number of hidden units. This architecture has proven to be effective in image detection [45], filtering [46] and quantum systems [47].
• For the coloring, because neither the MADE nor the NADE perform well, we also tested an autoregressive GNN that we called Graph Autoregressive Distribution Estimator (GADE), and a non-symmetric MADE (called ColoredMADE).
Finally, we tried several strategies to learn the auxiliary model: (I) Maximum likelihood: we generate a sample from the target distribution, and we use it to train the auxiliary model by maximum likelihood. While this is not a technique to generate samples from P t , because one needs the samples to begin with, it is the best way to test if a given architecture for P a is expressive enough.
(II) Variational: we minimize the KL divergence D KL (P a ||P t ), which also corresponds to the variational free energy of the auxiliary model when considered as an approximation of the true one [13].
(III) Sequential tempering: we train the auxiliary model at higher temperature T , then use it to generate samples at lower T , and use the new samples to re-train the auxiliary model, and so on [14,15].
The core of our work is the application of methods (II) and (III) to attempt a sampling at low temperatures, for which local MCMC does not decorrelate fast enough. For the 2d EA model, we find that basically all the techniques and architectures perform well down to very low temperatures, although (II) is more prone to mode-collapse. We confirm that machine learning MCMC moves can provide a speedup in this case [14]. For the COL problem, we find that none of these methods work well, even at moderately high temperatures located within the paramagnetic phase of the model.

II. METHODS
We consider a specific application of the general scheme discussed in Sec. I, in which we want to approximate the Boltzmann distribution associated to the 'true' Hamiltonian H(σ) at a fixed inverse temperature β = 1/T (the target), with an autoregressive (AR) network, i.e. P a (σ) = P AR (σ).
In most cases, sampling is easy at small β and becomes harder and harder as β is increased. We now discuss different possible strategies to learn a proper P AR (σ).

A. Maximum likelihood
If a sufficiently large sample of configurations {σ m } m=1,··· ,M is available, independently and identically sampled from P B (σ), it is possible to train the model by maximizing the probability of the sample according to the AR model itself, i.e. to use the maximum likelihood method.
Assuming that the AR model is specified by a set of parameters θ, we maximize the likelihood of the observed data, defined as Equivalently, if P emp (σ) = 1 M M m=1 δ σ,σ m is the empirical distribution of the sample, we minimize The optimal parametersθ are given bŷ The gradient of D(θ) can be computed analytically in terms of the σ m , because log P AR (σ|θ) is given explicitly (or by back-propagation) as a function of θ at fixed σ, and the training can thus be performed efficiently. Note that if P B (σ) has multiple peaks, for sufficiently large M all these peaks are represented in the empirical distribution with the correct weights. Hence, when performing maximum likelihood to learn P AR (σ), the learned model should be able to represent all the peaks of P B (σ), provided the AR model has enough free parameters, i.e. is expressive enough. Then, mode-collapse will not occur.
Obviously, the maximum-likelihood approach relies on the quality of the initial sample, which has to be representative of the true distribution. Such a sample is by definition difficult to obtain for the really hard sampling problems that we want to solve. Moreover, if we were able to obtain such a sample by conventional means, there would be no need for any smart MCMC scheme. Nevertheless, the maximum-likelihood approach constitutes a reliable and effective way to test if a specific AR architecture is capable of learning the complexity of a specific problem. We will thus use this scheme, in cases where standard sampling from P B (σ) is possible, to test the expressive quality of our AR architectures.

B. Variational approach
Ref. [13] proposes to bypass the need for sampling from P B (σ) by using a variational approach.
Instead of minimizing D KL (P B ||P AR ) or its approximation D KL (P emp ||P AR ), as in Sec. II A, we want here to minimize D KL (P AR ||P B ) = σ P AR (σ) log P AR (σ) P B (σ) , or equivalently [13] the variational free energy: As it is well known in statistical mechanics, because the KL divergence is positive, F [P AR ] is minimized when P AR = P B and D KL (P AR ||P B ) = 0, and otherwise it provides an upper bound to the true free energy.
The gradient with respect to the parameters θ that define the AR model can be written as an expectation value over the AR model itself [13], The learning can then be done by gradient descent on F [P AR ], sampling from the AR model to estimate the gradient via Eq. (8). We used, as a condition to stop the gradient descent, that the variance of Q(σ) over batches of generated data is smaller than a given threshold. Indeed, if the AR distribution is exactly the Boltzmann one, then Q(σ) is a constant. Reciprocally, if the variance of Q(σ) is zero, then the AR distribution is proportional to the Boltzmann distribution whenever P AR (σ) > 0, but not necessarily over all possible σ, due to modecollapse. More specifically, if we have mode-collapse, then P AR (σ) = P B (σ)/Z in some regions of the space of σ (typically around some of the peaks of P B ) and P AR (σ) = 0 elsewhere, where the proportionality constant Z < 1 represents the total probability covered by P AR (σ) in P B (σ). We then obtain D KL (P AR ||P B ) = − log Z > 0, because the regions where P AR (σ) = 0 do not contribute to the sum. While this solution has larger KL divergence with respect to the optimal one (P AR = P B , D KL (P AR ||P B ) = 0), it can be a local minimum of D KL (P AR ||P B ) in which the variationally-trained AR model can be trapped, thus learning only a part of the landscape.

C. Local versus global MCMC
Standard local MCMC usually consists in selecting a variable at random and then proposing a random (or semi-random) change. If the MC moves respect microscopic detailed balance, the MCMC is guaranteed to converge to the correct Boltzmann distribution. This can be achieved, e.g., by accepting/rejecting the proposed MC moves following the Metropolis rule, where the acceptance probability of a move from configuration σ old to σ new is defined as: We call this scheme local MCMC because each move consists in a change of a single degree of freedom.
In contrast, we can sample from our autoregressive model P AR (σ) to generate a new proposed configuration σ new . It is still useful to respect microscopic detailed balance in order to ensure convergence to equilibrium, and for this reason the replacement σ old → σ new is accepted with probability (10) Note that, because σ new is generated from scratch by the AR model, it is in most cases completely different from σ old , hence the resulting move is non-local and this is why this scheme is called global MCMC.
We also note that this global MCMC scheme is very similar to importance sampling, in which M i.i.d. samples σ m are generated from P AR (σ), and then reweighted by W (σ m ) = P B (σ m )/P AR (σ m ) to compute averages. However, the formulation in terms of a MCMC is convenient to compare with local MCMC, to monitor efficiency via the acceptance rate (which is morally equivalent to a participation ratio in importance sampling), and to perform smart protocols (e.g. sequential tempering) during the MCMC dynamics [14,15]. This is why we stick to this formulation in this paper.
The reweighting factor W (σ) = P B (σ)/P AR (σ) that appears both in importance sampling and in Eq. (10) is the crucial quantity for the efficiency of the global MCMC scheme. If W (σ) is typically exponential in N , then its fluctuations are too wild and moves are almost never accepted. The KL divergence is precisely the average of log W (σ), either on the Boltzmann or on the AR distribution, and if it is too large (in particular, growing linearly in N ) the method is doomed to failure.

D. Sequential tempering
Sequential tempering is a technique used in Ref. [14] to learn the AR probability at a larger β using data from lower β, which can be convenient because collecting data becomes harder upon increasing β. The first step consists in collecting a sample via local MCMC at low β, where sampling is easy, and then training an AR model to reproduce this sample by maximum likelihood (Sec. II A). Next, in order to create a new sample at β + δβ, we use global MCMC by proposing moves with the previous AR model, at the new temperature. The acceptance rule then becomes: (11) We then learn a new AR model from the new sample by maximum likelihood, and iterate until either we reach the temperature of interest, or the convergence time of the global MCMC exceeds some fixed threshold, indicating a failure of the training procedure. A related adaptive global MCMC scheme has been introduced in Ref. [15] and is detailed in the SI.

E. Evaluation of the AR model
Once the learning is completed, one can use several observables to evaluate the quality of the AR model. By completion of the learning we mean convergence of the gradient ascent in maximum likelihood (Sec. II A), convergence of the gradient descent in the variational free energy (Sec. II B), or reaching the target temperature with high enough acceptance rate in the sequential tempering (Sec. II D).
As a first check, we can use the AR model to estimate thermodynamic observables (energy, entropy, correlations) of the true Hamiltonian. If the AR model has a lower entropy than the true one, the AR model is probably suffering mode-collapse. Another interesting observable is the KL divergence, either D KL (P AR ||P B ) or D KL (P B ||P AR ), which measures how well the AR model approximates the target one, and more quantitatively provides the average of the reweighting factor, as discussed in Sec. II C. A more easily accessible quality measure consists in generating samples with the AR model, then evolving them with local MCMC and checking if the energy remains constant and the correlation functions remain time-translationally invariant [6,41], as expected if the initial configuration is a good equilibrium one.
A very important measure of the quality of the AR model is the acceptance rate of the global MCMC, as a high acceptance rate indicates that the AR model describes well the true model, at least in the region explored by the global MCMC. Yet, the AR model could be modecollapsed and still keep a high acceptance rate, because the global MCMC would then only explore the region on which the AR model has collapsed [15]. In particular, this could happen if: • The true model has a few (or a polynomial number of) probability peaks. In that case, the AR model could have learned the correct energy and correct free energy, and still the entropy could be lower than the true one, but by a sub-extensive (hence difficult to detect) amount. This is the standard scenario of mode-collapse, e.g. in a ferromagnetic model with two states, when the AR model is only learning one of them.
• The true model has a number of peaks growing exponentially in N , hence associated to an extensive contribution to the entropy (usually called configurational entropy or complexity in the glass literature [38]). In this case, if the AR model has learned the true energy (or a higher one), and its entropy is much lower than the true one, then its free energy should be appreciably higher than the true one. If instead the AR model with mode-collapse learns a good approximation to the true free energy, this happens by lowering both the entropy and the energy. In this case, mode-collapse is easily detected by a drift of the energy, that increases with the number of iterations when MCMC is started from AR-generated samples.
In practice, if (i) the global MCMC has good acceptance rate, (ii) the global MCMC dynamics initialized in a configuration generated by the AR model is close to stationary (i.e. one-time quantities such as the energy are constant in time, and two-time quantities such as correlations only depend on the time difference), and (iii) the time-dependent correlation of global MCMC vanish for large time differences, then the model should be of high enough quality. Yet, ultimately, it is hard to assess the quality of the AR model in a general way. We will give more concrete examples in the rest of the paper.

III. CURIE-WEISS MODEL
As a first exercise, we discuss here a mean-field ferromagnetic model, namely the Curie-Weiss (CW) model, which provides some insight on the expressive properties of AR models.

A. The model and its Boltzmann distribution
The model is defined on a set of binary Ising spins, for convenience σ i ∈ {−1, 1}, by the Hamiltonian In the thermodynamic limit, the value of m that dominates the Boltzmann probability distribution is the solution of m = tanh(βm) (see the SI for details), which gives m = 0, i.e. a paramagnetic phase, for T > 1, and m = ±m, i.e. a ferromagnetic phase, for T < 1. Note that the two solutions with positive or negative m have the same free energy, hence the probability distribution of the model has two distinct peaks corresponding to these two ferromagnetic states. The model has a spin-flip symmetry σ i → −σ i , which is spontaneously broken for T < 1.
For N → ∞ and T > 1, the Boltzmann probability distribution is well approximated by a product distribution, in the sense that the KL divergence (per spin) between the two sides of Eq. (13) vanishes for N → ∞. When T < 1, this holds separately for each of the two ferromagnetic states, resulting in a mixture of two product distributions: each peak being approximated by the product form with m =m and m = −m.

B. Expressivity of the autoregressive model
The CW model is well approximated by a mixture of two product distributions for large N . We now want to ask whether this structure can also be well approximated by a single AR model. Because training an AR model on a sample generated by the CW model is only possible numerically, we consider instead a sample generated by the mixture model, which coincides with the CW one for N → ∞. For this model, we can provide an analytic solution.
We then assume that an infinite sample of spin configurations is generated from the mixture model in Eq. (14), i.e. with probability 1/2 we choose m = ±m and we then choose each spin independently with probability p ± (σ i ). We use this infinite sample to train the AR model via maximum likelihood (Sec. II A). We consider here a shallow AR model (see the SI for details). Because of the spin-flip symmetry, which is realized in an infinite sample from the true model, the local fields h i = 0 have to vanish. Furthermore, because the Hamiltonian is invariant under any relabeling of the {σ i }, in each conditional probability P i AR (σ i |σ i−1 , ..., σ 1 ) the spins σ i−1 , ..., σ 1 are seen as equivalent by spin σ i , hence the couplings J ij should be independent of the index j < i, leading to where m <i = j(<i) σ j is the magnetization of the first (i − 1) spins. Hence, each term P i AR is parametrized by a single coupling J i . Minimization with respect to J i of the KL divergence between the mixture model in Eq. (14) and the AR model in Eq. (15) leads to a set of coupled equations: Oncem is computed by solving m = tanh(βm), the J i s can be computed by solving Eq. (16) numerically. We can also compute the resulting minimum KL divergence (SI). Becausem = 0 for T > 1, all the J i = 0 and the AR model trivially reduces to the product distribution in Eq. (13). Results for T < 1 are shown in Fig. 2a for the first few couplings. While we find a non-trivial relation between J i and β in the vicinity of the critical temperature β = 1 where all the J i vanish, Fig. 2a suggests that J i roughly scales as β/(i − 1) at large β. A comparison of the free energy per spin, obtained from the exact solution and from the AR model, is given in Fig. 2b. The KL divergences between P AR and P B (in both directions) are given in Fig. 2c; note that these are not divided by N , hence their value per spin is very small, and shows a mild peak in the vicinity of the phase transition. Finally, Fig. 2d shows that D KL (P B ||P AR )/N vanishes at all temperatures when N → ∞, which confirms that the AR model can perfectly approximate the target one in the thermodynamic limit, even in the ferromagnetic phase.
This calculation gives some useful insight on the expressivity and training of AR models: • If training is done via maximum likelihood with a perfect sample of the target model, no mode-collapse is observed, as expected.
• A single shallow AR model is capable of expressing a two-peaked distribution, at the price of having finite couplings that are much stronger than the CW ones (which where of order β/N , taking into account the factor of β in the Boltzmann measure). This happens as follows. The first spin σ 1 is chosen at random, P 1 AR (σ 1 ) = 1/2. The second spin has a strong coupling with the first, J 2 ∼ β. The third spin also has a strong coupling with the first two, J 3 ∼ β/2, and so on. More generally, this suggests that for q-state variables, a single AR model can express a distribution with q peaks, each peak being selected by the choice of the first spin. Whether an AR model can express a distribution with more than q peaks remains open at this stage, but we will see in Sec. V that this is not possible when there are too many peaks.
• The maximum-likelihood training converges to a solution at all temperatures, and does not suffer from any slowdown in the vicinity of the critical point.
We thus conclude that AR models can represent well models with second-order phase transitions, without suffering from slowing down, at least when training is done via maximum likelihood.

C. Mode-collapse
We also consider a shallow AR model trained in a variational way (Sec. II B) to represent the CW model. In this case, we do not impose any restriction on the couplings J ij and the fields h i (see the SI), and the variational free energy cannot be written explicitly. For T < 1, we found numerically that the AR model can end up in three different fixed points. The first one corresponds to the 'correct' form in Eq. (15), i.e. with h i = 0 and J ij = J i . The other two correspond to mode-collapse in a solution with J ij = 0 and h i = h = ±T atanh(m): in each of these two solutions, the AR model learns only one of the two peaks of the Boltzmann distribution, resulting in a loss of entropy per spin of (log 2)/N , which is however too small to be detected for large N . Moreover, the basin of attraction of the correct solution seems to be vanishingly small for large N , and the two mode-collapsed solutions dominate the training. In fact, to obtain the solution with h i = 0, we have to force the fields to vanish, thus imposing the spin-flip symmetry. We conclude that, as expected, one has to be careful about mode-collapse when training the AR model variationally.

IV. EDWARDS-ANDERSON MODEL
We now consider a two-dimensional (2d) Edwards-Anderson (EA) spin glass model, chosen because it is not exponentially hard to sample (finding its ground state requires polynomial time in N [48][49][50]), there is no phase transition at finite temperature (hence no a priori risk of mode-collapse), and yet the relaxation time of local MCMC grows quite fast upon lowering temperature. Moreover, previous work [14] has explored this model, finding an important speedup using a NADE architecture.

A. Model and architectures
We consider a 2d EA model defined on a 10 × 10 square lattice with periodic boundary conditions, i.e. with N = 100 Ising spins, σ i ∈ {−1, 1}, as Here, the sum runs over neighboring pairs of sites in the square lattice, and the couplings J ij are i.i.d. from a Gaussian distribution with zero mean and unit variance as in Ref. [14].
For the AR decomposition in Eq.
(2), we order the variables moving along e.g. horizontal lines from the top left to the bottom right of the lattice. We consider two different AR architectures: a shallow MADE without fields (h i = 0) and with N (N −1)/2 independent coupling parameters J ij , and a NADE model with N h hidden nodes (see the SI). In the NADE, in order to impose the inversion symmetry σ i → −σ i , we set all the biases to zero, and we use sigmoid functions in the hidden layers. The NADE then has 2 × N h × N parameters.
In order to train the two AR models, we use sequential tempering (Sec. II D) as in Ref. [14]. The initial sample used to learn the model by maximum likelihood is generated at β = 1, where local MCMC still decorrelates fast enough. It contains M = 10 5 configurations, and is obtained by a single local MCMC, adding a new configuration every 10 sweeps, resulting in a total of 10 6 sweeps. The models are learned with gradient descent using minibatches of 256 configurations, during 50 training epochs and with a learning rate of 10 −3 . During the sequential tempering, each sequence of the sample is evolved for 10 global MCMC steps, at a new inverse temperature β ← β + 0.1. The acceptance rate is also computed during this global MCMC. The model is then trained again, with the same hyperparameters, using the new sample. The times needed on a standard laptop to perform local (one local MC sweep corresponds to N single spin-flip attempts) and global MCMC, and to train the model, are given in Table I.

B. Results
We begin by reproducing the results of Ref. [14] with the NADE, choosing N h = 64 which is the minimal number needed to obtain accurate results according to Ref. [14]. Having reproduced the NADE results of Ref. [14], we also considered a shallow MADE, which has a slightly smaller number of parameters, i.e. N (N − 1)/2 = 4950 versus 2N N h = 12800 for N = 100 and N h = 64. The shallow MADE performs identically to the NADE (Fig. 3), and we find the shallow MADE preferable because it has less parameters and it is more interpretable [51]. It might be interesting, in future work, to check whether or not the shallow MADE has a better scaling upon increasing N with respect to the NADE. Fig. 3d shows the number of local MCMC sweeps τ needed to reach a time-averaged spin-spin correlation Note that already for β = 1.5, the time to decorrelate is of the order of magnitude of 10 4 sweeps, and it grows quickly upon decreasing temperature. We stop investigating the local MCMC at β = 1.5 because below this temperature measuring its relaxation time becomes too computationally costly. Next, we train the AR models using the sequential tempering procedure with M = 10 5 training configurations, as detailed in Sec. IV A. We observe in Fig. 3d that the acceptance rate of global MCMC moves remains very close to one at all temperatures, and it actually increases upon decreasing temperature, reaching one for β → ∞. For this paramagnetic model, new configurations proposed by the global MCMC are completely uncorrelated from previous ones, and the decorrelation time for the global MCMC is of the order of magnitude of the inverse acceptance rate, hence between one and two global MCMC moves. We checked explicitly in Fig. 3f, where we show that the average Hamming distance (i.e., number of distinct spins) between two configurations obtained by independent global MCMC, and between a single MCMC and itself a few global MCMC later, coincide.
Using the numbers reported in Table I, we find that at β = 1.5, decorrelation on a standard laptop takes about 0.12 s using local MCMC (i.e. about 10 −5 s per MC sweep times 12000 time sweeps to decorrelate), and about 10 −3 s for global MCMC (that requires a few moves to decorrelate, see Fig. 3f, each taking about 3·10 −4 s), hence with a speedup of two orders of magnitude. Furthermore, the decorrelation time of global MCMC slightly decreases upon lowering temperature, while that of local MCMC strongly increases, leading to an even stronger speedup, divergent for β → ∞. Note that the learning time is negligible at large enough β.
To monitor the quality of the AR models upon decreasing temperature, we compare the exact energy (Fig. 3a) and entropy (Fig. 3b) computed with local MCMC with those estimated by sampling the AR models, finding extremely good agreement. For the local MCMC computation, we perform an annealing from high temperature measuring the energy and the specific heat, and we obtain the entropy by integrating the latter in temperature. For the AR models, we estimate the energy and the entropy by sampling from the model at each T , and computing the average of H(σ) and of − log P AR (σ) over this sample, respectively. We note, in particular, that the AR entropy coincides with that of the true model, which shows that the AR model is not mode-collapsed. In Fig. 3c, we also show the KL divergences D(P AR ||P B ) and D(P B ||P AR ) as a function of temperature. Both are extremely small, which implies that log W (σ) = log[P B (σ)/P AR (σ)] is extremely small for configurations σ that are typical both of the true equilibrium and of the AR model, consistently with the high acceptance rate shown in Fig. 3d (see the discussion in Sec. II C).
For future reference, we also checked the dependence of the acceptance rate of global MCMC on the training set size M . We found that a minimal training set size of about M = 10 4 is needed to obtain good performance, while below this size the AR model cannot be trained properly and global MCMC fails (see the SI).
To summarize, we found that: • A 2d EA model with N = 100 spins can be approximated very accurately by an AR model, as shown by the very small KL divergence.
• Correspondingly, global MCMC is very efficient and leads to an important speedup with respect to local MCMC, as shown in Ref. [14]. We have shown that the speedup diverges for β → ∞.
• A simple shallow MADE performs as well as the NADE used in Ref. [14], even with a slightly smaller number of parameters.
• A sufficiently large training set (M 10 4 ) must be used to achieve a good efficiency of the global MCMC.
We thus conclude that, at least for N = 100 (a 10 × 10 periodic lattice), the 2d EA model is easy to approximate with an AR model, leading to efficient global MCMC. Yet, the fact that the AR model concentrates at low temperatures on the two ground states (related by the inversion symmetry), which can be found in short time by algorithms such as McGroundstate [50], leads us to believe that the instance we studied is too small to be really hard to sample.
Furthermore, we repeated the study on a 3d EA model on a 5 × 5 × 5 cubic periodic lattice (N = 125 spins). Also in this case, because of the small size, the ground state is found easily by McGroundstate [50], despite the fact that the problem is in principle NP-hard [48]. We found that the AR model approximates well the 3d EA model, leading to efficient global MCMC at all temperatures, despite the presence of a phase transition at finite temperature [52], and converging to the ground state in the zero-temperature limit (see the SI).
We believe that a more complete study should investigate systematically the dependence on system size for much larger N , as well as the dependence on the space dimension, because the EA model is known to become harder and harder to sample upon increasing spatial dimensions. Because this is not the main focus of the present work, we leave this for future investigation. The main goal of the present section was to show that our implementation of the global MCMC is able to reproduce and extend previous positive results [14], and it can thus be considered reliable and efficient.

V. COLORING
We now consider our main benchmark, namely the hardto-sample coloring problem. We will show that all the global MCMC implementations discussed above fail for this model, and we could not find an implementation that achieves a better efficiency. In particular, we will show that both the maximum-likelihood and the variational training fail, but for quite different reasons.

A. Model
The model is formulated in terms of N variables σ i ∈ {0, · · · , q − 1}, each taking q possible colors. The variables are the nodes of an Erdös-Rényi random graph G, i.e. a graph in which each link i, j is present at random with the same probability, chosen such that the average connectivity of a node is c. The model Hamiltonian can then be written as i.e. a q-state Potts model where antiferromagnetic couplings (whose value is fixed to one without loss of generality) are only present on the edges of G. The Hamiltonian counts the monochromatic edges by assigning a zero energy to all links that connect sites with distinct colors, and energy one to links that connect sites with the same color. Hence, H(σ) ≥ 0 and H(σ) = 0 if and only if σ is a proper coloring of G. At zero temperature, the problem of finding the ground state of H(σ) is therefore related to the random graph coloring problem [43,44]. At finite temperature and for large enough q, the model belongs to the Random First Order Transition (RFOT) universality class [54], together with many other optimization problems [38] and with mean-field structural glasses [39][40][41][42]. These models are characterized by a finitetemperature dynamical glass transition, below which the time needed for a proper sampling of the Boltzmann probability diverges exponentially with N [2], due to the presence of an exponential number of peaks (i.e. metastable states) [24,38,41,42].
The coloring problem is specified by the number of sites N , the choice of the random graph G which is specified by its average connectivity c, the number of colors q, and temperature T . For the model to be a good representative of the RFOT class, we need q to be large enough, otherwise the transition is too close to a second order one, especially at finite N [54]. We thus chose q = 10, for which a finite RFOT dynamical transition T d (c) is present for all c > 39.02, and the corresponding phase diagram is reported in Fig. 4 (data were privately communicated by the authors of Ref. [53]).
Systems in the RFOT class also display a second phase transition, called condensation or Kauzmann transition T K (c). Below this transition, the number of metastable glassy states become sub-extensive [38]. For our purposes, this phase transition is relevant because for T > T K (c) one can use the quiet planting technique [55], which consists in exchanging the sampling of σ and the generation of G. While one should first generate a graph G and then sample σ from the Boltzmann distribution for that G (which is exponentially hard in N at low T ), in quiet planting one first takes a random configuration σ, and then constructs at a given T a random graph G such that σ is an equilibrium configuration for the model defined on that graph, at temperature T (which can be done in polynomial time). It can be shown [55] that for T > T K (c) in models such as the coloring, this exchange of order does not affect the statistical properties of G. This allows one to run local MCMC from the configuration σ, which is guaranteed to be an equilibrium one, thus bypassing the need for equilibration, which takes an exponentially large time for T < T d (c). Our choice of q = 10 ensures the existence of a large enough region T K < T < T d where model-generated samples may be compared to the planted configurations.
For our study, we thus chose c = 40, for which T d = 0.1768 and T K = 0 (Fig. 4). In this way, we can perform quiet planting at any temperature to study the equilibrium local MCMC dynamics, and we also have a wide range of temperature T < T d where this dynamics is exponentially slow in N . Notice that such graphs are full of short loops, so some finite size effects are measurable, like an energy drift in the planted solution. We have confirmed that the scenario we describe in the next sections is equally reproduced for lower connectivity, {q = 5, c = 13} and {q = 10, c = 30}, so that it is mainly determined by the complexity of the problem. In Fig. 4 we show the decorrelation time of the overlap correlation function, here measured using local MCMC and averaging over time t. Note that with this definition, complete decorrelation corresponds to C(t, τ ) ∼ 1/q. The decorrelation time is defined by C(t, τ ) t = 0.5. When plotted as a function of T −T d , it displays the characteristic power-law divergence predicted by RFOT theory. In the vicinity and below T d , the dynamics becomes so slow that we are unable to observe decorrelation of local MCMC in reasonable time. We consider a fixed size N = 250, which is a good compromise between avoiding too large finite size effects, and having a small enough size to allow for an extensive testing of autoregressive model architectures. It is also important to mention that the thermodynamics of the model can be solved, for T > T K (c) and for N → ∞, by a simple cavity computation [53][54][55]. In particular, the energy and entropy per spin are given by Although for N = 250 we observe significant deviations from the thermodynamic values (Fig. 5), these expressions are still useful as a reference.

B. Model selection
Our goal is to use autoregressive models to generate good trial configurations and speed up the dynamics of the hard-to-sample coloring problem. Before exploring ) is reported on the x-axis and corresponds to a specific color. Overall, the models are ranked, from left to right, by free energy, which at T = 1 is the difference of energy minus entropy. The horizontal lines correspond to the equilibrium values, measured using local MCMC for N = 250 (orange, the entropy is obtained via thermodynamic integration) or calculated from the cavity method for N → ∞ (blue). the hard region below T d , we test how different AR architectures perform in the paramagnetic phase at T = 1, where the local MCMC dynamics is fast and we can easily generate an equilibrium sample from the Boltzmann distribution. Our intermediate goal is to select the best architectures and training methods (see the SI for details).
For the AR decomposition in Eq. (2), variables are ordered naively, which means that we first fix an arbitrary ordering σ 1 , · · · , σ N , and then construct a graph G as described in Sec. V A. We also considered ordering the variables from higher to lower connectivity, and found no qualitative difference. Results are reported here for this second choice of ordering. The models are trained either via maximum-likelihood on a sample of M equilibrium configurations generated via local MCMC, or via the variational method using M model-generated samples to compute the gradient, see Sec. II B. The training is done using the Adam optimizer [56] and mini-batches of 1024 configurations. In order to enforce color symmetry, each configuration in the sample is subjected to a random permutation of colors before being used, in each epoch of training. We start from a learning rate of 0.01 that we reduce by a factor 2 after every 250 epochs of no improvement [57]. We set an early stop criterion that triggers if the likelihood of a validation set made of independent equilibrium samples starts decreasing.
At the end of the training, we compute the model energy (Fig. 5a) and entropy (Fig. 5b) by sampling from the AR model and computing the average of H(σ) and − log P AR (σ), respectively. We compare these results with the exact ones obtained via local MCMC. We observe that the AR energy and entropy are generally quite close to the equilibrium values. In particular, the samples generated by the models often have an energy that is closer to the average equilibrium energy than that of the planted configuration, hence the difference falls within the typical equilibrium fluctuations. As a general trend, we observe that increasing the model expressivity does not improve the energy, probably because of overfitting, as shown by the decrease in entropy. On the other hand, adding a regularization (either dropout or ridge) degrades the performance significantly, by increasing both the energy and the entropy. We thus conclude that the best model is once again a shallow MADE (ColoredMADE, see the SI), which can achieve a very low energy and a very high entropy, close to the target values, both with variational and maximum-likelihood training.
In Fig. 5c,d we report the KL divergences between the Boltzmann distribution and that of the AR models, which confirms that the two shallow ColoredMADE have the smallest KL divergence. Having selected two good architectures at T = 1, we now focus on these two models and check how they perform upon lowering the temperature. We consider, for each T , a ColoredMADE trained by maximum likelihood on an independent equilibrium sample generated from the Boltzmann distribution by long enough local MCMC at that T , and another Col-oredMADE trained variationally at fixed T (which, we recall, does not require to generate equilibrium samples).
We will show next that both models fail at lower T .  variationally (b, continuous lines). The temperature decreases from T = 1 (red) to T = 0.3 (blue) following the color gradient. (c) Relaxation time τ rel defined from C(0, τ rel ) = 1/e for the samples generated by the AR model trained using maximum likelihood (orange) or variationally (purple) compared to equilibrium (green). Relaxation is measured averaging over 100 local MCMC runs. In the inset we report the ratio of the equilibrium time to that of AR models. The difference becomes larger upon decreasing the temperature, but the variational model (purple) remains accurate down to T ∼ 0.35. (d) Average energy and (e) entropy of samples generated by the two AR models and in equilibrium, with the same color code as in (c). In panel (e), the green data correspond to the exact entropy in the thermodynamic limit, Eq. (21). Using the same data of (d)-(e), we plot in (f) the entropy of the generated samples as a function of their energy.
Of course, there might be other architectures that we did not consider here, which might provide better results. However, because both adding regularization or increasing the expressivity of the model (also by using GNNs) did not seem to improve the efficiency, we do not see a clear path for improvement.

C. Failure of sequential tempering
We first consider the ColoredMADE trained by maximum likelihood, and compare the average energy of the equilibrium configurations with those generated by the AR model (Fig. 6d). We observe that the model generates configurations with higher energy as soon as temperature goes below T ∼ 1, while its entropy (Fig. 6e) remains close to the equilibrium value. Yet, once the entropy is plotted parametrically as a function of energy (Fig. 6f), the difference from the equilibrium result becomes apparent.
We also compare the correlation functions of local MCMC dynamics starting from the equilibrium samples and the AR-generated samples. (To avoid confusion, we stress that here we are not yet considering AR-assisted global MCMC.) In Fig. 6a we report the overlap corre-lation function defined by Eq. (20), here for fixed t = 0 as a function of τ measured in MC sweeps. The dashed lines are obtained starting at t = 0 from equilibrium samples, while the full lines are obtained starting from samples generated by the AR model. We observe that AR-generated samples have the correct equilibrium local MCMC dynamics only at very high temperatures. In Fig. 6c we compare the relaxation times, defined by C(0, τ rel ) = 1/e, and show that AR-samples have systematically faster dynamics upon lowering T . Overall, these results suggest that the maximum-likelihood training is effective at high temperatures only, because upon lowering temperature the model is unable to propose configurations of low enough energy. This study is restricted to T ≥ 0.3 because below that temperature we are unable to generate the equilibrium samples needed to train the model by local MCMC.
Next, we consider the global MCMC dynamics, and we measure its acceptance rate at T = 1. As in Sec. IV B, global MCMC can be initialized (i) on AR-generated configurations, (ii) on the training set and (iii) on an independent test set made by additional equilibrium configurations. In Fig. 7a, we consider a ColoredMADE trained on M = 10 3 configurations, and we report the average acceptance rate together with the geometric average of the Boltzmann ratio P B (σ new )/P B (σ old ) and the model ratio P AR (σ old )/P AR (σ new ). Similarly to the 2d EA model (see the SI), we observe that for such a small M , the acceptance rate on the test set is vanishingly small, while using the other initializations, it decreases with time. The situation improves strongly upon increasing the size of the training set to M = 10 5 (Fig. 7b), leading to a good acceptance rate. Global MCMC can then decorrelate reasonably fast in this case.
We explore more systematically the dependence on T and M in Fig. 7c, where we report the acceptance as a function of β for different architectures and different M . We observe that for a given architecture and M , the acceptance drops exponentially fast with β, in some cases becoming so small that we cannot measure it (no move is accepted during the longest global MCMC we can run). We conclude that the maximum likelihood approach fails because, upon lowering temperature, the AR models are not expressive enough to properly fit the target energy, and as a consequence they propose moves that cannot be accepted by the global MCMC. Increasing M does not help at low temperature (Fig. 7c), and we have already seen that increasing the model expressivity only leads to overfitting (Fig. 5).
It is possible that by improving model expressivity and at the same time considering larger M , the acceptance rate could improve. However, the AR architectures we considered are already quite hard to train, and M ∼ 10 5 corresponds to a quite large dataset, and still the method fails at T = 0.5, which is a quite high temperature in the paramagnetic phase. These results thus highlight the lack of computational efficiency of this approach.
Note that the global MCMC fails even when the AR model is trained using maximum likelihood on a perfect equilibrium sample, i.e. under the best possible conditions. Hence, sequential tempering cannot work, because upon decreasing temperature the global MCMC remains stuck and does not lead to a proper new sample on which the AR model can be trained. In fact, performing sequential tempering runs (not shown) we observe that the global MCMC does not equilibrate, hence the AR model has even worse energy and leads to an even slower global MCMC.

D. Failure of the variational training
We now consider the variationally-trained Colored-MADE, and note that its energy remains accurate down to T = 0.3 (Fig. 6d), while its entropy decreases quickly upon lowering T (Fig. 6e), suggesting mode collapse, i.e. the model only learns a limited portion of the support of the Boltzmann distribution. Surprisingly, the parametric curve of entropy versus energy (Fig. 6f) is very similar to that obtained using maximum-likelihood training. Consistently, when we run global MCMC using the variational model, we observe that the acceptance rate remains good, but the proposed moves do not lead to any decorrelation (Fig. 8), and as a result the sampling of the Boltzmann distribution is not efficient. Global MCMC indeed remain confined in the limited portion of the phase space that has been learned by the variational model. This is consistent with the discussion of Sec. III C for the CW model.
In Fig. 6b, we report the correlation functions of the local MCMC dynamics initialized both in equilibrium and in configurations generated by the variational AR model, and note that they are essentially indistinguishable down to T ∼ 0.35. Because this method does not require the generation of equilibrium samples for training, one might wonder whether, despite being mode-collapsed and thus useless for global MCMC, the variational model could still generate configurations that are close enough to some of the equilibrium ones, such that the local MCMC dynamics coincides with the equilibrium one. If true, this would be interesting, because one could use the variational model to generate some typical equilibrium states, although only a small subset of them. However, we show in Fig. 9 that this is not the case. Below T ∼ 0.3, upon approaching the glass transition, the energy of the variational model remains higher than the equilibrium one, and the generated configurations display an initially faster local MCMC dynamics, whose relaxation time increases during the evolution (the so-called aging behavior typical of glasses). Hence, the variational model also fails in providing good equilibrium configurations. We used the following model architectures: shallow MADE (blue), shallow ColoredMADE (red) and deep MADE (D = 3, purple). The models have been trained via the variational approach with batches of M = 10 2 (triangles) or M = 10 4 (circles) configurations. The full line is the correlation measured from tw = 0, while the dashed line is measured after waiting tw = 10 4 sweeps. In black we report, for comparison, the exact equilibrium dynamics, obtained initializing the local MCMC in the planted configuration.

E. Role of the color symmetry
One might wonder whether the failure of the AR models is somehow related to the presence of the color symmetry (see the SI for details), which might lead to a proliferation of states, thus requiring a lot of training samples to be taken into account correctly. Or, maybe, breaking the color symmetry would allow for a better learning of the AR model, that could use local fields to obtain more information on the Boltzmann distribution, thus reducing the role of couplings.
To address this question, we tried to break the color symmetry by adding small random local fields on each spin, thus lifting the degeneracy of the q! equivalent configurations. Recall that q = 10, hence q! is very large, although the contribution to the entropy per spin, log(q!)/N , is negligible. Yet, we repeated the same analysis and we obtained very similar results. We thus conclude that the addition of small external fields is not enough to make the approach work, because its failure is due to the intrinsic difficulty of representing the multiplicity of states typical of a glassy system.

VI. CONCLUSIONS
In this paper, we addressed the general problem of whether machine learning can assist Monte Carlo methods to provide a computational speed up in hard-to-sample glassy problems.
We confirmed previous findings [13][14][15][16][17][18][19] that identified a class of problems, such as simple ferromagnetic models and the two-dimensional Edwards-Anderson spin glass model, which can be well approximated by simple autoregressive models. These models can then be used to sample the Boltzmann distribution by machine-learning-assisted global MCMC, providing a computational speedup of several orders of magnitude. For these type of problems, several different architectures and training schemes have been proposed, but we found that all these methods are roughly equivalent. In particular, we tend to prefer a shallow autoregressive architecture, which provides a simple and interpretable model. We also highlighted the need for a large enough training sample for the efficiency of the approach.
We found that the situation is totally different when one considers really hard-to-sample problems belonging to the Random First Order Transition (RFOT) universality class, such as the coloring of random graphs and, possibly, the model considered in Ref. [20]. For these type of problems, the landscape is very rough and complex, displaying a multitude of locally stable glass states that are known to trap standard local MCMC [2]. We showed that machinelearning-assisted global MCMC fails, either because of the mode-collapse of the variationally learned model into a restricted portion of the phase space (that is not even representative of equilibrium in the vicinity of the glass transition), or because the sequential tempering scheme fails to achieve a high enough acceptance rate of global MCMC moves.
We tried several architectures in order to overcome the problem, but we did not find any promising sign of improvement. In particular, we found that adding regularization leads to underfitting (high entropy), while increasing the expressivity of the model leads to overfitting (low entropy), and both are detrimental to the global MCMC efficiency.
We are thus led to believe that there is an intrinsic difficulty of these problems that prevents machine learning methods to solve them efficiently. Of course, we might have missed something and future work could find a solution using smarter architectures. Yet, we believe that our work is useful because it establishes a benchmark against which, in our opinion, such machine learning methods should be tested to assess their performance as universal methods for sampling speedup.

S1. ARCHITECTURES
In this section we give the details of the different architectures that are used in the paper.

A. Representation of the input variables
Because we deal with autoregressive models taking as input N variables σ = {σ 1 , · · · , σ N }, we define a notation σ <i = {σ 1 , · · · , σ i−1 , 0, 0, · · · , 0}, which is a Ndimensional vector in which the variables with index ≥ i are "masked" to zero. To avoid confusion, let us stress that σ is a N -dimensional vector, σ i are its components, and σ <i is a masked N -dimensional vector.

B. Shallow model
The shallow model is the simplest model that satisfies the autoregressive property (beyond independent variables). In this model, each conditional probability is written in the form of a Boltzmann distribution with local fields and two-variable couplings. For binary variables σ i ∈ {0, 1}, it is conveniently parametrized as: such that the total number of parameters of the AR model is N (N − 1)/2 + N . For non-binary variables, using a one-hot-encoding representation (Sec. S1 A), the shallow model reads: where the J ij are q × q real (non-symmetric) matrices, theĥ i are real q-component vectors, and the AR model has in total q 2 N (N − 1)/2 + N q parameters.
Note that there is an overparametrization in Eq. (S2), because for instance one of the q components ofĥ i can be set to zero without loss of generality, due to the normalization factor. Similarly, one line and one column of the matrix J ij can be set to zero. This so-called "gauge invariance" slightly reduces the number of parameters, such that in particular for q = 2 one recovers the binary representation in Eq. (S1).

C. Color symmetry
In this paper, we will consider in particular the coloring problem, which enjoys a 'color' symmetry, meaning that the Hamiltonian is invariant under any arbitrary permutation of the q colors. We then find useful, in order to (partially) prevent mode-collapse, to enforce the same symmetry into the AR model. This requires in particular the vectorĥ i = h i (1, · · · , 1) to be a constant, hence the termσ T i · (h i , · · · , h i ) = h i becomes a constant too. Furthermore, we note that the only q × q matrices that are fully invariant under permutations of the q states are the identity matrix and the matrix of all ones. The identity matrix with a number J ij on the diagonal gives a contribution J ijσ T i ·σ j = J ij δ σi,σj , while the matrix of all ones gives again a constant contribution. Constant terms can be neglected due to the normalization, and the model becomes which reduces the number of parameters to N (N − 1)/2, i.e. by a factor ∼ q 2 . Note that, introducing a q-component probability vec-torP i AR = {P i AR (σ i = 0), · · · , P i AR (σ i = q − 1)}, Eq. (S3) can be written aŝ i.e. the probability of σ i = k is the softmax (over k, at fixed i) of a vector H k i . Each of the N -dimensional vectors H k = F[σ k ] is calculated independently by applying a function F with the same weights J ij to the N -dimensional one-hot-encoded input vector (σ k ) i = δ k,σi . The function F satisfies the AR property, hence its component i only depends on the components with j < i of the input.
While in Eq. (S4) the function F is a simple linear layer (with the AR mask), more general architectures can be considered while keeping the color symmetry, by replacing F by an arbitrarily complex neural network with multiple hidden layers and non-linearities. The procedure can be summarized as follows: 1. Construct the q one-hot-encodings of the input, (σ k ) i = δ k,σi , each σ k being a binary vector of size N (i.e. the number of spins).
2. Apply the same (i.e. independent of k) arbitrarily complex neural network F to each of the q inputs to construct q output fields also of size N , i.e.
where the function y = F[x] satisfies the AR property, such that y i only depends on {x j } j<i .
3. Take a softmax of the q-component field ) to obtain the probability of variable σ i .
Because the same neural network with the same weights is applied independently for all k, this process guarantees color symmetry, and reduces the number of parameters (as compared with the same network architecture without color symmetry) by ∼ q 2 , as in Eq. (S3).

D. MADE
A Masked Autoencoder for Distribution Estimator (MADE) consists of a generic D-layer neural network satisfying the autoregressive property, i.e. the weights satisfy J l i≤j = 0 for any layer l = 1, · · · , D. When the depth D = 1, we obtain the shallow model described in Sec. S1 B. When D > 1, we use ReLU activation functions between all layers. We can also increase the width of the hidden layers by defining a parameter W that increases the dimensionality of each variable, i.e. {x 1 , · · · , x N } ∈ R N − → y 1 1 , · · · , y W 1 , y 1 2 · · · , y W N ∈ R N ×W . Notice that when W > 1, each variable y a i satisfies the autoregressive property of depending only on {x j } j<i . To describe nonbinary variables with the MADE architecture, we use the symmetric one-hot-encoding introduced in Sec. S1 C.

E. NADE
A Neural Autoregressive Distribution Estimator (NADE) [27] is another estimator based on an encodingdecoding architecture, where the autoregressive property is overimposed. It usually consists of a single hidden layer to encode the message, and the size of this hidden layer corresponds to N h hidden variables for each visible variable. This is similar to a MADE with D = 2 and W = N h . The i = 1, · · · , N hidden units of a NADE are calculated from where y i ∈ R N h , A ∈ R N h ×N , σ <i is the usual Ndimensional input (each of the one-hot-encodings is processed in parallel to enforce color symmetry, see Sec. S1 C) with the autoregressive mask, B ∈ R N h , and Ψ is an arbitrary component-wise non-linear function, e.g. ReLu.
The distinctive feature of the NADE architecture is that the weight matrix A and biases B are shared between all the hidden units y i , while in a MADE there would be a different A i , B i for each hidden unit, which can be encoded in fully connected weight matricesÃ ∈ R N ×N h ×N andB ∈ R N ×N h . This feature, inspired by restricted Boltzmann machines, has proven to be very convenient to reduce the number of parameters (by a factor N ) while keeping a good expressivity of the model. The information is propagated from the hidden layers to the output through where V i ∈ R N h and C i ∈ R assume different values from each output field i = 1, · · · , N . Finally, a softmax is applied to the q fields computed in parallel to obtain the probability, see Sec. S1 C.

F. ColoredMADE and ColoredNADE
These architectures are modifications of the MADE and NADE architectures in order to describe non-binary variables in the standard one-hot-encoding fashion. In practice, this is realized by setting W = q, where q is the number of colors, and one-hot-encoding the input. This choice does not enforce the color permutation symmetry, so each weight matrix is now free to take all possible values. This allows for more flexibility, at the cost of an increased memory consumption and training time.

G. GADE
A natural way to improve the performance of machinelearning models is to incorporate the physics of the problem into the model architecture. In the case of our spin models, the fact that the interactions have a graph structure suggests that Graph Neural Networks (GNN) may be an efficient choice. For this reason, we considered a GNN model that we called GADE, which stands for Graph Autoregressive Distribution Estimator.
Following the message passing paradigm, see e.g. [58], we define m i− →j as the message from node i to node j. Each node has to perform the following operations: (i) define the message to pass, (ii) take the information from the incoming messages, (iii) update its state. Hence, for each of the t = 1, · · · , T graph propagations we do the following operations: This operation imposes the autoregressive property by stating that a message goes from i to j only in the autoregressive direction. The message itself corresponds to the state of the node at the previous iteration, y t−1 i , but it vanishes if the nodes have different colors. Notice that where ∂j are the nearest neighbors of j on the graph. Compared to commonly used GNN like GraphSAGE [59], the approach defined by Eqs. (S8-S9) does not require the introduction of any new model parameters. As such, it can be interpreted as a deterministic pre-processing operation that we add to the model pipeline.
After we have computed the graph-weighted state of each node, {y t i } T t=0 , we feed it to a MADE model (GADE) or to a ColoredMADE model (ColoredGADE). Note that since we have explored the graph up to a depth T , the MADE network requires T input channels to process the data. This means that overall a GADE model will have the same number of parameters as a MADE with width W = T .

H. Variable ordering
When constructing AR models using the Bayes' rule decomposition, the choice of the ordering of variables is arbitrary. In our case, e.g. for the random graph coloring problem, we can arbitrary label the variables as σ 1 , · · · , σ N and then construct the random graph. We call this the 'naive' ordering. One can also consider smarter orderings [51], in which variables are ordered according to some useful characteristics. In our work, we considered a seemingly smarter ordering, in which the interaction graph is constructed first, and the variable σ 1 is taken to be the most connected one, σ 2 as the second most connected, and so on. The rationale for this choice [51] is that it can be useful to put more constrained variables first in the ordering. Yet, we found that the choice of ordering does not affect the conclusions qualitatively, we thus stick to the naive ordering when not told otherwise.

S2. ADAPTIVE TRAINING SCHEME
The idea of this scheme is to perform a mixture of local and global MCMC, now at fixed temperature, starting with a decent guess for P AR (σ). The algorithm works in three steps: 1. Perform a global MCMC step using a proposed σ new from P AR .
2. Perform a certain number of local MCMC steps.
3. Repeat the first two steps M times to generate a sample {σ m } m=1,··· ,M . Use this sample to perform a few steps of gradient ascent on the log-likelihood to improve P AR (σ).
Initially, most global MCMC moves are rejected, hence the sample is only constructed using local MCMC, resulting in poor sampling if local MCMC is inefficient. Yet, the little information contained in the sample might be enough to improve the AR model, resulting in a better acceptance of the global MCMC at the next step, and in turn of a sample of better quality. Ideally, the process should converge to a situation where the acceptance rate of global MCMC is high enough, thus achieving proper sampling; yet, this is far from being guaranteed in practical applications. Furthermore, there is a danger of mode-collapse during the training [15]. Suppose that global MCMC is very inefficient and that local MCMC is stuck into a probability peak of the true model. Then, at the first iteration, the sample only covers this single peak; the AR model will learn that peak, resulting in its reinforcement, and the process thus remains stuck. To prevent this, Ref. [15] uses many parallel MCMC, initialized in very different states. Yet, in systems with a really large number of probability peaks (i.e., a rough energy landscape), this might not be enough to prevent mode-collapse.

A. The model and its Boltzmann distribution
The model Hamiltonian is where an irrelevant constant of order one has been neglected in the second expression of H. Note that because all spin pairs interact, in order to have an extensive energy, the spin coupling must be fixed to 1/N . Also, because the Hamiltonian is fully invariant under permutations of the spin labels, the ordering of variables in the AR model is irrelevant. Because the number of spin configurations corresponding to a given m is N N (1+m)/2 , the partition function can be written as: (S11) and the free energy per spin follows as Using Stirling's formula to approximate the binomial coefficient for large N , one obtains (S12) In the thermodynamic limit, the value of m that dominates the sum in Eq. (S11) is the one that minimizes f (m), which is the solution of m = tanh(βm), leading to f (T ) = min m f (m, T ).

B. Minimization of the KL divergence between the AR model and the mixture model
We detail here the minimization of the KL divergence between the shallow MADE model, where m <i = j(<i) σ j is the magnetization of the first (i − 1) spins, and the mixture model that approximates the Boltzmann distribution of the CW model. For an infinite sample (M → ∞, neglecting constants), the KL divergence between these two distributions is where the average is over the mixture model (S14) Minimization of each term with respect to J i leads to a set of coupled equations: Because of the spin-flip symmetry, the average σ i σ j is the same in the two terms with m = ±m, , hence we can restrict to e.g. m = +m and write σ i σ j = σ i σ j + = σ i + σ j + =m 2 , which leads to Similarly, we can restrict to m = +m and write m <i = 2k − (i − 1), k being the number of positive spins among the first (i − 1). The probability distribution of k is a binomial, hence we arrive to (S17) The minimal value of the KL divergence is then (S18) We can easily check that there is no problem in the convergence of the gradient descent algorithm in the vicinity of the phase transition, by expanding each of the terms D(J i ) = − log P i AR (σ i |σ <i ) in Eq. (S13) for small J i , which leads to (S19) Form = 0, this has a single minimum in J i = 0 with finite curvature B = (i − 1), leading to exponentially fast convergence. Even for T < 1, the minimum is slightly shifted to a finite J i but its curvature stays finite.

A. Dependence on the size of the training set
In order to investigate the dependence on the size of the training set M , we trained the NADE at fixed temperature T = 1 by using a smaller training sample of M = 10 2 (obtained by subsampling the original sample of M = 10 5 configurations). We then studied the acceptance rate of global MCMC as a function of the number of MC iterations. We consider K = 10 3 independent chains, initialized in three different ways: (i) using samples generated by the NADE itself; (ii) using the samples in the training set; and (iii) using independent equilibrium samples obtained by local MCMC. One should keep in mind that, because the test sample initialization (iii) correspond to an initial sample in equilibrium and global MCMC satisfies detailed balance, the system remains in equilibrium at all times, and all quantities must then be time-independent, which is confirmed in Fig. S1a. Also, this initialization thus corresponds to the long-time limit of the other two, (i) and (ii). In Fig. S1a we show, as a function of the number of global MC steps, the acceptance rate averaged over the K chains, together with the geometric average over the K chains of the Boltzmann ratio P B (σ new )/P B (σ old ) and the AR model ratio P AR (σ old )/P AR (σ new ). We recall that, according to Eq. (10), the acceptance rate depends on the product of the Boltzmann and model ratios. We observe that at M = 10 2 the asymptotic acceptance rate (as obtained from the test set initialization) is poor, of the order of 10 −3 . Furthermore, the Boltzmann and model ratios are both much smaller than one, but have very wide fluctuations in log-scale, hence the rare acceptance of moves is due to an atypical fluctuation of these ratios. When the global MCMC is instead initialized in the model or training samples, the acceptance rate is initially much higher and decreases very slowly, indicating that the global MCMC is behaving differently on these samples, and that it is taking an extremely long time to reach the asymptotic behavior from these initializations. We conclude that with a training set of M = 10 2 samples, the global MCMC has poor decorrelation properties, and is generally inefficient, because the AR model is overfitting the training set. We then repeat the same study for larger values of M , and we find that around M = 10 4 the acceptance rate becomes good (Fig. S1b), and correspondingly the dependence on initialization disappears. This justifies the choice of M = 10 5 as an optimal value, and highlights the need for a sufficiently large training set to obtain a good performance of global MCMC. The dependence of the acceptance rate on M is reported in Fig. S1c.

B. Three-dimensional model
We also investigated the three-dimensional Edwards-Anderson (3d EA) model, on a 5 × 5 × 5 cubic periodic lattice, i.e. with N = 125 spins. Contrarily to the 2d case, finding the exact ground state in the 3d case is NPhard [48]. Yet, for these small sizes, the McGroundstate server [50] can find the ground state in a few minutes. Furthermore, in the 3d case it is known that a phase transition happens at a critical temperature very close to T = 1 [52].
We find that global MCMC using the shallow MADE architecture and a training set of M = 10 5 samples performs very similar to the 2d case. In particular, it provides a diverging speedup with respect of local MCMC (whose decorrelation time diverges near the phase transition) and converges to the ground state at low temperatures. The results are reported in Fig. S2.