Sparse autoregressive neural networks for classical spin systems

Efficient sampling and approximation of Boltzmann distributions involving large sets of binary variables, or spins, are pivotal in diverse scientific fields even beyond physics. Recent advances in generative neural networks have significantly impacted this domain. However, these neural networks are often treated as black boxes, with architectures primarily influenced by data-driven problems in computational science. Addressing this gap, we introduce a novel autoregressive neural network architecture named TwoBo, specifically designed for sparse two-body interacting spin systems. We directly incorporate the Boltzmann distribution into its architecture and parameters, resulting in enhanced convergence speed, superior free energy accuracy, and reduced trainable parameters. We perform numerical experiments on disordered, frustrated systems with more than 1000 spins on grids and random graphs, and demonstrate its advantages compared to previous autoregressive and recurrent architectures. Our findings validate a physically informed approach and suggest potential extensions to multivalued variables and many-body interaction systems, paving the way for broader applications in scientific research.


Introduction
This paper addresses the challenge of efficiently sampling and approximating the Boltzmann distribution in systems characterized by two-body interacting binary variables, or spins.The interactions among the variables are typically defined on sparse graphs, such as lattice systems pivotal in material science, and random graphs that model a plethora of complex socio-technological systems [Barabási andAlbert, 1999, Albert andBarabási, 2002].
The complexity of the system's Boltzmann distribution, and thus its underlying physics, significantly depends on the couplings and external field values.For instance, disordered or random coupling values lead to complex and frustrated systems that could reduce the efficiency of sampling algorithms such as Monte Carlo approaches.
This complexity finds relevance in various areas, including spin glasses [Mezard et al., 1986], optimization on random graphs [Mezard and Montanari, 2009], and statistical inference problems [Zdeborová and Krzakala, 2016].arXiv:2402.16579v2 [cond-mat.stat-mech]21 Jun 2024 Consider a system of N spins characterized by a Hamiltonian H.These spins, each denoted as σ i , can take values in ±1.Let σ = {σ 1 , σ 2 , . . ., σ N } represent the set of all spins.The Hamiltonian is given by H(σ) = − ⟨i,j⟩ J ij σ i σ j − i h i σ i , where the summation ⟨i,j⟩ extends to all interacting pairs of spins.The term J ij represents the interaction strengths between the spins, and h i is the external magnetic field acting on each spin.
We approximate the Boltzmann distribution P (σ) using an ARNN whose probability distribution is decomposed in the autoregressive form , where σ <i = {σ 1 , σ 2 , . . ., σ i−1 } denotes the set of variables with indices less than i.If we can write a probability distribution in this form, then the ancestral sampling procedure, where each variable σ i is sampled in the subsequent order according to its conditional probability Q i (σ i | σ <i ) given the previous variables, produces guaranteed independent samples.
We parameterize each conditional probability with a neural network.Without limitation to particular interacting systems, variable sharing schemes, or recurrent structures, one may use feedforward neural networks to parameterize the conditional probabilities, with masked dense layers to ensure the autoregressive property.This architecture is known as masked autoencoder for distribution estimation (MADE), which was first introduced in [Germain et al., 2015] and used in physics in [Wu et al., 2019].However, both its number of parameters and the computation time of each layer scale by O(N 2 ) with the system size N , limiting its applicability to small systems.
In the following, we will show how the knowledge of the Hamiltonian and the Boltzmann distribution leads to systematically reducing both the number of trainable parameters and the computation time.Consider the Boltzmann distribution , where β is the inverse temperature and Z = σ e −βH(σ) is the normalization factor.Following the derivation presented in [Biazzo, 2023], we can write: where we defined f (σ i , σ <i ) = σ >i e −βH (σ) .The positive conditional probability of the variable σ i can be written as: (2) where we impose the sigmoid function S(x) = 1 1+e −x as the last layer of the neural network.This ensures that the network output is constrained to lie between zero and one.Substituting the definition of the Hamiltonian and the Boltzmann distribution, we obtain the following: where: Observing ρ ± i as a sequence of operations on the input variables σ <i , the first operation is a linear transformation, which leads us to define a linear layer with the outputs: (5) Now Eq. ( 3) can be written as: where we defined ρ i (ξ i ) = log , ξ i,i+2 , . . ., ξ i,N }.The variables ξ i can be explicitly computed given the input spins σ <i and the Hamiltonian parameters J , which means that the first layer of our neural network does not involve any trainable parameter.After the first layer, the term 2βξ ii appears as an input to the last layer S, which can be considered as a skip connection [He et al., 2016].As shown in Eq. ( 4), the exact computation of the function ρ i involves a sum over all configurations of σ >i , which scales exponentially with the system size.The entire network architecture to compute Eq. ( 6) is sketched in Fig. 1 (a).
In [Biazzo, 2023], the functions ρ i were approximated with specific architectures derived for the fully connected Curie-Weiss and Sherrington-Kirkpatrick models.In the present work, we propose a more general approach to approximate ρ i using feedforward neural networks, whose inputs are the variables ξ i , and to take advantage of the sparsity of interactions among the spins.In the following, we refer to this architecture as TwoBo (two-body interactions).(a) Sketch of the TwoBo neural network architecture to compute a conditional probability P Bi (σ i = +1 | σ <i ) in the autoregressively decomposed Boltzmann distribution.The parameters of the first layer are taken from the coupling matrix J .The function ρ i is parameterized as a feedforward neural network, whose inputs are the variables ξ i after the first layer.The last layer S is a sigmoid function.(b) Calculating the conditional probability P Bi with i = 13 on a 2D grid of N = 25.The edges between the spins σ <i and σ ≥i are highlighted in blue, and only those edges are used when computing ξ il .Among the variables {ξ il | l > i}, only those with l < i + L are kept as inputs to ρ i , and the others are ensured to be zero.
When computing a conditional probability P Bi with a fixed i, the computation time of the first layer, Eq. ( 5), scales linearly in the number of non-zero elements in J .For fully connected interactions, it scales as O(N 2 ) with the system size N ; whereas for sparse interactions, it is linear in N .Moreover, in the case of sparse interactions, many of the variables in ξ i are ensured to be zero.As seen in Eq. ( 5), if the spin σ l does not interact with all spins σ <i , then ξ il is zero and can be discarded in the parameterization of the function ρ i .Consider, for instance, a 2D grid of side length L, with spins σ ordered as in Fig. 1 (b).The variables ξ il are non-zero only when i < l < i + L, so the number of variables in ξ i is reduced from O(N ) = O(L 2 ) to O(L).Furthermore, since the Hamiltonian is locally interacting, each remaining ξ il only depends on a fixed number of spins, which does not scale with N , so the computation time of the first layer is only O(L).When we compute all conditional probabilities {P Bi | 1 ≤ i ≤ N } in a batch, the computation time of the first layer is O(LN ) = O(L 3 ), which is polynomially lower than O(L 4 ) of MADE.After the first layer, we use a lightweight neural network for ρ i , whose computational complexity does not exceed O(L 3 ).The empirical scaling of the number of trainable parameters is shown in the Supplementary Material A. Similar polynomial reduction of trainable parameters is also valid with other local interactions, such as second and third nearest neighbors, as long as the interaction range is much less than the system size.
When training the neural network in a variational setting, we denote it as Q θ (σ), where θ is the set of trainable parameters.The parameters are trained to minimize the Kullback-Leibler divergence P (σ) between the variational and the target distributions.When the target is the Boltzmann distribution P B , it is equivalent to minimizing the variational free energy , except for a constant term.Its exact computation involves a summation over the spin configurations that grows exponentially with the system size.In [Wu et al., 2019], it was proposed to estimate the variational free energy and its gradients with respect to θ by a finite subset of configurations sampled from the ARNN thanks to the ancestral sampling procedure.
In this work, to alleviate the mode collapse problem, an annealing procedure is used in the training, starting from a low value of β = 0.05 (high temperature) and changing it in steps of 0.05 until β = 3.In each temperature step, a fixed number of optimization steps are executed, then the trained neural network is used as the initialization at the next temperature.In the case of TwoBo, the β used in the skip connection coefficient in Eq. ( 6) is also updated accordingly, and its effect is shown in the Supplementary Material B.

Results
We test the performance of TwoBo in approximating the Boltzmann distribution of sparse and disordered interacting spin systems.
We first choose Edward-Anderson (EA) models [Edwards and Anderson, 1975] on 2D and 3D grids with periodic boundary conditions as target distributions, which can demonstrate the rich physics of the spin glass phase [Mezard et al., 1986, Picco et al., 2001, Stein and Newman, 2013].In particular, the 3D EA model falls into the NPcomplete class of computational complexity [Cipra, 2000], and has been a widely recognized testing ground for algorithms aiming to solve computationally difficult optimization problems [Amey andMachta, 2018, Cirauqui et al., 2024].Moreover, we consider random regular graphs (RRG) as another target, which naturally arises in modeling long-range interactions and cooperative behaviors in several domains, such as biological and techno-social systems [Mézard and Parisi, 2001, Albert and Barabási, 2002, Dommers, 2017].
The couplings J ij are randomly sampled to be either +1 or −1, and the external fields h i are set to zero for simplicity.Each result we report in this paper is averaged from 10 random instances, and for each instance we train a neural network, respectively.When generating the RRGs, we fix the degree d = 3.
Although in general we can use a deep feedforward neural network to parameterize each function ρ i in Eq. ( 6), in this work we found that a single dense layer is enough to obtain satisfactory results.This dense layer maps N ξi inputs to one output, where N ξi is the number of non-zero variables in ξ i .Note that we do not share the parameters among the functions ρ i with different i.For comparison, we have tested the simplest MADE with a single dense layer, which still has more trainable parameters than the simplest TwoBo.Another architecture we have compared with is the tensorized RNN introduced in [Hibat- Allah et al., 2021] for 2D classical spin systems, where we use 4 memory units for a comparable number of trainable parameters, and a comparison with RNN at different sizes is shown in the Supplementary Material C. Details of the numerical experiments can be found in the Supplementary Material D.
In the left column of Fig. 2, the free energies estimated by the three architectures are shown as a function of β, where the TwoBo result is taken as the reference for the relative difference.We observe that TwoBo obtains the lowest free energy estimation on all three interaction graphs, except for smaller system sizes at the beginning of the annealing procedure where β is low.This signifies a better approximation of the corresponding Boltzmann distribution.
Recent studies [Ciarella et al., 2023, Inack et al., 2022] have indicated the potential problem of mode collapse that could distort the free energy estimation in highly frustrated systems, where the variational distribution is trapped in local minima of free energy.To corroborate the efficacy of TwoBo in capturing the high complexity of these systems, we have examined its ability to find ground state configurations, as shown in the right column of Fig. 2. MADE and RNN failed to find configurations with lower energy than TwoBo in large systems, indicating the difficulty in capturing the complexity of these probability distributions.
In addition to variational methods, we have used McGroundstate [Charfreitag et al., 2022], a recent max-cut solver for 2D and 3D lattices, to solve the EA instances and compared the results with TwoBo.McGroundstate solved all 2D instances with N = 256, 576, 1024 and 3D instances with N = 64 in provable optimality, as found by TwoBo, but it failed to provide results for larger 3D cases with N ≥ 512.For the largest 3D cases, we have tried the traditional max-cut algorithm [Burer et al., 2002] implemented in MQLib [Dunning et al., 2018], which yielded energies higher than those obtained by TwoBo.Consequently, we chose the energy found by TwoBo as the reference for the relative difference in the minimum energy shown in the right column of Fig. 2.
To investigate the convergence of the training procedure, Fig. 3 shows the free energy during the optimization steps at three different temperatures on a 2D EA model, from which we can observe that TwoBo always starts with a lower free energy compared to the other architectures.This advantage may stem from the fact that the parameters of the first layer are fixed to the Hamiltonian couplings, as well as the change of the skip connection coefficient with β, which leads the variational distribution to be closer to the target distribution at the beginning of the training procedure.With significantly more optimization steps, MADE could eventually match the free energy obtained by TwoBo.In contrast, RNN converges to a higher free energy, indicating its insufficient expressivity or trainability in a regime of few trainable parameters.
To better assess this claim, Fig. 4 displays the free energy as a function of the number of optimization steps in a whole temperature step.It is evident that as the number of steps increases, the free energy computed by TwoBo systematically decreases.MADE and RNN display a similar trend with almost an order of magnitude more steps, while RNN eventually reaches a plateau with a free energy higher than TwoBo.This comparison shows the abundant expressivity of TwoBo in capturing the target distribution and its ease of training.

Conclusion
In this work, we presented a new autoregressive neural network architecture named TwoBo to sample and approximate the Boltzmann distribution of two-body interacting spin systems.It exploits the knowledge of the target Boltzmann distribution to determine part of its parameters, and systematically reduces the number of trainable parameters in sparse interacting systems without loss of expressivity.It obtains higher accuracy in approximating disordered and frustrated Boltzmann distributions with considerably fewer trainable parameters and faster convergence than previous ARNN architectures.
We achieve a reduction in the number of trainable parameters with respect to MADE in all cases considered.The most notable results are observed in the 2D lattice case, although the benefits decrease with increasing lattice dimensions.In contrast, in the long-range interaction case of RRG, while there is a reduction in parameters, the scaling remains comparable to that of MADE.Nevertheless, TwoBo still shows faster convergence and higher accuracy in these cases, thanks to the knowledge of the Boltzmann distribution.Although the present work focuses on binary, two-body interacting variables, nothing prevents using a similar approach to derive ARNN architectures for multi-valued or continuous variables, such as the Potts model [Wu, 1982] and the Kuramoto model [Acebrón et al., 2005], and with more than two-body interactions, such as the Baxter-Wu model [Novotny and Landau, 1981].Beyond statistical physics, TwoBo has potential applications in a wide range of research areas that utilize Boltzmann distributions, including combinatorial optimization and inference problems, extending from mathematics to social sciences [Mézard et al., 2002, Zdeborová and Krzakala, 2016, Biazzo et al., 2022, Khandoker et al., 2023].It also has promising uses in a family of energy-based models within the rapidly growing field of machine learning, particularly for tasks such as language understanding, image recognition, and other complex data interpretation activities [Deng et al., 2020, Du et al., 2020].We believe that the efficiency of TwoBo will help explore the computationally difficult regimes of these problems.
)UHHHQHUJ\>UHOGLII@ Variational free energy of TwoBo compared to RNN with different numbers of memory units (2,4,8,16,24,32,40), on 2D EA model with N = 576.The results are averaged over 10 random instances of the Hamiltonian, and the error bars show the standard errors over the instances.

D. Details of numerical experiments
In TwoBo and MADE, the only dense layer is initialized using Lecun normal initialization (the default to initialize dense layers in JAX) multiplied by a small factor of 0.01, so that the output values before the last sigmoid function are small, and after the sigmoid function the probability distribution is almost uniform at the beginning of training, which helps alleviate mode collapse.
The implementation of tensorized RNN is taken from https://github.com/mhibatallah/RNNWavefunctions.We only used their neural network architecture but not their annealing procedure in training, because our annealing procedure aims to produce a converged free energy at each step of β, while theirs only aims to produce a ground state energy at zero temperature.
At the beginning of training, we run 500 optimization steps as warm-up with the initial β = 0.05.Then for each β in steps of 0.05 until β = 3, we run 200 optimization steps.Note that the value of β is unchanged during those optimization steps.After each step of β, the trained neural network is used as the initialization in the next β.
The optimizer we use is Adam [Kingma and Ba, 2015] with learning rate 10 −3 , and at each step we take batch size 1024.The max-cut solver McGroundstate provides public access through their website.They put a time limit of 1800 seconds for each spin glass instance, and they failed to solve the 3D EA models with N = 512 and 1728 within the time limit.We have run MQLib on our local machine with a time limit of 24 hours.It outputs the energy whenever it finds a lower one, and it stopped improving the energy after the first hour.
Figure 1.(a) Sketch of the TwoBo neural network architecture to compute a conditional probability P Bi (σ i = +1 | σ <i ) in the autoregressively decomposed Boltzmann distribution.The parameters of the first layer are taken from the coupling matrix J .The function ρ i is parameterized as a feedforward neural network, whose inputs are the variables ξ i after the first layer.The last layer S is a sigmoid function.(b) Calculating the conditional probability P Bi with i = 13 on a 2D grid of N = 25.The edges between the spins σ <i and σ ≥i are highlighted in blue, and only those edges are used when computing ξ il .Among the variables {ξ il | l > i}, only those with l < i + L are kept as inputs to ρ i , and the others are ensured to be zero.

Figure 2 .
Figure 2. Variational free energy (left column) and minimum energy of sampled configurations (right column) at each β on 2D EA (top row), 3D EA (middle row), and RRG (bottom row) models of different sizes.We compare the performances of TwoBo, MADE, and RNN (only in 2D).The variational free energy is shown as the relative difference from TwoBo at the same β, and the minimum energy as the relative difference from the one found by TwoBo at β = 3.The results are averaged over 10 random instances of the Hamiltonian, and the error bars show the standard errors over the instances.

Figure 3 .
Figure 3. Convergence of variational free energy when training at fixed values of β on 2D EA model with N = 576.The results are averaged over 10 random instances of the Hamiltonian, and the error bars show the standard errors over the instances.

Figure 4 .
Figure 4. Variational free energy using different numbers of optimization steps at each β on 2D EA model with N = 576, shown as the relative difference from TwoBo at the same β using 1024 steps.The results are averaged over 10 random instances of the Hamiltonian, and the standard errors over the instances are too small to be visible.

Figure S2 .
Figure S2.Variational free energy when fixing β = 1 in the skip connection coefficient, shown as the relative difference from updating it during the annealing procedure, on 2D EA model with N = 576.The results are averaged over 10 random instances of the Hamiltonian, and the error bars show the standard errors over the instances.