Training neural networks using Metropolis Monte Carlo and an adaptive variant

We examine the zero-temperature Metropolis Monte Carlo algorithm as a tool for training a neural network by minimizing a loss function. We find that, as expected on theoretical grounds and shown empirically by other authors, Metropolis Monte Carlo can train a neural net with an accuracy comparable to that of gradient descent, if not necessarily as quickly. The Metropolis algorithm does not fail automatically when the number of parameters of a neural network is large. It can fail when a neural network's structure or neuron activations are strongly heterogenous, and we introduce an adaptive Monte Carlo algorithm, aMC, to overcome these limitations. The intrinsic stochasticity and numerical stability of the Monte Carlo method allow aMC to train deep neural networks and recurrent neural networks in which the gradient is too small or too large to allow training by gradient descent. Monte Carlo methods offer a complement to gradient-based methods for training neural networks, allowing access to a distinct set of network architectures and principles.


I. INTRODUCTION
The Metropolis Monte Carlo algorithm was developed in the 1950s in order to simulate molecular systems [1][2][3][4]. The Metropolis algorithm consists of small, random moves of particles, accepted probabilistically. It is, along with other Monte Carlo (MC) algorithms, widely used as a tool to equilibrate molecular systems [5]. Equilibrating a molecular system is similar in key respects to training a neural network: both involve optimizing quantities derived from many degrees of freedom that interact in a nonlinear way. Despite this similarity, the Metropolis algorithm and its variants are not widely used as a tool for training neural networks by minimizing a loss function (for exceptions, see e.g. Refs. [6][7][8]). Instead, this is usually done by gradient-based algorithms [9,10], and sometimes by population-based evolutionary or genetic algorithms [11][12][13] to which Monte Carlo methods are conceptually related.
In this paper we address the potential of the zerotemperature Metropolis Monte Carlo algorithm and an adaptive variant thereof as tools for neural-network training [14]. The class of algorithm we consider consists of taking a neural network of fixed structure, adding random numbers to all weights and biases simultaneously, and accepting this change if the loss function does not increase. For uncorrelated Gaussian random numbers this procedure is equivalent, for small updates, to normalized or clipped gradient descent in the presence of Gaussian white noise [15][16][17] [18], and so its ability to train a neural network should be similar to that of simple gradient descent (GD). We show in Section II A that, for a particular supervised-learning problem, this is the case, a finding consistent with results presented by other authors [6][7][8]. * swhitelam@lbl.gov † isaac.tamblyn@uottawa.ca It is sometimes stated that the ability of stochastic algorithms to train neural networks diminishes sharply as the number of network parameters increases (particularly if all network parameters are updated simultaneously). However, population-based evolutionary algorithms have been used to train many-parameter networks [19], and in Section II B we show that the ability of Metropolis Monte Carlo to train a fully-connected neural network is similar for networks with of order a hundred parameters or of order a million: there is not necessarily a sharp decline of acceptance rate with increasing network size.
What does thwart the Metropolis Monte Carlo algorithm is network heterogeneity. For instance, if the number of connections entering neurons differs markedly throughout the network (as is the case for networks with convolutional-and fully-connected layers) or if the outputs of neurons in different parts of a network differ markedly (as is the case for very deep networks) then stochastic weight changes of a fixed scale will saturate neurons in some parts of the network and scarcely effect change in other parts. The result is an inability to train. To address this problem we introduce a set of simple adaptive modifications of the Metropolis algorithm -a momentum-like term, an adaptive step-size scheduler, and a means of enacting heterogenous weight updates -that are borrowed from ideas commonly used with gradient-based methods. The resulting algorithm, which we call adaptive Monte Carlo or aMC, is substantially more efficient than the non-adaptive Metropolis algorithm in a variety of settings. In Section II B we show, for one particular problem, that the acceptance rate of aMC remains much higher than that of the Metropolis algorithm at low values of loss, and can be made almost insensitive to network width, depth, and size. In Section II C we show that its momentum-like term speeds the rate at which aMC can learn the high-frequency features of an objective function, much as adaptive methods such as Adam [20] can learn high-frequency features faster than regular gradient descent. In Section II D we show that the Monte Carlo method can train simple recurrent neural networks in the presence of small or large gradients, where gradient-based methods fail. In Section II E we show aMC can train deep neural networks in which gradients are too small for gradient-based methods to train. In Section II F we comment on the fact that best practices for training nets using Monte Carlo methods await development. We introduce the elements of aMC throughout Section II, and summarize the algorithm in Section III.
Our conclusion is that the Metropolis Monte Carlo algorithm and its adaptive variants such as aMC are viable ways of training neural networks, if less developed and optimized than modern gradient-based methods. In particular, Monte Carlo algorithms can, for small updates, effectively sense the gradient, and they do not fail simply because the number of parameters of a neural network becomes large. Monte Carlo algorithms should be considered a complement to gradient-based algorithms because they admit different design principles for neural networks. Given a network that permits gradient flow, modern gradient-based algorithms are fast and effective [9,10,21]. For large neural nets with tens of millions of parameters we find gradient-based methods to be considerably faster than Monte Carlo (Section II F). However, Monte Carlo algorithms free us from the requirement of ensuring reliable gradient flow (and gradients can be unreliable even in differentiable systems [22]). As a result, we find that Monte Carlo methods can train deep neural networks and simple recurrent neural networks in which gradients are too small (or too large) for gradient-based methods to work. There already exist solutions to these problems, namely the introduction of skip connections or more elaborate recurrent neural network architectures, but aMC requires neither of these things. One type of solution is architectural, the other algorithmic, and having both options offers more possibilities than having only one.

A. Metropolis Monte Carlo and its connection to gradient descent
We start with the zero-temperature Metropolis Monte Carlo algorithm. The zero-temperature limit is not often used in molecular simulation, but it and its variants are widely used (and sometimes called randommutation hill climbing) for optimizing non-differentiable systems such as cellular automata [23,24]. Consider a neural network with N parameters (weights and biases) x = {x 1 , . . . , x i , . . . , x N }, and associated loss function U (x). If we propose the simultaneous change of each neural-network parameter by a Gaussian random number [25], fixed (11) depth (12) variable (13) depth (14) U Here n is training time (epoch), and η is a Gaussian white noise with zero mean and variance η i (n)η j (n ) = (σ 2 /2)δ ij δ(n − n ). That is, small stochastic perturbations of a network's weights and biases, accepted if the loss function does not increase, is equivalent to noisy clipped or normalized gradient descent on the loss function. Given the success of gradient-based training methods, this correspondence shows the potential of the Metropolis algorithm to train neural networks. Consistent with this expectation, we show in Fig. 1 that the zero-temperature Metropolis algorithm can train a neural network. For comparison, we also train the network using simple gradient descent, where α is the learning rate, U (x) the loss function, and ∇ ≡ (∂/∂x 1 , . . . , ∂/∂x N ) the gradient operator with respect to the neural-network parameters x.
We consider a standard supervised-learning problem, recognizing MNIST images [26, 27] using a fullyconnected, two-layer neural net. The net has 784 input neurons, 16 neurons in each hidden layer, and one output layer of 10 neurons. The hidden neurons have hyperbolic tangent activation functions, and the output neurons comprise a softmax function. The net has in total 13,002 parameters [28]. We did batch learning, with the loss function U being the mean-squared error on the MNIST training set of size 6 × 10 4 (in the standard way we consider the ground truth for each training example to be a 1-hot encoding of the class label, and take the 10 outputs of the neural network as its prediction vector). Fig. 1(a) shows the classification accuracy C 10 5 on the MNIST test set of size 10 4 after 10 5 epochs of training. We show results for MC (blue) and GD (gray), for a range of values of step size σ and learning rate α, respectively. The initial neural-net parameters for MC simulations were x i ∼ N (0, σ 2 ). The two algorithms behave in a similar manner: each has a range of its single parameter over which it is effective, and displays a maximum at a particular value of that parameter. The value of the maximum for GD is slightly higher than that for MC (about 96% compared to about 95%), and GD achieves near-maximal results over a broader range of its single parameter than does MC. Fig. 1(b) shows loss U and classification accuracy C as a function of epoch for two examples from panel (a). GD trains faster with n initially, but results are comparable near the end of the learning process. The learning dynamics of these algorithms is not the same: in the limit of small steps, the zero-temperature Metropolis algorithm approaches normalized or clipped gradient descent, not standard gradient descent (and its equivalence to the former would only be seen with an appropriately rescaled horizontal axis) [29]. Nonetheless, MC can in effect sense the gradient, as long as the step size is relatively small, and for this problem the range of appropriate step sizes is small compared to the effective GD step size. GD therefore trains faster, but MC has similar capacity for learning. The computational cost per epoch of the two algorithms is of similar order, with MC being cheaper per epoch for batch learning: each MC step requires a forward pass through the data, and each GD step a forward and a backward pass.
There are many things that could be done to improve the learning precision of these algorithms (no preprocessing of data was done, and a basic neural net was used [30]), but this comparison, given a neural net and a data set, confirms that Metropolis Monte Carlo can achieve results roughly comparable to gradient descent, even on a problem for which gradients are available. For this problem GD trains faster, but MC works. It is worth noting that this conclusion follows from considering a range of step sizes σ: for a single choice of step size it would be possible to conclude that MC doesn't work at all.
Similar findings have been noted previously: simulated annealing on the Metropolis algorithm [6,7] and a variant of zero-temperature Metropolis MC (applied weight-by-weight) [8] were used to train neural nets with an accuracy comparable to that of gradient-based algorithms. These results, and the correspondence described in Ref. [17] establish both theoretically and empirically the ability of Metropolis MC to train neural nets.
We next turn to the question of how the efficiency of Monte Carlo training scales with net parameters, and how to improve this efficiency by introducing adaptivity to the algorithm.

B. Metropolis acceptance rate as a function of net size
To examine how the efficiency of the Metropolis algorithm changes with neural-net size and architecture, we consider in this section a supervised-learning problem in which a neural net is trained by zero-temperature Metropolis MC to express the sine function f 0 (θ) = sin(2πθ) on the interval θ ∈ [0, 1]. The loss U is the meansquared error between f 0 (θ) and the neural-net output f x (θ), evaluated at 10 3 evenly-spaced points on the interval. The neural net has one input neuron, which is fed the value θ, and one output, which returns f x (θ). Its internal structure is fully connected, with hyperbolic tangent nonlinearities. To explore the effect of varying width (panels (a) and (c) of Fig. 2) we set the depth to 2 and varied the width from 10 to 10 3 , these choices corresponding to about 10 2 to about 10 6 parameters. To explore the effect of varying depth (panels (b) and (d) of Fig. 2) we set the width to 25 and varied the depth from 2 to 10, these choices corresponding to about 10 2 to about 10 4 parameters.
In Fig. 2(a,left) we show loss U as a function of Metropolis acceptance rate A for three different neuralnet widths. The acceptance rate tells us, for fixed step size, the fraction of directions that point downhill in loss. It provides information similar to that shown in plots of the index of the critical points of a loss surface [31,32], confirming that at large loss there are more downhill directions than at small loss. In Fig. 2(a,right) we plot the acceptance rate A(U 0 ) at fixed loss U 0 as a function of the number of net parameters N (obtained by taking horizontal cuts across panel (a); note that more net sizes are shown in panel (b) than panel (a)).
The acceptance rate decreases with increasing net size, but relatively slowly. Upon increasing the size of the net by 4 orders of magnitude, the acceptance rate decreases by about 1 order of magnitude. We have indicated an N −1 scaling as a guide to the eye. In the extreme limit, if N simultaneous parameter updates each had to be individually productive, the acceptance rate would decrease exponentially with N , which is clearly not the case. The more dramatic decrease in acceptance rate is with loss: at small loss the acceptance rate becomes very small. Similar trends are seen with depth in Fig. 2(b). The acceptance rate declines sharply with loss. It also declines with the number of parameters, slightly more rapidly than in panel (a) but not as rapidly as N −1 .
Empirically, therefore, we do not see evidence of a fundamental inability of MC to cope with large numbers of parameters. In Section II E we discuss how network heterogeneity can impair the Metropolis algorithm's ability to train a network. The solution, as we discuss there, is to introduce an adaptive Monte Carlo (aMC) variant of the Metropolis algorithm. To motivate the introduction t = 520 (15) Adam (16) or ↵ depth 64 (18) varying width (1) t = 520 (15) Adam (16) or ↵ depth 64 (18) varying width (1) t = 520 Adam (16) or ↵ (17)  of this algorithm we show in panels (c) and (d) the aMC analog of panels (a) and (b), respectively. The trends experienced by Metropolis have been annulled, the acceptance rates of aMC remaining large and essentially constant with loss or model size over the range of parameters considered.
We now turn to a step-by-step introduction of the elements of aMC.

C. Adaptivity speeds learning, particularly of high-frequency features
Modern gradient-based methods are adaptive, allowing the learning rate for each neural-net parameter to differ and to change as a function of the gradients encountered during training [10] (adaptive learning is also used in evolutionary algorithms [33][34][35]). We can copy this general idea in a simple way within a zero-temperature Metropolis Monte Carlo scheme by changing the proposed move of Eq. (1) to The parameters µ i , set initially to zero, are updated after every accepted move according to where is a hyperparameter of the method. This form of adaptation is similar to the inclusion of momentum in a gradient-based scheme: the center µ i of each parameter's move-proposal distribution shifts in the direction of the last accepted move i , with the aim of increasing the probability of generating moves that will be accepted.
In addition, to remove the need for a search over the step-size parameter σ, we introduce a simple adaptive learning-rate schedule by setting σ → 0.95σ (and µ i = 0) after n s consecutive rejected moves. We take the initial step size to be σ = σ 0 . We refer to this adaptive Monte Carlo algorithm as aMC, specified in Section III. The aMC parameter can be used to influence the rate of learning. In Fig. 3 we provide a simple illustration of this fact using the two-dimensional Rosenbrock function often used as a test function for optimization methods [36][37][38]. The Rosenbrock function has a global minimum at (x, y) = (1, 1) that is set within a long valley surrounded by steep slopes on either side. A particle on this function moving under pure gradient descent takes a long time to reach the global minimum because gradients within the valley are small: the particle will quickly reach the valley and then move slowly along the valley floor [36,37]. Placing a particle at the point (−2, 2), outside the valley, we evolve the position x = (x, y) of the particle using aMC. We used the initial step size σ 0 = 10 −3 and rescaling parameter n s = 100, and carried out three simulations for three values of . As shown in the figure, the larger is the more rapidly is the global minimum attained, with the difference between zero and nonzero epsilon being considerable. This form of adaptivity has an effect similar to that of momentum with fixed (17) depth (18) hidden (19) depth (20) x y (1) fixed (17) depth (18) hidden Adaptivity speeds convergence on a test function. We use aMC with three values of to simulate a particle moving on the Rosenbrock function f (x, y), Eq. (6). Panel (a) shows a 2000-step simulation for each of three values of ; the global minimum is indicated by the red dot at (1,1). Panel (b) shows the value of the function along each trajectory when continued to n = 10 4 steps. aMC hyperparameters: σ0 = 10 −3 , ns = 20. gradient descent [39].
Returning to neural-network simulation, the aMC parameter can speed the rate at which a neural network can learn high-frequency features, much as adaptive methods do with gradient-based algorithms. The extent to which high-frequency features should be learned varies by application. For instance, if a training set contains high-frequency noise then we may wish to attenuate an algorithm's ability to learn this noise in order to enhance its ability to generalize. This is the idea expressed in Fig.  2 of Ref. [40]. Empirical studies show that non-adaptive versions of gradient descent sometimes generalize better than their adaptive counterparts [41], in some cases because of the different abilities of these things to learn high-frequency features.
In Fig. 4 we consider a supervised-learning problem inspired by Fig. 2 of Ref. [40]. We use gradient-based methods (GD and Adam) and aMC to train a neural network to learn the function f 0 (θ) = ln(1 + 5θ) + 1 10 sin(20πθ) on θ ∈ [0, 1]. This function contains a low-frequency term, the logarithm, and a high-frequency term, the sine.
The neural network has one input neuron, which is fed the value θ, one output neuron, which returns f x (θ), and a single hidden layer of 100 neurons with tanh activations. The parameters x of the network were set initially to random values x i ∼ N (0, σ 2 0 ). In Fig. 4(a) we show the training loss, the meansquared difference U between f 0 (θ) and the neural-net output f x (θ), evaluated at 1000 evenly-spaced points fixed (11) depth (12) variable (13) depth (14) U Adam (18) over the interval. At left we show results produced using GD and Adam, and at right we show results produced using aMC for three values of , one positive ( = 5×10 −3 ), one negative ( = −5 × 10 −3 ), and zero. Of the gradientbased methods Adam trains faster than GD, while for aMC the training loss U decreases fastest for positive and slowest for negative . (We did not carry out a systematic search of learning rates in order to compare directly the gradient-based and Monte Carlo methods; our intent here is to illustrate how adaptivity matters within the two classes of algorithm.) In Fig. 4(b) we show the pseudo-loss U that expresses the mean-squared difference between the net function f x (θ) and the low-frequency logarithmic term of f 0 . The net is not trained to minimize U , but it so happens during training that U becomes small as the net first learns the low-frequency component of f 0 . Subsequently, U increases as the net also learns the high-frequency component of f 0 .
Of the gradient-based methods Adam learns highfrequency features more rapidly than GD. As it does so, the value of the pseudo-loss U increases. For aMC, the parameter controls the separation of timescales between Metropolis Monte Carlo and aMC can effectively sense the gradient [17], and so can train a neural network to similar levels of accuracy as gradient descent. However, Monte Carlo algorithms can also train a neural network when gradients become unreliable, such as when they vanish or explode.
Vanishing gradients can be overcome by the intrinsic stochasticity of the Monte Carlo method. In the absence of gradients, pure gradient-based methods receive no signal [42,43]. However, the Metropolis Monte Carlo procedure (1) is equivalent, for vanishing gradients, to the diffusive dynamicsẋ i = ξ i (n), where ξ is a Gaussian white noise with zero mean and variance ξ i (n)ξ j (n ) = σ 2 δ ij δ(n − n ). Thus Metropolis MC will, in the absence of gradients, enact diffusion in parameter space until nonvanishing gradients are encountered, at which point learning can resume.
Monte Carlo algorithms can also cope with exploding gradients. Moves are proposed without reference to the gradient, and so can be made on a landscape for which the gradient varies rapidly or is numerically large. Monte Carlo algorithms are also numerically stable, with moves that would induce large increases in loss being rejected but otherwise not harming the training process.
We show in this section that the ability to cope with vanishing and exploding gradients allows aMC to train recurrent neural networks that gradient-based methods cannot. Recurrent neural networks (RNNs) are designed to act on sequences, and have been applied to problems involving text, images, speech, and music [44][45][46][47]. An RNN possesses a vector known as its hidden state, which acts as a form of memory. This vector is updated each time the RNN views a position in a sequence. The size of the gradient scales in general exponentially with sequence length, and so when the sequence is long, and long-term dependencies exist, the gradient tends to vanish or explode. As a result, it can be difficult to train simple RNNs using first-order gradient-based methods in order to learn long-term dependencies in sequence data [9,[48][49][50][51]. One solution to this problem is the introduction of more elaborate and computationally expensive RNN architectures such as long short-term memory [42] and gated recurrent units [52,53]. These architectures can be more reliably trained by gradient-based methods than can simple RNNs.
Here we demonstrate an algorithmic solution to the problem rather than an architectural one, by showing that Monte Carlo methods can train an RNN in circumstances in which gradient-based methods cannot. We train a simple RNN with tanh activation functions to distinguish the digits of class 0 and 1 in the MNIST data set. We binarized the data, and used a vector of length 2 and a one-hot encoding to represent black or white pixels. The RNN was shown an unrolled version of each image, a sequence of 784 pixels. The RNN architecture is a two-layer stack in which the hidden state at each site depth 16 (6) depth 128 (7) depth 16 (6) depth 128 (7) t = 520 (9) Adam (10) or ↵ (11) depth 64 (12) depth 128 (13) j P (a) a (8) t = 520 (9) Adam (10) or ↵ (11) depth 64 (12) depth 128 (13) in the first layer is used as input for the second layer. The dimension of the hidden state is set to 64, and the final hidden state is sent into a linear classifier. The loss function is the cross entropy between the neural network's prediction and the ground truth.
In Fig. 5(a,b) we show the results of training the RNN using aMC and gradient-based optimizers. We stochastically subsample the training data set for each update ("mini-batching"), using 1500 samples at each step. We used gradient descent and the Adam optimizer, both with learning rate 10 −3 , with gradient clipping turned on (this has been shown to solve the exploding-gradient problem for some data sets [54]), and aMC with hyperparameters σ 0 = 10 −3 , n s = 100, = 0 (which is the Metropolis algorithm with an adaptive step-size scheduler). Gradient descent and Adam fail to train (a search over learning rates between the values 10 −5 and 1 did not result in lower loss values), while aMC trains to small values of loss and a test-set accuracy of 99.8%.
In Fig. 5(c) we show the size of the gradient associated with the models produced by aMC. aMC does not calculate or make use of the gradient, but we can nonetheless evaluate it for the models produced by the training process. Two distinct regimes can be seen, with the size of the gradient changing by about three orders of magnitude between them. The gradient-based algorithms cannot escape from the small-gradient regime, and when initiated from the large-gradient regime that is encountered by aMC, in which |∇U | U , the gradient-based integra-tors explode. Thus Monte Carlo can train productively in the face of two of the classic obstacles to training by gradient-based methods, small and large gradients. Simple RNNs are known be harder to train by gradient descent than more complex RNNs, but simple RNNs possess similar or greater capacity per parameter than more complex architectures [55]. Methods that can train simple RNNs may therefore allow more widespread use of those architectures.
E. For nets with heterogeneous structures or neuron activations, the weight-update scale must be made heterogenous For some architectures, particularly those with structural heterogeneity or heterogenous neuron activations, it is necessary to scale the Monte Carlo step-size parameter σ for each neural-net parameter individually. In this section we address this problem using deep neural networks for which, as in Section II D, gradients are too small for gradient-based methods to train.
Choosing a set of heterogeneous Monte Carlo step-size parameters can be done by adapting ideas used in the development of gradient-based methods [56]. Guided by that work we modify the proposal distribution of (4) to read where σ i = λ i σ. The λ i are parameters that are either set to unity (a condition we call "signal norm off") or according to Eq. (18) in Section III B ("signal norm on"). The parameters µ i and σ are adjusted as in Section II C. The parameters λ i , which are straightforwardly calculated during a forward pass through the net, ensure that the scale of signal change to each neuron is roughly constant. The intent of signal norm is similar to that of layer norm [57], except that the latter is an architectural solution -it entails a modification of the net, and is present at test time -while the former is an algorithmic solution and plays no role once the neural net has been trained.
In Fig. 6 we show the results of neural networks of depth d trained by aMC and by gradient-based methods to express a step function f 0 (θ) that is equal to 1/2 if 1/2 < θ < 3/4 and is zero otherwise. The neural nets have one input neuron, which is fed the value θ, and one output neuron, which returns f x (θ). They have 10 neurons in the penultimate hidden layer, and 4 neurons in each of the other d − 1 hidden layers, the intent being to allow very deep nets with relatively few neurons. All neurons have tanh activation functions.
In Fig. 6(a) we show loss U as a function of epoch n for four algorithms: GD (gray); Adam (blue); and aMC with signal norm off (green dotted) and on (green). For each algorithm we ran 20 independent simulations, 10 using Kaiming initialization [58] and 10 initialized with Gaussian random numbers x i ∼ N (0, σ 2 0 ), where σ 0 = 10 −2 . We plot the simulation having the smallest U after 10 6 epochs. As the depth of the network increases beyond 4 layers, GD and aMC with signal norm off stop learning on the timescales shown. Above 32 layers, Adam also stops learning on the timescale shown. (For depth 64 we tried a broad range of learning rates for GD and Adam, from 10 to 10 −6 , none of which was successful. We also varied the Adam hyperparameters β 1 , β 2 over a small range of values, without success. It may be that hyperparameters that enable training do exist, but we were not able to find them.) aMC with signal norm on continues to learn up to a depth of 128 (we also verified that aMC trains nets of depth 256), and so can successfully train deep nets in which gradient-based algorithms receive too little signal to train.
In Fig. 6(b) we show net outputs at n = 10 6 epochs for three algorithms and two depths. As discussed in Section II C, the adaptive algorithms Adam and aMC learn the sharp features of the target function more quickly than does GD. For the deeper net, the gradient-based algorithms GD and Adam do not receive sufficient signal to train.
In Fig. 6(c) we show 10 simulations for each of Adam and aMC for the deeper nets. The outcome of training is stochastic, resembling a nucleation dynamics with an induction time that increases with net depth. For some initial conditions both algorithms fail to train on the allotted timescale. In general, the rate of nucleation is higher for aMC than Adam, and remains measurable for all depths shown  Fig. 1(b), with the addition of a simulation done using aMC (green) with hyperparameters (σ0, , ns, s.n.) = (10 −2 , 0, 20, on). (b) The aMC result from panel (a), which uses batch learning, compared with aMC using conventional minibatching (blue) and progressive batching (cyan). We verified that introducing skip connections [59] or layer norm [57] to the neural nets enabled Adam to train at depth 64 (and enables aMC without signal norm to train at all depths). These architectural modifications allow gradient-based algorithms to train, but they are not required for Monte Carlo. This comparison illustrates the fact that design principles for nets trained by Monte Carlo methods differ from those trained by gradient-based methods.
F. Best practices for training neural nets using MC await development In this section we first revisit Fig. 1 using aMC, and we present data indicating that numerical best practices for training nets using MC may differ from those developed for gradient-based algorithms.
In Fig. 7(a) we reproduce Fig. 1(b) with the addition of an aMC simulation (green) in which we use aMC's adaptive step-size attenuation (with n s = 20) and signal norm. These features allow us to choose an initial stepsize parameter σ 0 = 10 −2 larger than the optimum value for Metropolis MC (see Fig. 1(a)). As a result, training proceeds faster than for MC, at a rate comparable to the GD result shown. Training-set loss and test-set accuracy at long times are similar for all three algorithms.
In Fig. 7(b) we compare the aMC result of panel (a), which uses batch learning (green), with two additional aMC results. The first (cyan) uses conventional minibatching with a minibatch size of 2000. Minibatch learning proceeds faster than batch learning, as happens with gradient-based methods, but achieves slightly larger values of training-set loss and test-set error than does batch learning. We speculate that this happens because of competing sources of stochasticity, that of the minibatch and that intrinsic to the MC algorithm. The second (cyan) uses progressive batching: training begins with a minibatch of size 500, which doubles every time the classification error rate on the minibatch falls below 10% (aMC moves are conditioned against minibatch loss, in the usual way, not minibatch classification error, but the latter is the trigger for the doubling of the minibatch size). After doubling the minibatch size the aMC algorithm is reset (σ → σ 0 and µ i → 0). Progressivebatch learning proceeds faster than batch learning, and reaches similar final values of training-set loss and testset accuracy. This comparison suggests that MC may respond differently than GD to procedures such as minibatch training; best practices for MC training of neural networks await development.
We note that the noise intrinsic to the Monte Carlo method provides a means of exploration even when the batch identity is kept fixed (noise also provides a way to effect change in the presence of vanishing gradients; see Section II E and Section II D). For the problem discussed in this section there is a variation of about 1% in values of test-set accuracy at 10 5 epochs for 30 independent MC trajectories propagated with the same set of hyperparameters. Such fluctuations could provide the basis for an additional form of importance sampling that identified and propagated the best-performing networks in a population.
Finally, we show in Fig. 8 the analog of Fig. 1(b) for ResNET-18. ResNET-18 is a large, deep neural network with ≈ 1.11 × 10 7 parameters [59] (we changed the number of input channels for the first layer in order to apply it to MNIST). We applied layer normalization [57], which homogenizes neuron inputs and removes (or reduces) the need for signal norm when using aMC. The gradient descent learning rate was set to 10 −3 and we applied gradient clipping for |∇U | > 1; the aMC hyperparameters were σ 0 = 10 −3 , n s = 100, and = 10 −3 . Two points are apparent from the plot: gradient descent trains faster than aMC, but aMC has similar ability to train in the long-time limit: both GD and aMC train to small loss and about 99.2% accuracy on the MNIST test set. Thus for large modern architectures such as ResNET-18 we find training with gradients faster than training by Monte Carlo, but the latter has similar capacity for learning, suggesting that it is a promising tool for training neural networks when gradients are unreliable: see Section II D and Section II E.

III. SUMMARY OF AMC
A. An adaptive version of the Metropolis algorithm for training neural networks In this section we summarize aMC, the adaptive Monte Carlo algorithm used in this paper. It is based on the Metropolis MC algorithm, modified to allow the moveproposal distribution to adapt in response to accepted and rejected moves. The Metropolis acceptance criterion is min(1, e −∆U/T ), where ∆U is the change of loss and T is temperature. For nonzero temperature the algorithm allows moves uphill in loss. We focus here on the limit of zero temperature, which allows no uphill moves in loss. This choice is motivated by the success of gradient-descent algorithms and the intuition in deep learning (suggested by the structure of high-dimensional Gaussian random surfaces) that at large loss most stationary points on the loss surface are saddle points that can be escaped by moving downhill [31,32].
aMC is specified by four hyperparameters: σ 0 , the initial move scale; , the rate at which the mean of the move-proposal distribution is modified; n s , the number of consecutive rejected moves allowed before rescaling the parameters of the move-proposal distribution; and by the choice of signal norm being on or off.
We introduce a counter n cr = 0 to record the number of consecutive rejected moves. We initialize the parameters (weights and biases) x = {x 1 , . . . , x i , . . . , x N } of the neural network (e.g. using Gaussian random numbers x i ∼ N (0, σ 2 0 )), and set the centers µ i of each parameter's move-proposal distribution to zero. aMC proceeds as follows.
of each neural-network parameter i, where σ i = λ i σ. Initially, µ i = 0 and σ = σ 0 , where σ 0 is the initial move scale. The parameters λ i are set either to unity ("signal norm off") or by Eq. (18) ("signal norm on"). Evaluate the loss U (x ) at the set of coordinates x resulting from the proposal (9). If U (x ) ≤ U (x) [60] then we accept the move and go to Step 3. Otherwise we reject the move and go to Step 4.
3. Accept move. Make the proposed coordinates x the current coordinates x. Set n cr = 0. For each neural-network parameter i, set using the values i calculated in (9). Return to Step 1.
4. Reject move. Retain the set of coordinates x recorded in Step 1. Set n cr → n cr + 1. If n cr = n s then set n cr = 0, σ → 0.95σ, and (for all i) µ i = 0. Return to Step 1.
The computational cost of one move is the cost to draw N Gaussian random numbers and to calculate the loss function twice (once for batch learning). The memory cost is the cost to hold two versions of the model in memory, and (if = 0 and signal norm is on) the values µ i and λ i for each neural-net parameter. Note that the algorithm requires calculation of the loss U (x) only, and not of gradients of the loss with respect to the net parameters.
We refer to this algorithm as aMC, for adaptive Monte Carlo (the term "adaptive Metropolis algorithm" has been used in a different context [61]). Standard zerotemperature Metropolis Monte Carlo is recovered in the limit = 0, n s = ∞, and λ i = 1.
B. Signal norm: enacting heterogenous weight updates in order to keep roughly constant the change of neuron inputs The proposal step (9) contains the parameter step size σ i = λ i σ. For some applications, particularly involving deep or heterogeneous networks, it is useful to choose the λ i in order to keep the scale of updates for each neuron approximately equal, following ideas applied to gradientbased methods [56]. We call this concept signal norm; when signal norm is off, all λ i = 1. When it is on, we proceed as follows.
Consider the class of neural networks for which the input to neuron j (its pre-activation) is where the sum runs over all weights x i feeding into neuron j; N j is the fan-in of j (the number of connections entering j); and S α i is the output of neuron i (the neuron that the weight x i connects to neuron j) given one particular evaluation α of the neural network. Under the proposal (9) the change of input to neuron j is approximately [62] We therefore have and where · is the expectation over the move-proposal distribution (9), and δ ik is the Kronecker delta. The expected approximate variance of the change of input to neuron j under the move (9) is therefore This quantity, averaged over all N data neural-net calls required to calculate the loss, is We can choose the values of the λ i in order to ensure that the right-hand side of (16) is always σ 2 . A simple way to do so is to set equal the λ i for all weights x i feeding neuron j, in which case If all neuron outputs S α i appearing in (17) vanish identically then the expression must be regularized; one option is to set λ i = 0 for weights feeding a neuron whose input neurons are zero for a given pass through the data. Recall that the sum α runs over the input data; the sum i → j runs over all neurons i whose connections feed j; N j is the fan-in of j (the number of connections entering j); and S α i is the output of neuron i given a particular evaluation α of the neural network. The values (17) can be calculated from the pass through the data immediately before the proposed move.
Under (17), weights on connections that feed into a neuron receiving many other connections will experience a smaller basic move scale than weights on connections that feed into a neuron receiving few connections. Similarly, weights on connections fed by active neurons will experience a smaller basic move scale than weights on connections fed by relatively inactive neurons.
Finally, if the parameter x i is a bias we choose λ i = 1.
To summarize, we consider two settings for the parameters λ i that set the step-size parameters σ i = λ i σ in the proposal (9). The first setting, "signal norm off" (used in Fig. 1, Fig. 2(a,b), Fig. 4, and Fig. 5), has λ i = 1 for all parameters x i .
The second setting, called "signal norm on" (used in Fig. 2(c,d), Fig. 6, and Fig. 7), has λ i = 1 if x i is a bias; Eq. (17) if x i is a weight into neuron j.

IV. CONCLUSIONS
We have examined the Metropolis Monte Carlo algorithm as a tool for training neural networks, and have introduced aMC, an adaptive variant of it. Monte Carlo methods are closely related to evolutionary algorithms, which are used to train neural networks [11][12][13]19], but the latter are usually applied to populations of neural networks; the MC algorithms we have considered here are applied to populations of size 1, just as gradient descent is. For sufficiently small moves the Metropolis algorithm is effectively gradient descent in the presence of white noise [17]. Thus on theoretical grounds the Metropolis algorithm should possess the ability to train a neural network to values of a loss function similar to those achieved by GD; this is indeed what we (and others [6][7][8]) have observed empirically, both for simple neural nets and for large, modern architectures. This correspondence does not guarantee similar training times, however, and we have found gradient-based methods to be faster in general, particularly for large and heterogenous neural nets.
aMC is an adaptive version of the Metropolis algorithm. The efficiency of aMC diminishes less quickly with decreasing loss and increasing net size than does the efficiency of the Metropolis algorithm, and aMC can train faster than Metropolis, much as adaptive gradient-based methods can train faster than pure gradient descent.
The Metropolis algorithm and aMC offer a complement to gradient-based methods in that they can sense the gradient when it exists but can work without it. In particular, aMC can train nets in which the gradient is too small (or too large) to allow gradient-based methods to train on the timescales simulated. We have shown here that aMC can train deep neural networks and recurrent neural networks that gradient descent cannot train. In both cases there exist modifications to those networks that can be trained by gradient-based methods, but aMC does not require those modifications. The design principles of neural nets optimal for Monte Carlo algorithms are largely unexplored but are likely distinct from those optimal for gradient-based methods, and having both sets of algorithms offers more choices for net design than having only one.
Finally, we note that while Metropolis and aMC have a fundamental connection to gradient-based methods in the limit of small step size, Monte Carlo algorithms more generally can enact large-scale nonlocal or collective changes that cannot be made by integrating gradientbased equations of motion [5,[63][64][65][66][67]. The analogy suggests that improved Monte Carlo algorithms for training neural networks await development.

V. CODE AVAILABILITY
Calculations using gradient descent and Adam were done in PyTorch [58]. Monte Carlo calculations were done in C and in PyTorch. A PyTorch implementation of the aMC optimizer (with signal norm off) and an example RNN optimization are available here [68].