Gibbs sampling the posterior of neural networks

In this paper, we study sampling from a posterior derived from a neural network. We propose a new probabilistic model consisting of adding noise at every pre- and post-activation in the network, arguing that the resulting posterior can be sampled using an efficient Gibbs sampler. For small models, the Gibbs sampler attains similar performances as the state-of-the-art Markov chain Monte Carlo methods, such as the Hamiltonian Monte Carlo or the Metropolis adjusted Langevin algorithm, both on real and synthetic data. By framing our analysis in the teacher-student setting, we introduce a thermalization criterion that allows us to detect when an algorithm, when run on data with synthetic labels, fails to sample from the posterior. The criterion is based on the fact that in the teacher-student setting we can initialize an algorithm directly at equilibrium.


I. INTRODUCTION
Neural networks are functions parametrized by the so-called weights, mapping inputs to outputs.Neural networks are commonly trained by seeking values of weights that minimize a prescribed loss function.In some contexts, however, we want to sample from an associated probability distribution of the weights.Such sampling is at the basis of Bayesian deep learning [48,52].It is used in Bayesian uncertainty estimation [24,29,45] or to evaluate Bayes-optimal performance in toy models where the data-generative process is postulated [3].In this paper, we focus on studying the algorithms and properties of such sampling.
Given training inputs X in Bayesian learning, one implicitly assumes the labels to be generated according to the stochastic process y ∼ P (y|X, W ), where W are the weight of the network, on which a prior P (W |X) is placed.At its heart, Bayesian deep learning consists of sampling from the posterior probability of the parameters: P (W |X, y) = P (y|W, X)P (W |X) where we simply used Bayes theorem.This sampling problem is, in general, NP-hard [10], with many techniques being developed to sample from (1).In this paper, we look at iterative algorithms that in the large time limit, return samples from the posterior distribution (1).Most available algorithms for this task are based on MCMC methods.We focus on the two following questions: • Q1: Do we have a method to evaluate whether the algorithms have thermalized, i.e., if the samples returned by the MCMC plausibly come from the posterior (1)?
• Q2: Which combinations of sampling algorithm and form of the posterior distribution achieve the best performance in terms of ability to thermalize while reaching a low test error?
The first question addresses the long-standing problem of estimating an MCMC's thermalization time, that is, the time at which the MCMC starts sampling well from the posterior.We propose a criterion for thermalization based on the teacher-student setting.The criterion can only be reliably applied to synthetic labels generated by a teacher network.After a comparison with other thermalization heuristics, we argue that the teacher-student criterion is more discriminative, in that it provides a higher lower bound to the thermalization time.The second question explores the interplay between the form of the posterior and the sampling algorithm: since there is more than one way of translating a network architecture into a probabilistic process, we exploit this freedom to introduce a generative process in which noise is added at every pre-and post-activation of the network.We then design a Gibbs sampler tailored to this posterior and compare it to other commonly used MCMCs.
A. Related literature When running an MCMC one has to wait a certain number of iterations for the algorithm to start sampling from the desired probability measure.We will refer to this burn-in period as the thermalization time or T therm [39].Samples before T therm should therefore be discarded.Estimating T therm is thus of great practical importance, as it is crucial to know how long the MCMC should be run.
More formally, we initialize the MCMC at a starting state W 0 ∈ W of our liking.We run the chain iteratively sampling W (t) ∼ P (W (t)|W (t − 1)), where P (•|•) is the transition kernel.If the kernel is ergodic and satisfies the relation π(W ) = W ′ ∈W P (W |W ′ )π(W ′ ), for a certain probability measure π(•), then for t → ∞ the MCMC will return samples from π(•).Thermalization is concerned with how soon the chain starts sampling approximately from π(•).
Consider an observable φ : W → R. Let π φ (•) be the distribution of φ(W ) when W ∼ π(•).Define S φ (δ) ⊆ R as the smallest set such that π φ (S φ (δ)) ≥ 1 − δ.When δ ≪ 1, S φ (δ) is a high probability set for φ(•).We can then look at thermalization from the point of view of φ.For a general initialization W 0 one will usually have φ(W 0 ) ̸ ∈ S φ (δ), since in most initializations, W 0 is unlikely to be a typical sample form π(•).As more samples are drawn, the measure sampled by the chain will approach π(•), therefore we expect that φ(W (t)) ∈ S φ (δ) (up to a fraction δ of draws) for t greater than some time tφ = tφ (W 0 ).We call tφ (W 0 ) the thermalization time of observable φ; notice that tφ (W 0 ), depends both on the observable φ and on the initial condition 1 .In fact some observables may thermalize faster than others, and a good initialization can make the difference between an exponentially (in the dimension of W ) long thermalization time and a zero one (for example if W 0 is drawn from π(•)).In statistical physics it is common to say that the whole chain has thermalized when all observables that concentrate in the thermodynamic limit have thermalized [32,39,55].This will be our definition of T therm .Despite the theoretical appeal, this definition is inapplicable in practice.In fact, for most observables computing π φ (•) is extremely hard computationally.
Practitioners have instead resorted to a number of heuristics, which provide lower bounds to the thermalization time.These heuristics usually revolve around two ideas.We first have methods involving multiple chains [5,9,15,39].In different flavours, all these criteria rely on comparing multiple chains with different initializations.Once all the chains have thermalized, samples from different chains should be indistinguishable.Another approach consists of finding functions with known mean under the posterior and verifying whether the empirical mean is also close to its predicted value [9,13,18,19,56].The proposed method for detecting thermalization relies instead on the teacher-student framework [55].
Another field we connect with is that of Bayesian learning of neural networks.For an introduction see [17,23,30,48] and references therein.We shall first examine the probabilistic models for Bayesian learning of neural networks and then review the algorithms that are commonly used to sample.In order to specify the posterior (1), one needs to pick the likelihood (or data generating process) P (y|X, W ). The most common model, employed in the great majority of works [22,37,46,50,52] , where f (•, W ) is the neural network function, ℓ is the loss function, µ is the sample index, and ∆ is a temperature parameter.As an alternative, other works have introduced the "stochastic feedforward networks", where noise is added at every layer's pre-activation [35,40,44,54].Outside of the Bayesian learning of neural networks literature, models where intermediate pre-or post-activations are added as dynamical variables have also been considered in the predictive coding literature [2,33,34,51].
Once a probabilistic model has been chosen, the goal is to obtain samples from the corresponding posterior.A first solution consists of approximating the posterior with a simpler distribution, which is easier to sample.This is the strategy followed by variational inference methods [25,28,29,44,47,53].Although variational inference yields fast algorithms, it is often based on uncontrolled approximations.Another category of approximate methods is that of "altered MCMCs", i.e., Monte Carlo algorithms which have been modified to be faster at the price of not sampling anymore from the posterior [7,26,27,36,38,57].An example of these algorithms is the discretized Langevin dynamics [49].Restricting the sampling to a subset of the parameters has also been considered in [43] as an alternative training technique.
Finally, we have exact sampling methods: these are iterative algorithms that in the large time limit are guaranteed to return samples from the posterior distribution.Algorithms for exact sampling mostly rely on MCMC methods.The most popular ones are HMC [12,37], MALA [4,31,42] and the No U-turn sampler (NUTS) [21].Within the field of Bayesian learning in neural networks, HMC is the most commonly used algorithm [22,50,52].The proposed Gibbs sampler is inspired to the work of [1], and later [14,20], that introduced the idea of augmenting the variable space in the context of logistic and multinomial regression.

II. TEACHER-STUDENT THERMALIZATION CRITERION
In this section we explain how to use the teacher-student setting to build a thermalization test for sampling algorithms.The test gives a lower bound on the thermalization time.We start by stating the main limitation of this approach: the criterion can only be applied to synthetic datasets.In other words, the training labels y must be generated by a teacher network, using the following procedure.
We first pick arbitrarily the training inputs and organize them into an n × d matrix X.Each row X µ of the matrix is a different training sample, for a total of n samples.We then sample the teacher weights W ⋆ from the prior P (W ).Finally, we generate the noisy training labels as y µ ∼ P (y|X µ , W ⋆ ).Our goal is to draw samples from the posterior P (W |D), where D = {(X µ , y µ )} µ∈ [n] indicates the training set.Suppose we want to have a lower bound on the thermalization time of a MCMC initialized at a particular configuration W start .The method consists of running two parallel chains W 1 (t) and W 2 (t).For the first chain, we use an informed initialization, meaning we initialize the chain on the teacher weights, thus setting W 1 (t = 0) = W ⋆ .For second chain we set W 2 (t = 0) = W start .To determine convergence we consider a test function φ(W ).We first run the informed initialization: after some time T 1 , φ(W 1 (t)) will become stationary.Using samples collected after T 1 we compute the expected value of φ(•) (let us call it φ).Next, we run the second chain.The lower bound to the thermalization time of W 2 (t) is the time where φ(W 2 (t)) becomes stationary and starts oscillating around φ.In practice, this time is determined by visually inspecting the time series of φ(•) under the two initializations, and observing when the two merge.
At first glance this method does not seem too different from [15] or [5], whose method (described in Appendix A) relies on multiple chains with different initializations.There is however a crucial difference: under the informed initialization most observables are already thermalized at t = 0. To see this, recall that the pair W ⋆ , D was obtained by first sampling W ⋆ from P (W ) and then sampling D from P (D|W ⋆ ).This implies that W ⋆ , D is a sample from the joint distribution P (W, D).Writing P (W |D) = P (W,D) P (D) , we see that W ⋆ is also typical under the posterior distribution P (W |D).In conclusion, the power of the teacher-student setting lies in the fact that it gives us access to one sample from the posterior, namely W ⋆ .It then becomes easier to check whether a second chain is sampling from the posterior by comparing the value of an observable.In contrast, other methods comparing chains with different initialization have no guarantee that if the two chains "merge" then the MCMC is sampling from the posterior, since it is possible that both chains are trapped together far from equilibrium.

III. THE INTERMEDIATE NOISE MODEL
In this section, we introduce a new probabilistic model for Bayesian learning of neural networks.We start by reviewing the classical formulation of Bayesian learning of neural networks.Let f (x, W ) be the neural network function, with W its parameters, and x ∈ R d the input vector.Given a training set X ∈ R n×d , y ∈ R n we aim to sample from where ℓ(•, •) is the single sample loss function, and ∆ a temperature parameter.Notice that to derive (2) from (1), we supposed that P (W |X) = P (W ), i.e., W is independent of X.This is a common and widely adopted assumption in the Bayesian learning literature, and we shall make it in what follows.Most works in the field of Bayesian learning of neural networks attempt to sample from (2).This form of the posterior corresponds to the implicit assumption that the labels were generated by where W are some weights sampled from the prior.We propose an alternative generative model based on the idea of introducing a small Gaussian noise at every pre-and post-activation in the network.The motivation behind this process lies in the fact that we are able to sample the resulting posterior efficiently using a Gibbs sampling scheme.Consider the case where f (•, W ) is a multilayer perceptron with L layers, without biases and with activation function σ (•).Hence we have f (x, W ) = W (L) σ W (L−1) σ . . .σ(W (1) x) . . . .Here W (ℓ) ∈ R d ℓ+1 ×d ℓ indicates the weights of layer ℓ ∈ [L], with d ℓ the width of the layer.We define the pre-activations Z (ℓ) ∈ R n×d ℓ and post activations X (ℓ) ∈ R n×d ℓ of layer ℓ.Using Bayes theorem and applying the chain rule to the likelihood we obtain with the constraint Z (L+1) = y.The conditional probabilities are assumed to be (5) where X } L ℓ=2 control the amount of noise added at each pre-and post-activation.This structure of the posterior implicitly assumes that the pre-and post-activations are iteratively generated as X (1) = X ∈ R n×d are the inputs, Z (L+1) = y ∈ R n represent the labels and ϵ X are n × d ℓ matrices of i.i.d.respectively N (0, ∆ (ℓ) Z ) and N (0, ∆ (ℓ) X ) elements.We will refer to (7) as the intermediate noise generative process.If we manage to sample from the posterior (4), which has been augmented with the variables {X (ℓ) } L ℓ=2 , {Z (ℓ) } L ℓ=2 , then we can draw samples from P (W |X, y), just by discarding the additional variables.A drawback of this posterior is that one has to keep in memory all the pre-and post-activations in addition to the weights.
We remark that the intermediate noise generative process admits the classical generative process (3) and the SFNN generative model as special cases.Setting all ∆s (and hence all ϵ) to zero in (7) except for ∆ .Instead, setting ∆ (ℓ) X = 0 for all ℓ, but keeping the noise in the pre-activations gives the SFNN model.

Algorithm 1 Gibbs sampler for Multilayer perceptron
Input: training inputs X, training labels y, noise variances {∆ X } L ℓ=2 , prior inverse variances {λ ) ▷ See eq. ( 9) ▷ See eq. ( 8) ▷ See eq. ( 9) ▷ See eq. ( 10) Putting all ingredients together, we obtain the Gibbs sampling algorithm, whose pseudocode is reported in Algorithm 1.The main advantages of Gibbs sampling lie in the fact that it has no hyperparameters to tune and, moreover, it is a rejection-free sampling method.In the case of MCMCs, hyperparameters are defined to be all parameters that can be changed without affecting the probability measure that the MCMC asymptotically samples.The Gibbs sampler can also be parallelized across layers: a parallelized version of Algorithm 1 is presented in Appendix D. Finally, one can also extend this algorithm to more complex architectures: Appendices F and G contain respectively the update equations for biases and convolutional networks.We release an implementation of the Gibbs sampler at https://github.com/SPOC-group/gibbs-sampler-neural-networks

V. NUMERICAL RESULTS
In this section we present numerical experiments to support our claims.We publish the code to reproduce these experiments at https://github.com/SPOC-group/numerics-gibbs-sampling-neural-nets

A. Teacher student convergence method
In section II we proposed a thermalization criterion based on having access to an already thermalized initialization.
Here we show that it is more discriminative than other commonly used heuristics.We first briefly describe these heuristics.
• Stationarity.Thermalization implies stationarity since once the MCMC has thermalized, it samples from a fixed probability measure.Therefore any observable, plotted as a function of time should oscillate around a constant value.The converse (stationarity implies thermalization) is not true.Nevertheless observing when a function becomes stationary gives a lower bound on T therm .
• Score method [13].Given a probability measure P (W ), we exploit the fact that E W ∼P ∂W dW = 0. We then monitor the function ∂ log P (W ) ∂W along the dynamics.The time at which it starts fluctuating around zero is another lower bound to T therm .dedico questo papero alla mia paperella In the legend, next to each method we write between parentheses the initialization (or pair of initializations) the method is applied to.The circles on the x axis represent the thermalization times estimated by each method.Left: We compare the predictions for the thermalization time of the zero-initialized MCMC.The red y scale on the right refers uniquely to the lines in red.All the other quantities should be read on the black y scale.Right: We compare the predictions for the thermalization time of two chains initialized independently at random.The pink y scale refers uniquely to the pink line.All other quantities should be read on the black logarithmic scale.The randomly initialized runs fail to thermalize and their test MSEs get stuck on a plateau.However, R, whose time series on the plateau is stationary and close to 1, fails to detect this lack of thermalization.
• R statistic [15].Two (or more) MCMCs are run in parallel starting from different initializations.The within-chain variance is compared to the total variance, obtained by merging samples from both chains.Call the ratio of these variances R (a precise definition of which is given in Appendix A).If the MCMC has thermalized, the samples from the two chains should be indistinguishable, thus R will be close to 1.The time at which R gets close to 1 provides yet another lower bound to the thermalization time.
We compare these methods in the case of a one hidden layer neural network, identical for the teacher and the student, with input dimension d 1 = 50, d 2 = 10 hidden units and a scalar output.This corresponds to the function 2) σ W (1) x + b (1) , where σ(x) = max(0, x) and W indicates the collection of all parameters: W (1) ∈ R d2×d1 and W (2) ∈ R 1×d2 .We specify the prior by setting λ (1) . over the coordinates of the bias vector.Let n = 2084 be the size of the training set.We pick n to be four times the number of parameters in the network anticipating that the training set contains enough information to learn the teacher.We start by generating the matrix of training inputs X ∈ R n×d1 with i.i.d.standard Gaussian entries, then we sample the teacher's weights W ⋆ from the Gaussian prior.For concreteness we set Z to the same value ∆ and set ∆ = 10 −4 .To generate the training labels y, we feed X, the teacher's weights W ⋆ and ∆ into the generative process (7), adapted to also add the biases.For the test set, we first sample X test , with i.i.d.standard Gaussian entries.Both the test labels and the test predictions are generated in a noiseless way (i.e., just passing the inputs through the network).In this way, the test mean square error (MSE) takes the following form: The full details about this experiment set are in Appendix B. We run the Gibbs sampler on the intermediate noise posterior starting from three different initializations: informed, zero and random.Respectively the student's variables are initialized to the teacher's counterparts, to zero, or are sampled from the prior.In this particular setting, the Gibbs sampler initialized at zero manages to thermalize, while the random initializations fail to do so.Two independent random initializations are shown, in order to be able to use the multiple chains method., where P indicates the posterior distribution; this is the score rescaled by ∆ and averaged over the first layer weights.In the zero-initialized chain, U starts oscillating around zero already at t = 20.Then we consider the R statistics computed on the outputs of the two chains with zero and informed initializations.The criterion estimates that the zero-initialized chain has thermalized after t = 6 × 10 4 , when R approaches 1 and becomes stationary.Next, we consider the teacher-student method, with the test MSE as the test function (g in our previous discussion).According to this method, the MCMC thermalizes after the test MSE time series of the informed and zero-initialized chains merge, which happens around t = 10 5 .Finally, the stationarity criterion, when applied to the test MSE or to R gives a similar estimate for the thermalization time.The x-axis of the left plot provides a summary of this phenomenology, by placing a circle at the thermalization time estimated by each method.In summary, the teacher-student method is the most conservative, but the R statistics-based method is also reasonable here.
The right panel of figure 1 then shows a representative situation where thermalization is not reached yet the R statistics-based method would indicate it is.In the right panel, two randomly initialized chains, denoted by random 1 and random 2 are considered.Neither of these chains actually thermalizes, in fact looking at the test MSE time series we see that both chains get stuck on the same plateau around MSE=10 −3 and are unable to reach the MSE of the informed initialization.However, as soon as both chains reach the plateau, R quickly drops to a value close to the order of 1 and thereafter becomes stationary, mistakenly signalling thermalization.This example exposes the problem at the heart of the multiple-chain method: the method can be fooled if the chains find themselves close to each other but far from equilibrium.Similarly, since the chains become stationary after they hit the plateau, the stationarity criterion would incorrectly predict that they have thermalized.To conclude, we have shown an example where common thermalization heuristics fail to recognize that the MCMC has not thermalized; instead, the teacher-student method detects the lack of thermalization.

B. Gibbs sampler
In this section, we show that the combination of intermediate noise posterior and Gibbs sampler is effective in sampling from the posterior by comparing it to HMC, run both on the classical and intermediate noise posteriors, and to MALA, run on the classical posterior.We provide the pseudocode for these algorithms in Appendix H. Notice we don't compare the Gibbs sampler to variational inference methods or altered MCMCs, since these algorithms only sample from approximated versions of the posterior.For the first set of experiments, we use the same network architecture as in the previous section.The teacher weights W ⋆ , as well as X, X test are also sampled in the same way.The intermediate noise and the classical generative process prescribe different ways of generating the labels.However, to perform a fair comparison, we use the same dataset for all MCMCs and posteriors; thus we generate the training set in a noiseless way, i.e., setting y µ = f (X µ , W ⋆ ).We generate 72 datasets according to this procedure, each time using independently sampled inputs and teacher's weigths.The consequence of generating datasets in a noiseless way is that the noise level used to generate the data is different from the one in the MCMC, implying that the informed initialization will not exactly be a sample from the posterior.However, the noise is small enough that we did not observe any noticeable difference in the functioning of the teacher-student criterion.
First, we aim to characterize how often each algorithm thermalizes, when started from an uninformed initialization.Uninformed means that the network's initialization is agnostic to the teacher's weights.For several values of ∆, and for all the 72 datasets, we run the four algorithms (Gibbs, classical HMC, intermediate HMC, classical MALA) starting from informed and uninformed initializations.More information about the initializations and hyperparameters of these experiments is contained in Appendix I.
The left panel of figure 2 depicts the proportion of the 72 datasets in which the uninformed initialization thermalizes within 5:30h of simulation.The x−axis is the equilibrium test MSE, i.e., the average test MSE reached by the informed initialization once it becomes stationary.When ∆, and thus the test MSE, decreases, the proportion of thermalized runs drops for all algorithms, with the Gibbs sampler attaining the highest proportion, in most of the range.In the right panel, we plot the dynamics of the test error under each algorithm for a run where they all thermalize.For the same ∆s of this plot (respectively ∆ = 10 −3 , 4.64 × 10 −4 for the classical and intermediate noise posterior), we compute the average thermalization time among the runs that thermalize.Classical HMC, MALA, Gibbs, and intermediate HMC take, respectively on average around 130, 2700, 3200, 12500 seconds to thermalize.This shows that the classical HMC, when it thermalizes, is the fastest method, while MALA and Gibbs occupy the second and third position, with similar times.However classical HMC thermalizes about 20% less often than the Gibbs sampler.Therefore in cases where it is essential to reach equilibrium, the Gibbs sampler represents the best choice.
We now move from the abstract setting of Gaussian data to more realistic inputs and architectures.As an architecture we use a one-hidden layer multilayer perceptron (MLP) with 12 hidden units and ReLU activations, and a simple convolutional neural network (CNN) with a convolutional layer, followed by average pooling, ReLU activations, and a fully connected layer.See Appendix J for a description of both models and of the experimental details.In this setting, we resort to the stationarity criterion to check for thermalization, since the teacher-student method is inapplicable.We compare the Gibbs sampler with HMC and MALA both run on the classical posterior, picking MNIST as dataset.
Figure 3 shows the test error as a function of time for the two architectures.We choose the algorithms ∆s such that they all reach a comparable test error at stationarity.We then compare the time it takes each algorithm to reach this error.The results of the experiments are depicted in figure 3.For the MLP all algorithms take approximately the same time to become stationary, around 500s.In the CNN case, after HMC and MALA reach stationarity in 100s, compared to 800s for Gibbs.We however note that for HMC and MALA to achieve these performances we had to carry out an extensive optimization over hyperparameters, thus the speed is overall comparable.

VI. CONCLUSION
In this work, we introduced the intermediate noise posterior, a probabilistic model for Bayesian learning of neural networks, along with a novel Gibbs sampler to sample from this posterior.We compared the Gibbs sampler to MALA and HMC, varying also the form of the posterior.We found that HMC and MALA both on the classical posterior and Gibbs, on the intermediate noise posterior, each have their own merits and can be considered effective in sampling the high dimensional posteriors arising from Bayesian learning of neural networks.For the small architectures considered, Gibbs compares favourably to the other algorithms in terms of the ability to thermalize, moreover, no hyperparameter tuning is required, it can be applied to non-differentiable posteriors, and can be parallelized across layers.The main drawback of the Gibbs sampler lies in the need to store and update all the pre-and post-activations.This slows down the algorithm, compared to HMC, in the case of larger architectures.
We further proposed the teacher-student thermalization criterion: a method to obtain stringent lower bounds on the thermalization time of an MCMC, within a synthetic data setting.We first provided a simple theoretical argument to justify the method and subsequently compared it to other thermalization heuristics, finding that the teacher-student criterion consistently gives the highest lower bound to T therm .In this paragraph we look more in detail at the properties of the informed initialization.We saw in section II that W ⋆ is a sample from the posterior P (W |D).What does this imply for the chain initialized at W ⋆ ?First any observable φ(•) that concentrates under the posterior and that does not depend explicitly on W ⋆ (e.g.φ(W ) = ||W ||), will be thermalized already at t = 0.This implies that all these observables will be stationary from the very beginning of the MCMC simulation and they will oscillate around their mean value under the posterior.The case where φ(•) depends explicitly on W ⋆ is more delicate.We comment on it because the observable we use to determine thermalization throughout the whole paper is the test MSE (φ

VII.
2 ), which explicitly depends on W ⋆ .In the following we will explore the behavior of the test MSE, however keep in mind that most other observables dependent on W ⋆ will exhibit similar behavior.The first peculiarity of the test MSE, is that under the informed initialization it is not stationary, and it is not close to its expected value under the posterior.In fact at initialization we always have test MSE=0.As more samples are drawn the MSE then relaxes to equilibrium and starts oscillating around its expected value under the posterior.If the test MSE is not thermalized, what is then the advantage of using the informed initialization compared to an uninformed one?While the test MSE is not thermalized, most other observables are under the informed initialization.This means that the chain is started in a favorable region of the weight space.In practice, looking at the right panel of figure 2 one can compare the smooth convergence to stationarity of the informed initializations (transparent lines), to the irregular paths followed by the uninformed initializations (solid lines).

Numerical experiments
In the rest of this appendix, we report the details of the experiments presented in V A. Recall that a synthetic dataset was generated using a teacher network with Gaussian weights, and according to the intermediate noise generative process (7).Then the Gibbs sampler was used to sample from the resulting posterior.We precise that all parameters of the Gibbs sampler (i.e.all the ∆s and λs) match those of the generative process.The Gibbs sampler was run Figure 4: Percentiles of R as a function of time.A line marked with the number k in the legend represents how the k-th percentile of R changes throughout the simulation.The data comes from the same simulation that was used for computing the average R in figure 1.The red dashed horizontal line is placed at a height of 1, the value that R should approach when the chains are close to each other.Left: percentiles of R, computed on two chains with respectively zero and informed initialization.Right: percentiles when computing R from two chains independently initialized at random.
on four chains: one with informed initialization, one initialized at zero, and two chains with independent random initializations.For the random initializations the weights of the student are sampled from the prior (with the same λs as the teacher), then the pre-and post-activations are computed using the intermediate noise generative process (7), with noises ϵ, independent from those of the teacher.The zero-initialized chain plausibly thermalizes, while the randomly initialized ones do not.We briefly comment on how the R statistic and the score statistics were computed.

R statistic
In the notation of A, θ is given by {W (1) , b (1) , Z (2) , X (2) , W (2) , b (2) }.Next we have to pick the (possibly vectorvalued) function ψ(θ).One possible choice is to use the weights, e.g., ψ(θ) = W (1) .However, due to the permutational symmetry between the neurons in the hidden layer, this gives R ≫ 1 even when the MCMC has thermalized.Hence one must focus on quantities that are invariant to this symmetry.A natural choice is the student output on the test set.We pick ψ(θ) ∈ R ntest , with ψ(θ) = f (X test , W ) and f as in (11).We record these vectors along the simulation at times evenly spaced by 100 MCMC steps.We split the samples into blocks of 50 consecutive measurements.We then compute the R statistic on each block (hence N = 50).Since the function ψ we are using returns an n test dimensional output, R will also be n test dimensional.In figure 1

Rν
τ , with t τ being the average time within block τ .In principle the whole distribution of R is interesting.Figure 4 shows the evolution of the 25th, 50th, 75th and 95th percentiles of R. Recall that the two chains in the left panel thermalize, while those in the right panel are actually very far from equilibrium.Even if the distribution in the right panel is more shifted away from one than the distribution in the left panel, we still think that the sudden drop from a higher value and the subsequent stationarity could be interpreted as the chains having thermalized.

Score method
If a MCMC has thermalized then the gradient of the log posterior must have mean zero, hence the time series of each of its coordinates will have to oscillate around zero.In figure 1 , where P indicates the posterior distribution; this is the score rescaled by ∆ and averaged over the first layer weights.We also tried taking the gradient with respect to the second layer weights W (2) , or with respect to Z (2) .The results do not change significantly and the score methods keep severely underestimating the thermalization time.

Appendix C: Gibbs sampler derivation
In this appendix, we provide the details of the derivation of the Gibbs sampler algorithm in the case of an MLP without biases.In order, we will derive equations ( 8), ( 9) and (10).
For X (ℓ) we see that the conditional distribution factorizes over samples µ ∈ [n].
In the derivation we used that P (X (ℓ) |W (ℓ) , Z (ℓ) ) = P (X (ℓ) |Z (ℓ) ).We see that the conditional distribution of X (ℓ) is a multivariate Gaussian, hence it is possible to sample from it efficiently on a computer.Algorithm 2 Parallel Gibbs sampler for MLP.The expression for parallel indicates that all iterations of the loop can be executed in parallel.
, and (m Once again the conditional distribution of W (ℓ) is a multivariate Gaussian.In the case of Z (ℓ+1) the conditional factorizes both over samples and over coordinates.We have α )P (X ) ∝ (C8) does not appear.The mass of this part of the distribution is We now look at the case Z (ℓ+1)µ α > 0.
The mass of this part is Then one has Z = Z + + Z − .The probability of having First draw a bernoulli variable r ∼ Bernoulli(p − ).If r = 1 sample a negative truncated normal from the z < 0 distribution.If r = 0 sample a positive truncated normal from the z > 0 distribution.

σ(z) = sign(z)
If Z (ℓ+1)µ α > 0 (resp.< 0) one ends up sampling from positive (resp.negative) part of the following Gaussian We now compute the normalization associated to the positive and negative parts Positive part • r, r ′ ∈ [K ℓ ] are indices for the position within the filter (e.g. if the filter is 3 × 3 then K = 9 and r runs over all the components of the filter) • i = (β, r) and i ′ = (β ′ , r ′ ) are used to group pairs of indices.
Commas will be used to separate indices, whenever there is ambiguity.The basic building block is the noisy convolutional layer, which has the following expression with (ϵ ). W ∈ R C ℓ+1 ×C ℓ ×K ℓ .We also indicate with ν a (r) the position of the r − th coordinate of the filter inside the input layer, when the output is in position a.In other words ν a : [K ℓ ] → [d ℓ ].First we notice that P (Z (ℓ+1) |W (ℓ) , X (ℓ) , X (ℓ+1) ) is basically unaffected by the structure of weights.We will concentrate on computing P (W (ℓ) |Z (ℓ+1) , X (ℓ) ) and P (X (ℓ) |Z (ℓ+1) , Z (ℓ) , W (ℓ) ).Let us begin by Computing these quantities requires grouping the two indices r, β into a single index and then inverting the matrix of the quadratic form in W .The double index matrix we would like to invert is In words, we're packing the indices of Ã, inverting it, and then unpacking the indices.For the mean we have (m We now look at P (X (ℓ) |Z (ℓ+1) , Z (ℓ) , W (ℓ) ).In the computation we use the identity c δ c,νa(b) = 1 As in the previous case we let A be the matrix of the quadratic form, with In the last passage ν a ([K ℓ ]) indicates the image of the whole filter through ν a .We basically incorporated the constraint that a should be such that c, c ′ are in the same subset of the input (if such as exist at all).As we did previously for W we group the indices using i = (β, c) and i ′ = (β ′ , c ′ ).We then define the matrix For the mean we have (m

Practical implementation
So far we packed the spatial (i.e.x and y coordinate within an image) in a single index.This allowed for more agile computations.We now unpack the indices and translate the results we obtained.In a practical case X .For the weights we have W W .Let s y , s x be respectively the strides along the y, x axes.We do not use any padding.Then the height and width of Z (ℓ+1) will be respectively Given a = (a y , a x ) the position inside layer ℓ + 1 one can write ν a (r) = ν a (r y , r x ) = (r y + s y a y , r x + s x a x ), yielding the following expression for the forward pass: In ϵ we removed all the indices, for notation simplicity.
Recall we hat to compute the matrix Ã when sampling W .In this notation the expression for Ã The expression for m W in turn becomes We now move to sampling X (ℓ)µ .In that case we have X we have (m One can see that when the filter W (ℓ) has dimensions (height and width) that are much smaller than those of X (ℓ) , then Ã will have few nonzero elements.In fact for Ãβcycxβ ′ c ′ y c ′ x to be nonzero, one must have that c, c ′ are close enough to be contained in the filter W .This implies that all the pairs of pixels c, c ′ with |c x − c ′ x | > W x = 0. Hence Ã will be a sparse tensor.The same will somewhat be true in the covariance, which is the inverse of A.

Average Pooling
Here we look at how to put pooling into the mix.We focus on average pooling, which is easier since it is a linear transformation.For each pixel b ∈ [d ℓ ], let P ℓ (b)3 be the "pooled pixel" in layer ℓ + 1 to which b gets mapped.Hence we have P : [d ℓ ] → [d ℓ+1 ], a surjective function.Given a pixel a in layer ℓ + 1 this will have multiple preimages through P , we denote the set of preimages as P −1 (a).P −1 (a) can therefore be seen as the receptive field of pixel a.Let model.To do so, the additional variables Z (2) ∈ R n×12 , X (2) ∈ R n×12 , Z (3) ∈ R n×10 are introduced, with n = 6 × 10 4 in the case of MNIST.A noise with variance ∆ is put on Z (2) , X (2) .Notice there is no noise between Z (3) and the labels y.Hence the posterior has the hard constraint y µ = arg max α∈{0,1,...,9} Z (3)µ α . The priors on the parameters are given by: λ (1) b = 12.On the intermediate noise posterior we ran experiments with ∆ = 2, and all variables were set to zero at initial condition.
In the case of the classical posterior, we ran experiments with ∆ = 2 using MALA and HMC as algorithms.The value of ∆ was picked so that the test error at stationarity is the same as in the intermediate noise posterior.The optimal parameters of HMC are a learning rate of 10 −3 and 200 leapfrog steps, while for MALA the optimal learning rate is 2 × 10 −6 .For HMC, MALA the variables were initialized as i.i.d.Gaussians with respective standard deviations 10 −1 , 10 −4 .
Regarding the CNN, we provide a schematic representation of the architecture in figure 5.In the intermediate noise model a noise is added after the convolution, after the average pooling, after the ReLU, and after the fully connected layer.All noises are i.i.d.Gaussians with variance ∆, as prescribed by the intermediate noise model.To complete the description, we specify the prior.We set λ In the intermediate noise posterior, we run the Gibbs sampler with ∆ = 100, and initialize all variables to zero.For the classical posterior, we run MALA and HMC on the CNN architecture, with ∆ = 10.This value of ∆ leads to approximately the same test error as in the intermediate noise posterior.For HMC we use a learning rate of 10 −3 and 50 leapfrog steps, while for MALA we choose 5 × 10 −6 as learning rate.
Both for MLP and CNN, in the case of the classical posterior, we have to specify a loss function.To do so we replace the argmax in the last layer by a softmax and apply a cross entropy loss on top of the softmax.Calling Q ∈ R n×10 the output of the softmax, the loss function is ℓ(y µ , Q µ ) = − log Q µ y µ .The argmax is however still used when making predictions, for example when evaluating the model on the test set.All experiments were run on one NVIDIA V100 PCIe 32 GB GPU and one core of Xeon-Gold running at 2.1 GHz.

Figure 1 :
Figure1: Comparison of different thermalization measures.In the legend, next to each method we write between parentheses the initialization (or pair of initializations) the method is applied to.The circles on the x axis represent the thermalization times estimated by each method.Left: We compare the predictions for the thermalization time of the zero-initialized MCMC.The red y scale on the right refers uniquely to the lines in red.All the other quantities should be read on the black y scale.Right: We compare the predictions for the thermalization time of two chains initialized independently at random.The pink y scale refers uniquely to the pink line.All other quantities should be read on the black logarithmic scale.The randomly initialized runs fail to thermalize and their test MSEs get stuck on a plateau.However, R, whose time series on the plateau is stationary and close to 1, fails to detect this lack of thermalization.

Figure 1
Figure 1 illustrates a representative result of these experiments.In the left panel, we aim to find the highest lower bound to the thermalization time of the zero-initialized chain.Looking at the score method we plot U = ∆ 1 d1d2 d1 i=1 d2 α=1 ∂ log P ∂W (1) αi

Figure 2 :
Figure 2: Thermalization experiments on synthetic data.Left: Proportion of the 72 runs that thermalize plotted against the equilibrium test MSE.Right: Example of the dynamics of the test MSE in a particular run where all four algorithms thermalize.In order to get a similar equilibrium test MSE in the classical and intermediate noise posteriors, we pick respectively ∆ = 10 −3 and ∆ = 4.64 × 10 −4 .The transparent lines represent the informed initializations.

Figure 3 :
Figure 3: Gibbs on the intermediate noise posterior and HMC, MALA both on the classical posterior, compared on MNIST.Left: MLP with one hidden layer with 12 hidden units.Right: CNN network.
E14) • a, a ′ ∈ [d ℓ+1 ] are indices for positions inside layer ℓ + 1 (i.e. a specifies both the horizontal and vertical position within the layer) • b, b ′ , c, c ′ ∈ [d ℓ ] are indices for positions in layer ℓ.

W
or |c y − c ′ y | > H (ℓ) W will have Ãβcycxβ ′ c ′ y c ′

Figure 5 :
Figure 5: CNN architecture used in the experiments of section V B. The convolutional layer is composed of the filter W (1) with shape 2 × 1 × 4 × 4 and a output channel bias b (1) ∈ R 2 .The final layer instead has weights W (2) ∈ R 72×10 and bias b (2) ∈ R 10 .
we then decided to plot the average (over the test set) value of R. In other words, calling Rν τ the value of R computed on the τ −th block and the ν-th test sample, what we plot are the pairs t τ , 1