This site uses cookies. By continuing to use this site you agree to our use of cookies. To find out more, see our Privacy and Cookies policy.
Close this notification
The International School for Advanced Studies (SISSA) logo

Click here to close this overlay, or press the "Escape" key on your keyboard.

The International School for Advanced Studies (SISSA) was founded in 1978 and was the first institution in Italy to promote post-graduate courses leading to a Doctor Philosophiae (or PhD) degree. A centre of excellence among Italian and international universities, the school has around 65 teachers, 100 post docs and 245 PhD students, and is located in Trieste, in a campus of more than 10 hectares with wonderful views over the Gulf of Trieste.

SISSA hosts a very high-ranking, large and multidisciplinary scientific research output. The scientific papers produced by its researchers are published in high impact factor, well-known international journals, and in many cases in the world's most prestigious scientific journals such as Nature and Science. Over 900 students have so far started their careers in the field of mathematics, physics and neuroscience research at SISSA.

Visit www.sissa.it

.

Machine Learning 2019

The Journal of Statistical Mechanics, Theory and Experiment (JSTAT) has decided to launch a new initiative in the field of Machine Learning - Artificial Intelligence, a multidisciplinary field with a rapidly growing activity that in recent years has involved quite a few physicists in studying its basic conceptual challenges as well as applications.

JSTAT wishes to contribute to the development of this field on the side of statistical physics by publishing a series of yearly special issues, of which this is the first volume.

The format of these special issues takes into account the status of the machine learning field, where many of the most important papers are published in proceedings of conferences and are often overlooked by the physics community.

Our first special issues on machine learning will therefore include selected papers recently published in the proceedings of some major conferences. The authors of the selected papers have been proposed to include, if needed, an augmented version of their conference paper, including supplementary material which makes it more suitable to our journal readership.

The present selection has been made by a committee consisting of the following JSTAT editors : Riccardo Zecchina (chair), Yoshiyuki Kabashima, Bert Kappen, Florent Krzakala and Manfred Opper.

The future special issues will include both the journal version of proceedings papers as well as original submissions of manuscripts on subjects lying at the interface between Machine Learning and Statistical Physics.

With this initiative JSTAT aims at bringing the conceptual and methodological tools of statistical physics to the full benefit of an emergent field which is becoming of fundamental importance across most areas of science.

The editorial committee: Marc Mezard (JSTAT Chief Scientific Director), Riccardo Zecchina (JSTAT editor and chair), Yoshiyuki Kabashima, Bert Kappen, Florent Krzakala and Manfred Opper.

Papers

Tightening bounds for variational inference by revisiting perturbation theory

Robert Bamler et al J. Stat. Mech. (2019) 124004

Variational inference has become one of the most widely used methods in latent variable modeling. In its basic form, variational inference employs a fully factorized variational distribution and minimizes its Kullback–Leibler divergence to the posterior. As the minimization can only be carried out approximately, this approximation induces a bias. In this paper, we revisit perturbation theory as a powerful way of improving the variational approximation. Perturbation theory relies on a form of Taylor expansion of the log marginal likelihood, vaguely in terms of the log ratio of the true posterior and its variational approximation. While first order terms give the classical variational bound, higher-order terms yield corrections that tighten it. However, traditional perturbation theory does not provide a lower bound, making it inapt for stochastic optimization. In this paper, we derive a similar yet alternative way of deriving corrections to the evidence lower bound that resemble perturbation theory, but that result in a valid bound. We show in experiments on Gaussian processes and variational autoencoders that the new bounds are more mass covering, and that the resulting posterior covariances are closer to the true posterior and lead to higher likelihoods on held-out data.

Nonlinear random matrix theory for deep learning

Jeffrey Pennington and Pratik Worah J. Stat. Mech. (2019) 124005

Neural network configurations with random weights play an important role in the analysis of deep learning. They define the initial loss landscape and are closely related to kernel and random feature methods. Despite the fact that these networks are built out of random matrices, the vast and powerful machinery of random matrix theory has so far found limited success in studying them. A main obstacle in this direction is that neural networks are nonlinear, which prevents the straightforward utilization of many of the existing mathematical results. In this work, we open the door for direct applications of random matrix theory to deep learning by demonstrating that the pointwise nonlinearities typically applied in neural networks can be incorporated into a standard method of proof in random matrix theory known as the moments method. The test case for our study is the Gram matrix FF T , , where W is a random weight matrix, X is a random data matrix, and is a pointwise nonlinear activation function. We derive an explicit representation for the trace of the resolvent of this matrix, which defines its limiting spectral distribution. We apply these results to the computation of the asymptotic performance of single-layer random feature networks on a memorization task and to the analysis of the eigenvalues of the data covariance matrix as it propagates through a neural network. As a byproduct of our analysis, we identify an intriguing new class of activation functions with favorable properties.

Streamlining variational inference for constraint satisfaction problems

Aditya Grover et al J. Stat. Mech. (2019) 124006

Several algorithms for solving constraint satisfaction problems are based on survey propagation, a variational inference scheme used to obtain approximate marginal probability estimates for variable assignments. These marginals correspond to how frequently each variable is set to true among satisfying assignments, and are used to inform branching decisions during search; however, marginal estimates obtained via survey propagation are approximate and can be self-contradictory. We introduce a more general branching strategy based on streamlining constraints, which sidestep hard assignments to variables. We show that streamlined solvers consistently outperform decimation-based solvers on random k-SAT instances for several problem sizes, shrinking the gap between empirical performance and theoretical limits of satisfiability by on average for .

Mean-field theory of graph neural networks in graph partitioning

Tatsuro Kawamoto et al J. Stat. Mech. (2019) 124007

A theoretical performance analysis of the graph neural network (GNN) is presented. For classification tasks, the neural network approach has the advantage in terms of flexibility that it can be employed in a data-driven manner, whereas Bayesian inference requires the assumption of a specific model. A fundamental question is then whether GNN has a high accuracy in addition to this flexibility. Moreover, whether the achieved performance is predominately a result of the backpropagation or the architecture itself is a matter of considerable interest. To gain a better insight into these questions, a mean-field theory of a minimal GNN architecture is developed for the graph partitioning problem. This demonstrates a good agreement with numerical experiments.

Adaptive path-integral autoencoder: representation learning and planning for dynamical systems

Jung-Su Ha et al J. Stat. Mech. (2019) 124008

We present a representation learning algorithm that learns a low-dimensional latent dynamical system from high-dimensional sequential raw data, e.g. video. The framework builds upon recent advances in amortized inference methods that use both an inference network and a refinement procedure to output samples from a variational distribution given an observation sequence, and takes advantage of the duality between control and inference to approximately solve the intractable inference problem using the path integral control approach. The learned dynamical model can be used to predict and plan the future states; we also present the efficient planning method that exploits the learned low-dimensional latent dynamics. Numerical experiments show that the proposed path-integral control based variational inference method leads to tighter lower bounds in statistical model learning of sequential data. The supplementary video ( https://youtu.be/xCp35crUoLQ) and the implementation code ( https://github.com/yjparkLiCS/18-NIPS-APIAE) are available online.

Deep learning for physical processes: incorporating prior scientific knowledge

Emmanuel de Bézenac et al J. Stat. Mech. (2019) 124009

We consider the use of deep learning methods for modeling complex phenomena like those occurring in natural physical processes. With the large amount of data gathered on these phenomena the data intensive paradigm could begin to challenge more traditional approaches elaborated over the years in fields like maths or physics. However, despite considerable successes in a variety of application domains, the machine learning field is not yet ready to handle the level of complexity required by such problems. Using an example application, namely sea surface temperature prediction, we show how general background knowledge gained from the physics could be used as a guideline for designing efficient deep learning models. In order to motivate the approach and to assess its generality we demonstrate a formal link between the solution of a class of differential equations underlying a large family of physical phenomena and the proposed model. Experiments and comparison with series of baselines including a state of the art numerical approach is then provided.

Objective and efficient inference for couplings in neuronal network

Yu Terada et al J. Stat. Mech. (2019) 124010

Inferring directional couplings from the spike data of networks is desired in various scientific fields such as neuroscience. Here, we apply a recently proposed objective procedure to the spike data obtained from the Hodgkin–Huxley type models and in vitro neuronal networks cultured in a circular structure. As a result, we succeed in reconstructing synaptic connections accurately from the evoked activity as well as the spontaneous one. To obtain the results, we invent an analytic formula approximately implementing a method of screening relevant couplings. This significantly reduces the computational cost of the screening method employed in the proposed objective procedure, making it possible to treat large-size systems as in this study.

The scaling limit of high-dimensional online independent component analysis

Chuang Wang and Yue M Lu J. Stat. Mech. (2019) 124011

We analyze the dynamics of an online algorithm for independent component analysis in the high-dimensional scaling limit. As the ambient dimension tends to infinity, and with proper time scaling, we show that the time-varying joint empirical measure of the target feature vector and the estimates provided by the algorithm will converge weakly to a deterministic measured-valued process that can be characterized as the unique solution of a nonlinear PDE. Numerical solutions of this PDE, which involves two spatial variables and one time variable, can be efficiently obtained. These solutions provide detailed information about the performance of the ICA algorithm, as many practical performance metrics are functionals of the joint empirical measures. Numerical simulations show that our asymptotic analysis is accurate even for moderate dimensions. In addition to providing a tool for understanding the performance of the algorithm, our PDE analysis also provides useful insight. In particular, in the high-dimensional limit, the original coupled dynamics associated with the algorithm will be asymptotically ‘decoupled’, with each coordinate independently solving a 1D effective minimization problem via stochastic gradient descent. Exploiting this insight to design new algorithms for achieving optimal trade-offs between computational and statistical efficiency may prove an interesting line of future research.

On neuronal capacity

Pierre Baldi and Roman Vershynin J. Stat. Mech. (2019) 124012

We define the capacity of a learning machine to be the logarithm of the number (or volume) of the functions it can implement. We review known results, and derive new results, estimating the capacity of several neuronal models: linear and polynomial threshold gates, linear and polynomial threshold gates with constrained weights (binary weights, positive weights), and ReLU neurons. We also derive some capacity estimates and bounds for fully recurrent networks, as well as feedforward networks.

Comparing dynamics: deep neural networks versus glassy systems

Marco Baity-Jesi et al J. Stat. Mech. (2019) 124013

We analyze numerically the training dynamics of deep neural networks (DNN) by using methods developed in statistical physics of glassy systems. The two main issues we address are (1) the complexity of the loss landscape and of the dynamics within it, and (2) to what extent DNNs share similarities with glassy systems. Our findings, obtained for different architectures and datasets, suggest that during the training process the dynamics slows down because of an increasingly large number of flat directions. At large times, when the loss is approaching zero, the system diffuses at the bottom of the landscape. Despite some similarities with the dynamics of mean-field glassy systems, in particular, the absence of barrier crossing, we find distinctive dynamical behaviors in the two cases, showing that the statistical properties of the corresponding loss and energy landscapes are different. In contrast, when the network is under-parametrized we observe a typical glassy behavior, thus suggesting the existence of different phases depending on whether the network is under-parametrized or over-parametrized.

Open access
Entropy and mutual information in models of deep neural networks

Marylou Gabrié et al J. Stat. Mech. (2019) 124014

We examine a class of stochastic deep learning models with a tractable method to compute information-theoretic quantities. Our contributions are three-fold: (i) we show how entropies and mutual informations can be derived from heuristic statistical physics methods, under the assumption that weight matrices are independent and orthogonally-invariant. (ii) We extend particular cases in which this result is known to be rigorously exact by providing a proof for two-layers networks with Gaussian random weights, using the recently introduced adaptive interpolation method. (iii) We propose an experiment framework with generative models of synthetic datasets, on which we train deep neural networks with a weight constraint designed so that the assumption in (i) is verified during learning. We study the behavior of entropies and mutual informations throughout learning and conclude that, in the proposed setting, the relationship between compression and generalization remains elusive.

Gauging variational inference

Sungsoo Ahn et al J. Stat. Mech. (2019) 124015

Computing of partition function is the most important statistical inference task arising in applications of graphical models (GM). Since it is computationally intractable, approximate methods have been used in practice, where mean-field (MF) and belief propagation (BP) are arguably the most popular and successful approaches of a variational type. In this paper, we propose two new variational schemes, coined Gauged-MF (G-MF) and Gauged-BP (G-BP), improving MF and BP, respectively. Both provide lower bounds for the partition function by utilizing the so-called gauge transformation which modifies factors of GM while keeping the partition function invariant. Moreover, we prove that both G-MF and G-BP are exact for GMs with a single loop of a special structure, even though the bare MF and BP perform badly in this case. Our extensive experiments indeed confirm that the proposed algorithms outperform and generalize MF and BP.

Statistical mechanics of low-rank tensor decomposition

Jonathan Kadmon and Surya Ganguli J. Stat. Mech. (2019) 124016

Often, large, high-dimensional datasets collected across multiple modalities can be organized as a higher-order tensor. Low-rank tensor decomposition then arises as a powerful and widely used tool to discover simple low-dimensional structures underlying such data. However, we currently lack a theoretical understanding of the algorithmic behavior of low-rank tensor decompositions. We derive Bayesian approximate message passing (AMP) algorithms for recovering arbitrarily shaped low-rank tensors buried within noise, and we employ dynamic mean field theory to precisely characterize their performance. Our theory reveals the existence of phase transitions between easy, hard and impossible inference regimes, and displays an excellent match with simulations. Moreover it reveals several qualitative surprises compared to the behavior of symmetric, cubic tensor decomposition. Finally, we compare our AMP algorithm to the most commonly used algorithm, alternating least squares (ALS), and demonstrate that AMP significantly outperforms ALS in the presence of noise.

Open access
Legendre decomposition for tensors

Mahito Sugiyama et al J. Stat. Mech. (2019) 124017

We present a novel nonnegative tensor decomposition method, called Legendre decomposition, which factorizes an input tensor into a multiplicative combination of parameters. Thanks to the well-developed theory of information geometry, the reconstructed tensor is unique and always minimizes the KL divergence from an input tensor. We empirically show that Legendre decomposition can more accurately reconstruct tensors than other nonnegative tensor decomposition methods.

Entropy-SGD: biasing gradient descent into wide valleys

Pratik Chaudhari et al J. Stat. Mech. (2019) 124018

This paper proposes a new optimization algorithm called Entropy-SGD for training deep neural networks that is motivated by the local geometry of the energy landscape. Local extrema with low generalization error have a large proportion of almost-zero eigenvalues in the Hessian with very few positive or negative eigenvalues. We leverage upon this observation to construct a local-entropy-based objective function that favors well-generalizable solutions lying in large flat regions of the energy landscape, while avoiding poorly-generalizable solutions located in the sharp valleys. Conceptually, our algorithm resembles two nested loops of SGD where we use Langevin dynamics in the inner loop to compute the gradient of the local entropy before each update of the weights. We show that the new objective has a smoother energy landscape and show improved generalization over SGD using uniform stability, under certain assumptions. Our experiments on convolutional and recurrent networks demonstrate that Entropy-SGD compares favorably to state-of-the-art techniques in terms of generalization error and training time.

Online learning of quantum states

Scott Aaronson et al J. Stat. Mech. (2019) 124019

Suppose we have many copies of an unknown n-qubit state . We measure some copies of using a known two-outcome measurement E 1, then other copies using a measurement E 2, and so on. At each stage t, we generate a current hypothesis about the state , using the outcomes of the previous measurements. We show that it is possible to do this in a way that guarantees that , the error in our prediction for the next measurement, is at least at most times. Even in the ‘non-realizable’ setting—where there could be arbitrary noise in the measurement outcomes—we show how to output hypothesis states that incur at most excess loss over the best possible state on the first T measurements. These results generalize a 2007 theorem by Aaronson on the PAC-learnability of quantum states, to the online and regret-minimization settings. We give three different ways to prove our results—using convex optimization, quantum postselection, and sequential fat-shattering dimension—which have different advantages in terms of parameters and portability.

On the information bottleneck theory of deep learning

Andrew M Saxe et al J. Stat. Mech. (2019) 124020

The practical successes of deep neural networks have not been matched by theoretical progress that satisfyingly explains their behavior. In this work, we study the information bottleneck (IB) theory of deep learning, which makes three specific claims: first, that deep networks undergo two distinct phases consisting of an initial fitting phase and a subsequent compression phase; second, that the compression phase is causally related to the excellent generalization performance of deep networks; and third, that the compression phase occurs due to the diffusion-like behavior of stochastic gradient descent. Here we show that none of these claims hold true in the general case, and instead reflect assumptions made to compute a finite mutual information metric in deterministic networks. When computed using simple binning, we demonstrate through a combination of analytical results and simulation that the information plane trajectory observed in prior work is predominantly a function of the neural nonlinearity employed: double-sided saturating nonlinearities like yield a compression phase as neural activations enter the saturation regime, but linear activation functions and single-sided saturating nonlinearities like the widely used ReLU in fact do not. Moreover, we find that there is no evident causal connection between compression and generalization: networks that do not compress are still capable of generalization, and vice versa. Next, we show that the compression phase, when it exists, does not arise from stochasticity in training by demonstrating that we can replicate the IB findings using full batch gradient descent rather than stochastic gradient descent. Finally, we show that when an input domain consists of a subset of task-relevant and task-irrelevant information, hidden representations do compress the task-irrelevant information, although the overall information about the input may monotonically increase with training time, and that this compression happens concurrently with the fitting process rather than during a subsequent compression period.

Plug in estimation in high dimensional linear inverse problems a rigorous analysis

Alyson K Fletcher et al J. Stat. Mech. (2019) 124021

Estimating a vector from noisy linear measurements often requires use of prior knowledge or structural constraints on for accurate reconstruction. Several recent works have considered combining linear least-squares estimation with a generic or ‘plug-in’ denoiser function that can be designed in a modular manner based on the prior knowledge about . While these methods have shown excellent performance, it has been difficult to obtain rigorous performance guarantees. This work considers plug-in denoising combined with the recently-developed vector approximate message passing (VAMP) algorithm, which is itself derived via expectation propagation techniques. It shown that the mean squared error of this ‘plug-and-play’ VAMP can be exactly predicted for high-dimensional right-rotationally invariant random and Lipschitz denoisers. The method is demonstrated on applications in image recovery and parametric bilinear estimation.

Bucket renormalization for approximate inference

Sungsoo Ahn et al J. Stat. Mech. (2019) 124022

Probabilistic graphical models are a key tool in machine learning applications. Computing the partition function, i.e. normalizing constant, is a fundamental task of statistical inference but it is generally computationally intractable, leading to extensive study of approximation methods. Iterative variational methods are a popular and successful family of approaches. However, even state of the art variational methods can return poor results or fail to converge on difficult instances. In this paper, we instead consider computing the partition function via sequential summation over variables. We develop robust approximate algorithms by combining ideas from mini-bucket elimination with tensor network and renormalization group methods from statistical physics. The resulting ‘convergence-free’ methods show good empirical performance on both synthetic and real-world benchmark models, even for difficult instances.

The committee machine: computational to statistical gaps in learning a two-layers neural network

Benjamin Aubin et al J. Stat. Mech. (2019) 124023

Heuristic tools from statistical physics have been used in the past to locate the phase transitions and compute the optimal learning and generalization errors in the teacher-student scenario in multi-layer neural networks. In this paper, we provide a rigorous justification of these approaches for a two-layers neural network model called the committee machine, under a technical assumption. We also introduce a version of the approximate message passing (AMP) algorithm for the committee machine that allows optimal learning in polynomial time for a large set of parameters. We find that there are regimes in which a low generalization error is information-theoretically achievable while the AMP algorithm fails to deliver it; strongly suggesting that no efficient algorithm exists for those cases, unveiling a large computational gap.