Machine Learning 2022

The Journal of Statistical Mechanics, Theory and Experiment (JSTAT) is publishing its fourth special issue on the Statistical Physics aspects of Machine Learning/Artificial Intelligence. The scope of this initiative is to bring the methods of statistical physics to the full benefit of machine learning, a computational tool which is becoming of fundamental importance across many fields of science. Conversely, the use of machine learning approaches in statistical physics are also mostly welcome.

The format of the special issue takes into account the situation of the field of machine learning, where many of the most important papers are published in proceedings of conferences that are often overlooked by the physics community. Our special issue on machine learning includes some selected papers published recently in the proceedings of some major conferences, as well as original contributions that are reviewed as all other papers in JSTAT.

The authors of the conference papers have been proposed to include, if needed, an augmented version of their paper, including supplementary material, which makes it more adapted for the readership of our journal.

The selection for this fourth special issue has been made by a committee consisting of the JSTAT Chief Scientific Director Marc Mezard and the following JSTAT editors: Riccardo Zecchina (chair), Yoshiyuki Kabashima, Bert Kappen, Florent Krzakala and Manfred Opper.

Other annual Machine Learning Special Issues are available here.

Participating Journals

Journal
Impact Factor
Citescore
Submit

Paper

Asymptotics of representation learning in finite Bayesian neural networks

Jacob A Zavatone-Veth et al J. Stat. Mech. (2022) 114008

Recent works have suggested that finite Bayesian neural networks may sometimes outperform their infinite cousins because finite networks can flexibly adapt their internal representations. However, our theoretical understanding of how the learned hidden layer representations of finite networks differ from the fixed representations of infinite networks remains incomplete. Perturbative finite-width corrections to the network prior and posterior have been studied, but the asymptotics of learned features have not been fully characterized. Here, we argue that the leading finite-width corrections to the average feature kernels for any Bayesian network with linear readout and Gaussian likelihood have a largely universal form. We illustrate this explicitly for three tractable network architectures: deep linear fully-connected and convolutional networks, and networks with a single nonlinear hidden layer. Our results begin to elucidate how task-relevant learning signals shape the hidden layer representations of wide Bayesian neural networks.

Open access
An analytical theory of curriculum learning in teacher–student networks

Luca Saglietti et al J. Stat. Mech. (2022) 114014

In animals and humans, curriculum learning—presenting data in a curated order—is critical to rapid learning and effective pedagogy. A long history of experiments has demonstrated the impact of curricula in a variety of animals but, despite its ubiquitous presence, a theoretical understanding of the phenomenon is still lacking. Surprisingly, in contrast to animal learning, curricula strategies are not widely used in machine learning and recent simulation studies reach the conclusion that curricula are moderately effective or even ineffective in most cases. This stark difference in the importance of curriculum raises a fundamental theoretical question: when and why does curriculum learning help? In this work, we analyse a prototypical neural network model of curriculum learning in the high-dimensional limit, employing statistical physics methods. We study a task in which a sparse set of informative features are embedded amidst a large set of noisy features. We analytically derive average learning trajectories for simple neural networks on this task, which establish a clear speed benefit for curriculum learning in the online setting. However, when training experiences can be stored and replayed (for instance, during sleep), the advantage of curriculum in standard neural networks disappears, in line with observations from the deep learning literature. Inspired by synaptic consolidation techniques developed to combat catastrophic forgetting, we propose curriculum-aware algorithms that consolidate synapses at curriculum change points and investigate whether this can boost the benefits of curricula. We derive generalisation performance as a function of consolidation strength (implemented as an L2 regularisation/elastic coupling connecting learning phases), and show that curriculum-aware algorithms can yield a large improvement in test performance. Our reduced analytical descriptions help reconcile apparently conflicting empirical results, trace regimes where curriculum learning yields the largest gains, and provide experimentally-accessible predictions for the impact of task parameters on curriculum benefits. More broadly, our results suggest that fully exploiting a curriculum may require explicit adjustments in the loss.

Open access
Relative stability toward diffeomorphisms indicates performance in deep nets

Leonardo Petrini et al J. Stat. Mech. (2022) 114013

Understanding why deep nets can classify data in large dimensions remains a challenge. It has been proposed that they do so by becoming stable to diffeomorphisms, yet existing empirical measurements support that it is often not the case. We revisit this question by defining a maximum-entropy distribution on diffeomorphisms, that allows to study typical diffeomorphisms of a given norm. We confirm that stability toward diffeomorphisms does not strongly correlate to performance on benchmark data sets of images. By contrast, we find that the stability toward diffeomorphisms relative to that of generic transformations Rf correlates remarkably with the test error epsilont. It is of order unity at initialization but decreases by several decades during training for state-of-the-art architectures. For CIFAR10 and 15 known architectures we find \${{\epsilon}}_{\text{t}}\approx 0.2\sqrt{{R}_{f}}\$, suggesting that obtaining a small Rf is important to achieve good performance. We study how Rf depends on the size of the training set and compare it to a simple model of invariant learning.

Open access
Disordered systems insights on computational hardness

David Gamarnik et al J. Stat. Mech. (2022) 114015

In this review article we discuss connections between the physics of disordered systems, phase transitions in inference problems, and computational hardness. We introduce two models representing the behavior of glassy systems, the spiked tensor model and the generalized linear model. We discuss the random (non-planted) versions of these problems as prototypical optimization problems, as well as the planted versions (with a hidden solution) as prototypical problems in statistical inference and learning. Based on ideas from physics, many of these problems have transitions where they are believed to jump from easy (solvable in polynomial time) to hard (requiring exponential time). We discuss several emerging ideas in theoretical computer science and statistics that provide rigorous evidence for hardness by proving that large classes of algorithms fail in the conjectured hard regime. This includes the overlap gap property, a particular mathematization of clustering or dynamical symmetry-breaking, which can be used to show that many algorithms that are local or robust to changes in their input fail. We also discuss the sum-of-squares hierarchy, which places bounds on proofs or algorithms that use low-degree polynomials such as standard spectral methods and semidefinite relaxations, including the Sherrington–Kirkpatrick model. Throughout the manuscript we present connections to the physics of disordered systems and associated replica symmetry breaking properties.

Open access
Learning curves of generic features maps for realistic datasets with a teacher-student model

Bruno Loureiro et al J. Stat. Mech. (2022) 114001

Teacher-student models provide a framework in which the typical-case performance of high-dimensional supervised learning can be described in closed form. The assumptions of Gaussian i.i.d. input data underlying the canonical teacher-student model may, however, be perceived as too restrictive to capture the behaviour of realistic data sets. In this paper, we introduce a Gaussian covariate generalisation of the model where the teacher and student can act on different spaces, generated with fixed, but generic feature maps. While still solvable in a closed form, this generalization is able to capture the learning curves for a broad range of realistic data sets, thus redeeming the potential of the teacher-student framework. Our contribution is then two-fold: first, we prove a rigorous formula for the asymptotic training loss and generalisation error. Second, we present a number of situations where the learning curve of the model captures the one of a realistic data set learned with kernel regression and classification, with out-of-the-box feature maps such as random projections or scattering transforms, or with pre-learned ones—such as the features learned by training multi-layer neural networks. We discuss both the power and the limitations of the framework.

Open access
Locality defeats the curse of dimensionality in convolutional teacher–student scenarios

Alessandro Favero et al J. Stat. Mech. (2022) 114012

Convolutional neural networks perform a local and translationally-invariant treatment of the data: quantifying which of these two aspects is central to their success remains a challenge. We study this problem within a teacher–student framework for kernel regression, using 'convolutional' kernels inspired by the neural tangent kernel of simple convolutional architectures of given filter size. Using heuristic methods from physics, we find in the ridgeless case that locality is key in determining the learning curve exponent β (that relates the test error epsilontPβ to the size of the training set P), whereas translational invariance is not. In particular, if the filter size of the teacher t is smaller than that of the student s, β is a function of s only and does not depend on the input dimension. We confirm our predictions on β empirically. We conclude by proving, under a natural universality assumption, that performing kernel regression with a ridge that decreases with the size of the training set leads to similar learning curve exponents to those we obtain in the ridgeless case.

Deep networks on toroids: removing symmetries reveals the structure of flat regions in the landscape geometry

Fabrizio Pittorino et al J. Stat. Mech. (2022) 114007

We systematize the approach to the investigation of deep neural network landscapes by basing it on the geometry of the space of implemented functions rather than the space of parameters. Grouping classifiers into equivalence classes, we develop a standardized parameterization in which all symmetries are removed, resulting in a toroidal topology. On this space, we explore the error landscape rather than the loss. This lets us derive a meaningful notion of the flatness of minimizers and of the geodesic paths connecting them. Using different optimization algorithms that sample minimizers with different flatness we study the mode connectivity and relative distances. Testing a variety of state-of-the-art architectures and benchmark datasets, we confirm the correlation between flatness and generalization performance; we further show that in function space flatter minima are closer to each other and that the barriers along the geodesics connecting them are small. We also find that minimizers found by variants of gradient descent can be connected by zero-error paths composed of two straight lines in parameter space, i.e. polygonal chains with a single bend. We observe similar qualitative results in neural networks with binary weights and activations, providing one of the first results concerning the connectivity in this setting. Our results hinge on symmetry removal, and are in remarkable agreement with the rich phenomenology described by some recent analytical studies performed on simple shallow models.

Open access
Generalization error rates in kernel regression: the crossover from the noiseless to noisy regime

Hugo Cui et al J. Stat. Mech. (2022) 114004

In this manuscript we consider kernel ridge regression (KRR) under the Gaussian design. Exponents for the decay of the excess generalization error of KRR have been reported in various works under the assumption of power-law decay of eigenvalues of the features co-variance. These decays were, however, provided for sizeably different setups, namely in the noiseless case with constant regularization and in the noisy optimally regularized case. Intermediary settings have been left substantially uncharted. In this work, we unify and extend this line of work, providing characterization of all regimes and excess error decay rates that can be observed in terms of the interplay of noise and regularization. In particular, we show the existence of a transition in the noisy setting between the noiseless exponents to its noisy values as the sample complexity is increased. Finally, we illustrate how this crossover can also be observed on real data sets.

Particle dual averaging: optimization of mean field neural network with global convergence rate analysis

Atsushi Nitanda et al J. Stat. Mech. (2022) 114010

We propose the particle dual averaging (PDA) method, which generalizes the dual averaging method in convex optimization to the optimization over probability distributions with quantitative runtime guarantee. The algorithm consists of an inner loop and outer loop: the inner loop utilizes the Langevin algorithm to approximately solve for a stationary distribution, which is then optimized in the outer loop. The method can thus be interpreted as an extension of the Langevin algorithm to naturally handle nonlinear functional on the probability space. An important application of the proposed method is the optimization of neural network in the mean field regime, which is theoretically attractive due to the presence of nonlinear feature learning, but quantitative convergence rate can be challenging to obtain. By adapting finite-dimensional convex optimization theory into the space of measures, we analyze PDA in regularized empirical/expected risk minimization, and establish quantitative global convergence in learning two-layer mean field neural networks under more general settings. Our theoretical results are supported by numerical simulations on neural networks with reasonable size.

Open access
Approximate message passing with spectral initialization for generalized linear models

Marco Mondelli and Ramji Venkataramanan J. Stat. Mech. (2022) 114003

We consider the problem of estimating a signal from measurements obtained via a generalized linear model. We focus on estimators based on approximate message passing (AMP), a family of iterative algorithms with many appealing features: the performance of AMP in the high-dimensional limit can be succinctly characterized under suitable model assumptions; AMP can also be tailored to the empirical distribution of the signal entries, and for a wide class of estimation problems, AMP is conjectured to be optimal among all polynomial-time algorithms. However, a major issue of AMP is that in many models (such as phase retrieval), it requires an initialization correlated with the ground-truth signal and independent from the measurement matrix. Assuming that such an initialization is available is typically not realistic. In this paper, we solve this problem by proposing an AMP algorithm initialized with a spectral estimator. With such an initialization, the standard AMP analysis fails since the spectral estimator depends in a complicated way on the design matrix. Our main contribution is a rigorous characterization of the performance of AMP with spectral initialization in the high-dimensional limit. The key technical idea is to define and analyze a two-phase artificial AMP algorithm that first produces the spectral estimator, and then closely approximates the iterates of the true AMP. We also provide numerical results that demonstrate the validity of the proposed approach.

ConViT: improving vision transformers with soft convolutional inductive biases

Stéphane d'Ascoli et al J. Stat. Mech. (2022) 114005

Convolutional architectures have proven to be extremely successful for vision tasks. Their hard inductive biases enable sample-efficient learning, but come at the cost of a potentially lower performance ceiling. Vision transformers rely on more flexible self-attention layers, and have recently outperformed CNNs for image classification. However, they require costly pre-training on large external datasets or distillation from pre-trained convolutional networks. In this paper, we ask the following question: is it possible to combine the strengths of these two architectures while avoiding their respective limitations? To this end, we introduce gated positional self-attention (GPSA), a form of positional self-attention which can be equipped with a 'soft' convolutional inductive bias. We initialize the GPSA layers to mimic the locality of convolutional layers, then give each attention head the freedom to escape locality by adjusting a gating parameter regulating the attention paid to position versus content information. The resulting convolutional-like ViT architecture, ConViT, outperforms the DeiT (Touvron et al 2020 arXiv:2012.12877) on ImageNet, while offering a much improved sample efficiency. We further investigate the role of locality in learning by first quantifying how it is encouraged in vanilla self-attention layers, then analyzing how it has escaped in GPSA layers. We conclude by presenting various ablations to better understand the success of the ConViT. Our code and models are released publicly at https://github.com/facebookresearch/convit.

Equilibrium and non-equilibrium regimes in the learning of restricted Boltzmann machines

Aurélien Decelle et al J. Stat. Mech. (2022) 114009

Training restricted Boltzmann machines (RBMs) have been challenging for a long time due to the difficulty of precisely computing the log-likelihood gradient. Over the past few decades, many works have proposed more or less successful training recipes but without studying the crucial quantity of the problem: the mixing time, i.e. the number of Monte Carlo iterations needed to sample new configurations from a model. In this work, we show that this mixing time plays a crucial role in the dynamics and stability of the trained model, and that RBMs operate in two well-defined regimes, namely equilibrium and out-of-equilibrium, depending on the interplay between this mixing time of the model and the number of steps, k, used to approximate the gradient. We further show empirically that this mixing time increases with the learning, which often implies a transition from one regime to another as soon as k becomes smaller than this time. In particular, we show that using the popular k (persistent) contrastive divergence approaches, with k small, the dynamics of the learned model are extremely slow and often dominated by strong out-of-equilibrium effects. On the contrary, RBMs trained in equilibrium display faster dynamics, and a smooth convergence to dataset-like configurations during the sampling. Finally, we discuss how to exploit in practice both regimes depending on the task one aims to fulfill: (i) short k can be used to generate convincing samples in short learning times, (ii) large k (or increasingly large) is needed to learn the correct equilibrium distribution of the RBM. Finally, the existence of these two operational regimes seems to be a general property of energy based models trained via likelihood maximization.

Align, then memorise: the dynamics of learning with feedback alignment

Maria Refinetti et al J. Stat. Mech. (2022) 114002

Direct feedback alignment (DFA) is emerging as an efficient and biologically plausible alternative to backpropagation for training deep neural networks. Despite relying on random feedback weights for the backward pass, DFA successfully trains state-of-the-art models such as transformers. On the other hand, it notoriously fails to train convolutional networks. An understanding of the inner workings of DFA to explain these diverging results remains elusive. Here, we propose a theory of feedback alignment algorithms. We first show that learning in shallow networks proceeds in two steps: an alignment phase, where the model adapts its weights to align the approximate gradient with the true gradient of the loss function, is followed by a memorisation phase, where the model focuses on fitting the data. This two-step process has a degeneracy breaking effect: out of all the low-loss solutions in the landscape, a network trained with DFA naturally converges to the solution which maximises gradient alignment. We also identify a key quantity underlying alignment in deep linear networks: the conditioning of the alignment matrices. The latter enables a detailed understanding of the impact of data structure on alignment, and suggests a simple explanation for the well-known failure of DFA to train convolutional neural networks. Numerical experiments on MNIST and CIFAR10 clearly demonstrate degeneracy breaking in deep non-linear networks and show that the align-then-memorize process occurs sequentially from the bottom layers of the network to the top.

Open access
Ising model selection using 1-regularized linear regression: a statistical mechanics analysis

Xiangming Meng et al J. Stat. Mech. (2022) 114006

We theoretically analyze the typical learning performance of 1-regularized linear regression (1-LinR) for Ising model selection using the replica method from statistical mechanics. For typical random regular graphs in the paramagnetic phase, an accurate estimate of the typical sample complexity of 1-LinR is obtained. Remarkably, despite the model misspecification, 1-LinR is model selection consistent with the same order of sample complexity as 1-regularized logistic regression (1-LogR), i.e. \$M=\mathcal{O}\left(\mathrm{log}\,N\right)\$, where N is the number of variables of the Ising model. Moreover, we provide an efficient method to accurately predict the non-asymptotic behavior of 1-LinR for moderate M, N, such as precision and recall. Simulations show a fairly good agreement between theoretical predictions and experimental results, even for graphs with many loops, which supports our findings. Although this paper mainly focuses on 1-LinR, our method is readily applicable for precisely characterizing the typical learning performances of a wide class of 1-regularized M-estimators including 1-LogR and interaction screening.

On the proliferation of support vectors in high dimensions

Daniel Hsu et al J. Stat. Mech. (2022) 114011

The support vector machine (SVM) is a well-established classification method whose name refers to the particular training examples, called support vectors, that determine the maximum margin separating hyperplane. The SVM classifier is known to enjoy good generalization properties when the number of support vectors is small compared to the number of training examples. However, recent research has shown that in sufficiently high-dimensional linear classification problems, the SVM can generalize well despite a proliferation of support vectors where all training examples are support vectors. In this paper, we identify new deterministic equivalences for this phenomenon of support vector proliferation, and use them to (1) substantially broaden the conditions under which the phenomenon occurs in high-dimensional settings, and (2) prove a nearly matching converse result.