Machine Learning 2023

The Journal of Statistical Mechanics, Theory and Experiment (JSTAT) is publishing its fifth special issue on the Statistical Physics aspects of Machine Learning/Artificial Intelligence. The scope of this initiative is to bring the methods of statistical physics to the full benefit of machine learning, a discipline which is becoming of fundamental importance across many fields of science. Conversely, the use of machine learning approaches in physics are also mostly welcome.

The format of the special issue takes into account the situation of the field of machine learning, where many of the most important papers are published in proceedings of conferences that are often overlooked by the physics community. Our special issue on machine learning therefore includes some selected papers published recently in the proceedings of some major conferences, as well as original contributions that are submitted to this special issue and reviewed using the usual JSTAT procedures.

The authors of the conference papers have been proposed to include, if needed, an augmented version of their paper, including supplementary material, which makes it more adapted for the readership of our journal.

The selection for this fifth special issue is made by a committee consisting of the JSTAT Chief Scientific Director Marc Mezard and the following JSTAT editors: Riccardo Zecchina (chair), Yoshiyuki Kabashima, Bert Kappen, Florent Krzakala and Manfred Opper.

Other annual Machine Learning Special Issues are available here.

Participating Journals

Journal
Impact Factor
Citescore
Submit

Paper

Open access
Phase diagram of stochastic gradient descent in high-dimensional two-layer neural networks

Rodrigo Veiga et al J. Stat. Mech. (2023) 114008

Despite the non-convex optimization landscape, over-parametrized shallow networks are able to achieve global convergence under gradient descent. The picture can be radically different for narrow networks, which tend to get stuck in badly-generalizing local minima. Here we investigate the cross-over between these two regimes in the high-dimensional setting, and in particular investigate the connection between the so-called mean-field/hydrodynamic regime and the seminal approach of Saad & Solla. Focusing on the case of Gaussian data, we study the interplay between the learning rate, the time scale, and the number of hidden units in the high-dimensional dynamics of stochastic gradient descent (SGD). Our work builds on a deterministic description of SGD in high-dimensions from statistical physics, which we extend and for which we provide rigorous convergence rates.

Self-consistent dynamical field theory of kernel evolution in wide neural networks

Blake Bordelon and Cengiz Pehlevan J. Stat. Mech. (2023) 114009

We analyze feature learning in infinite-width neural networks trained with gradient flow through a self-consistent dynamical field theory. We construct a collection of deterministic dynamical order parameters which are inner-product kernels for hidden unit activations and gradients in each layer at pairs of time points, providing a reduced description of network activity through training. These kernel order parameters collectively define the hidden layer activation distribution, the evolution of the neural tangent kernel (NTK), and consequently, output predictions. We show that the field theory derivation recovers the recursive stochastic process of infinite-width feature learning networks obtained by Yang and Hu with tensor programs. For deep linear networks, these kernels satisfy a set of algebraic matrix equations. For nonlinear networks, we provide an alternating sampling procedure to self-consistently solve for the kernel order parameters. We provide comparisons of the self-consistent solution to various approximation schemes including the static NTK approximation, gradient independence assumption, and leading order perturbation theory, showing that each of these approximations can break down in regimes where general self-consistent solutions still provide an accurate description. Lastly, we provide experiments in more realistic settings which demonstrate that the loss and kernel dynamics of convolutional neural networks at fixed feature learning strength are preserved across different widths on a image classification task.

Open access
Redundant representations help generalization in wide neural networks,

Diego Doimo et al J. Stat. Mech. (2023) 114011

Deep neural networks (DNNs) defy the classical bias-variance trade-off; adding parameters to a DNN that interpolates its training data will typically improve its generalization performance. Explaining the mechanism behind this 'benign overfitting' in deep networks remains an outstanding challenge. Here, we study the last hidden layer representations of various state-of-the-art convolutional neural networks and find that if the last hidden representation is wide enough, its neurons tend to split into groups that carry identical information and differ from each other only by statistically independent noise. The number of these groups increases linearly with the width of the layer, but only if the width is above a critical value. We show that redundant neurons appear only when the training is regularized and the training error is zero.

Open access
Exact learning dynamics of deep linear networks with prior knowledge

Clémentine C J Dominé et al J. Stat. Mech. (2023) 114004

Learning in deep neural networks is known to depend critically on the knowledge embedded in the initial network weights. However, few theoretical results have precisely linked prior knowledge to learning dynamics. Here we derive exact solutions to the dynamics of learning with rich prior knowledge in deep linear networks by generalising Fukumizu's matrix Riccati solution (Fukumizu 1998 Gen1 1E–03). We obtain explicit expressions for the evolving network function, hidden representational similarity, and neural tangent kernel over training for a broad class of initialisations and tasks. The expressions reveal a class of task-independent initialisations that radically alter learning dynamics from slow non-linear dynamics to fast exponential trajectories while converging to a global optimum with identical representational similarity, dissociating learning trajectories from the structure of initial internal representations. We characterise how network weights dynamically align with task structure, rigorously justifying why previous solutions successfully described learning from small initial weights without incorporating their fine-scale structure. Finally, we discuss the implications of these findings for continual learning, reversal learning and learning of structured knowledge. Taken together, our results provide a mathematical toolkit for understanding the impact of prior knowledge on deep learning.

Open access
Learning sparse features can lead to overfitting in neural networks

Leonardo Petrini et al J. Stat. Mech. (2023) 114003

It is widely believed that the success of deep networks lies in their ability to learn a meaningful representation of the features of the data. Yet, understanding when and how this feature learning improves performance remains a challenge. For example, it is beneficial for modern architectures to be trained to classify images, whereas it is detrimental for fully-connected networks to be trained on the same data. Here, we propose an explanation for this puzzle, by showing that feature learning can perform worse than lazy training (via the random feature kernel or the neural tangent kernel) as the former can lead to a sparser neural representation. Although sparsity is known to be essential for learning anisotropic data, it is detrimental when the target function is constant or smooth along certain directions of the input space. We illustrate this phenomenon in two settings: (i) regression of Gaussian random functions on the d-dimensional unit sphere and (ii) classification of benchmark data sets of images. For (i), we compute the scaling of the generalization error with the number of training points and show that methods that do not learn features generalize better, even when the dimension of the input space is large. For (ii), we show empirically that learning features can indeed lead to sparse and thereby less smooth representations of the image predictors. This fact is plausibly responsible for deteriorating the performance, which is known to be correlated with smoothness along diffeomorphisms.

Multi-layer state evolution under random convolutional design

Max Daniels et al J. Stat. Mech. (2023) 114002

Signal recovery under generative neural network priors has emerged as a promising direction in statistical inference and computational imaging. Theoretical analysis of reconstruction algorithms under generative priors is, however, challenging. For generative priors with fully connected layers and Gaussian i.i.d. weights, this was achieved by the multi-layer approximate message (ML-AMP) algorithm via a rigorous state evolution. However, practical generative priors are typically convolutional, allowing for computational benefits and inductive biases, and so the Gaussian i.i.d. weight assumption is very limiting. In this paper, we overcome this limitation and establish the state evolution of ML-AMP for random convolutional layers. We prove in particular that random convolutional layers belong to the same universality class as Gaussian matrices. Our proof technique is of an independent interest as it establishes a mapping between convolutional matrices and spatially coupled sensing matrices used in coding theory.

Exact solutions of a deep linear network

Liu Ziyin et al J. Stat. Mech. (2023) 114006

This work finds the analytical expression for the global minima of a deep linear network with weight decay and stochastic neurons, a fundamental model for understanding the landscape of neural networks. Our result implies that the origin is a special point in the deep neural network loss landscape where highly nonlinear phenomenon emerge. We show that weight decay strongly interacts with the model architecture and can create bad minima at zero in a network with more than one hidden layer, qualitatively different from a network with only one hidden layer. Practically, our result implies that common deep learning initialization methods are generally insufficient to ease the optimization of neural networks.

Open access
Fluctuations, bias, variance and ensemble of learners: exact asymptotics for convex losses in high-dimension

Bruno Loureiro et al J. Stat. Mech. (2023) 114001

From the sampling of data to the initialisation of parameters, randomness is ubiquitous in modern Machine Learning practice. Understanding the statistical fluctuations engendered by the different sources of randomness in prediction is therefore key to understanding robust generalisation. In this manuscript we develop a quantitative and rigorous theory for the study of fluctuations in an ensemble of generalised linear models trained on different, but correlated, features in high-dimensions. In particular, we provide a complete description of the asymptotic joint distribution of the empirical risk minimiser for generic convex loss and regularisation in the high-dimensional limit. Our result encompasses a rich set of classification and regression tasks, such as the lazy regime of overparametrised neural networks, or equivalently the random features approximation of kernels. While allowing to study directly the mitigating effect of ensembling (or bagging) on the bias-variance decomposition of the test error, our analysis also helps disentangle the contribution of statistical fluctuations, and the singular role played by the interpolation threshold that are at the roots of the 'double-descent' phenomenon.

Precise learning curves and higher-order scaling limits for dot-product kernel regression

Lechao Xiao et al J. Stat. Mech. (2023) 114005

As modern machine learning models continue to advance the computational frontier, it has become increasingly important to develop precise estimates for expected performance improvements under different model and data scaling regimes. Currently, theoretical understanding of the learning curves (LCs) that characterize how the prediction error depends on the number of samples is restricted to either large-sample asymptotics (\$m\to\infty\$) or, for certain simple data distributions, to the high-dimensional asymptotics in which the number of samples scales linearly with the dimension (\$m\propto d\$). There is a wide gulf between these two regimes, including all higher-order scaling relations \$m\propto d^r\$, which are the subject of the present paper. We focus on the problem of kernel ridge regression for dot-product kernels and present precise formulas for the mean of the test error, bias and variance, for data drawn uniformly from the sphere with isotropic random labels in the rth-order asymptotic scaling regime \$m\to\infty\$ with \$m/d^r\$ held constant. We observe a peak in the LC whenever \$m \approx d^r/r!\$ for any integer r, leading to multiple sample-wise descent and non-trivial behavior at multiple scales. We include a colab (available at: https://tinyurl.com/2nzym7ym) notebook that reproduces the essential results of the paper.

Two-layer neural network on infinite-dimensional data: global optimization guarantee in the mean-field regime

Naoki Nishikawa et al J. Stat. Mech. (2023) 114007

The analysis of neural network optimization in the mean-field regime is important as the setting allows for feature learning. The existing theory has been developed mainly for neural networks in finite dimensions, i.e. each neuron has a finite-dimensional parameter. However, the setting of infinite-dimensional input naturally arises in machine learning problems such as nonparametric functional data analysis and graph classification. In this paper, we develop a new mean-field analysis of a two-layer neural network in an infinite-dimensional parameter space. We first give a generalization error bound, which shows that the regularized empirical risk minimizer properly generalizes when the data size is sufficiently large, despite the neurons being infinite-dimensional. Next, we present two gradient-based optimization algorithms for infinite-dimensional mean-field networks, by extending the recently developed particle optimization framework to the infinite-dimensional setting. We show that the proposed algorithms converge to the (regularized) global optimal solution, and moreover, their rates of convergence are of polynomial order in the online setting and exponential order in the finite sample setting, respectively. To the best of our knowledge, this is the first quantitative global optimization guarantee of a neural network on infinite-dimensional input and in the presence of feature learning.

Open access
The dynamics of representation learning in shallow, non-linear autoencoders

Maria Refinetti and Sebastian Goldt J. Stat. Mech. (2023) 114010

Autoencoders are the simplest neural network for unsupervised learning, and thus an ideal framework for studying feature learning. While a detailed understanding of the dynamics of linear autoencoders has recently been obtained, the study of non-linear autoencoders has been hindered by the technical difficulty of handling training data with non-trivial correlations—a fundamental prerequisite for feature extraction. Here, we study the dynamics of feature learning in non-linear, shallow autoencoders. We derive a set of asymptotically exact equations that describe the generalisation dynamics of autoencoders trained with stochastic gradient descent (SGD) in the limit of high-dimensional inputs. These equations reveal that autoencoders learn the leading principal components of their inputs sequentially. An analysis of the long-time dynamics explains the failure of sigmoidal autoencoders to learn with tied weights, and highlights the importance of training the bias in ReLU autoencoders. Building on previous results for linear networks, we analyse a modification of the vanilla SGD algorithm, which allows learning of the exact principal components. Finally, we show that our equations accurately describe the generalisation dynamics of non-linear autoencoders trained on realistic datasets such as CIFAR10, thus establishing shallow autoencoders as an instance of the recently observed Gaussian universality.

Open access
How a student becomes a teacher: learning and forgetting through spectral methods

Lorenzo Giambagli et al J. Stat. Mech. (2024) 034002

In theoretical machine learning, the teacher–student paradigm is often employed as an effective metaphor for real-life tuition. A student network is trained on data generated by a fixed teacher network until it matches the instructor's ability to cope with the assigned task. The above scheme proves particularly relevant when the student network is overparameterized (namely, when larger layer sizes are employed) as compared to the underlying teacher network. Under these operating conditions, it is tempting to speculate that the student ability to handle the given task could be eventually stored in a sub-portion of the whole network. This latter should be to some extent reminiscent of the frozen teacher structure, according to suitable metrics, while being approximately invariant across different architectures of the student candidate network. Unfortunately, state-of-the-art conventional learning techniques could not help in identifying the existence of such an invariant subnetwork, due to the inherent degree of non-convexity that characterizes the examined problem. In this work, we take a decisive leap forward by proposing a radically different optimization scheme which builds on a spectral representation of the linear transfer of information between layers. The gradient is hence calculated with respect to both eigenvalues and eigenvectors with negligible increase in terms of computational and complexity load, as compared to standard training algorithms. Working in this framework, we could isolate a stable student substructure, that mirrors the true complexity of the teacher in terms of computing neurons, path distribution and topological attributes. When pruning unimportant nodes of the trained student, as follows a ranking that reflects the optimized eigenvalues, no degradation in the recorded performance is seen above a threshold that corresponds to the effective teacher size. The observed behavior can be pictured as a genuine second-order phase transition that bears universality traits. Code is available at: https://github.com/Jamba15/Spectral-regularization-teacher-student/tree/master.