Machine Learning and Statistical Physics: Theory, Inspiration, Application

Guest Editors

Elena Agliari Sapienza Università di Roma and Istituto Nazionale d'Alta Matematica, Italy
Adriano Barra Dipartimento di Matematica e Fisica "Ennio De Giorgi", Università del Salento; Istituto Nazionale d'Alta Matematica and Istituto Nazionale di Fisica Nucleare, Italy
Peter Sollich Department of Mathematics, King's College University of London UK and Institut für Theoretische Physik, Georg-August-Universität Göttingen, Germany
Lenka Zdeborová Institut de Physique Théorique, CEA/SACLAY, France

Scope

This issue is intended to provide a picture of the state-of-the-art and open challenges in machine learning, from a statistical physics perspective. We focus here on three crucial, and to some extent, complementary aspects. These are reflected in the proposed subtitle of the volume:

  • Theory: methods from mathematical and theoretical physics, including in particular statistical physics, are being deployed to analyse theoretically the performance of many machine learning approaches, which can lead to improvements over existing algorithms or a better understanding of the conditions required for good performance.
  • Inspiration: methodical approaches developed in statistical physics have inspired—and continue to inspire—machine learning algorithms, for example through the use of mean-field theory and its variants.
  • Application: machine learning techniques have recently come to the fore in solving problems in statistical and more generally theoretical physics, ranging from the automatic detection of phases of matter to learning efficient representations of quantum wave functions.

We hope that the contributions to the special issue will form a compendium of the state-of-the-art in machine learning and its connection to statistical physics. We emphasize to authors the importance of also setting out the open challenges in the field, to inspire new approaches and further studies toward a fuller comprehension of this branch of science as a whole.

The issue will be open to submissions until 31 December 2019 and you can submit manuscripts through ScholarOne Manuscripts.

Editorial

Topical Reviews

Open access
Mean-field inference methods for neural networks

Marylou Gabrié 2020 J. Phys. A: Math. Theor. 53 223002

Machine learning algorithms relying on deep neural networks recently allowed a great leap forward in artificial intelligence. Despite the popularity of their applications, the efficiency of these algorithms remains largely unexplained from a theoretical point of view. The mathematical description of learning problems involves very large collections of interacting random variables, difficult to handle analytically as well as numerically. This complexity is precisely the object of study of statistical physics. Its mission, originally pointed toward natural systems, is to understand how macroscopic behaviors arise from microscopic laws. Mean-field methods are one type of approximation strategy developed in this view. We review a selection of classical mean-field methods and recent progress relevant for inference in neural networks. In particular, we remind the principles of derivations of high-temperature expansions, the replica method and message passing algorithms, highlighting their equivalences and complementarities. We also provide references for past and current directions of research on neural networks relying on mean-field methods.

Papers

Replica symmetry breaking in neural networks: a few steps toward rigorous results

Elena Agliari et al 2020 J. Phys. A: Math. Theor. 53 415005

In this paper we adapt the broken replica interpolation technique (developed by Francesco Guerra to deal with the Sherrington-Kirkpatrick model, namely a pairwise mean-field spin-glass whose couplings are i.i.d. standard Gaussian variables) in order to work also with the Hopfield model (i.e. a pairwise mean-field neural-network whose couplings are drawn according to Hebb's learning rule): this is accomplished by grafting Guerra's telescopic averages on the transport equation technique, recently developed by some of the authors. As an overture, we apply the technique to solve the Sherrington-Kirkpatrick model with i.i.d. Gaussian couplings centered at J0 and with finite variance J; the mean J0 provides a ferromagnetic contribution to be detected in a noisy environment tuned by J, hence making this model a natural test-case to be investigated before addressing the Hopfield model. For both the models, an explicit expression of their quenched free energy in terms of their natural order parameters is obtained at the Kth step (K arbitrary, but finite) of replica-symmetry-breaking. In particular, for the Hopfield model, by assuming that the overlaps respect Parisi's decomposition (following the ziqqurat ansatz) and that the Mattis magnetization is self-averaging, we recover previous results obtained via replica-trick by Amit, Crisanti and Gutfreund (1RSB) and by Steffan and Kühn (2RSB).

A minority of self-organizing autonomous vehicles significantly increase freeway traffic flow

Amir Goldental and Ido Kanter 2020 J. Phys. A: Math. Theor. 53 414001

This study investigates the dynamics of traffic containing human-driven vehicles along with a fraction of self-organized artificial intelligence (AI) autonomous vehicles (AVs) on multilane freeways. We propose guidelines for the development of AI agents, such that a small fraction of AVs forms local constellations that significantly accelerate the entire traffic flow while reducing fuel consumption and increasing safety. Specifically, we report a 40% enhancement in traffic flow efficiency and up to a 28% reduction in fuel consumption even when only 5% of vehicles are autonomous. This scenario does not require changes to current infrastructure or communication between vehicles; it only requires proper regulations. The results indicate that more efficient, safer, faster, and greener traffic flow can be realized in the near future.

Open access
Capacity of the covariance perceptron

David Dahmen et al 2020 J. Phys. A: Math. Theor. 53 354002

The classical perceptron is a simple neural network that performs a binary classification by a linear mapping between static inputs and outputs and application of a threshold. For small inputs, neural networks in a stationary state also perform an effectively linear input–output transformation, but of an entire time series. Choosing the temporal mean of the time series as the feature for classification, the linear transformation of the network with subsequent thresholding is equivalent to the classical perceptron. Here we show that choosing covariances of time series as the feature for classification maps the neural network to what we call a 'covariance perceptron'; a mapping between covariances that is bilinear in terms of weights. By extending Gardner's theory of connections to this bilinear problem, using a replica symmetric mean-field theory, we compute the pattern and information capacities of the covariance perceptron in the infinite-size limit. Closed-form expressions reveal superior pattern capacity in the binary classification task compared to the classical perceptron in the case of a high-dimensional input and low-dimensional output. For less convergent networks, the mean perceptron classifies a larger number of stimuli. However, since covariances span a much larger input and output space than means, the amount of stored information in the covariance perceptron exceeds the classical counterpart. For strongly convergent connectivity it is superior by a factor equal to the number of input neurons. Theoretical calculations are validated numerically for finite size systems using a gradient-based optimization of a soft-margin, as well as numerical solvers for the NP hard quadratically constrained quadratic programming problem, to which training can be mapped.

Blind calibration for compressed sensing: state evolution and an online algorithm

Marylou Gabrié et al 2020 J. Phys. A: Math. Theor. 53 334004

Compressed sensing allows for the acquisition of compressible signals with a small number of measurements. In experimental settings, the sensing process corresponding to the hardware implementation is not always perfectly known and may require a calibration. To this end, blind calibration proposes to perform at the same time the calibration and the compressed sensing. Schülke and collaborators suggested an approach based on approximate message passing for blind calibration (cal-AMP) in (Schülke C et al 2013 Advances in Neural Information Processing Systems 26 1–9 and Schülke C et al 2015 J. Stat. Mech. P11013). Here, their algorithm is extended from the already proposed offline case to the online case, for which the calibration is refined step by step as new measured samples are received. We show that the performance of both the offline and the online algorithms can be theoretically studied via the state evolution formalism. Finally, the efficiency of cal-AMP and the consistency of the theoretical predictions are confirmed through numerical simulations.

Machine learning as ecology

Owen Howell et al 2020 J. Phys. A: Math. Theor. 53 334001

Machine learning methods have had spectacular success on numerous problems. Here we show that a prominent class of learning algorithms—including support vector machines (SVMs)—have a natural interpretation in terms of ecological dynamics. We use these ideas to design new online SVM algorithms that exploit ecological invasions, and benchmark performance using the MNIST dataset. Our work provides a new ecological lens through which we can view statistical learning and opens the possibility of designing ecosystems for machine learning.

Analysis of Bayesian inference algorithms by the dynamical functional approach

Burak Çakmak and Manfred Opper 2020 J. Phys. A: Math. Theor. 53 274001

We analyze the dynamics of an algorithm for approximate inference with large Gaussian latent variable models in a student–teacher scenario. To model nontrivial dependencies between the latent variables, we assume random covariance matrices drawn from rotation invariant ensembles. For the case of perfect data-model matching, the knowledge of static order parameters derived from the replica method allows us to obtain efficient algorithmic updates in terms of matrix–vector multiplications with a fixed matrix. Using the dynamical functional approach, we obtain an exact effective stochastic process in the thermodynamic limit for a single node. From this, we obtain closed-form expressions for the rate of the convergence. Analytical results are in excellent agreement with simulations of single instances of large models.

Learning quantum models from quantum or classical data

H J Kappen 2020 J. Phys. A: Math. Theor. 53 214001

In this paper, we address the problem of how to represent a classical data distribution in a quantum system. The proposed method is to learn the quantum Hamiltonian, that is such that its ground state approximates the given classical distribution. We review previous work on the quantum Boltzmann machine (QBM) (Kieferová M and Nathan W 2017 Phys. Rev. A 96 062327, Amin M H et al 2018 Phys. Rev. X 8 021050) and how it can be used to infer quantum Hamiltonians from quantum statistics. We then show how the proposed quantum learning formalism can also be applied to a purely classical data analysis. Representing the data as a rank one density matrix introduces quantum statistics for classical data in addition to the classical statistics. We show that quantum learning yields results that can be significantly more accurate than the classical maximum likelihood approach, both for unsupervised learning and for classification. The data density matrix and the QBM solution show entanglement, quantified by the quantum mutual information I. The classical mutual information in the data IcI/2 = C, with C maximal classical correlations obtained by choosing a suitable orthogonal measurement basis. We suggest that the remaining mutual information Q = I/2 is obtained by non orthogonal measurements that may violate the Bell inequality. The excess mutual information IIc may potentially be used to improve the performance of quantum implementations of machine learning or other statistical methods.

Gaussian-spherical restricted Boltzmann machines

Aurélien Decelle and Cyril Furtlehner 2020 J. Phys. A: Math. Theor. 53 184002

We consider a special type of restricted Boltzmann machine (RBM), namely a Gaussian-spherical RBM where the visible units have Gaussian priors while the vector of hidden variables is constrained to stay on an ${\mathbb{L}}_{2}$ sphere. The spherical constraint having the advantage to admit exact asymptotic treatments, various scaling regimes are explicitly identified based solely on the spectral properties of the coupling matrix (also called weight matrix of the RBM). Incidentally these happen to be formally related to similar scaling behaviors obtained in a different context dealing with spatial condensation of zero range processes. More specifically, when the spectrum of the coupling matrix is doubly degenerated an exact treatment can be proposed to deal with finite size effects. Interestingly the known parallel between the ferromagnetic transition of the spherical model and the Bose–Einstein condensation can be made explicit in that case. More importantly this gives us the ability to extract all needed response functions with arbitrary precision for the training algorithm of the RBM. This allows us then to numerically integrate the dynamics of the spectrum of the weight matrix during learning in a precise way. This dynamics reveals in particular a sequential emergence of modes from the Marchenko–Pastur bulk of singular vectors of the coupling matrix.

How to iron out rough landscapes and get optimal performances: averaged gradient descent and its application to tensor PCA

Giulio Biroli et al 2020 J. Phys. A: Math. Theor. 53 174003

In many high-dimensional estimation problems the main task consists in minimizing a cost function, which is often strongly non-convex when scanned in the space of parameters to be estimated. A standard solution to flatten the corresponding rough landscape consists in summing the losses associated to different data points and obtaining a smoother empirical risk. Here we propose a complementary method that works for a single data point. The main idea is that a large amount of the roughness is uncorrelated in different parts of the landscape. One can then substantially reduce the noise by evaluating an empirical average of the gradient obtained as a sum over many random independent positions in the space of parameters to be optimized. We present an algorithm, called averaged gradient descent, based on this idea and we apply it to tensor PCA, which is a very hard estimation problem. We show that averaged gradient descent over-performs physical algorithms such as gradient descent and approximate message passing and matches the best algorithmic thresholds known so far, obtained by tensor unfolding and methods based on sum-of-squares.

'Place-cell' emergence and learning of invariant data with restricted Boltzmann machines: breaking and dynamical restoration of continuous symmetries in the weight space

Moshir Harsh et al 2020 J. Phys. A: Math. Theor. 53 174002

Distributions of data or sensory stimuli often enjoy underlying invariances. How and to what extent those symmetries are captured by unsupervised learning methods is a relevant question in machine learning and in computational neuroscience. We study here, through a combination of numerical and analytical tools, the learning dynamics of restricted Boltzmann machines (RBM), a neural network paradigm for representation learning. As learning proceeds from a random configuration of the network weights, we show the existence of, and characterize a symmetry-breaking phenomenon, in which the latent variables acquire receptive fields focusing on limited parts of the invariant manifold supporting the data. The symmetry is restored at large learning times through the diffusion of the receptive field over the invariant manifold; hence, the RBM effectively spans a continuous attractor in the space of network weights. This symmetry-breaking phenomenon takes place only if the amount of data available for training exceeds some critical value, depending on the network size and the intensity of symmetry-induced correlations in the data; below this 'retarded-learning' threshold, the network weights are essentially noisy and overfit the data.

Open access
On the universality of noiseless linear estimation with respect to the measurement matrix

Alia Abbara et al 2020 J. Phys. A: Math. Theor. 53 164001

In a noiseless linear estimation problem, the goal is to reconstruct a vector from the knowledge of its linear projections . There have been many theoretical works concentrating on the case where the matrix is a random i.i.d. one, but a number of heuristic evidence suggests that many of these results are universal and extend well beyond this restricted case. Here we revisit this problem through the prism of development of message passing methods, and consider not only the universality of the -transition, as previously addressed, but also the one of the optimal Bayesian reconstruction. We observed that the universality extends to the Bayes-optimal minimum mean-squared (MMSE) error, and to a range of structured matrices.

Dense limit of the Dawid–Skene model for crowdsourcing and regions of sub-optimality of message passing algorithms

Christian Schmidt and Lenka Zdeborová 2020 J. Phys. A: Math. Theor. 53 124001

Crowdsourcing is a strategy to categorize data through the contribution of many individuals. A wide range of theoretical and algorithmic contributions are based on the model of Dawid and Skene. Recently it was shown in the work of Ok et al that, in certain regimes, belief propagation is optimal for data generated from the Dawid–Skene model. This paper is motivated by this recent progress. We analyze a noisy dense limit of the Dawid–Skene model that has so long remained open. It is shown that it belongs to a larger class of low-rank matrix estimation problems for which it is possible to express the Bayes-optimal performance for large system sizes in a simple closed form. In the dense limit the mapping to a low-rank matrix estimation problem provides an approximate message passing algorithm that solves the problem algorithmically. We identify the regions where the algorithm efficiently computes the Bayes-optimal estimates. Our analysis further refines the results of Ok et al about optimality of message passing algorithms by characterizing regions of parameters where these algorithms do not match the Bayes-optimal performance. Besides, we study numerically the performance of approximate message passing, derived in the dense limit, on sparse instances and carry out experiments on a real world dataset.

Open access
Large deviation analysis of function sensitivity in random deep neural networks

Bo Li and David Saad 2020 J. Phys. A: Math. Theor. 53 104002

Mean field theory has been successfully used to analyze deep neural networks (DNN) in the infinite size limit. Given the finite size of realistic DNN, we utilize the large deviation theory and path integral analysis to study the deviation of functions represented by DNN from their typical mean field solutions. The parameter perturbations investigated include weight sparsification (dilution) and binarization, which are commonly used in model simplification, for both ReLU and sign activation functions. We find that random networks with ReLU activation are more robust to parameter perturbations with respect to their counterparts with sign activation, which arguably is reflected in the simplicity of the functions they generate.

Legendre equivalences of spherical Boltzmann machines

Giuseppe Genovese and Daniele Tantari 2020 J. Phys. A: Math. Theor. 53 094001

We study either fully visible and restricted Boltzmann machines with sub-Gaussian random weights and spherical or Gaussian priors. We prove that the free energies of the spherical and Gaussian models are related by a Legendre transformation. Incidentally our analysis brings also a new purely variational derivation of the free energy of the spherical models.

Interpolating between boolean and extremely high noisy patterns through minimal dense associative memories

Francesco Alemanno et al 2020 J. Phys. A: Math. Theor. 53 074001

Recently, Hopfield and Krotov introduced the concept of dense associative memories [DAM] (close to spin-glasses with P-wise interactions in a disordered statistical mechanical jargon): they proved a number of remarkable features these networks share and suggested their use to (partially) explain the success of the new generation of Artificial intelligence. Thanks to a remarkable ante-litteram analysis by Baldi & Venkatesh, among these properties, it is known these networks can handle a maximal amount of stored patterns K scaling as .

In this paper, once introduced a minimal dense associative network as one of the most elementary cost-functions falling in this class of DAM, we sacrifice this high-load regime -namely we force the storage of solely a linear amount of patterns, i.e. (with )- to prove that, in this regime, these networks can correctly perform pattern recognition even if pattern signal is and is embedded in a sea of noise , also in the large N limit. To prove this statement, by extremizing the quenched free-energy of the model over its natural order-parameters (the various magnetizations and overlaps), we derived its phase diagram, at the replica symmetric level of description and in the thermodynamic limit: as a sideline, we stress that, to achieve this task, aiming at cross-fertilization among disciplines, we pave two hegemon routes in the statistical mechanics of spin glasses, namely the replica trick and the interpolation technique.

Both the approaches reach the same conclusion: there is a not-empty region, in the noise-T versus load- phase diagram plane, where these networks can actually work in this challenging regime; in particular we obtained a quite high critical (linear) load in the (fast) noiseless case resulting in .

Parameter estimation for biochemical reaction networks using Wasserstein distances

Kaan Öcal et al 2020 J. Phys. A: Math. Theor. 53 034002

We present a method for estimating parameters in stochastic models of biochemical reaction networks by fitting steady-state distributions using Wasserstein distances. We simulate a reaction network at different parameter settings and train a Gaussian process to learn the Wasserstein distance between observations and the simulator output for all parameters. We then use Bayesian optimization to find parameters minimizing this distance based on the trained Gaussian process. The effectiveness of our method is demonstrated on the three-stage model of gene expression and a genetic feedback loop for which moment-based methods are known to perform poorly. Our method is applicable to any simulator model of stochastic reaction networks, including Brownian dynamics.

Open access
Empirical Bayes method for Boltzmann machines

Muneki Yasuda and Tomoyuki Obuchi 2020 J. Phys. A: Math. Theor. 53 014004

We consider an empirical Bayes method for Boltzmann machines and propose an algorithm for it. The empirical Bayes method allows for estimation of the values of the hyperparameters of the Boltzmann machine by maximizing a specific likelihood function referred to as the empirical Bayes likelihood function in this study. However, the maximization is computationally hard because the empirical Bayes likelihood function involves intractable integrations of the partition function. The proposed algorithm avoids this computational problem by using the replica method and the Plefka expansion. Our method is quite simple and fast because it does not require any iterative procedures and gives reasonable estimates at a certain condition. However, our method introduces a bias to the estimate, which exhibits an unnatural behavior with respect to the size of the dataset. This peculiar behavior is supposed to be due to the approximate treatment by the Plefka expansion. A possible extension to overcome this behavior is also discussed.

A jamming transition from under- to over-parametrization affects generalization in deep learning

S Spigler et al 2019 J. Phys. A: Math. Theor. 52 474001

In this paper we first recall the recent result that in deep networks a phase transition, analogous to the jamming transition of granular media, delimits the over- and under-parametrized regimes where fitting can or cannot be achieved. The analysis leading to this result support that for proper initialization and architectures, in the whole over-parametrized regime poor minima of the loss are not encountered during training, because the number of constraints that hinders the dynamics is insufficient to allow for the emergence of stable minima. Next, we study systematically how this transition affects generalization properties of the network (i.e. its predictive power). As we increase the number of parameters of a given model, starting from an under-parametrized network, we observe for gradient descent that the generalization error displays three phases: (i) initial decay, (ii) increase until the transition point—where it displays a cusp—and (iii) slow decay toward an asymptote as the network width diverges. However if early stopping is used, the cusp signaling the jamming transition disappears. Thereby we identify the region where the classical phenomenon of over-fitting takes place as the vicinity of the jamming transition, and the region where the model keeps improving with increasing the number of parameters, thus organizing previous empirical observations made in modern neural networks.

Approximate matrix completion based on cavity method

Chihiro Noguchi and Yoshiyuki Kabashima 2019 J. Phys. A: Math. Theor. 52 424004

In order to solve large matrix completion problems with practical computational cost, an approximate approach based on matrix factorization has been widely used. Alternating least squares (ALS) and stochastic gradient descent (SGD) are two major algorithms to this end. In this study, we propose a new algorithm, namely cavity-based matrix factorization (CBMF) and approximate cavity-based matrix factorization (ACBMF), which are developed based on the cavity method from statistical mechanics. ALS yields solutions with less iterations when compared to those of SGD. This is because its update rules are described in a closed form although it entails higher computational cost. CBMF can also write its update rules in a closed form, and its computational cost is lower than that of ALS. ACBMF is proposed to compensate a disadvantage of CBMF in terms of relatively high memory cost. We experimentally illustrate that the proposed methods outperform the two existing algorithms in terms of convergence speed per iteration, and it can work under the condition where observed entries are relatively fewer. Additionally, in contrast to SGD, (A)CBMF does not require scheduling of the learning rate.

Open access
Cross validation in sparse linear regression with piecewise continuous nonconvex penalties and its acceleration

Tomoyuki Obuchi and Ayaka Sakata 2019 J. Phys. A: Math. Theor. 52 414003

We investigate the signal reconstruction performance of sparse linear regression in the presence of noise when piecewise continuous nonconvex penalties are used. Among such penalties, we focus on the smoothly clipped absolute deviation (SCAD) penalty. The contributions of this study are three-fold: we first present a theoretical analysis of a typical reconstruction performance, using the replica method, under the assumption that each component of the design matrix is given as an independent and identically distributed (i.i.d.) Gaussian variable. This clarifies the superiority of the SCAD estimator compared with in a wide parameter range, although the nonconvex nature of the penalty tends to lead to solution multiplicity in certain regions. This multiplicity is shown to be connected to replica symmetry breaking in the spin-glass theory, and associated phase diagrams are given. We also show that the global minimum of the mean square error between the estimator and the true signal is located in the replica symmetric phase. Second, we develop an approximate formula efficiently computing the cross-validation error without actually conducting the cross-validation, which is also applicable to the non-i.i.d. design matrices. It is shown that this formula is only applicable to the unique solution region and tends to be unstable in the multiple solution region. We implement instability detection procedures, which allows the approximate formula to stand alone and resultantly enables us to draw phase diagrams for any specific dataset. Third, we propose an annealing procedure, called nonconvexity annealing, to obtain the solution path efficiently. Numerical simulations are conducted on simulated datasets to examine these results to verify the consistency of the theoretical results and the efficiency of the approximate formula and nonconvexity annealing. The characteristic behaviour of the annealed solution in the multiple solution region is addressed. Another numerical experiment on a real-world dataset of Type Ia supernovae is conducted; its results are consistent with those of earlier studies using the formulation. A MATLAB package of numerical codes implementing the estimation of the solution path using the annealing with respect to in conjunction with the approximate CV formula and the instability detection routine is distributed in Obuchi (2019 https://github.com/T-Obuchi/SLRpackage_AcceleratedCV_matlab).

Minimal model of permutation symmetry in unsupervised learning

Tianqi Hou et al 2019 J. Phys. A: Math. Theor. 52 414001

Permutation of any two hidden units yields invariant properties in typical deep generative neural networks. This permutation symmetry plays an important role in understanding the computation performance of a broad class of neural networks with two or more hidden units. However, a theoretical study of the permutation symmetry is still lacking. Here, we propose a minimal model with only two hidden units in a restricted Boltzmann machine, which aims to address how the permutation symmetry affects the critical learning data size at which the concept-formation (or spontaneous symmetry breaking (SSB) in physics language) starts, and moreover semi-rigorously prove a conjecture that the critical data size is independent of the number of hidden units once this number is finite. Remarkably, we find that the embedded correlation between two receptive fields of hidden units reduces the critical data size. In particular, the weakly-correlated receptive fields have the benefit of significantly reducing the minimal data size that triggers the transition, given less noisy data. Inspired by the theory, we also propose an efficient fully-distributed algorithm to infer the receptive fields of hidden units. Furthermore, our minimal model reveals that the permutation symmetry can also be spontaneously broken following the SSB. Overall, our results demonstrate that the unsupervised learning is a progressive combination of SSB and permutation symmetry breaking which are both spontaneous processes driven by data streams (observations). All these effects can be analytically probed based on the minimal model, providing theoretical insights towards understanding unsupervised learning in a more general context.

Analysis of overfitting in the regularized Cox model

Mansoor Sheikh and Anthony C C Coolen 2019 J. Phys. A: Math. Theor. 52 384002

The Cox proportional hazards model is ubiquitous in the analysis of time-to-event data. However, when the data dimension p  is comparable to the sample size N, maximum likelihood estimates for its regression parameters are known to be biased or break down entirely due to overfitting. This prompted the introduction of the so-called regularized Cox model. In this paper we use the replica method from statistical physics to investigate the relationship between the true and inferred regression parameters in regularized multivariate Cox regression with L2 regularization, in the regime where both p  and N are large but with . We thereby generalize a recent study from maximum likelihood to maximum a posteriori inference. We also establish a relationship between the optimal regularization parameter and , allowing for straightforward overfitting corrections in time-to-event analysis.

Open access
Statistical mechanical analysis of learning dynamics of two-layer perceptron with multiple output units

Yuki Yoshida et al 2019 J. Phys. A: Math. Theor. 52 184002

The plateau phenomenon, wherein the loss value stops decreasing during the process of learning, is troubling. Various studies suggest that the plateau phenomenon is frequently caused by the network being trapped in the singular region on the loss surface, a region that stems from the symmetrical structure of neural networks. However, these studies all deal with networks that have a one-dimensional output, and networks with a multidimensional output are overlooked. This paper uses a statistical mechanical formalization to analyze the dynamics of learning in a two-layer perceptron with multidimensional output. We derive order parameters that capture macroscopic characteristics of connection weights and the differential equations that they follow. We show that singular-region-driven plateaus diminish or vanish with multidimensional output, in a simple setting. We found that the more non-degenerative (i.e. far from one-dimensional output) the model is, the more plateaus are alleviated. Furthermore, we showed theoretically that singular-region-driven plateaus seldom occur in the learning process in the case of orthogonalized initializations.