Table of contents

Papers

Classical statistical mechanics, equilibrium and non-equilibrium

123201

The yielding transition of amorphous materials is studied with a two-dimensional Hamiltonian model that allows both shear and volume deformations. The model is investigated as a function of the relative value of the bulk modulus B with respect to the shear modulus μ. When the ratio B/μ is small enough, the yielding transition becomes discontinuous, yet reversible. If the system is driven at constant strain rate in the coexistence region, a spatially localized shear band is observed while the rest of the system remains blocked. The crucial role of volume fluctuations in the origin of this behavior is clarified in a mean field version of the model.

123202

, , , and

We study the large deviations of the power injected by the active force for an active Ornstein–Uhlenbeck particle (AOUP), free or in a confining potential. For the free-particle case, we compute the rate function analytically in d-dimensions from a saddle-point expansion, and numerically in two dimensions by (a) direct sampling of the active work in numerical solutions of the AOUP equations and (b) Legendre–Fenchel transform of the scaled cumulant generating function obtained via a cloning algorithm. The rate function presents asymptotically linear branches on both sides and it is independent of the system's dimensionality, apart from a multiplicative factor. For the confining potential case, we focus on two-dimensional systems and obtain the rate function numerically using both methods (a) and (b). We find a different scenario for harmonic and anharmonic potentials: in the former case, the phenomenology of fluctuations is analogous to that of a free particle, but the rate function might be non-analytic; in the latter case the rate functions are analytic, but fluctuations are realised by entirely different means, which rely strongly on the particle-potential interaction. Finally, we check the validity of a fluctuation relation for the active work distribution. In the free-particle case, the relation is satisfied with a slope proportional to the bath temperature. The same slope is found for the harmonic potential, regardless of activity, and for an anharmonic potential with low activity. In the anharmonic case with high activity, instead, we find a different slope which is equal to an effective temperature obtained from the fluctuation–dissipation theorem.

123203

and

We consider one-dimensional discrete-time random walks (RWs) in the presence of finite size traps of length over which the RWs can jump. We study the survival probability of such RWs when the traps are periodically distributed and separated by a distance L. We obtain exact results for the mean first-passage time and the survival probability in the special case of a double-sided exponential jump distribution. While such RWs typically survive longer than if they could not leap over traps, their survival probability still decreases exponentially with the number of steps. The decay rate of the survival probability depends in a non-trivial way on the trap length and exhibits an interesting regime when → 0 as it tends to the ratio /L, which is reminiscent of strongly chaotic deterministic systems. We generalize our model to continuous-time RWs, where we introduce a power-law distributed waiting time before each jump. In this case, we find that the survival probability decays algebraically with an exponent that is independent of the trap length. Finally, we derive the diffusive limit of our model and show that, depending on the chosen scaling, we obtain either diffusion with uniform absorption, or diffusion with periodically distributed point absorbers.

123204

, , and

We propose a method to exactly generate Brownian paths xc(t) that are constrained to return to the origin at some future time tf, with a given fixed area ${A}_{f}={\int }_{0}^{{t}_{f}}\mathrm{d}t\enspace {x}_{c}(t)$ under their trajectory. We derive an exact effective Langevin equation with an effective force that accounts for the constraint. In addition, we develop the corresponding approach for discrete-time random walks, with arbitrary jump distributions including Lévy flights, for which we obtain an effective jump distribution that encodes the constraint. Finally, we generalise our method to other types of dynamical constraints such as a fixed occupation time on the positive axis ${T}_{f}={\int }_{0}^{{t}_{f}}\mathrm{d}t\enspace {\Theta}\left[{x}_{c}(t)\right]$ or a fixed generalised quadratic area ${\mathcal{A}}_{f}={\int }_{0}^{{t}_{f}}\mathrm{d}t\enspace {x}_{c}^{2}(t)$.

123205

For a given inhomogeneous exclusion processes on N sites between two reservoirs, the trajectories probabilities allow to identify the relevant local empirical observables and to obtain the corresponding rate function at level 2.5. In order to close the hierarchy of the empirical dynamics that appear in the stationarity constraints, we consider the simplest approximation, namely the mean-field approximation for the empirical density of two consecutive sites, in direct correspondence with the previously studied mean-field approximation for the steady state. For a given inhomogeneous totally asymmetric model, this mean-field approximation yields the large deviations for the joint distribution of the empirical density profile and of the empirical current around the mean-field steady state; the further explicit contraction over the current allows to obtain the large deviations of the empirical density profile alone. For a given inhomogeneous asymmetric model, the local empirical observables also involve the empirical activities of the links and of the reservoirs; the further explicit contraction over these activities yields the large deviations for the joint distribution of the empirical density profile and of the empirical current. The consequences for the large deviations properties of time-additive space-local observables are also discussed in both cases.

123206

, and

The Poisson–Nernst–Planck (PNP) diffusional model is a successful theoretical framework to investigate the electrochemical impedance response of insulators containing ionic impurities to an external ac stimulus. Apparent deviations of the experimental spectra from the predictions of the PNP model in the low frequency region are usually interpreted as an interfacial property. Here, we provide a rigorous mathematical analysis of the low-frequency limiting behavior of the model, analyzing the possible origin of these deviation related to bulk properties. The analysis points toward the necessity to consider a bulk effect connected with the difference in the diffusion coefficients of cations and anions (ambipolar diffusion). The ambipolar model does not continuously reach the behavior of the one mobile ion diffusion model when the difference in the mobility of the species vanishes, for a fixed frequency, in the cases of ohmic and adsorption–desorption boundary conditions. The analysis is devoted to the low frequency region, where the electrodes play a fundamental role in the response of the cell; thus, different boundary conditions, charged to mimic the non-blocking character of the electrodes, are considered. The new version of the boundary conditions in the limit in which one of the mobility is tending to zero is deduced. According to the analysis in the dc limit, the phenomenological parameters related to the electrodes are frequency dependent, indicating that the exchange of electric charge from the bulk to the external circuit, in the ohmic model, is related to a surface impedance, and not simply to an electric resistance.

Disordered systems, classical and quantum

123301

and

We study the probability of stability of a large complex system of size N within the framework of a generalized May model, which assumes a linear dynamics of each population size ni (with respect to its equilibrium value): $\frac{\mathrm{d}{n}_{i}}{\mathrm{d}t}=-{a}_{i}{n}_{i}-\sqrt{T}{\sum }_{j}{J}_{ij}{n}_{j}$. The ai > 0's are the intrinsic decay rates, Jij is a real symmetric (N × N) Gaussian random matrix and $\sqrt{T}$ measures the strength of pairwise interaction between different species. Our goal is to study how inhomogeneities in the intrinsic damping rates ai affect the stability of this dynamical system. As the interaction strength T increases, the system undergoes a phase transition from a stable phase to an unstable phase at a critical value T = Tc. We reinterpret the probability of stability in terms of the hitting time of the level b = 0 of an associated Dyson Brownian motion (DBM), starting at the initial position ai and evolving in 'time' T. In the large N limit, using this DBM picture, we are able to completely characterize Tc for arbitrary density μ(a) of the ai's. For a specific flat configuration ${a}_{i}=1+\sigma \frac{i-1}{N}$, we obtain an explicit parametric solution for the limiting (as N) spectral density for arbitrary T and σ. For finite but large N, we also compute the large deviation properties of the probability of stability on the stable side T < Tc using a Coulomb gas representation.

123302

, and

Thermodynamic uncertainty relations unveil useful connections between fluctuations in thermal systems and entropy production. This work extends these ideas to the disparate field of zero temperature quantum mesoscopic physics where fluctuations are due to coherent effects and entropy production is replaced by a cost function. The cost function arises naturally as a bound on fluctuations, induced by coherent effects—a critical resource in quantum mesoscopic physics. Identifying the cost function as an important quantity demonstrates the potential of importing powerful methods from non-equilibrium statistical physics to quantum mesoscopics.

Interdisciplinary statistical mechanics

123401

, , and

To understand the dynamics on complex networks, measurement of correlations is indispensable. In a motorway network, it is not sufficient to collect information on fluxes and velocities on all individual links, i.e. parts of the freeways between ramps and highway crosses. The interdependencies and mutual connections are also of considerable interest. We analyze correlations in the complete motorway network in North Rhine-Westphalia, the most populous state in Germany. We view the motorway network as a complex system consisting of road sections which interact via the motion of vehicles, implying structures in the corresponding correlation matrices. In particular, we focus on collective behavior, i.e. coherent motion in the whole network or in large parts of it. To this end, we study the eigenvalue and eigenvector statistics and identify significant sections in the motorway network. We find collective behavior in these significant sections and further explore its causes. We show that collectivity throughout the network cannot directly be related to the traffic states (free, synchronous and congested) in Kerner's three-phase theory. Hence, the degree of collectivity provides a new, complementary observable to characterize the motorway network.

123402

, , and

In real communication or transportation systems, loss of agents is very common due to finite storage capacity. We study the traffic dynamics in finite buffer networks and propose a routing strategy motivated by a heuristic algorithm to alleviate packet loss. Under this routing strategy, the traffic capacity is further improved, comparing to the shortest path routing strategy and efficient routing strategy. Then we investigate the effect of this routing strategy on the betweenness of nodes. Through dynamic routing changes, the maximum node betweenness of the network is greatly reduced, and the final betweenness of each node is almost the same. Therefore, the routing strategy proposed in this paper can balance the node load, thereby effectively alleviating packet loss.

123403

, and

The Maki–Thompson rumor model is defined by assuming that a population represented by a graph is subdivided into three classes of individuals; namely, ignorants, spreaders and stiflers. A spreader tells the rumor to any of its nearest ignorant neighbors at rate one. At the same rate, a spreader becomes a stifler after a contact with other nearest neighbor spreaders, or stiflers. In this work we study the model on random trees. As usual we define a critical parameter of the model as the critical value around which the rumor either becomes extinct almost-surely or survives with positive probability. We analyze the existence of phase-transition regarding the survival of the rumor, and we obtain estimates for the mean range of the rumor. The applicability of our results is illustrated with examples on random trees generated from some well-known discrete distributions.

123404

and

This paper introduces a new framework to quantify distance between finite sets with uncertainty present, where probability distributions determine the locations of individual elements. Combining this with a Bayesian change point detection algorithm, we produce a new measure of similarity between time series with respect to their structural breaks. First, we demonstrate the algorithm's effectiveness on a collection of piecewise autoregressive processes. Next, we apply this to financial data to study the erratic behavior profiles of 19 countries and 11 sectors over the past 20 years. Our measure provides quantitative evidence that there is greater collective similarity among sectors' erratic behavior profiles than those of countries, which we observe upon individual inspection of these time series. Our measure could be used as a new framework or complementary tool for investors seeking to make asset allocation decisions for financial portfolios.

123405

, , , , and

The faster-is-slower (FIS) effect is an interesting phenomenon in crowd dynamics. However, the validity of FIS was not universally accepted without objection. A series of experiments was conducted by using a group of young students in a room evacuating through a narrow exit at two locations, i.e. a center exit and a corner exit. The mean time intervals of two consecutive persons passing through a center exit were 1.14 ± 0.09, 1.31 ± 0.43 and 1.42 ± 0.93 s at low, medium and high competitiveness, respectively, i.e. which was the FIS effect. However, the mean time intervals of two consecutive persons passing through a corner exit were 1.04 ± 0.07 and 0.85 ± 0.17 s at low and high competitiveness, respectively, which was contrary to the FIS effect. Furthermore, two series of circulation movement at high competiveness were studied in which all students were required to re-enter the room from another opening after getting out of the room and this process continued until the end of a test. The mean time interval of consecutive persons passing through the exits was around 2.39 ± 4.29 and 0.77 ± 0.25 s for the center exit and corner exit respectively, and the flow rate of the corner exit was around 3 times that of the center exit. The complementary cumulative probability distribution of the time intervals between consecutive students Δt was studied and it followed a power law, i.e. $P\left({\Delta}t\right)\sim {\Delta}{t}^{-\mathrm{a}\mathrm{l}\mathrm{p}\mathrm{h}\mathrm{a}}$. However, the study showed that alpha alone cannot well represent the efficiency of an evacuation. The experiment demonstrated that the FIS effect can be avoided by relocating the exit to the corner and the flow rate can be greatly improved, particularly in high competiveness conditions.

Special Issue Articles

124001
The following article is Free article

and

We consider the problem of learning structures and parameters of continuous-time Bayesian networks (CTBNs) from time-course data under minimal experimental resources. In practice, the cost of generating experimental data poses a bottleneck, especially in the natural and social sciences. A popular approach to overcome this is Bayesian optimal experimental design (BOED). However, BOED becomes infeasible in high-dimensional settings, as it involves integration over all possible experimental outcomes. We propose a novel criterion for experimental design based on a variational approximation of the expected information gain. We show that for CTBNs, a semi-analytical expression for this criterion can be calculated for structure and parameter learning. By doing so, we can replace sampling over experimental outcomes by solving the CTBNs master-equation, for which scalable approximations exist. This alleviates the computational burden of integrating over possible experimental outcomes in high-dimensions. We employ this framework in order to recommend interventional sequences. In this context, we extend the CTBN model to conditional CTBNs in order to incorporate interventions. We demonstrate the performance of our criterion on synthetic and real-world data.

124002
The following article is Free article

, and

A recent line of research has highlighted the existence of a 'double descent' phenomenon in deep learning, whereby increasing the number of training examples N causes the generalization error of neural networks (NNs) to peak when N is of the same order as the number of parameters P. In earlier works, a similar phenomenon was shown to exist in simpler models such as linear regression, where the peak instead occurs when N is equal to the input dimension D. Since both peaks coincide with the interpolation threshold, they are often conflated in the literature. In this paper, we show that despite their apparent similarity, these two scenarios are inherently different. In fact, both peaks can co-exist when NNs are applied to noisy regression tasks. The relative size of the peaks is then governed by the degree of nonlinearity of the activation function. Building on recent developments in the analysis of random feature models, we provide a theoretical ground for this sample-wise triple descent. As shown previously, the nonlinear peak at N = P is a true divergence caused by the extreme sensitivity of the output function to both the noise corrupting the labels and the initialization of the random features (or the weights in NNs). This peak survives in the absence of noise, but can be suppressed by regularization. In contrast, the linear peak at N = D is solely due to overfitting the noise in the labels, and forms earlier during training. We show that this peak is implicitly regularized by the nonlinearity, which is why it only becomes salient at high noise and is weakly affected by explicit regularization. Throughout the paper, we compare analytical results obtained in the random feature model with the outcomes of numerical experiments involving deep NNs.

124003
The following article is Free article

, , , , and

We show that a variety of modern deep learning tasks exhibit a 'double-descent' phenomenon where, as we increase model size, performance first gets worse and then gets better. Moreover, we show that double descent occurs not just as a function of model size, but also as a function of the number of training epochs. We unify the above phenomena by defining a new complexity measure we call the effective model complexity and conjecture a generalized double descent with respect to this measure. Furthermore, our notion of model complexity allows us to identify certain regimes where increasing (even quadrupling) the number of train samples actually hurts test performance.

124004
The following article is Free article

, , , and

We consider the problem of estimating the input and hidden variables of a stochastic multi-layer neural network (NN) from an observation of the output. The hidden variables in each layer are represented as matrices with statistical interactions along both rows as well as columns. This problem applies to matrix imputation, signal recovery via deep generative prior models, multi-task and mixed regression, and learning certain classes of two-layer NNs. We extend a recently-developed algorithm—multi-layer vector approximate message passing, for this matrix-valued inference problem. It is shown that the performance of the proposed multi-layer matrix vector approximate message passing algorithm can be exactly predicted in a certain random large-system limit, where the dimensions N × d of the unknown quantities grow as N with d fixed. In the two-layer neural-network learning problem, this scaling corresponds to the case where the number of input features as well as training samples grow to infinity but the number of hidden nodes stays fixed. The analysis enables a precise prediction of the parameter and test error of the learning.

124005
The following article is Free article

and

Neural networks have been shown to perform incredibly well in classification tasks over structured high-dimensional datasets. However, the learning dynamics of such networks is still poorly understood. In this paper we study in detail the training dynamics of a simple type of neural network: a single hidden layer trained to perform a classification task. We show that in a suitable mean-field limit this case maps to a single-node learning problem with a time-dependent dataset determined self-consistently from the average nodes population. We specialize our theory to the prototypical case of a linearly separable data and a linear hinge loss, for which the dynamics can be explicitly solved in the infinite dataset limit. This allows us to address in a simple setting several phenomena appearing in modern networks such as slowing down of training dynamics, crossover between rich and lazy learning, and overfitting. Finally, we assess the limitations of mean-field theory by studying the case of large but finite number of nodes and of training samples.

124006
The following article is Free article

, and

This article characterizes the exact asymptotics of random Fourier feature (RFF) regression, in the realistic setting where the number of data samples n, their dimension p, and the dimension of feature space N are all large and comparable. In this regime, the random RFF Gram matrix no longer converges to the well-known limiting Gaussian kernel matrix (as it does when N alone), but it still has a tractable behavior that is captured by our analysis. This analysis also provides accurate estimates of training and test regression errors for large n, p, N. Based on these estimates, a precise characterization of two qualitatively different phases of learning, including the phase transition between them, is provided; and the corresponding double descent test error curve is derived from this phase transition behavior. These results do not depend on strong assumptions on the data distribution, and they perfectly match empirical results on real-world data sets.

124007
The following article is Free article

and

Pairwise models like the Ising model or the generalized Potts model have found many successful applications in fields like physics, biology, and economics. Closely connected is the problem of inverse statistical mechanics, where the goal is to infer the parameters of such models given observed data. An open problem in this field is the question of how to train these models in the case where the data contain additional higher-order interactions that are not present in the pairwise model. In this work, we propose an approach based on energy-based models and pseudolikelihood maximization to address these complications: we show that hybrid models, which combine a pairwise model and a neural network, can lead to significant improvements in the reconstruction of pairwise interactions. We show these improvements to hold consistently when compared to a standard approach using only the pairwise model and to an approach using only a neural network. This is in line with the general idea that simple interpretable models and complex black-box models are not necessarily a dichotomy: interpolating these two classes of models can allow to keep some advantages of both.

124008
The following article is Free article

, , and

We analyze in a closed form the learning dynamics of the stochastic gradient descent (SGD) for a single-layer neural network classifying a high-dimensional Gaussian mixture where each cluster is assigned one of two labels. This problem provides a prototype of a non-convex loss landscape with interpolating regimes and a large generalization gap. We define a particular stochastic process for which SGD can be extended to a continuous-time limit that we call stochastic gradient flow. In the full-batch limit, we recover the standard gradient flow. We apply dynamical mean-field theory from statistical physics to track the dynamics of the algorithm in the high-dimensional limit via a self-consistent stochastic process. We explore the performance of the algorithm as a function of the control parameters shedding light on how it navigates the loss landscape.

124009
The following article is Free article

, , and

For a certain scaling of the initialization of stochastic gradient descent (SGD), wide neural networks (NN) have been shown to be well approximated by reproducing kernel Hilbert space (RKHS) methods. Recent empirical work showed that, for some classification tasks, RKHS methods can replace NNs without a large loss in performance. On the other hand, two-layers NNs are known to encode richer smoothness classes than RKHS and we know of special examples for which SGD-trained NN provably outperform RKHS. This is true even in the wide network limit, for a different scaling of the initialization. How can we reconcile the above claims? For which tasks do NNs outperform RKHS? If covariates are nearly isotropic, RKHS methods suffer from the curse of dimensionality, while NNs can overcome it by learning the best low-dimensional representation. Here we show that this curse of dimensionality becomes milder if the covariates display the same low-dimensional structure as the target function, and we precisely characterize this tradeoff. Building on these results, we present the spiked covariates model that can capture in a unified framework both behaviors observed in earlier work. We hypothesize that such a latent low-dimensional structure is present in image classification. We test numerically this hypothesis by showing that specific perturbations of the training distribution degrade the performances of RKHS methods much more significantly than NNs.

124010
The following article is Free article

and

Natural gradient descent (NGD) helps to accelerate the convergence of gradient descent dynamics, but it requires approximations in large-scale deep neural networks because of its high computational cost. Empirical studies have confirmed that some NGD methods with approximate Fisher information converge sufficiently fast in practice. Nevertheless, it remains unclear from the theoretical perspective why and under what conditions such heuristic approximations work well. In this work, we reveal that, under specific conditions, NGD with approximate Fisher information achieves the same fast convergence to global minima as exact NGD. We consider deep neural networks in the infinite-width limit, and analyze the asymptotic training dynamics of NGD in function space via the neural tangent kernel. In the function space, the training dynamics with the approximate Fisher information are identical to those with the exact Fisher information, and they converge quickly. The fast convergence holds in layer-wise approximations; for instance, in block diagonal approximation where each block corresponds to a layer as well as in block tri-diagonal and K-FAC approximations. We also find that a unit-wise approximation achieves the same fast convergence under some assumptions. All of these different approximations have an isotropic gradient in the function space, and this plays a fundamental role in achieving the same convergence properties in training. Thus, the current study gives a novel and unified theoretical foundation with which to understand NGD methods in deep learning.

124011
The following article is Free article

, , , and

Graph neural networks (GNNs) extend the functionality of traditional neural networks to graph-structured data. Similar to CNNs, an optimized design of graph convolution and pooling is key to success. Borrowing ideas from physics, we propose path integral-based GNNs (PAN) for classification and regression tasks on graphs. Specifically, we consider a convolution operation that involves every path linking the message sender and receiver with learnable weights depending on the path length, which corresponds to the maximal entropy random walk. It generalizes the graph Laplacian to a new transition matrix that we call the maximal entropy transition (MET) matrix derived from a path integral formalism. Importantly, the diagonal entries of the MET matrix are directly related to the subgraph centrality, thus leading to a natural and adaptive pooling mechanism. PAN provides a versatile framework that can be tailored for different graph data with varying sizes and structures. We can view most existing GNN architectures as special cases of PAN. Experimental results show that PAN achieves state-of-the-art performance on various graph classification/regression tasks, including a new benchmark dataset from statistical mechanics that we propose to boost applications of GNN in physical sciences.

124012
The following article is Free article

, , , , , and

Forecasting complex dynamical phenomena in settings where only partial knowledge of their dynamics is available is a prevalent problem across various scientific fields. While purely data-driven approaches are arguably insufficient in this context, standard physical modeling-based approaches tend to be over-simplistic, inducing non-negligible errors. In this work, we introduce the APHYNITY framework, a principled approach for augmenting incomplete physical dynamics described by differential equations with deep data-driven models. It consists of decomposing the dynamics into two components: a physical component accounting for the dynamics for which we have some prior knowledge, and a data-driven component accounting for errors of the physical model. The learning problem is carefully formulated such that the physical model explains as much of the data as possible, while the data-driven component only describes information that cannot be captured by the physical model; no more, no less. This not only provides the existence and uniqueness for this decomposition, but also ensures interpretability and benefit generalization. Experiments made on three important use cases, each representative of a different family of phenomena, i.e. reaction–diffusion equations, wave equations and the non-linear damped pendulum, show that APHYNITY can efficiently leverage approximate physical models to accurately forecast the evolution of the system and correctly identify relevant physical parameters. The code is available at https://github.com/yuan-yin/APHYNITY.

124013
The following article is Free article

, , , and

We study generalised linear regression and classification for a synthetically generated dataset encompassing different problems of interest, such as learning with random features, neural networks in the lazy training regime, and the hidden manifold model. We consider the high-dimensional regime and using the replica method from statistical physics, we provide a closed-form expression for the asymptotic generalisation performance in these problems, valid in both the under- and over-parametrised regimes and for a broad choice of generalised linear model loss functions. In particular, we show how to obtain analytically the so-called double descent behaviour for logistic regression with a peak at the interpolation threshold, we illustrate the superiority of orthogonal against random Gaussian projections in learning with random features, and discuss the role played by correlations in the data generated by the hidden manifold model. Beyond the interest in these particular problems, the theoretical formalism introduced in this manuscript provides a path to further extensions to more complex tasks.

124014
The following article is Free article

, , and

Despite its success in a wide range of applications, characterizing the generalization properties of stochastic gradient descent (SGD) in non-convex deep learning problems is still an important challenge. While modeling the trajectories of SGD via stochastic differential equations (SDE) under heavy-tailed gradient noise has recently shed light over several peculiar characteristics of SGD, a rigorous treatment of the generalization properties of such SDEs in a learning theoretical framework is still missing. Aiming to bridge this gap, in this paper, we prove generalization bounds for SGD under the assumption that its trajectories can be well-approximated by a Feller process, which defines a rich class of Markov processes that include several recent SDE representations (both Brownian or heavy-tailed) as its special case. We show that the generalization error can be controlled by the Hausdorff dimension of the trajectories, which is intimately linked to the tail behavior of the driving process. Our results imply that heavier-tailed processes should achieve better generalization; hence, the tail-index of the process can be used as a notion of 'capacity metric'. We support our theory with experiments on deep neural networks illustrating that the proposed capacity metric accurately estimates the generalization error, and it does not necessarily grow with the number of parameters unlike the existing capacity metrics in the literature.

124015
The following article is Free article

, , , , , and

The properties of flat minima in the empirical risk landscape of neural networks have been debated for some time. Increasing evidence suggests they possess better generalization capabilities with respect to sharp ones. In this work we first discuss the relationship between alternative measures of flatness: the local entropy, which is useful for analysis and algorithm development, and the local energy, which is easier to compute and was shown empirically in extensive tests on state-of-the-art networks to be the best predictor of generalization capabilities. We show semi-analytically in simple controlled scenarios that these two measures correlate strongly with each other and with generalization. Then, we extend the analysis to the deep learning scenario by extensive numerical validations. We study two algorithms, entropy-stochastic gradient descent and replicated-stochastic gradient descent, that explicitly include the local entropy in the optimization objective. We devise a training schedule by which we consistently find flatter minima (using both flatness measures), and improve the generalization error for common architectures (e.g. ResNet, EfficientNet).

124016
The following article is Free article

and

We introduce JAX MD, a software package for performing differentiable physics simulations with a focus on molecular dynamics. JAX MD includes a number of physics simulation environments, as well as interaction potentials and neural networks that can be integrated into these environments without writing any additional code. Since the simulations themselves are differentiable functions, entire trajectories can be differentiated to perform meta-optimization. These features are built on primitive operations, such as spatial partitioning, that allow simulations to scale to hundreds-of-thousands of particles on a single GPU. These primitives are flexible enough that they can be used to scale up workloads outside of molecular dynamics. We present several examples that highlight the features of JAX MD including: integration of graph neural networks into traditional simulations, meta-optimization through minimization of particle packings, and a multi-agent flocking simulation. JAX MD is available at https://www.github.com/google/jax-md.

124017
The following article is Open access

, and

Graphical models are useful tools for describing structured high-dimensional probability distributions. Development of efficient algorithms for learning graphical models with least amount of data remains an active research topic. Reconstruction of graphical models that describe the statistics of discrete variables is a particularly challenging problem, for which the maximum likelihood approach is intractable. In this work, we provide the first sample-efficient method based on the interaction screening framework that allows one to provably learn fully general discrete factor models with node-specific discrete alphabets and multi-body interactions, specified in an arbitrary basis. We identify a single condition related to model parametrization that leads to rigorous guarantees on the recovery of model structure and parameters in any error norm, and is readily verifiable for a large class of models. Importantly, our bounds make explicit distinction between parameters that are proper to the model and priors used as an input to the algorithm. Finally, we show that the interaction screening framework includes all models previously considered in the literature as special cases, and for which our analysis shows a systematic improvement in sample complexity.