Paper The following article is Free article

Hausdorff dimension, heavy tails, and generalization in neural networks*

, , and

Published 29 December 2021 © 2021 IOP Publishing Ltd and SISSA Medialab srl
, , Citation Umut Şimşekli et al J. Stat. Mech. (2021) 124014 DOI 10.1088/1742-5468/ac3ae7

1742-5468/2021/12/124014

Abstract

Despite its success in a wide range of applications, characterizing the generalization properties of stochastic gradient descent (SGD) in non-convex deep learning problems is still an important challenge. While modeling the trajectories of SGD via stochastic differential equations (SDE) under heavy-tailed gradient noise has recently shed light over several peculiar characteristics of SGD, a rigorous treatment of the generalization properties of such SDEs in a learning theoretical framework is still missing. Aiming to bridge this gap, in this paper, we prove generalization bounds for SGD under the assumption that its trajectories can be well-approximated by a Feller process, which defines a rich class of Markov processes that include several recent SDE representations (both Brownian or heavy-tailed) as its special case. We show that the generalization error can be controlled by the Hausdorff dimension of the trajectories, which is intimately linked to the tail behavior of the driving process. Our results imply that heavier-tailed processes should achieve better generalization; hence, the tail-index of the process can be used as a notion of 'capacity metric'. We support our theory with experiments on deep neural networks illustrating that the proposed capacity metric accurately estimates the generalization error, and it does not necessarily grow with the number of parameters unlike the existing capacity metrics in the literature.

Export citation and abstract BibTeX RIS

1. Introduction

Many important tasks in deep learning can be represented by the following optimization problem,

Equation (1)

where $w\in {\mathbb{R}}^{d}$ denotes the network weights, n denotes the number of training data points, f denotes a non-convex cost function, and f(i) denotes the cost incurred by a single data point. Gradient-based optimization algorithms, perhaps stochastic gradient descent (SGD) being the most popular one, have been the primary algorithmic choice for attacking such optimization problems. Given an initial point w0, the SGD algorithm is based on the following recursion,

Equation (2)

where η is the step-size, and $\nabla {\tilde{f}}_{k}$ is the unbiased stochastic gradient with batch size $\mathrm{B}=\vert {\tilde{B}}_{k}\vert $ for a random subset ${\tilde{B}}_{k}$ of {1, ..., n} for all $k\in \mathbb{N}$, |⋅| denoting cardinality.

In contrast to convex optimization setting where the behavior of SGD is fairly well-understood (see e.g. [DDB20, SSBD14]), the generalization properties of SGD in non-convex deep learning problems is an active area of research [AZL19, AZLL19, PBL19]. In the last decade, there has been considerable progress around this topic, where several generalization bounds have been proven in different mathematical setups [AZLL19, DR17, KL17, Lon17, MWZZ17, NHD+19, NTS15, RRT17, ZLZ19]. While these bounds are useful at capturing the generalization behavior of SGD in certain cases, they typically grow with dimension d, which contradicts empirical observations [NBMS17].

An important initial step toward developing a concrete generalization theory for the SGD algorithm in deep learning problems, is to characterize the statistical properties of the weights ${\left\{{w}_{k}\right\}}_{k\in \mathbb{N}}$, as they might provide guidance for identifying the constituents that determine the performance of SGD. A popular approach for analyzing the dynamics of SGD, mainly borrowed from statistical physics, is based on viewing it as a discretization of a continuous-time stochastic process that can be described by a stochastic differential equation (SDE). For instance, if we assume that the gradient noise, i.e. $\nabla {\tilde{f}}_{k}(w)-\nabla f(w)$ can be well-approximated with a Gaussian random vector, we can represent (2) as the Euler–Maruyama discretization of the following SDE,

Equation (3)

where Bt denotes the standard Brownian motion in ${\mathbb{R}}^{d}$, and ${\Sigma}:{\mathbb{R}}^{d}{\mapsto}{\mathbb{R}}^{d\times d}$ is called the diffusion coefficient. This approach has been adopted by several studies [CS18, HLLL17, JKA+17, MHB16, ZWY+19]. In particular, based on the 'flat minima' argument (cf [HS97]), Jastrzebski et al [JKA+17] illustrated that the performance of SGD on unseen data correlates well with the ratio η/B.

More recently, Gaussian approximation for the gradient noise has been taken under investigation. While Gaussian noise can accurately characterize the behavior of SGD for very large batch sizes [PSGN19], Simsekli et al [SSG19] empirically demonstrated that the gradient noise in fully connected and convolutional neural networks can exhibit heavy-tailed behavior in practical settings. This characteristic was also observed in recurrent neural networks [ZKV+19]. Favaro et al [FFP20] illustrated that the iterates themselves can exhibit heavy-tails and investigated the corresponding asymptotic behavior in the infinite-width limit. Similarly, Martin and Mahoney [MM19] observed that the eigenspectra of the weight matrices in individual layers of a neural network can exhibit heavy-tails; hence, they proposed a layer-wise heavy-tailed model for the SGD iterates. By invoking results from heavy-tailed random matrix theory, they proposed a capacity metric based on a quantification of the heavy-tails, which correlated well with the performance of the network on unseen data. Further, they empirically demonstrated that this capacity metric does not necessarily grow with dimension d.

Based on the argument that the observed heavy-tailed behavior of SGD 6 in practice cannot be accurately represented by an SDE driven by a Brownian motion, Simsekli et al [SSG19] proposed modeling SGD with an SDE driven by a heavy-tailed process, so-called the α-stable Lévy motion [Sat99]. By using this framework and invoking metastability results proven in statistical physics [IP06, Pav07], SGD is shown to spend more time around 'wider minima', and the time spent around those minima is linked to the tail properties of the driving process [CWZ+21, NŞGR19, SSG19].

Even though the SDE representations of SGD have provided many insights on several distinguishing characteristics of this algorithm in deep learning problems, a rigorous treatment of their generalization properties in a statistical learning theoretical framework is still missing. In this paper, we aim to take a first step in this direction and prove novel generalization bounds in the case where the trajectories of the optimization algorithm (including but not limited to SGD) can be well-approximated by a Feller process [Sch16], which form a broad class of Markov processes that includes many important stochastic processes as a special case. More precisely, as a proxy to SGD, we consider the Feller process that is expressed by the following SDE:

Equation (4)

where Σ1, Σ2 are d × d matrix-valued functions, and ${\mathrm{L}}_{t}^{\boldsymbol{\alpha }(\cdot )}$ denotes the state-dependent α -stable Lévy motion, which will be defined in detail in section 2. Informally, ${\mathrm{L}}_{t}^{\boldsymbol{\alpha }(\cdot )}$ can be seen as a heavy-tailed generalization of the Brownian motion, where $\boldsymbol{\alpha }:{\mathbb{R}}^{d}{\mapsto}\left(\right.0,2{\left.\right]}^{d}$ denotes its state-dependent tail-indices. In the case αi (w) = 2 for all i and w, ${\mathrm{L}}_{t}^{\boldsymbol{\alpha }(\cdot )}$ reduces to $\sqrt{2}{\mathrm{B}}_{t}$ whereas if αi gets smaller than 2, the process becomes heavier-tailed in the ith component, whose tails asymptotically obey a power-law decay with exponent αi . The SDEs in [CS18, HLLL17, JKA+17, MHB16, ZWY+19] all appear as a special case of (4) with Σ2 = 0, and the SDE proposed in [SSG19] corresponds to the isotropic setting: Σ2(w) is diagonal and ${\alpha }_{i}(w)=\alpha \in \left(\right.0,2\left.\right]$ for all i, w. In (4), we allow each coordinate of ${\mathrm{L}}_{t}^{\boldsymbol{\alpha }(\cdot )}$ to have a different tail-index which can also depend on the state Wt . We believe that ${\mathrm{L}}_{t}^{\boldsymbol{\alpha }(\cdot )}$ provides a more realistic model based on the empirical results of [ŞGN+19], suggesting that the tail index can have different values at each coordinate and evolve over time.

At the core of our approach lies the fact that the sample paths of Markov processes often exhibit a fractal-like structure [Xia03], and the generalization error over the sample paths is intimately related to the 'roughness' of the random fractal generated by the driving Markov process, as measured by a notion called the Hausdorff dimension. Our main contributions are as follows.

  • (a)  
    We introduce a novel notion of complexity for the trajectories of a stochastic learning algorithm, which we coin as 'uniform Hausdorff dimension'. Building on [Sch98], we show that the sample paths of Feller processes admit a uniform Hausdorff dimension, which is closely related to the tail properties of the process.
  • (b)  
    By using tools from geometric measure theory, we prove that the generalization error can be controlled by the Hausdorff dimension of the process, which can be significantly smaller than the ambient dimension d. In this sense, the Hausdorff dimension acts as an 'intrinsic dimension' of the problem, mimicking the role of Vapnik–Chervonenkis (VC) dimension in classical generalization bounds.

These two contributions collectively show that heavier-tailed processes achieve smaller generalization error, implying that the heavy-tails of SGD incur an implicit regularization. Our results also provide a theoretical justification to the observations reported in [MM19] and [SSG19]. Besides, a remarkable feature of the Hausdorff dimension is that it solely depends on the tail behavior of the process; hence, contrary to existing capacity metrics, it does not necessarily grow with the number of parameters d. Furthermore, we provide an efficient approach to estimate the Hausdorff dimension by making use of existing tail index estimators, and empirically demonstrate the validity of our theory on various neural networks. Experiments on both synthetic and real data verify that our bounds do not grow with the problem dimension, providing an accurate characterization of the generalization performance.

2. Technical background

In this section, we provide the required technical background on stable distributions, Lévy and Feller processes, and the Hausdorff dimension.

2.1. Stable distributions

Stable distributions appear as the limiting distribution in the generalized central limit theorem [Lév37] and can be seen as a generalization of the Gaussian distribution. In this paper, we will be interested in symmetric α-stable distributions, denoted by $\mathcal{S}\alpha \mathcal{S}$. In the one-dimensional case, a random variable X is $\mathcal{S}\alpha \mathcal{S}(\sigma )$ distributed, if its characteristic function (chf.) has the following form: $\mathbb{E}[\mathrm{exp}(i\omega X)]=\mathrm{exp}(-\vert \sigma \omega {\vert }^{\alpha })$, where $\alpha \in \left(\right.0,2\left.\right]$ is called the tail-index and $\sigma \in {\mathbb{R}}_{+}$ is called the scale parameter. When α = 2, $\mathcal{S}\alpha \mathcal{S}(\sigma )=\mathcal{N}(0,2{\sigma }^{2})$, where $\mathcal{N}$ denotes the Gaussian distribution in $\mathbb{R}$. As soon as α < 2, the distribution becomes heavy-tailed and $\mathbb{E}[\vert X{\vert }^{q}]$ becomes finite if and only if q < α, indicating that the variance of $\mathcal{S}\alpha \mathcal{S}$ is finite only when α = 2.

There are multiple ways to extend $\mathcal{S}\alpha \mathcal{S}$ to the multivariate case. In our experiments, we will be mainly interested in the elliptically-contoured α-stable distribution [ST94], whose chf. is given by $\mathbb{E}[\mathrm{exp}(i\langle \omega ,X\rangle )]=\mathrm{exp}(-{\Vert}\omega {{\Vert}}^{\alpha })$ for $X,\omega \in {\mathbb{R}}^{d}$, where ⟨⋅, ⋅⟩ denotes the Euclidean inner product. Another common choice is the multivariate α -stable distribution with independent components for a vector $\boldsymbol{\alpha }\in {\mathbb{R}}^{d}$, whose chf. is given by $\mathbb{E}[\mathrm{exp}(i\langle \omega ,X\rangle )]=\mathrm{exp}(-{\sum }_{i=1}^{d}\vert {\omega }_{i}{\vert }^{{\alpha }_{i}})$. Essentially, the ith component of X is distributed with $\mathcal{S}\alpha \mathcal{S}$ with parameters αi and σi = 1. Both of these multivariate distributions reduce to a multivariate Gaussian when their tail indices are 2.

2.2. Lévy and Feller processes

We begin by defining a general Lévy process (also called Lévy motion), which includes Brownian motion Bt and the α-stable motion ${\mathrm{L}}_{t}^{\alpha }$ as special cases 7 . A Lévy process ${\left\{{\mathrm{L}}_{t}\right\}}_{t\geqslant 0}$ in ${\mathbb{R}}^{d}$ with the initial point L0 = 0, is defined by the following properties:

  • (a)  
    For $N\in \mathbb{N}$ and t0 < t1 <⋯< tN , the increments $({\mathrm{L}}_{{t}_{i}}-{\mathrm{L}}_{{t}_{i-1}})$ are independent for all i.
  • (b)  
    For any t > s > 0, (Lt − Ls ) and Lts have the same distribution.
  • (c)  
    Lt is continuous in probability, i.e. for all δ > 0 and s ⩾ 0, $\mathbb{P}(\vert {\mathrm{L}}_{t}-{\mathrm{L}}_{s}\vert > \delta )\to 0$ as ts.

By the Lévy–Khintchine formula [Sat99], the chf. of a Lévy process is given by $\mathbb{E}[\mathrm{exp}(i\langle \xi ,{\mathrm{L}}_{t}\rangle )]=\mathrm{exp}(-t\psi (\xi ))$, where $\psi :{\mathbb{R}}^{d}{\mapsto}\mathbb{C}$ is called the characteristic (or Lévy) exponent, given as:

Equation (5)

Here, $b\in {\mathbb{R}}^{d}$ denotes a constant drift, ${\Sigma}\in {\mathbb{R}}^{d\times d}$ is a positive semi-definite matrix, and ν is called the Lévy measure, which is a Borel measure on ${\mathbb{R}}^{d}{\backslash}\left\{0\right\}$ satisfying ${\int }_{{\mathbb{R}}^{d}}{\Vert}x{{\Vert}}^{2}/(1+{\Vert}x{{\Vert}}^{2})\nu (\mathrm{d}x)< \infty .$ The choice of (b, Σ, ν) determines the law of Lts ; hence, it fully characterizes the process Lt by the properties (a) and (b) above. For instance, from (5), we can easily verify that under the choice b = 0, ${\Sigma}=\frac{1}{2}{\mathrm{I}}_{d}$, and ν(ξ) = 0, with Id denoting the d × d identity matrix, the function exp(−ψ(ξ)) becomes the chf. of a standard Gaussian in ${\mathbb{R}}^{d}$; hence, Lt reduces to Bt . On the other hand, if we choose b = 0, Σ = 0, and $\nu (\mathrm{d}x)=\frac{\mathrm{d}r}{{r}^{1+\alpha }}\lambda (\mathrm{d}y)$, for all $x=ry,(r,y)\in {\mathbb{R}}_{+}\times {\mathbb{S}}^{d-1}$ where ${\mathbb{S}}^{d-1}$ denotes unit sphere in ${\mathbb{R}}^{d}$ and λ is an arbitrary Borel measure on ${\mathbb{S}}^{d-1}$, we obtain the chf. of a generic multivariate α-stable distribution, hence Lt reduces to ${\mathrm{L}}_{t}^{\alpha }$. Depending on λ, exp(−ψ(ξ)) becomes the chf. of an elliptically contoured α-stable distribution or an α-stable distribution with independent components [Xia03].

Feller processes (also called Lévy-type processes [BSW13]) are a general family of Markov processes, which further extend the scope of Lévy processes. In this study, we consider a class of Feller processes [Cou65], which locally behave like Lévy processes and they additionally allow for state-dependent drifts b(w), diffusion matrices Σ(w), and Lévy measures ν(w, dy) for $w\in {\mathbb{R}}^{d}$. For a fixed state w, a Feller process ${\left\{{\mathrm{W}}_{t}\right\}}_{t\geqslant 0}$ is defined through the chf. of the random variable Wt w, given as ${\psi }_{t}(w,\xi )=\mathbb{E}\left[\mathrm{exp}(-i\langle \xi ,{\mathrm{W}}_{t}-w\rangle )\right]$. A crucial characteristic of a Feller process related to its chf. is its symbol Ψ, defined as,

Equation (6)

for $w,\xi \in {\mathbb{R}}^{d}$ [Jac02, Sch98, Xia03]. Here, for each $w\in {\mathbb{R}}^{d}$, ${\Sigma}(w)\in {\mathbb{R}}^{d\times d}$ is symmetric positive semi-definite, and for all w, ν(w, dx) is a Lévy measure.

Under mild conditions, one can verify that the SDE (4) we use as a proxy for the SGD algorithm indeed corresponds to a Feller process with b(w) = −∇f(w), Σ(w) = 2Σ1(w), and an appropriate choice of ν (see [HDS18]). We also note that many other popular stochastic optimization algorithms can be accurately represented by a Feller process, which we describe in appendix B. Hence, our results can be useful in a broader context.

2.3. Decomposable Feller processes

In this paper, we will focus on decomposable Feller processes introduced in [Sch98], which will be useful in both our theory and experiments. Let Wt be a Feller process with symbol Ψ. We call the process Wt 'decomposable at w0', if there exists a point ${w}_{0}\in {\mathbb{R}}^{d}$, such that ${\Psi}(w,\xi )=\psi (\xi )+\tilde{{\Psi}}(w,\xi )$, where ψ(ξ) := Ψ(w0, ξ) is called the sub-symbol and $\tilde{{\Psi}}(w,\xi ){:=}{\Psi}(w,\xi )-{\Psi}({w}_{0},\xi )$ is the remainder term. Here, $\tilde{{\Psi}}$ is assumed to satisfy certain smoothness and boundedness assumptions, which are provided in appendix C. Essentially, the technical regularity conditions on $\tilde{{\Psi}}$ impose a structure on the triplet (b, Σ, ν) around w0 which ensures that, around that point, Wt behaves like a Lévy process whose characteristic exponent is given by the sub-symbol ψ.

2.4. The Hausdorff dimension

Due to their recursive nature, Markov processes often generate 'random fractals' [Xia03] and understanding the structure of such fractals has been a major challenge in modern probability theory [BP17, Kho09, KX17, LG19, LY19, Yan18]. In this paper, we are interested in identifying the complexity of the fractals generated by a Feller process that approximates SGD.

The intrinsic complexity of a fractal is typically characterized by a notion called the Hausdorff dimension [Fal04], which extends the usual notion of dimension (e.g. a line segment is one-dimensional, a plane is two-dimensional) to fractional orders. Informally, this notion measures the 'roughness' of an object (i.e. a set) and in the context of Lévy processes, they are deeply connected to the tail properties of the corresponding Lévy measure. [Sch98, Xia03, Yan18].

Before defining the Hausdorff dimension, we need to introduce the Hausdorff measure. Let $G\subset {\mathbb{R}}^{d}$ and δ > 0, and consider all the δ-coverings ${\left\{{A}_{i}\right\}}_{i}$ of G, i.e. each Ai denotes a set with diameter less than δ satisfying G ⊂ ∪i Ai . For any s ∈ (0, ), we then denote:

Equation (7)

where the infimum is taken over all the δ-coverings. The s-dimensional Hausdorff measure of G is defined as the monotonic limit:

Equation (8)

It can be shown that ${\mathcal{H}}^{s}$ is an outer measure; hence, it can be extended to a complete measure by the Carathéodory extension theorem [Mat99]. When s is an integer, ${\mathcal{H}}^{s}$ is equal to the s-dimensional Lebesgue measure up to a constant factor; thus, it strictly generalizes the notion 'volume' to the fractional orders. We now proceed with the definition of the Hausdorff dimension.

Definition 1. The Hausdorff dimension of $G\subset {\mathbb{R}}^{d}$ is defined as follows.

Equation (9)

One can show that if dimHG = s, then ${\mathcal{H}}^{r}(G)=0$ for all r > s and ${\mathcal{H}}^{r}(G)=\infty $ for all r < s [EMG90]. In this sense, the Hausdorff dimension of G is the moment order s when ${\mathcal{H}}^{s}(G)$ drops from to 0, and we always have 0 ⩽ dimHGd [Fal04]. Apart from the trivial cases such as ${\mathrm{dim}}_{\mathrm{H}}\enspace {\mathbb{R}}^{d}=d$, a canonical example is the well-known Cantor set, whose Hausdorff dimension is (log 2/log 3) ∈ (0, 1). Besides, the Hausdorff dimension of Riemannian manifolds correspond to their intrinsic dimension, e.g. ${\mathrm{dim}}_{\mathrm{H}}\enspace {\mathbb{S}}^{d-1}=d-1$.

We note that, starting with the seminal work of Assouad [Ass83], tools from fractal geometry have been considered in learning theory [DSD19, MSS19, SHTY13] in different contexts. In this paper, we consider the Hausdorff dimension of the sample paths of Markov processes in a learning theoretical framework, which, to the best of our knowledge, has not yet been investigated in the literature.

3. Mathematical setup

In this section, we precise the mathematical framework that we will use in our theoretical results. Let $\mathcal{Z}=\mathcal{X}\times \mathcal{Y}$ denote the space of data points, with $\mathcal{X}$ being the space of features and $\mathcal{Y}$ the space of the labels. We consider an unknown data distribution over $\mathcal{Z}$, denoted by μz . We assume that we have access to a training set with n elements, denoted as S = {z1, ..., zn }, where each element of S is independently and identically distributed (i.i.d.) from μz . We will denote $S\sim {\mu }_{z}^{\otimes n}$, where ${\mu }_{z}^{\otimes n}$ is the n-times product measure of μz .

To assess the quality of an estimated parameter, we consider a loss function $\ell :{\mathbb{R}}^{d}\times \mathcal{Z}{\mapsto}{\mathbb{R}}_{+}$, such that (w, z) measures the loss induced by a single data point z for the particular choice of parameter $w\in {\mathbb{R}}^{d}$. We accordingly denote the population risk with

Equation (10)

and the empirical risk with

Equation (11)

We note that we allow the cost function f in (1) and the loss to be different from each other, where f should be seen as a surrogate loss function. In particular, we will have different sets of assumptions on f and . However, as f and are different from each other, the discrepancy between the risks of their respective minimizers would have an impact on generalization. We leave the analysis of such discrepancy as a future work.

An iterative training algorithm $\mathcal{A}$ (for example SGD) is a function of two variables S and U, where S denoting the dataset and U encapsulating all the algorithmic randomness (e.g. batch indices to be used in training). The algorithm $\mathcal{A}(S,U)$ returns the entire evolution of the parameters in the time frame [0, T], where ${[\mathcal{A}(S,U)]}_{t}={w}_{t}$ being the parameter value returned by $\mathcal{A}$ at time t (e.g. parameters trained by SGD at time t). More precisely, given a training set S and a random variable U, the algorithm will output a random process ${\left\{{w}_{t}\right\}}_{t\in [0,T]}$ indexed by time, which is the trajectory of iterates. To formalize this definition, let us denote the class of bounded Borel functions defined from [0, T] to ${\mathbb{R}}^{d}$ with $\mathcal{B}([0,T],{\mathbb{R}}^{d})$, and define

Equation (12)

where Ω denotes the domain of U. We will denote the law of U by μu , and without loss of generality we let T = 1.

In the remainder of the paper, we will consider the case where the algorithm $\mathcal{A}$ is chosen to be the trajectories produced by a Feller process W(S) (e.g. the proxy for SGD (4)), whose symbol depends on the training set S. More precisely, given $S\in {\mathcal{Z}}^{n}$, the output of the training algorithm $\mathcal{A}(S,\cdot )$ will be the random mapping $t{\mapsto}{\mathrm{W}}_{t}^{(S)}$, where the symbol of W(S) is determined by the drift bS (w), diffusion matrix ΣS (w), and the Lévy measure νS (w, ⋅) (see (6) for definitions), which all depend on S. In this context, the random variable U represents the randomness that is incurred by the Feller process. In particular, for the SDE proxy (4), U accounts for the randomness due to Bt and ${\mathrm{L}}_{t}^{\boldsymbol{\alpha }(\cdot )}$.

As our framework requires $\mathcal{A}$ to produce continuous-time trajectories to represent the discrete-time recursion of SGD (2), we can consider the linearly interpolated continuous-time process; an approach which is commonly used in SDE analysis [Dal17, EH20, EMS18, NŞGR19, RRT17]. For a given $t\in \left[\right.k\eta ,(k+1)\eta $), we can define the process ${\tilde{\mathrm{W}}}_{t}$ as the linear interpolation of wk and wk+1 (see (2)), such that ${w}_{k}={\tilde{\mathrm{W}}}_{k\eta }$ for all k. On the other hand, the random variable U here represents the randomness incurred by the choice of the random minibatches ${\tilde{B}}_{k}$ over iterations (2).

Throughout this paper, we will assume that S and U are independent from each other. In the case of (4), this will entail that the randomness in Bt and ${\mathrm{L}}_{t}^{\alpha }$ does not depend on S, or in the case of the SGD recursion, it will require the random sets ${\tilde{B}}_{k}\subset \left\{1,\dots ,n\right\}$ to be drawn independently from S. 8 Under this assumption, U does not play a crucial role in our analysis; hence, to ease the notation, we will occasionally omit the dependence on U and simply write $\mathcal{A}(S){:=}\mathcal{A}(S,U)$. We will further use the notation ${[\mathcal{A}(S)]}_{t}{:=}{[\mathcal{A}(S,U)]}_{t}$ to refer to wt . Without loss of generality, we will assume that the training algorithm is always initialized with zeros, i.e. ${[\mathcal{A}(S)]}_{0}=\mathbf{0}\in {\mathbb{R}}^{d}$, for all $S\in {\mathcal{Z}}^{n}$. Finally, we define the collection of the parameters given in a trajectory, as the image of $\mathcal{A}(S)$, i.e.

Equation (13)

and the collection of all possible parameters as the union

Equation (14)

Note that $\mathcal{W}$ is still random due to its dependence on U.

4. Generalization bounds via Hausdorff dimension

In this section, we present our main contributions, where we derive generalization bounds based on the Hausdorff dimension of the training trajectories.

4.1. Uniform Hausdorff dimension and Feller processes

In this part, we introduce the 'uniform Hausdorff dimension' property for a training algorithm $\mathcal{A}$, which is a notion of complexity based on the Hausdorff dimension of the trajectories generated by $\mathcal{A}$. By translating [Sch98] into our context, we will then show that decomposable Feller processes possess this property.

Definition 2. An algorithm $\mathcal{A}$ has uniform Hausdorff dimension dH if for any $n\in {\mathbb{N}}_{+}$ and any training set $S\in {\mathcal{Z}}^{n}$

Equation (15)

Since ${\mathcal{W}}_{S}\subset \mathcal{W}\subset {\mathbb{R}}^{d}$, by the definition of Hausdorff dimension, any algorithm $\mathcal{A}$ possesses the uniform Hausdorff dimension property trivially with dH = d. However, as we will illustrate in the sequel, dH can be much smaller than d, which is of our interest in this study.

Proposition 1. Let ${\left\{{\mathrm{W}}^{(S)}\right\}}_{S\in {\mathcal{Z}}^{n}}$ be a family of Feller processes. Assume that for each S, W(S) is decomposable at a point wS with sub-symbol ψS . Consider the algorithm $\mathcal{A}$ that returns ${[\mathcal{A}(S)]}_{t}={\mathrm{W}}_{t}^{(S)}$ for a given $S\in {\mathcal{Z}}^{n}$ and for every t ∈ [0, 1]. Then, we have

Equation (16)

μu -almost surely. Furthermore, $\mathcal{A}$ has uniform Hausdorff dimension with

Equation (17)

We provide all the proofs in appendix E. Informally, this result can be interpreted as follows. Thanks to the decomposability property, for each S, the process W(S) behaves like a Lévy motion around wS , and the characteristic exponent is given by the sub-symbol ψS . Because of this locally regular behavior, the Hausdorff dimension of the image of W(S) can be bounded by βS , which only depends on tail behavior of the Lévy process whose exponent is the sub-symbol ψS .

Example 1. In order to illustrate proposition 1, let us consider a simple example, where ${\mathrm{W}}_{t}^{(S)}$ is taken as the d-dimensional α-stable process with d ⩾ 2, which is independent of the data sample S. More precisely, ${\mathrm{W}}_{t}^{(S)}$ is the solution to the SDE given by $\mathrm{d}{\mathrm{W}}_{t}^{(S)}=\mathrm{d}{\mathrm{L}}_{t}^{\alpha }$ for some $\alpha \in \left(\right.0,2\left.\right]$, where ${\mathrm{L}}_{1}^{\alpha }$ is an elliptically-contoured α-stable random vector. As ${\mathrm{W}}_{t}^{(S)}$ is already a Lévy process, it trivially satisfies the assumptions of proposition 1 with βS = α for all n and S [BG60], hence

Equation (18)

μu -almost surely (in fact, one can show that ${\mathrm{dim}}_{\mathrm{H}}\enspace {\mathcal{W}}_{S}=\alpha $, see [BG60], theorem 4.2). Hence, the 'algorithm' ${[\mathcal{A}(S)]}_{t}={\mathrm{W}}_{t}^{(S)}$ has uniform Hausdorff dimension dH = α. This shows that as the process becomes heavier-tailed (i.e. α decreases), the Hausdorff dimension dH gets smaller. This behavior is illustrated in figure 1.

Figure 1.

Figure 1. Trajectories of ${\mathrm{L}}_{t}^{\alpha }$ for α = 2.0, 1.5, and 1.0. The colors indicate the evolution in time. We observe that the trajectories become 'simpler' as dimH Lα [0, T] = α gets smaller.

Standard image High-resolution image

The term βS is often termed as the upper Blumenthal–Getoor (BG) index of the Lévy process with an exponent ψS [BG60], and it is directly related to the tail-behavior of the corresponding Lévy measure. In general, the value of βS decreases as the process gets heavier-tailed, which implies that the heavier-tailed processes have smaller Hausdorff dimension; thus, they have smaller complexity.

4.2. Generalization bounds via uniform Hausdorff dimension

This part provides the first main contribution of this paper, where we show that the generalization error of a training algorithm can be controlled by the Hausdorff dimension of its trajectories. Even though our interest is still in the case where $\mathcal{A}$ is chosen as a Feller process, the results in this section apply to more general algorithms. To this end, we will be mainly interested in bounding the following object:

Equation (19)

with high probability over the choice of S and U. Note that this is an algorithm dependent definition of generalization that is widely used in the literature (see [BE02] for a detailed discussion).

To derive our first result, we will require the following assumptions.

  • H 1  
    is bounded by B and L-Lipschitz continuous in w.
  • H 2  
    The diameter of $\mathcal{W}$ is finite μu -almost surely. S and U are independent.
  • H 3  
    $\mathcal{A}$ has uniform Hausdorff dimension dH.
  • H 4  
    For μu -almost every $\mathcal{W}$, there exists a Borel measure μ on ${\mathbb{R}}^{d}$ and positive numbers a, b, r0 and s such that $0< \mu (\mathcal{W})\leqslant \mu \left({\mathbb{R}}^{d}\right)< \infty $ and 0 < ars ⩽ μ(Bd (x, r)) ⩽ brs < for $x\in \mathcal{W},0< r\leqslant {r}_{0}$.

Boundedness of the loss can be relaxed at the expense of using sub-Gaussian concentration bounds and introducing more complexity into the expressions [MBM16]. More precisely, H 1 can be replaced with the assumption ∃K > 0, such that ∀p, $\mathbb{E}{[\ell {(w,z)}^{p}]}^{1/p}\leqslant K\sqrt{p}$, and by using sub-Gaussian concentration our bounds will still hold with K in place of B. On the other hand, since we have a finite time-horizon and we fix the initial point of the processes to 0, by using [XZ20] lemma 7.1, we can show that the finite diameter condition on $\mathcal{W}$ holds almost surely, if standard regularity assumptions hold uniformly on the coefficients of W(S) (i.e. b, Σ, and ν in (6)) for all $S\in {\mathcal{Z}}^{n}$, and a countability condition on $\mathcal{Z}$. Finally H 4 is a common condition in fractal geometry, and ensures that the set $\mathcal{W}$ is regular enough, so that we can relate its Hausdorff dimension to its covering numbers [Mat99] 9 . Under these conditions and an additional countability condition on $\mathcal{Z}$ (see [BE02] for similar assumptions), we present our first main result as follows.

Theorem 1. Assume that H 1 to 4 hold, and $\mathcal{Z}$ is countable. Then, for a sufficiently large n, we have

Equation (20)

with probability at least 1 − γ over $S\sim {\mu }_{z}^{\otimes n}$ and U ∼ μu .

This theorem shows that the generalization error can be controlled by the uniform Hausdorff dimension of the algorithm $\mathcal{A}$, along with the constants inherited from the regularity conditions. A noteworthy property of this result is that it does not have a direct dependency on the number of parameters d; on the contrary, we observe that dH plays the role that d plays in standard bounds [AB09], implying that dH acts as the intrinsic dimension and mimics the role of the VC dimension in binary classification [SSBD14]. Furthermore, in combination with proposition 1 that indicates dH decreases as the processes W(S) get heavier-tailed, theorem 1 implies that the generalization error can be controlled by the tail behavior of the process: heavier-tails imply less generalization error.

We note that the countability condition on $\mathcal{Z}$ is crucial for theorem 1. Thanks to this condition, in our proof, we invoke the stability properties of the Hausdorff dimension and we directly obtain a bound on ${\mathrm{dim}}_{\mathrm{H}}\enspace \mathcal{W}$. This bound combined with H 4 allows us to control the covering number of $\mathcal{W}$, and then the desired result can be obtained by using standard covering techniques [AB09, SSBD14].

Next, we show that the log n dependency of dH is not crucial. In the next theorem, we show that the log n factor can be replaced with any increasing function (e.g. log log n) by using a chaining argument, with the expense of having L as a multiplying factor (instead of log L). Theorem 1 holds for sufficiently large n; however, this threshold is not apriori known, which is a limitation of the result.

Theorem 2. Assume that H 1 to 4 hold, and $\mathcal{Z}$ is countable. Then, for any function $\xi :\mathbb{R}\to \mathbb{R}$ satisfying limxρ(x) = , and for a sufficiently large n, we have

with probability at least 1 − γ over $S\sim {\mu }_{z}^{\otimes n}$ and U ∼ μu , where c is an absolute constant.

4.3. Generalization bounds via non-uniform Hausdorff dimension

In our second main result, we control the generalization error without the countability assumption on $\mathcal{Z}$, and more importantly we will also relax H 3. Our main goal will be to relate the error to the Hausdorff dimension of a single ${\mathcal{W}}_{S}$, as opposed to dH, which uniformly bounds ${\mathrm{dim}}_{\mathrm{H}}\enspace {\mathcal{W}}_{{S}^{\prime }}$ for every ${S}^{\prime }\in {\mathcal{Z}}^{n}$. In order to achieve this goal, we introduce a technical assumption, which lets us control the statistical dependency between the training set S and the set of parameters ${\mathcal{W}}_{S}$.

For any δ > 0, let us consider a finite δ-cover of $\mathcal{W}$ by closed balls of radius δ, whose centers are on the fixed grid

Equation (21)

and collect the center of each ball in the set Nδ . Then, for each S, let us define the set ${N}_{\delta }(S){:=}\left\{x\in {N}_{\delta }:{B}_{d}(x,\delta )\cap {\mathcal{W}}_{S}\ne \varnothing \right\}$, where ${B}_{d}(x,\delta )\subset {\mathbb{R}}^{d}$ denotes the closed ball centered around $x\in {\mathbb{R}}^{d}$ with radius δ.

  • H 5  
    Let ${\mathcal{Z}}^{\infty }{:=}(\mathcal{Z}\times \mathcal{Z}\times \cdots \enspace )$ denote the countable product endowed with the product topology and let $\mathfrak{B}$ be the Borel σ-algebra generated by ${\mathcal{Z}}^{\infty }$. Let $\mathfrak{F},\mathfrak{G}$ be the sub-σ-algebras of $\mathfrak{B}$ generated by the collections of random variables given by $\left\{\hat{\mathcal{R}}(w,S):w\in \mathcal{W},n\geqslant 1\right\}$ and $\left\{\mathbb{1}\left\{w\in {N}_{\delta }({\mathcal{W}}_{S})\right\}:\delta \in {\mathbb{Q}}_{ > 0},w\in {N}_{\delta },n\geqslant 1\right\}$ respectively. There exists a constant M ⩾ 1 such that for any $A\in \mathfrak{F}$, $B\in \mathfrak{G}$ we have $\mathbb{P}\left[A\cap B\right]\leqslant M\mathbb{P}\left[A\right]\mathbb{P}[B]$.

This assumption is common in statistics and is sometimes referred to as the ψ-mixing condition, a measure of weak dependence often used in proving limit theorems, see e.g. [Bra83]; yet, it is unfortunately hard to verify this condition in practice. In our context H 5 essentially quantifies the dependence between S and the set ${\mathcal{W}}_{S}$, through the constant M > 0: smaller M indicates that the dependence of $\hat{\mathcal{R}}$ on the training sample S is weaker. This concept is also similar to the mutual information used recently in [AAV18, HŞKM21, RZ19, XR17] and to the concept of stability [BE02].

Theorem 3. Assume that H 1, 2 and 5 hold, and H 4 holds with ${\mathcal{W}}_{S}$ in place of $\mathcal{W}$ for all n ⩾ 1 and $S\in {\mathcal{Z}}^{n}$, (with s, a, b, r0 can potentially depend on n and S). Then, for n sufficiently large, we have

Equation (22)

with probability at least 1 − γ over $S\sim {\mu }_{z}^{\otimes n}$ and U ∼ μu .

This result shows that under H 5, we can replace dH in theorem 1 with ${\mathrm{dim}}_{\mathrm{H}}\enspace {\mathcal{W}}_{S}$, at the expense of introducing the coupling coefficient M into the bound. We observe that two competing terms are governing the generalization error: in the case where ${\mathrm{dim}}_{\mathrm{H}}\enspace {\mathcal{W}}_{S}$ is small, the error is dominated by the coupling parameter M, and vice versa. On the other hand, in the context of proposition 1, ${\mathrm{dim}}_{\mathrm{H}}\enspace {\mathcal{W}}_{S}\leqslant {\beta }_{S}$, μu -almost surely, implying again that a heavy-tailed W(S) would achieve smaller generalization error as long as the dependency between S and ${\mathcal{W}}_{S}$ is weak.

5. Experiments

We empirically study the generalization behavior of deep neural networks from the Hausdorff dimension perspective. We use VGG networks [SZ15] as they perform well in practice, and their depth (the number of layers) can be controlled directly. We vary the number of layers from D = 4 to D = 19, resulting in the number of parameters d between 1.3M and 20M. We train models on the CIFAR-10 dataset [KH09] using SGD and we choose various stepsizes η, and batch sizes B. We provide full range of parameters and additional implementation details in appendix A. The code can be found in https://github.com/umutsimsekli/Hausdorff-Dimension-and-Generalization.

We assume that SGD can be well-approximated by the process (4). Hence, to bound the corresponding ${\mathrm{dim}}_{\mathrm{H}}\enspace {\mathcal{W}}_{S}$ to be used in theorem 3, we invoke proposition 1, which relies on the existence of a point wS around which the process behaves like a regular Lévy process with exponent ψS . Considering the empirical observation that SGD exhibits a 'diffusive behavior' around a local minimum [BJSG+18], we take wS to be the local minimum found by SGD and assume that the conditions of proposition 1 hold around that point. This perspective indicates that the generalization error can be controlled by the BG index βS of the Lévy process defined by ψS (ξ); the sub-symbol of the process (4) around wS .

Estimating the BG index for a general Lévy process is a challenging task; however, the choice of the SDE (4) imposes some structure on ψS , which lets us express βS in a simpler form. Inspired by the observation that the tail-index of the gradient noise in a multi-layer neural network differs from layer to layer, as reported in [ŞGN+19], we will assume that, around the local minimum wS , the dynamics of SGD will be similar to the Lévy motion with frozen coefficients: ${{\Sigma}}_{2}({w}_{S}){\mathrm{L}}^{\boldsymbol{\alpha }({w}_{S})}$, see (4) for definitions. We will further impose that, around wS , the coordinates corresponding to the same layer l have the same tail-index αl . Under this assumption, the BG index can be analytically computed as ${\beta }_{S}={\mathrm{max}}_{l}\enspace {\alpha }_{l}\in \left(\right.0,2\left.\right]$ [Hen73, MX05]. While the range $\left(\right.0,2\left.\right]$ might seem narrow at the first sight, we note that ${\mathrm{dim}}_{\mathrm{H}}\enspace {\mathcal{W}}_{S}$; hence βS determines the order of the generalization error and this parameter gets closer to 0 with more layers added to the network (see figure 2). Thanks to this simplification, we can easily compute βS , by first estimating each αl by using the estimator proposed in [MMO15], which can efficiently estimate αl by using multiple SGD iterates.

Figure 2.

Figure 2. Empirical study of generalization behavior on VGG [SZ15] networks with various depth values (the number of layers are shown as D). As our theory predicts, the generalization error is strongly correlated with βS . As ${\beta }_{S}\in \left(\right.0,2\left.\right]$, the estimates exceeding 2 is an artifact of the estimator.

Standard image High-resolution image

We trained all the models for 100 epochs and computed their βS over the last epoch, assuming that the iterations reach near local minima. We monitor the generalization error in terms of the difference between the training and test accuracy with respect to the estimated βS in figure 2(a). We also plot the final test accuracy in figure 2(b). Test accuracy results validate that the models perform similarly to the state-of-the-art, which suggests that the empirical study matches the practically relevant application settings. Results in figure 2(a) indicate that, as predicted by our theory, the generalization error is strongly correlated with βS , which is an upper-bound of the Hausdorff dimension. With increasing βS (implying increasing Hausdorff dimension), the generalization error increases, as our theory indicates. Moreover, the resulting behavior validates the importance of considering ${\mathrm{dim}}_{\mathrm{H}}\enspace {\mathcal{W}}_{S}$ as opposed to ambient dimension: for example, the number of parameters in the four-layer network is significantly lower than other networks; however, its Hausdorff dimension as well as generalization error are significantly higher. Even more importantly, there is no monotonic relationship between the number of parameters and ${\mathrm{dim}}_{\mathrm{H}}\enspace {\mathcal{W}}_{S}$. In other words, increasing depth is not always beneficial from the generalization perspective. It is only beneficial if it also decreases ${\mathrm{dim}}_{\mathrm{H}}\enspace {\mathcal{W}}_{S}$. We also observe an interesting behavior: the choice of η and B seems to affect βS , indicating that the choice of the algorithm parameters can impact the tail behavior of the algorithm. In summary, our theory holds over a large selection of depth, step-sizes, and batch sizes when tested on deep neural networks. We provide additional experiments on synthetic models in appendix A.

6. Conclusion

In this paper, we rigorously tied the generalization in a learning task to the tail properties of the underlying training algorithm, shedding light on an empirically observed phenomenon. We established this relationship through the Hausdorff dimension of the SDE approximating the algorithm, and proved a generalization error bound based on this notion of complexity. Unlike the ambient dimension, our bounds do not necessarily grow with the number of parameters in the network, and they solely depend on the tail behavior of the training process, providing an explanation for the implicit regularization effect of heavy-tailed SGD.

Finally, we note that extensions of our framework have shown that the generalization error can be linked to topological data analysis tools [BLGŞ21], and our tools can be used for analyzing discrete-time dynamical systems through the fractal dimensions of their invariant measures [CDE+21].

Acknowledgments

In the early version of the manuscript, which was published at NeurIPS 2020, we identified an imprecision in definition 2, and a mistake in figure 2 and in the statement and the proof of theorem 3, which are now fixed. The authors are grateful to Berfin Şimşek and Xiaochuan Yang for fruitful discussions, and thank Vaishnavh Nagarajan for pointing out the imprecision in definition 2. The contribution of Umut Şimşekli to this work is partly supported by the French National Research Agency (ANR) as a part of the FBIMATRIX (ANR-16-CE23-0014) project.

Appendix A.: Additional experimental results and implementation details

A.1. Comparison with other generalization metrics for deep networks

In this section, we empirically analyze the proposed metric with respect to existing generalization metrics, developed for neural networks. Specifically, we consider the 'flat minima' argument of Jastrzevski et al [JKA+17] and plot the generalization error vs η/B which is the ratio of step size to the batch size. As a second comparison, we use heavy-tailed random matrix theory based metric of Martin and Mahoney [MM19]. We plot the generalization error with respect to each metric in figure A1. As the results suggest, our metric is the one which correlates best with the empirically observed generalization error. The metric proposed by Martin and Mahoney [MM19] fails for the low number of layers and the resulting behavior is not monotonic. Similarly, η/B captures the relationship for very deep networks (for D = 16 & 19), however, it fails for other settings.

Figure A1.

Figure A1. Empirical comparison to other capacity metrics.

Standard image High-resolution image

We also note that the norm-based capacity metrics [NTS15] typically increase with the increasing dimension d, we refer to [NBMS17] for details.

A.2. Synthetic experiments

We consider a simple synthetic logistic regression problem, where the data distribution is a Gaussian mixture model with two components. Each data point ${z}_{i}\equiv ({x}_{i},{y}_{i})\in \mathcal{Z}={\mathbb{R}}^{d}\times \left\{-1,1\right\}$ is generated by simulating the model: yi ∼ Bernoulli(1/2) and ${x}_{i}\vert {y}_{i}\sim \mathcal{N}({m}_{{y}_{i}},100{\mathrm{I}}_{d})$, where the means are drawn from a Gaussian: ${m}_{-1},{m}_{1}\sim \mathcal{N}(0,25{\mathrm{I}}_{d})$. The loss function is the logistic loss as (w, z) = log(1 + exp(−yx w)).

As for the algorithm, we consider a data-independent multivariate stable process: ${[\mathcal{A}(S)]}_{t}={\mathrm{L}}_{t}^{\alpha }$ for any $S\in {\mathcal{Z}}^{n}$, where ${\mathrm{L}}_{1}^{\alpha }$ is distributed with an elliptically contoured α-stable distribution with $\alpha \in \left(\right.0,2\left.\right]$ (see section 2): when α = 2, ${\mathrm{L}}_{t}^{\alpha }$ is just a Brownian motion, as α gets smaller, the process becomes heavier-tailed. By theorem 4.2 of [BG60], $\mathcal{A}$ has the uniform Hausdorff dimension property with dH = α independently from d when d ⩾ 2.

We set d = 10 and generate points to represent the whole population, i.e. ${\left\{{z}_{i}\right\}}_{i=1}^{{n}_{\text{tot}}}$ with ntot = 100 K. Then, for different values of α, we simulate $\mathcal{A}$ for t ∈ [0, 1], by using a small step-size η = 0.001 (the total number of iterations is hence 1/η). We finally draw 20 random sets S with n elements from this population, and we monitor the maximum difference ${\mathrm{sup}}_{w\in {\mathcal{W}}_{S}}\vert \hat{\mathcal{R}}(w,S)-\mathcal{R}(w)\vert $ for different values of n. We repeat the whole procedure 20 times and report the average values in figure A2. We observe that the results support theorems 1 and 3: for every n, the generalization error decreases with decreasing α, hence illustrates the role of the Hausdorff dimension.

Figure A2.

Figure A2. Results on synthetic data.

Standard image High-resolution image

A.3. Implementation details for the deep neural network experiments

In this section, we provide the additional details which are skipped in the main text for the sake of space. We use the following VGG style neural networks with various number of layers as

  • VGG4: Conv(512)–ReLU–MaxPool–Linear
  • VGG6: Conv(256)–ReLU–MaxPool–Conv(512)–ReLU–MaxPool–Conv(512)–ReLU–MaxPool–Linear
  • VGG7: Conv(128)–ReLU–MaxPool–Conv(256)–ReLU–MaxPool–Conv(512)–ReLU–MaxPool–Conv(512)–ReLU–MaxPool–Linear
  • VGG8: Conv(64)–ReLU–MaxPool–Conv(128)–ReLU–MaxPool–Conv(256)–ReLU–MaxPool–Conv(512)–ReLU–MaxPool–Conv(512)–ReLU–MaxPool–Linear
  • VGG11: Conv(64)–ReLU–MaxPool–Conv(128)–ReLU–MaxPool –Conv(256)–ReLU–Conv(256)–ReLU–MaxPool–Conv(512)–ReLU–Conv(512)–ReLU–MaxPool–Conv(512)–ReLU–Conv(512)–ReLU–MaxPool–Linear
  • VGG16: Conv(64)–ReLU–Conv(64)–ReLU–MaxPool–Conv(128)–ReLU–Conv(128)–ReLU–MaxPool–Conv(256)–ReLU–Conv(256)–ReLU–Conv(256)–ReLU–MaxPool–Conv(512)–ReLU–Conv(512)–ReLU–Conv(512)–ReLU–MaxPool–Conv(512)–ReLU–Conv(512)–ReLU–Conv(512)–ReLU–MaxPool–Linear
  • VGG19: Conv(64)–ReLU–Conv(64)–ReLU–MaxPool–Conv(128)–ReLU–Conv(128)–ReLU–MaxPool–Conv(256)–ReLU–Conv(256)–ReLU–Conv(256)–ReLU–Conv(256)–ReLU–MaxPool–Conv(512)–ReLU–Conv(512)–ReLU–Conv(512)–ReLU–Conv(512)–ReLU–MaxPool–Conv(512)–ReLU–Conv(512)–ReLU–Conv(512)–ReLU–Conv(512)–ReLU–MaxPool–Linear

where all convolutions are noted with the number of filters in the paranthesis. Moreover, we use the following hyperparameter ranges for step size of SGD: {1e–2, 1e–3, 3e–3, 1e–4, 3e–4, 1e–5, 3e–5} with the batch sizes {32, 64, 128, 256}. All networks are learned with cross entropy loss and ReLU activations, and no additional technique like batch normalization or dropout is used. While computing the empirical BG index over the layers, we only consider convolutional layers ignoring the final fully-connected layer. We also release the full source code of the experiments at https://github.com/umutsimsekli/Hausdorff-Dimension-and-Generalization.

Appendix B.: Representing optimization algorithms as Feller processes

Thanks to the generality of the Feller processes, we can represent multiple popular stochastic optimization algorithms as a Feller process, in addition to SGD. For instance, let us consider the following SDE:

Equation (B.1)

where Σ0, Σ1, Σ2 are d × d matrix-valued functions and the tail-index  α (⋅) of ${\mathrm{L}}_{t}^{\boldsymbol{\alpha }}(\cdot )$ is also allowed to change depending on value of the state Wt . We can verify that this SDE corresponds to a Feller process with b(w) = −Σ0(w)∇f(w), Σ(w) = 2Σ1(w), and an appropriate choice of ν [HDS18]. As we discussed in the main document, we the choice Σ0 = Id can represent SGD with state-dependent Gaussian and/or heavy-tailed noise. Besides, we can choose an appropriate Σ0 in order to be able to represent optimization algorithms that use second-order geometric information, such as natural gradient [Ama98] or stochastic Newton [EM15] algorithms. On the other hand, by using the SDEs proposed in [BB18, GGZ18, LPH+17, OKL19, ŞZTG20], we can further represent momentum-based algorithms such as SGD with momentum [Pol64] as a Feller process.

Appendix C.: Decomposable Feller processes and their Hausdorff dimension

In our study, we focus on decomposable Feller processes, introduced in [Sch98]. Let us consider a Feller process expressed by its symbol Ψ. We call the process defined by Ψ decomposable at w0, if there exists a point ${w}_{0}\in {\mathbb{R}}^{d}$ such that the symbol can be decomposed as

Equation (C.1)

where ψ(ξ) := Ψ(w0, ξ) is the sub-symbol and $\tilde{{\Psi}}(w,\xi )={\Psi}(w,\xi )-{\Psi}({w}_{0},\xi )$ is the reminder term. Let $\mathbf{j}\in {\mathbb{N}}_{0}^{d}$ denote a multi-index 10 . We assume that there exist functions $a,{{\Phi}}_{\mathbf{j}}:{\mathbb{R}}^{d}{\mapsto}\mathbb{R}$ such that the following holds:

  • Ψ(x, 0) ≡ 0
  • ${{\Vert}{{\Phi}}_{0}{\Vert}}_{\infty }< \infty $, and ${{\Phi}}_{\mathbf{j}}\in {L}^{1}\left({\mathbb{R}}^{d}\right)$ for all |j| ⩽ d + 1.
  • $\left\vert {\partial }_{w}^{\mathbf{j}}\tilde{{\Psi}}(w,\xi )\right\vert \leqslant {{\Phi}}_{\mathbf{j}}(w)\left(1+{a}^{2}(\xi )\right)$, for all $w,\xi \in {\mathbb{R}}^{d}$ and |j| ⩽ d + 1.
  • ${a}^{2}(\xi )\geqslant {\kappa }_{0}{\Vert}\xi {{\Vert}}^{{r}_{0}}$, for ||ξ|| large, ${r}_{0}\in \left(\right.0,2\left.\right]$, and κ0 > 0.

The Hausdorff dimension of the image of a decomposable Feller process is bounded, due to the following result.

Theorem 4 ([Sch98] theorem 4).

Let Ψ(x, ξ) generate a Feller process, denoted by ${\left\{{\mathrm{W}}_{t}\right\}}_{t\geqslant 0}$. Assume that Ψ is decomposable at w0 with the sub-symbol ψ. Then, for any given $T\in {\mathbb{R}}_{+}$, we have

Equation (C.2)

where W([0, T]) := {w: w = Wt , for some t ∈ [0, T]} is the image of the process, ${\mathbb{P}}^{x}$ denotes the law of the process ${\left\{{\mathrm{W}}_{t}\right\}}_{t\geqslant 0}$ with initial value x, and β is the upper Blumenthal–Getoor index of the Lévy process with the characteristic exponent ψ(ξ), given as follows:

Equation (C.3)

Appendix D.: Additional technical background

In this section, we will define the notions that will be used in our proofs. For the sake of completeness we also provide the main theoretical results that will be used in our proofs.

D.1. The Minkowski dimension

In our proofs, in addition to the Hausdorff dimension, we also make use of another notion of dimension, referred to as the Minkowski dimension (also known as the box-counting dimension [Fal04]), which is defined as follows.

Definition 3. Let $G\subset {\mathbb{R}}^{d}$ be a set and let Nδ (G) be a collection of sets that contains either one of the following:

  • The smallest number of sets of diameter at most δ which cover G
  • The smallest number of closed balls of diameter at most δ which cover G
  • The smallest number of cubes of side at most δ which cover G
  • The number of δ-mesh cubes that intersect G
  • The largest number of disjoint balls of radius δ, whose centers are in G.

Then the lower- and upper-Minkowski dimensions of G are respectively defined as follows:

Equation (D.1)

In case the $\underline{{\mathrm{dim}}_{\mathrm{M}}}\enspace G=\bar{{\mathrm{dim}}_{\mathrm{M}}}\enspace G$, the Minkowski dimension dimM(G) is their common value.

We always have $0\leqslant {\mathrm{dim}}_{\mathrm{H}}\enspace G\leqslant \underline{{\mathrm{dim}}_{\mathrm{M}}}\enspace G\leqslant \bar{{\mathrm{dim}}_{\mathrm{M}}}G\leqslant d$ where the inequalities can be strict [Fal04].

It is possible to construct examples where the Hausdorff and Minkowski dimensions are different from each other. However, in many interesting cases, these two dimensions often match each other [Fal04]. In this paper, we are interested in such a case, i.e. the case when the Hausdorff and Minkowski dimensions match. The following result identifies the conditions for which the two dimensions match each other, which form the basis of H 4:

Theorem 5 ([Mat99] theorem 5.7). Let A be a non-empty bounded subset of ${\mathbb{R}}^{d}$. Suppose there is a Borel measure μ on ${\mathbb{R}}^{d}$ and there are positive numbers a, b, r0 and s such that $0< \mu (A)\leqslant \mu \left({\mathbb{R}}^{d}\right)< \infty $ and

Equation (D.2)

Then ${\mathrm{dim}}_{\mathrm{H}}\enspace A={\mathrm{dim}}_{\mathrm{M}}\enspace A=\bar{{\mathrm{dim}}_{\mathrm{M}}}\enspace A=s$.

D.2. Egoroff's theorem

Egoroff's theorem is an important result in measure theory and establishes a condition for measurable functions to be uniformly continuous in an almost full-measure set.

Theorem 6 (Egoroff's theorem [Bog07] theorem 2.2.1). Let $(X,\mathcal{A},\mu )$ be a space with a finite nonnegative measure μ and let μ-measurable functions fn be such that μ-almost everywhere there is a finite limit f(x) := limnfn (x). Then, for every ɛ > 0, there exists a set ${X}_{\varepsilon }\in \mathcal{A}$ such that $\mu \left(X{\backslash}{X}_{\varepsilon }\right)< \varepsilon $ and the functions fn converge to f uniformly on Xɛ .

Appendix E.: Postponed proofs

E.1. Proof of proposition 1

Proof. Let ΨS denote the symbol of the process W(S). Then, the desired result can obtained by directly applying theorem 4 on each ΨS . □

E.2. Proof of theorem 1

We first prove the following more general result which relies on $\bar{{\mathrm{dim}}_{\mathrm{M}}}\enspace \mathcal{W}$.

Lemma 1. Assume that is bounded by B and L-Lipschitz continuous in w. Let $\mathcal{W}\subset {\mathbb{R}}^{d}$ be a set with finite diameter. Then, for n sufficiently large, we have

Equation (E.1)

with probability at least 1 − γ over $S\sim {\mu }_{z}^{\otimes n}$.

Proof. As is L-Lipschitz, so are $\mathcal{R}$ and $\hat{\mathcal{R}}$. By using the notation ${\hat{\mathcal{R}}}_{n}(w){:=}\hat{\mathcal{R}}(w,S)$, and by the triangle inequality, for any ${w}^{\prime }\in \mathcal{W}$ we have:

Equation (E.2)

Equation (E.3)

Now since $\mathcal{W}$ has finite diameter, let us consider a finite δ-cover of $\mathcal{W}$ by balls and collect the center of each ball in the set ${N}_{\delta }{:=}{N}_{\delta }(\mathcal{W})$. Then, for each $w\in \mathcal{W}$, there exists a w' ∈ Nδ , such that ||ww'|| ⩽ δ. By choosing this w' in the above inequality, we obtain:

Equation (E.4)

Taking the supremum of both sides of the above equation yields:

Equation (E.5)

Using the union bound over Nδ , we obtain

Equation (E.6)

Equation (E.7)

Further, for δ > 0, since |Nδ | has finitely many elements, we can invoke Hoeffding's inequality for each of the summands on the right-hand side and obtain

Equation (E.8)

Notice that Nδ is a random set, and choosing ɛ based on |Nδ |, one can obtain a deterministic γ. Therefore, we can plug this back in (E.5) and obtain that, with probability at least 1 − γ

Equation (E.9)

Now since $\mathcal{W}\subset {\mathbb{R}}^{d}$, $\bar{{\mathrm{dim}}_{\mathrm{M}}}\enspace \mathcal{W}$ is finite. Then, for any sequence ${\left\{{\delta }_{n}\right\}}_{n\in \mathbb{N}}$ such that limnδn = 0, we have, ∀epsilon > 0, ∃nepsilon > 0 such that nnepsilon implies

Equation (E.10)

Choosing ${\delta }_{n}=1/\sqrt{n{L}^{2}}$ and ${\epsilon}=\bar{{\mathrm{dim}}_{\mathrm{M}}}\enspace \mathcal{W}$, we have for $\forall n\geqslant {n}_{\bar{{\mathrm{dim}}_{\mathrm{M}}}\enspace \mathcal{W}}$,

Equation (E.11)

Therefore, we obtain with probability at least 1 − γ

Equation (E.12)

Equation (E.13)

for sufficiently large n. This concludes the proof. □

We now proceed to the proof of theorem 1.

Proof of theorem  1. By noticing ${\mathcal{Z}}^{n}$ is countable (since $\mathcal{Z}$ is countable) and using the property that ${\mathrm{dim}}_{\mathrm{H}}{\cup }_{i\in \mathbb{N}}{A}_{i}={\mathrm{sup}}_{i\in \mathbb{N}}\enspace {\mathrm{dim}}_{\mathrm{H}}\enspace {A}_{i}$ (cf [Fal04], section 3.2), we observe that

Equation (E.14)

μu -almost surely. Define the event ${\mathcal{Q}}_{R}=\left\{\mathrm{d}\mathrm{i}\mathrm{a}\mathrm{m}(\mathcal{W})\leqslant R\right\}$. On the event ${\mathcal{Q}}_{R}$, by theorem 5, we have that ${\mathrm{dim}}_{\mathrm{M}}\enspace \mathcal{W}=\bar{{\mathrm{dim}}_{\mathrm{M}}}\enspace \mathcal{W}={\mathrm{dim}}_{\mathrm{H}}\enspace \mathcal{W}\leqslant {d}_{\mathrm{H}}$, μu -almost surely.

Now, we observe that

Equation (E.15)

Hence, by defining $\varepsilon =B\sqrt{\frac{2(\bar{{\mathrm{dim}}_{\mathrm{M}}}\enspace \mathcal{W})\mathrm{log}(n{L}^{2})}{n}+\frac{\mathrm{log}(1/\gamma )}{n}}$, and using the independence of S and U, lemma 1, and (E.14), we write

Finally, we let R and use dominated convergence theorem to obtain

which concludes the proof. □

E.3. Proof of theorem 2

Similar to the proof of theorem 1, we first prove a more general result where $\bar{{\mathrm{dim}}_{\mathrm{M}}}\mathcal{W}$ is fixed.

Lemma 2. Assume that is bounded by B and L-Lipschitz continuous in w. Let $\mathcal{W}\subset {\mathbb{R}}^{d}$ be a bounded set with $\bar{{\mathrm{dim}}_{\mathrm{M}}}\mathcal{W}\leqslant {d}_{\mathrm{M}}$. For any function $\rho :\mathbb{R}\to \mathbb{R}$ satisfying limxρ(x) = and for a sufficiently large n, with probability at least 1 − γ, we have

where c is an absolute constant.

Proof. We define the empirical process

and we notice that

Recall that a random process ${\left\{G(w)\right\}}_{w\in \mathcal{W}}$ on a metric space $(\mathcal{W},d)$ is said to have sub-Gaussian increments if there exists K ⩾ 0 such that

Equation (E.16)

where ${\Vert}\cdot {{\Vert}}_{{\psi }_{2}}$ denotes the sub-Gaussian norm [Ver19].

We verify that ${\left\{{\mathcal{G}}_{n}(w)\right\}}_{w}$ has sub-Gaussian increments with $K=2L/\sqrt{n}$ and for the metric being the standard Euclidean metric, d(w, w') = ||ww'||. To see why this is the case, notice that

which is a sum of i.i.d. random variables that are uniformly bounded by

by the Lipschitz continuity of the loss. Therefore, Hoeffding's lemma for bounded and centered random variables easily imply that

Equation (E.17)

thus, we have ${\Vert}{\mathcal{G}}_{n}(w)-{\mathcal{G}}_{n}({w}^{\prime }){{\Vert}}_{{\psi }_{2}}\leqslant (2L/\sqrt{n}){\Vert}w-{w}^{\prime }{\Vert}$.

Next, define the sequence δk = 2k and notice that we have δk ↓ 0. Dudley's tail bound (see for example theorem 8.16 in [Ver19]) for this empirical process implies that, with probability at least 1 − γ, we have

Equation (E.18)

where C is an absolute constant and

In order to apply Dudley's lemma, we need to bound the above summation. For that, choose κ0 such that

and any strictly increasing function $\rho :\mathbb{R}\to \mathbb{R}$.

Now since $\bar{{\mathrm{dim}}_{\mathrm{M}}}\mathcal{W}\leqslant {d}_{\mathrm{M}}$, for the sequence ${\left\{{\delta }_{k}\right\}}_{k\in \mathbb{N}}$, and for a sufficiently large n, whenever k ⩾ ⌊ρ(n)⌋, we have

By splitting the entropy sum in Dudley's tail inequality in two terms, we obtain

For the first term on the right-hand side, we use the monotonicity of covering numbers, i.e. $\vert {N}_{{\delta }_{k}}\vert \leqslant \vert {N}_{{\delta }_{l}}\vert $ for kl, and write

For the second term on the right-hand side, we have

Combining these, we obtain

Plugging this bound back in Dudley's tail bound (E.18), we obtain

Now fix ${w}_{0}\in \mathcal{W}$ and write the triangle inequality,

Clearly for a fixed ${w}_{0}\in \mathcal{W}$, we can apply Hoeffding's inequality and obtain that, with probability at least 1 − γ,

Combining this with the previous result, we have with probability at least 1 − 2γ

Finally replacing and γ with γ/2 and collecting the absolute constants in c, we conclude the proof. □

Proof of theorem  2.The proof follows the same lines of the proof of theorem 1, except that we invoke lemma 2 instead of lemma 1. □

E.4. Proof of theorem 3

Proof. We start by restating our theoretical framework in an equivalent way for mathematical convenience. In particular, consider the (countable) product measure ${\mu }_{z}^{\infty }={\mu }_{z}\otimes {\mu }_{z}\otimes \dots \enspace $ defined on the cylindrical sigma-algebra. Accordingly, denote $\mathbf{S}\sim {\mu }_{z}^{\infty }$ as an infinite sequence of i.i.d. random vectors, i.e. $\mathbf{S}={({z}_{j})}_{j\geqslant 1}$ with zj i.i.d. μz for all j = 1, 2, .... Furthermore, let Sn := (z1, ..., zn ) be the first n elements of S. In this notation, we have $S\stackrel{\mathrm{d}}{=}{\mathbf{S}}_{n}$ and ${\mathcal{W}}_{S}\stackrel{\mathrm{d}}{=}{\mathcal{W}}_{{\mathbf{S}}_{n}}$, where $\stackrel{\mathrm{d}}{=}$ denotes equality in distribution. Similarly, we have ${\hat{\mathcal{R}}}_{n}(w)=\hat{\mathcal{R}}(w,{\mathbf{S}}_{n})$.

Due to the hypotheses and theorem 5, we have $\bar{{\mathrm{dim}}_{\mathrm{M}}}{\mathcal{W}}_{{\mathbf{S}}_{n}}={\mathrm{dim}}_{\mathrm{H}}{\mathcal{W}}_{{\mathbf{S}}_{n}}=:{d}_{\mathrm{H}}(\mathbf{S},n)$, μu -almost surely. It is easy to verify that the particular forms of the δ-covers and Nδ in H 5 still yield the same Minkowski dimension in (D.1). Then by definition, we have for all S and n:

Equation (E.19)

μu -almost surely. Hence for each n

Equation (E.20)

as δ → 0 almost surely, or alternatively, for each n, there exists a set Ωn of full measure such that

Equation (E.21)

for all S ∈ Ωn . Let Ω* := ∩n Ωn . Then for S ∈ Ω* we have that for all n

Equation (E.22)

and therefore, on this set we also have

where αn is a monotone increasing sequence such that αn ⩾ 1 and αn . To see why, suppose that we are given a collection of functions ${\left\{{g}_{n}(r)\right\}}_{n}$ where r > 0, such that limr→0gn (r) → 0 for each n. We then have that the infinite dimensional vector ${({g}_{n}(r))}_{n}\to (0,0,\dots \enspace )$ in the product topology on ${\mathbb{R}}^{\infty }$. We can metrize the product topology on ${\mathbb{R}}^{\infty }$ using the metric

where $\mathbf{x}={({x}_{n})}_{n\geqslant 1}$, $\mathbf{y}={({y}_{n})}_{n\geqslant 1}$, and 1 ⩽ αn is monotone increasing. Alternatively, notice that if limr→0gn (r) → 0 for all n, for any epsilon > 0 choose N0 such that for nN0, 1/αn < epsilon, and choose r0 such that for r < r0, we have ${\mathrm{max}}_{n\leqslant {N}_{0}}\vert {g}_{n}(r)\vert < {\epsilon}$. Then for r < r0 we have

where we used that αn ⩾ 1.

Applying the above reasoning we have that for all S ∈ Ω*

as δ → 0. By applying theorem 6 to the collection of random variables {Fδ (S); δ}, for any δ' > 0 we can find a subset $\mathfrak{Z}\subset {\mathcal{Z}}^{\infty }$, with probability at least 1 − δ' under ${\mu }_{z}^{\infty }$, such that on $\mathfrak{Z}$ the convergence is uniform, that is

where c(r) → 0 as r → 0. Notice that c(r) = c(δ', r), that is it depends on the choice of δ' (for any δ', we have limr→0c(r; δ') = 0).

As U and S are assumed to be independent, all the following statements hold μu -almost surely, hence we drop the dependence on U to ease the notation. We proceed as in the proof of lemma 1:

Equation (E.23)

Notice that on $\mathfrak{Z}$ we have that

so in particular for any sequence {δn ; n ⩾ 0} we have that

or

Let ${({\delta }_{n})}_{n\geqslant 0}$ be a decreasing sequence such that ${\delta }_{n}\in \mathbb{Q}$ for all n and δn → 0. We then have

For ρ > 0 and $k\in {\mathbb{N}}_{+}$ let us define ${J}_{k}(\rho ){:=}\left(\right.k\rho ,(k+1)\rho $] and set ρn := log(1/δn ). Furthermore, for any t > 0 define

Notice that ɛ(t) is increasing in t. Therefore, we have

where we used the fact dH(S, n) ⩽ d almost surely, and that on the event dH(S, n) ∈ Jk (ρn ), ɛ(dH(S, n)) ⩾ ɛ(n ).

Notice that the events

are in $\mathfrak{G}$. To see why, notice first that for any $0< \delta \in \mathbb{Q}$

so $\vert {N}_{\delta }({\mathcal{W}}_{{\mathbf{S}}_{n}})\vert $ is $\mathfrak{G}$-measurable as a finite sum of $\mathfrak{G}$-measurable variables. From (E.20) it can also be seen that dH(S, n) is also $\mathfrak{G}$-measurable as a countable supremum of $\mathfrak{G}$-measurable random variables. On the other hand, the event $\left\{\vert {\hat{\mathcal{R}}}_{n}(w)-\mathcal{R}(w)\vert \geqslant \varepsilon (k{\rho }_{n})\right\}$ is clearly in $\mathfrak{F}$ (see H 5 for definitions).

Therefore,

Now, notice that the mapping tɛ2(t) is linear, with Lipschitz coefficient

Therefore, on the event {dH(S, n) ∈ Jk (ρn )} we have

Equation (E.24)

Equation (E.25)

Hence,

where we used the fact that ρn = log(1/δn ). Therefore, we have

By the definition of ɛ(t), for any S and n, we have that:

Therefore,

That is, with probability at least 1 − (1 + 2e)δ' we have:

Choosing ${\delta }_{n}=1/\sqrt{n{L}^{2}}$ and αn = log(n), for each δ' > 0, with probability at least 1 − (1 + 2e)δ' we have

Equation (E.26)

where for each δ' > 0, we have limnc(δ', δn ) = 0. Hence, for sufficiently large n, we have with probability at least 1 − (1 + 2e)δ' we have:

Equation (E.27)

Setting γ := (1 + 2e)δ' and using 1 + 2e < 7, we obtain the desired result. This concludes the proof. □

Footnotes

  • This article is an updated version of: Simsekli U, Sener O, Deligiannidis G and Erdogdu M A 2020 Hausdorff dimension, heavy tails, and generalization in neural networks Advances in Neural Information Processing Systems vol 33 eds H Larochelle, M Ranzato, R Hadsell, M F Balcan and H Lin (New York: Curran Associates) pp 5138–51.

  • Recently, Gurbuzbalaban et al [GSZ21] and Hodgkinson and Mahoney [HM21] have simultaneously shown that the law of the SGD iterates (2) can indeed converge to a heavy-tailed stationary distribution with infinite variance when the step-size η is large and/or the batch-size B is small. These results form a theoretical basis for the origins of the observed heavy-tailed behavior of SGD in practice.

  • Here ${\mathrm{L}}_{t}^{\alpha }$ is equivalent to ${\mathrm{L}}_{t}^{\boldsymbol{\alpha }(\cdot )}$ with ${\alpha }_{i}(w)=\alpha \in \left(\right.0,2\left.\right]$, ∀i ∈ {1, ..., d} and $\forall w\in {\mathbb{R}}^{d}$.

  • Note that this prevents adaptive minibatching algorithms, e.g. [AB99] to be represented in our framework.

  • H 4 ensures that the Hausdorff dimension of $\mathcal{W}$ coincides with another notion of dimension, called the Minkowski dimension, which is explained in detail in appendix D. We note that for many fractal-like sets, these two notions of dimensions are equal to each other (see [Mat99], chapter 5), which include α-stable processes (see [Fal04], chapter 16).

  • 10 

    We use the multi-index convention j = (j1, ..., jd ) with each ${j}_{i}\in {\mathbb{N}}_{0}$, and we use the notation ${\partial }_{w}^{\mathbf{j}}\tilde{{\Psi}}(w,\xi )=\frac{{\partial }^{{j}_{1}}\tilde{{\Psi}}(w,\xi )}{\partial {w}_{1}^{{j}_{1}}}\dots \frac{{\partial }^{{j}_{d}}\tilde{{\Psi}}(w,\xi )}{\partial {w}_{d}^{{j}_{d}}}$, and $\vert \mathbf{j}\vert ={\sum }_{i=1}^{d}{j}_{i}$.

Please wait… references are loading.