Hausdorff dimension, heavy tails, and generalization in neural networks*

Umut Şimşekli; Ozan Sener; George Deligiannidis; Murat A Erdogdu

doi:10.1088/1742-5468/ac3ae7

1. Introduction

Many important tasks in deep learning can be represented by the following optimization problem,

$\begin{equation}\underset{w\in {\mathbb{R}}^{d}}{\mathrm{min}}\left\{f(w){:=}\frac{1}{n}\sum\limits _{i=1}^{n}{f}^{(i)}(w)\right\},\end{equation} \tag{ 1 }$

where $w\in {\mathbb{R}}^{d}$ denotes the network weights, n denotes the number of training data points, f denotes a non-convex cost function, and f⁽ⁱ⁾ denotes the cost incurred by a single data point. Gradient-based optimization algorithms, perhaps stochastic gradient descent (SGD) being the most popular one, have been the primary algorithmic choice for attacking such optimization problems. Given an initial point w₀, the SGD algorithm is based on the following recursion,

$\begin{equation}{w}_{k+1}={w}_{k}-\eta \nabla {\tilde{f}}_{k}({w}_{k})\enspace \enspace \;\text{with}\;\enspace \enspace \nabla {\tilde{f}}_{k}(w){:=}\frac{1}{\mathrm{B}}\sum\limits _{i\in {\tilde{B}}_{k}}\nabla {f}^{(i)}(w),\end{equation} \tag{ 2 }$

where η is the step-size, and $\nabla {\tilde{f}}_{k}$ is the unbiased stochastic gradient with batch size $\mathrm{B}=\vert {\tilde{B}}_{k}\vert$ for a random subset ${\tilde{B}}_{k}$ of {1, ..., n} for all $k\in \mathbb{N}$ , |⋅| denoting cardinality.

In contrast to convex optimization setting where the behavior of SGD is fairly well-understood (see e.g. [DDB20, SSBD14]), the generalization properties of SGD in non-convex deep learning problems is an active area of research [AZL19, AZLL19, PBL19]. In the last decade, there has been considerable progress around this topic, where several generalization bounds have been proven in different mathematical setups [AZLL19, DR17, KL17, Lon17, MWZZ17, NHD+19, NTS15, RRT17, ZLZ19]. While these bounds are useful at capturing the generalization behavior of SGD in certain cases, they typically grow with dimension d, which contradicts empirical observations [NBMS17].

An important initial step toward developing a concrete generalization theory for the SGD algorithm in deep learning problems, is to characterize the statistical properties of the weights ${\left\{{w}_{k}\right\}}_{k\in \mathbb{N}}$ , as they might provide guidance for identifying the constituents that determine the performance of SGD. A popular approach for analyzing the dynamics of SGD, mainly borrowed from statistical physics, is based on viewing it as a discretization of a continuous-time stochastic process that can be described by a stochastic differential equation (SDE). For instance, if we assume that the gradient noise, i.e. $\nabla {\tilde{f}}_{k}(w)-\nabla f(w)$ can be well-approximated with a Gaussian random vector, we can represent (2) as the Euler–Maruyama discretization of the following SDE,

$\begin{equation}\mathrm{d}{\mathrm{W}}_{t}=-\nabla f({\mathrm{W}}_{t})\mathrm{d}t+{\Sigma}({\mathrm{W}}_{t})\mathrm{d}{\mathrm{B}}_{t},\end{equation} \tag{ 3 }$

where B_t denotes the standard Brownian motion in ${\mathbb{R}}^{d}$ , and ${\Sigma}:{\mathbb{R}}^{d}{\mapsto}{\mathbb{R}}^{d\times d}$ is called the diffusion coefficient. This approach has been adopted by several studies [CS18, HLLL17, JKA+17, MHB16, ZWY+19]. In particular, based on the 'flat minima' argument (cf [HS97]), Jastrzebski et al [JKA+17] illustrated that the performance of SGD on unseen data correlates well with the ratio η/B.

More recently, Gaussian approximation for the gradient noise has been taken under investigation. While Gaussian noise can accurately characterize the behavior of SGD for very large batch sizes [PSGN19], Simsekli et al [SSG19] empirically demonstrated that the gradient noise in fully connected and convolutional neural networks can exhibit heavy-tailed behavior in practical settings. This characteristic was also observed in recurrent neural networks [ZKV+19]. Favaro et al [FFP20] illustrated that the iterates themselves can exhibit heavy-tails and investigated the corresponding asymptotic behavior in the infinite-width limit. Similarly, Martin and Mahoney [MM19] observed that the eigenspectra of the weight matrices in individual layers of a neural network can exhibit heavy-tails; hence, they proposed a layer-wise heavy-tailed model for the SGD iterates. By invoking results from heavy-tailed random matrix theory, they proposed a capacity metric based on a quantification of the heavy-tails, which correlated well with the performance of the network on unseen data. Further, they empirically demonstrated that this capacity metric does not necessarily grow with dimension d.

Based on the argument that the observed heavy-tailed behavior of SGD⁶ in practice cannot be accurately represented by an SDE driven by a Brownian motion, Simsekli et al [SSG19] proposed modeling SGD with an SDE driven by a heavy-tailed process, so-called the α-stable Lévy motion [Sat99]. By using this framework and invoking metastability results proven in statistical physics [IP06, Pav07], SGD is shown to spend more time around 'wider minima', and the time spent around those minima is linked to the tail properties of the driving process [CWZ+21, NŞGR19, SSG19].

Even though the SDE representations of SGD have provided many insights on several distinguishing characteristics of this algorithm in deep learning problems, a rigorous treatment of their generalization properties in a statistical learning theoretical framework is still missing. In this paper, we aim to take a first step in this direction and prove novel generalization bounds in the case where the trajectories of the optimization algorithm (including but not limited to SGD) can be well-approximated by a Feller process [Sch16], which form a broad class of Markov processes that includes many important stochastic processes as a special case. More precisely, as a proxy to SGD, we consider the Feller process that is expressed by the following SDE:

$\begin{equation}\mathrm{d}{\mathrm{W}}_{t}=-\nabla f({\mathrm{W}}_{t})\mathrm{d}t+{{\Sigma}}_{1}({\mathrm{W}}_{t})\mathrm{d}{\mathrm{B}}_{t}+{{\Sigma}}_{2}({\mathrm{W}}_{t})\mathrm{d}{\mathrm{L}}_{t}^{\boldsymbol{\alpha }({\mathrm{W}}_{t})},\end{equation} \tag{ 4 }$

where Σ₁, Σ₂ are d × d matrix-valued functions, and ${\mathrm{L}}_{t}^{\boldsymbol{\alpha }(\cdot )}$ denotes the state-dependent α -stable Lévy motion, which will be defined in detail in section 2. Informally, ${\mathrm{L}}_{t}^{\boldsymbol{\alpha }(\cdot )}$ can be seen as a heavy-tailed generalization of the Brownian motion, where $\boldsymbol{\alpha }:{\mathbb{R}}^{d}{\mapsto}\left(\right.0,2{\left.\right]}^{d}$ denotes its state-dependent tail-indices. In the case α_i(w) = 2 for all i and w, ${\mathrm{L}}_{t}^{\boldsymbol{\alpha }(\cdot )}$ reduces to $\sqrt{2}{\mathrm{B}}_{t}$ whereas if α_i gets smaller than 2, the process becomes heavier-tailed in the ith component, whose tails asymptotically obey a power-law decay with exponent α_i. The SDEs in [CS18, HLLL17, JKA+17, MHB16, ZWY+19] all appear as a special case of (4) with Σ₂ = 0, and the SDE proposed in [SSG19] corresponds to the isotropic setting: Σ₂(w) is diagonal and ${\alpha }_{i}(w)=\alpha \in \left(\right.0,2\left.\right]$ for all i, w. In (4), we allow each coordinate of ${\mathrm{L}}_{t}^{\boldsymbol{\alpha }(\cdot )}$ to have a different tail-index which can also depend on the state W_t. We believe that ${\mathrm{L}}_{t}^{\boldsymbol{\alpha }(\cdot )}$ provides a more realistic model based on the empirical results of [ŞGN+19], suggesting that the tail index can have different values at each coordinate and evolve over time.

At the core of our approach lies the fact that the sample paths of Markov processes often exhibit a fractal-like structure [Xia03], and the generalization error over the sample paths is intimately related to the 'roughness' of the random fractal generated by the driving Markov process, as measured by a notion called the Hausdorff dimension. Our main contributions are as follows.

(a)
We introduce a novel notion of complexity for the trajectories of a stochastic learning algorithm, which we coin as 'uniform Hausdorff dimension'. Building on [Sch98], we show that the sample paths of Feller processes admit a uniform Hausdorff dimension, which is closely related to the tail properties of the process.
(b)
By using tools from geometric measure theory, we prove that the generalization error can be controlled by the Hausdorff dimension of the process, which can be significantly smaller than the ambient dimension d. In this sense, the Hausdorff dimension acts as an 'intrinsic dimension' of the problem, mimicking the role of Vapnik–Chervonenkis (VC) dimension in classical generalization bounds.

These two contributions collectively show that heavier-tailed processes achieve smaller generalization error, implying that the heavy-tails of SGD incur an implicit regularization. Our results also provide a theoretical justification to the observations reported in [MM19] and [SSG19]. Besides, a remarkable feature of the Hausdorff dimension is that it solely depends on the tail behavior of the process; hence, contrary to existing capacity metrics, it does not necessarily grow with the number of parameters d. Furthermore, we provide an efficient approach to estimate the Hausdorff dimension by making use of existing tail index estimators, and empirically demonstrate the validity of our theory on various neural networks. Experiments on both synthetic and real data verify that our bounds do not grow with the problem dimension, providing an accurate characterization of the generalization performance.

2. Technical background

In this section, we provide the required technical background on stable distributions, Lévy and Feller processes, and the Hausdorff dimension.

2.1. Stable distributions

Stable distributions appear as the limiting distribution in the generalized central limit theorem [Lév37] and can be seen as a generalization of the Gaussian distribution. In this paper, we will be interested in symmetric α-stable distributions, denoted by $\mathcal{S}\alpha \mathcal{S}$ . In the one-dimensional case, a random variable X is $\mathcal{S}\alpha \mathcal{S}(\sigma )$ distributed, if its characteristic function (chf.) has the following form: $\mathbb{E}[\mathrm{exp}(i\omega X)]=\mathrm{exp}(-\vert \sigma \omega {\vert }^{\alpha })$ , where $\alpha \in \left(\right.0,2\left.\right]$ is called the tail-index and $\sigma \in {\mathbb{R}}_{+}$ is called the scale parameter. When α = 2, $\mathcal{S}\alpha \mathcal{S}(\sigma )=\mathcal{N}(0,2{\sigma }^{2})$ , where $\mathcal{N}$ denotes the Gaussian distribution in $\mathbb{R}$ . As soon as α < 2, the distribution becomes heavy-tailed and $\mathbb{E}[\vert X{\vert }^{q}]$ becomes finite if and only if q < α, indicating that the variance of $\mathcal{S}\alpha \mathcal{S}$ is finite only when α = 2.

There are multiple ways to extend $\mathcal{S}\alpha \mathcal{S}$ to the multivariate case. In our experiments, we will be mainly interested in the elliptically-contoured α-stable distribution [ST94], whose chf. is given by $\mathbb{E}[\mathrm{exp}(i\langle \omega ,X\rangle )]=\mathrm{exp}(-{\Vert}\omega {{\Vert}}^{\alpha })$ for $X,\omega \in {\mathbb{R}}^{d}$ , where ⟨⋅, ⋅⟩ denotes the Euclidean inner product. Another common choice is the multivariate α -stable distribution with independent components for a vector $\boldsymbol{\alpha }\in {\mathbb{R}}^{d}$ , whose chf. is given by $\mathbb{E}[\mathrm{exp}(i\langle \omega ,X\rangle )]=\mathrm{exp}(-{\sum }_{i=1}^{d}\vert {\omega }_{i}{\vert }^{{\alpha }_{i}})$ . Essentially, the ith component of X is distributed with $\mathcal{S}\alpha \mathcal{S}$ with parameters α_i and σ_i = 1. Both of these multivariate distributions reduce to a multivariate Gaussian when their tail indices are 2.

2.2. Lévy and Feller processes

We begin by defining a general Lévy process (also called Lévy motion), which includes Brownian motion B_t and the α-stable motion ${\mathrm{L}}_{t}^{\alpha }$ as special cases⁷ . A Lévy process ${\left\{{\mathrm{L}}_{t}\right\}}_{t\geqslant 0}$ in ${\mathbb{R}}^{d}$ with the initial point L₀ = 0, is defined by the following properties:

(a)
For $N\in \mathbb{N}$ and t₀ < t₁ <⋯< t_N, the increments $({\mathrm{L}}_{{t}_{i}}-{\mathrm{L}}_{{t}_{i-1}})$ are independent for all i.
(b)
For any t > s > 0, (L_t − L_s) and L_t−s have the same distribution.
(c)
L_t is continuous in probability, i.e. for all δ > 0 and s ⩾ 0, $\mathbb{P}(\vert {\mathrm{L}}_{t}-{\mathrm{L}}_{s}\vert > \delta )\to 0$ as t → s.

By the Lévy–Khintchine formula [Sat99], the chf. of a Lévy process is given by $\mathbb{E}[\mathrm{exp}(i\langle \xi ,{\mathrm{L}}_{t}\rangle )]=\mathrm{exp}(-t\psi (\xi ))$ , where $\psi :{\mathbb{R}}^{d}{\mapsto}\mathbb{C}$ is called the characteristic (or Lévy) exponent, given as:

$\begin{equation}\psi (\xi )=i\langle b,\xi \rangle +\frac{1}{2}\left\langle \xi ,{\Sigma}\xi \right\rangle +{\int }_{{\mathbb{R}}^{d}}\left[1-{\text{e}}^{\text{i}\langle x,\xi \rangle }+\frac{i\langle x,\xi \rangle }{1+{\Vert}x{{\Vert}}^{2}}\right]\nu (\mathrm{d}x),\quad \forall \xi \in {\mathbb{R}}^{d}.\end{equation} \tag{ 5 }$

Here, $b\in {\mathbb{R}}^{d}$ denotes a constant drift, ${\Sigma}\in {\mathbb{R}}^{d\times d}$ is a positive semi-definite matrix, and ν is called the Lévy measure, which is a Borel measure on ${\mathbb{R}}^{d}{\backslash}\left\{0\right\}$ satisfying ${\int }_{{\mathbb{R}}^{d}}{\Vert}x{{\Vert}}^{2}/(1+{\Vert}x{{\Vert}}^{2})\nu (\mathrm{d}x)< \infty .$ The choice of (b, Σ, ν) determines the law of L_t−s; hence, it fully characterizes the process L_t by the properties (a) and (b) above. For instance, from (5), we can easily verify that under the choice b = 0, ${\Sigma}=\frac{1}{2}{\mathrm{I}}_{d}$ , and ν(ξ) = 0, with I_d denoting the d × d identity matrix, the function exp(−ψ(ξ)) becomes the chf. of a standard Gaussian in ${\mathbb{R}}^{d}$ ; hence, L_t reduces to B_t. On the other hand, if we choose b = 0, Σ = 0, and $\nu (\mathrm{d}x)=\frac{\mathrm{d}r}{{r}^{1+\alpha }}\lambda (\mathrm{d}y)$ , for all $x=ry,(r,y)\in {\mathbb{R}}_{+}\times {\mathbb{S}}^{d-1}$ where ${\mathbb{S}}^{d-1}$ denotes unit sphere in ${\mathbb{R}}^{d}$ and λ is an arbitrary Borel measure on ${\mathbb{S}}^{d-1}$ , we obtain the chf. of a generic multivariate α-stable distribution, hence L_t reduces to ${\mathrm{L}}_{t}^{\alpha }$ . Depending on λ, exp(−ψ(ξ)) becomes the chf. of an elliptically contoured α-stable distribution or an α-stable distribution with independent components [Xia03].

Feller processes (also called Lévy-type processes [BSW13]) are a general family of Markov processes, which further extend the scope of Lévy processes. In this study, we consider a class of Feller processes [Cou65], which locally behave like Lévy processes and they additionally allow for state-dependent drifts b(w), diffusion matrices Σ(w), and Lévy measures ν(w, dy) for $w\in {\mathbb{R}}^{d}$ . For a fixed state w, a Feller process ${\left\{{\mathrm{W}}_{t}\right\}}_{t\geqslant 0}$ is defined through the chf. of the random variable W_t − w, given as ${\psi }_{t}(w,\xi )=\mathbb{E}\left[\mathrm{exp}(-i\langle \xi ,{\mathrm{W}}_{t}-w\rangle )\right]$ . A crucial characteristic of a Feller process related to its chf. is its symbol Ψ, defined as,

$\begin{equation}{\Psi}(w,\xi )=i\langle b(w),\xi \rangle +\frac{1}{2}\left\langle \xi ,{\Sigma}(w)\xi \right\rangle +{\int }_{{\mathbb{R}}^{d}}\left[1-{\text{e}}^{\text{i}\langle x,\xi \rangle }+\frac{i\langle w,\xi \rangle }{1+{\Vert}x{{\Vert}}^{2}}\right]\nu (w,\mathrm{d}x),\end{equation} \tag{ 6 }$

for $w,\xi \in {\mathbb{R}}^{d}$ [Jac02, Sch98, Xia03]. Here, for each $w\in {\mathbb{R}}^{d}$ , ${\Sigma}(w)\in {\mathbb{R}}^{d\times d}$ is symmetric positive semi-definite, and for all w, ν(w, dx) is a Lévy measure.

Under mild conditions, one can verify that the SDE (4) we use as a proxy for the SGD algorithm indeed corresponds to a Feller process with b(w) = −∇f(w), Σ(w) = 2Σ₁(w), and an appropriate choice of ν (see [HDS18]). We also note that many other popular stochastic optimization algorithms can be accurately represented by a Feller process, which we describe in appendix B. Hence, our results can be useful in a broader context.

2.3. Decomposable Feller processes

In this paper, we will focus on decomposable Feller processes introduced in [Sch98], which will be useful in both our theory and experiments. Let W_t be a Feller process with symbol Ψ. We call the process W_t 'decomposable at w₀', if there exists a point ${w}_{0}\in {\mathbb{R}}^{d}$ , such that ${\Psi}(w,\xi )=\psi (\xi )+\tilde{{\Psi}}(w,\xi )$ , where ψ(ξ) := Ψ(w₀, ξ) is called the sub-symbol and $\tilde{{\Psi}}(w,\xi ){:=}{\Psi}(w,\xi )-{\Psi}({w}_{0},\xi )$ is the remainder term. Here, $\tilde{{\Psi}}$ is assumed to satisfy certain smoothness and boundedness assumptions, which are provided in appendix C. Essentially, the technical regularity conditions on $\tilde{{\Psi}}$ impose a structure on the triplet (b, Σ, ν) around w₀ which ensures that, around that point, W_t behaves like a Lévy process whose characteristic exponent is given by the sub-symbol ψ.

2.4. The Hausdorff dimension

Due to their recursive nature, Markov processes often generate 'random fractals' [Xia03] and understanding the structure of such fractals has been a major challenge in modern probability theory [BP17, Kho09, KX17, LG19, LY19, Yan18]. In this paper, we are interested in identifying the complexity of the fractals generated by a Feller process that approximates SGD.

The intrinsic complexity of a fractal is typically characterized by a notion called the Hausdorff dimension [Fal04], which extends the usual notion of dimension (e.g. a line segment is one-dimensional, a plane is two-dimensional) to fractional orders. Informally, this notion measures the 'roughness' of an object (i.e. a set) and in the context of Lévy processes, they are deeply connected to the tail properties of the corresponding Lévy measure. [Sch98, Xia03, Yan18].

Before defining the Hausdorff dimension, we need to introduce the Hausdorff measure. Let $G\subset {\mathbb{R}}^{d}$ and δ > 0, and consider all the δ-coverings ${\left\{{A}_{i}\right\}}_{i}$ of G, i.e. each A_i denotes a set with diameter less than δ satisfying G ⊂ ∪_i A_i. For any s ∈ (0, ∞), we then denote:

$\begin{equation}{\mathcal{H}}_{\delta }^{s}(G){:=}\mathrm{inf}\sum\limits _{i=1}^{\infty }\mathrm{d}\mathrm{i}\mathrm{a}\mathrm{m}{({A}_{i})}^{s},\end{equation} \tag{ 7 }$

where the infimum is taken over all the δ-coverings. The s-dimensional Hausdorff measure of G is defined as the monotonic limit:

$\begin{equation}{\mathcal{H}}^{s}(G){:=}\underset{\delta \to 0}{\mathrm{lim}}\enspace {\mathcal{H}}_{\delta }^{s}(G).\end{equation} \tag{ 8 }$

It can be shown that ${\mathcal{H}}^{s}$ is an outer measure; hence, it can be extended to a complete measure by the Carathéodory extension theorem [Mat99]. When s is an integer, ${\mathcal{H}}^{s}$ is equal to the s-dimensional Lebesgue measure up to a constant factor; thus, it strictly generalizes the notion 'volume' to the fractional orders. We now proceed with the definition of the Hausdorff dimension.

Definition 1. The Hausdorff dimension of $G\subset {\mathbb{R}}^{d}$ is defined as follows.

$\begin{equation}{\mathrm{dim}}_{\mathrm{H}}\enspace G{:=}\mathrm{sup}\left\{s > 0:{\mathcal{H}}^{s}(G) > 0\right\}=\mathrm{inf}\left\{s > 0:{\mathcal{H}}^{s}(G)< \infty \right\}.\end{equation} \tag{ 9 }$

One can show that if dim_H G = s, then ${\mathcal{H}}^{r}(G)=0$ for all r > s and ${\mathcal{H}}^{r}(G)=\infty$ for all r < s [EMG90]. In this sense, the Hausdorff dimension of G is the moment order s when ${\mathcal{H}}^{s}(G)$ drops from ∞ to 0, and we always have 0 ⩽ dim_H G ⩽ d [Fal04]. Apart from the trivial cases such as ${\mathrm{dim}}_{\mathrm{H}}\enspace {\mathbb{R}}^{d}=d$ , a canonical example is the well-known Cantor set, whose Hausdorff dimension is (log 2/log 3) ∈ (0, 1). Besides, the Hausdorff dimension of Riemannian manifolds correspond to their intrinsic dimension, e.g. ${\mathrm{dim}}_{\mathrm{H}}\enspace {\mathbb{S}}^{d-1}=d-1$ .

We note that, starting with the seminal work of Assouad [Ass83], tools from fractal geometry have been considered in learning theory [DSD19, MSS19, SHTY13] in different contexts. In this paper, we consider the Hausdorff dimension of the sample paths of Markov processes in a learning theoretical framework, which, to the best of our knowledge, has not yet been investigated in the literature.

3. Mathematical setup

In this section, we precise the mathematical framework that we will use in our theoretical results. Let $\mathcal{Z}=\mathcal{X}\times \mathcal{Y}$ denote the space of data points, with $\mathcal{X}$ being the space of features and $\mathcal{Y}$ the space of the labels. We consider an unknown data distribution over $\mathcal{Z}$ , denoted by μ_z. We assume that we have access to a training set with n elements, denoted as S = {z₁, ..., z_n}, where each element of S is independently and identically distributed (i.i.d.) from μ_z. We will denote $S\sim {\mu }_{z}^{\otimes n}$ , where ${\mu }_{z}^{\otimes n}$ is the n-times product measure of μ_z.

To assess the quality of an estimated parameter, we consider a loss function $\ell :{\mathbb{R}}^{d}\times \mathcal{Z}{\mapsto}{\mathbb{R}}_{+}$ , such that ℓ(w, z) measures the loss induced by a single data point z for the particular choice of parameter $w\in {\mathbb{R}}^{d}$ . We accordingly denote the population risk with

$\begin{equation}\mathcal{R}(w){:=}{\mathbb{E}}_{z}[\ell (w,z)],\end{equation} \tag{ 10 }$

and the empirical risk with

$\begin{equation}\hat{\mathcal{R}}(w,S){:=}\frac{1}{n}\sum\limits _{i=1}^{n}\ell (w,{z}_{i}).\end{equation} \tag{ 11 }$

We note that we allow the cost function f in (1) and the loss ℓ to be different from each other, where f should be seen as a surrogate loss function. In particular, we will have different sets of assumptions on f and ℓ. However, as f and ℓ are different from each other, the discrepancy between the risks of their respective minimizers would have an impact on generalization. We leave the analysis of such discrepancy as a future work.

An iterative training algorithm $\mathcal{A}$ (for example SGD) is a function of two variables S and U, where S denoting the dataset and U encapsulating all the algorithmic randomness (e.g. batch indices to be used in training). The algorithm $\mathcal{A}(S,U)$ returns the entire evolution of the parameters in the time frame [0, T], where ${[\mathcal{A}(S,U)]}_{t}={w}_{t}$ being the parameter value returned by $\mathcal{A}$ at time t (e.g. parameters trained by SGD at time t). More precisely, given a training set S and a random variable U, the algorithm will output a random process ${\left\{{w}_{t}\right\}}_{t\in [0,T]}$ indexed by time, which is the trajectory of iterates. To formalize this definition, let us denote the class of bounded Borel functions defined from [0, T] to ${\mathbb{R}}^{d}$ with $\mathcal{B}([0,T],{\mathbb{R}}^{d})$ , and define

$\begin{equation}\mathcal{A}:\bigcup\limits _{n=1}^{\infty }{\mathcal{Z}}^{n}\times {\Omega}{\mapsto}\mathcal{B}([0,T],{\mathbb{R}}^{d}),\end{equation} \tag{ 12 }$

where Ω denotes the domain of U. We will denote the law of U by μ_u, and without loss of generality we let T = 1.

In the remainder of the paper, we will consider the case where the algorithm $\mathcal{A}$ is chosen to be the trajectories produced by a Feller process W^(S) (e.g. the proxy for SGD (4)), whose symbol depends on the training set S. More precisely, given $S\in {\mathcal{Z}}^{n}$ , the output of the training algorithm $\mathcal{A}(S,\cdot )$ will be the random mapping $t{\mapsto}{\mathrm{W}}_{t}^{(S)}$ , where the symbol of W^(S) is determined by the drift b_S(w), diffusion matrix Σ_S(w), and the Lévy measure ν_S(w, ⋅) (see (6) for definitions), which all depend on S. In this context, the random variable U represents the randomness that is incurred by the Feller process. In particular, for the SDE proxy (4), U accounts for the randomness due to B_t and ${\mathrm{L}}_{t}^{\boldsymbol{\alpha }(\cdot )}$ .

As our framework requires $\mathcal{A}$ to produce continuous-time trajectories to represent the discrete-time recursion of SGD (2), we can consider the linearly interpolated continuous-time process; an approach which is commonly used in SDE analysis [Dal17, EH20, EMS18, NŞGR19, RRT17]. For a given $t\in \left[\right.k\eta ,(k+1)\eta$ ), we can define the process ${\tilde{\mathrm{W}}}_{t}$ as the linear interpolation of w_k and w_k+1 (see (2)), such that ${w}_{k}={\tilde{\mathrm{W}}}_{k\eta }$ for all k. On the other hand, the random variable U here represents the randomness incurred by the choice of the random minibatches ${\tilde{B}}_{k}$ over iterations (2).

Throughout this paper, we will assume that S and U are independent from each other. In the case of (4), this will entail that the randomness in B_t and ${\mathrm{L}}_{t}^{\alpha }$ does not depend on S, or in the case of the SGD recursion, it will require the random sets ${\tilde{B}}_{k}\subset \left\{1,\dots ,n\right\}$ to be drawn independently from S.⁸ Under this assumption, U does not play a crucial role in our analysis; hence, to ease the notation, we will occasionally omit the dependence on U and simply write $\mathcal{A}(S){:=}\mathcal{A}(S,U)$ . We will further use the notation ${[\mathcal{A}(S)]}_{t}{:=}{[\mathcal{A}(S,U)]}_{t}$ to refer to w_t. Without loss of generality, we will assume that the training algorithm is always initialized with zeros, i.e. ${[\mathcal{A}(S)]}_{0}=\mathbf{0}\in {\mathbb{R}}^{d}$ , for all $S\in {\mathcal{Z}}^{n}$ . Finally, we define the collection of the parameters given in a trajectory, as the image of $\mathcal{A}(S)$ , i.e.

$\begin{equation}{\mathcal{W}}_{S}{:=}\left\{w\in {\mathbb{R}}^{d}:\exists t\in [0,1],w={[\mathcal{A}(S)]}_{t}\right\},\end{equation} \tag{ 13 }$

and the collection of all possible parameters as the union

$\begin{equation}\mathcal{W}{:=}\bigcup\limits _{n\geqslant 1}\;\bigcup\limits _{S\in {\mathcal{Z}}^{n}}{\mathcal{W}}_{S}.\end{equation} \tag{ 14 }$

Note that $\mathcal{W}$ is still random due to its dependence on U.

4. Generalization bounds via Hausdorff dimension

In this section, we present our main contributions, where we derive generalization bounds based on the Hausdorff dimension of the training trajectories.

4.1. Uniform Hausdorff dimension and Feller processes

In this part, we introduce the 'uniform Hausdorff dimension' property for a training algorithm $\mathcal{A}$ , which is a notion of complexity based on the Hausdorff dimension of the trajectories generated by $\mathcal{A}$ . By translating [Sch98] into our context, we will then show that decomposable Feller processes possess this property.

Definition 2. An algorithm $\mathcal{A}$ has uniform Hausdorff dimension d_H if for any $n\in {\mathbb{N}}_{+}$ and any training set $S\in {\mathcal{Z}}^{n}$

$\begin{equation}{\mathrm{dim}}_{\mathrm{H}}\enspace {\mathcal{W}}_{S}={\mathrm{dim}}_{\mathrm{H}}\left\{w\in {\mathbb{R}}^{d}:\exists t\in [0,1],w={[\mathcal{A}(S,U)]}_{t}\right\}\leqslant {d}_{\mathrm{H}},\quad {\mu }_{u}-\text{almost}\;\text{surely.}\end{equation} \tag{ 15 }$

Since ${\mathcal{W}}_{S}\subset \mathcal{W}\subset {\mathbb{R}}^{d}$ , by the definition of Hausdorff dimension, any algorithm $\mathcal{A}$ possesses the uniform Hausdorff dimension property trivially with d_H = d. However, as we will illustrate in the sequel, d_H can be much smaller than d, which is of our interest in this study.

Proposition 1. Let ${\left\{{\mathrm{W}}^{(S)}\right\}}_{S\in {\mathcal{Z}}^{n}}$ be a family of Feller processes. Assume that for each S, W^(S) is decomposable at a point w_S with sub-symbol ψ_S. Consider the algorithm $\mathcal{A}$ that returns ${[\mathcal{A}(S)]}_{t}={\mathrm{W}}_{t}^{(S)}$ for a given $S\in {\mathcal{Z}}^{n}$ and for every t ∈ [0, 1]. Then, we have

$\begin{equation}{\mathrm{dim}}_{\mathrm{H}}\enspace {\mathcal{W}}_{S}\leqslant {\beta }_{S},\enspace \enspace \;\text{where}\;\enspace \enspace {\beta }_{S}{:=}\mathrm{inf}\left\{\lambda \geqslant 0:\underset{{\Vert}\xi {\Vert}\to \infty }{\mathrm{lim}}\enspace \frac{\vert {\psi }_{S}(\xi )\vert }{{\Vert}\xi {{\Vert}}^{\lambda }}=0\right\},\end{equation} \tag{ 16 }$

μ_u-almost surely. Furthermore, $\mathcal{A}$ has uniform Hausdorff dimension with

$\begin{equation}{d}_{\mathrm{H}}=\underset{n}{\mathrm{sup}}\enspace \underset{S\in {\mathcal{Z}}^{n}}{\mathrm{sup}}\enspace {\beta }_{S}.\end{equation} \tag{ 17 }$

We provide all the proofs in appendix E. Informally, this result can be interpreted as follows. Thanks to the decomposability property, for each S, the process W^(S) behaves like a Lévy motion around w_S, and the characteristic exponent is given by the sub-symbol ψ_S. Because of this locally regular behavior, the Hausdorff dimension of the image of W^(S) can be bounded by β_S, which only depends on tail behavior of the Lévy process whose exponent is the sub-symbol ψ_S.

Example 1. In order to illustrate proposition 1, let us consider a simple example, where ${\mathrm{W}}_{t}^{(S)}$ is taken as the d-dimensional α-stable process with d ⩾ 2, which is independent of the data sample S. More precisely, ${\mathrm{W}}_{t}^{(S)}$ is the solution to the SDE given by $\mathrm{d}{\mathrm{W}}_{t}^{(S)}=\mathrm{d}{\mathrm{L}}_{t}^{\alpha }$ for some $\alpha \in \left(\right.0,2\left.\right]$ , where ${\mathrm{L}}_{1}^{\alpha }$ is an elliptically-contoured α-stable random vector. As ${\mathrm{W}}_{t}^{(S)}$ is already a Lévy process, it trivially satisfies the assumptions of proposition 1 with β_S = α for all n and S [BG60], hence

$\begin{equation}{\mathrm{dim}}_{\mathrm{H}}\enspace {\mathcal{W}}_{S}\leqslant \alpha ,\end{equation} \tag{ 18 }$

μ_u-almost surely (in fact, one can show that ${\mathrm{dim}}_{\mathrm{H}}\enspace {\mathcal{W}}_{S}=\alpha$ , see [BG60], theorem 4.2). Hence, the 'algorithm' ${[\mathcal{A}(S)]}_{t}={\mathrm{W}}_{t}^{(S)}$ has uniform Hausdorff dimension d_H = α. This shows that as the process becomes heavier-tailed (i.e. α decreases), the Hausdorff dimension d_H gets smaller. This behavior is illustrated in figure 1.

**Figure 1.** Trajectories of ${\mathrm{L}}_{t}^{\alpha }$ for α = 2.0, 1.5, and 1.0. The colors indicate the evolution in time. We observe that the trajectories become 'simpler' as dim_H L^α[0, T] = α gets smaller.
Download figure:
Standard image High-resolution image

**Figure 1.** Trajectories of ${\mathrm{L}}_{t}^{\alpha }$ for α = 2.0, 1.5, and 1.0. The colors indicate the evolution in time. We observe that the trajectories become 'simpler' as dim_H L^α[0, T] = α gets smaller.
Download figure:
Standard image High-resolution image

The term β_S is often termed as the upper Blumenthal–Getoor (BG) index of the Lévy process with an exponent ψ_S [BG60], and it is directly related to the tail-behavior of the corresponding Lévy measure. In general, the value of β_S decreases as the process gets heavier-tailed, which implies that the heavier-tailed processes have smaller Hausdorff dimension; thus, they have smaller complexity.

4.2. Generalization bounds via uniform Hausdorff dimension

This part provides the first main contribution of this paper, where we show that the generalization error of a training algorithm can be controlled by the Hausdorff dimension of its trajectories. Even though our interest is still in the case where $\mathcal{A}$ is chosen as a Feller process, the results in this section apply to more general algorithms. To this end, we will be mainly interested in bounding the following object:

$\begin{equation}\enspace \underset{t\in [0,1]}{\mathrm{sup}}\vert \hat{\mathcal{R}}({[\mathcal{A}(S)]}_{t},S)-\mathcal{R}({[\mathcal{A}(S)]}_{t})\vert =\underset{w\in {\mathcal{W}}_{S}}{\mathrm{sup}}\vert \hat{\mathcal{R}}(w,S)-\mathcal{R}(w)\vert ,\end{equation} \tag{ 19 }$

with high probability over the choice of S and U. Note that this is an algorithm dependent definition of generalization that is widely used in the literature (see [BE02] for a detailed discussion).

To derive our first result, we will require the following assumptions.

H 1
ℓ is bounded by B and L-Lipschitz continuous in w.
H 2
The diameter of $\mathcal{W}$ is finite μ_u-almost surely. S and U are independent.
H 3
$\mathcal{A}$ has uniform Hausdorff dimension d_H.
H 4
For μ_u-almost every $\mathcal{W}$ , there exists a Borel measure μ on ${\mathbb{R}}^{d}$ and positive numbers a, b, r₀ and s such that $0< \mu (\mathcal{W})\leqslant \mu \left({\mathbb{R}}^{d}\right)< \infty$ and 0 < ar^s ⩽ μ(B_d(x, r)) ⩽ br^s < ∞ for $x\in \mathcal{W},0< r\leqslant {r}_{0}$ .

Boundedness of the loss can be relaxed at the expense of using sub-Gaussian concentration bounds and introducing more complexity into the expressions [MBM16]. More precisely, H 1 can be replaced with the assumption ∃K > 0, such that ∀p, $\mathbb{E}{[\ell {(w,z)}^{p}]}^{1/p}\leqslant K\sqrt{p}$ , and by using sub-Gaussian concentration our bounds will still hold with K in place of B. On the other hand, since we have a finite time-horizon and we fix the initial point of the processes to 0, by using [XZ20] lemma 7.1, we can show that the finite diameter condition on $\mathcal{W}$ holds almost surely, if standard regularity assumptions hold uniformly on the coefficients of W^(S) (i.e. b, Σ, and ν in (6)) for all $S\in {\mathcal{Z}}^{n}$ , and a countability condition on $\mathcal{Z}$ . Finally H 4 is a common condition in fractal geometry, and ensures that the set $\mathcal{W}$ is regular enough, so that we can relate its Hausdorff dimension to its covering numbers [Mat99]⁹ . Under these conditions and an additional countability condition on $\mathcal{Z}$ (see [BE02] for similar assumptions), we present our first main result as follows.

Theorem 1. Assume that H 1 to 4 hold, and $\mathcal{Z}$ is countable. Then, for a sufficiently large n, we have

$\begin{equation}\underset{w\in {\mathcal{W}}_{S}}{\mathrm{sup}}\vert \hat{\mathcal{R}}(w,S)-\mathcal{R}(w)\vert \leqslant B\sqrt{\frac{2{d}_{\mathrm{H}}\enspace \mathrm{log}(n{L}^{2})}{n}+\frac{\mathrm{log}(1/\gamma )}{n}},\end{equation} \tag{ 20 }$

with probability at least 1 − γ over $S\sim {\mu }_{z}^{\otimes n}$ and U ∼ μ_u.

This theorem shows that the generalization error can be controlled by the uniform Hausdorff dimension of the algorithm $\mathcal{A}$ , along with the constants inherited from the regularity conditions. A noteworthy property of this result is that it does not have a direct dependency on the number of parameters d; on the contrary, we observe that d_H plays the role that d plays in standard bounds [AB09], implying that d_H acts as the intrinsic dimension and mimics the role of the VC dimension in binary classification [SSBD14]. Furthermore, in combination with proposition 1 that indicates d_H decreases as the processes W^(S) get heavier-tailed, theorem 1 implies that the generalization error can be controlled by the tail behavior of the process: heavier-tails imply less generalization error.

We note that the countability condition on $\mathcal{Z}$ is crucial for theorem 1. Thanks to this condition, in our proof, we invoke the stability properties of the Hausdorff dimension and we directly obtain a bound on ${\mathrm{dim}}_{\mathrm{H}}\enspace \mathcal{W}$ . This bound combined with H 4 allows us to control the covering number of $\mathcal{W}$ , and then the desired result can be obtained by using standard covering techniques [AB09, SSBD14].

Next, we show that the log n dependency of d_H is not crucial. In the next theorem, we show that the log n factor can be replaced with any increasing function (e.g. log log n) by using a chaining argument, with the expense of having L as a multiplying factor (instead of log L). Theorem 1 holds for sufficiently large n; however, this threshold is not apriori known, which is a limitation of the result.

Theorem 2. Assume that H 1 to 4 hold, and $\mathcal{Z}$ is countable. Then, for any function $\xi :\mathbb{R}\to \mathbb{R}$ satisfying lim_x→∞ ρ(x) = ∞, and for a sufficiently large n, we have

$\begin{equation*}\underset{w\in {\mathcal{W}}_{S}}{\mathrm{sup}}\left(\hat{\mathcal{R}}(w,S)-\mathcal{R}(w)\right)\leqslant cLB\mathrm{d}\mathrm{i}\mathrm{a}\mathrm{m}(\mathcal{W})\sqrt{\frac{{d}_{\mathrm{H}}\rho (n)}{n}+\frac{\mathrm{log}(1/\gamma )}{n}},\end{equation*}$

with probability at least 1 − γ over $S\sim {\mu }_{z}^{\otimes n}$ and U ∼ μ_u, where c is an absolute constant.

4.3. Generalization bounds via non-uniform Hausdorff dimension

In our second main result, we control the generalization error without the countability assumption on $\mathcal{Z}$ , and more importantly we will also relax H 3. Our main goal will be to relate the error to the Hausdorff dimension of a single ${\mathcal{W}}_{S}$ , as opposed to d_H, which uniformly bounds ${\mathrm{dim}}_{\mathrm{H}}\enspace {\mathcal{W}}_{{S}^{\prime }}$ for every ${S}^{\prime }\in {\mathcal{Z}}^{n}$ . In order to achieve this goal, we introduce a technical assumption, which lets us control the statistical dependency between the training set S and the set of parameters ${\mathcal{W}}_{S}$ .

For any δ > 0, let us consider a finite δ-cover of $\mathcal{W}$ by closed balls of radius δ, whose centers are on the fixed grid

$\begin{equation}\left\{\left(\frac{(2{j}_{1}+1)\delta }{2\sqrt{d}},\dots ,\frac{(2{j}_{d}+1)\delta }{2\sqrt{d}}\right):{j}_{i}\in \mathbb{Z},i=1,\dots ,\mathrm{d}\right\},\end{equation} \tag{ 21 }$

and collect the center of each ball in the set N_δ. Then, for each S, let us define the set ${N}_{\delta }(S){:=}\left\{x\in {N}_{\delta }:{B}_{d}(x,\delta )\cap {\mathcal{W}}_{S}\ne \varnothing \right\}$ , where ${B}_{d}(x,\delta )\subset {\mathbb{R}}^{d}$ denotes the closed ball centered around $x\in {\mathbb{R}}^{d}$ with radius δ.

H 5
Let ${\mathcal{Z}}^{\infty }{:=}(\mathcal{Z}\times \mathcal{Z}\times \cdots \enspace )$ denote the countable product endowed with the product topology and let $\mathfrak{B}$ be the Borel σ-algebra generated by ${\mathcal{Z}}^{\infty }$ . Let $\mathfrak{F},\mathfrak{G}$ be the sub-σ-algebras of $\mathfrak{B}$ generated by the collections of random variables given by $\left\{\hat{\mathcal{R}}(w,S):w\in \mathcal{W},n\geqslant 1\right\}$ and $\left\{\mathbb{1}\left\{w\in {N}_{\delta }({\mathcal{W}}_{S})\right\}:\delta \in {\mathbb{Q}}_{ > 0},w\in {N}_{\delta },n\geqslant 1\right\}$ respectively. There exists a constant M ⩾ 1 such that for any $A\in \mathfrak{F}$ , $B\in \mathfrak{G}$ we have $\mathbb{P}\left[A\cap B\right]\leqslant M\mathbb{P}\left[A\right]\mathbb{P}[B]$ .

This assumption is common in statistics and is sometimes referred to as the ψ-mixing condition, a measure of weak dependence often used in proving limit theorems, see e.g. [Bra83]; yet, it is unfortunately hard to verify this condition in practice. In our context H 5 essentially quantifies the dependence between S and the set ${\mathcal{W}}_{S}$ , through the constant M > 0: smaller M indicates that the dependence of $\hat{\mathcal{R}}$ on the training sample S is weaker. This concept is also similar to the mutual information used recently in [AAV18, HŞKM21, RZ19, XR17] and to the concept of stability [BE02].

Theorem 3. Assume that H 1, 2 and 5 hold, and H 4 holds with ${\mathcal{W}}_{S}$ in place of $\mathcal{W}$ for all n ⩾ 1 and $S\in {\mathcal{Z}}^{n}$ , (with s, a, b, r₀ can potentially depend on n and S). Then, for n sufficiently large, we have

$\begin{equation}\underset{w\in {\mathcal{W}}_{S}}{\mathrm{sup}}\vert \hat{\mathcal{R}}(w,S)-\mathcal{R}(w)\vert \leqslant 2B\sqrt{\frac{[{\mathrm{dim}}_{\mathrm{H}}\enspace {\mathcal{W}}_{S}+1]{\mathrm{log}}^{2}(n{L}^{2})}{n}+\frac{\mathrm{log}(7M/\gamma )}{n}},\end{equation} \tag{ 22 }$

with probability at least 1 − γ over $S\sim {\mu }_{z}^{\otimes n}$ and U ∼ μ_u.

This result shows that under H 5, we can replace d_H in theorem 1 with ${\mathrm{dim}}_{\mathrm{H}}\enspace {\mathcal{W}}_{S}$ , at the expense of introducing the coupling coefficient M into the bound. We observe that two competing terms are governing the generalization error: in the case where ${\mathrm{dim}}_{\mathrm{H}}\enspace {\mathcal{W}}_{S}$ is small, the error is dominated by the coupling parameter M, and vice versa. On the other hand, in the context of proposition 1, ${\mathrm{dim}}_{\mathrm{H}}\enspace {\mathcal{W}}_{S}\leqslant {\beta }_{S}$ , μ_u-almost surely, implying again that a heavy-tailed W^(S) would achieve smaller generalization error as long as the dependency between S and ${\mathcal{W}}_{S}$ is weak.

5. Experiments

We empirically study the generalization behavior of deep neural networks from the Hausdorff dimension perspective. We use VGG networks [SZ15] as they perform well in practice, and their depth (the number of layers) can be controlled directly. We vary the number of layers from D = 4 to D = 19, resulting in the number of parameters d between 1.3M and 20M. We train models on the CIFAR-10 dataset [KH09] using SGD and we choose various stepsizes η, and batch sizes B. We provide full range of parameters and additional implementation details in appendix A. The code can be found in https://github.com/umutsimsekli/Hausdorff-Dimension-and-Generalization.

We assume that SGD can be well-approximated by the process (4). Hence, to bound the corresponding ${\mathrm{dim}}_{\mathrm{H}}\enspace {\mathcal{W}}_{S}$ to be used in theorem 3, we invoke proposition 1, which relies on the existence of a point w_S around which the process behaves like a regular Lévy process with exponent ψ_S. Considering the empirical observation that SGD exhibits a 'diffusive behavior' around a local minimum [BJSG+18], we take w_S to be the local minimum found by SGD and assume that the conditions of proposition 1 hold around that point. This perspective indicates that the generalization error can be controlled by the BG index β_S of the Lévy process defined by ψ_S(ξ); the sub-symbol of the process (4) around w_S.

Estimating the BG index for a general Lévy process is a challenging task; however, the choice of the SDE (4) imposes some structure on ψ_S, which lets us express β_S in a simpler form. Inspired by the observation that the tail-index of the gradient noise in a multi-layer neural network differs from layer to layer, as reported in [ŞGN+19], we will assume that, around the local minimum w_S, the dynamics of SGD will be similar to the Lévy motion with frozen coefficients: ${{\Sigma}}_{2}({w}_{S}){\mathrm{L}}^{\boldsymbol{\alpha }({w}_{S})}$ , see (4) for definitions. We will further impose that, around w_S, the coordinates corresponding to the same layer l have the same tail-index α_l. Under this assumption, the BG index can be analytically computed as ${\beta }_{S}={\mathrm{max}}_{l}\enspace {\alpha }_{l}\in \left(\right.0,2\left.\right]$ [Hen73, MX05]. While the range $\left(\right.0,2\left.\right]$ might seem narrow at the first sight, we note that ${\mathrm{dim}}_{\mathrm{H}}\enspace {\mathcal{W}}_{S}$ ; hence β_S determines the order of the generalization error and this parameter gets closer to 0 with more layers added to the network (see figure 2). Thanks to this simplification, we can easily compute β_S, by first estimating each α_l by using the estimator proposed in [MMO15], which can efficiently estimate α_l by using multiple SGD iterates.

**Figure 2.** Empirical study of generalization behavior on VGG [SZ15] networks with various depth values (the number of layers are shown as D). As our theory predicts, the generalization error is strongly correlated with β_S. As ${\beta }_{S}\in \left(\right.0,2\left.\right]$ , the estimates exceeding 2 is an artifact of the estimator.
Download figure:
Standard image High-resolution image

**Figure 2.** Empirical study of generalization behavior on VGG [SZ15] networks with various depth values (the number of layers are shown as D). As our theory predicts, the generalization error is strongly correlated with β_S. As ${\beta }_{S}\in \left(\right.0,2\left.\right]$ , the estimates exceeding 2 is an artifact of the estimator.
Download figure:
Standard image High-resolution image

We trained all the models for 100 epochs and computed their β_S over the last epoch, assuming that the iterations reach near local minima. We monitor the generalization error in terms of the difference between the training and test accuracy with respect to the estimated β_S in figure 2(a). We also plot the final test accuracy in figure 2(b). Test accuracy results validate that the models perform similarly to the state-of-the-art, which suggests that the empirical study matches the practically relevant application settings. Results in figure 2(a) indicate that, as predicted by our theory, the generalization error is strongly correlated with β_S, which is an upper-bound of the Hausdorff dimension. With increasing β_S (implying increasing Hausdorff dimension), the generalization error increases, as our theory indicates. Moreover, the resulting behavior validates the importance of considering ${\mathrm{dim}}_{\mathrm{H}}\enspace {\mathcal{W}}_{S}$ as opposed to ambient dimension: for example, the number of parameters in the four-layer network is significantly lower than other networks; however, its Hausdorff dimension as well as generalization error are significantly higher. Even more importantly, there is no monotonic relationship between the number of parameters and ${\mathrm{dim}}_{\mathrm{H}}\enspace {\mathcal{W}}_{S}$ . In other words, increasing depth is not always beneficial from the generalization perspective. It is only beneficial if it also decreases ${\mathrm{dim}}_{\mathrm{H}}\enspace {\mathcal{W}}_{S}$ . We also observe an interesting behavior: the choice of η and B seems to affect β_S, indicating that the choice of the algorithm parameters can impact the tail behavior of the algorithm. In summary, our theory holds over a large selection of depth, step-sizes, and batch sizes when tested on deep neural networks. We provide additional experiments on synthetic models in appendix A.

6. Conclusion

In this paper, we rigorously tied the generalization in a learning task to the tail properties of the underlying training algorithm, shedding light on an empirically observed phenomenon. We established this relationship through the Hausdorff dimension of the SDE approximating the algorithm, and proved a generalization error bound based on this notion of complexity. Unlike the ambient dimension, our bounds do not necessarily grow with the number of parameters in the network, and they solely depend on the tail behavior of the training process, providing an explanation for the implicit regularization effect of heavy-tailed SGD.

Finally, we note that extensions of our framework have shown that the generalization error can be linked to topological data analysis tools [BLGŞ21], and our tools can be used for analyzing discrete-time dynamical systems through the fractal dimensions of their invariant measures [CDE+21].

Acknowledgments

In the early version of the manuscript, which was published at NeurIPS 2020, we identified an imprecision in definition 2, and a mistake in figure 2 and in the statement and the proof of theorem 3, which are now fixed. The authors are grateful to Berfin Şimşek and Xiaochuan Yang for fruitful discussions, and thank Vaishnavh Nagarajan for pointing out the imprecision in definition 2. The contribution of Umut Şimşekli to this work is partly supported by the French National Research Agency (ANR) as a part of the FBIMATRIX (ANR-16-CE23-0014) project.

Appendix A.: Additional experimental results and implementation details

A.1. Comparison with other generalization metrics for deep networks

In this section, we empirically analyze the proposed metric with respect to existing generalization metrics, developed for neural networks. Specifically, we consider the 'flat minima' argument of Jastrzevski et al [JKA+17] and plot the generalization error vs η/B which is the ratio of step size to the batch size. As a second comparison, we use heavy-tailed random matrix theory based metric of Martin and Mahoney [MM19]. We plot the generalization error with respect to each metric in figure A1. As the results suggest, our metric is the one which correlates best with the empirically observed generalization error. The metric proposed by Martin and Mahoney [MM19] fails for the low number of layers and the resulting behavior is not monotonic. Similarly, η/B captures the relationship for very deep networks (for D = 16 & 19), however, it fails for other settings.

**Figure A1.** Empirical comparison to other capacity metrics.
Download figure:
Standard image High-resolution image

We also note that the norm-based capacity metrics [NTS15] typically increase with the increasing dimension d, we refer to [NBMS17] for details.

A.2. Synthetic experiments

We consider a simple synthetic logistic regression problem, where the data distribution is a Gaussian mixture model with two components. Each data point ${z}_{i}\equiv ({x}_{i},{y}_{i})\in \mathcal{Z}={\mathbb{R}}^{d}\times \left\{-1,1\right\}$ is generated by simulating the model: y_i ∼ Bernoulli(1/2) and ${x}_{i}\vert {y}_{i}\sim \mathcal{N}({m}_{{y}_{i}},100{\mathrm{I}}_{d})$ , where the means are drawn from a Gaussian: ${m}_{-1},{m}_{1}\sim \mathcal{N}(0,25{\mathrm{I}}_{d})$ . The loss function ℓ is the logistic loss as ℓ(w, z) = log(1 + exp(−yx^⊤ w)).

As for the algorithm, we consider a data-independent multivariate stable process: ${[\mathcal{A}(S)]}_{t}={\mathrm{L}}_{t}^{\alpha }$ for any $S\in {\mathcal{Z}}^{n}$ , where ${\mathrm{L}}_{1}^{\alpha }$ is distributed with an elliptically contoured α-stable distribution with $\alpha \in \left(\right.0,2\left.\right]$ (see section 2): when α = 2, ${\mathrm{L}}_{t}^{\alpha }$ is just a Brownian motion, as α gets smaller, the process becomes heavier-tailed. By theorem 4.2 of [BG60], $\mathcal{A}$ has the uniform Hausdorff dimension property with d_H = α independently from d when d ⩾ 2.

We set d = 10 and generate points to represent the whole population, i.e. ${\left\{{z}_{i}\right\}}_{i=1}^{{n}_{\text{tot}}}$ with n_tot = 100 K. Then, for different values of α, we simulate $\mathcal{A}$ for t ∈ [0, 1], by using a small step-size η = 0.001 (the total number of iterations is hence 1/η). We finally draw 20 random sets S with n elements from this population, and we monitor the maximum difference ${\mathrm{sup}}_{w\in {\mathcal{W}}_{S}}\vert \hat{\mathcal{R}}(w,S)-\mathcal{R}(w)\vert$ for different values of n. We repeat the whole procedure 20 times and report the average values in figure A2. We observe that the results support theorems 1 and 3: for every n, the generalization error decreases with decreasing α, hence illustrates the role of the Hausdorff dimension.

**Figure A2.** Results on synthetic data.
Download figure:
Standard image High-resolution image

A.3. Implementation details for the deep neural network experiments

In this section, we provide the additional details which are skipped in the main text for the sake of space. We use the following VGG style neural networks with various number of layers as

VGG4: Conv(512)–ReLU–MaxPool–Linear
VGG6: Conv(256)–ReLU–MaxPool–Conv(512)–ReLU–MaxPool–Conv(512)–ReLU–MaxPool–Linear
VGG7: Conv(128)–ReLU–MaxPool–Conv(256)–ReLU–MaxPool–Conv(512)–ReLU–MaxPool–Conv(512)–ReLU–MaxPool–Linear
VGG8: Conv(64)–ReLU–MaxPool–Conv(128)–ReLU–MaxPool–Conv(256)–ReLU–MaxPool–Conv(512)–ReLU–MaxPool–Conv(512)–ReLU–MaxPool–Linear
VGG11: Conv(64)–ReLU–MaxPool–Conv(128)–ReLU–MaxPool –Conv(256)–ReLU–Conv(256)–ReLU–MaxPool–Conv(512)–ReLU–Conv(512)–ReLU–MaxPool–Conv(512)–ReLU–Conv(512)–ReLU–MaxPool–Linear
VGG16: Conv(64)–ReLU–Conv(64)–ReLU–MaxPool–Conv(128)–ReLU–Conv(128)–ReLU–MaxPool–Conv(256)–ReLU–Conv(256)–ReLU–Conv(256)–ReLU–MaxPool–Conv(512)–ReLU–Conv(512)–ReLU–Conv(512)–ReLU–MaxPool–Conv(512)–ReLU–Conv(512)–ReLU–Conv(512)–ReLU–MaxPool–Linear
VGG19: Conv(64)–ReLU–Conv(64)–ReLU–MaxPool–Conv(128)–ReLU–Conv(128)–ReLU–MaxPool–Conv(256)–ReLU–Conv(256)–ReLU–Conv(256)–ReLU–Conv(256)–ReLU–MaxPool–Conv(512)–ReLU–Conv(512)–ReLU–Conv(512)–ReLU–Conv(512)–ReLU–MaxPool–Conv(512)–ReLU–Conv(512)–ReLU–Conv(512)–ReLU–Conv(512)–ReLU–MaxPool–Linear

where all convolutions are noted with the number of filters in the paranthesis. Moreover, we use the following hyperparameter ranges for step size of SGD: {1e–2, 1e–3, 3e–3, 1e–4, 3e–4, 1e–5, 3e–5} with the batch sizes {32, 64, 128, 256}. All networks are learned with cross entropy loss and ReLU activations, and no additional technique like batch normalization or dropout is used. While computing the empirical BG index over the layers, we only consider convolutional layers ignoring the final fully-connected layer. We also release the full source code of the experiments at https://github.com/umutsimsekli/Hausdorff-Dimension-and-Generalization.

Appendix B.: Representing optimization algorithms as Feller processes

Thanks to the generality of the Feller processes, we can represent multiple popular stochastic optimization algorithms as a Feller process, in addition to SGD. For instance, let us consider the following SDE:

$\begin{equation}\mathrm{d}{\mathrm{W}}_{t}=-{{\Sigma}}_{0}({\mathrm{W}}_{t})\nabla f({\mathrm{W}}_{t})\mathrm{d}t+{{\Sigma}}_{1}({\mathrm{W}}_{t})\mathrm{d}{\mathrm{B}}_{t}+{{\Sigma}}_{2}({\mathrm{W}}_{t})\mathrm{d}{\mathrm{L}}_{t}^{\boldsymbol{\alpha }({\mathrm{W}}_{t})},\end{equation} \tag{ B.1 }$

where Σ₀, Σ₁, Σ₂ are d × d matrix-valued functions and the tail-index α (⋅) of ${\mathrm{L}}_{t}^{\boldsymbol{\alpha }}(\cdot )$ is also allowed to change depending on value of the state W_t. We can verify that this SDE corresponds to a Feller process with b(w) = −Σ₀(w)∇f(w), Σ(w) = 2Σ₁(w), and an appropriate choice of ν [HDS18]. As we discussed in the main document, we the choice Σ₀ = I_d can represent SGD with state-dependent Gaussian and/or heavy-tailed noise. Besides, we can choose an appropriate Σ₀ in order to be able to represent optimization algorithms that use second-order geometric information, such as natural gradient [Ama98] or stochastic Newton [EM15] algorithms. On the other hand, by using the SDEs proposed in [BB18, GGZ18, LPH+17, OKL19, ŞZTG20], we can further represent momentum-based algorithms such as SGD with momentum [Pol64] as a Feller process.

Appendix C.: Decomposable Feller processes and their Hausdorff dimension

In our study, we focus on decomposable Feller processes, introduced in [Sch98]. Let us consider a Feller process expressed by its symbol Ψ. We call the process defined by Ψ decomposable at w₀, if there exists a point ${w}_{0}\in {\mathbb{R}}^{d}$ such that the symbol can be decomposed as

$\begin{equation}{\Psi}(w,\xi )=\psi (\xi )+\tilde{{\Psi}}(w,\xi ),\end{equation} \tag{ C.1 }$

where ψ(ξ) := Ψ(w₀, ξ) is the sub-symbol and $\tilde{{\Psi}}(w,\xi )={\Psi}(w,\xi )-{\Psi}({w}_{0},\xi )$ is the reminder term. Let $\mathbf{j}\in {\mathbb{N}}_{0}^{d}$ denote a multi-index¹⁰ . We assume that there exist functions $a,{{\Phi}}_{\mathbf{j}}:{\mathbb{R}}^{d}{\mapsto}\mathbb{R}$ such that the following holds:

Ψ(x, 0) ≡ 0
${{\Vert}{{\Phi}}_{0}{\Vert}}_{\infty }< \infty$ , and ${{\Phi}}_{\mathbf{j}}\in {L}^{1}\left({\mathbb{R}}^{d}\right)$ for all |j| ⩽ d + 1.
$\left\vert {\partial }_{w}^{\mathbf{j}}\tilde{{\Psi}}(w,\xi )\right\vert \leqslant {{\Phi}}_{\mathbf{j}}(w)\left(1+{a}^{2}(\xi )\right)$ , for all $w,\xi \in {\mathbb{R}}^{d}$ and |j| ⩽ d + 1.
${a}^{2}(\xi )\geqslant {\kappa }_{0}{\Vert}\xi {{\Vert}}^{{r}_{0}}$ , for ||ξ|| large, ${r}_{0}\in \left(\right.0,2\left.\right]$ , and κ₀ > 0.

The Hausdorff dimension of the image of a decomposable Feller process is bounded, due to the following result.

Theorem 4 ([Sch98] theorem 4).

Let Ψ(x, ξ) generate a Feller process, denoted by ${\left\{{\mathrm{W}}_{t}\right\}}_{t\geqslant 0}$ . Assume that Ψ is decomposable at w₀ with the sub-symbol ψ. Then, for any given $T\in {\mathbb{R}}_{+}$ , we have

$\begin{equation}{\mathrm{dim}}_{\mathrm{H}}\enspace \mathrm{W}([0,T])\leqslant \beta ,\hspace{25.0pt}{\mathbb{P}}^{x}-\text{almost}\;\text{surely},\end{equation} \tag{ C.2 }$

where W([0, T]) := {w: w = W_t, for some t ∈ [0, T]} is the image of the process, ${\mathbb{P}}^{x}$ denotes the law of the process ${\left\{{\mathrm{W}}_{t}\right\}}_{t\geqslant 0}$ with initial value x, and β is the upper Blumenthal–Getoor index of the Lévy process with the characteristic exponent ψ(ξ), given as follows:

$\begin{equation}\beta {:=}\mathrm{inf}\left\{\lambda \geqslant 0:\underset{{\Vert}\xi {\Vert}\to \infty }{\mathrm{lim}}\enspace \frac{\vert \psi (\xi )\vert }{{\Vert}\xi {{\Vert}}^{\lambda }}=0\right\}.\end{equation} \tag{ C.3 }$

Appendix D.: Additional technical background

In this section, we will define the notions that will be used in our proofs. For the sake of completeness we also provide the main theoretical results that will be used in our proofs.

D.1. The Minkowski dimension

In our proofs, in addition to the Hausdorff dimension, we also make use of another notion of dimension, referred to as the Minkowski dimension (also known as the box-counting dimension [Fal04]), which is defined as follows.

Definition 3. Let $G\subset {\mathbb{R}}^{d}$ be a set and let N_δ(G) be a collection of sets that contains either one of the following:

The smallest number of sets of diameter at most δ which cover G
The smallest number of closed balls of diameter at most δ which cover G
The smallest number of cubes of side at most δ which cover G
The number of δ-mesh cubes that intersect G
The largest number of disjoint balls of radius δ, whose centers are in G.

Then the lower- and upper-Minkowski dimensions of G are respectively defined as follows:

$\begin{equation}\underline{{\mathrm{dim}}_{\mathrm{M}}}\enspace G{:=}\underset{\delta \to 0}{\mathrm{liminf}}\enspace \frac{\mathrm{log}\vert {N}_{\delta }(G)\vert }{-\mathrm{log}\delta },\hspace{25.0pt}\bar{{\mathrm{dim}}_{\mathrm{M}}}G{:=}\underset{\delta \to 0}{\mathrm{limsup}}\enspace \frac{\mathrm{log}\vert {N}_{\delta }(G)\vert }{-\mathrm{log}\delta }.\end{equation} \tag{ D.1 }$

In case the $\underline{{\mathrm{dim}}_{\mathrm{M}}}\enspace G=\bar{{\mathrm{dim}}_{\mathrm{M}}}\enspace G$ , the Minkowski dimension dim_M(G) is their common value.

We always have $0\leqslant {\mathrm{dim}}_{\mathrm{H}}\enspace G\leqslant \underline{{\mathrm{dim}}_{\mathrm{M}}}\enspace G\leqslant \bar{{\mathrm{dim}}_{\mathrm{M}}}G\leqslant d$ where the inequalities can be strict [Fal04].

It is possible to construct examples where the Hausdorff and Minkowski dimensions are different from each other. However, in many interesting cases, these two dimensions often match each other [Fal04]. In this paper, we are interested in such a case, i.e. the case when the Hausdorff and Minkowski dimensions match. The following result identifies the conditions for which the two dimensions match each other, which form the basis of H 4:

Theorem 5 ([Mat99] theorem 5.7). Let A be a non-empty bounded subset of ${\mathbb{R}}^{d}$ . Suppose there is a Borel measure μ on ${\mathbb{R}}^{d}$ and there are positive numbers a, b, r₀ and s such that $0< \mu (A)\leqslant \mu \left({\mathbb{R}}^{d}\right)< \infty$ and

$\begin{equation}0< a{r}^{s}\leqslant \mu ({B}_{d}(x,r))\leqslant b{r}^{s}< \infty \enspace \enspace \;\text{for}\;\enspace \enspace x\in A,0< r\leqslant {r}_{0}\end{equation} \tag{ D.2 }$

Then ${\mathrm{dim}}_{\mathrm{H}}\enspace A={\mathrm{dim}}_{\mathrm{M}}\enspace A=\bar{{\mathrm{dim}}_{\mathrm{M}}}\enspace A=s$ .

D.2. Egoroff's theorem

Egoroff's theorem is an important result in measure theory and establishes a condition for measurable functions to be uniformly continuous in an almost full-measure set.

Theorem 6 (Egoroff's theorem [Bog07] theorem 2.2.1). Let $(X,\mathcal{A},\mu )$ be a space with a finite nonnegative measure μ and let μ-measurable functions f_n be such that μ-almost everywhere there is a finite limit f(x) := lim_n→∞ f_n(x). Then, for every ɛ > 0, there exists a set ${X}_{\varepsilon }\in \mathcal{A}$ such that $\mu \left(X{\backslash}{X}_{\varepsilon }\right)< \varepsilon$ and the functions f_n converge to f uniformly on X_ɛ.

Appendix E.: Postponed proofs

E.1. Proof of proposition 1

Proof. Let Ψ_S denote the symbol of the process W^(S). Then, the desired result can obtained by directly applying theorem 4 on each Ψ_S. □

E.2. Proof of theorem 1

We first prove the following more general result which relies on $\bar{{\mathrm{dim}}_{\mathrm{M}}}\enspace \mathcal{W}$ .

Lemma 1. Assume that ℓ is bounded by B and L-Lipschitz continuous in w. Let $\mathcal{W}\subset {\mathbb{R}}^{d}$ be a set with finite diameter. Then, for n sufficiently large, we have

$\begin{equation}\underset{w\in \mathcal{W}}{\mathrm{sup}}\vert \hat{\mathcal{R}}(w,S)-\mathcal{R}(w)\vert \leqslant B\sqrt{\frac{2\bar{{\mathrm{dim}}_{\mathrm{M}}}\enspace \mathcal{W}\enspace \mathrm{log}(n{L}^{2})}{n}+\frac{\mathrm{log}(1/\gamma )}{n}},\end{equation} \tag{ E.1 }$

with probability at least 1 − γ over $S\sim {\mu }_{z}^{\otimes n}$ .

Proof. As ℓ is L-Lipschitz, so are $\mathcal{R}$ and $\hat{\mathcal{R}}$ . By using the notation ${\hat{\mathcal{R}}}_{n}(w){:=}\hat{\mathcal{R}}(w,S)$ , and by the triangle inequality, for any ${w}^{\prime }\in \mathcal{W}$ we have:

$\begin{equation}\vert {\hat{\mathcal{R}}}_{n}(w)-\mathcal{R}(w)\vert =\left\vert {\hat{\mathcal{R}}}_{n}\left({w}^{\prime }\right)-\mathcal{R}\left({w}^{\prime }\right)+{\hat{\mathcal{R}}}_{n}(w)-{\hat{\mathcal{R}}}_{n}\left({w}^{\prime }\right)-\mathcal{R}(w)+\mathcal{R}\left({w}^{\prime }\right)\right\vert ,\end{equation} \tag{ E.2 }$

$\begin{equation}\leqslant \left\vert {\hat{\mathcal{R}}}_{n}\left({w}^{\prime }\right)-\mathcal{R}\left({w}^{\prime }\right)\right\vert +2L{\Vert}w-{w}^{\prime }{\Vert}.\end{equation} \tag{ E.3 }$

Now since $\mathcal{W}$ has finite diameter, let us consider a finite δ-cover of $\mathcal{W}$ by balls and collect the center of each ball in the set ${N}_{\delta }{:=}{N}_{\delta }(\mathcal{W})$ . Then, for each $w\in \mathcal{W}$ , there exists a w' ∈ N_δ, such that ||w − w'|| ⩽ δ. By choosing this w' in the above inequality, we obtain:

$\begin{equation}\vert {\hat{\mathcal{R}}}_{n}(w)-\mathcal{R}(w)\vert \leqslant \left\vert {\hat{\mathcal{R}}}_{n}\left({w}^{\prime }\right)-\mathcal{R}\left({w}^{\prime }\right)\right\vert +2L\delta .\end{equation} \tag{ E.4 }$

Taking the supremum of both sides of the above equation yields:

$\begin{equation}\underset{w\in \mathcal{W}}{\mathrm{sup}}\vert {\hat{\mathcal{R}}}_{n}(w)-\mathcal{R}(w)\vert \leqslant \underset{w\in {N}_{\delta }}{\mathrm{max}}\left\vert {\hat{\mathcal{R}}}_{n}(w)-\mathcal{R}(w)\right\vert +2L\delta .\end{equation} \tag{ E.5 }$

Using the union bound over N_δ, we obtain

$\begin{equation}{\mathbb{P}}_{S}\left(\underset{w\in {N}_{\delta }}{\mathrm{max}}\vert {\hat{\mathcal{R}}}_{n}(w)-\mathcal{R}(w)\vert \geqslant \varepsilon \right)={\mathbb{P}}_{S}\left(\bigcup\limits _{w\in {N}_{\delta }}\left\{\vert {\hat{\mathcal{R}}}_{n}(w)-\mathcal{R}(w)\vert \geqslant \varepsilon \right\}\right),\end{equation} \tag{ E.6 }$

$\begin{equation}\leqslant \sum\limits _{w\in {N}_{\delta }}{\mathbb{P}}_{S}\left(\vert {\hat{\mathcal{R}}}_{n}(w)-\mathcal{R}(w)\vert \geqslant \varepsilon \right).\end{equation} \tag{ E.7 }$

Further, for δ > 0, since |N_δ| has finitely many elements, we can invoke Hoeffding's inequality for each of the summands on the right-hand side and obtain

$\begin{equation}{\mathbb{P}}_{S}\left(\underset{w\in {N}_{\delta }}{\mathrm{max}}\vert {\hat{\mathcal{R}}}_{n}(w)-\mathcal{R}(w)\vert \geqslant \varepsilon \right)\leqslant 2\vert {N}_{\delta }\vert \mathrm{exp}\left\{-\frac{2n{\varepsilon }^{2}}{{B}^{2}}\right\}=:\gamma .\end{equation} \tag{ E.8 }$

Notice that N_δ is a random set, and choosing ɛ based on |N_δ|, one can obtain a deterministic γ. Therefore, we can plug this back in (E.5) and obtain that, with probability at least 1 − γ

$\begin{equation}\underset{w\in \mathcal{W}}{\mathrm{sup}}\vert {\hat{\mathcal{R}}}_{n}(w)-\mathcal{R}(w)\vert \leqslant B\sqrt{\frac{\mathrm{log}(2\vert {N}_{\delta }\vert )}{2n}+\frac{\mathrm{log}(1/\gamma )}{2n}}+2L\delta .\end{equation} \tag{ E.9 }$

Now since $\mathcal{W}\subset {\mathbb{R}}^{d}$ , $\bar{{\mathrm{dim}}_{\mathrm{M}}}\enspace \mathcal{W}$ is finite. Then, for any sequence ${\left\{{\delta }_{n}\right\}}_{n\in \mathbb{N}}$ such that lim_n→∞ δ_n = 0, we have, ∀ > 0, ∃n > 0 such that n ⩾ n implies

$\begin{equation}\mathrm{log}(\vert {N}_{\delta }\vert )\leqslant (\bar{{\mathrm{dim}}_{\mathrm{M}}}\enspace \mathcal{W}+{\epsilon})\mathrm{log}({\delta }_{n}^{-1}).\end{equation} \tag{ E.10 }$

Choosing ${\delta }_{n}=1/\sqrt{n{L}^{2}}$ and ${\epsilon}=\bar{{\mathrm{dim}}_{\mathrm{M}}}\enspace \mathcal{W}$ , we have for $\forall n\geqslant {n}_{\bar{{\mathrm{dim}}_{\mathrm{M}}}\enspace \mathcal{W}}$ ,

$\begin{equation}\mathrm{log}(2\vert {N}_{\delta }\vert )\leqslant \mathrm{log}(2)+\bar{{\mathrm{dim}}_{\mathrm{M}}}\enspace \mathcal{W}\enspace \mathrm{log}(n{L}^{2})\enspace \enspace \;\text{and}\;\enspace \enspace 2L{\delta }_{n}=\frac{2}{\sqrt{n}}.\end{equation} \tag{ E.11 }$

Therefore, we obtain with probability at least 1 − γ

$\begin{equation}\underset{w\in \mathcal{W}}{\mathrm{sup}}\vert {\hat{\mathcal{R}}}_{n}(w)-\mathcal{R}(w)\vert \leqslant B\sqrt{\frac{\mathrm{log}(2)+\bar{{\mathrm{dim}}_{\mathrm{M}}}\enspace \mathcal{W}\enspace \mathrm{log}(n{L}^{2})}{2n}+\frac{\mathrm{log}(1/\gamma )}{2n}}+\frac{2}{\sqrt{n}},\end{equation} \tag{ E.12 }$

$\begin{equation}\leqslant B\sqrt{\frac{2\bar{{\mathrm{dim}}_{\mathrm{M}}}\enspace \mathcal{W}\enspace \mathrm{log}(n{L}^{2})}{n}+\frac{\mathrm{log}(1/\gamma )}{n}},\end{equation} \tag{ E.13 }$

for sufficiently large n. This concludes the proof. □

We now proceed to the proof of theorem 1.

Proof of theorem 1. By noticing ${\mathcal{Z}}^{n}$ is countable (since $\mathcal{Z}$ is countable) and using the property that ${\mathrm{dim}}_{\mathrm{H}}{\cup }_{i\in \mathbb{N}}{A}_{i}={\mathrm{sup}}_{i\in \mathbb{N}}\enspace {\mathrm{dim}}_{\mathrm{H}}\enspace {A}_{i}$ (cf [Fal04], section 3.2), we observe that

$\begin{equation}{\mathrm{dim}}_{\mathrm{H}}\enspace \mathcal{W}={\mathrm{dim}}_{\mathrm{H}}\bigcup\limits _{n\geqslant 1}\;\bigcup\limits _{S\in {\mathcal{Z}}^{n}}{\mathcal{W}}_{S}=\underset{n\geqslant 1}{\mathrm{sup}}\underset{S\in {\mathcal{Z}}^{n}}{\mathrm{sup}}\enspace {\mathrm{dim}}_{\mathrm{H}}\enspace {\mathcal{W}}_{S}\leqslant {\mathrm{d}}_{\mathrm{H}},\end{equation} \tag{ E.14 }$

μ_u-almost surely. Define the event ${\mathcal{Q}}_{R}=\left\{\mathrm{d}\mathrm{i}\mathrm{a}\mathrm{m}(\mathcal{W})\leqslant R\right\}$ . On the event ${\mathcal{Q}}_{R}$ , by theorem 5, we have that ${\mathrm{dim}}_{\mathrm{M}}\enspace \mathcal{W}=\bar{{\mathrm{dim}}_{\mathrm{M}}}\enspace \mathcal{W}={\mathrm{dim}}_{\mathrm{H}}\enspace \mathcal{W}\leqslant {d}_{\mathrm{H}}$ , μ_u-almost surely.

Now, we observe that

$\begin{equation}\underset{w\in {\mathcal{W}}_{S}}{\mathrm{sup}}\vert \hat{\mathcal{R}}(w,S)-\mathcal{R}(w)\vert \leqslant \underset{w\in \mathcal{W}}{\mathrm{sup}}\vert \hat{\mathcal{R}}(w,S)-\mathcal{R}(w)\vert .\end{equation} \tag{ E.15 }$

Hence, by defining $\varepsilon =B\sqrt{\frac{2(\bar{{\mathrm{dim}}_{\mathrm{M}}}\enspace \mathcal{W})\mathrm{log}(n{L}^{2})}{n}+\frac{\mathrm{log}(1/\gamma )}{n}}$ , and using the independence of S and U, lemma 1, and (E.14), we write

$\begin{align*} {\mathbb{P}}_{S,U}& \left(\underset{w\in {\mathcal{W}}_{S}}{\mathrm{sup}}\vert \hat{\mathcal{R}}(w,S)-\mathcal{R}(w)\vert > B\sqrt{\frac{2{d}_{\mathrm{H}}\enspace \mathrm{log}(n{L}^{2})}{n}+\frac{\mathrm{log}(1/\gamma )}{n}};{\mathcal{Q}}_{R}\right) \\ & \leqslant {\mathbb{P}}_{S,U}\left(\underset{w\in \mathcal{W}}{\mathrm{sup}}\vert \hat{\mathcal{R}}(w,S)-\mathcal{R}(w)\vert > B\sqrt{\frac{2{d}_{\mathrm{H}}\mathrm{log}(n{L}^{2})}{n}+\frac{\mathrm{log}(1/\gamma )}{n}};{\mathcal{Q}}_{R}\right) \\ & ={\mathbb{P}}_{S,U}\left(\underset{w\in \mathcal{W}}{\mathrm{sup}}\vert \hat{\mathcal{R}}(w,S)-\mathcal{R}(w)\vert > B\sqrt{\frac{2{d}_{\mathrm{H}}\mathrm{log}(n{L}^{2})}{n}+\frac{\mathrm{log}(1/\gamma )}{n}};\bar{{\mathrm{dim}}_{\mathrm{M}}}\enspace \mathcal{W} \leqslant {d}_{\mathrm{H}};{\mathcal{Q}}_{R}\right) \\ & \leqslant {\mathbb{P}}_{S,U}\left(\underset{w\in \mathcal{W}}{\mathrm{sup}}\vert \hat{\mathcal{R}}(w,S)-\mathcal{R}(w)\vert > \varepsilon ;{\mathcal{Q}}_{R}\right). \end{align*}$

Finally, we let R → ∞ and use dominated convergence theorem to obtain

$\begin{align*} & {\mathbb{P}}_{S,U}\left(\underset{w\in {\mathcal{W}}_{S}}{\mathrm{sup}}\vert \hat{\mathcal{R}}(w,S)-\mathcal{R}(w)\vert > B\sqrt{\frac{2{d}_{\mathrm{H}}\enspace \mathrm{log}(n{L}^{2})}{n}+\frac{\mathrm{log}(1/\gamma )}{n}}\right) \\ & \leqslant {\mathbb{P}}_{S,U}\left(\underset{w\in \mathcal{W}}{\mathrm{sup}}\vert \hat{\mathcal{R}}(w,S)-\mathcal{R}(w)\vert > \varepsilon \right) \\ & \leqslant \gamma , \end{align*}$

which concludes the proof. □

E.3. Proof of theorem 2

Similar to the proof of theorem 1, we first prove a more general result where $\bar{{\mathrm{dim}}_{\mathrm{M}}}\mathcal{W}$ is fixed.

Lemma 2. Assume that ℓ is bounded by B and L-Lipschitz continuous in w. Let $\mathcal{W}\subset {\mathbb{R}}^{d}$ be a bounded set with $\bar{{\mathrm{dim}}_{\mathrm{M}}}\mathcal{W}\leqslant {d}_{\mathrm{M}}$ . For any function $\rho :\mathbb{R}\to \mathbb{R}$ satisfying lim_x→∞ ρ(x) = ∞ and for a sufficiently large n, with probability at least 1 − γ, we have

$\begin{equation*}\underset{w\in \mathcal{W}}{\mathrm{sup}}\left({\hat{\mathcal{R}}}_{n}(w)-\mathcal{R}(w)\right)\leqslant cLB\mathrm{d}\mathrm{i}\mathrm{a}\mathrm{m}(\mathcal{W})\sqrt{\frac{{d}_{\mathrm{M}}\rho (n)+\mathrm{log}(1/\gamma )}{n}},\end{equation*}$

where c is an absolute constant.

Proof. We define the empirical process

$\begin{equation*}{\mathcal{G}}_{n}(w){:=}{\hat{\mathcal{R}}}_{n}(w)-\mathcal{R}(w)=\frac{1}{n}\sum\limits _{i=1}^{n}\ell (w,{z}_{i})-{\mathbb{E}}_{z}[\ell (w,z)],\end{equation*}$

and we notice that

$\begin{equation*}\mathbb{E}[{\mathcal{G}}_{n}(w)]=0.\end{equation*}$

Recall that a random process ${\left\{G(w)\right\}}_{w\in \mathcal{W}}$ on a metric space $(\mathcal{W},d)$ is said to have sub-Gaussian increments if there exists K ⩾ 0 such that

$\begin{equation}{\Vert}G(w)-G({w}^{\prime }){{\Vert}}_{{\psi }_{2}}\leqslant Kd(w,{w}^{\prime }),\end{equation} \tag{ E.16 }$

where ${\Vert}\cdot {{\Vert}}_{{\psi }_{2}}$ denotes the sub-Gaussian norm [Ver19].

We verify that ${\left\{{\mathcal{G}}_{n}(w)\right\}}_{w}$ has sub-Gaussian increments with $K=2L/\sqrt{n}$ and for the metric being the standard Euclidean metric, d(w, w') = ||w − w'||. To see why this is the case, notice that

$\begin{equation*}{\mathcal{G}}_{n}(w)-{\mathcal{G}}_{n}({w}^{\prime })=\frac{1}{n}\sum\limits _{i=1}^{n}\left[\ell (w,{z}_{i})-\ell ({w}^{\prime },{z}_{i})-\left({\mathbb{E}}_{z}\ell (w,z)-{\mathbb{E}}_{z}\ell ({w}^{\prime },z)\right)\right],\end{equation*}$

which is a sum of i.i.d. random variables that are uniformly bounded by

$\begin{equation*}\left\vert \ell (w,{z}_{i})-\ell ({w}^{\prime },{z}_{i})-\left({\mathbb{E}}_{z}\ell (w,z)-{\mathbb{E}}_{z}\ell ({w}^{\prime },z)\right)\right\vert \leqslant 2L{\Vert}w-{w}^{\prime }{\Vert},\end{equation*}$

by the Lipschitz continuity of the loss. Therefore, Hoeffding's lemma for bounded and centered random variables easily imply that

$\begin{equation}\mathbb{E}\left\{\mathrm{exp}\left[\lambda \left({\mathcal{G}}_{n}(w)-{\mathcal{G}}_{n}({w}^{\prime })\right)\right]\right\}\leqslant \mathrm{exp}\left[\frac{2{\lambda }^{2}}{n}{L}^{2}{\Vert}w-{w}^{\prime }{{\Vert}}^{2}\right],\end{equation} \tag{ E.17 }$

thus, we have ${\Vert}{\mathcal{G}}_{n}(w)-{\mathcal{G}}_{n}({w}^{\prime }){{\Vert}}_{{\psi }_{2}}\leqslant (2L/\sqrt{n}){\Vert}w-{w}^{\prime }{\Vert}$ .

Next, define the sequence δ_k = 2^−k and notice that we have δ_k ↓ 0. Dudley's tail bound (see for example theorem 8.16 in [Ver19]) for this empirical process implies that, with probability at least 1 − γ, we have

$\begin{equation}\underset{w,{w}^{\prime }\in \mathcal{W}}{\mathrm{sup}}\left({\mathcal{G}}_{n}(w)-{\mathcal{G}}_{n}({w}^{\prime })\right)\leqslant C\frac{L}{\sqrt{n}}\left[{S}_{\mathcal{W}}+\sqrt{\mathrm{log}(2/\gamma )}\mathrm{d}\mathrm{i}\mathrm{a}\mathrm{m}(\mathcal{W})\right],\end{equation} \tag{ E.18 }$

where C is an absolute constant and

$\begin{equation*}{S}_{\mathcal{W}}=\sum\limits _{k\in \mathbb{Z}}{\delta }_{k}\sqrt{\mathrm{log}\vert {N}_{{\delta }_{k}}(\mathcal{W})\vert }.\end{equation*}$

In order to apply Dudley's lemma, we need to bound the above summation. For that, choose κ₀ such that

$\begin{equation*}{2}^{{\kappa }_{0}}\geqslant \mathrm{d}\mathrm{i}\mathrm{a}\mathrm{m}(\mathcal{W}) > {2}^{{\kappa }_{0}-1},\end{equation*}$

and any strictly increasing function $\rho :\mathbb{R}\to \mathbb{R}$ .

Now since $\bar{{\mathrm{dim}}_{\mathrm{M}}}\mathcal{W}\leqslant {d}_{\mathrm{M}}$ , for the sequence ${\left\{{\delta }_{k}\right\}}_{k\in \mathbb{N}}$ , and for a sufficiently large n, whenever k ⩾ ⌊ρ(n)⌋, we have

$\begin{align*} \mathrm{log}\vert {N}_{{\delta }_{k}}(\mathcal{W})\vert & \leqslant 2{d}_{\mathrm{M}}\mathrm{log}({\delta }_{k}^{-1}) \\ & =\mathrm{log}(4){d}_{\mathrm{M}}k. \end{align*}$

By splitting the entropy sum in Dudley's tail inequality in two terms, we obtain

$\begin{align*} {S}_{\mathcal{W}}& =\sum\limits _{k\in \mathbb{Z}}{\delta }_{k}\sqrt{\mathrm{log}\vert {N}_{{\delta }_{k}}(\mathcal{W})\vert } \\ & =\sum\limits _{k=-{\kappa }_{0}}^{\lfloor \rho (n)\rfloor }{\delta }_{k}\sqrt{\mathrm{log}\vert {N}_{{\delta }_{k}}(\mathcal{W})\vert }+\sum\limits _{k=\lfloor \rho (n)\rfloor }^{\infty }{\delta }_{k}\sqrt{\mathrm{log}\vert {N}_{{\delta }_{k}}(\mathcal{W})\vert }. \end{align*}$

For the first term on the right-hand side, we use the monotonicity of covering numbers, i.e. $\vert {N}_{{\delta }_{k}}\vert \leqslant \vert {N}_{{\delta }_{l}}\vert$ for k ⩽ l, and write

$\begin{align*} \sum\limits _{k=-{\kappa }_{0}}^{\lfloor \rho (n)\rfloor }{\delta }_{k}\sqrt{\mathrm{log}\vert {N}_{{\delta }_{k}}(\mathcal{W})\vert }& \leqslant \sqrt{\mathrm{log}\vert {N}_{{\delta }_{\lfloor \rho (n)\rfloor }}(\mathcal{W})\vert }\sum\limits _{k=-{\kappa }_{0}}^{\lfloor \rho (n)\rfloor }{\delta }_{k} \\ & \leqslant \sqrt{\mathrm{log}(4){d}_{\mathrm{M}}\lfloor \rho (n)\rfloor }\sum\limits _{k=-{\kappa }_{0}}^{\infty }{\delta }_{k} \\ & \leqslant \sqrt{\mathrm{log}(4){d}_{\mathrm{M}}\rho (n)}{2}^{{\kappa }_{0}+1} \\ & \leqslant 4\mathrm{d}\mathrm{i}\mathrm{a}\mathrm{m}(\mathcal{W})\sqrt{\mathrm{log}(4){d}_{\mathrm{M}}\rho (n)}. \end{align*}$

For the second term on the right-hand side, we have

$\begin{align*} \sum\limits _{k=\lfloor \rho (n)\rfloor }^{\infty }{\delta }_{k}\sqrt{\mathrm{log}\vert {N}_{{\delta }_{k}}(\mathcal{W})\vert }& \leqslant \sqrt{\mathrm{log}(4){d}_{\mathrm{M}}}\sum\limits _{k=\lfloor \rho (n)\rfloor }^{\infty }\sqrt{k}{\delta }_{k} \\ & \leqslant \sqrt{\mathrm{log}(4){d}_{\mathrm{M}}}\sum\limits _{k=0}^{\infty }k{\delta }_{k} \\ & =2\sqrt{\mathrm{log}(4){d}_{\mathrm{M}}}. \end{align*}$

Combining these, we obtain

$\begin{equation*}{S}_{\mathcal{W}}\leqslant 2\sqrt{\mathrm{log}(4){d}_{\mathrm{M}}}\left\{1+2\mathrm{d}\mathrm{i}\mathrm{a}\mathrm{m}(\mathcal{W})\sqrt{\rho (n)}\right\}.\end{equation*}$

Plugging this bound back in Dudley's tail bound (E.18), we obtain

$\begin{equation*}\underset{w,{w}^{\prime }\in \mathcal{W}}{\mathrm{sup}}\left({\mathcal{G}}_{n}(w)-{\mathcal{G}}_{n}({w}^{\prime })\right)\leqslant CL\mathrm{d}\mathrm{i}\mathrm{a}\mathrm{m}(\mathcal{W})\frac{\sqrt{{d}_{\mathrm{M}}\rho (n)}+\sqrt{\mathrm{log}(2/\gamma )}}{\sqrt{n}}.\end{equation*}$

Now fix ${w}_{0}\in \mathcal{W}$ and write the triangle inequality,

$\begin{equation*}\underset{w\in \mathcal{W}}{\mathrm{sup}}{\mathcal{G}}_{n}(w)\leqslant \underset{w,{w}^{\prime }\in \mathcal{W}}{\mathrm{sup}}\left({\mathcal{G}}_{n}(w)-{\mathcal{G}}_{n}({w}^{\prime })\right)+{\mathcal{G}}_{n}({w}_{0}).\end{equation*}$

Clearly for a fixed ${w}_{0}\in \mathcal{W}$ , we can apply Hoeffding's inequality and obtain that, with probability at least 1 − γ,

$\begin{equation*}{\mathcal{G}}_{n}({w}_{0})\leqslant B\sqrt{\frac{\mathrm{log}(2/\gamma )}{n}}.\end{equation*}$

Combining this with the previous result, we have with probability at least 1 − 2γ

$\begin{equation*}\underset{w\in \mathcal{W}}{\mathrm{sup}}{\mathcal{G}}_{n}(w)\leqslant CL\mathrm{d}\mathrm{i}\mathrm{a}\mathrm{m}(\mathcal{W})\frac{\sqrt{{d}_{\mathrm{M}}}\sqrt{{K}_{{d}_{\mathrm{M}}}}+\sqrt{\mathrm{log}(2/\gamma )}}{\sqrt{n}}+B\sqrt{\frac{\mathrm{log}(2/\gamma )}{n}}.\end{equation*}$

Finally replacing and γ with γ/2 and collecting the absolute constants in c, we conclude the proof. □

Proof of theorem 2.The proof follows the same lines of the proof of theorem 1, except that we invoke lemma 2 instead of lemma 1. □

E.4. Proof of theorem 3

Proof. We start by restating our theoretical framework in an equivalent way for mathematical convenience. In particular, consider the (countable) product measure ${\mu }_{z}^{\infty }={\mu }_{z}\otimes {\mu }_{z}\otimes \dots \enspace$ defined on the cylindrical sigma-algebra. Accordingly, denote $\mathbf{S}\sim {\mu }_{z}^{\infty }$ as an infinite sequence of i.i.d. random vectors, i.e. $\mathbf{S}={({z}_{j})}_{j\geqslant 1}$ with z_j ∼ _i.i.d. μ_z for all j = 1, 2, .... Furthermore, let S_n := (z₁, ..., z_n) be the first n elements of S. In this notation, we have $S\stackrel{\mathrm{d}}{=}{\mathbf{S}}_{n}$ and ${\mathcal{W}}_{S}\stackrel{\mathrm{d}}{=}{\mathcal{W}}_{{\mathbf{S}}_{n}}$ , where $\stackrel{\mathrm{d}}{=}$ denotes equality in distribution. Similarly, we have ${\hat{\mathcal{R}}}_{n}(w)=\hat{\mathcal{R}}(w,{\mathbf{S}}_{n})$ .

Due to the hypotheses and theorem 5, we have $\bar{{\mathrm{dim}}_{\mathrm{M}}}{\mathcal{W}}_{{\mathbf{S}}_{n}}={\mathrm{dim}}_{\mathrm{H}}{\mathcal{W}}_{{\mathbf{S}}_{n}}=:{d}_{\mathrm{H}}(\mathbf{S},n)$ , μ_u-almost surely. It is easy to verify that the particular forms of the δ-covers and N_δ in H 5 still yield the same Minkowski dimension in (D.1). Then by definition, we have for all S and n:

$\begin{equation}\underset{\delta \to 0}{\mathrm{limsup}}\enspace \frac{\mathrm{log}\vert {N}_{\delta }({\mathcal{W}}_{{\mathbf{S}}_{n}})\vert }{\mathrm{log}(1/\delta )}=\underset{\delta \to 0}{\mathrm{lim}}\underset{r< \delta }{\mathrm{sup}}\enspace \frac{\mathrm{log}\vert {N}_{r}({\mathcal{W}}_{{\mathbf{S}}_{n}})\vert }{\mathrm{log}(1/r)}={d}_{\mathrm{H}}(\mathbf{S},n),\end{equation} \tag{ E.19 }$

μ_u-almost surely. Hence for each n

$\begin{equation}{f}_{\delta }^{n}(\mathbf{S}){:=}\underset{\mathbb{Q}\ni r< \delta }{\mathrm{sup}}\frac{\mathrm{log}\vert {N}_{r}({\mathcal{W}}_{{\mathbf{S}}_{n}})\vert }{\mathrm{log}(1/r)}\to {d}_{\mathrm{H}}(\mathbf{S},n),\end{equation} \tag{ E.20 }$

as δ → 0 almost surely, or alternatively, for each n, there exists a set Ω_n of full measure such that

$\begin{equation}\underset{\delta \to 0}{\mathrm{lim}}\enspace {f}_{\delta }^{n}(\mathbf{S})=\underset{\delta \to 0}{\mathrm{lim}}\underset{\mathbb{Q}\ni r< \delta }{\mathrm{sup}}\enspace \frac{\mathrm{log}\vert {N}_{r}({\mathcal{W}}_{{\mathbf{S}}_{n}})\vert }{\mathrm{log}(1/r)}\to {d}_{\mathrm{H}}(\mathbf{S},n),\end{equation} \tag{ E.21 }$

for all S ∈ Ω_n. Let Ω* := ∩_nΩ_n. Then for S ∈ Ω* we have that for all n

$\begin{equation}\underset{\delta \to 0}{\mathrm{lim}}\enspace {f}_{\delta }^{n}(\mathbf{S})=\underset{\delta \to 0}{\mathrm{lim}}\underset{\mathbb{Q}\ni r< \delta }{\mathrm{sup}}\enspace \frac{\mathrm{log}\vert {N}_{r}({\mathcal{W}}_{{\mathbf{S}}_{n}})\vert }{\mathrm{log}(1/r)}\to {d}_{\mathrm{H}}(\mathbf{S},n),\end{equation} \tag{ E.22 }$

and therefore, on this set we also have

$\begin{equation*}\underset{\delta \to 0}{\mathrm{lim}}\underset{n}{\mathrm{sup}}\enspace \frac{1}{{\alpha }_{n}}\mathrm{min}\left\{1,\vert {f}_{\delta }^{n}(\mathbf{S})-{d}_{\mathrm{H}}(\mathbf{S},n)\vert \right\}\to 0,\end{equation*}$

where α_n is a monotone increasing sequence such that α_n ⩾ 1 and α_n → ∞. To see why, suppose that we are given a collection of functions ${\left\{{g}_{n}(r)\right\}}_{n}$ where r > 0, such that lim_r→0 g_n(r) → 0 for each n. We then have that the infinite dimensional vector ${({g}_{n}(r))}_{n}\to (0,0,\dots \enspace )$ in the product topology on ${\mathbb{R}}^{\infty }$ . We can metrize the product topology on ${\mathbb{R}}^{\infty }$ using the metric

$\begin{equation*}\bar{d}(\mathbf{x},\mathbf{y})=\sum\limits _{n}\frac{1}{{\alpha }_{n}}\enspace \mathrm{min}\left\{1,\vert {x}_{n}-{y}_{n}\vert \right\},\end{equation*}$

where $\mathbf{x}={({x}_{n})}_{n\geqslant 1}$ , $\mathbf{y}={({y}_{n})}_{n\geqslant 1}$ , and 1 ⩽ α_n → ∞ is monotone increasing. Alternatively, notice that if lim_r→0 g_n(r) → 0 for all n, for any > 0 choose N₀ such that for n ⩾ N₀, 1/α_n < , and choose r₀ such that for r < r₀, we have ${\mathrm{max}}_{n\leqslant {N}_{0}}\vert {g}_{n}(r)\vert < {\epsilon}$ . Then for r < r₀ we have

$\begin{align*} \underset{n}{\mathrm{sup}}\frac{1}{{\alpha }_{n}}\mathrm{min}\left\{1,\vert {g}_{n}(r)\vert \right\}& \leqslant \mathrm{max}\left\{\underset{n\leqslant {N}_{0}}{\mathrm{sup}}\frac{1}{{\alpha }_{n}}\enspace \mathrm{min}\left\{1,\vert {g}_{n}(r)\vert \right\},\underset{n > {N}_{0}}{\mathrm{sup}}\frac{1}{{\alpha }_{n}}\enspace \mathrm{min}\left\{1,\vert {g}_{n}(r)\vert \right\}\right\} \\ & \leqslant \mathrm{max}\left\{\frac{{\epsilon}}{{\alpha }_{n}},\underset{n > {N}_{0}}{\mathrm{sup}}\frac{1}{{\alpha }_{n}}\right\}\leqslant {\epsilon}, \end{align*}$

where we used that α_n ⩾ 1.

Applying the above reasoning we have that for all S ∈ Ω*

$\begin{equation*}{F}_{\delta }(\mathbf{S}){:=}\underset{n}{\mathrm{sup}}\frac{1}{{\alpha }_{n}}\enspace \mathrm{min}\left\{1,\vert {f}_{\delta }^{n}(\mathbf{S})-{d}_{\mathrm{H}}(\mathbf{S},n)\vert \right\}\to 0,\end{equation*}$

as δ → 0. By applying theorem 6 to the collection of random variables {F_δ(S); δ}, for any δ' > 0 we can find a subset $\mathfrak{Z}\subset {\mathcal{Z}}^{\infty }$ , with probability at least 1 − δ' under ${\mu }_{z}^{\infty }$ , such that on $\mathfrak{Z}$ the convergence is uniform, that is

$\begin{equation*}\underset{\mathbf{S}\in \mathfrak{Z}}{\mathrm{sup}}\underset{n}{\mathrm{sup}}\frac{1}{{\alpha }_{n}}\enspace \mathrm{min}\left\{1,\vert {f}_{r}^{n}(\mathbf{S})-{d}_{\mathrm{H}}(\mathbf{S},n)\vert \right\}\leqslant c(r),\end{equation*}$

where c(r) → 0 as r → 0. Notice that c(r) = c(δ', r), that is it depends on the choice of δ' (for any δ', we have lim_r→0 c(r; δ') = 0).

As U and S are assumed to be independent, all the following statements hold μ_u-almost surely, hence we drop the dependence on U to ease the notation. We proceed as in the proof of lemma 1:

$\begin{equation}\underset{w\in {\mathcal{W}}_{{\mathbf{S}}_{n}}}{\mathrm{sup}}\vert {\hat{\mathcal{R}}}_{n}(w)-\mathcal{R}(w)\vert \leqslant \underset{w\in {N}_{\delta }^{{\mathbf{S}}_{n}}}{\mathrm{max}}\left\vert {\hat{\mathcal{R}}}_{n}(w)-\mathcal{R}(w)\right\vert +2L\delta .\end{equation} \tag{ E.23 }$

Notice that on $\mathfrak{Z}$ we have that

$\begin{equation*}\underset{\mathbf{S}\in \mathfrak{Z}}{\mathrm{sup}}\underset{n}{\mathrm{sup}}\frac{1}{{\alpha }_{n}}\enspace \mathrm{min}\left\{1,\vert {f}_{r}^{n}(\mathbf{S})-{d}_{\mathrm{H}}(\mathbf{S},n)\vert \right\}\leqslant c(r),\quad \text{for}\enspace \text{all}\enspace \enspace \enspace r > 0,\end{equation*}$

so in particular for any sequence {δ_n; n ⩾ 0} we have that

$\begin{equation*}\left\{\mathbf{S}\in \mathfrak{Z}\right\}\subseteq \bigcap\limits _{n}\left\{\frac{1}{{\alpha }_{n}}\enspace \mathrm{min}\left\{1,\vert {f}_{{\delta }_{n}}^{n}(\mathbf{S})-{d}_{\mathrm{H}}(\mathbf{S},n)\vert \right\}\leqslant c({\delta }_{n})\right\},\end{equation*}$

or

$\begin{equation*}\left\{\mathbf{S}\in \mathfrak{Z}\right\}\subseteq \bigcap\limits _{n}\left\{\left\vert {N}_{{\delta }_{n}}\left({\mathcal{W}}_{{\mathbf{S}}_{n}}\right)\right\vert \leqslant {(1/{\delta }_{n})}^{{d}_{\mathrm{H}}(\mathbf{S},n)+{\alpha }_{n}c({\delta }_{n})}\right\}.\end{equation*}$

Let ${({\delta }_{n})}_{n\geqslant 0}$ be a decreasing sequence such that ${\delta }_{n}\in \mathbb{Q}$ for all n and δ_n → 0. We then have

$\begin{align*} \mathbb{P}& \left(\underset{w\in {N}_{{\delta }_{n}}({\mathcal{W}}_{{\mathbf{S}}_{n}})}{\mathrm{max}}\vert {\hat{\mathcal{R}}}_{n}(w)-\mathcal{R}(w)\vert \geqslant \varepsilon \right) \\ & \leqslant \mathbb{P}\left(\left\{\mathbf{S}\in \mathfrak{Z}\right\}\cap \left\{\underset{w\in {N}_{{\delta }_{n}}({\mathcal{W}}_{{\mathbf{S}}_{n}})}{\mathrm{max}}\vert {\hat{\mathcal{R}}}_{n}(w)-\mathcal{R}(w)\vert \geqslant \varepsilon \right\}\right)+{\delta }^{\prime }. \end{align*}$

For ρ > 0 and $k\in {\mathbb{N}}_{+}$ let us define ${J}_{k}(\rho ){:=}\left(\right.k\rho ,(k+1)\rho$ ] and set ρ_n := log(1/δ_n). Furthermore, for any t > 0 define

$\begin{equation*}\varepsilon (t){:=}\sqrt{\frac{{B}^{2}}{2n}\left[\mathrm{log}(1/{\delta }_{n})\left(t+{\alpha }_{n}c({\delta }^{\prime },{\delta }_{n})\right)+\mathrm{log}(M/{\delta }^{\prime })\right]}.\end{equation*}$

Notice that ɛ(t) is increasing in t. Therefore, we have

$\begin{align*} & \mathbb{P}\left(\underset{w\in {N}_{{\delta }_{n}}({\mathcal{W}}_{{\mathbf{S}}_{n}})}{\mathrm{max}}\vert {\hat{\mathcal{R}}}_{n}(w)-\mathcal{R}(w)\vert \geqslant \varepsilon ({d}_{\mathrm{H}}(\mathbf{S},n))\right) \\ & \hspace{25.0pt}\leqslant {\delta }^{\prime }+\mathbb{P}\left(\left\{\left\vert {N}_{{\delta }_{n}}({\mathcal{W}}_{{\mathbf{S}}_{n}})\right\vert \leqslant {\left(\frac{1}{{\delta }_{n}}\right)}^{{d}_{\mathrm{H}}(\mathbf{S},n)+{\alpha }_{n}c({\delta }_{n})}\right\}\right. \\ & \hspace{25.0pt}\quad \left.\cap \left\{\underset{w\in {N}_{{\delta }_{n}}({\mathcal{W}}_{{\mathbf{S}}_{n}})}{\mathrm{max}}\vert {\hat{\mathcal{R}}}_{n}(w)-\mathcal{R}(w)\vert \geqslant \varepsilon ({d}_{\mathrm{H}}(\mathbf{S},n))\right\}\right) \\ & \hspace{25.0pt}={\delta }^{\prime }+\sum\limits _{k=0}^{\lceil \frac{d}{{\rho }_{n}}\rceil }\mathbb{P}\left(\left\{\left\vert {N}_{{\delta }_{n}}({\mathcal{W}}_{{\mathbf{S}}_{n}})\right\vert \leqslant {\left(\frac{1}{{\delta }_{n}}\right)}^{{d}_{\mathrm{H}}(\mathbf{S},n)+{\alpha }_{n}c({\delta }_{n})}\right\}\right. \\ & \hspace{25.0pt}\quad \left.\cap \left\{\underset{w\in {N}_{{\delta }_{n}}({\mathcal{W}}_{{\mathbf{S}}_{n}})}{\mathrm{max}}\vert {\hat{\mathcal{R}}}_{n}(w)-\mathcal{R}(w)\vert \geqslant \varepsilon ({d}_{\mathrm{H}}(\mathbf{S},n))\right\}\cap \left\{{d}_{\mathrm{H}}(\mathbf{S},n)\in {J}_{k}({\rho }_{n})\right\}\right) \\ & \hspace{25.0pt}={\delta }^{\prime }+\sum\limits _{k=0}^{\lceil \frac{d}{{\rho }_{n}}\rceil }\mathbb{P}\left(\left\{\left\vert {N}_{{\delta }_{n}}({\mathcal{W}}_{{\mathbf{S}}_{n}})\right\vert \leqslant {\left(\frac{1}{{\delta }_{n}}\right)}^{{d}_{\mathrm{H}}(\mathbf{S},n)+{\alpha }_{n}c({\delta }_{n})}\right\}\cap \left\{{d}_{\mathrm{H}}(\mathbf{S},n)\in {J}_{k}({\rho }_{n})\right\}\right. \\ & \hspace{25.0pt}\quad \left.\cap \bigcup\limits _{w\in N({\delta }_{n})}\left(\left\{w\in {N}_{{\delta }_{n}}({\mathcal{W}}_{{\mathbf{S}}_{n}})\right\}\cap \left\{\vert {\hat{\mathcal{R}}}_{n}(w)-\mathcal{R}(w)\vert \geqslant \varepsilon ({d}_{\mathrm{H}}(\mathbf{S},n))\right\}\right)\right) \\ & \hspace{25.0pt}\leqslant {\delta }^{\prime }+\sum\limits _{k=0}^{\lceil \frac{d}{{\rho }_{n}}\rceil }\sum\limits _{w\in {N}_{{\delta }_{n}}}\mathbb{P}\left(\left\{\vert {\hat{\mathcal{R}}}_{n}(w)-\mathcal{R}(w)\vert \geqslant \varepsilon (k{\rho }_{n})\right\}\right. \\ & \hspace{25.0pt}\quad \left.\cap \left\{w\in {N}_{{\delta }_{n}}({\mathcal{W}}_{{\mathbf{S}}_{n}})\right\}\cap \left\{\left\vert {N}_{{\delta }_{n}}({\mathcal{W}}_{{\mathbf{S}}_{n}})\right\vert \leqslant {\left(\frac{1}{{\delta }_{n}}\right)}^{{d}_{\mathrm{H}}(\mathbf{S},n)+{\alpha }_{n}c({\delta }_{n})}\right\}\cap \left\{{d}_{\mathrm{H}}(\mathbf{S},n)\in {J}_{k}({\rho }_{n})\right\}\right), \end{align*}$

where we used the fact d_H(S, n) ⩽ d almost surely, and that on the event d_H(S, n) ∈ J_k(ρ_n), ɛ(d_H(S, n)) ⩾ ɛ(kρ_n).

Notice that the events

$\begin{equation*}\left\{w\in {N}_{{\delta }_{n}}({\mathcal{W}}_{{\mathbf{S}}_{n}})\right\},\left\{\vert {N}_{{\delta }_{n}}({\mathcal{W}}_{{\mathbf{S}}_{n}})\vert \leqslant {(1/{\delta }_{n})}^{{d}_{\mathrm{H}}(\mathbf{S},n)+{\alpha }_{n}c({\delta }_{n})}\right\},\left\{{d}_{\mathrm{H}}(\mathbf{S},n)\in {J}_{k}({\rho }_{n})\right\},\end{equation*}$

are in $\mathfrak{G}$ . To see why, notice first that for any $0< \delta \in \mathbb{Q}$

$\begin{equation*}\vert {N}_{\delta }({\mathcal{W}}_{{\mathbf{S}}_{n}})\vert =\sum\limits _{w\in {N}_{\delta }}\mathbb{1}\left\{w\in {N}_{\delta }({\mathcal{W}}_{{\mathbf{S}}_{n}})\right\},\end{equation*}$

so $\vert {N}_{\delta }({\mathcal{W}}_{{\mathbf{S}}_{n}})\vert$ is $\mathfrak{G}$ -measurable as a finite sum of $\mathfrak{G}$ -measurable variables. From (E.20) it can also be seen that d_H(S, n) is also $\mathfrak{G}$ -measurable as a countable supremum of $\mathfrak{G}$ -measurable random variables. On the other hand, the event $\left\{\vert {\hat{\mathcal{R}}}_{n}(w)-\mathcal{R}(w)\vert \geqslant \varepsilon (k{\rho }_{n})\right\}$ is clearly in $\mathfrak{F}$ (see H 5 for definitions).

Therefore,

$\begin{align*} & \mathbb{P}\left.\left(\underset{w\in {N}_{{\delta }_{n}}({\mathcal{W}}_{{\mathbf{S}}_{n}})}{\mathrm{max}}\vert {\hat{\mathcal{R}}}_{n}(w)-\mathcal{R}(w)\vert \geqslant \varepsilon ({d}_{\mathrm{H}}(\mathbf{S},n))\right)\right) \\ & \hspace{25.0pt}\leqslant {\delta }^{\prime }+M\sum\limits _{k=0}^{\lceil \frac{d}{{\rho }_{n}}\rceil }\sum\limits _{w\in {N}_{{\delta }_{n}}}\mathbb{P}\left(\left\{\vert {\hat{\mathcal{R}}}_{n}(w)-\mathcal{R}(w)\vert \geqslant \varepsilon (k{\rho }_{n})\right\}\right) \\ & \hspace{25.0pt}\quad \times \mathbb{P}\left(\left\{w\in {N}_{{\delta }_{n}}({\mathcal{W}}_{{\mathbf{S}}_{n}})\right\}\cap \left\{\left\vert {N}_{{\delta }_{n}}({\mathcal{W}}_{{\mathbf{S}}_{n}})\right\vert \leqslant {\left(\frac{1}{{\delta }_{n}}\right)}^{{d}_{\mathrm{H}}(\mathbf{S},n)+{\alpha }_{n}c({\delta }_{n})}\right\}\right. \\ & \hspace{25.0pt}\quad \left.\cap \left\{{d}_{\mathrm{H}}(\mathbf{S},n)\in {J}_{k}({\rho }_{n})\right\}\right), \\ & \hspace{25.0pt}\leqslant {\delta }^{\prime }+2M\sum\limits _{k=0}^{\lceil \frac{d}{{\rho }_{n}}\rceil }{e}^{-\frac{2n{\varepsilon }^{2}(k{\rho }_{n})}{{B}^{2}}}\sum\limits _{w\in {N}_{{\delta }_{n}}}\mathbb{P}\left(\left\{w\in {N}_{{\delta }_{n}}({\mathcal{W}}_{{\mathbf{S}}_{n}})\right\}\right. \\ & \hspace{25.0pt}\quad \left.\cap \left\{\left\vert {N}_{{\delta }_{n}}({\mathcal{W}}_{{\mathbf{S}}_{n}})\right\vert \leqslant {\left(\frac{1}{{\delta }_{n}}\right)}^{{d}_{\mathrm{H}}(\mathbf{S},n)+{\alpha }_{n}c({\delta }_{n})}\right\}\right. \\ & \hspace{25.0pt}\left.\quad \cap \left\{{d}_{\mathrm{H}}(\mathbf{S},n)\in {J}_{k}({\rho }_{n})\right\}\right) \\ & \hspace{25.0pt}\leqslant {\delta }^{\prime }+2M\sum\limits _{k=0}^{\lceil \frac{d}{{\rho }_{n}}\rceil }{e}^{-\frac{2n{\varepsilon }^{2}(k{\rho }_{n})}{{B}^{2}}}\sum\limits _{w\in {N}_{{\delta }_{n}}}\mathbb{E}\left(\mathbb{1}\left\{w\in {N}_{{\delta }_{n}}({\mathcal{W}}_{{\mathbf{S}}_{n}})\right\}\right. \\ & \hspace{25.0pt}\quad \left.\times \mathbb{1}\left\{\left\vert {N}_{{\delta }_{n}}({\mathcal{W}}_{{\mathbf{S}}_{n}})\right\vert \leqslant {\left(\frac{1}{{\delta }_{n}}\right)}^{{d}_{\mathrm{H}}(\mathbf{S},n)+{\alpha }_{n}c({\delta }_{n})}\right\}\times \mathbb{1}\left\{{d}_{\mathrm{H}}(\mathbf{S},n)\in {J}_{k}({\rho }_{n})\right\}\right) \\ & \hspace{25.0pt}\leqslant {\delta }^{\prime }+2M\sum\limits _{k=0}^{\lceil \frac{d}{{\rho }_{n}}\rceil }{e}^{-\frac{2n{\varepsilon }^{2}(k{\rho }_{n})}{{B}^{2}}}\mathbb{E}\left(\sum\limits _{w\in {N}_{{\delta }_{n}}}\mathbb{1}\left\{w\in {N}_{{\delta }_{n}}({\mathcal{W}}_{{\mathbf{S}}_{n}})\right\}\right. \\ & \hspace{25.0pt}\quad \left.\times \mathbb{1}\left\{\left\vert {N}_{{\delta }_{n}}({\mathcal{W}}_{{\mathbf{S}}_{n}})\right\vert \leqslant {\left(\frac{1}{{\delta }_{n}}\right)}^{{d}_{\mathrm{H}}(\mathbf{S},n)+{\alpha }_{n}c({\delta }_{n})}\right\}\times \mathbb{1}\left\{{d}_{\mathrm{H}}(\mathbf{S},n)\in {J}_{k}({\rho }_{n})\right\}\right), \end{align*}$

$\begin{align*} & \leqslant {\delta }^{\prime }+2M\sum\limits _{k=0}^{\lceil \frac{d}{{\rho }_{n}}\rceil }{e}^{-\frac{2n{\varepsilon }^{2}(k{\rho }_{n})}{{B}^{2}}}\mathbb{E}\left(\vert {N}_{{\delta }_{n}}({\mathcal{W}}_{{\mathbf{S}}_{n}})\vert \times \mathbb{1}\left\{\left\vert {N}_{{\delta }_{n}}({\mathcal{W}}_{{\mathbf{S}}_{n}})\right\vert \leqslant {\left(\frac{1}{{\delta }_{n}}\right)}^{{d}_{\mathrm{H}}(\mathbf{S},n)+{\alpha }_{n}c({\delta }_{n})}\right\}\right. \\ & \quad \left.\times \mathbb{1}\left\{{d}_{\mathrm{H}}(\mathbf{S},n)\in {J}_{k}({\rho }_{n})\right\}\right) \\ & ={\delta }^{\prime }+2M\sum\limits _{k=0}^{\lceil \frac{d}{{\rho }_{n}}\rceil }{e}^{-\frac{2n{\varepsilon }^{2}(k{\rho }_{n})}{{B}^{2}}}\mathbb{E}\left({\left[\frac{1}{{\delta }_{n}}\right]}^{{d}_{\mathrm{H}}(\mathbf{S},n)+{\alpha }_{n}c({\delta }_{n})}\times \mathbb{1}\left\{{d}_{\mathrm{H}}(\mathbf{S},n)\in {J}_{k}({\rho }_{n})\right\}\right). \end{align*}$

Now, notice that the mapping t ↦ ɛ²(t) is linear, with Lipschitz coefficient

$\begin{equation*}\frac{\mathrm{d}{\varepsilon }^{2}(t)}{\mathrm{d}t}=\frac{{B}^{2}}{2n}\enspace \mathrm{log}(1/{\delta }_{n}).\end{equation*}$

Therefore, on the event {d_H(S, n) ∈ J_k(ρ_n)} we have

$\begin{equation}{\varepsilon }^{2}({d}_{\mathrm{H}}(\mathbf{S},n))-{\varepsilon }^{2}(k{\rho }_{n})\leqslant ({d}_{\mathrm{H}}(\mathbf{S},n)-k{\rho }_{n})\frac{{B}^{2}}{2n}\enspace \mathrm{log}(1/{\delta }_{n})\end{equation} \tag{ E.24 }$

$\begin{equation}\leqslant {\rho }_{n}\frac{{B}^{2}}{2n}\enspace \mathrm{log}(1/{\delta }_{n}).\end{equation} \tag{ E.25 }$

Hence,

$\begin{equation*}{\varepsilon }^{2}(k{\rho }_{n})\geqslant {\varepsilon }^{2}\left({d}_{\mathrm{H}}(\mathbf{S},n)\right)-\frac{{B}^{2}}{2n},\end{equation*}$

where we used the fact that ρ_n = log(1/δ_n). Therefore, we have

$\begin{align*} & \mathbb{P}\left(\underset{w\in {N}_{{\delta }_{n}}({\mathcal{W}}_{{\mathbf{S}}_{n}})}{\mathrm{max}}\vert {\hat{\mathcal{R}}}_{n}(w)-\mathcal{R}(w)\vert \geqslant \varepsilon ({d}_{\mathrm{H}}(\mathbf{S},n))\right) \\ & \hspace{25.0pt}\leqslant {\delta }^{\prime }+2M\mathbb{E}\left(\sum\limits _{k=0}^{\lceil \frac{d}{{\rho }_{n}}\rceil }{e}^{-\frac{2n{\varepsilon }^{2}(k{\rho }_{n})}{{B}^{2}}}{\left[\frac{1}{{\delta }_{n}}\right]}^{{d}_{\mathrm{H}}(\mathbf{S},n)+{\alpha }_{n}c({\delta }_{n})}\times \mathbb{1}\left\{{d}_{\mathrm{H}}(\mathbf{S},n)\in {J}_{k}({\rho }_{n})\right\}\right) \\ & \hspace{25.0pt}\leqslant {\delta }^{\prime }+2M\mathbb{E}\left(\sum\limits _{k=0}^{\lceil \frac{d}{{\rho }_{n}}\rceil }{e}^{-\frac{2n{\varepsilon }^{2}({d}_{\mathrm{H}}(\mathbf{S},n))}{{B}^{2}}+1}{\left[\frac{1}{{\delta }_{n}}\right]}^{{d}_{\mathrm{H}}(\mathbf{S},n)+{\alpha }_{n}c({\delta }_{n})}\times \mathbb{1}\left\{{d}_{\mathrm{H}}(\mathbf{S},n)\in {J}_{k}({\rho }_{n})\right\}\right) \\ & \hspace{25.0pt}={\delta }^{\prime }+2M\mathbb{E}\left({e}^{-\frac{2n{\varepsilon }^{2}({d}_{\mathrm{H}}(\mathbf{S},n))}{{B}^{2}}+1}{\left[\frac{1}{{\delta }_{n}}\right]}^{{d}_{\mathrm{H}}(\mathbf{S},n)+{\alpha }_{n}c({\delta }_{n})}\right). \end{align*}$

By the definition of ɛ(t), for any S and n, we have that:

$\begin{equation*}2M{e}^{-\frac{2n{\varepsilon }^{2}({d}_{\mathrm{H}}(\mathbf{S},n))}{{B}^{2}}+1}{\left[\frac{1}{{\delta }_{n}}\right]}^{{d}_{\mathrm{H}}(\mathbf{S},n)+{\alpha }_{n}c({\delta }_{n})}=2e{\delta }^{\prime }.\end{equation*}$

Therefore,

$\begin{equation*}\mathbb{P}\left(\underset{w\in {N}_{{\delta }_{n}}({\mathcal{W}}_{{\mathbf{S}}_{n}})}{\mathrm{max}}\vert {\hat{\mathcal{R}}}_{n}(w)-\mathcal{R}(w)\vert \geqslant \varepsilon ({d}_{\mathrm{H}}(\mathbf{S},n))\right)\leqslant (1+2e){\delta }^{\prime }.\end{equation*}$

That is, with probability at least 1 − (1 + 2e)δ' we have:

$\begin{align*} \underset{w\in {N}_{{\delta }_{n}}({\mathcal{W}}_{{\mathbf{S}}_{n}})}{\mathrm{max}}\vert {\hat{\mathcal{R}}}_{n}(w)-\mathcal{R}(w)\vert & \leqslant \varepsilon ({d}_{\mathrm{H}}(\mathbf{S},n)) \\ & =\sqrt{\frac{{B}^{2}}{2n}\left[\mathrm{log}(\sqrt{n}L)\left(t+{\alpha }_{n}c({\delta }^{\prime },{\delta }_{n})\right)+\mathrm{log}(M/{\delta }^{\prime })\right]}. \end{align*}$

Choosing ${\delta }_{n}=1/\sqrt{n{L}^{2}}$ and α_n = log(n), for each δ' > 0, with probability at least 1 − (1 + 2e)δ' we have

$\begin{equation}\underset{w\in {\mathcal{W}}_{{\mathbf{S}}_{n}}}{\mathrm{sup}}\vert {\hat{\mathcal{R}}}_{n}(w)-\mathcal{R}(w)\vert \leqslant B\sqrt{\frac{\mathrm{log}(\sqrt{n}L)\left[{d}_{\mathrm{H}}(\mathbf{S},n)+c({\delta }^{\prime },{\delta }_{n})\mathrm{log}\enspace n\right]+\mathrm{log}\left(\frac{M}{{\delta }^{\prime }}\right)}{2n}}+\frac{2}{\sqrt{n}},\end{equation} \tag{ E.26 }$

where for each δ' > 0, we have lim_n→∞ c(δ', δ_n) = 0. Hence, for sufficiently large n, we have with probability at least 1 − (1 + 2e)δ' we have:

$\begin{equation}\underset{w\in {\mathcal{W}}_{{\mathbf{S}}_{n}}}{\mathrm{sup}}\vert {\hat{\mathcal{R}}}_{n}(w)-\mathcal{R}(w)\vert \leqslant 2B\sqrt{\frac{[{d}_{\mathrm{H}}(\mathbf{S},n)+1]{\mathrm{log}}^{2}(n{L}^{2})+\mathrm{log}\left(\frac{M}{{\delta }^{\prime }}\right)}{n}}.\end{equation} \tag{ E.27 }$

Setting γ := (1 + 2e)δ' and using 1 + 2e < 7, we obtain the desired result. This concludes the proof. □

Hausdorff dimension, heavy tails, and generalization in neural networks^*

Article metrics

Permissions

Author e-mails

Author affiliations

Author notes

Dates

Abstract

1. Introduction

2. Technical background

2.1. Stable distributions

2.2. Lévy and Feller processes

2.3. Decomposable Feller processes

2.4. The Hausdorff dimension

3. Mathematical setup

4. Generalization bounds via Hausdorff dimension

4.1. Uniform Hausdorff dimension and Feller processes

4.2. Generalization bounds via uniform Hausdorff dimension

4.3. Generalization bounds via non-uniform Hausdorff dimension

5. Experiments

6. Conclusion

Acknowledgments

Appendix A.: Additional experimental results and implementation details

A.1. Comparison with other generalization metrics for deep networks

A.2. Synthetic experiments

A.3. Implementation details for the deep neural network experiments

Appendix B.: Representing optimization algorithms as Feller processes

Appendix C.: Decomposable Feller processes and their Hausdorff dimension

Appendix D.: Additional technical background

D.1. The Minkowski dimension

D.2. Egoroff's theorem

Appendix E.: Postponed proofs

E.1. Proof of proposition 1

E.2. Proof of theorem 1

E.3. Proof of theorem 2

E.4. Proof of theorem 3

Footnotes

Hausdorff dimension, heavy tails, and generalization in neural networks*

Article metrics

Permissions

Share this article

Author e-mails

Author affiliations

Author notes

Dates

Abstract

1. Introduction

2. Technical background

2.1. Stable distributions

2.2. Lévy and Feller processes

2.3. Decomposable Feller processes

2.4. The Hausdorff dimension

3. Mathematical setup

4. Generalization bounds via Hausdorff dimension

4.1. Uniform Hausdorff dimension and Feller processes

4.2. Generalization bounds via uniform Hausdorff dimension

4.3. Generalization bounds via non-uniform Hausdorff dimension

5. Experiments

6. Conclusion

Acknowledgments

Appendix A.: Additional experimental results and implementation details

A.1. Comparison with other generalization metrics for deep networks

A.2. Synthetic experiments

A.3. Implementation details for the deep neural network experiments

Appendix B.: Representing optimization algorithms as Feller processes

Appendix C.: Decomposable Feller processes and their Hausdorff dimension

Appendix D.: Additional technical background

D.1. The Minkowski dimension

D.2. Egoroff's theorem

Appendix E.: Postponed proofs

E.1. Proof of proposition 1

E.2. Proof of theorem 1

E.3. Proof of theorem 2

E.4. Proof of theorem 3

Footnotes

Hausdorff dimension, heavy tails, and generalization in neural networks^*