Paper The following article is Free article

Asymptotic learning curves of kernel methods: empirical data versus teacher–student paradigm

, and

Published 21 December 2020 © 2020 IOP Publishing Ltd and SISSA Medialab srl
, , Citation Stefano Spigler et al J. Stat. Mech. (2020) 124001 DOI 10.1088/1742-5468/abc61d

1742-5468/2020/12/124001

Abstract

How many training data are needed to learn a supervised task? It is often observed that the generalization error decreases as nβ where n is the number of training examples and β is an exponent that depends on both data and algorithm. In this work we measure β when applying kernel methods to real datasets. For MNIST we find β ≈ 0.4 and for CIFAR10 β ≈ 0.1, for both regression and classification tasks, and for Gaussian or Laplace kernels. To rationalize the existence of non-trivial exponents that can be independent of the specific kernel used, we study the teacher–student framework for kernels. In this scheme, a teacher generates data according to a Gaussian random field, and a student learns them via kernel regression. With a simplifying assumption—namely that the data are sampled from a regular lattice—we derive analytically β for translation invariant kernels, using previous results from the kriging literature. Provided that the student is not too sensitive to high frequencies, β depends only on the smoothness and dimension of the training data. We confirm numerically that these predictions hold when the training points are sampled at random on a hypersphere. Overall, the test error is found to be controlled by the magnitude of the projection of the true function on the kernel eigenvectors whose rank is larger than n. Using this idea we predict the exponent β from real data by performing kernel PCA, leading to β ≈ 0.36 for MNIST and β ≈ 0.07 for CIFAR10, in good agreement with observations. We argue that these rather large exponents are possible due to the small effective dimension of the data.

Export citation and abstract BibTeX RIS

1. Introduction

In supervised learning machines learn from a finite collection of n training data, and their generalization error is then evaluated on unseen data drawn from the same distribution. How many data are needed to learn a task is characterized by the learning curve relating generalization error to n. In various cases, the generalization error decays as a power law nβ , with an exponent β that depends on both the data and the algorithm. In [1] β is reported for state-of-the-art (SOTA) deep neural networks for various tasks: in neural-machine translation β ≈ 0.3–0.36 (for fixed model size) or β ≈ 0.13 (for best-fit models at any n); language modeling shows β ≈ 0.06–0.09; in speech recognition β ≈ 0.3; SOTA models for image classification (on ImageNet) have exponents β ≈ 0.3–0.5. Currently there is no available theory of deep learning to rationalize these observations. Recently it was shown that for a proper initialization of the weights, deep learning in the infinite-width limit [2] converges to kernel learning. Moreover, it is nowadays part of the lore that there exist kernels whose performance is nearly comparable to deep networks [3, 4], at least for some tasks. It is thus of great interest to understand the learning curves of kernels. For regression, if the target function being learned is simply assumed to be Lipschitz, then the best guarantee is β = 1/d [5, 6] where d is the data dimension. Therefore, for large d, β is very small: learning is completely inefficient, a phenomenon referred to as the curse of dimensionality. As a result, various works on kernel regression make the much stronger assumption that the training points are sampled from a target function that belongs to the reproducing kernel Hilbert space (RKHS) of the kernel (for Gaussian r [7]). With this assumption β does not depend on d (for instance in [8] β = 1/2 is guaranteed). However, RKHS is a very strong assumption, which requires the smoothness of the target function to increase with d [6] (for Gaussian random fields see appendix H), which may not be realistic in large dimensions.

In section 3 we compute β empirically for kernel methods applied on MNIST and CIFAR10 datasets. We find βMNIST ≈ 0.4 and βCIFAR10 ≈ 0.1 respectively. Quite remarkably, we observe essentially the same exponents for regression and classification tasks, using either a Gaussian or a Laplace kernel. Thus, the exponents are not as small as 1/d (d = 784 for MNIST, d = 3072 for CIFAR10), but neither are they 1/2 as one would expect under the RKHS assumption. These facts call for frameworks in which assumptions on the smoothness of the data can be intermediary between Lipschitz and RKHS. Here we study such a framework for regression, in which the target function is assumed to be a Gaussian random field of zero mean with translation-invariant isotropic covariance ${K}_{\mathrm{T}}\left(\underline{x}\right)$. The data can equivalently be thought of as being synthesized by a 'teacher' kernel ${K}_{\mathrm{T}}\left(\underline{x}\right)$. Learning is performed with a 'student' kernel ${K}_{\mathrm{S}}\left(\underline{x}\right)$ that minimizes the mean-square error. In general ${K}_{\mathrm{T}}\left(\underline{x}\right)\ne {K}_{\mathrm{S}}\left(\underline{x}\right)$. In this set-up learning is very similar to a technique referred to as kriging, or Gaussian process regression, originally developed in the geostatistics community [9, 10].

To quantify learning, in section 4 we first perform numerical experiments for data points distributed uniformly at random on a hypersphere of varying dimension d, focusing on a Laplace kernel for the student, and considering a Laplace or Gaussian kernel for the teacher. We observe that in both cases β(d) is a decreasing function. In section 5, to derive β(d) we consider the simplified situation where the Gaussian random field is sampled at training points lying on a regular lattice. Building on the kriging literature [10], we show that β is controlled by the high-frequency scaling of both the teacher and student kernels: assuming that the Fourier transforms of the kernels decay as ${\tilde {K}}_{\mathrm{T}}\left(\underline{w}\right)={c}_{\mathrm{T}}\vert \vert \underline{w}\vert {\vert }^{-{\alpha }_{\mathrm{T}}}+o\left(\vert \vert \underline{w}\vert {\vert }^{-{\alpha }_{\mathrm{T}}}\right)$ and ${\tilde {K}}_{\mathrm{S}}\left(\underline{w}\right)={c}_{\mathrm{S}}\vert \vert \underline{w}\vert {\vert }^{-{\alpha }_{\mathrm{S}}}+o\left(\vert \vert \underline{w}\vert {\vert }^{-{\alpha }_{\mathrm{S}}}\right)$, we obtain

Equation (1)

Importantly (i) equation (1) leads to a prediction for β(d) that accurately matches our numerical study for random training data points, leading to the conjecture that equation (1) holds in that case as well. We offer the following interpretation: ultimately, kernel methods are performing a local interpolation, the quality of which depends on the distance δ(n) between adjacent data points. δ(n) is asymptotically similar for random data or data sitting on a lattice. (ii) If the kernel KS is not too sensitive to high frequencies, then learning is optimal as far as scaling is concerned and β = (αTd)/d. We will argue that the smoothness index s ≡ [(αTd)/2] characterizes the number of derivatives of the target function that are continuous. We thus recover the curse of dimensionality: s needs to be of order d to have non-vanishing β in large dimensions.

We show that in some regimes, the test error for Gaussian data is controlled by an exponent a describing how the coefficients of the true function in the eigenbasis of the kernel decay with rank. We estimate a by the kernel principal component analysis (kernel PCA) based on diagonalizing the Gram matrix. This measure yields a prediction for the learning curve exponent β that matches the numerical fit, with βMNIST ≈ 0.36 and βCIFAR10 ≈ 0.07. We show in appendix I using the recent formalism of [11], which does not assume Gaussianity but makes more technical assumptions, that the result of our theorem equation (1) is recovered, supporting further its validity for real data.

Finally, we discuss the following apparent paradox: β is significant for MNIST and CIFAR10, for which d is a priori very large, leading to a smoothness value s in the hundreds in both cases, which appears unrealistic. In section 7 the paradox is resolved by considering that real datasets actually live on lower-dimensional manifolds. As far as kernel learning is concerned, our findings support that the correct definition of dimension should be based on how the nearest-neighbors distance δ(n) scales with n: $\delta \left(n\right)\sim {n}^{-1/{d}_{\mathrm{e}\mathrm{f}\mathrm{f}}}$. Direct measurements of δ(n) support that MNIST and CIFAR10 live on manifolds of lower dimensions ${d}_{\mathrm{M}\mathrm{N}\mathrm{I}\mathrm{S}\mathrm{T}}^{\mathrm{e}\mathrm{ff}}\approx 15$ and ${d}_{\mathrm{C}\mathrm{I}\mathrm{F}\mathrm{A}\mathrm{R}\mathrm{10}}^{\mathrm{e}\mathrm{ff}}\approx 35$.

2. Related works

Part of the literature has investigated the problem of kernel regression from a different point of view, namely the optimal worst-case performance (see for instance [1214]). The target function is not assumed to be generated by a Gaussian random field, but its regularity is controlled using a source condition that constrains the decay of its coefficients in the eigenbasis of the kernel. For uniform data distributions and isotropic kernels this is similar to controlling how the Fourier transform of the target function decays at high frequency. What we study in the present work is, on the contrary, the typical performance. Indeed, it turns out that both the worst-case and the typical learning curve decay as power laws, and the latter decays faster.

The teacher–student framework for kernel regression was previously introduced in [15, 16], where a formula for the learning curve was derived based on a few uncontrolled approximations. It is easy to show that their results match the predictions of our theorem, although in [16] the case where the performance is limited by the student (αTd > 2αS) is ignored. More recently, [11] generalized this approach using similar approximations and extended it to kernel regression applied to any target function (or ensemble thereof). Kernel PCA on MNIST was used to support that these approximations hold well on real data. An asymptotic scaling relation between β and a was obtained; again the presence of other regimes was not noted. By contrast we perform an exact calculation for the asymptotic behavior of Gaussian data on a lattice. In appendix I we show that the two approaches are consistent and lead to the same asymptotic predictions for β.

Our set-up of teacher–student learning with kernels is also referred to as kriging, or Gaussian process regression, and it was originally developed in the geostatistics community [9]. In section 5 we present our theorem, that allows one to know the rate at which the test error decreases as we increase the number of training points, assumed to lie on a high-dimensional regular lattice. Similar results have been previously derived in the kriging literature [10] when sampling occurs on the regular lattice with the exception of the origin, where the inference is made. Here we propose an alternative derivation that some readers might find simpler. We also study a slightly different problem: instead of computing the test error when the inference is carried on at the origin, we compute the average error for a test point that lies at an arbitrary point, sampled uniformly at random and not necessarily on the lattice. Then, in what follows we show, via extensive numerical simulations, that such predictions are accurate even when the training points do not lie on a regular lattice, but are taken at random on a hypersphere. An exact proof of our result in such a general setting is difficult and cannot be found even in the kriging literature. To our knowledge the results that get closer to the point are those discussed in [17], where the author studies one-dimensional processes where the training data are not necessarily evenly spaced.

In this work the effective dimension of the data plays an import role, as it controls how the distance between nearest neighbors scales with the dataset size. Of course, there exists a vast literature [1824] devoted to the study of effective dimensions, where other definitions are analyzed. The effective dimensions that we find are compatible with those obtained with more refined methods.

3. Learning curve for kernel methods applied to real data

In what follows we apply kernel methods to the MNIST and CIFAR10 datasets, each consisting of a set of images ${\left({\underline{x}}_{\mu }\right)}_{\mu =1}^{n}$. We simplify the problem by considering only two classes whose labels $Z\left({\underline{x}}_{\mu }\right)={\pm}1$ correspond to odd and even numbers for MNIST, and to two groups of five classes in CIFAR10. The goal is to infer the value of the label ${\hat{Z}}_{\mathrm{S}}\left(\underline{x}\right)$ of an image $\underline{x}$ that does not belong to the dataset. The S subscript reminds us that inference is performed using a positive definite kernel KS. We perform inference in both a regression and a classification setting. The following algorithms and associated results can be found in [25].

Regression. Learning corresponds to minimizing a mean-square error:

Equation (2)

For algorithms seeking solutions of the form ${\hat{Z}}_{\mathrm{S}}\left(\underline{x}\right)={\sum }_{\mu }{a}_{\mu }{K}_{\mathrm{S}}\left({\underline{x}}_{\mu },\underline{x}\right)\equiv \underline{a}\cdot {\underline{k}}_{\mathrm{S}}\left(\underline{x}\right)$ by minimizing the mean-square loss over the vector $\underline{a}$, one obtains:

Equation (3)

where the vector $\underline{Z}$ contains all the labels in the training set, $\underline{Z}\equiv {\left(Z\left({\underline{x}}_{\mu }\right)\right)}_{\mu =1}^{n}$, and ${\mathbb{K}}_{\mathrm{S},\mu \nu }\equiv {K}_{\mathrm{S}}\left({\underline{x}}_{\mu },{\underline{x}}_{\nu }\right)$ is the Gram matrix. The Gram matrix is always invertible if the kernel KS is positive definite. The generalization error is then evaluated as the expected mean-square error on unseen data, estimated by averaging over a test set composed of ntest unseen data points:

Equation (4)

Classification. We perform kernel classification via the algorithm soft-margin SVM. The details can be found in appendix A. After learning from the training data with a student kernel KS, performance is evaluated via the generalization error. It is estimated as the fraction of correctly predicted labels for data points belonging to a test set with ntest elements.

In figure 1 we present the learning curves for (binary) MNIST and CIFAR10, for regression and classification. Learning is performed both with a Gaussian kernel $K\left(\underline{x}\right)\propto \mathrm{exp}\left(-\vert \vert \underline{x}\vert {\vert }^{2}/\left(2{\sigma }^{2}\right)\right)$ and a Laplace one $K\left(\underline{x}\right)\propto \mathrm{exp}\left(-\vert \vert \underline{x}\vert \vert /\sigma \right)$. Remarkably, the power laws in the two tasks are essentially identical (although the estimated exponent appears to be slightly larger, in absolute value, for classification). Moreover, the two kernels display a very similar behavior, compatible with the same exponent: about −0.4 for MNIST and −0.1 for CIFAR10. The presented data are for σ = 1000; in appendix B we show that the same behavior is observed for different values.

Figure 1.

Figure 1. Learning curves for regression on MNIST and CIFAR10 (top row); and for classification on MNIST and CIFAR10 (bottom row). Curves are averaged over 400 runs. A power law is plotted to estimate the asymptotic behavior at large n: the exponent is fitted on the last decade on the average of the two curves, since it does not seem to depend significantly on the specific kernel or on the task. In each setting we use both a Gaussian kernel $K\left(\underline{x}\right)\propto \mathrm{exp}\left(-\vert \vert \underline{x}\vert {\vert }^{2}/\left(2{\sigma }^{2}\right)\right)$ and a Laplace one $K\left(\underline{x}\right)\propto \mathrm{exp}\left(-\vert \vert \underline{x}\vert \vert /\sigma \right)$, with σ = 1000.

Standard image High-resolution image

4. Generalization scaling in kernel teacher–student problems

We study β in a simplified setting where the data is assumed to follow a Gaussian distribution with known covariance. It falls into the class of teacher–student problems, which are characterized by a machine (the teacher) that generates the data, and another machine (the student) that tries to learn from them. The teacher–student paradigm has been broadly used to study supervised learning [15, 16, 2633]. He we restrict our attention to kernel methods: we assume that a target function is distributed according to a Gaussian random field $Z\sim \mathcal{N}\left(0,{K}_{\mathrm{T}}\right)$—the teacher—characterized by a translation-invariant isotropic covariance function ${K}_{\mathrm{T}}\left(\underline{x},{\underline{x}}^{\prime }\right)={K}_{\mathrm{T}}\left(\vert \vert \underline{x}-{\underline{x}}^{\prime }\vert \vert \right)$, and that the training dataset consists the finite set of n observations $\underline{Z}={\left(Z\left({\underline{x}}_{\mu }\right)\right)}_{\mu =1}^{n}$. This is equivalent to saying that the vector of training points follows a centered Gaussian distribution with a covariance matrix that depends on KT and on the location of the points ${\left({\underline{x}}_{\mu }\right)}_{\mu =1}^{n}$:

Equation (5)

Once the teacher has generated the dataset, the rest follows as in the kernel regression described in the previous section. We use another translation-invariant isotropic kernel ${K}_{\mathrm{S}}\left(\underline{x},{\underline{x}}^{\prime }\right)$—the student—to infer the value of the field at another point, ${\hat{Z}}_{\mathrm{S}}\left(\underline{x}\right)$, with a regression task, i.e. minimizing the mean-square error in equation (2). The solution is therefore given again by equation (3).

Figures 2(a) and (b) shows the mean-square error obtained numerically. In the examples the student is always taken to be a Laplace kernel, and the teacher is either a Laplace kernel or a Gaussian kernel. The points ${\left({\underline{x}}_{\mu }\right)}_{\mu =1}^{n}$ are taken uniformly at random on the unit d-dimensional hypersphere for several dimensions d and for several dataset sizes n. We take σS = σT = d as we observed that with this choice smaller datasets were enough to approach a limiting curve—in appendix C we show the plots for the case σS = σT = 10, which appears to converge to the same limit curve with increasing n, but at a smaller pace. The figure shows that when n is large enough, the mean-square error behaves as a power law (dashed lines) with an exponent that depends on the spatial dimension of the data, as well as on the kernels. The fitted exponents are plotted in figures 2(c) and (d) as a function of the spatial dimension d for different dataset sizes n. In the next section we will discuss the theoretical prediction, which in the figure is plotted as a thick black line. The figure shows that as the dataset gets bigger, the asymptotic exponent tends to our prediction. In appendix D we present the learning curves of Gaussian students with both a Laplace and a Gaussian kernel. When both kernels are Gaussian the test error decays exponentially fast, a result that matches our theoretical prediction. In appendix E we also provide further numerical results for the case where the teacher kernel is a Matérn kernel (as defined therein).

Figure 2.

Figure 2. Results for the teacher–student kernel regression problem, where the student is always a Laplace kernel. Data points are sampled uniformly at random on a d-dimensional hypersphere. (Top) Mean-square error versus the size of the training dataset, for Gaussian and Laplace teachers and for multiple spatial dimensions. Dotted lines are the fitted power laws—we fit starting from n = 700. (Bottom) Fitted exponent $-\beta =\mathrm{log}\enspace \mathbb{E}\enspace \mathrm{M}\mathrm{S}\mathrm{E}/\mathrm{log}\enspace n$ against the spatial dimension, for several dataset sizes. We fit from n = 0 to a varying n (written in the legends). The thick black lines are the theoretical predictions.

Standard image High-resolution image

5. Analytic asymptotics for the kernel teacher–student problem on a lattice

In this section we compute analytically the exponent that describes the asymptotic decay of the generalization error when the number n of training data increases. In order to derive the result we assume that both the teacher Gaussian random fields live on a bounded hypercube, $\underline{x}\in \mathcal{V}\equiv {\left[0,L\right]}^{d}$, where L is a constant and d is the spatial dimension. The fields and the kernels can then be thought of as L-periodic along each dimension. Furthermore, to make the problem tractable we assume that the points ${\left({\underline{x}}_{\mu }\right)}_{\mu =1}^{n}$ live on a regular lattice, covering all the hypercube $\mathcal{V}$. Therefore, the linear spacing between neighboring points is δ = Ln−1/d . This is a different setting than the one used in the numerical simulations presented in the previous section, for which the data distribution is Gaussian, showing that our results below are robust to such differences.

Generalization error is then evaluated via the typical mean-square error

Equation (6)

where the expectation is taken over both the teacher process and the point $\underline{x}$ at which we estimate the field, assumed to be uniformly distributed in the hypercube $\mathcal{V}$. In appendix F we prove the following:

Theorem 1. Let ${\tilde {K}}_{\mathrm{T}}\left(\underline{w}\right)={c}_{\mathrm{T}}\vert \vert \underline{w}\vert {\vert }^{-{\alpha }_{\mathrm{T}}}+o\left(\vert \vert \underline{w}\vert {\vert }^{-{\alpha }_{\mathrm{T}}}\right)$ and ${\tilde {K}}_{\mathrm{S}}\left(\underline{w}\right)={c}_{\mathrm{S}}\vert \vert \underline{w}\vert {\vert }^{-{\alpha }_{\mathrm{S}}}+o\left(\vert \vert \underline{w}\vert {\vert }^{-{\alpha }_{\mathrm{S}}}\right)$ as $\vert \vert \underline{w}\vert \vert \to \infty $, where ${\tilde {K}}_{\mathrm{T}}\left(\underline{w}\right)$ and ${\tilde {K}}_{\mathrm{S}}\left(\underline{w}\right)$ are the Fourier transforms of the kernels ${K}_{\mathrm{T}}\left(\underline{x}\right)$, ${K}_{\mathrm{S}}\left(\underline{x}\right)$ respectively, assumed to be positive definite. We assume that ${\tilde {K}}_{\mathrm{T}}\left(\underline{w}\right)$ and ${\tilde {K}}_{\mathrm{S}}\left(\underline{w}\right)$ have a finite limit as $\vert \vert \underline{w}\vert \vert \to 0$ and that $K\left(\underline{0}\right){< }\infty $. Then,

Equation (7)

Moreover, in the case of a Gaussian kernel the result holds valid if we take the corresponding exponent to be α = .

Apart from the specific value of the exponent in equation (7), theorem 1 implies that if the student kernel decays fast enough in the frequency domain, then β depends only on the data through the behavior of the teacher kernel at high frequencies. One then recovers β = (αTd)/d, also found for the Bayes-optimal setting where the student is identical to the teacher.

Consider the predictions of theorem 1 in the cases presented in figures 2(a) and (b) of Gaussian and Laplace kernels. If both kernels are Laplace kernels then αT = αS = d + 1 and $\mathbb{E}\enspace \mathrm{M}\mathrm{S}\mathrm{E}\sim {n}^{-1/d}$, which scales very slowly with the dataset size in large dimensions. If the teacher is a Gaussian kernel (αT = ) and the student is a Laplace kernel then β = 2(1 + 1/d), leading to β → 2 as d. In figures 2(c) and (d) we compare these predictions with the exponents extracted from figures 2(a) and (b). We plot $\mathrm{log}\enspace \mathbb{E}\enspace \mathrm{M}\mathrm{S}\mathrm{E}/\mathrm{log}\enspace n\equiv -\beta $, against the dimension d of the data, varying the dataset size n. The exponents extracted numerically tend to our analytical predictions when n is large enough.

Notice that, although the theory and the experiments do not assume the same distribution for the sampling points ${\left({\underline{x}}_{\mu }\right)}_{\mu =1}^{n}$, this does not seem to yield any difference in the asymptotic behavior of the generalization error, leading to the conjecture that our predictions are exact even when the training set is random, and does not correspond to a lattice. The conjecture can be proven in one dimension following results of the kriging literature [17], but generalization to higher d is a much harder problem. Intuitively, for kernel learning performs an expansion, the quality of which is governed by the target function smoothness and the typical distance δmin between a point and its nearest neighbors in the training set. Both for random points or on a lattice, one has δminn−1/d when n is large enough, thus both situations lead to the same β. This is shown in figure 4 (left).

Theorem 1 underlines that kernel methods are subjected to the curse of dimensionality. Indeed, for appropriate students, one obtains β = (αTd)/d. Let us define the smoothness index $s\equiv \left[\left({\alpha }_{\mathrm{T}}-d\right)/2\right]=\beta d/2$, which must be $\mathcal{O}\left(d\right)$ to avoid β → 0 for large d. The two lemmas below, derived in the appendix, indicate that the target function is s time differentiable (in a mean-square sense). Thus learning with kernels in very large dimension can only occur if the target function is $\mathcal{O}\left(d\right)$ times differentiable, a condition that appears very restrictive in large d.

Lemma 1. Let $K\left(\underline{x},{\underline{x}}^{\prime }\right)$ be a translation-invariant isotropic kernel such that $\tilde {K}\left(\underline{w}\right)=c\vert \vert \underline{w}\vert {\vert }^{-\alpha }+o\left(\vert \vert \underline{w}\vert {\vert }^{-\alpha }\right)$ as ||w|| → and $\vert \vert \underline{w}\vert {\vert }^{d}\tilde {K}\left(\underline{w}\right)\to 0$ as ||w|| → 0. If α > d + n for some $n\in {\mathbb{Z}}^{+}$, then $K\left(\underline{x}\right)\in {C}^{n}$, that is, it is at least n-times differentiable. (Proof in appendix G).

Lemma 2. Let $Z\sim \mathcal{N}\left(0,K\right)$ be a d-dimensional Gaussian random field, with KC2n being a 2n-times differentiable kernel. Then Z is n-times mean-square differentiable in the sense that

  • Derivatives of $Z\left(\underline{x}\right)$ are Gaussian random fields;
  • $\mathbb{E}{\partial }_{{x}_{1}}^{{n}_{1}}\dots {\partial }_{{x}_{d}}^{{n}_{d}}Z\left(\underline{x}\right)=0$;
  • $\mathbb{E}{\partial }_{{x}_{1}}^{{n}_{1}}\dots {\partial }_{{x}_{d}}^{{n}_{d}}Z\left(\underline{x}\right)\cdot {\partial }_{{x}_{1}}^{{n}_{1}^{\prime }}\dots {\partial }_{{x}_{d}}^{{n}_{d}^{\prime }}Z\left({\underline{x}}^{\prime }\right)={\partial }_{{x}_{1}}^{{n}_{1}+{n}_{1}^{\prime }}\dots {\partial }_{{x}_{d}}^{{n}_{d}+{n}_{d}^{\prime }}K\left(\underline{x}-{\underline{x}}^{\prime }\right){< }\infty $ if the derivatives of K exist.

In particular, $\mathbb{E}{\partial }_{{x}_{i}}^{m}Z\left(\underline{x}\right)\cdot {\partial }_{{x}_{i}}^{m}Z\left({\underline{x}}^{\prime }\right)={\partial }_{{x}_{i}}^{2m}K\left(\underline{x}-{\underline{x}}^{\prime }\right){< }\infty \quad \forall \enspace m{\leqslant}n$. (Proof in appendix G).

Interpretation of theorem 1: when the student does not limit performance, i.e. when 2αS > αTd and $\beta =\frac{{\alpha }_{\mathrm{T}}-d}{d}$, we can interpret the result as follows. An isotropic student kernel corresponds to a Gaussian prior on the Fourier coefficients of the target function being learned. The student puts large (low) power at low (high) frequencies, and it can then reconstruct a number of the order of n largest Fourier coefficients, which corresponds to frequencies $\underline{w}$ of norm $\vert \vert \underline{w}\vert \vert {\leqslant}1/\delta \sim {n}^{1/d}$. Fourier coefficients $\tilde {Z}\left(\underline{w}\right)$ at higher frequencies cannot be learned, and the mean square error is then simply of order of the sum of the squares of these coefficients:

Equation (8)

6. Learning curve exponent of real data

Equation (8) is not readily applicable to real data which are neither Gaussian nor uniformly distributed. However, it supports the following broader result: kernel methods can predict well of order n first coefficients of the true function in the eigenbasis of the kernel, but not the following ones. For any student kernel KS, let λ1λρ ⩾ ... be its eigenvalues (positive and real, because of symmetry and positive definiteness) and ${\phi }_{\rho }\left(\underline{x}\right)$ the associated eigenfunctions:

Equation (9)

where $p\left(\underline{x}\right)$ is the density of the data points. Then the kernel can be decomposed in its eigenmodes:

Equation (10)

The eigenfunctions of K make a complete basis, and we can write any function $Z\left(\underline{x}\right)$ as

Equation (11)

The generalization of our result then simply reads:

Equation (12)

Assuming a power-law behavior ${q}_{\rho }^{2}\sim {\rho }^{-a}$ then leads to β = a − 1.

To extract the exponent a and test this prediction for real data, we first approximate the eigenvalue equation equation (9) for the student kernel with the diagonalization of its finite-dimensional Gram matrix ${\mathbb{K}}_{\mathrm{S}}$ computed on a large dataset of size $\tilde {n}$:

Equation (13)

where now we have $\tilde {n}$ eigenvalues ${\lambda }_{1}{\geqslant}\cdots {\geqslant}{\lambda }_{\tilde {n}}$ and the eigenvectors ${\underline{\phi }}_{\rho }$ are $\tilde {n}$-dimensional eigenvectors. Computing this diagonalization for a given training set is referred to (uncentered) kernel PCA. This procedure is a discretized version of equation (10) and yields only an approximation to the largest $\tilde {n}$ eigenvalues of the kernel, that are exactly recovered as $\tilde {n}\to \infty $ (in figure J1 in appendix J we show that the eigenvalues of the Gram matrix converge when $\tilde {n}$ increases, and that their density displays the power-law behavior that one can extract from the kernel operator with a uniform distribution $p\left(\underline{x}\right)$). The coefficient qρ is then estimated by the scalar products $\left(\underline{Z}\cdot {\underline{\phi }}_{\rho }\right)$, where $\underline{Z}=\left(Z\left({\underline{x}}_{1}\right),\dots ,Z\left({\underline{x}}_{\tilde {n}}\right)\right)$ is the vector of the target function's values on the train set.

Finally, we approximate equation (12) as:

Equation (14)

This quantity is plotted in figure 3, where we show that it correlates remarkably well with the true learning curve. Fitting these cumulative curves whose exponent is 1 − a for asymptotically large $\tilde {n}$ (here we plotted several curves for growing $\tilde {n}$) we extract an exponent aMNIST = 1.36 leading to ${\hat{\beta }}_{\mathrm{M}\mathrm{N}\mathrm{I}\mathrm{S}\mathrm{T}}\approx 0.36$ and aCIFAR10 = 1.07 leading to ${\hat{\beta }}_{\mathrm{C}\mathrm{I}\mathrm{F}\mathrm{A}\mathrm{R}\mathrm{10}}\approx 0.07$ that are very close to the exponents that we measured in section 3.

Figure 3.

Figure 3. Several measures of the learning curves for MNIST (left) and CIFAR10 (right). In every plot, the gray solid line is the numerical evaluation of the generalization error (shifted for clarity). Colored lines are computed using equation (14) for several values of $\tilde {n}$, and the dashed black line is a fit of the power-law decay with which we extract the predicted exponents ${\hat{\beta }}_{\mathrm{M}\mathrm{N}\mathrm{I}\mathrm{S}\mathrm{T}}\approx 0.36$ and ${\hat{\beta }}_{\mathrm{C}\mathrm{I}\mathrm{F}\mathrm{A}\mathrm{R}\mathrm{10}}\approx 0.07$.

Standard image High-resolution image

Support for the genericity of equation (12) can be obtained from the recent paper [11], where the authors derived a formula for the generalization error based on the decomposition of the target function on the eigenbasis of the kernel. The formula is derived with uncontrolled approximations, but applies to a generic target function (or ensembles thereof) and a generic data point distribution $p\left(\underline{x}\right)$, and matches well their numerical experiments. In appendix I we show that the asymptotic limit (large n) of their formula yields equation (12). Furthermore, equation (7) is recovered if a power-law decay of the coefficient of the true function in the eigenbasis of the kernel is assumed, thus generalizing our result to non-Gaussian data.

7. Effective dimension of real data

Both our predictions and empirical observations support rather large values of β. From a Gaussian random process point of view, it is surprising: the exponent β avoids the curse of dimensionality only if the smoothness of the teacher is of the order of the data dimension. If these observations were to hold true also for real data, it would seem to imply that MNIST and CIFAR10 must be hundreds or thousands of times differentiable. However, there is a simple catch: real data actually live on a manifold of much lower dimensionality. Above we have argued that the quantity that governs the asymptotic learning curve is the typical distance δmin between neighboring points in the training set. A simple way to measure the effective dimension deff of real data therefore consists of plotting the (asymptotic) dependence of δmin on the number of points n in a random subset of the dataset, and fitting

Equation (15)

In figure 4 (right) we show that for MNIST and CIFAR10 there is indeed a power-law relation linking δmin to n, and that the effective dimension extracted this way is much smaller than the embedding dimension of the datasets:

Equation (16)

Equation (17)

Figure 4.

Figure 4. Average distance from one point to its nearest neighbor as a function of the dataset size n. (Left) For random points on d-dimensional hypersphere, $\left\langle {\delta }_{\mathrm{min}}\right\rangle \sim {n}^{-1/d}$. Colored solid curves are found numerically, dashed lines are the theoretical asymptotic prediction, and the gray lines are numerical fit (we fitted only starting from n ≈ 6000 to reduce finite size effects, and the fit has been rescaled to match the data at n = 10). The larger d, the stronger the preasymptotic effects (a larger n is needed to observe the predicted scaling). (Right) Comparison between random data on 15- and 35-dimensional hyperspheres and the MNIST, CIFAR10 datasets. According to this definition of effective dimension, MNIST live on a 15-dimensional manifold and CIFAR10 on a 35-dimensional one. Data have been rescaled along the y-axis for ease of comparison.

Standard image High-resolution image

This measure is consistent with previous extrapolations of the intrinsic dimension of MNIST [19, 20, 22, 23].

8. Conclusion

In this work we have shown for CIFAR10 and MNIST respectively that kernel regression and classification display a power-law decay in the learning curves, quite remarkably with essentially the same exponent β regardless of task and kernel—a fact yet to be explained. These exponents are much larger than β = 1/d expected for Lipschitz target functions and smaller than β = 1/2 expected for RKHS target functions.

This observation led us to study a teacher–student framework for regression in which data are modeled as Gaussian random fields of varying smoothness, in which intermediary values of β are obtained. We find two regimes depending on the respective smoothness of the teacher and student kernels. If the student is smooth enough—i.e. it puts a sufficiently low prior on high frequency components—then β is entirely controlled by the teacher. We obtain that the smoothness index must scale with the dimension for β to be finite as d, recovering the curse of dimensionality.

In our calculations, the dimension enters as the parameter relating the number of points to the nearest-neighbor distance δn−1/d . Thus, in practice the parameter d considered should be the effective dimension deff of the data, which is much smaller than the number of pixels for MNIST and CIFAR data. It explains why β is not very small in these cases.

Finally, for Gaussian fields our result is equivalent to the statement that β is governed by the power of the true function past the first ∼n eigenvectors of the kernel. We test this more general idea both for CIFAR and MNIST and find that it correctly predicts β. Understanding what controls this power in a general setting (which include the effective dimension of the data and presumably a generalized quantity characterizing smoothness) thus appears necessary to understand how many data are required to learn a task.

Acknowledgments

We acknowledge G Biroli, C Hongler and F Gabriel for the discussions that stimulated this work, and we thank S d'Ascoli, A Jacot, C Pehlevan, L Sagun and M L Stein for discussions. This work was partially supported by the grant from the Simons Foundation (#454953 Matthieu Wyart). M W thanks the Swiss National Science Foundation for support under Grant No. 200021-165509.

Appendix A.: Soft-margin support vector machines

The kernel classification task is performed via the algorithm known as soft-margin support vector machine.

We want to find a function ${\hat{Z}}_{\mathrm{S}}\left(\underline{x}\right)$ such that its sign correctly predicts the label of the data. In this context we model such a function as a linear prediction after projecting the data on a feature space via $\underline{x}\to \phi \left(\underline{x}\right)$:

Equation (18)

where $\underline{w},b$ are parameters to be learned from the training data. The kernel is related to the feature space via ${K}_{\mathrm{S}}\left(\underline{x},{\underline{x}}^{\prime }\right)={\underline{\phi }}_{\mathrm{S}}\left(\underline{x}\right)\cdot {\underline{\phi }}_{\mathrm{S}}\left({\underline{x}}^{\prime }\right)$. We require that $Z\left({\underline{x}}_{\mu }\right){\hat{Z}}_{\mathrm{S}}\left({\underline{x}}_{\mu }\right){ >}1-{\xi }_{\mu }$ for all training points. Ideally, we want to have some large margins 1 − ξμ = 1, but we allow some of them to be smaller by introducing the slack variables ξμ and penalizing large values. To achieve this the following constrained minimization is performed:

Equation (19)

This problem can be expressed in a dual formulation as

Equation (20)

where ${\mathbb{Q}}_{\mathrm{S},\mu ,\nu }=Z\left({\underline{x}}_{\mu }\right)Z\left({\underline{x}}_{\nu }\right){K}_{\mathrm{S}}\left({\underline{x}}_{\mu },{\underline{x}}_{\nu }\right)$ and $\underline{Z}$ is the vector of the labels of the training points. Here C (=104 in our simulations) controls the trade-off between minimizing the training error and maximizing the margins 1 − ξμ . For the details we refer to [25]. If ${\underline{a}}^{\star }$ is the solution to the minimization problem, then

Equation (21)

Equation (22)

The predicted label for unseen data points is then

Equation (23)

The generalization error is now defined as the probability that an unseen image has a predicted label different from the true one, and such a probability is again estimated as an average over a test set with ntest elements:

Equation (24)

Appendix B.: Different kernel variances

In figure B1 we show the learning curves for kernel regression on the MNIST (parity) dataset—the same setting as in figure 1(a). Several Laplace kernels of varying variance σ are used. The variance ranges several orders of magnitude and the learning curves all decay with the same exponent, although for σ = 10 the algorithm achieves suboptimal performance and the test errors are increased by some factor.

Figure B1.

Figure B1. Learning curves for kernel regression on the MNIST dataset. Regression is performed with several Laplace kernels of varying variance σ ranging from σ = 10 to σ = 10 000.

Standard image High-resolution image

Appendix C.: Different choice of kernel variances

In figure C1 we show the learning curves for the teacher–student kernel regression problem, with a student kernel that is always Laplace and a teacher that can be either Gaussian or Laplace. We show how the test error decays with the size of the training dataset and how the asymptotic exponent depends on the spatial dimension. Every experiment is run with two different choices of the kernel variances: in one case σT = σS = d and in the other σT = σS = 10. We observed that scaling the variances with the spatial dimension leads more quickly to the results that we predicted in this paper, but overall the choice has little effect on the exponents (both tend towards the prediction as the dataset size is increased).

Figure C1.

Figure C1. In these plots we show the results for the teacher–student kernel regression. The student is always a Laplace kernel; the teacher is either Gaussian or Laplace. The four plots on the left depict the mean-square error against the size of the dataset for different spatial dimensions of the data; those on the right show the fitted asymptotic exponent against the spatial dimension for different dataset sizes. For every case we show both the results for σT = σS = d and σT = σS = 10.

Standard image High-resolution image

Appendix D.: Gaussian students

In this appendix we present the learning curves of Gaussian students: the Fourier transform of these kernels decays faster than any power law and one can effectively consider αS = . If the teacher is Laplace (αT = d + 1) then the predicted exponent is finite and takes the values $\beta =\frac{1}{d}\enspace \mathrm{min}\left({\alpha }_{\mathrm{T}}-d,2{\alpha }_{\mathrm{S}}\right)=\frac{1}{d}\enspace \mathrm{min}\left(1,\infty \right)=\frac{1}{d}$. Such a case is displayed in figure D1 (left) in dimension d = 6. However, if we consider the teacher to be Gaussian as well, then the predicted exponent would be $\beta =\frac{1}{d}\enspace \mathrm{min}\left(\infty ,\infty \right)=\infty $. This case corresponds to figure D1 (center): the test errors decay faster than a power law. In figure 7 (right) we compare the case where both kernels are Gaussian to the case where both kernels are Laplace: while the latter decays as a power law, the former decays much faster.

Figure D1.

Figure D1. (Left) The test error of a Laplace teacher (αT = d + 1) with a Gaussian student (αS = ) decays as a power law with the predicted exponent $\beta =\frac{1}{d}\enspace \mathrm{min}\left(1,\infty \right)=\frac{1}{6}$ in d = 6 dimensions. (Center) When both the teacher and the student are Gaussian the test error decays faster than any power law as the number n of data is increased. This plot confirms this by showing that the logarithm of the test error decays linearly as a function of ${n}^{\frac{1}{3}}$. (Right) Comparison between the learning curves for the cases where both kernels are either Laplace (top blue line) or Gaussian (bottom orange line). While the former decays algebraically with the predicted exponent, the latter decays exponentially, in agreement with the prediction β = found within our framework. In all these plots we have taken the variances of both the teacher and student kernels to be equal to the dimension d = 6.

Standard image High-resolution image

Appendix E.: Matérn teachers

To further test the applicability of our theory, we show here some numerical simulations for a teacher kernel that is a Matérn covariance function and a Laplace kernel as student. We ran the simulations in 1d: the data points are sampled uniformly on a one-dimensional circle embedded in ${\mathbb{R}}^{2}$. Matérn kernels are parametrized by a parameter ν > 0:

Equation (25)

where $z=\sqrt{2\nu }\frac{\vert \vert \underline{x}\vert \vert }{\sigma }$ (σ being the kernel variance), Γ is the gamma function and ${\mathcal{K}}_{\nu }$ is the Bessel function of the second kind with parameter ν. Interestingly we recover the Laplace kernel for ν = 1/2 and the Gaussian kernel for ν = . As one can find in e.g. [34], the exponent αT that governs the decay at high frequency of this kernels is αT = d + 2ν. Varying ν we can change the smoothness of the target function.

For d = 1 our prediction for the learning curve exponent β is

Equation (26)

In figure E1 we verify that our prediction matches the numerical results.

Figure E1.

Figure E1. Mean-squared error for Matérn teacher kernels and Laplace students. The variance of the kernels is equal to 2 for all the curves.

Standard image High-resolution image

Appendix F.: Proof of theorem

We prove here theorem 1:

Theorem 1. Let ${\tilde {K}}_{\mathrm{T}}\left(\underline{w}\right)={c}_{\mathrm{T}}\vert \vert \underline{w}\vert {\vert }^{-{\alpha }_{\mathrm{T}}}+o\left(\vert \vert \underline{w}\vert {\vert }^{-{\alpha }_{\mathrm{T}}}\right)$ and ${\tilde {K}}_{\mathrm{S}}\left(\underline{w}\right)={c}_{\mathrm{S}}\vert \vert \underline{w}\vert {\vert }^{-{\alpha }_{\mathrm{S}}}+o\left(\vert \vert \underline{w}\vert {\vert }^{-{\alpha }_{\mathrm{S}}}\right)$ as $\vert \vert \underline{w}\vert \vert \to \infty $, where ${\tilde {K}}_{\mathrm{T}}\left(\underline{w}\right)$ and ${\tilde {K}}_{\mathrm{S}}\left(\underline{w}\right)$ are the Fourier transforms of the kernels ${K}_{\mathrm{T}}\left(\underline{x}\right)$, ${K}_{\mathrm{S}}\left(\underline{x}\right)$ respectively, assumed to be positive definite. We assume that ${\tilde {K}}_{\mathrm{T}}\left(\underline{w}\right)$ and ${\tilde {K}}_{\mathrm{S}}\left(\underline{w}\right)$ have a finite limit as $\vert \vert \underline{w}\vert \vert \to 0$ and that $K\left(\underline{0}\right){< }\infty $. Then,

Equation (27)

Moreover, in the case of a Gaussian kernel the result holds valid if we take the corresponding exponent to be α = .

Proof. Our strategy is to compute how the mean-square test error scales with distance δ between two nearest neighbors on the d-dimensional regular lattice. At the end, we will use the fact that δn−1/d , where n is the number of sampled points on the lattice.

We denote by $\tilde {F}\left(\underline{w}\right)$ the Fourier transform of a function $F:\mathcal{V}\to \mathbb{R}$:

Equation (28)

Equation (29)

If $Z\sim \mathcal{N}\left(0,K\right)$ is a Gaussian field with translation-invariant covariance K then by definition

Equation (30)

Properties of the Fourier transform of a Gaussian field:

Equation (31)

Equation (32)

Equation (33)

Equation (31) comes from the fact that $K\left(\underline{x}\right)$ is an even, real-valued function. The real and imaginary parts of $\tilde {Z}\left(\underline{w}\right)$ are Gaussian random variables. They are all independent except that $\tilde {Z}\left(-\underline{w}\right)=\bar{\tilde {Z}\left(\underline{w}\right)}$. Equation (33) follows from the fact that $Z\left(\underline{x}\right)$ and $K\left(\underline{x}\right)$ are L-periodic functions, and therefore ${\text{e}}^{\text{i}\underline{w}\cdot \underline{x}}\enspace \tilde {K}\left(\underline{w}\right)$ is the Fourier transform of $K\left(\cdot +\underline{x}\right)$ if $\underline{w}\in \frac{2\pi }{L}{\mathbb{Z}}^{d}$. □

The solution equation (3) for kernel regression has two interpretations. In section 4 we introduced it as the quantity that minimizes a quadratic error, but it can also be seen as the maximum-a-posteriori estimation of another formulation of the problem [34]. The field $Z\left(\underline{x}\right)$ is assumed to be drawn from a Gaussian distribution with covariance function ${K}_{\mathrm{S}}\left(\underline{x}\right)$: KS therefore plays a role in the prior distribution of the data $\underline{Z}=\left(Z{\left({\underline{x}}_{\mu }\right)}_{\mu =1}^{n}\right)$. Inference about the value of the field ${\hat{Z}}_{\mathrm{S}}\left(\underline{x}\right)$ at another location is then performed by maximizing its posterior distribution,

Equation (34)

Such a posterior distribution is Gaussian, and its mean—and therefore also the value that maximizes the probability—is exactly equation (3):

Equation (35)

where $\underline{Z}={\left(Z\left({\underline{x}}_{\mu }\right)\right)}_{\mu =1}^{n}$ are the training data, ${\underline{k}}_{\mathrm{S}}\left(\underline{x}\right)={\left({K}_{\mathrm{S}}\left({\underline{x}}_{\mu },\underline{x}\right)\right)}_{\mu =1}^{n}$ and ${\mathbb{K}}_{\mathrm{S}}={\left({K}_{\mathrm{S}}\left({\underline{x}}_{\mu },{\underline{x}}_{\nu }\right)\right)}_{\mu ,\nu =1}^{n}$ is the Gram matrix, that is invertible since the kernel KS is assumed to be positive definite. By Fourier transforming this relation we find

Equation (36)

where we have defined ${F}^{\star }\left(\underline{w}\right)\equiv {\sum }_{\underline{n}\in {\mathbb{Z}}^{d}}F\left(\underline{w}+\frac{2\pi \underline{n}}{\delta }\right)$ for a generic function F.

Another way to reach equation (36) is to consider that we are observing the quantities

Equation (37)

Given that we know the prior distribution of the Fourier components on the right-hand side in equation (37), we can infer their posterior distribution once their sums are constrained by the value of ${\tilde {Z}}^{\star }\left(\underline{w}\right)$, and it is straightforward to see that we recover equation (36).

The mean-square error can then be written using the Parseval–Plancherel identity,

Equation (38)

By taking the expectation value with respect to the teacher and using equations (31)–(33) we can write the mean-square error as

Equation (39)

where $\mathcal{B}={\left[-\frac{\pi }{\delta },\frac{\pi }{\delta }\right]}^{d}$ is the Brillouin zone.

At high frequencies, ${\tilde {K}}_{\mathrm{T}}\left(\underline{w}\right)={c}_{\mathrm{T}}\vert \vert \underline{w}\vert {\vert }^{-{\alpha }_{\mathrm{T}}}+o\left(\vert \vert \underline{w}\vert {\vert }^{-{\alpha }_{\mathrm{T}}}\right)$ and ${\tilde {K}}_{\mathrm{S}}\left(\underline{w}\right)={c}_{\mathrm{S}}\vert \vert \underline{w}\vert {\vert }^{-{\alpha }_{\mathrm{S}}}+o\left(\vert \vert \underline{w}\vert {\vert }^{-{\alpha }_{\mathrm{S}}}\right)$. Therefore:

Equation (40)

This equation defines the function ψT, and a similar equation holds for the student as well. The hypothesis ${K}_{\mathrm{T}}\left(\underline{0}\right)\propto \int \mathrm{d}\underline{w}\enspace {\tilde {K}}_{\mathrm{T}}\left(\underline{w}\right){< }\infty $ implies αT > d and therefore ${\sum }_{\underline{n}\in {\mathbb{Z}}^{d}}\vert \vert \underline{n}\vert {\vert }^{-{\alpha }_{\mathrm{T}}}{< }\infty $ (and likewise for the student). Then, ${\psi }_{{\alpha }_{\mathrm{T}}}\left(\underline{0}\right),{\psi }_{{\alpha }_{\mathrm{S}}}\left(\underline{0}\right)$ are finite; furthermore, the $\underline{w}$'s in the sum equation (39) are at most of order $\mathcal{O}\left({\delta }^{-1}\right)$, therefore the terms ${\psi }_{\alpha }\left(\underline{w}\delta \right)$ are $\mathcal{O}\left({\delta }^{0}\right)$ and do not influence how equation (39) scales with δ. Applying equation (40), expanding for δ ≪ 1 and keeping only the leading orders, we find

Equation (41)

We have neglected terms proportional to, for instance, ${\delta }^{{\alpha }_{\mathrm{T}}+{\alpha }_{\mathrm{S}}}$, since they are subleading with respect to ${\delta }^{{\alpha }_{\mathrm{T}}}$, but we must keep both ${\delta }^{{\alpha }_{\mathrm{T}}}$ and ${\delta }^{{\alpha }_{\mathrm{S}}}$ since we do not know a priori which one is dominant. The additional term δd in the subleading terms comes from the fact that $\vert \mathbb{L}\cap \mathcal{B}\vert =\mathcal{O}\left({\delta }^{-d}\right)$.

The first term in equation (41) is the simplest to deal with: since $\vert \vert \underline{w}\delta \vert \vert $ is smaller than some constant for all $\underline{w}\in \mathbb{L}\cap \mathcal{B}$ and the function ${\psi }_{{\alpha }_{\mathrm{T}}}\left(\underline{w}\delta \right)$ has a finite limit, we have

Equation (42)

We then split the second term in equation (41) in two contributions:

Small $\vert \vert \underline{w}\vert \vert $. We consider 'small' all the terms $\underline{w}\in \mathbb{L}\cap \mathcal{B}$ such that $\vert \vert \underline{w}\vert \vert {< }{\Gamma}$, where Γ ≫ 1 is $\mathcal{O}\left({\delta }^{0}\right)$ but large. As δ → 0, ${\psi }_{2{\alpha }_{\mathrm{S}}}\left(\underline{w}\delta \right)\to {\psi }_{2{\alpha }_{\mathrm{S}}}\left(\underline{0}\right)$ which is finite because $K\left(\underline{0}\right){< }\infty $. Therefore

Equation (43)

The summand is real and strictly positive because the positive definiteness of the kernels implies that their Fourier transforms are strictly positive. Moreover, as δ → 0, $\mathbb{L}\cap \mathcal{B}\cap \left\{\vert \vert \underline{w}\vert \vert {< }{\Gamma}\right\}\to \mathbb{L}\cap \left\{\vert \vert \underline{w}\vert \vert {< }{\Gamma}\right\}$, which contains a finite number of elements, independent of δ. Therefore

Equation (44)

Large $\vert \vert \underline{w}\vert \vert $. 'Large' $\underline{w}$ are those with $\vert \vert \underline{w}\vert \vert { >}{\Gamma}$: we recall that Γ ≫ 1 is $\mathcal{O}\left({\delta }^{0}\right)$ but large. This allows us to approximate ${\tilde {K}}_{\mathrm{T}}$, ${\tilde {K}}_{\mathrm{S}}$ in the sum with their asymptotic behavior:

Equation (45)

Finally, putting equations (42), (44) and (45) together,

Equation (46)

The proof is concluded by considering that $\delta =\mathcal{O}\left({n}^{-1/d}\right)$.

In the case of a Gaussian kernel $K\left(\underline{x}\right)\propto \mathrm{exp}\left(-\vert \vert \underline{x}\vert {\vert }^{2}/\left(2{\sigma }^{2}\right)\right)$—and therefore $\tilde {K}\left(\underline{w}\right)\propto \mathrm{exp}\left(-{\sigma }^{2}\vert \vert \underline{w}\vert {\vert }^{2}/2\right)$—one has to redo the calculations starting from equation (39), but the final result can be easily recovered by taking the limit α → + (Gaussian kernels decay faster than any power law). □

Appendix G.: Proofs of lemmas

Lemma 1. Let $K\left(\underline{x},{\underline{x}}^{\prime }\right)$ be a translation-invariant isotropic kernel such that $\tilde {K}\left(\underline{w}\right)=c\vert \vert \underline{w}\vert {\vert }^{-\alpha }+o\left(\vert \vert \underline{w}\vert {\vert }^{-\alpha }\right)$ as ||w|| → and $\vert \vert \underline{w}\vert {\vert }^{d}\tilde {K}\left(\underline{w}\right)\to 0$ as ||w|| → 0. If α > d + n for some $n\in {\mathbb{Z}}^{+}$, then $K\left(\underline{x}\right)\in {C}^{n}$, that is, it is at least n-times differentiable.

Proof. The kernel is rotational invariant in real space ($K\left(\underline{x}\right)=K\left(\vert \vert \underline{x}\vert \vert \right)$) and therefore also in the frequency domain. Then, calling ${\hat{{\epsilon}}}_{1}=\left(1,0,\dots \enspace \right)$ the unitary vector along the first dimension x1,

Equation (47)

It follows that

Equation (48)

We want to claim that this quantity is finite if mn. Convergence at infinity requires m < αd, that is always smaller than or equal to n because of the hypothesis of the lemma. Convergence in zero requires that ${w}^{d+m}\vert \tilde {K}\left(w\right)\vert \to 0$, and we want this to hold for all 0 ⩽ m < αd, the most constraining one being the condition with m = 0. □

Lemma 2. Let $Z\sim \mathcal{N}\left(0,K\right)$ be a d-dimensional Gaussian random field, with KC2n being a 2n-times differentiable kernel. Then Z is n-times differentiable in the sense that

  • Derivatives of $Z\left(\underline{x}\right)$ are a Gaussian random fields;
  • $\mathbb{E}{\partial }_{{x}_{1}}^{{n}_{1}}\dots {\partial }_{{x}_{d}}^{{n}_{d}}Z\left(\underline{x}\right)=0$;
  • $\mathbb{E}{\partial }_{{x}_{1}}^{{n}_{1}}\dots {\partial }_{{x}_{d}}^{{n}_{d}}Z\left(\underline{x}\right)\cdot {\partial }_{{x}_{1}}^{{n}_{1}^{\prime }}\dots {\partial }_{{x}_{d}}^{{n}_{d}^{\prime }}Z\left({\underline{x}}^{\prime }\right)={\partial }_{{x}_{1}}^{{n}_{1}+{n}_{1}^{\prime }}\dots {\partial }_{{x}_{d}}^{{n}_{d}+{n}_{d}^{\prime }}K\left(\underline{x}-{\underline{x}}^{\prime }\right){< }\infty $ if the derivatives of K exist.

In particular, $\mathbb{E}{\partial }_{{x}_{i}}^{m}Z\left(\underline{x}\right)\cdot {\partial }_{{x}_{i}}^{m}Z\left({\underline{x}}^{\prime }\right)={\partial }_{{x}_{i}}^{2m}K\left(\underline{x}-{\underline{x}}^{\prime }\right){< }\infty \quad \forall \enspace m{\leqslant}n$.

Proof. Derivatives of $Z\left(\underline{x}\right)$ are defined as limits of sums and differences of the field Z evaluated at different points; therefore they are Gaussian random fields too, and furthermore it is straightforward to see that their expected value is always 0 if the field itself is zero centered.

The correlation can be computed via induction. Assume that $\mathbb{E}{\partial }_{{x}_{1}}^{{n}_{1}}\dots {\partial }_{{x}_{d}}^{{n}_{d}}Z\left(\underline{x}\right)\cdot {\partial }_{{x}_{1}}^{{n}_{1}^{\prime }}\dots {\partial }_{{x}_{d}}^{{n}_{d}^{\prime }}Z\left({\underline{x}}^{\prime }\right)={\partial }_{{x}_{1}}^{{n}_{1}+{n}_{1}^{\prime }}\dots {\partial }_{{x}_{d}}^{{n}_{d}+{n}_{d}^{\prime }}K\left(\underline{x}-{\underline{x}}^{\prime }\right)$ holds true. Then, if we increment n1:

Equation (49)

Of course, by symmetry the same can be said about the increase of any other exponent. To conclude the induction proof we simply recall that by definition $\mathbb{E}Z\left(\underline{x}\right)Z\left({\underline{x}}^{\prime }\right)=K\left(\underline{x}-{\underline{x}}^{\prime }\right)$. □

Appendix H.: RKHS hypothesis and smoothness

It is important to note the high degree of smoothness underlying the RKHS hypothesis. Consider for instance realizations $Z\left(\underline{x}\right)$ of a teacher Gaussian process with covariance KT and assume that they lie in the RKHS of the student kernel KS (notice that they never belong to the RKHS of the same kernel KT), namely

Equation (50)

If the teacher and student kernels decay in the frequency domain with exponents αT and αS respectively, convergence requires αT > αS + d, and ${K}_{\mathrm{S}}\left(\underline{0}\right)\propto \int \mathrm{d}\underline{w}\enspace {\tilde {K}}_{\mathrm{S}}\left(\underline{w}\right){< }\infty $ (true for many commonly used kernels) implies αS > d. Then using lemmas 5.1 and 5.2 we can conclude that the realizations $Z\left(\underline{x}\right)$ must be at least ⌊d/2⌋-times mean-square differentiable to be RKHS.

Appendix I.: Asymptotic limit of the PDE approximation

In this appendix we show how to recover the prediction of theorem 1 using the approach presented in [11], that can be applied to a generic target function (not necessarily Gaussian nor evaluated only on a regular lattice). They derive their main formula, that we write below, by computing the amount of generalization error due to each eigenmode of the (student) kernel. In order to carry out the calculations they introduce a partial differential equation that they solve with two different approximations.

In the following we denote by λ1 ⩾ ⋯ ⩾ λρ ⩾ ⋯ the eigenvalues of the kernel, and by ${\phi }_{\rho }\left(\underline{x}\right)$ the corresponding eigenfunctions. In [11] they show that the generalization error can be written as:

Equation (51)

Equation (52)

Equation (53)

Equation (54)

The term $\mathbb{E}{w}_{\rho }^{2}$ is the variance of the coefficients of the target function in the kernel eigenbasis, defined as:

Equation (55)

(the factor in front of the scalar product is to keep our notation consistent with that of [11]), to help the reader compare the two works. Notice that the variance can be computed with respect to an ensemble of target functions, but this ensemble may contain one deterministic function only.

In order to compute sums over the eigenmodes we will always replace them with integrals over eigenvalues. To do so, we must also introduce a density of eigenvalues $\mathcal{D}\left(\lambda \right)$: ${\sum }_{\rho }f\left({\lambda }_{\rho }\right)\to \int \mathrm{d}\lambda \enspace \mathcal{D}\left(\lambda \right)f\left(\lambda \right)$. The asymptotic behavior of this density for small eigenvalues can be derived as follows (for a given kernel whose Fourier transform decays with an exponent α):

Equation (56)

where we have defined the exponent $\theta \equiv 1+\frac{d}{\alpha }$. Notice that 1 < θ < 2, and that of course this exponent depends on the kernel. We can use this density also to derive a scaling behavior of small eigenvalues: indeed, the ρth (≫1) eigenvalue can be estimated by

Equation (57)

The last equation follows from the fact that λρ λ1 and that θ > 1.

We now have to estimate the asymptotic behavior of the implicitly defined function t(n). It is easy to see that this function must go to 0 as n; therefore we can assume it is small. Splitting the integral according to whether the denominator in the definition of t(n) is dominated by the first or second term,

Equation (58)

Therefore, $t\left(n\right)\sim {n}^{-\frac{2-\theta }{\theta -1}}$, and with a similar approximation we can also deduce that $\gamma \left(n\right)\sim {n}^{-\frac{3-\theta }{\theta -1}}$. Injecting all we know in the formula for the generalization error and splitting the integral we find

Equation (59)

In the second equality we have used the fact that ${\lambda }_{n}\sim {n}^{-\frac{1}{\theta -1}}$ to introduce the nth eigenvalue into the formula. Then, we approximated the sum by splitting it in two sums, one over the first n eigenvalues (ρn, therefore λρ λn ) and one over the remaining ones (ρ > n). Notice that the second sum is indeed the sum in equation (12).

Next we assume that $\mathbb{E}{w}_{\rho }^{2}$ behaves asymptotically as a power law with respect to small eigenvalues, $\mathbb{E}{w}_{\rho }^{2}\sim {\lambda }_{\rho }^{q}$, with an exponent q that can be either positive or negative. We can now compute each of the integrals in the previous equation:

Equation (60)

Equation (61)

For the second integral to converge we have assumed that the exponent q is larger than θ − 1. The first integral behaves differently according to whether q > θ or not: if q > θ, the integral scales as ${\lambda }_{n}^{2}\sim {n}^{-\frac{2}{\theta -1}}$; if q < θ, then it scales as ${\lambda }_{n}^{2-\theta +q}\sim {n}^{-\frac{q-\theta +2}{\theta -1}}$. Therefore,

Equation (62)

A consequence of equations (60) and (61) is that if q < θ (which always occurs if the student is smooth enough, so that α characterizing the decay of the Fourier coefficient is small and θ is large), then the scaling of the generalization error is given by equation (61) alone, and we recover equation (12) from equation (59), justifying why this equation applies to real, non-Gaussian data.

Notice that if the target function is generated by a teacher Gaussian process, the exponent q takes the value $\frac{\theta -{\theta }_{\mathrm{T}}}{{\theta }_{\mathrm{T}}-1}$, where ${\theta }_{\mathrm{T}}=1+\frac{d}{{\alpha }_{\mathrm{T}}}$ and αT is the exponent characterizing the decay of the Fourier transform of the teacher kernel. With some manipulations we then recover our theorem 1.

Equation (63)

Appendix J.: Convergence of the spectrum of the Gram matrix

See figure J1.

Figure J1.

Figure J1. (Left) we plot the first eigenvalues λρ of Gram matrices of size $\tilde {n}$, rescaled by the corresponding eigenvalue of the largest Gram matrix (top row is MNIST, bottom row is CIFAR10). Increasing $\tilde {n}$ the eigenvalues are expected to converge, and indeed these ratios asymptote to one. We are plotting one eigenvalue every 10 for the first 100 eigenvalues. In order to make the plot clearer we have multiplied each curve by a factor ρ, equal to the eigenvalue index. (Right) Density of eigenvalues of the Gram matrix, for several sizes $\tilde {n}$, for MNIST (top) and CIFAR10 (bottom). The density is divided by the predicted asymptotic behavior ${\left({\lambda }_{\rho }^{\mathrm{S}}\right)}^{-\theta }$, with $\theta =1+\frac{\alpha }{{d}_{\mathrm{e}\mathrm{f}\mathrm{f}}}$. For a Laplace kernel αS = deff + 1 and for the effective dimension we used the values extracted in section 7, resulting in θ ≈ 1.937 for MNIST and θ ≈ 1.972 for CIFAR10. This plot shows that the density of eigenvalues converges when $\tilde {n}$ increases, and that the predicted power law is consistent with observations.

Standard image High-resolution image
Please wait… references are loading.