Learning sparse features can lead to overfitting in neural networks

It is widely believed that the success of deep networks lies in their ability to learn a meaningful representation of the features of the data. Yet, understanding when and how this feature learning improves performance remains a challenge: for example, it is beneficial for modern architectures trained to classify images, whereas it is detrimental for fully-connected networks trained for the same task on the same data. Here we propose an explanation for this puzzle, by showing that feature learning can perform worse than lazy training (via random feature kernel or the NTK) as the former can lead to a sparser neural representation. Although sparsity is known to be essential for learning anisotropic data, it is detrimental when the target function is constant or smooth along certain directions of input space. We illustrate this phenomenon in two settings: (i) regression of Gaussian random functions on the d-dimensional unit sphere and (ii) classification of benchmark datasets of images. For (i), we compute the scaling of the generalization error with number of training points, and show that methods that do not learn features generalize better, even when the dimension of the input space is large. For (ii), we show empirically that learning features can indeed lead to sparse and thereby less smooth representations of the image predictors. This fact is plausibly responsible for deteriorating the performance, which is known to be correlated with smoothness along diffeomorphisms.


Introduction
Neural networks are responsible for a technological revolution in a variety of machine learning tasks.Many such tasks require learning functions of high-dimensional inputs from a finite set of examples, thus should be generically hard due to the curse of dimensionality [1,2]: the exponent that controls the scaling of the generalization error with the number of training examples is inversely proportional to the input dimension d.For instance, for standard image classification tasks with d ranging in 10 3 ÷ 10 5 , such exponent should be practically vanishing, contrary to what is observed in practice [3].In this respect, understanding the success of neural networks is still an open question.A popular explanation is that, during training, neurons adapt to features in the data that are relevant for the task [4], effectively reducing the input dimension and making the problem tractable [5,6,7].However, understanding quantitatively if this intuition is true and how it depends on the structure of the task remains a challenge.Figure 1: Feature vs. Lazy in image classification.Generalization error as a function of the trainingset size n for infinite-width fully-connected networks (FCNs) trained in the feature (blue) and lazy regime (orange).In the latter case the limit is taken exactly by training an SVC algorithm with the analytical NTK [23].In the former case, the infinite-width limit can be accurately approximated for these datasets by considering very wide nets (H = 10 3 ), and performing ensemble averaging on different initial conditions of the parameters as shown in [24,25].Panels correspond to different benchmark image datasets [26,27,28].Results are averaged over 10 different initializations of the networks and datasets.
Recently much progress was made in characterizing the conditions which lead to features learning, in the overparameterized setting where networks generally perform best.When the initialization scale of the network parameters is large [8] one encounters the lazy training regime, where neural networks behave as kernel methods [9,10] (coined Neural Tangent Kernel or NTK) and features are not learned.By contrast, when the initialization scale is small, a feature learning regime is found [11,12,13] where the network parameters evolve significantly during training.This limit is much less understood apart from very simple architectures, where it can be shown to lead to sparse representations where a limited number of neurons are active after training [14].Such sparse representations can also be obtained by regularizing the weights during training [2,15].
In terms of performance, most theoretical works have focused on fully-connected networks.For these architectures, feature learning was shown to significantly outperform lazy training [16,17,18,19,11] for certain tasks, including approximating a function which depends only on a subset or a linear combination of the input variables.However, when such primitive networks are trained on image datasets, learning features is detrimental [20,21], as illustrated in Fig. 1 (see [19,Fig. 3] for the analogous plot in the case of a target function depending on just one of the input variables, where learning features is beneficial).A similar result was observed in simple models of data [22].These facts are unexplained, yet central to understanding the implicit bias of the feature learning regime.

Our contribution
Our main contribution is to provide an account of the drawbacks of learning sparse representations based on the following set of ideas.Consider, for concreteness, an image classification problem: (i) images class varies little along smooth deformations of the image; (ii) due to that, tasks like image classification require a continuous distribution of neurons to be represented; (iii) thus, requiring sparsity can be detrimental for performance.We build our argument as follows.
• In order to find a quantitative description of the phenomenon, we start from the problem of regression of a random target function of controlled smoothness on the d-dimensional unit sphere, and study the property of the minimizers of the empirical loss with n observations, both in the lazy and the feature learning regimes.More specifically, we consider two extreme limits-the NTK limit and mean-field limit-as representatives of lazy and feature regimes, respectively (section 2).Both these limits admit a simple formulation that allows us to predict generalization performances.In particular, our results on feature learning rely on solutions having an atomic support.This property can be justified for one-hidden-layer neural networks with ReLU activations and weight decay.Yet, we also find such a sparsity empirically using gradient descent in the absence of regularization, if weights are initialized to be small enough.• We find that lazy training leads to smoother predictors than feature learning.As a result, lazy training outperforms feature learning when the target function is also sufficiently smooth.
Otherwise, the performances of the two methods are comparable, in the sense that they display the same asymptotic decay of generalization error with the number of training Overfitting in Feature Learning examples.Our predictions are obtained from asymptotic arguments that we systematically back up with numerical studies.
• For image datasets, it is believed that diffeomorphisms of images are key transformations along which the predictor function should only mildly vary to obtain good performance [29].From the results above, a natural explanation as to why lazy beats feature for fully connected networks is that it leads to predictors with smaller variations along diffeomorphisms.We confirm that this is indeed the case empirically on benchmark datasets.
Numerical experiments are performed in PyTorch [30], and the code for reproducing experiments is available online at github.com/pcsl-epfl/regressionsphere.

Related Work
The property that training ReLU networks in the feature regime leads to a sparse representation was observed empirically [31].This property can be justified for one-hidden-layer networks by casting training as a L1 minimization problem [32,2], then using a representer theorem [33,15,34].This is analogous to what is commonly done in predictive sparse coding [35,36,37,38].
Many works have investigated the benefits of learning sparse representations in neural networks.[2,16,17,18,19,39,40] study cases in which the true function only depends on a linear subspace of input space, and show that feature learning profitably capture such property.Even for more general problems, sparse representations of the data might emerge naturally during deep network training-a phenomenon coined neural collapse [41].Similar sparsification phenomena, for instance, have been found to allow for learning convolutional layers from scratch [42,43].Our work builds on this body of literature by pointing out that learning sparse features can be detrimental, if the task does not allow for it.
There is currently no general framework to predict rigorously the learning curve exponent β defined as (n) = O(n −β ) for kernels.Some of our asymptotic arguments can be obtained by other approximations, such as assuming that data points lie on a lattice in R d [44], or by using the nonrigorous replica method of statistical physics [45,46,47].In the case d = 2, we provide a more explicit mathematical formulation of our results, which leads to analytical results for certain kernels.We systematically back up our predictions with numerical tests as d varies.
Finally, in the context of image classification, the connection between performance and 'stability' or smoothness toward small diffeomorphisms of the inputs has been conjectured by [29,48].Empirically, a strong correlation between these two quantities was shown to hold across various architectures for real datasets [49].In that reference, it was found that fully connected networks lose their stability over training: here we show that this effect is much less pronounced in the lazy regime.

Problem and notation
Task We consider a supervised learning scenario with n training points {x i } n i=1 uniformly drawn on the d-dimensional unit sphere S d−1 .We assume that the target function f * is an isotropic Gaussian random process on S d−1 and control its statistics via the spectrum: by introducing the decomposition of f * into spherical harmonics (see App.A for definitions), We assume that all the c k with k odd vanish apart from c 1 : this is required to guarantee that f * can be approximated as well as desired with a one-hidden-layer ReLU network with no biases, as discussed in App. A. We also assume that the non-zero c k decay as a power of k for k 1) .The exponent ν t > 0 controls the (weak) differentiability of f * on the sphere (see App. A) and also the statistics of f * in real space: Examples of such a target function for d = 3 and different values of ν t are reported in Fig. 2.
Overfitting in Feature Learning of f * into spherical harmonics (see App.A for definitions), We assume that all the c k with k odd vanish apart from c 1 : this is required to guarantee that f * can be approximated as well as desired with a one-hidden-layer ReLU network with no biases, as discussed in App. A. We also assume that the non-zero c k decay as a power of k for k 1) .The exponent ν t > 0 controls the (weak) differentiability of f * on the sphere (see App. A) and also the statistics of f * in real space: Neural network representations We consider a one-hidden-layer neural network of width H, Feature Regime (ξ = 0) If we assume that {θ h , w h } H h=1 are independently drawn from a probability measure µ on S d−1 × R such that the Radon measure γ = R wµ(•, dw) exists, then as a.e. on S d−1 . (2.4) Figure 2: Gaussian random process on the sphere.We show here two samples of the task introduced in section 2 when the target function f * (x) is defined on the 3−dimensional unit sphere.(a) and (b) show samples of large and small smoothness coefficient ν t , respectively.
Neural network representation in the feature regime In this regime we aim to approximate the target function f * (x) via a one-hidden-layer neural network of width H, where {θ h } H h=1 (the features) and {w h } H h=1 (the weights) are the network parameters to be optimized, and σ(x) denotes the ReLU function, σ(x) = max {0, x}.If we assume that {θ h , w h } H h=1 are independently drawn from a probability measure µ on S d−1 × R such that the Radon measure γ = R wµ(•, dw) exists, then as H → ∞, a.e. on S d−1 .
(2.4)This is the so-called mean-field limit [11,12], and it is then natural to determine the optimal γ via |dγ(θ)| subject to: In practice, we can approximate this minimization problem by using a network with large but finite width, constraining the feature to be on the sphere |θ h | = 1, and minimizing the following empirical loss with L1 regularization on the weights, This minimization problem leads to (2.5) when H → ∞ and λ → 0. Note that, by homogeneity of ReLU, (2.6) can be shown to be equivalent to imposing a regularization on the L2 norm of all parameters [32, Thm.10], i.e. the usual weight decay.
To proceed we will make the following assumption about the minimizer γ * : Assumption 1.The minimizer γ * of (2.5) is unique and atomic, with n A ≤ n atoms, i.e. there exists The main component of the assumption is the uniqueness of γ * ; if it holds the sparsity of γ * follows from the representer theorem, see e.g.[33].Both the uniqueness and sparsity of the minimizer can be justified as holding generically using asymptotic arguments involving recasting the L1 minimization problem 2.5 as a linear programming one: these arguments are standard (see e.g.[50]) and are presented in App.B for the reader convenience.In our arguments below to deduce the scaling of the generalization error we will mainly use that n A = O(n)-we shall confirm this fact numerically even in the absence of regularization, if the weights are initialized to be small enough.Notice that from Assumption 1 it follows that the predictor in the feature regime corresponding to the minimizer γ * takes the following form Overfitting in Feature Learning Neural network representation in the lazy regime.In this regime we approximate the target function f * (x) via where the weights {g i } n i=1 solve and Here µ 0 is a fixed probability distribution which, in the NTK training regime [9], is the distribution of the features and weights at initialization.It is well-known [51] that the solution to kernel ridge regression problem can also be expressed via the kernel trick as where g θ and g w are the solutions of subject to: Another lazy limit can be obtained equivalently by training only the weights while keeping the features to their initialization value.This is equivalent to forcing g θ (θ, w) to vanish in Eq. 2.13, resulting again in a kernel method.The kernel, in this case, is called Random Feature Kernel (K RFK ), and can be obtained from Eq. 2.11 by setting dµ 0 (θ, w) = δ w=0 dμ 0 (θ).The minimizer can then be written as in Eq. 2.9 with K NTK replaced by K RFK .

Asymptotic analysis of generalization
In this section, we characterize the asymptotic decay of the generalization error (n) averaged over several realizations of the target function f * .Denoting with dτ d−1 (x) the uniform measure on for some constant A d which might depend on d but not on n.Both for the lazy (see Eq. 2.9) and feature regimes (see Eq. 2.8) the predictor can be written as a sum of O(n) terms: In the feature regime, the g j 's (y j ) coincide with the optimal weights w * j (features θ * j ), ϕ with the activation function σ.In the lazy regime, the y j are the training points x j , ϕ is the neural tangent or random feature kernel the g j 's are the weights solving Eq. 2. We have defined the density g n (x) = j |S d−1 |g j δ(x − y j ) so as to cast the predictor as a convolution on the sphere.Therefore, the projections of is the projection of g n (x) and ϕ k that of ϕ(x • y).For ReLU neurons one has (as shown in App.A) Overfitting in Feature Learning Main Result Consider a target function f * with smoothness exponent ν t as defined above, with data lying on S d−1 .If f * is learnt with a one-hidden-layer network with ReLU neurons in the regimes specified above, then the generalization error follows (n) ∼ n −β with: ) This is our central result.It implies that if the target function is a smooth isotropic Gaussian field (realized for large ν t ), then lazy beats feature, in the sense that training the network in the lazy regime leads to a better scaling of the generalization performance with the number of training points.
Strategy There is no general framework for a rigorous derivation of the generalization error in the ridgeless limit λ → 0: predictions such as that of Eq. 3.4 can be obtained by either assuming that training points (for Eq. 3.4a) and neurons (for Eq. 3.4b) lie on a periodic lattice [44], or (for Eq. 3.4a) using the replica method from physics [45] as shown in App.F.Here we follow a different route, by first characterizing the form of the predictor for d = 2 (proof in App.C).This property alone allows us to determine the asymptotic scaling of the generalization error.We use it to analytically obtain the generalization error in the NTK case with a slightly simplified function ϕ (details in App.D).This calculation motivates a simple ansatz for the form of g n (x) entering Eq. 3.2 and its projections onto spherical harmonics, which extends naturally to arbitrary dimension.We confirm the predictions resulting from this ansatz systematically in numerical experiments.
Properties of the predictor in d = 2 On the unit circle S 1 all points are identified by a polar angle x ∈ [0, 2π).Hence both target function and estimated predictor are functions of the angle, and all functions of a scalar product are in fact functions of the difference in angle.In particular, introducing where we defined Both for feature regime and NTK limit, the first derivative of φ(x) is continuous except for two values of x (0 and π for lazy, −π/2 and π/2 for feature), so that φ(x) has a singular part consisting of two Dirac delta functions.
As a result, the second derivative of the predictor (f n ) has a singular part consisting of many Dirac deltas.If we denote with (f n ) r the regular part, obtained by subtracting all the delta functions, we can show that (see App. C): Overfitting in Feature Learning In the large n limit, the predictor displays a singular second derivative at O(n) points.Proposition 1 implies that outside of these singular points the second derivative is well defined.Thus, as n gets large and the singular points approach each other, the predictor can be approximated by a chain of parabolas, as highlighted in Fig. 3 and noticed in [47] for a Laplace kernel.This property alone allows to determine the asymptotic scaling of the error in d = 2.In simple terms, Prop. 1 follows from the convergence of g n to the function satisfying f * (x) = dy 2π g(y) φr (x − y), which is guaranteed under our assumptions on the target function-a detailed proof is given in App. C.
Decay of the error in d = 2 (sketch) The full calculation is in App.D. Consider a slightly simplified problem where φ has a single discontinuity in its derivative, located at x = 0.In this case, f n (x) is singular if and only if x is a data point.Consider then the interval x ∈ [x i , x i+1 ] and set Since the distances δ i between adjacent singular points are random variables with mean of order 1/n and finite moments, it is straightforward to obtain that (n . Note that for this asymptotic argument to apply to the feature learning regime, one must ensure that the distribution of the rescaled distance between adjacent singularities nδ i has a finite fourth moment.This is obvious in the lazy regime, where the δ i 's are controlled by the position of the training points, but not in the feature regime, where the distribution of singular points is determined by that of the neuron's features.Nevertheless, we show that it must be the case in our setup in App.D. Interpretation in terms of spectral bias From the discussion above it is evident that there is a length scale δ of order 1/n such that f n (x) is a good approximation of f * (x) over scales larger than δ.In terms of Fourier modes2 , one has: i) f n (k) matches f n (k) at long wavelengths, i.e. for k k c ∼ 1/n.ii) In addition, since the phases exp(ikx j ) become effectively random phases for k k c , g n (k) = j g j exp(ikx j ) becomes a Gaussian random variable with zero mean and fixed variance and thus iii) (3.8)For ν t > 2, one has j g 2 j ∼ n −1 lim n→∞ g n (x) 2 dx ∼ n −1 .It follows (see App. E for details) that the sum is dominated by the first term, hence entirely controlled by the Fourier coefficients of f n (k) at large k.A smoother predictor corresponds to a faster decay of f n (k) with k, thus a faster decay of the error with n.Plugging the relevant decays yields ∼ n −4 for feature regime and lazy regime with the NTK, and n −6 for lazy regime with the RFK (which is smoother than the NTK).For ν t ≤ 2, the two terms have comparable magnitude (see App. E), thus ∼ n −2νt .

Generalization to higher dimensions
The argument above can be generalized for any d by replacing Fourier modes with projections onto spherical harmonics.The characteristic distance between training points scales as n −1/(d−1) , thus k c ∼ n −1/(d−1) .Our ansatz is that, as in d = 2: i) for k k c , the predictor modes coincide with those of the target function, f n k,l ≈ f * k,l (this corresponds to the spectral bias result of kernel methods, stating that the predictor reproduces the first O(n) projections of the target in the kernel eigenbasis [45]); ii) For k k c , g n k,l is a sum of uncorrelated terms, thus a Gaussian variable with zero mean and fixed variance; iii) ii) and iii) imply that: As shown in App.E, from this expression it is straightforward to obtain Eq. 3.4.Notice again that when the target is sufficiently smooth so that the predictor-dependent term dominates, the error is determined by the smoothness of the predictor.In particular, as d > 2, the predictor of feature learning is less smooth than both the NTK and RFK ones, due to the slower decay of the corresponding ϕ k .
Overfitting in Feature Learning

Numerical tests of the theory
We test successfully our predictions by computing the learning curves of both lazy and feature regimes when (i) the target function is constant on the sphere for varying d, see Fig. 4, and (ii) the target is a Gaussian random field with varying smoothness ν t , as shown in Fig. G.1 of App.G.For the lazy regime, we perform kernel regression using the analytical expression of the NTK [52] (see also Eq. A. 19).For the feature regime, we find that our predictions hold when having a small regularization, although it takes unreachable times for gradient descent to exactly recover the minimalnorm solution-a more in-depth discussion can be found in App.G.An example of the atomic distribution of neurons found after training, which contrasts with the initial distribution, is displayed in Fig. 5a, left panel.
Another way to obtain sparse features is to initialize the network with very small weights [14], as proposed in [8].As in the presence of an infinitesimal weights decay, this scheme also leads to sparse solutions with

Evidence for overfitting along diffeomorphisms in image datasets
For fully-connected networks, the feature regime is well-adapted to learn anisotropic tasks [16]: if the target function does not depend on a certain linear subspace of input space, e.g. the pixels at the corner of an image, then neurons align perpendicularly to these directions [19].By contrast, our results highlight a drawback of this regime when the target function is constant or smooth along directions in input space that require a continuous distribution of neurons to be represented.In such a case, the adaptation of the weights to the training points leads to a predictor with a sparse representation.Such a predictor would be less smooth than in the lazy regime and thus underperform.
Does this view hold for images, and explain why learning their features is detrimental for fullyconnected networks?The first positive empirical evidence is that the neurons' distribution of networks trained on image data becomes indeed sparse in the feature regime, as illustrated in Fig. 5a, right, for CIFAR10 [28].This observation raises the question of which are the directions in input space i) along which the target should vary smoothly, and ii) that are not easily represented by a discrete set of neurons.An example of such directions are global translations, which conserve the norm of the input and do not change the image class: the lazy regime predictor is indeed smoother than the feature one with respect to translations of the input (see App. H).Yet, these transformations live in a space of dimension 2, which is small in comparison with the full dimensionality d of the data and thus may play a negligible role.
A much larger class of transformations believed to have little effect on the target are small diffeomorphisms [29].A diffeomorphism τ acting on an image is illustrated in Fig. 5b, which highlights that  our brain still perceives the content of the transformed image as in the original one.Near-invariance of the task to these transformations is believed to play a key role in the success of deep learning, and in explaining how neural networks beat the curse of dimensionality [48].Indeed, if modern architectures can become insensitive to these transformations, then the dimensionality of the problem is considerably reduced.In fact, it was found that the architectures displaying the best performance are precisely those which learn to vary smoothly along such transformations [49].
Small diffeomorphisms are likely the directions we are looking for.To test this hypothesis, following [49], we characterize the smoothness of a function along such diffeomorphisms, relative to that of random directions in input space.Specifically, we use the relative sensitivity: (5.1) In the numerator, the average is made over the test set and over an ensemble of diffeomorphisms, reviewed in App.I.The magnitude of the diffeomorphisms is chosen so that each pixel is shifted by one on average.In the denominator, the average runs over the test set and the vectors η sampled uniformly on the sphere of radius η = E x,τ τ x − x , and this fixes the transformations magnitude.
We measure R f as a function of n for three benchmark datasets of images, as shown in Fig. 6.
We indeed find that R f is consistently smaller in the lazy training regime, where features are not learned.Overall, this observation supports the view that learning sparse features is detrimental when data present (near) invariance to transformations that cannot be represented sparsely by the architecture considered.Fig. 1 supports the idea that-for benchmark image datasets-this negative effect overcomes well-known positive effects of learning features, e.g.becoming insensitive to pixels on the edge of images (see App. H for evidence of this effect).

Conclusion
Our central result is that learning sparse features can be detrimental if the task presents invariance or smooth variations along transformations that are not adequately captured by the neural network architecture.For fully-connected networks, these transformations can be rotations of the input, but also continuous translations and diffeomorphisms.
Overfitting in Feature Learning  Our analysis relies on the sparsity of the features learned by a shallow fully-connected architecture: even in the infinite width limit, when trained in the feature learning regime such networks behave as O(n) neurons.The asymptotic analysis we perform for random Gaussian fields on the sphere leads to predictions for the learning curve exponent β in different training regimes, which we verify.Such kind of results is scarce in the literature.
Note that our analysis focuses on ReLU neurons because (i) these are very often used in practice and (ii) in that case, β will depend on the training regime, allowing for stringent numerical tests.If smooth activations (e.g.softplus) are considered, we expect that learning features will still be detrimental for generalization.Yet, the difference will not appear in the exponent β, but in other aspects of the learning curves (including numerical coefficients and pre-asymptotic effects) that are harder to predict.
Most fundamentally, our results underline that the success of feature learning for modern architectures still lacks a sufficient explanation.Indeed, most of the theoretical studies that previously emphasized the benefits of learning features have been considering fully-connected networks, for which learning features can be in practice a drawback.It is tempting to argue that in modern architectures, learning features is not at a disadvantage because smoothness along diffeomorphisms can be enforced from the start-due to the locally connected, convolutional, and pooling layers [53,29].Yet the best architectures often do not perform pooling and are not stable toward diffeomorphisms at initialization.
During training, learning features leads to more stable and smoother solutions along diffeomorphisms [54,49].Understanding why building sparse features enhances stability in these architectures may ultimately explain the magical feat of deep CNNs: learning tasks in high dimensions.
Overfitting in Feature Learning

A Quick recap of spherical harmonics
Spherical harmonics This appendix collects some introductory background on spherical harmonics and dot-product kernels on the sphere [55].See [56,57] for an expanded treatment.Spherical harmonics are homogeneous polynomials on the sphere denoting the L2 norm.Given the polynomial degree k ∈ N, there are N k,s linearly independent spherical harmonics of degree k on S s−1 , with where means logarithmic equivalence for k → ∞ and . Thus, we can introduce a set of N k,d spherical harmonics Y k, for each k, with ranging in 1, . . ., N k,d , which are orthonormal with respect to the uniform measure on the sphere dτ (x), Because of the orthogonality of homogeneous polynomials with different degree, the set is a complete orthonormal basis for the space of square-integrable functions on S d−1 .For any function f : Furthermore, spherical harmonics are eigenfunctions of the Laplace-Beltrami operator ∆, which is nothing but the restriction of the standard Laplace operator to Legendre polynomials By fixing a direction y in S d−1 one can select, for each k, the only spherical harmonic of degree k which is invariant for rotations that leave y unchanged.This particular spherical harmonic is, in fact, a function of x • y and is called the Legendre polynomial of degree k, P k,d (x • y) (also referred to as Gegenbauer polynomial).Legendre polynomials can be written as a combination of the orthonormal spherical harmonics Y k, via the addition theorem [56, Thm.2.9], Alternatively, P k,d is given explicitly as a function of t = x • y ∈ [−1, 1] via the Rodrigues' formula [56,Thm. 2.23], Here Γ denotes the Gamma function, Γ(z) = ∞ 0 x z−1 e −x dx.Legendre polynomials are orthogonal on [−1, 1] with respect to the measure with density (1 − t 2 ) (d−3)/2 , which is the probability density function of the scalar product between to points on S d−1 .To sum up, given x, y ∈ S d−1 , functions of x or y can be expressed as a sum of projections on the orthonormal spherical harmonics, whereas functions of x • y can be expressed as a sum of projections on the Legendre polynomials.The relationship between the two expansions is elucidated in the Funk-Hecke formula [56,Thm. 2.22], Overfitting in Feature Learning NTK and RFK of one-hidden-layer ReLU networks Let E θ denote expectation over a multivariate normal distribution with zero mean and unitary covariance matrix.For any x, y ∈ S d−1 , the RFK of a one-hidden-layer ReLU network Eq.2.3 with all parameters initialised as independent Gaussian random numbers with zero mean and unit variance reads The NTK of the same network reads, with σ denoting the derivative of ReLU or Heaviside function, As functions of a dot-product on the sphere, both NTK and RFK admit a decomposition in terms of spherical harmonics as Eq.A.15.For dot-product kernels, this expansion coincides with the Mercer's decomposition of the kernel [55], that is the coefficients of the expansion are the eigenvalues of the kernel.The asymptotic decay of the eigenvalues of such kernels ϕ NTK k and ϕ RFK k can be obtained by applying Eq. A. 16 [58,Thm. 1].Equivalently, one can notice that K RFK is proportional to the convolution on the sphere of ReLU with itself, therefore ϕ RFK k = (ϕ ReLU k ) 2 .Similarly, the asymptotic decay of ϕ NTK k can be related to that of the coefficients of σ , derivative of ReLU: Both methods lead to Eq. 3.3 of the main text.
Gaussian random fields and Eq.2.2 Consider a Gaussian random field f * on the sphere with covariance kernel C(x • y), (A.20) f * can be equivalently specified via the statistics of the coefficients f * k, , If c k decays as a power of k, then such power controls the weak differentiability (in the mean-squared sense) of the random field f * .In fact, from Eq. A.4, Upon averaging over f * one gets In addition, for finite but arbitrary d, see Eq. A.1). Hence the summand in the right-hand side of Eq. A.
Alternatively, one can think of ν t as controlling the scaling of the difference δf * over inputs separated by a distance δ.From Eq. A.20, Overfitting in Feature Learning

B Uniqueness and Sparsity of the L1 minimizer
Recall that we want to find the γ * that solves In this appendix, we argue that the uniqueness of γ * which implies that it is atomic with at most n atoms is a natural assumption.We start by discretizing the measure γ into H atoms, with H arbitrarily large.Then the problem Eq.B.1 can be rewritten as Given w ∈ R H , let u = max(w, 0) ≥ 0 and v = − max(−w, 0) ≥ 0 so that w = u − v.It is well-known (see e.g.[50]) that the minimization problem in (B.2) can be recast in terms of u and v into a linear programming problem.That is, w where e = [1, 1, . . ., 1] T .Assuming that this problem is feasible (i.e.there is at least one solution to Φu − Φv = y such that u ≥ 0, v ≥ 0), it is known that it admits extremal solution, i.e. solutions such that at most n entries of (u * , v * ) (and hence w * ) are non-zero.The issue is whether such an extremal solution is unique.Assume that there are two, say is also a minimizer of (B.3) for all t ∈ [0, 1], with the same minimum value Generalizing this argument to the case of more than two extremal solutions, we conclude that all minimizers are global, with the same minimum value, and they live on the simplex where e T (u + v) = e T (u 1 + v 1 ).Therefore, nonuniqueness requires that that this simplex has a nontrivial intersection with the feasible set where Φu − Φv = y with u ≥ 0, v ≥ 0. We argue that, generically, this will not be the case, i.e. the intersection will be trivial, and the extremal solution unique.In particular, since in our case we are in fact interested in the problem (B.1), we can always perturb slightly the discretization into H atoms of γ to guarantee that the extremal solution is unique.Since this is true no matter how large H is, and any Radon measure can be approached to arbitrary precision using such discretization, we conclude that the minimizer of (B.1) should be unique as well, with at most n atoms.

C Proof of Proposition 1
In this section, we provide the formal statement and proof of Proposition 1.Let us recall the general form of the predictor for both lazy and feature regimes in d = 2. From Eq. 3.6, where n is the number of training points for the lazy regime and the number of atoms for the feature regime and, for x ∈ (−π, π], All these functions φ have jump discontinuities on some derivative: the first for feature and NTK, the third for RFK.If the l-th derivative has jump discontinuities, the l + 1-th only exists in a distributional sense and it can be generically written as a sum of a regular function and a sequence of Dirac masses Overfitting in Feature Learning located at the discontinuities.With m denoting the number of such discontinuities and {x j } j their locations, f (l) denoting the l-th derivative of f , for some c j ∈ R, where f r denotes the regular part of f .Proposition 2. Consider a random target function f * satisfying Eq. 2.1 and the predictor f n obtained by training a one-hidden-layer ReLU network on n samples (x i , f * (x i )) in the feature or in the lazy regime (Eq.C.1).Then, with f (k) denoting the Fourier transform of f (x), one has where c is a constant (different for every regime).This result implies that as n → ∞, (f n ) (x) converges to a function having finite second moment, i.e.
Proof: Because our target functions are random fields that are in L 2 with probability one, and the RKHS of our kernels are dense in that space, we know that the test error vanishes as n → ∞ [59].
As a result Consider first the feature regime and the NTK lazy regime.In both cases φ has two jump discontinuities in the first derivative, located at x = 0, π for the NTK and at x = ± π/2, therefore we can write the second derivative as the sum of a regular function and two Dirac masses, As a result, the second derivative of the predictor can be written as the sum of a regular part (f n ) r and a sequence of 2n Dirac masses.After subtracting the Dirac masses, both sides of Eq.C.1 can be differentiated twice and yield Hence in the Fourier representation we have where we defined and used φ r (k) = −k 2 φr (k).By universal approximation we have As a result by combining Eq.C.9 and Eq.C.11 we deduce Overfitting in Feature Learning To complete the proof using this result it remains to estimate the scaling of φr (k) and φ(k) in the large |k| limit.
For the NTK lazy regime φ r and − φ are different but they have similar singular expansions near x = 0 and π.Therefore their Fourier coefficients display the same asymptotic decay.More specifically, with t = cos(x) (or x = arccos(t)), so that φ(x) = ϕ(t), one has Therefore, due to Eq. A.17, Eq.C.4 is satisfied with c = − 5.The same procedure can be applied to the RFK lazy regime, with the exception that it is the fourth derivative of φRFK which can be written as a regular part plus Dirac masses, but one can still obtain the Fourier coefficients of the second derivative's regular part by dividing those of the fourth derivative's regular part by k 2 .

D Asymptotics of generalization in d = 2
In this section we compute the decay of generalization error with the number of samples n in the following 2-dimensional setting: where the x j 's are the training points (like in the NTK case) and ϕ has a single discontinuity on the first derivative in 0.
Let us order the training points clockwise on the ring, such that x 1 = 0 and x i+1 > x i for all i = 1, . . ., n, with x n+1 := 2π.On each of the x i the predictor coincides with the target, For large enough n, the difference x i+1 − x i is small enough such that, within (x i , x i+1 ), f n (x) can be replaced with its Taylor series expansion up to the second order.In practice, the predictors appear like the cable of a suspension bridge with the pillars located on the training points.In particular, we can consider an expansion around x + i := x i + for any > 0 and then let → 0 from above: By differentiability of f n in (x i , x i+1 ) the second derivative can be computed at any point inside (x i , x i+1 ) without changing the order of approximation in Eq.D.3, in particular we can replace (f n ) (x + i ) with c i , the mean curvature of f n in (x i , x i+1 ).Moreover, as → 0, . By introducing the limiting slope m + i := lim x→0 + f n (x i + x), we can write Computing Eq.D.4 at x = x i+1 yields a closed form for the limiting slope m + i as a function of the mean curvature c i , the interval length δ i := (x i+1 − x i ) and ∆f i := f * (x i+1 ) − f * (x i ).Specifically, Overfitting in Feature Learning The generalization error can then be split into contributions from all the intervals.If ν t > 2, A Taylor expansion leads to: 1 2π implying: ( (D.9) where we used that (i) the integral converges to some finite value, due to proposition 2. From App.C, this integral can be estimated as 2 , that indeed converges for ν t > 2.
(ii) n −1 n i=1 (nδ i ) 5 has a deterministic limit for large n.It is clear for the lazy regime since the distance between adjacent singularities δ i follows an exponential distribution of mean ∼ 1 n .We expect this result to be also true for the feature regime in our set-up.Indeed, in the limit n → ∞, the predictor approaches a parabola between singular points, which generically cannot fit more than three random points.There must thus be a singularity at least every two data-points with a probability approaching unity as n → ∞, which implies that n −1 n i=1 (nδ i ) 5 converges to a constant for large n.
Finally, for ν t < 2, the same decomposition in intervals applies, but a Taylor expansion to second order does not hold.The error is then dominated by the fluctuations of f * on the scale of the intervals, as indicated in the main text.

E Asymptotic of generalization via the spectral bias ansatz
According to the spectral bias ansatz, the first n modes of the predictor f n k, coincide with the modes of the target function f * k, .Therefore, the asymptotic scaling of the error with n is entirely controlled by the remaining modes, After averaging the error over target functions we get Overfitting in Feature Learning Let us recall that, with the predictor having the general form in Eq. 3.2, then where the y j 's denote the training points for the lazy regime and the neuron features for the feature regime.For k k c , where For k k c , due to the highly oscillating nature of Y k, , the factors Y k, (y j ) are essentially decorrelated random numbers with zero mean and finite variance, since the values of (Y k, (y j )) 2 are limited by the addition theorem Eq.A.5.Let us denote the variance with σ Y .By the central limit theorem, g n k, converges to a Gaussian random variable with zero mean and finite variance σ 2 Y n j=1 g 2 j .As a result, where we have used the definition of f * (Eq.2.1) to set the expectation of (f * k, ) 2 to c k .
Large ν t case When f * is smooth enough the error is controlled by the predictor term proportional to then the function g n (x) converges to the square-summable function g Eq. E.5 is satisfied when 2ν t > 2(d − 1) + 4ν (ν = 1/2 for the NTK and 3/2 for the RFK).In the feature regime ϕ k ∼ k −(d−1)/2−3/2 , Eq. E.5 is satisfied when 2ν t > (d − 1) + 3.If g n (x) converges to a square-summable function, then n j=1 Hence, if ν t is large enough so that Eq.E.5 is satisfied, the asymptotic decay of the error is given by Eq.E.7.
Small ν t case If Eq.E.7 does not hold then g n (x) is not square-summable in the limit n → ∞.However, for large but finite n only the modes up to the k c -th are correctly reconstructed, therefore Both for feature and lazy, multiplying the term above by k≥kc N k,d ϕ k from Eq. E.7 yields 1) .This is also the scaling of the target function term Eq.E.8, implying that for small ν t one has (n) ∼ n − 2ν t d−1 (E.10) both in the feature and in the lazy regimes.

F Spectral bias via the replica calculation
Due to the equivalence with kernel methods, the asymptotic decay of the test error in the lazy regime can be computed with the formalism of [45], which also provides a non-rigorous justification for the spectral bias ansatz.By ranking the eigenvalues from the biggest to the smallest, such that ϕ ρ denotes the ρ-th eigenvalue and denoting with c ρ the variance of the projections of the target onto the ρ-th eigenfunction, one has It is convenient to introduce the eigenvalue density, (F.2) After changing variables in the delta function, one finds This can be used for inferring the asymptotics of κ(n), Once the scaling of κ(n) has been determined, the modal contributions to the error can be split according to whether ϕ ρ κ(n) or ϕ ρ κ(n).The scaling of ϕ ρ with the rank ρ is determined self-consistently, Notice that κ(n) 2 scales as n −1 k≥kc N k,s ϕ k in Eq.E.7, whereas ρ n c ρ /ϕ 2 ρ corresponds to n j g 2 j in Eq.E.9, so that the first term on the right-hand side of Eq.F.6 matches that of Eq.E.4.The same matching is found for the second term on the right-hand side of Eq.F.6, so that the replica calculation justifies the spectral bias ansatz.
G Training wide neural networks: does gradient descent (GD) find the minimal-norm solution?
In the main text we provided predictions for the asymptotics of the test error of the minimal norm solution that fits all the training data.Does the prediction hold when solution of Eq. 2.5 and Eq.2.13 is approximately found by GD?More specifically, is the solution found by GD the minimal-norm one?
Feature Learning We answer these questions by performing full-batch gradient descent in two settings (further details about the trainings are provided in the code repository, experiments.mdfile), 1. Min-L1.Here we update weights and features of Eq. 2.3, with ξ = 0, by following the negative gradient of with λ → 0 + .The weights w h are initialized to zero and the features are initialized uniformly and constrained to be on the unit sphere.2. α-trick.Following [8], here we minimize 2 , (G.2) with α → 0. This trick allows to be far from the lazy regime by forcing the weights to evolve to O(1/α), when fitting a target of order 1.
In both cases, the solution found by GD is sparse, in the sense that is supported on a finite number of neurons -in other words, the measure γ(θ) becomes atomic, satisfying Assumption 1. Furthermore, we find that 1.For Min-L1, the generalization error prediction holds (Fig. 4  Lazy Learning In this case, the correspondence between the solution found by gradient descent and the minimal-norm one is well established [9].Therefore, numerical experiments are performed here via kernel regression and the analytical NTK Eq.A.19: given a dataset {x i , y i = f * (x i )} n i=1 , we define the gram matrix K ∈ R n×n with elements K ij = K(x i , x j ) and the vector of target labels y = [y 1 , y 2 , . . ., y n ].The q i 's in Eq. 2.9 can be easily recovered by solving the linear system y = 1 n Kq. (G.3) Experiments Numerical experiments are run with PyTorch on GPUs NVIDIA V100 (university internal cluster).Details for reproducing experiments are provided in the code repository, experiments.mdfile.Individual trainings are run in 1 minute to 1 hour of wall time.We estimate a total of a thousand hours of computing time for running the preliminary and actual experiments present in this work.
Overfitting in Feature Learning FashionMNIST.Smaller values correspond to a smoother predictor, on average.Results are computed using the same predictors as in Fig. 1.Left: For small translations, the behavior is the same compared to applying diffeomorphisms.Right: The lazy regime does not distinguish between noise added at the boundary or on the whole image (R f = 1), while the feature regime gets more insensitive to the former.

I Maximum-entropy model of diffeomorphisms
We briefly review here the maximum-entropy model of diffeomorphisms as introduced in [49].
An image can be thought of as a function where the C ij 's are Gaussian variables of zero mean and variance T /(i 2 + j 2 ) and T is a parameter controlling the deformation magnitude.Once τ is generated, pixels are displaced to random positions.See Fig. 5b for an example of such transformation.
) where {θ h } H h=1 (the features) and {w h } H h=1 (the weights) are the network parameters to be optimized, values at initialization, ξ is a parameters fixed to ξ = 0 for feature learning and ξ = 1 for lazy training, and σ(x) denotes the ReLU function, σ(x) = max {0, x}.We assume that θ 0 h , w 0 h H h=1 are drawn independently from a distribution µ 0 with the properties that: all moments of µ 0 exist; µ 0 is absolutely continuous with respect to the Hausdorff measure on S d−1 × R; and µ 0 is centered, i.e. S d−1 ×R θdµ 0 (θ, w) = 0 and S d−1 ×R wdµ 0 (θ, w) = 0.

Figure 3 :
Figure 3: Feature vs. Lazy Predictor.Predictor of the lazy (left) and feature (right) regime when learning the constant function on the ring with 8 uniformly-sampled training points.
n A = O(n) -an asymptotic dependence confirmed in Fig. G.3 of App.G.This observation implies that our predictions must apply in that case too, as we confirm in Fig. G.3.

Figure 4 :
Figure 4: Generalization error for a constant function f * (x) = 1.Generalization error as a function of the training set size n for a network trained in the feature regime with L1 regularization (blue) and kernel regression corresponding to the infinite-width lazy regime (orange).Numerical results (full lines) and the exponents predicted by the theory (dashed) are plotted.Panels correspond to different input-space dimensions (d = 2, 3, 5).Results are averaged over 10 different initializations of the networks and datasets.For d = 2 and large n, the gap between experiments and prediction for the feature regime is due to the finite training time t.Indeed our predictions become more accurate as t increases, as illustrated in the left.
Features sparsification. 1 st Panel: Distribution of neuron's feature for the task of learning a constant function on the sphere in 2D.Arrows represent a subset of the network features {θ h } H h=1 after training in the lazy and feature regimes.Training is performed on n = 8 data-points (black dots). 2 nd Panel: FCN trained on CIFAR10.On the axes the first two principal components of the features {θ h } H h=1 after training on n = 32 points in the feature (blue) and lazy (orange) regimes.Similarly to what is observed when learning a constant function, the θ h angular distribution becomes sparse with training in the feature regime.τ τ x (b) Example of diffeomorphism.Sample of a max-entropy deformation τ [49] when applied to a natural image, illustrating that it does not change the image class for the human brain.

Figure 5 :
Figure 5: Features sparsification and example of a diffeomorphism.

Figure 6 :
Figure 6: Sensitivity to diffeomorphisms vs number of training points.Relative sensitivity of the predictor to small diffeomorphisms of the input images, in the two regimes, for varying number of training points n and different image datasets.Smaller values correspond to a smoother predictor, on average.Results are computed using the same predictors as in Fig. 1.

A. 21 )
with c k denoting the eigenvalues of C in Eq.A.15. Notice that the eigenvalues are degenerate with respect to because the covariance kernel is a function x • y: as a result, the random function f * is isotropic in law.
(π − |x|) cos(x) + sin(|x|) 2π (lazy regime, NTK), (π − |x|) cos(x) + sin(|x|) 2π (lazy regime, RFK).(C.2) and Fig. G.1) as the the minimal norm solution if effectively recovered, see Fig. G.2.Such clean results in terms of features position are difficult to achieve for large n because the training dynamics becomes very slow and reaching convergence becomes computationally infeasible.Still, we observe the test error to plateau and reach its infinite-time limit much earlier than the parameters, which allows for the scaling predictions to hold.2. α-trick, however, does not recover the minimal-norm solution, Fig. G.2.Still, the solution found is of the type (2.7) as it is sparse and supported on a number of atoms that scales linearly with n, Fig. G.3, left.For this reason, we find that our predictions for the generalization error hold also in this case, see Fig. G.3, right.

2 × 10 − 1 3 × 10 −1 4 × 10 − 1 6 × 10 − 1 RFigure H. 1 :
Figure H.1: Sensitivity to input transformations vs number of training points.Relative sensitivity of the predictor to (left) random 1-pixel translations and (right) white noise added at the boundary of the input images, in the two regimes, for varying number of training points n and when training onFashionMNIST.Smaller values correspond to a smoother predictor, on average.Results are computed using the same predictors as in Fig.1.Left: For small translations, the behavior is the same compared to applying diffeomorphisms.Right: The lazy regime does not distinguish between noise added at the boundary or on the whole image (R f = 1), while the feature regime gets more insensitive to the former.