This site uses cookies. By continuing to use this site you agree to our use of cookies. To find out more, see our Privacy and Cookies policy.
Paper The following article is Free article

Triple descent and the two kinds of overfitting: where and why do they appear?*

, and

Published 29 December 2021 © 2021 IOP Publishing Ltd and SISSA Medialab srl
, , Citation Stéphane d'Ascoli et al J. Stat. Mech. (2021) 124002 DOI 10.1088/1742-5468/ac3909

1742-5468/2021/12/124002

Abstract

A recent line of research has highlighted the existence of a 'double descent' phenomenon in deep learning, whereby increasing the number of training examples N causes the generalization error of neural networks (NNs) to peak when N is of the same order as the number of parameters P. In earlier works, a similar phenomenon was shown to exist in simpler models such as linear regression, where the peak instead occurs when N is equal to the input dimension D. Since both peaks coincide with the interpolation threshold, they are often conflated in the literature. In this paper, we show that despite their apparent similarity, these two scenarios are inherently different. In fact, both peaks can co-exist when NNs are applied to noisy regression tasks. The relative size of the peaks is then governed by the degree of nonlinearity of the activation function. Building on recent developments in the analysis of random feature models, we provide a theoretical ground for this sample-wise triple descent. As shown previously, the nonlinear peak at N = P is a true divergence caused by the extreme sensitivity of the output function to both the noise corrupting the labels and the initialization of the random features (or the weights in NNs). This peak survives in the absence of noise, but can be suppressed by regularization. In contrast, the linear peak at N = D is solely due to overfitting the noise in the labels, and forms earlier during training. We show that this peak is implicitly regularized by the nonlinearity, which is why it only becomes salient at high noise and is weakly affected by explicit regularization. Throughout the paper, we compare analytical results obtained in the random feature model with the outcomes of numerical experiments involving deep NNs.

Export citation and abstract BibTeX RIS

Introduction

A few years ago, deep neural networks (NNs) achieved breakthroughs in a variety of contexts [14]. However, their remarkable generalization abilities have puzzled rigorous understanding [57]: classical learning theory predicts that generalization error should follow a U-shaped curve as the number of parameters P increases, and a monotonous decrease as the number of training examples N increases. Instead, recent developments show that deep NNs, as well as other machine learning models, exhibit a starkly different behavior. In the absence of regularization, increasing P and N respectively yields parameter-wise and sample-wise double descent curves [813], whereby the generalization error first decreases, then peaks at the interpolation threshold (at which point training error vanishes), then decreases monotonically again. This peak 3 was shown to be related to a sharp increase in the variance of the estimator [9, 18], and can be suppressed by regularization or ensembling procedures [9, 19, 20].

Although double descent has only recently gained interest in the context of deep learning, a seemingly similar phenomenon has been well-known for several decades for simpler models such as least squares regression [14, 15, 2123], and has recently been studied in more detail in an attempt to shed light on the double descent curve observed in deep learning [2427]. However, in the context of linear models, the number of parameters P is not a free parameter: it is necessarily equal to the input dimension D. The interpolation threshold occurs at N = D, and coincides with a peak in the test loss which we refer to as the linear peak. For NNs with nonlinear activations, the interpolation threshold surprisingly becomes independent of D and is instead observed when the number of training examples is of the same order as the total number of training parameters, i.e. NP: we refer to the corresponding peak as the nonlinear peak.

Somewhere in between these two scenarios lies the case of NNs with linear activations. They have P > D parameters, but only D of them are independent: the interpolation threshold occurs at N = D. However, their dynamical behavior shares some similarities with that of deep nonlinear networks, and their analytical tractability has given them significant attention [6, 28, 29]. A natural question is the following: what would happen for a 'quasi-linear' network, e.g. one that uses a sigmoidal activation function with a high saturation plateau? Would the overfitting peak be observed both at N = D and N = P, or would it somehow lie in between?

In this work, we unveil the similarities and the differences between the linear and nonlinear peaks. In particular, we address the following questions:

  • Are the linear and nonlinear peaks two different phenomena?
  • If so, can both be observed simultaneously, and can we differentiate their sources?
  • How are they affected by the activation function? Can they both be suppressed by regularizing or ensembling? Do they appear at the same time during training?

Contribution. In modern NNs, the double descent phenomenon is mostly studied by increasing the number of parameters P (figure 1, left), and more rarely, by increasing the number of training examples N (figure 1, middle) [13]. The analysis of linear models is instead performed by varying the ratio P/N. By studying the full (P, N) phase space (figure 1, right), we disentangle the role of the linear and the nonlinear peaks in modern NNs, and elucidate the role of the input dimension D.

Figure 1.

Figure 1. (Left) The parameter-wise profile of the test loss exhibits double descent, with a peak at P = N. (Middle) The sample-wise profile can, at high noise, exhibit a single peak at N = P, a single peak at N = D, or a combination of the two (triple descent 4 ) depending on the degree of nonlinearity of the activation function. (Right) Color-coded location of the peaks in the (P, N) phase space.

Standard image High-resolution image

In section 1, we demonstrate that the linear and nonlinear peaks are two different phenomena by showing that they can co-exist in the (P, N) phase space in noisy regression tasks. This leads to a sample-wise triple descent , as sketched in figure 1. We consider both an analytically tractable model of random features (RF) [30] and a more realistic model of NNs.

In section 2, we provide a theoretical analysis of this phenomenon in the random feature model. We examine the eigenspectrum of random feature Gram matrices and show that whereas the nonlinear peak is caused by the presence of small eigenvalues [6], the small eigenvalues causing the linear peak gradually disappear when the activation function becomes nonlinear: the linear peak is implicitly regularized by the nonlinearity. Through a bias-variance decomposition of the test loss, we reveal that the linear peak is solely caused by overfitting the noise corrupting the labels, whereas the nonlinear peak is also caused by the variance due to the initialization of the random feature vectors (which plays the role of the initialization of the weights in NNs).

Finally, in section 3, we present the phenomenological differences which follow from the theoretical analysis. Increasing the degree of nonlinearity of the activation function weakens the linear peak and strengthens the nonlinear peak. We also find that the nonlinear peak can be suppressed by regularizing or ensembling, whereas the linear peak cannot since it is already implicitly regularized. Finally, we note that the nonlinear peak appears much later under gradient descent dynamics than the linear peak, since it is caused by small eigenmodes which are slow to learn.

Related work. Various sources of sample-wise non-monotonicity have been observed since the 1990s, from linear regression [14] to simple classification tasks [31, 32]. In the context of adversarial training [33], shows that increasing N can help or hurt generalization depending on the strength of the adversary. In the non-parametric setting of [34], an upper bound on the test loss is shown to exhibit multiple descent, with peaks at each $N={D}^{i},i\in \mathbb{N}$.

Two concurrent papers also discuss the existence of a triple descent curve, albeit of different nature to ours. On one hand [19], observes a sample-wise triple descent in a non-isotropic linear regression task. In their setup, the two peaks stem from the block structure of the covariance of the input data, which presents two eigenspaces of different variance; both peaks boil down to what we call 'linear peaks' [35] pushed this idea to the extreme by designing the covariance matrix in such a way to make an arbitrary number of linear peaks appear.

On the other hand [36], presents a parameter-wise triple descent curve in a regression task using the neural tangent kernel of a two-layer network. Here the two peaks stem from the block structure of the covariance of the random feature Gram matrix, which contains a block of linear size in input dimension (features of the second layer, i.e. the ones studied here), and a block of quadratic size (features of the first layer). In this case, both peaks are 'nonlinear peaks'.

The triple descent curve presented here is of different nature: it stems from the general properties of nonlinear projections, rather than the particular structure chosen for the data [19] or regression kernel [36]. To the best of our knowledge, the disentanglement of linear and nonlinear peaks presented here is novel, and its importance is highlighted by the abundance of papers discussing both kinds of peaks.

On the analytical side, our work directly uses the results for high-dimensional RF models derived in [11, 37] (for the test loss) [38], (for the spectral analysis) and [20] (for the bias-variance decomposition).

Reproducibility. We release the code necessary to reproduce the data and figures in this paper publicly at https://github.com/sdascoli/triple-descent-paper.

1. Triple descent in the test loss phase space

We compute the (P, N) phase space of the test loss in noisy regression tasks to demonstrate the triple descent phenomenon. We start by introducing the two models which we will study throughout the paper: on the analytical side, the random feature model, and on the numerical side, a teacher–student task involving NNs trained with gradient descent.

Dataset. For both models, the input data $\boldsymbol{X}\in {\mathbb{R}}^{N\times D}$ consists of N vectors in D dimensions whose elements are drawn i.i.d. from $\mathcal{N}(0,1)$. For each model, there is an associated label generator f corrupted by additive Gaussian noise: y = f( x ) + epsilon, where the noise variance is inversely related to the signal to noise ratio (SNR), ${\epsilon}\sim \mathcal{N}(0,1/\mathrm{S}\mathrm{N}\mathrm{R})$.

1.1. RF regression (RF model)

Model. We consider the RF model introduced in [30]. It can be viewed as a two-layer NN whose first layer is a fixed random matrix $\mathbf{\Theta }\in {\mathbb{R}}^{P\times D}$ containing the P random feature vectors (see figure 2) 5 :

Equation (1)

σ is a pointwise activation function, the choice of which will be of prime importance in the study. The ground truth is a linear model given by ${f}^{\star }(\boldsymbol{x})=\langle \boldsymbol{\beta },\boldsymbol{x}\rangle /\sqrt{D}$. Elements of Θ and β are drawn i.i.d from $\mathcal{N}(0,1)$.

Figure 2.

Figure 2. Illustration of an RF network.

Standard image High-resolution image

Training. The second layer weights, i.e. the elements of a , are calculated via ridge regression with a regularization parameter γ:

Equation (2)

Equation (3)

1.2. Teacher–student regression with NNs (NN model)

Model. We consider a teacher–student NN framework where a student network learns to reproduce the labels of a teacher network. The teacher f is taken to be an untrained ReLU fully-connected network with three layers of weights and 100 nodes per layer. The student f is a fully-connected network with three layers of weights and nonlinearity σ. Both are initialized with the default PyTorch initialization.

Training. We train the student with mean-square loss using full-batch gradient descent for 1000 epochs with a learning rate of 0.01 and momentum 0.9. 6 We examine the effect of regularization by adding weight decay with parameter 0.05, and the effect of ensembling by averaging over 10 initialization seeds for the weights. All results are averaged over these 10 runs.

1.3. Test loss phase space

In both models, the key quantity of interest is the test loss, defined as the mean-square loss evaluated on fresh samples $\boldsymbol{x}\sim \mathcal{N}(0,1)$: ${\mathcal{L}}_{g}={\mathbb{E}}_{\boldsymbol{x}}\left[{\left(f(\boldsymbol{x})-{f}^{\star }(\boldsymbol{x})\right)}^{2}\right]$.

In the RF model, this quantity was first derived rigorously in [11], in the high-dimensional limit where N, P, D are sent to infinity with their ratios finite. More recently, a different approach based on the replica method from statistic physics was proposed in [37]; we use this method to compute the analytical phase space. As for the NN model, which operates at finite size D = 196, the test loss is computed over a test set of 104 examples.

In figure 3, we plot the test loss as a function of two intensive ratios of interest: the number of parameters per dimension P/D and the number of training examples per dimension N/D. In the left panel, at high SNR, we observe an overfitting line at N = P, yielding a parameter-wise and sample-wise double descent. However when the SNR becomes smaller than unity (middle panel), the sample-wise profile undergoes triple descent, with a second overfitting line appearing at N = D. A qualitatively identical situation is shown for the NN model in the right panel 7 .

Figure 3.

Figure 3. Logarithmic plot of the test loss in the (P, N) phase space. (a) RF model with SNR = 2, γ = 10−1. (b) RF model with SNR = 0.2, γ = 10−1. The solid arrows emphasize the sample-wise profile, and the dashed lines emphasize the parameter-wise profile. (c) NN model. In all cases, σ = Tanh. Analogous results for different activation functions and values of the SNR are shown in appendix A of the SM.

Standard image High-resolution image

The case of structured data. The case of structured datasets such as CIFAR10 is discussed in appendix C of the SM. The main differences are (i) the presence of multiple linear peaks at N < D due to the complex covariance structure of the data, as observed in [19, 35], and (ii) the fact that the nonlinear peak is located slightly above the line N = P since the data is easier to fit, as observed in [18].

2. Theory for the RF model

The qualitative similarity between the central and right panels of figure 3 indicates that a full understanding can be gained by a theoretical analysis of the RF model, which we present in this section.

2.1. High-dimensional setup

As is usual for the study of RF models, we consider the following high-dimensional limit:

Equation (4)

Then the key quantities governing the behavior of the system are related to the properties of the nonlinearity around the origin:

Equation (5)

As explained in [41], the Gaussian equivalence theorem [11, 41, 42] which applies in this high dimensional setting establishes an equivalence to a Gaussian covariate model where the nonlinear activation function is replaced by a linear term and a nonlinear term acting as noise:

Equation (6)

Of prime importance is the degree of linearity r = ζ/η ∈ [0, 1], which indicates the relative magnitudes of the linear and the nonlinear terms 8 .

2.2. Spectral analysis

As expressed by equation (3), RF regression is equivalent to linear regression on a structured dataset $\boldsymbol{Z}\in {\mathbb{R}}^{N\times P}$, which is projected from the original i.i.d dataset $\boldsymbol{X}\in {\mathbb{R}}^{N\times D}$. In [6], it was shown that the peak which occurs in unregularized linear regression on i.i.d. data is linked to vanishingly small (but non-zero) eigenvalues in the covariance of the input data. Indeed, the norm of the interpolator needs to become very large to fit small eigenvalues according to equation (3), yielding high variance.

Following this line, we examine the eigenspectrum of $\mathbf{\Sigma }=\frac{1}{N}{\boldsymbol{Z}}^{\top }\boldsymbol{Z}$, which was derived in a series of recent papers. The spectral density ρ(λ) can be obtained from the resolvent G(z) [38, 4345]:

Equation (7)

where Aϕ (t) = 1 + (A(t) − 1)ϕ and Aψ (t) = 1 + (A(t) − 1)ψ. We solve the implicit equation for A(t) numerically, see for example equation (11) of [38].

In the bottom row of figure 4 (see also middle panel of figure 5), we show the numerical spectrum obtained for various values of N/D with σ = Tanh, and we superimpose the analytical prediction obtained from equation (7). At N > D, the spectrum separates into two components: one with D large eigenvalues, and the other with PD smaller eigenvalues. The spectral gap (distance of the left edge of the spectrum to zero) closes at N = P, causing the nonlinear peak [46], but remains finite at N = D.

Figure 4.

Figure 4. Empirical eigenspectrum of the covariance of the projected features $\mathbf{\Sigma }=\frac{1}{N}{\boldsymbol{Z}}^{\top }\boldsymbol{Z}$ at various values of N/D, with the corresponding test loss curve shown above. Analytics match the numerics even at D = 100. We color the top D eigenvalues in gray, which allows to separate the linear and nonlinear components at N > D. We set σ = Tanh, P/D = 10, SNR = 0.2, γ = 10−5.

Standard image High-resolution image
Figure 5.

Figure 5. Analytical eigenspectrum of Σ for η = 1, P/D = 10, and ζ = 0, 0.92, 1 (a)–(c). We distinguish linear and nonlinear components by using respectively solid and dashed lines. (a) (Purely nonlinear) The spectral gap vanishes at N = P (i.e. N/D = 10). (c) (Purely linear) The spectral gap vanishes at N = D. (b) (Intermediate) The spectral gap of the nonlinear component vanishes at N = P, but the gap of the linear component does not vanish at N = D.

Standard image High-resolution image

Figure 5 shows the effect of varying r on the spectrum. We can interpret the results from equation (6):

  • 'Purely nonlinear' (r = 0): this is the case of even activation functions such as x ↦ |x|, which verify ζ = 0 according to equation (5). The spectrum of ${\mathbf{\Sigma }}_{nl}=\frac{1}{N}{\boldsymbol{W}}^{\top }\boldsymbol{W}$ follows a Marcenko–Pastur distribution of parameter c = P/N, concentrating around λ = 1 at N/D. The spectral gap closes at N = P.
  • 'Purely linear' (r = 1): this is the maximal value for r, and is achieved only for linear networks. The spectrum of ${\mathbf{\Sigma }}_{l}=\frac{1}{ND}{(\boldsymbol{X}{\mathbf{\Theta }}^{\top })}^{\top }\boldsymbol{X}{\mathbf{\Theta }}^{\top }$ follows a product Wishart distribution [47, 48], concentrating around λ = P/D = 10 at N/D. The spectral gap closes at N = D.
  • Intermediate (0 < r < 1): this case encompasses all commonly used activation functions such as ReLU and Tanh. We recognize the linear and nonlinear components, which behave almost independently (they are simply shifted to the left by a factor of r and 1 − r respectively), except at N = D where they interact nontrivially, leading to implicit regularization (see below).

The linear peak is implicitly regularized. As stated previously, one can expect to observe overfitting peaks when Σ is badly conditioned, i.e. its when its spectral gap vanishes. This is indeed observed in the purely linear setup at N = D, and in the purely nonlinear setup at N = P. However, in the everyday case where 0 < r < 1, the spectral gap only vanishes at N = P, and not at N = D. The reason for this is that a vanishing gap is symptomatic of a random matrix reaching its maximal rank. Since $\mathrm{r}\mathrm{k}\left({\mathbf{\Sigma }}_{nl}\right)=\mathrm{min}(N,P)$ and $\mathrm{r}\mathrm{k}\left({\mathbf{\Sigma }}_{l}\right)=\mathrm{min}(N,P,D)$, we have $\mathrm{r}\mathrm{k}\left({\mathbf{\Sigma }}_{nl}\right)\geqslant \mathrm{r}\mathrm{k}\left({\mathbf{\Sigma }}_{l}\right)$ at P > D. Therefore, the rank of Σ is imposed by the nonlinear component, which only reaches its maximal rank at N = P. At N = D, the nonlinear component acts as an implicit regularization, by compensating the small eigenvalues of the linear component. This causes the linear peak to be implicitly regularized be the presence of the nonlinearity.

What is the linear peak caused by? At 0 < r < 1, the spectral gap vanishes at N = P, causing the norm of the estimator || a || to peak, but it does not vanish at N = D due to the implicit regularization; in fact, the lowest eigenvalue of the full spectrum does not even reach a local minimum at N = D. Nonetheless, a soft linear peak remains as a vestige of what happens at r = 1. What is this peak caused by? A closer look at the spectrum of figure 5(b) clarifies this question. Although the left edge of the full spectrum is not minimal at N = D, the left edge of the linear component, in solid lines, reaches a minimum at N = D. This causes a peak in ||Θa ||, the norm of the 'linearized network', as shown in appendix B of the SM. This, in turn, entails a different kind of overfitting as we explain in the next section.

2.3. Bias-variance decomposition

The previous spectral analysis suggests that both peaks are related to some kind of overfitting. To address this issue, we make use of the bias-variance decomposition presented in [20]. The test loss is broken down into four contributions: a bias term and three variance terms stemming from the randomness of (i) the random feature vectors Θ (which plays the role of initialization variance in realistic networks), (ii) the noise epsilon corrupting the labels of the training set (noise variance) and (iii) the inputs X (sampling variance). We defer to [20] for further details.

Only the nonlinear peak is affected by initialization variance. In figure 6(a), we show such a decomposition. As observed in [20], the nonlinear peak is caused by an interplay between initialization and noise variance. This peak appears starkly at N = P in the high noise setup, where noise variance dominates the test loss, but also in the noiseless setup (figure 6(b)), where the residual initialization variance dominates: nonlinear networks can overfit even in absence of noise.

Figure 6.

Figure 6. Bias-variance decomposition of the test loss in the RF model for σ = ReLU and P/D = 100. Regularizing (increasing γ) and ensembling (increasing the number K of initialization seeds we average over) mitigates the nonlinear peak but does not affect the linear peak. (a) K = 1, γ = 10−5, SNR = 0.2. (b) Same but SNR = . (c) Same but K = 10. (d) Same but γ = 10−3.

Standard image High-resolution image

The linear peak vanishes in the noiseless setup. In stark contrast, the linear peak which appears clearly at N = D in figure 6(a) is caused solely by a peak in noise variance, in agreement with [6]. Therefore, it vanishes in the noiseless setup of figure 6(b). This is expected, as for linear networks the solution to the minimization problem is independent of the initialization of the weights.

3. Phenomenology of triple descent

3.1. The nonlinearity determines the relative height of the peaks

In figure 7, we consider RF models with four different activation functions: absolute value (r = 0), ReLU (r = 0.5), Tanh (r ∼ 0.92) and linear (r = 1). We see that increasing the degree of nonlinearity strengthens the nonlinear peak (by increasing initialization variance) and weakens the linear peak (by increasing the implicit regularization). In appendix A of the SM, we present additional results where the degree of linearity r is varied systematically in the RF model, and show that replacing Tanh by ReLU in the NN setup produces a similar effect. Note that the behavior changes abruptly near r = 1, marking the transition to the linear regime.

Figure 7.

Figure 7. Numerical test loss of RF models at finite size (D = 100), averaged over 10 runs. We set P/D = 10, SNR = 0.2 and γ = 10−3.

Standard image High-resolution image

3.2. Ensembling and regularization only affects the nonlinear peak

It is a well-known fact that regularization [19] and ensembling [9, 20, 49] can mitigate the nonlinear peak. This is shown in figures 6(c) and (d) for the RF model, where ensembling is performed by averaging the predictions of ten RF models with independently sampled random feature vectors. However, we see that these procedures only weakly affect the linear peak. This can be understood by the fact that the linear peak is already implicitly regularized by the nonlinearity for r < 1, as explained in section 2.

In the NN model, we perform a similar experiment by using weight decay as a proxy for the regularization procedure, see figure 8. Similarly as in the RF model, both ensembling and regularizing attenuates the nonlinear peak much more than the linear peak.

Figure 8.

Figure 8. Test loss phase space for the NN model with σ = Tanh. Weight decay with parameter γ and ensembling over K seeds weakens the nonlinear peak but leaves the linear peak untouched. (a) K = 1, γ = 0, SNR = 0.2. (b) Same but K = 10. (c) Same but γ = 0.05.

Standard image High-resolution image

3.3. The nonlinear peak forms later during training

To study the evolution of the phase space during training dynamics, we focus on the NN model (there are no dynamics involved in the RF model we considered, where the second layer weights were learnt via ridge regression). In figure 9, we see that the linear peak appears early during training and maintains throughout, whereas the nonlinear peak only forms at late times. This can be understood qualitatively as follows [6]: for linear regression the time required to learn a mode of eigenvalue λ in the covariance matrix is proportional to λ−1. Since the nonlinear peak is due to vanishingly small eigenvalues, which is not the case of the linear peak, the nonlinear peak takes more time to form completely.

Figure 9.

Figure 9. Test loss phase space for the NN model with σ = Tanh, plotted at various times during training. The linear peak grows first, followed by the nonlinear peak.

Standard image High-resolution image

4. Conclusion

One of the key challenges in solving tasks with network-like architectures lies in choosing an appropriate number of parameters P given the properties of the training dataset, namely its size N and dimension D. By elucidating the structure of the (P, N) phase space, its dependency on D, and distinguishing the two different types of overfitting which it can exhibit, we believe our results can be of interest to practitioners.

Our results leave room for several interesting follow-up questions, among which the impact of (1) various architectural choices, (2) the optimization algorithm, and (3) the structure of the dataset. For future work, we will consider extensions along those lines with particular attention to the structure of the dataset. We believe it will provide a deeper insight into data-model matching.

Acknowledgments

We thank Federica Gerace, Armand Joulin, Florent Krzakala, Bruno Loureiro, Franco Pellegrini, Maria Refinetti, Matthieu Wyart and Lenka Zdeborova for insightful discussions. G B acknowledges funding from the French Government under management of Agence Nationale de la Recherche as part of the 'Investissements d'avenir' program, reference ANR-19-P3IA-0001 (PRAIRIE 3IA Institute) and from the Simons Foundation collaboration Cracking the Glass Problem (No. 454935 to G Biroli).

Broader impact

Due to the theoretical nature of this paper, a broader impact discussion is not easily applicable. However, given the tight interaction of data & model and their impact on overfitting regimes, we believe that our findings and this line of research, in general, may potentially impact how practitioners deal with data.

Appendix A.: Effect of signal-to-noise ratio and nonlinearity

A.1. RF model

In the RF model, varying r can easily be achieved analytically and yields interesting results, as shown in figure 10 9 .

Figure 10.

Figure 10. Analytical parameter-wise (top, N/D = 10) and sample-wise (bottom, P/D = 10) test loss profiles of the RF model. (Left) Noiseless case, SNR = . (Center) Low noise, SNR = 2. (Right) High noise, SNR = 0.2. We set γ = 10−1.

Standard image High-resolution image

In the top panel, we see that the parameter-wise profile exhibits double descent for all degrees of linearity r and signal-to-noise ratio SNR, except in the linear case r = 1 which is monotonously deceasing. Increasing the degree of nonlinearity (decreasing r) and the noise (decreasing the SNR) simply makes the nonlinear peak stronger.

In the bottom panel, we see that the sample-wise profile is more complex. In the linear case r = 1, only the linear peak appears (except in the noiseless case). In the nonlinear case r < 1, the nonlinear peak appears is always visible; as for the linear peak, it is regularized away, except in the strong noise regime SNR > 1 when the degree of nonlinearity is small (r > 0.8), where we observe the triple descent.

Notice that both in the parameter-wise and sample-wise profiles, the test loss profiles change smoothly with r, except near r = 1 where the behavior abruptly changes, particularly at low SNR.

One can also mimick these results numerically by considering, as in [38], the following family of piecewise linear functions:

Equation (8)

for which

Equation (9)

Here, α parametrizes the ratio of the slope of the negative part to the positive part and allows to adjust the value of r continuously. α = −1 (r = 1) will correspond to a (shifted) absolute value, α = 1 (r = 0) will correspond to a linear function, α = 0 will correspond to a (shifted) ReLU. In figure 11, we show the effect of sweeping α uniformly from 1 to −1 (which causes r to range from 0 to 1). As expected, we see the linear peak become stronger and the nonlinear peak become weaker.

Figure 11.

Figure 11. Moving from a purely nonlinear function to a purely linear function (dark to light colors) strengthens the linear peak and weakens the nonlinear peak.

Standard image High-resolution image

A.2. NN model

In figure 12, we show the effect of replacing Tanh (r ∼ 0.92) by ReLU (r = 0.5). We still distinguish the two peaks of triple descent, but the linear peak is much weaker in the case of ReLU, as expected from the stronger degree of nonlinearity.

Figure 12.

Figure 12. Dynamics of the test loss phase space, with weight decay γ = 0.05. (Top) Tanh. (Bottom) ReLU.

Standard image High-resolution image

Appendix B.: Origin of the linear peak

In this section, we follow the lines of [37], where the test loss is decomposed in the following way (equation (D.6)):

Equation (10)

Equation (11)

As before, β denotes the linear teacher vector and Θ, a respectively denote the (fixed) first and (learnt) second layer of the student. This insightful expression shows that the loss only depends on the norm of the second layer || a ||, the norm of the linearized network || b ||, and its overlap with the teacher b β .

We plot these three terms in figure 13, focusing on the triple descent scenario SNR < 1. In the left panel, we see that the overlap of the student with the teacher is monotically increasing, and reaches its maximal value at a certain point which increases from D to P as we decrease r from 1 to 0. In the central panel, we see that || a || peaks at N = P, causing the nonlinear peak as expected, but nothing special happens at N = D (except for r = 1). However, in the right panel, we see that the norm of the linearized network peaks at N = D, where we know from the spectral analysis that the gap of the linear part of the spectrum is minimal. This is the origin of the linear peak.

Figure 13.

Figure 13. Terms entering equation (11), plotted at SNR = 0.2, γ = 10−1.

Standard image High-resolution image

Appendix C.: Structured datasets

In this section, we examine how our results are affected by considering the realistic case of correlated data. To do so, we replace the Gaussian i.i.d. data by MNIST data, downsampled to 10 × 10 images for the RF model (D = 100) and 14 × 14 images for the NN model (D = 196).

C.1. RF model: data structure does not matter in the lazy regime

We refer to the results in figure 14. Interestingly, the triple descent profile is weakly affected by the correlated structure of this realistic dataset: the linear peak and nonlinear peaks still appear, respectively at N = D and N = P. However, the spectral properties of $\mathbf{\Sigma }=\frac{1}{N}{\boldsymbol{Z}}^{\top }\boldsymbol{Z}$ are changed in an interesting manner: the two parts of the spectrum are now contiguous, there is no gap between the linear part and the nonlinear part.

Figure 14.

Figure 14. Spectrum of the covariance of the projected features $\mathbf{\Sigma }=\frac{1}{N}{\boldsymbol{Z}}^{\top }\boldsymbol{Z}$ at various values of N/D, with the corresponding loss curve shown above. We set σ = Tanh, γ = 10−5.

Standard image High-resolution image

C.2. NN model: the effect of feature learning

As shown in figure 15, the NN model behaves very differently on structured data like CIFAR10. At late times, three main differences with respect to the case of random data can be observed in the low SNR setup:

  • Instead of a clearly defined linear peak at N = D, we observe a large overfitting regions at N < D.
  • The nonlinear peak shifts to higher values of N during time, but always scales sublinearly with P, i.e. it is located at NPα with α < 1

Figure 15.

Figure 15. Dynamics of test loss phase space on CIFAR10 with SNR = 0.2, for different activation functions.

Standard image High-resolution image

Behavior of the linear peaks. As emphasized by [35], a non-trivial structure of the covariance of the input data can strongly affect the number and location of linear peaks. For example, in the context of linear regression [19], considers Gaussian data drawn from a diagonal covariance matrix ${\Sigma}\in {\mathbb{R}}^{30\times 30}$ with two blocks of different strengths: Σi , i = 10 for i ⩽ 15 and Σi , i = 1 for i ⩾ 15. In this situation, two linear peaks are observed: one at N1 = 15, and one at N2 = 30. In the setup of structured data such as CIFAR10, where the covariance is evidently much more complicated, one can expect to see a multiple peaks, all located at ND.

This is indeed what we observe: in the case of Tanh, where the linear peaks are strong, two linear peaks are particularly at late times, at N1 = 10−1.5 D, and the other at N2 = 10−2 D. In the case of ReLU, they are obscured at late times by the strength of the nonlinear peak.

Behavior of the nonlinear peak. We observe that during training, the nonlinear peak shifts toward higher values of N. This phenomenon was already observed in [12], where it is explained through the notion of effective model complexity. The intuition goes as follows: increasing the training time increases the expressivity of a given model, and therefore increases the number of training examples it can overfit.

We argue that this sublinear scaling is a consequence of the fact that structured data is easier to memorize than random data [18]: as the dataset grows large, each new example becomes easier to classify since the network has learnt the underlying rule (in contrast to the case of random data where each new example makes the task harder by the same amount).

Footnotes

  • This article is an updated version of: Ascoli S, Sagun L and Biroli G 2020 Triple descent and the two kinds of overfitting: where and why do they appear? Advances in Neural Information Processing Systems vol 33 ed H Larochelle, M Ranzato, R Hadsell, M F Balcan and H Lin (New York: Curran Associates).

  • Also called the jamming peak due to similarities with a well-studied phenomenon in the statistical physics literature [1418].

  • The name 'triple descent' refers to the presence of two peaks instead of just one in the famous 'double descent' curve, but in most cases the test error does not actually descend before the first peak.

  • This model, shown to undergo double descent in [11], has become a cornerstone to study the so-called lazy learning regime of NNs where the weights stay close to their initial value [39]: assuming ${f}_{{\theta }_{0}} =0$, we have ${f}_{\theta }(\mathbf{x})\approx {\nabla }_{\theta }{f}_{\theta }(\mathbf{x}){\vert }_{\theta ={\theta }_{0}}\cdot (\theta -{\theta }_{0})$ [40]. In other words, lazy learning amounts to a linear fitting problem with a random feature vector ${\left.{\nabla }_{\theta }{f}_{\theta }(\mathbf{x})\right\vert }_{\theta ={\theta }_{0}}$.

  • We use full batch gradient descent with small learning rate to reduce the noise coming from the optimization as much as possible. After 1000 epochs, all observables appear to have converged.

  • Note that for NNs, we necessarily have P/D > 1.

  • Note from equation (5) that for non-homogeneous functions such as Tanh, r also depends on the variance of the inputs and fixed weights, both set to unity here: intuitively, smaller variance will yield smaller preactivations which will lie in the linear region of the Tanh, increasing the effective value of r.

  • We focus here on the practically relevant setup N/D ≫ 1. Note from the (P, N) phase-space that things can be more complex at N/D ≲ 1).

Please wait… references are loading.
10.1088/1742-5468/ac3909