Triple descent and the two kinds of overfitting: where and why do they appear?*

Stéphane d’Ascoli; Levent Sagun; Giulio Biroli

doi:10.1088/1742-5468/ac3909

Introduction

A few years ago, deep neural networks (NNs) achieved breakthroughs in a variety of contexts [1–4]. However, their remarkable generalization abilities have puzzled rigorous understanding [5–7]: classical learning theory predicts that generalization error should follow a U-shaped curve as the number of parameters P increases, and a monotonous decrease as the number of training examples N increases. Instead, recent developments show that deep NNs, as well as other machine learning models, exhibit a starkly different behavior. In the absence of regularization, increasing P and N respectively yields parameter-wise and sample-wise double descent curves [8–13], whereby the generalization error first decreases, then peaks at the interpolation threshold (at which point training error vanishes), then decreases monotonically again. This peak³ was shown to be related to a sharp increase in the variance of the estimator [9, 18], and can be suppressed by regularization or ensembling procedures [9, 19, 20].

Although double descent has only recently gained interest in the context of deep learning, a seemingly similar phenomenon has been well-known for several decades for simpler models such as least squares regression [14, 15, 21–23], and has recently been studied in more detail in an attempt to shed light on the double descent curve observed in deep learning [24–27]. However, in the context of linear models, the number of parameters P is not a free parameter: it is necessarily equal to the input dimension D. The interpolation threshold occurs at N = D, and coincides with a peak in the test loss which we refer to as the linear peak. For NNs with nonlinear activations, the interpolation threshold surprisingly becomes independent of D and is instead observed when the number of training examples is of the same order as the total number of training parameters, i.e. N ∼ P: we refer to the corresponding peak as the nonlinear peak.

Somewhere in between these two scenarios lies the case of NNs with linear activations. They have P > D parameters, but only D of them are independent: the interpolation threshold occurs at N = D. However, their dynamical behavior shares some similarities with that of deep nonlinear networks, and their analytical tractability has given them significant attention [6, 28, 29]. A natural question is the following: what would happen for a 'quasi-linear' network, e.g. one that uses a sigmoidal activation function with a high saturation plateau? Would the overfitting peak be observed both at N = D and N = P, or would it somehow lie in between?

In this work, we unveil the similarities and the differences between the linear and nonlinear peaks. In particular, we address the following questions:

Are the linear and nonlinear peaks two different phenomena?
If so, can both be observed simultaneously, and can we differentiate their sources?
How are they affected by the activation function? Can they both be suppressed by regularizing or ensembling? Do they appear at the same time during training?

Contribution. In modern NNs, the double descent phenomenon is mostly studied by increasing the number of parameters P (figure 1, left), and more rarely, by increasing the number of training examples N (figure 1, middle) [13]. The analysis of linear models is instead performed by varying the ratio P/N. By studying the full (P, N) phase space (figure 1, right), we disentangle the role of the linear and the nonlinear peaks in modern NNs, and elucidate the role of the input dimension D.

In section 1, we demonstrate that the linear and nonlinear peaks are two different phenomena by showing that they can co-exist in the (P, N) phase space in noisy regression tasks. This leads to a sample-wise triple descent , as sketched in figure 1. We consider both an analytically tractable model of random features (RF) [30] and a more realistic model of NNs.

In section 2, we provide a theoretical analysis of this phenomenon in the random feature model. We examine the eigenspectrum of random feature Gram matrices and show that whereas the nonlinear peak is caused by the presence of small eigenvalues [6], the small eigenvalues causing the linear peak gradually disappear when the activation function becomes nonlinear: the linear peak is implicitly regularized by the nonlinearity. Through a bias-variance decomposition of the test loss, we reveal that the linear peak is solely caused by overfitting the noise corrupting the labels, whereas the nonlinear peak is also caused by the variance due to the initialization of the random feature vectors (which plays the role of the initialization of the weights in NNs).

Finally, in section 3, we present the phenomenological differences which follow from the theoretical analysis. Increasing the degree of nonlinearity of the activation function weakens the linear peak and strengthens the nonlinear peak. We also find that the nonlinear peak can be suppressed by regularizing or ensembling, whereas the linear peak cannot since it is already implicitly regularized. Finally, we note that the nonlinear peak appears much later under gradient descent dynamics than the linear peak, since it is caused by small eigenmodes which are slow to learn.

Related work. Various sources of sample-wise non-monotonicity have been observed since the 1990s, from linear regression [14] to simple classification tasks [31, 32]. In the context of adversarial training [33], shows that increasing N can help or hurt generalization depending on the strength of the adversary. In the non-parametric setting of [34], an upper bound on the test loss is shown to exhibit multiple descent, with peaks at each $N={D}^{i},i\in \mathbb{N}$ .

Two concurrent papers also discuss the existence of a triple descent curve, albeit of different nature to ours. On one hand [19], observes a sample-wise triple descent in a non-isotropic linear regression task. In their setup, the two peaks stem from the block structure of the covariance of the input data, which presents two eigenspaces of different variance; both peaks boil down to what we call 'linear peaks' [35] pushed this idea to the extreme by designing the covariance matrix in such a way to make an arbitrary number of linear peaks appear.

On the other hand [36], presents a parameter-wise triple descent curve in a regression task using the neural tangent kernel of a two-layer network. Here the two peaks stem from the block structure of the covariance of the random feature Gram matrix, which contains a block of linear size in input dimension (features of the second layer, i.e. the ones studied here), and a block of quadratic size (features of the first layer). In this case, both peaks are 'nonlinear peaks'.

The triple descent curve presented here is of different nature: it stems from the general properties of nonlinear projections, rather than the particular structure chosen for the data [19] or regression kernel [36]. To the best of our knowledge, the disentanglement of linear and nonlinear peaks presented here is novel, and its importance is highlighted by the abundance of papers discussing both kinds of peaks.

On the analytical side, our work directly uses the results for high-dimensional RF models derived in [11, 37] (for the test loss) [38], (for the spectral analysis) and [20] (for the bias-variance decomposition).

Reproducibility. We release the code necessary to reproduce the data and figures in this paper publicly at https://github.com/sdascoli/triple-descent-paper.

1. Triple descent in the test loss phase space

We compute the (P, N) phase space of the test loss in noisy regression tasks to demonstrate the triple descent phenomenon. We start by introducing the two models which we will study throughout the paper: on the analytical side, the random feature model, and on the numerical side, a teacher–student task involving NNs trained with gradient descent.

Dataset. For both models, the input data $\boldsymbol{X}\in {\mathbb{R}}^{N\times D}$ consists of N vectors in D dimensions whose elements are drawn i.i.d. from $\mathcal{N}(0,1)$ . For each model, there is an associated label generator f^⋆ corrupted by additive Gaussian noise: y = f^⋆( x ) + , where the noise variance is inversely related to the signal to noise ratio (SNR), ${\epsilon}\sim \mathcal{N}(0,1/\mathrm{S}\mathrm{N}\mathrm{R})$ .

1.1. RF regression (RF model)

Model. We consider the RF model introduced in [30]. It can be viewed as a two-layer NN whose first layer is a fixed random matrix $\mathbf{\Theta }\in {\mathbb{R}}^{P\times D}$ containing the P random feature vectors (see figure 2)⁵ :

$\begin{equation}f(\boldsymbol{x})=\sum\limits _{i=1}^{P}{\boldsymbol{a}}_{i}\sigma \left(\frac{\langle {\mathbf{\Theta }}_{i},\boldsymbol{x}\rangle }{\sqrt{D}}\right),\end{equation} \tag{ 1 }$

σ is a pointwise activation function, the choice of which will be of prime importance in the study. The ground truth is a linear model given by ${f}^{\star }(\boldsymbol{x})=\langle \boldsymbol{\beta },\boldsymbol{x}\rangle /\sqrt{D}$ . Elements of Θ and β are drawn i.i.d from $\mathcal{N}(0,1)$ .

**Figure 2.** Illustration of an RF network.
Download figure:
Standard image High-resolution image

Training. The second layer weights, i.e. the elements of a , are calculated via ridge regression with a regularization parameter γ:

$\begin{equation}\hat{\boldsymbol{a}}=\underset{\boldsymbol{a}\in {\mathbb{R}}^{P}}{\mathrm{argmin}}\left[\frac{1}{N}{\left(\boldsymbol{y}-\boldsymbol{a}{\mathbf{Z}}^{\top }\right)}^{2}+\frac{P\gamma }{D}{\Vert}\boldsymbol{a}{{\Vert}}_{2}^{2}\right]=\frac{1}{N}{\boldsymbol{y}}^{\top }\boldsymbol{Z}{\left(\mathbf{\Sigma }+\frac{P\gamma }{D}{\mathbb{I}}_{P}\right)}^{-1},\end{equation} \tag{ 2 }$

$\begin{equation}{\boldsymbol{Z}}_{i}^{\mu }=\sigma \left(\frac{\langle {\mathbf{\Theta }}_{i},{\boldsymbol{X}}_{\mu }\rangle }{\sqrt{D}}\right)\in {\mathbb{R}}^{N\times P},\quad \mathbf{\Sigma }=\frac{1}{N}{\boldsymbol{Z}}^{\top }\boldsymbol{Z}\in {\mathbb{R}}^{P\times P}.\end{equation} \tag{ 3 }$

1.2. Teacher–student regression with NNs (NN model)

Model. We consider a teacher–student NN framework where a student network learns to reproduce the labels of a teacher network. The teacher f^⋆ is taken to be an untrained ReLU fully-connected network with three layers of weights and 100 nodes per layer. The student f is a fully-connected network with three layers of weights and nonlinearity σ. Both are initialized with the default PyTorch initialization.

Training. We train the student with mean-square loss using full-batch gradient descent for 1000 epochs with a learning rate of 0.01 and momentum 0.9.⁶ We examine the effect of regularization by adding weight decay with parameter 0.05, and the effect of ensembling by averaging over 10 initialization seeds for the weights. All results are averaged over these 10 runs.

1.3. Test loss phase space

In both models, the key quantity of interest is the test loss, defined as the mean-square loss evaluated on fresh samples $\boldsymbol{x}\sim \mathcal{N}(0,1)$ : ${\mathcal{L}}_{g}={\mathbb{E}}_{\boldsymbol{x}}\left[{\left(f(\boldsymbol{x})-{f}^{\star }(\boldsymbol{x})\right)}^{2}\right]$ .

In the RF model, this quantity was first derived rigorously in [11], in the high-dimensional limit where N, P, D are sent to infinity with their ratios finite. More recently, a different approach based on the replica method from statistic physics was proposed in [37]; we use this method to compute the analytical phase space. As for the NN model, which operates at finite size D = 196, the test loss is computed over a test set of 10⁴ examples.

In figure 3, we plot the test loss as a function of two intensive ratios of interest: the number of parameters per dimension P/D and the number of training examples per dimension N/D. In the left panel, at high SNR, we observe an overfitting line at N = P, yielding a parameter-wise and sample-wise double descent. However when the SNR becomes smaller than unity (middle panel), the sample-wise profile undergoes triple descent, with a second overfitting line appearing at N = D. A qualitatively identical situation is shown for the NN model in the right panel⁷ .

The case of structured data. The case of structured datasets such as CIFAR10 is discussed in appendix C of the SM. The main differences are (i) the presence of multiple linear peaks at N < D due to the complex covariance structure of the data, as observed in [19, 35], and (ii) the fact that the nonlinear peak is located slightly above the line N = P since the data is easier to fit, as observed in [18].

2. Theory for the RF model

The qualitative similarity between the central and right panels of figure 3 indicates that a full understanding can be gained by a theoretical analysis of the RF model, which we present in this section.

2.1. High-dimensional setup

As is usual for the study of RF models, we consider the following high-dimensional limit:

$\begin{equation}N,D,P\to \infty ,\quad \frac{D}{P}=\psi =\mathcal{O}(1),\quad \frac{D}{N}=\phi =\mathcal{O}(1).\end{equation} \tag{ 4 }$

Then the key quantities governing the behavior of the system are related to the properties of the nonlinearity around the origin:

$\begin{equation}\eta =\int \mathrm{d}z\enspace \frac{{e}^{-{z}^{2}/2}}{\sqrt{2\pi }}{\sigma }^{2}\left(z\right),\qquad \zeta ={\left[\int \mathrm{d}z\enspace \frac{{e}^{-{z}^{2}/2}}{\sqrt{2\pi }}{\sigma }^{\prime }\left(z\right)\right]}^{2}\quad \text{and}\quad r=\frac{\zeta }{\eta }.\end{equation} \tag{ 5 }$

As explained in [41], the Gaussian equivalence theorem [11, 41, 42] which applies in this high dimensional setting establishes an equivalence to a Gaussian covariate model where the nonlinear activation function is replaced by a linear term and a nonlinear term acting as noise:

$\begin{equation}\boldsymbol{Z}=\sigma \left(\frac{\boldsymbol{X}{\mathbf{\Theta }}^{\top }}{\sqrt{D}}\right)\to \sqrt{\zeta }\frac{\boldsymbol{X}{\mathbf{\Theta }}^{\top }}{\sqrt{D}}+\sqrt{\eta -\zeta }\boldsymbol{W},\quad \boldsymbol{W}\sim \mathcal{N}(0,1).\end{equation} \tag{ 6 }$

Of prime importance is the degree of linearity r = ζ/η ∈ [0, 1], which indicates the relative magnitudes of the linear and the nonlinear terms⁸ .

2.2. Spectral analysis

As expressed by equation (3), RF regression is equivalent to linear regression on a structured dataset $\boldsymbol{Z}\in {\mathbb{R}}^{N\times P}$ , which is projected from the original i.i.d dataset $\boldsymbol{X}\in {\mathbb{R}}^{N\times D}$ . In [6], it was shown that the peak which occurs in unregularized linear regression on i.i.d. data is linked to vanishingly small (but non-zero) eigenvalues in the covariance of the input data. Indeed, the norm of the interpolator needs to become very large to fit small eigenvalues according to equation (3), yielding high variance.

Following this line, we examine the eigenspectrum of $\mathbf{\Sigma }=\frac{1}{N}{\boldsymbol{Z}}^{\top }\boldsymbol{Z}$ , which was derived in a series of recent papers. The spectral density ρ(λ) can be obtained from the resolvent G(z) [38, 43–45]:

$\begin{align} & \rho (\lambda )=\frac{1}{\pi }\underset{{\epsilon}\to {0}^{+}}{\mathrm{lim}}\enspace \mathrm{I}\mathrm{m}\enspace G(\lambda -i{\epsilon}),\quad G(z)=\frac{\psi }{z}A\left(\frac{1}{z\psi }\right)+\frac{1-\psi }{z} \\ & A(t)=1+(\eta -\zeta )t{A}_{\phi }(t){A}_{\psi }(t)+\frac{{A}_{\phi }(t){A}_{\psi }(t)t\zeta }{1-{A}_{\phi }(t){A}_{\psi }(t)t\zeta }, \end{align} \tag{ 7 }$

where A_ϕ(t) = 1 + (A(t) − 1)ϕ and A_ψ(t) = 1 + (A(t) − 1)ψ. We solve the implicit equation for A(t) numerically, see for example equation (11) of [38].

In the bottom row of figure 4 (see also middle panel of figure 5), we show the numerical spectrum obtained for various values of N/D with σ = Tanh, and we superimpose the analytical prediction obtained from equation (7). At N > D, the spectrum separates into two components: one with D large eigenvalues, and the other with P − D smaller eigenvalues. The spectral gap (distance of the left edge of the spectrum to zero) closes at N = P, causing the nonlinear peak [46], but remains finite at N = D.

**Figure 4.** Empirical eigenspectrum of the covariance of the projected features $\mathbf{\Sigma }=\frac{1}{N}{\boldsymbol{Z}}^{\top }\boldsymbol{Z}$ at various values of N/D, with the corresponding test loss curve shown above. Analytics match the numerics even at D = 100. We color the top D eigenvalues in gray, which allows to separate the linear and nonlinear components at N > D. We set σ = Tanh, P/D = 10, SNR = 0.2, γ = 10⁻⁵.
Download figure:
Standard image High-resolution image

Figure 5 shows the effect of varying r on the spectrum. We can interpret the results from equation (6):

'Purely nonlinear' (r = 0): this is the case of even activation functions such as x ↦ |x|, which verify ζ = 0 according to equation (5). The spectrum of ${\mathbf{\Sigma }}_{nl}=\frac{1}{N}{\boldsymbol{W}}^{\top }\boldsymbol{W}$ follows a Marcenko–Pastur distribution of parameter c = P/N, concentrating around λ = 1 at N/D → ∞. The spectral gap closes at N = P.
'Purely linear' (r = 1): this is the maximal value for r, and is achieved only for linear networks. The spectrum of ${\mathbf{\Sigma }}_{l}=\frac{1}{ND}{(\boldsymbol{X}{\mathbf{\Theta }}^{\top })}^{\top }\boldsymbol{X}{\mathbf{\Theta }}^{\top }$ follows a product Wishart distribution [47, 48], concentrating around λ = P/D = 10 at N/D → ∞. The spectral gap closes at N = D.
Intermediate (0 < r < 1): this case encompasses all commonly used activation functions such as ReLU and Tanh. We recognize the linear and nonlinear components, which behave almost independently (they are simply shifted to the left by a factor of r and 1 − r respectively), except at N = D where they interact nontrivially, leading to implicit regularization (see below).

The linear peak is implicitly regularized. As stated previously, one can expect to observe overfitting peaks when Σ is badly conditioned, i.e. its when its spectral gap vanishes. This is indeed observed in the purely linear setup at N = D, and in the purely nonlinear setup at N = P. However, in the everyday case where 0 < r < 1, the spectral gap only vanishes at N = P, and not at N = D. The reason for this is that a vanishing gap is symptomatic of a random matrix reaching its maximal rank. Since $\mathrm{r}\mathrm{k}\left({\mathbf{\Sigma }}_{nl}\right)=\mathrm{min}(N,P)$ and $\mathrm{r}\mathrm{k}\left({\mathbf{\Sigma }}_{l}\right)=\mathrm{min}(N,P,D)$ , we have $\mathrm{r}\mathrm{k}\left({\mathbf{\Sigma }}_{nl}\right)\geqslant \mathrm{r}\mathrm{k}\left({\mathbf{\Sigma }}_{l}\right)$ at P > D. Therefore, the rank of Σ is imposed by the nonlinear component, which only reaches its maximal rank at N = P. At N = D, the nonlinear component acts as an implicit regularization, by compensating the small eigenvalues of the linear component. This causes the linear peak to be implicitly regularized be the presence of the nonlinearity.

What is the linear peak caused by? At 0 < r < 1, the spectral gap vanishes at N = P, causing the norm of the estimator || a || to peak, but it does not vanish at N = D due to the implicit regularization; in fact, the lowest eigenvalue of the full spectrum does not even reach a local minimum at N = D. Nonetheless, a soft linear peak remains as a vestige of what happens at r = 1. What is this peak caused by? A closer look at the spectrum of figure 5(b) clarifies this question. Although the left edge of the full spectrum is not minimal at N = D, the left edge of the linear component, in solid lines, reaches a minimum at N = D. This causes a peak in ||Θa ||, the norm of the 'linearized network', as shown in appendix B of the SM. This, in turn, entails a different kind of overfitting as we explain in the next section.

2.3. Bias-variance decomposition

The previous spectral analysis suggests that both peaks are related to some kind of overfitting. To address this issue, we make use of the bias-variance decomposition presented in [20]. The test loss is broken down into four contributions: a bias term and three variance terms stemming from the randomness of (i) the random feature vectors Θ (which plays the role of initialization variance in realistic networks), (ii) the noise corrupting the labels of the training set (noise variance) and (iii) the inputs X (sampling variance). We defer to [20] for further details.

Only the nonlinear peak is affected by initialization variance. In figure 6(a), we show such a decomposition. As observed in [20], the nonlinear peak is caused by an interplay between initialization and noise variance. This peak appears starkly at N = P in the high noise setup, where noise variance dominates the test loss, but also in the noiseless setup (figure 6(b)), where the residual initialization variance dominates: nonlinear networks can overfit even in absence of noise.

The linear peak vanishes in the noiseless setup. In stark contrast, the linear peak which appears clearly at N = D in figure 6(a) is caused solely by a peak in noise variance, in agreement with [6]. Therefore, it vanishes in the noiseless setup of figure 6(b). This is expected, as for linear networks the solution to the minimization problem is independent of the initialization of the weights.

3. Phenomenology of triple descent

3.1. The nonlinearity determines the relative height of the peaks

In figure 7, we consider RF models with four different activation functions: absolute value (r = 0), ReLU (r = 0.5), Tanh (r ∼ 0.92) and linear (r = 1). We see that increasing the degree of nonlinearity strengthens the nonlinear peak (by increasing initialization variance) and weakens the linear peak (by increasing the implicit regularization). In appendix A of the SM, we present additional results where the degree of linearity r is varied systematically in the RF model, and show that replacing Tanh by ReLU in the NN setup produces a similar effect. Note that the behavior changes abruptly near r = 1, marking the transition to the linear regime.

3.2. Ensembling and regularization only affects the nonlinear peak

It is a well-known fact that regularization [19] and ensembling [9, 20, 49] can mitigate the nonlinear peak. This is shown in figures 6(c) and (d) for the RF model, where ensembling is performed by averaging the predictions of ten RF models with independently sampled random feature vectors. However, we see that these procedures only weakly affect the linear peak. This can be understood by the fact that the linear peak is already implicitly regularized by the nonlinearity for r < 1, as explained in section 2.

In the NN model, we perform a similar experiment by using weight decay as a proxy for the regularization procedure, see figure 8. Similarly as in the RF model, both ensembling and regularizing attenuates the nonlinear peak much more than the linear peak.

3.3. The nonlinear peak forms later during training

To study the evolution of the phase space during training dynamics, we focus on the NN model (there are no dynamics involved in the RF model we considered, where the second layer weights were learnt via ridge regression). In figure 9, we see that the linear peak appears early during training and maintains throughout, whereas the nonlinear peak only forms at late times. This can be understood qualitatively as follows [6]: for linear regression the time required to learn a mode of eigenvalue λ in the covariance matrix is proportional to λ⁻¹. Since the nonlinear peak is due to vanishingly small eigenvalues, which is not the case of the linear peak, the nonlinear peak takes more time to form completely.

**Figure 9.** Test loss phase space for the NN model with σ = Tanh, plotted at various times during training. The linear peak grows first, followed by the nonlinear peak.
Download figure:
Standard image High-resolution image

4. Conclusion

One of the key challenges in solving tasks with network-like architectures lies in choosing an appropriate number of parameters P given the properties of the training dataset, namely its size N and dimension D. By elucidating the structure of the (P, N) phase space, its dependency on D, and distinguishing the two different types of overfitting which it can exhibit, we believe our results can be of interest to practitioners.

Our results leave room for several interesting follow-up questions, among which the impact of (1) various architectural choices, (2) the optimization algorithm, and (3) the structure of the dataset. For future work, we will consider extensions along those lines with particular attention to the structure of the dataset. We believe it will provide a deeper insight into data-model matching.

Acknowledgments

We thank Federica Gerace, Armand Joulin, Florent Krzakala, Bruno Loureiro, Franco Pellegrini, Maria Refinetti, Matthieu Wyart and Lenka Zdeborova for insightful discussions. G B acknowledges funding from the French Government under management of Agence Nationale de la Recherche as part of the 'Investissements d'avenir' program, reference ANR-19-P3IA-0001 (PRAIRIE 3IA Institute) and from the Simons Foundation collaboration Cracking the Glass Problem (No. 454935 to G Biroli).

Broader impact

Due to the theoretical nature of this paper, a broader impact discussion is not easily applicable. However, given the tight interaction of data & model and their impact on overfitting regimes, we believe that our findings and this line of research, in general, may potentially impact how practitioners deal with data.

Appendix A.: Effect of signal-to-noise ratio and nonlinearity

A.1. RF model

In the RF model, varying r can easily be achieved analytically and yields interesting results, as shown in figure 10 ⁹ .

In the top panel, we see that the parameter-wise profile exhibits double descent for all degrees of linearity r and signal-to-noise ratio SNR, except in the linear case r = 1 which is monotonously deceasing. Increasing the degree of nonlinearity (decreasing r) and the noise (decreasing the SNR) simply makes the nonlinear peak stronger.

In the bottom panel, we see that the sample-wise profile is more complex. In the linear case r = 1, only the linear peak appears (except in the noiseless case). In the nonlinear case r < 1, the nonlinear peak appears is always visible; as for the linear peak, it is regularized away, except in the strong noise regime SNR > 1 when the degree of nonlinearity is small (r > 0.8), where we observe the triple descent.

Notice that both in the parameter-wise and sample-wise profiles, the test loss profiles change smoothly with r, except near r = 1 where the behavior abruptly changes, particularly at low SNR.

One can also mimick these results numerically by considering, as in [38], the following family of piecewise linear functions:

$\begin{equation}{\sigma }_{\alpha }(x)=\frac{{[x]}_{+}+\alpha {[-x]}_{+}-\frac{1+\alpha }{\sqrt{2\pi }}}{\sqrt{\frac{1}{2}\left(1+{\alpha }^{2}\right)-\frac{1}{2\pi }{(1+\alpha )}^{2}}},\end{equation} \tag{ 8 }$

for which

$\begin{equation}{r}_{\alpha }=\frac{{(1-\alpha )}^{2}}{2\left(1+{\alpha }^{2}\right)-\frac{2}{\pi }{(1+\alpha )}^{2}}.\end{equation} \tag{ 9 }$

Here, α parametrizes the ratio of the slope of the negative part to the positive part and allows to adjust the value of r continuously. α = −1 (r = 1) will correspond to a (shifted) absolute value, α = 1 (r = 0) will correspond to a linear function, α = 0 will correspond to a (shifted) ReLU. In figure 11, we show the effect of sweeping α uniformly from 1 to −1 (which causes r to range from 0 to 1). As expected, we see the linear peak become stronger and the nonlinear peak become weaker.

**Figure 11.** Moving from a purely nonlinear function to a purely linear function (dark to light colors) strengthens the linear peak and weakens the nonlinear peak.
Download figure:
Standard image High-resolution image

A.2. NN model

In figure 12, we show the effect of replacing Tanh (r ∼ 0.92) by ReLU (r = 0.5). We still distinguish the two peaks of triple descent, but the linear peak is much weaker in the case of ReLU, as expected from the stronger degree of nonlinearity.

Appendix B.: Origin of the linear peak

In this section, we follow the lines of [37], where the test loss is decomposed in the following way (equation (D.6)):

$\begin{equation}{\mathcal{L}}_{g}=\rho +Q-2M,\end{equation} \tag{ 10 }$

$\begin{equation}\rho =\frac{1}{D}{{\Vert}\boldsymbol{\beta }{\Vert}}^{2},\quad M=\frac{\sqrt{\zeta }}{D}\boldsymbol{b}\cdot \boldsymbol{\beta },\quad Q=\frac{\zeta }{D}{\Vert}\boldsymbol{b}{{\Vert}}^{2}+\frac{\eta -\zeta }{P}{\Vert}\boldsymbol{a}{{\Vert}}^{2},\quad \boldsymbol{b}=\mathbf{\Theta }\boldsymbol{a}.\end{equation} \tag{ 11 }$

As before, β denotes the linear teacher vector and Θ, a respectively denote the (fixed) first and (learnt) second layer of the student. This insightful expression shows that the loss only depends on the norm of the second layer || a ||, the norm of the linearized network || b ||, and its overlap with the teacher b ⋅ β .

We plot these three terms in figure 13, focusing on the triple descent scenario SNR < 1. In the left panel, we see that the overlap of the student with the teacher is monotically increasing, and reaches its maximal value at a certain point which increases from D to P as we decrease r from 1 to 0. In the central panel, we see that || a || peaks at N = P, causing the nonlinear peak as expected, but nothing special happens at N = D (except for r = 1). However, in the right panel, we see that the norm of the linearized network peaks at N = D, where we know from the spectral analysis that the gap of the linear part of the spectrum is minimal. This is the origin of the linear peak.

Appendix C.: Structured datasets

In this section, we examine how our results are affected by considering the realistic case of correlated data. To do so, we replace the Gaussian i.i.d. data by MNIST data, downsampled to 10 × 10 images for the RF model (D = 100) and 14 × 14 images for the NN model (D = 196).

C.1. RF model: data structure does not matter in the lazy regime

We refer to the results in figure 14. Interestingly, the triple descent profile is weakly affected by the correlated structure of this realistic dataset: the linear peak and nonlinear peaks still appear, respectively at N = D and N = P. However, the spectral properties of $\mathbf{\Sigma }=\frac{1}{N}{\boldsymbol{Z}}^{\top }\boldsymbol{Z}$ are changed in an interesting manner: the two parts of the spectrum are now contiguous, there is no gap between the linear part and the nonlinear part.

C.2. NN model: the effect of feature learning

As shown in figure 15, the NN model behaves very differently on structured data like CIFAR10. At late times, three main differences with respect to the case of random data can be observed in the low SNR setup:

Instead of a clearly defined linear peak at N = D, we observe a large overfitting regions at N < D.
The nonlinear peak shifts to higher values of N during time, but always scales sublinearly with P, i.e. it is located at N ∼ P^α with α < 1

Behavior of the linear peaks. As emphasized by [35], a non-trivial structure of the covariance of the input data can strongly affect the number and location of linear peaks. For example, in the context of linear regression [19], considers Gaussian data drawn from a diagonal covariance matrix ${\Sigma}\in {\mathbb{R}}^{30\times 30}$ with two blocks of different strengths: Σ_i, i = 10 for i ⩽ 15 and Σ_i, i = 1 for i ⩾ 15. In this situation, two linear peaks are observed: one at N₁ = 15, and one at N₂ = 30. In the setup of structured data such as CIFAR10, where the covariance is evidently much more complicated, one can expect to see a multiple peaks, all located at N ⩽ D.

This is indeed what we observe: in the case of Tanh, where the linear peaks are strong, two linear peaks are particularly at late times, at N₁ = 10^−1.5 D, and the other at N₂ = 10⁻² D. In the case of ReLU, they are obscured at late times by the strength of the nonlinear peak.

Behavior of the nonlinear peak. We observe that during training, the nonlinear peak shifts toward higher values of N. This phenomenon was already observed in [12], where it is explained through the notion of effective model complexity. The intuition goes as follows: increasing the training time increases the expressivity of a given model, and therefore increases the number of training examples it can overfit.

We argue that this sublinear scaling is a consequence of the fact that structured data is easier to memorize than random data [18]: as the dataset grows large, each new example becomes easier to classify since the network has learnt the underlying rule (in contrast to the case of random data where each new example makes the task harder by the same amount).

Triple descent and the two kinds of overfitting: where and why do they appear?^*

Article metrics

Permissions

Author e-mails

Author affiliations

Author notes

Dates

Abstract

Introduction