Paper The following article is Free article

Generalisation error in learning with random features and the hidden manifold model*

, , , and

Published 29 December 2021 © 2021 IOP Publishing Ltd and SISSA Medialab srl
, , Citation Federica Gerace et al J. Stat. Mech. (2021) 124013 DOI 10.1088/1742-5468/ac3ae6

1742-5468/2021/12/124013

Abstract

We study generalised linear regression and classification for a synthetically generated dataset encompassing different problems of interest, such as learning with random features, neural networks in the lazy training regime, and the hidden manifold model. We consider the high-dimensional regime and using the replica method from statistical physics, we provide a closed-form expression for the asymptotic generalisation performance in these problems, valid in both the under- and over-parametrised regimes and for a broad choice of generalised linear model loss functions. In particular, we show how to obtain analytically the so-called double descent behaviour for logistic regression with a peak at the interpolation threshold, we illustrate the superiority of orthogonal against random Gaussian projections in learning with random features, and discuss the role played by correlations in the data generated by the hidden manifold model. Beyond the interest in these particular problems, the theoretical formalism introduced in this manuscript provides a path to further extensions to more complex tasks.

Export citation and abstract BibTeX RIS

1. Introduction

One of the most important goals of learning theory is to provide generalisation bounds describing the quality of learning a given task as a function of the number of samples. Existing results fall short of being directly relevant for the state-of-the-art deep learning methods [1, 2]. Consequently, providing tighter results on the generalisation error is currently a very active research subject. The traditional learning theory approach to generalisation follows for instance the Vapnik–Chervonenkis [3] or Rademacher [4] worst-case type bounds, and many of their more recent extensions [5]. An alternative approach, followed also in this paper, has been pursued for decades, notably in statistical physics, where the generalisation ability of neural networks was analysed for a range of 'typical-case' scenario for synthetic data arising from a probabilistic model [614]. While at this point it is not clear which approach will lead to a complete generalisation theory of deep learning, it is worth pursuing both directions.

The majority of works following the statistical physics approach study the generalisation error in the so-called teacher-student framework, where the input data are element-wise i.i.d. vectors, and the labels are generated by a teacher neural network. In contrast, in most of real scenarios the input data do not span uniformly the input space, but rather live close to a lower-dimensional manifold. The traditional focus onto i.i.d. Gaussian input vectors is an important limitation that has been recently stressed in [14, 15]. In [14], the authors proposed a model of synthetic data to mimic the latent structure of real data, named the hidden manifold model (HMM), and analysed the learning curve of one-pass stochastic gradient descent (SGD) algorithm in a two-layer neural network with a small number of hidden units also known as committee machine.

Another key limitation of the majority of existing works stemming from statistical physics is that the learning curves were only computed for neural networks with a few hidden units. In particular, the input dimension is considered large, the number of samples is a constant times the input dimension and the number of hidden units is of order one. Tight learning curves were only very recently analysed for two-layer neural networks with more hidden units. These studies addressed in particular the case of networks that have a fixed first layer with random i.i.d. Gaussian weights [12, 13], or the lazy-training regime where the individual weights change only infinitesimally during training, thus not learning any specific features [1618].

In this paper we compute the generalisation error and the corresponding learning curves, i.e. the test error as a function of the number of samples for a model of high-dimensional data that encompasses at least the following cases:

  • Generalised linear regression and classification for data generated by the HMM of [14]. The HMM can be seen as a single-layer generative neural network with i.i.d. inputs and a rather generic feature matrix [14, 19].
  • Learning data generated by the teacher-student model with a random-features neural network [20], with a very generic feature matrix, including deterministic ones. This model is also interesting because of its connection with the lazy regime, that is equivalent to the random features model with slightly more complicated features [12, 13, 16].

We give a closed-form expression for the generalisation error in the high-dimensional limit, obtained using a non-rigorous heuristic method from statistical physics known as the replica method [21], that has already shown its remarkable efficacy in many problems of machine learning [6, 8, 22, 23], with many of its predictions being rigorously proven, e.g. [24, 25]. While in the present model it remains an open problem to derive a rigorous proof for our results, we shall provide numerical support that the formula is indeed exact in the high-dimensional limit, and extremely accurate even for moderately small system sizes.

1.1. The model

We study high-dimensional regression and classification for a synthetic dataset $\mathcal{D}={\left\{({\boldsymbol{x}}^{\mu },{y}^{\mu })\right\}}_{\mu =1}^{n}$ where each sample μ is created in the following three steps: (i) first, for each sample μ we create a vector ${\boldsymbol{c}}^{\mu }\in {\mathbb{R}}^{d}$ as

Equation (1.1)

(ii) We then draw ${\boldsymbol{\theta }}^{0}\in {\mathbb{R}}^{d}$ from a separable distribution Pθ and draw independent labels ${\left\{{y}^{\mu }\right\}}_{\mu =1}^{n}$ from a (possibly probabilistic) rule f0:

Equation (1.2)

(iii) The input data points ${\boldsymbol{x}}^{\mu }\in {\mathbb{R}}^{p}$ are created by a one-layer generative network with fixed and normalised weights $\mathrm{F}\in {\mathbb{R}}^{d\times p}$ and an activation function $\sigma :\mathbb{R}\to \mathbb{R}$, acting component-wise:

Equation (1.3)

We study the problem of supervised learning for the dataset $\mathcal{D}$ aiming at achieving a low generalisation error epsilong on a new sample x new, ynew drawn by the same rule as above, where:

Equation (1.4)

with k = 0 for regression task and k = 1 for classification task. Here, ${\hat{y}}_{\boldsymbol{w}}$ is the prediction on the new label ynew of the form:

Equation (1.5)

The weights $\hat{\boldsymbol{w}}\in {\mathbb{R}}^{p}$ are learned by minimising a loss function with a ridge regularisation term (for λ ⩾ 0) and defined as

Equation (1.6)

where (⋅, ⋅) can be, for instance, a logistic, hinge, or square loss. Note that although our formula is valid for any f0 and $\hat{f}$, we take ${f}^{0}=\hat{f}=\mathrm{s}\mathrm{i}\mathrm{g}\mathrm{n}$, for the classification tasks and ${f}^{0}=\hat{f}=\text{id}$ for the regression tasks studied here. We now describe in more detail the above-discussed reasons why this model is of interest for machine learning.

Hidden manifold model: the dataset $\mathcal{D}$ can be seen as generated from the HMM introduced in [14]. From this perspective, although x μ lives in a p dimensional space, it is parametrised by a latent d-dimensional subspace spanned by the rows of the matrix F which are 'hidden' by the application of a scalar non-linear function σ. The labels yμ are drawn from a generalised linear rule defined on the latent d-dimensional subspace via equation (1.2). In modern machine learning parlance, this can be seen as data generated by a one-layer generative neural network, such as those trained by generative adversarial networks or variational auto-encoders with random Gaussian inputs cμ and a rather generic weight matrix F [19, 2628].

Random features: the model considered in this paper is also an instance of the random features learning discussed in [20] as a way to speed up kernel-ridge-regression. From this perspective, the c μ s $\in {\mathbb{R}}^{d}$ are regarded as a set of d-dimensional i.i.d. Gaussian data points, which are projected by a feature matrix $\mathrm{F}={({\boldsymbol{f}}_{\rho })}_{\rho =1}^{p}\in {\mathbb{R}}^{d\times p}$ into a higher dimensional space, followed by a non-linearity σ. In the p limit of infinite number of features, performing regression on $\mathcal{D}$ is equivalent to kernel regression on the cμ s with a deterministic kernel $K({\mathbf{c}}^{{\mu }_{1}},{\mathbf{c}}^{{\mu }_{2}})={\mathbb{E}}_{\mathbf{f}}\left[\sigma (\mathbf{f}\cdot {\mathbf{c}}^{{\mu }_{1}}/\sqrt{d})\cdot \sigma (\mathbf{f}\cdot {\mathbf{c}}^{{\mu }_{2}}/\sqrt{d})\right]$ where $\mathbf{f}\in {\mathbb{R}}^{d}$ is sampled in the same way as the rows of F. Random features are also intimately linked with the lazy training regime, where the weights of a neural network stay close to their initial value during training. The training is lazy as opposed to a 'rich' one where the weights change enough to learn useful features. In this regime, neural networks become equivalent to a random feature model with correlated features [1618, 2931].

1.2. Contributions and related work

The main contribution of this work is a closed-form expression for the generalisation error epsilong , equation (2.1), that is valid in the high-dimensional limit where the number of samples n, and the two dimensions p and d are large, but their respective ratios are of order one, and for generic sequence of matrices F satisfying the following balance conditions:

Equation (1.7)

where ${\left\{{\boldsymbol{w}}^{a}\right\}}_{a=1}^{r}$ are r independent samples from the Gibbs measure (2.7), and ρ1, ρ2, ..., ρq ∈ {1, ..., d}, a1, a2, ..., as ∈ {1, ..., r} are an arbitrary choice of subset of indices, with $s,q\in {\mathbb{Z}}_{+}$. The non-linearities ${f}^{0},\hat{f},\sigma $ and the loss function can be arbitrary. Our result for the generalisation error stems from the replica method and we conjecture it to be exact for convex loss functions . It can also be useful for non-convex loss functions but in those cases it is possible that the so-called replica symmetry breaking [21] needs to be taken into account to obtain an exact expression. In the present paper we hence focus on convex loss functions and leave the more general case for future work. The final formulas are simpler for nonlinearities σ that give zero when integrated over a centred Gaussian variable, and we hence focus on those cases.

An interesting application of our setting is ridge regression, i.e. taking $\hat{f}(x) = x$ with square loss, and random i.i.d. Gaussian feature matrices. For this particular case [13] proved an equivalent expression. Indeed, in this case there is an explicit solution of equation (1.6) that can be rigorously studied with random matrix theory. In a subsequent work [32] derived heuristically a formula for the special case of random i.i.d. Gaussian feature matrices for the maximum margin classification, corresponding to the hinge loss function in our setting, with the difference, however, that the labels yμ are generated from the xμ instead of the variable cμ as in our case.

Our main technical contribution is thus to provide a generic formula for the model described in section 1.1 for any loss function and for fairly generic features F, including for instance deterministic ones.

The authors of [14] analysed the learning dynamics of a neural network containing several hidden units using a one-pass SGD for exactly the same model of data as here. In this online setting, the algorithm is never exposed to a sample twice, greatly simplifying the analysis as what has been learned at a given epoch can be considered independent of the randomness of a new sample. Another motivation of the present work is thus to study the sample complexity for this model (in our case only a bounded number of samples is available, and the one-pass SGD would be highly suboptimal).

An additional technical contribution of our work is to derive an extension of the equivalence between the considered data model and a model with Gaussian covariate, that has been observed and conjectured to hold rather generically in both [14, 32]. While we do not provide a rigorous proof for this equivalence, we show that it arises naturally using the replica method, giving further evidence for its validity.

Finally, the analysis of our formula for particular machine learning tasks of interest allows for an analytical investigation of a rich phenomenology that is also observed empirically in real-life scenarios. In particular

  • The double descent behaviour, as termed in [33] and exemplified in [34], is exhibited for the non-regularized logistic regression loss. The peak of worst generalisation does not corresponds to p = n as for the square loss [13], but rather corresponds to the threshold of linear separability of the dataset. We also characterise the location of this threshold, generalising the results of [11] to our model.
  • When using projections to approximate kernels, it has been observed that orthogonal features F perform better than random i.i.d. [35]. We show that this behaviour arises from our analytical formula, illustrating the 'unreasonable effectiveness of structured random orthogonal embeddings' [35].
  • We compute the phase diagram for the generalisation error for the HMM and discuss the dependence on the various parameters, in particular the ratio between the ambient and latent dimensions.

2. Main analytical results

We now state our two main analytical results. The replica computation used here is in spirit similar to the one performed in a number of tasks for linear and generalised linear models [6, 3638], but requires a significant extension to account for the structure of the data. We refer the reader to the supplementary material section C (https://stacks.iop.org/JSTAT/12/124013/mmedia) for the detailed and lengthy derivation of the final formula. The resulting expression is conjectured to be exact and, as we shall see, observed to be accurate even for relatively small dimensions in simulations. Additionally, these formulas reproduce the rigorous results of [13], in the simplest particular case of a Gaussian projection matrix and ridge regression task. It remains a challenge to prove them rigorously in broader generality.

2.1. Generalisation error from replica method

Let F be a feature matrix satisfying the balance condition stated in equation (1.7). Then, in the high-dimensional limit where p, d, n with α = n/p, γ = d/p fixed, the generalisation error, equation (1.4), of the model introduced in section (1.4) for σ such that its integral over a centered Gaussian variable is zero (so that κ0 = 0 in equation (2.10)) is given by the following easy-to-evaluate integral:

Equation (2.1)

where f0(.) is defined in (1.2), $\hat{f}(.)$ in (1.5) and (ν, λ) are jointly Gaussian random variables with zero mean and covariance matrix:

Equation (2.2)

with ${M}^{\star }={\kappa }_{1}{m}_{s}^{\star }$, ${Q}^{\star }={\kappa }_{1}^{2}{q}_{s}^{\star }+{\kappa }_{\star }^{2}{q}_{w}^{\star }$. The constants κ, κ1 depend on the nonlinearity σ via equation (2.10), and ${q}_{s}^{\star },{q}_{w}^{\star },{m}_{s}^{\star }$, defined as:

Equation (2.3)

The values of these parameters correspond to the solution of the optimisation problem in equation (1.6), and can be obtained as the fixed point solutions of the following set of self-consistent saddle-point equations:

Equation (2.4)

written in terms of the following auxiliary variables $\xi \sim \mathcal{N}(0,1)$, $z=\frac{\lambda +{\hat{V}}_{w}}{{\hat{V}}_{s}}$ and functions:

Equation (2.5)

where $V={\kappa }_{1}^{2}{V}_{s}+{\kappa }_{\star }^{2}{V}_{w}$, ${V}^{0}=\rho -\frac{{M}^{2}}{Q}$, $Q={\kappa }_{1}^{2}{q}_{s}+{\kappa }_{\star }^{2}{q}_{w}$, M = κ1 ms , ${\omega }_{0}=\left(M/\sqrt{Q}\right)\xi $ and ${\omega }_{1}=\sqrt{Q}\xi $. In the above, we assume that the matrix $\mathrm{F}{\mathrm{F}}^{\top }\in {\mathbb{R}}^{d\times d}$ associated to the feature map F has a well behaved spectral density, and denote gμ its Stieltjes transform.

The training loss on the dataset $\mathcal{D}={\left\{{\boldsymbol{x}}^{\mu },{y}^{\mu }\right\}}_{\mu =1}^{n}$ can also be obtained from the solution of the above equations as

Equation (2.6)

where as before $\xi \sim \mathcal{N}(0,1)$, $y\sim \text{Uni}(\mathbb{R})$ and $\mathcal{Z},\eta $ are the same as in equation (2.5), evaluated at the solution of the above saddle-point equations ${\omega }_{0}^{\star }=\left({M}^{\star }/\sqrt{{Q}^{\star }}\right)\xi $, ${\omega }_{1}^{\star }=\sqrt{{Q}^{\star }}\xi $.

Sketch of derivation—we now sketch the derivation of the above result. A complete and detailed account can be found in section C of the supplementary material. The derivation is based on the key observation that in the high-dimensional limit the asymptotic generalisation error only depends on the solution $\hat{\boldsymbol{w}}\in {\mathbb{R}}^{p}$ of equation (1.5) through the scalar parameters $({q}_{s}^{\star },{q}_{w}^{\star },{m}_{s}^{\star })$ defined in equation (2.3). The idea is therefore to rewrite the high-dimensional optimisation problem in terms of only these scalar parameters.

The first step is to note that the solution of equation (1.6) can be written as the average of the following Gibbs measure

Equation (2.7)

in the limit β. Of course, we have not gained much, since an exact calculation of πβ is intractable for large values of n, p and d. This is where the replica method comes in. It states that the distribution of the free energy density $f=-\mathrm{l}\mathrm{o}\mathrm{g}\enspace {\mathcal{Z}}_{\beta }$ (when seen as a random variable over different realisations of dataset $\mathcal{D}$) associated with the measure μβ concentrates, in the high-dimensional limit, around a value fβ that depends only on the averaged replicated partition function ${\mathcal{Z}}_{\beta }^{r}$ obtained by taking r > 0 copies of ${\mathcal{Z}}_{\beta }$:

Equation (2.8)

Interestingly, ${\mathbb{E}}_{\left\{{\boldsymbol{x}}^{\mu },{y}^{\mu }\right\}}{\mathcal{Z}}_{\beta }^{r}$ can be computed explicitly for $r\in \mathbb{N}$, and the limit r → 0+ is taken by analytically continuing to r > 0 (see section C of the supplementary material). The upshot is that ${\mathcal{Z}}^{r}$ can be written as

Equation (2.9)

where Φβ —known as the replica symmetric potential—is a concave function depending only on the following scalar parameters:

for w πβ . In the limit of p, this integral concentrates around the extremum of the potential ${{\Phi}}_{\beta }^{(0)}$ for any β. Since the optimisation problem in equation (1.5) is convex, by construction as β the overlap parameters $({q}_{s}^{\star },{q}_{w}^{\star },{m}_{s}^{\star })$ satisfying this optimisation problem are precisely the ones of equation (2.3) corresponding to the solution $\hat{\boldsymbol{w}}\in {\mathbb{R}}^{p}$ of equation (1.5).

In summary, the replica method allows to circumvent the hard-to-solve high-dimensional optimisation problem equation (1.5) by directly computing the generalisation error in equation (1.4) in terms of a simpler scalar optimisation. Doing gradient descent (GD) in ${{\Phi}}_{\beta }^{(0)}$ and taking β lead to the saddle-point equation (2.4).

2.2. Replicated Gaussian equivalence

The backbone of the replica derivation sketched above and detailed in section C of the supplementary material is a central limit theorem type result coined as the Gaussian equivalence theorem (GET) from [14] used in the context of the 'replicated' Gibbs measure obtained by taking r copies of (2.7). In this approach, we need to assume that the 'balance condition' (1.7) applies with probability one when the weights w are sampled from the replicated measure. We shall use this assumption in the following, checking its self-consistency via agreement with simulations.

It is interesting to observe that, when applying the GET in the context of our replica calculation, the resulting asymptotic generalisation error stated in section 2.1 is equivalent to the asymptotic generalisation error of the following linear model:

Equation (2.10)

with ${\kappa }_{0}=\mathbb{E}\left[\sigma (z)\right]$, ${\kappa }_{1}\equiv \mathbb{E}\left[z\sigma (z)\right]$, ${\kappa }_{\star }^{2}\equiv \mathbb{E}\left[\sigma {(z)}^{2}\right]-{\kappa }_{0}^{2}-{\kappa }_{1}^{2}$, and ${\boldsymbol{z}}^{\mu }\sim \mathcal{N}(\mathbf{0},{\mathrm{I}}_{p})$. We have for instance, $({\kappa }_{0},{\kappa }_{1},{\kappa }_{\star })\approx \left(0,\frac{2}{\sqrt{3\pi }},0.2003\right)$ for σ = erf and $({\kappa }_{0},{\kappa }_{1},{\kappa }_{\star })=\left(0,\sqrt{\frac{2}{\pi }},\sqrt{1-\frac{2}{\pi }}\right)$ for σ = sign, two cases explored in the next section. This equivalence constitutes a result with an interest in its own, with applicability beyond the scope of the generalised linear task equation (1.6) studied here.

Equation (2.10) is precisely the mapping obtained by [13], who proved its validity rigorously in the particular case of the square loss and Gaussian random matrix F using random matrix theory. The same equivalence arises in the analysis of kernel random matrices [39, 40] and in the study of online learning [14]. The replica method thus suggests that the equivalence actually holds in a much larger class of learning problem, as conjectured as well in [32], and numerically confirmed in all our numerical tests. It also potentially allows generalisation of the analysis in this paper for data coming from a learned generative adversarial network, along the lines of [28, 41].

Figure 1 illustrates the remarkable agreement between the result of the generalisation formula, equation (2.1) and simulations both on the data equation (1.3) with σ(x) = sign(x) non-linearity, and on the Gaussian equivalent data equation (2.10) where the non-linearity is replaced by rescaling by a constant plus noise. The agreement is flawless as implied by the theory in the high-dimensional limit, testifying that the used system size d = 200 is sufficiently large for the asymptotic theory to be relevant. We observed similar good agreement between the theory and simulation in all the cases we tested, in particular in all those presented in the following.

Figure 1.

Figure 1. Comparison between theory (full line), and simulations with dimension d = 200 on the original model (dots), equation (1.3), with σ = sign, and the Gaussian equivalent model (crosses), equation (2.10), for logistic loss, regularisation λ = 10−3, n/d = 3. Labels are generated as ${y}^{\mu } = \mathrm{s}\mathrm{i}\mathrm{g}\mathrm{n}\left({\boldsymbol{c}}^{\mu }\cdot {\boldsymbol{\theta }}^{0}\right)$ and $\hat{f}=\mathrm{s}\mathrm{i}\mathrm{g}\mathrm{n}$. Both the training loss (green) and generalisation error (blue) are depicted. The theory and the equivalence with the Gaussian model are observed to be very accurate even at dimensions as small as d = 200.

Standard image High-resolution image

3. Applications of the generalisation formula

3.1. Double descent for classification with logistic loss

Among the surprising observations in modern machine learning is the fact that one can use learning methods that achieve zero training error, yet their generalisation error does not deteriorate as more and more parameters are added into the neural network. The study of such 'interpolators' have attracted a growing attention over the last few years [9, 12, 13, 33, 34, 4244], as it violates basic intuition on the bias-variance trade-off [45]. Indeed classical learning theory suggests that generalisation should first improve then worsen when increasing model complexity, following a U-shape curve. Many methods, including neural networks, instead follow a so-called 'double descent curve' [33] that displays two regimes: the 'classical' U-curve found at low number of parameters is followed at high number of parameters by an interpolation regime where the generalisation error decreases monotonically. Consequently neural networks do not drastically overfit even when using much more parameters than data samples [46], as actually observed already in the classical work [45]. Between the two regimes, a 'peak' occurs at the interpolation threshold [9, 22, 34, 47]. It should, however, be noted that existence of this 'interpolation' peak is an independent phenomenon from the lack of overfitting in highly over-parametrized networks, and indeed in a number of the related works these two phenomena were observed separately [9, 22, 45, 47]. Scaling properties of the peak and its relation to the jamming phenomena in physics are in particular studied in [43].

Among the simple models that allow to observe this behaviour, random projections—that are related to lazy training and kernel methods—are arguably the most natural one. The double descent has been analysed in detail in the present model in the specific case of a square loss on a regression task with random Gaussian features [13]. Our analysis allows to show the generality and the robustness of the phenomenon to other tasks, matrices and losses. In figure 2 we compare the double descent as present in the square loss (blue line) with the one of logistic loss (red line) for random Gaussian features. We plot the value of the generalisation error at small values of the regularisation λ (full line), and for optimal value of λ (dashed line) for a fixed ratio between the number of samples and the dimension n/d as a function of the number of random features per sample p/n. We also plot the value of the training error (lower panel) for a small regularisation value, showing that the peaks indeed occur when the training loss goes to zero. For the square loss the peak appears at 1/α = p/n = 1 when the system of n linear equations with p parameters becomes solvable. For the logistic loss the peak instead appears at a value 1/α* where the data $\mathcal{D}$ become linearly separable and hence the logistic loss can be optimised down to zero. These values 1/α* depends on the value n/d, and this dependence is plotted in figure 5. For very large dimension d, i.e. n/d → 0 the data matrix X is close to i.i.d. random matrix and hence the α*(n/d = 0) = 2 as famously derived in classical work by Cover [48]. For n/d > 0 the α* is growing (1/α* decreasing) as correlations make data easier to linearly separate, similarly as in [11].

Figure 2.

Figure 2. (Top) Generalisation error evaluated from equation (2.1) plotted against the number of random Gaussian features per sample p/n = 1/α and fixed ratio between the number of samples and dimension n/d = α/γ = 3 for logistic loss (red), square loss (blue). Labels are generated as ${y}^{\mu }=\mathrm{s}\mathrm{i}\mathrm{g}\mathrm{n}\left({\boldsymbol{c}}^{\mu }\cdot {\boldsymbol{\theta }}^{0}\right)$, data as ${\boldsymbol{x}}^{\mu }=\mathrm{s}\mathrm{i}\mathrm{g}\mathrm{n}\left({\mathrm{F}}^{\top }{\boldsymbol{c}}^{\mu }\right)$ and $\hat{f}=\mathrm{s}\mathrm{i}\mathrm{g}\mathrm{n}$ for two different values of regularisation λ, a small penalty λ = 10−4 (full line) and a value of lambda optimised for every p/n (dashed line). (Bottom) The training loss corresponding to λ = 10−4 is depicted.

Standard image High-resolution image

Figure 2 also shows that better error can be achieved with the logistic loss compared to the square loss, both for small and optimal regularisations, except in a small region around the logistic interpolation peak. In the Kernel limit, i.e. p/n, the generalization error at optimal regularisation saturates at epsilong (p/n) ≃ 0.17 for square loss and at epsilong (p/n) ≃ 0.16 for logistic loss. Figure 3 then depicts a 3D plot of the generalisation error also illustrating the position of the interpolation peak.

Figure 3.

Figure 3. Generalisation error of the logistic loss at fixed very small regularisation λ = 10−4, as a function of n/d = α/γ and p/n = 1/α, for random Gaussian features. Labels are generated with ${y}^{\mu }=\mathrm{s}\mathrm{i}\mathrm{g}\mathrm{n}\left({\boldsymbol{c}}^{\mu }\cdot {\boldsymbol{\theta }}^{0}\right)$, the data ${\boldsymbol{x}}^{\mu }=\mathrm{s}\mathrm{i}\mathrm{g}\mathrm{n}\left({\mathrm{F}}^{\top }{\boldsymbol{c}}^{\mu }\right)$ and $\hat{f}=\mathrm{s}\mathrm{i}\mathrm{g}\mathrm{n}$. The interpolation peak happening where data become linearly separable is clearly visible here.

Standard image High-resolution image

3.2. Random features: Gaussian versus orthogonal

Kernel methods are a very popular class of machine learning techniques, achieving state-of-the-art performance on a variety of tasks with theoretical guarantees [4951]. In the context of neural network, they are the subject of a renewal of interest in the context of the neural tangent Kernel [17]. Applying kernel methods to large-scale 'big data' problems, however, poses many computational challenges, and this has motivated a variety of contributions to develop them at scale, see, e.g. [50, 5254]. Random features [20] are among the most popular techniques to do so.

Here, we want to compare the performance of random projection with respect to structured ones, and in particular orthogonal random projections [35] or deterministic matrices such as real Fourier DCT and Hadamard matrices used in fast projection methods [5557]. Up to normalisation, these matrices have the same spectral density. Since the asymptotic generalisation error only depends on the spectrum of FF, all these matrices share the same theoretical prediction when properly normalised, see figure 4. In our computation, left- and right-orthogonal invariance is parametrised by letting F = UDV for $\mathrm{U}\in {\mathbb{R}}^{d\times d}$, $\mathrm{V}\in {\mathbb{R}}^{p\times p}$ two orthogonal matrices drawn from the Haar measure, and $\mathrm{D}\in {\mathbb{R}}^{d\times p}$ a diagonal matrix of rank min(d, p). In order to compare the results with the Gaussian case, we fix the diagonal entries ${d}_{k}=\mathrm{max}(\sqrt{\gamma },1)$ of D such that an arbitrary projected vector has the same norm, on average, to the Gaussian case.

Figure 4.

Figure 4. Generalisation error against the number of features per sample p/n, for a regression problem (left) and a classification one (right). Left (ridge regression): we used n/d = 2 and generated labels as yμ = c μ θ 0, data as ${\boldsymbol{x}}^{\mu }=\mathrm{s}\mathrm{i}\mathrm{g}\mathrm{n}\left({\mathrm{F}}^{\top }{\boldsymbol{c}}^{\mu }\right)$ and $\hat{f}(x)=x$. The two curves correspond to ridge regression with Gaussian (blue) versus orthogonal (red) projection matrix F for both λ = 10−8 (top) and optimal regularisation λ (bottom). Right (logistic classification): we used n/d = 2 and generated labels as ${y}^{\mu }=\mathrm{s}\mathrm{i}\mathrm{g}\mathrm{n}\left({\boldsymbol{c}}^{\mu }\cdot {\boldsymbol{\theta }}^{0}\right)$, data as ${\boldsymbol{x}}^{\mu }=\mathrm{s}\mathrm{i}\mathrm{g}\mathrm{n}\left({\mathrm{F}}^{\top }{\boldsymbol{c}}^{\mu }\right)$ and $\hat{f}=\mathrm{s}\mathrm{i}\mathrm{g}\mathrm{n}$. The two curves correspond to a logistic classification with again Gaussian (blue) versus orthogonal (red) projection matrix F for both λ = 10−4 and optimal regularisation λ. In all cases, full lines is the theoretical prediction, and points correspond to gradient-descent simulations with d = 256. For the simulations of orthogonally invariant matrices, we results for Hadamard matrices (dots) and DCT Fourier matrices (diamonds).

Standard image High-resolution image

Figure 4 shows that random orthogonal embeddings always outperform Gaussian random projections, in line with empirical observations, and that they allow to reach the kernel limit with fewer number of projections. Their behaviour is, however, qualitatively similar to the one of random i.i.d. projections. We also show in figure 5 that orthogonal projections allow to separate the data more easily than the Gaussian ones, as the phase transition curve delimiting the linear separability of the logistic loss get shifted to the left.

Figure 5.

Figure 5. The position of the interpolation peak in logistic regression with λ = 10−4, where data become linearly separable, as a function of the ratio between the number of samples n and the dimension d. Labels are generated with ${y}^{\mu }=\mathrm{s}\mathrm{i}\mathrm{g}\mathrm{n}\left({\boldsymbol{c}}^{\mu }\cdot {\boldsymbol{\theta }}^{0}\right)$, the data ${\boldsymbol{x}}^{\mu }=\mathrm{s}\mathrm{i}\mathrm{g}\mathrm{n}\left({\mathrm{F}}^{\top }{\boldsymbol{c}}^{\mu }\right)$ and $\hat{f}=\mathrm{s}\mathrm{i}\mathrm{g}\mathrm{n}$. The red line is with Gaussian random features, the blue line with orthogonal features. We see that for linear separability we need smaller number of projections p with orthogonal random features than with Gaussian.

Standard image High-resolution image

3.3. The hidden manifold model phase diagram

In this subsection we consider the HMM where p-dimensional x data lie on a d-dimensional manifold, we have mainly in mind d < p. The labels y are generated using the coordinates on the manifold, equation (1.2).

In figure 6 we plot the generalisation error of classification with the square loss for various values of the regularisation λ. We fix the ratio between the dimension of the sub-manifold and the dimensionality of the input data to d/p = 0.1 and plot the learning curve, i.e. the error as a function of the number of samples per dimension. Depending on the value of the regularisation, we observe that the interpolation peak, which is at α = 1 at very small regularisation (here the over-parametrised regime is on the left-hand side), decreases for larger regularisation λ. A similar behaviour has been observed for other models in the past, see e.g. [47]. Finally, figure 6 depicts the error for optimised regularisation parameter in the black dashed line. For large number of samples we observe the generalisation error at optimal regularisation to saturate in this case at epsilong (α) → 0.0325. A challenge for future work is to see whether better performance can be achieved on this model by including hidden variables into the neural network.

Figure 6.

Figure 6. Generalisation error against the number of samples per dimension, α = n/p, and fixed ratio between the latent and data dimension, d/p = 0.1, for a classification task with square loss on labels generated as ${y}^{\mu }=\mathrm{s}\mathrm{i}\mathrm{g}\mathrm{n}\left({\boldsymbol{c}}^{\mu }\cdot {\boldsymbol{\theta }}^{0}\right)$, data ${\boldsymbol{x}}^{\mu }=\mathrm{e}\mathrm{r}\mathrm{f}\left({\mathrm{F}}^{\top }{\boldsymbol{c}}^{\mu }\right)$ and $\hat{f}=\mathrm{s}\mathrm{i}\mathrm{g}\mathrm{n}$, for different values of the regularisation λ (full lines), including the optimal regularisation value (dashed).

Standard image High-resolution image

Figure 7 then shows the generalisation error for the optimised regularisation λ with square loss as a function of the ratio between the latent and the data dimensions d/p. In the limit d/p ≫ 1 the data matrix becomes close to a random i.i.d. matrix and the labels are effectively random, thus only bad generalisation can be reached. Interestingly, as d/p decreases to small values even the simple classification with regularised square loss is able to 'disentangle' the hidden manifold structure in the data and to reach a rather low generalisation error. The figure quantifies how the error deteriorates when the ratio between the two dimensions d/p increases. Rather remarkably, for a low d/p a good generalisation error is achieved even in the over-parametrised regime, where the dimension is larger than the number of samples, p > n. In a sense, the square loss linear classification is able to locate the low-dimensional subspace and find good generalisation even in the over-parametrised regime as long as roughly dn. The observed results are in qualitative agreement with the results of learning with SGD in [14] where for very low d/p good generalisation error was observed in the HMM, but a rather bad one for d/p = 0.5.

Figure 7.

Figure 7. Heat-map of the generalisation errors as a function of the number of samples per data dimension n/p against the ratio of the latent and data dimension d/p, for a classification task with square loss on labels ${y}^{\mu }=\mathrm{s}\mathrm{i}\mathrm{g}\mathrm{n}\left({\boldsymbol{c}}^{\mu }\cdot {\boldsymbol{\theta }}^{0}\right)$ and data ${\boldsymbol{x}}^{\mu }=\mathrm{e}\mathrm{r}\mathrm{f}\left({\mathrm{F}}^{\top }{\boldsymbol{c}}^{\mu }\right)$ for the optimal values of the regularisation λ.

Standard image High-resolution image

Acknowledgments

This work is supported by the ERC under the European Union's Horizon 2020 Research and Innovation Program 714 608-SMiLe, as well as by the French Agence Nationale de la Recherche under Grant ANR-17-CE23-0023-01 PAIL and ANR-19-P3IA-0001 PRAIRIE. We also acknowledge support from the chaire CFM-ENS 'Science des données'. We thank Google Cloud for providing us access to their platform through the Research Credits Application program. B L was partially financed by the Coordenação de Aperfeiçoamento de Pessoal de Nível Superior—Brasil (CAPES)—Finance Code 001.

Note

After the completion of this work, the replica formula equation (2.4) for the generalisation error has been rigorously proven using Gordon's Gaussian min-max inequalities in [58]. Likewise, the Gaussian equivalence discussed in section 2.2 was rigorously proven in [59, 60].

Appendix A.: Definitions and notations

In this section we recall the models introduced in the main body of the article, and introduce the notations used throughout the supplementary material.

A.1. The dataset

In this work we study a series of regression and classification tasks for a dataset ${\left\{{\boldsymbol{x}}^{\mu },{y}^{\mu }\right\}}_{\mu =1}^{n}$ with labels ${y}^{\mu }\in \mathbb{R}$ sampled identically from a generalised linear model:

Equation (A.1)

where the output-channel ${P}_{y}^{0}\left(\cdot \right)$ is defined as:

Equation (A.2)

for some noise ξμ and for data points ${\boldsymbol{x}}^{\mu }\in {\mathbb{R}}^{p}$ given by:

Equation (A.3)

The vectors ${\boldsymbol{c}}^{\mu }\in {\mathbb{R}}^{d}$ is assumed to be identically drawn from $\mathcal{N}(0,{\mathrm{I}}_{d})$, and ${\boldsymbol{\theta }}^{0}\in {\mathbb{R}}^{d}$ from a separable distribution Pθ . The family of vectors ${\boldsymbol{f}}_{\rho }\in {\mathbb{R}}^{p}$ and the scalar function $\sigma :\mathbb{R}\to \mathbb{R}$ can be arbitrary.

Although our results are valid for the general model introduced above, the two examples we will be exploring in this work are the noisy linear channel (for regression tasks) and the deterministic sign channel (for classification tasks):

Equation (A.4)

Equation (A.5)

where ${\xi }^{\mu }\sim \mathcal{N}(0,1)$ and Δ > 0.

This dataset can be regarded from two different perspectives.

Hidden manifold model: the dataset ${\left\{{\boldsymbol{x}}^{\mu },{y}^{\mu }\right\}}_{\mu =1,\dots ,n}$ is precisely the HMM introduced in [14] to study the dynamics of online learning in a synthetic but structured dataset. From this perspective, although x μ lives in a p dimensional space, it is parametrised by a latent d < p-dimensional subspace spanned by the basis ${\left\{{\boldsymbol{f}}_{\rho }\right\}}_{\rho =1,\dots ,d}$ which is 'hidden' by the application of a scalar nonlinear function σ acting component-wise. The labels y μ are then drawn from a generalised linear rule defined on the latent d-dimensional space.

Random features model: the dataset ${\left\{{\boldsymbol{x}}^{\mu },{y}^{\mu }\right\}}_{\mu =1,\dots ,n}$ is tightly related to the random features model studied in [20] as a random approximation for kernel ridge regression. In this perspective, ${\boldsymbol{c}}^{\mu }\in {\mathbb{R}}^{d}$ is regarded as a collection of d-dimensional data points which are projected by a random feature matrix $\mathrm{F}={({\boldsymbol{f}}_{\rho })}_{\rho =1}^{p}\in {\mathbb{R}}^{d\times p}$ into a higher dimensional space, followed by a non-linearity σ. In the limit of infinite number of features d, p with fixed ratio d/p, performing ridge regression of x μ is equivalent to kernel ridge regression with a limiting kernel depending on the distribution of the feature matrix F and on the non-linearity σ.

A.2. The task

In this work, we study the problem of learning the rule from equation (A.1) from the dataset ${\left\{({\boldsymbol{x}}^{\mu },{y}^{\mu })\right\}}_{\mu =1,\dots ,n}$ introduced above with a generalised linear model:

Equation (A.6)

where the weights $\boldsymbol{w}\in {\mathbb{R}}^{p}$ are learned by minimising a loss function with a ridge regularisation term:

Equation (A.7)

for λ > 0.

It is worth stressing that our results hold for general , $\hat{f}$ and f0—including non-convex loss functions. However, for the purpose of the applications explored in this manuscript, we will be mostly interested in the cases $\hat{f}(x)={f}^{0}(x)=x$ for regression and $\hat{f}(x)={f}^{0}(x)=\mathrm{s}\mathrm{i}\mathrm{g}\mathrm{n}(x)$ for classification, and we will focus on the following two loss functions:

Equation (A.8)

Note that these loss functions are strictly convex. Therefore, for these losses, the regularised optimisation problem in (A.7) has a unique solution.

Given a new pair ( x new, ynew) drawn independently from the same distribution as ${\left\{({\boldsymbol{x}}^{\mu },{y}^{\mu })\right\}}_{\mu =1}^{n}$, we define the success of our fit through the generalisation error, defined as:

Equation (A.9)

where ${\hat{y}}^{\mathrm{n}\mathrm{e}\mathrm{w}}=\hat{f}({\boldsymbol{x}}^{\mathrm{n}\mathrm{e}\mathrm{w}}\cdot \hat{\boldsymbol{w}})$, and for convenience we choose k = 0 for the regression tasks and k = 1 for the classification task, such that the generalisation error in this case counts misclassification. Note that for a classification problem, the generalisation error is just one minus the classification error.

Similarly, we define the training loss on the dataset ${\left\{{\boldsymbol{x}}^{\mu },{y}^{\mu }\right\}}_{\mu =1}^{n}$ as:

Equation (A.10)

Finally, all the results of this manuscript are derived in the high-dimensional limit, also known as thermodynamic limit in the physics literature, in which we take p, d, n while keeping the ratios α = n/p, γ = d/p fixed.

Appendix B.: Gaussian equivalence theorem

In this section we introduce the replicated Gaussian equivalence (rGE), a central result we will need for our replica calculation of the generalisation error in section 2.1 of the main body. The rGET is a stronger version of the GET that was introduced and proved in [14]. Previously, particular cases of the GET were derived in the context of random matrix theory [39, 40, 61, 62]. The Gaussian equivalence has also been stated and used in [13, 32].

B.1. Gaussian equivalence theorem

Let $\mathrm{F}\in {\mathbb{R}}^{d\times p}$ be a fixed matrix, ${\boldsymbol{w}}^{a}\in {\mathbb{R}}^{p}$, 1 ⩽ ar be a family of vectors, ${\boldsymbol{\theta }}^{0}\in {\mathbb{R}}^{d}$ be a fixed vector and $\sigma :\mathbb{R}\to \mathbb{R}$ be a scalar function acting component-wise on vectors.

Let $\boldsymbol{c}\in {\mathbb{R}}^{d}$ be a Gaussian vector $\mathcal{N}(0,{\mathrm{I}}_{d})$. The GET is a statement about the (joint) statistics of the following r + 1 random variables

Equation (B.1)

in the asymptotic limit where d, p with fixed p/d and fixed r. For simplicity, assume that σ(x) = −σ(−x) is an odd function. Further, suppose that in the previously introduced limit the following two balance conditions hold:

Condition 1:

Equation (B.2)

for any ρ.

Condition 2:

Equation (B.3)

for any integers k ⩾ 0, q > 0, for any choice of indices ρ1, ρ2, ..., ρq ∈ {1, ..., d} all distinct from each other, and for any choice of indices a1, a2, ..., ak ∈ {1, ..., r}. Under the aforementioned conditions, the following theorem holds:

Theorem 1. In the limit d, p with fixed p/d, the random variables {λa , ν} are jointly normal, with zero mean and covariances:

Equation (B.4)

where:

Equation (B.5)

and

Equation (B.6)

where $z\sim \mathcal{N}(0,1)$.

B.2. Replicated Gaussian equivalence

Note that the GET holds for a fixed family ${\left\{{\boldsymbol{w}}^{a}\right\}}_{a=1}^{r}$ and matrix $\mathrm{F}\in {\mathbb{R}}^{d\times p}$ satisfying the balance condition from equation (B.3). In the replica setting, we will need to apply the GET under an average over r samples (referred to here as replicas) of the Gibbs distribution μβ , introduced in equation (2.7) on the main. We therefore shall require the assumption that the balance condition equation (B.3) holds for any sample of μβ . We refer to this stronger version of the GET as the rGE. Although proving this result is out of the scope of the present work, we check its self-consistency extensively with numerical simulations.

Appendix C.: Replica analysis

In this section we give a full derivation of the result in section 2 in the main manuscript for the generalisation error of the problem defined in appendix A. Our derivation follows from a Gibbs formulation of the optimisation problem in equation (A.7) followed by a replica analysis inspired by the toolbox of the statistical physics of disordered systems.

C.1. Gibbs formulation of problem

Given the dataset ${\left\{{\boldsymbol{x}}^{\mu },{y}^{\mu }\right\}}_{\mu =1}^{n}$ defined in appendix A.1, we define the following Gibbs measure over ${\mathbb{R}}^{p}$:

Equation (C.1)

for β > 0. When β, the Gibbs measure peaks at the solution of the optimisation problem in equation (A.7)—which, in the particular case of a strictly convex loss, is unique. Note that in the second equality we defined the factorised distributions Py and Pw , showing that μβ can be interpreted as a posterior distribution of w given the dataset { x μ , yμ }, with Py and Pw being the likelihood and prior distributions respectively.

An exact calculation of μβ is intractable for large values of n, p and d. However, the interest in μβ is that in the limit n, p, d with d/p and n/p fixed, the free energy density associated to the Gibbs measure:

Equation (C.2)

can be computed exactly using the replica method, and at β give us the optimal overlaps:

Equation (C.3)

that—as we will see—fully characterise the generalisation error defined in equation (A.9).

C.2. Replica computation of the free energy density

The replica calculation of fβ is based on a large deviation principle for the free energy density. Let

Equation (C.4)

be the free energy density for one given sample of the problem, i.e. a fixed dataset ${\left\{{\boldsymbol{x}\enspace }^{\mu },{y}^{\mu }\right\}}_{\mu =1}^{n}$. We assume that the distribution P(f) of the free energy density, seen as a random variable over different samples of the problem, satisfies a large deviation principle, in the sense that, in the thermodynamic limit:

Equation (C.5)

with Φ a concave function reaching its maximum at the free energy density f = fβ , with Φ(fβ ) = 0. This hypothesis includes the notion of self-averageness which states that the free-energy density is the same for almost all samples in the thermodynamic limit.

The value of fβ can be computed by computing the replicated partition function

Equation (C.6)

and taking the limit

Equation (C.7)

Although this procedure is not fully rigorous, experience from the statistical physics of disordered systems shows that it gives exact results, and in fact the resulting expression can be verified to match the numerical simulations.

Using the replica method we need to evaluate:

Equation (C.8)

where Pw and Py have been defined in (C.1). In order to compute this quantity, we introduce, for each point μ in the database, the r + 1 variables

Equation (C.9)

Equation (C.10)

Choosing c μ at random induces a joint distribution $P({\nu }_{\mu },{\lambda }_{\mu }^{a})$. In the thermodynamic limit p, d with fixed p/n, and for matrices F satisfying the balance condition in equation (B.3), the rGE introduced in appendix B.2 tells us that, for a given μ, the r + 1 variables ${\left\{{\nu }_{\mu },{\lambda }_{\mu }^{a}\right\}}_{a=1}^{r}$ are Gaussian random values with zero mean and covariance given by:

Equation (C.11)

The elements of the covariance matrix Ma and Qab are the rescaled version of the so-called overlap parameters:

Equation (C.12)

where ${\boldsymbol{s}}^{a}=\frac{1}{\sqrt{p}}\mathrm{F}{\boldsymbol{w}}^{a}$. They are thus given by:

Equation (C.13)

where ${\kappa }_{1}={\mathbb{E}}_{z}\left[z\sigma (z)\right]$ and ${\kappa }_{\star }^{2}={\mathbb{E}}_{z}\left[\sigma {(z)}^{2}\right]-{\kappa }_{1}^{2}$ as in equation (B.6). With this notation, the asymptotic joint probability is simply written as:

Equation (C.14)

with ${z}_{\mu }^{0}={\nu }_{\mu }$ and ${z}_{\mu }^{a}={\lambda }_{\mu }^{a}$ for a = 1, ..., r. The average over the replicated partition function (C.8) therefore reads:

Equation (C.15)

C.2.1. Rewriting as a saddle-point problem

Note that after taking the average over x , the integrals involved in the replicated partition function only couple through the overlap parameters. It is therefore useful to introduce the following Dirac δ-functions to unconstrain them, introducing the decomposition:

Equation (C.16)

Introducing the above in equation (C.15) and exchanging the integration order allows to factorise the integrals over the d, p, n dimensions and rewrite:

Equation (C.17)

where the integrals over the variables ${m}_{s}^{a}$, ${q}_{s}^{ab}$ and ${q}_{w}^{ab}$ run over $\mathbb{R}$, while those over ${\hat{m}}_{s}^{a}$, ${\hat{q}}_{s}^{ab}$ and ${\hat{q}}_{w}^{ab}$ run over $i\mathbb{R}$. The function Φ(r), a function of all the overlap parameters, is given by:

Equation (C.18)

where we recall that α = n/p, γ = d/p, and we have introduced:

Equation (C.19)

Note that ${\boldsymbol{s}}^{a}=\frac{1}{\sqrt{p}}\mathrm{F}{\boldsymbol{w}}^{a}$ is a function of w a , and must be kept under the w a integral. In the thermodynamic limit where p with n/p and d/p fixed, the integral in equation (C.17) concentrates around the values of the overlap parameters that extremize Φ(r), and therefore

Equation (C.20)

C.2.2. Replica symmetric ansatz

In order to proceed with the r → 0+ limit, we restrict the extremization above to the following replica symmetric ansatz:

Equation (C.21)

Note that, in the particular case of a convex loss function with λ > 0, the replica symmetric ansatz is justified: the problem only admitting one solution, it a fortiori coincides with the replica symmetric one. For non-convex losses, solutions that are not replica symmetric (also known as replica symmetry breaking) are possible, and the energy landscape of the free energy needs to be carefully analysed. In the practical applications explored in this manuscript, we focus on convex losses with ridge regularisation, and therefore the replica symmetric assumption is fully justified.

Before proceeding with the limit in equation (C.20), we need to verify that the above ansatz is well defined—in other words, that we have not introduced a spurious order one term in Φ that would diverge. This means we need to check that $\underset{r\to {0}^{+}}{\mathrm{lim}}\enspace {\Phi}=0$.

First, with a bit of algebra one can check that, within our replica symmetric ansatz:

Equation (C.22)

Therefore,

Equation (C.23)

where we have used the fact that Pθ is a factorised distribution to take the p limit. In order for this limit to be 0, we need that $\hat{\rho }=0$, which also fixes ρ to be a constant given by the second moment of θ0:

Equation (C.24)

We now proceed with the limit in equation (C.20). Let us look first at Ψy . The non-trivial limit comes from the fact that det Σ and Σ−1 are non-trivial functions of r. It is not hard to see, however, that Σ−1 itself has replica symmetric structure, with components given by:

Equation (C.25)

where M, Q and R are the rescaled overlap parameters in the replica symmetric ansatz, that is:

Equation (C.26)

This allows us to write:

Equation (C.27)

In order to completely factor the integral above in the replica space, we use the HubbardStratonovich transformation:

Equation (C.28)

for $\xi \sim \mathcal{N}(0,1)$, such that

Equation (C.29)

Taking into account the r dependence of the inverse elements and of the determinant, we can take the limit to get:

Equation (C.30)

Finally, making a change of variables and defining:

Equation (C.31)

allows us to rewrite the limit of Ψy —which abusing notation we still denote Ψy —as:

Equation (C.32)

One can follow a very similar approach for the limit of Ψw , although in this case the limit is much simpler, since there is no r dependence on the hat variables. The limit can be written as:

Equation (C.33)

for $\xi ,\eta \sim \mathcal{N}(0,1)$, and we have defined:

Equation (C.34)

and we have defined the shorthands ${\hat{V}}_{w}={\hat{r}}_{w}+{\hat{q}}_{w}$ and ${\hat{V}}_{s}={\hat{r}}_{s}+{\hat{q}}_{s}$.

C.2.3. Summary of the replica symmetric free energy density

Summarising the calculation above, the replica symmetric free energy density reads:

Equation (C.35)

with $\alpha =\frac{n}{p}$, $\gamma =\frac{d}{p}$, and:

Equation (C.36)

The so-called potentials (Ψy , Ψw ) are given by:

Equation (C.37)

Equation (C.38)

where:

Equation (C.39)

C.3. Evaluating Ψw for ridge regularisation and Gaussian prior

Note that as long as the limit in Ψw is well defined, the equation (C.35) holds for any Pθ and Pw . However, as discussed in appendix A.1, we are interested in ${\boldsymbol{\theta }}^{0}\sim \mathcal{N}(0,{\mathrm{I}}_{d})$ and ridge regularisation so that ${P}_{w}=\mathrm{exp}\left(-\frac{\beta \lambda }{2}{\Vert}\boldsymbol{w}{{\Vert}}^{2}\right)$. In this case, we simply have:

Equation (C.40)

with:

Equation (C.41)

Therefore the argument of the logarithm in Ψw is just another Gaussian integral we can do explicitly:

Equation (C.42)

where we have defined the shorthand $\boldsymbol{b}=\left(\sqrt{{\hat{q}}_{s}}\xi {\mathbf{1}}_{d}+{\hat{m}}_{s}{\boldsymbol{\theta }}^{0}\right)\in {\mathbb{R}}^{d}$. Inserting back in equation (C.37) and taking the log,

Equation (C.43)

The averages over η, ξ, θ 0 simplify this expression considerably:

Equation (C.44)

Finally, we can combine the two terms:

Equation (C.45)

and write:

Equation (C.46)

Note that Ψ only depends on the spectral properties of the matrix $\frac{1}{p}\mathrm{F}{\mathrm{F}}^{\top }\in {\mathbb{R}}^{d\times d}$, and more specifically on its resolvent in the asymptotic limit. A case of particular interest is when FF has a well defined spectral measure μ on the p, d limit with γ = d/p fixed. In that case, we can write:

Equation (C.47)

where gμ is the Stieltjes transform of μ, defined by:

Equation (C.48)

Similarly, the logarithm term can be expressed as the logarithm potential of μ—although for the purpose of evaluating the generalisation error we will only need the derivative of these terms, and therefore only the Stieltjes transforms and its derivative.

In what follows, we will mostly focus on two kinds of projection matrices F:

Gaussian projections: for $\mathrm{F}\in {\mathbb{R}}^{d\times p}$ a random matrix with i.i.d. Gaussian entries with zero mean and variance 1, μ is given by the well-known Marchenko–Pastur law, and the corresponding Stieltjes transform is given by:

Equation (C.49)

Orthogonally invariant projection: for F = UDV with $\mathrm{U}\in {\mathbb{R}}^{d\times d}$ and $\mathrm{V}\in {\mathbb{R}}^{p\times p}$ two orthogonal matrices and $\mathrm{D}\in {\mathbb{R}}^{d\times p}$ a rectangular diagonal matrix of rank min(d, p) and diagonal entries dk , the empirical spectral density μp is given by:

Equation (C.50)

Therefore the choice of diagonal elements dk fully characterise the spectrum of FF. In order for the orthogonally invariant case to be comparable to the Gaussian case, we fix dk in such a way that the projected vector F w is of the same order in both cases, i.e.

Equation (C.51)

With this choice, the Stieltjes transform of μ reads:

Equation (C.52)

C.4. Gaussian equivalent model

It is interesting to note that the average over the dataset ${\left\{{\boldsymbol{x}}^{\mu },{y}^{\mu }\right\}}_{\mu =1}^{n}$ of the replicated partition function ${\mathcal{Z}}_{\beta }^{r}$ in equation (C.15), obtained after the application of the GET, is identical to the replicated partition function of the same task over the following dual dataset ${\left\{{\tilde{\boldsymbol{x}}}^{\mu },{y}^{\mu }\right\}}_{\mu =1}^{n}$, where:

Equation (C.53)

where ${\boldsymbol{z}}^{\mu }\sim \mathcal{N}(\mathbf{0},{\mathrm{I}}_{p})$, and the labels yμ Py are the same. Indeed, calling ${\tilde{\mathcal{Z}}}_{\beta }^{r}$ the replicated partition function for this equivalent dataset, and considering κ0 we have:

Equation (C.54)

Rewriting (I):

Equation (C.55)

It is easy to show that taking (κ0, κ1) to match those from equation (B.6), the variables $\left({\nu }_{\mu },\left\{{\lambda }_{\mu }^{a}\right\}\right)$ are jointly Gaussian variables with correlation matrix given by Σ exactly as in equation (C.11). This establishes the equivalence

Equation (C.56)

from which follows the equivalence between the asymptotic generalisation and test error of these two models.

Appendix D.: Saddle-point equations and the generalisation error

The upshot of the replica analysis is to exchange the p-dimensional minimisation problem for $\boldsymbol{w}\in {\mathbb{R}}^{p}$ in equation (A.7) for a one-dimensional minimisation problem for the parameters {rs , qs , ms , rw , qw } and their conjugate in equation (C.35). In particular, note that by construction at the limit β the solution $\left\{{q}_{s}^{\star },{m}_{s}^{\star },{q}_{w}^{\star }\right\}$ of equation (C.35) corresponds to:

Equation (D.1)

where $\hat{\boldsymbol{w}}$ is the solution of the solution of equation (A.7). As we will see, both the generalisation error defined in equation (A.9) and the training loss can be expressed entirely in terms of these overlap parameters.

D.1. Generalisation error as a function of the overlaps

Let { x new, ynew} be a new sample independently drawn from the same distribution of our data ${\left\{{\boldsymbol{x}}^{\mu },{y}^{\mu }\right\}}_{\mu =1}^{n}$. The generalisation error can then be written as:

Equation (D.2)

where for convenience, we normalise k = 0 for the regression task and k = 1 for the classification task. Again, we apply the GET from appendix B to write the joint distribution over {ν, λ}:

Equation (D.3)

where $\boldsymbol{z}={(\nu ,\lambda )}^{\top }\in {\mathbb{R}}^{2}$ and Σ is given by

Equation (D.4)

Inserting in equation (D.2) gives the desired representation of the generalisation error in terms of the optimal overlap parameters:

Equation (D.5)

For linear labels y = c θ 0 in the regression problem, we simply have:

Equation (D.6)

while for the corresponding classification problem with $y=\mathrm{s}\mathrm{i}\mathrm{g}\mathrm{n}\left(\boldsymbol{c}\cdot {\boldsymbol{\theta }}^{0}\right)$:

Equation (D.7)

which, as expected, only depend on the angle between $\mathrm{F}\hat{\boldsymbol{w}}$ and θ 0.

D.2. Training loss

Similarly to the generalisation error, the asymptotic of the training loss, defined for the training data ${\left\{{\boldsymbol{x}}^{\mu },{y}^{\mu }\right\}}_{\mu =1}^{n}$ as:

Equation (D.8)

can also be written only in terms of the overlap parameters. Indeed, it is closely related to the free energy density defined in equation (C.2). A close inspection on this definition tells us that:

Equation (D.9)

Taking the derivative of the free energy with respect to the parameter β and recalling that p = αn, we can then get:

Equation (D.10)

For what concerns the contribution of the regulariser, we simply note that in the limit of p, the average concentrates around the overlap parameter ${q}_{w}^{\star }$. Instead, for what concerns the contribution of the loss function, we can start by explicitly taking the derivative with respect to β of Ψy in equation (C.32), i.e.:

Equation (D.11)

with ${\mathcal{Z}}_{y}^{\cdot /0}$ defined in equation (C.31). At this point, as explained more in details in appendix D.4, we can notice that in the limit of β, it holds:

Equation (D.12)

with $\eta \left(y,{\omega }_{1}^{\star }\right)$ given in equation (D.21). Combining the two results together we then finally get:

Equation (D.13)

D.3. Solving for the overlaps

As we showed above, both the generalisation error and the training loss are completely determined by the β solution of the extremization problem in equation (C.35). For strictly convex losses , there is a unique solution to this problem, that can be found by considering the derivatives of the replica potential. This leads to a set of self-consistent saddle-point equations that can be solved iteratively:

Equation (D.14)

In the case of an F with well-defined spectral density μ, we can be more explicit and write:

Equation (D.15)

where:

Equation (D.16)

We can also simplify slightly the derivatives of Ψy without loosing generality by applying Stein's lemma, yielding:

Equation (D.17)

with ${f}_{y}^{\cdot /0}(y;\omega ,V)={\partial }_{\omega }\enspace \mathrm{log}\enspace {\mathcal{Z}}_{y}^{\cdot /0}$. For a given choice of spectral density μ (corresponding to a choice of projection F), label rule ${P}_{y}^{0}$ and loss function , the auxiliary functions $({\mathcal{Z}}^{0},\mathcal{Z})$ can be computed, and from them the right-hand side of the update equations above. The equations can then be iterated until the convergence to the fixed point minimising the free energy at fixed (α, γ, β). For convex losses and β, the fixed point of these equations gives the overlap corresponding to the estimator solving equation (A.7).

D.4. Taking β explicitly

Although the saddle-point equations above can be iterated explicitly for any β > 0, it is envisageable to take the limit β explicitly, since β is an auxiliary parameter we introduced, and that was not present in the original problem defined in equation (A.7).

Since the overlap parameters depend on β only implicitly through ${\mathcal{Z}}_{y}$ and its derivatives, we proceed with the following ansatz for their β scaling:

Equation (D.18)

This ansatz can be motivated as follows. Recall that:

Equation (D.19)

Therefore, letting $V={\mu }_{1}^{2}{V}_{s}+{\mu }_{\star }^{2}{V}_{w}$ scale as V = βV, at β:

Equation (D.20)

where:

Equation (D.21)

For convex losses with λ > 0, this one-dimensional minimisation problem has a unique solution that can be easily evaluated. Intuitively, this ansatz translates the fact the variance of our estimator goes to zero as a power law at β, meaning the Gibbs measure concentrates around the solution of the optimisation problem equation (A.7). The other scalings in equation (D.19) follow from analysing the dependence of the saddle-point equations in V.

The ansatz in equation (D.18) allow us to take the β and rewrite the saddle-point equations as:

Equation (D.22)

Equation (D.23)

where ${\mathcal{Z}}_{y}^{0}(y;\omega ,V)$ is always evaluated at $(\omega ,V)=\left(\frac{{M}^{\infty }}{\sqrt{{Q}^{\infty }}}\xi ,\rho -\frac{{{M}^{\infty }}^{2}}{{Q}^{\infty }}\right)$, η(y; ω, V) at $\left(\omega ,V\right)=\left(\sqrt{{Q}^{\infty }}\xi ,{V}^{\infty }\right)$ and $z=\frac{\lambda +{\hat{V}}_{w}^{\infty }}{{\hat{V}}_{s}^{\infty }}$.

D.5. Examples

In this section we exemplify our general result in two particular cases for which the integrals in the right-hand side of equation (D.22) can be analytically performed: the ridge regression task with linear labels and a classification problem with square loss and ridge regularisation term. The former example appears in figure 4 (left) and the later in figure 2 (blue curve), figures 6 and 7 of the main.

Ridge regression with linear labels: consider the task of doing ridge regression $\ell (y,x)=\frac{1}{2}{\left(y-x\right)}^{2}$, λ > 0 on the linear patterns $\boldsymbol{y}=\frac{1}{\sqrt{d}}\mathrm{C}{\boldsymbol{\theta }}^{0}+\sqrt{{\Delta}}\boldsymbol{z}$, with $\boldsymbol{z}\sim \mathcal{N}(\mathbf{0},{\mathrm{I}}_{n})$ and ${\boldsymbol{\theta }}^{\star }\sim \mathcal{N}(\mathbf{0},{\mathrm{I}}_{d})$. In this case, we have:

Equation (D.24)

and the saddle-point equations for the hat overlap read:

Equation (D.25)

This particular example corresponds precisely to the setting studied in [32].

Classification with square loss and ridge regularisation: consider a classification task with square loss $\ell (y,x)=\frac{1}{2}{\left(y-x\right)}^{2}$ and labels generated as $\boldsymbol{y}=\mathrm{s}\mathrm{i}\mathrm{g}\mathrm{n}\left(\frac{1}{\sqrt{d}}\mathrm{C}{\boldsymbol{\theta }}^{0}\right)$, with ${\boldsymbol{\theta }}^{0}\sim \mathcal{N}(\mathbf{0},{\mathrm{I}}_{d})$. Then the saddle-point equations are simply:

Equation (D.26)

Appendix E.: Numerical simulations

In this section, we provide more details on how the numerical simulations in the main manuscript have been performed.

First, the dataset ${\left\{{\boldsymbol{x}}^{\mu },{y}^{\mu }\right\}}_{\mu =1}^{n}$ is generated according to the procedure described in section 1.1 of the main, which we summarise here for convenience in algorithm 1:

Algorithm 1. Generating dataset ${\left\{{\boldsymbol{x}}^{\mu },{y}^{\mu }\right\}}_{\mu =1}^{n}$.

Input: integer d, parameters $\alpha ,\gamma \in {\mathbb{R}}_{+}$, matrix $\mathrm{F}\in {\mathbb{R}}^{d\times p}$, vector ${\boldsymbol{\theta }}^{0}\in {\mathbb{R}}^{d}$ non-linear functions $\sigma ,{f}^{0}:\mathbb{R}\to \mathbb{R}$
Assign p ← ⌊d/γ⌋, n ← ⌊αp
Draw $\mathrm{C}\in {\mathbb{R}}^{n\times d}$ with entries ${c}_{\rho }^{\mu }\sim \mathcal{N}(0,1)$ i.i.d.
Assign $\boldsymbol{y}{\leftarrow}{f}^{0}\left(\mathrm{C}{\boldsymbol{\theta }}^{0}\right)\in {\mathbb{R}}^{n}$ component-wise
Assign $\mathrm{X}{\leftarrow}\sigma \left(\mathrm{C}\mathrm{F}\right)\in {\mathbb{R}}^{n\times p}$ component-wise
Return: X, y

In all the examples from the main, we have drawn ${\boldsymbol{\theta }}^{0}\sim \mathcal{N}(0,{\mathrm{I}}_{d})$. For the regression task in figure 4 we have taken f0(x) = x, while in the remaining classification tasks f0(x) = sign(x). For Gaussian projections, the components of F are drawn from $\mathcal{N}(0,1)$ i.i.d., and in for the random orthogonal projections we draw two orthogonal matrices $\mathrm{U}\in {\mathbb{R}}^{d\times d}$, $\mathrm{V}\in {\mathbb{R}}^{p\times p}$ from the Haar measure and we let F = UDV with $\mathrm{D}\in {\mathbb{R}}^{d\times p}$ a diagonal matrix with diagonal entries ${d}_{k}=\mathrm{max}(\sqrt{\gamma },1)$, k = 1, ...,  min(n, p).

Given this dataset, the aim is to infer the configuration $\hat{\boldsymbol{w}}$, minimising a given loss function with a ridge regularisation term. In the following, we describe how to accomplish this task for both square and logistic loss.

Square loss: in this case, the goal is to solve the following optimisation problem:

Equation (E.1)

which has a simple closed-form solution given in terms of the Moore–Penrose inverse:

Equation (E.2)

Logistic loss: in this case, the goal is to solve the following optimisation problem:

Equation (E.3)

To solve the above, we use the GD on the regularised loss. In our simulations, we took advantage of Scikit-learn 0.22.1, an out-of-the-box open source library for machine learning tasks in Python [63, 64]. The library provides the class sklearn.linearLogisticRegression, which implements GD with logistic loss and a further 2-regularisation, if the parameter 'penalty' is set to 'l2'. GD stops either if the following condition is satisfied:

Equation (E.4)

with ∇ w being the gradient, or if a maximum number of iterations is reached. We set tol to 10−4 and the maximum number of iterations to 104.

In both cases described above, the algorithm returns the estimator $\hat{\boldsymbol{w}}\in {\mathbb{R}}^{p}$, from which all the quantities of interest can be evaluated. For instance, the generalisation error can be simply computed by drawing a new and independent sample {Xnew, y new} using algorithm 1 with the same inputs F, σ, f0 and θ 0 and computing:

Equation (E.5)

with $\hat{f}(x)=x$ for the regression task and $\hat{f}(x)=\mathrm{s}\mathrm{i}\mathrm{g}\mathrm{n}(x)$ for the classification task.

The procedure outlined above is repeated nseeds times, for different and independent draws of the random quantities F, θ 0, and a simple mean is taken in order to obtain the ensemble average of the different quantities. In most of the examples from the main, we found that nseeds = 30 was enough to obtain a very good agreement with the analytical prediction from the replica analysis. The full pipeline for computing the averaged generalisation error is exemplified in algorithm 2.

Algorithm 2. Averaged generalisation error.

Input: integer d, parameters $\alpha ,\gamma ,\lambda \in {\mathbb{R}}_{+}$, non-linear functions $\sigma ,{f}^{0},\hat{f}$ and integer nseeds
Assign p ← ⌊d/γ⌋, n ← ⌊αp
Initialise Eg = 0
for i = 1 to nseeds do
 Draw F, θ 0
 Assign X, y ← algorithm 1
 Compute $\hat{\boldsymbol{w}}$ from equations (E.1) or (E.3) with X, y and λ
 Generate new dataset Xnew, y new from algorithm 1
 Assign ${E}_{g}{\leftarrow}{E}_{g}+\frac{1}{{4}^{k}n}{\Vert}{\boldsymbol{y}}^{\text{new}}-\hat{f}\left({\mathrm{X}}^{\text{new}}\hat{\boldsymbol{w}}\right){{\Vert}}_{2}^{2}$
end for
Return: ${{\epsilon}}_{g}=\frac{{E}_{g}}{{n}_{\text{seeds}}}$

Footnotes

  • This article is an updated version of: Gerace F, Loureiro B, Krzakala F, Mezard M and Zdeborova L 2020 Generalisation error in learning with random features and the hidden manifold model Proc. 37th Int. Conf. Machine Learning vol 119 ed H D III and A Singh pp 3452–62.

Please wait… references are loading.
10.1088/1742-5468/ac3ae6