This site uses cookies. By continuing to use this site you agree to our use of cookies. To find out more, see our Privacy and Cookies policy.
Paper The following article is Free article

A random matrix analysis of random Fourier features: beyond the Gaussian kernel, a precise phase transition, and the corresponding double descent*

, and

Published 29 December 2021 © 2021 IOP Publishing Ltd and SISSA Medialab srl
, , Citation Zhenyu Liao et al J. Stat. Mech. (2021) 124006 DOI 10.1088/1742-5468/ac3a77

1742-5468/2021/12/124006

Abstract

This article characterizes the exact asymptotics of random Fourier feature (RFF) regression, in the realistic setting where the number of data samples n, their dimension p, and the dimension of feature space N are all large and comparable. In this regime, the random RFF Gram matrix no longer converges to the well-known limiting Gaussian kernel matrix (as it does when N alone), but it still has a tractable behavior that is captured by our analysis. This analysis also provides accurate estimates of training and test regression errors for large n, p, N. Based on these estimates, a precise characterization of two qualitatively different phases of learning, including the phase transition between them, is provided; and the corresponding double descent test error curve is derived from this phase transition behavior. These results do not depend on strong assumptions on the data distribution, and they perfectly match empirical results on real-world data sets.

Export citation and abstract BibTeX RIS

1. Introduction

For a machine learning system having N parameters, trained on a data set of size n, asymptotic analysis as used in classical statistical learning theory typically either focuses on the (statistical) population n limit, for N fixed, or the over-parameterized N limit, for a given n, as in the popular neural tangent kernel (NTK) regime [1]. These two settings are technically more convenient to work with, yet less practical, as they essentially assume that one of the two dimensions is negligibly small compared to the other, and this is rarely the case in practice. Indeed, with a factor of 2 or 10 more data, one typically works with a more complex model. This has been highlighted perhaps most prominently in recent work on neural network models, in which the model complexity and data size increase together. For this reason, the double asymptotic regime where n, N, with N/nc, a constant, is a particularly interesting (and likely more realistic) limit, despite being technically more challenging [28]. In particular, working in this regime allows for a finer quantitative assessment of machine learning systems, as a function of their relative complexity N/n, as well as for a precise description of the under-to over-parameterized 'phase transition' (that does not appear, e.g. in the N alone analysis). This transition is largely hidden in the usual style of statistical learning theory [9], but it is well-known in the statistical mechanics approach to learning theory [25], and empirical signatures of it have received attention recently under the name 'double descent' phenomena [1012].

This article considers the asymptotics of random Fourier features (RFFs) [13], and more generally random feature maps, which may be viewed also as a single-hidden-layer neural network model, in this limit. More precisely, let $\mathbf{X}=[{\mathbf{x}}_{1},\dots ,{\mathbf{x}}_{n}]\in {\mathbb{R}}^{p\times n}$ denote the data matrix of size n with data vectors ${\mathbf{x}}_{i}\in {\mathbb{R}}^{p}$ as column vectors. The random feature matrix ΣX of X is generated by pre-multiplying some random matrix $\mathbf{W}\in {\mathbb{R}}^{N\times p}$ having i.i.d. entries and then passing through some entry-wise nonlinear function σ(⋅), i.e. ${\mathbf{\Sigma }}_{\mathbf{X}}\equiv \sigma (\mathbf{W}\mathbf{X})\in {\mathbb{R}}^{N\times n}$. Commonly used random feature techniques such as RFFs [13] and homogeneous kernel maps [14], however, rarely involve a single non-linearity. The popular RFF maps are built with cosine and sine non-linearities, so that ${\mathbf{\Sigma }}_{\mathbf{X}}\in {\mathbb{R}}^{2N\times n}$ is obtained by cascading the random features of both, i.e. ${\mathbf{\Sigma }}_{\mathbf{X}}^{\mathsf{\text{T}}}\equiv [\mathrm{cos}{(\mathbf{W}\mathbf{X})}^{\mathsf{\text{T}}},\enspace \mathrm{sin}{(\mathbf{W}\mathbf{X})}^{\mathsf{\text{T}}}]$. Note that, by combining both non-linearities, RFFs generated from $\mathbf{W}\in {\mathbb{R}}^{N\times p}$ are of dimension 2N.

The large N asymptotics of random feature maps is closely related to their limiting kernel matrices KX . In the case of RFF, it was shown in [13] that entry-wise the Gram matrix ${\mathbf{\Sigma }}_{\mathbf{X}}^{\mathsf{\text{T}}}{\mathbf{\Sigma }}_{\mathbf{X}}/N$ converges to the Gaussian kernel matrix ${\mathbf{K}}_{\mathbf{X}}\equiv {\left\{\mathrm{exp}(-{\Vert}{\mathbf{x}}_{i}-{\mathbf{x}}_{j}{{\Vert}}^{2}/2)\right\}}_{i,j=1}^{n}$, as N. This follows from $\frac{1}{N}{[{\mathbf{\Sigma }}_{\mathbf{X}}^{\mathsf{\text{T}}}{\mathbf{\Sigma }}_{\mathbf{X}}]}_{ij}=\frac{1}{N}{\sum }_{t=1}^{N}\mathrm{cos}({\mathbf{x}}_{i}^{\mathsf{\text{T}}}{\mathbf{w}}_{t})\mathrm{cos}({\mathbf{w}}_{t}^{\mathsf{\text{T}}}{\mathbf{x}}_{j})+\mathrm{sin}({\mathbf{x}}_{i}^{\mathsf{\text{T}}}{\mathbf{w}}_{t})\mathrm{sin}({\mathbf{w}}_{t}^{\mathsf{\text{T}}}{\mathbf{x}}_{j})$, for wt independent Gaussian random vectors, so that by the strong law of large numbers, for fixed n, p, ${[{\mathbf{\Sigma }}_{\mathbf{X}}^{\mathsf{\text{T}}}{\mathbf{\Sigma }}_{\mathbf{X}}/N]}_{ij}$ goes to its expectation (with respect to $\mathbf{w}\sim \mathcal{N}(\mathbf{0},{\mathbf{I}}_{p})$) almost surely as N, i.e.

Equation (1)

with

Equation (2)

(The identification with ${[{\mathbf{K}}_{\mathbf{X}}]}_{ij}$ is easily shown in lemma 1 of appendix A.)

While this result holds in the N limit, recent advances in random matrix theory [15, 16] suggest that, in the more practical setting where N is not much larger than n, p and n, p, N at the same pace, the situation is more subtle. In particular, the above entry-wise convergence remains valid, but the convergence ${\Vert}{\mathbf{\Sigma }}_{\mathbf{X}}^{\mathsf{\text{T}}}{\mathbf{\Sigma }}_{\mathbf{X}}/N-{\mathbf{K}}_{\mathbf{X}}{\Vert}\to 0$ no longer holds in spectral norm, due to the factor n, now large, in the norm inequality ||A|| ⩽ ||A|| ⩽ n||A|| for $\mathbf{A}\in {\mathbb{R}}^{n\times n}$ and ||A|| ≡  maxij |Aij |. This implies that, in the large n, p, N regime, the assessment of the behavior of ${\mathbf{\Sigma }}_{\mathbf{X}}^{\mathsf{\text{T}}}{\mathbf{\Sigma }}_{\mathbf{X}}/N$ via the limiting kernel KX may result in a spectral norm error that blows up with n. As a consequence, for various machine learning algorithms, the performance guarantee offered by the limiting Gaussian kernel is less likely to agree with empirical observations in real-world large-scale problems, when n, p are large [17].

1.1. Our main contributions

We consider the RFF model in the more realistic large n, p, N limit. While, in this setting, the RFF empirical Gram matrix does not converge to the Gaussian kernel matrix, we can characterize its behavior as n, p, N and provide asymptotic performance guarantees for RFF on large-scale problems. We also identify a phase transition as a function of the ratio N/n, including the corresponding double descent phenomenon. In more detail, our contributions are the following.

  • (a)  
    We provide a precise characterization of the asymptotics of the RFF empirical Gram matrix, in the large n, p, N limit (theorem 1). This is accomplished by constructing a deterministic equivalent for the resolvent of the RFF Gram matrix. Based on this, the behavior of the RFF model is (asymptotically) accessible through a fixed-point equation, that can be interpreted in terms of an angle-like correction induced by the non-trivial large n, p, N limit (relative to the N alone limit).
  • (b)  
    We derive the asymptotic training and test mean squared errors (MSEs) of RFF ridge regression, as a function of the ratio N/n, regularization penalty λ, training as well as test sets (theorems 2 and 3, respectively). We identify precisely the under-to over-parameterization phase transition, as a function of the relative model complexity N/n; we prove the existence of a 'singular' peak of test error at the N/n = 1/2 boundary; and we characterize the corresponding double descent behavior. Importantly, our results are valid with almost no specific assumption on the data distribution. This is a significant improvement over existing double descent analyses, which fundamentally rely on the knowledge of the data distribution (often assumed to be multivariate Gaussian for simplicity) [12, 18].
  • (c)  
    We provide a detailed empirical evaluation of our theoretical results, demonstrating that the theory closely matches empirical results on a range of real-world data sets (sections 3 and 4). This includes the correction due to the large n, p, N setting, sharp transitions (as a function of N/n) in the aforementioned angle-like quantities, and the corresponding double descent test curves. This also includes an evaluation of the impact of training-test similarity and the effect of different data sets, thus confirming, as stated in (ii), that (unlike in prior work) the phase transition and double descent curve hold much more generally with respect to the data distribution.

1.2. Related work

Here, we provide a brief review of related previous efforts.

Random features and limiting kernels. In most RFF work [1922], non-asymptotic bounds are given, on the number of random features N needed for a predefined approximation error, for a given kernel matrix with fixed n, p. A more recent line of work [1, 2325] has focused on the over-parameterized N limit of large neural networks by studying the corresponding NTKs. Here, we position ourselves in the more practical regime where n, p, N are all large and comparable, and we provide asymptotic performance guarantees that better fit large-scale problems compared to the large-N-alone analysis.

Random matrix theory. From a random matrix theory perspective, nonlinear Gram matrices of the type ${\mathbf{\Sigma }}_{\mathbf{X}}^{\mathsf{\text{T}}}{\mathbf{\Sigma }}_{\mathbf{X}}$ have recently received an unprecedented research interests, due to their close connection to neural networks [2629], with a particular focus on the associated eigenvalue distribution. Here we propose a deterministic equivalent [30, 31] analysis for the resolvent matrix that provides access, not only to the eigenvalue distribution, but also to the regression error of central interest in this article. While most existing deterministic equivalent analyses are performed on linear models, here we focus on the nonlinear RFF model. From a technical perspective, the most relevant work is [12, 15]. We improve their results by considering generic data model on the popular RFF model.

Statistical mechanics of learning. A long history of connections between statistical mechanics and machine learning models (such as neural networks) exists, including a range of techniques to establish generalization bounds [25], and recently there has been renewed interest [7, 8, 3234]. Their relevance to our results lies in the use of the so-called thermodynamic limit (akin to the large n, p, N limit), rather than the classical limits more commonly used in statistical learning theory, in which case uniform convergence bounds and related techniques can be applied.

Double descent in large-scale learning systems. The large n, N asymptotics of statistical models has received considerable research interests in the machine learning community [18, 35], resulting in a (somehow) counterintuitive phenomenon referred to as the 'double descent'. Instead of focusing on different 'phases of learning' [25, 7], the 'double descent' phenomenon focuses on an empirical manifestation of the phase boundary and refers to the empirical observations of the test error curve as a function of the model complexity, which differs from the usual textbook description of the bias-variance tradeoff [10, 11, 36, 37]. Theoretical investigation into this phenomenon mainly focuses on various regression models [12, 18, 3841]. In most cases, quite specific (and rather strong) assumptions are imposed on the input data distribution. In this respect, our work extends the analysis in [12] to handle the RFF model and its phase structure on real-world data sets.

1.3. Notations and organization of the paper

Throughout this article, we follow the convention of denoting scalars by lowercase, vectors by lowercase boldface, and matrices by uppercase boldface letters. In addition, the notation (⋅) T denotes the transpose operator; the norm ||⋅|| is the Euclidean norm for vectors and the spectral or operator norm for matrices; and $\;\stackrel{a.s.}{\to }\;$ stands for almost sure convergence of random variables.

Our main results on the asymptotic behavior of the RFF resolvent matrix, as well as of the training MSE and testing MSE of RFF ridge regression are presented in section 2, with detailed proofs deferred to the appendix. In section 3, we provide a detailed empirical evaluation of our main results; and in section 4, we provide additional empirical evaluation on real-world data, illustrating the practical effectiveness of the proposed analysis. Concluding remarks are placed in section 5.

2. Main technical results

In this section, we present our main theoretical results. To investigate the large n, p, N asymptotics of the RFF model, we position ourselves under the following assumption.

Assumption 1. As n, we have

  • (a)  
    0 < lim inf n  min{p/n, N/n} ⩽  lim supn  max{p/n, N/n} < ; or, practically speaking, the ratios p/n and N/n are only moderately large or moderately small.
  • (b)  
    lim supn ||X|| < and lim supn ||y|| < , i.e. the data and targets are both normalized with respect to n.

Under assumption 1, we consider the RFF regression model as in figure 1.

Figure 1.

Figure 1. Illustration of an RFF regression model.

Standard image High-resolution image

For training data $\mathbf{X}\in {\mathbb{R}}^{p\times n}$ of size n, the associated RFFs, ${\mathbf{\Sigma }}_{\mathbf{X}}\in {\mathbb{R}}^{2N\times n}$, are obtained by computing $\mathbf{W}\mathbf{X}\in {\mathbb{R}}^{N\times n}$, for standard Gaussian random matrix $\mathbf{W}\in {\mathbb{R}}^{N\times p}$, and then applying entry-wise cosine and sine non-linearities on WX, that is

Equation (3)

Given this setup, the RFF ridge regressor $\boldsymbol{\beta }\in {\mathbb{R}}^{2N}$ is given by, for λ ⩾ 0,

Equation (4)

The two forms of β in (4) are equivalent for any λ > 0 and minimize the (ridge-regularized) squared loss $\frac{1}{n}{\Vert}\mathbf{y}-{\mathbf{\Sigma }}_{\mathbf{X}}^{\mathsf{\text{T}}}\boldsymbol{\beta }{{\Vert}}^{2}+\lambda {\Vert}\boldsymbol{\beta }{{\Vert}}^{2}$ on the training set (X, y). Our objective is to characterize the large n, p, N asymptotics of both the training MSE, Etrain, and the test MSE, Etest, defined respectively as

Equation (5)

with ${\mathbf{\Sigma }}_{\hat{\mathbf{X}}}^{\mathsf{\text{T}}}\equiv [\mathrm{cos}{(\mathbf{W}\hat{\mathbf{X}})}^{\mathsf{\text{T}}},\enspace \mathrm{sin}{(\mathbf{W}\hat{\mathbf{X}})}^{\mathsf{\text{T}}}]\in {\mathbb{R}}^{\hat{n}\times 2N}$ on a test set $(\hat{\mathbf{X}},\hat{\mathbf{y}})$ of size $\hat{n}$, and from this to characterize the phase transition behavior (as a function of the model complexity N/n) as mentioned in section 1. Precisely, in the training phase, the random weight matrix W is drawn once and kept fixed; and the RFF ridge regressor β is given explicitly as a function of W and the training set (X, y), as per (4). In the test phase, for β now fixed, the model takes the test data $\hat{\mathbf{X}}$ as input, and it outputs ${\mathbf{\Sigma }}_{\hat{\mathbf{X}}}^{\mathsf{\text{T}}}\boldsymbol{\beta }$ that should be compared to the corresponding target $\hat{\mathbf{y}}$ to measure the model test performance, Etest.

2.1. Asymptotic deterministic equivalent

To start, we observe that the training MSE, Etrain, in (5), can be written as

Equation (6)

which depends on the quadratic form yT Q(λ)y of

Equation (7)

the so-called resolvent of $\frac{1}{n}{\mathbf{\Sigma }}_{\mathbf{X}}^{\mathsf{\text{T}}}{\mathbf{\Sigma }}_{\mathbf{X}}$ (also denoted Q when there is no ambiguity) with λ > 0. To see this, from (5) we have ${E}_{\mathrm{t}\mathrm{r}\mathrm{a}\mathrm{i}\mathrm{n}}=\frac{1}{n}{\Vert}\mathbf{y}-\frac{1}{n}{\mathbf{\Sigma }}_{\mathbf{X}}^{\mathsf{\text{T}}}{\mathbf{\Sigma }}_{\mathbf{X}}{(\frac{1}{n}{\mathbf{\Sigma }}_{\mathbf{X}}^{\mathsf{\text{T}}}{\mathbf{\Sigma }}_{\mathbf{X}}+\lambda {\mathbf{I}}_{n})}^{-1}\mathbf{y}{{\Vert}}^{2}=\frac{{\lambda }^{2}}{n}{\Vert}\mathbf{Q}(\lambda )\mathbf{y}{{\Vert}}^{2}=-\frac{{\lambda }^{2}}{n}{\mathbf{y}}^{\mathsf{\text{T}}}\frac{\partial \mathbf{Q}(\lambda )}{\partial \lambda }\mathbf{y}$, with $\frac{\partial \mathbf{Q}(\lambda )}{\partial \lambda }=-{\mathbf{Q}}^{2}(\lambda )$.

In order to assess the asymptotic training MSE, it thus suffices to find a deterministic equivalent for Q(λ), that is, a deterministic matrix that captures the asymptotic behavior of the latter. One possibility is the expectation ${\mathbb{E}}_{\mathbf{W}}[\mathbf{Q}(\lambda )]$. Informally, if the training MSE Etrain (that is random due to random W for given X, y) is 'close to' some deterministic quantity ${\bar{E}}_{\mathrm{t}\mathrm{r}\mathrm{a}\mathrm{i}\mathrm{n}}$, in the large n, p, N limit, then ${\bar{E}}_{\mathrm{t}\mathrm{r}\mathrm{a}\mathrm{i}\mathrm{n}}$ must have the same limit as ${\mathbb{E}}_{\mathbf{W}}[{E}_{\mathrm{t}\mathrm{r}\mathrm{a}\mathrm{i}\mathrm{n}}]=-\frac{{\lambda }^{2}}{n}\partial {\mathbf{y}}^{\mathsf{\text{T}}}{\mathbb{E}}_{\mathbf{W}}[\mathbf{Q}(\lambda )]\mathbf{y}/\partial \lambda $ for n, p, N. However, ${\mathbb{E}}_{\mathbf{W}}[\mathbf{Q}]$ involves integration (with no closed-form due to the matrix inverse), and it is not a convenient quantity with which to work. Our objective is to find an asymptotic 'alternative' for ${\mathbb{E}}_{\mathbf{W}}[\mathbf{Q}]$ that is (i) close to ${\mathbb{E}}_{\mathbf{W}}[\mathbf{Q}]$ in the large n, p, N limit and (ii) numerically more accessible.

In the following theorem, we introduce an asymptotic equivalent for ${\mathbb{E}}_{\mathbf{W}}[\mathbf{Q}]$. Instead of being directly related to the Gaussian kernel KX = Kcos + Ksin as suggested by (2) in the large-N-only limit, it depends on the two components Kcos, Ksin in a more involved manner. Importantly, the proposed equivalent $\bar{\mathbf{Q}}$ can be numerically evaluated by running simple fixed-point iterations involving Kcos and Ksin.

Theorem 1. (Asymptotic equivalent for ${\mathbb{E}}_{\mathbf{W}}$[Q]). Under assumption 1, for Q defined in (7) and λ > 0, we have, as n

for $\bar{\mathbf{Q}}\equiv {\left(\frac{N}{n}(\frac{{\mathbf{K}}_{\mathrm{cos}}}{1+{\delta }_{\mathrm{cos}}}+\frac{{\mathbf{K}}_{\mathrm{sin}}}{1+{\delta }_{\mathrm{sin}}})+\lambda {\mathbf{I}}_{n}\right)}^{-1}$, ${\mathbf{K}}_{\mathrm{cos}}\equiv {\mathbf{K}}_{\mathrm{cos}}(\mathbf{X},\mathbf{X}),{\mathbf{K}}_{\mathrm{sin}}\equiv {\mathbf{K}}_{\mathrm{sin}}(\mathbf{X},\mathbf{X})\in {\mathbb{R}}^{n\times n}$ and

Equation (8)

where (δcos, δsin) is the unique positive solution to

Equation (9)

Proof. See appendix A. □

Remark 1. (Lower and upper bounds). Since

Equation (10)

in the positive definite order, for KX Kcos + Ksin the Gaussian kernel, $\frac{{\mathbf{K}}_{\mathrm{cos}}}{1+{\delta }_{\mathrm{cos}}}+\frac{{\mathbf{K}}_{\mathrm{sin}}}{1+{\delta }_{\mathrm{sin}}}$ is therefore positive definite, if x1 ..., xn are all distinct; see theorem 2.18 in [42].

Remark 2. (Correction to large-N behavior). Taking N/n, one has δcos → 0, δsin → 0 so that

Equation (11)

for λ > 0 independent of N, n, in accordance with the classical large-N-only prediction. In this sense, the pair (δcos, δsin) introduced in theorem 1 accounts for the 'correction' due to the non-trivial n/N, as opposed to the N alone analysis. Also, when the number of features N is large (i.e. as N/n), the regularization effect of λ flattens out and $\bar{\mathbf{Q}}$ behaves like (a scaled version of) the inverse Gaussian kernel matrix ${\mathbf{K}}_{\mathbf{X}}^{-1}$ (that is well-defined for distinct x1 ..., xn ).

Remark 3. (Geometric interpretation). Since $\bar{\mathbf{Q}}$ shares the same eigenspace with $\frac{{\mathbf{K}}_{\mathrm{cos}}}{1+{\delta }_{\mathrm{cos}}}+\frac{{\mathbf{K}}_{\mathrm{sin}}}{1+{\delta }_{\mathrm{sin}}}$, one can geometrically interpret (δcos, δsin) as a sort of 'angle' between the eigenspaces of Kcos, Ksin and that of $\frac{{\mathbf{K}}_{\mathrm{cos}}}{1+{\delta }_{\mathrm{cos}}}+\frac{{\mathbf{K}}_{\mathrm{sin}}}{1+{\delta }_{\mathrm{sin}}}$. For fixed n, as N, one has $\frac{1}{N}{\sum }_{t=1}^{N}\mathrm{cos}({\mathbf{X}}^{\mathsf{\text{T}}}{\mathbf{w}}_{t})\mathrm{cos}({\mathbf{w}}_{t}^{\mathsf{\text{T}}}\mathbf{X})\to {\mathbf{K}}_{\mathrm{cos}}$, $\frac{1}{N}{\sum }_{t=1}^{N}\mathrm{sin}({\mathbf{X}}^{\mathsf{\text{T}}}{\mathbf{w}}_{t})\mathrm{sin}({\mathbf{w}}_{t}^{\mathsf{\text{T}}}\mathbf{X})\to {\mathbf{K}}_{\mathrm{sin}}$, the eigenspaces of which are 'orthogonal' to each other, so that δcos, δsin → 0. On the other hand, as N, n, the eigenspaces of Kcos and Ksin 'intersect' with each other, captured by the non-trivial (δcos, δsin).

2.2. Asymptotic training performance

Theorem 1 provides an asymptotically more tractable approximation of ${\mathbb{E}}_{\mathbf{W}}[\mathbf{Q}]$. Together with some additional concentration arguments (e.g. from theorem 2 in [15]). this permits us to provide a complete description of the limiting behavior of the random bilinear form a T Qb, for $\mathbf{a},\mathbf{b}\in {\mathbb{R}}^{n}$ of bounded Euclidean norms, in such a way that ${\mathbf{a}}^{\mathsf{\text{T}}}\mathbf{Q}\mathbf{b}-{\mathbf{a}}^{\mathsf{\text{T}}}\bar{\mathbf{Q}}\mathbf{b}\;\stackrel{a.s.}{\to }\;0$, as n, p, N. This, together with the fact that ${E}_{\mathrm{t}\mathrm{r}\mathrm{a}\mathrm{i}\mathrm{n}}=\frac{{\lambda }^{2}}{n}{\mathbf{y}}^{\mathsf{\text{T}}}\mathbf{Q}{(\lambda )}^{2}\mathbf{y}=-\frac{{\lambda }^{2}}{n}{\mathbf{y}}^{\mathsf{\text{T}}}\partial \mathbf{Q}(\lambda )\mathbf{y}/\partial \lambda $, leads to the following result on the asymptotic training error.

Theorem 2. (Asymptotic training performance). Under assumption 1, for a given training set (X, y) and training MSE, Etrain defined in (5), as n

for $\bar{\mathbf{Q}}$ defined in theorem 1 and

Equation (12)

Proof. See appendix B. □

Remark 4. (First- and second-order corrections). Since ${E}_{\mathrm{t}\mathrm{r}\mathrm{a}\mathrm{i}\mathrm{n}}=\frac{{\lambda }^{2}}{n}{\mathbf{y}}^{\mathsf{\text{T}}}{\mathbf{Q}}^{2}\mathbf{y}$, we can see in the expression of ${\bar{E}}_{\mathrm{t}\mathrm{r}\mathrm{a}\mathrm{i}\mathrm{n}}$ that there is not only a first-order (large n, p, N) correction in the first $\frac{{\lambda }^{2}}{n}{\Vert}\bar{\mathbf{Q}}\mathbf{y}{{\Vert}}^{2}$ term (which is different than $\frac{{\lambda }^{2}}{n}{\Vert}\mathbf{Q}\mathbf{y}{{\Vert}}^{2}$), but there is also a second-order correction, appearing in the form of $\bar{\mathbf{Q}}{\mathbf{K}}_{\sigma }\bar{\mathbf{Q}}$ or $\bar{\mathbf{Q}}{\mathbf{K}}_{\sigma }\bar{\mathbf{Q}}{\mathbf{K}}_{\sigma }$ for σ ∈ { cos,  sin }, as in the second term. This has a similar interpretation to remark 3, where the pair (δcos, δsin) in $\bar{\mathbf{Q}}$ is (geometrically) interpreted as the eigenspace 'intersection' due to a non-vanishing n/N. In particular, taking N/n, we have $\bar{\mathbf{Q}}\sim \frac{n}{N}{\mathbf{K}}_{\mathbf{X}}^{-1}$, ΩI2 so that ${\bar{E}}_{\mathrm{t}\mathrm{r}\mathrm{a}\mathrm{i}\mathrm{n}}=0$ and the model interpolates the entire training set, as expected.

One can show that (i) for a given n and λ > 0, ${\bar{E}}_{\mathrm{t}\mathrm{r}\mathrm{a}\mathrm{i}\mathrm{n}}$ decreases as the model size N increases; and (ii) for a given ratio N/n, ${\bar{E}}_{\mathrm{t}\mathrm{r}\mathrm{a}\mathrm{i}\mathrm{n}}$ increases as the regularization penalty λ grows large.

2.3. Asymptotic test performance

Theorem 2 holds without any restriction on the training set, (X, y), except for assumption 1, since only the randomness of W is involved, and thus one can simply treat (X, y) as known in this result. This is no longer the case for the test error. Intuitively, the test data $\hat{\mathbf{X}}$ cannot be chosen arbitrarily, and one must ensure that the test data 'behave' statistically like the training data, in some 'well-controlled' manner, so that the test MSE is asymptotically deterministic and bounded as $n,\hat{n},p,N\to \infty $. Following this intuition, we work under the following assumption.

Assumption 2. (Data as concentrated random vectors [43]). The training data ${\mathbf{x}}_{i}\in {\mathbb{R}}^{p},i\in \left\{1,\dots ,n\right\}$, are independently drawn (non-necessarily uniformly) from one of K > 0 distribution classes 4 μ1, ..., μK . There exist constants C, η, q > 0 such that for any xi ∼ μk , k ∈ {1, ..., K} and any one-Lipschitz function $f:{\mathbb{R}}^{p}\to \mathbb{R}$, we have

Equation (13)

The test data ${\hat{\mathbf{x}}}_{i}\sim {\mu }_{k}$, $i\in \left\{1,\dots ,\hat{n}\right\}$ are mutually independent, but may depend on training data X and that ${\Vert}\mathbb{E}[\sigma (\mathbf{W}\mathbf{X})-\sigma (\mathbf{W}\hat{\mathbf{X}})]{\Vert}=O(\sqrt{n})$ for σ ∈ { cos,  sin }.

To facilitate the discussion of the phase transition and the double descent, we do not assume independence between training and test data (but we do assume independence between different data vectors within X and $\hat{\mathbf{X}}$). In this respect, assumption 2 is weaker than the classical i.i.d. assumption, and it permits us to illustrate the impact of training-test similarity on the model performance (section 4.2).

A first example of concentrated random vectors satisfying (13) is the multivariate Gaussian vector $\mathcal{N}(\mathbf{0},{\mathbf{I}}_{p})$ [44]. Moreover, since the concentration property in (13) is stable over Lipschitz transformations [43], it holds, for any one-Lipschitz mapping $g:{\mathbb{R}}^{d}\to {\mathbb{R}}^{p}$ and $\mathbf{z}\sim \mathcal{N}(\mathbf{0},{\mathbf{I}}_{d})$, that g(z) also satisfies (13). In this respect, assumption 2, although seemingly quite restrictive, represents a large family of 'generative models', including notably the 'fake images' generated by modern generative adversarial networks that are, by construction, Lipschitz transformations of large random Gaussian vectors [45, 46]. As such, from a practical consideration, assumption 2 provides a more realistic and flexible statistical model for real-world data.

With assumption 2, we have the following result on the asymptotic test error.

Theorem 3. (Asymptotic test performance). Under assumptions 1 and 2, we have, for test MSE Etest defined in (5) and test data $(\hat{\mathbf{X}},\hat{\mathbf{y}})$ satisfying ${\mathrm{limsup}}_{\hat{n}}{\Vert}\hat{\mathbf{X}}{\Vert}< \infty $, ${\mathrm{limsup}}_{\hat{n}}{\Vert}\hat{\mathbf{y}}{{\Vert}}_{\infty }< \infty $ with $\hat{n}/n\in (0,\infty )$ that, as n

for Ω defined in (12),

Equation (14)

and $\mathbf{\Phi }\equiv \frac{{\mathbf{K}}_{\mathrm{cos}}}{1+{\delta }_{\mathrm{cos}}}+\frac{{\mathbf{K}}_{\mathrm{sin}}}{1+{\delta }_{\mathrm{sin}}}$, $\hat{\mathbf{\Phi }}\equiv \frac{{\mathbf{K}}_{\mathrm{cos}}(\hat{\mathbf{X}},\mathbf{X})}{1+{\delta }_{\mathrm{cos}}}+\frac{{\mathbf{K}}_{\mathrm{sin}}(\hat{\mathbf{X}},\mathbf{X})}{1+{\delta }_{\mathrm{sin}}}$, with ${\mathbf{K}}_{\mathrm{cos}}(\hat{\mathbf{X}},\mathbf{X}),{\mathbf{K}}_{\mathrm{sin}}(\hat{\mathbf{X}},\mathbf{X})\in {\mathbb{R}}^{\hat{n}\times n}$ and ${\mathbf{K}}_{\mathrm{cos}}(\hat{\mathbf{X}},\hat{\mathbf{X}}),{\mathbf{K}}_{\mathrm{sin}}(\hat{\mathbf{X}},\hat{\mathbf{X}})\in {\mathbb{R}}^{\hat{n}\times \hat{n}}$ defined as in (8).

Proof. See appendix C. □

Similar to theorem 2 on ${\bar{E}}_{\mathrm{t}\mathrm{r}\mathrm{a}\mathrm{i}\mathrm{n}}$, here the expression for ${\bar{E}}_{\mathrm{t}\mathrm{e}\mathrm{s}\mathrm{t}}$ is also given as the sum of first- and second-order corrections. To see this, one can confirm, by taking $(\hat{\mathbf{X}},\hat{\mathbf{y}})=(\mathbf{X},\mathbf{y})$, that the first term in ${\bar{E}}_{\mathrm{t}\mathrm{e}\mathrm{s}\mathrm{t}}$ becomes

and is equal to the first term in ${\bar{E}}_{\mathrm{t}\mathrm{r}\mathrm{a}\mathrm{i}\mathrm{n}}$, where we used the fact that $\frac{N}{n}\mathbf{\Phi }\bar{\mathbf{Q}}={\mathbf{I}}_{n}-\lambda \bar{\mathbf{Q}}$. The same also holds for the second term, so that one obtains ${\bar{E}}_{\mathrm{t}\mathrm{e}\mathrm{s}\mathrm{t}}={\bar{E}}_{\mathrm{t}\mathrm{r}\mathrm{a}\mathrm{i}\mathrm{n}}$, with $(\hat{\mathbf{X}},\hat{\mathbf{y}})=(\mathbf{X},\mathbf{y})$, as expected. From this perspective, theorem 3 can be seen as an extension of theorem 2, with the 'interaction' between training and test data (e.g. test-versus-test ${\mathbf{K}}_{\sigma }(\hat{\mathbf{X}},\hat{\mathbf{X}})$ and test-versus-train ${\mathbf{K}}_{\sigma }(\hat{\mathbf{X}},\mathbf{X})$ interaction matrices) summarized in the scalar parameter Θσ defined in (14), for σ ∈ { cos,  sin }.

By taking N/n, we have that $\bar{\mathbf{Q}}\sim \frac{n}{N}{\mathbf{K}}^{-1}$, Θσ N−1, ΩI2, and consequently

This is the test MSE of classical Gaussian kernel regression, with $\mathbf{K}(\hat{\mathbf{X}},\mathbf{X})\equiv {\mathbf{K}}_{\mathrm{cos}}(\hat{\mathbf{X}},\mathbf{X})+{\mathbf{K}}_{\mathrm{sin}}(\hat{\mathbf{X}},\mathbf{X})\in {\mathbb{R}}^{\hat{n}\times n}$ the test-versus-train Gaussian kernel matrix. As opposed to the training MSE discussed in remark 4, here ${\bar{E}}_{\mathrm{t}\mathrm{e}\mathrm{s}\mathrm{t}}$ generally has a non-zero limit (that is, however, independent of λ) as N/n.

3. Empirical evaluations and practical implications

In this section, we provide a detailed empirical evaluation, including a discussion of the behavior of the fixed-point equation in theorem 1, and its consequences in theorems 2 and 3.

In particular, we describe the behavior of the pair (δcos, δsin) that characterizes the necessary correction in the large n, p, N regime, as a function of the regularization penalty λ and the ratio N/n. This explains: (i) the mismatch between empirical regression errors from the Gaussian kernel prediction (figure 2); (ii) the behavior of (δcos, δsin) as a function of λ (figure 3); (iii) the behavior of (δcos, δsin) as a function of N/n, which clearly indicates two phases of learning and the transition between them (figure 4); and (iv) the corresponding double descent test error curves (figure 5).

Figure 2.

Figure 2. Training MSEs of RFF ridge regression on MNIST data (class 3 versus 7), as a function of regression penalty λ, for p = 784, n = 1 000, N = 250, 500, 1 000, 2000. Empirical results displayed in blue circles; Gaussian kernel predictions (assuming N alone) in black dashed lines; and theorem 2 in red solid lines. Results obtained by averaging over 30 runs.

Standard image High-resolution image
Figure 3.

Figure 3. Behavior of (δcos, δsin) in (15) on MNIST data (class 3 versus 7), as a function of the regularization parameter λ, for p = 784, n = 1 000, N = 250, 1 000, 4 000, 16 000.

Standard image High-resolution image
Figure 4.

Figure 4. Behavior of (δcos, δsin) in (15) on MNIST data set (class 3 versus 7), as a function of the ratio N/n, for p = 784, n = 1 000, λ = 10−7, 10−3, 1, 10. The black dashed line represents the interpolation threshold 2 N = n.

Standard image High-resolution image
Figure 5.

Figure 5. Empirical (blue crosses) and theoretical (red dashed lines) test error of RFF regression as a function of the ratio N/n on MNIST data (class 3 versus 7), for p = 784, n = 500, λ = 10−7, 10−3, 0.2, 10. The black dashed line represents the interpolation threshold 2N = n.

Standard image High-resolution image

3.1. Correction due to the large n, p, N regime

The RFF Gram matrix ${\mathbf{\Sigma }}_{\mathbf{X}}^{\mathsf{\text{T}}}{\mathbf{\Sigma }}_{\mathbf{X}}/N$ is not close to the classical Gaussian kernel matrix KX in the large n, p, N regime; and, as a consequence, its resolvent Q, as well the training and test MSE, Etrain and Etest (that are functions of Q), behave quite differently from the Gaussian kernel predictions. As already discussed in remark 2 after theorem 1, for λ > 0, the pair (δcos, δsin) characterizes the correction when considering n, p, N all large, compared to the large-N-only asymptotic behavior:

Equation (15)

To start, figure 2 compares the training MSEs of RFF ridge regression to the predictions from Gaussian kernel regression and to the predictions from our theorem 2, on the popular MNIST data set [47]. Observe that there is a huge gap between empirical training errors and the Gaussian kernel predictions, especially when N/n < 1, while our theory consistently fits empirical observations almost perfectly.

Next, from (15) we know that both δcos and δsin are decreasing functions of λ. (See lemma 7 in appendix D for a proof of this fact.) Figure 3 shows that: (i) over a range of different N/n, both δcos and δsin decrease monotonically as λ increases; (ii) the behavior for N/n < 1, which is decreasing from an initial value of δ ≫ 1, is very different from the behavior for N/n ≳ 1, where an initially flat region is observed for small values of λ and we have δ < 1 for all values of λ; and (iii) the impact of regularization λ becomes less significant as the ratio N/n becomes large. This is in accordance with the limiting behavior of $\bar{\mathbf{Q}}\simeq \frac{n}{N}{\mathbf{K}}_{\mathbf{X}}^{-1}$ in remark 2 that is independent of λ as N/n.

Note also that, while δcos and δsin can be geometrically interpreted as a sort of weighted 'angle' between different kernel matrices, and therefore one might expect to have δ ∈ [0, 1], this is not the case for the leftmost plot with N/n = 1/4. There, for small values of λ (say λ ≲ 0.1), both δcos and δsin scale like λ−1, while they are observed to saturate to a fixed O(1) value for N/n = 1, 4, 16. This corresponds to two different phases of learning in the 'ridgeless' λ → 0 case. As we shall see in more detail later in section 4.1, depending on whether we are in the 'under-parameterized' (2N < n) or the 'over-parameterized' (2N > n) regime, the system behaves fundamentally differently.

3.2. Phase transition and corresponding double descent

Both δcos and δsin in (15) are decreasing functions of N, as depicted in figure 4. (See lemma 6 in appendix D for a proof.) More importantly, figure 4 also illustrates that δcos and δsin exhibit qualitatively different behavior: for λ not too small (λ = 1 or 10), we observe a rather 'smooth' behavior, as a function of the ratio N/n, and they both decrease smoothly, as N/n grows large. However, for λ relatively small (λ = 10−3 and 10−7), we observe a sharp 'phase transition' on two sides of the interpolation threshold 2N = n. (Note that the scale of the y-axis is very different in different subfigures.) More precisely, in the leftmost plot with λ = 10−7, the values of δcos and δsin 'jumps' from order O(1) (when 2N > n) to much higher values of the order of λ−1 (when 2N < n). A similar behavior is also observed for λ = 10−3.

As a consequence of this phase transition, different behaviors are expected for training and test MSEs in the 2N < n and 2N > n regime. Figure 5 depicts the empirical and theoretical test MSEs with different regularization penalty λ. In particular, for λ = 10−7 and λ = 10−3, a double descent behavior is observed, with a singularity at 2N = n, while for larger values of λ (λ = 0.2, 10), a smoother and monotonically decreasing curve for test error is observed, as a function of N/n. Figure 5 also illustrates that: (i) for a fixed regularization λ > 0, the minimum test error is always obtained in the over-parameterization 2N > n regime; and (ii) the global optimal design (over N and λ) is achieved by highly over-parametrized system with a (problem-dependent) non-vanishing λ. This is in accordance with the observations in [12] for Gaussian data.

Remark 5. (On ridge regularization). Performing ridge regularization (with λ as a control parameter) is known to help alleviate the sharp performance drop around 2N = n [12, 18]. Our theorem 3 can serve as a convenient alternative to evaluate the effect of small λ around 2N = n, as well as to determine an optimal λ, for not-too-small n, p, N. In the setup of figure 5, a grid search can be used to find the regularization that minimizes ${\bar{E}}_{\mathrm{t}\mathrm{e}\mathrm{s}\mathrm{t}}$. For this choice of λ (λopt ≈ 0.2), no singular peak at 2N = n is observed.

Remark 6. (Double descent as a consequence of phase transition). While the double descent phenomenon has received considerable attention recently, our analysis makes it clear that in this model (and presumably many others) it is a natural consequence of the phase transition between two qualitatively different phases of learning [7].

4. Additional discussion and results

In this section, we provide additional discussions and empirical results, to complement and extend those of section 3. We start, in section 4.1, by discussing in more detail the two different phases of learning for 2N < n and 2N > n, including the sharp phase transition at 2N = n, for (δcos, δsin), as well as the asymptotic test MSE, in the ridgeless λ → 0 case. Then, in section 4.2, we discuss the impact of training-test similarly on the test MSE by considering the example of test data $\hat{\mathbf{X}}$ obtained by slightly perturbing the training data X. Finally, in section 4.3, we present empirical results on additional real-world data sets to demonstrate the wide applicability of our results.

4.1. Two different learning regimes in the ridgeless limit

We chose to present our theoretical results in section 2 (theorems 13) in the same form, regardless of whether 2N > n or 2N < n. This comes at the cost of requiring a strictly positive ridge regularization λ > 0, as n, p, N. As discussed in section 3, for small values of λ, depending on the sign of 2Nn, we observe totally different behaviors for (δcos, δsin) and thus for the key resolvent $\bar{\mathbf{Q}}$. As a matter of fact, for λ = 0 and 2N < n, the (random) resolvent Q(λ = 0) in (7) is simply undefined, as it involves inverting a singular matrix ${\mathbf{\Sigma }}_{\mathbf{X}}^{\mathsf{\text{T}}}{\mathbf{\Sigma }}_{\mathbf{X}}\in {\mathbb{R}}^{n\times n}$ that is of rank at most 2N < n. As a consequence, we expect to see $\bar{\mathbf{Q}}\sim {\lambda }^{-1}$ as λ → 0 for 2N < n, while for 2N > n this is not the case.

These two phases of learning can be theoretically justified by considering the ridgeless λ → 0 limit in theorem 1, with the unified variables γcos and γsin introduced below.

  • (a)  
    For 2N < n and λ → 0, we obtain
    Equation (16)
    in such as way that δcos, δsin and $\bar{\mathbf{Q}}$ scale like λ−1. We have in particular $\mathbb{E}[\lambda \mathbf{Q}]\sim \lambda \bar{\mathbf{Q}}\sim {\left(\frac{N}{n}\left(\frac{{\mathbf{K}}_{\mathrm{cos}}}{{\gamma }_{\mathrm{cos}}}+\frac{{\mathbf{K}}_{\mathrm{sin}}}{{\gamma }_{\mathrm{sin}}}\right)+{\mathbf{I}}_{n}\right)}^{-1}$ with (γcos, γsin) of order O(1).
  • (b)  
    For 2N > n and λ → 0, we obtain
    Equation (17)
    by taking directly λ → 0 in theorem 1.

As a consequence, in the ridgeless limit λ → 0, theorem 1 exhibits the following two learning phases:

  • (a)  
    Under-parameterized phase: with 2N < n. Here, Q is not well defined (indeed Qλ−1) and one must consider instead the properly scaled γcos, γsin and $\lambda \bar{\mathbf{Q}}$ in (16). Like δcos and δsin, γcos and γsin also decrease as N/n grows large. In particular, one has ${\gamma }_{\mathrm{cos}},{\gamma }_{\mathrm{sin}},{\Vert}\lambda \bar{\mathbf{Q}}{\Vert}\to 0$ as 2Nn ↑ 0.
  • (b)  
    Over-parameterized phase: with 2N > n. Here, one can consider δcos, δsin and ${\Vert}\bar{\mathbf{Q}}{\Vert}$. One has particularly that ${\delta }_{\mathrm{cos}},{\delta }_{\mathrm{sin}},{\Vert}\bar{\mathbf{Q}}{\Vert}\to \infty $ as 2Nn ↓ 0 and tend to zero as N/n.

With this discussion on the two phases of learning, we now understand why:

  • in the leftmost plot of figure 3 with 2N < n, δcos and δsin behave rather differently from other plots and approximately scale as λ−1 for small values of λ; and
  • in the first and second leftmost plots of figure 4, a 'jump' in the values of δ occurs at the transition point 2N = n, and the δ's are numerically of the same order of λ−1 for 2N < n.

To characterize the phase transition from (16) and (17) in the λ → 0 setting, we consider the scaled variables

Equation (18)

An advantage of using these scaled variables is that they are of order O(1) as n, p, N and λ → 0. The behavior of (γcos, γsin) is reported in figure 6, in the same setting as figure 4. Observe the sharp transition between the 2N < n and 2N > n regime, in particular for λ = 10−7 and λ = 10−3, and that this transition is smoothed out for λ = 1. (A 'transition' is also seen for λ = 10, but this is potentially misleading. It is true that γcos and γsin do change in this way, as a function of N/n, but unless λ ≈ 0, these quantities are not solutions of the aforementioned fixed point equations.)

Figure 6.

Figure 6. Behavior of (γcos, γsin) in (15) on MNIST data set (class 3 versus 7), as a function of the ratio N/n, for p = 784, n = 1 000, λ = 10−7, 10−3, 1, 10. The black dashed line represents the interpolation threshold 2 N = n.

Standard image High-resolution image

On account of these two different phases of learning (under- and over-parameterized, in (16) and (17), respectively) and the sharp transition of (γcos, γsin) in figure 6, it is not surprising to observe a 'singular' behavior at 2N = n, when no regularization is applied. We next examine the asymptotic training and test errors in more detail.

Asymptotic training MSE as λ → 0. In the under-parameterized regime with 2N < n, combining (16) we have that both $\lambda \bar{\mathbf{Q}}$ and $\frac{\bar{\mathbf{Q}}}{1+{\delta }_{\sigma }}\sim \frac{\lambda \bar{\mathbf{Q}}}{{\gamma }_{\sigma }},\sigma \in \left\{\mathrm{cos},\mathrm{sin}\right\}$ are well-behaved and are generally not zero. As a consequence, by theorem 2, the asymptotic training error ${\bar{E}}_{\mathrm{t}\mathrm{r}\mathrm{a}\mathrm{i}\mathrm{n}}$ tends to a nonzero limit as λ → 0, measuring the residual information in the training set that is not captured by the regressor $\boldsymbol{\beta }\in {\mathbb{R}}^{2N}$. As 2Nn ↑ 0, we have γcos, γsin → 0 and ${\Vert}\lambda \bar{\mathbf{Q}}{\Vert}\to 0$ so that ${\bar{E}}_{\mathrm{t}\mathrm{r}\mathrm{a}\mathrm{i}\mathrm{n}}\to 0$ and β interpolates the entire training set. On the other hand, in the over-parameterized 2N > n regime, one always has ${\bar{E}}_{\mathrm{t}\mathrm{r}\mathrm{a}\mathrm{i}\mathrm{n}}=0$. This particularly implies the training error is 'continuous' around the point 2N = n.

Asymptotic test MSE as λ → 0. Again, in the under-parameterized regime with 2N < n, now consider the more involved asymptotic test error in theorem 3. In particular, we will focus here on the case $\hat{\mathbf{X}}\ne \mathbf{X}$ (or, more precisely, they are sufficiently different from each other in such a way that ${\Vert}\mathbf{X}-\hat{\mathbf{X}}{\Vert}\to \hspace{-12pt}/\;\;\;0$ as n, p, N and λ → 0; see further discussion below in section 4.2) so that ${\mathbf{K}}_{\sigma }(\mathbf{X},\mathbf{X})\ne {\mathbf{K}}_{\sigma }(\hat{\mathbf{X}},\mathbf{X})$ and $\frac{N}{n}\hat{\mathbf{\Phi }}\bar{\mathbf{Q}}\ne {\mathbf{I}}_{n}-\lambda \bar{\mathbf{Q}}$. In this case, the two-by-two matrix Ω in ${\bar{E}}_{\mathrm{t}\mathrm{e}\mathrm{s}\mathrm{t}}$ diverges to infinity at 2N = n in the λ → 0 limit. (Indeed, the determinant det(Ω−1) scales as λ, per lemma 5.) As a consequence, we have ${\bar{E}}_{\mathrm{t}\mathrm{e}\mathrm{s}\mathrm{t}}\to \infty $ as 2Nn, resulting in a sharp deterioration of the test performance around 2N = n. (Of course, this holds if no additional regularization is applied as discussed in remark 5.) It is also interesting to note that, while Ω also appears in ${\bar{E}}_{\mathrm{t}\mathrm{r}\mathrm{a}\mathrm{i}\mathrm{n}}$, we still obtain (asymptotically) zero training MSE at 2N = n, despite the divergence of Ω, again due to the prefactor λ2 in ${\bar{E}}_{\mathrm{t}\mathrm{r}\mathrm{a}\mathrm{i}\mathrm{n}}$. If λ ≳ 1, then det(Ω−1) exhibits much more regular properties (figure 7), as one would expect.

Figure 7.

Figure 7. Behavior of det(Ω−1) on MNIST data set (class 3 versus 7), as a function of N/n, for p = 784, n = 1000 and λ = 10−7, 10−3, 1, 10. The black dashed line represents the interpolation threshold 2 N = n.

Standard image High-resolution image

4.2. Impact of training-test similarity

Continuing our discussion of the RFF performance in the large n, p, N limit, we can see that the (asymptotic) test error behaves entirely differently, depending on whether $\hat{\mathbf{X}}$ is 'close to' X or not. For $\hat{\mathbf{X}}=\mathbf{X}$, one has ${\bar{E}}_{\mathrm{t}\mathrm{e}\mathrm{s}\mathrm{t}}={\bar{E}}_{\mathrm{t}\mathrm{r}\mathrm{a}\mathrm{i}\mathrm{n}}$ that decreases monotonically as N grows large; while for $\hat{\mathbf{X}}$ 'sufficiently' different from X, ${\bar{E}}_{\mathrm{t}\mathrm{e}\mathrm{s}\mathrm{t}}$ diverges to infinity at 2N = n. To have a more quantitative assessment of the influence of training-test data similarity on the test error, we consider the special case $\hat{n}=n$ and $\hat{\mathbf{y}}=\mathbf{y}$. In this case, it follows from theorem 3 that

for σ ∈ { cos,  sin }, ${\Delta}{\mathbf{K}}_{\sigma }={\mathbf{K}}_{\sigma }-{\mathbf{K}}_{\sigma }(\hat{\mathbf{X}},\mathbf{X})$ and ${\Delta}\mathbf{\Phi }\equiv \hat{\mathbf{\Phi }}-\mathbf{\Phi }$. Since in the ridgeless λ → 0 limit the matrix Ω scale as λ−1 (see figure 7), one must have Θσ scaling as λ so that ${\bar{E}}_{\mathrm{t}\mathrm{e}\mathrm{s}\mathrm{t}}$ does not diverge at 2N = n as λ → 0. One example is the case where the test data is a small (additive) perturbation of the training data such that, in the kernel feature space

for ${\mathbf{\Xi }}_{\sigma },{\hat{\mathbf{\Xi }}}_{\sigma }\in {\mathbb{R}}^{n\times n}$ of bounded spectral norms. In this setting, we have ${{\Theta}}_{\sigma }=\frac{\lambda }{N}\enspace \mathrm{t}\mathrm{r}({\mathbf{\Xi }}_{\sigma }+{\hat{\mathbf{\Xi }}}_{\sigma })+O({\lambda }^{2})$ so that the asymptotic test error does not diverge to infinity at 2N = n as λ → 0. This is supported by figure 8, where the test data are generated by adding Gaussian white noise of variance σ2 to the training data, i.e. ${\hat{\mathbf{x}}}_{i}={\mathbf{x}}_{i}+\sigma {\boldsymbol{\varepsilon }}_{i}$, for independent ${\boldsymbol{\varepsilon }}_{i}\sim \mathcal{N}(\mathbf{0},{\mathbf{I}}_{p}/p)$. In figure 8, we observe that (i) below the threshold σ2 = λ, test error coincides with the training error and both are close to zero; and (ii) as soon as σ2λ, the test error diverges from the training error and grows large (but linearly in σ2) as the noise level increases. Note also from the two rightmost plots of figure 8 that, the training-to-test 'transition' at σ2λ is sharp only for relatively small values of λ, as predicted by our theory.

Figure 8.

Figure 8. Empirical training and test errors of RFF ridgeless regression on MNIST data (class 3 versus 7), when modeling training-test similarity as $\hat{\mathbf{X}}=\mathbf{X}+\sigma \boldsymbol{\varepsilon }$, with ɛ having i.i.d. $\mathcal{N}(0,1/p)$ entries, as a function of the noise level σ2, for N = 512, p = 784, $n=\hat{n}=1024=2N$, λ = 10−7, 10−3, 1, 10. Results obtained by averaging over 30 runs.

Standard image High-resolution image

4.3. Additional real-world data sets

So far, we have presented results in detail for one particular real-world data set, but we have extensive empirical results demonstrating that similar conclusions hold more broadly. As an example of these additional results, here we present a numerical evaluation of our results on several other real-world image data sets. We consider the classification task on another two MNIST-like data sets composed of 28 × 28 grayscale images: the Fashion-MNIST [48] and the Kannada-MNIST [49] data sets. Each image is represented as a p = 784-dimensional vector and the output targets $\mathbf{y},\hat{\mathbf{y}}$ are taken to have −1, +1 entries depending on the image class. As a consequence, both the training and test MSEs in (5) are approximately 1 for N = 0 and significantly small λ, as observed in figures 5 and 11 below. For each data set, images were jointly centered and scaled so to fall close to the setting of assumption 1 on X and $\hat{\mathbf{X}}$.

In figure 9, we compare the empirical training and test errors with their limiting behaviors derived in theorems 2 and 3, as a function of the penalty parameter λ, on a training set of size n = 1024 (512 images from class 5 and 512 images from class 6) with feature dimension N = 256 and N = 512, on both data sets. A close fit between theory and practice is observed, for moderately large values of n, p, N, demonstrating a wide practical applicability of the proposed asymptotic analyses, particularly compared to the (limiting) Gaussian kernel predictions per figure 2.

Figure 9.

Figure 9. MSEs of RFF regression on Fashion-MNIST (left two) and Kannada-MNIST (right two) data (class 5 versus 6), as a function of regression parameter λ, for p = 784, $n=\hat{n}=1024$, N = 256 and 512. Empirical results displayed in blue (circles for training and crosses for test); and the asymptotics from theorems 2 and 3 displayed in red (sold lines for training and dashed for test). Results obtained by averaging over 30 runs.

Standard image High-resolution image

In figure 10, we report the behavior of the pair (δcos, δsin) for small values of λ = 10−7 and 10−3. Similar to the two leftmost plots in figure 4 for MNIST, a jump from the under-to over-parameterized regime occurs at the interpolation threshold 2N = n, in both Fashion- and Kannada-MNIST data sets, clearly indicating the two phases of learning and the phase transition between them.

Figure 10.

Figure 10. Behavior of (δcos, δsin) in (15), on Fashion-MNIST (left two) and Kannada-MNIST (right two) data (class 8 versus 9), for p = 784, n = 1000, λ = 10−7 and 10−3. The black dashed line represents the interpolation threshold 2N = n.

Standard image High-resolution image

In figure 11, we report the empirical and theoretical test errors as a function of the ratio N/n, on a training test of size n = 500 (250 images from class 8 and 250 images from class 9), by varying the feature dimension N. An exceedingly small regularization λ = 10−7 is applied to mimic the 'ridgeless' limiting behavior as λ → 0. On both data sets, the corresponding double descent curve is observed where the test errors go down and up, with a singular peak around 2N = n, and then go down monotonically as N continues to increase when 2N > n.

Figure 11.

Figure 11. Empirical (blue crosses) and theoretical (red dashed lines) test error of RFF regression, as a function of the ratio N/n, on Fashion-MNIST (left two) and Kannada-MNIST (right two) data (class 8 versus 9), for p = 784, n = 500, λ = 10−7 and 10−3. The black dashed line represents the interpolation threshold 2N = n.

Standard image High-resolution image

5. Conclusion

We have established a precise description of the resolvent of RFF Gram matrices, and provided asymptotic training and test performance guarantees for RFF ridge regression, in the limit of n, p, N at the same pace. We have also discussed the under- and over-parameterized regimes, where the resolvent behaves dramatically differently. These observations involve only mild regularity assumptions on the data distribution, yielding phase transition behavior and corresponding double descent test error curves for RFF regression that closely match experiments on real-world data. From a technical perspective, our analysis extends to arbitrary combinations of (Lipschitz) non-linearities, such as the more involved homogeneous kernel maps [14]. This opens the door for future studies of more elaborate random feature structures and models. Extended to a (technically more involved) multi-layer setting in the more realistic large n, p, N regime, as in [50], our analysis may shed new light on the theoretical understanding of modern deep neural nets, beyond the large-N alone NTK limit [1].

Acknowledgments

Z L would like to acknowledge the Fundamental Research Funds for the Central Universities of China (No. 2021XXJS110) and CCF-Hikvision Open Fund (20210008) for providing partial support of this work. R C would like to acknowledge the MIAI LargeDATA chair (ANR-19-P3IA-0003) at University Grenoble-Alpes as well as the HUAWEI LarDist project for providing partial support of this work. M W M would like to acknowledge DARPA, IARPA (Contract W911NF20C0035), NSF, and ONR via its BRC on RandNLA for providing partial support of this work. Our conclusions do not necessarily reflect the position or the policy of our sponsors, and no official endorsement should be inferred.

Appendix A.: Proof of theorem 1

Our objective is to prove, under assumption 1, the asymptotic equivalence between the expectation (with respect to W, omitted from now on) $\mathbb{E}[\mathbf{Q}]$ and

for ${\mathbf{K}}_{\mathrm{cos}}\equiv {\mathbf{K}}_{\mathrm{cos}}(\mathbf{X},\mathbf{X}),{\mathbf{K}}_{\mathrm{sin}}\equiv {\mathbf{K}}_{\mathrm{sin}}(\mathbf{X},\mathbf{X})\in {\mathbb{R}}^{n\times n}$ defined in (8), with (δcos, δcos) the unique positive solution to

The existence and uniqueness of the above fixed-point equation is standard in random matrix literature and can be reached for instance with the standard interference function framework [51].

The asymptotic equivalence should be announced in the sense that ${\Vert}\mathbb{E}[\mathbf{Q}]-\bar{\mathbf{Q}}{\Vert}\to 0$ as n, p, N at the same pace. We shall proceed by introducing an intermediary resolvent $\tilde{\mathbf{Q}}$ (see definition in (A.2)) and show subsequently that

In the sequel, we use o(1) and o||⋅||(1) for scalars or matrices of (almost surely if being random) vanishing absolute values or operator norms as n, p.

We start by introducing the following lemma.

Lemma 1. (Expectation of ${\sigma }_{1}({\mathbf{x}}_{i}^{\mathsf{\text{T}}}\mathbf{w}){\sigma }_{2}({\mathbf{w}}^{\mathsf{\text{T}}}{\mathbf{x}}_{j})$ ). For $\mathbf{w}\sim \mathcal{N}(\mathbf{0},{\mathbf{I}}_{p})$ and ${\mathbf{x}}_{i},{\mathbf{x}}_{j}\in {\mathbb{R}}^{p}$ we have (per definition in (8))

Proof of Lemma  1 .The proof follows the integration tricks in [15, 52]. Note in particular that the third equality holds in the case of (cos,  sin) non-linearity but in general not true for arbitrary Lipschitz (σ1, σ2). □

Let us focus on the resolvent $\mathbf{Q}\equiv {\left(\frac{1}{n}{\mathbf{\Sigma }}_{\mathbf{X}}^{\mathsf{\text{T}}}{\mathbf{\Sigma }}_{\mathbf{X}}+\lambda {\mathbf{I}}_{n}\right)}^{-1}$ of $\frac{1}{n}{\mathbf{\Sigma }}_{\mathbf{X}}^{\mathsf{\text{T}}}{\mathbf{\Sigma }}_{\mathbf{X}}\in {\mathbb{R}}^{n\times n}$, for RFF matrix ${\mathbf{\Sigma }}_{\mathbf{X}}\equiv \left[\begin{matrix}\hfill \mathrm{cos}(\mathbf{W}\mathbf{X})\hfill \\ \hfill \mathrm{sin}(\mathbf{W}\mathbf{X})\hfill \end{matrix}\right]$ that can be rewritten as

Equation (A.1)

for wi the ith row of $\mathbf{W}\in {\mathbb{R}}^{N\times p}$ with ${\mathbf{w}}_{i}\sim \mathcal{N}(\mathbf{0},{\mathbf{I}}_{p}),i=1,\dots ,N$, that is at the core of our analysis. Note from (A.1) that we have

with ${\mathbf{U}}_{i}=\left[\begin{matrix}\hfill \mathrm{cos}({\mathbf{X}}^{\mathsf{\text{T}}}{\mathbf{w}}_{i})\hfill & \hfill \mathrm{sin}({\mathbf{X}}^{\mathsf{\text{T}}}{\mathbf{w}}_{i})\hfill \end{matrix}\right]\in {\mathbb{R}}^{n\times 2}$.

Letting

Equation (A.2)

with

Equation (A.3)

we have, with the resolvent identity (A−1B−1 = A−1(BA)B−1 for invertible A, B) that

for ${\mathbf{Q}}_{-i}\equiv {\left(\frac{1}{n}{\mathbf{\Sigma }}_{\mathbf{X}}^{\mathsf{\text{T}}}{\mathbf{\Sigma }}_{\mathbf{X}}-\frac{1}{n}{\mathbf{U}}_{i}{\mathbf{U}}_{i}+\lambda {\mathbf{I}}_{n}\right)}^{-1}$ that is independent of Ui (and thus wi ), where we applied the following Woodbury identity.

Lemma 2. (Woodbury). For $\mathbf{A},\mathbf{A}+\mathbf{U}{\mathbf{U}}^{\mathsf{\text{T}}}\in {\mathbb{R}}^{p\times p}$ both invertible and $\mathbf{U}\in {\mathbb{R}}^{p\times n}$, we have

so that in particular ${(\mathbf{A}+\mathbf{U}{\mathbf{U}}^{\mathsf{\text{T}}})}^{-1}\mathbf{U}={\mathbf{A}}^{-1}\mathbf{U}{({\mathbf{I}}_{n}+{\mathbf{U}}^{\mathsf{\text{T}}}{\mathbf{A}}^{-1}\mathbf{U})}^{-1}$.

Consider now the two-by-two matrix

which, according to the following lemma, is expected to be close to $\left[\begin{matrix}\hfill 1+{\alpha }_{\mathrm{cos}}\hfill & \hfill 0\hfill \\ \hfill 0\hfill & \hfill 1+{\alpha }_{\mathrm{sin}}\hfill \end{matrix}\right]$ as defined in (A.3).

Lemma 3. (Concentration of quadratic forms). Under assumption 1, for σ1(⋅), σ2(⋅) two real one-Lipschitz functions, $\mathbf{w}\sim \mathcal{N}(\mathbf{0},{\mathbf{I}}_{p})$ and $\mathbf{A}\in {\mathbb{R}}^{n\times n}$ independent of w with ||A|| ⩽ 1, then

for a, b ∈ {1, 2} and some universal constants C, c > 0.

Proof of Lemma  3 .Lemma 3 can be easily extended from lemma 1 in [15], where one observes the proof actually holds when different types of nonlinear Lipschitz functions σ1(⋅), σ2(⋅) (and in particular cos and sin) are considered. □

For ${\mathbf{W}}_{-i}\in {\mathbb{R}}^{(N-1)\times p}$ the random matrix $\mathbf{W}\in {\mathbb{R}}^{N\times p}$ with its ith row wi removed, lemma 3, together with the Lipschitz nature of the map ${\mathbf{W}}_{-i}{\mapsto}\frac{1}{n}{\sigma }_{a}({\mathbf{w}}_{i}^{\mathsf{\text{T}}}\mathbf{X}){\mathbf{Q}}_{-i}{\sigma }_{b}({\mathbf{X}}^{\mathsf{\text{T}}}{\mathbf{w}}_{i})$ for ${\mathbf{Q}}_{-i}={(\frac{1}{n}\enspace \mathrm{cos}{({\mathbf{W}}_{-i}\mathbf{X})}^{\mathsf{\text{T}}}\enspace \mathrm{cos}({\mathbf{W}}_{-i}\mathbf{X})+\frac{1}{n}\enspace \mathrm{sin}{({\mathbf{W}}_{-i}\mathbf{X})}^{\mathsf{\text{T}}}\enspace \mathrm{sin}({\mathbf{W}}_{-i}\mathbf{X})+\lambda {\mathbf{I}}_{n})}^{-1}$, leads to the following concentration result

Equation (A.4)

the proof of which follows the same line of argument of lemma 4 in [15] and is omitted here.

As a consequence, we continue to write, with again the resolvent identity, that

where we note from (A.4) (and ||Qi || ⩽ λ−1) that the matrix $\mathbb{E}[{\mathbf{D}}_{i}]={o}_{{\Vert}\cdot {\Vert}}(1)$ (in fact of spectral norm of order $O({n}^{-\frac{1}{2}})$). So that

where we used ${\mathbb{E}}_{{\mathbf{w}}_{i}}[{\mathbf{U}}_{i}{\mathbf{U}}_{i}^{\mathsf{\text{T}}}]={\mathbf{K}}_{\mathrm{cos}}+{\mathbf{K}}_{\mathrm{sin}}$ by lemma 1 and then lemma 2 in reverse for the last equality. Moreover, since

so that with the fact $\frac{1}{\sqrt{n}}{\Vert}\mathbf{Q}{\mathbf{\Sigma }}_{\mathbf{X}}^{\mathsf{\text{T}}}{\Vert}\leqslant {\Vert}\sqrt{\mathbf{Q}\frac{1}{n}{\mathbf{\Sigma }}_{\mathbf{X}}^{\mathsf{\text{T}}}{\mathbf{\Sigma }}_{\mathbf{X}}\mathbf{Q}}{\Vert}\leqslant {\lambda }^{-\frac{1}{2}}$ we have for the first term

It thus remains to treat the second term, which, with the relation AB T + BA T AA T + BB T (in the sense of symmetric matrices), and the same line of arguments as above, can be shown to have vanishing spectral norm (of order $O({n}^{-\frac{1}{2}})$) as n, p, N.

We thus have ${\Vert}\mathbb{E}[\mathbf{Q}]-\tilde{\mathbf{Q}}{\Vert}=O({n}^{-\frac{1}{2}})$, which concludes the first part of the proof of theorem 1.

We shall show next that ${\Vert}\tilde{\mathbf{Q}}-\bar{\mathbf{Q}}{\Vert}\to 0$ as n, p, N. First note from previous derivation that ${\alpha }_{\sigma }-\frac{1}{n}\enspace \mathrm{t}\mathrm{r}\enspace {\mathbf{K}}_{\sigma }\tilde{\mathbf{Q}}=O({n}^{-\frac{1}{2}})$ for σ = cos,  sin. To compare $\tilde{\mathbf{Q}}$ and $\bar{\mathbf{Q}}$, it follows again from the resolvent identity that

so that the control of ${\Vert}\tilde{\mathbf{Q}}-\bar{\mathbf{Q}}{\Vert}$ boils down to the control of max{|αcosδcos|, |αsinδsin|}. To this end, it suffices to write

where we used | tr(AB)| ⩽ ||A|| tr(B) for nonnegative definite B, together with the fact that $\frac{1}{n}\enspace \mathrm{t}\mathrm{r}\enspace {\mathbf{K}}_{\sigma }$ is (uniformly) bounded under assumption 1, for σ = cos,  sin.

As a consequence, we have

It thus remains to show

or alternatively, by the Cauchy–Schwarz inequality, to show

To treat the first right-hand side term (the second can be done similarly), it unfolds from | tr(AB)| ⩽ ||A|| ⋅  tr(B) for nonnegative definite B that

where we used the fact that $\frac{N}{n}\frac{{\mathbf{K}}_{\mathrm{cos}}\bar{\mathbf{Q}}}{1+{\delta }_{\mathrm{cos}}}={\mathbf{I}}_{n}-\frac{N}{n}\frac{{\mathbf{K}}_{\mathrm{sin}}\bar{\mathbf{Q}}}{1+{\delta }_{\mathrm{sin}}}-\lambda \bar{\mathbf{Q}}$. This concludes the proof of theorem 1.□

Appendix B.: Proof of theorem 2

To prove theorem 2, it indeed suffices to prove the following lemma.

Lemma 4. (Asymptotic behavior of $\mathbb{E}$[QAQ]). Under assumption 1, for Q defined in (7) and symmetric nonnegative definite $\mathbf{A}\in {\mathbb{R}}^{n\times n}$ of bounded spectral norm, we have

almost surely as n, with ${\mathbf{\Omega }}^{-1}\equiv {\mathbf{I}}_{2}-\frac{N}{n}\left[\begin{matrix}\hfill \frac{\frac{1}{n}\enspace \mathrm{t}\mathrm{r}(\bar{\mathbf{Q}}{\mathbf{K}}_{\mathrm{cos}}\bar{\mathbf{Q}}{\mathbf{K}}_{\mathrm{cos}})}{{(1+{\delta }_{\mathrm{cos}})}^{2}}\hfill & \hfill \frac{\frac{1}{n}\enspace \mathrm{t}\mathrm{r}(\bar{\mathbf{Q}}{\mathbf{K}}_{\mathrm{cos}}\bar{\mathbf{Q}}{\mathbf{K}}_{\mathrm{sin}})}{{(1+{\delta }_{\mathrm{sin}})}^{2}}\hfill \\ \hfill \frac{\frac{1}{n}\enspace \mathrm{t}\mathrm{r}(\bar{\mathbf{Q}}{\mathbf{K}}_{\mathrm{cos}}\bar{\mathbf{Q}}{\mathbf{K}}_{\mathrm{sin}})}{{(1+{\delta }_{\mathrm{cos}})}^{2}}\hfill & \hfill \frac{\frac{1}{n}\enspace \mathrm{t}\mathrm{r}(\bar{\mathbf{Q}}{\mathbf{K}}_{\mathrm{sin}}\bar{\mathbf{Q}}{\mathbf{K}}_{\mathrm{sin}})}{{(1+{\delta }_{\mathrm{sin}})}^{2}}\hfill \end{matrix}\right]$. In particular, we have

Proof of Lemma  4 .The proof of lemma 4 essentially follows the same line of arguments as that of theorem 1. Writing

where we note ≃ by ignoring matrices with vanishing spectral norm (i.e. o||⋅||(1)) in the n,, p, N limit and recall the shortcut $\mathbf{\Phi }\equiv \frac{{\mathbf{K}}_{\mathrm{cos}}}{1+{\delta }_{\mathrm{cos}}}+\frac{{\mathbf{K}}_{\mathrm{sin}}}{1+{\delta }_{\mathrm{sin}}}$. Developing rightmost term with lemma 2 as

so that

Equation (B.1)

by taking A = Kcos or Ksin, we result in

with $a=1-\frac{N}{n}\frac{\frac{1}{n}\enspace \mathrm{t}\mathrm{r}(\bar{\mathbf{Q}}{\mathbf{K}}_{\mathrm{cos}}\bar{\mathbf{Q}}{\mathbf{K}}_{\mathrm{cos}})}{{(1+{\delta }_{\mathrm{cos}})}^{2}}$, $b=\frac{N}{n}\frac{\frac{1}{n}\enspace \mathrm{t}\mathrm{r}(\bar{\mathbf{Q}}{\mathbf{K}}_{\mathrm{cos}}\bar{\mathbf{Q}}{\mathbf{K}}_{\mathrm{sin}})}{{(1+{\delta }_{\mathrm{sin}})}^{2}}$, $c=1-\frac{N}{n}\frac{\frac{1}{n}\enspace \mathrm{t}\mathrm{r}(\bar{\mathbf{Q}}{\mathbf{K}}_{\mathrm{sin}}\bar{\mathbf{Q}}{\mathbf{K}}_{\mathrm{sin}})}{{(1+{\delta }_{\mathrm{sin}})}^{2}}$ and $d=\frac{N}{n}\frac{\frac{1}{n}\enspace \mathrm{t}\mathrm{r}(\bar{\mathbf{Q}}{\mathbf{K}}_{\mathrm{sin}}\bar{\mathbf{Q}}{\mathbf{K}}_{\mathrm{cos}})}{{(1+{\delta }_{\mathrm{cos}})}^{2}}$ such that ${(1+{\delta }_{\mathrm{sin}})}^{2}b={(1+{\delta }_{\mathrm{cos}})}^{2}d$.

for $\mathbf{\Omega }\equiv {\left[\begin{matrix}\hfill a\hfill & \hfill -b\hfill \\ \hfill -d\hfill & \hfill c\hfill \end{matrix}\right]}^{-1}$. Plugging back into (B.1) we conclude the proof of lemma 4. □

Theorem 2 can be achieved by considering the concentration of (the bilinear form) $\frac{1}{n}{\mathbf{y}}^{\mathsf{\text{T}}}{\mathbf{Q}}^{2}\mathbf{y}$ around its expectation $\frac{1}{n}{\mathbf{y}}^{\mathsf{\text{T}}}\mathbb{E}[{\mathbf{Q}}^{2}]\mathbf{y}$ (with for instance lemma 3 in [15]), together with lemma 4. This concludes the proof of theorem 2.□

Appendix C.: Proof of theorem 3

Recall the definition of ${E}_{\mathrm{t}\mathrm{e}\mathrm{s}\mathrm{t}}=\frac{1}{\hat{n}}{\Vert}\hat{\mathbf{y}}-{\mathbf{\Sigma }}_{\hat{\mathbf{X}}}^{\mathsf{\text{T}}}\boldsymbol{\beta }{{\Vert}}^{2}$ from (5) with ${\mathbf{\Sigma }}_{\hat{\mathbf{X}}}=\left[\begin{matrix}\hfill \mathrm{cos}(\mathbf{W}\hat{\mathbf{X}})\hfill \\ \hfill \mathrm{sin}(\mathbf{W}\hat{\mathbf{X}})\hfill \end{matrix}\right]\in {\mathbb{R}}^{2N\times \hat{n}}$ on a test set $(\hat{\mathbf{X}},\hat{\mathbf{y}})$ of size $\hat{n}$, and first focus on the case 2N > n where $\boldsymbol{\beta }=\frac{1}{n}{\mathbf{\Sigma }}_{\mathbf{X}}\mathbf{Q}\mathbf{y}$ as per (4). By (A.1), we have

where, similar to the notation ${\mathbf{U}}_{i}=\left[\begin{matrix}\hfill \mathrm{cos}({\mathbf{X}}^{\mathsf{\text{T}}}{\mathbf{w}}_{i})\hfill & \hfill \mathrm{sin}({\mathbf{X}}^{\mathsf{\text{T}}}{\mathbf{w}}_{i})\hfill \end{matrix}\right]\in {\mathbb{R}}^{n\times 2}$ as in the proof of theorem 1, we denote

As a consequence, we further get

where we similarly denote

Note that, different from the proof of theorems 1 and 2 where we constantly use the fact that ||Q|| ⩽ λ−1 and

so that ${\Vert}\frac{1}{n}{\mathbf{\Sigma }}_{\mathbf{X}}^{\mathsf{\text{T}}}{\mathbf{\Sigma }}_{\mathbf{X}}\mathbf{Q}{\Vert}\leqslant 1$, we do not have in general a simple control for ${\Vert}\frac{1}{n}{\mathbf{\Sigma }}_{\hat{\mathbf{X}}}^{\mathsf{\text{T}}}{\mathbf{\Sigma }}_{\mathbf{X}}\mathbf{Q}{\Vert}$, when arbitrary $\hat{\mathbf{X}}$ is considered. Intuitively speaking, this is due to the loss-of-control for ${\Vert}\frac{1}{n}{({\mathbf{\Sigma }}_{\hat{\mathbf{X}}}-{\mathbf{\Sigma }}_{\mathbf{X}})}^{\mathsf{\text{T}}}{\mathbf{\Sigma }}_{\mathbf{X}}\mathbf{Q}{\Vert}$ when $\hat{\mathbf{X}}$ can be chosen arbitrarily with respect to X. It was remarked in [15] (remark 1) that in general only a $O(\sqrt{n})$ upper bound can be derived for ${\Vert}\frac{1}{\sqrt{n}}{\mathbf{\Sigma }}_{\mathbf{X}}{\Vert}$ or ${\Vert}\frac{1}{\sqrt{n}}{\mathbf{\Sigma }}_{\hat{\mathbf{X}}}{\Vert}$. Nonetheless, this problem can be resolved with the additional assumption 2.

More precisely, note that

Equation (C.1)

it remains to show that ${\Vert}{\mathbf{\Sigma }}_{\mathbf{X}}-{\mathbf{\Sigma }}_{\hat{\mathbf{X}}}{\Vert}=O(\sqrt{n})$ under assumption 2 to establish ${\Vert}\frac{1}{n}{\mathbf{\Sigma }}_{\hat{\mathbf{X}}}^{\mathsf{\text{T}}}{\mathbf{\Sigma }}_{\mathbf{X}}\mathbf{Q}{\Vert}=O(1)$, that is, to show that

Equation (C.2)

for σ ∈ { cos,  sin }. Note this cannot be achieved using only the Lipschitz nature of σ(⋅) and the fact that ${\Vert}\mathbf{X}-\hat{\mathbf{X}}{\Vert}\leqslant {\Vert}\mathbf{X}{\Vert}+{\Vert}\hat{\mathbf{X}}{\Vert}=O(1)$ under assumption 1 by writing

Equation (C.3)

where we recall that ${\Vert}\mathbf{W}{\Vert}=O(\sqrt{n})$ and ||W||F = O(n). Nonetheless, from proposition B.1 in [43] we have that the product WX, and thus σ(WX), strongly concentrates around its expectation in the sense of (13), so that

under assumption 2. As a results, we are allowed to control $\frac{1}{n}{\mathbf{\Sigma }}_{\hat{\mathbf{X}}}^{\mathsf{\text{T}}}{\mathbf{\Sigma }}_{\mathbf{X}}\mathbf{Q}$ and similarly $\frac{1}{n}{\mathbf{\Sigma }}_{\hat{\mathbf{X}}}^{\mathsf{\text{T}}}{\mathbf{\Sigma }}_{\hat{\mathbf{X}}}\mathbf{Q}$ in the same vein as $\frac{1}{n}{\mathbf{\Sigma }}_{\mathbf{X}}^{\mathsf{\text{T}}}{\mathbf{\Sigma }}_{\mathbf{X}}\mathbf{Q}$ in the proof of theorems 1 and 2 in appendices A and B, respectively.

It thus remains to handle the last term (noted Z) as follows

where Z1 term can be treated as

where we apply lemma 4 and recall

Moving on to Z2 and we write

For the term Z21, note that Qj Q and depends on Ui (and ${\hat{\mathbf{U}}}_{i}$), such that

where we recall the shortcut $\mathbf{\Phi }\equiv \frac{{\mathbf{K}}_{\mathrm{cos}}}{1+{\delta }_{\mathrm{cos}}}+\frac{{\mathbf{K}}_{\mathrm{sin}}}{1+{\delta }_{\mathrm{sin}}}$ and similarly $\hat{\mathbf{\Phi }}\equiv \frac{{\mathbf{K}}_{\mathrm{cos}}(\hat{\mathbf{X}},\mathbf{X})}{1+{\delta }_{\mathrm{cos}}}+\frac{{\mathbf{K}}_{\mathrm{sin}}(\hat{\mathbf{X}},\mathbf{X})}{1+{\delta }_{\mathrm{sin}}}\in {\mathbb{R}}^{\hat{n}\times n}$. As a consequence, we further have, with lemma 4 that

The last term Z22 can be similarly treated as

where by lemma 2 we deduce

so that by again lemma 4

Assembling the estimates for Z1, Z21 and Z22, we get

which, up to further simplifications, concludes the proof of theorem 3.

Appendix D.: Several useful lemmas

Lemma 5. (Some useful properties of Ω). For any λ > 0 and Ω defined in (12), we have

  • (a)  
    all entries of Ω are positive;
  • (b)  
    for 2N = n, det(Ω−1), as well as the entries of Ω, scales like λ as λ → 0;

Proof. Developing the inverse we obtain

we have ${[{\mathbf{\Omega }}^{-1}]}_{11}=\frac{1}{1+{\delta }_{\mathrm{cos}}}+\frac{\lambda }{n}\enspace \mathrm{t}\mathrm{r}\bar{\mathbf{Q}}\frac{{\mathbf{K}}_{\mathrm{cos}}}{1+{\delta }_{\mathrm{cos}}}\bar{\mathbf{Q}}+\frac{N}{n}\frac{1}{n}\enspace \mathrm{t}\mathrm{r}\bar{\mathbf{Q}}\frac{{\mathbf{K}}_{\mathrm{cos}}}{1+{\delta }_{\mathrm{cos}}}\bar{\mathbf{Q}}\frac{{\mathbf{K}}_{\mathrm{sin}}}{1+{\delta }_{\mathrm{sin}}} > 0$, ${[{\mathbf{\Omega }}^{-1}]}_{12}< 0$, and similarly ${[{\mathbf{\Omega }}^{-1}]}_{21}< 0$, ${[{\mathbf{\Omega }}^{-1}]}_{22} > 0$. Furthermore, the determinant writes

where we constantly use the fact that $\bar{\mathbf{Q}}\frac{N}{n}\left(\frac{{\mathbf{K}}_{\mathrm{cos}}}{1+{\delta }_{\mathrm{cos}}}+\frac{{\mathbf{K}}_{\mathrm{sin}}}{1+{\delta }_{\mathrm{sin}}}\right)={\mathbf{I}}_{n}-\lambda \bar{\mathbf{Q}}$. Note that

so that (a) det(Ω−1) > 0 and (b) for 2N = n, det(Ω−1) scales like λ as λ → 0. □

Lemma 6. (Derivatives with respect to N). Let assumption 1 holds, for any λ > 0 and

defined in theorem 1, we have that (δcos, δsin) and ${\Vert}\bar{\mathbf{Q}}{\Vert}$ are all decreasing functions of N. Note in particular that the same conclusion holds for 2N > n as λ → 0.

Proof. We write

Equation (D.1)

for Ω defined in (12) and $\mathbf{\Phi }=\frac{{\mathbf{K}}_{\mathrm{cos}}}{1+{\delta }_{\mathrm{cos}}}+\frac{{\mathbf{K}}_{\mathrm{sin}}}{1+{\delta }_{\mathrm{sin}}}$, which, together with lemma 5, allows us to conclude that $\frac{\partial {\delta }_{\mathrm{cos}}}{\partial N},\frac{\partial {\delta }_{\mathrm{sin}}}{\partial N}< 0$. Further note that

which concludes the proof. □

Lemma 7. (Derivative with respect to λ). For any λ > 0, (δcos, δsin) and ${\Vert}\bar{\mathbf{Q}}{\Vert}$ defined in theorem 1 decrease as λ grows large.

Proof. Taking the derivative of (δcos, δsin) with respect to λ > 0, we have explicitly

Equation (D.2)

which, together with the fact that all entries of Ω are positive (lemma 5), allows us to conclude that $\frac{\partial {\delta }_{\mathrm{cos}}}{\partial \lambda },\frac{\partial {\delta }_{\mathrm{sin}}}{\partial \lambda }< 0$. Further considering

and thus the conclusion for $\bar{\mathbf{Q}}$. □

Footnotes

  • This article is an updated version of: Liao Z, Couillet R and Mahoney M W 2020 A random matrix analysis of random Fourier features: beyond the Gaussian kernel, a precise phase transition, and the corresponding double descent Advances in Neural Information Processing Systems vol 33, ed H Larochelle, M Ranzato, R Hadsell, M F Balcan and H Lin (New York: Curran Associates), pp 13939–50.

  • K ⩾ 2 is included to cover multi-class classification problems; and K should remain fixed as n, p.

Please wait… references are loading.
10.1088/1742-5468/ac3a77