A random matrix analysis of random Fourier features: beyond the Gaussian kernel, a precise phase transition, and the corresponding double descent*

Zhenyu Liao; Romain Couillet; Michael W Mahoney

doi:10.1088/1742-5468/ac3a77

1. Introduction

For a machine learning system having N parameters, trained on a data set of size n, asymptotic analysis as used in classical statistical learning theory typically either focuses on the (statistical) population n → ∞ limit, for N fixed, or the over-parameterized N → ∞ limit, for a given n, as in the popular neural tangent kernel (NTK) regime [1]. These two settings are technically more convenient to work with, yet less practical, as they essentially assume that one of the two dimensions is negligibly small compared to the other, and this is rarely the case in practice. Indeed, with a factor of 2 or 10 more data, one typically works with a more complex model. This has been highlighted perhaps most prominently in recent work on neural network models, in which the model complexity and data size increase together. For this reason, the double asymptotic regime where n, N → ∞, with N/n → c, a constant, is a particularly interesting (and likely more realistic) limit, despite being technically more challenging [2–8]. In particular, working in this regime allows for a finer quantitative assessment of machine learning systems, as a function of their relative complexity N/n, as well as for a precise description of the under-to over-parameterized 'phase transition' (that does not appear, e.g. in the N → ∞ alone analysis). This transition is largely hidden in the usual style of statistical learning theory [9], but it is well-known in the statistical mechanics approach to learning theory [2–5], and empirical signatures of it have received attention recently under the name 'double descent' phenomena [10–12].

This article considers the asymptotics of random Fourier features (RFFs) [13], and more generally random feature maps, which may be viewed also as a single-hidden-layer neural network model, in this limit. More precisely, let $\mathbf{X}=[{\mathbf{x}}_{1},\dots ,{\mathbf{x}}_{n}]\in {\mathbb{R}}^{p\times n}$ denote the data matrix of size n with data vectors ${\mathbf{x}}_{i}\in {\mathbb{R}}^{p}$ as column vectors. The random feature matrix Σ_X of X is generated by pre-multiplying some random matrix $\mathbf{W}\in {\mathbb{R}}^{N\times p}$ having i.i.d. entries and then passing through some entry-wise nonlinear function σ(⋅), i.e. ${\mathbf{\Sigma }}_{\mathbf{X}}\equiv \sigma (\mathbf{W}\mathbf{X})\in {\mathbb{R}}^{N\times n}$ . Commonly used random feature techniques such as RFFs [13] and homogeneous kernel maps [14], however, rarely involve a single non-linearity. The popular RFF maps are built with cosine and sine non-linearities, so that ${\mathbf{\Sigma }}_{\mathbf{X}}\in {\mathbb{R}}^{2N\times n}$ is obtained by cascading the random features of both, i.e. ${\mathbf{\Sigma }}_{\mathbf{X}}^{\mathsf{\text{T}}}\equiv [\mathrm{cos}{(\mathbf{W}\mathbf{X})}^{\mathsf{\text{T}}},\enspace \mathrm{sin}{(\mathbf{W}\mathbf{X})}^{\mathsf{\text{T}}}]$ . Note that, by combining both non-linearities, RFFs generated from $\mathbf{W}\in {\mathbb{R}}^{N\times p}$ are of dimension 2N.

The large N asymptotics of random feature maps is closely related to their limiting kernel matrices K_X. In the case of RFF, it was shown in [13] that entry-wise the Gram matrix ${\mathbf{\Sigma }}_{\mathbf{X}}^{\mathsf{\text{T}}}{\mathbf{\Sigma }}_{\mathbf{X}}/N$ converges to the Gaussian kernel matrix ${\mathbf{K}}_{\mathbf{X}}\equiv {\left\{\mathrm{exp}(-{\Vert}{\mathbf{x}}_{i}-{\mathbf{x}}_{j}{{\Vert}}^{2}/2)\right\}}_{i,j=1}^{n}$ , as N → ∞. This follows from $\frac{1}{N}{[{\mathbf{\Sigma }}_{\mathbf{X}}^{\mathsf{\text{T}}}{\mathbf{\Sigma }}_{\mathbf{X}}]}_{ij}=\frac{1}{N}{\sum }_{t=1}^{N}\mathrm{cos}({\mathbf{x}}_{i}^{\mathsf{\text{T}}}{\mathbf{w}}_{t})\mathrm{cos}({\mathbf{w}}_{t}^{\mathsf{\text{T}}}{\mathbf{x}}_{j})+\mathrm{sin}({\mathbf{x}}_{i}^{\mathsf{\text{T}}}{\mathbf{w}}_{t})\mathrm{sin}({\mathbf{w}}_{t}^{\mathsf{\text{T}}}{\mathbf{x}}_{j})$ , for w_t independent Gaussian random vectors, so that by the strong law of large numbers, for fixed n, p, ${[{\mathbf{\Sigma }}_{\mathbf{X}}^{\mathsf{\text{T}}}{\mathbf{\Sigma }}_{\mathbf{X}}/N]}_{ij}$ goes to its expectation (with respect to $\mathbf{w}\sim \mathcal{N}(\mathbf{0},{\mathbf{I}}_{p})$ ) almost surely as N → ∞, i.e.

$\begin{align}\hfill {[{\mathbf{\Sigma }}_{\mathbf{X}}^{\mathsf{\text{T}}}{\mathbf{\Sigma }}_{\mathbf{X}}/N]}_{ij}& \;\stackrel{a.s.}{\to }\;{\mathbb{E}}_{\mathbf{w}}\left[\mathrm{cos}({\mathbf{x}}_{i}^{\mathsf{\text{T}}}\mathbf{w})\mathrm{cos}({\mathbf{w}}^{\mathsf{\text{T}}}{\mathbf{x}}_{j})+\mathrm{sin}({\mathbf{x}}_{i}^{\mathsf{\text{T}}}\mathbf{w})\mathrm{sin}({\mathbf{w}}^{\mathsf{\text{T}}}{\mathbf{x}}_{j})\right]\equiv {\mathbf{K}}_{\mathrm{cos}}+{\mathbf{K}}_{\mathrm{sin}},\hfill \end{align} \tag{ 1 }$

with

$\begin{align}\hfill {\mathbf{K}}_{\mathrm{cos}}+{\mathbf{K}}_{\mathrm{sin}}& \equiv {\text{e}}^{-\frac{1}{2}({\Vert}{\mathbf{x}}_{i}{{\Vert}}^{2}+{\Vert}{\mathbf{x}}_{j}{{\Vert}}^{2})}\left(\mathrm{cosh}({\mathbf{x}}_{i}^{\mathsf{\text{T}}}{\mathbf{x}}_{j})+\mathrm{sinh}({\mathbf{x}}_{i}^{\mathsf{\text{T}}}{\mathbf{x}}_{j})\right)={\text{e}}^{-\frac{1}{2}({\Vert}{\mathbf{x}}_{i}-{\mathbf{x}}_{j}{{\Vert}}^{2})}\equiv {[{\mathbf{K}}_{\mathbf{X}}]}_{ij}.\hfill \end{align} \tag{ 2 }$

(The identification with ${[{\mathbf{K}}_{\mathbf{X}}]}_{ij}$ is easily shown in lemma 1 of appendix A.)

While this result holds in the N → ∞ limit, recent advances in random matrix theory [15, 16] suggest that, in the more practical setting where N is not much larger than n, p and n, p, N → ∞ at the same pace, the situation is more subtle. In particular, the above entry-wise convergence remains valid, but the convergence ${\Vert}{\mathbf{\Sigma }}_{\mathbf{X}}^{\mathsf{\text{T}}}{\mathbf{\Sigma }}_{\mathbf{X}}/N-{\mathbf{K}}_{\mathbf{X}}{\Vert}\to 0$ no longer holds in spectral norm, due to the factor n, now large, in the norm inequality ||A||_∞ ⩽ ||A|| ⩽ n||A||_∞ for $\mathbf{A}\in {\mathbb{R}}^{n\times n}$ and ||A||_∞ ≡ max_ij|A_ij|. This implies that, in the large n, p, N regime, the assessment of the behavior of ${\mathbf{\Sigma }}_{\mathbf{X}}^{\mathsf{\text{T}}}{\mathbf{\Sigma }}_{\mathbf{X}}/N$ via the limiting kernel K_X may result in a spectral norm error that blows up with n. As a consequence, for various machine learning algorithms, the performance guarantee offered by the limiting Gaussian kernel is less likely to agree with empirical observations in real-world large-scale problems, when n, p are large [17].

1.1. Our main contributions

We consider the RFF model in the more realistic large n, p, N limit. While, in this setting, the RFF empirical Gram matrix does not converge to the Gaussian kernel matrix, we can characterize its behavior as n, p, N → ∞ and provide asymptotic performance guarantees for RFF on large-scale problems. We also identify a phase transition as a function of the ratio N/n, including the corresponding double descent phenomenon. In more detail, our contributions are the following.

(a)
We provide a precise characterization of the asymptotics of the RFF empirical Gram matrix, in the large n, p, N limit (theorem 1). This is accomplished by constructing a deterministic equivalent for the resolvent of the RFF Gram matrix. Based on this, the behavior of the RFF model is (asymptotically) accessible through a fixed-point equation, that can be interpreted in terms of an angle-like correction induced by the non-trivial large n, p, N limit (relative to the N → ∞ alone limit).
(b)
We derive the asymptotic training and test mean squared errors (MSEs) of RFF ridge regression, as a function of the ratio N/n, regularization penalty λ, training as well as test sets (theorems 2 and 3, respectively). We identify precisely the under-to over-parameterization phase transition, as a function of the relative model complexity N/n; we prove the existence of a 'singular' peak of test error at the N/n = 1/2 boundary; and we characterize the corresponding double descent behavior. Importantly, our results are valid with almost no specific assumption on the data distribution. This is a significant improvement over existing double descent analyses, which fundamentally rely on the knowledge of the data distribution (often assumed to be multivariate Gaussian for simplicity) [12, 18].
(c)
We provide a detailed empirical evaluation of our theoretical results, demonstrating that the theory closely matches empirical results on a range of real-world data sets (sections 3 and 4). This includes the correction due to the large n, p, N setting, sharp transitions (as a function of N/n) in the aforementioned angle-like quantities, and the corresponding double descent test curves. This also includes an evaluation of the impact of training-test similarity and the effect of different data sets, thus confirming, as stated in (ii), that (unlike in prior work) the phase transition and double descent curve hold much more generally with respect to the data distribution.

1.2. Related work

Here, we provide a brief review of related previous efforts.

Random features and limiting kernels. In most RFF work [19–22], non-asymptotic bounds are given, on the number of random features N needed for a predefined approximation error, for a given kernel matrix with fixed n, p. A more recent line of work [1, 23–25] has focused on the over-parameterized N → ∞ limit of large neural networks by studying the corresponding NTKs. Here, we position ourselves in the more practical regime where n, p, N are all large and comparable, and we provide asymptotic performance guarantees that better fit large-scale problems compared to the large-N-alone analysis.

Random matrix theory. From a random matrix theory perspective, nonlinear Gram matrices of the type ${\mathbf{\Sigma }}_{\mathbf{X}}^{\mathsf{\text{T}}}{\mathbf{\Sigma }}_{\mathbf{X}}$ have recently received an unprecedented research interests, due to their close connection to neural networks [26–29], with a particular focus on the associated eigenvalue distribution. Here we propose a deterministic equivalent [30, 31] analysis for the resolvent matrix that provides access, not only to the eigenvalue distribution, but also to the regression error of central interest in this article. While most existing deterministic equivalent analyses are performed on linear models, here we focus on the nonlinear RFF model. From a technical perspective, the most relevant work is [12, 15]. We improve their results by considering generic data model on the popular RFF model.

Statistical mechanics of learning. A long history of connections between statistical mechanics and machine learning models (such as neural networks) exists, including a range of techniques to establish generalization bounds [2–5], and recently there has been renewed interest [7, 8, 32–34]. Their relevance to our results lies in the use of the so-called thermodynamic limit (akin to the large n, p, N limit), rather than the classical limits more commonly used in statistical learning theory, in which case uniform convergence bounds and related techniques can be applied.

Double descent in large-scale learning systems. The large n, N asymptotics of statistical models has received considerable research interests in the machine learning community [18, 35], resulting in a (somehow) counterintuitive phenomenon referred to as the 'double descent'. Instead of focusing on different 'phases of learning' [2–5, 7], the 'double descent' phenomenon focuses on an empirical manifestation of the phase boundary and refers to the empirical observations of the test error curve as a function of the model complexity, which differs from the usual textbook description of the bias-variance tradeoff [10, 11, 36, 37]. Theoretical investigation into this phenomenon mainly focuses on various regression models [12, 18, 38–41]. In most cases, quite specific (and rather strong) assumptions are imposed on the input data distribution. In this respect, our work extends the analysis in [12] to handle the RFF model and its phase structure on real-world data sets.

1.3. Notations and organization of the paper

Throughout this article, we follow the convention of denoting scalars by lowercase, vectors by lowercase boldface, and matrices by uppercase boldface letters. In addition, the notation (⋅)^T denotes the transpose operator; the norm ||⋅|| is the Euclidean norm for vectors and the spectral or operator norm for matrices; and $\;\stackrel{a.s.}{\to }\;$ stands for almost sure convergence of random variables.

Our main results on the asymptotic behavior of the RFF resolvent matrix, as well as of the training MSE and testing MSE of RFF ridge regression are presented in section 2, with detailed proofs deferred to the appendix. In section 3, we provide a detailed empirical evaluation of our main results; and in section 4, we provide additional empirical evaluation on real-world data, illustrating the practical effectiveness of the proposed analysis. Concluding remarks are placed in section 5.

2. Main technical results

In this section, we present our main theoretical results. To investigate the large n, p, N asymptotics of the RFF model, we position ourselves under the following assumption.

Assumption 1. As n → ∞, we have

(a)
0 < lim inf _n min{p/n, N/n} ⩽ lim sup_n max{p/n, N/n} < ∞; or, practically speaking, the ratios p/n and N/n are only moderately large or moderately small.
(b)
lim sup_n||X|| < ∞ and lim sup_n||y||_∞ < ∞, i.e. the data and targets are both normalized with respect to n.

Under assumption 1, we consider the RFF regression model as in figure 1.

For training data $\mathbf{X}\in {\mathbb{R}}^{p\times n}$ of size n, the associated RFFs, ${\mathbf{\Sigma }}_{\mathbf{X}}\in {\mathbb{R}}^{2N\times n}$ , are obtained by computing $\mathbf{W}\mathbf{X}\in {\mathbb{R}}^{N\times n}$ , for standard Gaussian random matrix $\mathbf{W}\in {\mathbb{R}}^{N\times p}$ , and then applying entry-wise cosine and sine non-linearities on WX, that is

$\begin{equation}{\mathbf{\Sigma }}_{\mathbf{X}}^{\mathsf{\text{T}}}=[\mathrm{cos}{(\mathbf{W}\mathbf{X})}^{\mathsf{\text{T}}},\enspace \mathrm{sin}{(\mathbf{W}\mathbf{X})}^{\mathsf{\text{T}}}]\quad \text{with}\quad {\mathbf{W}}_{ij}\sim \mathcal{N}(0,1).\end{equation} \tag{ 3 }$

Given this setup, the RFF ridge regressor $\boldsymbol{\beta }\in {\mathbb{R}}^{2N}$ is given by, for λ ⩾ 0,

$\begin{equation}\boldsymbol{\beta }\equiv \frac{1}{n}{\mathbf{\Sigma }}_{\mathbf{X}}{\left(\frac{1}{n}{\mathbf{\Sigma }}_{\mathbf{X}}^{\mathsf{\text{T}}}{\mathbf{\Sigma }}_{\mathbf{X}}+\lambda {\mathbf{I}}_{n}\right)}^{-1}\mathbf{y}\cdot {1}_{2N > n}+{\left(\frac{1}{n}{\mathbf{\Sigma }}_{\mathbf{X}}{\mathbf{\Sigma }}_{\mathbf{X}}^{\mathsf{\text{T}}}+\lambda {\mathbf{I}}_{2N}\right)}^{-1}\frac{1}{n}{\mathbf{\Sigma }}_{\mathbf{X}}\enspace \mathbf{y}\cdot {1}_{2N< n}.\end{equation} \tag{ 4 }$

The two forms of β in (4) are equivalent for any λ > 0 and minimize the (ridge-regularized) squared loss $\frac{1}{n}{\Vert}\mathbf{y}-{\mathbf{\Sigma }}_{\mathbf{X}}^{\mathsf{\text{T}}}\boldsymbol{\beta }{{\Vert}}^{2}+\lambda {\Vert}\boldsymbol{\beta }{{\Vert}}^{2}$ on the training set (X, y). Our objective is to characterize the large n, p, N asymptotics of both the training MSE, E_train, and the test MSE, E_test, defined respectively as

$\begin{equation}{E}_{\mathrm{t}\mathrm{r}\mathrm{a}\mathrm{i}\mathrm{n}}=\frac{1}{n}{\Vert}\mathbf{y}-{\mathbf{\Sigma }}_{\mathbf{X}}^{\mathsf{\text{T}}}\boldsymbol{\beta }{{\Vert}}^{2},\quad {E}_{\mathrm{t}\mathrm{e}\mathrm{s}\mathrm{t}}=\frac{1}{\hat{n}}{\Vert}\hat{\mathbf{y}}-{\mathbf{\Sigma }}_{\hat{\mathbf{X}}}^{\mathsf{\text{T}}}\boldsymbol{\beta }{{\Vert}}^{2},\end{equation} \tag{ 5 }$

with ${\mathbf{\Sigma }}_{\hat{\mathbf{X}}}^{\mathsf{\text{T}}}\equiv [\mathrm{cos}{(\mathbf{W}\hat{\mathbf{X}})}^{\mathsf{\text{T}}},\enspace \mathrm{sin}{(\mathbf{W}\hat{\mathbf{X}})}^{\mathsf{\text{T}}}]\in {\mathbb{R}}^{\hat{n}\times 2N}$ on a test set $(\hat{\mathbf{X}},\hat{\mathbf{y}})$ of size $\hat{n}$ , and from this to characterize the phase transition behavior (as a function of the model complexity N/n) as mentioned in section 1. Precisely, in the training phase, the random weight matrix W is drawn once and kept fixed; and the RFF ridge regressor β is given explicitly as a function of W and the training set (X, y), as per (4). In the test phase, for β now fixed, the model takes the test data $\hat{\mathbf{X}}$ as input, and it outputs ${\mathbf{\Sigma }}_{\hat{\mathbf{X}}}^{\mathsf{\text{T}}}\boldsymbol{\beta }$ that should be compared to the corresponding target $\hat{\mathbf{y}}$ to measure the model test performance, E_test.

2.1. Asymptotic deterministic equivalent

To start, we observe that the training MSE, E_train, in (5), can be written as

$\begin{equation}{E}_{\mathrm{t}\mathrm{r}\mathrm{a}\mathrm{i}\mathrm{n}}=\frac{{\lambda }^{2}}{n}{\Vert}\mathbf{Q}(\lambda )\mathbf{y}{{\Vert}}^{2}=-\frac{{\lambda }^{2}}{n}{\mathbf{y}}^{\mathsf{\text{T}}}\partial \mathbf{Q}(\lambda )\mathbf{y}/\partial \lambda ,\end{equation} \tag{ 6 }$

which depends on the quadratic form y^T Q(λ)y of

$\begin{equation}\mathbf{Q}(\lambda )\equiv {\left(\frac{1}{n}{\mathbf{\Sigma }}_{\mathbf{X}}^{\mathsf{\text{T}}}{\mathbf{\Sigma }}_{\mathbf{X}}+\lambda {\mathbf{I}}_{n}\right)}^{-1}\in {\mathbb{R}}^{n\times n},\end{equation} \tag{ 7 }$

the so-called resolvent of $\frac{1}{n}{\mathbf{\Sigma }}_{\mathbf{X}}^{\mathsf{\text{T}}}{\mathbf{\Sigma }}_{\mathbf{X}}$ (also denoted Q when there is no ambiguity) with λ > 0. To see this, from (5) we have ${E}_{\mathrm{t}\mathrm{r}\mathrm{a}\mathrm{i}\mathrm{n}}=\frac{1}{n}{\Vert}\mathbf{y}-\frac{1}{n}{\mathbf{\Sigma }}_{\mathbf{X}}^{\mathsf{\text{T}}}{\mathbf{\Sigma }}_{\mathbf{X}}{(\frac{1}{n}{\mathbf{\Sigma }}_{\mathbf{X}}^{\mathsf{\text{T}}}{\mathbf{\Sigma }}_{\mathbf{X}}+\lambda {\mathbf{I}}_{n})}^{-1}\mathbf{y}{{\Vert}}^{2}=\frac{{\lambda }^{2}}{n}{\Vert}\mathbf{Q}(\lambda )\mathbf{y}{{\Vert}}^{2}=-\frac{{\lambda }^{2}}{n}{\mathbf{y}}^{\mathsf{\text{T}}}\frac{\partial \mathbf{Q}(\lambda )}{\partial \lambda }\mathbf{y}$ , with $\frac{\partial \mathbf{Q}(\lambda )}{\partial \lambda }=-{\mathbf{Q}}^{2}(\lambda )$ .

In order to assess the asymptotic training MSE, it thus suffices to find a deterministic equivalent for Q(λ), that is, a deterministic matrix that captures the asymptotic behavior of the latter. One possibility is the expectation ${\mathbb{E}}_{\mathbf{W}}[\mathbf{Q}(\lambda )]$ . Informally, if the training MSE E_train (that is random due to random W for given X, y) is 'close to' some deterministic quantity ${\bar{E}}_{\mathrm{t}\mathrm{r}\mathrm{a}\mathrm{i}\mathrm{n}}$ , in the large n, p, N limit, then ${\bar{E}}_{\mathrm{t}\mathrm{r}\mathrm{a}\mathrm{i}\mathrm{n}}$ must have the same limit as ${\mathbb{E}}_{\mathbf{W}}[{E}_{\mathrm{t}\mathrm{r}\mathrm{a}\mathrm{i}\mathrm{n}}]=-\frac{{\lambda }^{2}}{n}\partial {\mathbf{y}}^{\mathsf{\text{T}}}{\mathbb{E}}_{\mathbf{W}}[\mathbf{Q}(\lambda )]\mathbf{y}/\partial \lambda$ for n, p, N → ∞. However, ${\mathbb{E}}_{\mathbf{W}}[\mathbf{Q}]$ involves integration (with no closed-form due to the matrix inverse), and it is not a convenient quantity with which to work. Our objective is to find an asymptotic 'alternative' for ${\mathbb{E}}_{\mathbf{W}}[\mathbf{Q}]$ that is (i) close to ${\mathbb{E}}_{\mathbf{W}}[\mathbf{Q}]$ in the large n, p, N → ∞ limit and (ii) numerically more accessible.

In the following theorem, we introduce an asymptotic equivalent for ${\mathbb{E}}_{\mathbf{W}}[\mathbf{Q}]$ . Instead of being directly related to the Gaussian kernel K_X = K_cos + K_sin as suggested by (2) in the large-N-only limit, it depends on the two components K_cos, K_sin in a more involved manner. Importantly, the proposed equivalent $\bar{\mathbf{Q}}$ can be numerically evaluated by running simple fixed-point iterations involving K_cos and K_sin.

Theorem 1. (Asymptotic equivalent for ${\mathbb{E}}_{\mathbf{W}}$ [Q]). Under assumption 1, for Q defined in (7) and λ > 0, we have, as n → ∞

$\begin{equation*}{\Vert}{\mathbb{E}}_{\mathbf{W}}[\mathbf{Q}]-\bar{\mathbf{Q}}{\Vert}\to 0\end{equation*}$

for $\bar{\mathbf{Q}}\equiv {\left(\frac{N}{n}(\frac{{\mathbf{K}}_{\mathrm{cos}}}{1+{\delta }_{\mathrm{cos}}}+\frac{{\mathbf{K}}_{\mathrm{sin}}}{1+{\delta }_{\mathrm{sin}}})+\lambda {\mathbf{I}}_{n}\right)}^{-1}$ , ${\mathbf{K}}_{\mathrm{cos}}\equiv {\mathbf{K}}_{\mathrm{cos}}(\mathbf{X},\mathbf{X}),{\mathbf{K}}_{\mathrm{sin}}\equiv {\mathbf{K}}_{\mathrm{sin}}(\mathbf{X},\mathbf{X})\in {\mathbb{R}}^{n\times n}$ and

$\begin{align}\hfill \begin{aligned}\hfill {\mathbf{K}}_{\mathrm{cos}}{(\mathbf{X},{\mathbf{X}}^{\prime })}_{ij}& ={\text{e}}^{-\frac{{\Vert}{\mathbf{x}}_{i}{{\Vert}}^{2}+{\Vert}{\mathbf{x}}_{j}^{\prime }{{\Vert}}^{2}}{2}}\enspace \mathrm{cosh}({\mathbf{x}}_{i}^{\mathsf{\text{T}}}{\mathbf{x}}_{j}^{\prime }),\hfill \\ \hfill {\mathbf{K}}_{\mathrm{sin}}{(\mathbf{X},{\mathbf{X}}^{\prime })}_{ij}& ={\text{e}}^{-\frac{{\Vert}{\mathbf{x}}_{i}{{\Vert}}^{2}+{\Vert}{\mathbf{x}}_{j}^{\prime }{{\Vert}}^{2}}{2}}\enspace \mathrm{sinh}({\mathbf{x}}_{i}^{\mathsf{\text{T}}}{\mathbf{x}}_{j}^{\prime }),\hfill \end{aligned}\end{align} \tag{ 8 }$

where (δ_cos, δ_sin) is the unique positive solution to

$\begin{equation}{\delta }_{\mathrm{cos}}=\frac{1}{n}\enspace \mathrm{t}\mathrm{r}({\mathbf{K}}_{\mathrm{cos}}\bar{\mathbf{Q}}),\quad {\delta }_{\mathrm{sin}}=\frac{1}{n}\enspace \mathrm{t}\mathrm{r}({\mathbf{K}}_{\mathrm{sin}}\bar{\mathbf{Q}}).\end{equation} \tag{ 9 }$

Proof. See appendix A. □

Remark 1. (Lower and upper bounds). Since

$\begin{equation}\frac{{\mathbf{K}}_{\mathbf{X}}}{1+\mathrm{max}({\delta }_{\mathrm{cos}},{\delta }_{\mathrm{sin}})}{\preceq}\frac{{\mathbf{K}}_{\mathrm{cos}}}{1+{\delta }_{\mathrm{cos}}}+\frac{{\mathbf{K}}_{\mathrm{sin}}}{1+{\delta }_{\mathrm{sin}}}{\preceq}\frac{{\mathbf{K}}_{\mathbf{X}}}{1+\mathrm{min}({\delta }_{\mathrm{cos}},{\delta }_{\mathrm{sin}})}\end{equation} \tag{ 10 }$

in the positive definite order, for K_X ≡ K_cos + K_sin the Gaussian kernel, $\frac{{\mathbf{K}}_{\mathrm{cos}}}{1+{\delta }_{\mathrm{cos}}}+\frac{{\mathbf{K}}_{\mathrm{sin}}}{1+{\delta }_{\mathrm{sin}}}$ is therefore positive definite, if x₁ ..., x_n are all distinct; see theorem 2.18 in [42].

Remark 2. (Correction to large-N behavior). Taking N/n → ∞, one has δ_cos → 0, δ_sin → 0 so that

$\begin{equation}\frac{{\mathbf{K}}_{\mathrm{cos}}}{1+{\delta }_{\mathrm{cos}}}+\frac{{\mathbf{K}}_{\mathrm{sin}}}{1+{\delta }_{\mathrm{sin}}}\to {\mathbf{K}}_{\mathrm{cos}}+{\mathbf{K}}_{\mathrm{sin}}={\mathbf{K}}_{\mathbf{X}}\quad \text{and}\quad \bar{\mathbf{Q}}\to {\left(\frac{N}{n}{\mathbf{K}}_{\mathbf{X}}+\lambda {\mathbf{I}}_{n}\right)}^{-1}\sim \frac{n}{N}{\mathbf{K}}_{\mathbf{X}}^{-1},\end{equation} \tag{ 11 }$

for λ > 0 independent of N, n, in accordance with the classical large-N-only prediction. In this sense, the pair (δ_cos, δ_sin) introduced in theorem 1 accounts for the 'correction' due to the non-trivial n/N, as opposed to the N → ∞ alone analysis. Also, when the number of features N is large (i.e. as N/n → ∞), the regularization effect of λ flattens out and $\bar{\mathbf{Q}}$ behaves like (a scaled version of) the inverse Gaussian kernel matrix ${\mathbf{K}}_{\mathbf{X}}^{-1}$ (that is well-defined for distinct x₁ ..., x_n).

Remark 3. (Geometric interpretation). Since $\bar{\mathbf{Q}}$ shares the same eigenspace with $\frac{{\mathbf{K}}_{\mathrm{cos}}}{1+{\delta }_{\mathrm{cos}}}+\frac{{\mathbf{K}}_{\mathrm{sin}}}{1+{\delta }_{\mathrm{sin}}}$ , one can geometrically interpret (δ_cos, δ_sin) as a sort of 'angle' between the eigenspaces of K_cos, K_sin and that of $\frac{{\mathbf{K}}_{\mathrm{cos}}}{1+{\delta }_{\mathrm{cos}}}+\frac{{\mathbf{K}}_{\mathrm{sin}}}{1+{\delta }_{\mathrm{sin}}}$ . For fixed n, as N → ∞, one has $\frac{1}{N}{\sum }_{t=1}^{N}\mathrm{cos}({\mathbf{X}}^{\mathsf{\text{T}}}{\mathbf{w}}_{t})\mathrm{cos}({\mathbf{w}}_{t}^{\mathsf{\text{T}}}\mathbf{X})\to {\mathbf{K}}_{\mathrm{cos}}$ , $\frac{1}{N}{\sum }_{t=1}^{N}\mathrm{sin}({\mathbf{X}}^{\mathsf{\text{T}}}{\mathbf{w}}_{t})\mathrm{sin}({\mathbf{w}}_{t}^{\mathsf{\text{T}}}\mathbf{X})\to {\mathbf{K}}_{\mathrm{sin}}$ , the eigenspaces of which are 'orthogonal' to each other, so that δ_cos, δ_sin → 0. On the other hand, as N, n → ∞, the eigenspaces of K_cos and K_sin 'intersect' with each other, captured by the non-trivial (δ_cos, δ_sin).

2.2. Asymptotic training performance

Theorem 1 provides an asymptotically more tractable approximation of ${\mathbb{E}}_{\mathbf{W}}[\mathbf{Q}]$ . Together with some additional concentration arguments (e.g. from theorem 2 in [15]). this permits us to provide a complete description of the limiting behavior of the random bilinear form a^T Qb, for $\mathbf{a},\mathbf{b}\in {\mathbb{R}}^{n}$ of bounded Euclidean norms, in such a way that ${\mathbf{a}}^{\mathsf{\text{T}}}\mathbf{Q}\mathbf{b}-{\mathbf{a}}^{\mathsf{\text{T}}}\bar{\mathbf{Q}}\mathbf{b}\;\stackrel{a.s.}{\to }\;0$ , as n, p, N → ∞. This, together with the fact that ${E}_{\mathrm{t}\mathrm{r}\mathrm{a}\mathrm{i}\mathrm{n}}=\frac{{\lambda }^{2}}{n}{\mathbf{y}}^{\mathsf{\text{T}}}\mathbf{Q}{(\lambda )}^{2}\mathbf{y}=-\frac{{\lambda }^{2}}{n}{\mathbf{y}}^{\mathsf{\text{T}}}\partial \mathbf{Q}(\lambda )\mathbf{y}/\partial \lambda$ , leads to the following result on the asymptotic training error.

Theorem 2. (Asymptotic training performance). Under assumption 1, for a given training set (X, y) and training MSE, E_train defined in (5), as n → ∞

$\begin{align*}\hfill {E}_{\mathrm{t}\mathrm{r}\mathrm{a}\mathrm{i}\mathrm{n}}-{\bar{E}}_{\mathrm{t}\mathrm{r}\mathrm{a}\mathrm{i}\mathrm{n}}\;\stackrel{a.s.}{\to }\;0,\quad {\bar{E}}_{\mathrm{t}\mathrm{r}\mathrm{a}\mathrm{i}\mathrm{n}}& =\frac{{\lambda }^{2}}{n}{\Vert}\bar{\mathbf{Q}}\mathbf{y}{{\Vert}}^{2}+\frac{N}{n}\frac{{\lambda }^{2}}{{n}^{2}}\left[\begin{matrix}\hfill \frac{\mathrm{t}\mathrm{r}(\bar{\mathbf{Q}}{\mathbf{K}}_{\mathrm{cos}}\bar{\mathbf{Q}})}{{(1+{\delta }_{\mathrm{cos}})}^{2}}\hfill & \hfill \frac{\mathrm{t}\mathrm{r}(\bar{\mathbf{Q}}{\mathbf{K}}_{\mathrm{sin}}\bar{\mathbf{Q}})}{{(1+{\delta }_{\mathrm{sin}})}^{2}}\hfill \end{matrix}\right]\hfill \\ \hfill & \quad \times \mathbf{\Omega }\left[\begin{matrix}\hfill {\mathbf{y}}^{\mathsf{\text{T}}}\bar{\mathbf{Q}}{\mathbf{K}}_{\mathrm{cos}}\bar{\mathbf{Q}}\mathbf{y}\hfill \\ \hfill {\mathbf{y}}^{\mathsf{\text{T}}}\bar{\mathbf{Q}}{\mathbf{K}}_{\mathrm{sin}}\bar{\mathbf{Q}}\mathbf{y}\hfill \end{matrix}\right]\hfill \end{align*}$

for $\bar{\mathbf{Q}}$ defined in theorem 1 and

$\begin{equation}{\mathbf{\Omega }}^{-1}\equiv {\mathbf{I}}_{2}-\frac{N}{n}\left[\begin{matrix}\hfill \frac{1}{n}\frac{\mathrm{t}\mathrm{r}(\bar{\mathbf{Q}}{\mathbf{K}}_{\mathrm{cos}}\bar{\mathbf{Q}}{\mathbf{K}}_{\mathrm{cos}})}{{(1+{\delta }_{\mathrm{cos}})}^{2}}\hfill & \hfill \frac{1}{n}\frac{\mathrm{t}\mathrm{r}(\bar{\mathbf{Q}}{\mathbf{K}}_{\mathrm{cos}}\bar{\mathbf{Q}}{\mathbf{K}}_{\mathrm{sin}})}{{(1+{\delta }_{\mathrm{sin}})}^{2}}\hfill \\ \hfill \frac{1}{n}\frac{\mathrm{t}\mathrm{r}(\bar{\mathbf{Q}}{\mathbf{K}}_{\mathrm{cos}}\bar{\mathbf{Q}}{\mathbf{K}}_{\mathrm{sin}})}{{(1+{\delta }_{\mathrm{cos}})}^{2}}\hfill & \hfill \frac{1}{n}\frac{\mathrm{t}\mathrm{r}(\bar{\mathbf{Q}}{\mathbf{K}}_{\mathrm{sin}}\bar{\mathbf{Q}}{\mathbf{K}}_{\mathrm{sin}})}{{(1+{\delta }_{\mathrm{sin}})}^{2}}\hfill \end{matrix}\right].\end{equation} \tag{ 12 }$

Proof. See appendix B. □

Remark 4. (First- and second-order corrections). Since ${E}_{\mathrm{t}\mathrm{r}\mathrm{a}\mathrm{i}\mathrm{n}}=\frac{{\lambda }^{2}}{n}{\mathbf{y}}^{\mathsf{\text{T}}}{\mathbf{Q}}^{2}\mathbf{y}$ , we can see in the expression of ${\bar{E}}_{\mathrm{t}\mathrm{r}\mathrm{a}\mathrm{i}\mathrm{n}}$ that there is not only a first-order (large n, p, N) correction in the first $\frac{{\lambda }^{2}}{n}{\Vert}\bar{\mathbf{Q}}\mathbf{y}{{\Vert}}^{2}$ term (which is different than $\frac{{\lambda }^{2}}{n}{\Vert}\mathbf{Q}\mathbf{y}{{\Vert}}^{2}$ ), but there is also a second-order correction, appearing in the form of $\bar{\mathbf{Q}}{\mathbf{K}}_{\sigma }\bar{\mathbf{Q}}$ or $\bar{\mathbf{Q}}{\mathbf{K}}_{\sigma }\bar{\mathbf{Q}}{\mathbf{K}}_{\sigma }$ for σ ∈ { cos, sin }, as in the second term. This has a similar interpretation to remark 3, where the pair (δ_cos, δ_sin) in $\bar{\mathbf{Q}}$ is (geometrically) interpreted as the eigenspace 'intersection' due to a non-vanishing n/N. In particular, taking N/n → ∞, we have $\bar{\mathbf{Q}}\sim \frac{n}{N}{\mathbf{K}}_{\mathbf{X}}^{-1}$ , Ω → I₂ so that ${\bar{E}}_{\mathrm{t}\mathrm{r}\mathrm{a}\mathrm{i}\mathrm{n}}=0$ and the model interpolates the entire training set, as expected.

One can show that (i) for a given n and λ > 0, ${\bar{E}}_{\mathrm{t}\mathrm{r}\mathrm{a}\mathrm{i}\mathrm{n}}$ decreases as the model size N increases; and (ii) for a given ratio N/n, ${\bar{E}}_{\mathrm{t}\mathrm{r}\mathrm{a}\mathrm{i}\mathrm{n}}$ increases as the regularization penalty λ grows large.

2.3. Asymptotic test performance

Theorem 2 holds without any restriction on the training set, (X, y), except for assumption 1, since only the randomness of W is involved, and thus one can simply treat (X, y) as known in this result. This is no longer the case for the test error. Intuitively, the test data $\hat{\mathbf{X}}$ cannot be chosen arbitrarily, and one must ensure that the test data 'behave' statistically like the training data, in some 'well-controlled' manner, so that the test MSE is asymptotically deterministic and bounded as $n,\hat{n},p,N\to \infty$ . Following this intuition, we work under the following assumption.

Assumption 2. (Data as concentrated random vectors [43]). The training data ${\mathbf{x}}_{i}\in {\mathbb{R}}^{p},i\in \left\{1,\dots ,n\right\}$ , are independently drawn (non-necessarily uniformly) from one of K > 0 distribution classes⁴ μ₁, ..., μ_K. There exist constants C, η, q > 0 such that for any x_i ∼ μ_k, k ∈ {1, ..., K} and any one-Lipschitz function $f:{\mathbb{R}}^{p}\to \mathbb{R}$ , we have

$\begin{equation}\mathbb{P}\left(\left\vert f({\mathbf{x}}_{i})-\mathbb{E}[f({\mathbf{x}}_{i})]\right\vert \geqslant t\right)\leqslant C{e}^{-{(t/\eta )}^{q}},\quad t\geqslant 0.\end{equation} \tag{ 13 }$

The test data ${\hat{\mathbf{x}}}_{i}\sim {\mu }_{k}$ , $i\in \left\{1,\dots ,\hat{n}\right\}$ are mutually independent, but may depend on training data X and that ${\Vert}\mathbb{E}[\sigma (\mathbf{W}\mathbf{X})-\sigma (\mathbf{W}\hat{\mathbf{X}})]{\Vert}=O(\sqrt{n})$ for σ ∈ { cos, sin }.

To facilitate the discussion of the phase transition and the double descent, we do not assume independence between training and test data (but we do assume independence between different data vectors within X and $\hat{\mathbf{X}}$ ). In this respect, assumption 2 is weaker than the classical i.i.d. assumption, and it permits us to illustrate the impact of training-test similarity on the model performance (section 4.2).

A first example of concentrated random vectors satisfying (13) is the multivariate Gaussian vector $\mathcal{N}(\mathbf{0},{\mathbf{I}}_{p})$ [44]. Moreover, since the concentration property in (13) is stable over Lipschitz transformations [43], it holds, for any one-Lipschitz mapping $g:{\mathbb{R}}^{d}\to {\mathbb{R}}^{p}$ and $\mathbf{z}\sim \mathcal{N}(\mathbf{0},{\mathbf{I}}_{d})$ , that g(z) also satisfies (13). In this respect, assumption 2, although seemingly quite restrictive, represents a large family of 'generative models', including notably the 'fake images' generated by modern generative adversarial networks that are, by construction, Lipschitz transformations of large random Gaussian vectors [45, 46]. As such, from a practical consideration, assumption 2 provides a more realistic and flexible statistical model for real-world data.

With assumption 2, we have the following result on the asymptotic test error.

Theorem 3. (Asymptotic test performance). Under assumptions 1 and 2, we have, for test MSE E_test defined in (5) and test data $(\hat{\mathbf{X}},\hat{\mathbf{y}})$ satisfying ${\mathrm{limsup}}_{\hat{n}}{\Vert}\hat{\mathbf{X}}{\Vert}< \infty$ , ${\mathrm{limsup}}_{\hat{n}}{\Vert}\hat{\mathbf{y}}{{\Vert}}_{\infty }< \infty$ with $\hat{n}/n\in (0,\infty )$ that, as n → ∞

$\begin{align*}\hfill {E}_{\mathrm{t}\mathrm{e}\mathrm{s}\mathrm{t}}-{\bar{E}}_{\mathrm{t}\mathrm{e}\mathrm{s}\mathrm{t}}\;\stackrel{a.s.}{\to }\;0,\quad {\bar{E}}_{\mathrm{t}\mathrm{e}\mathrm{s}\mathrm{t}}& =\frac{1}{\hat{n}}{\Vert}\hat{\mathbf{y}}-\frac{N}{n}\hat{\mathbf{\Phi }}\bar{\mathbf{Q}}\mathbf{y}{{\Vert}}^{2}+\frac{{N}^{2}}{{n}^{2}\hat{n}}\left[\begin{matrix}\hfill \frac{{{\Theta}}_{\mathrm{cos}}}{{(1+{\delta }_{\mathrm{cos}})}^{2}}\hfill & \hfill \frac{{{\Theta}}_{\mathrm{sin}}}{{(1+{\delta }_{\mathrm{sin}})}^{2}}\hfill \end{matrix}\right]\hfill \\ \hfill & \quad \times \mathbf{\Omega }\left[\begin{matrix}\hfill {\mathbf{y}}^{\mathsf{\text{T}}}\bar{\mathbf{Q}}{\mathbf{K}}_{\mathrm{cos}}\bar{\mathbf{Q}}\mathbf{y}\hfill \\ \hfill {\mathbf{y}}^{\mathsf{\text{T}}}\bar{\mathbf{Q}}{\mathbf{K}}_{\mathrm{sin}}\bar{\mathbf{Q}}\mathbf{y}\hfill \end{matrix}\right]\hfill \end{align*}$

for Ω defined in (12),

$\begin{equation}{{\Theta}}_{\sigma }=\frac{1}{N}\enspace \mathrm{t}\mathrm{r}\enspace {\mathbf{K}}_{\sigma }(\hat{\mathbf{X}},\hat{\mathbf{X}})+\frac{N}{n}\frac{1}{n}\enspace \mathrm{t}\mathrm{r}\enspace \bar{\mathbf{Q}}{\hat{\mathbf{\Phi }}}^{\mathsf{\text{T}}}\hat{\mathbf{\Phi }}\bar{\mathbf{Q}}{\mathbf{K}}_{\sigma }-\frac{2}{n}\enspace \mathrm{t}\mathrm{r}\enspace \bar{\mathbf{Q}}{\hat{\mathbf{\Phi }}}^{\mathsf{\text{T}}}{\mathbf{K}}_{\sigma }(\hat{\mathbf{X}},\mathbf{X}),\quad \sigma \in \left\{\mathrm{cos},\mathrm{sin}\right\},\end{equation} \tag{ 14 }$

and $\mathbf{\Phi }\equiv \frac{{\mathbf{K}}_{\mathrm{cos}}}{1+{\delta }_{\mathrm{cos}}}+\frac{{\mathbf{K}}_{\mathrm{sin}}}{1+{\delta }_{\mathrm{sin}}}$ , $\hat{\mathbf{\Phi }}\equiv \frac{{\mathbf{K}}_{\mathrm{cos}}(\hat{\mathbf{X}},\mathbf{X})}{1+{\delta }_{\mathrm{cos}}}+\frac{{\mathbf{K}}_{\mathrm{sin}}(\hat{\mathbf{X}},\mathbf{X})}{1+{\delta }_{\mathrm{sin}}}$ , with ${\mathbf{K}}_{\mathrm{cos}}(\hat{\mathbf{X}},\mathbf{X}),{\mathbf{K}}_{\mathrm{sin}}(\hat{\mathbf{X}},\mathbf{X})\in {\mathbb{R}}^{\hat{n}\times n}$ and ${\mathbf{K}}_{\mathrm{cos}}(\hat{\mathbf{X}},\hat{\mathbf{X}}),{\mathbf{K}}_{\mathrm{sin}}(\hat{\mathbf{X}},\hat{\mathbf{X}})\in {\mathbb{R}}^{\hat{n}\times \hat{n}}$ defined as in (8).

Proof. See appendix C. □

Similar to theorem 2 on ${\bar{E}}_{\mathrm{t}\mathrm{r}\mathrm{a}\mathrm{i}\mathrm{n}}$ , here the expression for ${\bar{E}}_{\mathrm{t}\mathrm{e}\mathrm{s}\mathrm{t}}$ is also given as the sum of first- and second-order corrections. To see this, one can confirm, by taking $(\hat{\mathbf{X}},\hat{\mathbf{y}})=(\mathbf{X},\mathbf{y})$ , that the first term in ${\bar{E}}_{\mathrm{t}\mathrm{e}\mathrm{s}\mathrm{t}}$ becomes

$\begin{equation*}\frac{1}{\hat{n}}{\Vert}\hat{\mathbf{y}}-\frac{N}{n}\hat{\mathbf{\Phi }}\bar{\mathbf{Q}}\mathbf{y}{{\Vert}}^{2}=\frac{1}{n}{\Vert}\mathbf{y}-\frac{N}{n}\mathbf{\Phi }\bar{\mathbf{Q}}\mathbf{y}{{\Vert}}^{2}=\frac{{\lambda }^{2}}{n}{\Vert}\bar{\mathbf{Q}}\mathbf{y}{{\Vert}}^{2}\end{equation*}$

and is equal to the first term in ${\bar{E}}_{\mathrm{t}\mathrm{r}\mathrm{a}\mathrm{i}\mathrm{n}}$ , where we used the fact that $\frac{N}{n}\mathbf{\Phi }\bar{\mathbf{Q}}={\mathbf{I}}_{n}-\lambda \bar{\mathbf{Q}}$ . The same also holds for the second term, so that one obtains ${\bar{E}}_{\mathrm{t}\mathrm{e}\mathrm{s}\mathrm{t}}={\bar{E}}_{\mathrm{t}\mathrm{r}\mathrm{a}\mathrm{i}\mathrm{n}}$ , with $(\hat{\mathbf{X}},\hat{\mathbf{y}})=(\mathbf{X},\mathbf{y})$ , as expected. From this perspective, theorem 3 can be seen as an extension of theorem 2, with the 'interaction' between training and test data (e.g. test-versus-test ${\mathbf{K}}_{\sigma }(\hat{\mathbf{X}},\hat{\mathbf{X}})$ and test-versus-train ${\mathbf{K}}_{\sigma }(\hat{\mathbf{X}},\mathbf{X})$ interaction matrices) summarized in the scalar parameter Θ_σ defined in (14), for σ ∈ { cos, sin }.

By taking N/n → ∞, we have that $\bar{\mathbf{Q}}\sim \frac{n}{N}{\mathbf{K}}^{-1}$ , Θ_σ ∼ N⁻¹, Ω → I₂, and consequently

$\begin{equation*}\underset{N/n\to \infty }{\mathrm{lim}}\enspace {\bar{E}}_{\mathrm{t}\mathrm{e}\mathrm{s}\mathrm{t}}=\frac{1}{\hat{n}}{\Vert}\hat{\mathbf{y}}-\mathbf{K}(\hat{\mathbf{X}},\mathbf{X}){\mathbf{K}}_{\mathbf{X}}^{-1}\mathbf{y}{{\Vert}}^{2}.\end{equation*}$

This is the test MSE of classical Gaussian kernel regression, with $\mathbf{K}(\hat{\mathbf{X}},\mathbf{X})\equiv {\mathbf{K}}_{\mathrm{cos}}(\hat{\mathbf{X}},\mathbf{X})+{\mathbf{K}}_{\mathrm{sin}}(\hat{\mathbf{X}},\mathbf{X})\in {\mathbb{R}}^{\hat{n}\times n}$ the test-versus-train Gaussian kernel matrix. As opposed to the training MSE discussed in remark 4, here ${\bar{E}}_{\mathrm{t}\mathrm{e}\mathrm{s}\mathrm{t}}$ generally has a non-zero limit (that is, however, independent of λ) as N/n → ∞.

3. Empirical evaluations and practical implications

In this section, we provide a detailed empirical evaluation, including a discussion of the behavior of the fixed-point equation in theorem 1, and its consequences in theorems 2 and 3.

In particular, we describe the behavior of the pair (δ_cos, δ_sin) that characterizes the necessary correction in the large n, p, N regime, as a function of the regularization penalty λ and the ratio N/n. This explains: (i) the mismatch between empirical regression errors from the Gaussian kernel prediction (figure 2); (ii) the behavior of (δ_cos, δ_sin) as a function of λ (figure 3); (iii) the behavior of (δ_cos, δ_sin) as a function of N/n, which clearly indicates two phases of learning and the transition between them (figure 4); and (iv) the corresponding double descent test error curves (figure 5).

**Figure 2.** Training MSEs of RFF ridge regression on MNIST data (class 3 versus 7), as a function of regression penalty λ, for p = 784, n = 1 000, N = 250, 500, 1 000, 2000. Empirical results displayed in **blue** circles; Gaussian kernel predictions (assuming N → ∞ alone) in **black** dashed lines; and theorem 2 in **red** solid lines. Results obtained by averaging over 30 runs.
Download figure:
Standard image High-resolution image

**Figure 3.** Behavior of (δ_cos, δ_sin) in (15) on MNIST data (class 3 versus 7), as a function of the regularization parameter λ, for p = 784, n = 1 000, N = 250, 1 000, 4 000, 16 000.
Download figure:
Standard image High-resolution image

**Figure 4.** Behavior of (δ_cos, δ_sin) in (15) on MNIST data set (class 3 versus 7), as a function of the ratio N/n, for p = 784, n = 1 000, λ = 10⁻⁷, 10⁻³, 1, 10. The **black** dashed line represents the interpolation threshold 2 N = n.
Download figure:
Standard image High-resolution image

**Figure 5.** Empirical (**blue** crosses) and theoretical (**red** dashed lines) test error of RFF regression as a function of the ratio N/n on MNIST data (class 3 versus 7), for p = 784, n = 500, λ = 10⁻⁷, 10⁻³, 0.2, 10. The **black** dashed line represents the interpolation threshold 2N = n.
Download figure:
Standard image High-resolution image

3.1. Correction due to the large n, p, N regime

The RFF Gram matrix ${\mathbf{\Sigma }}_{\mathbf{X}}^{\mathsf{\text{T}}}{\mathbf{\Sigma }}_{\mathbf{X}}/N$ is not close to the classical Gaussian kernel matrix K_X in the large n, p, N regime; and, as a consequence, its resolvent Q, as well the training and test MSE, E_train and E_test (that are functions of Q), behave quite differently from the Gaussian kernel predictions. As already discussed in remark 2 after theorem 1, for λ > 0, the pair (δ_cos, δ_sin) characterizes the correction when considering n, p, N all large, compared to the large-N-only asymptotic behavior:

$\begin{equation}{\delta }_{\mathrm{cos}}=\frac{1}{n}\enspace \mathrm{t}\mathrm{r}\enspace {\mathbf{K}}_{\mathrm{cos}}\bar{\mathbf{Q}},{\delta }_{\mathrm{sin}}=\frac{1}{n}\enspace \mathrm{t}\mathrm{r}\enspace {\mathbf{K}}_{\mathrm{sin}}\bar{\mathbf{Q}},\quad \bar{\mathbf{Q}}={\left(\frac{N}{n}\left(\frac{{\mathbf{K}}_{\mathrm{cos}}}{1+{\delta }_{\mathrm{cos}}}+\frac{{\mathbf{K}}_{\mathrm{sin}}}{1+{\delta }_{\mathrm{sin}}}\right)+\lambda {\mathbf{I}}_{n}\right)}^{-1}.\end{equation} \tag{ 15 }$

To start, figure 2 compares the training MSEs of RFF ridge regression to the predictions from Gaussian kernel regression and to the predictions from our theorem 2, on the popular MNIST data set [47]. Observe that there is a huge gap between empirical training errors and the Gaussian kernel predictions, especially when N/n < 1, while our theory consistently fits empirical observations almost perfectly.

Next, from (15) we know that both δ_cos and δ_sin are decreasing functions of λ. (See lemma 7 in appendix D for a proof of this fact.) Figure 3 shows that: (i) over a range of different N/n, both δ_cos and δ_sin decrease monotonically as λ increases; (ii) the behavior for N/n < 1, which is decreasing from an initial value of δ ≫ 1, is very different from the behavior for N/n ≳ 1, where an initially flat region is observed for small values of λ and we have δ < 1 for all values of λ; and (iii) the impact of regularization λ becomes less significant as the ratio N/n becomes large. This is in accordance with the limiting behavior of $\bar{\mathbf{Q}}\simeq \frac{n}{N}{\mathbf{K}}_{\mathbf{X}}^{-1}$ in remark 2 that is independent of λ as N/n → ∞.

Note also that, while δ_cos and δ_sin can be geometrically interpreted as a sort of weighted 'angle' between different kernel matrices, and therefore one might expect to have δ ∈ [0, 1], this is not the case for the leftmost plot with N/n = 1/4. There, for small values of λ (say λ ≲ 0.1), both δ_cos and δ_sin scale like λ⁻¹, while they are observed to saturate to a fixed O(1) value for N/n = 1, 4, 16. This corresponds to two different phases of learning in the 'ridgeless' λ → 0 case. As we shall see in more detail later in section 4.1, depending on whether we are in the 'under-parameterized' (2N < n) or the 'over-parameterized' (2N > n) regime, the system behaves fundamentally differently.

3.2. Phase transition and corresponding double descent

Both δ_cos and δ_sin in (15) are decreasing functions of N, as depicted in figure 4. (See lemma 6 in appendix D for a proof.) More importantly, figure 4 also illustrates that δ_cos and δ_sin exhibit qualitatively different behavior: for λ not too small (λ = 1 or 10), we observe a rather 'smooth' behavior, as a function of the ratio N/n, and they both decrease smoothly, as N/n grows large. However, for λ relatively small (λ = 10⁻³ and 10⁻⁷), we observe a sharp 'phase transition' on two sides of the interpolation threshold 2N = n. (Note that the scale of the y-axis is very different in different subfigures.) More precisely, in the leftmost plot with λ = 10⁻⁷, the values of δ_cos and δ_sin 'jumps' from order O(1) (when 2N > n) to much higher values of the order of λ⁻¹ (when 2N < n). A similar behavior is also observed for λ = 10⁻³.

As a consequence of this phase transition, different behaviors are expected for training and test MSEs in the 2N < n and 2N > n regime. Figure 5 depicts the empirical and theoretical test MSEs with different regularization penalty λ. In particular, for λ = 10⁻⁷ and λ = 10⁻³, a double descent behavior is observed, with a singularity at 2N = n, while for larger values of λ (λ = 0.2, 10), a smoother and monotonically decreasing curve for test error is observed, as a function of N/n. Figure 5 also illustrates that: (i) for a fixed regularization λ > 0, the minimum test error is always obtained in the over-parameterization 2N > n regime; and (ii) the global optimal design (over N and λ) is achieved by highly over-parametrized system with a (problem-dependent) non-vanishing λ. This is in accordance with the observations in [12] for Gaussian data.

Remark 5. (On ridge regularization). Performing ridge regularization (with λ as a control parameter) is known to help alleviate the sharp performance drop around 2N = n [12, 18]. Our theorem 3 can serve as a convenient alternative to evaluate the effect of small λ around 2N = n, as well as to determine an optimal λ, for not-too-small n, p, N. In the setup of figure 5, a grid search can be used to find the regularization that minimizes ${\bar{E}}_{\mathrm{t}\mathrm{e}\mathrm{s}\mathrm{t}}$ . For this choice of λ (λ_opt ≈ 0.2), no singular peak at 2N = n is observed.

Remark 6. (Double descent as a consequence of phase transition). While the double descent phenomenon has received considerable attention recently, our analysis makes it clear that in this model (and presumably many others) it is a natural consequence of the phase transition between two qualitatively different phases of learning [7].

4. Additional discussion and results

In this section, we provide additional discussions and empirical results, to complement and extend those of section 3. We start, in section 4.1, by discussing in more detail the two different phases of learning for 2N < n and 2N > n, including the sharp phase transition at 2N = n, for (δ_cos, δ_sin), as well as the asymptotic test MSE, in the ridgeless λ → 0 case. Then, in section 4.2, we discuss the impact of training-test similarly on the test MSE by considering the example of test data $\hat{\mathbf{X}}$ obtained by slightly perturbing the training data X. Finally, in section 4.3, we present empirical results on additional real-world data sets to demonstrate the wide applicability of our results.

4.1. Two different learning regimes in the ridgeless limit

We chose to present our theoretical results in section 2 (theorems 1–3) in the same form, regardless of whether 2N > n or 2N < n. This comes at the cost of requiring a strictly positive ridge regularization λ > 0, as n, p, N → ∞. As discussed in section 3, for small values of λ, depending on the sign of 2N − n, we observe totally different behaviors for (δ_cos, δ_sin) and thus for the key resolvent $\bar{\mathbf{Q}}$ . As a matter of fact, for λ = 0 and 2N < n, the (random) resolvent Q(λ = 0) in (7) is simply undefined, as it involves inverting a singular matrix ${\mathbf{\Sigma }}_{\mathbf{X}}^{\mathsf{\text{T}}}{\mathbf{\Sigma }}_{\mathbf{X}}\in {\mathbb{R}}^{n\times n}$ that is of rank at most 2N < n. As a consequence, we expect to see $\bar{\mathbf{Q}}\sim {\lambda }^{-1}$ as λ → 0 for 2N < n, while for 2N > n this is not the case.

These two phases of learning can be theoretically justified by considering the ridgeless λ → 0 limit in theorem 1, with the unified variables γ_cos and γ_sin introduced below.

(a)
For 2N < n and λ → 0, we obtain
$\begin{equation}\begin{cases}\lambda {\delta }_{\mathrm{cos}}\to {\gamma }_{\mathrm{cos}}\equiv \frac{1}{n}\enspace \mathrm{t}\mathrm{r}\enspace {\mathbf{K}}_{\mathrm{cos}}{\left(\frac{N}{n}\left(\frac{{\mathbf{K}}_{\mathrm{cos}}}{{\gamma }_{\mathrm{cos}}}+\frac{{\mathbf{K}}_{\mathrm{sin}}}{{\gamma }_{\mathrm{sin}}}\right)+{\mathbf{I}}_{n}\right)}^{-1}\quad \hfill \\ \lambda {\delta }_{\mathrm{sin}}\to {\gamma }_{\mathrm{sin}}\equiv \frac{1}{n}\enspace \mathrm{t}\mathrm{r}\enspace {\mathbf{K}}_{\mathrm{sin}}{\left(\frac{N}{n}\left(\frac{{\mathbf{K}}_{\mathrm{cos}}}{{\gamma }_{\mathrm{cos}}}+\frac{{\mathbf{K}}_{\mathrm{sin}}}{{\gamma }_{\mathrm{sin}}}\right)+{\mathbf{I}}_{n}\right)}^{-1}\quad \hfill \end{cases},\end{equation} \tag{ 16 }$
in such as way that δ_cos, δ_sin and $\bar{\mathbf{Q}}$ scale like λ⁻¹. We have in particular $\mathbb{E}[\lambda \mathbf{Q}]\sim \lambda \bar{\mathbf{Q}}\sim {\left(\frac{N}{n}\left(\frac{{\mathbf{K}}_{\mathrm{cos}}}{{\gamma }_{\mathrm{cos}}}+\frac{{\mathbf{K}}_{\mathrm{sin}}}{{\gamma }_{\mathrm{sin}}}\right)+{\mathbf{I}}_{n}\right)}^{-1}$ with (γ_cos, γ_sin) of order O(1).
(b)
For 2N > n and λ → 0, we obtain
$\begin{equation}\begin{cases}{\delta }_{\mathrm{cos}}\to {\gamma }_{\mathrm{cos}}=\frac{1}{N}\enspace \mathrm{t}\mathrm{r}\enspace {\mathbf{K}}_{\mathrm{cos}}{\left(\frac{{\mathbf{K}}_{\mathrm{cos}}}{1+{\gamma }_{\mathrm{cos}}}+\frac{{\mathbf{K}}_{\mathrm{sin}}}{1+{\gamma }_{\mathrm{sin}}}\right)}^{-1}\quad \hfill \\ {\delta }_{\mathrm{sin}}\to {\gamma }_{\mathrm{cos}}=\frac{1}{N}\enspace \mathrm{t}\mathrm{r}\enspace {\mathbf{K}}_{\mathrm{sin}}{\left(\frac{{\mathbf{K}}_{\mathrm{cos}}}{1+{\gamma }_{\mathrm{cos}}}+\frac{{\mathbf{K}}_{\mathrm{sin}}}{1+{\gamma }_{\mathrm{sin}}}\right)}^{-1}\quad \hfill \end{cases},\end{equation} \tag{ 17 }$
by taking directly λ → 0 in theorem 1.

As a consequence, in the ridgeless limit λ → 0, theorem 1 exhibits the following two learning phases:

(a)
Under-parameterized phase: with 2N < n. Here, Q is not well defined (indeed Q ∼ λ⁻¹) and one must consider instead the properly scaled γ_cos, γ_sin and $\lambda \bar{\mathbf{Q}}$ in (16). Like δ_cos and δ_sin, γ_cos and γ_sin also decrease as N/n grows large. In particular, one has ${\gamma }_{\mathrm{cos}},{\gamma }_{\mathrm{sin}},{\Vert}\lambda \bar{\mathbf{Q}}{\Vert}\to 0$ as 2N − n ↑ 0.
(b)
Over-parameterized phase: with 2N > n. Here, one can consider δ_cos, δ_sin and ${\Vert}\bar{\mathbf{Q}}{\Vert}$ . One has particularly that ${\delta }_{\mathrm{cos}},{\delta }_{\mathrm{sin}},{\Vert}\bar{\mathbf{Q}}{\Vert}\to \infty$ as 2N − n ↓ 0 and tend to zero as N/n → ∞.

With this discussion on the two phases of learning, we now understand why:

in the leftmost plot of figure 3 with 2N < n, δ_cos and δ_sin behave rather differently from other plots and approximately scale as λ⁻¹ for small values of λ; and
in the first and second leftmost plots of figure 4, a 'jump' in the values of δ occurs at the transition point 2N = n, and the δ's are numerically of the same order of λ⁻¹ for 2N < n.

To characterize the phase transition from (16) and (17) in the λ → 0 setting, we consider the scaled variables

$\begin{equation}\begin{cases}{\gamma }_{\sigma }=\lambda {\delta }_{\sigma }\quad \hfill & \text{for}\enspace 2\enspace N< n\hfill \\ {\gamma }_{\sigma }={\delta }_{\sigma }\quad \hfill & \text{for}\enspace 2\enspace N > n\hfill \end{cases},\quad \sigma \in \left\{\mathrm{cos},\mathrm{sin}\right\}.\end{equation} \tag{ 18 }$

An advantage of using these scaled variables is that they are of order O(1) as n, p, N → ∞ and λ → 0. The behavior of (γ_cos, γ_sin) is reported in figure 6, in the same setting as figure 4. Observe the sharp transition between the 2N < n and 2N > n regime, in particular for λ = 10⁻⁷ and λ = 10⁻³, and that this transition is smoothed out for λ = 1. (A 'transition' is also seen for λ = 10, but this is potentially misleading. It is true that γ_cos and γ_sin do change in this way, as a function of N/n, but unless λ ≈ 0, these quantities are not solutions of the aforementioned fixed point equations.)

On account of these two different phases of learning (under- and over-parameterized, in (16) and (17), respectively) and the sharp transition of (γ_cos, γ_sin) in figure 6, it is not surprising to observe a 'singular' behavior at 2N = n, when no regularization is applied. We next examine the asymptotic training and test errors in more detail.

Asymptotic training MSE as λ → 0. In the under-parameterized regime with 2N < n, combining (16) we have that both $\lambda \bar{\mathbf{Q}}$ and $\frac{\bar{\mathbf{Q}}}{1+{\delta }_{\sigma }}\sim \frac{\lambda \bar{\mathbf{Q}}}{{\gamma }_{\sigma }},\sigma \in \left\{\mathrm{cos},\mathrm{sin}\right\}$ are well-behaved and are generally not zero. As a consequence, by theorem 2, the asymptotic training error ${\bar{E}}_{\mathrm{t}\mathrm{r}\mathrm{a}\mathrm{i}\mathrm{n}}$ tends to a nonzero limit as λ → 0, measuring the residual information in the training set that is not captured by the regressor $\boldsymbol{\beta }\in {\mathbb{R}}^{2N}$ . As 2N − n ↑ 0, we have γ_cos, γ_sin → 0 and ${\Vert}\lambda \bar{\mathbf{Q}}{\Vert}\to 0$ so that ${\bar{E}}_{\mathrm{t}\mathrm{r}\mathrm{a}\mathrm{i}\mathrm{n}}\to 0$ and β interpolates the entire training set. On the other hand, in the over-parameterized 2N > n regime, one always has ${\bar{E}}_{\mathrm{t}\mathrm{r}\mathrm{a}\mathrm{i}\mathrm{n}}=0$ . This particularly implies the training error is 'continuous' around the point 2N = n.

Asymptotic test MSE as λ → 0. Again, in the under-parameterized regime with 2N < n, now consider the more involved asymptotic test error in theorem 3. In particular, we will focus here on the case $\hat{\mathbf{X}}\ne \mathbf{X}$ (or, more precisely, they are sufficiently different from each other in such a way that ${\Vert}\mathbf{X}-\hat{\mathbf{X}}{\Vert}\to \hspace{-12pt}/\;\;\;0$ as n, p, N → ∞ and λ → 0; see further discussion below in section 4.2) so that ${\mathbf{K}}_{\sigma }(\mathbf{X},\mathbf{X})\ne {\mathbf{K}}_{\sigma }(\hat{\mathbf{X}},\mathbf{X})$ and $\frac{N}{n}\hat{\mathbf{\Phi }}\bar{\mathbf{Q}}\ne {\mathbf{I}}_{n}-\lambda \bar{\mathbf{Q}}$ . In this case, the two-by-two matrix Ω in ${\bar{E}}_{\mathrm{t}\mathrm{e}\mathrm{s}\mathrm{t}}$ diverges to infinity at 2N = n in the λ → 0 limit. (Indeed, the determinant det(Ω⁻¹) scales as λ, per lemma 5.) As a consequence, we have ${\bar{E}}_{\mathrm{t}\mathrm{e}\mathrm{s}\mathrm{t}}\to \infty$ as 2N → n, resulting in a sharp deterioration of the test performance around 2N = n. (Of course, this holds if no additional regularization is applied as discussed in remark 5.) It is also interesting to note that, while Ω also appears in ${\bar{E}}_{\mathrm{t}\mathrm{r}\mathrm{a}\mathrm{i}\mathrm{n}}$ , we still obtain (asymptotically) zero training MSE at 2N = n, despite the divergence of Ω, again due to the prefactor λ² in ${\bar{E}}_{\mathrm{t}\mathrm{r}\mathrm{a}\mathrm{i}\mathrm{n}}$ . If λ ≳ 1, then det(Ω⁻¹) exhibits much more regular properties (figure 7), as one would expect.

**Figure 7.** Behavior of det(Ω⁻¹) on MNIST data set (class 3 versus 7), as a function of N/n, for p = 784, n = 1000 and λ = 10⁻⁷, 10⁻³, 1, 10. The **black** dashed line represents the interpolation threshold 2 N = n.
Download figure:
Standard image High-resolution image

4.2. Impact of training-test similarity

Continuing our discussion of the RFF performance in the large n, p, N limit, we can see that the (asymptotic) test error behaves entirely differently, depending on whether $\hat{\mathbf{X}}$ is 'close to' X or not. For $\hat{\mathbf{X}}=\mathbf{X}$ , one has ${\bar{E}}_{\mathrm{t}\mathrm{e}\mathrm{s}\mathrm{t}}={\bar{E}}_{\mathrm{t}\mathrm{r}\mathrm{a}\mathrm{i}\mathrm{n}}$ that decreases monotonically as N grows large; while for $\hat{\mathbf{X}}$ 'sufficiently' different from X, ${\bar{E}}_{\mathrm{t}\mathrm{e}\mathrm{s}\mathrm{t}}$ diverges to infinity at 2N = n. To have a more quantitative assessment of the influence of training-test data similarity on the test error, we consider the special case $\hat{n}=n$ and $\hat{\mathbf{y}}=\mathbf{y}$ . In this case, it follows from theorem 3 that

$\begin{align*}\hfill {{\Theta}}_{\sigma }& =\frac{1}{N}\enspace \mathrm{t}\mathrm{r}({\mathbf{K}}_{\sigma }+{\mathbf{K}}_{\sigma }(\hat{\mathbf{X}},\hat{\mathbf{X}})-2{\mathbf{K}}_{\sigma }(\hat{\mathbf{X}},\mathbf{X}))+\frac{2}{n}\enspace \mathrm{t}\mathrm{r}\enspace \bar{\mathbf{Q}}{\Delta}{\mathbf{\Phi }}^{\mathsf{\text{T}}}{\Delta}{\mathbf{K}}_{\sigma }\hfill \\ \hfill & \quad +\frac{N}{n}\frac{1}{n}\enspace \mathrm{t}\mathrm{r}\enspace \bar{\mathbf{Q}}{\Delta}{\mathbf{\Phi }}^{\mathsf{\text{T}}}{\Delta}\mathbf{\Phi }\bar{\mathbf{Q}}{\mathbf{K}}_{\sigma }+\frac{n}{N}\frac{{\lambda }^{2}}{n}\enspace \mathrm{t}\mathrm{r}\enspace \bar{\mathbf{Q}}{\mathbf{K}}_{\sigma }\bar{\mathbf{Q}}-\frac{2\lambda }{N}\enspace \mathrm{t}\mathrm{r}\enspace \bar{\mathbf{Q}}{\Delta}{\mathbf{K}}_{\sigma }-\frac{2\lambda }{n}\enspace \mathrm{t}\mathrm{r}\enspace \bar{\mathbf{Q}}{\Delta}{\mathbf{\Phi }}^{\mathsf{\text{T}}}\bar{\mathbf{Q}}{\mathbf{K}}_{\sigma },\hfill \end{align*}$

for σ ∈ { cos, sin }, ${\Delta}{\mathbf{K}}_{\sigma }={\mathbf{K}}_{\sigma }-{\mathbf{K}}_{\sigma }(\hat{\mathbf{X}},\mathbf{X})$ and ${\Delta}\mathbf{\Phi }\equiv \hat{\mathbf{\Phi }}-\mathbf{\Phi }$ . Since in the ridgeless λ → 0 limit the matrix Ω scale as λ⁻¹ (see figure 7), one must have Θ_σ scaling as λ so that ${\bar{E}}_{\mathrm{t}\mathrm{e}\mathrm{s}\mathrm{t}}$ does not diverge at 2N = n as λ → 0. One example is the case where the test data is a small (additive) perturbation of the training data such that, in the kernel feature space

$\begin{equation*}{\mathbf{K}}_{\sigma }-{\mathbf{K}}_{\sigma }(\hat{\mathbf{X}},\mathbf{X})=\lambda {\mathbf{\Xi }}_{\sigma },\quad {\mathbf{K}}_{\sigma }(\hat{\mathbf{X}},\hat{\mathbf{X}})-{\mathbf{K}}_{\sigma }(\hat{\mathbf{X}},\mathbf{X})=\lambda {\hat{\mathbf{\Xi }}}_{\sigma }\end{equation*}$

for ${\mathbf{\Xi }}_{\sigma },{\hat{\mathbf{\Xi }}}_{\sigma }\in {\mathbb{R}}^{n\times n}$ of bounded spectral norms. In this setting, we have ${{\Theta}}_{\sigma }=\frac{\lambda }{N}\enspace \mathrm{t}\mathrm{r}({\mathbf{\Xi }}_{\sigma }+{\hat{\mathbf{\Xi }}}_{\sigma })+O({\lambda }^{2})$ so that the asymptotic test error does not diverge to infinity at 2N = n as λ → 0. This is supported by figure 8, where the test data are generated by adding Gaussian white noise of variance σ² to the training data, i.e. ${\hat{\mathbf{x}}}_{i}={\mathbf{x}}_{i}+\sigma {\boldsymbol{\varepsilon }}_{i}$ , for independent ${\boldsymbol{\varepsilon }}_{i}\sim \mathcal{N}(\mathbf{0},{\mathbf{I}}_{p}/p)$ . In figure 8, we observe that (i) below the threshold σ² = λ, test error coincides with the training error and both are close to zero; and (ii) as soon as σ² ≳ λ, the test error diverges from the training error and grows large (but linearly in σ²) as the noise level increases. Note also from the two rightmost plots of figure 8 that, the training-to-test 'transition' at σ² ∼ λ is sharp only for relatively small values of λ, as predicted by our theory.

**Figure 8.** Empirical training and test errors of RFF ridgeless regression on MNIST data (class 3 versus 7), when modeling training-test similarity as $\hat{\mathbf{X}}=\mathbf{X}+\sigma \boldsymbol{\varepsilon }$ , with ɛ having i.i.d. $\mathcal{N}(0,1/p)$ entries, as a function of the noise level σ², for N = 512, p = 784, $n=\hat{n}=1024=2N$ , λ = 10⁻⁷, 10⁻³, 1, 10. Results obtained by averaging over 30 runs.
Download figure:
Standard image High-resolution image

**Figure 8.** Empirical training and test errors of RFF ridgeless regression on MNIST data (class 3 versus 7), when modeling training-test similarity as $\hat{\mathbf{X}}=\mathbf{X}+\sigma \boldsymbol{\varepsilon }$ , with ɛ having i.i.d. $\mathcal{N}(0,1/p)$ entries, as a function of the noise level σ², for N = 512, p = 784, $n=\hat{n}=1024=2N$ , λ = 10⁻⁷, 10⁻³, 1, 10. Results obtained by averaging over 30 runs.
Download figure:
Standard image High-resolution image

4.3. Additional real-world data sets

So far, we have presented results in detail for one particular real-world data set, but we have extensive empirical results demonstrating that similar conclusions hold more broadly. As an example of these additional results, here we present a numerical evaluation of our results on several other real-world image data sets. We consider the classification task on another two MNIST-like data sets composed of 28 × 28 grayscale images: the Fashion-MNIST [48] and the Kannada-MNIST [49] data sets. Each image is represented as a p = 784-dimensional vector and the output targets $\mathbf{y},\hat{\mathbf{y}}$ are taken to have −1, +1 entries depending on the image class. As a consequence, both the training and test MSEs in (5) are approximately 1 for N = 0 and significantly small λ, as observed in figures 5 and 11 below. For each data set, images were jointly centered and scaled so to fall close to the setting of assumption 1 on X and $\hat{\mathbf{X}}$ .

In figure 9, we compare the empirical training and test errors with their limiting behaviors derived in theorems 2 and 3, as a function of the penalty parameter λ, on a training set of size n = 1024 (512 images from class 5 and 512 images from class 6) with feature dimension N = 256 and N = 512, on both data sets. A close fit between theory and practice is observed, for moderately large values of n, p, N, demonstrating a wide practical applicability of the proposed asymptotic analyses, particularly compared to the (limiting) Gaussian kernel predictions per figure 2.

**Figure 9.** MSEs of RFF regression on Fashion-MNIST (left two) and Kannada-MNIST (right two) data (class 5 versus 6), as a function of regression parameter λ, for p = 784, $n=\hat{n}=1024$ , N = 256 and 512. Empirical results displayed in blue (circles for training and crosses for test); and the asymptotics from theorems 2 and 3 displayed in **red** (sold lines for training and dashed for test). Results obtained by averaging over 30 runs.
Download figure:
Standard image High-resolution image

In figure 10, we report the behavior of the pair (δ_cos, δ_sin) for small values of λ = 10⁻⁷ and 10⁻³. Similar to the two leftmost plots in figure 4 for MNIST, a jump from the under-to over-parameterized regime occurs at the interpolation threshold 2N = n, in both Fashion- and Kannada-MNIST data sets, clearly indicating the two phases of learning and the phase transition between them.

In figure 11, we report the empirical and theoretical test errors as a function of the ratio N/n, on a training test of size n = 500 (250 images from class 8 and 250 images from class 9), by varying the feature dimension N. An exceedingly small regularization λ = 10⁻⁷ is applied to mimic the 'ridgeless' limiting behavior as λ → 0. On both data sets, the corresponding double descent curve is observed where the test errors go down and up, with a singular peak around 2N = n, and then go down monotonically as N continues to increase when 2N > n.

5. Conclusion

We have established a precise description of the resolvent of RFF Gram matrices, and provided asymptotic training and test performance guarantees for RFF ridge regression, in the limit of n, p, N → ∞ at the same pace. We have also discussed the under- and over-parameterized regimes, where the resolvent behaves dramatically differently. These observations involve only mild regularity assumptions on the data distribution, yielding phase transition behavior and corresponding double descent test error curves for RFF regression that closely match experiments on real-world data. From a technical perspective, our analysis extends to arbitrary combinations of (Lipschitz) non-linearities, such as the more involved homogeneous kernel maps [14]. This opens the door for future studies of more elaborate random feature structures and models. Extended to a (technically more involved) multi-layer setting in the more realistic large n, p, N regime, as in [50], our analysis may shed new light on the theoretical understanding of modern deep neural nets, beyond the large-N alone NTK limit [1].

Acknowledgments

Z L would like to acknowledge the Fundamental Research Funds for the Central Universities of China (No. 2021XXJS110) and CCF-Hikvision Open Fund (20210008) for providing partial support of this work. R C would like to acknowledge the MIAI LargeDATA chair (ANR-19-P3IA-0003) at University Grenoble-Alpes as well as the HUAWEI LarDist project for providing partial support of this work. M W M would like to acknowledge DARPA, IARPA (Contract W911NF20C0035), NSF, and ONR via its BRC on RandNLA for providing partial support of this work. Our conclusions do not necessarily reflect the position or the policy of our sponsors, and no official endorsement should be inferred.

Appendix A.: Proof of theorem 1

Our objective is to prove, under assumption 1, the asymptotic equivalence between the expectation (with respect to W, omitted from now on) $\mathbb{E}[\mathbf{Q}]$ and

$\begin{equation*}\bar{\mathbf{Q}}\equiv {\left(\frac{N}{n}\left(\frac{{\mathbf{K}}_{\mathrm{cos}}}{1+{\delta }_{\mathrm{cos}}}+\frac{{\mathbf{K}}_{\mathrm{sin}}}{1+{\delta }_{\mathrm{sin}}}\right)+\lambda {\mathbf{I}}_{n}\right)}^{-1}\end{equation*}$

for ${\mathbf{K}}_{\mathrm{cos}}\equiv {\mathbf{K}}_{\mathrm{cos}}(\mathbf{X},\mathbf{X}),{\mathbf{K}}_{\mathrm{sin}}\equiv {\mathbf{K}}_{\mathrm{sin}}(\mathbf{X},\mathbf{X})\in {\mathbb{R}}^{n\times n}$ defined in (8), with (δ_cos, δ_cos) the unique positive solution to

$\begin{equation*}{\delta }_{\mathrm{cos}}=\frac{1}{n}\enspace \mathrm{t}\mathrm{r}({\mathbf{K}}_{\mathrm{cos}}\bar{\mathbf{Q}}),\quad {\delta }_{\mathrm{sin}}=\frac{1}{n}\enspace \mathrm{t}\mathrm{r}({\mathbf{K}}_{\mathrm{sin}}\bar{\mathbf{Q}}).\end{equation*}$

The existence and uniqueness of the above fixed-point equation is standard in random matrix literature and can be reached for instance with the standard interference function framework [51].

The asymptotic equivalence should be announced in the sense that ${\Vert}\mathbb{E}[\mathbf{Q}]-\bar{\mathbf{Q}}{\Vert}\to 0$ as n, p, N → ∞ at the same pace. We shall proceed by introducing an intermediary resolvent $\tilde{\mathbf{Q}}$ (see definition in (A.2)) and show subsequently that

$\begin{equation*}{\Vert}\mathbb{E}[\mathbf{Q}]-\tilde{\mathbf{Q}}{\Vert}\to 0,\quad {\Vert}\tilde{\mathbf{Q}}-\bar{\mathbf{Q}}{\Vert}\to 0.\end{equation*}$

In the sequel, we use o(1) and o_||⋅||(1) for scalars or matrices of (almost surely if being random) vanishing absolute values or operator norms as n, p → ∞.

We start by introducing the following lemma.

Lemma 1. (Expectation of ${\sigma }_{1}({\mathbf{x}}_{i}^{\mathsf{\text{T}}}\mathbf{w}){\sigma }_{2}({\mathbf{w}}^{\mathsf{\text{T}}}{\mathbf{x}}_{j})$ ). For $\mathbf{w}\sim \mathcal{N}(\mathbf{0},{\mathbf{I}}_{p})$ and ${\mathbf{x}}_{i},{\mathbf{x}}_{j}\in {\mathbb{R}}^{p}$ we have (per definition in (8))

$\begin{align*}\hfill {\mathbb{E}}_{\mathbf{w}}[\mathrm{cos}({\mathbf{x}}_{i}^{\mathsf{\text{T}}}\mathbf{w})\mathrm{cos}({\mathbf{w}}^{\mathsf{\text{T}}}{\mathbf{x}}_{j})]& ={\text{e}}^{-\frac{1}{2}({\Vert}{\mathbf{x}}_{i}{{\Vert}}^{2}+{\Vert}{\mathbf{x}}_{j}{{\Vert}}^{2})}\mathrm{cosh}({\mathbf{x}}_{i}^{\mathsf{\text{T}}}{\mathbf{x}}_{j})\equiv {[{\mathbf{K}}_{\mathrm{cos}}(\mathbf{X},\mathbf{X})]}_{ij}\equiv {[{\mathbf{K}}_{\mathrm{cos}}]}_{ij}\hfill \\ \hfill {\mathbb{E}}_{\mathbf{w}}[\mathrm{sin}({\mathbf{x}}_{i}^{\mathsf{\text{T}}}\mathbf{w})\mathrm{sin}({\mathbf{w}}^{\mathsf{\text{T}}}{\mathbf{x}}_{j})]& ={\text{e}}^{-\frac{1}{2}({\Vert}{\mathbf{x}}_{i}{{\Vert}}^{2}+{\Vert}{\mathbf{x}}_{j}{{\Vert}}^{2})}\mathrm{sinh}({\mathbf{x}}_{i}^{\mathsf{\text{T}}}{\mathbf{x}}_{j})\equiv {[{\mathbf{K}}_{\mathrm{sin}}(\mathbf{X},\mathbf{X})]}_{ij}\equiv {[{\mathbf{K}}_{\mathrm{sin}}]}_{ij}\hfill \\ \hfill {\mathbb{E}}_{\mathbf{w}}[\mathrm{cos}({\mathbf{x}}_{i}^{\mathsf{\text{T}}}\mathbf{w})\mathrm{sin}({\mathbf{w}}^{\mathsf{\text{T}}}{\mathbf{x}}_{j})]& =0.\hfill \end{align*}$

Proof of Lemma 1 .The proof follows the integration tricks in [15, 52]. Note in particular that the third equality holds in the case of (cos, sin) non-linearity but in general not true for arbitrary Lipschitz (σ₁, σ₂). □

Let us focus on the resolvent $\mathbf{Q}\equiv {\left(\frac{1}{n}{\mathbf{\Sigma }}_{\mathbf{X}}^{\mathsf{\text{T}}}{\mathbf{\Sigma }}_{\mathbf{X}}+\lambda {\mathbf{I}}_{n}\right)}^{-1}$ of $\frac{1}{n}{\mathbf{\Sigma }}_{\mathbf{X}}^{\mathsf{\text{T}}}{\mathbf{\Sigma }}_{\mathbf{X}}\in {\mathbb{R}}^{n\times n}$ , for RFF matrix ${\mathbf{\Sigma }}_{\mathbf{X}}\equiv \left[\begin{matrix}\hfill \mathrm{cos}(\mathbf{W}\mathbf{X})\hfill \\ \hfill \mathrm{sin}(\mathbf{W}\mathbf{X})\hfill \end{matrix}\right]$ that can be rewritten as

$\begin{equation}{\mathbf{\Sigma }}_{\mathbf{X}}^{\mathsf{\text{T}}}=[\mathrm{cos}({\mathbf{X}}^{\mathsf{\text{T}}}{\mathbf{w}}_{1}),\dots ,\mathrm{cos}({\mathbf{X}}^{\mathsf{\text{T}}}{\mathbf{w}}_{N}),\mathrm{sin}({\mathbf{X}}^{\mathsf{\text{T}}}{\mathbf{w}}_{1}),\dots ,\mathrm{sin}({\mathbf{X}}^{\mathsf{\text{T}}}{\mathbf{w}}_{N})]\end{equation} \tag{ A.1 }$

for w_i the ith row of $\mathbf{W}\in {\mathbb{R}}^{N\times p}$ with ${\mathbf{w}}_{i}\sim \mathcal{N}(\mathbf{0},{\mathbf{I}}_{p}),i=1,\dots ,N$ , that is at the core of our analysis. Note from (A.1) that we have

$\begin{equation*}{\mathbf{\Sigma }}_{\mathbf{X}}^{\mathsf{\text{T}}}{\mathbf{\Sigma }}_{\mathbf{X}}=\sum\limits _{i=1}^{N}\left(\mathrm{cos}({\mathbf{X}}^{\mathsf{\text{T}}}{\mathbf{w}}_{i})\mathrm{cos}({\mathbf{w}}_{i}^{\mathsf{\text{T}}}\mathbf{X})+\mathrm{sin}({\mathbf{X}}^{\mathsf{\text{T}}}{\mathbf{w}}_{i})\mathrm{sin}({\mathbf{w}}_{i}^{\mathsf{\text{T}}}\mathbf{X})\right)=\sum\limits _{i=1}^{N}{\mathbf{U}}_{i}{\mathbf{U}}_{i}^{\mathsf{\text{T}}}\end{equation*}$

with ${\mathbf{U}}_{i}=\left[\begin{matrix}\hfill \mathrm{cos}({\mathbf{X}}^{\mathsf{\text{T}}}{\mathbf{w}}_{i})\hfill & \hfill \mathrm{sin}({\mathbf{X}}^{\mathsf{\text{T}}}{\mathbf{w}}_{i})\hfill \end{matrix}\right]\in {\mathbb{R}}^{n\times 2}$ .

Letting

$\begin{equation}\tilde{\mathbf{Q}}\equiv {\left(\frac{N}{n}\frac{{\mathbf{K}}_{\mathrm{cos}}}{1+{\alpha }_{\mathrm{cos}}}+\frac{N}{n}\frac{{\mathbf{K}}_{\mathrm{sin}}}{1+{\alpha }_{\mathrm{sin}}}+\lambda {\mathbf{I}}_{n}\right)}^{-1}\end{equation} \tag{ A.2 }$

with

$\begin{equation}{\alpha }_{\mathrm{cos}}=\frac{1}{n}\enspace \mathrm{t}\mathrm{r}({\mathbf{K}}_{\mathrm{cos}}\mathbb{E}[\mathbf{Q}]),\quad {\alpha }_{\mathrm{sin}}=\frac{1}{n}\enspace \mathrm{t}\mathrm{r}({\mathbf{K}}_{\mathrm{sin}}\mathbb{E}[\mathbf{Q}])\end{equation} \tag{ A.3 }$

we have, with the resolvent identity (A⁻¹ − B⁻¹ = A⁻¹(B − A)B⁻¹ for invertible A, B) that

$\begin{align*}\hfill \mathbb{E}[\mathbf{Q}]-\tilde{\mathbf{Q}}& =\mathbb{E}\left[\mathbf{Q}\left(\frac{N}{n}\frac{{\mathbf{K}}_{\mathrm{cos}}}{1+{\alpha }_{\mathrm{cos}}}+\frac{N}{n}\frac{{\mathbf{K}}_{\mathrm{sin}}}{1+{\alpha }_{\mathrm{sin}}}-\frac{1}{n}{\mathbf{\Sigma }}_{\mathbf{X}}^{\mathsf{\text{T}}}{\mathbf{\Sigma }}_{\mathbf{X}}\right)\right]\tilde{\mathbf{Q}}\hfill \\ \hfill & =\mathbb{E}[\mathbf{Q}]\frac{N}{n}\left(\frac{{\mathbf{K}}_{\mathrm{cos}}}{1+{\alpha }_{\mathrm{cos}}}+\frac{{\mathbf{K}}_{\mathrm{sin}}}{1+{\alpha }_{\mathrm{sin}}}\right)\tilde{\mathbf{Q}}-\frac{N}{n}\frac{1}{N}\sum\limits _{i=1}^{N}\mathbb{E}[\mathbf{Q}{\mathbf{U}}_{i}{\mathbf{U}}_{i}^{\mathsf{\text{T}}}]\tilde{\mathbf{Q}}\hfill \\ \hfill & =\mathbb{E}[\mathbf{Q}]\frac{N}{n}\left(\frac{{\mathbf{K}}_{\mathrm{cos}}}{1+{\alpha }_{\mathrm{cos}}}+\frac{{\mathbf{K}}_{\mathrm{sin}}}{1+{\alpha }_{\mathrm{sin}}}\right)\tilde{\mathbf{Q}}\hfill \\ \hfill & \quad -\frac{N}{n}\frac{1}{N}\sum\limits _{i=1}^{N}\mathbb{E}\left[{\mathbf{Q}}_{-i}{\mathbf{U}}_{i}{\left({\mathbf{I}}_{2}+\frac{1}{n}{\mathbf{U}}_{i}^{\mathsf{\text{T}}}{\mathbf{Q}}_{-i}{\mathbf{U}}_{i}\right)}^{-1}{\mathbf{U}}_{i}^{\mathsf{\text{T}}}\right]\tilde{\mathbf{Q}},\hfill \end{align*}$

for ${\mathbf{Q}}_{-i}\equiv {\left(\frac{1}{n}{\mathbf{\Sigma }}_{\mathbf{X}}^{\mathsf{\text{T}}}{\mathbf{\Sigma }}_{\mathbf{X}}-\frac{1}{n}{\mathbf{U}}_{i}{\mathbf{U}}_{i}+\lambda {\mathbf{I}}_{n}\right)}^{-1}$ that is independent of U_i (and thus w_i), where we applied the following Woodbury identity.

Lemma 2. (Woodbury). For $\mathbf{A},\mathbf{A}+\mathbf{U}{\mathbf{U}}^{\mathsf{\text{T}}}\in {\mathbb{R}}^{p\times p}$ both invertible and $\mathbf{U}\in {\mathbb{R}}^{p\times n}$ , we have

$\begin{equation*}{(\mathbf{A}+\mathbf{U}{\mathbf{U}}^{\mathsf{\text{T}}})}^{-1}={\mathbf{A}}^{-1}-{\mathbf{A}}^{-1}\mathbf{U}{({\mathbf{I}}_{n}+{\mathbf{U}}^{\mathsf{\text{T}}}{\mathbf{A}}^{-1}\mathbf{U})}^{-1}{\mathbf{U}}^{\mathsf{\text{T}}}{\mathbf{A}}^{-1}\end{equation*}$

so that in particular ${(\mathbf{A}+\mathbf{U}{\mathbf{U}}^{\mathsf{\text{T}}})}^{-1}\mathbf{U}={\mathbf{A}}^{-1}\mathbf{U}{({\mathbf{I}}_{n}+{\mathbf{U}}^{\mathsf{\text{T}}}{\mathbf{A}}^{-1}\mathbf{U})}^{-1}$ .

Consider now the two-by-two matrix

$\begin{align*}\hfill & \quad {\mathbf{I}}_{2}+\frac{1}{n}{\mathbf{U}}_{i}^{\mathsf{\text{T}}}{\mathbf{Q}}_{-i}{\mathbf{U}}_{i}\hfill \\ \hfill & \qquad =\left[\begin{matrix}\hfill 1+\frac{1}{n}\enspace \mathrm{cos}({\mathbf{w}}_{i}^{\mathsf{\text{T}}}\mathbf{X}){\mathbf{Q}}_{-i}\enspace \mathrm{cos}({\mathbf{X}}^{\mathsf{\text{T}}}{\mathbf{w}}_{i})\hfill & \hfill \frac{1}{n}\enspace \mathrm{cos}({\mathbf{w}}_{i}^{\mathsf{\text{T}}}\mathbf{X}){\mathbf{Q}}_{-i}\enspace \mathrm{sin}({\mathbf{X}}^{\mathsf{\text{T}}}{\mathbf{w}}_{i})\hfill \\ \hfill \frac{1}{n}\enspace \mathrm{sin}({\mathbf{w}}_{i}^{\mathsf{\text{T}}}\mathbf{X}){\mathbf{Q}}_{-i}\enspace \mathrm{cos}({\mathbf{X}}^{\mathsf{\text{T}}}{\mathbf{w}}_{i})\hfill & \hfill 1+\frac{1}{n}\enspace \mathrm{sin}({\mathbf{w}}_{i}^{\mathsf{\text{T}}}\mathbf{X}){\mathbf{Q}}_{-i}\enspace \mathrm{sin}({\mathbf{X}}^{\mathsf{\text{T}}}{\mathbf{w}}_{i})\hfill \end{matrix}\right]\hfill \end{align*}$

which, according to the following lemma, is expected to be close to $\left[\begin{matrix}\hfill 1+{\alpha }_{\mathrm{cos}}\hfill & \hfill 0\hfill \\ \hfill 0\hfill & \hfill 1+{\alpha }_{\mathrm{sin}}\hfill \end{matrix}\right]$ as defined in (A.3).

Lemma 3. (Concentration of quadratic forms). Under assumption 1, for σ₁(⋅), σ₂(⋅) two real one-Lipschitz functions, $\mathbf{w}\sim \mathcal{N}(\mathbf{0},{\mathbf{I}}_{p})$ and $\mathbf{A}\in {\mathbb{R}}^{n\times n}$ independent of w with ||A|| ⩽ 1, then

$\begin{align*}\hfill & \quad \mathbb{P}\left(\left\vert \frac{1}{n}{\sigma }_{a}({\mathbf{w}}^{\mathsf{\text{T}}}\mathbf{X})\mathbf{A}{\sigma }_{b}({\mathbf{X}}^{\mathsf{\text{T}}}\mathbf{w})-\frac{1}{n}\enspace \mathrm{t}\mathrm{r}(\mathbf{A}{\mathbb{E}}_{\mathbf{w}}[{\sigma }_{b}({\mathbf{X}}^{\mathsf{\text{T}}}\mathbf{w}){\sigma }_{a}({\mathbf{w}}^{\mathsf{\text{T}}}\mathbf{X})])\right\vert > t\right)\leqslant C\enspace {\text{e}}^{-cn\mathrm{min}(t,{t}^{2})}\hfill \end{align*}$

for a, b ∈ {1, 2} and some universal constants C, c > 0.

Proof of Lemma 3 .Lemma 3 can be easily extended from lemma 1 in [15], where one observes the proof actually holds when different types of nonlinear Lipschitz functions σ₁(⋅), σ₂(⋅) (and in particular cos and sin) are considered. □

For ${\mathbf{W}}_{-i}\in {\mathbb{R}}^{(N-1)\times p}$ the random matrix $\mathbf{W}\in {\mathbb{R}}^{N\times p}$ with its ith row w_i removed, lemma 3, together with the Lipschitz nature of the map ${\mathbf{W}}_{-i}{\mapsto}\frac{1}{n}{\sigma }_{a}({\mathbf{w}}_{i}^{\mathsf{\text{T}}}\mathbf{X}){\mathbf{Q}}_{-i}{\sigma }_{b}({\mathbf{X}}^{\mathsf{\text{T}}}{\mathbf{w}}_{i})$ for ${\mathbf{Q}}_{-i}={(\frac{1}{n}\enspace \mathrm{cos}{({\mathbf{W}}_{-i}\mathbf{X})}^{\mathsf{\text{T}}}\enspace \mathrm{cos}({\mathbf{W}}_{-i}\mathbf{X})+\frac{1}{n}\enspace \mathrm{sin}{({\mathbf{W}}_{-i}\mathbf{X})}^{\mathsf{\text{T}}}\enspace \mathrm{sin}({\mathbf{W}}_{-i}\mathbf{X})+\lambda {\mathbf{I}}_{n})}^{-1}$ , leads to the following concentration result

$\begin{align}\hfill & \quad \mathbb{P}\left(\left\vert \frac{1}{n}{\sigma }_{a}({\mathbf{w}}_{i}^{\mathsf{\text{T}}}\mathbf{X}){\mathbf{Q}}_{-i}{\sigma }_{b}({\mathbf{X}}^{\mathsf{\text{T}}}{\mathbf{w}}_{i}) - \frac{1}{n}\enspace \mathrm{t}\mathrm{r}\left(\mathbb{E}[{\mathbf{Q}}_{-i}]\mathbb{E}[{\sigma }_{b}({\mathbf{X}}^{\mathsf{\text{T}}}{\mathbf{w}}_{i}){\sigma }_{a}({\mathbf{w}}_{i}^{\mathsf{\text{T}}}\mathbf{X})]\right)\right\vert > t\right) \leqslant {C}^{\prime }\enspace {\text{e}}^{-{c}^{\prime }n\mathrm{max}({t}^{2},t)}\hfill \end{align} \tag{ A.4 }$

the proof of which follows the same line of argument of lemma 4 in [15] and is omitted here.

As a consequence, we continue to write, with again the resolvent identity, that

$\begin{align*}\hfill & \quad {\left({\mathbf{I}}_{2}+\frac{1}{n}{\mathbf{U}}_{i}^{\mathsf{\text{T}}}{\mathbf{Q}}_{-i}{\mathbf{U}}_{i}\right)}^{-1}-{\left[\begin{matrix}\hfill 1+{\alpha }_{\mathrm{cos}}\hfill & \hfill 0\hfill \\ \hfill 0\hfill & \hfill 1+{\alpha }_{\mathrm{sin}}\hfill \end{matrix}\right]}^{-1}\hfill \\ \hfill & \qquad ={\left[\begin{matrix}\hfill 1+\frac{1}{n}\enspace \mathrm{cos}({\mathbf{w}}_{i}^{\mathsf{\text{T}}}\mathbf{X}){\mathbf{Q}}_{-i}\enspace \mathrm{cos}({\mathbf{X}}^{\mathsf{\text{T}}}{\mathbf{w}}_{i})\hfill & \hfill \frac{1}{n}\enspace \mathrm{cos}({\mathbf{w}}_{i}^{\mathsf{\text{T}}}\mathbf{X}){\mathbf{Q}}_{-i}\enspace \mathrm{sin}({\mathbf{X}}^{\mathsf{\text{T}}}{\mathbf{w}}_{i})\hfill \\ \hfill \frac{1}{n}\enspace \mathrm{sin}({\mathbf{w}}_{i}^{\mathsf{\text{T}}}\mathbf{X}){\mathbf{Q}}_{-i}\enspace \mathrm{cos}({\mathbf{X}}^{\mathsf{\text{T}}}{\mathbf{w}}_{i})\hfill & \hfill 1+\frac{1}{n}\enspace \mathrm{sin}({\mathbf{w}}_{i}^{\mathsf{\text{T}}}\mathbf{X}){\mathbf{Q}}_{-i}\enspace \mathrm{sin}({\mathbf{X}}^{\mathsf{\text{T}}}{\mathbf{w}}_{i})\hfill \end{matrix}\right]}^{-1}\hfill \\ \hfill & \qquad -{\left[\begin{matrix}\hfill 1+{\alpha }_{\mathrm{cos}}\hfill & \hfill 0\hfill \\ \hfill 0\hfill & \hfill 1+{\alpha }_{\mathrm{sin}}\hfill \end{matrix}\right]}^{-1}\hfill \\ \hfill & \qquad ={\left({\mathbf{I}}_{2}+\frac{1}{n}{\mathbf{U}}_{i}^{\mathsf{\text{T}}}{\mathbf{Q}}_{-i}{\mathbf{U}}_{i}\right)}^{-1}\hfill \\ \hfill & \qquad \times \left[\begin{matrix}\hfill {\alpha }_{\mathrm{cos}}-\frac{1}{n}\enspace \mathrm{cos}({\mathbf{w}}_{i}^{\mathsf{\text{T}}}\mathbf{X}){\mathbf{Q}}_{-i}\enspace \mathrm{cos}({\mathbf{X}}^{\mathsf{\text{T}}}{\mathbf{w}}_{i})\hfill & \hfill -\frac{1}{n}\enspace \mathrm{cos}({\mathbf{w}}_{i}^{\mathsf{\text{T}}}\mathbf{X}){\mathbf{Q}}_{-i}\enspace \mathrm{sin}({\mathbf{X}}^{\mathsf{\text{T}}}{\mathbf{w}}_{i})\hfill \\ \hfill -\frac{1}{n}\enspace \mathrm{sin}({\mathbf{w}}_{i}^{\mathsf{\text{T}}}\mathbf{X}){\mathbf{Q}}_{-i}\enspace \mathrm{cos}({\mathbf{X}}^{\mathsf{\text{T}}}{\mathbf{w}}_{i})\hfill & \hfill {\alpha }_{\mathrm{sin}}-\frac{1}{n}\enspace \mathrm{sin}({\mathbf{w}}_{i}^{\mathsf{\text{T}}}\mathbf{X}){\mathbf{Q}}_{-i}\enspace \mathrm{sin}({\mathbf{X}}^{\mathsf{\text{T}}}{\mathbf{w}}_{i})\hfill \end{matrix}\right]\hfill \\ \hfill & \qquad \times \left[\begin{matrix}\hfill \frac{1}{1+{\alpha }_{\mathrm{cos}}}\hfill & \hfill 0\hfill \\ \hfill 0\hfill & \hfill \frac{1}{1+{\alpha }_{\mathrm{sin}}}\hfill \end{matrix}\right]\equiv {\left({\mathbf{I}}_{2}+\frac{1}{n}{\mathbf{U}}_{i}^{\mathsf{\text{T}}}{\mathbf{Q}}_{-i}{\mathbf{U}}_{i}\right)}^{-1}{\mathbf{D}}_{i}\left[\begin{matrix}\hfill \frac{1}{1+{\alpha }_{\mathrm{cos}}}\hfill & \hfill 0\hfill \\ \hfill 0\hfill & \hfill \frac{1}{1+{\alpha }_{\mathrm{sin}}}\hfill \end{matrix}\right],\hfill \end{align*}$

where we note from (A.4) (and ||Q_−i|| ⩽ λ⁻¹) that the matrix $\mathbb{E}[{\mathbf{D}}_{i}]={o}_{{\Vert}\cdot {\Vert}}(1)$ (in fact of spectral norm of order $O({n}^{-\frac{1}{2}})$ ). So that

$\begin{align*}\hfill \mathbb{E}[\mathbf{Q}]-\tilde{\mathbf{Q}}& =\mathbb{E}[\mathbf{Q}]\frac{N}{n}\left(\frac{{\mathbf{K}}_{\mathrm{cos}}}{1+{\alpha }_{\mathrm{cos}}}+\frac{{\mathbf{K}}_{\mathrm{sin}}}{1+{\alpha }_{\mathrm{sin}}}\right)\tilde{\mathbf{Q}}\hfill \\ \hfill & \quad -\frac{N}{n}\frac{1}{N}\sum\limits _{i=1}^{N}\mathbb{E}\left[{\mathbf{Q}}_{-i}{\mathbf{U}}_{i}{\left({\mathbf{I}}_{2}+\frac{1}{n}{\mathbf{U}}_{i}^{\mathsf{\text{T}}}{\mathbf{Q}}_{-i}{\mathbf{U}}_{i}\right)}^{-1}{\mathbf{U}}_{i}^{\mathsf{\text{T}}}\right]\tilde{\mathbf{Q}}\hfill \\ \hfill & =\mathbb{E}[\mathbf{Q}]\frac{N}{n}\left(\frac{{\mathbf{K}}_{\mathrm{cos}}}{1+{\alpha }_{\mathrm{cos}}}+\frac{{\mathbf{K}}_{\mathrm{sin}}}{1+{\alpha }_{\mathrm{sin}}}\right)\tilde{\mathbf{Q}}\hfill \\ \hfill & \quad -\frac{N}{n}\frac{1}{N}\sum\limits _{i=1}^{N}\mathbb{E}\left[{\mathbf{Q}}_{-i}{\mathbf{U}}_{i}\left[\begin{matrix}\hfill \frac{1}{1+{\alpha }_{\mathrm{cos}}}\hfill & \hfill 0\hfill \\ \hfill 0\hfill & \hfill \frac{1}{1+{\alpha }_{\mathrm{sin}}}\hfill \end{matrix}\right]{\mathbf{U}}_{i}^{\mathsf{\text{T}}}\right]\tilde{\mathbf{Q}}\hfill \\ \hfill & \quad -\frac{N}{n}\frac{1}{N}\sum\limits _{i=1}^{N}\mathbb{E}\left[{\mathbf{Q}}_{-i}{\mathbf{U}}_{i}{\left({\mathbf{I}}_{2}+\frac{1}{n}{\mathbf{U}}_{i}^{\mathsf{\text{T}}}{\mathbf{Q}}_{-i}{\mathbf{U}}_{i}\right)}^{-1}{\mathbf{D}}_{i}\left[\begin{matrix}\hfill \frac{1}{1+{\alpha }_{\mathrm{cos}}}\hfill & \hfill 0\hfill \\ \hfill 0\hfill & \hfill \frac{1}{1+{\alpha }_{\mathrm{sin}}}\hfill \end{matrix}\right]{\mathbf{U}}_{i}^{\mathsf{\text{T}}}\right]\tilde{\mathbf{Q}}\hfill \\ \hfill & =\left(\mathbb{E}[\mathbf{Q}]-\frac{1}{N}\sum\limits _{i=1}^{N}\mathbb{E}[{\mathbf{Q}}_{-i}]\right)\frac{N}{n}\left(\frac{{\mathbf{K}}_{\mathrm{cos}}}{1+{\alpha }_{\mathrm{cos}}}+\frac{{\mathbf{K}}_{\mathrm{sin}}}{1+{\alpha }_{\mathrm{sin}}}\right)\tilde{\mathbf{Q}}\hfill \\ \hfill & \quad -\frac{N}{n}\frac{1}{N}\sum\limits _{i=1}^{N}\mathbb{E}\left[\mathbf{Q}{\mathbf{U}}_{i}{\mathbf{D}}_{i}\left[\begin{matrix}\hfill \frac{1}{1+{\alpha }_{\mathrm{cos}}}\hfill & \hfill 0\hfill \\ \hfill 0\hfill & \hfill \frac{1}{1+{\alpha }_{\mathrm{sin}}}\hfill \end{matrix}\right]{\mathbf{U}}_{i}^{\mathsf{\text{T}}}\right]\tilde{\mathbf{Q}},\hfill \end{align*}$

where we used ${\mathbb{E}}_{{\mathbf{w}}_{i}}[{\mathbf{U}}_{i}{\mathbf{U}}_{i}^{\mathsf{\text{T}}}]={\mathbf{K}}_{\mathrm{cos}}+{\mathbf{K}}_{\mathrm{sin}}$ by lemma 1 and then lemma 2 in reverse for the last equality. Moreover, since

$\begin{align*}\hfill \mathbb{E}[\mathbf{Q}]-\frac{1}{N}\sum\limits _{i=1}^{N}\mathbb{E}[{\mathbf{Q}}_{-i}]& =\frac{1}{N}\sum\limits _{i=1}^{N}\mathbb{E}[\mathbf{Q}-{\mathbf{Q}}_{-i}]\hfill \\ \hfill & =-\frac{1}{n}\frac{1}{N}\sum\limits _{i=1}^{N}\mathbb{E}\left[\mathbf{Q}{\mathbf{U}}_{i}{\left({\mathbf{I}}_{2}+\frac{1}{n}{\mathbf{U}}_{i}^{\mathsf{\text{T}}}{\mathbf{Q}}_{-i}{\mathbf{U}}_{i}\right)}^{-1}{\mathbf{U}}_{i}^{\mathsf{\text{T}}}\mathbf{Q}\right]\hfill \end{align*}$

so that with the fact $\frac{1}{\sqrt{n}}{\Vert}\mathbf{Q}{\mathbf{\Sigma }}_{\mathbf{X}}^{\mathsf{\text{T}}}{\Vert}\leqslant {\Vert}\sqrt{\mathbf{Q}\frac{1}{n}{\mathbf{\Sigma }}_{\mathbf{X}}^{\mathsf{\text{T}}}{\mathbf{\Sigma }}_{\mathbf{X}}\mathbf{Q}}{\Vert}\leqslant {\lambda }^{-\frac{1}{2}}$ we have for the first term

$\begin{equation*}{\Vert}\mathbb{E}[\mathbf{Q}]-\frac{1}{N}\sum\limits _{i=1}^{N}\mathbb{E}[{\mathbf{Q}}_{-i}]{\Vert}=O({n}^{-1}).\end{equation*}$

It thus remains to treat the second term, which, with the relation AB^T + BA^T ⪯ AA^T + BB^T (in the sense of symmetric matrices), and the same line of arguments as above, can be shown to have vanishing spectral norm (of order $O({n}^{-\frac{1}{2}})$ ) as n, p, N → ∞.

We thus have ${\Vert}\mathbb{E}[\mathbf{Q}]-\tilde{\mathbf{Q}}{\Vert}=O({n}^{-\frac{1}{2}})$ , which concludes the first part of the proof of theorem 1.

We shall show next that ${\Vert}\tilde{\mathbf{Q}}-\bar{\mathbf{Q}}{\Vert}\to 0$ as n, p, N → ∞. First note from previous derivation that ${\alpha }_{\sigma }-\frac{1}{n}\enspace \mathrm{t}\mathrm{r}\enspace {\mathbf{K}}_{\sigma }\tilde{\mathbf{Q}}=O({n}^{-\frac{1}{2}})$ for σ = cos, sin. To compare $\tilde{\mathbf{Q}}$ and $\bar{\mathbf{Q}}$ , it follows again from the resolvent identity that

$\begin{equation*}\tilde{\mathbf{Q}}-\bar{\mathbf{Q}}=\tilde{\mathbf{Q}}\left(\frac{N}{n}\frac{{\mathbf{K}}_{\mathrm{cos}}({\alpha }_{\mathrm{cos}}-{\delta }_{\mathrm{cos}})}{(1+{\delta }_{\mathrm{cos}})(1+{\alpha }_{\mathrm{cos}})}+\frac{N}{n}\frac{{\mathbf{K}}_{\mathrm{sin}}({\alpha }_{\mathrm{sin}}-{\delta }_{\mathrm{sin}})}{(1+{\delta }_{\mathrm{sin}})(1+{\alpha }_{\mathrm{sin}})}\right)\bar{\mathbf{Q}}\end{equation*}$

so that the control of ${\Vert}\tilde{\mathbf{Q}}-\bar{\mathbf{Q}}{\Vert}$ boils down to the control of max{|α_cos − δ_cos|, |α_sin − δ_sin|}. To this end, it suffices to write

$\begin{equation*}{\alpha }_{\mathrm{cos}}-{\delta }_{\mathrm{cos}}=\frac{1}{n}\enspace \mathrm{t}\mathrm{r}\enspace {\mathbf{K}}_{\mathrm{cos}}(\mathbb{E}[\mathbf{Q}]-\bar{\mathbf{Q}})=\frac{1}{n}\enspace \mathrm{t}\mathrm{r}\enspace {\mathbf{K}}_{\mathrm{cos}}(\tilde{\mathbf{Q}}-\bar{\mathbf{Q}})+O\left({n}^{-\frac{1}{2}}\right)\end{equation*}$

where we used | tr(AB)| ⩽ ||A|| tr(B) for nonnegative definite B, together with the fact that $\frac{1}{n}\enspace \mathrm{t}\mathrm{r}\enspace {\mathbf{K}}_{\sigma }$ is (uniformly) bounded under assumption 1, for σ = cos, sin.

As a consequence, we have

$\begin{equation*}\vert {\alpha }_{\mathrm{cos}}-{\delta }_{\mathrm{cos}}\vert \leqslant \vert {\alpha }_{\mathrm{cos}}-{\delta }_{\mathrm{cos}}\vert \frac{N}{n}\frac{\frac{1}{n}\enspace \mathrm{t}\mathrm{r}({\mathbf{K}}_{\mathrm{cos}}\tilde{\mathbf{Q}}{\mathbf{K}}_{\mathrm{cos}}\bar{\mathbf{Q}})}{(1+{\delta }_{\mathrm{cos}})(1+{\alpha }_{\mathrm{cos}})}+o(1).\end{equation*}$

It thus remains to show

$\begin{equation*}\frac{N}{n}\frac{\frac{1}{n}\enspace \mathrm{t}\mathrm{r}({\mathbf{K}}_{\mathrm{cos}}\tilde{\mathbf{Q}}{\mathbf{K}}_{\mathrm{cos}}\bar{\mathbf{Q}})}{(1+{\delta }_{\mathrm{cos}})(1+{\alpha }_{\mathrm{cos}})}< 1\end{equation*}$

or alternatively, by the Cauchy–Schwarz inequality, to show

$\begin{align*}\hfill \frac{N}{n}\frac{\frac{1}{n}\enspace \mathrm{t}\mathrm{r}({\mathbf{K}}_{\mathrm{cos}}\tilde{\mathbf{Q}}{\mathbf{K}}_{\mathrm{cos}}\bar{\mathbf{Q}})}{(1+{\delta }_{\mathrm{cos}})(1+{\alpha }_{\mathrm{cos}})}& \leqslant \sqrt{\frac{N}{n}\frac{\frac{1}{n}\enspace \mathrm{t}\mathrm{r}({\mathbf{K}}_{\mathrm{cos}}\bar{\mathbf{Q}}{\mathbf{K}}_{\mathrm{cos}}\bar{\mathbf{Q}})}{{(1+{\delta }_{\mathrm{cos}})}^{2}}\cdot \frac{N}{n}\frac{\frac{1}{n}\enspace \mathrm{t}\mathrm{r}({\mathbf{K}}_{\mathrm{cos}}\tilde{\mathbf{Q}}{\mathbf{K}}_{\mathrm{cos}}\tilde{\mathbf{Q}})}{{(1+{\alpha }_{\mathrm{cos}})}^{2}}}< 1.\hfill \end{align*}$

To treat the first right-hand side term (the second can be done similarly), it unfolds from | tr(AB)| ⩽ ||A|| ⋅ tr(B) for nonnegative definite B that

$\begin{align*}\hfill \frac{N}{n}\frac{\frac{1}{n}\enspace \mathrm{t}\mathrm{r}({\mathbf{K}}_{\mathrm{cos}}\bar{\mathbf{Q}}{\mathbf{K}}_{\mathrm{cos}}\bar{\mathbf{Q}})}{{(1+{\delta }_{\mathrm{cos}})}^{2}}& \leqslant {\Vert}\frac{N}{n}\frac{{\mathbf{K}}_{\mathrm{cos}}\bar{\mathbf{Q}}}{1+{\delta }_{\mathrm{cos}}}{\Vert}\frac{\frac{1}{n}\enspace \mathrm{t}\mathrm{r}({\mathbf{K}}_{\mathrm{cos}}\bar{\mathbf{Q}})}{1+{\delta }_{\mathrm{cos}}}={\Vert}\frac{N}{n}\frac{{\mathbf{K}}_{\mathrm{cos}}\bar{\mathbf{Q}}}{1+{\delta }_{\mathrm{cos}}}{\Vert}\frac{{\gamma }_{\mathrm{cos}}}{1+{\delta }_{\mathrm{cos}}}\hfill \\ \hfill & \leqslant \frac{{\gamma }_{\mathrm{cos}}}{1+{\delta }_{\mathrm{cos}}}< 1\hfill \end{align*}$

where we used the fact that $\frac{N}{n}\frac{{\mathbf{K}}_{\mathrm{cos}}\bar{\mathbf{Q}}}{1+{\delta }_{\mathrm{cos}}}={\mathbf{I}}_{n}-\frac{N}{n}\frac{{\mathbf{K}}_{\mathrm{sin}}\bar{\mathbf{Q}}}{1+{\delta }_{\mathrm{sin}}}-\lambda \bar{\mathbf{Q}}$ . This concludes the proof of theorem 1.□

Appendix B.: Proof of theorem 2

To prove theorem 2, it indeed suffices to prove the following lemma.

Lemma 4. (Asymptotic behavior of $\mathbb{E}$ [QAQ]). Under assumption 1, for Q defined in (7) and symmetric nonnegative definite $\mathbf{A}\in {\mathbb{R}}^{n\times n}$ of bounded spectral norm, we have

$\begin{equation*}{\Vert}\mathbb{E}[\mathbf{Q}\mathbf{A}\mathbf{Q}]-\left(\bar{\mathbf{Q}}\mathbf{A}\bar{\mathbf{Q}}+\frac{N}{n}\left[\begin{matrix}\hfill \frac{\frac{1}{n}\enspace \mathrm{t}\mathrm{r}(\bar{\mathbf{Q}}\mathbf{A}\bar{\mathbf{Q}}{\mathbf{K}}_{\mathrm{cos}})}{{(1+{\delta }_{\mathrm{cos}})}^{2}}\hfill & \hfill \frac{\frac{1}{n}\enspace \mathrm{t}\mathrm{r}(\bar{\mathbf{Q}}\mathbf{A}\bar{\mathbf{Q}}{\mathbf{K}}_{\mathrm{sin}})}{{(1+{\delta }_{\mathrm{sin}})}^{2}}\hfill \end{matrix}\right]\mathbf{\Omega }\left[\begin{matrix}\hfill \bar{\mathbf{Q}}{\mathbf{K}}_{\mathrm{cos}}\bar{\mathbf{Q}}\hfill \\ \hfill \bar{\mathbf{Q}}{\mathbf{K}}_{\mathrm{sin}}\bar{\mathbf{Q}}\hfill \end{matrix}\right]\right){\Vert}\to 0\end{equation*}$

almost surely as n → ∞, with ${\mathbf{\Omega }}^{-1}\equiv {\mathbf{I}}_{2}-\frac{N}{n}\left[\begin{matrix}\hfill \frac{\frac{1}{n}\enspace \mathrm{t}\mathrm{r}(\bar{\mathbf{Q}}{\mathbf{K}}_{\mathrm{cos}}\bar{\mathbf{Q}}{\mathbf{K}}_{\mathrm{cos}})}{{(1+{\delta }_{\mathrm{cos}})}^{2}}\hfill & \hfill \frac{\frac{1}{n}\enspace \mathrm{t}\mathrm{r}(\bar{\mathbf{Q}}{\mathbf{K}}_{\mathrm{cos}}\bar{\mathbf{Q}}{\mathbf{K}}_{\mathrm{sin}})}{{(1+{\delta }_{\mathrm{sin}})}^{2}}\hfill \\ \hfill \frac{\frac{1}{n}\enspace \mathrm{t}\mathrm{r}(\bar{\mathbf{Q}}{\mathbf{K}}_{\mathrm{cos}}\bar{\mathbf{Q}}{\mathbf{K}}_{\mathrm{sin}})}{{(1+{\delta }_{\mathrm{cos}})}^{2}}\hfill & \hfill \frac{\frac{1}{n}\enspace \mathrm{t}\mathrm{r}(\bar{\mathbf{Q}}{\mathbf{K}}_{\mathrm{sin}}\bar{\mathbf{Q}}{\mathbf{K}}_{\mathrm{sin}})}{{(1+{\delta }_{\mathrm{sin}})}^{2}}\hfill \end{matrix}\right]$ . In particular, we have

$\begin{equation*}{\Vert}\mathbb{E}\left[\begin{matrix}\hfill \mathbf{Q}{\mathbf{K}}_{\mathrm{cos}}\mathbf{Q}\hfill \\ \hfill \mathbf{Q}{\mathbf{K}}_{\mathrm{sin}}\mathbf{Q}\hfill \end{matrix}\right]-\mathbf{\Omega }\left[\begin{matrix}\hfill \bar{\mathbf{Q}}{\mathbf{K}}_{\mathrm{cos}}\bar{\mathbf{Q}}\hfill \\ \hfill \bar{\mathbf{Q}}{\mathbf{K}}_{\mathrm{sin}}\bar{\mathbf{Q}}\hfill \end{matrix}\right]{\Vert}\to 0.\end{equation*}$

Proof of Lemma 4 .The proof of lemma 4 essentially follows the same line of arguments as that of theorem 1. Writing

$\begin{align*}\hfill \mathbb{E}[\mathbf{Q}\mathbf{A}\mathbf{Q}]& =\mathbb{E}[\bar{\mathbf{Q}}\mathbf{A}\mathbf{Q}]+\mathbb{E}[(\mathbf{Q}-\bar{\mathbf{Q}})\mathbf{A}\mathbf{Q}]\hfill \\ \hfill & \simeq \bar{\mathbf{Q}}\mathbf{A}\bar{\mathbf{Q}}+\mathbb{E}\left[\mathbf{Q}\left(\frac{N}{n}\frac{{\mathbf{K}}_{\mathrm{cos}}}{1+{\delta }_{\mathrm{cos}}}+\frac{N}{n}\frac{{\mathbf{K}}_{\mathrm{sin}}}{1+{\delta }_{\mathrm{sin}}}-\frac{1}{n}{\mathbf{\Sigma }}_{\mathbf{X}}^{\mathsf{\text{T}}}{\mathbf{\Sigma }}_{\mathbf{X}}\right)\bar{\mathbf{Q}}\mathbf{A}\mathbf{Q}\right]\hfill \\ \hfill & =\bar{\mathbf{Q}}\mathbf{A}\bar{\mathbf{Q}}+\frac{N}{n}\mathbb{E}[\mathbf{Q}\mathbf{\Phi }\bar{\mathbf{Q}}\mathbf{A}\mathbf{Q}]-\frac{1}{n}\sum\limits _{i=1}^{N}\mathbb{E}[\mathbf{Q}{\mathbf{U}}_{i}{\mathbf{U}}_{i}^{\mathsf{\text{T}}}\bar{\mathbf{Q}}\mathbf{A}\mathbf{Q}]\hfill \end{align*}$

where we note ≃ by ignoring matrices with vanishing spectral norm (i.e. o_||⋅||(1)) in the n,, p, N → ∞ limit and recall the shortcut $\mathbf{\Phi }\equiv \frac{{\mathbf{K}}_{\mathrm{cos}}}{1+{\delta }_{\mathrm{cos}}}+\frac{{\mathbf{K}}_{\mathrm{sin}}}{1+{\delta }_{\mathrm{sin}}}$ . Developing rightmost term with lemma 2 as

$\begin{align*}\hfill \mathbb{E}[\mathbf{Q}{\mathbf{U}}_{i}{\mathbf{U}}_{i}^{\mathsf{\text{T}}}\bar{\mathbf{Q}}\mathbf{A}\mathbf{Q}]& =\mathbb{E}\left[{\mathbf{Q}}_{-i}{\mathbf{U}}_{i}{\left({\mathbf{I}}_{2}+\frac{1}{n}{\mathbf{U}}_{i}^{\mathsf{\text{T}}}{\mathbf{Q}}_{-i}{\mathbf{U}}_{i}\right)}^{-1}{\mathbf{U}}_{i}^{\mathsf{\text{T}}}\bar{\mathbf{Q}}\mathbf{A}\mathbf{Q}\right]\hfill \\ \hfill & =\mathbb{E}\left[{\mathbf{Q}}_{-i}{\mathbf{U}}_{i}{\left({\mathbf{I}}_{2}+\frac{1}{n}{\mathbf{U}}_{i}^{\mathsf{\text{T}}}{\mathbf{Q}}_{-i}{\mathbf{U}}_{i}\right)}^{-1}{\mathbf{U}}_{i}^{\mathsf{\text{T}}}\bar{\mathbf{Q}}\mathbf{A}{\mathbf{Q}}_{-i}\right]\hfill \\ \hfill & \quad -\frac{1}{n}\mathbb{E}\left[{\mathbf{Q}}_{-i}{\mathbf{U}}_{i}{\left({\mathbf{I}}_{2}+\frac{1}{n}{\mathbf{U}}_{i}^{\mathsf{\text{T}}}{\mathbf{Q}}_{-i}{\mathbf{U}}_{i}\right)}^{-1}{\mathbf{U}}_{i}^{\mathsf{\text{T}}}\bar{\mathbf{Q}}\mathbf{A}{\mathbf{Q}}_{-i}{\mathbf{U}}_{i}\right.\hfill \\ \hfill & \left.\quad \times {\left({\mathbf{I}}_{2}+\frac{1}{n}{\mathbf{U}}_{i}^{\mathsf{\text{T}}}{\mathbf{Q}}_{-i}{\mathbf{U}}_{i}\right)}^{-1}{\mathbf{U}}_{i}^{\mathsf{\text{T}}}{\mathbf{Q}}_{-i}\right]\simeq \mathbb{E}[{\mathbf{Q}}_{-i}\mathbf{\Phi }\bar{\mathbf{Q}}\mathbf{A}{\mathbf{Q}}_{-i}]\hfill \\ \hfill & \quad -\mathbb{E}\left[{\mathbf{Q}}_{-i}{\mathbf{U}}_{i}\left[\begin{matrix}\hfill \frac{1}{1+{\delta }_{\mathrm{cos}}}\hfill & \hfill 0\hfill \\ \hfill 0\hfill & \hfill \frac{1}{1+{\delta }_{\mathrm{sin}}}\hfill \end{matrix}\right]\right.\hfill \\ \hfill & \left.\quad \times \left[\begin{matrix}\hfill \frac{1}{n}\enspace \mathrm{t}\mathrm{r}(\bar{\mathbf{Q}}\mathbf{A}\bar{\mathbf{Q}}{\mathbf{K}}_{\mathrm{cos}})\hfill & \hfill 0\hfill \\ \hfill 0\hfill & \hfill \frac{1}{n}\enspace \mathrm{t}\mathrm{r}(\bar{\mathbf{Q}}\mathbf{A}\bar{\mathbf{Q}}{\mathbf{K}}_{\mathrm{sin}})\hfill \end{matrix}\right]\right.\hfill \\ \hfill & \left.\quad \times \left[\begin{matrix}\hfill \frac{1}{1+{\delta }_{\mathrm{cos}}}\hfill & \hfill 0\hfill \\ \hfill 0\hfill & \hfill \frac{1}{1+{\delta }_{\mathrm{sin}}}\hfill \end{matrix}\right]{\mathbf{U}}_{i}^{\mathsf{\text{T}}}{\mathbf{Q}}_{-i}\right]\hfill \end{align*}$

so that

$\begin{align}\hfill \mathbb{E}[\mathbf{Q}\mathbf{A}\mathbf{Q}]& \simeq \bar{\mathbf{Q}}\mathbf{A}\bar{\mathbf{Q}}+\frac{N}{n}\mathbb{E}\left[\mathbf{Q}\left(\frac{\frac{1}{n}\enspace \mathrm{t}\mathrm{r}(\bar{\mathbf{Q}}\mathbf{A}\bar{\mathbf{Q}}{\mathbf{K}}_{\mathrm{cos}})}{{(1+{\delta }_{\mathrm{cos}})}^{2}}{\mathbf{K}}_{\mathrm{cos}}+\frac{\frac{1}{n}\enspace \mathrm{t}\mathrm{r}(\bar{\mathbf{Q}}\mathbf{A}\bar{\mathbf{Q}}{\mathbf{K}}_{\mathrm{sin}})}{{(1+{\delta }_{\mathrm{sin}})}^{2}}{\mathbf{K}}_{\mathrm{sin}}\right)\mathbf{Q}\right]\hfill \\ \hfill & =\bar{\mathbf{Q}}\mathbf{A}\bar{\mathbf{Q}}+\frac{N}{n}\left[\begin{matrix}\hfill \frac{\frac{1}{n}\enspace \mathrm{t}\mathrm{r}(\bar{\mathbf{Q}}\mathbf{A}\bar{\mathbf{Q}}{\mathbf{K}}_{\mathrm{cos}})}{{(1+{\delta }_{\mathrm{cos}})}^{2}}\hfill & \hfill \frac{\frac{1}{n}\enspace \mathrm{t}\mathrm{r}(\bar{\mathbf{Q}}\mathbf{A}\bar{\mathbf{Q}}{\mathbf{K}}_{\mathrm{sin}})}{{(1+{\delta }_{\mathrm{sin}})}^{2}}\hfill \end{matrix}\right]\mathbb{E}\left[\begin{matrix}\hfill \mathbf{Q}{\mathbf{K}}_{\mathrm{cos}}\mathbf{Q}\hfill \\ \hfill \mathbf{Q}{\mathbf{K}}_{\mathrm{sin}}\mathbf{Q}\hfill \end{matrix}\right]\hfill \end{align} \tag{ B.1 }$

by taking A = K_cos or K_sin, we result in

$\begin{align*}\hfill \mathbb{E}[\mathbf{Q}{\mathbf{K}}_{\mathrm{cos}}\mathbf{Q}]\simeq \frac{c}{ac-bd}\bar{\mathbf{Q}}{\mathbf{K}}_{\mathrm{cos}}\bar{\mathbf{Q}}+\frac{b}{ac-bd}\bar{\mathbf{Q}}{\mathbf{K}}_{\mathrm{sin}}\bar{\mathbf{Q}}\\ \hfill \mathbb{E}[\mathbf{Q}{\mathbf{K}}_{\mathrm{sin}}\mathbf{Q}]\simeq \frac{a}{ac-bd}\bar{\mathbf{Q}}{\mathbf{K}}_{\mathrm{sin}}\bar{\mathbf{Q}}+\frac{d}{ac-bd}\bar{\mathbf{Q}}{\mathbf{K}}_{\mathrm{cos}}\bar{\mathbf{Q}}\end{align*}$

with $a=1-\frac{N}{n}\frac{\frac{1}{n}\enspace \mathrm{t}\mathrm{r}(\bar{\mathbf{Q}}{\mathbf{K}}_{\mathrm{cos}}\bar{\mathbf{Q}}{\mathbf{K}}_{\mathrm{cos}})}{{(1+{\delta }_{\mathrm{cos}})}^{2}}$ , $b=\frac{N}{n}\frac{\frac{1}{n}\enspace \mathrm{t}\mathrm{r}(\bar{\mathbf{Q}}{\mathbf{K}}_{\mathrm{cos}}\bar{\mathbf{Q}}{\mathbf{K}}_{\mathrm{sin}})}{{(1+{\delta }_{\mathrm{sin}})}^{2}}$ , $c=1-\frac{N}{n}\frac{\frac{1}{n}\enspace \mathrm{t}\mathrm{r}(\bar{\mathbf{Q}}{\mathbf{K}}_{\mathrm{sin}}\bar{\mathbf{Q}}{\mathbf{K}}_{\mathrm{sin}})}{{(1+{\delta }_{\mathrm{sin}})}^{2}}$ and $d=\frac{N}{n}\frac{\frac{1}{n}\enspace \mathrm{t}\mathrm{r}(\bar{\mathbf{Q}}{\mathbf{K}}_{\mathrm{sin}}\bar{\mathbf{Q}}{\mathbf{K}}_{\mathrm{cos}})}{{(1+{\delta }_{\mathrm{cos}})}^{2}}$ such that ${(1+{\delta }_{\mathrm{sin}})}^{2}b={(1+{\delta }_{\mathrm{cos}})}^{2}d$ .

$\begin{equation*}\mathbb{E}\left[\begin{matrix}\hfill \mathbf{Q}{\mathbf{K}}_{\mathrm{cos}}\mathbf{Q}\hfill \\ \hfill \mathbf{Q}{\mathbf{K}}_{\mathrm{sin}}\mathbf{Q}\hfill \end{matrix}\right]\simeq {\left[\begin{matrix}\hfill a\hfill & \hfill -b\hfill \\ \hfill -d\hfill & \hfill c\hfill \end{matrix}\right]}^{-1}\left[\begin{matrix}\hfill \bar{\mathbf{Q}}{\mathbf{K}}_{\mathrm{cos}}\bar{\mathbf{Q}}\hfill \\ \hfill \bar{\mathbf{Q}}{\mathbf{K}}_{\mathrm{sin}}\bar{\mathbf{Q}}\hfill \end{matrix}\right]\equiv \mathbf{\Omega }\left[\begin{matrix}\hfill \bar{\mathbf{Q}}{\mathbf{K}}_{\mathrm{cos}}\bar{\mathbf{Q}}\hfill \\ \hfill \bar{\mathbf{Q}}{\mathbf{K}}_{\mathrm{sin}}\bar{\mathbf{Q}}\hfill \end{matrix}\right]\end{equation*}$

for $\mathbf{\Omega }\equiv {\left[\begin{matrix}\hfill a\hfill & \hfill -b\hfill \\ \hfill -d\hfill & \hfill c\hfill \end{matrix}\right]}^{-1}$ . Plugging back into (B.1) we conclude the proof of lemma 4. □

Theorem 2 can be achieved by considering the concentration of (the bilinear form) $\frac{1}{n}{\mathbf{y}}^{\mathsf{\text{T}}}{\mathbf{Q}}^{2}\mathbf{y}$ around its expectation $\frac{1}{n}{\mathbf{y}}^{\mathsf{\text{T}}}\mathbb{E}[{\mathbf{Q}}^{2}]\mathbf{y}$ (with for instance lemma 3 in [15]), together with lemma 4. This concludes the proof of theorem 2.□

Appendix C.: Proof of theorem 3

Recall the definition of ${E}_{\mathrm{t}\mathrm{e}\mathrm{s}\mathrm{t}}=\frac{1}{\hat{n}}{\Vert}\hat{\mathbf{y}}-{\mathbf{\Sigma }}_{\hat{\mathbf{X}}}^{\mathsf{\text{T}}}\boldsymbol{\beta }{{\Vert}}^{2}$ from (5) with ${\mathbf{\Sigma }}_{\hat{\mathbf{X}}}=\left[\begin{matrix}\hfill \mathrm{cos}(\mathbf{W}\hat{\mathbf{X}})\hfill \\ \hfill \mathrm{sin}(\mathbf{W}\hat{\mathbf{X}})\hfill \end{matrix}\right]\in {\mathbb{R}}^{2N\times \hat{n}}$ on a test set $(\hat{\mathbf{X}},\hat{\mathbf{y}})$ of size $\hat{n}$ , and first focus on the case 2N > n where $\boldsymbol{\beta }=\frac{1}{n}{\mathbf{\Sigma }}_{\mathbf{X}}\mathbf{Q}\mathbf{y}$ as per (4). By (A.1), we have

$\begin{equation*}{E}_{\mathrm{t}\mathrm{e}\mathrm{s}\mathrm{t}}=\frac{1}{\hat{n}}{{\Vert}\hat{\mathbf{y}}-\frac{1}{n}{\mathbf{\Sigma }}_{\hat{\mathbf{X}}}^{\mathsf{\text{T}}}{\mathbf{\Sigma }}_{\mathbf{X}}\mathbf{Q}\mathbf{y}{\Vert}}^{2}=\frac{1}{\hat{n}}{{\Vert}\hat{\mathbf{y}}-\frac{1}{n}\sum\limits _{i=1}^{N}{\hat{\mathbf{U}}}_{i}{\mathbf{U}}_{i}^{\mathsf{\text{T}}}\mathbf{Q}\mathbf{y}{\Vert}}^{2}\end{equation*}$

where, similar to the notation ${\mathbf{U}}_{i}=\left[\begin{matrix}\hfill \mathrm{cos}({\mathbf{X}}^{\mathsf{\text{T}}}{\mathbf{w}}_{i})\hfill & \hfill \mathrm{sin}({\mathbf{X}}^{\mathsf{\text{T}}}{\mathbf{w}}_{i})\hfill \end{matrix}\right]\in {\mathbb{R}}^{n\times 2}$ as in the proof of theorem 1, we denote

$\begin{equation*}{\hat{\mathbf{U}}}_{i}\equiv \left[\begin{matrix}\hfill \mathrm{cos}({\hat{\mathbf{X}}}^{\mathsf{\text{T}}}{\mathbf{w}}_{i})\hfill & \hfill \mathrm{sin}({\hat{\mathbf{X}}}^{\mathsf{\text{T}}}{\mathbf{w}}_{i})\hfill \end{matrix}\right]\in {\mathbb{R}}^{\hat{n}\times 2}.\end{equation*}$

As a consequence, we further get

$\begin{align*}\hfill \mathbb{E}[{E}_{\mathrm{t}\mathrm{e}\mathrm{s}\mathrm{t}}]& =\frac{1}{\hat{n}}{\Vert}\hat{\mathbf{y}}{{\Vert}}^{2}-\frac{2}{n\hat{n}}\sum\limits _{i=1}^{N}{\hat{\mathbf{y}}}^{\mathsf{\text{T}}}\mathbb{E}[{\hat{\mathbf{U}}}_{i}{\mathbf{U}}_{i}^{\mathsf{\text{T}}}\mathbf{Q}]\mathbf{y}+\frac{1}{{n}^{2}\hat{n}}\sum\limits _{i,j=1}^{N}{\mathbf{y}}^{\mathsf{\text{T}}}\mathbb{E}[\mathbf{Q}{\mathbf{U}}_{i}{\hat{\mathbf{U}}}_{i}^{\mathsf{\text{T}}}{\hat{\mathbf{U}}}_{j}{\mathbf{U}}_{j}^{\mathsf{\text{T}}}\mathbf{Q}]\mathbf{y}\hfill \\ \hfill & =\frac{1}{\hat{n}}{\Vert}\hat{\mathbf{y}}{{\Vert}}^{2}-\frac{2}{n\hat{n}}\sum\limits _{i=1}^{N}{\hat{\mathbf{y}}}^{\mathsf{\text{T}}}\mathbb{E}\left[{\hat{\mathbf{U}}}_{i}{\left({\mathbf{I}}_{2}+\frac{1}{n}{\mathbf{U}}_{i}^{\mathsf{\text{T}}}{\mathbf{Q}}_{-i}{\mathbf{U}}_{i}\right)}^{-1}{\mathbf{U}}_{i}^{\mathsf{\text{T}}}{\mathbf{Q}}_{-i}\right]\mathbf{y}\hfill \\ \hfill & \quad +\frac{1}{{n}^{2}\hat{n}}\sum\limits _{i,j=1}^{N}{\mathbf{y}}^{\mathsf{\text{T}}}\mathbb{E}[\mathbf{Q}{\mathbf{U}}_{i}{\hat{\mathbf{U}}}_{i}^{\mathsf{\text{T}}}{\hat{\mathbf{U}}}_{j}{\mathbf{U}}_{j}^{\mathsf{\text{T}}}\mathbf{Q}]\mathbf{y}\hfill \\ \hfill & \simeq \frac{1}{\hat{n}}{\Vert}\hat{\mathbf{y}}{{\Vert}}^{2}-\frac{2}{n\hat{n}}\sum\limits _{i=1}^{N}{\hat{\mathbf{y}}}^{\mathsf{\text{T}}}\mathbb{E}\left[{\hat{\mathbf{U}}}_{i}\left[\begin{matrix}\hfill \frac{1}{1+{\delta }_{\mathrm{cos}}}\hfill & \hfill 0\hfill \\ \hfill 0\hfill & \hfill \frac{1}{1+{\delta }_{\mathrm{sin}}}\hfill \end{matrix}\right]{\mathbf{U}}_{i}^{\mathsf{\text{T}}}{\mathbf{Q}}_{-i}\right]\mathbf{y}\hfill \\ \hfill & \quad +\frac{1}{{n}^{2}\hat{n}}\sum\limits _{i,j=1}^{N}{\mathbf{y}}^{\mathsf{\text{T}}}\mathbb{E}[\mathbf{Q}{\mathbf{U}}_{i}{\hat{\mathbf{U}}}_{i}^{\mathsf{\text{T}}}{\hat{\mathbf{U}}}_{j}{\mathbf{U}}_{j}^{\mathsf{\text{T}}}\mathbf{Q}]\mathbf{y}\hfill \\ \hfill & \simeq \frac{1}{\hat{n}}{\Vert}\hat{\mathbf{y}}{{\Vert}}^{2}-\frac{2}{\hat{n}}{\hat{\mathbf{y}}}^{\mathsf{\text{T}}}\left(\frac{N}{n}\frac{{\mathbf{K}}_{\mathrm{cos}}(\hat{\mathbf{X}},\mathbf{X})}{1+{\delta }_{\mathrm{cos}}}+\frac{N}{n}\frac{{\mathbf{K}}_{\mathrm{sin}}(\hat{\mathbf{X}},\mathbf{X})}{1+{\delta }_{\mathrm{sin}}}\right)\bar{\mathbf{Q}}\mathbf{y}\hfill \\ \hfill & \quad +\frac{1}{{n}^{2}\hat{n}}\sum\limits _{i,j=1}^{N}{\mathbf{y}}^{\mathsf{\text{T}}}\mathbb{E}[\mathbf{Q}{\mathbf{U}}_{i}{\hat{\mathbf{U}}}_{i}^{\mathsf{\text{T}}}{\hat{\mathbf{U}}}_{j}{\mathbf{U}}_{j}^{\mathsf{\text{T}}}\mathbf{Q}]\mathbf{y}\hfill \end{align*}$

where we similarly denote

$\begin{align*}\hfill {\mathbf{K}}_{\mathrm{cos}}(\hat{\mathbf{X}},\mathbf{X})& \equiv {\left\{{\text{e}}^{-\frac{1}{2}({\Vert}{\hat{\mathbf{x}}}_{i}{{\Vert}}^{2}+{\Vert}{\mathbf{x}}_{j}{{\Vert}}^{2})}\mathrm{cosh}({\hat{\mathbf{x}}}_{i}^{\mathsf{\text{T}}}{\mathbf{x}}_{j})\right\}}_{i,j=1}^{\hat{n},n}\hfill \\ \hfill {\mathbf{K}}_{\mathrm{sin}}(\hat{\mathbf{X}},\mathbf{X})& \equiv {\left\{{\text{e}}^{-\frac{1}{2}({\Vert}{\hat{\mathbf{x}}}_{i}{{\Vert}}^{2}+{\Vert}{\mathbf{x}}_{j}{{\Vert}}^{2})}\mathrm{sinh}({\hat{\mathbf{x}}}_{i}^{\mathsf{\text{T}}}{\mathbf{x}}_{j})\right\}}_{i,j=1}^{\hat{n},n}\in {\mathbb{R}}^{\hat{n}\times n}.\hfill \end{align*}$

Note that, different from the proof of theorems 1 and 2 where we constantly use the fact that ||Q|| ⩽ λ⁻¹ and

$\begin{equation*}\frac{1}{n}{\mathbf{\Sigma }}_{\mathbf{X}}^{\mathsf{\text{T}}}{\mathbf{\Sigma }}_{\mathbf{X}}\mathbf{Q}={\mathbf{I}}_{n}-\lambda \mathbf{Q}\end{equation*}$

so that ${\Vert}\frac{1}{n}{\mathbf{\Sigma }}_{\mathbf{X}}^{\mathsf{\text{T}}}{\mathbf{\Sigma }}_{\mathbf{X}}\mathbf{Q}{\Vert}\leqslant 1$ , we do not have in general a simple control for ${\Vert}\frac{1}{n}{\mathbf{\Sigma }}_{\hat{\mathbf{X}}}^{\mathsf{\text{T}}}{\mathbf{\Sigma }}_{\mathbf{X}}\mathbf{Q}{\Vert}$ , when arbitrary $\hat{\mathbf{X}}$ is considered. Intuitively speaking, this is due to the loss-of-control for ${\Vert}\frac{1}{n}{({\mathbf{\Sigma }}_{\hat{\mathbf{X}}}-{\mathbf{\Sigma }}_{\mathbf{X}})}^{\mathsf{\text{T}}}{\mathbf{\Sigma }}_{\mathbf{X}}\mathbf{Q}{\Vert}$ when $\hat{\mathbf{X}}$ can be chosen arbitrarily with respect to X. It was remarked in [15] (remark 1) that in general only a $O(\sqrt{n})$ upper bound can be derived for ${\Vert}\frac{1}{\sqrt{n}}{\mathbf{\Sigma }}_{\mathbf{X}}{\Vert}$ or ${\Vert}\frac{1}{\sqrt{n}}{\mathbf{\Sigma }}_{\hat{\mathbf{X}}}{\Vert}$ . Nonetheless, this problem can be resolved with the additional assumption 2.

More precisely, note that

$\begin{align}\hfill {\Vert}\frac{1}{n}{\mathbf{\Sigma }}_{\hat{\mathbf{X}}}^{\mathsf{\text{T}}}{\mathbf{\Sigma }}_{\mathbf{X}}\mathbf{Q}{\Vert}& \leqslant \frac{1}{n}{\Vert}{\mathbf{\Sigma }}_{\mathbf{X}}^{\mathsf{\text{T}}}{\mathbf{\Sigma }}_{\mathbf{X}}\mathbf{Q}{\Vert}+\frac{1}{n}{\Vert}{({\mathbf{\Sigma }}_{\hat{\mathbf{X}}}-{\mathbf{\Sigma }}_{\mathbf{X}})}^{\mathsf{\text{T}}}{\mathbf{\Sigma }}_{\mathbf{X}}\mathbf{Q}{\Vert}\hfill \\ \hfill & \leqslant 1+\frac{1}{\sqrt{n}}{\Vert}{\mathbf{\Sigma }}_{\hat{\mathbf{X}}}-{\mathbf{\Sigma }}_{\mathbf{X}}{\Vert}\cdot \frac{1}{\sqrt{n}}{\Vert}{\mathbf{\Sigma }}_{\mathbf{X}}\mathbf{Q}{\Vert}\hfill \end{align} \tag{ C.1 }$

it remains to show that ${\Vert}{\mathbf{\Sigma }}_{\mathbf{X}}-{\mathbf{\Sigma }}_{\hat{\mathbf{X}}}{\Vert}=O(\sqrt{n})$ under assumption 2 to establish ${\Vert}\frac{1}{n}{\mathbf{\Sigma }}_{\hat{\mathbf{X}}}^{\mathsf{\text{T}}}{\mathbf{\Sigma }}_{\mathbf{X}}\mathbf{Q}{\Vert}=O(1)$ , that is, to show that

$\begin{equation}{\Vert}\sigma (\mathbf{W}\mathbf{X})-\sigma (\mathbf{W}\hat{\mathbf{X}}){\Vert}=O(\sqrt{n})\end{equation} \tag{ C.2 }$

for σ ∈ { cos, sin }. Note this cannot be achieved using only the Lipschitz nature of σ(⋅) and the fact that ${\Vert}\mathbf{X}-\hat{\mathbf{X}}{\Vert}\leqslant {\Vert}\mathbf{X}{\Vert}+{\Vert}\hat{\mathbf{X}}{\Vert}=O(1)$ under assumption 1 by writing

$\begin{equation}{\Vert}\sigma (\mathbf{W}\mathbf{X})-\sigma (\mathbf{W}\hat{\mathbf{X}}){\Vert}\leqslant {\Vert}\sigma (\mathbf{W}\mathbf{X})-\sigma (\mathbf{W}\hat{\mathbf{X}}){{\Vert}}_{F}\leqslant {\Vert}\mathbf{W}{{\Vert}}_{F}\cdot {\Vert}\mathbf{X}-\hat{\mathbf{X}}{\Vert}=O(n).\end{equation} \tag{ C.3 }$

where we recall that ${\Vert}\mathbf{W}{\Vert}=O(\sqrt{n})$ and ||W||_F = O(n). Nonetheless, from proposition B.1 in [43] we have that the product WX, and thus σ(WX), strongly concentrates around its expectation in the sense of (13), so that

$\begin{align*}\hfill {\Vert}\sigma (\mathbf{W}\mathbf{X})-\sigma (\mathbf{W}\hat{\mathbf{X}}){\Vert}& \leqslant {\Vert}\sigma (\mathbf{W}\mathbf{X})-\mathbb{E}[\sigma (\mathbf{W}\mathbf{X})]{\Vert}+{\Vert}\mathbb{E}[\sigma (\mathbf{W}\mathbf{X})-\sigma (\mathbf{W}\hat{\mathbf{X}})]{\Vert}\hfill \\ \hfill & \quad +{\Vert}\sigma (\mathbf{W}\hat{\mathbf{X}})-\mathbb{E}[\sigma (\mathbf{W}\hat{\mathbf{X}})]{\Vert}=O(\sqrt{n})\hfill \end{align*}$

under assumption 2. As a results, we are allowed to control $\frac{1}{n}{\mathbf{\Sigma }}_{\hat{\mathbf{X}}}^{\mathsf{\text{T}}}{\mathbf{\Sigma }}_{\mathbf{X}}\mathbf{Q}$ and similarly $\frac{1}{n}{\mathbf{\Sigma }}_{\hat{\mathbf{X}}}^{\mathsf{\text{T}}}{\mathbf{\Sigma }}_{\hat{\mathbf{X}}}\mathbf{Q}$ in the same vein as $\frac{1}{n}{\mathbf{\Sigma }}_{\mathbf{X}}^{\mathsf{\text{T}}}{\mathbf{\Sigma }}_{\mathbf{X}}\mathbf{Q}$ in the proof of theorems 1 and 2 in appendices A and B, respectively.

It thus remains to handle the last term (noted Z) as follows

$\begin{align*}\hfill \mathbf{Z}& \equiv \frac{1}{{n}^{2}\hat{n}}\sum\limits _{i,j=1}^{N}{\mathbf{y}}^{\mathsf{\text{T}}}\mathbb{E}[\mathbf{Q}{\mathbf{U}}_{i}{\hat{\mathbf{U}}}_{i}^{\mathsf{\text{T}}}{\hat{\mathbf{U}}}_{j}{\mathbf{U}}_{j}^{\mathsf{\text{T}}}\mathbf{Q}]\mathbf{y}=\frac{1}{{n}^{2}\hat{n}}\sum\limits _{i=1}^{N}{\mathbf{y}}^{\mathsf{\text{T}}}\mathbb{E}[\mathbf{Q}{\mathbf{U}}_{i}{\hat{\mathbf{U}}}_{i}^{\mathsf{\text{T}}}{\hat{\mathbf{U}}}_{i}{\mathbf{U}}_{i}^{\mathsf{\text{T}}}\mathbf{Q}]\mathbf{y}\hfill \\ \hfill & \quad +\frac{1}{{n}^{2}\hat{n}}\sum\limits _{i=1}^{N}\sum\limits _{j\ne i}{\mathbf{y}}^{\mathsf{\text{T}}}\mathbb{E}[\mathbf{Q}{\mathbf{U}}_{i}{\hat{\mathbf{U}}}_{i}^{\mathsf{\text{T}}}{\hat{\mathbf{U}}}_{j}{\mathbf{U}}_{j}^{\mathsf{\text{T}}}\mathbf{Q}]\mathbf{y}={\mathbf{Z}}_{1}+{\mathbf{Z}}_{2}\hfill \end{align*}$

where Z₁ term can be treated as

$\begin{align*}\hfill {\mathbf{Z}}_{1}& \equiv \frac{1}{{n}^{2}\hat{n}}\sum\limits _{i=1}^{N}{\mathbf{y}}^{\mathsf{\text{T}}}\mathbb{E}[\mathbf{Q}{\mathbf{U}}_{i}{\hat{\mathbf{U}}}_{i}^{\mathsf{\text{T}}}{\hat{\mathbf{U}}}_{i}{\mathbf{U}}_{i}^{\mathsf{\text{T}}}\mathbf{Q}]\mathbf{y}\hfill \\ \hfill & =\frac{1}{n\hat{n}}\sum\limits _{i=1}^{N}{\mathbf{y}}^{\mathsf{\text{T}}}\mathbb{E}\left[{\mathbf{Q}}_{-i}{\mathbf{U}}_{i}{\left({\mathbf{I}}_{2}+\frac{1}{n}{\mathbf{U}}_{i}^{\mathsf{\text{T}}}{\mathbf{Q}}_{-i}{\mathbf{U}}_{i}\right)}^{-1}\frac{1}{n}{\hat{\mathbf{U}}}_{i}^{\mathsf{\text{T}}}{\hat{\mathbf{U}}}_{i}\right.\hfill \\ \hfill & \left.\quad \times {\left({\mathbf{I}}_{2}+\frac{1}{n}{\mathbf{U}}_{i}^{\mathsf{\text{T}}}{\mathbf{Q}}_{-i}{\mathbf{U}}_{i}\right)}^{-1}{\mathbf{U}}_{i}^{\mathsf{\text{T}}}{\mathbf{Q}}_{-i}\right]\mathbf{y}\hfill \\ \hfill & \simeq \frac{1}{n\hat{n}}\sum\limits _{i=1}^{N}{\mathbf{y}}^{\mathsf{\text{T}}}\mathbb{E}\left[{\mathbf{Q}}_{-i}{\mathbf{U}}_{i}\left[\begin{matrix}\hfill \frac{1}{1+{\delta }_{\mathrm{cos}}}\hfill & \hfill 0\hfill \\ \hfill 0\hfill & \hfill \frac{1}{1+{\delta }_{\mathrm{sin}}}\hfill \end{matrix}\right]\left[\begin{matrix}\hfill \frac{1}{n}\enspace \mathrm{t}\mathrm{r}{\hat{\hat{\mathbf{K}}}}_{\mathrm{cos}}\hfill & \hfill 0\hfill \\ \hfill 0\hfill & \hfill \frac{1}{n}\enspace \mathrm{t}\mathrm{r}{\hat{\hat{\mathbf{K}}}}_{\mathrm{sin}}\hfill \end{matrix}\right]\right.\hfill \\ \hfill & \left.\quad \times \left[\begin{matrix}\hfill \frac{1}{1+{\delta }_{\mathrm{cos}}}\hfill & \hfill 0\hfill \\ \hfill 0\hfill & \hfill \frac{1}{1+{\delta }_{\mathrm{sin}}}\hfill \end{matrix}\right]{\mathbf{U}}_{i}^{\mathsf{\text{T}}}{\mathbf{Q}}_{-i}\right]\mathbf{y}\hfill \\ \hfill & \simeq \frac{N}{n}\frac{1}{\hat{n}}{\mathbf{y}}^{\mathsf{\text{T}}}\mathbb{E}\left[\mathbf{Q}\left(\frac{\frac{1}{n}\enspace \mathrm{t}\mathrm{r}{\mathbf{K}}_{\mathrm{cos}}(\hat{\mathbf{X}},\hat{\mathbf{X}})}{{(1+{\delta }_{\mathrm{cos}})}^{2}}{\mathbf{K}}_{\mathrm{cos}}+\frac{\frac{1}{n}\enspace \mathrm{t}\mathrm{r}{\mathbf{K}}_{\mathrm{sin}}(\hat{\mathbf{X}},\hat{\mathbf{X}})}{{(1+{\delta }_{\mathrm{sin}})}^{2}}{\mathbf{K}}_{\mathrm{sin}}\right)\mathbf{Q}\right]\mathbf{y}\hfill \\ \hfill & \simeq \frac{N}{n}\frac{1}{\hat{n}}\left[\begin{matrix}\hfill \frac{\frac{1}{n}\enspace \mathrm{t}\mathrm{r}{\mathbf{K}}_{\mathrm{cos}}(\hat{\mathbf{X}},\hat{\mathbf{X}})}{{(1+{\delta }_{\mathrm{cos}})}^{2}}\hfill & \hfill \frac{\frac{1}{n}\enspace \mathrm{t}\mathrm{r}\frac{1}{n}\enspace \mathrm{t}\mathrm{r}{\mathbf{K}}_{\mathrm{sin}}(\hat{\mathbf{X}},\hat{\mathbf{X}})}{{(1+{\delta }_{\mathrm{sin}})}^{2}}\hfill \end{matrix}\right]\mathbf{\Omega }\left[\begin{matrix}\hfill {\mathbf{y}}^{\mathsf{\text{T}}}\bar{\mathbf{Q}}{\mathbf{K}}_{\mathrm{cos}}\bar{\mathbf{Q}}\mathbf{y}\hfill \\ \hfill {\mathbf{y}}^{\mathsf{\text{T}}}\bar{\mathbf{Q}}{\mathbf{K}}_{\mathrm{sin}}\bar{\mathbf{Q}}\mathbf{y}\hfill \end{matrix}\right]\hfill \end{align*}$

where we apply lemma 4 and recall

$\begin{align*}\hfill {\mathbf{K}}_{\mathrm{cos}}(\hat{\mathbf{X}},\hat{\mathbf{X}})& \equiv {\left\{{\text{e}}^{-\frac{1}{2}({\Vert}{\hat{\mathbf{x}}}_{i}{{\Vert}}^{2}+{\Vert}{\hat{\mathbf{x}}}_{j}{{\Vert}}^{2})}\mathrm{cosh}({\hat{\mathbf{x}}}_{i}^{\mathsf{\text{T}}}{\hat{\mathbf{x}}}_{j})\right\}}_{i,j=1}^{\hat{n}},\hfill \\ \hfill {\mathbf{K}}_{\mathrm{sin}}(\hat{\mathbf{X}},\hat{\mathbf{X}})& \equiv {\left\{{\text{e}}^{-\frac{1}{2}({\Vert}{\hat{\mathbf{x}}}_{i}{{\Vert}}^{2}+{\Vert}{\hat{\mathbf{x}}}_{j}{{\Vert}}^{2})}\mathrm{sinh}({\hat{\mathbf{x}}}_{i}^{\mathsf{\text{T}}}{\hat{\mathbf{x}}}_{j})\right\}}_{i,j=1}^{\hat{n}}.\hfill \end{align*}$

Moving on to Z₂ and we write

$\begin{align*}\hfill {\mathbf{Z}}_{2}& \equiv \frac{1}{{n}^{2}\hat{n}}\mathbb{E}\sum\limits _{i=1}^{N}\sum\limits _{j\ne i}{\mathbf{y}}^{\mathsf{\text{T}}}\mathbf{Q}{\mathbf{U}}_{i}{\hat{\mathbf{U}}}_{i}^{\mathsf{\text{T}}}{\hat{\mathbf{U}}}_{j}{\mathbf{U}}_{j}^{\mathsf{\text{T}}}\mathbf{Q}\mathbf{y}\hfill \\ \hfill & =\frac{1}{{n}^{2}\hat{n}}\mathbb{E}\sum\limits _{i=1}^{N}\sum\limits _{j\ne i}{\mathbf{y}}^{\mathsf{\text{T}}}{\mathbf{Q}}_{-j}{\mathbf{U}}_{i}{\hat{\mathbf{U}}}_{i}^{\mathsf{\text{T}}}{\hat{\mathbf{U}}}_{j}{\left({\mathbf{I}}_{2}+\frac{1}{n}{\mathbf{U}}_{j}^{\mathsf{\text{T}}}{\mathbf{Q}}_{-j}{\mathbf{U}}_{j}\right)}^{-1}{\mathbf{U}}_{j}^{\mathsf{\text{T}}}{\mathbf{Q}}_{-j}\mathbf{y}\hfill \\ \hfill & \quad -\frac{1}{{n}^{2}\hat{n}}\mathbb{E}\sum\limits _{i=1}^{N}\sum\limits _{j\ne i}{\mathbf{y}}^{\mathsf{\text{T}}}{\mathbf{Q}}_{-j}{\mathbf{U}}_{j}{\left({\mathbf{I}}_{2}+\frac{1}{n}{\mathbf{U}}_{j}^{\mathsf{\text{T}}}{\mathbf{Q}}_{-j}{\mathbf{U}}_{j}\right)}^{-1}\hfill \\ \hfill & \quad \times {\mathbf{U}}_{j}^{\mathsf{\text{T}}}{\mathbf{Q}}_{-j}{\mathbf{U}}_{i}{\hat{\mathbf{U}}}_{i}^{\mathsf{\text{T}}}{\hat{\mathbf{U}}}_{j}{\left({\mathbf{I}}_{2}+\frac{1}{n}{\mathbf{U}}_{j}^{\mathsf{\text{T}}}{\mathbf{Q}}_{-j}{\mathbf{U}}_{j}\right)}^{-1}{\mathbf{U}}_{j}^{\mathsf{\text{T}}}{\mathbf{Q}}_{-j}\mathbf{y}\hfill \\ \hfill & \simeq \frac{1}{n\hat{n}}\mathbb{E}\sum\limits _{i=1}^{N}\sum\limits _{j\ne i}{\mathbf{y}}^{\mathsf{\text{T}}}{\mathbf{Q}}_{-j}{\mathbf{U}}_{i}{\hat{\mathbf{U}}}_{i}^{\mathsf{\text{T}}}\left(\frac{{\mathbf{K}}_{\mathrm{cos}}(\hat{\mathbf{X}},\mathbf{X})}{1+{\delta }_{\mathrm{cos}}}+\frac{{\mathbf{K}}_{\mathrm{sin}}(\hat{\mathbf{X}},\mathbf{X})}{1+{\delta }_{\mathrm{sin}}}\right){\mathbf{Q}}_{-j}\mathbf{y}\hfill \\ \hfill & \quad -\frac{1}{{n}^{2}\hat{n}}\mathbb{E}\sum\limits _{i=1}^{N}\sum\limits _{j\ne i}{\mathbf{y}}^{\mathsf{\text{T}}}{\mathbf{Q}}_{-j}{\mathbf{U}}_{j}\left[\begin{matrix}\hfill \frac{1}{1+{\delta }_{\mathrm{cos}}}\hfill & \hfill 0\hfill \\ \hfill 0\hfill & \hfill \frac{1}{1+{\delta }_{\mathrm{sin}}}\hfill \end{matrix}\right]\hfill \\ \hfill & \quad \times \left[\begin{matrix}\hfill \frac{1}{n}\enspace \mathrm{t}\mathrm{r}({\mathbf{Q}}_{-j}{\mathbf{U}}_{i}{\hat{\mathbf{U}}}_{i}^{\mathsf{\text{T}}}{\mathbf{K}}_{\mathrm{cos}}(\hat{\mathbf{X}},\mathbf{X}))\hfill & \hfill 0\hfill \\ \hfill 0\hfill & \hfill \frac{1}{n}\enspace \mathrm{t}\mathrm{r}({\mathbf{Q}}_{-j}{\mathbf{U}}_{i}{\hat{\mathbf{U}}}_{i}^{\mathsf{\text{T}}}{\mathbf{K}}_{\mathrm{sin}}(\hat{\mathbf{X}},\mathbf{X}))\hfill \end{matrix}\right]\hfill \\ \hfill & \quad \times \left[\begin{matrix}\hfill \frac{1}{1+{\delta }_{\mathrm{cos}}}\hfill & \hfill 0\hfill \\ \hfill 0\hfill & \hfill \frac{1}{1+{\delta }_{\mathrm{sin}}}\hfill \end{matrix}\right]{\mathbf{U}}_{j}^{\mathsf{\text{T}}}{\mathbf{Q}}_{-j}\mathbf{y}\equiv {\mathbf{Z}}_{21}-{\mathbf{Z}}_{22}.\hfill \end{align*}$

For the term Z₂₁, note that Q_−j ≃Q and depends on U_i (and ${\hat{\mathbf{U}}}_{i}$ ), such that

$\begin{align*}\hfill {\mathbf{Z}}_{21}& \equiv \frac{1}{{n}^{2}\hat{n}}\mathbb{E}\sum\limits _{i=1}^{N}\sum\limits _{j\ne i}{\mathbf{y}}^{\mathsf{\text{T}}}{\mathbf{Q}}_{-j}{\mathbf{U}}_{i}{\hat{\mathbf{U}}}_{i}^{\mathsf{\text{T}}}\left(\frac{{\mathbf{K}}_{\mathrm{cos}}(\hat{\mathbf{X}},\mathbf{X})}{1+{\delta }_{\mathrm{cos}}}+\frac{{\mathbf{K}}_{\mathrm{sin}}(\hat{\mathbf{X}},\mathbf{X})}{1+{\delta }_{\mathrm{sin}}}\right){\mathbf{Q}}_{-j}\mathbf{y}\hfill \\ \hfill & \simeq \frac{N}{n}\frac{1}{n\hat{n}}\mathbb{E}\sum\limits _{i=1}^{N}{\mathbf{y}}^{\mathsf{\text{T}}}\mathbf{Q}{\mathbf{U}}_{i}{\hat{\mathbf{U}}}_{i}^{\mathsf{\text{T}}}\left(\frac{{\mathbf{K}}_{\mathrm{cos}}(\hat{\mathbf{X}},\mathbf{X})}{1+{\delta }_{\mathrm{cos}}}+\frac{{\mathbf{K}}_{\mathrm{sin}}(\hat{\mathbf{X}},\mathbf{X})}{1+{\delta }_{\mathrm{sin}}}\right)\mathbf{Q}\mathbf{y}\hfill \\ \hfill & =\frac{N}{n}\frac{1}{n\hat{n}}\mathbb{E}\sum\limits _{i=1}^{N}{\mathbf{y}}^{\mathsf{\text{T}}}{\mathbf{Q}}_{-i}{\mathbf{U}}_{i}{\left({\mathbf{I}}_{2}+\frac{1}{n}{\mathbf{U}}_{i}^{\mathsf{\text{T}}}{\mathbf{Q}}_{-i}{\mathbf{U}}_{i}\right)}^{-1}{\hat{\mathbf{U}}}_{i}^{\mathsf{\text{T}}}\hat{\mathbf{\Phi }}{\mathbf{Q}}_{-i}\mathbf{y}\hfill \\ \hfill & \quad -\frac{N}{n}\frac{1}{n\hat{n}}\mathbb{E}\sum\limits _{i=1}^{N}{\mathbf{y}}^{\mathsf{\text{T}}}{\mathbf{Q}}_{-i}{\mathbf{U}}_{i}{\left({\mathbf{I}}_{2}+\frac{1}{n}{\mathbf{U}}_{i}^{\mathsf{\text{T}}}{\mathbf{Q}}_{-i}{\mathbf{U}}_{i}\right)}^{-1}\hfill \\ \hfill & \quad \times {\hat{\mathbf{U}}}_{i}^{\mathsf{\text{T}}}\hat{\mathbf{\Phi }}{\mathbf{Q}}_{-i}{\mathbf{U}}_{i}{\left({\mathbf{I}}_{2}+\frac{1}{n}{\mathbf{U}}_{i}^{\mathsf{\text{T}}}{\mathbf{Q}}_{-i}{\mathbf{U}}_{i}\right)}^{-1}{\mathbf{U}}_{i}^{\mathsf{\text{T}}}{\mathbf{Q}}_{-i}\mathbf{y}\hfill \\ \hfill & \simeq \frac{N}{n}\frac{1}{n\hat{n}}\mathbb{E}\sum\limits _{i=1}^{N}{\mathbf{y}}^{\mathsf{\text{T}}}{\mathbf{Q}}_{-i}{\left(\frac{{\mathbf{K}}_{\mathrm{cos}}(\hat{\mathbf{X}},\mathbf{X})}{1+{\delta }_{\mathrm{cos}}}+\frac{{\mathbf{K}}_{\mathrm{sin}}(\hat{\mathbf{X}},\mathbf{X})}{1+{\delta }_{\mathrm{sin}}}\right)}^{\mathsf{\text{T}}}\hat{\mathbf{\Phi }}{\mathbf{Q}}_{-i}\mathbf{y}\hfill \\ \hfill & \quad -\frac{N}{n}\frac{1}{\hat{n}}\mathbb{E}\sum\limits _{i=1}^{N}{\mathbf{y}}^{\mathsf{\text{T}}}{\mathbf{Q}}_{-i}{\mathbf{U}}_{i}\left[\begin{matrix}\hfill \frac{1}{1+{\delta }_{\mathrm{cos}}}\hfill & \hfill 0\hfill \\ \hfill 0\hfill & \hfill \frac{1}{1+{\delta }_{\mathrm{sin}}}\hfill \end{matrix}\right]\frac{1}{n}{\hat{\mathbf{U}}}_{i}^{\mathsf{\text{T}}}\hat{\mathbf{\Phi }}{\mathbf{Q}}_{-i}{\mathbf{U}}_{i}\hfill \\ \hfill & \quad \times \left[\begin{matrix}\hfill \frac{1}{1+{\delta }_{\mathrm{cos}}}\hfill & \hfill 0\hfill \\ \hfill 0\hfill & \hfill \frac{1}{1+{\delta }_{\mathrm{sin}}}\hfill \end{matrix}\right]{\mathbf{U}}_{i}^{\mathsf{\text{T}}}{\mathbf{Q}}_{-i}\mathbf{y}\hfill \end{align*}$

where we recall the shortcut $\mathbf{\Phi }\equiv \frac{{\mathbf{K}}_{\mathrm{cos}}}{1+{\delta }_{\mathrm{cos}}}+\frac{{\mathbf{K}}_{\mathrm{sin}}}{1+{\delta }_{\mathrm{sin}}}$ and similarly $\hat{\mathbf{\Phi }}\equiv \frac{{\mathbf{K}}_{\mathrm{cos}}(\hat{\mathbf{X}},\mathbf{X})}{1+{\delta }_{\mathrm{cos}}}+\frac{{\mathbf{K}}_{\mathrm{sin}}(\hat{\mathbf{X}},\mathbf{X})}{1+{\delta }_{\mathrm{sin}}}\in {\mathbb{R}}^{\hat{n}\times n}$ . As a consequence, we further have, with lemma 4 that

$\begin{align*}\hfill {\mathbf{Z}}_{21}& \simeq {\left(\frac{N}{n}\right)}^{2}\frac{1}{\hat{n}}{\mathbf{y}}^{\mathsf{\text{T}}}\mathbb{E}\left[\mathbf{Q}{\hat{\mathbf{\Phi }}}^{\mathsf{\text{T}}}\hat{\mathbf{\Phi }}\mathbf{Q}\right]\mathbf{y}-\frac{N}{n}\frac{1}{\hat{n}}\mathbb{E}\sum\limits _{i=1}^{N}{\mathbf{y}}^{\mathsf{\text{T}}}{\mathbf{Q}}_{-i}{\mathbf{U}}_{i}\left[\begin{matrix}\hfill \frac{1}{1+{\delta }_{\mathrm{cos}}}\hfill & \hfill 0\hfill \\ \hfill 0\hfill & \hfill \frac{1}{1+{\delta }_{\mathrm{sin}}}\hfill \end{matrix}\right]\hfill \\ \hfill & \quad \times \left[\begin{matrix}\hfill \frac{1}{n}\enspace \mathrm{t}\mathrm{r}(\hat{\mathbf{\Phi }}\bar{\mathbf{Q}}{\mathbf{K}}_{\mathrm{cos}}{(\hat{\mathbf{X}},\mathbf{X})}^{\mathsf{\text{T}}})\hfill & \hfill 0\hfill \\ \hfill 0\hfill & \hfill \frac{1}{n}\enspace \mathrm{t}\mathrm{r}(\hat{\mathbf{\Phi }}\bar{\mathbf{Q}}{\mathbf{K}}_{\mathrm{sin}}{(\hat{\mathbf{X}},\mathbf{X})}^{\mathsf{\text{T}}})\hfill \end{matrix}\right]\hfill \\ \hfill & \quad \times \left[\begin{matrix}\hfill \frac{1}{1+{\delta }_{\mathrm{cos}}}\hfill & \hfill 0\hfill \\ \hfill 0\hfill & \hfill \frac{1}{1+{\delta }_{\mathrm{sin}}}\hfill \end{matrix}\right]{\mathbf{U}}_{i}^{\mathsf{\text{T}}}{\mathbf{Q}}_{-i}\mathbf{y}\simeq {\left(\frac{N}{n}\right)}^{2}\frac{1}{\hat{n}}{\mathbf{y}}^{\mathsf{\text{T}}}\mathbb{E}\left[\mathbf{Q}{\hat{\mathbf{\Phi }}}^{\mathsf{\text{T}}}\hat{\mathbf{\Phi }}\mathbf{Q}\right]\mathbf{y}\hfill \\ \hfill & \quad -{\left(\frac{N}{n}\right)}^{2}\frac{1}{\hat{n}}\mathbb{E}{\mathbf{y}}^{\mathsf{\text{T}}}\mathbf{Q}\left(\frac{1}{n}\enspace \mathrm{t}\mathrm{r}(\hat{\mathbf{\Phi }}\bar{\mathbf{Q}}{\mathbf{K}}_{\mathrm{cos}}{(\hat{\mathbf{X}},\mathbf{X})}^{\mathsf{\text{T}}})\frac{{\mathbf{K}}_{\mathrm{cos}}}{{(1+{\delta }_{\mathrm{cos}})}^{2}}\right.\hfill \\ \hfill & \left.\quad +\frac{1}{n}\enspace \mathrm{t}\mathrm{r}(\hat{\mathbf{\Phi }}\bar{\mathbf{Q}}{\mathbf{K}}_{\mathrm{sin}}{(\hat{\mathbf{X}},\mathbf{X})}^{\mathsf{\text{T}}})\frac{{\mathbf{K}}_{\mathrm{sin}}}{{(1+{\delta }_{\mathrm{sin}})}^{2}}\right)\mathbf{Q}\mathbf{y}\hfill \\ \hfill & \simeq {\left(\frac{N}{n}\right)}^{2}\frac{1}{\hat{n}}{\mathbf{y}}^{\mathsf{\text{T}}}\mathbb{E}\left[\mathbf{Q}{\hat{\mathbf{\Phi }}}^{\mathsf{\text{T}}}\hat{\mathbf{\Phi }}\mathbf{Q}\right]\mathbf{y}-{\left(\frac{N}{n}\right)}^{2}\frac{1}{\hat{n}}{\mathbf{y}}^{\mathsf{\text{T}}}\hfill \\ \hfill & \quad \times \left(\left[\begin{matrix}\hfill \frac{\frac{1}{n}\enspace \mathrm{t}\mathrm{r}(\hat{\mathbf{\Phi }}\bar{\mathbf{Q}}{\mathbf{K}}_{\mathrm{cos}}{(\hat{\mathbf{X}},\mathbf{X})}^{\mathsf{\text{T}}})}{{(1+{\delta }_{\mathrm{cos}})}^{2}}\hfill & \hfill \frac{\frac{1}{n}\enspace \mathrm{t}\mathrm{r}(\hat{\mathbf{\Phi }}\bar{\mathbf{Q}}{\mathbf{K}}_{\mathrm{sin}}{(\hat{\mathbf{X}},\mathbf{X})}^{\mathsf{\text{T}}})}{{(1+{\delta }_{\mathrm{sin}})}^{2}}\hfill \end{matrix}\right]\mathbb{E}\left[\begin{matrix}\hfill \mathbf{Q}{\mathbf{K}}_{\mathrm{cos}}\mathbf{Q}\hfill \\ \hfill \mathbf{Q}{\mathbf{K}}_{\mathrm{sin}}\mathbf{Q}\hfill \end{matrix}\right]\right)\mathbf{y}\hfill \\ \hfill & \simeq {\left(\frac{N}{n}\right)}^{2}\frac{1}{\hat{n}}{\mathbf{y}}^{\mathsf{\text{T}}}\bar{\mathbf{Q}}{\mathbf{\Phi }}^{\mathsf{\text{T}}}\hat{\mathbf{\Phi }}\bar{\mathbf{Q}}\mathbf{y}+{\left(\frac{N}{n}\right)}^{2}\hfill \\ \hfill & \quad \times \frac{1}{\hat{n}}\left[\begin{matrix}\hfill \frac{\frac{1}{n}\enspace \mathrm{t}\mathrm{r}\bar{\mathbf{Q}}\frac{N}{n}{\hat{\mathbf{\Phi }}}^{\mathsf{\text{T}}}\hat{\mathbf{\Phi }}\bar{\mathbf{Q}}{\mathbf{K}}_{\mathrm{cos}}-\frac{1}{n}\enspace \mathrm{t}\mathrm{r}\bar{\mathbf{Q}}\hat{\mathbf{\Phi }}{\mathbf{K}}_{\mathrm{cos}}(\hat{\mathbf{X}},\mathbf{X})}{{(1+{\delta }_{\mathrm{cos}})}^{2}}\hfill & \hfill \frac{\frac{1}{n}\enspace \mathrm{t}\mathrm{r}\bar{\mathbf{Q}}\frac{N}{n}{\hat{\mathbf{\Phi }}}^{\mathsf{\text{T}}}\hat{\mathbf{\Phi }}\bar{\mathbf{Q}}{\mathbf{K}}_{\mathrm{sin}}-\frac{1}{n}\enspace \mathrm{t}\mathrm{r}\bar{\mathbf{Q}}{\hat{\mathbf{\Phi }}}^{\mathsf{\text{T}}}{\mathbf{K}}_{\mathrm{sin}}(\hat{\mathbf{X}},\mathbf{X})}{{(1+{\delta }_{\mathrm{sin}})}^{2}}\hfill \end{matrix}\right]\hfill \\ \hfill & \quad \times \mathbf{\Omega }\left[\begin{matrix}\hfill {\mathbf{y}}^{\mathsf{\text{T}}}\bar{\mathbf{Q}}{\mathbf{K}}_{\mathrm{cos}}\bar{\mathbf{Q}}\mathbf{y}\hfill \\ \hfill {\mathbf{y}}^{\mathsf{\text{T}}}\bar{\mathbf{Q}}{\mathbf{K}}_{\mathrm{sin}}\bar{\mathbf{Q}}\mathbf{y}\hfill \end{matrix}\right].\hfill \end{align*}$

The last term Z₂₂ can be similarly treated as

$\begin{align*}\hfill {\mathbf{Z}}_{22}& \simeq \frac{1}{{n}^{2}\hat{n}}\mathbb{E}\sum\limits _{i=1}^{N}\sum\limits _{j\ne i}{\mathbf{y}}^{\mathsf{\text{T}}}{\mathbf{Q}}_{-j}{\mathbf{U}}_{j}\hfill \\ \hfill & \quad \times \left[\begin{matrix}\hfill \frac{\frac{1}{n}\enspace \mathrm{t}\mathrm{r}(\mathbf{Q}{\mathbf{U}}_{i}{\hat{\mathbf{U}}}_{i}^{\mathsf{\text{T}}}{\mathbf{K}}_{\mathrm{cos}}(\hat{\mathbf{X}},\mathbf{X}))}{{(1+{\delta }_{\mathrm{cos}})}^{2}}\hfill & \hfill 0\hfill \\ \hfill 0\hfill & \hfill \frac{\frac{1}{n}\enspace \mathrm{t}\mathrm{r}(\mathbf{Q}{\mathbf{U}}_{i}{\hat{\mathbf{U}}}_{i}^{\mathsf{\text{T}}}{\mathbf{K}}_{\mathrm{sin}}(\hat{\mathbf{X}},\mathbf{X}))}{{(1+{\delta }_{\mathrm{sin}})}^{2}}\hfill \end{matrix}\right]{\mathbf{U}}_{j}^{\mathsf{\text{T}}}{\mathbf{Q}}_{-j}\mathbf{y}\hfill \end{align*}$

where by lemma 2 we deduce

$\begin{align*}\hfill \frac{1}{n}\enspace \mathrm{t}\mathrm{r}(\mathbf{Q}{\mathbf{U}}_{i}{\hat{\mathbf{U}}}_{i}^{\mathsf{\text{T}}}{\mathbf{K}}_{\mathrm{cos}}(\hat{\mathbf{X}},\mathbf{X}))& \simeq \frac{1}{n}\enspace \mathrm{t}\mathrm{r}\left({\mathbf{Q}}_{-i}{\mathbf{U}}_{i}{({\mathbf{I}}_{2}+{\mathbf{U}}_{i}^{\mathsf{\text{T}}}{\mathbf{Q}}_{-i}{\mathbf{U}}_{i})}^{-1}{\hat{\mathbf{U}}}_{i}^{\mathsf{\text{T}}}{\mathbf{K}}_{\mathrm{cos}}(\hat{\mathbf{X}},\mathbf{X})\right)\hfill \\ \hfill & \simeq \frac{1}{n}\enspace \mathrm{t}\mathrm{r}\left({\mathbf{Q}}_{-i}{\mathbf{U}}_{i}\left[\begin{matrix}\hfill \frac{1}{1+{\delta }_{\mathrm{cos}}}\hfill & \hfill 0\hfill \\ \hfill 0\hfill & \hfill \frac{1}{1+{\delta }_{\mathrm{sin}}}\hfill \end{matrix}\right]{\hat{\mathbf{U}}}_{i}^{\mathsf{\text{T}}}{\mathbf{K}}_{\mathrm{cos}}(\hat{\mathbf{X}},\mathbf{X})\right)\hfill \\ \hfill & \simeq \frac{1}{n}\enspace \mathrm{t}\mathrm{r}(\bar{\mathbf{Q}}{\hat{\mathbf{\Phi }}}^{\mathsf{\text{T}}}{\mathbf{K}}_{\mathrm{cos}}(\hat{\mathbf{X}},\mathbf{X}))\hfill \end{align*}$

so that by again lemma 4

$\begin{align*}\hfill {\mathbf{Z}}_{22}& \simeq \frac{N}{n}\frac{1}{n\hat{n}}\mathbb{E}\sum\limits _{j=1}^{N}{\mathbf{y}}^{\mathsf{\text{T}}}{\mathbf{Q}}_{-j}{\mathbf{U}}_{j}\left[\begin{matrix}\hfill \frac{\frac{1}{n}\enspace \mathrm{t}\mathrm{r}(\bar{\mathbf{Q}}{\hat{\mathbf{\Phi }}}^{\mathsf{\text{T}}}{\mathbf{K}}_{\mathrm{cos}}(\hat{\mathbf{X}},\mathbf{X}))}{{(1+{\delta }_{\mathrm{cos}})}^{2}}\hfill & \hfill 0\hfill \\ \hfill 0\hfill & \hfill \frac{\frac{1}{n}\enspace \mathrm{t}\mathrm{r}(\bar{\mathbf{Q}}{\hat{\mathbf{\Phi }}}^{\mathsf{\text{T}}}{\mathbf{K}}_{\mathrm{sin}}(\hat{\mathbf{X}},\mathbf{X}))}{{(1+{\delta }_{\mathrm{sin}})}^{2}}\hfill \end{matrix}\right]{\mathbf{U}}_{j}^{\mathsf{\text{T}}}{\mathbf{Q}}_{-j}\mathbf{y}\hfill \\ \hfill & \simeq {\left(\frac{N}{n}\right)}^{2}\frac{1}{\hat{n}}{\mathbf{y}}^{\mathsf{\text{T}}}\mathbb{E}\left[\mathbf{Q}\left(\frac{\frac{1}{n}\enspace \mathrm{t}\mathrm{r}(\bar{\mathbf{Q}}{\hat{\mathbf{\Phi }}}^{\mathsf{\text{T}}}{\mathbf{K}}_{\mathrm{cos}}(\hat{\mathbf{X}},\mathbf{X}))}{{(1+{\delta }_{\mathrm{cos}})}^{2}}{\mathbf{K}}_{\mathrm{cos}}\right.\right.\hfill \\ \hfill & \left.\left.\quad +\frac{\frac{1}{n}\enspace \mathrm{t}\mathrm{r}(\bar{\mathbf{Q}}{\hat{\mathbf{\Phi }}}^{\mathsf{\text{T}}}{\mathbf{K}}_{\mathrm{sin}}(\hat{\mathbf{X}},\mathbf{X}))}{{(1+{\delta }_{\mathrm{sin}})}^{2}}{\mathbf{K}}_{\mathrm{sin}}\right)\mathbf{Q}\right]\mathbf{y}\hfill \\ \hfill & \simeq {\left(\frac{N}{n}\right)}^{2}\frac{1}{\hat{n}}{\mathbf{y}}^{\mathsf{\text{T}}}\left(\bar{\mathbf{Q}}\mathbf{\Xi }\bar{\mathbf{Q}}+\frac{N}{n}\left[\begin{matrix}\hfill \frac{\frac{1}{n}\enspace \mathrm{t}\mathrm{r}(\bar{\mathbf{Q}}\mathbf{\Xi }\bar{\mathbf{Q}}{\mathbf{K}}_{\mathrm{cos}})}{{(1+{\delta }_{\mathrm{cos}})}^{2}}\hfill & \hfill \frac{\frac{1}{n}\enspace \mathrm{t}\mathrm{r}(\bar{\mathbf{Q}}\mathbf{\Xi }\bar{\mathbf{Q}}{\mathbf{K}}_{\mathrm{sin}})}{{(1+{\delta }_{\mathrm{sin}})}^{2}}\hfill \end{matrix}\right]\mathbf{\Omega }\left[\begin{matrix}\hfill \bar{\mathbf{Q}}{\mathbf{K}}_{\mathrm{cos}}\bar{\mathbf{Q}}\hfill \\ \hfill \bar{\mathbf{Q}}{\mathbf{K}}_{\mathrm{sin}}\bar{\mathbf{Q}}\hfill \end{matrix}\right]\right)\mathbf{y}\hfill \\ \hfill & \simeq {\left(\frac{N}{n}\right)}^{2}\frac{1}{\hat{n}}\left[\begin{matrix}\hfill \frac{\frac{1}{n}\enspace \mathrm{t}\mathrm{r}(\bar{\mathbf{Q}}{\hat{\mathbf{\Phi }}}^{\mathsf{\text{T}}}{\mathbf{K}}_{\mathrm{cos}}(\hat{\mathbf{X}},\mathbf{X}))}{{(1+{\delta }_{\mathrm{cos}})}^{2}}\hfill & \hfill \frac{\frac{1}{n}\enspace \mathrm{t}\mathrm{r}(\bar{\mathbf{Q}}{\hat{\mathbf{\Phi }}}^{\mathsf{\text{T}}}{\mathbf{K}}_{\mathrm{sin}}(\hat{\mathbf{X}},\mathbf{X}))}{{(1+{\delta }_{\mathrm{sin}})}^{2}}\hfill \end{matrix}\right]\mathbf{\Omega }\left[\begin{matrix}\hfill {\mathbf{y}}^{\mathsf{\text{T}}}\bar{\mathbf{Q}}{\mathbf{K}}_{\mathrm{cos}}\bar{\mathbf{Q}}\mathbf{y}\hfill \\ \hfill {\mathbf{y}}^{\mathsf{\text{T}}}\bar{\mathbf{Q}}{\mathbf{K}}_{\mathrm{sin}}\bar{\mathbf{Q}}\mathbf{y}\hfill \end{matrix}\right].\hfill \end{align*}$

Assembling the estimates for Z₁, Z₂₁ and Z₂₂, we get

$\begin{align*}\hfill \mathbb{E}[{E}_{\mathrm{t}\mathrm{e}\mathrm{s}\mathrm{t}}]& \simeq \frac{1}{\hat{n}}{\Vert}\hat{\mathbf{y}}{{\Vert}}^{2}-\frac{2}{\hat{n}}{\hat{\mathbf{y}}}^{\mathsf{\text{T}}}\frac{N}{n}\hat{\mathbf{\Phi }}\bar{\mathbf{Q}}\mathbf{y}+\frac{1}{\hat{n}}{\mathbf{y}}^{\mathsf{\text{T}}}\left(\frac{{N}^{2}}{{n}^{2}}\bar{\mathbf{Q}}{\hat{\mathbf{\Phi }}}^{\mathsf{\text{T}}}\hat{\mathbf{\Phi }}\bar{\mathbf{Q}}\right)\mathbf{y}+{\left(\frac{N}{n}\right)}^{2}\frac{1}{n\hat{n}}\hfill \\ \hfill & \quad \times \left[\begin{matrix}\hfill \frac{\frac{n}{N}\enspace \mathrm{t}\mathrm{r}{\mathbf{K}}_{\mathrm{cos}}(\hat{\mathbf{X}},\hat{\mathbf{X}})+\frac{N}{n}\enspace \mathrm{t}\mathrm{r}\bar{\mathbf{Q}}{\hat{\mathbf{\Phi }}}^{\mathsf{\text{T}}}\hat{\mathbf{\Phi }}\bar{\mathbf{Q}}{\mathbf{K}}_{\mathrm{cos}}-2\enspace \mathrm{t}\mathrm{r}\bar{\mathbf{Q}}{\hat{\mathbf{\Phi }}}^{\mathsf{\text{T}}}{\mathbf{K}}_{\mathrm{cos}}(\hat{\mathbf{X}},\mathbf{X})}{{(1+{\delta }_{\mathrm{cos}})}^{2}}\hfill & \hfill \frac{\frac{n}{N}\enspace \mathrm{t}\mathrm{r}{\mathbf{K}}_{\mathrm{sin}}(\hat{\mathbf{X}},\hat{\mathbf{X}})+\frac{N}{n}\enspace \mathrm{t}\mathrm{r}\bar{\mathbf{Q}}{\hat{\mathbf{\Phi }}}^{\mathsf{\text{T}}}\hat{\mathbf{\Phi }}\bar{\mathbf{Q}}{\mathbf{K}}_{\mathrm{sin}}-2\enspace \mathrm{t}\mathrm{r}\bar{\mathbf{Q}}{\hat{\mathbf{\Phi }}}^{\mathsf{\text{T}}}{\mathbf{K}}_{\mathrm{sin}}(\hat{\mathbf{X}},\mathbf{X})}{{(1+{\delta }_{\mathrm{sin}})}^{2}}\hfill \end{matrix}\right]\hfill \\ \hfill & \quad \times \mathbf{\Omega }\left[\begin{matrix}\hfill {\mathbf{y}}^{\mathsf{\text{T}}}\bar{\mathbf{Q}}{\mathbf{K}}_{\mathrm{cos}}\bar{\mathbf{Q}}\mathbf{y}\hfill \\ \hfill {\mathbf{y}}^{\mathsf{\text{T}}}\bar{\mathbf{Q}}{\mathbf{K}}_{\mathrm{sin}}\bar{\mathbf{Q}}\mathbf{y}\hfill \end{matrix}\right]\hfill \end{align*}$

which, up to further simplifications, concludes the proof of theorem 3.

Appendix D.: Several useful lemmas

Lemma 5. (Some useful properties of Ω). For any λ > 0 and Ω defined in (12), we have

(a)
all entries of Ω are positive;
(b)
for 2N = n, det(Ω⁻¹), as well as the entries of Ω, scales like λ as λ → 0;

Proof. Developing the inverse we obtain

$\begin{equation*}\mathbf{\Omega }={\left[\begin{matrix}\hfill 1-\frac{N}{n}\frac{\frac{1}{n}\enspace \mathrm{t}\mathrm{r}(\bar{\mathbf{Q}}{\mathbf{K}}_{\mathrm{cos}}\bar{\mathbf{Q}}{\mathbf{K}}_{\mathrm{cos}})}{{(1+{\delta }_{\mathrm{cos}})}^{2}}\hfill & \hfill -\frac{N}{n}\frac{\frac{1}{n}\enspace \mathrm{t}\mathrm{r}(\bar{\mathbf{Q}}{\mathbf{K}}_{\mathrm{cos}}\bar{\mathbf{Q}}{\mathbf{K}}_{\mathrm{sin}})}{{(1+{\delta }_{\mathrm{sin}})}^{2}}\hfill \\ \hfill -\frac{N}{n}\frac{\frac{1}{n}\enspace \mathrm{t}\mathrm{r}(\bar{\mathbf{Q}}{\mathbf{K}}_{\mathrm{cos}}\bar{\mathbf{Q}}{\mathbf{K}}_{\mathrm{sin}})}{{(1+{\delta }_{\mathrm{cos}})}^{2}}\hfill & \hfill 1-\frac{N}{n}\frac{\frac{1}{n}\enspace \mathrm{t}\mathrm{r}(\bar{\mathbf{Q}}{\mathbf{K}}_{\mathrm{sin}}\bar{\mathbf{Q}}{\mathbf{K}}_{\mathrm{sin}})}{{(1+{\delta }_{\mathrm{sin}})}^{2}}\hfill \end{matrix}\right]}^{-1}\end{equation*}$

we have ${[{\mathbf{\Omega }}^{-1}]}_{11}=\frac{1}{1+{\delta }_{\mathrm{cos}}}+\frac{\lambda }{n}\enspace \mathrm{t}\mathrm{r}\bar{\mathbf{Q}}\frac{{\mathbf{K}}_{\mathrm{cos}}}{1+{\delta }_{\mathrm{cos}}}\bar{\mathbf{Q}}+\frac{N}{n}\frac{1}{n}\enspace \mathrm{t}\mathrm{r}\bar{\mathbf{Q}}\frac{{\mathbf{K}}_{\mathrm{cos}}}{1+{\delta }_{\mathrm{cos}}}\bar{\mathbf{Q}}\frac{{\mathbf{K}}_{\mathrm{sin}}}{1+{\delta }_{\mathrm{sin}}} > 0$ , ${[{\mathbf{\Omega }}^{-1}]}_{12}< 0$ , and similarly ${[{\mathbf{\Omega }}^{-1}]}_{21}< 0$ , ${[{\mathbf{\Omega }}^{-1}]}_{22} > 0$ . Furthermore, the determinant writes

$\begin{align*}\hfill \mathrm{det}({\mathbf{\Omega }}^{-1})& =\left(1-\frac{1}{n}\enspace \mathrm{t}\mathrm{r}\bar{\mathbf{Q}}\frac{{\mathbf{K}}_{\mathrm{cos}}}{1+{\delta }_{\mathrm{cos}}}+\frac{\lambda }{n}\enspace \mathrm{t}\mathrm{r}\enspace \bar{\mathbf{Q}}\frac{{\mathbf{K}}_{\mathrm{cos}}}{1+{\delta }_{\mathrm{cos}}}\bar{\mathbf{Q}}\right)\hfill \\ \hfill & \quad \times \left(1-\frac{1}{n}\enspace \mathrm{t}\mathrm{r}\enspace \bar{\mathbf{Q}}\frac{{\mathbf{K}}_{\mathrm{sin}}}{1+{\delta }_{\mathrm{sin}}}+\frac{\lambda }{n}\enspace \mathrm{t}\mathrm{r}\enspace \bar{\mathbf{Q}}\frac{{\mathbf{K}}_{\mathrm{sin}}}{1+{\delta }_{\mathrm{sin}}}\bar{\mathbf{Q}}\right)\hfill \\ \hfill & \quad +\left(1-\frac{1}{n}\enspace \mathrm{t}\mathrm{r}\enspace \bar{\mathbf{Q}}\frac{{\mathbf{K}}_{\mathrm{cos}}}{1+{\delta }_{\mathrm{cos}}}+1-\frac{1}{n}\enspace \mathrm{t}\mathrm{r}\enspace \bar{\mathbf{Q}}\frac{{\mathbf{K}}_{\mathrm{sin}}}{1+{\delta }_{\mathrm{sin}}}\right.\hfill \\ \hfill & \left.\quad +\frac{\lambda }{n}\enspace \mathrm{t}\mathrm{r}\enspace \bar{\mathbf{Q}}\left(\frac{{\mathbf{K}}_{\mathrm{cos}}}{1+{\delta }_{\mathrm{cos}}}+\frac{{\mathbf{K}}_{\mathrm{sin}}}{1+{\delta }_{\mathrm{sin}}}\right)\bar{\mathbf{Q}}\right)\hfill \\ \hfill & \quad \times \frac{N}{n}\frac{1}{n}\enspace \mathrm{t}\mathrm{r}\enspace \bar{\mathbf{Q}}\frac{{\mathbf{K}}_{\mathrm{cos}}}{1+{\delta }_{\mathrm{cos}}}\bar{\mathbf{Q}}\frac{{\mathbf{K}}_{\mathrm{sin}}}{1+{\delta }_{\mathrm{sin}}}\hfill \end{align*}$

where we constantly use the fact that $\bar{\mathbf{Q}}\frac{N}{n}\left(\frac{{\mathbf{K}}_{\mathrm{cos}}}{1+{\delta }_{\mathrm{cos}}}+\frac{{\mathbf{K}}_{\mathrm{sin}}}{1+{\delta }_{\mathrm{sin}}}\right)={\mathbf{I}}_{n}-\lambda \bar{\mathbf{Q}}$ . Note that

$\begin{align*}\hfill 1-\frac{1}{n}\enspace \mathrm{t}\mathrm{r}\enspace \bar{\mathbf{Q}}\frac{{\mathbf{K}}_{\mathrm{cos}}}{1+{\delta }_{\mathrm{cos}}}& =\frac{1}{1+{\delta }_{\mathrm{cos}}} > 0,\hfill \\ \hfill 1-\frac{1}{n}\enspace \mathrm{t}\mathrm{r}\enspace \bar{\mathbf{Q}}\frac{{\mathbf{K}}_{\mathrm{sin}}}{1+{\delta }_{\mathrm{sin}}}& =\frac{1}{1+{\delta }_{\mathrm{sin}}} > 0\hfill \\ \hfill \frac{1}{1+{\delta }_{\mathrm{cos}}}+\frac{1}{1+{\delta }_{\mathrm{sin}}}& =\underline{2-\frac{n}{N}}+\frac{\lambda }{N}\enspace \mathrm{t}\mathrm{r}\enspace \bar{\mathbf{Q}} > 0\hfill \end{align*}$

so that (a) det(Ω⁻¹) > 0 and (b) for 2N = n, det(Ω⁻¹) scales like λ as λ → 0. □

Lemma 6. (Derivatives with respect to N). Let assumption 1 holds, for any λ > 0 and

$\begin{equation*}\begin{cases}{\delta }_{\mathrm{cos}}=\frac{1}{n}\enspace \mathrm{t}\mathrm{r}({\mathbf{K}}_{\mathrm{cos}}\bar{\mathbf{Q}})=\frac{1}{n}\enspace \mathrm{t}\mathrm{r}{\mathbf{K}}_{\mathrm{cos}}{\left(\frac{N}{n}\left(\frac{{\mathbf{K}}_{\mathrm{cos}}}{1+{\delta }_{\mathrm{cos}}}+\frac{{\mathbf{K}}_{\mathrm{sin}}}{1+{\delta }_{\mathrm{sin}}}\right)+\lambda {\mathbf{I}}_{n}\right)}^{-1}\quad \hfill \\ {\delta }_{\mathrm{sin}}=\frac{1}{n}\enspace \mathrm{t}\mathrm{r}({\mathbf{K}}_{\mathrm{sin}}\bar{\mathbf{Q}})=\frac{1}{n}\enspace \mathrm{t}\mathrm{r}{\mathbf{K}}_{\mathrm{sin}}{\left(\frac{N}{n}\left(\frac{{\mathbf{K}}_{\mathrm{cos}}}{1+{\delta }_{\mathrm{cos}}}+\frac{{\mathbf{K}}_{\mathrm{sin}}}{1+{\delta }_{\mathrm{sin}}}\right)+\lambda {\mathbf{I}}_{n}\right)}^{-1}\quad \hfill \end{cases}\end{equation*}$

defined in theorem 1, we have that (δ_cos, δ_sin) and ${\Vert}\bar{\mathbf{Q}}{\Vert}$ are all decreasing functions of N. Note in particular that the same conclusion holds for 2N > n as λ → 0.

Proof. We write

$\begin{equation}\left[\begin{matrix}\hfill \frac{\partial {\delta }_{\mathrm{cos}}}{\partial N}\hfill \\ \hfill \frac{\partial {\delta }_{\mathrm{sin}}}{\partial N}\hfill \end{matrix}\right]=-\frac{1}{n}\mathbf{\Omega }\left[\begin{matrix}\hfill \frac{1}{n}\enspace \mathrm{t}\mathrm{r}\left(\bar{\mathbf{Q}}\mathbf{\Phi }\bar{\mathbf{Q}}{\mathbf{K}}_{\mathrm{cos}}\right)\hfill \\ \hfill \frac{1}{n}\enspace \mathrm{t}\mathrm{r}\left(\bar{\mathbf{Q}}\mathbf{\Phi }\bar{\mathbf{Q}}{\mathbf{K}}_{\mathrm{sin}}\right)\hfill \end{matrix}\right]=-\frac{n}{N}\frac{1}{n}\mathbf{\Omega }\left[\begin{matrix}\hfill {\delta }_{\mathrm{cos}}-\frac{\lambda }{n}\enspace \mathrm{t}\mathrm{r}(\bar{\mathbf{Q}}{\mathbf{K}}_{\mathrm{cos}}\bar{\mathbf{Q}})\hfill \\ \hfill {\delta }_{\mathrm{sin}}-\frac{\lambda }{n}\enspace \mathrm{t}\mathrm{r}(\bar{\mathbf{Q}}{\mathbf{K}}_{\mathrm{sin}}\bar{\mathbf{Q}})\hfill \end{matrix}\right]\end{equation} \tag{ D.1 }$

for Ω defined in (12) and $\mathbf{\Phi }=\frac{{\mathbf{K}}_{\mathrm{cos}}}{1+{\delta }_{\mathrm{cos}}}+\frac{{\mathbf{K}}_{\mathrm{sin}}}{1+{\delta }_{\mathrm{sin}}}$ , which, together with lemma 5, allows us to conclude that $\frac{\partial {\delta }_{\mathrm{cos}}}{\partial N},\frac{\partial {\delta }_{\mathrm{sin}}}{\partial N}< 0$ . Further note that

$\begin{equation*}\frac{\partial \bar{\mathbf{Q}}}{\partial N}=-\frac{1}{n}\bar{\mathbf{Q}}\left(\mathbf{\Phi }-\frac{{\mathbf{K}}_{\mathrm{cos}}}{{(1+{\delta }_{\mathrm{cos}})}^{2}}N\frac{\partial {\delta }_{\mathrm{cos}}}{\partial N}-\frac{{\mathbf{K}}_{\mathrm{sin}}}{{(1+{\delta }_{\mathrm{sin}})}^{2}}N\frac{\partial {\delta }_{\mathrm{sin}}}{\partial N}\right)\bar{\mathbf{Q}}\end{equation*}$

which concludes the proof. □

Lemma 7. (Derivative with respect to λ). For any λ > 0, (δ_cos, δ_sin) and ${\Vert}\bar{\mathbf{Q}}{\Vert}$ defined in theorem 1 decrease as λ grows large.

Proof. Taking the derivative of (δ_cos, δ_sin) with respect to λ > 0, we have explicitly

$\begin{equation}\left[\begin{matrix}\hfill \frac{\partial {\delta }_{\mathrm{cos}}}{\partial \lambda }\hfill \\ \hfill \frac{\partial {\delta }_{\mathrm{sin}}}{\partial \lambda }\hfill \end{matrix}\right]=-\mathbf{\Omega }\left[\begin{matrix}\hfill \frac{1}{n}\enspace \mathrm{t}\mathrm{r}(\bar{\mathbf{Q}}{\mathbf{K}}_{\mathrm{cos}}\bar{\mathbf{Q}})\hfill \\ \hfill \frac{1}{n}\enspace \mathrm{t}\mathrm{r}(\bar{\mathbf{Q}}{\mathbf{K}}_{\mathrm{sin}}\bar{\mathbf{Q}})\hfill \end{matrix}\right]\end{equation} \tag{ D.2 }$

which, together with the fact that all entries of Ω are positive (lemma 5), allows us to conclude that $\frac{\partial {\delta }_{\mathrm{cos}}}{\partial \lambda },\frac{\partial {\delta }_{\mathrm{sin}}}{\partial \lambda }< 0$ . Further considering

$\begin{equation*}\frac{\partial \bar{\mathbf{Q}}}{\partial \lambda }=\bar{\mathbf{Q}}\left(\frac{N}{n}\frac{{\mathbf{K}}_{\mathrm{cos}}}{{(1+{\delta }_{\mathrm{cos}})}^{2}}\frac{\partial {\delta }_{\mathrm{cos}}}{\partial \lambda }+\frac{N}{n}\frac{{\mathbf{K}}_{\mathrm{sin}}}{{(1+{\delta }_{\mathrm{sin}})}^{2}}\frac{\partial {\delta }_{\mathrm{sin}}}{\partial \lambda }-{\mathbf{I}}_{n}\right)\bar{\mathbf{Q}}\end{equation*}$

and thus the conclusion for $\bar{\mathbf{Q}}$ . □

A random matrix analysis of random Fourier features: beyond the Gaussian kernel, a precise phase transition, and the corresponding double descent^*

Article metrics

Permissions

Author e-mails

Author affiliations

Author notes

Dates

Abstract

1. Introduction

1.1. Our main contributions

1.2. Related work

1.3. Notations and organization of the paper

2. Main technical results

2.1. Asymptotic deterministic equivalent

2.2. Asymptotic training performance

2.3. Asymptotic test performance

3. Empirical evaluations and practical implications

3.1. Correction due to the large n, p, N regime

3.2. Phase transition and corresponding double descent

4. Additional discussion and results

4.1. Two different learning regimes in the ridgeless limit

4.2. Impact of training-test similarity

4.3. Additional real-world data sets

5. Conclusion

Acknowledgments

Appendix A.: Proof of theorem 1

Appendix B.: Proof of theorem 2

Appendix C.: Proof of theorem 3

Appendix D.: Several useful lemmas

Footnotes

A random matrix analysis of random Fourier features: beyond the Gaussian kernel, a precise phase transition, and the corresponding double descent*

Article metrics

Permissions

Share this article

Author e-mails

Author affiliations

Author notes

Dates

Abstract

1. Introduction

1.1. Our main contributions

1.2. Related work

1.3. Notations and organization of the paper

2. Main technical results

2.1. Asymptotic deterministic equivalent

2.2. Asymptotic training performance

2.3. Asymptotic test performance

3. Empirical evaluations and practical implications

3.1. Correction due to the large n, p, N regime

3.2. Phase transition and corresponding double descent

4. Additional discussion and results

4.1. Two different learning regimes in the ridgeless limit

4.2. Impact of training-test similarity

4.3. Additional real-world data sets

5. Conclusion

Acknowledgments

Appendix A.: Proof of theorem 1

Appendix B.: Proof of theorem 2

Appendix C.: Proof of theorem 3

Appendix D.: Several useful lemmas

Footnotes

A random matrix analysis of random Fourier features: beyond the Gaussian kernel, a precise phase transition, and the corresponding double descent^*