When do neural networks outperform kernel methods?*

Behrooz Ghorbani; Song Mei; Theodor Misiakiewicz; Andrea Montanari

doi:10.1088/1742-5468/ac3a81

1. Introduction

In supervised learning we are given data ${\left\{({y}_{i},{\boldsymbol{x}}_{i})\right\}}_{i\leqslant n}\hspace{2pt}{\sim }_{iid}\hspace{2pt}\mathbb{P}\in \mathcal{P}(\mathbb{R}\times {\mathbb{R}}^{d})$ , with ${\boldsymbol{x}}_{i}\in {\mathbb{R}}^{d}$ a covariate vector and ${y}_{i}\in \mathbb{R}$ the corresponding label, and would like to learn a function $f:{\mathbb{R}}^{d}\to \mathbb{R}$ to predict future labels. In many applications, state-of-the-art systems use multi-layer neural networks (NN). The simplest such model is provided by two-layers fully-connected networks:

$\begin{equation}{\mathcal{F}}_{\mathrm{NN}}^{N}{:=}\left\{{\hat{f}}_{\mathrm{NN}}(\boldsymbol{x}\hspace{-1pt};\boldsymbol{b},\boldsymbol{W}\hspace{2pt})=\sum\limits _{i=1}^{N}{b}_{i}\sigma (\langle {\boldsymbol{w}}_{i},\boldsymbol{x}\rangle ):{b}_{i}\in \mathbb{R},{\boldsymbol{w}}_{i}\in {\mathbb{R}}^{d},\quad \forall \enspace i\in [N]\right\}.\end{equation} \tag{ 1 }$

${\mathcal{F}}_{\mathrm{NN}}^{N}$ is a non-linearly parametrized class of functions: while nonlinearity poses a challenge to theoreticians, it is often claimed to be crucial in order to learn rich representation of the data. Recent efforts to understand NN have put the spotlight on two linearizations of ${\mathcal{F}}_{\mathrm{NN}}^{N}$ , the random features [1] and the neural tangent [2] classes

$\begin{equation}{\mathcal{F}}_{\mathrm{RF}}^{N}(\boldsymbol{W}\hspace{2pt}){:=}\left\{{\hat{f}}_{\mathrm{RF}}(\boldsymbol{x}\hspace{-1pt};\boldsymbol{a}\hspace{-1pt};\boldsymbol{W}\hspace{2pt})=\sum\limits _{i=1}^{N}{a}_{i}\sigma (\langle {\boldsymbol{w}}_{i},\boldsymbol{x}\rangle ):{a}_{i}\in \mathbb{R},\quad \forall \enspace i\in [N]\right\},\end{equation} \tag{ 2 }$

$\begin{equation}{\mathcal{F}}_{\mathrm{NT}}^{N}(\boldsymbol{W}\hspace{2pt}){:=}\left\{{\hat{f}}_{\mathrm{NT}}(\boldsymbol{x}\hspace{-1pt};\boldsymbol{S},\boldsymbol{W}\hspace{2pt})=\sum\limits _{i=1}^{N}\langle {\boldsymbol{s}}_{i},\boldsymbol{x}\rangle {\sigma }^{\prime }(\langle {\boldsymbol{w}}_{i},\boldsymbol{x}\rangle ):{\boldsymbol{s}}_{i}\in {\mathbb{R}}^{d},\quad \forall \enspace i\in [N]\right\}.\end{equation} \tag{ 3 }$

${\mathcal{F}}_{\mathrm{RF}}^{N}(\boldsymbol{W}\hspace{2pt})$ and ${\mathcal{F}}_{\mathrm{NT}}^{N}(\boldsymbol{W}\hspace{2pt})$ are linear classes of functions, depending on the realization of the input-layer weights $\boldsymbol{W}={({\boldsymbol{w}}_{i})}_{i\leqslant N}$ (which are chosen randomly). The relation between NN and these two linear classes is given by the first-order Taylor expansion: ${\hat{f}}_{\mathrm{NN}}(\boldsymbol{x}\hspace{-1pt};\boldsymbol{b}+\varepsilon \boldsymbol{a},\boldsymbol{W}+\varepsilon \boldsymbol{S})-{\hat{f}}_{\mathrm{NN}}(\boldsymbol{x}\hspace{-1pt};\boldsymbol{b},\boldsymbol{W}\hspace{2pt})=\varepsilon {\hat{f}}_{\mathrm{RF}}(\boldsymbol{x}\hspace{-1pt};\boldsymbol{a}\hspace{-1pt};\boldsymbol{W}\hspace{2pt})+\varepsilon {\hat{f}}_{\mathrm{NT}}(\boldsymbol{x}\hspace{-1pt};\boldsymbol{S}(\boldsymbol{b});\boldsymbol{W}\hspace{2pt})+O({\varepsilon }^{2})$ , where $\boldsymbol{S}(\boldsymbol{b})={({b}_{i}{\boldsymbol{s}}_{i})}_{i\leqslant N}$ . A number of recent papers show that, if weights and stochastic gradient descent (SGD) updates are suitably scaled, and the network is sufficiently wide (N sufficiently large), then SGD converges to a function ${\hat{f}}_{\mathrm{NN}}$ that is approximately in ${\mathcal{F}}_{\mathrm{RF}}^{N}(\boldsymbol{W}\hspace{2pt})+{\mathcal{F}}_{\mathrm{NT}}^{N}(\boldsymbol{W}\hspace{2pt})$ , with W determined by the SGD initialization [2–7]. This was termed the 'lazy regime' in [8].

Does this linear theory convincingly explain the successes of NN? Can the performances of NN be achieved by the simpler NT or RF models? Is there any fundamental difference between the two classes RF and NT? If the weights ${({\boldsymbol{w}}_{i})}_{i\leqslant N}$ are i.i.d. draws from a distribution ν on ${\mathbb{R}}^{d}$ , the spaces ${\mathcal{F}}_{\mathrm{RF}}^{N}(\boldsymbol{W}\hspace{2pt})$ , ${\mathcal{F}}_{\mathrm{NT}}^{N}(\boldsymbol{W}\hspace{2pt})$ can be thought as finite-dimensional approximations of a certain reproducing kernel Hilbert space (RKHS):

$\begin{equation}\mathcal{H}(h){:=}\text{cl}\left(\left\{f(\boldsymbol{x})=\sum\limits _{i=1}^{N}{c}_{i}h(\boldsymbol{x},{\boldsymbol{x}}_{i}):{c}_{i}\in \mathbb{R},{\boldsymbol{x}}_{i}\in {\mathbb{R}}^{d},N\in \mathbb{N}\right\}\right),\end{equation} \tag{ 4 }$

where cl(⋅) denotes closure. From this point of view, RF and NT differ in that they correspond to slightly different choices of the kernel: h_RF( x ₁, x ₂) := ∫σ(⟨ w , x ₁⟩)σ(⟨ w , x ₂⟩)ν(d w ) versus ${h}_{\mathrm{NT}}({\boldsymbol{x}}_{1},{\boldsymbol{x}}_{2}){:=}\langle {\boldsymbol{x}}_{1},{\boldsymbol{x}}_{2}\rangle \int {\sigma }^{\prime }({\boldsymbol{w}}^{\mathsf{T}}{\boldsymbol{x}}_{1}){\sigma }^{\prime }({\boldsymbol{w}}^{\mathsf{T}}{\boldsymbol{x}}_{2})\nu (\mathrm{d}\boldsymbol{w})$ . Multi-layer fully-connected NNs in the lazy regime can be viewed as randomized approximations to RKHS as well, with some changes in the kernel h. This motivates analogous questions for $\mathcal{H}(h)$ : can the performances of NN be achieved by RKHS methods?

Recent work addressed the separation between NN and RKHS from several points of view, without providing a unified answer. Some empirical studies on various datasets showed that networks can be replaced by suitable kernels with limited drop in performances [9–16]. At least two studies reported a larger gap for convolutional networks and the corresponding kernels [17, 18]. On the other hand, theoretical analysis provided a number of separation examples, i.e. target functions f_* that can be represented and possibly efficiently learnt using NN, but not in the corresponding RKHS [19–24]. For instance, if the target is a single neuron f_*( x ) = σ(⟨ w _*, x ⟩), then training a neural network with one hidden neuron learns the target efficiently from approximately d log d samples [25], while the corresponding RKHS has test error bounded away from zero for every sample size polynomial in d [19, 21]. Further even in the infinite width limit, it is known that two-layers NN can actually capture a richer class of functions than the associated RKHS, provided SGD training is scaled differently from the lazy regime [26–30].

Can we reconcile empirical and theoretical results?

1.1. Overview

In this paper we introduce a stylized scenario—which we will refer to as the spiked covariates model—that can explain the above seemingly divergent observations in a unified framework. The spiked covariates model is based on two building blocks: (1) target functions depending on low-dimensional projections; (2) approximately low-dimensional covariates.

(a)
Target functions depending on low-dimensional projections. We investigate the hypothesis that NNs are more efficient at learning target functions that depend on low-dimensional projections of the data (the signal covariates). Formally, we consider target functions ${f}_{\ast }:{\mathbb{R}}^{d}\to \mathbb{R}$ of the form ${f}_{\ast }(\boldsymbol{x})=\varphi ({\boldsymbol{U}}^{\mathsf{T}}\boldsymbol{x})$ , where $\boldsymbol{U}\in {\mathbb{R}}^{d\times {d}_{0}}$ is a semi-orthogonal matrix, d₀ ≪ d, and $\varphi :{\mathbb{R}}^{{d}_{0}}\to \mathbb{R}$ is a suitably smooth function. This model captures an important property of certain applications. For instance, the labels in an image classification problem do not depend equally on the whole Fourier spectrum of the image, but predominantly on the low-frequency components.

As for the example of a single neuron f_*( x ) = σ(⟨ w _*, x ⟩), we expect RKHS to suffer from a curse of dimensionality in learning functions of low-dimensional projections. Indeed, this is well understood in low dimension or for isotropic covariates [20, 21].

(b)
Approximately low-dimensional covariates. RKHS behave well on certain image classification tasks [10, 12, 17], and this seems to contradict the previous point. However, the example of image classification naturally brings up another important property of real data that helps to clarify this puzzle. Not only we expect the target function f_*( x ) to depend predominantly on the low-frequency components of image x , but the image x itself to have most of its spectrum concentrated on low-frequency components (linear denoising algorithms exploit this very observation).

More specifically, we consider the case in which x = Uz ₁ + U ^⊥ z ₂, where $\boldsymbol{U}\in {\mathbb{R}}^{d\times {d}_{0}}$ , ${\boldsymbol{U}}^{\perp }\in {\mathbb{R}}^{d\times (d-{d}_{0})}$ , and $[\boldsymbol{U}\hspace{1pt}\vert {\boldsymbol{U}}^{\perp }]\in {\mathbb{R}}^{d\times d}$ is an orthogonal matrix. Moreover, we assume ${\boldsymbol{z}}_{1}\sim \mathrm{Unif}({\mathbb{S}}^{{d}_{0}-1}({r}_{1}\sqrt{{d}_{0}}))$ , ${\boldsymbol{z}}_{\hspace{1pt}2}\sim \mathrm{Unif}({\mathbb{S}}^{d-{d}_{0}-1}({r}_{2}\sqrt{d-{d}_{0}}))$ , and ${r}_{1}^{2}\geqslant {r}_{2}^{2}$ . We find that, if r₁/r₂ (which we will denote later as the covariates signal-to-noise ratio) is sufficiently large, then the curse of dimensionality becomes milder for RKHS methods. We characterize precisely how the performance of these methods depend on the covariate signal-to-noise ratio r₁/r₂, the signal dimension d₀, and the ambient dimension d.

Notice that the spiked covariate model is highly stylized. For instance, while we expect real images to have a latent low-dimensional structure, this is best modeled in a nonlinear fashion (e.g. sparsity in wavelet domain [31]). Nevertheless the spiked covariate model captures the two basic mechanisms, and provides useful qualitative predictions. As an illustration, consider adding noise to the high-frequency components of images in a classification task. This will make the distribution of x more isotropic, and—according to our theory—deteriorate the performances of RKHS methods. On the other hand, NN should be less sensitive to this perturbation. (Notice that noise is added both to train and test samples.) In figure 1 we carry out such an experiment using FMNIST data (d = 784, n = 60 000, 10 classes). We compare two-layers NN with the RF and NT models. We choose the architectures of NN, NT, RF as to match the number of parameters: namely we used N = 4096 for NN and NT and N = 321 126 for RF. We also fit the corresponding RKHS models (corresponding to $N=\infty \left. \right)$ using kernel ridge regression (KRR), and two simple polynomial models: ${f}_{\ell }(\boldsymbol{x})={\sum }_{k=0}^{\ell }\langle {\boldsymbol{B}}_{k},{\boldsymbol{x}}^{\otimes k}\rangle$ , for ℓ ∈ {1, 2}. In the unperturbed dataset, all of these approaches have comparable accuracies (except the linear fit). As noise is added, RF, NT, and RKHS methods deteriorate rapidly. While the accuracy of NN decreases as well, it significantly outperforms other methods.

**Figure 1.** Test accuracy on Fashion MNIST (FMNIST) images perturbed by adding noise to the high-frequency Fourier components of the images (see examples on the right). (Left) Comparison of the accuracy of various methods as a function of the added noise. (Center) Eigenvalues of the empirical covariance of the images. As the noise increases, the images distribution becomes more isotropic.
Download figure:
Standard image High-resolution image

1.2. Notations and outline

Throughout the paper, we use bold lowercase letters { x , y , z , ...} to denote vectors and bold uppercase letters { A , B , C , ...} to denote matrices. We denote by ${\mathbb{S}}^{d-1}(r)=\left\{\boldsymbol{x}\in {\mathbb{R}}^{d}:{\Vert}\boldsymbol{x}{{\Vert}}_{2}=r\right\}$ the set of d-dimensional vectors with radius r and $\mathrm{Unif}({\mathbb{S}}^{d-1}(r))$ be the uniform probability distribution on ${\mathbb{S}}^{d-1}(r)$ . Further, we let N(μ, τ²) be the Gaussian distribution with mean μ and variance τ².

Let O_d(⋅) (respectively o_d(⋅), Ω_d(⋅), ω_d(⋅)) denote the standard big-O (respectively little-o, big-omega, little-omega) notation, where the subscript d emphasizes the asymptotic variable. We denote by ${o}_{d,\mathbb{P}}(\cdot )$ the little-o in probability notation: ${h}_{1}(d)={o}_{d,\mathbb{P}}({h}_{2}(d))$ , if h₁(d)/h₂(d) converges to 0 in probability.

In section 2, we introduce the spiked covariates model and characterize the performance of KRR, RF, NT, and NN models. Section 3 presents numerical experiments with real and synthetic data. Section 4 discusses our results in the context of earlier work.

2. Rigorous results for kernel methods and NT, RF NN expansions

2.1. The spiked covariates model

Let d₀ = ⌊d^η⌋ for some η ∈ (0, 1). Let $\boldsymbol{U}\in {\mathbb{R}}^{d\times {d}_{0}}$ and ${\boldsymbol{U}}^{\perp }\in {\mathbb{R}}^{d\times (d-{d}_{0})}$ be such that [ U | U ^⊥] is an orthogonal matrix. We denote the subspace spanned by the columns of U by $\mathcal{V}\subseteq {\mathbb{R}}^{d}$ which we will refer to as the signal subspace, and the subspace spanned by the columns of U ^⊥ by ${\mathcal{V}}^{\perp }\subseteq {\mathbb{R}}^{d}$ which we will refer to as the noise subspace. In the case η ∈ (0, 1), the signal dimension ${d}_{0}=\mathrm{dim}(\mathcal{V})$ is much smaller than the ambient dimension d. Our model for the covariate vector x _i is

$\begin{equation*}{\boldsymbol{x}}_{i}=\boldsymbol{U}{\boldsymbol{z}}_{0,i}+{\boldsymbol{U}}^{\perp }{\boldsymbol{z}}_{1,i},\qquad ({\boldsymbol{z}}_{0,i},{\boldsymbol{z}}_{1,i})\sim \mathrm{Unif}({\mathbb{S}}^{{d}_{0}-1}(r\sqrt{{d}_{0}}))\otimes \mathrm{Unif}({\mathbb{S}}^{d-{d}_{0}-1}(\sqrt{d-{d}_{0}})).\end{equation*}$

We call z _0,i the signal covariates, z _1,i the noise covariates, and r the covariates signal-to-noise ratio (or covariates SNR). We will take r > 1, so that the variance of the signal covariates z _0,i is larger than that of the noise covariates z _1,i. In high dimension, this model is—for many purposes—similar to an anisotropic Gaussian model ${\boldsymbol{x}}_{i}\sim \mathsf{\text{N}}(0,({r}^{2}-1)\boldsymbol{U}{\boldsymbol{U}}^{\mathsf{T}}+\mathbf{I})$ . As shown below, the effect of anisotropy on RKHS methods is significant only if the covariate SNR r is polynomially large in d. We shall therefore set r = d^κ/2 for a constant κ > 0.

We are given i.i.d. pairs ${({y}_{i},{\boldsymbol{x}}_{i})}_{1\leqslant i\leqslant n}$ , where y_i = f_*( x _i) + ɛ_i, and ɛ_i ∼ N(0, τ²) is independent of x _i. The function f_* only depends on the projection of x _i onto the signal subspace $\mathcal{V}$ (i.e. on the signal covariates z _0,i): ${f}_{\ast }({\boldsymbol{x}}_{i})=\varphi ({\boldsymbol{U}}^{\mathsf{T}}{\boldsymbol{x}}_{i})$ , with $\varphi \in {L}^{2}({\mathbb{S}}^{{d}_{0}-1}(r\sqrt{{d}_{0}}))$ .

For the RF and NT models, we will assume that input layer weights to be i.i.d. ${\boldsymbol{w}}_{i}\sim \mathrm{Unif}({\mathbb{S}}^{d-1}(1))$ . For our purposes, this is essentially the same as w_ij ∼ N(0, 1/d) independently, but slightly more convenient technically.

We will consider a more general model in appendix C, in which the distribution of x _i takes a more general product-of-uniforms form, and we assume a general f_* ∈ L².

2.2. A sharp characterization of RKHS methods

Given $h:[-1,1]\to \mathbb{R}$ , consider the rotationally invariant kernel K_d( x ₁, x ₂) = h(⟨ x ₁, x ₂⟩/d). This class includes the kernels that are obtained by taking the wide limit of the RF and NT models (here expectation is with respect to (G₁, G₂) ∼ N(0, I₂))

$\begin{equation*}{h}_{\mathrm{RF}}(t){:=}\mathbb{E}\left\{\sigma ({G}_{1})\sigma (t{G}_{1}+\sqrt{1-{t}^{2}}{G}_{2})\right\},\qquad {h}_{\mathrm{NT}}(t){:=}t\mathbb{E}\left\{{\sigma }^{\prime }({G}_{1}){\sigma }^{\prime }(t{G}_{1}+\sqrt{1-{t}^{2}}{G}_{2})\right\}.\end{equation*}$

(These formulae correspond to w _i ∼ N(0, I_d), but similar formulae hold for ${\boldsymbol{w}}_{i}\sim \mathrm{Unif}({\mathbb{S}}^{d-1}(\sqrt{d}))$ .) This correspondence holds beyond two-layers networks: under i.i.d. Gaussian initialization, the NT kernel for an arbitrary number of fully-connected layers is rotationally invariant (see the proof of proposition 2 of [2]), and hence is covered by the present analysis.

Any RKHS method with kernel h outputs a model of the form $\hat{f}(\boldsymbol{x}\hspace{-1pt};\boldsymbol{a})={\sum }_{i\leqslant n}{a}_{i}h(\langle \boldsymbol{x},{\boldsymbol{x}}_{i}\rangle /d)$ , with RKHS norm given by ${\Vert}\hat{f}(\cdot \hspace{-1pt};\boldsymbol{a}){{\Vert}}_{h}^{2}={\sum }_{i,j\leqslant n}h(\langle {\boldsymbol{x}}_{i},{\boldsymbol{x}}_{j}\rangle /d){a}_{i}{a}_{j}$ . We consider KRR on the dataset ${\left\{({y}_{i},{\boldsymbol{x}}_{i})\right\}}_{i\leqslant n}$ with regularization parameter λ, namely:

$\begin{equation*}\hat{\boldsymbol{a}}(\lambda ){:=}\mathrm{arg}\underset{\boldsymbol{a}\in {\mathbb{R}}^{N}}{\mathrm{min}}\left\{\sum\limits _{i=1}^{n}{\left({y}_{i}-\hat{f}({\boldsymbol{x}}_{i}\hspace{-2pt};\boldsymbol{a})\right)}^{2}+\lambda {\Vert}\hat{f}(\cdot \enspace ;\boldsymbol{a}){{\Vert}}_{h}^{2}\right\}={(\boldsymbol{H}+\lambda {\mathbf{I}}_{n})}^{-1}\boldsymbol{y},\end{equation*}$

where $\boldsymbol{H}={({H}_{ij})}_{ij\in [n]}$ , with H_ij = h(⟨ x _i, x _j⟩/d). We denote the prediction error of KRR by

$\begin{equation*}{R}_{\mathrm{KRR}}({f}_{\ast },\lambda )={\mathbb{E}}_{\boldsymbol{x}}\left[{\left({f}_{\ast }(\boldsymbol{x})-{\boldsymbol{y}}^{\mathsf{T}}{(\boldsymbol{H}+\lambda {\mathbf{I}}_{n})}^{-1}\boldsymbol{h}(\boldsymbol{x})\right)}^{2}\right],\end{equation*}$

where $\boldsymbol{h}(\boldsymbol{x})={(h(\langle \boldsymbol{x},{\boldsymbol{x}}_{1}\rangle /d),\dots ,h(\langle \boldsymbol{x},{\boldsymbol{x}}_{n}\rangle /d))}^{\mathsf{T}}$ .

Recall that we assume the target function ${f}_{\ast }({\boldsymbol{x}}_{i})=\varphi ({\boldsymbol{U}}^{\mathsf{T}}{\boldsymbol{x}}_{i})$ . We denote ${\mathsf{P}}_{\leqslant k}:{L}^{2}\to {L}^{2}$ to be the projection operator onto the space of degree k orthogonal polynomials, and ${\mathsf{P}}_{ > k}=\mathbf{I}-{\mathsf{P}}_{\leqslant k}$ . Our next theorem shows that the impact of the low-dimensional latent structure on the generalization error of KRR is characterized by a certain 'effective dimension', d_eff.

Theorem 1. Let h ∈ C^∞([−1, 1]). Let $\ell \in {\mathbb{Z}}_{\geqslant 0}$ be a fixed integer. We assume that h^(k)(0) > 0 for all k ⩽ ℓ, and assume that there exists a k > ℓ such that h^(k)(0) > 0. (Recall that h is positive semidefinite whence h^(k)(0) ⩾ 0 for all k.)

Define the effective dimension d_eff = max{d₀, d/r²} = d^{max(1−κ,η)}. If ${\omega }_{d}({d}_{\mathrm{eff}}^{\ell }\enspace \mathrm{log}$ $({d}_{\mathrm{eff}}))\leqslant n\leqslant {d}_{\mathrm{eff}}^{\ell +1-\delta }$ for some δ > 0, then for any regularization parameter λ = O_d(1), the prediction error of KRR with kernel h is

$\begin{equation}\left\vert {R}_{\mathrm{KRR}}({f}_{\ast };\lambda )-{\Vert}{\mathsf{P}}_{ > \ell }{f}_{\ast }{{\Vert}}_{{L}^{2}}^{2}\right\vert \leqslant {o}_{d,\mathbb{P}}(1)\cdot ({\Vert}{f}_{\ast }{{\Vert}}_{{L}^{2}}^{2}+{\tau }^{2}).\end{equation} \tag{ 5 }$

Remarkably, the effective dimension d_eff = d^{max(1−κ,η)} depends both on the signal dimension $\mathrm{dim}(\mathcal{V})={d}^{\eta }$ and on the covariate SNR r = d^κ/2. Sample size $n={d}_{\mathrm{eff}}^{\ell }$ is necessary to learn a degree ℓ polynomial. If we fix η ∈ (0, 1) and take κ = 0 +, we get d_eff ≈ d: this corresponds to almost isotropic x _i. We thus recover theorem 4 in [21]. If instead κ > 1 − η, then most variance of x _i falls in the signal subspace $\mathcal{V}$ , and we get ${d}_{\mathrm{eff}}={d}^{\eta }=\mathrm{dim}(\mathcal{V})$ : the test error is effectively the same as if we had oracle knowledge of the signal subspace $\mathcal{V}$ and performed KRR on signal covariates ${\boldsymbol{z}}_{0,i}={\boldsymbol{U}}^{\mathsf{T}}{\boldsymbol{x}}_{i}$ . Theorem 1 describes the transition between these two regimes.

2.3. RF and NT models

How do the results of the previous section generalize to finite-width approximations of the RKHS? In particular, how do the RF and NT models behave at finite N? In order to simplify the picture, we focus here on the approximation error. Equivalently, we assume the sample size to be n = ∞ and consider the minimum population risk for M ∈ {RF, NT}

$\begin{equation}{R}_{\mathsf{\text{M}},N}({f}_{\ast };\boldsymbol{W}\hspace{2pt}){:=}\underset{\hat{f}\in {\mathcal{F}}_{\mathsf{\text{M}}}^{N}(\boldsymbol{W}\hspace{2pt})}{\mathrm{inf}}\enspace \mathbb{E}\left\{{\left[{f}_{\ast }(\boldsymbol{x})-\hat{f}(\boldsymbol{x})\right]}^{2}\right\}.\end{equation} \tag{ 6 }$

The next two theorems characterize the asymptotics of the approximation error for RF and NT models. We give generalizations of these statements to other settings and under weaker assumptions in appendix C.

Theorem 2 (Approximation error for RF). Assume $\sigma \in {C}^{\infty }(\mathbb{R})$ , with kth derivative ${\sigma }^{(k)}{(x)}^{2}\leqslant {c}_{0,k}\enspace {\mathrm{e}}^{{c}_{1,k}{x}^{2}/2}$ for some c_0,k > 0, c_1,k < 1, and all $x\in \mathbb{R}$ and all k. Define its kth Hermite coefficient ${\mu }_{k}(\sigma ){:=}{\mathbb{E}}_{G\sim \mathsf{\text{N}}(0,1)}[\sigma (G){\mathrm{He}}_{k}(G)]$ . Let $\ell \in {\mathbb{Z}}_{\geqslant 0}$ be a fixed integer, and assume μ_k(σ) ≠ 0 for all k ⩽ ℓ. Define d_eff = d^{max(1−κ,η)}. If ${d}_{\mathrm{eff}}^{\ell +\delta }\leqslant N\leqslant {d}_{\mathrm{eff}}^{\ell +1-\delta }$ for some δ > 0 independent of N, d, then

$\begin{equation}\left\vert {R}_{\mathrm{RF},N}({f}_{\ast };\boldsymbol{W}\hspace{2pt})-{\Vert}{\mathsf{P}}_{ > \ell }{f}_{\ast }{{\Vert}}_{{L}^{2}}^{2}\right\vert \leqslant {o}_{d,\mathbb{P}}(1)\cdot {\Vert}{\mathsf{P}}_{ > \ell }{f}_{\ast }{{\Vert}}_{{L}^{2}}{\Vert}{f}_{\ast }{{\Vert}}_{{L}^{2}}.\end{equation} \tag{ 7 }$

Theorem 3 (Approximation error for NT). Assume $\sigma \in {C}^{\infty }(\mathbb{R})$ , with kth derivative ${\sigma }^{(k)}{(x)}^{2}\leqslant {c}_{0,k}\enspace {\mathrm{e}}^{{c}_{1,k}{x}^{2}/2}$ , for some c_0,k > 0, c_1,k < 1, and all $x\in \mathbb{R}$ and all k. Let $\ell \in {\mathbb{Z}}_{\geqslant 0}$ , and assume μ_k(σ) ≠ 0 for all k ⩽ ℓ + 1. Further assume that, for all $L\in {\mathbb{Z}}_{\geqslant 0}$ , there exist k₁, k₂ with L < k₁ < k₂, such that ${\mu }_{{k}_{1}}({\sigma }^{\prime })\ne 0$ , ${\mu }_{{k}_{2}}({\sigma }^{\prime })\ne 0$ , and ${\mu }_{{k}_{1}}({x}^{2}{\sigma }^{\prime })/{\mu }_{{k}_{1}}({\sigma }^{\prime })\ne {\mu }_{{k}_{2}}({x}^{2}{\sigma }^{\prime })/{\mu }_{{k}_{2}}({\sigma }^{\prime })$ . Define d_eff = d^{max(1−κ,η)}. If ${d}_{\mathrm{eff}}^{\ell +\delta }\leqslant N\leqslant {d}_{\mathrm{eff}}^{\ell +1-\delta }$ for some δ > 0 independent of N, d, then

$\begin{equation}\left\vert {R}_{\mathrm{NT},N}({f}_{\ast };\boldsymbol{W}\hspace{2pt})-{\Vert}{\mathsf{P}}_{ > \ell +1}{f}_{\ast }{{\Vert}}_{{L}^{2}}^{2}\right\vert \leqslant {o}_{d,\mathbb{P}}(1)\cdot {\Vert}{\mathsf{P}}_{ > \ell +1}{f}_{\ast }{{\Vert}}_{{L}^{2}}{\Vert}{f}_{\ast }{{\Vert}}_{{L}^{2}}.\end{equation} \tag{ 8 }$

Here, the definitions of effective dimension d_eff is the same as in theorem 1. While for the test error of KRR as in theorem 1, the effective dimension controls the sample complexity n in learning a degree ℓ polynomial, in the present case it controls the number of neurons N that is necessary to approximate a degree ℓ polynomial. In the case of RF, the latter happens as soon as $N\gg {d}_{\mathrm{eff}}^{\ell }$ , while for NT it happens as soon as $N\gg {d}_{\mathrm{eff}}^{\ell -1}$ . If we take η ∈ (0, 1) and κ = 0+, the above theorems, again, recover theorems 1 and 2 of [21].

Notice that NT has higher approximation power than RF in terms of the number of neurons. This is expected, since NT models contain Nd instead of N parameters. On the other hand, NT has less power in terms of number of parameters: to fit a degree ℓ + 1 polynomial, the parameter complexity for NT is $Nd={d}_{\mathrm{eff}}^{\ell }d$ while the parameter complexity for RF is $N={d}_{\mathrm{eff}}^{\ell +1}\ll {d}_{\mathrm{eff}}^{\ell }d$ . While the NT model has p = Nd parameters, only ${p}_{\mathrm{eff}}^{\mathrm{NT}}=N{d}_{\mathrm{eff}}$ of them appear to matter. We will refer to ${p}_{\mathrm{eff}}^{\mathrm{NT}}\equiv N{d}_{\mathrm{eff}}$ as the effective number of parameters of NT models.

Finally, it is natural to ask what are the behaviors of RF and NT models at finite sample size. Denote by R_M,N,n(f_*; W ) the corresponding test error (assuming for instance ridge regression, with the optimal regularization λ). Of course the minimum population risk provides a lower bound: R_M,N,n(f_*; W ) ⩾ R_M,N(f_*; W ). Moreover, we conjecture that the risk is minimized at infinite N, R_M,N,n(f_*; W ) ≳ R_n(f_*; h_M). Altogether this implies the lower bound R_M,N,n(f_*; W ) ≳ max(R_M,N(f_*; W ), R_n(f_*; h_M)). We also conjecture that this lower bound is tight, up to terms vanishing as N, n, d → ∞.

Namely (focusing on NT models), if Nd_eff ≲ n, and ${d}_{\mathrm{eff}}^{{\ell }_{1}}\lesssim N{d}_{\mathrm{eff}}\lesssim {d}_{\mathrm{eff}}^{{\ell }_{1}+1}$ then the approximation error dominates and ${R}_{\mathsf{\text{M}},N,n}({f}_{\ast };\boldsymbol{W}\hspace{2pt})={\Vert}{\mathsf{P}}_{ > {\ell }_{1}}{f}_{\ast }{{\Vert}}_{{L}^{2}}^{2}+{o}_{d,\mathbb{P}}(1){\Vert}{f}_{\ast }{{\Vert}}_{{L}^{2}}^{2}$ . If on the other hand Nd_eff ≳ n, and ${d}_{\mathrm{eff}}^{{\ell }_{2}}\lesssim n\lesssim {d}_{\mathrm{eff}}^{{\ell }_{2}+1}$ then the generalization error dominates and ${R}_{\mathsf{\text{M}},N,n}({f}_{\ast };\boldsymbol{W}\hspace{2pt})={\Vert}{\mathsf{P}}_{ > {\ell }_{2}}{f}_{\ast }{{\Vert}}_{{L}^{2}}^{2}+{o}_{d,\mathbb{P}}(1){\Vert}{f}_{\ast }{{\Vert}}_{{L}^{2}}^{2}$ .

2.4. Neural network models

Consider the approximation error for NNs

$\begin{equation}{R}_{\mathrm{NN},N}({f}_{\ast }){:=}\underset{\hat{f}\in {\mathcal{F}}_{\mathrm{NN}}^{N}}{\mathrm{inf}}\enspace \mathbb{E}\left\{{\left[{f}_{\ast }(\boldsymbol{x})-\hat{f}(\boldsymbol{x})\right]}^{2}\right\}.\end{equation} \tag{ 9 }$

Since ${\varepsilon }^{-1}[\sigma (\langle {\boldsymbol{w}}_{i}+\varepsilon {\boldsymbol{a}}_{i},\boldsymbol{x}\rangle )-\sigma (\langle {\boldsymbol{w}}_{i},\boldsymbol{x}\rangle )]\stackrel{\varepsilon \to 0}{\to }\langle {\boldsymbol{a}}_{i},\boldsymbol{x}\rangle {\sigma }^{\prime }(\langle {\boldsymbol{w}}_{i},\boldsymbol{x}\rangle )$ , we have ${\cup }_{\boldsymbol{W}}{\mathcal{F}}_{\mathrm{NT}}^{N/2}(\boldsymbol{W}\hspace{2pt})\subseteq \mathrm{c}\mathrm{l}({\mathcal{F}}_{\mathrm{NN}}^{N})$ , and R_NN,N(f_*) ⩽ inf_W R_NT,N/2(f_*, W ). By choosing $\bar{\boldsymbol{W}}={({\bar{\boldsymbol{w}}}_{i})}_{i\leqslant N}$ , with ${\bar{\boldsymbol{w}}}_{i}=\boldsymbol{U}{\bar{\boldsymbol{v}}}_{i}$ (see section 2.1 for definition of U ), we obtain that ${\mathcal{F}}_{\mathrm{NT}}^{N}(\bar{\boldsymbol{W}}\hspace{2pt})$ contains all functions of the form $\bar{f}({\boldsymbol{U}}^{\mathsf{T}}\boldsymbol{x})$ , where $\bar{f}$ is in the class of functions ${\mathcal{F}}_{\mathrm{NT}}^{N}(\bar{\boldsymbol{V}}\hspace{2pt})$ on ${\mathbb{R}}^{{d}_{0}}$ . Hence if ${f}_{\ast }(\boldsymbol{x})=\varphi ({\boldsymbol{U}}^{\mathsf{T}}\boldsymbol{x})$ , R_NN,N(f_*) is at most the error of approximating φ( z ) on the small sphere $\boldsymbol{z}\sim \mathrm{Unif}({\mathbb{S}}^{{d}_{0}-1})$ within the class ${\mathcal{F}}_{\mathrm{NT}}^{N}(\bar{\boldsymbol{V}}\hspace{2pt})$ . As a consequence, by theorem 3, if ${d}_{0}^{\ell +\delta }\leqslant N\leqslant {d}_{0}^{\ell +1-\delta }$ for some δ > 0, then ${R}_{\mathrm{NN},N}({f}_{\ast })\leqslant {R}_{\mathrm{NT},N/2}({f}_{\ast },\bar{\boldsymbol{W}})\leqslant (1+{o}_{d,\mathbb{P}}(1))\cdot {\Vert}{\mathsf{P}}_{ > \ell +1}{f}_{\ast }{{\Vert}}_{{L}^{2}}^{2}$ .

Theorem 4 (Approximation error for NN). Assume that $\sigma \in {C}^{\infty }(\mathbb{R})$ satisfies the same assumptions as in theorem 3. Further assume that ${\mathrm{sup}}_{x\in \mathbb{R}}\vert {\sigma }^{{\prime\prime}}(x)\vert < \infty$ . If ${d}_{0}^{\ell +\delta }\leqslant N\leqslant {d}_{0}^{\ell +1-\delta }$ for some δ > 0 independent of N, d, then the approximation error of NN models (3) is

$\begin{equation}{R}_{\mathrm{NN},N}({f}_{\ast })\leqslant (1+{o}_{d}(1))\cdot {\Vert}{\mathsf{P}}_{ > \ell +1}{f}_{\ast }{{\Vert}}_{{L}^{2}}^{2}.\end{equation} \tag{ 10 }$

Moreover, the quantity R_NN,N(f_*) is independent of κ ⩾ 0.

As a consequence of theorems 3 and 4, there is a separation between NN and (uniformly sampled) NT models when d_eff ≠ d₀, i.e. κ < 1 − η. As κ increases, the gap between NN and NT becomes smaller and smaller until κ = 1 − η.

3. Further numerical experiments

We carried out extensive numerical experiments on synthetic data to check our predictions for RF, NT, RKHS methods at finite sample size n, dimension d, and width N. We simulated two-layers fully-connected NN in the same context in order to compare their behavior to the behavior of the previous models. Finally, we carried out numerical experiments on FMNIST and CIFAR-10 data to test whether our qualitative predictions apply to image datasets. Throughout we use ReLU activations.

In figure 2 we investigate the approximation error of RF, NT, and NN models. We generate data ${({y}_{i},{\boldsymbol{x}}_{i})}_{i\geqslant 1}$ according to the model of section 2.1, in d = 1024 dimensions, with a latent space dimension d₀ = 16, hence η = 2/5. The per-coordinate variance in the latent space is r² = d^κ, with κ ∈ {0.0, ..., 0.9}. Labels are obtained by ${y}_{i}={f}_{\ast }({\boldsymbol{x}}_{i})=\varphi ({\boldsymbol{U}}^{\mathsf{T}}{\boldsymbol{x}}_{i})$ where $\varphi :{\mathbb{R}}^{{d}_{0}}\to \mathbb{R}$ is a degree-4 polynomial, without a linear component. Since we are interested in the minimum population risk, we use a large sample size n = 2²⁰: we expect the approximation error to dominate in this regime. (See appendix A for further details.)

**Figure 2.** Finite-width two-layers NN and their linearizations RF and NT. Models are trained on 2²⁰ training observations drawn i.i.d. from the distribution of section 2.1. Continuous lines: NT; dashed lines: RF; dot-dashed: NN. Various curves (colors) refer to values of the exponent κ (larger κ corresponds to stronger low-dimensional component). (Right) Curves for RF and NT as a function of the rescaled quantity $\mathrm{log}({p}_{\mathrm{eff}}^{\mathsf{\text{M}}})/\mathrm{log}({d}_{\mathrm{eff}})$ .
Download figure:
Standard image High-resolution image

We plot the normalized risk R_RF,N(f_*, W )/R₀, R_NT,N(f_*, W )/R₀, R_NN,N(f_*)/R₀, ${R}_{0}{:=}{\Vert}{f}_{\ast }{{\Vert}}_{{L}^{2}}^{2}$ , for various widths N. These are compared with the error of the best polynomial approximation of degrees ℓ = 1 to 3 (which correspond to ${\Vert}{\mathsf{P}}_{ > \ell }{f}_{\ast }{{\Vert}}_{{L}^{2}}^{2}/{\Vert}{f}_{\ast }{{\Vert}}_{{L}^{2}}^{2}$ ). As expected, as the number of parameters increases, the approximation error of each function class decreases. NN provides much better approximations than any of the linear classes, and RF is superior to NT given the same number of parameters. This is captured by theorems 2 and 3: to fit a degree ℓ + 1 polynomial, the parameter complexity for NT is $Nd={d}_{\mathrm{eff}}^{\ell }d$ while for RF it is $N={d}_{\mathrm{eff}}^{\ell +1}\ll {d}_{\mathrm{eff}}^{\ell }d$ . We denote the effective number of parameters for NT by ${p}_{\mathrm{eff}}^{\mathrm{NT}}=N{d}_{\mathrm{eff}}$ and the effective number of parameter for RF by ${p}_{\mathrm{eff}}^{\mathrm{RF}}=N$ . The right plot reports the same data, but we rescale the x-axis to be $\mathrm{log}({p}_{\mathrm{eff}}^{\mathsf{\text{M}}})/\mathrm{log}({d}_{\mathrm{eff}})$ . As predicted by the asymptotic theory of theorems 2 and 3, various curves for NT and RF tend to collapse on this scale. Finally, the approximation error of RF and NT depends strongly on κ: larger κ leads to smaller effective dimension and hence smaller approximation error. In contrast, the error of NN, besides being smaller in absolute terms, is much less sensitive to κ.

In figure 3 we compare the test error of NN (with N = 4096) and KRR for the NT kernel (corresponding to the N → ∞ limit in the lazy regime), for the same data distribution as in the previous figure. We observe that the test error of KRR is substantially larger than the one of NN, and deteriorates rapidly as κ gets smaller (the effective dimension gets larger). In the right frame we plot the test error as a function of log(n)/log(d_eff): we observe that the curves obtained for different κ approximately collapse, confirming that d_eff is indeed the right dimension parameter controlling the sample complexity. Notice that also the error of NN deteriorates as κ gets smaller, although not so rapidly: this behavior deserves further investigation. Notice also that the KRR error crosses the level of best degree-ℓ polynomial approximation roughly at log(n)/log(d_eff) ≈ ℓ.

The basic qualitative insight of our work can be summarized as follows. Kernel methods are effective when a low-dimensional structure in the target function is aligned with a low-dimensional structure in the covariates. In image data, both the target function and the covariates are dominated by the low-frequency subspace. In figure 1 we tested this hypothesis by removing the low-dimensional structure of the covariate vectors: we simply added noise to the high-frequency part of the image. In figure 4 we try the opposite, by removing the component of the target function that is localized on low-frequency modes. We decompose each images into a low-frequency and a high-frequency part. We leave the high-frequency part unchanged, and replace the low-frequency part by Gaussian noise with the first two moments matching the empirical moments of the data.

**Figure 4.** Comparison between multilayer NNs and the corresponding NT models under perturbations in frequency domain. (Left) Fully connected networks on FMNIST data. (Right) Comparison of CNN and convolutional neural tangent kernel (CNTK) KRR classification accuracy on CIFAR-10. We progressively replace the lowest frequencies of each image with Gaussian noise with matching covariance structure. Right: accuracy for FMNIST.
Download figure:
Standard image High-resolution image

In the left frame, we consider FMNIST data and compare fully-connected NNs with two or three layers (and N = 4096 nodes at each hidden layer) with the corresponding NT KRR model (infinite width). In the right frame, we use CIFAR-10 data and compare a Myrtle-5 network (a lightweight convolutional architecture [16, 32]) with the corresponding NT KRR. We observe the same behavior as in figure 1. While for the original data NT is comparable to NN, as the proportion of perturbed Fourier modes increases, the performance of NT deteriorates much more rapidly than the one of NN.

4. Discussion

The limitations of linear methods—such as KRR—in high dimension are well understood in the context of nonparametric function estimation. For instance, a basic result in this area establishes that estimating a Sobolev function f_* in d dimensions with mean square error ɛ requires roughly ɛ^−2−d/α samples, with α the smoothness parameter [33]. This behavior is achieved by kernel smoothing and by KRR: however these methods are not expected to be adaptive when f_*( x ) only depends on a low-dimensional projection of x , i.e. ${f}_{\ast }(\boldsymbol{x})=\varphi ({\boldsymbol{U}}^{\mathsf{T}}\boldsymbol{x})$ for an unknown $\boldsymbol{U}\in {\mathbb{R}}^{{d}_{0}\times d}$ , d₀ ≪ d. On the contrary, fully-trained NN can overcome this problem [20].

However, these classical statistical results have some limitations. First, they focus on the low-dimensional regime: d is fixed, while the sample size n diverges. This is probably unrealistic for many machine learning applications, in which d is at least of the order of a few hundreds. Second, classical lower bounds are typically established for the minimax risk, and hence they do not necessarily apply to specific functions.

To bridge these gaps, we developed a sharp characterization of the test error in the high-dimensional regime in which both d and n diverge, while being polynomially related. This characterization holds for any target function f_*, and expresses the limiting test error in terms of the polynomial decomposition. We also present analogous results for finite-width RF and NT models.

Our analysis is analogous and generalizes the recent results of [21]. However, while [21] assumed the covariates x _i to be uniformly distributed over the sphere ${\mathbb{S}}^{d-1}(\sqrt{d})$ , we introduced and analyzed a more general model in which the covariates mostly lie in the signal subspace with dimension d₀ ≪ d, and the target function is also dependent on that subspace. In fact our results follow as special cases of a more general model discussed in appendix C.

Depending on the relation between signal dimension d₀, ambient dimension d, and the covariate signal-to-noise ratio r, the model presents a continuum of different behaviors. At one extreme, the covariates are fully d-dimensional, and RKHS methods are highly suboptimal compared to NN. At the other, covariates are close to d₀-dimensional and RKHS methods are instead more competitive with NN.

Finally, the Fourier decomposition of images is a simple proxy for the decomposition of the covariate vector x into its low-dimensional dominant component (low frequency) and high-dimensional component (high frequency) [34].

Acknowledgments

This work was partially supported by the NSF Grants CCF-1714305, IIS-1741162, DMS-1418362, DMS-1407813 and by the ONR Grant N00014-18-1-2729.

Data availability statement

The code used to produce our results can be accessed at https://github.com/bGhorbani/linearized_neural_networks.

Appendix A.: Details of numerical experiments

A.1. General training details

All models studied in the paper are trained with squared loss and ℓ₂ regularization. For multi-class datasets such as FMNIST, one-hot encoded labels are used for training. All models discussed in the paper use ReLU non-linearity. Fully-connected models are initialized according to mean-field parameterization [26, 35, 36]. All NN are optimized with SGD with 0.9 momentum. The learning-rate evolves according to the cosine rule

$\begin{equation}l{r}_{t}=l{r}_{0}\enspace \mathrm{max}\left(\left(1+\mathrm{cos}\left(\frac{t\pi }{T}\right)\right),\frac{1}{15}\right)\end{equation} \tag{ 11 }$

where lr₀ = 10⁻³ and T = 750 is the total number of training epochs. To ensure the stability of the optimization for wide models, we use 15 linear warm-up epochs in the beginning.

When N ≫ 1, training RF and NT with SGD is unstable (unless extremely small learning-rates are used). This makes the optimization prohibitively slow for large datasets. To avoid this issue, instead of SGD, we use conjugate gradient method (CG) for optimizing RF and NT. Since these two models are strongly convex⁵ , the optimizer is unique. Hence, using CG will not introduce any artifacts in the results.

In order to use CG, we first implement a function to perform Hessian-vector products in TensorFlow [37]. The function handle is then passed to scipy.sparse.cg for CG. Our Hessian-vector product code uses tensor manipulation utilities implemented by [38].

Unfortunately, scipy.sparse.cg does not support one-hot encoded labels. To avoid running CG for each class separately, when the labels are one-hot encoded, we use Adam optimizer [39] instead. When using Adam, the learning-rate still evolves as (11) with lr₀ = 10⁻⁵. The batch-size is fixed at 10⁴ to encourage fast convergence to the minimum.

For NN, RF and NT, the training is primary done in TensorFlow (v1.12) [37]. For KRR, we generate the kernel matrix first and directly fit the model in regular python. The kernels associated with two-layer models are calculated analytically. For deeper models, the kernels are computed using neural-tangents library in JAX [40, 41].

A.2. Synthetic data experiments

The synthetic data follows the distribution outlined in the main text. In particular,

$\begin{equation}{\boldsymbol{x}}_{i}=({\boldsymbol{u}}_{i},{\boldsymbol{z}}_{i}),\qquad {y}_{i}=\varphi ({\boldsymbol{u}}_{i}),\quad {\boldsymbol{u}}_{i}\in {\mathbb{R}}^{{d}_{0}},\enspace {\boldsymbol{z}}_{i}\in {\mathbb{R}}^{d-{d}_{0}},\end{equation} \tag{ 12 }$

where u _i and z _i are drawn i.i.d. from the hyper-spheres with radii $r\sqrt{{d}_{0}}$ and $\sqrt{d}$ respectively. We choose

$\begin{equation}r={d}^{\kappa /2},\qquad {d}_{0}={d}^{\eta },\end{equation} \tag{ 13 }$

where d is fixed to be 1024 and $\eta =\frac{2}{5}$ . We change κ in the interval {0, ..., 0.9}. For each value of κ we generate 2²⁰ training and 10⁴ test observations⁶ .

The function φ is the sum of three orthogonal components ${\left\{{\varphi }_{i}\right\}}_{i=1}^{3}$ with ||φ_i||₂ = 1. To be more specific,

$\begin{equation}{\varphi }_{i}(\boldsymbol{x})\propto \sum\limits _{j=1}^{{d}_{0}-i}{\alpha }_{j}^{(i)}\prod\limits _{k=j}^{j+i}{\boldsymbol{x}}_{k},\qquad {\alpha }_{j}^{(i)}\hspace{2pt}\stackrel{\mathrm{i}.\mathrm{i}.\mathrm{d}.}{\sim }\hspace{2pt}\mathrm{exp}(1).\end{equation} \tag{ 14 }$

This choice of φ_i guarantees that each φ_i is in the span of degree i + 1 spherical harmonics.

In the experiments presented in figure 2, for NN and NT, the number of hidden units N takes 30 geometrically spaced values in the interval [5, 10⁴]. NN models are trained using SGD with momentum 0.9 (the learning-rate evolution is described above). We use batch-size of 512 for the warm-up epochs and batch-size of 1024 for the rest of the training. For RF, N takes 24 geometrically spaced values in the interval [100, 711 680]. The limit N = 711 680 corresponds to the largest model size we are computationally able to train at this scale. All models are trained with ℓ₂ regularization. The ℓ₂ regularization grids used for these experiments are presented in table A.1. In all our experiments, we choose the ℓ₂ regularization parameter that yields the best test performance⁷ . In total, we train approximately 10 000 different models just for this subset of experiments.

Table A.1. Hyper-parameter details for synthetic data experiments.

Experiment	Model	ℓ₂ regularization grid
Approximation error (figure 2)	NN	${\left\{1{0}^{{\alpha }_{i}}\right\}}_{i=1}^{20}$ , α_i uniformly spaced in [−8, −4]
	NT	${\left\{1{0}^{{\alpha }_{i}}\right\}}_{i=1}^{10}$ , α_i uniformly spaced in [−4, 2]
	RF	${\left\{1{0}^{{\alpha }_{i}}\right\}}_{i=1}^{10}$ , α_i uniformly spaced in [−5, 2]
Generalization error (figure 3)	NN	${\left\{1{0}^{{\alpha }_{i}}\right\}}_{i=1}^{25}$ , α_i uniformly spaced in [−8, −2]
Generalization error (figure 3)	NT KRR	${\left\{1{0}^{{\alpha }_{i}}\right\}}_{i=1}^{10}$ , α_i uniformly spaced in [0, 6]

In figure 3 of the main text, we compared the generalization performance of NTK KRR with NN. We use the same training and test data as above to perform this analysis. The number of training data points, n, takes 24 different values ranging from 50 to 10⁵. The number of test data points is always fixed at 10⁴.

A.3. High-frequency noise experiment on FMNIST

In effort to make the distribution of the covariates more isotropic, in this experiment, we add high-frequency noise to both the training and test data.

Let $\boldsymbol{x}\in {\mathbb{R}}^{k\times k}$ be an image. We first remove the global average of the image and then add high-frequency Gaussian noise to x in the following manner:

(a)
We convert x to frequency domain via discrete cosine transform (DCT II-orthogonal to be precise). We denote the representation of the image in the frequency domain $\tilde{\boldsymbol{x}}\in {\mathbb{R}}^{k\times k}$ .
(b)
We choose a filter F ∈ {0, 1}^k×k. F determines on which frequencies the noise should be added. The noise matrix $\tilde{\boldsymbol{Z}}$ is defined as Z ⊙ F where Z ∈ R^k×k has i.i.d. N(0, 1) entries.
(c)
We define ${\tilde{\boldsymbol{x}}}_{\mathrm{n}\mathrm{o}\mathrm{i}\mathrm{s}\mathrm{y}}=\tilde{\boldsymbol{x}}+\tau ({\Vert}\tilde{\boldsymbol{x}}{\Vert}/{\Vert}\tilde{\boldsymbol{Z}}{\Vert})\tilde{\boldsymbol{Z}}$ . The constant τ controls the noise magnitude.
(d)
We perform inverse discrete cosine transform (DCT III-orthogonal) on ${\tilde{\boldsymbol{x}}}_{\mathrm{n}\mathrm{o}\mathrm{i}\mathrm{s}\mathrm{y}}$ to convert the image to pixel domain. We denote the noisy image in the pixel domain as x _noisy.
(e)
Finally, we normalize the x _noisy so that it has norm $\sqrt{d}$ .

In the frequency domain, a grayscale image is represented by a matrix $\tilde{\boldsymbol{x}}\in {\mathbb{R}}^{k\times k}$ . Qualitatively speaking, elements ${(\tilde{\boldsymbol{x}})}_{i,j}$ with small values of i and j correspond to the low-frequency component of the image and elements with large indices correspond to high-frequency components. The matrix F is chosen such that no noise is added to low frequencies. Specifically, we choose

$\begin{equation}{\boldsymbol{F}}_{i,j}=\begin{cases}1\quad \hfill & \quad \text{if}\;{(k-i)}^{2}+{(k-j)}^{2}\leqslant {(k-1)}^{2}\hfill \\ 0\quad \hfill & \quad \text{otherwise.}\hfill \end{cases}\end{equation} \tag{ 15 }$

This choice of F mirrors the average frequency domain representation of FMNIST images (see figure A.1 for a comparison). Figures A.2 and A.4 respectively show the eigenvalues of the empirical covariance of the dataset for various noise levels. As discussed in the main text, the distribution of the covariates becomes more isotropic as more and more high-frequency noise is added to the images.

**Figure A.2.** The eigenvalues of the empirical covariance matrix of the FMNIST training data. As the noise intensity increases, the distribution of the eigenvalues becomes more isotropic. Note that due to the conservative choice of the filter F , noise is not added to all of the low-variance directions. These left-out directions corresponds to the small eigenvalues appearing in the left-hand side of the plot.
Download figure:
Standard image High-resolution image

Figure A.3 shows the normalized squared loss and the classification accuracy of the models as more and more high-frequency noise is added to the data. The normalization factor R₀ = 0.9 corresponds to the risk achievable by the (trivial) predictor ${\left[{\hat{y}}_{j}(\boldsymbol{x})\right]}_{1\leqslant j\leqslant 10}=0.1$ .

A.3.1. Experiment hyper-parameters

For NT and NN, the number of hidden units N = 4096. For RF, we fix N = 321 126. These hyper-parameter choices ensure that the models have approximately the same number of trainable parameters. NN is trained with SGD with 0.9 momentum and learning-rate described by (11). The batch-size for the warm-up epochs is 500. After the warm-up stage is over, we use batch-size of 1000 to train the network. Since CG is not available in this setting, NT and RF are optimized using Adam for T = 750 epochs with batch-size of 10⁴. The ℓ₂ regularization grids used for training these models are listed in table A.2.

Table A.2. Details of regularization parameters used for high-frequency noise experiments.

Dataset	Model	ℓ₂ regularization grid
FMNIST	NN	${\left\{1{0}^{{\alpha }_{i}}\right\}}_{i=1}^{20}$ , α_i uniformly spaced in [−6, −2]
	NT	${\left\{1{0}^{{\alpha }_{i}}\right\}}_{i=1}^{20}$ , α_i uniformly spaced in [−5, 3]
	RF	${\left\{1{0}^{{\alpha }_{i}}\right\}}_{i=1}^{20}$ , α_i uniformly spaced in [−5, 3]
	NT KRR	${\left\{1{0}^{{\alpha }_{i}}\right\}}_{i=1}^{20}$ , α_i uniformly spaced in [−1, 5]
	RF KRR	${\left\{1{0}^{{\alpha }_{i}}\right\}}_{i=1}^{20}$ , α_i uniformly spaced in [−1, 5]
CIFAR-2	NN	${\left\{1{0}^{{\alpha }_{i}}\right\}}_{i=1}^{20}$ , α_i uniformly spaced in [−6, −2]
	NT	${\left\{1{0}^{{\alpha }_{i}}\right\}}_{i=1}^{20}$ , α_i uniformly spaced in [−4, 4]
	RF	${\left\{1{0}^{{\alpha }_{i}}\right\}}_{i=1}^{40}$ , α_i uniformly spaced in [−2, 10]
	NT KRR	${\left\{1{0}^{{\alpha }_{i}}\right\}}_{i=1}^{20}$ , α_i uniformly spaced in [−2, 4]
	RF KRR	${\left\{1{0}^{{\alpha }_{i}}\right\}}_{i=1}^{20}$ , α_i uniformly spaced in [−2, 4]

A.4. High-frequency noise experiment on CIFAR-2

We perform a similar experiment on a subset of CIFAR-10. We choose two classes (airplane and cat) from the ten classes of CIFAR-10. This choice provides us with 10⁴ training and 2000 test data points. Given that the number of training observations is not very large, we reduce the covariate dimension by converting the images to grayscale. This transformation reduces the covariate dimension to d = 1024.

**Figure A.4.** (Left) FMNIST images with various high-frequency noise levels. (Right) CIFAR-2 images with various levels of high-frequency Gaussian noise. The images are converted to grayscale to make the covariate dimension manageable.
Download figure:
Standard image High-resolution image

Figure A.5 demonstrates the evolution of the model performances as the noise intensity increases. In the noiseless regime (τ = 0), all models have comparable performances. However, as the noise level increases, the performance gap between NN and RKHS methods widens. For reference, the accuracy gap between NN and NT KRR is only 0.6% at τ = 0. However, at τ = 3, this gap increases to 4.5%. The normalization factor R₀ = 0.25 corresponds to the risk achievable by the trivial estimator $\hat{y}(\boldsymbol{x})=0.5$ .

A.4.1. Experiment hyper-parameters

For NT and NN, the number of hidden units N = 4096. For RF, we fix N = 4.2 × 10⁶. These hyper-parameter choices ensure that the models have approximately the same number of trainable parameters. NN is trained with SGD with 0.9 momentum and learning-rate described by (11). The batch-size is fixed at 250. NT is optimized via CG with 750 maximum iterations. The ℓ₂ regularization grids used for training these models are listed in table A.2.

A.5. Low-frequency noise experiments on FMNIST

To examine the ability of NN and RKHS methods in learning the information in low-variance components of the covariates, we replace the low-frequency components of the image with Gaussian noise. To be specific, we follow the following steps to generate the noisy datasets:

(a)
We normalize all images to have mean zero and norm $\sqrt{d}$ .
(b)
Let ${\mathcal{D}}_{\mathrm{t}\mathrm{r}\mathrm{a}\mathrm{i}\mathrm{n}}$ denote the set of training images in the DCT-frequency domain. We compute the mean μ and the covariance Σ of the elements of ${\mathcal{D}}_{\mathrm{t}\mathrm{r}\mathrm{a}\mathrm{i}\mathrm{n}}$ .
(c)
We fix a threshold $\alpha \in \mathbb{N}$ where 1 ⩽ α ⩽ k.
(d)
Let x be an image in the dataset (test or train). We denote the representation of x in the frequency domain with $\tilde{\boldsymbol{x}}$ . For each image, we draw a noise matrix $\boldsymbol{z}\sim \mathcal{N}(\mu ,{\Sigma})$ . We have
$\begin{equation*}{\left[{\tilde{\boldsymbol{x}}}_{\mathrm{n}\mathrm{o}\mathrm{i}\mathrm{s}\mathrm{y}}\right]}_{i,j}=\begin{cases}{(\boldsymbol{z})}_{i,j}\quad \hfill & \quad \text{if}\;i,j\leqslant \alpha \hfill \\ {\tilde{\boldsymbol{x}}}_{i,j}\quad \hfill & \quad \text{otherwise.}\hfill \end{cases}\end{equation*}$
(e)
We perform IDCT on ${\tilde{\boldsymbol{x}}}_{\mathrm{n}\mathrm{o}\mathrm{i}\mathrm{s}\mathrm{y}}$ to get the noisy image x _noisy.

The fraction of the frequencies replaced by noise is α²/k². Figure A.8 shows several examples of noisy images for different thresholds α.

A.5.1. Experiment hyper-parameters

For NN trained for this experiment, we fix the number of hidden units per-layer to N = 4096. This corresponds to approximately 3.2 × 10⁶ trainable parameters for two-layer networks and 2 × 10⁷ trainable parameters for three-layer networks. Both models are trained using SGD with momentum with learning rate described by (11) (with lr₀ = 10⁻³). For the warm-up epochs, we use batch-size of 500. We increase the batch-size to 1000 after the warm-up stage. The regularization grids used for training our models are presented in table A.3 (figure A.6).

Table A.3. Details of regularization parameters used for low-frequency noise experiments.

Dataset	Model	ℓ₂ regularization grid
FMNIST	NN depth 2	${\left\{1{0}^{{\alpha }_{i}}\right\}}_{i=1}^{20}$ , α_i uniformly spaced in [−6, −2]
	NN depth 3	${\left\{1{0}^{{\alpha }_{i}}\right\}}_{i=1}^{10}$ , α_i uniformly spaced in [−7, −5]
	NTK KRR depth 2	${\left\{1{0}^{{\alpha }_{i}}\right\}}_{i=1}^{20}$ , α_i uniformly spaced in [−1, 5]
	NTK KRR depth 3	${\left\{1{0}^{{\alpha }_{i}}\right\}}_{i=1}^{20}$ , α_i uniformly spaced in [−4, 3]
	Linear model	${\left\{1{0}^{{\alpha }_{i}}\right\}}_{i=1}^{30}$ , α_i uniformly spaced in [−1, 5]
CIFAR-10	Myrtle-5	${\left\{1{0}^{{\alpha }_{i}}\right\}}_{i=1}^{10}$ , α_i uniformly spaced in [−5, −2]
CIFAR-10	KRR (Myrtle-5 NTK)	${\left\{1{0}^{{\alpha }_{i}}\right\}}_{i=1}^{20}$ , α_i uniformly spaced in [−6, 1]

**Figure A.6.** Normalized test squared error (left) and test classification accuracy (right) of the models on FMNIST with low-frequency Gaussian noise.
Download figure:
Standard image High-resolution image

A.6. Low-frequency noise experiments on CIFAR-10

To test whether our insights are valid for convolutional models, we repeat the same experiment for CNNs trained on CIFAR-10. The noisy data is generated as follows:

(a)
Let ${\mathcal{D}}_{\mathrm{t}\mathrm{r}\mathrm{a}\mathrm{i}\mathrm{n}}$ denote the set of training images in the DCT-frequency domain. Note that CIFAR-10 images have three channels. To convert the images to frequency domain, we apply two-dimensional discrete cosine transform (DCT-II orthogonal) to each channel separately. We compute the mean μ and the covariance Σ of the elements of ${\mathcal{D}}_{\mathrm{t}\mathrm{r}\mathrm{a}\mathrm{i}\mathrm{n}}$ .
(b)
We fix a threshold $\alpha \in \mathbb{N}$ where 1 ⩽ α ⩽ 32.
(c)
Let $\boldsymbol{x}\in {\mathbb{R}}^{32\times 32\times 3}$ be an image in the dataset (test or train). We denote the representation of x in the DCT-frequency domain with $\tilde{\boldsymbol{x}}\in {\mathbb{R}}^{32\times 32\times 3}$ . For each image, we draw a noise matrix $\boldsymbol{z}\sim \mathcal{N}(\mu ,{\Sigma})$ . We have
$\begin{equation*}{\left[{\tilde{\boldsymbol{x}}}_{\mathrm{n}\mathrm{o}\mathrm{i}\mathrm{s}\mathrm{y}}\right]}_{i,j,k}=\begin{cases}{(\boldsymbol{z})}_{i,j,k}\quad \hfill & \quad \text{if}\;i,j\leqslant \alpha \hfill \\ {\tilde{\boldsymbol{x}}}_{i,j,k}\quad \hfill & \quad \text{otherwise.}\hfill \end{cases}\end{equation*}$
(d)
We perform IDCT on ${\tilde{\boldsymbol{x}}}_{\mathrm{n}\mathrm{o}\mathrm{i}\mathrm{s}\mathrm{y}}$ to get the noisy image x _noisy.
(e)
We normalize the noisy data to have zero per-channel mean and unit per-channel standard deviation. The normalization statistics are computed using only the training data.

We use Myrtle-5 architecture for our analysis. The Myrtle family is a collection of simple light-weight high-performance purely convolutional models. The simplicity of these models coupled with their good performance makes them a natural candidate for our analysis. The network only uses convolutions and average pooling. In particular, we do not use any batch-normalization [42] layers in this network (see [16] for details). We fix the number of channels in all convolutional layers to be N = 512. This corresponds to approximately 7 × 10⁶ parameters. Similar to the fully-connected networks, our convolutional models are also optimized via SGD with 0.9 momentum (learning rate evolves as (11) with lr₀ = 0.1 and T = 70). We fix the batch-size to 128. To keep the experimental setting as simple as possible, we do not use any data augmentation for training the network. The results of the numerical experiments are reported in figure A.7.

**Figure A.7.** Performance of Myrtle-5 and KRR with CNTK on noisy CIFAR-10. CNTK is generated from the Myrtle-5 architecture using neural-tangents JAX library. When no noise is present in the data, the CNN achieves 87.7% and the CNTK achieves 77.6% classification accuracy. After randomizing only 1.5% of the frequencies (corresponding to α = 4) CNTK classification performance falls to 58.2% while the CNN retains 84.7% accuracy.
Download figure:
Standard image High-resolution image

**Figure A.8.** The effect of low-frequency noise for various cut-off thresholds, α. The left panel corresponds to the noisy FMNIST images and the right panel corresponds to CIFAR-10 images. In order to plot CIFAR-10 images, we rescale them to the interval [0, 1].
Download figure:
Standard image High-resolution image

Appendix B.: Technical background on function spaces on the sphere

B.1. Functional spaces over the sphere

For d ⩾ 1, we let ${\mathbb{S}}^{d-1}(r)=\left\{\boldsymbol{x}\in {\mathbb{R}}^{d}:{\Vert}\boldsymbol{x}{{\Vert}}_{2}=r\right\}$ denote the sphere with radius r in ${\mathbb{R}}^{d}$ . We will mostly work with the sphere of radius $\sqrt{d}$ , ${\mathbb{S}}^{d-1}(\sqrt{d})$ and will denote by μ_d−1 the uniform probability measure on ${\mathbb{S}}^{d-1}(\sqrt{d})$ . All functions in the following are assumed to be elements of ${L}^{2}({\mathbb{S}}^{d-1}(\sqrt{d}),{\mu }_{d-1})$ , with scalar product and norm denoted as ${\langle \cdot ,\cdot \rangle }_{{L}^{2}}$ and ${\Vert}\cdot {{\Vert}}_{{L}^{2}}$ :

$\begin{equation}{\langle f,g\rangle }_{{L}^{2}}\equiv {\int }_{{\mathbb{S}}^{d-1}(\sqrt{d})}f(\boldsymbol{x})g(\boldsymbol{x}){\mu }_{d-1}(\mathrm{d}\boldsymbol{x}).\end{equation} \tag{ 16 }$

For $\ell \in {\mathbb{Z}}_{\geqslant 0}$ , let ${\tilde{V}}_{\hspace{-2pt}d,\ell }$ be the space of homogeneous harmonic polynomials of degree ℓ on ${\mathbb{R}}^{d}$ (i.e. homogeneous polynomials q( x ) satisfying Δq( x ) = 0), and denote by V_d,ℓ the linear space of functions obtained by restricting the polynomials in ${\tilde{V}}_{\hspace{-2pt}d,\ell }$ to ${\mathbb{S}}^{d-1}(\sqrt{d})$ . With these definitions, we have the following orthogonal decomposition

$\begin{equation}{L}^{2}({\mathbb{S}}^{d-1}(\sqrt{d}),{\mu }_{d-1})=\underset{\ell =0{}}{\overset{\infty }{\bigotimes}}{V}_{d,\ell }.\end{equation} \tag{ 17 }$

The dimension of each subspace is given by

$\begin{equation}\mathrm{dim}({V}_{d,\ell })=B(d,\ell )=\frac{2\ell +d-2}{\ell }\left(\genfrac{}{}{0pt}{}{\ell +d-3}{\ell -1}\right).\end{equation} \tag{ 18 }$

For each $\ell \in {\mathbb{Z}}_{\geqslant 0}$ , the spherical harmonics ${\left\{{Y}_{\ell ,j}^{(d)}\right\}}_{1\leqslant j\in \leqslant B(d,\ell )}$ form an orthonormal basis of V_d,ℓ:

$\begin{equation*}{\langle {Y}_{ki}^{(d)},{Y}_{sj}^{(d)}\rangle }_{{L}^{2}}={\delta }_{ij}{\delta }_{ks}.\end{equation*}$

Note that our convention is different from the more standard one, that defines the spherical harmonics as functions on ${\mathbb{S}}^{d-1}(1)$ . It is immediate to pass from one convention to the other by a simple scaling. We will drop the superscript d and write ${Y}_{\ell ,j}={Y}_{\ell ,j}^{(d)}$ whenever clear from the context.

We denote by ${\mathsf{P}}_{k}$ the orthogonal projections to V_d,k in ${L}^{2}({\mathbb{S}}^{d-1}(\sqrt{d}),{\mu }_{d-1})$ . This can be written in terms of spherical harmonics as

$\begin{equation}{\mathsf{P}}_{k}f(\boldsymbol{x})\equiv \sum\limits _{l=1}^{B(d,k)}{\langle f,{Y}_{kl}\rangle }_{{L}^{2}}{Y}_{kl}(\boldsymbol{x}).\end{equation} \tag{ 19 }$

We also define ${\mathsf{P}}_{\leqslant \ell }\equiv {\sum }_{k=0}^{\ell }{\mathsf{P}}_{k}$ , ${\mathsf{P}}_{ > \ell }\equiv \mathbf{I}-{\mathsf{P}}_{\leqslant \ell }={\sum }_{k=\ell +1}^{\infty }{\mathsf{P}}_{k}$ , and ${\mathsf{P}}_{< \ell }\equiv {\mathsf{P}}_{\leqslant \ell -1}$ , ${\mathsf{P}}_{\geqslant \ell }\equiv {\mathsf{P}}_{ > \ell -1}$ .

B.2. Gegenbauer polynomials

The ℓth Gegenbauer polynomial ${Q}_{\ell }^{(d)}$ is a polynomial of degree ℓ. Consistently with our convention for spherical harmonics, we view ${Q}_{\ell }^{(d)}$ as a function ${Q}_{\ell }^{(d)}:[-d,d]\to \mathbb{R}$ . The set ${\left\{{Q}_{\ell }^{(d)}\right\}}_{\ell \geqslant 0}$ forms an orthogonal basis on ${L}^{2}([-d,d],{\tilde{\mu }}_{d-1}^{1})$ , where ${\tilde{\mu }}_{d-1}^{1}$ is the distribution of $\sqrt{d}\langle \boldsymbol{x},{\boldsymbol{e}}_{1}\rangle$ when x ∼ μ_d−1, satisfying the normalization condition:

$\begin{align}\hfill {\int }_{-d}^{d}{Q}_{k}^{(d)}(t){Q}_{j}^{(d)}(t)\mathrm{d}{\tilde{\mu }}_{d-1}^{1}& =\frac{{w}_{d-2}}{d{w}_{d-1}}{\int }_{-d}^{d}{Q}_{k}^{(d)}(t){Q}_{j}^{(d)}(t){\left(1-\frac{{t}^{2}}{{d}^{2}}\right)}^{(d-3)/2}\enspace \mathrm{d}t\hfill \\ \hfill & =\frac{1}{B(d,k)}{\delta }_{jk},\hfill \end{align} \tag{ 20 }$

where we denoted ${w}_{d-1}=\frac{2{\pi }^{d/2}}{{\Gamma}(d/2)}$ the surface area of the sphere ${\mathbb{S}}^{d-1}(1)$ . In particular, these polynomials are normalized so that ${Q}_{\ell }^{(d)}(d)=1$ .

Gegenbauer polynomials are directly related to spherical harmonics as follows. Fix $\boldsymbol{v}\in {\mathbb{S}}^{d-1}(\sqrt{d})$ and consider the subspace of V_ℓ formed by all functions that are invariant under rotations in ${\mathbb{R}}^{d}$ that keep v unchanged. It is not hard to see that this subspace has dimension one, and coincides with the span of the function ${Q}_{\ell }^{(d)}(\langle \boldsymbol{v},\enspace \cdot \enspace \rangle )$ .

We will use the following properties of Gegenbauer polynomials.

(a)
For $\boldsymbol{x},\boldsymbol{y}\in {\mathbb{S}}^{d-1}(\sqrt{d})$
$\begin{equation}\langle {Q}_{j}^{(d)}(\langle \boldsymbol{x},\cdot \rangle ),{Q}_{k}^{(d)}(\langle \boldsymbol{y},\cdot \rangle ){\rangle }_{{L}^{2}}=\frac{1}{B(d,k)}{\delta }_{jk}{Q}_{k}^{(d)}(\langle \boldsymbol{x},\boldsymbol{y}\rangle ).\end{equation} \tag{ 21 }$
(b)
For $\boldsymbol{x},\boldsymbol{y}\in {\mathbb{S}}^{d-1}(\sqrt{d})$
$\begin{equation}{Q}_{k}^{(d)}(\langle \boldsymbol{x},\boldsymbol{y}\rangle )=\frac{1}{B(d,k)}\sum\limits _{i=1}^{B(d,k)}{Y}_{ki}^{(d)}(\boldsymbol{x}){Y}_{ki}^{(d)}(\boldsymbol{y}).\end{equation} \tag{ 22 }$
(c)
Recurrence formula
$\begin{equation}\frac{t}{d}{Q}_{k}^{(d)}(t)=\frac{k}{2k+d-2}{Q}_{k-1}^{(d)}(t)+\frac{k+d-2}{2k+d-2}{Q}_{k+1}^{(d)}(t).\end{equation} \tag{ 23 }$
(d)
Rodrigues formula
$\begin{equation}{Q}_{k}^{(d)}(t)={(-1/2)}^{k}{d}^{k}\frac{{\Gamma}((d-1)/2)}{{\Gamma}(k+(d-1)/2)}{\left(1-\frac{{t}^{2}}{{d}^{2}}\right)}^{(3-d)/2}{\left(\frac{\mathrm{d}}{\mathrm{d}t}\right)}^{k}{\left(1-\frac{{t}^{2}}{{d}^{2}}\right)}^{k+(d-3)/2}.\end{equation} \tag{ 24 }$

Note in particular that property (b) implies that—up to a constant— ${Q}_{k}^{(d)}(\langle \boldsymbol{x},\boldsymbol{y}\rangle )$ is a representation of the projector onto the subspace of degree-k spherical harmonics

$\begin{equation}({\mathsf{P}}_{k}f)(\boldsymbol{x})=B(d,k){\int }_{{\mathbb{S}}^{d-1}(\sqrt{d})}{Q}_{k}^{(d)}(\langle \boldsymbol{x},\boldsymbol{y}\rangle )f(\boldsymbol{y}){\mu }_{d-1}(\mathrm{d}\boldsymbol{y}).\end{equation} \tag{ 25 }$

B.3. Hermite polynomials

The Hermite polynomials ${\left\{{\mathrm{He}}_{k}\right\}}_{k\geqslant 0}$ form an orthogonal basis of ${L}^{2}(\mathbb{R},\gamma )$ , where $\gamma (\mathrm{d}x)={\mathrm{e}}^{-{x}^{2}/2}\enspace \mathrm{d}x/\sqrt{2\pi }$ is the standard Gaussian measure, and He_k has degree k. We will follow the classical normalization (here and below, expectation is with respect to G ∼ N(0, 1)):

$\begin{equation}\mathbb{E}\left\{{\mathrm{He}}_{j}(G){\mathrm{He}}_{k}(G)\right\}=k!{\delta }_{jk}.\end{equation} \tag{ 26 }$

As a consequence, for any function $g\in {L}^{2}(\mathbb{R},\gamma )$ , we have the decomposition

$\begin{equation}g(x)=\sum\limits _{k=0}^{\infty }\frac{{\mu }_{k}(g)}{k!}{\mathrm{He}}_{k}(x),\qquad {\mu }_{k}(g)\equiv \mathbb{E}\left\{g(G){\mathrm{He}}_{k}(G)\right\}.\end{equation} \tag{ 27 }$

Notice that for functions g that are k-weakly differentiable with g^(k) the kth weak derivative, we have

$\begin{equation}{\mu }_{k}(g)={\mathbb{E}}_{G}[{g}^{(k)}(G)].\end{equation} \tag{ 28 }$

The Hermite polynomials can be obtained as high-dimensional limits of the Gegenbauer polynomials introduced in the previous section. Indeed, the Gegenbauer polynomials are constructed by Gram–Schmidt orthogonalization of the monomials ${\left\{{x}^{k}\right\}}_{k\geqslant 0}$ with respect to the measure ${\tilde{\mu }}_{d-1}^{1}$ , while Hermite polynomial are obtained by Gram–Schmidt orthogonalization with respect to γ. Since ${\tilde{\mu }}_{d-1}^{1}{\Rightarrow}\gamma$ (here ⇒ denotes weak convergence), it is immediate to show that, for any fixed integer k,

$\begin{equation}\underset{d\to \infty }{\mathrm{lim}}\enspace \text{Coeff}\left\{{Q}_{k}^{(d)}(\sqrt{\mathrm{d}}x)B{(d,k)}^{1/2}\right\}=\text{Coeff}\left\{\frac{1}{{(k!)}^{1/2}}\enspace {\mathrm{He}}_{k}(x)\right\}.\end{equation} \tag{ 29 }$

Here and below, for P a polynomial, Coeff{P(x)} is the vector of the coefficients of P.

B.4. Tensor product of spherical harmonics

We will consider in this paper the product space

$\begin{equation}{\mathrm{PS}}^{\boldsymbol{d}}\equiv \prod\limits _{q=1}^{Q}{\mathbb{S}}^{{d}_{q}-1}\left(\sqrt{{d}_{q}}\right),\end{equation} \tag{ 30 }$

and the uniform measure on PS^d, denoted ${\mu }_{\boldsymbol{d}}\equiv {\mu }_{{d}_{1}-1}\otimes \dots \otimes {\mu }_{{d}_{Q}-1}={\bigotimes}_{q\in [Q]}{\mu }_{{d}_{q}-1}$ , where we recall ${\mu }_{{d}_{q}-1}\equiv \mathrm{Unif}({\mathbb{S}}^{{d}_{q}-1}(\sqrt{{d}_{q}}))$ . We consider the functional space of L²(PS^d, μ_d) with scalar product and norm denoted as ${\langle \cdot ,\cdot \rangle }_{{L}^{2}}$ and ${\Vert}\cdot {{\Vert}}_{{L}^{2}}$ :

$\begin{equation*}{\langle f,g\rangle }_{{L}^{2}}\equiv {\int }_{{\mathrm{PS}}^{\boldsymbol{d}}}f(\bar{\boldsymbol{x}}\hspace{2pt})g(\bar{\boldsymbol{x}}\hspace{2pt}){\mu }_{\boldsymbol{d}}(\mathrm{d}\bar{\boldsymbol{x}}).\end{equation*}$

For $\boldsymbol{\ell }=({\ell }_{1},\dots ,{\ell }_{Q})\in {\mathbb{Z}}_{\geqslant 0}^{Q}$ , let ${\tilde{V}}_{\boldsymbol{\ell }}^{\boldsymbol{d}}\equiv {\tilde{V}}_{{d}_{1},{\ell }_{1}}\otimes \dots \otimes {\tilde{V}}_{{d}_{Q},{\ell }_{Q}}$ be the span of tensor products of Q homogeneous harmonic polynomials, respectively of degree ℓ_q on ${\mathbb{R}}^{{d}_{q}}$ in variable ${\bar{\boldsymbol{x}}}_{q}$ . Denote by ${V}_{\boldsymbol{\ell }}^{\boldsymbol{d}}$ the linear space of functions obtained by restricting the polynomials in ${\tilde{V}}_{\boldsymbol{\ell }}^{\boldsymbol{d}}$ to PS^d. With these definitions, we have the following orthogonal decomposition

$\begin{equation}{L}^{2}({\mathrm{PS}}^{\boldsymbol{d}},{\mu }_{\boldsymbol{d}})=\underset{\boldsymbol{\ell }\in {\mathbb{Z}}_{\geqslant 0}^{Q}}{\bigoplus}{V}_{\boldsymbol{\ell }}^{\boldsymbol{d}}.\end{equation} \tag{ 31 }$

The dimension of each subspace is given by

$\begin{equation*}B(\boldsymbol{d},\boldsymbol{\ell })\equiv \mathrm{dim}({V}_{\boldsymbol{\ell }}^{\boldsymbol{d}})=\prod\limits _{q=1}^{Q}B({d}_{q},{\ell }_{q}),\end{equation*}$

where we recall

$\begin{equation*}B(d,\ell )=\frac{2\ell +d-2}{\ell }\left(\genfrac{}{}{0pt}{}{\ell +d-3}{\ell -1}\right).\end{equation*}$

We recall that for each $\ell \in {\mathbb{Z}}_{\geqslant 0}$ , the spherical harmonics ${\left\{{Y}_{\ell j}^{(d)}\right\}}_{j\in [B(d,\ell )]}$ form an orthonormal basis of ${V}_{\ell }^{(d)}$ on ${\mathbb{S}}^{d-1}(\sqrt{d})$ . Similarly, for each $\boldsymbol{\ell }\in {\mathbb{Z}}_{\geqslant 0}^{Q}$ , the tensor product of spherical harmonics ${\left\{{Y}_{\boldsymbol{\ell },\boldsymbol{s}}^{\boldsymbol{d}}\right\}}_{\boldsymbol{s}\in [B(\boldsymbol{d},\boldsymbol{\ell })]}$ form an orthonormal basis of ${V}_{\boldsymbol{\ell }}^{\boldsymbol{d}}$ , where s = (s₁, ..., s_Q) ∈ [B( d , ℓ )] signify s_q ∈ [B(d_q, ℓ_q)] for q = 1, ..., Q and

$\begin{equation*}{Y}_{\boldsymbol{\ell },\boldsymbol{s}}^{\boldsymbol{d}}\equiv {Y}_{{\ell }_{1},{s}_{1}}^{({d}_{1})}\otimes {Y}_{{\ell }_{2},{s}_{2}}^{({d}_{2})}\otimes \dots \otimes {Y}_{{\ell }_{Q},{s}_{Q}}^{({d}_{Q})}=\underset{q=1{}}{\overset{Q}{\bigotimes}}{Y}_{{\ell }_{q},{s}_{q}}^{({d}_{q})}.\end{equation*}$

We have the following orthonormalization property

$\begin{equation*}\langle {Y}_{\boldsymbol{\ell },\boldsymbol{s}}^{\boldsymbol{d}},{Y}_{{\boldsymbol{\ell }}^{\prime },{\boldsymbol{s}}^{\prime }}^{\boldsymbol{d}}{\rangle }_{{L}^{2}}=\prod\limits _{q=1}^{Q}{\left\langle {Y}_{{\ell }_{q}{s}_{q}}^{({d}_{q})},{Y}_{{\ell }_{q}^{\prime }{s}_{q}^{\prime }}^{({d}_{q})}\right\rangle }_{{L}^{2}\left({\mathbb{S}}^{{d}_{q}-1}(\sqrt{{d}_{q}})\right)}=\prod\limits _{q=1}^{Q}{\delta }_{{\ell }_{q},{\ell }_{q}^{\prime }}{\delta }_{{s}_{q},{s}_{q}^{\prime }}={\delta }_{\boldsymbol{\ell },{\boldsymbol{\ell }}^{\prime }}{\delta }_{\boldsymbol{s},{\boldsymbol{s}}^{\prime }}.\end{equation*}$

We denote by ${\mathsf{P}}_{\boldsymbol{k}}$ the orthogonal projections on ${V}_{\boldsymbol{k}}^{\boldsymbol{d}}$ in L²(PS^d, μ_d). This can be written in terms of spherical harmonics as

$\begin{equation}{\mathsf{P}}_{\boldsymbol{k}}f(\bar{\boldsymbol{x}}\hspace{2pt})\equiv \sum\limits _{\boldsymbol{s}\in [B(\boldsymbol{d},\boldsymbol{k})]}{\langle f,{Y}_{\boldsymbol{k},\boldsymbol{s}}^{\boldsymbol{d}}\rangle }_{{L}^{2}}{Y}_{\boldsymbol{k},\boldsymbol{s}}^{\boldsymbol{d}}(\bar{\boldsymbol{x}}\hspace{2pt}).\end{equation} \tag{ 32 }$

We will denote for any $\mathcal{Q}\subset {\mathbb{Z}}_{\geqslant 0}^{Q}$ , ${\mathsf{P}}_{\mathcal{Q}}$ the orthogonal projection on ${\bigoplus}_{\boldsymbol{k}\in \mathcal{Q}}{V}_{\boldsymbol{k}}^{\boldsymbol{d}}$ , given by

$\begin{equation*}{\mathsf{P}}_{\mathcal{Q}}=\sum\limits _{\boldsymbol{k}\in \mathcal{Q}}{\mathsf{P}}_{\boldsymbol{k}}.\end{equation*}$

Similarly, the projection on ${\mathcal{Q}}^{c}$ , the complementary of the set $\mathcal{Q}$ in ${\mathbb{Z}}_{\geqslant 0}^{Q}$ , is given by

$\begin{equation*}{\mathsf{P}}_{{\mathcal{Q}}^{c}}=\sum\limits _{\boldsymbol{k}\notin \mathcal{Q}}{\mathsf{P}}_{\boldsymbol{k}}.\end{equation*}$

B.5. Tensor product of Gegenbauer polynomials

We recall that ${\tilde{\mu }}_{d-1}^{1}$ denotes the distribution of $\sqrt{d}\langle \boldsymbol{x},{\boldsymbol{e}}_{d}\rangle$ when $\boldsymbol{x}\sim \mathrm{Unif}({\mathbb{S}}^{d-1}(\sqrt{d}))$ . We consider similarly the projection of PS^d on one coordinate per sphere. We define

$\begin{equation}{\mathrm{ps}}^{\boldsymbol{d}}\equiv \prod\limits _{q=1}^{Q}[-{d}_{q},{d}_{q}],\qquad {\tilde{\mu }}_{\boldsymbol{d}}^{1}\equiv {\tilde{\mu }}_{{d}_{1}-1}^{1}\otimes \dots \otimes {\tilde{\mu }}_{{d}_{Q}-1}^{1}=\underset{q=1{}}{\overset{Q}{\bigotimes}}{\tilde{\mu }}_{{d}_{q}-1}^{1},\end{equation} \tag{ 33 }$

and consider ${L}^{2}({\mathrm{ps}}^{\boldsymbol{d}},{\tilde{\mu }}_{\boldsymbol{d}}^{1})$ .

Recall that the Gegenbauer polynomials ${\left\{{Q}_{k}^{(d)}\right\}}_{k\geqslant 0}$ form an orthogonal basis of ${L}^{2}([-d,d],{\tilde{\mu }}_{d-1}^{1})$ .

Define for each $\boldsymbol{k}\in {\mathbb{Z}}_{\geqslant 0}^{Q}$ , the tensor product of Gegenbauer polynomials

$\begin{equation}{Q}_{\boldsymbol{k}}^{\boldsymbol{d}}\equiv {Q}_{{k}_{1}}^{({d}_{1})}\otimes \dots \otimes {Q}_{{k}_{Q}}^{({d}_{q})}=\underset{q=1{}}{\overset{Q}{\bigotimes}}{Q}_{{k}_{q}}^{({d}_{q})}.\end{equation} \tag{ 34 }$

We will use the following properties of the tensor product of Gegenbauer polynomials:

Lemma 1 (Properties of products of Gegenbauer). Consider the tensor product of Gegenbauer polynomials ${\left\{{Q}_{\boldsymbol{k}}^{\boldsymbol{d}}\right\}}_{\boldsymbol{k}\in {\mathbb{Z}}_{\geqslant 0}^{Q}}$ defined in equation (34). Then

(a)
The set ${\left\{{Q}_{\boldsymbol{k}}^{\boldsymbol{d}}\right\}}_{\boldsymbol{k}\in {\mathbb{Z}}_{\geqslant 0}^{Q}}$ forms an orthogonal basis on ${L}^{2}({\mathrm{ps}}^{\boldsymbol{d}},{\tilde{\mu }}_{\boldsymbol{d}}^{1})$ , satisfying the normalization condition: for any $\boldsymbol{k},{\boldsymbol{k}}^{\prime }\in {\mathbb{Z}}_{\geqslant 0}^{Q}$ ,
$\begin{equation}{\left\langle {Q}_{\boldsymbol{k}}^{\boldsymbol{d}},{Q}_{{\boldsymbol{k}}^{\prime }}^{\boldsymbol{d}}\right\rangle }_{{L}^{2}({\mathrm{ps}}^{\boldsymbol{d}})}=\frac{1}{B(\boldsymbol{d},\boldsymbol{k})}{\delta }_{\boldsymbol{k},{\boldsymbol{k}}^{\prime }}.\end{equation} \tag{ 35 }$
(b)
For $\bar{\boldsymbol{x}}=({\bar{\boldsymbol{x}}}^{(1)},\dots ,{\bar{\boldsymbol{x}}}^{(Q)})$ and $\bar{\boldsymbol{y}}=({\bar{\boldsymbol{y}}}^{(1)},\dots ,{\bar{\boldsymbol{y}}}^{(Q)})\in {\mathrm{PS}}^{\boldsymbol{d}}$ , and $\boldsymbol{k},{\boldsymbol{k}}^{\prime }\in {\mathbb{Z}}_{\geqslant 0}^{Q}$ ,
$\begin{align}\hfill & {\left\langle {Q}_{\boldsymbol{k}}^{\boldsymbol{d}}\left({\left\{\langle {\bar{\boldsymbol{x}}}^{(q)},\cdot \rangle \right\}}_{q\in [Q]}\right),{Q}_{{\boldsymbol{k}}^{\prime }}^{\boldsymbol{d}}\left({\left\{\langle {\bar{\boldsymbol{y}}}^{(q)},\cdot \rangle \right\}}_{q\in [Q]}\right)\right\rangle }_{{L}^{2}\left({\mathrm{PS}}^{\boldsymbol{d}}\right)}=\frac{1}{B(\boldsymbol{d},\boldsymbol{k})}{\delta }_{\boldsymbol{k},{\boldsymbol{k}}^{\prime }}{Q}_{\boldsymbol{k}}^{\boldsymbol{d}}\left({\left\{\langle {\bar{\boldsymbol{x}}}^{(q)},{\bar{\boldsymbol{y}}}^{(q)}\rangle \right\}}_{q\in [Q]}\right).\hfill \end{align} \tag{ 36 }$
(c)
For $\bar{\boldsymbol{x}}=({\bar{\boldsymbol{x}}}^{(1)},\dots ,{\bar{\boldsymbol{x}}}^{(Q)})$ and $\bar{\boldsymbol{y}}=({\bar{\boldsymbol{y}}}^{(1)},\dots ,{\bar{\boldsymbol{y}}}^{(Q)})\in {\mathrm{PS}}^{\boldsymbol{d}}$ , and $\boldsymbol{k}\in {\mathbb{Z}}_{\geqslant 0}^{Q}$ ,
$\begin{equation}{Q}_{\boldsymbol{k}}^{\boldsymbol{d}}\left({\left\{\langle {\bar{\boldsymbol{x}}}^{(q)},{\bar{\boldsymbol{y}}}^{(q)}\rangle \right\}}_{q\in [Q]}\right)=\frac{1}{B(\boldsymbol{d},\boldsymbol{k})}\sum\limits _{\boldsymbol{s}\in B(\boldsymbol{d},\boldsymbol{k})}{Y}_{\boldsymbol{k},\boldsymbol{s}}^{\boldsymbol{d}}(\bar{\boldsymbol{x}}\hspace{2pt}){Y}_{\boldsymbol{k},\boldsymbol{s}}^{\boldsymbol{d}}(\bar{\boldsymbol{y}}\hspace{2pt}).\end{equation} \tag{ 37 }$

Notice that lemma 1(c) implies that ${Q}_{\boldsymbol{k}}^{\boldsymbol{d}}$ is (up to a constant) a representation of the projector onto the subspace ${V}_{\boldsymbol{k}}^{\boldsymbol{d}}$

$\begin{equation*}[{\mathsf{P}}_{\boldsymbol{k}}f](\bar{\boldsymbol{x}}\hspace{2pt})=B(\boldsymbol{d},\boldsymbol{k}){\int }_{{\mathrm{PS}}^{\boldsymbol{d}}}{Q}_{\boldsymbol{k}}^{\boldsymbol{d}}\left({\left\{\langle {\bar{\boldsymbol{x}}}^{(q)},{\bar{\boldsymbol{y}}}^{(q)}\rangle \right\}}_{q\in [Q]}\right)f(\bar{\boldsymbol{y}}\hspace{2pt}){\mu }_{\boldsymbol{d}}(\mathrm{d}\bar{\boldsymbol{y}}).\end{equation*}$

Proof of lemma 1. Part (a) comes from the normalization property (20) of Gegenbauer polynomials,

$\begin{equation*}\begin{aligned}\hfill {\left\langle {Q}_{\boldsymbol{k}}^{\boldsymbol{d}},{Q}_{{\boldsymbol{k}}^{\prime }}^{\boldsymbol{d}}\right\rangle }_{{L}^{2}({\mathrm{ps}}^{\boldsymbol{d}})}& ={\left\langle {Q}_{\boldsymbol{k}}^{\boldsymbol{d}}\left({\left\{\sqrt{{d}_{q}}\langle {\boldsymbol{e}}_{q},\cdot \rangle \right\}}_{q\in [Q]}\right),{Q}_{{\boldsymbol{k}}^{\prime }}^{\boldsymbol{d}}\left({\left\{\sqrt{{d}_{q}}\langle {\boldsymbol{e}}_{q},\cdot \rangle \right\}}_{q\in [Q]}\right)\right\rangle }_{{L}^{2}\left({\mathrm{PS}}^{\boldsymbol{d}}\right)}\hfill \\ \hfill & =\prod\limits _{q=1}^{Q}{\left\langle {Q}_{{k}_{q}}^{({d}_{q})}\left(\sqrt{{d}_{q}}\langle {\boldsymbol{e}}_{q},\cdot \rangle \right),{Q}_{{k}_{q}^{\prime }}^{({d}_{q})}\left(\sqrt{{d}_{q}}\langle {\boldsymbol{e}}_{q},\cdot \rangle \right)\right\rangle }_{{L}^{2}\left({\mathbb{S}}^{{d}_{q}-1}\left(\sqrt{{d}_{q}}\right)\right)}\hfill \\ \hfill & =\prod\limits _{q=1}^{Q}\frac{1}{B({d}_{q},{k}_{q})}{\delta }_{{k}_{q},{k}_{q}^{\prime }}\hfill \\ \hfill & =\frac{1}{B(\boldsymbol{d},\boldsymbol{k})}{\delta }_{\boldsymbol{k},{\boldsymbol{k}}^{\prime }},\hfill \end{aligned}\end{equation*}$

where the ${\left\{{\boldsymbol{e}}_{q}\right\}}_{q\in [Q]}$ are unit vectors in ${\mathbb{R}}^{{d}_{q}}$ respectively.

Part (b) comes from equation (21),

$\begin{align*}\hfill & {\left\langle {Q}_{\boldsymbol{k}}^{\boldsymbol{d}}\left({\left\{\langle {\bar{\boldsymbol{x}}}^{(q)},\cdot \rangle \right\}}_{q\in [Q]}\right),{Q}_{{\boldsymbol{k}}^{\prime }}^{\boldsymbol{d}}\left({\left\{\langle {\bar{\boldsymbol{y}}}^{(q)},\cdot \rangle \right\}}_{q\in [Q]}\right)\right\rangle }_{{L}^{2}\left({\mathrm{PS}}^{\boldsymbol{d}}\right)}\hfill \\ \hfill & \qquad =\prod\limits _{q=1}^{Q}{\left\langle {Q}_{{k}_{q}}^{({d}_{q})}\left(\langle {\bar{\boldsymbol{x}}}^{(q)},\cdot \rangle \right),{Q}_{{k}_{q}^{\prime }}^{({d}_{q})}\left(\langle {\bar{\boldsymbol{y}}}^{(q)},\cdot \rangle \right)\right\rangle }_{{L}^{2}\left({\mathbb{S}}^{{d}_{q}-1}\left(\sqrt{{d}_{q}}\right)\right)}\hfill \\ \hfill & \qquad =\prod\limits _{q=1}^{Q}\frac{1}{B({d}_{q},{k}_{q})}{\delta }_{{k}_{q},{k}_{q}^{\prime }}{Q}_{{k}_{q}}^{({d}_{q})}\left(\langle {\bar{\boldsymbol{x}}}^{(q)},{\bar{\boldsymbol{y}}}^{(q)}\rangle \right)\hfill \\ \hfill & \qquad =\frac{1}{B(\boldsymbol{d},\boldsymbol{k})}{\delta }_{\boldsymbol{k},{\boldsymbol{k}}^{\prime }}{Q}_{\boldsymbol{k}}^{\boldsymbol{d}}\left({\left\{\langle {\bar{\boldsymbol{x}}}^{(q)},{\bar{\boldsymbol{y}}}^{(q)}\rangle \right\}}_{q\in [Q]}\right),\hfill \end{align*}$

while part (c) is a direct consequence of equation (22). □

B.6. Notations

Throughout the proofs, O_d(⋅) (resp. o_d(⋅)) denotes the standard big-O (resp. little-o) notation, where the subscript d emphasizes the asymptotic variable. We denote ${O}_{d,\mathbb{P}}(\cdot )$ (resp. ${o}_{d,\mathbb{P}}(\cdot )$ ) the big-O (resp. little-o) in probability notation: ${h}_{1}(d)={O}_{d,\mathbb{P}}({h}_{2}(d))$ if for any ɛ > 0, there exists C_ɛ > 0 and ${d}_{\varepsilon }\in {\mathbb{Z}}_{ > 0}$ , such that

$\begin{equation*}\mathbb{P}(\vert {h}_{1}(d)/{h}_{2}(d)\vert > {C}_{\varepsilon })\leqslant \varepsilon ,\quad \forall \enspace d\geqslant {d}_{\varepsilon },\end{equation*}$

and respectively: ${h}_{1}(d)={o}_{d,\mathbb{P}}({h}_{2}(d))$ , if h₁(d)/h₂(d) converges to 0 in probability.

We will occasionally hide logarithmic factors using the ${\tilde{O}}_{d}(\cdot )$ notation (resp. ${\tilde{o}}_{d}(\cdot )$ ): ${h}_{1}(d)={\tilde{O}}_{d}({h}_{2}(d))$ if there exists a constant C such that h₁(d) ⩽ C(log d)^C h₂(d). Similarly, we will denote ${\tilde{O}}_{d,\mathbb{P}}(\cdot )$ (resp. ${\tilde{o}}_{d,\mathbb{P}}(\cdot )$ ) when considering the big-O in probability notation up to a logarithmic factor.

Furthermore, f = ω_d(g) will denote f(d)/g(d) → ∞.

Appendix C.: General framework and main theorems

In this section, define a more general model than the model considered in the main text. In the general model, we will assume the covariate vectors will follow a product of uniform distributions on the sphere, and assume a target function in L² space. We establish more general versions of theorems 1–3 on the two-spheres cases in the main text as theorems 5–7. We will prove theorems 5–7 in the following sections. At the end of this section, we will show that theorems 5–7 will imply theorems 1–3 in the main text.

C.1. Setup on the product of spheres

Assume that the data x lies on the product of Q spheres,

$\begin{equation*}\boldsymbol{x}=\left({\boldsymbol{x}}^{(1)},\dots ,{\boldsymbol{x}}^{(Q)}\right)\in \prod\limits _{q\in [Q]}{\mathbb{S}}^{{d}_{q}-1}({r}_{q}),\end{equation*}$

where ${d}_{q}={d}^{{\eta }_{q}}$ and ${r}_{q}={d}^{({\eta }_{q}+{\kappa }_{q})/2}$ . Let $\boldsymbol{d}=({d}_{1},\dots ,{d}_{q})=({d}^{{\eta }_{1}},\dots ,{d}^{{\eta }_{q}})$ and κ = (κ₁, ..., κ_Q), where η_q > 0 and κ_q ⩾ 0 for q = 1, ..., Q. We will denote this space

$\begin{equation}{\mathrm{PS}}_{\boldsymbol{\kappa }}^{\boldsymbol{d}}=\prod\limits _{q\in [Q]}{\mathbb{S}}^{{d}_{q}-1}({r}_{q}).\end{equation} \tag{ 38 }$

Furthermore, assume that the data is generated following the uniform distribution on ${\mathrm{PS}}_{\boldsymbol{\kappa }}^{\boldsymbol{d}}$ , i.e.

$\begin{equation}\boldsymbol{x}\hspace{2pt}\stackrel{\mathrm{i}.\mathrm{i}.\mathrm{d}.}{\sim }\hspace{2pt}\mathrm{Unif}({\mathrm{PS}}_{\boldsymbol{\kappa }}^{\boldsymbol{d}})=\underset{q\in [Q]}{\bigotimes}\mathrm{Unif}\left({\mathbb{S}}^{{d}_{q}-1}({r}_{q})\right)\equiv {\mu }_{\boldsymbol{d}}^{\boldsymbol{\kappa }}.\end{equation} \tag{ 39 }$

We have $\boldsymbol{x}\in {\mathbb{R}}^{D}$ and || x ||₂ = R where $D={d}^{{\eta }_{1}}+\cdots +{d}^{{\eta }_{Q}}$ and $R={({d}^{{\eta }_{1}+{\kappa }_{1}}+\cdots +{d}^{{\eta }_{Q}+{\kappa }_{Q}})}^{1/2}$ .

We will make the following assumption that will simplify the proofs. Denote

$\begin{equation}\xi \equiv \underset{q\in [Q]}{\mathrm{max}}\left\{{\eta }_{q}+{\kappa }_{q}\right\},\end{equation} \tag{ 40 }$

then ξ is attained on only one of the sphere, whose coordinate will be denoted q_ξ, i.e. $\xi ={\eta }_{{q}_{\xi }}+{\kappa }_{{q}_{\xi }}$ and η_q + κ_q < ξ for q ≠ q_ξ.

Let $\sigma :\mathbb{R}\to \mathbb{R}$ be an activation function and ${({\boldsymbol{w}}_{i})}_{i\in [N]}\hspace{2pt}{\sim }_{iid}\hspace{2pt}\mathrm{Unif}({\mathbb{S}}^{D-1})$ the weights. We introduce the random feature function class

$\begin{equation*}{\mathcal{F}}_{\mathrm{RF}}(\boldsymbol{W}\hspace{2pt})=\left\{{\hat{f}}_{\mathrm{RF}}(\boldsymbol{x}\hspace{-1pt};\boldsymbol{a})=\sum\limits _{i=1}^{N}{a}_{i}\sigma (\langle {\boldsymbol{w}}_{i},\boldsymbol{x}\rangle \sqrt{D}/R):{a}_{i}\in \mathbb{R},\quad \forall \enspace i\in [N]\right\},\end{equation*}$

and the neural tangent function class

$\begin{equation*}{\mathcal{F}}_{\mathrm{NT}}(\boldsymbol{W}\hspace{2pt})=\left\{{\hat{f}}_{\mathrm{RF}}(\boldsymbol{x}\hspace{-1pt};\boldsymbol{a})=\sum\limits _{i=1}^{N}\langle {\boldsymbol{a}}_{i},\boldsymbol{x}\rangle {\sigma }^{\prime }(\langle {\boldsymbol{w}}_{i},\boldsymbol{x}\rangle \sqrt{D}/R):{\boldsymbol{a}}_{i}\in {\mathbb{R}}^{D},\quad \forall \enspace i\in [N]\right\}.\end{equation*}$

We will denote ${\boldsymbol{\theta }}_{i}=\sqrt{D}{\boldsymbol{w}}_{i}$ . Notice that the normalization in the definition of the function class insures that the scalar product ⟨ x , θ _i⟩/R is of order 1. This corresponds to normalizing the data.

We consider the approximation of f by functions in function classes ${\mathcal{F}}_{\mathrm{RF}}(\mathbf{\Theta })$ and ${\mathcal{F}}_{\mathrm{NT}}(\mathbf{\Theta })$ .

C.2. Reparametrization

Recall ${({\boldsymbol{\theta }}_{i})}_{i\in [N]}\sim \mathrm{Unif}({\mathbb{S}}^{D-1}(\sqrt{D}))$ independently. We decompose ${\boldsymbol{\theta }}_{i}=({\boldsymbol{\theta }}_{i}^{(1)},\dots ,{\boldsymbol{\theta }}_{i}^{(Q)})$ into Q sections corresponding to the d_q coordinates associated to the qth sphere. Let us consider the following reparametrization of ${({\boldsymbol{\theta }}_{i})}_{i\in [N]}\hspace{2pt}\stackrel{\mathrm{i}.\mathrm{i}.\mathrm{d}.}{\sim }\hspace{2pt}\mathrm{Unif}({\mathbb{S}}^{D-1}(\sqrt{D}))$ :

$\begin{equation*}\left({\bar{\boldsymbol{\theta }}}_{i}^{(1)},\dots ,{\bar{\boldsymbol{\theta }}}_{i}^{(Q)},{\tau }_{i}^{(1)},\dots ,{\tau }_{i}^{(Q)}\right),\end{equation*}$

where

$\begin{equation*}{\bar{\boldsymbol{\theta }}}_{i}^{(q)}\equiv \sqrt{{d}_{q}}{\boldsymbol{\theta }}_{i}^{(q)}/{\Vert}{\boldsymbol{\theta }}_{i}^{(q)}{{\Vert}}_{2},\qquad {\tau }_{i}^{(q)}\equiv {\Vert}{\boldsymbol{\theta }}_{i}^{(q)}{{\Vert}}_{2}/\sqrt{{d}_{q}},\quad \text{for}\enspace q=1,\dots ,Q.\end{equation*}$

Hence

$\begin{equation*}{\boldsymbol{\theta }}_{i}=\left({\tau }_{i}^{(1)}\cdot {\bar{\boldsymbol{\theta }}}_{i}^{(1)},\dots ,{\tau }_{i}^{(Q)}\cdot {\bar{\boldsymbol{\theta }}}_{i}^{(Q)}\right).\end{equation*}$

It is easy to check that the variables $({\bar{\boldsymbol{\theta }}}^{(1)},\dots ,{\bar{\boldsymbol{\theta }}}^{(Q)})$ are independent and independent of $({\tau }_{i}^{(1)},\dots ,{\tau }_{i}^{(Q)})$ , and verify

$\begin{equation*}{\bar{\boldsymbol{\theta }}}_{i}^{(q)}\sim \mathrm{Unif}({\mathbb{S}}^{{d}_{q}-1}(\sqrt{{d}_{q}})),\qquad {\tau }_{i}^{(q)}\sim {d}_{q}^{-1/2}\sqrt{\mathrm{B}\mathrm{e}\mathrm{t}\mathrm{a}\left(\frac{{d}_{q}}{2},\frac{D-{d}_{q}}{2}\right)},\quad \text{for}\enspace q=1,\dots ,Q.\end{equation*}$

We will denote ${\bar{\boldsymbol{\theta }}}_{i}\equiv ({\bar{\boldsymbol{\theta }}}_{i}^{(1)},\dots ,{\bar{\boldsymbol{\theta }}}_{i}^{(Q)})$ and ${\boldsymbol{\tau }}_{i}\equiv ({\tau }_{i}^{(1)},\dots ,{\tau }_{i}^{(Q)})$ . With these notations, we have

$\begin{equation*}{\bar{\boldsymbol{\theta }}}_{i}\in \prod\limits _{q\in [Q]}{\mathbb{S}}^{{d}_{q}-1}(\sqrt{{d}_{q}})\equiv {\mathrm{PS}}^{\boldsymbol{d}},\end{equation*}$

where PS^d is the 'normalized space of product of spheres', and

$\begin{equation*}{\left({\bar{\boldsymbol{\theta }}}_{i}\right)}_{i\in [N]}\hspace{2pt}\stackrel{\mathrm{i}.\mathrm{i}.\mathrm{d}.}{\sim }\hspace{2pt}\underset{q\in [Q]}{\bigotimes}\mathrm{Unif}({\mathbb{S}}^{{d}_{q}-1}(\sqrt{{d}_{q}}))\equiv {\mu }_{\boldsymbol{d}}.\end{equation*}$

Similarly, we will denote the rescaled data $\bar{\boldsymbol{x}}\in {\mathrm{PS}}^{\boldsymbol{d}}$ ,

$\begin{equation*}\bar{\boldsymbol{x}}=\left({\bar{\boldsymbol{x}}}^{(1)},\dots ,{\bar{\boldsymbol{x}}}^{(Q)}\right)\sim \underset{q\in [Q]}{\bigotimes}\mathrm{Unif}({\mathbb{S}}^{{d}_{q}-1}(\sqrt{{d}_{q}})),\end{equation*}$

obtained by taking ${\bar{\boldsymbol{x}}}^{(q)}=\sqrt{{d}_{q}}{\boldsymbol{x}}^{(q)}/{r}_{q}={d}^{-{\kappa }_{q}/2}{\boldsymbol{x}}^{(q)}$ for each q ∈ [Q].

The proof will proceed as follows: first, noticing that τ^(q) concentrates around 1 for every q = 1, ..., Q, we will restrict ourselves without loss of generality to the following high probability event

$\begin{equation*}{\mathcal{P}}_{d,N,\varepsilon }\equiv \left\{\mathbf{\Theta }\vert {\tau }_{i}^{(q)}\in [1-\varepsilon ,1+\varepsilon ],\quad \forall \enspace i\in [N],\enspace \forall \enspace q\in [Q]\right\}\subset {\mathbb{S}}^{D-1}{(\sqrt{D})}^{N},\end{equation*}$

where ɛ > 0 will be chosen sufficiently small. Then, we rewrite the activation function

$\begin{equation*}\sigma (\langle \cdot ,\cdot \rangle /R):{\mathbb{S}}^{D-1}(\sqrt{D})\times {\mathrm{PS}}_{\boldsymbol{\kappa }}^{\boldsymbol{d}}\to \mathbb{R},\end{equation*}$

as a function, for a random τ (but close to (1, ..., 1))

$\begin{equation*}{\sigma }_{\boldsymbol{d},\boldsymbol{\tau }}:{\mathrm{PS}}^{\boldsymbol{d}}\times {\mathrm{PS}}^{\boldsymbol{d}}\to \mathbb{R},\end{equation*}$

given for $\boldsymbol{\theta }=(\bar{\boldsymbol{\theta }},\boldsymbol{\tau })$ by

$\begin{equation*}{\sigma }_{\boldsymbol{d},\boldsymbol{\tau }}\left({\left\{\langle {\bar{\boldsymbol{\theta }}}^{(q)},{\bar{\boldsymbol{x}}}^{(q)}\rangle /\sqrt{{d}_{q}}\right\}}_{q\in [Q]}\right)=\sigma \left(\sum\limits _{q\in [Q]}\frac{{\tau }^{(q)}{r}_{q}}{R}\cdot \frac{\langle {\bar{\boldsymbol{\theta }}}^{(q)},{\bar{\boldsymbol{x}}}^{(q)}\rangle }{\sqrt{{d}_{q}}}\right).\end{equation*}$

We can therefore apply the algebra of tensor product of spherical harmonics and use the machinery developed in [21].

C.3. Notations

Recall the definitions d = (d₁, ..., d_q), κ = (κ₁, ..., κ_Q), ${d}_{q}={d}^{{\eta }_{q}}$ , ${r}_{q}={d}^{({\eta }_{q}+{\kappa }_{q})/2}$ , $D={d}^{{\eta }_{1}}+\cdots +{d}^{{\eta }_{Q}}$ and $R={({d}^{{\eta }_{1}+{\kappa }_{1}}+\cdots +{d}^{{\eta }_{Q}+{\kappa }_{Q}})}^{1/2}$ . Let us denote ξ = max_q∈[Q]{η_q + κ_q} and q_ξ = argmin_q∈[Q]{η_q + κ_q}.

Recall that ${({\boldsymbol{\theta }}_{i})}_{i\in [N]}\sim \mathrm{Unif}({\mathbb{S}}^{D-1}(\sqrt{D}))$ independently. Let Θ = ( θ ₁, ..., θ _N). We denote ${\mathbb{E}}_{\boldsymbol{\theta }}$ to be the expectation operator with respect to $\boldsymbol{\theta }\sim \mathrm{Unif}({\mathbb{S}}^{D-1}(\sqrt{D}))$ and ${\mathbb{E}}_{\mathbf{\Theta }}$ the expectation operator with respect to $\mathbf{\Theta }=({\boldsymbol{\theta }}_{1},\dots ,{\boldsymbol{\theta }}_{N})\sim \mathrm{Unif}{({\mathbb{S}}^{D-1}(\sqrt{D}))}^{\otimes N}$ .

We will denote ${\mathbb{E}}_{\bar{\boldsymbol{\theta }}}$ the expectation operator with respect to $\bar{\boldsymbol{\theta }}\equiv ({\bar{\boldsymbol{\theta }}}^{(1)},\dots ,{\bar{\boldsymbol{\theta }}}^{(Q)})\sim {\mu }_{\boldsymbol{d}}$ , ${\mathbb{E}}_{\bar{\mathbf{\Theta }}}$ the expectation operator with respect to $\bar{\mathbf{\Theta }}=({\bar{\boldsymbol{\theta }}}_{1},\dots ,{\bar{\boldsymbol{\theta }}}_{N})$ , and ${\mathbb{E}}_{\boldsymbol{\tau }}$ the expectation operator with respect to τ (we recall τ ≡ (τ⁽¹⁾, ..., τ^(Q))) or ( τ ₁, ..., τ _N) (where the τ _i are independent) depending on the context. In particular, notice that ${\mathbb{E}}_{\boldsymbol{\theta }}={\mathbb{E}}_{\boldsymbol{\tau }}{\mathbb{E}}_{\bar{\boldsymbol{\theta }}}$ and ${\mathbb{E}}_{\mathbf{\Theta }}={\mathbb{E}}_{\boldsymbol{\tau }}{\mathbb{E}}_{\bar{\mathbf{\Theta }}}$ .

We will denote ${\mathbb{E}}_{{\mathbf{\Theta }}_{\varepsilon }}$ the expectation operator with respect to Θ = ( θ ₁, ..., θ _N) restricted to ${\mathcal{P}}_{d,N,\varepsilon }$ and ${\mathbb{E}}_{{\boldsymbol{\tau }}_{\varepsilon }}$ the expectation operator with respect to τ restricted to [1 − ɛ, 1 + ɛ]^Q. Notice that ${\mathbb{E}}_{{\mathbf{\Theta }}_{\varepsilon }}={\mathbb{E}}_{{\boldsymbol{\tau }}_{\varepsilon }}{\mathbb{E}}_{\bar{\mathbf{\Theta }}}$ .

Let ${\mathbb{E}}_{\boldsymbol{x}}$ to be the expectation operator with respect to $\boldsymbol{x}\sim {\mu }_{\boldsymbol{\kappa }}^{\boldsymbol{d}}$ , and ${\mathbb{E}}_{\bar{\boldsymbol{x}}}$ the expectation operator with respect to $\bar{\boldsymbol{x}}\sim {\mu }_{\boldsymbol{d}}$ .

C.4. Generalization error of kernel ridge regression

We consider the KRR solution ${\hat{a}}_{i}$ , namely

$\begin{equation*}\hat{\boldsymbol{a}}={(\boldsymbol{H}+\lambda {\mathbf{I}}_{n})}^{-1}\boldsymbol{y},\end{equation*}$

where the kernel matrix $\boldsymbol{H}={({H}_{ij})}_{ij\in [n]}$ is assumed to be given by

$\begin{equation*}{H}_{ij}={\bar{h}}_{d}(\langle {\boldsymbol{x}}_{i},{\boldsymbol{x}}_{j}\rangle /{R}^{2})={\mathbb{E}}_{\bar{\boldsymbol{\theta }}\sim \mathrm{Unif}({\mathrm{PS}}^{\boldsymbol{d}})}[\sigma (\langle \bar{\boldsymbol{\theta }},\boldsymbol{x}\rangle /R)\sigma (\langle \bar{\boldsymbol{\theta }},\boldsymbol{y}\rangle /R)],\end{equation*}$

and $\boldsymbol{y}={({y}_{1},\dots ,{y}_{n})}^{\mathsf{T}}=\boldsymbol{f}+\boldsymbol{\varepsilon }$ , with

$\begin{equation*}\begin{aligned}\hfill \boldsymbol{f}& ={({f}_{d}({\boldsymbol{x}}_{1}),\dots ,{f}_{d}({\boldsymbol{x}}_{n}))}^{\mathsf{T}},\hfill \\ \hfill \boldsymbol{\varepsilon }& ={({\varepsilon }_{1},\dots ,{\varepsilon }_{n})}^{\mathsf{T}}.\hfill \end{aligned}\end{equation*}$

The prediction function at location x gives

$\begin{equation*}{\hat{f}}_{\lambda }(\boldsymbol{x})={\boldsymbol{y}}^{\mathsf{T}}{(\boldsymbol{H}+\lambda {\mathbf{I}}_{n})}^{-1}\boldsymbol{h}(\boldsymbol{x}),\end{equation*}$

with

$\begin{equation*}\boldsymbol{h}(\boldsymbol{x})={[{\bar{h}}_{d}(\langle \boldsymbol{x},{\boldsymbol{x}}_{1}\rangle /{R}^{2}),\dots ,{\bar{h}}_{d}(\langle \boldsymbol{x},{\boldsymbol{x}}_{n}\rangle /{R}^{2})]}^{\mathsf{T}}.\end{equation*}$

The test error of empirical KRR is defined as

$\begin{equation*}{R}_{\mathrm{KRR}}({f}_{d},\boldsymbol{X},\lambda )\equiv {\mathbb{E}}_{\boldsymbol{x}}\left[{\left({f}_{d}(\boldsymbol{x})-{\boldsymbol{y}}^{\mathsf{T}}{(\boldsymbol{H}+\lambda {\mathbf{I}}_{n})}^{-1}\boldsymbol{h}(\boldsymbol{x})\right)}^{2}\right].\end{equation*}$

We define the set ${\bar{\mathcal{Q}}}_{\mathrm{KRR}}(\gamma )\subseteq {\mathbb{Z}}_{\geqslant 0}^{Q}$ as follows (recall that ξ ≡ max_q∈[Q](η_q + κ_q)):

$\begin{equation}{\bar{\mathcal{Q}}}_{\mathrm{KRR}}(\gamma )=\left\{\boldsymbol{k}\in {\mathbb{Z}}_{\geqslant 0}^{Q}\left\vert \right.\sum\limits _{q=1}^{Q}(\xi -{\kappa }_{q}){k}_{q}\leqslant \gamma \right\},\end{equation} \tag{ 41 }$

and the function $m:{\mathbb{R}}_{\geqslant 0}\to {\mathbb{R}}_{\geqslant 0}$ which at γ associates

$\begin{equation*}m(\gamma )=\underset{\boldsymbol{k}\notin {\bar{\mathcal{Q}}}_{\mathrm{KRR}}(\gamma )}{\mathrm{min}}\sum\limits _{q\in [Q]}(\xi -{\kappa }_{q}){k}_{q}.\end{equation*}$

Notice that by definition m(γ) > γ.

We consider sequences of problems indexed by the integer d, and we view the problem parameters (in particular, the dimensions d_q, the radii r_q, the kernel h_d, and so on) as functions of d.

Assumption 1. Let ${\left\{{h}_{d}\right\}}_{d\geqslant 1}$ be a sequence of functions ${h}_{d}:[-1,1]\to \mathbb{R}$ such that H_d( x ₁, x ₂) = h_d(⟨ x ₁, x ₂⟩/d) is a positive semidefinite kernel.

(a)
For γ > 0 (which is specified in the theorem), we denote L = max_q∈[Q]⌈γ/η_q⌉. We assume that h_d is L-weakly differentiable. We assume that for 0 ⩽ k ⩽ L, the kth weak derivative verifies almost surely ${h}_{d}^{(k)}(u)\leqslant C$ for some constants C > 0 independent of d. Furthermore, we assume there exists k > L such that ${h}_{d}^{(k)}(0)\geqslant c > 0$ with c independent of d.
(b)
For γ > 0 (which is specified in the theorem), we define
$\begin{equation*}\bar{K}=\underset{\boldsymbol{k}\in {\bar{\mathcal{Q}}}_{\mathrm{KRR}}(\gamma )}{\mathrm{max}}\vert \boldsymbol{k}\vert .\end{equation*}$
We assume that σ verifies for $k\leqslant \bar{K}$ , ${h}_{d}^{(k)}(0)\geqslant c$ , with c > 0 independent of d.

Theorem 5 (Risk of the KRR model). Let ${\left\{{f}_{d}\in {L}^{2}({\mathrm{PS}}_{\boldsymbol{\kappa }}^{\boldsymbol{d}},{\mu }_{\boldsymbol{d}}^{\boldsymbol{\kappa }})\right\}}_{d\geqslant 1}$ be a sequence of functions. Assume w_d(d^γ log d) ⩽ n ⩽ O_d(d^m(γ)−δ) for some γ > 0 and δ > 0. Let ${\left\{{h}_{d}\right\}}_{d\geqslant 1}$ be a sequence of functions that satisfies assumption 1 at level γ. Let $\boldsymbol{X}={({\boldsymbol{x}}_{i})}_{i\in [n]}$ with ${({\boldsymbol{x}}_{i})}_{i\in [n]}\sim \mathrm{Unif}({\mathrm{PS}}_{\boldsymbol{\kappa }}^{\boldsymbol{d}})$ independently, and y_i = f_d( x _i) + ɛ_i and ɛ_i ∼_iid N(0, τ²) for some τ² ⩾ 0. Then for any ɛ > 0, and for any λ = O_d(1), with high probability we have

$\begin{equation}\left\vert {R}_{\mathrm{KRR}}({f}_{d},\boldsymbol{X},\lambda )-{\Vert}{\mathsf{P}}_{{\mathcal{Q}}^{c}}{f}_{d}{{\Vert}}_{{L}^{2}}^{2}\right\vert \leqslant \varepsilon ({\Vert}{f}_{d}{{\Vert}}_{{L}^{2}}^{2}+{\tau }^{2}).\end{equation} \tag{ 42 }$

See appendix D for the proof of this theorem.

C.5. Approximation error of the random features model

We consider the minimum population error for the random features model

$\begin{equation*}{R}_{\mathrm{RF}}({f}_{d},\boldsymbol{W}\hspace{2pt})=\underset{f\in {\mathcal{F}}_{\mathrm{RF}}(\boldsymbol{W}\hspace{2pt})}{\mathrm{inf}}\enspace \mathbb{E}\left[{({f}_{\ast }(\boldsymbol{x})-f(\boldsymbol{x}))}^{2}\right].\end{equation*}$

Let us define the sets:

$\begin{equation}{\mathcal{Q}}_{\mathrm{RF}}(\gamma )=\left\{\boldsymbol{k}\in {\mathbb{Z}}_{\geqslant 0}^{Q}\left\vert \right.\sum\limits _{q=1}^{Q}(\xi -{\kappa }_{q}){k}_{q}< \gamma \right\},\end{equation} \tag{ 43 }$

$\begin{equation}{\bar{\mathcal{Q}}}_{\mathrm{RF}}(\gamma )=\left\{\boldsymbol{k}\in {\mathbb{Z}}_{\geqslant 0}^{Q}\left\vert \right.\sum\limits _{q=1}^{Q}(\xi -{\kappa }_{q}){k}_{q}\leqslant \gamma \right\}.\end{equation} \tag{ 44 }$

Assumption 2. Let σ be an activation function.

(a)
There exists constants c₀, c₁, with c₀ > 0 and c₁ < 1 such that the activation function σ verifies σ(u)² ⩽ c₀ exp(c₁ u²/2) almost surely for $u\in \mathbb{R}$ .
(b)
For γ > 0 (which is specified in the theorem), we denote L = max_q∈[Q]⌈γ/η_q⌉. We assume that σ is L-weakly differentiable. Define
$\begin{equation*}K=\underset{\boldsymbol{k}\in {\mathcal{Q}}_{\mathrm{RF}}{(\gamma )}^{c}}{\mathrm{min}}\vert \boldsymbol{k}\vert .\end{equation*}$
We assume that for K ⩽ k ⩽ L, the kth weak derivative verifies almost surely σ^(k)(u)² ⩽ c₀ exp(c₁ u²/2) for some constants c₀ > 0 and c₁ < 1.Furthermore we will assume that σ is not a degree- $\lfloor \gamma /{\eta }_{{q}_{\xi }}\rfloor$ polynomial where we recall that q_ξ corresponds to the unique argmin_q∈[Q]{η_q + κ_q}.
(c)
For γ > 0 (which is specified in the theorem), we define
$\begin{equation*}\bar{K}=\underset{\boldsymbol{k}\in {\bar{\mathcal{Q}}}_{\mathrm{RF}}(\gamma )}{\mathrm{max}}\vert \boldsymbol{k}\vert .\end{equation*}$
We assume that σ verifies for $k\leqslant \bar{K}$ , μ_k(σ) ≠ 0. Furthermore we assume that for $k\leqslant \bar{K}$ , the kth weak derivative verifies almost surely σ^(k)(u)² ⩽ c₀ exp(c₁ u²/2) for some constants c₀ > 0 and c₁ < 1.

Assumption 2(a) implies that $\sigma \in {L}^{2}(\mathbb{R},\gamma )$ where $\gamma (\mathrm{d}x)={\mathrm{e}}^{-{x}^{2}/2}\enspace \mathrm{d}x/\sqrt{2\pi }$ is the standard Gaussian measure. We recall the Hermite decomposition of σ,

$\begin{equation}\sigma (x)=\sum\limits _{k=0}^{\infty }\frac{{\mu }_{k}(\sigma )}{k!}{\mathrm{He}}_{k}(x),\qquad {\mu }_{k}(\sigma )\equiv {\mathbb{E}}_{G\sim \mathsf{\text{N}}(0,1)}[\sigma (G){\mathrm{He}}_{k}(G)].\end{equation} \tag{ 45 }$

Theorem 6 (Risk of the RF model). Let ${\left\{{f}_{d}\in {L}^{2}({\mathrm{PS}}_{\boldsymbol{\kappa }}^{\boldsymbol{d}},{\mu }_{\boldsymbol{d}}^{\boldsymbol{\kappa }})\right\}}_{d\geqslant 1}$ be a sequence of functions. Let $\boldsymbol{W}={({\boldsymbol{w}}_{i})}_{i\in [N]}$ with ${({\boldsymbol{w}}_{i})}_{i\in [N]}\sim \mathrm{Unif}({\mathbb{S}}^{D-1})$ independently. We have the following results.

(a)
Assume N ⩽ o_d(d^γ) for a fixed γ > 0. Let σ satisfy assumptions 2(a) and (b) at level γ. Then, for any ɛ > 0, the following holds with high probability:
$\begin{equation}\left\vert {R}_{\mathrm{RF}}({f}_{d},\boldsymbol{W}\hspace{2pt})-{R}_{\mathrm{RF}}({\mathsf{P}}_{\mathcal{Q}}{f}_{d},\boldsymbol{W}\hspace{2pt})-{\Vert}{\mathsf{P}}_{{\mathcal{Q}}^{c}}{f}_{d}{{\Vert}}_{{L}^{2}}^{2}\right\vert \leqslant \varepsilon {\Vert}{f}_{d}{{\Vert}}_{{L}^{2}}{\Vert}{\mathsf{P}}_{{\mathcal{Q}}^{c}}{f}_{d}{{\Vert}}_{{L}^{2}},\end{equation} \tag{ 46 }$
where $\mathcal{Q}\equiv {\mathcal{Q}}_{\mathrm{RF}}(\gamma )$ is defined in equation (43).
(b)
Assume N ⩾ w_d(d^γ) for some positive constant γ > 0, and σ satisfy assumptions 2(a) and (c) at level γ. Then for any ɛ > 0, the following holds with high probability:
$\begin{equation}0\leqslant {R}_{\mathrm{RF}}({\mathsf{P}}_{\mathcal{Q}}{f}_{d},\boldsymbol{W}\hspace{2pt})\leqslant \varepsilon {\Vert}{\mathsf{P}}_{\mathcal{Q}}{f}_{d}{{\Vert}}_{{L}^{2}}^{2},\end{equation} \tag{ 47 }$
where $\mathcal{Q}\equiv {\bar{\mathcal{Q}}}_{\mathrm{RF}}(\gamma )$ is defined in equation (44).

See appendix E for the proof of the lower bound (46), and appendix F for the proof of the upper bound (47).

Remark 1. This theorems shows that for each $\gamma \notin (\xi -{\kappa }_{1}){\mathbb{Z}}_{\geqslant 0}+\cdots +(\xi -{\kappa }_{Q}){\mathbb{Z}}_{\geqslant 0}$ , we can decompose our functional space as

$\begin{equation*}{L}^{2}({\mathrm{PS}}_{\boldsymbol{\kappa }}^{\boldsymbol{d}},{\mu }_{\boldsymbol{d}}^{\boldsymbol{\kappa }})=\mathcal{F}(\boldsymbol{\beta },\boldsymbol{\kappa },\gamma )\oplus {\mathcal{F}}^{c}(\boldsymbol{\beta },\boldsymbol{\kappa },\gamma ),\end{equation*}$

where

$\begin{equation*}\begin{aligned}\hfill \mathcal{F}(\boldsymbol{\beta },\boldsymbol{\kappa },\gamma )=& \underset{\boldsymbol{k}\in {\mathcal{Q}}_{\mathrm{RF}}(\gamma )}{\bigoplus}{\boldsymbol{V}}_{\boldsymbol{k}}^{\boldsymbol{d}},\hfill \\ \hfill {\mathcal{F}}^{c}(\boldsymbol{\beta },\boldsymbol{\kappa },\gamma )=& \underset{\boldsymbol{k}\notin {\mathcal{Q}}_{\mathrm{RF}}(\gamma )}{\bigoplus}{\boldsymbol{V}}_{\boldsymbol{k}}^{\boldsymbol{d}},\hfill \end{aligned}\end{equation*}$

such that for N = d^γ, RF model fits the subspace of low degree polynomials $\mathcal{F}(\boldsymbol{\beta },\boldsymbol{\kappa },\gamma )$ and cannot fit ${\mathcal{F}}^{c}(\boldsymbol{\beta },\boldsymbol{\kappa },\gamma )$ , i.e.

$\begin{equation*}{R}_{\mathrm{RF}}({f}_{d},\boldsymbol{W}\hspace{2pt})\approx {\Vert}{\mathsf{P}}_{{Q}_{\mathrm{RF}}{(\gamma )}^{c}}{f}_{d}{{\Vert}}_{{L}^{2}}^{2}.\end{equation*}$

Remark 2. In other words, we can fit a polynomial of degree $\boldsymbol{k}\in {\mathbb{Z}}_{\geqslant 0}^{Q}$ , if and only if

$\begin{equation*}{d}^{(\xi -{\kappa }_{1}){k}_{1}}\cdot \dots \cdot {d}^{(\xi -{\kappa }_{Q}){k}_{Q}}={d}_{1,\mathrm{eff}}^{{k}_{1}}\dots {d}_{Q,\mathrm{eff}}^{{k}_{Q}}={o}_{d}(N).\end{equation*}$

Each subspace has therefore an effective dimension ${d}_{q,\mathrm{eff}}\equiv {d}^{\xi -{\kappa }_{q}}={d}_{q}^{(\xi -{\kappa }_{q})/{\eta }_{q}}\asymp {D}^{(\xi -{\kappa }_{q})/{\mathrm{max}}_{q\in [Q]}{\eta }_{q}}$ . This can be understood intuitively as follows,

$\begin{equation*}\sigma \left(\langle \boldsymbol{\theta },\boldsymbol{x}\rangle /R\right)=\sigma \left(\sum\limits _{q\in [Q]}\langle {\boldsymbol{\theta }}^{(q)},{\boldsymbol{x}}^{(q)}\rangle /R\right).\end{equation*}$

The term q_ξ (recall that q_ξ = argmax_q(η_q + κ_q) and $\xi ={\eta }_{{q}_{\xi }}+{\kappa }_{{q}_{\xi }}$ ) verifies $\langle {\boldsymbol{\theta }}^{({q}_{\xi })},{\boldsymbol{x}}^{({q}_{\xi })}\rangle /R={{\Theta}}_{d}(1)$ and has the same effective dimension ${d}_{{q}_{\xi },\mathrm{eff}}={d}^{{\eta }_{{q}_{\xi }}}$ has in the uniform case restricted to the sphere ${\mathbb{S}}^{{d}^{{\eta }_{q}}-1}(\sqrt{{d}^{{\eta }_{q}}})$ (the scaling of the sphere do not matter because of the global normalization factor R⁻¹). However, for η_q + κ_q < ξ, we have $\langle {\boldsymbol{\theta }}^{(q)},{\boldsymbol{x}}^{(q)}\rangle /R={{\Theta}}_{d}({d}^{({\eta }_{q}+{\kappa }_{q}-\xi )/2})$ and we will need ${d}^{\xi -{\kappa }_{q}-{\eta }_{q}}$ more neurons to capture the dependency on the qth sphere coordinates. The effective dimension is therefore given by ${d}_{q,\mathrm{eff}}={d}_{q}\cdot {d}^{\xi -{\kappa }_{q}-{\eta }_{q}}={d}^{\xi -{\kappa }_{q}}$ .

C.6. Approximation error of the neural tangent model

We consider the minimum population error for the random features model

$\begin{equation*}{R}_{\mathrm{NT}}({f}_{d},\boldsymbol{W}\hspace{2pt})=\underset{f\in {\mathcal{F}}_{\mathrm{NT}}(\boldsymbol{W}\hspace{2pt})}{\mathrm{inf}}\enspace \mathbb{E}\left[{({f}_{\ast }(\boldsymbol{x})-f(\boldsymbol{x}))}^{2}\right].\end{equation*}$

For $\boldsymbol{k}\in {\mathbb{Z}}_{\geqslant 0}^{Q}$ , we denote by S( k ) ⊆ [Q] the subset of indices q ∈ [Q] such that k_q > 0.

We define the sets

$\begin{equation}{\mathcal{Q}}_{\mathrm{NT}}(\gamma )=\left\{\boldsymbol{k}\in {\mathbb{Z}}_{\geqslant 0}^{Q}\left\vert \right.\sum\limits _{q=1}^{Q}(\xi -{\kappa }_{q}){k}_{q}< \gamma +\left(\xi -\underset{q\in S(\boldsymbol{k})}{\mathrm{min}}\enspace {\kappa }_{q}\right)\right\},\end{equation} \tag{ 48 }$

$\begin{equation}{\bar{\mathcal{Q}}}_{\mathrm{NT}}(\gamma )=\left\{\boldsymbol{k}\in {\mathbb{Z}}_{\geqslant 0}^{Q}\left\vert \right.\sum\limits _{q=1}^{Q}(\xi -{\kappa }_{q}){k}_{q}\leqslant \gamma +\left(\xi -\underset{q\in S(\boldsymbol{k})}{\mathrm{min}}\enspace {\kappa }_{q}\right)\right\}.\end{equation} \tag{ 49 }$

Assumption 3. Let $\sigma :\mathbb{R}\to \mathbb{R}$ be an activation function.

(a)
The activation function σ is weakly differentiable with weak derivative σ'. There exists constants c₀, c₁, with c₀ > 0 and c₁ < 1 such that the activation function σ verifies σ'(u)² ⩽ c₀ exp(c₁ u²/2) almost surely for $u\in \mathbb{R}$ .
(b)
For γ > 0 (which is specified in the theorem), we denote L = max_q∈[Q]⌈γ/η_q⌉. We assume that σ' is L-weakly differentiable. Define
$\begin{equation*}K=\underset{\boldsymbol{k}\in {\mathcal{Q}}_{\mathrm{NT}}{(\gamma )}^{c}}{\mathrm{min}}\vert \boldsymbol{k}\vert .\end{equation*}$
We assume that for K − 1 ⩽ k ⩽ L, the kth weak derivative verifies almost surely σ^(k+1)(u)² ⩽ c₀ exp(c₁ u²/2) for some constants c₀ > 0 and c₁ < 1.Furthermore, we assume that σ' verifies a non-degeneracy condition. Recall that ${\mu }_{k}(h)\equiv {\mathbb{E}}_{G\sim \mathsf{\text{N}}(0,1)}[h(G){\mathrm{He}}_{k}(G)]$ denote the kth coefficient of the Hermite expansion of $h\in {L}_{2}(\mathbb{R},\gamma )$ (with γ the standard Gaussian measure). Then there exists k₁, k₂ ⩾ 2L + 7[max_q∈[Q] ξ/η_q] such that ${\mu }_{{k}_{1}}({\sigma }^{\prime }),{\mu }_{{k}_{2}}({\sigma }^{\prime })\ne 0$ and
$\begin{equation}\frac{{\mu }_{{k}_{1}}({x}^{2}{\sigma }^{\prime })}{{\mu }_{{k}_{1}}({\sigma }^{\prime })}\ne \frac{{\mu }_{{k}_{2}}({x}^{2}{\sigma }^{\prime })}{{\mu }_{{k}_{2}}({\sigma }^{\prime })}.\end{equation} \tag{ 50 }$
(c)
For γ > 0 (which is specified in the theorem), we define
$\begin{equation*}\bar{K}=\underset{\boldsymbol{k}\in {\bar{\mathcal{Q}}}_{\mathrm{NT}}(\gamma )}{\mathrm{max}}\vert \boldsymbol{k}\vert .\end{equation*}$
We assume that σ verifies for $k\leqslant \bar{K}+1$ , μ_k(σ') = μ_k+1(σ) ≠ 0. Furthermore we assume that for $k\leqslant \bar{K}+1$ , the kth weak derivative verifies almost surely σ^(k+1)(u)² ⩽ c₀ exp(c₁ u²/2) for some constants c₀ > 0 and c₁ < 1.

Assumption 3(a) implies that ${\sigma }^{\prime }\in {L}^{2}(\mathbb{R},\gamma )$ where $\gamma (\mathrm{d}x)={\mathrm{e}}^{-{x}^{2}/2}\enspace \mathrm{d}x/\sqrt{2\pi }$ is the standard Gaussian measure. We recall the Hermite decomposition of σ':

$\begin{equation}{\sigma }^{\prime }(x)=\sum\limits _{k=0}^{\infty }\frac{{\mu }_{k}({\sigma }^{\prime })}{k!}{\mathrm{He}}_{k}(x),\qquad {\mu }_{k}({\sigma }^{\prime })\equiv {\mathbb{E}}_{G\sim \mathsf{\text{N}}(0,1)}[{\sigma }^{\prime }(G){\mathrm{He}}_{k}(G)].\end{equation} \tag{ 51 }$

In assumption 3(b), it is useful to notice that the Hermite coefficients of x² σ'(x) can be computed from the ones of σ'(x) using the relation μ_k(x² σ') = μ_k+2(σ') + [1 + 2k]μ_k(σ') + k(k − 1)μ_k−2(σ').

Theorem 7 (Risk of the NT model). Let ${\left\{{f}_{d}\in {L}^{2}({\mathrm{PS}}_{\boldsymbol{\kappa }}^{\boldsymbol{d}},{\mu }_{\boldsymbol{d}}^{\boldsymbol{\kappa }})\right\}}_{d\geqslant 1}$ be a sequence of functions. Let $\boldsymbol{W}={({\boldsymbol{w}}_{i})}_{i\in [N]}$ with ${({\boldsymbol{w}}_{i})}_{i\in [N]}\sim \mathrm{Unif}({\mathbb{S}}^{D-1})$ independently. We have the following results.

(a)
Assume N ⩽ o_d(d^γ) for a fixed γ > 0. Let σ satisfy assumptions 3(a) and (b) at level γ. Then, for any ɛ > 0, the following holds with high probability:
$\begin{equation}\left\vert {R}_{\mathrm{NT}}({f}_{d},\boldsymbol{W}\hspace{2pt})-{R}_{\mathrm{NT}}({\mathsf{P}}_{\mathcal{Q}}{f}_{d},\boldsymbol{W}\hspace{2pt})-{\Vert}{\mathsf{P}}_{{\mathcal{Q}}^{c}}{f}_{d}{{\Vert}}_{{L}^{2}}^{2}\right\vert \leqslant \varepsilon {\Vert}{f}_{d}{{\Vert}}_{{L}^{2}}{\Vert}{\mathsf{P}}_{{\mathcal{Q}}^{c}}{f}_{d}{{\Vert}}_{{L}^{2}},\end{equation} \tag{ 52 }$
where $\mathcal{Q}\equiv {\mathcal{Q}}_{\mathrm{NT}}(\gamma )$ is defined in equation (48).
(b)
Assume N ⩾ w_d(d^γ) for some positive constant γ > 0, and σ satisfy assumptions 3(a) and (c) at level γ. Then for any ɛ > 0, the following holds with high probability:
$\begin{equation}0\leqslant {R}_{\mathrm{NT}}({\mathsf{P}}_{\mathcal{Q}}{f}_{d},\boldsymbol{W}\hspace{2pt})\leqslant \varepsilon {\Vert}{\mathsf{P}}_{\mathcal{Q}}{f}_{d}{{\Vert}}_{{L}^{2}}^{2},\end{equation} \tag{ 53 }$
where $\mathcal{Q}\equiv {\bar{\mathcal{Q}}}_{\mathrm{NT}}(\gamma )$ is defined in equation (49).

See appendix G for the proof of lower bound, and appendix H for the proof of upper bound.

Remark 3. This theorems shows that each for each γ > 0 such that ${\mathcal{Q}}_{\mathrm{NT}}{(\gamma )}^{c}\cap {\bar{\mathcal{Q}}}_{\mathrm{NT}}(\gamma )=\varnothing$ , we can decompose our functional space as

$\begin{equation*}{L}^{2}({\mathrm{PS}}_{\boldsymbol{\kappa }}^{\boldsymbol{d}},{\mu }_{\boldsymbol{d}}^{\boldsymbol{\kappa }})=\mathcal{F}(\boldsymbol{\beta },\boldsymbol{\kappa },\gamma )\oplus {\mathcal{F}}^{c}(\boldsymbol{\beta },\boldsymbol{\kappa },\gamma ),\end{equation*}$

where

$\begin{equation*}\begin{aligned}\hfill \mathcal{F}(\boldsymbol{\beta },\boldsymbol{\kappa },\gamma )=& \underset{\boldsymbol{k}\in {\mathcal{Q}}_{\mathrm{NT}}(\gamma )}{\bigoplus}{\boldsymbol{V}}_{\boldsymbol{k}}^{\boldsymbol{d}},\hfill \\ \hfill {\mathcal{F}}^{c}(\boldsymbol{\beta },\boldsymbol{\kappa },\gamma )=& \underset{\boldsymbol{k}\notin {\mathcal{Q}}_{\mathrm{NT}}(\gamma )}{\bigoplus}{\boldsymbol{V}}_{\boldsymbol{k}}^{\boldsymbol{d}},\hfill \end{aligned}\end{equation*}$

such that for N = d^γ, NT model fits the subspace of low degree polynomials $\mathcal{F}(\boldsymbol{\beta },\boldsymbol{\kappa },\gamma )$ and cannot fit ${\mathcal{F}}^{c}(\boldsymbol{\beta },\boldsymbol{\kappa },\gamma )$ at all, i.e.

$\begin{equation*}{R}_{\mathrm{NT}}({f}_{d},\boldsymbol{W}\hspace{2pt})\approx {\Vert}{\mathsf{P}}_{{\mathcal{Q}}_{\mathrm{NT}}{(\gamma )}^{c}}{f}_{d}{{\Vert}}_{{L}^{2}}^{2}.\end{equation*}$

Remark 4. In other words, we can fit a polynomial of degree $\boldsymbol{k}\in {\mathbb{Z}}_{\geqslant 0}^{Q}$ , if and only if

$\begin{equation*}{d}^{(\xi -{\kappa }_{1}){k}_{1}}\cdot \dots \cdot {d}^{(\xi -{\kappa }_{Q}){k}_{Q}}={d}_{1,\mathrm{eff}}^{{k}_{1}}\dots {d}_{Q,\mathrm{eff}}^{{k}_{Q}}={o}_{d}({d}^{\beta }N),\end{equation*}$

where β = ξ − min_{q∈S(
k
)} κ_q.

C.7. Connecting to the theorems in the main text

Let us connect the above general results to the two-spheres setting described in the main text. We consider two spheres with η₁ = η, κ₁ = κ for the first sphere, and η₂ = 1, κ₂ = 0 for the second sphere. We have ξ = max(η + κ, 1).

Let w_d(d^γ log d) ⩽ n ⩽ O_d(d^γ+δ) with δ > 0 constant sufficiently small, then by theorem 5 the function subspace learned by KRR is given by the polynomials of degree k₁ in the first sphere coordinates and k₂ in the second sphere with

$\begin{equation*}\mathrm{max}(\eta ,1-\kappa ){k}_{1}+\mathrm{max}(\eta +\kappa ,1){k}_{2}< \gamma .\end{equation*}$

We consider functions that only depend on the first sphere, i.e. k₂ = 0 and denote d_eff = d^{max(η,1−κ)}. Then the subspace of approximation is given by the k polynomials in the first sphere such that ${d}_{\mathrm{eff}}^{k}\leqslant {d}^{\gamma }$ . Furthermore, one can check that the assumptions listed in theorem 1 in the main text verifies assumption 1.

Similarly, for w_d(d^γ) ⩽ N ⩽ O_d(d^γ+δ) with δ > 0 constant sufficiently small, theorem 6 implies that the RF models can only approximate k polynomials in the first sphere such that ${d}_{\mathrm{eff}}^{k}\leqslant {d}^{\gamma }$ . Furthermore, assumptions listed in theorem 2 in the main text verifies assumption 2.

In the case of NT, we only consider k = (k₁, 0) and S( k ) = {1}. We get min_{q∈S(
k
)} κ_q = κ. The subspace approximated is given by the k polynomials in the first sphere such that ${d}_{\mathrm{eff}}^{k}\leqslant {d}^{\gamma }{d}_{\mathrm{eff}}$ . Furthermore, assumptions listed in theorem 3 in the main text verifies assumption 3.

Appendix D.: Proof of theorem 5

The proof follows closely the proof of theorem 4 in [21].

D.1. Preliminaries

Let us rewrite the kernel functions ${\left\{{h}_{d}\right\}}_{d\geqslant 1}$ as functions on the product of normalized spheres: for $\boldsymbol{x}={\left\{{\boldsymbol{x}}^{(q)}\right\}}_{q\in [Q]}$ and $\boldsymbol{y}={\left\{{\boldsymbol{y}}^{(q)}\right\}}_{q\in [Q]}\in {\mathrm{PS}}_{\boldsymbol{\kappa }}^{\boldsymbol{d}}$ :

$\begin{align}\hfill {h}_{d}(\langle \boldsymbol{y},\boldsymbol{x}\rangle /{R}^{2})& ={h}_{d}\left(\sum\limits _{q\in [Q]}({r}_{q}^{2}/{R}^{2}\sqrt{{d}_{q}})\cdot \langle {\bar{\boldsymbol{y}}}^{(q)},{\bar{\boldsymbol{x}}}^{(q)}\rangle /\sqrt{{d}_{q}}\right)\hfill \\ \hfill & \equiv {h}_{\boldsymbol{d}}\left({\left\{\langle {\bar{\boldsymbol{y}}}^{(q)},{\bar{\boldsymbol{x}}}^{(q)}\rangle /\sqrt{{d}_{q}}\right\}}_{q\in [Q]}\right).\hfill \end{align} \tag{ 54 }$

Consider the expansion of h_d in terms of tensor product of Gegenbauer polynomials. We have

$\begin{equation*}{h}_{d}(\langle \boldsymbol{y},\boldsymbol{x}\rangle /{R}^{2})=\sum\limits _{\boldsymbol{k}\in {\mathbb{Z}}_{\geqslant 0}^{Q}}{\lambda }_{\boldsymbol{k}}^{\boldsymbol{d}}({h}_{\boldsymbol{d}})B(\boldsymbol{d},\boldsymbol{k}){Q}_{\boldsymbol{k}}^{\boldsymbol{d}}\left({\left\{\langle {\bar{\boldsymbol{y}}}^{(q)},{\bar{\boldsymbol{x}}}^{(q)}\rangle \right\}}_{q\in [Q]}\right),\end{equation*}$

where

$\begin{equation*}{\lambda }_{\boldsymbol{k}}^{\boldsymbol{d}}({h}_{\boldsymbol{d}})={\mathbb{E}}_{\bar{\boldsymbol{x}}}\left[{h}_{\boldsymbol{d}}\left({\bar{x}}_{1}^{(1)},\dots ,{\bar{x}}_{1}^{(Q)}\right){Q}_{\boldsymbol{k}}^{\boldsymbol{d}}\left(\sqrt{{d}_{1}}{\bar{x}}_{1}^{(1)},\dots ,\sqrt{{d}_{Q}}{\bar{x}}_{1}^{(Q)}\right)\right],\end{equation*}$

where the expectation is taken over $\bar{\boldsymbol{x}}=({\bar{\boldsymbol{x}}}^{(1)},\dots ,{\bar{\boldsymbol{x}}}^{(Q)})\sim {\mu }_{\boldsymbol{d}}$ .

Lemma 2. Let ${\left\{{h}_{d}\right\}}_{d\geqslant 1}$ be a sequence of kernel functions that satisfies assumption 1. Assume w_d(d^γ) ⩽ n ⩽ o_d(d^m(γ)) for some γ > 0. Consider $\mathcal{Q}={\bar{\mathcal{Q}}}_{\mathrm{KRR}}(\gamma )$ as defined in equation (41). Then there exists constants c, C > 0 such that for d large enough,

$\begin{equation*}\begin{aligned}\hfill \underset{\boldsymbol{k}\notin \mathcal{Q}}{\mathrm{max}}\enspace {\lambda }_{\boldsymbol{k}}^{\boldsymbol{d}}({h}_{\boldsymbol{d}})& \leqslant C{d}^{-m(\gamma )},\hfill \\ \hfill \underset{\boldsymbol{k}\in \mathcal{Q}}{\mathrm{min}}\enspace {\lambda }_{\boldsymbol{k}}^{\boldsymbol{d}}({h}_{\boldsymbol{d}})& \geqslant c{d}^{-\gamma }.\hfill \end{aligned}\end{equation*}$

Proof of lemma 2. Notice that by lemma 18,

$\begin{equation*}{\lambda }_{\boldsymbol{k}}^{\boldsymbol{d}}({h}_{\boldsymbol{d}})=\left(\prod\limits _{q\in [Q]}{\alpha }_{q}^{{k}_{q}}\right)\cdot R(\boldsymbol{d},\boldsymbol{k})\cdot {\mathbb{E}}_{\bar{\boldsymbol{x}}}\left[\left(\prod\limits _{q\in [Q]}{\left(1-\frac{{({\bar{x}}_{1}^{(q)})}^{2}}{{d}_{q}}\right)}^{{k}_{q}}\right)\cdot {h}_{d}^{(\vert \boldsymbol{k}\vert )}\left(\sum\limits _{q\in [Q]}{\alpha }_{q}{\bar{x}}_{1}^{(q)}\right)\right],\end{equation*}$

where ${\alpha }_{q}={d}_{q}^{-1/2}{r}_{q}^{2}/{R}^{2}=(1+{o}_{d}(1)){d}^{{\eta }_{q}/2+{\kappa }_{q}-\xi }$ . By assumption 1(a), we have

$\begin{equation*}{\lambda }_{\boldsymbol{k}}^{\boldsymbol{d}}({h}_{\boldsymbol{d}})B(\boldsymbol{d},\boldsymbol{k})\leqslant C\prod\limits _{q\in [Q]}{d}^{({\kappa }_{q}-\xi ){k}_{q}}.\end{equation*}$

Furthermore, by assumption 1(b) and dominated convergence,

$\begin{equation*}{\mathbb{E}}_{\bar{\boldsymbol{x}}}\left[\left(\prod\limits _{q\in [Q]}{\left(1-\frac{{({\bar{x}}_{1}^{(q)})}^{2}}{{d}_{q}}\right)}^{{k}_{q}}\right)\cdot {h}_{d}^{(\vert \boldsymbol{k}\vert )}\left(\sum\limits _{q\in [Q]}{\alpha }_{q}{\bar{x}}_{1}^{(q)}\right)\right]\to {h}_{d}^{(\vert \boldsymbol{k}\vert )}(0)\geqslant c > 0,\end{equation*}$

for $k\geqslant \bar{K}$ . The lemma then follows from the same proof as in lemmas 9 and 10, where we adapt the proofs of lemmas 19 and 20 to h_d. □

D.2. Proof of theorem 5

Step 1. Rewrite the y , E , H , M matrices.

The test error of empirical KRR gives

$\begin{align*}\hfill {R}_{\mathrm{KRR}}({f}_{d},\boldsymbol{X},\lambda )& \equiv {\mathbb{E}}_{\boldsymbol{x}}\left[{\left({f}_{d}(\boldsymbol{x})-{\boldsymbol{y}}^{\mathsf{T}}{(\boldsymbol{H}+\lambda {\mathbf{I}}_{n})}^{-1}\boldsymbol{h}(\boldsymbol{x})\left. \right)\right)}^{2}\right]\hfill \\ \hfill & ={\mathbb{E}}_{\boldsymbol{x}}[{f}_{d}{(\boldsymbol{x})}^{2}]-2{\boldsymbol{y}}^{\mathsf{T}}{(\boldsymbol{H}+\lambda {\mathbf{I}}_{n})}^{-1}\boldsymbol{E}+{\boldsymbol{y}}^{\mathsf{T}}{(\boldsymbol{H}+\lambda {\mathbf{I}}_{n})}^{-1}\boldsymbol{M}{(\boldsymbol{H}+\lambda {\mathbf{I}}_{n})}^{-1}\boldsymbol{y},\hfill \end{align*}$

where $\boldsymbol{E}={({E}_{1},\dots ,{E}_{n})}^{\mathsf{T}}$ and $\boldsymbol{M}={({M}_{ij})}_{ij\in [n]}$ , with

$\begin{equation*}\begin{aligned}\hfill {E}_{i}& ={\mathbb{E}}_{\boldsymbol{x}}[{f}_{d}(\boldsymbol{x}){h}_{d}(\langle \boldsymbol{x},{\boldsymbol{x}}_{i}\rangle /d)],\hfill \\ \hfill {M}_{ij}& ={\mathbb{E}}_{\boldsymbol{x}}[{h}_{d}(\langle {\boldsymbol{x}}_{i},\boldsymbol{x}\rangle /d){h}_{d}(\langle {\boldsymbol{x}}_{j},\boldsymbol{x}\rangle /d)].\hfill \end{aligned}\end{equation*}$

Let $B={\sum }_{\boldsymbol{k}\in \mathcal{Q}}B(\boldsymbol{d},\boldsymbol{k})$ . Define for any $\boldsymbol{k}\in {\mathbb{Z}}_{\geqslant 0}^{Q}$ ,

$\begin{equation*}\begin{aligned}\hfill {\boldsymbol{D}}_{\boldsymbol{k}}& ={\lambda }_{\boldsymbol{k}}^{\boldsymbol{d}}({h}_{\boldsymbol{d}}){\mathbf{I}}_{B(\boldsymbol{d},\boldsymbol{k})},\hfill \\ \hfill {\boldsymbol{Y}}_{\boldsymbol{k}}& ={({Y}_{\boldsymbol{k},\boldsymbol{s}}^{\boldsymbol{d}}({\bar{\boldsymbol{x}}}_{i}))}_{i\in [n],\boldsymbol{s}\in [B(\boldsymbol{d},\boldsymbol{k})]}\in {\mathbb{R}}^{n\times B(\boldsymbol{d},\boldsymbol{k})},\hfill \\ \hfill {\boldsymbol{\lambda }}_{\boldsymbol{k}}& ={({\lambda }_{\boldsymbol{k},\boldsymbol{s}}^{\boldsymbol{d}}({f}_{d}))}_{\boldsymbol{s}\in [B(\boldsymbol{d},\boldsymbol{k})]}^{\mathsf{T}}\in {\mathbb{R}}^{B(\boldsymbol{d},\boldsymbol{k})},\hfill \\ \hfill {\boldsymbol{D}}_{\mathcal{Q}}& =\mathrm{diag}\left({\left({\lambda }_{\boldsymbol{k}}^{\boldsymbol{d}}({h}_{\boldsymbol{d}}){\mathbf{I}}_{B(\boldsymbol{d},\boldsymbol{k})}\right)}_{\boldsymbol{k}\in \mathcal{Q}}\right)\in {\mathbb{R}}^{B\times B}\hfill \\ \hfill {\boldsymbol{Y}}_{\mathcal{Q}}& ={({\boldsymbol{Y}}_{\boldsymbol{k}})}_{\boldsymbol{k}\in \mathcal{Q}}\in {\mathbb{R}}^{n\times B},\hfill \\ \hfill {\boldsymbol{\lambda }}_{\mathcal{Q}}& ={\left({\left({\boldsymbol{\lambda }}_{\boldsymbol{k}}^{\mathsf{T}}\right)}_{\boldsymbol{k}\in \mathcal{Q}}\right)}^{\mathsf{T}}\in {\mathbb{R}}^{B}.\hfill \end{aligned}\end{equation*}$

Let the spherical harmonics decomposition of f_d be

$\begin{equation*}{f}_{d}(\boldsymbol{x})=\sum\limits _{\boldsymbol{k}\in {\mathbb{Z}}_{\geqslant 0}^{Q}}\sum\limits _{\boldsymbol{s}\in [B(\boldsymbol{d},\boldsymbol{k})]}{\lambda }_{\boldsymbol{k},\boldsymbol{s}}^{\boldsymbol{d}}({f}_{d}){Y}_{\boldsymbol{k},\boldsymbol{s}}^{\boldsymbol{d}}(\bar{\boldsymbol{x}}\hspace{2pt}),\end{equation*}$

and the Gegenbauer decomposition of h_d be

$\begin{equation*}{h}_{\boldsymbol{d}}\left({\bar{x}}_{1}^{(1)},\dots ,{\bar{x}}_{1}^{(Q)}\right)=\sum\limits _{\boldsymbol{k}\in {\mathbb{Z}}_{\geqslant 0}^{Q}}{\lambda }_{\boldsymbol{k}}^{\boldsymbol{d}}({h}_{\boldsymbol{d}})B(\boldsymbol{d},\boldsymbol{k}){Q}_{\boldsymbol{k}}^{\boldsymbol{d}}\left(\sqrt{{d}_{1}}{\bar{x}}_{1}^{(1)},\dots ,\sqrt{{d}_{Q}}{\bar{x}}_{1}^{(Q)}\right).\end{equation*}$

We write the decompositions of vectors f , E , H , and M . We have

$\begin{equation*}\begin{aligned}\hfill \boldsymbol{f}& ={\boldsymbol{Y}}_{\mathcal{Q}}{\boldsymbol{\lambda }}_{\mathcal{Q}}+\sum\limits _{\boldsymbol{k}\in {\mathcal{Q}}^{c}}{\boldsymbol{Y}}_{\boldsymbol{k}}{\boldsymbol{\lambda }}_{\boldsymbol{k}},\hfill \\ \hfill \boldsymbol{E}& ={\boldsymbol{Y}}_{\mathcal{Q}}{\boldsymbol{D}}_{\mathcal{Q}}{\boldsymbol{\lambda }}_{\mathcal{Q}}+\sum\limits _{\boldsymbol{k}\in {\mathcal{Q}}^{c}}{\boldsymbol{Y}}_{\boldsymbol{k}}{\boldsymbol{D}}_{\boldsymbol{k}}{\boldsymbol{\lambda }}_{\boldsymbol{k}},\hfill \\ \hfill \boldsymbol{H}& ={\boldsymbol{Y}}_{\mathcal{Q}}{\boldsymbol{D}}_{\mathcal{Q}}{\boldsymbol{Y}}_{\mathcal{Q}}^{\mathsf{T}}+\sum\limits _{\boldsymbol{k}\in {\mathcal{Q}}^{c}}{\boldsymbol{Y}}_{\boldsymbol{k}}{\boldsymbol{D}}_{\boldsymbol{k}}{\boldsymbol{Y}}_{\boldsymbol{k}}^{\mathsf{T}},\hfill \\ \hfill \boldsymbol{M}& ={\boldsymbol{Y}}_{\mathcal{Q}}{\boldsymbol{D}}_{\mathcal{Q}}^{2}{\boldsymbol{Y}}_{\mathcal{Q}}^{\mathsf{T}}+\sum\limits _{\boldsymbol{k}\in {\mathcal{Q}}^{c}}{\boldsymbol{Y}}_{\boldsymbol{k}}{\boldsymbol{D}}_{\boldsymbol{k}}^{2}{\boldsymbol{Y}}_{\boldsymbol{k}}^{\mathsf{T}}.\hfill \end{aligned}\end{equation*}$

From lemma 4, we can rewrite

$\begin{equation*}\begin{aligned}\hfill \boldsymbol{H}& ={\boldsymbol{Y}}_{\mathcal{Q}}{\boldsymbol{D}}_{\mathcal{Q}}{\boldsymbol{Y}}_{\mathcal{Q}}^{\mathsf{T}}+{\kappa }_{h}({\mathbf{I}}_{n}+{\mathbf{\Delta }}_{h}),\hfill \\ \hfill \boldsymbol{M}& ={\boldsymbol{Y}}_{\mathcal{Q}}{\boldsymbol{D}}_{\mathcal{Q}}^{2}{\boldsymbol{Y}}_{\mathcal{Q}}^{\mathsf{T}}+{\kappa }_{u}{\mathbf{\Delta }}_{u},\hfill \end{aligned}\end{equation*}$

where κ_h = Θ_d(1), κ_u = O_d(d^−m(γ)), ${\Vert}{\mathbf{\Delta }}_{h}{{\Vert}}_{\mathrm{op}}={o}_{d,\mathbb{P}}(1)$ and ${\Vert}{\mathbf{\Delta }}_{u}{{\Vert}}_{\mathrm{op}}={O}_{d,\mathbb{P}}(1)$ .

Step 2. Decompose the risk

The rest of the proof follows closely from theorem 4 in [21]. We decompose the risk as follows

$\begin{equation*}{R}_{\mathrm{KRR}}({f}_{d},\boldsymbol{X},\lambda )={\Vert}{f}_{d}{{\Vert}}_{{L}^{2}}^{2}-2{T}_{1}+{T}_{2}+{T}_{3}-2{T}_{4}+2{T}_{5},\end{equation*}$

where

$\begin{equation*}\begin{aligned}\hfill {T}_{1}& ={\boldsymbol{f}}^{\mathsf{T}}{(\boldsymbol{H}+\lambda {\mathbf{I}}_{n})}^{-1}\boldsymbol{E},\hfill \\ \hfill {T}_{2}& ={\boldsymbol{f}}^{\mathsf{T}}{(\boldsymbol{H}+\lambda {\mathbf{I}}_{n})}^{-1}\boldsymbol{M}{(\boldsymbol{H}+\lambda {\mathbf{I}}_{n})}^{-1}\boldsymbol{f},\hfill \\ \hfill {T}_{3}& ={\boldsymbol{\varepsilon }}^{\mathsf{T}}{(\boldsymbol{H}+\lambda {\mathbf{I}}_{n})}^{-1}\boldsymbol{M}{(\boldsymbol{H}+\lambda {\mathbf{I}}_{n})}^{-1}\boldsymbol{\varepsilon },\hfill \\ \hfill {T}_{4}& ={\boldsymbol{\varepsilon }}^{\mathsf{T}}{(\boldsymbol{H}+\lambda {\mathbf{I}}_{n})}^{-1}\boldsymbol{E},\hfill \\ \hfill {T}_{5}& ={\boldsymbol{\varepsilon }}^{\mathsf{T}}{(\boldsymbol{H}+\lambda {\mathbf{I}}_{n})}^{-1}\boldsymbol{M}{(\boldsymbol{H}+\lambda {\mathbf{I}}_{n})}^{-1}\boldsymbol{f}.\hfill \end{aligned}\end{equation*}$

Further, we denote ${\boldsymbol{f}}_{\mathcal{Q}}$ , ${\boldsymbol{f}}_{{\mathcal{Q}}^{c}}$ , ${\boldsymbol{E}}_{\mathcal{Q}}$ , and ${\boldsymbol{E}}_{{\mathcal{Q}}^{c}}$ ,

$\begin{equation*}\begin{aligned}\hfill {\boldsymbol{f}}_{\mathcal{Q}}& ={\boldsymbol{Y}}_{\mathcal{Q}}{\boldsymbol{\lambda }}_{\mathcal{Q}},\hfill & \hfill {\boldsymbol{E}}_{\mathcal{Q}}& ={\boldsymbol{Y}}_{\mathcal{Q}}{\boldsymbol{D}}_{\mathcal{Q}}{\boldsymbol{\lambda }}_{\mathcal{Q}},\hfill \\ \hfill {\boldsymbol{f}}_{{\mathcal{Q}}^{c}}& =\sum\limits _{\boldsymbol{k}\in {\mathcal{Q}}^{c}}{\boldsymbol{Y}}_{\boldsymbol{k}}{\boldsymbol{\lambda }}_{\boldsymbol{k}},\hfill & \hfill {\boldsymbol{E}}_{{\mathcal{Q}}^{c}}& =\sum\limits _{\boldsymbol{k}\in {\mathcal{Q}}^{c}}{\boldsymbol{Y}}_{\boldsymbol{k}}{\boldsymbol{D}}_{\boldsymbol{k}}{\boldsymbol{\lambda }}_{\boldsymbol{k}}.\hfill \end{aligned}\end{equation*}$

Step 3. Term T₂

Note we have

$\begin{equation*}{T}_{2}={T}_{21}+{T}_{22}+{T}_{23},\end{equation*}$

where

$\begin{equation*}\begin{aligned}\hfill {T}_{21}& ={\boldsymbol{f}}_{\mathcal{Q}}^{\mathsf{T}}{(\boldsymbol{H}+\lambda {\mathbf{I}}_{n})}^{-1}\boldsymbol{M}{(\boldsymbol{H}+\lambda {\mathbf{I}}_{n})}^{-1}{\boldsymbol{f}}_{\mathcal{Q}},\hfill \\ \hfill {T}_{22}& =2{\boldsymbol{f}}_{\mathcal{Q}}^{\mathsf{T}}{(\boldsymbol{H}+\lambda {\mathbf{I}}_{n})}^{-1}\boldsymbol{M}{(\boldsymbol{H}+\lambda {\mathbf{I}}_{n})}^{-1}{\boldsymbol{f}}_{{\mathcal{Q}}^{c}},\hfill \\ \hfill {T}_{23}& ={\boldsymbol{f}}_{{\mathcal{Q}}^{c}}^{\mathsf{T}}{(\boldsymbol{H}+\lambda {\mathbf{I}}_{n})}^{-1}\boldsymbol{M}{(\boldsymbol{H}+\lambda {\mathbf{I}}_{n})}^{-1}{\boldsymbol{f}}_{{\mathcal{Q}}^{c}}.\hfill \end{aligned}\end{equation*}$

By lemma 6, we have

$\begin{equation}{\Vert}n{(\boldsymbol{H}+\lambda {\mathbf{I}}_{n})}^{-1}\boldsymbol{M}{(\boldsymbol{H}+\lambda {\mathbf{I}}_{n})}^{-1}-{\boldsymbol{Y}}_{\mathcal{Q}}{\boldsymbol{Y}}_{\mathcal{Q}}^{\mathsf{T}}/n{{\Vert}}_{\mathrm{op}}={o}_{d,\mathbb{P}}(1),\end{equation} \tag{ 55 }$

hence

$\begin{equation*}\begin{aligned}\hfill {T}_{21}& ={\boldsymbol{\lambda }}_{\mathcal{Q}}{\boldsymbol{Y}}_{\mathcal{Q}}^{\mathsf{T}}{(\boldsymbol{H}+\lambda {\mathbf{I}}_{n})}^{-1}\boldsymbol{M}{(\boldsymbol{H}+\lambda {\mathbf{I}}_{n})}^{-1}{\boldsymbol{Y}}_{\mathcal{Q}}{\boldsymbol{\lambda }}_{\mathcal{Q}}\hfill \\ \hfill & ={\boldsymbol{\lambda }}_{\mathcal{Q}}^{\mathsf{T}}{\boldsymbol{Y}}_{\mathcal{Q}}^{\mathsf{T}}{\boldsymbol{Y}}_{\mathcal{Q}}{\boldsymbol{Y}}_{\mathcal{Q}}^{\mathsf{T}}{\boldsymbol{Y}}_{\mathcal{Q}}{\boldsymbol{\lambda }}_{\mathcal{Q}}/{n}^{2}+[{\Vert}{\boldsymbol{Y}}_{\mathcal{Q}}{\boldsymbol{\lambda }}_{\mathcal{Q}}{{\Vert}}_{2}^{2}/n]\cdot {o}_{d,\mathbb{P}}(1).\hfill \end{aligned}\end{equation*}$

By lemma 3, we have (with ${\Vert}\mathbf{\Delta }{{\Vert}}_{2}={o}_{d,\mathbb{P}}(1)$ )

$\begin{equation*}{\boldsymbol{\lambda }}_{\mathcal{Q}}^{\mathsf{T}}{\boldsymbol{Y}}_{\mathcal{Q}}^{\mathsf{T}}{\boldsymbol{Y}}_{\mathcal{Q}}{\boldsymbol{Y}}_{\mathcal{Q}}^{\mathsf{T}}{\boldsymbol{Y}}_{\mathcal{Q}}{\boldsymbol{\lambda }}_{\mathcal{Q}}/{n}^{2}={\boldsymbol{\lambda }}_{\mathcal{Q}}^{\mathsf{T}}{({\mathbf{I}}_{B}+\mathbf{\Delta })}^{2}{\boldsymbol{\lambda }}_{\mathcal{Q}}={\Vert}{\boldsymbol{\lambda }}_{\mathcal{Q}}{{\Vert}}_{2}^{2}(1+{o}_{d,\mathbb{P}}(1)).\end{equation*}$

Moreover, we have

$\begin{equation*}{\Vert}{\boldsymbol{Y}}_{\mathcal{Q}}{\boldsymbol{\lambda }}_{\mathcal{Q}}{{\Vert}}_{2}^{2}/n={\boldsymbol{\lambda }}_{\mathcal{Q}}^{\mathsf{T}}({\mathbf{I}}_{B}+\mathbf{\Delta }){\boldsymbol{\lambda }}_{\mathcal{Q}}={\Vert}{\boldsymbol{\lambda }}_{\mathcal{Q}}{{\Vert}}_{2}^{2}(1+{o}_{d,\mathbb{P}}(1)).\end{equation*}$

As a result, we have

$\begin{equation}{T}_{21}={\Vert}{\boldsymbol{\lambda }}_{\mathcal{Q}}{{\Vert}}_{2}^{2}(1+{o}_{d,\mathbb{P}}(1))={\Vert}{\mathsf{P}}_{\mathcal{Q}}{f}_{d}{{\Vert}}_{{L}^{2}}^{2}(1+{o}_{d,\mathbb{P}}(1)).\end{equation} \tag{ 56 }$

By equation (55) again, we have

$\begin{equation*}\begin{aligned}\hfill {T}_{23}& =\left(\sum\limits _{\boldsymbol{k}\in {\mathcal{Q}}^{c}}{\boldsymbol{\lambda }}_{\boldsymbol{k}}^{\mathsf{T}}{\boldsymbol{Y}}_{\boldsymbol{k}}^{\mathsf{T}}\right){(\boldsymbol{H}+\lambda {\mathbf{I}}_{n})}^{-1}\boldsymbol{M}{(\boldsymbol{H}+\lambda {\mathbf{I}}_{n})}^{-1}\left(\sum\limits _{\boldsymbol{k}\in {\mathcal{Q}}^{c}}{\boldsymbol{Y}}_{\boldsymbol{k}}{\boldsymbol{\lambda }}_{\boldsymbol{k}}\right)\hfill \\ \hfill & =\left(\sum\limits _{\boldsymbol{k}\in {\mathcal{Q}}^{c}}{\boldsymbol{\lambda }}_{\boldsymbol{k}}^{\mathsf{T}}{\boldsymbol{Y}}_{\boldsymbol{k}}^{\mathsf{T}}\right){\boldsymbol{Y}}_{\mathcal{Q}}{\boldsymbol{Y}}_{\mathcal{Q}}^{\mathsf{T}}\left(\sum\limits _{\boldsymbol{k}\in {\mathcal{Q}}^{c}}{\boldsymbol{Y}}_{\boldsymbol{k}}{\boldsymbol{\lambda }}_{\boldsymbol{k}}\right)/{n}^{2}+\left[{{\Vert}\sum\limits _{\boldsymbol{k}\in {\mathcal{Q}}^{c}}{\boldsymbol{Y}}_{\boldsymbol{k}}{\boldsymbol{\lambda }}_{\boldsymbol{k}}{\Vert}}_{2}^{2}/n\right]\cdot {o}_{d,\mathbb{P}}(1).\hfill \end{aligned}\end{equation*}$

By lemma 5, we have

$\begin{align*}\hfill \mathbb{E}\left[\left(\sum\limits _{\boldsymbol{k}\in {\mathcal{Q}}^{c}}{\boldsymbol{\lambda }}_{\boldsymbol{k}}^{\mathsf{T}}{\boldsymbol{Y}}_{\boldsymbol{k}}^{\mathsf{T}}\right){\boldsymbol{Y}}_{\mathcal{Q}}{\boldsymbol{Y}}_{\mathcal{Q}}^{\mathsf{T}}\left(\sum\limits _{\boldsymbol{k}\in {\mathcal{Q}}^{c}}{\boldsymbol{Y}}_{\boldsymbol{k}}{\boldsymbol{\lambda }}_{\boldsymbol{k}}\right)\right]/{n}^{2}& =\sum\limits _{\boldsymbol{u},\boldsymbol{v}\in {\mathcal{Q}}^{c}}{\boldsymbol{\lambda }}_{\boldsymbol{u}}^{\mathsf{T}}\left\{\mathbb{E}[({\boldsymbol{Y}}_{\boldsymbol{u}}^{\mathsf{T}}{\boldsymbol{Y}}_{\mathcal{Q}}{\boldsymbol{Y}}_{\mathcal{Q}}^{\mathsf{T}}{\boldsymbol{Y}}_{\boldsymbol{v}})]/{n}^{2}\right\}{\boldsymbol{\lambda }}_{\boldsymbol{v}}\hfill \\ \hfill & =\frac{B}{n}\sum\limits _{\boldsymbol{k}\in {\mathcal{Q}}^{c}}{\Vert}{\boldsymbol{\lambda }}_{\boldsymbol{k}}{{\Vert}}_{2}^{2}.\hfill \end{align*}$

Moreover

$\begin{equation*}\mathbb{E}\left[{{\Vert}\sum\limits _{\boldsymbol{k}\in {\mathcal{Q}}^{c}}{\boldsymbol{Y}}_{\boldsymbol{k}}{\boldsymbol{\lambda }}_{\boldsymbol{k}}{\Vert}}_{2}^{2}/n\right]=\sum\limits _{\boldsymbol{k}\in {\mathcal{Q}}^{c}}{\Vert}{\boldsymbol{\lambda }}_{\boldsymbol{k}}{{\Vert}}_{2}^{2}={\Vert}{\mathsf{P}}_{{\mathcal{Q}}^{c}}{f}_{d}{{\Vert}}_{{L}^{2}}^{2}.\end{equation*}$

This gives

$\begin{equation}{T}_{23}={o}_{d,\mathbb{P}}(1)\cdot {\Vert}{\mathsf{P}}_{{\mathcal{Q}}^{c}}{f}_{d}{{\Vert}}_{{L}^{2}}^{2}.\end{equation} \tag{ 57 }$

Using Cauchy Schwarz inequality for T₂₂, we get

$\begin{equation}{T}_{22}\leqslant 2{({T}_{21}{T}_{23})}^{1/2}={o}_{d,\mathbb{P}}(1)\cdot {\Vert}{\mathsf{P}}_{\mathcal{Q}}{f}_{d}{{\Vert}}_{{L}^{2}}{\Vert}{\mathsf{P}}_{{\mathcal{Q}}^{c}}{f}_{d}{{\Vert}}_{{L}^{2}}.\end{equation} \tag{ 58 }$

As a result, combining equations (56)–(58), we have

$\begin{equation}{T}_{2}={\Vert}{\mathsf{P}}_{\mathcal{Q}}{f}_{d}{{\Vert}}_{{L}^{2}}^{2}+{o}_{d,\mathbb{P}}(1)\cdot {\Vert}{f}_{d}{{\Vert}}_{{L}^{2}}^{2}.\end{equation} \tag{ 59 }$

Step 4. Term T₁. Note we have

$\begin{equation*}{T}_{1}={T}_{11}+{T}_{12}+{T}_{13},\end{equation*}$

where

$\begin{equation*}\begin{aligned}\hfill {T}_{11}& ={\boldsymbol{f}}_{\mathcal{Q}}^{\mathsf{T}}{(\boldsymbol{H}+\lambda {\mathbf{I}}_{n})}^{-1}{\boldsymbol{E}}_{\mathcal{Q}},\hfill \\ \hfill {T}_{12}& ={\boldsymbol{f}}_{{\mathcal{Q}}^{c}}^{\mathsf{T}}{(\boldsymbol{H}+\lambda {\mathbf{I}}_{n})}^{-1}{\boldsymbol{E}}_{\mathcal{Q}},\hfill \\ \hfill {T}_{13}& ={\boldsymbol{f}}^{\mathsf{T}}{(\boldsymbol{H}+\lambda {\mathbf{I}}_{n})}^{-1}{\boldsymbol{E}}_{{\mathcal{Q}}^{c}}.\hfill \end{aligned}\end{equation*}$

By lemma 7, we have

$\begin{equation*}{\Vert}{\boldsymbol{Y}}_{\mathcal{Q}}^{\mathsf{T}}{(\boldsymbol{H}+\lambda {\mathbf{I}}_{n})}^{-1}{\boldsymbol{Y}}_{\mathcal{Q}}{\boldsymbol{D}}_{\mathcal{Q}}-{\mathbf{I}}_{B}{{\Vert}}_{\mathrm{op}}={o}_{d,\mathbb{P}}(1),\end{equation*}$

so that

$\begin{equation}{T}_{11}={\boldsymbol{\lambda }}_{\mathcal{Q}}^{\mathsf{T}}{\boldsymbol{Y}}_{\mathcal{Q}}^{\mathsf{T}}{(\boldsymbol{H}+\lambda {\mathbf{I}}_{n})}^{-1}{\boldsymbol{Y}}_{\mathcal{Q}}{\boldsymbol{D}}_{\mathcal{Q}}{\boldsymbol{\lambda }}_{\mathcal{Q}}={\Vert}{\boldsymbol{\lambda }}_{\mathcal{Q}}{{\Vert}}_{2}^{2}(1+{o}_{d,\mathbb{P}}(1))={\Vert}{\mathsf{P}}_{\mathcal{Q}}{f}_{d}{{\Vert}}_{2}^{2}(1+{o}_{d,\mathbb{P}}(1)).\end{equation} \tag{ 60 }$

Using Cauchy Schwarz inequality for T₁₂, and by the expression of $\boldsymbol{M}={\boldsymbol{Y}}_{\mathcal{Q}}{\boldsymbol{D}}_{\mathcal{Q}}^{2}{\boldsymbol{Y}}_{\mathcal{Q}}^{\mathsf{T}}+{\kappa }_{u}{\mathbf{\Delta }}_{u}$ with ${\Vert}{\mathbf{\Delta }}_{u}{{\Vert}}_{\mathrm{op}}={O}_{d,\mathbb{P}}(1)$ and κ_u = O_d(d^−m(λ)), we get with high probability

$\begin{align}\hfill \vert {T}_{12}\vert & =\left\vert \sum\limits _{\boldsymbol{k}\in {\mathcal{Q}}^{c}}{\boldsymbol{\lambda }}_{\boldsymbol{k}}^{\mathsf{T}}{\boldsymbol{Y}}_{\boldsymbol{k}}^{\mathsf{T}}{(\boldsymbol{H}+\lambda {\mathbf{I}}_{n})}^{-1}{\boldsymbol{Y}}_{\mathcal{Q}}{\boldsymbol{D}}_{\mathcal{Q}}{\boldsymbol{\lambda }}_{\mathcal{Q}}\right\vert \hfill \\ \hfill & \leqslant {{\Vert}\sum\limits _{\boldsymbol{k}\in {\mathcal{Q}}^{c}}{\boldsymbol{\lambda }}_{\boldsymbol{k}}^{\mathsf{T}}{\boldsymbol{Y}}_{\boldsymbol{k}}^{\mathsf{T}}{(\boldsymbol{H}+\lambda {\mathbf{I}}_{n})}^{-1}{\boldsymbol{Y}}_{\mathcal{Q}}{\boldsymbol{D}}_{\mathcal{Q}}{\Vert}}_{2}{\Vert}{\boldsymbol{\lambda }}_{\mathcal{Q}}{{\Vert}}_{2}\hfill \\ \hfill & ={\left[\left(\sum\limits _{\boldsymbol{k}\in {\mathcal{Q}}^{c}}{\boldsymbol{\lambda }}_{\boldsymbol{k}}^{\mathsf{T}}{\boldsymbol{Y}}_{\boldsymbol{k}}^{\mathsf{T}}\right){(\boldsymbol{H}+\lambda {\mathbf{I}}_{n})}^{-1}{\boldsymbol{Y}}_{\mathcal{Q}}{\boldsymbol{D}}_{\mathcal{Q}}^{2}{\boldsymbol{Y}}_{\mathcal{Q}}^{\mathsf{T}}{(\boldsymbol{H}+\lambda {\mathbf{I}}_{n})}^{-1}\left(\sum\limits _{\boldsymbol{k}\in {\mathcal{Q}}^{c}}{\boldsymbol{\lambda }}_{\boldsymbol{k}}^{\mathsf{T}}{\boldsymbol{Y}}_{\boldsymbol{k}}^{\mathsf{T}}\right)\right]}^{1/2}{\Vert}{\boldsymbol{\lambda }}_{\mathcal{Q}}{{\Vert}}_{2}\hfill \\ \hfill & \leqslant {\left[\left(\sum\limits _{\boldsymbol{k}\in {\mathcal{Q}}^{c}}{\boldsymbol{\lambda }}_{\boldsymbol{k}}^{\mathsf{T}}{\boldsymbol{Y}}_{\boldsymbol{k}}^{\mathsf{T}}\right){(\boldsymbol{H}+\lambda {\mathbf{I}}_{n})}^{-1}\boldsymbol{M}{(\boldsymbol{H}+\lambda {\mathbf{I}}_{n})}^{-1}\left(\sum\limits _{\boldsymbol{k}\in {\mathcal{Q}}^{c}}{\boldsymbol{\lambda }}_{\boldsymbol{k}}^{\mathsf{T}}{\boldsymbol{Y}}_{\boldsymbol{k}}^{\mathsf{T}}\right)\right]}^{1/2}{\Vert}{\boldsymbol{\lambda }}_{\mathcal{Q}}{{\Vert}}_{2}\hfill \\ \hfill & ={T}_{23}^{1/2}{\Vert}{\boldsymbol{\lambda }}_{\mathcal{Q}}{{\Vert}}_{2}={o}_{d,\mathbb{P}}(1)\cdot {\Vert}{\mathsf{P}}_{\mathcal{Q}}{f}_{d}{{\Vert}}_{{L}^{2}}{\Vert}{\mathsf{P}}_{{\mathcal{Q}}^{c}}{f}_{d}{{\Vert}}_{{L}^{2}}.\hfill \end{align} \tag{ 61 }$

For term T₁₃, we have

$\begin{equation*}\vert {T}_{13}\vert =\vert {\boldsymbol{f}}^{\mathsf{T}}{(\boldsymbol{H}+\lambda {\mathbf{I}}_{n})}^{-1}{\boldsymbol{E}}_{{\mathcal{Q}}^{c}}\vert \leqslant {\Vert}\boldsymbol{f}\hspace{2pt}{{\Vert}}_{2}{\Vert}{(\boldsymbol{H}+\lambda {\mathbf{I}}_{n})}^{-1}{{\Vert}}_{\mathrm{op}}{\Vert}{\boldsymbol{E}}_{{\mathcal{Q}}^{c}}{{\Vert}}_{2}.\end{equation*}$

Note we have $\mathbb{E}[{\Vert}\boldsymbol{f}\hspace{2pt}{{\Vert}}_{2}^{2}]=n{\Vert}{f}_{d}{{\Vert}}_{{L}^{2}}^{2}$ , and ${\Vert}{(\boldsymbol{H}+\lambda {\mathbf{I}}_{n})}^{-1}{{\Vert}}_{\mathrm{op}}\leqslant 2/({\kappa }_{h}+\lambda )$ with high probability, and

$\begin{equation*}\mathbb{E}[{\Vert}{\boldsymbol{E}}_{{\mathcal{Q}}^{c}}{{\Vert}}_{2}^{2}]=n\sum\limits _{\boldsymbol{k}\in {\mathcal{Q}}^{c}}{\lambda }_{\boldsymbol{k}}^{\boldsymbol{d}}{({h}_{\boldsymbol{d}})}^{2}{\Vert}{\mathsf{P}}_{\boldsymbol{k}}{f}_{d}{{\Vert}}_{{L}^{2}}^{2}\leqslant n\left[\underset{\boldsymbol{k}\in {\mathcal{Q}}^{c}}{\mathrm{max}}\enspace {\lambda }_{\boldsymbol{k}}^{\boldsymbol{d}}{({h}_{\boldsymbol{d}})}^{2}\right]{\Vert}{\mathsf{P}}_{{\mathcal{Q}}^{c}}{f}_{d}{{\Vert}}_{{L}^{2}}^{2}.\end{equation*}$

As a result, we have

$\begin{align}\hfill \vert {T}_{13}\vert & \leqslant {O}_{d}(1)\cdot {\Vert}{\mathsf{P}}_{{\mathcal{Q}}^{c}}{f}_{d}{{\Vert}}_{{L}^{2}}{\Vert}{f}_{d}{{\Vert}}_{{L}^{2}}{\left[{n}^{2}\enspace \underset{\boldsymbol{k}\in {\mathcal{Q}}^{c}}{\mathrm{max}}\enspace {\lambda }_{\boldsymbol{k}}^{\boldsymbol{d}}{({h}_{\boldsymbol{d}})}^{2}\right]}^{1/2}/({\kappa }_{h}+\lambda )\hfill \\ \hfill & ={o}_{d,\mathbb{P}}(1)\cdot {\Vert}{\mathsf{P}}_{{\mathcal{Q}}^{c}}{f}_{d}{{\Vert}}_{{L}^{2}}{\Vert}{f}_{d}{{\Vert}}_{{L}^{2}},\hfill \end{align} \tag{ 62 }$

where the last equality used the fact that n ⩽ O_d(d^m(γ)−δ) and lemma 2. Combining equations (60)–(62), we get

$\begin{equation}{T}_{1}={\Vert}{\mathsf{P}}_{\mathcal{Q}}{f}_{d}{{\Vert}}_{{L}^{2}}^{2}+{o}_{d,\mathbb{P}}(1)\cdot {\Vert}{f}_{d}{{\Vert}}_{{L}^{2}}^{2}.\end{equation} \tag{ 63 }$

Step 5. Terms T₃, T₄ and T₅. By lemma 6 again, we have

$\begin{equation*}{\mathbb{E}}_{\boldsymbol{\varepsilon }}[{T}_{3}]/{\tau }^{2}=\text{tr}({(\boldsymbol{H}+\lambda {\mathbf{I}}_{n})}^{-1}\boldsymbol{M}{(\boldsymbol{H}+\lambda {\mathbf{I}}_{n})}^{-1})=\text{tr}({\boldsymbol{Y}}_{\mathcal{Q}}{\boldsymbol{Y}}_{\mathcal{Q}}^{\mathsf{T}}/{n}^{2})+{o}_{d,\mathbb{P}}(1).\end{equation*}$

By lemma 3, we have

$\begin{equation*}\text{tr}({\boldsymbol{Y}}_{\mathcal{Q}}{\boldsymbol{Y}}_{\mathcal{Q}}^{\mathsf{T}}/{n}^{2})=\text{tr}({\boldsymbol{Y}}_{\mathcal{Q}}^{\mathsf{T}}{\boldsymbol{Y}}_{\mathcal{Q}})/{n}^{2}=nB/{n}^{2}+{o}_{d,\mathbb{P}}(1)={o}_{d,\mathbb{P}}(1).\end{equation*}$

This gives

$\begin{equation}{T}_{3}={o}_{d,\mathbb{P}}(1)\cdot {\tau }^{2}.\end{equation} \tag{ 64 }$

Let us consider T₄ term:

$\begin{equation*}\begin{aligned}\hfill {\mathbb{E}}_{\boldsymbol{\varepsilon }}[{T}_{4}^{2}]/{\tau }^{2}& ={\mathbb{E}}_{\boldsymbol{\varepsilon }}[{\boldsymbol{\varepsilon }}^{\mathsf{T}}{(\boldsymbol{H}+\lambda {\mathbf{I}}_{n})}^{-1}\boldsymbol{E}{\boldsymbol{E}}^{\mathsf{T}}{(\boldsymbol{H}+\lambda {\mathbf{I}}_{n})}^{-1}\boldsymbol{\varepsilon }]/{\tau }^{2}\hfill \\ \hfill & ={\boldsymbol{E}}^{\mathsf{T}}{(\boldsymbol{H}+\lambda {\mathbf{I}}_{n})}^{-2}\boldsymbol{E}.\hfill \end{aligned}\end{equation*}$

For any integer L, denote $\mathcal{L}\equiv {[0,L]}^{Q}\cap {\mathbb{Z}}_{\geqslant 0}^{Q}$ , and ${\boldsymbol{Y}}_{\mathcal{L}}={({\boldsymbol{Y}}_{\boldsymbol{k}})}_{\boldsymbol{k}\in \mathcal{L}}$ and ${\boldsymbol{D}}_{\mathcal{L}}={({\boldsymbol{D}}_{\boldsymbol{k}})}_{\boldsymbol{k}\in \mathcal{L}}$ . Then notice that by lemmas 3, 6 and the definition of M , we get

$\begin{equation*}\begin{aligned}\hfill {\Vert}{\boldsymbol{D}}_{\mathcal{L}}{\boldsymbol{Y}}_{\mathcal{L}}^{\mathsf{T}}{(\boldsymbol{H}+\lambda {\mathbf{I}}_{n})}^{-2}{\boldsymbol{Y}}_{\mathcal{L}}{\boldsymbol{D}}_{\mathcal{L}}{{\Vert}}_{\mathrm{op}}& ={\Vert}{(\boldsymbol{H}+\lambda {\mathbf{I}}_{n})}^{-1}{\boldsymbol{Y}}_{\mathcal{L}}{\boldsymbol{D}}_{\mathcal{L}}^{2}{\boldsymbol{Y}}_{\mathcal{L}}^{\mathsf{T}}{(\boldsymbol{H}+\lambda {\mathbf{I}}_{n})}^{-1}{{\Vert}}_{\mathrm{op}}\hfill \\ \hfill & \leqslant {\Vert}{(\boldsymbol{H}+\lambda {\mathbf{I}}_{n})}^{-1}\boldsymbol{M}{(\boldsymbol{H}+\lambda {\mathbf{I}}_{n})}^{-1}{{\Vert}}_{\mathrm{op}}.\hfill \\ \hfill & \leqslant {\Vert}{\boldsymbol{Y}}_{\mathcal{Q}}{\boldsymbol{Y}}_{\mathcal{Q}}^{\mathsf{T}}/n{{\Vert}}_{\mathrm{op}}/n+{o}_{\mathbb{P},d}(1)\cdot /n\hfill \\ \hfill & ={o}_{d,\mathbb{P}}(1).\hfill \end{aligned}\end{equation*}$

Therefore,

$\begin{equation*}\begin{aligned}\hfill {\boldsymbol{E}}^{\mathsf{T}}{(\boldsymbol{H}+\lambda {\mathbf{I}}_{n})}^{-2}\boldsymbol{E}& =\underset{L\to \infty }{\mathrm{lim}}\enspace {\boldsymbol{E}}_{\mathcal{L}}^{\mathsf{T}}{(\boldsymbol{H}+\lambda {\mathbf{I}}_{n})}^{-2}{\boldsymbol{E}}_{\mathcal{L}}\hfill \\ \hfill & =\underset{L\to \infty }{\mathrm{lim}}\enspace {\boldsymbol{\lambda }}_{\mathcal{L}}^{\mathsf{T}}[{\boldsymbol{D}}_{\mathcal{L}}{\boldsymbol{Y}}_{\mathcal{L}}^{\mathsf{T}}{(\boldsymbol{H}+\lambda {\mathbf{I}}_{n})}^{-2}{\boldsymbol{Y}}_{\mathcal{L}}{\boldsymbol{D}}_{\mathcal{L}}]{\boldsymbol{\lambda }}_{\mathcal{L}}\hfill \\ \hfill & \leqslant {\Vert}{(\boldsymbol{H}+\lambda {\mathbf{I}}_{n})}^{-1}\boldsymbol{M}{(\boldsymbol{H}+\lambda {\mathbf{I}}_{n})}^{-1}{{\Vert}}_{\mathrm{op}}\cdot \underset{L\to \infty }{\mathrm{lim}}{\Vert}{\boldsymbol{\lambda }}_{\mathcal{L}}{{\Vert}}_{2}^{2}\hfill \\ \hfill & \leqslant {o}_{d,\mathbb{P}}(1)\cdot {\Vert}{f}_{d}{{\Vert}}_{{L}^{2}}^{2},\hfill \end{aligned}\end{equation*}$

which gives

$\begin{equation}{T}_{4}={o}_{d,\mathbb{P}}(1)\cdot \tau {\Vert}{f}_{d}{{\Vert}}_{{L}^{2}}={o}_{d,\mathbb{P}}(1)\cdot ({\tau }^{2}+{\Vert}{f}_{d}{{\Vert}}_{{L}^{2}}^{2}).\end{equation} \tag{ 65 }$

We decompose T₅ using $\boldsymbol{f}={\boldsymbol{f}}_{\mathcal{Q}}+{\boldsymbol{f}}_{{\mathcal{Q}}^{c}}$ ,

$\begin{equation*}{T}_{5}={T}_{51}+{T}_{52},\end{equation*}$

where

$\begin{equation*}\begin{aligned}\hfill {T}_{51}& ={\boldsymbol{\varepsilon }}^{\mathsf{T}}{(\boldsymbol{H}+\lambda {\mathbf{I}}_{n})}^{-1}\boldsymbol{M}{(\boldsymbol{H}+\lambda {\mathbf{I}}_{n})}^{-1}{\boldsymbol{f}}_{\mathcal{Q}},\hfill \\ \hfill {T}_{52}& ={\boldsymbol{\varepsilon }}^{\mathsf{T}}{(\boldsymbol{H}+\lambda {\mathbf{I}}_{n})}^{-1}\boldsymbol{M}{(\boldsymbol{H}+\lambda {\mathbf{I}}_{n})}^{-1}{\boldsymbol{f}}_{{\mathcal{Q}}^{c}}.\hfill \end{aligned}\end{equation*}$

First notice that

$\begin{equation*}{\Vert}{\boldsymbol{M}}^{1/2}{(\boldsymbol{H}+\lambda {\mathbf{I}}_{n})}^{-2}{\boldsymbol{M}}^{1/2}{{\Vert}}_{\mathrm{op}}={\Vert}{(\boldsymbol{H}+\lambda {\mathbf{I}}_{n})}^{-1}\boldsymbol{M}{(\boldsymbol{H}+\lambda {\mathbf{I}}_{n})}^{-1}{{\Vert}}_{\mathrm{op}}={o}_{d,\mathbb{P}}(1).\end{equation*}$

Then by lemma 6, we get

$\begin{align*}\hfill {\mathbb{E}}_{\boldsymbol{\varepsilon }}[{T}_{51}^{2}]/{\tau }^{2}& ={\mathbb{E}}_{\boldsymbol{\varepsilon }}[{\boldsymbol{\varepsilon }}^{\mathsf{T}}{(\boldsymbol{H}+\lambda {\mathbf{I}}_{n})}^{-1}\boldsymbol{M}{(\boldsymbol{H}+\lambda {\mathbf{I}}_{n})}^{-1}{\boldsymbol{f}}_{\mathcal{Q}}{\boldsymbol{f}}_{\mathcal{Q}}^{\mathsf{T}}{(\boldsymbol{H}+\lambda {\mathbf{I}}_{n})}^{-1}\boldsymbol{M}{(\boldsymbol{H}+\lambda {\mathbf{I}}_{n})}^{-1}\boldsymbol{\varepsilon }]/{\tau }^{2}\hfill \\ \hfill & ={\boldsymbol{f}}_{\mathcal{Q}}^{\mathsf{T}}{[{(\boldsymbol{H}+\lambda {\mathbf{I}}_{n})}^{-1}\boldsymbol{M}{(\boldsymbol{H}+\lambda {\mathbf{I}}_{n})}^{-1}]}^{2}{\boldsymbol{f}}_{\mathcal{Q}}\hfill \\ \hfill & \leqslant {\Vert}{\boldsymbol{M}}^{1/2}{(\boldsymbol{H}+\lambda {\mathbf{I}}_{n})}^{-2}{\boldsymbol{M}}^{1/2}{{\Vert}}_{\mathrm{op}}{\Vert}{\boldsymbol{M}}^{1/2}{(\boldsymbol{H}+\lambda {\mathbf{I}}_{n})}^{-1}{\boldsymbol{f}}_{\mathcal{Q}}{{\Vert}}_{2}^{2}\hfill \\ \hfill & ={o}_{d,\mathbb{P}}(1)\cdot {T}_{21}\hfill \\ \hfill & ={o}_{d,\mathbb{P}}(1)\cdot {\Vert}{\mathsf{P}}_{\mathcal{Q}}{f}_{d}{{\Vert}}_{{L}^{2}}^{2}.\hfill \end{align*}$

Similarly, we get

$\begin{equation*}{\mathbb{E}}_{\boldsymbol{\varepsilon }}[{T}_{52}^{2}]/{\tau }^{2}={o}_{d,\mathbb{P}}(1)\cdot {T}_{23}={o}_{d,\mathbb{P}}(1)\cdot {\Vert}{\mathsf{P}}_{{\mathcal{Q}}^{c}}{f}_{d}{{\Vert}}_{{L}^{2}}^{2}.\end{equation*}$

By Markov's inequality, we deduce that

$\begin{equation}{T}_{5}={o}_{d,\mathbb{P}}(1)\cdot \tau ({\Vert}{\mathsf{P}}_{\mathcal{Q}}{f}_{d}{{\Vert}}_{{L}^{2}}+{\Vert}{\mathsf{P}}_{{\mathcal{Q}}^{c}}{f}_{d}{{\Vert}}_{{L}^{2}})={o}_{d,\mathbb{P}}(1)\cdot ({\tau }^{2}+{\Vert}{f}_{d}{{\Vert}}_{{L}^{2}}^{2}).\end{equation} \tag{ 66 }$

Step 6. Finish the proof.

Combining equations (59) and (63)–(66), we have

$\begin{equation*}\begin{aligned}\hfill {R}_{\mathrm{KRR}}({f}_{d},\boldsymbol{X},\lambda )& ={\Vert}{f}_{d}{{\Vert}}_{{L}^{2}}^{2}-2{T}_{1}+{T}_{2}+{T}_{3}-2{T}_{4}+2{T}_{5}\hfill \\ \hfill & ={\Vert}{\mathsf{P}}_{{\mathcal{Q}}^{c}}{f}_{d}{{\Vert}}_{{L}^{2}}^{2}+{o}_{d,\mathbb{P}}(1)\cdot ({\Vert}{f}_{d}{{\Vert}}_{{L}^{2}}^{2}+{\tau }^{2}),\hfill \end{aligned}\end{equation*}$

which concludes the proof.

D.3. Auxiliary results

Lemma 3. Let ${\left\{{Y}_{\boldsymbol{k},\boldsymbol{s}}^{\boldsymbol{d}}\right\}}_{\boldsymbol{k}\in {\mathbb{Z}}_{\geqslant 0}^{Q},\boldsymbol{s}\in [B(\boldsymbol{d},\boldsymbol{k})]}$ be the collection of tensor product of spherical harmonics on PS^d. Let ${({\bar{\boldsymbol{x}}}_{i})}_{i\in [n]}\hspace{2pt}{\sim }_{iid}\hspace{2pt}\mathrm{Unif}({\mathrm{PS}}^{\boldsymbol{d}})$ . Denote

$\begin{equation*}{\boldsymbol{Y}}_{\boldsymbol{k}}={({Y}_{\boldsymbol{k},\boldsymbol{s}}^{\boldsymbol{d}}({\bar{\boldsymbol{x}}}_{i}))}_{i\in [n],\boldsymbol{s}\in [B(\boldsymbol{d},\boldsymbol{k})]}\in {\mathbb{R}}^{n\times B(\boldsymbol{d},\boldsymbol{k})}.\end{equation*}$

Assume that n ⩾ w_d(d^γ log d) and consider

$\begin{equation*}\mathcal{R}=\left\{\boldsymbol{k}\in {\mathbb{Z}}_{\geqslant 0}^{Q}\left\vert \right.\sum\limits _{q\in [Q]}{\eta }_{q}{k}_{q}< m(\gamma )\right\}.\end{equation*}$

Denote $A={\sum }_{\boldsymbol{k}\in \mathcal{R}}B(\boldsymbol{d},\boldsymbol{k})$ and

$\begin{equation*}{\boldsymbol{Y}}_{\mathcal{R}}={({\boldsymbol{Y}}_{\boldsymbol{k}})}_{\boldsymbol{k}\in \mathcal{R}}\in {\mathbb{R}}^{n\times A}.\end{equation*}$

Then we have

$\begin{equation*}{\boldsymbol{Y}}_{\mathcal{R}}^{\mathsf{T}}{\boldsymbol{Y}}_{\mathcal{R}}/n={\mathbf{I}}_{A}+\mathbf{\Delta },\end{equation*}$

with $\mathbf{\Delta }\in {\mathbb{R}}^{A\times A}$ and $\mathbb{E}[{\Vert}\mathbf{\Delta }{{\Vert}}_{\mathrm{op}}]={o}_{d}(1)$ .

Proof of lemma 3. Let $\mathbf{\Psi }={\boldsymbol{Y}}_{\mathcal{R}}^{\mathsf{T}}{\boldsymbol{Y}}_{\mathcal{R}}/n\in {\mathbb{R}}^{A\times A}$ . We can rewrite Ψ as

$\begin{equation*}\mathbf{\Psi }=\frac{1}{n}\sum\limits _{i=1}^{n}{\boldsymbol{h}}_{i}{\boldsymbol{h}}_{i}^{\mathsf{T}},\end{equation*}$

where ${\boldsymbol{h}}_{i}={({Y}_{\boldsymbol{k},\boldsymbol{s}}^{\boldsymbol{d}}({\bar{\boldsymbol{x}}}_{i}))}_{\boldsymbol{k}\in \mathcal{R},\boldsymbol{s}\in [B(\boldsymbol{d},\boldsymbol{k})]}\in {\mathbb{R}}^{A}$ . We use matrix Bernstein inequality. Denote ${\boldsymbol{X}}_{i}={\boldsymbol{h}}_{i}{\boldsymbol{h}}_{i}-{\mathbf{I}}_{A}\in {\mathbb{R}}^{A\times A}$ . Then we have $\mathbb{E}[{\boldsymbol{X}}_{i}]=\mathbf{0}$ , and

$\begin{equation*}\begin{aligned}\hfill {\Vert}{\boldsymbol{X}}_{i}{{\Vert}}_{\mathrm{op}}\leqslant {\Vert}{\boldsymbol{h}}_{i}{{\Vert}}_{2}^{2}+1& =\sum\limits _{\boldsymbol{k}\in \mathcal{R}}\hspace{2pt}\sum\limits _{\boldsymbol{s}\in [B(\boldsymbol{d},\boldsymbol{k})]}{Y}_{\boldsymbol{k},\boldsymbol{s}}^{\boldsymbol{d}}{({\bar{\boldsymbol{x}}}_{i})}^{2}+1\hfill \\ \hfill & =\sum\limits _{\boldsymbol{k}\in \mathcal{R}}B(\boldsymbol{d},\boldsymbol{k}){Q}_{\boldsymbol{k}}^{\boldsymbol{d}}\left({\left\{\langle {\bar{\boldsymbol{x}}}_{i}^{(q)},{\bar{\boldsymbol{x}}}_{i}^{(q)}\rangle \right\}}_{q\in [Q]}\right)+1=A+1,\hfill \end{aligned}\end{equation*}$

where we use formula (22) and the normalization ${Q}_{\boldsymbol{k}}^{\boldsymbol{d}}({d}_{1},\dots ,{d}_{Q})=1$ . Denote $V={\Vert}{\sum }_{i=1}^{n}\mathbb{E}[{\boldsymbol{X}}_{i}^{2}]{{\Vert}}_{\mathrm{op}}$ . Then we have

$\begin{align*}\hfill V& =n{\Vert}\mathbb{E}[{({\boldsymbol{h}}_{i}{\boldsymbol{h}}_{i}^{\mathsf{T}}-{\mathbf{I}}_{A})}^{2}]{{\Vert}}_{\mathrm{op}}=n{\Vert}\mathbb{E}[{\boldsymbol{h}}_{i}{\boldsymbol{h}}_{i}^{\mathsf{T}}{\boldsymbol{h}}_{i}{\boldsymbol{h}}_{i}^{\mathsf{T}}-2{\boldsymbol{h}}_{i}{\boldsymbol{h}}_{i}^{\mathsf{T}}+{\mathbf{I}}_{A}]{{\Vert}}_{\mathrm{op}}\hfill \\ \hfill & =n{\Vert}(A-1){\mathbf{I}}_{A}{{\Vert}}_{\mathrm{op}}=n(A-1),\hfill \end{align*}$

where we used ${\boldsymbol{h}}_{i}^{\mathsf{T}}{\boldsymbol{h}}_{i}={\Vert}{\boldsymbol{h}}_{i}{{\Vert}}_{2}^{2}=A$ and $\mathbb{E}[{\boldsymbol{h}}_{i}({\bar{\boldsymbol{x}}}_{i}){\boldsymbol{h}}_{i}^{\mathsf{T}}({\bar{\boldsymbol{x}}}_{i})]={(\mathbb{E}[{Y}_{\boldsymbol{k},\boldsymbol{s}}^{\boldsymbol{d}}({\bar{\boldsymbol{x}}}_{i}){Y}_{{\boldsymbol{k}}^{\prime },{\boldsymbol{s}}^{\prime }}^{\boldsymbol{d}}({\bar{\boldsymbol{x}}}_{i})])}_{\boldsymbol{k}\boldsymbol{s},{\boldsymbol{k}}^{\prime }{\boldsymbol{s}}^{\prime }}={\mathbf{I}}_{A}$ . As a result, we have for any t > 0,

$\begin{align}\hfill \mathbb{P}({\Vert}\mathbf{\Psi }-{\mathbf{I}}_{A}{{\Vert}}_{\mathrm{op}}\geqslant t)& \leqslant A\enspace \mathrm{exp}\left\{-{n}^{2}{t}^{2}/[2n(A-1)+2(A+1)nt/3]\right\}\hfill \\ \hfill & \leqslant \mathrm{exp}\left\{-(n/A){t}^{2}/[10(1+t)]+\mathrm{log}\enspace A\right\}.\hfill \end{align} \tag{ 67 }$

Notice that there exists C > 0 such that $A\leqslant C{\mathrm{max}}_{\boldsymbol{k}\in \mathcal{R}}{\prod }_{q\in [Q]}{d}^{{\eta }_{q}{k}_{q}}\leqslant C{d}^{\gamma }$ (by definition of m(γ) and $\mathcal{R}$ ) and therefore n ⩾ w_d(A log A). Integrating the tail bound (67) proves the lemma. □

Lemma 4. Let σ be an activation function satisfying assumption 1. Let w_d(d^γ log d) ⩽ n ⩽ O_d(d^m(γ)−δ) for some γ > 0 and δ > 0. Then there exists sequences κ_h and κ_u such that

$\begin{equation}\boldsymbol{H}=\sum\limits _{\boldsymbol{k}\in {\mathbb{Z}}_{\geqslant 0}^{Q}}{\boldsymbol{Y}}_{\boldsymbol{k}}{\boldsymbol{D}}_{\boldsymbol{k}}{\boldsymbol{Y}}_{\boldsymbol{k}}^{\mathsf{T}}={\boldsymbol{Y}}_{\mathcal{Q}}{\boldsymbol{D}}_{\mathcal{Q}}{\boldsymbol{Y}}_{\mathcal{Q}}^{\mathsf{T}}+{\kappa }_{h}({\mathbf{I}}_{n}+{\mathbf{\Delta }}_{h}),\end{equation} \tag{ 68 }$

$\begin{equation}\boldsymbol{M}=\sum\limits _{\boldsymbol{k}\in {\mathbb{Z}}_{\geqslant 0}^{Q}}{\boldsymbol{Y}}_{\boldsymbol{k}}{\boldsymbol{D}}_{\boldsymbol{k}}^{2}{\boldsymbol{Y}}_{\boldsymbol{k}}^{\mathsf{T}}={\boldsymbol{Y}}_{\mathcal{Q}}{\boldsymbol{D}}_{\mathcal{Q}}^{2}{\boldsymbol{Y}}_{\mathcal{Q}}^{\mathsf{T}}+{\kappa }_{m}{\mathbf{\Delta }}_{m},\end{equation} \tag{ 69 }$

where κ_h = Θ_d(1), κ_m = O_d(d^−m(γ)), ${\Vert}{\mathbf{\Delta }}_{h}{{\Vert}}_{\mathrm{op}}={o}_{d,\mathbb{P}}(1)$ and ${\Vert}{\mathbf{\Delta }}_{m}{{\Vert}}_{\mathrm{op}}={O}_{d,\mathbb{P}}(1)$ .

Proof of lemma 4. Define

$\begin{equation*}\begin{aligned}\hfill \mathcal{R}=& \left\{\boldsymbol{k}\in {\mathbb{Z}}_{\geqslant 0}^{Q}\left\vert \right.\sum\limits _{q\in [Q]}{\eta }_{q}{k}_{q}< m(\gamma )\right\},\hfill \\ \hfill \mathcal{S}=& \left\{\boldsymbol{k}\in {\mathbb{Z}}_{\geqslant 0}^{Q}\left\vert \right.\sum\limits _{q\in [Q]}{\eta }_{q}{k}_{q}\geqslant m(\gamma )\right\},\hfill \end{aligned}\end{equation*}$

such that $\mathcal{R}\cup \mathcal{S}={\mathbb{Z}}_{\geqslant 0}^{Q}$ . The proof comes from bounding the eigenvalues of the matrix ${\boldsymbol{Y}}_{\boldsymbol{k}}{\boldsymbol{Y}}_{\boldsymbol{k}}^{\mathsf{T}}$ for $\boldsymbol{k}\in \mathcal{R}$ and $\boldsymbol{k}\in \mathcal{S}$ separately. From corollary 1, we have

$\begin{equation*}\underset{\boldsymbol{k}\in \mathcal{S}}{\mathrm{sup}}{\Vert}{\boldsymbol{Y}}_{\boldsymbol{k}}{\boldsymbol{Y}}_{\boldsymbol{k}}^{\mathsf{T}}/B(\boldsymbol{d},\boldsymbol{k})-{\mathbf{I}}_{n}{{\Vert}}_{\mathrm{op}}={o}_{d,\mathbb{P}}(1).\end{equation*}$

Hence, we can write

$\begin{equation}\sum\limits _{\boldsymbol{k}\in \mathcal{S}}{\boldsymbol{Y}}_{\boldsymbol{k}}{\boldsymbol{D}}_{\boldsymbol{k}}{\boldsymbol{Y}}_{\boldsymbol{k}}^{\mathsf{T}}={\kappa }_{h}({\mathbf{I}}_{n}+{\mathbf{\Delta }}_{h,1}),\end{equation} \tag{ 70 }$

with ${\kappa }_{h}={\sum }_{\boldsymbol{k}\in \mathcal{S}}{\lambda }_{\boldsymbol{k}}^{\boldsymbol{d}}({h}_{\boldsymbol{d}})B(\boldsymbol{d},\boldsymbol{k})={O}_{d}(1)$ . From assumption 1(b) and a proof similar to lemma 20, there exists k = (0, ..., k, ..., 0) (for k > L at position q_ξ) such that $\mathrm{lim}{\mathrm{inf}}_{d\to \infty }\enspace {\lambda }_{\boldsymbol{k}}^{\boldsymbol{d}}({h}_{\boldsymbol{d}})B(\boldsymbol{d},\boldsymbol{k}) > 0$ . Hence, κ_h = Θ_d(1).

From lemma 3 we have for $\boldsymbol{k}\in \mathcal{R}\cap {\mathcal{Q}}^{c}$ ,

$\begin{equation*}{\boldsymbol{Y}}_{\boldsymbol{k}}^{\mathsf{T}}{\boldsymbol{Y}}_{\boldsymbol{k}}/n={\mathbf{I}}_{B(\boldsymbol{d},\boldsymbol{k})}+\mathbf{\Delta },\end{equation*}$

with ${\Vert}\mathbf{\Delta }{{\Vert}}_{\mathrm{op}}={o}_{d,\mathbb{P}}(1)$ . We deduce that ${\Vert}{\boldsymbol{Y}}_{\boldsymbol{k}}{\boldsymbol{Y}}_{\boldsymbol{k}}^{\mathsf{T}}{{\Vert}}_{\mathrm{op}}={O}_{d,\mathbb{P}}(n)$ . Hence,

$\begin{equation*}{\Vert}{\boldsymbol{Y}}_{\boldsymbol{k}}{\boldsymbol{D}}_{\boldsymbol{k}}{\boldsymbol{Y}}_{\boldsymbol{k}}^{\mathsf{T}}{{\Vert}}_{\mathrm{op}}={O}_{d,\mathbb{P}}(n{\lambda }_{\boldsymbol{k}}^{\boldsymbol{d}}({h}_{\boldsymbol{d}}))={o}_{d,\mathbb{P}}(1),\end{equation*}$

where we used lemma 2. We deduce that

$\begin{equation}\sum\limits _{\boldsymbol{k}\in \mathcal{R}\cap {\mathcal{Q}}^{c}}{\boldsymbol{Y}}_{\boldsymbol{k}}{\boldsymbol{D}}_{\boldsymbol{k}}{\boldsymbol{Y}}_{\boldsymbol{k}}^{\mathsf{T}}={\kappa }_{h}{\mathbf{\Delta }}_{h,2},\end{equation} \tag{ 71 }$

with ${\Vert}{\mathbf{\Delta }}_{h,2}{{\Vert}}_{\mathrm{op}}={o}_{d,\mathbb{P}}(1)$ where we used ${\kappa }_{h}^{-1}={O}_{d}(1)$ . Combining equations (70) and (71) yields equation (68).

Similarly, we get

$\begin{equation*}\sum\limits _{\boldsymbol{k}\in {\mathcal{Q}}^{c}}{\boldsymbol{Y}}_{\boldsymbol{k}}{\boldsymbol{D}}_{\boldsymbol{k}}^{2}{\boldsymbol{Y}}_{\boldsymbol{k}}^{\mathsf{T}}=\sum\limits _{\boldsymbol{k}\in \mathcal{R}\cap {\mathcal{Q}}^{c}}[{\lambda }_{\boldsymbol{k}}^{\boldsymbol{d}}{({h}_{\boldsymbol{d}})}^{2}n]{\boldsymbol{Y}}_{\boldsymbol{k}}{\boldsymbol{Y}}_{\boldsymbol{k}}^{\mathsf{T}}/n+\sum\limits _{\boldsymbol{k}\in \mathcal{S}}[{\lambda }_{\boldsymbol{k}}^{\boldsymbol{d}}{({h}_{\boldsymbol{d}})}^{2}B(\boldsymbol{d},\boldsymbol{k})]{\boldsymbol{Y}}_{\boldsymbol{k}}{\boldsymbol{Y}}_{\boldsymbol{k}}^{\mathsf{T}}/B(\boldsymbol{d},\boldsymbol{k}).\end{equation*}$

Using lemma 2, we have ${\lambda }_{\boldsymbol{k}}^{\boldsymbol{d}}{({h}_{\boldsymbol{d}})}^{2}n\leqslant C{d}^{-2m(\gamma )}n={O}_{d,\mathbb{P}}({d}^{-m(\gamma )})$ and ${\lambda }_{\boldsymbol{k}}^{\boldsymbol{d}}{({h}_{\boldsymbol{d}})}^{2}B(\boldsymbol{d},\boldsymbol{k})\leqslant C{\lambda }_{\boldsymbol{k}}^{\boldsymbol{d}}({h}_{\boldsymbol{d}})\leqslant {C}^{\prime }{d}^{-m(\gamma )}$ . Hence equation (69) is verified with

$\begin{equation*}{\kappa }_{m}=\sum\limits _{\boldsymbol{k}\in \mathcal{R}\cap {\mathcal{Q}}^{c}}{\lambda }_{\boldsymbol{k}}^{\boldsymbol{d}}{({h}_{\boldsymbol{d}})}^{2}n+\sum\limits _{\boldsymbol{k}\in \mathcal{S}}{\lambda }_{\boldsymbol{k}}^{\boldsymbol{d}}{({h}_{\boldsymbol{d}})}^{2}B(\boldsymbol{d},\boldsymbol{k}).\end{equation*}$

□

Lemma 5. Let ${\left\{{Y}_{\boldsymbol{k},\boldsymbol{s}}^{\boldsymbol{d}}\right\}}_{\boldsymbol{k}\in {\mathbb{Z}}_{\geqslant 0}^{Q},\boldsymbol{s}\in [B(\boldsymbol{d},\boldsymbol{k})]}$ be the collection of product of spherical harmonics on L²(PS^d, μ_d). Let ${({\bar{\boldsymbol{x}}}_{i})}_{i\in [n]}\hspace{2pt}{\sim }_{iid}\hspace{2pt}\mathrm{Unif}({\mathrm{PS}}^{\boldsymbol{d}})$ . Denote

$\begin{equation*}{\boldsymbol{Y}}_{\boldsymbol{k}}={({Y}_{\boldsymbol{k},\boldsymbol{s}}^{\boldsymbol{d}}({\bar{\boldsymbol{x}}}_{i}))}_{i\in [n],\boldsymbol{s}\in [B(\boldsymbol{d},\boldsymbol{k})]}\in {\mathbb{R}}^{n\times B(\boldsymbol{d},\boldsymbol{k})}.\end{equation*}$

Then for $\boldsymbol{u},\boldsymbol{v},\boldsymbol{t}\in {\mathbb{Z}}_{\geqslant 0}^{Q}$ and u ≠ v , we have

$\begin{equation*}\mathbb{E}[{\boldsymbol{Y}}_{\boldsymbol{u}}^{\mathsf{T}}{\boldsymbol{Y}}_{\boldsymbol{t}}{\boldsymbol{Y}}_{\boldsymbol{t}}^{\mathsf{T}}{\boldsymbol{Y}}_{\boldsymbol{v}}]=\mathbf{0}.\end{equation*}$

For $\boldsymbol{u},\boldsymbol{t}\in {\mathbb{Z}}_{\geqslant 0}^{Q}$ , we have

$\begin{equation*}\mathbb{E}[{\boldsymbol{Y}}_{\boldsymbol{u}}^{\mathsf{T}}{\boldsymbol{Y}}_{\boldsymbol{t}}{\boldsymbol{Y}}_{\boldsymbol{t}}^{\mathsf{T}}{\boldsymbol{Y}}_{\boldsymbol{u}}]=[B(\boldsymbol{d},\boldsymbol{t})n+n(n-1){\delta }_{\boldsymbol{u},\boldsymbol{t}}]{\mathbf{I}}_{B(\boldsymbol{d},\boldsymbol{u})}.\end{equation*}$

Proof. We have

$\begin{align}\hfill & \mathbb{E}[{\boldsymbol{Y}}_{\boldsymbol{u}}^{\mathsf{T}}{\boldsymbol{Y}}_{\boldsymbol{t}}{\boldsymbol{Y}}_{\boldsymbol{t}}^{\mathsf{T}}{\boldsymbol{Y}}_{\boldsymbol{v}}]=\sum\limits _{i,j\in [n]}\sum\limits _{\boldsymbol{m}\in [B(\boldsymbol{d},\boldsymbol{t})]}{(\mathbb{E}[{Y}_{\boldsymbol{u},\boldsymbol{p}}^{\boldsymbol{d}}({\bar{\boldsymbol{x}}}_{i})\left({Y}_{\boldsymbol{t},\boldsymbol{m}}^{\boldsymbol{d}}({\bar{\boldsymbol{x}}}_{i}){Y}_{\boldsymbol{t},\boldsymbol{m}}^{\boldsymbol{d}}({\bar{\boldsymbol{x}}}_{j})\right){Y}_{\boldsymbol{v},\boldsymbol{q}}^{\boldsymbol{d}}({\bar{\boldsymbol{x}}}_{j})])}_{\boldsymbol{p}\in [B(\boldsymbol{d},\boldsymbol{u})],\boldsymbol{q}\in [B(\boldsymbol{d},\boldsymbol{v})]}\hfill \\ \hfill & =\sum\limits _{i\in [n]}{\left(\mathbb{E}\left[{Y}_{\boldsymbol{u},\boldsymbol{p}}^{\boldsymbol{d}}({\bar{\boldsymbol{x}}}_{i})\left(\sum\limits _{\boldsymbol{m}\in [B(\boldsymbol{d},\boldsymbol{t})]}{Y}_{\boldsymbol{t},\boldsymbol{m}}^{\boldsymbol{d}}({\bar{\boldsymbol{x}}}_{i}){Y}_{\boldsymbol{t},\boldsymbol{m}}^{\boldsymbol{d}}({\bar{\boldsymbol{x}}}_{i})\right){Y}_{\boldsymbol{v},\boldsymbol{q}}^{\boldsymbol{d}}({\bar{\boldsymbol{x}}}_{i})\right]\right)}_{\boldsymbol{p}\in [B(\boldsymbol{d},\boldsymbol{u})],\boldsymbol{q}\in [B(\boldsymbol{d},\boldsymbol{v})]}\hfill \\ \hfill & \qquad \qquad +\sum\limits _{i\ne j\in [n]}\hspace{2pt}\sum\limits _{\boldsymbol{m}\in [B(\boldsymbol{d},\boldsymbol{t})]}{(\mathbb{E}[{Y}_{\boldsymbol{u},\boldsymbol{p}}^{\boldsymbol{d}}({\bar{\boldsymbol{x}}}_{i}){Y}_{\boldsymbol{t},\boldsymbol{m}}^{\boldsymbol{d}}({\bar{\boldsymbol{x}}}_{i}){Y}_{\boldsymbol{t},\boldsymbol{m}}^{\boldsymbol{d}}({\bar{\boldsymbol{x}}}_{j}){Y}_{\boldsymbol{v},\boldsymbol{q}}^{\boldsymbol{d}}({\bar{\boldsymbol{x}}}_{j})])}_{\boldsymbol{p}\in [B(\boldsymbol{d},\boldsymbol{u})],\boldsymbol{q}\in [B(\boldsymbol{d},\boldsymbol{v})]}\hfill \\ \hfill & =B(\boldsymbol{d},\boldsymbol{t})\sum\limits _{i\in [n]}{(\mathbb{E}[{Y}_{\boldsymbol{u},\boldsymbol{p}}^{\boldsymbol{d}}({\bar{\boldsymbol{x}}}_{i}){Y}_{\boldsymbol{v},\boldsymbol{q}}^{\boldsymbol{d}}({\bar{\boldsymbol{x}}}_{i})])}_{\boldsymbol{p}\in [B(\boldsymbol{d},\boldsymbol{u})],\boldsymbol{q}\in [B(\boldsymbol{d},\boldsymbol{v})]}\hfill \\ \hfill & \qquad \qquad +\sum\limits _{i\ne j\in [n]}\hspace{2pt}\sum\limits _{\boldsymbol{m}\in [B(\boldsymbol{d},\boldsymbol{t})]}{({\delta }_{\boldsymbol{u},\boldsymbol{t}}{\delta }_{\boldsymbol{p},\boldsymbol{m}}{\delta }_{\boldsymbol{t},\boldsymbol{v}}{\delta }_{\boldsymbol{q},\boldsymbol{m}})}_{\boldsymbol{p}\in [B(\boldsymbol{d},\boldsymbol{u})],\boldsymbol{q}\in [B(\boldsymbol{d},\boldsymbol{v})]}\hfill \\ \hfill & ={(B(\boldsymbol{d},\boldsymbol{t})n{\delta }_{\boldsymbol{u},\boldsymbol{v}}{\delta }_{\boldsymbol{p},\boldsymbol{q}}+n(n-1){\delta }_{\boldsymbol{u},\boldsymbol{t}}{\delta }_{\boldsymbol{t},\boldsymbol{v}}{\delta }_{\boldsymbol{p},\boldsymbol{q}})}_{\boldsymbol{p}\in [B(\boldsymbol{d},\boldsymbol{u})],\boldsymbol{q}\in [B(\boldsymbol{d},\boldsymbol{v})]}.\hfill \end{align} \tag{ 72 }$

This proves the lemma. □

Lemma 6. Let σ be an activation function satisfying assumption 1. Assume ω_d(d^γ log d) ⩽ n ⩽ O_d(d^m(γ)−δ) for some γ > 0 and δ > 0. We have

$\begin{equation*}{\Vert}n{(\boldsymbol{H}+\lambda {\mathbf{I}}_{n})}^{-1}\boldsymbol{M}{(\boldsymbol{H}+\lambda {\mathbf{I}}_{n})}^{-1}-{\boldsymbol{Y}}_{\mathcal{Q}}{\boldsymbol{Y}}_{\mathcal{Q}}^{\mathsf{T}}/n{{\Vert}}_{\mathrm{op}}={o}_{d,\mathbb{P}}(1).\end{equation*}$

Proof of lemma 6. Denote

$\begin{equation}{\boldsymbol{Y}}_{\boldsymbol{k}}={({Y}_{\boldsymbol{k},\boldsymbol{s}}^{\boldsymbol{d}}({\bar{\boldsymbol{x}}}_{i}))}_{i\in [n],\boldsymbol{s}\in [B(\boldsymbol{d},\boldsymbol{k})]}\in {\mathbb{R}}^{n\times B(\boldsymbol{d},\boldsymbol{k})}.\end{equation} \tag{ 73 }$

Denote $B={\sum }_{\boldsymbol{k}\in \mathcal{Q}}B(\boldsymbol{d},\boldsymbol{k})$ , and

$\begin{equation*}{\boldsymbol{Y}}_{\mathcal{Q}}={({\boldsymbol{Y}}_{\boldsymbol{k}})}_{\boldsymbol{k}\in \mathcal{Q}}\in {\mathbb{R}}^{n\times B},\end{equation*}$

and

$\begin{equation*}{\boldsymbol{D}}_{\mathcal{Q}}=\mathrm{diag}({({\lambda }_{\boldsymbol{k}}^{\boldsymbol{d}}({h}_{\boldsymbol{d}}){\mathbf{I}}_{B(\boldsymbol{d},\boldsymbol{k})})}_{\boldsymbol{k}\in \mathcal{Q}})\in {\mathbb{R}}^{B\times B}.\end{equation*}$

From lemma 4, we have

$\begin{align*}\hfill & n{(\boldsymbol{H}+\lambda {\mathbf{I}}_{n})}^{-1}\boldsymbol{M}{(\boldsymbol{H}+\lambda {\mathbf{I}}_{n})}^{-1}\hfill \\ \hfill & \qquad =n{({\boldsymbol{Y}}_{\mathcal{Q}}{\boldsymbol{D}}_{\mathcal{Q}}{\boldsymbol{Y}}_{\mathcal{Q}}^{\mathsf{T}}+({\kappa }_{h}+\lambda ){\mathbf{I}}_{n}+{\kappa }_{h}{\mathbf{\Delta }}_{h})}^{-1}({\boldsymbol{Y}}_{\mathcal{Q}}{\boldsymbol{D}}_{\mathcal{Q}}^{2}{\boldsymbol{Y}}_{\mathcal{Q}}^{\mathsf{T}}+{\kappa }_{m}{\mathbf{\Delta }}_{m})\hfill \\ \hfill & \quad \qquad \times {({\boldsymbol{Y}}_{\mathcal{Q}}{\boldsymbol{D}}_{\mathcal{Q}}{\boldsymbol{Y}}_{\mathcal{Q}}^{\mathsf{T}}+({\kappa }_{h}+\lambda ){\mathbf{I}}_{n}+{\kappa }_{h}{\mathbf{\Delta }}_{h})}^{-1}\hfill \\ \hfill & \qquad ={T}_{1}+{T}_{2},\hfill \end{align*}$

where ${\Vert}{\mathbf{\Delta }}_{h}{{\Vert}}_{\mathrm{op}}={o}_{d,\mathbb{P}}(1)$ , ${\Vert}{\mathbf{\Delta }}_{u}{{\Vert}}_{\mathrm{op}}={O}_{d,\mathbb{P}}(1)$ and κ_m = O_d(d^−m(γ)), and

$\begin{align*}\hfill {T}_{1}& =n{\kappa }_{m}{({\boldsymbol{Y}}_{\mathcal{Q}}{\boldsymbol{D}}_{\mathcal{Q}}{\boldsymbol{Y}}_{\mathcal{Q}}^{\mathsf{T}}+({\kappa }_{h}+\lambda ){\mathbf{I}}_{n}+{\kappa }_{h}{\mathbf{\Delta }}_{h})}^{-1}{\mathbf{\Delta }}_{m}{({\boldsymbol{Y}}_{\mathcal{Q}}{\boldsymbol{D}}_{\mathcal{Q}}{\boldsymbol{Y}}_{\mathcal{Q}}^{\mathsf{T}}+({\kappa }_{h}+\lambda ){\mathbf{I}}_{n}+{\kappa }_{h}{\mathbf{\Delta }}_{h})}^{-1},\hfill \\ \hfill {T}_{2}& =n{({\boldsymbol{Y}}_{\mathcal{Q}}{\boldsymbol{D}}_{\mathcal{Q}}{\boldsymbol{Y}}_{\mathcal{Q}}^{\mathsf{T}}+({\kappa }_{h}+\lambda ){\mathbf{I}}_{n}+{\kappa }_{h}{\mathbf{\Delta }}_{h})}^{-1}{\boldsymbol{Y}}_{\mathcal{Q}}{\boldsymbol{D}}_{\mathcal{Q}}^{2}{\boldsymbol{Y}}_{\mathcal{Q}}^{\mathsf{T}}{({\boldsymbol{Y}}_{\mathcal{Q}}{\boldsymbol{D}}_{\mathcal{Q}}{\boldsymbol{Y}}_{\mathcal{Q}}^{\mathsf{T}}+({\kappa }_{h}+\lambda ){\mathbf{I}}_{n}+{\kappa }_{h}{\mathbf{\Delta }}_{h})}^{-1}.\hfill \end{align*}$

Then, we can use the same proof as in lemma 13 in [21] to bound ||T₁||_op (recall n = O_d(d^m(γ)−δ))

$\begin{equation*}{\Vert}{T}_{1}{{\Vert}}_{\mathrm{op}}\leqslant 2n{\kappa }_{m}/{({\kappa }_{h}+\lambda )}^{2}{\Vert}{\mathbf{\Delta }}_{m}{{\Vert}}_{\mathrm{op}}={o}_{d,\mathbb{P}}(1),\end{equation*}$

and ${\Vert}{T}_{2}-{\boldsymbol{Y}}_{\mathcal{Q}}{\boldsymbol{Y}}_{\mathcal{Q}}^{\mathsf{T}}/n{{\Vert}}_{\mathrm{op}}={o}_{d,\mathbb{P}}(1)$ , where we only need to check that

$\begin{equation*}{\lambda }_{\mathrm{min}}({\boldsymbol{D}}_{\mathcal{Q}}/[({\kappa }_{h}+\lambda )/n])=\underset{\boldsymbol{k}\in \mathcal{Q}}{\mathrm{min}}[n{\lambda }_{\boldsymbol{k}}^{\boldsymbol{d}}({h}_{\boldsymbol{d}})]/({\kappa }_{h}+\lambda )={w}_{d}(1),\end{equation*}$

which directly follows from lemma 2. □

Lemma 7. Let σ be an activation function satisfying assumption 1. Assume ω_d(d^γ log d) ⩽ n ⩽ O_d(d^m(γ)−δ) for some γ > 0 and δ > 0. We have

$\begin{equation*}{\Vert}{\boldsymbol{Y}}_{\mathcal{Q}}^{\mathsf{T}}{(\boldsymbol{H}+\lambda {\mathbf{I}}_{n})}^{-1}{\boldsymbol{Y}}_{\mathcal{Q}}{\boldsymbol{D}}_{\mathcal{Q}}-{\mathbf{I}}_{B}{{\Vert}}_{\mathrm{op}}={o}_{d,\mathbb{P}}(1).\end{equation*}$

Proof of lemma 7. This lemma can be deduced directly from lemma 14 in [21], by noticing that

$\begin{equation*}{\lambda }_{\mathrm{min}}\left(\right. {\boldsymbol{D}}_{\mathcal{Q}}/[({\kappa }_{h}+\lambda )/n]=\underset{\boldsymbol{k}\in \mathcal{Q}}{\mathrm{min}}[n{\lambda }_{\boldsymbol{k}}^{\boldsymbol{d}}({h}_{\boldsymbol{d}})]/({\kappa }_{h}+\lambda )={\omega }_{d}(1),\end{equation*}$

from lemma 2. □

Appendix E.: Proof of theorem 6(a): lower bound for the RF model

E.1. Preliminaries

In the theorems, we show our results in high probability with respect to Θ. Hence, in the proof we will restrict the sample space to the high probability event ${\mathcal{P}}_{\varepsilon }\equiv {\mathcal{P}}_{d,N,\varepsilon }$ for ɛ > 0 small enough, where

$\begin{equation}{\mathcal{P}}_{d,N,\varepsilon }\equiv \left\{\mathbf{\Theta }\left\vert {\tau }_{i}^{(q)}\in [1-\varepsilon ,1+\varepsilon ],\quad \forall \enspace i\in [N],\enspace \forall \enspace q\in [Q]\right.\right\}\subset {\left({\mathbb{S}}^{D-1}(\sqrt{D})\right)}^{\otimes N}.\end{equation} \tag{ 74 }$

We will denote ${\mathbb{E}}_{{\boldsymbol{\tau }}_{\varepsilon }}$ the expectation over τ restricted to τ^(q) ∈ [1 − ɛ, 1 + ɛ] for all q ∈ [Q], and ${\mathbb{E}}_{{\mathbf{\Theta }}_{\varepsilon }}$ the expectation over Θ restricted to the event ${\mathcal{P}}_{\varepsilon }$ .

Lemma 8. Assume N = o(d^γ) for some γ > 0. We have for any fixed ɛ > 0,

$\begin{equation*}\mathbb{P}({\mathcal{P}}_{\varepsilon }^{c})={o}_{d}(1).\end{equation*}$

Proof of lemma 8. The tail inequality in lemma 16 and the assumption N = o(d^γ) imply that there exists some constants C, c > 0 such that

$\begin{equation*}\mathbb{P}({\mathcal{P}}_{d,N,\varepsilon }^{c})\leqslant \sum\limits _{q\in [Q]}N\mathbb{P}(\vert {\tau }^{(q)}-1\vert > \varepsilon )\leqslant \sum\limits _{q\in [Q]}C\enspace \mathrm{exp}(\gamma \enspace \mathrm{log}(d)-c{d}^{{\eta }_{q}}\varepsilon )={o}_{d}(1).\end{equation*}$

□

We consider the activation function $\sigma :\mathbb{R}\to \mathbb{R}$ . Let $\boldsymbol{\theta }\sim {\mathbb{S}}^{D-1}(\sqrt{D})$ and $\boldsymbol{x}={\left\{{\boldsymbol{x}}^{(q)}\right\}}_{q\in [Q]}\in {\mathrm{PS}}_{\boldsymbol{\kappa }}^{\boldsymbol{d}}$ . We introduce the function ${\sigma }_{\boldsymbol{d},\boldsymbol{\tau }}:{\mathrm{ps}}^{\boldsymbol{d}}\to \mathbb{R}$ such that

$\begin{align}\hfill \sigma (\langle \boldsymbol{\theta },\boldsymbol{x}\rangle /R)& =\sigma \left(\sum\limits _{q\in [Q]}{\tau }^{(q)}\cdot ({r}_{q}/R)\cdot \langle {\bar{\boldsymbol{\theta }}}^{(q)},{\bar{\boldsymbol{x}}}^{(q)}\rangle /\sqrt{{d}_{q}}\right)\hfill \\ \hfill & \equiv {\sigma }_{\boldsymbol{d},\boldsymbol{\tau }}\left({\left\{\langle {\bar{\boldsymbol{\theta }}}^{(q)},{\bar{\boldsymbol{x}}}^{(q)}\rangle /\sqrt{{d}_{q}}\right\}}_{q\in [Q]}\right).\hfill \end{align} \tag{ 75 }$

Consider the expansion of σ_{d
,
τ} in terms of tensor product of Gegenbauer polynomials. We have

$\begin{equation}\sigma (\langle \boldsymbol{\theta },\boldsymbol{x}\rangle /R)=\sum\limits _{\boldsymbol{k}\in {\mathbb{Z}}_{\geqslant 0}^{Q}}{\lambda }_{\boldsymbol{k}}^{\boldsymbol{d}}({\sigma }_{\boldsymbol{d},\boldsymbol{\tau }})B(\boldsymbol{d},\boldsymbol{k}){Q}_{\boldsymbol{k}}^{\boldsymbol{d}}\left({\left\{\langle {\bar{\boldsymbol{\theta }}}^{(q)},{\bar{\boldsymbol{x}}}^{(q)}\rangle \right\}}_{q\in [Q]}\right),\end{equation} \tag{ 76 }$

where

$\begin{equation*}{\lambda }_{\boldsymbol{k}}^{\boldsymbol{d}}({\sigma }_{\boldsymbol{d},\boldsymbol{\tau }})={\mathbb{E}}_{\bar{\boldsymbol{x}}}\left[{\sigma }_{\boldsymbol{d},\boldsymbol{\tau }}\left({\bar{x}}_{1}^{(1)},\dots ,{\bar{x}}_{1}^{(Q)}\right){Q}_{\boldsymbol{k}}^{\boldsymbol{d}}\left(\sqrt{{d}_{1}}{\bar{x}}_{1}^{(1)},\dots ,\sqrt{{d}_{Q}}{\bar{x}}_{1}^{(Q)}\right)\right],\end{equation*}$

where the expectation is taken over $\bar{\boldsymbol{x}}=({\bar{\boldsymbol{x}}}^{(1)},\dots ,{\bar{\boldsymbol{x}}}^{(Q)})\sim {\mu }_{\boldsymbol{d}}$ .

Lemma 9. Let σ be an activation function that satisfies assumptions 2(a) and (b). Consider N ⩽ o_d(d^γ) and $\mathcal{Q}={\mathcal{Q}}_{\mathrm{RF}}(\gamma )$ as defined in theorem 6(a). Then there exists ɛ₀ > 0 and d₀ and a constant C > 0 such that for d ⩾ d₀ and $\boldsymbol{\tau }\in {[1-{\varepsilon }_{0},1+{\varepsilon }_{0}]}^{Q}$ ,

$\begin{equation*}\underset{\boldsymbol{k}\notin \mathcal{Q}}{\mathrm{max}}\enspace {\lambda }_{\boldsymbol{k}}^{\boldsymbol{d}}{({\sigma }_{\boldsymbol{d},\boldsymbol{\tau }})}^{2}\leqslant C{d}^{-\gamma }.\end{equation*}$

Proof of lemma 9. Notice that by assumption 2(b) we can apply lemma 19 to any $\boldsymbol{k}\in {\mathcal{Q}}^{c}$ such that | k | = k₁ + ⋯ + k_Q ⩽ L. In particular, there exists C > 0, ${\varepsilon }_{0}^{\prime } > 0$ and ${d}_{0}^{\prime }$ such that for any $\boldsymbol{k}\in {\mathcal{Q}}^{c}$ with | k | ⩽ L, $d\geqslant {d}_{0}^{\prime }$ and $\boldsymbol{\tau }\in {[1-{\varepsilon }_{0}^{\prime },1+{\varepsilon }_{0}^{\prime }]}^{Q}$ ,

$\begin{equation*}\left(\prod\limits _{q\in [Q]}{d}^{(\xi -{\eta }_{q}-{\kappa }_{q}){k}_{q}}\right)B(\boldsymbol{d},\boldsymbol{k}){\lambda }_{\boldsymbol{k}}^{\boldsymbol{d}}{({\sigma }_{\boldsymbol{d},\boldsymbol{\tau }})}^{2}\leqslant C< \infty .\end{equation*}$

Furthermore, using that $B(\boldsymbol{d},\boldsymbol{k})={\Theta}({d}_{1}^{{k}_{1}}{d}_{2}^{{k}_{2}}\dots {d}_{Q}^{{k}_{Q}})$ , there exists C' > 0 such that for $\boldsymbol{k}\in {\mathcal{Q}}^{c}$ with | k | ⩽ L,

$\begin{equation}{\lambda }_{\boldsymbol{k}}^{\boldsymbol{d}}{({\sigma }_{\boldsymbol{d},\boldsymbol{\tau }})}^{2}\leqslant {C}^{\prime }\prod\limits _{q\in [Q]}{d}^{({\eta }_{q}+{\kappa }_{q}-\xi ){k}_{q}}{d}_{q}^{-{k}_{q}}={C}^{\prime }\prod\limits _{q\in [Q]}{d}^{({\kappa }_{q}-\xi ){k}_{q}}\leqslant {C}^{\prime }{d}^{-\gamma },\end{equation} \tag{ 77 }$

where we used in the last inequality $\boldsymbol{k}\notin {\mathcal{Q}}_{\mathrm{RF}}(\gamma )$ implies (ξ − κ₁)k₁ + ⋯ + (ξ − κ_Q)k_Q ⩾ γ by definition.

Furthermore, from assumption 2 and lemma 17(b), there exists ${\varepsilon }_{0}^{{\prime\prime}} > 0$ , ${d}_{0}^{{\prime\prime}}$ and C < ∞, such that

$\begin{equation*}\underset{d\geqslant {d}_{0}^{{\prime\prime}}}{\mathrm{sup}}\hspace{2pt}\underset{\boldsymbol{\tau }\in {[1-{\varepsilon }_{0}^{{\prime\prime}},1+{\varepsilon }_{0}^{{\prime\prime}}]}^{Q}}{\mathrm{sup}}\enspace {\mathbb{E}}_{\bar{\boldsymbol{x}}}\left[{\sigma }_{\boldsymbol{d},\boldsymbol{\tau }}{\left({\left\{\langle {\boldsymbol{w}}^{(q)},{\bar{\boldsymbol{x}}}^{(q)}\rangle \right\}}_{q\in [Q]}\right)}^{2}\right]< C.\end{equation*}$

From the Gegenbauer decomposition (76), this implies that for any $\boldsymbol{k}\in {\mathbb{Z}}_{\geqslant 0}^{Q}$ , $d\leqslant {d}_{0}^{{\prime\prime}}$ and $\boldsymbol{\tau }\in {[1-{\varepsilon }_{0}^{{\prime\prime}},1+{\varepsilon }_{0}^{{\prime\prime}}]}^{Q}$ ,

$\begin{equation*}B(\boldsymbol{d},\boldsymbol{k}){\lambda }_{\boldsymbol{k}}^{\boldsymbol{d}}{({\sigma }_{\boldsymbol{d},\boldsymbol{\tau }})}^{2}\leqslant C.\end{equation*}$

In particular, for | k | = k₁ + ⋯ + k_Q > L = max_q∈[Q]⌈γ/η_q⌉, we have

$\begin{equation}{\lambda }_{\boldsymbol{k}}^{\boldsymbol{d}}{({\sigma }_{\boldsymbol{d},\boldsymbol{\tau }})}^{2}\leqslant \frac{C}{B(\boldsymbol{d},\boldsymbol{k})}\leqslant {C}^{\prime }\prod\limits _{q\in [Q]}{d}^{-{\eta }_{q}{k}_{q}}\leqslant {C}^{\prime }\prod\limits _{q\in [Q]}{d}^{-\gamma {k}_{q}/L}\leqslant {C}^{\prime }{d}^{-\gamma }.\end{equation} \tag{ 78 }$

Combining equations (77) and (78) yields the result. □

E.2. Proof of theorem 6(a): outline

Let $\mathcal{Q}\equiv {\mathcal{Q}}_{\mathrm{RF}}(\gamma )$ as defined in theorem 6(a) and $\mathbf{\Theta }=\sqrt{D}\boldsymbol{W}$ such that ${\boldsymbol{\theta }}_{i}=\sqrt{D}{\boldsymbol{w}}_{i}\hspace{2pt}{\sim }_{iid}\hspace{2pt}\mathrm{Unif}({\mathbb{S}}^{D-1}(\sqrt{D}))$ .

Define the random vectors $\boldsymbol{V}={({V}_{1},\dots ,{V}_{N})}^{\mathsf{T}}$ , ${\boldsymbol{V}}_{\mathcal{Q}}={({V}_{1,\mathcal{Q}},\dots ,{V}_{N,\mathcal{Q}})}^{\mathsf{T}}$ , ${\boldsymbol{V}}_{{\mathcal{Q}}^{c}}={({V}_{1,{\mathcal{Q}}^{c}},\dots ,{V}_{N,{\mathcal{Q}}^{c}})}^{\mathsf{T}}$ , with

$\begin{equation}{V}_{i,\mathcal{Q}}\equiv {\mathbb{E}}_{\boldsymbol{x}}[[{\mathsf{P}}_{\mathcal{Q}}{f}_{d}](\boldsymbol{x})\sigma (\langle {\boldsymbol{\theta }}_{i},\boldsymbol{x}\rangle /R)],\end{equation} \tag{ 79 }$

$\begin{equation}{V}_{i,{\mathcal{Q}}^{c}}\equiv {\mathbb{E}}_{\boldsymbol{x}}[[{\mathsf{P}}_{{\mathcal{Q}}^{c}}{f}_{d}](\boldsymbol{x})\sigma (\langle {\boldsymbol{\theta }}_{i},\boldsymbol{x}\rangle /R)],\end{equation} \tag{ 80 }$

$\begin{equation}{V}_{i}\equiv {\mathbb{E}}_{\boldsymbol{x}}[{f}_{d}(\boldsymbol{x})\sigma (\langle {\boldsymbol{\theta }}_{i},\boldsymbol{x}\rangle /R)]={V}_{i,\mathcal{Q}}+{V}_{i,{\mathcal{Q}}^{c}}.\end{equation} \tag{ 81 }$

Define the random matrix $\boldsymbol{U}={({U}_{ij})}_{i,j\in [N]}$ , with

$\begin{equation}{U}_{ij}={\mathbb{E}}_{\boldsymbol{x}}[\sigma (\langle \boldsymbol{x},{\boldsymbol{\theta }}_{i}\rangle /R)\sigma (\langle \boldsymbol{x},{\boldsymbol{\theta }}_{j}\rangle /R)].\end{equation} \tag{ 82 }$

In what follows, we write ${R}_{\mathrm{RF}}({f}_{d})={R}_{\mathrm{RF}}({f}_{d},\boldsymbol{W}\hspace{2pt})={R}_{\mathrm{RF}}({f}_{d},\mathbf{\Theta }/\sqrt{D})$ for the random features risk, omitting the dependence on the weights $\boldsymbol{W}=\mathbf{\Theta }/\sqrt{D}$ . By the definition and a simple calculation, we have

$\begin{align*}\hfill {R}_{\mathrm{RF}}({f}_{d})& =\underset{\boldsymbol{a}\in {\mathbb{R}}^{N}}{\mathrm{min}}\left\{{\mathbb{E}}_{\boldsymbol{x}}[{f}_{d}{(\boldsymbol{x})}^{2}]-2\langle \boldsymbol{a},\boldsymbol{V}\rangle +\langle \boldsymbol{a},\boldsymbol{U}\boldsymbol{a}\rangle \right\}\hfill \\ \hfill & ={\mathbb{E}}_{\boldsymbol{x}}[{f}_{d}{(\boldsymbol{x})}^{2}]-{\boldsymbol{V}}^{\mathsf{T}}{\boldsymbol{U}}^{-1}\boldsymbol{V},\hfill \\ \hfill {R}_{\mathrm{RF}}({\mathsf{P}}_{\mathcal{Q}}{f}_{d})& =\underset{\boldsymbol{a}\in {\mathbb{R}}^{N}}{\mathrm{min}}\left\{{\mathbb{E}}_{\boldsymbol{x}}[{\mathsf{P}}_{\mathcal{Q}}{f}_{d}{(\boldsymbol{x})}^{2}]-2\langle \boldsymbol{a},{\boldsymbol{V}}_{\leqslant \ell }\rangle +\langle \boldsymbol{a},\boldsymbol{U}\boldsymbol{a}\rangle \right\}\hfill \\ \hfill & ={\mathbb{E}}_{\boldsymbol{x}}[{\mathsf{P}}_{\mathcal{Q}}{f}_{d}{(\boldsymbol{x})}^{2}]-{\boldsymbol{V}}_{\mathcal{Q}}^{\mathsf{T}}{\boldsymbol{U}}^{-1}{\boldsymbol{V}}_{\mathcal{Q}}.\hfill \end{align*}$

By orthogonality, we have

$\begin{equation*}{\mathbb{E}}_{\boldsymbol{x}}[{f}_{d}{(\boldsymbol{x})}^{2}]={\mathbb{E}}_{\boldsymbol{x}}[[{\mathsf{P}}_{\mathcal{Q}}{f}_{d}]{(\boldsymbol{x})}^{2}]+{\mathbb{E}}_{\boldsymbol{x}}[[{\mathsf{P}}_{{\mathcal{Q}}^{c}}{f}_{d}]{(\boldsymbol{x})}^{2}],\end{equation*}$

which gives

$\begin{align}\hfill & \left\vert {R}_{\mathrm{RF}}({f}_{d})-{R}_{\mathrm{RF}}({\mathsf{P}}_{\mathcal{Q}}{f}_{d})-{\mathbb{E}}_{\boldsymbol{x}}[[{\mathsf{P}}_{{\mathcal{Q}}^{c}}{f}_{d}]{(\boldsymbol{x})}^{2}]\right\vert \hfill \\ \hfill & \qquad =\left\vert {\boldsymbol{V}}_{\mathcal{Q}}^{\mathsf{T}}{\boldsymbol{U}}^{-1}{\boldsymbol{V}}_{\mathcal{Q}}-{\boldsymbol{V}}^{\mathsf{T}}{\boldsymbol{U}}^{-1}\boldsymbol{V}\right\vert \hfill \\ \hfill & \qquad =\left\vert {\boldsymbol{V}}_{\mathcal{Q}}^{\mathsf{T}}{\boldsymbol{U}}^{-1}{\boldsymbol{V}}_{\mathcal{Q}}-{({\boldsymbol{V}}_{\mathcal{Q}}+{\boldsymbol{V}}_{{\mathcal{Q}}^{c}})}^{\mathsf{T}}{\boldsymbol{U}}^{-1}({\boldsymbol{V}}_{\mathcal{Q}}+{\boldsymbol{V}}_{{\mathcal{Q}}^{c}})\right\vert \hfill \\ \hfill & \qquad =\left\vert 2{\boldsymbol{V}}^{\mathsf{T}}{\boldsymbol{U}}^{-1}{\boldsymbol{V}}_{{\mathcal{Q}}^{c}}-{\boldsymbol{V}}_{{\mathcal{Q}}^{c}}^{\mathsf{T}}{\boldsymbol{U}}^{-1}{\boldsymbol{V}}_{{\mathcal{Q}}^{c}}\right\vert \hfill \\ \hfill & \qquad \leqslant 2{\Vert}{\boldsymbol{U}}^{-1/2}{\boldsymbol{V}}_{{\mathcal{Q}}^{c}}{{\Vert}}_{2}{\Vert}{\boldsymbol{U}}^{-1/2}\boldsymbol{V}{{\Vert}}_{2}+{\Vert}{\boldsymbol{U}}^{-1}{{\Vert}}_{\mathrm{op}}{\Vert}{\boldsymbol{V}}_{{\mathcal{Q}}^{c}}{{\Vert}}_{2}^{2}\hfill \\ \hfill & \qquad \leqslant 2{\Vert}{\boldsymbol{U}}^{-1/2}{{\Vert}}_{\mathrm{op}}{\Vert}{\boldsymbol{V}}_{{\mathcal{Q}}^{c}}{{\Vert}}_{2}{\Vert}{f}_{d}{{\Vert}}_{{L}^{2}}+{\Vert}{\boldsymbol{U}}^{-1}{{\Vert}}_{\mathrm{op}}{\Vert}{\boldsymbol{V}}_{{\mathcal{Q}}^{c}}{{\Vert}}_{2}^{2},\hfill \end{align} \tag{ 83 }$

where the last inequality used the fact that

$\begin{equation*}0\leqslant {R}_{\mathrm{RF}}({f}_{d})={\Vert}{f}_{d}{{\Vert}}_{{L}^{2}}^{2}-{\boldsymbol{V}}^{\mathsf{T}}{\boldsymbol{U}}^{-1}\boldsymbol{V},\end{equation*}$

so that

$\begin{equation*}{\Vert}{\boldsymbol{U}}^{-1/2}\boldsymbol{V}{{\Vert}}_{2}^{2}={\boldsymbol{V}}^{\mathsf{T}}{\boldsymbol{U}}^{-1}\boldsymbol{V}\leqslant {\Vert}{f}_{d}{{\Vert}}_{{L}^{2}}^{2}.\end{equation*}$

The theorem follows from the following two claims

$\begin{equation}{\Vert}{\boldsymbol{V}}_{{\mathcal{Q}}^{c}}{{\Vert}}_{2}/{\Vert}{\mathsf{P}}_{{\mathcal{Q}}^{c}}{f}_{d}{{\Vert}}_{{L}^{2}}={o}_{d,\mathbb{P}}(1),\end{equation} \tag{ 84 }$

$\begin{equation}{\Vert}{\boldsymbol{U}}^{-1}{{\Vert}}_{\mathrm{op}}={O}_{d,\mathbb{P}}(1).\end{equation} \tag{ 85 }$

This is achieved by the propositions 1 and 2 stated below.

Proposition 1 (Expected norm of V ). Let σ be an activation function satisfying assumptions 2(a) and (b) for a fixed γ > 0. Denote $\mathcal{Q}={\mathcal{Q}}_{\mathrm{RF}}(\gamma )$ . Let ɛ > 0 and define ${\mathcal{E}}_{{\mathcal{Q}}^{c},\varepsilon }$ by

$\begin{equation*}{\mathcal{E}}_{{\mathcal{Q}}^{c},\varepsilon }\equiv {\mathbb{E}}_{{\boldsymbol{\theta }}_{\varepsilon }}[\langle {\mathsf{P}}_{{\mathcal{Q}}^{c},0}{f}_{d},\sigma (\langle \boldsymbol{\theta },\cdot \rangle /R){\rangle }_{{L}^{2}}^{2}],\end{equation*}$

where we recall that ${\mathbb{E}}_{{\boldsymbol{\theta }}_{\varepsilon }}={\mathbb{E}}_{{\boldsymbol{\tau }}_{\varepsilon }}{\mathbb{E}}_{\bar{\boldsymbol{\theta }}}$ the expectation with respect to τ restricted to [1 − ɛ, 1 + ɛ]^Q and $\bar{\boldsymbol{\theta }}\sim \mathrm{Unif}({\mathrm{PS}}^{\boldsymbol{d}})$ .

Then there exists a constant C > 0 and ɛ₀ > 0 (depending only on the constants of assumptions 2(a) and (b)) such that for d sufficiently large,

$\begin{equation*}{\mathcal{E}}_{{\mathcal{Q}}^{c},{\varepsilon }_{0}}\leqslant C{d}^{-\gamma }\cdot {\Vert}{\mathsf{P}}_{{\mathcal{Q}}^{c}}{f}_{d}{{\Vert}}_{{L}^{2}}^{2}.\end{equation*}$

Proposition 2 (Lower bound on the kernel matrix). Assume N = o_d(d^γ) for a fixed integer γ > 0. Let ${({\boldsymbol{\theta }}_{i})}_{i\in [N]}\sim \mathrm{Unif}({\mathbb{S}}^{D-1}(\sqrt{D}))$ independently, and σ be an activation function satisfying assumption 2(a). Let $\boldsymbol{U}\in {\mathbb{R}}^{N\times N}$ be the kernel matrix defined by equation (82). Then there exists a constant ɛ > 0 that depends on the activation function σ, such that

$\begin{equation*}{\lambda }_{\mathrm{min}}(\boldsymbol{U})\geqslant \varepsilon ,\end{equation*}$

with high probability as d → ∞.

The proofs of these two propositions are provided in the next sections.

Proposition 1 shows that there exists ɛ₀ > 0 such that

$\begin{equation*}{\mathbb{E}}_{{\mathbf{\Theta }}_{{\varepsilon }_{0}}}[{\Vert}{\boldsymbol{V}}_{{\mathcal{Q}}^{c}}{{\Vert}}_{2}^{2}]=N{\mathcal{E}}_{{\mathcal{Q}}^{c},{\varepsilon }_{0}}\leqslant CN{d}^{-\gamma }{\Vert}{\mathsf{P}}_{{\mathcal{Q}}^{c}}{f}_{d}{{\Vert}}_{{L}^{2}}^{2}.\end{equation*}$

Hence, by Markov's inequality, we get for any ɛ > 0,

$\begin{equation*}\begin{aligned}\hfill \mathbb{P}({\Vert}{\boldsymbol{V}}_{{\mathcal{Q}}^{c}}{{\Vert}}_{2}\geqslant \varepsilon \cdot {\Vert}{\mathsf{P}}_{{\mathcal{Q}}^{c}}{f}_{d}{{\Vert}}_{{L}^{2}})& \leqslant \mathbb{P}(\left\{{\Vert}{\boldsymbol{V}}_{{\mathcal{Q}}^{c}}{{\Vert}}_{2}\geqslant \varepsilon \cdot {\Vert}{\mathsf{P}}_{{\mathcal{Q}}^{c}}{f}_{d}{{\Vert}}_{{L}^{2}}\right\}\cap {\mathcal{P}}_{{\varepsilon }_{0}})+\mathbb{P}({\mathcal{P}}_{{\varepsilon }_{0}}^{c})\hfill \\ \hfill & \leqslant \frac{N{\mathcal{E}}_{{\mathcal{Q}}^{c},{\varepsilon }_{0}}}{{\varepsilon }^{2}{\Vert}{\mathsf{P}}_{{\mathcal{Q}}^{c}}{f}_{d}{{\Vert}}_{{L}^{2}}^{2}}+{o}_{d}(1)\hfill \\ \hfill & \leqslant {C}^{\prime }N{d}^{-\gamma }+{o}_{d}(1),\hfill \end{aligned}\end{equation*}$

where we used lemma 8. By assumption, we have N = o_d(d^γ), hence equation (84) is verified. Furthermore equation (85) follows simply from proposition 2. This proves the theorem.

E.3. Proof of proposition 1

We will denote:

$\begin{equation*}{\bar{f}}_{d}(\bar{\boldsymbol{x}}\hspace{2pt})={f}_{d}(\boldsymbol{x}),\end{equation*}$

such that $\bar{f}$ is a function on the normalized product of spheres PS^d. (Note that we defined ${\mathsf{P}}_{\boldsymbol{k}}{f}_{d}(\boldsymbol{x})\equiv {\mathsf{P}}_{\boldsymbol{k}}{\bar{f}}_{d}(\bar{\boldsymbol{x}}\hspace{2pt})$ the unambiguous polynomial approximation of f_d with polynomial of degree k .) We have

$\begin{equation*}\begin{aligned}\hfill {V}_{i,{\mathcal{Q}}^{c}}& ={\mathbb{E}}_{\boldsymbol{x}}\left[[{\mathsf{P}}_{{\mathcal{Q}}^{c}}{f}_{d}](\boldsymbol{x})\sigma \left(\sum\limits _{q\in [Q]}\langle {\boldsymbol{x}}^{(q)},{\boldsymbol{\theta }}_{i}^{(q)}\rangle /R\right)\right]\hfill \\ \hfill & ={\mathbb{E}}_{\bar{\boldsymbol{x}}}\left[[{\mathsf{P}}_{{\mathcal{Q}}^{c}}{\bar{f}}_{d}](\bar{\boldsymbol{x}}\hspace{2pt}){\sigma }_{\boldsymbol{d},{\boldsymbol{\tau }}_{i}}\left({\left\{\langle {\bar{\boldsymbol{x}}}^{(q)},{\bar{\boldsymbol{\theta }}}_{i}^{(q)}\rangle /\sqrt{{d}_{q}}\right\}}_{q\in [Q]}\right)\right].\hfill \end{aligned}\end{equation*}$

We recall the expansion of σ_{d
,
τ} in terms of tensor product of Gegenbauer polynomials

$\begin{equation*}\begin{aligned}\hfill \sigma (\langle \boldsymbol{\theta },\boldsymbol{x}\rangle /R)& =\sum\limits _{\boldsymbol{k}\in {\mathbb{Z}}_{\geqslant 0}^{Q}}{\lambda }_{\boldsymbol{k}}^{\boldsymbol{d}}({\sigma }_{\boldsymbol{d},\boldsymbol{\tau }})B(\boldsymbol{d},\boldsymbol{k}){Q}_{\boldsymbol{k}}^{\boldsymbol{d}}\left({\left\{\langle {\bar{\boldsymbol{\theta }}}^{(q)},{\bar{\boldsymbol{x}}}^{(q)}\rangle \right\}}_{q\in [Q]}\right),\hfill \\ \hfill {\lambda }_{\boldsymbol{k}}^{\boldsymbol{d}}({\sigma }_{\boldsymbol{d},\boldsymbol{\tau }})& ={\mathbb{E}}_{\bar{\boldsymbol{x}}}\left[{\sigma }_{\boldsymbol{d},\boldsymbol{\tau }}\left({\bar{x}}_{1}^{(1)},\dots ,{\bar{x}}_{1}^{(Q)}\right){Q}_{\boldsymbol{k}}^{\boldsymbol{d}}\left(\sqrt{{d}_{1}}{\bar{x}}_{1}^{(1)},\dots ,\sqrt{{d}_{Q}}{\bar{x}}_{1}^{(Q)}\right)\right].\hfill \end{aligned}\end{equation*}$

For any $\boldsymbol{k}\in {\mathbb{Z}}_{\geqslant 0}^{Q}$ , the spherical harmonics expansion of ${P}_{\boldsymbol{k}}{\bar{f}}_{d}$ gives

$\begin{equation*}{P}_{\boldsymbol{k}}{\bar{f}}_{d}(\bar{\boldsymbol{x}}\hspace{2pt})=\sum\limits _{\boldsymbol{s}\in [B(\boldsymbol{d},\boldsymbol{k})]}{\lambda }_{\boldsymbol{k},\boldsymbol{s}}^{\boldsymbol{d}}({\bar{f}}_{d}){Y}_{\boldsymbol{k},\boldsymbol{s}}^{\boldsymbol{d}}(\bar{\boldsymbol{x}}\hspace{2pt}).\end{equation*}$

Using equation (37) to get the following property

$\begin{align}\hfill {\mathbb{E}}_{\bar{\boldsymbol{x}}}\left[{Q}_{{\boldsymbol{k}}^{\prime }}^{\boldsymbol{d}}\left({\left\{\langle {\bar{\boldsymbol{\theta }}}^{(q)},{\bar{\boldsymbol{x}}}^{(q)}\rangle \right\}}_{q\in [Q]}\right){Y}_{\boldsymbol{k},\boldsymbol{s}}^{\boldsymbol{d}}(\bar{\boldsymbol{x}}\hspace{2pt})\right]& =\frac{1}{B(\boldsymbol{d},{\boldsymbol{k}}^{\prime })}\sum\limits _{{\boldsymbol{s}}^{\prime }\in [B(\boldsymbol{d},\boldsymbol{k})]}{Y}_{{\boldsymbol{k}}^{\prime },{\boldsymbol{s}}^{\prime }}^{\boldsymbol{d}}(\bar{\boldsymbol{\theta }}\hspace{2pt}){\mathbb{E}}_{\bar{\boldsymbol{x}}}\left[{Y}_{{\boldsymbol{k}}^{\prime },{\boldsymbol{s}}^{\prime }}^{\boldsymbol{d}}(\bar{\boldsymbol{x}}\hspace{2pt}){Y}_{\boldsymbol{k},\boldsymbol{s}}^{\boldsymbol{d}}(\bar{\boldsymbol{x}}\hspace{2pt})\right]\hfill \\ \hfill & =\frac{1}{B(\boldsymbol{d},\boldsymbol{k})}{Y}_{\boldsymbol{k},\boldsymbol{s}}^{\boldsymbol{d}}(\bar{\boldsymbol{\theta }}\hspace{2pt}){\delta }_{\boldsymbol{k},{\boldsymbol{k}}^{\prime }},\hfill \end{align} \tag{ 86 }$

we get

$\begin{align*}\hfill & {\mathbb{E}}_{\bar{\boldsymbol{x}}}\left[[{\mathsf{P}}_{\boldsymbol{k}}{\bar{f}}_{d}](\bar{\boldsymbol{x}}\hspace{2pt}){\sigma }_{\boldsymbol{d},\boldsymbol{\tau }}\left({\left\{\langle {\bar{\boldsymbol{\theta }}}^{(q)},{\bar{\boldsymbol{x}}}^{(q)}\rangle /\sqrt{{d}_{q}}\right\}}_{q\in [Q]}\right)\right]\hfill \\ \hfill & \qquad =\sum\limits _{{\boldsymbol{k}}^{\prime }\geqslant \mathbf{0}}{\lambda }_{{\boldsymbol{k}}^{\prime }}^{\boldsymbol{d}}({\sigma }_{\boldsymbol{d},\boldsymbol{\tau }})B(\boldsymbol{d},{\boldsymbol{k}}^{\prime })\sum\limits _{\boldsymbol{s}\in [B(\boldsymbol{d},\boldsymbol{k})]}{\lambda }_{\boldsymbol{k},\boldsymbol{s}}^{\boldsymbol{d}}({\bar{f}}_{d}){\mathbb{E}}_{\bar{\boldsymbol{x}}}\left[{Y}_{\boldsymbol{k},\boldsymbol{s}}^{\boldsymbol{d}}(\bar{\boldsymbol{x}}\hspace{2pt}){Q}_{\boldsymbol{k}}^{\boldsymbol{d}}\left({\left\{\langle {\bar{\boldsymbol{\theta }}}^{(q)},{\bar{\boldsymbol{x}}}^{(q)}\rangle \right\}}_{q\in [Q]}\right)\right]\hfill \\ \hfill & \qquad =\sum\limits _{\boldsymbol{s}\in [B(\boldsymbol{d},\boldsymbol{k})]}{\lambda }_{\boldsymbol{k},\boldsymbol{s}}^{\boldsymbol{d}}({\bar{f}}_{d}){\lambda }_{\boldsymbol{k}}^{\boldsymbol{d}}({\sigma }_{\boldsymbol{d},\boldsymbol{\tau }}){Y}_{\boldsymbol{k},\boldsymbol{s}}^{\boldsymbol{d}}(\bar{\boldsymbol{\theta }}\hspace{2pt}).\hfill \end{align*}$

Let ɛ₀ > 0 be a constant as specified in lemma 9. We consider

$\begin{align}\hfill {\mathcal{E}}_{{\mathcal{Q}}^{c},{\varepsilon }_{0}}& ={\mathbb{E}}_{\bar{\boldsymbol{\theta }},{\boldsymbol{\tau }}_{{\varepsilon }_{0}}}\left[{\mathbb{E}}_{\bar{\boldsymbol{x}}}{\left[[{\mathsf{P}}_{{\mathcal{Q}}^{c}}{\bar{f}}_{d}](\bar{\boldsymbol{x}}\hspace{2pt}){\sigma }_{\boldsymbol{d},\boldsymbol{\tau }}\left({\left\{\langle {\bar{\boldsymbol{\theta }}}^{(q)},{\bar{\boldsymbol{x}}}^{(q)}\rangle /\sqrt{{d}_{q}}\right\}}_{q\in [Q]}\right)\right]}^{2}\right]\hfill \\ \hfill & =\sum\limits _{\boldsymbol{k},{\boldsymbol{k}}^{\prime }\in {\mathcal{Q}}^{c}}{\mathbb{E}}_{\bar{\boldsymbol{\theta }},{\boldsymbol{\tau }}_{{\varepsilon }_{0}}}\left[{\mathbb{E}}_{\bar{\boldsymbol{x}}}\left[[{\mathsf{P}}_{\boldsymbol{k}}{\bar{f}}_{d}](\bar{\boldsymbol{x}}\hspace{2pt}){\sigma }_{\boldsymbol{d},\boldsymbol{\tau }}\left({\left\{\langle {\bar{\boldsymbol{\theta }}}^{(q)},{\bar{\boldsymbol{x}}}^{(q)}\rangle /\sqrt{{d}_{q}}\right\}}_{q\in [Q]}\right)\right]\right.\hfill \\ \hfill & \quad \left.\times {\mathbb{E}}_{\bar{\boldsymbol{y}}}\left[[{\mathsf{P}}_{{\boldsymbol{k}}^{\prime }}{\bar{f}}_{d}](\bar{\boldsymbol{y}}\hspace{2pt}){\sigma }_{\boldsymbol{d},\boldsymbol{\tau }}\left({\left\{\langle {\bar{\boldsymbol{\theta }}}^{(q)},{\bar{\boldsymbol{y}}}^{(q)}\rangle /\sqrt{{d}_{q}}\right\}}_{q\in [Q]}\right)\right]\right]\hfill \\ \hfill & =\sum\limits _{\boldsymbol{k},{\boldsymbol{k}}^{\prime }\in {\mathcal{Q}}^{c}}{\mathbb{E}}_{{\boldsymbol{\tau }}_{{\varepsilon }_{0}}}\left[{\lambda }_{\boldsymbol{k}}^{\boldsymbol{d}}({\sigma }_{\boldsymbol{d},\boldsymbol{\tau }}){\lambda }_{{\boldsymbol{k}}^{\prime }}^{\boldsymbol{d}}({\sigma }_{\boldsymbol{d},\boldsymbol{\tau }})\right]\hfill \\ \hfill & \quad \times \sum\limits _{\boldsymbol{s}\in [B(\boldsymbol{d},\boldsymbol{k})]}\sum\limits _{{\boldsymbol{s}}^{\prime }\in [B(\boldsymbol{d},{\boldsymbol{k}}^{\prime })]}{\lambda }_{\boldsymbol{k},\boldsymbol{s}}^{\boldsymbol{d}}({\bar{f}}_{d}){\lambda }_{{\boldsymbol{k}}^{\prime },{\boldsymbol{s}}^{\prime }}^{\boldsymbol{d}}({\bar{f}}_{d}){\mathbb{E}}_{\bar{\boldsymbol{\theta }}}[{Y}_{\boldsymbol{k},\boldsymbol{s}}^{\boldsymbol{d}}(\bar{\boldsymbol{\theta }}\hspace{2pt}){Y}_{{\boldsymbol{k}}^{\prime },{\boldsymbol{s}}^{\prime }}^{\boldsymbol{d}}(\bar{\boldsymbol{\theta }}\hspace{2pt})]\hfill \\ \hfill & =\sum\limits _{\boldsymbol{k}\in {\mathcal{Q}}^{c}}{\mathbb{E}}_{{\boldsymbol{\tau }}_{{\varepsilon }_{0}}}[{\lambda }_{\boldsymbol{k}}^{\boldsymbol{d}}{({\sigma }_{\boldsymbol{d},\boldsymbol{\tau }})}^{2}]\sum\limits _{\boldsymbol{s}\in [B(\boldsymbol{d},\boldsymbol{k})]}{\lambda }_{\boldsymbol{k},\boldsymbol{s}}^{\boldsymbol{d}}{({\bar{f}}_{d})}^{2}\hfill \\ \hfill & \leqslant \left[\underset{\boldsymbol{k}\in {\mathcal{Q}}^{c}}{\mathrm{max}}\enspace {\mathbb{E}}_{{\boldsymbol{\tau }}_{{\varepsilon }_{0}}}[{\lambda }_{\boldsymbol{k}}^{\boldsymbol{d}}{({\sigma }_{\boldsymbol{d},\boldsymbol{\tau }})}^{2}]\right]\cdot \sum\limits _{\boldsymbol{k}\in {\mathcal{Q}}^{c}}\sum\limits _{\boldsymbol{s}\in [B(\boldsymbol{d},\boldsymbol{k})]}{\lambda }_{\boldsymbol{k},\boldsymbol{s}}^{\boldsymbol{d}}{({\bar{f}}_{d})}^{2}\hfill \\ \hfill & =\left[\underset{\boldsymbol{k}\in {\mathcal{Q}}^{c}}{\mathrm{max}}\enspace {\mathbb{E}}_{{\boldsymbol{\tau }}_{{\varepsilon }_{0}}}[{\lambda }_{\boldsymbol{k}}^{\boldsymbol{d}}{({\sigma }_{\boldsymbol{d},\boldsymbol{\tau }})}^{2}]\right]\cdot {\Vert}{\mathsf{P}}_{{\mathcal{Q}}^{c}}{\bar{f}}_{d}{{\Vert}}_{{L}^{2}}.\hfill \end{align} \tag{ 87 }$

From lemma 9, there exists a constant C > 0 such that for d sufficiently large, we have for any $\boldsymbol{k}\in {\mathcal{Q}}^{c}$ ,

$\begin{equation}{\mathbb{E}}_{{\boldsymbol{\tau }}_{{\varepsilon }_{0}}}[{\lambda }_{\boldsymbol{k}}^{\boldsymbol{d}}{({\sigma }_{\boldsymbol{d},\boldsymbol{\tau }})}^{2}]\leqslant \underset{\boldsymbol{\tau }\in {[1-{\varepsilon }_{0},1+{\varepsilon }_{0}]}^{Q}}{\mathrm{sup}}\enspace {\lambda }_{\boldsymbol{k}}^{\boldsymbol{d}}{({\sigma }_{\boldsymbol{d},\boldsymbol{\tau }})}^{2}\leqslant C{d}^{-\gamma }.\end{equation} \tag{ 88 }$

Combining equation (87) and (88) yields

$\begin{equation*}{\mathcal{E}}_{{\mathcal{Q}}^{c},{\varepsilon }_{0}}\leqslant C{d}^{-\gamma }\cdot {\Vert}{\mathsf{P}}_{{\mathcal{Q}}^{c}}{\bar{f}}_{d}{{\Vert}}_{{L}^{2}}.\end{equation*}$

E.4. Proof of proposition 2

Step 1. Construction of the activation functions $\hat{\sigma }$ , $\bar{\sigma }$ .

Without loss of generality, we will assume that q_ξ = 1. From assumption 2(b), σ is not a degree ⌊γ/η₁⌋-polynomial. This is equivalent to having m ⩾ ⌊γ/η₁⌋ + 1 such that μ_m(σ) ≠ 0. Let us denote

$\begin{equation*}m=\mathrm{inf}\left\{k\geqslant \lfloor \gamma /{\eta }_{1}\rfloor +1\vert {\mu }_{m}(\sigma )\ne 0\right\}.\end{equation*}$

Recall the expansion of σ_{d
,
τ} in terms of product of Gegenbauer polynomials

$\begin{equation*}{\sigma }_{\boldsymbol{d},\boldsymbol{\tau }}\left({\left\{\langle {\bar{\boldsymbol{\theta }}}^{(q)},{\bar{\boldsymbol{x}}}^{(q)}\rangle /\sqrt{{d}_{q}}\right\}}_{q\in [Q]}\right)=\sum\limits _{\boldsymbol{k}\in {\mathbb{Z}}_{\geqslant 0}^{Q}}{\lambda }_{\boldsymbol{k}}^{\boldsymbol{d}}({\sigma }_{\boldsymbol{d},\boldsymbol{\tau }})B(\boldsymbol{d},\boldsymbol{k}){Q}_{\boldsymbol{k}}^{\boldsymbol{d}}\left({\left\{\langle {\bar{\boldsymbol{\theta }}}^{(q)},{\bar{\boldsymbol{x}}}^{(q)}\rangle \right\}}_{q\in [Q]}\right),\end{equation*}$

where

$\begin{equation*}{\lambda }_{\boldsymbol{k}}^{\boldsymbol{d}}({\sigma }_{\boldsymbol{d},\boldsymbol{\tau }})={\mathbb{E}}_{\bar{\boldsymbol{x}}}\left[{\sigma }_{\boldsymbol{d},\boldsymbol{\tau }}\left({\bar{x}}_{1}^{(1)},\dots ,{\bar{x}}_{1}^{(Q)}\right){Q}_{\boldsymbol{k}}^{\boldsymbol{d}}\left(\sqrt{{d}_{1}}{\bar{x}}_{1}^{(1)},\dots ,\sqrt{{d}_{Q}}{\bar{x}}_{1}^{(Q)}\right)\right].\end{equation*}$

Denoting $\boldsymbol{m}=(m,0,\dots ,0)\in {\mathbb{Z}}_{\geqslant 0}^{Q}$ and using the Gegenbauer coefficients of σ_{d
,
τ}, we define an activation function ${\bar{\sigma }}_{\boldsymbol{d},\boldsymbol{\tau }}$ which is a degree m polynomial in ${\bar{\boldsymbol{x}}}^{(1)}$ and do not depend on ${\bar{\boldsymbol{x}}}^{(q)}$ for q ⩾ 2.

$\begin{equation*}\begin{aligned}\hfill {\bar{\sigma }}_{\boldsymbol{d},\boldsymbol{\tau }}\left({\left\{{\bar{\boldsymbol{\theta }}}^{(q)},{\bar{\boldsymbol{x}}}^{(q)}\rangle /\sqrt{{d}_{q}}\right\}}_{q\in [Q]}\right)& ={\lambda }_{\boldsymbol{m}}^{\boldsymbol{d}}({\sigma }_{\boldsymbol{d},\boldsymbol{\tau }})B(\boldsymbol{d},\boldsymbol{m}){Q}_{\boldsymbol{m}}^{\boldsymbol{d}}\left({\left\{\langle {\bar{\boldsymbol{\theta }}}^{(q)},{\bar{\boldsymbol{x}}}^{(q)}\rangle /\sqrt{{d}_{q}}\right\}}_{q\in [Q]}\right)\hfill \\ \hfill & ={\lambda }_{\boldsymbol{m}}^{\boldsymbol{d}}({\sigma }_{\boldsymbol{d},\boldsymbol{\tau }})B({d}_{1},m){Q}_{m}^{({d}_{1})}(\sqrt{{d}_{1}}{\bar{x}}_{1}^{(1)}),\hfill \end{aligned}\end{equation*}$

and an activation function

$\begin{equation*}{\hat{\sigma }}_{\boldsymbol{d},\boldsymbol{\tau }}\left({\left\{{\bar{\boldsymbol{\theta }}}^{(q)},{\bar{\boldsymbol{x}}}^{(q)}\rangle /\sqrt{{d}_{q}}\right\}}_{q\in [Q]}\right)=\sum\limits _{\boldsymbol{k}\ne \boldsymbol{m}\in {\mathbb{Z}}_{\geqslant 0}^{Q}}{\lambda }_{\boldsymbol{k}}^{\boldsymbol{d}}({\sigma }_{\boldsymbol{d},\boldsymbol{\tau }})B(\boldsymbol{d},\boldsymbol{k}){Q}_{\boldsymbol{k}}^{\boldsymbol{d}}\left({\left\{\langle {\bar{\boldsymbol{\theta }}}^{(q)},{\bar{\boldsymbol{x}}}^{(q)}\rangle /\sqrt{{d}_{q}}\right\}}_{q\in [Q]}\right).\end{equation*}$

Step 2. The kernel functions u_d, ${\hat{u}}_{d}$ and ${\bar{u}}_{d}$ .

Let u_d, ${\hat{u}}_{d}$ and ${\bar{u}}_{d}$ be defined by

$\begin{align}\hfill & {u}_{\boldsymbol{d}}^{{\boldsymbol{\tau }}_{1},{\boldsymbol{\tau }}_{2}}\left({\left\{\langle {\bar{\boldsymbol{\theta }}}_{1}^{(q)},{\bar{\boldsymbol{\theta }}}_{2}^{(q)}\rangle /\sqrt{{d}_{q}}\right\}}_{q\in [Q]}\right)\hfill \\ \hfill & \qquad ={\mathbb{E}}_{\boldsymbol{x}}[\sigma (\langle {\boldsymbol{\theta }}_{1},\boldsymbol{x}\rangle /R)\sigma (\langle {\boldsymbol{\theta }}_{2},\boldsymbol{x}\rangle /R)]\hfill \\ \hfill & \qquad =\sum\limits _{\boldsymbol{k}\in {\mathbb{Z}}_{\geqslant 0}^{Q}}{\lambda }_{\boldsymbol{k}}^{\boldsymbol{d}}({\sigma }_{\boldsymbol{d},{\boldsymbol{\tau }}_{1}}){\lambda }_{\boldsymbol{k}}^{\boldsymbol{d}}({\sigma }_{\boldsymbol{d},{\boldsymbol{\tau }}_{2}})B(\boldsymbol{d},\boldsymbol{k}){Q}_{\boldsymbol{k}}^{\boldsymbol{d}}\left({\left\{\langle {\bar{\boldsymbol{\theta }}}_{1}^{(q)},{\bar{\boldsymbol{\theta }}}_{2}^{(q)}\rangle /\sqrt{{d}_{q}}\right\}}_{q\in [Q]}\right)\hfill \end{align} \tag{ 89 }$

and

$\begin{align}\hfill & {\hat{u}}_{\boldsymbol{d}}^{{\boldsymbol{\tau }}_{1},{\boldsymbol{\tau }}_{2}}\left({\left\{\langle {\bar{\boldsymbol{\theta }}}_{1}^{(q)},{\bar{\boldsymbol{\theta }}}_{2}^{(q)}\rangle /\sqrt{{d}_{q}}\right\}}_{q\in [Q]}\right)\hfill \\ \hfill & \qquad ={\mathbb{E}}_{\boldsymbol{x}}[\hat{\sigma }(\langle {\boldsymbol{\theta }}_{1},\boldsymbol{x}\rangle /R)\hat{\sigma }(\langle {\boldsymbol{\theta }}_{2},\boldsymbol{x}\rangle /R)]\hfill \\ \hfill & \qquad =\sum\limits _{\boldsymbol{k}\ne \boldsymbol{m}\in {\mathbb{Z}}_{\geqslant 0}^{Q}}{\lambda }_{\boldsymbol{k}}^{\boldsymbol{d}}({\sigma }_{\boldsymbol{d},{\boldsymbol{\tau }}_{1}}){\lambda }_{\boldsymbol{k}}^{\boldsymbol{d}}({\sigma }_{\boldsymbol{d},{\boldsymbol{\tau }}_{2}})B(\boldsymbol{d},\boldsymbol{k}){Q}_{\boldsymbol{k}}^{\boldsymbol{d}}\left({\left\{\langle {\bar{\boldsymbol{\theta }}}_{1}^{(q)},{\bar{\boldsymbol{\theta }}}_{2}^{(q)}\rangle /\sqrt{{d}_{q}}\right\}}_{q\in [Q]}\right)\hfill \end{align} \tag{ 90 }$

and

$\begin{align}\hfill {\bar{u}}_{\boldsymbol{d}}^{{\boldsymbol{\tau }}_{1},{\boldsymbol{\tau }}_{2}}\left({\left\{\langle {\bar{\boldsymbol{\theta }}}_{1}^{(q)},{\bar{\boldsymbol{\theta }}}_{2}^{(q)}\rangle /\sqrt{{d}_{q}}\right\}}_{q\in [Q]}\right)& ={\mathbb{E}}_{\boldsymbol{x}}[\bar{\sigma }(\langle {\boldsymbol{\theta }}_{1},\boldsymbol{x}\rangle /R)\bar{\sigma }(\langle {\boldsymbol{\theta }}_{2},\boldsymbol{x}\rangle /R)]\hfill \\ \hfill & ={\lambda }_{\boldsymbol{m}}^{\boldsymbol{d}}({\sigma }_{\boldsymbol{d},{\boldsymbol{\tau }}_{1}}){\lambda }_{\boldsymbol{m}}^{\boldsymbol{d}}({\sigma }_{\boldsymbol{d},{\boldsymbol{\tau }}_{2}})B({d}_{1},m){Q}_{m}^{({d}_{1})}(\langle {\bar{\boldsymbol{\theta }}}_{1}^{(1)},{\bar{\boldsymbol{\theta }}}_{2}^{(1)}\rangle ).\hfill \end{align} \tag{ 91 }$

We immediately have ${u}_{\boldsymbol{d}}^{{\boldsymbol{\tau }}_{1},{\boldsymbol{\tau }}_{2}}={\hat{u}}_{\boldsymbol{d}}^{{\boldsymbol{\tau }}_{1},{\boldsymbol{\tau }}_{2}}+{\bar{u}}_{\boldsymbol{d}}^{{\boldsymbol{\tau }}_{1},{\boldsymbol{\tau }}_{2}}$ . Note that all three correspond to positive semi-definite kernels.

Step 3. Analyzing the kernel matrix.

Let $\boldsymbol{U},\hat{\boldsymbol{U}},\bar{\boldsymbol{U}}\in {\mathbb{R}}^{N\times N}$ with

$\begin{equation*}\begin{aligned}\hfill {\boldsymbol{U}}_{ij}& ={u}_{\boldsymbol{d}}^{{\boldsymbol{\tau }}_{i},{\boldsymbol{\tau }}_{j}}\left({\left\{\langle {\bar{\boldsymbol{\theta }}}_{i}^{(q)},{\bar{\boldsymbol{\theta }}}_{j}^{(q)}\rangle /\sqrt{{d}_{q}}\right\}}_{q\in [Q]}\right),\hfill \\ \hfill {\hat{\boldsymbol{U}}}_{ij}& ={\hat{u}}_{\boldsymbol{d}}^{{\boldsymbol{\tau }}_{i},{\boldsymbol{\tau }}_{j}}\left({\left\{\langle {\bar{\boldsymbol{\theta }}}_{i}^{(q)},{\bar{\boldsymbol{\theta }}}_{j}^{(q)}\rangle /\sqrt{{d}_{q}}\right\}}_{q\in [Q]}\right),\hfill \\ \hfill {\bar{\boldsymbol{U}}}_{ij}& ={\bar{u}}_{\boldsymbol{d}}^{{\boldsymbol{\tau }}_{i},{\boldsymbol{\tau }}_{j}}\left({\left\{\langle {\bar{\boldsymbol{\theta }}}_{i}^{(q)},{\bar{\boldsymbol{\theta }}}_{j}^{(q)}\rangle /\sqrt{{d}_{q}}\right\}}_{q\in [Q]}\right).\hfill \end{aligned}\end{equation*}$

Since $\hat{\boldsymbol{U}}=\boldsymbol{U}-\bar{\boldsymbol{U}}{\succeq}0$ , we immediately have $\boldsymbol{U}{\succeq}\bar{\boldsymbol{U}}$ . In the following, we will lower bound $\bar{\boldsymbol{U}}$ .

By the decomposition of $\bar{\boldsymbol{U}}$ in terms of Gegenbauer polynomials (91), we have

$\begin{equation*}\bar{\boldsymbol{U}}=B({d}_{1},m)\mathrm{diag}\left({\lambda }_{\boldsymbol{m}}^{\boldsymbol{d}}({\sigma }_{\boldsymbol{d},{\boldsymbol{\tau }}_{i}})\right)\cdot {\boldsymbol{W}}_{m}\cdot \mathrm{diag}\left({\lambda }_{\boldsymbol{m}}^{\boldsymbol{d}}({\sigma }_{\boldsymbol{d},{\boldsymbol{\tau }}_{i}})\right),\end{equation*}$

where ${\boldsymbol{W}}_{m}\in {\mathbb{R}}^{N\times N}$ with ${W}_{m,ij}={Q}_{m}^{({d}_{1})}(\langle {\bar{\boldsymbol{\theta }}}_{i}^{(1)},{\bar{\boldsymbol{\theta }}}_{j}^{(1)}\rangle )$ . From proposition 6 (recalling that by definition of m > γ/η₁, i.e. γ < mη₁, we have $N< {d}^{{\eta }_{1}m-\delta }={d}_{1}^{m-{\delta }^{\prime }}$ for some δ > 0), we have

$\begin{equation*}{\Vert}{\boldsymbol{W}}_{m}-{\mathbf{I}}_{N}{{\Vert}}_{\mathrm{op}}={o}_{d,\mathbb{P}}(1).\end{equation*}$

Hence we get

$\begin{equation}{{\Vert}\bar{\boldsymbol{U}}-B({d}_{1},m)\mathrm{diag}\left({\lambda }_{\boldsymbol{m}}^{\boldsymbol{d}}{({\sigma }_{\boldsymbol{d},{\boldsymbol{\tau }}_{i}})}^{2}\right){\Vert}}_{\mathrm{op}}=\underset{i\in [N]}{\mathrm{max}}\left\{B({d}_{1},m){\lambda }_{\boldsymbol{m}}^{\boldsymbol{d}}{({\sigma }_{\boldsymbol{d},{\boldsymbol{\tau }}_{i}})}^{2}\right\}\cdot {o}_{d,\mathbb{P}}(1).\end{equation} \tag{ 92 }$

From assumption 2(a) and lemma 20 applied to coefficient m , as well as the assumption that μ_m(σ) ≠ 0, there exists ɛ₀ > 0 and C, c > 0 such that for d large enough,

$\begin{equation}\begin{aligned}\hfill & \underset{\boldsymbol{\tau }\in {[1-{\varepsilon }_{0},1+{\varepsilon }_{0}]}^{Q}}{\mathrm{sup}}\enspace B({d}_{1},m){\lambda }_{\boldsymbol{m}}^{\boldsymbol{d}}{({\sigma }_{\boldsymbol{d},\boldsymbol{\tau }})}^{2}\leqslant C< \infty ,\hfill \\ \hfill & \underset{\boldsymbol{\tau }\in {[1-{\varepsilon }_{0},1+{\varepsilon }_{0}]}^{Q}}{\mathrm{inf}}\enspace B({d}_{1},m){\lambda }_{\boldsymbol{m}}^{\boldsymbol{d}}{({\sigma }_{\boldsymbol{d},\boldsymbol{\tau }})}^{2}\geqslant c > 0.\hfill \end{aligned}\end{equation} \tag{ 93 }$

We restrict ourselves to the event ${\mathcal{P}}_{{\varepsilon }_{0}}$ defined in equation (74), which happens with high probability (lemma 8). Hence from equations (92) and (93), we deduce that with high probability

$\begin{equation*}\bar{\boldsymbol{U}}=B({d}_{1},m)\mathrm{diag}\left({\lambda }_{\boldsymbol{m}}^{\boldsymbol{d}}{({\sigma }_{\boldsymbol{d},{\boldsymbol{\tau }}_{i}})}^{2}\right)+{o}_{d,\mathbb{P}}(1){\succeq}\frac{c}{2}{\mathbf{I}}_{N}.\end{equation*}$

We conclude that with high probability

$\begin{equation*}\boldsymbol{U}=\bar{\boldsymbol{U}}+\hat{\boldsymbol{U}}{\succeq}\bar{\boldsymbol{U}}{\succeq}\frac{c}{2}{\mathbf{I}}_{N}.\end{equation*}$

Appendix F.: Proof of theorem 6(b): upper bound for RF model

F.1. Preliminaries

Lemma 10. Let σ be an activation function that satisfies assumptions 2(a) and (b). Let || w ^(q)||₂ = 1 be unit vectors of ${\mathbb{R}}^{{d}_{q}}$ , for q = 1, ..., Q. Fix γ > 0 and denote $\mathcal{Q}={\bar{\mathcal{Q}}}_{\mathrm{RF}}(\gamma )$ . Then there exists ɛ₀ > 0 and d₀ and constants C, c > 0 such that for d ⩾ d₀ and $\boldsymbol{\tau }\in {[1-{\varepsilon }_{0},1+{\varepsilon }_{0}]}^{Q}$ ,

$\begin{equation}{\mathbb{E}}_{\bar{\boldsymbol{x}}}\left[{\sigma }_{\boldsymbol{d},\boldsymbol{\tau }}{\left({\left\{\langle {\boldsymbol{w}}^{(q)},{\bar{\boldsymbol{x}}}^{(q)}\rangle \right\}}_{q\in [Q]}\right)}^{2}\right]\leqslant C< \infty ,\end{equation} \tag{ 94 }$

$\begin{equation}\underset{\boldsymbol{k}\in \mathcal{Q}}{\mathrm{min}}\enspace {\lambda }_{k,0}^{\boldsymbol{d}}{({\sigma }_{d,\tau })}^{2}\geqslant c{d}^{-\gamma } > 0.\end{equation} \tag{ 95 }$

Proof of lemma 10. The first inequality comes simply from assumption 2(a) and lemma 17(b). For the second inequality, notice that by assumption 2(c) we can apply lemma 19 to any $\boldsymbol{k}\in \mathcal{Q}$ . Hence (using that μ_k(σ)² > 0 and we can choose δ sufficiently small), we deduce that there exists c > 0, ɛ₀ > 0 and d₀ such that for any d ⩾ d₀, $\boldsymbol{\tau }\in {[1-{\varepsilon }_{0},1+{\varepsilon }_{0}]}^{Q}$ and $\boldsymbol{k}\in \mathcal{Q}$ ,

$\begin{equation*}\left(\prod\limits _{q\in [Q]}{d}^{(\xi -{\eta }_{q}-{\kappa }_{q}){k}_{q}}\right)B(\boldsymbol{d},\boldsymbol{k}){\lambda }_{\boldsymbol{k}}^{\boldsymbol{d}}{({\sigma }_{\boldsymbol{d},\boldsymbol{\tau }})}^{2}\geqslant c > 0.\end{equation*}$

Furthermore, using that $B(\boldsymbol{d},\boldsymbol{k})={\Theta}({d}_{1}^{{k}_{1}}{d}_{2}^{{k}_{2}}\dots {d}_{Q}^{{k}_{Q}})$ , there exists c' > 0 such that for any $\boldsymbol{k}\in \mathcal{Q}$ ,

$\begin{equation*}{\lambda }_{\boldsymbol{k}}^{\boldsymbol{d}}{({\sigma }_{\boldsymbol{d},\boldsymbol{\tau }})}^{2}\geqslant {c}^{\prime }\prod\limits _{q\in [Q]}{d}^{({\eta }_{q}+{\kappa }_{q}-\xi ){k}_{q}}{d}_{q}^{{k}_{q}}={c}^{\prime }\prod\limits _{q\in [Q]}{d}^{({\kappa }_{q}-\xi ){k}_{q}}\geqslant c{d}^{-\gamma },\end{equation*}$

where we used in the last inequality $\boldsymbol{k}\in {\bar{\mathcal{Q}}}_{\mathrm{RF}}(\gamma )$ implies (ξ − κ₁)k₁ + ⋯ + (ξ − κ_Q)k_Q ⩽ γ by definition. □

F.2. Properties of the limiting kernel

Similarly to the proof of theorem 1(b) in [21], we construct a limiting kernel which is used as a proxy to upper bound the RF risk.

We recall the definition of ${\mathrm{PS}}^{\boldsymbol{d}}={\prod }_{q\in [Q]}{\mathbb{S}}^{{d}_{q}-1}(\sqrt{{d}_{q}})$ and μ_d = Unif(PS^d). Let us denote $\mathcal{L}={L}^{2}({\mathrm{PS}}^{\boldsymbol{d}},{\mu }_{\boldsymbol{d}})$ . Fix $\boldsymbol{\tau }\in {\mathbb{R}}_{ > 0}^{Q}$ and recall the definition for a given $\boldsymbol{\theta }=(\bar{\boldsymbol{\theta }},\boldsymbol{\tau })$ of ${\sigma }_{\boldsymbol{d},\boldsymbol{\tau }}(\left\{\langle {\bar{\boldsymbol{\theta }}}^{(q)},\cdot \rangle /\sqrt{{d}_{q}}\right\})\in \mathcal{L}$ ,

$\begin{equation*}{\sigma }_{\boldsymbol{d},\boldsymbol{\tau }}\left({\left\{\langle {\bar{\boldsymbol{\theta }}}_{q},{\bar{\boldsymbol{x}}}^{(q)}\rangle \right\}}_{q\in [Q]}\right)=\sigma \left(\sum\limits _{q\in [Q]}{\tau }^{(q)}({r}_{q}/R)\langle {\bar{\boldsymbol{\theta }}}_{q},{\bar{\boldsymbol{x}}}^{(q)}\rangle \right).\end{equation*}$

Define the operator ${\mathbb{T}}_{\boldsymbol{\tau }}:\mathcal{L}\to \mathcal{L}$ , such that for any $g\in \mathcal{L}$ ,

$\begin{equation*}{\mathbb{T}}_{\boldsymbol{\tau }}g(\bar{\boldsymbol{\theta }}\hspace{2pt})={\mathbb{E}}_{\bar{\boldsymbol{x}}}\left[{\sigma }_{\boldsymbol{d},\boldsymbol{\tau }}\left({\left\{\langle {\bar{\boldsymbol{\theta }}}^{(q)},{\bar{\boldsymbol{x}}}^{(q)}\rangle /\sqrt{{d}_{q}}\right\}}_{q\in [Q]}\right)g(\bar{\boldsymbol{x}}\hspace{2pt})\right].\end{equation*}$

It is easy to check that the adjoint operator ${\mathbb{T}}_{\boldsymbol{\tau }}^{\ast }:\mathcal{L}\to \mathcal{L}$ verifies ${\mathbb{T}}^{\ast }=\mathbb{T}$ with variables $\bar{\boldsymbol{x}}$ and $\bar{\boldsymbol{\theta }}$ exchanged.

We define the operator ${\mathbb{K}}_{\boldsymbol{\tau },{\boldsymbol{\tau }}^{\prime }}:\mathcal{L}\to \mathcal{L}$ as ${\mathbb{K}}_{\boldsymbol{\tau },{\boldsymbol{\tau }}^{\prime }}\equiv {\mathbb{T}}_{\boldsymbol{\tau }}{\mathbb{T}}_{{\boldsymbol{\tau }}^{\prime }}^{\ast }$ . For $g\in \mathcal{L}$ , we can write

$\begin{equation*}{\mathbb{K}}_{{\boldsymbol{\tau }}_{1},{\boldsymbol{\tau }}_{2}}g({\bar{\boldsymbol{\theta }}}_{1})={\mathbb{E}}_{{\bar{\boldsymbol{\theta }}}_{2}}[{K}_{{\boldsymbol{\tau }}_{1},{\boldsymbol{\tau }}_{2}}({\bar{\boldsymbol{\theta }}}_{1},{\bar{\boldsymbol{\theta }}}_{2})g({\bar{\boldsymbol{\theta }}}_{2})],\end{equation*}$

where

$\begin{equation*}{K}_{{\boldsymbol{\tau }}_{1},{\boldsymbol{\tau }}_{2}}({\bar{\boldsymbol{\theta }}}_{1},{\bar{\boldsymbol{\theta }}}_{2})={\mathbb{E}}_{\bar{\boldsymbol{x}}}\left[{\sigma }_{\boldsymbol{d},{\boldsymbol{\tau }}_{1}}\left({\left\{\langle {\bar{\boldsymbol{\theta }}}_{1}^{(q)},{\bar{\boldsymbol{x}}}^{(q)}\rangle /\sqrt{{d}_{q}}\right\}}_{q\in [Q]}\right){\sigma }_{\boldsymbol{d},{\boldsymbol{\tau }}_{2}}\left({\left\{\langle {\bar{\boldsymbol{\theta }}}_{2}^{(q)},{\bar{\boldsymbol{x}}}^{(q)}\rangle /\sqrt{{d}_{q}}\right\}}_{q\in [Q]}\right)\right].\end{equation*}$

We recall the decomposition of σ_{d
,
τ} in terms of tensor product of Gegenbauer polynomials

$\begin{equation*}\begin{aligned}\hfill {\sigma }_{\boldsymbol{d},\boldsymbol{\tau }}\left({\left\{{\bar{x}}_{1}^{(q)}\right\}}_{q\in [Q]}\right)& =\sum\limits _{\boldsymbol{k}\in {\mathbb{Z}}_{\geqslant 0}^{Q}}{\lambda }_{\boldsymbol{k}}^{\boldsymbol{d}}({\sigma }_{\boldsymbol{d},\boldsymbol{\tau }})B(\boldsymbol{d},\boldsymbol{k}){Q}_{\boldsymbol{k}}^{\boldsymbol{d}}\left({\left\{{\bar{x}}_{1}^{(q)}\right\}}_{q\in [Q]}\right),\hfill \\ \hfill {\lambda }_{\boldsymbol{k}}^{\boldsymbol{d}}({\sigma }_{\boldsymbol{d},\boldsymbol{\tau }})& ={\mathbb{E}}_{\bar{\boldsymbol{x}}}\left[{\sigma }_{\boldsymbol{d},\boldsymbol{\tau }}\left({\left\{{\bar{x}}_{1}^{(q)}\right\}}_{q\in [Q]}\right){Q}_{\boldsymbol{k}}^{\boldsymbol{d}}\left({\left\{\sqrt{{d}_{q}}{\bar{x}}_{1}^{(q)}\right\}}_{q\in [Q]}\right)\right].\hfill \end{aligned}\end{equation*}$

Recall that ${\left\{{Y}_{\boldsymbol{k},\boldsymbol{s}}^{\boldsymbol{d}}\right\}}_{\boldsymbol{k}\in {\mathbb{Z}}_{\geqslant 0}^{Q},\boldsymbol{s}\in [B(\boldsymbol{d},\boldsymbol{s})]}$ forms an orthonormal basis of $\mathcal{L}$ . From equation (86), we have for any k ⩾ 0 and s ∈ [B( d , k )]

$\begin{equation*}\begin{aligned}\hfill {\mathbb{T}}_{\boldsymbol{\tau }}{Y}_{\boldsymbol{k},\boldsymbol{s}}^{\boldsymbol{d}}(\bar{\boldsymbol{\theta }}\hspace{2pt})& =\sum\limits _{{\boldsymbol{k}}^{\prime }\in {\mathbb{Z}}_{\geqslant 0}^{Q}}{\lambda }_{{\boldsymbol{k}}^{\prime }}^{\boldsymbol{d}}({\sigma }_{\boldsymbol{d},\boldsymbol{\tau }})B(\boldsymbol{d},{\boldsymbol{k}}^{\prime }){\mathbb{E}}_{\bar{\boldsymbol{x}}}\left[{Q}_{{\boldsymbol{k}}^{\prime }}^{\boldsymbol{d}}\left({\left\{\langle {\bar{\boldsymbol{\theta }}}^{(q)},{\bar{\boldsymbol{x}}}^{(q)}\rangle \right\}}_{q\in [Q]}\right){Y}_{\boldsymbol{k},\boldsymbol{s}}^{\boldsymbol{d}}(\bar{\boldsymbol{x}}\hspace{2pt})\right]\hfill \\ \hfill & ={\lambda }_{\boldsymbol{k}}^{\boldsymbol{d}}({\sigma }_{\boldsymbol{d},\boldsymbol{\tau }}){Y}_{\boldsymbol{k},\boldsymbol{s}}^{\boldsymbol{d}}(\bar{\boldsymbol{\theta }}\hspace{2pt}),\hfill \end{aligned}\end{equation*}$

where we used

$\begin{equation*}{\mathbb{E}}_{\bar{\boldsymbol{x}}}\left[{Q}_{{\boldsymbol{k}}^{\prime }}^{\boldsymbol{d}}\left({\left\{\langle {\bar{\boldsymbol{\theta }}}^{(q)},{\bar{\boldsymbol{x}}}^{(q)}\rangle \right\}}_{q\in [Q]}\right){Y}_{\boldsymbol{k},\boldsymbol{s}}^{\boldsymbol{d}}(\bar{\boldsymbol{x}}\hspace{2pt})\right]=\frac{{\delta }_{\boldsymbol{k},{\boldsymbol{k}}^{\prime }}}{B(\boldsymbol{d},\boldsymbol{k})}{\boldsymbol{Y}}_{\boldsymbol{k},\boldsymbol{s}}(\bar{\boldsymbol{\theta }}\hspace{2pt}).\end{equation*}$

The same equation holds for ${\mathbb{T}}_{\boldsymbol{\tau }}^{\ast }$ . Therefore, we directly deduce that

$\begin{equation*}{\mathbb{K}}_{\boldsymbol{\tau },{\boldsymbol{\tau }}^{\prime }}{Y}_{\boldsymbol{k},\boldsymbol{s}}^{\boldsymbol{d}}(\bar{\boldsymbol{\theta }}\hspace{2pt})=({\mathbb{T}}_{\boldsymbol{\tau }}{\mathbb{T}}_{{\boldsymbol{\tau }}^{\prime }}^{\ast }){Y}_{\boldsymbol{k},\boldsymbol{s}}^{\boldsymbol{d}}(\bar{\boldsymbol{\theta }}\hspace{2pt})={\lambda }_{\boldsymbol{k}}^{\boldsymbol{d}}({\sigma }_{\boldsymbol{d},\boldsymbol{\tau }}){\lambda }_{\boldsymbol{k}}^{\boldsymbol{d}}({\sigma }_{\boldsymbol{d},{\boldsymbol{\tau }}^{\prime }}){Y}_{\boldsymbol{k},\boldsymbol{s}}^{\boldsymbol{d}}(\bar{\boldsymbol{\theta }}\hspace{2pt}).\end{equation*}$

We deduce that ${\left\{{Y}_{\boldsymbol{k},\boldsymbol{s}}^{\boldsymbol{d}}\right\}}_{\boldsymbol{k}\in {\mathbb{Z}}_{\geqslant 0}^{Q},\boldsymbol{s}\in [B(\boldsymbol{d},\boldsymbol{s})]}$ is an orthonormal basis that diagonalizes the operator ${\mathbb{K}}_{\boldsymbol{\tau },{\boldsymbol{\tau }}^{\prime }}$ .

Let ɛ₀ > 0 be defined as in lemma 10. We will consider $\boldsymbol{\tau },{\boldsymbol{\tau }}^{\prime }\in {[1-{\varepsilon }_{0},1+{\varepsilon }_{0}]}^{Q}$ and restrict ourselves to the subspace ${V}_{\mathcal{Q}}^{\boldsymbol{d}}$ . From the choice of ɛ₀ and for d large enough, the eigenvalues ${\lambda }_{\boldsymbol{k}}^{\boldsymbol{d}}({\sigma }_{\boldsymbol{d},\boldsymbol{\tau }}){\lambda }_{\boldsymbol{k}}^{\boldsymbol{d}}({\sigma }_{\boldsymbol{d},{\boldsymbol{\tau }}^{\prime }})\ne 0$ for any $\boldsymbol{k}\in \mathcal{Q}$ . Hence, the operator ${\mathbb{K}}_{\boldsymbol{\tau },{\boldsymbol{\tau }}^{\prime }}{\vert }_{{V}_{\mathcal{Q}}^{\boldsymbol{d}}}$ is invertible.

F.3. Proof of theorem 6(b)

Without loss of generality, let us assume that {f_d} are polynomials contained in ${V}_{\mathcal{Q}}^{\boldsymbol{d}}$ , i.e. ${\bar{f}}_{d}={\mathsf{P}}_{\mathcal{Q}}{\bar{f}}_{d}$ .

Consider

$\begin{equation*}\hat{f}(\boldsymbol{x}\hspace{-1pt};\mathbf{\Theta },\boldsymbol{a})=\sum\limits _{i=1}^{N}{a}_{i}\sigma (\langle {\boldsymbol{\theta }}_{i},\boldsymbol{x}\rangle /R).\end{equation*}$

Define ${\alpha }_{\boldsymbol{\tau }}(\bar{\boldsymbol{\theta }}\hspace{2pt})\equiv {\mathbb{K}}_{\boldsymbol{\tau },\boldsymbol{\tau }}^{-1}{\mathbb{T}}_{\boldsymbol{\tau }}{\bar{f}}_{d}(\bar{\boldsymbol{\theta }}\hspace{2pt})$ and choose ${a}_{i}^{\ast }={N}^{-1}{\alpha }_{{\boldsymbol{\tau }}_{i}}({\bar{\boldsymbol{\theta }}}_{i})$ , where we denoted ${\bar{\boldsymbol{\theta }}}_{i}={({\bar{\boldsymbol{\theta }}}_{i}^{(q)})}_{q\in [Q]}$ with ${\bar{\boldsymbol{\theta }}}_{i}^{(q)}={\boldsymbol{\theta }}_{i}^{(q)}/{\tau }_{i}^{(q)}\in {\mathbb{S}}^{{d}_{q}-1}(\sqrt{{d}_{q}})$ and ${\tau }_{i}^{(q)}={\Vert}{\boldsymbol{\theta }}_{i}^{(q)}{{\Vert}}_{2}/\sqrt{{d}_{q}}$ independent of ${\bar{\boldsymbol{\theta }}}_{i}^{(q)}$ .

Let ɛ₀ > 0 be defined as in lemma 10 and consider the expectation over ${\mathcal{P}}_{{\varepsilon }_{0}}$ of the RF risk (in particular, ${\boldsymbol{a}}^{\ast }=({a}_{1}^{\ast },\dots ,{a}_{N}^{\ast })$ are well defined):

$\begin{equation*}\begin{aligned}\hfill {\mathbb{E}}_{{\mathbf{\Theta }}_{{\varepsilon }_{0}}}[{R}_{\mathrm{RF}}({f}_{d},\mathbf{\Theta })]& ={\mathbb{E}}_{{\mathbf{\Theta }}_{{\varepsilon }_{0}}}\left[\underset{\boldsymbol{a}\in {\mathbb{R}}^{N}}{\mathrm{inf}}\enspace {\mathbb{E}}_{\boldsymbol{x}}[{({f}_{d}(\boldsymbol{x})-\hat{f}(\boldsymbol{x}\hspace{-1pt};\mathbf{\Theta },\boldsymbol{a}))}^{2}]\right]\hfill \\ \hfill & \leqslant {\mathbb{E}}_{{\mathbf{\Theta }}_{{\varepsilon }_{0}}}\left[{\mathbb{E}}_{\boldsymbol{x}}\left[{({f}_{d}(\boldsymbol{x})-\hat{f}(\boldsymbol{x}\hspace{-1pt};\mathbf{\Theta },{\boldsymbol{a}}^{\ast }(\mathbf{\Theta })))}^{2}\right]\right].\hfill \end{aligned}\end{equation*}$

We can expand the squared loss at a * as

$\begin{align}\hfill {\mathbb{E}}_{\boldsymbol{x}}[{({f}_{d}(\boldsymbol{x})-\hat{f}(\boldsymbol{x}\hspace{-1pt};\mathbf{\Theta },{\boldsymbol{a}}^{\ast }))}^{2}]& ={\Vert}{f}_{d}{{\Vert}}_{{L}^{2}}^{2}-2\sum\limits _{i=1}^{N}{\mathbb{E}}_{\boldsymbol{x}}[{a}_{i}^{\ast }\sigma (\langle {\boldsymbol{\theta }}_{i},\boldsymbol{x}\rangle /R){f}_{d}(\boldsymbol{x})]\hfill \\ \hfill & \quad \;\;+\sum\limits _{i,j=1}^{N}{\mathbb{E}}_{\boldsymbol{x}}[{a}_{i}^{\ast }{a}_{j}^{\ast }\sigma (\langle {\boldsymbol{\theta }}_{i},\boldsymbol{x}\rangle /R)\sigma (\langle {\boldsymbol{\theta }}_{j},\boldsymbol{x}\rangle /R)].\hfill \end{align} \tag{ 96 }$

The second term of the expansion (96) around a * verifies

$\begin{align}\hfill & {\mathbb{E}}_{{\mathbf{\Theta }}_{{\varepsilon }_{0}}}\left[\sum\limits _{i=1}^{N}{\mathbb{E}}_{\boldsymbol{x}}\left[{a}_{i}^{\ast }\sigma (\langle {\boldsymbol{\theta }}_{i},\boldsymbol{x}\rangle /R){f}_{d}(\boldsymbol{x})\right]\right]\hfill \\ \hfill & \qquad ={\mathbb{E}}_{{\boldsymbol{\tau }}_{{\varepsilon }_{0}}}\left[{\mathbb{E}}_{\bar{\boldsymbol{\theta }}}\left[{\alpha }_{\boldsymbol{\tau }}(\bar{\boldsymbol{\theta }}\hspace{2pt}){\mathbb{E}}_{\bar{\boldsymbol{x}}}\left[{\sigma }_{\boldsymbol{d},\boldsymbol{\tau }}\left({\left\{\langle {\bar{\boldsymbol{\theta }}}^{(q)},{\bar{\boldsymbol{x}}}^{(q)}\rangle /\sqrt{{d}_{q}}\right\}}_{q\in [Q]}\right){\bar{f}}_{d}(\bar{\boldsymbol{x}}\hspace{2pt})\right]\right]\right]\hfill \\ \hfill & \qquad ={\mathbb{E}}_{{\boldsymbol{\tau }}_{{\varepsilon }_{0}}}\left[{\langle {\mathbb{K}}_{\boldsymbol{\tau },\boldsymbol{\tau }}^{-1}{\mathbb{T}}_{\boldsymbol{\tau }}{\bar{f}}_{d},{\mathbb{T}}_{\boldsymbol{\tau }}{\bar{f}}_{d}\rangle }_{{L}^{2}}\right]\hfill \\ \hfill & \qquad ={\Vert}{f}_{d}{{\Vert}}_{{L}^{2}}^{2},\hfill \end{align} \tag{ 97 }$

where we used that for each $\boldsymbol{\tau }\in {[1-{\varepsilon }_{0},1+{\varepsilon }_{0}]}^{Q}$ , we have ${\mathbb{T}}_{\boldsymbol{\tau }}^{\ast }{\mathbb{K}}_{\boldsymbol{\tau },\boldsymbol{\tau }}^{-1}{\mathbb{T}}_{\boldsymbol{\tau }}{\vert }_{{V}_{\mathcal{Q}}^{\boldsymbol{d}}}=\mathbf{I}{\vert }_{{V}_{\mathcal{Q}}^{\boldsymbol{d}}}$ .

Let us consider the third term in the expansion (96) around a *: the non diagonal term verifies

$\begin{align*}\hfill & {\mathbb{E}}_{{\mathbf{\Theta }}_{{\varepsilon }_{0}}}\left[\sum\limits _{i\ne j}{\mathbb{E}}_{\boldsymbol{x}}\left[{a}_{i}^{\ast }{a}_{j}^{\ast }\sigma (\langle {\boldsymbol{\theta }}_{i},\boldsymbol{x}\rangle /R)\sigma (\langle {\boldsymbol{\theta }}_{j},\boldsymbol{x}\rangle /R)\right]\right]\hfill \\ \hfill & \qquad =\left(1-{N}^{-1}\right){\mathbb{E}}_{{\boldsymbol{\tau }}_{{\varepsilon }_{0}}^{1},{\boldsymbol{\tau }}_{{\varepsilon }_{0}}^{2},{\bar{\boldsymbol{\theta }}}_{1},{\bar{\boldsymbol{\theta }}}_{2}}\left[{\alpha }_{{\boldsymbol{\tau }}^{1}}({\bar{\boldsymbol{\theta }}}_{1}){\alpha }_{{\boldsymbol{\tau }}^{2}}({\bar{\boldsymbol{\theta }}}_{2})\right.\hfill \\ \hfill & \qquad \quad \left.\times {E}_{\bar{\boldsymbol{x}}}\left[{\sigma }_{\boldsymbol{d},{\boldsymbol{\tau }}^{1}}\left({\left\{\langle {\bar{\boldsymbol{\theta }}}_{1}^{(q)},{\bar{\boldsymbol{x}}}^{(q)}\rangle /\sqrt{{d}_{q}}\right\}}_{q\in [Q]}\right){\sigma }_{\boldsymbol{d},{\boldsymbol{\tau }}^{2}}\left({\left\{\langle {\bar{\boldsymbol{\theta }}}_{2}^{(q)},{\bar{\boldsymbol{x}}}^{(q)}\rangle /\sqrt{{d}_{q}}\right\}}_{q\in [Q]}\right)\right]\right]\hfill \\ \hfill & \qquad =\left(1-{N}^{-1}\right){\mathbb{E}}_{{\boldsymbol{\tau }}_{{\varepsilon }_{0}}^{1},{\boldsymbol{\tau }}_{{\varepsilon }_{0}}^{2},{\bar{\boldsymbol{\theta }}}_{1},{\bar{\boldsymbol{\theta }}}_{2}}\left[{\mathbb{K}}_{{\boldsymbol{\tau }}^{1},{\boldsymbol{\tau }}^{1}}^{-1}{\mathbb{T}}_{{\boldsymbol{\tau }}^{1}}{\bar{f}}_{d}({\bar{\boldsymbol{\theta }}}_{1}){\mathbb{K}}_{{\boldsymbol{\tau }}^{1},{\boldsymbol{\tau }}^{2}}({\bar{\boldsymbol{\theta }}}_{1},{\bar{\boldsymbol{\theta }}}_{2}){\mathbb{K}}_{{\boldsymbol{\tau }}^{2},{\boldsymbol{\tau }}^{2}}^{-1}{T}_{{\boldsymbol{\tau }}^{2}}{\bar{f}}_{d}({\bar{\boldsymbol{\theta }}}_{2})\right]\hfill \\ \hfill & \qquad =\left(1-{N}^{-1}\right){\mathbb{E}}_{{\boldsymbol{\tau }}_{{\varepsilon }_{0}}^{1},{\boldsymbol{\tau }}_{{\varepsilon }_{0}}^{2}}\left[{\langle {\mathbb{K}}_{{\boldsymbol{\tau }}^{1},{\boldsymbol{\tau }}^{1}}^{-1}{\mathbb{T}}_{{\boldsymbol{\tau }}^{1}}{\bar{f}}_{d},{\mathbb{K}}_{{\boldsymbol{\tau }}^{1},{\boldsymbol{\tau }}^{2}}{\mathbb{K}}_{{\boldsymbol{\tau }}^{2},{\boldsymbol{\tau }}^{2}}^{-1}{\mathbb{T}}_{{\boldsymbol{\tau }}^{2}}{\bar{f}}_{d}\rangle }_{{L}^{2}}\right].\hfill \end{align*}$

For $\boldsymbol{k}\in \mathcal{Q}$ and s ∈ [B( d , k )] and ${\boldsymbol{\tau }}^{1},{\boldsymbol{\tau }}^{2}\in {[1-{\varepsilon }_{0},1+{\varepsilon }_{0}]}^{Q}$ , we have (for d large enough)

$\begin{equation*}{\mathbb{T}}_{{\boldsymbol{\tau }}^{1}}^{\ast }{\mathbb{K}}_{{\boldsymbol{\tau }}^{1},{\boldsymbol{\tau }}^{1}}^{-1}{\mathbb{K}}_{{\boldsymbol{\tau }}^{1},{\boldsymbol{\tau }}^{2}}{\mathbb{K}}_{{\boldsymbol{\tau }}^{2},{\boldsymbol{\tau }}^{2}}^{-1}{\mathbb{T}}_{{\boldsymbol{\tau }}^{2}}{Y}_{\boldsymbol{k},\boldsymbol{s}}^{\boldsymbol{d}}=\left({\mathbb{T}}_{{\boldsymbol{\tau }}^{1}}^{\ast }{\mathbb{K}}_{{\boldsymbol{\tau }}^{1},{\boldsymbol{\tau }}^{1}}^{-1}{\mathbb{T}}_{{\boldsymbol{\tau }}^{1}}\right)\cdot \left({\mathbb{T}}_{{\boldsymbol{\tau }}^{2}}^{\ast }{\mathbb{K}}_{{\boldsymbol{\tau }}^{2},{\boldsymbol{\tau }}^{2}}^{-1}{\mathbb{T}}_{{\boldsymbol{\tau }}^{2}}\right)\cdot {Y}_{\boldsymbol{k},\boldsymbol{s}}^{\boldsymbol{d}}={Y}_{\boldsymbol{k},\boldsymbol{s}}^{\boldsymbol{d}}.\end{equation*}$

Hence for any ${\boldsymbol{\tau }}^{1},{\boldsymbol{\tau }}^{2}\in {[1-{\varepsilon }_{0},1+{\varepsilon }_{0}]}^{Q}$ , ${\mathbb{T}}_{{\boldsymbol{\tau }}^{1}}^{\ast }{\mathbb{K}}_{{\boldsymbol{\tau }}^{1},{\boldsymbol{\tau }}^{1}}^{-1}{\mathbb{K}}_{{\boldsymbol{\tau }}^{1},{\boldsymbol{\tau }}^{2}}{\mathbb{K}}_{{\boldsymbol{\tau }}^{2},{\boldsymbol{\tau }}^{2}}^{-1}{\mathbb{T}}_{{\boldsymbol{\tau }}^{2}}{\vert }_{{V}_{\mathcal{Q}}^{\boldsymbol{d}}}=\mathbf{I}{\vert }_{{V}_{\mathcal{Q}}^{\boldsymbol{d}}}$ . Hence

$\begin{equation}{\mathbb{E}}_{{\mathbf{\Theta }}_{{\varepsilon }_{0}}}\left[\sum\limits _{i\ne j}{\mathbb{E}}_{\boldsymbol{x}}\left[{a}_{i}^{\ast }{a}_{j}^{\ast }\sigma (\langle {\boldsymbol{\theta }}_{i},\boldsymbol{x}\rangle /R)\sigma (\langle {\boldsymbol{\theta }}_{j},\boldsymbol{x}\rangle /R)\right]\right]=\left(1-{N}^{-1}\right){\Vert}{f}_{d}{{\Vert}}_{{L}^{2}}^{2}.\end{equation} \tag{ 98 }$

The diagonal term verifies

$\begin{align*}\hfill & {\mathbb{E}}_{{\mathbf{\Theta }}_{{\varepsilon }_{0}}}\left[\sum\limits _{i\in [N]}{\mathbb{E}}_{\boldsymbol{x}}\left[{({a}_{i}^{\ast })}^{2}\sigma {(\langle {\boldsymbol{\theta }}_{i},\boldsymbol{x}\rangle /R)}^{2}\right]\right]\hfill \\ \hfill & \qquad ={N}^{-1}{\mathbb{E}}_{{\boldsymbol{\tau }}_{{\varepsilon }_{0}},\bar{\boldsymbol{\theta }}}\left[{\alpha }_{\boldsymbol{\tau }}{(\bar{\boldsymbol{\theta }}\hspace{2pt})}^{2}{K}_{\boldsymbol{\tau },\boldsymbol{\tau }}(\bar{\boldsymbol{\theta }},\bar{\boldsymbol{\theta }})\right]\hfill \\ \hfill & \qquad \leqslant {N}^{-1}\left[\underset{\bar{\boldsymbol{\theta }},\boldsymbol{\tau }\in {[1-{\varepsilon }_{0},1+{\varepsilon }_{0}]}^{Q}}{\mathrm{max}}\enspace {K}_{\boldsymbol{\tau },\boldsymbol{\tau }}(\bar{\boldsymbol{\theta }},\bar{\boldsymbol{\theta }})\right]\cdot {\mathbb{E}}_{{\boldsymbol{\tau }}_{{\varepsilon }_{0}}}[{\Vert}{\mathbb{K}}_{\boldsymbol{\tau },\boldsymbol{\tau }}^{-1}{\mathbb{T}}_{\boldsymbol{\tau }}{\bar{f}}_{d}{{\Vert}}_{{L}^{2}}^{2}].\hfill \end{align*}$

We have by definition of ${\mathbb{K}}_{\boldsymbol{\tau },\boldsymbol{\tau }}$

$\begin{equation*}{\mathrm{sup}}_{\boldsymbol{\tau }\in {[1-{\varepsilon }_{0},1+{\varepsilon }_{0}]}^{Q}}\enspace {K}_{\boldsymbol{\tau },\boldsymbol{\tau }}(\bar{\boldsymbol{\theta }},\bar{\boldsymbol{\theta }})={\mathrm{sup}}_{\boldsymbol{\tau }\in {[1-{\varepsilon }_{0},1+{\varepsilon }_{0}]}^{Q}}{\Vert}{\sigma }_{\boldsymbol{d},\boldsymbol{\tau }}{{\Vert}}_{{L}^{2}}^{2}\leqslant C,\end{equation*}$

for d large enough (using lemma 10). Furthermore

$\begin{equation*}\begin{aligned}\hfill {\Vert}{\mathbb{K}}_{\boldsymbol{\tau },\boldsymbol{\tau }}^{-1}{\mathbb{T}}_{\boldsymbol{\tau }}{\bar{f}}_{d}{{\Vert}}_{{L}^{2}}^{2}& =\sum\limits _{\boldsymbol{k}\in \mathcal{Q}}\frac{1}{{\lambda }_{\boldsymbol{k}}^{\boldsymbol{d}}{({\sigma }_{\boldsymbol{d},\boldsymbol{\tau }})}^{2}}\sum\limits _{\boldsymbol{s}\in [B(\boldsymbol{d},\boldsymbol{k})]}{\lambda }_{\boldsymbol{k},\boldsymbol{s}}^{\boldsymbol{d}}{({\bar{f}}_{d})}^{2}\hfill \\ \hfill & \leqslant \left[\underset{\boldsymbol{k}\in \mathcal{Q}}{\mathrm{max}}\enspace \frac{1}{{\lambda }_{\boldsymbol{k}}^{\boldsymbol{d}}{({\sigma }_{\boldsymbol{d},\boldsymbol{\tau }})}^{2}}\right]\cdot {\Vert}{\mathsf{P}}_{\mathcal{Q}}{f}_{d}{{\Vert}}_{{L}^{2}}^{2}.\hfill \end{aligned}\end{equation*}$

From lemma 10, we get

$\begin{equation*}{\mathbb{E}}_{{\boldsymbol{\tau }}_{{\varepsilon }_{0}}}[{\Vert}{\mathbb{K}}_{\boldsymbol{\tau },\boldsymbol{\tau }}^{-1}{\mathbb{T}}_{\boldsymbol{\tau }}{\bar{f}}_{d}{{\Vert}}_{{L}^{2}}^{2}]\leqslant C{d}^{\gamma }\cdot {\Vert}{\mathsf{P}}_{\mathcal{Q}}{f}_{d}{{\Vert}}_{{L}^{2}}^{2}.\end{equation*}$

Hence,

$\begin{equation}{\mathbb{E}}_{{\mathbf{\Theta }}_{{\varepsilon }_{0}}}\left[\sum\limits _{i\in [N]}{\mathbb{E}}_{\boldsymbol{x}}\left[{({a}_{i}^{\ast })}^{2}\sigma {(\langle {\boldsymbol{\theta }}_{i},\boldsymbol{x}\rangle /R)}^{2}\right]\right]\leqslant C\frac{{d}^{\gamma }}{N}{\Vert}{\mathsf{P}}_{\mathcal{Q}}{f}_{d}{{\Vert}}_{{L}^{2}}^{2}.\end{equation} \tag{ 99 }$

Combining equations (97)–(99), we get

$\begin{align*}\hfill & {\mathbb{E}}_{{\mathbf{\Theta }}_{{\varepsilon }_{0}}}[{R}_{\mathrm{RF}}({f}_{d},\mathbf{\Theta })]\hfill \\ \hfill & \qquad \leqslant {\mathbb{E}}_{{\mathbf{\Theta }}_{{\varepsilon }_{0}}}\left[{\mathbb{E}}_{\boldsymbol{x}}\left[{\left({f}_{d}(\boldsymbol{x})-\hat{f}(\boldsymbol{x}\hspace{-1pt};\mathbf{\Theta },{\boldsymbol{a}}^{\ast }(\mathbf{\Theta }))\right)}^{2}\right]\right]\hfill \\ \hfill & \qquad ={\Vert}{f}_{d}{{\Vert}}_{{L}^{2}}^{2}-2{\Vert}{f}_{d}{{\Vert}}_{{L}^{2}}^{2}+(1-{N}^{-1}){\Vert}{f}_{d}{{\Vert}}_{{L}^{2}}^{2}+{N}^{-1}{\mathbb{E}}_{{\boldsymbol{\tau }}_{{\varepsilon }_{0}},\bar{\boldsymbol{\theta }}}\left[{({\alpha }_{\boldsymbol{\tau }}(\bar{\boldsymbol{\theta }}\hspace{2pt}))}^{2}{K}_{\boldsymbol{\tau },\boldsymbol{\tau }}(\bar{\boldsymbol{\theta }},\bar{\boldsymbol{\theta }})\right]\hfill \\ \hfill & \qquad \leqslant C\frac{{d}^{\gamma }}{N}{\Vert}{\mathsf{P}}_{\mathcal{Q}}{f}_{d}{{\Vert}}_{{L}^{2}}^{2}.\hfill \end{align*}$

By Markov's inequality, we get for any ɛ > 0 and d large enough,

$\begin{align*}\hfill \mathbb{P}({R}_{\mathrm{RF}}({f}_{d},\mathbf{\Theta }) > \varepsilon \cdot {\Vert}{f}_{d}{{\Vert}}_{{L}^{2}}^{2})& \leqslant \mathbb{P}(\left\{{R}_{\mathrm{RF}}({f}_{d},\mathbf{\Theta }) > \varepsilon \cdot {\Vert}{f}_{d}{{\Vert}}_{{L}^{2}}^{2}\right\}\cap {\mathcal{P}}_{{\varepsilon }_{0}})+\mathbb{P}({\mathcal{P}}_{{\varepsilon }_{0}}^{c})\hfill \\ \hfill & \leqslant {C}^{\prime }\frac{{d}^{\gamma }}{N}+\mathbb{P}({\mathcal{P}}_{{\varepsilon }_{0}}^{c}).\hfill \end{align*}$

The assumption N = ω_d(d^γ) and lemma 8 conclude the proof.

Appendix G.: Proof of theorem 7(a): lower bound for NT model

G.1. Preliminaries

We consider the activation function $\sigma :\mathbb{R}\to \mathbb{R}$ with weak derivative σ'. Consider ${\sigma }_{\boldsymbol{d},\boldsymbol{\tau }}^{\prime }:{\mathrm{ps}}^{\boldsymbol{d}}\to \mathbb{R}$ defined as follows

$\begin{align}\hfill {\sigma }^{\prime }(\langle \boldsymbol{\theta },\boldsymbol{x}\rangle /R)& ={\sigma }^{\prime }\left(\sum\limits _{q\in [Q]}{\tau }^{(q)}\cdot ({r}_{q}/R)\cdot \langle {\bar{\boldsymbol{\theta }}}^{(q)},{\bar{\boldsymbol{x}}}^{(q)}\rangle /\sqrt{{d}_{q}}\right)\hfill \\ \hfill & \equiv {\sigma }_{\boldsymbol{d},\boldsymbol{\tau }}^{\prime }\left({\left\{\langle {\bar{\boldsymbol{\theta }}}^{(q)},{\bar{\boldsymbol{x}}}^{(q)}\rangle /\sqrt{{d}_{q}}\right\}}_{q\in [Q]}\right).\hfill \end{align} \tag{ 100 }$

Consider the expansion of ${\sigma }_{\boldsymbol{d},\boldsymbol{\tau }}^{\prime }$ in terms of product of Gegenbauer polynomials. We have

$\begin{equation}{\sigma }^{\prime }(\langle \boldsymbol{\theta },\boldsymbol{x}\rangle /R)=\sum\limits _{\boldsymbol{k}\in {\mathbb{Z}}_{\geqslant 0}^{Q}}{\lambda }_{\boldsymbol{k}}^{\boldsymbol{d}}({\sigma }_{\boldsymbol{d},\boldsymbol{\tau }}^{\prime })B(\boldsymbol{d},\boldsymbol{k}){Q}_{\boldsymbol{k}}^{\boldsymbol{d}}\left({\left\{\langle {\bar{\boldsymbol{\theta }}}^{(q)},{\bar{\boldsymbol{x}}}^{(q)}\rangle \right\}}_{q\in [Q]}\right),\end{equation} \tag{ 101 }$

where

$\begin{equation*}{\lambda }_{\boldsymbol{k}}^{\boldsymbol{d}}({\sigma }_{\boldsymbol{d},\boldsymbol{\tau }}^{\prime })={\mathbb{E}}_{\bar{\boldsymbol{x}}}\left[{\sigma }_{\boldsymbol{d},\boldsymbol{\tau }}^{\prime }\left({\bar{x}}_{1}^{(1)},\dots ,{\bar{x}}_{1}^{(Q)}\right){Q}_{\boldsymbol{k}}^{\boldsymbol{d}}\left(\sqrt{{d}_{1}}{\bar{x}}_{1}^{(1)},\dots ,\sqrt{{d}_{Q}}{\bar{x}}_{1}^{(Q)}\right)\right],\end{equation*}$

where the expectation is taken over $\bar{\boldsymbol{x}}=({\bar{\boldsymbol{x}}}^{(1)},\dots ,{\bar{\boldsymbol{x}}}^{(Q)})\sim {\mu }_{\boldsymbol{d}}$ .

Lemma 11. Let σ be an activation function that satisfies assumptions 3(a) and (b). Define for $\boldsymbol{k}\in {\mathbb{Z}}_{\geqslant 0}^{Q}$ and $\boldsymbol{\tau }\in {\mathbb{R}}_{\geqslant 0}^{Q}$ ,

$\begin{equation}{A}_{\boldsymbol{\tau },\boldsymbol{k}}^{(q)}={r}_{q}^{2}\cdot [{t}_{{d}_{q},{k}_{q}-1}{\lambda }_{{\boldsymbol{k}}_{q-}}^{\boldsymbol{d}}{({\sigma }_{\boldsymbol{d},\boldsymbol{\tau }}^{\prime })}^{2}B(\boldsymbol{d},{\boldsymbol{k}}_{q-})+{s}_{{d}_{q},{k}_{q}+1}{\lambda }_{{\boldsymbol{k}}_{q+}}^{\boldsymbol{d}}{({\sigma }_{\boldsymbol{d},\boldsymbol{\tau }}^{\prime })}^{2}B(\boldsymbol{d},{\boldsymbol{k}}_{q+})],\end{equation} \tag{ 102 }$

with k _q+ = (k₁, ..., k_q + 1, ..., k_Q) and k _q− = (k₁, ..., k_q − 1, ..., k_Q), and

$\begin{equation*}{s}_{d,k}=\frac{k}{2k+d-2},\qquad {t}_{d,k}=\frac{k+d-2}{2k+d-2},\end{equation*}$

with the convention t_d,−1 = 0. Then there exists constants ɛ₀ > 0 and C > 0 such that for d large enough, we have for any $\boldsymbol{\tau }\in {[1-{\varepsilon }_{0},1+{\varepsilon }_{0}]}^{Q}$ and $\boldsymbol{k}\in {\mathcal{Q}}_{\mathrm{NT}}{(\gamma )}^{c}$ ,

$\begin{equation*}\frac{{A}_{\boldsymbol{\tau },\boldsymbol{k}}^{(q)}}{B(\boldsymbol{d},\boldsymbol{k})}\leqslant \begin{cases}C{d}^{\xi }{d}^{-\gamma -(\xi -{\mathrm{min}}_{q\in S(\boldsymbol{k})}{\kappa }_{q})}\quad \hfill & \quad \text{if}\enspace {k}_{q} > 0,\hfill \\ C{d}^{{\eta }_{q}+2{\kappa }_{q}-\xi }{d}^{-\gamma -(\xi -{\mathrm{min}}_{q\in S(\boldsymbol{k})}{\kappa }_{q})}\quad \hfill & \quad \text{if}\enspace {k}_{q}=0,\hfill \end{cases}\end{equation*}$

where we recall S( k ) ⊂ [Q] is the subset of indices corresponding to the non zero integers k_q > 0.

Proof of lemma 11.Let us fix an integer M such that $\mathcal{Q}\subset {[M]}^{Q}$ . We will denote $\mathcal{Q}\equiv {\mathcal{Q}}_{\mathrm{NT}}(\gamma )$ for simplicity. Following the same proof as in lemma 9, there exists ɛ₀ > 0, d₀ and C > 0 such that for any d ⩾ d₀ and $\boldsymbol{\tau }\in {[1-{\varepsilon }_{0},1+{\varepsilon }_{0}]}^{Q}$ , we have for any $\boldsymbol{k}\in {\mathcal{Q}}^{c}\cap {[M]}^{Q}$ ,

$\begin{equation*}\begin{aligned}\hfill {\lambda }_{\boldsymbol{k}}^{\boldsymbol{d}}{({\sigma }_{\boldsymbol{d},\boldsymbol{\tau }}^{\prime })}^{2}& \leqslant C{d}^{-\gamma -(\xi -{\mathrm{min}}_{q\in S(\boldsymbol{k})}{\kappa }_{q})},\hfill \\ \hfill {\lambda }_{{\boldsymbol{k}}_{q-}}^{\boldsymbol{d}}{({\sigma }_{\boldsymbol{d},\boldsymbol{\tau }}^{\prime })}^{2}& \leqslant C{d}^{\xi -{\kappa }_{q}-\gamma -(\xi -{\mathrm{min}}_{q\in S(\boldsymbol{k})}{\kappa }_{q})},\hfill \\ \hfill {\lambda }_{{\boldsymbol{k}}_{q+}}^{\boldsymbol{d}}{({\sigma }_{\boldsymbol{d},\boldsymbol{\tau }}^{\prime })}^{2}& \leqslant C{d}^{{\kappa }_{q}-\xi -\gamma -(\xi -{\mathrm{min}}_{q\in S(\boldsymbol{k})}{\kappa }_{q})},\hfill \end{aligned}\end{equation*}$

while for k ∉ [M]^Q, we get

$\begin{equation*}\mathrm{max}\left\{{\lambda }_{\boldsymbol{k}}^{\boldsymbol{d}}{({\sigma }_{\boldsymbol{d},\boldsymbol{\tau }}^{\prime })}^{2},{\lambda }_{{\boldsymbol{k}}_{q-}}^{\boldsymbol{d}}{({\sigma }_{\boldsymbol{d},\boldsymbol{\tau }}^{\prime })}^{2},{\lambda }_{{\boldsymbol{k}}_{q+}}^{\boldsymbol{d}}{({\sigma }_{\boldsymbol{d},\boldsymbol{\tau }}^{\prime })}^{2}\right\}\leqslant C{d}^{-(M-1)\underset{q\in [Q]}{\mathrm{min}}{\eta }_{q}}.\end{equation*}$

Injecting this bound in the formula (102) of ${\boldsymbol{A}}_{\boldsymbol{\tau },\boldsymbol{k}}^{(q)}$ , we get for d ⩾ d₀, $\boldsymbol{\tau }\in {[1-{\varepsilon }_{0},1+{\varepsilon }_{0}]}^{Q}$ and any $\boldsymbol{k}\in {\mathcal{Q}}^{c}\cap {[M]}^{Q}$ : if k_q > 0,

$\begin{equation*}\frac{{A}_{\boldsymbol{\tau },\boldsymbol{k}}^{(q)}}{B(\boldsymbol{d},\boldsymbol{k})}\leqslant {C}^{\prime }{d}^{{\eta }_{q}+{\kappa }_{q}}{d}^{\xi -{\kappa }_{q}-\gamma -(\xi -{\mathrm{min}}_{q\in S(\boldsymbol{k})}{\kappa }_{q})}={C}^{\prime }{d}^{\xi }{d}^{-\gamma -(\xi -{\mathrm{min}}_{q\in S(\boldsymbol{k})}{\kappa }_{q})},\end{equation*}$

while for k_q = 0,

$\begin{equation*}\frac{{A}_{\boldsymbol{\tau },\boldsymbol{k}}^{(q)}}{B(\boldsymbol{d},\boldsymbol{k})}\leqslant {C}^{\prime }{d}^{{\eta }_{q}+{\kappa }_{q}}{d}^{-{\eta }_{q}}{d}^{{\kappa }_{q}+{\eta }_{q}-\xi -\gamma -(\xi -{\mathrm{min}}_{q\in S(\boldsymbol{k})}{\kappa }_{q})}={C}^{\prime }{d}^{{\eta }_{q}+2{\kappa }_{q}-\xi }{d}^{-\gamma -(\xi -{\mathrm{min}}_{q\in S(\boldsymbol{k})}{\kappa }_{q})},\end{equation*}$

where we used that for k_q ∈ [M], there exists a constant c > 0 such that ${s}_{{d}_{q},{k}_{q}}\leqslant c{d}^{-{\eta }_{q}}$ and ${t}_{{d}_{q},{k}_{q}}\leqslant c$ . Similarly, we get for k ∉ [M]^Q

$\begin{equation*}\frac{{A}_{\boldsymbol{\tau },\boldsymbol{k}}^{(q)}}{B(\boldsymbol{d},\boldsymbol{k})}\leqslant {C}^{{\prime\prime}}{d}^{{\kappa }_{q}+{\eta }_{q}-(M-1){\mathrm{min}}_{q\in [Q]}{\eta }_{q}},\end{equation*}$

where we used that ${s}_{{d}_{q},k},{t}_{{d}_{q},k}\leqslant 1$ for any $k\in {\mathbb{Z}}_{\geqslant 0}$ . Taking M sufficiently large yields the result. □

G.2. Proof of theorem 7(a): outline

The structure of the proof for the NT model is the same as for the RF case, however some parts of the proof requires more work.

We define the random vector $\boldsymbol{V}={({\boldsymbol{V}}_{1},\dots ,{\boldsymbol{V}}_{N})}^{\mathsf{T}}\in {\mathbb{R}}^{Nd}$ , where, for each j ⩽ N, ${\boldsymbol{V}}_{j}\in {\mathbb{R}}^{D}$ , and analogously ${\boldsymbol{V}}_{\mathcal{Q}}={({\boldsymbol{V}}_{1,\mathcal{Q}},\dots ,{\boldsymbol{V}}_{N,\mathcal{Q}})}^{\mathsf{T}}\in {\mathbb{R}}^{ND}$ , ${\boldsymbol{V}}_{{\mathcal{Q}}^{c}}={({\boldsymbol{V}}_{1,{\mathcal{Q}}^{c}},\dots ,{\boldsymbol{V}}_{N,{\mathcal{Q}}^{c}})}^{\mathsf{T}}\in {\mathbb{R}}^{ND}$ , as follows

$\begin{equation*}\begin{aligned}\hfill {\boldsymbol{V}}_{i,\mathcal{Q}}& ={\mathbb{E}}_{\boldsymbol{x}}[[{\mathsf{P}}_{\mathcal{Q}}{f}_{d}](\boldsymbol{x}){\sigma }^{\prime }(\langle {\boldsymbol{\theta }}_{i},\boldsymbol{x}\rangle /R)\boldsymbol{x}],\hfill \\ \hfill {\boldsymbol{V}}_{i,{\mathcal{Q}}^{c}}& ={\mathbb{E}}_{\boldsymbol{x}}[[{\mathsf{P}}_{{\mathcal{Q}}^{c}}{f}_{d}](\boldsymbol{x}){\sigma }^{\prime }(\langle {\boldsymbol{\theta }}_{i},\boldsymbol{x}\rangle /R)\boldsymbol{x}],\hfill \\ \hfill {\boldsymbol{V}}_{i}& ={\mathbb{E}}_{\boldsymbol{x}}[{f}_{d}(\boldsymbol{x}){\sigma }^{\prime }(\langle {\boldsymbol{\theta }}_{i},\boldsymbol{x}\rangle /R)\boldsymbol{x}]={\boldsymbol{V}}_{i,\mathcal{Q}}+{\boldsymbol{V}}_{i,{\mathcal{Q}}^{c}}.\hfill \end{aligned}\end{equation*}$

We define the random matrix $\boldsymbol{U}={({\boldsymbol{U}}_{ij})}_{i,j\in [N]}\in {\mathbb{R}}^{ND\times ND}$ , where for each i, j ⩽ N, ${\boldsymbol{U}}_{ij}\in {\mathbb{R}}^{D\times D}$ , is given by

$\begin{equation}{\boldsymbol{U}}_{ij}={\mathbb{E}}_{\boldsymbol{x}}[{\sigma }^{\prime }(\langle \boldsymbol{x},{\boldsymbol{\theta }}_{i}\rangle /R){\sigma }^{\prime }(\langle \boldsymbol{x},{\boldsymbol{\theta }}_{j}\rangle /R)\boldsymbol{x}{\boldsymbol{x}}^{\mathsf{T}}].\end{equation} \tag{ 103 }$

Proceeding as for the RF model, we obtain

$\begin{align*}\hfill & \left\vert {R}_{\mathrm{NT}}({f}_{d})-{R}_{\mathrm{NT}}({\mathsf{P}}_{\mathcal{Q}}{f}_{d})-{\Vert}{\mathsf{P}}_{{\mathcal{Q}}^{c}}{f}_{d}{{\Vert}}_{{L}^{2}}^{2}\right\vert \hfill \\ \hfill & \qquad =\left\vert {\boldsymbol{V}}_{\mathcal{Q}}^{\mathsf{T}}{\boldsymbol{U}}^{-1}{\boldsymbol{V}}_{\mathcal{Q}}-{\boldsymbol{V}}^{\mathsf{T}}{\boldsymbol{U}}^{-1}\boldsymbol{V}\right\vert \hfill \\ \hfill & \qquad =\left\vert {\boldsymbol{V}}_{\mathcal{Q}}^{\mathsf{T}}{\boldsymbol{U}}^{-1}{\boldsymbol{V}}_{\mathcal{Q}}-{({\boldsymbol{V}}_{\mathcal{Q}}+{\boldsymbol{V}}_{{\mathcal{Q}}^{c}})}^{\mathsf{T}}{\boldsymbol{U}}^{-1}({\boldsymbol{V}}_{\mathcal{Q}}+{\boldsymbol{V}}_{{\mathcal{Q}}^{c}})\right\vert \hfill \\ \hfill & \qquad =\left\vert 2{\boldsymbol{V}}^{\mathsf{T}}{\boldsymbol{U}}^{-1}{\boldsymbol{V}}_{{\mathcal{Q}}^{c}}-{\boldsymbol{V}}_{{\mathcal{Q}}^{c}}^{\mathsf{T}}{\boldsymbol{U}}^{-1}{\boldsymbol{V}}_{{\mathcal{Q}}^{c}}\right\vert \hfill \\ \hfill & \qquad \leqslant 2{\Vert}{\boldsymbol{U}}^{-1/2}{\boldsymbol{V}}_{{\mathcal{Q}}^{c}}{{\Vert}}_{2}{\Vert}{f}_{d}{{\Vert}}_{{L}^{2}}+{\boldsymbol{V}}_{{\mathcal{Q}}^{c}}^{\mathsf{T}}{\boldsymbol{U}}^{-1}{\boldsymbol{V}}_{{\mathcal{Q}}^{c}}.\hfill \end{align*}$

We claim that we have

$\begin{equation}{\Vert}{\boldsymbol{U}}^{-1/2}{\boldsymbol{V}}_{{\mathcal{Q}}^{c}}{{\Vert}}_{2}^{2}={\boldsymbol{V}}_{{\mathcal{Q}}^{c}}^{\mathsf{T}}{\boldsymbol{U}}^{-1}{\boldsymbol{V}}_{{\mathcal{Q}}^{c}}={o}_{d,\mathbb{P}}({\Vert}{\mathsf{P}}_{{\mathcal{Q}}^{c}}{f}_{d}{{\Vert}}_{{L}^{2}}^{2}).\end{equation} \tag{ 104 }$

To show this result, we will need the following two propositions.

Proposition 3 (Expected norm of V ). Let σ be a weakly differentiable activation function with weak derivative σ' and $\mathcal{Q}\subset {\mathbb{Z}}_{\geqslant 0}^{Q}$ . Let ɛ > 0 and define ${\mathcal{E}}_{{\mathcal{Q}}^{c},\varepsilon }^{(q)}$ by

$\begin{equation*}{\mathcal{E}}_{{\mathcal{Q}}^{c},\varepsilon }^{(q)}\equiv {\mathbb{E}}_{{\boldsymbol{\theta }}_{\varepsilon }}\left[\langle {\mathbb{E}}_{\boldsymbol{x}}[[{\mathsf{P}}_{{\mathcal{Q}}^{c}}{f}_{d}](\boldsymbol{x}){\sigma }^{\prime }(\langle \boldsymbol{\theta },\boldsymbol{x}\rangle /R){\boldsymbol{x}}^{(q)}],{\mathbb{E}}_{\boldsymbol{x}}[[{\mathsf{P}}_{{\mathcal{Q}}^{c}}{f}_{d}](\boldsymbol{x}){\sigma }^{\prime }(\langle \boldsymbol{\theta },\boldsymbol{x}\rangle /R){\boldsymbol{x}}^{(q)}]\rangle \right],\end{equation*}$

where the expectation is taken with respect to $\boldsymbol{x}=({\boldsymbol{x}}^{(1)},\dots ,{\boldsymbol{x}}^{(Q)})\sim {\mu }_{\boldsymbol{d}}^{\boldsymbol{\kappa }}$ . Then,

$\begin{equation*}{\mathcal{E}}_{{\mathcal{Q}}^{c},{\varepsilon }_{0}}^{(q)}\leqslant \left[\underset{\boldsymbol{k}\in {\mathcal{Q}}^{c}}{\mathrm{max}}\enspace B{(\boldsymbol{d},\boldsymbol{k})}^{-1}{\mathbb{E}}_{{\boldsymbol{\tau }}_{{\varepsilon }_{0}}}[{A}_{\boldsymbol{\tau },\boldsymbol{k}}^{(q)}]\right]\cdot {\Vert}{\mathsf{P}}_{{\mathcal{Q}}^{c}}{f}_{d}{{\Vert}}_{{L}^{2}}^{2}.\end{equation*}$

Proposition 4 (Lower bound on the kernel matrix). Let N = o_d(d^γ) for some γ > 0, and ${({\boldsymbol{\theta }}_{i})}_{i\in [N]}\sim \mathrm{Unif}({\mathbb{S}}^{D-1}(\sqrt{D}))$ independently. Let σ be an activation that satisfies assumptions 3(a) and (b). Let $\boldsymbol{U}\in {\mathbb{R}}^{ND\times ND}$ be the kernel matrix with i, j block ${\boldsymbol{U}}_{ij}\in {\mathbb{R}}^{D\times D}$ defined by equation (103). Then there exists two matrices D and Δ such that

$\begin{equation*}\boldsymbol{U}{\succeq}\boldsymbol{D}+\mathbf{\Delta },\end{equation*}$

with D = diag( D _ii) block diagonal. Furthermore, D and Δ verifies the following properties:

(a)
${\Vert}\mathbf{\Delta }{{\Vert}}_{\mathrm{op}}={o}_{d,\mathbb{P}}({d}^{-{\mathrm{max}}_{q\in [Q]}{\kappa }_{q}})$ .
(b)
For each i ∈ [N], we can decompose the matrix D _ii into block matrix form ${({\boldsymbol{D}}_{ii}^{q{q}^{\prime }})}_{q,{q}^{\prime }\in [Q]}\in {\mathbb{R}}^{DN\times DN}$ with ${\boldsymbol{D}}_{ii}^{q{q}^{\prime }}\in {\mathbb{R}}^{{d}_{q}N\times {d}_{{q}^{\prime }}N}$ such that
- For any q ∈ [Q], there exists constants c_q, C_q > 0 such that we have with high probability
  $\begin{align}\hfill 0< {c}_{q}\frac{{r}_{q}^{2}}{{d}_{q}}& ={c}_{q}{d}^{{\kappa }_{q}}\leqslant \underset{i\in [N]}{\mathrm{min}}\enspace {\lambda }_{\mathrm{min}}({\boldsymbol{D}}_{ii}^{qq})\leqslant \underset{i\in [N]}{\mathrm{max}}\enspace {\lambda }_{\mathrm{max}}({\boldsymbol{D}}_{ii}^{qq})\hfill \\ \hfill & \leqslant {C}_{q}\frac{{r}_{q}^{2}}{{d}_{q}}={C}_{q}{d}^{{\kappa }_{q}}< \infty ,\hfill \end{align} \tag{ 105 }$
  as d → ∞.
- For any q ≠ q' ∈ [Q], we have
  $\begin{equation}\underset{i\in [N]}{\mathrm{max}}\enspace {\sigma }_{\mathrm{max}}({\boldsymbol{D}}_{ii}^{q{q}^{\prime }})={o}_{d,\mathbb{P}}({r}_{q}{r}_{{q}^{\prime }}/\sqrt{{d}_{q}{d}_{{q}^{\prime }}}).\end{equation} \tag{ 106 }$

The proofs of these two propositions are provided in the next sections.

From proposition 4, we can upper bound equation (104) as follows

$\begin{equation}{\boldsymbol{V}}_{{\mathcal{Q}}^{c}}^{\mathsf{T}}{\boldsymbol{U}}^{-1}{\boldsymbol{V}}_{{\mathcal{Q}}^{c}}{\preceq}{\boldsymbol{V}}_{{\mathcal{Q}}^{c}}^{\mathsf{T}}{(\boldsymbol{D}+\mathbf{\Delta })}^{-1}{\boldsymbol{V}}_{{\mathcal{Q}}^{c}}={\boldsymbol{V}}_{{\mathcal{Q}}^{c}}^{\mathsf{T}}{\boldsymbol{D}}^{-1}{\boldsymbol{V}}_{{\mathcal{Q}}^{c}}-{\boldsymbol{V}}_{{\mathcal{Q}}^{c}}^{\mathsf{T}}{\boldsymbol{D}}^{-1}\mathbf{\Delta }{(\boldsymbol{D}+\mathbf{\Delta })}^{-1}{\boldsymbol{V}}_{{\mathcal{Q}}^{c}}.\end{equation} \tag{ 107 }$

Let us fix ɛ₀ > 0 as prescribed in lemma 11. We decompose the vector ${\boldsymbol{V}}_{i,{\mathcal{Q}}^{c}}={({\boldsymbol{V}}_{i,{\mathcal{Q}}^{c}}^{(q)})}_{q\in [Q]}$ where

$\begin{equation*}{\boldsymbol{V}}_{i,{\mathcal{Q}}^{c}}^{(q)}={\mathbb{E}}_{\boldsymbol{x}}[[{\mathsf{P}}_{{\mathcal{Q}}^{c}}{f}_{d}](\boldsymbol{x}){\sigma }^{\prime }(\langle {\boldsymbol{\theta }}_{i},\boldsymbol{x}\rangle /R){\boldsymbol{x}}^{(q)}].\end{equation*}$

We denote ${\boldsymbol{V}}_{{\mathcal{Q}}^{c}}^{(q)}=({\boldsymbol{V}}_{1,{\mathcal{Q}}^{c}}^{(q)},\dots ,{\boldsymbol{V}}_{N,{\mathcal{Q}}^{c}}^{(q)})\in {\mathbb{R}}^{{d}_{q}N}$ . From proposition 3, we have

$\begin{equation*}\frac{{d}_{q}}{{r}_{q}^{2}}{\mathbb{E}}_{{\mathbf{\Theta }}_{{\varepsilon }_{0}}}[{\Vert}{\boldsymbol{V}}_{{\mathcal{Q}}^{c}}^{(q)}{{\Vert}}_{2}^{2}]\leqslant \left[\underset{\boldsymbol{k}\in {\mathcal{Q}}^{c}}{\mathrm{max}}\enspace N{d}^{-{\kappa }_{q}}B{(\boldsymbol{d},\boldsymbol{k})}^{-1}{\mathbb{E}}_{{\boldsymbol{\tau }}_{{\varepsilon }_{0}}}[{A}_{\boldsymbol{\tau },\boldsymbol{k}}^{(q)}]\right]\cdot {\Vert}{\mathsf{P}}_{{\mathcal{Q}}^{c}}{f}_{d}{{\Vert}}_{{L}^{2}}^{2}.\end{equation*}$

Hence, using the upper bounds on ${A}_{\boldsymbol{\tau },\boldsymbol{k}}^{(q)}$ in lemma 11, we get for $\boldsymbol{k}\in {\mathcal{Q}}^{c}$ with k_q > 0:

$\begin{equation*}N{d}^{-{\kappa }_{q}}B{(\boldsymbol{d},\boldsymbol{k})}^{-1}{\mathbb{E}}_{{\tau }_{{\varepsilon }_{0}}}[{A}_{\boldsymbol{\tau },\boldsymbol{k}}^{(q)}]\leqslant CN{d}^{-{\kappa }_{q}}{d}^{-\gamma +{\mathrm{min}}_{q\in S(\boldsymbol{k})}{\kappa }_{q}}={o}_{d}(1),\end{equation*}$

where we used that N = o_d(d^γ) and κ_q ⩾ min_{q∈S(
k
)} κ_q (we have k_q > 0 and therefore q ∈ S( k ) by definition). Similarly for $\boldsymbol{k}\in {\mathcal{Q}}^{c}$ with k_q = 0:

$\begin{equation*}N{d}^{-{\kappa }_{q}}B{(\boldsymbol{d},\boldsymbol{k})}^{-1}{\mathbb{E}}_{{\boldsymbol{\tau }}_{{\varepsilon }_{0}}}[{A}_{\boldsymbol{\tau },\boldsymbol{k}}^{(q)}]\leqslant CN{d}^{{\eta }_{q}+{\kappa }_{q}-\xi }{d}^{-\gamma -(\xi -{\mathrm{min}}_{q\in S(\boldsymbol{k})}{\kappa }_{q})}={o}_{d}(1),\end{equation*}$

where we used that by definition of ξ we have η_q + κ_q ⩽ ξ and min_{q∈S(
k
)} κ_q ⩽ ξ. We deduce that

$\begin{equation*}\frac{{d}_{q}}{{r}_{q}^{2}}{\mathbb{E}}_{{\mathbf{\Theta }}_{{\varepsilon }_{0}}}[{\Vert}{\boldsymbol{V}}_{{\mathcal{Q}}^{c}}^{(q)}{{\Vert}}_{2}^{2}]={o}_{d}(1)\cdot {\Vert}{\mathsf{P}}_{{\mathcal{Q}}^{c}}{f}_{d}{{\Vert}}_{{L}^{2}}^{2},\end{equation*}$

and therefore by Markov's inequality that

$\begin{equation}\frac{{d}_{q}}{{r}_{q}^{2}}{\Vert}{\boldsymbol{V}}_{{\mathcal{Q}}^{c}}^{(q)}{{\Vert}}_{2}^{2}={o}_{d,\mathbb{P}}(1)\cdot {\Vert}{\mathsf{P}}_{{\mathcal{Q}}^{c}}{f}_{d}{{\Vert}}_{{L}^{2}}^{2}.\end{equation} \tag{ 108 }$

Notice that the properties (105) and (106) imply that there exists c > 0 such that with high probability λ_min( D ) ⩾ min_i∈[N] λ_min( D _ii) ⩾ c. In particular, we deduce that ||( D + Δ)⁻¹||_op ⩽ c⁻¹/2 with high probability. Combining these bounds and equation (108) and recalling that ${\Vert}\mathbf{\Delta }{{\Vert}}_{\mathrm{op}}={o}_{d,\mathbb{P}}({d}^{-{\mathrm{max}}_{q\in [Q]}{\kappa }_{q}})$ show that

$\begin{align}\hfill \vert {\boldsymbol{V}}_{{\mathcal{Q}}^{c}}^{\mathsf{T}}{\boldsymbol{D}}^{-1}\mathbf{\Delta }{(\boldsymbol{D}+\mathbf{\Delta })}^{-1}{\boldsymbol{V}}_{{\mathcal{Q}}^{c}}\vert & \leqslant {\Vert}{\boldsymbol{D}}^{-1}{{\Vert}}_{\mathrm{op}}{\Vert}{(\boldsymbol{D}+\mathbf{\Delta })}^{-1}{{\Vert}}_{\mathrm{op}}\sum\limits _{q\in [Q]}{\Vert}\mathbf{\Delta }{{\Vert}}_{\mathrm{op}}{\Vert}{\boldsymbol{V}}_{{\mathcal{Q}}^{c}}^{(q)}{{\Vert}}_{2}^{2}\hfill \\ \hfill & ={o}_{d,\mathbb{P}}(1)\cdot {\Vert}{\mathsf{P}}_{{\mathcal{Q}}^{c}}{f}_{d}{{\Vert}}_{{L}^{2}}^{2}.\hfill \end{align} \tag{ 109 }$

We are now left to show ${\boldsymbol{V}}_{{\mathcal{Q}}^{c}}^{\mathsf{T}}{\boldsymbol{D}}^{-1}{\boldsymbol{V}}_{{\mathcal{Q}}^{c}}={o}_{d,\mathbb{P}}(1)$ . For each i ∈ [N], denote ${\boldsymbol{B}}_{ii}={\boldsymbol{D}}_{ii}^{-1}$ and notice that we can apply lemma 24 to B _ii and get

$\begin{equation*}\underset{i\in [N]}{\mathrm{max}}{\Vert}{\boldsymbol{B}}_{ii}^{qq}{{\Vert}}_{\mathrm{op}}={O}_{d,\mathbb{P}}\left(\frac{{d}_{q}}{{r}_{q}^{2}}\right),\qquad \underset{i\in [N]}{\mathrm{max}}{\Vert}{\boldsymbol{B}}_{ii}^{q{q}^{\prime }}{{\Vert}}_{\mathrm{op}}={o}_{d,\mathbb{P}}\left(\frac{\sqrt{{d}_{q}{d}_{{q}^{\prime }}}}{{r}_{q}{r}_{{q}^{\prime }}}\right).\end{equation*}$

Therefore,

$\begin{align}\hfill {\boldsymbol{V}}_{{\mathcal{Q}}^{c}}^{\mathsf{T}}{\boldsymbol{D}}^{-1}{\boldsymbol{V}}_{{\mathcal{Q}}^{c}}& =\sum\limits _{i\in [N]}\sum\limits _{q,{q}^{\prime }\in [Q]}{({\boldsymbol{V}}_{i,{\mathcal{Q}}^{c}}^{(q)})}^{\mathsf{T}}{\boldsymbol{B}}_{ii}^{q{q}^{\prime }}{\boldsymbol{V}}_{i,{\mathcal{Q}}^{c}}^{({q}^{\prime })}\hfill \\ \hfill & \leqslant \sum\limits _{q,{q}^{\prime }\in [Q]}{O}_{d,\mathbb{P}}(1)\cdot {\left(\frac{{d}_{q}}{{r}_{q}^{2}}{\Vert}{\boldsymbol{V}}_{{\mathcal{Q}}^{c}}^{(q)}{{\Vert}}_{2}^{2}\right)}^{1/2}{\left(\frac{{d}_{{q}^{\prime }}}{{r}_{{q}^{\prime }}^{2}}{\Vert}{\boldsymbol{V}}_{{\mathcal{Q}}^{c}}^{({q}^{\prime })}{{\Vert}}_{2}^{2}\right)}^{1/2}.\hfill \end{align} \tag{ 110 }$

Using equation (108) in equation (110), we get

$\begin{equation}{\boldsymbol{V}}_{{\mathcal{Q}}^{c}}^{\mathsf{T}}{\boldsymbol{D}}^{-1}{\boldsymbol{V}}_{{\mathcal{Q}}^{c}}={o}_{d,\mathbb{P}}(1)\cdot {\Vert}{\mathsf{P}}_{{\mathcal{Q}}^{c}}{f}_{d}{{\Vert}}_{{L}^{2}}^{2}.\end{equation} \tag{ 111 }$

Combining equations (109) and (111) yields equation (104). This proves the theorem.

G.3. Proof of proposition 3

Proof of proposition 3. Let us consider ɛ₀ > 0 as prescribed in lemma 11. We have for q ∈ [Q]

$\begin{equation*}\begin{aligned}\hfill {\mathcal{E}}_{{\mathcal{Q}}^{c},{\varepsilon }_{0}}^{(q)}& ={\mathbb{E}}_{{\boldsymbol{\theta }}_{\varepsilon }}\left[\langle {\mathbb{E}}_{\boldsymbol{x}}[[{\mathsf{P}}_{{\mathcal{Q}}^{c}}{f}_{d}](\boldsymbol{x}){\sigma }^{\prime }(\langle \boldsymbol{\theta },\boldsymbol{x}\rangle /R){\boldsymbol{x}}^{(q)}],{\mathbb{E}}_{\boldsymbol{y}}[[{\mathsf{P}}_{{\mathcal{Q}}^{c}}{f}_{d}](\boldsymbol{y}){\sigma }^{\prime }(\langle \boldsymbol{\theta },\boldsymbol{y}\rangle /R){\boldsymbol{y}}^{(q)}]\rangle \right]\hfill \\ \hfill & ={\mathbb{E}}_{{\boldsymbol{\tau }}_{{\varepsilon }_{0}}}\left[{\mathbb{E}}_{\bar{\boldsymbol{x}},\bar{\boldsymbol{y}}}\left[[{\mathsf{P}}_{{\mathcal{Q}}^{c}}{\bar{f}}_{d}](\bar{\boldsymbol{x}}\hspace{2pt})[{\mathsf{P}}_{{\mathcal{Q}}^{c}}\bar{f}](\bar{\boldsymbol{y}}\hspace{2pt}){H}_{\boldsymbol{\tau }}^{(q)}(\bar{\boldsymbol{x}},\bar{\boldsymbol{y}})\right]\right],\hfill \end{aligned}\end{equation*}$

where we denoted ${H}_{\boldsymbol{\tau }}^{(q)}$ the kernel given by

$\begin{align*}\hfill & {H}_{\boldsymbol{\tau }}^{(q)}(\bar{\boldsymbol{x}},\bar{\boldsymbol{y}})={\mathbb{E}}_{\bar{\boldsymbol{\theta }}}\left[{\sigma }_{\boldsymbol{d},\boldsymbol{\tau }}^{\prime }\left({\left\{\langle {\bar{\boldsymbol{\theta }}}^{(q)},{\bar{\boldsymbol{x}}}^{(q)}\rangle /\sqrt{{d}_{q}}\right\}}_{q\in [Q]}\right){\sigma }_{\boldsymbol{d},\boldsymbol{\tau }}^{\prime }\left({\left\{\langle {\bar{\boldsymbol{\theta }}}^{(q)},{\bar{\boldsymbol{y}}}^{(q)}\rangle /\sqrt{{d}_{q}}\right\}}_{q\in [Q]}\right)\right]\langle {\boldsymbol{x}}^{(q)},{\boldsymbol{y}}^{(q)}\rangle .\hfill \end{align*}$

Then we have

$\begin{equation}{H}_{\boldsymbol{\tau }}^{(q)}(\bar{\boldsymbol{x}},\bar{\boldsymbol{y}})=\sum\limits _{\boldsymbol{k}\in {\mathbb{Z}}_{\geqslant 0}^{Q}}{A}_{\boldsymbol{\tau },\boldsymbol{k}}^{(q)}{Q}_{\boldsymbol{k}}^{\boldsymbol{d}}\left({\left\{\langle {\bar{\boldsymbol{x}}}^{(q)},{\bar{\boldsymbol{y}}}^{(q)}\rangle \right\}}_{q\in [Q]}\right),\end{equation} \tag{ 112 }$

where ${A}_{\boldsymbol{\tau },\boldsymbol{k}}^{(q)}$ is given in lemma 12. Hence we get

$\begin{equation*}\begin{aligned}\hfill {\mathcal{E}}_{{\mathcal{Q}}^{c},{\varepsilon }_{0}}^{(q)}& ={\mathbb{E}}_{{\boldsymbol{\tau }}_{{\varepsilon }_{0}}}[{\mathbb{E}}_{\bar{\boldsymbol{x}},\bar{\boldsymbol{y}}}[{P}_{{\mathcal{Q}}^{c}}{\bar{f}}_{d}(\bar{\boldsymbol{x}}\hspace{2pt}){P}_{{\mathcal{Q}}^{c}}{\bar{f}}_{d}(\bar{\boldsymbol{y}}\hspace{2pt}){H}_{\boldsymbol{\tau }}^{(q)}(\bar{\boldsymbol{x}},\bar{\boldsymbol{y}})]]\hfill \\ \hfill & =\sum\limits _{\boldsymbol{k}\in {\mathbb{Z}}_{\geqslant 0}^{Q}}{\mathbb{E}}_{{\boldsymbol{\tau }}_{{\varepsilon }_{0}}}[{A}_{\boldsymbol{\tau },\boldsymbol{k}}^{(q)}]{\mathbb{E}}_{\bar{\boldsymbol{x}},\bar{\boldsymbol{y}}}\left[{Q}_{\boldsymbol{k}}^{\boldsymbol{d}}\left({\left\{\langle {\bar{\boldsymbol{x}}}^{(q)},{\bar{\boldsymbol{y}}}^{(q)}\rangle \right\}}_{q\in [Q]}\right)[{P}_{{\mathcal{Q}}^{c}}{\bar{f}}_{d}](\bar{\boldsymbol{x}}\hspace{2pt})[{P}_{{\mathcal{Q}}^{c}}{\bar{f}}_{d}](\bar{\boldsymbol{y}}\hspace{2pt})\right].\hfill \end{aligned}\end{equation*}$

We have

$\begin{align*}\hfill & {\mathbb{E}}_{\bar{\boldsymbol{x}},\bar{\boldsymbol{y}}}\left[{Q}_{\boldsymbol{k}}^{\boldsymbol{d}}\left({\left\{\langle {\bar{\boldsymbol{x}}}^{(q)},{\bar{\boldsymbol{y}}}^{(q)}\rangle \right\}}_{q\in [Q]}\right)[{P}_{{\mathcal{Q}}^{c}}{\bar{f}}_{d}](\bar{\boldsymbol{x}}\hspace{2pt})[{P}_{{\mathcal{Q}}^{c}}{\bar{f}}_{d}](\bar{\boldsymbol{y}}\hspace{2pt})\right]\hfill \\ \hfill & \qquad =\sum\limits _{\boldsymbol{l},{\boldsymbol{l}}^{\prime }\in {\mathcal{Q}}^{c}}\hspace{2pt}\sum\limits _{\boldsymbol{s}\in [B(\boldsymbol{d},\boldsymbol{l})]}\hspace{2pt}\sum\limits _{{\boldsymbol{s}}^{\prime }\in [B(\boldsymbol{d},{\boldsymbol{l}}^{\prime })]}{\lambda }_{\boldsymbol{l},\boldsymbol{s}}^{\boldsymbol{d}}({\bar{f}}_{d}){\lambda }_{{\boldsymbol{l}}^{\prime },{\boldsymbol{s}}^{\prime }}^{\boldsymbol{d}}({\bar{f}}_{d}){\mathbb{E}}_{\bar{\boldsymbol{x}},\bar{\boldsymbol{y}}}\left[{Q}_{\boldsymbol{k}}^{\boldsymbol{d}}\left({\left\{\langle {\bar{\boldsymbol{x}}}^{(q)},{\bar{\boldsymbol{y}}}^{(q)}\rangle \right\}}_{q\in [Q]}\right){Y}_{\boldsymbol{l},\boldsymbol{s}}^{\boldsymbol{d}}(\bar{\boldsymbol{x}}\hspace{2pt}){Y}_{{\boldsymbol{l}}^{\prime },{\boldsymbol{s}}^{\prime }}^{\boldsymbol{d}}(\bar{\boldsymbol{y}}\hspace{2pt})\right]\hfill \\ \hfill & \qquad ={\delta }_{\boldsymbol{k}\in {\mathcal{Q}}^{c}}\sum\limits _{\boldsymbol{s}\in [B(\boldsymbol{d},\boldsymbol{k})]}\frac{{\lambda }_{\boldsymbol{k},\boldsymbol{s}}^{\boldsymbol{d}}{({\bar{f}}_{d})}^{2}}{B(\boldsymbol{d},\boldsymbol{k})},\hfill \end{align*}$

where we used in the third line

$\begin{align*}\hfill & {\mathbb{E}}_{\bar{\boldsymbol{x}},\bar{\boldsymbol{y}}}\left[{Q}_{\boldsymbol{k}}^{\boldsymbol{d}}\left({\left\{\langle {\bar{\boldsymbol{x}}}^{(q)},{\bar{\boldsymbol{y}}}^{(q)}\rangle \right\}}_{q\in [Q]}\right){Y}_{\boldsymbol{l},\boldsymbol{s}}^{\boldsymbol{d}}(\bar{\boldsymbol{x}}\hspace{2pt}){Y}_{{\boldsymbol{l}}^{\prime },{\boldsymbol{s}}^{\prime }}^{\boldsymbol{d}}(\bar{\boldsymbol{y}}\hspace{2pt})\right]\hfill \\ \hfill & \qquad ={\mathbb{E}}_{\bar{\boldsymbol{y}}}\left[{\mathbb{E}}_{\bar{\boldsymbol{x}}}\left[{Q}_{\boldsymbol{k}}^{\boldsymbol{d}}\left({\left\{\langle {\bar{\boldsymbol{x}}}^{(q)},{\bar{\boldsymbol{y}}}^{(q)}\rangle \right\}}_{q\in [Q]}\right){Y}_{\boldsymbol{l},\boldsymbol{s}}^{\boldsymbol{d}}(\bar{\boldsymbol{x}}\hspace{2pt})\right]{Y}_{{\boldsymbol{l}}^{\prime },{\boldsymbol{s}}^{\prime }}^{\boldsymbol{d}}(\bar{\boldsymbol{y}}\hspace{2pt})\right]\hfill \\ \hfill & \qquad =\frac{{\delta }_{\boldsymbol{k},\boldsymbol{l}}}{B(\boldsymbol{d},\boldsymbol{k})}{\mathbb{E}}_{\bar{\boldsymbol{y}}}\left[{Y}_{\boldsymbol{k},\boldsymbol{s}}^{\boldsymbol{d}}(\bar{\boldsymbol{y}}\hspace{2pt}){Y}_{{\boldsymbol{l}}^{\prime },{\boldsymbol{s}}^{\prime }}^{\boldsymbol{d}}(\bar{\boldsymbol{y}}\hspace{2pt})\right]\hfill \\ \hfill & \qquad =\frac{{\delta }_{\boldsymbol{k},\boldsymbol{l}}{\delta }_{\boldsymbol{k},{\boldsymbol{l}}^{\prime }}{\delta }_{\boldsymbol{s},{\boldsymbol{s}}^{\prime }}}{B(\boldsymbol{d},\boldsymbol{k})}.\hfill \end{align*}$

We conclude that

$\begin{equation*}\begin{aligned}\hfill {\mathcal{E}}_{{\mathcal{Q}}^{c},{\varepsilon }_{0}}^{(q)}& =\sum\limits _{\boldsymbol{k}\in {\mathbb{Z}}_{\geqslant 0}^{Q}}{\mathbb{E}}_{{\boldsymbol{\tau }}_{{\varepsilon }_{0}}}[{A}_{\boldsymbol{\tau },\boldsymbol{k}}^{(q)}]\sum\limits _{\boldsymbol{s}\in [B(\boldsymbol{d},\boldsymbol{k})]}\frac{{\lambda }_{\boldsymbol{k},\boldsymbol{s}}^{\boldsymbol{d}}{({\bar{f}}_{d})}^{2}}{B(\boldsymbol{d},\boldsymbol{k})}{\delta }_{\boldsymbol{k}\in {\mathcal{Q}}^{c}}\hfill \\ \hfill & =\sum\limits _{\boldsymbol{k}\in {\mathcal{Q}}^{c}}\frac{{\mathbb{E}}_{{\boldsymbol{\tau }}_{{\varepsilon }_{0}}}[{A}_{\boldsymbol{\tau },\boldsymbol{k}}^{(q)}]}{B(\boldsymbol{d},\boldsymbol{k})}{\Vert}{P}_{\boldsymbol{k}}{f}_{d}{{\Vert}}_{{L}^{2}}^{2}\hfill \\ \hfill & \leqslant \left[\underset{\boldsymbol{k}\in {\mathcal{Q}}^{c}}{\mathrm{max}}\enspace B{(\boldsymbol{d},\boldsymbol{k})}^{-1}{\mathbb{E}}_{{\boldsymbol{\tau }}_{{\varepsilon }_{0}}}[{A}_{\boldsymbol{\tau },\boldsymbol{k}}^{(q)}]\right]\cdot {\Vert}{\mathsf{P}}_{{\mathcal{Q}}^{c}}{f}_{d}{{\Vert}}_{{L}^{2}}^{2}.\hfill \end{aligned}\end{equation*}$

□

Lemma 12. Let σ be a weakly differentiable activation function with weak derivative σ'. For a fixed $\boldsymbol{\tau }\in {\mathbb{R}}_{\geqslant 0}^{Q}$ , define the kernels for q ∈ [Q],

$\begin{equation*}{H}_{\boldsymbol{\tau }}^{(q)}(\bar{\boldsymbol{x}},\bar{\boldsymbol{y}})=\frac{{r}_{q}^{2}}{{d}_{q}}{\mathbb{E}}_{\bar{\boldsymbol{\theta }}}\left[{\sigma }_{\boldsymbol{d},\boldsymbol{\tau }}^{\prime }\left({\left\{\langle {\bar{\boldsymbol{\theta }}}^{(q)},{\bar{\boldsymbol{x}}}^{(q)}\rangle /\sqrt{{d}_{q}}\right\}}_{q\in [Q]}\right){\sigma }_{\boldsymbol{d},\boldsymbol{\tau }}^{\prime }\left({\left\{\langle {\bar{\boldsymbol{\theta }}}^{(q)},{\bar{\boldsymbol{y}}}^{(q)}\rangle /\sqrt{{d}_{q}}\right\}}_{q\in [Q]}\right)\right]\langle {\bar{\boldsymbol{x}}}^{(q)},{\bar{\boldsymbol{y}}}^{(q)}\rangle .\end{equation*}$

Then, we have the following decomposition in terms of product of Gegenbauer polynomials

$\begin{equation*}{H}_{\boldsymbol{\tau }}^{(q)}(\bar{\boldsymbol{x}},\bar{\boldsymbol{y}})=\sum\limits _{\boldsymbol{k}\in {\mathbb{Z}}_{\geqslant 0}^{2}}{A}_{\boldsymbol{\tau },\boldsymbol{k}}^{(q)}{Q}_{\boldsymbol{k}}^{\boldsymbol{d}}\left({\left\{\langle {\bar{\boldsymbol{x}}}^{(q)},{\bar{\boldsymbol{y}}}^{(q)}\rangle \right\}}_{q\in [Q]}\right),\end{equation*}$

where

$\begin{equation*}{A}_{\boldsymbol{\tau },\boldsymbol{k}}^{(q)}={r}_{q}^{2}\cdot [{t}_{{d}_{q},{k}_{q}-1}{\lambda }_{{\boldsymbol{k}}_{q-}}^{\boldsymbol{d}}{({\sigma }_{\boldsymbol{d},\boldsymbol{\tau }}^{\prime })}^{2}B(\boldsymbol{d},{\boldsymbol{k}}_{q-})+{s}_{{d}_{q},{k}_{q}+1}{\lambda }_{{\boldsymbol{k}}_{q+}}^{\boldsymbol{d}}{({\sigma }_{\boldsymbol{d},\boldsymbol{\tau }}^{\prime })}^{2}B(\boldsymbol{d},{\boldsymbol{k}}_{q+})],\end{equation*}$

with k _q+ = (k₁, ..., k_q + 1, ..., k_Q) and k _q− = (k₁, ..., k_q − 1, ..., k_Q), and

$\begin{equation*}{s}_{d,k}=\frac{k}{2k+d-2},\qquad {t}_{d,k}=\frac{k+d-2}{2k+d-2},\end{equation*}$

with the convention t_d,−1 = 0.

Proof of lemma 12. Recall the decomposition of σ' in terms of tensor product of Gegenbauer polynomials,

$\begin{equation*}\begin{aligned}\hfill {\sigma }^{\prime }(\langle \boldsymbol{\theta },\boldsymbol{x}\rangle /R)& =\sum\limits _{\boldsymbol{k}\in {\mathbb{Z}}_{\geqslant 0}^{Q}}{\lambda }_{\boldsymbol{k}}^{\boldsymbol{d}}({\sigma }_{\boldsymbol{d},\boldsymbol{\tau }}^{\prime })B(\boldsymbol{d},\boldsymbol{k}){Q}_{\boldsymbol{k}}^{\boldsymbol{d}}\left({\left\{\langle {\bar{\boldsymbol{\theta }}}^{(q)},{\bar{\boldsymbol{x}}}^{(q)}\rangle \right\}}_{q\in [Q]}\right),\hfill \\ \hfill {\lambda }_{\boldsymbol{k}}^{\boldsymbol{d}}({\sigma }_{\boldsymbol{d},\boldsymbol{\tau }}^{\prime })& ={\mathbb{E}}_{\bar{\boldsymbol{x}}}\left[{\sigma }_{\boldsymbol{d},\boldsymbol{\tau }}^{\prime }\left({\bar{x}}_{1}^{(1)},\dots ,{\bar{x}}_{1}^{(Q)}\right){Q}_{\boldsymbol{k}}^{\boldsymbol{d}}\left(\sqrt{{d}_{1}}{\bar{x}}_{1}^{(1)},\dots ,\sqrt{{d}_{q}}{\bar{x}}_{1}^{(Q)}\right)\right].\hfill \end{aligned}\end{equation*}$

Injecting this decomposition into the definition of ${H}_{\boldsymbol{\tau }}^{(q)}$ yields

$\begin{align*}\hfill & {H}_{\boldsymbol{\tau }}^{(q)}(\bar{\boldsymbol{x}},\bar{\boldsymbol{y}})=\frac{{r}_{q}^{2}}{{d}_{q}}{\mathbb{E}}_{\bar{\boldsymbol{\theta }}}\left[{\sigma }_{\boldsymbol{d},\boldsymbol{\tau }}^{\prime }\left({\left\{\langle {\bar{\boldsymbol{\theta }}}^{(q)},{\bar{\boldsymbol{x}}}^{(q)}\rangle /\sqrt{{d}_{q}}\right\}}_{q\in [Q]}\right){\sigma }_{\boldsymbol{d},\boldsymbol{\tau }}^{\prime }\left({\left\{\langle {\bar{\boldsymbol{\theta }}}^{(q)},{\bar{\boldsymbol{y}}}^{(q)}\rangle /\sqrt{{d}_{q}}\right\}}_{q\in [Q]}\right)\right]\langle {\bar{\boldsymbol{x}}}^{(q)},{\bar{\boldsymbol{y}}}^{(q)}\rangle \hfill \\ \hfill & \qquad =\frac{{r}_{q}^{2}}{{d}_{q}}\sum\limits _{\boldsymbol{k},{\boldsymbol{k}}^{\prime }\in {\mathbb{Z}}_{\geqslant 0}^{Q}}{\lambda }_{\boldsymbol{k}}^{\boldsymbol{d}}({\sigma }_{\boldsymbol{d},\boldsymbol{\tau }}^{\prime }){\lambda }_{{\boldsymbol{k}}^{\prime }}^{\boldsymbol{d}}({\sigma }_{\boldsymbol{d},\boldsymbol{\tau }}^{\prime })B(\boldsymbol{d},\boldsymbol{k})B(\boldsymbol{d},{\boldsymbol{k}}^{\prime })\hfill \\ \hfill & \qquad \qquad \times {\mathbb{E}}_{\bar{\boldsymbol{\theta }}}\left[{Q}_{\boldsymbol{k}}^{\boldsymbol{d}}\left({\left\{\langle {\bar{\boldsymbol{\theta }}}^{(q)},{\bar{\boldsymbol{x}}}^{(q)}\rangle \right\}}_{q\in [Q]}\right){Q}_{{\boldsymbol{k}}^{\prime }}^{\boldsymbol{d}}\left({\left\{\langle {\bar{\boldsymbol{\theta }}}^{(q)},{\bar{\boldsymbol{y}}}^{(q)}\rangle \right\}}_{q\in [Q]}\right)\right]\langle {\bar{\boldsymbol{x}}}^{(q)},{\bar{\boldsymbol{y}}}^{(q)}\rangle .\hfill \end{align*}$

Recalling equation (36), we have

$\begin{equation*}{\mathbb{E}}_{\bar{\boldsymbol{\theta }}}\left[{Q}_{\boldsymbol{k}}^{\boldsymbol{d}}\left({\left\{\langle {\bar{\boldsymbol{\theta }}}^{(q)},{\bar{\boldsymbol{x}}}^{(q)}\rangle \right\}}_{q\in [Q]}\right){Q}_{{\boldsymbol{k}}^{\prime }}^{\boldsymbol{d}}\left({\left\{\langle {\bar{\boldsymbol{\theta }}}^{(q)},{\bar{\boldsymbol{y}}}^{(q)}\rangle \right\}}_{q\in [Q]}\right)\right]={\delta }_{\boldsymbol{k},{\boldsymbol{k}}^{\prime }}\frac{{Q}_{\boldsymbol{k}}^{\boldsymbol{d}}\left({\left\{\langle {\bar{\boldsymbol{x}}}^{(q)},{\bar{\boldsymbol{y}}}^{(q)}\rangle \right\}}_{q\in [Q]}\right)}{B(\boldsymbol{d},\boldsymbol{k})}.\end{equation*}$

Hence,

$\begin{align*}\hfill & {H}_{\boldsymbol{\tau }}^{(q)}(\bar{\boldsymbol{x}},\bar{\boldsymbol{y}})=\frac{{r}_{q}^{2}}{{d}_{q}}\sum\limits _{\boldsymbol{k}\in {\mathbb{Z}}_{\geqslant 0}^{Q}}{\lambda }_{\boldsymbol{k}}^{\boldsymbol{d}}{({\sigma }_{\boldsymbol{d},\boldsymbol{\tau }}^{\prime })}^{2}B(\boldsymbol{d},\boldsymbol{k}){Q}_{\boldsymbol{k}}^{\boldsymbol{d}}\left({\left\{\langle {\bar{\boldsymbol{x}}}^{(q)},{\bar{\boldsymbol{y}}}^{(q)}\rangle \right\}}_{q\in [Q]}\right)\langle {\bar{\boldsymbol{x}}}^{(q)},{\bar{\boldsymbol{y}}}^{(q)}\rangle \hfill \\ \hfill & \qquad ={r}_{q}^{2}\sum\limits _{\boldsymbol{k}\in {\mathbb{Z}}_{\geqslant 0}^{Q}}{\lambda }_{\boldsymbol{k}}^{\boldsymbol{d}}{({\sigma }_{\boldsymbol{d},\boldsymbol{\tau }}^{\prime })}^{2}B(\boldsymbol{d},\boldsymbol{k})\left[{Q}_{{k}_{q}}^{({d}_{q})}(\langle {\bar{\boldsymbol{x}}}^{(q)},{\bar{\boldsymbol{y}}}^{(q)}\rangle )\langle {\bar{\boldsymbol{x}}}^{(q)},{\bar{\boldsymbol{y}}}^{(q)}\rangle /{d}_{q}\right]\hfill \\ \hfill & \qquad \qquad \times \prod\limits _{{q}^{\prime }\ne q}{Q}_{{k}_{{q}^{\prime }}}^{\boldsymbol{d}}(\langle {\bar{\boldsymbol{x}}}^{({q}^{\prime })},{\bar{\boldsymbol{y}}}^{({q}^{\prime })}\rangle ).\hfill \end{align*}$

By the recurrence relationship for Gegenbauer polynomials (23), we have

$\begin{equation*}\frac{t}{{d}_{q}}{Q}_{{k}_{q}}^{({d}_{q})}(t)={s}_{{d}_{q},{k}_{q}}{Q}_{{k}_{q}-1}^{({d}_{q})}(t)+{t}_{{d}_{q},{k}_{q}}{Q}_{{k}_{q}+1}^{({d}_{q})}(t),\end{equation*}$

where (we use the convention ${t}_{{d}_{q},-1}=0$ )

$\begin{equation*}{s}_{{d}_{q},{k}_{q}}=\frac{{k}_{q}}{2{k}_{q}+{d}_{q}-2},\qquad {t}_{{d}_{q},{k}_{q}}=\frac{{k}_{q}+{d}_{q}-2}{2{k}_{q}+{d}_{q}-2}.\end{equation*}$

Hence we get,

$\begin{align*}{H}_{\boldsymbol{\tau }}^{(q)}(\bar{\boldsymbol{x}},\bar{\boldsymbol{y}})\hfill & ={r}_{q}^{2}\sum\limits _{\boldsymbol{k}\in {\mathbb{Z}}_{\geqslant 0}^{Q}}{\lambda }_{\boldsymbol{k}}^{\boldsymbol{d}}{({\sigma }_{\boldsymbol{d},\boldsymbol{\tau }}^{\prime })}^{2}B(\boldsymbol{d},\boldsymbol{k})\left[{Q}_{{k}_{q}}^{({d}_{q})}(\langle {\bar{\boldsymbol{x}}}^{(q)},{\bar{\boldsymbol{y}}}^{(q)}\rangle )\langle {\bar{\boldsymbol{x}}}^{(q)},{\bar{\boldsymbol{y}}}^{(q)}\rangle /{d}_{q}\right]\hfill \\ \hfill & \quad \times \prod\limits _{{q}^{\prime }\ne q}{Q}_{{k}_{{q}^{\prime }}}^{\boldsymbol{d}}(\langle {\bar{\boldsymbol{x}}}^{({q}^{\prime })},{\bar{\boldsymbol{y}}}^{({q}^{\prime })}\rangle )\hfill \\ \hfill & ={r}_{q}^{2}\sum\limits _{\boldsymbol{k}\in {\mathbb{Z}}_{\geqslant 0}^{Q}}{\lambda }_{\boldsymbol{k}}^{\boldsymbol{d}}{({\sigma }_{\boldsymbol{d},\boldsymbol{\tau }}^{\prime })}^{2}B(\boldsymbol{d},\boldsymbol{k})\left[{s}_{{d}_{q},{k}_{q}}{Q}_{{k}_{q}-1}^{({d}_{q})}(\langle {\bar{\boldsymbol{x}}}^{(q)},{\bar{\boldsymbol{y}}}^{(q)}\rangle )+{t}_{{d}_{q},{k}_{q}}{Q}_{{k}_{q}+1}^{({d}_{q})}(\langle {\bar{\boldsymbol{x}}}^{(q)},{\bar{\boldsymbol{y}}}^{(q)}\rangle )\right]\hfill \\ \hfill & \quad \times \prod\limits _{{q}^{\prime }\ne q}{Q}_{{k}_{{q}^{\prime }}}^{\boldsymbol{d}}(\langle {\bar{\boldsymbol{x}}}^{({q}^{\prime })},{\bar{\boldsymbol{y}}}^{({q}^{\prime })}\rangle )\hfill \\ \hfill & =\sum\limits _{\boldsymbol{k}\in {\mathbb{Z}}_{\geqslant 0}^{2}}{A}_{\boldsymbol{\tau },\boldsymbol{k}}^{(q)}{Q}_{\boldsymbol{k}}^{\boldsymbol{d}}\left({\left\{\langle {\bar{\boldsymbol{x}}}^{(q)},{\bar{\boldsymbol{y}}}^{(q)}\rangle \right\}}_{q\in [Q]}\right),\hfill \end{align*}$

where we get by matching the coefficients,

$\begin{equation*}{A}_{\boldsymbol{\tau },\boldsymbol{k}}^{(q)}={r}_{q}^{2}\cdot [{t}_{{d}_{q},{k}_{q}-1}{\lambda }_{{\boldsymbol{k}}_{q-}}^{\boldsymbol{d}}{({\sigma }_{\boldsymbol{d},\boldsymbol{\tau }}^{\prime })}^{2}B(\boldsymbol{d},{\boldsymbol{k}}_{q-})+{s}_{{d}_{q},{k}_{q}+1}{\lambda }_{{\boldsymbol{k}}_{q+}}^{\boldsymbol{d}}{({\sigma }_{\boldsymbol{d},\boldsymbol{\tau }}^{\prime })}^{2}B(\boldsymbol{d},{\boldsymbol{k}}_{q+})],\end{equation*}$

with k _q+ = (k₁, ..., k_q + 1, ..., k_Q) and k _q− = (k₁, ..., k_q − 1, ..., k_Q). □

G.4. Proof of proposition 4

G.4.1. Preliminaries

Lemma 13. Let $\psi :{\mathbb{R}}^{Q}\to \mathbb{R}$ be a function such that $\psi ({\left\{\langle {\boldsymbol{e}}_{q},\cdot \rangle \right\}}_{q\in [Q]})\in {L}^{2}({\mathrm{PS}}^{\boldsymbol{d}},{\mu }_{\boldsymbol{d}})$ . We will consider for integers $\boldsymbol{i}=({i}_{1},\dots ,{i}_{Q})\in {\mathbb{Z}}_{\geqslant 0}^{Q}$ , the associated function ψ⁽ⁱ⁾ given by:

$\begin{equation*}{\psi }^{(\boldsymbol{i})}\left({\bar{x}}_{1}^{(1)},\dots ,{\bar{x}}_{1}^{(Q)}\right)={\left({\bar{x}}_{1}^{(1)}\right)}^{{i}_{1}}\dots {\left({\bar{x}}_{1}^{(Q)}\right)}^{{i}_{Q}}\psi \left({\bar{x}}_{1}^{(1)},\dots ,{\bar{x}}_{1}^{(Q)}\right).\end{equation*}$

Assume that ${\psi }^{(\boldsymbol{i})}({\left\{\langle {\boldsymbol{e}}_{q},\cdot \rangle \right\}}_{q\in [Q]})\in {L}^{2}({\mathrm{PS}}^{\boldsymbol{d}},{\mu }_{\boldsymbol{d}})$ . Let ${\left\{{\lambda }_{\boldsymbol{k}}^{\boldsymbol{d}}(\psi )\right\}}_{\boldsymbol{k}\in {\mathbb{Z}}_{\geqslant 0}^{Q}}$ be the coefficients of the expansion of ψ in terms of the product of Gegenbauer polynomials

$\begin{equation*}\begin{aligned}\hfill \psi \left({\bar{x}}_{1}^{(1)},\dots ,{\bar{x}}_{1}^{(Q)}\right)& =\sum\limits _{\boldsymbol{k}\in {\mathbb{Z}}_{\geqslant 0}^{Q}}{\lambda }_{\boldsymbol{k}}^{\boldsymbol{d}}(\psi )B(\boldsymbol{d},\boldsymbol{k}){Q}_{\boldsymbol{k}}^{\boldsymbol{d}}\left(\sqrt{{d}_{1}}{\bar{x}}_{1}^{(1)},\dots ,\sqrt{{d}_{Q}}{\bar{x}}_{1}^{(Q)}\right),\hfill \\ \hfill {\lambda }_{\boldsymbol{k}}^{\boldsymbol{d}}(\psi )& ={\mathbb{E}}_{\bar{\boldsymbol{x}}}\left[\psi \left({\bar{x}}_{1}^{(1)},\dots ,{\bar{x}}_{1}^{(Q)}\right){Q}_{\boldsymbol{k}}^{\boldsymbol{d}}\left(\sqrt{{d}_{1}}{\bar{x}}_{1}^{(1)},\dots ,\sqrt{{d}_{Q}}{\bar{x}}_{1}^{(Q)}\right)\right].\hfill \end{aligned}\end{equation*}$

Then we can write

$\begin{equation*}{\psi }^{(\boldsymbol{i})}\left({\bar{x}}_{1}^{(1)},\dots ,{\bar{x}}_{1}^{(Q)}\right)=\sum\limits _{\boldsymbol{k}\in {\mathbb{Z}}_{\geqslant 0}^{Q}}{\lambda }_{\boldsymbol{k}}^{\boldsymbol{d},\boldsymbol{i}}(\psi )B(\boldsymbol{d},\boldsymbol{k}){Q}_{\boldsymbol{k}}^{\boldsymbol{d}}\left(\sqrt{{d}_{1}}{\bar{x}}_{1}^{(1)},\dots ,\sqrt{{d}_{Q}}{\bar{x}}_{1}^{(Q)}\right),\end{equation*}$

where the coefficients ${\lambda }_{\boldsymbol{k}}^{\boldsymbol{d},\boldsymbol{i}}(\psi )$ are given recursively: denoting i_q+ = (i₁, ..., i_q + 1, ..., i_Q), if k_q = 0,

$\begin{equation*}{\lambda }_{\boldsymbol{k}}^{\boldsymbol{d},{\boldsymbol{i}}_{q+}}(\psi )=\sqrt{{d}_{q}}{\lambda }_{{\boldsymbol{k}}_{q+}}^{\boldsymbol{d},\boldsymbol{i}}(\psi ),\end{equation*}$

and for k_q > 0,

$\begin{equation*}{\lambda }_{\boldsymbol{k}}^{\boldsymbol{d},{\boldsymbol{i}}_{q+}}(\psi )=\sqrt{{d}_{q}}\frac{{k}_{q}+{d}_{q}-2}{2{k}_{q}+{d}_{q}-2}{\lambda }_{{\boldsymbol{k}}_{q+}}^{\boldsymbol{d},\boldsymbol{i}}(\psi )+\sqrt{{d}_{q}}\frac{{k}_{q}}{2{k}_{q}+{d}_{q}-2}{\lambda }_{{\boldsymbol{k}}_{q-}}^{\boldsymbol{d},\boldsymbol{i}}(\psi ),\end{equation*}$

where we recall the notations k _q+ = (k₁, ..., k_q + 1, ..., k_Q) and k _q− = (k₁, ..., k_q − 1, ..., k_Q).

Proof of lemma 13. We recall the following two formulas for k ⩾ 1 (see appendix B.2):

$\begin{equation*}\begin{aligned}\hfill \frac{x}{d}{Q}_{k}^{(d)}(x)& =\frac{k}{2k+d-2}{Q}_{k-1}^{(d)}(x)+\frac{k+d-2}{2k+d-2}{Q}_{k+1}^{(d)}(x),\hfill \\ \hfill B(d,k)& =\frac{2k+d-2}{k}\left(\genfrac{}{}{0.0pt}{}{k+d-3}{k-1}\right).\hfill \end{aligned}\end{equation*}$

Furthermore, we have ${Q}_{0}^{(d)}(x)=1$ , ${Q}_{1}^{(d)}(x)=x/d$ and therefore therefore $x{Q}_{0}^{(d)}(x)=d{Q}_{1}^{(d)}(x)$ . Similarly to the proof of lemma 6 in [21], we insert these expressions in the expansion of the function ψ. Matching the coefficients of the expansion yields the result.

□

Let $\boldsymbol{u}:{\mathbb{S}}^{D-1}(\sqrt{D})\times {\mathbb{S}}^{D-1}(\sqrt{D})\to {\mathbb{R}}^{D\times D}$ be a matrix-valued function defined by

$\begin{equation*}\boldsymbol{u}({\boldsymbol{\theta }}_{1},{\boldsymbol{\theta }}_{2})={\mathbb{E}}_{\boldsymbol{x}}[{\sigma }^{\prime }(\langle {\boldsymbol{\theta }}_{1},\boldsymbol{x}\rangle /R){\sigma }^{\prime }(\langle {\boldsymbol{\theta }}_{2},\boldsymbol{x}\rangle /R)\boldsymbol{x}{\boldsymbol{x}}^{\mathsf{T}}].\end{equation*}$

We can write this function as a Q by Q block matrix function $\boldsymbol{u}={({\boldsymbol{u}}^{(q{q}^{\prime })})}_{q,{q}^{\prime }\in [Q]}$ , where ${\boldsymbol{u}}^{(q{q}^{\prime })}:{\mathbb{S}}^{D-1}(\sqrt{D})\times {\mathbb{S}}^{D-1}(\sqrt{D})\to {\mathbb{R}}^{{d}_{q}\times {d}_{{q}^{\prime }}}$ are given by

$\begin{equation*}{\boldsymbol{u}}^{(q{q}^{\prime })}({\boldsymbol{\theta }}_{1},{\boldsymbol{\theta }}_{2})={\mathbb{E}}_{\boldsymbol{x}}[{\sigma }^{\prime }(\langle {\boldsymbol{\theta }}_{1},\boldsymbol{x}\rangle /R){\sigma }^{\prime }(\langle {\boldsymbol{\theta }}_{2},\boldsymbol{x}\rangle /R){\boldsymbol{x}}^{(q)}{({\boldsymbol{x}}^{({q}^{\prime })})}^{\mathsf{T}}].\end{equation*}$

We have the following lemma which is a generalization of lemma 7 in [21] that shows essentially the same decomposition of the matrix u ( θ ₁, θ ₂) as by integration by part if we had x ∼ N(0, I).

Lemma 14. For q ∈ [Q], there exists functions ${u}_{1}^{(qq)},{u}_{2}^{(qq)},{u}_{3,1}^{(qq)},{u}_{3,2}^{(qq)}:{\mathbb{S}}^{D-1}(\sqrt{D})\times {\mathbb{S}}^{D-1}(\sqrt{D})\to \mathbb{R}$ such that

$\begin{equation*}\begin{aligned}\hfill {\boldsymbol{u}}^{(qq)}({\boldsymbol{\theta }}_{1},{\boldsymbol{\theta }}_{2})& ={u}_{1}^{(qq)}({\boldsymbol{\theta }}_{1},{\boldsymbol{\theta }}_{2}){\mathbf{I}}_{{d}_{q}}+{u}_{2}^{(qq)}({\boldsymbol{\theta }}_{1},{\boldsymbol{\theta }}_{2})[{\boldsymbol{\theta }}_{1}^{(q)}{({\boldsymbol{\theta }}_{2}^{(q)})}^{\mathsf{T}}+{\boldsymbol{\theta }}_{2}^{(q)}{({\boldsymbol{\theta }}_{1}^{(q)})}^{\mathsf{T}}]\hfill \\ \hfill & \quad +{u}_{3,1}^{(qq)}({\boldsymbol{\theta }}_{1},{\boldsymbol{\theta }}_{2}){\boldsymbol{\theta }}_{1}^{(q)}{({\boldsymbol{\theta }}_{1}^{(q)})}^{\mathsf{T}}+{u}_{3,2}^{(qq)}({\boldsymbol{\theta }}_{1},{\boldsymbol{\theta }}_{2}){\boldsymbol{\theta }}_{2}^{(q)}{({\boldsymbol{\theta }}_{2}^{(q)})}^{\mathsf{T}}.\hfill \end{aligned}\end{equation*}$

For q, q' ∈ [Q], there exists functions ${u}_{2,1}^{(q{q}^{\prime })},{u}_{2,2}^{(q{q}^{\prime })},{u}_{3,1}^{(q{q}^{\prime })},{u}_{3,2}^{(q{q}^{\prime })}:{\mathbb{S}}^{D-1}(\sqrt{D})\times {\mathbb{S}}^{D-1}(\sqrt{D})\to \mathbb{R}$ such that

$\begin{equation*}\begin{aligned}\hfill {\boldsymbol{u}}^{(q{q}^{\prime })}({\boldsymbol{\theta }}_{1},{\boldsymbol{\theta }}_{2})& ={u}_{2,1}^{(q{q}^{\prime })}({\boldsymbol{\theta }}_{1},{\boldsymbol{\theta }}_{2}){\boldsymbol{\theta }}_{1}^{(q)}{({\boldsymbol{\theta }}_{2}^{({q}^{\prime })})}^{\mathsf{T}}+{u}_{2,2}^{(q{q}^{\prime })}({\boldsymbol{\theta }}_{1},{\boldsymbol{\theta }}_{2}){\boldsymbol{\theta }}_{2}^{(q)}{({\boldsymbol{\theta }}_{1}^{({q}^{\prime })})}^{\mathsf{T}}\hfill \\ \hfill & \quad +{u}_{3,1}^{(q{q}^{\prime })}({\boldsymbol{\theta }}_{1},{\boldsymbol{\theta }}_{2}){\boldsymbol{\theta }}_{1}^{(q)}{({\boldsymbol{\theta }}_{1}^{({q}^{\prime })})}^{\mathsf{T}}+{u}_{3,2}^{(q{q}^{\prime })}({\boldsymbol{\theta }}_{1},{\boldsymbol{\theta }}_{2}){\boldsymbol{\theta }}_{2}^{(q)}{({\boldsymbol{\theta }}_{2}^{({q}^{\prime })})}^{\mathsf{T}}.\hfill \end{aligned}\end{equation*}$

Proof of lemma 14. Denote ${\gamma }^{(q)}=\langle {\bar{\boldsymbol{\theta }}}_{1}^{(q)},{\bar{\boldsymbol{\theta }}}_{2}^{(q)}\rangle /{d}_{q}$ . Let us rotate each sphere q ∈ [Q] such that

$\begin{equation}\begin{aligned}\hfill {\boldsymbol{\theta }}_{1}^{(q)}=& \left({\tau }_{1}^{(q)}\sqrt{{d}_{q}},0,\dots ,0\right),\hfill \\ \hfill {\boldsymbol{\theta }}_{2}^{(q)}=& \left({\tau }_{2}^{(q)}\sqrt{{d}_{q}}{\gamma }^{(q)},{\tau }_{2}^{(q)}\sqrt{{d}_{q}}\sqrt{1-{({\gamma }^{(q)})}^{2}},0,\dots ,0\right).\hfill \end{aligned}\end{equation} \tag{ 113 }$

Step 1: u ^(qq) .

Let us start with u ^(qq). For clarity, we will denote (in the rotated basis (113))

$\begin{equation*}\begin{aligned}\hfill {\alpha }_{1}=\langle {\boldsymbol{\theta }}_{1},\boldsymbol{x}\rangle /R& =\sum\limits _{q\in [Q]}{\tau }_{1}^{(q)}\sqrt{{d}_{q}}/R\cdot {\bar{x}}_{1}^{(q)},\hfill \\ \hfill {\alpha }_{2}=\langle {\boldsymbol{\theta }}_{2},\boldsymbol{x}\rangle /R& =\sum\limits _{q\in [Q]}\left[{\tau }_{2}^{(q)}\sqrt{{d}_{q}}{\gamma }^{(q)}/R\cdot {\bar{x}}_{1}^{(q)}+{\tau }_{2}^{(q)}\sqrt{{d}_{q}}\sqrt{1-{({\gamma }^{(q)})}^{2}}/R\cdot {\bar{x}}_{2}^{(q)}\right].\hfill \end{aligned}\end{equation*}$

Then it is easy to show that we can rewrite

$\begin{align*}\hfill {\boldsymbol{u}}^{(qq)}({\boldsymbol{\theta }}_{1},{\boldsymbol{\theta }}_{2})& ={\mathbb{E}}_{\boldsymbol{x}}\left[\right. {\sigma }^{\prime }({\alpha }_{1}){\sigma }^{\prime }({\alpha }_{2}){\boldsymbol{x}}^{(q)}{({\boldsymbol{x}}^{(q)})}^{\mathsf{T}}\hfill \\ \hfill & =\left[\begin{matrix}\hfill {\boldsymbol{u}}_{1:2,1:2}^{(qq)}\hfill & \hfill \mathbf{0}\hfill \\ \hfill \mathbf{0}\hfill & \hfill {\mathbb{E}}_{\boldsymbol{x}}[{\sigma }^{\prime }({\alpha }_{1}){\sigma }^{\prime }({\alpha }_{2}){({x}_{3}^{(q)})}^{2}]{\mathbf{I}}_{{d}_{q}-2}\hfill \end{matrix}\right],\hfill \end{align*}$

with

$\begin{equation*}{\boldsymbol{u}}_{1:2,1:2}^{(qq)}=\left[\begin{matrix}\hfill {\mathbb{E}}_{\boldsymbol{x}}[{\sigma }^{\prime }({\alpha }_{1}){\sigma }^{\prime }({\alpha }_{2}){({x}_{1}^{(q)})}^{2}]\hfill & \hfill {\mathbb{E}}_{\boldsymbol{x}}[{\sigma }^{\prime }({\alpha }_{1}){\sigma }^{\prime }({\alpha }_{2}){x}_{1}^{(q)}{x}_{2}^{(q)}]\hfill \\ \hfill {\mathbb{E}}_{\boldsymbol{x}}[{\sigma }^{\prime }({\alpha }_{1}){\sigma }^{\prime }({\alpha }_{2}){x}_{2}^{(q)}{x}_{1}^{(q)}]\hfill & \hfill {\mathbb{E}}_{\boldsymbol{x}}[{\sigma }^{\prime }({\alpha }_{1}){\sigma }^{\prime }({\alpha }_{2}){({x}_{2}^{(q)})}^{2}]\hfill \end{matrix}\right].\end{equation*}$

Case (a): ${\boldsymbol{\theta }}_{1}^{(q)}\ne {\boldsymbol{\theta }}_{2}^{(q)}$ .

Given any functions ${u}_{1}^{(qq)},{u}_{2}^{(qq)},{u}_{3,1}^{(qq)},{u}_{3,2}^{(qq)}:{\mathbb{S}}^{D-1}(\sqrt{D})\times {\mathbb{S}}^{D-1}(\sqrt{D})\to \mathbb{R}$ , we define

$\begin{equation*}\begin{aligned}\hfill {\tilde{\boldsymbol{u}}}^{(qq)}({\boldsymbol{\theta }}_{1},{\boldsymbol{\theta }}_{2})& ={u}_{1}^{(qq)}({\boldsymbol{\theta }}_{1},{\boldsymbol{\theta }}_{2}){\mathbf{I}}_{{d}_{1}}+{u}_{2}^{(qq)}({\boldsymbol{\theta }}_{1},{\boldsymbol{\theta }}_{2})[{\boldsymbol{\theta }}_{1}^{(q)}{({\boldsymbol{\theta }}_{2}^{(q)})}^{\mathsf{T}}+{\boldsymbol{\theta }}_{2}^{(q)}{({\boldsymbol{\theta }}_{1}^{(q)})}^{\mathsf{T}}]\hfill \\ \hfill & \quad +{u}_{3,1}^{(qq)}({\boldsymbol{\theta }}_{1},{\boldsymbol{\theta }}_{2}){\boldsymbol{\theta }}_{1}^{(q)}{({\boldsymbol{\theta }}_{1}^{(q)})}^{\mathsf{T}}+{u}_{3,2}^{(qq)}({\boldsymbol{\theta }}_{1},{\boldsymbol{\theta }}_{2}){\boldsymbol{\theta }}_{2}^{(q)}{({\boldsymbol{\theta }}_{2}^{(q)})}^{\mathsf{T}}.\hfill \end{aligned}\end{equation*}$

In the rotated basis (113), we have

$\begin{equation*}{\tilde{\boldsymbol{u}}}^{(qq)}({\boldsymbol{\theta }}_{1},{\boldsymbol{\theta }}_{2})=\left[\begin{matrix}\hfill {\tilde{\boldsymbol{u}}}_{1:2,1:2}^{(qq)}\hfill & \hfill \mathbf{0}\hfill \\ \hfill \mathbf{0}\hfill & \hfill {u}_{1}^{(qq)}({\boldsymbol{\theta }}_{1},{\boldsymbol{\theta }}_{2}){\mathbf{I}}_{{d}_{q}-2}\hfill \end{matrix}\right],\end{equation*}$

where (we dropped the dependency on ( θ ₁, θ ₂) for clarity)

$\begin{equation*}\begin{aligned}\hfill {\boldsymbol{u}}_{11}^{(qq)}& ={u}_{1}^{(qq)}+2{\tau }_{1}^{(q)}{\tau }_{2}^{(q)}{d}_{q}{\gamma }^{(q)}{u}_{2}^{(qq)}+{({\tau }_{1}^{(q)})}^{2}{d}_{q}{u}_{3,1}^{(qq)}+{({\tau }_{2}^{(q)})}^{2}{d}_{q}{({\gamma }^{(q)})}^{2}{u}_{3,2}^{(qq)},\hfill \\ \hfill {\boldsymbol{u}}_{12}^{(qq)}& ={\tau }_{1}^{(q)}{\tau }_{2}^{(q)}{d}_{q}\sqrt{1-{({\gamma }^{(q)})}^{2}}{u}_{2}^{(qq)}+{({\tau }_{2}^{(q)})}^{2}{d}_{q}{\gamma }^{(q)}\sqrt{1-{({\gamma }^{(q)})}^{2}}{u}_{3,2}^{(qq)},\hfill \\ \hfill {\boldsymbol{u}}_{22}^{(qq)}& ={u}_{1}^{(qq)}+{({\tau }_{2}^{(q)})}^{2}{d}_{q}(1-{({\gamma }^{(q)})}^{2}){u}_{3,2}^{(qq)}.\hfill \end{aligned}\end{equation*}$

We see that u ^(qq) and ${\tilde{\boldsymbol{u}}}^{(qq)}$ will be equal if and only if we have the following equalities:

$\begin{align*}\hfill \mathrm{Tr}({\boldsymbol{u}}^{(qq)}({\boldsymbol{\theta }}_{1},{\boldsymbol{\theta }}_{2}))& =\mathrm{Tr}({\tilde{\boldsymbol{u}}}^{(qq)}({\boldsymbol{\theta }}_{1},{\boldsymbol{\theta }}_{2}))\hfill \\ \hfill & ={d}_{q}{u}_{1}^{(qq)}+2{\tau }_{1}^{(q)}{\tau }_{2}^{(q)}{d}_{q}{\gamma }^{(q)}{u}_{2}^{(qq)}+{({\tau }_{1}^{(q)})}^{2}{d}_{q}{u}_{3,1}^{(qq)}+{({\tau }_{2}^{(q)})}^{2}{d}_{q}{u}_{3,2}^{(qq)},\hfill \\ \hfill \langle {\boldsymbol{\theta }}_{1}^{(q)},{\boldsymbol{u}}^{(qq)}({\boldsymbol{\theta }}_{1},{\boldsymbol{\theta }}_{2}){\boldsymbol{\theta }}_{2}^{(q)}\rangle & =\langle {\boldsymbol{\theta }}_{1}^{(q)},{\tilde{\boldsymbol{u}}}^{(qq)}({\boldsymbol{\theta }}_{1},{\boldsymbol{\theta }}_{2}){\boldsymbol{\theta }}_{2}^{(q)}\rangle \hfill \\ \hfill & ={\tau }_{1}^{(q)}{\tau }_{2}^{(q)}{d}_{q}{\gamma }^{(q)}{u}_{1}^{(qq)}+{({\tau }_{1}^{(q)})}^{2}{({\tau }_{2}^{(q)})}^{2}{d}_{q}^{2}(1+{({\gamma }^{(q)})}^{2}){u}_{2}^{(qq)}\hfill \\ \hfill & \quad +{({\tau }_{1}^{(q)})}^{3}{\tau }_{2}^{(q)}{d}_{q}^{2}{\gamma }^{(q)}{u}_{3,1}^{(qq)}+{\tau }_{1}^{(q)}{({\tau }_{2}^{(q)})}^{3}{d}_{q}^{2}{\gamma }^{(q)}{u}_{3,1}^{(qq)},\hfill \\ \hfill \langle {\boldsymbol{\theta }}_{1}^{(q)},{\boldsymbol{u}}^{(qq)}({\boldsymbol{\theta }}_{1},{\boldsymbol{\theta }}_{2}){\boldsymbol{\theta }}_{1}^{(q)}\rangle & =\langle {\boldsymbol{\theta }}_{1}^{(q)},{\tilde{\boldsymbol{u}}}^{(qq)}({\boldsymbol{\theta }}_{1},{\boldsymbol{\theta }}_{2}){\boldsymbol{\theta }}_{1}^{(q)}\rangle \hfill \\ \hfill & ={({\tau }_{1}^{(q)})}^{2}{d}_{q}{u}_{1}^{(qq)}+2{({\tau }_{1}^{(q)})}^{3}{\tau }_{2}^{(q)}{d}_{q}^{2}{\gamma }^{(q)}{u}_{2}^{(qq)}\hfill \\ \hfill & \quad +{({\tau }_{1}^{(q)})}^{4}{d}_{q}^{2}{u}_{3,1}^{(qq)}+{({\tau }_{1}^{(q)})}^{2}{({\tau }_{2}^{(q)})}^{2}{d}_{q}^{2}{({\gamma }^{(q)})}^{2}{u}_{3,2}^{(qq)},\hfill \\ \hfill \langle {\boldsymbol{\theta }}_{2}^{(q)},{\boldsymbol{u}}^{(qq)}({\boldsymbol{\theta }}_{1},{\boldsymbol{\theta }}_{2}){\boldsymbol{\theta }}_{2}^{(q)}\rangle & =\langle {\boldsymbol{\theta }}_{2}^{(q)},{\tilde{\boldsymbol{u}}}^{(qq)}({\boldsymbol{\theta }}_{1},{\boldsymbol{\theta }}_{2}){\boldsymbol{\theta }}_{2}^{(q)}\rangle \hfill \\ \hfill & ={({\tau }_{2}^{(q)})}^{2}{d}_{q}{u}_{1}^{(qq)}+2{\tau }_{1}^{(q)}{({\tau }_{2}^{(q)})}^{3}{d}_{q}^{2}{\gamma }^{(q)}{u}_{2}^{(qq)}\hfill \\ \hfill & \quad +{({\tau }_{1}^{(q)})}^{2}{({\tau }_{2}^{(q)})}^{2}{d}_{q}^{2}{({\gamma }^{(q)})}^{2}{u}_{3,1}^{(qq)}+{({\tau }_{2}^{(q)})}^{4}{d}_{q}^{2}{u}_{3,2}^{(qq)}.\hfill \end{align*}$

Hence ${\tilde{\boldsymbol{u}}}^{(qq)}={\boldsymbol{u}}^{(qq)}$ if and only if

$\begin{equation}\left[\begin{matrix}\hfill {u}_{1}^{(qq)}\hfill \\ \hfill {u}_{2}^{(qq)}\hfill \\ \hfill {u}_{3,1}^{(qq)}\hfill \\ \hfill {u}_{3,2}^{(qq)}\hfill \end{matrix}\right]={d}_{q}^{-1}{({\boldsymbol{M}}^{(qq)})}^{-1}\times \left[\begin{matrix}\hfill \mathrm{Tr}({\boldsymbol{u}}^{(qq)}({\boldsymbol{\theta }}_{1},{\boldsymbol{\theta }}_{2}))\hfill \\ \hfill \langle {\boldsymbol{\theta }}_{1}^{(q)},{\boldsymbol{u}}^{(qq)}({\boldsymbol{\theta }}_{1},{\boldsymbol{\theta }}_{2}){\boldsymbol{\theta }}_{2}^{(q)}\rangle \hfill \\ \hfill \langle {\boldsymbol{\theta }}_{1}^{(q)},{\boldsymbol{u}}^{(qq)}({\boldsymbol{\theta }}_{1},{\boldsymbol{\theta }}_{2}){\boldsymbol{\theta }}_{1}^{(q)}\rangle \hfill \\ \hfill \langle {\boldsymbol{\theta }}_{2}^{(q)},{\boldsymbol{u}}^{(qq)}({\boldsymbol{\theta }}_{1},{\boldsymbol{\theta }}_{2}){\boldsymbol{\theta }}_{2}^{(q)}\rangle \hfill \end{matrix}\right],\end{equation} \tag{ 114 }$

where

$\begin{equation*}{\boldsymbol{M}}^{(qq)}=\left[\begin{matrix}\hfill 1\hfill & \hfill 2{\tau }_{1}^{(q)}{\tau }_{2}^{(q)}{\gamma }^{(q)}\hfill & \hfill {({\tau }_{1}^{(q)})}^{2}\hfill & \hfill {({\tau }_{2}^{(q)})}^{2}\hfill \\ \hfill {\tau }_{1}^{(q)}{\tau }_{2}^{(q)}{\gamma }^{(q)}\hfill & \hfill {({\tau }_{1}^{(q)})}^{2}{({\tau }_{2}^{(q)})}^{2}{d}_{q}(1+{({\gamma }^{(q)})}^{2})\hfill & \hfill {({\tau }_{1}^{(q)})}^{3}{\tau }_{2}^{(q)}{d}_{q}{\gamma }^{(q)}\hfill & \hfill {\tau }_{1}^{(q)}{({\tau }_{2}^{(q)})}^{3}{d}_{q}{\gamma }^{(q)}\hfill \\ \hfill {({\tau }_{1}^{(q)})}^{2}\hfill & \hfill 2{({\tau }_{1}^{(q)})}^{3}{\tau }_{2}^{(q)}{d}_{q}{\gamma }^{(q)}\hfill & \hfill {({\tau }_{1}^{(q)})}^{4}{d}_{q}\hfill & \hfill {({\tau }_{1}^{(q)})}^{2}{({\tau }_{2}^{(q)})}^{2}{d}_{q}{({\gamma }^{(q)})}^{2}\hfill \\ \hfill {({\tau }_{2}^{(q)})}^{2}\hfill & \hfill 2{\tau }_{1}^{(q)}{({\tau }_{2}^{(q)})}^{3}{d}_{q}{\gamma }^{(q)}\hfill & \hfill {({\tau }_{1}^{(q)})}^{2}{({\tau }_{2}^{(q)})}^{2}{d}_{q}{({\gamma }^{(q)})}^{2}\hfill & \hfill {({\tau }_{2}^{(q)})}^{4}{d}_{q}\hfill \end{matrix}\right]\end{equation*}$

is invertible almost surely (for ${\tau }_{1}^{(q)},{\tau }_{2}^{(q)}\ne 0$ and γ^(q) ≠ 1).

Case (b): ${\boldsymbol{\theta }}_{1}^{(q)}={\boldsymbol{\theta }}_{2}^{(q)}$ .

Similarly, for some fixed α and β, we define

$\begin{equation*}{\tilde{\boldsymbol{u}}}^{(qq)}({\boldsymbol{\theta }}_{1},{\boldsymbol{\theta }}_{1})=\alpha {\mathbf{I}}_{{d}_{q}}+\beta {\boldsymbol{\theta }}_{1}^{(q)}{({\boldsymbol{\theta }}_{1}^{(q)})}^{\mathsf{T}}.\end{equation*}$

Then u ^(qq)( θ ₁, θ ₁) and ${\tilde{\boldsymbol{u}}}^{(qq)}({\boldsymbol{\theta }}_{1},{\boldsymbol{\theta }}_{1})$ are equal if and only if

$\begin{equation*}\left[\begin{matrix}\hfill \alpha \hfill \\ \hfill \beta \hfill \end{matrix}\right]={d}_{q}^{-1}{({\boldsymbol{M}}_{{\Vert}}^{(qq)})}^{-1}\times \left[\begin{matrix}\hfill \mathrm{Tr}({\boldsymbol{u}}^{(qq)}({\boldsymbol{\theta }}_{1},{\boldsymbol{\theta }}_{1}))\hfill \\ \hfill \langle {\boldsymbol{\theta }}_{1}^{(q)},{\boldsymbol{u}}^{(qq)}({\boldsymbol{\theta }}_{1},{\boldsymbol{\theta }}_{1}){\boldsymbol{\theta }}_{1}^{(q)}\rangle \hfill \end{matrix}\right],\end{equation*}$

where

$\begin{equation*}{\boldsymbol{M}}_{{\Vert}}^{(qq)}=\left[\begin{matrix}\hfill 1\hfill & \hfill {({\tau }_{1}^{(q)})}^{2}\hfill \\ \hfill {({\tau }_{1}^{(q)})}^{2}\hfill & \hfill {({\tau }_{1}^{(q)})}^{4}{d}_{q}\hfill \end{matrix}\right].\end{equation*}$

Step 2: ${\boldsymbol{u}}^{(q{q}^{\prime })}$ for q ≠ q'.

Similarly to the two previous steps, we define for any functions ${u}_{2,1}^{(q{q}^{\prime })},{u}_{2,2}^{(q{q}^{\prime })},{u}_{3,1}^{(q{q}^{\prime })},{u}_{3,2}^{(q{q}^{\prime })}:{\mathbb{S}}^{D-1}(\sqrt{D})\times {\mathbb{S}}^{D-1}(\sqrt{D})\to \mathbb{R}$ ,

$\begin{equation*}\begin{aligned}\hfill {\tilde{\boldsymbol{u}}}^{(q{q}^{\prime })}({\boldsymbol{\theta }}_{1},{\boldsymbol{\theta }}_{2})& ={u}_{2,1}^{(q{q}^{\prime })}({\boldsymbol{\theta }}_{1},{\boldsymbol{\theta }}_{2}){\boldsymbol{\theta }}_{1}^{(q)}{({\boldsymbol{\theta }}_{2}^{({q}^{\prime })})}^{\mathsf{T}}+{u}_{2,2}^{(q{q}^{\prime })}({\boldsymbol{\theta }}_{1},{\boldsymbol{\theta }}_{2}){\boldsymbol{\theta }}_{2}^{(q)}{({\boldsymbol{\theta }}_{1}^{({q}^{\prime })})}^{\mathsf{T}}\hfill \\ \hfill & \quad +{u}_{3,1}^{(q{q}^{\prime })}({\boldsymbol{\theta }}_{1},{\boldsymbol{\theta }}_{2}){\boldsymbol{\theta }}_{1}^{(q)}{({\boldsymbol{\theta }}_{1}^{({q}^{\prime })})}^{\mathsf{T}}+{u}_{3,2}^{(q{q}^{\prime })}({\boldsymbol{\theta }}_{1},{\boldsymbol{\theta }}_{2}){\boldsymbol{\theta }}_{2}^{(q)}{({\boldsymbol{\theta }}_{2}^{({q}^{\prime })})}^{\mathsf{T}}.\hfill \end{aligned}\end{equation*}$

We can rewrite ${\tilde{\boldsymbol{u}}}^{(q{q}^{\prime })}$ as

$\begin{equation*}{\tilde{\boldsymbol{u}}}^{(q{q}^{\prime })}({\boldsymbol{\theta }}_{1},{\boldsymbol{\theta }}_{2})=\left[\begin{matrix}\hfill {\tilde{\boldsymbol{u}}}_{1:2,1:2}^{(q{q}^{\prime })}\hfill & \hfill \mathbf{0}\hfill \\ \hfill \mathbf{0}\hfill & \hfill \mathbf{0}\hfill \end{matrix}\right],\end{equation*}$

where

$\begin{align*}\hfill {\tilde{\boldsymbol{u}}}_{11}^{(q{q}^{\prime })}& ={u}_{2,1}^{(q{q}^{\prime })}{\tau }_{1}^{(q)}{\tau }_{2}^{({q}^{\prime })}{\gamma }^{({q}^{\prime })}+{u}_{2,2}^{(q{q}^{\prime })}{\tau }_{2}^{(q)}{\tau }_{1}^{({q}^{\prime })}{\gamma }^{(q)}+{u}_{3,1}^{(q{q}^{\prime })}{\tau }_{1}^{(q)}{\tau }_{1}^{({q}^{\prime })}+{u}_{3,2}^{(q{q}^{\prime })}{\tau }_{2}^{(q)}{\tau }_{2}^{({q}^{\prime })}{\gamma }^{(q)}{\gamma }^{({q}^{\prime })},\hfill \\ \hfill {\tilde{\boldsymbol{u}}}_{12}^{(q{q}^{\prime })}& ={u}_{2,1}^{(q{q}^{\prime })}{\tau }_{1}^{(q)}{\tau }_{2}^{({q}^{\prime })}\sqrt{1-{({\gamma }^{({q}^{\prime })})}^{2}}+{u}_{3,2}^{(q{q}^{\prime })}{\tau }_{2}^{(q)}{\tau }_{2}^{({q}^{\prime })}{\gamma }^{(q)}\sqrt{1-{({\gamma }^{({q}^{\prime })})}^{2}},\hfill \\ \hfill {\tilde{\boldsymbol{u}}}_{21}^{(q{q}^{\prime })}& ={u}_{2,2}^{(q{q}^{\prime })}{\tau }_{2}^{(q)}{\tau }_{1}^{({q}^{\prime })}\sqrt{1-{({\gamma }^{(q)})}^{2}}+{u}_{3,2}^{(q{q}^{\prime })}{\tau }_{2}^{(q)}{\tau }_{2}^{({q}^{\prime })}\sqrt{1-{({\gamma }^{(q)})}^{2}}{\gamma }^{({q}^{\prime })},\hfill \\ \hfill {\tilde{\boldsymbol{u}}}_{22}^{(q{q}^{\prime })}& ={u}_{3,2}^{(q{q}^{\prime })}{\tau }_{2}^{(q)}{\tau }_{2}^{({q}^{\prime })}\sqrt{1-{({\gamma }^{(q)})}^{2}}\sqrt{1-{({\gamma }^{({q}^{\prime })})}^{2}}.\hfill \end{align*}$

Case (a): ${\boldsymbol{\theta }}_{1}^{(q)}\ne {\boldsymbol{\theta }}_{2}^{(q)}$ .

We have equality ${\tilde{\boldsymbol{u}}}^{(q{q}^{\prime })}={\boldsymbol{u}}^{(q{q}^{\prime })}$ if and only if

$\begin{equation*}\left[\begin{matrix}\hfill {u}_{2,1}^{(q{q}^{\prime })}\hfill \\ \hfill {u}_{2,2}^{(q{q}^{\prime })}\hfill \\ \hfill {u}_{3,1}^{(q{q}^{\prime })}\hfill \\ \hfill {u}_{3,2}^{(q{q}^{\prime })}\hfill \end{matrix}\right]={({d}_{q}{d}_{{q}^{\prime }})}^{-1}{({\boldsymbol{M}}^{(q{q}^{\prime })})}^{-1}\times \left[\begin{matrix}\hfill \langle {\boldsymbol{\theta }}_{1}^{(q)},{\boldsymbol{u}}^{(q{q}^{\prime })}({\boldsymbol{\theta }}_{1},{\boldsymbol{\theta }}_{2}){\boldsymbol{\theta }}_{1}^{({q}^{\prime })}\rangle \hfill \\ \hfill \langle {\boldsymbol{\theta }}_{1}^{(q)},{\boldsymbol{u}}^{(q{q}^{\prime })}({\boldsymbol{\theta }}_{1},{\boldsymbol{\theta }}_{2}){\boldsymbol{\theta }}_{2}^{({q}^{\prime })}\rangle \hfill \\ \hfill \langle {\boldsymbol{\theta }}_{2}^{(q)},{\boldsymbol{u}}^{(q{q}^{\prime })}({\boldsymbol{\theta }}_{1},{\boldsymbol{\theta }}_{2}){\boldsymbol{\theta }}_{1}^{({q}^{\prime })}\rangle \hfill \\ \hfill \langle {\boldsymbol{\theta }}_{2}^{(q)},{\boldsymbol{u}}^{(q{q}^{\prime })}({\boldsymbol{\theta }}_{1},{\boldsymbol{\theta }}_{2}){\boldsymbol{\theta }}_{2}^{({q}^{\prime })}\rangle \hfill \end{matrix}\right],\end{equation*}$

where ${\boldsymbol{M}}^{(q{q}^{\prime })}$ is given by

$\begin{equation*}\left[\begin{matrix}\hfill {({\tau }_{1}^{(q)})}^{2}{\tau }_{1}^{({q}^{\prime })}{\tau }_{2}^{({q}^{\prime })}{\gamma }^{({q}^{\prime })}\hfill & \hfill {\tau }_{1}^{(q)}{\tau }_{2}^{(q)}{({\tau }_{1}^{({q}^{\prime })})}^{2}{\gamma }^{(q)}\hfill & \hfill {({\tau }_{1}^{(q)})}^{2}{({\tau }_{1}^{({q}^{\prime })})}^{2}\hfill & \hfill {\tau }_{1}^{(q)}{\tau }_{2}^{(q)}{\tau }_{1}^{({q}^{\prime })}{\tau }_{2}^{({q}^{\prime })}{\gamma }^{(q)}{\gamma }^{({q}^{\prime })}\hfill \\ \hfill {({\tau }_{1}^{(q)})}^{2}{({\tau }_{2}^{({q}^{\prime })})}^{2}\hfill & \hfill {\tau }_{1}^{(q)}{\tau }_{2}^{(q)}{\tau }_{1}^{({q}^{\prime })}{\tau }_{2}^{({q}^{\prime })}{\gamma }^{(q)}{\gamma }^{({q}^{\prime })}\hfill & \hfill {({\tau }_{1}^{(q)})}^{2}{\tau }_{1}^{({q}^{\prime })}{\tau }_{2}^{({q}^{\prime })}{\gamma }^{({q}^{\prime })}\hfill & \hfill {\tau }_{1}^{(q)}{\tau }_{2}^{(q)}{({\tau }_{2}^{({q}^{\prime })})}^{2}{\gamma }^{(q)}\hfill \\ \hfill {\tau }_{1}^{(q)}{\tau }_{2}^{(q)}{\tau }_{1}^{({q}^{\prime })}{\tau }_{2}^{({q}^{\prime })}{\gamma }^{(q)}{\gamma }^{({q}^{\prime })}\hfill & \hfill {({\tau }_{2}^{(q)})}^{2}{({\tau }_{1}^{({q}^{\prime })})}^{2}\hfill & \hfill {\tau }_{1}^{(q)}{\tau }_{2}^{(q)}{({\tau }_{1}^{({q}^{\prime })})}^{2}{\gamma }^{(q)}\hfill & \hfill {({\tau }_{2}^{(q)})}^{2}{\tau }_{1}^{({q}^{\prime })}{\tau }_{2}^{({q}^{\prime })}{\gamma }^{({q}^{\prime })}\hfill \\ \hfill {\tau }_{1}^{(q)}{\tau }_{2}^{(q)}{({\tau }_{2}^{({q}^{\prime })})}^{2}{\gamma }^{(q)}\hfill & \hfill {({\tau }_{2}^{(q)})}^{2}{\tau }_{1}^{({q}^{\prime })}{\tau }_{2}^{({q}^{\prime })}{\gamma }^{({q}^{\prime })}\hfill & \hfill {\tau }_{1}^{(q)}{\tau }_{2}^{(q)}{\tau }_{1}^{({q}^{\prime })}{\tau }_{2}^{({q}^{\prime })}{\gamma }^{(q)}{\gamma }^{({q}^{\prime })}\hfill & \hfill {({\tau }_{2}^{(q)})}^{2}{({\tau }_{2}^{({q}^{\prime })})}^{2}\hfill \end{matrix}\right],\end{equation*}$

which is invertible almost surely (for ${\tau }_{1}^{(q)},{\tau }_{2}^{(q)}\ne 0$ and γ^(q) ≠ 1).

Case (b): ${\boldsymbol{\theta }}_{1}^{(q)}={\boldsymbol{\theta }}_{2}^{(q)}$ .

It is straightforward to check that

$\begin{equation*}{\boldsymbol{u}}^{(q{q}^{\prime })}({\boldsymbol{\theta }}_{1},{\boldsymbol{\theta }}_{1})=\beta {\boldsymbol{\theta }}_{1}^{(q)}{({\boldsymbol{\theta }}_{1}^{({q}^{\prime })})}^{\mathsf{T}},\end{equation*}$

where

$\begin{equation*}\beta ={({d}_{q}{d}_{{q}^{\prime }})}^{-1}{({\tau }_{1}^{(q)}{\tau }_{1}^{({q}^{\prime })})}^{-2}\left\langle {\boldsymbol{\theta }}_{1}^{(q)},{\boldsymbol{u}}^{(q{q}^{\prime })}({\boldsymbol{\theta }}_{1},{\boldsymbol{\theta }}_{1}){\boldsymbol{\theta }}_{1}^{({q}^{\prime })}\right\rangle .\end{equation*}$

□

G.4.2. Proof of proposition 4

Step 1. Construction of the activation function $\hat{\sigma }$ .

Recall the definition of σ_{d
,
τ} in equation (100) and its expansion in terms of tensor product of Gegenbauer polynomials:

$\begin{equation*}\begin{aligned}\hfill {\sigma }^{\prime }(\langle \boldsymbol{\theta },\boldsymbol{x}\rangle /R)& =\sum\limits _{\boldsymbol{k}\in {\mathbb{Z}}_{\geqslant 0}^{Q}}{\lambda }_{\boldsymbol{k}}^{\boldsymbol{d}}({\sigma }_{\boldsymbol{d},\boldsymbol{\tau }}^{\prime })B(\boldsymbol{d},\boldsymbol{k}){Q}_{\boldsymbol{k}}^{\boldsymbol{d}}\left({\left\{\langle {\bar{\boldsymbol{\theta }}}^{(q)},{\bar{\boldsymbol{x}}}^{(q)}\rangle \right\}}_{q\in [Q]}\right),\hfill \\ \hfill {\lambda }_{\boldsymbol{k}}^{\boldsymbol{d}}({\sigma }_{\boldsymbol{d},\boldsymbol{\tau }}^{\prime })& ={\mathbb{E}}_{\bar{\boldsymbol{x}}}\left[{\sigma }_{\boldsymbol{d},\boldsymbol{\tau }}^{\prime }\left({\bar{x}}_{1}^{(1)},\dots ,{\bar{x}}_{1}^{(Q)}\right){Q}_{\boldsymbol{k}}^{\boldsymbol{d}}\left(\sqrt{{d}_{1}}{\bar{x}}_{1}^{(1)},\dots ,\sqrt{{d}_{Q}}{\bar{x}}_{1}^{(Q)}\right)\right].\hfill \end{aligned}\end{equation*}$

We recall the definition of q_ξ = argmax_q∈[Q]{η_q + κ_q}. Let l₂ > l₁ ⩾ 2L + 5 be two indices that satisfy the conditions of assumption 3(b) and we define l ₁ = (0, ..., 0, l₁, 0, ..., 0) (l₁ at position q_ξ) and l ₂ = (0, ..., 0, l₂, 0, ..., 0) (l₂ at position q_ξ). Using the Gegenbauer coefficients of σ', we define a new activation function ${\hat{\sigma }}^{\prime }$ by

$\begin{equation}{\hat{\sigma }}^{\prime }(\langle \boldsymbol{\theta },\boldsymbol{x}\rangle /R)=\sum\limits _{\boldsymbol{k}\in {\mathbb{Z}}_{\geqslant 0}^{Q}{\backslash}\left\{{\boldsymbol{l}}_{1},{\boldsymbol{l}}_{2}\right\}}{\lambda }_{\boldsymbol{k}}^{\boldsymbol{d}}({\sigma }_{\boldsymbol{d},\boldsymbol{\tau }}^{\prime })B(\boldsymbol{d},\boldsymbol{k}){Q}_{\boldsymbol{k}}^{\boldsymbol{d}}\left({\left\{\langle {\bar{\boldsymbol{\theta }}}^{(q)},{\bar{\boldsymbol{x}}}^{(q)}\rangle \right\}}_{q\in [Q]}\right)\end{equation} \tag{ 115 }$

$\begin{equation}\quad +\sum\limits _{t=1,2}(1-{\delta }_{t}){\lambda }_{{\boldsymbol{l}}_{t}}^{\boldsymbol{d}}({\sigma }_{\boldsymbol{d},\boldsymbol{\tau }}^{\prime })B({d}_{{q}_{\xi }},{l}_{t}){Q}_{{l}_{t}}^{({d}_{{q}_{\xi }})}(\langle {\bar{\boldsymbol{\theta }}}^{({q}_{\xi })},{\bar{\boldsymbol{x}}}^{({q}_{\xi })}\rangle ),\end{equation} \tag{ 116 }$

for some δ₁, δ₂ that we will fix later (with |δ_t| ⩽ 1).

Step 2. The functions $\boldsymbol{u},\hat{\boldsymbol{u}}$ and $\bar{\boldsymbol{u}}$ .

Let u and $\hat{\boldsymbol{u}}$ be the matrix-valued functions associated respectively to σ' and ${\hat{\sigma }}^{\prime }$

$\begin{equation}\boldsymbol{u}({\boldsymbol{\theta }}_{1},{\boldsymbol{\theta }}_{2})={\mathbb{E}}_{\boldsymbol{x}}[{\sigma }^{\prime }(\langle {\boldsymbol{\theta }}_{1},\boldsymbol{x}\rangle /R){\sigma }^{\prime }(\langle {\boldsymbol{\theta }}_{2},\boldsymbol{x}\rangle /R)\boldsymbol{x}{\boldsymbol{x}}^{\mathsf{T}}],\end{equation} \tag{ 117 }$

$\begin{equation}\hat{\boldsymbol{u}}({\boldsymbol{\theta }}_{1},{\boldsymbol{\theta }}_{2})={\mathbb{E}}_{\boldsymbol{x}}[{\hat{\sigma }}^{\prime }(\langle {\boldsymbol{\theta }}_{1},\boldsymbol{x}\rangle /R){\hat{\sigma }}^{\prime }(\langle {\boldsymbol{\theta }}_{2},\boldsymbol{x}\rangle /R)\boldsymbol{x}{\boldsymbol{x}}^{\mathsf{T}}].\end{equation} \tag{ 118 }$

From lemma 14, there exists functions ${u}_{1}^{ab},{u}_{2,1}^{ab},{u}_{2,2}^{ab},{u}_{3,1}^{ab},{u}_{3,2}^{ab}$ and ${\hat{u}}_{1}^{ab},{\hat{u}}_{2,1}^{ab},{\hat{u}}_{2,2}^{ab},{\hat{u}}_{3,1}^{ab},{\hat{u}}_{3,2}^{ab}$ (for a, b ∈ [Q]), which decompose u and $\hat{\boldsymbol{u}}$ along θ ₁ and θ ₂ vectors. We define $\bar{\boldsymbol{u}}=\boldsymbol{u}-\hat{\boldsymbol{u}}$ . Then we have the same decomposition for ${\bar{u}}_{k,j}^{ab}={u}_{k,j}^{ab}-{\hat{u}}_{k,j}^{ab}$ for a, b ∈ [Q], k = 1, 2, 3, j = 1, 2.

Step 3. Construction of the kernel matrices.

Let $\boldsymbol{U},\hat{\boldsymbol{U}},\bar{\boldsymbol{U}}\in {\mathbb{R}}^{ND\times ND}$ with i, jth block (for i, j ∈ [N]) given by

$\begin{equation}{\boldsymbol{U}}_{ij}=\boldsymbol{u}({\boldsymbol{\theta }}_{i},{\boldsymbol{\theta }}_{j}),\end{equation} \tag{ 119 }$

$\begin{equation}{\hat{\boldsymbol{U}}}_{ij}=\hat{\boldsymbol{u}}({\boldsymbol{\theta }}_{i},{\boldsymbol{\theta }}_{j}),\end{equation} \tag{ 120 }$

$\begin{equation}{\bar{\boldsymbol{U}}}_{ij}=\bar{\boldsymbol{u}}({\boldsymbol{\theta }}_{i},{\boldsymbol{\theta }}_{j})=\boldsymbol{u}({\boldsymbol{\theta }}_{i},{\boldsymbol{\theta }}_{j})-\hat{\boldsymbol{u}}({\boldsymbol{\theta }}_{i},{\boldsymbol{\theta }}_{j}).\end{equation} \tag{ 121 }$

Note that we have $\boldsymbol{U}=\hat{\boldsymbol{U}}+\bar{\boldsymbol{U}}$ . By equations (118) and (120), it is easy to see that $\hat{\boldsymbol{U}}{\succeq}0$ . Then we have $\boldsymbol{U}{\succeq}\bar{\boldsymbol{U}}$ . In the following, we would like to lower bound matrix $\bar{\boldsymbol{U}}$ .

We decompose $\bar{\boldsymbol{U}}$ as

$\begin{equation*}\bar{\boldsymbol{U}}=\boldsymbol{D}+\mathbf{\Delta },\end{equation*}$

where $\boldsymbol{D}\in {\mathbb{R}}^{DN\times DN}$ is a block-diagonal matrix, with

$\begin{equation}\boldsymbol{D}=\mathrm{diag}({\bar{\boldsymbol{U}}}_{11},\dots ,{\bar{\boldsymbol{U}}}_{NN}),\end{equation} \tag{ 122 }$

and $\mathbf{\Delta }\in {\mathbb{R}}^{DN\times DN}$ is formed by blocks ${\mathbf{\Delta }}_{ij}\in {\mathbb{R}}^{D\times D}$ for i, j ∈ [n], defined by

$\begin{equation}{\mathbf{\Delta }}_{ij}=\begin{cases}0,\quad \hfill & \quad i=j,\hfill \\ {\bar{\boldsymbol{U}}}_{ij},\quad \hfill & \quad i\ne j.\hfill \end{cases}\end{equation} \tag{ 123 }$

In the rest of the proof, we will prove that ${\Vert}\mathbf{\Delta }{{\Vert}}_{\mathrm{op}}={o}_{d,\mathbb{P}}({d}^{-{\mathrm{max}}_{q\in [Q]}{\kappa }_{q}})$ and the block matrix D verifies the properties (105) and (106).

Step 4. Prove that ${\Vert}\mathbf{\Delta }{{\Vert}}_{\mathrm{op}}={o}_{d,\mathbb{P}}({d}^{-{\mathrm{max}}_{q\in [Q]}{\kappa }_{q}})$ .

We will prove in fact that ${\Vert}\mathbf{\Delta }{{\Vert}}_{F}^{2}={o}_{d,\mathbb{P}}({d}^{-2{\mathrm{max}}_{q\in [Q]}{\kappa }_{q}})$ . For the rest of the proof, we fix ɛ₀ ∈ (0, 1) and we restrict ourselves without loss of generality to the set ${\mathcal{P}}_{{\varepsilon }_{0}}$ .

Let us start with ${\bar{\boldsymbol{u}}}^{(qq)}$ for q ∈ [Q]. Denoting ${\gamma }_{ij}^{(q)}=\langle {\bar{\boldsymbol{\theta }}}_{i}^{(q)},{\bar{\boldsymbol{\theta }}}_{j}^{(q)}\rangle /{d}_{q}< 1$ , we get, from equation (114),

$\begin{align}\hfill \left[\begin{matrix}\hfill {\bar{u}}_{1}^{(qq)}({\boldsymbol{\theta }}_{i},{\boldsymbol{\theta }}_{j})\hfill \\ \hfill {\bar{u}}_{2}^{(qq)}({\boldsymbol{\theta }}_{i},{\boldsymbol{\theta }}_{j})\hfill \\ \hfill {\bar{u}}_{3,1}^{(qq)}({\boldsymbol{\theta }}_{i},{\boldsymbol{\theta }}_{j})\hfill \\ \hfill {\bar{u}}_{3,2}^{(qq)}({\boldsymbol{\theta }}_{i},{\boldsymbol{\theta }}_{j})\hfill \end{matrix}\right]& =\left[\begin{matrix}\hfill {u}_{1}({\boldsymbol{\theta }}_{i},{\boldsymbol{\theta }}_{j})-{\hat{u}}_{1}({\boldsymbol{\theta }}_{i},{\boldsymbol{\theta }}_{j})\hfill \\ \hfill {u}_{2}({\boldsymbol{\theta }}_{i},{\boldsymbol{\theta }}_{j})-{\hat{u}}_{2}({\boldsymbol{\theta }}_{i},{\boldsymbol{\theta }}_{j})\hfill \\ \hfill {u}_{3,1}({\boldsymbol{\theta }}_{i},{\boldsymbol{\theta }}_{j})-{\hat{u}}_{3,1}({\boldsymbol{\theta }}_{i},{\boldsymbol{\theta }}_{j})\hfill \\ \hfill {u}_{3,2}({\boldsymbol{\theta }}_{i},{\boldsymbol{\theta }}_{j})-{\hat{u}}_{3,2}({\boldsymbol{\theta }}_{i},{\boldsymbol{\theta }}_{j})\hfill \end{matrix}\right]\hfill \\ \hfill & ={d}_{q}^{-1}{({\boldsymbol{M}}_{ij}^{(qq)})}^{-1}\times \left[\begin{matrix}\hfill \mathrm{Tr}({\bar{\boldsymbol{u}}}^{(qq)}({\boldsymbol{\theta }}_{1},{\boldsymbol{\theta }}_{2}))\hfill \\ \hfill \langle {\boldsymbol{\theta }}_{1}^{(q)},{\bar{\boldsymbol{u}}}^{(qq)}({\boldsymbol{\theta }}_{1},{\boldsymbol{\theta }}_{2}){\boldsymbol{\theta }}_{2}^{(q)}\rangle \hfill \\ \hfill \langle {\boldsymbol{\theta }}_{1}^{(q)},{\bar{\boldsymbol{u}}}^{(qq)}({\boldsymbol{\theta }}_{1},{\boldsymbol{\theta }}_{2}){\boldsymbol{\theta }}_{1}^{(q)}\rangle \hfill \\ \hfill \langle {\boldsymbol{\theta }}_{2}^{(q)},{\bar{\boldsymbol{u}}}^{(qq)}({\boldsymbol{\theta }}_{1},{\boldsymbol{\theta }}_{2}){\boldsymbol{\theta }}_{2}^{(q)}\rangle \hfill \end{matrix}\right],\hfill \end{align} \tag{ 124 }$

where ${\boldsymbol{M}}_{ij}^{(qq)}$ is given by

$\begin{equation}\left[\begin{matrix}\hfill 1\hfill & \hfill 2{\tau }_{1}^{(q)}{\tau }_{2}^{(q)}{\gamma }_{ij}^{(q)}\hfill & \hfill {({\tau }_{1}^{(q)})}^{2}\hfill & \hfill {({\tau }_{2}^{(q)})}^{2}\hfill \\ \hfill {\tau }_{1}^{(q)}{\tau }_{2}^{(q)}{\gamma }_{ij}^{(q)}\hfill & \hfill {({\tau }_{1}^{(q)})}^{2}{({\tau }_{2}^{(q)})}^{2}{d}_{q}(1+{({\gamma }_{ij}^{(q)})}^{2})\hfill & \hfill {({\tau }_{1}^{(q)})}^{3}{\tau }_{2}^{(q)}{d}_{q}{\gamma }_{ij}^{(q)}\hfill & \hfill {\tau }_{1}^{(q)}{({\tau }_{2}^{(q)})}^{3}{d}_{q}{\gamma }_{ij}^{(q)}\hfill \\ \hfill {({\tau }_{1}^{(q)})}^{2}\hfill & \hfill 2{({\tau }_{1}^{(q)})}^{3}{\tau }_{2}^{(q)}{d}_{q}{\gamma }^{(q)}\hfill & \hfill {({\tau }_{1}^{(q)})}^{4}{d}_{q}\hfill & \hfill {({\tau }_{1}^{(q)})}^{2}{({\tau }_{2}^{(q)})}^{2}{d}_{q}{({\gamma }_{ij}^{(q)})}^{2}\hfill \\ \hfill {({\tau }_{2}^{(q)})}^{2}\hfill & \hfill {\tau }_{1}^{(q)}{({\tau }_{2}^{(q)})}^{3}{d}_{q}{\gamma }_{ij}^{(q)}\hfill & \hfill {({\tau }_{1}^{(q)})}^{2}{({\tau }_{2}^{(q)})}^{2}{d}_{q}{({\gamma }^{(q)})}^{2}\hfill & \hfill {({\tau }_{2}^{(q)})}^{4}{d}_{q}\hfill \end{matrix}\right].\end{equation} \tag{ 125 }$

Using the notations of lemma 13, we get

$\begin{align*}\hfill \mathrm{Tr}({\boldsymbol{U}}_{ij}^{(qq)})& ={\mathbb{E}}_{\boldsymbol{x}}[{\sigma }^{\prime }(\langle {\boldsymbol{\theta }}_{i},\boldsymbol{x}\rangle /R){\sigma }^{\prime }(\langle {\boldsymbol{\theta }}_{j},\boldsymbol{x}\rangle /R){\Vert}{\boldsymbol{x}}^{(q)}{{\Vert}}_{2}^{2}]\hfill \\ \hfill & ={r}_{q}^{2}\sum\limits _{\boldsymbol{k}\in {\mathbb{Z}}_{\geqslant 0}^{Q}}{\lambda }_{\boldsymbol{k}}^{\boldsymbol{d}}({\sigma }_{\boldsymbol{d},{\boldsymbol{\tau }}_{i}}^{\prime }){\lambda }_{\boldsymbol{k}}^{\boldsymbol{d}}({\sigma }_{\boldsymbol{d},{\boldsymbol{\tau }}_{j}}^{\prime })B(\boldsymbol{d},\boldsymbol{k}){Q}_{\boldsymbol{k}}^{\boldsymbol{d}}\left({\left\{\langle {\bar{\boldsymbol{\theta }}}_{i}^{(q)},{\bar{\boldsymbol{\theta }}}_{j}^{(q)}\rangle \right\}}_{q\in [Q]}\right),\hfill \\ \hfill \langle {\boldsymbol{\theta }}_{i}^{(q)},{\boldsymbol{U}}_{ij}^{(qq)}{\boldsymbol{\theta }}_{j}^{(q)}\rangle & ={\mathbb{E}}_{\boldsymbol{x}}[{\sigma }^{\prime }(\langle {\boldsymbol{\theta }}_{i},\boldsymbol{x}\rangle /R)\langle {\boldsymbol{\theta }}_{i}^{(q)},{\boldsymbol{z}}^{(q)}\rangle {\sigma }^{\prime }(\langle {\boldsymbol{\theta }}_{j},\boldsymbol{x}\rangle /R)\langle {\boldsymbol{\theta }}_{j}^{(q)},{\boldsymbol{x}}^{(q)}\rangle ]\hfill \\ \hfill & ={r}_{q}^{2}{\tau }_{i}^{(q)}{\tau }_{j}^{(q)}\sum\limits _{\boldsymbol{k}\in {\mathbb{Z}}_{\geqslant 0}^{Q}}{\lambda }_{\boldsymbol{k}}^{\boldsymbol{d},{\mathbf{1}}_{q}}({\sigma }_{\boldsymbol{d},{\boldsymbol{\tau }}_{i}}^{\prime }){\lambda }_{\boldsymbol{k}}^{\boldsymbol{d},{\mathbf{1}}_{q}}({\sigma }_{\boldsymbol{d},{\boldsymbol{\tau }}_{j}}^{\prime })B(\boldsymbol{d},\boldsymbol{k}){Q}_{\boldsymbol{k}}^{\boldsymbol{d}}\left({\left\{\langle {\bar{\boldsymbol{\theta }}}_{i}^{(q)},{\bar{\boldsymbol{\theta }}}_{j}^{(q)}\rangle \right\}}_{q\in [Q]}\right),\hfill \\ \hfill \langle {\boldsymbol{\theta }}_{i}^{(q)},{\boldsymbol{U}}_{ij}^{(qq)}{\boldsymbol{\theta }}_{i}^{(q)}\rangle & ={\mathbb{E}}_{\boldsymbol{x}}[{\sigma }^{\prime }(\langle {\boldsymbol{\theta }}_{i},\boldsymbol{x}\rangle /R){\langle {\boldsymbol{\theta }}_{i}^{(q)},{\boldsymbol{x}}^{(q)}\rangle }^{2}{\sigma }^{\prime }(\langle {\boldsymbol{\theta }}_{j},\boldsymbol{x}\rangle /R)]\hfill \\ \hfill & ={r}_{q}^{2}{({\tau }_{i}^{(q)})}^{2}\sum\limits _{\boldsymbol{k}\in {\mathbb{Z}}_{\geqslant 0}^{Q}}{\lambda }_{\boldsymbol{k}}^{\boldsymbol{d},{\mathbf{2}}_{q}}({\sigma }_{\boldsymbol{d},{\boldsymbol{\tau }}_{i}}^{\prime }){\lambda }_{\boldsymbol{k}}^{\boldsymbol{d}}({\sigma }_{\boldsymbol{d},{\boldsymbol{\tau }}_{j}}^{\prime })B(\boldsymbol{d},\boldsymbol{k}){Q}_{\boldsymbol{k}}^{\boldsymbol{d}}\left({\left\{\langle {\bar{\boldsymbol{\theta }}}_{i}^{(q)},{\bar{\boldsymbol{\theta }}}_{j}^{(q)}\rangle \right\}}_{q\in [Q]}\right),\hfill \\ \hfill \langle {\boldsymbol{\theta }}_{j}^{(q)},{\boldsymbol{U}}_{ij}^{(qq)}{\boldsymbol{\theta }}_{j}^{(q)}\rangle & ={\mathbb{E}}_{\boldsymbol{x}}[{\sigma }^{\prime }(\langle {\boldsymbol{\theta }}_{i},\boldsymbol{x}\rangle /R){\sigma }^{\prime }(\langle {\boldsymbol{\theta }}_{j},\boldsymbol{x}\rangle /R){\langle {\boldsymbol{\theta }}_{j}^{(q)},{\boldsymbol{x}}^{(q)}\rangle }^{2}]\hfill \\ \hfill & ={r}_{q}^{2}{({\tau }_{j}^{(q)})}^{2}\sum\limits _{\boldsymbol{k}\in {\mathbb{Z}}_{\geqslant 0}^{Q}}{\lambda }_{\boldsymbol{k}}^{\boldsymbol{d}}({\sigma }_{\boldsymbol{d},{\boldsymbol{\tau }}_{i}}^{\prime }){\lambda }_{\boldsymbol{k}}^{\boldsymbol{d},{\mathbf{2}}_{q}}({\sigma }_{\boldsymbol{d},{\boldsymbol{\tau }}_{j}}^{\prime })B(\boldsymbol{d},\boldsymbol{k}){Q}_{\boldsymbol{k}}^{\boldsymbol{d}}\left({\left\{\langle {\bar{\boldsymbol{\theta }}}_{i}^{(q)},{\bar{\boldsymbol{\theta }}}_{j}^{(q)}\rangle \right\}}_{q\in [Q]}\right),\hfill \end{align*}$

where we denoted 1_q = (0, ..., 0, 1, 0, ..., 0) (namely the q'th coordinate vector in ${\mathbb{R}}^{Q}$ ) and 2_q = (0, ..., 0, 2, 0, ..., 0) = 21_q.

We get similar expressions for ${\hat{\boldsymbol{U}}}_{ij}$ with ${\lambda }_{\boldsymbol{k}}^{\boldsymbol{d}}({\sigma }_{\boldsymbol{d},\boldsymbol{\tau }}^{\prime })$ replaced by ${\lambda }_{\boldsymbol{k}}^{\boldsymbol{d}}({\hat{\sigma }}_{\boldsymbol{d},\boldsymbol{\tau }}^{\prime })$ . Because we defined σ' and ${\hat{\sigma }}^{\prime }$ by only modifying the l ₁th and l ₂th coefficients, we get

$\begin{align}\hfill \mathrm{Tr}({\bar{\boldsymbol{U}}}_{ij}^{(qq)})& =\mathrm{Tr}({\boldsymbol{U}}_{ij}^{(qq)}-{\hat{\boldsymbol{U}}}_{ij}^{(qq)})\hfill \\ \hfill & ={r}_{q}^{2}\sum\limits _{t=1,2}{\delta }_{t}(2-{\delta }_{t}){\lambda }_{{\boldsymbol{l}}_{t}}^{\boldsymbol{d}}({\sigma }_{\boldsymbol{d},{\boldsymbol{\tau }}_{i}}^{\prime }){\lambda }_{{\boldsymbol{l}}_{t}}^{\boldsymbol{d}}({\sigma }_{\boldsymbol{d},{\boldsymbol{\tau }}_{j}}^{\prime })B(\boldsymbol{d},{\boldsymbol{l}}_{t}){Q}_{{\boldsymbol{l}}_{t}}^{\boldsymbol{d}}\left({\left\{{d}_{q}{\gamma }_{ij}^{(q)}\right\}}_{q\in [Q]}\right).\hfill \end{align} \tag{ 126 }$

Recalling that ${\lambda }_{\boldsymbol{k}}^{\boldsymbol{d},{\mathbf{1}}_{q}}$ only depend on ${\lambda }_{\boldsymbol{k}-{\mathbf{1}}_{q}}^{\boldsymbol{d}}$ and ${\lambda }_{\boldsymbol{k}+{\mathbf{1}}_{q}}^{\boldsymbol{d}}$ , and ${\lambda }_{\boldsymbol{k}}^{\boldsymbol{d},{\mathbf{2}}_{q}}$ on ${\lambda }_{\boldsymbol{k}-{\mathbf{2}}_{q}}^{\boldsymbol{d}}$ , ${\lambda }_{\boldsymbol{k}}^{\boldsymbol{d}}$ and ${\lambda }_{\boldsymbol{k}+{\mathbf{2}}_{q}}^{\boldsymbol{d}}$ , (lemma 13), we get

$\begin{align}\hfill \langle {\boldsymbol{\theta }}_{i}^{(q)},{\bar{\boldsymbol{U}}}_{ij}^{(qq)}{\boldsymbol{\theta }}_{j}^{(q)}\rangle & ={r}_{q}^{2}{\tau }_{i}^{(q)}{\tau }_{j}^{(q)}\sum\limits _{t=\left\{1,2\right\},\boldsymbol{k}\in \left\{{\boldsymbol{l}}_{t}\pm {\mathbf{1}}_{q}\right\}}{\delta }_{t}(2-{\delta }_{t}){\lambda }_{\boldsymbol{k}}^{\boldsymbol{d},{\mathbf{1}}_{q}}({\sigma }_{\boldsymbol{d},{\boldsymbol{\tau }}_{i}}^{\prime }){\lambda }_{\boldsymbol{k}}^{\boldsymbol{d},{\mathbf{1}}_{q}}({\sigma }_{\boldsymbol{d},{\boldsymbol{\tau }}_{j}}^{\prime })\hfill \\ \hfill & \quad \times B(\boldsymbol{d},\boldsymbol{k}){Q}_{\boldsymbol{k}}^{\boldsymbol{d}}\left({\left\{{d}_{q}{\gamma }_{ij}^{(q)}\right\}}_{q\in [Q]}\right),\hfill \\ \hfill \langle {\boldsymbol{\theta }}_{i}^{(q)},{\bar{\boldsymbol{U}}}_{ij}^{(qq)}{\boldsymbol{\theta }}_{i}^{(q)}\rangle & ={r}_{q}^{2}{({\tau }_{i}^{(q)})}^{2}\sum\limits _{t\in \left\{1,2\right\},\boldsymbol{k}\in \left\{{\boldsymbol{l}}_{t},{\boldsymbol{l}}_{t}\pm {\mathbf{2}}_{q}\right\}}{\delta }_{t}(2-{\delta }_{t}){\lambda }_{\boldsymbol{k}}^{\boldsymbol{d},{\mathbf{2}}_{q}}({\sigma }_{\boldsymbol{d},{\boldsymbol{\tau }}_{i}}^{\prime }){\lambda }_{\boldsymbol{k}}^{\boldsymbol{d}}({\sigma }_{\boldsymbol{d},{\boldsymbol{\tau }}_{j}}^{\prime })\hfill \\ \hfill & \quad \times B(\boldsymbol{d},\boldsymbol{k}){Q}_{\boldsymbol{k}}^{\boldsymbol{d}}\left({\left\{{d}_{q}{\gamma }_{ij}^{(q)}\right\}}_{q\in [Q]}\right),\hfill \\ \hfill \langle {\boldsymbol{\theta }}_{j}^{(q)},{\bar{\boldsymbol{U}}}_{ij}^{(qq)}{\boldsymbol{\theta }}_{j}^{(q)}\rangle & ={r}_{q}^{2}{({\tau }_{j}^{(q)})}^{2}\sum\limits _{t\in \left\{1,2\right\},\boldsymbol{k}\in \left\{{\boldsymbol{l}}_{t},{\boldsymbol{l}}_{t}\pm {\mathbf{2}}_{q}\right\}}{\delta }_{t}(2-{\delta }_{t}){\lambda }_{\boldsymbol{k}}^{\boldsymbol{d}}({\sigma }_{\boldsymbol{d},{\boldsymbol{\tau }}_{i}}^{\prime }){\lambda }_{\boldsymbol{k}}^{\boldsymbol{d},{\mathbf{2}}_{q}}({\sigma }_{\boldsymbol{d},{\boldsymbol{\tau }}_{j}}^{\prime })\hfill \\ \hfill & \quad \times B(\boldsymbol{d},\boldsymbol{k}){Q}_{\boldsymbol{k}}^{\boldsymbol{d}}\left({\left\{{d}_{q}{\gamma }_{ij}^{(q)}\right\}}_{q\in [Q]}\right),\hfill \end{align} \tag{ 127 }$

where we used the convention ${\lambda }_{\boldsymbol{k}}^{\boldsymbol{d}}({\sigma }_{\boldsymbol{d},\boldsymbol{\tau }}^{\prime })=0$ if one of the coordinates verifies k_q < 0.

From lemmas 13, 19 and 20, we get for t = 1, 2 and q ≠ q_ξ:

$\begin{equation}\begin{aligned}\hfill \underset{(d,{\boldsymbol{\tau }}_{i},{\boldsymbol{\tau }}_{j})\to (+\infty ,\mathbf{1},\mathbf{1})}{\mathrm{lim}}\enspace {\lambda }_{{\boldsymbol{l}}_{t}}^{\boldsymbol{d}}({\sigma }_{\boldsymbol{d},{\boldsymbol{\tau }}_{i}}^{\prime }){\lambda }_{{\boldsymbol{l}}_{t}}^{\boldsymbol{d}}({\sigma }_{\boldsymbol{d},{\boldsymbol{\tau }}_{j}}^{\prime })B(\boldsymbol{d},{\boldsymbol{l}}_{t})& =\frac{{\mu }_{{l}_{t}}{({\sigma }^{\prime })}^{2}}{{l}_{t}!},\hfill \\ \hfill \underset{(d,{\boldsymbol{\tau }}_{i},{\boldsymbol{\tau }}_{j})\to (+\infty ,\mathbf{1},\mathbf{1})}{\mathrm{lim}}\enspace {\lambda }_{{\boldsymbol{l}}_{t}+{\mathbf{1}}_{q}}^{\boldsymbol{d},{\mathbf{1}}_{q}}({\sigma }_{\boldsymbol{d},{\boldsymbol{\tau }}_{i}}^{\prime }){\lambda }_{{\boldsymbol{l}}_{t}+{\mathbf{1}}_{q}}^{\boldsymbol{d},{\mathbf{1}}_{q}}({\sigma }_{\boldsymbol{d},{\boldsymbol{\tau }}_{j}}^{\prime })B(\boldsymbol{d},{\boldsymbol{l}}_{t}+{\mathbf{1}}_{q})& =\frac{{\mu }_{{l}_{t}}{({\sigma }^{\prime })}^{2}}{{l}_{t}!},\hfill \\ \hfill \underset{(d,{\boldsymbol{\tau }}_{i},{\boldsymbol{\tau }}_{j})\to (+\infty ,\mathbf{1},\mathbf{1})}{\mathrm{lim}}\enspace {\lambda }_{{\boldsymbol{l}}_{t}}^{\boldsymbol{d},{\mathbf{2}}_{q}}({\sigma }_{\boldsymbol{d},{\boldsymbol{\tau }}_{i}}^{\prime }){\lambda }_{{\boldsymbol{l}}_{t}}^{\boldsymbol{d}}({\sigma }_{\boldsymbol{d},{\boldsymbol{\tau }}_{j}}^{\prime })B(\boldsymbol{d},{\boldsymbol{l}}_{t})& =\frac{{\mu }_{{l}_{t}}{({\sigma }^{\prime })}^{2}}{{l}_{t}!},\hfill \\ \hfill \underset{(d,{\boldsymbol{\tau }}_{i},{\boldsymbol{\tau }}_{j})\to (+\infty ,\mathbf{1},\mathbf{1})}{\mathrm{lim}}\enspace {\lambda }_{{\boldsymbol{l}}_{t}+{\mathbf{2}}_{q}}^{\boldsymbol{d},{\mathbf{2}}_{q}}({\sigma }_{\boldsymbol{d},{\boldsymbol{\tau }}_{i}}^{\prime }){\lambda }_{{\boldsymbol{l}}_{t}+{\mathbf{2}}_{q}}^{\boldsymbol{d}}({\sigma }_{\boldsymbol{d},{\boldsymbol{\tau }}_{j}}^{\prime })B(\boldsymbol{d},{\boldsymbol{l}}_{t}+{\mathbf{2}}_{q})& =0,\hfill \end{aligned}\end{equation} \tag{ 128 }$

while for q = q_ξ and u ∈ {−1, 1},

$\begin{equation}\begin{aligned}\hfill & \quad \underset{(d,{\boldsymbol{\tau }}_{i},{\boldsymbol{\tau }}_{j})\to (+\infty ,\mathbf{1},\mathbf{1})}{\mathrm{lim}}\enspace {\lambda }_{{\boldsymbol{l}}_{t}+u{\mathbf{1}}_{{q}_{\xi }}}^{\boldsymbol{d},{\mathbf{1}}_{{q}_{\xi }}}({\sigma }_{\boldsymbol{d},{\boldsymbol{\tau }}_{i}}^{\prime }){[B(\boldsymbol{d},{\boldsymbol{l}}_{t}+u{\mathbf{1}}_{{q}_{\xi }})({l}_{t}+u)!]}^{1/2}\hfill \\ \hfill & ={\mu }_{{l}_{t}+u+1}({\sigma }^{\prime })+({l}_{t}+u){\mu }_{{l}_{t}+u-1}({\sigma }^{\prime }),\hfill \end{aligned}\end{equation} \tag{ 129 }$

and for v ∈ {−2, 0, 2},

$\begin{equation}\begin{aligned}\hfill & \quad \underset{(d,{\boldsymbol{\tau }}_{i},{\boldsymbol{\tau }}_{j})\to (+\infty ,\mathbf{1},\mathbf{1})}{\mathrm{lim}}\enspace {\lambda }_{{\boldsymbol{l}}_{t}+v{\mathbf{1}}_{{q}_{\xi }}}^{\boldsymbol{d},{\mathbf{2}}_{{q}_{\xi }}}({\sigma }_{\boldsymbol{d},{\boldsymbol{\tau }}_{i}}^{\prime }){[B(\boldsymbol{d},{\boldsymbol{l}}_{t}+v{\mathbf{1}}_{{q}_{\xi }})({l}_{t}+v)!]}^{1/2}\hfill \\ \hfill & ={\mu }_{{l}_{t}+v+2}({\sigma }^{\prime })+(2{l}_{t}+2v+1){\mu }_{{l}_{t}+v}({\sigma }^{\prime })+({l}_{t}+v)({l}_{t}+v-1){\mu }_{{l}_{t}+v-2}({\sigma }^{\prime }).\hfill \end{aligned}\end{equation} \tag{ 130 }$

From lemma 26, we recall that the coefficients of the kth Gegenbauer polynomial ${Q}_{k}^{(d)}(x)={\sum }_{s=0}^{k}{p}_{k,s}^{(d)}{x}^{s}$ satisfy

$\begin{equation}{p}_{k,s}^{(d)}={O}_{d}({d}^{-k/2-s/2}).\end{equation} \tag{ 131 }$

Furthermore, lemma 27 shows that ${\mathrm{max}}_{i\ne j}\vert \langle {\bar{\boldsymbol{\theta }}}_{i}^{(q)},{\bar{\boldsymbol{\theta }}}_{j}^{(q)}\rangle \vert ={O}_{d,\mathbb{P}}(\sqrt{{d}_{q}\enspace \mathrm{log}\enspace {d}_{q}})$ . We deduce that

$\begin{equation}{\mathrm{max}}_{i\ne j}\vert {Q}_{{k}_{q}}^{({d}_{q})}(\langle {\bar{\boldsymbol{\theta }}}_{i}^{(q)},{\bar{\boldsymbol{\theta }}}_{j}^{(q)}\rangle )\vert ={\tilde{O}}_{d,\mathbb{P}}({d}_{q}^{-{k}_{q}/2}).\end{equation} \tag{ 132 }$

Plugging the estimates (128) and (132) into equations (126) and (127), we obtain that

$\begin{align}\hfill & \underset{i\ne j}{\mathrm{max}}\left\{\left\vert \mathrm{Tr}({\bar{\boldsymbol{U}}}_{ij}^{(qq)})\right\vert ,\enspace \left\vert \langle {\boldsymbol{\theta }}_{i}^{(q)},{\bar{\boldsymbol{U}}}_{ij}^{(qq)}{\boldsymbol{\theta }}_{j}^{(q)}\rangle \right\vert ,\enspace \left\vert \langle {\boldsymbol{\theta }}_{i}^{(q)},{\bar{\boldsymbol{U}}}_{ij}^{(qq)}{\boldsymbol{\theta }}_{i}^{(q)}\rangle ,\enspace \right\vert \left.\langle {\boldsymbol{\theta }}_{j}^{(q)},{\bar{\boldsymbol{U}}}_{ij}^{(qq)}{\boldsymbol{\theta }}_{j}^{(q)}\rangle \right\vert \right\}\hfill \\ \hfill & \qquad ={\tilde{O}}_{d,\mathbb{P}}({d}^{2\xi }{d}^{-{\eta }_{q}{l}_{1}/2}).\hfill \end{align} \tag{ 133 }$

From equation (125), using the fact that ${\mathrm{max}}_{i\ne j}\vert {\gamma }_{ij}^{(q)}\vert ={O}_{d,\mathbb{P}}(\sqrt{(\mathrm{log}\enspace {d}_{q})/{d}_{q}})$ and Cramer's rule for matrix inversion, it is easy to see that

$\begin{equation}\underset{i\ne j}{\mathrm{max}}\enspace \underset{l,k\in [4]}{\mathrm{max}}\left\vert {({({\boldsymbol{M}}_{ij}^{(qq)})}^{-1})}_{lk}\right\vert ={O}_{d,\mathbb{P}}(1).\end{equation} \tag{ 134 }$

We deduce from (124), (133) and (134) that for a ∈ [3], b ∈ [2],

$\begin{equation}\underset{i\ne j}{\mathrm{max}}\left\{\vert {\bar{u}}_{a,b}^{(qq)}({\boldsymbol{\theta }}_{i}^{(q)},{\boldsymbol{\theta }}_{j}^{(q)})\vert \right\}={\tilde{O}}_{d,\mathbb{P}}({d}^{2\xi }{d}^{-{\eta }_{q}{l}_{1}/2}).\end{equation} \tag{ 135 }$

As a result, combining equation (135) with equation (121) in the expression of ${\bar{u}}^{(qq)}$ given in lemma 14, we get

$\begin{align*}\hfill & \underset{i\ne j}{\mathrm{max}}{\Vert}{\bar{\boldsymbol{U}}}_{ij}^{(qq)}{{\Vert}}_{F}^{2}\hfill & \hfill \\ \hfill & \qquad =\underset{i\ne j}{\mathrm{max}}{\Vert}{\bar{u}}_{1}^{(qq)}{\mathbf{I}}_{{d}_{q}}+{\bar{u}}_{2}^{(qq)}[{\boldsymbol{\theta }}_{i}^{(q)}{({\boldsymbol{\theta }}_{j}^{(q)})}^{\mathsf{T}}+{\boldsymbol{\theta }}_{j}^{(q)}{({\boldsymbol{\theta }}_{i}^{(q)})}^{\mathsf{T}}]+{\bar{u}}_{3,1}^{(qq)}{\boldsymbol{\theta }}_{i}^{(q)}{({\boldsymbol{\theta }}_{i}^{(q)})}^{\mathsf{T}}+{\bar{u}}_{3,2}^{(qq)}{\boldsymbol{\theta }}_{j}^{(q)}{({\boldsymbol{\theta }}_{j}^{(q)})}^{\mathsf{T}}{{\Vert}}_{F}^{2}\hfill & \hfill \\ \hfill & \qquad \leqslant {\tilde{O}}_{d,\mathbb{P}}({d}^{6\xi }{d}^{-{\eta }_{q}{l}_{1}}).\hfill & \hfill \end{align*}$

A similar computation shows that

$\begin{equation*}\underset{i\ne j}{\mathrm{max}}{\Vert}{\bar{\boldsymbol{U}}}_{ij}^{(q{q}^{\prime })}{{\Vert}}_{F}^{2}\leqslant {\tilde{O}}_{d,\mathbb{P}}({d}^{6\xi }{d}^{-{\eta }_{q}{l}_{1}}).\end{equation*}$

By the expression of Δ given by (123), we conclude that

$\begin{equation*}{\Vert}\mathbf{\Delta }{{\Vert}}_{\mathrm{op}}^{2}\leqslant {\Vert}\mathbf{\Delta }{{\Vert}}_{F}^{2}=\sum\limits _{q,{q}^{\prime }\in [Q]}\sum\limits _{i,j=1,i\ne j}^{N}{\Vert}{\bar{\boldsymbol{U}}}_{ij}^{(q{q}^{\prime })}{{\Vert}}_{F}^{2}={\tilde{O}}_{d,\mathbb{P}}({N}^{2}{d}^{6\xi -{\eta }_{q}{l}_{1}}).\end{equation*}$

By assumption, N = o_d(d^γ). Hence, since by assumption η_q l₁ ⩾ 2γ + 7ξ, we deduce that ${\Vert}\mathbf{\Delta }{{\Vert}}_{\mathrm{op}}={o}_{d,\mathbb{P}}({d}^{-\xi })={o}_{d,\mathbb{P}}({d}^{-{\mathrm{max}}_{q\in [Q]}{\kappa }_{q}})$ .

Step 5. Checking the properties of matrix D .

By lemma 14, we can express ${\bar{\boldsymbol{U}}}_{ii}$ as a block matrix with

$\begin{equation*}{\bar{\boldsymbol{U}}}_{ii}^{(qq)}={\alpha }^{(q)}{\mathbf{I}}_{{d}_{q}}+{\beta }^{(q)}{\boldsymbol{\theta }}_{i}^{(q)}{({\boldsymbol{\theta }}_{i}^{(q)})}^{\mathsf{T}},\qquad {\bar{\boldsymbol{U}}}_{ii}^{(q{q}^{\prime })}={\beta }^{(q{q}^{\prime })}{\boldsymbol{\theta }}_{i}^{(q)}{({\boldsymbol{\theta }}_{i}^{({q}^{\prime })})}^{\mathsf{T}},\end{equation*}$

with coefficients given by

$\begin{equation}\begin{aligned}\hfill \left[\begin{matrix}\hfill {\alpha }^{(q)}\hfill \\ \hfill {\beta }^{(q)}\hfill \end{matrix}\right]& ={[{d}_{q}({d}_{q}-1){({\tau }_{i}^{(q)})}^{4}]}^{-1}\left[\begin{matrix}\hfill {d}_{q}{({\tau }_{i}^{(q)})}^{4}\hfill & \hfill -{({\tau }_{i}^{(q)})}^{2}\hfill \\ \hfill -{({\tau }_{i}^{(q)})}^{2}\hfill & \hfill 1\hfill \end{matrix}\right]\times \left[\begin{matrix}\hfill \mathrm{Tr}({\bar{\boldsymbol{U}}}_{ii}^{(qq)})\hfill \\ \hfill \langle {\boldsymbol{\theta }}_{i}^{(q)},{\bar{\boldsymbol{U}}}_{ii}^{(qq)}{\boldsymbol{\theta }}_{i}^{(q)}\rangle \hfill \end{matrix}\right],\hfill \\ \hfill {\beta }^{(q{q}^{\prime })}& ={({d}_{q}{d}_{{q}^{\prime }})}^{-1}{({\tau }_{i}^{(q)}{\tau }_{i}^{({q}^{\prime })})}^{-2}\langle {\boldsymbol{\theta }}_{i}^{(q)},{\bar{\boldsymbol{U}}}_{ii}^{(q{q}^{\prime })}({\boldsymbol{\theta }}_{i},{\boldsymbol{\theta }}_{i}){\boldsymbol{\theta }}_{i}^{({q}^{\prime })}\rangle .\hfill \end{aligned}\end{equation} \tag{ 136 }$

Let us first focus on the q = q_ξ sphere. Using equations (126) and (127) with the expressions (129) and (130), we get the following convergence in probability (using that ${\left\{{\tau }_{i}^{(q)}\right\}}_{i\in [N]}$ concentrates on 1),

$\begin{equation}\begin{aligned}\hfill {\mathrm{sup}}_{i\in [N]}\left\vert {r}_{{q}_{\xi }}^{-2}\mathrm{Tr}({\bar{\boldsymbol{U}}}_{ii}^{({q}_{\xi }{q}_{\xi })})-{F}_{1}(\boldsymbol{\delta })\right\vert \stackrel{\mathbb{P}}{\to }& \hspace{2pt}0,\hfill \\ \hfill {\mathrm{sup}}_{i\in [N]}\left\vert {r}_{{q}_{\xi }}^{-2}\langle {\boldsymbol{\theta }}_{i}^{({q}_{\xi })},{\bar{\boldsymbol{U}}}_{ii}^{({q}_{\xi }{q}_{\xi })}{\boldsymbol{\theta }}_{i}^{({q}_{\xi })}\rangle -{F}_{2}(\boldsymbol{\delta })\right\vert \stackrel{\mathbb{P}}{\to }& \hspace{2pt}0,\hfill \end{aligned}\end{equation} \tag{ 137 }$

where we denoted δ = (δ₁, δ₂) (where δ₁, δ₂ first appears in the definition of $\hat{\sigma }$ in equation (115), and till now δ₁, δ₂ are still not determined) and, similarly to the proof of proposition 5 in [21] and letting μ_k ≡ μ_k(σ'), we have

$\begin{equation}{F}_{1}(\boldsymbol{\delta })=\sum\limits _{t\in \left\{1,2\right\}}{\delta }_{t}(2-{\delta }_{t})\frac{{\mu }_{{l}_{t}}^{2}}{{l}_{t}!},\end{equation} \tag{ 138 }$

while, for l₂ ≠ l₁ + 2

$\begin{align*}\hfill {F}_{2}(\boldsymbol{\delta })& =\sum\limits _{t\in \left\{1,2\right\}}\left\{\frac{1}{({l}_{t}-1)!}\left[{({\mu }_{{l}_{t}}+({l}_{t}-1){\mu }_{{l}_{t}-2})}^{2}-{((1-{\delta }_{t}){\mu }_{{l}_{t}}+({l}_{t}-1){\mu }_{{l}_{t}-2})}^{2}\right]\right.\hfill \\ \hfill & \quad +\left.\frac{1}{({l}_{t}+1)!}\left[{({\mu }_{{l}_{t}+2}+({l}_{t}+1){\mu }_{{l}_{t}})}^{2}-{({\mu }_{{l}_{t}+2}+(1-{\delta }_{t})({l}_{t}+1){\mu }_{{l}_{t}})}^{2}\right]\right\},\hfill \end{align*}$

while, for l₂ = l₁ + 2

$\begin{align*}\hfill {F}_{2}(\boldsymbol{\delta })& =\frac{1}{({l}_{1}-1)!}\left[{({\mu }_{{l}_{1}}+({l}_{1}-1){\mu }_{{l}_{1}-2})}^{2}-{((1-{\delta }_{1}){\mu }_{{l}_{1}}+({l}_{1}-1){\mu }_{{l}_{1}-2})}^{2}\right]\hfill \\ \hfill & \quad +\frac{1}{({l}_{1}+1)!}\left[{({\mu }_{{l}_{1}+2}+({l}_{1}+1){\mu }_{{l}_{1}})}^{2}-{((1-{\delta }_{2}){\mu }_{{l}_{1}+2}+(1-{\delta }_{1})({l}_{1}+1){\mu }_{{l}_{1}})}^{2}\right]\hfill \\ \hfill & \quad +\frac{1}{({l}_{2}+1)!}\left[{({\mu }_{{l}_{2}+2}+({l}_{2}+1){\mu }_{{l}_{2}})}^{2}-{({\mu }_{{l}_{2}+2}+(1-{\delta }_{2})({l}_{2}+1){\mu }_{{l}_{2}})}^{2}\right].\hfill \end{align*}$

We have from equation (136),

$\begin{align*}\hfill & {\lambda }_{\mathrm{min}}({\bar{U}}_{ii}^{(qq)})\hfill \\ \hfill & \quad =\mathrm{min}\left\{{\alpha }^{(q)},{\alpha }^{(q)}+{\beta }^{(q)}{d}_{q}{({\tau }_{i}^{(q)})}^{2}\right\}\hfill \\ \hfill & \quad =\mathrm{min}\left\{\frac{1}{{d}_{q}-1}\mathrm{Tr}({\bar{\boldsymbol{U}}}_{ii}^{(qq)})-\frac{1}{{d}_{q}({d}_{q}-1){({\tau }_{i}^{(q)})}^{2}}\langle {\boldsymbol{\theta }}_{i}^{(q)},{\bar{\boldsymbol{U}}}_{ii}^{(qq)}{\boldsymbol{\theta }}_{i}^{(q)}\rangle ,\frac{1}{{d}_{q}{({\tau }_{i}^{(q)})}^{2}}\langle {\boldsymbol{\theta }}_{i}^{(q)},{\bar{\boldsymbol{U}}}_{ii}^{(qq)}{\boldsymbol{\theta }}_{i}^{(q)}\rangle \right\}.\hfill \end{align*}$

Hence, using equation (137), we get

$\begin{equation}\underset{i\in [N]}{\mathrm{sup}}\left\vert \frac{{d}_{{q}_{\xi }}}{{r}_{{q}_{\xi }}^{2}}{\lambda }_{\mathrm{min}}({\bar{U}}_{ii}^{({q}_{\xi }{q}_{\xi })})-\mathrm{min}\left\{{F}_{1}(\boldsymbol{\delta }),{F}_{2}(\boldsymbol{\delta })\right\}\right\vert \stackrel{\mathbb{P}}{\to }\hspace{2pt}0.\end{equation} \tag{ 139 }$

Following the same reasoning as in proposition 5 in [21], we can verify that under assumption 3(b), we have ∇F₁(0), ∇F₂(0) ≠ 0 and det(∇F₁(0), ∇F₂(0)) ≠ 0. We can therefore find δ = (δ₁, δ₂) such that F₁( δ ) > 0, F₂( δ ) > 0. Furthermore,

$\begin{equation}\underset{i\in [N]}{\mathrm{sup}}\left\vert \frac{{d}_{{q}_{\xi }}}{{r}_{{q}_{\xi }}^{2}}{\lambda }_{\mathrm{max}}({\bar{U}}_{ii}^{({q}_{\xi }{q}_{\xi })})-\mathrm{max}\left\{{F}_{1}(\boldsymbol{\delta }),{F}_{2}(\boldsymbol{\delta })\right\}\right\vert \stackrel{\mathbb{P}}{\to }\hspace{2pt}0.\end{equation} \tag{ 140 }$

Similarly, we get for q ≠ q_ξ from equations (126) and (127) with the expressions (128) (recalling that ${\left\{{\tau }_{i}^{(q)}\right\}}_{i\in [N]}$ concentrates on 1),

$\begin{equation}\begin{aligned}\hfill \underset{i\in [N]}{\mathrm{sup}}\left\vert {r}_{q}^{-2}\mathrm{Tr}({\bar{\boldsymbol{U}}}_{ii}^{(qq)})-{F}_{1}(\boldsymbol{\delta })\right\vert \stackrel{\mathbb{P}}{\to }& \hspace{2pt}0,\hfill \\ \hfill \underset{i\in [N]}{\mathrm{sup}}\left\vert {r}_{q}^{-2}\langle {\boldsymbol{\theta }}_{i}^{(q)},{\bar{\boldsymbol{U}}}_{ii}^{(qq)}{\boldsymbol{\theta }}_{i}^{(q)}\rangle -{F}_{1}(\boldsymbol{\delta })\right\vert \stackrel{\mathbb{P}}{\to }& \hspace{2pt}0,\hfill \\ \hfill \underset{i\in [N]}{\mathrm{sup}}\left\vert {({r}_{q}{r}_{{q}^{\prime }})}^{-1}\langle {\boldsymbol{\theta }}_{i}^{(q)},{\bar{\boldsymbol{U}}}_{ii}^{(q{q}^{\prime })}{\boldsymbol{\theta }}_{i}^{({q}^{\prime })}\rangle \right\vert \stackrel{\mathbb{P}}{\to }& \hspace{2pt}0.\hfill \end{aligned}\end{equation} \tag{ 141 }$

We deduce that for q ≠ q_ξ and q ≠ q',

$\begin{equation*}\begin{aligned}\hfill \underset{i\in [N]}{\mathrm{sup}}\left\vert \frac{{d}_{q}}{{r}_{q}^{2}}{\lambda }_{\mathrm{min}}({\bar{U}}_{ii}^{(qq)})-{F}_{1}(\boldsymbol{\delta })\right\vert & \stackrel{\mathbb{P}}{\to }\hspace{2pt}0,\hfill \\ \hfill \underset{i\in [N]}{\mathrm{sup}}\left\vert \frac{{d}_{q}}{{r}_{q}^{2}}{\lambda }_{\mathrm{max}}({\bar{U}}_{ii}^{(qq)})-{F}_{1}(\boldsymbol{\delta })\right\vert & \stackrel{\mathbb{P}}{\to }\hspace{2pt}0,\hfill \\ \hfill \underset{i\in [N]}{\mathrm{sup}}\left\vert \frac{{({d}_{q}{d}_{{q}^{\prime }})}^{1/2}}{{r}_{q}{r}_{{q}^{\prime }}}{\sigma }_{\mathrm{max}}({\bar{U}}_{ii}^{(q{q}^{\prime })})\right\vert & \stackrel{\mathbb{P}}{\to }\hspace{2pt}0,\hfill \end{aligned}\end{equation*}$

which finishes to prove properties (105) and (106).

Appendix H.: Proof of theorem 7(b): upper bound for NT model

H.1. Preliminaries

Lemma 15. Let σ be an activation function that satisfies assumptions 3(a) and (c) for some level γ > 0. Let $\mathcal{Q}={\bar{\mathcal{Q}}}_{\mathrm{NT}}(\gamma )$ as defined in equation (49). Define for integer $\boldsymbol{k}\in {\mathbb{Z}}_{\geqslant 0}^{Q}$ and $\boldsymbol{\tau },{\boldsymbol{\tau }}^{\prime }\in {\mathbb{R}}_{\geqslant 0}^{Q}$ ,

$\begin{align}\hfill {A}_{(\boldsymbol{\tau },{\boldsymbol{\tau }}^{\prime }),\boldsymbol{k}}^{(q)}& ={r}_{q}^{2}\cdot \left[{t}_{{d}_{q},{k}_{q}-1}{\lambda }_{{\boldsymbol{k}}_{q-}}^{\boldsymbol{d}}({\sigma }_{\boldsymbol{d},\boldsymbol{\tau }}^{\prime }){\lambda }_{{\boldsymbol{k}}_{q-}}^{\boldsymbol{d}}({\sigma }_{\boldsymbol{d},{\boldsymbol{\tau }}^{\prime }}^{\prime })B(\boldsymbol{d},{\boldsymbol{k}}_{q-})\right.\hfill \\ \hfill & \quad \left.+{s}_{{d}_{q},{k}_{q}+1}{\lambda }_{{\boldsymbol{k}}_{q+}}^{\boldsymbol{d}}({\sigma }_{\boldsymbol{d},\boldsymbol{\tau }}^{\prime }){\lambda }_{{\boldsymbol{k}}_{q+}}^{\boldsymbol{d}}({\sigma }_{\boldsymbol{d},{\boldsymbol{\tau }}^{\prime }}^{\prime })B(\boldsymbol{d},{\boldsymbol{k}}_{q+})\right],\hfill \end{align} \tag{ 142 }$

with k _q+ = (k₁, ..., k_q + 1, ..., k_Q) and k _q− = (k₁, ..., k_q − 1, ..., k_Q), and

$\begin{equation*}{s}_{d,k}=\frac{k}{2k+d-2},\qquad {t}_{d,k}=\frac{k+d-2}{2k+d-2},\end{equation*}$

with the convention t_d,−1 = 0.

Then there exists constants ɛ₀ > 0 and C > 0 such that for d large enough, we have for any $\boldsymbol{\tau },{\boldsymbol{\tau }}^{\prime }\in {[1-{\varepsilon }_{0},1+{\varepsilon }_{0}]}^{Q}$ ,

$\begin{equation*}\underset{\boldsymbol{k}\in \mathcal{Q}}{\mathrm{max}}\enspace \frac{B(\boldsymbol{d},\boldsymbol{k})}{{A}_{(\boldsymbol{\tau },{\boldsymbol{\tau }}^{\prime }),\boldsymbol{k}}^{(q)}}\leqslant C{d}^{\gamma -{\kappa }_{q}}.\end{equation*}$

Proof of lemma 15. From assumptions 3(a) and (c) and lemma 19, there exists c > 0 and ɛ₀ > 0 such that for any $\boldsymbol{\tau },{\boldsymbol{\tau }}^{\prime }\in {[1-{\varepsilon }_{0},1+{\varepsilon }_{0}]}^{Q}$ and $\boldsymbol{k}\in \mathcal{Q}$ ,

$\begin{equation*}{\lambda }_{\boldsymbol{k}}^{\boldsymbol{d}}({\sigma }_{\boldsymbol{d},\boldsymbol{\tau }}^{\prime }){\lambda }_{\boldsymbol{k}}^{\boldsymbol{d}}({\sigma }_{\boldsymbol{d},{\boldsymbol{\tau }}^{\prime }}^{\prime })\geqslant c\prod\limits _{q\in [Q]}{d}^{{k}_{q}({\kappa }_{q}-\xi )}.\end{equation*}$

Hence for k_q > 0, we get ${\lambda }_{\boldsymbol{k}}^{\boldsymbol{d}}({\sigma }_{\boldsymbol{d},\boldsymbol{\tau }}^{\prime }){\lambda }_{\boldsymbol{k}}^{\boldsymbol{d}}({\sigma }_{\boldsymbol{d},{\boldsymbol{\tau }}^{\prime }}^{\prime })\geqslant c{d}^{-\gamma -\xi +{\kappa }_{q}}$ , and for k_q = 0, we get ${\lambda }_{\boldsymbol{k}}^{\boldsymbol{d}}({\sigma }_{\boldsymbol{d},\boldsymbol{\tau }}^{\prime }){\lambda }_{\boldsymbol{k}}^{\boldsymbol{d}}({\sigma }_{\boldsymbol{d},{\boldsymbol{\tau }}^{\prime }}^{\prime })\geqslant c{d}^{-\gamma +\xi -{\eta }_{q}-{\kappa }_{q}}$ . Carefully injecting these bounds in equation (142) yields the lemma. □

H.2. Proof of theorem 7(b): outline

In this proof, we will consider Q sub-classes of functions corresponding to the NT model restricted to the qth sphere:

$\begin{equation*}{\mathcal{F}}_{{\mathrm{NT}}^{(q)}}(\boldsymbol{W}\hspace{2pt})\equiv \left\{f(\boldsymbol{x})=\sum\limits _{i=1}^{N}\langle {\boldsymbol{a}}_{i},{\boldsymbol{x}}^{(q)}\rangle {\sigma }^{\prime }(\langle {\boldsymbol{w}}_{i},\boldsymbol{x}\rangle /R):{\boldsymbol{a}}_{i}\in {\mathbb{R}}^{{d}_{q}},i\in [N]\right\}.\end{equation*}$

We define similarly the risk associated to this sub-model

$\begin{equation*}{R}_{{\mathrm{NT}}^{(q)}}({f}_{d},\boldsymbol{W}\hspace{2pt})=\underset{f\in {\mathcal{F}}_{{\mathrm{NT}}^{(q)}}(\boldsymbol{W}\hspace{2pt})}{\mathrm{inf}}\enspace \mathbb{E}[{({f}_{d}(\boldsymbol{x})-f(\boldsymbol{x}))}^{2}].\end{equation*}$

and approximation subspace

$\begin{align}\hfill {\bar{\mathcal{Q}}}_{{\mathrm{NT}}^{(q)}}(\gamma )& =\left\{\boldsymbol{k}\in {\mathbb{Z}}_{\geqslant 0}^{Q}\left\vert {k}_{q} > 0\quad \text{and}\quad \sum\limits _{q\in [Q]}{k}_{q}(\xi -{\kappa }_{q})\leqslant \gamma +(\xi -{\kappa }_{q})\right.\right\}\hfill \\ \hfill & \cup \left\{\boldsymbol{k}\in {\mathbb{Z}}_{\geqslant 0}^{Q}\left\vert {k}_{q}=0\quad \text{and}\quad \sum\limits _{q\in [Q]}{k}_{q}(\xi -{\kappa }_{q})\leqslant \gamma -(\xi -{\kappa }_{q}-{\eta }_{q})\right.\right\}.\hfill \end{align} \tag{ 143 }$

Theorem 8. Let ${\left\{{f}_{d}\in {L}^{2}({\mathrm{PS}}_{\boldsymbol{\kappa }}^{\boldsymbol{d}},{\mu }_{\boldsymbol{d}}^{\boldsymbol{\kappa }})\right\}}_{d\geqslant 1}$ be a sequence of functions. Let $\boldsymbol{W}={({\boldsymbol{w}}_{i})}_{i\in [N]}$ with ${({\boldsymbol{w}}_{i})}_{i\in [N]}\sim \mathrm{Unif}({\mathbb{S}}^{D-1})$ independently. Assume N ⩾ ω_d(d^γ) for some positive constant γ > 0, and σ satisfy assumptions 3(a) and (c) at level γ. Then for any ɛ > 0, the following holds with high probability:

$\begin{equation}0\leqslant {R}_{{\mathrm{NT}}^{(q)}}({\mathsf{P}}_{\mathcal{Q}}{f}_{d},\boldsymbol{W}\hspace{2pt})\leqslant \varepsilon {\Vert}{\mathsf{P}}_{\mathcal{Q}}{f}_{d}{{\Vert}}_{{L}^{2}}^{2},\end{equation} \tag{ 144 }$

where $\mathcal{Q}\equiv {\bar{\mathcal{Q}}}_{{\mathrm{NT}}^{(q)}}(\gamma )$ is defined in equation (143).

Remark 5. From the proof of theorem 7(a), we have a matching lower bound for ${\mathcal{F}}_{{\mathrm{NT}}^{(q)}}$ .

We recall

$\begin{equation*}{\bar{\mathcal{Q}}}_{\mathrm{NT}}(\gamma )=\left\{\boldsymbol{k}\in {\mathbb{Z}}_{\geqslant 0}^{Q}\left\vert \sum\limits _{q=1}^{Q}(\xi -{\kappa }_{q}){k}_{q}\leqslant \gamma +\left(\xi -\underset{q\in S(\boldsymbol{k})}{\mathrm{min}}\enspace {\kappa }_{q}\right)\right.\right\}.\end{equation*}$

Notice that

$\begin{equation*}{\bar{\mathcal{Q}}}_{\mathrm{NT}}(\gamma )=\bigcup\limits _{q\in Q}{\bar{\mathcal{Q}}}_{{\mathrm{NT}}^{(q)}}(\gamma ).\end{equation*}$

Denote q_k = argmin_{q∈S(
k
)} κ_q, such that $\boldsymbol{k}\in {\bar{\mathcal{Q}}}_{{\mathrm{NT}}^{({q}_{\boldsymbol{k}})}}$ for any $\boldsymbol{k}\in {\bar{\mathcal{Q}}}_{\mathrm{NT}}(\gamma )$ . Furthermore, notice that by definition for any $f\in {L}^{2}({\mathrm{PS}}_{\boldsymbol{\kappa }}^{\boldsymbol{d}},{\mu }_{\boldsymbol{d}}^{\boldsymbol{\kappa }})$ and q ∈ [Q],

$\begin{equation*}{R}_{\mathrm{NT}}(f,\boldsymbol{W}\hspace{2pt})\leqslant {R}_{{\mathrm{NT}}^{(q)}}(f,\boldsymbol{W}\hspace{2pt}).\end{equation*}$

Let us deduce theorem 7(b) from theorem 8. Denote $\mathcal{Q}={\bar{\mathcal{Q}}}_{\mathrm{NT}}(\gamma )$ . We divide the N neurons in $\vert \mathcal{Q}\vert$ sections of size ${N}^{\prime }=N/\vert \mathcal{Q}\vert$ , i.e. $\boldsymbol{W}={({\boldsymbol{W}}_{\boldsymbol{k}})}_{\boldsymbol{k}\in \mathcal{Q}}$ where ${\boldsymbol{W}}_{\boldsymbol{k}}\in {\mathbb{R}}^{{N}^{\prime }\times d}$ . For any ɛ > 0, we get from theorem 8 that with high probability

$\begin{equation*}{R}_{\mathrm{NT}}({\mathsf{P}}_{\mathcal{Q}}{f}_{d},\boldsymbol{W}\hspace{2pt})\leqslant \sum\limits _{\boldsymbol{k}\in \mathcal{Q}}{R}_{{\mathrm{NT}}^{({q}_{\boldsymbol{k}})}}({\mathsf{P}}_{\boldsymbol{k}}f,{\boldsymbol{W}}_{\boldsymbol{k}})\leqslant \sum\limits _{\boldsymbol{k}\in \mathcal{Q}}\varepsilon {\Vert}{\mathsf{P}}_{\boldsymbol{k}}{f}_{d}{{\Vert}}_{{L}^{2}}^{2}=\varepsilon {\Vert}{\mathsf{P}}_{\mathcal{Q}}{f}_{d}{{\Vert}}_{{L}^{2}}^{2}.\end{equation*}$

H.3. Proof of theorem 8

H.3.1. Properties of the limiting kernel

Similarly to the proof of theorem 6(b), we construct a limiting kernel which is used as a proxy to upper bound the NT^(q) risk.

We recall the definition of ${\mathrm{PS}}^{\boldsymbol{d}}={\prod }_{q\in [Q]}{\mathbb{S}}^{{d}_{q}-1}(\sqrt{{d}_{q}})$ . We introduce $\mathcal{L}={L}^{2}({\mathrm{PS}}^{\boldsymbol{d}}\to \mathbb{R},{\mu }_{\boldsymbol{d}})$ and ${\mathcal{L}}_{{d}_{q}}={L}^{2}({\mathrm{PS}}^{\boldsymbol{d}}\to {\mathbb{R}}^{{d}_{q}},{\mu }_{\boldsymbol{d}})$ . For a given $\boldsymbol{\theta }\in {\mathbb{S}}^{D-1}(\sqrt{D})$ and associated vector $\boldsymbol{\tau }\in {\mathbb{R}}_{\geqslant 0}^{Q}$ , recall the definition of ${\sigma }_{\boldsymbol{d},\boldsymbol{\tau }}^{\prime }\in \mathcal{L}$ :

$\begin{equation*}{\sigma }_{\boldsymbol{d},\boldsymbol{\tau }}^{\prime }\left({\left\{\langle {\bar{\boldsymbol{\theta }}}^{(q)},{\bar{\boldsymbol{x}}}^{(q)}\rangle /\sqrt{{d}_{q}}\right\}}_{q\in [Q]}\right)={\sigma }^{\prime }\left(\sum\limits _{q\in [Q]}{\tau }^{(q)}\cdot ({r}_{q}/R)\cdot \langle {\bar{\boldsymbol{\theta }}}^{(q)},{\bar{\boldsymbol{x}}}^{(q)}\rangle /\sqrt{{d}_{q}}\right).\end{equation*}$

For any $\boldsymbol{\tau }\in {\mathbb{R}}_{\geqslant 0}^{Q}$ , define the operator ${\mathbb{T}}_{\boldsymbol{\tau }}:\mathcal{L}\to {\mathcal{L}}_{{d}_{q}}$ , such that for any $g\in \mathcal{L}$ ,

$\begin{equation*}{\mathbb{T}}_{\boldsymbol{\tau }}g(\bar{\boldsymbol{\theta }}\hspace{2pt})=\frac{{r}_{q}}{\sqrt{{d}_{q}}}{\mathbb{E}}_{\bar{\boldsymbol{x}}}\left[{\bar{\boldsymbol{x}}}^{(q)}{\sigma }_{\boldsymbol{d},\boldsymbol{\tau }}^{\prime }\left({\left\{\langle {\bar{\boldsymbol{\theta }}}^{(q)},{\bar{\boldsymbol{x}}}^{(q)}\rangle /\sqrt{{d}_{q}}\right\}}_{q\in [Q]}\right)g(\bar{\boldsymbol{x}}\hspace{2pt})\right].\end{equation*}$

The adjoint operator ${\mathbb{T}}_{\boldsymbol{\tau }}^{\ast }:{\mathcal{L}}_{{d}_{q}}\to \mathcal{L}$ verifies for any $h\in {\mathcal{L}}_{{d}_{q}}$ ,

$\begin{equation*}{\mathbb{T}}_{\boldsymbol{\tau }}^{\ast }h(\bar{\boldsymbol{x}}\hspace{2pt})=\frac{{r}_{q}}{\sqrt{{d}_{q}}}{({\bar{\boldsymbol{x}}}^{(q)})}^{\mathsf{T}}{\mathbb{E}}_{\bar{\boldsymbol{\theta }}}\left[{\sigma }_{\boldsymbol{d},\boldsymbol{\tau }}^{\prime }\left({\left\{\langle {\bar{\boldsymbol{\theta }}}^{(q)},{\bar{\boldsymbol{x}}}^{(q)}\rangle /\sqrt{{d}_{q}}\right\}}_{q\in [Q]}\right)h(\bar{\boldsymbol{\theta }}\hspace{2pt})\right].\end{equation*}$

We define the operator ${\mathbb{K}}_{\boldsymbol{\tau },{\boldsymbol{\tau }}^{\prime }}:{\mathcal{L}}_{{d}_{q}}\to {\mathcal{L}}_{{d}_{q}}$ as ${\mathbb{K}}_{\boldsymbol{\tau },{\boldsymbol{\tau }}^{\prime }}\equiv {\mathbb{T}}_{\boldsymbol{\tau }}{\mathbb{T}}_{{\boldsymbol{\tau }}^{\prime }}^{\ast }$ . For $h\in {\mathcal{L}}_{{d}_{q}}$ , we can write

$\begin{equation*}{\mathbb{K}}_{\boldsymbol{\tau },{\boldsymbol{\tau }}^{\prime }}h({\bar{\boldsymbol{\theta }}}_{1})={\mathbb{E}}_{{\bar{\boldsymbol{\theta }}}_{2}}[{\mathbb{K}}_{\boldsymbol{\tau },{\boldsymbol{\tau }}^{\prime }}({\bar{\boldsymbol{\theta }}}_{1},{\bar{\boldsymbol{\theta }}}_{2})h({\bar{\boldsymbol{\theta }}}_{2})],\end{equation*}$

where

$\begin{align*}\hfill & {\mathbb{K}}_{{\boldsymbol{\tau }}_{1},{\boldsymbol{\tau }}_{2}}({\bar{\boldsymbol{\theta }}}_{1},{\bar{\boldsymbol{\theta }}}_{2})\hfill \\ \hfill & \qquad =\frac{{r}_{q}^{2}}{{d}_{q}}{\mathbb{E}}_{\bar{\boldsymbol{x}}}\left[{\bar{\boldsymbol{x}}}^{(q)}{({\bar{\boldsymbol{x}}}^{(q)})}^{\mathsf{T}}{\sigma }_{\boldsymbol{d},{\boldsymbol{\tau }}_{1}}^{\prime }\left({\left\{\langle {\bar{\boldsymbol{\theta }}}_{1}^{(q)},{\bar{\boldsymbol{x}}}^{(q)}\rangle /\sqrt{{d}_{q}}\right\}}_{q\in [Q]}\right){\sigma }_{\boldsymbol{d},{\boldsymbol{\tau }}_{2}}^{\prime }\left({\left\{\langle {\bar{\boldsymbol{\theta }}}_{2}^{(q)},{\bar{\boldsymbol{x}}}^{(q)}\rangle /\sqrt{{d}_{q}}\right\}}_{q\in [Q]}\right)\right].\hfill \end{align*}$

Define ${\mathbb{H}}_{\boldsymbol{\tau },{\boldsymbol{\tau }}^{\prime }}:\mathcal{L}\to \mathcal{L}$ as ${\mathbb{H}}_{\boldsymbol{\tau },{\boldsymbol{\tau }}^{\prime }}\equiv {\mathbb{T}}_{\boldsymbol{\tau }}^{\ast }{\mathbb{T}}_{{\boldsymbol{\tau }}^{\prime }}$ . For $g\in \mathcal{L}$ , we can write

$\begin{equation*}{\mathbb{H}}_{\boldsymbol{\tau },{\boldsymbol{\tau }}^{\prime }}g({\bar{\boldsymbol{x}}}_{1})={\mathbb{E}}_{{\bar{\boldsymbol{x}}}_{2}}[{\mathbb{H}}_{\boldsymbol{\tau },{\boldsymbol{\tau }}^{\prime }}({\bar{\boldsymbol{x}}}_{1},{\bar{\boldsymbol{x}}}_{2})g({\bar{\boldsymbol{x}}}_{2})]\end{equation*}$

where

$\begin{align*}\hfill & {\mathbb{H}}_{\boldsymbol{\tau },{\boldsymbol{\tau }}^{\prime }}({\bar{\boldsymbol{x}}}_{1},{\bar{\boldsymbol{x}}}_{2})\hfill \\ \hfill & \qquad =\frac{{r}_{q}^{2}}{{d}_{q}}{\mathbb{E}}_{\bar{\boldsymbol{\theta }}}\left[{\sigma }_{\boldsymbol{d},\boldsymbol{\tau }}^{\prime }\left({\left\{\langle {\bar{\boldsymbol{\theta }}}^{(q)},{\bar{\boldsymbol{x}}}_{1}^{(q)}\rangle /\sqrt{{d}_{q}}\right\}}_{q\in [Q]}\right){\sigma }_{\boldsymbol{d},{\boldsymbol{\tau }}^{\prime }}^{\prime }\left({\left\{\langle {\bar{\boldsymbol{\theta }}}^{(q)},{\bar{\boldsymbol{x}}}_{2}^{(q)}\rangle /\sqrt{{d}_{q}}\right\}}_{q\in [Q]}\right)\right]\langle {\bar{\boldsymbol{x}}}_{1}^{(q)},{\bar{\boldsymbol{x}}}_{2}^{(q)}\rangle .\hfill \end{align*}$

We recall the decomposition of ${\sigma }_{\boldsymbol{d},\boldsymbol{\tau }}^{\prime }$ in terms of tensor product of Gegenbauer polynomials:

$\begin{equation*}\begin{aligned}\hfill {\sigma }_{\boldsymbol{d},\boldsymbol{\tau }}^{\prime }({\bar{x}}_{1}^{(1)},\dots ,{\bar{x}}_{1}^{(Q)})& =\sum\limits _{\boldsymbol{k}\in {\mathbb{Z}}_{\geqslant 0}^{Q}}{\lambda }_{\boldsymbol{k}}^{\boldsymbol{d}}({\sigma }_{\boldsymbol{d},\boldsymbol{\tau }}^{\prime })B(\boldsymbol{d},\boldsymbol{k}){Q}_{\boldsymbol{k}}^{\boldsymbol{d}}\left(\sqrt{{d}_{1}}{\bar{x}}_{1}^{(1)},\dots ,\sqrt{{d}_{Q}}{\bar{x}}_{1}^{(Q)}\right),\hfill \\ \hfill {\lambda }_{\boldsymbol{k}}^{\boldsymbol{d}}({\sigma }_{\boldsymbol{d},\boldsymbol{\tau }}^{\prime })& ={\mathbb{E}}_{\bar{\boldsymbol{x}}}\left[{\sigma }_{\boldsymbol{d},\boldsymbol{\tau }}^{\prime }({\bar{x}}_{1}^{(1)},\dots ,{\bar{x}}_{1}^{(Q)}){Q}_{\boldsymbol{k}}^{\boldsymbol{d}}\left(\sqrt{{d}_{1}}{\bar{x}}_{1}^{(1)},\dots ,\sqrt{{d}_{Q}}{\bar{x}}_{1}^{(Q)}\right)\right].\hfill \end{aligned}\end{equation*}$

Following the same computations as in lemma 12, we get

$\begin{equation*}{\mathbb{H}}_{\boldsymbol{\tau },{\boldsymbol{\tau }}^{\prime }}({\bar{\boldsymbol{x}}}_{1},{\bar{\boldsymbol{x}}}_{2})=\sum\limits _{\boldsymbol{k}\in {\mathbb{Z}}_{\geqslant 0}^{Q}}{A}_{(\boldsymbol{\tau },{\boldsymbol{\tau }}^{\prime }),\boldsymbol{k}}^{(q)}{Q}_{\boldsymbol{k}}^{\boldsymbol{d}}\left({\left\{\langle {\bar{\boldsymbol{x}}}_{1}^{(q)},{\bar{\boldsymbol{x}}}_{2}^{(q)}\rangle \right\}}_{q\in [Q]}\right),\end{equation*}$

where

$\begin{align}\hfill {A}_{(\boldsymbol{\tau },{\boldsymbol{\tau }}^{\prime }),\boldsymbol{k}}^{(q)}& ={r}_{q}^{2}\cdot \left[{t}_{{d}_{q},{k}_{q}-1}{\lambda }_{{\boldsymbol{k}}_{q-}}^{\boldsymbol{d}}({\sigma }_{\boldsymbol{d},\boldsymbol{\tau }}^{\prime }){\lambda }_{{\boldsymbol{k}}_{q-}}^{\boldsymbol{d}}({\sigma }_{\boldsymbol{d},{\boldsymbol{\tau }}^{\prime }}^{\prime })B(\boldsymbol{d},{\boldsymbol{k}}_{q-})\right.\hfill \\ \hfill & \quad \left.+{s}_{{d}_{q},{k}_{q}+1}{\lambda }_{{\boldsymbol{k}}_{q+}}^{\boldsymbol{d}}({\sigma }_{\boldsymbol{d},\boldsymbol{\tau }}^{\prime }){\lambda }_{{\boldsymbol{k}}_{q+}}^{\boldsymbol{d}}({\sigma }_{\boldsymbol{d},{\boldsymbol{\tau }}^{\prime }}^{\prime })B(\boldsymbol{d},{\boldsymbol{k}}_{q+})\right],\hfill \end{align} \tag{ 145 }$

with k _q+ = (k₁, ..., k_q + 1, ..., k_Q) and k _q− = (k₁, ..., k_q − 1, ..., k_Q), and convention ${t}_{{d}_{q},-1}=0$ ,

$\begin{equation*}{s}_{{d}_{q},{k}_{q}}=\frac{{k}_{q}}{2{k}_{q}+{d}_{q}-2},\qquad {t}_{{d}_{q},{k}_{q}}=\frac{{k}_{q}+{d}_{q}-2}{2{k}_{q}+{d}_{q}-2}.\end{equation*}$

Recall that for $\boldsymbol{k}\in {\mathbb{Z}}_{\geqslant 0}^{Q}$ and s ∈ [B( d , k )], ${Y}_{\boldsymbol{k},\boldsymbol{s}}^{\boldsymbol{d}}={\bigotimes}_{q\in [Q]}{Y}_{{k}_{q}{s}_{q}}^{({d}_{q})}$ forms an orthogonal basis of $\mathcal{L}$ and that

$\begin{equation*}{\mathbb{E}}_{{\bar{\boldsymbol{x}}}_{2}}\left[{Q}_{\boldsymbol{k}}^{\boldsymbol{d}}\left({\left\{\langle {\bar{\boldsymbol{x}}}_{1},{\bar{\boldsymbol{x}}}_{2}\rangle \right\}}_{q\in [Q]}\right){Y}_{\boldsymbol{k},\boldsymbol{s}}^{\boldsymbol{d}}({\bar{\boldsymbol{x}}}_{2})\right]=\frac{1}{B(\boldsymbol{d},\boldsymbol{k})}{Y}_{\boldsymbol{k},\boldsymbol{s}}^{\boldsymbol{d}}({\bar{\boldsymbol{x}}}_{1}){\delta }_{\boldsymbol{k},\boldsymbol{s}}.\end{equation*}$

We deduce that

$\begin{align*}\hfill {\mathbb{H}}_{\boldsymbol{\tau },{\boldsymbol{\tau }}^{\prime }}{Y}_{\boldsymbol{k},\boldsymbol{s}}^{\boldsymbol{d}}({\bar{\boldsymbol{x}}}_{1})& =\sum\limits _{{\boldsymbol{k}}^{\prime }\in {\mathbb{Z}}_{\geqslant 0}^{Q}}{A}_{(\boldsymbol{\tau },{\boldsymbol{\tau }}^{\prime }),{\boldsymbol{k}}^{\prime }}^{(q)}{\mathbb{E}}_{{\bar{\boldsymbol{x}}}_{2}}\left[{Q}_{{\boldsymbol{k}}^{\prime }}^{\boldsymbol{d}}\left({\left\{\langle {\bar{\boldsymbol{x}}}_{1},{\bar{\boldsymbol{x}}}_{2}\rangle \right\}}_{q\in [Q]}\right){Y}_{\boldsymbol{k},\boldsymbol{s}}^{\boldsymbol{d}}({\bar{\boldsymbol{x}}}_{2})\right]\hfill \\ \hfill & =\frac{{A}_{(\boldsymbol{\tau },{\boldsymbol{\tau }}^{\prime }),\boldsymbol{k}}^{(q)}}{B(\boldsymbol{d},\boldsymbol{k})}{Y}_{\boldsymbol{k},\boldsymbol{s}}^{\boldsymbol{d}}({\bar{\boldsymbol{x}}}_{1}).\hfill \end{align*}$

Consider ${\left\{{\mathbb{T}}_{\boldsymbol{\tau }}{Y}_{\boldsymbol{k},\boldsymbol{s}}^{\boldsymbol{d}}\right\}}_{\boldsymbol{k}\in {\mathbb{Z}}_{\geqslant 0}^{Q},\boldsymbol{s}\in [B(\boldsymbol{d},\boldsymbol{k})]}$ . We have:

$\begin{equation*}\begin{aligned}\hfill \langle {\mathbb{T}}_{\boldsymbol{\tau }}{Y}_{\boldsymbol{k},\boldsymbol{s}}^{\boldsymbol{d}},{\mathbb{T}}_{{\boldsymbol{\tau }}^{\prime }}{Y}_{{\boldsymbol{k}}^{\prime },{\boldsymbol{s}}^{\prime }}^{\boldsymbol{d}}{\rangle }_{{L}^{2}}& =\langle {Y}_{\boldsymbol{k},\boldsymbol{s}}^{\boldsymbol{d}},{\mathbb{H}}_{\boldsymbol{\tau },{\boldsymbol{\tau }}^{\prime }}{Y}_{{\boldsymbol{k}}^{\prime },{\boldsymbol{s}}^{\prime }}^{\boldsymbol{d}}{\rangle }_{{L}^{2}}=\frac{{A}_{(\boldsymbol{\tau },{\boldsymbol{\tau }}^{\prime }),\boldsymbol{k}}^{(q)}}{B(\boldsymbol{d},\boldsymbol{k})}{\delta }_{\boldsymbol{k},{\boldsymbol{k}}^{\prime }}{\delta }_{\boldsymbol{s},{\boldsymbol{s}}^{\prime }},\hfill \\ \hfill {\mathbb{K}}_{\boldsymbol{\tau },{\boldsymbol{\tau }}^{\prime }}{\mathbb{T}}_{{\boldsymbol{\tau }}^{{\prime\prime}}}{Y}_{\boldsymbol{k},\boldsymbol{s}}^{\boldsymbol{d}}& ={\mathbb{T}}_{\boldsymbol{\tau }}{\mathbb{H}}_{{\boldsymbol{\tau }}^{\prime },{\boldsymbol{\tau }}^{{\prime\prime}}}{Y}_{\boldsymbol{k},\boldsymbol{s}}^{\boldsymbol{d}}=\frac{{A}_{({\boldsymbol{\tau }}^{\prime },{\boldsymbol{\tau }}^{{\prime\prime}}),\boldsymbol{k}}^{(q)}}{B(\boldsymbol{d},\boldsymbol{k})}{\mathbb{T}}_{\boldsymbol{\tau }}{Y}_{\boldsymbol{k},\boldsymbol{s}}^{\boldsymbol{d}}.\hfill \end{aligned}\end{equation*}$

Hence $\left\{{\mathbb{T}}_{{\boldsymbol{\tau }}^{{\prime\prime}}}{Y}_{\boldsymbol{k},\boldsymbol{s}}^{(\boldsymbol{d})}\right\}$ forms an orthogonal basis that diagonalizes ${\mathbb{K}}_{{\boldsymbol{\tau }}^{\prime },{\boldsymbol{\tau }}^{{\prime\prime}}}$ (notice that ${\mathbb{T}}_{\boldsymbol{\tau }}{Y}_{\boldsymbol{k},\boldsymbol{s}}^{\boldsymbol{d}}$ is parallel to ${\mathbb{T}}_{{\boldsymbol{\tau }}^{\prime }}{Y}_{\boldsymbol{k},\boldsymbol{s}}^{\boldsymbol{d}}$ for any $\boldsymbol{\tau },{\boldsymbol{\tau }}^{\prime }\in {\mathbb{R}}_{\geqslant 0}^{Q}$ ). Let us consider the subspace ${\mathbb{T}}_{\boldsymbol{\tau }}({V}_{\mathcal{Q}}^{\boldsymbol{d}})$ , the image of ${V}_{\mathcal{Q}}^{\boldsymbol{d}}$ by the operator ${\mathbb{T}}_{\boldsymbol{\tau }}$ . From assumptions 3(a) and (b) and lemma 19, there exists ɛ₀ ∈ (0, 1) and d₀ such that for any $\boldsymbol{\tau },{\boldsymbol{\tau }}^{\prime }\in {[1-{\varepsilon }_{0},1+{\varepsilon }_{0}]}^{Q}$ and d ⩾ d₀, we have ${A}_{(\boldsymbol{\tau },{\boldsymbol{\tau }}^{\prime }),\boldsymbol{k}}^{(q)} > 0$ for any $\boldsymbol{k}\in \mathcal{Q}$ , and therefore the inverse ${\mathbb{K}}_{\boldsymbol{\tau },{\boldsymbol{\tau }}^{\prime }}^{-1}{\vert }_{{\mathbb{T}}_{\boldsymbol{\tau }}({V}_{\mathcal{Q}}^{\boldsymbol{d}})}$ (restricted to ${\mathbb{T}}_{\boldsymbol{\tau }}({V}_{\mathcal{Q}}^{\boldsymbol{d}})$ ) is well defined.

H.3.2. Proof of theorem 8

Let us assume that {f_d} is contained in ${\bigoplus}_{\boldsymbol{k}\in \mathcal{Q}}{\boldsymbol{V}}_{\boldsymbol{k}}^{\boldsymbol{d}}$ , i.e. ${\bar{f}}_{d}={\mathsf{P}}_{\mathcal{Q}}{\bar{f}}_{d}$ .

Consider

$\begin{equation*}\hat{f}(\boldsymbol{x}\hspace{-1pt};\mathbf{\Theta },\boldsymbol{a})=\sum\limits _{i=1}^{N}\langle {\boldsymbol{a}}_{i},{\boldsymbol{x}}^{(q)}\rangle {\sigma }^{\prime }(\langle {\boldsymbol{\theta }}_{i},\boldsymbol{x}\rangle /R).\end{equation*}$

Define ${\boldsymbol{\alpha }}_{\boldsymbol{\tau }}(\bar{\boldsymbol{\theta }}\hspace{2pt})\equiv {\mathbb{K}}_{\boldsymbol{\tau },\boldsymbol{\tau }}^{-1}{\mathbb{T}}_{\boldsymbol{\tau }}{\bar{f}}_{d}(\bar{\boldsymbol{\theta }}\hspace{2pt})$ and choose ${\boldsymbol{a}}_{i}^{\ast }={N}^{-1}{\boldsymbol{\alpha }}_{{\boldsymbol{\tau }}_{i}}({\bar{\boldsymbol{\theta }}}_{i})$ , where we denoted ${\bar{\boldsymbol{\theta }}}_{i}={({\bar{\boldsymbol{\theta }}}_{i}^{(q)})}_{q\in [Q]}$ with ${\bar{\boldsymbol{\theta }}}_{i}^{(q)}={\boldsymbol{\theta }}_{i}^{(q)}/{\tau }_{i}^{(q)}\in {\mathbb{S}}^{{d}_{q}-1}(\sqrt{{d}_{q}})$ independent of τ _i.

Fix ɛ₀ > 0 as prescribed in lemma 15 and consider the expectation over ${\mathcal{P}}_{{\varepsilon }_{0}}$ of the NT^(q) risk (in particular, ${\boldsymbol{a}}^{\ast }=({\boldsymbol{a}}_{1}^{\ast },\dots ,{\boldsymbol{a}}_{N}^{\ast })\in {\mathbb{R}}^{N{d}_{q}}$ are well defined):

$\begin{equation*}\begin{aligned}\hfill {\mathbb{E}}_{{\mathbf{\Theta }}_{{\varepsilon }_{0}}}[{R}_{{\mathrm{NT}}^{(q)}}({f}_{d},\mathbf{\Theta })]& ={\mathbb{E}}_{{\mathbf{\Theta }}_{{\varepsilon }_{0}}}\left[\underset{\boldsymbol{a}\in {\mathbb{R}}^{N{d}_{q}}}{\mathrm{inf}}\enspace {\mathbb{E}}_{\boldsymbol{x}}[{({f}_{d}(\boldsymbol{x})-\hat{f}(\boldsymbol{x}\hspace{-1pt};\mathbf{\Theta },\boldsymbol{a}))}^{2}]\right]\hfill \\ \hfill & \leqslant {\mathbb{E}}_{{\mathbf{\Theta }}_{{\varepsilon }_{0}}}\left[{\mathbb{E}}_{\boldsymbol{x}}\left[{({f}_{d}(\boldsymbol{x})-\hat{f}(\boldsymbol{x}\hspace{-1pt};\mathbf{\Theta },{\boldsymbol{a}}^{\ast }(\mathbf{\Theta })))}^{2}\right]\right].\hfill \end{aligned}\end{equation*}$

We can expand the squared loss at a as

$\begin{align}\hfill {\mathbb{E}}_{\boldsymbol{x}}[{({f}_{d}(\boldsymbol{x})-\hat{f}(\boldsymbol{x}))}^{2}]& ={\Vert}{f}_{d}{{\Vert}}_{{L}^{2}}^{2}-2\sum\limits _{i=1}^{N}{\mathbb{E}}_{\boldsymbol{x}}[\langle {\boldsymbol{a}}_{i},{\boldsymbol{x}}^{(q)}\rangle {\sigma }^{\prime }(\langle {\boldsymbol{\theta }}_{i},\boldsymbol{x}\rangle /R){f}_{d}(\boldsymbol{x})]\hfill \\ \hfill & \quad +\sum\limits _{i,j=1}^{N}{\mathbb{E}}_{\boldsymbol{x}}[\langle {\boldsymbol{a}}_{i},{\boldsymbol{x}}^{(q)}\rangle \langle {\boldsymbol{a}}_{j},{\boldsymbol{x}}^{(q)}\rangle {\sigma }^{\prime }(\langle {\boldsymbol{\theta }}_{i},\boldsymbol{x}\rangle /R){\sigma }^{\prime }(\langle {\boldsymbol{\theta }}_{j},\boldsymbol{x}\rangle /R)].\hfill \end{align} \tag{ 146 }$

The second term of the expansion (146) around a * verifies

$\begin{align}\hfill & {\mathbb{E}}_{{\mathbf{\Theta }}_{{\varepsilon }_{0}}}\left[\sum\limits _{i=1}^{N}{\mathbb{E}}_{\boldsymbol{x}}[\langle {\boldsymbol{a}}_{i}^{\ast },{\boldsymbol{x}}^{(q)}\rangle {\sigma }^{\prime }(\langle {\boldsymbol{\theta }}_{i},\boldsymbol{x}\rangle /R){f}_{d}(\boldsymbol{x})]\right]\hfill \\ \hfill & \qquad ={\mathbb{E}}_{{\boldsymbol{\tau }}_{{\varepsilon }_{0}}}\left[{\mathbb{E}}_{\bar{\boldsymbol{\theta }}}\left[{\boldsymbol{\alpha }}_{\boldsymbol{\tau }}{(\bar{\boldsymbol{\theta }}\hspace{2pt})}^{\mathsf{T}}{\mathbb{E}}_{\bar{\boldsymbol{x}}}\left[{\boldsymbol{x}}^{(q)}{\sigma }_{\boldsymbol{d},\boldsymbol{\tau }}^{\prime }\left({\left\{\langle {\bar{\boldsymbol{\theta }}}^{(q)},{\bar{\boldsymbol{x}}}^{(q)}\rangle /\sqrt{{d}_{q}}\right\}}_{q\in [Q]}\right){\bar{f}}_{d}(\bar{\boldsymbol{x}}\hspace{2pt})\right]\right]\right]\hfill \\ \hfill & \qquad ={\mathbb{E}}_{{\boldsymbol{\tau }}_{{\varepsilon }_{0}}}\left[{\langle {\mathbb{K}}_{\boldsymbol{\tau },\boldsymbol{\tau }}^{-1}{\mathbb{T}}_{\boldsymbol{\tau }}{\bar{f}}_{d},{\mathbb{T}}_{\boldsymbol{\tau }}{\bar{f}}_{d}\rangle }_{{L}^{2}}\right]\hfill \\ \hfill & \qquad ={\Vert}{f}_{d}{{\Vert}}_{{L}^{2}}^{2},\hfill \end{align} \tag{ 147 }$

where we used that for each $\boldsymbol{\tau }\in {[1-{\varepsilon }_{0},1+{\varepsilon }_{0}]}^{Q}$ , we have ${\mathbb{T}}_{\boldsymbol{\tau }}^{\ast }{\mathbb{K}}_{\boldsymbol{\tau },\boldsymbol{\tau }}^{-1}{\mathbb{T}}_{\boldsymbol{\tau }}=\mathbf{I}{\vert }_{{V}_{\mathcal{Q}}^{\boldsymbol{d}}}$ .

Let us consider the third term in the expansion (146) around a *: the non diagonal term verifies

$\begin{align*}\hfill & {\mathbb{E}}_{{\mathbf{\Theta }}_{{\varepsilon }_{0}}}\left[\sum\limits _{i\ne j}{\mathbb{E}}_{\boldsymbol{x}}[\langle {\boldsymbol{a}}_{i}^{\ast },{\boldsymbol{x}}^{(q)}\rangle \langle {\boldsymbol{a}}_{j}^{\ast },{\boldsymbol{x}}^{(q)}\rangle {\sigma }^{\prime }(\langle {\boldsymbol{\theta }}_{i},\boldsymbol{x}\rangle /R){\sigma }^{\prime }(\langle {\boldsymbol{\theta }}_{j},\boldsymbol{x}\rangle /R)]\right]\hfill \\ \hfill & \qquad =(1-{N}^{-1}){\mathbb{E}}_{{\boldsymbol{\tau }}_{{\varepsilon }_{0}}^{1},{\boldsymbol{\tau }}_{{\varepsilon }_{0}}^{2},{\bar{\boldsymbol{\theta }}}_{1},{\bar{\boldsymbol{\theta }}}_{2}}\left[{\boldsymbol{\alpha }}_{{\boldsymbol{\tau }}^{1}}{({\bar{\boldsymbol{\theta }}}_{1})}^{\mathsf{T}}{E}_{\bar{\boldsymbol{x}}}\left[{\sigma }_{\boldsymbol{d},\boldsymbol{\tau }}^{\prime }\left({\left\{\langle {\bar{\boldsymbol{\theta }}}_{1}^{(q)},{\bar{\boldsymbol{x}}}^{(q)}\rangle /\sqrt{{d}_{q}}\right\}}_{q\in [Q]}\right)\right.\right.\hfill \\ \hfill & \qquad \quad \left.\left.\times {\sigma }_{\boldsymbol{d},{\boldsymbol{\tau }}^{\prime }}^{\prime }\left({\left\{\langle {\bar{\boldsymbol{\theta }}}_{2}^{(q)},{\bar{\boldsymbol{x}}}^{(q)}\rangle /\sqrt{{d}_{q}}\right\}}_{q\in [Q]}\right){\boldsymbol{x}}^{(q)}{({\boldsymbol{x}}^{(q)})}^{\mathsf{T}}\right]{\boldsymbol{\alpha }}_{{\boldsymbol{\tau }}^{2}}({\bar{\boldsymbol{\theta }}}_{2})\right]\hfill \\ \hfill & \qquad =(1-{N}^{-1}){\mathbb{E}}_{{\boldsymbol{\tau }}_{{\varepsilon }_{0}}^{1},{\boldsymbol{\tau }}_{{\varepsilon }_{0}}^{2},{\bar{\boldsymbol{\theta }}}_{1},{\bar{\boldsymbol{\theta }}}_{2}}\left[{\mathbb{K}}_{{\boldsymbol{\tau }}^{1},{\boldsymbol{\tau }}^{1}}^{-1}{\mathbb{T}}_{{\boldsymbol{\tau }}^{1}}{\bar{f}}_{d}{({\bar{\boldsymbol{\theta }}}_{1})}^{\mathsf{T}}{\mathbb{K}}_{{\boldsymbol{\tau }}^{1},{\boldsymbol{\tau }}^{2}}({\bar{\boldsymbol{\theta }}}_{1},{\bar{\boldsymbol{\theta }}}_{2}){\mathbb{K}}_{{\boldsymbol{\tau }}^{2},{\boldsymbol{\tau }}^{2}}^{-1}{\mathbb{T}}_{{\boldsymbol{\tau }}^{2}}{\bar{f}}_{d}({\bar{\boldsymbol{\theta }}}_{2})\right]\hfill \\ \hfill & \qquad =(1-{N}^{-1}){\mathbb{E}}_{{\boldsymbol{\tau }}_{{\varepsilon }_{0}}^{1},{\boldsymbol{\tau }}_{{\varepsilon }_{0}}^{2}}\left[{\langle {\mathbb{K}}_{{\boldsymbol{\tau }}^{1},{\boldsymbol{\tau }}^{1}}^{-1}{\mathbb{T}}_{{\boldsymbol{\tau }}^{1}}{\bar{f}}_{d},{\mathbb{K}}_{{\boldsymbol{\tau }}^{1},{\boldsymbol{\tau }}^{2}}{\mathbb{K}}_{{\boldsymbol{\tau }}^{2},{\boldsymbol{\tau }}^{2}}^{-1}{\mathbb{T}}_{{\boldsymbol{\tau }}^{2}}{\bar{f}}_{d}\rangle }_{{L}^{2}}\right].\hfill \end{align*}$

For $\boldsymbol{k}\in \mathcal{Q}$ and s ∈ [B( d , k )] and ${\boldsymbol{\tau }}^{1},{\boldsymbol{\tau }}^{2}\in {[1-{\varepsilon }_{0},1+{\varepsilon }_{0}]}^{Q}$ , we have

$\begin{equation*}{\mathbb{T}}_{{\boldsymbol{\tau }}^{1}}^{\ast }{\mathbb{K}}_{{\boldsymbol{\tau }}^{1},{\boldsymbol{\tau }}^{1}}^{-1}{\mathbb{K}}_{{\boldsymbol{\tau }}^{1},{\boldsymbol{\tau }}^{2}}{\mathbb{K}}_{{\boldsymbol{\tau }}^{2},{\boldsymbol{\tau }}^{2}}^{-1}{\mathbb{T}}_{{\boldsymbol{\tau }}^{2}}{Y}_{\boldsymbol{k},\boldsymbol{s}}^{\boldsymbol{d}}=\left({\mathbb{T}}_{{\boldsymbol{\tau }}^{1}}^{\ast }{\mathbb{K}}_{{\boldsymbol{\tau }}^{1},{\boldsymbol{\tau }}^{1}}^{-1}{\mathbb{T}}_{{\boldsymbol{\tau }}^{1}}\right)\cdot \left({\mathbb{T}}_{{\boldsymbol{\tau }}^{2}}^{\ast }{\mathbb{K}}_{{\boldsymbol{\tau }}^{2},{\boldsymbol{\tau }}^{2}}^{-1}{\mathbb{T}}_{{\boldsymbol{\tau }}^{2}}\right)\cdot {Y}_{\boldsymbol{k},\boldsymbol{s}}^{\boldsymbol{d}}={Y}_{\boldsymbol{k},\boldsymbol{s}}^{\boldsymbol{d}}.\end{equation*}$

Hence for any ${\boldsymbol{\tau }}^{1},{\boldsymbol{\tau }}^{2}\in {[1-{\varepsilon }_{0},1+{\varepsilon }_{0}]}^{Q}$ , ${\mathbb{T}}_{{\boldsymbol{\tau }}^{1}}^{\ast }{\mathbb{K}}_{{\boldsymbol{\tau }}^{1},{\boldsymbol{\tau }}^{1}}^{-1}{\mathbb{K}}_{{\boldsymbol{\tau }}^{1},{\boldsymbol{\tau }}^{2}}{\mathbb{K}}_{{\boldsymbol{\tau }}^{2},{\boldsymbol{\tau }}^{2}}^{-1}{\mathbb{T}}_{{\boldsymbol{\tau }}^{2}}=\mathbf{I}{\vert }_{{V}_{\mathcal{Q}}^{\boldsymbol{d}}}$ . Hence

$\begin{equation}{\mathbb{E}}_{{\mathbf{\Theta }}_{{\varepsilon }_{0}}}\left[\sum\limits _{i\ne j}{\mathbb{E}}_{\boldsymbol{x}}[\langle {\boldsymbol{a}}_{i}^{\ast },{\boldsymbol{x}}^{(q)}\rangle \langle {\boldsymbol{a}}_{j}^{\ast },{\boldsymbol{x}}^{(q)}\rangle {\sigma }^{\prime }(\langle {\boldsymbol{\theta }}_{i},\boldsymbol{x}\rangle /R){\sigma }^{\prime }(\langle {\boldsymbol{\theta }}_{j},\boldsymbol{x}\rangle /R)]\right]=(1-{N}^{-1}){\Vert}{f}_{d}{{\Vert}}_{{L}^{2}}^{2}.\end{equation} \tag{ 148 }$

The diagonal term verifies

$\begin{align*}\hfill & {\mathbb{E}}_{{\mathbf{\Theta }}_{{\varepsilon }_{0}}}\left[\sum\limits _{i\in [N]}{\mathbb{E}}_{\boldsymbol{x}}[{\langle {\boldsymbol{a}}_{i}^{\ast },{\boldsymbol{x}}^{(q)}\rangle }^{2}{\sigma }^{\prime }(\langle {\boldsymbol{\theta }}_{i},\boldsymbol{x}\rangle /R){\sigma }^{\prime }(\langle {\boldsymbol{\theta }}_{j},\boldsymbol{x}\rangle /R)]\right]\hfill \\ \hfill & \qquad ={N}^{-1}{\mathbb{E}}_{{\boldsymbol{\tau }}_{{\varepsilon }_{0}},\bar{\boldsymbol{\theta }}}\left[{\boldsymbol{\alpha }}_{\boldsymbol{\tau }}{(\bar{\boldsymbol{\theta }}\hspace{2pt})}^{\mathsf{T}}{\mathbb{K}}_{\boldsymbol{\tau },\boldsymbol{\tau }}(\bar{\boldsymbol{\theta }},\bar{\boldsymbol{\theta }}){\boldsymbol{\alpha }}_{\boldsymbol{\tau }}(\bar{\boldsymbol{\theta }}\hspace{2pt})\right]\hfill \\ \hfill & \qquad \leqslant {N}^{-1}\left[\underset{\bar{\boldsymbol{\theta }},\boldsymbol{\tau }\in {[1-{\varepsilon }_{0},1+{\varepsilon }_{0}]}^{Q}}{\mathrm{max}}{\Vert}{\mathbb{K}}_{\boldsymbol{\tau },\boldsymbol{\tau }}(\bar{\boldsymbol{\theta }},\bar{\boldsymbol{\theta }}){{\Vert}}_{\mathrm{op}}\right]\cdot {\mathbb{E}}_{{\boldsymbol{\tau }}_{{\varepsilon }_{0}}}[{\Vert}{\mathbb{K}}_{\boldsymbol{\tau },\boldsymbol{\tau }}^{-1}{\mathbb{T}}_{\boldsymbol{\tau }}{\bar{f}}_{d}{{\Vert}}_{{L}^{2}}^{2}].\hfill \end{align*}$

We have, from lemma 14,

$\begin{equation*}{\mathbb{K}}_{\boldsymbol{\tau },\boldsymbol{\tau }}(\bar{\boldsymbol{\theta }},\bar{\boldsymbol{\theta }})={\alpha }^{(q)}{\mathbf{I}}_{{d}_{q}}+{\beta }^{(q)}{\bar{\boldsymbol{\theta }}}^{(q)}{({\bar{\boldsymbol{\theta }}}^{(q)})}^{2}\end{equation*}$

where

$\begin{align*}\hfill \left[\begin{matrix}\hfill {\alpha }^{(q)}\hfill \\ \hfill {\beta }^{(q)}\hfill \end{matrix}\right]& ={[{d}_{q}({d}_{q}-1){({\tau }^{(q)})}^{4}]}^{-1}\left[\begin{matrix}\hfill {d}_{q}{({\tau }^{(q)})}^{4}\hfill & \hfill -{({\tau }^{(q)})}^{2}\hfill \\ \hfill -{({\tau }^{(q)})}^{2}\hfill & \hfill 1\hfill \end{matrix}\right]\left[\begin{matrix}\hfill {\mathbb{E}}_{\boldsymbol{x}}[\langle {\boldsymbol{x}}^{(q)},{\boldsymbol{x}}^{(q)}\rangle {\sigma }_{\boldsymbol{d},\boldsymbol{\tau }}^{\prime }{({\bar{x}}_{1}^{(1)},\dots ,{\bar{x}}_{1}^{(Q)})}^{2}]\hfill \\ \hfill {\mathbb{E}}_{\boldsymbol{x}}[{({\bar{x}}_{1}^{(q)})}^{2}{\sigma }_{\boldsymbol{d},\boldsymbol{\tau }}^{\prime }{({\bar{x}}_{1}^{(1)},\dots ,{\bar{x}}_{1}^{(Q)})}^{2}]\hfill \end{matrix}\right].\hfill \end{align*}$

Hence from lemma 17 and for ɛ₀ small enough, there exists C > 0 such that for d large enough

$\begin{equation*}{\mathrm{sup}}_{\boldsymbol{\tau }\in {[1-{\varepsilon }_{0},1+{\varepsilon }_{0}]}^{Q}}{\Vert}{\mathbb{K}}_{\boldsymbol{\tau },\boldsymbol{\tau }}(\bar{\boldsymbol{\theta }},\bar{\boldsymbol{\theta }}){{\Vert}}_{\mathrm{op}}\leqslant C\frac{{r}_{q}^{2}}{{d}_{q}}=C{d}^{{\kappa }_{q}}.\end{equation*}$

Furthermore

$\begin{equation*}\begin{aligned}\hfill {\Vert}{\mathbb{K}}_{\boldsymbol{\tau },\boldsymbol{\tau }}^{-1}{\mathbb{T}}_{\boldsymbol{\tau }}{\bar{f}}_{d}{{\Vert}}_{{L}^{2}}^{2}& =\sum\limits _{\boldsymbol{k}\in \mathcal{Q}}\frac{B(\boldsymbol{d},\boldsymbol{k})}{{A}_{(\boldsymbol{\tau },\boldsymbol{\tau }),\boldsymbol{k}}^{(q)}}\sum\limits _{\boldsymbol{s}\in [B(\boldsymbol{d},\boldsymbol{k})]}{\lambda }_{\boldsymbol{k},\boldsymbol{s}}^{\boldsymbol{d}}{({\bar{f}}_{d})}^{2}\hfill \\ \hfill & \leqslant \left[\underset{k\in \mathcal{Q}}{\mathrm{max}}\enspace \frac{B(\boldsymbol{d},\boldsymbol{k})}{{A}_{(\boldsymbol{\tau },\boldsymbol{\tau }),\boldsymbol{k}}^{(q)}}\right]\cdot {\Vert}{\mathsf{P}}_{\mathcal{Q}}{f}_{d}{{\Vert}}_{{L}^{2}}^{2}.\hfill \end{aligned}\end{equation*}$

From lemma 15, we get

$\begin{equation*}{\mathbb{E}}_{{\boldsymbol{\tau }}_{{\varepsilon }_{0}}}[{\Vert}{\mathbb{K}}_{\boldsymbol{\tau },\boldsymbol{\tau }}^{-1}{\mathbb{T}}_{\boldsymbol{\tau }}{\bar{f}}_{d}{{\Vert}}_{{L}^{2}}^{2}]\leqslant C{d}^{\gamma -{\kappa }_{q}}\cdot {\Vert}{\mathsf{P}}_{\mathcal{Q}}{f}_{d}{{\Vert}}_{{L}^{2}}^{2}.\end{equation*}$

Hence,

$\begin{equation}{\mathbb{E}}_{{\mathbf{\Theta }}_{{\varepsilon }_{0}}}\left[\sum\limits _{i\in [N]}{\mathbb{E}}_{\boldsymbol{x}}[{\langle {\boldsymbol{a}}_{i}^{\ast },{\boldsymbol{x}}^{(q)}\rangle }^{2}{\sigma }^{\prime }(\langle {\boldsymbol{\theta }}_{i},\boldsymbol{x}\rangle /R){\sigma }^{\prime }(\langle {\boldsymbol{\theta }}_{j},\boldsymbol{x}\rangle /R)]\right]\leqslant C\frac{{d}^{\gamma }}{N}{\Vert}{\mathsf{P}}_{\mathcal{Q}}{f}_{d}{{\Vert}}_{{L}^{2}}^{2}.\end{equation} \tag{ 149 }$

Combining equations (147)–(149), we get

$\begin{align*}\hfill & {\mathbb{E}}_{{\mathbf{\Theta }}_{{\varepsilon }_{0}}}\left[{\mathbb{E}}_{\boldsymbol{x}}\left[{({f}_{d}(\boldsymbol{x})-\hat{f}(\boldsymbol{x}\hspace{-1pt};\mathbf{\Theta },{\boldsymbol{a}}^{\ast }(\mathbf{\Theta })))}^{2}\right]\right]\hfill \\ \hfill & \qquad ={\Vert}{f}_{d}{{\Vert}}_{{L}^{2}}^{2}-2{\Vert}{f}_{d}{{\Vert}}_{{L}^{2}}^{2}+(1-{N}^{-1}){\Vert}{f}_{d}{{\Vert}}_{{L}^{2}}^{2}+{N}^{-1}{\mathbb{E}}_{{\boldsymbol{\tau }}_{{\varepsilon }_{0}},\bar{\boldsymbol{\theta }}}\left[{\boldsymbol{\alpha }}_{\tau }{(\bar{\boldsymbol{\theta }}\hspace{2pt})}^{\mathsf{T}}{\mathbb{K}}_{\tau ,\tau }(\bar{\boldsymbol{\theta }},\bar{\boldsymbol{\theta }}){\boldsymbol{\alpha }}_{\tau }(\bar{\boldsymbol{\theta }}\hspace{2pt})\right]\hfill \\ \hfill & \qquad \leqslant C\frac{{d}^{\gamma }}{N}{\Vert}{\mathsf{P}}_{\mathcal{Q}}{f}_{d}{{\Vert}}_{{L}^{2}}^{2}.\hfill \end{align*}$

By Markov's inequality, we get for any ɛ > 0 and d large enough,

$\begin{equation*}\begin{aligned}\hfill \mathbb{P}({R}_{{\mathrm{NT}}^{(q)}}({f}_{d},\mathbf{\Theta }) > \varepsilon \cdot {\Vert}{f}_{d}{{\Vert}}_{{L}^{2}}^{2})& \leqslant \mathbb{P}(\left\{{R}_{{\mathrm{NT}}^{(q)}}({f}_{d},\mathbf{\Theta }) > \varepsilon \cdot {\Vert}{f}_{d}{{\Vert}}_{{L}^{2}}^{2}\right\}\cap {\mathcal{P}}_{{\varepsilon }_{0}})+\mathbb{P}({\mathcal{P}}_{{\varepsilon }_{0}}^{c})\hfill \\ \hfill & \leqslant {C}^{\prime }\frac{{d}^{\gamma }}{N}+\mathbb{P}({\mathcal{P}}_{{\varepsilon }_{0}}^{c}).\hfill \end{aligned}\end{equation*}$

The assumption that N = ω_d(d^γ) and lemma 8 conclude the proof.

Appendix I.: Proof of theorem 4 in the main text

Step 1. Show that ${R}_{\mathrm{NN},2N}({f}_{\ast })\leqslant {\mathrm{inf}}_{\boldsymbol{W}\in {\mathbb{R}}^{N\times d}}\enspace {R}_{\mathrm{NT},N}({f}_{\ast },\boldsymbol{W}\hspace{2pt})$ .

Define the neural tangent model with N neurons by ${\hat{f}}_{\mathrm{NT},N}(\boldsymbol{x}\hspace{-1pt};\boldsymbol{s}\hspace{-1pt};\boldsymbol{W}\hspace{2pt})={\sum }_{i=1}^{N}\langle {\boldsymbol{s}}_{i},\boldsymbol{x}\rangle {\sigma }^{\prime }(\langle {\boldsymbol{w}}_{i},\boldsymbol{x}\rangle )$ and the NN with N neurons by ${\hat{f}}_{\mathrm{NN},N}(\boldsymbol{x}\hspace{-1pt};\boldsymbol{W},\boldsymbol{b})={\sum }_{i=1}^{N}{b}_{i}\sigma (\langle {\boldsymbol{w}}_{i},\boldsymbol{x}\rangle )$ . For any $\boldsymbol{W}\in {\mathbb{R}}^{N\times d}$ , $\boldsymbol{s}\in {\mathbb{R}}^{N}$ , and ɛ > 0, we define

$\begin{equation*}\begin{aligned}\hfill {\hat{g}}_{N}(\boldsymbol{x}\hspace{-1pt};\boldsymbol{W},\boldsymbol{s},\varepsilon )& \equiv {\varepsilon }^{-1}\left({\hat{f}}_{\mathrm{NN},N}(\boldsymbol{x}\hspace{-1pt};\boldsymbol{W}+\varepsilon \boldsymbol{s},\mathbf{1})-{\hat{f}}_{\mathrm{NN},N}(\boldsymbol{x}\hspace{-1pt};\boldsymbol{W},\mathbf{1})\right),\hfill \\ \hfill \mathcal{E}(\boldsymbol{x}\hspace{-1pt};\boldsymbol{W},\boldsymbol{s},\varepsilon )& ={\hat{g}}_{N}(\boldsymbol{x}\hspace{-1pt};\boldsymbol{W},\boldsymbol{s},\varepsilon )-{\hat{f}}_{\mathrm{NT},N}(\boldsymbol{x}\hspace{-1pt};\boldsymbol{s}\hspace{-1pt};\boldsymbol{W}\hspace{2pt}).\hfill \end{aligned}\end{equation*}$

Then by Taylor expansion, there exists ${({\tilde{\boldsymbol{w}}}_{i})}_{i\in [N]}$ such that

$\begin{equation*}\vert \mathcal{E}(\boldsymbol{x}\hspace{-1pt};\boldsymbol{W},\boldsymbol{s},\varepsilon )\vert =\frac{\varepsilon }{2}\left\vert \sum\limits _{i=1}^{N}\langle {\boldsymbol{s}}_{i},\boldsymbol{x}{\rangle }^{2}{\sigma }^{{\prime\prime}}(\langle {\tilde{\boldsymbol{w}}}_{i},\boldsymbol{x}\rangle )\right\vert .\end{equation*}$

By the boundedness assumption of ${\mathrm{sup}}_{x\in \mathbb{R}}\vert {\sigma }^{{\prime\prime}}(x)\vert$ , we have

$\begin{equation*}\underset{\varepsilon \to 0+}{\mathrm{lim}}{\Vert}\mathcal{E}(\cdot \hspace{-1pt};\boldsymbol{W},\boldsymbol{s},\varepsilon ){{\Vert}}_{{L}^{2}}^{2}=0,\end{equation*}$

and hence

$\begin{equation*}\underset{\varepsilon \to 0+}{\mathrm{lim}}{\Vert}{f}_{\ast }-{\hat{g}}_{N}(\cdot \hspace{-1pt};\boldsymbol{W},\boldsymbol{s},\varepsilon ){{\Vert}}_{{L}^{2}}^{2}={\Vert}{f}_{\ast }-{\hat{f}}_{\mathrm{NT},N}(\cdot \hspace{-1pt};\boldsymbol{s}\hspace{-1pt};\boldsymbol{W}\hspace{2pt}){{\Vert}}_{{L}^{2}}^{2}.\end{equation*}$

Note that ${\hat{g}}_{N}$ can be regarded as a function in ${\mathcal{F}}_{\mathrm{NN}}^{2N}$ and ${\hat{f}}_{\mathrm{NT},N}\in {\mathcal{F}}_{\mathrm{NN}}^{N}(\boldsymbol{W}\hspace{2pt})$ , this implies that

$\begin{equation}{R}_{\mathrm{NN},2N}({f}_{\ast })\leqslant \underset{\boldsymbol{W}\in {\mathbb{R}}^{N\times d}}{\mathrm{inf}}\enspace {R}_{\mathrm{NT},N}({f}_{\ast },\boldsymbol{W}\hspace{2pt}).\end{equation} \tag{ 150 }$

Step 2. Give upper bound of ${\mathrm{inf}}_{\boldsymbol{W}\in {\mathbb{R}}^{N\times d}}{R}_{\mathrm{NT},N}({f}_{\ast },\boldsymbol{W}\hspace{2pt})$ . We take $\bar{\boldsymbol{W}}={({\bar{\boldsymbol{w}}}_{i})}_{i\leqslant N}$ with ${\bar{\boldsymbol{w}}}_{i}=\boldsymbol{U}{\bar{\boldsymbol{v}}}_{i}$ , where ${\bar{\boldsymbol{v}}}_{i}\sim \mathrm{Unif}({\mathbb{S}}^{{d}_{0}-1}({r}^{-1}))$ , and denote $\bar{\boldsymbol{V}}={({\bar{\boldsymbol{v}}}_{i})}_{i\leqslant N}$ . Then we have

$\begin{equation*}{\mathcal{G}}_{\mathrm{NT}}^{N}(\bar{\boldsymbol{V}}\hspace{2pt})\equiv \left\{f(\boldsymbol{x})=\bar{f}({\boldsymbol{U}}^{\mathsf{T}}\boldsymbol{x}):\bar{f}(\boldsymbol{z})=\sum\limits _{i=1}^{N}\langle {\bar{\boldsymbol{s}}}_{i},\boldsymbol{z}\rangle {\sigma }^{\prime }(\langle {\bar{\boldsymbol{v}}}_{i},\boldsymbol{z}\rangle ),{\bar{\boldsymbol{s}}}_{i}\in {\mathbb{R}}^{{d}_{0}},i\leqslant N\right\}\subseteq {\mathcal{F}}_{\mathrm{NT}}^{N}(\bar{\boldsymbol{W}}\hspace{2pt}).\end{equation*}$

It is easy to see that, when ${f}_{\ast }(\boldsymbol{x})=\varphi ({\boldsymbol{U}}^{\mathsf{T}}\boldsymbol{x})$ , we have

$\begin{equation*}\underset{\hat{f}\in {\mathcal{G}}_{\mathrm{NT}}^{N}(\bar{\boldsymbol{V}}\hspace{2pt})}{\mathrm{inf}}\enspace \mathbb{E}[{({f}_{\ast }(\boldsymbol{x})-\hat{f}(\boldsymbol{x}))}^{2}]=\underset{\hat{f}\in {\mathcal{F}}_{\mathrm{NT}}^{N}(\bar{\boldsymbol{V}}\hspace{2pt})}{\mathrm{inf}}\enspace \mathbb{E}[{(\varphi (\boldsymbol{z})-\hat{f}(\boldsymbol{z}))}^{2}],\end{equation*}$

where ${\mathcal{F}}_{\mathrm{NT}}^{N}(\bar{\boldsymbol{V}}\hspace{2pt})$ is the class of neural tangent model on ${\mathbb{R}}^{{d}_{0}}$

$\begin{equation*}{\mathcal{F}}_{\mathrm{NT}}^{N}(\bar{\boldsymbol{V}}\hspace{2pt})=\left\{\bar{f}(\boldsymbol{z})=\sum\limits _{i=1}^{N}\langle {\bar{\boldsymbol{s}}}_{i},\boldsymbol{z}\rangle {\sigma }^{\prime }(\langle {\bar{\boldsymbol{v}}}_{i},\boldsymbol{z}\rangle ):{\bar{\boldsymbol{s}}}_{i}\in {\mathbb{R}}^{{d}_{0}},i\leqslant N\right\}.\end{equation*}$

Moreover, by theorem 3 in the main text, when ${d}_{0}^{\ell +\delta }\leqslant N\leqslant {d}_{0}^{\ell +1-\delta }$ for some δ > 0 independent of N, d, we have

$\begin{equation*}\underset{\hat{f}\in {\mathcal{F}}_{\mathrm{NT}}^{N}(\bar{\boldsymbol{V}}\hspace{2pt})}{\mathrm{inf}}\enspace \mathbb{E}[{(\varphi (\boldsymbol{z})-\hat{f}(\boldsymbol{z}))}^{2}]=(1+{o}_{d,\mathbb{P}}(1))\cdot {\Vert}{\mathsf{P}}_{ > \ell +1}\varphi {{\Vert}}_{{L}^{2}}^{2}=(1+{o}_{d,\mathbb{P}}(1))\cdot {\Vert}{\mathsf{P}}_{ > \ell +1}{f}_{\ast }{{\Vert}}_{{L}^{2}}^{2}.\end{equation*}$

As a consequence, we have

$\begin{align*}\hfill \underset{\boldsymbol{W}\in {\mathbb{R}}^{N\times d}}{\mathrm{inf}}\enspace {R}_{\mathrm{NT},N}({f}_{\ast },\boldsymbol{W}\hspace{2pt})& \leqslant \underset{\hat{f}\in {\mathcal{F}}_{\mathrm{NT}}^{N}(\bar{\boldsymbol{W}}\hspace{2pt})}{\mathrm{inf}}\enspace \mathbb{E}[{({f}_{\ast }(\boldsymbol{x})-\hat{f}(\boldsymbol{x}))}^{2}]\leqslant \underset{\hat{f}\in {\mathcal{G}}_{\mathrm{NT}}^{N}(\bar{\boldsymbol{V}}\hspace{2pt})}{\mathrm{inf}}\enspace \mathbb{E}[{({f}_{\ast }(\boldsymbol{x})-\hat{f}(\boldsymbol{x}))}^{2}]\hfill \\ \hfill & =\underset{\hat{f}\in {\mathcal{F}}_{\mathrm{NT}}^{N}(\bar{\boldsymbol{V}}\hspace{2pt})}{\mathrm{inf}}\enspace \mathbb{E}[{(\varphi (\boldsymbol{z})-\hat{f}(\boldsymbol{z}))}^{2}]=(1+{o}_{d,\mathbb{P}}(1))\cdot {\Vert}{\mathsf{P}}_{ > \ell +1}{f}_{\ast }{{\Vert}}_{{L}^{2}}^{2}.\hfill \end{align*}$

Combining with equation (150) gives that, when ${d}_{0}^{\ell +\delta }\leqslant N\leqslant {d}_{0}^{\ell +1-\delta }$ , we have

$\begin{equation*}{R}_{\mathrm{NN},N}({f}_{\ast })\leqslant (1+{o}_{d}(1))\cdot {\Vert}{\mathsf{P}}_{ > \ell +1}{f}_{\ast }{{\Vert}}_{{L}^{2}}^{2}.\end{equation*}$

Step 3. Show that R _NN,N (f _* ) is independent of κ .

We let $\tilde{r}={d}^{\tilde{\kappa }/2}$ and $\stackrel{-}{r}={d}^{\stackrel{-}{r}/2}$ for some $\tilde{\kappa }\ne \stackrel{-}{\kappa }$ . Suppose we have $\tilde{\boldsymbol{x}}=\boldsymbol{U}{\tilde{\boldsymbol{z}}}_{1}+{\boldsymbol{U}}^{\perp }{\boldsymbol{z}}_{2}$ and $\stackrel{-}{\boldsymbol{x}}=\boldsymbol{U}{\stackrel{-}{\boldsymbol{z}}}_{1}+{\boldsymbol{U}}^{\perp }{\boldsymbol{z}}_{2}$ , where ${\tilde{\boldsymbol{z}}}_{1}\sim \mathrm{Unif}({\mathbb{S}}^{{d}_{0}-1}(\tilde{r}{\sqrt{d}}_{0}))$ , ${\stackrel{-}{\boldsymbol{z}}}_{1}\sim \mathrm{Unif}({\mathbb{S}}^{{d}_{0}-1}(\stackrel{-}{r}{\sqrt{d}}_{0}))$ , and ${\boldsymbol{z}}_{2}\sim \mathrm{Unif}({\mathbb{S}}^{d-{d}_{0}-1}(\sqrt{d-{d}_{0}}))$ . Moreover, we let ${\tilde{f}}_{\ast }(\tilde{\boldsymbol{x}})=\varphi ({\boldsymbol{U}}^{\mathsf{T}}\tilde{\boldsymbol{x}}/\tilde{r})$ and ${\stackrel{-}{f}}_{\ast }(\stackrel{-}{\boldsymbol{x}})=\varphi ({\boldsymbol{U}}^{\mathsf{T}}\stackrel{-}{\boldsymbol{x}}/\stackrel{-}{r})$ for some function $\varphi :{\mathbb{R}}^{{d}_{0}}\to \mathbb{R}$ .

Then, for any $\tilde{\boldsymbol{W}}={({\tilde{\boldsymbol{w}}}_{i})}_{i\leqslant N}\subseteq {\mathbb{R}}^{d}$ and $\tilde{\boldsymbol{b}}={({\tilde{b}}_{i})}_{i\leqslant N}\subseteq \mathbb{R}$ , there exists ${({\tilde{\boldsymbol{v}}}_{1,i})}_{i\leqslant N}\subseteq {\mathbb{R}}^{{d}_{0}}$ and ${({\tilde{\boldsymbol{v}}}_{2,i})}_{i\leqslant N}\subseteq {\mathbb{R}}^{d-{d}_{0}}$ such that ${\tilde{\boldsymbol{w}}}_{i}=\boldsymbol{U}{\tilde{\boldsymbol{v}}}_{1,i}+{\boldsymbol{U}}^{\perp }{\tilde{\boldsymbol{v}}}_{2,i}$ . We define ${\stackrel{-}{\boldsymbol{v}}}_{1,i}=\tilde{r}\cdot {\tilde{\boldsymbol{v}}}_{1,i}/\stackrel{-}{r}$ , ${\stackrel{-}{\boldsymbol{w}}}_{i}=\boldsymbol{U}{\stackrel{-}{\boldsymbol{v}}}_{1,i}+{\boldsymbol{U}}^{\perp }{\tilde{\boldsymbol{v}}}_{2,i}$ , $\stackrel{-}{\boldsymbol{W}}={({\stackrel{-}{\boldsymbol{w}}}_{i})}_{i\leqslant N}$ , and $\stackrel{-}{\boldsymbol{b}}=\tilde{\boldsymbol{b}}$ . Then we have

$\begin{equation*}{\mathbb{E}}_{\stackrel{-}{\boldsymbol{x}}}[{({\stackrel{-}{f}}_{\ast }(\stackrel{-}{\boldsymbol{x}})-{f}_{\mathrm{NN},N}(\stackrel{-}{\boldsymbol{x}};\stackrel{-}{\boldsymbol{W}},\stackrel{-}{\boldsymbol{b}}))}^{2}]={\mathbb{E}}_{\tilde{\boldsymbol{x}}}[{({\tilde{f}}_{\ast }(\tilde{\boldsymbol{x}})-{f}_{\mathrm{NN},N}(\tilde{\boldsymbol{x}};\tilde{\boldsymbol{W}},\tilde{\boldsymbol{b}}))}^{2}].\end{equation*}$

On the other hand, for any $\stackrel{-}{\boldsymbol{W}}={({\stackrel{-}{\boldsymbol{w}}}_{i})}_{i\leqslant N}\subseteq {\mathbb{R}}^{d}$ and $\stackrel{-}{\boldsymbol{b}}={({\stackrel{-}{b}}_{i})}_{i\leqslant N}\subseteq \mathbb{R}$ , we can find $\tilde{\boldsymbol{W}}={({\tilde{\boldsymbol{w}}}_{i})}_{i\leqslant N}\subseteq {\mathbb{R}}^{d}$ and $\tilde{\boldsymbol{b}}={({\tilde{b}}_{i})}_{i\leqslant N}\subseteq \mathbb{R}$ such that the above equation holds. This proves that R_NN,N(f_*) is independent of κ.

Appendix J.: Convergence of the Gegenbauer coefficients

In this section, we prove a string of lemmas that are used to show convergence of the Gegenbauer coefficients.

J.1. Technical lemmas

First recall that for q ∈ [Q] we denote ${\tau }^{(q)}\equiv {\Vert}{\boldsymbol{\theta }}^{(q)}{{\Vert}}_{2}/\sqrt{{d}_{q}}$ where θ ^(q) are the d_q coordinates of $\boldsymbol{\theta }\sim \mathrm{Unif}({\mathbb{S}}^{D-1}(\sqrt{D}))$ associated to the qth sphere of PS^d. We show that τ^(q) is (1/d_q)-sub-Gaussian.

Lemma 16. There exists constants c, C > 0 such that for any ɛ > 0,

$\begin{equation*}\mathbb{P}(\vert {\tau }^{(q)}-1\vert > \varepsilon )\leqslant C\enspace \mathrm{exp}(-c{d}_{q}{\varepsilon }^{2}).\end{equation*}$

Proof of lemma 16. Let G ∼ N(0, I_D). We consider the random vector $\boldsymbol{U}\equiv \boldsymbol{G}/{\Vert}\boldsymbol{G}{{\Vert}}_{2}\in {\mathbb{R}}^{D}$ . We have $\boldsymbol{U}\sim \mathrm{Unif}({\mathbb{S}}^{D-1}(1))$ . We denote ${N}_{{d}_{q}}={G}_{1}^{2}+\cdots +{G}_{{d}_{q}}^{2}$ and ${N}_{D}={G}_{1}^{2}+\cdots +{G}_{D}^{2}$ . The random variable τ^(q) has the same distribution as

$\begin{equation*}{\tau }^{(q)}\equiv {\Vert}{\boldsymbol{\theta }}^{(q)}{{\Vert}}_{2}/\sqrt{{d}_{q}}\stackrel{\mathrm{d}}{=}\frac{\sqrt{{N}_{{d}_{q}}/{d}_{q}}}{\sqrt{{N}_{D}/D}}.\end{equation*}$

Hence,

$\begin{equation}\begin{aligned}\hfill \mathbb{P}(\vert {\tau }^{(q)}-1\vert > \varepsilon )& =\mathbb{P}\left(\left\vert \frac{\sqrt{{N}_{{d}_{q}}/{d}_{q}}}{\sqrt{{N}_{D}/D}}-1\right\vert > \varepsilon \right)\hfill \\ \hfill & \leqslant \mathbb{P}\left(\left\vert \sqrt{{N}_{{d}_{q}}/{d}_{q}}-1\right\vert > \varepsilon /2\right)+\mathbb{P}\left(\left\vert \sqrt{{N}_{D}/D}-1\right\vert > \varepsilon /(2+2\varepsilon )\right),\hfill \end{aligned}\end{equation} \tag{ 151 }$

where we used the fact that

$\begin{equation*}\vert a-1\vert \leqslant \frac{\varepsilon }{2}\quad \text{and}\quad \vert b-1\vert \leqslant \frac{\varepsilon }{2+2\varepsilon }{\Rightarrow}\left\vert \frac{a}{b}-1\right\vert \leqslant \varepsilon .\end{equation*}$

Let us first consider ${N}_{{d}_{q}}$ with ɛ ∈ (0, 2]. The ${G}_{i}^{2}$ are sub-exponential random variables with

$\begin{equation*}\mathbb{E}\left[{\mathrm{e}}^{\lambda ({G}_{i}^{2}-1)}\right]\leqslant {\mathrm{e}}^{2{\lambda }^{2}},\quad \forall \enspace \vert \lambda \vert < 1/4.\end{equation*}$

From standard sub-exponential concentration inequality, we get

$\begin{equation}\mathbb{P}\left(\left\vert {N}_{{d}_{q}}/{d}_{q}-1\right\vert > \varepsilon \right)\leqslant 2\enspace \mathrm{exp}\left(-{d}_{q}\varepsilon \enspace \mathrm{min}(1,\varepsilon )/8\right).\end{equation} \tag{ 152 }$

Hence, for ɛ ∈ (0, 2], we have

$\begin{equation*}\mathbb{P}\left(\left\vert \sqrt{{N}_{{d}_{q}}/{d}_{q}}-1\right\vert > \varepsilon /2\right)\leqslant \mathbb{P}\left(\left\vert {N}_{{d}_{q}}/{d}_{q}-1\right\vert > \varepsilon /2\right)\leqslant 2\enspace \mathrm{exp}\left(-{d}_{q}{\varepsilon }^{2}/32\right),\end{equation*}$

while for ɛ > 2,

$\begin{equation*}\begin{aligned}\hfill \mathbb{P}\left(\left\vert \sqrt{{N}_{{d}_{q}}/{d}_{q}}-1\right\vert > \varepsilon /2\right)\leqslant \mathbb{P}\left({N}_{{d}_{q}}/{d}_{q} > {(\varepsilon /2+1)}^{2}\right)& \leqslant \mathbb{P}\left({N}_{{d}_{q}}/{d}_{q}-1 > {\varepsilon }^{2}/4\right)\hfill \\ \hfill & \leqslant \mathrm{exp}\left(-{d}_{q}{\varepsilon }^{2}/32\right).\hfill \end{aligned}\end{equation*}$

In the case of N_D, applying (152) with ɛ/(2 + 2ɛ) ⩽ 1 shows that

$\begin{equation*}\begin{aligned}\hfill \mathbb{P}\left(\left\vert \sqrt{{N}_{D}/D}-1\right\vert > \varepsilon /(2+2\varepsilon )\right)& \leqslant \mathbb{P}\left(\left\vert {N}_{D}/D-1\right\vert > \varepsilon /(2+2\varepsilon )\right)\hfill \\ \hfill & \leqslant 2\enspace \mathrm{exp}\left(-D{\varepsilon }^{2}/(32{(1+\varepsilon )}^{2})\right).\hfill \end{aligned}\end{equation*}$

Combining the above bounds into (151) yields for ɛ ⩾ 0,

$\begin{equation*}\begin{aligned}\hfill \mathbb{P}(\vert {\tau }^{(q)}-1\vert > \varepsilon )& \leqslant 2\enspace \mathrm{exp}\left(-{d}_{q}{\varepsilon }^{2}/32\right)+2\enspace \mathrm{exp}\left(-D{\varepsilon }^{2}/(32{(1+\varepsilon )}^{2})\right)\hfill \\ \hfill & \leqslant 4\enspace \mathrm{exp}\left(-{\varepsilon }^{2}\enspace \mathrm{min}\left({d}_{q},D/{(1+\varepsilon )}^{2}\right)/32\right).\hfill \end{aligned}\end{equation*}$

Notice that $\vert {\tau }^{(q)}-1\vert \leqslant \sqrt{D/{d}_{q}}-1$ and we only need to consider $\varepsilon \in [0,\sqrt{D/{d}_{q}}-1]$ . We conclude that for any ɛ ⩾ 0, we have

$\begin{equation*}\mathbb{P}(\vert {\tau }^{(q)}-1\vert > \varepsilon )\leqslant 4\enspace \mathrm{exp}\left(-{d}_{q}{\varepsilon }^{2}/32\right).\end{equation*}$

□

We consider an activation function $\sigma :\mathbb{R}\to \mathbb{R}$ . Fix $\boldsymbol{\theta }\in {\mathbb{S}}^{D-1}(\sqrt{D})$ and recall that $\boldsymbol{x}=({\boldsymbol{x}}^{(1)},\dots {\boldsymbol{x}}^{(Q)})\in {\mathrm{PS}}_{\boldsymbol{\kappa }}^{\boldsymbol{d}}$ . We recall that $\boldsymbol{x}\sim \mathrm{Unif}({\mathrm{PS}}_{\boldsymbol{\kappa }}^{\boldsymbol{d}})={\mu }_{\boldsymbol{d}}^{\boldsymbol{\kappa }}$ while $\bar{\boldsymbol{x}}\sim \mathrm{Unif}({\mathrm{PS}}^{\boldsymbol{d}})={\mu }_{\boldsymbol{d}}$ . Therefore, for a given $\bar{\boldsymbol{\theta }}$ , ${\left\{\langle {\bar{\boldsymbol{\theta }}}^{(q)},{\bar{\boldsymbol{x}}}^{(q)}\rangle /\sqrt{{d}_{q}}\right\}}_{q\in [Q]}\sim {\tilde{\mu }}_{\boldsymbol{d}}^{1}$ as defined in equation (33). Therefore we reformulate σ(⟨ θ , ⋅⟩/R) as a function σ_{d
,
τ} from ps^d to $\mathbb{R}$ :

$\begin{align}\hfill \sigma (\langle \boldsymbol{\theta },\boldsymbol{x}\rangle /R)& =\sigma \left(\sum\limits _{q\in [Q]}{\tau }^{(q)}\cdot ({r}_{q}/R)\cdot \langle {\bar{\boldsymbol{\theta }}}^{(q)},{\bar{\boldsymbol{x}}}^{(q)}\rangle /\sqrt{{d}_{q}}\right)\hfill \\ \hfill & \equiv {\sigma }_{\boldsymbol{d},\boldsymbol{\tau }}\left({\left\{\langle {\bar{\boldsymbol{\theta }}}^{(q)},{\bar{\boldsymbol{x}}}^{(q)}\rangle /\sqrt{{d}_{q}}\right\}}_{q\in [Q]}\right).\hfill \end{align} \tag{ 153 }$

We will denote in the rest of this section α_q = τ^(q) r_q/R for q = 1, ..., Q. Notice in particular that ${\alpha }_{q}\propto {d}^{{\eta }_{q}+{\kappa }_{q}-\xi }$ where we recall that ξ = max_q∈[Q]{η_q + κ_q}. Without loss of generality, we will assume that the (unique) maximum is attained on the first sphere, i.e. ξ = η₁ + κ₁ and ξ > η_q + κ_q for q ⩾ 2.

Lemma 17. Assume σ is an activation function with σ(u)² ⩽ c₀ exp(c₁ u²/2) almost surely, for some constants c₀ > 1 and c₁ < 1. We consider the function ${\sigma }_{\boldsymbol{d},\boldsymbol{\tau }}:{\mathrm{ps}}^{\boldsymbol{d}}\to \mathbb{R}$ associated to σ, as defined in equation (75).

Then

(a)
${\mathbb{E}}_{G\sim \mathsf{\text{N}}(0,1)}[\sigma {(G)}^{2}]< \infty$ .
(b)
Let w ^(q) be unit vectors in ${\mathbb{R}}^{{d}_{q}}$ for q = 1, ..., Q. There exists ɛ₀ = ɛ₀(c₁) and d₀ = d₀(c₁) such that, for $\bar{\boldsymbol{x}}=({\boldsymbol{x}}^{(1)},\dots ,{\boldsymbol{x}}^{(Q)})\sim {\mu }_{\boldsymbol{d}}^{\boldsymbol{\kappa }}$ ,
$\begin{equation}\underset{d\geqslant {d}_{0}}{\mathrm{sup}}\underset{\boldsymbol{\tau }\in {[1-{\varepsilon }_{0},1+{\varepsilon }_{0}]}^{Q}}{\mathrm{sup}}\enspace {\mathbb{E}}_{\bar{\boldsymbol{x}}}\left[{\sigma }_{\boldsymbol{d},\boldsymbol{\tau }}{\left({\left\{\langle {\boldsymbol{w}}^{(q)},{\bar{\boldsymbol{x}}}^{(q)}\rangle \right\}}_{q\in [Q]}\right)}^{2}\right]< \infty .\end{equation} \tag{ 154 }$
(c)
Let w ^(q) be unit vectors in ${\mathbb{R}}^{{d}_{q}}$ for q = 1, ..., Q. Fix integers $\boldsymbol{k}=({k}_{1},\dots ,{k}_{Q})\in {\mathbb{Z}}_{\geqslant 0}^{Q}$ . Then for any δ > 0, there exists constants ɛ₀ = ɛ₀(c₁, δ) and d₀ = d₀(c₁, δ), and a coupling of G ∼ N(0, 1) and $\bar{\boldsymbol{x}}=({\boldsymbol{x}}^{(1)},\dots ,{\boldsymbol{x}}^{(Q)})\sim {\mu }_{\boldsymbol{d}}^{\boldsymbol{\kappa }}$ such that for any d ⩾ d₀ and $\boldsymbol{\tau }\in {[1-{\varepsilon }_{0},1+{\varepsilon }_{0}]}^{Q}$
$\begin{equation}\begin{aligned}\hfill {\mathbb{E}}_{\bar{\boldsymbol{x}},G}\left[{\left(\left[\prod\limits _{q\in [Q]}{\left(1-\langle {\boldsymbol{w}}^{(q)},{\bar{\boldsymbol{x}}}^{(q)}{\rangle }^{2}/{d}_{q}\right)}^{{k}_{q}}\right]{\sigma }_{\boldsymbol{d},\boldsymbol{\tau }}\left({\left\{\langle {\boldsymbol{w}}^{(q)},{\bar{\boldsymbol{x}}}^{(q)}\rangle \right\}}_{q\in [Q]}\right)-\sigma (G)\right)}^{2}\right]< \delta .\end{aligned}\end{equation} \tag{ 155 }$

Proof of lemma 17. Part (a) is straightforward.

For part (b), recall that the probability distribution of $\langle {\boldsymbol{w}}^{(q)},{\bar{\boldsymbol{x}}}^{(q)}\rangle$ when ${\bar{\boldsymbol{x}}}^{(q)}\sim \mathrm{Unif}({\mathbb{S}}^{{d}_{q}-1}(\sqrt{{d}_{q}}))$ is given by

$\begin{equation}{\tilde{\tau }}_{{d}_{q}-1}^{1}(\mathrm{d}x)={C}_{{d}_{q}}{\left(1-\frac{{x}^{2}}{{d}_{q}}\right)}^{\frac{{d}_{q}-3}{2}}{\mathbf{1}}_{x\in [-\sqrt{{d}_{q}},\sqrt{{d}_{q}}]}\mathrm{d}x,\end{equation} \tag{ 156 }$

$\begin{equation}{C}_{{d}_{q}}=\frac{{\Gamma}({d}_{q}-1)}{{2}^{{d}_{q}-2}\sqrt{{d}_{q}}{\Gamma}{(({d}_{q}-1)/2)}^{2}}.\enspace \end{equation} \tag{ 157 }$

A simple calculation shows that C_n → (2π)^−1/2 as n → ∞, and hence ${\mathrm{sup}}_{n}\enspace {C}_{n}\leqslant \bar{C}< \infty$ . Therefore for τ ∈ [1 − ɛ, 1 + ɛ]^Q, we have

$\begin{align*}\hfill & {\mathbb{E}}_{\bar{\boldsymbol{x}}}\left[{\sigma }_{\boldsymbol{d},\boldsymbol{\tau }}{\left({\left\{\langle {\boldsymbol{w}}^{(q)},{\bar{\boldsymbol{x}}}^{(q)}\rangle \right\}}_{q\in [Q]}\right)}^{2}\right]\hfill \\ \hfill & \qquad ={\int }_{\prod\limits _{q\in [Q]}[-\sqrt{{d}_{q}},\sqrt{{d}_{q}}]}{\sigma }_{\boldsymbol{d},\boldsymbol{\tau }}{\left({\bar{x}}_{1}^{(1)},\dots ,{\bar{x}}_{1}^{(Q)}\right)}^{2}\prod\limits _{q\in [Q]}\left({C}_{{d}_{q}}{\left(1-\frac{{({\bar{x}}_{1}^{(q)})}^{2}}{{d}_{q}}\right)}^{\frac{{d}_{q}-3}{2}}\mathrm{d}{\bar{x}}_{1}^{(q)}\right)\hfill \\ \hfill & \qquad \leqslant {\bar{C}}^{Q}{\int }_{{\mathbb{R}}^{Q}}{c}_{0}\enspace \mathrm{exp}\left({c}_{1}{\left(\sum\limits _{q\in [Q]}{\alpha }_{q}{\bar{x}}_{1}^{(q)}\right)}^{2}/2\right)\prod\limits _{q\in [Q]}\left(\mathrm{exp}\left(-\frac{{d}_{q}-3}{2{d}_{q}}{({\bar{x}}_{1}^{(q)})}^{2}\right)\mathrm{d}{\bar{x}}_{1}^{(q)}\right)\hfill \\ \hfill & \qquad ={c}_{0}{\bar{C}}^{Q}{\int }_{{\mathbb{R}}^{Q}}\mathrm{exp}\left(-{\bar{\boldsymbol{x}}}_{1}^{\mathsf{T}}\boldsymbol{M}{\bar{\boldsymbol{x}}}_{1}/2\right)\left(\prod\limits _{q\in [Q]}\mathrm{d}{\bar{x}}_{1}^{(q)}\right)\hfill \end{align*}$

where we denoted ${\bar{\boldsymbol{x}}}_{1}=({\bar{x}}_{1}^{(1)},\dots ,{\bar{x}}_{1}^{(Q)})$ and $\boldsymbol{M}\in {\mathbb{R}}^{Q\times Q}$ with

$\begin{equation*}{M}_{qq}=\frac{{d}_{q}-3}{{d}_{q}}-{c}_{1}^{2}{\alpha }_{q}^{2},\qquad {M}_{q{q}^{\prime }}=-{c}_{1}{\alpha }_{q}{\alpha }_{{q}^{\prime }},\quad \text{for}\enspace q\ne {q}^{\prime }\in [Q].\end{equation*}$

Recalling the definition of α_q = τ^(q) r_q/R, with ${r}_{q}={d}^{({\eta }_{q}+{\kappa }_{q})/2}$ and R = d^ξ/2(1 + o_d(1)). Hence for any ɛ > 0, uniformly on τ ∈ [1 − ɛ, 1 + ɛ]^Q, we have α_q → 0 for q ⩾ 2 and lim sup_d→∞|α₁ − 1| ⩽ ɛ. Hence if we choose ${\varepsilon }_{0}< {c}_{1}^{-1}-1$ , there exists c > 0 such that for d sufficiently large M ⪰c I_Q and for any $\boldsymbol{\tau }\in {[1-{\varepsilon }_{0},1+{\varepsilon }_{0}]}^{Q}$

$\begin{equation*}{\mathbb{E}}_{\bar{\boldsymbol{x}}}\left[{\sigma }_{\boldsymbol{d},\boldsymbol{\tau }}{\left({\left\{\langle {\boldsymbol{w}}^{(q)},{\bar{\boldsymbol{x}}}^{(q)}\rangle \right\}}_{q\in [Q]}\right)}^{2}\right]\leqslant {c}_{0}{\bar{C}}^{Q}{\int }_{{\mathbb{R}}^{Q}}\mathrm{exp}\left(-c{\Vert}{\bar{\boldsymbol{x}}}_{1}{{\Vert}}_{2}^{2}/2\right)\left(\prod\limits _{q\in [Q]}\mathrm{d}{\bar{x}}_{1}^{(q)}\right)< \infty .\end{equation*}$

Finally, for part (c), without loss of generality we will take ${\boldsymbol{w}}^{(q)}={\boldsymbol{e}}_{1}^{(q)}$ so that $\langle {\boldsymbol{w}}^{(q)},{\bar{\boldsymbol{x}}}^{(q)}\rangle ={\bar{x}}_{1}^{(q)}$ . From part (b), there exists ɛ > 0 and d₀ such that

$\begin{align*}\hfill & \underset{d\geqslant {d}_{0}}{\mathrm{sup}}\hspace{2pt}\underset{\tau \in [1-\varepsilon ,1+\varepsilon ]}{\mathrm{sup}}\enspace {\mathbb{E}}_{\bar{\boldsymbol{x}}}\enspace {\mathbb{E}}_{\bar{\boldsymbol{x}},G}\left[\left[\prod\limits _{q\in [Q]}{\left(1-{\langle {\boldsymbol{w}}^{(q)},{\bar{\boldsymbol{x}}}^{(q)}\rangle }^{2}/{d}_{q}\right)}^{2{k}_{q}}\right]{\sigma }_{\boldsymbol{d},\boldsymbol{\tau }}{\left({\left\{\langle {\boldsymbol{w}}^{(q)},{\bar{\boldsymbol{x}}}^{(q)}\rangle \right\}}_{q\in [Q]}\right)}^{2}\right]\hfill \\ \hfill & \qquad \leqslant \underset{d\geqslant {d}_{0}}{\mathrm{sup}}\hspace{2pt}\underset{\tau \in [1-\varepsilon ,1+\varepsilon ]}{\mathrm{sup}}\enspace {\mathbb{E}}_{\bar{\boldsymbol{x}}}\left[{\sigma }_{\boldsymbol{d},\boldsymbol{\tau }}{\left({\left\{\langle {\boldsymbol{w}}^{(q)},{\bar{\boldsymbol{x}}}^{(q)}\rangle \right\}}_{q\in [Q]}\right)}^{2}\right]< \infty .\hfill \end{align*}$

Consider G ∼ N(0, I_Q) and an arbitrary coupling between $\bar{\boldsymbol{x}}$ and G . For any M > 0 we can choose σ_M bounded continuous so that for any d and τ ∈ [1 − ɛ, 1 + ɛ]^Q,

$\begin{align}\hfill & {\mathbb{E}}_{\bar{\boldsymbol{x}},\boldsymbol{G}}\left[\left(\prod\limits _{q\in [Q]}{\left(1-{({\bar{x}}_{1}^{(q)})}^{2}/{d}_{q}\right)}^{{k}_{q}}\cdot \sigma \left(\sum\limits _{q\in [Q]}{\alpha }_{q}{\bar{x}}_{1}^{(q)}\right)\right.\right.\hfill \\ \hfill & \qquad \quad \left.{\left.-\prod\limits _{q\in [Q]}{\left(1-{G}_{q}^{2}/{d}_{q}\right)}^{{k}_{q}}\cdot \sigma \left(\sum\limits _{q\in [Q]}{\alpha }_{q}{G}_{q}\right)\right)}^{2}\right]\hfill \\ \hfill & \qquad \leqslant {\mathbb{E}}_{\bar{\boldsymbol{x}},\boldsymbol{G}}\left[\left(\prod\limits _{q\in [Q]}{\left(1-{({\bar{x}}_{1}^{(q)})}^{2}/{d}_{q}\right)}^{{k}_{q}}\cdot {\sigma }_{M}\left(\sum\limits _{q\in [Q]}{\alpha }_{q}{\bar{x}}_{1}^{(q)}\right)\right.\right.\hfill \\ \hfill & \qquad \quad \left.{\left.-\prod\limits _{q\in [Q]}{\left(1-{G}_{q}^{2}/{d}_{q}\right)}^{{k}_{q}}\cdot {\sigma }_{M}\left(\sum\limits _{q\in [Q]}{\alpha }_{q}{G}_{q}\right)\right)}^{2}\right]+\frac{1}{M}.\hfill \end{align} \tag{ 158 }$

It is therefore sufficient to prove the claim for σ_M. Letting ${\boldsymbol{\xi }}_{q}\sim \mathsf{\text{N}}(0,{\mathbf{I}}_{{d}_{q}-1})$ independently for each q ∈ [Q] and independent of G , we construct the coupling via

$\begin{equation}{\bar{x}}_{1}^{(q)}=\frac{{G}_{q}\sqrt{{d}_{q}}}{\sqrt{{G}_{q}^{2}+{\Vert}{\boldsymbol{\xi }}_{q}{{\Vert}}_{2}^{2}}},\qquad {\bar{\boldsymbol{x}}}_{-1}^{(q)}=\frac{{\boldsymbol{\xi }}_{q}\sqrt{{d}_{q}}}{\sqrt{{G}_{q}^{2}+{\Vert}{\boldsymbol{\xi }}_{q}{{\Vert}}_{2}^{2}}},\quad q\in [Q],\end{equation} \tag{ 159 }$

where we set ${\bar{\boldsymbol{x}}}^{(q)}=({\bar{x}}_{1}^{(q)},{\bar{\boldsymbol{x}}}_{-1}^{(q)})$ for each q ∈ [Q]. We thus have $({\bar{x}}_{1}^{(q)},{\bar{\boldsymbol{x}}}_{-1}^{(q)})\to \boldsymbol{G}$ almost surely, hence the limit superior of equation (158) is by weak convergence bounded by 1/M for any arbitrary M. Furthermore, noticing that α_q → 0 uniformly on τ ∈ [1 − ɛ, 1 + ɛ]^Q for q ⩾ 2, we have by bounded convergence

$\begin{equation}\underset{d\to \infty }{\mathrm{lim}}\underset{\boldsymbol{\tau }\in {[1-\varepsilon ,1+\varepsilon ]}^{Q}}{\mathrm{sup}}\enspace {\mathbb{E}}_{\bar{\boldsymbol{x}},\boldsymbol{G}}\left[{\left(\prod\limits _{q\in [Q]}{\left(1-{G}_{q}^{2}/{d}_{q}\right)}^{{k}_{q}}\cdot \sigma \left(\sum\limits _{q\in [Q]}{\alpha }_{q}{G}_{q}\right)-\sigma \left({\alpha }_{1}{G}_{1}\right)\right)}^{2}\right]=0.\end{equation} \tag{ 160 }$

We further have ${\mathrm{lim}}_{(d,{\tau }^{(1)})\to (\infty ,1)}\enspace {\alpha }_{1}=1$ . Hence, by bounded convergence,

$\begin{equation}\underset{(d,{\tau }^{(1)})\to (\infty ,1)}{\mathrm{lim}}\enspace {\mathbb{E}}_{{G}_{1}}\left[{\left(\sigma \left({\alpha }_{1}{G}_{1}\right)-\sigma ({G}_{1})\right)}^{2}\right]=0.\end{equation} \tag{ 161 }$

Combining equation (158) with the coupling (159) and equations (160) and (161) yields the result. □

Consider the expansion of σ_{d
,
τ} in terms of tensor product of Gegenbauer polynomials. We have

$\begin{equation*}\sigma (\langle \boldsymbol{\theta },\boldsymbol{x}\rangle /R)=\sum\limits _{\boldsymbol{k}\in {\mathbb{Z}}_{\geqslant 0}^{Q}}{\lambda }_{\boldsymbol{k}}^{\boldsymbol{d}}({\sigma }_{\boldsymbol{d},\boldsymbol{\tau }})B(\boldsymbol{d},\boldsymbol{k}){Q}_{\boldsymbol{k}}^{\boldsymbol{d}}\left({\left\{\langle {\bar{\boldsymbol{\theta }}}^{(q)},{\bar{\boldsymbol{x}}}^{(q)}\rangle \right\}}_{q\in [Q]}\right),\end{equation*}$

where

$\begin{equation*}{\lambda }_{\boldsymbol{k}}^{\boldsymbol{d}}({\sigma }_{\boldsymbol{d},\boldsymbol{\tau }})={\mathbb{E}}_{\bar{\boldsymbol{x}}}\left[{\sigma }_{\boldsymbol{d},\boldsymbol{\tau }}\left({\bar{x}}_{1}^{(1)},\dots ,{\bar{x}}_{1}^{(Q)}\right){Q}_{\boldsymbol{k}}^{\boldsymbol{d}}\left(\sqrt{{d}_{1}}{\bar{x}}_{1}^{(1)},\dots ,\sqrt{{d}_{Q}}{\bar{x}}_{1}^{(Q)}\right)\right],\end{equation*}$

with the expectation taken over $\bar{\boldsymbol{x}}=({\bar{\boldsymbol{x}}}^{(1)},\dots ,{\bar{\boldsymbol{x}}}^{(Q)})\sim {\mu }_{\boldsymbol{d}}\equiv \mathrm{Unif}({\mathrm{PS}}^{\boldsymbol{d}})$ . We will need the following lemma, which is direct consequence of Rodrigues formula, to get the scaling of the Gegenbauer coefficients of σ_{d
,
τ}.

Lemma 18. Let $\boldsymbol{k}=({k}_{1},\dots ,{k}_{Q})\in {\mathbb{Z}}_{\geqslant 0}^{Q}$ and denote | k | = k₁ + ⋯ + k_Q. Assume that the activation function σ is | k |-times weakly differentiable and denote σ^{(|
k
|)} its | k |-weak derivative. Let α_q = τ^(q) r_q/R for q = 1, ..., Q. Then

$\begin{equation}{\lambda }_{\boldsymbol{k}}^{\boldsymbol{d}}({\sigma }_{\boldsymbol{d},\boldsymbol{\tau }})=\left(\prod\limits _{q\in [Q]}{\alpha }_{q}^{{k}_{q}}\right)\cdot R(\boldsymbol{d},\boldsymbol{k})\cdot {\mathbb{E}}_{\bar{\boldsymbol{x}}}\left[\left(\prod\limits _{q\in [Q]}{\left(1-\frac{{({\bar{x}}_{1}^{(q)})}^{2}}{{d}_{q}}\right)}^{{k}_{q}}\right)\cdot {\sigma }^{(\vert \boldsymbol{k}\vert )}\left(\sum\limits _{q\in [Q]}{\alpha }_{q}{\bar{x}}_{1}^{(q)}\right)\right],\end{equation} \tag{ 162 }$

where $\bar{\boldsymbol{x}}\sim \mathrm{Unif}({\mathrm{PS}}^{\boldsymbol{d}})$ and

$\begin{equation*}R(\boldsymbol{d},\boldsymbol{k})=\prod\limits _{q\in [Q]}\frac{{d}_{q}^{{k}_{q}/2}{\Gamma}(({d}_{q}-1)/2)}{{2}^{{k}_{q}}{\Gamma}({k}_{q}+({d}_{q}-1)/2)}.\end{equation*}$

Furthermore,

$\begin{equation}\underset{d\to \infty }{\mathrm{lim}}\enspace B(\boldsymbol{d},\boldsymbol{k})R{(\boldsymbol{d},\boldsymbol{k})}^{2}=\frac{1}{\boldsymbol{k}!},\end{equation} \tag{ 163 }$

where k ! = k₁!...k_Q!.

Proof of lemma 18. We have

$\begin{align}{\lambda }_{\boldsymbol{k}}^{\boldsymbol{d}}({\sigma }_{\boldsymbol{d},\boldsymbol{\tau }})\hfill & ={\mathbb{E}}_{\bar{\boldsymbol{x}}}\left[{\sigma }_{\boldsymbol{d},\boldsymbol{\tau }}\left({\bar{x}}_{1}^{(1)},\dots ,{\bar{x}}_{1}^{(Q)}\right){Q}_{\boldsymbol{k}}^{\boldsymbol{d}}\left(\sqrt{{d}_{1}}{\bar{x}}_{1}^{(1)},\dots ,\sqrt{{d}_{Q}}{\bar{x}}_{1}^{(Q)}\right)\right]\hfill \\ \hfill & ={\mathbb{E}}_{{\bar{\boldsymbol{x}}}^{(1)},\dots ,{\bar{\boldsymbol{x}}}^{(Q-1)}}\left[{\mathbb{E}}_{{\bar{\boldsymbol{x}}}^{(Q)}}\left[\sigma \left(\sum\limits _{q\in [Q-1]}{\alpha }_{q}{\bar{x}}_{1}^{(q)}+{\alpha }_{Q}{\bar{x}}_{1}^{(Q)}\right){Q}_{{k}_{Q}}^{({d}_{Q})}(\sqrt{{d}_{Q}}{\bar{x}}_{1}^{(Q)})\right]\right.\hfill \\ \hfill & \quad \left.\times \prod\limits _{q\in [Q-1]}{Q}_{{k}_{q}}^{({d}_{q})}(\sqrt{{d}_{q}}{\bar{x}}_{1}^{(q)})\right],\hfill \end{align} \tag{ 164 }$

where we used the definition (34) of tensor product of Gegenbauer polynomials.

Consider the integration with respect to ${\bar{\boldsymbol{x}}}^{(Q)}$ . Denote for ease of notations $u={\alpha }_{1}{\bar{x}}_{1}^{(1)}+\cdots +{\alpha }_{Q-1}{\bar{x}}_{1}^{(Q-1)}$ . We use the Rodrigues formula for the Gegenbauer polynomials (see equation (24)):

$\begin{align}\hfill & {\mathbb{E}}_{{\bar{\boldsymbol{x}}}^{(Q)}\sim \mathrm{Unif}({\mathbb{S}}^{{d}_{Q}-1}(\sqrt{{d}_{Q}}))}\left[\sigma \left(u+{\alpha }_{Q}{\bar{x}}_{1}^{(Q)}\right){Q}_{{k}_{Q}}^{({d}_{Q})}\left(\sqrt{{d}_{Q}}{\bar{x}}_{1}^{(Q)}\right)\right]\hfill \\ \hfill & \qquad =\frac{{\omega }_{{d}_{Q}-2}}{{\omega }_{{d}_{Q}-1}}{\int }_{[-1,1]}\sigma \left({\alpha }_{Q}\sqrt{{d}_{Q}}t+u\right){Q}_{{k}_{Q}}^{({d}_{Q})}({d}_{Q}t){(1-{t}^{2})}^{({d}_{Q}-3)/2}\mathrm{d}t\hfill \\ \hfill & \qquad ={(-1/2)}^{{k}_{Q}}\frac{{\Gamma}(({d}_{Q}-1)/2)}{{\Gamma}({k}_{Q}+({d}_{Q}-1)/2)}\cdot \frac{{\omega }_{{d}_{Q}-2}}{{\omega }_{{d}_{Q}-1}}\hfill \\ \hfill & \quad \qquad \times {\int }_{[-1,1]}\sigma \left({\alpha }_{Q}\sqrt{{d}_{Q}}t+u\right){\left(\frac{\mathrm{d}}{\mathrm{d}t}\right)}^{{k}_{Q}}{(1-{t}^{2})}^{{k}_{Q}+({d}_{Q}-3)/2}\mathrm{d}t\hfill \\ \hfill & \qquad ={\alpha }_{Q}^{{k}_{Q}}{2}^{-{k}_{Q}}{d}_{Q}^{{k}_{Q}/2}\frac{{\Gamma}(({d}_{Q}-1)/2)}{{\Gamma}({k}_{Q}+({d}_{Q}-1)/2)}\cdot \frac{{\omega }_{{d}_{Q}-2}}{{\omega }_{{d}_{Q}-1}}\hfill \\ \hfill & \quad \qquad \times {\int }_{[-1,1]}{(1-{t}^{2})}^{{k}_{Q}}{\sigma }^{({k}_{Q})}\left({\alpha }_{Q}\sqrt{{d}_{Q}}t+u\right){(1-{t}^{2})}^{({d}_{Q}-3)/2}\mathrm{d}t\hfill \\ \hfill & \qquad ={\alpha }_{Q}^{{k}_{Q}}\frac{{d}_{Q}^{{k}_{Q}/2}{\Gamma}(({d}_{Q}-1)/2)}{{2}^{{k}_{Q}}{\Gamma}({k}_{Q}+({d}_{Q}-1)/2)}{\mathbb{E}}_{{\bar{\boldsymbol{x}}}^{(Q)}\sim \mathrm{Unif}({\mathbb{S}}^{{d}_{Q}-1}(\sqrt{{d}_{Q}}))}\hfill \\ \hfill & \quad \qquad \times \left[{\left(1-{({\bar{x}}_{1}^{(Q)})}^{2}/{d}_{Q}\right)}^{{k}_{Q}}{\sigma }^{({k}_{Q})}\left({\alpha }_{Q}{\bar{x}}_{1}^{(Q)}+u\right)\right].\hfill \end{align} \tag{ 165 }$

Iterating equation (165) over q ∈ [Q] and equation (164) yield the desired formula (162).

Furthermore, for each q ∈ [Q],

$\begin{equation*}\begin{aligned}\hfill {k}_{q}!B({d}_{q},{k}_{q})& =(2{k}_{q}+{d}_{q}-2)\prod\limits _{j=0}^{{k}_{q}-2}(j+{d}_{q}-1),\hfill \\ \hfill \frac{{\Gamma}(({d}_{q}-1)/2)}{{2}^{{k}_{q}}{\Gamma}({k}_{q}+({d}_{q}-1)/2)}& =\prod\limits _{j=0}^{{k}_{q}-1}\frac{1}{2j+{d}_{q}-1}.\hfill \end{aligned}\end{equation*}$

Combining these two equations yields

$\begin{align}\hfill & {k}_{q}!B({d}_{q},{k}_{q})\frac{{d}_{q}^{{k}_{q}}{\Gamma}{(({d}_{q}-1)/2)}^{2}}{{2}^{2{k}_{q}}{\Gamma}{({k}_{q}+({d}_{q}-1)/2)}^{2}}\hfill \\ \hfill & \qquad =\frac{2{k}_{q}+{d}_{q}-2}{2{k}_{q}+{d}_{q}-3}\cdot \left(\prod\limits _{j=0}^{{k}_{q}-2}\frac{j+{d}_{q}-1}{2j+{d}_{q}-1}\right)\cdot \left(\prod\limits _{j=0}^{{k}_{q}-1}\frac{{d}_{q}}{2j+{d}_{q}-1}\right),\hfill \end{align} \tag{ 166 }$

which converges to 1 when d_q → ∞. We deduce that

$\begin{equation*}\underset{d\to \infty }{\mathrm{lim}}\enspace B(\boldsymbol{d},\boldsymbol{k})R{(\boldsymbol{d},\boldsymbol{k})}^{2}=\frac{1}{\boldsymbol{k}!}.\end{equation*}$

□

J.2. Proof of convergence in probability of the Gegenbauer coefficients

Lemma 19. Let $\boldsymbol{k}=({k}_{1},\dots ,{k}_{Q})\in {\mathbb{Z}}_{\geqslant 0}^{Q}$ and denote | k | = k₁ + ⋯ + k_Q. Assume that the activation function σ is | k |-times weakly differentiable and denote σ^{(|
k
|)} its | k |-weak derivative. Assume furthermore that there exist constants c₀ > 0 and c₁ < 1 such that σ^{(|
k
|)}(u)² ⩽ c₀ exp(c₁ u²/2) almost surely.

Then for any δ > 0, there exists ɛ₀ ∈ (0, 1) and d₀ such that for any d ⩾ d₀ and $\boldsymbol{\tau }\in {[1-{\varepsilon }_{0},1+{\varepsilon }_{0}]}^{Q}$ ,

$\begin{equation*}\left\vert \left(\prod\limits _{q\in [Q]}{d}^{(\xi -{\eta }_{q}-{\kappa }_{q}){k}_{q}}\right)B(\boldsymbol{d},\boldsymbol{k}){\lambda }_{\boldsymbol{k}}^{\boldsymbol{d}}{({\sigma }_{\boldsymbol{d},\boldsymbol{\tau }})}^{2}-\frac{{\mu }_{\vert \boldsymbol{k}\vert }{(\sigma )}^{2}}{\boldsymbol{k}!}\right\vert \leqslant \delta .\end{equation*}$

Proof of lemma 19. From lemma 18, we have

$\begin{align}\hfill & \left(\prod\limits _{q\in [Q]}{d}^{(\xi -{\eta }_{q}-{\kappa }_{q}){k}_{q}}\right)B(\boldsymbol{d},\boldsymbol{k}){\lambda }_{\boldsymbol{k}}^{\boldsymbol{d}}{({\sigma }_{\boldsymbol{d},\boldsymbol{\tau }})}^{2}\hfill \\ \hfill & \qquad =\left(\prod\limits _{q\in [Q]}{\alpha }_{q}^{2{k}_{q}}{d}^{(\xi -{\eta }_{q}-{\kappa }_{q}){k}_{q}}\right)\cdot [B(\boldsymbol{d},\boldsymbol{k})R{(\boldsymbol{d},\boldsymbol{k})}^{2}]\hfill \\ \hfill & \qquad \quad \times {\mathbb{E}}_{\bar{\boldsymbol{x}}}{\left[\prod\limits _{q\in [Q]}{\left(1-\frac{{({\bar{x}}_{1}^{(q)})}^{2}}{{d}_{q}}\right)}^{{k}_{q}}\cdot {\sigma }^{(\vert \boldsymbol{k}\vert )}\left(\sum\limits _{q\in [Q]}{\alpha }_{q}{\bar{x}}_{1}^{(q)}\right)\right]}^{2}.\hfill \end{align} \tag{ 167 }$

Recall α_q = τ^(q) r_q/R with ${r}_{q}={d}^{({\kappa }_{q}+{\eta }_{q})/2}$ and R = d^ξ/2(1 + o_d(1)). Hence, we have

$\begin{equation}\underset{(d,\boldsymbol{\tau })\to (\infty ,\mathbf{1})}{\mathrm{lim}}\enspace \prod\limits _{q\in [Q]}{\alpha }_{q}^{2{k}_{q}}{d}^{(\xi -{\eta }_{q}-{\kappa }_{q}){k}_{q}}=1.\end{equation} \tag{ 168 }$

Furthermore, from lemma 18, we have

$\begin{equation}\underset{d\to \infty }{\mathrm{lim}}\enspace B(\boldsymbol{d},\boldsymbol{k})R{(\boldsymbol{d},\boldsymbol{k})}^{2}=\frac{1}{\boldsymbol{k}!}.\end{equation} \tag{ 169 }$

We can apply lemma 17 to the activation function σ^{(|
k
|)}. In particular part (c) of the lemma implies that there exists ɛ₀ ∈ (0, 1) such that for d sufficiently large, we have for any $\boldsymbol{\tau }\in {[1-{\varepsilon }_{0},1+{\varepsilon }_{0}]}^{Q}$ ,

$\begin{equation}\left\vert {E}_{\bar{\boldsymbol{x}}}\left[\left(\prod\limits _{q\in [Q]}{\left(1-\frac{{({\bar{x}}_{1}^{(q)})}^{2}}{{d}_{q}}\right)}^{{k}_{q}}\right)\cdot {\sigma }^{(\vert \boldsymbol{k}\vert )}\left(\sum\limits _{q\in [Q]}{\alpha }_{q}{\bar{x}}_{1}^{(q)}\right)\right]-{\mathbb{E}}_{G}[{\sigma }^{(\vert \boldsymbol{k}\vert )}(G)]\right\vert \leqslant \delta /2.\end{equation} \tag{ 170 }$

From equation (28), we have ${\mathbb{E}}_{G}[{\sigma }^{(\vert \boldsymbol{k}\vert )}(G)]={\mu }_{\vert \boldsymbol{k}\vert }(\sigma )$ . Combining equations (168) and (170) into equation (167) yields the result. □

Lemma 20. Let k be a non negative integer and denote $\boldsymbol{k}=(k,0,\dots ,0)\in {\mathbb{Z}}_{\geqslant 0}^{Q}$ , where we recall that without loss of generality we choose q = 1 as the unique argmax_q∈[Q]{η_q + κ_q}. Assume that the activation function σ verifies σ(u)² ⩽ c₀ exp(c₁ u²/2) almost surely for some constants c₀ > 0 and c₁ < 1.

Then for any δ > 0, there exists ɛ₀ = ɛ₀(c₁, δ) and d₀ = d₀(c₁, δ) such that for any d ⩾ d₀ and $\boldsymbol{\tau }\in {[1-{\varepsilon }_{0},1+{\varepsilon }_{0}]}^{Q}$ ,

$\begin{equation*}\left\vert B({d}_{1},k){\lambda }_{\boldsymbol{k}}^{\boldsymbol{d}}{({\sigma }_{\boldsymbol{d},\boldsymbol{\tau }})}^{2}-\frac{{\mu }_{k}{(\sigma )}^{2}}{k!}\right\vert \leqslant \delta .\end{equation*}$

Proof of lemma 20. 20. Recall the correspondence (29) between Gegenbauer and Hermite polynomials. Note for any monomial m_l(x) = x^k, we can apply lemma 17(c) to ${m}_{l}({\bar{x}}_{1}^{({q}_{\xi })})\sigma$ and find a coupling such that for any η > 0, there exists ɛ₀ > 0 and

$\begin{equation}\underset{d\to \infty }{\mathrm{lim}}\enspace \underset{\boldsymbol{\tau }\in {[1-{\varepsilon }_{0},1+{\varepsilon }_{0}]}^{Q}}{\mathrm{sup}}\enspace {\mathbb{E}}_{\bar{\boldsymbol{x}},G}\left[{\left({m}_{k}({\bar{x}}_{1}^{({q}_{\xi })}){\sigma }_{\boldsymbol{d},\boldsymbol{\tau }}({\bar{x}}_{1}^{(1)},\dots ,{\bar{x}}_{1}^{(Q)})-{m}_{k}(G)\sigma (G)\right)}^{2}\right]\leqslant \eta .\end{equation} \tag{ 171 }$

We have

$\begin{equation*}{[B({d}_{1},k)k!]}^{1/2}{\lambda }_{\boldsymbol{k}}^{\boldsymbol{d}}({\sigma }_{\boldsymbol{d},\boldsymbol{\tau }})={\mathbb{E}}_{\bar{\boldsymbol{x}}}[{\sigma }_{\boldsymbol{d},\boldsymbol{\tau }}({\bar{x}}_{1}^{(1)},\dots ,{\bar{x}}_{1}^{(Q)}){Q}_{k}^{({d}_{1})}(\sqrt{{d}_{1}}{\bar{x}}_{1}^{(1)}){[B({d}_{1},k)k!]}^{1/2}].\end{equation*}$

Using the asymptotic correspondence between Gegenbauer polynomials and Hermite polynomials (29)

$\begin{equation*}\underset{d\to \infty }{\mathrm{lim}}\enspace \text{Coeff}\left\{{Q}_{k}^{(d)}(\sqrt{\mathrm{d}}x)B{(d,k)}^{1/2}\right\}=\text{Coeff}\left\{\frac{1}{{(k!)}^{1/2}}\enspace {\mathrm{He}}_{k}(x)\right\},\end{equation*}$

and equation (171), we get for any δ > 0, there exists ɛ₀ > 0 such that for d sufficiently large, we have for any $\boldsymbol{\tau }\in {[1-{\varepsilon }_{0},1+{\varepsilon }_{0}]}^{Q}$ ,

$\begin{equation*}\left\vert {E}_{\bar{\boldsymbol{x}}}\left[{\sigma }_{\boldsymbol{d},\boldsymbol{\tau }}({\bar{x}}_{1}^{(1)},\dots ,{\bar{x}}_{1}^{(Q)}){Q}_{k}^{({d}_{1})}(\sqrt{{d}_{1}}{\bar{x}}_{1}^{(1)}){[B({d}_{1},k)k!]}^{1/2}\right]-{\mathbb{E}}_{G}[\sigma (G){\mathrm{He}}_{k}(G)]\right\vert \leqslant \delta ,\end{equation*}$

which concludes the proof. □

Appendix K.: Bound on the operator norm of Gegenbauer polynomials

Proposition 5 (Bound on the Gram matrix). Let $\boldsymbol{k}\in {\mathbb{Z}}_{\geqslant 0}^{Q}$ and denote γ = ∑_q∈[Q] η_q k_q. Let $n\leqslant {d}^{\gamma }/{\mathrm{e}}^{{A}_{d}\sqrt{\mathrm{log}\enspace d}}$ for any A_d → ∞. Let ${({\bar{\boldsymbol{x}}}_{i})}_{i\in [n]}$ with ${\bar{\boldsymbol{x}}}_{i}=({\left\{{\bar{\boldsymbol{x}}}_{i}^{(q)}\right\}}_{q\in [Q]})\sim \mathrm{Unif}({\mathrm{PS}}^{\boldsymbol{d}})$ independently, and ${Q}_{{k}_{q}}^{({d}_{q})}$ be the k_q'th Gegenbauer polynomial with domain [−d_q, d_q]. Consider the random matrix $\boldsymbol{W}={({\boldsymbol{W}}_{ij})}_{i,j\in [n]}\in {\mathbb{R}}^{n\times n}$ , with

$\begin{equation*}{\boldsymbol{W}}_{ij}={Q}_{\boldsymbol{k}}^{\boldsymbol{d}}({\left\{\langle {\bar{\boldsymbol{x}}}_{i}^{(q)},{\bar{\boldsymbol{x}}}_{j}^{(q)}\rangle \right\}}_{q\in [Q]})=\prod\limits _{q\in [Q]}{Q}_{{k}_{q}}^{({d}_{q})}(\langle {\bar{\boldsymbol{x}}}_{i}^{(q)},{\bar{\boldsymbol{x}}}_{j}^{(q)}\rangle ).\end{equation*}$

Then we have

$\begin{equation*}\underset{d,n\to \infty }{\mathrm{lim}}\enspace \mathbb{E}[{\Vert}\boldsymbol{W}-{\mathbf{I}}_{n}{{\Vert}}_{\mathrm{op}}]=0.\end{equation*}$

Corollary 1 (Uniform bound on the Gram matrix). Let $n\leqslant {d}^{\gamma }/{\mathrm{e}}^{{A}_{d}\sqrt{\mathrm{log}\enspace d}}$ for some γ > 0 and any A_d → ∞. Let ${({\bar{\boldsymbol{x}}}_{i})}_{i\in [N]}$ with ${\bar{\boldsymbol{x}}}_{i}=({\left\{{\bar{\boldsymbol{x}}}_{i}^{(q)}\right\}}_{q\in [Q]})\sim \mathrm{Unif}({\mathrm{PS}}^{\boldsymbol{d}})$ independently. Consider for any $\boldsymbol{k}\in {\mathbb{Z}}_{\geqslant 0}^{Q}$ , the random matrix ${\boldsymbol{W}}_{\boldsymbol{k}}={({({\boldsymbol{W}}_{\boldsymbol{k}})}_{ij})}_{i,j\in [n]}\in {\mathbb{R}}^{n\times n}$ as defined in proposition 5. Denote:

$\begin{equation*}\mathcal{Q}=\left\{\boldsymbol{k}\in {\mathbb{Z}}_{\geqslant 0}^{Q}\left\vert \right.\sum\limits _{q\in [Q]}{\eta }_{q}{k}_{q}< \gamma \right\}.\end{equation*}$

Then we have

$\begin{equation*}{\mathrm{sup}}_{\boldsymbol{k}\in {\mathcal{Q}}^{c}}\enspace \mathbb{E}[{\Vert}{\boldsymbol{W}}_{\boldsymbol{k}}-{\mathbf{I}}_{n}{{\Vert}}_{\mathrm{op}}]={o}_{d,\mathbb{P}}(1).\end{equation*}$

Proof of corollary 1. For each q ∈ [Q], we consider ${\mathbf{\Delta }}^{(q)}={\boldsymbol{W}}_{k}^{\hspace{1.5pt}(q)}-{\mathbf{I}}_{n}$ where ${\boldsymbol{W}}_{k}^{\hspace{1.5pt}(q)}={({({\boldsymbol{W}}_{k}^{\hspace{1.5pt}(q)})}_{ij})}_{i,j\in [n]}$ with

$\begin{equation*}{({\boldsymbol{W}}_{k}^{\hspace{1.5pt}(q)})}_{ij}={Q}_{k}^{({d}_{q})}(\langle {\bar{\boldsymbol{x}}}_{i}^{(q)},{\bar{\boldsymbol{x}}}_{j}^{(q)}\rangle ).\end{equation*}$

Then, defining γ_q ≡ γ/η_q, we have

$\begin{align*}\hfill & \mathbb{E}\left[\underset{k\geqslant 2{\gamma }_{q}+3}{\mathrm{sup}}{\Vert}{\boldsymbol{W}}_{k}^{\hspace{1.5pt}(q)}-{\mathbf{I}}_{n}{{\Vert}}_{\mathrm{op}}^{2}\right]\leqslant \mathbb{E}\left[\sum\limits _{k\geqslant 2{\gamma }_{q}+3}{\Vert}{\boldsymbol{W}}_{k}^{\hspace{1.5pt}(q)}-{\mathbf{I}}_{n}{{\Vert}}_{F}^{2}\right]\hfill \\ \hfill & \qquad =n(n-1)\sum\limits _{k\geqslant 2{\gamma }_{q}+3}\mathbb{E}[{Q}_{k}^{({d}_{q})}{(\langle {\bar{\boldsymbol{x}}}^{(q)},{\bar{\boldsymbol{y}}}^{(q)}\rangle )}^{2}]=n(n-1)\sum\limits _{k\geqslant 2{\gamma }_{q}+3}B{({d}_{q},k)}^{-1}.\hfill \end{align*}$

For d sufficiently large, there exists C > 0 such that for any p ⩾ m ≡ ⌈2γ_q + 3⌉:

$\begin{equation*}\begin{aligned}\hfill \frac{B({d}_{q},m)}{B({d}_{q},p)}=\prod\limits _{k=m}^{p-1}\frac{(2k+{d}_{q}-2)}{(2k+{d}_{q})}\cdot \frac{(k+1)}{(k+{d}_{q}-2)}\leqslant & \prod\limits _{k=m}^{p-1}\frac{1}{1+({d}_{q}-3)/(k+1)}\hfill \\ \hfill \leqslant & \prod\limits _{k=m}^{p-1}{\mathrm{e}}^{-\frac{m+1}{{d}_{q}-2+m}\cdot \frac{{d}_{q}-2}{k+1}}\leqslant \frac{C}{{p}^{2}}.\hfill \end{aligned}\end{equation*}$

Hence, there exists constant C', such that for large d, we have

$\begin{equation*}\sum\limits _{k\geqslant 2{\gamma }_{q}+3}B{({d}_{q},k)}^{-1}\leqslant {C}^{\prime }\cdot B{({d}_{q},m)}^{-1}.\end{equation*}$

Recalling that $B({d}_{q},m)={{\Theta}}_{d}({d}^{{\eta }_{q}m})={\omega }_{d}({d}^{2\gamma })$ , and n = o_d(d^γ), we deduce

$\begin{equation}\mathbb{E}\left[\underset{k\geqslant 2{\gamma }_{q}+3}{\mathrm{sup}}{\Vert}{\boldsymbol{W}}_{k}^{\hspace{1pt}(q)}-{\mathbf{I}}_{n}{{\Vert}}_{\mathrm{op}}^{2}\right]={o}_{d}(1).\end{equation} \tag{ 172 }$

Let us now consider Δ = W _k − I_n. We will denote ${\mathbf{\Delta }}^{(q)}={\boldsymbol{W}}_{{k}_{q}}^{(q)}-{\mathbf{I}}_{n}$ . Then it is easy to check (recall the diagonal elements of ${\boldsymbol{W}}_{{k}_{q}}^{({d}_{q})}$ are equal to one) that for any q ∈ [Q]

$\begin{equation*}\mathbf{\Delta }=\left(\underset{{q}^{\prime }\ne q}{\odot }{\boldsymbol{W}}_{{k}_{{q}^{\prime }}}^{({q}^{\prime })}\right)\odot {\mathbf{\Delta }}^{(q)}\end{equation*}$

where A ⊙ B denotes the Hadamard product, or entrywise product, ${(\boldsymbol{A}\odot \boldsymbol{B})}_{i,j\in [n]}={({A}_{ij}{B}_{ij})}_{i,j\in [n]}$ . We recall the following inequality on the operator norm of Hadamard product of two matrices, with A positive definite:

$\begin{equation*}{\Vert}\boldsymbol{A}\odot \boldsymbol{B}{{\Vert}}_{\mathrm{op}}\leqslant \left(\underset{ij}{\mathrm{max}}\enspace {\boldsymbol{A}}_{ij}\right){\Vert}\boldsymbol{B}{{\Vert}}_{\mathrm{op}}.\end{equation*}$

Hence, in particular

$\begin{equation*}{\Vert}\mathbf{\Delta }{{\Vert}}_{\mathrm{op}}\leqslant \left(\prod\limits _{{q}^{\prime }\ne q}\underset{ij}{\mathrm{max}}[{({\boldsymbol{W}}_{{k}_{{q}^{\prime }}}^{({q}^{\prime })})}_{ij}]\right){\Vert}{\mathbf{\Delta }}^{(q)}{{\Vert}}_{\mathrm{op}}.\end{equation*}$

Consider $\mathcal{I}=\left[\right.0,2{\gamma }_{1}+3\left[\right.\times \cdots \times \left[\right.0,2{\gamma }_{Q}+3\left[\right.\cap {\mathbb{Z}}_{\geqslant 0}^{Q}$ . Then, from equation (172), we get directly

$\begin{equation}\underset{\boldsymbol{k}\in {\mathcal{I}}^{c}}{\mathrm{sup}}{\Vert}{\boldsymbol{W}}_{\boldsymbol{k}}-{\mathbf{I}}_{n}{{\Vert}}_{\mathrm{op}}={o}_{d,\mathbb{P}}(1).\end{equation} \tag{ 173 }$

Furthermore, $\mathcal{I}\cap \mathcal{Q}$ is finite and from proposition 5, we directly get

$\begin{equation}\underset{\boldsymbol{k}\in \mathcal{I}\cap \mathcal{Q}}{\mathrm{sup}}{\Vert}{\boldsymbol{W}}_{\boldsymbol{k}}-{\mathbf{I}}_{n}{{\Vert}}_{\mathrm{op}}={o}_{d,\mathbb{P}}(1).\end{equation} \tag{ 174 }$

Combining bounds (173) and (174) yields the result. □

K.1. Proof of proposition 5

The proof follows closely the proof of the uniform case presented in [21]. For completeness, we copy here the relevant lemmas.

Step 1. Bounding operator norm by moments.

Denote Δ = W − I_n. We define for each q ∈ [Q], ${\boldsymbol{W}}_{{k}_{q}}^{({d}_{q})}={({Q}_{{k}_{q}}^{({d}_{q})}(\langle {\bar{\boldsymbol{x}}}_{i}^{(q)},{\bar{\boldsymbol{x}}}_{j}^{(q)}\rangle ))}_{ij\in [n]}$ and ${\mathbf{\Delta }}^{(q)}={\boldsymbol{W}}_{{k}_{q}}^{({d}_{q})}-{\mathbf{I}}_{n}$ . Then it is easy to check (recall the diagonal elements of ${\boldsymbol{W}}_{{k}_{q}}^{({d}_{q})}$ are equal to one)

$\begin{equation*}\mathbf{\Delta }={\mathbf{\Delta }}^{(1)}\odot \dots \odot {\mathbf{\Delta }}^{(Q)},\end{equation*}$

where A ⊙ B denotes the Hadamard product, or entrywise product, ${(\boldsymbol{A}\odot \boldsymbol{B})}_{i,j\in [n]}={({A}_{ij}{B}_{ij})}_{i,j\in [n]}$ . For any sequence of integers p = p(d), we have

$\begin{equation}\mathbb{E}[{\Vert}\mathbf{\Delta }{{\Vert}}_{\mathrm{op}}]\leqslant \mathbb{E}[\mathrm{Tr}{({\mathbf{\Delta }}^{2p})}^{1/(2p)}]\leqslant \mathbb{E}{[\mathrm{Tr}({\mathbf{\Delta }}^{2p})]}^{1/(2p)}.\end{equation} \tag{ 175 }$

To prove the proposition, it suffices to show that for any sequence A_d → ∞, we have

$\begin{equation}\underset{d,n\to \infty ,n={O}_{d}({d}^{\gamma }\enspace {\mathrm{e}}^{-{A}_{d}\sqrt{\mathrm{log}\enspace d}})}{\mathrm{lim}}\enspace \mathbb{E}{[\mathrm{Tr}({\mathbf{\Delta }}^{2p})]}^{1/(2p)}=0.\end{equation} \tag{ 176 }$

In the following, we calculate $\mathbb{E}[\mathrm{Tr}({\mathbf{\Delta }}^{2p})]$ . We have

$\begin{equation*}\begin{aligned}\hfill \mathbb{E}[\mathrm{Tr}({\mathbf{\Delta }}^{2p})]& =\sum\limits _{\boldsymbol{i}=({i}_{1},\dots ,{i}_{2p})\in {[n]}^{2p}}\mathbb{E}[{{\Delta}}_{{i}_{1}{i}_{2}}{{\Delta}}_{{i}_{2}{i}_{3}}\dots {{\Delta}}_{{i}_{2p}{i}_{1}}]\hfill \\ \hfill & =\sum\limits _{\boldsymbol{i}=({i}_{1},\dots ,{i}_{2p})\in {[n]}^{2p}}\prod\limits _{q\in [Q]}\mathbb{E}[{{\Delta}}_{{i}_{1}{i}_{2}}^{(q)}{{\Delta}}_{{i}_{2}{i}_{3}}^{(q)}\dots {{\Delta}}_{{i}_{2p}{i}_{1}}^{(q)}],\hfill \end{aligned}\end{equation*}$

where we used that ${\bar{\boldsymbol{x}}}^{(q)}$ and ${\bar{\boldsymbol{x}}}^{({q}^{\prime })}$ are independent for q ≠ q'.

We will denote for any i = (i₁, ..., i_k) ∈ [n]^k, define for each q ∈ [Q]

$\begin{equation*}{M}_{\boldsymbol{i}}^{(q)}=\begin{cases}\mathbb{E}[{{\Delta}}_{{i}_{1}{i}_{2}}^{(q)}\dots {{\Delta}}_{{i}_{k}{i}_{1}}^{(q)}]\quad \hfill & \quad k\geqslant 2,\hfill \\ 1\quad \hfill & \quad k=1.\hfill \end{cases}\end{equation*}$

Similarly, we define M_i associated to Δ,

$\begin{equation*}{M}_{\boldsymbol{i}}=\prod\limits _{q\in [Q]}{M}_{\boldsymbol{i}}^{(q)}.\end{equation*}$

To calculate these quantities, we will apply repeatedly the following identity, which is an immediate consequence of equation (21). For any i₁, i₂, i₃ distinct, we have

$\begin{equation*}{\mathbb{E}}_{{\boldsymbol{\theta }}_{{i}_{2}}}[{{\Delta}}_{{i}_{1}{i}_{2}}^{(q)}{{\Delta}}_{{i}_{2}{i}_{3}}^{(q)}]=\frac{1}{B({d}_{q},{k}_{q})}{{\Delta}}_{{i}_{1}{i}_{3}}^{(q)}.\end{equation*}$

Throughout the proof, we will denote by C, C', C'' constants that may depend on k but not on p, d, n. The value of these constants is allowed to change from line to line.

Step 2. The induced graph and equivalence of index sequences.

For any index sequence i = (i₁, i₂, ..., i_2p) ∈ [n]^2p, we defined an undirected multigraph G_i = (V_i, E_i) associated to index sequence i . The vertex set V_i is the set of distinct elements in i₁, ..., i_2p. The edge set E_i is formed as follows: for any j ∈ [2p] we add an edge between i_j and i_j+1 (with convention 2p + 1 ≡ 1). Notice that this could be a self-edge, or a repeated edge: G_i = (V_i, E_i) will be—in general—a multigraph. We denote v( i ) = |V_i| to be the number of vertices of G_i, and e( i ) = |E_i| to be the number of edges (counting multiplicities). In particular, e( i ) = k for i ∈ [n]^k. We define

$\begin{equation*}{\mathcal{T}}_{\star }(p)=\left\{\boldsymbol{i}\in {[n]}^{2p}:{G}_{\boldsymbol{i}}\;\text{does}\;\text{not}\;\text{have}\;\text{self}\;\text{edge}\right\}.\end{equation*}$

For any two index sequences i ₁, i ₂, we say they are equivalent i ₁ ≍ i ₂, if the two graphs ${G}_{{\boldsymbol{i}}_{1}}$ and ${G}_{{\boldsymbol{i}}_{2}}$ are isomorphic, i.e. there exists an edge-preserving bijection of their vertices (ignoring vertex labels). We denote the equivalent class of i to be

$\begin{equation*}\mathcal{C}(\boldsymbol{i})=\left\{\boldsymbol{j}:\boldsymbol{j}\asymp \boldsymbol{i}\right\}.\end{equation*}$

We define the quotient set $\mathcal{Q}(p)$ by

$\begin{equation*}\mathcal{Q}(p)=\left\{\mathcal{C}(\boldsymbol{i}):\boldsymbol{i}\in {[n]}^{2p}\right\}.\end{equation*}$

The following lemma was proved in proposition 3 in [21].

Lemma 21. The following properties holds for all sufficiently large n and d:

(a)
For any equivalent index sequences i = (i₁, ..., i_2p) ≍ j = (j₁, ..., j_2p), we have ${M}_{\boldsymbol{i}}^{(q)}={M}_{\boldsymbol{j}}^{(q)}$ .
(b)
For any index sequence $\boldsymbol{i}\in {[n]}^{2p}{\backslash}{\mathcal{T}}_{\star }(p)$ , we have M_i = 0.
(c)
For any index sequence $\boldsymbol{i}\in {\mathcal{T}}_{\star }(p)$ , the degree of any vertex in G_i must be even.
(d)
The number of equivalent classes $\vert \mathcal{Q}(p)\vert \leqslant {(2p)}^{2p}$ .
(e)
Recall that v( i ) = |V_i| denotes the number of distinct elements in i . Then, for any i ∈ [n]^2p, the number of elements in the corresponding equivalence class satisfies $\vert \mathcal{C}(\boldsymbol{i})\vert \leqslant v{(\boldsymbol{i})}^{v(\boldsymbol{i})}\cdot {n}^{v(\boldsymbol{i})}\leqslant {p}^{p}{n}^{v(\boldsymbol{i})}$ .

In view of property (a) in the last lemma, given an equivalence class $\mathcal{C}=\mathcal{C}(\boldsymbol{i})$ , we will write ${M}_{\mathcal{C}}={M}_{\boldsymbol{i}}$ for the corresponding value.

Step 3. The skeletonization process.

For multi-graph G, we say that one of its vertices is redundant, if it has degree 2. For any index sequence $\boldsymbol{i}\in {\mathcal{T}}_{\star }(p)\subset {[n]}^{2p}$ (i.e. such that G_i does not have self-edges), we denote by $r(\boldsymbol{i})\in {\mathbb{N}}_{+}$ to be the redundancy of i , and by sk( i ) to be the skeleton of i , both defined by the following skeletonization process. Let i ₀ = i ∈ [n]^2p. For any integer s ⩾ 0, if ${G}_{{\boldsymbol{i}}_{s}}$ has no redundant vertices then stop and set sk( i ) = i _s. Otherwise, select a redundant vertex i _s(ℓ) arbitrarily (the ℓth element of i _s). If i _s(ℓ − 1) ≠ i _s(ℓ + 1), then remove i _s(ℓ) from the graph (and from the sequence), together with its adjacent edges, and connect i _s(ℓ − 1) and i _s(ℓ + 1) with an edge, and denote i _s+1 to be the resulting index sequence, i.e. i _s+1 = ( i _s(1), ..., i _s(ℓ − 1), i _s(ℓ + 2), ..., i _s(end)). If i _s(ℓ − 1) = i _s(ℓ + 1), then remove i _s(ℓ) from the graph (and from the sequence), together with its adjacent edges, and denote i _s+1 to be the resulting index sequence, i.e. i _s+1 = ( i _s(1), ..., i _s(ℓ − 1), i _s(ℓ + 1), i _s(ℓ + 2), ..., i _s(end)). (Here ℓ + 1, and ℓ − 1 have to be interpreted modulo | i _s|, the length of i _s.) The redundancy of i , denoted by r( i ), is the number of vertices removed during the skeletonization process.

It is easy to see that the outcome of this process is independent of the order in which we select vertices.

Lemma 22. For the above skeletonization process, the following properties hold

(a)
If i ≍ j ∈ [n]^p, then sk( i ) ≍ sk( j ). That is, the skeletons of equivalent index sequences are equivalent.
(b)
For any i = (i₁, ..., i_k) ∈ [n]^k, and q ∈ [Q], we have
$\begin{equation*}{M}_{\boldsymbol{i}}^{(q)}=\frac{{M}_{\text{sk}(\boldsymbol{i})}^{(q)}}{B{({d}_{q},{k}_{q})}^{r(\boldsymbol{i})}}.\end{equation*}$
(c)
For any $\boldsymbol{i}\in {\mathcal{T}}_{\star }(p)\subset {[n]}^{2p}$ , its skeleton is either formed by a single element, or an index sequence whose graph has the property that every vertex has degree greater or equal to 4.

Given an index sequence $\boldsymbol{i}\in {\mathcal{T}}_{\star }(p)\subset {[n]}^{2p}$ , we say i is of type 1, if sk( i ) contains only one index. We say i is of type 2 if sk( i ) is not empty (so that by lemma 22, G_{sk(
i
)} can only contain vertices with degree greater or equal to 4). Denote the class of type 1 index sequence (respectively type 2 index sequence) by ${\mathcal{T}}_{1}(p)$ (respectively ${\mathcal{T}}_{2}(p)$ ). We also denote by ${\tilde{\mathcal{T}}}_{a}(p)$ , a ∈ {1, 2} the set of equivalence classes of sequences in ${\mathcal{T}}_{a}(p)$ . This definition makes sense since the equivalence class of the skeleton of a sequence only depends on the equivalence class of the sequence itself.

Step 4. Type 1 index sequences.

Recall that v( i ) is the number of vertices in G_i, and e( i ) is the number of edges in G_i (which coincides with the length of i ). We consider $\boldsymbol{i}\in {\mathcal{T}}_{1}(p)$ . Since for $\boldsymbol{i}\in {\mathcal{T}}_{1}(p)$ , every edge of G_i must be at most a double edge. Indeed, if (u₁, u₂) had multiplicity larger than 2 in G_i, neither u₁ nor u₂ could be deleted during the skeletonization process, contradicting the assumption that sk( i ) contains a single vertex. Therefore, we must have ${\mathrm{min}}_{\boldsymbol{i}\in {\mathcal{T}}_{1}}\enspace v(\boldsymbol{i})=p+1$ . According the lemma 22(b), for every $\boldsymbol{i}\in {\mathcal{T}}_{1}(p)$ , we have

$\begin{equation*}{M}_{\boldsymbol{i}}=\prod\limits _{q\in [Q]}{M}_{\boldsymbol{i}}^{(q)}=\prod\limits _{q\in [Q]}1/B{({d}_{q},{k}_{q})}^{v(\boldsymbol{i})-1}=\frac{1}{B{(\boldsymbol{d},\boldsymbol{k})}^{v(\boldsymbol{i})-1}}.\end{equation*}$

Note by lemma 21(e), the number of elements in the equivalence class of i is $\vert \mathcal{C}(\boldsymbol{i})\vert \leqslant {p}^{p}\cdot {n}^{v(\boldsymbol{i})}$ . Hence we get

$\begin{equation}\underset{\boldsymbol{i}\in {\mathcal{T}}_{1}(p)}{\mathrm{max}}\left[\vert \mathcal{C}(\boldsymbol{i})\vert \vert {M}_{\boldsymbol{i}}\vert \right]\leqslant {\mathrm{sup}}_{\boldsymbol{i}\in {\mathcal{T}}_{1}(p)}\left[{p}^{p}{n}^{v(\boldsymbol{i})}/B{(\boldsymbol{d},\boldsymbol{k})}^{v(\boldsymbol{i})-1}\right]={p}^{p}{n}^{p+1}/B{(\boldsymbol{d},\boldsymbol{k})}^{p}.\end{equation} \tag{ 177 }$

Therefore, denoting K = ∑_q∈[Q] η_q k_q,

$\begin{equation}\sum\limits _{\boldsymbol{i}\in {\mathcal{T}}_{1}(p)}{M}_{\boldsymbol{i}}=\sum\limits _{\mathcal{C}\in {\tilde{\mathcal{T}}}_{1}(p)}\vert \mathcal{C}\vert \vert {M}_{\mathcal{C}}\vert \end{equation} \tag{ 178 }$

$\begin{equation}\leqslant \vert \mathcal{Q}(p)\vert {p}^{p}\frac{{n}^{p+1}}{B{(\boldsymbol{d},\boldsymbol{k})}^{p}}\leqslant {(Cp)}^{3p}{n}^{p+1}{d}^{-Kp},\end{equation} \tag{ 179 }$

where in the last step we used lemma 21 and the fact that for q ∈ [Q], $B({d}_{q},{k}_{q})\geqslant {C}_{0}{d}_{q}^{{k}_{q}}$ for some C₀ > 0.

Step 5. Type 2 index sequences.

We have the following simple lemma bounding M_i, copied from proposition 3 in [21]. This bound is useful when i is a skeleton.

Lemma 23. For any q ∈ [Q], there exists constants C and d₀ depending uniquely on k_q such that, for any d ⩾ d₀(k_q), and any index sequence i ∈ [n]^m with 2 ⩽ m ⩽ d_q/(4k_q), we have

$\begin{equation*}\vert {M}_{\boldsymbol{i}}^{(q)}\vert \leqslant {\left(C{m}^{{k}_{q}}\cdot {d}_{q}^{-{k}_{q}}\right)}^{m/2}.\end{equation*}$

Suppose $\boldsymbol{i}\in {\mathcal{T}}_{2}(p)$ , and denote v(i) to be the number of vertices in G_i. We have, for a sequence p = o_d(d), and each q ∈ [Q]

$\begin{align*}\hfill \vert {M}_{\boldsymbol{i}}^{(q)}\vert & \hspace{2pt}\stackrel{(1)}{=}\hspace{2pt}\frac{\vert {M}_{\text{sk}(\boldsymbol{i})}^{(q)}\vert }{B{({d}_{q},{k}_{q})}^{r(\boldsymbol{i})}}\hfill \\ \hfill & \stackrel{(2)}{\leqslant }{\left(\frac{Ce(\text{sk}(\boldsymbol{i}))}{{d}_{q}}\right)}^{{k}_{q}\cdot e(\text{sk}(\boldsymbol{i}))/2}{({C}^{\prime }{d}_{q})}^{-r(\boldsymbol{i}){k}_{q}}\hfill \\ \hfill & \stackrel{(3)}{\leqslant }{\left(\frac{Cp}{{d}_{q}}\right)}^{{k}_{q}\cdot e(\text{sk}(\boldsymbol{i}))/2}{({C}^{\prime }{d}_{q})}^{-r(\boldsymbol{i}){k}_{q}}\hfill \\ \hfill & \stackrel{(4)}{\leqslant }{\left(\frac{Cp}{{d}_{q}}\right)}^{{k}_{q}\cdot v(\text{sk}(\boldsymbol{i}))}{({C}^{\prime }{d}_{q})}^{-r(\boldsymbol{i}){k}_{q}}\hfill \\ \hfill & \stackrel{(5)}{\leqslant }{C}^{v(\boldsymbol{i})}{p}^{{k}_{q}\cdot v(\text{sk}(\boldsymbol{i}))}{d}_{q}^{-(v(\text{sk}(\boldsymbol{i}))+r(\boldsymbol{i}))\cdot {k}_{q}}\hfill \\ \hfill & \stackrel{(6)}{\leqslant }{(Cp)}^{{k}_{q}\cdot v(\boldsymbol{i})}{d}_{q}^{-v(\boldsymbol{i}){k}_{q}}.\hfill \end{align*}$

Here (1) holds by lemma 22(b); (2) by lemma 23, and the fact that sk(i) ∈ [n]^e(sk(i)), together by $B({d}_{q},{k}_{q})\geqslant {C}_{0}{d}_{q}^{{k}_{q}}$ ; (3) because e(sk(i)) ⩽ 2p; (4) by lemma 22(c), implying that for $\mathit{i}\in {\mathcal{T}}_{2}(p)$ , each vertex of G_sk(i) has degree greater or equal to 4, so that v(sk( i )) ⩽ e(sk(i))/2 (notice that for d ⩾ d₀(k_q) we can assume Cp/d_q < 1). Finally, (5) follows since r(i), v(sk(i)) ⩽ v(i), and (6) the definition of r(i) implying r(i) = v(i) − v(sk(i)).

Hence we get

$\begin{equation*}\vert {M}_{\boldsymbol{i}}\vert \leqslant \prod\limits _{q\in [Q]}{(Cp)}^{{k}_{q}\cdot v(\boldsymbol{i})}{d}_{q}^{-v(\boldsymbol{i}){k}_{q}}.\end{equation*}$

Note by lemma 21(e), the number of elements in equivalent class $\vert \mathcal{C}(\boldsymbol{i})\vert \leqslant {p}^{v(\boldsymbol{i})}\cdot {n}^{v(\boldsymbol{i})}$ . Since v( i ) depends only on the equivalence class of i, we will write, with a slight abuse of notation $v(\boldsymbol{i})=v(\mathcal{C}(\boldsymbol{i}))$ . Notice that the number of equivalence classes with $v(\mathcal{C})=v$ is upper bounded by the number multi-graphs with v vertices and 2p edges, which is at most v^4p. Denoting α = max_q∈[Q]{1/η_q}, we have

$\begin{equation}\sum\limits _{\boldsymbol{i}\in {\mathcal{T}}_{2}(p)}{M}_{\boldsymbol{i}}\leqslant \sum\limits _{\mathcal{C}\in {\tilde{\mathcal{T}}}_{2}(p)}\vert \mathcal{C}\vert \vert {M}_{\mathcal{C}}\vert \end{equation} \tag{ 180 }$

$\begin{equation}\leqslant \sum\limits _{\mathcal{C}\in {\tilde{\mathcal{T}}}_{2}(p)}{(C{p}^{\alpha })}^{(K+1)v(\mathcal{C})}{\left(\frac{n}{{d}^{K}}\right)}^{v(\mathcal{C})}\end{equation} \tag{ 181 }$

$\begin{equation}\leqslant \sum\limits _{v=2}^{2p}{v}^{4p}{\left(\frac{Cn{p}^{\alpha (K+1)}}{{d}^{K}}\right)}^{v}.\end{equation} \tag{ 182 }$

Define ɛ = Cnp^α(K+1)/d^K. We will assume hereafter that p is selected such that

$\begin{equation}2p\leqslant -\mathrm{log}\left(\frac{Cn{p}^{\alpha (K+1)}}{{d}^{K}}\right).\end{equation} \tag{ 183 }$

By calculus and condition (183), the function F(v) = v^4p ɛ^v is maximized over v ∈ [2, 2p] at v = 2, whence

$\begin{equation}\sum\limits _{\boldsymbol{i}\in {\mathcal{T}}_{2}(p)}{M}_{\boldsymbol{i}}\leqslant 2pF(2)\leqslant {C}^{p}{\left(\frac{n}{{d}^{K}}\right)}^{2}.\end{equation} \tag{ 184 }$

Step 6. Concluding the proof.

Using equations (179) and (184), we have, for any p = o_d(d) satisfying equation (183), we have

$\begin{equation}\mathbb{E}[\mathrm{Tr}({\mathbf{\Delta }}^{2p})]=\sum\limits _{\boldsymbol{i}=({i}_{1},\dots ,{i}_{2p})\in {[N]}^{2p}}{M}_{\boldsymbol{i}}=\sum\limits _{\boldsymbol{i}\in {\mathcal{T}}_{1}(p)}{M}_{\boldsymbol{i}}+\sum\limits _{\boldsymbol{i}\in {\mathcal{T}}_{2}(p)}{M}_{\boldsymbol{i}}\end{equation} \tag{ 185 }$

$\begin{equation}\leqslant {(Cp)}^{3p}\frac{{n}^{p+1}}{{d}^{Kp}}+{C}^{p}{\left(\frac{n}{{d}^{K}}\right)}^{2}.\end{equation} \tag{ 186 }$

Form equation (175), we obtain

$\begin{equation}\mathbb{E}[{\Vert}\mathbf{\Delta }{{\Vert}}_{\mathrm{op}}]\leqslant C\left\{{p}^{3/2}{n}^{1/(2p)}\sqrt{\frac{n}{{d}^{K}}}+{\left(\frac{n}{{d}^{K}}\right)}^{1/p}\right\}.\end{equation} \tag{ 187 }$

Finally setting $n={d}^{K}\enspace {\mathrm{e}}^{-2A\sqrt{\mathrm{log}\enspace d}}$ and $p=(K/A)\sqrt{\mathrm{log}\enspace d}$ , this yields

$\begin{equation}\mathbb{E}[{\Vert}\mathbf{\Delta }{{\Vert}}_{\mathrm{op}}]\leqslant C\left\{{\mathrm{e}}^{-\frac{A}{4}\sqrt{\mathrm{log}\enspace d}}+{\mathrm{e}}^{-2{A}^{2}/K}\right\}.\end{equation} \tag{ 188 }$

Therefore, as long as A → ∞, we have $\mathbb{E}[{\Vert}\mathbf{\Delta }{{\Vert}}_{\mathrm{op}}]\to 0$ . It is immediate to check that the above choice of p satisfies the required conditions p = o_d(d) and equation (183) for all d large enough.

Appendix L.: Technical lemmas

We put here one technical lemma that is used in the proof of theorem 7(a).

Lemma 24. Let $\boldsymbol{D}={({\boldsymbol{D}}^{q{q}^{\prime }})}_{q,{q}^{\prime }\in [Q]}\in {\mathbb{R}}^{DN\times DN}$ be a symmetric Q by Q block matrix with ${\boldsymbol{D}}^{q{q}^{\prime }}\in {\mathbb{R}}^{{d}_{q}N\times {d}_{{q}^{\prime }}N}$ . Denote B = D ⁻¹. Assume that D satisfies the following properties:

(a)
For any q ∈ [Q], there exists c_q, C_q > 0 such that we have with high probability
$\begin{equation*}0< \frac{{r}_{q}^{2}}{{d}_{q}}{c}_{q}={d}^{{\kappa }_{q}}{c}_{q}\leqslant {\lambda }_{\mathrm{min}}({\boldsymbol{D}}^{qq})\leqslant {\lambda }_{\mathrm{max}}({\boldsymbol{D}}^{qq})\leqslant \frac{{r}_{q}^{2}}{{d}_{q}}{C}_{q}={d}^{{\kappa }_{q}}{C}_{q}< \infty ,\end{equation*}$
as d → ∞.
(b)
For any q ≠ q' ∈ [Q], we have ${\sigma }_{\mathrm{max}}({\boldsymbol{D}}^{q{q}^{\prime }})={o}_{d,\mathbb{P}}({r}_{q}{r}_{{q}^{\prime }}/\sqrt{{d}_{q}{d}_{{q}^{\prime }}})={o}_{d,\mathbb{P}}({d}^{({\kappa }_{q}+{\kappa }_{{q}^{\prime }})/2})$ .

Then for any q ≠ q' ∈ [Q], we have

$\begin{align}\hfill & {\Vert}{\boldsymbol{B}}^{qq}{{\Vert}}_{\mathrm{op}}={O}_{d,\mathbb{P}}\left(\frac{{d}_{q}}{{r}_{q}^{2}}\right)={O}_{d,\mathbb{P}}({d}^{-{\kappa }_{q}}),\qquad \hfill \\ \hfill & {\Vert}{\boldsymbol{B}}^{q{q}^{\prime }}{{\Vert}}_{\mathrm{op}}={o}_{d,\mathbb{P}}\left(\frac{\sqrt{{d}_{q}{d}_{{q}^{\prime }}}}{{r}_{q}{r}_{{q}^{\prime }}}\right)={o}_{d,\mathbb{P}}({d}^{-({\kappa }_{q}+{\kappa }_{{q}^{\prime }})/2}).\hfill \end{align} \tag{ 189 }$

Proof of lemma 24. Let us show the result recursively on the integer Q. Note that the case Q = 1 is direct.

Consider $\boldsymbol{D}={({\boldsymbol{D}}^{q{q}^{\prime }})}_{q,{q}^{\prime }\in [Q]}$ . Denote $\tilde{D}=D-{d}_{Q}$ , $\boldsymbol{A}={({\boldsymbol{D}}^{q{q}^{\prime }})}_{q,{q}^{\prime }\in [Q-1]}\in {\mathbb{R}}^{\tilde{D}N\times \tilde{D}N}$ and $\boldsymbol{C}={[{({\boldsymbol{D}}^{1Q})}^{\mathsf{T}},\dots ,{({\boldsymbol{D}}^{(Q-1)Q})}^{\mathsf{T}}]}^{\mathsf{T}}\in {\mathbb{R}}^{{d}_{Q}N\times \tilde{D}N}$ such that

$\begin{equation*}\boldsymbol{D}=\left[\begin{matrix}\hfill \boldsymbol{A}\hfill & \hfill \boldsymbol{C}\hfill \\ \hfill {\boldsymbol{C}}^{\mathsf{T}}\hfill & \hfill {\boldsymbol{D}}^{QQ}\hfill \end{matrix}\right].\end{equation*}$

Assume that A ⁻¹ verifies equation (189). Denote

$\begin{equation*}\boldsymbol{B}=\left[\begin{matrix}\hfill \boldsymbol{R}\hfill & \hfill \boldsymbol{T}\hfill \\ \hfill {\boldsymbol{T}}^{\mathsf{T}}\hfill & \hfill {\boldsymbol{B}}^{QQ}\hfill \end{matrix}\right].\end{equation*}$

From the two by two blockmatrix inversion, we have:

$\begin{equation*}\begin{aligned}\hfill {\boldsymbol{B}}^{QQ}& ={({\boldsymbol{D}}^{QQ}-{\boldsymbol{C}}^{\mathsf{T}}{\boldsymbol{A}}^{-1}\boldsymbol{C})}^{-1},\hfill \\ \hfill \boldsymbol{T}& =-{\boldsymbol{A}}^{-1}\boldsymbol{C}{\boldsymbol{B}}^{QQ}.\hfill \end{aligned}\end{equation*}$

We have

$\begin{equation*}\begin{aligned}\hfill {{\Vert}{\boldsymbol{C}}^{\mathsf{T}}{\boldsymbol{A}}^{-1}\boldsymbol{C}{\Vert}}_{\mathrm{op}}& \leqslant \sum\limits _{q,{q}^{\prime }\in [Q-1]}{{\Vert}{({\boldsymbol{D}}^{qQ})}^{\mathsf{T}}{({\boldsymbol{A}}^{-1})}_{q{q}^{\prime }}{\boldsymbol{D}}^{{q}^{\prime }Q}{\Vert}}_{\mathrm{op}}\hfill \\ \hfill & =\sum\limits _{q,{q}^{\prime }\in [Q-1]}{o}_{d,\mathbb{P}}\left(\frac{{r}_{q}{r}_{Q}}{\sqrt{{d}_{q}{d}_{Q}}}\right)\cdot {O}_{d,\mathbb{P}}\left(\frac{\sqrt{{d}_{q}{d}_{{q}^{\prime }}}}{{r}_{q}{r}_{{q}^{\prime }}}\right)\cdot {o}_{d,\mathbb{P}}\left(\frac{{r}_{{q}^{\prime }}{r}_{Q}}{\sqrt{{d}_{{q}^{\prime }}{d}_{Q}}}\right)\hfill \\ \hfill & ={o}_{d,\mathbb{P}}({r}_{Q}^{2}/{d}_{Q}),\hfill \end{aligned}\end{equation*}$

where we used in the second line the properties on D and our assumption on A ⁻¹. Hence ${\boldsymbol{D}}^{QQ}-{\boldsymbol{C}}^{\mathsf{T}}{\boldsymbol{A}}^{-1}\boldsymbol{C}{\preceq}({r}_{q}^{2}/{d}_{q})({c}_{q}-{o}_{d,\mathbb{P}}(1))\mathbf{I}$ and ${\Vert}{\boldsymbol{B}}^{QQ}{{\Vert}}_{\mathrm{op}}={O}_{d,\mathbb{P}}({d}_{q}/{r}_{q}^{2})$ .

Furthermore, for q < Q,

$\begin{equation*}{\boldsymbol{B}}^{qQ}=-\sum\limits _{{q}^{\prime }\in [Q-1]}{({\boldsymbol{A}}^{-1})}_{q{q}^{\prime }}{\boldsymbol{C}}_{{q}^{\prime }}{\boldsymbol{B}}^{QQ}.\end{equation*}$

Hence

$\begin{align*}\hfill {{\Vert}{\boldsymbol{B}}^{qQ}{\Vert}}_{\mathrm{op}}& \leqslant \sum\limits _{{q}^{\prime }\in [Q-1]}{{\Vert}{({\boldsymbol{A}}^{-1})}_{q{q}^{\prime }}{\boldsymbol{D}}^{{q}^{\prime }Q}{\boldsymbol{B}}^{QQ}{\Vert}}_{\mathrm{op}}\hfill \\ \hfill & =\sum\limits _{q,{q}^{\prime }\in [Q-1]}{O}_{d,\mathbb{P}}\left(\frac{\sqrt{{d}_{q}{d}_{{q}^{\prime }}}}{{r}_{q}{r}_{{q}^{\prime }}}\right)\cdot {o}_{d,\mathbb{P}}\left(\frac{{r}_{q}{r}_{Q}}{\sqrt{{d}_{{q}^{\prime }}{d}_{Q}}}\right)\cdot {O}_{d,\mathbb{P}}\left(\frac{{d}_{Q}}{{r}_{Q}^{2}}\right)\hfill \\ \hfill & ={o}_{d,\mathbb{P}}\left(\frac{\sqrt{{d}_{q}{d}_{Q}}}{{r}_{q}{r}_{Q}}\right),\hfill \end{align*}$

which finishes the proof. □

L.1. Useful lemmas from [21]

For completeness, we reproduce in this section lemmas proven in [21].

Lemma 25. The number B(d, k) of independent degree-k spherical harmonics on ${\mathbb{S}}^{d-1}$ is non-decreasing in k for any fixed d ⩾ 2.

Lemma 26. For any fixed k, let ${Q}_{k}^{(d)}(x)$ be the kth Gegenbauer polynomial. We expand

$\begin{equation*}{Q}_{k}^{(d)}(x)=\sum\limits _{s=0}^{k}{p}_{k,s}^{(d)}{x}^{s}.\end{equation*}$

Then we have

$\begin{equation*}{p}_{k,s}^{(d)}={O}_{d}({d}^{-k/2-s/2}).\end{equation*}$

Lemma 27. Let N = o_d(d^ℓ+1) for a fixed integer ℓ. Let ${({\boldsymbol{w}}_{i})}_{i\in [N]}\sim \mathrm{Unif}({\mathbb{S}}^{d-1})$ independently. Then as d → ∞, we have

$\begin{equation*}\underset{i\ne j}{\mathrm{max}}\vert \langle {\boldsymbol{w}}_{i},{\boldsymbol{w}}_{j}\rangle \vert ={O}_{d,\mathbb{P}}({(\mathrm{log}\enspace d)}^{k/2}{d}^{-k/2}).\end{equation*}$

Proposition 6 (Bound on the Gram matrix). Let $N\leqslant {d}^{k}/{\mathrm{e}}^{{A}_{d}\sqrt{\mathrm{log}\enspace d}}$ for a fixed integer k and any A_d → ∞. Let ${({\boldsymbol{\theta }}_{i})}_{i\in [N]}\sim \mathrm{Unif}({\mathbb{S}}^{d-1}(\sqrt{d}))$ independently, and ${Q}_{k}^{(d)}$ be the k'th Gegenbauer polynomial with domain [−d, d]. Consider the random matrix $\boldsymbol{W}={({\boldsymbol{W}}_{ij})}_{i,j\in [N]}\in {\mathbb{R}}^{N\times N}$ , with ${\boldsymbol{W}}_{ij}={Q}_{k}^{(d)}(\langle {\boldsymbol{\theta }}_{i},{\boldsymbol{\theta }}_{j}\rangle )$ . Then we have

$\begin{equation*}\underset{d,N\to \infty }{\mathrm{lim}}\enspace \mathbb{E}[{\Vert}\boldsymbol{W}-{\mathbf{I}}_{d}{{\Vert}}_{\mathrm{op}}]=0.\end{equation*}$

When do neural networks outperform kernel methods?*

Article metrics

Permissions

Share this article

Author e-mails

Author affiliations

Author notes

Dates

Abstract

1. Introduction

1.1. Overview

1.2. Notations and outline

2. Rigorous results for kernel methods and NT, RF NN expansions

2.1. The spiked covariates model

2.2. A sharp characterization of RKHS methods

2.3. RF and NT models

2.4. Neural network models

3. Further numerical experiments

4. Discussion

Acknowledgments

Data availability statement

Appendix A.: Details of numerical experiments

A.1. General training details

A.2. Synthetic data experiments

A.3. High-frequency noise experiment on FMNIST

A.3.1. Experiment hyper-parameters

A.4. High-frequency noise experiment on CIFAR-2

A.4.1. Experiment hyper-parameters

A.5. Low-frequency noise experiments on FMNIST

A.5.1. Experiment hyper-parameters

A.6. Low-frequency noise experiments on CIFAR-10

Appendix B.: Technical background on function spaces on the sphere

B.1. Functional spaces over the sphere

B.2. Gegenbauer polynomials

B.3. Hermite polynomials

B.4. Tensor product of spherical harmonics

B.5. Tensor product of Gegenbauer polynomials

B.6. Notations

Appendix C.: General framework and main theorems

C.1. Setup on the product of spheres

C.2. Reparametrization

C.3. Notations

C.4. Generalization error of kernel ridge regression

C.5. Approximation error of the random features model

C.6. Approximation error of the neural tangent model

C.7. Connecting to the theorems in the main text

Appendix D.: Proof of theorem 5

D.1. Preliminaries

D.2. Proof of theorem 5

D.3. Auxiliary results

Appendix E.: Proof of theorem 6(a): lower bound for the RF model

E.1. Preliminaries

E.2. Proof of theorem 6(a): outline

E.3. Proof of proposition 1

E.4. Proof of proposition 2

Appendix F.: Proof of theorem 6(b): upper bound for RF model

F.1. Preliminaries

F.2. Properties of the limiting kernel

F.3. Proof of theorem 6(b)

Appendix G.: Proof of theorem 7(a): lower bound for NT model

G.1. Preliminaries

G.2. Proof of theorem 7(a): outline

G.3. Proof of proposition 3

G.4. Proof of proposition 4

G.4.1. Preliminaries

G.4.2. Proof of proposition 4

Appendix H.: Proof of theorem 7(b): upper bound for NT model

H.1. Preliminaries

H.2. Proof of theorem 7(b): outline

H.3. Proof of theorem 8

H.3.1. Properties of the limiting kernel

H.3.2. Proof of theorem 8

Appendix I.: Proof of theorem 4 in the main text

Appendix J.: Convergence of the Gegenbauer coefficients

J.1. Technical lemmas

J.2. Proof of convergence in probability of the Gegenbauer coefficients

Appendix K.: Bound on the operator norm of Gegenbauer polynomials

K.1. Proof of proposition 5

Appendix L.: Technical lemmas

L.1. Useful lemmas from [21]

When do neural networks outperform kernel methods?^*