Error scaling laws for kernel classification under source and capacity conditions

In this manuscript we consider the problem of kernel classification. While worst-case bounds on the decay rate of the prediction error with the number of samples are known for some classifiers, they often fail to accurately describe the learning curves of real data sets. In this work, we consider the important class of data sets satisfying the standard source and capacity conditions, comprising a number of real data sets as we show numerically. Under the Gaussian design, we derive the decay rates for the misclassification (prediction) error as a function of the source and capacity coefficients. We do so for two standard kernel classification settings, namely margin-maximizing support vector machines and ridge classification, and contrast the two methods. We find that our rates tightly describe the learning curves for this class of data sets, and are also observed on real data. Our results can also be seen as an explicit prediction of the exponents of a scaling law for kernel classification that is accurate on some real datasets.


I. INTRODUCTION AND RELATED WORK
A recent line of work [1][2][3][4] has empirically evidenced that the test error of neural networks often obey scaling laws with the number of parameters of the model, training set size, or other model parameters.Because of their implications in terms of relating performance and model size, these findings have been the object of sustained theoretical attention.Authors of [5] relate the decay rate of the test loss with the number of parameters to the intrinsic dimension of the data.This idea is refined by [6] for the case of regression tasks, building on the observation that in a number of settings, the covariance of the learnt features exhibits a power-law spectrum, whose rate of decay controls the scaling of the error.This investigation is actually very closely related to another large body of works.In fact, the study of a power-law features spectrum (and of a target function whose components in the corresponding eigenbasis also decay as a power-law) has a long history in the kernel literature, dating back to the seminal works of [7,8].The corresponding rates governing the power-laws are respectively known as the capacity and source coefficients, and the scaling of the test error with the training set size can be entirely characterized in terms of these two numbers.While the study of kernel ridge regression [7][8][9][10][11][12][13][14][15] therefore offers a rich viewpoint on the question of neural scaling laws with the training set size, little is so far known for kernel classification.Since ascertaining the test error decay under source and capacity conditions would automatically translate into neural scaling laws in classification tasks -similarly to [6] for regression -this is a question of sizeable interest addressed in the present work.

Related works
a. Neural scaling laws -A number of works [1][2][3][4] have provided empirical evidence of scaling laws in neural networks, with the number of parameters, training samples, compute, or other observables.These findings motivated theoretical investigations of the underlying mechanisms.Authors of [5] show how the scaling of the test loss with the number of parameters is related to the intrinsic dimension of the data.This dimension is further tied in with the kernel spectrum by [6], a work that leverages the kernel ridge regression viewpoint to translate, in turn, the decay of the spectrum to test error rates.Authors of [16] similarly study a simple toy model where the power-law data is processed through a random features layer.Finally, [17] investigate a toy model of scalar integer data in the context of classification, and ascertain the corresponding scaling law.Relating in classification settings the rate of decay of the kernel spectrum to the test error, like [6] for regression, is still an open question.
b. Source and capacity conditions -The source and capacity conditions are standard regularity assumptions in the theoretical study of kernel methods, as they allow to subsume a large class of learning setups, c.f. [7,8,12,13,15,18].c.Kernel ridge regression -The error rates for kernel ridge regression have been extensively and rigorously characterized in terms of the source/capacity coefficients in the seminal work of [7,8], with a sizeable body of work being subsequently devoted thereto [9][10][11][12][13][14]19].In particular, in [15] it was shown that rates derived under worst-case assumptions [7][8][9][10]20] are identical to the typical rates computed under the standard Gaussian design [21][22][23] assumption.Crucially, it was observed that many real data-sets satisfy the source/capacity conditions, and arXiv:2201.12655v3[stat.ML] 6 Sep 2023 display learning rates in very good agreement to the theoretical values [15].d.Worst-case analyses for SVM -The worst-case bounds for Support Vector Machines (SVM) classificationsee e.g.[24,25] for general introductions thereto-are known from the seminal works of [24,26,27].However, it is not known how tightly the corresponding rates hold for a given realistic data distributions, not even for synthetic Gaussian data.We show that, contrary to the case of ridge regression, for classification the worst case bounds are not tight for Gaussian data.This effectively hinders the ability to predict and understand the error rates for relevant classes of data-sets, and in particular the class of data described by source/capacity conditions, which as mentioned above includes many real data-sets [15].The key goal of this work is to fill this gap by leveraging the recent work on learning curves for the Gaussian covariate model [28] specified to data satisfying the capacity and source conditions.

Main contribution
In this work, we investigate the decay rate of the misclassification (generalization) error for noiseless kernel classification, under the Gaussian design and source/capacity regularity assumptions with capacity coefficient α and source coefficient r.Building on the analytic framework of [28], we consider the two most widely used classifiers: margin-maximizing Support Vector Machines (SVMs) and ridge classifiers.We derive in Section III the error rate (describing the decay of the prediction error with the number of samples) for margin-maximizing SVM : 2 ) 1+αmin(r, 1  2   ) .
As a consequence, we conclude that the worst-case rates [24,26,27] are indeed loose and fail to describe this class of data.This fact alone is not at all surprising.However, it becomes remarkable in the light of the fact that for ridge regression the worst case bounds and the typical case rates do agree [15].
We contrast the SVM rate with the rate for optimally regularized ridge classification, which we establish in Section IV to be 1) .
We argue in the light of these findings that the SVM always displays faster rates than the ridge classifier for the classification task considered.
Finally, we observe that some real data-sets fall in the same universality class as the considered setting, in the sense that, as illustrated in Section V, their error rates are in very good agreement with the ones above.This work is thus a key step for theoretically predicting the error rates of kernel classification for a broad range of real data-sets.

A. Kernel classification
Consider a data-set D = {(x µ , y µ )} n µ=1 with n independent samples from a probability measure ν on X × {−1, +1}, with X ⊂ R d .We will assume that the labels can be expressed as for some non-stochastic target function f ⋆ : X → R. Note that the noiseless setting considered here is out of the validity domain of many worst case analyses, whose bounds become void without noise [27], whereas a number of real learning settings are well described by a noiseless setup, see section V. Learning to classify D in the direct space X for a linear f ⋆ has been the object of extensive studies.In the present work, we focus on the case where f ⋆ more generically belongs to the space of square-integrable functions L 2 (X ).To classify D, a natural method is then to perform kernel classification in a p-dimensional Reproducing Kernel Hilbert Space (RKHS) H associated to a kernel K, by minimizing the regularized empirical risk: The function ℓ(•) is a loss function and λ is the strength of the ℓ 2 regularization term.In this paper we shall more specifically consider the losses ℓ(z, y) = max(0, 1 − yz) (hinge classification) and ℓ(z, y) = (y − z) 2 (ridge classification), and the case of an infinite dimensional RKHS (p = ∞).The risk (2) admits a dual rewriting in terms of a standard parametric risk.To see this, diagonalize K in an orthogonal basis of kernel features {ψ k (•)} p k=1 of L 2 (X ), with corresponding eigenvalues {ω k } p k=1 : It is convenient to normalize the eigenfunctions to so that the kernel K can be rewritten in simple scalar product form K(x, x ′ ) = ψ(x) ⊤ ψ(x ′ ), where we named ψ(x) the p-dimensional vector with components {ψ k (x)} p k=1 .Furthermore, note that the covariance Σ of the data in feature space with this choice of feature map is simply diagonal Any function f ∈ H can then be expressed as f (•) = w ⊤ ψ(•) for a vector w with square summable components.Using this parametrization, the risk (2) can be rewritten as Throughout this manuscript we will refer to the components of the target function in the features basis as the teacher θ ⋆ , so that Note that any f ⋆ ∈ L 2 (X ) can be formally written in this form with a certain θ ⋆ (allowing for non square-summable components if f ⋆ ∈ L 2 (X ) \ H).Similarly, the minimizer ŵ of the parametric risk ( 6) is related to the argmin f of (2) by f (•) = ŵ⊤ ψ(•), and will be referred to as the estimator in the following.We make two further assumptions : first, we work under the Gaussian design, and assume the features ψ(x) to follow a Gaussian distribution with covariance Σ, i.e. ψ(x) ∼ N (0, Σ).Note that this assumption might appear constraining , as the distribution of the data in feature space strongly depends on its distribution in the original space, and the feature map associated to the kernel.In fact, for a large class of data distributions and standard kernels, the Gaussian design assumption does not hold.However, rates derived under Gaussian design can hold more broadly.For instance, the rates established by [15] under Gaussian design were later proven by [29] under weaker conditions on the features.We will moreover discuss in Section V several settings in which our theoretical rates are in good agreement with rates observed for real data.
Second, following [15], we assume that the regularization strength λ decays as a power-law of the number of samples n with an exponent ℓ: λ = n −ℓ .Note that this form of regularization is natural, since the need for regularizing is lesser for larger training sets.Furthermore, this allows to investigate the classical question of the asymptotically optimal regularization [7,8,15], i.e. the decay ℓ of the regularization yielding fastest decrease of the prediction error.

B. Source and capacity conditions
Under the above assumptions of Gaussian design with features covariance Σ and existence of a teacher θ ⋆ that generates the labels using eq.( 1) we can now study the error rates.In statistical learning theory one often uses the source and capacity conditions, which assume the existence of two parameters α > 1, r ≥ 0 (hereafter referred to as the capacity coefficient and the source coefficient respectively) so that trΣ As in [13,15,21,30,31], we will consider the particular case where both the spectrum of Σ and the teacher components θ ⋆ k have exactly a power-law form satisfying the limiting source/capacity conditions (7): The power-law forms (8) have been empirically found in [15] in the context of kernel regression to be a reasonable approximation for a number of real data-sets including MNIST [32] and Fashion MNIST [33] and a number of standard kernels such as polynomial kernels and radial basis functions.Similar observations were also made in the present work and are discussed in F and section V.The capacity parameter α and source parameter r capture the complexity of the data-set in feature space -i.e. after the data is transformed through the kernel feature map into {ψ(x µ ), y µ } n µ=1 .A large α, for example, signals that the spectrum of the data covariance Σ displays a fast decay, implying that the data effectively lies along a small number of directions, and has a low effective dimension.Conversely, a small capacity α means that the data is effectively large dimensional, and therefore a priori harder to learn.Similarly, a large r signals a good alignment of the teacher θ ⋆ with the main directions of the data, and a priori an easier learning task.In terms of the target function f ⋆ , larger r correspond to smoother f ⋆ .Note that r > 1 /2 implies that f ⋆ ∈ H, while r ≤ 1 /2 implies f ⋆ ∈ L 2 (X ) \ H. Finally, note that while [18] suggested an alternative definition for the source and capacity coefficients in the case of non-square loss functions, their redefinition is not directly applicable for the hinge loss.

C. Misclassification error
The performance of learning the data-set D using kernel classification ( 6) is quantified by the misclassification (generalization) error where ŵ is the minimizer of the risk (6).The error (9) corresponds to the probability for the predicted label sign( ŵ⊤ ψ(x)) of a test sample x to be incorrect.The rate at which the error (9) decays with the number of samples n in D depends on the complexity of the data-set, as captured by the source and capacity coefficients α, r eq. ( 8).
To compute this rate, we build upon the work of [28] who, following a long body of work in the statistical physics literature [31,[34][35][36][37][38], provided and proved a mathematically rigorous closed form asymptotic characterization of the misclassification error as where ρ is the squared L 2 (X ) norm of the target function f ⋆ , i.e. ρ = X ν(dx)f ⋆ (x) 2 = θ ⋆⊤ Σθ ⋆ , and m, q are the solution of a set of self-consistent equations, which are later detailed and analyzed in Section III for margin-maximizing SVMs and section IV for ridge classifiers.The order parameters m, q are known as the magnetization and the self-overlap in statistical physics and respectively correspond to the target/estimator and estimator/estimator L 2 (X ) correlations: It follows from these interpretations that η has to be thought of as the cosine-similarity between the teacher θ ⋆ and the estimator ŵ, with perfect alignment (η = 1) resulting in minimal error ϵ g = 0 from (9).
Note that while this characterization has formally been proven in [28] in the asymptotic proportional n, p → ∞, n /p = O(1) limit, we are presently using it in the n ≪ p = ∞ limit, thereby effectively working at n /p = 0 + .The non-asymptotic rate guarantees of [28] are nevertheless encouraging in this respect, although a finer control of the limit would be warranted to put the present analysis on fully rigorous grounds.Further, [15] also build on [28] in the n /p = 0 + limit, and display solid numerics-backed results, later rigorously proven by [29].We thus conjecture that this limit can be taken as well safely in our case.Finally, we mention that a recent line of works [39][40][41][42] has explored the connections between kernel regression and Bayesian learning for networks in the n /p = O(1) limit, where p is in this case the width of the network.While the high-dimensional limit is indeed related to the one originally discussed in [28], which we relax here to n /p = 0 + , the main object of [39][40][41] was not to study kernel regression per se, but to show how observables in Bayesian regression could be expressed in terms of well-chosen kernels.In the present work, we focus on analyzing kernel classification in the n /p = 0 + regime.FIG. 1. Misclassification error ϵg for max-margin classification on synthetic Gaussian features, as specified in (8), for different source/capacity coefficients α, r.In blue, the solution of the closed set of eqs.( 13) used in the characterization (9) for the misclassification error, using the g3m package [28].The dimension p was cut-off at 10 4 .Red dots corresponds to simulations using the scikit-learn SVC [43] package run for vanishing regularization λ = 10 −4 and averaged over 40 instances, for p = 10 4 .The green dashed line indicates the power-law rate (15) derived in this work.The light blue dotted line indicates the classical worst-case min ( 1 /2, α /(3 + α)) rate for SVM classification (Theorem 2.3 in [24]) in the cases where the theorem readily applies (r > 1 /2) (see also G).The code used for the simulations is available here.

III. MAX-MARGIN CLASSIFICATION
A. Self-consistent equations In this section we study regression using Support Vector Machines.The risk (6) then reads for the hinge loss In the following, we shall focus more specifically on the max-margin limit with λ = 0 + .We show in B that zero regularization is indeed asymptotically optimal for the data following eq.( 8) when the target function is characterized by a source r ≤ 1 /2, i.e. f ⋆ ∈ L 2 (X ) \ H.We heuristically expect margin maximization to be a fortiori optimal also for easier and smoother teachers f ⋆ ∈ H.For the risk ( 12) at λ = 0 + , the self-consistent equations defining m, q in (11) read (see A) Here r1 should be thought of as the ratio between the norms of the estimator ŵ and the teacher θ ⋆ , while z can be loosely interpreted as an effective regularization.A detailed derivation of these equations can be found in A.

B. Decay rates for max-margin
From the investigation of the eqs.( 13), as detailed in A, the following scalings are found to hold between the order parameters: 1+αmin(r, 1 2 ) .
Note that the mutual scaling between m, q also follows intuitively from the interpretation of these order parametersas the overlap of ŵ with the ground truth and itself respectively -see the discussion around eqs. ( 11) and (13).Since the width of the margin is generically expected to shrink with the number of samples (as more training data are likely to be sampled close to the separating hyperplane), the increase of the norm of ŵ (as captured by q, r1 ) with n is also intuitive.Finally, an analysis of the subleading corrections to m and q, detailed in A, leads to ) .
The error rate (15) stands in very good agreement with numerical simulations on artificial Gaussian features generated using the model specification (8), see Fig. 1.Two observations can further be made on the decay rate (15).First, the rate is as expected an increasing function of α (low-dimensionality of the features) and r (smoothness of the target f ⋆ ).Second, for a source r > 1 /2 (corresponding to a target f ⋆ ∈ H), the rate saturates, suggesting that all functions in H are all equally easy to classify, while for rougher target f ⋆ ∈ L 2 (X ) \ H the specific roughness of the target function, as captured by its source coefficient r, matters and conditions the rate of decay of the error.
Finally, we briefly discuss for completeness in E the more general case where the label distribution (1) includes data noise, and show that the rates display a crossover from the noiseless value (15) to a noisy value, much like what was reported for kernel ridge regression [15].

C. Comparison to classical rates
To the best of the authors' knowledge, there currently exists little work addressing the error rates for datasets satisfying source and capacity conditions (7).The closest result is the worst-case bound of [24] for SVM classification, which can be adapted to the present setting provided f ⋆ ∈ H (r > 1 /2).The derivation is detailed in G and results in an upper bound of min ( 1 /2, α /(3 + α)) for the error rate for max-margin classification, which is always slower than (15).This rate [24] is plotted for comparison in Fig. 1 against numerical simulations and is visibly off, failing to capture the learning curves.It is to be expected that the worst case rates will be loose when compared to rate that assume a specific data distribution.What makes our result interesting is the comparison with the more commonly studied ridge regression where, as discussed already in the introduction, the worst case rates actually match those derived for Gaussian data, see [15].Importantly, the rates from [24] only hold for capacity r > 1 /2, while real datasets are typically characterized by sources r < 1 /2 (see for instance Fig. 4).The present work therefore fills an important gap in the literature in providing rates (15) which accurately capture the learning curves of datasets satisfying source and capacity conditions.Further discussions of [24], along with [26,27,44] for completeness, are provided in G. Also note that while [45] report α /(1 + α) rates under Gaussianity assumptions, they rely on very stringent assumptions which are too strong and unfulfilled in our setting.

IV. RIDGE CLASSIFICATION A. Self-consistent equations
Another standard classification method is the ridge classifier, which corresponds to minimizing As previously discussed in section II, we consider a decaying regularization λ = n −ℓ .The self-consistent equations characterizing the quantities (q, m), read for the ridge risk ( 16) Further details on the derivation of ( 17) from [28] are provided in C. Like (13), eqs.(17) have been formally proven in the proportional n, p → ∞, n/p = O(1) limit in [28], but are expected to hold also in the present n ≪ p = ∞ setting [15,29].Note that comparing to (13), eqs.( 17) correspond to a constant student/teacher norm ratio r1 = 2 /(πρ) and to a simple r2 = 1 + q − 2m 2 /(πρ).r2 moreover admits a very intuitive interpretation as the prediction mean squared error (MSE) between the true label y = sign(θ ⋆⊤ ψ(x)) and the pre-activation linear predictor ŵ⊤ ψ(x), i.e.  8), for different source/capacity coefficients α, r, in the effectively regularized regime ℓ ≤ α (top) and unregularized regime ℓ > α (bottom).In blue, the solution of the eqs.( 13) used in the characterization (9) for the misclassification error, using the g3m package [28].The dimension p was cut-off at 10 4 .Red dots corresponds to simulations averaged over 40 instances, for p = 10 4 .The green dashed lines indicate the power-laws (18) (top) and ( 19) (bottom) derived in this work.The slight increase of the error for larger n in the unregularized regime (bottom) is due to finite size effects of the simulations ran at p = 10 4 < ∞.Physically, it corresponds to the onset of the ascent preceding the second descent that is present for finite p, see D for further discussion.The code used for the simulations is available here.
Similarly to [15,31], an analysis of the eqs.(17) (see C) reveals that, depending on how the rate of decay ℓ of the regularization compares to the capacity α, two regimes (called effectively regularized and effectively un-regularized in [15] in the context of ridge regression) can be found: a. Effectively regularized regime -ℓ ≤ α.In this regime, an analysis of the corrections to the self-overlap q and magnetization m, presented in C, shows that the misclassification error scales like The rate (18) compares very well to numerical simulations, see Fig. 2. Note that the saturation for ridge happens for r = 1, rather than r = 1 /2 as for max-margin classification (see discussion in section III): very smooth targets f ⋆ characterized by a source r ≥ 1 are all equally easily classified by ridge.For rougher teachers f ⋆ characterized by r ≤ 1 however, the rate of decay of the error (18) depends on the specific roughness of the target, even if, in contrast to max-margin, the latter belongs to H (r > 1 /2).Two important observations should further be made on the rates (18): • If the regularization remains small (fast decay α > ℓ > α /(1 + 2αmin(r, 1))), the decay ( 18) is determined only by the data capacity α, while the source r plays no role.As a matter of fact, with insufficient regularization, the limiting factor to the learning is the tendency to overfit, which depends on the effective dimension of the data as captured by the capacity α.
b. Effectively un-regularized regime -ℓ > α.As derived in C, the error plateaus and stays of order 1: This plateau is further elaborated upon in D, and is visible in numerical experiments, see Fig. 2. It corresponds to the first plateau in a double descent curve, with the second descent never happening since p = ∞.Intuitively, this phenomenon is attributable to the ridge classifier overfitting the labels using the small-variance directions of the data (8).
Interestingly, all the rates ( 19) and ( 18) correspond exactly (up to a factor 1 /2) to those reported in [15] for the MSE of ridge regression, where they are respectively called the red, blue and orange exponents.Notably, the plateau (19) at low regularizations and the (α − ℓ) /α exponent in (18) only appeared in [15] for noisy cases in which the labels are corrupted by an additive noise.The fact that they hold in the present noiseless study very temptingly suggests that model mis-specification (trying to interpolate binary labels using a linear model) effectively plays the role of a large noise.
C. Optimal rates a.Optimally regularized ridge classification In practice, the strength of the regularization λ is a tunable parameter.A natural question to ask is then the one of the asymptotically optimal regularization, that is the regularization decay rate ℓ ⋆ leading to fastest decay rates for the misclassification error.From the expressions of (18) (which hold provided ℓ < α) and ( 19) (which holds provided ℓ > α), the value of ℓ maximizing the error rate is found to be and the corresponding error rate for ϵ see the red dashed lines in Fig. 3. Coincidentally, the optimal rate ( 21) is up to a factor 1 /2 identical to the classical optimal rate known for the rather distinct problem of the MSE of kernel ridge regression on noisy data [7,8].Like the max-margin exponent (15), the optimal error rate for ridge ( 21) is an increasing function of both the capacity α and the source r, i.e. of the easiness of the learning task.Note that in contrast to max-margin classification which is insensitive to the specifics of the target function f ⋆ , provided it is in H, ridge is sensitive to the source (smoothness) r of f ⋆ up to r = 1.
We finally briefly comment on support vector proliferation.[46][47][48] showed that in some settings almost every training sample in D becomes a support vector for the SVM.In such settings, the estimators ŵ (and hence the error ϵ g ) consequently coincide for the ridge classifier and the margin-maximizing SVM.In the present setting however, the result a SVM − a r > 0 establishes that for features with a power-law decaying spectrum (8), there is no such support vector proliferation.Note that this result does not follow immediately from Theorem 3 in [48].In fact, the spiked covariance (8), with only a small number of important (large variance) directions and a tail of unimportant (low-variance) directions does effectively not offer enough overparametrization [20,47] for support vector proliferation, and the support consists only of the subset of the training set with weakest alignment with the spike.simulations, SVM rate, min(r, 1 2 ) 1 + min(r, 1  2 ) 10 1 10 2 10 3 10 4 n 10 2 10 1 g = 2.0, r = 0.5 simulations, ridge rate, min(r, 1) 1 + 2 min(r, 1) simulations, SVM rate, min(r, 1 2 ) 1 + min(r, 1  2 ) rate, min( 1 2 , 3 + ) simulations, SVM rate, min(r, 1 2 ) 1 + min(r, 1  2 ) rate, min( 1 2 , 3 + ) simulations, SVM rate, min(r, 1 2 ) 1 + min(r, 1  2 ) rate, min( 1 2 , 3 + ) FIG. 3. (red) Misclassification error ϵg for ridge classification on synthetic Gaussian features, as specified in (8), for different source/capacity coefficients α, r, for optimal regularization λ ⋆ .The dimension p was cut-off at 10 4 and the regularization λ numerically tuned to minimize the error ϵg for avery n.Red dots correspond to simulations averaged over 40 instances, for p = 10 4 .Optimization over λ was performed using cross validation, with the help of the python scikit-learn GridSearchCV package.The red dashed line represents the power-law (21).In blue, the learning curves for max-margin for the same data-set are plotted for reference, along the corresponding power law (15) (blue) and the loose classical min ( 1 /2, α /(3 + α)) rate [24] (light blue), see Section III.The code used for the simulations is available here.
V. REMARKS FOR REAL DATA-SETS   (7) as estimated from the data sets, and the corresponding theoretical error rates for SVM (15) and ridge (21).The details on the estimation procedure can be found in Appendix F.
The source and capacity condition (8) provide a simple framework to study a large class of structured data-sets.While idealized, we observe, similarly to [15], that many real data-sets seem to fall under this category of data-sets, and hence display learning curves which are to a good degree described by the rates (15) for SVM and (21) for ridge classification.We present here three examples of such data-sets : a data-set of 10 4 randomly sampled CIFAR 10 [49] images of animals (labelled +1) and means of transport (labelled −1), a data-set of 14000 FashionMNIST [33] images of t-shirts (labelled +1) and coats (labelled −1), and a data-set of 14702 MNIST [32] images of 8s (labelled +1) and 1s (labelled −1).On the one hand, the learning curves for max-margin classification and optimally regularized ridge classification were obtained using the python scikit-learn SVC, KernelRidge packages.On the other hand, the spectrum {ω k } k of the data covariance Σ in feature space was computed, and a teacher θ ⋆ providing perfect classification of the data-set was fitted using margin-maximizing SVM.Then, the capacity and source coefficients α, r (8) were estimated for the data-set by fitting {ω k } k and {θ ⋆ k } k by power laws, and the theoretical rates ( 15) and ( 21) computed therefrom.More details on this method, adapted from [30,31], are provided in F. The results of the simulations are presented in Figure 4 and compared to the theoretical rates (15)(21) computed from the empirically evaluated source and capacity coefficients for a Radial Basis Function (RBF) kernel and a polynomial kernel of degree 5, with overall very good agreement.We do not compare here with the worst case bounds because the observed values of r < 1/2 in which case we remind the known results do not apply.

VI. CONCLUSION
We compute the generalization error rates as a function of the source and capacity coefficients for two standard kernel classification methods, margin-maximizing SVM and ridge classification, and show that SVM classification consistently displays faster rates.Our results establish that known worst-case upper bound rates for SVM classification fail to tightly capture the rates of the class of data described by source/capacity conditions.We illustrate empirically that a number of real data-sets fall under this class, and display error rates which are to a very good degree described by the ones derived in this work.
Appendix A: Rates for margin maximizing SVM In this appendix, we provide some analytical discussion of the equations ( 13) motivating the scaling ( 15) and ( 14).We remind the risk for the hinge loss ( 12) with λ = 0 + for max-margin.The predictor is then ŷ = sign( ŵ • ψ(x)).
1. Mapping from (Loureiro et. al, 2021) The starting point is the closed-form asymptotic characterization of the misclassification error of [28].We begin by reviewing the main results of [28], and detail how their setting can be mapped to ours.Consider hinge regression on n independent p− dimensional Gaussian samples D = {x µ , y µ } n µ=1 , by minimizing the empirical risk for some constant regularization strength λ ≥ 0. Suppose that the labels are generated from a teacher/target/oracle θ ⋆ ∈ R p as y µ = sign(θ ⋆⊤ x µ ).Then provided the assumptions are satisfied, there exist constants C, c, c ′ > 0 so that for all 0 < ϵ < c ′ , where ρ = θ ⋆⊤ Σθ ⋆ /p and m ⋆ , q ⋆ are the solutions of the fixed point equations , with η = m 2 /(ρq), and Σ the covariance of the samples x.The limit λ = 0 + can be taken using the rescaling [50] m The equations (A4) simplify in this limit to Note that the risk studied by [28] (A2) differs from the one we consider ( 12) by the scaling 1/ √ p, the missing 1/n in front of the sum, and a factor 2 for the regularization strength.All those scalings can be absorbed in λ ← 2λ/n and Σ ← Σ/p, leading to (13): thereby completeing the mapping from the setup in [28] to the present setting.

Equations for max-margin under source and capacity conditions
In the following, we detail the asymptotic scaling analysis of the equations (13).For a diagonal covariance Σ = diag(ω 1 , ..., ω p ) and θ ⋆ = (θ ⋆ k ) k , (13) reads To simplify the equations (A7), we introduce the auxiliary variables z = n p / V , r1 = m/ V and r2 = (nq)/(p V 2 ).The intuitive meaning of z is loosely that of an effective regularization.In the context of kernel ridge regression where a similar variable appears, the role of z as an effective regularizing term is quite clear in [15].We also refer the reader to the discussion in [28], also for ridge regression, where the role of V as parametrizing the gap between training and test error is mentionned.r1 is to be regarded as the ratio between the norm of the estimator ŵ minimizing the risk (12), and the norm of the teacher θ ⋆ .Introducing these variables in (A7) allows to have a well defined p = ∞ limit, which reads (A8) From (A7), z satisfies the self-consistent equation where a Riemann approximation of the sum was used.We introduce its to-be-determined scaling γ, where C z designates the prefactor.In the following, we discuss the scaling and the corrections of the order parameters and express them as a function of γ, before determining its value using the numerical solution of (A8).Note that the scaling (A10) implies in particular the following scaling for the integral (A11)

First order corrections
We now plug the source and capacity ansatz (7) into the equations (A8) Massaging the expression for m (and remembering z ∼ n 1−αγ (A10)) where we used the shorthand Massaging by the same token the equation for q, q ≈ r2 where we defined An important remark on (A13) and (A15) is that A m q and A n m tend to the same limit as n → ∞, with (for r ≥ 1 2 since there is an indicator function in front of More straigtforwardly, Finally, remark that the expansions (A13) and (A15) imply that m ∼ r1 ρ and q ∼ r2 1 ρ, which echoes the intuitive meaning of m, q as ⟨ ŵ, , see also discussion in section II under equation (11).
The expansions (A13) and (A15) can be plugged into the expression of the cosine similarity η (9) to access the decay rate for the misclassification error.
where we used A ∞ q = A ∞ m , meaning the leading order of the n −αγ term cancels out.The scaling of the other order parameters q, m, r1 , r2 can at this point also be deduced.To see how, notice that which together with the observation (A11) implies the equations (B1) can be rewritten as Specializing to the source/capacity power-law forms ( 8) one finally reaches

Rate analysis for targets outside the Hilbert space
In the following, we deliver an asymptotic analysis of eqs.(B6).To that end, we first ascertain the scaling of the effective regularization z using the self-consistent equation (B2).Depending on the scaling of the order parameter V , two regimes can be distinguished.
a. Effectively regularized regime V n→∞ − −−− → ∞ In this regime the denominator of (B2) scales like which is exactly the scaling for the corresponding integral in the vanishing regularization case λ = 0 + , see Appendix A eq. (A11).In this regime, all the discussion in Appendix A then carries through and the error rate coincides with the one for margin-maximizing SVM (15) ) .(B8) b. Effectively regularized regime V n→∞ − −−− → 0 The integral in the denominator in (B2) then admits the following scaling This means in particular that where we introduced similarly to the max-margin case the to-be determined parameter γ, see also Appendix A. Note that it follows from equation (B10) and the definition of λ = n −ℓ that q ∼ n 2(ℓ−αγ) .(B11) Since nλV ∼ (z/n) 1− 1 α , one also has that In particular, it can be seen from (B12) that the assumption V n→∞ − −−− → 0 only holds for large enough regularizations (slow enough decays ℓ < ℓ ⋆ for some limiting value ℓ ⋆ ), satisfying The regularization decay ℓ = ℓ ⋆ gives the boundary between the effectively regularized and unregularized regime.
Because of this, the rate of q at ℓ ⋆ should coincide with its max-margin rate ( 14).This, together with (B13), allow to determine ℓ ⋆ as the solution of the system (denoting γ ⋆ the value of γ at ℓ = ℓ ⋆ ) 2 ) 1 + αmin(r, 1  2 ) Summarizing, for ℓ > ℓ ⋆ , in the effectively unregularized regime, the max-margin scalings ( 14), ( 15) hold.In the following, we focus on pursuing the discussion for the new ℓ < ℓ ⋆ (effectively regularized) regime.(8), for different source/capacity coefficients α, r, for a regularization λ = n −ℓ .In blue, the solution of the closed set of equations (B1) used in the characterization (9) for the misclassification error, using the g3m package [28].The dimension p was cut-off at 10 4 .Red dots corresponds to simulations using the scikit-learn SVC package and averaged over 40 instances, for p = 10 4 .The green dashed line indicates the power-law rate (B20) derived in this work.
The expansion for m and q carry over in similar fashion to max-margin (see Appendix A), yielding Therefore r1 ∼ √ q ∼ n ℓ−αγ , while a rescaling of the integrals in the equations for r2 also reveal that r2 Note that the mutual scaling of the order parameters m, q, r1 is the same as for max-margin classification (14), which is the mutual scaling directly following from the physical interpretation of m, q as ⟨ ŵ, θ ⋆ ⟩ L 2 (X ) , || ŵ|| 2 L 2 (X ) and of r1 as , see also discussion in section II under equation (11).Finally, x −1+α(2−2r) where we specialized to r ≤ 1 2 in the last line, thereby focusing on target functions f ⋆ ∈ L 2 (X ) \ H.At this point, it remains to heuristically determine γ, a rigorous analytical derivation of our results being left for future work.Making the two (numerically verified) assumptions Assumption B.1.Like the max-margin case, 1 − η ∼ q −1 , Assumption B.2.The term of rate 2αγr dominates in (B17).
γ can be guessed as Note that consistently, the value of γ at the boundary ℓ = ℓ ⋆ with the unregularized regime γ ⋆ coincides with its max-margin value (A24).In the effectively regularized regime, under these assumptions, we thus conjecture the error to scale for r ≤ 1 2 as Finally observe that the rates for the effectively unregularized regime ℓ ≥ ℓ ⋆ (B8) and regularized regime ℓ ≤ ℓ ⋆ (B19) can be subsumed in the more compact form, still for r ≤ 1 2 : Fig. 5 contrasts the rates (B20) to the numerical solution of the equations (B1) and to numerical simulations, and displays a very good agreement.The main conclusion from (B20) is that for any ℓ, the rate is necessarily slower than ℓ ⋆ r/(1 + r) = αr/(1 + αr), which is the max-margin rate (15).Therefore, the max-margin rate ( 15) is optimal for r ≤ 1 2 , and is achieved for any regularization λ decaying at least as fast as n −ℓ ⋆ .In particular, it is achieved for the limit case ℓ = ∞, i.e. λ = 0 + a.k.a. the max-margin case, thereby suggesting that no regularization is optimal for rough enough target functions f ⋆ ∈ L 2 (X ) \ H. Since regularization is not needed for hard teachers (small source r), we do not a fortiori expect it to help for easier, smoother teachers f ⋆ ∈ H characterized by source r ≥ 1 2 .This suggests that λ = 0 + (max-margin) should be optimal for all sources r, i.e. any target f ⋆ ∈ L 2 (X ).While this conjecture is observed to hold in numerical simulations, a more thorough theoretical analysis of the error rates for r ≥ 1  2 is nevertheless warranted.We leave this more challenging analysis to future work.
Once again, we defined the shorthands We remind that, similarly to the discussion in Appendix A, the sequences A n m and A n q admit identical limits as n → ∞ : A ∞ q = A ∞ m .The consequences of this identity are expounded below.We now focus on ascertaining the scalings of the order parameters m, q.These scalings depend on the regime considered.
a. Regularized regime ℓ < α In the case ℓ < α, we have γ = ℓ/α < 1.Then the expansion for q reads Therefore the cosine similarity η admits the following expression where we used that A n q and A n m share the same limit.Finally, the scaling for the misclassification error can be accessed: which is equation (18).Remark C.1.Note that only the classification error (9) tends to zero, while neither the MSE betwen the label y = sign(θ ⋆⊤ ψ(x)) and the pre-activation linear predictor ŵ⊤ ψ(x) (below denoted MSE 1 ) nor the MSE between the teacher and student preactivations θ ⋆⊤ ψ(x), ŵ⊤ ψ(x) (below denoted MSE 2 ) tend to zero: Note also the identity between the order parameter r2 and MSE 1 : r2 = MSE 1 , discussed in section IV in the main text.b.Unregularized regime ℓ > α If ℓ > α, then γ = 1 and Since η is now bounded away from 1, the misclassification error ϵ g fails to go to 0 asymptotically, and plateaus to a finite value: which is equation (19).This plateau is attributable to ridge classifiers overfitting the binary labels, see Appendix D for further discussion.
This means in particular the direction of the teacher is perfectly recovered, and that the prediction error ϵ g (9) goes to zero in the p ≪ n → ∞ limit.

Infinite dimensional feature space
We now argue why this discussion ceases to hold in the setup of interest n ≪ p = ∞ (plateau reached after the first descent, see Fig. 7).In this limit, loosely written, The last term comes from the sum of p random terms of order 1/ √ n entailed by the matrix multiplication.In particular, this implies that the teacher fails to be perfectly recovered by ERM in this limit, causing the misclassification error to plateau to a finite limit, see equation (19).This is due to the fact that in an infinite-dimensional space, the ridge ERM ( 16) always has enough dimensions to overfit the dataset.Another way to look at this limit is in the framework of the well-known double-descent phenomenon which occurs in the finite dimensional limit p < ∞ discussed in subsection D 1.In the p = ∞ limit, the second descent, which commences at n ≈ p, happens at infinite sample complexity and is thus not observed for finitely large n.In fact, the 1 ≪ n ≪ p = ∞ limit always correspond to the plateau following the first descent, see Fig. 7. allows to smoothen out the curves, thus permitting a relatively more precise evaluation.Because the feature space is finite, the functions (F3) fail to be exact power-laws, and exhibit in particular a sharp drop for n approaching m.Nonetheless, we identify for each curve a region of indices k for which the curve looks qualitatively like a power-law, and fit the curves by a power-law using least-squares linear regression to finally extract the coefficients α, r therefrom.These fits are shown in Fig. 10 for the reduced CIFAR 10 dataset discussed in the main text (see section V), for an RBF kernel with inverse variance 10 −7 and a polynomial kernel of degree 5.The capacity and source were estimated to be α ≈ 1.16, r ≈ 0.10 for the RBF kernel and α ≈ 1.51, r ≈ 0.07 for the polynomial kernel.

Details on numerical simulations
In this subsection, we provide further details on the simulations presented in Fig. 4 of the main text.For each sample complexity n, a subset of size n was randomly sampled without replacements from D m .Like in [15,28,51], the whole dataset D m was used as a test set.The max-margin simulations in Fig. 4 were performed using the scikit-learn SVC package at vanishing regularization λ = 10 −5 , and averaged over 50 realizations of the training set.The ridge simulations in Fig. 4 were realized using the scikit-learn KernelRidge package, with the optimal λ estimated using scikit-learn GridSearchCV's default 5-fold cross validation routine over a grid λ ∈ {0} ∪ (10 −10 , 10 5 ), with logarithmic step size 0.026.The misclassification error was also averaged over 50 realizations of the training set.In this Appendix, we provide a detailed comparison of the presently reported rate for noiselesss max-margin classification (15) and the rate reported in [24] (equation 7.53, following from Theorem 7.23 and Lemma A.1.7), in the case of a teacher in the Hilbert space f ⋆ ∈ H, for which the latter result holds.We alternatively point the reader to section 5 of [45], where the same bound, and the underlying assumptions thereof, are reminded in concise fashion.We first provide a brief reminder of this upper bound, before proceeding to evaluate it in the present setting (8), and show that it yields a slower rate than (15), i.e. that the bound in [24] is loose.For completeness, we finally provide a comparison to the rates reported in [27].
Note that the rate ( 15) is always faster: In other words, the upper bound of [24] is loose in the present setting when f ⋆ ∈ H.While numerical investigations suggest this is also true for f ⋆ ∈ L 2 (X ) \ H, we leave a more detailed comparison to [24] in this case to future work.
3. Theorems 3.3 and 3.5 in [25] Finally, we discuss the rates reported in [27] under margin assumptions as [26,44].Note that the regression function η (ψ(x)) ≡ P(y = 1|ψ(x)) has in the noiseless setting (1) the compact form η (ψ(x)) = sign(θ ⋆⊤ ψ(x)) and is therefore not Hölder class for any β > 0, rendering Theorems 3.3 and 3.5 in [27] effectively inapplicable.Note that in addition, the rates would in any case be ambiguous, as both the margin exponent α and the dimension d are infinite in our setting.Finally remark that [26,52] provide rates under similar margin conditions in direct space, for the particular case of Gaussian kernels.Relating those conditions to the characterization (8) in feature space used in the present work is out of the scope of the present manuscript.

2 .FIG. 2 .
FIG.2.Misclassification error ϵg for ridge classification on synthetic Gaussian features, as specified in(8), for different source/capacity coefficients α, r, in the effectively regularized regime ℓ ≤ α (top) and unregularized regime ℓ > α (bottom).In blue, the solution of the eqs.(13) used in the characterization (9) for the misclassification error, using the g3m package[28].The dimension p was cut-off at 10 4 .Red dots corresponds to simulations averaged over 40 instances, for p = 10 4 .The green dashed lines indicate the power-laws (18) (top) and (19) (bottom) derived in this work.The slight increase of the error for larger n in the unregularized regime (bottom) is due to finite size effects of the simulations ran at p = 10 4 < ∞.Physically, it corresponds to the onset of the ascent preceding the second descent that is present for finite p, see D for further discussion.The code used for the simulations is available here.

FIG. 4 .
FIG.4.Dots: Misclassification error ϵg of kernel classification on CIFAR 10 with a polynomial kernel (top left) and an RBF kernel (top right), on Fashion MNIST with an RBF kernel (bottom left), on MNIST with an RBF kernel (bottom right), for max-margin SVM (blue) and optimally regularized ridge classification (red), using respectively the python scikit-learn SVC and KernelRidge packages.Dashed lines: Theoretical decay rates for the error ϵg (15) (blue) (21) (red), computed from empirically estimated capacity α and source r coefficients (see section (V) and F for details).The measure coefficients are summarized in TableI.The code used for the simulations is available here.

FIG. 5 .
FIG.5.Misclassification error ϵg for hinge classification on synthetic Gaussian data, as specified in(8), for different source/capacity coefficients α, r, for a regularization λ = n −ℓ .In blue, the solution of the closed set of equations (B1) used in the characterization (9) for the misclassification error, using the g3m package[28].The dimension p was cut-off at 10 4 .Red dots corresponds to simulations using the scikit-learn SVC package and averaged over 40 instances, for p = 10 4 .The green dashed line indicates the power-law rate (B20) derived in this work.

15 FIG. 10 .
FIG.10.Cumulative functions (F3) for 10 4 CIFAR 10 sampled at random, for a RBF kernel (top) and a polynomial kernel (bottom).The slopes fitted using least-squares linear regression on the regions the curves are qualitatively resembling power-laws are represented in dashed lines.

Table I .
The code used for the simulations is available here.

TABLE I .
Values of the source and capacity coefficients