Learning curves for the multi-class teacher-student perceptron

One of the most classical results in high-dimensional learning theory provides a closed-form expression for the generalisation error of binary classification with the single-layer teacher-student perceptron on i.i.d. Gaussian inputs. Both Bayes-optimal estimation and empirical risk minimisation (ERM) were extensively analysed for this setting. At the same time, a considerable part of modern machine learning practice concerns multi-class classification. Yet, an analogous analysis for the corresponding multi-class teacher-student perceptron was missing. In this manuscript we fill this gap by deriving and evaluating asymptotic expressions for both the Bayes-optimal and ERM generalisation errors in the high-dimensional regime. For Gaussian teacher weights, we investigate the performance of ERM with both cross-entropy and square losses, and explore the role of ridge regularisation in approaching Bayes-optimality. In particular, we observe that regularised cross-entropy minimisation yields close-to-optimal accuracy. Instead, for a binary teacher we show that a first-order phase transition arises in the Bayes-optimal performance.


Introduction
Starting with the seminal work of Gardner and Derrida [1] the teacher-student perceptron is a broadly adopted and studied model for high-dimensional supervised binary (i.e., two classes) classi cation.In this model the input data are Gaussian independent identically distributed (i.i.d.) and a single-layer teacher neural network with randomly chosen i.i.d.weights from some distribution generates the labels.A student neural network then uses the input data and labels to learn the teacher function.The corresponding generalisation error as a function of the number of samples per dimension  = / was rst derived using the replica method from statistical physics in the limit ,  → ∞ for a range of teacher weights distributions (Gaussian and Rademacher being the most commonly considered) and for a range of estimators, e.g., Bayes-optimal or empirical risk minimisation (ERM) with common losses, see reviews [2,3,4] and references therein.Notably, the phase transition in the optimal generalisation error of the teacher-student perceptron with Rademacher teacher weights [5,6] is possibly one of the earliest examples of the so-called statistical-to-computational trade-o s that are currently broadly studied in high-dimensional statistics and inference.More recently, these works on the teacher-student perceptron have been put on rigorous ground in [7] for the Bayes-optimal estimation, and in [8] for ERM with convex losses.
In what follows, we will be interested in the problem of learning the teacher target function in the highdimensional setting, where ,  → ∞ at a xed rate, or sample complexity,  = /, under two estimation procedures: empirical risk minimisation (ERM) and Bayes-optimal estimation.Empirical risk minimisation (ERM) -In the rst case, the statistician (or student) is given only the training data (,  ), and her goal is to learn the teacher weights  * with a multi-class perceptron model ŷ() =  out ( ) by minimising a regularised empirical risk over the training set: The loss function ℓ accounts for the performance of the weight vector  over a single training point.Two widely used loss functions for multi-class classi cation are the cross-entropy loss ℓ (, ) = −  =1   • ln    /  =1    and the square loss ℓ (, ) = ( − ) ( − )/2.We will focus on the ridge regularisation function   () =   2  /2, where •  indicates the Frobenius norm.
Bayes-optimal estimator -In the second case, known as the Bayes-optimal setting, the student has access not only to the training data but also to prior knowledge on the teacher weights distribution  *  and on the model generating the inputs and the labels (1).In the teacher-student setting under consideration, where labels are generated by a noiseless channel, the Bayes-optimal estimator for the label  new of a previously unseen point  new can be computed directly from the Bayes-optimal estimator Ŵ BO of the teacher weights as ŷnew =  out ( Ŵ BO  new ).The matrix Ŵ BO is the minimiser of the mean-squared error with respect to the ground-truth  * , i.e., Ŵ BO = argmin  E , *  −  * 2  = E , * , | (, * ) [𝑾 ] . (3) Note that computing explicitly the Bayesian estimator requires computing the posterior distribution, which in general is unfeasible in high-dimensions.However, as we shall see, its performance can be characterised exactly in such limit.A key quantity in our derivation is the free entropy density: where   is the partition function, i.e., the normalisation factor of the posterior distribution over the weights In the Bayes-optimal setting, the free entropy density is closely related to the mutual information density between the labels and the weights, see [7] for an explicit discussion of this connection.
Generalisation error -The performance of di erent optimisation strategies is measured through the generalisation error, i.e. the expected error on a fresh sample.As it is commonly done for classi cation, in this work we will be interested in the misclassi cation rate (a.k.a.0/1 error): where  new is a previously unseen data point and  new is the corresponding label, generated by the teacher as in Eq. (1).Similarly, the estimator ŷ is generated by the weight vector Ŵ , which in turn depends on the training set.We compare the performance obtained with ERM to the one achieved by the Bayes-optimal estimator from Eq. (3).Note that Eq. ( 6) for the Bayes-optimal error can be rewritten as where we have renamed for brevity  =  new , and • = E  | (, * ) , and we have used that  out (•) 2 2 ≡ 1.Since the distribution of  new is rotationally invariant, the averaged quantity E , new , *  out ( *  new ) •  out ( Ŵ BO  new ) only depends on the correlation between  * and Ŵ BO , which as we will discuss later concentrates to the maximiser of the free entropy in Eq. ( 4) in the high-dimensional limit.

Main theoretical results
From the de nition of the generalisation error in (6) and of the teacher model (1), it is easy to see that crucially the generalisation error only depends on the statistics of the -dimensional quantities ( *  new , Ŵ  new ) ∈ R  × R  (a.k.a.local elds) -both for Bayes-optimal estimation and ERM.Therefore, characterising the su cient statistics of the local elds is equivalent to characterising the generalisation error.Our key theoretical result is that in the high-dimensional limit considered here the local elds are jointly Gaussian, and therefore the generalisation error only depends on the correlation m between the teacher  * and the estimator Ŵ , and the covariances  *  and q of the teacher and the estimator respectively (a.k.a. the overlaps): Note that we keep the subscript  to emphasize that these de nitions are still in nite dimension and to distinguish them from the corresponding overlaps in the high-dimensional limit.As we will show next, these low-dimensional su cient statistics can be computed explicitly by solving a set of coupled ( − 1) × ( − 1) self-consistent equations.

Performance of empirical risk minimization
Our result holds under the following assumptions, in addition to the Gaussian hypothesis on the design matrix  . Assumptions: (A1) the functions L,   are proper, closed, lower-semicontinuous, convex functions.The loss function L is di erentiable and pseudo-Lipschitz of order 2 in both its arguments.We assume additionally that the regularisation   is strongly convex, di erentiable and pseudo-Lipschitz of order 2; (A2) the dimensions ,  grow linearly according to the nite ratio  = /; (A3) the lines of the ground truth matrix W * ∈ R × are sampled i.i.d.from a sub-Gaussian probability distribution in R  .
Theorem 2.1.Let  ∼ N  (0,   ).Under (A1)-(A3), for any pair of pseudo-Lipschitz functions  1 : R × → R, 2 : R × → R of order 2, the estimator Ŵ and Ẑ = 1 √  X Ŵ satisfy the following: where  → denotes convergence in probability as ,  → ∞ and the parameters (, ,  ) are the solution (assumed to be unique) of the following set of self-consistent equations (where we introduced the auxiliary pa-rameters ( m, q, V ) : We have made use of the following auxiliary functions: where  *  is the distribution (in R  ) of the teacher weights and, for any function is a proximal operator (here de ned with matrix parameters, see Appendix D.3 for more detail).The simpli ed expressions of the auxiliary functions are provided in Appendix C.1 and depend on the choices of the teacher weights distribution, the regularisation and the loss function.
Proof outline.We now provide a short outline of the proof for the asymptotic performance of the estimator obtained with convex ERM.The idea, pioneered in [16], is to express the estimator Ŵ as the limit of a carefully chosen sequence whose iterates have an exact, rigorous asymptotic characterisation.Such a sequence can be designed using iterations of an approximate message passing (AMP) algorithm [17,18] for which the statistics of the iterates are asymptotically characterised by a closed form, low dimensional iteration, the state evolution equations.In our case we need a matrix-valued AMP iteration with separable update functions, rigorously covered in [15,19].One then needs to show that converging trajectories of the iteration can be systematically found, in order to properly de ne the limit of the sequence.Here this is made possible by the convexity of the problem.In the end, the performance of the estimator is shown to be characterised by the xed point of the state evolution equations.

Bayes-optimal performance
The su cient statistics describing the performance of the Bayes-optimal estimator (3) can also be derived in the high-dimensional limit, and are closely related to the free entropy density.Indeed, in Appendix B we show that the Bayes-optimal estimator can be fully characterised by only one overlap matrix  ∈ R × which is given by the solution of following extremisation problem: where (Z *  , Z * out ) are the same auxiliary functions de ned in Eqs.(11), and Φ is the free entropy density de ned in Eq. ( 4).Looking for extremisers of the equation above leads to the following set of self-consistent equations: where, again, ( *  ,  * out ) are the same as de ned in Eqs.(11).Note the similarity between the equations ( 14) above and the ones characterising the su cient statistics of ERM, Eq. (10).Indeed, the equations above can be obtained from the ones of ERM through the following mapping, known in the context of statistical physics as Nishimori conditions [20]: → , m → q, (16) Intuitively, the student's additional knowledge of the data generating process is translated by choosing the same set of auxiliary functions as the teacher.These conditions imply, on average, no statistical di erence between the ground truth con guration and a con guration sampled uniformly at random from the posterior distribution.Therefore, in the Bayes-optimal setting there is no distinction between the teacherstudent overlap and the student self-overlap.This connection is further discussed in Appendix B, where we show that both set of equations can be derived from a common framework.Despite the close similarity between the two sets of self-consistent equations, note one key di erence: the set of extrema in Eq. ( 14) is not necessarily a single point.This means that, di erently from Eqs. (10), the xed-point of the self-consistent equations (14) might not be unique.In this case, it is important to stress that the overlap  corresponding to the Bayes-optimal estimator (3) is, by de nition, the one with highest free entropy density.Therefore, the Bayes-optimal generalisation error is then evaluated by nding the xed point of equations ( 14) that maximises the free entropy (13).
A proof of this claim and of equations ( 14) and ( 13) for the Bayes-optimal performance was done by ( [13], Theorem 3.1, and [14]) in the setting of the committee machine, by an interpolation method that shows the correctness of the replica prediction for the free-entropy of the system.Their proof applies to teacher-student committee machines with bounded output channel, prior distribution with nite second moment and Gaussian i.i.d.inputs.Therefore, it applies to the multi-class perceptron of our setting, both with Gaussian and Rademacher prior.

Generalisation error
The characterisation of the generalisation error in the high-dimensional limit is a direct consequence of Theorem 2.1.
Corollary 2.2.In the high-dimensional limit the asymptotic generalisation error associated to the ERM estimator (2) can be expressed only as a function of the parameters (, ) obtained by solving the self-consistent equations (10): where  =  *    .
The proof of Corollary 2.2 is straightforward and follows by noticing that for any , the function As one can expect from the discussion in Section 2.2, the Bayes-Optimal generalisation error is obtained by a similar, but simpler expression depending only on the overlap , obtained by extremising (13): 3 Approximate message-passing algorithm In order to illustrate our theoretical results for the performance of the Bayesian (3) and ERM (2) estimators, we would like to compare our asymptotic expressions for the generalisation error with nite instance simulations.On one hand, the regularised empirical risk de ned in (2) is strongly convex, and therefore it can be readily minimised with any descent-based algorithm such as gradient descent or stochastic gradient descent.Indeed, in the ERM simulations that follow we employ out-of-the-box multi-class solvers from Scikit learn [21] to assess our theoretical result from Theorem 2.1.On the other hand, explicitly computing the Bayesian estimator (3) requires sampling from the posterior, an operation which is prohibitively costly in high-dimensions.Instead, in this manuscript we employ an Approximate Message Passing (AMP) algorithm to e ciently approximate the posterior marginals.AMP has several interesting properties which make it a popular tool in the study of random problems.First, it is proven to be optimal among a class of random estimation problems [22], and for this reason it is widely used as a benchmark to assess algorithmic complexity.Second, it admits a set of scalar state evolution equations allowing to track its performance in high-dimensions [15].For the Bayes-optimal estimation problem considered here, AMP is summarised by the pseudo-code in Algorithm 1.It follows the well-known AMP algorithm for generalised linear estimation [17,23], which takes advantage of the high-dimensional limit  → ∞ by approximating the posterior distribution (5) by a multivariate Gaussian distribution through a belief propagation procedure expanded in powers of  −1 .The di erence is that the estimators ŵ  are -dimensional vectors and their variances Ĉ  are  ×  dimensional matrices, with  = 1, . . ., .Both channel and prior updated functions,  out and   , respectively, are de ned in Appendix C.1.For a detailed derivation of the algorithm, see [24].
Several versions of this -fold AMP and the associated state evolution appeared in previous works, e.g.[13].It can be shown that the state evolution equations associated to Algorithm 1 for Bayes-optimal estimation coincide exactly with the self-consistent equations (14) presented in Section 2.2 starting from an uninformed initialisation  0 ≈ 0 [13].This interesting property implies that when the extremisation problem in Eq. ( 13) has only one extremiser, AMP provides an exact approximation to the Bayes-optimal estimator in the high-dimensional limit.Instead, when there are more than one maxima in Eq. ( 13), AMP will converge to an estimator with overlap  closest to the uninformed initial condition.If this is not the global maxima, this corresponds to a situation where AMP di ers from the Bayes-optimal estimator.Since AMP provides a bound on the performance of rst-order algorithms, this situation is an example of an algorithmic hard phase, where it is conjectured that the statistical optimal performance cannot be achieved by algorithms running in time ∼  ( 2 ).
We have implemented Algorithm 1 for  = 3 using the mapping presented in Appendix A, which makes the estimators ( − 1)-dimensional vectors and their variances ( − 1) × ( − 1) dimensional matrices.The detailed expressions for the computation of the denoising functions, as well as the integrals to be numerically evaluated are presented in Appendix E.  (19).The orange points correspond to the error that would be asymptotically reached by the randomly initialised AMP algorithm.The blue points mark the Bayes-optimal generalisation error.The inset depicts the corresponding free entropies as a function of , their crossing locating the information-theoretic transition to perfect generalisation at  IT =3 = 2.45.AMP reaches perfect generalisation starting from  alg =3 = 2.89.

Results for 𝑘 = 3 classes
In this section we discuss the consequences of our theoretical results for the particular case of  = 3 classes and compare them with numerical simulations.We investigate the dependence of the generalisation error on the sample complexity .First, we consider the case of Rademacher teacher weights and show that a rst-order phase transition arises in the Bayes-optimal performance.Then, we turn to the case of Gaussian teacher weights and explore the role of the regularisation strength  in approaching the Bayes-optimal performance with ERM.
Bayes-optimal performance for Rademacher teacher -The main di erence between Gaussian and Rademacher teacher is that in the second case perfect generalisation is achievable at nite sample complexity, in line with the results known for the two-classes case of [5,6,2].To compute the optimal informationtheoretical performance, we have evaluated the global extremum of the replica free entropy.To that extent, we have run the replica saddle point iterations Eqs. ( 14) with both uninformed and informed initialisations and computed the free entropy (13) of the xed points (if distinct) reached by the two initialisations.In Fig. 2a we report the generalisation error corresponding to the xed points reached by the two initialisations, along with their corresponding free entropy in the inset.We found that indeed, for Rademacher teacher weights, the generalisation error decreases continuously for  ≤  (=3) IT ≈ 2.45, and then jumps to zero for all  >  (=3) IT .From the statistical physics point of view, this discontinuous transition in the generalisation error corresponds to a rst-order phase transition associated to the discontinuous appearance of a second extrema in the free energy potential corresponding to perfect learning.As we have previously discussed, the state evolution of the AMP Algorithm 1 is equivalent to doing gradient descent on the free energy potential (13) starting from an uninformed random initialisation.Therefore, the appearance of a second extremum away from zero implies that AMP is not able to achieve the Bayes-optimal statistical per-   19), is marked by the green symbols (error bars are smaller than the symbols) .The dashed black line indicates the Bayes optimal error.The inset displays the diagonal ( 00 ) and anti-diagonal ( 01 ) entries of the self-overlap 2 × 2 matrix as a function of the sample complexity  in the Bayes-optimal setting.The full lines mark the xed points of the saddle point equations of Eq. ( 14), while the symbols represent the result obtained from the AMP algorithm described in Section 3 averaged over 20 runs.The error bars represent the respective standard deviations.
formance.Since AMP is conjectured to be optimal among rst-order methods [22], this result is an example of a fundamental statistical-to-algorithmic gap in this problem.For  >  (=3) algo = 2.89, we observe that the uninformed minima disappears, and we can check that this coincides with the sample complexity for which AMP is able to achieve zero generalisation error from random initialisation.This marks the algorithmic threshold, i.e. the sample complexity beyond which perfect generalisation is reachable algorithmically efciently.Our ndings thus suggest the existence of an algorithmic hard phase for  (=3) algo , where the theoretically optimal performance is not reachable by e cient algorithms.
We note here the comparison with the canonical perceptron with Rademacher teacher weights and two classes, where the same thresholds are well known to be  (=2) IT = 1.249,  (=2) algo = 1.493 [5,6,7].Naturally, these values are roughly twice smaller than the ones for  = 3 since for  classes the teacher has  − 1 independent -dimensional binary elements that need to be recovered in order to reach perfect generalisation.Comparing more precisely the values for  = 3 and also their di erence, all are slightly smaller than the double of the values for  = 2.
Bayes-optimal performance for Gaussian teacher -Fig.3 and 4 summarise our results for the case of Gaussian teacher weights.The Bayes-optimal generalisation error, computed from Eq. ( 7), is depicted by the dashed black line in both gures and is a smooth, monotonically-decreasing function of the sample complexity.It is interesting to note that, for Gaussian teacher weights, the Bayes-optimal AMP algorithm -described in Sec. 3 and marked by the green symbols in Fig. 3 -achieves the Bayes-optimal performance.10).The symbols mark the results from numerical simulations at dimension  = 1000, averaged over 250 instances.We also plot the performance of simulations at zero regularisation and theory at  → 0 + , for both crossentropy (teal) and square loss (purple).Di erent values of  are depicted with di erent colours.The curves are the result of numerical simulations performed at dimension  = 1000, averaged over 250 instances.We conclude that for these values of  the optimal  is close to 1.
Figure 4: Bayes-optimal and ERM performances in the case of Gaussian teacher weights.
This is highly non-trivial: computing the Bayesian estimator usually requires sampling from the posterior distribution of the weights given the data, and therefore can be prohibitively costly in the high-dimensional regime considered here.For Gaussian weights AMP provides an exact approximation of the posterior marginals in quadratic time in the input dimension.
Approaching Bayes-optimality with ERM -Instead, how does ERM compare to the Bayesian estimator?Note that the empirical risk in Eq. ( 2) is convex, and therefore, at variance with the posterior estimation, this problem can be readily simulated using descent-based algorithms such as stochastic gradient descent.The generalisation error obtained by ERM is plotted in Fig. 4a as a function of the sample complexity.The full lines depict our theoretical predictions for the learning curves while the symbols mark the results from numerical simulations performed at nite dimension  = 1000 (more details on the numerics are provided in Appendix F).We nd excellent agreement between the two.For both crossentropy and square losses, we show the performance achieved without regularisation ( = 0) and with naively-optimised , obtained by cross-validation in Fig. 4 (c), (d).Interestingly, we nd that the optimallyregularised cross-entropy loss achieves a close-to-optimal performance, while the square loss maintains a nite gap with respect to the Bayes-optimal error even at ne-tuned regularisation strength.Similar results were obtained for the two-classes teacher student perceptron [8].The fact that regularised cross-entropy minimisation is so close to optimal is remarkable and worth further investigation in a more general setting.
Large- behaviour -Fig.4b considers again a Gaussian teacher prior and explores the behaviour of the generalisation error at large sample complexity.The Bayes-optimal performance is depicted in black and decays as 1/ in the large− regime.On the other hand, the performance obtained by ERM at xed  displays a slower decay  −1/2 .This is again compatible with the behaviour observed in the two-classes case [8].It remains to be analysed whether for  > 2 the optimally regularised ERM achieves the 1/ rate as it does for the two classes.
The role of regularisation -The lower panels of Fig. 4 further illustrate the role played by ridge regularisation.We plot the generalisation error as a function of the regularisation strength  at xed sample complexity  for the cross-entropy (4c) and the square loss (4d).Di erent curves represent di erent values of sample complexity.We observe that the optimal regularisation depends only very mildly on the sample complexity  for this range of values of .

Appendix A Prior reduction
In this section we explain the mapping from  to  − 1 dimensions that we apply to evaluate our theoretical results from Eq. ( 10) as well as to implement Algorithm 1.The intuition is exactly the same of the binary perceptron: the knowledge of −1 components of the one-hot label representation  is enough to determine the remaining component.Nevertheless for  > 2, shifting the weights in order to reproduce this structure introduces additional correlations that must be taken into account.
We recall that  * is  ×  matrix, and denote by  *  , 1 ≤  ≤ , its columns, each corresponding to a di erent class.Notice that the label  = e argmax l ( { * l  } l∈ [k] ) given by Eq. ( 1) of a data point  can be equivalently expressed by taking the  th −component, i.e.  *  , as a reference for comparison and setting w * so that w *  = 0, and the problem is reduced to  − 1 dimensions.We then replace  * by W * ∈ R ×(−1) .Denoting 1  as the -dimensional vector with all entries equal to 1, we present schematically in Figure 5 the prior reduction.Note that this mapping introduces correlations along the columns of W * , but not along

Hidden layer
One hot output the rows, i.e. the  components of each vector w *  remains i.i.d.Therefore, the prior over the weights is still factorizable along the extensive dimension .
Gaussian prior -In the Gaussian prior setting, where  * ∈ R  is a column of the matrix  * , the transformation imposes a new covariance matrix with elements Thus, since all contributions related to the -th degree of freedom become zero, the transformation (A.1) allows us to write the mapping Therefore each row of the reduced matrix W * follows a Gaussian distribution with 0 mean and covariance matrix given by Eq. (A.8b) Rademacher prior -In the Rademacher setting,  * ∈ R  , we can write leading to the reduced prior for w * ∈ R −1 : The dimensional reduction simpli es our analysis when it comes to the numerical evaluation both of the Gaussian integrals in Eq. ( 10) and of the prior and channel updates of AMP, as discussed in Appendix E. The same reformulation applies straightforwardly to the parameters  and .

B Replica calculation
In this section, we carry out the (heuristic) replica computation leading to the system of equations (10) in the main text.We consider a general setting where the student has access to a prior distribution   over the teacher weights and a model distribution  out , which can be the true ones or not.This formulation encompasses both the Bayes-optimal and non Bayes-optimal settings.As we shall see in the following, ERM can be seen as a special case of the latter.The posterior distribution of the student weights is given by where we have de ned ℎ  =     / √ .The partition function is then By using the replica trick, we can compute the free entropy in the high-dimensional limit as We can then rewrite the average in Eq. (B.3) as where above we have renamed  * =  0 .In order to account for both the Bayes-optimal and non Bayesoptimal cases, we keep the distinction between teacher and student distributions by adding an index  to prior and model distributions.In what follows,  0  =  *  and  0 out =  * out refer to the teacher, while  >0  =   and  >0 out =  out to the student.Let us denote the covariance tensor of the ℎ   as with    ∈ R × .We can rewrite the above as where we have de ned and the auxiliary functions: We observe that, upon exchanging the limits in  and , the high-dimensional limit of the free entropy can be computed via a saddle-point method:

B.1 Replica symmetric ansatz
In order to progress in the calculation, we assume that the extremum in Eq. (B.17) is attained at {, Q } described by a replica symmetric (RS) ansatz [20].We distinguish between the Bayes-optimal and non Bayes-optimal cases.Note that in the Bayes-optimal case we can drop the −index from the prior and model distributions.In the non Bayes-optimal case, we will denote the teacher distributions by  *  ,  * out and the student ones simply by   ,  out .
Bayes-optimal RS ansatz -In the Bayes-optimal setting we make the following ansatz: The trace term is simpli ed as follows The prior and output terms can be simpli ed by performing a Hubbard-Stratonovich transformation that allows to decouple the replica indices , .By indicating the standard Gaussian measure with D :  ∼ N(0,   ), we obtain Since we are interested in the  → 0 + limit, it is useful to rewrite Non Bayes-optimal RS ansatz -In the non Bayes-optimal setting we make the following ansatz: The trace term is simpli ed as follows The prior term is In order to compute the output term we need to compute the inverse matrix which has a similar block structure as .The components of the inverse can be computed from the relation The determinant of  is given by The above results allow us to rewrite As in the Bayes-optimal case, in order to consider the  → 0 + limit, we can rewrite

B.2 Computing the free entropy
At this point, it is straightforward to compute the free entropy from Eq. (B.17 ).In the Bayes-optimal case, we obtain: Tr[ q] + Ψ *  ( q) +  Ψ * out () , (B.37) In the non Bayes-optimal case, we obtain:

C Update equations for the overlap parameters
We can now compute the update equations for the overlap parameters both in the Bayes and non Bayesoptimal settings by taking the derivatives of Eq. (B.37) with respect to (, q) and of Eq. (B.39) with respect to ( 0 , , , Q0 , q, m), and setting them to zero.In the Bayes-optimal setting the update equations are therefore given by: In the non-Bayes optimal setting, we de ne for simplicity:  =  0 − , V = Q0 + q, and we nd where in both settings we have made use of the following de nitions.

C.1 De nitions of the update functions
For  ∈ R  , let The explicit expressions of the auxiliary functions depend on the choice of the teacher and student distributions.We evaluate these expressions for the special cases under consideration in the following sections.

C.2 Bayes-optimal update functions
In this section, we evaluate the Bayes-optimal update functions.We consider directly the expressions obtained after performing the mapping described in Appendix A.

C.2.1 Gaussian prior terms with the dimensional reduction A
In the case of Gaussian teacher prior, it is straightforward to notice that, after the application of the mapping A, the prior over the weights is still Gaussian with covariance Σ =  −1 + 1 −1 1 −1 : For  = 3, the reduced covariance matrix is given by

C.2.2 Rademacher prior terms with the dimensional reduction A
Considering  = 3, the reduced prior given by Eq. (A.11) becomes (C.12) The denoising functions for this case are computed numerically, via Monte Carlo sampling of the distribution given Eq.(C.12).

C.2.3 Output terms with the dimensional reduction A
Considering directly the mapping to dimension  − 1, we can write the Bayes-optimal model distribution as Therefore, the auxiliary functions Z * out , and so on, are composed by  − 1 contributions according to the membership of the label  in the argument.For instance, in the case  = 3, we have √︁ det(2 ) .

(C.14)
Similarly, for  * out we need to change the integration bounds in order to take into account all the possibilities for the label.For each term, we can only compute analytically the inner integral, while we have to estimate the outer ones via Monte Carlo sampling.Therefore, applying the mapping in A is useful in order to reduce the number of integrals to be performed numerically and speed up the whole procedure.

C.3 ERM update functions
The update equations for ERM can be derived as a special case of the non Bayes-optimal Eqs.(C.1) and so on.In particular, this can be seen by rewriting the solution of the optimization problem as the ground state of the following measure i.e., the solution in the limit  → ∞.Therefore, we can express the prior and model distributions of a student learning via ERM as

C.3.1 Prior terms with the dimensional reduction A
The ERM ridge-regularization prior can therefore be seen as i.i.d.Gaussian-distributed with variance 1/.This means that, appling the mapping A, we have where again  is the prior covariance in the reduced setting, i.e.  = [2, 1; 1, 2] for  = 3.Let we rescale  ←  and  ← .We will see that this would correspond to the rescaling: q ←  2 q and V ←  V .We obtain Substituting the expressions above in Eqs.(10), we nd where additionally we have rescaled:  ← ,  ←  2 ,  ←  .The equations are now independent of the parameter .We will see that the above rescaling is consistent and leads to a set of well-de ned equations in the  → ∞ limit.

C.3.2 Output terms with the dimensional reduction A
We remind that we have performed the rescaling  →  −1  .In the  → ∞ imit, the ERM output term where M is a Moreau envelope associated with the loss L, From Eq.(C.9d), we have which can be obtained using the proximal operator consistently with the rescaling previously adopted, which leads to the nal equations Eqs.(10) in the main text, holding in the limit  → ∞.
Special case: square loss -The proximal operator for the square loss can be computed analytically:

D Proof of the main theorem
In this section we prove the main theorem in a slightly more general setup than what is presented in the main part of the paper.We start by reminding the learning problem de ning the ensemble of estimators with a few auxiliary notations, so that this part is self contained.The exact match with the replica prediction will be given at the end of the proof.

D.1 The learning problem
We start by reminding the de nition of the problem.Consider the following generative model where  ∈ R × ,  ∼ N(0, 1) ∈ R × and  * ∈ R × .The goal is to try to learn an estimator of  * using a generalised linear model de ned by the optimisation problem Ŵ ∈ arg min where L,  are convex functions, and we omit the dependence of the regularisation  on the parameter  for simplicity.We wish to determine the asymptotic properties of the estimator Ŵ in the limit where ,  → ∞ with xed ratios  = /.We now list the necessary assumptions for our main theorem to hold.

Assumptions -
• the functions L,  are proper, closed, lower-semicontinuous, convex functions.The loss function L is di erentiable and pseudo-Lipschitz of order 2 in both its arguments.We assume additionally that the regularisation  is strongly convex, di erentiable and pseudo-Lipschitz of order 2.
• the dimensions ,  grow linearly with nite ratios  = /, and the number of classes  is kept constant.
• the lines of the ground truth matrix  * ∈ R × are sampled i.i.d.from a sub-Gaussian probability distribution in R  .

D.2 Reduction to an AMP iteration
We start by reformulating the optimisation problem (D.2) in order to be able to solve it with an AMP iteration.In particular it is useful to separate the design matrix  in two contributions : one aligned with the ground truth  * and one independent on the teacher  .To do so we condition  on the teacher input where X is an independent copy of the design matrix  , P  * denotes the orthogonal projection on the subspace spanned by the columns of  * and P ⊥  * =   − P  * .Furthermore, since we assume that ,  are arbitrarily large and that  remains nite for each instance of the problem, the matrix  * has full column rank and the projector P  * =  * (( * )  * ) −1 ( * ) is always well-de ned.We can then rewrite the original problem as Ŵ ∈ arg min The quantity  * is a R × Gaussian matrix with covariance (W * ) W * , and can be represented as  * =  (( * )  * ) 1/2 where  is an  ×  random matrix with i.i.d.standard normal elements.We then have 1 where we introduced the order parameter  = 1  Ŵ  * ∈ R × and the quantity  = 1  ( * ) W * ∈ R × .Note that  =  1/2 , and We may then rewrite the optimisation problem Eq. (D.26) as an equivalent problem under constraint on the de nition of m leading to the Lagrangian formulation where the initial constraint on  automatically enforces the orthogonality constraint on  w.r.t. * .The following lemma then characterises the feasibility sets of , m,  .
Proof.Consider the optimisation problem de ning Ŵ Ŵ ∈ arg min From the strong convexity assumption on  , there exists a strictly positive constant  2 such that the function r ( ) :=  ( ) −  2 2  2  is convex (and proper, closed, lower semi-continuous).We can then rewrite the optimisation problem as Ŵ ∈ arg min which, owing to the convexity of the cost function, veri es The functions L and r are proper, thus their sum is bounded below for any value of their arguments and we may write 1 The pseudo-Lipschitz assumption on L and  then implies that there exist positive constants  L and where, on the right hand side, the term  2  / =   2  / is bounded since the labels are in {−1, +1} and  is nite.Now using the de nition of where the singular values of P ⊥ W * are bounded with probability one.Therefore there exists a constant   such that  / √    .Then, by de nition of  and the Cauchy-Schwarz inequality By assumption the columns of  * are sampled from sub-Gaussian distributions, thus, using Bernstein's inequality for sub-exponential random variables there exists a positive constant   * such that, with high probability as ,  → +∞,  * 2    * .Combining this with the result on Ŵ , there exists a positive constant   such that     with high probability as ,  → +∞.We nally turn to m.The optimality condition for  in problem Eq.(D.10) gives The pseudo-Lipschtiz assumption on L implies that we can nd a constant  L such that All quantities in the right hand side of the last inequality have bounded scaled norm with high probability, except the operator norm of the random matrix X which has i.i.d.N(0, 1/) elements.Existing results in random matrix theory [25] ensure this operator norm is bounded with high as ,  → +∞, which concludes the proof of this lemma.
The optimisation problem Eq. (D.12) is convex and feasible.Furthermore, we may reduce the feasibility sets of , m to compact spaces, and the function of  is coercive and thus has bounded lower level sets.Strong duality then implies we can invert the order of minimisation to obtain the equivalent problem inf and study the optimisation problem in  at xed , m: inf where we de ned the functions and the random matrix X with i.i.d.N(0, 1) elements is independent from all other random quantities in the problem.The asymptotic properties of the unique solution to this optimisation problem can now be studied with a non-separable, matrix-valued approximate message passing iteration.The AMP iteration solving problem Eq. (D.26) is given in the following lemma Lemma 2. Consider the following AMP iteration where for any  ∈ N and Then the xed point ( ∞ ,  ∞ ) of this iteration veri es where U * is the unique solution to the optimisation problem Eq. (D.26).
Proof.To nd the correct form of the non-linearities in the AMP iteration, we match the optimality condition of problem Eq. (D.26) with the generic form of the xed point of the AMP iteration Eq. (D.101).In the subsequent derivation, we absorb the scaling 1/ √  in the matrix X , such that its elements are i.i.d.N(0, 1/), and omit time indices for simplicity.Going back to problem Eq. (D.26), its optimality condition reads : For any pair of  ×  symmetric positive de nite matrices , Ŝ, this optimality condition is equivalent to where we added the same quantity on both sides of the equality.For the loss function, we can then introduce the resolvent, formally D-resolvent: Similarly for the regularisation, introduce where  ∈ R × is a positive de nite matrix, and where Ŝ ∈ R × is a positive de nite matrix, and v ∈ R × .The optimality condition Eq. (D.39) may then be rewritten as: Which gives the following non-linearity for the AMP iteration The following lemma then gives the exact asymptotics at each time step of the AMP iteration solving problem Eq.(D.26) : its state evolution equations.Lemma 3. Consider the AMP iteration Eq.(D.31-D.35).Assume it is initialised with u 0 such that lim →∞ 1   0 ( 0 )  0 ( 0 ) F exists, a positive de nite matrix Ŝ0 , and  −1 ≡ 0. Then for any  ∈ N, and any pair of sequences of uniformly pseudo-Lipschitz functions  1, : R × and  2, : R × , the following holds Proof.The proof of this lemma is identical to that of Lemma 7 from [9].
Combining these results, we obtain the following asymptotic characterisation of  * .
Lemma 5.For any xed  and m in their feasibility sets, let  * be the unique solution to the optimisation problem Eq.(D.26).Then, for any sequences (in the problem dimension) of pseudo-Lipschitz functions of order 2  1, : R × → R and  2, : R × → R, the following holds where G ∈ R × and H ∈ R × are independent random matrices with i.i.d.standard normal elements, and Q, Q, V, V are given by the xed point of the following set of self consistent equations Proof.Combining the results of the previous lemmas, this proof is close to that of Theorem 1.5 in [16].
Returning to the optimisation problem on m, m in Eq. (D.25), the solution  * , at any dimension, veri es the zero gradient conditions on , m: Using Lemma 5 with the assumption that the gradients of L,  are pseudo-Lipschitz, we obtain for m and for m 1 Using the de nition of D-resolvents, this is equivalent to 1  E m −1/2   −  L( ,.), (.)  −1/2  +  Q1/2  −1 (D.81) which brings us to the following set of six self consistent equations These equations then characterise the asymptotic properties of the quantities Û and X Û / √ .The properties of Ŵ and  Ŵ / √  are then obtained by using the de nition of  in terms of orthogonal decompositions.Note that To match these equations with the replica ones, we rst need to assume the loss and cost functions are separable.The proximal operators are then separable as well across lines of the input matrices.All arguments then have i.i.d.lines (Gaussian matrices with  ×  covariances, or lines of the teacher matrix, which are i.i.d., multiplied with  ×  matrices), and the 1/ averages simplify, leaving the aspect ratio in the quantities de ned over arguments in R × .The rest of the matching then boils down to identifying the proximal operators with the replica notations, done in Section A, and standard Gaussian integration, as done for instance in [8], Appendix III.3.

D.3 Toolbox
In this section, we reproduce part of the appendix of [9] for completeness, in order to give an overview of the main concepts and tools on approximate message passing algorithms which will be required for the proof.
Notations -For a given function  : R × → R × , we write : where each   : R × → R  .We then write the  ×  Jacobian For a given matrix  ∈ R × , we write  ∈ R × ∼ N(0,  ⊗   ) to denote that the lines of  are sampled i.i.d.from N(0, ).Note that this is equivalent to saying that  = Z  1/2 where Z ∈ R × is an i.i.d.standard normal random matrix.The notation P denotes convergence in probability.We start with some de nitions that commonly appear in the approximate message-passing literature, see e.g.[28,15].The main regularity class of functions we will use is that of pseudo-Lipschitz functions, which roughly amounts to functions with polynomially bounded rst derivatives.We include the required scaling w.r.t. the dimensions in the de nition for convenience.
De nition D.1 (Pseudo-Lipschitz function).For ,  ∈ N * and any ,  ∈ N * , a function  : R × → R × is called a pseudo-Lipschitz of order  if there exists a constant (, ) such that for any ,  ∈ R × , where • F denotes the Frobenius norm.Since  will be kept nite, it can be absorbed in any of the constants.
Moreau envelopes and Bregman proximal operators -In our proof, we will also frequently use the notions of Moreau envelopes and proximal operators, see e.g.[29,30].These elements of convex analysis are often encountered in recent works on high-dimensional asymptotics of convex problems, and more detailed analysis of their properties can be found for example in [31,9].For the sake of brevity, we will only sketch the main properties of such mathematical objects, referring to the cited literature for further details.In this proof, we will mainly use proximal operators acting on sets of real matrices endowed with their canonical scalar product.Furthermore, proximals will be de ned with matrix valued parameters in the following way: for a given convex function  : R × → R, a given matrix  ∈ R × and a given symmetric positive de nite matrix  ∈ R × with bounded spectral norm, we will consider operators of the type arg min This operator can either be written as a standard proximal operator by factoring the matrix  −1 in the arguments of the trace: for any x, y in X.
Approximate message-passing -Approximate message-passing algorithms are a statistical physics inspired family of iterations which can be used to solve high dimensional inference problems [18].One of the central objects in such algorithms are the so called state evolution equations, a low-dimensional recursion equations which allow to exactly compute the high dimensional distribution of the iterates of the sequence.In this proof we will use a speci c form of matrix-valued approximate message-passing iteration with non-separable non-linearities.In its full generality, the validity of the state evolution equations in this case is an extension of the works of [15] included in [19].Consider a sequence Gaussian matrices () ∈ R × with i.i.d.Gaussian entries,    () ∼ N(0, 1/).For each ,  ∈ N, consider two sequences of pseudo-Lipschitz functions exists, and recursively de ne: where the dimension of the iterates are   ∈ R × and   ∈ R × .The terms in brackets are de ned as: We de ne now the state evolution recursion on two sequences of matrices { , } , 0 and { Q, } , 1 initialised with  0,0 = lim →∞ E AMP implementation: channel and prior updates for  = 3 In this appendix we present the expression of the integrals numerically computed for the implementation of Algorithm 1 in the present case.For a general and detailed derivation of the algorithm see [18,24].Apart from the update functions   (, ) and  out (, ,  ) in Appendix C.1, the approximate message passing algorithm also requires the variance updates.Using the mapping from Appendix A, the updates are computed though ( − 1) × ( − 1) matrices constructed from the derivatives of the denoising functions: We introduce the notation,  Observe that the mapping from Appendix A has allowed us to reduce the number of integrals to be numerically computed at each AMP iteration to three, given by Eq. (E.4) or Eq.(E.7), depending on the one-hot output representation .These integrals were solved through the integrate.quadmodule from SciPy [32].To speed up the integration, we have also used Numba [33] decorators.

E.2 Prior updates
Gaussian prior -Under the mapping of Appendix A, the prior partition function is written as (E.13) The denoising functions for this case are computed numerically, via Monte Carlo sampling of the distribution given Eq.(E.13).
For more details, see the amp folder on GitHub.

F Details on the numerical simulations
In this section we provide some more details on the numerical simulations implemented to test our theory for the learning curves of ERM (Figure 4).The solution of the convex optimization problem de ned in Eq.
(2) can be computed by a standard gradient descent algorithm.We ran simulations using the squared loss and the cross-entropy loss.The simulations for the cross-entropy loss have been implemented using the LogisticRegression module of the scikit-learn package [21].The solution for the square loss is analytical.The results from numerical simulations that we show in the gures are averaged over 250 instances of the problem at dimension  = 1000.
Generalisation error  gen as a function of the sample complexity  evaluated using Eqs.
Diagonal ( 00 ) and anti-diagonal ( 01 ) entries of the selfoverlap 2 × 2 matrix  as a function of the sample complexity  in the Bayes-optimal setting.The full lines mark the xed points of the saddle point equations of Eq. (14), while the symbols represent the result obtained from the AMP algorithm described in Section 3 averaged over 20 runs.The error bars represent the respective standard deviations.

Figure 2 :
Figure 2: Bayes-optimal performance in the case of Rademacher teacher weights with  = 3 classes.

Figure 3 :
Figure3: AMP for Gaussian teacher prior: Generalisation error  gen as a function of the sample complexity .The performance of AMP (averaged over 20 runs) computed using the formula in Eq. (19), is marked by the green symbols (error bars are smaller than the symbols) .The dashed black line indicates the Bayes optimal error.The inset displays the diagonal ( 00 ) and anti-diagonal ( 01 ) entries of the self-overlap 2 × 2 matrix as a function of the sample complexity Cross-entropy, λ = 0.01 Cross-entropy, λ = 0 Square loss, λ = 1 Square loss, λ = 0 (a) Generalisation error  gen as a function of the sample complexity .The black dashed line depicts the Bayes-optimal performance.The full lines display the performance of ERM with cross-entropy loss (blue) and square loss (orange), each computed at optimised ridge regularisation ( = 0.01 and  = 1 respectively, see panels (c) and (d)) from the xed points of Eq. ( Square loss, λ = 1 Cross-entropy loss, λ = 1 (b) Large− behaviour: Generalisation error  gen as a function of the sample complexity .We plot our theoretical predictions at large , in log-log scale for visibility purposes.The black dashed line marks the Bayes-optimal error.The symbols represent the error obtained by ERM at xed regularisation strength  = 1.Cross-entropy loss: Generalisation error  gen as a function of the regularisation strength , at xed sample complexity .Di erent values of  are depicted with di erent colours.The curves are the result of numerical simulations performed at dimension  = 1000, averaged over 250 instances.We conclude that for these values of  the optimal  is close to 0.01.Square loss: Generalisation error  gen as a function of the regularisation strength , at xed sample complexity .

Figure 5 :
Figure 5: Prior reduction: Schematic representation of the prior reduction mapping over the multiclass classi cation problem depicted in Figure 1.

1 𝑑
66)Proof.Owing to the properties of Bregman proximity operators[26,27], the update functions in the AMP iteration Eq.(D.31-D.35)are Lipschitz continuous.Thus under the assumptions made on the initialisation, the assumptions of Theorem D.2 are veri ed, which gives the desired result.Lemma 4. Consider iteration Eq. (D.31-D.35),where the parameters , Q, , V are initialised at any xed point of the state evolution equations of Lemma 3.For any sequence initialised with V 0 = V and  0 such that lim →∞  0 ( 0 )  0 ( 0 ) =  (D.67) the following holds lim
Algorithm 1 Approximate message passing Input: data matrix  ∈ R × and label matrix  ∈ R ×  and    out, ∈ R ×   out, ←  out   ,    ,        out, ←    out   ,    ,    Prior updates Mean   ∈ R  and variance   ∈ R × introduced both the de nitions of the overlaps {   } and the local elds {ℎ   }.We can introduce the Fourier representation of the Dirac −functions in the prior term  prior and rewrite 2,    E  2,  (  ) 1/2 (D.62) where  ∈ R × and  ∈ R × are independent random matrices with i.i.d.standard normal elements, and   , Q ,   , V are given by the equations [26,27] proximal operator[26]de ned with the Bregman distance induced by the strictly convex, coercive function (for positive de nite  ) Many of the usual or similar properties to that of standard proximal operators (i.e.rm non-expansiveness, link with Moreau/Bregman envelopes,. . . ) hold for Bregman proximal operators de ned with the function (D.95), see e.g.[26,27].In particular, we will be using the equivalent notion to rmly nonexpansive operators for Bregman proximity operators, called D-rm operators.Consider the Bregman proximal de ned with a di erentiable, strictly convex, coercive function  : X → R, where X is a given input Hilbert space.