Fluctuations, bias, variance and ensemble of learners: exact asymptotics for convex losses in high-dimension

From the sampling of data to the initialisation of parameters, randomness is ubiquitous in modern Machine Learning practice. Understanding the statistical fluctuations engendered by the different sources of randomness in prediction is therefore key to understanding robust generalisation. In this manuscript we develop a quantitative and rigorous theory for the study of fluctuations in an ensemble of generalised linear models trained on different, but correlated, features in high-dimensions. In particular, we provide a complete description of the asymptotic joint distribution of the empirical risk minimiser for generic convex loss and regularisation in the high-dimensional limit. Our result encompasses a rich set of classification and regression tasks, such as the lazy regime of overparametrised neural networks, or equivalently the random features approximation of kernels. While allowing to study directly the mitigating effect of ensembling (or bagging) on the bias-variance decomposition of the test error, our analysis also helps disentangle the contribution of statistical fluctuations, and the singular role played by the interpolation threshold that are at the roots of the ‘double-descent’ phenomenon.

Original Content from this work may be used under the terms of the Creative Commons Attribution 4.0 licence.Any further distribution of this work must maintain attribution to the author(s) and the title of the work, journal citation and DOI.

Introduction
Randomness is ubiquitous in Machine Learning.It is present in the data (e.g.noise in acquisition and annotation), in commonly used statistical models (e.g.random features (RFs) (Rahimi and Recht 2007)), or in the algorithms used to train them (e.g. in the choice of initialisation of weights of neural networks (Narkhede et al 2022), or when sampling a mini-batch in Stochastic Gradient Descent (Bottou 2012)).Strikingly, fluctuations associated to different sources of randomness can have a major impact in the generalisation performance of a model.For instance, this is the case in leastsquares regression with RFs, where it has been shown (D'Ascoli et al 2020, Geiger et al 2020, Jacot et al 2020) that the variance associated with the random projections matrix is responsible for poor generalisation near the interpolation peak (Advani and Saxe 2017, Spigler et al 2019, Belkin et al 2020).As a consequence, this double-descent behaviour can be mitigated by averaging over a large ensemble of learners, effectively suppressing this variance.Indeed, considering an ensemble (sometimes also refereed to as a committee (Drucker et al 1994)) of independent learners provide a natural framework to study the contribution of the variance of prediction in the estimation accuracy.In this manuscript we leverage this idea to provide an exact asymptotic characterisation of the statistics of fluctuations in empirical risk minimisation with generic convex losses and penalties in high-dimensional models.We focus on the case of synthetic datasets, and we apply our results to RF learning in particular.
Fluctuations, bias, variance and ensemble of learners: exact asymptotics for convex losses in high-dimension J. Stat. Mech. (2023) 114001 vectors of parameters {w k } k∈ [K] are trained independently, they correlate through the training data.Statistical fluctuations in the learnt parameters can then arise for different reasons.For instance, a common practice is to initialise the parameters randomly during optimisation, which will induce statistical variability between the different predictors.Alternatively, each predictor could be trained on a subsample of the data, as it is commonly done in bagging (Breiman 1996).The statistical model can also be inherently stochastic, e.g. the RFs approximation for kernel methods (Rahimi and Recht 2007).Finally, the predictors could also be jointly trained, e.g.coupling them through the loss or penalty as it is done in boosting (Schapire 1990).
Our goal in this work is to provide a sharp characterisation of the statistical fluctuations of the ensemble of parameters {w k } k∈ [K] in a particular, mathematically tractable, class of predictors: generalised linear models, where u k : R d → R p , k ∈ [K] is an ensemble of possibly correlated features and f : R K → Y is an activation function.For most of this work, we discuss the case in which the predictors are independently trained through regularised empirical risk minimisation: ŵk = arg min with a convex loss function ℓ : Y × R → R (e.g. the logistic loss) and ridge penalty whose strength is given by λ ∈ R + .However, our analysis also includes the case in which the learners are jointly trained with a generic convex penalty.This case will be further discussed in section 4. In what follows we will also concentrate in the RFs case where u k (x) = ϕ (F k x) with ϕ : R → R an activation function acting component-wise and F k ∈ R p×d a family of independently sampled random matrices.Besides being an efficient approximation for kernels (Rahimi and Recht 2007), RFs are often studied as a simple model for neural networks in the lazy and neural tangent kernel regimes of deep neural networks (Jacot et al 2018, Chizat et al 2019), in which case the matrices F k correspond to different random initialisation of hidden-layer weights.Moreover, the RFs model displays some of the exotic behaviours of high-dimensional overparametrised models, such as double-descent (Gerace et al 2020, Mei andMontanari 2021) and benign overfitting (Bartlett et al 2020), therefore providing an ideal playground to study the interplay between fluctuations and overparametrisation.A broader class of features maps is also discussed in section 4.
To provide an exact characterisation of the statistics of the estimators in equation ( 2), we shall assume data is generated from a target https://doi.org/10.1088/1742-5468/ad0221 Fluctuations, bias, variance and ensemble of learners: exact asymptotics for convex losses in high-dimension J. Stat. Mech. (2023) 114001 Two learners with the same architecture (in grey) receive a correlated input generated from the same vector x ∼ N (0 d , I d ).The output ŷ is an average of their outputs.While the study of an ensemble of learners is already interesting per se, it is also pivotal to study the fluctuation between learners, and the error steaming from the difference in the weights in random features and lazy training.
with f 0 : R → Y and I d d -dimensional identity matrix.The dataset is then constructed generating i.i.d.n vectors An illustration summary of the setting considered here in given in figure 1.Note that such architecture can be interpreted as a two-layer tree neural network, also known in some contexts as the tree-committee or parity machine (Schwarze and Hertz 1992).

Main contributions.
The results in this manuscript can be listed as follows.
• We provide a sharp asymptotic characterisation of the joint statistics of the ensemble of empirical risk minimisers { ŵk } k∈ [K] in the high-dimensional limit where p, n → +∞ with n / p kept constant, for any convex loss and penalty.In particular, we show that the pre-activations { ŵ k u k } k∈ [K] are jointly Gaussian, with sufficient statistics obeying a set of explicit closed-form equations.Note that the analysis of ensembling with nonsquare losses is out of the grasp of the most commonly adopted theoretical tools (e.g.random matrix theory).Therefore, our proof method based on recent progress on approximate message passing (AMP) techniques (Javanmard and Montanari 2013, Berthier et al 2020, Gerbelot and Berthier 2021) is of independent interest.Different versions of our theorem are discussed throughout the manuscript.First, in section 2 for the particular case of independently trained learners on RFs (theorem 1).Later, in section 4 for the general case of jointly trained learners on correlated Gaussian covariates (theorem 2).• We discuss the role played by fluctuations in the non-monotonic behaviour of the generalisation performance of interpolators (a.k.a.double-descent behaviour).
In particular-as discussed in Geiger et al (2020), d'Ascoli et al (2021)for the ridge case-the interpolation peak arises from the model overfitting the particular realisation of the random weights.We show the test error can be decomposed ϵ g (K = 1) = ϵ g + δϵ g in terms of a fluctuation-free term ϵ g and a fluctuation term δϵ g responsible for the double-descent behaviour, see figure 2 for the case of max-margin classification.
https://doi.org/10.1088/1742-5468/ad0221• In the context of classification, we discuss how majority vote and score averaging, two popular ensembling procedures, compare in terms of generalisation performance.
More specifically, we show that in the setting we study score averaging consistently outperforms the majority vote predictor.However, for a large number of learners K 1 these two predictors agree, see figure 5(right).• Finally, we discuss how ensembling can be used as a tool for uncertainty quantification.
In particular, we connect the correlation between two learners to the probability of disagreement, and show that it decreases with overparametrisation, see figure 5(centre).We provide a full characterisation of the joint probability density of the confidence score between two independent learners, see figure 5(left).
1.1.2.Related works.The idea of reducing the variance of a predictor by averaging over independent learners is quite old in Machine Learning (Hansen and Salamon 1990, Perrone and Cooper 1993, Perrone 1994, Krogh and Vedelsby 1995), and an early asymptotic analysis of the regression case was given in Krogh and Sollich (1997).In particular, a variety of methods to combine an ensemble of learners appeared in the literature (Opitz and Maclin 1999).In a very inspiring work, Geiger et al (2020) carried out an extensive series of experiments in order to shed light on the generalisation properties of neural networks, and reported many observations and empirical arguments about the role of the variance due to the random initialisation of the weights in the double-descent curve using an ensemble of learners.This was a major motivation for the present work.
Fluctuations, bias, variance and ensemble of learners: exact asymptotics for convex losses in high-dimension

J. Stat. Mech. (2023) 114001
Closest to our setting is the work of Neal et al (2018), D'Ascoli et al (2020), Jacot et al (2020) which disentangles the various sources of variance in the process of training deep neural networks.Indeed, here we adopt the model defined by D'Ascoli et al (2020), and provide a rigorous justification of their results for the case of ridge regression.A slightly finer decomposition of the variance in terms of the different sources of randomness in the problem was later proposed by Adlam and Pennington (2020a).Lin and Dobriban (2021) show that such decomposition is not unique, and can be more generally understood from the point of view of the analysis of variance framework.Interestingly, subsequent papers were able to identity a series of triple (and more) descent, e.g.Chen et al (2020), Adlam andPennington (2020b), d'Ascoli et al (2021).
The RFs model was introduced in the seminal work of Rahimi and Recht (2007) as an efficient approximation for kernel methods.Drawing from early ideas of Karoui (2010), Pennington and Worah (2017) showed that the empirical distribution of the Gram matrix of RF is asymptotically equivalent to a linear model with matched second statistics, and characterised in this way memorisation with RF regression.The learning problem was first analysed by Mei and Montanari (2021), who provided an exact asymptotic characterisation of the training and generalisation errors of RF regression.This analysis was later extended to generic convex losses by Gerace et al (2020) using the heuristic replica method, and later proved by Dhifallah and Lu (2020) using convex Gaussian inequalities.
The aforementioned asymptotic equivalence between the RF model and a Gaussian model with matched moments has been named the Gaussian Equivalence Principle (GEP) (Goldt et al 2020).Rigorous proofs in the memorisation and learning setting with square loss appeared in Pennington and Worah (2017), Mei and Montanari (2021), and for general convex penalties in Hu and Lu (2020), Goldt et al (2021).Goldt et al (2021) and Loureiro et al (2021b) provided extensive numerical evidence that the GEP holds for more generic feature maps, including features stemming from trained neural networks.
Most of the previously mentioned works deriving exact asymptotics for the RF model in the proportional limit use either Random Matrix Theory techniques or convex Gaussian inequalities.While these tools have been recently used in many different contexts, they ultimately fall short when considering an ensemble of predictors with generic convex loss and regularisation, along with structured design matrices.Therefore, to prove the results herein we employ an AMP proof technique (Bayati andMontanari 2011a, Donoho andMontanari 2016), leveraging on recently introduced progresses in Gerbelot and Berthier (2021), Loureiro et al (2021b) which enables to capture the full complexity of the problem and obtain the asymptotic joint distribution of the ensemble of predictors.LeJeune et al (2020) studies ensembles of ordinary least-squares learned from subsamples of a common data matrix, and shows its equivalence to an implicit ridge regularisation.

Learning with an ensemble of random features
In this section give a first formulation of our main result, namely the exact asymptotic characterisation of the statistics of the ensembling estimator introduced in equation (1).
Fluctuations, bias, variance and ensemble of learners: exact asymptotics for convex losses in high-dimension

J. Stat. Mech. (2023) 114001
We prove that, in the proportional high dimensional limit, the statistics of the arguments of the activation function in equation ( 1) is simply given by a multivariate Gaussian, whose covariance matrix we can completely specify.This result holds for any convex loss, any convex regularisation, and for all models of generative networks u k : R d → R p , as we will show in full generality in section 4.However, for simplicity, in this section and in the following we focus on the setting described in section 1, in which the statistician averages over an independent ensemble of RFs, i.e. u k (x) = ϕ(F k x).In this case, our result can be formulated as follows: Theorem 1 (simplified version).Assume that in the high-dimensional limit where d, p, n → +∞ with α := n / p and γ := d / p kept Θ(1) constants, the Wishart matrix F F ⊺ has a well-defined asymptotic spectral distribution.Then in this limit, for any pseudo-Lispchitz function of order 2 φ : R × R K → R, we have where with 1 K,K ∈ R K×K and 1 K ∈ R K are a matrix and a vector of ones respectively.The entries of Σ are solutions of a set of self-consistent equations given in Corollary 4.
As discussed in the introduction, the asymptotic statistics of the single learner has been studied in Dhifallah and Lu (2020), Gerace et al (2020), Loureiro et al (2021b).Their result amounts to the analysis of the estimator solving the empirical risk minimisation problem in equation (2) and it is recovered imposing K = 1 in the theorem above.
However, such result is not enough to quantify the correlation between different learners, induced by the training on the same dataset, which is required to compute, e.g. the test error associated with an ensembling predictor as in equation (1).For example, in the simple case where f 0 (u) = u and f (v) = 1 K k v k , the mean-squared error on the labels is given by ϵ g = E (x,y) [(y − ŷ(x)) 2 ] = ρ + (q 0 − q 1 )K −1 + q 1 − 2m, which crucially depends on the average correlation between two independent learners4 q 1 := 1 p E[ ŵ⊺ 1 ŵ2 ].Our main result is precisely an exact asymptotic characterisation of this correlation in the proportional limit of the previous theorem.Once m, q 0 and q 1 have been determined, the generalisation error can be computed as for any error measure ∆ : Y × Y → R + .
Fluctuations, bias, variance and ensemble of learners: exact asymptotics for convex losses in high-dimension J. Stat. Mech. (2023) 114001 for some f0 : R → Y activation function of the single learner.In this case we can introduce the random variable . This formally coincides with the joint distribution for the activation fields for K = 1 (Gerace et al 2020), but with q 0 replaced by q 1 ⩽ q 0 .The smaller variance is due to the fact that the fluctuations of the activation fields are averaged out by the ensembling process.The test error in the K → +∞ limit is then so that the fluctuation contribution to the test error for K = 1 can be defined as The term δϵ g is by definition the contribution suppressed by ensembling and corresponds to the ambiguity introduced by Krogh and Vedelsby (1995) for the square loss.This contribution expresses the variance in the ensemble and it is responsible for the nonmonotonic behaviour in the test error of interpolators, also known as the double-descent behavior.

Applications
We will consider now two relevant examples of separable losses, namely a ridge loss and a logistic loss.In both cases, it is possible to derive the explicit expression of the training loss and generalisation error in terms of the elements of the correlation matrix introduced above.

Ridge regression
If we assume , and a quadratic loss of the type ℓ(y, x) = 1 2 (y − x) 2 , it is possible to write down simple recursive equations for m, q 0 and q 1 (see appendix A.3.2).Taking ∆(y, ŷ) = (y − ŷ) 2 , the generalisation error is easily computed as Note that in this case the λ → 0 + limit gives the minimum ℓ 2 -norm interpolator.In figure 3 we compare our theoretical prediction with numerical results for λ = 10  is only due to the divergence of q 0 , whereas the contribution ϵ g , which is independent on q 0 , is smooth everywhere.Alongside with the interpolation divergence, δϵ g = q 0 − q 1 has an additional bump at p / n = d / n , which corresponds to the 'linear peak' discussed by d'Ascoli et al (2021).
In the plot we present also the so-called kernel limit, corresponding to the limit n / p = α → 0 at fixed n / d .An explicit manipulation (see appendix A.3.2) shows that q 1 = q 0 ≡ q in this limit.This implies that in the kernel limit ϵ k g does not depend on K, being equal to ϵ k g ≡ ρ + q − 2m.The generalisation error obtained in the kernel limit coincides with ϵ g for p > n: this is expected as in ϵ g the fluctuations amongst learners are averaged out, effectively recovering the cost obtained in the case of an infinite number of parameters.

Binary classification
Suppose now that we are considering a classification task, such that Y = {−1, 1}.For this task we consider f 0 (x) = sign(x).A popular choice of loss in this classification task is the logistic loss, (11) although other choices, e.g, hinge loss, can be considered.Since both the logistic and hinge losses depend only on the margin yw ⊺ u, the empirical risk minimiser for λ → 0 + in both cases give the max-margin interpolator (Rosset et al 2004).We write down the explicit saddle-point equations associated to the logistic and hinge loss in  , the norm of the predictor in feature space q 0 and the correlation between learners q 1 (right) (see equation ( 5) for the definition) in a classification task using logistic loss with ridge penalty with λ = 10 −4 at fixed n / d = 2 as function of p / n .In the inset, ratio q 1 / q 0 , quantifying the correlation between two learners.In all parameters the interpolation kink is clearly visible.
appendix A.3.3, but we will focus our attention on the logistic case for the sake of brevity.For this choice of the loss, we obtained the values of m, q 0 and q 1 showed in figure 4. Using these values, a number of relevant questions can be addressed.

Alignment of learners
Assuming that the predictor of the learner k is ŷk (x) = sign( ŵ⊺ k u k (x)), in figure 5(centre) we estimate the probability that two learners give opposite classification.This is analytically given by Note that by definition the ratio q 1 /q 0 is a cosine similarity between two learners in the norm induced by the feature space.Therefore, this provides an interesting interpretation of these sufficient statistics in terms of the probability of disagreement.
In particular, as illustrated in figure 5(centre) overparametrisation promotes agreement between the learners, therefore suppressing uncertainty.More generally, ensembling can be used as a technique for uncertainty estimation (Lakshminarayanan et al 2017).In the context of logistic regression, the pre-activation to the sign function is often interpreted as a confidence score.Indeed, introducing the logistic function ) −1 , it expresses the confidence of the k th classifier in associating ŷ = 1 to the input x .Therefore, it is reasonable to ask how reliable is the ) −1 of two learners for p / n 0.13.Centre.Probability that two learners give discordant predictions using logistic regression as function of p / n = 1 / α with n / d = 2, ρ = 1, and λ = 10 −4 .Right.Test error for logistic regression using the estimators in equation ( 13) and K = 3, with the same parameters.We adopted ϕ(x) = erf(x).We observe that the test error obtained using (a) is always smaller than the one obtained using (b).(Centre and right) Dots represent the average of the outcomes of 10 3 numerical experiments.logistic score as a confidence measure.For instance, what is the variance of the confidence among different learners?This can be quantified by the joint probability density which can be readily computed using our theorem 1. Figure 5(left) shows one example at fixed p / n and vanishing λ.

Ensemble predictors
In the previous two points, we discussed how ensembling can be used as a tool to quantify fluctuations.However, ensembling methods are also used in practical settings in order to mitigate fluctuations, e.g.Breiman (1996).An important question in this context is: given an ensemble of predictors { ŵk } k∈ [K] , what is the best way of combining them to produce a point estimate?In our setting, this amounts to choosing the function f : R K → Y. Let us consider two popular choices for the estimator f in equation ( 1) used in practice: In a sense, (a) provides an estimator based on the average of the output fields, whereas (b), which corresponds to a majority rule if K is odd (Hansen and Salamon 1990), is a function of the average of the estimators of the single learners.For both choices of the estimator we use ∆(y, ŷ) = δ ŷ,y to measure the test error.In figure 5(right) we compare the test error obtained using (a) and (b) for K = 3 with vanishing regularisation λ = 10 −4 .It is observed that the estimator (a) has better performances than the estimator (b).As previously discussed, in this case logistic regression is equivalent to Fluctuations, bias, variance and ensemble of learners: exact asymptotics for convex losses in high-dimension

J. Stat. Mech. (2023) 114001
max-margin estimation, and in this case the error (a) can be intuitively understood in terms of a robust max-margin estimation obtained by averaging the margins associated to different draws of the RFs.In the case (a) it is easy to show that the generalisation error takes the form This formula is in agreement with numerical experiments, see figure 2(left).Unfortunately, we did not find a similar closed-form expression in case (b).However, we can observe that in the K → +∞ limit the generalisation error in case (a) coincides with the generalisation error in case (b), see figure 2(right).By comparing with the results in figure 5(centre), it is evident that the benefit of ensembling in reducing the test error correlates with the tendency of learners to disagree, i.e. for small values of p / n , as stressed by Krogh and Vedelsby (1995).Finally, we observe a constant value of ϵ g beyond the interpolation threshold, compatibly with the numerical results of Geiger et al (2020).

The case of general loss and regularisation
In this section we generalise our results in section 2 relaxing the hypothesis on the loss, on the regularisation and on the properties of the feature maps.In the general setting we are going to consider, we denote P 0 y (y|x) the probabilistic law by which y is generated.For example, in section 2, P 0 y (y|x) = δ(y − f 0 (x)).In the treatment given here, we allow for more general cases (e.g. the presence of noise in the label generation).We make no assumptions on the generative networks u k , so that the information about the first layer is contained in the following tensors, In the equations above, U (x) ∈ R p×K is the matrix having as concatenated columns u k (x).We aim at learning a rule as in equation ( 1), adopting a general convex loss l : Y × R K → R, so that the weights are estimated as where r : R p×K → R is a convex regularisation, U µ ≡ U (x µ ) and Ŵ ∈ R p×K matrix of the concatenated columns { ŵk }.Here, since the optimisation problem defining the estimator may be non strictly convex, the solution may not be unique.We then denote with Ŵ the unique least ℓ 2 norm solution of equation ( 18).
In the most general case, the statistical properties of Ŵ are captured by a finite set of finite-dimensional order parameters, namely , X, Y ∈ R p×K two matrices.We will denote ) Given a second tensor B ∈ R p×p ⊗ R K×K , we write We can now state our general result.
Theorem 2. Let us consider the random quantities ξ ∈ R K and Ξ ∈ R K×K with entries distributed as N (0, 1).Assume that in the high-dimensional limit where d, p, n → +∞ with α := n / p and γ := d / p kept Θ(1) constants.Then in this limit, for any pseudo-Lispchitz functions of order 2 φ : R × R K → R and φ : R K×p → R, the estimator Ŵ verifies https://doi.org/10.1088/1742-5468/ad0221 Fluctuations, bias, variance and ensemble of learners: exact asymptotics for convex losses in high-dimension

J. Stat. Mech. (2023) 114001
where U ≡ U (x), (ν, µ) ∈ R 1+K are jointly Gaussian random variables with zero mean and covariance matrix and we have introduced the proximals for the loss and the regularisation: We have also introduced the auxiliary function and the scalar quantities The order parameters satisfy the saddle-point equations and In the equation above we have introduced the short-hand notation f := V −1 (h − ω).
In the theorem above, for a tensor : in the formula, the contractions involve Latin indices only.Equations ( 24) are typically called channel equations, because depend on the form of the loss l.Equations ( 25), instead, are usually called prior equations, because of their dependence on the prior, https://doi.org/10.1088/1742-5468/ad0221 Fluctuations, bias, variance and ensemble of learners: exact asymptotics for convex losses in high-dimension

J. Stat. Mech. (2023) 114001
i.e. r.In the following Corollary, we specify their expression for a ridge regularisation, Corollary 3 (ridge regularisation).In the hypotheses of theorem 2, if r(W ) = 1 2 W 2 F , then the prior equations are In the equation above, we have used the auxiliary tensor

The random feature case and the kernel limit
Theorem 2 is given in a very general setting, and, in particular, no assumptions are made on the features u k .We have anticipated in section 2 that, in the case of RFs, the structure of the order parameters highly simplifies and the covariance matrix Σ is fully specified by only three scalar order parameters for any K > 1.Here will adapt therefore theorem 2 to the RF setting in section 2, using the notation therein.The motivation of this section is to explicitly present the self-consistent equations that are required to produce the results given in the paper.
Corollary 4. Assume that in the high-dimensional limit where d, p, n → +∞ with α := n / p and γ := d / p kept Θ(1) constants, the Wishart matrix F F ⊺ has a well-defined asymptotic spectral distribution.Then in this limit, for any pseudo-Lispchitz function of finite order φ : R × R K → R, the estimator Ŵ verifies where (ν, µ) ∈ R K+1 is a jointly Gaussian vector with covariance and Q := (q 0 − q 1 )I K + q 1 1 K,K .The collection of parameters (q 0 , q 1 , m) is obtained solving a set of fixed point equations involving the auxiliary variables (q 0 , q1 , m, v, v), namely: Fluctuations, bias, variance and ensemble of learners: exact asymptotics for convex losses in high-dimension where ω and ω are two correlated Gaussian random variables of zero mean and with Finally, ϱ(s) is the asymptotic spectral density of the features covariance matrix * I p and the coefficients are given by κ ).The previous corollary recovers the results of Dhifallah and Lu (2020), Gerace et al (2020), and Loureiro et al (2021b) when restricted to the K = 1 case by marginalisation.

The kernel limit
The so-called kernel limit is obtained by taking the limit of infinite number of parameters so that γ → 0 (i.e.p d and p n), but with a fixed ratio α / γ = n / d .To balance the loss term and the regularisation it is convenient to rescale λ → αλ.We can simplify the saddle-point equation in this special limit introducing q0 Fluctuations, bias, variance and ensemble of learners: exact asymptotics for convex losses in high-dimension The p → +∞ limit in the prior equations depends on the spectral density ϱ(s).For example, if F has random Gaussian entries with zero mean and unit variance, then ϱ(s) is a shifted Marchenko-Pastur distribution, where By means of a series of algebraic manipulation, we obtain in the end at the first order in which complete our set of equations for the kernel limit.
Fluctuations, bias, variance and ensemble of learners: exact asymptotics for convex losses in high-dimension In other words, the double brackets • express the contraction of the upper indices only.This means for example that We will adopt the same notation in the simple K = 1 case.

A.2. The replica computation
The replica computation relies on the treatment of a Gibbs measure over the weights W which concentrates on the weights Ŵ that minimise a certain loss l when a fictitious 'inverse temperature' parameter is sent to infinity.Such measure reads where P w (W ) = e −βλr(W ) is the prior on the weights For each µ, the label y µ has distribution P 0 y y|d −1/2 θ x µ for some fixed θ ∼ N (0 d , ρI d ).The array of features U µ , instead, is obtained as function of the vector x µ via a law U : R d → R p×K such that U µ := U (x µ ) ∈ R p×K .As we will show below, the tensors https://doi.org/10.1088/1742-5468/ad0221 Fluctuations, bias, variance and ensemble of learners: exact asymptotics for convex losses in high-dimension J. Stat. Mech. (2023) 114001 will incorporate the information about the action of the law U .We denote for brevity and we proceed computing the free entropy Φ := E (y µ ,x µ )µ [ln Z(β)] using the replica trick, i.e. the fact that E[ln Denoting by µ a ≡ (µ a k ) k∈ [K] , if we now consider We apply now the GEP (Goldt et al 2021), i.e. we assume that P (ν, µ) is a Gaussian with covariance matrix given by Here, for each a, b ∈ [n], m a ∈ R K and Q ab ∈ R K×K and are defined as In the end Absorbing the factor −i in the integrals, and denoting by n / p = α and d / p = γ, Here we have introduced ˆdµ a P y (y|µ a ) P (ν, µ) so that in the high dimensional limit the desired average is the extremum of the functional 1sΦ (s) in the s → 0 limit, A.2.1.Replica symmetric ansatz.In order to take the limit, let us assume as usual a replica symmetric (RS) ansatz: Observe that lim s→0 Φ (s) = 0 by construction, meaning that ρ = 0 fixing ρ . Before proceeding further, we note that the matrix Q in the RS ansatz can be written as where 1 s is the column vector of s elements equal to 1.Following similar steps to the https://doi.org/10.1088/1742-5468/ad0221 Fluctuations, bias, variance and ensemble of learners: exact asymptotics for convex losses in high-dimension

J. Stat. Mech. (2023) 114001
ones detailed, e.g. in Loureiro et al (2021b), we obtain where and we have introduced Similarly, defining V = R + Q, we can write down the prior channel.We can write then where we have performed a Hubbard-Stratonovich transformation and ) for all i, k.The free entropy is then We are interested in the extremum of this quantity, and therefore we have to find the order parameters that maximise it by means of a set of saddle-point equations.Defining for brevity https://doi.org/10.1088/1742-5468/ad0221 Fluctuations, bias, variance and ensemble of learners: exact asymptotics for convex losses in high-dimension

J. Stat. Mech. (2023) 114001
a first set of saddle-point equation is where A.2.2.Zero temperature state evolution equations.To obtain a nontrivial β → +∞ limit we rescale After this change of variable, equations ( 79) remain formally identical.To complete the set of saddle-point equations, let us observe that, defining then after the rescaling In this way the remaining saddle-point equations keep the form (68) but with In the β → +∞ limit, we can write also where and https://doi.org/10.1088/1742-5468/ad0221 Fluctuations, bias, variance and ensemble of learners: exact asymptotics for convex losses in high-dimension

J. Stat. Mech. (2023) 114001
As a result, the remaining saddle point equations are In the first equation, the derivative produce in general a 6-index tensor.Denoting . In the formula, the contractions involve Latin indices only.
A.2.3.The case of ℓ 2 regularisation.If we assume an l2 regularisation r(W ) = 1 2 W 2 F , Ψ w is a Gaussian integral that can be explicitly computed before the rescaling in β.Denoting we obtain the following saddle-point equations for V , Q and m, A.2.4.Training loss and generalisation error.The order parameters introduced to solve the problem allow us to reach our ultimate goal of computing the average errors of the learning process.We have where h is the proximal introduced above and all overlaps have to be intended computed at the fixed point.
Fluctuations, bias, variance and ensemble of learners: exact asymptotics for convex losses in high-dimension

A.3. Separable loss with ridge regularisation
Let us focus now on the case of separable losses, i.e. losses in the form l(y, u) = k ℓ(y, u k ), which is a crucial special case in the analysis of our contribution.We will assume a ridge regularisation r(W ) = 1 2 W 2 F .Let us also assume that the K generative networks are statistically equivalent.This implies a specific structure in the tensors Θ and Ω, Here by d = we mean that the equalities hold in distribution.Observe Ω kk and Ω kk ′ are not uncorrelated quantities.For reasons of symmetry reasons, we impose therefore the ansatz It is easily seen that Fluctuations, bias, variance and ensemble of learners: exact asymptotics for convex losses in high-dimension

J. Stat. Mech. (2023) 114001
Plugging this ansatz in our equations, and introducing η = q 1/ q 0 , we obtain The new variables ζ 1 and ζ 2 are obtained by a linear transformation from the old ones.
In particular, they are distributed as two components of a vector where ξ ∼ N (0 K , I K ).It follows that ζ, ζ ∼ N (0, 1) but they are correlated as Moreover, we have introduced the proximal and the corresponding f obtained using ζ .The remaining equations read Fluctuations, bias, variance and ensemble of learners: exact asymptotics for convex losses in high-dimension

J. Stat. Mech. (2023) 114001
A.3.1.The random-features model for the generative networks.To further simplify these expressions suppose now that our generative networks are such that where F k ∈ R p×d are (fixed) random matrices extracted from some given distribution and ϕ is a nonlinearity acting elementwise.As anticipated in the main text, we can use the fact that each generative network is equivalent to the following Gaussian model (Mei and Montanari 2021) for some coefficients κ 0 , κ 1 and κ * depending on ϕ (see theorem 4), and z µ ∼ N (0 p , I p ). Assuming now, for the sake of simplicity, to the κ 0 = 0 case and that E θ [θθ ⊺ ] = I d , then Once the spectral density ϱ(s) of Ω is introduced, it is immediate to see that the equations for q 0 , m and v take the forms given in the main text.The equation for q 1 requires an additional step.If we introduce the symmetric random matrix then we can rewrite the equation as where in the second equality we used the fact that F and F are asymptotically free.

A.3.2. Ridge regression.
Let us consider the simple case of ridge regression with f 0 (x) = x.We will give here the channel equations that are obtained straightforwardly as https://doi.org/10.1088/1742-5468/ad0221 Fluctuations, bias, variance and ensemble of learners: exact asymptotics for convex losses in high-dimension

J. Stat. Mech. (2023) 114001
The kernel limit is also obtained straightforwardly taking the α → 0 limit and rescaling q0 → αq 0 , q1 A.3.3.Binary classification problem.We consider now the case f 0 (x) = sign(x), corresponding to a binary classification problem, and we write down the channel equations for this problem in the case of logistic and hinge loss.In this case we have that If we pick a logistic loss in the form l(y, µ) = k ln(1 + e −yµ k ), then the proximal h solves the equation in such a way that f = η−ω v satisfies If we use instead a hing loss l(y, µ) = k max(0, 1 − yµ k ), the proximal is such that Fluctuations, bias, variance and ensemble of learners: exact asymptotics for convex losses in high-dimension

J. Stat. Mech. (2023) 114001
In the case of the hinge loss, the simple form of the proximal allows for a more explicit expression of the channel equations.Introducing Let us now make the change of variables ζ → √ q 0 +q 1 z+ . This allows us to rewrite the expression for q 1 as Fluctuations, bias, variance and ensemble of learners: exact asymptotics for convex losses in high-dimension J. Stat. Mech. (2023)

Appendix B. Proof of the main theorem
In this section we prove theorem 2, from which all other analytical results in the paper can be deduced.We start by reminding the learning problem defining the ensemble of estimators with a few auxiliary notations, so that this part is self contained.The sketch of proof is one pioneered in Bayati and Montanari (2011b), Donoho and Montanari (2016) and is the following: the estimator W * is expressed as the limit of a carefully chosen sequence, an AMP iteration (Bayati andMontanari 2011a, Zdeborová andKrzakala 2016), whose iterates can be asymptotically exactly characterised using an auxiliary, closed form iteration, the state evolution equations.We then show that converging trajectories of such an AMP iteration can be systematically found.

B.1. The learning problem
We start by reminding the definition of the problem.Consider the following generative model where, for any i ∈ [1, n] and k ∈ [0, K], we have: where each sample is Gaussian and we denote: https://doi.org/10.1088/1742-5468/ad0221 Fluctuations, bias, variance and ensemble of learners: exact asymptotics for convex losses in high-dimension J. Stat. Mech. (2023) 114001 The predictors interact with each sample dataset in a linear way, i.e. we will consider a generalised linear model acting on the ensemble of products {X k w k } K k=1 : where L, r 0 are convex functions.We wish to determine the asymptotic properties of the estimator W * in the limit where n, p, d → ∞ with fixed ratios α = n/p, γ = d/p.We now list the necessary assumptions for our main theorem to hold.

B.1.1. Assumptions
• the functions L, r 0 are proper, closed, lower-semicontinuous, convex functions.The loss function L is pseudo-lipschitz of order 2 in both its arguments and the regularisation r 0 is pseudo-Lipschitz of order 2. The cost function L(X.) + r(.) is coercive.
We also assume that the matrix Σ is positive definite.
• Their exists a positive constant • The dimensions n, p, d grow linearly with finite ratios α = n/p and γ = d/p.
• The ground truth vector w 0 ∈ R d and noise vector ϵ 0 ∈ R n are sampled from subgaussian probability distributions independent from each other and from all other random quantities of the learning problem.
The proof method we will employ involves expressing the estimator W * as the limit of a carefully chosen sequence.In the case of non-strictly convex problems, the estimator may not be unique, making it unclear what estimator is reached by the sequence (at best we know it belongs to the set of zeroes of the subgradient of the cost function).We thus start with the following problem where, for any i.e. we add a ridge regularisation to the initial problem to make it strongly convex.We will relax this additional strong convexity constraint later on.
Fluctuations, bias, variance and ensemble of learners: exact asymptotics for convex losses in high-dimension J. Stat.Mech.(2023) 114001

B.2. Asymptotics for the strongly convex problem
We now reformulate the minimisation problem equation ( 111) to make it amenable to an AMP iteration.The key feature of this ensembling problem, outside of the convexity which will be crucial to control the trajectories of the AMP iteration, is the fact that each predictor only interacts linearly with each design sample, along with the correlation structure of the overall dataset.We are effectively sampling n vectors of size (Kp + d) from the Gaussian distribution with covariance Σ, i.e.K+1) , such that where W = and Z ∈ R n×(Kp+d) is a random matrix with i.i.d.N (0, 1) elements.Then, any sample x 0 |x 1 |. ..|xK may be rewritten as The optimisation problem may then be written, introducing the appropriate scalings W * ∈ arg min where we let w0 = Ψ 1/2 w 0 , its scaled norm ρ w0 = 1 d w0 2 2 and we introduced the function Fluctuations, bias, variance and ensemble of learners: exact asymptotics for convex losses in high-dimension

J. Stat. Mech. (2023) 114001
In order to isolate the contribution correlated with the teacher, we condition the design matrix A on the teacher distribution y, we can write where Ã is an independent copy of A, see Bayati and Montanari (2011a) lemma 11.
The cost function then becomes where where we introduced c = Φ w 0 ∈ R Kp and ρ c = 1 p c 2 2 , reaching the cost function ∈ R Kp×Kp , and the Lagrange multiplier ν associated to m, the optimisation problem can equivalently be written Fluctuations, bias, variance and ensemble of learners: exact asymptotics for convex losses in high-dimension

J. Stat. Mech. (2023) 114001
We now look for an explicit expression of the matrix square root where the positivity of κ is ensured by the positive-definiteness of Σ.The problem then becomes where Z is an independent copy of Z, see Bayati and Montanari (2011a) lemma 11.
where s = Z c c 2 is an i.i.d.standard normal vector and ρ c = 1 p c 2 2 such that the optimisation problem becomes Note that the constraint defining m automatically enforces the orthogonality constraint on U w.r.t.c.The following lemma characterises properties of the feasibility sets of U, m, ν.
Fluctuations, bias, variance and ensemble of learners: exact asymptotics for convex losses in high-dimension
Proof.Consider the optimisation problem defining W * W * ∈ arg min which, owing to the convexity of the cost function, verifies The functions L and r0 are assumed to be proper, thus their sum is bounded below for any value of their arguments and we may write The pseudo-Lipschitz assumption on L and r0 then implies that there exist positive constants C L and C r0 such that where the second line follows from the scaling assumption on the teacher function f 0 .Hence where • op denotes the operator norm of a given matrix, and we remind that A has i.i.d.N (0, 1) elements.By assumption the maximum singular value of Ψ 1/2 is bounded.The maximum singular value of a random matrix with i.i.d.N (0, 1 d ) elements is bounded with high probability as n, p, d → ∞, see e.g.Vershynin (2010).Finally, w 0 and ϵ 0 are sampled from subgaussian probability distributions, thus their scaled norms are bounded with high probability as n, p, d → ∞ according to Bernstein's inequality, see e.g.Vershynin (2018).An application of the union bound then leads to the following statement: there exists a constant C W such that 1 p W 2 2 ⩽ C W, with high probability as n, p, d → ∞.Now using the definition of U https://doi.org/10.1088/1742-5468/ad0221 Fluctuations, bias, variance and ensemble of learners: exact asymptotics for convex losses in high-dimension where the singular values of P ⊥ c and Ω 1/2 are bounded with probability one.Therefore there exists a constant C U such that 1 √ p U ⩽ C U with high probability as n, p, d → ∞.Then, by definition of m and the Cauchy-Schwarz inequality combining the results previously established on W and w 0 with the fact that the maximum singular value of Φ is bounded, there exists a positive constant C m such that m 2 ⩽ C m with high probability as n, p, d → ∞.We finally turn to ν.The optimality condition for m in problem equation ( 128) gives The pseudo-Lipschtiz assumption on L implies that we can find a constant C ∂L such that the last bound then follows from similar arguments as those employed above.
The optimisation problem equation ( 137) is convex and feasible.Furthermore, we may reduce the feasibility sets of m, ν to compact spaces, and the function of U is coercive and thus has bounded lower level sets.Strong duality then implies we can invert the order of minimisation to obtain the equivalent problem and study the optimisation problem in U at fixed m, ν: Fluctuations, bias, variance and ensemble of learners: exact asymptotics for convex losses in high-dimension

J. Stat. Mech. (2023) 114001
where we defined the functions and the random matrix Z with i.i.d.N (0, 1) elements is independent from all other random quantities in the problem.The asymptotic properties of the unique solution to this optimisation problem can now be studied with a non-separable, matrix-valued AMP iteration.The AMP iteration solving problem equation ( 152) is given in the following lemma Lemma 6.Consider the following AMP iteration where for any t ∈ N and Then the fixed point where U * is the unique solution to the optimisation problem equation (152).
Proof.To find the correct form of the non-linearities in the AMP iteration, we match the optimality condition of problem equation ( 152) with the generic form of the fixed point of the AMP iteration equation ( 233).In the subsequent derivation, we absorb the scaling For any pair of K × K symmetric positive definite matrices S, Ŝ, this optimality condition is equivalent to where we added the same quantity on both sides of the equality.For the loss function, we can then introduce the resolvent, formally D-resolvent: Similarly for the regularisation, introduce where S ∈ R K×K is a positive definite matrix, and where Ŝ ∈ R K×K is a positive definite matrix, and v ∈ R d×K .The optimality condition equation ( 165) may then be rewritten as: where both equations should be satisfied.We can now define update functions based on the previously obtained block decomposition.The fixed point of the matrix-valued AMP equation ( 233), omitting the time indices for simplicity, reads: Matching this fixed point with the optimality condition equation ( 170) suggests the following mapping: Fluctuations, bias, variance and ensemble of learners: exact asymptotics for convex losses in high-dimension

J. Stat. Mech. (2023) 114001
where we redefined û ≡ ûŜ in (168).We are now left with the task of evaluating the resolvents of L, r as expressions of the original functions L, r.Starting with the loss function, we get and the corresponding non-linearity will then be Moving to the regularisation, the resolvent reads R r, Ŝ (u) = arg min Which gives the following non-linearity for the AMP iteration The following lemma then gives the exact asymptotics at each time step of the AMP iteration solving problem equation ( 152) : its state evolution equations.
Lemma 7. Consider the AMP iteration equations (157)-( 161).Assume it is initialised with u 0 such that lim d→∞ 1 d e 0 (u 0 ) e 0 (u 0 ) F exists, a positive definite matrix Ŝ0 , and h −1 ≡ 0. Then for any t ∈ N, and any pair of seqeunces of uniformly pseudo-Lipschitz functions ϕ 1,n : R Kp×K and ϕ 2,n : R n×K , the following holds where G ∈ R Kp×K and H ∈ R n×K are independent random matrices with i.i.d.standard normal elements, and Q t , Qt , V t , Vt are given by the equations 191) https://doi.org/10.1088/1742-5468/ad0221 Fluctuations, bias, variance and ensemble of learners: exact asymptotics for convex losses in high-dimension J. Stat. Mech. (2023) 114001 Proof.Owing to the properties of Bregman proximity operators (Bauschke et al 2003(Bauschke et al , 2006)), the update functions in the AMP iteration equations ( 157)-( 161) are Lipschitz continuous.Thus under the assumptions made on the initialisation, the assumptions of theorem 11 are verified, which gives the desired result.
Lemma 8. Consider iteration equations ( 157)-( 161), where the parameters Q, Q, V , V are initialised at any fixed point of the state evolution equations of lemma 7.For any sequence initialised with V 0 = V and u 0 such that the following holds Proof.The proof of this lemma is identical to that of lemma 7 from Loureiro et al (2021b).
Combining these results, we obtain the following asymptotic characterisation of U * .
Lemma 9.For any fixed m and ν in their feasibility sets, let U * be the unique solution to the optimisation problem equation (152).Then, for any sequences (in the problem dimension) of pseudo-Lipschitz functions of order 2 ϕ 1,n : R n×K → R and ϕ 2,n : R Kp×K → R, the following holds where G ∈ R Kp×K and H ∈ R n×K are independent random matrices with i.i.d.standard normal elements, and Q, Q, V, V are given by the fixed point (assumed to be unique) of the following set of self consistent equations 198) https://doi.org/10.1088/1742-5468/ad0221 Fluctuations, bias, variance and ensemble of learners: exact asymptotics for convex losses in high-dimension Proof.Combining the results of the previous lemmas, this proof is close to that of theorem 1.5 in Bayati and Montanari (2011b).
Returning to the optimisation problem on m, ν in equation ( 151), the solution U * , at any dimension, verifies the zero gradient conditions on m, ν: Using lemma 9 while assuming the subgradients of L, r are pseudo-Lipschitz (we discuss this assumption in section B.4), we obtain for m and for ν Using the definition of D-resolvents, this is equivalent to which brings us to the following set of six self consistent equations https://doi.org/10.1088/1742-5468/ad0221 Fluctuations, bias, variance and ensemble of learners: exact asymptotics for convex losses in high-dimension This set of equations then characterises the asymptotic distribution of the estimator U * in the sense of lemma 9, with the optimal values of m and ν.Using the definition of U * and ZU * , along with the definition of the function r w.r.t. the original regularisation function, a tedious but straightforward calculation allows reconstruct the asymptotic properties of W * and of the set {X k w * k } K k=1 given in the main text.

B.3. Relaxing the strong convexity constraint
Assuming the set of self consistent equations ( 212) have a unique fixed point regardless of the strong convexity assumption, this solution defines a unique set of six order parameters for the λ 2 = 0 case.Furthermore, using proposition 12, the unique estimator W * (λ 2 ) solving problem equation ( 111) for strictly positive λ 2 converges to the least-norm solution to the convex (but not strongly) equation ( 110).Thus, for any pseudo-Lipschitz observable of U * (λ 2 ), we have, one the one side a continuous function of λ 2 with a unique continuous extension at λ 2 = 0, and on the other side a function of λ 2 prescribed by the expectation taken w.r.t. the asymptotic Gaussian model parametrised by the state evolution parameters which is defined for all positive values of λ 2 .Since both functions match for any strictly positive λ 2 , continuity implies they also match for λ 2 = 0 and we obtain the exact asymptotics of the least ℓ 2 norm solution of problem equation ( 110).Regarding the uniqueness of the solution to the fixed point equations ( 212), it is shown in Loureiro et al (2021a) that a similar set of equations, although for a vector valued variable, i.e. no ensembling, the solution is unique even if the original problem is not strictly convex.This is proven by showing that the fixed point equations are the solution of a strictly convex problem.We expect this to be true here as well, and leave this part for a longer version of this paper.

B.4. A comment on non-pseudo-Lipschitz subgradients
Provided the subgradients in equation ( 203) are pseudo-Lipschitz continuous, the proof goes through.However some convex functions commonly used in machine learning, such as the hinge loss or the ℓ 1 norm for the penalty, have non-pseudo-Lipschitz gradient.
To circumvent this issue, one can consider the optimisation problem where both loss and regularisation are replaced by their Moreau envelopes with strictly positive parameters τ 1 , τ 2 , as is done in Celentano et al (2020) for the LASSO.Moreau envelopes are everywhere differentiable and have Lipschitz gradient for strictly positive values of their parameter (Bauschke et al 2011), thus the asymptotic characterisation holds.One can then take the parameters to zero, using the fact that the limit at zero in the parameters of Moreau envelopes is well defined (Bauschke et al 2011), recovering the original function.Since proximity operators are defined as strongly convex problems, the sequence of problems defined by the proximal operator of a Moreau envelope with Fluctuations, bias, variance and ensemble of learners: exact asymptotics for convex losses in high-dimension

J. Stat. Mech. (2023) 114001
decreasing parameter converges to the proximal operator of the original function when the parameter is taken to zero.Finally, inverting the expectations on random quantities with the limit taking the parameters of the Moreau envelopes to zero can be done by verifying the dominated convergence theorem using the firm-nonexpansiveness of proximity operators and the corresponding bounds on their norms, see Bauschke et al (2011) chapter 4, section 1.We leave the details of this part to a longer version of this paper.

B.5. Toolbox
In this section, we reproduce part of the appendix of Loureiro et al (2021b) for completeness, in order to give an overview of the main concepts and tools on AMP algorithms which will be required for the proof.
B.5.1.Notations.For a given function ϕ : R d×K → R d×K , we write: where each ϕ i : R d×K → R K .We then write the K × K Jacobian For a given matrix Q ∈ R K×K , we write Z ∈ R n×K ∼ N (0, Q ⊗ I n ) to denote that the lines of Z are sampled i.i.d.from N (0, Q).Note that this is equivalent to saying that Z = ZQ 1/2 where Z ∈ R n×K is an i.i.d.standard normal random matrix.The notation P denotes convergence in probability.We start with some definitions that commonly appear in the AMP literature, see e.g.Bayati and Montanari (2011a), Javanmard and Montanari (2013), Berthier et al (2020).The main regularity class of functions we will use is that of pseudo-Lipschitz functions, which roughly amounts to functions with polynomially bounded first derivatives.We include the required scaling w.r.t. the dimensions in the definition for convenience.
Definition 10 (pseudo-Lipschitz function).For k, K ∈ N * and any n, m ∈ N * , a function ϕ : R n×K → R m×K is called a pseudo-Lipschitz of order k if there exists a constant L(k, K) such that for any X, Y ∈ R n×K , where • F denotes the Frobenius norm.Since K will be kept finite, it can be absorbed in any of the constants. https://doi.org/10.1088/1742-5468/ad0221 Fluctuations, bias, variance and ensemble of learners: exact asymptotics for convex losses in high-dimension
B.5.2.Moreau envelopes and Bregman proximal operators.In our proof, we will also frequently use the notions of Moreau envelopes and proximal operators, see e.g.Bauschke et al (2011), Parikh and Boyd (2014).These elements of convex analysis are often encountered in recent works on high-dimensional asymptotics of convex problems, and more detailed analysis of their properties can be found for example in Thrampoulidis et al (2018), Loureiro et al (2021a).For the sake of brevity, we will only sketch the main properties of such mathematical objects, referring to the cited literature for further details.In this proof, we will mainly use proximal operators acting on sets of real matrices endowed with their canonical scalar product.Furthermore, proximals will be defined with matrix valued parameters in the following way: for a given convex function f : R d×K → R, a given matrix X ∈ R d×K and a given symmetric positive definite matrix V ∈ R K×K with bounded spectral norm, we will consider operators of the type arg min This operator can either be written as a standard proximal operator by factoring the matrix V −1 in the arguments of the trace: or as a Bregman proximal operator (Bauschke et al 2003) defined with the Bregman distance induced by the strictly convex, coercive function (for positive definite V ) which justifies the use of the Bregman resolvent arg min Many of the usual or similar properties to that of standard proximal operators (i.e.firm non-expansiveness, link with Moreau/Bregman envelopes, . ..) hold for Bregman proximal operators defined with the function ( 224), see e.g.Bauschke et al (2003Bauschke et al ( , 2006)).In particular, we will be using the equivalent notion to firmly nonexpansive operators for Bregman proximity operators, called D − f irm operators.Consider the Bregman proximal defined with a differentiable, strictly convex, coercive function g : X → R, where X is a given input Hilbert space.Let T be the associated Bregman proximal of a given convex function f : X → R, i.e. for any x ∈ X then and B.5.4.Gaussian concentration.Gaussian concentration properties are at the root of this proof.Such properties are reviewed in more detail, for example, in Vershynin (2018).
We refer the interested reader to related works for a more detailed discussion.
B.5.5.Approximate message-passing.Approximate message-passing algorithms (Donoho et al 2009, Rangan 2011, Donoho and Montanari 2016) are a statistical physics inspired (Mézard et al 1987, Zdeborová andKrzakala 2016) family of iterations which can be used to solve high dimensional inference problems.One of the central objects in such algorithms are the so called state evolution equations, a lowdimensional recursion equations which allow to exactly compute the high dimensional distribution of the iterates of the sequence.In this proof we will use a specific form of matrix-valued AMP iteration with non-separable non-linearities.In its full generality, the validity of the state evolution equations in this case is an extension of the works of Javanmard and Montanari (2013), Berthier et al (2020)  where the dimension of the iterates are u t ∈ R d×K and v t ∈ R n×K .The terms in brackets are defined as: We define now the state evolution recursion on two sequences of matrices {Q r,s } s,r⩾0 and { Qr,s } s,r⩾1 initialised with Q 0,0 = lim d→∞ 1 d e 0 (u 0 ) e 0 (u 0 ): ϕ n u 0 , v 0 , u 1 , v 1 , . . ., v t−1 , u t P E ϕ n u 0 , Z 0 , Ẑ1 , Z 1 , . . ., Z t−1 , Ẑt ( 238) where (Z 0 , . . ., Z t−1 ) ∼ N (0, {Q r,s } 0⩽r,s⩽t−1 ⊗ I n ), ( Ẑ1 , . . ., Ẑt ) ∼ N (0, { Qr,s } 1⩽r,s⩽t ⊗ I n ).

B.5.6. A useful result from convex analysis
Here we remind a result from Bauschke et al (2011) describing the limiting behaviour of regularised estimators for vanishing regularisation.
Proposition 12 (theorem 26.20 from Bauschke et al (2011)).Let f and h be proper, lower semi-continuous, convex functions.Suppose that arg min f ∩ dom(g) = ∅ and that h is coercive and strictly convex.Then g admits a unique minimiser x 0 over arg min f and, for every ϵ ∈]0, 1[, the regularised problem arg min x f (x) + ϵh (x) (239) admits a unique solution x ϵ .If we assume further that h is uniformly convex on any closed ball of the input space, then lim ϵ→0 x ϵ = x 0 .

Figure 1 .
Figure 1.Pictorial representation of the model considered in the paper for K = 2.Two learners with the same architecture (in grey) receive a correlated input generated from the same vector x ∼ N (0 d , I d ).The output ŷ is an average of their outputs.While the study of an ensemble of learners is already interesting per se, it is also pivotal to study the fluctuation between learners, and the error steaming from the difference in the weights in random features and lazy training.

Figure 2 .
Figure 2. Left.Test error for logistic regression with λ = 10 −4 and different values of K as function of p / n = 1 / α with n / d = 2 and ρ = 1.Dots represent the average of the outcomes of 10 3 numerical experiments.Here we adopted ϕ(x) = erf(x) and estimator f (v) = sign( k v k ).Right.Decomposition of the K = 1 test error ϵ g = ϵ g + δϵ g for the estimator (a), with n / d = 2 and λ = 10 −4 .We plot also the contribution δϵ g corresponding to the estimator (b): we numerically observed that such decomposition coincides in the two cases.Note also the presence of a kink in δϵ g at the interpolation transition.

Figure 3 .
Figure 3. Left.Test error for ridge regression with λ = 10 −6 and different values of K as function of p / n = 1 / α with n / d = 2 and ρ = 1.Dots represent the average of the outcomes of 50 numerical experiments in which the parameters of the neurons are estimated using min(d, p) = 200.Here we adopted ϕ(x) = erf(x).Right.Decomposition of ϵ g = ϵ g + δϵ g in the K = 1 case.

Figure 4 .
Figure 4. Analytical estimation of the covariance parameters characterising the correlation with the oracle m (left), the norm of the predictor in feature space q 0 and the correlation between learners q 1 (right) (see equation (5) for the definition) in a classification task using logistic loss with ridge penalty with λ = 10 −4 at fixed n / d = 2 as function of p / n .In the inset, ratio q 1 / q 0 , quantifying the correlation between two learners.In all parameters the interpolation kink is clearly visible.

Fluctuations
, bias, variance and ensemble of learners: exact asymptotics for convex losses in high-dimension J. Stat.Mech.(2023) 114001

Figure 5 .
Figure 5. Left.Joint probability density of the confidence score φi (x) = (1 + exp(−p −1/2 ŵ⊺ i u i (x))) −1 of two learners for p / n 0.13.Centre.Probability that two learners give discordant predictions using logistic regression as function of p / n = 1 / α with n / d = 2, ρ = 1, and λ = 10 −4 .Right.Test error for logistic regression using the estimators in equation (13) and K = 3, with the same parameters.We adopted ϕ(x) = erf(x).We observe that the test error obtained using (a) is always smaller than the one obtained using (b).(Centre and right) Dots represent the average of the outcomes of 10 3 numerical experiments.

Fluctuations
, bias, variance and ensemble of learners: exact asymptotics for convex losses in high-dimension J. Stat.Mech.(2023) 114001 order parameters satisfy a set of fixed-point equations.To avoid a proliferation of indices in our formulas, let us introduce some notation.Let A 116) where a ∈ R d , b ∈ R Kp are vectors with i.i.d.standard normal components, A ∈ R n×d , B ∈ R n×Kp are the corresponding design matrices, and the covariance matrices are given by Ψ = Σ 00 ∈ R d×d , Φ = Σ 11 |Σ 12 |Σ 13 . ..|Σ 1K ∈ R d×Kp and