Stochastic gradient descent with random label noises: doubly stochastic models and inference stabilizer

Random label noise (or observational noise) widely exists in practical machine learning settings. While previous studies primarily focused on the effects of label noise to the performance of learning, our work intends to investigate the implicit regularization effects of label noise, under mini-batch sampling settings of stochastic gradient descent (SGD), with the assumption that label noise is unbiased. Specifically, we analyze the learning dynamics of SGD over the quadratic loss with unbiased label noise (ULN), where we model the dynamics of SGD as a stochastic differentiable equation with two diffusion terms (namely a doubly stochastic model). While the first diffusion term is caused by mini-batch sampling over the (label-noiseless) loss gradients, as in many other works on SGD (Zhu et al 2019 ICML 7654–63; Wu et al 2020 Int. Conf. on Machine Learning (PMLR) pp 10367–76), our model investigates the second noise term of SGD dynamics, which is caused by mini-batch sampling over the label noise, as an implicit regularizer. Our theoretical analysis finds such an implicit regularizer would favor some convergence points that could stabilize model outputs against perturbations of parameters (namely inference stability). Though similar phenomenon have been investigated by Blanc et al (2020 Conf. on Learning Theory (PMLR) pp 483–513), our work does not assume SGD as an Ornstein–Uhlenbeck-like process and achieves a more generalizable result with convergence of the approximation proved. To validate our analysis, we design two sets of empirical studies to analyze the implicit regularizer of SGD with unbiased random label noise for deep neural network training and linear regression. Our first experiment studies the noisy self-distillation tricks for deep learning, where student networks are trained using the outputs from well-trained teachers with additive unbiased random label noise. Our experiment shows that the implicit regularizer caused by the label noise tends to select models with improved inference stability. We also carry out experiments on SGD-based linear regression with ULN, where we plot the trajectories of parameters learned in every step and visualize the effects of implicit regularization. The results back up our theoretical findings.


Introduction
Stochastic Gradient Descent (SGD) has been widely used as an effective way to train deep neural networks with large datasets [4].While the mini-batch sampling strategy was firstly proposed to lower the cost of computation per iteration, it has been considered to incorporate an implicit regularizer preventing the learning process from converging to the local minima with poor generalization performance [1,[5][6][7][8].To interpret such implicit regularization, one can model SGD as gradient descent (GD) with gradient noises caused by mini-batch sampling [9].Studies have demonstrated the potentials of such implicit regularization or gradient noises to improve the generalization performance of learning from both theoretical [10][11][12][13] and empirical aspects [1,7,8].In summary, gradient noises keep SGD away from converging to the sharp local minima that generalizes poorly [1,12,13] and would select a flat minima [14] as the outcome of learning.
In this work, we aim at investigating the influence of random label noises to the implicit regularization under mini-batch sampling of SGD.To simplify our research, we assume the training dataset as a set of vectors D = {x 1 , x 2 , x 3 , . . ., x N }.The label ỹi for every vector x i ∈ D is the noisy response of the true neural network f * (x) such that ỹi = y i + ε i , y i = f * (x i ), and E[ε i ] = 0, var where the label noise ε i is assumed to be an independent zero-mean random variable.In our work, the random label noises can be either (1) drawn from probability distributions before training steps (but re-sampled by mini-batch sampling of SGD) or (2) realized by the random variables per training iteration [15].Thus learning is to estimate θ in f (x, θ) for approximating f * (x),

Symbols Definitions and Equations
x i , and y i = f * (x i ) the i th data point and true label (1) ỹi , and ε i the i th noisy label and the label noise.
(1) f (x, θ) the output of neural network with parameter θ and input x.
(2) θ the estimator of parameters of a neural network.
(17) W 1 (t), and W 2 (t) two independent Brownian motions over time.( 16) z k , and z k two independent random vectors of standard Gaussian.( 9), (17).Θ LNL (t) the continuous-time dynamics under Label-NoiseLess settings.(20) Table 1: Key Symbols and Definitions such that Note that we denote L * i (θ) = 1 2 (f (x i , θ) − y i ) 2 as the loss based on a noiseless sample in this work.Inspired by [1,14], our work studies how unbiased label noises ε i (1 ≤ i ≤ N ) would affect the "selection" of θ from possible solutions, in the viewpoint of learning dynamics [16] of SGD under mini-batch sampling [12,17,18].For symbols used in this paper, please refer to Table 1.

Backgrounds: SGD Dynamics and Implicit Regularization
To analyze the SGD algorithm solving the problem in Eq (2), we follow settings in [17] and consider SGD as an algorithm that, in the k th iteration with the estimate θ k , it randomly picks up a b-length subset of samples from the training dataset i.e., B k ⊂ D, and estimates the mini-batch stochastic gradient where η refers to the step-size of SGD.Furthermore, we can derive the minibatch sampled loss gradients into the combination of the full batch loss gradient and the noise, such that where η refers to the step-size of SGD, and V k (θ k ) refers to a stochastic gradient noise term caused by mini-batch sampling.The noise would converge to zero with increasing batch size, as follow (5) With dt = η → 0 and the constant batch size b = |B k |, SGD algorithm would diffuse to a continuous-time dynamics Θ(t) with a stochastic differential equation (SDE), with weak convergence [17,19], as follow, where W (t) is a standard Brownian motion in R d , and let us define Σ SGD N (Θ) as the sample covariance matrix of loss gradients ∇L i (Θ) for 1 ≤ i ≤ N .Please note that, for detailed derivations to obtain above continuous-time approximation and the assumptions, please refer to [17].We follow [17] and do not make low-rank assumptions on ΣSGD N (Θ).Through Euler discretization [11,17], one can approximate SGD as θk such that The implicit regularizer of SGD is z k which is datadependent and controlled by the learning rate η and batch size B [20].[10][11][12] discussed SGD for varational inference and enabled novel applications to samplers [21,22].To understand the effect to generalization performance, [1,20] studied the escaping behavior from the sharp local minima [8] and convergence to the flat ones.[23] discover the way that SGD could find a flat local minimum from information-theoretical perspectives and propose a novel regularizer to improve the performance.Finally, [24] studied regularization effects to linear DNNs and our previous work [18] proposed new multiplicative noises to interpret SGD and obtain stronger theoretical properties.

Our Contributions
In this work, we assume the unbiased random label noises ε i (1 ≤ i ≤ N ) and the mini-batch sampler of SGD are independent.When the random label noises have been drawn from probability distributions prior to the training procedure, SGD re-samples the label noises and generates a new type of data-dependent noises, in addition to the stochastic gradient noises of label-noiseless losses, through re-sampling label-noisy data and averaging label-noisy loss gradients of random mini-batchs [25,26].
Our analysis shows that under mild conditions, with gradients of labelnoisy losses, SGD might incorporate an additional data-dependent noise term, complementing with the stochastic gradient noises [17,18] of label-noiseless losses, through re-sampling the samples with label noises [26] or dynamically adding noises to labels over iterations [15].We consider such noises as an implicit regularization caused by unbiased label noises, and interpret the effects of such noises as a solution selector of learning procedure.More specifically, this work has made unique contributions as follow.

Doubly Stochastic Models
We reviewed the preliminaries [12,17,18,27] and extended the analytical framework in [17] to interpret the effects of unbiased label noises as an additional implicit regularizer on top of the continuous-time dynamics of SGD.Through discretizing the continuous-time dynamics of label-noisy SGD, we write discrete-time approximation to the learning dynamics, denoted as θ ULN k for k = 1, 2, . . ., as where 2 refers to the label-noiseless loss function with sample x i and the true (noiseless) label y i , the noise term ξ * k (θ) refers to the stochastic gradient noise [17] of label-noiseless loss function L * i (θ), then we can obtain the new implicit regularizer caused by the unbiased label noises (ULN) for ∀θ ∈ R d , which can be approximated as follow where z k refers to a random noise vector drawn from the standard Gaussian distribution, θ k refers to the parameters of network in the k th iteration, (•) 1/2 refers to the Chelosky decomposition of the matrix, ∇ θ f (x i , θ) = ∂f (x i , θ)/∂θ refers to the gradient of the neural network output for sample x i over the parameter θ k , and B and η are defined as the batch size and the learning rate of SGD respectively.Obviously, the strength of such implicit regularizer is controlled by σ 2 , B and η.

Summary of Results
Section 3 formulates the algorithm of SGD with unbiased random label noises as a stochastic dynamics based on two noise terms (Proposition 1), derives the Continuous-time and Discrete-time Doubly Stochastic Models from SGD algorithms (Definitions 1 and 2), and provides approximation error bounds (Proposition 2).Proofs of two propositions are provided in A and B

Inference Stabilizer as Implicit Regularizer
The regularization effects of unbiased random label noises should be where ∇ θ f (x, θ) refers to the gradient of f over θ and the effects are controlled by the batch size B and the variance of label noises σ 2 .Similar results have been obtained by assuming the deep learning algorithms have been driven by an Ornstein-Uhlenbeck like process [3], while our work does not rely on such assumption but is all based on our proposed Doubly Stochastic Models.

Summary of Results
Section 4 analyzes the implicit regularization effects of unbiased random label noises for SGD, where we conclude the implicit regularizer as a controller of the neural network gradient norm 1 in the dynamics (Proposition 3).We then offer Remarks 2, 3 and 4 to characterize the behaviors of SGD with unbiased label noises: (1) SGD would escape the local minimums with higher gradient norms, due to the larger perturbation driven by the implicit regularizer, (2) the strength of implicit regularization effects is controlled by the learning rate η and batch size b, and (3) it is possible to tune the performance of SGD through adding and controlling the unbiased label noises, as low neural network gradient norms usually correspond to flat loss landscapes.
To validate our three remarks, Section 5 presents the experiments based on self-distillation with unbiased label noises [28,29] for deep neural networks, where we show that, under the teacher-student training setting, a well-trained model could escape the local minimum and converge to a new point with lower neural network gradient norms and better generalization performance through learning from itself noisy outputs (i.e., logit outputs add unbiased label noises).Section 6 present experiments based on SGD-based linear regression with unbiased label noises to visualize implicit regularization effects and the connection between the effects and the learning rate/batch size.Our visualization results show SGD-based linear regression with unbiased label noises would converge to a distribution of Gaussian-alike centered at the solution of linear regression and the (co-)variance of the distribution is controlled by the covariance of data samples as well as the learning rate and batch size.Our experiment results backup our theory.

SGD Implicit Regularization for Ordinary Least Square (OLS)
The most recent and relevant work in this area is [27,30], where the same group of authors studied the implicit regularization of gradient descent and stochastic gradient descent for OLS.They investigated an implicit regularizer of 2 -norm alike on the parameter, which regularizes OLS as a Ridge estimator with decaying penalty.Prior to these efforts, F. Bach and his group have studied the convergence of gradient-based solutions for linear regression with OLS and regularized estimators under both noisy and noiseless settings in [31][32][33].

Langevin Dynamics and Gradient Noises
With similar agendas, [10][11][12] studied limiting behaviors of SGD (or steadystate of dynamics) from the perspectives of Bayesian/variational inference.They also promoted novel applications to stochastic gradient MCMC samplers [21,22].Through connecting Σ SGD N (θ) to the loss Hessian 1/N ∇ 2 L i (θ) in near-convergence regions, [1] studied the escaping behavior from the sharp local minima, while [8] discussed this issue in large-batch training settings.Furthermore, [20] discussed how learning rates and batch sizes would affect the generalization performance and flatness of optimization results.Finally, [24] studied the implicit regularization on linear neural networks and [18] proposed a new multiplicative noise model to interpret the gradient noises with stronger theoretical properties.

Self-Distillation and Noisy Students
Self-distillation [28,29,34,35] has been examined as an effective way to further improve the generalization performance of well-trained models.Such strategies enable knowledge distillation using the well-trained ones as teacher models and optionally adding noises (e.g., dropout, stochastic depth, and label smoothing or potentially the label noises) onto training procedure of student models.

Discussion on the Relevant Work
Compared to above works, we still make contributions in above three categories.First of all, this work characterizes the implicit regularization effects of label noises to SGD dynamics.Compared to [27,30] working on linear regression, our proposed doubly stochastic model could be used to explained the learning dynamics of SGD with label noises for nonlinear neural networks.Even from linear regression perspectives [27,30,33], we precisely measured the gaps between SGD dynamics with and without label noises and provide an new example with numerical simulation to visualize the implicit regularization effects.
Compared to [29,36], our analysis emphasized the role of the implicit regularizer caused by label noises for model selection, where models with high inferential stability would be selected.[37] is the most relevant work to us, where authors studied the early stopping of gradient descent under label noises via neural tangent kernel (NTK) [38] approximation.Our work made the analyze for SGD without assumptions for approximation such as NTK.
In addition to NTK assumption, [3] assumes the deep learning algorithms are driven by an Ornstein-Uhlenbeck (OU) like process and obtains similar results as the inference stabilizer (the third result of our research), while our work makes contribution through proposing Doubly Stochastic Models and reach the conclusion in a different way.We also provide yet the first empirical results and evidences, based on commonly-used DNN architectures and benchmark datasets, to visualize the effects of implicit regularizers caused by the unbiased label noises in real-world settings.Please note that an earlier manuscript [39] from us has been put on OpenReview with discussion, where external reviewers demonstrated their concerns-part of results has been investigated in [3] and we didn't provide the results in a strong form (e.g., theorems or proofs).Hereby, this work shifts the main contributions from implicit regularization of label noises to the doubly stochastic models with approximation error bounds and proofs.The implicit regularization effects could be estimated via doubly stochastic models directly without the assumption of OU process.To best of our knowledge, this work is the first to understand the effects of unbiased label noises to SGD dynamics, by addressing technical issues including implicit regularization, OLS, self-distillation, model selection, and the stability inference results.

Double Stochastic Models for SGD with
Unbiased Random Label Noises In this section, we present SGD with unbiased random label noises, derive the Continuous-time/Discrete-time Doubly Stochastic Models, and provide convergence of approximation between models.

Modeling Unbiased Label Noises in SGD
In our research, SGD with Unbiased Random Label Noises refers to an iterative algorithm that updates the estimate incrementally from initialization θ ULN 0 .With mini-batch sampling and unbiased random label noises, in the k th iteration, SGD algorithm updates the estimate θ ULN k using the stochastic gradient gk (θ ULN k ) through a gradient descent rule , such that Specifically, in the k th iteration, SGD randomly picks up a batch of sample B k ⊆ D to estimate the stochastic gradient, as follow where ∇L * i (θ) for ∀θ ∈ R d refers to the loss gradient based on the labelnoiseless sample (x i , y i ) and y i = f * (x i ), ξ * k (θ) refers to stochastic gradient noises [17] through mini-batch sampling over the gradients of label-noiseless samples, and ξ ULN k (θ) is an additional noise term caused by the mini-batch sampling and the Unbiased Random Label Noises, such that Proposition 1 (Mean and Variance of the Two Noise Terms).The mean and variance of the noise terms ξ * k (θ) and ξ ULN k (θ) should be the vector-value functions as follow The two matrix-value functions Σ SGD N (θ) and Σ ULN N (θ) over θ ∈ R d characterize the variance of noise vectors.When we assume the label noises and mini-batch sampling are independent, there has (15) The two noise terms ξ * k (θ) and ξ ULN k (θ) that are controlled by the learning rate and the batch size would largely influence the SGD dynamics.Please refer to A for proofs.
With the mean and variance of two noise terms, we can easily formulate the learning dynamics of SGD with unbiased label noises as follows.

Doubly Stochastic Models and Approximation
We consider the SGD algorithm with unbiased random label noises in the form of gradient descent with additive data-dependent noise., such that θ When η → 0, we assume the noise terms are independent, then we can follow the analysis in [19] to derive the diffusion process of SGD with unbiased random label noises, denoted as Θ ULN (t) over continuous-time t ≥ 0. In this way, we define the Doubly Stochastic Models that characterizes the continuous-time dynamics of SGD with unbiased label noises as follows.
Obviously, we can obtain the discrete-time approximation [11,17] to the SGD dynamics as follows.
Definition 2 (Discrete-Time Doubly Stochastic Models).We denote θULN k for k = 1, 2, . . .as the discrete-time approximation to the Doubly Stochastic Models for SGD with Unbiased Label Noises, which in the k th iteration behaves as

{( ∇L
A2 There exists some L > 0 such that ∇L i (θ) and ∇ θ f (x, θ) for ∀x ∈ D are Lipschitz continuous with bounded Lipschitz constant L > 0 uniformly for all i = 1, 2, ..., N .The continuous-time dynamics of SGD with unbiased label noises (ULN), denoted as Θ ULN (t) in Eq. ( 16), is with order 1 strong approximation to the discrete time SGD dynamics θULN k in Eq. (17).I.e., there exist a constant C independent on η but depending on σ 2 , L and M such that Please refer to B for proofs.
Remark 1.With above strong convergence bound for approximation, we can consider θULN k -the solution of Eq. ( 17) -as a tight approximation to the SGD algorithm with unbiased label noises based on the same initialization.A tight approximation to the noise term ξ ULN k (θ) (defined in Eq. ( 27)) could be as follow We use such discrete-time iterations and approximations to the noise term ξ ULN k (θ) to interpret the implicit regularization behaviors of the SGD with unbiased label noises algorithm θ ULN k accordingly.

Implicit Regularization Effects to Neural Networks
In this section, we use our model to interpret the regularization effects of SGD with unbiased label noises for general neural networks, without assumptions on the structures of neural networks.

Implicit Regularizer Influenced by Unbiased Random Label Noises
Compared the stochastic gradient with unbiased random label noises gk (θ) and the stochastic gradient based on the label-noiseless losses, we find an additional noise term ξ ULN k (θ) as the implicit regularizer.To interpret ξ ULN k (θ), we first define the diffusion process of SGD based on Label-NoiseLess losses i.e., L * i (θ) for 1 ≤ i ≤ N as Through comparing Θ ULN (t) with Θ LNL (t), the effects of ξ ULN k (Θ) over continuous-time form should be η/B(Σ ULN N (Θ)) 1/2 dW (t).Then, in discrete-time, we could get results as follow.
Proposition 3 (The implicit regularizer ξ ULN k (θ)).The implicit regularizer of SGD with unbiased random label noises could be approximated as follow, In this way, we can estimate the expected regularization effects of the implicit regularizer ξ ULN k (θ) 2 as follow, Please refer to C for proofs.
We thus conclude that the effects of implicit regularization caused by unbiased random label noises for SGD is proportional to

Understanding the Unbiased Label Noises as an Inference Stabilizer
Here we extend the existing results on SGD [1,40] to understand Proposition 3 and obtain remarks as follows.
Remark 2 (Inference Stability).In the partial derivative form, the gradient norm could be written as characterizes the variation of neural network output f (x, θ) based on samples x i (for 1 ≤ i ≤ N ) over the parameter interpolation around the point θ.
2 leads to higher stability of neural network f (x, θ) outputs against the (random) perturbations over parameters.
Remark 3 (Escape and Converge).When the noise ξ ULN k (θ) is θ-dependent (section 4 would present a special case that ξ ULN k (θ) is θ-independent with OLS), we follow [1] and suggest that the implicit regularizer helps SGD escape from the point θ with high neural network gradient norm 2  2 , as the scale of noise ξ ULN k ( θ) is large.Reciprocally, we follow [40] and suggest that when the SGD with unbiased random label noises converges, the algorithm would converge to a point θ * with small 2 .Similar results have been obtained in [3] when assuming the deep learning algorithms are driven by an Ornstein-Uhlenbeck process.
Remark 4 (Performance Tuning).Considering ησ 2 /B as the coefficient balancing the implicit regularizer and vanilla SGD, one can regularize/penalize the SGD learning procedure with the fixed η and B more fiercely using a larger σ 2 .More specifically, we could expect to obtain a regularized solution with lower 1 N N i=1 ∇ θ f (x i , θ) 2  2 or higher inference stability of neural networks, as regularization effects become stronger when σ 2 increases.

Experiments on Self-Distillation with Unbiased Label Noises
The goal of this experiment is to understand Proposition 3 andRemarks.

Experiments Design
To evaluate SGD with unbiased label noises, we design a set of novel experiments based on self-distillation with unbiased label noises.In addition to learn from noisy labels directly, our experiment intends to train a (student) network from the noisy outputs of a (teacher) network in a quadratic regression loss, where the student network has been initialized from weights of the teacher one and unbiased label noises are given to the soft outputs of the teacher network randomly.
We aim to directly measure the gradient norm 1 2 of the neural network after every epoch to testify the SGD implicit regularization effects of unbiased label noises (i.e., Proposition 3).The performance comparisons among the teacher network, the student network (trained with unbiased label noises), and the student network (trained noiselessly) demonstrate the advantage of unbiased label noises in SGD for regression tasks (i.e.,

Remarks. 3 & 4 ).
Particularly, we design a set of novel experiments based on self-distillation with unbiased label noises and elaborate in which way the proposed SGD with unbiased label noises fits the settings of self-distillation with unbiased label noises.Further, we introduce the goal of our empirical experiments with a list of expected evidences, then present the experiment settings for the empirical evaluation.Finally, we present the experiment results with solid evidence to validate our proposals in this work.

Noisy Self-Distillation
Given a well-trained model, Self-Distillation algorithms [28,29,34,35] intend to further improve the performance of a model through learning from the "soft label" outputs (i.e., logits) of the model (as the teacher).Furthermore, some practices found that the self-distillation could be further improved through incorporating certain randomness and stochasticity in the training procedure, namely noisy self-distillation, so as to obtain better generalization performance [29,34].In this work, we study two well-known strategies for additive noises as follow.
1. Gaussian Noises.Given a pre-trained model with L-dimensional logit output, for every iteration of self-distillation, this simple first draws random vectors from a L-dimensional Gaussian distribution N (0 L , σ 2 I L ), then adds the vectors to the logit outputs of the model.It makes the student model learn from the noisy outputs.Note that in our analysis, we assume the output of the model is single dimension while, in self-distillation, the logit labels are with multiple dimensions.Thus, the diagonal matrix σ 2 I L now refers to the complete form of the variances and σ 2 controls the scale of variances of noises.2. Symmetric Noises.. Basically, this strategy is derived from [15] that generates noises through randomly swapping the values of logit output among the L dimensions.Specifically, in every iteration of self-distillation, given a swap-probability p, every logit output (denoted as y here) from the pre-trained model, and every dimension of logit output denoted as y l , the strategy in probability p swaps the logit value in the dimension that corresponds to y l with any other dimension y m =l in equal prior (i.e., in (L − 1) −1 probability).In the rest 1 − p probability, the strategy remains the original logit output there.In this way, the new noisy label ỹ is with expectation E[ỹ] as follow, This strategy introduces explicit bias to the original logit outputs.However, when we consider the expectation E[ỹ] as the innovative soft label, then the random noise around the new soft label is still unbiased as E[ỹ−E[ỹ]] = 0 for all dimensions.Note that this noise is not the symmetric noises studied for robust learning [41].Thus, literately, our proposed SGD with Unbiased Label Noises settings well fit the practice of noisy-self distillation.

Datasets and DNN Models
We choose the ResNet-56 [42], one of the most practical deep models, for conducting the experiments on three datasets: SVHN [43], CIFAR-10 and CIFAR-100 [44].We follow the standard training procedure [42] for training a teacher model (original model).Specifically we train the model from scratch for 200 epochs and adopt the SGD optimizer with batch size 64 and momentum 0.9.The learning rate is set to 0.1 at the beginning of training and divided by 10 at 100 th epoch and 150 th epoch.A standard weight decay with a small regularization parameter (10 −4 ) is applied.As for noiseless self-distillation, we follow the standard procedure [45] for distilling knowledge from the teacher to a student of the same network structure.The basic training setting is the same as training the teacher model.
For the settings of noisy self-distillation, we divide the original training set into a new training set (80%) and a validation set (20%).For clarity, we also present the results using varying scales of unbiased label noises on all three sets, where the original training set is used for training.

Experiment Results
Figure 1 presents the results of above two methods with increasing scales of noises, i.e., increasing σ 2 for Gaussian noises and increasing p for Symmetric noises.In Figure 1(a)-(c), we demonstrate that the gradient norms of neural networks 1 2  2 decrease with growing σ 2 and p for two strategies.The results backup our theoretical investigation , which means the model would be awarded high inferential stability, as the variation of neural network outputs against the potential random perturbation in parameters has been reduced by the regularization.In Figure 1(d)-(f) and (g)-(i), we plot the validation and testing accuracy of the models obtained under noisy selfdistillation.The results show that (1) student models are with lower gradient norms of neural networks 1 than teacher models, the gradient norm further decreases with increasing scale of noises (i.e., σ 2 and P ); (2) some of models have been improved through noisy self-distillation compared to the teacher model, while noisy self-distillation could obtain better performance than noiseless self-distillation; and (3) it is possible to select noisily self-distilled models using validation accuracy for better overall generalization performance (in testing dataset).All results here are based on 200 epochs of noisy self-distillation.We show the evolution of training and test losses during the entire training procedure, and compare the settings of adding no label noises, symmetric and Gaussian noises for noisy self-distillation.Figure 2 presents the results on the three datasets, i.e., SVHN, CIFAR-10 and CIFAR-100 with the optimal scales of label noises on validation sets.It shows all algorithms would finally converge to a local minima with a training loss near to zero, while the local minimas searched by the SGD with Symmetric noise would be flatter with better generalization performance (especially for CIFAR-100 dataset).

Experiments on Linear Regression with Unbiased Label Noises
To validate our findings in linear regression settings, we carry out numerical evaluation using synthesize data to simply visualize the dynamics over iteration of SGD algorithms with label-noisy OLS and label-noiseless OLS.

Linear Regression with Unbiased Label Noises
We here hope to see how unbiased label noises would influence SGD iterates for ordinary linear regression (OLS), where a simple quadratic loss function is considered for OLS, such that where samples are generated through ỹi = x i β * + ε i , E[ε i ] = 0 and var[ε i ] = σ 2 .Note that in this section, we replace the notation of θ with β to present the parameters of linear regression models.Let us combine Eq. ( 25) and Eq.(11).We write the SGD for Ordinary Least Squares with Unbiased Label Noises as the iterations β ULN k for k = 1, 2, 3 . . .as follow where ∇L * i (β) for ∀θ ∈ R d refers to the loss gradient based on the labelnoiseless sample (x i , y i ) and y i = x i β * , ξ * k (β) refers to stochastic gradient noises caused by mini-batch sampling over the gradients of label-noiseless samples, and ξ ULN k (β) is an additional noise term caused by the mini-batch sampling and the unbiased label noises, such that We denote the sample covariance matrix of N samples as According to Remark 1 for implicit regularization in the general form, we can write the implicit regularizer of SGD with the random label noises for OLS as, z k , and Let us combine Proposition 2 and linear regression settings, we obtain the continuous-time dynamics for linear regression with unbiased label noises, denoted as β ULN (t).According to [33,46], we can see SGD and its continuoustime dynamics for noiseless linear regression (denoted as β LNL k and β LNL (t)) would asymptotically converge to the optimal solution β * .As the additional noise term ξ ULN is unbiased with an invariant covariance structure, when t → ∞, we can simply conclude that lim . By definition of a distribution from a stochastic process, we could conclude β ULN (t) converges to a stationary distribution, such that β ULN (t) ∼ N (β * , ησ 2 B ΣN ), as t → ∞. .Remark 5. Thus, with k → ∞, the SGD algorithm for OLS with unbiased label noises would converge to a distribution of Gaussian-alike as follow The span and shape of the distribution are controlled by σ 2 and ΣN when η and B are constant.
In this experiment, we hope to evaluate above remark using numerical simulations, so as to testify (1) whether the trajectories of β ULN k converges to a distribution of N (β * , ησ 2 B ΣN ); (2) whether the shape of convergence area could be controlled by the sample covariance matrix the data ΣN of; and (3) whether the size of convergence area could be controlled by the variance of label noises σ 2 .

Experiment Setups
In our experiments, we use 100 random samples realized from a 2-dimension Gaussian distribution X i ∼ N (0, Σ 1,2 ) for 1 ≤ i ≤ 100, where Σ 1,2 is an symmetric covariance matrix controlling the random sample generation.To add the noises to the labels, we first draw 100 copies of random noises from the normal distribution with the given variance ε i ∼ N (0, σ 2 ), then we setup the OLS problem with (X i , ỹi ) pairs using ỹi = X i β * + ε i and β * = [1, 1] and various settings of σ 2 and Σ 1,2 .We setup the SGD algorithms with the fixed learning rate η = 0.01, and batch size B = 5, with the total number of iterations K = 1, 000, 000 to visualize the complete paths.

Experiment Results
Figure 3 presents the results of numerical validations.In Figure 3(a)-(d), we gradually increase the variances of label noises σ 2 from 0.25 to 2.0, where we can observe (1) SGD over label-noiseless OLS converges to the optimal solution β * = [1.0,1.0] in a fast manner, (2) SGD over OLS with unbiased random label noises would asymptotically converge to a distribution centered at the optimal point, and (3) when σ 2 increases, the span of the converging distribution becomes larger.In Figure 3(e)-(h), we use four settings of Σ 1,2 , where we can see (4) no matter how Σ 1,2 is set for OLS problems, the SGD with unbiased random label noises would asymptotically converge to a distribution centered at the optimal point.Compared the results in (e) with(f), we can find that, when the trace of Σ 1,2 increases, the span of converging distributions would increase.Furthermore, (5) the shapes of converging distributions depend on Σ 1,2 .In Figure 3(g), when we place the principal component of Σ 1,2 onto the vertical axis (i.e., Σ Ver = [[10, 0] , [0, 100] ]), the distribution lays on the vertical axis principally.Figure 3(h) demonstrates the opposite layout of the distribution, when we set Σ Hor = [[100, 0] , [0, 10] ] as Σ 1,2 .The scale and shape of the converging distribution backups our theoretical investigation in Eq (29).
Note that the unbiased random label noises are added to the labels prior to the learning procedure.In this setting, it is the mini-batch sampler of SGD that re-samples the noises and influences the dynamics of SGD through forming the implicit regularizer.

Discussion and Conclusion
While previous studies primarily focus on the performance degradation caused by label noises or corrupted labels [37,47], we investigate the implicit regularization effects of random label noises, under mini-batch sampling settings of stochastic gradient descent (SGD).Specifically, we adopt the dynamical systems interpretation of SGD to analyze the learning procedure based on the quadratic loss with unbiased random label noises.We decompose the mini-batch stochastic gradient based on label-noisy losses into three parts in Eq. ( 11): (i) ∇L * (θ) -the true gradient of label-noiseless losses, (ii) ξ * k (θ) A Proof of Proposition 1 Proof As the mini-batch B k are randomly, independently, and uniformly drawn from the full set sample D, thus for ∀θ ∈ R d and ∀x j ∈ B k , there has ∇L * i (θ) .
(31) In this way, we can derive that Further, for any θ ∈ R d there has where Σ SGD N (θ) is defined as Eq.(15).Similarly, as the mini-batch B k are randomly, independently, and uniformly drawn from the full set sample D, there for ∀θ ∈ R d , there has Thus, there has Again, by the definition, there has Let us assume B k and ε i for 1 ≤ i ≤ N are independent.

1 2 (
f (x i , θ) − y i )2 the loss based on a noiseless sample.(2)Li Li (θ) = 1 2 (f (x i , θ) − ỹi )2 the loss based on a noisy sample.(2) SGD without assumptions on label noises B k the mini-batch of samples drawn by the k th step of SGD (3) b = |B k | the constant batch size of B k (3) θ k the k th step of SGD .(3) V k SGD noise caused by mini-batch sampling of loss gradients.the k th step of discrete-time approximation to Θ(t).(7) SGD with Unbiased Label Noises (ULN) θ ULN k the k th step of SGD with unbiased label noises .(8) ξ * k SGD noises thru.mini-batch sampling of TRUE loss gradients.(8) ξ ULN k SGD noise thru.mini-batch sampling of unbiased label noises.(8) Σ SGD N the covariance matrix of the TRUE loss gradients.

where z k
and z k are two independent d-dimensional random vectors drawn from a standard d-dimensional Gaussian distribution N (0 d , I d ) per iteration independently, and θULN 0 = Θ ULN (0).The convergence between θULNk and Θ ULN (t) is tight when t = kη and the convergence bound is as follow.Proposition 2 (Convergence of Approximation).Let T ≥ 0. Let Σ SGD N (θ) and Σ ULN N (θ) be the two diffusion matrices defined in Eq. 15.Assume that A1 There exists some M > 0 such that max i=1,2,...,N

2 2
the average gradient norm of the neural network f (x, θ) over samples.

Fig. 2 :
Fig. 2: Training and Validation Loss per Epoch during the Training Procedure which is unbiased E[ξ ULN k (β)] = 0 d with invariant covariance structure, and is independent with β (the location) and k (the time).