CoRe optimizer: an all-in-one solution for machine learning

The optimization algorithm and its hyperparameters can significantly affect the training speed and resulting model accuracy in machine learning (ML) applications. The wish list for an ideal optimizer includes fast and smooth convergence to low error, low computational demand, and general applicability. Our recently introduced continual resilient (CoRe) optimizer has shown superior performance compared to other state-of-the-art first-order gradient-based optimizers for training lifelong ML potentials. In this work we provide an extensive performance comparison of the CoRe optimizer and nine other optimization algorithms including the Adam optimizer and resilient backpropagation (RPROP) for diverse ML tasks. We analyze the influence of different hyperparameters and provide generally applicable values. The CoRe optimizer yields best or competitive performance in every investigated application, while only one hyperparameter needs to be changed depending on mini-batch or batch learning.


INTRODUCTION
Machine learning (ML) is a part of the general field of artificial intelligence.ML is employed in a wide range of applications such as computer vision, natural language processing, and speech recognition [1,2].It involves statistical models whose performance on tasks can be improved by learning from sample data or past experience.ML models include very many parameters, the so-called weights.In the learning process, these weights are optimized according to a performance measure.To evaluate this measure, training data or experience are required.In supervised learning, the model is trained on labeled data to obtain a function that maps the data to its label as in classification and regression tasks.By contrast, in unsupervised learning unlabeled data is trained for categorization.In addition, in reinforcement learning the model is trained through trial and error aiming to maximize its reward.Hence, ML models predict tasks only based on a learned pattern of the data and do not require explicit program instructions for predictions.
The performance measure can be a loss function (also called cost function) that needs to be minimized [3].This loss function is usually a sum over contributions from the training data points.Instead of calculating it simultaneously for the full training data set (deterministic or batch learning), a (semi-)randomly chosen subset of the training data is often employed (stochastic or mini-batch learning).This approach can accelerate the convergence with respect to the total computation time because the loss-function accuracy increase is sub-linear for larger batch sizes.To update the weights of the ML model, first-order gradient-based iterative optimization schemes are dominating the field, since the memory demand and computation time per step of second-order optimizers is often too high.In general, the optimization aims at a loss function's local minimum as a function of the model's weights because it is sufficient for most ML applications to find weight values with low loss rather than the global minimum.
The optimization algorithm can crucially determine the training speed and final performance of ML models [4].Therefore, the development of advanced optimizers is an active field of research which can significantly impact the accuracy of predictions in all ML applications.The simplest form of stochastic first-order minimization for high-dimensional parameter spaces is stochastic gradient decent (SGD) [5].In SGD, the negative gradient of the loss function with respect to each weight is multiplied by a constant learning rate and the product is subtracted from the respective weight in each update.The loss function gradient is adapted in stochastic gradient decent with momentum (Momentum) [6] and Nesterov accelerated gradient (NAG) [7,8].These methods aim to improve convergence by a momentum in the weight updates, as the gradients are based on stochastic estimates.In a different fashion, adaptive gradient (AdaGrad) [9], adaptive delta (AdaDelta) [10], and root mean square propagation (RMSprop) [11] apply the ordinary loss function gradient combined with a weightspecific, adapted learning rate.Adaptive moment estimation (Adam) [12], adaptive moment estimation with infinity norm (AdaMax) [12], and our recently developed continual resilient (CoRe) optimizer [13] combine momentum with individually adapted learning rates.In resilient backpropagation (RPROP) [14,15] only the sign of the loss function gradient is employed with individually adapted learning rates.
Apart from these optimizers, which are applied in this work, many more optimizers have been developed for ML applications in recent years.For example, the modification of the first moment estimation of Adam yields Nesterov-accelerated adaptive moment estimation (NAdam) [16] and the modification of the second moment AMSGrad [17].Nesterov momentum is also employed in adaptive Nesterov momentum (Adan) [18].Moreover, AdaFactor [19], AdaBound [20], AdaBelief [21], AdamW [22], PAdam [23], RAdam [24], AdamP [25], Lamb [26], Gravity [27], and Lion [28] are further examples of the large zoo of optimizers.They often represent incremental improvements of parent algorithms.We note that these optimizers can be used in applications beyond ML as well.Furthermore, second-order optimizers have been proposed such as adaptive estimates of the Hessian (AdaHessian) [29] and second-order clipped stochastic optimization (Sophia) [30].To acquire an overview of the performance differences among these optimizers, extensive benchmarks are required [31][32][33].Statistical averaging and uncertainty quantification are indispensable in these benchmarks for validation.
To ease the burden on an ML practitioner in the optimizer choice, an optimizer is desired which performs well on diverse ML tasks.Moreover, a generally applicable set of optimizer hyperparameters is required which works out-of-the-box avoiding time consuming hyperparameter tuning.At most, a single intuitive hyperparameter may require to be adapted coarsely, while its value needs to be easy to estimate.Furthermore, the ideal optimizer features fast and smooth convergence to high accuracy with low computational burden.
The Adam optimizer is not an obviously superior, but viable choice for many ML tasks [33].Therefore, Adam became the most frequently applied optimizer with adaptive learning rates.Since our CoRe optimizer has outperformed Adam on the task of training a lifelong machine learning potential (lMLP) [13], it is obvious to assess its performance on diverse ML tasks and compare the outcome with that of various aforementioned optimizers.Such a broad performance evaluation further allows us to obtain generally valid hyperparameters for the CoRe optimizer to obtain an all-in-one solution.
As a benchmark, we examine a set of fast running ML tasks provided in PyTorch [34].The benchmark set spans the range from small mini-batch learning to full batch learning as well as reinforcement learning.Moreover, it includes different tasks, models, and data sets to enable a broad comparison of different optimizers.First, for the MNIST handwritten digits [35] and Fashion-MNIST [36] data sets we run mini-batch learning to do variational auto-encoding (AED and ADF) [37] and image classification (ICD and ICF).The latter is done by convolutional neural networks [38] with rectified linear units (ReLU) [39], dropout [40], max pooling [41], and softmax.Second, for the cart-pole problem [42] we perform naive reinforcement learning (NR) with a feed-forward linear neural network [43], dropout, ReLU, and softmax and reinforcement learning by an actor-critic algorithm (RA) [44].Third, for the BSD300 data set [45] we carry out single image super-resolution (SR) with upscale factor four by sub-pixel convolutional neural networks [46] employing relatively large mini-batches.Fourth, we run batch learning of the Cora data set [47] for semi-supervised classification (SS) with graph convolutional networks [48] and dropout as well as of a sine wave for time sequence prediction (TS) with a long short-term memory (LSTM) cell [49].
In addition, we evaluate the optimizers in training of a machine learning potential [50][51][52][53][54][55], i.e., a regression task.A machine learning potential is a representation of the potential energy surface of a chemical system.It can be employed in atomistic simulations to calculate chemical properties and reactivity.One method example among many others is a high-dimensional neural network potential [56,57] which takes as input the chemical element types and atomic coordinates and in required cases atomic charges and spins [58][59][60] to calculate the energy and atomic forces of systems ranging from organic molecules over liquids to inorganic materials including multi-component systems such as interfaces [13,[61][62][63].In this work, we repeat the stationary learning of an lMLP based on an ensemble of ten high-dimensional neural network potentials, which employ element-embracing atom-centered symmetry functions as descriptors [13].The lMLP is trained on 8600 S N 2 reaction systems with lifelong adaptive data selection.
This work is organized as follows: In Section 2, we summarize the applied optimization algorithms, and in Section 3, we compile the computational details.In Section 4, we analyze the resulting training speed and final accuracy for the PyTorch ML task examples and lMLPs.This work ends with a conclusion in Section 5.

Continual Resilient (CoRe) Optimizer
The CoRe optimizer [13] is a first-order gradient-based optimizer for stochastic and deterministic iterative optimizations.It adapts the learning rates individually for each weight w ξ depending on the optimization progress.These learning rate adjustments are inspired by the Adam optimizer [12], RPROP [14,15], and the synaptic intelligence method [64].
Exponential moving averages of the loss function gradient and its square, with decay rates β τ 1 , β 2 ∈ [0, 1), are employed in minimization in analogy to the Adam optimizer.For maximization, the sign of the loss function gradient in Equation (1) has to be inverted.In the CoRe optimizer, β 1 is a function of the individual weight update counter τ , whereby τ can vary from the counter of gradient calculations t if some optimization steps do not update every weight.The initial decay β a 1 ∈ [0, 1) is converted by a Gaussian with width β c 1 > 0 to the final decay β b 1 ∈ [0, 1).The smaller β τ 1 , the higher is the dependence on the current gradient, while a larger β τ 1 leads to a slower decay of previous gradient contributions.
The Adam-like adaption of the weight-specific learning rates, employs the quotient of the moving averages g τ ξ and (h τ ξ ) 1 2 , which are corrected with respect to their initialization bias toward zero (g 0 ξ , h 0 ξ = 0).For numerical stability ϵ ⪆ 0 is added in the denominator.This quotient is invariant to gradient rescaling and introduces a form of step size annealing.Therefore, u τ ξ changes from ±1 in the first optimization step τ = 1 toward zero in wellbehaving optimizations.
The plasticity factor, aims to improve the stability-plasticity balance by regularization in the weight updates.Therefore, weight groups χ are specified-for example, a layer in a neural network-and the weight-specific importance scores S τ −1 χ (see Equation (8) below) are compared within these groups.When τ > t hist > 0, P τ ξ can freeze the weights with the n frozen,χ ≥ 0 highest importance scores in their group in update τ to mitigate forgetting of previous knowledge.
The RPROP-like learning rate adaption, depends only on the sign of the gradient moving average g τ ξ and not on its magnitude leading to a robust optimization.Sign inversions from g τ −1 ξ to g τ ξ often signalize a jump over a minimum in the previous update.Hence, the step size s τ −1 ξ is reduced by the decrease factor η − ∈ (0, 1] in this case, while it is enlarged by the increase factor η + ≥ 1 for constant signs to speed up convergence.The updated step size s τ ξ is bounded by the minimal and maximal step sizes s min , s max > 0. For g τ −1 ξ • g τ ξ • P τ ξ = 0, the step size update is omitted.The initial step size s 0 ξ = s 1 ξ is a hyperparameter of the optimization.
The weight decay, ranks the weight importance by taking into account weight-specific contributions to previously estimated loss function decreases.This ansatz is inspired by the synaptic intelligence method.The importance scores enable to identify the most important weights in previous updates, which can be frozen by the plasticity factors (Equation (5)) in following updates to improve the stabilityplasticity balance.The product of gradient moving average and signed weight update is employed to estimate the loss function decrease.Since the weight update sign is not inverted, the higher positive the importance score, the larger is the loss function decrease.Starting with S 0 ξ = 0, the mean of g τ ξ • u τ ξ • P τ ξ • s τ ξ over τ ≤ t hist is calculated.For τ > t hist , the importance score is determined as exponential moving average with decay 1 − (t hist ) −1 .
We note that the relative large number of hyperparameters in the CoRe optimizer is, on the one hand, an advantage to obtain good results even in very difficult or edge cases.On the other hand, the hyperparameter tuning is more complicated.However, a set of generally applicable values, which are provided in this work, can overcome this drawback.

SGD
SGD [5] subtracts the product of a constant learning rate γ and the loss function gradient from the weights w t−1 ξ in the weight updates, with

Momentum
An additional momentum (Momentum) [6] can be introduced in SGD by replacing G τ ξ in Equation ( 9) by with the momentum factor µ and 9) is substituted by

Adam
The algorithm of the Adam optimizer [12] is given by Equations (1) (with constant β 1 ), ( 2), (4), and ( 9), whereby G τ ξ in Equation ( 9) is replaced by u τ ξ .In comparison to the CoRe optimizer, Adam misses the τ dependence of the decay rate β 1 , the plasticity factors P τ ξ , the RPROP-like learning rate adaption s τ ξ , and the weight decay.The latter can be introduced in Adam as well as in many other optimizers also by adding d • w t−1 ξ to the loss function gradient as second operation of an optimization iteration after the possible sign inversion for maximization.A further alternative is to subtract in- as in AdamW [22].

AdaMax
The difference of the AdaMax optimizer [12] compared to Adam is that the term in curly brackets in Equation ( 4) is replaced by the infinity norm, with k 0 ξ = 0.

RMSprop
In RMSprop [11] the loss function gradient is divided by the moving average of its magnitude, Hence, the difference to the Adam optimizer is that the loss function gradient G τ ξ is applied instead of the gradient moving average g τ ξ and the initialization bias correction is omitted.

AdaDelta
The adaptive learning rate in the AdaDelta optimizer [10] is established by with and l 0 ξ = 0. Hence, in comparison to the RMSprop algorithm the factor l τ −1 ξ + ϵ 1 2 is applied additionally in the weight update and the order of adding ϵ to h τ ξ and taking the square root is inverted.

RPROP
RPROP [14,15] is based on Equation ( 6), whereby In addition, a backtracking weight step is applied by setting The weight update is given by

COMPUTATIONAL DETAILS
The PyTorch ML task examples [65] were solely modified to embed them in the extensive benchmark without touching the ML models and trainings.The only exception was the removal of the learning rate scheduler in ICD and ICF to assess exclusively the performance of the optimizer.The tasks performed originally only on the MNIST data set (AED and ICD) were also carried out for the Fashion-MNIST data set (AEF and ICF).The batch sizes of the ML tasks AED, AEF, ICD, and ICF were 64 of in total 60000 training data points to obtain test cases for small mini-batch learning (64 was the default value in the ICD PyTorch ML task example).The batch size of SR was 10 of 200 training data points to get an example of a batch size which is a rather large fraction of the total number of data points (5%).The employed scripts with all details on the models, trainings, and error definitions are available on Zenodo [66] alongside the compiled raw results as well as plot and analysis scripts.Moreover, this repository as well as the Zenodo repository [67] contain the CoRe optimizer software, which is compatible to use with PyTorch.In addition, the lMLP software [68] was extended to integrate all optimizers and is also available in the Zenodo repository [66] alongside lMLP results as well as model and training details.The latter were taken over from Reference [13].The lMLP training employed lifelong adaptive data selection and a fit fraction per epoch of 10% of all 7740 training structures.
Each ML task was performed for each optimizer setting with 20 different sets of random numbers.For reinforcement learning (NR and RA) even 100 different sets of random numbers were employed as the fluctuations in the respective results were the largest.These sets were the same for each optimizer and they ensured differently initialized weights (and different selection of training and test data).The mean test set error E test i and its standard deviation ∆E test i of ML task i were calculated for each set as a function of the training epoch n epoch to evaluate convergence.To determine the final accuracy, for the minimal test set error in each of the 20 trainings the mean E test,min i and standard deviation ∆E test,min i were calculated, i.e., early stopping was applied.For reinforcement learning (NR and RA) the mean number of training episodes until a reward of 475 [69] was taken to quantify E test i .The maximum number of training episodes was 2500, which was also used as error of unsuccessful trainings.For AEF 7 of 20 Momentum * trainings, 8 of 20 NAG trainings, and 12 of 20 NAG * trainings failed even for the best learning rate value.These trainings were penalized with a constant error of 1000.For lMLPs the total test loss according to Equation (10) in Reference [13] determined the training epoch with minimal error.In this way, the mean squared error of the energies was weighted with a factor q 2 = 10.9 2 in the loss function, while that of the atomic force components was not scaled.We evaluated the mean error based on the errors of all 20 lMLPs in each of the 20 training epochs where an individual lMLP showed minimal error, i.e., 400 error values were included.In this way, the error was still calculated from advanced training states, while it was also sensitive to the smoothness of the training processes as early stopping is difficult to apply in practise in lifelong machine learning.
To compare the final accuracy among different optimizers k for ML task i, the inverse of the minimum test set error E test,min i,k relative to the result of best performing optimizer in ML task i was calculated, The uncertainty of the accuracy score was calculated from an error propagation based on the test set error's standard deviation For comparison of different optimizers k with regard to the overall accuracy, the arithmetic mean A(k) over all N tasks ML task accuracy scores A i (k) was calculated, Its uncertainty was determined by propagating the errors of the independent variables A i (k), The PyTorch version 2.0.0 [34] and its default settings were applied for the optimizers AdaDelta, AdaGrad, Adam, AdaMax, Momentum, NAG, RMSprop, RPROP, and SGD (see Tables S1 and S3 in the Supporting Information for all hyperparameter values).The momentum factor in Momentum and NAG was µ = 0.9.In addition, scans of the performance determining hyperparameters β 1 , β 2 , µ, η − , and η + were carried out for the PyTorch ML task examples in order to find their optimal values for this set of ML tasks.If the default values turned out to be the best ones, the second best choice was applied.The optimizers employing these modified hyperparameters (see Tables S1 and S3 in the Supporting Information) are marked with an asterisk ( * ).Weight decay was by default only applied in the CoRe optimizer.The learning rates s 0 ξ of RPROP, RPROP * , and the CoRe optimizer were set to 10 −3 .For the learning rate γ of the other optimizers and the maximal step size s max of RPROP, RPROP * , and the CoRe optimizer, the values 0.0001, 0.001, 0.01, 0.1, and 1 were tested for each PyTorch ML task example.The value yielding the lowest ∆E test,min i was employed in the performance evaluation (see Table S2 in the Supporting Information).For lMLP training the two most likely options according to the PyTorch ML task results were tested (see Table S4 in the Supporting Information).

General Recommendations for CoRe Optimizer Hyperparameter Values
A generally applicable set of CoRe optimizer hyperparameter values has been obtained from our benchmark on nine ML tasks including seven different models and six different data sets.The training processes span the entire range from learning on small mini-batches to full data set batch learning.Based on this benchmark we generally recommend the hyperparameter values 1, and t hist = 250.The number of frozen weights per group n frozen,χ can often be specified as a fraction of frozen weights per group p frozen,χ .Well working values of p frozen,χ are typically in the interval between 0 (without stability-plasticity balance) and about 10%.The maximal step size s max is recommended to be 10 −3 for minibatch learning, 1 for batch learning, and 10 −2 for intermediate cases.s max is the main hyperparameter like the learning rate γ in many other optimizers.

Optimizer Performance Evaluation for Diverse Machine Learning Tasks
To assess the performance of the CoRe optimizer in comparison to nine other optimizers with in total 16 different hyperparameter settings, relative accuracy scores for nine ML tasks were calculated for these optimizers (Figures 1 (a) and (b)).For mini-batch learning on small batch sizes (0.1% for AED, AEF, ICD, and ICF) the popular Adam optimizer and our CoRe optimizer perform best, while especially RPROP yields poor accuracy because it cannot handle well stochastic gradient fluctuations.RPROP is intended for batch learning which becomes obvious by the high accuracy scores for SS and TS.For these ML tasks, RPROP and the CoRe optimizer achieve the highest accuracy scores.In the intermediate case, i.e., mini-batch learning with rather large batch sizes (5% for SR and 10% for lMLP training (Figure 4)), both Adam and RPROP perform well with Adam having a small advantage over RPROP.However, the CoRe optimizer outperforms both in this case.
Moreover, the learning speed and reliability of the CoRe optimizer in reinforcement learning (NR and RA) is also better than for the other optimizers (Figures 1  (a) and (b)).RPROP is not able to learn the task in the maximal number of episodes for NR in any training.The CoRe optimizer's convergence speed of the mean test set errors for the other ML tasks is similar to Adam for minibatch learning and similar to RPROP for batch learning (see Figures S1 to S7 and S9 in the Supporting Information).
In total, the CoRe optimizer achieves the highest final accuracy score in six tasks and lMLP training, Adam * in In general, for the chosen set of ML tasks the optimizers which combine momentum and individually adapted learning rates (CoRe, Adam, and AdaMax) perform better than those which only apply individually adapted learning rates (RMSprop, AdaGrad, and AdaDelta) (Figure 2).However, the differences among the CoRe optimizer, Adam, and AdaMax are larger than that of RMSprop and AdaMax.The final accuracy obtained by pure SGD is significantly worse than that of the aforementioned optimizers.However, for these nine ML tasks it is still slightly better than that of the optimizers which employ only momentum (Momentum and NAG).The overall accuracy of RPROP is in between those applying individually adapted learning rates and SGD for these ML tasks.However, this order is, of course, dependent on the fraction of mini-batch and batch learning ML tasks.
The best single model performances obtained by the 7 CoRe optimizer are provided in Table S5 and Figures S10 (a) and (b) and S11 in the Supporting Information.
For SS we can compare the final accuracy directly to the original work with 81.5% correct test set classifications [48].Due to training by the CoRe optimizer, the best graph convolutional network for SS achieves a test set classification accuracy of 84.2%.

Performance Dependence on Hyperparameter Values
The CoRe optimizer's hyperparameters were tuned on this set of ML tasks, while the general hyperparameter recommendations of PyTorch for the other optimizers were not based on this benchmark set.To provide a fair comparison, we also applied hyperparameter values for the other optimizers which were adjusted on this set of CoRe optimizer are provided in Table S5 and Figures S10 (a) and (b) and S11 in the Supporting Information.
For SS we can compare the final accuracy directly to the original work with 81.5% correct test set classifications [48].Due to training by the CoRe optimizer, the best graph convolutional network for SS achieves a test set classification accuracy of 84.2%.

Performance Dependence on Hyperparameter Values
The CoRe optimizer's hyperparameters were tuned on this set of ML tasks, while the general hyperparameter recommendations of PyTorch for the other optimizers were not based on this benchmark set.To provide a fair comparison, we also applied hyperparameter values for the other optimizers which were adjusted on this set of ML tasks.The adjusted hyperparameters of AdaDelta * , AdaMax * , Momentum * , and RPROP * yielded an improvement of their overall accuracy scores (Figure 2).However, the gain is not sufficient to reach the overall accuracy scores in the next better class of optimizers described in the last section.Therefore, the choice of the optimization algorithm is confirmed to be crucial for the final accuracy of the ML model.The highest overall accuracy scores of Adam, NAG, and RMSprop were obtained with their generally recommended hyperparameter values.The second best choices of the hyperparameters yielded very similar overall accuracy scores.Another difference between the CoRe optimizer and the other optimizers was the application of a weight decay.However, Figures S13 and S14 in the Supporting Information show that the standard weight decay algorithm of Adam employed with four different hyperparameter values in general reduces the accuracy score for Adam.Only the weight decay algorithm of AdamW can lead to a small increase of the overall accuracy score.However, the gain is only a fraction of the overall accuracy score difference between the CoRe optimizer and the Adam optimizer.The weight decay of the CoRe optimizer only marginally affects the final accuracy on average (Figures S13 and S14 in the Supporting Information).
In the analysis of individual ML task performances, we note that RPROP and the CoRe optimizer show a slow convergence in the initial epochs of SS training (see Figure S6 in the Supporting Information).The reason is that large weight changes are required in the optimization and the initial step size s 0 ξ is only set to 0.001.Higher values of s 0 ξ result in faster convergence to a similar final accuracy, with s 0 ξ = 0.1 yielding a much faster convergence than obtained with Adam (see Figure S7 in the Supporting Information).However, this ML task is an extreme example with few weight updates to adjust s τ ξ in batch learning and the need of large weight changes.Still, as the final accuracy is the same and in most applications s τ ξ is fast adapted in a relatively small fraction of weight updates, the initialization of s 0 ξ is in general noncritical.Another edge case can be obtained for high maximal step size values s max in the CoRe optimizer.While s max = 1 yields a high final accuracy in TS training when early stopping is applied, the training can become unstable when continued (see Figure S8 in the Supporting Information).However, reducing s max to 0.1 already solves this issue (see Figure S9 in the Supporting Information).

Optimizer Performance in Training Lifelong
Machine Learning Potentials In the training of lMLPs rather large fractions of training data (10%) were employed in the loss function gradient calculation.In line with the results of the PyTorch ML task examples, this kind of training best suits the CoRe optimizer followed by Adam * , Adam, Adamax * , and RPROP * (Figures 3 (a) and (b) and 4).Moreover, the general trend is confirmed that adaptive and momentum based optimizers perform best, while only adaptive optimizers still yield better results than only momentum based optimizers.In contrast to the PyTorch ML task examples, where the stability-plasticity balance of the CoRe optimizer with p frozen around 0.025 can only marginally improve the accuracy scores for AED, AEF, and SR and worsens the final accuracy for ICD and ICF (see Figure S12 in the Supporting Information), the lMLP training largely benefits from the stability-plasticity balance with p frozen = 0.1.We note that tuning p frozen , in addition to the maximal step size, extends the hyperparameter optimization capability for CoRe p frozen =0.1 compared to CoRe and all other optimizers, for which only the maximal step size/learning rate was adjusted while all other hyperparameters were taken from the PyTorch ML task example results.This higher degree of freedom can also contribute to the optimization performance.However, a significant performance improvement was not obtained for any other hyperparameter tuning with the exception of tuning η − .
Moreover, the stability-plasticity balance smoothens the training convergence as shown in the test set root mean square errors (RMSEs) of energies and atomic force components as a function of the training epochs (Figures ML tasks.The adjusted hyperparameters of AdaDelta * , AdaMax * , Momentum * , and RPROP * yielded an improvement of their overall accuracy scores (Figure 2).However, the gain is not sufficient to reach the overall accuracy scores in the next better class of optimizers described in the last section.Therefore, the choice of the optimization algorithm is confirmed to be crucial for the final accuracy of the ML model.The highest overall accuracy scores of Adam, NAG, and RMSprop were obtained with their generally recommended hyperparameter values.The second best choices of the hyperparameters yielded very similar overall accuracy scores.Another difference between the CoRe optimizer and the other optimizers was the application of a weight decay.However, Figures S13 and S14 in the Supporting Information show that the standard weight decay algorithm of Adam employed with four different hyperparameter values in general reduces the accuracy score for Adam.Only the weight decay algorithm of AdamW can lead to a small increase of the overall accuracy score.However, the gain is only a fraction of the overall accuracy score difference between the CoRe optimizer and the Adam optimizer.The weight decay of the CoRe optimizer only marginally affects the final accuracy on average (Figures S13 and S14 in the Supporting Information).
In the analysis of individual ML task performances, we note that RPROP and the CoRe optimizer show a slow convergence in the initial epochs of SS training (see Figure S6 in the Supporting Information).The reason is that large weight changes are required in the optimization and the initial step size s 0 ξ is only set to 0.001.Higher values of s 0 ξ result in faster convergence to a similar final accuracy, with s 0 ξ = 0.1 yielding a much faster convergence than obtained with Adam (see Figure S7 in the Supporting Information).However, this ML task is an extreme example with few weight updates to adjust s τ ξ in batch learning and the need of large weight changes.Still, as the final accuracy is the same and in most applications s τ ξ is fast adapted in a relatively small fraction of weight updates, the initialization of s 0 ξ is in general noncritical.Another edge case can be obtained for high maximal step size values s max in the CoRe optimizer.While s max = 1 yields a high final accuracy in TS training when early stopping is applied, the training can become unstable when continued (see Figure S8 in the Supporting Information).However, reducing s max to 0.1 already solves this issue (see Figure S9 in the Supporting Information).

Optimizer Performance in Training Lifelong
Machine Learning Potentials In the training of lMLPs rather large fractions of training data (10%) were employed in the loss function gradient calculation.In line with the results of the PyTorch ML task examples, this kind of training best suits the CoRe optimizer followed by Adam * , Adam, Adamax * , and RPROP * (Figures 3 (a) and (b) and 4).Moreover, the general trend is confirmed that adaptive and momentum based optimizers perform best, while only adaptive optimizers still yield better results than only momentum based optimizers.In contrast to the PyTorch ML task examples, where the stability-plasticity balance of the CoRe optimizer with p frozen around 0.025 can only marginally improve the accuracy scores for AED, AEF, and SR and worsens the final accuracy for ICD and ICF (see Figure S12 in the Supporting Information), the lMLP training largely benefits from the stability-plasticity balance with p frozen = 0.1.We note that tuning p frozen , in addition to the maximal step size, extends the hyperparameter optimization capability for CoRe p frozen =0.1 compared to CoRe and all other optimizers, for which only the maximal step size/learning rate was adjusted while all other hyperparameters were taken from the PyTorch ML task example results.This higher degree of freedom can also contribute to the optimization performance.However, a significant performance improvement was not obtained for any other hyperparameter tuning with the exception of tuning η − .
Moreover, the stability-plasticity balance smoothens the training convergence as shown in the test set root mean square errors (RMSEs) of energies and atomic force components as a function of the training epochs (Figures 9   (a        ) the respective test set RMSE values decreased to only (3.4 ± 0.4) meV atom −1 and (92 ± 4) meV Å −1 .Finally, the comparison of computation time for training with Adam and the CoRe optimizer shows that not only the final accuracy but also the accuracy-cost ratio of the CoRe optimizer is better than that of Adam.For comparison of multiple trainings with the lMLP software, the time fraction of model fitting in the entire training process (including initialization, descriptor calculation, model fitting (about 87%), final prediction, and finalization) is calculated to reduce the influence of different computers and computation loads.The resulting speed is the same within the uncertainty interval for Adam and the CoRe optimizer.The additional operations in the CoRe optimizer algorithm cause only little increase of computational cost which is not significant in comparison to the cost for evaluating the loss function gradient.For the presented lMLP example, an optimizer step requires less than 0.2% of the time needed for a loss function gradient calculation.Since the CoRe optimizer requires only the loss function gradient as input like Adam and the other optimizers, the computation time per training epoch is similar for all optimizers.

CONCLUSION
The CoRe optimizer combines Adam-like and RPROPlike weight-specific learning rate adaption.Moreover, in the CoRe optimizer step-dependent decay rates are employed in the calculation of Adam-like gradient moving averages, which are the basis of the RPROP-like step size updates.Its weight decay depends on the absolute weight update and an optional stability-plasticity balance based on a weight importance score can be applied.In this way, the CoRe optimizer combines the high performance of the Adam optimizer in small mini-batch learning and that of RPROP in full data set batch learning, while it is superior to both in intermediate cases.With the general hyperparameter recommendation obtained in this work based on diverse ML tasks, the CoRe optimizer is a wellrounded all-in-one solution with broad applicability and high convergence speed and final accuracy on-par and beyond state-of-the-art first-order gradient-based optimizers.
The performance evaluation has further confirmed a general advantage for optimizers which combine momentum and individually adapted learning rates in terms of convergence speed and final accuracy compared to optimizers which are only adaptive or momentum based or none of these.Moreover, adaptive and/or momentum based methods need only marginally more computation time than simple SGD which is negligible compared to the time required for loss function gradient calculation.
Besides the general CoRe optimizer hyperparameter recommendation, only the maximal step size needs to be set depending on the fluctuations in the gradient calcu- ) the respective test set RMSE val-ues decreased to only (3.4 ± 0.4) meV atom −1 and (92 ± 4) meV Å −1 .Finally, the comparison of computation time for training with Adam and the CoRe optimizer shows that not only the final accuracy but also the accuracy-cost ratio of the CoRe optimizer is better than that of Adam.For comparison of multiple trainings with the lMLP software, the time fraction of model fitting in the entire training process (including initialization, descriptor calculation, model fitting (about 87%), final prediction, and finalization) is calculated to reduce the influence of different computers and computation loads.The resulting speed is the same within the uncertainty interval for Adam and the CoRe optimizer.The additional operations in the CoRe optimizer algorithm cause only little increase of computational cost which is not significant in comparison to the cost for evaluating the loss function gradient.For the presented lMLP example, an optimizer step requires less than 0.2% of the time needed for a loss function gradient calculation.Since the CoRe optimizer requires only the loss function gradient as input like Adam and the other optimizers, the computation time per training epoch is similar for all optimizers.

CONCLUSION
The CoRe optimizer combines Adam-like and RPROPlike weight-specific learning rate adaption.Moreover, in the CoRe optimizer step-dependent decay rates are employed in the calculation of Adam-like gradient moving averages, which are the basis of the RPROP-like step size updates.Its weight decay depends on the absolute weight update and an optional stability-plasticity balance based on a weight importance score can be applied.In this way, the CoRe optimizer combines the high performance of the Adam optimizer in small mini-batch learning and that of RPROP in full data set batch learning, while it is superior to both in intermediate cases.With the general hyperparameter recommendation obtained in this work based on diverse ML tasks, the CoRe optimizer is a wellrounded all-in-one solution with broad applicability and high convergence speed and final accuracy on-par and beyond state-of-the-art first-order gradient-based optimizers.
The performance evaluation has further confirmed a general advantage for optimizers which combine momentum and individually adapted learning rates in terms of convergence speed and final accuracy compared to optimizers which are only adaptive or momentum based or none of these.Moreover, adaptive and/or momentum based methods need only marginally more computation time than simple SGD which is negligible compared to the time required for loss function gradient calculation.
Besides the general CoRe optimizer hyperparameter recommendation, only the maximal step size s max needs to be set depending on the fluctuations in the gradient calculation which can be estimated easily based on the application of mini-batch (0.001) or batch learning (1) or intermediate cases (0.01).Additionally, the stabilityplasticity balance can be enabled by the hyperparameter p frozen .It can achieve smoother training convergence to even higher final accuracy yielding a large improvement in the example of lMLP training.We note that hyperparameter fine-tuning for ML tasks can, of course, improve the performance to some degree for all optimizers but comes with the drawback of being very time consuming.

Figure 1 :
Figure 1: Bar chart of the final accuracy scores A i (Equation (19)) of various ML tasks i trained by different optimizers.The uncertainty interval (Equation (20)) is shown as cross-hatched bar around the upper edge of the bar which equals the mean of A i over 20 trainings (100 for NR and RA).100% corresponds to the highest obtained final accuracy of all optimizer specifications.The acronyms of the ML tasks are explained in Table 1.The optimizers are listed in the legends and are represented by different colors.The learning rate (maximal step size for RPROP, RPROP * , and CoRe) was adjusted, while all other hyperparameters of the optimizers were set to (a) their general recommendation and (b) modified values.Exceptions are AdaGrad and SGD which do not include additional hyperparameters beyond the learning rate.The CoRe results are shown as reference in (b).

Figure 1 :
Figure 1: Bar chart of final accuracy scores A i (Equation (19)) of various ML tasks i trained by different optimizers.The uncertainty interval (Equation (20)) is shown as cross-hatched bar around the upper edge of the bar which equals the mean of A i over 20 trainings (100 for NR and RA).100% corresponds to the highest obtained final accuracy of all optimizer specifications.The acronyms of the ML tasks are explained in Table 1.The optimizers are listed in the legends and are represented by different colors.The learning rate (maximal step size for RPROP, RPROP * , and CoRe) was adjusted, while all other hyperparameters of the optimizers were set to (a) their general recommendation and (b) modified values.Exceptions are AdaGrad and SGD which do not include additional hyperparameters beyond the learning rate.The CoRe results are shown as reference in (b).

Figure 2 :
Figure 2: Bar chart of the final accuracy score A (Equation (21)) averaged over all ML tasks shown in Figures 1 (a) and (b) for different optimizers.The uncertainty interval (Equation (22)) is shown as cross-hatched bar around the right edge of the bar which equals the value of A. A value of 100% means that the optimizer achieves highest accuracy every ML task.The bars are labeled and colored according to the respective optimizer.

Figure 2 :
Figure 2: Bar chart of the final accuracy score (Equation (21)) averaged over all ML tasks shown in Figures 1 (a) and (b) for different optimizers.The uncertainty interval (Equation (22)) is shown as cross-hatched bar around the right edge of the bar equals the value of A. A value of 100% means that the optimizer achieves highest accuracy in every ML task.The bars are labeled and colored according to the respective optimizer.

Figure 3 :
Figure 3: Bar chart of the final accuracy scores A i (Equation (19)) of energy and force prediction of lMLPs trained by different optimizers.The uncertainty interval (Equation (20)) is shown as cross-hatched bar around the upper edge of the bar which equals the mean of A i over 20 trainings.100% to the highest obtained final accuracy of all optimizer specifications, i.e., the lowest RMSE in the prediction of energies or atomic force components.The colors of most optimizers are listed in the legends of Figures 1 (a) and (b).The learning rate (maximal step size for RPROP and CoRe specifications) was adjusted, while all other hyperparameters of the optimizers were set to (a) their general recommendation and (b) modified values.Exceptions are AdaGrad and SGD which do not include additional hyperparameters beyond the learning rate.The CoRe results are shown as reference in (b).

Figure 4 :
Figure 4: Bar chart of the final accuracy score A (Equation (21)) combining energy and force prediction of lMLPs for different optimizers.The uncertainty interval (Equation (22)) is shown as cross-hatched bar around the right edge of the bar which equals the value of A. A value of 100% means that the optimizer achieves highest accuracy in energy and force prediction.The bars are labeled and colored according to the respective optimizer.

Figure 3 :
Figure 3: Bar chart of the final accuracy scores A i (Equation (19)) of energy and force prediction of lMLPs trained by different optimizers.The uncertainty interval (Equation (20)) is shown as cross-hatched bar around the upper edge of the bar which equals the mean of A i over 20 trainings.100% corresponds to the highest obtained final accuracy of all optimizer specifications, i.e., the lowest RMSE in the prediction of energies or atomic force components.The colors of most optimizers are listed in the legends of Figures 1 (a) and (b).The learning rate (maximal step size for RPROP and CoRe specifications) was adjusted, while all other hyperparameters of the optimizers were set to (a) their general recommendation and (b) modified values.Exceptions are AdaGrad and SGD which do not include additional hyperparameters beyond the learning rate.The CoRe results are shown as reference in (b).

Figure 3 :
Figure 3: Bar chart of the final accuracy scores A i (Equation (19)) of energy and force prediction of lMLPs trained by different optimizers.The uncertainty interval (Equation (20)) is shown as cross-hatched bar around the upper edge of the bar which equals the mean of A i over 20 trainings.100% corresponds to the highest obtained final accuracy of all optimizer specifications, i.e., the lowest RMSE in the prediction of energies or atomic force components.The colors of most optimizers are listed in the legends of Figures 1 (a) and (b).The learning rate (maximal step size for RPROP and CoRe specifications) was adjusted, while all other hyperparameters of the optimizers were set to (a) their general recommendation and (b) modified values.Exceptions are AdaGrad and SGD which do not include additional hyperparameters beyond the learning rate.The CoRe results are shown as reference in (b).

Figure 4 :
Figure 4: Bar chart of the final accuracy score A (Equation (21)) combining energy and force prediction of lMLPs for different optimizers.The uncertainty interval (Equation (22)) is shown as cross-hatched bar around the right edge of the bar which equals the value of A. A value of 100% means that the optimizer achieves highest accuracy in energy and force prediction.The bars are labeled and colored according to the respective optimizer.

Figure 4 :
Figure 4: Bar chart of the final accuracy score A (Equation (21)) combining energy and force prediction of lMLPs for different optimizers.The uncertainty interval (Equation (22)) is shown as cross-hatched bar around the right edge of the bar which equals the value of A. A value of 100% means that the optimizer achieves highest accuracy in energy and force prediction.The bars are labeled and colored according to the respective optimizer.

Figure 5 :
Figure 5: Test set RMSEs of (a) energy E test and (b) atomic force components F test α,n as a function of the training epoch n epoch for the lMLP compared to the DFT reference.The results are shown for the seven optimizers yielding highest final accuracy.The less often a line is broken, the lower is final error.Uncertainty intervals are shown in pale color of the respective line.

Figure 5 :
Figure 5: Test set RMSEs of (a) energy E test and (b) atomic force components F test α,n as a function of the training epoch n epoch for the lMLP compared to the DFT reference.The results are shown for the eight optimizers yielding highest final accuracy.The less often a line is broken, the lower is the final error.Uncertainty intervals are shown in pale color of the respective line.

Table 1 :
Overview of the ML tasks and the respective data sets including their acronyms.inonetask(Figures1(a) and (b)).However, in the six cases where the CoRe optimizers performs best, the second best optimizer is always within the uncertainty interval of the CoRe optimizer's accuracy score.Still, there is no single optimizer which is always within the uncertainty interval.For example, Adam, RMSprop, RMSprop * , and SGD are within the uncertainty interval for ML task ICF, only AdaMax * for SR, and AdaMax * , RPROP, and RPROP * for TS, whereas the CoRe optimizer is always within the uncertainty interval of the best optimizer for the other three ML tasks.Hence, even if there is no clear dominance for individual ML tasks, the CoRe optimizer is among the best optimizers in all these ML tasks resulting in, on average, the best performance and the broadest applicability.Therefore, the CoRe optimizer is well-rounded and achieves the highest overall accuracy score (Figure2).The overall accuracy score of Adam is second highest, while those of AdaMax * and Adam * are almost equal to that of Adam.The uncertainty interval of the CoRe optimizer's overall accuracy score overlaps slightly with that of the Adam optimizer.We note that the uncertainty interval of the Adam optimizer's results is also the largest among all results. *