Perspective: new insights from loss function landscapes of neural networks

Sathya R Chitturi; Philipp C Verpoort; Alpha A Lee; David J Wales

doi:10.1088/2632-2153/ab7aef

1. Introduction

In this report we analyse the structure of the loss function landscape (LFL) for neural networks. Here, the landscape refers to the loss as a function of the trainable parameters (node weights), and we exploit computational tools developed for exploration of energy landscapes (ELs) in molecular science [1]. The principal focus is on the organisation of local minima of the loss function, which correspond to the isomers of a molecule. This organisation is defined by the pathways between local minima mediated by transition states, which are stationary points of Hessian index one, with precisely one negative Hessian eigenvalue [2]. The connection between a molecular EL and a LFL has been developed in previous work, as summarised below. We have previously referred to the LFL as a machine learning landscape (MLL), and we will employ these descriptions interchangeably in the present contribution.

Unfortunately, direct analysis of the loss landscape is challenging due to issues of computational complexity [3]. The high dimensionality employed in deep learning representations produces poorly conditioned problems for optimisation, and leads to slow convergence. Furthermore, the number of stationary points grows exponentially with the dimensionality of the problem [4], as in molecular science [5, 6]. Nevertheless, the power and utility of recent machine learning techniques is remarkable, and part of the motivation for the present work is to understand these advances in terms of the underlying LFL.

Choromanska et al have previously considered the performance of various local minima for neural networks [7]. A good performance, by their metric, corresponds to high accuracy for both an independent training and test set. They show that theoretically, subject to a number of assumptions of independence, neural network optimisation reduces to minimising the energy of the spin-glass Hamiltonian from statistical physics [7]. Based on the spin-glass model, bounds can be derived, suggesting that there exists a tight band of local minima, bounded above the global minimum, characterised by low training and testing errors. Furthermore, in this model it is exponentially less likely to find a minimum with relatively high testing error as the dimensionality of the neural network grows [7]. These results suggest that almost any local minimum that is found via standard optimisation techniques should perform comparably to any other local minimum on an unseen test set [7].

Wu et al agree with the conclusion that the majority of local minima solutions of the loss landscape tend to have properties similar to that of the global minimum [8]. This work suggests that neural networks may generalise well because they yield simple solutions with small Hessian norm. A theoretical analysis of two-layer networks indicates that these simple solutions occur because the volumes of the basins of attraction for minima with high test error are exponentially dominated by the volumes of the basins of attraction for minima with low test errors. In other words, good solutions lie in large, flat regions of parameter space and bad solutions lie in small, sharp regions [8].

Li et al proposed a filter-wise normalisation scheme to preserve scale-invariant properties of neural networks, which allows for comparison between different architectures and landscapes [9]. Low-dimensional 2D contour plots were created to investigate the loss function along random directions near chosen minima. By studying a variety of different architectures on the CIFAR-10 dataset, Li et al suggest that flat minima tend to generalise better than sharp minima. Furthermore, shallow, wide neural networks have contour surfaces with a convex appearance, which might make them more generalisable. Nguyen et al agree with this description, and showed that, if a network has a pyramid-like structure following a very wide layer, then local minima are very close to the global minimum and the surface is much easier to navigate [10].

Some of the assumptions made in the above theoretical models are quite restrictive, and may not hold for examples of practical interest. Furthermore, low-dimensional representations of the landscape can misrepresent the underlying non-convexity present in higher dimensions. In addition, it is possible to manipulate the dataset and optimisation problem to create solutions with very high training accuracies but with arbitrarily low testing accuracies. This scenario can be achieved by adding a tunable attacking term to the cost function or deliberately misassigning labels during training [8]. Furthermore, it is possible to create datasets in which specific initialisation schemes will either not converge or converge to high-lying loss solutions [11].

To avoid the problems of low-dimensional projection and restrictive theoretical assumptions, the present work builds on previous considerations of the LFL as an EL [12–15]. ELs in molecular science [12, 13, 16–18] are defined in terms of the potential energy (PE), with minima corresponding to physically stable structures, which can interconvert via transition states. Minima are defined geometrically, as stationary points with non-negative Hessian eigenvalues. Transition states are defined as stationary points with exactly one negative Hessian eigenvalue (index one saddle points); [2] the Murrell–Laidler theorem guarantees that the pathway with the lowest barrier between two minima involves only transition states, and not higher index saddles [1, 2]. By investigating the correspondence between a potential energy surface (PES) and the neural network loss function, where the atomic configuration space becomes the neural network parameter space, many of the tools developed in EL research can be used to study neural network landscapes [12, 13].

We have recently compared the landscapes for neural networks with one, two and three hidden layers for a similar number of fitting parameters [18]. In principle, a single hidden layer with enough nodes is sufficient to fit a well-behaved function [19], although the required number of hidden nodes scales exponentially with the number of parameters [20]. In the present contribution we report new results for the properties of such networks to investigate the structure of the underlying LFL. In particular, we consider the effect of systematically removing certain edges from the network to reduce the connectivity, and the effects of training set mislabelling. In addition, we present some results for neural networks with multiple hidden layers for comparison.

Our goal in this research is to understand the behaviour of relatively small neural networks, where the underlying solution landscape can be properly characterised. We hope that the resulting insight will carry over to large networks, where there may be too many parameters to locate even a single local minimum.

2. Defining the network

We begin with a standard single hidden-layer neural network architecture [21] containing input, output, and hidden nodes, plus a bias added to the sum of edge weights used as input to the activation function for each hidden node, ${w}_{j}^{\mathrm{bh}}$ , and each output node, ${w}_{i}^{\mathrm{bo}}$ . For the classification problem described in §4 the inputs correspond to interatomic distances for starting point geometries in a triatomic cluster, and there are N_out = 4 possible outputs, corresponding to the four local minima that the cluster can adopt, as in previous work [13, 14, 22]. Each training or test data item α comprises ${N}_{\mathrm{in}}$ inputs written as ${{\bf{x}}}^{\alpha }=\{{x}_{1}^{\alpha },\ldots ,\,{x}_{{N}_{\mathrm{in}}}^{\alpha }\}$ , and a set of N_data input data is written as ${\bf{X}}=\{{{\bf{x}}}^{1},\ldots ,\,{{\bf{x}}}^{{N}_{\mathrm{data}}}\}$ .

The outputs, y_i, were calculated as

$\begin{eqnarray}&&{y}_{i}({\bf{W}};{{\bf{x}}}^{\alpha })={w}_{i}^{\mathrm{bo}}+\sum _{j=1}^{{N}_{\mathrm{hidden}}}{w}_{{ij}}^{(1)}\tanh \left[{w}_{j}^{\mathrm{bh}}+\sum _{k=1}^{{N}_{\mathrm{in}}}{w}_{{jk}}^{(2)}{x}_{k}^{\alpha }\right],\end{eqnarray} \tag{ 1 }$

for a given input data item x^α, and weights ${w}_{{ij}}^{(1)}$ between hidden node j and output i, ${w}_{{jk}}^{(2)}$ between input k and hidden node j, and bias weights ${w}_{j}^{\mathrm{bh}}$ and ${w}_{i}^{\mathrm{bo}}$ collected into the vector W. Softmax probabilities, ${p}_{c}({\bf{W}};{{\bf{x}}}^{\alpha })$ were obtained from the outputs to reduce the effect of outliers

$\begin{eqnarray}&&{p}_{c}({\bf{W}};{{\bf{x}}}^{\alpha })={e}^{{y}_{c}({\bf{W}};{{\bf{x}}}^{\alpha })}/\left(\sum _{i}^{{N}_{\mathrm{out}}}{e}^{{y}_{i}({\bf{W}};{{\bf{x}}}^{\alpha })}\right).\end{eqnarray} \tag{ 2 }$

The loss function, which defines local minima and transition states of the MLL, was written as the sum of a cross-entropy, and an L2 regularisation term with coefficient λ > 0:

$\begin{eqnarray}&&E({\bf{W}};{\bf{X}})=-\displaystyle \frac{1}{{N}_{\mathrm{data}}}\sum _{\alpha =1}^{{N}_{\mathrm{data}}}\mathrm{ln}{p}_{c(\alpha )}({\bf{W}};{{\bf{x}}}^{\alpha })+\lambda {{\bf{W}}}^{2},\end{eqnarray} \tag{ 3 }$

where c(α) is the known outcome for input data item α in the training set. The regularisation term biases against large values for the weights and shifts any zero eigenvalues of the Hessian (second derivative) matrix, which would otherwise complicate transition state searches [15, 23]. To accelerate computation of the potential, a GPU version [24] of the loss function and gradient was also implemented and is available in the public domain GMIN and OPTIM programmes [25–27].

3. Characterisation of the LFL

To train each network we minimise the loss function, E(W; X), with respect to the variables ${w}_{{ij}}^{(1)},{w}_{{jk}}^{(2)},{w}_{j}^{\mathrm{bh}}$ and ${w}_{i}^{\mathrm{bo}}$ , written collectively as a vector of weights W. Basin-hopping global optimisation was used [28–30] to search for the global minimum, and all the distinct minima obtained during these searches were saved for later comparison. In this approach we take steps between local minima of the loss function, accepting or rejecting moves according to a simple Metropolis criterion [31] based upon the change in loss function, scaled by a parameter that plays the role of temperature. Downhill moves are always accepted, and the probability of accepting an uphill move depends on the fictitious temperature [28–30]. For the MLLs considered in the present work locating the global minimum is usually straightforward, and the choice of basin-hopping parameters is not critical. A customised LBFGS optimisation routine was employed for local minimisation, based on the limited memory version [32, 33] of the quasi-Newton Broyden [34], Fletcher [35], Goldfarb [36], Shanno [37], BFGS procedure.

Transition state candidates were determined using the doubly-nudged [38, 39] elastic band [40, 41] (DNEB) approach, which involves optimising a series of intermediate atomic configurations (images) connected by a harmonic potential. The transition state candidates were then refined using hybrid eigenvector-following [42–44], which involves systematic energy maximisation along just one Hessian eigenvector. Having determined a candidate transition state, the connected minima are located by minimisation following small displacements along the eigenvector corresponding to the unique negative eigenvalue. This method can be employed to create databases of connected local minima [45], which are analogous to kinetic transition networks [46–49]. Visualisation of the landscape was performed using disconnectivity graphs [50–52]. This approach segregates the EL into disjoint sets of minima that can interconvert within themselves below each energy threshold. Using this topological method, an undirected tree is constructed [53]. For the machine learning analysis, the vertical axis represents the neural network training loss. The branches of the graph correspond to the minima of the loss function. More specifically, each branch represents the vector of parameters containing the node-connectivity weights for the neural network, and terminates at a height on the vertical axis corresponding to the training loss function. The branches join together at regularly spaced intervals on the vertical axis when they can interconvert via pathways mediated by index one saddle points (transition states). At the highest threshold all the minima lie in the same group because there are no infinite barriers on the landscape, and only one vertical branch remains in the graph.

Analytic first and second derivatives were programmed for E(W; X) in the public domain GMIN and OPTIM codes for exploration of the corresponding LFLs [25–27]. Further details are provided elsewhere, including a review of the EL perspective in the context of machine learning [12]. Performance of the neural networks was measured using standard area under curve (AUC) metrics. The AUC metric ranges from 0 to 1, with an AUC = 0.5 signifying random performance. If the AUC value is <0.5, the model performs worse than than a random guess and if the AUC is >0.5, it performs better. The AUC is calculated by determining the true positive and false positive statistics for the machine learning problem as a function of the threshold probability, P, for predicting convergence to one of the outcomes. Details are provided in the supplementary information (SI), which is available online at stacks.iop.org/MLST/1/023002/mmedia.

For the single-layered architectures, we use a short-hand, [A,B,C,D,E], to refer to the number of inputs, hidden nodes, outputs, training data and regularisation constant, respectively. For example, in the geometry optimisation classification problem the [2,10,4,1000,0.0001] architecture corresponds to 74 optimisable parameters (two input bond lengths, 10 hidden nodes and 4 output classes with 1000 training points and a regularisation constant of 0.0001), and for the MNIST dataset the [784,10,10,1000,0.1] architecture corresponds to 7960 optimisable parameters.

4. Application to prediction of geometry optimisation outcomes for an atomic cluster

The first classification problem that we consider involves predicting the outcome of local minimisation for a triatomic cluster, as in previous reports [13, 14, 22]. Here we emphasise that we are not using machine learning to perform the optimisation [54–56], but instead to predict the outcome from a given starting configuration.

The PES for the cluster is defined by a two-body Lennard-Jones [57] potential and a three-body Axilrod–Teller [58] term, weighted by a coefficient Z. The total PE for this LJAT₃ cluster for particle positions r_i, and separations r_ij is then

$\begin{eqnarray}&&V=4\varepsilon \sum _{i\lt j}\left[{\left(\displaystyle \frac{\sigma }{{r}_{{ij}}}\right)}^{12}-{\left(\displaystyle \frac{\sigma }{{r}_{{ij}}}\right)}^{6}\right]+Z\sum _{i\lt j\lt k}\left[\displaystyle \frac{1+3\cos {\theta }_{1}\cos {\theta }_{2}\cos {\theta }_{3}}{{\left({r}_{{ij}}{r}_{{ik}}{r}_{{jk}}\right)}^{3}}\right],\end{eqnarray} \tag{ 4 }$

where the internal angles of the triangle defined by atoms i, j and k are θ₁, θ₂ and θ₃. We choose Z = 2, for which the molecular PES has an equilateral triangle minimum and three linear minima with each of the three atoms in the centre position. In all the AUC calculations for this problem we chose to refer the threshold probability P to the outcome corresponding to the equilateral triangle minimum.

This cluster, denoted LJAT₃, defines a multinomial logistic regression problem. Local minimisation for any starting configuration will terminate in one of the four minima, and we seek to predict this outcome given inputs corresponding to the initial geometry. The configuration is uniquely specified by the three interatomic distances r₁₂, r₁₃, and r₂₃, and given sufficient training data in this three-dimensional space a large enough neural network can make accurate predictions by learning the basins of attraction [1, 59] of the four minima. Here, however, we limit the inputs to two of the three distances, namely r₁₂ and r₁₃. The basins of attraction of the equilateral triangle and the linear minimum with atom 1 in the middle overlap in the space defined by the missing coordinate, r₂₃. Hence the best predictions possible should correspond to networks that learn the marginal probabilities for the different outcomes in the lower dimensional space.

We note that this sort of classification problem is not only a convenient benchmark, but also has practical applications. Knowledge of the relative configuration volumes for the catchment basins of different minima can be used to calculate thermodynamic properties [60], and predicting outcomes without running local minimisation to convergence would provide a way to save computational resources [61].

Two databases (D1 and D2) of initial configurations and outcomes were considered, as in previous work [18]. Starting geometries were generated by randomly distributing the three atoms in a cube of side length L. The datasets involved 200,000 minimisations for cube lengths of $L=2\sqrt{3}\sigma$ (D1) and $L=1.385\sigma$ (D2). A third dataset, D3, with $L=2\sqrt{2}\sigma$ (D3) was also created; results for this dataset are reported in the Supplementary Information (SI). In each case the data was divided into two halves for training and testing purposes. All the local minimisations, which define the outcome and classification label for each data item, consisting of the initial r₁₂ and r₁₃ values, were performed using the customised LBFGS algorithm [32, 33] described above. The convergence condition on the root mean square gradient was ${10}^{-10}\varepsilon /\sigma$ .

4.1. Landscapes subject to dataset mislabelling

Many real datasets of interest have significant label noise [62, 63], arising from difficulties in the data cleaning and acquisition processes, or simply from ambiguous class differentiation criteria. Additionally, to reduce acquisition loss, many practitioners prefer to obtain large amounts of low-quality data, rather than small amounts of high-quality data. While this scenario allows for the creation of a much larger labelled training set, it also has the potential to greatly deteriorate the quality of the dataset [62–65]. In light of the positive advantages of acquiring cheap data, much effort has been dedicated to improving the robustness of training neural networks under noise. Previously, it has been demonstrated that neural networks can perform well under uniform label noise [62, 64], even retaining predictive capability in regimes where the ratio of noisy data to clean data exceeds 100 to 1. One possible explanation is that this phenomenon is a result of a filtering effect due to favourable gradient cancellation [62]. On the other hand, it is known that neural networks perform poorly for more sophisticated noise models, including both stochastic [64], and adversarial type noise [66].

Here we have analysed the uniform mislabelling case to see if the landscape approach can provide insight into how neural networks learn under noise. To study this problem, we permuted a fixed percentage of training outcomes for 1000 input data items. In this scenario, an outcome i would be mapped to any other outcome j with probability $\tfrac{1}{N-1}$ , where N is the number of output classes. Specifically, for the D1 and D2 datasets (four outputs), class i could be mislabelled to that of any class $j\ne i$ with equal probabilities of $\tfrac{1}{3}$ . Similarly, for the MNIST dataset [67], each of the ten output classes has nine possible options for mislabelling with corresponding probabilities of $\tfrac{1}{9}$ . It is important to note that the mislabelling procedure was applied only to the training dataset in order to study relevant properties on a clean unseen testing set.

Previous work by Rolnick et al used a fixed amount of correct training data rather than a total error percentage (fixed total training data). In our analysis, we opt for an error percentage formulation, as the number of stationary points decreases with the amount of training data [18], complicating the interpretation of our disconnectivity graph analysis. Here, we note that the distribution of outcomes varies with the size of the configuration space. For example, the more compact dataset (D2) contains a larger number of equilateral triangle minima (class 0). Since we cannot uncouple the outcome distribution from the choice of configuration space in an unbiased manner, we studied all the relevant properties using the size of the configuration space as an extra variable parameter.

For the D1 and D2 datasets, we were able to perform a (near) exhaustive search of the low-lying minima for landscapes with fixed error percentages of 0%, 10%, 50% and 100% for the [2,10,4,1000,0.0001] neural network architecture. Results for the D3 dataset are presented in the SI.

The number of minima and transition states (Min and Ts), as well as the loss associated with the training global minimum (Gmin Loss), are shown in tables 1 and 2. All training set distributions are available in the SI.

Table 1. Summary of results for the D1 dataset. Min and Ts refer to the number of minima and transition states, while Gmin refers to the minimum with the lowest loss value.

Training							Testing
Error (%)	Min	Ts	Gmin AUC, Loss	$\overline{{\rm{AUC}}},\sigma$ (AUC)	Incorrect $\overline{{\rm{AUC}}},\sigma$ (AUC)	Correct $\overline{{\rm{AUC}}},\sigma$ (AUC)	Gmin AUC, Loss	$\overline{{\rm{AUC}}},\sigma$ (AUC)
0	122	592	0.749, 0.850	0.746, 0.0035	—	0.746, 0.0035	0.732, 0.891	0.733, 0.0025
10	266	960	0.727, 1.000	0.724, 0.0036	0.509, 0.015	0.747, 0.0034	0.720, 0.726	0.726, 0.0043
50	394	1474	0.639, 1.291	0.638, 0.0029	0.539, 0.0079	0.760, 0.0072	0.706, 1.131	0.699, 0.0083
100	490	1395	0.589, 1.321	0.591, 0.0061	0.591, 0.0061	—	0.336, 1.918	0.340, 0.013

Table 2. Summary of results for the D2 dataset. Min and Ts refer to the number of minima and transition states, while Gmin refers to the minimum with the lowest loss value.

Training							Testing
Error (%)	Min	Ts	Gmin AUC, Loss	$\overline{{\rm{AUC}}},\sigma$ (AUC)	Incorrect $\overline{{\rm{AUC}}},\sigma$ (AUC)	Correct $\overline{{\rm{AUC}}},\sigma$ (AUC)	Gmin AUC, Loss	$\overline{{\rm{AUC}}},\sigma$ (AUC)
0	6	20	0.810, 0.519	0.810, 0.000 33	—	0.810, 0.000 33	0.797, 0.552	0.796, 0.000 31
10	13	66	0.730, 0.791	0.728, 0.0021	0.190, 0.019	0.809, 0.0023	0.791, 0.622	0.791, 0.0020
50	26	155	0.604, 1.285	0.602, 0.0018	0.398, 0.010	0.779, 0.0054	0.741, 0.994	0.739, 0.0030
100	20	148	0.772, 1.236	0.768, 0.0047	0.768, 0.0047	—	0.242, 2.771	0.245, 0.0043

We found that, on average, the number of local minima and transition states increased with the percentage of mislabelled data for both datasets (tables 1 and 2). This observation suggests that a larger number of local minima reflect many competing values for the parameters of the model, and thus produce higher uncertainty in the statistical fit. Based on this reasoning, it is unsurprising that noisier datasets lead to greater uncertainty in fitting the training data. The loss value of the global minimum also increased with the percentage of mislabelled data (tables 1 and 2).

In addition, we observed that the larger the molecular configuration space (D1 > D2), the greater the number of minima and transition states; this trend also holds for D3 (SI). This result is expected as there should be greater uncertainty in predicting final outcomes from more diffuse initial molecular configurations. In other words, the diversity of the dataset depends on the size of the configuration space. This interpretation is further supported by the observation that the loss of the global minimum increases with configuration space size.

To study generalisation, we used the AUC value corresponding to the training global minimum (Gmin AUC) as a metric to characterise the performance of the neural network on the D1 and D2 datasets (tables 1 and 2). In both geometry optimisation datasets (D1 and D2), as the percentage of mislabelled data increased, the training and testing AUC for global training minimum decreased (tables 1 and 2). This trend is consistent with expectations, as randomising labels should increase the generalisation error. [65] For 0% error, we observe relatively high AUC values for both training and testing; in particular, for the D1 and D2 datasets, the training AUCs outperform the corresponding testing AUCs, as expected. Interestingly, however, for 10% and 50% error, the testing AUCs outperform the training AUCs for the global training minimum (tables 1 and 2). This result implies that the neural network learns the structure of the correct data and filters out the noise [62]. Thus, since the training AUC is calculated on the mislabelled dataset, the neural networks perform poorly (since they have actually learned the correct structure). However, since the testing AUC is calculated on a correctly labelled dataset, the neural networks perform significantly better. Note that when the error rate is increased to 100%, the training error is relatively low [65], as the network overfits to noise. However, the testing AUC decreases precipitously. This decrease is unsurprising as the neural network is fitted to noise, and thus cannot possibly generalise to an unseen dataset.

In addition to studying the properties of the training global minimum, we also calculate the average ( $\overline{{\rm{AUC}}}$ ) and standard deviation [σ(AUC)] of the AUC values computed over all the training local minima in our database (tables 1 and 2). For the case of 0% error, we observe a tight band of low-lying local minima with high testing accuracies, which agrees with previous work [7, 8]. For increasing error percentages, we also observe the same trends for the average AUC values as those obtained using the training global minima, suggesting a general filtering mechanism for single-layered perceptrons under uniform label noise (tables 1 and 2). We also find that the variance of the testing AUC increases significantly with the percentage of training error (tables 1 and 2).

To further analyse these effects, we investigated the performance of the network on the mislabelled and correctly labelled entries of the (mislabelled) training dataset (tables 1 and 2). For both datasets (D1 and D2), the training AUC values for the correctly labelled components exceeded the corresponding testing AUC values (tables 1 and 2). From these results, it is clear that, even at high training errors, the network can distinguish clean data from noisy data.

To study the structure of the LFLs for single-layered perceptrons under uniform noise, we produced the corresponding disconnectivity graphs [50, 51], coloured by both training and testing AUC values, for the D1 and D2 datasets (figures 1–4).

**Figure 1.** Disconnectivity graphs for dataset D1, 1000 training points, λ = 0.0001, coloured by training AUC as a function of % label errors, as marked.
Download figure:
Standard image High-resolution image

**Figure 2.** Disconnectivity graphs for dataset D1, 1000 training points, λ = 0.0001, coloured by testing AUC as a function of % label errors, as marked.
Download figure:
Standard image High-resolution image

**Figure 3.** Disconnectivity graphs for dataset D2, 1000 training points, λ = 0.0001, coloured by training AUC as a function of % label errors, as marked.
Download figure:
Standard image High-resolution image

**Figure 4.** Disconnectivity graphs for dataset D2, 1000 training points, λ = 0.0001, coloured by testing AUC as a function of % label errors, as marked.
Download figure:
Standard image High-resolution image

Interestingly, single-funnelled ELs are observed in each case. Since even the graphs at 100% error have a funnelled appearance, the structure likely arises due to the single-layered feed-forward architecture, not the input data. These results are consistent with previous work on the appearance of single-layered neural network landscapes [8, 12, 13], as well as previous suggestions that noisy landscapes are no harder to train than clean landscapes [65].

As expected, for all error thresholds, low-lying minima correspond to high training AUCs. Furthermore, the training and testing AUC values are reasonably correlated for 0% error. This result is also unsurprising, as the premise of neural network training is that low-lying minima generalise well to unseen training sets. Interestingly, as the mislabelling percentage increases, the better testing AUC minima (in the graphs, green-blue) are found at higher loss values, and the low-lying minima can have relatively low testing AUC values. This result highlights the bias-variance trade-off between over-fitting and generalisation. Some low loss minima overfit to noise, leading to high training AUCs and low testing AUCs. However, some high loss training can filter the noise more effectively and thus generalise well (i.e. higher AUC values for testing). These results are consistent with the hypothesis that it can sometimes be better to converge to local minima, rather than the global minimum, to prevent overfitting [7]. Together, these results help explain why the testing variance for the AUC increases with the percentage of mislabelled training data.

For MNIST data, a similar pictures emerges after mislabelling various fixed percentages of the training data. Since the architecture used here, [784,10,10,1000,0.1], has nearly 8000 optimisable parameters, our results are based on samples of low-lying minima (i.e. not exhaustive searching). These calculations were much more computationally expensive, and we used a GPU accelerated implementation for basin-hopping global optimisation [24]. Unlike the D1 and D2 datasets, we do not obtain higher testing accuracies relative to training accuracies as the error threshold is increased (table 3). However, analysis of neural network performance, average over database minima on the correct and incorrect portions of the mislabelled dataset, shows that the networks still perform significantly better on the clean segment of the training data, even with large amounts of noise (table 3).

Table 3. Summary statistics for MNIST dataset.

Training				Testing
Error	$\overline{{\rm{AUC}}},\sigma$ (AUC)	Incorrect $\overline{{\rm{AUC}}},\sigma$ (AUC)	Correct $\overline{{\rm{AUC}}},\sigma$ (AUC)	$\overline{{\rm{AUC}}},\sigma$ (AUC)

0	0.9996, 0.0027	—	0.9996, 0.0027	0.9687, 0.010
10	0.9783, 0.0070	0.7747, 0.050	0.9997, 0.0014	0.9645, 0.012
25	0.9545, 0.013	0.8011, 0.044	0.9991, 0.0025	0.9472, 0.018
40	0.9429, 0.015	0.8440, 0.032	0.9976, 0.0042	0.9304, 0.022
50	0.9390, 0.016	0.8707, 0.029	0.9950, 0.0062	0.9197, 0.024
60	0.9310, 0.017	0.8893, 0.026	0.9891, 0.010	0.8940, 0.031
75	0.9281, 0.020	0.9164, 0.024	0.9729, 0.019	0.7716, 0.061
100	0.9509, 0.015	0.9509, 0.015	—	0.2333, 0.072

It is again worth highlighting that, similar to the D1 and D2 case, there is a systematic trend towards increased testing AUC variance with the increase in dataset error, which indicates a change in the structure of the underlying landscape. Thus, while we do find good minima [62, 64], we also find many bad minima (figure 5).

**Figure 5.** Box-plots for training (a) and testing (b) AUC values for various error percentages on the MNIST dataset. The box extends from the first (Q₁) to third (Q₃) quartiles (25th to 75th percentiles, range ${Q}_{3}-{Q}_{1}={IQR}$ , the interquartile range) with a band at the median.
Download figure:
Standard image High-resolution image

**Figure 5.** Box-plots for training (a) and testing (b) AUC values for various error percentages on the MNIST dataset. The box extends from the first (Q₁) to third (Q₃) quartiles (25th to 75th percentiles, range ${Q}_{3}-{Q}_{1}={IQR}$ , the interquartile range) with a band at the median.
Download figure:
Standard image High-resolution image

Based on these numerical results it appears that the relatively tight band of local minima above the global minimum [7, 8] no longer exists for the mislabelled case. Furthermore, in almost every example, the variance of the testing AUC is greater than the variance of the training AUC. Thus, while it is possible to obtain high testing accuracies under uniform random error [62, 64], the landscape perspective indicates that the probability of finding such solutions diminishes as the error percentage increases. These results indicate that it might be valuable to further analyse the properties of the subset of minima that perform well under high training error. This approach might be particularly helpful in designing new optimisers to preferentially find good solutions when training under noise.

5. Landscapes with reduced connectivity

To investigate the effect of reduced connectivity between the layers of a network we defined locality via a simple distance metric. The nodes in each of the three layers were mapped onto a unit line at positions 0, $1/({N}_{\beta }-1),2/({N}_{\beta }-1),\ldots ,({N}_{\beta }-2)/({N}_{\beta }-1),1$ , defining N_β sites separated at intervals of $1/({N}_{\beta }-1)$ . The distance between hidden node h and an input node i or output node o was then defined as

$\begin{eqnarray}&&\left|\displaystyle \frac{h-1}{{N}_{\mathrm{hidden}}-1}-\displaystyle \frac{i-1}{{N}_{\mathrm{in}}-1}\right|,\qquad \mathrm{or}\qquad \left|\displaystyle \frac{h-1}{{N}_{\mathrm{hidden}}-1}-\displaystyle \frac{o-1}{{N}_{\mathrm{out}}-1}\right|,\end{eqnarray} \tag{ 5 }$

for $1\leqslant h\leqslant {N}_{\mathrm{hidden}},1\leqslant i\leqslant {N}_{\mathrm{in}}$ and $1\leqslant o\leqslant {N}_{\mathrm{out}}$ . The distances were sorted and the weights ${w}_{{ij}}^{(1)}$ and ${w}_{{jk}}^{(2)}$ corresponding to a specified number of nearest neighbours were retained. Weights corresponding to connections outside the neighbour cutoff were frozen at zero, with all bias weights retained. When it was necessary to choose between neighbours at the same distance we simply selected the input or output node with the lowest index i or o.

This scheme is related to the dropout procedure, where nodes are randomly removed during training [68, 69]. Dropout helps to prevent overfitting in large networks, and also reduces the problem of local regions of the network coadapting, which can degrade the predictive capabilities in testing. The present formulation is closer to the DropConnect procedure, which removes connections rather than nodes [70]. However, unlike both DropOut and DropConnect, the architecture remains fixed during training in our analysis. Furthermore, the connections are not removed at random, but instead are omitted to define a locality in the network. This construction is similar to previous work by LeCun et al who systematically reduced network connectivity using a weight saliency metric based on second derivative information [71]. The present analysis is designed to test whether the global connections between adjacent layers of the network are responsible for the single-funnelled appearance of the MLL, which has been observed in previous studies [13, 14, 17]. The potential ELs of atomistic systems generally exhibit more local minima and transition states for short-range forces [72–75]. Introducing reduced connectivity based on locality might have a systematic effect on MLLs, and we wish to investigate this possibility for the present setup. Our formulation also gives an indication of how reducing the capacity of a neural network is manifested in the underlying landscape.

The potential defined in terms of neighbourhood connectivity described above was used to generate databases of minima and transition states for the D1 dataset with the [2,10,4,1000,0.0001] and [2,5,4,1000,0.00001] single-layered architectures; this dataset was chosen because it has a relatively large number of minima. Landscapes for 1, 2, 3 nearest neighbours, and the fully-connected [2,10,4,1000,0.0001] model, corresponding to 40, 20, 10 and 0 frozen weights, were created and visualised using disconnectivity graphs. To study generalisability, all the minima obtained were coloured by testing AUC (figure 6).

**Figure 6.** Disconnectivity graphs for 1 (top left), 2 (top right) and 3 (bottom left) nearest neighbours for the D1 dataset, compared to the fully-connected architecture (bottom right). The colouring runs from red (low testing AUC) to blue (high testing AUC).
Download figure:
Standard image High-resolution image

For two and three nearest neighbours, the number of stationary points increased significantly from the fully connected reference. This situation is consistent with previous results for interatomic potentials with short-range forces in molecular systems [72–75]. The present analysis also suggests that strong locality can induce more complex MLLs. This conjecture is supported by recent results for two- and three-layered perceptrons, which can have more locality than the single-layered perceptrons, and exhibit more local minima for a similar number of edge weight variables [18].

Local minima for the two- and three-neighbour networks performed reasonably well on an unseen testing set, with two nearest neighbours even outperforming the fully-connected model (figure 6). One possible reason for this phenomenon is the DropOut argument; i.e. the reduced neural network minimises the problem of local regions of network coadaptation, and instead produces a small number of connections, which are independently good at predicting the correct class [69, 71]. Another possibility is that the new network has broken symmetry and therefore no longer has highly degenerate solutions arising from parameter permutation, which may facilitate expression of more complex fitting functions [76]. This perspective is at least partially substantiated by the observation of much more complicated landscapes for reduced connectivity (figure 6). Interestingly, however, only two poorly performing minima were found for the one nearest neighbour model. This observation likely reflects the fact that the architecture has significantly reduced capacity, since more than half the trainable weights are zero. Overall, our results suggest that in terms of the landscape, optimal architectures may balance sparsity and expressiveness to perform well on unseen testing sets.

Although the reduced-connectivity landscapes obtained for the [2,10,4,1000,0.0001] architecture were significantly more frustrated than the fully-connected model, they were still relatively single-funnelled (figure 6). To determine whether we could obtain glassy or multi-funnelled landscapes, we visualised disconnectivity graphs for two and three nearest neighbours and significantly reduced regularisation (ten-fold) architecture [2,5,4,1000,0.00001] (figure 7).

**Figure 7.** Disconnectivity graphs for 2 (left) and 3 (middle) nearest neighbours with reduced connectivity for the D1 dataset, compared to the fully-connected architecture (right). These results are for five nodes in the hidden layer and λ = 0.00001.
Download figure:
Standard image High-resolution image

Since the regularisation term is a convex L2 penalty, it is possible that part of the single-funnelled appearance of the reduced-connectivity networks is due purely to regularisation; i.e. higher L2 regularisation convexifies the landscape [15]. Again, for the fully-connected case, we observed a single-funnelled appearance, substantiating our previous suggestion that this type of landscape is architecture dependent. However, for the two and three nearest neighbour models, we observe that some additional sub-funnel structure starts to emerge (figure 7). This result highlights the strong effect of locality on single-layered architectures.

6. Landscapes for two and three hidden layers

Here we present some results for the D1 dataset obtained with neural networks containing two (2HL) and three (3HL) hidden layers, to provide comparisons with the single hidden layer results. We consider 2HL with five nodes in each hidden layer, and 3HL with four nodes in each hidden layer, giving 69 and 72 training variables, respectively, for the D1 dataset obtained for the LJAT₃ classification problem. Disconnectivity graphs that focus on the lower-lying region of the landscape are shown in figure 8, and the corresponding stationary point databases are described in table 4. These results illustrate two trends, namely, the growth in the number of stationary points with increasing hidden layers, and with decreasing training data, for a comparable number of variable edge weights. The corresponding databases are far from complete for N_data = 100, but should provide a reasonable coverage of the low-lying region, which is the focus of interest here.

**Figure 8.** Disconnectivity graphs obtained with 100 (top) and 1000 (bottom) training data for the D1 dataset λ = 0.0001 and neural networks with two (left) and three (right) hidden layers.
Download figure:
Standard image High-resolution image

Table 4. Number of minima (Min) and transition states (Ts) for machine learning landscapes with two and three hidden layers, λ = 0.0001, for 100 and 1000 training data drawn from the D1 dataset.

	100	1000
Hidden layers (%)	D1 (Min,Ts)	D1 (Min,Ts)
2	65 591, 90 622	3630, 3197
3	193 036, 540 962	13 298, 20 777

Comparing the lower panels of figure 8 for ${N}_{\mathrm{data}}=1000$ , with the top panels for N_data = 100, we see that the uphill barriers corresponding to pathways that lead to the global minimum are significantly smaller. Further analysis also shows that the minima span a wider range of loss function values for the 3HL architecture. These effects are maintained when more training data is included; a systematic analysis will be presented elsewhere [18].

7. Conclusions

Using custom generated high-quality geometry optimisation training data we showed that increasing training diversity (in this case, configuration space volume for an atomic cluster) leads to landscapes with many more stationary points and higher loss values. These results suggest a correspondence between the number of local minima and the statistical uncertainty of the LFL.

In our mislabelling analysis, we found that neural networks can correctly filter uniform noise for very high levels of dataset poisoning and these results remain (empirically) true for averages over the database of local minima. We also find that for mislabelling, a tight band of minima around the global minimum does not occur. Instead, the variance of the testing AUC increases significantly with the training error. Furthermore, we observe that many high loss training minima perform well on unseen testing input, as they do not overfit to noise, highlighting a bias-variance type trade-off. In future work we aim to consider other types of noise. Much of the realistic (and difficult) noise in machine learning datasets is not uniform, but instead highly feature dependent or adversarial [66, 77]. As a first step, we plan to see whether a landscape analysis might reveal why it is more difficult to train under stochastic permutation noise than uniform random noise. We would also like to compare our noise analysis to neural networks with more than one hidden layer, which may be more resilient to labelling noise [62].

We have also explored the landscapes of neural networks with reduced connectivity. For two and three nearest neighbours, the networks retained sufficient expressive capacity. In particular, the network for two nearest neighbours systematically outperformed the fully-connected case on unseen testing data. The networks with reduced connectivity are significantly more complex, due to the effects of stronger locality and the symmetry-broken architecture. For very limited connectivity (one nearest neighbour), we found only a few minima with poor predictive capability, reflecting the reduced capacity of the network. Furthermore, as we reduced the regularisation (convexity) of the landscape, the reduced-connectivity architecture produced much more complex LFLs with emerging subfunnel structure. These results may be helpful in understanding the difference between the performance of deep networks and shallow networks, and in determining architectures to obtain optimal capacity for neural networks (sparseness versus expressible trade-off). Future work in this area will likely include a generalised systematic scheme for reduced-connectivity of deep neural networks. In particular, the trends observed for LFLs as a function of the number of hidden layers, the number of training data, and the presence of mislabelling, should be investigated to test whether we can legitimately extrapolate to large networks where a detailed analysis of the landscape is not feasible.

Acknowledgments

This research was funded by the EPSRC.

Data availability

The data that support the findings of this study are available from the corresponding author upon reasonable request.

Perspective: new insights from loss function landscapes of neural networks

Article metrics

Submit

Author e-mails

Author affiliations

ORCID iDs

Dates

Peer review information

Abstract

1. Introduction

2. Defining the network

3. Characterisation of the LFL

4. Application to prediction of geometry optimisation outcomes for an atomic cluster

4.1. Landscapes subject to dataset mislabelling

5. Landscapes with reduced connectivity

6. Landscapes for two and three hidden layers

7. Conclusions

Acknowledgments

Data availability

Perspective: new insights from loss function landscapes of neural networks

Article metrics

Submit

Share this article

Author e-mails

Author affiliations

ORCID iDs

Dates

Peer review information

Abstract

1. Introduction

2. Defining the network

3. Characterisation of the LFL

4. Application to prediction of geometry optimisation outcomes for an atomic cluster

4.1. Landscapes subject to dataset mislabelling

5. Landscapes with reduced connectivity

6. Landscapes for two and three hidden layers

7. Conclusions

Acknowledgments

Data availability