Visualizing high-dimensional loss landscapes with Hessian directions

Analyzing the geometric properties of high-dimensional loss functions, such as local curvature and the existence of other optima around a certain point in loss space, can help provide a better understanding of the interplay between neural-network structure, implementation attributes, and learning performance. In this paper, we combine concepts from high-dimensional probability and differential geometry to study how curvature properties in lower-dimensional loss representations depend on those in the original loss space. We show that saddle points in the original space are rarely correctly identified as such in the expected lower-dimensional representations if random projections are used. The principal curvature in the expected lower-dimensional representation is proportional to the mean curvature in the original loss space. Hence, the mean curvature in the original loss space determines if saddle points appear, on average, as either minima, maxima, or almost flat regions. We use the connection between expected curvature in random projections and mean curvature in the original space (i.e. the normalized Hessian trace) to compute Hutchinson-type trace estimates without calculating Hessian-vector products as in the original Hutchinson method. Because random projections are not suitable for correctly identifying saddle information, we propose to study projections along the dominant Hessian directions that are associated with the largest and smallest principal curvatures. We connect our findings to the ongoing debate on loss landscape flatness and generalizability. Finally, for different common image classifiers and a function approximator, we show and compare random and Hessian projections of loss landscapes with up to approximately 7×106 parameters.


I. INTRODUCTION
Every deep neural network loss function depends on parameters, θ ∈ R N , which induce a loss landscape that is typically in high dimensions [1].The loss landscape associated with a neural network and the performance of optimization procedures that operate on it are each influenced by several factors, including the structural properties of the neural network [2][3][4] and a range of implementation attributes one may choose [5][6][7][8][9].The exact nature of these factors and how their combinations influence learning performance, however, remain largely unknown.
One way to better understand the interplay between neural network structure, implementation attributes, and learning performance is through a better understanding of the geometric properties of loss landscapes.For example, Keskar et al. [10] analyze the local curvature around candidate minimizers via the spectrum of the underlying Hessian to characterize the flatness and sharpness of loss minima, and Dinh et al. [11] demonstrate that reparameterizations can render flat minima sharp without affecting generalization properties.Another approach, which holds out the promise of visualizing the curvature around a point and the existence, if any, of nearby optima, aims to visualize high-dimensional loss functions by a lower-dimensional (and often random) projection of two or three dimensions [12][13][14][15][16]. Building on this approach, Horoi et al. [16] pursue improvements in learning by dynamically sampling points in projected low-loss regions surrounding local minima during training.
Given the advantages afforded by the promise of visualizing loss landscape optima, one may ask: how do the curvature properties in low-dimensional visualizations of high-dimensional functions depend on the curvature properties of the original, high-dimensional loss space?Specifically, do random two-and three-dimensional loss projections meaningfully represent convexity and concavity properties of high-dimensional loss functions?In short, they do not.Saddle points are generally misrepresented in the expected lower-dimensional representation of an original, high dimensional loss space.Yet, due to the exponential proliferation of saddle points in high dimensions [17,18], a critical point in high-dimensional loss space is almost certain to be a saddle rather than a minimum.Nevertheless, random projections can be useful to obtain curvature estimates, as we shall demonstrate by establishing a connection between expected curvature in random loss projections and Hessian trace estimates based on Hutchinson's method [19].
Our contributions in this paper are threefold.First, we explain why random projections from a high-dimensional space do not preserve curvature information.This argument is developed in Section II.One consequence of this In differential geometry, the principal curvatures are the eigenvalues of the shape operator (or Weingarten map 1 ), a linear endomorphism defined on the tangent space T p of L at a point p.For the original high-dimensional space, we have (θ, L(θ)) ⊆ R N +1 and there are N principal curvatures κ θ 1 ≥ κ θ 2 ≥ • • • ≥ κ θ N .At a non-degenerate critical point θ * where the gradient ∇ θ L vanishes, the matrix of the shape parameter is given by the Hessian H θ with elements (H θ ) ij = ∂ 2 L/(∂θ i ∂θ j ) (i, j ∈ {1, . . ., N }). 2 Some works refer to the Hessian as the "curvature matrix" [28] or use it to characterize the curvature properties of L(θ) [13].In the vicinity of a non-degenerate critical point θ * , the eigenvalues of the Hessian H θ are the principal curvatures and describe the loss function in the eigenbasis of H θ according to The Morse lemma states that, if a critical point θ * of L(θ) is non-degenerate, then there exists a chart ( θ1 , . . ., θN ) in a neighborhood of θ * such that where θi (θ * ) = 0 for i ∈ {1, . . ., N }.The loss function L( θ) in Eq. ( 2) is decreasing along i directions and increasing along the remaining i + 1 to N directions.Further, the index i of a critical point θ * is the number of linearly 1 Some authors distinguish between the shape operator and the Weingarten map depending on if the change of the underlying tangent vector is described in the original manifold or in Euclidean space (see, e.g., chapter 3 in [27]). 2 A critical point is degenerate if the Hessian H θ at this point is singular (i.e., det(H θ ) = 0).At degenerate critical points, one cannot use the eigenvalues of H θ to determine if the critical point is a minimum (positive definite H θ ) or a maximum (negative definite H θ ).Geometrically, at a degenerate critical point, a quadratic approximation fails to capture the local behavior of the function that one wishes to study.
independent decreasing dimensions of L near θ * (i.e., the number of negative eigenvalues of the Hessian H θ at that point).
In the standard basis, the Hessian is Closely related to the Hessian is the Fisher information matrix (FIM) [29,30], whose elements F ij are given by where the expectation is taken over the random variable X and p(x | θ) denotes the probability density function associated with X conditioned on the parameter θ.The probability that X falls within the infinitesimal interval [x, x + dx], given a known value of θ, is p(x | θ) dx.Provided that certain regularity conditions such as are satisfied [29,31], we may rewrite Eq. ( 4) as Notice the structural similarity between F ij and the elements of H θ [see Eq. ( 3)].In natural gradient descent [32], the metric associated with a stochastic feedforward neural network is the FIM.Considering a neural network with input-output pairs (x, y) and parameters θ, a possible choice of a model distribution, which takes on the role of the probability density in Eq. ( 4 Here, q(x) denotes the distribution of inputs x, f θ (x) is the output of a neural network given input x, and ∥•∥ 2 is the Euclidean norm [33].The expected values associated with FIM elements F ij for p(x, y|θ) are calculated with respect to (w.r.t.) input-output pairs (x, y) of the joint distribution p(x, y|θ).
Given the mean squared loss function the Hessian H θ of L(θ) and the empirical FIM F θ are connected via Here, y(t) denotes the desired output associated with sample x(t).At the global optimum, which is usually attained for overparameterized neural networks [1], the difference y k (t) − f θ,k (t) vanishes for all k, t, and the empirical FIM is equal to the Hessian H θ [34].
We can also establish a connection between H θ and F θ for the cross-entropy loss L , where softmax outputs f θ,k (t) are used to approximate binary class labels y k (t). 3 Similarly to Eq. ( 9), we obtain Given the scenario in which f θ,k (t) approaches y k (t) for all k, t and using is the empirical FIM associated with the cross-entropy loss [35].
The neural network examples that we consider in Section IV are trained using cross-entropy loss.In the Appendix, we also consider a function approximation problem in which we train neural networks using a mean squared loss function.

B. Random projections
To graphically explore an N -dimensional loss function L around a critical point θ * , one may wish to work in a lower-dimensional representation.For example, a two-dimensional projection of L around θ * is provided by where the parameters α, In high-dimensional spaces, there exist vastly many more almost-orthogonal than orthogonal directions.In fact, if the dimension of our space is large enough, with high probability, random vectors will be sufficiently close to orthogonal [36].Following this result, many related works [13,15,16] use random Gaussian directions with independent and identically distributed vector elements η i , δ i ∼ N (0, 1) (i ∈ {1, . . ., N }).
We now turn to showing that η, δ are almost orthogonal using a concentration inequality for chi-squared distributed random variables.To do so, we first note that the scalar product of random Gaussian vectors η, δ is a sum of the difference between two chi-squared distributed random variables because where X i , Y i ∼ N (0, 1).In the last step of Eq. ( 13), we rescaled X i and Y i by a factor √ 2 to obtain normally distributed random variables with variance 2. Since E i , the right-hand side of Eq. ( 13) vanishes in expectation.That is, lim Hence, the vectors η, δ are orthogonal in the limit N → ∞.For finite N , we can bound Eq. ( 13) using the concentration inequalities Pr 1 for any ϵ > 0 [37].In the limit ϵ → 0, the above inequalities can be cast in the following form Pr 1 Symbol Definition Hessian in the original loss space principal curvatures in the original loss space, where i ∈ {1, . . ., N } (i.e., the eigenvalues of H θ ) Hessian in a two-dimensional projection of an N -dimensional loss function in the limit of large N .For further details on the derivation of Eq. ( 18), see Appendix A.

C. Principal curvature
With the form of random Gaussian projections in hand, we now analyze the principal curvatures in both the original and lower-dimensional spaces.The Hessian associated with the two-dimensional loss projection (12) is where we use Einstein notation in the last equality.
In analogy with Eq. ( 9), the dimension-reduced Hessian H α,β is connected to the corresponding FIM F α,β according to where f (α,β),k (t) is the k-th entry of the output of a neural network with parameters θ * + αη + βδ.The operator ∇ (α,β) denotes the gradient w.r.t.parameters α, β.Recall that Eq. ( 9) [and hence Eq. (20)] is based on a mean squared loss function.One may also use Eq.(10) to obtain a connection between the dimension-reduced Hessian and the corresponding cross-entropy FIM.Lower-dimensional approximations H α,β of the Hessian H θ in the original high-dimensional space might be useful in generalized Gauss-Newton algorithms that are based on positive semidefinite and computationally affordable Hessian approximations [38,39].
Because the elements of δ, η arising in H α,β are distributed according to a standard normal distribution, the second derivatives of the loss function L in Eq. ( 19) have prefactors that are products of standard normal variables and, hence, can be expressed as sums of chi-squared distributed random variables as in Eq. (13).To summarize, elements of H α,β are sums of second derivatives of L in the original space weighted with chi-squared distributed prefactors.
We now turn to the relationship between random projections and principal curvature.The appeal of random projections is that pairwise distances between points in a high-dimensional space can be nearly preserved by a lowerdimensional linear embedding, affording a low-dimensional representation of mean and variance information with minimal distortion [43].The question is whether random projections also preserve curvature information, and if so, what is the nature of the relationship between random Gaussian directions and principal curvature.For instance, Li et al. [13] assert that "the principal curvatures of a dimensionality-reduced plot (with random Gaussian directions) are weighted averages of the principal curvatures of the full-dimensional surface".However, our results show that the principal curvatures κ α,β ± in a two-dimensional loss projection are weighted averages of the Hessian elements (H θ ) ij in the original space and not weighted averages of the principal curvatures κ θ i , which instead are solutions of an N -th degree polynomial.Similar arguments apply to projections with dimension larger than 2.
In Section III we discuss examples that show that lower-dimensional projections of L(θ) can be misleading, since high-dimensional saddle points may appear to be minima, maxima, or almost flat regions depending on the index of the underlying Hessian H θ in the original space.
Returning to principal curvature, since i,j 4 To show that the expected values of a ij η i η j (i ̸ = j) or a ij η i δ j vanish, one can either invoke independence of η i , η j (i ̸ = j) and η i , δ j or transform both products into corresponding differences of two chi-squared random variables with the same mean [see Eq. ( 13)].
Hence, the expected, dimension-reduced Hessian ( 19) is The corresponding eigenvalue (or principal curvature) κα,β is therefore given by the sum over all principal curvatures in the original space (i.e., κα,β = N i=1 κ θ i ).Hence, the value of the principal curvature κα,β in the expected dimensionreduced space will be either positive (if the positive principal curvatures in the original space dominate), negative (if the negative principal curvatures in the original space dominate), or close to zero (if positive and negative principal curvatures in the original space cancel out each other).As a result, saddle points will not appear as such in the expected random projection.
In addition to the connection between κα,β and the principal curvatures κ θ i , we now provide an overview of additional mathematical relations between different curvature measures that are useful to quantify curvature properties of highdimensional loss functions and their two-dimensional random projections.
Invoking Eq. ( 21), we can relate κα,β to tr(H θ ) and κ α,β ± .Because κ α,β The mean curvature H in the original space is related to κα,β via We summarize the definitions of the employed Hessians and curvature measures in Tab.I.

D. Hessian trace estimates
Finally, we point to a connection between the above curvature measures and existing Hessian trace estimates.The trace (or unnormalized mean curvature) of the Hessian H θ has already found applications in characterizing loss landscapes of neural networks [21,44].A common way of estimating tr(H θ ) without explicitly computing all eigenvalues of H θ is based on Hutchinson's method [19] and random numerical linear algebra [45,46].The basic idea behind this approach is to (i) use a random vector z ∈ R N with elements z i that are distributed according to a distribution function with zero mean and unit variance (e.g., a Rademacher distribution with Pr (z i = ±1) = 1/2), and (ii) compute z ⊤ H θ z, an unbiased estimator of tr(H θ ), using Hessian-vector products.That is, Recall that Eq. (23) shows that the principal curvature of the expected random loss projection, κα,β , is equal to tr(H θ ).Instead of estimating tr(H θ ) using the original Hutchinson's method (25), an alternative Hutchinson-type estimate is provided by the mean of the expected values of κ α,β − and κ α,β + [see Eq. ( 23)].

III. ILLUSTRATIVE EXAMPLES
In this section, we summarize the main concepts derived in Section II in terms of different examples.These examples are intended to aid intuition on how (i) principal and mean curvatures of the original loss function relate to those of random projections, (ii) curvature-based Hessian trace estimates relate to those obtained with the original Hutchinson's method, and (iii) Hessian directions (i.e., the eigenbasis of the Hessian H θ of L(θ)) can be used for a guided visual analysis of saddle information in high-dimensional spaces.

A. Extracting curvature information
We now study two examples that will help build intuition for the concepts discussed in the previous sections.In the first example, we study a critical point θ * of an N -dimensional loss function L(θ) for which (i) all principal curvatures have the same magnitude and (ii) the number of positive curvature directions is equal to the number of negative curvature directions.The mean curvature of this saddle point is zero.In accordance with Eq. ( 24), we will show that the mean curvature can be approximated by a function of ensemble means of elements of the dimension-reduced Hessian H α,β , and that the principal curvatures κ α,β ± are able to correctly identify the saddle point in the majority of simulated projections.In the second example, we use a loss function associated with an unequal number of negative and positive curvature directions.For the different curvature measures derived in Section II C, we find that random projections cannot identify the underlying saddle point.
Because H θ has positive and negative eigenvalues, the critical point is a saddle.The corresponding principal curvatures are κ θ i ∈ {−1, 1} (i ∈ {1, . . ., N − 1}) and κ θ N = 0.In this example, the mean curvature H, as defined in Eq. ( 24), is equal to 0. According to Eq. ( 22), the principal curvature, κα,β , associated with the expected, dimension-reduced Hessian H α,β is also equal to 0, erroneously indicating an apparently flat loss landscape if one would use κα,β as the main measure of curvature.To compare the convergence of different curvature measures as a function of the number of loss projections S, we will now study the ensemble mean of different quantities of interest X such as Hessian elements and principal curvature measures in dimension-reduced space.Here, X (k) is the k-th realization (or sample) of X.
We first study the dependence of ensemble means ⟨(H α,β ) ij ⟩ (i, j ∈ {1, 2}) of elements of the dimension-reduced Hessian H α,β on the number of samples S. According to Eq. ( 22), the diagonal elements of E[H α,β ] are proportional to the mean curvature of the original high-dimensional space and are thus useful to examine curvature properties of high-dimensional loss functions.Figure 1(a) shows the convergence of the ensemble means ⟨(H α,β ) ij ⟩ towards the expected values E[(H α,β ) ij ] as a function of S. Note that E[(H α,β ) ij ] = 0 for all i, j.For a few dozen loss projections, the deviations of the ensemble means from the corresponding expected values reach values larger than 20.A relatively large number of loss projections S between 10 3 -10 4 is required to keep these deviations at values that are smaller than about 2-4.The solid black and red lines in Figure 1 26) and (31), respectively.We evaluate the corresponding Hessians ( 28) and ( 32) at the saddle point θ * = (θ * 1 , . . ., θ * 2n , θ * 2n+1 ) = (0, . . ., 0, 1).In both loss functions, we set n = 500 and in loss function (31) we set ñ = 800.While in panel (a), the probability Pr(κ α,β + κ α,β − > 0) that the critical point in the lower-dimensional, random projection does not appear as a saddle is about 0.3, it is 1 in panel (b).Histograms are based on 10,000 random projections that are used to compute κ α,β ± .Solid grey lines indicate Gaussian approximations of the empirical distributions. eigenvalues of the ensemble-averaged dimension-reduced Hessian as a function of S. Since 22)], we have that ⟨κ α,β ± ⟩ = κα,β in the limit S → ∞.In the current example, the ensemble means ⟨κ α,β ± ⟩ thus approach κα,β = 0 for large numbers of samples S, represented by the dashed red lines in Figure 1(b).The ensemble means ⟨κ α,β ± ⟩ converge towards values of opposite sign, indicating a saddle point.For a sample size of S = 10 4 , we show the distribution of the principal curvatures κ α,β ± in Figure 2(a).We observe that the distributions are plausibly Gaussian.We also calculate the probability Pr(κ α,β + κ α,β − > 0) that the critical point in the lower-dimensional, random projection does not appear as a saddle (i.e., κ α,β + κ α,β − > 0).For the example shown in Figure 2(a), we find that Pr(κ α,β + κ α,β − > 0) ≈ 0.3.That is, in about 30% of the simulated projections, the lower-dimensional loss landscape wrongly indicates that it does not correspond to a saddle.
Our first example, which is based on the loss function (26), shows that the principal curvatures in the lowerdimensional representation of L(θ) may capture the saddle behavior in the original space if one computes ensemble means ⟨κ α,β ± ⟩ in the lower-dimensional space [Figure 1(b)].However, if one first calculates ensemble means of the elements of the dimension-reduced Hessian H α,β to infer ⟨κ α,β ± ⟩, the loss landscape appears to be flat in this example.We thus conclude that different ways of computing ensemble means (either before or after calculating the principal curvatures) may lead to different results with respect to the "flatness" of a dimension-reduced loss landscape.
In the next example, we will show that random projections cannot identify certain saddle points regardless of the underlying averaging process.We consider the loss function where we use the convention b i=a (•) = 0 if a > b.The Hessian at the critical point (θ * 1 , . . ., θ * 2n , θ * 2n+1 ) = (0, . . ., 0, 1) is As in the previous example, the critical point is again a saddle, but the mean curvature is H = 2(ñ − n)/N > 0.
In the following numerical experiments, we set n = 500 and ñ = 800.Figure 1(c) shows that the ensemble means FIG. 3. Estimating the trace of the Hessian H θ .In panels (a) and (b), the loss functions are given by Eqs. ( 26) and ( 31), respectively.We evaluate the corresponding Hessians ( 28) and ( 32) at the saddle point θ * = (θ * 1 , . . ., θ * 2n , θ * 2n+1 ) = (0, . . ., 0, 1).In both loss functions, we set n = 500 and in loss function (31) we set ñ = 800.Solid black and dash-dotted red lines represent Hutchinson [⟨z ⊤ H θ z⟩; see Eq. ( 25)] and curvature-based (⟨κ α ⟩) estimates of tr(H θ ), respectively.We compute ensemble means ⟨•⟩ as defined in Eq. ( 29 ⟨(H α,β ) ij ⟩ converge towards the expected value E[(H α,β ) ij ] as the number of samples increases.We again observe that a relatively large number of random loss projections S between 10 3 and 10 4 is required to keep the deviations of ensemble means from their corresponding expected values small.Because of the dominance of positive principal curvatures κ θ i in the original space, the corresponding ensemble means of principal curvatures (i.e., ⟨κ α,β ± ⟩, ⟨κ α,β ± ⟩) in the lower-dimensional representation approach positive values [Figure 1(d)].The distribution of κ α,β ± indicates that the probability of observing a saddle in the lower-dimensional loss landscape is vanishingly small [Figure 2(b)].In this second example, both ways of computing ensemble means, before and after calculating the lower-dimensional principal curvatures, mistakenly suggest that the saddle in the original space is a minimum in dimension-reduced space.
To summarize, for both loss functions (26) and (31), the saddle point θ * = (0, . . ., 0, 1) in the original loss function L(θ) is often misrepresented in lower-dimensional representations L(θ + αη + βδ) if random directions are used.Depending on (i) the employed curvature measure and (ii) the index of the underlying Hessian H θ in the original space, the saddle θ * = (0, . . ., 0, 1) often appears erroneously as either a minimum, maximum, or an almost flat region.
If the critical point were a minimum or maximum (i.e., a critical point associated with a positive definite or negative definite Hessian H θ ), it would be correctly represented in the corresponding expected random projection because the sign of its principal curvature κα,β , which is proportional to the sum of all eigenvalues κ θ i of the Hessian H θ in the original loss space, would be equal to the sign of the principal curvatures κ θ i .However, such points are scarce in high-dimensional loss spaces [17,18].
Finally, because of the confounding factors associated with the inability of random projections to correctly identify saddle information, it does not appear advisable to use "flatness" around a critical point in a lower-dimensional random loss projection as a measure of generalization error [13].

B. Hessian trace
We now use the loss functions (26) and (31) to compare the convergence behavior between the original Hutchinson method (25) and the curvature-based trace estimation (23).Instead of two random directions, we use one random Gaussian direction η and perform quadratic least-square fits for 50 equidistant values of α in the interval [−0.05, 0.05] to extract estimates of tr(H θ ) from L(θ * + αη).We use the same random Gaussian directions in Hutchinson's method.
Figure 3 shows how the Hutchinson and curvature-based trace estimates converge towards the true trace values, 0 for the loss function (26) and 600 for the loss function (31) with n = 500 and ñ = 800.Given that we use the same random vectors in both methods, their convergence behavior towards the true trace value is similar.With the curvature-based method, one can produce Hutchinson-type trace estimates without computing Hessian-vector products.However, it requires the user to specify an appropriate interval for the quantity α so that the mean curvature can be properly  31).If the eigenvalues associated with η, δ have different signs, the corresponding loss landscape is a saddle as depicted in panel (a).If the eigenvalues associated with η, δ have the same sign, the corresponding loss landscape is either a minimum (both signs are positive) as shown in panel (b) or a maximum (both signs are negative) as shown in panel (c).Because there is an excess of ñ − n = 100 positive eigenvalues in H θ , a projection onto a dimension-reduced space that is spanned by the random directions η, δ is often associated with an apparently convex loss landscape.An example of such an apparent minimum is shown in panel (d).We selected a single pair of random directions η, δ (i.e., no averaging over random directions has been performed).
estimated in a quadratic-approximation regime.It also requires a sufficiently large number of points in this interval.Therefore, it may be less accurate than the original Hutchinson method.

C. Hessian directions
Given the described shortcomings of random projections (e.g., in correctly identifying saddles of high-dimensional loss functions), we suggest to use Hessian directions (i.e., the eigenbasis of H θ ) as directions η, δ in L(θ * +αη+βδ).For Eq. ( 31) with n = 900, ñ = 1000, we show projections along different Hessian directions in Figure 4(a-c).We observe that different Hessian directions indicate different types of critical points in dimension-reduced space.If the eigenvalues associated with the Hessian directions η, δ have different signs, the corresponding lower-dimensional loss landscape is a saddle [Figure 4(a)].If both eigenvalues have the same sign, the loss landscape is either a minimum [Figure 4(b): both signs are positive] or it is a maximum [Figure 4(c): both signs are negative].If one uses a random projection instead, the resulting lower-dimensional loss landscape often appears to be a minimum in this example [Figure 4(d)].To quantify the proportion of random projections that correctly identify the saddle with n = 900, ñ = 1000, we generated 10,000 realizations of κ α,β ± [see Eq. ( 21)].We find that the signs of κ α,β ± were different in only about 0.5% of all simulated realizations.That is, in this example the principal curvatures κ α,β ± indicate a saddle in only about 0.5% of the studied projections.
In the next section, we will show that Hessian directions that are associated with the largest-magnitude positive and negative eigenvalues of H θ are useful to appropriately identify saddle information and visually study the geometric properties of saddle points in high-dimensional loss spaces of image classifiers.

IV. APPLICATIONS TO NEURAL NETWORKS
We now compare projections L(θ * + αη + βδ) using (i) dominant Hessian directions (i.e., eigenvectors that are associated with the largest-magnitude positive and negative eigenvalues of H θ ) and (ii) random directions.Hessian directions were computed using HVPs without an explicit representation of H θ (see Algorithm 1).We first compute the largest-magnitude eigenvalue and then use an annihilation method [47] to compute the second largest-magnitude eigenvalue of opposite sign.More details on the annihilation algorithm are provided in Appendix B. Other deflation techniques can be used to find additional Hessian directions.
Instead of explicitly representing the Hessian H θ , Algorithm 1 evaluates products between H θ and a vector v ∈ R N using the identity In the first step, the gradient (∇ θ L) ⊤ is computed using reverse-mode autodifferentiation (AD) to then compute the scalar product [(∇ θ L) ⊤ v].In the the second step, we again apply reverse-mode AD to the computational graph associated with the scalar product [(∇ θ L) ⊤ v].Because the vector v does not depend on θ (i.e., ∇ θ v = 0), the result is H θ v.One may also use forward-mode AD in the second step to provide a more memory efficient implementation.5 used the LinearOperator module that is available in the Python package scipy to represent HVPs.As an alternative to implementing HVPs manually, one may use the hvp function provided in the torch.autograd.functionalmodule.Moreover, the Python package PyHessian [21] can be also used to evaluate Hessians of high-dimensional functions in a distributed manner.For the computation of dominant Hessian directions, we used the scipy function eigsh that is based on the implicitly restarted Lanczos method [48].
The computational cost of Algorithm 1 can be summarized as follows.For the HVPs, the computational cost is O(N ) [49], where N is the number of neural-network parameters.To identify extremal eigenvalues along with their corresponding eigenvectors using a Lanczos-type method, one has to perform a certain number of matrix-vector multiplications.The required number of iterations depends on several factors, including the desired accuracy, the characteristics of the underlying Hessian matrix, and the choice of the initial vector.
We first focus on a ResNet-56 architecture that has been trained in [13] on CIFAR-10 using stochastic gradient descent (SGD) with Nesterov momentum.The number of parameters is 855,770.The training and test losses at the local optimum found by SGD are 9.20 × 10 −4 and 0.29, respectively; the corresponding accuracies are 100.00 and 93.66, respectively.mineigval, mineigvec = eigval1, eigvec1 13: return maxeigval, maxeigvec, mineigval, mineigvec In accordance with [13], we apply filter normalization to random directions.This and related normalization methods are often employed when generating random projections.One reason is that simply adding random vectors to parameters of a neural network loss function (or parameters of other functions) does not consider the range of parameters associated with different elements of that function.As a result, random perturbations may be too small or large to properly resolve the influence of certain parameters on a given function.
When calculating Hessian directions, we are directly taking into account the parameterization of the underlying functions that we want to visualize.Therefore, there is no need for an additional rescaling of different parts of the perturbation vector.Still, reparameterizations of a neural network can result in changes of curvature properties (see, e.g., Theorem 4 in [11]).
Figure 5 shows the two-dimensional projections of the loss function (cross entropy loss) around a local critical point.The smallest and largest eigenvalues are −16.4 and 5007.9, respectively.This means that the found critical point is a saddle with a maximum negative curvature that is more than two orders of magnitude smaller than the maximum positive curvature at that point.The saddle point is clearly visible in Figure 5(a,c).We observe in the zoomed inset in Figure 5(c) that the loss decreases along the negative β-axis.
If random directions are used, the corresponding projections indicate that the optimizer converged to a local minimum and not to a saddle point [Figure 5(b,d)].Overall, the ResNet-56 visualizations that we show in Figure 5 exhibit structural similarities to those that we generated using the simple loss model (31) [Figure 4(d)].
As a second neural-network structure, we consider DenseNet-121 that has also been trained in [13] on CIFAR-10 using SGD with Nesterov momentum.In this neural network, the number of parameters is 6,956,298.Figure 6 shows the projections L(θ * + αη + βδ) of the loss function (cross entropy loss) around a local optimum.The training and test losses of the local optimum found by SGD are 8.07 × 10 −4 and 1.69 × 10 −1 , respectively; the corresponding accuracies are 100.00 and 95.63, respectively.We again observe the typical saddle shape of L if dominant Hessian directions are used and an apparent local minimum for projections along random directions.One should keep in mind that the principal curvatures in the random direction plot are given by κ α,β ± [see Eq. ( 21)].At a critical point, the principal curvatures in dimension-reduced, random-projection space are, on average, equal to the sum of the principal curvatures in the original space.If only very few, small-magnitude negative principal curvatures are present in the original space, the expected random projections are associated with an apparent local minimum in dimension-reduced, random-projection space.
Analyzing the Hessian H θ of DenseNet-121 around its local optimum, we find that the minimum and maximum principal curvatures are -109.1 and 1937.5, respectively.In comparison to ResNet-56, the relative difference between these two curvatures is substantially smaller.In particular the largest-magnitude negative principal curvature in DenseNet-121 is by a factor of about 7 smaller than that in ResNet-56, indicating that more substantial loss improvements may be possible in the former by changing the neural-network parameters θ along the negative curvature direction.In Figure 7, we compare changes in the projected loss along random and dominant Hessian directions for DenseNet-121 and ResNet-56.In addition to evaluating the loss on training data (50,000 images) as in Figures 5 and 6, we also provide a comparison to loss changes in the test dataset (10,000 images).Black crosses in Figure 7 indicate loss minima.For both neural networks, we observe that random projections are associated with loss values that increase along both directions δ, η and for both training and test data [Figure 7(a,b,e,f)].For these projections, the loss minima are very close to the origin of the loss space.Different observations can be made if Hessian directions are employed in the loss projections.Figure 7(c) shows that the training loss minimum of DenseNet-121 in the Hessian projection is not located at the origin but at (α, β) ≈ (0, 0.2).A value of β ≈ 0.2 means that one has to move along the negative curvature direction to find a loss function value that is smaller than the one at the origin.We find that the value of the loss at that point is more than one order of magnitude smaller than the original loss at the origin or the smallest loss found in a random-projection plot.Similar observations can be made in the test dataset.However, the smallest loss value found in the Hessian direction plot is associated with a relatively large loss in the test dataset [Figure 7(d)].Using Hessian directions to improve training and test performance requires one to balance potential performance improvements in the test and training datasets.Because of the relatively small negative principal curvature of ResNet-56, one can only achieve small loss improvements along the corresponding Hessian direction and differences in minimum losses between random and Hessian projections are less pronounced [Figure 7(e-h)].
In Appendix C, we study random and Hessian loss projections of three additional architectures, namely ResNet-56 without skip connections [13], MobileNetV2 [50], and GoogLeNet [50].For ResNet-56 without skip connections, we find that the training loss minimum in the Hessian projection is associated with slightly larger training and test accuracies than the training loss minimum in the random projection.For MobileNetV2 and GoogLeNet, we find that the minimum training loss in the Hessian projection is associated with an either slightly larger training or test accuracy compared to the training loss minimum in the random projection.In the examples that we study in Appendix C, the training and test loss minima in the Hessian projections are better aligned than those in the corresponding random projections.Table II in Appendix C summarizes the results for all five architectures.
Image classification architectures and their training are already highly optimized to achieve very good performance in common image classification tasks.Our results show that it is possible to improve training loss values along the dominant negative curvature direction.Smaller loss values may be also achievable with further gradient-based training iterations, so one has consider tradeoffs between gradient and curvature-based optimization methods.Also, assessing overall performance of dominant Hessian directions on this task is not straightforward.For instance, loss measurements and accuracy measurements are not proportional to one another for this image classification task (see Table II in Appendix C).A further complication is that, since these architectures are highly optimized and the underlying generative processes are unknown, it is difficult to interpret variations in performance that may appear between training loss, test loss, and accuracy.Therefore, to better address the question of whether better performance in training by dominant Hessian directions can translate to better test performance, we need to control for some of the suspected confounds that appear in the image classification task.So, we also study possible loss improvements along the dominant negative curvature direction in a function approximation task [20] that we describe in Appendix D.
Unlike image classification, function approximation involves sampling from a known generative distribution.Our results, summarized in Appendix D, show that both training loss and test loss are smaller in our functional approximation task for Hessian projections than for random projections.Specifically, for 2-layer and 10-layer fully connected neural networks (FCNNs), with 20,501 and 101,301 parameters, respectively, training loss is more than 9% smaller for the 2-layer network and 26% smaller for the 10-layer network for Hessian projections than for random projections and test loss is 1% smaller and 6% smaller, respectively.Using this function approximation example, we have also included in the Supplemental Information [51] an animation that shows the evolution of random and Hessian loss projections during training.

V. DISCUSSION AND CONCLUSION
Given that a global understanding of high-dimensional loss landscapes remains still very limited [1], one-and twodimensional projections of such functions are a common tool to study the geometric properties of loss regions nearby critical points and along optimizer trajectories [12][13][14][15][16].For an informed way of projecting high-dimensional functions to lower-dimensional representations, it is important to mathematically interpret the geometric information contained in such visualizations.To put it in the words of Ker-Chau Li [52]: "There are too many directions to project a high-dimensional data set and unguided plotting can be time-consuming and fruitless."Therefore, it is important to understand how projection directions η, δ affect the resulting loss visualizations.A standard choice in the literature is to use random Gaussian directions to obtain lower-dimensional loss representations L(θ * + αη + βδ).Combining concepts from high-dimensional probability and differential geometry, we show that saddle points in the original space may not be correctly identified as such in lower-dimensional representations if such random projections are used.
As formalized in Eq. ( 24), the expected principal curvature κα,β = tr(H θ ), obtained by averaging over Hessian elements (H θ ) i i in a dimension-reduced Hessian [see Eq. ( 22)], is proportional to the mean curvature H = tr(H θ )/N in the original loss space.The connection between κα,β and tr(H θ ) shows that Hutchinson-type trace estimates can be computed directly from curvature information in lower-dimensional random projections [19].
Depending on the value of the mean curvature in the original loss space, saddle points appear, on average, as either minima (H > 0), maxima (H < 0), or almost flat regions (H ≈ 0).Instead of computing ensemble means of elements of dimension-reduced Hessians, one may also compute ensemble means of principal curvatures.As illustrated in different numerical experiments, both ways of averaging are associated with different geometric interpretations and random projections are generally not able to correctly identify saddle points.We therefore propose to study projections along dominant Hessian directions that are associated with the largest-magnitude positive and negative principal curvatures in the original space.Similar methods have been developed and applied in related work on exploratory data analysis [52].Thus, our work may be viewed as part of a recent "Hessian turn" in machine learning characterized by computationally tractable methods for unlocking Hessian information in non-linear and high-dimensional settings [21,22].Projecting high-dimensional loss functions along Hessian directions provides a genuine way of illustrating their geometric properties around critical points.Negative curvature directions may be used as a starting point for further improving the performance of a given neural network.Analyzing principal curvatures (or eigenvalues of H θ ) at critical points θ * is also useful to identify global minima.It has been shown by Cooper [1] that for any global minimum of L(θ), the corresponding Hessian H θ has rn positive eigenvalues, N − rn eigenvalues equal to 0, and no negative eigenvalues.Here, n and r are the number of input samples (i.e., data points) and the output dimension, respectively.
Our numerical experiments on different image classifiers and a function approximation task are in accordance with our reasoning and identify marked differences in the projections that are based on Hessian and random directions.Our results also address the ongoing debate on the connection between flatness of a certain loss region and generalizability (i.e., the ability to effectively use trained neural networks on unseen data).Dinh et al. [11] provided a detailed discussion on different mechanisms (e.g., reparameterization of flat regions) that indicate that sharp minima can achieve good generalization properties.Their work also highlighted the importance of clear definitions of flatness.Our work shows that flatness in randomly projected loss landscapes might be just apparent and a consequence of (large) principal curvatures of opposite sign that cancel out each other.
There are many interesting and worthwhile directions for future research building on the work reported here.To provide more insights into the geometric properties of high-dimensional loss functions of deep neural networks, future work may study other convolutional architectures [53,54], vision transformers [55], and residual multi-layer perceptrons [56].Our results suggest that negative curvature directions may be useful to find better local optima, but further research is needed to gain a clearer understanding of the tradeoffs associated with optimization protocols that are based on gradients and curvature information.More research on learning algorithms that minimize loss functions along negative curvature directions [57] can contribute to the development of more effective optimizers.Such guided optimization algorithms may be of particular use in applications (e.g., neural network-based control [58][59][60] and function approximation in numerical analysis [20]) where training data can be generated with a high resolution and used to approximate a function as closely as possible.In applications where one wishes to reduce generalization error, the proposed mean-curvature estimation method may be also useful to improve existing curvature regularization approaches [61].Furthermore, to improve our understanding of the properties of loss landscapes, it may be helpful to devote more attention to the study of connections between random matrix theory and the characterization of high-dimensional functions [62,63].For example, for critical points of Gaussian fields in high-dimensional spaces, Bray and Dean [62] showed that there is a monotonic relation between the index of the Hessian (i.e., the number of negative eigenvalues) and the loss value: the larger the index, the larger the corresponding loss.Bray and Dean [62] also showed that the eigenvalues of the Hessian at a critical point of such fields are distributed according to Wigner's semicircle law [64] up to an additional shift.However, traditional random-matrix-theory results such as Wigner's semicircle law or the Marčenko-Pastur law [65] are based on various simplifying assumptions, including Gaussian weight, error, and data distributions, which do not hold in practical applications of neural networks [22].For each neural network, we report all performance measures after initial training in the first row of each cell.The second and third rows in each cell list the performance measures that are associated with minimum training loss in random projections ("rand proj") and Hessian projections ("Hess"), respectively.The column "dHd" lists the principal curvatures associated with the dominant Hessian directions at the initial training optimum.

Appendix D: Function approximation
We compare loss function visualizations that are based on random and Hessian directions in a function approximation task.In accordance with [20], we consider the smooth one-dimensional function f (x) = log(sin(10x) + 2) + sin(x) , (D1) where x ∈ [−1, 1).To approximate f (x), we use two fully connected neural networks (FCNNs) with 2 and 10 layers, respectively.Each layer has 100 ReLU activations.The numbers of parameters of the 2 and 10 layer architectures are 20,501 and 101,301, respectively.The training data is based on 50 points that are sampled uniformly at random from the interval [−1, 1).We train both neural networks using a mean squared error (MSE) loss function and SGD with a learning rate of 0.1.The 2 and 10 layer architectures are respectively trained for 100,000 and 50,000 epochs to reach loss values of less than 10 −4 .The best model was saved and evaluated by calculating the MSE loss for 1,000 points that were sampled uniformly at random from the interval [−1, 1). Figure 9 shows heatmaps of the training and test loss landscapes of both neural networks along random and dominant Hessian directions.Black crosses in Figure 9 indicate loss minima.As in the main text, we observe for both neural networks that random projections are associated with loss values that increase along both directions δ, η and for both training and test data [Figure 9(a,b,e,f)].For these projections, we find that the loss minima are very close to or at the origin of the loss space.The situation is different in the loss projections that are based on Hessian directions.Figure 6(c) shows that the loss minimum for the 2-layer FCNN is not located at the origin but at (α, β) ≈ (0, 0.04).We find that the value of the training loss at that point is more than 9% smaller than the smallest training loss found in a random-projection plot.The corresponding test loss is about 1% smaller than the test loss associated with the smallest training loss in the random-projection plot.For the 10-layer FCNN, the smallest training loss in the Hessian direction projection is more than 26% smaller than the smallest training loss in the randomly projected loss landscape [Figure 9(e,g)].The corresponding test loss in the Hessian direction plot is about 6% smaller than the corresponding test loss minimum in the random direction plot [Figure 9(f,h)].Notice that both the training and test loss minima in the random direction heatmaps in Figure 9(e,f) are located at the origin while they are located at (α, β) ≈ (0, −0.05) in the Hessian direction heatmaps in Figure 9(g,h).
In the Supplemental Information [51], we include an animation that shows the evolution of lower-dimensional loss projections of the 10-layer FCNN.

FIG. 4 .
FIG.4.Dimensionality-reduced loss L(θ * + αη + βδ) of Eq.(31) with n = 900, ñ = 1000 for different directions η, δ. (a-c) The directions η, δ correspond to eigenvectors of the Hessian H θ of Eq.(31).If the eigenvalues associated with η, δ have different signs, the corresponding loss landscape is a saddle as depicted in panel (a).If the eigenvalues associated with η, δ have the same sign, the corresponding loss landscape is either a minimum (both signs are positive) as shown in panel (b) or a maximum (both signs are negative) as shown in panel (c).Because there is an excess of ñ − n = 100 positive eigenvalues in H θ , a projection onto a dimension-reduced space that is spanned by the random directions η, δ is often associated with an apparently convex loss landscape.An example of such an apparent minimum is shown in panel (d).We selected a single pair of random directions η, δ (i.e., no averaging over random directions has been performed).

FIG. 9 .
FIG. 9. Heatmaps of the mean squared error (MSE) loss along random and dominant Hessian directions for a function approximation task.(a-c) Loss heatmaps for a fully connected neural network (FCNN) with 2 layers and 100 ReLU activations per layer [(a,c): training data, (c,d): test data].(e-h) Loss heatmaps for an FCNN with 10 layers and 100 ReLU activations per layer [(e,g): training data, (f,h): test data].Random directions are used in panels (a,b,e,f) while Hessian directions are used in panels (c,d,g,h).Green and red regions indicate small and large mean squared error (MSE) loss values, respectively.Black crosses indicate the positions of loss minima in the shown domain.Both neural networks, FCNN 1 and FCNN 2, are trained to approximate the smooth one-dimensional function (D1).

TABLE I .
Overview of several quantities that are useful to characterize curvature of high-dimensional loss functions.We summarize the employed definitions of Hessians and several derived curvature measures that are useful to quantify curvature of N -dimensional loss functions L(θ * ) : R N → R and their two-dimensional random projections L(θ * + αη + βδ) at a non-degenerate critical point θ * .The vectors η, δ ∈ R N are the directions along which we project the original loss function and α, β ∈ R are the corresponding scale factors.Random loss projections are usually based on Gaussian vectors η, δ, which are almost orthogonal in high-dimensional spaces (i.e., for large N ).

TABLE II .
Summary of image classification results.We summarize the four performance measures training loss, test loss, training accuracy, and test accuracy of ResNet-56, DenseNet-121, ResNet-56 without skip connections, MobileNetV2, and GoogLeNet.