Near-optimal control of dynamical systems with neural ordinary differential equations

Optimal control problems naturally arise in many scientific applications where one wishes to steer a dynamical system from an initial state x 0 to a desired target state x∗ in finite time T. Recent advances in deep learning and neural network–based optimization have contributed to the development of numerical methods that can help solve control problems involving high-dimensional dynamical systems. In particular, the framework of neural ordinary differential equations (neural ODEs) provides an efficient means to iteratively approximate continuous-time control functions associated with analytically intractable and computationally demanding control tasks. Although neural ODE controllers have shown great potential in solving complex control problems, the understanding of the effects of hyperparameters such as network structure and optimizers on learning performance is still very limited. Our work aims at addressing some of these knowledge gaps to conduct efficient hyperparameter optimization. To this end, we first analyze how truncated and non-truncated backpropagation through time affect both runtime performance and the ability of neural networks to learn optimal control functions. Using analytical and numerical methods, we then study the role of parameter initializations, optimizers, and neural-network architecture. Finally, we connect our results to the ability of neural ODE controllers to implicitly regularize control energy.

Mathematically, optimal control is concerned with finding a control signal u(t) ∈ R m (0 ≤ t ≤ T ) that steers a dynamical system from its initial state x 0 ∈ R n to a desired target state x * ∈ R n in finite time T , minimizing the control energy (1.1) Depending on the application, it may be useful to consider integrated cost functionals that are different from Eq. (1.1).A corresponding example is provided in Appendix A.
The outlined optimal control problem aims at finding (1.2) u * (t) = arg min subject to the constraint where f (•) denotes a vector field that depends on the system state x(t), control input u(t), and time t.Linear systems admit a closed-form solution of u * (t) [44].For Fig. 1.Controlling networked dynamical systems with neural ODEs.(a) A neural ODE controller takes the time t as an input variable and produces a control signal û(t; θ) ∈ R m .A networked dynamical system is then controlled by connecting control inputs u 1 (t; θ), . . ., um(t; θ) to all or a subset of nodes.Activation functions in the neural ODE controller are denoted by σ. (b,c) Evolution of x(t), the state of a one-dimensional dynamical system, and of û(t; θ), the corresponding control function.Solid grey lines represent x(t) and û(t; θ) at different training epochs.Solutions after convergence are shown by solid black lines.
general, non-linear dynamical systems, one typically aims at minimizing (1.4) where L(x(T ), x * ) denotes the distance between the reached state and the desired target state x * , e.g., the mean-squared error (MSE) L(x(T ), x * ) ∝ x(T ) − x * 2 2 .The Lagrange multiplier µ models the impact of control energy on the total loss.Clearly, in the limit µ → ∞, we have u * = 0.
A necessary condition for optimal control is provided by Pontryagin's maximum principle (PMP) [37] while the Hamilton-Jacobi-Bellman (HJB) equation provides a necessary and sufficient condition for optimality [45,6].To solve non-linear optimal control problems, one typically resorts to indirect and direct numerical methods.Examples of indirect optimal control solvers include different kinds of shooting methods [36] that use PMP and a control Hamiltonian to construct a system of equations describing the evolution of state and adjoint variables.In addition to indirect methods, one may also parameterize state and control functions and directly solve the resulting optimization problem.Possible function parameterizations include piecewise constant functions and other suitable basis functions [9].Over the past two decades, pseudospectral methods have emerged as an effective approach for solving non-linear optimal control problems [18] with applications in aerospace engineering [23].However, it has been shown that certain pseudospectral methods are not able to solve standard benchmark control problems [16].In this work, we represent time-dependent control signals by artificial neural networks (ANNs) [24].To parameterize and learn control functions, we use neural ordinary differential equations (neural ODEs) [11,14,4,10].A schematic of the application of neural ODE control (NODEC) to a networked dynamical systems is shown in Fig. 1.
Control methods that are based on ANNs have been applied to both discrete and continuous time dynamical systems [30].To keep calculations associated with gradient updates tractable, applications of ANNs in control have often focused on shallow architectures and linear dynamics.For high-dimensional and non-linear dynamics with intractable gradient updates, "identifier" ANNs are useful to learn and replace the dynamical system underlying a given control task [30].Recent advances in neural ODEs [11] and automatic differentiation [5] allow for the direct application of deep neural network architectures to high-dimensional non-linear models [4], thereby avoiding identifier networks [30] and limitations associated with shallow ANN structures.Other popular uses of ANNs for control are found in the fields of deep reinforcement learning [35] and neural HJB methods [2].Deep reinforcement learning is often applied to model-free control tasks where the model dynamics are unknown or non-differentiable [34].Neural HJB methods have been applied to state-feedback control and rely on the existence of smooth value functions [2].Instead of explicitly minimizing control energy or other loss functionals, neural ODE controllers have been shown to exhibit implicitly regularization properties [10,4].In this work, we study the effect of learning protocols and hyperparameters such as network structure and optimizers on the ability of NODEC to implicitly learn near-optimal control solutions by minimizing the distance between reached and target state.
The remainder of this paper is organized as follows.Section 2 summarizes the basic concepts associated with neural ODE control (NODEC) and discusses two different backpropagation protocols for learning control functions: (i) truncated backpropagation through time (TBPTT) and (ii) backpropagation through time (BPTT) [42,41,17].In Section 3, we discuss how parameter initialization and activation functions affect a neural network's ability to implicitly learn OC solutions.Invoking tools from dynamical systems theory, we discuss how different neural network structures affect the learnability of a constant control solution.We will show that single neurons are able to learn OC solutions only for certain initial weights and biases.Numerical results presented in Sec. 4 indicate that as the depth of the employed neural network increases, initial conditions have a smaller impact on the ability of NODEC to learn a near-optimal control solution.In Section 5, we analyze how gradients in the neural network parameters induce gradient updates in both the control function and control energy.By applying NODEC to a control problem with a time-dependent OC solution, we find that adaptive gradient methods such as Adam [25] are able to approximate OC well while steepest descent (SD) fails to do so.We complement this part of our analysis with one and two-dimensional loss projections that are useful to geometrically interpret the tradeoff between (i) a small distance between reached and target state and (ii) a small control energy.Section 6 concludes our paper.
Our source codes are publicly available at [1].
2. Backpropagation protocols.The basic steps underlying NODEC-based solutions of optimal control problems (1.3) are summarized below.
1. Parameterize the control input û(t; θ) using a sufficiently wide and deep neural network.Here, θ ∈ R N denotes the neural-network parameters.2. Solve the dynamical system (1.3) numerically for a given initial condition x(0) = x 0 .3. Calculate the difference L(x(T ), x * ) between reached state x(T ) and target state x * and backpropagate gradients to update neural-network weights.4. Iterate steps 1-3 until convergence is reached (i.e., until L(x(T ), x * ) is smaller than a certain threshold).We will discuss in the following sections that control solutions with a small control energy E T [u] can be obtained without explicitly including E T [u] in the loss function.Even if the control energy is included directly in the loss function, one may still have to search for a Lagrange multiplier µ [see Eq. (1.4)] that is associated with a desired tradeoff between control energy and distance from the target state.We provide a corresponding example in Appendix A. Identifying the optimal value of a Lagrange multiplier [21] or Lagrange costate might prove challenging or intractable in practice [40].Implicit energy regularization properties that have been reported for NODEC [4,10] may therefore be useful for the design of effective control methods.
To learn control functions that are suitable to solve a certain control problem minimizing the MSE between reached and target state, there exist two main classes of backpropagation protocols: (i) truncated backpropagation and approximating a desired control function within a certain subinterval of [0, T ] and (ii) learning the control input over the whole interval [0, T ] using backpropagation through time (BPTT).The first option is also referred to as truncated BPTT (TBPTT).
In the following two subsections, we will discuss the differences between BPTT and TBPTT with respect to learning control functions.We will also provide an example to compare the performance of BPTT and TBPTT.

Truncated backpropagation through time.
For notational brevity, we will consider a TBPTT protocol for a one-dimensional flow ẋ = f (x(t), û(t; θ), t) where neural-network parameters θ are being updated for a time t in the loss function L = 1/2(x(T ) − x * ) 2 .Gradients are calculated through the loss function if the underlying dynamical system time t is equal to t .Generalizations for multiple evaluation times and higher-dimensional flows follow directly from the following equations.The TBPTT gradient of L w.r.t.θ is where is a small positive number that determines the width of the TBPTT interval.Note that can be absorbed in the learning rate associated with gradient-descent updates of θ.
where the notation I D ûx(T ), J û is used to indicate that the integration associated with the derivative D ûx(T ) := dx(T )/dû has to be applied to J û .Note that ∂L/∂x(T ) is not time-dependent, while the derivative of the loss function L w.r.t.û is an operator that acts on the time-dependent quantity J û .
We will now compare BPTT and TBPTT protocols for controlling the twodimensional linear flow The optimal control u * (t) associated with solving Eq. (2.5) and minimizing E T [u] [see Eq. (1.1)] has been derived in [44].We compare the optimal solution with the solutions obtained using (T)BPTT and the MSE loss Figure 2(a) shows the evolution of x(t) in the (x 1 , x 2 ) plane.Both backpropagation protocols, BPTT and TBPTT, produce solutions after 2,000 epochs of training that are similar to the OC solution.This similarity is also reflected in the evolution of the control signal and control energy [see Fig.  smaller learning rate, but at later stages in training it becomes less stable.On the contrary, TBPTT, converges slower with a higher learning rate, but it is more stable after convergence.Under the employed parametrization, both methods are capable to converge fast to low error values.In terms of computation time, on the same hardware and according to tqdm library measurements, BPTT performs around 65 training epochs per second, whereas TBPTT achieves approximately 125 epochs per second.The difference in total computation time may be attributed to the fewer gradient evaluations required by TBPTT.

Parameter initialization and activation functions.
As shown in Sec. 2 and in previous work [10,4], NODEC is able to learn near-optimal control solutions without explicitly minimizing E T [u].To provide insights into the learning dynamics underlying NODEC, we study two control problems associated with a one-dimensional linear flow.The first dynamical system has a constant OC solution while that of the second dynamical system is time-dependent.In this section, we will show that shallow architectures require one to tune initial weights and biases to learn OC solutions without accounting for a control energy term in the loss function.The next section then provides numerical evidence that deeper architectures are less prone to such weight and bias initialization effects.

Constant control.
We first consider a state-independent, one-dimensional flow Using the control Hamiltonian, we obtain the optimal control u * (t) from PMP according to where we used that the reached state x(T ) is equal to the target state x * .The optimal control is u * = (x * − x 0 )/T .A simple neural network that consists of a single ReLU and uses the time t as an input is, in principle, able to represent the constant optimal control u * .For a neural-network generated control signal û(t Near-optimal control solutions have been obtained representing control functions by neural ODEs and optimizing without explicitly accounting for the integrated cost or control energy (1.1) [10].Under what conditions (e.g., initial weights and biases, optimization algorithm, and neural-network structure) can a neural network learn a control function û(t; θ) that is close to u * (t) by optimizing Eq. (3.5)?As a starting point for addressing this question, we will focus on analytically tractable learning algorithms and neural network structures.In the following sections, we will then use these gained insights to draw conclusions for control problems that involve higher-dimensional neural network structures and generalized optimization algorithms.
3.1.1.Single-neuron structure with linear activation.In the following example, we use simple neural network that consists of a linear activation and a weight and bias parameter [see Fig. 3(a)].The corresponding loss function is Using steepest descent learning and BPTT, weights and biases are iteratively updated according to where w (0) and b (0) denote initial weight and bias, respectively.The quantity η denotes the learning rate.Using the loss function (3.6), the gradients in Eq. (3.7) are The fixed points (w (∞) , b (∞) ) of the iteration (3.8) satisfy For T = 1, x 0 = 0, and x * = −1, the fixed points are given by w Recall that, in this example, the weight and bias of optimal control are w * = 0 and b * = −1.This fixed point can only be reached for specific initializations as shown in Fig. 4(a,b).Other fixed points are associated with certain tradeoffs between small control energies and small losses L(x(T ), x * ) [see Fig. 4(c,d)].
3.1.2.Single-neuron structure with ReLU activation.We now use a slightly different neural-network structure that consists of a single ReLU [see Fig. 3(b)].The loss function in this example is If the weights are positive, the gradient-descent equations are equivalent to Eq. (3.8).Otherwise, we have (3.12) If w (n) < 0 for some n, the fixed point of the gradient descent (3.12) is equivalent to the optimal control b (∞) = b * = x * −x0 T .Figure 5(a,b) shows that the optimalcontrol domain of Eq. (3.11) is substantially larger than that resulting from a gradient descent in Eq. (3.6).The control energy approaches that of optimal control for w (∞) ≤ 0, while larger values of w (∞) are associated with larger control energies [see Fig. 5(c)].As in the previous example, one can observe that tradeoffs between small control energies and small losses are possible.
Equation (3.12) also shows that optimal control of Eq. (3.1) is a global attractor if one would just learn a bias term, reflecting the fact that the optimal control is given by a constant.

Time-dependent control.
To study the ability of NODEC to learn timedependent control functions, we consider the linear dynamical system The corresponding optimal control signal that satisfies Eqs.(1.2) and (1.3) is Note that for a > 0 the magnitude of u * (t) decays exponentially because larger values of t are associated with larger state values of the uncontrolled dynamics ẋ = ax.The opposite holds for a < 0. The evolution of the system state x(t) under the influence of u * (t) is x * (t) = x 0 e at + sinh(at) sinh(aT ) x * − x 0 e aT , As in the prior examples, we use NODEC to learn û(t; θ) by minimizing the MSE loss (3.5).We set x 0 = 0 and T = a = b = x * = 1, such that x * (t) = sinh(t)/ sinh(1), u * (t) = e −t / sinh (1), and E T [u * ] = 1/(e 2 −1) ≈ 0.157.To capture the time-dependence of the control input, we use a neural network with two hidden layers and six ELU units per hidden layer.Numerical experiments with smaller architectures showed that the exponential decay of the OC solution could not be captured well.The number of timesteps in our simulations is 100.If bias and weight values are initialized to 1, NODEC is able to steer the dynamical system towards the desired target state [see Fig. 6(a)].For a learning rate of about 0.12, we observe that NODEC approximates the OC solution by a constant control with a similar control energy [see Fig. 6(b,c)].The relative difference between the two control energies is about 3.5%.Using smaller initial weights and biases can help NODEC approximate the OC solution even more closely [see Fig. 6(d-f)].We find a minimum relative control energy difference of less than 0.2%.Earlier work [10] also suggested that small initial weights and biases can be helpful for learning OC solutions without explicitly regularizing control energy.

4.
Neural network depth and width.After having discussed the effect of parameter initialization on the ability of NODEC to implicitly learn OC solutions, we will now study how different numbers of layers and neurons per layer can help improve learning performance in the context of the examples from Sec. 3.

Constant control.
We first focus on the ability of deeper ANN architectures to implicitly learn a constant control function for the example that we discussed in Sec.3.1.The optimal control signal is u * = −1, and we use an ANN with hyperbolic tangent activations, between one to nine layers, and up to 1,100 neurons.Weights and biases are initially distributed according to where k denotes the number of input features of a certain layer.Figure 7(a) shows that an increase in the number of layers is associated with a decrease in control energy towards the optimal value of 0.5.Even if the number of layers increases, loss values are still small and almost unaffected by the change of the ANN architecture [see Fig. 7(b)].Increasing the number of layers also helps in learning constant controls, as the temporal mean µ û is close to the optimal constant control for most simulated scenarios with many layers [see Fig. 7(c)] while the corresponding variance σ 2 û gets closer to 0 [see Fig. 7(d)].To summarize, we observe that adding more layers is more beneficial than adding more neurons in order to approximate a constant OC function.4.2.Time-dependent control.For learning the time-dependent control signal that we studied in Sec.3.2, we use an ANN with ELU activations, between one to nine layers, and up to 90 neurons.Weights and biases are initially distributed according to where k denotes the number of input features of a certain layer.Figure 8(a) shows that smaller control energies can be achieved on average as the number of layers increases.For one or eight layers and 30 neurons in total, we observe that the ANN controller is able to implicitly learn a solution with a small loss and a control energy that is close to that of the optimal solution.However, in general, the loss may increase with the number of layers [see Fig. 8(b)].Increasing the layers while keeping a low number of neurons seems to affect convergence as most ANNs seem to converge to a non-optimal constant control-the variance in the right side of Fig. 8(d) is close to 0 (darker color).We show in Sec.5.3 that the constant OC approximation of Eq. (3.14) has a control energy of about 0.169 [see Fig . 8].In this example, we find that implicit regularization of control energy can be achieved by selecting appropriate hyperparameters (e.g., one or eight layers and 30 neurons in total ).
In Appendix B, we study the effect of different numbers of layers and neurons on the performance of NODEC in implicitly learning near-optimal control solutions for the time-dependent control signal associated with the two-dimensional flow discussed in Sec. 2.

Induced gradient update.
For the linear one-dimensional flow (3.13), we have (5.2) x(t) = e at x 0 + t 0 be −at û(t ; θ (n) ) dt and Using the BPTT notation introduced in Sec. 2, we find .
(5.4) Invoking Eq. (5.1), the induced gradient update in û(t; . (5.5) Based on Eq. (5.5), we define the weighted total change in û(t; θ (n) ) for linear dynamics (3.13) as where ∆θ (n) can be directly evaluated with standard backpropagation methods, thus avoiding the evaluation of the Jacobian elements (J û) i = ∂ û/∂θ (n) i in Eq. (5.1). Figure 9 shows a comparison of learned control signals û(t; θ (n) ) for linear dynamics (3.13) with x 0 = 0 and T = a = b = x * = 1.The neural network that we use in this example consists of 2 hidden layers with 6 ELU neurons each.Weights and biases are initialized to values of 0.1.Steepest descent fails to approximate the OC signal [see Fig. 9(a)], whereas Adam is able to approach the optimal control function [see Fig. 9(b)].We again emphasize that the learning loss is proportional to the squared difference between reached state and target state (x(T ) − x * ) 2 [see Eq. (3.6)].Learning the OC solution is not a direct learning target.Still, in the discussed example, learning OC is possible with the Adam optimizer.As described by Eqs.(5.1)-(5.5),a gradient descent in θ (n) may induces a gradient update in û(t; θ (n) ).Because the Jacobian J û is usually not straightforward to evaluate, we use ∆ Û (n) to study if the linearization of û(t; (solid red line).The evolution of ∆ Û (n) under Adam (solid blue line) undergoes oscillations.

Control energy regularization.
The induced gradient update in the control input û(t; θ (n) ) and the evolution of the control energy E T [û] during training are connected via (5.7) Equation (5.7) shows that, up to terms of higher order, a gradient-descent update in θ is associated with a control energy update that is equal to and ∇ θ (n) L. If cos(ω) > 0 (i.e., if −π/2 < ω < π/2), a gradient update in θ (n) is associated with a decrease in control energy.
Table 1 provides an overview of the gradient updates associated with the neuralnetwork parameters θ (n) , control function u(t; θ (n) ), and control energy Overview of gradient updates of neural-network parameters, control function, and control energy.
Using random projections of the loss function L [31] can help provide geometric intuition for the interplay between explicit minimization of L and implicit regularization of the control energy E T [û].To obtain two and three-dimensional projections of the 51-dimensional parameter space of the employed neural networks, we set the neural-network parameters θ = θ * + αδ + βη around a local optimum θ * .Here, α, β ⊆ R denote scaling parameters and δ, η ∼ N (0, 1 N ×N ) are random Gaussian  ) ] as a function of n.In the limit n → ∞, the control energy approaches 1/2(e − 1) −2 ≈ 0.17.(c) The MSE loss L(x(T ), x * ) as a function of n.Solid red and blue lines represent steepest descent and Adam solutions, respectively.We set the learning rate η = 10 −2 .

vectors.
For an Adam-based optimization, Fig. 10 shows two and three-dimensional random projections of L(x(T ), x * ), E T [û], and the mean-squared error quantifying the deviation of û from the optimum u * .Here, M is the number of timesteps and t i = i T /M .In the random projection shown in Fig. 10, we observe that Adam produces a local optimum that is associated with a small loss, MSE(u * , û), and control energy.In comparison with the results obtained with steepest descent (see Fig. 11), we observe that Adam produces sharper local optima.Figure 11 also shows that steepest descent produces two "valleys" of local optima separated by a plateau.Escaping from a local optimum located in one valley to another optimum in the second valley may be challenging as even larger learning rates may not be able to overcome the plateau.Additional loss projections for neural networks with fewer parameters are provided in Appendix C.

Constant optimal control approximation.
In situations where the OC function is expected to only vary modestly in time, or when no good guess of the OC solution is available, one may use a constant control approximation as a baseline.For such a baseline, the underlying neural network consists of a bias term (i.e., û(t; θ (n) ) ≡ c (n) ) and Eq.(5.1) becomes (5.10) where ∆c (n) = −η dL dc (n) and (5.11) In the limit n → ∞, the control function approaches c * = 1/(e − 1).
In each gradient-descent step, the control energy changes according to (5.12) For linear dynamics (3.13) with x 0 = 0 and T = a = b = x * = 1, we have (5.13) Figure 12 shows the evolution of û(t; ) ], and L(x(T ), x * ).Solid red and blue lines represent solutions that were obtained with steepest descent and Adam, respectively.Dashed black lines in Fig. 12(a,b) indicate the OC solutions.The convergence behavior of Adam is non-monotonic.Between 60 and 70 training epochs, Adam achieves control energy values smaller than E T [û (∞) ] = 1/2(e − 1) −2 , owing to an increase in the loss L(x(T ), x * ).The ratio between E T [û (∞) ] = 1/2(e − 1) −2 and the OC energy is 1.08, indicating that a constant control approximation achieves a control energy that is close to that of the corresponding OC solution.
The constant control baseline that we derived in this section also appears to be close to possible local optima that are learned by NODEC.For example, for certain learning rates and initial values of neural-network parameters, Adam approaches solutions that are similar to the constant baseline [see Fig. 6(b,e)].We also observe a similar behavior for steepest descent learning [see Fig. 9(a)].Also one of the loss projections in Appendix C shows that steepest descent approaches a local optimum that corresponds to a constant control approximation.A solution that is closer to OC and which was not found by steepest descent is also visible in the shown projections.
6. Discussion.Optimal control problems arise in various applications and can be solved analytically often only under special assumptions (e.g., for linear systems [44]).For non-linear control problems, different direct and indirect numerical methods have been developed to approximate OC solutions.The loss function underlying OC problems consists of a final cost (e.g., difference between reached and target state) and an integrated cost (e.g., control energy).To achieve a certain tradeoff between these two cost contributions, one has to find suitable Lagrange multipliers [see Eq. (1.4)], which might prove challenging in practice.Neural ODE controllers provide an alternative to standard direct and indirect OC solvers in that they are able to implicitly learn near-optimal control functions [4,10] if the underlying hyperparameters are chosen appropriately.The problem of optimizing Lagrange multipliers in OC loss functionals can be recast into a hyperparameter optimization problem, with the goal being that the basin of attraction of the desired OC solution in the underlying loss landscape is sufficiently large.
To study the ability of neural ODE controllers to perform implicit control energy regularization, we have analyzed the influence of backpropagation protocols, parameter initialization, optimizers, and activation functions on the learned control solutions.Using analytical and numerical arguments, we were able to identify different features that can help neural ODE control designers to efficiently perform hyperparameter optimization and learn near-optimal control solutions.
First, truncated backpropagation through time (TBPTT) has been shown to be computationally more efficient in terms of iterations per unit time than its untruncated counterpart.Although TBPTT protocols only provide an approximation of the actual gradient update, our results suggest that they can achieve loss values and control solutions that are similar to those obtained with untruncated gradient updates.Future work may evaluate the computational advantages of TBPTT over BPTT in larger problems.
Second, a useful starting point for certain control problems may be a constant controller that can be represented by a bias term.Constant optimal control approximations appeared as local optima in deeper architectures that we used in our numerical experiments.For high-dimensional control tasks studied in earlier work [10], NODEC also learned a constant optimal control approximation that provided better performance than solutions found by an adjoint-gradient method (i.e., an indirect OC solver).Increasing the width and depth of the underlying ANN can substantially improve the performance of NODEC in implicitly learning near-optimal control solutions as shown in Sec. 4. Between four to six hidden fully-connected layers and about five to ten neurons per hidden layer have shown good performance in different control tasks.In accordance with earlier numerical work [10], our results also highlight the importance of appropriate initialization protocols.Large weight and bias values may often be associated with loss values that are far away from the basin of attraction of a nearoptimal control solution.In this context it may be worthwhile to study combinations of neural-network parameter initialization protocols and momentum schedulers [39].
Third, adapative gradient methods like Adam have been shown to be able to learn near-optimal control functions in problems where steepest descent fails to do so.These results therefore contribute to the discussion of the advantages of adaptive optimizers over standard gradient descent techniques [19,43,12].Two and three dimensional loss projections provide insights into hyperparameter-dependent geometric features of the loss and control energy landscapes and may reveal the existence of better nearby local optima.
There are different possibilities for future work.Studying the ability of neural ODE controllers to implicitly learn control functions for loss functionals different from Eq. (1.4) ("generalized implicit regularization").It would also be interesting to use dynamical systems approaches to study the behavior of adaptive optimizers akin to our steepest-descent analysis in Sec.3.1.Instead of using ANNs to approximate control functions, control problems can function as a tool to better understand optimization dynamics within deep ANNs that are otherwise mainly studied in terms of image classification.respectively.The neural network that we use in this example has H = 64 layers with N H = 6 ELU neurons per layer.Bias terms are included in every layer.
To study the effect of different neural network structures and activations on the learning performance, we first set the number of neurons per layer N H = 6 and vary the number of layers H from 2 to 64.All weights and biases are initially set to a value of 10 −2 .We perform numerical experiments for both ELU and ReLU activations.We train NODEC for 100 epochs using Adam with a learning rate η = 0.5 × 10 −2 , and we evaluate the best model.We show the corresponding loss and MSE values after training in Fig. 13(c,d).For H ≥ 8, the loss approaches values smaller than 10 −7 and the MSE drops to values below 10 −4 .For one layer and varying N H , the neural network is not able approximate the OC solution.Our results are almost identical for both ReLU and ELU activations.
In addition to the prior numerical experiment, we now also directly account for the additional loss term W [x, u] and optimize where µ is a Lagrange multiplier that models the influence of W [•] on the total loss.To initialize weights without additional parameter optimization, we use the Kaiming uniform initializer [20], and we initially set all biases to a value of 10 −2 .The neural network that we use to optimize Eq. (A.3) consists of H = 8 fully connected linear layers with N H = 6 ELU neurons each.This way of setting up the network architecture ensures that NODEC is able to represent the constant control function.We use Adam with a learning rate of 10 −1 to train the neural network for 100 epochs, and we evaluate the best model.In addition to the uniform weight initialization, the learning rate of η = 10 −1 was chosen to model a not fully optimized hyperparameter set.Otherwise, near-optimal solutions would be learned automatically as in the previous example without optimizing the multiplier µ.As shown in Fig. 14(a,b), the loss component L(x(T ), x * ) approaches a minimum for µ ≈ 2 × 10 −3 .The corresponding value of W [x, u] is reasonably close to the optimal solution.However, Fig. 14(a) also shows that the implicit regularization of W [x, u] can achieve smaller loss values and similar values of W [x, u]. Figure 14(c,d) shows different NODEC solutions (solid red lines) that are associated with different multiplier values µ.Although almost all solutions are aligned with the OC solution in terms of the evolution of x(t), variations in µ produce visible deviations of û(t; θ) from u * (t).Appendix B. Architectures for time-dependent control in two dimensions.For learning a time-dependent control, we use an ANN with leaky ReLU activations, between one to nine layers, and up to 90 neurons.Weights and biases are initially distributed according to U(− √ k, √ k), where k denotes the number of input features of a certain layer.Figure 15(a) shows that smaller control energies can be achieved as the number of layers increases.For six layers and 20 neurons in total, we observe that the ANN controller is able to implicitly learn a solution with a small loss and a control energy that is close to that of the optimal solution.However, in general, the loss may increase with the number of layers [see Fig. 15(b)].Increasing the layers while keeping a low number of neurons seems to affect convergence as most ANNs seem to converge to a non-optimal constant control-the variance in the bottom right corner of Fig. 15(d) is close to 0 (darker color).To summarize, implicit regularization of control energy can be achieved by selecting appropriate hyperparameters (e.g., six layers and 20 neurons in total).

Appendix C. Loss projections.
To complement our geometric analysis of implicit energy regularization (see Sec. 5), we provide additional loss, MSE, and control energy visualizations in Figs.16 and 17.The control task is the same as in Sec. 5. We aim at controlling the one-dimensional flow (3.13) with x 0 = 0 and T = a = b = x * = 1.The neural network that we use here consists of 2 hidden layers with 5 ELU neurons each.Weights and biases are initialized to values of 0.1.The total number of neural-network parameters is 46 instead of 51 as in Sec. 5.
Figure 16 shows two and three-dimensional loss projections around a local optimum found by Adam.The local optimum is associated with a small loss, MSE, and  For the same neural network and initial conditions, we observe in Fig. 17 that steepest descent gets stuck in a local optimum that is associated with a larger MSE than the optimum found by Adam.Another optimum that is visible in the projections in Fig. 17 has a significantly smaller MSE and control energy than the one found by steepest descent.However, in the shown example, steepest descent converges towards a constant control approximation (see Sec. 5.3) and not towards the local optimum that is closer to the OC solution.

Fig. 2 .
Fig. 2. Learning control functions with different backpropagation protocols.(a) The evolution of x(t) of the solution of Eq. (2.5) in the (x 1 , x 2 ) plane.(b,c) The evolution of the corresponding control function and control energy.The OC solution is indicated by a dashed grey line.Black disks and red triangles show the solutions associated with BPTT and TBPTT protocols, respectively.The underlying neural network has 2 hidden ELU layers and 14 neurons per hidden layer.We trained the neural network for 1,000 epochs using Adam.The learning rates are η = 3 × 10 −3 (BPTT) and η = 5 × 10 −3 (TBPTT).(d) The loss L(x(T ), x * ) as a function of training epochs n.Both backpropagation protocols can achieve low loss values after a few hundred epochs of training.

Fig. 3 .
Fig. 3. Examples of single-neuron structures.(a) The input is the time t and the output is wt + b.(b) The input is the time t and the output is max(0, wt) + b.

Fig. 4 .
Fig. 4. Approximating constant control with a linear activation function.(a) The mean squared error (MSE) associated with the deviation of the learned weight and bias, (w (∞) , b (∞) ), from the optimal control weight and bias, (w * , b * ) = (0, −1), for different initial weights w (0) and biases b (0) .The solid white line indicates the location of initial weights and biases for which gradientdescent learning converges to optimal control.(b) Convergence of weight and bias values towards the attractor w (∞) = −2(1+b (∞) ).Solid red lines and filled red squares indicate learning trajectories and learned parameters (w (∞) , b (∞) ), respectively.The solid black line represents fixed points of gradientdescent learning [i.e., w (∞) = −2(1 + b (∞) ) as defined in Eq. (3.9)].The fixed point (w * , b * ) = (0, −1) corresponding to optimal control is marked by a black circle.(c) The control energy E T [u] [see Eq. (1.1)] as a function of learned neural-network weights w (∞) .Energies associated with simulation results and gradient-descent fixed points are indicated by filled red disks and a solid black line, respectively.The dashed grey line marks the energy associated with optimal control.(d) Control energy as a function of the loss function (3.6).In all simulations, we set T = 1, x 0 = 0, and x * = −1.We trained the controller with Adam and used a learning rate η = 0.1.

Fig. 5 .
Fig. 5. Approximating constant control with a ReLU activation function.(a) The mean squared error (MSE) associated with the deviation of the learned weight and bias, (w (∞) , b (∞) ), from the optimal control weight and bias, w * ≤ 0, b * = −1, for different initial weights w (0) and biases b (0) .The deep purple region indicates initial weights and biases for which the neural-network learns optimal control.Optimal control cannot be learned in the yellow region.(b) Convergence of weight and bias values towards the attractor w (∞) = −2(1 + b (∞) ).Solid red lines and filled red squares indicate learning trajectories and learned parameters (w (∞) , b (∞) ), respectively.The solid black line represents the attractor of gradient-descent learning [i.e., b * = −1 if w (∞) ≤ 0 and w (∞) = −2(1 + b (∞) ) if w (∞) > 0].The attractor w * ≤ 0, b * = −1 corresponds to optimal control.(c) The control energy E T [u] [see Eq. (1.1)] as a function of learned neural-network weights w (∞) .Energies associated with simulation results and gradient-descent fixed points are indicated by filled red disks and the solid black line, respectively.The dashed grey line marks the energy associated with optimal control.(d) Control energy as a function of the loss function (3.6).In all simulations, we set T = 1, x 0 = 0, and x * = −1.We trained the controller with Adam and used a learning rate η = 0.1.

Fig. 6 . 2 T 0 u
Fig. 6.Approximating time-dependent control.We use NODEC to control Eq. (3.13) for x 0 = 0 and T = a = b = x * = 1.The neural network that we use in this example consists of 2 hidden layers with 6 ELU neurons each.(a-c) Initial weights and biases are set to 1. (d-f ) Initial weights and biases are set to 0.1.Different solid red lines are associated with different learning rates.Solid black lines indicate OC solutions (3.14)-(3.16).We trained NODEC for 10 3 epochs using Adam.

Fig. 7 .
Fig. 7. Effect of variations in number of layers and neurons per layer on the ability of NODEC to learn a constant control signal.The underlying architectures use hyperbolic tangent activations in all hidden layers.Bias terms are not included.Each ANN is trained for 100 epochs using Adam and a learning rate η = 10 −3 .We set the number of neurons per layer to be the greatest integer less than or equal to the maximum number of neurons (vertical axes) divided by the number of layers (horizontal axes).The optimal constant control is −1 and the optimal control energy is 1/2.Heatmaps show the (a) control energy E T [û], (b) loss L(x(T ), x * ), (c) mean control signal over time µ û, and (d) control signal variance over time, σ 2 û.

Fig. 8 .
Fig. 8. Effect of variations in number of layers and neurons per layer on the ability of NODEC to learn the time-dependent control signal (3.14) for x 0 = 0 and T = a = x * = 1.The underlying architectures use bias terms and ELU activations in all hidden layers.Each ANN is trained for 500 epochs using Adam and a learning rate η = 3 × 10 −3 .We set the number of neurons per layer to be the greatest integer less than or equal to the maximum number of neurons (vertical axes) divided by the number of layers (horizontal axes).The optimal control energy is approximately 0.157.Heatmaps show the (a) control energy E T [û], (b) loss L(x(T ), x * ), (c) mean control signal over time µ û, and (d) control signal variance over time, σ 2 û.

Fig. 9 .
Fig. 9. Learning control signals with steepest descent and Adam.We use NODEC to control Eq. (3.13) for x 0 = 0 and T = a = b = x * = 1.(a,b) Learned and optimal control signals are indicated by solid red and black lines, respectively [(a) steepest descent; (b) Adam].Different solid red lines correspond to control functions û(t; θ (n) ) learned after different numbers of epochs n.The learning rate was set to η = 0.15.The neural network that we use in this example consists of 2 hidden layers with 6 ELU neurons each.Initial weights and biases are set to 0.1.Solid black lines indicate the optimal control signal (3.14).(c) The weighted total change ∆ Û (n) [see Eq. (5.6)] as a function of epochs n.Red squares (steepest descent) and the blue solid line (Adam) are based on directly evaluating T 0 û(t; θ (n+1) )e −at dt − T 0 û(t; θ (n) )e −at dt.The solid red line is based on evaluating the right-hand side of Eq. (5.6) using neural-network parameter changes ∆θ (n) associated with steepest descent.

Fig. 12 .
Fig. 12. Approximating optimal control by a constant.(a) The constant control input û(t; θ n ) ≡ c (n) as a function of the number of training epochs n.The dashed black line indicates the optimum c (∞) = 1/(e − 1) ≈ 0.58.(b) The control energy E T [û (n) ] as a function of n.In the limit n → ∞, the control energy approaches 1/2(e − 1) −2 ≈ 0.17.(c) The MSE loss L(x(T ), x * ) as a function of n.Solid red and blue lines represent steepest descent and Adam solutions, respectively.We set the learning rate η = 10 −2 .

Fig. 13 .
Fig. 13.Controlling a moving particle that is subject to friction.(a,b) Evolution of the state x(t) and control function u(t).Optimal control and NODEC-based solutions are shown in grey and red, respectively.(c,d) Loss L(x(T ), x * ) = x(t) − x * 2 2 and MSE(u * , û) for different activations and architectures as a function of the number of neurons.

Fig. 14 .
Fig. 14.Lagrange multiplier optimization.(a,b) The loss L(x(T ), x * ) = x(T )−x * 2 2 and work W [x, u] [see Eq. (A.1)] for different values of µ [see Eq. (A.3)].Red disks are associated with a direct optimization of both L(x(T ), x * ) and W [x, u] while solid red lines indicate implicit regularization solutions.(c,d) Different values of µ are associated with different NODEC-based solutions of x(t) and û(t; θ) (solid red lines).As a reference, we also show the corresponding OC solutions (solid grey line).

Fig. 15 .
Fig. 15.Effect of variations in number of layers and neurons per layer on the ability of NODEC to learn a time-dependent control signal for the two-dimensional flow (2.5).The underlying architectures use bias terms and leaky ReLU activations in all hidden layers.Each ANN is trained for 500 epochs using Adam and learning rate η = 3 × 10 −3 .We set the number of neurons per layer to be the greatest integer less than or equal to the maximum number of neurons (vertical axes) divided by the number of layers (horizontal axes).The optimal control energy is approximately 34.Heatmaps show the (a) control energy E T [û], (b) loss L(x(T ), x * ), (c) mean control signal over time µ û, and (d) control signal variance over time, σ 2 û.