Reinforcement learning of optimal active particle navigation

Mahdi Nasiri; Benno Liebchen

doi:10.1088/1367-2630/ac8013

1. Introduction

The problem of finding suitable navigation strategies is of great interest to applications ranging from motion planning for autonomous underwater vehicles, ocean gliders [1–4], and aerial vehicles [5–7] to microorganisms searching for food and prey [8, 9] and striving for survival in complex environments [10, 11]. One important class of path planning problems which is currently attracting a rapidly increasing attention is centered around the quest for the optimal trajectory allowing an active particle, which can freely steer but cannot control its speed, to reach a given target in a complex environment. This active particle navigation (APN) problem is relevant both for biological swimmers like fish or for turtles on the way to their breeding grounds [12, 13] and for future applications of synthetic microswimmers [14] such as targeted drug [15–17] and gene delivery [18, 19] or microsurgery [20].

Contrasting classical navigation problems of vehicles like ships or airplanes [21, 22], mesoscopic active particles can face a variety of new challenges including fluctuations, hydrodynamic interactions with obstacles and boundaries [23, 24], and highly complex environments [25–28]. Following these complex ingredients, the optimal (fastest) path generically differs from the shortest one and is highly challenging to determine. In fact, as of now, there is no standard recipe to find the optimal path even for the simplest case of a dry active particle with perfect steering (no fluctuations, no delay) [29] in sufficiently complex environments. While a method to tackle this challenge, which we refer to as the 'minimal APN problem', would serve as a crucial step forward en route to the ultimate dream of a universal path planner for microswimmers, such a method is not yet available.

In fact, classical path planning algorithms such as A^* or Dijkstra cannot be straightforwardly adjusted to account for complex smooth environments (even when using heuristic ingredients) and achieving general solutions with analytical methods based on variational methods [29], optimal control theory [23, 30] or transformations to a problem of finding geodesics of a Randers metric [31] is highly challenging (if not impossible) if the environment is sufficiently complex. In contrast, recent advancements in the field of artificial intelligence and reinforcement learning (RL) have allowed handling the required complexity [32, 33] and can be used in principle to approach the (minimal) APN problem for generic environments. In addition, unlike most classical motion planners, the objectives of deep RL based methods can straightforwardly be modified to optimize the path with respect to various targets such as traveling time or energy dissipation.

Corresponding pioneering studies at the interface of RL and active matter [34] have demonstrated, remarkably, that tabular Q-learning algorithms are able to uncover efficient navigation strategies for certain environments [35, 36] including complex fluid flows [27], or even in collective tasks such as flocking [37]. Other RL based methods have been used to develop efficient navigation strategies even in the presence of turbulent and chaotic flows [38, 39] or learning and modeling chemotaxis behavior [40]. Deep RL methods have been applied to colloidal robots in unknown environment with random obstacles [41, 42] and Q-learning was used to explore optimal colloidal predator–prey dynamics and strategies [43]. Very recently, a policy gradient-based method was shown to be able to qualitatively reproduce globally optimal predatory strategies [44].

However, despite these successes, the challenge of finding the globally optimal path in generic environments still remains. First, there is generally a major risk that agents converge to a local optimum rather than to the global one [45]. Second, even after finding the global optimum for given reward and state-action space definitions, it often remains unclear how the result relates to optimality in the physical reality. In some cases this deviation can reach a level where the final result is nowhere near the optimal trajectory, as we will see below.

To overcome this gap in the literature between methods which cannot handle the complexity of generic complex environments and methods leading to solutions which do not (or are not known to) lead to the globally optimal path, in the present work we develop a new approach which does not only asymptotically reproduce analytically known optimal solutions for the minimal APN problem but also allows to find the optimal path in very complex environments. In addition, the present work creates a bridge between tools which are used for path planning problems in robotics [32, 33, 46] and navigation tasks in active matter. Based on an explicit comparison with exact results, we demonstrate that these methods provide the fastest trajectory with an accuracy which has not been recognized before. We also show the generalizability of these methods to very complex environments which have not been explored before (figure 1).

To achieve this, we combine a hybrid discrete-continuum representation of the environment with policy gradient based deep RL agents (advantageous actor-critic algorithms) to find the optimal path.

**Figure 1.** Schematic illustration of the developed machine learning approach. (a) The RL model is designed, trained and fine-tuned to find the optimal trajectory in environments with known analytical solutions. Upon training, the model's early performance (blue curve) converges asymptotically (green dashed curve) to the optimal solution (dashed black curve). Arrows and colors indicate the direction and strength of force/flow fields acting on the agent. (b) The designed model with proven performance is trained in highly complex environments to provide asymptotically optimal solutions in cases which are analytically inaccessible. The dashed line exemplarically shows the learned path in a Gaussian random potential (background colors).
Download figure:
Standard image High-resolution image

Remarkably, we find that using a single relatively simple neural network architecture across different environments, is sufficient to achieve asymptotically optimal trajectories i.e. trajectories which are optimal for a given discretization. Our results unify, for the first time, asymptotic optimality with the feasibility of handling generic complex environments. They can also be interpreted as an indication that there is a way for biological agents to learn optimal navigation strategies from generation to generation (e.g. throughout evolution) without requiring direct global knowledge of their environment or even of the location of their target.

2. Model

To define the minimal APN problem, we consider an overdamped dry active particle in 2D at position $\vec{r}(t)=(x(t),y(t))$ which evolves as:

$\begin{equation}\dot{\overrightarrow{r}}(t)={v}_{0}\hat{n}(t)+\overrightarrow{f}(\vec{r}).\end{equation} \tag{ 1 }$

Here, v₀ is the constant self-propulsion velocity of the active particle and $\vec{f}(\vec{r})=\vec{F}(\vec{r})/\gamma (\vec{r})+\vec{u}(\vec{r})$ is the overall external field with $\vec{F}(\vec{r}),\vec{u}(\vec{r})$ being the force field and the solvent flow field due to the environment and $\gamma (\vec{r})$ is the Stokes drag coefficient which can vary in space as relevant e.g. for viscotaxis [47]. We assume that the steering direction $\hat{n}(t)=(\mathrm{cos}\,\psi (t),\mathrm{sin}\,\psi (t))$ can be freely controlled by the active particle, as relevant e.g. for biological microswimmers, or by external fields e.g. via feedback control systems [48–50] or external electric, magnetic or phoretic fields [51–53].

While in reality there would of course be some delay in the control [54] as well as thermal fluctuations (for small agents), here we neglect both complications as even in their absence there is no known way to systematically determine or learn globally optimal trajectories in the literature. However, we would like to stress that our approach can be generalized to include both ingredients. To complete the definition of the minimal APN problem, let us now assume that the starting $\vec{r}(0)={\vec{r}}_{\text{start}}$ and target $\vec{r}(T)={\vec{r}}_{\text{end}}$ points as well as $\vec{f}(\vec{r})$ are given and the task is to find the connecting path (and the associated navigation strategy) which minimizes the travelling time.

3. Approach

Here we use RL to solve the minimal APN problem. We train an agent within a complex environment, to take actions in order to maximize a cumulative reward [45, 55]. Within each episode, the agent starts from a certain initial state s₀ (initial position), and chooses an action a_t (direction of motion) at each time step t until it either reaches the target point (or region) or faces an existing condition, e.g. by hitting the boundaries of the simulation domain or exceeding the maximum number of allowed steps (see SI (https://stacks.iop.org/NJP/24/073042/mmedia) for details).

For this study, we consider a hybrid interplay of discrete and continuous state spaces for the environment so that the RL agent can harness the power of working inside an infinite dimensional continuous state space without dealing with high dimensional input data (figure 2), which is crucial for the present approach. That is, we account for the exact continuous external field $\vec{f}(\vec{r})$ , but discretize the environment, represented as a 2D gridworld observation, when feeding it into the neural networks as the input states, s_t. In each step, the agent chooses an action from a set of 60 equally spaced orientation angles $\psi ({s}_{t})=\left\{\frac{m\pi }{30}\vert m\in \mathbb{Z},0\leqslant m< 60\right\}$ , defining its direction of motion and receives a fixed negative reward R_t. The agent receives an additional positive reward of 100|R_t| and a very large penalty amounting to R_t times the maximum number of steps available for each episode in case of reaching the target region or hitting the environment boundaries. Hence the optimal navigation strategy corresponds to reaching the target by taking the least number of actions necessary, without hitting the boundaries.

**Figure 2.** Illustration of the hybrid combination of continuous and discrete state spaces used by our approach. The actions and effects of the force fields take place in a continuous environment and only a discrete observation is fed back into the model. In this hybrid image, the agent can only control the strategies on a per-region basis.
Download figure:
Standard image High-resolution image

Importantly, in contrast to most previous studies, our approach neither requires any reward shaping [56, 57] nor any heuristics [58, 59]. That is, the agent's strategy hinges on learning an estimate of the temporal trajectory length (number of timesteps), while the reward is independent of any heuristic measures such as the relative temporal or spatial distance of the agent to the target. Thus, the agent learns through pure exploration and is able to develop the optimal strategy.

The main component of our approach is the RL algorithm itself. It became evident during our tests, that, on-policy methods [45] provide more robustness in convergence toward the global optimal solution, as compared to off-policy methods such as Q-learning; hence a policy gradient method was selected (see SI for details). This method allows learning a parametrized policy π_θ for choosing actions by optimizing the expected return $E[J(\theta )]=E\left[{\sum }_{t=0}^{T}{R}_{t}\vert {\pi }_{\theta }\right]$ [60–62]. This goal is achieved by updating the set of parameters of the policy θ, which in our case corresponds to the parameters (weights) of the policy network.

The update is performed in the direction of the gradient of the expected return, i.e. such that the expected return is enhanced. The gradient of total expected return, averaged across K trajectories, can be approximated as [60–62]:

$\begin{equation}E[{\mathrm{\nabla }}_{\theta }J({\pi }_{\theta })]\approx \frac{1}{K}\sum\limits _{i=1}^{K}\sum\limits _{t=0}^{{T}_{i}}{\mathrm{\nabla }}_{\theta }\;\mathrm{l}\mathrm{o}\mathrm{g}\;{\pi }_{\theta }({a}_{it}\vert {s}_{it})\sum\limits _{{t}^{\mathrm{\prime }}=t}^{{T}_{i}}{R}_{{t}^{\mathrm{\prime }}},\end{equation} \tag{ 2 }$

where R_t' denotes the reward value at time step t', achieved by policy π_θ, and T_i, denotes the overall lengths (number of steps) of the i-th trajectory, which can be written in state-action space as ${\xi }_{i}=\left\{({s}_{i0},{a}_{i0}),({s}_{i1},{a}_{i1}),\dots ,({s}_{i{T}_{i}},{a}_{i{T}_{i}})\right\}$ . We compute these gradients via the back-propagation algorithm in the training process of the policy network. It can be seen from equation (2), that as the policy network improves to achieve higher total returns, the gradient of the expected return starts to diminish, until eventually the agent converges to the final policy. In the following we choose the advantageous actor critic method (A2C) as a specific example of a policy gradient method. This method involves, besides the policy network, also a critic network which assigns a value to each state which corresponds to the expected temporal distance to the target and is used to guide the updating of the parameters (weights) of the policy network in the following way: the total return ${\sum }_{{t}^{\prime }=t}^{{T}_{i}}{R}_{{t}^{\prime }}$ is replaced with a more well-behaved term known as the advantage function [60, 63, 64], which together with the critic network, essentially rates the possible actions by evaluating the benefit of choosing a specific action in comparison to the average action for a given state. In our approach, we define the advantage function as follows [64]:

$\begin{equation}{A}^{{\pi }_{\theta }w}({s}_{it},{a}_{it})={Q}^{{\pi }_{\theta }}({s}_{it},{a}_{it})-{V}^{{\pi }_{\theta }w}({s}_{it})\end{equation} \tag{ 3 }$

$\begin{equation}{Q}^{{\pi }_{\theta }}({s}_{it},{a}_{it})=\sum\limits _{\mu =0}^{{T}_{i}-t}{\lambda }^{\mu }R({s}_{it+\mu },{a}_{it+\mu }),\end{equation} \tag{ 4 }$

where ${V}^{{\pi }_{\theta }w}$ is the critic network with parameters (weights) w, which estimates the value of a given state, and ${Q}^{{\pi }_{\theta }}({s}_{it},{a}_{it})$ is the state-action value function under policy π_θ and discount factor λ, which determines the expected reward when choosing a specific action.

We train the two networks which are involved in the A2C algorithm (policy and critic network) simultaneously based on the trajectories from multiple past episodes. At each training round the critic network is updated based on the value (discounted total return computed based on the present policy) of each state and is used to judge the performance of previous episodes through the advantage term, which influences (or guides) the update of the policy network through equation (3).

Note that the deep RL approach is able to manifest a surprisingly high level of generalization potential. Remarkably, all of the results presented in this study, across different setups with various force fields, are achieved with policy networks containing only two hidden layers and critic networks with one hidden layer. This feature vastly decreases the computational cost of training our model (see SI for details).

4. Results

To demonstrate the power of the developed approach, we now consider various environments of increasing complexity, starting with cases for which exact analytical solutions are available [29].

As a first application of our method, we explore an active particle in a linear force (or flow) field $\vec{f}=(kx,0)$ . Despite its apparent simplicity, for some combinations of starting and end points, this problem cannot be straightforwardly obtained by 'optimistic' applications of classical path planning algorithms like Dijkstra or A^* [65] (see SI).

In contrast, by applying our RL-based method, we find that after training the model (see SI for details), the agent is able to achieve asymptotically optimal solutions (figures 3(a) and (b) and table 1). That is, any remaining deviations due to the discretization of the problem can be systematically reduced by choosing an even finer discretization. Notably, the algorithm finds the optimal path even for $k=1.0,{\vec{r}}_{\text{start}}=(0,0)$ and ${\vec{r}}_{\text{target}}=(5,5)$ (figure 3(b)) where the drift due to the external field dominates the self-propulsion in a large portion of space such that choosing unsuitable actions at early times can make it impossible for the agent to reach the target even after very long times.

**Figure 3.** Learned asymptotically optimal trajectories. (a)–(d) The trained active particle replicates theoretical results from [29] (dashed curves), for a linear force/flow field $\vec{f}=(kx,0)$ with (a) k = 0.6, (b) k = 1.0, and for shear flow $\vec{f}=(k[1-{y}^{2}],0)$ with (c) k = −0.5, (d) k = −0.8. Starting and target points are shown with circles and triangles and background colors show the learned policy map. Panels (e) and (f) show generalizations to Gaussian random potentials with α = 2 (e) and α = 4 (f) with background colors showing the value of the Gaussian random potential. $\vec{f}$ is measured in units of v₀ (such that v₀ = 1), and computational parameters are provided in table B.1 in the SI. Trajectory lengths comparisons are shown in table 1.
Download figure:
Standard image High-resolution image

**Figure 3.** Learned asymptotically optimal trajectories. (a)–(d) The trained active particle replicates theoretical results from [29] (dashed curves), for a linear force/flow field $\vec{f}=(kx,0)$ with (a) k = 0.6, (b) k = 1.0, and for shear flow $\vec{f}=(k[1-{y}^{2}],0)$ with (c) k = −0.5, (d) k = −0.8. Starting and target points are shown with circles and triangles and background colors show the learned policy map. Panels (e) and (f) show generalizations to Gaussian random potentials with α = 2 (e) and α = 4 (f) with background colors showing the value of the Gaussian random potential. $\vec{f}$ is measured in units of v₀ (such that v₀ = 1), and computational parameters are provided in table B.1 in the SI. Trajectory lengths comparisons are shown in table 1.
Download figure:
Standard image High-resolution image

Table 1. Temporal length (number of timesteps) of the exact optimal trajectories and the learned ones for a given discretization. The length of the analytical cases are recreated using an agent which follows the exact optimal trajectories [29] with a continuous orientation. Parameters: stepsize and other environment parameters are the same as in table B.1 in the SI. The range of the values (±) are due to the finite size of target region. Note that all the trajectories which end up in the same location of the target region have exactly the same temporal length for a given discretization.

	Linear flow(k = 0.6)	Linear flow(k = 1.0)	Shear flow(k = −0.5)	Shear flow(k = −0.8)
Exact	50 ± 1	50 ± 1	101 ± 1	120 ± 1
RL method	51 ± 1	51 ± 1	104 ± 1	124 ± 1

Let us now focus on a somewhat more complex environment, defined by $\vec{f}=(k(1-{y}^{2}),0)$ , representing e.g. shear flow in a pipe [66]. Here the optimal path is S-shaped for sufficiently large k (dashed lines in figures 3(c) and (d)). From the RL point of view, such trajectories are rather unapparent, because achieving this symmetric S-shaped path, in the presence of the strong flow field, would require the agent to employ a strategy with intricate combination of actions. Accordingly it is not surprising that simple RL algorithms such as tabular Q-learning tend to fail finding the global optimal path or at least require unsystematic hyperparameter fine-tuning for each considered k-value (see SI for more details). Remarkably, also for this setup, our RL agent is able to learn the asymptotically optimal path for any k value. That is, also here any remaining deviations from the optimal path can be systematically reduced by choosing a finer discretization for state and action spaces. Notice however, that for a given discretization degenerate trajectories may occur which have very similar (or even identical) temporal costs but differ slightly in shape (see SI for details).

Let us now exploit the very close agreement between theoretically calculated optimal paths and the learned ones, to explore truly complex environments for which optimal trajectories cannot be analytically calculated. For this purpose, we create a Gaussian random potential (GRP) [67, 68] $U(\vec{r})$ with a power spectrum of $\langle \tilde{U}(k)\tilde{U}(-k)\rangle \propto {k}^{-\alpha }$ and determine $\vec{f}=-\vec{\nabla }U$ (see SI for more details).

Here we consider the cases of α = 2 and 4 for two different combinations of starting and end points (see figures 3(e) and (f)). Notably for α = 4 the learned solutions are comparatively smooth curves, while for α = 2 the learned paths involve comparatively sharp turns, reflecting the structure of the underlying potential islands. In all shown cases, we find that the travelling time of the learned optimal trajectories is shorter than the travelling time when following a straight line.

5. Conclusions

We have developed an end-to-end deep RL approach which creates asymptotically optimal solutions for complex navigation problems which are not straightforwardly attainable with existing standard methods. This approach complements the broad literature on methods to find the shortest path and opens a route towards a universal path planner for microswimmers and larger self-driven agents which are subject to continuously varying force or flow fields.

To achieve this, our method can be generalized in many ways, e.g. to find globally optimal paths with respect to fuel consumption or dissipated power, or after generalization to continuous state-action-spaces, to account for microswimmer-specific ingredients such as hydrodynamic interactions with obstacles and fluctuations.

Potential applications of the presented method range from testing future theoretical developments e.g. regarding the optimal path of active particles in the presence of fluctuations to the programming of nano- and microscale robots for targeted drug and gene delivery. On larger scales our approach could also be used to test if biological agents like turtles or fish manage to find the globally optimal path (based on a comparison of their trajectories with the learned results) or, possibly, even for route planning of macroscopic vehicles like cleaning-robots or spacecrafts where it could provide an alternative to optimization methods based on nonlinear programming or meta-heuristics [69].

Acknowledgments

The authors would like to thank Prof. Dr Hartmut Löwen for useful discussions.

Data availability statement

The data that support the findings of this study are available upon reasonable request from the authors.

Reinforcement learning of optimal active particle navigation

Article metrics

Author e-mails

Author affiliations

Author notes

ORCID iDs

Dates

Abstract

1. Introduction

2. Model

3. Approach

4. Results

5. Conclusions

Acknowledgments

Data availability statement

Reinforcement learning of optimal active particle navigation

Article metrics

Share this article

Author e-mails

Author affiliations

Author notes

ORCID iDs

Dates

Abstract

1. Introduction

2. Model

3. Approach

4. Results

5. Conclusions

Acknowledgments

Data availability statement