Optimal active particle navigation meets machine learning (a)

Mahdi Nasiri; Hartmut Löwen; Benno Liebchen

doi:10.1209/0295-5075/acc270

Introduction

Before the start of an airplane, the conductor often runs a software to plan a path to a target destination. This path aims to represent the best possible compromise between traveling time, fuel consumption, and safety aspects for given environmental conditions, such as the current wind and temperature pattern, as well as airspace occupancy. Similar optimal navigation problems where an active (self-propelled) agent, which can control its direction of motion, or its speed, in a complex environment also occur for robots and autonomously driving cars. These agents are commonly equipped with cameras, processing units and actuators [1–3], allowing them to locally perceive their environment and to use it to achieve a goal such as completing a race track as fast as possible or to efficiently collect waste. Here, the cameras typically create images of the environment, that are fed into the processing unit, which then translates the received information into a desired action (motion command) and sends it to an actuator, which initiates the motion. In the animal kingdom, the ability to develop efficient navigation strategies can assist the survival of species. To reach their breeding grounds, some turtles, for example, have to find an efficient path through the ocean over hundreds of kilometers [4,5]. Similarly, insects need strategies to find odor sources by navigating through complex molecular patterns that can even be chaotic due to turbulent air streams [6–9].

Smart microswimmers

Besides macroscale agents, even microorganisms can perceive information from their environment and use it for navigation. For example, the ability of sperm cells to sense gradients in the concentration of those chemicals which are emitted by the egg cell [10,11] is crucial for the survival of many species. Similarly, bacteria possess a remarkable spectrum of biochemical sensors which allow them, e.g., to measure gradients in oxygen, nutrient, or autoinducer concentration [12–14] and to use them for navigation.

Besides biological microswimmers, since almost two decades [15] also synthetic microswimmers have become available, and are currently studied together in the research field of active matter [16–35]. Synthetic microswimmers can be steered by external fields [36–53] (or even with feedback control systems [54–56]) and can react to their environment through various forms of taxis [57,58], which may be used in the future to help them navigate through our blood vessels to detect and perhaps repair mutated cells [59,60], transport drugs to cancer cells [61–63] or perform microsurgery [64].

While optimal navigation problems at the macroscale have been studied for decades [65,66], based on methods such as optimal control theory and dynamic programming [67–70], and more recently reinforcement learning [71,72], at the microscale corresponding explorations have started only recently. Here, the smallness of the particles leads to various new challenges: i) Microswimmers are subject to significant fluctuations due to Brownian motion (or errors and delays in the steering protocol), hence they cannot accurately predict the outcome of their navigational maneuvers. ii) Microswimmers interact hydrodynamically with walls, obstacles, and other microswimmers, which can qualitatively change the required navigation strategy to reach a target fastest [73]. iii) The displacement rate of microswimmers due to their environment can exceed their self-propulsion speed (typically $\sim \mu\,\text{m/s}$ ) by orders of magnitude, e.g., in the blood vessels which is opposite, e.g., to airplanes in the wind. iv) For many prospective applications, microswimmers will face unknown environments with only local information about their surroundings available and will require transferable navigation strategies (fig. 1).

**Fig. 1:** Classification of optimal navigation problems for active particles. Examples: Optimal point-to-point navigation in (a) deterministic and (b) fluctuating environments, (c) target search and capture in predator-prey systems, (d) odor search in turbulent streams, optimal collection and localization in (e) known, and (f) unknown setup.
Download figure:
Standard image

Optimal point-to-point navigation

Ernst Zermelo asked in 1931 how a ship needs to steer in a nonuniform wind field to reach its target fastest [74]. Zermelo's problem has later become relevant to a variety of topics ranging from active particles, to fish-like underwater vehicles [75] and unmanned balloon navigation in the stratosphere [76]. More generally, in this section, we ask how a self-propelled agent, which is subject to constraints has to steer to optimally (e.g., fastest, cheapest, or safest) reach a target in an environment comprising complex (e.g., turbulent) flow and force fields, motility fields (as relevant for light-powered microswimmers in nonuniform intensity fields), viscosity fields (such as in the case of viscotaxis [77]) and complex obstacle landscapes. The optimal path (trajectory) and the corresponding navigation strategy of a self-propelled agent can be determined, e.g., based on Pontryagin's principle (for deterministic problems) or Hamilton-Jacobi-Bellman equations (also for stochastic problems) [69,78], geometric approaches [79–81], modern optimization algorithms [82], but also based on (deep) reinforcement learning methods [83–85], which are particularly useful if the environment is not known (or partially known) or for high-dimensional and chaotic environments, where exact solutions are difficult to obtain (figs. 2 and 3).

**Fig. 2:** Optimal active particle navigation. (a) Trajectories (red and cyan) of active particles in the presence of Brownian noise. The particles steer such that they try to follow the optimal (green curve) and the shortest path (blue line) of the underlying determinstic problem in Taylor-Green flow (arrows and background color) [80]. (b) Snapshot of a learned trajectory (blue curve) and corresponding policy map (arrows) in presence of virtual obstacles (red cells) in experiments with feedback-controlled colloids moving toward a goal (green cell) [91]. (c) Trajectories of an active particle obtained by Q-learning for point-to-point navigation in a Mexican hat potential (background color) after 2000 (yellow), 3000 (green) and 5000 (blue) training episodes in comparison with the exact optimal trajectory (red) [92]. (d) Time evolution of the total reward (blue curve) and exemplaric trajectories (inset) for active particles in turbulent flows [93]. See references for more details.
Download figure:
Standard image

**Fig. 3:** (a) Hydrodynamics can qualitatively change the navigation strategy that an active particle needs to follow to reach a target fastest: Optimal trajectory of a dry active particle (blue) and of source dipole microswimmers with source dipole strength $\sigma = -15$ (green) and $\sigma = 7.5$ (red) which interact hydrodynamically with obstacles (grey disks) [73]. (b)–(e) Exact optimal trajectories for active particles (dashed lines) between a given starting and end point (red dots) in a linear force field (b) and in horizontal pipe flow (c) [86] in comparison to machine-learned trajectories (yellow lines) [95]. Background colors show the learned policy map, *i.e.*, the preferred discretized steering "direction" $\psi\,(\psi\in \{0,\ldots,59\} \cong [0,2\pi)$ . (d), (e): lines show learned trajectories in Gaussian random potentials (background colors) [95]. See references for more details.
Download figure:
Standard image

**Fig. 3:** (a) Hydrodynamics can qualitatively change the navigation strategy that an active particle needs to follow to reach a target fastest: Optimal trajectory of a dry active particle (blue) and of source dipole microswimmers with source dipole strength $\sigma = -15$ (green) and $\sigma = 7.5$ (red) which interact hydrodynamically with obstacles (grey disks) [73]. (b)–(e) Exact optimal trajectories for active particles (dashed lines) between a given starting and end point (red dots) in a linear force field (b) and in horizontal pipe flow (c) [86] in comparison to machine-learned trajectories (yellow lines) [95]. Background colors show the learned policy map, *i.e.*, the preferred discretized steering "direction" $\psi\,(\psi\in \{0,\ldots,59\} \cong [0,2\pi)$ . (d), (e): lines show learned trajectories in Gaussian random potentials (background colors) [95]. See references for more details.
Download figure:
Standard image

Elementary calculation of exact optimal trajectories

Consider an overdamped dry active particle (no hydrodynamic interactions) in a time-independent and two-dimensional complex environment. The equation of motion for the particle position $\vec{r}(t) = (x(t), y(t))$ can be compactly written as [86]

$\begin{equation} \dot{\vec{r}}\,(t) = v_0(\vec r) \hat{n}(t) + \vec{f}(\vec{r}) + \sqrt{2D} \vec{\eta}(t). \end{equation} \tag{ 1 }$

Here $\vec f(\vec r)$ represents a general force, flow, and viscosity field and D, $\vec \eta$ represent the translational diffusion coefficient and Gaussian white noise of zero mean and unit variance. Let us first focus on the idealized situation where the agent can freely and instantaneously control its self-propulsion direction $\hat{n}(t) = (\text{cos}\psi(t), \text{sin}\psi(t))$ but not its speed $v_0(\vec r)$ , which may depend on space [53,87]. The goal is to find, for a given starting and end point, the connecting path $\vec r(t)$ (equivalently the steering angle $\psi(t)$ ), allowing the active particle to reach the target fastest. To solve this problem for vanishing noise $(D=0)$ , we now write the traveling time as a functional of the path y(x) and of $y^{\prime}(x) = \mathrm{d} y(x)/\mathrm{d} x$ as

$\begin{equation} T\left[y(x), y^{\prime}(x), x\right] = \int\limits_{0}^{T}\mathrm{d} t = \int_{x_{A}}^{x_{B}} \frac{\mathrm{d}x}{|\dot x(y(x),y'(x),x)|} \end{equation} \tag{ 2 }$

and minimize it by solving the Euler-Lagrange equation $\frac{\mathrm{d}}{\mathrm{d} x} \frac{\partial L}{\partial y^{\prime}}-\frac{\partial L}{\partial y}=0$ (boundary value problem) for $L\left(y(x), y^{\prime}(x), x\right)=\frac{1}{|\dot x(y(x),y^{\prime}(x),x)|}$ , where $\dot{x}$ denotes the velocity in the x-direction. Using (1), one obtains the Lagrange function [86]

$\begin{equation} L(x, y(x),y^{\prime}(x))\!=\!\frac{\left(1\!+\!y^{\prime 2}\right)}{\left|f_{x}\!+\!y^{\prime} f_{y} \!\pm\! \sqrt{v_{0}^{2}\left(1\!+\!y^{\prime 2}\right)\!-\!\left(f_{y}\!-\!y^{\prime} f_{x}\right)^{2}}\right|}\!, \end{equation} \tag{ 3 }$

where $\vec f = (f_x,f_y)$ and v₀ may depend on x, y.

Active particles and ray-optics: The Euler-Lagrange equation, together with eq. (3) can be readily solved for constant $\vec f$ and constant v₀, yielding $y^{\prime}(x)=\text{const}$ , showing that the shortest path is the fastest in any constant field. That is, the active particle steers such that it exactly compensates for the drift due to the environment. In piecewise constant environments, the optimal trajectory is also piecewise constant, yielding Snell's law for active particles which involves a generalized refractive index that can also be negative as for light in meta-materials [88,89]. (See also [90].) Other exact solutions for the Euler-Lagrange equation can be obtained by exploiting conservation laws (symmetries) showing that the shortest path is typically not the fastest in complex environments. In rotating flow fields, active particles sometimes even have to initially swim away from the target to reach it fastest. Note that optimizing other quantities, such as the dissipated power along the path, leads to a different Lagrange function and hence in general also to a different navigation strategy [86].

Hydrodynamic interactions

Instead of dry active particles, ref. [73] considers microswimmers which hydrodynamically interact with walls and obstacles. One key result was to show that the optimal navigation strategy which microswimmers require can qualitatively differ from the one which leads to optimal trajectories for a dry active particle or a macroscopic vehicle in the same environment (fig. 3(a)).

Reinforcement learning

In very complex environments (e.g., chaotic or unknown cases), optimal navigation problems can typically not be solved exactly, but methods based on reinforcement learning [83] can still be applied. Some works have used tabular Q-learning, for efficient real-time control of self-thermophoretic active particles (fig. 2(b)) [91], or for learning to navigate optimally inside an environment hosting a Mexican hat potential without brim (fig. 2(c)) [92]. Other recent works have used actor-critic methods [93] and also deep reinforcement learning [94,95] to study optimal navigation within increasingly complex environments. In ref. [95] in particular a deep reinforcement learning-based method has been developed to determine asymptotically optimal trajectories. Here, the key challenge was to develop an approach that is capable of finding the global optimum rather than some locally optimal path. The key "trick" to meet this challenge was to use a policy gradient-based method that "understands" and directly focuses on optimizing the expected total reward (as opposed to the off-policy methods such as Q-learning). To benchmark this method, results were compared to exactly known optimal trajectories (fig. 3).

Reference [96] treats the microswimmer navigation problem as a Markov decision problem and minimizes a cost function to solve mazes, assuming global knowledge of the environment (see also [97] for maze solving with droplets). Very recently, swarms of intelligent colloidal microrobots were also trained for capturing Brownian cargo particles within mazes [98]. Apart from maze solving, recently, a series of studies [98–100] have explored microswimmer navigation also in an unknown environment containing obstacles that are locally explored by the microswimmer, using (deep) reinforcement learning. It was found that smart colloids receiving local sensory input were able to navigate around obstacles to reach a target using deep reinforcement learning [99] and to accomplish complex navigation and localization tasks under time constraints [100].

As opposed to unknown environments, for the problem of optimal navigation in chaotic (turbulent) environments, one can in principle use optimal control theory (Pontryagin's principle) to determine the exact optimal trajectory. However, this problem is numerically not easily solvable with shooting methods since, in chaotic environments, a tiny variation of the initial condition commonly results in a completely different endpoint, and a systematic variation of the initial conditions is not useful and one needs to work with an extended target domain. This difficulty in evaluating the equation representing the exact optimal solution makes the usage of reinforcement learning particularly valuable and accordingly, ref. [93] has recently used an actor-critic–based reinforcement learning approach for Zermelo's problem in turbulent flow fields (fig. 2(d)).

Reference [94] in turn has explored the effects of accounting for environmental cues (such as vorticity, flow velocity, etc.) within the input features of a deep reinforcement learning method and the resulting strategies for a point-to-point navigation task within a turbulent flow. Reference [101] in turn used an adversarial reinforcement learning method to train microswimmers for time-efficient point-to-point navigation within statistically homogeneous and isotropic turbulent fluid flows which were able to outperform the naive strategy of always moving in the direction of the target.

While the swimming direction of synthetic microswimmers can be typically controlled with external fields [33], many biological microswimmers autonomously change their swimming direction through suitable shape deformations. In line with that, several recent works have used reinforcement learning to explore the swimming mechanism of deformable agents [102–105]. Related works have used learning approaches to understand how a swimmer needs to deform to swim as fast as possible [106], to follow a predetermined path [107], to exhibit chemotaxis [108] or to achieve optimal point-to-point navigation [109–111]. For example, ref. [110] considers three-link models of (bionic) fish which receive only their orientation and distance to a (moving) target as input data. They learn generic strategies which are then explored in situations that the swimmer has not encountered throughout training.

In another line of work, smart active particles have been trained to exploit underlying turbulent flows to escape local fluid traps [112], reach target regions with high-vorticity [113], or navigate towards the highest altitude achievable [114]. Very recently, smart microswimmers equipped with tabular Q-learning were also able to demonstrate efficient navigation strategies (while only having access to local information) within environments hosting various motility fields [115].

Searching and capturing targets

Another class of optimal navigation problems concerns the quest of how an active agent has to move to efficiently find a target with an unknown location (dynamics). Here one can distinguish i) problems where the agent does not receive any information from the target, to which we, therefore, refer as "silent" targets, ii) problems where the target does emit certain information, e.g., in the form of odors which is spread by diffusion or advection (fig. 1(d)), and iii) problems where the agent is aware of the current location of the target (can "see" the prey) but not of its dynamic (fig. 1(c)). Other interesting examples occur if the predator has only access to indirect or partial information about the prey (type ii)), as relevant, e.g., for sharks and rays sensing their victims via the created flow fields through lateral line sensors [116,117], and for chemotactic bacterial predators [118]. Note that the special case of iii) where the target does not move corresponds to point-to-point navigation as discussed in the previous section.

i) Finding "silent" targets: Problems of class i) have been studied very recently based on the development of an algorithm generalizing transition path sampling to active Brownian particle dynamics searching for a target in a complex environment [119,120]. For self-propelled particles in search of a target located at the center of a circular confining domain, controlled adjustment of parameters such as the self-propulsion velocity and the characteristic rotation time was demonstrated to improve the search efficiency [121]. Later on, ref. [122] also studied the role of environmental characteristics (such as spatial heterogeneity) on the target search dynamics and capabilities of self-propelled particles.

ii) Finding sources: Target search problems have been studied intensively for macroscopic animals searching for odor sources in complex flow fields [123–127]. Such flow fields do not only allow for pheromone communication among animals but they also efficiently distribute odors over much longer distances than enhanced molecular diffusion would [127]. However, they also make it difficult to predict the location of the source based on the information which an agent receives from its immediate vicinity. A popular strategy for searching with sparse information is infotaxis [128], which has been intensively studied in the context of odor search problems for insects. This strategy aims at locally maximizing the information gain; i.e., infotactic agents essentially move up the (local) information gain gradient, similarly to chemotactic bacteria moving up the concentration gradient. Infotaxis has been recently tested against methods from value iteration (for partially observable Markov decision problems [129]) [9,130] to reinforcement learning [131,132].

iii) Catching targets moving in an unknown way: Predator-prey problems for active Brownian particles have recently been studied using Q-learning [133,134]. Reference [135] in turn explored the outcome strategies of training adversarial reinforcement learning agents in microswimmer pursuit and evasion tasks. Here, throughout the training, the predator and the prey devised policies to exploit hydrodynamic interactions to out-compete each other with complex sequences of moves and countermoves. Very recently also a deep policy gradient-based method has been demonstrated to be able to qualitatively reproduce the optimal predatory path in chasing a finite size prey at low Reynolds number [136].

Soaring: Another related class of problems, where the target is not necessarily localized and which has been explored with reinforcement learning methods, concerns the soaring of birds, unmanned air vehicles, or (other) gliders, i.e., the question of how these agents have to navigate to find and navigate thermals within a complex landscape [137,138]. Interestingly, it is still unknown how birds achieve this [138].

Collection problems

How does a prospective microswimmer have to move to efficiently collect targets that are distributed in an unknown way, such as toxins or microplastics? This problem, which is closely related to the area sweeping tasks in robotics [139–141], has not yet been studied much in the active matter literature (fig. 1).

Existing works largely focus on stochastic search strategies. Several influential works have reported observations of such strategies in the forms of Levy walks (step size randomly drawn from a fat-tailed distribution) in the foraging of albatrosses [142], marine predators [143], bumble bees and T cells [144]. Reference [145] explores using Levy walking active particles to collect (nonregenerative) sparse targets. While in homogeneous environments a certain combination of diffusive and ballistic motions is believed to be optimal, this work finds that as the environment gets increasingly complex due to the presence of barriers, strategies that are more diffusive tend to lead to better target collection rates. Similarly, run-and-tumble walkers searching for a single target have been studied in ref. [146].

Open questions, challenges and perspectives

How good are machine-learned results?

When learning the result of a navigation problem (or of another complex control problem), it often remains unclear how close the resulting trajectory or navigation strategy is to the real optimum. Of course, convergence of the reward does not guarantee optimality: The reward can converge to an arbitrary local optimum [83] and even if it converges to the global optimum it is often unclear if this optimum is representative for the (asymptotic) physical optimum, or just optimal within the given reward definition, discretization, hyperparameter choice, and the chosen learning algorithm. Accordingly, one major open challenge is to develop reinforcement learning approaches that can reproduce exact results and which can be used to go beyond those in [95].

Fluctuations

While we expect that fluctuations can qualitatively change (fig. 1(b)) the required navigation strategy to optimally reach a target (fig. 2(a)) [79], as, e.g., in the cliff walking problem [147], they have little effect on the required navigation strategies in other problems. Accordingly, it would be important to systematically understand and formulate criteria for when fluctuations lead to strong quantitative or even qualitative changes in the required navigation strategy. As for the development of reinforcement learning algorithms, fluctuations lead to environments that are only partially observable, hence making the outcome of decisions (actions) made by the agents not accurately predictable. In certain problems, this unpredictability can challenge the robustness of training, which highlights the need for novel methods based on deep reinforcement learning (such as trust region methods [148,149]) which are capable of maintaining robust learning in volatile setups.

Transferability and unknown environments

While microorganisms require strategies to navigate and find food in environments that they have never encountered before and which may change over time in an unpredictable way, many navigation problems for active particles so far hinge on a fixed (or deterministic dynamic) environment. An early work that addresses optimal navigation of "colloidal robots" in unknown environments based on deep reinforcement learning is [99]. Developing powerful methods to allow determining transferable navigation strategies in the future will likely require methods from model-based reinforcement learning [150,151] and more importantly world models [152] where the agents strive to learn a model (representation) of the environment (which here can even be translated to learning the physics of the setup) and use this representation of the environment for planning its future actions.

Recent developments in machine learning

We are currently witnessing a rapid advancement in the development of new reinforcement learning methods. Accordingly, it is not surprising that some of the most powerful methods have not yet been applied to active matter and related optimal navigation problems. A very promising line of study that has been recently applied to famous games such as Chess, Go, and Shogi is the introduction of reinforcement learning algorithms with integrated planning such as the AlphaGo and AlphaZero [153–155]. We believe, given the robustness of their planning phase (thanks to a built-in Monte Carlo tree search of possible future outcomes), these methods can be very useful in tasks requiring high degrees of accuracy and confidence in the optimality of the learned strategies.

Cross-interactions and communication rules

Recently, several works have focused on motile agents learning collective behaviors [156] such as flocking from simple low-level principles and incentive designs with reinforcement learning techniques [157,158]. A related line of study concerns the application of multi-agent reinforcement learning [159] to microswimmer problems. One can imagine complex tasks such as localizing cancer cells or collecting microplastics while having low environmental awareness (fig. 1(f)), which would require multiple smart microswimmers to cooperate and share their gathered knowledge of the host environment to guarantee success.

Data availability statement: No new data were created or analysed in this study.

Optimal active particle navigation meets machine learning^{^(a)}

Article metrics

Author e-mails

Author affiliations

Author notes

ORCID iDs

Dates

Abstract

Introduction

Smart microswimmers

Optimal point-to-point navigation

Elementary calculation of exact optimal trajectories

Hydrodynamic interactions

Reinforcement learning

Searching and capturing targets

Collection problems

Open questions, challenges and perspectives

How good are machine-learned results?

Fluctuations

Transferability and unknown environments

Recent developments in machine learning

Cross-interactions and communication rules

Optimal active particle navigation meets machine learning(a)

Article metrics

Share this article

Author e-mails

Author affiliations

Author notes

ORCID iDs

Dates

Abstract

Introduction

Smart microswimmers

Optimal point-to-point navigation

Elementary calculation of exact optimal trajectories

Hydrodynamic interactions

Reinforcement learning

Searching and capturing targets

Collection problems

Open questions, challenges and perspectives

How good are machine-learned results?

Fluctuations

Transferability and unknown environments

Recent developments in machine learning

Cross-interactions and communication rules

Optimal active particle navigation meets machine learning^{^(a)}