Adaptive active Brownian particles searching for targets of unknown positions

Developing behavioral policies designed to efficiently solve target-search problems is a crucial issue both in nature and in the nanotechnology of the 21st century. Here, we characterize the target-search strategies of simple microswimmers in a homogeneous environment containing sparse targets of unknown positions. The microswimmers are capable of controlling their dynamics by switching between Brownian motion and an active Brownian particle and by selecting the time duration of each of the two phases. The specific conduct of a single microswimmer depends on an internal decision-making process determined by a simple neural network associated with the agent itself. Starting from a population of individuals with random behavior, we exploit the genetic algorithm NeuroEvolution of augmenting topologies to show how an evolutionary pressure based on the target-search performances of single individuals helps to find the optimal duration of the two different phases. Our findings reveal that the optimal policy strongly depends on the magnitude of the particle’s self-propulsion during the active phase and that a broad spectrum of network topology solutions exists, differing in the number of connections and hidden nodes.


INTRODUCTION
Active matter and directed motion have come under the spotlight of several research communities such as biology, biomedicine, robotics, and statistical physics [1][2][3][4][5][6].In nature, many micro-organisms are able to convert chemical energy into self-propulsion with the goal of exploring their environment, foraging nutrients, or running away from toxic substances [2,3,7].Paradigmatic examples include the swimming behavior of bacteria such as Escherichia coli [8], phagocytes of the immune system performing chemotactic motion during injury or infection [9,10], and sperm cells navigating against chemical gradients to find the egg [11].At larger length-scales, animals constantly have to face challenges such as finding food, a mating partner, or shelter [12,13], and generally solve these issues by adopting strategies involving smart motion.Artificial and biohybrid microswimmers [7,[14][15][16] capable of intelligent self-propulsion have potential for revolutionary applications ranging from active drug delivery [17][18][19][20] to assisted fertilization [21] and environmental remediation [22].
Notwithstanding the tremendous progress that has been achieved in this research field in the past decade, central problems as those regarding optimal navigation and target-search strategies still have to be thoroughly addressed already at the level of a single agent in a homogeneous environment.Nature has found solutions to these problems in hundreds of millions of years of evolution: Many organisms display robust locomotion performances by adapting their locomotory gaits to the surroundings [23,24] and several microswimmers sense environmental stimuli and exhibit various tactics to achieve effective navigation in biological fluids [7][8][9][10][11].In a certain sense, even relatively simple microswimmers are smart with respect to the actions they have to perform to reach their biological goals.
To understand how evolution shaped navigation and search strategies, one can use reinforcement learning (RL) [25] and genetic algorithms [26,27] to identify optimal and alternative strategies.Recently it has been demonstrated how agents trained with RL (eventually combined with genetic algorithms) are able to find advantageous swimming strategies in several situations such as in viscous solutions [28][29][30], simple energy landscapes [31], steady flows [32][33][34], turbulent fluids [35][36][37][38], and complex motility landscapes [39].Notwithstanding their merits, in all these studies, either the goal of the particle is different from reaching a specific target or, if a target region has to be met, its position is fixed and then implicitly learned during the learning process.On the other hand, a crucial question arising in the context of target search is which strategies are optimal to find sparse targets of unknown positions.
The first theoretical studies on search strategies date back to World War II, when the U.S. Navy tried to rationalize search procedures to efficiently hunt submarines of the enemy [40].When looking for sparse small targets, depending on the searcher's abilities and on the space to be explored, different target-search strategies can be put in place.In the microscopic world, often the agents have only limited or no spatial memory, and search trajectories can be qualified as stochastic, meaning that some characteristics of the stochastic motion typical at this length scale are tuned to optimize the search time [13].Among random strategies that can be used in a homogeneous environment, Lévy walks [12,41,42] and intermittent-search strategies [13,43,44] have been extensively studied in various contexts.In particular, intermittent-search strategies rely on the experimental observation that fast movement degrades perception.Thus, these strategies combine phases of diffusive motion allowing target detection and phases of ballistic motion with random orientation that allow moving quickly to a different space region but do not allow detecting the target.It has been shown that the mean search time of intermittent random walks can be minimized under broad conditions [45][46][47].Recently, Muñoz-Gil and coworkers [48] have proposed the use of RL techniques to study non-intermittent-search strategies and showed how these can outperform Lévy walks.
Here, for the first time, within the framework of intermittent-search strategies, we address the problem of finding a target of unknown position using a genetic-algorithm approach.This is elaborated for the simple case of a homogeneous environment and considering agents equipped with a simple artificial neural network (ANN) which selects, based on the agent state, an action among a set of possible actions.Specifically, the agent can switch its state between a passive and an active Brownian particle and, with intermittent-search strategies in mind, has at its disposal only a limited set of actions allowing it to switch between passive and active motion and to choose the duration of the new phase.In our framework, the agent behavior is then deterministically selected on the basis of its state.Our goal is to characterize and understand the behavioral policies which are optimal in solving the target search problem and show that these strategies can be obtained by means of genetic algorithms.A genetic algorithm is here preferred over typical RL methods because having a target with a completely unknown position results in a very sparse reward signal to be used in the latter methods.Given the pivotal role played by the reward function in RL, this is a non-trivial problem to face when willing to adopt such approach [25].On the other hand, genetic algorithms are known to be less sensitive to issues related to the sparseness of the rewards because they evaluate the full behavior of an agent rather than trying to find the value of the state-action pairs.

MODEL
Our environment consists of a two-dimensional square box of size L × L with periodic boundary conditions and of a circular target of radius R placed randomly inside this box.Note that, due to the periodic boundary conditions, this environment is equivalent to an infinite domain with a lattice of targets.We consider an agent able to switch its state s between a passive Brownian particle (BP) and an active Brownian particle (ABP).Similarly to intermittent-search strategies, during the BP phase the agent is allowed to find the target while, in the ABP phase, it can more quickly relocate to a different region of the box but cannot sense the target.Every time the agent finds a target (i.e. the distance between its position and the center of the target is smaller than the target radius R), this is destroyed and a new target appears at a new random location inside the box.The equations of motion of the ABP model in a homogeneous environment are a set of Langevin equations that, once discretized according to Itô rule, read where ∆t is the integration step, r t = (x t , y t ) is the position at time t, and u t = cos ϑ t , sin ϑ t denotes the instantaneous orientation of the driving velocity with constant modulus v. D and D ϑ are the translational and rotational diffusion coefficients, respectively.Finally, the components of the vector noise ξ t = (ξ x,t , ξ y,t ) and of the scalar noise η t are independent random variables, distributed according to a Gaussian with zero average and unit variance.The equations of motion of the standard Brownian particle model are readily recovered by setting v = 0, thereby decoupling the spatial evolution from the orientational diffusion of the self-propulsion.In the following, we fix the length unit as the size of the square box L and time unit as the typical time τ := L 2 /4D required by a passive particle to cover this distance.The magnitude of the activity and the persistence of motion in the ABP phase are respectively measured by the dimensionless Péclet number, Pe := vτ /L and the dimensionless persistence With the intent of keeping the setup as simple as possible, we equip our particle with only a limited set actions for each state.Each action a = (a s , a τ ) is a tuple consisting of two parameter.The first parameter, a s ∈ {0, 1}, is a binary variable determining the next state of the particle: For a s = 0 the next phase is identical to the previous one (BP→BP or ABP→ABP), while for a s = 1 the phase changes (BP→ABP or ABP→BP).The second parameter, a τ , specifies the time duration of the next phase and here we restrict the agent to select only among N τ = 5 different time durations τ i (i = 1, . . ., 5), thus implying a total of 10 possible actions to select given the state.When an active phase begins after a passive one, the direction ϑ of the self-propulsion velocity is drawn from a uniform distribution in [0, 2π), otherwise it is updated according to Eq. ( 2).In the following we set the persistence * = 1 and select the time duration of the actions such that τ i = τ /4 i , see Figure 1a.Thus, in the longest action time, τ 1 = τ /4, a Brownian particle covers a typical distance of L/2 which, because of periodic boundary conditions, is comparable to the possible maximal distance between the particle and the target, L/ √ 2. The results reported in this manuscript are obtained by  following this choice.However, to check that our results are not particular to this choice, the Supplemental Material reports also results obtained by varying the number of allowed time durations N τ .
To learn optimal policies solving the target-search problem, we exploit the evolutionary algorithm NeuroEvolution of Augmenting Topologies (NEAT) [49,50].Therefor we equip each agent with an ANN taking as input the current state (BP or ABP) and outputs an action chosen among the set of actions described above, see Figure (1)c.Any ANN is characterized by a certain topology, internal parameters, and activation and response functions, see section Methods for further details.Depending on its inner structure, a given ANN always returns the same action given the input state.Starting from a population of randomly generated individuals (ANNs), the NEAT algorithm then iteratively creates new generations relying on biologically inspired operators such as mutation, crossover, and selection based on the fitness of each individual in the population.In very simple words, only the inner traits of the fittest individuals are transmitted from a generation to the next generation.Finally, the fitness of an individual is defined as the number of targets that it manages to detect in a time equal to 5 • 10 2 τ , see section Methods for further details.

RESULTS
We start by investigating how the adaptive particles evolve in the case where the size of the target is in between the typical distance explored in the passive phase in time τ 5 and the typical distance covered in the active phase in the same time, τ 5 /τ < R/L < Pe τ 5 /τ , see Figure 1a.Within this choice, independently of the phase duration time τ i , during the active phase the particle is typically relocating to a distance larger than the target size.On the other hand, when the action duration τ 5 is selected, a particle in the passive phase is typically exploring a region smaller than the target size.This situation, representing well the idea of an intermittent search, is here implemented by setting the radius of the target and the Péclet number to R = 0.05L and Pe = 100 respectively.
The initial population contains N = 10 3 randomly generated individuals which, depending on the particular action returned when being in a given state, can be categorized into three different species: Individuals that always behave as a passive particle (BP-like individuals), those that once they visit their active state are unable to go back to the passive one (ABP-like individuals), and finally individuals that switch periodically between the two phases with fixed switching times (switching individuals).The latter species can be further split into N 2 τ sub-species depending on the specific duration of the BP and the ABP phases.
Since the NEAT algorithm dynamically produces new generations by selecting for reproduction those individuals having the largest fitness, for this large value of the Péclet number, we expect that the switching individuals become the dominating species.Indeed, the fraction of switching individuals raises from the initial value of about 0.25 to about 0.9 already at the second generation and reaches a plateau at about 0.95 at the fifth generation, see Figure 2a.
Further inspection reveals that the large majority of the switching individuals is represented by those individuals that select τ 5 as the time duration of both the passive and the active state, corresponding to short passive detection phases alternated to short active relocation ones, see inset of Figure 2a.
We evaluate the learning performances also by inspecting the evolution of the time required by the agent to reach the target in subsequent generations and comparing it to the same quantity computed for three different benchmarking models: i) A fully passive particle; ii) A particle which selects randomly among the possible actions (we will refer to this benchmark as the "casual particle"); iii) An "optimal particle" following the optimal strategy.This optimal strategy is obtained by individually checking the N τ × N τ possible combinations that make the particle switch to the active state when it is in the passive phase and to the passive state when it is in the active one.Continuous line with circular points represents the average over the switching individuals, while the box-and-whiskers symbols report respectively the 10th, 25th, 50th, 75th, and 90th percentiles among the switching individuals only.Starting from the 3rd generation, the 10th, 25th, 50th, and 75th percentiles are squeezed over the optimal particle benchmark.
At the beginning of the learning process, for the switching individuals, the time required to find the target is comparable to that of the casual particle.More precisely, the average searching time of the switching individuals is twice that of the casual particle, while the median is about a factor of 0.6 lower than the latter quantity and closer to the searching time of a completely passive particle.This observation is due to the fact that the phase times τ i (i = 1, . . ., 5) are logarithmically spaced and that, in the initial population of switching individuals, some spend most of their time in the active phase where it is not possible to find the target, and others spend most of their time in the passive phase, thus effectively performing as a Brownian particle.If the evolutionary process is successful, after repeated generations this quantity is expected to decrease and achieve a value slightly bigger than that of the optimal particle, with this small difference due to the fact that the NEAT algorithm maintains a certain level of exploration by creating new individuals through mutations and preserving a certain number of unfit individuals for different evolution scenarios.This is in fact what is observed for R = 0.05L and Pe = 100, see Figure 2b.Furthermore, since more and more switching individuals enter the most adapted subspecies, also the spread of the search time decreases during the learning process, with at least 75% of the switching individuals behaving as good as the optimal particle after the third generation, see Figure 2b.However, the average stays above the 75th percentile over the whole evolutive process because is dominated by the performances of the worst switching individuals.
Additional insight into how the genetic algorithm encodes learning a successful policy can be obtained by focusing on the topology of the ANNs corresponding to the fittest individuals, i.e. for Pe = 100, those selecting τ 5 as the duration of both the BP and ABP phases (inset of Figure 2a).In the initial population, all the individuals have two input nodes and four output nodes with no hidden nodes in between, see Methods section for details.However, with the progress of the evolution process, new fit individuals with some hidden nodes emerge, see Figure 3a.The fractions of fit individuals with h hidden nodes are consistent with the expected values as obtained by solving the master equation where N h (g) indicates the number of individuals with h hidden nodes at the g-th generation and p add = p del = 0.05 are respectively the probability of adding and deleting an hidden node.This means that the topology of the initial ANN with only two input nodes and four output nodes is already complex enough to provide successful solutions to the target-search problem.In contrast, a similar approach adopted starting from a different initial network topology shows that, when having fewer output nodes, fittest individuals are slightly likely to have a certain number of hidden nodes that contribute in selecting the optimal actions, see Supplemental Material.This observation agrees with the intuitive expectation that a minimal ANN's complexity is required to find successful strategies.By looking more in detail at the internal structure of the ANNs associated with the fittest individuals, it appears that there is not a preferred topology once the number of hidden nodes h is fixed.In fact, typical topologies of these networks show a very variagated range of active connections and of their weights among the different nodes, see figures 3b and S2 in the Supplemental material.In these plots the edges connecting different vertices have a width proportional to the absolute value of the weight of the connection, see Methods section for details.However, a complete understanding of the functioning of the particular ANN should include also the properties of the single nodes, including their bias and activation and aggregation functions, see Methods section for details.Note also that, typically, emergent hidden nodes are connected to only one input and one output.
An important question is how the target-search strategy depends on the activity of the particle.To address this issue, we carry out a similar analysis for a different Péclet numbers.When Pe > 100 we do not expect major differences to the previously considered case (Pe = 100).In fact, for high Péclet numbers, the self-propulsion velocity is enough to allow relocating at a distance larger than the target size even in the shortest time τ 5 .Then, intuitively, performing a motion alternating between short BP phases and short ABP phases is the most promising strategy.In contrast, for very low Péclet numbers, if the action time is smaller then the time unit τ , the typical distance travelled by simple diffusion is always greater than the distance covered by self-propulsion, see Figure 1.Then, selecting active phases becomes superfluous and the optimal strategy is simply maintaining the particle in a BP-like phase.These expectations are confirmed by checking, after about 10 generations, the fraction of individuals in different subspecies and by comparing the overall performance of the population to those of the passive and of the optimal particle for different Péclet numbers.This is done in Figure 4a, reporting the average search times, and in Figure 4b, showing the relative distribution of individuals across subspecies.In the latter panel, the phase durations associated with the optimal particle are also highlighted with a black frame in the table.
More interesting is the behavior for intermediate activity (10 Pe 70).In this range, the optimal particle outperforms the simple passive particle and the optimal actions correspond to a behavioral policy that again displays the shortest passive phase (i.e.BP-like phases with duration τ 5 ) but, to allow for significant relocation, selects a longer duration of the ABP phase, namely the duration τ 2 for the considered cases, see Figure 4b.Consistently, for Pe = 20, 30, 50, and 70, the population produced by NEAT at the 9-th generation shows a majority of individuals belonging to the sub-species selecting the same actions as those selected by the optimal particle.However, for these values of activity, the genetic algorithm still considers as possible candidates for the optimal solution also individuals of other sub-species, especially those with a short passive phase.This results in the higher spread of the search times reported in Figure 4a.Finally, at Pe = 10 the optimal solution corresponds to an optimal particle selecting τ 3 as the duration of both the passive and the active phase but the NEAT algorithm in this case develops a population with a majority of BP-like individuals.The reason for this small inconsistency is likely due to the fact that, in this case, the average search time of the passive particle is very close to that of the optimal particle and, by chance, the first evolutionary lineage is preferred.A final remark is in order: All results reported so far are obtained by strongly reducing the amount of possible ABP and BP phase durations.In particular, we allowed only N τ = 5 different durations that span a large time range.By doing so, the computational time required to run the NEAT algorithm is comparable with directly evaluating the performances of the completely Brownian particle and of the N τ × N τ possible switching individuals.However, any integer multiple of the integration time step ∆t could serve as a possible phase duration and one may consequently increase N τ , thus letting the ANNs select phase durations among a more fine-grained palette.Indeed, increasing N τ is desiderable to explore a larger set of switching individuals, thus allowing to find even better phases durations that increase the target-search performances.However, by doing so, a brute force check of all possible switching individuals becomes quickly inefficient, with computational costs increasing as N 2 τ , in favor of our choice of using the NEAT algorithm, which, in contrast, has unaltered computational costs.The Supplemental Material reports the equivalent of figures 2 and 4 obtained by varying N τ , namely for N τ = 10, 20, and 50.The learning performances obtained by the NEAT algorithm in these cases are comparable to the case study reported in the main text and, for large Pèclet numbers, the average time to reach the target slightly outperforms that of the fittest individual when this is selected among the set of 25 switching individuals used in this section, see Figure S6 and S8 in the Supplemental Material.However, the fluctuations of the average time to reach the target increase with N τ .Furthermore, with increasing number of allowed phase durations, more actions configure themselves as a valid candidate to become the optimal action selected by the genetic algorithm.

CONCLUSIONS
Our findings demonstrate that genetic algorithms are a powerful tool to address the problem of finding targets of unknown positions for particles able to switch their behavior between a simple passive Brownian particle and an active Brownian particle.In particular, we equipped the particle with a neural network receiving the current state of the particle itself as the only input and returning as an output a decision regarding if and when the agent should switch its phase.We then showed that the algorithm NeuroEvolution of Augmenting Topologies is able to evolve an initial population of neural networks taking random decisions towards a population in which the majority of individuals are optimized to solve the target-search problem.
In principle, similar results on target-search performances of intermittent passive-active Brownian particles could be obtained by resorting on RL algorithms [25].However, in the RL framework, setting a target having a different unknown position at the beginning of each target-search episode would result in a very sparse reward function, which is generally a non-trivial problem in RL [25].More specifically, since in our case the state of the agent is a simple binary variable and the rewards are extremely sparse, the reward signal has only a very low correlation with the particular state-action pair encountered when the target is found, making typical action-value methods such as Q-learning or SARSA [25] fail in learning successful strategies.If willing to follow any RL method, algorithms taking into account long sequences of visited state-action pairs should then be preferred because these methods, similar to the genetic algorithm, seek to maximize the performances by directly evaluating the outcome of a given policy.Possible algorithms of this kind include policy-gradient and actor-critic methods [25] and the projective simulations algorithm [51].These considerations do not exclude that successful results could be obtained by using more elaborated versions of the above mentioned action-value methods and/or by differently defining the states and the actions, as proposed in Ref. [48] for non-intermittent searchers.On the other hand, genetic algorithms bypass the problem of reward sparseness by giving the agents a fitness function dependent on the overall performance in assessing some task, and thus they are particularly suited to our case.
In the current setup, the output of any given individual ANN is fixed once the input is given, meaning that, given its current state, a particle always chooses deterministically the same action.This is different from the typical intermittent-search strategies discussed in Ref. [13], where an agent draws the phase durations from a certain distribution.However, similarly to the most general case [13,45], also our results show that there is an optimal duration of the active relocation phase which depends mainly on the amount of the activity and that, for very low activity, having an active phase is not any more functional to improve the target-finding efficiency.A natural step forward to go even further in the direction of a standard intermittent search model would be to adapt our algorithm in such a way that, instead of learning the optimal phase durations, it optimizes the parameters of a certain time distribution.Another possible extension of our work would be to link the output of the ANNs to a transition probability rather than to a deterministic action, with different individuals thus corresponding to different transition matrices.
Our paper provides a first attempt to use machine-learning methods to investigate the problem of finding targets of unknown positions in a simple homogeneous environment and paves the way to further research on this important problem.Having a minimal model with only two distinct phases is a choice that serves as our proof of concept that genetic algorithms are powerful tools to investigate target-search strategies in stochastic systems.However, in nature, searchers may have multiple dynamic modes.For example, dendritic cells searching for infections combine three distinct migration modes in their motion [52] and some DNA-binding proteins also have more than two dynamic states for the search [53].This provides a solid biological motivation for a first generalization of our work to a case in which three or more distinct different phases are considered as possible states of the agent.Other possible topics worth future investigation include multiple and/or motile targets problems [13], target search with resetting events [54][55][56], and extensions to more realistic scenarios involving the presence of boundaries, obstacles, and energy barriers [57][58][59][60].To follow these goals, both the state and the actions of the agent may be made arbitrarily complex: For example, the agent could gather sensorimotor cues from the environment and, based on them, modify its behavior by control over some motility parameters.Alternatively, similarly to what has been done recently in a different context [48], the agent can be equipped with the ability to sense the duration of its current phase.Finally, agents having a limited memory of the visited locations can also be investigated.

METHODS
To investigate how an evolutionary pressure allows the adaptive particles to develop successful target-search strategies, we resort to the genetic algorithm NeuroEvolution of Augmenting Topologies (NEAT) [49,50,61].To do so, a simple Artificial Neural Network (ANN) is associated with our adaptive particle.The role of this network is to take as input the state of the particle (s =BP or ABP) and return as output the action to be performed as described in the Model section.Starting from a population of 10 3 individuals, the NEAT algorithm then iteratively creates new generations relying on biologically inspired operations such as mutation, crossover, and selection based on the fitness of each individual in the population.This fitness is defined as the number of targets that the individual manages to detect in a time equal to 5 • 10 2 τ .The evolution process is based on the principle of complexification of existing networks [49]: not only the node biases and the edge weights are adjusted to optimize the individuals' fitness, but this goal is also reached by changing the topology of the network, i.e. by adding or deleting new nodes and enabling or disabling some connections, see text below.The number of individuals in each generation is kept fixed.
More specifically, we construct an initial population of networks having two input nodes and four output nodes, see Figure 1c.These are the nodes through which the ANN interacts with the external environment and they cannot be created or destroyed.The two input nodes recognize the state s of the particle and pass to the output nodes the signals χ BP (s) and χ ABP (s) which is further multiplied by the weight of the connection between the specific pair of input and output nodes.This signal is a simple characteristic function, χ S (s) = 1 if s = S and χ S (s) = 0 otherwise.
The four output nodes aggregate by summation the signals coming from the various input nodes, add a bias, and return an output value according to a certain activation function.Formally, the output value of node j is given by f ( k w kj x k + b j ), where x k is the signal coming from node k, w kj is the weight of the connection between k and j, b j is the bias of node j, and f is the activation function, in our case a modified clamped function, and f (x) = 1 otherwise.Each output node j then returns an output real variable x j ∈ [0, 1] and these four output values are then together determining the action taken by the agent as follows.If the current state of the agent is s = BP (ABP) the first (third) node determines the next state, being this again BP (ABP) if the node output value is smaller than 0.5 or changing to ABP (BP) otherwise.The two options correspond respectively to a s = 0, 1, see Model section.The duration of the next phase, a τ , is instead determined by the second (fourth) output node that selects the phase duration τ i with i the integer part of (1 + x j N τ ), being x j is the output signal of the node and N τ = 5 the number of allowed phase durations, see Model section.
The individuals in the initial population are selected randomly (i.e.random biases and random connection weights) but all share the just described topology.However, in subsequent generations individuals with new hidden nodes emerge.These nodes receive the signals coming from the input nodes and eventually from other hidden nodes and, in the same fashion as described for the output nodes, return an output value which is collected as input signal from the output nodes and eventually from other hidden nodes.During mutations, hidden nodes are generated with a probability p add and deleted with probability p del .Following standard practice [50,61], we set p add = p del = 0.05.
The initial network topology with two input nodes and four output nodes works particularly well in our targetsearch problem.However, while in the main text we report only results obtained by starting with this initial setup, we also tested other topologies as well as a different number of phase durations, see Supplemental Material.
Supplemental Material for "Adaptive Active Brownian particles searching for targets of unknown positions" Harpreet Kaur 1 , Thomas Franosch 1 , and Michele Caraglio Here, we consider a different choice for the initial topology of the artificial networks associated with the individuals of the first generation.
More specifically, we construct an initial population of networks having one input node and only two output nodes.The input nodes recognizes the state s of the particle and pass to the output nodes a signal s = 0 if the particle is in the diffusive Brownian phase and s = 1 if the particle is in the active Brownian phase.The two output nodes are then together determining the action taken by the agent.In particular the first node determines the next state, remaining unchanged if the node output value is smaller than 0.5 and switching to the other state otherwise.The two options correspond respectively to a s = 0, 1.The duration of the next phase, a τ , is instead determined by the second output node selecting the phase duration τ i with i the integer part of (1 + x j N τ ), being x j ∈ [0, 1] the output signal of the node and N τ = 5 the number of allowed phase durations, see Model section in the main text.
Figures 5 and 6 are respectively the analogs of figures 2 and 3 of the main text as obtained by using the previous choice for the initial network topology.In this section we fix N τ = 10 and define the τ i = τ /2 i with i = 1, . . ., N τ , see Model section in the main text for more details.In this section we fix N τ = 20 and define the τ i = τ /1.5 i with i = 1, . . ., N τ , see Model section in the main text for more details.In this section we fix N τ = 50 and define the τ i = τ /1.16 i with i = 1, . . ., N τ , see Model section in the main text for more details.

FIG. 2 .
FIG.2.For R = 0.05L, Pe = 100, and * = 1: (a) Fraction of individuals in each species α as a function of the generation.The inset reports the fraction of individuals in each sub-species αi,j with αi,j referring to switching individuals having a BP phase duration τi and an ABP phase duration τj.(b) Averaged time required to reach the target as a function of the generation.Continuous line with circular points represents the average over the switching individuals, while the box-and-whiskers symbols report respectively the 10th, 25th, 50th, 75th, and 90th percentiles among the switching individuals only.Starting from the 3rd generation, the 10th, 25th, 50th, and 75th percentiles are squeezed over the optimal particle benchmark.

FIG. 3 .
FIG.3.For R = 0.05L, Pe = 100, and * = 1: (a) Fraction of individuals with different amounts of hidden nodes, h, in subspecies α5,5 (individuals having both the BP and the ABP phase lasting τ5) as a function of the generation.Dashed lines are the theoretical expectations obtained by solving the master equations for each group (probability of adding or deleting a node is 0.05).(b) Typical ANN topologies for individuals belonging to the subspecies α5,5 and with a number of hidden nodes equal to h = 0, h = 1, and h = 2, respectively.Individuals are extracted from the 9th generation.The width of the edges is proportional to the absolute value of the weight of the connections between different nodes.

FIG. 4 .
FIG. 4. For R = 0.05L and * = 1: (a) Time required to reach the target as a function of the Péclet number at 9-th generation.The box-and-whiskers symbols report respectively the 10th, 25th, 50th, 75th, and 90th percentiles among the whole population.The dashed lines represent the values for the three benchmarks.(b) Fraction of individuals in each sub-species for different Péclet numbers.The box highlighed in black represents the action selected by the optimal particle.

FIG. 5 .
FIG. 5.For R = 0.05L, Pe = 10 2 , * 1: (a) Fraction of individuals in each species α as a function of the generation.The inset reports the fraction of individuals in each sub-species αi,j with αi,j referring to switching individuals having a BP phase duration τi and an ABP phase duration τj.(b) Time required to reach the target as a function of the generation.Continuous with circular points represents the average over the switching individuals, while the box-and-whiskers symbols report respectively the 10th, 25th, 50th, 75th, and 90th percentiles among the switching individuals only.The dashed lines represent the values for the three benchmarks.

FIG. 7 .
FIG. 7.For = 0.05L, Pe = 10 2 , and * = 1: (a) Fraction of individuals in each species α as a function of the generation.The inset reports the fraction of individuals in each sub-species αi,j with αi,j referring to switching individuals having a BP phase duration τi and an ABP phase duration τj.(b) Time required to reach the target as a function of the generation.The dashed lines represent the values for the three benchmarks and have the same values as in Fig.2 of the main text (i.e.no brute-force individual check of the performances of the the Nτ × Nτ possible switching individuals has been done).

FIG. 8 .
FIG. 8.For R = 0.05L and * = 1: Time required to reach the target as a function of the Péclet number at 9-th generation.The box-and-whiskers symbols report respectively the 10th, 25th, 50th, 75th, and 90th percentiles among the whole population.The dashed lines represent the values for the three benchmarks and have the same values as in Fig.4 of the main text (i.e.no brute-force individual check of the performances of the the Nτ × Nτ possible switching individuals has been done

FIG. 9 .
FIG. 9.For R = 0.05L, Pe = 2 , and * = 1: (a) Fraction of individuals in each species α as a function of the generation.The inset reports the fraction of individuals in each sub-species αi,j with αi,j referring to switching individuals having a BP phase duration τi and an ABP phase duration τj.(b) Time required to reach the target as a function of the generation.The dashed lines represent the values for the three benchmarks and have the same values as in Fig.2 of the main text (i.e.no brute-force individual check of the performances of the the Nτ × Nτ possible switching individuals has been done).

FIG. 10 .
FIG. 10.For R = 0.05L and * = Time required to reach the target as a function of the Péclet number at 9-th generation.The box-and-whiskers symbols report respectively the 10th, 25th, 50th, 75th, and 90th percentiles among the whole population.The dashed lines represent the values for the three benchmarks and have the same values as in Fig.4 of the main text (i.e.no brute-force individual check of the performances of the the Nτ × Nτ possible switching individuals has been done

FIG. 11 .
FIG. 11.For R = 0.05L, Pe = 10 2 , and * = (a) Fraction of individuals in each species α as a function of the generation.The inset reports the fraction of individuals in each sub-species αi,j with αi,j referring to switching individuals having a BP phase duration τi and an ABP phase duration τj.(b) Time required to reach the target as a function of the generation.The dashed lines represent the values for the three benchmarks and have the same values as in Fig.2 of the main text (i.e.no brute-force individual check of the performances of the the Nτ × Nτ possible switching individuals has been done).

FIG. 12 .
FIG. 12.For R = 0.05L and * = Time required to reach the target as a function of the Péclet number at 9-th generation.The box-and-whiskers symbols report respectively the 10th, 25th, 50th, 75th, and 90th percentiles among the whole population.The dashed lines represent the values for the three benchmarks and have the same values as in Fig.4 of the main text (i.e.no brute-force individual check of the performances of the the Nτ × Nτ possible switching individuals has been done Typical distance travelled during action time τi = τ /4 i , (i = 1, . . ., 5) by a BP and by a particle moving ballistically for three different Péclet numbers.(b) Sketch of a typical trajectory exploring space by alternating between the two different phases (BP and ABP) and ending with encountering the target.(c) The agent is modeled as a neural network taking as input the state of the agent itself and returning as output the action to be performed.Further details in Methods section.Each action first changes (or maintains inalterate) the state of the particle and then integrates the equations of motion for a time τi.

1 1
Institut für Theoretische Physik, Universität Innsbruck, Technikerstraße 21A, A-6020, Innsbruck, Austria I. DIFFERENT TOPOLOGY OF THE INDIVIDUALS IN FIRST GENERATION Pe = 10 2 , and * = 1: (a) Fraction of individuals with different amounts of hidden nodes, h, in subspecies α5,5 (individuals having both the BP and the ABP phase lasting τ5) as a function of the generation.Dashed lines are the theoretical expectations obtained by solving the master equations for each group (probability of adding or deleting a node is 0.05).(b) Typical ANN topologies for individuals belonging to the subspecies α5,5 and with a number of hidden nodes equal to h = 0, h = 1, and h = 2, respectively.Individuals are extracted from the 9th generation.The width of the edges is proportional to the absolute value of the weight of the connections between different nodes.