Optimal foraging strategies can be learned

The foraging behavior of animals is a paradigm of target search in nature. Understanding which foraging strategies are optimal and how animals learn them are central challenges in modeling animal foraging. While the question of optimality has wide-ranging implications across fields such as economy, physics, and ecology, the question of learnability is a topic of ongoing debate in evolutionary biology. Recognizing the interconnected nature of these challenges, this work addresses them simultaneously by exploring optimal foraging strategies through a reinforcement learning (RL) framework. To this end, we model foragers as learning agents. We first prove theoretically that maximizing rewards in our RL model is equivalent to optimizing foraging efficiency. We then show with numerical experiments that, in the paradigmatic model of non-destructive search, our agents learn foraging strategies which outperform the efficiency of some of the best known strategies such as Lévy walks. These findings highlight the potential of RL as a versatile framework not only for optimizing search strategies but also to model the learning process, thus shedding light on the role of learning in natural optimization processes.

Strategies to search for randomly distributed targets are of paramount importance in many fields.For instance, they are widely used in ecology to model the foraging activities of predators [1], hunters [2] and gatherers [3], as well as the movement of pedestrians in complex environments [4]; other interesting applications are the study of human information search in complex knowledge networks [5] or the improvement of optimization algorithms [6].
A widely used and investigated idealization of such search problems is the model of non-destructive foraging (cf.Fig. 1).The model consists of a two-dimensional environment with randomly, uniformly, and sparsely distributed immobile targets which can be detected by a walker within a detection radius r.The walker moves through the environment with constant speed in steps (straight segments) of random orientation and varying step length L. As soon as the walker detects a target, its current step is aborted, whereupon it immediately takes its next step from a position displaced in a random direction by a cutoff length l c > r from the detected target.
The goal of the walker is to maximize the search efficiency where T is physical time and n T is the average number (with respect to target positions) of detected targets in the time interval [0, T ].In several fields, η −1 is known as the mean first-passage time (MFPT) since it equals the average time a walker needs to detect the first target [7].Indeed, MFPT-based approaches are widely used to study and compare different search strategies [8][9][10][11].However, the hardness of computing the MFPT in realistic scenarios poses an obstacle to the theoretical understanding of optimal search strategies.* These authors contributed equally to the work According to the hypothesis of learning-based movement biological entities can adapt and learn in order to optimize their search efficiency (i.e. the number of targets collected per time) [12].Different animals (as e.g.albatrosses [13,14], bison [15], bumblebees [16,17], deer [18] or bats [19]) may learn different strategies, based on their cognitive capacities, surrounding environment or biological pressure [20].
Non-destructive foraging corresponds to the biologically relevant case of foraging replenishable targets where larger cutoff lengths l c correspond to longer replenishment times of targets [21].In particular, the case of destructive foraging, where targets vanish after detection, can formally be recovered by taking the limit l c → ∞.
In the foraging model under consideration, simplifying assumptions are made regarding the walker's selection of step length.These assumptions state that the walker's choice of step length is independent of its position, any previous detection events, and also any prior choices of step lengths.This results into a widely investigated class of random walks, which considers that the walker samples independently from a probability distribution Pr(L) at each step to determine its current step length.This is the class of random walks we consider in this work.
Motivated by observational data [13,22,23] and supported by heuristic arguments [13,24,25], a family of step-length distributions known as Lévy distributions [26,27] Pr(L) ∝ s µ L −(µ+1) , where s is a lengthscale parameter and µ ∈ [0, 2], received major attention, including the discovery of the optimal value of µ in certain limiting cases [7,25], such as l c → ∞ and l c → r.While a number of numerical studies have shown that typically the optimal value of µ depends sensitively on the environmental parameters [7,21,25,28,29], it has recently been shown that other families of step-length distributions Pr(L) such as bi-and tri-exponential distributions outperform Lévy distributions [8,[30][31][32].In general, it is still an open problem what strategy is optimal for non-destructive foraging, both in terms of its step-length distribution family and its precise parameters.
Apart from the exact mathematical description of the search strategies followed by biological entities, another debate concerns the question of how animals come to possess knowledge of particular search strategies [12], which recently sparked interest from a neurological perspective [33,34].More generally, the origin of search strategies is usually classified as either emergent or evolutionary [35].The former considers that an individual walker can learn different strategies via complex interactions with different environments, while the latter proposes that a single strategy has evolved via natural selection in order to optimize search in different environments [20].In order to understand emerging and adaptive processes one has to go beyond the study of optimal search strategies based on heuristic ansätze; it calls for a framework that naturally incorporates learning.
This paper presents such a framework, rooted in the principles of reinforcement learning (RL).What sets our model apart from normative approaches, which aim to directly optimize an objective function such as search efficiency, is its mechanistic nature [36]: in the proposed RL framework, optimization occurs as a result of an interactive learning process which does not assume direct access to the objective function.This allows for valuable insights into the optimization of search strategies and the comprehension of their emergence through the process of learning.In summary, by abstracting away extraneous biological details, we arrive to a concise framework specifically designed to model how biological agents learn to forage.
Non-destructive foraging in the RL framework.-Webase our framework on RL, a standard paradigm of machine learning [37].To this end we will first describe RL in general before we proceed with embedding nondestructive foraging within the framework of RL.
RL considers an agent interacting with an environment (see Fig. 2).During each interaction cycle, called an RL step, the agent perceives state s ∈ S from the environment, and replies with an action a ∈ A. We adopt the common assumption that the agent's decision making is 2. The problem of non-destructive foraging is formulated within the framework of RL.An agent moves through an environment with randomly distributed targets and, at each step, chooses between two possible actions: continue in the same direction (↑) or turn (↱) in a random direction.The state perceived by the agent is a counter n, which is the number of small steps of length d which compose the current step of length L. Whenever the agent detects a target, it receives a reward R and resumes its walk at a distance lc from the detected target with the counter reset.
Markovian, which means that the probability π(a|s) that the agent chooses action a only depends on the current state s.π(a|s) is called the policy of the RL agent.Further, we consider the environment to be partially observable, which means that the agent cannot perceive the full state e ∈ E of the environment, that is s ̸ = e.
The agent also receives a reward signal R(e, a) depending on the current action a and the state of the environment e.The goal of the agent is then to optimize its policy such that the expected average reward is maximized, where t ∈ N 0 labels the RL steps and E denotes expectation with respect to all infinite trajectories of agent-environment interactions given that the agent employs policy π and that the environment is initially in state e ∈ E. In practice, the optimization of Eq. ( 2) with respect to the policy π is governed by an RL algorithm which will be discussed further below.
In the following, we embed non-destructive foraging within the framework of RL.The idea is to model the walker as an RL agent.A naive approach would be to let the agent choose its step length L in each RL step without ever perceiving a state from the environment.However, this would correspond to an infinite action space A and an empty state space S, which often represents an obstacle to efficient learning.
Therefore, we proceed by reformulating the nondestructive foraging model.As we will show below, this is only a reformulation and hence will give rise to exactly the same class of walks which are fully characterized by sampling independently at each step from a step-length distribution Pr(L).The reformulation consists in discretizing the step length L in units of small steps d such that at each RL step the agent only has the following choice: either continue in the direction of its previous step or turn in a random direction.This formulation indeed resembles that of Ref. [30], although learning was not considered in that work.
To this end, given the environmental parameters r, l c , and d, which characterize the search problem, the state of the environment is defined by the positions of targets, together with the current position, walking direction, and a step counter n ∈ N of the agent.The counter is reset to n = 1 whenever the agent turns or detects a target and it increases by one whenever the agent continues without detecting a target.In each RL step, the counter is perceived by the agent, i.e., s = n.The two possible actions are continue and turn, symbolically denoted by ↑ and ↱, which correspond to walking for a distance d either in the current or in a random direction, respectively.If the agent detects a target, its position is resampled at a distance l c from the detected target, according to the non-destructive foraging model.Then, the agent walks a small step of length d in a random direction, and it perceives the counter state n = 1 and a reward R = 1.Otherwise, when no target is detected, R = 0 and the agent continues its walk unimpeded.
The probability that the agent makes a step of length L is thus given by where we use the convention that the empty product, in case of N = 1, equals one.Since there are only two actions, we have π(↱ |n) = 1 − π(↑ |n).
We proceed by showing that our RL agent maximizes its search efficiency η by maximizing the expected average reward in Eq. ( 2).First we note that the average number of targets detected during the first T RL steps, given that policy π is employed and that the environment is initially in state e, can be written as since R = 1 when a target is detected and R = 0 otherwise.By inserting Eq. ( 4) into the definition of the detection efficiency, Eq. ( 1), we immediately see that η = R π (e) and therefore, η is maximal when R π (e) is.This proves an important feature of our RL formulation of foraging: reinforcement learning converges to optimal search strategies.Since our RL formulation moves the focus from steplength distributions Pr(L) to policies π, we proceed to show which policy will reproduce a given step-length distribution.Policy and step-length distribution are related via Eq.( 3).Inserting π(↑ |N ) = 1 − π(↱ |N ) into Eq.( 3) yields a recurrent equation for π(↱ |n), the solution of which we find to be for any N such that the denominator is non-vanishing, i.e., Pr(L = nd) = 0, no step can continue beyond length (N − 1)d, which implies π(↱ |n) = 1 for an n < N .In this case, the agent would always turn before perceiving states with n ≥ N and the policy for n ≥ N does not affect the walk.Some important remarks are in order: (i) while the agent has information on the current step-length via the counter state, it is unaware of any previous step lengths, due to the counter reset; (ii) given the former, the motion of our agent classifies as a semi-Markov process since the points where direction changes occur form a Markov chain, as it is the case for most Lévy walks and related diffusion processes [27]; (iii) last, as shown by Eq. ( 3) and Eq. ( 5), the family of walks which can be generated by the RL agent is precisely that with Pr(L)-distributed step lengths.In the form presented in this work, agents are not designed to and are not capable of performing walks that are non-Markovian in step length (walks that are non-Markovian in step length would be of the form Pr(L t |L t−1 , L t−2 , ...)).
Finally, from Eq. ( 5) it can also be seen that the agent must employ a probabilistic policy to reproduce arbitrary step-length distributions; deterministic policies with π(s|a) ∈ {0, 1} can only generate walks with constant step lengths.
Numerical case study.-In the framework of RL, learning consists of updating the policy in order to maximize the obtained rewards (Eq.( 2)) in response to agentenvironment interactions.From a plethora of RL algorithms, we choose projective simulation (PS) [38,39] (for details, see Appendix B and [40]).We expect that similar results can be found with other RL algorithms such as policy gradient [37].
As benchmarks for the trained RL agents, we use the widely investigated Lévy walks [7,13,25,26] as well as the strongest ansatz considered in the literature so far: a random walk with bi-modal exponential step length distribution [8,31,32](see Appendix A for details).Their search efficiencies are computed numerically by simulating an agent with an equivalent, fixed policy, obtained via Eq.( 5), which for short will be referred to as Lévy and bi-exponential policies, respectively.
It is important to note that an RL agent is not expected to learn any of the benchmark policies unless these are the only ones that maximize search efficiency (i.e.these strategies are indeed optimal).Nevertheless, we want to demonstrate, from a numerical perspective, that an RL agent based on PS has the capacity to learn any of these policies.To this end, we train agents to imitate the benchmark policies with various parameters and find a near-perfect convergence of their learned policies (for further details on imitation learning, see Appendix B 1).These results suggests that, if any of the known foraging strategies is optimal, we can expect a trained agent converging to it.With this result in mind, we return to the original problem of learning optimal foraging strategies by interacting with an environment with randomly distributed targets, see Fig. 2.
We model the infinite foraging environment by a twodimensional squared box of size W = 100 with periodic boundary conditions, 100 randomly distributed targets (thus corresponding to a target density ρ = 0.01) with a detection radius r = 0.5, where the unit length is defined as the small-step size of the agent d = 1.In what follows, we consider that all lengths are measured in units of d = 1 and we adopt a dimensionless notation.In this work, the agents are always initialized with a policy π 0 (↱ |n) = 0.01 ∀n.Further details on this and the respective parameters of the learning process are presented in Appendix B.
We simulate RL agents in various foraging environments with different cutoff length l c .Fig. 3 shows the learning curves of the agents, i.e., the evolution of search efficiency over the course of learning (see Appendix B 2 for details).The performance averaged over 10 agents reaches or overcomes the best benchmark in most cases, showing the robustness of the proposed RL method.Importantly, in all but one of the tested environments, the best of the 10 trained agents achieves higher efficiency than the respectively best benchmark model (see Fig. 3b), suggesting that there exist optimal walks with step-length distribution Pr(N d) as given by Eq. ( 5), beyond the benchmark model families.
In order to better understand the learned policies and the origin of such an advantage, we start by comparing the learned policy with the benchmarks at the example of l c = 0.6, see Fig. 4. For n ≲ 10 2 , training significantly changes the initial policy, and the learned policy shows greater similarity to the bi-exponential than to the Lévy policy.For n ≫ 10 2 , the learned policy still equals its initial value of 0.01.This is due to practical limitations of the training and can in particular be attributed to a limited training time, as detailed in Appendix B 3. However, note that the policy for large n only affects the agent's behavior at sufficiently long steps that happen very rarely, and thus the resulting effect on the agent's walk is rather small.Fig. 5a shows the learned policies for environments with l c ∈ [1,10], revealing an interesting property of the learned policies: the probability of turning at n = l c is significantly increased for all environments considered, and the smaller l c the more pronounced this effect is (see inset of Fig. 5a).This can be understood by an approximation: in the regime of low target density and small l c , after the detection of a target, to a good approximation the immediate environment of the agent only contains the previously detected target.Then, apart from the possibility of returning to the target by walking straight for l c , there is a significant probability p ↱ that the agent will first miss the target if it walks straight for l c but then detects the target in the next step by turning at n = l c , see inset of Fig. 5a.This explains the added benefit of turning at n = l c (for details, see Appendix C).
Using Eq. ( 3), we translate the learned policies from Fig. 5a into the their corresponding step-length distributions (Fig. 5b).The learned distributions show a pronounced peak at L = l c , which can be interpreted as the agents learning to forage with a length scale related to the cutoff length.Then, for 10 1 ≲ L ≲ 10 2 , we find a region where the probability remains stable, i.e., with a smaller probability the agents also make longer steps, associated with a second and larger length scale.For L > 10 2 , Pr(L) increases before it decreases exponentially, which is due to the finite training times (see Appendix B 3).This prevents us from fully characterizing the second length scale (and any other that might appear for larger L), which is commonly related to the average spacing between targets [32].
Indeed "two-scale" foraging strategies, such as the biexponential benchmark, have been widely studied in the literature [20,31,35] and their approximate emergence from our learning model is interesting.It should be noted, however, that this emergence is only approximate: first, we found the learned policies to perform even slightly better than the bi-exponential benchmark in most cases.Second, neither the learned step-length distributions are fully characterized by two scales, nor is the problem of non-destructive foraging, since the detection radius r represents a third length scale [32].Therefore there are good reasons to hope for learning advantages over benchmark model families, as found in this work.
Discussion.-We introduced a framework based on reinforcement learning (RL) for non-destructive foraging in environments with sparsely distributed immobile targets.The framework is based on analytic results which involve an exact correspondence between the policy π(a|n) and the step-length distribution Pr(L) of the agent, and a proof that learning converges to optimal foraging strategies.This not only makes the proposed RL framework a viable alternative to traditional approaches, but also introduces remarkable benefits.
Firstly, learning is not restricted to a specific ansatz, as e.g.step-length distributions such as Lévy walks or biexponential ansätze, which have been extensively studied in the literature.Even in cases where the latter are optimal, RL could be leveraged to find the suitable param-eters of the respective model family.Most importantly, RL together with Eq. ( 3) paves the way for the discovery of novel, more efficient step-length distributions.
Secondly, the RL framework sidesteps the problem of what knowledge about the environment (e.g., characteristic length scales) should be considered available to the agent; while it is often assumed that foragers do not possess any knowledge about the environment, it is also wellknown that informed agents usually outperform their agnostic counterparts [20,32].With RL, it is not necessary to artificially endow the agent with knowledge about the environment.Instead, the agent implicitly learns such knowledge by interacting with the environment.Interestingly, recent work has also shown that RL agents create implicit maps of their environments even when they are not directly programmed to do so [41].As shown in our numerical case study, RL agents are able to adapt to different length scales of the environment.Therefore, the assumption that foragers do not possess any knowledge about the environment is unrealistic for adaptive, learning foragers.
The proposed framework could also be leveraged to understand how search strategies are learned by foragers in biological scenarios.This is related to the debate on the emergent versus evolutionary hypotheses (see Introduction).Both evolutionary priors and learning could be modelled within our framework via policy initialization and policy updating, respectively.For example, we showed numerically (Fig. 3) that an agent initialized with a preference for walking straight efficiently learns to perform better than Lévy searches.
Moreover, we expect RL to be beneficial especially when dealing with more complex environments, where exploiting the numerous properties of the environment can lead to better foraging strategies.Some examples include complex topographies [29], non-uniform distributions of targets [20], the presence of gradient forces or obstacles, or even self-interacting foragers [42].Another interesting perspective involves endowing the agents with a memory which goes beyond a mere step counter.For instance, the agent could remember its past decisions for a given number of steps, similar to what is considered in [43].Moreover, RL can be used to investigate multiagent scenarios [39,44] and their connection with known collective phenomena such as flocking [45,46].In the simplest case, the nodes of this graph are distributed in two layers, one of which represents the states and the other the actions.Each state is connected to all possible actions by directed, weighted edges.The decision making of a PS agent is stochastic: the probability of performing action a ∈ A having perceived a state s ∈ S is where h(s, a), called h-value, is the weight of the edge that connects s with a.The agent learns by updating the h-values-and thus, its policy π(a|s)-at the end of each RL step.During one interaction with the environment, i.e., one RL step, the agent (i) perceives a state s, (ii) decides on an action a by sampling from its current policy π(a|s) and performs the action, (iii) receives a reward R from the environment, and (iv) updates its memory accordingly.The latter is implemented by updating the matrix H that contains the h-values: where γ is a damping factor that controls how quickly the agent forgets, H 0 is the initial H matrix and G is the so-called glow matrix, which allows the agent to learn from delayed rewards, i.e. rewards that are only received several interactions with the environment later.The glow mechanism tracks which edges in the ECM were traversed prior to getting a reward.To do so, whenever there is a transition from state s to action a, its corresponding edge "glows" with a certain intensity given by the glow value g(s, a), which is stored in the G matrix.The glow has an initial value of g(s, a) = 0 and increases by 1 every time the edge is traversed (for details on the definition and update of the G matrix, see [49,52]).At the end of each RL step, G is damped as where η g is the damping factor.Both η g and γ are considered hyperparameters of the model and are adjusted to obtain the best learning performance.

Imitation learning of benchmarks
In this section, we study whether a PS agent is actually able to learn known foraging strategies within the proposed framework.To do so, we translate the original foraging problem into an imitation problem.Imitation learning is commonly used to simplify reinforcement learning problems with sparse rewards.Instead of learning in the original, complex scenario, the RL agent is trained to imitate an expert already equipped with the optimal policy.The expert is then considered to perform only succeeding trajectories, i.e. sets of actions that lead to rewarded states.In this way, the reward sparsity of the original problem can be avoided and learning the optimal policy is greatly simplified.Nonetheless, the RL agent is still required to correctly update its policy, hence imitation learning is a useful experiment to understand if the proposed framework is adequate for an agent that needs to learn optimal foraging strategies.
In what follows, we consider environments for which we assume that certain distributions Pr(L) are optimal.This means that the expert's policy is calculated inserting Pr(L) in Eq. ( 5).In the imitation scheme, one RL step proceeds as follows: 1. a step of length L is sampled from Pr(L); 2. the counter n of the RL agent is set to N such that N d = L; 3. the agent RL performs the turning action, hence having effectively done a step of exactly length L; 4. a reward R = 1 is given to the agent after the action (↱) in state (N ); 5. the agent updates its policy via Eq.(B2) based on this reward; In Fig. 7a-b we show the convergence of the policy of an RL agent trained to imitate an expert equipped with policies calculated from Lévy and bi-exponential distributions with multiple parameters.As shown, the agent's policy correctly converges to the expert's.This ensures that the proposed framework, together with the PS update rule, can adequately accommodate the most typical foraging strategies, if these were to be optimal in the given environment.

Details about the training and the model parameters of the agent
The RL learning process of the PS agent consists of N ep ≃ 10 4 episodes, each of which contains T RL steps.The agent's memory has a two-layer structure of states and actions, as described in the previous section.Its learning mechanism is governed by Eq. (B2).There are T states, one for each possible value that the counter n can in principle take before the episode is over; and 2 actions, continue in the same direction or turn.At the beginning of each episode, the G matrix of the agent is reset and both the agent's position and the positions of the targets are randomly sampled.The agent keeps updating its policy until N ep episodes are completed.In order to accurately assess search efficiency at different stages of training (Fig. 3), we perform a post-training evaluation: we take the policy in a given episode, freeze it, and have the agent employ this policy to perform 10 4 walks consisting of 20000 small steps of length d = 1 in environments with different target distributions.We train 10 independent agents in each scenario (given by different cutoff lengths) and consider that the best agent is the one that achieves the highest average search efficiency in the post-training evaluation using the policy after the last episode.
The learning parameters γ and η g were adjusted in order to achieve the best performance in each case.The values of all parameters for each scenario are given in Table I.Examples of trainings are presented in the accompanying repository [40].

Policy initialization and finite training times
In this section, we motivate our choice of policy initialization.In the literature, the policy of a PS agent is typically initialized such that π 0 (↑ |n) = π 0 (↱ |n) = 0.5 holds for all n.While π 0 (↱ |n) = 0.01 is used in the main text, in this appendix, we first consider the π 0 (↱ |n) = 0.5 policy initialization instead.That is, the agent initially chooses randomly and uniformly between the two possible actions ↑ and ↱.In PS, this is enforced by setting all h-values in the initial H 0 matrix to h(s, a) = 1 ∀s, a.As we see in Fig. 8a, the agent progressively learns to decrease π(↱ |n), which in turn leads to longer steps, and to the agent reaching larger values of n as the learning progresses.Nonetheless, reaching large n is still exponentially costly.Moreover, to properly train the policy at a given n, the agent needs to first reach such value and then sample enough rewards.In Fig. 8b we show the frequency with which n is observed in different episodes.As expected, due to the update of the policy, the probability of reaching larger n increases as the training progresses.However, this training requires large computation times and still the agent does not sample steps longer than L = 40.
As can be seen in Fig. 8a, even after 69000 episodes (yellow line), the policy exhibits a decay back to initialization values for large n.The policy π(↱ |n) at a given n depends not only on the received rewards at such n, but also on the rewards that will be attained for n ′ > n due to the glow (Eq.(B3)).Thus, not reaching n ′ > n also hampers the policy update of π(↱ |n). .Schematic illustration of the benefit of turning after nd ≃ lc steps in a simplified scenario with only a single target, the center of which is represented with a black dot and which has a detection radius r.After the detection of the target, the agent is displaced by a distance lc from the target (black line).Then, we assume that it continues to walk straight for nd ≃ lc at a random angle (orange line).In this case, the blue area marks the angle range within which the agent returns to the target when walking straight.The yellow area indicates the angle range at which the target is missed but is still within reach (after one step) if the agent turns by a suitable angle (red area).
that p ↱ is even larger than p ↑ for the values of l c plotted in the inset of Fig. 5a.

FIG. 1 .
FIG.1.Schematic illustration of random-walk based foraging strategies.According to the hypothesis of learning-based movement biological entities can adapt and learn in order to optimize their search efficiency (i.e. the number of targets collected per time)[12].Different animals (as e.g.albatrosses[13,14], bison[15], bumblebees[16,17], deer[18] or bats[19]) may learn different strategies, based on their cognitive capacities, surrounding environment or biological pressure[20].
FIG.2.The problem of non-destructive foraging is formulated within the framework of RL.An agent moves through an environment with randomly distributed targets and, at each step, chooses between two possible actions: continue in the same direction (↑) or turn (↱) in a random direction.The state perceived by the agent is a counter n, which is the number of small steps of length d which compose the current step of length L. Whenever the agent detects a target, it receives a reward R and resumes its walk at a distance lc from the detected target with the counter reset.

FIG. 3 .
FIG.3.Learning curves and the advantage of learned policies over benchmarks.(a) The search efficiency (averaged over 10 agents, displayed with one standard deviation) is shown over the course of learning (measured in training episodes).Different colors correspond to environments with different cutoff length lc.Efficiencies are normalized by the respective best benchmark efficiency, which turns out to be bi-exponential distributions in all cases.Dashed lines show the efficiency of the best Lévy walk for each case.(b) Comparison between the best agent's search efficiency at the end of learning and that of the best benchmarks, for each environment.The efficiency of the best Lévy walk for lc = 1 is η Lévy /η bi-exp.= 0.88.For each agent and benchmark model, the efficiency is averaged over 2 • 10 8 RL steps.In panel b), the standard error of the mean for the benchmark models and the best agents is depicted but too small to be visible.

FIG. 5 .
FIG. 5. Analysis of the learned policies for different cutoff lengths.(a) Learned policies π(↱ |n) as a function of the counter n.Each point is the average over 10 agents and the shaded area represents one standard deviation.The inset shows, on the left axis, the turning probability at n = lc averaged over 10 agents, error bars represent one standard deviation.On the right axis, the probability p ↱ of hitting the target when turning at n = lc is shown (see Appendix C).(b) Step-length distributions corresponding to the policies presented in a).Each point is the median over 10 agents.

FIG. 7 .
FIG. 7. Comparison of benchmark policies (solid lines) with the policies (colored dots) of a PS agent using imitation learning.Panel a) shows Lévy distributions with varying exponent β and panel b) shows bi-exponential distributions with varying d1, w1 = 0.94 and d2 = 5000 (see Appendix A for details).

FIG. 9
FIG. 9. Schematic illustration of the benefit of turning after nd ≃ lc steps in a simplified scenario with only a single target, the center of which is represented with a black dot and which has a detection radius r.After the detection of the target, the agent is displaced by a distance lc from the target (black line).Then, we assume that it continues to walk straight for nd ≃ lc at a random angle (orange line).In this case, the blue area marks the angle range within which the agent returns to the target when walking straight.The yellow area indicates the angle range at which the target is missed but is still within reach (after one step) if the agent turns by a suitable angle (red area).

TABLE I .
Training parameters used to obtain the data presented in the respective figures.