Predator–prey survival pressure is sufficient to evolve swarming behaviors

The comprehension of how local interactions arise in global collective behavior is of utmost importance in both biological and physical research. Traditional agent-based models often rely on static rules that fail to capture the dynamic strategies of the biological world. Reinforcement learning (RL) has been proposed as a solution, but most previous methods adopt handcrafted reward functions that implicitly or explicitly encourage the emergence of swarming behaviors. In this study, we propose a minimal predator–prey coevolution framework based on mixed cooperative–competitive multiagent RL, and adopt a reward function that is solely based on the fundamental survival pressure, that is, prey receive a reward of −1 if caught by predators while predators receive a reward of +1. Surprisingly, our analysis of this approach reveals an unexpectedly rich diversity of emergent behaviors for both prey and predators, including flocking and swirling behaviors for prey, as well as dispersion tactics, confusion, and marginal predation phenomena for predators. Overall, our study provides novel insights into the collective behavior of organisms and highlights the potential applications in swarm robotics.


I. INTRODUCTION
Swarming behaviors in the nature, such as starling flocks, fish schools, and sheep herds [1], have been studied across various research fields, including biology [2], physics [3], and robotics [4].Various agent-based models have been proposed to understand these collective behaviors.Most models are constructed based on designed static rules, including velocity alignment rules [3], a balance of social forces including attraction and repulsion [5,6], and vision-based movement decisions [7][8][9][10].Although these models can phenomenally give rise to swarming behaviors including complex ring-shaped and swirling patterns [6,11,12], the interaction rules are mostly heuristic and static, failing to capture the adaptation property of the biological world.To address this issue, recent studies have utilized evolution-based methods including neuroevolution [13] and reinforcement learning (RL) methods [14][15][16][17], because they offer the potential for adaptable strategies, analogous to the way biological organisms evolve [18][19][20].In addition, multi-agent RL allows to model the interactions among individual agents in a swarm, and to optimize their collective behaviors [14][15][16][17].
However, the interaction mechanism or reward functions in evolution-based methods [13][14][15][16][17] are task-specific [21,22] meaning they are handcrafted by designers to intentionally fit for the characteristics implicitly or explicitly associated to swarming behaviors, which we refer to as swarm-dependent in this study.For example, the reward function in [16] penalizes losing neighbors.That is, if an agent loses its neighbor, it receives a penalty of −1.The recent work in [17] assumes prey receive larger reward as the area of domain of danger decreases [23], thus explicitly encouraging prey agents to move closer to each other.The authors in [13] implicitly assume the attack efficiency is inversely proportional to the number of prey visible to the predator due to confusion.As a result, the number of prey in low-density areas reduces faster than that in high-density areas, resulting in a clustering phenomenon as anticipated.A similar confusion mechanism is used in [15].
In this study, we propose a minimal predator-prey coevolution framework based on mixed cooperativecompetitive multiagent reinforcement learning, where the reward function is solely based on the motivation to survive for both predators and prey.Specifically, prey receive a reward of −1 if caught by predators while predators receive a reward of +1.This reward function has no relation to objectives such as increasing neighbors, decreasing sparsity, enhancing alignment, or promoting other characteristics directly associated with swarming behaviors, thus swarm-independent.Surprisingly, under the proposed framework, we find the simple survival pressure is sufficient to evolve flocking and swirling behaviors for prey.Quantitatively, we observe a steady increase in swarming density and group polarization.We also observe the emergent dispersion tactic, confusion and marginal predation phenomena for predators.

II. MODELING
Environment.We established a physics-based simulation environment where predators and prey interact with each other.The environment is a two-dimensional continuous space with two kinds of boundary conditions.As we will demonstrate in section IV, distinct boundary conditions will encourage flocking or swirling behaviors.The first kind is a finite square area, commonly used in previous studies such as [24,25].In this space, agents are unable to cross the boundaries, which are simulated as walls with a specified contact stiffness.The second kind is periodic boundary condition where an agent passes through one side of the square environment re-appears on the opposite side with the same velocity.This treatment wraps around the edges of the environment into a torus, thus enabling us to approximate a large or infinite space.Periodic boundary condition is widely used in molecular dynamics simulation [26] and swarm modeling [3,16].
Agent Dynamics.An agent, namely a predator or prey, is represented by a circle with a short line segment representing its heading, as shown in Fig. 1(a), where the unit vector h ∈ R 2 , ||h|| = 1 is the heading and v ∈ R 2 is the velocity.The agents are subject to both active and passive forces.
The active force is self-generated propulsion to drive ego-motion, and consists of two components as shown in Fig. 1(a).The first component is the action to move forward aligned with the heading and computed as a F h ∈ R 2 where a F ∈ R. The second component is the action to rotate its heading, and is denoted as a R ∈ R within a threshold value.
The passive forces include dragging force f d ∈ R 2 , elastic force between contact agents f a ∈ R 2 as shown in Fig. 1(b), and elastic force between agents and boundaries f b ∈ R 2 as shown in Fig. 1(c).The dragging force is simply assumed to be in the opposite direction of velocity v with its magnitude proportional to ||v||.Elastic forces f a and f b follow Hooke's law, and sum up when an agent contacts multiple other agents or boundaries as f a = j f a,j and f b = j f b,j .The elastic forces prevent agents overlap and reflect physical collision dynamics.It is remarked that the velocity may not be aligned with the heading direction when a collision happens.In the simulation, the drag coefficient is set as 2 N•s/m and the contact stiffness coefficient is set as 50 N/m.
Combining the aforementioned active and passive forces, the dynamics of an agent of species i can be sum- where x ∈ R 2 is the position, m i ∈ R + is the mass, θ ∈ (−π, π] ⊂ R is the heading angle and h = [cos θ, sin θ] T .
Agent Observation.The observation model of an agent is assumed to be dependent on both metric and topological distance.Metric dependence means an agent can only perceive others in its perception range [13,14] which is assumed to be a disk with a pre-defined radius R ∈ R + for simplicity.Topological dependence refers to how many at most an agent can perceive concurrently rather than how far away.The threshold is set as 6 which means an agent can perceive at most six allies and six adversaries.This setting is inspired by the work in [27] that each bird interacts on average with six neighbors, and further confirmed in the recent work [12].The information contained in an observation vector includes relative position and heading of observed agents for both allies and adversaries, and the agent's own position, velocity and heading.If the number of agents in the perception range exceeds the topological threshold, the farthest ones are removed, and if it does not reach the threshold, the rest part of the observation vector are masked out with zeros.This topological modeling simplifies the structure of neural networks by fixing the input dimension.

III. FRAMEWORK OF COEVOLUTION
The optimization regime is set as a mixed cooperativecompetitive multiagent reinforcement learning framework where predators and prey are learning and adapting their behaviors concurrently analogous to the coevolution in the nature.

Homogeneity.
Agents of the same species in a swarm are assumed to be homogeneous.This is reasonable when each agent has the same capability, responsibility, and goal as the others.More importantly, agents in a swarm typically display similar behavior patterns following the same interaction rule [7,8,28].
Homogeneity has been widely adopted in the modeling of swarm systems.For example, works such as those in [3,5,[7][8][9][10][11][12] propose singleton control laws that dominate all the agents in a swarm, while those in [14][15][16][17] employ parameter-sharing techniques for the agents' neural networks.As a result, the homogeneity leads to two features of the proposed training regime: parameter-sharing of actor-critic networks and replay buffer sharing among conspecifics, which are described as follows.
Actor-Critic.Conspecifics share one critic and one actor network.The critic network is used to evaluate the quality of an action taken by the agent, by estimating the expected future reward from that action, while the actor determines the agent's action a = [a F , a R ] T based on its current observation.Together, the critic and actor work to improve the agent's behavior over time, by continuously refining its estimates of expected future rewards and adjusting its policy accordingly.
The critic is designed to be decentralized allowing agents to evaluate based solely on local observations, without the knowledge of global states and actions like centralized critics used in [25,29].This is analogous to the situation when living organisms can only perceive nearby surroundings with limited abilities.Additionally, a decentralized critic provides better scalability as the number of agents in the system increases, making it particularly beneficial for swarm systems.
The actor is shared by conspecifics but not for their adversaries.Policy-sharing technique has been widely used in traditional agent-based models [3,5,[7][8][9][10][11][12], and learning-based methods [14][15][16][17].An advantage of policy-sharing is that the trained policy can be deployed across any number of agents of the same kind making it perfect for swarming research.It is remarked that although the parameters of the policy are shared, the output actions from the policy for each agent are different because they have different observations.Replay Buffer.When an agent takes an action in a given state, it receives a reward and transitions to a next state.The replay buffer is used to store such transitions of the agent interacting with the environment in the form of tuples: state, action, reward and next state.The replay buffer serves to reduce correlations between consecutive experiences by exposing the agent to a more diverse set of experiences, thus helping to avoid over-fitting and stabilize the training process.
Two replay buffers B 0 and B 1 are used for predators and prey, respectively.There is no need to construct replay buffers for each individual because the homogeneity allows for interchangeable use of the experiences collected by conspecifics.Therefore, experiences collected from multiple conspecifics can be congregated into a single replay buffer, resulting in more resilient and effective learning outcomes.
Rewards.The reward for prey is simply set as r = −1 if it is caught by a predator, where the catch is represented by a contact between the two agents.Compared to the swarm-dependent interaction mechanisms or reward functions in [13][14][15][16][17], the proposed reward function solely emphasizes the potential dangers of remaining in close proximity to predators, thus swarm-independent.Similarly, the reward for a predator is r = +1 if it catches prey.It is remarked that prey agents are not removed from the simulation after being caught by predators.The contact between predators and prey can be likened to a continuous process where predators extract energy from the prey while engaged in "eating" them.Upon separation, the prey's survival reward returns to zero, signifying the cessation of energy transfer analogous to the termination of "bleeding".A decorative reward that mimics energy consumption due to movement is added to the essential survival reward, which is simply set as −0.01|aF | − 0.1|a R |.This reward function will cause the agent to exhibit laziness.
In the special case when boundaries exist in the environment, an additional penalty −0.1 is added to the reward function when contact between agents and boundaries happens.This setting is designed to simulate either the presence of danger in the outside world or the situation where an agent chooses not to leave a specific location, such as a food-rich coral reef.
Algorithm.The algorithm primarily adopts the multiagent deep deterministic policy gradient method as proposed in [25,30].However, certain modifications have been made to the original method, which are summarized as follows: 1) to account for the limited perception capabilities of agents, a decentralized critic has been employed in lieu of a centralized one; 2) agents of conspecifics share one actor network but not for their adversaries; 3) experiences collected by conspecifics are merged into a single replay buffer for more effective learning.
We designate the predators and prey as species 0 and 1, respectively.The dimensionality of the observation and action vectors are denoted as d o and d a , respectively.We denote the discount factor, which determines the weight given to future rewards, as γ, and the soft update rate, which determines the speed at which a target network is updated towards a learning network, as τ .The proposed algorithm is summarized in Algorithm 1.

IV. RESULTS AND ANALYSIS
Simulation Setup.In the training phase, we instantiate n 0 = 3 predators and n 1 = 10 prey in the environment.This selection of population size is based on the assessment of hardware performance, as having more agents in the environment requires longer computation time.Nevertheless, the population size can be arbitrary as the policy is decentralized.In the evaluation phase, we deploy the trained policy on 50 prey, to yield a more // i = 0 for predators, i = 1 for prey for species i = 0 to 1 do Randomly initialize actor µi parametrized by θ µ i and critic Qi parametrized by θ Q i ; Initialize target actor µ ′ i and target critic Randomly spawn n0 predators and n1 prey; receive observations oi ∈ R n i ×do ; for t = 1 to max-episode-length do for all agents in species i, select actions ai = µ θ i (oi) + Nt ∈ R n i ×da where Nt is Gaussian noise; Execute actions ai, receive reward ri ∈ R n i and new observations Update actors using sampled policy gradient distinct visual effect of swarming behaviors, whilst maintaining the same number of predators.
In order to provide a quantitative assessment of swarming behaviors, we introduce two measures: degree of sparsity (DoS) and degree of alignment (DoA).The DoS ∈ [0, 1] ⊂ R is defined as the average normalized distance to the nearest neighborhood of all conspecifics in an episode as where x j (t) is j-th agent position at time step t, k = arg min ||x j (t) − x k (t)||, k ∈ {1, 2, ..., N }\j, T ∈ N + is episode length, N ∈ N + is the total number which is equal to n 1 for prey, D ∈ R + is the environment size defined as the maximum possible distance for two agents.For example, for a periodic square environment with edge length 2, the largest possible distance is √ 2. The symbol || • || : R 2 → R denotes the Euclidean norm mapping.A smaller DoS indicates a denser swarm.An extreme case is when all conspecifics aggregate at the same point resulting in zero DoS.The definition of DoA ∈ where h j ∈ R 2 is the heading of j-th agent, k is the same as in the definition of DoS.It is worthy to remark that DoA is not equivalent to the mean heading of all conspecifics.This is because although the headings are similar in the same flock resulting in a high DoA, multiple flocks with different headings may cancel each other, as shown in Fig. 2(b) and Fig. 8(b).Therefore, calculating the relative quantity within a local neighborhood is more appropriate.A similar conclusion can be drawn for DoS.
Emergent Flocking Behavior.A typical scenario before and after evolution are shown in Fig. 2(a) and Fig. 2(b), respectively.The predators are depicted in orange, while the prey are in blue and a smaller size.Prior to evolution, the prey move randomly in different directions due to randomly initialized policy.Similarly, predators exhibit purposeless movements without the intention to pursue prey.The agents' behavior has undergone substantial changes after 2000 episodes of coevolution, as depicted in Fig. 2(b).Notably, the prey exhibit a remarkable emergence of cohesive movement patterns and a high degree of alignment within multiple flocks, adhering to the well-known flocking basic rules outlined in [31].
The episodic evolution of running average of DoS and DoA is shown in Fig. 3, where the running average length is 100-episode and the shaded area indicates 95% confidence interval.Specifically, the DoS drops steadily from 22% of the environment size to around 19% suggesting a more cohesive movement of prey.Meanwhile, the DoA increases from 0.65 to approximately 0.82 indicating any two prey in the vicinity exhibit a higher degree of alignment.It is remarked that the initial value of DoA is about 0.65.This is because for a uniformly distributed heading angle ranging from −π to π, the expected DoA is given by E[cos(ϕ/2)] = 2/π ≈ 0.64, where ϕ denotes the angle formed by two headings.This expected value is quite close to the value read from Fig. 3.
The DoS and DoA shown in Fig. 3 is a mean value computed from the entire episode, as indicated by It is possible to question whether the observed alignment and distance reduction between neighboring prey is a result of swarming or merely a consequence of prey fleeing from predators in the same direction, which is commonly referred to as herding.To differentiate between these two phenomena, we conducted additional simulations where only trained prey agents are present.A typ- ical scenario is illustrated in Fig. 5(a), where we observe that the prey still exhibit a high degree of alignment even in the absence of predators, which is comparable to the well-known Vicsek model [3].This finding further supports the hypothesis that swarming behavior observed in the absence of predators is a strategy employed by prey to avoid predators.These findings are noteworthy because they reveal how simple predator-prey interaction, driven solely by the motivation to survive, could give rise to conspicuous flocking behavior that exhibits both cohesion and alignment characteristics.This leads us to hypothesize that the emergent flocking behavior is largely an outcome of passive space extrusion and polarization induced by predators.
Emergent Confusion Effect, Dispersion Tactic and Edge Effect.During the pursuit, predators exhibit confusion in certain situations, as illustrated in Fig. 6(a)(b) at time step 30 and 33 of an episode.When the prey agent merges into a flock, the predator gives up the chase, slows down, and stagnates for a while as evidenced by a shorter path, appearing confused and uncertain about which prey to focus on.
Additionally, as shown in Fig. 7(a), predators often employ a dispersion tactic, in which they first move towards the center of a swarm to disperse it and then focus on isolated ones to capture [32,33].The resulting isolation or the phenomenon that predators frequently catch prey on the periphery of a swarm is referred to as marginal predation or edge effect [34,35], as shown in Fig. 7(b).The observed marginal predation behavior suggests that predators may be less confused when targeting prey on the periphery of a swarm, potentially due to a smaller number of prey resulting in an elevated predation rate [33,36], or a higher encounter rate [34].Overall, the emergent confusion effect and marginal predation behavior highlight the challenges predators face when trying to select the optimal target from a group of prey.
Emergent Swirling Behavior.We now examine the scenario when the environment is confined, and consider two cases: without and with penalties when prey collide with boundaries.It is remarked that the extra boundary penalty will complicate the environment, thus slowing down the learning process.To accelerate the evolution of prey, we endow the predators a behavioral rule that creates survival pressure by directly moving towards their nearest prey.This rule controls two active forces: first, the predator rotates its heading to point towards the nearest prey, and then it moves directly towards the target at maximum speed.By following this rule, the predators exert survival pressure on the prey and drive their evolutionary adaptation.
Without penalty, there are no significant differences in flocking behaviors compared to the infinite space case, except for boundary aggregation, as in Fig. 8(a), similar to fish swimming in a fishbowl or pond.With penalty, an enigmatic swirling behavior emerges as in Fig. 8(b).Such circular motion has been observed in fish and insects, yet it is hardly understood and still remains unclear in the scientific community [37,38].We hypothesize that the extra penalty acts as a disincentive for prey to leave a food-rich location, such as coral reefs, or a response to perceived threats in the outside world.Swirling may be an optimal tactic for prey to remain at the same place while simultaneously evading potential predators.
Effect of Speed Limit Ratio.In prior studies, we assume a maximum speed ratio ||v 0 || max : ||v 1 || max = 1 : 1 between predators and prey.This assumption is reasonable when predator and prey species have similar maximum speeds.Here, we investigate the emergent phenomenon when their maximum speeds have a distinct difference.
As shown in Fig. 9, adjusting the speed limit ratio to 5 : 3, we observe more pronounced swarming characteristics evidenced by a smaller DoS dropping from 19% to 18% and a larger DoA rising from 0.82 to around 0.85.Further, tuning the speed limit ratio to 3 : 5 results in a higher convergence rate of DoS and a lower value of around 17%, attributed to prey's athletic ability to evade predators while still maintaining formation.
Effect of Perception Range.In other simulations, predators and prey are assumed to have a perception range equal to the environment size, that is, R = D, which is reasonable when organisms have a perception range much larger than their body length.Here, we intentionally tune the perception range to be R = 2/3D and R = 1/3D to investigate its effect on flocking behaviors, as shown in Fig. 10.
Smaller perception ranges lead to less pronounced flocking behavior, as indicated by a larger DoS and smaller DoA.This effect is especially notable when the perception range is only one-third of the environment size, where DoS increases from 19% to around 20.5%, and DoA drops from 0.82 to 0.75.These findings suggest that perception range plays a crucial role in facilitating the emergence of flocking behaviors.

Effect of Number of Predators.
In previous simulations, the number of predators present in training is set as n 0 = 3.Here, we specifically investigate the special cases where the number of predators during evolution are n 0 = 0 and n 0 = 1.The episodic evolutions of DoS and DoA are shown in Fig. 11.
Comparing the cases of a single predator and three predators, we observe that the emergence of swarming behaviors is slightly slower with a single predator.This can be explained by the greater survival pressure exerted by multiple predators, which accelerates the prey's evolution.In the special case without predators in the evolution, it can be seen that the DoS and DoA of the prey remain unchanged, indicating that no swarming phenomenon emerges, as exemplified in Fig. 5(b).This evidence, from another perspective, further corroborates our hypothesis that predator-prey survival pressure is sufficient to promote swarming behaviors.In this article, a minimal predator-prey coevolution framework based on mixed cooperative-competitive multiagent RL is proposed, with a swarm-independent reward function based solely on the motivation to survive.We have observed the emergent flocking and swirling behaviors for prey, and dispersion tactic, confusion and marginal predation phenomena for predators.Based on these findings, we hypothesize swarming behaviors can be largely an outcome of passive space extrusion and polarization induced by predators.At the same time, predators may face challenges when trying to select their optimal target from a group of prey.While the proposed framework may not perfectly match the real evolution mechanism in the nature and may not depict the evolution process of all kinds of swarming organisms, it provides a feasible approach for swarming research without the need for handcrafted reward function design.With the design philosophy of minimalism, the proposed framework could serve as a starting point for further experimentation and customization.For instance, one could perform a quantitative analysis on the effect of observation noise, or introduce a third-party species to study emergent behaviors.In the field of robotics, collision penalties can be further incorporated into reward functions to realize collision avoidance simultaneously, enabling robots to exhibit swarming behaviors in a more natural manner without complex handcrafted interaction rules.Overall, the proposed framework and findings could contribute to a better understanding on swarm intelligence of organisms and physical active matter, and have potential applications in swarm robotics.
is further updated such that it reappears on the opposite edge as if the edges were connected.For example, if the agent has moved beyond the maximum x-coordinate, we then set its x-coordinate to the minimum x-coordinate plus the distance it has moved beyond the maximum x-coordinate.The agent's actions a F and a R are the outputs of the actor neural network, and re-scaled to fit within specified ranges as shown in Table I.Specifically, a F ranges from zero to its maximum linear acceleration, while a R ranges from negative maximum angular velocity to positive maximum angular velocity.To calculate f a , we first analyze whether any two agents collide based on their sizes and distances.If collision occurs, we then apply Hooke's law to calculate the elastic forces resulting from the deformation, and sum them up as f a = j f a,j if multiple collisions happen.The procedure of calculating f b is the same.The observation vector for an agent has the following form:   agent's own pos., vel.and heading, relative pos.and headings of observed predators, relative pos.and headings of observed prey

 
The relative positions and headings are reordered from the nearest to the farthest based on range.This is reasonable as conspecifics are considered as homogeneous, as explained in section III.If the number of observed agents exceeds the topological limit, the farthest ones are removed, and if it does not reach the limit, the rest part of the observation vector is masked out with zeros.
Both critic and actor are encoded by deep feed-forward neural networks with rectified linear unit (ReLU) activation with an input dimension d o equivalent to the length of the observation vector.Each network consists of three hidden layers with 64 neurons per layer as shown in Fig. 12, where d a = 2 is the output dimension of the actor network and a F , a R are the output actions.
For a detailed introduction to RL and multi-agent RL, we refer to [18,39] and [40], respectively.For an introduction to multiagent deep deterministic policy gradient algorithm, we refer to [25] species i.In this study, predators and prey are denoted as species i = 0 and 1, respectively.The corresponding target actor µ ′ i and critic Q ′ i are also initialized to help mitigate the issue of non-stationarity by slowly updating the value functions as policy improves.At the beginning of each episode, n 0 ∈ N + predators and n 1 ∈ N + prey are placed in the environment at random positions with random headings.Based on current observations o i ∈ R ni×do where d o is the dimension of the observation vector, the agents perform actions a i ∈ R ni×da according to current actors where d a is the dimension of the action, receive rewards r i ∈ R ni , and obtain new observations o ′ .The resulting experience tuple (o i , a i , r i , o ′ i ) are saved into replay buffer B i .
In learning, agents draw a random mini-batch of S ∈ N + samples from B i , and each sample is denoted as (o j i , a j i , r j i , o ′j i ) ∈ (R do , R da , R, R do ).The critic is updated by minimizing the loss: where the target value y j i = r j i + γQ ′ i (o ′j i , a ′j i ), a ′j i = µ ′ i (o ′j i ).To update the actor, the policy gradient theorem [30] is used, and the gradient is approximated as: Finally, the target networks are soft-updated accordingly.The exploration rate ϵ is the probability that the agent will choose to explore the environment instead of exploiting it.In this study, ϵ is set to gradually decrease with each episode achieved by using the formula max(0.05,ϵ − 5e-5).Similarly, the action noise N t is also set to decrease gradually with each episode as max(0.05,N t − 5e-5), with their initial values shown in Table II.The learning process is carried out until a state of dynamic equilibrium is achieved between the predators and prey, such that neither party can obtain their future rewards by altering their respective policies.The hyperparameters for the proposed algorithm are summarized in Table II.

TABLE I :
Environment parameters . In the training, we first randomly initialize the actor network µ i parameterized by θ µ i and the critic network Q i parameterized by θ Q i for FIG.12: Illustration of critic and actor neural networks.

TABLE II :
Hyper-parameters of algorithm