Synergistic effects of adaptive reward and reinforcement learning rules on cooperation

Cooperative behavior in multi-agent systems has been a focal point of research, particularly in the context of pairwise interaction games. While previous studies have successfully used reinforcement learning rules to explain and predict the behavior of agents in two-agent interactions, multi-agent interactions are more complex, and the impact of reward mechanisms on agent behavior is often overlooked. To address this gap, we propose a framework that combines the public goods game (PGG) with reinforcement learning and adaptive reward mechanisms to better capture decision-making behavior in multi-agent interactions. In that, PGG is adopted to reflect the decision-making behavior of multi-agent interactions, self-regarding Q-learning emphasizes an experience-based strategy update, and adaptive reward focuses on the adaptability. We are mainly concentrating on the synergistic effects of them. The simulations demonstrate that while self-regarding Q-learning fails to prevent the collapse of cooperation in the traditional PGG, the fraction of cooperation increases significantly when the adaptive reward strategy is included. Meanwhile, the theoretical analyses aligned with our simulation results, which revealed that there is a specific reward cost required to maximize the fraction of cooperation. Overall, this study provides a novel perspective on establishing cooperative reward mechanisms in social dilemmas and highlights the importance of considering adaptive reward mechanisms in multi-agent interactions.


Introduction
The efficiency of multi-agent systems heavily relies on the cooperation among agents [1,2]. Over the past few years, the emergence of cooperative behaviors in multi-agent systems has been a prominent research topic [3][4][5]. Evolutionary game theory [6,7] provides a framework to model and simulate the evolution of behaviors in multi-agent system, where each individual in the game is treated as an agent, and the behavior evolution is achieved by interacting and learning with other agents. In the context of the minority game model [8] and pairwise interaction games [9][10][11], several researches have used reinforcement learning rules to successfully explain and predict the behavior of agents [12,13]. Q-learning [14] is a typical self-regarding reinforcement learning algorithm. Compared with aspiration-based self-learning [15][16][17][18][19], the optimal behavior in Q-learning is generated by executing the action with the highest expected Q-values. However, it should be noted that Q-learning cannot promote cooperation in the well-mixed prisoner's dilemma game [9]. To address this issue, extortion strategy [20] and Lévy noise [21] have been proposed to study the behavior evolution of agents in a spatial prisoner's dilemma game, which have proven effective in promoting cooperation. Nonetheless, given that multi-agent interactions are more prevalent than two-agent interactions in realistic scenarios, it is essential to model and study the dynamics of behaviors in multi-agent interaction systems.
Public goods game (PGG) [22] is a widely used model, which can reflect the decision-making behavior of multi-agent interactions. In this game, several agents form a PGG group, and each agent must choose whether to cooperate by donating a fixed amount c to a common pool or to defect by not donating anything [23]. The total amount of donations is multiplied by an enhancement factor r and distributed equally among all agents in the group. There is a classical social dilemma [24][25][26][27][28], in which the best strategy is not to donate (i.e. defection), since a defective agent can benefit from the goods donated by other agents without incurring any costs. However, if all agents make the rational decision, the group can get nothing. Thus, PGG represents a simplified multi-agent interactions situation and helps us study the decision making of agents in social dilemma.
Exploring more adaptable strategies that adjust the frequency and amount of reward over time would be intriguing, such as adopting adaptive reward mechanisms [49]. In this approach, the reward activity of agents depends sensitively on the spread of defection behavior. If a sharp spreading of defectors is observed, rewarders will give a higher bonus to encourage cooperation, as higher payoff agents are more likely to survive, averting a potential social dilemma. Conversely, if the fraction of defection is relatively stable, such action may be prohibitively expensive, and reward activity should be reduced accordingly. Incorporating adaptive rewards into spatial PGGs has demonstrated evolutionary advantages and enhanced cooperation by preventing the emergence of cyclic dominance. However, this work was primarily focused on the Fermi rule [56] (a rule based on imitation), the evolution of cooperation and the role of adaptive reward in reinforcement learning rules (a rule based on self-regarding) remain unsolved. Investigating the combination of adaptive reward and self-regarding Q-learning has important implications for designing specific incentives. On the one hand, in contrast to Fermi rule, self-regarding Q-learning emphasizes the importance of paying attention to oneself, and because of the existence of randomness, no strategy will become extinct during the evolutionary process, which is more consistent with the actual scenario. On the other hand, adaptive reward focus on the adaptability, and self-regarding Q-learning emphasizes an experience-based strategy update. Thus, the combination of the two can reflect a scenario that the agent is constantly adapting to the environment changes in continuous learning.
In this paper, we focus on the circumstance of multi-agent interactions, in which a PGG is adopted to model the social dilemma. Besides, a third strategy, namely adaptive reward, is also introduced into the game. We are interested in how strategies evolve when adaptive reward and self-regarding Q-learning are taken into account at the same time.
The remainder of this paper is structured as follows. Section 2 introduces the basic model, which includes the PGG, adaptive reward strategy, and self-regarding Q-learning rule. Section 3 contains the simulation results and analysis. Finally, in section 4, we summarize our conclusions.

Model
We consider a three-strategy PGG on a N = L × L square lattice with Von Neumann neighborhoods and period boundary, where each vertex x denotes an agent who is allowed to interact with their k = 4 nearest neighbors [57]. As a result, each agent belongs to g = 1, . . . , G (G = 5) overlapping groups containing k + 1 agents each, where the first group is centered on agent x, while the remaining G − 1 groups are centered on neighboring agents. At the start of the game, each agent x is randomly designated as one of the three strategies, i.e. defection (s x = D), cooperation (s x = C) or adaptive reward (s x = R) (figure 1(a), defection, cooperation and adaptive reward strategy are shown as blue, red and green nodes, respectively). Under standard protocol, the two cooperating strategies s x = C and s x = R donate c = 1 unit to the common pool while defecting strategies donate nothing. The sum of all donations in each group is multiplied by an enhancement factor r > 1, and distributed equally among the k + 1 agents, regardless of their strategies [48].
In this paper, adaptive rewarders are able to adjust the incremental of bonus u, instead of simply and constantly reward. To accommodate adaptive reward [49], each agent is assigned a tag ω x to keep track of its reward activity. Initially, ω x = 0 for all agents. Subsequently, whenever agents switch to the other strategy, we will reset its reward activity to zero (ω x = 0). At the same time, if an agent changes its strategy to defection successfully, all adaptive rewarders in the five groups ( figure 1(a)) that containing the defeated agent will increase their reward activity by one, i.e. ω x = ω x + 1 (figure 1(b)). As a result, cooperators and adaptive rewarders may obtain more than one bonus u from each adaptive rewarder within the interaction neighborhood, which in turn requires adaptive rewarders to increase the cost α accordingly that determines the sanctions of reward (self-rewarding is excluded). In order to avoid unnecessary cost loss, there is a constant decrease of ω x (ω x = ω x −1) after each Monte Carlo step of the game towards rewarding cooperators as long as ω x > 0 (figure 1(c)).
According to the above description, the payoffs of agent x with different strategies obtained from a group g are as follows [49], where N sx denotes the number of other agents in the group with strategy s x , and ω i is the actual reward activity of agent i. α and u are two adaptive reward parameters that determine the cost of the reward and the incremental of bonus used for one reward activity, respectively. The total payoff is a summation of the sub-payoffs of each group g, which can be expressed as follows: During the strategy update process, agents adopt self-regarding Q-learning to decide the action by considering the current state and Q-values. In general, randomly select an agent x with initial state s. Its Q-values are defined by the Q-table to record the relative utility of states and actions. In this paper, the state and action set are s = a = {D, C, R}, which consists of all the optional strategies. Accordingly, the time-dependent Q-table of each agent with states (rows) and actions (columns) can be provided as follows, where Q s,a (t) represents the agent's Q-value for a fixed combination of state s and action a at time step t. In Q-learning, ε-greedy algorithm is used to strike a balance between exploration and exploitation. That is, agents have a chance to randomly choose an action with probability ε (a small value set as 0.02 for all simulations) or switch to an action with the highest value of Q-table in the row of current state s with probability 1 − ε. It is worth noting that there may be more than one action with the highest Q-value in the state s, in which case the agent randomly selects one of them to learn. So, in each Monte Carlo step t, the agent will learn an action a using the ε-greedy algorithm, obtain a payoff Π(t) (defined in equation (2)), and enter a new state s ′ . After that, the Q-value is updated by the following equation [8,9], where η ∈ (0, 1] denotes the learning rate and γ ∈ [0, 1) discounts the expected payoff for the next state s ′ , Q max s ′ ,a ′ (t) represents the highest Q-value of the next state s ′ . The stationary fractions of defection, cooperation and adaptive reward (represent by ρ D , ρ C , ρ R , respectively) are calculated using a Monte Carlo simulation procedure including the following basic steps: (1) Simulations are carried out on a square lattice with sizes ranging from L = 200 to 800. Initially, each agent is assigned to play either D, C, or R with equal probability. The Q-table is initialized to zero. (2) At each step, an agent x is randomly selected to take action a by using ε-greedy algorithm and engages in the PGG with its four neighbors to acquire payoffs according to equation (2). (3) Using equation (4) to update the Q-value, where the selected action a is the new state s ′ . Then, the state s is changed to s ′ . (4) One Monte Carlo step is characterized by repeating procedures (2) and (3) for N times. (5) The steady state of the system is averaged over the last 2000 steps of the overall 20 000 steps. Moreover, the final results have been averaged over 10 independent runs to eliminate the effect of some uncertainties.

Results
Cost α and incremental of bonus u are two critical indicators of adaptive reward mechanisms. First, we present the fraction of defection (D), cooperation (C), and adaptive reward (R) as a function of the enhancement factor r for various values of α to investigate the effect of reward cost. For comparison, figure 2(a) depicts the situation where there is no reward, in which cooperation can hardly be maintained when r is small. Whereas for larger r, the fraction of cooperation increases with the increase of r. The following, figure 2(b) depicts the case where a low-cost reward can promote cooperation (including cooperation and adaptive reward) over the entire interval of r, and the fraction of cooperation and adaptive reward can be maintained in a high level for small r. At the same time, we observe that the fraction of defection, cooperation and adaptive reward almost remains essentially constant as r increases, indicating that agents are self-stable. Furthermore, the case that low r values cannot sustain cooperation reappears in figures 2(c) and (d). In contrast to figure 2(a), figure 2(c) shows that cooperators can emerge earlier due to the presence of adaptive rewarders, resulting in the coexistence of the three strategies. Figure 2(d) indicates that the region of r where cooperation can be promoted becomes narrow as the reward cost α continues to rise, implying that reward cannot even compensate for the cost α when r is small. From figures 2(b)-(d), it is apparent that the defection strategy is not dominant when r is large, regardless of the value of α.
Next, figure 3 investigates the impact of reward incremental of bonus u on the evolution of the three strategies. It clearly illustrates how the fraction of the three strategies varies with the enhancement factor r for different values of u, which elucidates the fact that higher reward incremental of bonus facilitate cooperation, and the occurrence threshold of cooperation is brought forward. It is worth mentioning that this facilitating effect is more pronounced when r is small. However, in contrast to figure 2(a), when r reaches a certain value, the facilitating effect almost vanishes. This is because cooperative behavior emerges naturally in the PGG when the enhancement factor r is sufficiently large.
Clearly, adaptive reward mechanism is only meaningful when the cooperators cannot survive. So, we will systematically investigate how different combinations of reward cost α and reward incremental of bonus u affect the evolution of the three strategies when r = 2.8 (a relatively low value) in figure 4. From figures 4(a) and (c), we find that the fraction of defection and adaptive reward varies monotonically as the reward cost and reward incremental of bonus increase. However, figure 4(b) clearly shows that there exists a specific reward cost to maximize the fraction of cooperation when u is in a certain interval. The results show that the appropriate combination of u and α significantly increases the fraction of cooperation.
Furthermore, three representative cross sections of figure 4(b) are provided to more precisely quantify the role of adaptive reward in promoting cooperation. It also demonstrates that changing the values of α and u can affect the fraction of cooperation, and that there exists a value of α (around 0.36) that maximizes cooperation. We find that for small values of u (i.e. u = 0.5 and u = 1.0), as shown in figures 5(a) and (b), when α becomes larger and goes beyond this threshold, the fraction of cooperation decreases monotonically with α. However, for large values of u (i.e. u = 1.5), as shown in figure 5(c), the fraction of cooperation is slightly below the peak and it does not decrease as α increases, indicating that the exist of cooperation is stable. It seems that the large u offsets the disadvantage of the high cost.
Up to now, we have demonstrated how the combination of adaptive reward and reinforcement learning rules can produce synergistic effects in the context of PGG. In what follows, we shift our focus to a micro   perspective and analyze the evolutionary dynamics of the three strategies. Considering a low-cost situation in figure 6(a), where we observe an increase in the fraction of adaptive reward and a corresponding decrease in the fraction of defection. This indicates that the incentive effect on the cooperation strategy is being realized. Nonetheless, we discover that the three strategies can coexist without requiring a cyclic dominance mechanism. Then, as the cost increases (figures 6(b) and (c)), defection take over and weaken the cooperation and adaptive reward. The balance between cooperation and adaptive reward is delicate. As seen in panel (b), the evolution of adaptive reward has an obvious enduring+expanding [58] process. During the endure phase, cooperators always outnumber adaptive rewarders, but this is reversed during the expand phase. Then, in panel (c), there is a complete decrease in the number of adaptive rewarders, indicating that the high cost of reward cannot withstand the exploitation of defection, and thus the fraction of cooperation is also unsustainable. Figure 6 illustrates a common trend in the evolution of cooperation: a temporary upward trend is followed by a slight or significant decrease in the fraction of cooperation, it eventually remains at a constant value, which is determined by the fraction of adaptive reward. What is interesting is  that the three curves gradually separate as the reward cost increases, and the larger the cost, the earlier the three curves diverge.
In order to intuitively understand the effect of adaptive reward, we plot some typical snapshots from a prepared initial state in figure 7. In the first row, it demonstrates how the adaptive reward strategy can overcome defection. Since agents select an action with ε-greedy exploration, the population will evolve into a mixed state at the beginning, regardless of the initial strategy distribution. As time goes on, a large number of adaptive reward clusters will emerge, and the sustainability of cooperation can be improved through wide distribution, which differs from traditional spatial reciprocity [57]. With the continuous increase of α (α = 0.5), more and more agents prefer to defection, allowing defectors to spread throughout the system, fragmenting cooperators and adaptive rewarders into numerous tiny clusters. Eventually, the three strategies will be able to coexist peacefully. The third row depicts the high reward cost state (α = 0.8), in which the cooperators and adaptive rewarders are severely depressed, with only a few cooperators and adaptive rewarders in the system.
To more clearly explain the strategy update process of agents, figure 8 records the average Q-values (i.e. Q s,a (t) = ∑ N i Q s,a (t)/N) of agents in steady state. According to the Q-learning rule, cooperation or adaptive reward actions are more likely to occur if the Q-values of cooperation or adaptive reward are larger than defection in the current state. In this case, figure 8(a) reveals that the Q-value of adaptive reward is absolutely dominant, regardless of the current state. Therefore, adaptive rewarders can easily survive and expand when the reward cost is low. Meanwhile, the presence of adaptive rewarders also encourages agents to cooperate. Then, we can observe an interesting action selection process when α = 0.5 ( figure 8(b)). That is, when the current state is D, agents are more inclined to choose R, and when the current state is R, agents prefer to choose C, while when the current state is C, agents are more inclined to choose D. As a result, defectors Figure 7. Snapshots of the spatial distribution of defection (blue), cooperation (red) and adaptive reward (green) obtained by α = 0.2 (first row), α = 0.5 (second row) and α = 0.8 (third row) at different time step t (from left to right, t = 0, 10, 200, 1000 and 5000, respectively). It should be pointed out that the system achieves the evolutionary stable state when t = 5000. All panels are drawn on a 300 × 300 spatial lattices. Other parameters are r = 2.8, u = 0.5, γ = 0.8, η = 0.8, ε = 0.02. Figure 8. The average of several different Qs,a for various α values. The current states of agents are divided into different areas, with blue, red, and green representing defection, cooperation, and adaptive reward, respectively. The same color is then used to characterize the actions that agents can take in each area. Other parameters are r = 2.8, u = 0.5, γ = 0.8, η = 0.8, ε = 0.02. dominate in this situation, but cooperators and adaptive rewarders can also be maintained. After that, as α increases to a large level ( figure 8(c)), the advantages of defection become more obvious. Figure 5 have shown that high cooperation emerges at a specific reward cost (about α = 0.36). It is critical to understand this phenomenon. We would like to define the corresponding strategy transition probabilities as W sx→sy , if strategy s x is transition to s y (see equation (5)) where N sx is the numbers of agents with strategy s x . Figure 9 depicts the strategy transition probabilities (including agents entering and exiting C) as α varies from low to high value. On the one hand, it is discovered that the relationship between W D→C +W R→C and W C→D +W C→R varies as α increases. In figures 9(a) and (b), W D→C +W R→C is large than W C→D +W C→R when α is small, so that cooperators have the chance to survive. Meanwhile, in figure 9(c), the probability of W D→C +W R→C is consistently greater than W C→D +W C→R to make cooperators survive in the entire interval of α (similar as shown in figure 5(c)). On the other hand, W D→C +W R→C reaches its maximum value around α = 0.36 in figures 9(a) and (b), corresponding to the maximum fraction of cooperation in figures 5(a) and (b). In addition, the maximum  figure 9(c) is greater than 0.36, but the difference between the probability of W D→C +W R→C and W C→D +W C→R reaches its maximum value at 0.36 (as shown in the inset panel), which also leads cooperators to obtain the maximum fraction in figure 5(c). By utilizing the transition probabilities, we can also understand the changes of strategies for α with theoretical analysis (mean-field techniques [59,60]). The motion of ρ C , ρ D and ρ R can be approximated as follows:ρ Once the system has attained a steady state,ρ C = 0,ρ D = 0,ρ R = 0. Then, the fraction of cooperation, defection, and adaptive reward in the steady state can be deduced as, Figure 10 provides the comparison between simulation results and theoretical analysis, in which the changes of ρ C , ρ D and ρ R for theoretical analysis are well consistent with numerical simulations. In addition to the adaptive reward mechanism parameters, the endogenous parameters of the Q-learning rule, namely the learning rate η and the discount factor γ, are also investigated. We only look at the impact of parameters on the fraction of defection because adaptive reward is also a cooperative action. It is clear when a low-cost reward ( figure 11(b)) is introduced, the fraction of defection is inhibited when compare with figure 11(a) where no reward is present. Moreover, as α increases, figure 11(c) shows that small values of γ promote defection reaching a higher fraction. This finding reveals that considering short-term benefits favors the existence of defect behavior when adaptive reward is available. However, for a high reward cost ( figure 11(d)), the impact of adaptive reward on defection is negligible, and the fraction of defection remains almost the same as in figure 11(a). Furthermore, as presented in figures 11(b) and (d), the fraction of defection is almost not influenced by varying Q-learning parameters in a wide range, which indicates that Q-learning is self-stable.
Finally, the effectiveness of the self-regarding Q-learning rule versus the Fermi rule in fostering cooperation under adaptive reward mechanisms remains an open question. To shed light on this issue, we show a comparison of the two evolutionary results in figure 12. The results show that, with the same   parameter settings, the Q-learning rule does not cause the extinction of any strategies because of the randomness when compared to the Fermi rule. Moreover, Q-learning rule outperforms the Fermi rule in promoting cooperation when the synergy factor r is small.

Conclusion
Our study aims to shift the research focus from two-agent interactions to multi-agent interactions [61] by investigating the evolution of strategies in the spatial PGG, where adaptive reward and self-regarding Q-learning rule are both taken into consideration. The results reveal that relying solely on self-regarding Q-learning is often inadequate in promoting cooperation. However, incorporating adaptive reward and reinforcement learning rules can lead to synergistic effects that motivate the system to generate novel and intriguing dynamics. In detail, the introduction of the adaptive reward strategy can not only increase the fraction of cooperation, but also enable to make cooperation (include adaptive reward) dominate the system even when the value of r is small. Then, when the value of r is large, the fraction of the three strategies maintains a relatively stable state. At the same time, one novel finding is that the reward cost α plays a crucial role in the fraction of cooperation when the enhancement factor r and reward incremental of bonus u are both fixed. That is, there exists a moderate reward cost that is optimal for the evolution of cooperation. Apart from that, we also analyze the reasons why adaptive reward promotes cooperation from a micro perspective (including the time evolution of strategies, typical snapshots of the system, average Q-value of agents, and strategy transition probability). The results indicate that the three strategies can coexist without the need for a cyclic dominance mechanism, and the wide distribution of adaptive rewarders can improve the sustainability of cooperation. Besides, when the reward cost is low, the Q-value of cooperation and adaptive reward is more likely to be larger than that of defection, while the advantages of defection become more obvious when the reward cost is high. In addition, the strategy transition probabilities explain why α = 0.36 lead cooperation achieve the maximum, which is validated by theoretical analysis. Finally, the impact of Q-learning parameters indicates that Q-learning is self-stable, and the comparative analysis with Fermi rule also shows the advantage of Q-learning in promoting cooperation. This work may inspire a series of subsequent works, for example, considering multiple populations [62] or asymmetric interactions [63,64].

Data availability statements
All data that support the findings of this study are included within the article (and any supplementary files).