The emergence of cooperation via Q-learning in spatial donation game

Decision-making often overlooks the feedback between agents and the environment. Reinforcement learning is widely employed through exploratory experimentation to address problems related to states, actions, rewards, decision-making in various contexts. This work considers a new perspective, where individuals continually update their policies based on interactions with the spatial environment, aiming to maximize cumulative rewards and learn the optimal strategy. Specifically, we utilize the Q-learning algorithm to study the emergence of cooperation in a spatial population playing the donation game. Each individual has a Q-table that guides their decision-making in the game. Interestingly, we find that cooperation emerges within this introspective learning framework, and a smaller learning rate and higher discount factor make cooperation more likely to occur. Through the analysis of Q-table evolution, we disclose the underlying mechanism for cooperation, which may provide some insights to the emergence of cooperation in the real-world systems.


Introduction
Donation behaviors are prevalent in human society, where individuals are willing to contribute money or goods to support those suffering from natural disasters [1].In our daily lives, some people opt to regularly donate a portion of their incomes to charitable organizations to promote social welfare [2].Nevertheless, 'survival of the fittest' in Darwin's The Origin of Species suggests that individuals have an inherent selfish nature, indicating that donation behaviors are counterintuitive.But why would people willingly donate their money or goods to strangers?The key question is as follows: how does donation behavior spontaneously emerge when there is an interest conflict between the society and individuals [3]?
To understand the emergence and maintenance of cooperative behaviors in donation, evolutionary game theory provides a theoretical framework [4].As the prototypical game for many social dilemmas, the Prisoner's Dilemma is a two-player two-strategy game model in which individuals can choose either cooperation or defection as their strategies.When two individuals play this game, mutual cooperation results in a reward R for both, while mutual defection leads to a punishment P. If one individual cooperates and the other defects, the cooperator receives a sucker's payoff S, while the defector receives a temptation payoff T. The following constraints are satisfied: T > R > P > S, and T + S < 2R, which ensures that mutual cooperation is beneficial to the collective, but defection yields a higher unilateral payoff.Obviously, no matter what strategy the other adopts, the optimal strategy is always to defect, which means the mutual defection is the Nash equilibrium, forming the unavoidable dilemma.Donation game (DG), as a variant of Prisoner's Dilemma, is particularly suitable to study donation behaviors [5].There cooperation entails an individual cost c by providing a payoff b to the other party under the condition b > c, while defection provides nothing [6,7].Therefore, the payoff parameters are adjusted to: P = 0, S = −c, R = b − c, and T = b.
During the past two decades, lots of mechanisms have been proposed to address this dilemma, such as direct [8,9] and indirect reciprocity [10][11][12][13][14][15], group selection [16,17], spatial and network structures [18], memory effect [19][20][21][22][23], repeated interactions [24][25][26][27], social diversity [28][29][30][31], and mechanisms that punish or reward [32][33][34][35][36][37][38][39] cooperative behaviors.It is worth noting that in most of previous studies, research on the mechanisms of cooperation evolution has mainly adopted the framework of imitation learning, where individuals tend to replicate the strategies of more successful players.This approach focuses on the dynamics in continuous systems.However, human learning is usually far more complex than this framework, people often revise their strategies based upon the interaction with the environment, draw lessons from their own experiences, balance the reward at present and in the future and so on.
Reinforcement learning (RL), as one of the most powerful machine learning methods, is deeply rooted in psychology and neuroscience [40,41].It is widely employed through exploratory experimentation to address problems related to states, actions, rewards, and decision-making in various environments.In recent years, RL has also been applied to solve complex cooperation problems in the discrete systems [42][43][44][45].
Current related work is mainly developed along two main lines.One originates from the Bush-Mosteller (BM) model from psychology [46].Specifically, if a player's payoff from taking a certain action exceeds the expected level, the likelihood of adopting that strategy is increased; otherwise, it is decreased.A series of expectation-based RL works have emerged [47][48][49].Tanabe and Masuda [50] employed a modified BM model, demonstrating that learning enables the evolution of cooperation.Ezaki et al [51] explained the observed conditional cooperation behavior in multiplayer social dilemma games, reproduced the failure of network reciprocity in structured population experiments [52].Horita et al [53] observed emotion-conditioned cooperative behavior in various types of games, which can be elucidated by RL.It is worth noting that in this learning process, participants in the BM model only use information related to their own past choices and payoffs, ignoring information from all opponents [54].
Another research avenue is being pursued within the framework of self-regarding Q-learning [55], where players only consider its own experience.Zhang et al [56,57] explored the emergence of cooperation characterized by explosiveness and cyclic oscillations in two-player games, and found that the herding effect could be effectively eliminated in minority games [58].Furthermore, the incorporation of other mechanisms, such as adaptive reward [59], social payoff [55], extortion strategy [60], and Lévy noise [61], has shown to significantly improve the level of cooperation.Nonetheless, the self-regarding Q-learning approach, which emphasizes policy updates based on one's experience, leads to an alignment of individual state space with strategy space, thereby overlooking environmental cues.Some studies have linked an individual's state space with the strategies of their neighbors.For instance, Ding et al [62] found that individuals maintain a high level of cooperation act like WSLS in two-agent repeated games.Additionally, Zheng et al [63] observed elevated levels of trust in trust games, and Yang et al [64] proposed a reward information-sharing mechanism, indicating that considering neighbors' payoff information can significantly enhance cooperation.
Inspired by previous works, we incorporate more environmental information (neighbors' decisions) to expand the state space, diverging from the two main approaches mentioned earlier.This allows individuals to more effectively tailor their responses to diverse environments, ultimately enhancing the overall performance of the population.Specifically, we employ the Q-learning algorithm [65] to place individuals on a square lattice for studying the evolution of cooperation in the DG.Analysis of the Q-table reveals that the underlying mechanism behind the cooperation emergence lies in the fact that when the benefit-to-cost ratio is high, the rewards associated with both choosing cooperation and defection are high.Thus the preference in two actions easily undergoes a reversal, and some individuals continue to choose cooperation in different states, thereby cooperation is able to be sustained.Moreover, we investigate the impact of other parameters on the emergence of cooperation, and we find that a stronger historical effect and a larger discount factor make cooperation more likely to occur.Comparing the results with imitation dynamics, we find that cooperation is more likely to emerge in the RL framework without introducing any additional mechanism.This integrated approach, which takes into account historical influences, immediate payoffs, and future expectations in learning, has the potential to capture the psychological processes underlying human decision-making.Furthermore, our findings aim to offer a fresh perspective on comprehending the emergence of cooperative evolution in RL.
This paper is organized as follows.We introduce our model in section 2. Numerical simulation results on the square lattice are shown in section 3. The mechanism is analyzed in section 4. Finally, we conclude and discuss our work in section 5.

Model
We consider a group of individuals engaging in a DG [5] on an L × L square lattice with a periodic boundary condition.In our work, we adopt the parameter settings P = 0, S = −1, R = r − 1, and T = r, to simplify the parameters.Here, the benefit-to-cost ratio is denoted as r = b/c, and the dilemma occurs when r > 1.As the benefit of mutual cooperation gradually increase with the rise of r, it is anticipated that a larger value of the ratio r is more conducive to promoting cooperation in the structured population [66][67][68].The payoff matrix is shown below in equation ( 1), where Π aa ′ is the payoff for the individual when playing against its neighbor, Consider that each individual is only engaged in the game with their four nearest neighbors on the square lattice.Specifically, we assume that the number of cooperating neighbors represents the state that they perceive in the environment.The state space includes all possible surroundings, denoted as , where s i j corresponds to having j cooperating neighbors around individual i.The individual chooses an action based on their current state, and the action set is denoted as A = {C, D}, where C, D represent cooperation and defection, respectively.To employ the Q-learning algorithm [65], each individual has a Q-table that stores the estimated Q-values, as shown in table 1, where Q-values represent the expected cumulative rewards the individual can obtain by taking an action in a given state.
The procedure of the game evolution is as follows.Initially, the strategies of all individuals are randomly assigned with either C or D, and all items within their Q-tables are also assigned with random numbers between 0 and 1.At each time step, the individuals select an action based on an exploration-exploitation trade-off scheme.Here we adopt the ϵ-greedy exploration, where an action is randomly chosen from the action set A with a probability ϵ, otherwise the action with the larger Q-value is selected for the current state.As a common practice, we fix the exploration rate ϵ = 0.01 [62].After all individuals make their moves, their average rewards Π can then be computed, and the new state s i j ′ is also known.Finally, all individuals update their Q-tables according to Here Q s i j ,a i (t) represents Q-value of individual i within state s i j and action a i at the timestep t, s i j ′ is the new state, Q max is the learning rate that controls the relative weight between the old Q-value and the new contribution, as respectively shown in the first and the second term in equation (2).γ ∈ [0, 1) is the discount factor that determines the importance of future rewards.By iteratively updating the Q-table based on the observed rewards and transitioning between states, the agent gradually learns the optimal policy that maximizes long-term rewards.
We employ the synchronous updating scheme, where all individuals update simultaneously at each time step.The system size N = L × L is fixed at L = 100 throughout the study.We use the proportion of cooperators in the population as our primary order parameter f C = 1 N ∑ N i =1 a i to describe the macroscopic state of the system.

Numerical results
Let's fix the two learning parameters α = 0.1, γ = 0.9 and investigate the impact of benefit-to-cost ratio r = b/c on the cooperation level in the latticed system, see figure 1.We observe that when r is small, the population quickly reaches a stable state where the fraction of cooperation f C approaches 0. As r gradually  increases, cooperation emerges since cooperators can get more benefit from cooperative neighbors.Furthermore, as r continues to increase, we observe that the level of cooperation will exceed 0.5.
In order to illustrate the overall impact of r on the level of cooperation, we present the phase transition depicting the influence of r on the level of cooperation, as shown in figure 2. It shows that when r is small, the level of cooperation in the system approaches 0. As r increases gradually, cooperation emerges when r C > 7.0.Further in increase in r leads to higher proportion in cooperators.It's worthy noting that, without extra mechanism, the condition for cooperation emergence in the framework of imitation replicator dynamics [69] is around r C > 14.0.Obviously, cooperation is more likely to emerge in the RL framework.
To gain some intuition, figure 3 illustrates spatial-temporal patterns for states at different times when r = 10.It can be seen that, at the beginning, the strategies C and D are randomly assigned within the domain (see figure 3(a)).Within the first 10 3 steps, this feature remains largely the same within the population (figure 3(b)).From figures 3(c) to (e), as time goes by, a large number of cooperators in the population are replaced by defectors.Eventually, the majority of the population consists of defectors for the given parameters, while cooperators exist only in small clusters or in isolation but do not disappear even in the long run, as shown in figure 3(f).

Mechanism analysis
To comprehend the underlying reason for the emergence of cooperation, we now turn our attention to the dynamical mechanism of the evolution.To maintain generality, we keep the parameters consistent with figure 3.

The evolution of the average Q-value
First of all, we investigate the temporal evolution of the average relative magnitude of Q-table values across different states, denoted by ∆Q s j , as depicted in figure 4, where ).When ∆Q s j > 0, this means that individuals are more inclined to choose cooperation on average; conversely, they are more inclined to defect when ∆Q s j < 0. During the initial 10 4 th steps, we observe that ∆Q s j decreases across all states but still remains close to 0. This indicates that, on the whole, there is no substantial difference in the probabilities for choosing either C or D in the early stages.
Afterwards, there is a significant differentiation in ∆Q s j across different states, particularly within s 0 and s 1 .Eventually, when reaching the stable state, the values of ∆Q s j indicate a strengthened inclination of individuals towards defection strategy, especially when surrounded by fewer cooperators (i.e.s 0 and s 1 ).The evolution of ∆Q s2 and ∆Q s3 exhibits similar trends between the onset of descending and reaching stability, but ∆Q s2 drops and reaches a steady state earlier than ∆Q s3 , which leads to a longer duration of choosing option C in state s 3 compared to state s 2 , consistent with figure 6(b).It is precisely because of this time lag that some individuals are given space to choose cooperation.Ultimately, the ∆Q s j values for these states stabilized around −2, suggesting that although defection remains the dominating strategy on average, some individuals have ∆Q s j values that are larger than 0 in specific realizations, leading them to choose cooperation.As for ∆Q s4 , its value initially decreases and then gradually increases before stabilizing.This phenomenon can be attributed to the fact that while choosing defection can be beneficial when surrounded by cooperators, in the long run, selecting cooperation will yield sustained and enduring benefits.Hence, the ∆Q s4 ultimately stabilizes around −2, indicating that some individuals may still opt for cooperation.

The proportion of cooperation in different states
To monitor the evolution of proportion of cooperators and defectors, we separately calculate the proportions of individuals choosing cooperation and defection in different states over time, as illustrated in figure 5.At the the beginning, cooperators and defectors are randomly assigned, and there are few clusters of cooperators in the population as expected, most individuals are in state s 2 as depicted in figure 5(a).As time goes by, the proportions of cooperators in s 1 , s 2 , s 3 and s 4 states decrease, while the cooperator proportion in state s 0 temporarily increases.This can be attributed to the early stage of Q-table evolution, during which the average relative magnitude of Q-values (∆Q s0 ) is close to 0. As a result, there are small differences between Q i s j ,C and Q i s j ,D , leading to approximately equal probabilities of selecting C or D. Consequently, despite the negative reward −1 associated with choosing C in the state s 0 , the influence of historical effect (α) causes slight variations in the values of Q i s j ,C and Q i s j ,D , resulting in instances where Q i s j ,C > Q i s j ,D .This, in turn, leads to a temporary accumulation of actions C in the particular state.However, in the long run, individuals who chose action C experience continuous losses, leading to a shift in strategy towards action D, causing a further decrease in the proportion of individuals choosing cooperation in the state of s 0 .In states s 3 and s 4 , the proportion of individuals choosing action C continuously decreases until reaching stability.This results in a sharp reduction in the density of cooperators in the population.Conversely, in states s 1 and s 2 , the proportion of cooperators C experiences a temporary rebound.Because as long as there are cooperators among the neighbors, selecting action C yields positive rewards, which can lead to instances where Q i s j ,C = Q i s j ,D , prompting to choose action C and causing a reversal in the Q-table.Figure 5(b) demonstrates the same evolution of different proportions but for defectors.Initially, there are also few clusters of defectors in the population, with the majority of individuals being in state s 2 .As time goes by, the proportions of defectors in states s 0 and s 1 increase until reaching a stable state.This is because there are fewer cooperators in their surroundings, and choosing defection will yield more positive rewards for them.Consequently, the updates to Q i s j ,D accelerates (reflects in the drop ∆Q s0 and ∆Q s1 ), which lead to an increase in the proportion of defectors and an enhancement in the compactness of defector clusters.Simultaneously, the proportion of defectors in state s 2 starts to increases but then decreases until reaching a stable state.This turnover can be understood as follows: during the initial stages of evolution, selecting action D in state s 2 yields relatively high rewards.However, as the proportions of defectors within states s 0 and s 1 continue to increase, and the number of defectors in the surroundings also increases, the choice of D in state s 2 fails to bring decent reward thus the corresponding proportion decreases until reaching a stable state.In contrast, the proportions of defectors in s 3 and s 4 states continuously decrease until reaching a stable state.This is because, over time, there are fewer instances of the population being in the environment represented by s 3 and s 4 states, causing the proportion to steadily decline until reaching stability.This results in an increased compactness of defector clusters and the gradual dissolution of cooperator clusters, as shown in figure 3(f).
To illustrate the proportion of cooperators or defectors within different states and at different stages more clearly, we present the individual proportions of choosing cooperation and defection in different time periods, as shown in figure 6.Here, ∆Q j = Q i s j ,C − Q i s j ,D , where if ∆Q j ⩾ 0, individuals prefer cooperation within the given state; otherwise, they prefer choosing defection.It is evident that in the first 10 4 MCS, the proportions of individuals choosing cooperation and defection within all five states are very close, as depicted in figures 6(a) and (b).After 10 4 time steps (figure 6(c)), there is a significant difference in the probabilities of individuals choosing cooperation or defection.Additionally, the proportions of individuals in states s 3 and s 4 decrease noticeably, but the proportions of individuals choosing cooperation and defection remain close, contributing to the emergence of cooperation.Therefore, although the proportion of individuals choosing defection is higher than those who choosing cooperation in states s 0 , s 1 , and s 2 , still there are some individuals who adopt cooperation, which the reason why cooperation emerges.
Interestingly, we also find that as the number of cooperative neighbors increases, the proportion difference between choosing cooperation and defection becomes small.From the benefit perspective, when there are more cooperators in the neighborhood, the average payoff will be high regardless of one's own action.Thus, when updating the Q-table, likelihood of the transition between Q s i j ,C and Q s i j ,D increases, sustaining cooperation.From 10 4 to 5 × 10 s time steps, cooperation evolution reaches a stable state, as shown in figure 6(d), which is similar to figure 6(c).
In brief, the mechanism for the emergence of donation can be summarized as follows: (1) Initially, there is no preference difference in choosing either C or D in all states.The ensuing evolution of Q-value differences exhibits similar drops in some states but with a time lag, which leaves space for some individuals to choose cooperation.(2) While defection always yields higher rewards than cooperation in a single round given there are cooperators nearby, in the long run, the whole population get nothing at all if that's the case.Instead, individuals learn to choose cooperation from time to time since this yields sustained and enduring benefits in the long run.It is the tradeoff among the historical influences, immediate benefits, and future expectations in Q-learning that makes this possible, and also seemingly captures the psychological changes when we humans play this game.

The impact of two learning parameters
Finally, we also investigate the impact of the two learning parameters γ − α on the level of cooperation by fixing r = 10 and ϵ = 0.01, as shown in figure 7. It is found that as the discount factor γ increases and the learning rate α decreases, the level of cooperation increases.
In short, the smaller learning rate α, the larger discount factor γ and benefit-to-cost ratio r lead to a higher level of cooperation in the system.This finding is in line with previous studies [62,63], where when individuals appreciate both of the past experience (small α) and rewards in the future (large γ), the population are better to cope with the dilemma they are facing.

Discussion and conclusion
Note that imitation learning and RL are two fundamentally different learning ways.The former, rooted in evolutionary theory, focuses on replicating strategies from high-payoff players among neighbors.While the latter, originating from computational discrete algorithms, emphasizes a trial-and-error mechanism based on individual experiences to maximize long-term benefits.In this work, we utilize the Q-learning algorithm to study the evolution of cooperation in the DGs.Individuals adopting Q-learning reinforcement algorithm update their behaviors through Q-table that is related with the numbers of cooperators around each agent.Therefore, it does not directly imitate neighbors' behaviors, but by adapting strategies based on the environment.
Specially, comparing with the replicator dynamics under imitation learning, we find that cooperation is more likely to emerge in the RL framework.Analysis of the Q-table evolution reveals that the underlying mechanism behind cooperation lies in the fact that when the benefit-to-cost ratio r is high, the rewards associated with both actions are large.As time goes by, ∆Q j = Q i s j ,C − Q i s j ,D undergoes a reversal, and some individuals continue to choose cooperation in different states, thereby sustaining cooperation within the population.Furthermore, we also investigate the impact of other parameters on the emergence of cooperation, and we find that a smaller learning rate α and a larger discount factor γ make cooperation more likely to occur.Besides, Q-learning algorithm relies on the ability of self-reflection and exploration, instead of imitating others.As a result, Q-learning facilitates the emergence of cooperation through the development of distinct spatial patterns, as opposed to the formation of cooperation clusters often observed in imitation learning [55].
Moreover, in contrast to existing works that combine RL with game theory [55,59,70], our model incorporate more environmental information (neighbors' decisions) to expand the state space, which enables individuals to more effectively tailor their responses to diverse environments, ultimately enhancing the overall performance of the population.It may provide a relatively simple explanation for the emergence of cooperation under this introspective learning mechanism.In addition, our model also provides a possible mechanism for the emergence of social donation behavior based on individual interactions and feedback from the surrounding environment.

Figure 2 .
Figure 2. The proportion of cooperators fC as a function of benefit-to-cost ratio r in two different learning methods.Each data is averaged 10 independent runs (ensemble averages), and in each run fC is averaged over 10 4 time steps after a transient period of 10 6 time steps.The black line represents the results under replicator dynamics, while the red line represents the results under Q-learning.Parameters in Q-learning: α = 0.1, γ = 0.9, and ϵ = 0.01.

Figure 4 .
Figure 4.The temporal evolution of the average value of the population ∆Q sj in different states, where ∆Q sj = 1

Table 1 .
Q-table for the individual i.