Inferring to cooperate: Evolutionary games with Bayesian inferential strategies

Strategies for sustaining cooperation and preventing exploitation by selfish agents in repeated games have mostly been restricted to Markovian strategies where the response of an agent depends on the actions in the previous round. Such strategies are characterized by lack of learning. However, learning from accumulated evidence over time and using the evidence to dynamically update our response is a key feature of living organisms. Bayesian inference provides a framework for such evidence-based learning mechanisms. It is therefore imperative to understand how strategies based on Bayesian learning fare in repeated games with Markovian strategies. Here, we consider a scenario where the Bayesian player uses the accumulated evidence of the opponent’s actions over several rounds to continuously update her belief about the reactive opponent’s strategy. The Bayesian player can then act on her inferred belief in different ways. By studying repeated Prisoner’s dilemma games with such Bayesian inferential strategies, both in infinite and finite populations, we identify the conditions under which such strategies can be evolutionarily stable. We find that a Bayesian strategy that is less altruistic than the inferred belief about the opponent’s strategy can outperform a larger set of reactive strategies, whereas one that is more generous than the inferred belief is more successful when the benefit-to-cost ratio of mutual cooperation is high. Our analysis reveals how learning the opponent’s strategy through Bayesian inference, as opposed to utility maximization, can be beneficial in the long run, in preventing exploitation and eventual invasion by reactive strategies.


Introduction
Every organism, from a microbe to a human, reacts to its environment to various degrees of complexity depending on how much cognitively capable it is.Some learn to react to reap more benefit compared to the others.Such optimal learning is selected over the generations, and finally the progenies are not observed to be spending almost any time in learning to optimally react to the environment; they react instinctively: What was a learnt strategy for the ancestors, is now a genetically-hardwired instinct for the selected descendants [1][2][3][4][5][6][7].
Specific reaction to a situation is contingent on the information about the situation that the organism gathers.In more formal terms, the organism develops a belief about the state of the environment and its strategy on how to react is based on this belief which can be expressed as a probability distribution over the possible states of the environment.Note that, even before observing the situation, the organism has some belief (called prior belief) that transforms to an updated belief (called posterior belief) in the light of new information.How a posterior belief emerges from a prior belief is dependent on the exact nature of the update rule.
The optimal update rule itself needs to be learnt and, if it is evolutionarily beneficial, then it would be passed on to the progenies.It has been argued that Bayesian updating [8,9] is a requirement of evolutionary optimality.While Bayesian updating is not the only form of updating seen in the natural world, there are plenty of examples [10,11] where the belief updating occurs in accordance with Bayes rule.In passing, it is worth pointing out that the Bayesian updating is a rational player's normative choice since the player's actions should be probabilistically coherent so as not to incur losses [12,13] which rational players always look to avoid.Moreover, the evolutionary basis of the Bayesian rationality and the evolution of Bayes-rational agents has also been put forward [14].
Of course, the simplest strategy for reacting is to randomly adopt an action (out of a set of all possible implementable actions) even without observing the situation at hand.A more sophisticated strategy would involve choosing an action with some probability, based on knowledge about the particular state of the environment in the immediate past.This may be called a reactive strategy [15,16].In the context of evolutionary games [17,18], the reactive strategies are most often encountered in the study of evolution of cooperation [15,19] which seeks to address the question of how altruistic behaviour can be sustained [20] despite involving a cost which makes the altruistic phenotype less suitable for selection over its selfish counterpart.The environment of a player playing an evolutionary game is the entire population consisting of all other players she can interact with.
A lot of work carried out over the last three decades have shed light on this problem by identifying the critical factors that can facilitate not just the survival but also the dominance of cooperators in an evolving population.A vast amount of literature on evolutionary game theory have focused on either reactive and memory-one strategies [21][22][23][24][25][26], population structure [27][28][29][30][31][32], player's reputation together with social norms [33][34][35] and strategy update based on the pairwise comparison rule [36][37][38] to understand the evolution of cooperation in diverse scenarios.Most of the work in this domain can be broadly classified under the categories of direct and indirect reciprocity [39][40][41][42][43][44][45][46][47][48] in either well-mixed or structured populations.Such models focus on identifying strategies that depend only on the strategy of the interacting partner or the strategy of both the focal player and her interacting partner (in the case of memory-one strategies) in the last round.Even when strategy update is carried out using the pairwise comparison model, preferential selection of strategies is based on a single factor, namely payoff of a randomly selected neighbour.Such models constitute a very narrow set of possible strategies that players may employ in deciding whether to be altruistic or selfish.More importantly, these models are characterized by lack of learning based on new evidence that results from repeated interactions, since the strategy-update rule is fixed at the outset.
In view of the above, it is quite natural that one takes up the endeavour of not only considering different strategy update rules [49][50][51][52][53][54][55], that are not just dependent on the payoff [51], but also allow for the possibility that individuals may learn the appropriate rule (model for strategy update) from the experience gained through repeated interactions.In other words, a more complex strategy could involve adopting a strategy contingent on the belief about environmental state of player.This belief could further be continually updated based on the actual sequence of states observed over the past times till the true frequencies of environmental states is learnt.Such a strategy may be called a learning strategy.
If the learning strategy specifically calls for Bayes' updating rule, it may be called a Bayesian strategy.It is natural to ponder whether a Bayesian strategy is evolutionary optimal compared to a reactive strategy.To the best of our knowledge, this important question has not yet been investigated in the context of evolutionary games (see, however, [50]).Unlike pairwise-comparison or reinforcement learning mechanisms [56][57][58][59][60][61][62][63] like Bush-Mosteller [52,[64][65][66][67][68][69], the Bayesian strategy does not attempt to optimise the received payoff in every round but attempts to infer the reactive strategy of her opponent based on accumulating knowledge of the opponent's actions.In the process of doing so, the accumulated payoff of the Bayesian player in the long term can exceed the payoff of the opponent using a reactive strategy, under certain circumstances.
Within the paradigm of repeated games [15,[70][71][72], we consider a player employing a Bayesian strategy against an opponent, employing an unknown but fixed reactive strategy.The Bayesian player then tries to infer the reactive strategy (i.e. the probabilities p (r) and q (r) of cooperation of the reactive player when her opponent cooperated (played C) and defected (played D) respectively in the last round).The Bayesian player uses the action of her reactive interacting partner in each round to update her own beliefs about the (p (r) , q (r) ) values of the partner's reactive strategy using Bayes' rule.A schematic showing the belief update process of the Bayesian player is shown in figure 1.She then utilizes her updated belief about the interacting player's (p (r) , q (r) ) values to tune her own response in three different ways.In the first scenario, we assume that the Bayesian player adopts the maximum (p max , q max ) of her updated belief about the (p (r) , q (r) ) values of the reactive opponent as her own strategy in the next round.We call such a strategy Bayesian tit-for-tat (BTFT).This ability, called probability matching, of an organism to choose a behaviour with a probability equal to the Bayesian-estimated maximum a posteriori (MAP) probability has been well-documented in several animal behaviour experiments [10,11,[73][74][75][76][77].In an alternative scenario, the Bayesian player adopts a  r) , q (r) ) that is unknown to her.The Bayesian player's objective is to infer the strategy of the reactive player from the latter's actions over several rounds.The Bayesian player is depicted in green and the reactive opponent is denoted by emoticon.In the first round, the reactive player's action is C while the Bayesian player's action is D. In response, the reactive player defects (D) in round 2 (with probability (1 − q (r) )).The Bayesian player uses the reactive opponent's action at the end of round 1 as evidence to update her belief (p ′ 2 , q ′ 2 ) about the reactive opponent's strategy using Bayes rule.The Bayesian player then uses her updated belief (p ′ 2 , q ′ 2 ) to cooperate (C) with the reactive opponent (with probability p ′ 2 in round 2).After round 2, where the reactive player defects, the Bayesian player again uses Bayes rule to update her belief by determining the maximum (p ′ 3 , q ′ 3 ) of a new posterior distribution that uses the posterior distribution estimated at the end of round 1 as prior.She then uses this inferred belief to cooperate (C) with the reactive opponent with probability q ′ 3 in response to the reactive opponent's action (D) in round 2. This updating process continues over rounds (see section 2 for details) with the prior being updated to the posterior distribution at the beginning of each round.strategy that is less reciprocal than BTFT, i.e. she chooses to cooperate with a probability that is less than the inferred probabilities of cooperation (p max , q max ) of the BTFT player.In another alternative scenario, the Bayesian player adopts a strategy that is more generous than BTFT, i.e. she chooses to cooperate with a probability that is more than the inferred probabilities of cooperation (p max , q max ) of the BTFT player.The comparative effects of reciprocity and generosity in the evolution of cooperation is a problem of interest for the researchers of evolutionary game theory [78][79][80][81].
Under the Darwinian paradigm all the players adopt an action (or a probabilistic mixture of actions) that makes the population evolutionarily resilient against invasion by a mutant with a different action.Such a robust action is known as a evolutionarily stable strategy (ESS) [17,82].Here, we study a repeated two-player game between a Bayesian player and a reactive player and ascertain the conditions for strategies to be ESS.Our results show that the success of Bayesian strategies in preventing invasion by opponents with reactive strategies depend on the manner in which the Bayesian player responds to her continuously updated belief about the opponent's reactive strategy.

Model
We consider a repeated two-player, two-action game where each player can play with a reactive strategy or a Bayesian strategy.The two actions of the underlying game are taken as cooperation (C) and defection (D).The reactive strategy is defined by (p (r) , q (r) , c 1 ).Here p (r) and q (r) are the probabilities of cooperation in the current round of the game given her opponent respectively cooperated and defected in the previous round, while c 1 is the reactive player's initial (first round) probability of cooperation.
A player with a Bayesian strategy attempts to infer the true probabilities of cooperation (p (r) and q (r) ) of the opponent by taking into account the opponent's actions over several rounds of the game.In order to do so, the Bayesian player has to continuously update her belief about the reactive strategy on the basis of evidence obtained in the form of action (A ∈ {C, D}) of the opponent in each round.This update is done using Bayes' rule which requires a prior probability distribution of the beliefs of the Bayesian inferential player about the reactive strategy adopted by her opponent.We assume a uniform prior distribution P 1 (p, q) and at the end of each round n the Bayesian player updates her belief (p, q) ∈ [0, 1] × [0, 1] about her opponent's strategy from the estimated posterior probability distribution, The subscript n alludes to the quantities estimated in the nth round.The Likelihood, P(A n |p, q), is chosen to be where c ′ n is the probability of cooperation of the Bayesian player in n th round.Probabilities p and q are sampled from the set of all possible values lying between 0 and 1.That this choice of likelihood function is apposite is clear from the fact [16,71] that a reactive player with p (r) = p and q (r) = q plays action A in nth round with probability given by equation ( 2) when the opponent's cooperation probability is c ′ n−1 in the previous round.For round n = 1, the expression for likelihood requires specification of c ′ 0 for which we set c ′ 0 = c ′ 1 , the initial probability of cooperation of the Bayesian player.The Bayesian player determines the global maximum (p max n , q max n ), i.e. the maximum a posteriori (MAP) of the updated posterior distribution P n (p, q|A n ) at the end of n th round, which then constitutes her updated belief about the true (p (r) ,q (r) ) values of her reactive opponent.If the posterior distribution has multiple maxima, the Bayesian player selects any of them at random.Subsequently, she uses a function of her updated belief as her new strategy in the next round that is In this paper, we consider three possible functional forms of (f, g) as defined later.Thus, the evolving strategy of the Bayesian player is determined by her belief (p,q) about the reactive opponent's strategy.This update is recursive; the posterior distribution calculated in the nth round becomes the prior in the (n + 1)th round, P n+1 (p, q) = P n (p, q|A n ), A n being the action of reactive player in the n th round.Through this updating process, the Bayesian player eventually infers the reactive opponent's true strategy.Had the opponent also been a Bayesian player, even then the focal Bayesian player would implement the Bayes' rule as given above to update her strategy; even though the true belief about the opponent's strategy does not exists in such a interaction.

Strategies
The Bayesian player can adopt her strategy in many ways using her belief, but below we describe three intuitive strategies, viz., Bayesian Tit-for-tat, non-reciprocal Bayesian Tit-for-tat, and generous Bayesian Tit-for-tat.
Bayesian Tit-for-tat (BTFT): An obvious choice of a new strategy would be one where the Bayesian player just adopts her updated belief as her new probabilities of cooperation in the next round.Mathematically, We therefore call such a strategy the Bayesian TFT (BTFT) strategy.

Non-reciprocal Bayesian Tit-for-tat (NBTFT):
In some cases, the player may be less cooperative than suggested by her own Bayesian updated belief (p,q) about her opponent's strategy, due to her own self-interest.We call such a class of strategies non-reciprocal BTFT (NBTFT) strategies.If the parameter ν measures the extent to which the Bayesian player is less cooperative than a BTFT player, then her updated strategy in the subsequent round becomes where ν ∈ [0, 1].If ν = 0, then the player is reciprocal, and we get back the BTFT strategy.If ν = 1, the player is maximally non-reciprocal, which amounts to the Always-defect (ALLD) strategy.
Generous Bayesian Tit-for-tat (GBTFT): On the other hand, the player may want to decide to defect less than suggested by her belief about the opponent's strategy.We call such a class of strategies Generous BTFT (GBTFT).If the generosity parameter γ measures the extent to which the Bayesian player is less selfish than a BTFT player, then her updated probabilities of defection in the subsequent round are where γ ∈ [0, 1].When γ = 0, the BTFT strategy is recovered.On the other hand, if γ = 1, the player is maximally generous, which amounts to the Always-cooperate (ALLC) strategy.

Payoff
We consider a two-player-two-action game where the players have two choices: cooperate or defect.They receive a reward R for mutual cooperation, whereas mutual defection yields the payoff P. If a player cooperates and the other defects, then the cooperating player gets the payoff S, and the defecting player receives the payoff T. Hence, the resulting payoff matrix can be expressed as In this game, the degree of cooperation is influenced by the strength of dilemma, which can be characterized by two scaling parameters [83][84][85][86]: the strength of the gamble-intending dilemma, denoted as D ′ g ≡ T−R R−P , and the strength of the risk-averting dilemma, denoted as D ′ r ≡ P−S R−P .These two scaling parameters classify a two-player game into four categories: Prisoner's dilemma (PD), Snowdrift game, Stag Hunt, and Harmony game.The game is PD when both the parameters are positive.In contrast, the game is Harmony if both are negative.If D ′ g is positive but D ′ r is negative, the game is Hawk-Dove.On the other hand, if r is positive, then the game is Stag Hunt.Note that if both are equal and positive, the game is known as the donor and recipient game (DR game)-a special form of PD game.
We consider here underlying game to be a DR game, conveniently characterized by two parameters [20,87], b and c, as follows: If a player cooperates, she incurs a cost c and gets the benefit of either b (if the opponent cooperates) or 0 (if the opponent defects).Whereas, if a player defects, she gets a benefit b without incurring any cost when the opponent cooperates, but gets nothing when the opponent defects.For convenience, we re-parameterise the resulting payoff matrix by dividing all payoff elements by c.The effective payoff matrix, thus, is where r ≡ b/c is benefit-to-cost ratio for cooperating.Here, the scaling parameters As is customary, the repeated PD game needs to satisfy two conditions-T > R > P > S and 2R > T + S-which are automatically satisfied in the above form because r > 1.
The shadow of the future [88] is practically inevitable in repeated games and the standard way of addressing this is to consider the payoff in every subsequent interaction to be discounted by a multiplicative factor, δ ∈ [0, 1], called the discount factor.Thus, the accumulated payoff of the player at the end of game, i.e. when n = n f , is given by [16] Here u n ∈ {r, −1, r − 1, 0} is the payoff of the focal player in n th round.In all our simulations, n f is fixed at n f = 700 As players in a population interact at random, a focal player can meet with either a reactive player or a Bayesian player.Therefore, three distinct types of interaction are present in the population: reactive-reactive, reactive-Bayesian, and Bayesian-Bayesian.Hence one can write the payoff matrix, to model the strategic competition between reactive and Bayesian strategies.To simplify our notations, we suppress the arguments of function π (see equation ( 8)) wherever there is no ambiguity.

Numerical method
In order to determine how a Bayesian strategy fares against an arbitrary reactive strategy, we allow the Bayesian player to play a repeated PD game against a large set of reactive strategies uniformly spread across the entire reactive strategy (p-q) space.The p-q space is uniformly divided into 51 × 51 grid points.We numerically follow games played at every grid point.Three distinct kinds of interaction between the players at a grid point are possible, depending on the nature of the interacting partners: reactive-reactive, reactive-Bayesian, and Bayesian-Bayesian.
When the focal player's strategy is reactive, i.e. (p (r) , q (r) , c 1 ), we consider repeated games with reactive-reactive and reactive-Bayesian interactions to determine the payoff to the focal reactive player.When the focal player is Bayesian, games with both Bayesian-reactive and Bayesian-Bayesian interactions are considered to determine the payoff to the focal Bayesian player.We use an uniform prior distribution at n = 1 represented by an 11 × 11 matrix where each matrix element is assigned equal probability.In other words, in our simulations, p and q are respectively sampled in steps of 0.1 from the ranges 0 ⩽ p ⩽ 1 and 0 ⩽ q ⩽ 1.The posterior distribution is then estimated at the end of each round using equation (1).Therefore, the posterior distribution also becomes a distribution over 11 × 11 grid points.Figure 2 illustrates the convergence of Bayesian player's belief, indicated by the peak of posterior distribution, towards the true reactive strategy.As expected, the convergence improves with the number of rounds.Each of the four types of interactions is repeated over n f = 700 rounds to give the cumulative payoff for each pairwise interaction.The final payoffs π RR (δ, n f ), π RB (δ, n f ), π BR (δ, n f ), and π BB (δ, n f ); shown in figure 3 are calculated by averaging over 10 4 independent trials.Even though the time-evolution of the payoff is expected to be noisy, averaging over 10 4 trials ensures smooth curves in figure 3 that appear to be devoid of any fluctuations.
In passing, we remark that in line with the central limit theorem, the standard deviation about any of the average payoff elements is of the order σ/ √ 10 4 , where σ is the standard deviation of the independent identical random payoffs.Since we have found that σ ∼ 1 implying σ/ √ 10 4 ∼ 10 −2 .Hence, we round off the average payoffs corresponding to each strategic interaction to the second place of decimal.Furthermore, we have fixed c 1 = 0.5 along the line of principle of insufficient reason [9].We choose one value of δ very close to unity (specifically, δ = 0.99) to somewhat negate the shadow of the future since, as we will see later, it facilitates certain analytical estimations.To see the effect of the discount factor δ on our simulation results, we compared results obtained for δ = 0.75 with those for δ = 0.99 see figures 4-7 and 9).

Results
Evidently, the reason behind the entire exercise of the aforementioned numerics is to calculate П (see equation (9)) for play at various points of the reactive strategy space.The central idea of this paper is to use these payoff matrices to ascertain the comparative efficiency of Bayesian strategy.To this end, one can envisage an unstructured population of randomly matched players.The success of the Bayesian strategy against any reactive strategy in a population is determined by the ability of the former to avoid being invaded by mutant reactive strategy.Therefore, we determine the conditions under which either the Bayesian strategy or the reactive strategy is an ESS and how those conditions are affected by the nature of the reactive strategy, the nature of the update rule (BTFT, NBTFT, or GBTFT) adopted by the Bayesian strategy, the benefit-to-cost ratio of cooperation and the discount factor (δ).In such an investigation, however, we must distinguish between the cases of finite and infinite populations: While in the former case, a single mutant may invade the host monomorphic population; in the latter case, an infinitesimal fraction of mutants is required to invade a non-ESS host population strategy.Accordingly, the definition of ESS varies in finite population [89] and infinite population [17].
Therefore, in what follows, we succinctly present our results using, what we term as, ESS phase diagram.An ESS phase diagram is a pictorial representation showing which strategy is an ESS in which region of the reactive strategy space.The regions with Bayesian strategy as an exclusive ESS, reactive strategy as an exclusive ESS, and both the strategies as ESS are marked with different colours.Whereas, the region where mixed ESS exists i.e. certain non-zero fraction of the population plays Bayesian and the remaining fraction adopts the reactive strategy, is denoted using a colour gradient such that its intensity (redness) indicates the frequency of reactive strategy.Obviously, the mixed ESS should be absent in the finite population scenario.

Infinite population
Before we start discussing the ESS phase diagrams generated from our simulation, let us recall the condition of ESS for a given payoff matrix П [18]: The reactive strategy is ESS in an infinite population if (a) π RR > π BR or (b) π RR = π BR and π RB > π BB ; similarly, The Bayesian strategy is ESS if (a) π BB > π RB or (b) π BB = π RB and π BR > π RR .Finally, a mixed ESS is implied by the condition: π RR < π BR and π BB < π RB .

BTFT
In this subsection, we consider the competition between the reactive and BTFT strategy and analyse the ESS phase diagram for a given discount factor and benefit-to-cost ratio.It is clear that diagram depends on the nature of the reactive strategy competing with the Bayesian strategy.For low values of the benefit-to-cost ratio (r) and the discount factor (δ) see figure 4(a), a host population of reactive strategy players can prevent invasion by an infinitesimal fraction of a mutant Bayesian strategy as long as p + q < 1 and p ≲ 0.75.Similarly, the Bayesian strategy is an ESS, and can therefore prevent invasion by any reactive strategy, as long as p + q > 1 and q ≳ 0.25.In this region, a BTFT mutant can invade a reactive strategy.
Increasing either the benefit-to-cost ratio (r) or the discount factor (δ), or both, have a similar effects as can be seen by comparing figures 4(b) and (d) with figure 4(a).Moreover, for a certain range of (p, q) values lying in the region p + q < 1, both reactive and Bayesian strategies are an ESS but the size and location of this region varies as both r and δ changes.Another region shown in yellow corresponding to p + q > 1 is characterized by the stable coexistence between the reactive and BTFT strategy when neither of the strategies is an exclusive ESS; rather a mixed ESS exists.
During the process of inferring the reactive opponent's true strategy, the Bayesian player samples many different (p, q) values as it acquires evidence based on the opponent's actions.While the region of reactive space from where (p, q) values are sampled becomes eventually restricted as evidence accumulates over increasing number of rounds; initially, when evidence is sparse, the region can be large.The exact stochastic trajectories of the Bayesian player in the reactive space is analytically intractable.However, given the sharp phase boundaries in ESS phase diagrams, an explanation about why they appear as they do is worth uncovering.Moreover, there are a few additional intriguing features of the ESS phase diagram that are worth understanding, e.g.ALLD may not invade BTFT see figures 4(b) and (d) and ALLC is not completely eliminated by a mutant BTFT but can coexist with the latter.
To this end, we present a useful ansatz, in the limit δ → 1 and n f → ∞: It is helpful to think of the BTFT as an effective reactive strategy with p = q = 0.5.This ansatz is motivated by the fact that the BTFT strategy is likely to sample a large number of (p, q) values during the process of updating her belief.Initially, the Bayesian update rule may not be efficient in inferring the true (p, q) values of the opponent's reactive strategy since evidence is sparse.Hence, the sampled values of both p and q are likely to be equally distributed between 0 and 1 leading to an average of 0.5 for each.We will see that even though this ansatz cannot explain all aspects of the complex, evolutionary dynamics between the Bayesian and reactive strategies, it is successful in explaining the aforementioned specific features of the dynamics.It is worth pointing out that the effectiveness of ansatz owes to the early dynamics when the BTFT strategy is sampling the strategy space while updating her beliefs but has not yet converged on to the opponent's reactive strategy.During the latter phase of the strategy update dynamics (i.e. when the BTFT strategy is close to the reactive strategy), the contributions to the accumulated payoffs of both BTFT and reactive strategies are almost same, and hence, not significant in deciding which one is ESS.Now we note that any positive affine transformation of the payoff matrix, equation ( 9), keeps the ESS invariant.Consequently, the condition for ESS is not affected if we consider the payoff matrix, П ≡ (1 − δ)П, obtained from equation ( 9) by dividing each element of the matrix by the factor (1 − δ) −1 -the expected length of the repeated game when n f → ∞; each element of П denotes the average payoff of the corresponding interaction.When δ = 1, in the limit n f → ∞, the average payoff can be written as π ≡ (1/n f ) lim n f →∞ ∑ n f n=1 u n [71].It is convenient to calculate the average payoffs of the player with the reactive strategy and the Bayesian player modelled (using our ansatz) as a reactive player with the strategy (p = 0.5, q = 0.5) by merely recalling the standard result [16]: When two players, each with two reactive strategies-S ≡ (p (r) , q (r) ) and S ′ ≡ (p ′ (r), q ′ (r))-play against each other, the payoff on playing S against S ′ is given by Here, the superscript ∞ represents the limiting stationary cooperation probability given by c ∞ = q (r) /(1 − p (r) + q (r) ) and c ′ ∞ = q ′ (r)/(1 − p ′ (r) + q ′ (r)) for the strategies S and the S ′ , respectively.Now we are equipped to explain the features pointed out earlier.First, let us focus on the phase boundaries.We have at any grid point of the reactive space, two strategies: R = (p (r) = p, q (r) = q) and B = (p ′ (r) = 0.5, q ′ (r) = 0.5).The reactive strategy R is an ESS if πRR > πBR which, owing to equation (10), leads to inequality r(p − q)(p + q − 1) > (p + q − 1), i.e.
This estimation is very promising as evident from figures 4(b) and (d): The regions (red and blue) with reactive strategy as ESS satisfy inequalities (11).The lines p + q = 1 (dashed white) and q = p − 1/r (solid white) are almost precise estimates of the phase boundaries.
In order to show the coexistence of ALLC and BTFT, we must show that neither ALLC nor BTFT is an ESS i.e. πRR < πBR and πBB < πRB .While we just calculated πBR and πRB , πRR = (r − 1) but πBB remains to be estimated.When two BTFT players play against each other, their payoffs depend on whether the action profile in the very first round is (C, C), (D, D), or (C, D) (equivalently, (D, C)) which corresponds to the Bayesian players effectively playing ALLC, ALLD and reactive strategy with (p, q) = (0.5, 0.5) respectively.Since c 1 = 0.5, the three forms of the BTFT should be respectively associated with weight factors 0.25, 0.25 and 0.5 for the calculation of the payoff.The weighted payoff, thus, is given by πBB = 0.25(r − 1) + 0.25(0)+ 0.5(r − 1)/2 = (r − 1)/2.Evidently, πRR < πBR always holds good, whereas πBB < πRB implies r > 2. Of course, this is not a strict bound given the non-rigorous assumptions made about the BTFT's dynamics and the mean-field nature of the arguments.Nevertheless, it is remarkable, how we arrive at a condition on the benefit-to-cost ratio (keeping δ → 1) for which the coexistence of ALLC and BTFT is a distinct possibility.
Finally, the case of ALLD vs. BTFT may be treated in a similar manner to show that ALLD may not invade BTFT.While playing against ALLD, the BTFT acts like ALLD after the first round if ALLD defects in the first round (since the likelihood function P(D|p, q) = (2 − p − q)/2 is maximum at p = q = 0) and she plays like a reactive strategy with (p = 0.5, q = 0.5) if ALLD cooperates in the first round.The weight factors associated with these two roles of BTFT are each equal to 0.5.Hence the weighted payoffs are πBR = −1/4 and πRB = r/4, and πRR = 0. πBB = (r − 1)/2 was already estimated in the preceding paragraph.ALLD is an ESS if πRR > πBR which is trivially satisfied and the Bayesian strategy is an ESS if πBB > πRB which is satisfied for r > 2. This accounts for the observation see figures 4(b) and (d) that both ALLD and Bayesian strategies are ESS's.

NBTFT
When the Bayesian player chooses not to reciprocate fully, she modifies her strategy to an NBTFT strategy that results in her cooperating with probabilities (p ′ , q ′ ) that are lower than her belief (p max , q max ) about opponent's (p, q) value.This way she can potentially exploit the reactive opponent more frequently and thereby acquire a larger cumulative payoff.Hence such a strategy is able to resist invasion by a reactive counterpart over a larger region of reactive strategy space as can be seen by comparing the cases of r = 2 in figure 5 with that in figure 4.Here the region of dominance of NBTFT strategy increases with increasing ν which quantifies the extent of non-reciprocity of the NBTFT strategy compared to the BTFT strategy (figures 5(a) and (b) vs. Figures 5(e) and (f).This advantage is most pronounced for higher (p, q) values and less effective for reactive opponents with low (p, q) values.
However, the advantages of NBTFT decreases as the benefit-to-cost ratio of cooperation increases (compare upper panels with lower panels of figure 5).Specifically, we observe that for r = 10, NBTFT is ESS only in regions characterized by smaller values of both (p, q) and small p, large q.This is because the much larger benefit for mutual cooperation (compared to mutual defection) that accrues over time outweighs the occasional advantage of selfish behaviour exhibited by the NBTFT player.Reactive strategies with large q are more prone to exploitation by NBTFT since they have a higher likelihood of cooperating even when the NBTFT opponent defects.
With increasing δ, the contributions to payoff from later rounds carry almost as much weight as the contributions from earlier rounds.It is also important to note that the NBTFT player keeps updating her belief and as her belief converges towards the true belief about her reactive opponent with increasing number of rounds, her ability to exploit the cooperative nature of her reactive opponent is limited to those reactive strategies with higher q values compare figures 5(a)-(d).The dominance of NBTFT strategies is lost when both r and δ increase see figures 5(d) and (h) indicating that increased benefits from mutual cooperation over mutual defection as well as enhanced contribution to total payoff from later rounds increasingly favour reactive strategies with q < p, leading to their dominance for q < p see figures 5(d) and (h) and coexistence of NBTFT and reactive strategies is seen only for q > p see figures 5(d) and (h).NBTFT is found to dominate only in a small sliver of region around q = 0 and q = 1.

GBTFT
If the strategic response is more generous than BTFT (i.e.p ′ n+1 > p max n and q ′ n+1 > q max n ), the corresponding GBTFT player is easily exploited by the reactive opponent for most values of (p, q) when both r and δ are small (compare figures 4(a) and 6(a).GBTFT can resist invasion by a reactive strategy only if the reactive opponent's are highly cooperative, i.e. both p and q are sufficiently high (see figure 6(a).As expected, when r is low, for a given δ, the region of (p, q) space where GBTFT is an ESS, shrinks even further as γ increases (compare figures 6(a) and (e).Recall that γ-as defined in equations (6a) and (6b)-quantifies the extent of generosity shown by the GBTFT player.As the benefit-to-cost ratio r of cooperation increases, more cooperative strategies gain an advantage from the larger payoff received for mutual cooperation which offsets the cost of being exploited by the opponent's selfish behaviour.For this reason, reactive strategies with high (p, q) outperform their Bayesian counterpart and are therefore stable against invasion by GBTFT (see figure 6(c).On the other hand, GBTFT dominates over the reactive counterparts when the probability of reciprocal cooperation, p, for reactive strategy is above a certain threshold value of q: p = q + ϵ, ϵ ≈ 0.15 (r = 10 and δ = 0.75).For such types of reactive opponent's, GBTFT can reap the large benefit of mutual cooperation while occasionally exploiting the cooperative nature of the reactive opponent.A region of reactive strategy space where both strategies are ESS's also emerges (blue region in figure 6(c).
With an increase in the discount factor (δ), the region of (p, q) space where GBTFT dominates changes to one characterized by high p and low q.This region increases with increased benefit-to-cost ratio (compare figure 6(b) with figure 6(d) and the generosity parameter γ (compare with figures 6(d) and (h) since the enhanced advantage of mutual cooperation carry more weight over larger time scales (due to larger δ).

Finite population
In finite populations, evolutionary stability of a Bayesian strategy is dependent on the population size N. Consequently, the ESS condition needs to be appropriately modified [89] to ensure that a single reactive mutant has a lower fitness than the Bayesian strategy and selection opposes the fixation of reactive mutant, i.e. the fixation probability ρ R of reactive strategy is less than 1/N.The former condition leads to (N − 1)π RB < π BR + (N − 2)π BB while the latter leads to (N − 2)π RR + (2N − 1)π RB < (N + 1)π BR +(2N − 4)π BB under the assumption of weak selection.Similarly, the evolutionary stability of a reactive strategy implies that the conditions (i)(N − 1)π BR < π RB + (N − 2)π RR and (ii)(N − 2)π BB + (2N − 1)π BR < (N + 1)π RB + (2N − 4)π RR are simultaneously satisfied.In finite populations, there are only two absorbing states corresponding to a population consisting only of either the Bayesian strategy or the reactive strategy.Hence a mixed phase where both strategies coexist is not possible.
For N = 2, the game is played between a single Bayesian and a single reactive player, and the aforementioned conditions of ESS simply boils down to π RB > π BR for the reactive strategy to be an ESS and π BR > π RB for the Bayesian strategy to be an ESS.In other words, the condition for evolutionary stability depends on which of the two strategies has a larger average payoff when playing against the other.From figure 7, it is clear that reactive strategies dominate over their Bayesian (BTFT) counterparts as long as p + q < 1.
This can be rationalized through the ansatz that the Bayesian strategy may be thought as an effective reactive strategy with p = q = 0.5.Thus, as before, calculations yield πRB = (r − p − q)/2 and  and GBTFT (with γ = 0.5) with a reactive strategy defined by (0.5, 0.6, 0.5).The game is played over 700 rounds and the frequency of each interaction pair is obtained by averaging over 10 4 distinct trials.The first and the second actions correspond to those of the reactive and the Bayesian player respectively.πBR = [r(p + q) − 1]/2 where (p, q) denote the opponent's reactive strategy.Thus, the condition for the reactive strategy to beat the Bayesian one gets recast as (r − p − q)/2 > [r(p + q) − 1]/2 which implies the condition p + q < 1.Interestingly, the condition is independent of the benefit-to-cost ratio r as observed from figure 7.
When the Bayesian strategy chooses to be more selfish (NBTFT) than is dictated by her perceived belief about their reactive opponent's strategy, it dominates over the reactive strategy over a much larger region of (p, q) space (figure 7(b)).As δ increases, the size of this region increases with NBTFT dominating all but the most selfish strategies (see figure 7(b).The NBTFT player, by virtue of her ability to explore the strategy space in the process of inferring her opponent's strategy, is more effective in exploiting her reactive opponent (see figure 7(b), leading to a larger average payoff for herself.The situation is reversed when the Bayesian strategy is GBTFT (see figure 7(c), i.e, more generous.The reactive strategy dominates over a larger region of strategy space as δ increases, indicating that it is better able to exploit the generosity of the BTFT strategy (see figure 7(c) to increase her average payoff.
These observations can be explained by noting that in contrast to the BTFT case (comparing figures 8(a) and (b); the fewer average number of C-C interactions and larger average number of D-D interactions between the NBTFT and the reactive player ensures that higher benefits of mutual cooperation does not accrue as much.Similarly, comparing figures 8(a) and (c), we note that even though the number of C-C interactions is larger in the latter case, the significantly larger (on average) number of D-C interactions (indicative of the reactive player more frequently exploiting the GBTFT player) neutralizes the increased benefits of mutual cooperation.
Another interesting observation is that although both BTFT and GBTFT can be ESS-see figure 4 and figure 6-against the reactive strategies (including the ones close to ALLD in the reactive strategy space) for large r and δ, it is evident from figure 8 that the mutual cooperation rendered by a GBTFT strategy is more than what the BTFT strategy achieves.In other words, the GBTFT strategy is comparatively more capable of increasing the level of cooperation within the population.
As the population size increases, the ESS phase diagram in (p, q) space approaches the infinite population limit as can be seen by comparing the N = 100 case in figures 9(a)-(c) with the corresponding panels in figures 4-6.It should be noted the mixed ESS (coexisting reactive and Bayesian players) does not exist in the finite population scenario because there are two absorbing states which can be an ESS.Hence, wherever there is a mixed ESS in the infinite population case, a white region (No ESS) appears in the corresponding ESS phase diagram for finite populations.

Discussion
Can a strategy that attempts to learn the fixed reactive strategy of the opponent prevent being out-competed by extremely selfish strategies like (p ∼ 0, q ∼ 0) in a repeated PD game?The answer depends critically on the relative benefit of cooperation (r), the discount factor (δ) and on the nature of the strategy (BTFT, GBTFT, or NBTFT) employed by the Bayesian player to update her actions.For low r and δ, predominantly selfish strategies dominate over Bayesian learning strategies see figures 4(a), 5(a) and 6(a).But the situation changes with increase in the discount factor see figures 4(b), 5(b) and 6(b).As r increases, the Bayesian learning strategies are always effective at resisting invasion by selfish reactive strategies even for low discount factors see figures 4(c), 5(c) and 6(c).
Even though the Bayesian player may end up being more cooperative than extremely selfish strategies during the exploration phase of the game when she is trying to learn the strategy of her opponent, she avoids exploitation in the long run by gradually becoming more selfish through effective learning of her opponent's strategy.In general, the success of a Bayesian player depends on the extent to which she can leverage the higher benefits of mutual cooperation against a cooperative opponent while avoiding being exploited by a more selfish opponent.
Reactive strategies form just a subset of the larger class of Markovian memory-one strategies.Our results can be easily extended to see how Bayesian strategies fare against more cognitively sophisticated Markovian strategies [50].Bayesian inference in evolutionary games provides a powerful learning framework applicable to other social dilemmas that can be modeled through the public goods game.In such situations each individual would take into account the actions of other members of her community to update her belief about cooperation levels of the group and tune her actions accordingly.It would be interesting to see how Bayesian learning compares with other strategy update mechanisms like pairwise comparison and reinforcement learning in such scenarios.Moreover, we envisage a future study where one addresses what happens if a Bayesian learner adopts the best response strategy based on their belief.Another aspect that can be investigated in future involves relaxing the assumptions that the reactive player does not make any error while employing her strategy and that the payoff matrix remains unchanged over rounds.The payoff matrix itself can be subject to fluctuations as in the case of stochastic games.In such noisy situations, the Bayesian player's belief about the opponent may not converge to the true belief and may keep fluctuating even at long time scales.Furthermore, the Bayesian player herself can make errors while employing her strategy.The extent of impact of all such noisy effects is likely to depend on the noise strength and may modify our conclusions only for large noise strength.
In order to better understand the key underlying causes behind altruistic behaviour in the natural world, it is important to take into account realistic ways in which animals learn and take decisions.Decision making is often modulated by learning as well as cognitive constraints in factoring and processing a diverse range of stimuli from the environment.Accounting for those constraints will enable us to build more realistic models for understanding altruistic behaviour in social groups.The Bayesian framework developed in this work is the first step in incorporating sophisticated statistical learning mechanisms like Bayesian learning in altruistic decision-making.We hope to eventually address scenarios in which deviations from Bayesian inference, perhaps induced by cognitive constraints, can also affect patterns of altruistic behaviour in social groups.Such investigations will hopefully make it possible to design and implement protocols that encourage altruistic behaviour leading to greater benefits for society at large.

Figure 1 .
Figure 1.Schematic diagram showing how a Bayesian player updates her belief about an opponent's reactive strategy (p(r) , q (r) ) that is unknown to her.The Bayesian player's objective is to infer the strategy of the reactive player from the latter's actions over several rounds.The Bayesian player is depicted in green and the reactive opponent is denoted by emoticon.In the first round, the reactive player's action is C while the Bayesian player's action is D. In response, the reactive player defects (D) in round 2 (with probability (1 − q (r) )).The Bayesian player uses the reactive opponent's action at the end of round 1 as evidence to update her belief (p ′ 2 , q ′ 2 ) about the reactive opponent's strategy using Bayes rule.The Bayesian player then uses her updated belief (p ′ 2 , q ′ 2 ) to cooperate (C) with the reactive opponent (with probability p ′ 2 in round 2).After round 2, where the reactive player defects, the Bayesian player again uses Bayes rule to update her belief by determining the maximum (p ′ 3 , q ′ 3 ) of a new posterior distribution that uses the posterior distribution estimated at the end of round 1 as prior.She then uses this inferred belief to cooperate (C) with the reactive opponent with probability q ′ 3 in response to the reactive opponent's action (D) in round 2. This updating process continues over rounds (see section 2 for details) with the prior being updated to the posterior distribution at the beginning of each round.

Figure 2 .
Figure 2. Convergence of belief: Subplots (a) and (b) respectively exhibit the posterior distribution of the Bayesian player's belief about the reactive player's strategy at the end of 50 rounds and 500 rounds, starting from a uniform prior.The black dot at the intersection of the horizontal line and the vertical line, represents the true reactive strategy, viz., (0.8, 0.3, 0.5).The color bar indicates the probability, P(p, q).The p − q space has been divided into 51 × 51 grid points.

Figure 4 .
Figure 4. ESS phase diagram for BTFT vs. reactive strategies in infinite population: Four subplots (a), (b), (c) and (d) respectively correspond to r = 2 and δ = 0.75, r = 2 and δ = 0.99, r = 10 and δ = 0.75, and r = 10 and δ = 0.99.The color gradient, ranging from zero to unity, marks the frequency of reactive strategy in the mixed ESS.Blue, green, and red colors, respectively, represent three cases: BTFT and reactive strategy are both ESS's, BTFT is the only ESS, and reactive is the only ESS.Few scattered white dots correspond to absence of any ESS.The dashed white line and the solid white line-the analytically estimated boundaries-respectively satisfy equations: q = 1 − p and q = p − 1/r.

Figure 5 .
Figure 5. ESS phase diagram for NBTFT vs. reactive strategies in infinite population: Two sets of subplots, viz., {a, b, c, d} and {e, f, g, h} respectively correspond to two non-reciprocity parameter values ν = 0.3 and ν = 0.5.For each set, four subplots are arranged in a 2 × 2 grid corresponding to two different discount factors δ ∈ {0.75, 0.99} and two benefit-to-cost ratio parameter values r ∈ {2, 10}.The color gradient, ranging from zero to unity, marks the frequency of reactive strategy in the mixed ESS.Blue, green, and red colors, respectively, represent three cases: BTFT and reactive strategy are both ESS's, BTFT is the only ESS, and reactive is the only ESS.Few scattered white dots correspond to absence of any ESS.

Figure 6 .
Figure 6.ESS phase diagram for GBTFT vs. reactive strategies in infinite population: Two sets of subplots, viz., {a,b,c,d} and {e,f,g,h} respectively correspond to two non-reciprocity parameter values γ = 0.3 and γ = 0.5.For each set, four subplots are arranged in a 2 × 2 grid corresponding to two different discount factors δ ∈ {0.75, 0.99} and two benefit-to-cost ratio parameter values r ∈ {2, 10}.The color gradient, ranging from zero to unity, marks the frequency of reactive strategy in the mixed ESS.Blue, green, and red colors, respectively, represent three cases: BTFT and reactive strategy are both ESS's, BTFT is the only ESS, and reactive is the only ESS.Few scattered white dots correspond to absence of any ESS.

Figure 7 .
Figure 7. ESS phase diagram for a repeated 2-player game between the Bayesian strategies and the reactive strategies for a finite population of size N = 2: Subplots (a), (b), and (c) represent three Bayesian strategies; BTFT, NBTFT with non-reciprocity parameter ν = 0.3, and GBTFT with generosity parameter γ = 0.3, respectively.In each subplot, the red color denotes that the reactive strategy is an ESS, green color denotes that the Bayesian strategy is ESS.Few scattered white dots correspond to absence of any ESS for those (p, q) values.

Figure
FigureHistogram of different interactions generated in a repeated two-player game with reactive and Bayesian strategies: Subplots (a), (b), and (c), respectively, correspond to the interaction of the three Bayesian strategies; BTFT, NBTFT (with ν = 0.5) and GBTFT (with γ = 0.5) with a reactive strategy defined by (0.5, 0.6, 0.5).The game is played over 700 rounds and the frequency of each interaction pair is obtained by averaging over 10 4 distinct trials.The first and the second actions correspond to those of the reactive and the Bayesian player respectively.

Figure 9 .
Figure 9. ESS phase diagram for Bayesian strategies vs. reactive strategies in finite populations of sizes N = 6 and N = 100: Subplots (a), (b), and (c) represent three Bayesian strategies; BTFT, NBTFT with non-reciprocity parameter ν = 0.3, and GBTFT with generosity parameter γ = 0.3, respectively.In each subplot, while red color denotes that reactive strategy is ESS, green color denotes that Bayesian strategy is ESS.The white color corresponds to absence of any ESS.