Exploitation by asymmetry of information reference in coevolutionary learning in prisoner's dilemma game

Mutual relationships, such as cooperation and exploitation, are the basis of human and other biological societies. The foundations of these relationships are rooted in the decision making of individuals, and whether they choose to be selfish or altruistic. How individuals choose their behaviors can be analyzed using a strategy optimization process in the framework of game theory. Previous studies have shown that reference to individuals' previous actions plays an important role in their choice of strategies and establishment of social relationships. A fundamental question remains as to whether an individual with more information can exploit another who has less information when learning the choice of strategies. Here we demonstrate that a player using a memory-one strategy, who can refer to their own previous action and that of their opponent, can be exploited by a reactive player, who only has the information of the other player, based on mutual adaptive learning. This is counterintuitive because the former has more choice in strategies and can potentially obtain a higher payoff. We demonstrated this by formulating the learning process of strategy choices to optimize the payoffs in terms of coupled replicator dynamics and applying it to the prisoner's dilemma game. Further, we show that the player using a memory-one strategy, by referring to their previous experience, can sometimes act more generous toward the opponent's defection, thereby accepting the opponent's exploitation. Mainly, we found that through adaptive learning, a player with limited information usually exploits the player with more information, leading to asymmetric exploitation.


Introduction
Cooperation, defection, and exploitation are important relationships that universally appear in biological and social systems. While cooperating, individuals are altruistic and achieve benefits for the entire group. In defection, they behave selfishly for their own benefit, which results in demerits for all. In exploitation, selfish individuals receive benefit at the expense of altruistic others. The choice of strategy, i.e., selfish or altruistic behavior, is important in establishing social relationships. Individuals, based on their abilities, sophisticate their strategies through their experiences. Generally, people's ability to choose the best strategies differ. These differences in ability can affect how cooperation is established between people. Now the following question arises: Do individuals with higher abilities exploit those with lower abilities or vice versa?
Game theory is a mathematical framework for analyzing such individual decision-making of strategies [1]. Everyone has a strategy for choosing given actions and receives a reward based on their chosen actions. In particular, the prisoner's dilemma (PD) game (see Fig. 1-(A)) has been used extensively to investigate how people act when competing for benefits. Each of two players chooses either cooperation (C) or defection (D), depending on their own strategies. Accordingly, the single game has four results, namely CC, CD, DC, and DD, where the left and right symbols (C or D) indicate the action taken by oneself and that of the opponent, respectively. The benefit of them is given by R, S, T , and P . The property of PD demands T > R > P > S: Each player receives a larger benefit for choosing D, which may lead to DD, but CC is more beneficial than DD. How to avoid falling into mutual defection, i.e., DD, has been a significant issue.
In this study, we assume the iteration of games, and then each player can refer to the actions of the previous game and change the own choice depending on the observed actions. We consider that a player can change their next action based on the previous actions of the two players, i.e., CC, CD, DC, and DD at the maximum. Thus, a player with more detailed information about a previous round has a higher ability to choose their own optimal action. This ability to observe actions of their own and their opponents is seen in reality, such as intention recognition [2][3][4][5][6][7][8]. Furthermore, in social systems, from the bacterial community to international war, the reference to past actions might play an important role in establishing interpersonal relationships [9]. A representative example is tit-for-tat (TFT) strategy [10,11], in which the player observes and mimics the other's previous action. The monumental and the numerous following studies showed that the players with TFT strategies are selected in the optimization process and establish cooperation. This strategy is classified as a reactive strategy [12][13][14][15][16][17] as the player chooses their actions by referring only to their opponent's previous action. Another type of strategy, the memory-one strategy [17][18][19][20][21][22][23], is introduced when a player refers to both their own previous action as well as their opponent's. The memory-one strategy includes not only the TFT but the Win-Stay-Lose-Shift (WSLS) strategy [18], which generates cooperation even under the error in choice of action. Consequently, it is expected that a player who refers to more information will succeed in receiving a larger benefit.
Although such reference of past actions plays a crucial role to establish interpersonal relationships, how the difference in information for it between the players affects in the player's benefit is unclear yet. Evolution of strategy in a single population [12][13][14][15][16][17][18][19][20][21][22][23] does not generate the difference in payoffs among the players in principle, where the strategy is selected within the same group. Evolution in multi-population [24][25][26][27][28][29] (i.e., intra-group selection of strategies by the inter-group game) and learning among multi-agent [30][31][32][33][34][35][36][37] can generate the exploitation, but so far most studies have focused only on the emergence of cooperation. Recently, the exploitation has been studied as to "symmetry" breaking of the payoffs of players [26,34], where only the game between the same class of strategies is assumed. Here, we revised the coupled replicator model in the previous studies of multi-population evolution [39] and multi-agent learning [40][41][42] so that the player can refer to their own previous actions and those of their opponent and update their strategy accordingly within a class of strategies. In particular, we focused on the reactive and memory-one classes of strategies, as they are basic and have been studied extensively. We then investigated whether players using the memory-one strategy would win the game against opponents using the reactive strategy, by utilizing the extra information provided through the observation of their own previous actions.
The remainder of this paper is organized as follows. In § 2, we formulate the learning dynamics for various strategy classes. In addition, we confirm that in iterated games against an opponent's fixed strategy, the strategy with the higher ability obtains a larger payoff in equilibrium. In § 3, we introduce an example of mutual learning between memory-one and reactive strategies. Then, we demonstrate that the memory-one class, i.e., the player with the higher ability, is, counterintuitively, one-sidedly exploited by the reactive one. In § 4, we analyze how this exploitation is achieved and elucidate that the ability to reference one's own actions leads to generosity and leaves room for exploitation. Finally, in § 5, we show that high-ability players are generally exploited because of their generosity, independent of the strategy class and payoff matrix. A single game has four results: CC (red), CD (yellow), DC (green), and DD (purple). (B) memory-one, reactive, and mixed strategy classes. The memory-one class can refer to all four previous results. The reactive class can only refer to an opponent's result, compressing CC and DC (colored in the same red), and CD and DD (yellow). Furthermore, the mixed class compresses all the previous results into one (all colored in red).
2 Formulation of learning dynamics of strategies

Formulation of class and strategy
Before formulating the learning process, we mathematically define the strategy and class. Recall that a single game can have one of four results: CC, CD, DC, and DD. First, a player using a memory-one strategy can refer to their own and the opponent's previous actions and respond with a different action to each result of the previous game. Thus, the player has four independent stochastic variables x 1 , x 2 , x 3 , and x 4 as the probabilities of choosing C regardless of the outcome of the previous game. Thus, the memory-one class is defined as the possible set of such memory-one strategies, which is denoted as {x 1 , x 2 , x 3 , x 4 } ∈ [0, 1] 4 . The TFT and Win-Stay-Lose-Shift [18,[43][44][45] strategies are examples of the emergence of cooperation as x 1 = x 3 = 1, x 2 = x 4 = 0, and x 1 = x 4 = 1 and x 2 = x 3 = 0.
Second, a player using the reactive strategy can only refer to the opponent's action and therefore cannot distinguish between CC and DC (CD and DD). Thus, the strategy is given by two independent variables x 13 and x 24 , where the former (latter) is the probability of choosing C where the previous result was either CC or DC (CD or DD). Therefore, the reactive class is defined as {x 13 , x 24 } ∈ [0, 1] 2 . Here, the notation x 13 (variable for CC and DC) clearly indicates the integration of x 1 (CC) and x 3 (DC) from the memory-one class. Indeed, all the strategies in the reactive class are included in the memory-one class, as one can set x 1 = x 3 = x 13 and x 2 = x 4 = x 24 for all x 13 and x 24 . Thus, the above TFT strategy can be represented as x 13 = 1 and x 24 = 0, whereas the Win-Stay-Lose-Shift strategy cannot be represented. This makes it clear that the memory-one class is more complex than the reactive one, because the former can include all the strategies of the latter. The ordering of complexity can be defined as a class that includes all the strategies of the other is more complex.
Third, in the classical mixed strategy [46], a player stochastically chooses their action without referencing any actions from the previous game. Thus, the strategy controls only one variable x 1234 ∈ [0, 1], which is the probability of choosing C. This class is the least complex of the three classes.

Analysis of repeated game
In this section, we analyze a repeated game under the condition that the strategy of both players are fixed. We define p := (p CC , p CD , p DC , p DD ) T as the probabilities that (CC, CD, DC, DD) are played in the present round, and the result of the next round is calculated by p ′ = Mp with (1) When none of the strategy variables, x n and y n for n ∈ {1, · · · , 4} are 0 or 1, the repeated game has only one equilibrium state p e := (p CCe , p CDe , p DCe , p DDe ) T . Here, we can directly compute p e as Here the coefficient k is determined by the normalization of the probabilities p CCe + p CDe + p DCe + p DDe = 1.

Learning dynamics of memory-one class
We next consider adaptive learning from past experiences of repeated games. For instance, we assume that the probability of CC being the result of the previous round is p CCe . Then, the next action being C (D) would have the probability of x 1 (x 1 ). Here, we define u CC(C) (u CC(D) ) as the benefit that the player gains by performing action C (D). First, the time evolution of x 1 is assumed to depend on the amount of experience: the previous game's result and the action in the present one must be CC and C, respectively. Thus,ẋ 1 is proportional to p CCe x 1 . Second,ẋ 1 also depends on the benefit of action C and thus is proportional to u CC(C) − (x 1 u CC(C) +x 1 u CC(D) ). To summarize, we getẋ Next, we compute u CC(C) and u CC(D) . When the previous game's result and the present self-action are CC and C, respectively, the present state is given by p = p CC(C) := (y 1 ,ȳ 1 , 0, 0) T . If p CC(C) = p e holds, then the state gradually relaxes to equilibrium with the repetition of the game. Thus, u CC(C) is the total payoff generated by p CC(C) until equilibrium is reached, which is given by Here, we define u := (R, S, T, P ) T as the vector for the payoff matrix. By contrast, when the previous game's result and the present self-action are CC and D, respectively, the present state is given by is computed in the same way using By substituting Eqs. 4 and 5 into Eq. (3), we can write the learning dynamics of x 1 aṡ using only strategy variables, x n and y n for n ∈ {1, · · · , 4}, and the payoff variables (T, R, P, S). Similarly, we can derive the time evolution of the other strategy variables x 2 , x 3 , and x 4 .
Eq. (6) appears complicated at first glance, but it can be simplified aṡ as will be shown in subsection § 2.4. The same equations hold for the learning of x 2 , x 3 , and x 4 , and for opponent player. Notably, this equation reproduces the original coupled replicator model [39,40].

Detailed calculation of learning dynamics
In this section, we prove that Eqs. 6 and 7 are equivalent. First, because p e is the equilibrium state for the Markov transition matrix M, we obtain Here, by a perturbation of strategy δx 1 , the equilibrium state δp e changes accordingly. By substituting x 1 → x 1 + δx 1 and p e → p e + δp e into Eq. (1), we obtain: Here, we use M ∞ ∂p e /∂x 1 = 0 because M has only one eigenvector for which the eigenvalue is 1 as the preservation of probability (i.e., p CCe + p CDe + p DCe + p DDe = 1). Eq. (9) not only provides a simple representation of the time evolution but is also useful for the numerical simulation of Eq. (6). The right-hand side of Eq. (9) requires an approximate numerical calculation for all eigenvectors of M. By contrast, the left-hand side demands only the information on the equilibrium state p e , which is analytically given by Eq. (2).

Learning dynamics of other strategies
In the previous sections, we formulated the learning dynamics of memory-one class strategies against another within the same class. In this section, we consider other cases in which both learned and learning players adopt the reactive class.
First, we consider a case in which a learned player uses the reactive class, and the learning player uses the memory-one class. In this case, the learning is easily given bẏ for n ∈ {1, · · · , 4} because the learned reactive player's strategy is constrained by y 1 = y 3 = y 13 and y 2 = y 4 = y 24 . Second, we consider a case in which the learning player uses the reactive class, and the learned player uses the memory-one class. In this case, the learning player's strategy is given by (x 13 , x 24 ). Recall that the learning speed of our model depends on the amount of experience. Because the frequency of observing the opponent's previous action, C, is the total of both CC and DC, the time evolution of x 13 is the sum of x 1 and x 3 . Thus, we geṫ 3 Numerical result for learning 3

.1 One-sided learning against a fixed strategy
Before investigating the game between the memory-one and reactive classes, we first study the learning of the memory-one and reactive classes against the other fixed strategy. Fig. 2 shows the time series of each strategy's payoff based on the learning dynamics. The payoffs of both classes monotonically increase their payoffs over time because the opponent's strategy is fixed. However, there are two major differences between the two classes in the way the payoff increases. Figure 2: Payoff of the memory-one (blue) and reactive (orange) classes over time when learning with an opponent with a fixed strategy, whose strategy is a reactive one; y 13 = 0.9 and y 24 = 0.1. The payoff is an average of a large number (10000) of initial conditions for x i , which are chosen randomly. The horizontal axis denotes time on a scale of log(t + 1). The rise of the payoff of the reactive class is larger than that of the memory-one class, but finally, the memory-one class has a larger payoff.
First, the reactive class learns faster than the memory-one class. This is because the reactive class is a compressed version of the memory-one model, as the constraints x 1 = x 3 and x 2 = x 4 are postulated, that is, the learning in the cases of CC and DC (CD and DD) are integrated. Recall that the change in strategy is optimized based on the empirical data sampled through the played games. In the reactive class, the number of strategy variables is fewer; therefore, quick optimization can be achieved, as shown in Eq. (11).
Second, the memory-one strategy gains a larger payoff in equilibrium than the reactive one. This is simply because the memory-one strategy contains a reactive strategy. Accordingly, max is derived for all the opponent's fixed strategies y.

Mutual learning between memory-one and reactive classes
In § 3.1, we considered one-sided learning, where a player dynamically optimizes the strategy against their opponent's fixed strategy. In this section, we consider mutual learning, where both players optimize their strategies as the opponent's strategy continues to change.
We excluded the mixed class in this study because the results of matches with the mixed class are trivial. A player with a mixed class must use the same action independently of the opponent's previous actions. Thus, the opponent always receives a higher pay off by choosing D according to the payoff matrix of the PD. When the opponent's choice is always D, the best choice of the mixed class is also D. Thus, only the pure DD will result in equilibrium. Therefore, the mixed class can establish neither cooperative nor exploitative relationships. Therefore, we consider only the game between players using the memory-one and reactive classes. In our model, the dynamics of players' strategies are deterministic. Thus, the equilibrium state is uniquely determined by the initial values of strategies x and y. Here, we take sampling over the initial conditions. A match between each pair of classes was evaluated using a sufficiently large number of initial conditions of x and y. In each sample, the initial values of the strategy were assumed to be given randomly. In other words, when the former player takes a memoryone (reactive) class, the strategy is randomly chosen from Fig. 3 shows the final state of mutual learning for three matches: (A) between two memoryone classes, (B) between the memory-one and reactive classes, and (C) between two reactive classes. (Note that the last case represented in (C) was already studied in [Fujimoto2019].) Here, recall that mutual cooperation satisfies p CCe = 1, mutual defection satisfies p DDe = 1, and exploitation satisfies p CDe = p DCe . First, we studied the matches between the same classes, represented in (A) and (C). In these matches, exploitation with p CDe = p DCe can be in equilibrium. In other words, asymmetry is permanently established between the players depending on their initial strategies, even though both deterministically improve their own strategies to receive a larger payoff. Notably, this asymmetry emerges symmetrically between the players when using the same class. In this case, the number of samples that satisfied p CDe > p DCe was equal to those that satisfied p CDe < p DCe . In (C), i.e., the match between the reactive classes, the equilibrium exists as multiple fixed points with p CDe = p DCe (see [34] for the detailed analysis). By contrast, each exploitative state in (A) permanently oscillated to form a limit cycle, whereas the temporal averages of p CDe and p DCe are not equal. There is an infinite number of limit cycles, one of which is achieved depending on the initial conditions. A detailed analysis of these limit cycles will be explained in the next section § 4.
The heterogeneous match (B) between the memory-one and reactive classes has the same exploitative states as match (A). However, the most remarkable difference here is that in this exploitation, the reactive class can receive a larger payoff in the match with the memory-one class, and the reverse never occurs. In other words, only the one-sided exploitation from the reactive class to the memory-one class emerges, regardless of the initial conditions. This result appears paradoxical, when one notes that the memory-one class has more information for the strategy choices and is indeed in a more advantageous position than the reactive one in equilibrium when the other player's strategy is fixed, as already confirmed in § 3.1. We will discuss the origin of this unintuitive result in Section § 4.

Emergence of oscillatory exploitation 4.1 Analysis of exploitation
We first analyzed the exploitation between the memory-one classes, but the analysis is also applicable to the case between memory-one and reactive classes. An example of the trajectory of strategies x i and y i during exploitation is shown in Fig. 4. For all cases, the exploiting player's strategy satisfies On the other hand, the exploited opponent's strategy satisfies Here, note that x 1 and y 1 are neutrally stable, and the asymptotic value continuously varies with the initial condition. Assuming that x 2 = x 4 = y 2 = 0 and y 3 = 1 and inserting them into Eq. (2), the possibility vector p satisfies This equation leads to p DCe > p CDe , which proves that player 1 always receives a larger payoff than player 2. By inserting this into the learning dynamics in Eq. (7), we obtaiṅ These equations indicate that x 1 and y 1 are neutral, as expected. This is because the players do not experience CC in this oscillatory equilibrium of exploitation and do not have the chance to change x 1 or y 1 by learning. Now, the two-variable dynamics x 3 and y 4 are obtained, which leads to oscillation.
The oscillatory dynamics for (x 3 , y 4 ) follow the Lotka-Volterra type equation. Eq. (16) has an infinite number of periodic solutions (cycles), and the cycle that is reached depends on the initial strategy. An example of a trajectory is presented in Fig. 4-(B). In Eq. (16), there is only one fixed point, x * 3 and y * 4 , which is given by Then, the players' expected payoffs, u * e and v * e , are given by This linear stability analysis shows that this fixed point is neutrally stable, that is, the fixed point is not a focus but a center, as in the original Lotka-Volterra equation. Indeed, the time evolution of (x 3 (t), y 4 (t)) has a conserved quantity, given by which is determined by the initial condition and preserved. Furthermore, this Lotka-Volterra type oscillation provides an explanation of the exploitation we observed. The original Lotka-Volterra equation shows the prey-predator relationship, where the predator increases its own population by sacrificing the prey population. Herex 3 (y 4 ) represents the exploiter's defection (cooperation on the exploited side), which is a selfish (altruistic) action in the PD. Fig. 4-(B) shows thatx 3 is larger when x 4 is larger. In other words, the exploiting side learns to use the selfish action with the altruistic action of the exploited. This result means that the exploiting one increases its own payoff at the expense of the exploited side. Thus, the oscillation of the exploitative relationship is interpreted as a prey-predator relationship.

Mechanism of one-sided exploitation: self-reference leads to generosity
In the previous sections, we mathematically showed how the exploitative relationship is maintained. Next, we intuitively interpret the strategies in Eqs. 13 and 14, which implies that exploitation emerges between the narrow-minded and generous players. Here, we also focus on why the memory-one class is exploited one-sidedly by the reactive one.
Before analyzing Eqs. 13 and 14, we present the well-known tit-for-tat (TFT) and two related strategies in Table 1. In the TFT strategy, the player deterministically responds with C to the opponent's previous C, and with D to the opponent's D. A more generous strategy [14] is that the player accepts the opponent's D and probabilistically responds with C. In contrast, in a more narrow-minded strategy [34], the player betrays the opponent's C probabilistically and responds with D. In contrast to the TFT strategy that was adopted to represent the emergence of symmetric cooperation, the generous and narrow-minded TFT strategies represent asymmetric exploitation. However, these strategies do not refer to previous self-actions. Here, we analyze the exploiting and exploited strategies with the generous and narrow-minded TFT, under the constraint that the previous actions of the self are C and D.
As seen in Eq. (13), the exploiting player uses the narrow-minded TFT strategy in both cases where their previous action was C and D. In other words, the strategy is characterized Strategy Action to opponent's C Action to opponent's D TFT Deterministic C Deterministic D Generous TFT Deterministic C Probabilistic C Narrow-minded TFT Probabilistic D Deterministic D Table 1: Summary of TFT and related strategies. The TFT strategy deterministically responds with C (D) to the opponent's C (D). Thus, if the player does not refer to their own action , the strategy is represented by a reactive class with x 13 = 1 and x 24 = 0. The generous TFT's response to the opponent's C is the same as the original TFT but probabilistically cooperates with the opponent's D. This strategy is represented by x 13 = 1 and x 24 > 0. Finally, a narrowminded TFT probabilistically defects the opponent's C, which is different from the original TFT. This strategy is represented by x 13 < 1 and x 24 = 0. by x 1 > 0 and x 2 = 0 for a self C, and x 3 > 0 and x 4 = 0. Therefore, CC never occurs in exploitative equilibrium, where x 1 > 0 can be arbitrarily chosen. Thus, a player has the potential to use the exploiting strategy without referring to their own action, as x 13 = x 1 = x 3 > 0 and x 24 = x 2 = x 4 = 0. Similarly, the exploited player also uses the narrow-minded TFT for a previous self C, that is, y 1 > 0 and y 2 = 0. However, for a self D, the player uses the generous TFT, that is, y 3 = 1 and y 4 > 0. Thus, the exploited player refers to their own action: the player cannot take x 13 = x 1 = x 3 and x 24 = x 2 = x 4 . Although this additional reference to self-action enriches the player's choice of strategy, the player instead tends to be more generous to the opponent's defection and accepts the opponent's exploitation.
In the above, we consider only the equilibrium states given by Eqs. 13 and 14. However, the question remains whether a memory-one class acquires generosity and accepts exploitation during the transient learning process. We attempt to answer this question in the following three steps: First, we classify three equilibrium states in the game between reactive classes: mutual defection, cooperation, and exploitation. Second, by assuming that one of the players adopt a memory-one class instead of the reactive one under these three equilibrium states, we discuss whether the player changes the strategy for each of the three equilibrium states. Third, we consider the one-sided learning process by the above memory-one class under the equilibrium states and examine if the memory-one side's learning increases the opponent's payoff by acquiring generosity.
Step 1. As shown in Fig. 3-(C), there are three cases of equilibria in a match between reactive classes; mutual defection (yellow dots), mutual cooperation (blue dots), exploitation (purple and orange dots). Here, all possible equilibria for exploitation are shown in Fig. 5.
Step 2. Next, we assume that in each of the above states, one player adopts a memory-one class instead of a reactive one. Here, the player can refer to their own actions, after which the state can be unstable. First, for cases of mutual defection and cooperation, the equilibrium states are stable even if one player alternatively adopts a memory-one class. This is easily explained by the analytical result that all the equilibrium states between reactive classes are completely included in those among the memory-one and reactive classes. By contrast, in case of exploitation, the memory-one class receives a larger payoff by releasing constraints x 1 = x 3 = x 13 and x 2 = x 4 = x 24 .
Step 3. Finally, we consider one-sided learning by a memory-one class under the exploitation case. Since we assume that the strategy of the reactive opponent is fixed, the memory-one side receives a larger payoff. For all states, the memory-one class releases the constraints x 1 = x 3 and x 2 = x 4 and learns to be x 1 = x 2 = 0 and x 3 = x 4 = 1, which represents an asymptotic relationship to the exploited strategy of Eq. (14). Then, the learning of the memory-one side increases the opponent's benefit much more than the increase in one's own benefit (see Fig. 5). This result shows that the memory-one class becomes generous toward the opponent's defection by also referring to their own actions.
Thus, we have shown that memory-one class becomes generous for the opponent's defection in equilibrium. If the learning feedback from the opponent reactive class is considered, the one-sided exploitation is generated as seen in Fig. 3-(B).

Generality over different classes
We have demonstrated that the reference to own actions lead to generosity toward the opponent's defection when comparing memory-one and reactive classes. Recall that the reactive class identifies both CC and DC, and CD and DD as the same. Therefore, we can consider two intermediate classes between the memory-one and reactive classes: the player only compresses either CC with DC or CD with DD. These strategies refer to the opponent's action completely but refer to the self-action only when the opponent's action is C or D. In this section, we study the learning dynamics of such extended classes.
Before analyzing the matches, we labeled these new classes according to Fig. 6. Here, we renamed the memory-one class as the "1234" class because the class distinguishes all of CC (1), CD (2), DC (3), and DD (4), and has four strategy variables x 1 , x 2 , x 3 , and x 4 . Further, the reactive class uses two variables x 13 and x 24 , as DC (3) and CC (1) represent one class, and DD (4) and CD (2) represent another. Thus, we renamed it as the "1212" class. In the same way, the newly defined classes are renamed as "1214" and "1232"; the former (latter) combines CC and DC (CD and DD). Among these four classes, complexity can be introduced as the degree to which the self-action is referred to. Fig. 6-(A) shows the ordering of this complexity. 1234 and 1212 are the most complex and simple of the four classes, respectively, whereas 1214 and 1232 lie between the two.
The outcomes of all possible 10 matches between any pairs of the 1234, 1232, 1214, and 1212 classes are shown in Fig. 6-(B). As shown in these figures, 1212, which is the simplest class, can exploit 1212, 1232, and 1234. Class 1214 exploits 1232 and 1234. Class 1232 can exploit 1234. These exploitative relationships are summarized in the bottom right panel. Interestingly, the results show that the simpler classes generally exploit the more complex classes, but the reverse never occurs. Thus, the reference to self-action generally leads to generosity toward the opponent's defection and accepts exploitation. Although, the complexity among these 15 classes does not always indicate the degree of reference to self-action. Interestingly, the one-sided exploitation by a complex class of a simple one is not observed, except for a single case, i.e., the match between the 1232 and 1131 classes.

Generality over different payoff matrices
So far, we have studied the PD game with the standard score (T, R, P, S) = (5, 3, 1, 0). The above results shows that one-sided exploitation by simple classes of complex classes is valid if the payoff satisfies T − R − P + S > 0. This condition is called the "submodular" PD [47,48]; the summation of asymmetric payoffs, T + S, is larger than the summation of symmetric payoffs, R + P .
When the payoff matrix does not satisfy the submodularity, mutual learning does not generate an exploitative relationship. In all the matches among 1234, 1214, 1232, and 1212 classes, the players achieve either mutual cooperation or mutual defection, depending on the initial condition.

Summary and discussion
In this study, we investigated how players' payoffs after learning depend on the complexity of their strategies, that is, the degree of reference to previous actions. By extending the coupled replicator model for learning, we formulated an adaptive optimization strategy by learning pre-vious actions. By focusing on a reactive strategy in which the player refers only to the last action of the opponent and a memory-one class in which the player refers to both their own last action and that of the opponent, we uncovered that the latter, which has more information and includes the former, will be exploited by the former, independent of the initial state. Here, both the strategies of the latter (exploited) and former (exploiting) players permanently oscillate as in the prey and predator dynamics, whereas the exploit relationship is maintained. The exploiting (exploited) side uses the narrow-minded (generous) TFT when the previous self-action was defection.
Using this definition, a player using a memory-one class has a larger number of choices of strategy than that of the reactive class. The memory-one class includes extortion strategies [49][50][51][52][53], where the player has an advantage over their opponent using a fixed strategy, that is, the player receives a larger payoff than the opponent, independent of the opponent's strategy. Given this, it is quite surprising that the reactive class unilaterally exploits the memory-one class after mutual learning, regardless of the initial strategy. The results show that even if there are possible advantages to choices of strategy, the player may not realize them through learning if the opponent's strategy continues to change. Here, we demonstrated that learning with reference to self-action makes the player generous toward an opponent's defection, as the unknown way to acquire the generosity [51,[54][55][56]. In this way, learning to obtain a higher payoff with more information counterintuitively results in a poorer payoff than the opponent, who learns with limited information.
It is common for a player to change their next choice depending on past choices. As already seen in reactive and memory-one classes, it is common for a player to change the next choice of behavior depending on their own or opponent's choice. As briefly mentioned in § 5.1, our formulation can be extended to reference arbitrary information. For instance, we can assume a memory-n strategy, which refers to actions in more than one previous round. Even the twomemory strategy is quite different from the memory-one strategy: The player can use a greater variety of strategies, such as tit-for-2-tat [57]. It has been discussed that a reference to multimemory generates cooperation more efficiently [23]. Our model could be extended to study whether this is true under mutual learning, or whether the player with more information would exploit their opponent or be exploited.
Game theory is often relevant in explaining characteristic human behaviors. The advantage of the TFT strategy indicates poetic justice in human nature. However, humans also reflect on their past behavior. For instance, they could be motivated to perform beneficial actions toward others after betraying others. This study supports how such behavior emerges or is preserved through learning. This, however, provides room for exploitation. Indeed, Eq. (14) shows that the player with reference to their own previous actions becomes generous toward the opponent's defection after the player defected in the previous round. Ironically, this can be exploited.

Supplementary Material S1 General formulation of classes
In § 1 of our main manuscript, we defined the previous memory-one class and reactive class. Moreover, in § 4, we renamed these classes as 1234 and 1212 ones, and additionally defined 1214 and 1232 ones. In general, however, there are 15 classes in total when the player can refer to only the previous one game at the maximum. This section provides the general extension of such classes.
Recall that there are four kinds of results in a single game, CC, CD, DC, and DD, as the left and right indices represents the actions of self and other. Then, we assigned numbers to these pairs of actions from 1 to 4, respectively. Then, we assign a number 1234 to the memory-one class, because it distinguishes all of 1 (CC), 2 (CD), 3 (DC), and 4 (DD) and can cooperate in different probabilities depending on the observed action. Next, we consider the reactive class. This reactive class only refers to the other's action, so the player cannot distinguish 3 (DC) with 1 (CC) and 4 (DD) with 2 (CD). When we compress multiple pieces of information, we replace ones with the larger numbers (i.e., 3 and 4) as ones with the smaller numbers (i.e., 1 and 2). Here, 3 and 4 are replaced by 1 and 2 respectively, so that the reactive class is coded as 1212. In the same way, we can define 1134, 1214, 1231, 1224, 1232, 1233, 1114, 1131, 1133, 1211, 1222, and 1221 classes in all. Fig. S1 shows the schematic diagram of these 15 classes.
In addition, recall the definition of complexity of class. When a class includes all the strategies of another class, the former one is defined to be more complex than the latter one. Fig. S1 also shows all the relationships of complexity among the classes.
FIG. S 1: Schematic diagram of possible 15 classes and the complexity among them. Each colored bar represents one class of strategies, which consists of four sets of actions, i.e., CC, CD, DC, and DD. When the player reflects the same action to different sets of actions, the sets are painted in the same color. Red, yellow, green and purple indicate numbers 1, 2, 3, and 4. Solid black lines represent the order of complexity between the connected bars. The upper bar is more complex than the connected lower bar.

S2 Mutual learning in general classes of strategies
Before analyzing the games of these 15 classes, we omit 1133 and 1111 classes from the analysis. These two classes do not refer to the other's action. In other words, they do not distinguish X 1 C with X 2 D for all X 1 , X 2 ∈ {C, D} 2 . From the rule of prisoner's dilemma, the other player learns to choose D deterministically against 1133 or 1111 classes. Thus, they have no other equilibrium than DD. Fig. S2 shows the equilibrium of mutual learning for still existing 13 classes. From this figure, we see a variety of equilibria which include various degrees of cooperation and exploitation. Here, note that there are several kinds of oscillation in the payoffs of both the players, as similar to the case of 1234 v.s. 1234 in the main manuscript. Fig. S2 shows that the same oscillation is frequently seen in the game of 1234, 1134, 1214, 1231, 1232, 1131, and 1212. Another oscillatory state is seen in the case of 1232 v.s. 1131 on the upper right triangle. All the other states than these two types of oscillation exist as the fixed point in equilibrium of learning.
FIG. S 2: The possible equilibrium states in all the pairs of 13 classes of strategies. As the combination of 13 classes, there are 13 + 13 × 12/2 = 91 panels of equilibrium states. In each panel, X (Y) axis indicates the equilibrium payoff of vertical (horizontal) class. Each panel has 10000 samples of equilibrium states in mutual learning with payoff (T, R, P, S) = (5, 3, 1, 0), even though some of equilibria are not plotted.
Then, we give several remarks on computational methods on mutual leanring. First, we gives a constant bound on the stochastic strategies ǫ ≤ x i ≤ 1 − ǫ with ǫ = 10 −4 in the computation of learning. This is to avoid false convergences into CC (i.e., u e = v e = 3) if CC is saddle point. Second, we remove equilibria on u e = 3, 1 < v e < 3 or vise versa on several panels. This is because these equilibria only exist on the condition of R = (T + P )/2. Fig. S3 gives a statistical analysis corresponding to Fig. S2. The figure (A) shows the payoff of each class obtained statistically from the numerous ensembles. In principle, a learning player receives at least P (= 1) amount of payoff in equilibrium. Thus, the difference from such minimal payoff 1 represents the class's surplus benefit. Interestingly, we see that besides previous 1234 (i.e., memory-one) and 1212 (i.e., reactive) classes, 1232 and 1131 classes achieve high scores in statistics. The figure (B) shows the difference of payoffs between the two players. Interestingly, no exploitation from the simpler class to the more complex one is seen with an exception of 1232 v.s. 1131.