Decoding trust: a reinforcement learning perspective

Behavioral experiments on the trust game have shown that trust and trustworthiness are commonly seen among human beings, contradicting the prediction by assuming Homo economicus in orthodox Economics. This means some mechanism must be at work that favors their emergence. Most previous explanations, however, need to resort to some exogenous factors based upon imitative learning, a simple version of social learning. Here, we turn to the paradigm of reinforcement learning, where individuals revise their strategies by evaluating the long-term return through accumulated experience. Specifically, we investigate the trust game with the Q-learning algorithm, where each participant is associated with two evolving Q-tables that guide one’s decision-making as trustor and trustee, respectively. In the pairwise scenario, we reveal that high levels of trust and trustworthiness emerge when individuals appreciate both their historical experience and returns in the future. Mechanistically, the evolution of the Q-tables shows a crossover that resembles human psychological changes. We also provide the phase diagram for the game parameters, where the boundary analysis is conducted. These findings are robust when the scenario is extended to a latticed population. Our results thus provide a natural explanation for the emergence of trust and trustworthiness, and indicate that the long-ignored endogenous factors alone are sufficient to drive. More importantly, the proposed paradigm shows the potential to decipher many puzzles in human behaviors.


INTRODUCTION
Trust and trustworthiness are central components of our human civilization [1], especially required in dealing with many pressing threats such as climate change, pandemic threats, international conflicts, energy crises etc.As "a lubricant for social system" [2], trust can facilitate cooperation and contribute to the economic growth [3,4].A large-scale survey shows many of our human beings give a positive answer when facing the statement "most people can be trusted", though a recent survey (see "Using Data to Understand Our World") [5] shows that over the past four decades fewer people say they "trust each other", which is an alarming signal.As fundamental questions, what is the mechanism for the emergence of trust and the associated trustworthiness, and under what conditions they are likely to sustain, have attracted considerable attention from different areas in the past decades.
Trust is the willingness of an agent (the trustor) to act in a way that benefits the other (the trustee) with the expectation that the trustee will return part of the profit afterwards.But, the trustor has no control over the trustee's action, the action of trust thus puts oneself in a vulnerable position.Without any mechanism to enforce the returns, the trustee tends to maximize her/his interest by simply walking away without reciprocation; with the backward induction, the trustor should not trust the trustee due to the expected loss of investment.That's the prediction from the assumption of Homo economicus in orthodox Economics [6][7][8], where individuals are supposed to be self-interested, their behaviors are guided by the maximization of payoff.
A large number of laboratory experiments on the canonical model -the trust game [9], however, reveal strikingly different observations [9,10].Across different countries with varied experimental protocols, the trustors are willing to send an average of 50% of their initial endowments to trustees, and the trustees return an average of 37% of the earnings back to trustors.Despite the variations in details, it turns out that we human beings remarkably prefer trusting others and are trustworthy to a large extent.
Many efforts subsequently have been made to understand the above discrepancies.Zucker [11] systematically discussed three trust-producing modes: trust is tied to the past or expected exchange, to social characteristics, and to formal societal structures.Ref. [12] explored the impact of culture and its relationship with indirect reciprocity [13].There are other explanations where the emergence of trust and trustworthiness is attributed to further factors, such as social awareness [14], reputation [15], information [16], delayed [17] or partial information [18] etc.While network reciprocity is found to potentially promote many altruistic behaviors, such as cooperation [19], fairness [20], and honesty [21], Ref. [22] however reveals that its impact on the evolution of trust and trustworthiness is marginal for whatever types of underlying networks of the population.Note that, most of aforementioned studies follow the imitation learning rule [23], such as the Moran or Fermi rule [24], where individuals imitate the strategies of their neighbors who may have higher payoffs.In essence, imitative learning can be taken as a simple version of social learning [25], where individuals learn from others in their socio-economic activities, through observations or instructions, which may or may not involve direct experiences.
While social learning is ubiquitous in nature and our society, there is however a different paradigm of learning that has been largely ignored -the reinforcement learning (RL) [26].As one of the main classes in machine learning algorithms, RL specializes in decision-making based upon experience and FIG. 1. Emergence of trust and trustworthiness.The color-coded stationary fractions of the four strategies in the domain (γ, α) values with both ranged from 0 to 1. Here, e.g. the strategy combination TB means the player trusts the trustee when acting as a trustor, but chooses betrayal when acting as a trustee.A large fraction of TR combination is seen in the corner of large γ and small α.Each data is averaged over 100 realizations.Other parameters: ϵ = 0.01, g = 3, x = 1.0, and ω = 0.5.
has achieved tremendous success in many aspects of science and technology, especially after the marriage to deep learning [27].Yet, only recently, RL starts to be applied to the evolutionary game theory to help understand the emergence of cooperation [28][29][30], the resource allocation [31,32], and other collective behaviors in complex systems [33][34][35][36].
In fact, the RL has a solid foundation in neuroscience and the existing evidences have shown that the working area and the physiological processes for RL are different from those of social learning, manifesting itself as a fundamentally distinct learning paradigm [37][38][39].In the social learning paradigm, the observations of utility are first made, and then their decisions are made by utility comparison with others, e.g.imitating the strategies of those peers who have higher utilities.As a kind of value-based learning, individuals score different actions in RL and action is chosen probabilistically based on these scores.Here, the most important distinction is that social learning is based upon the utility-comparison rule within its neighborhood, while each individual with RL develops a unique policy by self-reflection.The rule remains the same in social learning, but the policies in RL are coevolving with their surroundings and could be unique for each player.This makes the two learning paradigms fundamentally different.Within the paradigm of RL, we are interested in the following questions: Is such endogenous self-reflective learning way capable of providing an new paradigm to decipher the emer-gence of trust and trustworthiness?It could provide fundamentally different insights from the current social learning paradigm by exogenous comparison with peers.
In this work, we investigate the evolution of trust and trustworthiness within the paradigm of reinforcement learning.Specifically, we adopt a Q-learning algorithm [40,41] to study the trust game, where each person is guided by two Q-tables, respectively for the role of trustor and trustee.In the twoperson scenario, we find that a high level of trust and trustworthiness emerges when individuals both respect the historical experience and have a long-term vision.Similar phenomena are observed when the two-person scenario is extended to a population level.The analysis of the Q-tables reveals underlying mechanism behind where a crossover of individuals' preferences is detected.We also compute the phase diagram within the associated game parameters, and several boundaries are identified for the onset of the emergence and for regions of different prevalences.
The remainder of this work is organized as follows: we introduce our Q-learning setup for the trust game in Sec. 2. In Sec. 3, we show results for the two-person scenario, and provide a mechanistic analysis.The impact of gain factor on trust evolution is studied, and the boundary analysis is conducted in Sec. 4. In Sec. 5, we extend our study of the evolution of trust within a population on a 1d lattice.Finally, we conclude our work together with discussions in Sec. 6.
The payoff matrix in the trust game.Within each item, the first payoff is for the trustor (row player) and the second is for the trustee (column player).

METHODS
We start by introducing the trust game [9], where two agents are engaged and one acts as a trustor and the other as a trustee.Initially, the trustor is endowed with one monetary unit, who can choose either to stay the status quo or to invest the trustee with trust by transferring an investment fraction x ∈ (0, 1].With the status quo, denoted as not trust (N), the game is over.With trust (T), the investment is multiplied by a gain factor g, the game enters into the second stage.The trustee has to make a choice between reciprocity (R) and betrayal (B).In the former, the trustee transfers a return fraction of its earnings ω ∈ (0, 1] to the trustor as return, i.e. ωgx.The choice of betrayal means no return, the trustee walks away without reciprocity.At the end of the second session, the final payoffs for the trustor and trustee are, respectively, 1−x+ωgx and (1 − ω)gx if R is chosen, whereas the payoffs correspond to 1 − x and gx if B is selected.Here, the investment of the trustor is a manifestation of trust, and the transfer back to the trustor can be interpreted as trustworthiness.Table I summarizes the payoff matrix for the two agents.
In the one-shot anonymous scenario, it's obvious that the trustee prefers to walk away, and betrayal is the reasonable choice to maximize its payoff.Likewise, the trustor is supposed not to trust as no return is expected.Therefore, the rational solution for the trust game ends with the solution (N, B).Notice that, from the payoff structure in Table I, the trust game bears a strong resemblance to the Prisoner's Dilemma, but the two are fundamentally different because the two agents in the latter are symmetrical, while they play different roles and act in sequence in the trust game.
For simplicity, we stick to this 2-player scenario and study their temporal evolution, where the two players play the role of trustor and trustee in turn.Specifically, they adopt a Qlearning algorithm [40,41].In this algorithm, each player has two Q-tables denoted as Q A sn and Q B sn in hands that guide their decision-making for the role of trustor and trustee, respectively, see Table II.One takes an action from the action set {T, N } as a trustor, or from {R, B} as a trustee.The system is in one of four states denoted as S = {s 1 , ..., s 4 } with sn , whereas the states in Q B sn are the same but just reshuffled, s 1 = RT, s 2 = RN, s 3 = BT, s 4 = BN .The element Q s,a in the Q-tables is a value function used to measure the value of action a in the given state s, which is constantly updated over time.Players aim to find the optimal policy by Q-learning in

State
Action T (a terms of maximizing the expected accumulated reward.
Without loss of generality, two players are initially assigned with a random strategy from {TR, TB, NR, NB} with equal probabilities.Following the common practice, the two roles are switched in turn for the two players from round to round, though the way of randomly assigned role in each round does not change our findings.In round t, with a probability ϵ, two players independently choose an action a t at random from the corresponding action set to conduct a trial-and-error exploration.Otherwise, they choose the action a t with the larger Q value within the given state s t in the associated Q-table.This completes the stage of decision-making for their actions.Afterwards, each player tries to draw some lessons by revising the corresponding action-state value that has been adopted in this round, i.e. the element Q st,at in the Q-table.The update is as follows where s, a represents the current state s t and action a t of the focal individual, and s ′ , a ′ are the state s t+1 and action a t+1 at t + 1. r is the reward one obtained for the action a t within state s t , with reference to Table I. α ∈ (0, 1] is the learning rate, which captures the contribution of current step.A larger value of α means that the agent is more forgetful as old Q values tend to be more rapidly modified.γ ∈ [0, 1] is the discount factor, measuring the weight of future rewards, as max a ′ Q s ′ ,a ′ (t) is the maximal value expected in the new state.This completes the stage of Q-table updating, and a single round is done.The evolution protocol is summarized in Fig. 8 in Appendix A for clarity.
In our practice, we focus on the case with fixed game parameters g = 3, x = 1.0, and ω = 0.5 [22].This corresponds to such a scenario: the trustor transfers all the money to trustee, and after the money is tripled, half of the profit is returned to the trustor.Note that, we adopt a time-varying form for both learning rate and exploration rate at the beginning [42], i.e. ϵ(t) = max{ 4.0 √ t , ϵ}, α(t) = max{ 6.0 √ t , α}.This means that the two values starts with a large value, but as time goes by they get fixed once approach the desired value (ϵ and α will be set in the following studies), see Fig. 3(c).Though this setup does change the evolutionary outcome in the long term compare to the case for all-fixed parameters.Details see Appendix B. For each case, the simulation is run over 10 8 time steps to guarantee the evolution reaches a stable state.

Emergence of trust and trustworthiness
We find that a prominent emergence of both trust and trustworthiness is observed in a physically meaningful region, see the phase diagram Fig. 1.It reports that the final fractions of the four strategies {TR, TB, NR, NB} in the learning parameters domain (γ, α).As shown, there is a red region where the fraction of TR dominates (its fraction is larger than 0.6), where the learning rate α is small and the discount factor γ is large.This observation means that when individuals focus on both historical experience and the long-term vision, both levels of trust and trustworthiness rise.Otherwise, either a forgetful property (a large α) and/or a short-term vision (a small γ) lead to the failure of their emergence.Notice that, the fraction of the strategy TB remains small across the whole domain, meaning that once the trust is adopted, the agent also shows trustworthiness when acting in the role of trusteetrustors never betray.But once distrust is chosen, reciprocity and betrayal are equally likely to be chosen.
To understand the emergence, we show the time evolution of the four fractions for the case of α = 0.1 and γ = 0.9, a typical combination of parameters located in the red region in Fig. 1, see Fig. 2(a).We can see that their fractions start from around 0.25 due to random initialization, and then the two fractions of distrust (NR and NB) rise, but as time goes by these two fractions turn down, all three fractions except the strategy TR (the red line) decline, the fraction of NB (the black line) almost vanishes, and the system becomes stable in the end.
The reason for the initial decline in trust is straightforward since the level of reciprocity is low at the beginning (even below 50%), putting the trustor at a disadvantageous position, see Fig. 2(b).The disadvantage of trustors is more clearly seen in Fig. 2(c), where their net earnings are negative, being in a loss state.At the same time, the trustee is inclined to betrayal because as long as the trustor is willing to invest, betrayal tends to yield higher short-term payoffs compare to reciprocity.This is in line with the prediction from Homo economicus perspective.This leads to a slight increase in the preference of betrayal, and consequently enhances the trend in the preference decline of trust.
Unexpectedly, the fraction declines in trust ceases after around a hundred rounds and starts to turn up.This crossover can be understood by monitoring simultaneously the fractions of the two players' strategies [Fig.2(b)].As can be seen, after the transient, the trustor tends not to invest in a trustee who has betrayed in the last round, but learns to invest to who showed the reciprocity, which explains the fraction of TB declines but TR starts to rise.This observation is also explained in Fig. 2(d), where the advantage in the payoff by choosing B over R is lost at the moment as indicated by the dashed line.Intuitively, as the trustee learns that the action of betrayal incurs no investment, one starts to reciprocate the trustor instead of walking away.Once this trend starts, a higher level of reciprocity is preferred for the trustee, which in turn enhances the preference in action T for the trustor.This then forms a positive feedback that finally yields a high preference in both trust and trustworthiness.

Evolution of Q-tables
For a deeper understanding of the mechanism, let's direct our attention to the evolution of the two Q-tables.It shows the preference in NB like a Homo economicus is only present at the very beginning, the evolution in the later stage aiming for maximizing payoffs in the long term forces them to turn to TR, which resembles human's psychological changes.
Specifically, we focus on the Q-value difference for each row of the two Q-tables, represented as sna2 , which determines the preferred action within the given state s n according to the idea of Q-learning.For example, if ∆Q A sn >0, this means that the action T is preferred within the state s n when acting as a trustor; otherwise N is supposed to be a better choice.Likewise, ∆Q B sn >0 means that action R is considered to be better when playing as a trustee, otherwise B is preferred.In our study, the evolution of the two Q tables for both individuals are found statistically the same, as their learning parameters are identical, implicating they have quite similar cognitive processes.Therefore, we only focus on the evolution of the ∆Q A,B sn values for one of two individuals, as illustrated in Fig. 3.
Fig. 3(a) and 3(b) show respectively the time evolution of ∆Q B sn and ∆Q A sn for all four states.At the initial stage (t ≲ 100, before the marked dashed line), ∆Q A,B sn in both Q-tables mostly become more negative by learning, and the strategy for the individual converges to NB, where as a trustor one is unwilling to invest, and as a trustee one opts to betray.This is reasonable because betrayal for the trustee is better off than reciprocity in the short-term, since no investment can avoid potential money loss for the trustor.As a result, the dominating state in the system is NB for a trustor, and BN for a trustee, both players act indeed like a Homo economicus.
As time comes to t ≈ 100, the advantage of betrayal over reciprocity diminishes, because no investment comes from the trustor.This then causes a reversal of ∆Q B BN to be positive (the solid black line), and the action R is then preferred.This critical transition, however, does not immediately leads to the boom of trust or trustworthiness, because all four ∆Q A sn in Fig. 3(b) at the moment are all negative, meaning that distrust is still dominating.Actually, the action of reciprocity is unstable, since ∆Q B BN > 0 and ∆Q B RN < 0, the state for the trustee oscillates between RN and BN.Therefore, at this stage, still no driving force towards either trust or trustworthiness is seen, as shown in Fig. 3(c), where TR is unstable for either trustor or trustee since the associated ∆Q A T R and ∆Q B RT are both negative at the left to the vertical dashed line.
An important change unfolds afterwards (i.e., t ≳ 100), as can be seen in Fig. 3(c).There, as the learning rate α and the exploration ϵ decreases, some optimal or nearly optimal policies have been learnt by individuals, and they rely more on their historical experience and conduct less random explorations.As shown, the trustee becomes gradually inclined to choose reciprocity rather than betrayal, as ∆Q B RT stops decreasing and starts to grow.When this turnover is detected, the trustor also starts to trust as a response, where a turnover is also present for ∆Q A T R , but with a time delay.Once both players turn to TR, this forms a positive feedback loop, both are well paid, leading to an increasing ∆Q that are both positive in the end.When the system becomes stable in the long run (t ≳ 10 6 ), the trustee always chooses R since all its ∆Q B > 0 [Fig.3(a)].On the trustor side, however, ∆Q A sn > 0 only for the action of reciprocity (i.e. for s n = TR and NR), the trustor chooses not to trust (for s n = TB and NB) when the trustee betrayed [Fig.3(b)].This punishment-like policy further forces the occasional betrayals back to the reciprocity.This then produces stable emergence of trust, where a decent level of trust and trustworthiness is seen.
Note that the final rise of ∆Q for both Q-tables also benefits from the diminishing exploration rate [Fig.3(c)], where both trustor and trustee choose TR with a large probability.In a more noisy scenario (a large ϵ), which can be interpreted as many misunderstandings or "trembling hands" [43], which can considerably suppress the level of the trust and trustworthiness, see Appendix C.
The above analysis shows that the emergence of trust and trustworthiness is caused by the preference transition from NB to TR.To further confirm the analysis, we compute the joint probability for two consecutive states P (s t , s t+1 ), all state transitions at different stages are shown in Fig. 4. In the early stage of evolution (0 < t < 1000), almost all mode transitions are detected, however, some modes are more likely to happen, such RN-BN, BN-RN, BN-BN, BN-RT, RT-BN, as shown in Fig. 4(a).This means that the state BN is the main state at this stage, in line with our above argument.As analyzed above, once the trust and trustee start to form a positive feedback loop, the strategy of TR starts to dominate, as can be seen in Fig. 4(b).Apart from the dominating TR-TR mode, the other two bars BN-RT and RT-BN are also present, meaning that the strategy pair RT flips to BN from time to time, but the flipping back to RT is equally likely, as the two bars are of nearly the same height.In Fig. 4(c-d), the mode of RT-RT becomes even higher, though the other two bars (i.e.BN-RT and RT-BN) are still present due to the non-vanishing ϵ.Till then, the evolution of trust and trustworthiness becomes stable.
Based on the above analysis, the emergence of trust can be roughly divided into three stages: 1) Initially, the trustee finds that betrayal is more profitable than reciprocating the trustor, and is thus inclined to choose B. Betrayal gradually becomes prevalent.As a consequence, the trustor chooses to be non trusting as the net earning of investing is negative.The level of 0 0.5 trust and trustworthiness both decrease.
2) As less investment is detected, the tendency towards betrayal was reversed, the trustee starts to be inclined to reciprocating the investment.Once this turnover is on site, the trustor also starts to invest.
3) The preference change in TR for the two players then forms a positive feedback that strengthens the advantage of T and R in their Q-tables.In the end, the trustee prefers R in almost all scenarios, and the trustor trusts those reciprocating trustees but invests no money to the betrayed trustee as a punishment.This guarantees a decent level of trust and trustworthiness.
However, as the two learning parameters (γ, α) deviate from the red region in Fig. 1(a), the mechanism behind stages 2) and 3) could be ruined.Analyses based on the evolution of Q-tables indicate that for a large α, the past experience is rapidly washed out, so that no lesson can be drawn from the history.In this case, the evolution of the game degrades to the classic iterated scenario in the absence of Q-learning, where the trustor is not willing to invest, and trust and trustworthiness fail to emerge [22].Meanwhile, for the case of small γ, the positive feedback between T and R fails to establish without confidence of the future reward.For a detailed analysis see Appendix D.

IMPACT OF GAIN FACTOR AND BOUNDARY ANALYSIS
The revealed mechanism is robust against the game parameter x, but shows intricate dependence on the return fraction ω and the gain factor g. In the original model [9], the investment is tripled on the trustee side, while in some other work the gain factor g is set to be 2 or other values [44,45].To systematically investigate the impact of the gain factor g on the emergence of trust and trustworthiness, here we first show the dependence of the TR fraction on the investment fraction x and the return fraction ω for g = 2, 3, 4, 5, shown in Fig. 5.As can be seen, the fractions of TR show no any dependence on the investment fraction x in all four cases, which is reasonable since the investment fraction only affects the absolute payoff, but not the relative values in the Q-tables and thus the level of trust and trustworthiness shows no dependence on x.However, a larger gain factor widens the region where the trust and the trustworthiness emerge.We find that there is a simple relationship determining the left boundary.For a trustor, the bottom line to invest is not to lose money, conditioned by gxω ≥ x, leading to the following inequality gω ≥ 1. ( This immediately gives the left boundaries in Fig. 5, i.e. ω (1) c = 1/g, which is well confirmed and explains the independence on x.
This theoretical argument is better validated in Fig. 6 within the parameter domain ω − g by fixing x = 0.5.We see that the hyperbolic boundary fits perfectly the simulation results within a wide range of the gain factor g. A closer lookup shows that the boundary slightly shifts to the right as g increases compared with the theoretic prediction.This implies that when the gain becomes so high, the trustor would take some risk of losing money to invest the trustee.Actually, the relationship revealed in Eq. ( 2) is supported by previous experiments [10,46,47], where they found that as the gain factor increases, the promotion in trust is seen.
Interestingly, the right boundary seen in Fig. 5 seems independent on the gain factor g, where ω (2) c ≈ 0.8.This means that for a trustee, one would only show trustworthiness only if she can keep at least 20% of the pie, otherwise one just walks away even if the gain factor is large enough that a positive earning is expected.This observation implies that the emergence of trustworthiness is built upon fairness, individuals desire a fair division of the pie.Furthermore, ω (3) = 0.5 seemingly marks as another threshold, below which a considerably high level of TR is then possible, especially for a large gain factor.This means that full reciprocity further requires that the trustee can keep at least half of the pie.

1-DIMENSIONAL LATTICE
In fact, the findings are not restricted to the above 2-player scenario, they are robust and can also be seen at the population level.An example of 1-dimensional lattice with the size N = 50 and k = 2 is shown in Fig. 7. Since there are two nearest neighbors for each individual in this scenario, the Q-table has to be expanded accordingly, detailed settings can be seen in Appendix E. We find a qualitatively similar phenomenon that trust and trustworthiness tend to arise when both historical experiences and the long-term vision are emphasized, and the fraction of TR stabilizes at around 0.7 in the long run.In Fig. 7(a), we can see that at the early stage, the nontrusting strategy (N, N ) is dominating against their two nearest neighbors when acting as a trustor , but as time goes by, players gradually tend to invest.Also, the level of reciprocity is low at the beginning but rises to be high later on, see Fig. 7(b).These observations are similar to the 2-player scenario.Notice that, due to the presence of more neighbors, a smaller exploration rate ϵ and a higher expectation of future reward γ are preferred to maintain a stable surrounding of a high level of TR, compared with the 2-player scenario.Actually for a pair of players, they act according to the their two Q-tables regarding the other person, relying only on the agreement reached between them.Therefore, even when an individual has multiple neighbors, trust in one neighbor does not imply trust in others, i.e. the occurrence of trust and trustworthiness is by nature pairwise between the two involved individuals, independent of the rest.As a result, when extended to other complex networked populations, we expect that the mechanism and phenomena are similar to the 2-player scenario.

DISCUSSION
In summary, we have investigated the trust game within the paradigm of reinforcement learning, each player acts following a Q-learning algorithm, and we focus on the evolution of trust and trustworthiness.Surprisingly, high levels the trust and trustworthiness emerges in the two-player scenario when players both care about the historical experience and have long-term vision.The evolution of the associated two Qtables reveals a crossover in the action preference, which resembles the psychological transition when we human beings playing the game.Our boundary analysis shows that a high level of trust and trustworthiness requires that the net earnings for the investment for the trustor is positive and the trustee can keep half of the earnings in hand.Furthermore, if the action choice deviates much from learnt Q-tables, this "trembling hand" effect undermines the evolution of trust and trustworthiness, where the desired relationships are broken down.Finally, the emergence of trust and trustworthiness can also be seen when the scenario is changed into a latticed population.
Interestingly, part of these observations were also seen in a series of experiments by Engle-Warnick and Slonim [48][49][50].They explored the evolution of strategies in repeated trust game experiments, and revealed that experience, attitude toward the future, institutions, and other factors have an important influence on the strategy selection in the repeated trust game.The results confirm that concerns for the future of repeated interactions and past experiences in game history are important for the persistence of trust.
Most importantly, we do not resort to any external factor as assumed in most previous work with social learning.Our reinforcement learning provides a natural explanation for the emergence of trust and trustworthiness.This indicates that past experience and the expectation for return in the future together as the endogenous factors are sufficient to trigger their emergence.In fact, existing efforts within this paradigm show that it can also provide explanations for understanding cooperation [28][29][30], resource coordination [31,32] etc.These work suggest that reinforcement learning demonstrates its power in explaining human behaviors, where the evolution of Q-table provides a uniform mechanism framework.Given the consistent experimental deviation from the predictions of Homo economicus regarding different altruistic behaviors [10,[51][52][53], such as cooperation, fairness, trust and trustworthiness, the reinforcement learning may provide a uniform paradigm to decipher complexities of human psychology, shedding new insights into the understanding of moral behaviors [54,55].
Although we adopt reinforcement learning as our framework, we do not deny the value of social learning paradigm that has been widely used in previous studies.In fact, the two paradigms are not contradictory, but complementary to each other.Till now, there are plenty of experimental evidences in neuroscience showing that the decision making for both learning ways have solid neural bases [37][38][39], and indicate that they may work for different scenarios.More probably, the learning processes in the real world are a mixture of the two when we human are dealing with complex issues.An important question as the next step is to infer the type of learning from the behavioral experiments.Only when the learning paradigms in realities are clarified, we are on the right track to understand many important issues the human society such as cooperation, fairness, honesty, and so on.

Appendix D: Failed cases of trust emergence
As the two key learning parameters (γ, α) deviate the ideal combination, the emergence of trust and trustworthiness could fail.Three cases with typical parameter combinations are investigated, both the time series of the four fractions and their corresponding ∆Q A,B sn are shown in Fig. 11.The failures can be attributed to two aspects.
i) When the learning rate α becomes large [e.g.α = 0.9 in Fig. 11(d-f) and (g-i)], this means that the historical experiences of players is removed immediately, almost no lesson is kept in the Q-table.In this scenario, the Q-learning algorithm loses its strength and the evolution degenerates to the traditional iterated trust game, where the trustor is not willing to invest in the trustee.Note that, once no investment is made, the reciprocating behaviors from the trustee is pointless, therefore, no decent level of trust and trustworthiness is seen, as shown in Fig. 11(d) and (g).
ii) When the discount factor γ becomes smaller, this causes another problem.Without confidence of the future reward, it's hard for the trustor to foresee potential reciprocity from the trustee to select to trust, and vice versa.As a result, the positive feedback between trust and reciprocity fails to form.A decent level of trust and trustworthiness is still hard to see [e.g.γ = 0.1 in Fig. 11(a-c) and (d-f)].
With these observations, it's reasonable to understand why a decent level of trust and trustworthiness is only observed for the parameter combination of small α and large γ in Fig. 1.

Appendix E: Setup of Q-table in 1D latticed population
When we extend the 2-player scenario to the 1dimensional latticed population, where each player connects 2 nearest-neighbors, i.e. the degree k = 2   action set for the trustor is then extended to be A r = {(T, T ), (T, N ), (N, T ), (N, N )}, where the first and the second are respectively the strategy playing against to its left and right nearest-neighbor.Similarly, the action set for the trustee is A e = {(R, R), (R, B), (B, R), (B, B)}.The state for the trustor is expanded as S = {s 1 , ..., s 16 }, with s 1 = (T T, RR), s 2 = (T T, RB),..., s 16 = (N N, BB).For example, s 2 = (T T, RB) means that the player has invested her two nearest-neighbors in the last round, and the left one showed reciprocity but the right one betrayed.The two Qtables Q A,B sn are illustrated in Table III and Table IV, for the player who plays the role of trustor and trustee, respectively.

FIG. 2 .
FIG. 2. Crossover in time series.Typical time series in the 2-player scenario for the learning parameter combination α = 0.1 and γ = 0.9.(a) The time series for the fractions of four strategy combinations; (b) Four time series at the pairwise level, by simultaneously monitoring the fractions of the two players' strategies; (c) The evolution of net payoffs to the trustor; (d) The evolution of payoffs corresponding to different actions chosen by the trustee.The fractions of no trust (N) and betray (B) are provided in (c, d) for reference, with the corresponding y-axis being put at the right side.The vertical dashed lines mark the same transition moment where the level of trust and trustworthiness starts to rise.Each data is averaged over 500 realizations in (a-d), besides a sliding window average of 60 steps is conducted in (c, d).Other parameters: ϵ = 0.01, g = 3, x = 1.0, and w = 0.5.

FIG. 3 .
FIG. 3. Evolution of Q-tables.(a) The evolution of ∆Q B sn for the trustee and (b) ∆Q A sn for the trustor in all four possible states.If ∆Q Asn > 0, the action of T is preferred within the state sn when in the role of trustor, otherwise N is flavored.Similarly, if ∆Q B sn > 0 , the action R is flavored when acting as a trustee, otherwise B is preferred.The vertical dashed lines in (a) and (b) seperate the two stages, NB is preferred in the first stage, but is then followed by a turnover, where a positive feedback is formed to promote both the trust and trustworthiness prevalences.(c) The two curves ∆Q A T R in (a) and ∆Q B RT in (b) are put together for clarity, along with the exploration rate ϵ and the learning rate α.The vertical line in (c) marks the approximate transitions at which the individuals continue to choose to trust and show trustworthiness.Each data is averaged over 500 realizations.Parameters: ϵ = 0.01, g = 3, α = 0.1, γ = 0.9.x = 1.0, w = 0.5.

FIG. 7 .
FIG. 7. Spatiotemporal evolution of 1d latticed population.(a) and (b) correspond to scenarios where the individual i acts as a trustor or a trustee, respectively.In each scenario, there are four action combinations, e.g., the strategy (N,T) in (a) means that as a trustor, the player chooses not to trust the left nearest neighbor, but trusts the right nearest neighbor.Parameters: N = 50, k = 2, ϵ = 0.001, α = 0.1, and γ = 0.98.

TABLE II .
The two Q-tables for each individual in the 2-player scenario, Q A sn (left) and Q B sn (right) are Q-tables respectively for the role of trustor and trustee.

TABLE IV .
The Q-table for the trustee in the one-dimensional latticed population, where each player connects 2 nearest-neighbors.