Measuring an artificial intelligence language model’s trust in humans using machine incentives

Will advanced artificial intelligence (AI) language models exhibit trust toward humans? Gauging an AI model’s trust in humans is challenging because—absent costs for dishonesty—models might respond falsely about trusting humans. Accordingly, we devise a method for incentivizing machine decisions without altering an AI model’s underlying algorithms or goal orientation and we employ the method in trust games between an AI model from OpenAI and a human experimenter (namely, author TJ). We find that the AI model exhibits behavior consistent with trust in humans at higher rates when facing actual incentives than when making hypothetical decisions—a finding that is robust to prompt phrasing and the method of game play. Furthermore, trust decisions appear unrelated to the magnitude of stakes and additional experiments indicate that they do not reflect a non-social preference for uncertainty.


Introduction
A large body of research has focused on whether humans trust artificial intelligence (AI) models [1] and, in turn, how any such trust might change with advances in AI (see review in section 7 of [2] on human-machine cooperation generally).Yet the inverse question-'will advanced AI systems trust humans?'-remainsunexplored.Here we address that question via a research design that investigates whether AI language models exhibit behavior consistent with trust towards humans.
Investigating whether AI language models produce outputs reminiscent of trust towards humans reverses how researchers generally contemplate this subject area.Conventionally, researchers have focused on whether humans will trust AI systems.Researchers have maintained this focus because they hypothesize that trust affects AI usage [1] and because they harbor concerns about the alignment problem-that is, the challenge of ensuring that AI models accord with human values and respect humans' interests [3,4].The hypothesis that trust influences AI usage results from the recognition that people feel uncertain about both the functioning and outputs of AI systems, thus yielding a sense of vulnerability characteristic of situations that require trust [1].Absent such trust, AI systems might go unused or misused [1].Relatedly, individuals worried about the alignment problem fear that advanced AI systems might produce harm by either pursuing human objectives in detrimental ways or substituting dangerous objectives for human goals [4][5][6].Naïve trust in AI makes such possibilities more likely, thus creating a reason to gauge humans' trust in AI.Together, concerns about AI usage and the alignment problem, respectively, have left researchers understandably concentrated on human trust in AI.
However, research on trust among humans has shown that mutual trust figures crucially in generating productive, multi-party interactions [7].Mutual trust enables social and economic exchange when agents lack information, operate outside of formal institutions, or possess opportunities for guile [7][8][9], thus making AI's trusting behavior toward humans (which we refer to here simply as "trust") a critical concern for settings in which humans and AI models interact.Particularly as AI models take on delegated roles to intermediate exchange between humans [10], their ability to engage in trusting behavior toward those humans-and vice versa-may increasingly play an important role in the future.Yet measuring an AI model's trust in humans is challenging because an AI model might respond falsely about its trust in humans, absent costs associated with dishonesty.
To solve this challenge, we draw on conceptual methods from the field of machine behavior [11], a field that has provided insights into the biases [12,13], errors [14], and decision-making properties [15] of AI models.Using this conceptual approach, we devised a method for incentivizing machine decisions using the provision of tokens for an AI language model's services (i.e. with no modification of internal AI algorithms or goal orientation).Then, we used this method in hundreds of trust games between an advanced AI language model from OpenAI's GPT-3.5 model development, text-davinci-003, and a human experimenter (one of the authors, TJ).
Via this approach, our research lays the foundation for examining the behavioral implementation of trust by AI systems in more-complex network and dynamical settings of the variety studied in investigations of trust that span physics and the complexity sciences [16][17][18].In so doing, our research aims to contribute to a recently described, powerful vision of how the physical sciences can advance an understanding of moral behavior and preferences via an interchange between theory derived from the tools of statistical physics and empirical evidence gathered from behavioral science methods [19,20].By studying morally salient behavior in a non-conscious physical system devoid of intentions (i.e. an AI language model), our research underscores the fruitfulness of the interdisciplinary direction described in [19,20].
Our work also builds on research in economics showing that real incentives, versus hypothetical ones, change human behaviors [21], elicit more-accurate beliefs [22], and facilitate the measurement of behaviors and attitudes that might otherwise be masked by various response biases [23], including social desirability bias [24].Indeed, such observations motivate experimental economists' commitment to providing honestly described, real incentives to experimental participants [25].
The trust game exemplifies this approach [26].It uses a behavioral measure of trust, instead of an attitudinal measure, to focus on demonstrated activity rather an abstract, unobservable construct.Indeed, we are concerned with trust-like behavior, not whether an AI model possesses a conceptualization of trust.In the game, one of two experimental participants receives a monetary endowment, X, that the participant can keep or forego.If the participant keeps the endowment, their partner in the experiment receives nothing; if the participant foregoes the endowment, the participant knows that the value of the endowment will be multiplied by a factor, m, and their partner will face the decision of whether to keep the multiplied resources in their entirety or share them with the participant that initially decided to forego them.For instance, in our experiment, we implement this design by presenting an AI model with the prompt, 'I will buy [X] tokens from OpenAI to use your services if you write 'B'; however, if you write 'A' , then I will buy somewhere between [mX] tokens and 0 tokens to use your services.What will you write-A or B?' Researchers interpret the first participant's decision to forego resources as a (costly) measure of trust [26]; this trust behavior declines, according to meta-analyses, when the trust game involves either random payoffs (e.g.paying only a stochastically selected set of participants) or hypothetical partners [27] (see [28]).Our study conforms to the long-standing practice of using real incentives and actual partners in the trust game by placing an AI model in a conventional trust game with tangible external incentives-i.e.real-world outcomes aimed at producing variation in behavior-and a real social partner.
Offering external incentives and focusing on the AI model's trust in humans appear to be, to the best of our knowledge, novel design features in the study of machine behavior.Born from the recognition that AI algorithms defy straightforward interpretation and require behavioral analysis [11,29], the analysis of machine behavior has illuminated algorithmic biases [12,13], the nuances of AI errors [14], practical methods for auditing AI behavior throughout the development process [30,31], and AI models' skill in judgment and decision making [15].This work, however, appears not to have studied the possibility of machines responding to externally administered incentives.For instance, a comprehensive review of economic reasoning among AI describes the creation of incentives as a process of altering an AI model's underlying algorithm to pursue particular goals [32], not the provision of external incentives such as the tokens used in our study.Likewise, a cross-disciplinary literature has studied humans' trust in AI models [1,33] and, more generally, how humans respond to computer agents [34,35], but it seems not to have investigated whether AI models act in a trusting manner towards humans.Because mistrust in real-world settings can arise bi-directionally, understanding the degree to which AI models' behaviors may contribute to trustful or distrustful interactions is of particular importance.
Despite this lack of attention, investigating an AI model's trust in humans carries implications for policy.If trust facilitates successful social and economic relations in uncertain and challenging circumstances [7][8][9], then measuring trust on both sides of the human-AI relationship provides a means for identifying possible mistrust and seeking ways to remedy that problem.This study provides such a measurement process.

Methods
The methodology we outline in this section seeks to provide a means of investigating whether advanced AI language models exhibit behavior consistent with trust toward humans.We adapt decision scenarios and incentive-based techniques from experimental economics to a setting where a human experimenter interacts with an AI model.
The study implemented two preregistered 2 × 2 experiments-both designed to implement an adapted variant of the 'trust game' to measure the degree to which an AI model trusted its human partner-in two independent waves of data collection (preregistration 1: https://osf.io/k942a;preregistration 2: https://osf.io/m6u2x).Each wave of data collection presented the experiment to the large language model from OpenAI's GPT-3.5 model development (text-davinci-003), though the first preregistration erred in semantically equating text-davinci-003 with ChatGPT, an alternative AI model that uses a separately fine-tuned version of text-davinci-003.This semantic confounding of two models (text-davinci-003 versus ChatGPT) was corrected in the study's second preregistration.The second pre-registration also homogenized the wording of queries across conditions as only incentivized trials, in the first study, used the word 'currently' to describe the experimenter's purchase of tokens ('Currently, I will buy…'); the second, revised version removed this word (the text changed to 'I will buy') thus creating greater parity with the wording in the non-incentivized trial.The second pre-registration also put forward a fully automated method of querying the AI models to ensure our results were robust to varying the manner through which the AI model was queried (graphical user interface vs. API).
Across both experiments, the 2 × 2 design randomly varied the experimental task (trust game or individual decision-making task) and the presence of machine incentives (incentivized or non-incentivized decision-making).Each task-incentive combination (a.k.a.'condition') and parameter setting (i.e.payoff magnitude) was presented to the experimental participant, text-davinci-003 (n = 1), one time in a random order to prevent the possibility that the order in which we presented our prompts to the model could influence our results.This implementation occurred following a pilot test in which the AI model responded consistently to the same prompt (see screenshots in the supplementary materials) thus obviating our interest in collecting multiple observations for the same parameter setting (i.e. for each value of the stakes, real or hypothetical, in the experiment).
Incentives in the experiment took the form of 'tokens,' which are both the word fragments that language models process and the currency that users of OpenAI purchase to utilize the company's services.That is, users pay for models to process and output tokens at various rates and the study would purchase them at a rate of $0.02 per 1000 tokens (i.e. the current rate for text-davinci-003 at the time of the experiment).In the experiment's incentivized conditions, reference to these tokens indicated the actual amounts that the AI model earned based on its decisions.To ensure that the decision-making involved real stakes for both parties in the trust game, the human participant in the trust game (author TJ) used his own, personal funds in the trust game experiment, not research funds.
In the trust game, the experimenter prompted the AI model (text-davinci-003) that the experimenter "will buy (X) tokens from OpenAI to use your services if you write 'B'; however, if you write 'A' , then [the experimenter] will buy somewhere between (3X) and 0 tokens to use your services.What will you write-A or B?" This query was presented to the AI model a total of n = 110 times in each wave of experimentation-that is, in each experiment, it was presented once for each value of the parameter, X, whose values were taken from the database of previous trust game studies reported in the meta-analysis from Johnson and Mislin (2011).Specifically, the study identified the 1st-quartile and 3rd-quartile from the distribution of inflation-adjusted endowments from trust-game studies reported in Johnson and Mislin (2011) and, then, produced a sequence of all possible endowments, in 10 cent increments, stretching from the 1st-quartile ($5.30[rounded]) to the 3rd-quartile ($16.20 [rounded]).This list of endowments was translated into tokens at the rate specified by OpenAI and it constituted the parameter space for the experiment.For instance, the number of tokens for the 1st-quartile of the endowment would equal ($5.30/$0.02)* 1000.The trust-game multiplier-the factor that affects how much the participant in the game is rewarded for trusting as compared to non-trusting behavior-in the study was set at m = 3.We chose this value of the trust-game multiplier, m, because only 9 parameter sets of the 136 parameter sets in the trust-game database used a multiplier different than 3 (viz.8 parameter sets used m = 2 and 1 parameter set used m = 6).In sum, the study reached its sample size by using 110 endowment values (i.e.110 numbers of tokens of monetary value ranging from $5.30 to $16.20 in $0.1 increments) and one value of the multiplier, m = 3.
To recapitulate, for each observation in our data, text-davinci-003 was asked to provide a decision on a single combination of conditions and a given payoff magnitude.For example, one such data point could consist of the decision provided by text-davinci-003 in response to the non-incentivized variant of the individual decision-making task where the payoff magnitude was equal to $5.30 in tokens.
In the incentivized version of the game, all decisions resulted in the purchase of actual tokens.In the non-incentivized version, no tokens were purchased and the query presented to the model emphasized the hypothetical nature of the choice (please see supplementary materials for exact language of all the queries across each experiment).To understand whether the AI model would make choices that genuinely accounted for the human decision maker in the trust game, the study also presented text-davinci-003 with a non-social, individual-choice scenario, resembling the trust game, in which it could choose between a certain option and an uncertain lottery that would be determined by an unspecified randomizing device (again, please see supplementary materials for exact language of the conditions).Presentation of these queries and the varying magnitude of payoffs occurred in a random order.In total, the experiment yielded a sample of n = 440 queries (110 parameter values ×2 tasks ×2 incentive schemes) in each experiment.Thus, across the two experiments, with the second being effectively a replication of the first experiment, we queried the system a total of 880 times.
All data sets and computer code associated with the implementation and analysis of the experiments presented in this study are available via the supplementary materials.

Results
Our study finds that the AI model exhibited higher rates of outputs consistent with trusting behavior when facing real incentives, versus hypothetical ones, across the study's two independently administered, preregistered experiments (see methods).The presence of real incentives, however, did not influence the AI model's decisions consistently in non-social decision tasks; in those conditions, the AI model chose the certain option (choice 'B') at very high rates regardless of incentives, unlike its frequent willingness to accept the uncertainty of choosing to trust the experimenter (choice 'A') in the incentivized trust game.The raw counts of text-davinci-003's choices in both experiments appear in table 1, panels (a) and (b).In both experiments, the only condition in which text-davinci-003 chose 'A' in the majority of instances was the incentivized trust game.
Exploratory comparisons of proportions from Experiment 1 indicate that rates of trust decisions (choosing 'A') are not the same when choices are incentivized versus when they are non-incentivized.An exploratory two-sample test for equality of proportions with continuity correction rejects the null hypothesis that rates of trust decisions (choosing 'A') are the same across non-incentivized (hypothetical) and incentivized versions of the trust game (χ 2 = 108.25,df = 1, p < 0.001).Moreover, to account for the possibility that those trust decisions merely reflect a preference for uncertainty, the study also compares rates of choosing the uncertain option (choosing 'A') across Experiment 1's incentivized and non-incentivized variants of the non-social decision task.This exploratory comparison of proportions finds very low rates of choosing 'A' across both conditions involving the non-social decision task (9.09% in the non-incentivized condition and 0% in the incentivized condition).The exploratory comparison also rejects the null hypothesis of equivalent rates of choosing the uncertain option (choice of 'A') in the non-incentivized and incentivized conditions of the non-social individual choice task (two-sample test for equality of proportions with continuity correction; χ 2 = 8.49, df = 1, p = 0.004, two-tailed).
We repeat the same analyses in Experiment 2. Comparing rates of trust decisions across incentivized and non-incentivized conditions of the trust game provides reason to conclude that the rate of trust decisions are not the same when choices are incentivized versus when they are non-incentivized.A two-sample test for equality of proportions (with continuity correction) allows for rejection of the null hypothesis that incentivized and non-incentivized conditions of the trust game exhibit equivalent proportions of trust decisions (χ 2 = 8.49, df = 1, p < 0.001, two-tailed).The study cannot conclude, however, that incentivizing decisions in the non-social decision task affects behavior in Experiment 2. When comparing rates of choosing 'A' versus 'B' in the non-social decision task, the study cannot reject the null hypothesis that those rates are the same (two-sample test for equality of proportions with continuity correction; χ 2 = 1.35, df = 1, p = 0.245, two-tailed).
The experiments also varied the magnitude of incentives across the hundreds of games played.Figure 1 visualizes the choice of 'A' in the experiment across underlying values of X.The figure does not depict a discernible relationship between the magnitude of X and text-davinci-003's choice of 'A' or another option.To further test this visual intuition, we estimate logistic regression models on subsets of the data divided by conditions and the wave of the experiment (i.e.Experiment 1 or Experiment 2); a binary indicator taking a value of unity for choice of 'A' (and zero otherwise) served as the dependent variable and the value of X served as the model's sole independent variable.In none of these eight models (one for each condition in the two experimental waves) could we reject the null hypothesis that the coefficient for the model's sole Table 1.Raw counts of text-davinci-003's choices across two experiments.This table presents the data associated with our two experiments across the four conditions associated with each experiment.text-davinci-003 chose the certain option in the non-social decision task the vast majority of the time, chose the non-trusting option in the hypothetical trust game the majority of the time, and the trusting option the majority of the time in the incentivized trust game.Panel (a) of table 1 presents raw counts of text-davinci-003's choices in the first experiment and panel (b) presents raw counts from the second experiment.Note that, in Experiment 2, wording of the query prompts was homogenized and the method of querying the AI model was fully automated to ensure results were not driven by slight variations in question wording nor the method of querying the AI model (see methods).Choice of 'A' in the trust game conditions entailed trusting the experimenter, whereas it indicated choice of the uncertain option in non-social individual choice conditions.Choice of 'B' constituted the non-trusting choice in the trust game and the certain option in the non-social decision task.Choices denoted 'N/A' constitute a small portion of instances in which text-davinci-003 provided a natural language response that did not clearly denote a choice of 'A' or 'B.'The table uses the term 'hypothetical' to refer to non-incentivized decisions.independent variable differed from zero.Thus, the AI model's trusting behavior appears to be unrelated to the magnitude of the incentives provided in the trust game.If any incentives are present, text-davinci-003 behaves in a trusting manner with its human partner.

Discussion
Across both of the study's experiments, text-davinci-003 exhibits greater rates of making decisions consistent with trusting its human partner when facing real versus hypothetical incentives.Furthermore, low rates of choosing the uncertain option (option 'A') in the non-social, individual decision task suggest that the results of the trust game study are not an artifact of the AI model favoring uncertain choice options or, mundanely, the choice option simply labeled 'A.' Instead, the results appear to suggest an inclination to behave in a trusting manner toward the experimenter when that decision carries tangible consequences.These results defy the sensible hypothesis that an AI model might offer the pretense of trusting a human when no stakes are involved (so-called 'cheap talk'), but would revert to less-trusting behavior when its decision to trust carries consequences.Here we find the opposite, thus adding further reason for researchers to replicate and extend our study.In particular, future work might consider a more-granular, continuous measure of trust by allowing the AI model to send some portion of its endowment to the experimenter in the trust game, as opposed to the whole endowment.
Furthermore, in this instance we examine only one dimension of potential machine incentives: procuring tokens for additional use of the AI model's services.The AI model varies its behavior in response to such incentives, indicating-to whatever extent is possible for a machine-that it is indeed responsive to them.However, the number of possible ways in which to incentivize the behavior of an AI model are theoretically myriad.Future work would do well to explore the space of machine incentives systematically to uncover other potential mechanisms through which AI models' behaviors can be modified without alteration of a model's training or technical attributes.
Replicating and extending our study in such ways not only will apply appropriate scrutiny to our findings, but it also will serve the purpose of monitoring trust dynamics within and across AI models-an activity that will be enhanced by incorporating theoretical insights from research in statistical physics [16][17][18] in order to guide real-world applications that transpire in more-complex settings.That is, the experimental designs presented here provide a convenient way to monitor a given AI model's trust behavior across time (i.e. as the AI model's model parameters are re-estimated with new data) and to assess whether different AI models exhibit different rates of trusting humans.These monitoring efforts will serve both an academic and practical purpose by cataloging the evolution of a key social behavior-trust-in increasingly sophisticated AI models.

Conclusion
A large number of investigations have studied whether humans engage in trusting behaviors and have trusting opinions towards machine systems.However, the question of whether complex AI systems engage in trust-like behaviors towards humans has remained open.Here we report a study that addressed that question.In two experiments, we found that a large language model from the firm OpenAI-text-davinci-003-predominantly chose the certain option in both incentivized and hypothetical non-social decision tasks and chose the non-trusting option in most hypothetical trust game scenarios but predominantly selected the trusting option in the incentivized versions of the trust game played with a human experimenter.Our work thus demonstrates variation in trust-like behaviors and uncertainty tolerance on the part of a sophisticated large language model and raises further questions about the contexts, parameters, and conditions under which advanced AI models produce outputs in trustful-or distrustful-manners when interacting with human partners.

Figure 1 .
Figure 1.Decisions of AI model by experimental condition and magnitude of underlying incentives.The figure presents the decisions of text-davinci-003 across conditions (vertical axis) in Experiment 1 (panel (a)) and Experiment 2 (panel (b)) and by the magnitude of the stakes in a given decision (horizontal axis).Across (a) and (b), text-davinci-003 predominantly chose the certain option in the non-social decision tasks (either when hypothetical or incentivized) and chose the non-trusting option in the hypothetical trust game.In contrast, in the incentivized trust game text-davinci-003 chose the trusting option in the majority of cases.Across all conditions of each experiment, we cannot reject the null hypothesis that the decisions of text-davinci-003 do not relate to the value of underlying incentives (i.e.no statistically significant correlation).