This site uses cookies. By continuing to use this site you agree to our use of cookies. To find out more, see our Privacy and Cookies policy.
Paper The following article is Open access

Both eyes open: Vigilant Incentives help auditors improve AI safety

, and

Published 7 May 2024 © 2024 The Author(s). Published by IOP Publishing Ltd
, , Focus Issue on Game Theory and AI in Complex Systems Citation Paolo Bova et al 2024 J. Phys. Complex. 5 025009 DOI 10.1088/2632-072X/ad424c

2632-072X/5/2/025009

Abstract

Auditors can play a vital role in ensuring that tech companies develop and deploy AI systems safely, taking into account not just immediate, but also systemic harms that may arise from the use of future AI capabilities. However, to support auditors in evaluating the capabilities and consequences of cutting-edge AI systems, governments may need to encourage a range of potential auditors to invest in new auditing tools and approaches. We use evolutionary game theory to model scenarios where the government wishes to incentivise auditing but cannot discriminate between high and low-quality auditing. We warn that it is alarmingly easy to stumble on 'Adversarial Incentives', which prevent a sustainable market for auditing AI systems from forming. Adversarial Incentives mainly reward auditors for catching unsafe behaviour. If AI companies learn to tailor their behaviour to the quality of audits, the lack of opportunities to catch unsafe behaviour will discourage auditors from innovating. Instead, we recommend that governments always reward auditors, except when they find evidence that those auditors failed to detect unsafe behaviour they should have. These 'Vigilant Incentives' could encourage auditors to find innovative ways to evaluate cutting-edge AI systems. Overall, our analysis provides useful insights for the design and implementation of efficient incentive strategies for encouraging a robust auditing ecosystem.

Export citation and abstract BibTeX RIS

Original content from this work may be used under the terms of the Creative Commons Attribution 4.0 license. Any further distribution of this work must maintain attribution to the author(s) and the title of the work, journal citation and DOI.

1. Introduction

A recent and high-profile open letter calling for a pause on giant AI experiments has highlighted uncertainty over how to reduce the risks posed by the anticipated capabilities of future AI systems (Future of Life Institute 2023). AI researchers recognise that it will be more difficult to align the intentions and values of future goal-directed AI systems with those of the groups they serve (Amodei et al 2016, Leike et al 2017, Hernández-Orallo et al 2019, Krakovna et al 2020). Even if researchers can address these technical safety challenges, the deployment of powerful AI capabilities brings with it concerns of misuse, especially when we consider the dual use of many AI capabilities (Brundage et al 2018, Shevlane and Dafoe 2019, Zwetsloot and Dafoe 2019).

A widely discussed measure for addressing these concerns is to develop a robust and independent auditing ecosystem for cutting-edge AI systems (GOV.UK 2023, Stein-Perlman 2023). The AI governance literature has iterated on several frameworks that could be useful for auditing future AI systems; the recent System Cards framework has been adopted by OpenAI, who produced a system card for GPT-4 (Mitchell et al 2019, Brown et al 2021, Gursoy and Kakadiaris 2022, Open AI 2023b). Research organisations and academics have also produced benchmarks to track some of the ethical and safety characteristics of AI systems (Bommasani et al 2023, Pan et al 2023). Shavit (2023) outlines the infrastructure and procedure that audits would require to track the use of computational resources for large AI experiments 1 .

While tools and frameworks such as these may be vital to auditing AI systems, there is much less discussion on how governments should design an ecosystem for auditing AI systems. Cihon et al (2021) proposes that we establish AI certification schemes to enforce technical and ethical safety standards. Hadfield and Clark (2023) proposes a Regulatory Market (Clark and Hadfield 2019, Hadfield and Clark 2023). In both cases, AI companies are required to purchase the services of a regulatory intermediary to demonstrate both compliance with safety standards and that AI systems improve rather than worsen societal outcomes.

However, not all audits are equal. As the above proposals imply, the empirical and technical audits that provide direct evidence of the risks that foundational AI systems pose are also the most expensive to perform (GOV.UK 2023). They are also likely to require specific talent and experience of the scale that companies like OpenAI, Microsoft, and Google have and will develop.

Governments who wish to support auditors in performing empirical and technical audits of the latest AI systems will have to offer strong incentives if they hope for a sustainable market for auditing to emerge.

Our paper makes three contributions, the first of which is to show under which conditions we can expect a market of private auditors for AI systems to be successful. We argue that well-chosen, appropriately funded auditors will participate in such a market and can be incentivised to produce high-quality (HQ) detection methods and standards. These auditors are not just effective in catching unsafe behaviour. They also act as an effective deterrent to unsafe behaviour.

Not all incentives will encourage HQ auditors to join the Auditing Ecosystem. Some incentives, for example those which encourage an adversarial relationship between AI companies and auditors, will actively harm the Auditing Ecosystem. These incentives unfortunately have an appealing efficiency at first glance, since they focus on rewarding auditors for catching unsafe behaviour (we call such incentives 'Adversarial Incentives'). Incentives that instead appreciate the role of auditing as a deterrent to unsafe behaviour fare much better (which we call 'Vigilant Incentives').

We arrived at this first conclusion by modelling the different incentives that would face both private auditors and the AI companies they regulate. This model is based on an existing model of the market for new AI capabilities, known as the DSAIR model (Han et al 2020). We extend this model to capture the detection and enforcement abilities of private auditors. Our work contributes to a growing number of publications that model competitive dynamics in AI markets (Armstrong et al 2016, Askell et al 2019, Han et al 2020, Naudé and Dimitri 2020, LaCroix and Mohseni 2022). To capture the complex dynamics that may emerge as auditors and companies explore the strategy space, we turn to analytical and numerical methods from Evolutionary Game Theory (Foster and Young 1990, Fudenberg et al 2006, Wallace and Young 2015). Evolutionary Game Theory has been used to study other incentive mechanisms, both for issues in AI Governance and in Climate Change, another issue characterised by high uncertainty and multiple types of actors (Encarnação et al 2016, Santos et al 2016, Han et al 2020, LaCroix and Mohseni 2022).

As a second contribution, we discuss the trade-offs that one might consider when funding auditors. As with other forms of regulation, deterring more unsafe behaviour often has the side effect that auditors are more likely to slow down companies in scenarios where the risks are low, an outcome we call 'overregulation' in line with previous work (Han et al 2020, 2021, 2022).

Here, we invoke the double-blind problem of the Collingridge dilemma (Worthington 1982). Governments will likely know little about the capabilities and risks of new AI capabilities until those technologies become entrenched. At that point, it will probably be very difficult to influence who controls the market for AI. For this reason, the government will design these audit incentives under uncertainty.

In particular, if the risks are low enough and the speed advantage from neglecting safety norms is high enough, then incentivising auditors may lead to overregulation. As the Collingridge dilemma implies, these are two parameters of our model that are highly uncertain, and there exists much disagreement about where different approaches to AI sit and whether it makes sense to see AI Safety as separate from AI Capabilities in the first place (Cave and ÓhÉigeartaigh 2018, Dafoe 2018, Burden and Hernández-Orallo 2020, Vinuesa et al 2020).

We find that we can reduce overregulation with little impact on risk through the careful design of government incentives and auditor activities. The nature of the externalities that AI systems pose can have a large influence on these designs.

For our final contribution, we compare a market of private auditors to a government that directly regulates AI companies. Under uncertainty, we find that auditors fare much better in balancing risk reduction and overregulation than the government. We also note that Vigilant Incentives require that governments maintain a strong capacity for monitoring the market for AI.

The rest of the paper proceeds as follows: section 2 outlines our model, whereas section 3 describes the evolutionary game theory method we adopt. Section 4 discusses the above three results in more detail. Section 5 discusses how policymakers might use the model as a tool for thinking about how to support the auditing landscape. We conclude with a brief discussion of model limitations and possible future research directions.

2. Model

This section explains the details of the model, starting with the core set of actors that feature in the model, before outlining the decision problems that each actor faces. We first describe the auditor's problem and then describe the different incentives that we allow governments to award them. Finally, we discuss the AI companies' problem, where we extend previous work from the literature.

2.1. The auditing ecosystem model

Auditing Ecosystems involve 3 core sets of actors:

  • Governments who set targets for private auditors to meet and, therefore, have oversight on what auditors test for. Governments licence private auditors to provide oversight of AI firms in their markets.
  • Auditors who perform audits of AI companies and the AI systems they aim to deploy. Auditors can compete to provide better auditing services that companies are more willing to pay for.
  • AI companies who must choose from the auditors available for their desired market(s). The requirements of auditors are mandatory.

We have only two populations in the baseline model, auditors and AI companies. For simplicity, we assume that one external government is responsible for setting the incentives facing private auditors. This government entity is also assumed to have sufficient institutional power to enforce that AI companies work with at least one auditor should they wish to deploy their advanced AI systems.

We might assume that there will be many fewer auditors than AI companies, although this will depend on a number of choices. Are we considering a wide range of possible AI companies, or only a select few who have dedicated their innovative efforts to creating General Purpose AI systems? Barriers to entry can limit the number of AI companies in especially lucrative and risky domains (Bar et al 2009, Askell et al 2019).

On the other hand, the number of auditors may depend largely on the degree of success that an Auditing Ecosystem proposal has in encouraging the creation of private auditors. Will developers at existing AI companies leave to create start-ups in the market for AI auditing? Will such start-ups be sustainable or avoid buyout from AI companies? Or will the market for private auditors be mainly carved up by existing institutions (Clark and Hadfield 2019, Hollenbeck 2020)?

In our model, we assume a finite population setting where we have 50 auditors and 50 AI companies. Figure A1 demonstrates that concentration of the auditing market around 10 auditors has no effect on our results, while the concentration of the AI market around 10 firms weakens the positive effects of providing incentives to auditors. We present a full list of model parameters in table 1.

Table 1. Parameter table—several of the parameters are fixed because previous work on similar models has revealed that they have little influence on the results.

SymbolDefinitionRange
b Short term value of market4
B Long term value of market ${\gt}0$
c Cost of safety measures for firms1
W Length of time to develop transformative AI safely ${\gt}0$
s Speed advantage of skipping safety precautions ${\gt}0$
$p_\mathrm{r}$ The risk of disaster if a firm is Unsafe.[0, 1]
p $1 - \textrm{risk of disaster}$ [0, 1]
$p_\textrm{l}$ The chance of a low-quality auditor revealing an unsafe firm0
 r 
$p_\textrm{h}$ The chance of a high-quality auditor revealing an unsafe firm $1 \gt p_\textrm{h} \gt p_\textrm{l} = 0$
φ The auditor's impact on the speed of unsafe firms they catch[0, 1]
g Government budget allocated to auditors per firm regulated ${\gt}0$
$r_\textrm{l}$ Net profit for auditor (low-quality)0
$r_\textrm{h}$ Net profit for auditor (high-quality)−1
β Learning rate0.02
$Z_\textrm{reg}$ Size of Auditing Ecosystem50
$Z_\textrm{ai}$ Size of AI market50

2.2. The strategic interaction between AI Companies

AI companies enter into competition with each other and are matched to a relevant auditor from the Auditing Ecosystem. Crucially, we assume that they first observe the auditor's choice of effort before choosing how safe to be. We assume that they can follow one of three strategies:

  • AS — companies always develop AI systems safely.
  • AU — companies never allocate effort to AI Safety.
  • VS — companies develop their AI systems safely, but only if they observe that auditors have invested in HQ vetting systems.

Table 2 describes the average payoffs that AI companies receive when faced with another company playing a particular strategy, given the choice of the auditor. This model is largely based on the DSAIR model from Han et al (2020). Companies who are always safe, AS, are at a disadvantage against companies who take risks, AU, so it is usually the unsafe firm who is the first to bring the new AI capability to market, winning the big prize, B (note that payoffs are averaged over the duration of the competition—which is W if firms are safe, or $\frac{W}{s}$ if the winner is unsafe). If both companies choose the same strategy, they have an equal chance of winning the big prize.

Table 2. AI Company Payoff Matrix—These payoffs capture the payoffs of different strategies. Notice that the conditional strategy is not included since, depending on the auditor's choice, it performs identically to one of the other strategies. We use the detection rate for high-quality auditors. Against a low-quality auditor, the payoff matrix is the same, except that we replace the detection rate with that for low-quality auditors. See the main text for explanations of each symbol. Note that the short-term benefit of producing AI systems, b, has been omitted to ensure a more readable table. The omitted parameters do not influence our results.

StrategyAlways safe (AS)Always unsafe (AU)
AS $\frac{B}{2W} - c$ $p_\textrm{h} \cdot \frac{1}{\phi + 1} \frac{B}{W} - c$
AU $p \cdot (1-p_\textrm{h}) \cdot \frac{B}{W} + p_\textrm{h} \frac{\phi}{\phi + 1} \frac{B}{W}$ $p (1-p^2_\textrm{h}) s \frac{B}{2W} + p_\textrm{h}^2\frac{\phi}{\phi+1} \frac{B}{2W}$

Our model builds on the DSAIR model by adding a detection rate that differs for high- and low-quality (LQ) auditors, $p_\textrm{h}$ is the detection rate for HQ auditors, and $p_\textrm{l}$ is the detection rate for LQ auditors. We can see from table 2 that increasing the detection rate (as occurs for HQ auditors) reduces the payoffs to unsafe companies, which can only encourage them to be safer.

Once an unsafe company is caught, the auditor (or perhaps the government) will aim to enforce that the company slows down their AI development to a speed which is a fraction φ of the safe speed. This action has an uncertain impact on who is the first to bring new AI capabilities to market. A chance remains that the previously unsafe company catches up to and overtakes the safe company, $\frac{\phi}{\phi + s}$ 2 .

If we restrict our attention now to this subgame played by AI companies, we can see several possible equilibrium outcomes depending on the parameters of the model. If we fix the choice of the auditor, we can ignore the conditional strategy, VS, for the time being. Figure 3(a) provides an illustration of the equilibria selected by social learning as we vary the risk, $p_\textrm{r} = 1 - p$ and speed advantage, s, parameters of the model.

The payoffs are symmetric, so there are only a few possibilities. If the risks are high enough, then AS is the pure strategy Nash equilibrium of the game. If the risks are low enough, AU is the only equilibrium. If the risks are somewhere in-between, then both may be equilibria. Social learning will result in players selecting the risk-dominant equilibrium in this case. For large $\frac{B}{W}$ and a detection rate, $p_\textrm{h} = 0$, (Han et al 2020) note that AS is risk dominant when $p \gt \frac{1}{3s}$. Han et al (2020) also note that society prefers companies to be safe (i.e. the sum of AI company utilities is greatest) whenever $p \gt \frac{1}{s}$. These equations give rise to a 'dilemma zone' where society prefers unsafe firms to act safely (see figure 3(a)) 3 .

Now, let us allow for the choice of the auditor. As we shall discuss in more detail, the auditor can choose to be of low or high-quality: their quality determines their detection rate. It is noteworthy that if the detection rate increases due to the auditor's choice to be high quality, we can move from a region of the parameter space where AU is the only (or risk dominant) equilibrium for AI companies to a region where AS is the only (or risk dominant) equilibrium.

There are in fact 3 relevant possibilities: firms always play AU, no matter what the auditor does, firms always play AS, and firms only play AS if facing a HQ auditor (this is the conditional strategy, VS).

In the first scenario, HQ regulation is not a strong deterrent. In the second scenario, a HQ auditor is not needed. In the third scenario, the HQ auditor acts as a strong deterrent to unsafe behaviour, which will be socially desirable if the risk of an AI disaster is high enough, i.e. if we are in the dilemma zone.

2.3. A brief review of auditing

A review of the existing landscape for auditing AI systems by the UK government reveals that a range of actors outside government have proven vital for performing a number of auditing services (GOV.UK 2023).

To borrow the categorisation of GOV.UK (2023), governments typically engage in governance audits, such as conformity assessments, which check that companies follow procedures and principles. The European Union's AI Act and the USA's AI Algorithmic Accountability Act both require governance audits for certain types of AI systems.

Industry, academics, and non-profit organisations have participated heavily in empirical audits. Empirical audits involve measuring the outcomes of deployed algorithms on their users and society at large. Such audits can serve as evidence of harms AI systems can cause, such as racial bias in predicting reoffending rates or in serving gambling ads to children (GOV.UK 2023). Journalists and civil society actors also play a role here.

Often, it is left to AI companies themselves to invest resources in technical audits. These are audits that seek a deeper understanding of where AI systems can fail, and typically there are not often clear incentives for tech companies to dedicate resources to build capacity for auditing new technologies in this way. However, METR is a non-profit research organisation that has performed technical audits on an early version of Open AI's GPT-4 and Anthropic's Claude ( Kinniment et al 2023). Although their research revealed that such systems fail to fully execute a plan to replicate themselves, they showed that these systems were capable of solving many relevant subtasks. Given these findings, they argue that it is no longer obvious that future models will not exhibit this dangerous emergent behaviour. 4

Although governance audits are by far the most common in current regulatory frameworks, only empirical and technical audits can provide direct evidence that AI systems achieve the goals that AI systems can achieve.

Unfortunately, as noted in GOV.UK (2023)'s review, these audits can be incredibly expensive and seldom target cutting-edge AI systems. They are also likely to require specific talent and experience in auditing AI systems of the scale that companies like Open AI, Microsoft, and Google have developed.

2.4. The auditor's problem

Auditors move first, with their choices fully visible to AI companies before they make their own choices. Auditors must choose whether to aim to be:

  • high-quality (HQ): A HQ auditor accepts larger costs, so it has a better chance of evaluating cutting-edge AI systems. They are much more likely to detect unsafe behaviour on the part of AI companies and to know the appropriate procedures that AI companies should follow to ensure that their work is safe.
  • low-quality (LQ): They do not invest in evaluating cutting-edge AI systems, so they are unlikely to detect unsafe behaviour in more advanced systems.

For simplicity, we assume that LQ auditors have a detection rate for unsafe behaviour equal to 0. This assumption does not affect the qualitative features of our results: What matters is that the difference between the detection rates of both auditor types is sufficient in the dilemma zone to move AI companies to develop AI safely. As this difference decreases, our incentives will have weaker effects on reducing risk—see figure A4. Since we assume that LQ auditors essentially perform no detection services, our results also show which incentives are sufficient to encourage participation in the Auditing Ecosystem.

Figure 1 illustrates how the choice to aim for HQ may be pivotal in influencing AI companies to develop AI more safely. Unfortunately, since HQ auditors may have to pay a high cost to invest in better tools and talent, this cost may outweigh any revenue they can extract from AI companies in exchange for their services. If not, HQ auditors would already be frequent—see figure A3. We capture this scenario by setting the net profits of auditors as follows: $r_\textrm{h} = -1 \lt r_\textrm{l} = 0$. Barring government intervention, auditors will choose to be LQ, even if it is valuable from society's point of view.

Figure 1.

Figure 1. The Auditor's Problem in the default scenario of interest. An auditor must choose whether to invest in high-quality evaluation tools and talent, HQ, or to accept a lower detection rate for unsafe practices, LQ. AI companies make their choice after observing the choice of the auditor. If the high-quality detection rate is high enough, then AI companies will switch from the unsafe equilibrium where they all play $\textbf{(AU)}$ to one where they all play $\textbf{(AS)}$. Via backwards induction, the auditor could reason that they are choosing over the two equilibria and will act to secure whichever equilibrium outcome is best for them.

Standard image High-resolution image

Governments could offer a flat incentive, g, to all auditors for each company they regulate. However, if they cannot tell which type an auditor is, then this incentive would have no effect on the choice that auditors make: it will still be more profitable to be of LQ.

It is clear that to design more successful incentives, governments should take into account what little information may be available.

2.5. Adversarial incentives and vigilant incentives

We consider two types of incentives. First, one could pay a bounty for any unsafe firms that auditors catch, expecting that HQ firms will be better able to detect unsafe behaviour ('Adversarial Incentives'). Second, a vigilant government could rescind a prior incentive should they discover wrongdoing on the part of the companies the auditor is responsible for ('Vigilant Incentives'). To rescind the incentive, the government must monitor AI incidents.

Table 3 presents the auditor payoffs for each type of incentive, depending on what the AI companies they regulate choose to do. HQ auditors always perform worse in the absence of any incentive, $r_\textrm{h} \lt r_\textrm{l}$. Adversarial incentives are only offered when AI companies play AU. Adversarial incentives may also be earned when a conditionally safe AI company, VS, faces a LQ auditor, but as we set their detection rates to 0, this does not occur in the scenarios of interest.

Table 3. Auditor Payoffs under each Incentive—their payoffs also depend on the auditors' efforts and on the choices made by AI companies. See the text for explanations of each strategy and the relevant parameters.

 AdversarialVigilant
StrategyHQLQHQLQ
Firms play AS $r_\textrm{h}$ $r_\textrm{l}$ $r_\textrm{h} + g$ $r_\textrm{l} + g$
Firms play AU $r_\textrm{h} + g p_\textrm{h}$ $r_\textrm{l} + g p_\textrm{l}$ $r_\textrm{h} + g {p_\textrm{h}}^2$ $r_\textrm{l} + g {p_\textrm{l}}^2 $
Firms play VS $r_\textrm{h}$ $r_\textrm{l} + g p_\textrm{l}$ $r_\textrm{h} + g$ $r_\textrm{l} + g {p_\textrm{l}}^2$

We can see that when companies play VS, that HQ auditors do as poorly as they can, while LQ auditors do as well as they can.

The logic at play here is that Adversarial Incentives would signal to private auditors that the government is willing to pay for only the tools which are effective in their job. From one point of view, there appears to be a rather appealing efficiency at play. If we only pay those auditors who actually detect unsafe behaviour, then we encourage competition to be the auditor who offers the best tools.

These Adversarial Incentives certainly have their appeal in the short run in a market with rampant unsafe behaviour. Auditors will be enticed by the lucrative opportunity of catching an unsafe AI company in the act. Auditors therefore have a strong incentive to develop powerful tools for detecting a specific behaviour and may even deter the unwanted behaviour. Once the behaviour has been deterred, auditors will no longer see any profit in further improving their methods, and governments will no longer have to cover the cost of those investments.

We now turn to Vigilant Incentives. Under Vigilant incentives, all auditors receive a payment g. However, when AI companies play AU or when a conditionally safe company, VS, faces a LQ auditor, those payments will be rescinded if they fail to catch an unsafe company. We can see that when companies play VS, that HQ auditors do as well as they can, while LQ auditors do as poorly as they can.

The line of argument here is that the government chooses to treat incentives as investments in a deterrent to unsafe behaviour. A deterrent must be funded, regardless of whether unsafe behaviour is currently occurring.

Auditors are happy to receive the incentive but know that if they let unsafe companies slip away undetected, the government can claim back the incentive—remember that in our model we have assumed that this is the only way for the government to discriminate by auditor quality. If companies are unsafe, even HQ auditors risk losing their incentive.

3. Methods

We use Markov chain methods from evolutionary game theory to explore what type of behaviour the different actors within our Auditing Ecosystem will learn to follow Foster and Young (1990), Fudenberg et al (2006), Wallace and Young (2015).

Evolutionary Game Theory has been used to study pressing issues in AI Governance and Climate Change (Encarnação et al 2016, Santos et al 2016, Han et al 2020, LaCroix and Mohseni 2022). The field has also paid much attention to the study of the efficiency of different incentives for resolving social dilemmas (Sigmund et al 2010, Sasaki et al 2012, Sun et al 2021, Han 2022, Cimpeanu et al 2023). These methods have been used in the past to study games with multiple populations, as we do here (Rand et al 2013, Zisis et al 2015, Encarnação et al 2016, Santos et al 2016).

To further motivate the use of Evolutionary Game Theory, consider that the auditors and AI companies in our model will likely engage in a period of learning about the type of behaviour they wish to emulate. Although there may not be many AI companies with enough capital to perform at the cutting edge, there are a wide number of applications of AI systems that these companies may wish to be active in. We anticipate that companies are uncertain about the net value of any particular new technology, and that auditors face uncertainty over how difficult it is to evaluate new technologies. In the face of this uncertainty, we expect both groups to explore the strategy space and to imitate high-performers. It seems reasonable to approximate this setting as a Moran process: we have a finite number of players who may over time randomly explore different strategies or instead imitate their more successful peers 5 .

For the reasons above, we use a Markov chain approach to investigate the evolution of our model. This method is suitable for investigating learning dynamics in finite population settings. Markov chains model how likely the system is to transition from one state to another, and as we will show, allow us to assess how often the system spends its time in each state over the long run (Foster and Young 1990, Fudenberg et al 2006).

To keep the analysis as straightforward as possible, we also assume that the mutation rate (the rate at which players randomly explore the strategy space) is infinitesimally small. In the method of Fudenberg and Imhof (2006) this assumption is made so that in the long run the evolutionary system spends all its time in one of its absorbing states. Analyses which make this assumption often find results that are applicable well beyond the strict limit of very small mutation (or exploration) rates (Hauert et al 2007, Sigmund et al 2010, Rand et al 2013).

Recall that this is a model with multiple finite populations, so the absorbing states are any states where all auditors follow the same behaviour and all AI companies follow the same behaviour: $\textbf{HQ-AS}$, $\textbf{HQ-AU}$, $\textbf{HQ-VS}$, $\textbf{LQ-AS}$, $\textbf{LQ-AU}$, $\textbf{LQ-VS}$. These states are visible in the Markov chain in figure 2.

Figure 2.

Figure 2. Adversarial Incentives allow unsafe AI companies to exploit the presence of low-quality auditors. This Markov Chain diagram shows the transitions between states and their long-term frequencies. States are coloured blue if AI companies act safely and orange if AI companies act unsafely. The parameters chosen place us in the dilemma zone, $p_\textrm{h} = 0.6$, g = 1.2, φ = 0.5, $p_\textrm{r} = 0.6$, s = 1.5, $B/W = 100$, β = 0.02.

Standard image High-resolution image

We can also see from figure 2 that if we want to know how much time we spend in each state on average, we should care about the transitions between each of these states. Assuming that all mutations are just as likely to occur, it is straightforward to derive a transition matrix to tell us the relative likelihood with which the system is likely to move from one state to another. Due to the rare mutation limit, only one population, AI companies or auditors, will experience a mutation during a given evolutionary epoch. Therefore, we only need to consider transitions between states where one of the populations remains unchanged.

Now that we know the states of the Markov chain, and the valid transitions between them, we need to derive the relevant transition probabilities. To do so, we note that when an infinitesimally rare mutation occurs, the transition between any of the above states requires that the mutant invades the relevant population. This is known as the fixation rate (Nowak et al 2004, Traulsen et al 2006).

We derive the fixation rate from first principles. For a mutant to invade, (i) the mutant must avoid imitating another strategy, and (ii) everyone else must eventually imitate their strategy. We model imitation as follows.

Players are more likely to imitate the strategies of players who are comparatively more successful than they are, which we capture mathematically as the difference in expected payoffs, $\Pi(k)$. Note that the expected payoffs of playing a strategy is a function of the number of players choosing other strategies. The more people play a strategy different from you, the less likely you are to interact with someone using your strategy. Your payoffs may also depend on what people are doing in other populations. In our case, the actions of auditors will influence the expected payoffs of AI companies and vice versa. We can define the success, or fitness f, of one strategy A against B as follows:

Equation (1)

In our setting, at any given time, we select one player at random to observe a random peer. When a player using strategy B observes their peer using a different strategy A, we model the likelihood that they adopt the observed strategy, using the Fermi function and the relative success of the observed strategy over their own

In the above equation, β refers to the imitation rate (also known as the strength of selection). A larger β means that players are more inclined to imitate a more successful player's strategy. Different values of β may be appropriate in different contexts. For our figures, we choose a value for β which implies that players are at least 90% likely to adopt a strategy which gives payoffs one standard deviation greater than their current strategy, which seems reasonable given the high stakes involved, especially for companies. Our sensitivity analysis in figure A5 suggests that our results are robust to different choices of β (we chose values of β which instead implied a likelihood of adoption of 75% and 95%. The value of β can also be informed through behavioural experiments with human participants (Rand et al 2013, Hoffman et al 2015, Zisis et al 2015).

The above imitation process allows us to model the probability that the number, k, of players using the mutant strategy, A, changes by ± one in each time step as (Z is the population size) (Traulsen et al 2006)

Equation (2)

We can use these probabilities to calculate the fixation rate. As we assumed that mutation is rare, it does not affect the above probabilities. Eventually, either all mutants abandon the new strategy, or everyone plays the mutant strategy. One can show that the probability that we reach the latter outcome after introducing a single mutant with strategy A in a population of $(Z-1)$ agents using B is given by Nowak et al (2004), Traulsen et al (2006)

Equation (3)

As we are using the Fermi function, we can simplify the above equation as follows.

Equation (4)

We can now write the elements of the transition matrix as follows, where S is the number of states 6 :

Equation (5)

By construction, this transition matrix is irreducible. Therefore, this transition matrix has a unique stationary distribution (see Häggström 2002 for a proof). This unique stationary distribution, V, satisfies

Equation (6)

We can find this unique stationary distribution by noting that V, is the normalised left eigenvector with eigenvalue 1 of the transition matrix, P. We use the Grassmann–Taksar–Heyman algorithm to compute these eigenvectors in our numerical calculation (Stewart 2009).

The stationary distribution can be interpreted as the percentage of time that the system spends in (or around) each of these states. The results to follow interpret the stationary distribution as telling us the relative frequencies with which each type of interaction occurs between an auditor and AI companies.

4. Results

We now turn to our analytical and numerical results. In the first two sections, we explain our key takeaways concerning Adversarial and Vigilant incentives. We then consider the optimal design of an Auditing Ecosystem for balancing the competing concerns of risk reduction and overregulation. We then discuss how beliefs about AI risks and the size of externalities might influence the optimal design. Finally, we compare an Auditing Ecosystem with Vigilant Incentives to direct government regulation.

4.1. Adversarial incentives fail to sustain an auditing ecosystem

When we first introduced Adversarial Incentives, we told a plausible story of why they would be appealing to introduce. Our first result shows that this intuition was misguided.

Figure 3(a) suggests that catching unsafe AI companies in the act is only a dream. AI companies know to play it safe when a HQ auditor is active, so auditors cannot benefit from improving their detection rate. Figure 3(b) confirms that in the long run, auditors learn that they are better off skipping the investment, and AI companies remain unsafe in the dilemma zone.

Figure 3.

Figure 3. Adversarial Incentives have negligible impact on the behaviour of auditors and companies. (Panel (a)) The parameter space (here we show the speed advantage, s, and level of AI risk, $p_\textrm{r}$) can be split into regions where AI companies are always safe or always unsafe. AI companies choose their behaviour as they would have in the absence of any Auditing Ecosystem. The solid lines indicate the risk dominance (top line) and socially efficient thresholds (bottom line) for the always safe strategy in the absence of an Auditing Ecosystem. The area between them is the 'dilemma zone'. (Panel (b)) No auditor invests in HQ tools. We would therefore not see any change in welfare relative to a scenario where the government incentive and HQ detection rate are both 0. The model parameters take on values: $p_\textrm{h} = 0.6$, g = 1.2, φ = 0.5, $B/W = 100$, β = 0.02.

Standard image High-resolution image

Let us consider a brief analysis of the model. The subgame-perfect Nash equilibrium of the game can be solved by backward induction. Assume that we have model parameters such that figure 1 describes the relevant equilibria for AI companies when faced with different auditors. Therefore, each company's strategy is to play AU when facing a LQ auditor and to play AS when facing a HQ auditor. This is precisely the conditional strategy, VS.

The auditor will choose whichever option leads to an equilibrium with greater payoff. Let $I(.)$ denote a function that maps the detection rates $p_\textrm{h}$ or $p_\textrm{l}$ to the size of the incentive they expect to receive. Thus, the auditors will choose HQ if $r_\textrm{h} + I(p_\textrm{h} | \textrm{companies play} \textbf{AS}) \gt r_\textrm{l} + I(p_\textrm{l} | \textrm{companies play} \textbf{AU})$. We assume that auditor profits are lower for HQ auditors, $r_\textrm{h} \lt r_\textrm{l}$, so we need to choose incentives which are increasing in either the detection rate or in the number of safe companies.

Adversarial incentives achieve neither of these characteristics $r_\textrm{h} \lt r_\textrm{l} + g * p_\textrm{l}$, and if the LQ detection rate is positive, discourages auditor investment. In the SPNE, auditors will be of LQ.

The Markov chain diagram, figure 2, conveys the dynamics at play in the dilemma zone of figure 3(a). Auditors who try to be HQ will find that AI companies will either move to the conditional strategy, VS or play AS. In either case, the auditor has no unsafe firms to catch, so it switches to LQ. If AI companies were playing AS, they would return to an unsafe strategy once the auditor gave up. Once in these states, $\textbf{LQ-AU}$ or $\textbf{LQ-VS}$, auditors and AI companies are extremely unlikely to change what they do.

We can explain these dynamics in relation to our story from earlier. As mentioned above, auditors eventually become complacent once they have deterred companies from acting unsafely. However, progress in AI often leads to new capabilities, capabilities which may require auditors to consistently improve and innovate on their approach to detecting unsafe practises. It is easy to imagine that new risks will go relatively unnoticed by auditors, similar to the situation credit rating agencies found themselves in prior to the 2007–8 financial crisis (Clark and Hadfield 2019). Incumbent auditors may eventually consider improving their capabilities, but as AI companies learn to behave conditionally on observing this effort, they will quickly lose interest in further investment, as they cannot find unsafe companies to collect a bounty for. These incumbent auditors will only give us a false impression of safety.

4.2. Vigilant Incentives are sufficient to sustain an Auditing Ecosystem that deters unsafe behaviour

We now consider our 'Vigilant incentive', which satisfies the requirements of our preferred SPNE. HQ is the SPNE whenever $r_\textrm{h} + g \gt r_\textrm{l} + g \cdot p_\textrm{l}^2$. This means that we need $g \gt \frac{r_\textrm{l} - r_\textrm{h}}{1 - p_\textrm{l}^2}$. If the detection rate of LQ firms is 0, then the government incentive only needs to be large enough to cover the profit gap between low and HQ auditors 7 .

Figures 4 and 5(a) confirm that such an incentive is sufficient to see the emergence of HQ auditors as a result of social learning. We choose g to be slightly larger than needed to change the SPNE so that we encourage faster learning among auditors.

Figure 4.

Figure 4. Vigilant Incentives encourage auditors to be high-quality innovators. This Markov Chain diagram shows the transitions between states and their long-term frequencies. States are coloured blue if AI companies act safely and orange if AI companies act unsafely. The parameters chosen place us in the dilemma zone, $p_\textrm{h} = 0.6$, g = 1.2, φ = 0.5, $p_\textrm{r} = 0.6$, s = 1.5, $B/W = 100$, β = 0.02.

Standard image High-resolution image
Figure 5.

Figure 5. Vigilant Incentives reduce AI risk by deterring unsafe behaviour. (a) The parameter space (here we show the speed advantage, s, and level of AI risk, $p_\textrm{r}$) can be split into regions where AI companies are always safe or always unsafe. AI companies choose their behaviour as they would have in the absence of any Auditing Ecosystem. The solid lines indicate the risk dominance (top line) and socially efficient thresholds (bottom line) for the always safe strategy in the absence of an Auditing Ecosystem. The area between them is the 'dilemma zone'. (b) Auditors often choose to be high-quality in the dilemma zone'. A small percentage of auditors remain high-quality outside of it. (c) The Auditing Ecosystem improves welfare in the dilemma zone, but slightly reduces welfare through overregulation outside of it. The model parameters take on values: $p_\textrm{h} = 0.6$, g = 1.2, φ = 0.5, $\frac{B}{W} = 100$, β = 0.02.

Standard image High-resolution image

Consider figures 3 and 5 for the Adversarial and Vigilant incentives, respectively. Previously, the SPNE for Adversarial incentives precisely matched the results of the evolutionary model, but the same cannot be said for Vigilant incentives: Although many auditors choose to invest in being HQ in the dilemma zone, not all do. Moreover, even when auditors choose LQ in the SPNE (bottom right of figure 5(b)) a very small proportion of auditors remain HQ in the evolutionary model (with so few HQ auditors, the different shades in figure 5(b) may be difficult to see on first glance). These results can be explained with reference to evolutionary dynamics.

When we examine the Markov chain diagram, see figure 4, we can see that it is now much more common to be in state $\textbf{HQ-VS}$. Most importantly, auditors in the $\textbf{LQ-VS}$ state now have a reasonably strong incentive to switch to being HQ. Unlike in the previous Markov chain, there is still a somewhat probable path where companies can drift to playing AS. Auditors may become complacent, leading to lower investments after returning to the LQ strategy, which in turn can allow the revival of unsafe behaviour. This is one reason why not all auditors end up being HQ in the dilemma zone.

Another notable dynamic is the one-way transition from $\textbf{HQ-AU}$ to $\textbf{LQ-AU}$. The government incentive in our case is not high enough here to encourage auditors to be HQ when AI companies are unsafe, helping to explain the relatively slow transition to safer states. However, if g were much higher—as in figure A2—, this transition would flip, leading to a much larger fraction of auditors choosing to be HQ even in scenarios where risks are low. Therefore, choosing a g that allows for the former dynamic is helpful in avoiding most overregulation but comes at the cost of accepting some proportion of unsafe firms. We discuss this trade-off more deeply in the next section.

Our readers may be interested in how Adversarial Incentives would compare to these Vigilant Incentives, were Adversarial Incentives expected to spend as much money as the costlier Vigilant Incentives do. However, due to the inability of Adversarial Incentives to achieve any meaningful change in the behaviour of AI companies, we cannot reduce risk by increasing our spending on them. Moreover, we show in figure A6 that mixing Adversarial and Vigilant Incentives only weakens the positive effects of Vigilant Incentives. These results are unlikely to change unless one considers a very different model. As a result of the disappointing performance of Adversarial Incentives, we see no need to compare Vigilant Incentives with Adversarial Incentives when presenting the rest of our results.

The key takeaway here is that the design of our incentives matters. Adversarial incentives fail to guard auditors against AI companies who play conditional strategies, so are unhelpful for supporting an Auditing Ecosystem. On the other hand, Vigilant incentives reward auditors for the service they offer, discriminating against auditors who let unsafe AI companies enter the market undetected. This design ensures that when monitoring investments can deter unsafe behaviour, auditors are motivated to act on it.

4.3. The optimal design of an auditing ecosystem with vigilant incentives

Now that we have opted to use Vigilant Incentives to build our Auditing Ecosystem, it is time to consider how we can optimally design this proposal to reduce risk whilst avoiding overregulation. The key lesson is that we can maintain high risk reductions and reduce overregulation with careful adjustments to the following parameters: the detection rate that HQ auditors achieve, $p_\textrm{h}$, their impact on unsafe firms they catch, φ, and the government incentive, g.

First, the level of detection rate, $p_\textrm{h}$, has a direct influence on the extent of overregulation. A well-chosen detection rate can ensure that the deterrent only has an effect when society prefers companies to be safe.

Figure 6(a) summarises this finding. Higher values of $p_\textrm{h}$ shift the threshold where social learning selects safe behaviour closer to the threshold where society prefers safety (the solid lines in the figure are the original and desired thresholds, as in previous figures; the dashed lines are the result of different detection rates). However, if $p_\textrm{h}$ is too high, then social learning will select safe behaviour even when society prefers companies to take risks. As figure 6(a) shows, for a suitable choice of g, the optimal detection rate aligns the two thresholds. In this case, a detection rate of around 0.6 seems to best align the behaviour of AI companies with the values of society. We will later return to a discussion of when it could be feasible for policymakers to influence the detection rate.

Figure 6.

Figure 6. The details of an Auditing Ecosystem influence welfare: (a) risk dominance thresholds for different values of the detection rate for high-quality auditors, g = 1.2 (other parameter values specified below). (b) Expected Δ Welfare under Vigilant Incentives for different levels of incentives, g, and detection rates, $p_\textrm{h}$. Only positive values are shown. Expected Welfare is computed uniformly over the space of $s \in [1, 5]$ and $p_\textrm{r} \in [0, 1]$. The model parameters take on values: φ = 0.5, $B/W = 100$, β = 0.02.

Standard image High-resolution image

Second, let us discuss the strength with which auditors penalise unsafe behaviour. The figures we display show results for φ = 0.5, indicating that auditors ensure that companies are expected to be half as slow as companies that act safely. We have also considered φ = 1, where companies are only slowed to match the speed of safe AI companies, and φ = 0 where companies are completely barred from participating in AI development. The choice of φ that is the most appropriate depends on the detection rate. A high detection rate likely only needs lenient punishments to be a good deterrent to unsafe behaviour. A low detection rate requires stronger punishment.

However, there are additional considerations that lean towards our choice of φ = 0.5. Lower values of φ allow a wider range of detection rates that lead to a net positive welfare effect from Auditing Ecosystems. These detection rates are also lower, and we should anticipate that auditors will likely achieve lower detection rates. On the other hand, for all practical purposes, a low φ such as φ = 0 is implausible. History suggests that it is very difficult to bar companies from markets in which they have a strong foothold. Microsoft would be a notable example of a company who has faced legal repercussions for anti-competitive behaviour, yet to this day operates in the same markets (Economides 2001). The market for AI has companies with similar levels of power and legal capability. A good compromise is likely to slow down unsafe companies by requiring them to follow past and present safety guidelines. Additional security checks and requirements could disadvantage such a company, but due to the uncertain nature of AI development, still leave them with a significant chance of catching up with safer companies. This is not to say that we should rule out the strongest punishments in all cases: If companies try something truly reckless, it may be desirable to set a new precedent and shut down their activities.

Third, figure 6(b) suggests that the government incentive, g, should be large enough for HQ auditors to perform better than others when faced with AI companies that play their conditional strategy. Otherwise, HQ auditors would switch to being LQ over time. In figure 6(b), g > 1 is required for this purpose: note the discontinuous jump in expected welfare past g = 1. Additionally, auditors need to participate in the market, so they should do better than their outside option (which in our model we have assumed for simplicity to be 0, which is the same as the net profits for LQ auditors).

Increasing g further encourages more auditors to invest in being HQ, which means a stronger deterrent to unsafe behaviour. However, as we can see in figure 6(b), a g that is too high can lead to a reduction in expected welfare.

This reduction in expected welfare comes from overregulation, some of which is visible in the lower right corner of figure 5(c). Higher values of g ensure that auditors continue to invest in HQ detection methods, even when the deterrent fails to deter unsafe behaviour. If the detection rate already aligns the behaviour of AI companies with society's preferences, then these HQ auditors will punish companies that society would prefer not to. Overregulation can be reduced by keeping g low at the cost of having a higher level of unsafe behaviour in the dilemma zone.

4.4. The need for auditing ecosystems depends on beliefs about AI risk

Our discussion has so far considered how the design of the Auditing Ecosystem influences its impact on social welfare. Until now, we have not explicitly discussed how different beliefs about AI risk should affect our evaluation of an Auditing Ecosystem.

We first ask the reader to consider their beliefs about the level of risk presented by different AI capabilities, as well as the speed advantage of skipping associated safety norms. If the level of risk is high, and the speed advantage is relatively low, then we are more likely to be in the dilemma zone where society prefers unsafe AI companies to be safe. If the level of risk is low and the speed advantage is high, then it is likely that society prefers AI companies to take risks and accelerate innovation 8 .

Let us consider a purely illustrative example that focusses on Large Language Models, such as GPT-4 and its variants. There is an argument that the risks of an AI disaster are high if these models are widely adopted and used to generate misinformation, although perhaps not above 50%. At the same time, it might be hard to imagine seeing the pace of development we have seen so far if AI companies could not deploy LLMs until they no longer confidently generate fake information in response to a query. For illustration purposes, we might expect the mean level of risk to be normally distributed around 50% and the speed advantage to be normally distributed around a factor of 4. As this example places most probability mass outside the dilemma zone, we should expect Auditing Ecosystems to mainly bring overregulation.

So far, the risks from an AI disaster have been assumed to be isolated to the company who enjoys the benefits of achieving breakthroughs in AI capabilities. This simplification of the model appears to be inaccurate in two respects. First, the risks of an AI disaster may be collective in the sense that a disaster affects all companies in the AI market—An AI disaster may cause a government backlash and an AI winter. Or if the disaster is catastrophic, the assets or the people who make up each company may be in peril (Cave and ÓhÉigeartaigh 2018, Dafoe 2018). Second, the systemic and possibly catastrophic nature of AI risks means that externalities are very plausible. It is unlikely that AI companies will internalise the harms that misinformation or tail risks such as disruption of critical infrastructure will cause to citizens.

Figure 7 demonstrates how welfare results may change if we explicitly model the presence of externalities and the collective nature of AI disasters. If AI companies recognise the collective risk of an AI disaster, then Auditing Ecosystems can more directly influence behaviour. The dilemma zone is larger under collective risk because AI disasters are more likely. However, Auditing Ecosystems also find it easier to deter unsafe behaviour. Auditing Ecosystems are therefore much more likely to have a net positive welfare effect.

Figure 7.

Figure 7. Δ Welfare under Vigilant Incentives when we consider (a) adding collective risks that affect all firms, and (b) adding large externalities that AI companies do not expect to bear. Both ways of capturing the systemic nature of the risks presented by upcoming AI capabilities suggest that well-designed Auditing Ecosystems can greatly improve expected welfare. The model parameters take on values: $p_\textrm{h} = 0.6$, g = 1.2, φ = 0.5, $B/W = 100$, β = 0.02.

Standard image High-resolution image

If the externalities are large enough, for example 20% the size of the benefits to society of having new AI capabilities as soon as possible, then Auditing Ecosystems are significantly better at improving welfare.

For the most part, the results of the model are similar in pattern to those we have discussed so far. There is a wider range of scenarios where Auditing Ecosystems reduce risk without causing overregulation. The suggestions regarding g, $p_\textrm{h}$, and φ remain the same. However, once the externalities are large enough, overregulation ceases to be an important concern. It becomes justifiable to spend larger incentives. Higher detection rates can also be employed as the costs of overregulation become relatively small. Such a scenario also perhaps justifies the use of additional policies to reduce risk, such as penalties for negligent auditors—we consider such penalties in figure A7.

To summarise, Auditing Ecosystems become more viable when one believes the AI systems under consideration have a greater risk of causing harm to society at large. As we collect more information about the potential risks of future AI systems, governments may face even stronger incentives to invest in Auditing Ecosystems.

4.5. Auditing Ecosystems deal with uncertainty better than a Government regulator alone

How does an Auditing Ecosystem compare to the government enforcing regulation directly? We find results which are clearly favourable for Auditing Ecosystems. Though we have discussed at length the challenge of overregulation, Auditing Ecosystems are far better at avoiding overregulation than a government regulator.

Assume the government achieves the same optimal detection rate as we assumed our Auditing Ecosystem does. Figure 8(a) demonstrates that the government does better in reducing risk. In fact, since the government does not cycle between LQ and HQ auditors, all companies are deterred from unsafe behaviour in the dilemma zone.

Figure 8.

Figure 8. Direct government regulation—instead of an Auditing Ecosystem, we could allocate government spending to an institutional regulator. (a) Assuming they would achieve the same detection rate, they would be more effective in discouraging unsafe behaviour. The dilemma zone is completely eliminated. (b) However, the government always aims for high-quality. There does not exist a fallback mechanism to discourage regulation where it is not needed. (c) So, we see a massive loss in welfare due to overregulation outside the dilemma zone. Other parameter values are: g = 0, $p_\textrm{h} = 0.6$, $B/W = 100$, φ = 0.5, β = 0.02.

Standard image High-resolution image

However, this rigidity in the detection rate of a government regulator is also the source of overregulation, see figure 8(b). Even when society prefers AI companies to take risks, the government will discover and punish these companies, slowing down innovation. Throughout this paper, we have argued that it is difficult for the government to know whether a market for AI is in the dilemma zone. Given this uncertainty, government regulation could be excessive if we expect risks to be low.

The difference between overregulation in figure 8(c) for the government and figure 5(c) for an Auditing Ecosystem is substantial. Auditing Ecosystems do better in this case because they can fail. Auditors do not invest in better detection methods outside the dilemma zone because if they cannot deter AI companies from acting unsafely (or catch them all), the government will not pay them. Relative to direct government intervention, this failsafe means that Auditing Ecosystems offer a much better deal to policymakers given the uncertainty of AI development.

We do not mean to suggest that Auditing Ecosystems are a replacement for government regulation. Recall that Auditing Ecosystems aim to meet targets set by the government in the first place. Moreover, the Vigilant Incentives we propose require that governments are knowledgeable about cutting-edge AI deployments and can independently monitor AI companies to reveal whether auditors are living up to their targets. Government monitoring is therefore essential to a thriving Auditing Ecosystem which avoids capture from the AI companies they must regulate. Ultimately, Auditing Ecosystems and government regulation serve as complements rather than substitutes.

5. Discussion

In this paper, we have presented tentative evidence that a well-designed Auditing Ecosystem can play a role in reducing risks from even transformative AI systems (Gruetzemacher and Whittlestone 2022). Readers may also be curious about how practical considerations might inform our warnings against Adversarial Incentives and our recommendations for Vigilant Incentives.

We first touch upon what our model might say about two proposals which are at least on the surface similar to what we propose: Regulatory Markets and Bug Bounties. We then give additional reasons why Adversarial Incentives are likely to result in the collapse of an Auditing Ecosystem. We also suggest some obstacles to an Auditing Ecosystem under Vigilant Incentives. These include difficulties in measuring the detection rate, considering multiple markets, and regulatory capture. We end with a brief discussion of how Auditing Ecosystems might be used internationally. Extensive modelling of these challenges and the design of tests for their presence would serve as excellent starting points for future work.

5.1. What does our model say about similar proposals?

We briefly discuss two similar proposals that resemble our model: Regulatory Markets and Bug Bounties.

5.1.1. Regulatory Markets

Although we frame our work in terms of auditing, it is noteworthy that our model is applicable to a range of regulatory intermediaries, including private certifiers or organisations that produce standards. All of these intermediaries pursue some method of assessing whether companies comply with a given set of norms. The reason we choose to focus on auditors in particular is because it is more natural to think of them as detecting unsafe behaviour. While other intermediaries may be able to contribute to making deviations from normative behaviour in AI development more salient and hence easier to detect, it is not necessarily their main goal.

Hadfield and Clark (2023) propose the creation of Regulatory Markets, where governments mandate that companies purchase the services of private regulatory intermediaries. Governments also determine the targets that the clients of these intermediaries must achieve if they hope for the renewal of their licence to regulate. It may be natural to think of their proposal as allowing private actors to be even more involved in regulation than in our discussion of auditing. We anticipate that much of our model, and our discussion of incentives, will be applicable in this more general case. However, our model does not discuss the determinants of competition nor how to ensure regulatory competition remains aligned with democratic values. Future work that aims to model regulatory markets could build on our model if they can additionally capture those missing features.

5.1.2. Bug Bounties

Bug bounty programmes reward prizes to security researchers who submit valid reports of vulnerabilities that they find in the organising company's systems. Many tech companies use these programmes to crowdsource the detection of critical vulnerabilities. Recently, OpenAI has announced its own bug bounty programme on the Bugcrowd platform for its AI services (Open AI 2023a). Evidence indicates that these programmes are effective at consistently finding vulnerabilities while only costing as much as hiring a few additional software engineers (Subramanian and Malladi 2020, Walshe and Simpson 2020, Sridhar and Ng 2021, Wachs 2022).

At first glance, Bug Bounty programmes appear similar in design to the Adversarial Incentives we discuss in this paper: only those who submit valid reports of a vulnerability are rewarded. However, there are some telling signs that this situation is different from those an auditor finds themselves in.

First, companies offer these incentives themselves. Second, they will generally not be punished once a vulnerability is discovered. While there is a risk that a vulnerability could be disclosed inappropriately, the main cost they will incur is in committing time to repair the vulnerability. The evidence suggests that these costs are relatively small.

The above differences imply that AI companies that pursue Bug Bounty programmes would be using them as a tool to ensure safer AI development. In addition, researchers who are more effective at finding reports tend to be rewarded more highly. Recall that our model assumes a scenario where we anticipate that HQ audits are essentially rewarded less than LQ audits. In these respects, Bug Bounties operate in a very different way to what our model captures.

It is important to note that Bug Bounty programmes are unlikely to make good substitutes for audits. While they are effective in catching vulnerabilities, especially those which may be exploited by malicious users, they may not be effective at addressing issues endemic to the design of AI systems. They are also not designed with the goal of demonstrating compliance, nor in providing a full technical audit of an AI system.

The above takeaway is important. Should industry or governments decide to repurpose Bug Bounties for the broader goals of auditing, we may end up in a situation described by our model of audits with Adversarial Incentives.

For the purpose of providing a clear recommendation, we would say that this scenario is likely if a regulator or similar institution were to levy barriers or punishments on AI companies based directly on the results of a Bug Bounty programme.

5.2. Practical considerations for Adversarial Incentives

We have shown that Adversarial Incentives tend to fail to promote investment in HQ audits over time. In spite of this result, we anticipate that some readers will still believe that these incentives are worth attempting, given that the government does not have to pay anything unless unsafe behaviour is caught.

We ask our readers to contemplate the following additional reasons to avoid Adversarial incentives. It could be much easier for private auditors to fake or exaggerate unsafe claims, and they may even have an incentive to do so when AI companies aim to be as safe as possible. Such corruption would destroy the reputability of the Auditing Ecosystem and would only encourage unsafe behaviour.

Another reason is that we want to avoid pitting private auditors against AI Companies as adversaries. Advocates for AI Safety are often located within AI companies. Fostering animosity between industry and auditors only increases the difficulty of achieving consensus on the risks of future AI capabilities. We may also see other forms of antisocial punishment, such as industry or industry-aligned academics denouncing auditors (Herrmann et al 2008). This is not to say that collusion between AI companies and auditors is desirable. A lack of regulatory independence could also lead to ineffective regulation and may even act as a smokescreen against unacceptable behaviour (Clark and Hadfield 2019).

Note that there may be schemes which act implicitly as Adversarial Incentives. For example, a reputation system which gave high ratings of trust to auditors who detect and report the unsafe behaviour of companies could count as providing Adversarial Incentives, especially if these ratings were key to the private auditors securing future lucrative work. We argue that reputation systems should aim to avoid implicit Adversarial Incentives. Instead, reputation systems should focus on directly promoting truthfulness, as well as considering other insights from the literature that are more specific to reputation systems (Barton 2005, Brundage et al 2020, Cihon et al 2021).

5.3. Practical considerations for Vigilant Incentives

5.3.1. Ensuring the participation of private auditors

Vigilant incentives may also be unappealing to auditors. It would be odd for private auditors to have a business model where failing to detect unsafe behaviour might risk the entire business (this does not have to be the case, but for simplicity we often model scenarios where HQ auditors might not break even without government support). Stronger deterrents may lessen the risk, but care must be taken to ensure that this proposal has the ability to attract private auditors to participate in the Auditing Ecosystem.

In our model, we simplified away the issue of participation by assuming that LQ auditors are usually indifferent between participating in the Auditing Ecosystem and their outside option. However, this simplification is unlikely to hold. Recent work uses Evolutionary Game Theory to show that incentivising participation is just as important as incentivising compliance for overcoming conventional social dilemmas (Han 2022). We should expect a similar result to hold for Auditing Ecosystems.

Clark and Hadfield (2019) argue that many of the benefits of an Auditing Ecosystem could come from its independence from industry and the competitive pressure to find innovative ways to more cheaply evaluate cutting-edge AI systems. Both benefits seem less likely if there are high barriers to entry or if larger auditors have a motive to buy out smaller auditors. Governments can play a role in keeping the Auditing Ecosystem competitive by incentivising new entrants, and we welcome further research on other ways governments can promote healthy competition in Auditing Ecosystems 9 .

5.3.2. Reducing the cost to governments

At first glance, Vigilant Incentives may be unappealing to governments. These incentives ask governments to at least pay each private auditor enough so that the highest-quality auditors are better off than their lower-quality analogues. As discussed, we may also need to incentivise their participation.

The literature on public goods reveals several funding mechanisms that the government can use to raise these funds from different stakeholders (Tabarrok 1998, Sasaki et al 2012, Buchholz and Sandler 2021). We leave a comparison of these mechanisms in the context of an Auditing Ecosystem to future work.

Ultimately, some groups will have to bear the cost of providing these incentives, whether they be taxpayers, AI companies, or users of AI systems. It is natural to ask if there is anything the government could do to reduce the need for these incentives in the first place.

We turn our reader's attention to an unexplored part of our setting. We assumed that in the absence of an Auditing Ecosystem, HQ private auditors would make a loss relative to their lower quality peers. This assumption was motivated by the larger talent and capital costs that we might expect to come from investments in better detection methods for cutting-edge AI systems. We also anticipate that if we allow markets to set the price AI companies pay to private auditors, that AI companies are likely to pay more to auditors who they believe will evaluate them more favourably.

The above assumption is not guaranteed. The cost of audit innovation may prove to be somewhat low for a variety of AI applications. AI companies may have motives to pay more to auditors with more prestige. It might also be difficult for AI companies to win the trust of their user base if another AI company can demonstrate that their evaluation was both more relevant and more reliable. The logic here is also relevant to AI certification schemes, as discussed in Cihon et al (2021).

If the costs of evaluating AI systems are especially high, then it becomes more likely that governments can distinguish between HQ and LQ auditors before they perform any audits. In these cases, the government's dilemma may look very different, as they would have much more information with which to tailor their incentives.

So far, we have discussed ways the government might mitigate the cost of an Auditing Ecosystem. However, it is worth highlighting that if the risk reduction from an Auditing Ecosystem is high, then the costs the government faces may comparatively be very small. For this reason, we suggest that future work on Auditing Ecosystems consider a more thorough assessment of the costs and benefits associated with Auditing Ecosystems. Such work seems especially timely given that the UK government is exploring the role of government in a similar scheme (GOV.UK 2023).

One more proposal that we suggest can complement Auditing Ecosystems is voluntary safety agreements (Han et al 2022). Companies voluntarily make agreements to adhere to safety norms, expecting that those who violate the agreement will be punished, either by other companies or by an institution. Han et al (2022) found in their model that voluntary safety agreements can increase safety compliance without risking overregulation. These agreements are useful because Auditing Ecosystem incentives may need to be kept low to mitigate overregulation. Using these policies together can help eliminate the dilemma zone while keeping the costs of Auditing Ecosystems low.

We could add voluntary safety agreements to an Auditing Ecosystem as follows. In addition to the targets that governments would set, private auditors could enforce voluntary safety agreements that companies agree to. The increased detection capabilities of Auditing Ecosystems make these agreements much more credible than otherwise, since defectors from the agreement are much more likely to be caught. Keep in mind that voluntary safety agreements are ignorant of externalities; governments can set targets which take externalities into account. When in the presence of externalities, voluntary safety agreements cannot serve as a replacement for government targets that affect all companies.

5.4. How can we discover and manipulate the detection rate?

Unfortunately, it is not clear a priori what kind of detection rate we should expect to arise in an Auditing Ecosystem, nor is it clear whether we can shape it.

This first challenge is empirical—can we know that the Auditing Ecosystem is likely to achieve a Goldilocks detection rate that is neither too low nor too high? This issue may not be as terrible if the government is paying close attention to data on the performance of auditors. It seems plausible that the government or another independent observer could infer the detection rate of HQ auditors. They could, for example, estimate the reliability of current day audits of cutting-edge AI technologies across a range of related sectors. However, a relevant objection remains: Will we learn that investing in an Auditing Ecosystem is a good idea with enough time remaining to course-correct if necessary?

Related to this new criticism is the issue of how to influence the detection rate. A failure to detect malpractice may be the result of a lack of time or staff to perform quality checks on any audits. It could also be the result of a failure to anticipate emerging safety concerns in the latest models. If the detection rate is way too low, it is perhaps unlikely that further incentives will solve the problem—it may just be too difficult to expect a higher detection rate at that time. If the detection rate is too high, governments could advise auditors to perform audits with a lower probability. However, while probabilistic spot checks seem appropriate in airline security, it will not always be appropriate for AI auditors to forgo an audit. Alternatively, we could encourage auditors to be more forgiving. Note that for such a scheme to remain ethical, the discovered malpractice would still have to be amended. This would mean a result more in line with punishing unsafe firms less harshly. Rather than reducing the detection rate, we make larger detection rates more useful.

Another obstacle to measuring the detection rate is that safety standards are likely to be endogenous. If auditing deters unsafe practises as we hope, then the number of AI incidents is likely to be very small. This may lead to a push for higher standards, as smaller deviations from safety norms become more salient. Assessments of the detection rates of different auditors may wish to use their ability to spot these smaller deviations as evidence of their quality, especially in the absence of AI incidents. While this may often be a fair strategy for regulators to follow, there is a need to ensure that there is sufficient overlap between the methods used to detect these smaller risks and the risks that the regulator wants to prevent from emerging. Future work may want to model the dynamics of how an Auditing Ecosystem might change the meaning of safety standards over time.

One takeaway is that it is useful to keep track of the detection rate of auditors in the Auditing Ecosystem. Besides the reasons outlined above, it is necessary to use proxy measures to gain a better idea of whether the auditors are fulfilling the government's targets. It also seems sensible to encourage as high a detection rate as possible (since it seems difficult in practise to achieve high detection rates for flaws in novel technologies). If detection rates appear to be very high, the government could recommend lighter restrictions on the companies who defect, except in cases where large externalities are likely.

5.5. Additional challenges that face an Auditing ecosystem

5.5.1. Large AI companies operate in multiple markets

The evidence suggests that large AI companies will operate in multiple markets. Large industry-housed labs account for the vast majority of private investment in new AI capabilities (Zhang et al 2022). This research stretches across multiple sectors of the economy, whether it is in improving visual effects or towards software for better robotic assistants. It is also increasingly clear that new AI capabilities allow the development of 'general purpose AI systems' that we can expect on their own to be influential in multiple markets (Gutierrez et al 2023).

The challenge presented by AI companies operating in multiple markets is that it may be difficult to avoid at least one such market becoming underregulated. With so many possible applications of AI systems, it may be difficult for governments to be aware of the weakest links in their response to the risks presented by different technologies. The somewhat decentralised nature of Auditing Ecosystems holds promise in addressing these gaps in government monitoring, but it is not as clear from our discussion so far how governments can best target their incentives in the context of many markets.

In future work, we will integrate methods from network science to address this gap in our understanding of incentives for an Auditing Ecosystem (Choi et al 2020, Galeotti et al 2020). Cimpeanu et al (2022) have already studied the competitive dynamics of AI research on heterogeneous networks. Elsewhere, Cimpeanu et al (2023) have also studied how to target incentives to foster fairness on heterogeneous networks. The methods in Wang et al (2023) for designing optimal incentive protocols for structured populations could also prove useful for informing the design of incentives for auditors. However, to better represent the many markets in which AI companies operate, as well as to identify weak links in terms of regulations, we may need to turn to a multilayer network representation (Boccaletti et al 2014, Di Stefano et al 2015, Walsh 2019, Alvarez-Rodriguez et al 2022).

The inclusion of single and multilayer networks into the model would not only allow greater realism, it also allows for the integration of heterogeneous sources of data about the relative risks and operations of AI companies in different markets. We could use such data to inform how policymakers should allocate time and resources towards each Auditing Ecosystem. We also open up Auditing Ecosystems to be tested on whether they live up to the predictions of the model—a failure to do so can inform governments of how they might change course. Of course, not all sources of data will be consistent. In many cases of interest, data will be missing: not all countries will have the capacity to monitor the AI landscape, and not all companies will wish to be public about their research plans. In these cases, we plan to use machine learning techniques to infer the distributions relevant to the more complex model we have alluded to above.

5.5.2. Regulatory capture

A big reason why regulation can fail is due to regulatory capture. Collusion between regulators and the companies they regulate may be fairly common, as often the people qualified to work in the industry have the qualifications and network needed to take up a role in the regulator. This can lead to group-think about what the risks are, in ways that might ultimately be self-serving.

It also leaves open the possibility that companies can find ways to reward auditors for approving their AI systems. If the prize is large from being the first mover in markets for future AI capabilities, then these rewards may overpower any incentives the government might offer. Regulation in this form may be worse than no regulation at all, as it may act as a smokescreen, which discourages decision makers from taking pivotal action when it might be needed most.

Regulatory capture is a difficult challenge, and we do not claim to present a solution here. Nor do we explicitly model this failure mode. Nevertheless, we can make a case that Auditing Ecosystems, relative to a public or hybrid regulator, are more likely to avoid this capture. A thriving Auditing Ecosystem would have many private auditors with a diverse set of overlapping responsibilities. It seems that it would be more difficult for even powerful companies to collude with most of the auditors they might work with. Furthermore, the presence of an authority that complements the monitoring activities of private auditors may increase the difficulty of colluding undetected.

In short, the careful design of incentives can encourage an Auditing Ecosystem that operates independent of the industry, without harming relationships between AI Safety advocates and labs developing AI Capabilities.

Future work could explore a similar model to ours with an explicit collusion component (Lee et al 2019, Liu and Chen 2021, Liu et al 2024). We think it could be even more valuable for researchers to design formal tests for the presence of collusion in the market for AI in practise. Such tests should learn from models which have yielded evidence put towards antitrust cases in the past (Besanko et al 2020).

5.5.3. International coordination

Lastly, we hope to encourage our readers to consider how auditing might contribute to international coordination on AI standards. There are several challenges to this endeavour.

First, as the targets that governments set for auditors are outcome-based, they almost assuredly imply that AI companies will need to design their AI systems with both ethical and technical standards in mind. von Ingersleben-Seip (2023) find that so far only technical standards for AI have seen successful adoption. von Ingersleben-Seip (2023) attribute the failure to see international agreement on ethical standards to large differences in values between countries over these standards 10 . Barring clever framings of these difficult bargaining problems (Jackson et al 2018), these conditions are unlikely to change.

Nevertheless, Auditing Ecosystems may still be successful given that each country is free to set their own targets that their auditors must adhere to. However, if Auditing Ecosystems in different nations ask different requirements of AI companies, then a crucial assumption of our model is broken: the model considers that all firms are equally affected by HQ auditors. Moreover, some governments may not adopt Auditing Ecosystems at all, especially if they have different perceptions over the risks presented by future AI capabilities. Predictably, large economies are unlikely to enforce regulatory commitments if it puts them at a disadvantage to their economic rivals or if it lessens their lead in a strategic domain. This narrative paints a rather bleak outlook for regulation of AI, including the Auditing Ecosystems proposal we discuss here.

On the other hand, we have shown that Auditing Ecosystems could be a useful tool for deterring unsafe behaviour, one which provides more flexibility to responding to changing national or international contexts. Just as with other measures, Auditing Ecosystems could serve as a commitment device in international relations, allowing governments to commit to a safer market for AI, assuming that everyone else is willing to see through similar commitments (Putnam 1988, O'Keefe et al 2020). Crucially, the success of such commitments will depend on whether there is a shared perception of the risks from transformative AI (Jervis 1978, Askell et al 2019).

Acknowledgment

We would like to thank our anonymous reviewers for valuable feedback on our work.

Data availability statement

All data that support the findings of this study are included within the article (and any supplementary files).

Conflicts of interest

On behalf of all authors, the corresponding author states that there is no conflict of interest.

  • Paolo Bova: Conceptualization of this study, Methodology, Software, Writing—Original Draft, Formal analysis, Visualization
  • Alessandro Di Stefano: Writing—Review & Editing, Supervision
  • The Anh Han: Writing—Review & Editing, Supervision

Appendix:

The main purpose of this appendix is to show how our key results for Vigilant Incentives may vary if we were to consider varying key parameters in our model. While Vigilant Incentives remain effective for a wide range of parameter settings, there are cases where this efficacy is greatly reduced. We discuss these nuances in the section that follows and consider the viability of two alternative decisions that the government could make. These include whether to mix Adversarial and Vigilant incentives, and whether to penalise auditors who fail to prevent an AI disaster from occurring.

A.1. Market size

Recall that our model features two populations, one for the market of AI companies and one for the market of auditors, each of size 50. This scenario describes two relatively unconcentrated markets where no single company has a large share of the market. Here, we weaken this assumption since it is unlikely to occur in practise.

As we can see in figure A1, when the number of AI companies is small, the time they spend on each strategy is more evenly distributed. The noiseier behaviour of companies encourages more HQ auditors when the risks are low and fewer in the dilemma zone where they are needed. While Vigilant Incentives are still useful for reducing risk, auditing may no longer be sufficient to safeguard society from the risks that the AI systems of large companies may pose.

Figure A1.

Figure A1. A smaller market for AI companies reduces the efficacy of Auditing, whereas a smaller market for Auditors has no effect. Other parameter values are: g = 1.2, $p_\textrm{h} = 0.6$, $B/W = 100$, φ = 0.5, β = 0.02.

Standard image High-resolution image

When only the number of auditors is low, the dynamics do not change relative to a larger population of auditors. This highlights how the competitive dynamics between AI companies is crucial for informing how auditors will behave. It is plausible that these results would change if we were to explicitly model competitive interactions between auditors. In such a model, we would anticipate that a low number of auditors would lead to an even lower proportion of HQ auditors in the dilemma zone.

It is important to note that in our model the size of the market, Z, only affects the learning dynamics of the different actors. The competitive interactions between AI companies that we assume in the model still capture a large first mover advantage, which we argue is appropriate when considering competition among a select few large AI companies. As mentioned in the main text, one feature we miss is that collusion among AI companies could be more likely in more concentrated markets.

A.2. Higher government spending

A higher g leads to more high quality audits and less unsafe behaviour in the dilemma zone. From figure A2, we can see that Vigilant Incentives encourage HQ auditing when AI companies are unsafe, not just when they play safe conditional on auditor behaviour.

Figure A2.

Figure A2. Markov Chain diagram under Vigilant Incentives for g = 2. Other parameter values are: $p_\textrm{r} = 0.8, s = 1.5, p_\textrm{h} = 0.6$, $B/W = 100$, φ = 0.5, β = 0.02.

Standard image High-resolution image

A.3. Net profits to HQ auditors

We assumed that HQ auditors would receive a large net loss relative to LQ auditors. We based this assumption on the high costs of technical audits that seek to explain and address the root causes of problems in AI systems, as well as the size of the largest models. Audits that do not attempt to understand AI systems may be less expensive, but will be more likely to fail to detect dangerous emergent behaviour.

We do not need to assume that HQ auditors make a loss for Vigilant Incentives to be necessary. What matters is that LQ audits, which are not suitable for detecting dangerous emergent behaviours, are likely to be more profitable for auditors to focus on.

In figure A3, we see that if the gap between HQ and LQ profits is zero, that HQ auditing will be the norm, both inside and outside the dilemma zone. If the gap is larger than zero but smaller than in the main paper, then Vigilant Incentives are more efficient at reducing unsafe behaviour, both inside and outside the dilemma zone.

Figure A3.

Figure A3. A smaller profit gap between high and low-quality auditors means high-quality auditing is more frequent. Other parameter values are: g = 1.2, $p_\textrm{h} = 0.6$, $B/W = 100$, φ = 0.5, β = 0.02.

Standard image High-resolution image

On the other hand, if the profit gap is larger, Vigilant Incentives are less effective for the same size of incentive. If the profit gap is large, and especially if it is larger than the incentive, then Vigilant Incentives will be unable to encourage HQ auditing.

A.4. Detection rate of LQ audits

In figure A4, we allow LQ audits to have a non-zero detection rate for unsafe behaviour, albeit a lower rate than for HQ audits. For the same size of incentive, Vigilant Incentives are less effective in encouraging HQ auditing. More funding would be necessary to encourage HQ auditing, though the gains in risk reduction would be smaller than in our main text.

Figure A4.

Figure A4. As the detection rate of low-quality auditors rises, Vigilant Incentives are less effective in promoting high-quality auditing. Other parameter values are: g = 1.2, $p_\textrm{h} = 0.6$, $B/W = 100$, φ = 0.5, β = 0.02.

Standard image High-resolution image

A.5. The learning rate

As explained in the main text, we chose values of β for our two populations such that they had a 90% chance of adopting the strategy of a peer who achieved a payoff one standard deviation greater than their own (where the standard deviation is calculated across all possible payoffs for the range of parameters we are interested in). Figure A5 shows how the frequency of unsafe behaviour would change if that percentage were instead around 75% or 90%. Clearly, if there is much more noise when AI Companies and Auditors update their strategies, the long-run behaviour is less stable, so auditing is less effective. If they are more sensitive to how successful a strategy is, then auditing under Vigilant Incentives becomes even more useful.

Figure A5.

Figure A5. Slower learning rates make behaviour, and therefore the benefits of auditing, more uncertain. Faster rates favour auditing. The model parameters take on values: $p_\textrm{h} = 0.6$, g = 1.2, φ = 0.5, $\frac{B}{W} = 100$.

Standard image High-resolution image

A.6. Mixed incentives

What if the government gave a mix of both Adversarial and Vigilant Incentives to auditors? figure A6 shows that even if we only give 20% of our incentives as Adversarial incentives, that this completely destabilises the Auditing Ecosystem. Smaller proportions, such as 10% can allow an effective Auditing Ecosystem, though the influence of Adversarial incentives slightly attenuates their effectiveness at reducing risk.

Figure A6.

Figure A6. Mixing Adversarial Incentives with Vigilant Incentives weakens an Auditing Ecosystem. The model parameters take on values: $p_\textrm{h} = 0.6$, g = 1.2, φ = 0.5, $\frac{B}{W} = 100$, β = 0.02.

Standard image High-resolution image

A.7. A penalty for auditors

It may be feasible to hold auditors as partly liable for the harms that AI companies cause if negligence can be shown on their part. Such a policy may be appealing because auditors appear less likely to capture the large benefits that AI companies may receive from developing transformative AI systems first. Therefore, the penalities that governments must charge the auditors to sway their behaviour could be much smaller in scale than for AI companies.

Figure A7 shows how a penalty that is high enough can completely discourage LQ auditing. Low-quality auditors run too high a risk of facing the penalty, so are very unlikely unless the risks are sufficiently low. In keeping with our findings from the main text, high penalities are likely to be justifiable if there is evidence to suspect high externalities as a result of dangerous emergent behaviour in AI systems.

Figure A7.

Figure A7. A penalty for auditors can eliminate low-quality auditing for all but the lowest risks. The model parameters take on values: $\lambda = 5, p_\textrm{h} = 0.6$, g = 1.2, φ = 0.5, $\frac{B}{W} = 100$, β = 0.02.

Standard image High-resolution image

The government must also make sure that HQ auditors are unlikely to receive a large fine. This will be true if their detection rates are high, or if there is a strong evidence that they had both done everything that they could, and did not mislead anyone of how confident they should be that the risks would be avoided. Although we do not model it here, should the above not be true, instead of HQ audits, we would likely see no participation in the auditing market whatsoever.

Footnotes

  • This only scratches the surface of what governments and companies could do to reduce the risks from future AI capabilities (Brundage et al 2020, Cihon et al 2020, 2021, Naudé and Dimitri 2020, O'Keefe et al 2020, Truby et al 2022).

  • To present a simpler model of payoffs, we assume that if both companies are unsafe, and caught, that they are fully punished, φ = 0. This has no qualitative bearing on our results, but makes the equations here much easier to interpret.

  • There are also rare choices of the values for different parameters where we can have an asymmetric equilibrium where one AI company is safe, and the other is unsafe. Our methods from Evolutionary Game Theory never select such equilibria, so we will not discuss them at greater length.

  • Similar work has been conducted by another research organisation, Apollo Research. It is also worth highlighting that these methods of testing the behaviour of notable AI systems has been picked up by leading AI labs such as Google Deepmind as part of their evaluation of their Gemini release. Note that these organisations do not view these behavioral methods as being sufficient to guarantee systems are safe, pointing to work on mechanistic interpretability of AI systems as well as other work on the foundations of AI Safety.

  • There is a discussion to be had about whether an evolutionary model is more or less appropriate for studying the market for future AI systems than, say, classical game theory. This may be especially relevant for the increasing competition to build large language models, which is typically led by companies willing to spend large amounts on the talent, computational infrastructure, and data collection needed to develop cutting-edge systems. Large companies may be more forward-looking and rational than smaller companies, and therefore may be less likely to learn through imitation of their peers. However, we maintain that Evolutionary Game Theory is a useful first approximation to these scenarios, especially given the close ties between these methods, the analysis of complex agent-based models, and reinforcement learning.

  • Note that the fixation rate used for each element is the one relevant to the population undergoing the transition. If auditors changed strategy, then the fixation rate considers the success of the auditor's new strategy against the old one, rather than the success of an AI company's strategy which is unaffected by the transition.

  • Notice that if we know rl  - rh , we can choose g such that $p_\textrm{l}$ has to be sufficiently low for the Auditing Ecosystem to be worthwhile. This means that our choice of g implicitly determines which values of $p_\textrm{l}$ are too high to justify an Auditing Ecosystem.

  • Another popular view is that the level of risk is increasing in the speed advantage: getting AI capabilities earlier leads to higher risk. We find that a belief that risk and speed are positively correlated places much more probability mass in the dilemma zone, giving greater justification for an Auditing Ecosystem. However, this view has received criticism: An alternative view is that safety efforts are better targeted when AI capabilities are closer, suggesting that the level of risk may remain relatively unchanged with the speed advantage.

  • There is some measure of debate surrounding whether more concentrated markets allow for less or greater innovation, sometimes discussed as 'dynamic efficiency' (Demsetz 1973, Berger and Hannan 1998). Companies with a high market share tend to benefit from an inelastic demand for their products. This market power reduces the need to innovate to survive. Moreover, it seems common for these companies to buy out new innovative entrants. On the other hand, companies may need the large economies of scale that a higher market share provides if they want to finance more R&D. In addition, new start-ups might even be motivated to innovate in the hopes of a lucrative buy-out. Recent literature reveals that context matters when determining whether a pre-emptive buy-out motive overpowers high barriers to entry (Hollenbeck 2020). While a stronger argument could be made in favour of market concentration for AI companies themselves, we suspect that barriers to entry and a reduced need to innovate will matter much more for private auditors, as case studies appear to suggest (Clark and Hadfield 2019).

  • 10 

    von Ingersleben-Seip (2023) also attribute these failures to the non-excludability and nonrivalry of ethical standards—the properties of a public good. However, their data appears instead to support the idea that countries are after different ethical standards, rather than choosing to free-ride on producing such standards.

Please wait… references are loading.