Randomized Benchmarking with Confidence

Randomized benchmarking is a promising tool for characterizing the noise in experimental implementations of quantum systems. In this paper, we prove that the estimates produced by randomized benchmarking (both standard and interleaved) for arbitrary Markovian noise sources are remarkably precise by showing that the variance due to sampling random gate sequences is small. We discuss how to choose experimental parameters, in particular the number and lengths of random sequences, in order to characterize average gate errors with rigorous confidence bounds. We also show that randomized benchmarking can be used to reliably characterize time-dependent Markovian noise (e.g., when noise is due to a magnetic field with fluctuating strength). Moreover, we identify a necessary property for time-dependent noise that is violated by some sources of non-Markovian noise, which provides a test for non-Markovianity.

Proof. We first prove the lower bound and show that it is saturated. We have To see that the above inequality is saturated, simply let D =  d . To see that this bound is saturated, let Δ(A) = |0〉〈0|A|0〉〈0|. ,

Introduction
One of the key obstacles to realizing large-scale quantum computation is the need for error correction and fault tolerance [1], which require the coherent implementation of unitary operations to high precision. Characterizing the accuracy of an experimental implementation of a unitary operation is therefore an important prerequisite for constructing a large-scale quantum computer.
It is possible to completely characterize an experimental implementation of a unitary using full quantum process tomography [2,3]. However, this approach has several major deficiencies when applied to large quantum systems. Firstly, it is provably exponential in the number of qubits of the system for any procedure that can identify general noise sources and hence it cannot be performed practically for even intermediate numbers of qubits, despite improvements such as compressed sensing [4,5]. Secondly, it is sensitive to state preparation and measurement (SPAM) errors, which create a noise floor below which an accurate process estimation becomes impossible [6]. Finally, it does not capture any notion of systematic, timedependent errors that can arise from applying many unitaries in sequence.
One can avoid the exponential scaling by accepting a partial characterization of an experimental implementation. A partial characterization of, for example, the average error rate and/or the worst-case error rate compared to a perfect implementation of a target unitary is typically enough to determine whether an experimental implementation of a unitary is sufficient for achieving fault-tolerance in a specific scheme for fault-tolerant quantum computation. Such partial characterizations can be obtained efficiently (in the number of quantum systems) using either randomized benchmarking [7][8][9][10][11][12] or direct fidelity estimation [13,14].
While direct fidelity estimation gives an unconditional and assumption-free estimate of the average gate fidelity, it is prone to SPAM errors, which leads to conflation of noise sources. Thus, a key advantage of randomized benchmarking is that it is not sensitive to SPAM errors. Unfortunately, however, current proposals for randomized benchmarking assume that the noise is time-independent, although time-dependence can be partially characterized by a deviation from the expected fidelity decay curve [9,10]. Furthermore, experimental implementations of randomized benchmarking typically use on the order of 100 random sequences of Clifford gates, which is three orders of magnitude smaller than the number of sequences suggested by the rigorous bounds in [10] to obtain an accuracy comparable to the claimed experimental accuracies [12,15]. Numerical investigations of a variety of noise models have shown that between 10-100 random sequences for each length are sufficient to provide a tight estimate of the average gate fidelity [16]. Ideally, one would like to combine the advantages of both randomized benchmarking and direct fidelity estimation to achieve a method that is insensitive to SPAM, requires few measurements, is nearly assumption-free (i.e., does not assume a specific noise model), and comes with rigorous guarantees on the errors involved.
In this paper, we provide a new analysis of randomized benchmarking which brings it closer in line with this ideal. We first show that the standard protocol can be modified to provide a means of estimating the time-dependent average gate fidelity (which characterizes the average error rate), provided that the gate-dependent fluctuations at each time step are sufficiently small. Under the assumption that the noise is Markovian (that is, that the noise can be written as a sequence of noisy channels acting on the system of interest), all the time-dependent parameters that are estimated by our procedure are upper-bounded by 1, so if some of the parameters are observed to be greater than 1, the experimental noise must be non-Markovian.
We then provide a rigorous justification for taking a small number of random sequences at each length that is on the same order as used in practice by obtaining bounds on the variance due to sampling gate sequences. Our work complements the approach of [16], where it was shown that the width of the confidence interval for the parameters extracted from randomized benchmarking is on the order of the square root of the variance. Our work therefore proves that this confidence interval is generally very narrow, that is, the parameters extracted from randomized benchmarking are determined with high precision.
Numerically, we observe that our bounds (at least for qubits) are saturated and so cannot be improved without further assumptions on the noise (e.g., that the noise is diagonal in the Pauli basis). Therefore any experiments using fewer random sequences than justified by our analysis (unless there is solid evidence that the noise has a specific structure) will potentially underestimate the error due to sampling random sequences.
As a particular example, our results provide a rigorous proof that for single-qubit noise with an average error rate of 10 −4 , the error for randomized benchmarking with 100 random sequences of 100 random gates will be less than 0.9% with 99% confidence. If we use the parameters estimated in the experiment of [15], with 100 random sequences of length 987 at an average error rate of × − 2 10 5 , we find the error is less than 0.8% with 99% confidence. We emphasize that our results are solely in terms of the number of random gate sequences, and a given sequence must still be repeated many times to gather statistics about expectation values of an observable. This is of course an unavoidable consequence of quantum mechanics. However, these statistical fluctuations in the estimates of expectation values can be analyzed separately with standard statistical tools for binomial distributions or with the recent Bayesian methods introduced in [17] and combined seamlessly with our results.
In order to give a rigorous statement of results, we will first review the randomized benchmarking protocol.

The randomized benchmarking protocol
The goal of randomized benchmarking is to efficiently but partially characterize the average noise in an experimental implementation of a group of operations acting on a d-dimensional quantum system. In order to characterize the average noise in an implementation of  using randomized benchmarking, we require  to be a unitary two-design (e.g., the Clifford group on n qubits for = d 2 n ), meaning that sampling over  reproduces the second moments of the Haar measure [18,19]. To accomplish this, the following protocol is implemented.
• Choose a random sequence  = … ∈ s s s m G m 1 of m integers chosen uniformly at random from • Prepare a d-dimensional system in some state ρ (usually taken to be the pure state | 〉 0 ). . Alternatively, to perform interleaved randomized benchmarking for the gate for ≠ t 0 and, as before, (In general, each gate must be compiled into a sequence of elementary gates as well.) }for some E (usually taken to be | 〉〈 | 0 0 ) and repeat with the sequence s sufficiently many times to obtain an estimate of the probability to a suitable precision.
We can regard the probability F m s , as a realization of a random variable F m . We will denote the variance of the distribution for a fixed m by σ m 2 . Averaging F m s , over a number of random sequences will give an estimate F m of F m , the average of F m s , over all sequences s of fixed length m (that is, F m is the expectation of the random variable F m ). The accuracy of this estimate will be a function of the number of random sequences and σ m 2 .
Obtaining estimates F m for multiple m and fitting to the model give an estimate of f provided that the noise does not depend too strongly on the target gate [10], where [20] avg is the average gate fidelity of a noise channel  with respect to the identity channel and ψ d is the uniform Haar measure over all pure states. The average gate fidelity of  gives the average probability that preparing a state ψ, applying  and then measuring }will give the outcome ψ, averaged over all pure states ψ.
For standard randomized benchmarking,  is the error channel per operation, averaged over all operations in . For interleaved benchmarking,  is the error channel on a composite channel, namely, the interleaved channel composed with an element of , averaged over all . We note in passing that separating the error in the interleaved channel from the error in the composite channel is one of the key difficulties in obtaining meaningful results from interleaved benchmarking [21], though we do not address this issue here.

Statement of results and paper outline
The first principal contribution of this paper is to show that the number of random sequences that need to be averaged is comparable to the number actually used in contemporary experiments (compared to previous best estimates, which require three orders of magnitude more random sequences than currently used). The second principal contribution is to show that randomized benchmarking can be used to characterize time-dependent fluctuations in the noise strength.
In more detail, and in order of appearance, we show the following.
• We use the results derived later in the paper to obtain explicit confidence intervals for the estimates F m when is the average gate infidelity (section 4.1).
• Again, using results derived later, we show that a more thorough analysis of randomized benchmarking data can be used to characterize time-dependent Markovian noise, and consequently as a sufficient condition for the presence of non-Markovian noise in a system (section 4.2). • We review representation theory and the Liouville representation of quantum channels and prove some elementary results (section 5). We give an explicit proof of bounds on the diamond norm (which characterizes the worst-case error rate) in terms of the average gate fidelity (which characterizes the average error rate). These give slight improvements over previously stated (but unproven) bounds (section 5.4).
• We derive an expression for the mean of the randomized benchmarking distribution with time-dependent noise (section 6.1). where δ quantifies the deviation from preparations and measurements in a Pauli eigenstate. We use this improved bound to derive confidence intervals that rigorously justify the use of a small number of random sequences for qubits in the regime ≪ mr 1.
• For the special case of single-qubit noise that is diagonal in the Pauli basis, we further improve the upper bound to which is independent of preparations and measurements. • We show that the variance for unital (but nonunitary) channels decays exponentially to zero asymptotically, while the variance for nonunital noise converges exponentially to a positive constant proportional to the degree of nonunitality (as suitably quantified). • We prove that our results are robust under gate-dependent noise, which is one of the key assumptions under which randomized benchmarking produces a meaningful result. Furthermore, since our results apply to interleaved randomized benchmarking, gate dependence can be experimentally tested and used to bound the contribution from gatedependent terms.

Analyzing data from randomized benchmarking with finite sampling
In this section, we summarize the implications of our results for analyzing the data obtained from randomized benchmarking experiments. In particular, we derive confidence intervals for the estimates F m of F m and show how randomized benchmarking can be used to characterize time-dependent noise.

Confidence interval for randomized benchmarking
For a fixed sequence length m, randomized benchmarking provides an estimate F m of F m , which is exact in the limit when all random sequences are sampled. We will only consider the variance σ m 2 due to sampling a finite number K m of random sequences of length m, and we ignore the random fluctuations resulting from the use of a finite number of measurements to estimate a probability.
In [10], the variance-independent form of Hoeffdingʼs inequality was used to estimate the number of sequences K m required to obtain a given level of accuracy. The estimate in [10] erroneously restricted the range of the random variable in Hoeffdingʼs inequality. That is, they assumed that all the probabilities F m s , lay in a strict subset of [0, 1]. This assumption, while valid for depolarizing noise, is not valid in general. A simple counterexample is where the noise is a single-qubit preparation channel into the | 〉〈 | 0 0 state and ρ = = | 〉〈 | E 0 0 . Then any sequence of m gates ending in an identity gate or a z-axis rotation has = F 1 m s , , while any sequence ending in an X gate gives . Correcting for this (which does not change any of the conclusions of [10]), the variance-independent form of Hoeffdingʼs inequality requires 10 5 samples to ensure that the estimate F m is within × − 5 10 3 of the true mean F m with 99% probability. However, many experimental implementations of randomized benchmarking only use 30-100 sequences for each value of m [12,15,22].
One of the principal contributions of this paper is to provide a theoretical justification for choosing a relatively small number of sequences by showing that the variance is small for the short sequences that are of practical relevance. For the special case of qubits, we show that even for small m (e.g. ≈ m 100) the variance is at most × − 4 10 4 for currently achievable gate infidelities ≈ − r 10 4 , which is comparable to the numerical estimates presented in figure 1. Utilizing this very small variance gives substantial improvements over the previous rigorous 2 as a function of the sequence length m for randomized benchmarking with 100 randomly generated, timeindependent noisy qubit channels. The noise was sampled from the set of extremal qubit channels, characterized in [24], with average gate infidelity ≲ × − r 2.69 10 4 . Each channel was evolved for increasing sequence lengths to track the behavior of the variance as a function of m, which is why the data points track parabolic curves (furthermore, the spread in the parabolic curves is generated by the spread in the infidelity of the samples). The green curve is the upper bound σ = + m r m mr 2 2 2 7 4 2 , where we neglect the corrections to the bound at order O m r ( ) 2 3 and corrections due to measurement imprecision. Note that our bound is almost optimal. Our analytic results show that the variance σ m 2 will increase with m, at least until some threshold sequence length where the exponential decay for generic channels proven in theorem 17 begins to dominate.
bounds obtained in [10,21]. However, our bound on the variance (which is numerically almost optimal for qubits) implies that K m should scale quadratically with m to make the variance is independent of m.
Our upper bound σ ⩽ + + m r mr O m r ( ) m 2 2 2 7 4 2 2 3 (for qubits, neglecting the negligible δr terms) can be used together with a stronger version of Hoeffdingʼs inequality [23] to obtain a rigorous confidence interval comparable to the standard errors of the mean reported in current experiments [15]. The stronger version of Hoeffdingʼs inequality implies that where K is the number of randomly sampled sequences of length m and m 2 random sequences is sufficient to obtain an absolute precision of ϵ with probability δ − 1 . Since r is determined by the fitting procedure, which in turn depends on the uncertainties, this procedure would be applied recursively with an initial upper bound on r. Similarly, for qudits, a straightforward generalization of the above argument can be used, but with 2 (ignoring the higher-order terms). Then our bound shows that K = 145 random sequences suffices. This is an improvement by orders of magnitude over the previous best rigorously justifiable upper bound of 10 5 using the varianceindependent Hoeffding inequality [10].
Importantly, however, we note that the quadratic scaling with m in the regime ≪ mr 1 seems to be necessary (see figure 1). Even in the optimal case of noise that is diagonal in the Pauli basis, K m would still need to scale linearly with m to make the variance independent of m (where having the variance depend on m would generally cause less weight to be assigned to larger m when fitting). The linear scaling can be understood intuitively as following from the fact that there are m places for an error in a sequence of length m, and the errors could add up in the worst case. Therefore, a corollary of our result is that longer sequence lengths should be averaged over more random sequences in this regime.
Furthermore, we prove in section 7 that there are noise sources such that the variance due to sampling random sequences is constant (or decays on an arbitrarily long timescale). If such noise sources (including nonunital noise and any unitary noise, such as over-and underrotations) are believed to be present, substantially more random sequences need to be sampled. As such, randomized benchmarking is most reliable in the regime ≪ mr 1, although, since the next lowest order terms in our bound are δm r 2 2 and m r 2 3 , the lowest order bounds on the variance should be approximately valid for ≈ mr 0.1.

Characterizing time-dependent noise
The original presentation of randomized benchmarking assumed that the noise was approximately time independent (i.e., independent of the time step at which the gate is applied), with any Markovian time-dependence being partially characterized by deviations from the time-independent case [10]. However, in many practical applications there may be a nonnegligible time dependence, which it would be desirable to characterize more fully.
We show that randomized benchmarking can also be used to characterize time-dependent noise, provided the gate-dependence is negligible (in the sense established in theorem 18) and that the time-dependent noise is identically distributed between different experiments. However, the number of random sequences of length m will typically need to be increased relative to the number required for time-independent noise. In particular, we will show in theorem 8 that where A and B are constants that depend only upon the preparation and measurement procedures (and so account for SPAM) and the average gate fidelity at time t is , where d is the dimensionality of the system being benchmarked. In the case of time-independent noise, f t is a constant and equation (11) reduces to the standard equation for the fidelity decay curve.
By performing randomized benchmarking for a set of sequence lengths m 1 and m 2 , we can estimate F m j with associated uncertainties δ j . Combining these estimates with a procedure for obtaining an estimate Â of A with associated uncertainty δ A [25], we can estimate the ratio Therefore we can estimate the average gate infidelity r over the time interval + m m [ 1, ] 1 2 . From our rigorous analysis, we can infer that δ 1 and δ 2 will be small for small m r j , while δ A will be determined only by finite measurement statistics. Furthermore, when and so the above method gives a reliable method of characterizing the time-dependent gate fidelity.
We also note that if there are no temporal correlations in the noise, than all of the parameters r t (where r t is the average gate infidelity at time r) are lower-bounded by zero. Therefore any negative values (or average values) of r t are an indicator of temporal correlations in the noise, that is, of non-Markovian behavior.

Mathematical preliminaries
Randomized benchmarking involves composing random sequences of quantum channels that are sampled in a way which approximates a group average. For this reason, it is natural to consider both the representation theory of groups and the structure of quantum channels, especially the composition of channels. In this section we collect several mathematical results in this vein that we will need to prove our main results. We begin by considering group representation theory, and in particular prove a proposition showing how the tensor product of certain representations couple together. Most of this material is standard and can be found in any textbook on the subject, e.g. [26].

Representation theory and some useful lemmas
, where V is vector space known as the representation space (which we always take to be  d or  d for different values of d) and and ϕ W denotes the restriction of ϕ to the subspace W. We sometimes refer to a space, subspace, or homomorphism as being a rep or subrep, with the complementary ingredients understood from the context.
A rep is called irreducible or an irrep if the only subreps are ∅ and V. Since the reps we consider are unitary reps of compact groups, if W is a subrep of V then the orthogonal complement ⊥ W is a subrep as well. Therefore any rep can be decomposed into a direct sum of irreps, which may occur with some multiplicity. Any basis that decomposes a rep into a direct sum of irreps is called a Schur basis.
The simplest rep is the trivial rep  (1, ), which is also an irrep. The trivial rep is defined for any group  and take any element of  to 1. While the trivial rep deserves its name, it frequently appears as a subrep of tensor powers of other reps and so will appear throughout this paper.
The randomized benchmarking protocol is designed so that the sequence of operators applied to a system correspond to noise channels conjugated by uniformly random elements of a group . Given a rep ϕ V ( , ) of a group , a matrix The uniform average of this action on A is called the  ̵ twirl of A, and is given by Note that, for notational convenience, the map ϕ is left implicit but will always be obvious given the dimensionality of the matrix being twirled. An important property of  A is that it commutes with the action of  for any rep ϕ V ( , ) (reducible or not) since ϕ is a homomorphism and  is a group. That is,  . Expressions for the expected value F m and variance σ m 2 for the randomized benchmarking protocol for a fixed value of m will be obtained using the following propositions. 11 . Proposition 2. If ϕ V ( , ) is an irreducible representation of a finite group  with a real-valued character χ ϕ , then the trivial representation is a subrepresentation of ϕ ⊗ V ( , ) 2 with multiplicity 1.
Proof. As the rep is irreducible, Schurʼs orthogonality relations [26] give and that the character for the trivial representation is Quantum channels can be represented in a variety of equivalent ways, with different representations naturally suited to particular applications.
In this paper, we will primarily use the Liouville representation because it is defined so that quantum channels compose under matrix multiplication. We occasionally also use the Choi representation in order to apply results from the literature, but we will introduce it only as required.

States and measurements.
We begin by introducing the Liouville representation (also called the transfer matrix representation) of quantum states and measurements. States and measurement effects (i.e., elements of a positive-operator valued measure, or POVM) can be viewed as channels from    → : d and    → : d respectively, hence they can be treated on the same footing as any other quantum channel. However, we introduce them separately for pedagogical and notational clarity.
In the standard formulation of quantum mechanics in terms of density operators and POVMs, a quantum state  ρ ∈ d is any Hermitian, positive semi-definite operator such that ρ = Tr 1. In addition, we always have ρ ∈ Tr [0, 1] 2 . We can always choose a basis Tr( ) † . We can expand any density operator relative to such a basis as ρ ρ . Throughout this paper we , and makes all other A j for ≠ j 0 traceless. We can then identify a density operator ρ with a corresponding column vector Here we make the important distinction between the density operator itself, ρ, and the representation of ρ in terms of the column vector ρ | ). Note that ρ | ) is just a generalized version of a Bloch vector (with a different normalization) for ⩾ d 2. The conditions for ρ to correspond to a density operator now translate into geometric conditions on ρ | ). In particular, we will use the fact that ρ ρ is the standard isotropic Euclidean norm.
Measurements in the standard formulation correspond to POVMs, that is, to sets of Hermitian, positive semidefinite operators . As with quantum states, we can expand an element E of a POVM (an effect) as We then identify an effect E with a row vector which must satisfy similar conditions to ρ | ). In this formalism, the probability of observing an effect E given that the quantum state ρ was prepared is Transformations. For simplicity, we will only consider quantum channels that are either states, measurements or completely positive and trace-preserving (CPTP) maps    → : d d . We do not consider channels that reduce the trace or change the dimension because, while conceptually no more difficult, they require cumbersome additional notation and we do not use any such channels.
A quantum channel  maps a density operator ρ to another density operator  ρ ( ). We want to determine the map  between the corresponding vectors ρ | ) and  ρ | ( )). Since quantum channels are linear ) where we abuse notation slightly and define  as the matrix such that . That is, we use  to denote both the abstract operator as well as its representation as a matrix acting on vectors ρ | ). In this representation, the identity channel is represented by  d 2, the composition of two channels is given by matrix multiplication, and furthermore, the conjugate channel of a unitary channel  is given by  † . In particular, these properties imply that the Liouville representation of the unitary channels is a faithful and unitary representation of U d ( ) (though technically, it is a projective representation since a global phase is lost). Given our choice of  (recall that we have fixed A 0 ) and the fact that we consider only trace-preserving channels, we will always write the matrix representation of a quantum channel as A channel  is unital (i.e., the identity is a fixed point of the channel) iff  α = ( ) 0. Therefore we can regard  α ( ) as quantifying the nonunitality of  . [19], which will play a crucial role in our analysis of randomized benchmarking since it allows us to use tools from representation theory such as Schurʼs lemma. Note also that the representation that is a unitary two-design is also irreducible by the same argument (which can be regarded as a defining property of a unitary 2-design [19]). Therefore we can also use tools from representation theory when considering channels twirled over a unitary two-design. This fact allows randomized benchmarking to be performed efficiently because unitary two-designs can be efficiently sampled while the full unitary group cannot [7,18].
The representation φ g ( ) of  will be one of the basic tools we use in this paper. As such, whenever g appears in a matrix multiplication, it will refer to φ g ( ). Randomized benchmarking will allow for the estimation of which corresponds to the average gate fidelity of  with the identity channel. We will sometimes omit the argument of α, φ and f, or indicate the argument via a subscript. However, in all cases the argument will be clear from the context.

Properties of channels in the Liouville representation.
Since the Liouville representation associates a unique matrix to each channel, we can characterize properties of quantum channels by properties of the corresponding matrix. In particular, we will consider the spectral radius , which we will use to obtain bounds on valid quantum channels. Proposition 3. Let  be a completely positive map. Then the adjoint channel  † is also completely positive.
Proof. Any map can be written as j j T where the superscript T denotes the transpose and the K j and L j are Kraus operators for  . We then have . In particular consider   d ( ), which must be a density operator since  d is a density operator and  is a quantum channel. Then gives the desired result.
Now let  be a unital channel. Then  † and thus   † are also channels by corollary 4.
gives the improved bound.

Representing noisy channels
An attempt to physically implement a quantum channel  will generally result in some other channel ′, with the aim being, loosely speaking, to make ′ as close to  as possible. We will now outline how noisy channels can be related to the intended channel in the linear representation.
Consider an attempt to implement a target unitary channel  that results in some (noisy) channel  . Then since  is a real square matrix, it can be written as  = LQ, where L is a lower triangular matrix and Q is an orthogonal matrix. Since  is an orthogonal matrix, we can always write . Similarly, we can always write    Λ = ( ) pre . While the difference between these expressions is trivial for any single channel, it can cause confusion when comparing channels. Since , the notion of 'the' noise in an implementation of  depends on which expression is used. (We will fix a representation below to avoid this ambiguity.) The convergence of randomized benchmarking depends crucially upon the assumption that the noise is approximately independent of the target. However, in a general scenario, at most one of Λ post or Λ pre will be approximately independent of the target, with the specific choice depending upon the physical implementation. As a specific example, consider amplitude damping for a single qubit, which can be written as ⎛ , where ∈ g [0, 1] determines the strength of the damping. Assume that this noise is applied independently from the left (i.e., Λ Δ = post ), independently of the target. Then for X and Z pre pre which is only independent of the target when g = 1 (i.e., when there is no noise).
In this work, we write noise operators as pre-multiplying the target rather than postmultiplying the target as in [9]. The reason for this change is so that the residual noise term that is not averaged is in the first time step rather than the last and so is independent of the sequence length. While this simplifies the analysis, all the results of this paper can be derived for the other form with small modifications.

Measures of noise
The fidelity and trace distance between two quantum states are defined as 1 respectively 3 , where the one-norm (or trace norm) is given by ∥ ∥ = X XX Tr 1 † . These two quantities are related by the Fuchs-van de Graaf inequalities [30] where the right-hand inequality is always saturated when both states are pure. When one of the states is a pure state ψ, the left-hand inequality in equation (26) can be sharpened to Both of these quantities for quantum states can be promoted to distance measures for quantum channels [31]. Two such measures are the average gate fidelity and the diamond distance.
The average gate fidelity between a channel  and a unitary  is defined to be avg † avg which is technically the average gate fidelity between   † and . The diamond distance between two quantum channels  1 and  2 with    → : The norm in the above definition is indeed a valid norm, called the diamond norm, and it extends naturally to any hermiticity-preserving linear map between operators. The factor of 1 2 is to ensure that the diamond distance between two channels is bounded between 0 and 1. The diamond distance is useful for several reasons. Firstly, allowing for larger entangled inputs does not change the value of the diamond distance, hence it is stable. Secondly, it has an operational meaning as determining the optimal success probability for distinguishing two unknown quantum channels  1 and  2 [32]. Equivalently, the diamond distance gives the worst-case error rate between the pair of channels. Although we will not be able to measure the diamond distance directly, we will be able to bound it in terms of measurable quantities obtainable via randomized benchmarking.
To obtain upper and lower bounds on the diamond norm, we will use the following two lemmas to relate the average gate fidelity to the trace norm of the corresponding Choi matrix and then to the diamond norm. Recall that the Choi matrix of a linear map Δ is given by is the maximally entangled state. The first of these lemmas was proven in [20,33]. Lemma 6. ( [20,33]) The average fidelity of a CPTP map  is related to its Choi matrix  J ( ) by Proof. We first prove the lower bound and show that it is saturated. We have To see that the above inequality is saturated, simply let  Δ = d . To prove the upper bound, we write To see that this bound is saturated, let Δ be the projector onto | 〉〈 | 0 0 . □ We note that it would be interesting to see if the previous bounds are still saturated when restricting the input Δ to be a difference of channels.

Time-dependent gate-independent errors in randomized benchmarking
We consider the ideal case in which the noise depends only upon the time step. For such types of noise, we derive expressions for the mean F m and variance σ m 2 of the randomized for fixed m. In particular, we will show that for unital but nonunitary noise, σ m 2 decreases exponentially with m, while for non-unital noise, σ m 2 converges to a constant dependent on the strength of the non-unitality. We will also upper-bound the variance for small m, which enables the derivation of rigorous confidence intervals for the estimate of the average gate infidelity in section 4.1. We will also show that our results are stable under gate-dependent perturbations in the noise in section 8.
In order to present our results in as clear a form as possible, we will only explicitly consider the original proposal for randomized benchmarking. Interleaved benchmarking can also be treated in an almost identical manner, except that the noise is conjugated by the interleaved gate and is redefined to absorb the noise term for the interleaved gate.
Denoting the noise at time step t by Λ t , the sequence of operations applied to the system in the randomized benchmarking experiment with sequence for all ∈ t m (0, ). Uniformly sampling the g t is equivalent to uniformly sampling the h t since  is a group; the exception is h 0 and g 0 , which are chosen so that the product of all the gates is the identity (see section 2). We can then rewrite equation (36) as where we incorporate the first noise term into the preparation by setting ρ Λ ρ ← 0 . This redefinition of ρ is independent of the sequence length because we write the noise as pre-rather than post-multiplying the target. (Note that if the noise post-multiplied the target, then incorporating the final noise term in E would make E depend on the sequence length m.) The probability of observing the outcome E for the sequence S s is as the realizations of a random variable with mean F m and variance σ m 2 . Randomized benchmarking then corresponds to randomly sampling from the distribution F { } m s , (which we henceforth refer to as the randomized benchmarking distribution) to approximate the mean F m .

Mean of the benchmarking distribution
We now derive an expression for F m for general CPTP maps with time-dependent noise. A similar expression was derived for time-independent noise in [9]. We will then show how F m can be used to approximate quantities of experimental interest, namely, the SPAM error, average time-dependent gate fidelity and the worst-case error due to the noise.
The parameters ρ E 0 0 and ρ ⃗ ⃗ E directly characterize the quality of the state and measurement procedure (with the caveat that ρ has been redefined to include a noise term), since . This can be viewed as an instance of gate set tomography using a limited number of combinations of gates [36].
The parameters f t that give the mean of a randomized benchmarking distribution are closely related to an operational characterization of the amount of noise, namely, the average gate infidelity [10,12] (which gives the average error rate), as The randomized benchmarking protocol will enable the estimation of ∏ f t t , which can then be used to estimate the average gate infidelity averaged over arbitrary time intervals (as shown in section 4.2).
We now show that the average gate infidelity provides an upper and a lower bound on  Λ − ⋄ 1 2 , which gives the worst-case error introduced by using Λ instead of . An upper bound of the same form was stated without proof in [22], however, the bound here is a factor of two better. The following relation between the diamond distance and the average gate fidelity can also be applied at each time step to relate the time-averaged average gate fidelity to the average diamond distance from the identity channel. Note that the following bound is very loose in the regime ≪ mr 1 (since in that regime, ≪ r r ), which is also the regime in which we will typically use it.
be the average error rate for Λ. Then 1 2 Proof. Applying lemma 7 to Recalling that Φ, the maximally entangled state, is a pure state and using equation (26) and (27) gives 1 . Substituting this into the above expression and combining the inequalities completes the proof. □

Upper bounds on the variance
We now consider the variance σ m for fixed m. It has been observed that the standard error of the mean (and hence the sample variance) can be remarkably small in experimental applications of randomized benchmarking using relatively few random sequences [12,22]. In this section, we will prove that the variance due to sampling random sequences is indeed small in scenarios of practical interest (i.e., ≪ mr 1) by obtaining an upper bound on σ m 2 in terms of mr. For the special case of a qubit, we will also obtain a significantly improved upper bound in terms of m and r.
We begin by obtaining a general bound on σ m 2 that depends only on m, r and the dimension d of the system being benchmarked. In order to present results in a simple form, we assume that the noise is time-and gate-independent, however, the results in this section can readily be generalized to time-dependent noise.
As a first attempt at obtaining a good bound on the variance, we use only the fact that when The value of all realizations of F m (i.e., the probabilities F m s , ) are all in the unit interval. Since the distribution with the largest variance that has mean F m and takes values in the unit interval is the binomial distribution with that mean, we then have While simple to obtain, this bound has a constant off-set term that depends upon the SPAM which seems to be unavoidable. This term would be zero in the absence of SPAM, and could even be eliminated if the probabilities F m s , could be restricted to the interval ]. However, as illustrated in section 4.1, this cannot be done in general. Moreover, we expect that the above argument substantially overestimates the variance because it ignores the possibility that many sequences may have F m s , closer to F m . We now obtain an alternative bound that has a larger coefficient for r, but no constant term. We note from the outset that the following bound is not tight in general (and the previous bound suggests that the dimensional factor is an artifact of the proof technique), though by improving one of the steps we will be able to obtain a tight bound for qubits. To facilitate our analysis, we use the identity Proof. We write and so all the first-order terms and the second-order terms where the Δ act at different times will cancel. Therefore the only secondorder terms are the m terms with Δ ⊗2 and so the variance is . Therefore the only contributions of O r ( ) 2 or greater are from k = 3 and k = 4.
For k = 3, the only terms that will not cancel are products whose only nontrivial terms are a  2 2 4 . □ While the bound in theorem 10 is promising, it is not sufficiently small to justify the sequence lengths chosen in many experimental implementations of randomized benchmarking for a single qubit, since ≈ − mr 10 2 in many such experiments and so the contribution to standard error of the mean due to sampling random gate sequences is expected to be on the order of − K 0.1 1 2 , where K is the number of random sequences of length m that are sampled. One of the loosest approximations in theorem 10 is the use of the triangle inequality to upper-bound the contribution from terms of the form Avoiding this is difficult in general, however, for the case of a single qubit, we can significantly improve the following bound by understanding the irrep structure of the representation ⊗ g g. This irrep structure will depend on the choice of two-design, so we now fix the two-design to be the single qubit Clifford group,  2 and work in the Pauli basis   = X Y Z { , , , } 2 (where the factor of 2 makes the basis trace-orthonormal). In particular, we will work in the block basis ⎛ where σ⃗ = X Y Z { , , } 2. Restricting the Liouville representation to each of the blocks in the above basis will give a rep of  2 , where the first three reps have already been characterized. We now characterize the final subrep, .
Proof. The proof follows from a direct application of Schurʼs orthogonality relations, which imply where λ n is the multiplicity of the irrep λ in the rep ϕ ⊗2 . The character is given by Since the elements of  2 permute Paulis (up to signs), the diagonal elements of  in the Pauli basis are either 1 or −1 and there are 0, 1 or 3 diagonal elements that can contribute to g Tr . There are eight elements of  2 with no diagonal elements, namely, the eight permutations → ± → ± X Y Z and → ± → ± X Z Z. There is one element with all diagonal elements equal, namely, the identity (note that  − is antiunitary so is not in the Clifford group). All other 15 elements of the Clifford group have χ = ± ϕ g ( ) 1 since the diagonal elements cannot sum to any values in ± ± {0, 2, 3}. Given that the multiplicity of an irrep must be a nonnegative integer, there are two possibilities. Either there are four inequivalent irreps or the rep ϕ ⊗2 contains two equivalent irreps. By proposition 2, ϕ ⊗2 contains a trivial irrep with multiplicity 1 and so cannot contain two equivalent irreps, therefore there are 4 inequivalent irreps.
The following bases of operators S T 1 2 span the four irreps. □ The fact that ⊗ g g is a direct sum of four inequivalent irreps will allow us to use Schurʼs lemma on the unital block of Δ. To account for the nonunital component, we use the following bound. Proof. Any trace-preserving qubit channel as   3 3 and, since α 2 is invariant under the unitary transformations in equation (58), α = t 2 . Therefore the only remaining problem is to bound δ 3 (note that at this point, we could accept the trivial bound δ ⩽ 6 3 ).
, then complete positivity implies

. □
Combining the irrep structure of ⊗ g g and the bound on the nonunital component allows us to improve the bound in theorem 10 for the special case of one qubit. As discussed in section 4, the following bound provides a rigorous justification of current experiments and allows values of K m to be chosen that are substantially smaller then previously justified rigorously, that is, ≈ K 145 m as opposed to ≈ × K 7 10 m 4 as estimated in [10]. Proof. To prove the theorem, we will derive an exact expression for the variance and then approximate it in the relevant regimes.
We begin by noting that in the basis and  α φ ∑ ⊗ ∈ g g g ( ) 2 for any ⃗ E by, for example, considering a basis for the space of φʼs. We note in passing that this property is not a general property of two-designs, in that it does not hold for the single-qutrit Clifford group. By propositions 11 and 1 where the P R are the projectors onto the irreps from proposition 11 and λ φ = ⊗ P P Tr Tr . From equation (47), together with the orthogonality of the projectors P R , we have We can bound the first term using Similarly, the eigenvalues can be calculated to be  2 where we have omitted λ S since it will not contribute to the variance since any symmetric vector (such as ⃗ ⊗ E 2 ) will be orthogonal to P S . We now consider general noise with ⃗ E and ρ⃗ as in equation (65), where, without loss of generality, we set ⃗ = ⃗ w z. We begin by considering the case δ = 0, for which . Then a simple calculation using proposition 12 gives We now consider the correction when δ > 0 in equation (65), which will realistically always be the case since ρ incorporates a residual noise term. Then we define functions where the final inequality follows since δ δ δ = ⃗ ⃗ ⩽ ab | · |, 1 2 It is worth noting that one could in principle fill in the implicit constants given in the big-O notation by following the previous argument with sufficient care. To have a truly rigorous confidence region, one would need to take this into account, but for current parameter regimes of interest, the terms really are negligible, so it hardly seems worth optimizing this concern.
We also note that δ ρ will typically have entries of order r even without SPAM, since the off-diagonal terms for generic noise are of order r and there is a residual noise term that has been incorporated into ρ. However, the corresponding entries in δ E will generally be smaller (or at least, are determined only by SPAM).
We now show that the variance can be even further improved (by a factor of m and with no dependence on the state and measurement) for noise that is diagonal in the Pauli basis. by equation (70). □ One consequence of the above corollary is that the variance of the randomized benchmarking distribution will typically depend strongly upon the choice of two-design even for gate independent noise. This observation follows from the above theorem by noting that the unital block can be perturbed by an arbitrarily small amount to allow it to be unitarily diagonalized. Performing randomized benchmarking in the basis where the unital block is diagonalized (i.e., setting   = U 2 ) will give variances of order mr 2 , while randomized benchmarking in other bases will give variances of order m r 2 2 .

Asymptotic variance of randomized benchmarking
We now consider the variance σ m 2 of the distribution F { } m s , as → ∞ m . While not directly relevant to current experiments, the asymptotic behavior is nevertheless interesting in that it may provide a method of estimating the amount of nonunitality.
We will prove that, for the class of channels defined below called n-contractive channels (which are generic in the space of CPTP channels), σ m 2 decays exponentially in m to a constant that quantifies the amount of nonunitality. Unfortunately, we will not be able to provide a bound on the decay rate. In fact, no such bound is possible without further assumptions since the channel ] for any unitary U and two-contractive channel  will have an eigenvalue ( ) 1corresponding to the trivial subrep (this can be seen by following the proof of proposition 16). This eigenvalue will result in a variance that decays as has at most one eigenvalue of modulus 1.
We now prove that all unital but nonunitary channels are two-contractive with respect to any finite two-design. We conjecture that all nonunitary channels are in fact two-contractive with respect to any unitary two-design. An equivalent statement for trace-preserving channels Λ is that is strongly irreducible whenever Λ is not unitary [38]. As a corollary of the following proposition, this conjecture holds for qubits, since, for qubits, the projection onto the unital part of a CPTP map is also a CPTP map [21]. However, proving it for higher dimensions remains an open problem.
Proposition 16. Let Λ be a completely positive, trace-preserving and unital channel and  a unitary two-design. Then Λ is two-contractive with respect to  if and only if it is nonunitary.
contains the trivial rep as a subrep with multiplicity 1 by proposition 2. Therefore for any U ∈ U d ( ) and in a fixed Schur basis (i.e., independent of U), for some homomorphism  . Therefore any vector v in the (one-dimensional) trivial representation is a+1eigenvector of φ ⊗ U ( ) 2 for any U and consequently is a +1-eigenvector of . We now show that for all completely positive, trace-preserving and unital Λ Recall that one of the equivalent definitions of the spectral norm is Expanding the averages over  gives   In [10] it was shown that the mean of the benchmarking distribution is robust under gatedependent perturbations. We now show that the variance is also stable under gate-dependent perturbations. Let us write σ m,0 2 for the gate-averaged variance, and δ σ ( ) m 2 as the correction due gate-dependent perturbations. Then we have the following theorem. where, to get from the second to the third line, we note that for a fixed order a there are at most + a 1 quantum channels (i.e., products of Λ t ʼs and the elements of , which are channels) interleaved by the a different Δ t g , and we have also used the submultiplicativity of the spectral norm and proposition 5.
We can then substitute the above sequence-independent upper bound into equation (101) □ We remark that this result can surely be improved, though we have not attempted to do so. In particular, there should certainly be a factor of at least r, the average infidelity, bounding the change in the variance.

Conclusion
We have proven that the randomized benchmarking protocol can be applied to experimental scenarios in which the noise is time dependent in an efficient and reliable manner. Moreover, the ability to estimate time-dependent average gate fidelities using randomized benchmarking provides an indicator for non-Markovianity over long timescales.
In particular, we have proven that the variance is small for short sequences and asymptotically decays exponentially to a (small) constant, that, in the case of unital noise, is zero. The fact that the variance is remarkably small (e.g., on the order of × − 4 10 4 for currently achievable noise levels) enables experimental realizations of randomized benchmarking to be accurate even when using a small number of random sequences (e.g., 145 sequences compared to the 10 5 proposed in [10]).
Our results show rigorously that randomized benchmarking with arbitrary Markovian noise is generically almost as accurate as has previously been estimated in experiments [11,12,15] and numerics [16]. However, we find that K m should scale with m so that the