Demonstration of quantum volume 64 on a superconducting quantum computing system

We improve the quality of quantum circuits on superconducting quantum computing systems, as measured by the quantum volume, with a combination of dynamical decoupling, compiler optimizations, shorter two-qubit gates, and excited state promoted readout. This result shows that the path to larger quantum volume systems requires the simultaneous increase of coherence, control gate fidelities, measurement fidelities, and smarter software which takes into account hardware details, thereby demonstrating the need to continue to co-design the software and hardware stack for the foreseeable future.


I. INTRODUCTION
Quantum computing is a new kind of computing, using the same physical rules that atoms follow in order to manipulate information. At this fundamental level, quantum computers execute quantum circuits -like a classical computer's logical circuits -but now using the physical phenomena of superposition, entanglement, and interference to implement mathematical calculations that are out of reach for even our most advanced supercomputers.
As we progress towards machines capable of implementing circuits with a quantum advantage, meaning certain information processing tasks can be performed more efficiently or cost effectively than with classical circuits, quantum volume (QV) [1] serves as a holistic benchmark for quantum systems indicating the size of the quantum circuits that can be run on them. Sensitive to improvements in many aspects of device performance, quantum volume includes gate errors, measurement errors, the quality of the circuit compiler, and spectator errors. In Ref. [1] and later in Ref. [2], QV16 was measured on ibmq johannesburg and a Honeywell quantum system, respectively. In Ref. [3] QV8 was measured for the Rigetti Aspen-4 quantum system. We recently increased ibmq johannesburg to QV32 [4] by improving our physical understanding of the two-qubit crossresonance gate and using rotary echo pulses to reduce gate and spectator errors. Finally in unpublished work Honeywell has claimed to measure QV64 [5].
Here we demonstrate an increase in the quantum volume of an IBM quantum system by improving the Qiskit compiler [6], implementing excited state promoted (ESP) readout, shorter two-qubit gates, and adding dynamic decoupling to the idle qubits. These last three demonstrate the need for timing and pulse control in cloud quantum systems [7]. While individually not one of these improvements is enough to allow ibmq montreal to reach QV64, when combined we achieve QV64 with a heavy output probability (HOP) of 0.701 ± 0.031(> 2/3 ± 2σ) with a confidence interval of 98.744% (z = 2.25), see Fig. 2(a). The quantum volume test requires exceeding 2/3 HOP by a 97.725% (z = 2) confidence interval.
In Section II we give an overview of the ibmq montreal device, which is a 27-qubit IBM Quantum Falcon processor; in Section III we discuss the improvements to the compiler; in Section IV we discuss the dynamical decoupling protocol; in Section V we discuss the faster implementation of the direct CNOT gate which extends the improved pulse control of [4]; in Section VI we discuss the improvement in measurement fidelity by using a control pulse to promote the excited state to a higher level before measurement [8]. Finally in Section VII we conclude the paper.

II. QUANTUM SYSTEM -ibmq montreal
The device studied in this work is from the recent series of IBM Quantum Falcon processors, which consist of 27 qubits arranged in a lattice designed for a distance-3 hybrid Bacon-Shor-surface code [9]. A photo of this processor is shown in Fig. 1(a), and a schematic of its connectivity is shown in Fig. 1(b). A high connectivity layout, such as 'all-to-all', is preferable for random quantum circuits (such as QV circuits) in order to minimize the average qubit-qubit distance; however, additional edges in the connectivity increase the chance of frequency collision, cross-talk, and spectator errors. The IBM Quantum Falcon processor is a compromise, preserving a connectivity efficient for a logical qubit while simultaneously reducing detrimental effects of collisions and cross-talk without excessive insertion of SWAP gates to emulate 'all-to-all' connectivity. In these and related systems, by using the techniques described in [4], we have measured a QV of 32 on the last 7 deployed systems [10] demonstrating the reliability of this architecture. Dashed lines indicate collections of qubits that are multiplexed together for readout (labeled R1 to R6).
The qubits are fixed-frequency transmons with frequencies ≈ 5 GHz. Single-qubit gates are driven resonantly with a microwave pulse of duration τ sq = 21.33 ns. A DRAG pulse envelope [12] corrects σ z -errors and signal dispersion due to wiring. Two-qubit gates are based on a cross-resonance scheme [13], [14], [15] with a target rotary pulse [4] and an additional offset pulse-shape on the target for implementing a direct (echoless) CNOT as described later. Two-qubit gate lengths are τ tq = 199 −309 ns.

III. COMPILER
Circuit compilation is a substantial part of quantum computation. Here we report improvements in the stateof-the-art Qiskit compiler to achieve reductions in the number of gates which results in circuits with shorter depths. The compilation of a quantum volume circuit for a superconducting processor can be roughly broken down into two stages. The first stage is to map the circuit to the hardware's qubit connectivity constraints. At the conclusion of this step, each circuit will consist of a series of SU(4) gates on the available links, as well as the overhead of routing qubit information on the physical fabric, usually in the form of SWAPs. The second step consists of local expansions to the native gates of the hardware and optimizations. We introduce new compiler passes to improve both stages, and leverage existing passes in the Qiskit compiler throughout to achieve further reductions where possible: approximate synthesis, commutative cancellation, and peephole optimization of single-qubit and two-qubit chains of gates.
It is worth noting that the particular passes reported here have general utility beyond QV. Qubit mapping and routing is ubiquitous in compiling for limited-connectivity architectures, and SU(4) synthesis has broad use in peephole optimization of sequential two-qubit gates. a) Qubit layout and routing via Binary Integer Programming: We formulate qubit layout and routing as a binary integer programming (BIP) problem, which we are able to solve to optimality. We choose as the cost function, C, the effective fidelity, modeled as the product of the fidelity of all the implemented gates: where K is a factor penalizing circuits with high depth d; G [Ḡ] are the set of gates that are mapped directly [mapped with mirroring -combining SWAP with a gate]; and S is the set of added SWAP gates. Here, F b is the gate fidelity of the available entangling gate (which must be applied 3 times to implement SWAP), F best j [F best j ] is the modeled fidelity of the best approximation to the target unitary making i = 0, . . . , 3 uses of the entangling gate and F avg i,j is the average gate fidelity due to approximating the j-th gate with i uses of the entangling gate [1, Appendix B].
The freedom to implement either a gate or its mirror allows elimination of many explicit SWAP gates, and by restricting the number of candidate SWAP insertion sites we are able to reduce the size of the BIP problem such that it can be solved to optimality in around one second per circuit, using optimization software such as CPLEX [16]. Figure 3 shows the performance of this BIP pass in comparison to the state-of-the art SABRE algorithm [17] available in Qiskit, showing substantial improvement in both the mean and maximum number of uses of the entangling gate.
b) Pulse-efficient SU(4) decomposition: The ibmq montreal device has the following native gate set for achieving universal quantum computation: Ctrl-X (CX), Sqrt-X (SX) and Phase(θ). The CX gate itself can be implemented directly or be created using an Echo Cross-Resonance (ECR) pulse [18] (c.f. Section V). The Phase gate can be achieved with zero time and error [19]. We refer to any gate that is one pulse (i.e. equivalent to an SX by a pre-/post-phase) as a single-qubit (SQ) gate (e.g. Hadamard). A generic singlequbit operation (U) can be achieved with at most 2 SQ pulses.
Given the CX, SQ or ECR, SQ set of native pulses, we aim to minimize them during the expansion of each SU(4) and SWAP. A second goal is to expand them in a way that creates further opportunities for optimization. It is known that any SU(4) can be implemented using at most 3 CX gates [20], and 2 CX gates suffice for many useful approximations (e.g. at 99% fidelity) [1] (cf. Figure 4(a)). ECR is locally equivalent to CX, so it has the same requirements. While the question of "optimal" SU(4) decomposition has been extensively studied, the optimality criteria has usually been the number of 2-qubit gates [20], [21]. To extract ultimate performance, we are also interested in minimizing the total number of pulses and the duration.
Our approach is based on three strategies: 1. Circuit simplification to reduce redundant pulses: starting from a Qiskit synthesis of an arbitrary SU(4), we apply repeated circuit identities to the result to reduce its cost. This gives us a constructive SU(4) decomposition, depicted in Figure 4(b), which is optimal in the number of pulses (by a simple parameter counting argument). This decomposition has another advantage, in that 8 out of 10 single-qubit pulses are placed on the outside of the structure. Given that 2 SQ pulses suffice for any aggregate single-qubit operation, this creates an opportunity for merging with preceding and following layers of SU(4) in the circuit. One surprising consequence of this decomposition is that for the special case of a SWAP operation, the decomposition is locally less efficient than a textbook expansion; however globally it is more efficient as it creates more opportunities for cancellation (Figure 4(c)). We arrive at similar pulse-efficient decompositions targeting the ECR gate, and also for approximated SU(4)s that use 2 CX instead of 3 (omitted for brevity).  Both contribute to shorter durations. We assume basis gate fidelity F b = 0.99 for the approximate SU(4) expansion in all cases. If the native gate is ECR (rather than direct-CX), we get additional 7% reduction in mean duration by targeting the native gate and absorbing local pre-rotations.

Decomposition in
direction in terms of speed and error on the hardware. The other direction is achieved by local pre-and post-rotations. The same is true for ECR gates. By querying the device for its natural direction, we can expand each SU(4) and SWAP in the correct direction in the compiler, avoiding further cost down the road. To synthesize a general SU(4) when the logical and physical directions are mismatched, we employ a trick of double mirroring (adding SWAPs before and after the SU(4)). The doubly-mirrored SU(4) implements a different operator, where the middle two rows and middle two columns are swapped. We perform a pulse-efficient synthesis on the doubly-mirrored operator, but apply it in the circuit with the reverse order of qubits. This will ensure the original operator is implemented, but also now with the correct physical gate direction (Figure 4). Double-mirroring creates a locally equivalent gate, so any approximation to the original SU(4) still holds with the same error bounds.
3. Decomposition to native gate: If a direct CX is not available, we compile to the fundamental two-qubit interaction available. In the case of ECR, this saves us the extra singlequbit pulses involved in creating a CX. This demonstrates the benefit of removing simplifying abstraction barriers in the exposed gate set to gain efficiency in compiling [22], [23], [24].

IV. DYNAMICAL DECOUPLING
When quantum circuits are mapped to physical hardware, not all physical gates can be performed simultaneously. Gate execution-times can vary significantly, not only between single-and two-qubit gates, but also between individual qubits and qubit-pairs. In addition, architecture-specific gate schemes and connectivity determine which and how many gates can be executed in parallel. An analysis of QV64 circuits mapped to a line of transmonqubits reveals idle times that are a significant portion of the total circuit duration (Fig. 5). Two main effects create these idle slots. Firstly, a line configuration with nearest-neighbor gates requires a total of 7.3 SWAPs on average per QV circuit. In the optimal layout and routing choice (Section III) for reducing the number of SWAP gates, SWAPs are not executed at once over the entire quantum register, as shown in Fig. 3(c). When the basis two-qubit gate is a local equivalent of CNOT, this creates "idle holes" for the duration of three two-qubit gates (Fig. 5). Secondly, "idle holes" can still arise even if no SWAP operations are required. While single-qubit gates are tuned with identical durations across the entire register, two-qubit gate durations depend on qubit frequencies and coupling, differing by a factor of 1.5 − 2 between the fastest and the slowest gates. Given that two-qubit gates are ≈ 10× longer than single-qubit gates, these differences accumulate over the course of the computation, opening up additional temporal gaps when individual qubits sit idle.
Ideally idle qubits would evolve the identity operation; however, this is executed far from perfectly in realistic architectures. While thermal relaxation and white noise dephasing lead to dissipative information loss, cross-talk and unwanted non-local spectator interactions lead to local and non-local unitary errors, respectively. In addition, non-Markovian noise sources such as charge noise lead to non-white dephasing. All three error sources are detrimental as circuits become larger, i.e., wider and deeper. Dynamical decoupling is a thoroughly discussed error mitigation technique [25], [26], [27], and in its simplest form, can be a single Hahn echo-pulse [28], refocusing the low-frequency noise spectrum acting on a unitary. Various decoupling sequences have been proposed [29], [30], [31], some with self-correcting properties [32], [33], others with non-equidistant temporal spacing [34], and hybrids combining both [35], [36], [37], in order to optimize the effective filter function. Recently, dynamical decoupling has been shown to improve single-qubit states and an entangled two-qubit state on a Rigetti and IBM quantum computer [38].
For the successful QV64 measurement presented here, we used the sequence idle is the ith idle length on qubit q, and T X p/m is the duration of one echo pulse with X p,m being a π-pulse around x-axis with positive/negative sense of rotation. Figure 6 shows a comparison of identical QV-circuits run with (DD) and without (Idle) dynamical decoupling. Dynamical decoupling with X p −X m sequences improves 72.8% of all circuits in this run, i.e. HOP DD > HOP Idle , with an average HOP increase of 0.0178. The interplay between various DD sequences and random circuits, such as QV circuits, is an open research focus.

V. DIRECT CX GATE
Even with state-of-the-art compiling, QV64 circuits consist of a total of 57 two-qubit gates and 146 single-qubit gates on average. Any improvement in gate speed can significantly reduce the circuit duration compared to the coherence times. However, the optimal gate speed for running a circuit is in general not the speed that maximizes the fidelity of the individual gates. In particular, qubits experience idle times in a multi-qubit circuit (see Section IV), and the fidelity of the identity operation during these idle times is not captured in the single-qubit or two-qubit randomized benchmarking fidelities often used to characterize quantum systems. Finding the optimal trade-off between individual gate fidelity and circuit fidelity is currently open research, in addition to characterizing which errors are enhanced by driving gates faster. Here we focus on techniques to reduce two-qubit gate durations, but note that small increases in the speed of either single-or two-qubit gates can significantly impact the performance of QV64 circuits.
As mentioned in Section III, an immediate way to "speed up" two-qubit gates is to incorporate into the circuit compilation any pre-/post-single-qubit rotations needed to get from the native ECR gate to a CX or CNOT. We compare the standard echoed cross-resonance gate ECR CX, shown at the top of Fig. 7(a), to an ECR gate in which single qubit rotations are compiled separately, reducing the two-qubit gate duration to only the entangling portion of the gate. The errors of ECR CX and ECR, measured by two-qubit randomized benchmarking, are shown in Fig. 7(b) as a function of the two-qubit gate duration.
Two-qubit gates can be further sped up by finding highfidelity alternatives to the echo pulse sequence, effectively  removing another single-qubit gate from the total two-qubit gate duration. We compare an example of a "direct" echo-free CX pulse sequence, shown at the bottom of Fig. 7(a), to ECR and ECR CX. This sequence demonstrates an improvement over previous direct CNOT attempts [15] by leveraging our understanding of target rotary pulsing [4]. The resonant drive of the target is implemented as the sum of two parts, an active cancellation tone and a target rotary tone that are symmetric and antisymmetric over the CR pulse, respectively. The active cancellation tone cancels IX terms in the native CR Hamiltonian and any IY terms due to classical crosstalk, while the target rotary pulse can be used to reduce unwanted ZZ and ZY.
The impact of reducing the total gate duration is clearly evidenced by a reduction of two-qubit gate error, as shown in Fig.7(b). All gate sequences -ECR CX, ECR, and direct CX -experience a sudden loss of fidelity with increasing pulse amplitude, but the direct CX experiences this break down at a much shorter gate time. We note that reducing the gate duration below that which minimizes two-gate error as measured by randomized benchmarking can increase the HOP of a QV circuit, showing the importance of balancing circuit optimization with gate optimization. For our successful demonstration of QV64 we used a direct CX gate duration of 199 ns, which is shorter than that which minimizes the twoqubit gate error.

VI. STATE INITIALIZATION AND READOUT
Qubit-state initialization to a fiducial simple state and qubit-specific measurement are two out of five (plus two) necessary DiVincenzo criteria for quantum computation [39]. While certain metrics are designed specifically to be insensitive to "state preparation and measurement" (SPAM) errors, e.g. randomized benchmarking [40], [41] and gate set tomography [42], [43], quantum volume was developed as a holistic system measure and hence is sensitive to SPAM-errors.
In its simplest form, qubit initialization or reset is done passively by waiting multiple T 1 relaxation times before every new computational cycle in order to let the qubit thermalize with its surrounding bath. With ever-increasing coherence times, thermal relaxation protocols impractically limit the computational repetition rate. Various active reset schemes have been proposed and experimentally demonstrated [44], [45]. IBM Quantum systems implement a similar unconditional reset scheme [46]. By measuring the readout matrix ( Fig. 8(a)) we can infer a reset error of | ⟩ | ⟩ Fig. 8. Comparison of readout assignment-matrices. Color map indicates assignment error. Y-axis: Prepared six-qubit state vector encoded in black (|0 ) and white (|1 ). X-axis: Assigned six-qubit state vector. Left matrix: Standard procedure of the deployed system with |000000 -state reset error E RS = 2.8 × 10 −2 and a total assignment error E SP = 0.10. Right matrix: Stateof-the-art excited state promotion (ESP) readout with E ESP = 3.5 × 10 −2 and E RS = 3.7 × 10 −2 E RS = 2.8 × 10 −2 for the six-qubit ground state |0 . . . 0 .
Single qubits are dispersively read out by transversely coupled transmission line cavities [47]. The I-Q trajectories of each measurement signal are integrated with a filter function weighting the initial signal more heavily, hence reducing the sensitivity to T 1 events during measurement [48]. The signal is amplified with a quantum limited travelling wave parametric amplifier followed by a classical amplification chain. This standard procedure (SP) for typically deployed systems gives a total assignment error of E SP = 0.10 for all 2 6 states.
In order to further boost readout we have implemented excited state promotion (ESP) by applying an additional πpulse between the first and second excited transmon states |1 → |f before each measurement pulse [8], [49], where |f is the second excited transmon state. The advantage of this population transfer is threefold. Firstly, the dispersive χ-shift between |0 ↔ |f is stronger leading to a larger separation of the signals in the I-Q plane. Secondly, even though the |f -state has a lifetime half of the |1 -state [50], the qubit excitation has to decay twice |f → |1 → |0 (while a two-photon decay |f → |0 is strongly suppressed [51] ). This scheme effectively extends the |1 qubit-state lifetime and further reduces false |0 assignment due to T 1 decays. Lastly, the relaxation through an intermediate state leads to a sub-Poissonian relaxations statistics improving the readout error even in the absence of an increase in signal-to-noise ratio [52]. State discrimination is set with a linear discriminant analysis (LDA) between the states |0 and |f in the I-Q plane. In order to reset the extended qutrit system, we adapt our reset protocol in the following way: reset -π |f →|1reset. This state-of-the-art readout reduces the total assignment error to E ESP = 3.5 × 10 −2 with an initialization error of E RS = 3.7 × 10 −2 , measured with the assignment matrix ( Fig. 8(b)).
VII. CONCLUSION In this paper we have shown an improvement in the quantum volume of a state-of-the-art superconducting quantum system.
We reached a quantum volume of 64 through a combination of four factors: improvements of the Qiskit compiler, refinements to the two-qubit gate and its calibration, addition of dynamical decoupling to mitigate noise affecting idle qubits, and introduction of excited state promoted readout. The last three techniques were developed by having lower-in-the-stack access to how the pulses and gates that comprise quantum circuits are defined before being sent to control the qubits. Furthermore, we note that optimizing the fidelity of quantum circuits is not equivalent to optimizing the gates and confirms the need for circuit benchmarks like quantum volume. This type of hardware-aware approach to make improvements to circuit performance is a hallmark of the current era of noisy quantum systems which we expect to continue until we can achieve error rates in the range of 10 −4 .