Compiling quantum algorithms for architectures with multi-qubit gates

In recent years, small-scale quantum information processors have been realized in multiple physical architectures. These systems provide a universal set of gates that allow one to implement any given unitary operation. The decomposition of a particular algorithm into a sequence of these available gates is not unique. Thus, the fidelity of the implementation of an algorithm can be increased by choosing an optimized decomposition into available gates. Here, we present a method to find such a decomposition, where a small-scale ion trap quantum information processor is used as an example. We demonstrate a numerical optimization protocol that minimizes the number of required multi-qubit entangling gates by design. Furthermore, we adapt the method for state preparation, and quantum algorithms including in-sequence measurements.

Quantum technologies open new possibilities that are inaccessible with current classical devices, ranging from cryptography [1,2] to efficient simulation of physical systems [3][4][5].To utilize the full computational power of quantum systems, one needs a universal quantum computer : a device able to implement arbitrary unitary operations, or at least to approximate them to arbitrary accuracy.However, in any specific physical system, only a certain set of operations is readily available.Therefore, it is necessary to decompose the desired unitary operation as a sequence of these experimentally available gates.An available set of gates is known as universal if it is possible to find such a decomposition for an arbitrary unitary quantum operation acting on the qubit register.
A canonical universal set of gates consists of two-qubit CNOT gates and arbitrary single qubit rotations.There exist deterministic algorithms that provide near-optimal decompositions of unitaries in terms of these gates [6].However, the set of gates that yields the highest fidelities depends on the particular experimental implementation.In particular, two-qubit CNOT gates may not be the most efficient to implement.Architectures like trapped ions [7,8] or atom lattices [9] include in their toolboxes high-fidelity multi-qubit gates that act on the entire qubit register (see section I A).Implementing two-qubit gates in terms of these requires refocusing [10] or decoupling [7] techniques, and thus increases the overhead.Therefore it is desirable to find a direct decomposition of the target unitary into the available operations.In general, the number of multi-qubit gates needs to be minimized, since these are more prone to errors than local gates.
Compiling unitaries using multi-qubit gates that act on the whole qubit register is more challenging than using two-qubit gates.Even if a sequence implements correctly a unitary for N qubits, it might not work for N +1 qubits, since additional "spectator" qubits will also be affected by the sequence instead of being left unchanged [11].Therefore, one has to define a qubit register of interest where the unitary will be compiled, and the experimental implementation of the resulting sequence has to be limited to this subregister, as explained in Section I A. Moreover, the existing analytical methods for finding decompositions of unitaries in terms of two-qubit gates (see for instance Ref. [12]) do not seem to apply to multi-qubit gates.Therefore, in this work we employ an approach based on numerical optimization.
A similar algorithm for finding multi-qubit gate decompositions has been studied in Ref. [11], where optimal control techniques are used to find a pulse sequence for a given target unitary operation.The procedure described there starts with long sequences and then removes pulses, if possible.This often results in sequences with more entangling operations than actually required.In this work we present an algorithm designed to produce decompositions with a minimal number of entangling gates.In addition, we introduce a deterministic algorithm for finding decompositions of local unitaries.We also extend the algorithm to operations required for state preparation or measurement, which are particular cases of more general operations known as isometries [13,14].
The paper is organized as follows: in Section I A we describe precisely which gates we will consider as part of our experimentally available toolbox, and review some architectures for quantum information processing to which the methods described in this work can be applied.In Section II we show an analytic algorithm to compile local unitaries, which can be used to find efficient implementations of state and process tomographies.Finally, in Section III we describe and analyze an algorithm to compile fully general unitaries which relies on numerical optimization.

A. Experimental toolbox
Several quantum information processing experiments based on atomic and molecular systems have similar toolsets of quantum operations at their disposal.Often, it is convenient to apply collective rotations on an entire qubit register.These collective (yet local) gates, combined with addressed operations (typically rotations around the Z axis) allow one to implement arbitrary local unitaries, as we show in Section II.Together with suitable multi-qubit operations, arbitrary quantum unitaries can be implemented.In this work we consider the following set of gates: • Collective rotations of the whole qubit register about any axis on the equator of the Bloch sphere C(θ, φ).Here θ is the rotation angle and φ is the phase, so that: where S x,y = σ x,y N are the total spin projections on the x or y axes, and σ x,y,z j are the respective Pauli operators corresponding to qubit j.For the sake of brevity we also define rotations around the X and Y axes as: • Single qubit rotations around the Z axis Z n (θ), where θ is the rotation angle, and n is the qubit index: with σ z n being the Pauli Z operator applied to the n-th ion.
• Entangling Mølmer-Sørensen (MS) gates [15], with arbitrary rotation angle and phase MS φ (θ).Here θ is the rotation angle and φ is the phase of the gate, resulting in: where S x,y = σ x,y N are the total spin projections on the x or y axes, as before.For φ = 0 or φ = π/2 we obtain gates that act around the X or Y axes, which we will denote: x,y /4 .
As mentioned before, it is desirable to be able to restrict the action of the MS gate to a particular qubit subset.This can be done experimentally by spectroscopically decoupling the rest of the qubits from the computation [7], or by addressing the MS gate only on the relevant subset of the qubits [16].
This set of gates, or equivalent ones, are available in several trapped-ion experiments [7,17,18].A similar toolbox of operations is available for architectures based on trapped-ion hyperfine qubits.For example, Ref. [8] describes high-fidelity microwave gates applied to a single hyperfine 43 Ca + qubit.In a multi-qubit system, these gates would drive collective rotations like the ones described above.In addition, Ref. [19] describes a Ramandriven σ z ⊗σ z phase gate on two qubits, which applied to a many-qubit register would act analogously to the MS gate already described.
Recently, an implementation of high-fidelity gates in a 2D array of neutral atom qubits was reported [9].The toolbox described there consists of global microwavedriven gates and single-site Stark shifts on the atoms, which are completely equivalent to the local operations described before for the trapped ion architecture.A multi-qubit CNOT gate, equivalent to the MS gate, could also be implemented by means of long-range Rydberg blockade interactions [20].

II. COMPILATION OF LOCAL UNITARIES
Local unitaries can be written as a product of singlequbit unitaries.In this section we show a fully deterministic algorithm that produces decompositions of any local unitary as a sequence of collective equatorial rotations and addressed Z rotations, as described in section I A. The decompositions presented here are optimal in the number of pulses.These techniques are particularly useful for the implementation of state and process tomographies, as exemplified in figure 1, since both require only local operations at the beginning and end of the algorithm.
Let us consider a register of N qubits, and a local unitary U = U 1 ⊗U 2 ⊗• • •⊗U N to be applied to them, where U i is the action of the unitary on the i-th qubit.If the same operation has to be applied to more than one qubit (U i = U j ), we can replace both with a single instance of the operation, and then apply the same addressed rotations on all qubits subject to the same operation U i .Therefore, we only have to consider the case where every U i is unique.
In order to apply a general local unitary to each qubit we need to have at least three degrees of freedom per qubit [6], so the decomposition must have at least 3N free parameters.During the sequence at least N − 1 of the qubits must eventually be addressed, since a different unitary has to be applied to each qubit.Therefore, a sequence of addressed operations of the form ) must be included in the decomposition.These provide N − 1 parameters, so 2N + 1 more degrees of freedom are required.The most economic way to provide these is by means of collective gates C(θ i , φ i ), which have two degrees of freedom each, so the shortest sequence possible must include at least N global operations, for a total of 3N − 1 free parameters.One additional degree of freedom remains, so we must add a last gate.This can be either an addressed operation on qubit N or a collective gate.If we add an addressed gate Z N , we obtain a sequence of the form: where ) are collective and single-qubit rotations respectively, as explained in Section I A. Such a sequence is useful for compiling local unitaries up to arbitrary phases, as explained in Appendices A 3 and A 4. The second alternative is to add a collective rotation C N : which is the type of sequence we consider in this section.
For particular unitaries, some of the C i and Z i in Eq. ( 8) may actually be the identity, in which case the sequence is simpler.Since the decomposition depends on the ordering of the qubits, by reordering them a simpler sequence might be obtained.For small numbers of qubits, one can compile the unitary for every possible permutation, although this becomes inefficient for large numbers of qubits.However, let us remember that, for the purposes of the compilation, the qubits are grouped together according to which of them experience the same single-qubit unitary U i .For an application such as state tomography, there are only three possible unitaries to be applied to each qubit in the register (shown in Figure 1), since one only wants to perform a measurement in one of three different bases.Therefore, effectively we only need to consider three qubits, in which case trying out all the permutations is perfectly feasible.
We will describe now how to compile a generic local unitary U = U 1 ⊗ U 2 ⊗ • • • ⊗ U N exactly, using a decomposition of the form (8). Let us first note that the unitaries in Eq. ( 8) act on the N -qubit Hilbert space, which is the tensor product of the single-qubit Hilbert spaces.For the sake of simplifying the notation, we will now refer to these unitaries as Ci , Zi , and will reuse the notations C i and Z i for their action on the single qubits, so that: where 1 is the 2 × 2 identity matrix, and Z i appears at the i-th place (since it only addresses the i-th qubit).
In terms of these single-qubit unitaries, factoring Eq. ( 8) for each qubit we obtain N equations: . . .
From the last equation we can determine C N C N : and eliminating this factor from the remaining equations we obtain: We solve each equation in ( 12) consecutively.To solve the first equation in (12), let us notice that its left-hand side is a known unitary, which can be written as: where α 1 is the angle of the rotation and u 1 its generator.The right-hand side is simply a rotation around Z and a change of basis.Therefore, the rotation angle of Z 1 must be equal to α 1 , and the change of basis must be such that: We show in Appendix A 1 how to find the generator and angle of the collective rotation C 1 .
Having determined C 1 , we can write the second equation in (12) as: As before, the left-hand side of this equation is a known unitary, and the right-hand side consists of a rotation around Z and a change of basis, so the rotation angle θ 2 and generator of the change of basis C 2 can be found as for the previous equation.This procedure can be repeated until all of the C k and Z k with k ≤ N − 1 are determined.The last collective operations C N and C N can be determined from equation (11).For this we need to decompose an arbitrary unitary into a product of two equatorial rotations; this can be done as explained in appendix A 2.
We have shown so far how to compile a local unitary exactly.However, in certain cases the constraints on the target unitary are weaker, so that it can be implemented with a simpler sequence.For instance, a unitary that is followed by global gates whose phase can be freely adjusted must only be specified up to a collective Z rotation afterwards, since this rotation can be absorbed into the phase.This removes one free parameter from the sequence, thus simplifying its implementation.The details of this procedure are presented in appendix A 3. Another case of interest is when the target unitary is specified up to arbitrary independent Z rotations afterwards, for instance when the unitary is followed by a projective measurement on the Z basis.This is particularly useful for tomographic measurements; details are shown in Appendix A 4.

III. COMPILATION OF GENERAL UNITARIES
In section II we studied how to compile local unitaries in terms of collective and addressed rotations.However, a universal quantum computer also requires entangling unitaries, which must be compiled into the experimentally available local and entangling gates.For example, in figure 2 we show a decomposition of a Toffoli gate into a sequence of local and entangling gates applied consecutively.In this section, we present an algorithm to find such decompositions for arbitrary unitaries.We seek decompositions directly in terms of multiqubit entangling gates, since these are often more efficient than decompositions in terms of two-qubit gates.For example, a Toffoli gate can be implemented using only 3 Mølmer-Sørensen (MS) gates [11], while 6 CNOT gates are needed to implement it [21]), and a Fredkin gate can be implemented using 4 MS gates [22], while the least number of two-qubit gates required is 5 [23].As described in section I A, many equivalent types of entangling gates are experimentally available.We will consider MS gates, but the methods shown here are applicable to any entangling gate that forms a universal set together with local operations.

A. Compilation in layers
In many quantum information processing experiments the most costly operations in terms of fidelity are entangling gates.Therefore, when trying to compile a unitary we seek to minimize the number of those.A straightforward way to do this is to use pulse sequences where layers of local unitaries and entangling gates are applied consecutively, as shown in figure 3.

FIG. 3. Sequence with layers of local and entangling gates applied consecutively.
Any unitary can be decomposed in terms of singlequbit gates and two-qubit CNOT gates [24]: where L i denotes an arbitrary local unitary on the whole qubit register and CNOT i denotes a gate between some two qubits.A two-qubit CNOT gate can be implemented in an arbitrary N -qubit register as a sequence of local unitaries and MS x (π/8) gates [11].Therefore, the following decomposition is always possible: However, some of the local unitaries L i in a decomposition of the form (17) may actually be identity, so after removing them the resulting sequence has the following structure: where the k i are integers, and the number M of entangling gates is the same or less than in Eq. (17).It is enough to consider 0 ≤ k ≤ 7, since MS x (π) is either the identity for an odd number of qubits, or a π rotation around X for an even number of qubits.We now seek to further simplify sequence (18).Every single-qubit unitary U i on qubit i can be written as a composition of rotations around two different fixed axes [6], which means that we can always choose α i1 , α i2 and α i3 such that: Any local unitary L = N i=1 U i can therefore be written as: where the product goes over the N qubits in the register.Since unitaries acting on different qubits commute, we can write this as: where X and Z denote arbitrary products of rotations around the X or Z axes for all qubits.Therefore, the sequence in ( 18) can be written as: and commuting the X rotations with the MS gates we obtain a sequence of the form: Every odd local unitary (except for the last one) is a product of Z rotations on all qubits, and the even local unitaries can be grouped as L i = X i Zi Xi .Moreover, a collective Z rotation can be extracted from each even local unitary L i and absorbed into the phase of the subsequent MS gates and collective operations to simplify the implementation of L i .Therefore the sequence can be written as: We have thus shown that any N -qubit unitary U can be decomposed into a sequence of the form shown in (25).These sequences always have the same structure, which makes it easier to identify patterns if one wants to compile families of unitaries, i.e. unitaries that depend on some tunable parameter.

B. Numerical optimization
We have described a general form of a sequence of local operations and global entangling gates that implements any desired target unitary.It remains to find the actual sequence parameters, that is, the rotation angles and phases of the gates.However, we do not know a priori how many entangling gates will be needed for a given unitary.Therefore we suggest the following algorithm: 1. Propose a sequence with M = 0 entangling gates.

Search numerically for the sequence parameters
that maximize the fidelity with the target unitary.
3. If the sequence has converged to the desired unitary (i.e. the fidelity equals 1), stop.Otherwise increase M by 1 and go back to step 2.
When performing the numerical optimization in step 2 there might be a number of local optima in addition to the true global optimum, making fully deterministic optimization methods difficult to apply.We apply therefore a repeated local search, where an efficient deterministic optimization method is iterated, each time using randomly determined initial conditions.The initial conditions are chosen randomly for every optimization run, as experience has shown us that starting close to previously found local minima does not offer any improvement.The search is finalized whenever the fidelity with the target unitary is above some predefined threshold, or when a maximum number of tries is exceeded.An advantage of this method is that, since each optimization run starts from random initial conditions, these are easy to perform in parallel.
The algorithm chosen for each numerical optimization is the quasi-Newton method of Broyden, Fletcher, Goldfarb, and Shanno (BFGS) [25].The function to be maximized is the fidelity of the unitary resulting from the pulse sequence with the target unitary.Since we are interested in exact solutions, we reject those solutions whose fidelity (normalized to the maximum value possible) is not equal to 1 within some tolerance threshold (usually 1%).The gradient of the fidelity can be calculated analytically as a function of the sequence parameters, which speeds up the computation as compared to using several evaluations of the fidelity function.
A previously used approach to this optimization problem was a combination of local gradient descent and simulated annealing (SA) [11], which also helps to avoid local maxima.However, this method did not make use of the analytic expression for the fidelity gradient, which speeds up the search.Moreover, its performance depends on the "topography" of the optimization space and requires manual tuning of the search parameters to achieve optimal results.We have compared the BFGS and simulated annealing approaches by compiling 100 random unitaries uniformly distributed in the Haar measure as explained in [26] for different numbers of qubits.In our experience, we find that the BFGS method scales better with the number of qubits than simulated annealing (see Figure 4).The median number of search repetitions needed to find the global optimum was 1 in all the cases.Average time required to find the global optimum for 100 unitaries randomly distributed in the Haar measure with the BFGS and simulated annealing methods, using an Intel©Core i5-4670s CPU 550 @ 3.10 GHz x 4 (one processing thread per optimization run).No data was obtained for the simulated annealing approach for 4 qubits owing to the excessive time required.
The exponential scaling of the optimization problem complexity depends on the number of entangling gates required to compile a given unitary, which is an intrinsic property of the unitary and does not depend on the search algorithm.It is already known that it is not possible to efficiently implement an arbitrary unitary in terms of two-qubit gates [6]; our numerical results suggest a similar result for N -qubit gates.In the two-qubit case the compilation always succeeded with 3 entangling gates, and not less (using 200 search repetitions).This was to be expected, since for two qubits an MS gate is equivalent to a CNOT gate, and it is known that 3 CNOT gates are enough (and in general necessary) to implement an arbitrary two-qubit unitary [27,28].In the three-qubit case, the optimization always succeeded with 8 entangling gates, and never with fewer (also using 200 repetitions).For 4 qubits, the optimization always succeeded for 25 entangling gates, and succeeded only 4% of the time with 24 entangling gates.However, we did only 4 optimization repeats in the four-qubit case, owing to the increased time it takes for these to converge.Therefore, it might be the case that given enough optimization runs, more unitaries would have been compiled with only 24 gates.We are not aware of any result in the literature concerning the number of N -qubit global entangling gates required for implementing a general N -qubit unitary for more than N = 2 qubits.From our numerical results, we conjecture that any three-qubit unitary can be implemented using at most 8 MS gates, and any fourqubit unitary using at most 24 or 25 MS gates.
A particularly interesting group of unitaries are Clifford gates, which find applications in quantum error correction [29], randomized benchmarking [30], and state distillation protocols [6].To explore the difficulty of compiling such gates, we have tested our algorithm with randomly generated Clifford gates, as explained in Ref. [31].We show in Figure 5 the distribution of the optimal number of entangling gates required for compliging two-, three-and four-qubit unitaries.Our results agree with the literature [32] for the two-qubit case, since MS gates are then equivalent to controlled-Z (or CNOT) gates.For larger numbers of qubits, the performance of our algorithm in terms of number of multi-qubit gates required is also similar to that of algorithms based on two-qubit gates [32].

FIG. 5.
Optimal number of entangling gates needed to compile random Clifford operations, for sample sizes {1000, 200, 100} for N = {2, 3, 4} qubits.Each Clifford gate was generated using 10N 8 steps of the random walk described in Ref. [31].Error bars correspond to one standard deviation.

C. Compilation of isometries
A particular case of interest is the compilation of a unitary whose action only matters on certain input states.This happens, for instance, when one is interested in state preparation starting from some fixed input state.Such operations belong to the more general class of operations known as isometries [14,30].In this case, the problem to be solved has less constraints than when fully specifying the target unitary, so a simpler sequence may be found.In this section we focus on compiling a unitary that is only specified in a particular subspace of the input states, for example: where the columns marked as 'free' are left unspecified.
In this case, a suitable fidelity function for the numerical optimization is: where U | S is a rectangular matrix with the components of the unitary in the restricted subspace.
A more general case is where some of the relative phases of the projections of the unitary acting on different subspaces of the whole Hilbert space are irrelevant.For example, suppose that one wants to apply a unitary to map some observable onto an ancilla qubit and then measure the ancilla, as shown in figure 6.Since the input state of qubit 3 is known to be |0 , only the subspace of input states spanned by {|000 , |010 , |100 , |110 } is relevant.Moreover, the measurement will project the state of the system onto either the subspace spanned by {|000 , |010 , |100 }, or that spanned by {|111 }, and all phase coherence between these alternatives will be lost.Therefore, the compiled sequence can be sought such that it matches the desired unitary in each of the subspaces but allowing an arbitrary phase φ between them: In this case (figure 6) it is possible to find a simpler implementation than in the fully constrained case (figure 2), owing to the additional degrees of freedom available, namely arbitrary outputs for the |ψ 3 = |1 input states and an arbitrary relative phase between the two possible measurement outcomes.
In the general case considered here we want to maximize the fidelity in each subspace, without regard to the relative phases between these.Therefore we can seek to maximize the function f consisting of the sum of the fidelity functions (27) corresponding to each subspace: where the sum goes over all the subspaces with different relative phases, and U | Sj is a rectangular matrix with the components of the unitary in the j-th subspace.

D. Compensation of systematic errors
Owing to systematic errors, the operations experimentally applied may still be unitary but deviate from the intended ones.An example of this is addressing crosstalk due to laser light leaking onto adjacent qubits.If it is possible to characterize the actual experimental operations being applied, then they can be taken into account for the compilation by adapting our optimization procedure: 1. Compile the target unitary in terms of the ideal gates.
2. Replace the ideal gates by the experimentally characterized operations.
3. Add operations to obtain a higher fidelity with the ideal target unitary.
As an example we show that excessive crosstalk can be corrected in an implementation of a Toffoli gate. Figure 7 depicts experimental data corresponding to the action of the Toffoli gate on the 8 input basis states.It can be seen that, by adding just two pulses, the output fidelity for each input state increased in some cases by up to 20%.The sequence with 11 pulses is actually only an approximate correction to the uncorrected case.The exact correction requires 14 pulses, and actually yields a lower fidelity than the approximate one, since it requires more pulses and each of these has a non-zero error probability.

IV. CONCLUSIONS AND OUTLOOK
In this work we have shown methods to compile quantum unitaries into a sequence of collective rotations, addressed rotations and global entangling operations.For local unitaries, we have demonstrated an analytic approach that produces the shortest possible sequences in the general case, and adapted the method to simplify the resulting sequences if some constraints on the unitary are lifted.For arbitrary unitaries, we have presented an approach that produces sequences of layered local and entangling operations.This approach is based on a numerical optimization procedure that is faster than previously used ones, and the sequences obtained are by design optimal with respect to the number of entangling gates.Our numerical results suggest upper bounds on the number of N -qubit gates required to implement arbitrary threeand four-qubit unitaries.
The results of this paper show that in many cases one may obtain more efficient implementations by considering operations more general than two-qubit entangling gates.However, the exponentially growing complexity of decompositions as the number of qubits increases points to the necessity of keeping the register size V. ACKNOWLEDGEMENTS and therefore: We have shown how to find suitable rotations Z that fulfill condition (A28).Once these are found, all the lefthand sides of (A26) are known unitaries and the system can be solved as before.The last collective rotation C N can be determined from (A24) as shown in appendix A 3. We have thus shown how to compile the sought unitary U into a sequence of the form:

FIG. 2 .
FIG. 2. Decomposition of a Toffoli gate into a pulse sequence of collective equatorial rotations, addressed Z rotations and entangling Mølmer-Sørensen (MS) gates.
FIG. 4. Average time required to find the global optimum for 100 unitaries randomly distributed in the Haar measure with the BFGS and simulated annealing methods, using an Intel©Core i5-4670s CPU 550 @ 3.10 GHz x 4 (one processing thread per optimization run).No data was obtained for the simulated annealing approach for 4 qubits owing to the excessive time required.

FIG. 6 .
FIG.6.Left: a unitary mapping (Toffoli gate) is applied, after which qubit 3 is measured.Right: a pulse sequence for implementing the circuit on the left.This implementation is simpler than in the fully constrained case (figure2) because of the additional degrees of freedom when compiling.