Basic circuit compilation techniques for an ion-trap quantum machine

We study the problem of compilation of quantum algorithms into optimized physical-level circuits executable in a quantum information processing (QIP) experiment based on trapped atomic ions. We report a complete strategy: starting with an algorithm in the form of a quantum computer program, we compile it into a high-level logical circuit that goes through multiple stages of decomposition into progressively lower-level circuits until we reach the physical execution-level specification. We skip the fault-tolerance layer, as it is not within the scope of this work. The different stages are structured so as to best assist with the overall optimization while taking into account numerous optimization criteria, including minimizing the number of expensive two-qubit gates, minimizing the number of less expensive single-qubit gates, optimizing the runtime, minimizing the overall circuit error, and optimizing classical control sequences. Our approach allows a trade-off between circuit runtime and quantum error, as well as to accommodate future changes in the optimization criteria that may likely arise as a result of the anticipated improvements in the physical-level control of the experiment.


Introduction
The interest in quantum computing is rooted in the ability to solve certain computational problems more efficiently by a quantum algorithm than it is known how to do by a regular classical algorithm [1].To take advantage of those quantum algorithms, a suitable quantum information processing (QIP) system needs to be developed-specifically, one that provides the means to efficiently execute protocols prescribed by the respective quantum algorithms [2].As of the time of this writing, fully programmable quantum computational devices spanning a few to several qubits included those built based on the superconducting circuits [3,4] and trapped ions [5,6,7] technologies.
Since the focus of this paper is on the computing over trapped ions QIP platform, we next quickly describe how it works.For details specific to this paper, also see [6].In the trapped ions QIP the qubits are stored in the spins of the individual ions ( 171 Yb + in [6]), with the ions suspended in the free space (vacuum) via the use of electromagnetic fields.When confined in two dimensions, ions form a line, spanning the remaining spacial dimension.Weak confinement in the third dimension can maintain a linear structure of the ion crystal.Observed qubit coherence time of 0.5s is so long that it is currently not a limiting factor on the size of the computation that is possible to execute; furthermore, it is expected that it can be scaled up by the orders of magnitude in the future [8].Lasers are used to both initialize the state of the system to a simple state |00...0 via a process called optical pumping, and to read out the where d is the duration of the above single-qubit rotation and e gives a model of the experimental error based on laser pulse area fluctuations due to laser intensity and timing jitter, leading to random over-/under-rotations of the qubit.The formula describing e 1 is constructed such as to highlight the effect of the slope of the Rabi oscillation being smallest for full π rotations when the gate is applied to a quantum state close to the computational basis states [9].For an unknown qubit state, it is impossible to tell the slope of Rabi oscillation, and thereby e 1 error model becomes inaccurate.The formula describing e 2 is designed such as to highlight that the error should be proportional to the rotation angle [10].However, e 2 has its own limitations.Indeed, such a definition predicts that, to consider a specific example, the error in the single pulse circuit R(π/2, 0) will be smaller than that in the gate R(π, 0) (both applied to a computational basis state).However, the experiment shows the opposite result-R(π, 0) is more accurate than R(π/2, 0), and thereby e 2 is also inaccurate.
We note that while proper explanation and accurate modeling of experimental errors is very important, narrowing down a complete error model is not the focus of this paper.For the purpose of this paper, any error model can be acceptable.This is because the goal is to illustrate that trapped ions experiments can be optimized across a combination of (two) conflicting optimization criteria by those techniques reported, and to highlight the inner workings of such optimization approach.With this in mind, we select e 1 model, since optimization over e 2 is equivalent to optimization of the duration, reducing the number of optimization criteria to just one.
Presently, single-qubit rotations, as well as the two-qubit gates are implemented serially.As a result, the overall runtime of a computational experiment, as described by its circuit, equals to the sum of the runtimes of the individual gates.Depending on the desired properties of the circuit, one may choose to optimize the overall runtime, the overall error, the overall number of gates (including keeping separate counts of the single-qubit and two-qubit gates), as well as any combined figure of the above.In this paper, we will describe the overall cost of an implementation as a length-2 vector (d, e), with the components corresponding to the overall duration and the overall error.The error component itself is described by the list (written as a linear combination) of all errors from all gates participating in the respective circuit, per the error model introduced for the individual gates.This definition of the error does not correspond to the actual error seen in the experiment, but rather shows the influence and sources or errors within the given implementation.We try to minimize the cost vector (d, e), focusing separately on the duration and error.One may choose to focus on other optimization criteria, such as, e.g., minimization of the gate count; this does not affect the overall optimization strategy or the steps taken to arrive at the optimized solution.
For future discussions, we will need the following relations: , that can be used to construct the inverse of the R(θ, φ) gate at the same cost as the original gate; and , that can be helpful in that it provides the means for limiting the duration of any one R(θ, φ) gate to at most τ 1q , as the global phase does not matter.
Both identities are easy to verify directly.

The RX, RY, and RZ rotations
The single-qubit rotations around the basis axes must be expressed in terms of the physical-level R gate to be implementable in an experiment.

RY:
obtains the rotation about Y axis by the angle θ.In particular, As a result, the cost of RY (θ) is |θ|τ1q π , | sin θ| .The implementation (3) is also well known.The costs of the RX and RY gates with the same rotation angle are thus the same.RZ is more difficult to obtain.In particular, RZ: RZ rotation is defined as follows, It is easy to show that it cannot be obtained via a single physical R pulse, and thus requires a circuit with two or more R gates.Firstly, we found the following circuit implementing the RZ gate where v ∈ {−1, +1} is a variable allowing to arbitrarily set the sign of either first or last RY rotation, and '.' denotes matrix multiplication (recall that the order of gates in the circuit is given by the inverted order of matrices in the matrix product).The implementation with v = 1 was known to [6]; as we will show later, the ability to choose v, being our contribution to the above circuit, is very important in circuit optimization.The cost of this implementation is RZ(θ) gate may be obtained as up to an undetectable global phase of −1 (equality up to a global phase is furthermore denoted by '≡'), where the parameter x may be set arbitrarily.The cost of this implementation is (2τ 1q , 2 × | sin π| ) = (2τ 1q , 0).Observe that with the slightly longer execution time this second realization is associated with a smaller error, which seems to be a preferred scenario in the physical experiments if the gate is to be implemented by itself (as opposed to as a part of a larger computation).The flexibility in setting x within the above implementation allows to optimize quantum circuits where RZ is one of the gates used, as varying the value x allows to obtain either the RX(π) gate or the RY (π) gate to be either first or last gate in (5), and those may cancel out with other gates in the circuit.Varying parameter x furthermore allows optimizing classical control, as selecting a value of the R gate parameter used to implement a previous single-qubit gate allows to keep the phase of the Raman beatnote used a constant.This, however, is only a minor improvement to the classical control sequences.Implementation (4) may become more desirable for the purpose of circuit optimization, since it relies on the efficiently optimizable sequence of RX and RY gates.
The RZ gate may be implemented directly without resorting to a circuit-level composition of pulses by individually addressing the qubits with laser beams that result in a qubit energy level shift through the Stark effect [11,12].Physical-level RZ gate gives an advantage over the physical-level R gate when one desires to construct the RZ(θ), as the RZ(θ) requires two physical-level R gates, and only one physicallevel RZ gate to be implemented.However, when such RZ(θ) is an internal gate to the circuit (and most gates in interesting quantum computations are internal), it can be written as either R(π, 0).R(π, −θ/2) or R(π, θ/2).R(π, 0) (5), where R(π, 0) = RX(π) can be commuted past the two-qubit XX gate (discussed in Section 3) selectably to either left or right.This results in the effective ability to implement an internal RZ(θ) gate with just a single physical-level R gate.We furthermore note that in our optimized implementations of those circuits we tried, whenever the goal is to have no more than two sequential internal R gates apply to a given qubit in a sequence, this was always possible to accomplish.While this is unlikely to scale to arbitrary quantum computations, it is perhaps true that in practical designs most frequently no more than one R gate is required between a pair of two-qubit XX gates acting on a given qubit.This illustrates the expressive power of the R gates when used in conjunction with the two-qubit XX gates.To conclude this discussion, we believe the ability to implement RZ directly may not be in high demand, unless the properties of such a physical-level RZ gate, including its duration and error, are superior to the R gate.Recall that RX, RY , and RZ gates do not commute, but their parameters may be added, i.e., G(a)G(b) = G(a+b), when G is either one of RX, RY , or RZ.This is important for the circuit optimization technique discussed later.

Other common single-qubit rotations
A common single-qubit gate that may not be expressed as an axial rotation with a certain parameter is the Hadamard gate, It can be implemented up to a global phase as one of the following two circuits.
The cost of each of the above circuits is (1.5τ 1q , ).As such, the Hadamard gate is, roughly speaking, as expensive as the RZ rotation, and more expensive than either RX or RY .The ability to choose which of the RX/RY gates in the decomposition of the Hadamard gate comes first and which is second is important to the optimization of quantum circuits.

Arbitrary single-qubit rotations
An arbitrary single-qubit unitary gate can be written as a matrix U = e id e ia cos b e ic sin b −e −ic sin b e −ia cos b of four real-valued parameters a, b, c, and d.We found the following implementation as a circuit with at most two physical-level R gates, The cost of this implementation is (τ 1q + |2b mod π|τ1q π , | sin(2b)| ).Observe, that equation (7) uses the minimal number of physical-level gates R required to implement an arbitrary single-qubit unitary.This can be established by counting the number of the real-valued degrees of freedom in the 2 × 2 unitary matrices up to global phase, and comparing it to the number of the real-valued degrees of freedom of the R gates.Equation ( 7) furthermore gives rise to the following Lemma providing a guarantee on the cost of quantum physical-level circuits.Lemma 1.Any quantum physical-level circuit over n qubits with G two-qubit XX gates can be reduced to an equivalent one with no more than 2(n + 2G) single-qubit R gates, providing, across all single-qubit gates used, an overall contribution of no more than 2τ 1q (n + 2G) to the runtime and a term of no more than (n + 2G) × to the error.
Proof: First, count the number of the "pieces of wire", defined as an uninterrupted by any two-qubit gate qubit evolution time piece between two two-qubit gates in the circuit.This number is given by the expression n + 2G, as is easy to verify by induction on G, the number of the two-qubit gates.Indeed, there are n pieces of wire in quantum circuits with no two-qubit gates, and the introduction of a twoqubit gate at the end of the circuit increases the number of the pieces of wire by two (specifically, on those qubits that the given two-qubit gate operates on).The single-qubit operations are contained to the individual pieces of wire, allowing to conclude the proof by referring to the equation ( 7) and the definitions of the duration and error.
The above Lemma reports an upper bound on the number of single-qubit gates, and could be used as a bottom line comparison for evaluating the efficiency of the optimizing compiler developed and tested in Sections 4-5.We furthermore observe that the above Lemma applies to show that the 5-qubit 80-gate QFT5 circuit, of which 10 are two-qubit gates reported in [6] is suboptimal.This is because according to Lemma 1 the upper bound on the number of gates in such a circuit is 60.

Physical-level two-qubit gate
The physical-level two-qubit gate available to us is the so-called XX(χ) gate, with parameter χ that depends on the pair of ions the gate is being applied to.The gate itself is defined by the following unitary matrix [6]: The absolute value of the phase, |χ|, can be set to an arbitrary real number between 0 and π/2 by varying the laser power used in the experiment [6].The sign of χ depends on the laser detuning which is chosen based on the normal modes a particular pair of ions interacts with most strongly and hence which qubits the gate is being applied to [6].The sign for each two-qubit gate is thus fixed experimentally and becomes an input parameter for how one is allowed to construct circuits.
The XX(χ) gate implements the well-known Mølmer-Sørensen gate [13], and latter is known to generate the CNOT gate (for χ = ±π/4) using single-qubit operations on the input and the output side.The CNOT gate is an important computational primitive-the ability to obtain it, coupled with the ability to implement any single-qubit gate, see equation (7), gives computational universality [14].However, the ability to vary parameter χ to accept values beyond ±π/4 allows a more efficient implementation of important quantum gates.Once computational universality is obtained, the efficiency becomes a next important step.
Figure 1: Implementation of the CNOT gate using physical-level gates, where s = ±1 is the sign of the interaction parameter χ, specified by the ions the gate applies to (s is a parameter that cannot be varied), and v = ±1 may be chosen arbitrarily.
The following property of the XX(χ) gate is important to note for future discussions: XX(χ) commutes with any single-qubit RX(θ) rotation.However, XX(χ) does not commute with either RY (θ) or RZ(θ) when θ ≡ 0 mod 2π.

Cost function variables
The two-qubit XX(χ) gate has the vector-function cost (d, e), where [9]: Here, d is the duration of the two-qubit XX interaction, and e either models the error due to fluctuations in the experiment, analogous to the single-qubit case (e 1 ), or simply accepts a constant value E that is independent of the rotation angle (e 2 ).Current experiment [6] is setup such that the two-qubit XX gates can be applied to any pair of qubits in a serial fashion.Compared to the single-qubit gate R, the two-qubit gate is substantially longer in runtime and has a higher error.As a result, efficient circuit implementations must prefer the minimization of the use of the two-qubit XX gate over minimizing the single-qubit gates.

Constructing the CNOT gate
Depending on the sign s of the interaction χ for the given pair of qubits that we want to apply the CNOT gate to, it can be implemented up to the global phase of (−1) − vs 4 such as shown in Figure 1.Observe that v may be chosen arbitrarily from the set {−1, +1} to set the signs of the rotation angle in RY at the beginning or at the end of the implementation.In particular, parameter v may be chosen such as to set the angle of the first RY rotation to the positive number, + π 2 , in which case the second RY features the negative sign, or vice versa.The ability to choose the sign is particularly important in allowing single-qubit RY gate cancellations while decomposing logical-level circuits with multiple CNOT gates into efficient physical-level circuits.
Other than using two fewer single-qubit gate pulses compared to [15], our CNOT implementation allows to arbitrarily set the values of the RY rotations, and also allows further transformations and optimizations per template (12) discussed later.Our CNOT implementation saves two single-qubit pulses (one RX(± π 2 ) and one RY (± π 2 )) over the one reported in [6].

Constructing controlled-roots of Paulis
Controlled roots of axial rotations (Pauli gates) play an important role in quantum circuits.For instance, the n-qubit quantum Fourier transform is best viewed as a circuit with n(n−1) 2 controlled roots of Pauli-Z gates [1, Figure 5.1].The controlled-sqrt-NOT gate is used in the construction of an efficient five two-qubit gate circuit implementing the Toffoli gate [14, Lemma 6.1].Otherwise, if the proper root is not available, and the CNOT gate is the only two-qubit gate directly constructible, the Toffoli gate requires six two-qubit physical gates [16].It is furthermore known that each controlled unitary gate can be implemented with the use of two CNOT gates, and, equivalently, two Mølmer-Sørensen gates, along with some single-qubit gates [14,Lemma 5.1].We next show that only one XX(χ) gate suffices to implement any controlled root of a Pauli gate, providing an improvement by a factor of two.In particular, depending on the sign s of χ, the controlled-X α , α ∈ R, −1 ≤ α ≤ 1, may be obtained as follows: The inverse of the controlled-X α may be obtained from the above circuit by attempting to construct the controlled-X −α .Observe that the decomposition with the sign opposite to the physically available sign of χ needs to be selected.Other controlled roots of Paulis, such as the controlled-Y α and the controlled-Z α are related to the controlled roots of the NOT gate by the following formulas, As a result, all controlled roots of Paulis are constructible using at most one physical-level two-qubit XX gate.
In our constructions, we favoured the decompositions that feature RY (± π 2 ) gates with arbitrarily selectable sign of the rotation parameter.This was done to allow the selection of the specific sign or the relations between signs to dictate the RY gate cancellations.In some cases, the results of such cancellations can be dramatic, as is illustrated in the following Lemma.Lemma 2. Circuits over arbitrary RZ rotations (including Pauli-Z, Phase, and T gates) and the controlled-Z can be written as an efficient trapped ions physical-level implementation as follows: 1.The layer of RY gates, defined as the set of gates RY (v i π 2 ) applied to every qubit i, where v i ∈ {−1, +1} can be selected arbitrarily.
2. The layer of RX, where qubit i experiences the application of RX(t , where t is the aggregate angle accomplished by the combination of RZ rotations applied to this qubit, k is the number of CZ gates that apply to the qubit i, and s ix1 , s ix2 , ..., s ix k are the signs of the respective interactions.Proof: Construct the controlled-Z gate as follows: Observe that per formulas (4,11) and upon the decomposition of RZ and controlled-Z gates into RX/RY /XX circuits each qubit in each RZ/CZ gate experiences the application of RY (v π 2 ) in the beginning and RY (−v π 2 ) in the end.Therefore, setting the value of the parameter v equal to the previously used v accomplishes the following: every two RY gates between any two RZ/CZ gates in the This corresponds to the model where the errors are independent, and is consistent with the understanding of the physics and source of errors in the R gates [6,9].target circuit cancel out.However, there will be a layer of RY in the beginning of the circuit and a layer of RY at the end with the opposite signs, for each qubit that experiences an application of a gate.Next, notice that all internal gates (those except the outside RY layers) are RX and XX, and thereby they all commute.This means we need only one layer of RX, and due to the formulas (4,11), the rotation angle is calculated such as stated in the Lemma.Each CZ(a, b) in the original circuit is furthermore represented by XX ab (s ab π 4 ), per formula (11).Observe that allowing the layers of RY gates in Lemma 2 to share one sign across all RY used (v i = v j for every pair of qubits i and j) allows their implementation with a single global pulse.This capability is currently not supported by the existing hardware [6], but may potentially be implemented in the future.

Useful single-qubit circuit identity
Recall that a quantum template is a quantum circuit with n gates that evaluates to the identity, G 0 G 1 ...G n−1 = Id [17].A template can be used to construct a number of circuit identities, for arbitrary i and k, 0 ≤ i, k ≤ n, which may, in turn, be used to optimize quantum circuits via matching gates on the left hand side in the above equation and replacing them with the gates on the right hand side.Here we report one such template that is particularly useful in our constructions.Specifically, for any a, b ∈ R there exist c, d ∈ R such that: where The above template may be used in multiple ways.
• Firstly, it allows the replacement of RX(a)RY (b)RX(a) with R(c, d).The latter circuit always has a smaller duration and a higher overall fidelity (smaller contribution to the overall error), see Figure 2(a)(c).
• This template may be used to replace RX ) in the circuit in Figure 1 were replaced with R(π, − π 4 )RX(− π 2 ) this would result in the cost vector change of that part of the computation from (τ 1q , 2× ) to (1.5τ 1q , ).This constitutes an increase of the runtime by 0.5τ 1q over the increase of the fidelity from (1 − ) 2 to 1 − .Since the above rule applies to any pair RX and RY , it allows substantial flexibility in exchanging runtime for error.
In our approach to the optimization of quantum circuits we favour gate decompositions relying on RX, RY , and XX gates.The efficiency is evidenced through short elementary gate decompositions (Z and its roots, Hadamard, CNOT, controlled-roots of Paulis), and favourable in-circuit gate cancellations, such as illustrated in Lemma 2. Template (12) enables the next set of optimizations.Specifically, given a circuit over RX, RY , and XX gates, the template (12) allows to "commute" arbitrary RX to the right (or left) of every RY met via replacing RX(a)RY (b) with the properly defined R(c, d)RX(−a).Observe that such "commutation" changes the sign of the parameter in RX, as such it is a special kind of commutation.Recalling that RX(a)RX(b) = RX(a + b mod 2π) and that RX commutes with XX allows to "commute" all RX to the end (or beginning) of the circuit via replacing RY gates with R gates.This reduces the number of RX gates to at most one per qubit.This result is furthermore summarized in the following Lemma.Lemma 3. Any quantum physical-level circuit over n qubits with G XX two-qubit XX gates, G RY singlequbit RY gates, and G RX single-qubit RX gates can be reduced to an equivalent one with no more than G XX two-qubit XX gates, G RY single-qubit R gates, and at most n single-qubit RX gates.

Compiling quantum algorithms into physical-level circuits
Define two circuit cost metrics, a coarse-grain and a fine-grain one.The coarse-grain metric counts the number of the two-qubit controlled roots of Paulis in quantum circuits.The fine-grain metric is described by the cost vector (time, error), where time sums up the runtimes across all gates used, and error combines all errors.Note that once the controlling apparatus allowing to execute gates in parallel is developed, the definition of time will change into the sum of times across the critical path; other changes to the above costing metrics may also be accommodated, and depend on the improvements in the controlling apparatus and/or adjustments made to the error model.
The following reports all steps taken by our overall design approach that maps a quantum algorithm into an optimized physical-level experiment.
1. Choose an algorithm and map it into a high-level logical circuit with the help of a quantum programming language, if needed.
2. Synthesize all arithmetic and oracle parts, if not explicitly supplied.Arithmetic circuits are chosen from the known libraries, and oracles are synthesized using known reversible logic synthesis algorithms [18].Optimize the resulting implementation over the coarse-grain cost metric [19].
3. Decompose the multiple-control Toffoli gates into smaller gates, such as three-qubit Toffoli gate and small relative phase Toffoli gates [20].Optimize circuits using peep-hole [21] and templates [17] over the coarse-grain cost metric.
4. Break down all gates into two-qubit controlled roots of Paulis and arbitrary single-qubit gates using optimal implementations [1], and optimize the resulting decompositions using templates [17] over the coarse-grain metric.At the end of this stage we should have reached the limit of optimization of the number of most expensive two-qubit gates, therefore we will next switch the gear and employ the fine-grain metric.
5. Map logical qubits into physical qubits such as to minimize the use of the least desired interactions (those associated with the least two-qubit XX gate fidelities), and maximize gate cancellations during further optimization.To accomplish latter, record the position and the signs of all RY (± π 2 ) gates participating in the expansions of the powers of the controlled roots of Paulis that cannot be varied but depend of the sign of χ (9), and favour the selection of physical qubit mapping resulting in RY cancellations on the control qubit.The cancellation happens between two controls when there are no gates between them, and the signs of χ corresponding to the two XX rotations are equal.In general, so long as the number of qubits remains small, this can be done exhaustively; otherwise, a mix of subgraph isomorphism and greedy (i.e., those making a local choice at each stage in hopes this has little effect on the global optimality) heuristics can be employed.

Decompose further into physical-level circuit and optimize the resulting implementation:
(a) Perform controlled root of Pauli gate substitutions-those are now uniquely defined per formulas (9).Decompose all single-qubit gates except Hadamard and RZ into circuits over RX/RY .Choose CNOT (Figure 1), controlled-Y (10), controlled-Z ( 11), Hadamard (6), and RZ (4) gate decompositions such as to maximize the cancellations of pairs RY (± π 2 ) and RY (∓ π 2 ).Perform cancellation of the RY gates and combine the parameters of the neighbouring RX/RY /XX.
(b) Apply template (12) to reduce the selected figure of merit: to optimize gate count and error, "commute" RX to one side of the circuit (incidentally, while error model was chosen to ensure a conflict of optimization across runtime and error, error optimization results in the reduction of the runtime; this is perhaps not surprising as RX "commutation" has the effect of significantly reducing the RX gate count, and both errors and runtime, as modelled, originate from the application of gates); to optimize the duration, find RY (b) gates such that RX with equal or similar parameters can be commuted to it from both left and right, and replace RX(a)RY (b)RX(a) with proper R(c, d).
(c) Perform further balancing of runtime vs error using the template (12) until the desired balance is found or no more improvement can be achieved.
(d) If the single-qubit gate sequences of length 3 and higher are found, replace them with 2 − R gate sequences, per formula (7).Rewrite all remaining single-qubit RX and RY gates as the physical-level R pulses.
The bulk of work and the optimization not previously explicitly considered in the literature falls into steps 5 and 6 of the above approach.The complexity of the algorithms employed in step 6 is described by a low degree polynomial in the number of gates in the circuit (the degree of the polynomial is one in the simplest case of greedy algorithms).The complexity of the algorithms used in step 5 depends on the efficiency of heuristics employed.Steps 1-4 rely on a combination and a modification of the known techniques.

Benchmark results
The physical trapped ions machine we have access to [6] currently has the following parameters: τ 1q = 20µs and τ 2q = 235µs.The single-qubit and two-qubit gate errors are approximately ∼ 0.01 and E ∼ 0.04.The value of E may furthermore vary slightly depending on the set of qubits the gate is being applied to.The signs of χ, specifically, χ i,j , depending on the particular interaction between qubits i and j used, are as follows: χ has a positive sign for the interactions (next showing pairs of qubit numbers from the set of five, {1, and 45.The sign is negative for the interactions 13, 15, and 24.Due to the values of physical errors, we expect to be able to apply circuits containing no more than about 15 two-qubit gates, therefore the bulk of work in this section is devoted to developing and optimizing experiments that satisfy the above condition.In particular, our goal in this section is to propose experiments maximally utilizing the capabilities of the specific trapped ions machine [6], as well as to design those experiments with maximal efficiency while following precisely those procedures described in the previous sections.In the coming subsection we illustrate how the above circuit compilation techniques accomplish the task of implementing the Boolean multiplication, and then report the results of the design of advanced experiments.Majority of the computations proposed here scale beyond those previously demonstrated in an experiment. A relevant work on the circuit optimization for trapped ions technology was performed in [22], that relied on the use of gradient ascend type algorithm to optimize the gate sequences.Our optimization is based on establishing the relation between gate decompositions, and applying the template (12), and therefore we expect the computational complexity as well as the practical efficiency of our algorithms to be better compared to the gradient-ascend type techniques.Specifically, the algorithm for single-qubit circuit optimization with the purpose of error or gate count reduction employed in our work is described by the linear function of the number of gates, and the algorithm for the two-qubit gate optimization is described by at most qubic term, and furthermore allows the reduction to a linear complexity with the minimal loss to the quality of the output [17].While the physical-level gates used in our work appear to be different from the physical-level gates employed in [22], and no direct comparison can be made, we note that our CNOT implementation (Figure 1) contains 1 XX gate and 4 single-qubit pulses, whereas [22, Section IV.A] reports a 2-XX and 8 single-qubit pulse implementation, and our physical-level implementation of the circuit CNOT ) contains 2 XX gates and 4 single-qubit pulses, whereas [22, Section IV.A] reports a 2-XX and 7 single-qubit pulse implementation.
A recent paper [23] reports a numeric optimization approach to designing trapped ions circuits over global Molmer-Sorenson gates (defined as XX i,j (χ) applied to every pair of ions i and j, with controllable (a): parameter χ), global RX and RY gates, as well as local single-qubit RZ rotations.In contrast, our work focuses on quantum computations by local gates.The two (local vs global) are very different types of control and the corresponding circuits are incomparable: for instance, observe that our circuits work independently of the number of qubits the computation runs over, whereas circuits in [23] in general depend on the number of qubits used.This said, global control can be used in a way independent of the number of qubits involved in the computation.Specifically, an arbitrary local entangling CNOT gate can be constructed with the use of at least two global Molmer-Sorenson interactions and some number of single-qubit gates, allowing to express all quantum algorithms (majority of which are, in fact, described in terms of local operations) using global entangling gates.However, the number of global Molmer-Sorenson gates used would then be twice as much as the number of local Molmer-Sorenson gates required to accomplish the same task using local control.

Implementing Boolean multiplication
To implement the Boolean multiplication on the trapped ions machine our algorithm takes the following steps.

1-4.
Steps 1-3 identify that the computation we wish to perform is given by the Toffoli gate, TOF[a, b; c], and step 4 finds the following 5 two-qubit gate circuit implementing the Toffoli gate [14] using CNOT and controlled-√ N OT gates, 5. Physical qubits from the set {1, 2, 3, 4, 5} are mapped onto logical qubits: we choose 2, 4, and 5 (top to bottom), such as to rely on the high fidelity interactions [6, Table 1].The controlled roots-of-NOT gates do not feature neighbouring controls, imposing no further conditions on the mapping.
6. a.Now that the signs of χ i,j are known, mark the controls of the controlled-V /V † gates with +/− by the sign of the RY rotation that appears on it when decomposed into physical pulses using formulas ( 9), 2 and then choose the CNOT gate decompositions (Figure 1) such as to maximize RY cancellations, 2 The above choice for the CNOT decompositions allows to cancel two pairs of RY ( π 2 )/RY (− π 2 ) gates on the first qubit, as is evidenced by the "meeting" plus and minus signs.b.Perform gate substitutions and cancellations to obtain the circuit shown in Figure 3(a).c.The rest of the single-qubit optimization algorithm is based on the template (12), and works differently depending on the criteria for the remaining optimization.If the runtime is the goal, the algorithm tries to find triples of gates RX − RY − RX that may be replaced with the R gate, such as to minimize the overall duration.This allows to construct the circuit pictured in Figure 3(b).In case if minimizing the error is preferred, the algorithm "commutes" all RX to the left.The result of this optimization may be found in Figure 3(c).We conclude the optimization by layering the single-qubit gates sharing the same duration, by moving RX, whenever possible, such as to allow their sequential execution-this helps to optimize classical controlling sequences.
Observe that at the stage of the decomposition of the two-qubit logical gates into implementable physicallevel gates, the single-qubit pulse count was 20.The parametrized CNOT implementation per Figure 1, along with the algorithm for joint gate decomposition, and RX gates commutation over the application of template (12) allowed to reduce the single-qubit pulse count from the original 20 down to 9 (Figure 3(c)).Had the previously known circuitry implementing the CNOT gate with two more single-qubit pulses [6,15] been used, the original unoptimized single-qubit gate count would have been 30.The upper bound on the number of single-qubit gates in a circuit of this size, as given by the application of Lemma 1, is 26.Had the efficient controlled-root-of-Pauli implementation introduced in this paper been not used, the resource count would have been substantially higher-not only the single-qubit gates, but this time, two-qubit gates as well [16].This illustrates the power of the approach reported in this paper.
Our Toffoli gate implementation, Figure 3(b), takes time 1285 µs, which compares favourably to 1500 µs reported in [24].The advantage in the duration, however, is attributed to the different hardware [6] that we rely on in our work.The difference in the fidelity between our implementation and the one reported in [24] needs to be established via the experiment.The major difference between our implementation and the one reported in [24] is our circuit is a true Toffoli gate that can be used (and in fact is used in the next subsection) as a primitive in the implementation of quantum algorithms, whereas [24] reports Function #q 1q/2qg Time Error Grover a Toffoli gate implemented up to a relative phase; such an implementation up to the relative phase may not be used in quantum algorithms directly-specifically, it would give an incorrect answer if used within Grover's search [25].

Advanced experiments
Table 1 reports the result of the application of the above techniques to the design and optimization of circuits implementing advanced quantum computational experiments.None of the implementations proposed in Table 1 have yet been demonstrated in an experiment.Table 1 lists the name of the algorithm/function, the number of qubits the developed implementation uses, the number of physicallevel single-qubit/two-qubit pulses, the overall runtime of the circuit, and all sources of errors.All implementations reported in Table 1 are true implementations of the respective algorithms/functions, in that we report exact unitaries (VS those up to a relative phase), treat black boxes as black boxes (no optimizations crossing black box boundaries), as well as precisely follow the formulation of the respective algorithms and specifications.Our computations are thus properly scalable to accept larger numbers of qubits.The quantum state is furthermore initialized to |00000 (as opposed to a state containing partial results of a computation) before any of the circuits are applied.
For Grover's algorithm, we considered the scenario when the three-bit Boolean function f (x) = f (x 1 , x 2 , x 3 ) implemented as Grover's oracle, |x, y → |x, y ⊕ f (x) , marks some two items in the database.There are 28 such functions.The superscript in the name of the Grover's function in Table 1 indicates the bit strings encoded by the oracle the search over which is being reported.The efficiency of the Grover's circuit is determined by the efficiency of the implementation of the oracle, which in turn is determined by the Hamming distance between those items it marks.We treat the oracle as a black box, and do not allow optimizations across the boundary of the black box.While we can implement Grover's algorithm over any oracle marking two items, only a few representative circuits are actually included in the Table.
To implement Grover's algorithm over a Boolean function marking one item, the oracle needs to be a Toffoli-4 gate (the triply-controlled Toffoli, TOF[a, b, c; d]).We calculated that the single Grover's iteration requires 16 two-qubit gates.The algorithmic probability of reading out the correct answer upon applying a single Grover's iterate is 0.78125, which beats the classical probability of finding the answer with the single query, 0.125, therefore, it may be interesting to run such an experiment, as well.
Grover's algorithm may furthermore be implemented using the oracle function g computing the unknown Boolean function f into phase, g : |x → (−1) f (x) |x [1].For the 2-out-of-8 search the function g can be thought of as a Z/CZ circuit-see Lemma 2 outlining the construction of efficient Z/CZ circuits.The number of CZ gates used ranges between one and three.This means that the entire Grover's search with such a phase oracle can be implemented using between 6 and 8 two-qubit XX gates.A 1-out-of-8 search with the single Grover's iterate and algorithmic success probability of 0.78125 over phase oracle can be accomplished with 10 two-qubit XX gates.Finally, some 4-qubit Grover's searches may be possible to demonstrate.Specifically, the search for phase items {1110, 1111} can be accomplished by a trapped ions circuit with 16 two-qubit XX gates over algorithmic success probability of 0.78125.
The QFT5 circuit reported in Table 1 can be compared head-to-head to the one found in [6].Specifically, both circuits are true [1, Figure 5.1] (as opposed to semiclassical, [26]) implementations of the 5-qubit QFT, featuring 10 two-qubit gates.Our circuit benefited from careful design, and as a result features the single-qubit gate count of just 22 compared to 70 in [6].This constitutes the reduction of the single-qubit gate count by a factor of more than three, clearly illustrating the benefits of our approach.
To our knowledge, the Toffoli-4 gate circuit reported in Table 1 is the first such containing no more than 11 two-qubit gates.Previous best result is 12 CNOTs [20].

Conclusion
In this paper we reported a complete strategy for automatic execution of quantum algorithms on a trapped ions quantum machine.Our contributions include the design of the complete data flow from the algorithm level down to the physical level, algorithms for circuit cost optimization-specifically, due to combining decompositions of gates such as to enforce gate cancellations and single-qubit gate optimization, and physical-level designs of quantum logical gates and computational primitives.In particular, we demonstrated simplified designs of computational primitives suitable for a range of QIP proposals relying on the physical control provided by the R and XX gates, and designed optimized computational experiments suitable for the execution on a specific machine available in the lab.We furthermore note that the circuit optimization techniques developed in this paper are basic and genericthe key limitation is the reliance on the specific gate library, and thus can be modified and applied in conjunction with numerous optimization criteria, or even a over different QIP platform.
Our results help to bridge the gap between quantum computational experiments and a fully-fledged quantum computer: indeed, our approach allows to automatically design and execute quantum algorithms on the existing hardware, which may be described as programming a quantum computer.

2 ,
RX(π), RY (π), and RZ(π) implement Pauli-X, Pauli-Y, and Pauli-Z gates up to an undetectable global phase of −i.RX(π/2) implements the square-root-of-NOT gate V := 1+i 2 phase.The rotation RZ(π/2) implements the quantum Phase gate (commonly referred to as P or S) global phase.The quantum π/8 gate also known as the T gate, is obtained as RZ(π/4), up to a global phase.

3 . 4 ). 4 .
The set of XX gates, where each controlled-Z gate CZ(a, b) in the original circuit is represented by XX ab (s ab π The layer of RY gates, with RY (−v i π 2 ) applied to the qubit i.

Figure 2 :
Figure 2: (a) change in the duration (old: yellow, new:blue; lower is better) in replacing RX(a)RY (b)RX(a), a, b ∈ [−π, π] by R(c, d); (b) change in the duration (old: yellow, new:blue; lower is better) in replacing RX(a)RY (b), a, b ∈ [−π, π] by R(c, d)RX(−a); (c) change in the fidelity (old: yellow, new: blue; higher is better) in replacing RX(a)RY (b)RX(a), a, b ∈ [−π, π] by R(c, d); (d) change in the fidelity (old: yellow, new: blue; higher is better) in replacing RX(a)RY (b), a, b ∈ [−π, π] by R(c, d)RX(−a).For the purpose of this illustration, τ 1q was selected to be equal to 20µs and = 0.01, roughly corresponding to those values seen in a specific experiment[6].Fidelity of the circuit spanning a single qubit with the gates G 1 , G 2 , ..., G k featuring the individual errors e 1 , e 2 , ..., e k is calculated as the product k i=1 (1 − e i ).This corresponds to the model where the errors are independent, and is consistent with the understanding of the physics and source of errors in the R gates[6,9].
(a)RY (b) with R(c, d)RX(−a) and RY (b)RX(a) with RX(−a) R(c, d).This allows to trade off runtime for error, see Figure 2(b)(d) for the illustration of the changes in the runtime and error.While the runtime always increases, it is sometimes possible to improve the fidelity.For instance, if RX( π 2 )RY (− π 2