Floating point representations in quantum circuit synthesis

We provide a non-deterministic quantum protocol that approximates the single qubit rotations Rx(2ϕ21ϕ22) using Rx(2ϕ1) and Rx(2ϕ2) and a constant number of Clifford and T operations. We then use this method to construct a ‘floating point’ implementation of a small rotation wherein we use the aforementioned method to construct the exponent part of the rotation and also to combine it with a mantissa. This causes the cost of the synthesis to depend more strongly on the relative (rather than absolute) precision required. We analyze the mean and variance of the T-count required to use our techniques and provide new lower bounds for the T-count for ancilla free synthesis of small single-qubit axial rotations. We further show that our techniques can use ancillas to beat these lower bounds with high probability. We also discuss the T-depth of our method and see that the vast majority of the cost of the resultant circuits can be shifted to parallel computation paths.


Introduction
The ability to implement very small rotations is vitally important to quantum computation.The ability to economically implement small rotations is essential for the quantum Fourier transform, which is an essential part of Shor's factoring algorithm.In quantum computer simulations of local Hamiltonians (which encompasses simulations of quantum chemistry in second quantized form), the time evolution operator is simulated by breaking up the evolution time into a sequence of short time evolutions using Trotter-Suzuki formulas [1,2,3,4,5].The implementation of each timestep requires performing a single qubit Z-rotation through a very small angle.In practice, rotations of 10 −3 radians or smaller may be needed in order to ensure that upper bounds on the simulation error are appropriately small [5].As the error tolerance shrinks for the simulation, these rotation angles must shrink as well.This issue is problematic because existing algorithms for designing fault tolerant circuits to implement these small angle rotations can be very costly, both in the number of gates required and the classical computational time required to find the appropriate gate sequences [6,7].
The Solovay-Kitaev theorem [8] is often used to estimate the cost of synthesizing the rotation gates using a finite gate library at cost polylogarithmic in the error tolerance.Although polylogarithmic, the cost of performing gate synthesis using the Solovay-Kitaev theorem is polynomially greater than the lower bound of logarithmic scaling [9].In recent months, great progress has been made to reduce the cost of synthesizing single qubit unitaries, and now methods for synthesizing these rotations have been proposed that are polynomially more efficient than the Solovay-Kitaev theorem [10,11,12].Another novel approach that has recently been proposed uses non-deterministic algorithms that consume pre-programmed ancilla states to perform these rotations [13,14,15,16], rather than utilizing a complicated circuit synthesis method.A major advantage of these ancilla assisted synthesis methods is that the resource states can be prepared before the algorithm is executed, substantially reducing the depth of the circuit; furthermore, any leftover states can also be used as resources in subsequent runs or even other quantum algorithms.Such methods may be preferable to using traditional circuit synthesis methods in parallel quantum computation where fast classical feed forward is available.
Our key innovation in this work is a quantum protocol that refines large Xrotations into smaller rotations.In particular, given the ability to enact the rotations R x (φ 1 ) and R x (φ 2 ), our method provides a way to implement a rotation that is approximately R x (φ 2  1 φ 2 2 ) if φ 1 φ 2 1.We further show that, with high probability, this approach generates small single qubit rotations more efficiently than the best possible ancilla-free circuit synthesis method (using the {Clifford, T } gate library).This is significant because it not only shows that ancillas are a powerful resource for single qubit circuit synthesis, but also because it allows much more sophisticated computations to be performed on a rudimentary quantum computer.
This ability to generate small rotations and multiply the rotation angles of two operations naturally opens the possibility of employing a "floating point" implementation of the rotation.A floating point number is broken up into two parts: the mantissa and the exponent.Both the mantissa and exponents are encoded as integers and the they represent a number φ as φ = m × 10 e where m is the mantissa and e is the exponent.A major advantage of this representation is that extraneous digits of precision are not used to represent very small, or very large, numbers.Our ), which implements a small rotation on the input state |ψ given that each measurement outcome is 0.
Circuit for multiplying mantissa rotation Um with exponent rotation Ue.This circuit is a special case of that in Figure 1 for the case where non-deterministic circuit can then be used to construct e −iφX by combining a mantissa unitary U m and an exponent unitary U e .For example, if φ 1 then we could combine U m = e −i √ mX and U e = e −i(10 e/2 )X to approximate e −iφX .We make this intuition precise in Section 4.
This floating point representation is significant because the cost of implementing the rotation depends more strongly on the relative precision needed for the rotation rather than the absolute precision, unlike what would be found using traditional circuit synthesis techniques.A further benefit of our approach is that the vast majority of the cost involves preparing resource states that are then consumed to perform the desired rotation.These preparations can be performed offline and in parallel, which substantially reduces the online cost of our algorithm.Finally, these ancilla preparations are universal resource states and hence unconsumed resource states can be used in subsequent computations.
Our paper is laid out as follows.We introduce our non-deterministic circuit in Section 2 and show how to use it recursively to generate small rotations in Section 3 and compute the mean and the variance of the number of T gates required to execute our circuits.We then combine these ideas in Section 4 to produce the floating point representation of the desired rotation.Section 5 gives an example of floating point synthesis that shows that it substantially reduces the number of T gates needed to approximate exp(−iπZ/2 16 ) relative to optimal ancilla-free synthesis.We finally show in Section 6 that our method for generating small single qubit rotations is more efficient than the best possible circuit synthesis methods that only uses single qubit Clifford and T gates and provide an explicit construction for this optimal synthesis method.

The Gearbox Circuit
The "gearbox circuit" is the central object that underlies our entire method.The role of the circuit is to perform a rotation through an angle that is the product of the squares of the off-diagonal matrix elements of a series of single qubit unitary operations U 1 , . . ., U d acting on ancilla qubits.We refer to this circuit as a gearbox circuit because it transforms coarse rotations into much finer rotations in analogy to a gearbox.The circuit is denoted, in the case of d control qubits, as C (d) (U 1 , . . ., U d ) and is given in Figure 1.The circuit is equivalent to those used in [17,18] to implement linear combinations of unitary operations in the case where one of the unitary operations is the identity.
We use the circuit for three purposes: to multiply the rotation angles generated by U m and U e , to reduce the spacing between the rotations that our circuits produce and to generate U e .Our cost analysis assumes that Clifford operations (H, S and CNOT) are inexpensive whereas the non-Clifford operation T is expensive.This cost model is motivated by the fact that T gates are very expensive to perform in many error correcting codes because multiple rounds of magic state distillation may be required to obtain sufficiently accurate T gates [19].The following theorem shows that the gearbox circuit can be used to convert modestly small rotations into very small rotations non-deterministically, and furthermore that the gearbox circuit will always succeed because the rotation implemented when the circuit fails to give the desired rotation can be inverted using Clifford operations, which we assume are inexpensive.
The T -count required to produce a rotation angle given that each measurement outcome is 0 and a decomposition of the Toffoli gate is used that requires 7 T gates, is T count (C (d) (S 7 )) T count (C (d) (S 4 )) T count (C (d) (S 1 )) T count (Selinger) Student Version of MATLAB Figure 3.Here we compare the mean T -count for implementing C (d) (S j ), estimated using 500 samples per angle and compare the result to the scaling achievable using the synthesis method of Selinger.We find that our result compares favorably for j ≥ 4.
Similarly, if a circuit with minimal T -depth is desired then an implementation of the Toffoli gate can be used that only has T -depth 2 (but requires an ancilla qubit) [20].If these gates are used then the T -depth of the circuit becomes Any failures that occur in implementing C (d) (S j ) can be corrected by applying Clifford operations and attempting the rotation again because e iπ/4X is itself a Clifford operation, up to a global phase.These estimates of the T -count and T -depth also approximately hold in cases where the rotation is attempted until success is obtained because Theorem 1 predicts that the failure probability will be very small if θ 1. Figure 3 shows that the number of operations needed to synthesize a rotation of angle θ using C (d) (S j ) as a function of the rotation angle generated θ(d), where S j is a unitary that yields the minimum value of |S j 1,0 | over all H, T circuits consisting of at most j + 1 H gates.The non-deterministic circuit manages to outperform a lower bound proven by Selinger [11] for the number of T gates needed to synthesize an arbitrary Z-rotation using the {Clifford, T } library and no ancilla qubits, for j ≥ 4.This result is significant because Selinger's circuit synthesis method is known to be optimal, meaning that there exist Z-rotations that require a number of T gates that saturate the scaling predicted by the lower bound.We will see in Section 6 that our non-deterministic circuits can in fact surpass the efficiency of any single qubit circuit synthesis method that uses our gate library and does not employ ancillary qubits.
It may be natural to suspect from Figure 3 that the efficiency with which small angle rotations can be synthesized increases as |U 1,0 | decreases.We find that using longer circuits to synthesize unitaries with smaller values of |U 1,0 | does not necessarily yield a more efficient method for generating small rotations.Figure 5 contains results found by fitting the T -count for C (d) (S j ) to a logarithmic function of the form a(j) log 2 (θ −1 ) + b(j).Using values of j ranging from 1 to 98, we find strong evidence . Note that every non-Clifford operation except the right most U † can be implemented using ancillas containing that a(j) ≈ 2 is possible with this method.This is superior to the method of Selinger, which gives a(j) = 4 and we will show later in Figure 9 that this is also smaller than the optimal value of a(j) ≈ 3 that arises using ancilla-free circuit synthesis using the gate library {Clifford,T} .It should be noted, however, that C (d) (S j ) does not necessarily provide as fine control over the resultant rotation angle as these other circuit synthesis methods (especially for large j); although, our results show that it is more efficient at generating small rotation angles than the optimal ancilla free circuit synthesis method.

The Composed Gearbox Circuit
Figure 5 shows that a direct application of the gearbox circuit requires a T -count that scales at least as 2 log 2 (1/θ), implying that a different approach is needed to further improve the scaling.A natural way to improve on the prior method is to use the gearbox circuit recursively by taking U to be the rotation yielded by another gearbox circuit.This process can be repeated many times and the resulting circuit forms a tree-like structure as seen in Figure 4. We formally define the recursive construction of the "composed gearbox circuit" below.
be the circuit formed by taking U 1 = U in C (1) , then for for any integer d > 1, We then show in the following corollary that C •d (U ) generates a rotation angle that scales as tan 2 d (θ 0 ) in the limit of small θ (where sin Proof.We will first prove using induction that ))X , given that the outcome of each measurement in the tree is 0 and then use Theorem 1 to verify the claimed success probability.The base case for our inductive proof, C •1 (U ), has already been demonstrated by Theorem 1 for the case where d = 0. Now let us assume that as claimed.
for different values of j where θ(d) is the rotation angle generated by C (d) (S j ).We see evidence from the data that the efficiency of generating a small angle rotation using C (d) (S j ) increases at first as a function of j and then saturates.
One of the most remarkable features of C •d (U ) is that almost all of the computational steps in the circuit can be thought of as preparations of ancilla states either of the form |θ j := C •j (U )|0 for j = 1, . . ., d − 1 or U |0 .In fact, all but 1 application of U † can be implemented as ancilla preparations that are performed offline.This means that the ancilla preparations can be performed prior to attempting the rotation, potentially by using multiple quantum information processors working in parallel.In contrast, the final application of U † cannot be performed in this manner and hence is an online cost.We do not discuss the success probability in Corollary 1 because the success probability will vary depending on whether ancillas containing |θ j are provided or not.We show below that if such ancillas are provided then the success probability is bounded below by a constant for all d.In constrast, we will see that if no ancillas are provided then, with high probability, multiple rounds of error correction will be needed for the algorithm to succeed with high probability.
Lemma 1.For all integer d > 0 and θ 0 < π/4, if ancilla qubits of the form U |0 and |θ j := C •j (U )|0 for j = 1, . . ., d − 1 are provided then C •d (U ) can be implemented with failure probability at most Proof.We know from Theorem 1 that the probability of successfully implementing C •1 (U ) is cos(θ 0 ) 4 + sin(θ 0 ) 4 .Corollary 1 similarly tells us that the probability that the j th measurement is successful given that j ≥ 2 and all prior measurements were successful is Therefore the probability of failure at step j, given success at all previous steps, obeys The probability of a failure occuring is at most the sum of the probabilities of failing at any given step and hence The upper bound on the success probability given by Lemma 1 can be used to estimate the number of times the circuit needs to be attempted, in cases where ancillas are provided since assuming the presence of ancilla states that contain |θ j for j = 1, . . ., d − 1 is equivalent to assuming that all previous computational steps have already been successfully implemented.We expand on this reasoning in the following corollary.
Corollary 2. For integer d > 0 and θ 0 < π/4, the number of ancilla states of each type and the number U † operations, N d , that must be performed online to execute the circuit C •1 (U ) successfully follows a probability distribution with mean and variance obeying Proof.The number of times the measurement has to be repeated, N d , is geometrically distributed with mean 1/P d and variance (1 − P d )/P 2 d , where P d is the probability of the measurement succeeding.Since the mean and the variance are monotonically increasing functions of P fail therefore upper bounds for E(N d ) and V(N d ) can be found by substituting (7) into them because at most one of each of these types of resources are needed to attempt to implement C •d (U ).The proof of the corollary then follows by simplifying the result of this substitution.
As an example, we find from substituting θ 0 = π/8 into (8) that the number of trials needed to implement C •d (HT H) follows a distribution with E(N d ) < 5  4 and V(N d ) < 1 3 .Chebyshev's inequality then implies that if we define X to be the number of trials needed to achieve a successful rotation then This implies that with high probability the number of each type of resource consumed in implementing the successful rotation is a constant.If the cost of each of these resources is assumed to be identical, then the cost of the algorithm is O(d) = O(log log(θ −1 )) and the online cost of implementing the circuit is bounded above by a constant, with high probability.
The mean and the variance of the number of U and U † operations used to implement the rotation can also be computed in cases where no precomputed ancillas are provided.In fact, the number of U and U † gates that are needed to implement C •d (U ) with high probability scales as O(2 d ).We state this result in the following theorem.
Theorem 2. Let P q = sin(φ q ) 4 + cos(φ q ) 4 , where φ q := tan −1 (tan 2 q−1 (θ 0 )) for all integer q ≥ 1 and let n d be a random variable representing the number of applications of U or U † used to enact C •d (U ) in a given attempt.Then the expectation value of n d is and for θ 0 < π/4 the variance of n d obeys To prove Theorem 2, we think about our non-determinitic circuits as ones that always succeed, but require a random number of steps to do so.We introduce two random variables to describe the number of measurements required for the measurement at the n th level of our tree to succeed: one that describes number of attempts needed to successfully execute the branch before the controlled −iX at the n th level and the other describes the number of attempts needed for the branch after the controlled −iX and before all measurements.We then express the mean and variance of the number of attempts required to execute the n th level of the tree in terms of the mean and variance of the variables introduced to describe the number of attempts needed to succeed on the (n − 1) st level.We get a recursive relation for mean and variance that we then unfold and simplify using simple upper bounds.The same idea can be used to analyze more complicated tree-like non-deterministic circuits.Proof is given in Appendix A.
Theorem 2 shows that the mean and the standard deviation of the number of applications of U and U † used to implement C •d (U ) scales as Θ(2 d ) and O(2 d ) respectively for θ 0 ≤ π/8.This follows from the fact that for θ 0 ≤ π/8, Chebyshev's inequality therefore implies (similarly to the case discussed above where precomputed ancillas are used) that, with high probability, the number of U and U † gates needed to implement the rotation will also scale as O (2 d ).This procedure is efficient because d scales doubly-logarithmically with the desired rotation angle.The Here we compare the mean T -count for our composition based method given by C •d (HT H) to Selinger's method and also directly using the gearbox circuit C (d) (S 7 ).The dashed lines give the upper and lower limits of a 95% confidence interval for the T -count that arises from using the composition method to Ue.We see that the composition method offers superior performance to that of the circuit C (d) (S j ) and Selinger's method.500 samples were used to compute the expectation values of the scalings for both non-deterministic methods.
This implies that, on average, the number of T gates required to implement where θ = tan −1 (tan 2 d (π/8)).This estimate results from the use of several inequalities and it is therefore reasonable expect the actual expectation value of the T count to be smaller.The data in Figure 6 suggest that the mean value for the T count (which is proportional to n d for U = HT H) actually obeys for d ∈ Θ(log(log(θ −1 ))).The resultant T -count is smaller than that of [11] (which is known to give optimal scaling in cases where θ is chosen adversarially and no ancilla bits are permitted) or those that arise from a direct application of the gearbox circuit.We will see shortly that this scaling is in fact better than the best possible scaling achievable in any circuit synthesis method using only H, T and CNOT gates.Furthermore, the slopes of the 2.5 th and 97. where By using a binary expansion and a Taylor series expansion of the trigonometric functions, it can be seen that the circuit implements e −iφX for φ = tan 4q (π/8) + O(tan 12q (π/8)) and integer q.This allows us to address the problems posed by using our composition method to construct the rotation angle at the cost of additional T gates.Figure 7 contains a plot of the rotation angles generated by combining the rotations generated using our composition method via the gearbox circuit.We see in the figure that the rotation angles obtained approximately decrease by factors 0.031, as anticipated by the prior discussion.We also find that the expectation value of the Tcount of this algorithm scales roughly as a log 2 (1/θ)+22 where a ≈ 1.14 giving the line of best fit and [1.08, 1.20] gives a 95% confidence interval for a.The typical overhead from using C (d) (C •D1 , . . ., C •D d ) to implement the rotation is minimal because the cost of implementing a small rotation using C •d (HT H) followed a similar scaling with a ≈ 1.11, which falls within the 95% confidence interval for the value of a corresponding to C (d) (C •D1 , . . ., C •D d ).

Constructing the Floating Point Representation
The preceding discussion shows how we can use our composition method in conjunction with the gearbox circuit to implement a given U e .Our next goal is to use this idea to implement an arbitrary X-rotation by using this method to generate the exponent of our floating point representation, U e , and another technique to implement the mantissa U m .The circuit that implements the necessary rotation is given in Figure 8.
Theorem 1 implies that, conditioned on the successful implementation of the C •Dj (HT H), the circuit will implement e −iφX for where φ(D) is defined in (14).
We describe the process involved in using this floating point implementation of the desired rotation below.Set φ rem = φ in − kπ/4.

7:
Find the smallest value of φ(D), and the corresponding values of tan φ rem 1 + tan φ rem .

9:
return The algorithm can be seen to output the desired rotation via the following argument.It is easy to see that steps 1-4 will return a distance δ approximation to the desired rotation, given that the desired rotation obeys min k∈Z |φ in − kπ/4| ≤ δ.The remaining cases can then be handled by implementing e −ikXπ/4 using Clifford operations and synthesizing a rotation that implements e −i(φin−kπ/4)X within precision O(δ × 10 −γ ).
We have from Theorem 1 that the rotation angle implemented, for the ideal choice . This circuit gives the floating point implementation of a rotation for a given mantissa unitary Um.Unlike Figure 2, this circuit uses d different composed gearbox circuits to form the exponent part rather than just one.This provides greater control over the rotation than would be possible with just one composed gearbox.Note that that the multiply contrilled −iX gate can be implemented using 7(2d − 1) T gates as discussed in (3).
.( 16) We are constrained, however, to have |(U m ) 1,0 | ≤ 1 in our solution.We find the range of physically allowable solutions by setting |(U m ) 1,0 | = 1 and then solving for φ(D) to find that a valid solution exists if which is guaranteed by Step 7. Then given any such choice of D, we solve (16) for the corresponding value of |(U m ) 1,0 | and find that The x-rotation chosen in Step 8 yields the desired rotation and hence the algorithm will as well, modulo the error incurred in the synthesis of U m .We have already established in ( 13) that φ D will be within a constant factor of φ rem , and hence φ D ∈ Θ(10 −γ/2 ).We then see from Taylor's theorem that which verifies that the error is O(δ × 10 −γ ) as required.
A cost analysis of the floating point method is given in Appendix B, wherein we show that the T -count required by the floating point method approximately scales as 1.14 log 2 (1/θ) for constant precision.Similarly, the circuit depth and the online Tcount scale as O(log log(1/θ)).This implies that floating point synthesis is not only less expensive than traditional synthesis methods (as measured by the T -count) but much of this cost can be distributed over parallel quantum information processors.
Table 1.This table compares the T -counts that result from synthesizing e −iZπ/2 16 using our floating point method to those that arise from optimal synthesis using the gate library {Clifford,T} .V 1 and V 2 are the two shortest circuits that provide a better approximation to the rotation than e −iZπ/2 16 ≈ I.The mean and confidence intervals were calculated using 500 samples and the mean value agrees with the result of Theorem 2 within statistical error.We will now give an illustrative example of our floating point technique for synthesizing the operation e −iπ/2 16 Z .This rotation is significant because it appears in the quantum Fourier transform.We have found, by using techniques described in the subsequent section, that the T -optimal circuit that estimates this rotation more accurately than e −iπ/2 16 Z ≈ I consists of 57 T gates.The next shortest circuit contains 60 T -gates.This implies that the cost of synthesizing the rotation using an optimal circuit synthesis method and the {Clifford,T} gate library changes abruptly when an approximation to the rotation with even one digit of precision is needed.First, note that R z (θ) = HR x (θ)H and hence the x rotations that naturally arise from our method can be easily translated to z-rotations using Clifford operations (which we assume are inexpensive).This implies that the problem of synthesizing the rotation reduces to that of synthesizing e −iπ/2 16 X .Following Algorithm 1, we choose U e to be C (2) (π/8) because tan −1 (tan 4 (π/8)) > π/2 16 .We then find numerically that the mantissa part of the rotation must satisfy

Um
Finally, we exhaustively search for the two shortest circuits that give a unitary that has off-diagonal matrix elements of comparable magnitude to the ideal value and examine the performance of our floating point method for both these choices of U m by performing a Monte-Carlo simulation of the T -counts required to use our floating point method.The results of this Monte-Carlo simulation are given in Table 1.
We see from the data in Table 1 that circuits derived from the floating point method require, with high probability, nearly half the T gates required by the optimal synthesis method in order to produce non-trivial approximations with comparable relative error.The floating point circuits also have the benefit of requiring a substantially smaller online cost.For the cases considered above, these costs are approximately 11 and 14 T gates and the majority of the online cost is incurred in implementing the Toffoli gate and U † m (U m can be implemented offline).

Optimal Ancilla-Free Single-Qubit Synthesis of Small Rotations
In this section, we extend methods described in [21] to find circuits chosen from the {Clifford,T} library with the smallest possible (non-zero) off-diagonal entries.
The algorithm described guarantees optimality of the found circuits.The result of the section shows that gear box circuits involving ancillary qubits and measurement reduces the T -counts below the best possible T -counts in a purely unitary single qubit construction.
More precisely, the problem we are interested in is the following: amongst all circuits with optimal T -count n find one that corresponds to a unitary with a minimal possible off-diagonal entry.We say that circuit has optimal T -count n if any other circuit drawn from {Clifford,T} library implementing the same unitary requires at least n T gates.We reduce the problem to searching for unitaries over the ring with a certain property that we discuss in detail later in this section.It is known that any circuit over {Clifford, T } library corresponds to a unitary over Z[i, 1/ √ 2]; furthermore, the results presented in [22] show that there is a tight connection between optimal T -count and entries of the unitary.The notion of the smallest denominator exponent(sde) allows us to express the connection formally.For numbers of the form we define sde as a minimal possible m, m min such that the number can be written in the form (a Let u be an off-diagonal entry of a unitary U over the ring Z[i, 1/ √ 2] and let sde(|u| 2 ) = m.It was shown in Appendix B in [22] that the optimal T -count for the circuit implementing the unitary U can only be m − 2, m − 1, m.It turns out that for given |u| 2 there always exists a circuit with optimal T -count m − 2. Indeed, by multiplying U from right or left side by some power of T we can always achieve optimal T -count m − 2 (see Appendix B in [22]).From the other side, multiplying a unitary by powers of T leaves the absolute value of its off-diagonal entries unchanged.
To find a circuit implementing the unitary we apply the exact synthesis algorithm of [22], which produces a circuit with optimal number of T gates.The algorithm is based on the fact that sde(| • | 2 ) defines the complexity of the circuit that the unitary implements.The algorithm works by multiplying the unitary by HT l choosing l to reduce sde(| • | 2 ) of resulting unitary entries.The algorithm repeats this greedy approach until it reaches sde(| • | 2 ) = 3 and then looks up the optimal circuit in a small database.More detailed description of the algorithm and the proof of T optimality of produced circuits can be found in [22].
Based on the discussion above we can restate the initial problem as: for fixed m find a unitary with a minimal (but non-zero) off-diagonal entry u such that sde(|u| 2 ) = m.The simplest approach is to go through all elements of the set and find its element with minimal absolute value.The condition |u| 2 +|v| 2 = 1 assures that there exist a unitary with off-diagonal entry u.Therefore going through the set above is the same as going through all unitaries over the ring . As a side note, the condition ∃v ∈ Z[i, 1/ √ 2] : |u| 2 +|v| 2 = 1 must be explicitly enforced because there exists u ∈ Z[i, 1/ √ 2] such that |u| < 1, but u is not an entry of any unitary over the ring To iterate through all elements of S m it suffices to go through all u ∈ Z[i, 1/ √ 2] with sde(|u| 2 ) = m and check the second condition |u| 2 + |v| 2 = 1.For v expressed as κ the condition can be written as The algorithm for solving such equations is known and is a part of several computer algebra systems.We use PARI/GP [23] to check the existence of the solution for given A, B.
There is a systematic way to go through u ∈ Z[i, 1/ √ 2] with sde(|u| 2 ) = m.Each u can be described by five integers a, b, c, d, κ and written as (a + bω The condition that sde(|u| 2 ) = m implies that we can chose κ = m/2 .In addition, u is required to be an entry of a unitary, therefore |u| 2 + |v| 2 = 1 for some v. Multiplying the equality by 2 m/2 and collecting integer terms results in inequality In summary, to go through all u such that sde(|u| 2 ) = m it suffices to go through integers a, b, c, d satisfying the inequality.The complexity of such a search procedure is exponential in m.In the second part of this section we describe a search procedure that is still exponential, but more efficient and allows us to reach m high enough to be interesting for our purposes.Note that to get the minimal absolute value δ of the off-diagonal entries found we need to consider m that is in O(log(1/δ)); the complexity of both the simple and the improved search procedures is polynomial in 1/δ.The improved search procedure uses additional information to shrink the search space.In particular we require that an upper bound ε on |u| 2 for given m is provided as an input.This bound can be taken to be the minimal value of |u| 2 for m − 1.The procedure fails if the bound is too tight and an error message is returned, allowing the user to specify a less stringent error tolerance or increase the value of m.Now we show how to use upper bound ε to shrink the search space.For our current purpose it is more convenient to represent u as The savings are the most significant when 2 κ ε ≤ 1/4; in this case a j is uniquely defined by b j because |a 0 + b 0 √ 2| ≤ 1/2 and a 0 must be equal to −b 0 √ 2 .Our algorithm operates in this regime starting from m ≥ 9.
In the first stage of our search the algorithm builds list L of triples (a, b, |a+b √ ε and sorts it in ascending order by the third element.This allows the algorithm efficiently build the following list: for the chosen interval [0, δ].The algorithm again sorts the list in ascending order by the last element and finds the first element such that ((a 0 +b 0 can be an entry of the unitary.If it fails to find such an element then the algorithm restarts the procedure for a new list L [δ,2δ] .It keeps increasing list bounds either until it succeeds, or until it reaches the point where the lower bound for the list exceeds 2 κ ε.In the second case, it reports that the initial bound was too tight.Table 2 shows the results of running the described algorithm.For some values of N T (the optimal T -count) the minimal absolute value of off-diagonal matrix entries are not included in the table: for example, there are no values for the optimal Tcount that equal eight and nine.This means that we can achieve smaller absolute values of off-diagonal entries using unitaries with optimal T -count seven than using unitaries with optimal T -count eight or nine.The same holds for all other intermediate values of optimal T -count that are not included in Table 2.The dependence of the optimal T -count on the minimal absolute value of off-diagonal matrix entries is plotted on Figure 9.
In summary, we have demonstrated a practical algorithm for finding single qubit unitaries drawn from the gate library consisting of single qubit Clifford gates and T that have the smallest possible absolute values of off-diagonal entries for values of the optimal T -count ranging from seven to one hundred.We see from the data in Figure 9 and (12) that using ancillas and classical feedback for this task leads to improvement by approximately a factor of three in the T -count.To the best of our knowledge, this is the first example of a single qubit circuit synthesis task for which circuits including ancillas initialized to |0 and measurements with classical feedback require lower T -count in comparison to the optimal results involving only unitary operations.

Conclusion
Our work provides a new method for non-deterministically synthesizing small single qubit rotations.We use this approach to construct a floating point representation of the rotation that can lead to substantial reductions in the T -count, T -depth and online T -count used to perform the rotations; furthermore, we show that the number of operations required to synthesize these rotations is less than lower bounds for the cost of synthesizing single qubit rotations in {Clifford, T } in cases where ancilla qubits are not used.
There are several avenues of future inquiry that are suggested by our work.Our results can be generalized by using different recursion relations at different depths in the recursive definition of our composed gearbox circuit.Such generalizations allow modified versions of our circuits to closely approximate a much larger set of rotation angles and may lead to increased efficiency in certain cases.Another important application of our work is in quantum simulation where implementing terms that are nearly negligible in a Trotter-Suzuki expansion is a common problem.This application will be considered in subsequent work.
As mentioned in Section 3, the majority of the operations in C •Dj can be thought of as ancilla preparations.This means that any such ancilla preparation steps can be shifted offline and performed in parallel.In essence, this reduces the depth of the circuit exponentially in exchange for a logarithmic increase in the circuit width.This can easily be seen using Theorem 2. The only online operation that must be performed is HT H, which according to Lemma 1, will only have to be performed a constant number of times before C •Dj is implemented with high probability.This implies that E(T depth )(C •Dj ) ≤ D j +K ∈ O(log log(1/θ)) ∈ O(log log(10 γ )), (B.3)where K ≈ 12 is a constant that arises from having to repeat the online step a fixed number of times.
The synthesis of U m using Selinger's method requires a T -depth that equals the T -count of the circuit.This cost is where K is a constant.In our analysis this cost is assumed to be constant because the number of digits of precision, and in turn δ, is assumed to be a constant.The controlled −iX gate Λ d+1 (−iX) can be implemented using a depth 2 log 2 d+ 1 circuit [24].This implies that the expected T -depth obeys, for some constant K , E(T depth ) ≤ 2(max since d ≤ max j D j and max j D j ∈ Θ(log log(10 γ )).Therefore the circuit depth varies doubly-logarithmically with θ −1 and it is easy to see that the online cost follows a similar scaling.This shows that another strong advantage of floating point synthesis is that it can easily exploit parallelism to reduce the time required to execute the circuits given that a fixed number of digits of precision are required.

Figure 5 .
Figure5.Here we plot the fit parameter a(j) for a least squares fit of the Tcount as a function of d for C (d) (S j ) to a(j) log 2 (1/θ(d)) + b(j) for different values of j where θ(d) is the rotation angle generated by C (d) (S j ).We see evidence from the data that the efficiency of generating a small angle rotation using C (d) (S j ) increases at first as a function of j and then saturates.

Figure 6 .
Figure6.Here we compare the mean T -count for our composition based method given by C •d (HT H) to Selinger's method and also directly using the gearbox circuit C (d) (S 7 ).The dashed lines give the upper and lower limits of a 95% confidence interval for the T -count that arises from using the composition method to Ue.We see that the composition method offers superior performance to that of the circuit C (d) (S j ) and Selinger's method.500 samples were used to compute the expectation values of the scalings for both non-deterministic methods.

Figure 7 .
Figure 7.Here we plot the mean -count for C d (C •j 1 (HT H), . . ., C •j d (HT H)) as a function of the rotation angle generated by the circuit.The dashed lines give the upper and lower limits of a 95% confidence interval for the T -count, and the + symbols show the average T -count.The data scales approximately as a log 2 (1/θ) + 22, where the value of a that gives the least-square error is 1.14 and a ∈ [1.08, 1.20] with probability 0.95.500 samples were used to find the distribution of the T -count for each value of θ.

j
D j ) + 8 log(1/δ) + 2 log 2 d + 1 + K ∈ O(log log(10 γ )),(B.5) 5 th percentile of the Tcount are approximately 1.04 and 1.18 respectively.This suggests that small rotations generated by C •d (HT H) will have, with high probability, smaller T -counts than existing methods.A drawback of using C •d (HT H) as opposed to C (d) (HT H) to generate U e is that C •d generates small rotation angles that scale as tan 2 d (π/8), which does not give fine control over the rotation angle if only the variable d is used to control the rotation.The drawback of having poor control over the rotation angle used for U e can be addressed, at a modest cost, by using the gearbox circuit C(d)in tandem

Table 2 .
Minimal absolute values of non-zero off-diagonal entries u of unitaries with optimal T -count equal to N T .Here we show the optimal T -count as a function of the smallest absolute value |u| of off-diagonal entires of the unitary.The data scales approximately as a log 2 (1/|u|) − 1.064, where the value of a that gives the least square error is 2.98 and a ∈ [2.95, 3.03] with probability 0.95.