Supervised learning of time-independent Hamiltonians for gate design

We present a general framework to tackle the problem of finding time-independent dynamics generating target unitary evolutions. We show that this problem is equivalently stated as a set of conditions over the spectrum of the time-independent gate generator, thus translating the task into an inverse eigenvalue problem. We illustrate our methodology by identifying suitable time-independent generators implementing Toffoli and Fredkin gates without the need for ancillae or effective evolutions. We show how the same conditions can be used to solve the problem numerically, via supervised learning techniques. In turn, this allows us to solve problems that are not amenable, in general, to direct analytical solution, providing at the same time a high degree of flexibility over the types of gate-design problems that can be approached. As a significant example, we find generators for the Toffoli gate using only diagonal pairwise interactions, which are easier to implement in some experimental architectures. To showcase the flexibility of the supervised learning approach, we give an example of a non-trivial four-qubit gate that is implementable using only diagonal, pairwise interactions.

Let us consider the synthesis of a quantum operation G (a gate) from the underlying dynamics of a quantum system. Unitarity of G implies the existence of a Hermitian generator H G such that G = e iH G (we assume units such that the simulation time t is dimensionless, and rescale it so that the desired gate is successfully achieved at t = 1). Such generator will typically contain highly non-local interactions, which can be difficult to realize in a given physical setup. However, for a choice of the platform to use for the implementation of G, it is generally possible to single out a sub-set Γ of physical interactions that can be realised relatively easily and inexpensively. The question that we address here is thus: is it possible to synthesise G from an H G comprising operations drawn only from an assigned Γ?
This question is very relevant, for instance, in the context of quantum simulation, where the problem of general reachability of a target dynamics given a set of allowed physical interactions is pivotal [1]. However, it is also important for the realization of large-scale quantum computation [2,3], which relies on the capability of implementing entangling gates between many qubits and with high fidelity. A notable case is the quantum Toffoli gate, a universal reversible logic three-qubit gate [4] that is optimal for quantum error correction [5][6][7][8], and is a key component for reversible arithmetic operations such as modular exponentiation [9]. Unfortunately, the natural dynamics generating a Toffoli gate involves non-local three-qubit interactions, which are not easily implemented in experimental architectures. Other important gates, such as the Fredkin gate, share the same problem. Common strategies to overcome these limitations include a suitable use of the additional processing power offered by larger Hilbert spaces encompassing ancillary information carriers [10], and the embedding of quantum control techniques [11]. The identification of suitable alternatives to such expensive strategies for gate synthesis and simulation would represent a significant contribution to the ongoing effort towards the translation of theoretical protocols to the production line of quantum technologies [12].
In this Letter we show how the interplay between analytical results and efficient numerics -as enabled by the supervised learning paradigm -provides a powerful and flexible tool to explore a wide variety of gate design and simulation scenarios.
More specifically, we identify three analytic conditions that, when met, provide a HamiltonianH comprising only operations drawn from Γ and such that G = exp(iH) for a given target G. While these conditions do in principle allow to solve the problem in its generality, the corresponding equations soon become practically intractable. Even in this case, however, our framework provides an improved ansatz, which can be used to perform numerical optimization via supervised learning.
For the numerical optimization, we showcase an original application of supervised machine learning that allows to obtain results more efficiently than alternative methods. The context in which supervised learning is commonly explored is roughly the following: given a training dataset consisting of pairs (x i , i ) ∈ X × L and a parameterized function (usually referred to as the model) f λ : X → L, find λ 0 such that f λ 0 (x i ) i , ∀i. This is usually done by minimising the empirical loss function, that is, by solving the optimisation problem argmin λ { i L( f λ (x i ), i )} for some loss function L quantifying the distance between expected and obtained outputs (typical choices being the L 2 distance and the delta function). In other words, the goal is to approximate an unknown g : X → L from a finite number of samples. In this sense, supervised learning is akin to function fitting. Notably, this is not the kind of problem naturally encountered in gate design. The design of a time-independent Hamiltonian generating a target G, given a parametrisationH λ for the Hamiltonian, fits into the standard minimisation scenario of solving argmin λ {L(λ)} with L(λ) the operatorial distance between exp iH and G. However, as we will show in this paper, supervised learning techniques can be fruitfully applied even for this kind of problem.
Optimizing L(λ) is a daunting task, as the evaluation of e iH can require up to O(d 3 ) operations [13] for each iteration (d is the dimension of the Hilbert space). Nonetheless, it turns out that L(λ) can be expressed as a formal average over random states [14]. This suggests the use of stochastic optimization techniques, commonly employed in the machine learning literature, where one optimizes an estimated empirical loss function, rather than the exact one. For gate design, the resulting estimated loss is L(λ) i D exp iH λ |ψ i , G |ψ i with D a state distance and |ψ i a finite set of random states. As in supervised learning, the latter are used to train the quantum system so that its dynamics reproduces the expected action G |ψ i . As typicallyH is sparse, the computation of e iH λ |ψ i is efficient [15], unlike the full operator exponential. Moreover, stochasticity also helps to avoid local minima. When an analytic solution is not available, our framework gives a lower-dimensional representation of the loss function, which is then optimized via Stochastic Gradient Descent (SGD) [16,17]. Such representation is then embedded into an Automatic Differentiation (AD) protocol [18][19][20][21], to efficiently compute the gradients without resorting to numerical approximations. Remarkably, this significantly increases the efficiency of the optimisation, de facto allowing for the exploration of previously out-of-reach scenarios, and a systematic analysis of gate design and simulation problems. While the idea of using an SGD protocol to tackle gate design problems was also present in [14], the present work presents major steps forwards both from an analytical and from a numerical point of view. In particular, no analytical framework was developed in [14], and thus exact solutions could not be found. Moreover, our use of AD represents a drastic improvement in efficiency, which makes it possible to sistematically explore scenarios such as the generation of Toffoli and Fredkin gates without ancillae and with only experimentally feasible interactions, and the training of four-qubit gates.
To demonstrate such framework, we apply it to find exact and approximated gate-design strategies for problems of physical interest. In particular, we use it to devise exact Hamiltonians generating Toffoli and Fredkin gates using only pairwise interactions. On the other hand, the same conditions provide enhanced numerical ansatz for a speedy design of arbitrary N-qubit gates. We present a supervised-learning optimisation technique to train qubit networks, and demonstrate algorithmtraining instances of three-qubit Toffoli and Fredkin gates. We find that the training algorithm can very quickly find solutions with almost-perfect fidelities, when allowing only pairwise interactions and using our framework to reduce the number of free parameters. We then present an extensive exploration of training scenarios with more restrictive sets of interactions, finding approximate generators for Toffoli and Fredkin gates with good fidelities. We go beyond the three-qubit scenario by designing a four-qubit gate using only two-qubit interactions. A significant boost in performance is here made possible by the use of AD [18][19][20][21], which allows to speed-up gradientdescent-based optimization techniques in a flexible way, at the same time avoiding numerical errors and instabilities arising from numerical differentiation techniques. We also discuss the implications of our framework for problems extending beyond the field of quantum computing, and address in Ref. [22] its potential applications to quantum communication via perfect state-transfer approaches [23,24]. Finally, we present a systematic exploration of the possibility of finding generators for Toffoli and Fredkin gates with experimentally viable interactions, focusing on circuit-QED platforms. Remarkably, even in this more restrictive scenario, we find generators with good fidelities that use interaction strengths compatible with the capabilities of state of the art technologies [25].
General methodology.-We start our analysis by computing the Hamiltonian H G that generates the target gate G = e iH G . Using the spectral decomposition of G, we have H G = −iU Log(Λ)U † , where G = UΛU † , Log denotes the principal branch of the logarithm, and Λ is a diagonal matrix with the eigenvalues of U. Fixing a branch for the logarithm makes it single-valued, and H G uniquely determined from G. In general, the generator H G will contain both physical interactions, that can be realised easily in a given physical setup, and unphysical ones, i.e. dynamics that are not naturally achieved in the chosen experimental platform of the problem. Our goal is to construct a new HamiltonianH G , comprising only physical interactions, such that G = exp(iH G ) = exp(iH G ).
We assume thatH G depends on a quantity λ that parametrizes the set of physical interactions to be used. For instance, in a spin system, the physical interactions can be a certain subset of the possible two-body and single-body interactions, like the set of Heisenberg coupling strengths and local magnetic fields. In general,H G (λ) may also model a system where the original register is coupled to auxiliary degrees of freedom, though we will here focus on the case without ancillary qubits. The following three conditions are necessary and sufficient forH G to satisfy our requirements Requirements (1b) and (1c) ensure that G = exp(iH G − iH G ) exp(iH G ) and exp(iH G − iH G ) = 1, while condition (1a) reiterates the constraints we are imposing onH G . While condition (1b) may seem excessively restrictive, this turns out to not be the case. To see this, consider the spectral decomposition G = k λ k j P k j of a general gate G. Here, P k j is the j th trace-1 projector in the k th degenerate subsector of the eigenspace [26]. It follows that H G andH G can be written as H G = −i k Log(λ k )I k andH G = H G + 2π k, j ν k j P k j . Here I k = j P k j , which is not (in general) an identity (or diagonal) matrix, and we have used the expression log λ = Log λ + 2πiν (ν ∈ Z), applied to every term of the spectral decomposition of G. For anyH G we thus have [H G , H G ] = 0. Eqs. (1a)-(1c) considerably simplify the problem of gate synthesis, and can be used in several ways. On one hand, they can be analytically solved in at least some situations of physical interest. On the other hand, they can be used to produce an efficient starting point for numerical optimization techniques. The general procedure is to start from a parameterized expression forH G (λ) satisfying condition (1a), and then use (1b) to both significantly reduce the set of possible interactions and impose constraints on the parameters. The problem then reduces to the enforcing of the constraints on the spectrum of the generator set by Eq. (1c). This is the non-trivial step in the procedure, which we will however show to be analytically solvable in at least some cases. This strategy thus reduces the task of constrained gate design into an inverse eigenvalue problem, a topic well studied in the field of numerical analysis [27].
More generally, we develop a numerical supervised learning technique to avoid to directly solve the non-trivial eigenvalue problem posed by Eq. (1c). It is worth noting that, whileH G produces the same unitary evolution given by H G at time t = 1, the dynamics will in general be different at 0 < t < 1.
Applications: Toffoli and Fredkin gates.-The quantum Toffoli gate U Toff is a control-control-NOT that flips the state of the target qubit (qubit 3 in our notation) when the state of the two controls (qubits 1 and 2) is |1 1 ⊗ |1 2 , and acts trivially on qubit 3 otherwise. Its realization is an important step towards the construction of quantum computers [8,[28][29][30]. A timeindependent two-body Hamiltonian that simulates U Toff with four qubits has been obtained in [14] using a numerical optimization technique, while three qubits have only been found to make approximate and classical Toffoli gates [31]. Here, following the construction in Eq. (1), we find an analytic solution that requires as few as three qubits. Its generator, obtained by taking the principal value of the logarithm of U Toff , is whose only three-qubit term is σ z 1 σ z 2 σ x 3 , where σ α i is the α Pauli matrix on qubit i. We now parameterizeH Toff as The above expression, containing 37 parameters, automatically satisfies condition (1a) in that it corresponds to anH Toff without three-qubit interactions. Imposing condition (1b) further removes 12 parameters, leaving us with 25 [22]. These are still too many to easily solve the inverse eigenvalue problem embodied by condition (1c). We thus impose some physically plausible assumptions on the coefficients, in order to obtain a generator with a small enough number of parameters for which condition (1c) can be satisfied and the resulting equations are simple enough to be solvable. In particular, we impose J xz 12 = J zx 12 = J xx 13 = J xx 23 = 0, J zx 13 = J zx 23 = π/8, J zz 23 = −J zz 13 and h z 1,2 = −π/8. The rationale behind these assumptions is to look for a generator that is diagonal with respect to the first two qubits, does not use σ y i operators, and does not introduce new off-diagonal interactions, on top of the ones already in the principal generator. This last assumption is useful because it implies a reduced number of parameters in H Toff ≡H Toff − H Toff , which is the operator on which we have to impose Eq. (1c). Note that the above does not uniquely identify the set of assumptions, and different assumptions leading to different classes of solutions are possible. In Ref. [22] we present another example of generator, obtained via different assumptions. Imposing the above constraints we obtain The problem is now to find values for the coefficients in Eq. (4) such that exp(iH Toff ) = 1, which is equivalent to finding coefficients such that all the eigenvalues of H Toff are integer multiples of 2π. Solving for h 0 , h x 3 , J zz 12 , J zz 13 gives a family of solutions parameterized by the integer coefficients {ν 1 , . . . , ν 4 }.
The full expression is given in Ref. [22]. A simpler family of solutions depending on a single integer parameter is obtained imposing ν 1 = ν 2 = ν 3 = 0 and ν 4 = ν, and reads Eq. (5) satisfies exp(iH Toff (ν)) = U Toff for any non-zero integer value of ν. We remark that Eq. (5) can also be deduced directly by using only properties of the Pauli matrices, as shown in Ref. [22] [while this is harder in the case of the generator for the Fredkin in Eq. (7)]. Thus, a highly non-trivial gate such as Toffoli's, which in principle requires three-body interactions as in Eq. (2), can be obtained exactly without three-qubit interactions. Notably, although the generators for the Toffoli gate found here contains non-diagonal σ z -σ x interactions, which may be hard to implement in general, we will show later on how this can be overcome using a supervised learning approach. On a similar note, it is possible to use the framework provided by Eq. (1a)-(1c) to find a Hamiltonian that does not contain three-qubit interaction terms, and generates the Fredkin gate at suitable times. The quantum Fredkin gate U Fred is a three-qubit gate which swaps two qubits conditionally to the first qubit being in the |1 state, and is of use for a number of quantum information protocols [32,33]. A time-independent two-body Hamiltonian that simulates U Fred with 4 qubits has been found in [14] using numerical optimization. We find an analytic solution that requires as few as three qubits. Explicitly where qubit 1 is the control. Eq. (6) contains both two-and three-body interactions. We now write down the general parameterized expressionH G (λ) for a 3-qubit Hamiltonian containing only pairwise diagonal interactions, and imposing Eq. (1b) we cut the number of parameters λ down to 22. Imposing some physically plausible additional conditions, like the symmetry of second and third qubit, we finally manage to reduce the number of parameters enough to solve the eigenvalue problem, finding the following solutioñ This proves that also a non-trivial gate of crucial relevance such as the Fredkin gate can be implemented without timedependent dynamics using only at most two-qubit interactions. The physical reason behind this simplification can be understood from the study of the spectral properties. For example, the gate U Fred has only two eigenvalues, λ ± = ±1 with λ + having a sevenfold degeneracy due to the symmetries of the gate.
De facto, such degeneracy makes the propagator generated by H Fred operate in a two-level subspace. On the other hand, the spectrum ofH Fred − H Fred is {−4π, −2π, 0, 0, 0, 2π, 2π, 4π}, showing that the degeneracy in the spectrum of H Fred is partially lifted when consideringH Fred − H Fred , and the dynamics thus occurs in an larger effective Hilbert space. Although exp(itH Fred ) is non-symmetric for most of the evolution times, all the symmetries are restored at t = 1 and any subsequent integer times. This shows that breaking the symmetries of G and exploiting its degenerate space can help the gate gesign when restricting the set of viable interactions. Supervised learning approach-We now describe a different methodology to solve the difficult part of Eq. (1), that is, imposing the condition on the eigenvalues represented by Eq. (1c). While the direct algebraic approach fails as soon as we consider more than a few parameters, and for example already fails to find solutions for the Toffoli gate with only diagonal interactions, the method we present here scales much better with the number of interactions and is easily generalized to any kind of structure of the qubit network. The idea is to adopt a supervised learning approach to solve the optimization problem of finding the set of Hamiltonian parameters generating a target evolution.
The problem we address is a generalizion of the one presented in the introduction. Given a target gate G and a pa- a set of real parameters and O i are Hermitian operators, we look for the set λ 0 such that exp(iH(λ 0 )) = G. This can be reframed as an optimization problem by considering the fidelity function F (λ, ψ) ≡ ψ|G † exp(iH(λ))|ψ , for an arbitrary state |ψ . Clearly, F (λ 0 , ψ) = 1 for all |ψ iff exp(iH(λ 0 )) = G. A possible approach to find such λ 0 is to consider the average fidelity functionF (λ), defined as the average of F (λ, ψ) over all ψ [34]. Explicit formulas forF are known [35][36][37], so that standard optimisation methods can be used. This method is however inefficient for the problem at hand, due to the size of the underlying parameter space. We thus turn to a different technique, exploiting how the fidelity landscape changes when changing the state |ψ [14]. We can indeed use the fact that the only values of λ for which the fidelity is 1 regardless of |ψ are those corresponding to our solution. By employing an SGD technique [16,17,38], we implement the iterative procedure described in Algorithm 1, whose detailed presentation is provided in Ref. [22]. To find the interaction parameters implementing a Toffoli gate, using only one-qubit evolutions and two-qubit diagonal interactions (i.e. interactions of the form σ α 1 σ α 2 ), we start the numerical training from the Hamiltonian obtained by imposing Eq. (1b) on the parametrized Hamiltonian containing the required interactions, which has the form (9) Starting the training from Eq. (9), many different solutions can be found, depending on the chosen initial conditions and the random states that are used at each run. In Fig. 1 (a), several different solutions are shown, proving that it is indeed possible to implement a Toffoli gate using only pairwise diagonal interactions. This analysis can be extended to a Fredkin gate, Algorithm 1: Stochastic gate design 1: Choose values for the set λ 0 forH G , such that Eq. (1a) is satisfied. 2: Use Eq. (1b) to find a reduced set λ (linearly related to λ 0 ). 3: repeat Iteratively solve Eq. (1c) 4: Generate a random set of N b input states {|ψ k } with N b the size of the mini-batches chosen beforehand. 5: For each k, compute ∇ λ F (λ, ψ k ). Machine learning frameworks like Theano [39], TensorFlow [40], or PyTorch [41] (among others) enables the calculation of gradients automatically from the chain rule. This avoids numerical errors arising from numerical differentiation. 6: Update the coupling strengths λ. We do this using the socalled momentum gradient descent method [42], corresponding to the following updating rule Here the learning rate η and the momentum term γ are hyperparameters to be chosen beforehand. The value of η should decrease with the iteration number. 7: until a satisfactory value of the fidelity is obtained. whose generator with only diagonal pairwise interactions and commuting with the principal generator of the Fredkin is found to have at most two-body terms. Using this model as starting point for the training we again obtain several solutions, some of which are shown in Fig. 1 (b). We then test our supervised learning framework by exploring numerically the possibility of implementing Toffoli and Fredkin gates using a more restrictive set of interactions. The results provide a series of physically-feasible interaction parameters that realise Toffoli and Fredkin gates with good fidelities [22].
More generally, there is no need to limit ourselves to the training of three-qubit networks. To illustrate this, we provide yet another example of successful application of our framework, this time to implement a non-trivial unitary evolution over four qubits. In particular, we successfully train a fourqubit network to implement the doubly-controlled Fredkin gate U FF , defined as U FF ≡ |0 0| ⊗ U Fred + |1 1| ⊗ U Fred , where U Fred denotes a Fredkin gate in which the control qubit is the third one, and the first two are the target ones. It turns out that such four-qubit gate can be implemented using no more than two-qubit interactions, and that this set can be furthermore restricted to only consider diagonal ones. Some examples of such solutions are shown in Fig. 1 (c). Note that these results require no ad-hoc approach or ansatz, differently from the approach used to derive Eqs. (5) and (7). This, in particular, makes it easy to test any hypothesis such as "can gate X be implemented using only a set of interaction Y" without having to go through extra ad-hoc calculations.
Conclusions-We have presented a general framework to approach constrained gate-synthesis problems. We have showed that the procedure is amenable to direct analytical solution, providing time-independent Hamiltonians generating Toffoli and Fredkin gates using only undemanding diagonal interactions and no ancillary qubits. To our knowledge, no previous attempt at such a decomposition has been reported so far. Generality can be added to our approach by powerful techniques of supervised learning of the interaction parameters, which Acknowledgements-This work was supported by the Euro-

Supplementary Material
In Section I of this Supplementary Materials we explore further the solution framework introduced in the main text. All presented results are readily reproducible via the Mathematica code freely accessible in the GitHub repository at lucainnocenti/quantum-gate-learning-1803.07119-Mathematica-code. In Section II are given the details of the supervised learning approach to the gate design problem. Again, all results are reproducible using the code available in the GitHub repository at lucainnocenti/quantum-gate-learning-1803.07119.

I. ANALYTICAL APPROACH
In this section we provide a detailed description of how to apply the framework, introduced in the main text, to approach gate design problems analytically.
In Section I A we show how our framework can be applied to tackle generic perfect state transfer problems. In Section I B we prove that a CNOT gate cannot be implemented using only single-qubit interactions. While this is an obvious result, the calculation can be interesting to show, in a simple case, how the eigenvalue conditions given in Eq. (1c) can be used to rule out that a gate can be implemented with a specific set of interactions. In Section I C we show how to use our framework to derive Hamiltonians implementing the Toffoli gate, using only one-and two-qubit interactions. In Section I D we show how one of the solutions obtained in Section I C could have been obtained via direct analytical reasoning, without any knowledge of the conditions given in the main text. This provides some insight into how the solutions actually generate the Toffoli gate, and illustrates the kind of ad-hoc non-trivial reasoning that, without the use of Eq. (1), would have been necessary to find such solutions.
For notational convenience, we will in this section denote Pauli matrices with X i , Y i , Z i , instead of σ x i , σ y i , and σ z i as in the main text.

A. Perfect state transfer
A one-dimensional quantum walk is described by the Hamil- where N is the length of the lattice upon which the walk takes place, J k the transition rates between adjacent sites, B k the local energies and |k defines the state where the "walker" is at the k-th site. A quantum walk Hamiltonian H W admits perfect state transfer (PST) at time t, i.e. the initial state of the walker initially at site k is perfectly retrieved at site where Ξ k j = δ k,N− j+1 is the reflection matrix. Necessary and sufficient conditions for PST are well understood [23,24]: firstly, H W has to be "mirror-symmetric", that is [H W , Ξ] = 0, and secondly the eigenvalues {E k } of H W should to satisfy the condition e iE k t = (±1) k .
We show now that finding the parameters {J k , B k } for PST is a particular case of the Hamiltonian design problem for gate simulation. Specifically, the conditions for PST can be obtained from the construction in Eq. (1). In a 1D quantum walk, the "physical" couplings are the nearest neighbour interactions, but , where the physical interactions in H Ξ have been reabsorbed into the definition of H W , one finds that: (i) condition (1b) is equivalent to the mirror-symmetry request [H W , Ξ] = 0; (ii) from conditions (1c) and from the definitions of H W and H Ξ , one finds that the spectrum of H W satisfies e iE k t = (±1) k -indeed the eigenvalues of Ξ are {0, π}, so E k = π(2n k + 1) for integer n k . It is straigthforward to extend the above proof for more general transfers, such as with long-range interactions [44], or for perfect fractional revivals [45,46].

B. Proof that CNOT needs 2-qubit terms
We here show how to use our framework to prove that a CNOT gate cannot be implemented using only one-qubit interactions. While this result is trivial, it is nonetheless interesting to show how the framework can be used to obtain this kind of impossibility results.
The spectral decomposition of the CNOT reads where we made a canonical choice for the basis of the threefold degenerate eigenspace corresponding to the eigenvalue +1, and defined Z ± i ≡ (1 ± Z i )/2, and similarly for X ± i . More explicitly, we are considering the following basis set of trace-1 projectors for the degenerate space: The corresponding principal Hamiltonian H CNOT , obtained by directly taking the logarithm of Eq. (10), is: Let us now consider what happens when the multivaluedness of the logarithm is taken into account, but no rotation of the degenerate eigenspace is performed. Considering only the factors with two-qubit interactions, the following expression is found: where here ν i j ≡ ν i − ν j , and ν i ∈ Z is the integer produced by application of the logarithm to the i-th projector. Note how the Z 1 X 2 factor cannot be removed by any choice of ν i , which could be interpreted as a proof that two-qubit interaction terms are indeed necessary to implement the CNOT gate. This, however, does not in principle preclude the possibility that an appropriate rotation of the degenerate space allows to obtain a generator with only local terms. To verify that this is not the case, we would have to consider a generic rotation R of the degenerate space, that is, an operator of the form R = 3 i, j=1 r i j +1 i +1 j , with |+1 i the i-th eigenvector in a fixed base of the degenerate space. The problem is then that of finding a unitary R and integers ν i such that does not contain 2-qubit interactions. The solution of this problem is non-trivial, mostly due to the many (9 in this case) parameters characterising a general unitary R. To avoid searching solutions for such a system, we try a different approach to the problem. Let us denote withH a generator with the required properties: one that generates the same unitary as H CNOT and contains only 1-qubit interaction terms. Its general form will be:H As shown in the main text, forH to correctly generate CNOT, it must commute with the principal generator H CNOT . Imposing this commutativity removes most of the parameters h α i , leaving us with the following simplified expression: The only missing step is now to impose the eigenvalues of H CNOT −H to be integer multiples of 2π. This gives the following system of equations: 2πν 1 = (−π + 4h 0 − 4h 1,3 − 4h 2,1 )/4, 2πν 2 = (+π + 4h 0 + 4h 1,3 − 4h 2,1 )/4, 2πν 3 = (+π + 4h 0 − 4h 1,3 + 4h 2,1 )/4, 2πν 4 = (−π + 4h 0 + 4h 1,3 + 4h 2,1 )/4, with ν i ∈ Z. The above system can be seen to have no solution for h 0 , h 1,3 , h 2,1 , therefore definitively proving that there is no rotation of the degenerate space, and integer parameters ν i , that allow to generate the CNOT gate using only one-qubit interactions.

C. Toffoli gate: derivation through conditions
We show here the details of how, using Eqs. (1), (2) and (3) in the main text, we can obtain a family of Hamiltonian generators for the Toffoli gate, containing only single-and two-qubit interactions.
The Toffoli gate can be written as where we defined Z ± i ≡ (1 ± Z i )/2, and similarly for X ± i . The principal generator of U Toff is H Toff = πZ − 1 Z − 2 X − 3 , that is, highlighting the three-qubit interaction term, We start by writing down the general parametrisation of an Hamiltonian containing only one-and two-qubit interactions: Assuming for simplicity thatH Toff does not contain Y i components, and imposing the commutativity with H Toff , we get the following expression: As can be directly verified, the above satisfies [H Toff , H Toff ] = 0 while also containing only one-and two-qubit interactions, and no Pauli Y matrices. We could now directly try and find the set of parameters making the eigenvalues ofH Toff − H Toff all integer multiples of 2πi, but the associated calculations are made hard by the many parameters involved. It turns out however that we can make a number of further assumptions on the form ofH Toff , and still obtain a viable family of solutions. One such set of assumptions, that leads to a satisfying family of solutions, is J xz Finally, we impose that the generator is diagonal on the first two qubits, that is, J xx 13 = 0. Using this simplified expression, H Toff =H Toff − H Toff becomes With the above simplified expression it is now possible to directly solve the eigenvalue problem. This results in the following family of solutions: with for all integer values of ν i such that c(ν 1 , ν 2 , ν 3 , ν 4 ) ≥ 0. The corresponding spectrum of H Toff =H Toff − H Toff is while the spectrum ofH Toff changes only in that λ 2 = 2π(ν 1 + 1/2). Consistently with this, λ 2 is also the eigenvalue corresponding to the non-degenerate eigenspace of H Toff , while all the other eigenvalues correspond to eigenvectors orthogonal to this one. More specifically, we have where It is worth noting that the orthogonality of these eigenvectors follows from the easily verified property of the above coefficients: a 2 −b 2 = 1. Furthermore, we note that c(ν 1 , ν 2 , ν 3 , ν 4 ) ≥ 0 cannot be satisfied unless ν 3 ν 4 . This in turn, looking at Eq. (25), reveals that all the solutions are made possible by a non-trivial lifting of the degeneracy of the subspaces |0, 1 0, 1| and |1, 0 1, 0|. Let us now try to understand how and why the derived H Toff works. Let us use the notation P i ≡ |λ i λ i |, and consider the projector over the last two eigenvectors. Highlighting the 3-qubit terms, we find The term in the Hamiltonian to which these two projectors contribute is 2πν 3,4 (P 7 + P 8 ), with ν 3,4 = ν 3 + |ν 3 − ν 4 |. A little algebra reveals that the 3-qubit terms in P 7 + P 8 are Recalling the definitions of a, b, N 5 , N 6 , we see that the coefficient of Z 1 Z 2 Z 3 vanishes, and the resulting expression becomes Substitution of the appropriate values of ν i shows that the above term can be used to generate the 3-qubit factor π/8 Z 1 Z 2 X 3 , without introducing additional 3-qubit factors. In Box 1 are given the full expressions for the projectors and the found solutions for the Toffoli gate. It is also interesting to note that all of the above still holds if the X i operators are replaced with Y i operators. That is, the expressions found solving for the Toffoli, by simple substitution X i → Y i , also give a generator with only 2-qubit interactions for the CCY gate (that is, the gate that applies Y to the third qubit conditionally to the first 2 qubits being in the |1 state). A different way to understandH Toff is to analyse the four two-dimensional subspaces on the main diagonal, exploiting the fact thatH Toff acts diagonally on the first two qubits. Straightforward calculations lead to It can be verified that for all values of ν 1 , ν 2 , ν 3 , ν 4 , the twodimensional identity and X are correctly generated in the |00 and |11 spaces, respectively. On the other hand, in the |01 and |10 spaces, the two-dimensional identity is generated as long as ν 3 ν 4 , as was also derived before.
In particular, the class of solutions given by ν 1 = ν 2 = ν 3 = 0 is for all ν 4 0. It is interesting to look at the explicit form of the exponentials generated by this class generators. Computing exp iHt withH as in Eq. (31), we get the following unitary: where Box 1: Toffoli .
It is easily verified from the above that so that the sum of the projectors gives the identity as it should. On the other hand, multiplying by the appropriate ν i factors, we get 2π [ν 1 (P 1 + P 2 ) + ν 2 (P 3 + P 4 )] = (...) + π 2 (ν 2 − ν 1 )Z 1 Z 2 X 3 , 2π [ν 3 (P 5 + P 6 ) + ν 4 (P 7 + P 8 )] = (...) with the last identity holding for ν 3 ν 4 . and For large (in modulus) values of ν 4 , a + b → e 2iπtν 4 , a − b → 1 and c → 0, so that the exponential becomes which very closely resembles the matrix obtained by exponen- A different solution derived from Eq. (22) is Moreover, it is worth noting that Eq. (22) is only one possible family of solutions, and that different assumptions will lead to different ones. For example, a similar reasoning as above, starting however from the assumptions J zz 23 = J zz 13 will lead to solutions such as (note the use of (Z 1 + Z 2 ) terms here, making this solution not derivable from Eq. (22)): D. Toffoli gate: an example of direct a posteriori derivation We will here show a line of thought that could have conceivably led to Eq. (31) (in the case ν 4 = 1), by direct analysis, and without using any of the tools shown in the paper. It will be useful to keep in mind the expressions of Z 1 ± Z 2 : Given that we want to generate a CC-X gate, and remembering that exp iπ 2 (1 − X) = X, it is reasonable to start building our Hamiltonian as which however will generate an X evolution both in the |00 and in the |11 sectors, while we want it only in the latter sector: We can remove the term in the |00 sector exploiting the sign difference introduced by Z 1 + Z 2 , by directly adding an appropriate 1-qubit interaction term: where we remember that exp(iπX − ) = X. Equation (42) now correctly reproduces the evolution on |00 and |11 , but also wrongly evolves |01 and |10 . To remove these additional terms while at the same time leaving the others unaffected we use the fact that exp(iπ(1 ± σ)) = 1, for any normalised vector of sigma matrices: σ ≡ 3 i=1 n i σ i with n 2 1 + n 2 2 + n 2 3 = 1. To convert the central terms in Eq. (42) into something like this we observe that we can rewrite the second term in the above equation as Remembering that Z 1 Z 2 = diag(1, −1, −1, 1), we substitute the above with π/8(5 − 3Z 1 Z 2 − 2X 3 ). This change affects only the central terms, converting the expression into: π diag(0, 1 − X/4, 1 − X/4, X − ). The reason this form is preferable is that we can now simply add a factor in the central terms to convert them into an expression of the form 1 − σ. Adding an interaction of the form πα(Z 2 − Z 1 )Z 3 gives For the central terms to exponentiate to the identity we need them to become of the form 1 − σ with normalised σ. This is easily achieved by choosing α = ± √ 15/8. The final expression is therefore: Note that instead of πα(Z 2 − Z 1 )Z 3 we could have equivalently chosen πα(Z 2 − Z 1 )O 3 for any O 3 = aY 3 + bZ 3 and a 2 + b 2 = 1. The above reasoning explains the origin of the weird √ 15 factor: it comes as the coordinate necessary to make the vector unitary: for X/4 + xZ/4 to be normalized, x = √ 15 must be satisfied.

II. SUPERVISED LEARNING APPROACH
We here study more in depth the following problem: given a target gate G and a parametrised Hamiltonian H(λ) = k λ k σ k , with λ k ∈ R and σ k Hermitian operators, what is the set of parameters λ 0 such that exp(iH(λ 0 )) = G? We present a supervised learning approach to numerically solve this problem, considerably extending the ideas presented in Ref. [14]. Thanks to a number of numerical optimisation techniques, and in particular the use of AD [18][19][20][21], we can explore a variety of different scenarios, optimising over potentially hundreds of Hamiltonian parameters. On top of this, condition (1b) in the main text is used to further speed-up the numerical training, removing many interaction parameters that are known not to lead to the target gate.

A. Supervised learning
Supervised learning is the task of inferring or approximating a function, given a set of pre-labeled data [16,47]. A supervised learning algorithm starts with some model -a functional relation g λ parametrised by a set of parameters λ -and finds a λ 0 making g λ 0 as close as possible to a target function f . To do this, a set of pre-labeled training data {(x 1 , y 1 ), (x 2 , y 2 ), ...} is used, where here y k = f (x k ) is the output that we want the algorithm to associate to the input x k .
A NN is trained by optimising its parameters using a dataset of pre-labelled data. A common way to do this is use variations of a gradient-descent-based technique named SGD. Gradient descent algorithms aim to optimize a problem function f (x), starting from an initial point x 0 and performing a number of small steps towards the direction of maximum slope (that is, ∇ f (x)). The optimal point x opt is thus obtained via a sequence of small perturbations of the point x, which starting from x 0 reaches the nearest local stationary point by following the slope. In the simplest version of the algorithm, the update rule is simply x → x − η∇ f (x), with η a small real parameter commonly referred to as learning rate. SGD, on the other hand, is suitable for a situation in which one is given a parametrised functional relationship of the form f (x; w), and asked for a set of "parameters" w 0 such that f (x; w 0 ) is minimum (maximum) for all inputs x. Such a case can be handled via SGD, which in its simplest form involves picking a random x 1 , executing a number of gradient descent iterations over w, then picking a new x 2 and iterating the procedure. The updating rule for SGD is therefore of the form While standard gradient descent, being a local optimisation algorithm, is liable to getting stuck in local minima, SGD can at least partially avoid this issue, in that generally a local minimum for an input x is not a local minimum for a different input x . Many variations of SGD are used in different circumstances. For example, in the so-called mini-batch SGD, instead of updating with a single input x, one uses a batch of inputs {x 1 , ..., x M }, and updates the parameters using the averaged gradient: w → w − η M k=1 ∇ w f (x k ; w)/M. More sophisticated updating rules are used to increase the training efficiency in different circumstances. Common techniques involve dynamically updating the learning rate, or using momentum gradient descent [61,62] techniques.
To see how this class of optimisation problems is relevant to us, consider the fidelity function F defined as with G the target gate, λ the set of parameters, and ψ an input state. The gate design problem is then equivalent to finding λ such that F (λ, ψ) is maximised (that is, equal to 1) for all ψ.
One possibility to solve this problem is to consider the average fidelityF (λ), for which explicit formulas are known [35][36][37]. Standard optimisation methods, like standard gradient descent or differential evolution, can be applied directly onF (λ). This, however, reveals to be impractical, due to the complexity of the associated parameter landscapes. On the other hand, SGD allows to use a simple and efficient local maximisation method, while at the same time being less prone to getting stuck in local maxima. This works particularly well in this case, because we know that the sets of parameters corresponding to the target gate are all and only those such that for all inputs ψ the fidelity is unitary.
A crucial step, efficiency-wise, in gradient descent algorithms, is the evaluation of the gradient. Numerically approximating the gradient, as done in previous works [14], is generally inefficient and scales badly with the number of optimised parameters. Here we will instead make use of the powerful technique of Automatic Differentiation (AD) [19,21], described in Section II B. AD dramatically improves the training efficiency, allowing to explore a richer variety of circumstances.

B. Backpropagation
The gradient evaluation phase is efficiency-wise crucial for the training of a neural network. Computing the partial derivatives of the cost function with a standard method, like finite differences, has a complexity O(N 3 w ), with N w the number of parameters to differentiate [16]. This inefficiency can however be avoided using error backpropagation via AD. With this technique, the complexity of the gradient evaluation phase can be cut down to O(N 2 w ) [16]. This works by first decomposing the cost function of the model in terms of elementary operations, that is, functions the gradient of which is known analytically. In this way the computational graph representing the functional relation between input and output is built. A computational graph is a directed acyclic graph, whose nodes represent the operations, and edges the flowing direction of inputs into outputs (see Fig. 2). Once the computational graph is built, the gradients with respect to the model parameters can be computed efficiently. This happens in two stages, as schematically illustrated in Fig. 2. At first, every node of the computational graph is progressively computed, starting from the inputs (the current values of the model parameters) up to the final value of the error function. During this process, the intermediate values of the elementary operations are cached. This is the so-called feed-forward phase. The second phase (so-called backpropagation phase) starts from the output, and consists of progressively computing the gradients of the error function with respect to the independent variables.
To better understand AD, let us consider a simple example. Suppose the error function of the model is of the form g(w) ≡ f ( f (2) ( f (1) (w))), where w is a set of parameters, and f (i) are intermediate "elementary" functions, the gradients of which are supposed to be known analytically. Making use of the chain rule, the gradient of g reads where y (2) = f (2) ( f (1) (w)) and y (1) = f (1) (w). During the feedforward phase the values of y (1) and then y (2) are progressively computed and cached. Using y (2) , and the known expression for ∂ k f , ∂ k f (y (2) ) is then efficiently computed. The process continues by evaluating ∇ f (2) k , which is written as Again, being y (1) already computed during the feed-forward, ∂ j f (2) k (y (1) ) is readily computed. The last component needed for the full gradient is ∂ i f (1) j (w), all parts of which are known. This method therefore allows to efficiently evaluate numerical the gradient of complicated functions, without approximating the derivatives.
In the context of training neural networks, the function to be derived is the cost function of the network, that is, roughly speaking, the (euclidean) distance between the result obtained for an input and the corresponding training output. For the gate design problem, we will use another notion of distance between output obtained and output expected. For quantum states, the fidelity between these turns out to work well.

C. Implementation details
We used python as language of choice for the implementation of the supervised learning. Being Python language of widespread use in the machine learning community, many libraries and frameworks are available to build computational graphs over which AD can be used. In particular, we used Theano [39], together with the QuTiP library for the simulation of the dynamics of quantum systems [63,64].
Our implementation allows the training of an arbitrary target gate, parametrised via a time-independent Hamiltonian H(λ). The parametrisation is completely arbitrary (provided the dependence on the parameters is linear), so that the Hamiltonian can be chosen as H(λ) = i λ i A i for any set of matrices A i and number of parameters λ i . This is made possible by the flexibility of AD, which allows to automatically build an efficiently differentiable computational graph, without needing to hardcode the structure of the Hamiltonian.
The goal of the algorithm is, given a target gate G and a parametrisation for the Hamiltonian H(λ), find the λ 0 such that exp(iH(λ 0 )) = G. We use for the purpose mini-batch SGD with momentum. The mini-batch version of SGD involves computing the gradient, at every iteration, averaging over the gradients computed for a number of states. Making such batches of states larger or smaller allows to enhance or decrease the variance of the gradients with respect to the input state. The use of momentum [61,62] involves using a modified version of Eq. (46). The updating rule is instead given by where here η is the learning rate and γ the momentum. The use of the auxiliary parameter v during the training discourages sudden changes of direction, and can make the training significantly more efficient [61]. While the cost function F is always real, some of the intermediate calculations needed to compute it involve complex numbers. While this poses no fundamental problems, many of the Machine Learning (ML) libraries do not support AD over functions with complex inputs or outputs. We worked around this problem using a similar trick to the one reported in [65]. In particular, to use the existing framework, we mapped the problem into one involving only real numbers. To do this, we map complex matrices into real ones via the bijection A → Re(A) ≡ 1 ⊗ A R − iσ y ⊗ A I , where A R and A I are the real and imaginary parts of A, respectively. At the same time, state vectors are to be mapped to Ψ → Re(Ψ) ≡ (Ψ R , Ψ I ) T . It is easy to verify that with this mapping AΨ → Re(AΨ) = Re(A)Re(Ψ), so that all calculations can be equivalently be carried out with the real versions of matrices and vectors.
More specifically, the employed algorithm involves the following steps: 1. Choose an initial set of parameters λ (randomly, or specific values if one has an idea of where a solution might be). A number of other hyperparameters have to be decided at this step, depending on the exact SGD method used. In particular, for mini-batch SGD with momentum and decreasing learning rate, one has to decide the momentum γ, the initial value of η, the rate at which η decreases during the training, and the size N b of the batches of states used for every gradient descent step.
2. Repeat the following loop N e times, or until a satisfactory result is obtained. Each such iteration is conventionally named an epoch. Another hyperparameter to be chosen beforehand is the number of training states N tr to be used in each epoch. Once this is fixed, every epoch will involve a number N tr /N e of gradient descent steps, each one using N e states for a single gradient calculation. N e random training states are sampled, to be used during the epoch. (d) Return to point (a).

D. Results
A sample of training results for Toffoli, Fredkin, and "double Fredkin" gates are given in Fig. 1 (a), (b), and (c) in the main text. In Figs. 3 to 5 are shown the training histories of the parameters for eight different solutions for Toffoli, Fredkin and double Fredkin, respectively. These illustrate how quickly the networks converge for different initial values of the parameters. In all the shown cases the target gates are obtained with unit fidelity up to numerical precision (that is, all fidelities are between 1 − 10 −16 and 1). Different sets of optimisation hyperparameters are found to give acceptable solutions. For the trainings shown in this paper we used a dynamically updated learning rate given, for the k th epoch, by η = 1/(1 + kα) with the decay rate α = 0.005. The other hyperparameters were chosen as γ = 0.5, N b = 2, N tr = 200. Different initial values for the parameters were tested, but in most cases we started the training with either vanishing or random (following a normal distribution) parameters. For the training of the four-qubit gate we found the network to converge sooner to a solution when the parameters were initialised to a positive value (often with all parameters initialised to 4).
In Figs. 6 to 8 we report the behaviour of the fidelity upon changes of the learnt Hamiltonian parameters, for Toffoli, Fredkin and double Fredkin gates, respectively. As shown in these plots, the stability of the implemented gates with respect to variations of time and interactions values greatly varies between different solutions, as well as between different parameters in the same solutions.
To assess the feasibility of nontrivial gates in more restrictive experimental scenarios, we performed a systematic analysis of the reachability of Fredkin and Toffoli gates when allowing FIG. 2. Examples of AD in backpropagation mode. (a) Schematic representation of AD of a function with one output and two inputs. Starting from numerical values for x 1 and x 2 , one computes g(x 1 , x 2 ) and then f (g(x 1 , x 2 )). To get ∇ f (g(x 1 , x 2 )), one then computes f (g(x 1 , x 2 ))∂ i g(x 1 , x 2 ). Note that all components of this expression are known: f and ∂ i g are known by assumption, and the value of g(x 1 , x 2 ) has been computed and cached in the forward propagation phase. (b) Example of application of AD to compute the gradient of cos(x 1 x 2 ). (c) Using the same example function as (b), we give an example of the actual number computed at all stages when the inputs are (x 1 , x 2 ) = (π/2, 2). only for single-qubit and X i X j + Y i Y j two-qubit interactions, and in the less restrictive setting of allowing for all X i X j and Y i Y j interactions. The results are shown in Figs. 9 to 12. For the Fredkin gate, in the more restrictive XX interactions setting, the biggest fidelity obtained was F 0.94, while when allowing for all XX and YY interactions the maximum fidelity obtained was F 0.999. For the Toffoli gate, the maximum fidelity obtained in the XX scenario was F 0.94 as well, while when allowing for all XX and YY interactions the best training results corresponded to F 0.98. To have more consistent results, in all the training instances shown here all the hyperparameters, except for the interaction parameters' initial values, were chosen to have the same value. In particular, each training instance was run for 200 epochs, each one using 200 random quantum states as inputs, divided in batches of 2 elements. This choice of hyperparameters is mostly empirical, and it is possible for different values to provide better results.
The above provides further evidence for the flexibility of the supervised learning approach, which can produce solutions with good fidelities even in more restrictive scenarios, closer to the capabilities of state of the art experimental architectures. Furthermore, the values of the interaction strengths for many of the presented solutions are found to be compatible with the capabilities of state of the art circuit-QED architectures with gate times of the order of tens of nanoseconds [25].
Additional solutions and data, as well as the code used to produce them, is available in the GitHub repository lucainnocenti/quantum-gate-learning-1803.07119. This repository contains all the code used to reproduce the solutions presented in this paper, as well as to train arbitrary gates on arbitrary numbers of qubits. Even more generally, arbitrary (linearly) parametrised matrices can be used as training model, allowing a high degree of flexibility.  i j X i X j + J (2) i j Y i Y j . The initial conditions are chosen as in Fig. 9. The maximum fidelities obtained are F 0.98, obtained in the c = 4 and c = 5 sectors.