Universal and optimal coin sequences for high entanglement generation in 1D discrete time quantum walks

Entanglement is a key resource in many quantum information applications and achieving high values independently of the initial conditions is an important task. Here we address the problem of generating highly entangled states in a discrete time quantum walk irrespective of the initial state using two different approaches. First, we present and analyze a deterministic sequence of coin operators which produces high values of entanglement in a universal manner for a class of localized initial states. In a second approach, we directly optimize the sequence of coin operators using a reinforcement learning algorithm. While the amount of entanglement produced by the deterministic sequence is fully independent of the initial states considered, the optimized sequences achieve in general higher average values of entanglement that do however depend on the initial state parameters. Our proposed sequence and optimization algorithm are especially useful in cases where the initial state is not fully known or entanglement has to be generated in a universal manner for a range of initial states.


I. INTRODUCTION
Entanglement plays a fundamental role in quantum information processing [1] and maximising it is an important goal for reaching high fidelity operations. Developing and understanding methods for creating, maximising and in general engineering entanglement are therefore important topics in research at the moment. While most often entanglement between different states of the same degree of freedom is considered, more recently so-called hybrid entanglement between different degrees of freedom has been of interest [2]. If these degrees of freedom belong to the same particle, more information can be encoded at the single particle level, which can help to reduce required resources. Hybrid entanglement has recently been experimentally created in photonic architectures [3,4] and also studied in neutrons [5] and bosonic atoms [6].
Here, we discuss the generation of hybrid entangled states that can be created in discrete time quantum walks, where the motion of a particle that moves in a high-dimensional discrete space depends on an internal, two-dimensional coin degree of freedom [7][8][9]. The evolution itself consists of the recurrent application of a coin and shift operator which in general leads to entanglement between the walker and coin. Quantum walks have already been realized in a variety of physical systems such as cold atoms [10,11], trapped ions [12,13], superconducting qubits [14,15], neutral atoms [16,17], nuclear magnetic resonance systems [18,19] and photonic architectures [20][21][22]. Hybrid entanglement generation has been observed as well [23,24].
Recently different approaches to enhance the entanglement between the walker and coin have been explored. It was shown that disorder in the coin can increase the amount of hybrid entanglement created by the walk [25][26][27][28], and that randomly choosing the coin operator at each time step of the quantum walk can lead to maximally entangled states in the * gratsea.katerina@gmail.com † friederike.metz@oist.jp asymptotic limit independent of the initial state [25]. However, the large number of steps this strategy requires makes the scheme unrealistic for current experiments. As a possible solution, the optimization of the coin operator sequence was suggested and it was shown that this can reduce the number of steps to less than 10 [29,30]. However, the entanglement that can be generated by optimising is highly dependent on the initial state and requires potentially the full set of possible coin operators to be realized experimentally. In an alternative approach, Wang et al. suggested to restrict the set of possible coin operators to just the Hadamard and Fourier coins and showed that certain sequences give rise to highly entangled states with as few as 20 steps [24]. However, the optimal sequences they found were also highly dependent on the initial state.
In this work we present and discuss deterministic coin sequences that allow to create large amounts of hybrid entanglement in a quantum walk. To be experimentally realistic we restrict ourselves to Hadamard and Fourier coins only and aim at a minimal number of steps. It is worth noting that deterministic sequences of coin operators have already been studied in the context of the localization-delocalization transition in quantum walks [31][32][33], where the considered sequences range from the periodic cases [31] to aperiodic ones like the Thue-Morse, Rudin-Shapiro, and the Fibonacci sequence [32,33].
The first sequence we discuss is designed to create the same, large amount of entanglement independently of the localized initial states, as long as they have a vanishing relative phase. The structure of the sequence allows it to work for any odd number of time steps, and it therefore also fulfils the requirement to be useful in experimental settings that allow for small numbers of steps only. We further show that the amount of achievable entanglement can be controlled by replacing the Hadamard coin in the sequence by a general rotation operator. For localized initial states with nonzero local phases the same universal entangling behavior can be observed if the coin operators in the sequence are modified slightly.
In the second part of this work, we ask if higher values of hybrid entanglement can be achieved through a direct optimization of the coin operator sequence and choose a reinforcement learning (RL) based approach for the optimization. Machine learning has already achieved remarkable results in various areas of physics [34,35], and different machine learning approaches have been combined with quantum walks for exploring quantum speed-up [36,37] and graph structures [38]. On the other hand, RL has been successfully applied to challenging problems in quantum physics including quantum state preparation [39,40], quantum optimal control [41,42], and quantum error correction [43,44].
In this work, we use a RL technique to tackle the optimization of entanglement in quantum walks. In contrast to previously employed optimization schemes, RL allows us to not only find the optimal sequence of coin operators for a specific initial state but also for classes of initial states. The resulting optimized coin sequences achieve equal or higher average values of entanglement than the deterministic sequence discussed in the first part of this work. However, the amount entanglement created this way is not independent of the initial state.
The paper is organized as follows. In Sec. II we briefly review the discrete time quantum walk and introduce the reinforcement learning framework as well as the Q learning algorithm used for optimization. In Sec. III we present and analyze the universal entangling coin sequence and compare it to the results of the RL optimization problem. Finally, we conclude with a summary and outlook in Sec. IV.

A. The quantum walk
The discrete time quantum walk in one dimension is realized on the tensor product of two Hilbert spaces H = H w ⊗ H c [7][8][9]. The space corresponding to the position of the walker H w is high-dimensional and spanned by {|k : k ∈ Z}, while the coin space H c is two-dimensional and spanned by {|↑ , |↓ }. We assume that the walker is initially localised on one site in an arbitrary superposition of the coin states where θ ∈ [0, π] and φ ∈ [0, 2π]. The evolution consists of n applications of a unitary operator U = SC, where S is a translation and C is a local rotation. The translation S moves the walker either to the left or to the right depending on the internal coin state and has the form The coin operator C rotates the inner degree of freedom and in its most general form can be expressed as where ξ , ζ ∈ [0, 2π] and α ∈ [0, π/2] are the parameters of the SU(2) rotation [45]. In our case, we restrict the coin operators to be either the Hadamard coin H [α = π/4, ξ = 0, ζ = 0] or the Fourier coin F [α = π/4, ξ = π/2, ζ = π], which have the explicit forms

B. Reinforcement learning
Reinforcement learning (RL) is a sub field of machine learning in which a trainable agent interacts with an environment, takes actions, observes states, and obtains rewards [46]. The objective for the RL agent is to choose actions at each time step that maximize the expected future reward. Here, we implement the off-policy Q-learning algorithm with the goal of maximizing the hybrid entanglement between the walker and the coin degree of freedom [47].
Each training episode consists of a fixed number of time steps n of the quantum walk. Since we are interested in maximizing the entanglement after the evolution is complete, we set all rewards at intermediate time steps to zero and allow for a nonzero reward only at the final time step. We use the Schmidt norm as a measure of entanglement and therefore the reward R at the end of each episode i can be defined as Here λ k are the Schmidt coefficients and K = min(d c , d w ) with d c and d w being the dimensions of the coin and walker subsystems respectively. Since the coin space is always two dimensional, we have K = 2 and the rewards take values between At each time step of the quantum walk, the agent can choose between two actions defined as A ∈ {H, F}, where H and F correspond to the Hadamard and Fourier coin operator respectively (see Eq. (4)). Moreover, for a better trade-off between exploration and exploitation we use an ε-greedy action selection, where ε is exponentially decaying after each training episode i with ε init and ε fin being the initial and final value of ε, respectively. Finally, the states within the RL framework have to be defined. Since the quantum state of a quantum walk is essentially a vector of continuous complex numbers, it cannot be straightforwardly employed in tabular (discrete) RL settings. However, the dynamics of the system are deterministic and therefore we can use the history of the action sequences to encode the states. Specifically, for a given number of time steps n and a specific initial state ψ, there are 2 n possible sequences. Having defined the states S, the actions A and the rewards R, we can use Temporal Difference learning for updating the Q values after each environment step according to where α ∈ [0, 1] is the learning rate. During training the agent chooses actions according to the epsilon greedy policy, i.e. it acts randomly with probability ε and otherwise takes action A i which maximizes the Q value in the current state: A i = argmax A Q(S i , A). Once training has successfully converged, the optimal policy is given by a fully greedy action selection.

A. Universal entangling coin sequence
We are interested in generating highly entangled states during a quantum walk independent of the initial state. Since the final amount of entanglement cannot be fully independent for all possible initial states [48,49], we restrict the initial state to the class of localized states with zero relative phase, φ = 0 (see Eq. (1)). Hence, the problem reduces to finding a sequence of coin operators in time that generates entanglement independent of the initial state parameter θ . For this we propose a sequence given by seq * (2m + 1) = [(H, F) m , F], m ∈ Z for a quantum walk with 2m + 1 time steps. This sequence consists of an alternating application of the Hadamard and Fourier coin with an additional Fourier coin applied at the final time step and hence always describes a quantum walk with odd number of steps. In Fig. 1 we plot the Schmidt norm at the end of the quantum walk evolution with the proposed sequence for sev-eral different time steps as a function of the parameter θ . One can easily see that the value of entanglement is always very close to the maximal amount possible and indeed independent of θ for each sequence. However it depends on the number of steps taken and we explore this behaviour by plotting the average Schmidt norm at the final state for all odd number of time steps up to 99 in Fig. 2. One can see that the amount of entanglement created varies for small number of steps, but settles close to the value of S/ √ 2 = 0.99 when the sequences contain more steps. Each point is obtained after averaging over 1000 random angles θ and the zero variances confirm that the Schmidt norm is independent of the parameter θ . Therefore, from now on we will refer to the sequence seq * as a universal entangler for the class of initial states defined by φ = 0.
In the following we will give an intuitive explanation of how the universal behavior emerges from this sequence. Generally, the Schmidt norm can be calculated from the reduced density matrix of the coin degree of freedom after tracing out the walker states. Representing the reduced density matrix ρ as a vector on the Bloch sphere where α is the Bloch vector and σ is a vector of Pauli matrices, the Schmidt norm can be expressed in the form In order to better understand the behavior of the universal entangling sequence, we explore the role of the two coin operators H and F. The Fourier operator seems to be of significant importance for generating highly entangled states. Generally, it increases the localization of the quantum state [50] which has been associated with an enhancement in the entanglement [28]. On the other hand, the Hadamard operator belongs to the class of rotation matrices [51] and we have found that replacing it with a more general unbalanced operator does not change the universal behavior of the sequence. The generalized Hadamard operatorH is given bỹ  Fig. 3 shows the Schmidt norm after a 5, 7, and 15 step quantum walk as a function of the parameter ω for initial states with zero relative phase. Each data point was obtained after averaging over 1000 random angles θ of the initial state and the variance again calculates to zero in all cases. Therefore the amount of entanglement created is still independent of θ . Moreover, the plot suggests that by properly choosing the parameter ω for a given length of the sequence, the performance of the universal entangling sequence can be improved and a state close to a maximally entangled state can be reached. Let us finally note that the effect of a nonzero relative phase φ in the initial state can be cancelled out in two ways using the phase operator Z given by The phase operator can be applied either directly to the initial state or to the coin operators. In the latter case, the H and F operators are altered to HZ and FZ, respectively. However, this requires that the relative phase of the initial state is known beforehand, which can be the case if the creation process of the initial state is deterministic.

B. Optimal coin sequences
Let us next address the question whether we can find coin sequences that perform better on average than the universal entangling sequence, i.e. that generate higher values of entanglement across all initial states. To solve this optimization problem efficiently we employ the Q-Learning algorithm described in section II B. We should emphasize that for a given number of steps n, the goal is to find the optimal sequence of coins out of the 2 n possible sequences that maximizes the Schmidt norm (the reward) for all initial states. Our RL framework allows us to solve for this objective due to the agents ignorance of the quantum state. Even though different initial quantum states are used for each episode, the agent has access only to the states defined by the history of actions and hence no information about the quantum state is used for training. For a better comparison to the previous section, we again restrict the initial states to a subspace defined by φ = 0. For each episode of training, the remaining initial state parameter θ is sampled from a uniform distribution such that each episode is initialized with a different quantum state. The details of the training and the list of hyperparameters used can be found in the Appendix.
As an example we show the results of the RL optimization obtained for a 5, 7, and 15 step quantum walk in Fig. 4. The Schmidt norm achieved by the optimal sequence is plotted as a function of the parameter θ . Dashed lines of the same color correspond to the respective universal entangling sequence from the last section. Notice that in the case of a 5 step quantum walk the universal sequence and the optimal sequence coincide, i.e. the RL agent finds [H, F, H, F, F] to be optimal. For the cases of a 7 and 15 step quantum walk the optimal sequences differ from the universal ones and the obtained Schmidt norm is not independent of the initial state anymore. However, in both cases the amount of entanglement exceeds that of the universal sequence for all initial state parameters θ .
In order to validate the result, we compared the reinforcement learning algorithm with a simple brute-force method for the case of the 5 step quantum walk. The brute-force algorithm explores all of the possible 2 5 = 32 coin sequences for 1000 random initial states and computes the average Schmidt norm for each sequence. We find that the policy giving rise to the highest average entanglement is indeed the sequence the RL algorithm suggested previously: [H, F, H, F, F]. While for quantum walks with only a few steps a simple brute-force method as described above is able to identify optimal policies, the RL algorithm becomes advantageous for larger numbers of time steps. The number of possible coin sequences grows exponentially with the number of steps and hence quickly becomes intractable by any brute-force method. Finally, we train an RL agent on completely random initial states, where both φ and θ are uniformly sampled at the beginning of each episode. For a five step walk the optimal sequence suggested by the RL agent is [F, F, H, H, H] and in Fig. 5 we show the values of the achieved Schmidt norm as a function of the initial state parameters. One can see that the final amount of entanglement depends slightly stronger on the initial state compared to the previous cases where we only considered initial states with φ = 0. This is not surprising since it is known that quantum walks of only a few steps cannot generate highly entangled states in a fully universal way for all initial states at the same time [48,49]. However, the RL algorithm is still able to identify a sequence that, at least on average, performs better than others.

IV. CONCLUSION
We have discussed two different approaches for generating hybrid entanglement in a quantum walk. We first presented and studied an entangling sequence consisting of a deterministic string of Hadamard and Fourier coin operators that created a universal amount of entanglement for all initial states with zero relative phase. Since this sequence works for any number of steps larger than two, it is valuable for experimental setting where the number of possible steps is limited. The second method was based on direct optimization of the coin sequence using reinforcement learning (RL), which is a technique that allows to also determine longer sequences where brute force optimisation is not possible. We have shown that this method allows to find coin sequences that yield high average values of entanglement over many initial states, which is particularly useful in cases where the initial state is not fully known, very noisy, or simply whenever it is required to generate highly entangled states independent of the initial state.
Our work therefore extends existing results that either achieve state independent entanglement only in the asymptotic limit of an infinite quantum walk [25] or that can achieve maximal entanglement in a short sequence, but not independently of the initial state [29,30]. Furthermore, the RL scheme we have presented can be useful in a variety of other, experimentally relevant settings. For example, the class of initial states that is optimized over can be restricted to match the experimental problem, such as a fixed initial state with noise. The RL objective can also be altered in different ways. One could for example choose to maximize the fidelity between the final state and a given target state. Another option is to apply the techniques to higher dimensional quantum walks [52], quantum walks on graphs [53], or quantum walks involving more than one particle [54]. Additionally, one could move to the continuous case and use deep reinforcement learning to directly optimize the parameters in the coin operator.
Let us finally note that the universal entangling and optimized sequences give rise to qualitatively different probability distributions of the walker. The universal sequence produces a delocalised distribution whereas the optimised sequences generate more localised one. This effect will be a topic of future research.

ACKNOWLEDGMENTS
This work was supported by OIST Graduate University and we are grateful for the help and support provided by the Scientific Computing section at OIST. We would also like to thank Alexander Dauphin for his helpful comments on the manuscript. A.G. acknowledges financial support from the Spanish Ministry MINECO (National Plan 15 Grant: FISICATEAMO No. FIS2016-79508-P,