Efficient and Robust Entanglement Generation with Deep Reinforcement Learning for Quantum Metrology

Quantum metrology exploits quantum resources and strategies to improve measurement precision of unknown parameters. One crucial issue is how to prepare a quantum entangled state suitable for high-precision measurement beyond the standard quantum limit. Here, we propose a scheme to find optimal pulse sequence to accelerate the one-axis twisting dynamics for entanglement generation with the aid of deep reinforcement learning (DRL). We consider the pulse train as a sequence of $\pi/2$ pulses along one axis or two orthogonal axes, and the operation is determined by maximizing the quantum Fisher information using DRL. Within a limited evolution time, the ultimate precision bounds of the prepared entangled states follow the Heisenberg-limited scalings. These states can also be used as the input states for Ramsey interferometry and the final measurement precisions still follow the Heisenberg-limited scalings. While the pulse train along only one axis is more simple and efficient, the scheme using pulse sequence along two orthogonal axes show better robustness against atom number deviation. Our protocol with DRL is efficient and easy to be implemented in state-of-the-art experiments.


I. INTRODUCTION
Quantum metrology studies how to exploit quantum resources and strategies to improve the estimation precision of unknown parameters [1,2]. Generally, the information of an unknown parameter is encoded into a phase which can be precisely estimated via interferometric techniques in experiments [3][4][5]. For interferometry with individual atoms, the sensitivity of the estimated phase can reach the so-called standard quantum limit (SQL) [6], i.e., ∆φ = O(N −1/2 ) with N the atom number. However, this bound is not fundamental and can be surpassed by using multi-particle entanglement [7][8][9][10]. Recent developments in quantum metrology focus on how to generate metrologically useful quantum entangled states and utilize them for phase estimation.
One kind of representative entangled quantum states that can provide sub-SQL phase sensitivity is spinsqueezed state [11]. Spin squeezed states can be prepared through the one-axis twisting (OAT) interaction, which is widely realized by light-mediated interactions [12][13][14][15] or atom-atom interaction within Bose condensed atoms [3,[16][17][18][19] and the phase sensitivity can be scaled as ∆φ = O(N −2/3 ) [5,20]. Apart from OAT, spin squeezing can be generated by two-axis counter-twisting (TACT) interaction, and the phase sensitivity can be improved to the Heisenberg limit, ∆φ = O(N −1 ). However, this kind of spin squeezing is challenging to realize in experiments. In addition to spin squeezed states, non-Gaussian entangled states such as twin Fock state and spin cat state are also promising candidates for achieving Heisenberglimited phase sensitivity [1,20,21].
The main obstacle against the applications of quantum entangled states in practice is the entanglement generation in realistic experiments. Several theoretical schemes for preparing quantum entangled states such as adiabatic sweeping [8,[22][23][24], shortcut to adiabaticity [25][26][27] and optimal controls [23,[28][29][30] are developed. However, the schemes are either time-consuming or too complicated to be implemented, which are hard to realize in state-ofthe-art experiments. Hence, developing fast and effective approaches for creating quantum entanglement is of great importance.
One promising way is to make use of machine learning, which has already attracted much attention [31]. In particular, deep reinforcement learning (DRL) [31,32] which can provide optimal decision strategies or policies based upon a well-defined target, is gradually applied in quantum physics [31,[33][34][35][36][37][38][39][40]. It can provide a machine learning (ML) model, often neural networks that is capable of optimizing a certain objective function by providing a well-designed time sequence of control procedures. It is particularly suitable for seeking the optimal preparation of desired quantum states [41][42][43][44][45][46][47][48][49]. Recently, it is proposed that extreme spin squeezing can be achieved with OAT interaction using a sequence of rotation pulses designed via DRL [42]. Although spin squeezing is a good metrological quantum resource, the most metrologically useful one is usually characterized by the quantum Fisher information (QFI) F Q [50,51]. Can we find out an experimentally feasible scheme to prepare the optimal quantum entangled state that maximizing F Q via DRL? Can the prepared quantum entangled state suitable for practical quantum phase estimation?
In this work, we propose a scheme for preparing metrological useful entangled states based on OAT interaction with a sequence of rotation pulses designed via DRL. In our scheme, the OAT interaction which is the key for entanglement generation, exists persistently during the state preparation. Our scheme is inspired by the so-called twist and turn dynamics [30,52] that is capable of generating spin squeezing efficiently. In order to prepare the optimal quantum entangled state within a limited time T , a train of π/2 pulses is sophisticatedly applied [42]. The time sequence of pulse train is obtained by maximizing F Q with the aid of DRL.
When considering π/2 pulses only along one axis, we find that only a few number of pulses can drive to a highly entangled state which enables the Heisenberglimited scaling. However, this protocol is sensitive to the atom number of the initial state. In experiment, the atom number may not be well-defined and there will be a deviation from the atom number used in the DRL algorithm for designing the pulses. This kind of atom number deviation may deviate the prepared state from the optimal one, hence degrade the ultimate measurement precision scaling. To strengthen the robustness, we consider π/2 pulses along two orthogonal axes. We find that although more pulses are required, it is more robust against atom number deviation. To validate our scheme for phase estimation, we use the entangled states obtained by DRL as the input state to perform the Ramsey interferometry. The associated phase measurement precision ∆φ can still display the Heisenberg-limited scaling. Besides, the scheme with π/2 pulses along two axes can also provide better robustness against the atom number deviation. Our scheme via DRL provides a straightforward way to efficiently prepare optimal entangled states for quantum metrology, and its robustness against the atom number deviation makes it feasible in realistic experiments.

A. Preparation of quantum entangled state
We consider an ensemble of N two-level identical atoms whose Hamiltonian ( = 1) is given by H = χĴ 2 z + ΩĴ γ + δĴ z . Here,Ĵ γ =Ĵ x cos γ +Ĵ y sin γ andĴ α = l σ (l) α /2 (α = x, y, z) are the collective spin operators with the Pauli matrices σ (l) α for the l-th atom [3]. The system state can be expanded in the Dicke basisĴ z |m = m|m with m = −N/2, −N/2 + 1, ..., N/2. The Hamiltonian contains three terms. The first term χĴ 2 z denotes the atom-atom interaction, which is the key for realizing oneaxis twisting (OAT) dynamics [3,19]. The second term ΩĴ γ is the coupling between the two atomic levels. The third term δĴ z is the bias or detuning. The Hamiltonian H can be applied to Bose condensed atoms occupying two hyperfine states [53,54] or a single-component condensate trapped in a double-well potential [55][56][57]. The parameters χ, Ω and δ can be well controlled via external fields in experiments [5,58].
The first and significant step for quantum metrology is the entangled state preparation. Initially, the system state is usually prepared in a coherent spin state (CSS) [59,60] which is rotated by a π/2 pulse along the y axis [5,61] from the state |π, 0 CSS = | ↑ ⊗N with all N atoms in | ↑ . The OAT dynamics can squeeze the CSS to a spin squeezed state. There exists an optimal evolution time T os that extreme spin squeezing can be achieved [42]. Apart from spin squeezing, the metrological ability of a quantum state can also be characterized by QFI. Generally, maximizing F Q can obtain the optimal input state for attaining the best precision bounds [2,10,62]. Thus, we use QFI as a metric to find out the optimal input state for phase estimation below. For an input state |ψ , the QFI for phase estimation can be defined as [5] where |ψ(θ) t = e −iθĴz |ψ and |ψ (θ) t = −iĴ z |ψ(θ) t . Therefore, the ultimate precision bound can be given by F 50,51]. To speed up the entanglement generation, in the stage of state preparation, we apply some pulses and therefore the system obeys where Ω x (t) and Ω y (t) are time-dependent functions describing the applied pulses. Consider the total evolution time T is around T os , and we divide T equally into n t intervals and each interval length δt = T /n t . At each interval, one can choose to apply a π/2 pulse along x or y axis with t+δt t Ω x,y (t )dt = π/2, or turn off the coupling Ω x,y = 0 to let the state evolve barely under OAT interaction.
Our goal is to find the optimized pulse train to generate the input state |ψ that maximizing F Q within T from the initial CSS |ψ 0 . To accomplish this goal, we adopt the technique from machine learning (ML). The optimization process will be guided by an ML model obtained from DRL. In the following, we will introduce the DRL algorithm and show the optimization results in detail.

B. DRL algorithm
To obtain the optimal control, the optimization process will be guided by a DRL algorithm. Briefly, the DRL algorithm requests certain information about the current state |ψ t for the t-th time step (t ∈ [1, n t ]), and determines the evolution happening in the next (t + 1)th time step with an optimal policy. As one of the DRL algorithms, here we adopt the so-called Asynchronous Advantage Actor-Critic (A3C) algorithm [63] to accomplish our goal. It is based on a common actor-critic algorithm while designed in an asynchronous structure, as sketched in Fig. 1 (a). Generally it uses neural networks to find an appropriate decision. The network parameters are updated via adaptive momentum gradient decent method (ADAM) [64]. The asynchronous structure of A3C is beneficial for the stability of the learning process and makes it fast to converge. The learning process also becomes more efficient because the local network design is naturally parallel processing which can take full advantages of the multiple process units in the computing hardware.
Next, we show how to find the optimized pulse train in the framework of DRL algorithm. As shown in Fig. 1  (b), at every time step t the algorithmic state s t needed to know and feed into the algorithm is some expectation values of the evolved quantum state |ψ t . s t can be encoded in a tuple with the following six expectation: x t ). It should be mention that, these six expectation quantities are the intermediate variables in the algorithm. They are only calculated numerically [42] and do not need to be measured in experiments. Then the action a t is obtained after receiving s t , which is an evolution operator U t chosen from the action pool containing three candidates: Finally a reward r t related to the QFI of evolved state F (t) Q is calculated. The reward will be described later.
In this work, we consider two schemes, "only-J x " and "both-J x , J y ". The former one only using π/2 pulses along x axis, in which U t is chosen only from U 0 and U 1 . While for the latter one, π/2 pulses along x and y axis are both considered, i.e., U t ∈ {Û 0 ,Û 1 ,Û 2 }. Then, the unitary evolution |ψ t+1 = U t |ψ t is performed, and the consequent state |ψ t+1 will participate the evolution at the next time step t + 1 sequentially. Thus, the final prepared state can be written as where the initial state is given by Eq. (1). To maximize F Q of |ψ T , in each step we numerically calculate the QFI F (t) Q for |ψ t to obtain the reward r t of the t-th step. The calculation of the total reward R tot is then made after n t evolution steps. Finally, a specific pulse sequence (U 1 , U 2 , ..., U nt ) can be generated from the optimal policy within the DRL algorithm.
The total reward R tot is originally the accumulated reward of n t time steps as R tot = nt t=0 r t [42], while in our DRL algorithm the n t rewards are requested all in once after total evolution time T , by denoting the reward of the t-th step as the largest reward among the rest steps after time t, as: This non-step-wise design of reward allows us to denote every r t after knowing F (0∼nt) Q , which is beneficial for the training stability, efficiency and capability of convergent. Another advantage of this definition (6) is that in each training epoch the DRL algorithm can somewhat comprehend that the optimization task is fulfilled within n t steps so that the ML model can reach similar optimum once n t is large enough, see Fig. 2. In addition, we use two separated neural networks as actor and critic network. The benefit of this separation is that different quantities of F Q from different atom numbers N can be greatly balanced. The parameters of our algorithm, including structure of the neural networks and the learning rate, do not need to be adjusted in the face of different atom number situations and can achieve convergence at the same rate, see Fig. 1 (d).

C. Results with DRL
In our numerical simulations, we choose χ = 1 and n t = 50. The total evolution time T is chosen near the optimal squeezing time, which can be determined numerically. The relation between T and N is shown in Fig.1 (c), roughly an exponential dependence. For example, for N = 100 and 1000 we have T = 0.13 and 0.015, respectively. Starting from an initial |ψ 0 with a fixed N , we can obtain the maximized F Q and the corresponding prepared quantum state |ψ T with the help of DRL. Here, we display results of two representative cases (N = 100 and 1000) using only-J x scheme and both-J x , J y scheme, see Fig.1 (e∼h) and (i∼l), respectively. In Fig. 1 (d), the learning curves of DRL for both schemes with N = 100 and 1000 are given. It is shown that, after 8000 trails of learning the F Q of the final states |ψ T are optimized and converge to saturated values, indicating a successful optimization.
The associated pulse trains optimized by our DRL algorithm for N = 100 and 1000 are shown as histograms in Fig. 1 (e) for only-J x scheme and in Fig. 1 (i) 3. (a) The sketch of Ramsey interferometry with time-reversal operations for phase estimation. An entangled state |ψ T is produced by the operation of U , which is obtained by our DRL algorithm. Then, the state |ψ T is input for the Ramsey interferometry, where a time-reversed operation U † is used after the phase accumulation. Finally, applying a π/2 pulse and measuring the half relative populationĴz, one can extract the information of the estimated phase φ. Here, we consider the phase is in the vicinity of φ = 0. The measurement precision scaling of estimated phase versus atom number N obtained by (b) only-Jx scheme and (c) both-Jx, Jy scheme, respectively. The black dashed lines represent the Heisenberg limit N −1 , while the colored dashed lines are the Heisenberg-limited scaling obtained by fitting the numerical results.
for π/2 pulses along x and y axis, respectively. The corresponding time-evolutions of the F Q are shown in Fig. 1 (f) and (j). The F Q of the optimal prepared states |ψ T are highlighted by red dots, and the associated distributions of |ψ T are shown in in Fig. 1 (g) and (k).
The optimized F Q of the prepared states using only-J x scheme and both-J x , J y scheme are nearly the same, with the latter mostly being a little larger than the former. The final prepared states |ψ T become non-Gaussian with two humps appear near |m = ±N/2 , see the Husimi distribution on the generalized Bloch sphere and the probability distribution. However, the probability distribution of |ψ T using both-J x , J y scheme is more rugged than the one using only-J x scheme. Essentially, we find that the scaling of F Q versus N of the two schemes can both approach the Heisenberg limit. Here, we use least square method to fit the results and the fitting formula are displayed in the legends. Similarly, the both-J x , J y scheme outperforms the only-J x scheme with a slightly smaller constant. It is evident that the method with DRL algorithm is promising for developing Heisenberg-limited metrology protocols.
On the other hand, the optimized pulses trains for these two schemes are much different. We can see that, for both N = 100 and 1000, only four π/2 pulses along x axis is needed. With a final pulse applying at the final time step, the state can abruptly evolve to the optimal one. The corresponding F Q suddenly jump to a large value. While for both-J x , J y scheme, more π/2 pulses along x axis with few π/2 pulses along y axis are needed. Thus, the pulse trains for only-J x scheme is much sparse and simple, which will be more feasible in realistic experiments. For a fixed N , whatever by using only-J x scheme or both-J x , J y scheme, we can find the optimal control for preparing the optimal state within T with the help of DRL algorithm. However, the optimized pulse trains are always discrepant with different N and T . Thus, we need to know the atom number N roughly in advance to design the corresponding optimal pulse sequence.
The interval number n t we divide the total evolution time T may slightly influence the optimization results. The resultant F Q of the final states F Q with different n t are shown in Fig.2. It is shown that more pulses enable to push the optimization even better but the growth decreases when n t > 50, especially for large N . Thus, we find that n t = 50 is a balanced choice in condition that the structure of the two networks and hyperparameters in our DRL algorithm also remain unchanged. Despite that with increasing n t the F Q of the prepared state may be slightly larger, it requires more carefully designed algorithm parameters and increases operation complexity.

III. PHASE ESTIMATION VIA TIME-REVERSAL RAMSEY INTERFEROMETRY
Generally, QFI only sets the ultimate measurement precision bound, but it may not always be attained. To validate metrological usefulness of the prepared states via DRL, we implement the Ramsey interferometry for phase estimation [3,4,65] by inputting the prepared states |ψ T .
For a conventional Ramsey interferometry, the whole process consists of a phase accumulation sandwiched by two π/2 pulses [10, 66]. Since we start from an initial CSS, it is suitable to use time-reversal protocol. Here, we consider a time-reversal protocol: a disentangling operation U † after the phase accumulation process [67,68], which is implemented by a reverse of U in Eq. (5). As sketched in Fig. 3 (a), the final state after Ramsey interferometry is thus: The time-reversal operation can be achieved by changing the sign of the entangling Hamiltonian [68]. This can be realized in various synthetic quantum systems, such as atom-cavity system [69] and cold atom system [70].
Then the measurement precision of φ can be calculated by using error propagation formula [58]: where (∆Ĵ z ) φ = Ĵ2 z φ − Ĵ z 2 φ , the subscript φ indicates the expectation with respect to |ψ φ . Here, we consider the estimated phase is tiny which is in the vicinity of φ = 0.
The corresponding scalings of measurement precision versus N are shown in Fig. 3 (b) and (c). The resultant phase measurement precisions are given as blue (only-J x scheme) and red points (both-J x , J y scheme), respectively. Despite the scaling is a bit deviated from the ultimate bounds of F Q in Fig. 1 (h) and (l), the estimated phase measurement precision for only-J x and both-J x , J y schemes still show Heisenberg-limited scaling as expected. This suggests the optimized entangled state we prepare by using DRL algorithm also has great potential for Heisenberg-limited phase estimation with Ramsey interferometry.
The only-J x scheme shows a smoother scaling and closer to the Heisenberg limit, 2.0/N compared to 3.7/N that obtained by both-J x , J y scheme. This may result from the addition of U 2 pulses in Eq. (4), while in the next section we will see that the participation of U 2 can contributes to a better robustness against the deviation of atom number N .

IV. ROBUSTNESS AGAINST ATOM NUMBER DEVIATION
Finally, we discuss the robustness of our schemes against the atom number deviation. As it is mentioned in Sec. II, the optimal pulse sequence obtained by DRL depends on the atom number N and total evolution time T . In our numerical calculations, the initial state |ψ 0 is assumed to be a pure state with a well-defined atom number N . In practice, T can be precisely controlled but the estimation of atom number N may be inaccurate. The atom number in experiment may not be the same as expected. There may be a deviation between the atom number in experiment and the one set in the DRL algorithm for designing the pulses. Therefore, it is necessary to figure out the robustness of our scheme when this kind of atom number deviation exists.
We perform the robustness evaluation by applying the optimized pulse train of atom number N to the situation with other atom number in the range of [0.8N, 1.2N ], i.e., the deviation of atom number is assumed up to ±20%.
The results with N = 100, 500, 1000 and 5000 are shown in Fig. 4, including the F Q and phase measurement precision ∆φ via time-reversal Ramsey interferometry, using only-J x scheme and both-J x , J y scheme. The red dashed lines are added for reference, representing the Heisenberg-limited scalings passing the results of only-J x scheme cases without deviations. Ideally, the results should be close to the red dashed lines. It turns out that the deviation of N makes the resultant ultimate precision bound F −1/2 Q degraded, and the results of ∆φ also become worse. Compared with only-J x scheme, both-J x , J y scheme show better robustness against atom number deviation. As it is shown in Fig. 4 (a)-(d), the F Q keeps in the same level when there is no deviation of N , and the F Q using both-J x , J y decrease much less than those using only-J x scheme. The cases of ∆φ is shown in Fig. 4 (e)-(h), showing the same manner of degradation with these two schemes. Although the phase measurement precision using both-J x , J y scheme is worse than those using only-J x scheme for most N as shown in FIG. 3 (b, c), the robustness of the former scheme is better than the latter.
It suggests that the pulse trains optimized by the DRL algorithm is practicable even though the atom number N of the system cannot be estimated accurately. If the atom number deviation is small in experiment, one may give priority to use the only-J x scheme for phase estimation. Otherwise, the both-J x , J y scheme which can show better robustness against atom number deviation, may become favorable.

V. CONCLUSION AND DISCUSSION
We have presented an efficient and robust scheme for preparing entangled state with DRL algorithm and demonstrated their metrological usefulness with the Ramsey interferometry for phase estimation. We implemented the quantum state preparation through only-J x scheme or both-J x , J y scheme, referring to the OAT dynamics with pulse sequence along only one axis or along two orthogonal axes, respectively. The system starts from a CSS, then reaches an optimal entangled state under a pulse train optimized by DRL. The quantum state preparation process is accomplished within a short time duration and the ultimate precision bounds exhibit the Heisenberg-limited scaling. Further, the Heisenberglimited scaling can be maintained by performing the Ramsey interferometry, which verify the usefulness of our schemes in experiments. We use the A3C algorithm [63] whose actor and critic networks are separately established. It makes our algorithm equally effective and efficient for different atom number cases from N = 10 to 10000 without reforming the neural networks and parameters of the DRL algorithm. Besides a non-step-wise reward design makes the training process feasible and stable, similarly successful when the total number of pulses n t is sufficient.
The only-J x scheme and both-J x , J y scheme have different advantages. On one hand, the pulse trains of only-J x scheme provided by DRL algorithm is much more simple, and the scaling of phase measurement precision is better than that of both-J x , J y scheme. On the other hand, we find that the entangled states prepared by both-J x , J y scheme have better robustness against atom number deviation. Therefore only-J x scheme can be used when one wants to simplify the process of state preparation and the deviation of atom number can be well controlled, while the both-J x , J y scheme is considerable when the robustness against atom number deviation matters more.
Our algorithm can be used as an offline optimization for quantum entangled state preparation in synthetic many-body quantum systems, such as cold atoms [3,19], and trapped ions [71]. Online optimization is also feasible when the QFI is extractable [72] while accompanying a huge consumption of time, which might be solved by starting from results provided by sufficient offline optimizations. In the future, the effects of decoherence and imperfect pulse shape can also be taken into account, which will be more feasible for practical experiments.