Closed-loop control of a noisy qubit with reinforcement learning

The exotic nature of quantum mechanics differentiates machine learning applications in the quantum realm from classical ones. Stream learning is a powerful approach that can be applied to extract knowledge continuously from quantum systems in a wide range of tasks. In this paper, we propose a deep reinforcement learning method that uses streaming data from a continuously measured qubit in the presence of detuning, dephasing, and relaxation. The model receives streaming quantum information for learning and decision-making, providing instant feedback on the quantum system. We also explore the agent’s adaptability to other quantum noise patterns through transfer learning. Our protocol offers insights into closed-loop quantum control, potentially advancing the development of quantum technologies.


Introduction
Quantum computation and quantum information [1] are no longer just promising research fields, but they have become current realities with increasing applicability in the next decade, in particular, that related to computational intelligence [2,3]. Quantum computation relies on quantum bits (qubits), which are the quantum generalization of classical bits. The two basic states of a qubit are |0⟩ and |1⟩, corresponding with the states zero and one, respectively, of a classical bit. However, a qubit |Ψ ⟩ has the unique feature of allowing states formed by the superposition of |0⟩ and |1⟩, namely, |Ψ ⟩ = α|0⟩ + β|1⟩, where α and β are complex coefficients. When a qubit is in a superposition state, its measurement will collapse it to one of its basic states, but it is impossible to determine in advance which one. The only available information is that the probability of |0⟩ is |α| 2 and the probability of |1⟩ is |β| 2 , hence, |α| 2 + |β| 2 = 1. The primary operation when dealing with qubits is the unitary transformation U. Applying U to a superposition state results in another superposition state that superposes all basis vectors, which is known as quantum parallelism. This feature can be employed to evaluate the different values of a function f (x) for a given input x at the same time. The unitary transformation U(t, t 0 ) evolves a qubit state |Ψ(t 0 )⟩ to |Ψ(t)⟩ = U(t, t 0 )|Ψ(t 0 )⟩ = T exp[−(i/ℏ)´t t0 H(t ′ )dt ′ ]|Ψ(t 0 )⟩, where H(t ′ ) is its Hamiltonian. Consequently, quantum control arises as the most critical problem in realizing quantum computation. Its goal is to design a time-dependent Hamiltonian H(t), which drives the qubit to its target by a unitary transform. While simple solutions like the quantum NOT gate, which can be achieved with a resonant pulse H = ℏΩσ x /2 and an operation time T = ℏπ/2Ω, exist, they are not robust and far from optimal. Any slight systematic error T → T + δT or equivalently Ω → Ω + δΩ will lead to fidelity loss. Furthermore, qubits cannot be perfectly isolated from the external environment, where quantum noises induce decoherence. Therefore, optimal quantum control is necessary to achieve high-fidelity and high-robustness gate operations, which are the milestones in fault-tolerant universal quantum [4][5][6].
Physicists have proposed several protocols for achieving quantum control objectives, such as adiabatic quantum evolution [7], composite pulses [8][9][10], pulse-shaping engineering [11][12][13][14], shortcuts to adiabaticity [15][16][17]. Particularly, machine learning (ML) algorithms can be combined with them for further optimizations [18][19][20][21][22]. It is also natural to consider applying reinforcement learning (RL) individually for quantum control tasks [22][23][24][25][26][27][28][29][30][31][32][33]. In recent years, deep reinforcement learning (DRL) has successfully addressed pulse design for fast and robust quantum state preparation [34][35][36], gate operation [37], and quantum Szilard engine [38]. However, as highlighted in our previous research [20,21], the full potential of RL for quantum control has yet to be realized due to the challenge of quantum measurement. RL requires the observation of states for outputting an action, which conflicts with quantum mechanics' fundamental feature that the state is destroyed after direct quantum measurement. Most RL models for quantum control are trained in numerical environments instead of real quantum devices, to save resources, followed by fixed pulses after observation and evaluation. These fixed pulses hardly prevail over gradient-based optimization methods. Another approach involves combining the model with a quantum environment for evaluation. This approach allows the RL model to output an instant action after observing a state, even though the state is destroyed by direct measurement. However, this approach suffers from inefficiencies when historical actions are stored for repetitive operation of n(n + 1)/2 steps, retrieving the last destroyed state. Here, n is the maximum number of time steps in each episode.
In this work, we present a RL approach to quantum control by employing the RL algorithm for closed-loop quantum control. In this paradigm, qubit's wave functions are no longer destroyed but slightly perturbed after information extraction via weak measurement. The model observes the state, which contains weak values as the partial information of the qubit with less confidence, resulting in an action to evolve the quantum environment to the next timestep. Our scheme reflects the spirit of stream learning once the length of each timestep is sufficiently small, resembling the dynamics of continuous measurement. It also enables transfer learning by adapting the model to the environment during the evaluation while external noises patterns are changing. We reckon that our protocol enhances the performance of quantum computing and quantum information processing in real-time experiments, accelerating its development from noisy intermediate-scale devices to the next level.

Open quantum system
The dynamics of isolated quantum systems are governed by Schrödinger equation iℏ∂ t |ψ(t)⟩ = H(t)|ψ(t)⟩, where the operator H(t) represents the Hamiltonian of the quantum system, with its expectation be in the unit of energy. The von Neumann equation,ρ(t) = −(i/ℏ)[H(t), ρ(t)], is equivalent to the Schrödinger equation, where the pure state wave function |ψ(t)⟩ is extended to density matrix ρ(t) = |ψ(t)⟩⟨ψ(t)|. However, isolated quantum system are theoretical constructs that do not exist in the real world. External environments always affect the quantum system by coupling themselves to it, inducing undesired dynamics, such as decoherence.
Generally speaking, one can always write down the total Hamiltonian H T = H S + H E + H I , including the system Hamiltonian H S , the environment Hamiltonian H E , and the coupling interaction Hamiltonian H I . The dynamics of the new system are described by the von Neumann equation, and one retrieves information about the original system by tracing out the environmental subsystem ρ = Tr E (ρ T ), resulting in the Lindblad master equationρ where C n = √ γ n A n are the collapse operators, A n are the operators that couples the system to environment in H I , and γ n are the corresponding rates. The density matrix is assumed to be initially in the product state as ρ T (0) = ρ(0) ⊗ ρ E (0), i.e. the original system and the environment are not correlated at t = 0. They still remains separable ρ T (t) ≈ ρ(t) ⊗ ρ E during the evolution since the environment does not evolve significantly. The environment is considered to be Markovian, requiring the fast decays of its correlation functions than those of the system. it is also worthwhile to mention that the evolution of an open quantum system is no longer unitary. Therefore, the density matrix of the quantum system is not a pure state, but a mixed state ρ = ∑ n p n |ψ n ⟩⟨ψ n | instead, with p n be the classical probability of being in |ψ n ⟩ state.

Weak measurement and continuous measurement
One of the major difficulties in applying ML algorithms in the quantum regime is caused by measurement, which is usually costless in the classical realm. The act of measurement in the quantum system destroys it, being projected to an eigenstate once quantum information is extracted. Measuring a wave function by operatorÂ outputs eigenvalues, whose expectation follows ⟨Â⟩ = ⟨ψ |Â|ψ ⟩. It can also be expressed in the language of density matrix as ⟨Â⟩ = Tr(ρÂ). Aharonov's work [39] proposed an extension that extracts partial information from the quantum system without destroying it. The weak value A w = ⟨ψ f |Â|ψ i ⟩/⟨ψ f |ψ i ⟩ is no longer real eigenvalues of the operator, but exotic values instead or even complex, where |ψ i ⟩ and |ψ f ⟩ are pre/post-selected states. The post-selection operation does not always succeed, and the wave function is discarded once the operation fails. To address this issue, we couple the quantum system to a pointer for entanglement and measure the pointer projectively for a weak value, which is actually the original framework proposed by Aharonov, instead of the later developed pre/post-selection formalism. Specifically, a Gaussian pointer |Φ ⟩ =´(2πσ 2 ) −1/4 exp(−q 2 /4σ 2 )|q⟩dq is coupled to the qubit |Ψ ⟩ = [cos(α/2), sin(α/2)] T , following the interaction Hamiltonian H int = g(t)p ⊗Â, where σ is the standard deviation of the pointer's position, p is its conjugate momentum operator, and g(t) is the coupling strength. A non-correlated initial state |Φ(q)⟩ ⊗ |Ψ ⟩ is evolved by the Hamiltonian, entangling as cos(α/2)|Φ(q − a 1 )⟩ ⊗ |a 1 ⟩ + sin(α/2)|Φ(q − a 2 )⟩ ⊗ |a 2 ⟩, where a i and |a i ⟩ are the eigenvalues and eigenstates of the operatorÂ to be weakly measured, respectively, if´t 0 0 g(t)dt = 1. For example, if one aims at performing a weak measurement on the Z direction, i.e. the Pauli-Z operatorÂ =σ z , the measurement outputs of the pointer's position follow the probability distribution shifting a displacement of the expectation ⟨Ψ |σ z |Ψ ⟩ = cos α. Correspondingly, the wave function of the qubit is slightly perturbed as if the weak value q 0 is the measurement feedback of the projective measurement. Additionally, quantum information can be continuously extracted from the quantum system, allowing for continuous measurement as the information obtained per measurement approaches zero. In this framework, the total operation time is divided into intervals of timestep ∆t, so that a weak measurement is performed in each interval. The limit ∆t → 0 results in continuous measurement, with stochastic differential equations governing its dynamics [40,41]. In figure 1, we illustrate the dynamics of stochastic Schrödinger equations, used to flip a qubit with a fixed resonant π-pulse, with varying scales of time interval ∆t. It can be observed that if one continuously measures the qubit weakly, without taking any action based on the feedback, and evolves a resonant π-pulse, the final state is more likely to deviate from the target state, which is given by the open-loop quantum control with a π-pulse as its time-optimal solution. The more frequently we measure, the larger the expected deviation. Hence, for closed-loop quantum control, feedback must be exploited to control the system. We explore this idea further by studying the design of an RL algorithm to solve the closed-loop control of a noisy qubit.

Physical system and task
In section 2, we introduced Lindblad master equations as the governing equations for quantum systems under noise. For pure dephasing, the diagonal Lindblad operators are given by C n = √ γ n |n⟩⟨n|, yielding the master equationρ with γ 0 = γ 1 , affecting the coherence by reducing the off-diagonal elements of the density matrix. For relaxation, we consider the energy dissipation from the qubit to the external environment on the X direction, modeled by C = √ γσ x . The non-unitary evolution due to the Lindblad terms in master equation leads to a mixed state density matrix, where the classical probability p n of being in |ψ n ⟩ cannot be retrieved. Therefore, the perturbed system after weak measurement cannot be analytically calculated by equation (3) [42]. To extend our analysis to the case of the density matrix, we consider a Gaussian pointer of pure state ρ p = |Φ ⟩⟨Φ |, coupled to a two-level system of mixed state ρ through the interaction Hamiltonian H int = gδ(t − t ′ )p ⊗σ z . The collective system is evolved from the initial state ρ ini = ρ p ⊗ ρ to ρ fin after the coupling by shifting the pointer by ⟨σ z ⟩ = Tr(ρσ z ) when g = 1. One retrieves the wave pointer after the coupling by tracing out the qubit. The measurement of the pointer's position projects the pointer to its eigenstate |q 0 ⟩, where the projection operator of the collective system readsP = |q 0 ⟩⟨q 0 | ⊗ I. In this way, we have the qubit's density matrix after the weak value feedback of q 0 by the projection operator and tracing out the pointer. After clarifying the calculation of state perturbation in terms of the density matrix, we can now formulate the specific task to be studied by RL. We aim to study the optimal control of a continuously measured qubit within operation time T by ML algorithm. The goal is to flip the qubit from the state |0⟩ to |1⟩ using a sequence of pulses on the X direction. Each pulse lasts a small interval of ∆t, being described by the driving Hamiltonian H = Ωσ x , followed by a weak measurement on the Z direction. We assume that the measurement process is impulsive, meaning that the coupling and projective measurement on the pointer are instant and independent of the dynamical evolution. Meanwhile, the control pulses may also be imprecise, including slight detuning H = Ωσ x + ∆σ z and amplitude error Ω → Ω + δΩ. The weak value and the last pulse amplitude are fed to the ML model as streaming data. Accordingly, the model's instant feedback then controls the quantum system for the next timestep.

Numerical setup
We apply the DRL method to our task for the RL approach. The environment consists of a qubit that is continuously measured, perturbed for weak values, and controlled by the agent's pulses. The agent is implemented as an artificial neural network (ANN) that takes in the qubit state as input, and outputs an action for the control problem. The ANN is trained by deep learning algorithms to approximate the optimal policy function π(a|s). Upon receiving the action from the agent, the environment evolves to the next timestep, computes the new RL state, and provides a corresponding reward. It is worth noting that the environment in the quantum realm is different from other physical environments. In the RL environment, quantum information, e.g. density matrix elements or fidelity, is encoded in the RL state, requiring the numerical simulation. Unlike other physical environments, the density matrix elements cannot be directly obtained from the qubit without destroying it. Hence, one has to compute the density matrix based on the weak value and the control pulses, making the quantum environment non-trivial and computationally demanding.
In our practice, we set the tunable range of the Rabi frequency (pulse amplitude) as the action Ω ∈ [0, 3π] in dimensionless units, which is then renormalized toΩ ∈ [0, 1] for fitting the neuron. Total operation time T = 1 is uniformly separated into n = 100 control pulses, with each pulse driving the qubit for a time interval of ∆t = 0.01. To save the computational resources, we limit the position space of the pointer to q ∈ [−50, 50], with uniform separation by ∆x = 1. Consequently, the momentum operator p is constructed by [q, p] = iℏ with boundary conditions. The density matrix of the collective quantum matrix has a size of 202 × 202. The coarse grained position space leads to weak values q 0 of integer number, which is renormalized to Figure 2. Schematic diagram of the environment without a reward function. The qubit is repetitively coupled to the apparatus weakly for information extraction. Its last state after the measurement is characterized by the density matrix ρ(t i−1 ), being driven to a new unperturbed state by the last action Ω(t i ) asρ(t i ). The apparatus weakly measures the qubit for feedback of q0(t i ), and perturbs it to the state ρ(t i ). The absolute values of its elements, together with the last action Ω(t i ), feedback q0(t i ), and renormalized system time t i = i/n, are defined to be the RL State s(t i ), being observed by the RL Agent (an artificial neural network), resulting in the corresponding action a(s i ) for driving the qubit to its next state. q 0 = (q 0 + 50)/100 ∈ [0, 1]. Thereby, the state is defined as including the last action as renormalized pulse, renormalized weak value, current system time, and elements of the density matrix. The RL state is observed by the agent, an ANN with three fully connected hidden layers of 64 neurons activated by ReLU, evolving to the next state by the numerical simulation part of the environment, which receives an action from the agent. We show the schematic flow diagram of the RL environment in figure 2 for a better understanding.

Training of the agent and results
We train three separate models for driving the qubit in the presence of detuning, dephasing, and relaxation on the X direction, respectively. The agents approximate the optimal policy, which maximizes accumulated artificial rewards. We keep the generality in the design of reward functions since we have no specific preference for any pulse shape. For the task of flipping the qubit, |0⟩ → |1⟩, we reward the agent by r(t i ) = |ρ 22 (t i ) − 1| per timestep as a negative value, aiming at a fast flipping operation. The agent receives an extra reward of 1000 if ρ 22 exceeds the threshold of |ρ 22 | > 0.99, and terminates the episode for calculating the total reward early. We also notice that punishment of 100 if |ρ 11 | > 0.05 at the final timestep helps the convergence of the model. Figure 3 shows the high-fidelity closed-loop quantum control under various errors or noises. For relaxation on the X direction, we modify the terminal condition to |ρ 22 (t i )| > 0.99 for four neighboring timesteps, to prevent the model from converging on trivial resonant π-pulses. We use the Proximal Policy Optimization (PPO) method [43] to train the agent, with the learning rate being 1 × 10 −3 and a batch size of 20. PPO is the well-known baseline algorithm for DRL, which guarantees the convergence in most cases. All other hyperparameters are set to the default values in Tensorforce v0.5.3 [44]. Moreover, we introduce a random error on the action, characterized by a centered Gaussian distribution with a standard deviation of 0.02, which emulates the time-varying systematic error in the quantum system. The models give control pulses that are robust against systematic errors. It is important to note that a trade-off between fidelity and robustness often exists. We obtain the models in figure 3 after about 2000, 3000, and 8000 episodes for controlling the system under detuning, dephasing, and relaxation on the X direction, respectively.

Transfer of the agent
A ML model is online for service after being trained for a particular task, such as flipping a qubit underσ x relaxation as figure 3(c) does. One can evaluate the model by querying the information from the environment to check its validity after it is online. The flipped qubit can go for further tasks, which are independent of the model's duty. In this way, the performance of the model can be evaluated by checking the results of additional tasks without querying the environment or the model. If the performance of a well-trained model deviates from its expected behavior, one can conclude that the qubit in the environment has changed, and quantum errors or noises have shifted to other patterns. It is then necessary to develop another model for precise control in the new environment. Instead of discarding the current model and training a new one, which would be inefficient, the agent can be transferred to the new environment in order to explore its capability to adapt to the new conditions with minimal effort. We test the proposal by starting from the trained agent in figure 3(c). By directly evaluating the agent in a new environment in the presence of detuning ∆ = 0.1 Ω, dephasing rate Γ = 0.05, andσ x relaxation rate γ = 0.05, the average final state deviates significantly from the previous result (cf figure 4(a)), resulting in a decrease in fidelity as well. To recover the performance, we train the agent with the same setting for about 2000 episodes, and within an additional 20% of the total episodes, the agent retrieves its performance before the environment shift happens (cf figure 4(b)).

Discussion
Based on the numerical experiment in section 3, we have demonstrated that DRL can be employed to investigate the closed-loop quantum control. The fidelity can be further improved by fine-tuning the DRL agent in another training environment, with different thresholds and reward function designs. Interestingly, we have found out that the optimized policy from the agent is interpretable to some extent, as figures 3(c) and 4 shown. Specifically, we have observed that the maximal tunable Rabi frequency is 3π, which is 1.5 times the π-pulse for an operation time of T = 1. The agent drives the qubit with a relatively high frequency, for reaching a large ρ 22 as quickly as possible. It is understandable since continuous measurement can be described in the language of superoperators, affecting the dynamics like quantum noises, which can be effectively suppressed by reduced operation time. Later on, the pulse strength decreases significantly once ρ 22 is large enough, converging to a small constant value for more precise operations. Accordingly, the weak measurement predominantly governs the state evolution instead of the control pulse. This behavior is similar to the quantum Zeno effect, which locks a wave function on its eigenstate by repeatedly performing projective measurements. Now we further discuss this topic after analyzing the results above. In section 3, we explained that the RL environment consists of a qubit and a numerical simulation part. The qubit can be physical, e.g. constructed in superconducting circuits, trapped ions, photonics, etc or simulated by classical computers as we performed in numerical experiments. Here we emphasize again that the numerical simulation for calculating the qubit dynamics is compulsory if we include quantum information ρ ii or fidelity in the RL state and reward. Although we can perform the weak measurement, extracting partial quantum information and converting it to weak value q 0 without destroying the quantum state, it is still impossible to retrieve the total information of the density matrix by a single shot of measurement. We cannot treat the qubit as a black box, as we usually do in other classical scenarios, where the observation of the RL state is instant and cost-less. By contrast, we have to calculate the qubit dynamics based on the actions and feedback, deducing ρ ii without operating on the qubit. It becomes a setback when one performs stream learning in the quantum realm since simulating the quantum dynamics is time-consuming, e.g. about 7 s for an episode in our numerical experiment. However, the implementation in real quantum devices requires the simulation speedup of about 10 5 times (compared to the T1 time of state-of-the-art superconducting qubit). A possible solution is to train another ANN to mimic the dynamics of the quantum system, with available information as input, outputting the quantum information to be deduced without measurements. The training of such ANN needs plenty of training data and adequate training methods, which goes beyond the scope of this work.
Another method to avoid the black-box problem is to exclude the quantum information in the RL state. The RL state may contain the weak value q 0 and other classical information such as last action, the system time, etc. However, this approach comes with challenges. Since the threshold criteria for early termination are no longer available in this paradigm, the training environment only rewards the agent by a constant at the end of each episode once a projective measurement on the target state succeeds. The agent struggles to learn the precise control due to the low signal-to-noise ratio of q 0 . The reward criteria also needs a large ensemble (batch size) to evaluate the fidelities of final quantum states. In this way, the problem becomes more difficult, which can be applied for evaluating RL algorithms.

Conclusion
In summary, we have studied the closed-loop quantum control of a noisy qubit using DRL. We have employed a Gaussian apparatus to extract the quantum information from the qubit through weak coupling.
In the presence of detuning, dephasing, and relaxation, which are typical systematic error and quantum noises, we have developed the corresponding models for the bit-flipping task with high fidelity. Moreover, we have proved that transfer learning can be used to adapt a model to a new noise pattern instead of training from scratch, once the performance decay resulting from changing the noises and errors is observed. To facilitate reproducibility, we have made all source codes for the simulation of quantum dynamics, ML models, and evaluation scripts available on an open-source platform.