Quantum generative adversarial imitation learning

Investigating quantum advantage in the NISQ era is a challenging problem whereas quantum machine learning becomes the most promising application that can be resorted to. However, no proposal has been investigated for arguably challenging inverse reinforcement learning to demonstrate the potential advantage. In this work, we propose a hybrid quantum–classical inverse reinforcement learning algorithm based on the variational quantum circuit with the generative adversarial framework. We find an important connection between the quantum gradient anomaly and the performance degradation, which suggest a gradient clipping strategy to stabilize the training process. In light of the algorithm, we study three classic control problems and the Hamiltonian parameter estimation in quantum sensing with shallow quantum circuits. The numerical results showcase that the control-enhanced quantum sensor can saturate quantum Cramér-Rao bound only with a single variational layer, empirically demonstrating a parameter complexity advantage over the classical learning control. The proposed generative adversarial reinforcement learning algorithm achieves state-of-the-art performance in classical and quantum sensor control in terms of required number of parameters.


Introduction
Quantum computation attracts intensive attention from the academy and industry for its unique characteristics such as quantum superposition and entanglement which may provide substantial speedup for classical computation [1,2]. Demonstrating quantum computational advantage is always a great challenge over the past decade. Numerous quantum algorithms are proposed to demonstrate the quantum advantage based on the different theoretical models. The most encouraging progresses are the experimental study of random circuit sampling [3] and Boson sampling [4] where the quantum advantages are firstly verified in practical superconducting and photonics circuits.
The strong computation capability of fault-tolerant quantum computer stimulates the research interests of using quantum computer to speedup machine learning algorithms [5][6][7][8]. Most quantum machine learning (QML) algorithms are the quantum version of classical statistical learning models exploited based on quantum linear algebra. These QML algorithms are assumed to process logical qubit using logical quantum gates based on the quantum oracle model [9]. These QML algorithms are hard to be realized in noisy intermediate-scale quantum (NISQ) devices to show quantum advantage [10,11]. NISQ machine learning focuses on using variational quantum circuit as a core algorithmic component to demonstrate the potential advantage of quantum computation. The promising candidate capable of quantum advantage is variational quantum circuit (VQC) model [12,13]. Previous seminal VQC-based QML algorithms concentrate on classification [14,15] and generative modeling [16] to demonstrate its advantage in handling artificial data [17,18].
A few studies concerning quantum reinforcement learning (QRL) to show the learning capability to benchmark the results over classical models [19][20][21]. While RL requires a reward function to be defined for an agent to learn from, inverse RL (IRL) allows us to infer a reward function from expert demonstrations, which can be more difficult to define manually [22]. IRL imitates human behavior which is particularly important for applications where the agent interacts with humans, such as in healthcare or customer service [23]. Besides, IRL can also improve the robustness of agents to changes in the environment, as the inferred reward function is often more generalizable than a manually specified reward function. IRL can be used to learn from large-scale datasets of expert demonstrations, which can be more efficient than collecting rewards manually. Generative adversarial imitation learning (GAIL) is a representative IRL algorithm that uses a generative adversarial network (GAN) to learn a reward function from expert demonstrations. GAIL has been applied to a variety of domains, including robotics [24], game playing [25], and autonomous driving [26], and has shown promising results in learning human-like behaviors. As our knowledge, there is no study on inverse QRL (IQRL) algorithm to exploits the quantum advantage. Consequently, IQRL requires to be investigated to showcase its learning capability and potential advantage in classical controls and more crucial quantum control problems.
In this work, we propose a model-free IQRL algorithm called quantum generative adversarial imitationg learning (QGAIL). QGAIL inherits the architecture of GAIL where the reward function is not required to be designed compared to conventional RL methods. The quantum agent in QGAIL is trained based on a discriminator network in which the expert trajectories are input as the supervised reward signal. The quantum agent imitates the behavior of the expert trajectories aiming to render the discriminator network cannot distinguish the two strategies between the agent and expert. Our QGAIL algorithm adopts an actor-critic architecture where the quantum policy network is trained based on the proximal policy optimization (PPO) method. Furthermore, QGAIL is naturally suited for learning discrete distributions by sampling from the quantum circuits, which may be useful in some complex discrete control problems. Based on QGAIL, we provide plenty of training demonstrations to show the feasibility and parameter complexity advantage in the openAIopenAI gym environment such as the required number of parameters is polynomial fewer than classical RL methods. More significantly, we apply QGAIL to quantum sensing to estimate the parameters of the quantum Hamiltonian. The precision of the estimated parameter can saturate quantum Cramér-Rao bound (QCRB) through quantum controls provided by QGAIL with single variational layer. It is the first study, to our knowledge, of inverse QRL in quantum parameter estimation for quantum sensing. The learning capability and parameter complexity advantage of inverse QRL are highlighted for quantum controls.
The work is organized as follows. In section 2, the related works are discussed. In section 3, the physical model and the hybrid quantum-classical (QC) algorithm are analyzed. In section 4, we introduce two typical applications of QGAIL including classical and quantum sensor controls. In section 5, we present the simulation results based on QGAIL for classical and quantum controls, respectively. Section 6 summarizes the work.

Related work
While there has been significant research on VQC-based QML, the investigation of VQC-based RL has been limited. However, there have been several recent developments in this area. For instance, Chen et al [27] proposed a QRL algorithm that employs VQC to estimate the value function for discrete state spaces. Lockwood and Si [28] extended this VQC-based QRL to continuous state spaces, and in [29], the authors demonstrated that simple VQC-based Q-networks are insufficient for solving Atari games like Pong and Breakout. Additionally, Jerbi et al [30] investigated a hybrid QC algorithm for value-based RL, utilizing an energy-based neural network such as a quantum Boltzmann machine. However, these studies were restricted to value-based QRL methods and evaluated only on classic problems. Jerbi et al [20] proposed a hybrid QC policy-based QRL for classic problems and revealed that QRL, as opposed to RL, can solve supervised learning problems based on discrete logarithmic hardness. Furthermore, in [31], the authors presented a hybrid QC policy-based QRL approach to address real-world problems such as vehicle routing. Sequeira et al [32] explored a hardware-efficient VQC-based QRL approach for both classical and quantum control problems and demonstrated that VQC requires a smaller number of parameters to solve quantum control problems. Moving on to the full quantum setting, Wu et al [33] studied a deterministic policy-based RL method in which both the environment and agent are quantum. They suggested that VQC-based RL can solve quantum control problems with fewer optimizations. Meanwhile, Jerbi et al [34] studied the quantum policy gradient algorithm to demonstrate the quantum advantage of the full quantum setting, with VQC potentially providing quadratic speed-ups in sample complexity. Finally, Yun et al [35] proposed a quantum multi-agent RL approach based on VQC and demonstrated that it can improve the total reward in a single-hop environment where edge agents offload packets to clouds. In our work, we investigate QGAIL for classical and quantum control problems in the IRL setting. We employ a hardware-efficient shallow VQC to approximate a policy and examine the learning capability of QGAIL and its advantage in parameter complexity. Additionally, we examine the relationship between gradient anomalies and performance. The schematic of QGAIL. The algorithm adopts the expert trajectories as the supervised reward to train the policy and value networks. The environment can be classic controls and quantum sensor control in which the emulator can be characterized by classical or quantum dynamics. The value and discriminate networks are classical to evaluate the policy and discriminate the policy from expert trajectories, respectively. The policy network is constructed based on the quantum variational circuit with data re-uploading technique in which the observations are encoded into each variational layer shown in the blue squares in the variational quantum circuit. The parameters of value and policy networks can be optimized via the proximal policy optimization method.

Physical model and algorithm
The basic structure of RL consists of two core parts: the agent and the environment. The agent and the environment can be classical or quantum. When we adopt a quantum agent to interact with the classical environment, it is generally referred to as QC in RL. On the contrary, when the environment is quantum, it is referred to as QQ in RL. The CQ and CC can also be defined accordingly. There are many works in CC and CQ such as Alpha Go in playing the game of Go with a human player [36] and classical deep RL method in handling quantum tasks [37][38][39]. Here we handle QC and QQ tasks based on QGAIL.
A general RL algorithm considers the observations, actions, and rewards. States S are referred to as the position set of the agents at a specific time-step in the (C/Q) environment. Rewards R are the numerical values that the agents perform an action A given an observation from the states. A new state will be internally updated when the environment receives the action. The probability that the agent moves from one state to its successor state is called state transition probability obeying the distribution p(·) with which the environment updates the states. p(·) is updated according to the action the environment received. Notably, a Markov decision process (MDP) is exactly defined that one state moving to another state with p(·) when given an action [40]. At the same time, a reward value is also provided by the environment. An MDP can also be described by a tuple (S, A, p(·), γ), where γ ∈ (0, 1) denotes the discount rate that balances the importance of current reward and future reward.

Hybrid QC actor-critic network
QGAIL is a quantum version of classical GAIL in which we leverage a quantum policy neural network to replace the classical policy neural network. The quantum policy neural network is chosen as the parameterized quantum circuit (PQC) composed of interleaved rotation and entangling layers as can be seen in figure 1. PQC is proved to be universal in approximating arbitrary functions [41]. However, the structure of PQC has an impact on the final performance in practical situations. In this work, QGAIL consists of one classical value network, one quantum policy network, and one discriminate network. Three networks cooperate to constitute the hybrid QC neural network.
We adopt the data re-uploading technique in PQC to enhance the capability of the model [42] where the classical input s ∈ S is encoded by local unitary rotations R x , R y , R z . The final quantum state after the operation of quantum policy network can be given by with Figure 2. A typical PQC with two qubits and two layers. The entangling layer of ZZ gate is trainable with parameter νe. The architecture is composed of alternating layers of encoding unitaries Uenc(s, λ) and variational unitaries Uvar(ν). The entangling layer can also be CZ gates without training parameters. The number of training parameters for L layers (ZZ circular structure) and n ⩾ 3 qubits can be calculated by (6L − 3)n. In case for CZ entangling gates, the number of parameters scales for (5L − 2)n.
where U var ( ν (l) ) denotes the lth variational quantum layer, U enc (s, λ l ) represents the lth data encoding layer with scaling parameters λ, the trainable parameters ν = {ν (1) , · · · , ν (L) }, ν ∈ [0, 2π] |ν| with L denoting the number of layers, and the trainable scaling parameters λ = {λ 1 , · · · , λ L−1 }, λ ∈ R |λ| . Encoding and variational parameters are denoted as θ = (ν, λ). Variational quantum layer is composed of local qubit rotations and two-local entangling operations given by where n is the number of or the size of the state space of RL environment. In reality, we can also choose a CNOT gate or rotatable ZZ gate as the entangling operator. The two-local structure can be linear, circular, and full style. In figure 1, we show a circular entanglement structure. Data encoding layer with scaling parameters can be given by where λ l i,1/2 denotes trainable scaling parameter of lth encoding layer for qubit i. The state or observation from C/Q RL environment s = (s 1 , s 2 , · · · , s n ) with s i ∈ R. Finally, the observable O a is chosen to measure the quantum state. Arbitrary Hermitian operator O a can be decomposed into O a = ∑ i w a,i h a,i where h a,i denotes the sub-Hamiltonian in the n-qubit Pauli group h a,i ∈ P n . Then the expectation of the action observable via quantum expectation estimation is given by where we have defined h a s,θ = ⟨h a ⟩ s,θ as the expectation vector by observing each local sub-Hamiltonian. Therefore, the action observation can be obtained by post-processing the local observation with a classical trainable neural layer w. For clarity, we let θ = (ν, λ, w) to denote all the trainable parameters in hybrid QC policy network. The PQC with trainable parameters is displayed in figure 2. The number of training parameters is polynomially reduced compared to classical neural networks.
For discrete action space, we can adopt the softmax operation to obtain the final quantum policy with a tunable temperature β given by The softmax operation to obtain the quantum policy is necessary for discrete actions. For continuous action space, the softmax operator is not applicable to obtain the quantum policy for it maps the action observation into the value range of [0, 1]. Since the local observable is in Pauli group, we have ∥h s,θ ∥ ≤ 1 i.e. the expectation value of each local observable is limited in range [−1, 1]. Then we use a trainable weight to map the observation value to an arbitrary range viewed as the mean value of the quantum policy. Then we also randomly initialize a trainable parameter σ 2 a as the variance of the quantum policy. Consequently, the quantum policy for continuous control can be given by Note that we choose Gaussian distribution as the policy distribution according to the common practice. We adopt the multi-layer perceptron (MLP) to serve as the classical value network used to evaluate the observation from the environment given by where σ is the activation function, L v is the number of layers, W (l) v ∈ R H×n , b lv ∈ R H denote the trainable weights and biases and H is the number of hidden neurons. Here we train a separate value network rather than a branch from quantum policy network.

QGAIL
In previous part, we have presented the hybrid QC actor-critic architecture used to characterize the quantum policy. In IRL setting, it is assumed that the agent has no access to the environment reward. The classical GAIL adopts the generative adversarial learning as the framework to directly train the classical policy network through obtaining the expert reward by feeding the expert trajectories into a discriminator. More details can be found in appendix A.3. QGAIL shares the same algorithmic framework with GAIL but the policy is quantum. QGAIL consists of two training phases: training discriminator and training actor-critic. Training discriminator is accomplished by optimizing a min-max game given by where G π denotes the quantum policy generator, π E is the expert policy, D denotes the discriminator used to distinguish the trajectories between the expert and the quantum policy generated. D can be functional approximated by an MLP which is given by where W (l d ) d , b l d represent the trainable weights and biases, L d denotes the number of hidden layers, Cat(·) operation denotes concatenating the observation and action into a vector.
Maximizing the inner phase can make use of binary cross-entropy cost function to calculate the loss and then the gradients to update the parameters of D. Minimizing the outer phase however can adopt PPO algorithm to update the quantum actor-critic network. The PPO algorithm is a simplified version of TRPO algorithm and achieves the state-of-art performance in numerous RL games [43]. Firstly, PPO calculates the probability ratio between old and new policies at time-step t given by Then, PPO imposes the constraint by forcing r t (θ) to stay within a small interval around 1, precisely where ϵ is a hyperparameter. The clipped objective is given by where the function clip(r t (θ), 1 − ϵ, 1 + ϵ) clips the ratio to be no more than 1 + ϵ and no less than 1 − ϵ, the advantage function with old policy can be calculated by with state-action value function Q π θ old (s, a) given by and the state value function given by In reality, we can use a truncated version of generalized advantage estimation given bŷ where Note that ξ is hyperparameter similar to γ, T is the maximum time step and the expert reward is estimated by e t = − log D W d (s t , a t ). Then, we maximum J CLIP Q (θ) via stochastic gradient ascent method which requires the policy gradient of PQC. The value network is trained through minimizing the mean square error between the accumulated expert reward and state value given by where D = {τ i } denotes a set of trajectories. The procedure of QGAIL algorithm is presented in Algorithm 1.
The quantum policy step in equation (23) involves calculating the derivative of the CLIP objective function which can be calculated by the derivate of the probability ratio The gradient of the log-policy for discrete action space is given by As for continuous action space, the derivative of the log-policy is given by The partial derivative of equation (5) over observable weights trivially gives rise to h a s,θ . However, the derivatives with respect to variational and scaling parameters ν, λ can be estimated by the parameter-shift rule [44]: where ∂ i ⟨O a ⟩ s,θ is also called quantum gradient which is the partial derivative of measurement observables over variational parameters. We remark that parameter-shift rule is a standard method to estimate the gradient of PQC over trainable parameters in real quantum device. However, when simulating the quantum circuit in classical computer, back-propagation and adjoint method are faster to be executed compared to the parameter-shift rule. The updating of parameters in value and discriminator networks can be referred to the classical training techniques.

Algorithm 1. Quantum GAIL.
Input: Expert trajectories: τ E ∼ π E , initial quantum policy network parameters θ0, value network parameters w v 0 and discriminator network parameters w d 0, learning rate α 1: for i = 0, 1, 2 · · · do 2: Sample a batch of trajectories D = {τ i }, τ i ∼ π θ i 3: Update the discriminator parameters from w d i to w d i+1 with the gradient 4: Take a quantum policy step from θ i to θ i+1 , using the PPO algorithm with expert reward {et}. Specifically, update the parameters by gradient ascent

5:
Take a value step to minimize the error function J V by using gradient descent algorithm. Specifically, 6: end for

Classic and quantum controls
Classic control is mainly considered to find the control signals to complete the games in environment. We adopt three tasks as the classical environment. They are CarPole-v1, MountainCar-v0 and Acrobot-v1, respectively. These environments require discrete policy to maximize the accumulated reward. We make use of the proposed QGAIL method to handle these classic games by finding out optimal control signals. The detailed specifications including the observation, action space of the environment can be found in appendix B. In general, classic environments can be described by classical dynamics which can be well-characterized with an MDP. The quantum agent learns a good policy, producing a control sequence to render the classical environment complete the task.
On the other hand, we choose a quantum sensor as the representative quantum environment to study the performance of QGAIL in producing optimal quantum control signals. Quantum sensor consists of two critical parts: quantum evolution to sense the unknown parameters and quantum or classical processing unit (quantum processing unit (QPU) or CPU) to generate optimal control signals. The quantum evolution can be characterized by quantum Lindblad equation given by where ρ denotes the density matrix of the quantum sensor, L is the Lindblad operator to characterize the Markov noise process, H(ω) is the Hamiltonian of the quantum sensor used to sense the unknown parameters ω. In general, quantum sensor Hamiltonian H(ω) can be given by where H 0 (ω) is the time-independent Hamiltonian of the sensor but encoding the unknown parameters, H j denotes the jth control Hamiltonian and µ j is its control field, p is the number of control fields. Note that H 0 (ω) can also be time-dependent evolution. In quantum sensor, one of important goals is to achieve the most accuracy parameter estimation to reach the Heisenberg limit (HL). A key quantity relating HL to parameter estimation is quantum Fisher information (QFI). QFI characterizes the maximum information that can be observed from the quantum state in quantum sensor. QFI can be calculated by where L s (t) is the symmetric logarithm derivative operator that can be obtain by solving the equation In reality, we can estimate ∂ ω ρ(t) through numerical first order difference over ω. According to the Cramér-Rao bound [45], the QFI provides a saturable lower bound on the estimation can be achieved given by denotes the standard variance of an unbiased estimatorω, and M denotes the repeated measurements. The goal of quantum sensor is to find an optimal control sequences to maximize the QFI, thus minimizing the standard deviation.
The quantum sensor evolution is continuous which is not suitable for RL environment. We require discretizing the continuous time evolution into discrete time ∆t. To simplify the notation, let L t denotes the Lindblad superoperator at time t written as where A i (t) denotes the quantum noise operator such as the decoherence and phase damping, η i ⩾ 0 means the noise is Markovian, otherwise non-Markovian which beyonds the scope of our work. Then equation (25) can be rewritten as where T is the total evolution time. Equation (31) can be used to simulate the quantum evolution in RL style. The observation for kth interaction with the quantum sensor can be given by where i, j ∈ [1, 2 n ], n is the number of of the quantum sensor. The actions of the quantum agent are a = {µ j }.
Since we consider imitation learning, the reward from the quantum sensor is not required. However, in producing the expert trajectories based on classical RL methods, we should design a reward function to train the agent. The reward function is given by and for the final time step T/∆t, the reward signal is amplified with C, i.e. r ← C × r. F nc denotes the QFI without control signals. This reward function is different from the classical environments in which the reward is sparse. The reward function of equation (33) uses the QFI as the feedback to evaluate the quality of the quantum controls. Intuitively, maximizing the cumulated reward value can lead to a larger QFI which meets our goal of parameter estimation in quantum sensor. Moreover, we amplify the reward with 10 times so that the gradient norm of the neural network is also amplified when updating the parameters, which is beneficial for the training process. We remark that the quantum sensor is regarded as the quantum environment to be interacted with the quantum agent in our work. There are some other typical quantum environments such as the quantum gate control aiming to maximize the gate fidelity, quantum circuit optimization to reduce the quantum resources overhead, etc. These quantum environments are highly important and our proposed algorithm can also be leveraged into these tasks to demonstrate the feasibility and potential advantage of QRL.

QGAIL for classic control
To begin with, we conduct the numerical simulation for analyzing the performance of QGAIL in classic discrete environments. To train the quantum agent with QGAIL, we first collect the expert trajectories by using the PPO optimized classical policy to interact with the environments to collect the state and action pairs for Acrobot-v1 and CartPole-v1. PPO is not suitable for training the MountainCar-v0 for its extreme reward sparsity. Therefore, we use Deep-Q network (DQN) to train the agent to collect the expert trajectories as DQN can save the success trajectories into the replay buffer such that the agent can learn from the buffer with multi-epochs. For each environment, total 100 expert trajectories are collected and each trajectory consists of state-action pairs with the largest horizon of the environment. We make use of three Adam optimizers to update the parameters ν, w, λ, respectively. We use another two Adam optimizers to update the parameters of value and discriminate networks. It is important to remark that encoding and weight parameters should have different learning rates for variational parameters. In appendix D, we show that in case the parameters in PQC share the same large learning rate, the model will break and not converge. These results may relate to the small norm of the quantum gradients. In figure 3, we present the numerical simulation results of QGAIL for three representative classic control tasks. In figures 3(a)-(c), we present the best performance of QGAIL where the layers are adjusted to behave well. The convergence speed of our proposed QGAIL is smaller than other QRL methods [19,20]. This is likely caused by the intrinsic advantage of the IRL algorithm. We remark that the classical GAIL algorithm cannot or hardly work well in the MountainCar-v0 environment since the highly spare reward during interaction with the environment. The inefficiency is amplified in a generative adversarial regime. The online policy gradient methods such as PPO and REINFORCE behave poorly in this scenario. However, QGAIL performs well in these environments which empirically demonstrates its feasibility and efficiency.
We further present the results of different hyper-parameters settings of QGAIL as figures 3(d)-(f) shows. In figure 3(d), we find that different layers of quantum neural network (QNN) have slightly different performances in the CartPole-v1 environment. Especially, fixing or training hyper-parameters λ has no distinct performance difference. Besides, even a single layer of QNN i.e. L = 1 can fast saturate the maximum rewards. Compared to the classical neural network, QNN exhibits superior performance in terms of the number of trainable parameters. In figure 3(e), the overall performance of QGAIL is more unstable compared to CartPole-v1. Larger layers have better performance indicating that more variational parameters have a more powerful capability. Fixing tuning parameters behaves poorly. In figure 3(f), we find that the number of layers L = 5 of QNN is not promised to show the best performance. L = 2, 3 in general demonstrate a more stable and better performance over another number of layers. Fixing parameters is beneficial for the behavior learning process.
In general, more layers i.e. more variational parameters in QGAIL are not necessary to have better performance in classic discrete control environments. We deliberate that when QNN is 'fat' (the number of is large), the number of layers should be chosen within a moderate range such as L = 2, 3, i.e. shallow quantum circuit. In contrast, when QNN is 'thin' (the number of qubit is small), the number of layers can be chosen up to 5 or 6 layers. This empirical observation also implies that the number of trainable parameters should be designed up to a moderate number. Too many or too few variational parameters are likely to lead to poor performance in classic RL environments. This feature is distinct from the classical neural network and may be caused by the Barren Plateau (BP) in optimizing random Haar unitaries. The optimization of the variational parameters of the hardware-efficient QNN remains a challenging problem and there are many efforts aimed to tackle it. We also note that fixing tuning parameters has no significant improvement over classic environments. For different RL environments, we require conducting many simulations to choose the optimal hyperparameters.
During the numerical simulation, we find the training process is not stable as the classical NN. We speculate that this unstable training process is not a principle problem. The simulation software when calculating the gradient of variational parameters (especially the entangling parameters) limits the maximum float precision. When the gradient explodes during training, the performance drops greatly as figure 3(f) shows. In appendix D, we illustrate the gradient norm of different parameter groups during the training process. We find the gradient anomaly is related to the averaged return drop. Therefore, we propose the gradient clipping strategy to mitigate the gradient anomaly to avoid the return drop. We deliberate that the gradient clipping strategy can increase the stability of the learning process. Besides, our QNN does not suffer from BP problems since we design the number of variational parameters into a moderate range. The relation between the gradient anomaly and performance drop can be eliminated by using the gradient clipping strategy.
The classical GAIL also can obtain the optimal rewards as figures 3(a)-(c) shows. Since the classical GAIL is a mature algorithm and investigated extensively, we do expect the our QGAIL can surpass the classical GAIL in terms of the final rewards. However, in figure 3(b), it turns out that QGAIL has a faster convergence speed in MountainCar-v0 environment, an environment with highly sparse rewards. In addition, the number of parameters used to train the QNN is notably less than the classical neural network.

QGAIL for quantum control
In the quantum environment, we analyze the performance of QGAIL for parameter estimation in quantum sensors. We use the PPO and A3C algorithms to train the quantum agent. The expert trajectories are collected by training the classical PPO and A3C algorithm in a quantum sensor environment. The parameters of the quantum sensor environment determine the quantum evolution. The dephasing and spontaneous emission of the qubit is regarded as the quantum noise effect leading to the purity of the density matrix smaller than 1. We note that the quantum noise also leads to a linear QFI shrink. Quantum control in this work is continuous and we assume the actions are Gaussian distributed.
When we consider the qubit dephasing noise, the evolution can be described by master equation where The control operator σ i , i ∈ {1, 2, 3} denote the Pauli-X,Y,Z operator. The external control field combining these operators is sufficient to obtain arbitrary single qubit gate. The dephasing direction is given by n = (sin ϑ cos ϕ, sin ϑ sin ϕ, cos ϑ). The parameter to be estimated is ω 0 and we take ω −1 0 = 1 as our time unit. The optimal probe state calculated by the standard metrology theory is the superposition state (|0⟩ + |1⟩)/ √ 2. The optimal measurement that extracts the largest QFI is chosen as the projective measurement on the Pauli-X basis and the measurement operator is Π = |+⟩⟨+|. This optimal measurement can obtain the largest QFI given the optimal quantum control signals. The dephasing direction is ϑ = π/4, ϕ = 0 to simulate the quantum noise in the evolution. In this case, we consider the total evolution time T = 5 and ∆T = 0.1 thus giving rise to 50 time steps in one episode. The ideal QFI during the quantum sensing process that can be obtained is given by where λ max (·) (λ min (·)) refers to the largest (smallest) eigenvalue of the operator. The largest QFI in our case thus is T 2 , which is related to the quantum speed limit. However, as for the impact of the quantum dephasing noise, the largest QFI cannot be obtained even the optimal control is provided. As we consider the spontaneous emission noise, we can make use of the following master equation, where σ ± = (σ 1 ± iσ 2 )/2 and the relaxation rates are taken as γ + = 0.1, γ − = 0 throughout our discussion. We note that since our free Hamiltonian only has the Pauli-Z terms and the spontaneous emission noise only affect the Pauli-X and Y terms, we only consider Pauli-X and Y controls. The number of control terms only is 2 compared to the dephasing noise with 3 control terms. Similarly, the ideal QFI also cannot be obtained in spontaneous emission case even given the optimal control. In both cases, although quantum optimal control cannot recover the largest QFI, it shows an QFI enhancement compared to the case without control. The optimal noisy control is the benchmark generated by GRAPE algorithm for fair comparison. In figure 4, it turns out that QGAIL can produce the optimal control policy to maximize the ultimate QFI of the quantum parameter estimation. The convergence speed is notably fast and the algorithm consumes nearly 100 iterations. The red line in figures 4(a)-(c) denotes the benchmark result of the GRAPE algorithm and the dotted green lines represent the baseline that no control signals are fed into the quantum sensor. Under spontaneous emission noise, it is not necessary to apply the Pauli-Z control so that we only show two control amplitudes as figure 4(d) shows. Overall, quantum control signals can enhance the precision of quantum parameter estimation compared to no control case. QGAIL also shows a competitive performance over classical GAIL in terms of the convergence speed (slightly faster as figures 4(a)-(c) shows). The QGAIL has single variational layer and the number of parameter is significantly less than the classical algorithm. We also fix the tuning parameters to reduce the trainable parameters. The fixed QC architecture does not affect the performance of the QGAIL algorithm. Besides, the learning process is stable and has no return drop since we apply the gradient clipping during training to avoid the gradient anomaly phenomenon. In appendix D, we present more additional results about different L, architectures and longer evolution times. On the other hand, expert trajectories can enhance the learning progress of the quantum policy network through adversarial training. The simulation results imply that hardware-efficient QNN is highly powerful enough and can show practical capability in quantum controls. Compared to QGAIL in classic controls, the performance is more surprising since only a single quantum layer can achieve optimality. By integrating generative adversarial training in RL, we can train the quantum agent without the reward signals from interacting with the quantum sensor environments. The reward design of the environment is a complicated task and hard to engineer in practice. However, the expert trajectories are sometimes easy to be obtained such as in Robotics [46]. The QGAIL algorithm can relax the dependence on the reward design. Moreover, it is related to the quantum generative adversarial neural network, which is widely studied to showcase the potential quantum advantage of quantum variational circuits [47][48][49]. a QGAIL with amplitude encoding can obtain full logarithmic less parameters compared to classical models. The number of parameters cannot be large such that the QNN can be executed on current devices. Our proposed QGAIL algorithm does not require deep quantum layers but only a single layer is sufficient to saturate the optimal noisy QFI. We summarize the parameter complexity of the QNN for classical and quantum environments as table 1 shows.
The number of training parameters mainly involves memory consumption as figure 5 shows. In GAIL, the parameter complexity in a neural network is commonly viewed as the O(n 2 ) scaling [27] for classic controls. Since we can treat the quantum states as classical information, the complexity is still O(n 2 ). It is not known that quantum samples with classical processing can show a provable quantum advantage. In figure 5, we simulate GAIL with one hidden layer and 32, 64, 64, 128 neurons for four observation dimensions. Current deep RL methods generally will choose more hidden layers but we find for our problems, single hidden layer is adequate. In our simulation, the number of parameters surpasses the O(n 2 ) since the latter shows the parameter scaling when n is large. For QGAIL where the data is classical, the complexity is scaled as O(n) for L is generally viewed as a constant scaling [16] as figure 5 shows. The number of parameters shows a linear increment as the observation dimension linearly increases. As for the quantum environment, the complexity is scaled as O(n) as our simulation results demonstrate or O(log n) [50]. Moreover, the scaling may be lower than the case of QGAIL for classic controls. The QGAIL with amplitude encoding has a logarithmic scaling but there is no known efficient algorithm to encode the arbitrary classical data into the quantum memory in a superposition way [51]. The parameter complexity advantage in VQC-based supervised learning is also empirically observed in [52][53][54][55]. These quantum algorithms demonstrates the parameter complexity advantage of VQC in QML for classical tasks. For quantum sensor control, based on the quantum input, QGAIL shows a lower parameter requirement compared to classical tasks.

Conclusions
In summary, we propose a generative adversarial quantum imitation learning algorithm based on shallow VQCs. We find a critical association between the gradient anomaly and the performance drop, which may widely exist in variational QML models. We deliberate that the gradient clipping strategy is an efficient method to eliminate the gradient anomaly to stabilize the learning process in QML. We argue that our algorithm architecture is flexible and can produce many different IRL algorithms by choosing the classical or quantum version of value, discriminate and policy networks. Based on QGAIL, we testify three representative classic controls in classical environments. It turns out that the total training parameters of the QNN should be restricted to a moderate range to obtain better performance. Besides, we also apply QGAIL to quantum sensing where the learning model is used to generate the optimal controls to steer the state evolution so that the final quantum state can be measured to obtain the maximum QFI. QGAIL is robust against quantum noise such as dephasing and spontaneous emission noise. More surprisingly, QGAIL only requires a single variational layer to produce the optimal quantum control signals so that the parameter estimation can saturate QCRB. The parameter complexity of QGAIL compared with its classical RL models has a polynomial reduction. We further reason that QML may be better suited for quantum problems. Quantum sensing is highly likely to be the most promising application of quantum RL algorithms since the number of required is not large. Therefore, our proposed algorithm is feasible for current NISQ devices. We highlight the importance of generative QML models, which are widely investigated to exploit the advantage in learning discrete distributions. IQRL may be an interesting area that can exploits the expert information to train the agent and it can be used to enhance the sample efficiency of RL and more safe-demanding situations such as automatic driving. In future work, we will study QGAIL in multi-parameter quantum sensing and more complex automatic control tasks.

Data availability statement
All data that support the findings of this study are included within the article (and any supplementary files).

A.1. Classical DQN
Here, we briefly present the classical algorithms of DQN. DQN is an offline RL algorithm which possesses an experience replay memory to store the historical trajectories. The goal of RL is to maximize the cumulated reward (also referred to as return) at time step t written as R t = ∑ T t ′ =t γ t ′ −t r t ′ . The optimal action-value function is defined by maximizing expected return of equation (14) by find an optimal policy π ⋆ . The optimal Q function obeys Bellman equation, which is based on such an intuition: if the optimal value Q ⋆ (s ′ , a ′ ) of the sequence s ′ at the next time-step was known for all possible actions a ′ , then the optimal strategy is to choose the action a ′ maximizing the expected value of r + γQ ⋆ (s ′ , a ′ ) with .
RL wants to estimate the Q function by using the Bellman function as an iterative update rule given by Crucially, the value iteration algorithms converge to Q ⋆ theoretically [56] when i → ∞. However, in reality we use a function approximator such as a neural network to estimate Q function, i.e. Q(s, a ′ ; θ) ≈ Q ⋆ (s, a). The neural network is referred to as Q network. A Q network can be trained by minimizing a sequence of loss functions L Q i (θ i ) that changes at each iteration i, where is the target calculated by the trajectories at iteration i and ρ(s, a) denote the behavior distribution. We note that the θ i−1 is fixed when optimizing the loss function at iteration i − 1. We can use stochastic gradient descent to optimize the Q network by calculating the gradient The DQN algorithm is model-free and solves RL tasks directly using samples interacting with environments. It is also off-policy which learns about the greedy strategy by maximizing the Q function. The DQN algorithm is highly suited for sparse reward environments such as MountainCar. We remark that the quantum DQN algorithm is also proposed by realizing the Q network with a quantum neural network i.e. PQC. Our work can also be used to exploit the quantum DQN algorithm.

A.2. Classical A3C
Asynchronous advantage actor-critic (A3C) is a representative policy gradient method with a special focus on parallel training. In the task of quantum sensor control, A3C is leveraged to generate optimal control sequences to showcase RL's advantage compared to conventional GRAPE algorithm [57]. In A3C, the critics learn the value function while multiple actors are trained in parallel to keep synced with global parameters. Hence, A3C works well for parallel training since multiple actors can search for more possible actions. In quantum optimal control, as the reward function is QFI-related which is not a direct measure as rewards in classic environments, A3C can increase more explorations of the agent to search for optimal control signals. In addition, A3C has the same theoretical algorithmic framework as actor-critic in GAIL. Here we present a schematic of A3C in quantum sensor environments in figure A1.

A.3. Classical GAIL
Classical GAIL is a representative algorithm in imitation learning where the agent can learn a policy without interaction with the expert reward or access to reinforcement signal. There are two main approaches suitable for imitation learning: behavioral cloning (BC) [58], which learns a policy as a supervised learning problem over state-action pairs from expert trajectories; and IRL [59], which learns a cost function under which the expert is uniquely optimal. BC suffers from the compounding error caused by covariate shift and requires large amounts of data to perform well. On the other hand, IRL learns a cost function that prioritizes entire trajectories over others such that compounding error is not an issue. However, many IRL algorithms are extremely expensive to run, requiring RL in an inner loop [22]. GAIL integrates generative adversarial learning with IRL to overcome the extensive overhead of IRL through directly learning the policy without learning a cost function.
IRL primitive procedure aims to find a cost function such that the expert behaves better than all other policies and is defined with the cost regularized ψ given by a t )] denotes the expectation with respect to the trajectory it generates where c denotes the cost function, s 0 ∼ p 0 , a t ∼ π(·|s t ), and s t+1 ∼ p(·|s t , a t ) for t > 0, the expert policy is denoted as π E , H(π) is the γ-discounted causal entropy of the policy π given by . (A.6) Maximum casual entropy IRL seeks a cost function c ∈ C that assigns low cost to the expert policy and high cost to other policies. Therefore, the expert policy can be found via one RL procedure given by which maps a cost function to high causal entropy policies that minimize the expected cumulative cost. Designing appropriate ψ leads to different learning regimes. For example, in case ψ is a constant function, equation (A.5) becomes the conventional IRL that can find the optimal cost function but has large overhead. However, in case ψis chosen to be δ C where δ C (c) = 0 if c ∈ C and +∞ otherwise, equation (A.5) becomes apprenticeship learning which suffers from incapability of finding a cost function to recover the expert behavior but is more efficient compared with classic IRL. GAIL design a new cost regularizer to achieve a tradeoff, given by The regularizer assigns a small penalty on cost functions c which places an amount of negative cost on expert state-action pairs. However, in case c approaches zero and assigns large costs to the expert, then ψ GA will heavily penalize c. Note that ψ GA builds the connection between generative adversarial learning and imitation learning. To solve equation (A.5), one can instead optimize the min-max game given by equation (9). Discriminator D aims to distinguish between the distribution of state-action pairs generated by policy generator (G) and the expert state-action pairs. When D fails to distinguish state-action pairs generated by G from the expert state-action pairs, then G has successfully learned the distribution of the expert state-action pairs. Therefore, the generator imitates the behavior successfully from the expert trajectories. In a practical setting, we generally use the neural network as the function approximator to represent G and D. Minimizing the phase over the policy generator can make use of PPO or TRPO algorithms which can prevent the policy from changing too much due to noise in the policy gradient. The detailed training process can be referred to as the training trick of the GAN. We present the pseudo-code of GAIL in Algorithm 2.

4:
Take a policy step from θ i to θ i+1 , using the TRPO or PPO algorithm with cost function log (Dw i+1 (s, a)). Specifically, take a natural gradient step witĥ

Appendix B. Classical environments description
Classic control in this work chooses three typical games and they are CartPole-v1, Acrobot-v1, and MountainCar-v0. To begin with, CartPole-v1 aims to find a good discrete policy to produce controls (moving the cart) that can render the pole in the cart keep in a stable status. The control or action is moving left or moving right. The observation is the position of the cart on the track, the angle of the pole with the vertical, the cart velocity, and the rate of change of the angle. When the agent moves right or left and the pole is still stable (angle is smaller than the threshold), a (+1) reward will be given. The agent should interact with the environment and present a policy to maximize the reward. Acrobot-v1 aims to swing up a two-link robot. The system consists of two links connected linearly to form a chain, with one end of the chain fixed. The joint between the two links is actuated. The goal is to apply torques on the actuated joint to swing the free end of the linear chain above a given height while starting from the initial state of hanging downwards. All steps that do not reach the goal incur a reward of -1. Achieving the target height results in termination with a reward of 0.
MountainCar-v0 is a deterministic MDP that consists of a car placed stochastically at the bottom of a sinusoidal valley. The possible actions are the accelerations that can be applied to the car in either direction. The goal of the MDP is to strategically accelerate the car to reach the goal state on top of the right hill. Here we summarize the observation space, action space, horizon of the environments, and the optimal reward as seen in table B1. We note that these classic environments can be well described by MDP, thus RL can solve them and show a good performance. In addition, these classic environments obey classical dynamics and the controls are classical.
The environment of quantum sensor control is also summarized in table B1. We only consider one qubit evolution where the number of density matrix elements is 8 and the number of Pauli control terms is p = 3. The reward function can be calculated by equation (33). The maximum horizon is determined by the total evolution time T and the time slot ∆t. The goal is to achieve the maximum QFI so that obtaining the most accurate parameter estimation.

Appendix C. Expert trajectory generation of the toy games and quantum parameter estimation
Expert trajectories are viewed as the reinforcement signals to train the QC critic-actor networks. In principle, the number of trajectories can affect the ultimate performance of the QGAIL. Intuitively, more expert trajectories will have better average returns with the overhead of large training episodes. In our work, we generate 100 trajectories for three classic environments. Each trajectory consists of state-action pairs with a maximum horizon. Therefore, there are 50 000 state-action pairs for CartPole-v1, 20 000 state-action pairs for MountainCar-v0, 50, 000 state-action pairs for Acrobot-v1. The generation of expert trajectories is based on different Deep RL methods. Specifically, the expert trajectories of MountainCar-v0 are collected via the training DQN algorithm. The expert trajectories of CartPole-v1 and Acrobot-v1 are collected via the training PPO algorithm. The training parameters such as the learning rate and the number of neurons and the layers can be found in appendix E.
For quantum environments i.e. quantum sensor control, we adopt the A3C+PPO algorithm to produce the optimal quantum control sequences. The maximum horizon is T/∆t. Thus, the number of expert trajectories is 100T/∆t. In reality, we choose two typical evolution times T = 5, 10, ∆t = 0.1 to simulate the quantum sensor evolution. We also consider two typical quantum noise processes such as qubit dephasing and spontaneous emission to study the noisy evolution. The hyperparameters of the A3C+PPO algorithm can also be found in appendix E. Figure D1. The performance of single Adam optimizer in Acrobot-v1 environment. The blue curve denotes the average return of fixed learning rate. The yellow curve denotes the large initial learning rate with an exponential learning rate decay.

Appendix D. Additional results
We present the additional simulation results of QGAIL in controlling classic and quantum environments. In the main text, we have briefly discussed that a single optimizer for different parameter groups performs worse. The average return is displayed in figure D1. In both two situations, only one optimizer for a different parameter group cannot perform well even if there is an exponential learning rate decay.
Consequently, we can guess that entanglement and variational parameters θ var , θ ent have distinct updating rule compared to input and output parameters. Quantum gradient over 'quantum' parameters generally has a smaller gradient norm, As a rule of thumb, the learning rate of 'quantum' parameters (showing quantum feature) may be given a smaller learning rate such as 0.01, 0.001. The 'classical' parameters such as input and output parameters can have a larger learning rate. Recently, there is a study showing that QNN has a better performance with a small learning rate [60].
Here, we present the gradient norm of QNN during training regarding a different group of variational parameters. The gradient norm of the parameter set θ = (θ var , θ ent , λ, w) are recorded separately. Each parameter group has a distinct optimizer and initial learning rate. We do not conduct learning rate decay simulations but do not exclude the possibility of good performance. In the general variational quantum circuit, deep randomly initialized QNN suffers from BP problem, i.e. the gradient vanishes exponentially as the training epoch goes on. In such situation, the QNN cannot be trained well enough. In figure D2, the gradient norm ||θ|| 2 is recorded with respect to four separate groups. In figures D2(a) and (b), the Acrobot-v1 environment is simulated and the averaged return of QGAIL shows a drop as the blue boxed framed. Interestingly, the gradient norm of entanglement and tuning parameters also have a very large value at the same training iteration. We call the suddenly large gradient norm over entanglement parameters as the gradient anomaly. When the gradient anomaly is observed, QNN has a large parameter updating which leads to model collapse. Note that the gradient anomaly does not always occur for the randomness of the simulation environments. To avoid the model collapse, we can use the gradient clipping technique to manually clip the gradient norm into a restricted range for example setting the maximum gradient norm of specific variational parameter ||θ|| 2 = 3. We also observe that the gradient norm of output parameters w has a very small value both in figures D2(a) and (c) during the whole training process. Output parameters thus have no large updating gradient. In figures D2(c) and (d), the gradient norm does not have a gradient anomaly, thus the averaged return has no drop. In figure D2, the gradient norm of each parameter group is well distributed in a bounded range. The gradient is not exponentially vanished since we design the number of parameters in a moderate range. Therefore, QNN can be trained well to generate classic control signals with a small number of episodes. Note that each time the averaged return increases to a larger value, the gradient norm over θ var will also increase leading to a subsequent return drop. In case we apply the gradient clipping technique to restrict the gradient value into a specified range, the gradient anomaly does not occur and the averaged return is stable during the whole training process as figure D3 shows.
In the quantum sensor environment, we consider two architectures named classical-quantum-classical (CQC) and QC. The CQC architecture can be applied to the situation where the observation space is large, in which the quantum policy network cannot be classically and efficiently simulated. The first classical network can be viewed as a function of feature extraction or dimension reduction. The extracted feature is subsequently processed by a quantum network. The other architecture is QC where the observations are directly embedded into the multi-layer quantum network. The last classical network in both architectures is aimed to produce continuous quantum control signals. This hybrid QC architecture is flexible and quantum hardware-efficient, which may be commonly used in heterogeneous computing of CPU and QPU.
As figure D4 shows, the numerical performance of the CQC archieture with different variational layers L is illustrated. We can find that both two networks can obtain the optimal QFI but more layers (or parameters) cannot show an advantage in terms of convergence or optimality. Conversely, the performance of L = 10 layers is worse than the case of L = 4 layers demonstrating that 4 layers are powerful enough to generate optimal quantum controls. Fixing the tuning parameters will reduce the total number of training parameters but do not affect the convergence and optimality as figures D4(a) and (b) shows.
Besides, we simulate the QGAIL for quantum parameter estimation with QC architecture under different L to fully demonstrate the power of the quantum networks as can be seen in figure D5 shows. Figure D5 consists of 12 subplots that show the performance of QGAIL under different settings. Surprisingly, all quantum agents with quantum policy networks can approach or surpass the optimal noisy QFI generated by GRAPE. In figure D5(a), only a single layer of the QNN can showcase the optimality in producing the quantum control signals. Training the tuning parameters has no explicitly advantage compare to fixing λ as figure 4(a). Further, we simulate larger layers such as L = 4, 6, 10 to study the performance. We can find that increasing the layers cannot accelerate the training process or obtain a larger QFI, which demonstrates that a single variational layer is powerful enough to generate the optimal control signals. We also find that when  L = 6, 10 as figures D5(d)-(g), the convergence speed is even worse than the case of smaller layers implying that more variational parameters need more episodes (data samples) to train. When considering the case of T = 10, the quantum policy generated QFI shows a surpass over optimal noisy QFI even with a single variational layer. Besides, larger layers do not show the problem of BP, which demonstrates that generative adversarial training in RL is likely to avoid the BP problem. This may relate to the cost function where in supervised learning, the global cost function does readily lead to the BP problem when the layers become larger. However, in the IRL scenario, the supervised label is replaced with the reward signal, a weaker label compared to the supervised label. Combining the PPO algorithm which restricts the policy improvement into a specified region, the gradient is reduced stably and will not easily trap into the local minima. When we simulate the quantum agent in controlling quantum sensors with spontaneous emission evolution, the ultimate QFI also approaches the benchmark under T = 5. The control enhanced QFI is slightly larger than the case of no control. However, when we simulate longer evolution times, the QFI gap between the control enhanced QFI and QFI without control is amplified. In this case, the quantum agent can produce the optimal control signals to render the ultimate QFI saturate the limit. Through plenty of simulations, we find that QNN may be better suited for controls in a quantum environment compared to the classic controls in the main text. Especially in a quantum sensing application, the information about the density matrix is fed into the QNN to learn a policy that maximizes the QFI. We guess that since our data samples are quantum states which may be better suited for QNN to extract feature information. We remark that although the exact form of the density matrix is not regarded as the input of the QNN (amplitude encoded as the initial state), the amplitude and phase information are still obtained and encoded into the QNN. As in [50,61] demonstrated, the quantum samples are not necessary to be transformed into quantum states but only with the full information of the quantum state are still feasible and beneficial.

Appendix E. Hyperparameters specifications
The hyperparameters designed in this work including the quantum and classical neural networks are presented in table D1. Since GAIL requires the expert trajectories produced from classical RL methods as we have discussed in the previous section, we also present the hyperparameters of classical RL methods. For classical environments, we conduct the numerical simulation of DQN and PPO methods for generating expert trajectories. For quantum environments, we conduct A3C+PPO for quantum sensor control. During the training process, we find that the chosen hyperparameters are not unique and they are crucial for the final performance of machine learning. For example, the mini-batch size and epochs during one iteration should be tuned jointly to control the convergence speed and final performance. In general, for discrete control, we choose the batch size larger than the case in continuous control. The mini-batch size in QNN is better not to be too large since current quantum machine learning simulation packages are not so efficient to support large batch computation. In practical NISQ devices, the large batch computation has two ways: (1) run the same quantum circuit simultaneously on many NISQ devices, and (2) sequentially run the quantum circuit once for a single batch. Both two ways are resource-consuming. We propose training the QGAIL in classical computers and then transferring the parameters into the quantum device with a fine-tuning to adapt to the practical quantum imperfections. This transfer learning proposal can reduce the quantum resource consumption and highlight the advantage of fast inference in NISQ devices. In classic and quantum environments, the observables used to estimate the quantum policy are different. Through numerical simulation, it turns out that the observables has no vital impact if there are post-processing classical layers. In Acrobot-v1 environment, the observable is O a = [w i (Z 1 · · · Z 6 ), · · · ] 1≤i ≤3 , where w i is the training parameters. In MountainCar-v0 environment, the observable is O a = [w i (Z 1 Z 2 ), · · · ] 1≤i ≤3 . In CartPole-v1 environment, the observable is O a = [w i (Z 1 Z 2 Z 3 Z 4 ), · · · ] 1≤i≤2 . In continuous quantum control, the observable is O a = [w i · (Z 1 , · · · , Z 8 ) T ] 1≤i ≤3 , where final trainable linear layer is W ∈ R 8×3 . The final classical layer has smaller size compared to their pure classical models. As in continuous control situation, we also need the variance of the policy. Therefore, we also design the same number of training parameters Appendix F. Software specifications Quantum simulation in classical computers is not efficient but in some RL applications, the observation space is not so large so that the qubit encoding technique is still feasible in classical computers. In this study, we make use of Tensorflow quantum [62] to simulate the QNN in discrete classic controls. The classical neural network is based on the Keras implementation. The classical and quantum network constructs a hybrid QC quantum machine learning model and in this model, the data pipeline will automatically calculate the classical and quantum gradients with backpropagation. In continuous quantum control, we make use of Tencent tensorcircuit [63] to build the hybrid model where the backend of the quantum circuit is not based on a state-vector or density matrix but a tensor network. Therefore, tensorcircuit is more efficient than Tensorflow quantum in terms of simulation complexity. Moreover, the tensorcircuit supports