Take full advantage of demonstration data in deep reinforcement learning

Deep reinforcement learning (DRL) algorithms have achieve a great breakthrough in many tasks, even though, they still suffer from the problems, such as random exploration and sparse reward. Recently, some reinforcement learning from demonstration (RLfD) methods have shown to be promising in overcoming these problems. However, they usually require considerable demonstrations and the demonstration data may not be optimal which means the absent of exploitation. To deal with it, we propose a novel algorithm in which we use a behaver cloning network BC to learning the demonstration data, and then use it to guide the reinforcement learning agent, when the agent learn and explore a better trajectory, the BC then learning the trajectory to make itself better, which will lead the BC network and the agent become better at the same time.


Introduction
Deep reinforcement learning (RL) has made it possible to build end-to-end learning agents without having to human labelled data, as it performs well in various challenging domains such as MOBA games in the recent years. These feats make deep RL a great candidate to be employed in complex real-world sequential applications. However, while deep RL algorithms enable the use of large models, the use of large datasets for real-world RL is conceptually challenging. Most RL algorithms collect new data online every time a new policy is learned, which limits the size and diversity of the datasets for RL.
An off-policy RL agent needs to explore the environment for gathering training samples, which makes the learning process potentially costly or even unable to converge for many real-world applications. For this reason, many of the recent investigations in the area aim at reducing the amount of required samples during learning.
Deep Q-learning from demonstration [1], also known as DQfD, leverage the expert demonstrations to pre-train the Deep Q-learning agent, which is discussed in [9] and [10] as well, showing the benefit of demonstration, then use the pre-trained model to interact with the environment to collect selfgenerated data. This paradigm has had successes in many challenging robotics applications. However it still faces some problems as well, such as the demonstration data is hard to collect and it may not be optimal [6], and the demonstration data does not reuse the second phrase any more [7].
In order to solve the problem, in this paper, we propose a novel paradigm to reuse the demonstration data, and take advantage of the agent exploration, at the same time. We use a behaviour cloning network, which can view as a map from state to action to record the demonstration data, and use this network to giving the agent some advice. We evaluate the demonstration data and the behavior cloning network at the first time, when the agent can behave a better policy which surpass the behavior cloning network, the behavior network learns this specific trajectory. The RL agent and the BC network learn from each other and make progress and the same time, make the final result outperform.

Markov Decision Process
An MDP [3] is defined by a tuple ⟨S ,A ,R ,T,γ⟩, tuple which consists of a set of states S, a set of actions A, a reward function R(s, a), a transition function T(s, a, s') = P(s'| s, a), and a discount factor γ. In each state s ∈ S, the agent takes an action a ∈ A. After taking this action, the agent receives a reward r from R(s, a) and reaches a new state s', this new state s' is determined from the probability distribution P(s'| s, a). A policy π specifies for each state which action the agent will take. The goal of the agent is to find the policy π mapping states to actions that maximizes the expected discounted total reward. The value π (s, a) of a given state-action pair (s, a) is an estimate of the expected future reward that can be obtained from (s, a) when following policy . The optimal value function * (s, a) provides maximal values in all states and is determined by solving the Bellman equation: The optimal policy π is :

Deep Q Learning
The DQN [2] use a deep neural network, or Q net to approximates the Q(s, a). Besides it, DQN use a policy network to generate action in any state, and a target network ′, which is a copy of the Q function (or the previous policy network) that is held constant to serve as a stable target for learning for some fixed number of timesteps. The policy network is trained using the loss: After every timestep t, target network will be synchronized with the target network. To reuse the selfgenerated data, DQN proposal a replay memory buffer, during learning phrase, the agent uses the samples from this memory. It can also prevent action values from oscillating or diverging, which can be view as an essential mechanism in DQN

PROPOSED APPROACH
Consider a person begin to learn something new, such as play a new game, a new sport or learn a new skill, usually there is a guider or a teacher who teach the person to learn. At the very beginning, he will look at how the experts do, and then try it himself, at the same time, he will continue observe the experts until he surpasses the experts. Once the experts notice someone else is better, the experts will learn the better experience to improve themselves. Identical, we use the scenario at our algorithm, consider a behavior cloning is a expert at the first phrase, when a RL agent keep learning, it will collect some extremely useful experience, but it can not learn it because it may not be sample during learning stage. In the next time, the agent faces the same state, it will choose some action else, which is not what we want, at this time, our BC will give another action advice which is learning from the old trajectory.
When our agent become better and better, it can perform very well by itself, it will not rely on the BC network anymore, so we decrease the probability to use the BC's guide advice. Whereas at the very beginning, the agent will prefer to take the BC's advice. However, a BC is not a perfect guider, we decrease the taking guide probability when a guide advice don't perform well in the trajectory, to make the agent more independent, and explore by itself. Which means when facing a bad guider, the agent is inclined to explore by itself, or learn by itself rather than learning from someone else.
To separate the experience which is guide by the BC network, we use another replay buffer which is called to store the guiled data, we think the guiled data is prioritized, so when using the guiled data, we use a different to calculate the loss. The

Experiments Configuration
In this paper, we are using some simple environment to interact. The demonstration data was collected by manual marking. In the game MountainCar-v0, A car is on a one-dimensional track, positioned between two "mountains", the agent got a -1 reward each timestep until it reach the flag position, when reach the goal place, it will get a 0 reward, and the max timestep for the environment is 200 by default.
For the CartPole-v0 game, a pole is attached by an un-actuated joint to a cart, which moves along a frictionless track the agent got a +1 reward each timestep and the max timestep for the environment is 200 as well.The actions for both environments are same : left and right, or zero and one. Observation  , and CartPole-v0(right) For evaluating the learning performance in this domain, we use DQN as the base learning algorithm and at very begin, we play the game for several to collect the demonstration data, subject to the experience of playing the game, we don't get perfect demonstration data to feed the agent, however, our work do reduce the random exploration in some extent.
Then we use the demonstration data to first train the BC network, and use it to interact with the environment. With the simple environment, the BC network can sometimes be very strong to achieve a score of 200 in CartPole-v0, but sometimes it may get a 0 score because it can't handle with some known state. So we use a mean of the reward to use, this is a baseline or in above pseudo-code[ Figure  1].
We train the agent for 1000 game episode in total, where the agents are evaluated (exploration and updates turned off) for 10 episodes at every 10 learning episodes.With each episode, we take 200 action timestep, in other word, each game episode will take actions for 200 times. Each timestep 4, we will sample some data from or . We set the ( , ) to be 0.8 for both environment, and set to be 0.2 wich means it will explore the environment with probability 0.2. We reduce the every time it sample from the to updates the parameters in the Q network. Once a trajectory get a total reward that is greater than the , we use this specific trajectory to train our BC network for k times, here we set k is equal to 50. Each time we want to reducing the probability of taking action from the BC network, it reduce 0.01, and we set the initial k to be 1, which means the agent learn from the BC totally, and then when we reduce the probability, the agent then can take − algorithm to take actions to Exploration and Exploitation.

Experiments Result
We use DQN as our baseline algorithm, DQfD is involved, so we use the DQfD as well, but restrict to the struct of our demonstration data, we can't perform all the loss function in DQfD, but somehow we still use DQfD's training method.
Here are some tables and figures about those three algorithm. For the first environment, MountainCar-v0, the result are as follows.  For the another environment, CartPole-v0, the result are as follows.  We notice that our algorithm and DQfD can learn quickly with demonstration data. And notice ours can over perform compare to the others.

Experiments Analysis
Since the above figures and tables show that our method work, but it still exists some problems, which means our methods may not be flawless, but in these experiments, it shows its ability to accelerate the learning process, especially compare to DQfD, it doesn't need a pre-training process, just like a normal DQN. In the CartPole-v0 environment, the total reward has experienced oscillation around episode 150, which is the guider's responsibility, when the BC network choose a action at a unknown state, it may choose absolutely wrong actions, however, the agent and the BC network can't get any perception any In the mountain car environment, benefit from the BC network's guide, the agent doesn't use much time to take random actions, and quickly learn, and with the BC, the agent can be easily adaptive in just a few times of gameplay in the real-world environment.

CONCLUSIONS
In this paper, we developed an approach to guide a reinforcement learning agent to efficiently interact with the environment, we use the ideology of [5], using a manual demonstration data.Our study can be viewed as a combination of take advice [8] and take demonstration. But we focus on how to use the demonstration, rather than when to give the advice.
Since previous work have found that how to use the demonstration data can be a big challenge, we use a BC network, because behavior cloning requires much less computation than RL, it only learn a map from state to action, which can learn a better policy very quickly ignoring the sparse reward, at the same time, the RL agent, using off-policy learning strategy, first learning from the imperfect BC network, when the agent is still aware about the environment, it can output action by itself, to collect better trajectory experience to learn. Meanwhile, the BC network taking the advantage of less computation and easy to attain convergence, learn those better trajectory, make the BC to be a better guider to overcome the sample efficiency.