Mobile robot navigation based on Deep Reinforcement Learning: A brief review

Navigation capacity is a key attribute of robot technology and the foundation for achieving other advanced behaviours. Compared to traditional navigation technology, applying Deep Reinforcement Learning (DRL) to artificial intelligence agents to achieve mobile robot navigation function is currently the academic focus. DRL is based on an end-to-end approach, transforming high-dimensional and continuous inputs into optimal policy to guide mobile robots, forming an advanced perceptual control system. In this article, DRL is first compared with traditional navigation technology and SLAM, and its application advantages are elucidated. Then, the basic background and classic algorithm models of standard reinforcement learning and DRL are systematically elaborated. Finally, the application of DRL in different application scenarios and research fields is introduced.


Introduction
Mobile robot navigation (perceptual recognition of the environment) is the foundation for other advanced behaviors implemented by the artificial agent (robot entities or software algorithms that can independently execute commands and actions in real or simulated environments).In past conventional navigation technologies, the navigation algorithms carried by mobile robots achieved path-planning tasks based on information provided by external sensors (lasers, cameras, radar, etc.).Research has shown that this traditional technology is widely used in a variety of industrial applications.Moreover, it has achieved great success, such as Global Navigation Satellite System (GNSS) [1].Nonetheless, the above technologies still have significant limitations and inapplicability: it is not able to be applied in complex, unknown, and dynamically changing environments.The map-based navigation system needs to survey and map the environment in advance, which consumes a lot of time and puts forward high requirements for the accuracy of environment modeling.However, the effective reproduction of complex and changeable environments is a huge challenge for traditional navigation.Likewise, the lack of sensor accuracy and the accumulation of noise during the modeling process also affect the robustness of the navigation algorithm.
To solve the above predicament, some researchers are committed to combining traditional navigation systems with Simultaneous Localization and Mapping (SLAM) to accomplish the construction of uncharted and complex environments [2].Based on the positioning algorithm to anchor the real-time location of the mobile robot, the robot will plan an effective path connecting the destination and complete the navigation task.
The algorithm application of SLAM can be divided into two categories: laser SLAM and visual SLAM.Two types of SLAM algorithms have mature applications and solutions in Two-Dimensional (2D) and Three-Dimensional (3D) environments.Laser SLAM relies on the system equipped with the laser sensor to scan the ambient information, and processes laser point cloud data to extract environmental features.The data characteristics are used to complete the modeling of the unknown environment.Currently, there are many mainstream solutions for laser SLAM, such as Grid Mapping, Karto SLAM and Lidar odometry and mapping (LOAM), SegMap algorithm, etc. [3].The characteristic of laser SLAM is high precision, but its high cost hinders the popularization and application of laser SLAM, and its accuracy will be greatly affected in extreme weather, for instance, heavy rain and snow.Compared with laser SLAM, visual SLAM has low cost of use.At the same time, it retains the corresponding environmental semantic attributes.For example, LSD-SLAM presented in 2014 and ORB-SLAM proposed in 2015, are both classic visual SLAM algorithms [3].However, visual SLAM also faces related challenges of lack of image feature extraction, low robustness of high-speed cameras, and environmental lighting effects.
With the successful application of deep learning (DL) in various fields, Researchers are also trying to employ deep reinforcement learning (DRL) for complete robot navigation.The DRL algorithm is based on end-to-end input and output and can effectively extract observation feature values from discrete or continuous environments to simplify the training process [4].During the training process, DRL can call all parameters participating in the training network and does not strongly depend on pre-drawn and collected environmental information and the accuracy of the sensor.Thus, it improves the error accumulation weakness of traditional navigation and the sensor's highly dependent disadvantage of SLAM.Therefore, this paper will systematically review the application of DRL in mobile robot navigation so as to help researchers interested in this field to understand the current development trend of DRL in navigation and to better deploy and improve its corresponding research results.
The remaining content of the paper is arranged as follows: Section II systematically outlines and reviews the background knowledge and the fundamental model of DRL.Section III discusses the application of DRL in specific scenarios, and Section IV concludes the paper.

Deep Reinforcement Learning
Generally, standard reinforcement learning (RL) applies in discrete environments, the artificial agent interacts with the environment at each training step.For each step, it records the observation result to obtain the corresponding state characteristics and then performs learning policy under observation results and value function to form the corresponding mapping between action and current state.According to the current state and applying action, the system will offer the agent the corresponding reward and next state.Until the setting episodes are reached, or the goal is achieved, the training will continue to loop to maximize the cumulative reward and find the optimal policy.DRL combines RL and DL to form an end-to-end control system, solving the perceptual decision-making problem of complex systems.This section will introduce the basics of DRL in four parts.First of all, the framework model of the standard RL algorithm will be systematically reviewed, then three DRL algorithms based on different methods and their representative models will be introduced.

Framework of reinforcement learning
The RL algorithm encourages the agent to perform trial-and-error learning under unsupervised conditions in the environment (environment refers to all objects other than the agent) and provides rewards based on feedback from the artificial agent interacting with the environment.Different from supervised learning, RL does not emphasize how to perform the correct action but evaluates the behavior of the agent to guide the robot to choose a better action.Therefore, the mapping relationship between the state of the environment and the state of the action will eventually be learned by the agent, so as to improve the behavior to adapt to the environment and obtain the maximum accumulation of the reward.RL is applicable for Markov Decision Process (MDP), and MDP is built based on a set of interactive objects: artificial agent and environment [5]: , , ,    are the key elements of the MDP model, where S represents the state information about the state of the environment, A represents the information about the action that the artificial agents can perform, R represents the reward value returned by the reward function, P is the transition function between state and action, which expresses that under the state s, the probability of employing the action  in order to reach the next state. +1 at time  + 1 and  ∈ [0,1]is the discount factor that presents the impact coefficient of future rewards on the present state.Therefore, the cumulative reward function for the RL can be expressed as: At the same time, RL satisfies the MDP decision model properties, it shows that the state in the future is only up to the state in the present and the action taken, and is not affected by the past: The value function in RL is to represent the model's prediction of expected, cumulative, and future rewards.Unlike the reward equation, the value function is not instantaneous but can be treated as the consideration and evaluation of long-term benefits.It is suitable for evaluating the quality of the state.RL has two classes of value functions: state-value function and action-value function.
Equation ( 4) represents the state-value function, and V_π (s) is a measure of the quality of the state after following policy π.Similarly, the policy that returns the largest reward corresponds to the optimal state-value function: Similarly, referring to equation ( 4), the action-value function is defined as follows: (, ) is used to evaluate the value obtained by executing action a under the condition of the following policy π.Same as equation ( 5), the function of the optimal action-value is also determined as follow: Thus, on the basis of the above, it is possible to divide reinforcement learning into three models, depending on the relationship between value and policy: • Value-Based: Optimizing the value function.
• Policy-Based: Optimizing of the policy function directly with no reference to the value function.
• Actor-Critic (AC) Based: Considering both policy function π(s) and value function.

Value-based method
In the value-based approach, the quality of the overall system is assessed by the desired return value of the actions taken by the agents to achieve the purpose of guiding the artificial agents.The action that achieves the optimal action value function is considered to be the optimal policy.Q-learning is one of the most typical value-based RL algorithms [6].Through the Bellman formula, the value function can be decomposed into immediate reward:   and subsequent value function: ( +1 ,  +1 ), therefore, the action-value function can be rewritten as: When the number of training iterations is approaching infinity:  → ∞, the value of   will gradually converge and ultimately tend to be stable.The conditions for the optimal policy are: Through the above calculation, the Q value will be updated in each iteration, so Q-learning needs to establish the Q-table to accumulate and record the change in Q value and update the Q-table through Temporal Difference (TD) method.Unlike the Bellman equation, Q-learning also needs to introduce the variable  learning rate to participate in the iterative update of the Q value, thereby reducing the influence of errors on the iteration.Thus, the Q function can be estimated using the reward from observed state transitions via the iterative update rule when the state transition model is not known: Nevertheless, the limitation of Q-learning is that it records, updates, and iterates based on the Q table.Therefore, the Q-value corresponding to each state and action stored in the Q-table is a discrete and limited set of numbers, which can only deal with the low-dimensional and discrete situations of the state and action space.In fact, it is hard to handle high-dimensional and continuous situations in reality.When the Q-table is not applicable for accurately storing and representing the Q value, it is necessary to apply a parameterized function to approximate (, ): Deep Q-Network (DQN) is the groundbreaking work in DRL field.It is the introduction of DL into RL and the construction of an end-to-end architecture from perception to decision.DeepMind first presented DQN at NIPS 2013, followed by a improved version in Nature 2015 [7].DQN combines RL and DL, so Q-learning can directly use high-dimensional data to train and learn the corresponding control policy.Compared with Q-learning, DQN has made significant improvements in three aspects: (1) The convolutional neural network is used to replace the Q-table to approximate the Q value iteratively.This move enables Q-learning to directly input high-dimensional observation results for training agent with a large quantity of continuous states and actions.
(2) DQN has the function of experience replay.Different from RL. DL is supervised learning.It needs to be trained with labeled sample data and calculate the loss function, so as to update the weights of the neural network using method of the gradient descent and error backpropagation.To solve the problem of data labeling, DQN stores the trained data in the replay buffer for subsequent random training sampling using the experience replay mechanism, which effectively improves data utilization and reduces the correlation of continuous samples and variance.Overall, it ultimately improves the convergence speed and stability of the algorithm.
(3) Dealing with TD errors.The target network has been specifically designed for solving the problem of the TD algorithm.Two DNNs are designed in DQN.Their structures are the same, but the parameters are different; that is, the weights and biases of each layer in the neural network are different.The parameters of one DNN is  − , and the other is   .In each iteration, the algorithm updates   instead of  − , and stipulates that  − =  after running every n-step, the network where the parameters  − belong to is called the target network, which solves the TD error and prevents the neural network from overfitting.

Policy-based method
Generally, the value-based method requires that the spaces of action and state are discrete and limited, and the selected policy must be deterministic, which is not suitable for the problem that the optimal policy is a random policy.The policy-based method directly models the policy π, and for a given state, directly obtains the selected action a: (, ) = (|, ) In this case, agent iteratively calculate the policy, that is, iteratively update the parameter value until the expectation of the cumulative return is the maximum and the policy corresponding to the parameter at this time is the optimal policy.Therefore, the value function will be expressed as: Where ρ(π) is the average return, and the policy gradient method is used to optimize to obtain the maximum reward and the parameter α is introduced to represent the update step size: The above policy-based algorithm is also called the reinforce algorithm, which uses the Monte-Carlo (MC) method to perform policy gradient iteration on the policy π [8].However, the defect of the policy gradient algorithm is that there will be oscillations, and gradient cliffs.During the policy iteration process, it is difficult to converge.Therefore, Trust Region Policy Optimization (TRPO) is proposed in 2015 for the application of constrain adjacent policies to prevent instability caused by excessive policy changes [9]; and the TRPO ensures the monotonous increase of policy iterations.The goal for TROP can be described as follows:

𝑠. 𝑡. 𝐸 ̂𝜏[𝐾𝐿[𝜋 𝑜𝑙𝑑 (• |𝑠 𝑡 ), 𝜋 𝜃 (• |𝑠 𝑡 )]] ≤ 𝛿 (20)
Through the setting of the confidence interval δ, it is ensured that each policy iteration is within the constraint range, which can guarantee that its optimization policy will be better than the current policy before it reaches the local or global optimal policy.

Actor-critic-based method
Two different RL algorithms are discussed above: the value-based method and the policy-based method.The disadvantage of the former is that it is hard to be implemented in high-dimensional and continuous spaces, and the disadvantage of the latter is that the efficiency of policy evaluation is not high, and the convergence speed is slow.The AC method is to combine the two methods together.The policy-based method plays as role of the actor to choose action, and the value-based method is treated as the critic to evaluate the value and advantages of the adopted action.The actor revises the probability of choice of action in accordance with the scores of the critic.Thus, AC can not only solve continuous and discrete problems, but also implement single-step update frequency to improve learning and training efficiency.AC method can be expressed as: In equation (21), (  |  , ) refers to the actor, and critic corresponds to (  ,   ), (  ) is described as baseline b  for the algorithm.The comparison of the value function with the baseline is a measure of the quality of the action.
To improve the learning efficiency and make the best use of computing resources, ASYNCHRONOUS ADVANTAGE ACTOR-CRITIC (A3C) was presented in 2016 by Mnih and his team [10].Based on the actor-critic method, A3C adds asynchronous training and realizes multithreaded parallel computing of the algorithm.It improves the overall running speed and efficiency of the algorithm, that is, during training, the algorithm assigns tasks to agents on multiple threads at the same time.After one iteration of learning, each thread updates the parameters on the global network.Global parameters will be applied to all threads for the next learning episode.Furthermore, other algorithms usually only return the one-step return: ( 0 ) →  0 + ( 1 ), which is the immediate reward obtained by the current state ( 0 ,  0 ,  0 ,  1 ).For A3C, n-step return is adopted, that is, ( 0 ) →  0 + ( 1 ) +  2 ( 2 ) +••• +  (  ), so that the agent can make the approximation equation iterates nstep, which results in the more efficient operation.According to the research paper, the training performance of A3C on Atari games is four times that of DQN.

Local obstacle avoidance
Almost all mobile robot task scenarios involve local obstacle avoidance.On the basis of obstacle avoidance, the robot can be extended to perform more complicated and advanced applications.In conventional navigation schemes, artificial potential field (APF), A-star and ant colony optimization are all classic algorithms for realizing path-planning based on the global or local map information.However, its limitations lie in the memory and computing power requirements of high-precision grid maps, the complexity of dynamic environments, and the dependence on sensors.Applying DRL to local obstacle avoidance and establishing a deep convolutional neural network to process a large number of input information operations can significantly save computing resources and allow robots to process large amounts of data in real-time.
Cimurs et al. designed to input the depth image and relative position (current position relative to target point) as network information to conduct the training process and to control the mobile robot to achieve local obstacle avoidance.Moreover, they realize sim-to-real to transfer the training data in the simulator to the real environment and complete map-less navigation in reality.In order to complete the navigation task in continuous space, they combined Deep Deterministic Policy Gradient (DDPG), one of the actor-critic DRL algorithm, with deep convolutional network to process high-dimensional, continuous depth image input [11].Experiments show that in both static and dynamic location environments, the training robot has good obstacle avoidance performance.
In terms of reducing the dependence on sensors, the research team of Shi et al. optimized the A3C algorithm and used sparse laser ranging as the sensor to provide input for the training network, which reduced the navigation cost and the dependence on high-precision sensors [12].The error between the actual state and the ideal state is also introduced in the reward function to encourage the robot to explore independently and improve the training efficiency.Furthermore, Sangiovanni et al. presented a mixedmode architecture to reduce control cost, which enables the robot to use the traditional path-planning methods in the simple environment [13].When it is recognized that the robot enters the complex environment, it switches to the DRL control strategy to complete the corresponding obstacle avoidance tasks.

Indoor navigation
Indoor navigation is used to describe the navigation tasks of mobile robots in complex surroundings with dynamic obstacles.The research of local obstacle avoidance is often in a single, simple static environment, while the situation of indoor navigation is more complicated, such as a maze.Deep Mind built a huge 3D virtual maze in 2017 that provides a lot of complex structures to test DRL performance [14].During training, the navigation result is considered as a by-product of the agent adopting a rewardmaximizing policy.Moreover, the characteristics of the indoor environment are somewhat different from local obstacle avoidance.Sensors from the first-person perspective are often used instead of other positioning systems that provide target coordinates.Therefore, there are also major challenges in feature extraction.Surmann et al. fused 2D laser scanning data with 3D RGB camera to form as the the sample input, providing data with the 3D perception of environmental features and applying it to A3C for training [15].
Another challenge of Indoor Navigation lies in sparse rewards.Due to the complex structure of the indoor exploration space, the agent can usually only get rewards in a specific state and does not set corresponding rewards for the intermediate process and the global rewards are lagging and sparse.Therefore, it is difficult for the agent to obtain positive rewards during training for continuous learning.Zhu et al. proposed goal-driven visual indoor navigation, where they fed the navigation goal into the network to generate a flexible training set and reward pattern [16].Adding constraints to the environment by setting auxiliary tasks solves the problem of sparse rewards and improves the learning efficiency of the agent.

Crowd navigation
Crowd Navigation refers to the navigation tasks performed by mobile robots in crowded social places, for example, shopping malls, restaurants, and playgrounds.Such application scenarios put forward greater requirements for safety and reliability.In order to avoid collisions with crowds, mobile robots are asked to be able to make corresponding predictions and judgments on human behavior.The difficulty is that human behavior is not predictable and difficult to standardize and quantify, which makes crowd navigation more challenging and widening the gap between the simulated environment and the reality.
Chen et al. applied the CADRL algorithm to social scenarios, and modeled humans as uncontrolled agents, thus simplifying crowd navigation as a multi-agent collision avoidance problem [17].Construct a dual-agent or multi-agent model, apply the LiDAR sensor to obtain the position, velocity, and volume dynamic obstacle (crowd) as sample, and conduct training optimization through CADRL.Samsani et al. trained agent with crowd information captured by sensors to make robots understand human behavior and delineate dangerous zone based on human behavior and speed.The agent achieves safe and reliable navigation by avoiding dangerous zone [18].
Crowd Navigation also faces the problem of limited view, that is, the mobile robot can only observe the limited range and information in crowd environment.Therefore, the agent needs high-cost LiDAR to perceive the environment.Choi et al. attempted to replace LiDAR with a lower-cost depth camera and applied the LSTM agent with the Local-Map critic algorithm to solve the corresponding problem [19].

Conclusion
Navigation capability is crucial for mobile robots.The application of DRL in mobile robot navigation has greatly promoted the research in this field and has also attracted the attention of more and more scholars.This paper systematically reviews the advantages of DRL over traditional navigation and SLAM in robot navigation and sorts out the corresponding background and classic algorithm models of DRL.In addition, the corresponding scenarios of current DRL navigation applications are discussed.It is hoped that this article can help researchers interested in this field to improve the current research results and introduce DRL into more practical applications.