Research on Inertial Space Intercept Game based on Deep Reinforcement Learning

Aiming at the problem of the intercept game in the inertial space, this paper builds a model of basic actions of the two sides of the intercept game in the inertial space, explored the applicability of Deep Reinforcement Learning in inertial space game problem-solving. Based on Proximal Policy Optimization (PPO), this inertial space game problem is solved and the optimal solution is obtained by the reward designing with cumulative miss distance and minimum distance respectively, and finally, the effectiveness of the algorithm based on PPO is verified through the game simulation and results from comparison


Introduction
As a traditional method of studying inertial space game, Differential Game Theory has put forward many important theories and methods for inertial space game through many years of continuous development. Based on Dynamic Game Theory, the guidance law of optimal cooperative strategy is proposed [1]. Based on Cooperative Differential Game Theory, the optimal cooperative tracking strategy and the optimal target avoidance strategy of interceptors are designed [2]. With the application of cooperative guidance, the research on three-system guidance is deepening. Based on the relationship among target, interceptor, and defender, researchers studied and designed the optimal control strategy of the interceptor, the cooperative maneuver strategy of target and defender, and the optimal avoidance strategy of defender [3][4][5]. These theories greatly promote the research and development of inertial space game. However, the application of Differential Game Theory to game confrontation in inertial space also has some defects. Differential Game Theory requires the completeness of state information, which is reflected in the strict requirement of observation, almost assuming full observation, And some methods need to know the complete game matrix information, which is actually equivalent to knowing all strategies in advance. when solving game theory mainly based on Differential Game, it often falls into local optimal or there are multiple Nash equilibrium solutions.
Deep Reinforcement Learning (DRL] is an algorithm that combines Deep Learning and Reinforcement Learning to realize end-to-end Learning from state to action. In 2013, Google DeepMind put forward Deep Reinforcement Learning for the first time [6], which combined Reinforcement Learning and Deep Learning, Agent relies on pure image input to play Atari games completely through self-learning. In 2015, Google Deep Mind published an article in Nature [7], using DRL to achieve human-level control on Atari games. In 2016, AlphaGo, developed by the Google DeepMind team based on the Monte Carlo game tree search and DRL, won the game against the world champion of Go Lee Sedol, which attracted wide attention from the society. DRL has many advantages for solving game confrontation, Deep convolutional network can automatically learn the abstract representation of high-dimensional input data and effectively solve the problem of domain knowledge representation and acquisition in complex tasks. At the same time, its strategy is expressed as a mapping from state to action. The optimal strategy is selected through random sampling and other methods, without knowing the game strategy in advance. Some Reinforcement Learning methods can find better strategies through training and exploration under partial observation conditions. In addition, DRL can be solved by a variety of methods. Under the condition that the excitation function is suitable, there are few cases of multiple solutions and non-convergence. In this paper, basic actions of the two sides of inertial space games are modeled, the applicability of DRL in solving inertial space game confrontation is explored, and the penetration effect of inertial space game is verified by simulation.
EKV guidance equation [10][11][12]: Where K is the proportional guidance coefficient and T a is the maneuvering acceleration of the target.

Guidance Law modeling for interceptor missiles
Command acceleration of TPN:

Application of Deep Reinforcement Learning
In this paper, Proximal Policy Optimization (PPO) is adopted. PPO is a new Policy Gradient algorithm. Policy Gradient is very sensitive to the step size, but it is difficult to choose the appropriate step size. If the difference between the old and the new strategies is too large in the training process, it is not conducive to learning. PPO proposed a new objective function that could realize small-batch updates through multiple training steps, thus solving the problem that the step size was difficult to determine in Policy Gradient. The most commonly used gradient estimator has the form: The estimator ĝ is obtained by differentiating the objective: If the agent maneuvers so that missile-target distance at the final intersection exceeds a certain threshold, we give it a corresponding, appropriate, and positive reward. Thus, L in formula (10) is increased to the maximum.

Simulation Setting
The observable quantities of Intelligent Missile are Position of Missile relative to TPN, TPN Velocity, and Missile Velocity. We assumed that the Intelligent Missile only makes a decision once in each iteration, The contents of Decision include maneuvering position, maneuvering duration, maneuvering Angle, and maneuver force. Initial action is designed as a discrete vector, and the size of the action space is 2. Taking the action as 0 means no maneuver while taking the action as 1 means that Intelligent Missile takes maneuvering action. Training parameters are shown in Tab.1.

Analysis of results
After training, Intelligent Missile should find a maneuver position around 0.7F. Training results are shown in Fig.2:

Fig.2 Training results
The green track is the trajectory of the agent after the maneuver, while the blue track is the trajectory without maneuver. The visualization result shows that the maneuver of the agent makes the penetration successful.
The result indicates that Intelligent Missile chooses to maneuver at the beginning, which is bad training. The agent will only choose the endpoint value to maneuver each time, which is a typical sparse reward problem in Reinforcement Learning [13]. With the improved reward factor, the agent can make partially correct decisions, which solves the problem that the agent can only choose to maneuver at the endpoint, but it can only do the one maneuver. However, the fact is that the agent can maneuver multiple times, even though it will be punished for multiple maneuvers. After analysis and research, we have designed a set of mutually exclusive maneuvers.
Retrain after improvement, After 12,000 iterations of training, the mean reward of Agent tends to be stable and the std of the reward tends to 0.
Tab.2 shows some local optimal solutions found by agents based on PPO with cumulative miss distance as the variable reward value and discrete action space. As shown in Tab.3.With cumulative miss distance as reward value, with the minimum distance as the reward value, discrete action space, and some local optimal solutions found by Agents based on PPO.

Tab
Tab. 3 With cumulative miss distance as reward value With the minimum distance as the reward value

Conclusion
Through simulation, PPO model based on deep reinforcement learning shows good applicability in the application of the inertial space intercept game. Under the conditions of cumulative miss distance and minimum distance as reward values, the optimal solutions close to expert-level can be obtained, and the effectiveness of Deep Reinforcement Learning is verified.