Reentry trajectory design of a hypersonic vehicle based on reinforcement learning

In this research, we investigate control of a hypersonic vehicle (HV) following its reentry into the Earth’s atmosphere, using deep reinforcement learning (DRL) in a continuous space. Here, we incorporate the basic kinematic and force equations of motion for a vehicle in an atmospheric flight to formulate the reentry trajectory satisfying the boundary constraints and multiple mission related process constraints. The aerodynamic model of the vehicle emulates the properties of a common aero vehicle (CAV-H), while the atmospheric model of the Earth represents a standard model based on US Standard Atmosphere 1976, with significant simplification to the planetary model. In an unpowered flight, we then control the vehicle’s trajectory by perturbing its angle of attack and bank angle to achieve the desired objective, where the control problem is based on different actor-critic frameworks that utilize neural networks (NNs) as function approximators to select and evaluate the control actions in continuous state and action spaces. First, we train the model following each of the methods, that include on-policy proximal policy approximation (PPO) and off-policy twin delayed deterministic policy gradient (TD3). From the trajectory generated, we select a nominal trajectory for each algorithm that satisfies our mission requirements based on the reward model.


Introduction
In a hypersonic flight (speed Mach 5 or more), the vehicle that is either returning from space or reentering the planet atmosphere following a low-altitude orbital trajectory, associates a combination of orbital and atmospheric flight mechanics, while needing to maintain a delicate balance between deceleration, heating, and landing accuracy or impact, to ensure safe and reliable navigability of the vehicle [1][2].
Hypersonic reentry of a vehicle in a planetary atmosphere can be classified into either ballistic or lifting entry.In a ballistic reentry, the fundamental restraining force is the drag force, that is directed opposite to the line of flight.In contrast, the lift force which is perpendicular to the flight path acts as the primary decelerating force in case of lifting entry.Manoeuvrability of a lifting vehicle in the atmosphere during hypersonic reentry depends on the lift to drag ratio, which requires variation in angle of attack and possible bank angle change [2][3][4] which can be achieved in different ways by using propulsion in a powered flight, and by structural dynamics in an unpowered flight by means of reaction control or air deflections from vehicle control surfaces.Key motivation behind this research is to explore control of hypersonic reentry trajectories for manoeuvrable vehicle in an unpowered flight, and develop a possible trajectory design policy by using reinforcement learning (RL) under mission specific constraints, with primary focus being Earth's atmospheric reentry.
RL was demonstrated in an aerospace project [5] for the first time to control an autonomous helicopter flight.Researchers [6] applied deep reinforcement learning (DRL) to command a simulated fleet of wildfire surveillance aircraft.DRL has also been applied for spacecraft orbit control and transfers [7], and to approximate optimal guidance paths for pinpoint planetary landing [8].Chai et al. [9] presented an integrated reentry trajectory planning and attitude control framework based on deep neural network (DNN).Proximal Policy Optimization (PPO) has been utilized, in hovering and docking with rotating targets [10], in avoidance of on-orbit detection by ground-based sensors [11], and to simulate mid-course exo-atmospheric interception of manoeuvring targets [12].Miller and Linares, Zavoli and Federici presented PPO's utility for low thrust trajectory optimization in interplanetary missions [13,14].Wang et al. demonstrated, feasibility of the DDPG algorithm in autonomous rendezvous mission [15].Gao et al. used deep reinforcement learning to solve reentry optimization problem of reentry vehicle (RV) based on deep deterministic policy gradient (DDPG) [16].In a recent work, Hovell and Ulrich proposed a guidance policy for 3-DOF proximity operations using distributed distributional deep deterministic policy gradient (D4PG) [17].

Problem formulation
We begin with formulating the reentry dynamics, followed by the definition of necessary mission related constraints including boundary conditions to construct our trajectory design problem.

Reentry dynamic model
Under the simplifying assumption of a point symmetric spherical Earth rotating with a constant angular velocity, the reentry dynamics equations of motion for a point mass vehicle are [ where,   =  0 +  1  +  2   3  and   =  0 +  1  2 +  2   3  are lift and drag coefficients, and the coefficients  0 ,  1 ,  2 ,  3 ,  0 ,  1 ,  2 , and  3 are adopted following the vehicle model of a special class of high performance common aero vehicle (CAV-H) developed by Lockheed-Martin [18]. is the Mach value of the vehicle.

Process Constraints
The process constraints during an atmospheric reentry include necessary allowable boundaries for aerodynamic heating, dynamic pressure, and aerodynamic load (normal or g-load) to ensure safety of missions.

Aerodynamic heating.
The heating rate  ̇ is specified generally at the stagnation point on the nose of the vehicle, and depends on the atmospheric density () and earth relative velocity of the vehicle () Here,   is the heating rate normalization constant,  usually takes the value of 3 or 3.15, and  ̇ is the maximum allowable heating rate.
2.2.2.Dynamic pressure.Dynamic pressure constraint of the reentry vehicle is represented as where,   is the allowable dynamic pressure.

Aerodynamic load.
The aerodynamic load (normal load or g-force) is another necessary path constraint, that must be maintained throughout the trajectory.The value for the load constraint is defined as Here,  0 is the gravity at mean sea level,  and  are lift and drag force, and   is the allowable normal load.

Boundary constraints
We have defined the boundary constraints for the dynamical parameters, by defining each of them (, , , , , ) and control parameters (, ) at initial time  0 , and the terminal values for each at time   are defined as following.The values for all this parameter are generally set in a bounded domain for each category as,

𝑟(𝑡
The min and max values for each parameter denote minimum and maximum allowable range for the states and controls.̇ and ̇ are the rate of changes for angle of attack and bank angle respectively, and bounded by maximum rate of change in both as ̇  and ̇  .

Research methodology
In this section, we have presented the implementation method, that we have adopted for a reentry trajectory control framework of a hypersonic vehicle based on reinforcement learning.A reinforcement learning agent (learner or decision maker) and its environment (everything outside the agent) interact over a sequence of discrete time steps.Under the Markov decision process (MDP), we have constructed the environment by defining the state, action and reward model to compute the trajectory of a chosen reentry vehicle (RV).We have adopted different on/off-policy actor-critic frameworks based on the policy gradient theorem, particularly suitable for continuous action and state space.Utilizing neural networks as function approximators, on-policy proximal policy optimization (PPO) [19] and off-policy twin delayed deep deterministic policy gradient (TD3) [20] Here,  ℎ represents the difference between objective/target radial distance   and terminal radial distance (  ),   represents difference between objective/target velocity   and terminal velocity (  ),   represents difference between objective/target longitude   and terminal longitude (  ), and   stands for the difference between objective/target latitude   and terminal latitude (  ).

RL environment
One of the two core simulation frameworks implemented in our study for trajectory control is built on the definition of state, action, and reward model, that the other core consisting a compact actor-critic RL architecture can interact with.

State space.
In our study, the environmental state at time , denoted as   , includes the radial distance (), longitude (), latitude (), velocity (), flight path angle (), and the heading angle () of the RV.Future states  +1 are derived from the vehicle dynamics by numerically integrating the set of kinematic and force equations of motion specified in equation ( 1) using Euler method.Hence, 3.1.2.Action space.Conventionally, in an unpowered hypersonic reentry flight, control of a vehicle is achieved by actions taken by the agent in the form of aerodynamic deflections, depending on the perturbation of angle of attack () and bank angle (), to follow a desired trajectory.The action   at time , can be written in vector form as where,  = 0, 1, …, 9 and   =   −  0 is the difference in longitude between initial and objective values.Hence, total reward per episode stands as,

Results and discussion
In our effort to investigate unpowered trajectory control of a reentry hypersonic vehicle in the Earth's planetary atmosphere, we have applied reward based actor-critic reinforcement learning methods to approximate the desired policy that returns the best reward for some certain preconditions.Table 1 shows designed scenario with initial values for each state and action parameters, with primary objective to reach the positional target of longitude, latitude, and radial distance (velocity, flight path angle, and heading objectives can be considered as soft goals).A non-thrusting control action achieved through the perturbation of angle of attack and banking angle is assumed and adopted along with the necessary system credentials.The agents are trained for a total of 1 million timesteps each, and each path taken by the vehicle has been analyzed based on their average or total return.The reward model discussed in the previous chapter has been utilized to form a nominal trajectory.In table 2 process constraints considered are given, followed by the related necessary constants in table 3. To train the model we have adopted the relevant hyperparameters and neural network architecture for each actor-critic method according to the original papers cited earlier.Each training session has returned some policies, from which we have chosen the nominal trajectories, that have returned maximum reward (Maximum average rewards were also similar for both) from the PPO and TD3 algorithmic framework.The outcome from the reproduction of the selected trained policies are summarized below in table 4 and successive plots.The range to target is calculated according to the haversine formula used to evaluate the great circle distance.In the above graphs, first we have presented performances of the PPO and TD3 trained policies depicting all the six states during the flight (figure 1), followed by corresponding three-dimensional trajectory representation in achieving our primary objective of reaching the target position (figure 2).The aerodynamic constraints measures are demonstrated in figure 3.These results (Table 4 and Figure 1-3) clearly show the PPO trajectory has reached the closest to the target in terms of both altitude and range to the target.The trajectory taken by TD3 however has come closest to one of our key soft goals of terminal velocity and also has maintained the safest aerodynamic heating, pressure, and load values.

Conclusion
During our study, we have found some key features of implementing RL in a reentry scenario which include but are not limited to: selection of suitable initial conditions for desired terminal condition is paramount, tuning of reward model accordingly may require extensive trial and error processes, and generalization of the policies meaning their applicability in an unknown scenario is not viable yet.Reinforcement learning offers a magnitude of opportunities for trajectory control of HV in planetary reentry environment with extensive training, given well defined state and action boundaries.Further researches should be conducted to better understand the mechanism of RL algorithm and unlock the full potential of it in hypersonic trajectory control.

Figure 2 .
Figure 2. Three dimensional trajectories for the selected PPO and TD3 policies.

Figure 3 .
Figure 3. Aerodynamic Heating Rate (a), Dynamic Pressure (b), and Aerodynamic Load (c) for the selected trajectories.
) 3.1.3.Reward function.The reward signal consists of a primary performance reward (  ) based on the differences in current and terminal states, angular direction reward/penalty (   ), sparse trajectory corridor reward/penalty (   ), terminal reward/penalty, and some additional performance bonuses (b).Following are the general representations of ∡  and ∡  are real and directional objective angles respectively, and  is number of steps.

Table 1 .
Mission scenario with initial states and action values along with objective states, and state and action space minimum and maximum values.

Table 4 .
Terminal values of all the six states and final cross-range distance.