Angle-only Autonomous Terminal Guidance and Navigation Algorithm for Asteroid Defense based on Meta-reinforcement Learning

This paper presented a robust angle-only guidance and navigation algorithm for asteroid defense missions based on meta-reinforcement learning. A recurrent neural network, trained via proximal policy optimization, is used to map the line-of-sight angles captured in real-time by the onboard camera to the optimal thrust. The neural network effectively replaces the roles of the navigation and guidance system while simultaneously removing the dependence on dynamic and observation models. The guidance and navigation model is tested on numerical simulations of a simulated mission directed to asteroid Bennu. The objective is to enable the spacecraft to hit the asteroid precisely, despite the presence of scattered initial conditions, uncertain model parameters, thruster control error, and attitude control and measurement error.


Introduction
Near-Earth asteroids pose a potential danger as they come close to our planet, risking collisions that could endanger human survival and safety.Numerous scholars have put forth various strategies to avert asteroid impacts, with kinetic defense emerging as a particularly noteworthy approach [1,2] .Kinetic defense involves employing a spacecraft traveling at high relative velocity to directly collide with an asteroid, effectively transferring momentum to alter the asteroid's orbit or cause it to fragment.
The final stage of a kinetic defense mission demands precise navigation and guidance capabilities.This is particularly crucial because, in comparison to the vast interplanetary distances involved, the asteroid's small size and the spacecraft's high incoming velocity mean that even the slightest deviation from the intended trajectory can determine whether the target is hit or missed.Moreover, accurately modeling dynamic systems in deep space environments can often be challenging.Consequently, the presence of an autonomous guidance and navigation (G&N) system on board, equipped with sufficient robustness to handle potential navigation errors and model uncertainties, becomes a vital mission requirement [3] .
In general, spacecraft primarily rely on extracting line-of-sight (LOS) information from images to correct orbital deviations.Many researchers initially employ state estimation techniques based on LOS angle information, which are then applied to proportional guidance law or predictive guidance law [4][5][6] .These methods heavily rely on dynamic or observation models, and the presence of model uncertainty can significantly impact the mission.In the realm of G&N, numerous research papers have already addressed the utilization of artificial intelligence for the G&N of spacecraft [7,8] .As a parametric model, deep learning approaches utilize a deep neural network (DNN), and it often refers to as a policy network.This DNN directly maps raw measurements obtained from onboard sensors to real-time control actions on the flight hardware.Consequently, the DNN can fulfill both the roles of the G&N system, eliminating the need for explicit state estimation from sensory information as required by traditional controllers.DNN can be trained using reinforcement learning (RL) methods.By introducing various uncertainties into simulated mission scenarios, training data can be collected through the interaction between the agent (DNN) and the simulated environment.The collected data is then used to train the DNN offline, addressing the problem of optimal control.This approach allows for the avoidance of reliance on dynamic and observational models, while also possessing the capability to handle multiple sources of uncertainty and exhibit improved robustness.
In this paper, an angle-only autonomous G&N method for asteroid kinetic defense is designed using a meta-RL approach.The recurrent neural network is used as the navigation and guidance policy network, and the proximal policy optimization algorithm is used for optimization.The method proposed in this paper presents a robust approach for navigation and guidance in asteroid defense missions.

Orbital Equations of Motion
During the mission, the orbital motion of the spacecraft is affected by the solar gravity, the gravity of other planets, the gravity of the target, solar radiation pressure, self-control force, and interference force caused by attitude control and other factors.According to Newton's law, the orbital motion dynamic equation of mission spacecraft in the J2000 heliocentric-ecliptic coordinate system can be expressed as: ..
where the first term is the central body point mass, the second term is the direct and indirect third body point mass, the third term is the solar radiation pressure, the fourth term is the control forces acting on the spacecraft, and the last term is the interference forces on the spacecraft.The dynamic model of asteroids is similar to that of spacecraft.The orbital motion dynamic equation of the target asteroid in the J2000 heliocentric coordinate system can be expressed as: ..

Sensor model
Assuming that the angle sensor is installed in a strap-down manner with the mission spacecraft, the main optical axis of the angle sensor coincides with the X-axis of the body system, and outputs the target LOS angle in the spacecraft body coordinate system, including yaw angle and pitch angle.The measurement coordinate system of the optical camera coincides with the body coordinate system.The definitions of pitch angle and yaw angle are as follows.
where q J and q O respectively represent the pitch and yaw angles of the target, and b x , b y , and b z respectively represent the components of the relative positions between spacecraft and the target asteroid on each coordinate axis of the body system.

Controller model
Assuming that the mission spacecraft's orbit control thruster can provide thrust in the Y and Z directions of the body coordinate system, and the thrust f satisfies where y f and z f represent the thrust in the Y and Z directions respectively; max f represents the maximum thrust of the orbit control thruster.

Navigation and guidance issues
The navigation and guidance problem can be formulated as a discrete-time Markov decision process.(5) where u is bounded control.In this article, the state includes the attitude of the mission spacecraft, as well as the relative position and velocity of the mission spacecraft and the target asteroid in the J2000 heliocentric-ecliptic coordinate system.
The LOS angle output by the sensor is in the body coordinate system, and the body coordinate system will change with the change of the spacecraft attitude, which will not be conducive to the convergence of the model.Therefore, the observation and measurement are designed using the LOS angle in the J2000 coordinate system.The LOS angle only includes position information, so the expected intersection time and time step are further introduced as observation measurements.
[ ] (6)  where ih t represents the estimated remaining time of the h-th time step, which is approximately calculated based on the initial values provided by ground measurement and control; dt is the time step, which is determined by the measurement period of the angle sensor of the mission spacecraft; i h q J and i h q O represent the LOS angle of the target asteroid in the J2000 coordinate system.Since the agent cannot directly obtain the complete state h x , and can only obtain observation y , this process is defined as a partially observable Markov decision process.
The actual thrust h f of the spacecraft is a function of the state h x and the control variable h u : To ensure that the target is always within the field of view of the sensor, the spacecraft is in attitude maintenance mode during flight.Therefore, state updates mainly include the positions and velocities of the mission spacecraft and the target asteroid.The state of the next time step h+ 1 x can be calculated through integration, the integration calculation process starts from the state h x and the previous time step h t , and the integration time is a fixed control step dt .Assuming that throughout the entire time step, the thrust h f remains unchanged.Therefore, the state transition function can be abstracted as x of the spacecraft, all system states can be described as From the formula, the future state in the decision process only depends on the current state, that is, the decision process satisfies Markov properties.The termination condition can be defined as ^min , 0 In the formula, f t represents the termination time of the task.dr and dv represent the relative position and velocity of the spacecraft and the target asteroid.
The main goal of the guidance is to reduce the final intersection deviation, and the evaluation time of this target is at the end of the task.Guidance tasks have high exploration difficulty, and only rewards from the main target will lead to a lack of feedback signals and learning difficulties, which is known as the sparse reward problem.This paper overcomes the sparse reward problem by adding auxiliary rewards.The auxiliary rewards in this article include zero effort miss (ZEM) deviation reward, cumulative speed increment reward, and maneuver reward.Their corresponding reward functions are as follows. ( where ZEM is calculated as described in [4].Therefore, the reward function h R can be expressed as 0.02 , 0 0, 0 and 50 , 0 and 50 The goal of the agent is to find a control policy * S that maximizes the expected sum of rewards collected along a trajectory W .
, , , , ,. . ., , , Therefore, the description of the partially observable Markov decision process for the terminal guidance problem is as follows.

Policy Network
The core of solving the navigation and guidance problem through RL is to construct a DNN T S , by determining the parameter set that includes the connection weight parameters between neurons and the bias parameters of neurons T to approach precise control policy and maximize average rewards.
For discrete-time systems, at each step of h t , the current observation h y will be used as input by the neural network T S to return the mean and standard deviation of the multivariate Gaussian discrete distribution of the control.Therefore, the symbol T S also represents the probability of returning a given control u when y is observed.To ensure extensive exploration of action space, during the training process, the actual control is sampled based on the distribution function. ~, During the final policy deployment or evaluation, the agent no longer explores and returns to the optimal control for the given observation.The observation based on the target LOS angle has a drawback, as the current observation cannot provide the complete information required by the agent for making thoughtful decisions on the next control action.In fact, asteroid defense mission scenarios relying solely on the observation of the target LOS angle are not sufficient to provide all the states of the entire system at a given step, especially the lack of velocity-related information clues.For this reason, the RL method based on the policy gradient is usually difficult to deal with such partially observable scenarios when only a standard fully connected network layer is used as the control strategy.Therefore, consider using a recurrent neural network in the policy network [9] , this method is a meta-RL method.

Proximal Policy Optimization Algorithm
In the article, the proximal policy optimization (PPO) is used for training, which is one of the state-ofthe-art [10].The predecessor of proximal policy optimization is trust region policy optimization (TRPO).TRPO is faced with the problem of complicated algorithm implementation and KL divergence calculation.However, PPO is no KL divergence in the objective function of proximal strategy optimization pruning, and the objective function to maximize is , min , ,1 ,1 In the formula, the operator min means the smaller term between the first and second terms.There is a clip function in front of the second term, which means that there are three terms within parentheses.If the first term is less than the second term, it outputs 1 H .If the first term is greater than the third term, it outputs 1 H .If the first item is between the second and third items, it will be output.İ is a hyperparameter, it is usually set to 0.1 or 0.2.The PPO-clip method directly cuts the objective function used for policy gradients, resulting in more conservative updates and stable learning performance.

Mission Scenario
The asteroid Bennu is elected as a case study in this paper, and its orbital data is shown in Table 1.
Table1.Bennu Kepler Elements (Time: 59000.0MJD) [11] a The orbit of the mission vehicle is designed based on the nominal orbit of the asteroid.The mission begins on 2036 Sept 18, 00:00:00 UTC, and the rendezvous takes place on 2037 Jan 16, 00:00:00 UTC.The mass of the mission vehicle is 800 kg, the maximum thrust of the orbital control thruster is 5 N, and the control period is 5 s.The terminal guidance starts operating about 15 minutes before the rendezvous in this case.

Training Behavior
In this paper, the PPO algorithm is implemented using Python 3.7.The simulation environment has been created using a custom Python class derived from the OpenAI Gym formalization [12] .
After conducting an extensive trial-and-error tuning procedure on the problem, suitable values for the hyperparameters of PPO have been identified.The learning rate decreases with a linear law along the training.The corresponding values are documented in Table 3.It takes about 8 hours to train the policy on AMD Ryzen(TM)5 5600X CPU @ 3.70GHz and NVIDIA GeForce RTX 3060ti GPU.
Throughout the training session, Figure 2 illustrates a consistent and smooth learning process.The average reward steadily improved in the training environment, without any notable fluctuations, until the optimization reached its conclusion.The curve's non-zero slope near the final iteration implies that there is potential for further enhancement in performance.

Monte-Carlo Simulation Analysis
The effectiveness of the trained policy was verified through Monte Carlo simulations.The trained policy interacted with the environment for 1000 times, and the distribution of rendezvous points in the target plane is shown in Figure 3. Figure 3 shows that the average miss distance is 4.41 m with a standard deviation of 2.51 m, and the maximum miss distance is 10.91 m.The long axis of the error ellipse at a 99% confidence level is 13.68 m, while the short axis is 6.85 m.The mean miss distance in the Y direction was 2.26 m, and in the Z direction was 4.51 m.The policy trained using RL effectively corrects the trajectory deviation at the end of the mission, enabling the spacecraft to achieve accurate rendezvous with the target asteroid.

Conclusion
This paper presented a novel angle-only autonomous guidance and navigation algorithm for asteroid defense based on meta-RL.A recurrent neural network, trained via proximal policy optimization, is used to map the line-of-sight angles captured in real-time by the onboard camera to the optimal thrust.The neural network effectively replaces the roles of the navigation and guidance system while simultaneously removing the dependence on dynamic and observation models.The G&N model is tested on numerical simulations of a simulated mission directed to asteroid Bennu.The objective is to enable the spacecraft to hit the asteroid precisely, despite the presence of scattered initial conditions, uncertain model parameters, thruster control error and attitude control and measurement error.
The final Monte Carlo simulations demonstrated that the average miss distance is 4.41 m with a standard deviation of 2.51 m, and the maximum miss distance is 10.91 m.The recurrent G&N neural network is capable of compensating for the considered uncertainties and to make the spacecraft hit the target asteroid.This paper presents a methodological reference for the design of navigation and guidance systems for asteroid defense missions.
S used in this article is composed of a multi-layer perceptron (MLP) and a long shortterm memory layer (LSTM).Observations are first fed into an MLP with three fully connected layers to increase the nonlinearity of the network and allow for complex relationship representations.The connection has been unrolled in Figure1.Then the output and control 1 h u of the full connection layer are sent to LSTM, which is a recursive layer.It can understand the time relationship among the observation data constituting the input sequence and plays an important role in the policy network.

Figure 1 .
Figure 1.Unrolling of the policy networkThe observation based on the target LOS angle has a drawback, as the current observation cannot provide the complete information required by the agent for making thoughtful decisions on the next control action.In fact, asteroid defense mission scenarios relying solely on the observation of the target LOS angle are not sufficient to provide all the states of the entire system at a given step, especially the lack of velocity-related information clues.For this reason, the RL method based on the policy gradient is usually difficult to deal with such partially observable scenarios when only a standard fully connected network layer is used as the control strategy.Therefore, consider using a recurrent neural network in the policy network[9] , this method is a meta-RL method.

Figure 2 .
Figure 2. Episode reward change curve Figure 3. Distribution of rendezvous points in the Target plane The time step h t is recorded as h .The corresponding state of the h t time step is h x .The observation of state h x is h y .The closed-loop control policy is ʌ.The control h t given by the closed-loop control strategy is h u .There is a relationship among state, observation, and control variables as follows.

Table 2 .
Table 2 lists the errors considered in this paper.Error source and error value Table 3. PPO hyper-parameters.