A Deep Learning Method for Target Pose Estimation of Planetary Surface

The key technical challenge of 6D target attitude estimation from RGB-D image is to make full use of 2d and 3D image point cloud data. Previous traditional target pose estimation methods either extract information from RGB images and depth, respectively, or use expensive post-processing steps that limit their performance in highly cluttered scenarios and real-time applications. Therefore, in this paper, a method is proposed to estimate the 6D attitude of a group of known objects from RGB-D images. The network used is a heterogeneous architecture, which processes two data from different dimensions separately, and uses a new dense fusion network to extract pixel-level dense feature embedding, and estimate the target position and attitude from it. In addition, an end-to-end iterative attitude optimization method is integrated to further improve attitude estimation, while real-time visualization of point clouds and rendering of bounding boxes are realized. Experimental results show that this method is effective on open source data sets.


1.Introduction
The 6D pose estimation of the rigid body refers to the 6D pose estimation of the object in the camera coordinate system, that is, 3D position and 3D attitude. At this time, the coordinate system of the original object itself can be regarded as the world coordinate system, that is, the RT transformation between the world system of the original object and the camera system can be obtained. The significance of 6D pose estimation of rigid body is to obtain the precise pose of the object and support the fine operation of the object. If you want to interact with objects in the environment, 6D pose estimation of objects is a necessary technology, and will continue to become a research hot spot.
According to the input data used, the 6D pose estimation of rigid body can be divided into 2D image-based method and 3D point-cloud based method. The early 6D pose estimation method based on 2D images deals with objects with rich textures. By extracting salient feature points, the descriptor with strong characterization is constructed to obtain matching point pairs, and the PNP method is used to recover the 6D pose of the object. For weakly textured or non-textured objects, the template-based method can be used to retrieve the corresponding 6D pose of the most similar template image, or the optimal pose can be learned by voting method based on machine learning.
With the emergence of cheap Depth sensors represented by Kinect in 2011, 2.5D Depth images can be obtained while acquiring RGB images, which can assist the method based on 2D images. In order not to be affected by the texture, it can also be operated only in 3D space. At this point, the problem becomes a part-to-whole registration problem from the acquired single-view point cloud to the existing complete object point cloud. If the geometrical details of the object are rich, the salient 3D feature points can be extracted, and the characterization descriptor can be constructed to obtain 3D matching points, and the initial pose can be obtained by least squares. Random sampling point consensus algorithm (RANSAC) can also be used to obtain a large number of candidate 6D poses and select the pose with the smallest error. Since 2012, deep learning has taken the lead in the field of 2D vision and introduced deep learning into 6D pose estimation of objects. It has shown excellent performance no matter based on pure RGB images, RGB and Depth images, or only based on 3D point cloud.
In the process of planetary surface exploration, it is necessary to first understand the terrain. For example, rocks will become obstacles during the driving of the rover. How to accurately and effectively obtain the 6D pose of the target through image analysis in advance, use the robotic arm to carry out grasping experiments, and avoid obstacles in advance is an important link in the deep space exploration mission. It is difficult to obtain accurate data of the position and attitude of small targets, because in the deep air, the remote control has great uncertainty factors, and the two-dimensional image changes greatly under the influence of illumination. In a relatively complex situation (such as occlusion and low illumination), the attitude estimation of known objects can only be carried out by combining color images and depth images. However, these two kinds of images come from different Spaces and dimensions. For data from such different sources, feature fusion is a challenge faced by current technologies.
Extraterrestrial spacecraft missions, the author of this paper, the advantages of high precision requirements, target shape is irregular, geared to the needs of poor soil, metal minerals, irregular rock, motion detectors, such as specific targets, analysis of target image and multiple points of view of point cloud data correlation characteristics, design based on multi-source information of end-to-end independent positioning neural network architecture, put forward the characteristics of dense pixel level fusion detection target space positioning method, break through the perception of the image characteristics and depth information fusion target location technology, multi-source perception to detect targets under the dynamic scene space positioning. Therefore, in this paper, the feature vectors of RGB images and point clouds are fused at the level of each pixel. An iterative optimization method is used for end-to-end training, and the pose estimation of the target is achieved quickly.

2.Introduction of method
The algorithm was programmed by using PyCharm and VS in Python and C++. OpenCV, developed by the Visual Interaction Group of Intel Microprocessor Research Laboratory, has built-in frame extraction functions and standard image processing algorithms for various forms of image and video source files.

3.Theory and Methods
The whole network framework structure is mainly divided into two stages: 1. The target that needs attitude estimation is separated by semantic segmentation, and the position of the target in the depth map is determined by using the mask of semantic segmentation, and then the depth information of the target area in the depth map is transformed into point cloud information. 2. Perform attitude estimation according to the input 2D image and 3D point cloud [1] . The implementation method is as follows: (1) Semantic segmentation Semantic segmentation network is a CODEC structure which takes image as input and generates semantic segmentation graph of N+1 channel. Each channel is a binary mask in which the active pixel describes an object in each of the N possibly known classes.
(2) Dense feature extraction First, the segmented depth pixels are converted into three-dimensional point clouds using known camera internal functions, and then a point-net-like structure is used to extract geometric features. A geometric embedded network is proposed to generate dense point features by mapping each P-split point to an geo d -dimensional feature space. According to the known camera internal parameters, it is converted to point cloud data. Then it is processed through a network similar to the point net [2] . Implement a variation of the point-net architecture that uses average pooling as a symmetric reduction . Each embedded pixel is an n-dimensional vector that represents the appearance information of the input image at the corresponding position. [3] The input of the network is , and each pixel of the input image corresponds to rgb d .
(3) Pixel-level intensive fusion This paper describes a new pixel-level dense fusion network, whose principle is mainly based on one-to-one mapping from 2D images to 3D point clouds. Firstly, the known camera parameters are projected onto the image plane, and the geometric features of each point are correlated with the corresponding image feature pixels, so as to establish a good relationship between the three-dimensional point cloud and the corresponding position in the two-dimensional image. Then the obtained features are linked together and provided to other network processing, and symmetric reduction function is used to generate a global feature of fixed size. The global dense fusion feature is used to enrich the feature of each dense pixel and enrich the global semantic information. Each pixel feature is input into the final network to predict the 6D pose of the object.
(4) 6D target pose estimation The loss function Loss defined is the point cloud data of the current frame obtained by the target model according to camera parameters (rotation matrix R and deflection matrix T) saved when photographing pictures from other perspectives. The loss function is defined as follows: (1) Each point cloud selected by the target model and all predicted attitude parameters need to be transformed respectively. Finally, the combined loss value is minimized.
The above objects well define the learning method for well-textured, asymmetrical objects. But for symmetrical, non-textured objects, such a loss definition will cause fuzzy learning of the network. For symmetrical objects, use the following method: In the process of network learning, the losses are optimized to minimize the total losses: . The confidence of each pixel is also predicted when the attitude is estimated. Therefore, when calculating loss in each point cloud, add a weight as follows: The iterative optimization The previously predicted attitude is taken as the estimation of the standard frame of the target object, and the input point cloud is transformed into the standard frame of the estimation. Then, the converted point cloud is fed back into the network and a residual attitude is predicted based on the previously estimated attitude. A dedicated pose residual estimator network is trained to refine the initial pose estimation for a given main network. In each iteration, the image features embedded from the main network are reused and the geometric features calculated from the newly transformed point cloud are closely fused. The attitude residual estimator uses the global feature of the fusion pixel feature set as input [4] . After K iterations, the final attitude estimation is obtained as the series of each iteration estimation: A residual network is constructed to refine the attitude of the initial estimation of the backbone network. After K iterations, the final attitude is as follows:

4.Experimental simulation results and analysis
The accuracy of pose estimation algorithm was tested with Linemod data set, and the target pose point cloud and pose data were obtained, and the bounding box was drawn.
The training time was 3 days, 02h, 53m, and the 104 epoch was trained. The results were as follows: Train

5.Conclusion
Aiming at target recognition and pose estimation of planetary surface, this paper introduces a neural network deep learning and artificial intelligence method of target 6D pose estimation combining two-dimensional and three-dimensional images. Based on the original method [5] , it improves the point cloud and related information to visualize in real time and display the bounding frame and its coordinates. Experimental results show that this method can identify the target's 6D pose and display its enveloping frame [6] , which provides a basis for target pose estimation and subsequent robotic arm capture in deep space exploration missions. The accuracy and effectiveness of the method are verified on the open source data set, which will lay a foundation for the rover to identify the surface obstacles in the future deep space exploration missions.