Using Graph Convolutional Networks Skeleton-Based Pedestrian Intention Estimation Models for Trajectory Prediction

In autonomous driving scenarios, pedestrian trajectory prediction is an important research direction. Based on the spatio-temporal graph convolutional neural network, we propose a new pedestrian trajectory prediction algorithm. The new algorithm constructs a series of new models around pedestrian intention estimation. The construction of the estimation algorithm considers the following aspects: the contextual information of pedestrians and the surrounding environment, the “pedestrian ego-vehicle” interaction combined with the vehicle speed estimation, the pedestrian’s own skeletal structure information and body language estimation, which includes head joints and the relative structural relationship of the torso joints, including whether it is out of the same plane, is rotated, and so on. Skeleton information feature extraction and construction adopts the method of graph convolutional neural network to structure pedestrians into joints in the form of graphs in non-Euclidean space, and further adopts spatial temporal graph convolutional network for feature extraction and learning. The new method is named a “head-torso”-based spatial temporal graph convolutional network (HT-STGCN). On the dataset PID, the novel method achieves substantial improvements over mainstream methods. Experimental results show that combining HT-STGCN with observed action can improve trajectory prediction.


Introduction
In autonomous driving scenarios and the field of Advanced Driver Assistance Systems (ADAS) for urban environments, it is important to predict pedestrian intent. Based on the pedestrian intent prediction, the pedestrian trajectory can be further estimated to understand its next actions, which can greatly reduce the risk of accidents. Ref. [1] proposed an image-based 2D pose estimation algorithm. The algorithm is used for pedestrian intention estimation, Relevant behaviours include crossing the road in front of the ego-vehicle, stopped suddenly before entering the road, and so on. They evaluate the effectiveness of the 2D pose estimation algorithm on the JADD dataset [2], the proposed pipeline provides satisfactory results.
Based on two-dimensional pose estimation, Fang et al. [3] proposed an algorithm for pedestrians and cyclists intention recognition. Ref. [4] proposed Composite Fields method, this method can be effectively used for human pose estimation.
We focus on two types of areas, one is at intersections and nearby areas, and the other is bus stops and nearby areas. These areas are high incidence areas for pedestrians to cross the road, so pedestrian behaviour and intentions in these areas are the focus of our attention. Pedestrian behaviour in these areas can consist of many types: Pedestrians who simply reside in the area and do not intend to cross the road. Pedestrians who wander in the area and attempt to cross the road but have not yet crossed the road IOP Publishing doi: 10.1088/1742-6596/1621/1/012047 2 immediately. Pedestrians intend to cross the road and implement crossing the road. Pedestrians look around on the road near the bus platform and look forward to the arrival of the bus have no intention to cross the road, but they need to be highly alert for driving vehicles. The specific details are shown in figure 1. Action recognition methods include RGB-based methods and optical flow-based methods. In these research methods, many scholars have proposed a variety of optimization algorithms [5][6][7][8][9].
Based on the feature information of the skeleton and joints, action recognition has obvious advantages. For example, when the lighting conditions change significantly, for the same action, the RGB information has changed significantly, but the information of the skeleton and joints still has distinctive features, which can ensure the model has good generalization performance. In a variety of application scenarios, action recognition algorithms based on joint trajectories and skeleton have excellent robustness [10].
Under the existing technical conditions, it is no longer difficult to obtain the skeleton and joint track feature information [10]. Joint features can be obtained by means of professional depth image capture equipment, or related pose estimation algorithm and other means.

Skeleton-Based Action Recognition
Shotton et al. proposed a method for real-time human pose recognition based on the features of body parts in a single depth image [11], which combined depth information without using temporal features to predict the human body from a single image 3D position of the joint. The pose estimation problem is transformed into a pixel classification problem. Good generalization ability is achieved on accurate full skeleton nearest neighbor matching. Du et al. [12] proposed a hierarchical recurrent neural network method based on skeletal features to achieve better performance of action recognition.
Using skeleton feature information, Yan et al. proposed an action recognition algorithm based on spatial temporal graph convolutional networks [13]. This algorithm can automatically learn spatiotemporal features from data, surpassing previous methods in this regard. This method has stronger generalization ability. In the Kinetics dataset and the NTU-RGBD dataset [14][15][16], Novel algorithm have achieved better performance than other advanced methods. Spatial temporal graph convolutional networks [17,18] have been extensively and intensively studied recently and have been applied in many fields with satisfactory results. Using actional-structural graph convolutional networks, Li et al. proposed a method of skeleton-based action recognition [15]. In their proposed method，an encoderdecoder structure is introduced. This method can capture richer dependencies. This module can capture action-specific latent dependencies, include actional links.

Graph Representation of Joint Nodes
Ref. [13] introduced a method of constructing a Spatial temporal graph convolutional network, using human skeleton joints as modeling objects, combining the concepts of joints and body parts to achieve more efficient algorithm performance [14,19,20]. The construction of the skeleton-based graph convolutional neural network model in Ref. [13] can be divided into two steps: first, the design of the spatial graph convolution network; second, based on the spatial graph network, taking into account the In the aspect of the design of spatial graph convolution network, each frame in the video is taken as the research object. The purpose is to extract the spatial feature information of every frame, especially the features of pedestrian object. The concept of "body parts" is used to realize the local limitation of joint trajectories, which can effectively realize the hierarchical representation of skeletal sequences, and further can realize the automatic extraction of object features through convolutional networks.
Skeleton information can be obtained using the OpenPose toolbox [10] for 2D pose estimation. For each pedestrian in each frame of image, we usually set to extract 18 joint points to form a set of joint coordinate sets. Each joint acts as a node. The set of nodes can be expressed as: where N is the number of nodes. The natural connection between the various joints of the body part is regarded as edges. The set of edges includes two parts, the subset of edges in the frame and the subset of edges between consecutive frames. The intra-frame edge set refers to a set of skeletal edges composed of N joint points on a frame at time τ, which can be expressed as: where H is the set of natural connections of human joints. The subset of edges between consecutive frames refers to the same joint-connected edges between frames, expressed as: where represents the node of the i-th joint at time t, and ( +1) represents the node of the same i-th joint at time + 1. The nodes and edges defined in this way constitute the spatial temporal graph of the human skeleton.

Neighbor Set and Sampling Function
The neighbour set of the node is represented as: where d represents the shortest distance between two nodes. Generally, if D = 1 is set, then the sampling function has the following form: which can be denoted as: For the spatial temporal graph with the above structure, the definition of its weight function is based on the methods introduced in Refs [21,22]. The neighbor set ( ) of a joint node is divided into K subsets, and the K subsets are numbered separately. The obtained mapping is expressed as: Then, by indexing a (c, K) -dimensional tensor, the weight function can be implemented as w( ): ( ) → .

Method
The new method we constructed is skeleton-based spatio-temporal graph convolutional network. We propose a "head-torso" partitioning strategy to achieve efficient pedestrian intent estimation and further improve the performance of pedestrian trajectory prediction models.

Pose Estimation Based on Skeleton
The input to the model is a 2D skeleton of a pedestrian, which is represented as a 2D coordinate. This coordinate is obtained by detecting the pedestrian object in the image and then using a deep neural network algorithm to detect it. The algorithm uses the method of Ref. [10]. Use the OpenPose toolbox provided by the algorithm to get 2D pose estimation results. Based on the pose estimation results, we can get 18 joint coordinates of each pedestrian. Based on the detected human joint point coordinates, nodes and edges can be further defined to construct a spatial temporal graph.

Head-Torso Partitioning Strategy
On the basis of the skeleton-based spatial temporal graph completed above, further research found that [13], in the representation of the spatial temporal graph convolutional network, it is very important to study and optimize the partition strategy, the related partition strategy directly affects the implementation of the label map. Ref. [13] introduced three common partitioning strategies, namely Uni-labeling, Distance partitioning, and Spatial configuration partitioning.  The third partitioning strategy, Spatial configuration partitioning. The number of subsets k = 3, the mapping is simplified to → {0, 1, 2}. Among the above three partitioning strategies, Spatial configuration partitioning considers centripetal group and centrifugal group, and achieves the best performance of the three known methods. Based on a comparative analysis of the above three partitioning strategies, we propose the use of a 'Head-Torso' spatial configuration partitioning model and a Head-Torso spatial temporal graph convolutional network called HT-STGCN as shown in figure 3 below.  Constructed "head joint point group" (for the head) and "torso joint point group" (outside the head). The connections between head joint nodes are shown as adjacency matrix , and the connections between troso joint nodes are shown as adjacency matrix . This "head-torso" partitioning strategy specifically examines the positional relationship of face orientation relative to body posture and can provide steering and action information when pedestrians try to cross the road. For example, the head node group plane and the body joint point group are not in the same plane, indicating that the pedestrian is looking at the left or right, then there is a reason to judge that the possibility of crossing the road has increased significantly.
Implementation of ST-GCN based on "head-torso" partitioning strategy, combined with the construction methods of Refs. [13,23]. The connection between the joints of each pedestrian in a single frame can be expressed as an adjacency matrix A and an identity matrix I represents self-connection. But for partitioning based on "head-torso", there are partitioned adjacency matrices and .
where A + I = ∑ , the adjacency matrix A is split into multiple matrices 0 , and . represents a learnable weight matrix.
represents an input feature map, and represents an output feature map.
The importance of joints in different parts of the body is different. The learnable weight matrix is used to differentiate the importance expressions of joints in various parts of the body to improve the performance of the model. This "head-torso" partitioning design method assigns different importance weights to edges, such as the "head and neck" joints describing whether a pedestrian intends to turn around the road. It can effectively improve the performance of pedestrian intent estimation. Based on a good pedestrian intent estimation algorithm, further, HT-STGCN performs skeleton-based action recognition and expects to improve pedestrian trajectory prediction.

Pre-trained Model of HT-STGCN for Pedestrian Trajectory Prediction
According to the 6 types of actions that have been marked in the PIE dataset [24] (walking, standing, looking, not looking, crossing, not crossing). We train HT-STGCN on the PIE dataset [24]. In addition, based on the PIE dataset, we added a dataset of Chinese street scenes and annotated the classification of pedestrian action. The expanded dataset enhanced the model's generalization ability for Chinese pedestrian crossing styles.
The training result is used as a pre-trained model to input the Decoding module of the "Intention Estimation". The model architecture is shown figure 4. The HT-STGCN module structure draws on the construction method of "Semi-Supervised GCN" [23].  The model is divided into three functional modules, the first is the vehicle speed prediction module, the second is the intention estimation module, and the third is the trajectory prediction module. Among them, the first and third modules adopt the same structure as the algorithm in Ref. [24]. The first module is Vehicle speed prediction, and its input is the current speed sequence of the ego-vehicle, expressed as { , . . . , − +1 , − }. The output is the predicted ego-vehicle speed sequence { +1 , +2 , . . . , + }. Its prediction speed is input into the third Trajectory prediction module. The second module is Intention Estimation, where the Encoder part is improved by combining two modules, Encoder-1 and Encoder-2. The input of the module Encoder-1 is based on the "head-torso" partition of the pedestrian joint point coordinate tensor. The input of the module Encoder-2 is a sequence of square cropped images around the pedestrians. After the results of Encoder-1 and Encoder-2 are combined, the pedestrian box coordinates information is combined and used as the Encoder result. The result of the Encoder is input to the Decoder part of the Intention Estimation, and the result of the encoder is input to the third Trajectory prediction module. The third module is Trajectory prediction, and the other input is the pedestrian box coordinates of pedestrians. Based on the prediction speed sequence of the ego-vehicle, the output is the predicted sequence of pedestrian trajectories { +1 , +2 , . . . , + }.

Experiments
The structure of the model and the format of input and output data have been described in detail in the previous section, which will not be repeated here.
The part that needs to be introduced in detail in the model is intention estimation module. The intention estimation module consists of three parts: encoder-1, encoder-2 and decoder. The encoder-1 implements the function of HT-STGCN. The encoder-2 implements the function of CNN+Conv-LSTM. The decoder implements the function of LSTM, and outputs the result of intent estimation.
The implementation of the encoder-1 is similar to the structure described in literature [13]. The input of encoder-1 of intention estimation module is the pedestrian joint point coordinate. The input data is preprocessed as follows. Adjust the resolution of the input video to 340 × 256. Frame rate adjusted to 30 FPS. The coordinates of the 18 joints of each pedestrian were estimated using the OpenPose toolbox [10]. The HT-STGCN model consists of 7 layers of spatial temporal graph convolution operators. In these 7 layers, the number of channels used for output in each layer is 64, 64, 128, 128, 128, 256, 256, respectively. Temporal kernel size is set to 9 for these layers, and dropout is 0.6. Based on the results of Ref. [13], we first pre-train the HT-STGCN model in the aspect of action classification on the PIE dataset [24]. The implementation of the encoder-2 is similar to the structure described in Ref. [24]. The encoder-2 contains 64 filters of Convolutional LSTM. Its kernel size is 2 × 2, and the stride is 1. The input of encoder-2 of intention estimation module is pedestrian centric and contains a series of objects in the surrounding area. Therefore, the input image is cropped to twice the size of the pedestrian's bounding box ( represents skeleton coordinates of pedestrian. stands for pedestrian intention estimation of PIE model [24]. TH-STGCN model represents our novel method. Our model HT-STGCN can well implement pedestrian intention estimation and trajectory prediction, as is shown in figure 5.  Figure 5. Pedestrian intent estimation and real-time display.
The horizontal bar following the top of the pedestrian's head indicates the intention of the pedestrian. Red bars indicate that pedestrians are about to cross or are crossing the road. Green bars indicate that pedestrians have no intention of crossing the road. The white translucent ball around the pedestrian's waist indicates the direction of the pedestrian's trajectory. If the white ball is to the right of the pedestrian, then it indicates that the pedestrian is moving to the right, and vice versa. Figure 6 shows the pedestrian intention estimation and trajectory prediction in four scenarios.  Table 2 shows the performance comparison between our method HT-STGCN and PIE method [24] in trajectory prediction. In table 2, location, intention and vehicle speed are provided by the PIE dataset.
stands for pedestrian intention estimation of PIE model. stands for speed of ego-vehicle predicted by the PIE model. stands for pedestrian intention estimation predicted by our HT-STGCN model. Obviously, the performance of pedestrian intent estimation has been significantly improved based on the graph network and skeleton method. Furthermore, the effect of pedestrian trajectory prediction is improved. HT-STGCN + + 551

Conclusion
On the PIE dataset, we studied pedestrian intent estimation, performed trajectory prediction based on pedestrian intent, and improved the latest pedestrian trajectory prediction algorithms. we propose a graph convolutional network algorithm based on the head-torso partitioning strategy. The pedestrian body structure is refined into joint points in the form of graphs in non-Euclidean space, which overcomes the limitation of Euclidean spatial image as input. Our new model is superior in performance to the latest