DeepGCNs-Att for Point Cloud Semantic Segmentation

Compared with traditional CNNs, Graph Convolutional Networks (GCNs), with the graph’s neural network structure, can better characterize non-Euclidean space. Furthermore, with the number of the network layers increasing, deep GCNs demonstrate outstanding performance in mining the partial relationship between the point cloud’s local features. However, the current deep GCNs algorithm cannot sufficiently exploit the point cloud’s global characteristics for semantic segmentation. This paper proposes a novel network structure called DeepGCNs-Att to efficiently aggregate global context features. Moreover, to speed up the computation, we add an Attention layer after the GCN Backbone Block to mutually enhance the connection between the distant points of the non-Euclidean space. Our model is tested on the standard benchmark S3DIS. By comparing with other deep GCNs algorithms, our DeepGCNs-Att’s mIoU has at least two per cent higher than that of other models and even shows excellent results in space complexity and computational complexity under the same number of GCN layers.


Introduction
In recent years, deep learning has been applied to various image processing applications, such as image classification [1], semantic segmentation [2] and other application methods. However, due to the lack of depth information, two-dimensional image data has certain limitations that cannot fully perceive the surrounding environment, which further facilitates the rapid development of three-dimensional sensors. By transforming the natural scenes into the three-dimensional point clouds, the researchers can utilize every single point to represent pixel's 3D geometric coordinate, RGB colour, normal vectors, and other information. However, the irregular point cloud format makes a difference between the 2D and 3D datasets, making it difficult for convolutional neural networks [3] to work efficiently in the point cloud data.
To solve this problem, the method that researchers explore can be divided into four mainstream directions as follow: 3D convolution, multi-view projection onto images and 2D convolution, 1D/2D convolution or Multi-layer Perceptron (MLP) on the point cloud, and Graph Convolution Network (GCN). PointNet [4] using MLP and max-pooling, which is concise and effective, is widely used in feature extraction for point cloud detection. However, PointNet's shortcomings are conspicuous, limiting the accuracy in more complex scenes due to only proposing global features and ignoring local features' influence. Then, PointNet++ [5] proposed improvements that construct local and global features through the sampling and grouping layers. The Multi-scale grouping and the Multi-resolution grouping are designed to capture different features in densely sampled regions. Simultaneously, too much empirical data, which should be a manual setting, leads to limited results. Besides, because the graphs can efficiently represent the point cloud, more and more networks are proposed to use GCN for point cloud processing. After the number of GCN layers increased, the problem of vanishing gradient and over-smoothing appeared. DeepGCNs [6] inspired by dilated convolutions and DGCNN [7] construct a dilated graph and solve the problem above. However, DeepGCNs only use one max-pooling layer at the end to aggregate global features without considering the connection between each point and global information.
In this paper, we use ResGCN as GCN Backbone Block and utilize Multilayer Perceptron (MLP) for dimensionality reduction in the output layer of the network, then use the dual attention module, including Spatial-wise Attention and Channel-wise Attention, which adaptively aggregates global features. Our contribution can be summarized as follows: (1) We propose a novel neural network model with higher accuracy and faster computational speed than others under the same number of GCN layers.
(2) In the output layer of ResGCNs, dimensionality reduction is performed through MLP. The attention layer is used to directly output the classification of each point instead of the max-pooling layer.
(3) The experiments on point cloud segmentation indicate that our model is robust to sampling density variation and has an excellent correct rate, achieving better performance in OA and mIoU than DeepGCNs under the same number of network layers.

Methodology
Point cloud segmentation takes the 3D point cloud as input and outputs the classification of each point. Our novel model consists of two parts, namely GCN Backbone Block and Prediction Block, which completes an end-to-end point cloud semantic segmentation network based on GCN. Our model is shown in figure 1.

GCN Attention Backbone
The point cloud is difficult to extract features through traditional convolutional neural networks due to its spatial disorder. So ResGCNs utilizes GCNs to represent the point cloud as an undirected graph ( , ) g  , as shown in figure 2, where  and  are the set of n nodes and e edges of each layer. Each layer of graph convolution performs the following update and aggregation operations, As shown in equation (1) [6], in each l th − layer, () F is a fixed function that performs convolution operations, l g and 1 l g + represent the input and output of each layer, respectively, agg l w and update l w are the learnable aggregation and update weights, respectively. () F represents a 11  convolution operation. At the end of each GCN layer, each vertex's information is updated through a 11  convolution. When a graph is used as input, the aggregation function aggregates the feature information of each vertex's neighbourhood and utilizes the update function to renew the representation of each node through a nonlinear transformation to learn new features, As shown in equation (2)  h .  and  are the aggregation function and update function, respectively. w  contains the learnable parameters of the aggregation functions, which is essential in GCN. In most GCNs, the algorithm only updates the vertices' features without updating the edges of each layer. This structure allows the node to learn only the closer neighbours' features. Nevertheless, when the points departing significantly in the non-Euclidean space have similar characteristics, the traditional GCN is likely to lose this critical information when learning. EdgeConv [7] can update the graph in each layer by searching for the nodes' neighbouring points. The aggregation function is as follows: As shown in figure 2, ResGCNs subtract the vertex from its neighbours to distinguish different features and use the max-pool for feature aggregation. As the number of network layers increases, each node will have information about its distant neighbours to capture global feature information. Then, ResGCNs utilize MLP as the update function with the batch normalization to speed up gradient descent and use ReLu as the activation function to achieve nonlinear transformation.
Since the algorithm using ResGCN as the GCN Backbone Block, it can be stacked in multiple layers through the residual connection, which solves the vanishing gradient and over smoothing problem of the graph convolutional neural networks.

Prediction Block
In the prediction block layer, the model's function is to predict the category of all points in the point cloud through a neural network, so our method needs to fuse global features with local features. As shown in figure 3, the prediction block first utilizes MLP to reduce the dimensionality of each vertex' to obtain global features and then uses the attention module to learn global features. Finally, the classification results of each point can be obtained.

Attention Prediction
Unlike the traditional method of directly using softmax in the network's output layer to obtain the classification of each node, our model utilizes the dual attention module [8] in the network's output layer to strengthen the global context information of each point. We compared the prediction part's result with DeepGCNs, finding that our model has fewer parameters, faster prediction speed and increased the graph's output vertex information to get better classification results.

Spatial-Wise Attention
The role of Spatial-wise Attention is to adaptively aggregate global spatial feature information in the non-Euclidean space. As shown in equation (4) where H represents the weighted sum of the spatial feature dependency between each vertex and the original feature. The Spatial-wise Attention module can adaptively learn the vertices with similar distant features in the space.

Channel-Wise Attention
Channel-wise Attention is similar to Spatial-wise Attention. In each layer's output of ResGCN, the channel's high-dimensional feature map can be regarded as a map of classes, making the self-attention module utilizes the interdependence between channels to learn global feature representation. So far, our model has made full use of the rich local features in GCN Backbone Block, considering the spatial correlation between features and the interdependence between feature maps. So, our neural network can be more accurate in complex three-dimensional scenes Classification results.

Dataset
The S3DIS dataset constructs six different areas by scanning the indoor environment, including 271 rooms and 13 object classes (ceiling, floor, wall, beam, column, window, door, bookcases, board, clutters), 11 Scenes (offices, meeting rooms, corridors, auditoriums, open spaces, lobbies, lounges, pantry, copy rooms, storage, and toilets), which have a rich three-dimensional indoor characteristic structure. Compared with other datasets, S3DIS has a more complex spatial structure, making semantic segmentation more challenging.

Hardware Configuration
We utilized TensorFlow framework to complete the preparation of the network model. Our computational device equipped Intel(R) Core (TM) i9-9900k CPU @3.60GHz and two NVIDIA RTX 2080 Ti GPUs.

Implementation
To explain our model more clearly, we marked the models with different layers of DeepGCNs-Att as ResGCN-Att-7 and ResGCN-Att-14. Adam gradient descent optimizer is used for network training, where the initial learning rate is 0.01. The batch size of the ResGCN-Att-7 and ResGCN-Att-7 is set to 8 and 6, respectively. On area 5 and over 6-fold of S3DIS, we compared our model with PointNet [4], SEGCloud [9], RSNet [10], MS+CU [11], G+RCU [11], PointNet++ [5], 3DRNN+CF [12]. Due to hardware performance limitation, the batch size of ResGCN-Att-28 can only be set to 2, which directly cause serious gradient oscillation and the inaccurate final test result. So we did not test ResGCN-Att-28, but through the analysis of ResGCN-Att-7 and ResGCN-Att-14, ResGCN-Att-28 can be proved that it will have a better result.
The overall accuracy (OA) and mean intersection over union(mIoU) are utilized to evaluate the semantic segmentation performance of the network model. The mIoU is an index used to evaluate semantic segmentation. The results of semantic segmentation can be divided into the following four categories: true positive (TP), false positive (FP), true negative (TN) and false negative (FN). The mIoU measures the ratio of the intersection and union of the two sets of true and predicted values, that is, / ( ) mIoU TP FP FN TP = + + .

Flops and Params
The calculation of a neural network can be regarded as the input and output of the data. The computer's memory is the Params of the network. The number of Flops is an essential indicator for evaluating the neural network's overall performance, corresponding to the algorithm's computational and space complexity. As shown in 49.90 PointNet [4] 41.09 SEGCloud [9] 48.92 RSNet [10] 51.93 The results of OA and mIoU are shown in table 3. In area 5, our model's mIoU is very similar to the results of ResGCN-28's and exceeds the existing network model. In terms of over 6 fold, our model achieves accuracy close to the original network model and surpasses the current model on mIoU, but has fewer parameters and faster computational speed. To test the overall model performance on the whole dataset, we operated 6-fold cross validation. In