Feature Relation based Graph Convolution for 3D Point Cloud Analysis

3D point cloud recognition is still a challenge task since the shape implied in irregular points is difficult to capture. Standard convolution is inherently limited for these tasks due to its isotropy about features. In this paper, a novel graph convolutional network is introduced for point cloud classification and segmentation task. The proposed convolution adds an additional layer on the basis of relation-shape convolutional neural network, which can obtain more information and make the representation of point cloud more robust. At the same time, a feature relation method is proposed instead of the coordinate relation used in relation-shape convolutional neural network. The experimental results on challenging classification and segmentation datasets shows that the proposed method can learn discriminating features for recognition and semantic segmentation.


Introduction
In recent years, it has been caught too much attention for applications which rely on 3D point cloud data, e.g., autonomous driving, robotic manipulation and virtual reality [1]. Thus, there are critical needs for approaches to effectively and efficiently 3D point cloud processing. However, this task is very challenging, since inferring the underlying shape formed by these irregular points is difficult.
To solve this problem, many works are focused on replicating the remarkable success of convolutional neural network (CNN) on regular grid data analysis to irregular point cloud processing [3]. They can be divided into two categories. Some works transform point cloud to regular voxels [4] or multi-view images [5] for easy application of classic grid CNN. However, this kind of methods induce a loss of information and excessive consumption of memory and high computation cost.
The other is to directly process point cloud. PointNet [6] independently learns on each point and gathers the final features for a global representation and it ignores local structures. Local structures have been proven to be important for abstracting high-level visual concepts in image. Thus, A lot of improvement work has been proposed. Nevertheless, this extremely relies on effective inductive learning of local sub-sets, which is quite intractable to achieve.
Liu et. al [7] introduces a Relation-Shape Convolutional Neural Network (RS-CNN) for 3D point cloud analysis. The key of RS-CNN is learning from relation, i.e., the geometric topology constraint among points, which can encode meaningful shape information in 3D point cloud. However, because of down sampling is used from the first layer, RS-CNN lost some important information prematurely. To address this problem, an additional layer is used to extract the feature information in advance, and then the feature information is added to RS-CNN, so more feature information can be extracted. The experimental results show that the proposed extra layer can be used to improve the performance of RS-CNN.

Method
In this section, the RS-CNN [7] and our improvement are both introduced.

RS-CNN
RS-CNN learns contextual shape-aware representation by a novel relation-shape convolution (RS-Conv) which extending regular gird CNN to irregular configuration. Given n points in a point cloud x . RS-Conv can be formulated as: where x is a 3D point and f is a feature vector. Function M , A and  are shared multi-layer perceptron (MLP), aggregation function and nonlinear activator respectively. The core part represents the three-dimensional Euclidean distance between points. The flow of RS-CNN is shown in Fig. 1 [7].

Our improvement of RS-CNN
Although RS-CNN has achieved excellent results, it only uses coordinates and spatial information to learn shape information and ignores feature information. Thus, formula (1) is modified as follows: (2) Relative to formula (1), coordinates are replaced with more powerful features which can learn better for the relationship of objects. Compared with only using coordinates and spatial information to learn information, features contain more abundant information, and the learning results are more robust.
Moreover, in order to capture more feature information, we add an extra layer of convolution. As shown in Fig. 2, our classification network has four layers, while RS-CNN has only three. The 0 L layer is used to learn feature information from coordinates and serve as the input of next layer. The purpose of this is to be able to get stronger expression because the expression ability of features is much stronger than coordinates. In the segmentation task, the learned features are interpolated back to the finest scale layer by layer to obtain the feature map that has the same number of points as the original input.

Results
We evaluate our method in the task of object classification and shape part segmentation. All the experiments are run on a GTX 1070 GPU with CUDA 11.3 and CuDNN 7.6.5. Experimental environment is python 3.6.5 and pytorch 0.4.1.

Classification on ModelNet40 classification benchmark
ModelNet40 classification benchmark [8] is composed of 9843 train models and 2468 test models in 40 classes and 1024 points are sampled uniformly [6]. As the same in [9], during training in the classification task, the input data is augmented with random anisotropic scaling in the range [-0.66, 1.5] and translation in the range [-0.2, 0.2]. Meanwhile, dropout technique with 50% ratio is applied in FC layers and ten voting tests with random scaling are performed to obtain the average of predictions [6]. As shown in Table 1, the proposed method achieves an accuracy of 92.6%, improved by 0.4%. The final result of using ten voting tests is 92.8%, which is a competitive result. Table 1. Shape classification results (%) on ModelNet40 benchmark ("vot.": with ten voting tests, "no vot.": without ten voting tests) Methods Input Accuracy DGCNN [10] 1k points 92.2 SO-Net [11] 2k points 90.9 PointNet++ [12] 5k points + normal 91.9 PCNN [13] 1k points 92.3 FPConv [14] 1k points 92.5 KPConv [15] 1k points 92.9 RS-CNN [7]

Segmentation on ShapeNet part benchmark
ShapeNet part benchmark contains 16881 shapes with 16 categories, and is labelled in 50 parts in total.
In the experiments, 2048 points are randomly picked as the input [12]. The results of part segmentation are shown in Table 2. The proposed method still achieves a reasonable performance since it can learn a certain distinguishable information.

Conclusion
In this paper, an improved RS-CNN is proposed for 3D point cloud recognition and semantic segmentation. The proposed network can learn much more structure information in the recognition and semantic segmentation task. Compared with single scale RS-CNN, the performance of our method on the classification dataset is improved by 0.4%. Moreover, the proposed method also achieves a reasonable performance in part segmentation task. Although the proposed method have obtained an improvement, but the experiments are performed on a simulated dataset in which each object is complete. Thus, applying this method on an incomplete dataset is the future work.