A Panoramic Segmentation Network for Point Cloud

Scene segmentation mainly consists of semantic segmentation and instance segmentation. The latest research points out that combining the two segmentation methods to achieve panoramic segmentation can understand the current scene better. The point cloud contains rich spatial information, but panoramic segmentation research in this field is rarely discussed. How to use the unified model framework to obtain the results of instance segmentation and semantic segmentation is the key to realize the task of point cloud panoramic segmentation. In this paper, we propose a panoramic segmentation network for point cloud. In feature encoding stage, we introduce the potential correlation information among points to improve the performance of feature extraction. Then, an output module is presented to combine the results of the two decoders which uses objective distance to enhance the semantic and instance segmentation. Experiments show that our model has good performance on the panoramic segmentation task of point cloud.


Introduction
Point cloud is a set of points that express the spatial distribution and surface characteristics of a target in the same spatial coordinate system. It has rich spatial information, with which we can accurately locate objects and describe their contours. Point cloud data are widely used in 3D scene segmentation. 3D scene segmentation refers to dividing 3D data into groups of regions with certain specific meanings, marking the regions, and finally obtaining 3d views with information annotation. Nowadays, accurate and efficient scene segmentation is useful in many applications, such as automatic driving, indoor navigation, and virtual reality and reality enhancement. Actually, there are two main segmentation tasks in 3D scene segmentation, semantic segmentation, and instance segmentation. Semantic segmentation mainly studies how to segment diverse objects in the scene. The method of scene segmentation is to assign a class label to each pixel or point. The other is the instance segmentation based on the research of things. The boundary lines are drawn for all objects in the scene, and the mask is used to divide different instances. The method of scene segmentation is usually to generate the bounding box by the object detection and then to segment the points in the bounding box.
But in fact, there is a close relationship between semantic segmentation and instance segmentation, that an object must belong to one class, and the points in the same object belong to the same class. Instance segmentation largely depends on the performance of semantic segmentation. Similarly, using effective instance segmentation to delineate object scope can also react to semantic segmentation and improve its accuracy. Thus, panoramic segmentation is proposed to combine the two segmentation methods. Panoramic segmentation uses a unified network model to predict the category label and instance ID of 2 each point to achieve a unified and global segmentation. Approaches based on graphical models [1], [2], [26], [27] have been proposed to combine the two segmentation methods, but there are rare researches on the panoramic segmentation models proposed based on point cloud.
In recent years, research on point cloud mainly focuses on feature extraction of points and semantic segmentation. Point Net [7] proposed a network structure that utilizes MLP to extract features directly from points and achieved good results in 3D object recognition, 3D object parts segmentation and semantic segmentation. Later, Point Net++ [8] utilizes metric space distance to learn local features by increasing context scale. Then, algorithms [9]- [12] were proposed successively and achieved good results. However, the above methods ignore the potential connection between points.
To achieve efficient panoramic segmentation on point cloud data, the following two problems need to be solved: (1) how to efficiently extract the features of point cloud. In fact, both semantic segmentation and instance segmentation are based on point feature extraction. Only effective feature extraction can improve the accuracy. (2) How to combine semantic and instance result output to improve panoramic segmentation accuracy. Experiments on images [19,20] have proved that the combination of semantic and instance segmentation can improve their respective performance and obtain better panoramic segmentation results.
Based on the above analysis, we design the feature extraction network of points. In the encoding layer, we use vector information formed by sampling points and surrounding points as another set of important parameters besides local point feature information to integrate into the single point feature, so as to improve the feature extraction effect of the original point cloud. Aiming at obtaining both semantic and instance results at the same time, ASIS [3] inspired us to use two parallel branches for instance segmentation and semantic segmentation respectively, and then fuse the two results as output. However, [3] only takes the similar points in feature space, but not Euclidean space. Based on this problem, ASIS is modified in the panorama segmentation output module of our model. Finally, experiments with different data sets demonstrate the superiority of our backbone network in feature extraction. Compared with the original ASIS model, our panoramic segmentation output module can better realize the win-win results of semantic segmentation and instance segmentation. Our model shows certain advantages in panoramic segmentation compared with PointNet++ [8] and ASIS [3].

Related Work
With the application of deep learning in computer vision, semantic segmentation has developed rapidly. In terms of 2D images, great breakthroughs have been made in semantic segmentation [4]-[6] based on the fully convolutional neural network. In recent years, research on deep learning for point clouds has been increasing. PointNet [7] is a pioneering work for the successful application of deep learning in 3D point cloud. Taking original point cloud as the input of deep neural network, PointNet provides a unified architecture for classification, part segmentation, and scene semantic segmentation, but it does not consider the problem of local feature extraction. PointNet++ [8] carries out farthest point sampling and region division on point cloud, and extracts local features by using PointNet within the region to improve the feature extraction effect. And then a series of algorithms based on PointNet [7] and PointNet++ [8] have been proposed. SpiderCNN [11] introduced Taylor's formula in the process of convolution, which further improved the effect of feature extraction. PointConv [12] learns the convolution mode of images and constructs a multi-layer deep convolutional network, which can achieve transformation invariance as 2D convolutional network. However, the above method only considers the information about local points but ignores the potential relevance between points.
The deep neural network has also made good progress in image instance segmentation [13]- [15]. The basic idea is to frame different instances with object detection, and then conduct pixel by pixel marking in different instance regions with method of semantic segmentation. Currently, many deep learning models have been proposed for point cloud object detection [16]- [19], but there are few researches on point cloud instance segmentation. 3D-SIS [20] uses 2D convolution to generate features of RGB-D images and projects the features into 3D grids. 3D RPN (Region Proposed Network) and RoI (Region of Interest) are used to infer target boundaries. SPGN [21] uses the similarity matrix to reflect the similarity of different points in the feature space, so as to estimate the similar area of each point to achieve instance segmentation.
Panoramic segmentation [22] was first proposed jointly by FAIR and the University of Heidelberg in Germany. Panoramic segmentation can achieve both semantic segmentation and instance segmentation. However, it also faces two challenges: Firstly, the target must be non-overlapping. Secondly, not only the target, the pixel or point of background have to be labeled as an instance too. In the aspect of image, we have started the exploration of panoramic segmentation, hoping to realize the reconciliation of things and stuff. Panoptic Segmentation [22] proposes the PQ (Panoptic Quality) index for image segmentation. OANet [2] first proposed an end-to-end algorithm sensitive to occlusion for panoramic segmentation and introduced a novel spatial sorting module to solve the fuzziness of overlapping problem. Panoptic Feature Pyramid Networks [1] designed Panoptic FPN by combining FCN for semantic segmentation with MaskRCNN for instance segmentation, to propose a powerful baseline for panoramic segmentation of images. Through the above research and experiment, it is found that the combination of semantic segmentation and instance segmentation can effectively promote their accuracy. In three-dimensional space, the overlap of objects is smaller than that of images, but the close distance between objects will also make it impossible to accurately segment two different instances. In point cloud, ASIS [3] organically combines the output branches originally used for semantic segmentation and instance segmentation, realizes instance reference in semantic segmentation, and introduces semantic perception in instance segmentation, and provides an effective end-to-end training model. However, [3] only considers the similarity of embedded features of instances in semantic perception but ignores the fact that the points of instances tend to be relatively close in reality. Based on the above research, this work aims to establish a panoramic segmentation model of point cloud to effectively obtain semantic labels and instance ids of points in the scene, so as to better segment the scene.

Our Method
Our network architecture is shown in Fig. 1, consisting of an encoder, two decoders, and a panoramic segmentation output module. Through the combination of encoder and decoder, instance features and semantic features are obtained respectively. Panoramic segmentation output module is used to fuse the two features, to improve the efficiency of instance and semantic segmentation. Finally, the semantic label and instance ID of each point are predicted. is the feature of the k ℎ point around the sampling point , K is the number of neighboring points. According to the coordinates of each point, we can generate a set of vectors{ ⃗ | = 0,1,2, … , − 1}, ⃗ is the vector from to , used as a potential correlation between points. Similar to PointNet, our model also uses multilayer perceptron (MLP) and symmetric functions to fit edge feature extraction functions. Finally, the point features and edge features are fused to make the feature extraction more effective. It should be added that since the radius query contains the sampling point itself, an empty vector will be included in to represent the line connecting the sampling point to itself, but this has little impact on the model effect. The above process is as follows: Is the feature formed by local points, is the feature formed by the local vector, means the feature fused by and . We approximate 1 , 2 by a carpooling function and ℎ 1 , ℎ 2 , ℎ 3 by a multi-layer perceptron network. The specific structure is shown in Fig. 2.

Panoramic segmentation output module
Panoramic segmentation requires that each point be assigned a category label and an instance ID. Two parallel decoders, one for point-level semantic prediction and the other for extracting point-to-point instance relationships, are applied to our network model. Through these two decoders, the semantic labels and instance ids of the predicted points can be obtained by clustering and classifying the points. Refer to [3], the model can be trained by constructing the following loss function: is the loss of semantic segmentation, with classic cross-entropy loss to oversee the semantic segmentation branch, means ground truth, * means predict labels; is the loss of instance segmentation, we adopt the class-agnostic instance embedding learning strategy refer to ASIS [3], represent the number of points contained in instance , the mean embedding of instance , is an embedding of a point; represent regular loss, means the ℎ point feature of , is the ratio of semantic segmentation with instance segmentation performance. During the test, the final instance tag can be obtained by using the average embedding in the instance, and the semantic tags of the points in the same instance can be identified as their categories. The model structure of panoramic segmentation has been realized. ASIS [3] proves that there is a potential relationship between instance segmentation and semantic segmentation. An instance object must have its category tag, and the category tag of the same instance interior point is consistent. Through the effective combination of instance features and semantic features, the accuracy of both results can be improved, so as to improve the efficiency of the final panoramic segmentation. However, when combining instances with semantics, ASIS [3] only considers the similarity of instance features to build semantic features of points. After feature learning, points with similar features are not necessarily close in spatial, which is inconsistent with the objective fact that points with the same instance are close in spatial. Therefore, this paper takes spatial location relationships as an important factor of feature combination to improve the final segmentation result.
KNN is used to search for a fixed number of neighboring points for each point in the instance feature space. According to the index matrix in the semantic feature matrix find their semantic characteristics and grouped respectively, then fuse semantic features of each point by a carpooling. The ball query is used to search the neighboring points in the spatial distance, and the carpooling is also used to fuse their semantic features. After combining the semantic features of two different search methods, it generates the final semantic features. This process can be expressed as:

Experiment
The experiment is divided into the following parts. Firstly, the performance of encoder and decoder in our model is verified by different 3d recognition tasks. Secondly, the performance of our panoramic segmentation output module is demonstrated on the S3DIS data set. Finally, the structural characteristics of the model are analyzed through control experiments. All our experiments are implemented by Tensor flow on GTX 2080Ti GPU.

Backbone network
In order to verify the performance of our point cloud encoder network, we compared the point-based classification and segmentation methods. The data sets used mainly include ModelNet40 [24], ShapeNet [25]. Experiment on classification: Modelnet40 contains 12, 311 CAD models of 40 different types, which are divided into 2 parts, 9, 843 for training and 2,468 for testing. The model was uniformly sampled to  [7]. We randomly rotate the point cloud along the z-axis in the training process and use the Gaussian noise with the mean value of 0 and standard deviation of 0.02 to flutter each point. In the training process, we used the same Adam optimizer for all models and trained for 250 epochs. As shown in TABLE I, our encoder performs better than the original PointNet++ [8]. In the comparison of this experiment, our model only adds vector information in the coding layer. Therefore, it can be seen that we introduce edge information into the features of the point is effective. Compared with some other algorithms, our encoder has some shortcomings in performance. Next, we will try to introduce edge features into these algorithms and experiment to see if we can improve their performance.  TABLE II. Through analysis, it can be found that our model is improved based on PointNet++, but it also shows the performance not inferior to the current advanced algorithms. In this experiment, in addition to replacing the coding layer, we also used the iterative algorithm in the decoding layer. The result is a significant increase over PointNet++, which proves that our design ideas are useful.

Panoramic segmentation results
The experiment of panoramic segmentation is based on the S3DIS [23] dataset. During data preprocessing, the room was still divided into several overlapping blocks according to the ground1m×1m, and each overlapping block contained 4096 points. The ASIS source code is then used to attach semantic labels and instance ids to each point in the S3DIS dataset. In order to be fair to compare with ASIS, when training model will also be set to 0.5, set to 1.5, for instance, the fusion of KNN search, K set to 30, training 100 epochs, use Adam optimizer, basic vector is set to 0.001, every 300000 steps in half. For the test, the bandwidth was set to 0.6 for the average shifted cluster. The result of instance segmentation is shown in TABLE III. Based on the skeleton of PointNet++, comparing our output module with ASIS, mCov increased by 0.6 and mRec increased by 1.0, achieving better instance segmentation effect. In terms of semantic segmentation, as shown in TABLE IV, our output modules mAcc and mIoU have both been improved. Finally, referring to [1], we used the PQ as the result index of panoramic segmentation, as shown in TABLE V. Compared with the original model based on PointNet++ [7] and ASIS, our model improved by 0.3 on the PQ and achieved a good panoramic segmentation result. In order to better demonstrate the splitting Effect of our network, we present the final result in the form of semantic segmentation and instance segmentation, as shown in Fig. 3 Figure 3. Panoramic segmentation results of our model

Conclusion
This paper proposes a model framework for panoramic segmentation of point cloud. In order to better capture the features of points, we introduce the correlation information of points in the coding layer. In the output module of panoramic segmentation network, we consider the effect of objective distance to improve the segmentation performance. Experiments prove that the efficiency of our model is better than many similar models. Our encoder and decoder can also be applied to other different feature extraction networks. We hope that our work can contribute to the panoramic segmentation of point cloud, and we also hope that there will be more relevant researches in the future to achieve panoramic segmentation of point cloud as efficient as image panoramic segmentation.