Multi-level interactions for RGB-D object detection

In order to efficiently utilize high-level information and depth information in RGB-D saliency object detection, multi-level fusion is studied. Different from existing methods which ignore feature dilution in the process of feature downward transmission, a multi-level interactive fusion method is designed and compared with five advanced models through four evaluation indexes. The experimental results show that the model in this paper is advanced.


Introduction
Saliency object detection mainly uses computer to imitate human vision to detect or extract the most interesting object or a certain area in the image. At present, saliency object detection has been widely used in target segmentation and recognition, image editing, visual tracking and other computer fields. However, when performing the saliency object detection of a complex scene (the foreground and the background are extremely similar, and there are multiple objects in the image), a model based on the input of the red, green and blue three-channel color map (RGB) alone cannot get a very good result. One of the main reasons for this is the lack of depth information that can reflect the spatial structure. With the emergence of low-cost but high-performance depth sensors, researchers began to combine depth maps (Depth) for saliency detection. The pixels of the depth map reflect the distance of the object from the sensor. This method of combining the depth map is called RGB-D significance detection.
In order to effectively fuse RGB and Depth graphs across modes, Chen [1] proposed an RGB-D significance detection algorithm for multi-scale residual rough prediction. Feng [2] extracted RGB features at each stage and then combined them with depth features. Ji [3] proposed a cooperative learning framework for computing significance detection.
Although the above methods have a good effect in significance detection, there are still some deficiencies. First, they do not take into account the difference between RGB and Depth modes, just by simple addition or channel series. In addition, most of the top-down cross-modal fusion methods ignore the phenomenon of feature dilution in the continuous downward transmission of high-level features, which will affect the effect of feature fusion.
To solve the above problems, this paper proposes a new mode fusion module based on attention mechanism, which is embedded into the whole coding-decoding network structure, and simultaneously carries out multiple downward transmission of high-level information. Compared with eight other recently published advanced models, our model achieves better detection performance on five widely used public datasets in multiple evaluation indicators.

The proposed method
The model structure proposed in this paper is the coding-decoding dual-flow structure as shown in Figure 1. The encoder is composed of two VGG-16 feature extraction networks for feature extraction 2 across modes. We use hierarchical fusion architecture to fuse multi-scale features and perform side output estimation. In this part, the specific operation of the fusion method used in this paper is given. Figure 1. network structure 2.1. Network structure As shown in Figure 1, the RGB and Depth plots are individually fed into two identical trunk branch networks. In order to improve the computational efficiency of the model, VGG-16 with relatively shallow layers is used as the backbone branch network to extract relevant features. The whole VGG-16 trunk branch network is divided into five blocks. In this paper, the features extracted from the last convolutional layer of each block are taken for cross-modal feature fusion operation. RGB streams are used to extract the main feature information of an image, such as color, position or other low-level features, as well as other high-level semantic information and contextual features. The Depth stream mainly captures spatial information to make saliency detection more accurate and complete. In order to better integrate the two, a fusion module is designed in the fusion stage. At the same time, high-level features will be transmitted to each underlying fusion module to eliminate the dilution problem in the process of feature transmission, and each result of the trunk branch will be decoded side output.

Implementation details
Due to the attribute difference of different modes, the complementary information of RGB and Depth can not be effectively mined and fully used simply by adding or connecting the equal weight values of RGB and Depth. In order to improve the compatibility and merging of RGB and Depth, and realize the filtering of unnecessary information, this paper uses the attention mechanism to extract more effective features in the fusion module. First, RGB and Depth as well as the advanced features obtained by feedback are added at the pixel level. then, continuous attention mechanism operation is performed, and finally, a convolution is performed to extract more features.
、 represents the result generated by the last convolution layer of each block, and represents the number of layers.
(2) Represents the pooled and pixel-level summation results of all feature results higher than the current layer i， Represents the result of a transition operation in a fusion module， Represents the prediction result of current layer I。

Tranning
In the training stage, we supervise each side output, and the loss function of the whole model is also composed of the loss function of each side output.
1 log 1 ⋂ ⋃ (3) y represents the category, x represents prediction category，A represents the prediction area， represents the real area.
Use the same data as [4] for training. The size of each image input is 352*352 and the batch size is 4.

The experimental results
In order to prove the reliability of the whole model, this paper refers to [4][5][6] [7] and uses five data sets, namely NLPR, SIP, DUT-RGBD, STERE and RGBD135. In this paper, the model is compared with advanced models at the present stage that are also based on RGB-D for significance detection, including CPFP [7] ,CMW [8] , D3NET [5] , ICNET [9] , DCMF [10] . At the same time, four evaluation indicators were used, and they were S-measure,F-measure, MAE,avgF.
S-measure mainly evaluates the structural similarity between significance plots and binary truth plots, and its formula is as follows: (4) represents object perception, represents region perception,and is an equilibrium parameter, generally set to 0.5. And F-measure mainly calculates the weighted average between the accuracy and recall rate of the binarization significance graph, which is calculated as follows: 1 (5) is a hyperparameter used to assign different weights to accuracy and recall.avgF is the average of f-measure. MAE evaluates the mean absolute error of all pixels between the saliency graph and truth graph. MAE is calculated as follows: represents the total number of pixels, represents a salient graph pixel, represents truth graph pixels.

Table and visual results
This paper compares the detection results with other advanced models and presents the results more intuitively by means of tables and visual images.  Figure 3. Visual result Through the above data and pictures, it can be found that this paper not only maintains advantages over other methods in terms of various data, but also has advantages in the detection results in complex scenes.

Conclusion
In order to solve the problem of feature dilution in the top-down process, the high-level information is fed back to the low-level features for many times to achieve the optimal fusion of the two. And attention mechanism is used to extract information in the process of fusion to realize the interaction between high and low levels of information.Through experimental comparison, the superiority of the proposed method is proved effectively.