Variance-aware spatial decoupling attention for stereo matching

The stereo matching algorithms which are based on feature matching build cost volume to regression disparity. Such cost volume shows a regular data distribution along the disparity dimension. The pixels with proper matching should have a single-peak distribution, while the pixels in the ill-conditioned regions have a multi-peak distribution. Using the variance of disparity dimension data, ill-conditioned regions can be identified. In this paper, an attention mechanism based on variance features is proposed, which promotes the neural network to deal with complicated regions. By decoupling the space of the image, a linear complexity attention mechanism is realized. It can be embedded into the algorithm based on feature matching with a slight increase in inference time, and a suitable trade-off is achieved in precision and speed. By embedding the module into BGNet, the accuracy of the designed model has been improved by 10 %, meanwhile, competitive performance has been achieved.


Introduction
Stereo matching, as an important branch of 3D reconstruction, has broad applications in robot vision, AR, autonomous driving, and so on.The stereo-matching task can recover the degree of depth information of 3D scenarios by calculating the disparity of pixels on the same horizontal pole in a set of corrected image pairs.Traditionally stereo matching tasks include four procedures: extracting features, constructing cost volume, aggregation, and regressing disparity [1] .However, even though stereo matching algorithms are quite mature in processing flow, they are still limited to ill regions, referring to untextured and repetitive textured regions, etc.Additionally, the existing traditional algorithms or highprecision algorithms based on deep learning will consume a lot of time cost.
The introduction of deep learning makes the degree of accuracy of the stereo-matching algorithm go to a new level.The time-series model can effectively establish the long-range dependence, which makes the pixel matching more precise [2,3] .Raft-Stereo [2] design a multilevel recurrent field using GRU [4] modules, the results computed by the network retain more detail.Convolutional neural networks can efficiently extract and match the information of left and right image pairs.Different from GCNet [5] which uses concatenating feature maps of left-right graphs to build 4D cost volume, DispNetC [6] computes the inner product of left-right graph features.And both of them use 3D convolution for cost aggregation.Experiments [5,6,7,8] show that 3D convolution can effectively aggregate the spatial information of the image, which makes the algorithm obtain better results in the disparity regression stage.However, both the time-series model and the deep model based on 3D convolution are faced with the problem of high computation cost and long computation time, which seriously limits the industrial implementation of the algorithm.
On the other hand, the real-time algorithms in the field of stereo matching are constantly evolving.By trilinear interpolating the low-resolution (e.g., 1/8) cost volume, StereoNet [9] can obtain the cost volume of high resolution, which can quickly calculate the disparity.However, due to the low resolution, the image information is seriously missing, thus only rough results can be obtained.AANet [10] constructs cost volumes of different scales (e.g., 1/3, 1/6, 1/12) and aggregates inter-scale and intrascale information, achieving satisfactory performance in terms of time consumption and accurate calculation.BGNet [11] designs a learnable bilateral grid so that the up-sampling process of cost volume can obtain more accurate results without significantly increasing the computation time.However, the designed bilateral grid uses a fixed structure framework in the slicing stage, which hurts the accuracy of sampling on the cost volume.Thus, we design an adaptive up-sampling framework, which makes the network a stronger fitting ability but does not increase the inference time.
Table 1.The results tested on the benchmark include Scene Flow and KITTI 2015.
Scene Flow dataset KITTI2015 dataset Methods EPE D1 (Non-occ) D1 (All-pixels) Time (ms) StereoNet [9] 1.10 -4.83 15 MADNet [12] -4.27 4.66 20 DispNetC [13] 1.68 4.05 4.34 60 AANet [10] 0.87 2.32 2.55 62 DecNet [14] 0.84 2.16 2.37 50 BGNet [11] 1.17 Because real-time networks have to take into account time overhead and algorithm accuracy, fewer network layers are preferred.This makes the algorithm more susceptible to ill-conditioned regions.If the network pays equivalent attention to both the normal region and the ill-conditioned regions, the network may have a poor performance when dealing with complicated regions.This problem can be solved by introducing the attention mechanism.However, the global attention mechanism has O ‫ܪ(‬ ଶ ܹ ଶ ) computational complexity and is not suitable for the real-time network of stereo matching.For the sake of better identifying and dealing with ill-conditioned regions, we calculated the variance along the disparity dimension of cost volume to obtain a 2D feature map that retained disparity information.Then, attention scores were calculated along the horizontal and vertical directions for the obtained feature map to realize the attention calculation of linear complexity and effectively distinguish the tractable and unmanageable regions.
This paper intends to design a stereo-matching network.Meanwhile, a superb trade-off has been achieved between accuracy and performance.
x A variance-aware spatial decoupling attention module is designed to effectively identify illconditioned regions by calculating the variance of the disparity dimension belonging to cost volume so that the model can better deal with these regions in disparity regression.x An adaptive slicing framework is designed so that the cost volume can retain more information when up-sampling, but this operation does not increase the inference time of the model; x Based on variance aware spatial decoupling attention module and adaptive slicing framework, we have designed a real-time model, actualizing a superb trade-off between performance and precision.

Method
The cost volume based on feature matching is difficult to deal with ill-conditioned regions, including occlusion, no texture, reflection, and so on.When constructing the cost volume, the ideal feature matching should show the unimodal distribution in the disparity dimension, while the feature matching in the ill-conditioned region usually presents multi-modal or uniform distribution [15] .Therefore, the data distribution of the ill-conditioned region along the disparity dimension should have less variance.Based on this, we designed Variance aware spatial decoupling attention (SDA) module and pooled 3D cost volume (ignoring channel dimension) into a 2D feature map by calculating the variance of disparity dimension.The feature map was decoupled along horizontal and vertical directions, and the global attention scores were calculated respectively and the spatial attention have been obtained.Finally, the features of 3D cost volume are masked by Hadamard multiplication to realize the introduction of attention to the whole cost volume.Due to the decoupling operation, the whole calculation process only needs minimal computational complexity.

Variance-aware spatial decoupling attention
The calculation process of the specific attention module is as follows.Firstly, the variance of the disparity dimension of each pixel is calculated to obtain the feature map (C, H, W) that retains the disparity distribution.Then, the obtained disparity map is pooled using the H and W directions respectively.The compressed feature maps ‫ܨ‬ ு and ‫ܨ‬ ௪ whose shapes are (C, H, 1) and (C, 1, W) are obtained.Then the self-attention of the two feature maps is calculated respectively.Please note that for the sake of effectively reducing the number of model parameters as well as the overhead of computation time, we use a 1*1 point convolution layer to map the features to three feature Spaces of Q, K, and V.The specific self-attention calculation formula is shown in Equation 1.We further fused the two feature maps into one feature map whose shape is (C, H, W) by the matrix dot product.
Figure 1.The architecture of variance aware spatial decoupling attention module.The 3D cost volume is squeezed by the variance pooling handle to generate a 2D feature map.The map is further pooled by a mean pooling layer along the horizontal and vertical directions.After computing selfattention, the attention map will be multiplied by the original input.
After obtaining the self-attention feature map, we used the Hamdard product to skip-connect the old feature map to the news.The new feature map contained global attention information and focused on the area with high attention scores in both the H and W directions. Due to the phenomenon that the data distribution was too sparse in the results of attention feature maps after multiplication, we used Batch Normalization to re-normalize the data.Cost volume could screen out the regions we need to focus on.The designed architecture is shown in Figure 1 and Figure 2 shows how to embed it.Compared with the vanilla bilateral grid structure, the adaptive bilateral grid structure has more flexible spacing.

Adaptive bilateral grid
The cost volume obtained after cost aggregation will be used for disparity regression.The learnable bilateral grid designed by BGNet [11] can make the disparity regression retain the edge information better in the up-sampling stage.Considering the grid designed by BGNet [11] , the disparity space has the same grid interval, which may limit the fitting ability of the network.We designed an adaptive bilateral grid.Different from BGNet's [11] design, we use the network to automatically update the interval between layers of the bilateral grid.The design makes the meshes have higher degrees of freedom in the distance between layers and has more potential to deal with complex scenes.The specific design could be seen in Figure 3.

Introduction of Dataset and Metric
Scene Flow A massive dense ersatz dataset offers 35454 training image pairs, as well as 4370 nesting pairs are used to test.The dataset implements End-point-Error (EPE) to compare results.KITTI2015 A real driving scenario dataset whose ground truth comes from LiDAR detector, whose ground truth is sparse.The dataset implements the percentage of error disparity points in the first image (D1) as the metric.

Experiment manipulation details
We train an inference model on NVIDIA RTX 2080Ti GPUs and carry out experiments with PyTorch.The optimizer is AdamW with ߚ ଵ =0.9 and ߚ ଶ =0.999.For Scene Flow, we pre-train the model in a total of 64 epochs.As for the learning rate, whose initialization is 0.001, degrading half after 20 epochs and repetitive operation per 10 epochs.As for KITTI 2015, we continue to tune the pre-trained model 300 epochs using KITTI2012 (a previous version dataset) and KITTI 2015 training datasets.Before submitting the result to Benchmark, we continue to train the fine-tuned model for another 300 epochs using the corresponding datasets.For all datasets, we argue data by implementing color transform and random cut that the image is cropped to 256 * 512.For the sake of the accuracy of the comparison experiment, all the running time was calculated by inference on the KITTI 2015 testing set.

Ablation study results
The ablation experiment has been carried out on the Scene flow dataset, as shown in Table 2.As we can see, when we replace the vanilla grid with an adaptive bilateral grid, the EPE decreases effectively.However, the running time is the same.The introduction of attention has significantly improved the performance of the network, at the same time, the time consumption has slightly increased.On the other hand, simple normalization of BGNet [11] feature maps do not hurt the performance of the network, but it is necessary for our design.

Benchmark performance
We evaluated the EPE metric on the Scene Flow test dataset besides the D1 Non-occ and All-pixels on KITTI 2015 verification set.The metric and time comparison results of the stereo-matching network have been shown in Table 1.

Scene Flow
We achieve 1.04 in the EPE metric evaluation, this may be because the SDA module is better at dealing with ill-conditioned regions.As we can see, some networks have better performance than ours due to the construction of large-scale cost volumes, but we have more advantages in inference time.
KITTI 2015 We achieve a competitive performance tested on KITTI 2015 dataset.As shown in Table 1, we achieve 2.25 in the D1-All metric evaluation.Furthermore, there is a small result gap between D1-Noc and D1-All, this shows the model can handle the occluded area pretty well.The visualization result is shown in Figure 4. Benefiting from an adaptive bilateral grid, the model has a satisfactory performance in the foreground and background.

Conclusion
In this article, two novel attention modules have been proposed to aggregate cost volume and refine disparity, which could improve the accuracy of the model without obvious complexity.An ingenious design makes the model pay attention to ill-conditioned regions with few parameters and less inference time increment.Experiment results show that the module is beneficial when we insert the module into a network.In all, an ideal balance between inference performance and accuracy has been achieved.However, the real world has more complex scenarios, thus we should think about domain generalization in the future.

Figure 2 .
Figure 2. The way the module is embedded in the network.The grouped volume features are fed into the SDA module and then aggregated in the hourglass-shaped convolution layers.

Figure 3 .
Figure 3.The structure of the vanilla bilateral grid (left) and adaptive bilateral grid (right).Compared with the vanilla bilateral grid structure, the adaptive bilateral grid structure has more flexible spacing.

Figure 4 .
Figure 4. Experimental visualization results in KITTI 2015.The bottom is the error map.In the first and the second column, our error maps are better in occluded regions.The third column of images shows that our results have smaller errors in the reflection area.

Table 2 .
The ablation results were tested on the Scene Flow.Norm, SDA, and AdaBG denote feature normalization, variance-aware spatial decoupling attention module, and adaptive bilateral grid.