An improved semi-global matching method with optimized matching aggregation constraint

Stereo dense image matching (DIM) is a key technique in generating dense 3D point clouds at low cost, among which semi-global matching (SGM) is one of the best compromise between the matching accuracy and the time cost. Most commercial or open-source DIM software packages therefore adopt SGM as the core algorithm for the 3D point generation, which computes matching results in 2D image space by simply aggregating the matching results of multi-directional 1D paths. However, such aggregations of SGM did not consider the disparity consistency between adjacent pixels in 2D image space, which will finally decrease the matching accuracy. To achieve higher-accuracy while keep the high time efficiency of SGM, this paper proposes an improved SGM with a novel matching aggregation optimization constraint. The core algorithm formulates the matching aggregation as the optimization of a global energy function, and a local solution of the energy function is utilized to impose the disparity consistency between adjacent pixels, which is capable of removing noises in the matching aggregation results and increasing the final matching accuracy at low time cost. Experiments on aerial image dataset show that the proposed method outperformed the traditional SGM method and another improved SGM method. Compared with the traditional SGM, our proposed method can increase the average matching accuracy by at most 11%. Therefore, our proposed method can applied in some smart 3D applications, e.g. 3D change detection, city-scale reconstruction, and global survey mapping.


Introduction
Dense image matching (DIM) is to find pixelwise correspondences between stereo pairs. DIM is a traditional but still interesting topic in remote sensing and photogrammetry communities. DIM is able to compute dense 3D point clouds at low cost, thus having fueled many smart 3D applications, e.g. global survey mapping, city-scale reconstruction, virtual reality, 3D change detection, robot navigation, and some RGB-D data-based works [1][2][3][4].
Semi-global matching (SGM) is the most popular DIM method, which formulates DIM as the optimization of a global energy function, then breaks the global optimization into multi-directional 1D 2020 The Third International Workshop on Environment and Geoscience IOP Conf. Series: Earth and Environmental Science 569 (2020) 012050 IOP Publishing doi: 10.1088/1755-1315/569/1/012050 2 optimizations along 1D paths in image space and finally computing the approximate solutions of the global energy function by simply aggregating the matching results of the 1D optimizations [5]. In general, the workflow of SGM concludes: (1) matching cost computation, which formulates the appearance (e.g. intensities, textures) similarity of a matcher between stereo pairs as matching cost; (2) matching cost propagation, which conducts pixelwise matching by 1D optimization in a dynamic programming function; (3) matching results aggregation, which simply accumulates 1D matching results from multiple 1D paths; and (4) matching results computation and refinement, which compute the matching results with the minimum aggregation results. If epipolar image pairs are available in stereo dense image matching, the matching results of each pixel are their disparities (or the parallax in the epipolar space). Though proposed in 2007, SGM is still one of the best compromise between the matching accuracy and the time cost. Some state-of-the-art matching algorithms [6][7][8] achieved higher matching accuracy than SGM, but they also have much higher matching time cost. The high time cost of these state-of-the-art algorithms limit their applications in city-scale reconstruction. Therefore, most image-based 3D reconstruction software packages, e.g. Photoscan, SURE, ENVI, Erdas, Micmac, still adopt SGM as the core algorithm for city-scale 3D reconstruction [9]. However, the 1D optimization results of SGM are not robust, since they lack the matching constraints in the perpendicular direction of the 1D path.
To improve the matching accuracy of SGM, some work has been done to improve the workflow of SGM. The matching cost computation step of SGM can be improved by using convolutional neural network (CNN) [10][11], combined cost metrics [12][13], or local filtering methods [14][15][16] to compute more accurate matching cost. The matching cost propagation of SGM is unreliable, due to its 1D optimization paths. The unreliability problem of the 1D optimization paths of SGM can addressed by more global cost propagation paths [17][18] or more robust 1D optimization strategy [19][20]. The final matching results of SGM can also be improved through post processing. They either remove the noises on the surface of the matching results [21][22] or refine the building boundaries in the matching results [23][24]. All the above methods are able to improve the matching accuracy of SGM. However, few work payed attention to the matching results aggregation step. Since traditional SGM only keep the local disparity consistency along the 1D paths, the matching aggregation results of adjacent pixels in 2D image space may not be consistent, which will decrease the final matching accuracy.
To further improve the matching accuracy of SGM while keep its high time efficiency, this paper focuses on improving the matching aggregation of SGM, and proposed an efficient matching optimization aggregation method. The core algorithm formulates the matching aggregation problem as the optimization of a global energy function, and utilized a local solution to impose disparity consistency between adjacent pixels in 2D image space, which is able to improve the final matching accuracy at low time cost.
The remaining of the paper is organized as follows. Section 2 introduces the methodology of the proposed method, Section 3 show the comparison experiments between SGM and the proposed method, and Section 4 concludes the whole paper.

Brief introduction to sgm
SGM [5] is one of the most famous matching methods in remote sensing and photogrammetry communities. It formulates the matching problem as the optimization of a global energy function, and breaks the global optimization into sub-optimization in multi-directional 1D paths, as shown in Figure  1. In each 1D path, it applies a dynamic programming in Equation (1) for matching cost aggregation, where some window-based matching cost metrics (e.g. census [25] in this paper) can be used as matching cost due to their robust performance against luminance variation.  Figure 1. 1D optimizations of SGM in multiple paths. P is a pixel in image, which has eight cost propagation results from the eightdirectional paths [5].
where, ( ) is the cost propagation results of pixel at disparity in the direction ; is a disparity in epipolar space. Given epipolar stereo pairs, the matcher only have differences in column coordinates: with , being the column coordinates in the left and the right images. ( ) is the matching cost of pixel at disparity ; is the previous pixel of along the 1D path in direction ; and are penalty terms that penalize disparity variations between adjacent pixels, where ; , are any disparity values. After the matching cost propagation in multi-paths, SGM simply aggregates the cost propagation results ( ) in all directions , as shown in Equation (2). The pixelwise matching results are then derived from the matching aggregation results by using winner-takes-all strategy. ( where, ( ) is the matching aggregation results of pixel at disparity .

Matching aggregation optimization
2.2.1. Problem formulation. SGM only imposes local disparity consistency between adjacent pixels along 1D paths, therefore, it often meets streak problems in the matching results of each single path, as shown in Figure 2. Figure 2 shows that the direction of the streaks is consistent with the cost propagation directions, which will greatly decrease the matching accuracy. In the matching aggregation step, SGM assign equal weights to the cost propagation results of all directions. Therefore, the final disparity of adjacent pixels may not be consistent in the 2D image space, due to the ambiguity among the cost propagation results of all directions.
(a) Horizontal direction (b) Vertical direction (c) Diagonal direction  To solve the ambiguity in the matching aggregation step, we impose disparity smoothness constraints over the whole image, and formulate the matching aggregation problem as the optimization of a global energy function, as shown in Equation (3).
where, ( ) is the global energy function with being the set of disparities of all pixels; is the set of all adjacent pixels in 2D image space; is an adjacent pixel of ; is the disparity of ; is a penalty term. The global energy function means that the final matching results should have small aggregation results and consistent disparities with their adjacent pixels. (3) is a typical NP-hard problem. We can use bilateral filtering [16] to compute its local optima. Mozerov et al. [26] also proved such local filter is the approximate solution of the global energy function in Equation (3). However, the performance of the bilateral filtering depends on the filtering window sizes. Large window will bring high computational cost, while small window may not improve the final matching accuracy. To compute the solution of Equation (3) in a polynomial time while obtain high matching accuracy, we finally utilize a fast bilateral filtering strategy [27] in the matching aggregation optimization. In general, fast bilateral filtering formulates the traditional bilateral filtering as the non-local cost propagations in orthogonal scanning lines. The cost in the fast bilateral filtering is the matching aggregation results of SGM. In each scanning line, the cost propagation is conducted with the underlying assumptions that adjacent pixels with similar intensities should have the same disparities. Fast bilateral filtering only considers the intensity factor in the matching optimization. However, in intensity homogenous but disparity inconsistent regions, such assumption of the fast bilateral filtering will cause mismatches. Therefore, we add a spatial control factor in the propagation function of the fast bilateral filtering so that the matching of each pixel will not be influenced by the ones that are far away. The fast bilateral filtering function with the spatial control factor is shown as follows.

Approximate solution. The global energy function in Equation
where, is the direction of propagation path (e.g. the horizontal direction); is the propagated cost in direction ; is a pixel in the basic image; is the previous pixel of in direction ; ( ) is the intensity weight, which is inversely proportional to the intensity difference between and -. ( ) is normally defined as Gaussian function. is a spatial control factor, which is set as 0.7 in all experiments.
In fast bilateral filtering, the cost of each pixel is propagated through non-local, orthogonal scanning lines. The cost propagation firstly proceeds in both horizontal ( ) and the corresponding inverse direction (termed as -) in Equation (4), and then the propagated cost in these directions are aggregated to get the sub-optimization in horizontal directions, as shown in Equation (5).
where and are propagated cost in and -directions; is the aggregated cost which is the sum of propagated cost in both directions. Since both

Experiment
Our proposed method was tested on aerial dataset in Vaihingen, which consists of 6 stereo pairs, as shown in Figure 3. The dataset was provided by the German Society for Photogrammetry, Remote Sensing and Geoinformation (DGPF) in the international society for photogrammetry and remote sensing (ISPRS) benchmark, captured by an Intergraph / ZI digital mapping camera (DMC) (GSD: 0.08m) and a Leica airborne laser system (ALS)50 system (point density: 4 points/m2). True disparities of corresponding stereo pairs were generated by filtering inconsistent parts of light detection and ranging (LiDAR) points (e.g. occlusions, temporal changes) in a dense matching framework [28] and then projecting the consistent LiDAR points onto epipolar space.
In general, we respectively used the traditional SGM [5], an improved SGM which provided more global propagation paths for SGM [18], and the proposed method to estimate the disparities of the six stereo pairs. We then evaluated their matching performances by (1) computing the percentage of matches whose matching errors are smaller than 2 pixels (also termed as Acc-2), (2) averaging these matching errors (also termed as Acc-avg), and (3) counting their running time. The matching accuracy of the three methods are shown in Figure 4. In all experiments, the above three methods proceeded with only one single CPU configuration (i7 @2.60GHz). Figure 3. Testing data in our experiment. Acc-avg measures the average matching errors for each stereo pair, thus lower Acc-avg represents higher matching accuracy. On the other hand, Acc-2 measures the percentages of good pixels whose matching errors were smaller than 2 pixels, thus higher Acc-2 means higher matching accuracy. In Figure 4, SGM performed worst in both Acc-avg and Acc-2 metrics, since it only used 1D optimization paths and did not consider the ambiguity in the matching aggregation. The more global SGM used more global propagation paths for matching, thus achieving higher matching accuracy than SGM. However, it still assign equal weights to all propagation directions in the matching aggregations. The ambiguity in cost propagation results of each direction will decrease the final matching accuracy. The proposed method adopted the same propagation paths as SGM (i.e. the eight cost propagation paths), while it efficiently optimized the matching aggregation results to reduce the ambiguity, which is able to improve the final matching accuracy. In both metrics, the proposed method outperformed    Figure 5 shows that SGM achieved the lowest time cost. Since the proposed method additionally optimized the matching aggregation results through the fast bilateral filtering technique, it had a little higher time cost than SGM. In the DIM of stereo pairs (pair 2 -6) with the image size as 1500x1500 pixels, the average additional filtering time is only 8.20 s under only one single CPU configuration. If a parallel framework can be applied, the filtering time can be further reduced by several times. The more global SGM method achieved the highest time cost, since it optimized matching results through more global and more complicated paths. In overall consideration of the Figure 4 and Figure 5, the proposed method is able to achieved the best matching accuracy and the second best time cost. Therefore, it is a good compromise between the matching accuracy and the time cost.
For more comprehensive comparisons between SGM and the proposed method, we selected the matching results of SGM and the proposed method in some typical regions (i.e. grounds and buildings), as shown in Figure 6.  Figure 3. Figure 6 shows that the surfaces in the SGM matching results are noisy, especially in the grounds and the gable roofs. It is because the multi-directional propagation results contained ambiguities, due to their local disparity smoothness constraints along the 1D paths. Our proposed method locally optimized the matching aggregation results to reduce the ambiguity, thus achieving much smoother surfaces in the matching results.
However, the proposed method is essentially a pixelwise matching algorithm, which can get high matching accuracy in textured regions. However, in weak textured regions, the matching window of a single pixel may contain high matching uncertainties, which will decrease the final matching accuracy. Therefore, the proposed method may perform badly in some weak textured regions, e.g. rivers, lakes, and snow.