Researches advanced in VSLAM for dynamic environment

Simultaneous Localization and Mapping (SLAM), which has been widely utilized in a number of different fields, including unmanned vehicle, path planning, and robotics, has always been a topic of significant concern to computer vision community. Visual SLAM (VSLAM) only relies on cameras as sensors. Compared with lidar SLAM, data collection is convenient and low-cost, and it is the most representative technical direction in SLAM research. Traditional visual SLAM (VSLAM) mainly focuses on the static or low speed objects. However, in real life, most objects are moving, which makes the performance of traditional visual SLAM technology in the dynamic environment is not ideal. Benefited from the development of deep learning, VSLAM in dynamic scenes has make breakthrough in both accuracy and robustness. In this paper, the concept of SLAM and the development history of VSLAM are briefly introduced, and then the common methods used for dynamic region detection in semantic VSLAM are described in detail, which mainly include methods that use deep learning, methods that use optical/scene flow, and methods that use multi-view geometry. In addition, the existing data sets and evaluation indexes are also introduced. At the end of the paper, the current problems and shortcomings of semantic VSLAM are pointed out, and the future is prospected.


Introduction
Simultaneous Localization and Mapping (SLAM) has always been a major topic in the fields of robotics and computer vision, which intends to locate the location of agents in space and perceive the surrounding environment to build maps.SLAM is widely used in daily life, such as automatic navigation, automatic driving, path planning, etc.Among all SLAM technologies, visual SLAM(VSLAM) is the most widely used one, which only uses cameras for sensor positioning and mapping.Compared with the laser radar SLAM, the data acquisition of VSLAM is convenient and low cost.Therefore, VSLAM has gradually become the most widely used research direction in SLAM research.
As the Fig. 1 shown, The front-end visual odometry, back-end nonlinear optimization, loop closure, and mapping components make up the traditional VSLAM system.The visual odometer has only visual input for attitude estimation.The back-end optimization receives the visual odometer's multiple measurements of the camera position and pose at different times, along with data on closed-loop detection, optimizes them, and obtains a consistent global trajectory and map.Closed-loop detection The classic SLAM algorithm first assumes that all objects in the environment are in a static state.Currently, the standard design principle for SLAM systems in dynamic environments is to regard dynamic objects as outliers, separate them from the scene, and then process them using the traditional SLAM method.However, this often splits the context information of moving objects in different image frames, which seriously restricts the large-scale practical application of dynamic SLAM system, especially the front-end registration environment of dynamic environment.Specifically, the registration methods of point to point/point to feature/point to grid/NDT are mostly based on static assumptions.In practice, if the dynamic point ratio is too high, the trajectory accuracy will decline.To this end, dynamic points can only be identified and removed in real time before or during registration.Traditional point recognition generally eliminates points that are too far away in the registration iteration process.Currently, It is more typical to directly detect dynamic targets from the point cloud utilizing the deep learning-based dynamic region identification method.
There are four main types of dynamic area detection methods commonly used, namely, based on multi view geometry or optical flow/scene flow or depth learning.The widely used segmentation methods using deep learning include case segmentation and object detection.While instance segmentation segments an item at the pixel level, object detection seeks for all things that need to be identified in the image and marks their locations with boundaries.The method that use multi-view geometry works on the premise of constraining the pose of several frame images and removing feature points with significant faults.The optical flow/scene flow approach is the foundation of methods that use optical flow.For the sake of creating the first static background, the optical flow approach incorporates semantic information, and then combines the optical flow field between two images to calculate the average direction of motion.The scene flow method is a three-dimensional version of the optical flow method.
Focusing on the above dynamic area detection methods, in this paper, we introduce the representative dynamic SLAM algorithms in detail, including their main design ideas, basic frameworks, advantages and disadvantages.We further summarize the existing issues in the dynamic SLAM systems and discuss its possible future developments.

Methods based on multiple view geometry
Multiple View Geometry is a commonly method used to detect dynamic objection, which uses the relationship between images taken at different viewpoints to study the relationship between cameras or features.In visual SLAM, the common method is epipolar constraint.The polar line constraint is mainly to detect whether the matched feature point is close to the polar line.When the feature point has sufficiently distant to the polar line, it indicates that the feature point is drifting and can be identified as a dynamic feature point.The Fig. 2 illustrates the fundamental concept of limit limitation.Assume the camera captures nearly identical point P in space from various perspectives, according to the pinhole camera model, the pixel on two images' coordinates are  1 and  2 , these two points should satisfy the equation ( 1) and (2).
Where K is the camera internal reference matrix,  is the translation vector and  is the rotation matrix.In this ideal case, the matching points' coordinate in the two images should satisfy the constraint as equation (3).
is the Fundamental matrix.Nevertheless, there is considerable distortion and noise in the camera's captured images, which prevents the points between subsequent frames from matching the upper pole line L fully.
The polar line  1 of the current frame can be expressed as equation The distance D from point  2 to polar line  1 is as equation (5).
If the distance D exceeds the threshold, The point is thought to be out of line with the polar restriction and dynamic.The epipolar constraint is also very suitable for the optical flow method, because no depth information is required for feature points, so all feature points can be detected.

Methods based on optical flow/scene flow
2.2.1.Optical flow.Gibson initially suggested optical flow in 1950.The optical flow approach determines the connection between the fore frames and current frames by using correlations between subsequent frames and the time-domain pixel change in the sequence of images plus calculates the velocity of objects between consecutive frames [1].At present, the Lucas-Kanade algorithm, or LK algorithm for short, is widely used.Three premises form the basis of the LK algorithm: 1) Consistent brightness between adjacent frames, 2) Temporal continuity between adjacent frames or minimal object motion between adjacent frames, and 3) Maintenance of spatial consistency.
Assuming the displacement time constant in the neighborhood of point P, the following equation ( 6)-( 8) should be satisfied.
Then through the least multiplication method, useA T Av = A T b,v = (A T A) −1 A T b to calculate as equation (10).
The optical flow approach has the benefit of precisely identifying the position of the moving object without knowledge of the scene information.Additionally, it also conveys information of structure in three-dimensional about the scene and the moving object.The existing method combines the classical spatial pyramid formula and deep learning to calculate optical flow.This method distorts a pair of images with the current flow estimate at each pyramid level and calculates the update of the flow, and adds a deep network to each layer of training.Motions are all handled by pyramids, which are more efficient and ideal for embedding in programs.

Scene flow.
Vedula first proposed the concept of scene flow in 1999, and then Spies proposed the flow constraint, which brings depth simulation into lighting constraints.Commonly used are binocular vision scene flow and RGB-D image calculation scene flow.As the Fig. 4 shown, the system first calibrates the binocular camera, and then calibrates the original image in accordance with the calibration result, which will make the two pictures parallel to each other on the same plane, and then performs pixel point matching on the corrected two pictures, and finally in accordance with the results of the matching, computes the pixel depth to produce a depth map.The RGB-D image calculation scene flow method is based on three basic assumptions of optical flow calculation, and expands the twodimensional in optical flow to three-dimensional.

Methods based on deep learning 2.3.1. Object Detection.
Object detection can detect all potential targets in the pictures and mark the objects' position by using a frame.The object detection methods are divided into SSD, Yolo3 and Yolo4.
Wei Liu proposed Single Shot MultiBox Detector (SSD [2]),a method for identifying objects in pictures that just uses a single deep neural network, in 2016.As the Fig. 4 shown, SSD uses VGG16 as the basic model, uses CNN network for detection and uses multi-scale feature maps, one large and one small two feature maps, Small feature maps are responded for detecting huge targets, while larger feature maps are responded for detecting smaller ones.Secondly, The detection outcomes from different feature maps are directly extracted using the fully convolutional neural network, and a priori frame is set, and each a priori frame of each unit outputs a set of independent detection values.The SSD algorithm uses 6 different feature maps for detection, which improves the detection accuracy and runs faster.Yolo3 adjusted the new network structure.The following is the network structure diagram.A residual module is added to Darknet-53.Two convolutional layers make up a residual module plus shortcut connections, and there are multiple repeated residual modules which 1, 2, 8, 8, 4 means the number of residual modules and no fully linked layer.Setting the convolution's stride to 2 results in network downsampling, which causes the image's size to be cut in half each round when it goes through this convolutional layer.Compared with the previous version, Yolo3 additionally makes use of multi-scale feature maps to increase detection precision.Continuing the Kmeans clustering method of Yolo2, 3 a priori boxes are set for each downsampling scale, and a total of 9 size a priori boxes are clustered.The advantage of Yolo3 lies in the use of C language, the running speed has significantly increased, and the precision of detection is enhanced by the multi-scale feature map utilized subsequently.On the foundation of Yolo3, Yolo4 adds some fresh concepts to enhance performance.Yolo4 uses CSPDarknet-53 as the benchmark network, the structure can be split into two parts, Part1 is used for residual block stacking, and Part2 resembles a remnant edge that, after a few of processing, is attached to the end directly.Its purpose is to improve CNN's capacity for learning and lower the algorithm's memory requirements.Yolo4 also incorporates the SPP module to fuse different scales feature maps, which more effectively increases the acceptance range of backbone features and significantly separates context features.The advantage of Yolo4 is that less calculation is required., the precision is also enhanced, plus the speed of running is further increased.

Semantic segmentation.
Semantic segmentation, which classifies every pixel in an image, is extensively applying in the field of autonomous driving and medicine.At present, most semantic segmentation algorithms are supported by deep neural network(DNN).There are two kinds of main semantic segmentation methods.One of them simply classifies each pixel, called pixel-level semantic segmentation algorithm.Another method needs to classify the same pixels into distinct instances on the foundation of the previous method, which is called instance-level semantic segmentation.In the field of VLSAM, common pixel-level semantic segmentation algorithms include SegNet [3] and PSPNet [4], and common instance-level semantic segmentation algorithms include Mask R-CNN [5] and Yolact.
SegNet is a semantic segmentation network obtained by modifying VGG16 based on FCN.The architecture of SegNet is primarily divided in two portions: encoder and decoder.The encoder network abandoned the last three layers of VGG16 and used the first 13 convolutional layers.Each encoder layer matches a decoder layer.Finally, we'll get the maximum value of different classification through a Soft-max multi-classifier to complete image segmentation.Badrinarayanan et al. suggested a approach which utilizing the max pooling indices estimated to carry out nonlinear upsampling in the pooling step of the associated encoder layer, and the computation is much more efficient with this method.In the pooling operation of the encoder module, the maximum value's location is recorded, and then in the decoder network, the corresponding indices is used to make nonlinear upsampling, so that there is no need to learn in the upsampling stage.The upsampling result is shown as a sparse feature map, and through convolution operation we can get the dense feature map, and the upsampling is repeated.The main comparison of SegNet is FCN.FCN uses a deconvolution operation to obtain the feature map while decoding, then produces the output by adding the associated encoder's feature map.The advantage of SegNet is that it does not need to save the feature map of the whole encoder part, only need to save the max pooling indices, which saves the memory space.The second is you don't have to deconvolution, you don't have to learn in the upsampling phase, although you need to convolution after the upsampling.SegNet does not improve in accuracy compared to other networks, but the overall performance of SegNet is quite good considering the actual consumption of memory and time.
Based on FCN, Zhao et al. presented the PPM(pyramid pooling module) in 2017, inserting it between the encoder and decoder modules of FCN.The principle is that the convolution feature map is obtained after encoder, and different size kernels is used for pooling operation.After that, bilinear interpolation upsampling will be used to increase the feature map's resolution to ensure the feature map' size is unchanged compared with the feature map before pooling.Finally, performing concat and convolution operations to acquire the result.The author adopted dilated convolution and global average pooling at the network layer to augment the image's Receptive Field, which included not only the shallow features, but also the deep features of the detected image.In addition, the author made improvements on the basis of ResNet101.In addition to using the following softmax classification as loss, an auxiliary loss was added in the fourth stage.The two losses were propagated together, and different weights were used to jointly optimize parameters and accelerate convergence.He et al. extended the Faster R-CNN and proposed Mask R-CNN to complete instance segmentation.The architecture of Mask R-CNN adds a branch to the original design, which is a FCN utilized on Rol, which consumes little overall overhead to separate each pixel.The total training of this network is similar to Faster R-CNN.In comparison to a faster R-CNN, it appends a mask head which is paralleled with classification and regression head.Adopting the RoIAlign layer rather than the RoIPool layer to prevent pixel-level misalignment brought on by spatial quantization is a key feature and the authors use ResNeXT-101 with Feature Pyramid Network (FPN) in the main body to increase accuracy and speed.Faster R-CNN uses the original loss function to update loss, which uses 5 anchors and 3 aspect ratios.Mask R-CNN appends an extra branch to help instance segmentation, but the added overhead is small.The algorithm is simple and flexible to train, and has good versatility in key point detection, human pose estimation and other applications.However, it still falls short of real-time performance (>30 FPS).
In 2019, Bolya et al. from the University of California proposed Yolact, a real-time instance segmentation architecture which uses fully convolutional model.Compared with Mask R-CNN, Yolact change the one-stage detector.It appends a mask branch for instance segmentation, and gives up the step of introducing feature localization.The implementation of Yolact's instance segmentation task is mainly completed by combining the results of two simple parallel tasks.One branch task uses FCN Network to generate Prototype Masks which have same size with the original image, and the other branch calculates Mask coefficient via adding Head Network to the target detection branch.The final unique bounding box is selected by Non Maximum Suppression (NMS) algorithm.Finally, the prediction are attained through linear combination of the two branches.In terms of speed, in the hardware environment of a single Titan XP, the instance segmentation task of MS COCO dataset was tested, achieving 33FPS and 29.8mAP.According to the experimental data, Yolact is a fast and end-to-end instance segmentation network model.

Common data sets for VLSM
To compare the effectiveness among several VSLAM algorithms, some public datasets are needed for testing.Here are some commonly used datasets in dynamic scenarios.
The most prevalent dataset utilized in the autonomous driving environment is KITTI [6], which was created jointly by the Toyota American Institute of Technology and the German-based Karlsruhe Institute of Technology.The data set included radar scans, high-precision GPS information, IMU acceleration information, and other modal information.The author gathered 6 hours of real traffic environment.On the dataset website, the authors also offer benchmarks for optical flow, object detection, depth estimation, and other tasks.
TUM RGB-D dataset contains 39 sequences collected in diverse interior settings, and provides a diversity of datasets for different uses.In these datasets, Dynamic Objects contains nine datasets containing GroundTruth, as well as validation sets for each of the trajectories.The dataset includes four camera motion poses (xyz, rpy, halfsphere, static) and two dynamic degrees (sitting, walking).This dynamic dataset is widely used to evaluate the positioning accuracy of dynamic SLAM systems.It is worth mentioning that the website of this dataset has an online evaluation tool, and you can get various indicators by uploading your own track.

The evaluation metrics
When assessing a SLAM algorithm's performance, there are many factors to consider, including accuracy, complexity, and time spent running the algorithm.The rating of accuracy is the one that worries people the most.We will unavoidably come across two accuracy indices during this process, ATE and RPE.These evaluation metrics were first defined in TUM dataset Benchmark which are widely used.
(1) RPE: relative pose error.The RPE represents the equivalent to the error of the direct measurement of the odometer, primarily indicates the accuracy of the posture difference at a fixed time difference.Consequently, the RPE of frame i is used to define as equation (11): This error can be calculated by using the RMSE, and the population value can be obtained as equation (12): The relative pose error's translation component is called the (  ).Some people do not use RMSE, and directly use the average or even the median to describe the relative error.It should be noted that in addition to translation error, RPE also contains rotation error, but generally using translation error is sufficient for evaluation.We can also use the same method to calculate the error of rotation angle through equation ( 13): Because computational complexity is very high and time-consuming, the final result is calculated by calculating the estimated value of a fixed number of RPE samples in TUM.
(2) ATE: absolute trajectory error.The ATE can clearly indicate how accurately the algorithm is working and how consistently the trajectory is overall.The ATE of frame i is defined as equation ( 14): Similar to RPE, RMSE can also be used to count ATE as equation (15): The mean, median, etc. can also be used to reflect the ATE.

KSF-SLAM Apr-22 RGB-D KITTI/TUM RGB-D SegNet
We further select several representative VSLAM algorithms in recent years, and summarize their performance on TUM dataset from ATE, translation error and rotation error.The results include five sequences.The sequence's "S" and "W" stand for "sit" and "go," respectively.The word following the second slash denotes the condition of the camera.There are four types: (1) xyz: along the x, y, and z axes, the camera moves.(2) rpy: along the main axes, the camera rotates (roll-pitch-yaw) (3) static: the camera maintains its position manually.(4) halfsphere: on a tiny half-sphere with a diameter of around one meter, the camera rotates.
1) Table 2 shows that, on average, the inaccuracy of fr3_s_static is lower than that of fr3_w_, but in the RMSE of DynaSLAM and the SD of DeepLabv 3+_ SLAM, the error of fr3_s_static is larger than that of fr3_w_static.And the error of fr3_w_rpy is too large in all dynamic sequences.The algorithms under the dynamic sequence are still not as stable as those under the static sequence.And in all scenarios, the error of ORB_SLAM3 is larger than other algorithms, up to 0.8161, which is 2-3 times that of other algorithms.
2) As can be observed in Table 3, RPE (translational), the overall tendency is similar to that of the Table ATE, but the error of fr3_s_static is larger than fr3_w_static in DynaSLAM, which is the opposite of other algorithms.And in the ORB_SLAM algorithm, the error of each sequence is smaller than that of ATE.
3) It can be seen from the Table 4, RPE (rotational) that all errors have a significant increase compared with the first two cases, and the gap is large.   1 demonstrates that RGB-D is the most commonly used sensor.After abandoning the monocular camera, RGB-D has made great progress.There are three main types of mainstream cameras according to different principles: binocular camera, structured light technology and TOF.The three schemes are not much different in detection range, accuracy and angle, but the shortcomings are also obvious.They are suitable for indoor environments and have high requirements for lighting.Because they use cameras, they are not applicable in dim environments or strong light conditions, or the resolution not tall.Lidar can overcome these shortcomings, but it is difficult to promote due to its high cost.
To make up for these shortcomings of RGB-D, it is now common to fuse IMU on RGB-D based SLAM.When the motion is too fast, the image of the camera is blurred, or the overlap area between the two images is too small to measure the motion from the same point and different points, the IMU can provide a more reliable (R, t) estimate.In addition, the IMU can avoid motion misjudgment when the camera thinks it is still in a dynamic scene, and it can also work normally in texture-rich scenes, and the IMU can compensate for part of the lighting effects.The cost of using IMU is not high, which is one of the reasons why it is more common to use it with IMU.
2) YOLACT splits instance segmentation into two concurrent assigns for the sake of handling the instance segmentation problem at actual time.Although the speed and the stability are improved, because the positioning step is abandoned, when multiple targets appear at one point in the scene, The network may not be able to locate every object, although it will locate some objects similar to the foreground mask, but not the target of instance segmentation.
However, this problem has been optimized in YOLACT++ [15].First, deformable convolution is added to the backbone.This free-form sampling replaces the rigid grid sampling of traditional CNN and improves the accuracy of each detection and segmentation model.And optimize the Prediction Head branch, optimize the anchor, multiply the size of each FPN layer by 3, and the number of anchors is equivalent to an increase of 3 times.
3) In the initial stage of the network, Mask R-CNN creates areas of interest and then feeds region suggestions into the process for object recognition and regression with bounding boxes.Although the accuracy rate is high, it is usually slow.models with a single stage, for example YOLO and SSD have lower accuracy but run faster than two-stage.By dividing a picture into simple and difficult images, an image difficulty predictor can be utilized to recognize objects quickly and accurately.After being split, the rapid single-stage is used for the easy photos while the exact two-stage is used for the challenging images [16].

Conclusion
This paper introduces three main types of dynamic region detection methods, including methods based on multi-view geometry, optical flow/scene flow methods, and deep learning.Two categories are available to describe deep learning and introduced in detail.About their algorithms and detection methods.After that, the specific data sets and indicators are introduced, the indicator data is analyzed, and some problems and solutions to the problems are proposed.

Table 1 .
Classic VSLAM algorithms based on dynamic detection approaches.

Table 2 .
Results of the TUM(m)'s absolute trajectory error.

Table 3 .
Results of the TUM(m)'s translational relative pose error (RPE).

Table 4 .
Results of the TUM(m)'s rotational relative pose error (RPE).