Research advanced in the visual SLAM methods under indoor environment

Simultaneous localization and mapping (SLAM) have been a hotspot study topic in the computer vision community these days, which aims to locate the agent’s position and perceive the surrounding environment to build a map. In Visual SLAM, the camera is the external sensor used, creating an environment map while locating itself. Compared with radar and other rangefinders, the visual sensor is usually tiny and has low power consumption, which has been the mainstream framework in SLAM research task. This paper starts from the system overview of visual SLAM, introduces the landmark achievements and research hotspots of visual SLAM, and discusses the key issues and three research methods based on visual SLAM in indoor environment. Meanwhile, the applications of visual SLAM in dynamic scenes and large indoor environments are listed. Finally, the trend of its future development is prospected.


Introduction
Simultaneous localization and mapping (SLAM) is used to incrementally estimate the location of a mobile platform while creating a map of the surrounding environment [1].SLAM has become an important prerequisite for robots to operate autonomously in unknown environments due to its ability to locate and perceive environments autonomously [2].Visual SLAM is a system which utilize cameras as data input sensor and has a wide range of applications in various kinds of indoor environments.In contrast to radar and other sensors, the visual sensor is usually very tiny and has low power consumption, which is able to give a richer information of environment texture for mobile platform.As a result, visual SLAM is gaining more and more popularity from the research community.Visual SLAM uses cameras as its only external sensor, and has been improved for its low cost, light weight, ease of installation on commercial hardware, and rich information in images.Visual SLAM can be segmented into three types considering the different visual sensors chosen: monocular visual SLAM with a single camera, which is the only external sensor; Stereo vision SLAM which utilizing multiple cameras as sensors, such as binocular stereo vision, the most widely used one.Sensor RGB-D SLAM on the basis of monocular camera combined with infrared sensor.
As the Fig. 1 shown, the classic visual SLAM system consists of five modules: sensor data reading, front end, back end, loop detection, and map construction.The front end is used to evaluate the movement of camera as well as local maps by tracking characteristics of pictures among frames.The point-line feature-based front-end visual odometry uses binocular or monocular camera to capture the RGB images of the surrounding environment and uses this algorithm to extract and describe the image's point-and-line features [3].Based on the matching of the before and after frame features, the camera attitude and local map can be estimated.The back end is optimized based on initial values provided by the front end, which mainly solves nonlinear optimization problems in mathematics.After the attitude information is integrated with the closed-loop constraint information, it is optimized to ensure global consistency.The back-end optimization is mainly achieved by two methods: one is the nonlinear optimization method and the other is the filter method, which are represented from graph optimization and Extended Kalman Filter (EKF) respectively [4].Cyclic detection, also known as closed-loop detection, identifies visited scenes and eliminates accumulated errors by calculating image similarity.Mapping is the construction of sparse or dense maps according to different needs.Map construction mainly serves the positioning function, navigation, obstacle avoidance and environment reconstruction.Loop-back detection needs the precondition in which the camera carrier can return to the original scene, and the loop-back detection restraint is built for eliminating some large cumulative errors of the scene.The traditional loop detection methods mainly adopt bag-word model, and its realization method is as follows: the K-means clustering method is used to construct the lexicon containing the K-words for extracting some local characteristics from the image.According to the number of times each word appears in the lexicon, the image is represented as a k-dimensional digital vector, which is used for judging the difference among scenes, so as to recognize the visited scene.

Figure 1. Diagram of classic Visual SLAM system framework
The landmark achievement of visual SLAM is Mono SLAM [5] proposed by Andrew Davison, the first monocular SLAM on the basis of EKF method.It is able to realize real-time, creating the sparse map in a probabilistic framework online, but unable to decide the drift amount while can.Based on the Mono SLAM, a lot of effort has been invested, which makes great improvement in the accuracy and generalization ability of SLAM.Focusing on the visual SLAM on the basis of deep learning, the representative SLAM algorithms are introduced in Section 2. We further introduce the application of SLAM from the aspects of dynamic scene and large indoor environments in Section 3, where the quantitative and visual results of the different SLAM algorithms are compared in detail.Finally, the research hotspots in this field are discussed in Section 4.

Symbolic achievements of SLAM
DTAM is a single view SLAM algorithm based on a direct method proposed in 2011.The 6-degreeof-freedom attitude of the camera in relation to the density diagram is obtained by aligning the entire image at the frame rate, and the real-time impact on the GPU is realized.PTA is the first multithreaded SLAM method developed by George Klein, which separates tracking and graph building into two distinct jobs and executes them in two concurrent threads.Kinect Fusion is the first technique based on the Kinect to produce high-density 3D maps on a GPU in real time.The technique calculates the sensor posture and creates a precise 3D map model of the surroundings using just the depth data collected by the Kinect camera.LSD-SLAM, a direct monocular SLAM technique that was introduced in 2014, directly processes picture pixels.In comparison to the prior monocular visual odometry, it is able to not only determine its own attitude but also to create a semi-accurate global map of the environment.The tracking technique runs in real time on CPU and directly on SIM3, correctly detecting scale drift.ORB-SLAM [6] is a 2015 proposal for a keyframe-based monocular SLAM algorithm that is fairly complete.The three threads that make up the system are closed-loop control, map development, and tracking.Location recognition, sparse map construction, and feature extraction and matching based on ORB features all exhibit very high placement accuracy.

Feature Extraction based on Point Line Feature
Feature extraction is an indispensable key step when SLAM perceives the environment and builds a map.The widely used feature extraction frameworks mainly includes feature extraction based upon point feature, feature extraction as well as matching on the basis of line feature.

Point Feature Extraction and Matching.
Edges along with corners of the image are called point features, and they typically include descriptors that convey local information around the point, typically in the form of vectors, as well as crucial point data that describes the point's location, direction, and size.[7] At present, there are algorithms such as SIFT, SURF and ORB in image point feature extraction algorithms.The ORB algorithm a typical point feature extraction algorithm with its superior feature extraction and matching performance as well as running time.In feature point matching, the two-layer matching method is mainly used.The violent rough matching is carried out first, then the false matching is eliminated by the RANSAC algorithm to achieve accurate matching.Figure 2 The LSD method, which characterizes line features with the LBD descriptor, is now the most widely utilized line feature extraction algorithm.The LSD algorithm consists of the following steps: scaling the image using Gaussian subsampling; removing the aliasing effect from the image; and calculating the gradient value and gradient direction of pixel locations using equation.(1): x, y) ) (1) In the formula, i(x, y) refers to the pixel's gray level at the coordinate (x, y), and G(x, y) refers to the pixel's gradient value, which is the gradient direction of the point.Then sort and build the state chain, select the pixel point with the largest gradient value as the direction data of line segments, compare gradient directions of the pixel point with specific gradient direction value of the field, and perform the loop to get all the line segments on the image.As Fig. 2 (b) shows, the features of the lines are also well represented on some occasions like a white wall.

Visual SLAM optimization based upon point and line features.
A popular area of study for visual SLAM systems is the technology based upon point and line features.The point and line features of the image are extracted, motions of the cameras are estimated using approach of maximum likelihood, and the three methods-point feature, line feature, and point-line feature combination-are compared.Motion estimate accuracy is higher using the point-line feature fusion method.[8] reduces the reprojection error resulting from point and line features on the basis of the Gaussian distribution error matrix, and the reprojection error of the weighted point and line features which aims at performing binocular visual odometry with the point-line feature approach.Algorithm three-patch ORB [9] is used for extracting the point feature in local gray difference, MLSD algorithm for extracting line features in front end of point features, improving positioning accuracy and establishing a more accurate map.[10]applies Point Line Feature Fusion Method to the binocular vision and inertial SLAM systems uses the combination of point and line characteristics to introduce IMU data to correct the visual positioning algorithm and obtain a more accurate density map.

Dynamic Scene.
Robots in the highly developed era of robots typically have accurate data and function effectively in inside static contexts, but they frequently have delayed positioning updates, poor precision, and insufficient robustness in interior dynamic scenes.This issue arises because most visual SLAM algorithms consider the background to be a static rather than dynamic environment.A new dynamic scene RGB-D dynamic SLAM approach on the basis of point, line, semantic information, and optical flow is suggested to address the aforementioned issues.Studies reveal that the enhanced RDG-D SLAM technique significantly increases the SLAM system's accuracy and robustness in dynamic indoor situations.The RGB-D camera is able to acquire depth of the photometric of images and its corresponding pixels at low cost.The algorithm in this paper utilizes the RGB-D camera to be its only external sensor, and only uses the depth information to pre-eliminate dynamic objects.

Large Indoor Environments (LIE).
An important application environment for SLAM (simultaneous localization and mapping) technology is large indoor environments (LIE).Our goal for mapping large indoor environments is to be efficient and still have a high level of accuracy.Here, three SLAM approaches that are either mainstream or novel for large indoor environments will be introduced in detail.Constructing real-time maps of middle to large scale indoor settings using a 3D SLAM system that makes use of RGB-D sensors.It utilizes a keyframe autocorrelation map database as well as an adaptive thresholding method for closed loop detection and an alignment method one the basis of sparse feature to estimate interframe motion.Using an image autocorrelation map of the key frames, the key frame autocorrelation map database indexes the key frames.[11], which describes the spatial correlation of colors in an Image.To index the autocorrelation map, a k-mean priority search tree is used [12], It enables robustly and quickly collecting cyclic closure candidates and clustering keyframes into clusters of size k.Based on frame alignment techniques, mapping methods using RGB-D sensors can be divided into feature-based and density-based categories.Key point matching is used for alignment in feature-based approaches which aim at extracting and matching sparse key points among frames.This density-based approach uses a pixel-level error minimization method using all the frame data.
In the past 20 years, SLAM systems have extensively utilized this technology, which makes use of a graph-based methodology.The motion of the robot between frames is estimated using the sparse visual cues.Auto correlograms of frames are indexed and hierarchically searched for loop closure identification, and outliers are then efficiently eliminated via adaptive thresholding.The environmental diagram is created once the diagram has been optimized and all of the data frames have been processed.Fig. 3 shows the system's main structure.The method has the major benefit over other methods in that it performs computationally consistently in both large and small mapping environments.Regardless of the size of the map, the technology recognizes loop closures rather quickly.Our method searches for loop closures using keyframe autocorrelation graph trees, and even in the case of huge environments and lengthy sensor trajectories, the cost of the search operation is minimal.(The loop closure detection cost should have increased the processing time as the map size increases and more observations need to be searched.)In earlier dynamic SLAM techniques, dynamic objects were either directly detected via semantic segmentation, or they were presumed to take up less view compared with the static background, making them easy to eliminate as outliers.The suggested methodology, which makes simultaneous multi-object, camera localization, and background reconstruction for planar dynamic environments likely to be achieved, is a revolutionary dense RGB-D SLAM approach.Its fundamental pipelines are as follows: a) In existing frame t, the input image is displayed to be a hybrid of planes and pixels.Match the last frame, as well as extracting the ORB feature.
b) M planar rigid bodies are formed by grouping identical rigid body motions in planes, and their associated egocentric motions are evaluated independently.We are unsure about the rigid body in the plane that belongs to the static one, though.c) We therefore unify some statical backgrounds with planes as well as pixels, meanwhile estimating motion of the camera by alignment of the frame.
d) The statical part is utilized for background reconstruction and camera motion refinement.e) We compare dynamic planar rigid bodies to those previous frame planar stiff bodies and delete non-planar dynamic super pixels as outliers.RANSAC is used to track the ORB properties and plane parameters of matching planar rigid bodies.
This technique can produce improved positioning and mapping while several dynamic items are visible viewed by the camera.This method cannot track a dynamic object that is obscured by another item, although the object may be identified once more after reappearing from the c of the camera.Moreover, re-detection of dynamic object-based models to facilitate long-term object monitoring is one potential future route of research.To enable the independent tracking of numerous massive nonplanar rigid entities, one can also consider how to adapt the approach to non-planar situations.
Environmental Stimulus Localization is a latest method to deal with mobile robots' global positioning, presenting the existence of common facts about the environment around the robot is able to be seen as procedural stimulation.Two parallel particle filters support the robustness of this method.The primary particle filter estimates and tracks the location of the robot, whereas the secondary filter stimulates ignition through the environment, which helps lower the influence resulting from measurement faults and allow early recovery after a location failure.
This strategy is based on the "natural conduct of people who want to find their own position."At any time, a person can reasonably be certain that they are in a specific location.However, if a landmark in the area contradicts the estimate and is significant enough to raise questions, people will begin to consider another assumption.Present estimates and educated assumptions can both be made at the same time, and if the next perception raises a sufficient amount of doubt, the person may decide to revise the estimated position.Figure 4, which depicts the architecture of this technique, includes a detailed illustration of the main elements of the ESL positioning approach.This approach uses 2 simultaneous particle filters as the foundation, one of which responds to some overhead information which is extracted from the environment.The BIM method can correlate locations, which makes the method very useful in search and rescue missions.This method has been utilized for locating robots working in large complex environments.Intensive work about the experiment was carried out on actual tests of robots on different routes.Among ninety percent of the trials, after nine meters of the movement, the method gave the correct robot position, in spite of continuous perception faults, map faults, and robot movement which has not been modeled.A few reasons can interpret and verify the results by applying the proposed algorithm.

Experiments on TUM dataset
The dataset of images from the experiment is the TUM dataset specially used for the algorithm evaluation of SLAM.The TUM dataset was acquired from a Kinect depth camera with a 640*480 resolution, and it consists of 39 different data sequences of indoor scenes containing descriptions of various dynamic environments.In seated sequences, as shown in Fig. 5, two people are sitting on chairs in front of a desk, making small movements and gestures, which is considered to be a "low dynamic scene"; in a walking sequence, two persons are moving in front of a desk, which is considered to be a "high dynamic scene", respectively.We have four different ways to move a camera: (1) Static: Which means we have our camera fixed in place.
(2) method of x-y-z: Set x-y-z axis for camera movement.
(3) r-p-y: Set the roll, pitch and yaw axis for camera movement.(4) Halfsphere: Move in a hemisphere track which has a diameter around 1 m.Statistics show that the translation square mean root error and rotation square mean root error of the proposed algorithm are greatly reduced compared with DVO and BaMVO in both low and high dynamic scene conditions.Under the condition of low dynamic scene, compared with DVO, the translation square mean root error of the proposed algorithm is reduced by 86.6% at most and 47.1% at least.The mean root error of rotating square was reduced by 88.1% at most and 56.0% at least.Compared with BaMVO, the translation square mean root error of the proposed algorithm is reduced by 87.6% at most and 22.6% at least.The mean root error of rotating square was reduced by 88.1% at most and 61.1% at least.Under the condition of low dynamic scene, compared with literature, the proposed algorithm has different emphasis on translation accuracy and rotation accuracy, but the error value is very close.In high dynamic scenes, the translation accuracy and rotation accuracy of the proposed algorithm are greatly improved compared with DVO, BaMVO and literature.Compared with DVO, BaMVO and literature, the mean root error of translation square was reduced by 92.0%, 77.1% and 83.7% at most, and 57.5%, 35.5% and 41.2% at least under the four high-dynamic datasets.The mean root error of rotating square was reduced by 91.0%, 71.6%, 81.7%, 59.7%, 51.0%, 37.0%, respectively .

Experiments on FR3
We further conduct a set of experiments on three representative datasets (FR3/sitting_XYZ, FR3/WALking_XYZ, FR3/WALking_Halfsphere), as shown in Figure 6.(a), (b) and (c) represent the absolute trajectory errors of ORB-SLAM2 under FR3/sitting_XYZ, FR3/WALking_XYZ and FR3/WALking_Halfsphere, respectively.(d), (E) and (f) represent absolute trajectory error in the corresponding dataset of the ORB-SLAM2.ORB-SLAM2 uses the method of extracting the ORB feature points in the front end in order to recover pose of the camera, and removes some outliers through the overprojection error in the back end.It can achieve high accuracy in low dynamic scenes, but it also cannot adapt to the conditions of high dynamic scenes.Both the proposed algorithm and ORB-SLAM2 have very high accuracy under the condition of low dynamic scene, despite the fact that the accuracy of that proposed algorithm has been apparently improved by contrast to that of ORB-SLAM2 under a condition of high dynamic scene.
The DVO algorithm is based on the assumption of static background, so that the mean root error of translation and rotation is small in the "sitting" sequence and FR2 / Desk_peson dataset with only small movement.However, in the "walking" series, the mean root error of translation and rotation is significantly increased due to the existence of large moving objects.Compared with DVO algorithm, BaMVO algorithm eliminates some motion feature points according to the changes of the edge depth of the moving object, but also inevitably retains the motion features of the moving object at the nonedge, causing interference to the pose estimation of the camera.Because some motion feature points are removed, compared with the DVO method, the BaMVO method has increased accuracy in both high and low dynamic scenes, but there are still large errors, and both of them cannot adapt to the high dynamic environment.
Algorithm uses the geometric constraints in dynamic characteristic points out, but due to the extraction of feature points are not fixed point in each experiment, so the literature in advance after eliminating some feature points, there are still in the process of restoring the camera pose detection of new sports feature points, which are more obvious under high dynamic scene.Compared with BaMVO algorithm, the mean root error of translation square is 40.5% and 9.8% higher on FR3/Walking_static and FR3/Walking_Halfsphere, respectively, and the mean root error due to the rotation square is 53.7% higher on FR3/Walking_static.Compared with literature , the proposed algorithm has different emphasis on translation accuracy and rotation accuracy under low dynamic scene conditions, but the error values are very close.The only exception appears in the FR3/sitting_Halfsphere dataset, because the proposed algorithm takes out the small moving objects and the connected areas with the same depth.
As a result, feature loss occurs in some frames: in these frames, the complete individual to which the tiny moving object belongs contains most of the features of the frame; under the condition of high dynamic scene, this algorithm makes the minimum translation square mean root error and rotation square mean root error in the four data sets, the accuracy is so significantly improved.

Discussion
Although a lot of SLAM research works have been carried out, some issues still are remained to be solved, which mianly includes: (1) Key frame selection.Because errors in the attitude estimate process constantly happen, the frame-by-frame alignment method generates a significant number of cumulative floating points.Some key frame-based SLAM approaches based on key frame are proposed in order to lessen the mistake brought on by the frame-to-frame alignment method.Keyframe selection can be done in a variety of ways.In reference, if all of the following conditions are satisfied, this frame serving as a key frame will be added to the map: A minimum of N map points can be observed in the current frame, and posture estimate accuracy is very exact.N frame passes across the preceding key frame In Reference, when the number of feature points shared by both images falls below a predetermined threshold, a new key frame is generated.
(2) Map optimization.For robots working in complex dynamic environments, the rapid generation of 3D maps is very important, and the resulting environmental maps play a crucial role in subsequent positioning, path planning, and barrier performance, so accurate map generation is also paramount.After the closed-loop detection is successful, the closed-loop constraint is appended to the map and the closed-loop has been corrected.Closed-loop problem is regarded as a large-scale bundle adjustment problem which optimizes the posture of the camera, and all of the points' three-dimensional coordinates in the map.Nevertheless, this optimization is so computationally complex, that it can hardly achieve real-time performance.Each edge in the pose graph of the RGB-D SLAM algorithm proposed in Reference has a weight, therefore in order to account for the inaccuracy, the edge with the highest level of uncertainty must vary more than the edge with the lowest level of uncertainty.Each vertex in the chart is then subjected to additional loop closure checks, and the entire chart is then reoptimized.To effectively correct rotation, translation, and scale-fluctuation, a pose graph optimization technique is applied in closed-loop correction phases.When closed-loop detection is successful, Reference builds an important graph and optimises the graph for poses.[12] (3) Multisensor fusion.When lighting changes dramatically, action is vigorous, and texture is weak, visual SLR cameras with a single camera are not very durable and can easily lead to tracking failure, localization failure, and map construction failure.The system may become more reliable and accurate through the merging of data from several sensors.Many researchers have attempted to combine many sensors into the VSLAM system, using approaches like camera + inertial measurement unit (IMU), camera + lidar, etc. that are typical of multi-sensor fusion.
(4) Combine deep learning with SLAM.With the great success of computer vision deep learning, there is great interest in the application of deep learning in robotics.SLAM is a huge system which has several submodules like closed-loop detection and stereo matching, which can be improved by deep learning.In Reference, a stereo matching method originated from deep learning, utilizes convolutional neural network (CNN) in order to learn the similarities between small image blocks and linearizing the stereo matching cost.Reference uses convolutional neural networks and large-scale maps to achieve real-time location recognition by integrating new optimization techniques for local sensitive hashing and semantic spatial segmentation.Reference uses convolutional neural networks to learn the best visual features and visual odometry.presents a repositioning system using Bayesian convolutional neural networks calculate camera pose and its uncertainty with a single-color image with six degrees of freedom.

Conclusion
In this article, we explain the key SLAM algorithms and examine how well they work with various data sets.The visual SLAM system has good positioning accuracy, especially for scenes with uneven lighting and weak texture, and can enhance positioning accuracy and operation speed by matching special points and edge points more accurately.It is a great method for improving the visual SLAM system and has promising research potential.Further research is needed on the visual SLAM system on the basis of point and line properties.
Future indoor algorithms will primarily focus on multi-sensor fusion, with the fusion of laser, visual, and IMU having the most advantages, in order to address the issue of sensor limitations in interior environments.Various sensors can adapt to the complicated environment of mobile robots and the variety of needs of people for their functions more effectively through information fusion.Industrial robots' SLAM algorithms are currently moving toward multi-sensor fusion.However, there are still several issues with multi-sensor fusion research, including real-time data processing and the need to make multi-sensor collaborative operation less computationally complex.

Figure 2 .
(a) represents a schematic diagram of the image features extracted with ORB algorithm.(a) ORB feature extraction [4] (b) LSD feature extraction [5] Comparison of different feature extraction methods 2.2.2.Line Feature Extraction and Matching.Point feature extraction and matching technology established makes it simple to parameterize, and makes it popular for usage in visual SLAM systems' frame tracking.But for some areas which have low-texture, using single-point feature to track frequently performs poorly.Nevertheless, line features are common in artificial indoor environment, and they are able to successfully balance out each other's deficiencies.

Figure 5 .
Figure 5. Visualization of feature extraction effects in different scenarios[11]