Enhancing Autonomous Navigation: A Visual SLAM Approach

An autonomous vehicle can simultaneously map its environment and identify its own position by employing a technique called “Simultaneous Localisation And Mapping” (SLAM). Autonomous mobility requires identifying the locations of adjacent landmarks and objects, as well as the vehicle’s position, using an appropriate technique. Monocular SLAM systems often face challenges related to depth perception and scale ambiguity, leading to trajectory drift over time. In contrast, Stereo SLAM systems utilize dual cameras to overcome these limitations. The purpose of this work is to assess how well visual SLAM systems perform by contrasting trajectory estimates with ground truth information obtained from simulations. The findings indicate that stereo visual SLAM algorithms offer more accurate camera trajectory estimations than monocular SLAM, making them a preferable choice for applications demanding precise camera localization and mapping in autonomous vehicles.


Introduction
"Simultaneous Localisation and Mapping", or SLAM, is a method by which the automobile simultaneously maps its surroundings and determines its location within them.It entails figuring out the locations of nearby landmarks and objects as well as the vehicle's position, usually with the help of sensors and geometric and Bayesian algorithms.By examining the differences between the vehicle's consecutive camera images, visual odometry estimates the car's autonomous movement (location and orientation) step by step without creating a detailed map.Autonomous vehicles are also known to use this technology to navigate and understand their surroundings.Using this primitive approach, a car may map its environment and determine its location in respect to the map.In order for autonomous vehicles to operate properly and safely in real-world situations, this is required [1] [2].Monocular SLAM, or simultaneously localising and mapping with only one camera, is essential to autonomous vehicles' ability to navigate and understand their environment.Monocular SLAM enables autonomous cars to detect their precise location within their surroundings.The vehicle can determine its position and orientation in real time by evaluating the video stream and by detecting known landmarks or objects.This information is critical for accurate navigation of the ego vehicle.Scale ambiguity, drift, feature selection, tracking issues, and the need for real-time performance all of them pose hurdles to monocular SLAM.Without additional sensors, determining the absolute scale of the map is difficult, and slight inaccuracies in feature tracking can lead to significant drift over time.Autonomous vehicles use a technique called stereo SLAM, for mapping the surroundings and estimate their own position inside them using stereo vision.Stereo vision involves capturing a 3D image of the environment using two cameras (or other imaging sensors) set at a specified distance apart.A system can determine the depth Filtering was previously used to tackle the monocular SLAM problem [3, 4, 5, and 6].The filter is used to process each frame to predict the positions of map features and the camera's pose simultaneously.Among its shortcomings are the compute waste during the processing of consecutive frames with little additional data.However, keyframe-based methods [7], [8] estimate the map using only a portion of frames (keyframes) because mapping is not limited by frame rate, allowing for more expensive but accurate BA (Bundle Adjustment) optimizations.According to Strasdat et al. [9] to diminish the computing overload filtering on basis of keyframes can be used.According to Mouragon et al. [7], the field of ocular odometry was the first to employ bundle adjustment (BA) in real-time.Klein and Murray's ground-breaking work on parallel tracking and mapping (PTAM) in SLAM [8] followed afterwards.Despite its limitations on small scale operations the discussed system provides efficient methods for selection of keyframes, matching of features, triangulation of points, localisation of the camera at each key frame as well as relocalisation with tracking failure.However its application is restricted due to improper loop closure and occlusion management and due to the requirement of human involvement during the map bootstrapping.In the [3] the authors introduced the very first monocular SLAM with the help of a real time handheld camera.Its main contributions were to initialize by tracking known targets, match features "actively" based on uncertainty, and use huge image patches as features.But MonoSLAM could only be used in small workspaces and could not perform loop closures.The authors Ranftl et al. [10] uses two consecutive frames to create a dense depth map.The surrounding area and the moving objects are both faithfully reproduced.Experimental findings suggested that this method works better for monocular depth estimation as compared to the previous methods [10].The first continuous dense monocular SLAM was proposed by Stuhmer et al. [11].The authors introduced a novel variational method for "real-time dense depth map" estimate from numerous images.They showed how the resulting depth maps are far more resilient to noise in the input photos when numerous images are integrated.In order to achieve sequential convex optimization, they modified a primal-dual technique that was first suggested for optical flow to address the depth map estimation problem.They also demonstrated that a complex thresholding strategy may be used to solve the primal update in closed form.Dense depth maps can be produced at 24 frames per second by using the GPU to estimate the depth map in a coarse-tofine way while the CPU manages the camera motion.Engel et al. [12] combined several of the current concepts into their LSD-SLAM, the first direct monocular SLAM system that could do a consistent global semi-dense reconstruction.The approach, as detailed in [13], combines direct image alignment with filtering-based estimates of semi-dense depth maps.This graph skillfully incorporates changes in the environment's scale and makes it possible to identify and fix accumulated drift.Even on a modern smartphone, the framework can operate in real-time and be executed on the CPU [14].Through experimentation, the authors [12] demonstrated the approach's robustness, flexibility, and versatility by successfully tracking and mapping even difficult hand-held trajectories longer than 500 meters.Within a single series, there are enormous rotations and notable scale variations (mean reciprocal depth ranging from below 20 centimeters to over 10 meters).To recap, multiple picture integration, probabilistic depth estimation, and variational optimization were the main ideas presented by early publications like [11] and DTAM (Dense Tracking And Mapping).Globally consistent mapping and reconstruction were made possible by the extension of these ideas in later methods, including "LSD-SLAM".In the end, largẹ-scale, precise monocular densẹ SLAM was demonstrated via entirely direct approaches."Direct Sparse Mapping (DSM)" [15] was first presented by Zubizarreta et al. as a persistent map-based direct sparse monocular SLAM system that supports point reobservations.
Numerous researches have examined line characteristics in the SLAM community.It's interesting to note that the vertical line in SLAM is the SLAM feature of the original vision-based SLAM [16].An odometer on a robot and a monocular camera on it used an expanded Kalman filter version to create a two-dimensional map.The authors of the publication [17] based on the graph optimization framework implemented a stereo ORB2 SLAM system, and also randomly allocated 3-D lines as markers.The single sensor in the authors' suggested system is a stereo rig, and it makes use of sophisticated methods like motion estimation, posture optimization, and bundle modification.They then tested their proposed system using sequences from both indoor and outdoor environments, and found that in line-rich scenarios, it performs better than a point-based SLAM system.A novel omnidirectional stereo visionbased method for simultaneously localising and mapping was presented by Trung et al. [18].The technology combines approximated spatial information from the environment with the motion of the vehicle to efficiently match stereo images.By utilizing two "EKF (Extended Kalman Filter)" estimators, a more precise robot trajectory, and an extensive environment map with more landmarks are produced.The primary focus of the first estimator, a binocular estimator, is the vehicle trajectory, whereas the primary focus of the second estimator, a monocular estimator, is map construction.The results of their research showed how successful the stereo-matching strategy was.It was shown that convergence accuracy and speed were much faster than with the Extended Kalman Filter -Structure From Motion (EKF-SFM) method.

Methodology
Now we want to compare the two visual SLAM algorithms i.e.Monocular visual SLAM specifically ORB (Oriented Fast and Rotated Brief) SLAM and Stereo visual SLAM.First, we will discuss the methodology behind Monocular SLAM as shown in Figure . 1.

Figure 1: Methodology of Monocular SLAM
The process of monocular SLAM involves several key components as shown in Figure 1: x Map Initialization: Usually, a sparse feature map and an initial camera location estimate are the first steps in monocular SLAM.This can be determined using methods such as Structure From Motion (SFM) or sensor fusion using Inertial Measurement Unit (IMU) data.
x Feature Selection and Tracking: Key spots or distinguishable characteristics in the camera's range of vision are recognized and monitored over the course of multiple frames.These features, such as corners, edges, or texture-rich areas, serve as reference points for estimating the movement of the camera.
x Mapping: By triangulating the positions of the tracked features, the system builds a 3D map of the surroundings as the camera travels.This method is also called as "Structure from Motion." x Estimation of position and orientation of camera: Using feature correspondences between frames, the system estimates the camera's pose, which includes its position and orientation relative to the environment.
x Loop Closure: To correct for accumulated errors and drift over time, loop closure techniques are applied.These techniques identify when the camera revisits a previously seen location and help refine the map.x Optimization: Through the application of optimization techniques like bundle adjustment, the error between the anticipated and observed feature positions is minimized, enhancing the effectiveness of the 3-dimensional map and the calculated trajectory of the camera.To configure our simulation environment, we utilize the simulation 3D scene configuration block.As seen in Figure 2, we have chosen to use the built-in Large Parking Lot scene, which has a number of parked cars.The algorithm used in visual SLAM compares features between successive photos.We have added more parked cars to the scene using the Parked Vehicles subsystem in order to improve the number of possible feature matches.

Figure 2: Image of the Parking Lot
To generate the image data of the parking lot the following methodology is used as depicted in Figure 3.

Figure 3: Methodology to generate image data of the parking lot
Next, using the Vehicle Simulation with the Ground Tracking component, the ego vehicle is configured to follow the chosen reference path.There are two kinds of camera sensors in the Camera Subsystem: monocular and stereo.In both versions, the camera is located in the middle of the roof of the car.Next, the simulation is executed to view and record sensor data.The camera sensor's output image is shown on the Video Monitor block.Next, we record the actual location and orientation of the camera sensor using the workspace subsystem.We then employ the monocular visual SLAM function to take in the camera images and ground truth to generate the environment map along with the vehicle's trajectory.The working behind the stereo SLAM process is shown in figure 4: The general workflow for monocular vSLAM and stereo vSLAM is quite similar.The primary difference is that instead of using two images from separate frames during the Map Initialization step, three-dimensional map points are created from the pair of images captured with two cameras separated by a known distance apart.A monocular camera can be used to perform vSLAM.However, because depth cannot be precisely computed with a single camera, the map's scale and predicted trajectory are unknown and shift with time.Furthermore, because the initial map cannot be triangulated from the first frame, many views are needed to construct it.By using the "stereo" vision, the limitations of the "monocular" camera are reduced and a more dependable vSLAM solution is provided.The methodology of stereo SLAM in Figure 3.7 is discussed as follows: x Map Initialization: Using the difference map, the pipeline first creates a preliminary map of 3-D points from two "stereo" pictures.The first keyframe is the saved image on the left.x Tracking: After the ḿap is configured, the camera's posture for each consecutive stereo pair is estimated by comparing features from the left picture with the latest keyframe.To enhance the estimated camera pose, the local map is monitored.x Local Mapping: If the left picture at this point is identified as a keyframe, then new 3-D map points are calculated from the differences of the "Stereo" pair.After the previous step, the camera posture and 3-D points are adjusted using bundle adjustment to reduce reprojection errors.
x Loop Closure: The BOF (Bạg Ọf Features) method is for loop detection in each keyframe by comparing it to all preceding keyframes.After the detection of loop closure in our environment the poṡe graph is optimized to tune the trajectory as well as the pose of the camera of each of the critical frames.After this, the process remains the same as discussed in monocular visual SLAM.The main difference is that to generate the image data here we use the stereo camera.We then employ the stereo visual SLAM function to take in the camera images and ground truth to generate the environment map along with the vehicle's trajectory.Finally, we will compare the ground truth of the simulation for both the monocular and stereo-visual SLAM with the optimized camera trajectory.

Results and Discussion
We evaluate the optimized camera trajectory's accuracy by contrasting it with ground truth data derived from the simulation.Remember that the camera's trajectory can only be reshaped up to a particular scale factor because the images are produced by a monocular camera.With the actual trajectory data, this scaling value can be roughly calculated to simulate the results of an external sensor.Below is the final map produced by the monocular SLAM algorithm: Based on Figure 5, it can be determined that the trajectory created by the monocular ORB SLAM has an accuracy of 1.94 meters.Accurately calculating depth with a single camera is difficult with a monocular visual SLAM system.This results in an uncertain map scale and an approximated trajectory, which causes drift over time.Furthermore, bootstrapping the system often necessitates numerous views to generate an initial map because map points are sometimes difficult to be developed from the initial frame.The use of a stereo camera, on the other hand provides a more reliable visual SLAM solution, and is one of the way to answer these problems.The primary difference between stereo visual SLAM and monocular visual SLAM is that, instead of using photos taken in separate frames, the stereo pipeline creates 3D map points during map initialization using a pair of stereo images obtained in the same frame.This is the final map that the Stereo SLAM algorithm produced: The RMSE for the keyframe trajectory in this case of stereo SLAM algorithm is remarkably improved, with an absolute RMSE of 0.25 meters.As such, the camera trajectory estimation produced by the stereo visual SLAM algorithm is substantially more accurate than that of the monocular visual SLAM system.This advantage is primarily attributed to the enhanced depth perception and the elimination of scale ambiguity enabled by stereo vision.

CONCLUSION
The study found that the visual SLAM systems' ability to forecast trajectories accurately is highly dependent on the positioning of the cameras.The "Root Mean Square Error" (RMSE) of 1.94 meters is a greater error in the monocular visual SLAM system.Monocular SLAM has a lower computational complexity, but it is more susceptible to scale drift and accumulation of errors over time.This is due to the fact that its capacity to deduce depth is dependent on feature matching and geometric limitations, both of which may not always be accurate.Consequently, the mapping solution loses dependability over time.However, by utilizing the extra depth information given by the two simultaneous stereo images captured in the same frame, the stereo visual SLAM method gets over these limitations.This leads to more accurate feature matching and posture estimate, which reduces the likelihood of drift and mistakes.Consequently, the stereo SLAM system produces a much better keyframe trajectory estimate, with an absolute RMSE of 0.25 meters.It is evident that Stereo SLAM outperforms Monocular ORB SLAM in terms of trajectory accuracy.The RMSE has decreased by 87.11%, from 1.94 meters to 0.25 meters, indicating a notable improvement in localization accuracy.

Figure 5 :
Figure 5: Map generated by the monocular SLAM algorithm

Figure 6 :
Figure 6: Map generated by the Stereo SLAM Algorithm