Study on the method of SLAM initialization for monocular vision

Visual SLAM is considered as a key technology to realize the autonomous positioning and navigation of mobile robots, and also a hot research technology in the fields of unmanned driving, augmented reality and smart home. Among them, monocular camera based visual SLAM is one of the hotspots in the field of visual SLAM. In today’s mainstream visual SLAM, PTAM needs to manually select two images to complete the initial keyframe trajectory and map point estimation process, which limits the practical application of the system and reduces the success rate of initialization. In addition, ORB_SLAM system adopts the statistical model selection method to realize the automatic initialization process. This paper first introduces the definition of the homography matrix and the fundamental matrix and the implementation algorithm, and then elaborates on the use of a certain scoring strategy to select the model method for initialization.


Introduction Simultaneous Localization and Mapping (SLAM) is based on the Simultaneous Localization and
Mapping of the information of the unknown environment perceived by the sensor, the motion trajectory of the sensor in the unknown environment is estimated, and the structural map of the environment is constructed simultaneously [1]. The goal of monocular vision SLAM map initialization is to construct the initial 3d map points and the initial key frame pose. Since the relative depth information cannot be obtained from a single frame image, it is necessary to select two or more frames of images from the image sequence, estimate the camera posture and reconstruct the initial 3d map points [2].
There are two common ways to initialize a map. The first method is based on the assumption that there is a plane object in the space, and two images at different positions are selected to estimate the pose by calculating the homography matrix [3]. The second method is to calculate the basic matrix based on the feature point matching between two frames and further estimate the pose [4]. This method requires the existence of non-coplanar feature points. Both methods have their own limited scenarios, and Mur-Artal proposed a model selection method based on statistics. This method gives priority to the second method and expects to automatically select the first method in the case of scene degradation [5]. If the selected two frames do not meet the requirements, the two frames will be abandoned and re-initialized, which is also the initialization method to be studied in this paper.
The flow chart of the model selection automatic initialization method based on statistics is shown in figure 1. The first step of maping initialization is to calculate the homography matrix H and the fundamental matrix F. First, the homography matrix is calculated from two pre-processed images, which can be generally realized by using the normalized direct linear transformation algorithm. In 2 addition, the normalized eight-point algorithm can be used to calculate the fundamental matrix. The re-projection error of each point is calculated by the homography matrix and the fundamental matrix, and the corresponding chi-square distribution is compared. The total score of the interior points corresponding to the two matrices is calculated respectively, and then the fundamental matrix or the homography matrix or neither is selected according to a certain selection model. Since each calculation of single or fundamental matrix only uses 8 point pairs at most, there will be great uncertainty if it is only calculated once. Therefore, the RANSAC method is introduced to calculate multiple times and further eliminate the outer point.

Homography matrix
Homography matrix describes the correspondence between two images of three dimensional points on the same plane in space. What needs to be emphasized here is the same plane, as shown in figure 2. The homography matrix can be applied to image correction, image registration, Angle conversion and the calculation of camera motion (rotation and translation) of two images.  Fig.2 The same plane from different angles By relying on the basic principle of camera imaging, we can obtain the transformation from the world coordinate system to the camera coordinate system, and the following formula is obtained: Where, the point X P (u,v,1) is the pixel coordinate in the image coordinate system, and the point is the common coordinate in the world coordinate system. And, 1 f , 2 f is the focal length in the x and y directions, which is generally the same , x c , y c is the position of the optical center, At this point, the matrix P can be regarded as a 4×4 matrix. If the spatial points are in the same plane, we can consider b z as 0, so the P matrix becomes a 3×3 matrix. For two different cameras, the pixel coordinates and spatial point coordinates can be written as follows: So let's combine these two to get this: H is the homography matrix. On both sides of the H matrix are matching point pairs corresponding to the two images. In other words, the homography matrix H maps the points of the same plane in the three-dimensional space to the imaging image coordinates of the two cameras.

Calculation of homography matrix
If the adjacent two images are induced by a certain plane in the space, then the corresponding computing homography matrix can be obtained through the 2D of the two images. . In this form we can derive a simple linear solution to H.
, so that can be given explicitly: Further, formula (6) can be simplified as follows: These expressions have the following form is a 3×9 matrix, h is a 9dimensional vector composed of the elements of the matrix H, and the formula is as follows:

Fundamental Matrix
The correspondence between the image points and polar lines described by polar geometry can be represented by the Fundamental Matrix. In other words, the Fundamental Matrix is an algebraic representation of polar geometry.
It is assumed that the two camera matrices are Q , Q , and denotes that the image plane of the two cameras is U ,U  , then the parametric equation of the inverse projection line of m U  is: Where, * Q is the generalized inverse of Q , C is the optical center of one of the cameras. Thus, the expression of the fundamental matrix between two cameras can be derived:

Calculation of fundamental matrix
The fundamental matrix is defined by the following equation: Where x and x is the Any pair of points in two images. At this time, as long as enough matching point pairs are selected, the basic matrix F can be calculated. If n groups of points are matched, the linear equations are as follows: