A 3D reconstruction method of image sequence based on deep learning

Camera calibration, image feature detection, matching and other aspects have become barriers that traditional 3D reconstruction methods are difficult to break through. The important role of deep learning in data detection and classification has a deep impact on the real world, and has become a research hotspot at home and abroad to deal with this problem. In this paper, a 3D reconstruction method of image sequence based on depth learning is proposed. Firstly, the principle and method of 3D reconstruction are introduced. Then, the new method of 3D reconstruction based on image is studied and discussed in combination with deep learning theory. Finally, the conclusion and prospect are given.


Introduction
With the development of technology and culture, the two-dimensional display effect of images can not meet the needs of human beings. Therefore, 3D reconstruction has been widely concerned. 3D reconstruction comes from computer vision, among which 3D reconstruction based on image sequence is an important branch. Its traditional implementation method can be understood as: through a series of technical means, extract the depth information of image sequence, and then build the original 3D scene. Practice shows that the in-depth study of this technology plays an important role in medical, military, aerospace and other fields.
Traditional 3D reconstruction methods need to extract depth information for scene reconstruction, calibrate camera parameters, and carry out feature detection and stereo matching for image sequences, all of which have very high requirements for related technical means. These problems have always been the key and difficult points for researchers, so far there is no particularly good solution [1].
In recent years, deep learning theory has been widely used in image detection, recognition and computer vision. With the continuous development of deep learning technology, many 3D reconstruction methods based on deep learning have been proposed, and the reconstruction effect is better, which has gradually become the focus of research in this field in recent years. In this paper, a 3D reconstruction method of image sequence based on deep learning is proposed.

Summary
With the rapid development of digital image processing, computer graphics and computer vision, naked eye 3D display has gradually become a reality, and 3D reconstruction technology is particularly prominent in the field of optical 3D display. The visual characteristics of the human eye enable the three- dimensional scene of the real world to be mapped into the brain. Inspired by this, researchers try to use computers and digital sensing devices to obtain 3D geometric information of the scene. This process is called 3D reconstruction.3D reconstruction mainly includes two categories: model-based threedimensional reconstruction and image-based three-dimensional reconstruction.
2.1.1 3D reconstruction based on Model.Model-based 3D reconstruction refers to the pre-defined 3D model of the object, through certain methods, such as ray tracing, to obtain the 3D coordinates, colors, textures and other information on the surface of the model, and then use the computer to reproduce the model. This method is suitable for 3D reconstruction of static objects. Although it can be displayed dynamically with the animation attributes of the algorithm, it can not be consistent with the dynamic process of the actual scene.

3D reconstruction based on image.
Image-based 3D reconstruction refers to the use of some technical means to extract 3D information from single or multiple images and reconstruct 3D scene. The main technologies include active method and passive method.
Active method is to use the equipment to transmit energy pulse and judge the depth information according to the reflected energy difference received by the receiving sensor. This method can reconstruct three-dimensional scene better, but it can be used in limited scenes, and the equipment cost is expensive. Passive method is to use the algorithm to extract the depth information in the image directly and construct three-dimensional scene. This method is flexible, applicable and cheap. In this paper, passive method is used to realize 3D reconstruction based on image.
As mentioned earlier, the image-based method can use a single image or multiple images for reconstruction. Using a single image for reconstruction is to collect the depth, color, texture, mapping and other information of the scene through a single image, and then reconstruct the scene. This method has the advantages of less investment, fast speed and simple process, but it needs to extract a large amount of information from a small number of conditions, so the reconstruction effect is poor. There are many ways to reconstruct with multiple images [2]. Among them, the shadow based method is to estimate the three-dimensional information of the object and reconstruct the three-dimensional scene by studying the brightness of the object surface in the image; the binocular stereo reconstruction method is to take a specific camera calibration method to quickly detect and match the feature pixels or regions; the method based on the structure from motion (SFM) is to analyze the camera operation Dynamic information, to obtain the structure of 3D scene, also need to detect and match the image features.
In this paper, 3D reconstruction method based on multiple images [3].

Principle and theoretical basis
The basic principle of 3D reconstruction based on image sequence by using passive method is as follow : input image sequence into the algorithm, preprocess to make the image arrangement and the properties of parameters meet the requirements of feature detection in the next step; carry out feature detection and stereo matching for the image, and eliminate the mismatches in the algorithm; calibrate the camera array to reduce the visibility The internal and external parameters of the camera can be obtained by subtracting 3 the calculation error. If the camera has been calibrated before the image sequence is taken, the steps can be omitted here. After calculating the basic transformation matrix, the three-dimensional point cloud information in the image is displayed in the form of grid. It also includes the transformation from sparse point cloud to local dense point cloud, which is one of the mainstream 3d display methods. This paper uses point cloud data to display, the specific content will be introduced in the next section.  The basic theory of 3D reconstruction based on image sequence mainly includes the following contents:

Stereoscopic vision.
The human eye is the most important organ for perceiving the threedimensional information in the real world. Because there is a certain distance between the human eyes, the observation image of the same scene is slightly different between the two eyes, that is, to form a binocular perspective. And the human brain has a certain prior knowledge of the real world objects, so when the binocular perspective is introduced into the brain, the human can see the scene in the threedimensional world.

Camera calibration and feature extraction and matching.
Camera calibration plays an important role in traditional 3D reconstruction, and its accuracy directly affects the reconstruction effect. The basic principle of camera calibration is to solve the internal and external parameters of the camera according to the points with known positions in the space. Image feature extraction and stereo matching are the basis of computer 3D reconstruction. At the same time, related technologies are widely used in image retrieval, splicing, target recognition and other key areas of computer vision. Its basic principle is to extract image feature points with local invariant feature detection and description operators, and match the pixel points or areas representing the same spatial position in the image, so as to prepare for reconstruction of 3D scene information. At this stage, researchers are also breaking through the existing problems and improving their related algorithms, but the disastrous problems such as mismatching in the implementation process are always unavoidable.
In the traditional 3D reconstruction based on image sequence, the implementation effect of the above content has a very important impact on the generation of 3D scene. However, although these theories have been relatively mature, the researchers still do not have an optimal algorithm to provide strong support in the specific implementation. In other words, these algorithms are still in the process of further optimization, but whether there is a final solution is still unknown.
In this paper, deep learning theory will be applied to explore the realization of 3D reconstruction based on image sequence. Deep learning can play a huge role in detection and classification. Therefore, the scheme proposed in this paper can avoid technical problems such as camera calibration, image feature detection and matching. The specific implementation methods will be discussed in the following.

Theory
Deep learning comes from artificial neural network (ANN), but it is different from Ann's shallow feature extraction technology. It can get deeper feature extraction ability, which is not only multi hidden layer structure. Deep learning establishes a logic level model of implicit relationship in learning data through the function mapping from low-level signals to high-level features, so as to imitate the visual cognitive reasoning process of human brain, so that the acquired features have stronger generalization ability and expression ability [4].
Before feature extraction, deep learning should be trained in advance. The more comprehensive the training, the better the feature extraction and the better the effect. Because the training data needs to cover all or most of the features of the scene information, the requirements for the quantity and quality of the training data are very different when displaying simple regular objects and complex irregular scenes. The training can be divided into supervised manual training and unsupervised layer by layer training. The division of the two methods is not strict, and they are also used in many practical applications: first, use unsupervised layer by layer training for framework, extensive training, and then use supervised manual training for detail optimization.

Application
As mentioned above, the deep learning method can extract the deep features in the new input data by training the corresponding data set in advance, and then classify, identify, fuse and display the data. Combined with the deep learning method and the theory of 3D reconstruction based on image sequence, the image sequence is taken as the input of the deep learning system [5]. Without camera calibration to extract the depth information of two-dimensional image and complex feature detection and matching between images, the 3D scene information can be directly obtained through the shallow detection and deep extraction of the deep learning mechanism, and through the existing fusion technology The real three-dimensional display of the scene.
Wu J and others proposed MarrNet in 2017. First, the normal, depth and contour sketches of the object are recovered from RGB images, then the 3D model is recovered from 2.5D sketches, and finally the consistency loss function of re projection is used to ensure the alignment of the estimated 3D model with the 2.5D sketches [6].
Wang N. Proposed pixel2mesh based on GCN in 2018. Using the features extracted from the input image, the correct three-dimensional shape is generated by gradually deforming the ellipsoid represented by the mesh, so as to realize the end-to-end reconstruction of the three-dimensional model represented by the triangular mesh from a single color image. 3D mesh is the geometry of vertex, edge and face. This method enables the network to start training with a small number of vertices, learning to distribute the vertices to the most representative position, and increasing the number of vertices in forward propagation to add local details. In the training process, four kinds of losses are used to constrain the output shape and deformation process. Among them, the chamfer loss constrains the position of mesh vertices, the normal loss controls the consistency of model surface normals, the Laplacian regularization keeps the relative position of adjacent vertices unchanged during deformation, and the edge length Edge length regularization prevents outliers [7].

Realization
In view of the above content, this paper proposes a new method of 3D reconstruction based on image sequence through research and analysis, and its basic flow is shown as follow: Before generating three-dimensional scene, a large number of prior knowledge and basic data are used to pre train the deep learning system, and the method of "unsupervised training framework + supervised training optimization" is used to adjust the reconstruction expectation to the best effect; the two-dimensional image sequence is used as the input of the deep learning system, and the image sequence needs to be quickly found through the system's ability to extract features layer by layer The 3D scene information to be retrieved; the fusion processing module is added to the deep learning system to normalize the retrieved 3D scene information and complete the omni-directional view fusion of the scene by using interpolation, estimation and other technologies; the weight value of feature points and areas is planned according to the preset display expectation, so that the final display effect can not lose a lot of information Finally, we use various display technologies to get the naked eye display of 3D point cloud scene. The specific display technology does not belong to the scope of this paper, and will not be discussed here.

Conclusion
In this paper, the principle and method of 3D reconstruction based on image are studied and analyzed, and the application of deep learning technology in image classification and detection is explored. This method avoids the problems of camera calibration, image feature detection and matching in traditional 3D reconstruction, and uses point cloud to display 3D scene.
In general, deep learning provides a new method to solve the problems in 3D reconstruction based on image, and some research results have been achieved. But its own problems are still continuous, and it is still a key research direction in the future.