Single Image 3D Reconstruction Algorithm Based on Parallel Double-Branch Pyramid Stacking Network

Limited visual information contained in single images and complex motion models of objects may lead to severe fragmentation and confusion of model backbone in the 3D reconstruction of objects. In order to fully extract feature information in a single image and reduce noise interference caused by environmental factors in 3D reconstruction, a Parallel Dual-Branch Pyramid Stacking Network (PDB-PSN) model is proposed. A parallel dual-branch network model is used to construct an encoder-decoder framework based on cascaded feature extraction, encode/decode high and low-resolution features in a single image, and then convert from 2D to 3D through implicit functions to achieve the 3D reconstruction of target objects in a single image. A cascaded feature extraction network is used as a low-resolution feature extraction network to extract global features of objects. In the high-resolution feature extraction branch, three concatenated hourglass networks and dilated convolutions are used in the hourglass network to increase the receptive field and obtain more global information in order to maintain the integrity of the reconstructed object limbs. A threshold processing module is set to remove irrelevant information and ensure the integrity of global information, and meanwhile, to reduce the interference of irrelevant noise information. Simulation experiments on a self-built terracotta dataset show that the PDB-PSN model can completely reconstruct the 3D model in a single image and effectively eliminate model fragmentation in the reconstruction results.


Introduction
Three-dimensional reconstruction can be used to obtain rich texture information with fewer image inputs, therefore it is widely applied in autonomous driving, virtual reality, behaviour analysis, and cultural heritage preservation for converting two-dimensional images to three-dimensional models by using single or multiple images.As one of the important media for understanding historical and cultural changes, cultural relics have always been a focus of attention in terms of cultural protection and restoration.With the development of deep learning technology, people have gradually applied digital modeling technology to the restoration of cultural relic models depicting figures such as pottery figurines and sculptures.Currently, three-dimensional reconstruction mainly focuses on the reconstruction of simple action cultural relic models, and problems such as severe fragmentation and confusion of the model backbone may occur for complex action models.
Based on the form of processed data, deep learning-based three-dimensional reconstruction can be mainly classified into voxel-based three-dimensional reconstruction, point cloud-based threedimensional reconstruction, and mesh-based three-dimensional reconstruction.Eigen et al. [1] extended 2D convolution to 3D for three-dimensional reconstruction, used neural networks to directly recover depth maps, divided the network into global coarse estimation and local fine estimation, and used a scale-invariant loss function for regression.Christopher et al. [2] proposed a 3D-R2N2 model, which uses an Encoder-3DLSTM-Decoder network structure to establish a mapping from 2D graphics to 3D voxel models, and achieved single-view/multi-view three-dimensional reconstruction based on voxels.However, this voxel-based method has a disadvantage in that improving accuracy requires increasing resolution, which will greatly prolong computational time.Point cloud-based three-dimensional reconstruction is a simple and easy-to-learn structure.FAN et al. [3] designed a point cloud sampling component in the PSG model to select point clouds that can extract image features from multiple predicted trustworthy three-dimensional point clouds, and has good universality.CHARLES et al. [4] proposed a PointNet which takes point clouds as input, extracts shape features from three-dimensional point clouds, and provides segmented labels for each point.The experiment shows that a lightweight architecture can be obtained by using the PointNet, but the point clouds are independent of each other and cannot provide rich surface texture information, and the reconstructed object surface is not smooth.By contrast, mesh-based three-dimensional reconstruction features rich shapes and interconnects of adjacent points.Wang et al. [5] proposed a Pixel2Mesh, which uses an end-to-end neural network to generate a mesh representation with three-dimensional information obtained from a single image.
However, the current mainstream methods are mainly used to reconstruct three-dimensional models without texture features [6], and difficult to adapt to the posture diversity caused by the complexity of the human body [7].To solve these issues, a three-dimensional reconstruction based on implicit function representation has been proposed.Saito et al. [8] proposed a Pixel-Aligned Implicit Function (PIFu), which performs implicit function prediction based on the z value of the 3D query point and its projected 2D image feature, to generate seemingly reasonable surfaces from a single image.PIFuHD [9] can further improve the results of PIFu in fine geometric detail recovery by using predicted normal maps and higher resolution.However, PIFu cannot predict accurate spatial positions, and depth blur makes the reconstructed three-dimensional model prone to limb overlapping and fragmentation.
Currently, common three-dimensional reconstruction methods mainly have the following disadvantages: there is uncertainty in model representation methods, voxel-based representation methods need more resources, and computational complexity increases exponentially with data volume [10]; surfaces reconstructed by the point cloud-based representation is rough, and information between point clouds is independent and cannot present rich surface information; mesh-based representation can overcome the shortcomings of the above two methods, but the current feature extraction networks cannot extract complete and rich image features very well.
To address the problems above in research, a parallel dual-branch pyramid stacking network model is proposed, where two parallel networks are used to extract high-resolution and low-resolution feature information from a single image, and fully reconstruct the three-dimensional model of the object in a single image.The primary contributions of this paper are summarized as follows: (1) A parallel encoder-decoder architecture is used to extract high-resolution and low-resolution feature information from single images.A lightweight hourglass network is used to extract local highresolution details, while a cascaded feature extraction network is used to obtain global low-resolution features.A hidden function is used to convert 2D features to 3D features.
(2) The high-resolution feature extraction network is improved by introducing hole convolution modules and threshold processing modules.The hole convolution module increases the feature extraction receptive field, while the threshold processing module removes irrelevant features, reducing noise interference and minimizing feature information loss while increasing the receptive field.
(3) We proposed a new 3D terracotta warrior model dataset, which includes different observation angles of 2D images, 3D models of varying accuracy, model material texture maps and texture map mask images, and normal texture maps of texture features.The database contains 70 sets of 3D terracotta warrior models, and 102, 340 different attribute 3D data resources in total.

Single-image 3D reconstruction method based on PDB-PSN
A single-image 3D reconstruction method based on PDB-PSN is proposed, and the algorithm flowchart is shown in Figure 1.The algorithm adopts a parallel architecture.Firstly, a cascaded feature extraction network is introduced to construct low-resolution feature information, and extract the object's main framework and key point feature values.Then, a lightweight hourglass network is used to extract highresolution feature information and model surface detail feature values.Next, the two branches of extracted feature values are fused to construct an implicit function mapping from a 2D image to 3D feature points.According to the coordinates of adjacent 3D feature points, the normal vectors of the surface composed of 3D feature points are calculated to obtain a smoother 3D model surface.Finally, the effectiveness of the proposed algorithm is verified by a self-built 3D terracotta warrior dataset.

Cascade pyramid feature extraction network
Lacking 3D depth information in 2D images is likely to lead to uncertainties in the target action information and problems such as self-occlusion of the target itself, which makes the 3D models reconstructed by many current methods unable to represent the true features of the real model very well.
To extract global features of the target object in a single image, a feature extraction model based on cascaded pyramids is proposed, as shown in Figure 2. First, the 2D image is gradually shrunk into multiple sub-images of varying scales, and then the feature information of different scales is extracted for each sub-image, ensuring that the information between different scales does not interfere with each other.Then, the extracted multi-level features are connected, and the features of different scales are fused to construct a multi-level feature map, to achieve information extraction of the main feature points of the human body from top to bottom.Then, the feature maps with various levels are adjusted from bottom to top to a unified scale, and normalized through a 1×1 convolution layer to provide the extracted main features of the target object.

Lightweight hourglass network
The traditional Hourglass network consists of Hourglass units with four-order down-sampling and upsampling, as well as intermediate supervision, to achieve multi-level feature extraction through repeated bottom-up and top-down processing.However, due to the large amount of data in 3D reconstruction, the four-order Hourglass unit has high memory requirements for computation.Therefore, a lightweight serial Hourglass network composed of three layers of serial second-order Hourglass units is proposed, which can reduce the order of a single Hourglass unit, and increase the number of serial Hourglass units to ensure feature information extraction.Additionally, different types of residual modules are used in each layer of Hourglass units to improve the training speed of the network and reduce the loss of feature information.The network structure is shown in Figure 3. First, a basic residual module is used in the first layer of the Hourglass network to extract all feature information of the 2D image.Then, a large receptive field residual module is used in the second layer of the Hourglass network to effectively associate global and local information by enlarging the receptive field.Finally, a threshold processing residual module is used in the third layer of the Hourglass network to eliminate interference from irrelevant feature information by setting a threshold.

Dilated residual module
The original residual module mainly consists of convolution layers, normalization layers, pooling layers, and up-sampling layers, as shown in Figure 4(a).In this module, a dilated convolution with a dilation rate of 1 and a size of 5×5 is introduced to replace the original two ordinary convolutions with a size of 3×3, which can enlarge the receptive field of the hourglass network, enable it to learn more global information, and improve the correlation between global and local feature information further, as shown in Figure 4(b).The receptive field can be enlarged through dilated convolution, which can improve global information extraction.However, the dilated convolution may lead to loss of local information near the global information after multiple convolutions, so the dilated convolution module is added to the residual module of the second layer of the hourglass network.

Threshold processing residual module
In order to reduce the interference of irrelevant noise information due to the abundance of information during feature extraction, a threshold processing residual module is introduced in the third layer of the hourglass network.The module consists of a global average pooling layer, a normalization layer, and a sigmoid activation function.The feature map learns a set of thresholds through the threshold processing module, as shown in Figure 4(c).The corresponding threshold calculation formula is as follows:

ICAITA-2023 Journal of Physics: Conference Series 2637 (2023) 012053
where k represents the channel feature value, and y represents the output of the threshold processing module.Based on the threshold processing, distant features from the surface of the distance model are removed, and the interference of noise information is reduced, which can reduce the fragmentation of the reconstructed model.

Pixel-aligned implicit function
The implicit function describes the mapping relationship between two variables through an implicit equation, which plays an important role in computer vision and computer graphics.In 3D reconstruction, it is used to map 2D data to high-dimensional space, and to process and analyze data by learning the mapping relationship from input to output.A method of constructing implicit functions by combining 2D feature information with camera depth values is proposed.By combining the high-resolution and low-resolution feature information extracted from the 2D image with the camera depth value, the probability value of the corresponding 3D spatial feature point existence is obtained.The larger the value, the higher the probability of the corresponding point existing in the 3D model.Compared with other methods, the proposed implicit function features higher accuracy and robustness, and can represent the 3D model of objects in different poses in a better way.The corresponding formula for constructing pixel-aligned implicit functions is as follows: where X represents a point in three-dimensional space, x represents the projection point of X on a twodimensional image, F(x) represents the feature vector of the point on the two-dimensional image, z(X) represents the distance between the point in three-dimensional space and the origin of the camera coordinate system, and P represents the probability of X in three-dimensional space inside the model.P closer to 1 indicates a higher probability, and P closer to 0 indicates a lower probability.By combining the feature vector in the two-dimensional image with the corresponding camera coordinate depth value of the point in three-dimensional space z(X), the probability of the corresponding point inside the model can be determined, thereby three-dimensional space points with a higher probability inside the model can be selected.

Loss function
The loss function consists of two parts, which are the probability value of the point in the 3D space estimated in the network, and the true value of the point in the 3D space in the corresponding 3D model.The square of the difference between the two is divided by the number of points in the 3D space to form the loss function.The specific loss function calculation formula is as follows: where i represents the index of a point, n represents the number of points in the 3D model space of a cultural relic, x i represents the coordinates of the i point in the 2D image of the cultural relic, X i represents the coordinates of the i point in the 3D model space of the cultural relic, and f* represents the true value function.The calculation formula is as follows.
Using a loss function L, the pixel-aligned implicit function f is continuously fitted to the true value function f*, so that the pixel-aligned implicit function can reflect the mapping from the 2D image to the 3D feature points very well.After obtaining the point set information in the 3D space, the normal vectors of the surface composed of adjacent 3D feature points are calculated based on the coordinates of the 3D feature points, then a smoother 3D model surface can be obtained and the conversion from 2D image to 3D model is achieved.

Data collection and preprocessing
The dataset used in the research is the ceramic figurine artworks of Wang Qian, a master of arts and crafts in Shaanxi Province.More than 70 digital ceramic figurine models and 102,340 three-dimensional data resources of different attributes are included in the dataset.The experimental equipment used to collect the three-dimensional models is shown in Table 1 Firstly, an MP80 scanner is used to collect 360° posture and texture data of the terracotta figurine model.The collected data is synthesized to generate the model surface and color mapping.Meanwhile, two-dimensional image information from the front, back, left, right, top, and bottom view angles of each terracotta figurine is collected, and then the two-dimensional and three-dimensional data collection of the terracotta figurine model is completed.
In order to make the key points in the three-dimensional model more accurate, we simplify the threedimensional point cloud in the collected data to reduce the number of point clouds to 30% of the original model, and the simplified point cloud forms a simplified surface.At the same time, the texture file of the model is masked to obtain the mask image of each texture area.Finally, each set of model datasets obtained includes six two-dimensional images of different angles, model material files, and their mask images, and two types of three-dimensional model files with different accuracies.

Model evaluation metrics
The following evaluation metrics are used to evaluate the model: Intersection over Union (IoU) represents the proportion of the overlapping area of two models in the overall area.The larger IoU is, the more the two models will overlap.Precision represents the ratio of true values to predicted values, where true values indicate that both the predicted point and the true point are inside the model, which reflects the distinguishing ability for negative samples of the model.Recall represents the ratio of true values to true values, which reflects the distinguishing ability for positive samples of the model.The formula for calculating the two to form F-score is as follows: where F-score represents the comprehensive feedback of the two types of data.The higher the value of F-score is, the better the performance of the model will be.C2C avg-distance represents the average distance between reconstructed point clouds, which is calculated by dividing the sum of distances between point cloud pairs by the total number of point cloud pairs.Cloud distribution (CD) represents the overall distribution of distances between point cloud pairs, which can be shown in the form of distance intervals, and it reflects the overall performance of the model.

Comparison
The reconstructed terracotta figurine models obtained by the PIFu and PIFuHd algorithms were compared with those obtained by the method proposed in this paper.Three models reconstructed by each method were selected for frontal comparison, and the simulation results are shown in Figure 7.  From Table 2, it can be seen that the method proposed in this paper has a higher degree of coincidence with the original 3D terracotta warrior model, with a minimum average distance between point clouds, and the point cloud distribution is more concentrated near the position of the original model.The obtained model can better reflect the 3D features.However, training time per iteration increased significantly due to increasing network structure, and lightweight optimization cannot be achieved because model optimization speed is slow.

Conclusion
A parallel dual-branch pyramid stacking network model is proposed, the threshold processing operation is introduced in the hourglass network to eliminate irrelevant features, and dilated convolution is introduced to increase the receptive field of feature learning.The improved hourglass modules are connected in series to achieve a high-resolution feature extraction branch.A cascaded feature extraction network is used as the low-resolution feature extraction branch, and the cascaded pyramid feature extraction network and the lightweight hourglass network are connected in parallel to achieve image feature extraction.Based on the extracted features, a mapping function between 2D images and 3D models is constructed, and the feature point information in 3D space can be obtained from 2D images.Finally, the normal vectors of the surface composed of adjacent feature points in 3D space are calculated based on the coordinates of the 3D feature points, and a 3D model with texture information is generated.3D terracotta warrior data were collected for experiments.The experimental results show that the proposed method is superior to existing models in terms of model completeness and training time, and can minimize the fragmentation of reconstructed models.Future work will be needed to further speed up the training process and improve the correlation between 3D pose data and corresponding color data, and demonstrate consistent advantages on more datasets.

3. 3 .
Training 5 sets of models are selected as training data to train the model.The complete model training process includes two phases: pose reconstruction and color reconstruction.Based on the network framework above, the training parameters for the pose reconstruction include batch_size of 3, learning_rate of 0.0001, and a training period of 50; the training parameters for the color reconstruction include: batch_size of 4, learning_rate of 0.001, and a training period of 50.Variation of IoU value during the training process is shown in Figure 5.

Figure 5 .
Figure 5. Variation of IoU.The horizontal axis represents the training period, and the vertical axis represents the IoU value.It can be seen that after 40 training periods, the IoU value gradually stabilizes, and the overlapping between the reconstructed model and the original model stabilizes at around 0.93.Subsequently, a 500×500 resolution 2D terracotta warrior image was input into the trained network model to reconstruct the corresponding 3D terracotta warrior model.The point cloud distance distribution of the reconstructed 3D terracotta warrior model was compared with the original 3D terracotta warrior model, as shown in Figure 6.

Figure 6 .
Figure 6.Distribution of Reconstruction model point cloud distance.The horizontal axis represents the distance intervals between point clouds, and the vertical axis represents the number of point cloud pairs included in each distance interval.Among the 65, 602 points being compared, over 99.9% of the point cloud distances are within 0.1.The corresponding point cloud position is more accurate if the point cloud distance is closer to 0. More than half of the point cloud distances are within 0.01, which indicates a high degree of overlapping between the point cloud positions of the reconstructed model and those of the original model.

Figure 7 .
Figure 7.Comparison of different methods for reconstructing models from different angles.As shown in Figure 7, compared with the PIFu algorithm, the method proposed in this paper has better completeness and no fragmentation effects; compared with the PIFuHd algorithm, the posture structure was not disrupted, and the size and posture of the reconstructed model were closer to the original model.We selected the IoU value, four observation values during the training process, and single-round training time (s/epoch) as the model performance comparison indicators, and the comparison results are shown in Table2.Table2.Comparison of different methods

Table 2 .Table 2 .
Comparison of different methods