Anthropometric Parameter Measurement from Equivariant Multi-view Images

In this paper, we propose an anthropometric parameter measurement method that any customized parameter can be measured online by the pre-selected endpoints on the reconstructed 3D body models of equivariant multi-view images. The method includes 3D body model reconstruction, anthropometric parameter measurement, and parameter modification. In 3D body model reconstruction, we detect and segment the human body from its background and reconstruct a generative 3D body model from the segmented image with deep learning. And then we measure anthropometric parameter on the reconstructed 3D body model of each view. Before parameter measurement, we manually pre-select endpoints associated with all anthropometric parameters on the reconstructed 3D body model since all vertices of the reconstructed body model are ordered. However, the information of a single-view image is insufficient and the measurement result is varied regularly by the view changes. To improve the measurement accuracy, we design a convolutional neural network in the last step which can regress more accurate anthropometric parameters from equivariant multi-view measurements. Experimental results on the representative dataset demonstrate that the proposed method can measure planar and spatial anthropometric parameters automatically with comparable performance.


Introduction
The anthropometric parameters measurement has a wide range of applications on virtual try-on and health monitor. Traditional tape-based methods are not only time-consuming but infeasible for some occasions. With the growth of technology, the computer-based anthropometric parameters measurement attracts the attention of many researchers. The computer-based measurement methods can be divided into the 2D (image-based) measurement where images of the person are taken, and 3D (scan-based) measurement where a complete person's body is scanned to reconstruct exact model. Compared to the image-based measurement system, the scan-based measurement is more accurate, but it is expansive and not portable in actual production [1].
The image-based measurement system usually has four steps: image acquisition, image segmentation, feature points extraction, and parameter measurement. Image acquisition is the primary step which employs a camera to take photos of the person standing in front of the backboard. Image segmentation is to extract the person from its background and common methods are exploiting edge detection or HSV color space [2], [3]. Recently, the powerful convolution neural networks (CNN) are applied to detect and extract the person from its background. Feature points determine the start and end points of the related anthropometric parameter and the distance between the feature points are employed to measure anthropometric parameter [4]. Common methods of feature point extraction are

Measurement from images
To obtain accurate anthropometric parameter of human body in the image, we propose a systematic strategy in this part. Before parameters measurement, we reconstruct 3D body models from multi-view segmented images. Then anthropometric parameters will be measured by the Euclidean distance and geodesic distance according to the pre-selected endpoint on the reconstructed 3D body models. Moreover, a convolutional neural network is employed to modify multi-view measurement results since the information of a single-view image is insufficient. Figure 1 shows an overview of the proposed framework. The overview of the proposed framework where we segment multi-view images I and reconstruct 3D body models M from the segmented images S. Then M and the pre-selected endpoints F are imported to the measurement step to get multi-view anthropometric parameters P. Finally, the modification step will modify P to obtain more accurate anthropometric parameters P* of the person in the image. The goal of this section is to obtain multi-view images I in an indoor scene and reconstruct 3D body models M from segmented images S. To capture multi-view images and control the rotation angles, we design an image acquisition equipment as shown in figure 2. The image acquisition equipment mainly includes the turntable, mobile, and computer. The turntable can rotate every fixed degree and a green backboard is set behind the turntable and a moderate distance from the turntable is a mobile mounted by a tripod. The multi-view images taken by the mobile are defined as follows:

3D body model reconstruction
(1) where I is the set of N views images and ik is the kth view image. Image segmentation and 3D body reconstruction are two key steps that can avoid the background interference and reconstruct generative 3D body models. Firstly, multi-view images are segmented by DeepLab V3 [12]. The DeepLab V3+ combines the spatial pyramid pooling module and encoderdecoder structure to encode multiple-scale contextual information and capture sharper object boundaries. The segmentation process is defined as: where s k is the segmented image of each view, D(.) is the DeepLab V3+ network, and S is the set of multi-view segmented images. Another key step is the 3D body reconstruction that HMR [13] is exploited. HMR is an end to end network that can reconstruct a full 3D mesh of a human body from a single RGB image. The reconstruction process is defined as: (3) where mk is the reconstructed 3D body model in the kth view, H(.) is HMR network and M is the set of multi-view reconstructed body models.

Anthropometric parameters measurement
The measurement step reads in 3D body models of multi-view images and pre-selected endpoints and output multi-view anthropometric parameters. The step includes end-points selection and anthropometric parameters computation. The pre-selected endpoints will be imported into the measurement process and all anthropometric parameters measured in our work are defined according to [14]. We only measure a total of 17 representative anthropometric parameters in the experiment that can meet the needs of general clothing manufacturing and they can be customized by the user if necessary. The definition and name of all anthropometric parameters is shown in figure 3(a) and table 1. All endpoints are selected in advance since the vertices of the reconstructed 3D body model are ordered. To determine the position of endpoints, we design a graphical interface that can show all vertices on the reconstructed 3D body model as shown in figure 4(a). The endpoints are defined as follows: , where V is the total number of all selected endpoints (V=32), and the position and index of each endpoint fj is shown in figure 3(b).
where n is the number of vertices in k m and n=6890 in the SMPL model. For linear length l k p , we employ the Euclidean distance to compute the length between endpoints. Compared to the linear length l k p , the measurement of circumference length c k p is more complicated and we employ the geodesic distance to compute all circumference lengths. Geodesic distance is a common method to obtain the shortest path between points in the surface of the 3D model and what we exploit in this paper is the Subdivision Dijkstra algorithm which is improved according to the Dijkstra algorithm. The process of the Subdivision Dijkstra algorithm mainly includes two steps which are the vertex insertion and the shortest path searching.
The first step of circumference length measurement is vertex insertion which is to insert  vertices equally in each edge of the edge set k  :

Anthropometric parameter modification
Even though the anthropometric parameters of a single-view image have been obtained in above, however, we find that the information of a single-view image is insufficient and the measurement results are regularly changed by the change of view. Therefore, we design a modification step to combine the measurement results P of multi-view images to get a more reliable result * P of the human body. We propose a convolutional neural network with several layers of convolutional kernels to extract feature maps from multi-view measurement results

P .
The network consists of six convolutional layers with 256 neurons each with a batch normalization layer in between, followed by a ReLU layer, and a fully-connected layer with 256 neurons is connected behind all convolutional layers as shown in figure 5. Moreover, to reduce the number of parameters and decrease the over-fitting problem in the training process, the output of the second, fourth and sixth convolutional layers are max pooled. The process of each layer is defined as:

Experiment setup 1) Datasets:
We collect data of a dozen people who are representative of height and shape including 4 females and 11 males, and the height ranges from 161cm to 187cm. Moreover, we set N=24 in the experiment which means that the turntable rotates every 15 degrees and a total of 24 images will be taken every turn. The person in tights should stand statically and keep their arms slightly away from the torso which helps us determine the circumference length. The ratio between the training dataset and validation dataset is 4:1.
2)Parameter Settings: To balance the accuracy and cost of computation in the measurement process, we set 3 =  in formula (8). The learning rate of the CNN is 0.001 and the optimizer is RMSProp [15]. The train epoch is 2000 and the network will be validated every 10 epochs in the training process, besides, the batch size is 5 in the training and validation process. The step size of the max pooling function is 2, which means that the size of the output of each convolutional layer will drop to half of the original size. Besides, the kernel size of the convolutional layer is 3 and the padding is 1.

Method of evaluation
The evaluation methods in this paper are accuracy and mean difference that the accuracy is to judge the distribution of the result and the mean difference is to evaluate the deviation between the result and the true value measured manually. The accuracy is defined as follows: where α is a threshold to determine the scope of the ground truth. Moreover, 3.3. Discussion and result 1) Performance on Test Samples: To demonstrate the whole performance of our method, we choose three samples from the validation dataset and compare the measurement results with the values measured manually. The result is shown in table 2 where ti is the true anthropometric parameter of the ith person and yi is the output of our method. The experimental results show that the method proposed by us has higher accuracy in height, head length, and knee height whose yi and ti have little difference less than 1cm. We can find that the MAD of head length and knee height are smallest both in the train and test dataset and the MRD of them is also acceptable. To sum up, the average MAD is 0.996cm in the train dataset and 1.94cm in the test dataset which is very small in the image-based measurement system.  The method proposed by us can measure spatial anthropometric parameters from images without any image calibration and the result is comparable with other methods. The whole performance of our method is acceptable according to the experimental result even though there are some parameters such as chest circumference are not ideal. There are some reasons to explain the average MAD is 2cm and most measurement result falls within the ground truth of 3cm in our work.
(1)The ground truth of anthropometric parameters in our work is measured manually which will bring artificial error. (2) The training dataset of the Modification network is insufficient.

Conclusions
We have proposed an anthropometric parameter measurement method that any customized parameter can be measured online by the pre-selected endpoints on the reconstructed 3D body models of equivariant multi-view images. There are two key components of the method which are the measurement and modification of anthropometric parameters. In the measurement step, we pre-select endpoints associated with all anthropometric parameters before the measurement. Moreover, we design a convolutional neural network in the modification step to regress a more precise result according to measurement results of multi-view images since the information of single-view image is insufficient. Experimental results demonstrate that the method proposed by us can measure spatial parameters from multi-view images with comparable performance.