Comparison of video-based algorithms for 2D human kinematics estimation: a preliminary study

Many research efforts have been spent developing robust video-based algorithms for human pose estimation. Our goal was to compare video-based algorithms for pose estimation for gait analysis. We conducted an experiment with a healthy subject performing walking sessions on a treadmill at three different speeds: slow (3.6 km/h), medium (5 km/h), and high (7 km/h). An RGB 4k camera was placed laterally on the sagittal plane. Four algorithms were compared: (i) colour threshold filtering with blob-analysis, and three Deep Learning-based markerless algorithms (ii) TC-Former, (iii) FastPose and (iv) Blazepose. For colour threshold filtering with the blob-analysis algorithm, six magenta passive markers were placed over the joint centres of the subject’s lower limb. All selected deep learning-based markerless algorithms are supported by various open-source pose estimation toolboxes and are pre-trained on several whole-body keypoint datasets. The 2D trajectories of the joint centres were compared considering the root mean square error and Pearson’s coefficient. Preliminary results showed high correlations between marker and markerless algorithms for all walking speeds. TC-Former generally performed better with root mean square error on trajectories below 35 mm and did not suffer from self-occlusion issues.


Introductions
In the field of computer vision, many research efforts have been made to develop robust and cost-effective video-based algorithms for human pose estimation (HPE) [1].Human pose estimation represents an important tool for gait analysis and a person's health status assessment.The evaluation of kinematic gait parameters can be used not only for patient diagnosis but also as an indicator in various fields such as sports [2], rehabilitation [3] and biometric recognition [4].
Over the years, many systems and algorithms have been used for gait analysis [5].Infrared markerbased motion capture (MoCap) systems are still considered the gold standard for human pose estimation and gait analysis, but they have some limitations.They are indeed expensive, can alter the naturalness of human movement when used, and require trained operators and a confined environment [6].Wearable inertial sensors consisting of accelerometers, gyroscopes and magnetometers provide a low-cost, noninvasive alternative for performing gait analysis that is suitable both for indoor and outdoor evaluations [7].These sensors have been shown to be sufficiently reliable for measuring lower limb kinematics [8].The limitations of wearable systems rely on sensor positioning on the human body and the number of sensors required for a complete kinematic evaluation [9,10].The clinical potential of a video-based marker system was explored in 2013 by Ugbloe et al. [11], who developed an augmented video system (AVPS), based on multi-bull's-eye marker tracking through blob analysis.Using a single camera positioned laterally in the sagittal plane and the multi-bull's-eye markers placed on the human body, they were able to measure kinematic variables in accordance with those gathered by a gold standard.In 2019, Zult et al. [12] validated a similar low-cost 2D motion capture system on subjects with central vision loss.
Even in this study, results demonstrated the potentiality of such systems, reporting high test-retest reliability.In 2018, Prakash et al. [13] compared the marker-based method with a different markerless approach based on image segmentation.The comparison confirmed the efficiency of the proposed techniques for identifying joint trajectories and the reliability of the blob-analysis method, as a costeffective alternative for comparing new algorithms.
In recent years, several researchers have adopted deep learning algorithms to improve markerless human pose estimation.In the majority of cases, researchers have focused on developing novel neural architectures and new training datasets [14].The first whole-body keypoints dataset was the Openpose dataset [15], which combined, in 2019, the COCO dataset [16] with a new foot dataset with 15000 annotations.Subsequently, in 2020, the COCO database had been re-annotated (250,000 images), from 17 keypoints up to 133 keypoints [17].In 2022, Fang et al. [18]created a new database HALPE full-body, with 50,000 images annotated with 136 keypoints.Training neural networks on these new datasets will enable the creation of full-body position pose estimators, that can be more reliable in real-world applications.Several studies have been conducted to validate these markerless algorithms.In 2020, Moro et al. [19] used DeepLabCut [20] as a deep-learning pipeline and evaluated joint centres trajectories on chronic stroke survivors during walking, reporting a maximum error of 20 mm.In 2021, Mroz et al. [21] compared Blazepose [22] to OpenPose.They found the first algorithm to be often poorly accurate, due to self-occlusion issues.A comparison between four human pose estimators was investigated by Docekal et al. [23] in 2022 in the context of hand and torso gestures in close proximity view.Another comparison was performed by Mundt et al. [24] that evaluated ground reaction forces with Alphapose, Openpose, and Blazepose.
Although there are several studies in the literature studying the performance of vision systems, a comparative analysis of various markerless systems based on deep learning algorithms for the assessment of 2D lower limb kinematics is still untapped.In this study, a comparison between three markerless algorithms for 2D whole-body human body estimation (TC-Former, FastPose and Blazepose) was performed.This study aims to identify the most accurate and suitable algorithms for the evaluation of lower limb parameters using a single camera.

Participant
According to the Declaration of Helsinki, one healthy adult subject was involved in this study.The subject is a 25-year-old male, 78 kg and 176 cm tall and gave informed consent to participate in the experiment.

Experiment design and equipment
The kinematics of the subject was recorded by a Logitech BRIO 4k Stream Edition camera with a resolution of (1920 x 1080 px), as reported in Figure1a.The frame rate of the camera was configured at 60 Hz.Recordings were made using OBS Studio, with disabled autofocus.The subject performed four trials of one-minute walking on a treadmill at three different speeds: a slow speed of 3.6 km/h, a medium speed of 5 km/h, and a fast speed of 7 km/h.Given the size of the treadmill (2400 x 800 mm) and the scale of the experiment, the point s0 (600 x 400 mm) was set as the reference point, as reported in Figure 1b.The camera was positioned laterally at 1500 mm from s0, pointing to the sagittal plane of the subject.All data were post-processed using Matlab v.2022b.The Matlab app Single Camera Calibrator (Computer Vision Toolbox 10.3) and the Matlab app Color Thresholder (Image Processing Toolbox 11.6) were used for the calibration of the camera and the mask colour filter, in the blob-analysis method, respectively.For running deep learning algorithms, we used Python 3.9.13, a stable version of Pytorch (torch v.1.12.1+cu113,Torchvision v. 0.13.1+cu113),cuda v.11.3, MMPose 0.29, Mediapipe 0.90, Alphapose 0.6.0.(5 th metatarsus), and P6 (1 st metatarsus).b) Camera and treadmill position.
The selected markerless algorithms are deep neural networks that process an image or video and return the coordinates of the anatomical reference points in pixels and their respective visibility level, which varies between 0 and 1.This score can be used as a threshold to evaluate the self-occlusion problem and discard poorly visible points.In this preliminary analysis, we selected the anatomical reference points that identify the most visible lower limb joints on the sagittal plane useful for gait analysis, namely the pelvis, right knee, right ankle, heel, first metatarsal and fifth metatarsal of the right foot (Figure 1a).Every markerless algorithm was trained on different datasets.For each dataset, human annotators were different.In this comparison, we used: • Coco whole-body dataset [17] is composed of 250k images and the training was performed on 8 GPUTESLA V100 (32Gb) with Adam as optimizer, defining the initial learning rate, decay coefficient and a batch size of 32 for each GPU for 120 epochs.They organised the 133 annotations that made up their human model, which consists of 17 keypoints for the body, 6 for the feet, 68 for the face and 42 for the hands, in 4 bounding boxes.[18] is composed of 50k images that come from the HICO-DET [25] dataset.The training was performed on 8 Nvidia GPU 2080Ti with Adam as optimizer defining the initial learning rate, decay coefficient and a batch size of 32 for each GPU for 270 epochs.For every person, 136 keypoints were identified and they annotated 20 keypoints related to the body, 6 related to the feet, 42 associated with the hands, and 68 corresponding to the face.For latter they used a different approach, based on 3D Dense Face Alignment [26].• Blazepose dataset [22] is composed of 60k images with one or a few people in common poses, as well as 25k images of a single person performing fitness exercises.They proposed a new approach to pose estimation that uses 33 points on the human body, only 4 for feet (heel and 1 metatarsus).They restricted the dataset to cases where the whole person was visible or where the hip and shoulder points could be accurately identified.To account for strong occlusions that may not be present in the dataset, they used extensive augmentation techniques to simulate them.

Video-based algorithms
In this study, three markerless video-based algorithms were evaluated in comparison with a blob-analysis method.An overview of the three markerless methods and the blob analysis method is given in the following subsections.

Blob-analysis method
We used an algorithm based on a colour threshold filter based on HSV colour space, followed by blob image analysis (BA), main steps are shown in Figure 2.With the help of a physiotherapist, six reflective magenta markers were placed on the subject, to identify the anatomical reference points of the right lower body: hip (P1), right knee (P2), right ankle (P3), right heel (P4), 1 st metatarsus (P6), 5 th metatarsus (P5).A mask filter was then applied to select only a specific colour area.More specifically, we selected different shades of magenta.The result was an image consisting of 6 areas on a black background, as reported in Figure 2b.Blob analysis was applied and the corresponding centroids were determined.For comparison, the offset between the keypoints of the markerless algorithms and the centroids of the blob analysis was eliminated.Unlike April tag markers [27], where each QR code uniquely identifies the marker, the passive markers required a numbering routine.More specifically, once we obtained the centroids, we used the anatomical positioning of the markers to assign each centroid to its respective joint centre.

Markerless methods
The selected markless algorithms follow the so-called "top-down" approach (Figure 3) and consist of two different neural networks: 1.The first serves as a detector responsible for identifying the person and enclosing it in an anchor box; 2. The second acts as a pose estimator and, starting from the anchor box, has the task of identifying key points in pixels and estimating the human pose.

Figure 3. A visual representation of the "top-down" approach
The architecture can be traced back to an encoder-decoder structure [28], with the addition of skip connections [29], as reported in Figure 4. Skip connections were created with the idea of skipping some layers of the residual neural network to provide an alternative path for gradient flow during the error backpropagation.The underlying idea is to speed up the feature extraction process by reducing the size of the input image through downsampling processes (encoders) in the convolutional layers, followed by upsampling (decoders) in the deconvolutional layers.However, this upsampling process can lead to a loss of important spatial information contained in the first layers.For this reason, skip connections are used to directly connect the layers of the encoder with high-quality spatial information to the layers of the decoder.In this study, we selected a detector and a pose estimator from different open-source libraries, each offering different pre-trained neural networks for detection and pose estimation.The main difference between the chosen pose estimators lies in pose decoding, which is the real challenge in human pose estimation.From their database of pre-trained whole-body pose estimators, we selected a novel vision Transformer, TC Former [30].TC Former is a Token Clustering Transformer that divides the compressed image into a grid of tokens after the pose decoder and set the keypoints.Instead of using deconvolution layers for up-sampling, an attention transformer is used that can combine tokens that are not relevant to the task, such as background, and maintain high resolution for tokens that contain important information.
MMpose provides many pre-trained human detectors and pose estimators.We choose Faster R-CNN model [31], with a ResNet-50-FPN backbone, implemented by MMDetection as detector and TC-Former as Pose estimator.

Alphapose
Alphapose (AP) [18] is an open-source toolbox for pose estimation.They provide a new full-body dataset, Halpe Fullbody, and propose a new pose estimator approach with FastPose (FP) that consists of 5 layers.The first is ResNet, used as the "backbone" which acts as an encoder and feature extractor.Next.three layers of DUC (dense upscaling convolutional) [32] are used to provide enhanced upscaling.A 1x1 convolutional layer is then used to turn the output into a heat map.Alphapose, as MMpose, offers various human detectors and pose estimators.We chose Yolo-X [33] as the detector and FastPose as Pose estimator.

Mediapipe
With MediaPipe (MP) [34] Google proposes Blazepose (BZ) a hybrid model between top-down and bottom-up approaches as its pose estimator solution.BZ follows a top-down logic (detector/tracker) for the first frame, while it follows a bottom-up logic for the subsequent frames, where only the tracker tries to predict the keypoints.The biggest advantage of BlazePose is its lightweight and execution speed.BlazePose can run at a high frame rate on almost any device or hardware, including smartphones and CPUs.Blazepose offers three models of Pose Estimator: Blazepose lite, Blazepose full, and, Blazepose heavy.We have chosen the heavier and more accurate pose estimator, the Blazepose heavy.

Camera Calibration
Camera calibration was performed using a checkerboard with 8 rows and 11 columns and a Matlab Single Camera Calibration toolbox [35].Each square had a side of 20 mm.A set of photos were taken with the checkerboard positioned on line A (the outermost side of the tape) and on-line B through reference point s0, as reported in Figures 5a and 5b.The generated mean reprojection errors were less than 0.15 px on both planes (Figure 5c).A corner detection algorithm was applied that made it possible to determine the pixel coordinates of all corner points of the checkerboard squares.It was therefore possible to determine the pixel-to-mm conversion on plane A and plane B by evaluating the pixel distance between the detected corner points and the known distance of 20 mm.Assuming that the measured movement of the subject's lower limbs occurred on a sagittal plane between planes A and B, the average was taken between e values obtained on plane A and those obtained on plane B, giving an average value of 0.8 mm/px.

Evaluation parameters
Evaluations were performed offline.All tests were performed at the same brightness level.The algorithms were evaluated on the same hardware, an I7-10750 h CPU.The 2D trajectories of the joint centres of the markerless algorithms were thus compared with the 2D trajectories of the joint centres of the blobanalysis algorithm.Pearson's correlation coefficient and the root mean square error (RMSE) were computed.As for the Pearson's coefficient, values were evaluated as follows: 0.0-0.2represents negligible correlation; 0.2-0.4represents weak correlation; 0.4-0.7 represents moderate correlation; and 0.7-1.0represents strong correlation.[36].

Results and Discussion
Table 2 shows the RMSE results for TCF-MMP, FP-AP and BZ-MP for six lower limb joint centres' keypoints.Higher mean RMSE values were found by BlazePose in both directions: for the x-direction at P6 (133.8 ±76.7 mm) and the y-direction at P4 (40.6 ± 34.3 mm).Lower RMSE values were reported from TCF-MMP in the x-direction at P4 (4.5 ± 7.0 mm) and from FP-AP at P2 in the y-direction (3.9 ±2.7 mm).FP-AP and TCF-MMP performed significantly better than BZ-MP for all trajectories, and their results are comparable for P1, P2 and P3.In general, better results were obtained for P1, P2 and P3 than for P4, P5 and P6, especially for FP-AP and BZ-MP.TCF-MMP recorded similar results for all trajectories.Even though the blurring effect was not consistent, identifying the key points of the right foot was still challenging for FP-AP and BZ-MP.However, the trajectories generated by FP-AP almost matched those of BA.As can be seen in Figure 7, some artefacts on the x-trajectories increased the RMSEx and its standard deviation for the FP-AP.These spikes represented the self-occlusion problem and were caused by the overlap of the two legs during walking.Figures 8 and 9 show the x and y direction for the 6 key points.TCF-MMP and FP-AP reported a very strong degree of correlation correlations (rx,y>0.8)for all trajectories except for the y-trajectory on P6.This occurred because the passive marker is not positioned sagittally in relation to the centre of the first metatarsal joint but above it.Even if we deleted the offset between the compared human models, the y-trajectory would be somewhat distorted in this setup.BZ -MP had a weak to moderate degree of correlation for y-trajectories (0.18-0.6) and a moderate to a strong degree of correlation for x-trajectories (0.6-0.8).
The results obtained with TCF-MMP were very promising.TCF-MMP had the lowest RMSE, and the highest Pearson's correlation coefficient and did not suffer from the self-occlusion problem.Moro et al. [19] estimated the mean Euclidean distance between positions obtained with a marker-based system and their markerless approach.They found a mean error of 10 mm for the hip, right knee and left knee, and 15 mm for the ankle and 5 th metatarsus.Our results reported lower values of mean error for the TCF.
For the comparison with the MoCap system, Docekal et al. [23] reported a similar result considering AlphaPose and MMPose.The results that we obtained were in line with the literature: BZ -MP performed worse [21,24] FP-AP and TCF-MMP were similar in the evaluation of the human whole-body pose.However, our results reported that FP-AP, trained on Halpe Fullbody, suffered from self-occlusion during the midstance phase.With increasing speed, Pearson's coefficient deteriorated slightly, which was due to the greater blurring of the image.Similar results were obtained by D'Antonio et al. [37] who recorded higher mean absolute errors for knee and pelvic ROM values as the test speed increased.

Conclusion
In this preliminary study, a comparison of algorithms based on vision systems for estimating human kinematics was carried out considering a walking task.Among the compared markerless algorithms, the algorithm of MMPose, Faster RCNN, implemented by MMDetection as the detector, and TC Former, as pose estimator, showed the best performance.However, it should be noted that one of the limitations of this preliminary study is the use of the blob-analysis algorithm, which is not comparable to an optoelectronic system in terms of accuracy.Another limitation is the small sample involved in the study.

Figure 2 .
Figure 2. Blob analysis method: a) 6 magenta markers placed on the right lower limb; b) applied colour threshold filter based on HSV colour space; c) ordered centroids.

Figure 4 .
Figure 4. Encoder-decoder (hourglass) structure.MMPose OpenMMLab project aims to provide efficient open-source computer vision tools.Concerning pose estimation OpenMMLab offers MMPose (MMP), a Pytorch-based open-source tool.From their database of pre-trained whole-body pose estimators, we selected a novel vision Transformer, TC Former[30].TC Former is a Token Clustering Transformer that divides the compressed image into a grid of tokens after the pose decoder and set the keypoints.Instead of using deconvolution layers for up-sampling, an attention transformer is used that can combine tokens that are not relevant to the task, such as background, and maintain high resolution for tokens that contain important information.MMpose provides many pre-trained human detectors and pose estimators.We choose Faster R-CNN model[31], with a ResNet-50-FPN backbone, implemented by MMDetection as detector and TC-Former as Pose estimator.

Figure 5
Figure 5. a) Camera session on plane B; b) Calibration set up; c) Reprojection error for camera session on plane B

Figure 6 .
Figure 6.Self-occlusion problem due to the overlap of right and left leg during walking.

Figure 7 .
Figure 7.The horizontal component of trajectories computed during the task at medium speed by means of the blob-analysis and deep learning algorithms.

Figure 8 .
Figure 8.The vertical component of trajectories computed during the task at medium speed by means of the blob-analysis and deep learning algorithms.Pearson's correlation coefficients are shown in Table3.The p-values were below 0.01 for all keypoints and all algorithms.

Table 1 .
Selected pre-trained detector and pose estimator for each Open-Source Toolbox

Table 2 .
Results of the root mean square errors of the TCF-MMP, FP-AP and BZ -MP for all the 6 trajectories in x and y directions at slow, medium and high speed.

Table 3 .
Results of TCF-MMP, FP-AP and BZ-MP Pearson's coefficient for all the 6 trajectories in the x and y direction at different speeds.