Enhancing person re-identification on RGB-D data with noise free pose-regularized color and skeleton distance features

Noisy features may introduce irrelevant or incorrect features that can lead to incorrect classifications and lower accuracy. This can be especially problematic in tasks such as person re-identification (ReID), where subtle differences between individuals need to be accurately captured and distinguished. However, the existing ReID methods directly use noisy and limited multimodality features for similarity measures. It is crucial to use robust features and pre-processing techniques to reduce the effects of noise and ensure accurate classification. As a solution, we employ a Gaussian filter to eliminate the Gaussian noise from RGB-D data in the pre-processing stage. For similarity measure, the color descriptors are computed using the top eight peaks of the 2D histogram constructed from pose regularized partition grid cells, and eleven different skeleton distances are considered. The proposed method is evaluated on the BIWI RGBD-ID dataset, which comprises still (front view images) and walking set (images with varied pose and viewpoint) images. The obtained recognition rates of 99.15% and 94% on still and walking set images demonstrate the effectiveness of the proposed approach for the ReID task in the presence of pose and viewpoint variations. Further, the method is evaluated on and RGBD-ID and achieved improved performance over the existing techniques.


Introduction
In the recent decade, the person re-identification (ReID) task has gained more attention from the research community [1].Anticipating a terrorist attack, criminal investigation, monitoring, suspect retrieval, and tracking [2] from busy areas such as railway stations, shopping malls, and other places necessitate innovative and robust approaches to accomplish the ReID task.The ReID is a cross-camera image retrieval task that aims to retrieve the images of the same person in different physical sites at different time intervals by measuring the similarity between the query and gallery images [3].In the past, ReID research has predominantly focused on RGB appearance-based methods, but recent years have seen a shift towards different sensor-based ReID approaches, such as those utilizing depth (D) and infrared (IR) cameras.Over the years, numerous ReID techniques have been presented to overcome the limitations of conventional cameras (RGB), such as occlusions, low image resolutions, and background clutter, among others.However, in extremely poor or dark lighting conditions, RGB cameras may not be sufficient to capture the required information for surveillance.The emergence of technologies such as Microsoft's Kinect sensors sparks a revolution in computer vision.However, depth camera sensors naturally possess inherent noise that affects the resulting RGB-D (depth) images.Factors such as the camera's construction, technical specifications like focal length and lens quality, significantly impact image quality, thereby influencing the noise observed in depth images.Additionally, in active acquisition setups, the quality of projected IR light, encompassing aspects like intensity and collimation, significantly contributes to the presence of image noise.The level of noise present in depth images is closely tied to the approach used for depth acquisition.Stereo and structured light systems calculate depth by establishing correspondences between points across various views, involving interpolation between these points.This interpolation process, however, introduces depth errors.On the other hand, Time-of-Flight (ToF) methods relying on phase calculations are vulnerable to phase ambiguity and demodulation errors, leading to inaccurate depth estimates.
In the context of re-identification, where the goal is to accurately match individuals across different instances, noise poses a significant challenge.The reduction of noise is essential in ReID task because, noise can introduce random variations in pixel values, leading to distorted color information.This distortion may result in misalignment between corresponding features in different images of a same person, causing a decrease in similarity.Further, Noisy images tend to exhibit increased variability in color values, making it challenging to establish consistent feature patterns.This variability can lead to higher dissimilarity scores between the images of a same person, as the noisy components may be erroneously interpreted as distinguishing features.In some cases, noise may coincidentally align in a way that creates false similarities between images, leads to matching two individuals as a same person.
Denoising plays a crucial role in improving the quality of color features extracted for ReID task, leading to more accurate and reliable assessments of similarity between individuals.Denoising helps remove random fluctuations and irrelevant variations, resulting in a more precise representation of the underlying color features.This can lead to a more accurate assessment of similarity, as the extracted features better reflect the true characteristics of the individuals in the images.It promotes consistency in color information across images, making it easier to establish reliable correspondences between features.This contributes to a more stable and consistent re-identification of a person, reducing the influence of random noise.Conducting the ReID task on denoised data enhance the discriminative power of color features.Also, it helps mitigate the risk of false positives by eliminating random noise that might be incorrectly interpreted as significant features.This contributes to a more robust similarity measure that is less prone to erroneous matching.Noise reduction contributes to the creation of more reliable representations for re-identification, enhancing the system's ability to correctly match and identify subjects in diverse scenarios.
Albeit the fact that RGB-D data is noisy, RGB-D cameras are preferred over RGB cameras in surveillance networks due to their cost-effectiveness, smaller size, and high resolution [4].In addition, the RGB-D camera captures data from different modalities such as 3D point cloud data (three-dimensional groupings of color points), skeleton, and depth data.The usage of such multimodality for ReID task data adds an extra discriminative ability.However, the data collected directly from the camera devices will exhibit Gaussian noise [5].Since the RGB-D data is taken directly (without pre-processing) from the camera, it will also have a Gaussian noise.Therefore, the direct usage of such data will alter the overall performance of the ReID task.Thus, data denoising plays an essential role before using it for the ReID task.
Another major challenge faced during the ReID task is the variations in pose and viewpoint [6].There is a substantial risk that the ReID system retrieves the wrong person whenever an individual appears with a different pose (e.g.head, leg, hand movement) and viewpoint (e.g. a person in the front, side, and back views).The stateof-the-art methods addressed this challenge using a single or multimodality approach [7,8].In a single modality approach, appearance [9], skeleton [10], or depth [11] features are utilized individually.However, the exclusive use of single features alone is not always sufficient to solve pose and viewpoint problems.Therefore, many stateof-the-art methods proposed multimodality approaches by integrating multiple features to improve the ReID accuracy.However, multimodality features are extracted directly from noisy RGB-D data.The number of color descriptors and skeletal features considered is insufficient to tackle pose and viewpoint variation in real-time applications.Thus, we solve the aforementioned issues by using a Gaussian filter to denoise the RGB-D data.Further, the consideration of additional color and skeleton information improves the discriminative ability.
The proposed method is intended for short-term re-identification in the indoor scenario (our focus is on pose and viewpoint variation challenges rather than appearance changes).We assume that people retain the same cloth, but everyone will have varied poses and viewpoints.Color descriptors play a pivotal role in tasks such as content-based image retrieval (CBIR) [12] and person re-identification (person re-ID), offering essential information to differentiate and match images or individuals based on their color attributes.Common color descriptors utilized in these applications include color histograms, color moments, and texture features.The selection of a suitable color descriptor depends on factors such as the specific needs of the application, image complexity, and available computational resources.Moreover, the effectiveness and robustness of representation can often be enhanced by either combining multiple color descriptors or integrating them with other visual descriptors in these scenarios.Thus, color descriptors will be a more helpful feature when a person has a similar cloth pattern.The skeleton distances help when a person appears with a different pose and viewpoint.
1. We demonstrate that the consideration of color descriptors extracted from the top eight peaks of a 2D histogram in combination with skeleton distances is more robust than considering color descriptors extracted from the top eight peaks of 2D histogram for similarity measure.
2. We show that denoising the RDB-D images with a Gaussian filter is more effective than the median filter.
3. The effectiveness of the proposed method to handle pose and viewpoint variations is validated by evaluating its performance on both still and walking set images.
The remainder of the paper is structured as follows.Section 2 reviews on related works.Section 3 presents the proposed approach for person ReID.Experimental results and discussions are given in section 4. Finally, conclusions are drawn in section 5.

Related work
Most of the current ReID approaches concentrate on matching individuals based on RGB appearance features.However, with the inception of new RGB-D sensors, investigators have exploited various modalities (depth, skeleton and IR images) to enhance the ReID performance.Recent works have proposed combining appearance features with depth and/or skeleton data to mine robust representations.Based on the modalities used, ReID is categoried into 3 types.In this section, we provide an overview on different combination of multimodality categories and their performances on various benchmark datasets.

ReID using RGB and Depth data
Over the past few years, there has been an increasing trend of utilizing RGB-D sensor-based methods to incorporate supplementary information and boost the effectiveness of ReID systems.Some recent studies [13,14], have intigrated appearance features with anthropometric features that are invariant to changes in illumination are extracted from processed depth images.Other ReID techniques use both raw RGB and depth data [8,15,16].Additionally, depth images are sometimes utilized as body segmentation masks, which are useful in removing cluttered backgrounds from scenes, as in the approaches proposed by [11,17].The majority of ReID methods that rely on RGB-depth images are based on deep learning and employ score-level [18] or featurelevel fusion [19] techniques.Ren et al [13] introduced a multimodal uniform (MMUDL) technique in which appearance and anthropometric representations were extracted from the depth and RGB frames using CNNs.The features were then combined using a uniform latent variable containing depth-specific, RGB-specific, and sharable parts.The effectiveness of the their approach was assessed using 2 benchmark datasets, namely RGBD-ID and KinectREID.Subsequently, Ren et al [14] introduced a novel deep learning approach, referred to as 'uniform and variational deep learning' (UVDL), that integrates both depth and appearance features.By developing a multimodal autoencoder that maps the variable onto a common space and achieved an outstanding recognition rate of 99.4% and 76.7% on KinectREID and RGBD-ID datasets, respectively.Although the method led to a 2.4% enhancement in the rank-1 accuracy of KinectREID, it did not yield any improvement in the findings of the RGB-ID dataset.Few methods, such as [8,15], employ CNN techniques to combine global and local representations extracted from RGB and raw-depth frames.Lejbolle et al [15] proposed a multimodal CNN, which fuses features from RGB and depth modalities to provide improved ReID performance.Similarly, in [8], the authors proposed a multimodal attention framework (MAT) using RGB and depth data.The MAT jointly used CNN and attention module to mine local features and fuse them with globally extracted features.The primary objective of these methods was to overcome various challenges, including variation in lighting conditions, occlusion, etc.To achieve this, they relied on the use of overhead RGB-D cameras.The authors in [16] presented a technique for combining RGB and RGBD image-based scores in a heterogeneity space for ReID.Two separate models were trained, one employs RGB images and the other using RGBD images.The feature embeddings of each model were used to calculate heterogeneity scores between the query and gallery images.These scores were then merged in the heterogeneity space to achieve the final accuracy.However, the method has a high algorithmic complexity.The aforementioned methods that utilize RGB and depth data have shown impressive performance.However, they suffer from certain limitations.For instance, the feature-level fusion strategy employed by these methods may result in overfitting of the deep-learning model when fusing diverse features.Moreover, the Microsoft Kinect's operational range is restricted to a range of 0.8 to 4 meters, which can lead to a decrease in performance due to inaccurate capture of depth frames by the depth sensors.This, in turn, affects the ability to train effective models for ReID systems.

ReID using RGB and skeleton data
Geometrical features from depth data known as anthropometric measures can provide descriptive information about a person.The measures are based on the joint points of the skeleton body of individuals.Although RGB appearance features are commonly employed in ReID, they may not be robust in low-light situations.In contrast, color-invariant anthropometric features are dependable in such conditions.To improve the accuracy of re-identification, some researchers have recently fused color-invariant anthropometric and appearance contents [9,20,21].Pala et al [21] proposed a dissimilarity-based framework for ReID that combines anthropometrics estimated from 20 joint points with appearance features.In contrast to the commonly used score-level fusion technique, this framework reduces the similarity measure time among dissimilarity vectors computed from various modalities by using Multiple Component Dissimilarity (MCD) descriptors.The researchers selected 3 'Multiple Part-Multiple Component' (MPMC) descriptors, namely SDALF [22], eBiCov [23], and MCMimpl [24].The proposed technique was evaluated on KinectREID and RGBD-ID datasets.
The authors in [20] proposed an online framework for ReID task, that utilizes appearance and skeleton information obtained from RGB-D sensors.This approach addresses the limitations of offline training methods, that are not capable of adjusting to dynamic environments.Instead of relying on offline training, the authors utilized an online metric model to continually update the facial information.Subsequently, they combined the appearance and skeleton information through a distinctive 'feature funnel model' (FFM).The effectiveness of this approach was demonstrated through experiments conducted on publicly available BIWI RGBD-ID [25] and IAS-Lab [26] datasets.
In contrast to the aforementioned techniques, Patruno et al [9] introduced a new ReID approach that leverages 'Skeleton Standard Postures' (SSPs) and color descriptors.Their method utilized 3D skeleton data to normalize the posture of each person and employed a partition grid based on the SSPs to categorize the point cloud samples based on their positions.This depiction of the point clouds offered visual details of the individual, including their color appearance.The combination of color and depth information was used to create an effective signature of the person, which improved the ReID performance.The efficacy of this technique was assessed on three publicly available datasets, namely BIWI RGBD-ID, KinectREID, and RGBD-ID, and the authors obtained superior results compared to the method presented by Pala et al in their earlier work [21].
In summary, the aforementioned prior-art ReID methods attained a moderate level of accuracy in ReID by employing a combination of skeleton and RGB data.The use of skeleton information offers an advantage as it remains invariant even under challenging conditions such as extreme lighting or changes in clothing.

ReID using depth and skeleton data
To address the challenge of individuals appearing in various poses across distributed camera network, certain approaches have employed techniques to warp the point clouds of a person into a standardized pose [25,26].In more recent studies, a holistic representation of an individual's human body shape has been achieved by combining the depth-shape descriptor with features derived from their skeleton [11,27,28].
In [29], the authors split their method into two phases: first, they extracted skeleton-based and surface-based features; then, these combined features were used for person re-identification.Evaluation using normalized area under the curve (nAUC) showed varying performance (52.8% to 88.1%) on Collaborative and Walking2 datasets.Subsequent use of Walking1, Walking2, and Backward groups improved nAUC scores.
Munaro et al [25] compared skeleton-based features and global body-shape descriptors for one-shot person re-identification.They calculated limb lengths and ratios using the 3D location of body joint points, transforming point clouds into a standard pose in real time.Experiments on RGBD-ID and BIWI RGBD-ID datasets, using CMC and nAUC for evaluation, showed their method with a skeleton descriptor achieved a rank-1 recognition rate of 26.6% and an nAUC of 89.7% for the Still group, and 21.1% and 86.6% for the Walking group.Point-cloud matching outperformed the skeleton-based descriptor with a rank-1 accuracy of 32.5% for the Still group and 22.4% for the Walking group.Their method achieved the best results with NN and Generic-SVM classifiers when training on Walking1 and testing on Walking2, with a rank one of 28.6% and an nAUC of 89.9% for NN, and 35.7% and 92.8% for Generic SVM.In [26], they further improved this work by creating 3D models from transformed point clouds for re-identification.
In [11], the authors introduced a depth-based re-identification method employing depthvoxel covariance (DVCov) and local rotation-invariant depth-shape descriptors to characterize pedestrian body shapes.They utilized skeleton-based features (SKL) extracted from joint points and merged them with the depth shape descriptors to form the re-identification framework.Evaluation on three datasetsRGBD-ID, BIWI RGBD-ID, and IAS-Lab RGBD-IDshowed high performance on RGBD-ID, achieving a rank-1 recognition rate of 67.64% for single shots and 71.74% for multi-shots, surpassing other depth-and skeleton-based methods.However, its performance reduced on BIWI and IAS-Lab datasets due to varying viewing angles, posing more challenges compared to the RGBD-ID dataset.
In [28], Imani et al presented a short-term Re-ID method treating depth images as complex networks.They introduced novel features, the histogram of edge weight (HEW) and the histogram of node strength (HNS), and further extracted HNSs from single frames and multi-frames, labeled as histograms of special node strength (HSNS) and histograms of temporal node strength (HTNS).These features were combined with skeleton features using score-level fusion.Evaluation on RGBD-ID and KinectREID datasets showed the proposed approach achieving rank-1 performances of 58.35% and 64.43% for single and multi-shots, respectively, on KinectREID, and 62.43% and 72.35% on RGBD-ID.
In their subsequent work [27], Imani et al introduced a person re-identification method employing three local pattern descriptors (LBP, LDP, and LTrP) and anthropometric measures, merged using Kinect sensors.Designed for short-term Re-ID, their approach was tested on RGBD-ID and KinectREID datasets.They divided the RGBD-ID dataset into four groups (Backward, Walking1, Walking2, and Collaborative) based on frontal and rear views, filtering out individuals who changed clothes.These were further split into Walking2Backward and Walking1Collaborative databases.Fusion of SGLTrP3 (third-order LTrP with Gabor features) with anthropometric measures achieved rank-1 accuracies of 76.58% and 72.58% for Walking1Collaborative and Walking2Backward, respectively, with the former benefiting from frontal views.The recognition rate for KinectREID dataset using SGLTrP3 was 66.08%.
Depth and skeleton-based Re-ID may exhibit lower performance compared to color-based methods.However, their effectiveness shines in scenarios where RGB-based approaches struggle: low-light conditions or instances of clothing changes.

ReID using RGB, skeleton and depth data
Imani et al [30] leveraged the benefits of RGB-D sensors, which can acquire RGB, depth, and skeleton data concurrently.To derive features from various regions of RGB and depth modes, the authors introduced a 'local pattern descriptor' called LVP (local vector pattern).The RGB and depth frames were partitioned into 3 parts such as head, torso, and legs.Additionally, skeleton distances were estimated from 20 joint points of a skeleton, specifically using Euclidean distances.The extracted features from different modalities were combined using score-level fusion, as illustrated in figure 8, by double and triple combinations.However, the triple combination, which includes RGB, depth, and skeleton modalities, did not yield better results than the double combinations (i.e.RGB and depth, depth and skeleton, or RGB and skeleton).
The primary difficulty in utilizing a tri-modal approach (i.e.RGB, depth, and skeleton) is the fusion of features from the 3 modalities to achieve improved performance.However, it has been observed that combining RGB and skeleton features results in a higher level of performance than the tri-modal approach, indicating a limitation of this method.In the multimodality approach, a combination of two distinctive features, such as color and skeleton, are considered and achieve better performance.However, traditional RGB-D multimodality ReID approaches evaluated their methods using the RGB-D sensor-generated data as it is, without denoising.Also, limited skeleton distances are considered with color features for person discrimination.In this work, we propose to use the top eight histogram peaks to compute the color descriptor that enhances color dynamics and helps in better re-identification in an indoor scenario.We also employ eleven skeletal distances to boost the discriminative capability of the features.Thus, by leveraging the enhanced color descriptors and additional skeleton distances, we can develop a ReID system that is robust to pose and viewpoint variations.Although conceptually related, our approach is different from the previous work presented in [31] in two aspects.Firstly, the median filter is used to eliminate Gaussian noise in the RGB-D data [31], while our approach uses a Gaussian filter for image denoising without dropping structural information.Secondly, our approach uses color descriptors extracted from the top eight peeks of the histogram to increase color dynamics and eleven skeleton distances for similarity measures between two images.In [31], the similarity between two images is measured using the top five peaks.Also, the method is evaluated only on still images in the BIWI RGB-D ID dataset.In contrast, we evaluate our method on still and walking set images to show the robustness of the proposed work for the different pose and viewpoint variations.

Denoised multimodality feature discrimination
In the proposed framework depicted in figure 1, the proposed method takes in the dense point cloud data of an individual and their 3D skeletal information as input.Prior to alignment with the camera viewpoint, the point cloud undergoes a preliminary processing stage to eliminate any noise from the input data.The alignment is achieved with the aid of the 3D skeleton joints.The framework comprises two primary modules, namely the Query Signature Generator and the Gallery Signature Generator.The former is responsible for estimating the color descriptor and skeleton distances of an individual under scrutiny to create a query signature.On the other hand, the latter module involves re-projecting the partition grid of the query to compute color descriptors.To generate the gallery persons signature, the skeleton distances are combined with the color descriptor.Subsequently, the signature matching module utilizes the Euclidean distance to determine the similarity score, which is used to retrieve the relevant person from the gallery.A comprehensive explanation of the proposed approach is presented in the following subsections.

Point cloud pre-processing
When capturing point clouds using range cameras, they are susceptible to noise and outliers.While segmenting point clouds, sparse outliers, known as shadow points, may emerge at the periphery of the captured subjects.Additionally, secondary reflections or high-absorbing targets can exacerbate the noise in the point clouds.Therefore, it becomes necessary to employ a pre-processing stage to eliminate outliers and/or reduce the quantity of data points.To enhance the recognition rate of the ReID task, we assess the efficacy of the median and Gaussian filters in reducing noise in the point cloud data.The filters are applied separately, and their performance is evaluated.The RGB image is denoised using a median filter with a filter size of 3 × 3.
In the selection of a 3 × 3 median filter and a sigma value of 2 for the Gaussian filter, we drew upon empirical evidence from [32,33], conducted experimental studies on denoising algorithms.These studies involved systematically varying filter sizes and sigma values to assess their impact on denoising performance.Upon careful review of the findings, we determined that a 3x3 median filter and a sigma value of 2 for the Gaussian filter yielded optimal results based on criteria such as image quality, preservation of features, and computational efficiency for our considered task.The insights from these papers served as a valuable reference for our parameter selection, ensuring that our choices were grounded in a well-informed and evidence-based approach.Also, we cross validated the performance by changing different window size such as 2 × 2, 3 × 3, 5 × 5 and & 7 × 7, as filter size increased we observed structural loss.For the 3x3 window size we able to achieve better recognition rate compared to other filter sizes.Similarly, the impact of different Sigma value on recognition rate is observed, for the Sigma value 2 the recognition rate achieved was better than other different Sigma values.
As the filter size increased, we observed a noticeable trade-off between denoising effectiveness and structural preservation.Notably, the 3 × 3 window size consistently demonstrated superior recognition rates compared to other filter sizes, indicating a balance between noise reduction and feature retention.Similarly, we investigated the impact of various Sigma values on recognition rates.Our findings revealed that a Sigma value of 2 consistently outperformed other values.Aligning with the observations from empirical studies presented in [32,33] and our cross-validation results, led us to conclude that a 3x3 median filter and a Sigma value of 2 for the Gaussian filter represent optimal choices for our denoising task.
After assessing various window sizes, we selected a median filter with a 3 × 3 window as it provided a superior power signal-to-noise ratio (PSNR)and Structural Similarity Index (SSIM) compared to other window sizes [33].In contrast, the standard deviation (sigma value) of the Gaussian filter determines the retention of structural information required for the ReID task.Thus, after considering the degree of smoothing and time complexity [32], we determined the optimal value of sigma to be 2.The pre-processed point cloud data is then sent to the point cloud alignment module for further data preparation.

Point cloud alignments
The poses and orientations of individuals captured by different cameras can vary, resulting in point clouds that cannot be directly compared.Therefore, it is crucial to align all the point clouds to a common reference system to enable comparability.The global reference system is used to represent the skeleton joints and point clouds, where the origin is set to the camera's optical center, and the Z-axis is aligned with its optical axis.The torso joint is translated to the origin of the reference system, aligning the normal along the Z-axis to obtain a frontal view of the individual.The point cloud is then translated and rotated using the approach proposed by Moller et al [34].The figure 2 illustrates the point cloud alignment, in which 2(a), 2(b), and 2(c) exhibit the unaligned skeleton, aligned skeleton, and aligned point cloud, respectively.

Query signature generator (QSG)
The aligned point cloud and skeleton data are used as inputs for the query signature generator, which generates the person signature by leveraging color descriptor and skeleton distances.Each module of the QSG block is described in detail in the following sections.

Skeleton standard posture
The purpose of generating the Skeleton Standard Posture (SSP) is to obtain the shape of each person under investigation, which is crucial for creating a unique partition grid.The SSP is constructed using 15 aligned skeleton joints, as depicted in figure 1.During this process, each joint is labeled with an integer (j=1, . . .,15), and the torso joint is regarded as the reference system's origin.All the consecutive joints are aligned along the XY plane, with the arm joints placed along the X-axis, and the leg joints aligned to the Y-axis with a + 45 and −45 degree angle.We preserve the skeleton joint distance obtained from the camera during SSP generation.Figure 3 illustrates the SSP in the 3D plane, which is used to create a partition grid that separates the point cloud into multiple sections, each with its own set of discriminative features.This choice of visualization shown in figure 3 was influenced by the methodology established in the work by Patruno et al [9], upon which our research builds.In our pursuit of estimating the Skeleton Standard Posture (SSP), we adopted a methodology that aligns with the approach proposed by Patruno et al Patruno et al [9].The green error mark in figure 3 denotes the origin point in their representations, guiding readers to identify the reference or starting point within the visual context.

Partition grid
In generating the partition grid, we use the SSP (Skeleton Standard Posture) to create 81 bins for color descriptor computation, as shown in figure 4(a).Figure 4(b) demonstrates how the SSP is used to construct a 9 × 9 partition grid.The grid consists of three vertical bands (w1, w2, w3) and four horizontal bands (h1, h2, h3, h4), which are determined by the position of the arm joints and the head, neck, and leg joints, respectively.The w2 and h2 bands are subdivided equally into five and four parts, respectively, to collect more data from areas that are likely to be informative.The w1, w3, and h3, h4 bands are each divided into two parts, resulting in a 9 × 9 partition grid.It is important to note that each person has a unique partition grid that depends on their shape.Previous work by Patruno et al (2018) employed a partition grid and assigned Gaussian weights to each cell based on the assumption that upper cells would have less movement than corner and lower cells.However, their method was only applicable to front-facing (still) images without pose or viewpoint variations.In contrast, our work considers both still and walking images, which exhibit frequent pose and viewpoint changes.Thus, it is not practical to assign Gaussian weights to each cell, as the body part distribution in each cell can vary significantly with pose and viewpoint changes.As a result, we treat all grid cells equally, without assigning weights, in this study.

Color descriptor computation using CIE Lab
The partition grid will have a point cloud distribution that is RGB values in 3D space.For signature computation, the RGB color values are transformed to another color model that helps to extract improved color features.Unlike other color models, the CIE L * a * b * color model is highly suitable for image processing techniques due to its approximation to the human visual system and uniform color distribution [9].Also, it is empirically found that the CIE L * a * b * color model achieved a better recognition rate than other models (HSV, YUV).Thus, the RGB color space is transformed into CIE L * a * b * color space in this work.The CIE L * a * b * model has three channels: luminance (L), a * , and b * (a, and b are chromatic channels).The proposed work focused on color features, the chromatic channels a * and b * are considered for color descriptor computation.The 2D histogram is constructed for each cell of the partition grid, and the top 8 peaks of the histograms are considered for color descriptor computation as they carry more color dynamics.Each peak will have channel a * count, channel b * count, and n count (n denotes the normalized number of occurrences of a * and b * ).The color descriptor will be of a size 1 × 1944 one dimensional array (eightpeaks × threecountsforeachpeak × 81cells).Patruno et al [9] used five peaks per cell for color descriptor computation.As shown in figure 5, we computed color descriptors using the top 5, 6, 7, 8, 9, and 10 histogram peaks separately for still and walking set and empirically found that the recognition rate was better when the top 8 peaks were considered.The recognition rate declined for the top 9 and 10 peaks.As a result, the maximum peak value has been set to 8.

Skeleton distances
Since skeleton distances (computed using Euclidean distance between joints) are robust to viewpoint variations [35], we consider the following eleven skeleton distances as depicted in figure 6 for the computation of person signature.Also, two ratios d10 and d11, are considered as in [36], where d10 is the ratio between torso and legs and d11 is the ratio between torse and arms In the works [21] and [36], seven and nine skeleton distances are considered, respectively.However, the recent works are not specific about the arm and leg lengths considered, and work is limited at maximum for nine skeleton distances.The proposed approach considers each of the distances corresponding to the leg, left, and right arms Also, the eleven skeleton distances considered in this work will have better discrimination than the existing ones.The query signature will be of size 1 × 1955 onedimensional array (concatenation of color descriptor of size 1944 and 11 skeleton distances) which is used to compute similarity score with the gallery person signature.

Gallery signature generator (GSG)
In Gallery Signature Generator (GSG), the first task is to compute the color descriptor of the gallery images.The partition grid of the query image is re-projected onto all the gallery images, followed by color descriptor computation using the CIE L * a * b model, which gives the color descriptor of each gallery image.If there exist multiple instances of a person in the gallery, we average their color descriptors to form a final signature.The benefit of re-projecting the partition grid of a query image onto the gallery image is demonstrated in figure 7. When the partition grid of a person (image on the left side) is re-projected onto the point cloud of another person (image on the right side), the cells at the same index position (cells highlighted in red color) show the different point cloud distribution.As a result, it is easier to discern any two individuals.
The second task is to measure the skeleton distance for gallery images.The GSG module accepts the aligned skeleton of each gallery image and computes eleven skeleton distances (computed using Euclidean distance between joints).In the case of multiple instances of the same ID, the averaged distance value is considered for  each skeleton distance.Finally, color and skeleton distances are concatenated to generate the gallery person signature.

Signature matching
The generated query and gallery signatures are used to measure their similarity.This work measures the Euclidean distance between two signatures to compute similarity scores.The similarity S(x,y) is measured between two-person signatures using equation (1), where x, y are two individuals, d is the signature statistic at ith position (i ranges from 1 to n = 1955).The highest similarity score of the gallery person ID is retrieved as a matched ID.

Results and discussion
4.1.Experimental setup BIWI RGBD-ID dataset [25]: This dataset is designed for the ReID task using RGB-D cameras.Featuring 50 unique subjects, the dataset is segmented into 50 training and 56 testing sequences.Each dataset component, captured at a maximum resolution of 1280 × 960 pixels with a Microsoft Kinect for Windows, encompasses synchronized RGB images, depth images, segmentation maps of individuals, skeletal data, and ground plane coordinates.The videos were captured at an approximate frame rate of 10 frames per second (fps).In the training section, subjects were recorded executing predefined motion routines in front of the camera, such as rotations around the vertical axis, diverse head movements, and two frontal walks.Among the 50 subjects in the training set, 28 were additionally recorded in two testing videos each.Each person in the testing set was captured in two separate sequences: a 'still' sequence and a 'walking' sequence.In the 'still' sequence, subjects either remain predominantly motionless or display slight movements in place.On the other hand, the 'walking' sequence involves each subject executing two frontal walks and two diagonal walks in relation to the Kinect camera sensor.In the 'still' sequences, individuals are situated at a distance of 23 meters from the camera.In the 'walking' sequences, although they approach closer, the subjects maintain a distance greater than approximately 1.5 meters, frequently presenting side views of their faces.The proposed approach considers an indoor scenario where the persons appearance does not change.Therefore only training images are considered in our study.The training dataset is divided into two sets.One is still set, containing 11,780 front-facing images obtained by pruning and discarding the person captured from their back.Another one is, walking set, which includes complete training set images to have an effect of different pose and viewpoint variations.RGBD-ID [29]: Further, the performance is evaluated on RGBD-ID dataset.This dataset encompasses RGB and depth data for 79 individuals, with each individual having four acquisitions: walking1, walking2, collaborative, and backwards.For each person, there is one rear view (backwards) and three frontal views (walking1, walking2, and collaborative).Within each acquisition, four or five RGB frames and corresponding 3D frames (3D point clouds) are available for each individual.In total, 769 video sequences from the RGBD-ID dataset were included in the study.
The dataset is split into two parts: one contains the individuals for re-identification (called the source set), while the other includes labeled users (referred to as the reference set).These subsets are created using a k-fold cross-validation method with k set to 10.In each iteration, roughly 10% of the instances form the reference set, while the rest compose the source set.It's important to note that the contents of these sets change with each iteration of the k-fold cross-validation.Using the SSP and the partition grid of the particular user being examined from the source set, the method constructs all the reference set's signatures.These signatures are then averaged to produce 50 mean signatures, each corresponding to a user class.These mean signatures are unique to the specific user being analyzed from the source set.The proposed approach is implemented in Matlab 2020a 64bits using Intel core i5-8400 CPU with 2.80 GHz speed and 16 GB Ram.

Analysis on point cloud pre-processed data
To examine the image smoothness and level of retaining structural information after applying median and Gaussian filters on the BIWI RGBD-ID dataset, we subtract the original image with median and Gaussian filtered images separately.Figure 8 shows the original (image on the left side), median filtered image (top middle image), and Gaussian filtered image (bottom middle image).The resulting image (image at the bottom right) shows that the subtraction of original with Gaussian filtered image is significantly smoother and well retained structural information than the subtraction result of original and median filtered image (image at top right).Therefore, the Gaussian filter efficiently removes noise from the BIWI RGBD-ID dataset than the median filter.
Further, the noise reduction ability of Median and Gaussian filters was assessed on the BIWI RGBD-ID dataset through a quantitative analysis using PSNR and SSIM metrics.Various Sigma values for the Gaussian filter and kernel sizes for the median filter were systematically varied, and the resulting PSNR and SSIM values are presented in the table 1.The analysis suggests that the Gaussian filter outperforms the Median filter in terms of both PSNR and SSIM.The higher PSNR for the Gaussian filter implies superior noise reduction capabilities, while the elevated SSIM values indicate better preservation of structural information compared to the Median filter.

Experimental analysis on BIWI RGBD-ID dataset
The evaluation is performed using K-fold cross-validation [37].Each of the still and walking set images is divided into the query and gallery.These query and gallery sets are defined using a k -fold cross-validation partitioning, where k has been set to 10.About 10% of the database instances are taken for testing for each iteration, whereas all the remaining in the database defines the gallery set.
The ReID performance is analysed separately on median and Gaussian filtered images from the still and walking set.First, the ReID performance is measured by computing the color descriptor features alone from the input data.Table 2 shows the recognition rate on median and Gaussian filtered still and walking set images considering only color descriptors.The result shows that the Gaussian filtered input data (or images) are best suited for ReID on still and walking set images than the median filtered input data (or images).Patruno et al [9] achieved a 97% recognition rate on median filtered still images using five top histogram peaks.The proposed approach achieved a recognition rate of 98.48% (on still set) using a median filtered image as input, demonstrating the importance of counting extra histogram peaks in the color descriptor computation.Additional color dynamic (peaks) enhanced the performance by 1.48% compared to the work done by Patruno et al [9].We also observed that the performance drop on walking set images that reveal color descriptor features alone is insignificant when everyone has pose and viewpoint variation.
The goal of adding skeleton distances with color descriptors for signature computation is to make the ReID system robust for a pose and viewpoint change.From table 2, it is evident that the recognition rate has improved on walking set images that show the significance of multimodality features in the ReID task.Due to the lack of pose and viewpoint heterogeneity in the dataset, the identification rate on still images remains consistent even after using skeleton distances.

Ablation study
The ReID performance of the proposed approach is evaluated by varying the color model for color descriptor calculation, and the obtained results are listed in table 3. The results reveal CIE L * a * b * color model is more effective than YUV and HSV color models.
We also evaluate the ReID performance for the single modality approach (performance evaluated separately for color features and skeleton distances) and the multimodality approach (color features and skeleton distances are combined).The result in table 4 shows that the color features have more discriminative capability than the skeleton distance-based features.Therefore, when both are combined, the color feature contributes more towards improving the recognition rate, and adding skeleton distance-based features improves the overall performance.The upper and lower parts of the body have different effects on person re-identification.The impact of excluding the lower skeleton distances (D5, D4, D1, D10, D11) and upper skeleton distances (D2, D3, D6,D7, D8,D9) on person recognition rate is analysed and results are summarized in table 5.The recognition rates for upper body parts are consistently higher than those for the lower body part.This could imply that the features (skeleton distances) used in the study are more distinctive or easily identifiable for the upper body compared to the lower body.However, there is a noticeable decrease in recognition rates when the persons are walking.This suggests that the movement might introduce complexities or variability in features, making accurate recognition more challenging.However, combining eleven skeleton distance adds better discriminative ability in distinguishing two individuals both in still and walking set scenario.

Comparison with state-of-the-art techniques
BIWI RGBD-ID: table 6 demonstrates the superior overall performance of our proposed approach on the BIWI RGBD-ID dataset in comparison to previous methods.This underscores the effectiveness of color descriptors extracted from the top eight peaks of the pose regularized partition grid, coupled with the incorporation of skeleton distance for measuring similarity between two individuals.Notably, among the existing methods, our proposed approach stands out as a significant contender, achieving an impressive recognition rate of 99.01% and 94.81% recognition rate on still and walking set images, respectively.
The earlier work proposed by Liu et al [20] has used both skeletons and appearance features extracted using the HSV color model and achieved 91.6% accuracy.Later, Hafner et al [38] extracted structural features taking RGB and depth images as input to re-identify the person from still images, achieved 94.75%.Whereas, in our work, the color descriptors are extracted using the CIE Lab model that is less sensitive to varying lighting conditions.Further, ReID task requires precise discrimination between similar colors, especially in situations where human observers may struggle to distinguish between them, the perceptual uniformity of CIE Lab has been contributed to achieving improved performance.In our study, using appearance features alone, we have achieved 99.15%, which shows the extraction of color descriptors from CIE lab model is more effective than HSV model.Similar to our method, Patruno et al [9] employed color descriptors extracted from a pose regularized partition grid using the CIE Lab color model.However, they chose to extract color descriptors from images denoised with a median filter.In our empirical study, we have demonstrated that a Gaussian filter outperforms the median filter in terms of image denoising.Additionally, Patruno et al [9] computed color descriptors based on only the top 5 peaks of the histogram, while our approach considers the top eight peaks.Moreover, we incorporate skeleton distance as part of the similarity measure.These collective factors contributed to a noteworthy 2.01% improvement in performance compared to the method proposed by Patruno et al [9].
RGBD-ID dataset: The performance of the proposed method on RGBD-ID dataset is compared with the state-of-the-art methods and the results are reported in table 7.Among the different baseline models, the feature-level fusion model in [16] cause over fitting problem because of the direct fusion of two CNNs feature extracted from two different modalities RGB and depth.However, blindly fusing heterogeneous noisy features may not increase discrimination power, as different features have different reliability.Whereas in our proposed method, the data is denoised first then multimodality features are intigrated that helps to improve the performance further.Our work is the extension of the method [9] and we have achieved increased performances after the intigration of skeleton distance with increased color features.The obtained result on RGBD-ID dataset reveals the significance of denoising the multimodality features in descriminating two individually effectively.The recognition performance of the proposed method on the RGBD-ID dataset is depicted in a figure 9, showing its superiority over other methods in terms of better recognition rates, particularly in the top ranks.The proposed method achieves a rank-1 rate of 92.21%, outperforming the second-best methods such as APC-USG by Patruno et al [9] and Fusion by Uddin et al [16] with a rate of 89.34%, which is an improvement of about 2.87% over [9].Imani et al [30] ReID utilizing information from RGB, depth, and skeleton.85.5 Ren et al [14] Uniform and variational features using deep learning 76.7 Patruno et al [9] Skeleton Standard Postures and color descriptors (APC-USG) 89.34 Imani et al [27] Two novel histogram feature-based ReID.76.58 Uddin et al [16] Fusion in a dissimilarity space (Fusion).

Proposed
Denoised color descriptors and skeleton distances 92.21

Conclusion and future work
This work addresses the most common challenge of person ReID under pose and viewpoint variation from noisy RGB-D data.The Gaussian filter is applied to the data captured by the RDB-D camera, resulting in the noise-free 3D point cloud.The comparative results on Gaussian and median filters demonstrate that the Gaussian filter effectively eliminates the noise while retaining significant structural information, which aids in improving the ReID performance.Since the proposed ReID is constrained to the indoor scenario where a person maintains the same cloth pattern, the color descriptor computed with extra histogram peaks is found to be effective.Furthermore, because the skeleton distances are invariant to pose and viewpoint variations, the consideration of eleven skeleton distances along with the color descriptors outperformed state-of-the-art ReID techniques.The experimental results on still and walking set images in the publicly available RGBD-ID and BIWI RGBD-ID datasets make evident the robustness of our method to pose and viewpoint variations.
In our study, we concentrated on ensuring reliable re-identification within indoor scenarios, with a specific emphasis on mitigating challenges associated with pose and viewpoint changes.However, the color feature, which we considered has demonstrated limitations in effectively handling alterations in appearance due to variations in illumination, instances of occlusion, and changes in cloth patterns.Additionally, the study identified difficulties in accurately estimating skeleton distance when confronted with frequent changes in posture and motion patterns.These limitations highlight areas for potential improvement and refinement in our approach.Re-identification systems can affect individuals' privacy by enabling continuous tracking and identification.Individuals might become more conscious of their actions, potentially impacting the natural flow of public life and social interactions.The deployment of re-identification systems is often justified by the aim to enhance security and public safety.The impact can be positive if the technology helps prevent and solve crimes.

Figure 1 .
Figure 1.The conceptual framework of the proposed method.

Figure 4 .
Figure 4. Partition grid using SSP: (a) Generated partition grid (b) Selection of joints from SSP for partition.

Figure 5 .
Figure 5. Recognition rate Versus number of peaks: (a) Still set (b) Walking set.

Figure 7 .
Figure 7. Application of the same partition grid of a person having different SSP.

Table 1 .
Quantitative results of Median and Gaussian filters to assess the denoising ability.

Table 2 .
BIWI RGBD-ID dataset: Recognition rate on still and walking images using median and gaussian filtered multimodality features; best results are in bold.

Table 3 .
Experimentation with different color models.

Table 4 .
Performance comparison of proposed method with different feature types.

Table 6 .
BIWI RGBD-ID dataset: performance comparison with the state-of-the-art methods.

Table 7 .
RGBD-ID dataset: Performance comparison with the state-of-the-art methods.