Human Posture Recognition in Intelligent Healthcare

With the rapid growth of medical technology, intelligent healthcare has gradually integrated into our lives, and its core technology is the human body motion state detection. According to the basic tools being used, we divide this technology into two categories: vision-based detection and sensor-based detection, and take the fall monitoring as an example for elaboration. Finally, different methods are compared and analyzed, the difficulties that need to be solved in this field are summarized, the future trends are included too.


INTRODUCTION
Intelligent healthcare refers to the realization of interaction between patients and medical institutions to gradually achieve informationization by creating a medical information platform in health record field by means of the Internet of Things. In the near future, the healthcare industry will incorporate more technologies for the realization of real intelligence and informatization. But one of the most important technologies for the implementation is motion detection technology. Nowadays, the motion detection technology applied in the field of intelligent healthcare at home and abroad roughly falls into two types, one is vision-based detection, and the other is sensor-based detection [1]. The former is judged by performing playback and video location technologies on the saved video frames. By installing multiple cameras in the environment in which the detected object is located, computer vision technology is used to combine the motion process with changes in human body morphology, and the motion state of the detected object is further determined by analyzing changes in the posture of the human body. In a dynamic environment, the 3D simulated pictures of the detected object's motion are available for realtime display, and the judgment results are more accurate. Sensor-based motion detection is determined by a combination of analysis and integration of human posture data measured by the sensor. It is mainly through the collection of node fusion data for measurement in each major joint parts of human body that the human posture information is analyzed, followed with the judgment of the motion state. Such detection is low in cost and convenient to carry, with which the real-time monitoring of the motion state is realizable.
In this paper, the vision and sensor-based motion detection methods are sorted and summarized respectively, and takes the fall detection as an example. The first chapter introduces the research background and the research status at home and abroad. The second chapter and the third chapter introduce the vision-based motion state detection and sensor-based motion state detection according to the different detection methods; The fourth chapter analyzes and compares two detection methods; The fifth chapter looks forward to the future research trends in this field based on the research status of fall detection.

Target Detection
Motion target detection aims to find the kinetic region in video and provide a reference region for the subsequent action analysis and other tasks. The most traditional target detecting algorithms include inter-frame difference method, background subtraction method and optical flow method.
The basic principle of inter-frame difference method [2] is to obtain the difference image by subtracting the images of two successive frames for two-dimensionalization through threshold processing, and finally to acquire the location of the target contour through morphological processing. Some scholars come up with the three-frame differentiate method based on the traditional two-frame differentiate method, which performs the differentiate operation on two consecutive frames in the neighboring three frames, and finally operates the differentiate result to obtain the foreground object.
The basic principle of background subtraction method [3] is to make differentiate operation between the preselected background and the current frame. If it is greater than the threshold, it is considered as the target area. If it is smaller than the threshold, it is considered as the background area. On account of the background subtraction, some further propose algorithms such as linear Kalman filtering [4], statistical average method [5], Gaussian Mixture Model (GMM) [6], and Vibe [7], improve the detection efficiency and render more accurate detection results.
The optical flow method [8] is to represent the motion state of object in the image by calculating the instantaneous velocity of a motion pixel in the video sequence. The basic calculation is as follows: suppose that the pixel (x, y, t) of a certain frame in the video moves to (x + ∆x, y + ∆y, t + ∆t) after the elapse of time ∆t, since the motion of pixel does not affect the grey level of the pixel, hence I(x, y, t) = I(x + ∆x, y + ∆y, t + ∆t) and the optical flow of the pixel is obtained by the derivation of time: Δx Δt + Δy Δt + = 0 (1) represent the instantaneous velocity of the pixel along the X and Y directions respectively. Since most of the existing optical flow methods are complex in computation, which are greatly affected by noise, light source, shielding and other factors, ordinary hardware is unavailable and real-time is hard to satisfy. Therefore, generally optical flow method is used in conjunction with other algorithms.

Feature Extraction
The feature expression of image is a process to map the pixel of original image to the distinguishable dimensionality spatial data, which is a step of great concern to break the relationship between the underlying pixel and the high-level semantics. Judging from whether it can be obtained through selflearning, it can be divided into artificial design-based features and learning-based features

2.2.1
Feature expression based on artificial design Feature expression based on artificial design refers to the feature extracted by manual design. Due to the differences between visual features and visual computing, they are roughly divided into four types: gradient feature, pattern feature, shape feature and color feature. The first type describes a moving target by acquiring gradient information close to a specific key point with excellent characteristics such as rotation, scale invariability and typical algorithms of SIFT[9], HOG [10], etc. Pattern feature is a description obtained by analyzing the relative disparity between the local area of an image, usually used to represent the image texture information. Typical algorithms include Gabor [11], LBP [12] and Haarlike [13], etc. Shape feature is a model-based target detection, generally describing the target contour with classical algorithms such as DPM [14], Shape context [15], KASs. The color feature is obtained by calculating the probability distribution of local image attributes (such as grayscale and color). The classical algorithms include Color names [16]and entropy-based salient feature [17].

2.2.2
Feature expression based on machine learning The application of machine learning to vision-based target detection and tracking mainly includes regression and detection.
The regression result is the exact coordinate of human joint. Alexander Toshev et al. [18] first applied deep learning to the estimation of human posture as a regression problem, and used DNN [19] to locate the entire human joints in a large space. On the basis of human motion image input, an initial coordinate of the joint is obtained using the convolutional neural network transformed by function, and the posture is estimated more accurately by means of cascade. Only local views are used to estimate joints in the estimation of posture, while problems of no great application benefit in practical use do play a certain promotion role.
Using thermograph analysis, different human parts should be given different response values in detection in the corresponding thermograph. The higher the value is, the higher the probability that pixel is the target joint becomes. This method has been dominant in the present human posture recognition researches. The main research direction lies in how to obtain more accurate thermographs that are close to the real marking situation, especially in some undesirable conditions or in the case of distinguishing multiple postures. The main tool used of the detection methods is CNN [20], which is a multi-layer neural network adept in processing images, especially the machine learning problems related to large images. By virtue of a series of methods, convolutional network successfully reduces the dimensionality of image recognition with a large data size, and finally guarantees its training availability. The single-layer convolutional neural network contains two processes, convolution and subsampling, whose implementation procedure is shown in Figure 2 In the convolution process, different features of the signal are extracted by introducing different convolution kernels to observe the specific patterns of input signal. Dimensionality reduction of the feature graph is mainly performed in subsampling in the form of average or maximum pooling generally. The single-layer convolutional neural network is stacked layer by layer to form CNN where the output characteristics of underlaying layer serve as the input signals of the upper layer.

Fall Detection
As the feature expression is extracted, the target or background appearance model is established on the basis of feature expression, further classified to obtain the different motion states. There are generally two classification methods, one is by threshold classification and the other is by classifier.

Thresholding
The key of thresholding [21] is to judge a fall by setting one or multiple threshold values. If the received signal exceeds or drops below the preset value, it is judged as a fall with the main expression as below: Where, (x, y) is the judging result, (x, y) = 1 means that the received signal surpasses the preset value, and its state is judged to be a fall. (x, y) = 0 means that the received signal does not exceed the preset threshold, judging that its state is not a fall. The selection of threshold value is diverse, which is set as required. Medrano et al. [22] set thresholds based on feature mean and variance; Magnenat-Thalmann et al. [23] obtained data thresholds based on human height and then derived velocity thresholds; Planinc et al. [24] based on joint points and ground direction and spine. The judging process of thresholding is simple and fast, but greatly affected by the selection of threshold value. Inappropriate selection of threshold will result in a pre-judgment. Thresholding is usually used jointly with other methods.

2.3.2
Classifier Classification refers to the construction of a classification function or model based on existing data. This function or model maps the data records in database to one of the given categories, and is applicable to data forecasting. The current classifiers in use include Support Vector Machine (SVM), decision-making tree, logistic regression, Naive Bayes and neural network, etc.
The main idea of support vector machine is to randomly generate a hyperplane in the feature space, translate it in a certain direction until the hyperplane distributes pixels in different categories on both sides. Capezuti et al. [25] used hierarchical SVM to differentiate the fall action and other actions in line with the height variation of the intermediate node for consecutive frames in the detected video. The way of falling is further subdivided: falling while standing and falling off a chair. Bian et al. [26] used SVM to determine the falling directions of head, body, hip and trunk on the basis of RDT.
The main idea of Naive Bayes (NB) is to solve the probability of each category emerging under the condition of a given pending sorting item and the category with the largest probability is taken as the category of sorting item. Athitsos et al. [27] used Bayes to judge the height, speed and frame duration number of the head, and a probability calculation is conducted on the plus or minus characteristic to judge the fall. [28] first used moving average method to extract the length-width ratio of human bounding box, further presenting a method to be isolated from human body with a lower calculated amount by using KNN classifier.

SENSOR-BASED MOTION MONIORING
Sensor-based motion detection largely relies on the inertial sensor. Node sampling plus fusion data measurement is performed by means of band binding and other methods in the main joints of human body (such as the joints of head, neck, waist and limbs) to detect the motion state. The basic process is shown in Figure 3 below:

Analysis of Human Posture
In the analyzing process of human posture, some minor mechanical vibrations usually trigger highfrequency noise. To suppress these high-frequency noises, it is necessary to filter process the collected original data. By means of recursive average filtering, Kalman filtering and digital filtering, errors and faults in real-time processing and computer operation may be reduced.
Body posture analysis is vital in the inertial sensor that studies the kinematics parameters of different human actions by establishing the cartesian coordinate system OXYZ. With the center of mass of human body as the origin, the front is defined as the positive direction of X-axis, the left-hand side as the positive direction of Y-axis, and the vertical direction as the positive direction of Z-axis. At the same time, the deflection angle around X-axis is defined as roll, the deflection angle around Y-axis as pitch, and that around Z-axis as yaw. The changes in body posture resulted from different human actions are regarded as the rotation and translation of human body in the OXYZ coordinate system [29].The coordinate system of body posture is as shown in Figure 4: Since different kinematics parameters emerge in different parts of human body in motion, the corresponding accelerated velocity, angular velocity and magnetic deviation are accessible by using the inertial sensor to get ready for posture analysis.

3.1.1
Analysis of static posture The analysis of static posture is based on the fact that the gravity acceleration of human body is 1g in static state or low-velocity movement, and the acceleration in other directions is almost 0 when the roll angle θ and pitch angle γ are obtained by calculating the acceleration component of gravity in different axial directions. The main formula are as follows: The roll angle θ and pitch angle γ are obtained by the accelerated velocity prior to the transformation of coordinates when a posture matrix between the two angles is obtained and expressed as: Where, b is the carrier coordinate system, n is the geographic coordinate system, , and are the magnetic deviators output by the inertial sensor with the following relationship: Since the basic rotation matrix is a unit orthogonal matrix, here the following formula exist: Combining the above formula, the angle of rotation Φ in static state is obtained [30].
3.1.2 Analysis of quaternion posture Static state only accounts for a small part of human motion, in which the majority of human body is in disorder, thus the analysis of dynamic posture is the most frequently used. Quaternion method [31] is the most common in dynamic posture analysis. By gathering the original data of gyroscope, it calculates with quaternion. Based on the collected accelerometer and magnetometer signals, by virtue of the direction between geographic coordinate system and carrier coordinate system of the gravity field and geomagnetic field, cosine is transformed to calculate the corresponding posture angle.
Quaternion q, q = ［q0，q1，q2，q3］ , | | 2 = 0 2 + 1 2 + 2 2 + 2 = 1 are defined and the transformational relation between quaternion and posture matrix is: By means of data fusion of accelerometer, gyroscope and magnetometer, a more accurate posture angle is solved. Every time a posture algorithm is performed, quaternion is updated correspondingly depending on the fused sensor data. Posture angles γ, θ and Φ are calculated with the updated quaternions.
In comparison to the static posture analysis, the calculated amount of quaternion method is small with no singularity, thus it is widely used in real-time posture calculation of motion control system. There are a variety of applications where Quaternion is widely used as an analytical method. Pierononi et al. [32] designed a system to calculate posture and height of center of gravity using MEAS sensors such as MPU6050, HMC5883L, MS5611-01BA for data processing with Kalman filtering. The human body acceleration data are feature extracted by SVM so that quaternion is obtained and converted into euler angle, thereby detection of human direction posture is possible. Rico-Azagra et al. [33] used an explicit complementary filter (ECF) expressed with quaternions to fuse accelerometers or gyroscopes and achieve an accurate attitude estimation of the unmanned ground vehicle employing low-cost inertial measurement units. Brigante et al. [34] put forward a wearable modular system that captures real-time human motion and estimates attitude and gyroscope bias by using a quaternion-based extended Kalman filter.

Fall Detection
The data of human posture analysis are used for motion detection while a prominent usage of motion detection is to judge the falling state, determining whether the body falls by calculation of features. The mainstream falling detecting algorithms fall into three kinds: multi-level thresholding, intelligent classification (mainly based on support vector machine) and machine learning-based detection while the third algorithm incorporates two branches: KNN algorithm and BP neural network. The first algorithm needs to store all the training datasets before classification, which makes the operation processing extremely difficult and prone to wrong judgment. The latter is unable to observe the learning process, rendering unexplainable output results and affecting the reliability of results, and it may even fail to achieve the learning purpose. Therefore, this section mainly studies the previous two methods, namely multi-level thresholding and support vector machine.

3.2.1
Multi-level thresholding detection In the multi-level thresholding, the general human action is first divided into two kinds of violent (jogging, falling, jumping) and ordinary action based on the minimum value of resultant acceleration. The minimum value of the resultant acceleration threshold acc t is further determined. In the case of acc > acc t , it belongs to ordinary action, otherwise it is violent. The resultant angular velocity and resultant attitude angle basically classify the violent action. Next, the resultant angular velocity and resultant attitude angle thresholds, namely the maximum and are determined. In case of gyr > gyr t or angle > angle t , it is a falling action, and otherwise it is jumping or jogging. To improve the distinction of action, the intersection of judgment of three thresholds is taken. The resultant acceleration, resultant angular velocity and resultant angle at time t are recorded as acc(t), gyr(t) and angle(t) respectively, and the basis of judging the upcoming falling are: acc(t) < acc t and gyr(t) > gyr t plus angle(t) > angle t [35] [36]. As shown in Figure 5: Thresholding is widely used. Lindeman [37] proposed three thresholding detecting methods: the sum of acceleration vectors in XY-plane is greater than 2g. The sum of velocity vectors of each spatial component prior to collision is greater than 0.7m/s, and the sum of acceleration vectors of all spatial components is greater than 6g, which adds a basic threshold to the sensor-based fall detecting algorithm and lays the theoretical foundation for the multi-threshold detecting method. Bourke [38] designed a triaxial accelerometer combined with biaxial gyroscope, and set its threshold value to judge whether it is in the falling state. Li Na et al. [39] used Kalman filter to identify the steady and unsteady state, and adopted self-adapting thresholding to identify the running and walking action in the steady state.

3.2.2
Support vector machine detection SVM makes full use of external factors for statistical analysis and supplements adjustment factors with some weight, especially reducing the complexity of algorithm by linear growth.
The classification interface in Figure 6 correctly distinguishes two kinds of interfaces, and the distance from the optimal page to the two nearest points is the largest, as shown by H 1 and H 2 in the Figure. By casting the sample data into classification interval, distinguishing samples and establishing gaps, the entire empirical value reaches the minimum, and more nonlinear data are input when the elongation factors are amplified [40]. Using support vector machine is also very common, and many professionals improved and optimized it on the basis of ordinary vector machine detection. Liu Yong et al [41] proposed a fall detection technology based on acceleration and dip angle changes, performing SVM data sample training to detect human posture by making use of feature vector. Yang et al [42] used one class support vector machine (OCSVM) of semi-supervised learning algorithm to detect the unfulfilled falling events. Zhang Jinqiao et al [43] used AXDL345 sensor and MSP430F149 single-chip as the device data acquisition module and data processing module, proposing a multi-level fall detecting algorithm fusing acceleration and attitude angle. This study supplemented the theoretical research made by Liu Yong et al. [44] that is not complete with ADLs data analysis.

COMPARISON AND ANALYSIS
In general, the main detection process of vision and sensor-based motion state detection system are roughly the same. Both are collecting the action features of human body first followed with analytical screening of the collected results and information plus the extraction of corresponding feature values. A series of algorithmic processing distinguishes human motion behaviors. In the information extraction stage, the former mainly uses one or more cameras for video image collection with the conventional camera or a Kinect somatosensory camera [45]; the latter mainly relies on inertial sensors, using a simplex three-axial acceleration sensor or nine-axial sensor to detect accelerated and angular velocities, and the corresponding detecting algorithms are used to process data. In the action classification stage, the main methods for both vision and sensor-based are similar, which generally fall into thresholding method and support vector machine. In view that the detection effect of support vector machine method is superior to the multilevel thresholding method, but the response time lags behind. As the processor performance is further optimized and lifted, the operation speed gets faster and the efficiency is higher. Support vector machine also proves to be an excellent fall detecting algorithm whose main development tends to reduce the sensitivity to missing data and find a general solution to nonlinear problems.
Of various human detection systems, the vision-based detection system has unique superiorities, which is able to detect falling with the help of the increasingly popular home intelligence monitoring device [46]. Relative to the human motion information acquired by sensor, its accuracy is higher with flexible design and simply obtained information; since the obtained image contains plentiful information, this user-friendly system is convenient to use. Main problems facing this method including[47]: (1) The design based on 3D human model is complex and computationally intensive, while other falling detection methods are easily interfered by body sheltering, illumination variation and shadow, although with a relatively small calculation amount. (2) The detecting effect for actions similarly to falling and in parallelism to cameras are to be improved, thus it is necessary to take a certain measure to reduce the The motion detecting method based on wearable sensors realizes falling detection relying on human motion information through the motion sensor device with a relatively low cost to use. Its wearing convenience enables the user to monitor and measure the human motion information in real time, posing no restrictions on the environment without revealing user's privacy, thus it is worth being promoted. The single three-axial acceleration sensor is simple and easy to use, only requiring a simpler method to collect fewer human posture parameters. By optimizing and contrasting the algorithm, the detecting accuracy can be improved. The combined nine-axial sensor is able to capture more human parameters to render more accurate and faster analysis. Its main disadvantages are as follows: (1) The relatively low cost of sensor restricts the manufacturing process of various sensors, affecting the accuracy of the data collected. Therefore, to guarantee the data accuracy, in addition to the complementary advantages by using each sensor in the later stage, it is necessary to compensate and calibrate the data at the initial stage of data acquisition [48]. (2) During the measurement process, the influence of mechanical vibration and installation environment will inevitably cause noises to the sensors [49].

CONCLUSION AND PROSPECT
Motion state detection technology serves as a crucial component of the smart home medical monitoring field. The two major implementation technologies today incorporate the vision-based and sensor-based motion state detection, both of which are developing to a mature and dominant position. The basic comparison is shown in Table 1.

Information acquisition speed
Fast Fast

Fault tolerance
High Low

Large Petty
The user experience Relatively poor (privacy exposed) Relatively good (easy to wear) Acquisition precision

High Low
The sensor-based detection method mostly adopts the thresholding method, thus being convenient to design and use, low in cost, easy and flexible to install. It is also suitable to carry with a wide application range, unlikely to expose the privacy of users. However, the sensor needs to be carried for long, posing a certain impact on human activities. The vision-based detection method does not require long-time wearing and acquires abundant human motion information. With sophisticated image processing technology, a variety of falling detection methods with high accuracy are designed, which can detect falling with the help of the commonly used intelligent monitoring equipment. However, it is computationally intensive, the acquisition of relatively accurate results requires a large amount of calculation or training, and there may be cases when the judgment is inaccurate in an environment with sheltering. On the whole, each of the two methods is a double-edged sword, which should be used in conjunction with specific scenarios.  11 The future development of the sensor-based detection technology mainly lies in three aspects [50]. The first one improves the manufacturing process of the sensor and improve the measurement accuracy and stability of the sensor. The second fuses multi-sensor data. Targeting at the problem that single sensor data cannot obtain acceptable recognition accuracy, various sensors including acceleration sensor, magnetic sensor, gravitational acceleration sensor, gyro sensor, and air pressure sensor are integrated to collect the motion state data of people indoors and go through data processing. The third aspect combines the sensor with the multi-layer neural network for multiple learning on the multi-sensor data, adjust network parameters, and form the indoor motion state recognition model with a strong recognition capability.
Besides, the future development direction of computer vision-based detection technology mainly lies in three aspects. The first fuses the multi-scale information in feature extraction, deep mining the hidden connections within data and improving the accuracy of data judgment. The second aspect adds the spatial-temporal information of feature occurrence while using the classification algorithm, considering the fusion of multi-source complementary information of different levels such as features and decisions from different dimensionalities such as time and reasoning to improve the detecting accuracy. The third aspect links the technology to the environmental layout to solve the state detection problem when the object is sheltered.
Moreover, with the popularization of smartphone and the growth of elderly users, it has become a possible future trend to apply the smart phone-based falling detection [51]. The method of collecting motion state data through smartphone is very effective and close to life in the experiment and real environment. Although the sensor accuracy in smartphone is low, it is convenient to gather a large amount of motion state information, and to identify the high-accuracy motion state by extracting features in different motion states. As artificial intelligence advances, smart mobile phone will be making greater progress and the calculated quantity of data will be growing. Consequently, the on-line identification of motion state will eventually be realized. With smartphone, integration of processes such as data acquisition, processing, model training and classification recognition is available, and a complete motion state identification system is formed. What's more, machine learning is also a tendency for future development in the field of falling detection, and the computing resources required by its specific application pose a severe challenge for the processor and memory of mobile phones. How to implement an efficient algorithm with low resource demands on smartphone is also a task to be probed into.