Finger gesture recognition with smart skin technology and deep learning

Finger gesture recognition (FGR) was extensively studied in recent years for a wide range of human-machine interface applications. Surface electromyography (sEMG), in particular, is an attractive, enabling technique in the realm of FGR, and both low and high-density sEMG were previously studied. Despite the clear potential, cumbersome electrode wiring and electronic instrumentation render contemporary sEMG-based finger gestures recognition to be performed under unnatural conditions. Recent developments in smart skin technology provide an opportunity to collect sEMG data in more natural conditions. Here we report on a novel approach based on soft 16 electrode array, a miniature and wireless data acquisition unit and neural network analysis, in order to achieve gesture recognition under natural conditions. FGR accuracy values, as high as 93.1%, were achieved for 8 gestures when the training and test data were from the same session. For the first time, high accuracy values are also reported for training and test data from different sessions for three different hand positions. These results demonstrate an important step towards sEMG based gesture recognition in non-laboratory settings, such as in gaming or Metaverse.


Introduction
Finger gesture recognition (FGR) is a widely studied domain in human-machine interfaces (HMIs). Applications include virtual games, where finger gestures can be used instead of a joystick to achieve improved user experience, medical uses, where FGR can be used to help distinguish between normal and abnormal movements [1], or sign-language translation [2,3], to name just a few examples [4]. Several different approaches were explored in recent decades for FGR, including: Video analysis [5,6], smart gloves [7,8], smart bands [9] and surface electromysography (sEMG) [10][11][12][13][14][15]. sEMG in particular is an attractive approach as it records the electrical activity of arm muscles located away from the fingers, so that finger movements are not restricted. Moreover, it does not necessitate visual pathway, allowing operation in a dark environment or during movement. sEMG is also sensitive to applied force, even without any apparent movement (isometric muscle activation).
Despite the great potential of sEMG, such measurements have a number of technical and computational challenges. First, under dynamic activity, motion artifacts are very common. Second, electrode position may vary from session to session or from subject to subject, complicating the analysis. Owing to these challenges, current sEMG based studies concerned with the detection of hand movements are mostly performed in a controlled environment, with the hand held at a fixed position [11][12][13]. Also, most studies only report on intra-session classification [12,13,15], or inter-session classification with degraded performance [11,16]. Furthermore, to mediate good electrical contact between the electrodes and the skin, wet electrodes are commonly used [11,12]. These electrodes severely limit the usability of the technology, as they limit session duration and electrode number (and therefore separation capacity). Wires, cumbersome amplification and recording instrumentation and relatively large electrode arrays [11,13] further limit the technology, restricting it to clinical or laboratory use and mandate skilled personnel for electrode placement and system operation. Previous studies concerning the identification of finger gestures from sEMG signals were performed in a controlled environment, using relatively bulky electrode arrays that require the use of a conductive gel. Such measurements do not allow continuous real-time tracking, and are far from practical use in HMI applications. Flexible electrodes were recently demonstrated to have great potential in classifying hand gestures in a natural environment [17].
In this investigation, we demonstrate FGR under natural conditions for real life applications by addressing the following requirements: First, the system should be compatible with hours of use. Second, recognition of finger gestures should be achieved, regardless of the general position of the hand. Third, system performance must be invariant to precise electrode placement in repeated sessions (removal of the wearable device and re-application on a later day).
To achieve these requirements, we used novel printed electrode arrays. The electrodes are printed on a thin and soft substrate (figure 3(a)) and were studied previously for various applications [18,19]. The electrode arrays used in this study were designed specifically to capture arm muscle activity. Moreover, the arrays were designed with an internal ground for simple and quick placement and robustness against mechanical artifacts. The thinness and elasticity of the arrays allows excellent mechanical coupling to the skin. In this study we also used, for the first time, a new miniature wearable sensing system that allows continuous sEMG measurement, even during dynamic movement. Using such a small, convenient and non-invasive system is an important step towards hand gesture recognition in freely behaving humans. Finally, using deep learning, we demonstrate FGR under different hand positions and invariance to precise electrode placement (in repeated sessions).

Wireless sEMG system
The electrode arrays we tested in this study were purchased from X-trodes Inc. The dry carbon electrodes are 4.5 mm in diameter, they are organized in a 4 by 4 arrangement and can be quickly applied to the skin with a built in adhesive film. Their fabrication is based on a technology which was previously described in [19]. Briefly, carbon electrodes and silver traces are screen printed on a thin and soft PU film. A second double sided adhesive PU film is used for passivation and skin adhesive material. Data were recorded with a miniature wireless data acquisition unit (DAU, X-trodes Inc.), which was developed to allow electrophysiological measurements under natural conditions. The DAU supports up to 16 unipolar channels (2 µV noise root-mean-square (RMS), 0.5-700 Hz) with a sampling rate of 4000 S s −1 , 16 bit resolution and input range of ±12.5 mV. A 620 mAh battery supports DAU operation for a duration of up to 16 h. A Bluetooth (BT) module is used for continuous data transfer. The DAU is controlled by an Android application and the data are stored on a built-in SD card and on the Cloud for further analysis.The DAU also includes a 3-axis inertial sensor in order to measure the acceleration of the hand during the measurements.

Data collection
Eight healthy subjects (aged 18-30) completed two recording sessions with good signal to noise ratio (SNR). Electrode arrays were placed on the region of the extensor digitorum muscle of the dominant hand. Muscle location was identified by applying strong abduction of the fingers. During the recording, each subject sat or stood in front of a table (depending on the position of the hand being examined). An instructional video displayed on a computer was used to guide the subjects.
The experiment consisted of two steps: First, the hand was supported on a table, followed by a second stage in which the hand was not supported. The protocol was structured as follows: First, subjects were shown a short video showing different hand gestures. The subjects were instructed to perform specific gestures, both through voice and visual instructions presented on a computer screen (for 3 s), and then to stop and rest (another 3 s), after which the subjects were instructed to repeat the gesture. The instructions continued until each gesture was performed 10 times. Altogether, sEMG was recorded for 10 different finger gestures: Stretching two fingers, stretching three fingers, stretching all the fingers (abduction), making a fist, as well as six movements that represent letters in the Hebrew sign language: 'Bet' , 'Gimel' , 'Het' , 'Tet' , 'Kaf ' and 'Nun' . During the entire process of performing the movements, a representative of the research team monitored that the process was conducted as planned. In addition, a python code was used to send annotation to the Android application to mark the timing in which the subject was instructed to start and finish each of the gestures, in order to assist the analysis stage.

Data analysis
Data analysis flow is depicted in figure 1.

Filtering
Raw sEMG data were first filtered using 50 Hz and 100 Hz comb filters to reduce power-line interference. A 20-400 Hz 4th-order Butterworth bandpass filter was applied to attenuate non-sEMG components.

Segmentation
Segmentation into time intervals for each gesture (denoted as active time windows), was performed manually. Namely, the time window in which each gesture was performed was identified using the annotations made during the recording.

Classification
Each active time window, identified in the segmentation stage was divided into 200 ms sub-windows. For each sub-window, the RMS value per channel was derived, resulting in 16 values for each subwindow. The obtained 16 values were arranged on a grid according to the spatial locations of the electrodes in the array, resulting in an activation map. The sequence of maps for each active time window was then fed into a classification algorithm. Several algorithmic solutions were explored, as detailed below.

Convolutional neural network (CNN)
In this approach, each map is fed separately into a CNN, which outputs a classification. By conducting majority voting between the classifications obtained for different sub-windows of the same active window, a final classification is obtained. The CNN architecture used in this work is depicted in figure 2. This neural network (NN) consists of two convolutional layers, followed by three fully connected layers. Such architecture was favored as it has a relatively small number of parameters, making it suitable for our relatively small data set, as well as the small size of the activation maps. Each layer, apart from the last fully connected one, was followed by a ReLU activation [20] and Batch Normalization [21]. The network was trained with 500-2000 epochs (these were the epoch numbers required for the CNN training loss to converge in the different tasks), with a learning rate between 0.0005 and 0.001 (with Adam optimizer [22]), weight decay of 0.0001 and dropout [23] of up to 0.3.

Recurrent neural network (RNN)
In this approach, the sequence of maps for a certain action is treated as a time-series of dimension 16, which is fed into an LSTM-based RNN. The RNN outputs a classification. The architecture of the RNN consisted of one LSTM layer of 12 units, followed by one fully connected layer. The small number of layers was again favored to match the relatively small data set. The network was trained with 1000-2000 epochs (these were the epoch numbers required for the RNN training loss to converge in the different tasks), using a learning rate between 0.005 and 0.01 (with Adam optimizer [22]), weight decay of 0.0001 and dropout [23] of up to 0.1.
We note that the different hyper-parameters relate to different tasks. For each task (see table 1), hyperparameters are fixed for all subjects.

Classical algorithms
In this approach, the classification pipeline follows the same steps as in CNN classification, apart from the CNN which is replaced by either a k-nearest neighbor (KNN) classifier or a Multi-class support vector machine (SVM) classifier. For KNN, we chose the number of nearest neighbors for the voting (k), to be k = 1. For SVM, we used a soft margin SVM with Radial Basis Function kernel and C = 100 (where C is the weight given to the slack variables in the SVM loss).
We used classical algorithms to examine which cases indeed require Neural Networks, and for which cases we can use the simpler, classical algorithms and still achieve satisfying results. Specifically, KNN was chosen as a reference model for being one of the simplest classification algorithms. SVM was chosen for being an advanced and popular classical classification algorithm.

Enhancing training quality using hidden Markov model (HMM)
To improve classification accuracy, the training data set consisted of artificial data, generated as follows: For each of the 10 gestures, a Gaussian HMM with c components was defined. Each HMM was trained for I iterations, using the sequences of activation maps belonging to the active windows of the training data set. From each trained HMM, new sequences of activation maps were generated, and these maps were added to the training data set. We empirically set the parameters to: c = 4 and I = 10.

A note on the fixed-size 200 ms time windows
While gesture-dependent window sizes may allow better utilization of signal features, fixed-size windows serve the purpose of providing data with less inherent noise variability to the classification model, as well as real-time operation. If we consider a random noise, the length of the time window determines the uniformity of the noise mean amplitude across multiple repetitions (the longer the window, the more uniform the noise mean amplitude across different repetitions). Thus, a fixed time window will provide classification inputs with similar noise amplitudes, facilitating the classification process. In addition, gesture-dependent window size requires the processing to start only after the gesture is fully performed, enlarging the delay between gesture and recognition.
The length of the time window is a compromise between several considerations. First, EMG frequency range is 20-400 Hz, requiring us to choose a window size larger than 50 ms in order to capture more than one cycle of the signal. However, considering the random nature of the signals, one cycle would not be sufficient. This implies a trade-off: a larger window decreases randomness, but provides less maps for each gesture. The latter makes the majority voting procedure, applied at the end of the classification process for some algorithms (as described in the above paragraphs), less efficient. After testing several window sizes on an initial dataset, we found 200 ms to be a good compromise.

Results
To demonstrate reliable FGR with the soft electrodes, we used sEMG data collected from the arms of healthy volunteers to train and test several different classification models. sEMG data were first collected and prepared following these steps: (1) sEMG recording during hand gesturing; (2) sEMG data segmentation; (3) constructing RMS maps for each segment, and finally; (4) classification of each segment with a trained NN. Some of the collected data were used for the training (see figure 3 for a schematic presentation of the data flow).
Throughout the text, we use the following definitions: For each subject, there are two sessions, each recorded on a different date with a new electrode array placed approximately at the same location on the hand. For each one of these sessions, there are three hand positions (see figure 3), and for each one, there are 100 events, corresponding to ten different gestures, repeated ten times each. In total, we collected 300 events for each session.

sEMG signals
Soft 16 electrode arrays were placed at the region of the extensor digitorum muscle ( figure 3(a)). Healthy volunteers performed finger gestures (see section 2.2) while the sEMG activity of the muscle was recorded. The soft nature of the electrodes, along with the small dimension of the wireless DAU, allowed subjects to perform natural gestures while recording almost artifact free sEMG data. sEMG data were collected during ten different hand gestures ( figure 3(b)  standing with the arm next to the body (position III). Typical filtered sEMG signals of five gestures in three different electrodes is presented in figure 3(c), demonstrating some degree of variability between different gestures. For each such segment, RMS activation maps were generated. These 4 by 4 matrices provide a normalized representation of sEMG activity in the electrode space and is used as an input to the NN.
We selected ten gestures based on their physiological link with the location of the electrode array. This link is already apparent in the filtered sEMG data ( figure 3(b)), but it is particularly conspicuous when examining the RMS maps (for clarity, RMS intensity for each electrode was calculated over the entire duration of each action) (figure 4). Consecutive activation maps of the same gesture appear consistent within the same hand position, while varying between gestures. Importantly, the same gesture appeared to have different maps when the arm position was changed. This result reflects the complexity of the sEMG data. From close examinations of other maps, it is evident that small differences in electrode placement (either by repetitive use by the same subject or by different subjects) result with different activation maps. It is therefore important that the classification will be stable against these differences so that network training does not have to be fully repeated for each electrode placement, especially when used by the same individual.

Constructing train and test databases
Each session, from each hand position, contributed 100 events of 10 repetitions for each of the ten hand gestures. From each event, we generated normalized RMS maps similar to those described in figure 4. For each subject, the maps obtained from eight of the ten events of each gesture were assigned to the training Figure 4. RMS maps. An example of RMS maps of one subject. Each row represents one hand gesture, and the columns represent different repetitions of the same gesture. The first five columns were recorded as the subject's arm was placed on the table, and the next five columns were recorded with the elbow placed on the table, and the arm bent to 90 degrees. database, and the remaining maps were added to the test database.

Classification models
Based on the results discussed above, we set to realize a NN based classification which is not only accurate but also requires minimal tuning for newly acquired data (different recording sessions). We implemented both convolutional and RNNs. For comparison, we also implemented two classical classification algorithms: KNN and SVM (For a detailed explanation on the algorithms and the classification pipeline, see section 2.3). In addition to the original training data set acquired, the training data set of the networks consisted of artificial data, generated from the training data using Hidden Markov Models (see section 2.3, Enhancing training quality using HMM, for further elaboration).

Evaluating the classification models
We examined the performance of the models in classifying the sEMG data, focusing on the ability to overcome variability between hand positions and sessions. These capabilities are essential for FGR in natural conditions. In order to test these capabilities, we designed three classification tasks (see table 1). Each of these tasks was performed separately with each subject: In task 1, we trained the model with events   i.e. only two repetitions from each gesture in order to fine-tune the model. Then, we tested it with 60 events from session II, hand position II. Overall, we evaluated four classification models (KNN, SVM, CNN and RNN). Each evaluation was repeated five times, where in each repetition the data was randomly divided to train, validate and test data, and the desired model was trained and tested.
Obtained accuracy values (averaged over N = 8 subjects and five repetitions for each evaluation) are presented in table 2. For tasks 1 and 2, the best results were obtained for SVM, while for task 3, the best results were obtained for CNN. The best average accuracy values obtained for 10 gestures classification are 90.4% for task 1, 87.8% for task 2, and 78.2% for task 3. Reducing the number of gestures to 8 (by removing the two gestures classified with the lowest accuracy -'Tet' and 'Nun') increases the best accuracy obtained fortask 1 to 93.1%. In addition, confusion matrices of the three tasks using CNN are shown in figure 5. They were obtained by accumulating test data confusion matrices over N = 8 subjects and five evaluation repetitions, as explained above.
For the sake of completeness, we conducted another task dealing with inter-subject classification. The models we examined did not manage to classify gestures in that scenario, apparently due to large inter-subject variability. Further elaboration on that task can be found in appendix.

Discussion and conclusions
In this investigation, we demonstrated automated classification of sEMG data recorded using a novel user-friendly wireless system. We presented a NN based algorithm which can classify finger gestures under natural scenarios. Specifically, we demonstrated the ability to perform gesture classification which is insensitive to the position of the hand with an accuracy of 87.8%, and to classify hand gestures from a new recording session with an accuracy of 78.2%, using only a short calibration step. For tasks 1 and 2 used in the reported investigation, the SVM-based model outperforms CNN and RNN based models, as well as the classical model of KNN. However, for the more complex task, task 3, the CNN-based model outperforms RNN and the classical models of KNN and SVM. This might suggest that while simpler classification tasks can be performed by classical algorithms, a deep learning approach outperforms classical approaches when dealing with more complicated classification tasks.
An important element contributing to the high performances of the system described here are the soft electrode arrays. We have previously found that the sEMG SNR of these electrode arrays meets the criteria for recording high quality sEMG signals. Implementing an internal ground and using a new miniature wireless system further contributed to our ability to perform sEMG recordings with almost no mechanical artifact, even under natural conditions and in different hand positions.
The classification accuracy of the system described here is close to the state of the art, while doing so with significantly fewer electrodes. Several recent studies reported on sEMG based FGR (table 3). Atzori et al [12] established the NinaPro database for sEMG based hand movement classification, and used linear discriminant analysis, KNNs, an SVM and a multi-layer preceptron (MLP) to classify gestures. Table 3. Comparison between various classification models applied to position I.

References
Electrodes Gestures Method Accuracy (%) Atzori et al [12] 10 OttoBock 52 SVM 76.0 Amma et al [11] 192 dry 27 Naive Bayes 90.4 Geng et al [13] 10 OttoBock 52 CNN 77.8 Geng et al [13] 192 dry 27 CNN 96.8 Geng et al [13] 128 dry 8 CNN 99.5 Wei et al [15] 10 OttoBock 52 MS CNN 85.0 Wei et al [15] 192 dry 27 MS CNN 95.4 Wei et al [15] 128 dry 8 MS CNN 99.8 Padhy [16] 192 dry 27 MLSVD 98.0 Padhy [16] 128 Using carefully placed ten electrodes, they were able to distinguish between 52 hand gestures with an accuracy of 76%. In a later study, CNN was used for classification of the same database, achieving an accuracy of 66.59 ± 6.40% [14]. Other studies focused on high density EMG (HD-EMG) recordings. Rojas-Martínez et al [10] used activation maps obtained from HD-EMG recordings of the forearm muscles to classify between 12 hand gestures, achieving an accuracy of 90%. Amma et al [11] characterized another database for sEMG based FGR (CSL-HDEMG), using 192-electrode array and a naive Bayes classifier to discriminate 27 gestures, achieving an accuracy of up to 90%. Geng et al [13] introduced a new database (CapgMyo), consisting of 8 gestures recorded using a 128-electrodes array. Using a CNN, they were able to reach an accuracy of up to 99.5%. They also achieved a recognition accuracy of 96.8% and 77.8% on the CSL-HDEMG and NinaPro databases, respectively. Later studies achieved improved results using these databases. Wei et al [15] used a Multi Stream CNN, reaching an accuracy of 99.8%, 95.4% and 85% on the CapgMyo, CSL-HDEMG and NinaPro databases, respectively, while Padhy [16] proposed a multilinear singular value decomposition approach, resulting in accuracies of 98.0% and 98.6% for CapgMyo, and CSL-HDEMG databases, respectively. The first classification task introduced in this work is similar to classification tasks described in previous studies. A comparison is provided in table 3. From this comparison, it is apparent that our system achieves a similar accuracy to previous methods, while utilizing a much smaller and more convenient electrode array without the necessity for careful electrode placement. Classification tasks equivalent to the three tasks presented in this work were also examined by Moin et al [17]. Using a flexible 64-electrode array and an adaptive machine learning approach, they present classification of 13 gestures from the same hand position with an accuracy of 97.12%. They also show the ability to classify gestures in new contexts, including a different hand position, new day recording and a recording session performed after the device was worn for two hours during daily activities. After the classification model was updated with some of the new-context data, an accuracy was degraded on average by only 2.39%. However, the larger number of electrodes provides much more information for analysis. When applying their suggested method to data obtained from only 16 electrodes, Moin et al [17] show that the accuracy for same-hand-position classification drops to around 84%, and the accuracy for different-hand-position classification drops below 84%. In the investigation reported here, data were processed off-line. For many FGR applications, on-line analysis is desired and can be achieved if data transfer and analysis times are fast enough. Although sEMG requires electrode placement at close proximity to the muscle, the use of 16 electrode arrays and CNN analysis negate the need for very precise placement, allowing accurate classification despite the variability of multiple sessions.
In this investigation we used 16 electrode arrays. These 4 by 4 arrays clearly provide more information than low resolution sEMG, contributing to effective discrimination between gestures. Higher electrode resolution may contribute to improved resolution, in particular if discrimination between more gestures is needed. It is important to note that increasing electrode count may increase data analysis time and DAU dimensions, tempering with the ultimate goal of real time analysis under natural conditions. sEMG signals contain information on applied force. This information can be important in many applications. In the current study, we did not utilize this feature: RMS maps were normalized and as such, amplitude information was discarded.
Moreover, spectral analysis was not implemented to gain additional information about the applied force. These topics remain for future investigations.
As we demonstrated here, sEMG has many important benefits compared with video analysis and smart gloves. On the downside, sEMG gesture separation is limited to specific degrees of freedom associated with the targeted muscle. sEMG may gain from the combination of other technologies, such as video or smart gloves, in order to improve network training and resolution. For example, in this study we did not exploit three-dimensional acceleration data which was recorded by a built-in 3 axis inertial sensor implemented in the wireless DAU.
To conclude, the results we presented here demonstrate an important step in achieving FGR in natural conditions using sEMG signals. The advantage compared to other studies concerns the combination of a minimally interfering wearable system (owing to the small number of dry electrodes and the wireless nature of the system), the real-life scenarios which were examined, and the results which are close to the state of the art, despite the challenges mentioned above. Such advantages make the system a possible candidate for gaming or Metaverse applications.

Data availability statement
The data that support the findings of this study are openly available at the following URL/DOI: https:// datadryad.org/stash/share/khYfwOcRgshRlaRgDivy UxlkznzcledB-fI-Ywwz8Fs.

Conflicts of interest
Adi Ben Ari, and Liron Ben Ari report no financial or no-financial conflicts of interest. Cheni Hermon was an employee of X-trodes Ltd. Yael Hanein declares a financial interest in X-trodes Ltd, which commercialized the screen-printed electrode technology used in this paper. She has no other relevant financial involvement with any organization or entity with a financial interest in or financial conflict with the subject matter or materials discussed in the manuscript apart from those disclosed.

Appendix. Inter-subject classification
In order to test the ability of the suggested models to perform inter-subject classification, we split the original training set into new training and test sets: the inter-subject training set, composed of all gestures from 6 randomly chosen subjects, and the inter-subject test set, composed of all gestures from the remaining 2 subjects. These training and test sets constitute the inter-subject task. We evaluated each model on this task the same way we did in the first three tasks (see section 3.4), and the obtained accuracy values (averaged over 5 repetitions for each evaluation) are presented in table 4. As evident from these results (which are around chance probabilities), EMG-based systems are not capable of performing inter-subject classification. The reason for this shortcoming are physiological differences between subjects that result in totally different RMS maps. These differences cannot be handled by the models.