Unveiling EMG semantics: a prototype-learning approach to generalizable gesture classification

Objective. Upper limb loss can profoundly impact an individual’s quality of life, posing challenges to both physical capabilities and emotional well-being. To restore limb function by decoding electromyography (EMG) signals, in this paper, we present a novel deep prototype learning method for accurate and generalizable EMG-based gesture classification. Existing methods suffer from limitations in generalization across subjects due to the diverse nature of individual muscle responses, impeding seamless applicability in broader populations. Approach. By leveraging deep prototype learning, we introduce a method that goes beyond direct output prediction. Instead, it matches new EMG inputs to a set of learned prototypes and predicts the corresponding labels. Main results. This novel methodology significantly enhances the model’s classification performance and generalizability by discriminating subtle differences between gestures, making it more reliable and precise in real-world applications. Our experiments on four Ninapro datasets suggest that our deep prototype learning classifier outperforms state-of-the-art methods in terms of intra-subject and inter-subject classification accuracy in gesture prediction. Significance. The results from our experiments validate the effectiveness of the proposed method and pave the way for future advancements in the field of EMG gesture classification for upper limb prosthetics.


Introduction
Over two million people in the USA alone suffer from the devastating disability of upper limb loss, and millions more are at risk of amputation [1].Amputees face severe difficulties in their daily activities due to the loss of a limb, especially for those missing an upper limb [2].To restore function and independence for these patients, there has been ongoing development of prosthetic hands, arms, and other assistive devices that serve both rehabilitation and robotic purposes.Powered myoelectric prostheses, designed for rehabilitation, aim to provide amputees with intuitive motor control and sensory feedback.These prosthetic limbs are controlled by electromyographic (EMG) signals from the user's muscles, enabling basic hand grasp and arm movements.However, current myoelectric prostheses offer limited dexterity, which hinders the full restoration of natural hand function.As a result, accurate and generalizable EMG pattern recognition techniques are needed to enhance the functionality of these prostheses, allowing for nuanced and precise movements.Emphasizing the generalizability of these techniques across diverse user populations is essential for real-world application, ensuring the benefits reach individuals with varying physiological characteristics.By employing advanced machine learning algorithms and signal processing methods, there is an opportunity to refine EMGbased pattern recognition, enabling a more comprehensive range of gestures and intuitive control, thus bridging the gap between amputees' needs and current prosthetic capabilities.
Smooth and responsive control of a multifunctional myoelectric hand requires advanced signal processing to make accurate predictions of user intent from surface EMG (sEMG) recordings.Conventional EMG-based gesture classification typically involves multiple stages of handcrafted feature extraction and pattern recognition.In particular, raw EMG signals are segmented into discrete gestures, then handcrafted features are manually engineered to characterize relevant patterns of each segment.Finally, these features are used by classifiers such as linear discriminant analysis [3], hidden Markov models [4,5], Gaussian mixture models [6,7], or support vector machines (SVMs) [8,9] to predict gesture labels.While these standard approaches have enabled basic myoelectric control, they lack the flexibility to discover sophisticated patterns in the data and often lead to lower prediction accuracy.Therefore, achieving highly accurate and generalizable EMG-based control requires machine learning methods that can model the complex and non-linear relationships in the EMG signals.
In recent years, EMG pattern recognition for gesture classification using deep neural networks (DNNs) has gained increasing attention.These deep learning methods can discover complex patterns in the data that enable accurate predictions of user intent.State-of-the-art DNN-based approaches have achieved over 90% accuracy in classifying up to 53 different hand gestures on public benchmark datasets [10,11].However, a key limitation prohibits the practical use of these methods in applications.
Generalizability in EMG-gesture classification for prosthetic devices using deep learning is a vital challenge that needs to be resolved.It ensures adaptability across diverse users and environments, facilitating wider applicability without extensively customized, individualized training.A model with strong generalization and credible classification accuracy performs reliably on new data, which is crucial for real-world scenarios where users have varied muscle activities.This ease of deployment simplifies fitting the device to different users, improving the overall user experience by offering consistent control.Additionally, it drives technological advancements, potentially leading to more natural movements and increased functionality, ultimately enhancing users' quality of life.Existing methods perform well when training and testing models on the same subject, while most do not perform well when tested on new subjects.Thus, the high intra-subject performance on benchmark EMG datasets does not guarantee general and dependable control that can translate into practical prosthetic devices, and whether these devices can be generalized to different users remains an open question.
To address these challenges, we propose a novel approach to learning generalizable representations based on prototype learning.The proposed method allows the DNN model to characterize the relationships between EMG sensor data by clustering similar gesture examples into prototypes.The deep prototype learning classifier is trained on a large dataset of digital glove data to group similar EMG patterns into a set of gesture prototypes corresponding to joint movements.To perform prediction on new EMG inputs, the classifier first measures its similarity to each prototype and then predicts its gesture class based on the similarity measures.This two-stage approach, from learning the prototypes to predicting its label, enables the model to be highly general while still discriminating subtle differences between gestures.The set of prototypes essentially encodes a generalizable representation of EMG patterns for each gesture.They can be visualized or analyzed to gain insight into the features that are important for the classification of a particular class.
With this new approach, we conduct comprehensive experiments to compare different EMG features and prototypes.The results show that deep prototype learning classifiers outperform direct state-ofthe-art methods consistently across multiple public datasets.Most importantly, while previous methods struggle to generalize across subjects due to intersubject variability, our prototype approach successfully achieves improved accuracy for new subjects by matching its gestural prototypes.In sum, the contributions of this paper are summarized as follows: • We propose a deep prototype learning method for EMG gesture classification.By predicting the gesture class of the input EMG signal based on its closest gestural prototypes, this approach demonstrates generalizability to new subjects.

Related work
In recent years, there has been growing interest in developing machine learning methods for EMGbased hand gesture classification.Below, we summarize the related works with an outline of the EMG feature extraction methods.Additionally, we provide a summary of both conventional machine learning methods as well as deep learning-based approaches.

EMG feature extraction
Many studies [12][13][14] have extensively compared the quality of the processed features, usually based on the performance of the machine learning or deep learning gesture classifiers.The feature selection process can significantly affect the classifier performance, as each processed feature creates a different learning space, which ultimately changes the learning of the optimal decision boundaries for the given gestures.Furthermore, it is crucial to note that feature selection has implications for computational cost, as redundant features will introduce unnecessary overload.Thus, selecting high-quality feature candidates is one of the most important aspects when designing an effective and efficient EMG-based system.Feature extraction involves capturing relevant information from the raw EMG signals and representing them as feature vectors that can be processed by the classifiers.The feature vectors can be obtained in different domains, including the time domain [15][16][17], frequency domain [16][17][18], timefrequency domain [17,19], and spatial domain [17,20].In the time domain, features such as mean, variance, root mean square, and waveform length provide information about the amplitude and shape of the signal over time.In the frequency domain, features derived from the power spectrum, such as spectral moments or the spectral centroid, capture the distribution of signal energy across different frequency components.In the time-frequency domain, techniques such as the short-time Fourier transform (STFT), or wavelet transform can be employed to obtain features that reveal how the frequency content of the signal changes over time.In the spatial domain, features related to electrode placement or muscle activation patterns are often calculated.

Machine learning to EMG-based gesture classification
In the field of EMG signal classification, various machine learning algorithms have been used [21], including Naive Bayes (NB), k-nearest neighbor (k-NN), SVMs, decision trees, and other models that developed from those basic models [21][22][23].These classifiers take the EMG signal as input and predict the gesture classes.However, these models are often lower in classification accuracy for EMG-based gesture classification compared to deep learning-based approaches [17].In recent years, the integration of deep learning approaches into EMG-based gesture classification systems has gained considerable attention.For instance, Atzori et al [10]  Several notable works have demonstrated its effectiveness in various domains [27][28][29][30].Prototype learning methodologies have recently found application in the field of EMG-gesture classification, primarily serving two overarching objectives: mitigating computational expenses [31,32] and detecting unknown gestures [33,34].A seminal study by Sziburis et al [31,32]

Methodology
Our proposed deep prototype learning method consists of prototype learning and gesture classification processes.Prototype learning aims to learn the prototypes with a multi-label classification paradigm to characterize the movements of hand joints.These prototypes encode joint-level kinematic activation related to the gestures and serve as critical components of gesture classification.We train a gesture classifier by transferring prototype parameters and relevant knowledge from the prototype learning stage.This section outlines the EMG-based gesture classification problem and presents our proposed prototype learning and prototype-based classification methodologies.

Problem statement
Let X s be the 2D vector of raw EMG signals directly curated from the electrode s.Generally, the standard deep learning paradigm involves the application of preprocessed data as illustrated in equation ( 1), where the function f 1 preprocesses the X s to X(δ,s) based on various techniques (e.g., Denosing using filters, downsampling) and the time window-unit computations where δ refers to a total number of window size.Subsequently, the objective function f 2 in equation ( 2) maps the X(δ,s) to a set of features λ, which reflects unique significance such as statistical measures (e.g.Variance), frequency characteristics (e.g.Average power spectrum), or general knowledge (e.g.Zero crossing).f 2 returns the 4D tensor output, where each tensor is the union of single window features: X(∀λ,∀δ,∀s) , To predict the classification label of the processed features, a classifier M(•) is defined as: Here, σ is the softmax activation function for returning the soft probability of each gesture class, W represents the trainable parameters, and ŷ refers to the final output.The objective of gesture classification is to minimize the loss function, as defined in equation ( 4), where i denotes the index of samples and y i represents the class label of the sample with index i, Existing methods for upper-limb gesture classification based on EMG signals [4][5][6][7][8][9][10][11][12][13]15] mainly employ end-to-end models that implicitly learn the correlations between EMG-based input features X(δ,s) and gesture labels ŷ.The main challenges with blackbox networks are their limited ability to generalize to new data and their complexity, which hinders adaptability and transparency.This compromises interpretability, making it difficult to address overfitting or bias.To overcome these challenges and improve generalization, we propose a deep prototype learning framework.This framework decomposes EMG patterns into atomic prototypes, enhancing interpretability and aiding generalization across different conditions and datasets for practical applications.

Prototype learning method
Establishing a coherent connection between highlevel semantics and low-level EMG features is crucial for generalization and interpretation in gesture classification.Integrating abstract gesture concepts with detailed muscle activity enhances the model's ability to generalize across various scenarios and provides deeper insight into learned representations.To achieve this, we employ the use of semantic prototypes.These prototypes embody meaningful and distinctive gesture patterns, making them valuable in deciphering the underlying intentions of EMG signals.For example, they may represent EMG signals encoding patterns of hand movements such as the flexion or extension of a finger joint.Prototype-based gesture classification is then achieved by matching EMG features to prototypes of different gestures.
As shown in figure 1, our prototype learning method is based on a baseline CNN architecture, with a novel prototype layer that learns to factorize gestures into prototypical bases.Different from the conventional methods that classify gestures based on the penultimate layer features O, we decompose gestures with trainable prototypes P and utilize the combinations of their matched prototypes for classification.To achieve this, we compute the similarity between the deep features O and the prototypes P as where δ is the sigmoid activation function for normalization.Therefore, by strategically delving into the penultimate layer's output of the baseline, the prototypes can encapsulate semantic-level insights rather than low-level details of the EMG signals.
To obtain the semantic meanings of the prototypes, we leverage the Cyberglove [35] data synchronously collected with EMG signals, and perform multi-label classification [36,37] to learn prototypes from the data.The goal is to extract meaningful information from the fine-grained joint movements and learn discriminative prototypes that capture the relationship between joint movements and performed gestures.Specifically, the multi-label classification outputs y multi is predicted based on the α p in Figure 1.Overview of the proposed prototype learning method.The prototypes are learned through the multi-label classification of binary-coded joint movements.This approach establishes an association between prototypes and joint movements, thereby facilitating the interpretable classification of gestures using joint-kinematic information.By leveraging this transferred knowledge, our method not only learns representative prototypes but also provides insight into the underlying joint movements associated with each gesture type.equation (5), to fit the ground-truth joint movement labels obtained from the digital glove data, where the classification labels (i.e., binary multi-labels) indicate the flexion or extension of each joint decided by comparing the joint angles between the resting state and the gesture state: We train the network with a standard binary cross-entropy (BCE) loss for multi-label classification and extract prototypes P after the training.This method factorizes gestures encoded in the EMG signals into a set of bases formed by prototypes, so the learned prototypes encode important semantics representing a variety of joint movements.

Gesture classification with prototypes
With our prototypes constructing the joint-level semantic space for bridging different gestures, we further leverage the learned prototypes to adaptively incorporate the distinctive correlation between joint movements and gesture classes.Figure 2 provides an overview of the proposed gesture classification model with prototype matching.Instead of predicting the gesture labels using the deep features O, it takes advantage of joint-level patterns encoded in the learned prototypes by computing the similarity measure α p as described in equation ( 5).It then predicts the gesture label ŷ with α p .The network architecture is similar to the one used in prototype learning, with the final classification layer changed to predict the gesture labels.The model parameters were initialized with those from the prototype learning network and fine-tuned using the gesture labels, while the prototypes remained static.With this approach, the network associates joint movements based on their relationships with distinct prototypes and adaptively integrates relevant prototypes to predict the gestures.

Experiments
In this section, we present the implementation details and carry out experiments to analyze the proposed method.First, we present details of the experiment setups, including datasets, compared methods, evaluation, and implementation details.Second, we evaluate the gesture classification performance of our proposed deep prototype learning method by comparing it to state-of-the-art methods, the baseline network, and different variants of our method.Next, we assess the generalizability of our proposed methods by conducting inter-subject evaluation.Finally, we present qualitative examples to visualize and interpret the learned semantic prototypes.
Table 1.Joints and abbreviations of Cyberglove sensors [38].The F and A in the 'Notation' column refer to Flexion and Abduction, and the numbers 1 −5 indicate the following: 1 (Thumb), 2 (Index), 3 (Middle), 4 (Ring), and 5 (Pinky).This joint notation denotes specific movement captured by the Cyberglove sensors, aiding in the interpretation of the kinematics of each joint during the qualitative analysis in section 4.5.

Experiment setups 4.1.1. Datasets
The primary objective of our experiments is to evaluate the performance and generalizability of our method for EMG-based gesture classification.To achieve this, we conducted experiments using four well-established and publicly available EMG datasets: Ninapro DB1, DB2, DB3, and DB5.These datasets cover a wide range of experimental settings, including both basic and complex gestures recorded from intact and amputee subjects.They provide synchronized information from multiple data sources, including surface EMG data collected from up to 16 locations on the users' dominant forearm and biceps.The gesture categories within each dataset include basic finger movements, simple hand movements, and grips on the wrist and forearm, as well as more intricate grasping motions involving various objects.
All four datasets provide hand gesture data collected using the Cyberglove device, an effective tool for capturing kinematic information of hand gestures [39].The Cyberglove data comprises 22 joint-angle sensors that record the rotation trajectory of distinct joints.This includes the kinematic data of the joints acquired from the Cyberglove, synchronized with the EMG signals.Table 1 provides the terminology for specific joints measured with Cyberglove [38].We specifically excluded subjects other than 5 and 6 for the multi-label classification in DB3 due to noisy sensor data.Furthermore, we excluded the Cyberglove sensor 11 data due to similar noise issues [40].
Detailed specifications of these datasets are provided in table 2. To maintain consistency, we split the training and test trials following the identical settings used in MV-CNN [41].On these diverse datasets and employing the described exclusions and partial implementations when necessary, our experiments comprehensively evaluate the effectiveness and applicability of our proposed method for EMGbased gesture classification across various scenarios and challenges.

Compared methods
We conduct a comprehensive comparison of our proposed gesture classification method with existing CNN-based gesture classification models, including GengNet [42], Cheng et al [43], Wei et al [44], E2CNN [45], Yang et al [46], Zhai et al [25], Ding et al [47], Chen et al [26], Vitale et al [48], Peng et al [49], AtzoriNet [50], CNNLM [51], EVCNN [24], Hu et al [52], Pizzolato et al [53], MSCNet [54], DVMSCNN [55], and MV-CNN [41].These models, like ours, are CNN-based classifiers that independently classify each frame of EMG data.Many of them have achieved state-of-the-art performances on the Ninapro datasets.For example, E2CNN [45] and Yang et al [46] have both achieved over 90% classification accuracy on the Ninapro DB1 dataset.Particular emphasis is given to the comparison with MV-CNN [41], as it  has been extensively evaluated on diverse public datasets and demonstrated a consistently high accuracy over all datasets.The comparison with these models aims to highlight the superiority of our deep prototype learning method over the state-of-the-art methods and showcase the added value of the proposed prototype learning in improving gesture classification accuracy and generalizability.Furthermore, to provide a meaningful evaluation, we compare the performance of our deep prototype learning method with two baseline models.The first baseline, referred to as Baseline-CNN, is a plain CNN model without the prototype layer, as demonstrated in figure 3. The network starts with a batch normalization (BN) layer, which normalizes the input data to accelerate the training process and improve the model's generalization.This is followed by two layers of 2D convolution (Conv2D), each followed by a BN layer.The network further processes the data using two locally connected 2D layers (LC2D).This allows the model to capture more fine-grained spatial information.Additionally, a BN layer follows each locally connected layer.To prevent overfitting and enhance the model's generalization, a dropout layer with a rate of 0.5 is used after the BN layer.This is followed by two fully-connected layers with 512 neurons that learn high-level abstract representations of the input data.A dropout layer with a rate of 0.65 is applied to regularize each fully-connected layer.Feature outputs of the fullyconnected layers, denoted as O, focus on high-level gesture patterns instead of low-level details in the EMG data [56,57].Finally, the last fully-connected layer computes the predicted output from the features O. Table 3 presents a comprehensive list of the baseline network layers.The second baseline, Baseline-Proto, follows our proposed architecture but is trained directly on the gesture labels, rather than learned with the joint movement labels.These baselines help demonstrate the effectiveness of incorporating joint movement labels into the learning process and its impact on the overall performance of the model.

Evaluation
To assess the performance of our models, we conducted both intra-subject and inter-subject evaluations.For the intra-subject evaluation, we followed the experimental settings in [41], where the training and test datasets were split according to the details presented in table 2. For Ninapro DB1, we trained the models using the 1st, 3rd, 4th, 6th, 7th, 8th, and 9th trials, and then tested them on the remaining trials for each subject.Similarly, for Ninapro DB2, DB3, and DB5, the models were trained using the 1st, 3rd, 4th, and 6th trials, and tested on the remaining trials for each subject.The overall gesture classification accuracy was then averaged across all subjects in each database.
For the inter-subject evaluation in DB1, we employed Leave-One-Subject-Out Cross-Validation (LOSOCV).In this approach, the data from each subject were used as the test set, and the models were trained using data from all other subjects.This process was repeated for each subject, and the gesture classification accuracy was averaged across all subjects to obtain a comprehensive evaluation of model performance.For the inter-subject experiments in DB2, DB3, and DB5, due to their increased number of subjects, we employed four-fold cross-validation by randomly splitting the subjects into four different blocks and using three blocks for training and the other for testing.These settings also follow the experimental protocol of the MV-CNN study [41].
To quantify the performance of our models, we utilized gesture classification accuracy as the primary evaluation metric.This metric indicates the overall percentage of correctly predicted gestures from the test set, providing a reliable measure of model effectiveness.In addition to accuracy, we also reported the confusion matrix for each dataset.The confusion matrix is based on the normalized sum of intrasubject gesture classification results.It offers valuable insights into the model's performance in distinguishing between different gesture classes and helps identify any patterns of misclassifications.These evaluation methods and metrics comprehensively assess the accuracy and generalization capabilities of our proposed models across different databases and subjects, providing a robust analysis of their performance in EMG-based gesture classification.

Dataset preprocessing
During the experimental procedures, the EMG feature sets denoted as Phin_FS1 [58] were derived from the Ninapro DB1, DB2, DB3, and DB5 datasets.The conventional sEMG-based gesture recognition frameworks [3, 6, 9-11, 14-19, 21-23] using machine learning utilizes a sliding window data processing strategy to reduce high-dimensional signals into discriminative features, aiming to decrease computational complexity, emphasize signal characteristics, attenuate noise, and enhance generalization capabilities for building a practical gesture classification model.To ensure uniformity and adhere to established protocols, each dataset underwent feature extraction utilizing 200 ms time windows with an incremental size of 10 ms, and preprocessing using a low-pass filter with a cut-off frequency of 2 kHz, as well as a downsampling process following the experimental protocol outlined in [41].Phin_FS1 feature set represents a widely acknowledged feature extraction method within the classical limb gesture recognition domain, incorporating effective time and frequency domain analyses [59] as illustrated in table 1 of the prior work [41].The resultant feature set comprises 11 distinct features: mean absolute value (MAV), MAV slope, waveform length, Willison amplitude, zero crossings, four autoregression coefficients, mean frequency, and power spectrum ratio.They were extracted from each EMG channel, resulting in 11 total features per time window.To prepare the data for input to the compared models for gesture classification, we transformed the extracted features into signal images.By arranging the 11 features into a 2D matrix for each time window, the resulting signal images effectively captured internal temporal patterns.For instance, MAV variations, waveform length, and preprocessed EMG signal for sensor channel 1 were visually represented in figure 4, expressing similar patterns in the time domain, aiding in the identification of nuanced temporal dynamics.

Model hyperparameter setting
For our experiments, we set the number of prototypes to 100, aligning it with the number of finegrained joint movements for effective representation.
The network was trained using the Adam optimizer [60] for a varying number of epochs, with 100 epochs for DB1 and DB5 and 400 epochs for DB2 and DB3.The batch size was set to 128 for DB1 and DB5, and 32 for DB2 and DB3.The learning rate was reduced by 50% every 20 epochs in DB1 and DB5, and similarly, with a reduction every 40 epochs in experiments involving DB2 and DB3.The initial learning rate for all datasets was set at 0.002.All our experiments were performed in a GPU environment with an RTX 3080 Ti GPU equipped with 16GB of VRAM and an RTX 3070 Ti GPU equipped with 8GB of VRAM.

Intra-subject performance
We begin by demonstrating the effectiveness of our proposed method in the intra-subject evaluation.The results presented in table 4 clearly show that our method outperforms all these methods in terms of predicting gestures for the same subject.Even when compared with the state-of-the-art MV-CNN [41], which leverages multiple feature sets and a multiview architecture, our method surpasses MV-CNN by a significant margin.It is worth noting that DB3 poses a challenge for all methods, as evident by the relatively lower accuracy scores compared to other databases.However, our method still demonstrates superior performance in this database, indicating its capability to handle the complexities of the gestures present in DB3.This indicates the effectiveness of our proposed approach in accurately recognizing gestures from EMG signals.
Comparing our method with the baseline CNN and prototype learning methods, we observe significant improvements in classification accuracy.First, without the prototype layer introduced in the proposed method, the Baseline-CNN method performs better than most state-of-the-art methods on DB1, DB2, and DB5.However, it exhibits lower accuracy on DB3.The comparison between Baseline-CNN and Baseline-Proto demonstrates that the addition of a prototype layer improves the overall classification accuracy by 2%-7%.This result suggests that the prototype layer effectively captures unique joint patterns, serving as key indicators for accurate classification.Furthermore, our method surpasses Baseline-Proto, particularly notable in the substantial improvement of classification accuracy from 63.25% to 76.78% on DB3.In essence, our model's proficiency in capturing discriminative joint movement patterns in EMG signals significantly contributes to its elevated classification performance, as evidenced by the consistent improvements across diverse databases.
The confusion matrices generated by our methodology are presented in figure 5, offering a comprehensive analysis of the relationships between predicted and ground-truth gesture classes within the initial subject of each dataset for the intra-subject classification task.While the diagonal pattern in  the confusion matrix suggests good overall accuracy across different gesture classes, a closer look reveals that specific gestures pose more difficulty for the model than others.Specifically, in the case of DB2, gestures 42 through 50 ('Tip pinch grasp' to 'Cut with knife') were more common to experience classification errors than others.For DB3, gestures 42 through 46 and 48 ('Tip pinch grasp' to 'Extension type grasp' and 'tripod grasp') were prone to being misclassified to gesture 49 ('Turn a screw'), while DB5, gestures 25 ('Wrist flexion') and 29 ('Wrist extension with closed hands') manifest considerably elevated error rates.In the context of DB1, marked by notably higher classification accuracies, no consistent misclassification patterns are observed.This analysis demonstrates the overall high accuracy of gesture classification within subjects, and highlights areas for further model refinement.

Inter-subject performance
Assessing the generalizability of a gesture classification model is of paramount importance as it ensures its effectiveness in recognizing diverse patterns exhibited by different subjects.In this study, we conducted an inter-subject evaluation to determine the extent to which our proposed deep prototype learning model can effectively generalize its learned knowledge to new subjects.
The results presented in table 5 demonstrate the superiority of our proposed method over stateof-the-art alternatives such as MV-CNN and the two baseline methods.The MV-CNN and the Baseline-CNN methods achieve relatively low accuracy on all databases, which can be attributed to the challenges in generalizing the model's knowledge to new subjects.The low accuracy may be attributed to individual differences in muscle size, shape, and electrode placement affecting EMG signal generation, making models learned on one subject poorly generalizable to others.Variations in skin impedance, sweating, and electrode-skin contact can also introduce noise and artifacts, impacting classification accuracy.Despite the challenges, the Baseline-Proto method exhibits improved accuracy on all datasets.The introduction of the prototype layer helps boost performance compared to the baseline CNN, particularly on DB1 and DB3, indicating its ability to handle individual variability.

Further performance analyses
In this section, we delve further into the additional performance metrics to enhance the comprehensive understanding of our prototype learning model.We introduce additional metrics for both intra-and inter-subject classification tasks utilizing the same dataset, including precision, recall, F1

Qualitative results
The results presented in the previous sections highlight the effectiveness of our method in learning semantic prototypes that encapsulate a wide range of fine-grained gesture patterns, leading to improved generalizability across diverse subjects and datasets.By providing visualizations of important joint movements in gesture classification, we aim to gain further insights into the reasoning behind the model's decisions.As shown in figure 6, we integrate prototype knowledge from models associated with each gesture class into the joint movements.This transformation involves computing averages across all subjects within the Ninapro DB5 dataset, and the resultant predictions are compared with the ground-truth joint movements for analysis.This analysis facilitates a comprehensive understanding of the model's interpretative capacity and proficiency in capturing finegrained patterns in gesture classification.For instance, in the case of gestures 1 (index finger flexion) and 2 (index finger extension) unveil a notable alignment between the joint movements extracted from the multi-label classification model and the joint degrees measured from the raw Cyberglove dataset.Specifically, Cyberglove sensors 6 (PIP2_F) and 7 (IP2_F) consistently exhibit high activation (indicated in red color), emphasizing significant contribution in the joints of the index finger.Additionally, sensors 2 (MCPI_F), 3 (IP1_F), and 4 (CMC1_A), covering the thumb joints, display simultaneous high degrees of contribution due to the influence of proximate index finger activity.The prototypes employed in the second stage of gesture classification consistently emphasize similar joints, indicating their significant relevance in gesture identification.The consistent activation patterns of critical joints underscore the model's ability to discern critical joints crucial for distinguishing between gestures, which further reinforces the model's accuracy in gesture identification.These findings offer the potential for enhancing the development of more precise gesture classification algorithms, thus contributing to ongoing advancements in gesture classification systems.

Discussions
This section delves into the key insights garnered from our deep prototype learning approach for EMGbased gesture classification.Focusing on generalizability and interpretability, we discuss the technical novelty and advantages of our method.We then explore the real-world implications of these advancements and discuss the limitations of this work and potential future directions to push the boundaries of gesture classification in EMG applications.

Technical novelty and advantages
Conventional deep learning methods in EMG-based gesture classification often rely on black-box models, learning abstracted representations only from gesture classes.These methods, while effective, can struggle with inter-subject generalization and lack interpretability.To address this challenge, our study introduces a novel aspect of prototype learning by decomposing gestures into biomedically relevant joint components, which benefits the generalizability and interpretability of models.The advantages of our method are two-fold: (1) Our method leverages biomechanically relevant joint components, capturing core features less susceptible to inter-subject variations.This translates to superior performance across diverse subjects, as shown in tables 5 and 6).
(2) Unlike black-box approaches, we enable visualization of joint activation patterns for each prototype, as shown in figure 6.This unveils the contribution of individual joints to gesture classification, offering valuable insights for model improvement and personalization.These distinct advantages-enhanced generalizability and actionable interpretabilityposition our novel prototype learning approach as a significant step forward for EMG-based gesture classification.In the following, we discuss the broader impacts of generalizable and interpretable models, the limitations of this work, and future directions.

Generalizability and real-world significance
Inter-subject variability remains a significant bottleneck in achieving robust gesture classification for real-world applications [17].Anatomical differences, like muscle size and electrode placement, lead to diverse EMG signal patterns even for identical gestures across individuals [61,62].Additionally, personalized movement strategies and health conditions further complicate the issue [16,17].These factors significantly impact model generalizability, often rendering them ineffective when applied to users outside the training data.Aligning with research advocating for generalizable models in prosthetics [63][64][65], our work presents a novel deep prototype learning approach that tackles these challenges head-on.Supported by the experimental results reported in this paper (see tables 5 and 6), our deep prototype learning approach offers a significant step toward real-world applications.This improvement in generalizability has a profound impact on real-world applications.For example, in prosthetics, users with varying anatomies or muscle control can experience more reliable and intuitive control of their devices [17,66].Beyond prosthetics, generalizable gesture classification can benefit various humancomputer interfaces [67].Individuals with diverse needs can seamlessly interact with their surroundings using intuitive gestures.

Interpretability impacts
A major challenge in deep learning is the opacity of complex models.Users struggle to understand how these models arrive at decisions, hindering trust and limiting potential improvements.This lack of interpretability is particularly concerning in healthcare and medical domains, where transparency is crucial [68,69].While previous EMGbased gesture classification models often prioritized optimizing intra-subject performance through complex architectures [3, 9-11, 15-19, 21-26, 42-55], our study emphasizes the crucial role of interpretability.Beyond accuracy, it offers significant real-world benefits.By visualizing how each joint contributes to the overall gesture classification (as shown in figure 6), clinicians and researchers can directly understand the rationale behind our model's decisions.In prosthetics, for example, identifying specific joint activations associated with misinterpreted gestures can guide targeted rehabilitation interventions [70,71].This allows therapists to focus on areas where users might be struggling with muscle control or movement patterns.Additionally, interpretability can help highlight potential biases in the training data, ensuring the model is generalizable and effective across diverse user populations.The ability to visualize joint activation patterns can also be a valuable tool for researchers studying human movement and neuromuscular control [72].By understanding how different gestures activate various muscle groups, researchers can gain deeper insights into motor function and rehabilitation strategies.

Limitations and future directions
While significantly better than existing methods, as shown in table 5, our method experiences a noticeable decline in the inter-subject performance when trained on data from amputee subjects (DB3) compared to intact subjects (DB1, DB2, DB5).This reduction can be attributed to a variety of factors inherent to amputee populations, such as variations in the residual forearm after amputation, diverse experiences and adaptations accumulated over the years since amputation, individual health conditions affecting EMG signal measurements, and the historical extent of utilization of the amputated forearm, including differing levels of experience with prosthetic and other assistive devices.These factors collectively contribute to a highly non-IID (Independent and Identically Distributed) dataset across amputee subjects, where data distributions differ significantly between individuals.This inherent heterogeneity presents a major challenge for model generalization across the broader amputee population.Our future work will explore advanced techniques to address this issue, aiming to generalize our model more effectively across diverse amputee individuals and improve its real-world applicability.Future work could also explore personalized training protocols based on individual joint activation patterns identified through our interpretability methods.This holds promise for tailoring rehabilitation strategies to specific patient needs and improving prosthetic control efficacy.

Conclusion
In conclusion, this paper addresses the challenges in EMG-based gesture classification by proposing a novel deep prototype learning method.The approach achieves improved generalizability to new subjects, allowing the model to classify gestures based on a set of learned prototypes representing fine-grained joint movement patterns encoded in the EMG data.Comprehensive experiments across multiple public datasets demonstrate the classification performance and generalizability of the proposed approach.The proposed method's capacity for generalization and prototype learning empowers both clinical practitioners and users to grasp the underlying rationale driving the model's predictions.This broad applicability is crucial for building trust in the technology and facilitating its adoption in real-world applications.However, it is important to acknowledge that while the results demonstrate high intra-subject accuracy, there is a significant performance gap between intrasubject performance and inter-subject performance.

Figure 2 .
Figure 2. Overview of the proposed gesture classification model based on the learned prototypes.It measures the similarity between the input features and the prototypes to predict the gesture label.By leveraging the concept of prototype-based classification, our model effectively captures the key features of gestures, offering robust classification outcomes.

Figure 3 .
Figure 3.The proposed deep prototype learning classifier adopts a baseline CNN architecture consisting of two 2D convolution layers, followed by two locally-connected layers and three fully-connected layers dedicated to gesture classification.It is designed to effectively capture and extract hierarchical patterns from the preprocessed EMG features, facilitating accurate and robust gesture classification.

Figure 4 .
Figure 4.The similarity in temporal patterns between the extracted features and the original EMG patterns.The features were extended by the reversed length of the down-sampling process, utilizing data from subject 1 of the Ninapro DB5 dataset.To provide effective visualization, the amplitude of MAV was enlarged by three times, whereas the waveform length values were reduced to 10% from the original amplitude.The temporal span considered in this representation ranges from the initial 2000 ms to 12 000 ms.

Figure 5 .
Figure 5. Illustrations of the confusion matrices of intra-subject result utilizing the initial subject for Ninapro DB1, DB2, DB3 and DB5.These matrices offer a visual representation of the performance of classification models in discerning between different hand gestures across the four databases.It is notable that the performance, as indicated by the overall classification accuracy and the distribution of classification errors, demonstrates consistently high levels of accuracy across the databases, underscoring the efficacy of the classification algorithms employed in this study.Note that the values presented are normalized within the range of 0 to 1, which corresponds to a percentage scale of 0 to 100%.

Figure 6 .
Figure 6.Qualitative visualization of the classified gestures in three aspects: gestures themselves (row 1), underlying joint movements (row 2), and the top five influential prototypes (row 3, represented by their dominant joint).Colors indicate joint activation level, aiding the interpretation of how prototypes contribute to gesture classification.

Table 2 .
Details of the four Ninapro databases used in this study.

Table 3 .
Neural network architecture of the proposed method.

Table 4 .
Intra-subject gesture classification accuracy (%), where the best performance is highlighted in bold.

Table 5 .
Inter-subject gesture classification accuracy (%), where the best performance is highlighted in bold.
score, total training time, epoch required to train the model until convergence, and the inference time in table 6.Note that the 'Training time' in the table measures the time required to train each data sample in a single epoch, and the 'Inference time' measures the time required to predict class using a single test data sample.The prolonged duration of training for intersubject DB1 can be attributed to the utilization of

Table 6 .
Extended evaluation of our prototype learning model on the Ninapro datasets.The metrics, including precision, recall, F1 score, training time, epoch for convergence, and inference time, provide insights into the model's performance and adaptability across diverse tasks.Notably, datasets like DB1 and DB5 showcase rapid convergence and minimal errors, suggesting efficient training and effective prototype extraction.