Spatio-temporal warping for myoelectric control: an offline, feasibility study

Milad Jabbari; Rami Khushaba; Kianoush Nazarpour

doi:10.1088/1741-2552/ac387f

1. Introduction

The electromyographic (EMG) signal, recorded from the stump muscles, is a valuable, rich, complex, and dynamic source of information for prosthesis control. Research shows that providing natural EMG-based feedback with different approaches including grasp/wrist force and movement estimation [1–4], continuous finger trajectory decoding [5], and discrete movement classification [6, 7] could achieve promising results for desirable control of the prosthesis. Machine learning is a candidate tool in mapping motor intent to prosthesis control [8–10]. Extracting reliable features from the EMG signals plays a vital role in the control of upper limb prostheses with pattern recognition [8, 9, 11]. A wide range of factors affect the accuracy of machine learning-based methods [12–17]. The separability of the features can have a greater influence on the performance of EMG decoding than the type of the classifier [12–14]. Therefore, research on the development and real-time testing of more robust features is on-going.

Current EMG feature extraction methods suffer from two intrinsic limitations:

(a)
they are cross-sectional [18–21]; that is, they cannot extract the inter-temporal dependencies that exist between feature extraction windows;
(b)
they merely concatenate features extracted from individual channels [15, 22–24]; not capturing the synergistic and spatial patterns of muscles activity.

These limitations are depicted in the left side of figure 1. Therefore, the development of feature extraction methods that can capture the temporal dynamics of the EMG signals in a spatially-aware way, such as the one shown on the right side of figure 1, has received increased attention recently [15, 21, 25–38].

To address the former, many hand-crafted features [15, 25–27] and deep learning-based approaches [21, 28–30] are introduced. The fusion of time-domain descriptors (fTDDs) and deep long short-term memory (LSTM) networks yield higher classification accuracy than the conventional time-domain (TD) features [19, 21, 39, 40]. These methods excel in extracting the temporal information by concatenating features. Hence, they are blind to spatial information.

To address the second limitation, deep learning methods, e.g. convolutional neural networks (CNNs), construct spatial feature sets and capture the relationship between EMG channels [31–35]. Convolutional structures do not typically consider within-channel inter-temporal information of the EMG signals. Therefore, there has been a surge of research in combining CNNs and LSTM to capture the spatio-temporal features of the EMG signals [36–38]. However, the prohibitively large number of model parameters, e.g. 34–95 k in [41], 104 k in [32], and 30–549 k in [42], may impose a barrier in real-time implementation for myoelectric control. Traditional features have been also used to extract spatial information [15, 23, 43, 44] but that conversely leads to the loss of temporal information.

Hand-crafted features such as the temporal-spatial descriptors (TSDs) [15], fTDD [19], and time domain with auto-regressive (AR) model parameters have been recently used to complement different CNN and LSTM models [21, 45, 46]. Hence, they can facilitate the adoption of recent deep learning models for real-time clinical implementations given their reduced computational burden in comparison to deep learning models.

We address both limitations with a novel feature extraction method. Dynamic time warping (DTW) was employed to efficiently capture the nonlinear similarity between the EMG signals. For the temporal aspect, we developed a novel feature, named spatio-temporal warping (STW). Figure 1 summarises the challenge and the proposed approach.

2. Methods

Our STW approach has two building blocks. The first is the DTW algorithm [47]. The second component temporally fuses the extracted DTW features with that of a previous time window and that of a global trend. Figure 2 depicts this method.

**Figure 2.** The flow diagram of the proposed STW feature extraction algorithm.
Download figure:
Standard image High-resolution image

2.1. Dynamic time warping (DTW)

The DTW method measures the similarity between two sequences of different temporal dynamics. It has been used in speech recognition [47], data mining [48], gesture recognition [49], robotics [50], and myoelectric control [51–53]. Previous research utilising DTW in myoelectric control considered time-series similarity across training and testing EMG sequences from individual channels only. Using DTW, we created a similarity matrix between all pairs of EMG signals. DTW can also be applied across a set of signals together.

Assume two temporal sequences of length l: $\mathbf{a} = \left\{a_1,a_2,\ldots, a_l\right\}$ and $\mathbf{b} = \left\{b_1,b_2,\ldots, b_l\right\}$ , representing two EMG signals. We define $\mathbf{D}(\textbf{a},\textbf{b})$ as an l × l distance matrix between a and b, with $\mathbf{D}_{ij} = (a_i - b_j)^2$ . In the case of Euclidean distance, we generally set i = j, that is the distance along the same index from the two time series. The path for the Euclidean distance is hence the one along the diagonal of the matrix D, where i = j. Although, due to its simplicity and efficiency, Euclidean distance is widely used in different approaches, it suffers from two main problems: first, it is not practical for situation that the length of the sequences is not equal and second, it is sensitive to time shifting. Finding an optimal path using DTW can address these challenges. In the case of DTW method, the distance is calculated through a different warping path $\mathcal{P}$ generated by traversing the matrix D along ordered pairs of positions: $\mathcal{P} = \lt(e_1,f_1), (e_2,f_2),\cdots, (e_l,f_l)\gt$ where $e_i \in [1:l]$ and $f_i \in [1:l]$ are the positions that make the warping path. A valid warping path must satisfy the conditions $(e_1, f_1) = (1,1)$ , $(e_l, f_l) = (l, l)$ , $0~\leq e_{i+1} - e_i \leq 1$ and $0~\leq f_{i+1} - f_i \leq 1$ for all i < l.

A warping limit is typically imposed on the distances, that is $\left|e_i -f_i\right| \le w.l$ , $\forall(e_i,f_i) \in P^*$ . The value of w is the maximum allowed warping path to deviate from the diagonal. According to [54], the distance D to any path $\mathcal{P}$ is given with $D_\mathcal{P}(\textbf{a},\textbf{b}) = \sum_{i = 1}^{s} p_i$ , where p_i = $\mathbf{D}_{{e_i},{f_i}}$ is the distance between element at position e_i of a and at position f_i of b for the ith pair of points in a proposed warping path P. Considering a space of all possible paths as $\mathcal{P}$ , then the DTW path $P^*$ is the one that has the minimum distance: $P^* = \min_{P \in \mathcal{P}}~ \mathbf{D}_\mathcal{P}(\mathbf{a},\mathbf{b})$ .

We utilised the aforementioned distances between all NC EMG signals, i.e. $(NC \times (NC-1))/2$ pairs, as shown in figure 2. Inspired by [19, 25] the DTW features of all pairs of first and second derivatives of EMG signals were also computed. Hence, our method extracts $3~\times (NC \times (NC-1))/2$ features, in total. All calculated DTW distances were then concatenated to form a features vector and passed through logarithmic scaling for improved accuracy, similar to [55].

2.2. Spatio-temporal warping (STW)

We first multiplied the features vector extracted from the current windows, DTW_t, with that extracted from the previous window $DTW_{t-n}$ . To account for the long-term memory, our algorithm borrowed the cell state (c) concept of the LSTM method. This was the information highway that passes through all the LSTM cells across all time steps to evolve a recursive representation of the features. It was therefore continuously updated within each cell. At the same time, the contents of each cell were updated with a weighted contribution of the cell state through the parameter β. The selection of the parameter β was empirically made with the objective of achieving the best classification performance across several datasets.

$\begin{equation} f_t = DTW_t \odot DTW_{t-1} + \beta c \end{equation} \tag{ 1 }$

where f_t representing the extracted features from the current time t, the symbol $\odot$ denotes the element-wise multiplication operation, and c indicates the cell state.

The STW algorithm presented a dynamic normalization mechanism; borrowing from neural machine translation [56]. Attention mechanism provides modeling of dependencies without considering their distance in the input or output sequences. We used this method within each cell to normalise the current windows features and before adding it to the aforementioned features to normalise the long-term memory component. The final features representation is

$\begin{equation} {f_t}_i = \frac{{f_t}_i}{\sum_i {f_t}_i} + \mathrm{log} \left(1 + \frac{c_i}{\sum_i c_i}\right) \end{equation} \tag{ 2 }$

where the subscript i indicates the normalization applied across features at time t.

2.3. EMG data sets

Five EMG datasets were utilised to test the performance of the proposed feature extraction algorithm as described in table 1.

Table 1. Details of the utilised EMG datasets.

Dataset	Subjects	Channels	Samp. freq (Hz)	Classes
DB1	27	10 (Otto Bock)	100	52
DB5	10	16 (2 MYOs)	200	53
DB7	22	12	2000	41
3DC	22	10 (3DC), 8 (MYO)	1000 (3DC), 200 (MYO)	11
SC-EMG	5	10	1000	7

We used database-1 (DB1), database-5 (DB5), and database-7 (DB7) of the Ninapro repository [57–60], comprising respectively, 27 limbed-intact subjects (52 movements), 10 limb-intact subjects (52 movements and rest) and 20 limb-intact and two amputees (40 movements and rest). EMG signals were recorded using 10 Otto Bock electrodes and two wearable MYO armbands on the forearm in DB1 and DB5, respectively. In DB7, 12 Trigno (Delsys, USA) EMG sensors were placed on the limb. The third EMG dataset [41] included data from 22 limb-intact subjects each performing eleven hand/wrist gestures using two armbands. The fourth dataset, that is selective classification (SC-EMG), included data from five traumatic long trans-radial amputees [61], each performing seven classes of motion. Ten electrodes were placed around the forearm at the point of the largest diameter.

2.4. Feature extraction

The below features were extracted from windows of 150 ms at 50 ms increments:

HTD: Hudgins' TD feature set [18], comprising: mean absolute value (MAV), MAV slope, zero crossings (ZCs), slope sign changes, and waveform length (WL).
AR-RMS: The 6th-order AR coefficients and the root-mean-square (RMS).
LSF9: The lower sampling rate features defined in [22], comprising L-scale, maximum fractal length, the mean value of the square root, Willison amplitude, ZC, RMS, integrated absolute value, difference absolute standard deviation value, and variance.
ATD: Combined AR with TD as defined in [60], comprising MAV, WL, 4th-order AR coefficients and log-variance (LogVar).
STFS: Spatio-temporal features from [62], comprising integral square descriptor, normalised root-square coefficient of first and second differential derivatives, mean log-kernel, an estimate of mean derivative of the higher-order moments per sliding window, and a measure of spatial muscle information.
fTDD: The fusion of TD features from [19] with six features representing the first three even power spectrum moments, with an irregularity factor, a sparsity measure, and the ratio of WL of the first derivative to that of the second derivative. The step size was 15.
TSD: The temporal spatial-descriptors from [15].

To ensure a fair comparison, the dimensionality of all feature sets was reduced to c − 1, where c is the number of classes, using the spectral regression feature projection method [63].

2.5. Classifiers

The following classifiers were chosen: linear discriminant analysis (LDA), extreme learning machine (ELM), k-nearest neighbor (KNN), and support vector machines (SVMs). In the ELM classifier, one hidden layer with 1250 neurons was utilised. The parameters of SVM were optimised for each dataset, as the performance of SVM is susceptible to the kernel function parameter γ and the regularization parameter C, while for KNN, the parameter K was empirically set to K = 5. Additionally, three deep learning models were also implemented; LSTM, CNN, and a combination of CNN and LSTM. These models are shown in figure 3, designed empirically to reflect the best classification accuracy results using the raw EMG data. Raw EMG inputs to models like CNN (or even CNN + LSTM) are more successful than the spectrogram images [31, 41]. The RMS of raw EMG signals were therefore calculated, generating NC scalar values for each analysis window of 150 ms. These values were then turned into pseudo-images by multiplying each generated vector of size ( $NC~\times$ 1) by its transpose, resulting in an NC × NC images. The images were scaled logarithmically and then provided as inputs to the CNN and CNN + LSTM models. For the LSTM model, the raw EMG samples for each 150 ms window of the NC channels were used as features.

**Figure 3.** The schematic architectures of the utilised deep learning models: (A) CNN model including two convolutional layers; (B) LSTM network consists of two LSTM layers; (C) CNN + LSTM structure with one convolutional layer and one LSTM layer. The output of all models is connected to a softmax layer.
Download figure:
Standard image High-resolution image

2.6. Statistical analysis

The Wilcoxon signed rank test was applied to verify the statistical significance of the achieved results, with the results being considered significant for a p-value smaller than 0.05. Finally, the size of differences observed was measured using Cohen's effect size d for paired samples defined as the difference between two group means divided by the standard deviation [22]. A set of predefined thresholds of 0.2, 0.5, 0.8, 1.2, and $\gt$ 2 are usually employed to equate the effect size to small, medium, large, very large, and huge effects respectively. Additionally, when reporting Cohen's d value for testing one method versus another, the direction of the effect was calculated by subtracting the mean of the latter from that of the former. MATLAB 2020a was utilised for all experiments on a laptop with an i7 processor, 16 GB of RAM, and a GPU unit (NVIDIA GeForce RTX 2060).

3. Results

3.1. Results of DB5 dataset

The average classification error rates across all folds and subjects are shown in figure 4(a). Furthermore, the corresponding bar plots with standard deviation values are shown in figure 1 of supplementary materials (available online at stacks.iop.org/JNE/18/066028/mmedia). In terms of the deep models, CNN + LSTM (noted with CNNL, for brevity) showed a significantly better performance than the individual models of CNN and LSTM (p < 0.001 for both tests, d = 3.15 for CNN vs. CNN + LSTM and d = 3.55 for LSTM vs. CNN + LSTM). CNN also significantly outperformed LSTM ( $p \lt 0.001, d = 2.21$ ). This finding indicates that the temporal-spatial information captured by CNN + LSTM extracts more information than the individual CNN and LSTM, while the spatial information captured by CNN appears more important on this dataset than the temporal information captured by LSTM.

In terms of the traditional hand-crafted feature extraction algorithms, as it is clear in figure 4(a), HTD and ARRMS were the worst performers on the 53 class problem in DB5, with all other methods significantly outperforming them, except for LSTM. When comparing HTD and LSTM, no difference was found ( $p = 0.517,\ d = 0.10$ ), which is in line with the findings in [21].

We observed significant differences between LSF9 and ATD ( $p = 0.030,\ d = 0.26$ ), and large differences between LSF9 and STFS ( $p \lt 0.001,\ d = 0.80$ ) and LSF9 and TSD ( $p \lt 0.001,\ d = 0.80$ ). These results also show that fTDD achieved significantly lower error rates than all other traditional and deep learning feature extraction methods, including CNN + LSTM, except the proposed STW features that significantly outperformed fTDD ( $p\lt0.001, d = 2.64$ ). In contrast, STW significantly outperformed all other feature extraction and/or learning methods, including CNN + LSTM, with all tests having d > 2. Overall, the proposed approach yielded an average decrease in classification error of 19.81% ± 4.36% in comparison to all other methods considered in this work. Additionally, it showed a significantly lower standard deviation.

In addition to the classification accuracy metric, we evaluated the proposed STW method in comparison with other hand-crafted feature extraction algorithms using three other measurements including Fscore, Recall, and Precision [64]. Results are illustrated in figure 3 of supplementary materials. Average values for Fscore, Recall, and Precision, as well as statistical tests, prove that STW can outperform all conventional hand-crafted algorithms. Furthermore, to perform a quantitative comparison; in terms of separability of clusters, between STW and other conventional feature extraction methods, Davies-Bouldin index (DBI) was used [65]. Averaged DBI values calculated across test folds and all subjects are shown in figure 4 of supplementary materials. As it is visible clearly, STW has achieved the lowest DBI values which show its highest capability in feature discrimination criteria in comparison with other feature extraction methods.

3.2. Results of the SC-EMG datasets

The average classification error results on the SC-EMG datasets are shown in figure 4(b) for all traditional hand-crafted and deep learning models (Bar plots of the results are provided in figure 1 of supplementary materials). In terms of the deep models, it is interesting to see that, LSTM performed better than CNN in terms of the average error rates across all amputees. This could be in part attributed to the fact that the amputation affects muscles morphology and synergy and hence the performance of CNN is impacted by its capabilities in capturing spatial information between signals which may vary (i.e. warp) in timing as a result of amputation. In this direction, a recent work by Tallec and Ollivier [66] shows that LSTM networks have the capability to learn to warp input sequences. Despite the differences in the average classification errors of deep models, the Wilcoxon signed rank test though revealed no statistically significant difference between LSTM and CNN, LSTM and CNN + LSTM, and neither between CNN and CNN + LSTM ( $p \gt$ 0.05 for all tests). In terms of the hand-crafted methods, fTDD performed significantly worse than TSD (p < 0.001) while the proposed STW significantly outperformed all other hand-crafted and deep models ( $d = 2.54, 2.09, 1.87, 2.08, 1.87, 0.89$ , and 2.05 for STW vs. HTD, AR-RMS, STFS, LSF9, ATD, TSD, and fTDD, respectively, and p < 0.001 for all comparisons). The performance of the STW method can be further justified by its ability to approximate the warping path that best aligns the two signals of interest in the time domain, supported by the DTW stage. Such a path is a more representative of the distance between the signals than what Euclidean distance can offer; despite nonlinear variations in speed. The interested reader in referred to [47] where the details of the DTW method can be found.

3.3. Results of DB7 datasets

The same pattern of results was observed for DB7 datasets. Average classification error rates are shown in figure 4(c) as well as corresponding bar plots in figure 1 of supplementary materials. To avoid repetition, we used only an LDA classifier for decoding. HTD and AR-RMS were performed worst. In comparison to HTD and AR-RMS, LSTM performed better than both methods on DB7 datasets (p < 0.01 for both tests, d = 2.06 for HTD vs. LSTM, d = 1.15 for AR-RMS vs. LSTM). On the other hand, both CNN + LSTM and CNN significantly outperformed LSTM (p < 0.001 for both tests, d = 2.97 for LSTM vs. CNN + LSTM and d = 2.36 for LSTM vs. CNN), with CNN + LSTM also significantly outperforming CNN ( $p \lt 0.001, d = 1.18$ ). On the other hand, both fTDD and the proposed STW significantly outperformed all other methods, including CNN + LSTM, while STW also significantly outperforming fTDD ( $p \lt 0.001, d = 1.32$ ).

3.4. Results of the 3DC dataset

The average classification values achieved for the 3DC dataset are illustrated in figures 4(d), and (e) as well as in figure 1 of supplementary materials. As two different armbands including MYO and 3DC were used in this dataset, separate analyses were conducted. Instead of raw EMG signals, extracted STW features were fed into the deep models. KNN achieved the lowest average classification error, but statistical tests reveal no significant differences. Similar to the SC-EMG dataset, LSTM could achieve the best performance compared to CNN + LSTM, and CNN in terms of average classification error rates. Wilcoxon signed rank test revealed significant differences between LSTM and CNN and also between LSTM and CNN + LSTM (p < 0.001). With respect to the 3DC armband, similar results were obtained as LSTM significantly outperformed CNN and CNN + LSTM (p < 0.01).

3.5. Class-wise standard deviation

Reporting average classification error or accuracy for the off-line analysis is biased and the comparison of results; achieved by a specific method, between off-line and real-time schemes will not be fair. To partially address this challenge, we used class-wise accuracy standard deviations (CWS) [67]. Averaged confusion matrices across all folds and subjects were calculated and the standard deviation of the main diagonal of the matrices was considered as CWS. In this analysis, DB5 and 3DC were considered.

Averaged confusion matrices across ten subjects and six folds using ELM, KNN, LDA, and SVM classifiers for the DB5 dataset are illustrated in figure 5. The proposed STW method could achieve the lowest CWS values among all feature sets. The results of CWS analysis for the 3DC dataset are presented in figure 6.

**Figure 5.** Average confusion matrices of the 10 participants across six folds for the DB5 dataset using different feature extraction methods as input of the four classifiers. Confusion matrices achieved by using various feature extraction algorithms are shown in different rows, respectively. The numbers above the confusion matrices indicate the CWS values achieved by feeding different feature sets into the classifiers.
Download figure:
Standard image High-resolution image

**Figure 6.** Average confusion matrices of the 22 participants across eight folds for the 3DC dataset using STW feature extraction method as input of the four classifiers: ELM, KNN, LDA, and SVM. Confusion matrices achieved for 3DC and MYO armbands are shown in A, and B, respectively. The numbers above the confusion matrices indicate CWS values achieved by feeding the STW feature into the classifiers.
Download figure:
Standard image High-resolution image

3.6. Windows size

For the DB5 and 3DC, we varied the feature extraction window size from 50 to 250 ms. The average classification error rates across all folds and subjects for DB5, 3DC with 3DC armband, and 3DC with MYO armband are shown in figures 7(a)–(c), respectively. Results show the classification errors decrease with larger windows, as expected. However, the results also show that varying the window size had a medium to large effect between consecutive windows, which could potentially indicate the robustness of the method against varying window sizes.

**Figure 7.** Average STW classification error rates with different windows sizes; 50–250 ms, achieved by ELM, LDA, KNN, and SVM classifiers. (A) Average classification values for DB5 dataset. (B) Average classification error rates for 3DC dataset using 3DC armband. (C) Average classification error values for 3DC dataset using MYO armband.
Download figure:
Standard image High-resolution image

3.7. Computation time

The computation time for each of the feature extraction methods, except deep learning models, was calculated on randomly generated data with 150 samples across ten dimensions; equivalent to 150 ms with ten channels sampled at 1000 Hz. The time required to extract the features was registered and the analysis was repeated 1000 times. Average results are displayed in table 2.

Table 2. Time required to extract handcrafted features reported as average ± standard deviation.

Feature set	Time (ms)
LSF9	$0.8053 \,\,\pm\,\, 0.0620$
TSD	$0.4940 \,\,\pm\,\, 0.0411$
STFS	$0.4986 \,\,\pm\,\, 0.0381$
STW	$\boldsymbol{0.4248} \,\,\pm\,\, {0.0632}$
ATD	$0.2115 \,\,\pm\,\, 0.0237$
HTD	$0.1421 \,\,\pm\,\, 0.0180$
AR-RMS	$0.1660 \,\,\pm\,\, 0.0509$
fTDD	$0.0125 \,\,\pm\,\, 0.0354$

3.8. Comparison with other studies

To evaluate the proposed STW method against recently published state-of-the-art features and classifiers a comprehensive comparison is performed on DB1 of the Ninapro database. For a fair comparison, a similar train-test approach was selected according to the [58, 68] in such a way that repetitions 1, 3, 4, 6, 8, 9, and 10 were selected to train the classifiers, and repetitions 2, 5, and 7 were used to evaluate methods. Moreover, the length of the window was selected equal to 200 ms. Average classification accuracy achieved by STW + SVM in comparison with previous traditional features as well as deep learning models is reported in table 3. Moreover, averaged DBI values achieved by conventional feature extraction methods and the proposed STW algorithm are illustrated in figure 4 of supplementary materials. Results show that STW has achieved the lowest DBI value in comparison to other methods.

Table 3. Comparison of the proposed STW method with previous works on DB1 dataset.

Method	200 ms
Traditional-RF [58]	75.3%
AtzortNet [69]	66.6%
GengNet [31]	77.8%
RNN Module with raw-signal [68]	79.8%
CNN Module with raw-image1 [68]	83.5%
CNN Module with feature-signal-image1 [68]	86.3%
Hybrid CNN-RNN with raw-image1 [68]	84.7%
Hybrid CNN-RNN with feature-signal-image1 [68]	86.7%
Attention-based hybrid CNN-RNN with raw-image1 [68]	84.8%
Attention-based hybrid CNN-RNN with feature-signal-image1 [68]	87.0%
STW + SVM	87.3%

4. Discussion

We presented a new hand-crafted feature extraction algorithm that borrows concepts from deep learning models and mixes these with the spatial information concept implemented by DTW. DTW was previously utilised in the literature of myoelectric control to compare training and testing templates [51–53], but it was not used to capture the spatial similarity concurrent with temporal information. The proposed STW feature was achieved by mixing spatial information with a long and short-term memory component with an attention normalization step. The STW feature outperformed the state-of-the-art feature extraction and learning algorithms such as STFS, LSF9, LSTM, CNN, and CNN + LSTM across several datasets. Our results corroborate the recent literature that the traditional features, e.g. HTD and AR-RMS, can no longer compete with new features [7, 13, 19, 23, 25, 45].

LSTM and CNN were significantly outperformed by their combination of CNN + LSTM on DB7 and DB5 which in turn demonstrates the power of the spatio-temporal feature learning. Also, CNN significantly outperformed LSTM, which in turn indicates that the significance of the spatial information captured by CNN and CNN + LSTM. Interestingly, LSTM could only compete with HTD and AR-RMS features, which is in line with previous results from the literature suggesting that LSTM could potentially function in a better way when mixed with hand-crafted features [21]. When considering the significantly large number of models parameters for LSTM, CNN, and CNN + LSTM [32, 41, 42], STW only extracts a small number of features making it more suitable for future real-time implementations. This is further supported by the computational time requirements for STW which falls within the range of the hand-crafted features, as shown in table 2.

The fTDD feature, with a step parameter of 15, showed a good performance on many datasets. This means that fTDD should save the extracted features from the previous 15'th window to achieve this performance, which may impact the controller delay during real-time tests. However, the design of STW allows looking back at any of the features from any window. This paper showed the simplest example of looking at the feature extracted from the preceding window only, which makes STW more attractive for real-time implementations than CNN + LSTM, CNN, LSTM, and fTDD.

Research shows how to extract optimal performance [70] and mine a trillion time series sub-sequences using DTW and the potential use of DTW for real-time problems [71, 72]. A previous study conducting 35 million experiments, on 85 datasets, with dozens of rival methods concluded that nearest neighbor DTW (NN-DTW) is very hard to beat. When NN-DTW can be outperformed, it is typically by a very small margin and at the cost of a huge effort in coding/complexity of implementation, and a large time and space overhead [54]. All of this make DTW an attractive algorithmic choice for our analysis in this paper. However, previous research utilising DTW in myoelectric control only considered time-series similarity across training and testing EMG sequences or templates from individual channels [51–53]. In comparison, we utilise DTW as a spatial feature extraction step to calculate the warped similarity between multiple EMG channels and then temporally evolve the estimated similarity across time while considering long and short-term memories.

A similar processing step was also suggested in the fTDD algorithm in [19], as fTDD also fuses the features extracted from the current analysis windows (from all channels) with the same features extracted from a previous window while leaving the choice of the previous window temporal index as an empirical parameter for the user to optimise. In comparison to fTDD, we restricted the design of this stage to fuse the features extracted from the current set of windows at time step t with those coming from the nearby set of windows (1st previous, or 2nd previous, or 3rd previous, etc) to account for the short-term memory component. To simplify the real-time design of the algorithm, we connected each module to the first previous module. This approach limited the focus of the method to short-term memory. Furthermore, a comparison of TSD and fTDD reveals that there is instability in terms of classification accuracy (figure 4). There are three main reasons for this issue. First is that DB5 and DB7 are from intact limbed people (except two subjects from DB7). The SC-EMG dataset includes data from five trans-radial amputees. In fact, the original paper, which proposed TSD, showed that TSD can outperform fTDD on data from amputees. This could be related to the change of the morphology of the signals after amputation and the fact that the temporal patterns of EMG signals become harder to separate. The second reason is related to how TSD and fTDD features are extracted. fTDD focuses on the short-term temporal component only while TSD considers both temporal and spatial components. This issue motivated us to develop STW to enjoy the best of both of these methods and provide more robust performance across all databases. Moreover, TSD does not extract the inter-temporal dependencies that exist between feature extraction windows. The third reason is about the step size parameter in the fTDD algorithm, which should be optimised for data from amputees. As shown in figure 2 of supplementary materials, the average classification error rates are decreased with increasing the step size. This parameter plays a key role in achieving acceptable accuracy. In this feasibility study, we used a fixed step of 15. However, our proposed STW method can outperform both TSD and fTDD features for intact subjects as well as amputees with only one step.

In figure 4, we have evaluated the proposed STW method in comparison with the conventional methods as well as deep learning models for DB5, DB7, and SC-EMG datasets. Raw EMG signals were used to evaluate the performance of the deep learning models, whereas, conventional feature sets and STW were considered as inputs for traditional classifiers. Figures 4 and figure 1 of the supplementary materials show that the combination of the STW feature and traditional classifiers outperforms the combination of other feature and traditional classifiers as well as the decoding of raw EMG signals with deep learning classifiers. For the 3DC dataset, we decided to use only STW as input for both traditional classifiers and deep learning models. We aimed to investigate whether using the proposed STW feature as input for deep learning models can outperform the combination of the STW feature and traditional classifiers. For the MYO dataset, the results show that feeding an SVM with the STW feature achieves the lowest classification error rate (8.06 ± 6.22%). For the 3DC armband results, the combination of the STW features and the LSTM decoder resulted in the lowest error rate (5.10 ± 6.35%).

Attention mechanisms relating to different positions of a single sequence have become an integral part of sequence modeling and transduction models in various tasks, allowing modeling of dependencies without regard to their distance in the input or output sequences. A softmax function is usually employed for attention weights. We computed the attention component by dividing the extracted feature by their total sum to ensure that all features sum up to unity.

The major difference between the proposed method and LSTM is that LSTM learns a recurrent feature representation from the data, while STW extracts a set of hand-crafted DTW features. Additionally, STW blocks have a simpler design than LSTM and are more suitable for real-time implementations.

One possible limitation of this study is that the DTW algorithm may prove too complex for real-time implementation and may require additional hardware for parallel processing. Our results show that the time required to extract STW features is smaller than LSF9, TSD, and STFS features. Additionally, whether using an advanced machine learning method could enhance the quality of prosthesis control in real-life settings is an open question and falls outside the scope of this study. It is worth mentioning that in this study we investigated a wide range of TD features and did not evaluate spectral or time-frequency features as their enhanced separability compared to the TD features is unknown and furthermore, it has been shown that TD features can outperform time-frequency features [73]. Furthermore, beyond computational complexity, we do not envisage any challenges in using the STW method for high-density EMG. As for the outliers, we did not investigate STW because the utilised databases are almost outlier-free, as they are recorded in controlled laboratory conditions.

In conclusion, we proposed a new paradigm for grasp level control of prosthetic hands. We demonstrated the feasibility of decoding EMG signals in both able-bodied and below-elbow amputees. Our algorithm warrants further investigation with real-time, user-in-the-loop experiments with people with upper-limb difference.

Acknowledgment

This work is supported by EPSRC, UK under grant EP/R004242/2.

Data availability statement

No new data were created or analysed in this study. The code to extract the STW feature can be downloaded from https://github.com/RamiKhushaba.

Spatio-temporal warping for myoelectric control: an offline, feasibility study

Article metrics

Submit

Author e-mails

Author affiliations

Author notes

ORCID iDs

Dates

Peer review information

Abstract

1. Introduction