Deep learning-based artificial vision for grasp classification in myoelectric hands

Objective. Computer vision-based assistive technology solutions can revolutionise the quality of care for people with sensorimotor disorders. The goal of this work was to enable trans-radial amputees to use a simple, yet efficient, computer vision system to grasp and move common household objects with a two-channel myoelectric prosthetic hand. Approach. We developed a deep learning-based artificial vision system to augment the grasp functionality of a commercial prosthesis. Our main conceptual novelty is that we classify objects with regards to the grasp pattern without explicitly identifying them or measuring their dimensions. A convolutional neural network (CNN) structure was trained with images of over 500 graspable objects. For each object, 72 images, at 5∘ intervals, were available. Objects were categorised into four grasp classes, namely: pinch, tripod, palmar wrist neutral and palmar wrist pronated. The CNN setting was first tuned and tested offline and then in realtime with objects or object views that were not included in the training set. Main results. The classification accuracy in the offline tests reached 85% for the seen and 75% for the novel objects; reflecting the generalisability of grasp classification. We then implemented the proposed framework in realtime on a standard laptop computer and achieved an overall score of 84% in classifying a set of novel as well as seen but randomly-rotated objects. Finally, the system was tested with two trans-radial amputee volunteers controlling an i-limb UltraTM prosthetic hand and a motion controlTM prosthetic wrist; augmented with a webcam. After training, subjects successfully picked up and moved the target objects with an overall success of up to 88%. In addition, we show that with training, subjects’ performance improved in terms of time required to accomplish a block of 24 trials despite a decreasing level of visual feedback. Significance. The proposed design constitutes a substantial conceptual improvement for the control of multi-functional prosthetic hands. We show for the first time that deep-learning based computer vision systems can enhance the grip functionality of myoelectric hands considerably.

S Supplementary material for this article is available online (Some figures may appear in colour only in the online journal) Original content from this work may be used under the terms of the Creative Commons Attribution 3.0 licence. Any further distribution of this work must maintain attribution to the author(s) and the title of the work, journal citation and DOI.

Introduction
Prosthetic hands can provide a route to functional rehabilitation of upper-limb amputees and people with congenital motor deficit. According to recent statistics, in UK alone, there are 473 new upper-limb (133 trans-radial) referral every year; of which, 245 are in the age range of 15 and 54 years old [1]. Lifetime care for this group can be remarkably expensive. Trauma is the most prevalent cause of limb loss at ∼30% [1]. In the US, there are around 500 k upper-limb amputees [2]. Advanced prosthetic hands can dramatically improve users' quality of life by enabling them to carry out daily living activities.
Current commercial prosthetic hands are typically controlled via the myoelectric signals, that is the electrical activity of muscles recorded from the skin surface of the stump [3,4]. Despite considerable technical advances and improvements in the mechanical features, e.g. size and weight, of the prosthetic hands, the control of these systems is still limited to one or two degrees of freedom [4,5]. In addition, the process of switching a prosthetic hand into an appropriate grip mode, e.g. pinch, can be cumbersome or would require an ad-hoc solution, such as using a mobile application 4 or via an Electrocutaneous menu [6].
For several decades, research on prosthetic control has focused on myoelectric pattern recognition [3,4]. Classification and proportional control of myoelectric signals has been extensively studied for discrete decoding of wrist and elbow movements [7][8][9][10][11][12][13][14], grasp type [15][16][17], as well as individuated finger movements [18], with accuracies as high as 90% [19] in amputee subjects. Although reasonable classification accuracies are gained, there is still a considerable gap between the laboratory-based research and the widespread clinical use of pattern recognition-based systems. Lack of robustness, number and movement of the electrodes as well as modulation of the electromyogram (EMG) signal activation patterns with varying force and orientation of the arm may be the main reasons [4,14,20]. To become fully integrated into an amputee's sensorimotor repertoire, the performance of hand prostheses must still improve greatly [21][22][23][24]. The COAPT system is the first commercial myoelectric controller unit to employ pattern recognition 5 .
Specifically, in the case of using computer vision, it was shown that object shapes can be quantised such that appropriate grasp types and sizes can be determined. Došen et al [29,30] demonstrated a dexterous hand with an integrated vision based control system. The user controlled the prosthesis hand and the activation of the camera with myoelectric signals. A simple object detection method was used, in conjunction with distance information, estimated via ultrasound. This structure allowed them to approximate the size of the object of interest. The calculated size was then introduced to a rule-based reasoning algorithm to select the appropriate grasp accordingly.
They achieved 84% accuracy in estimating the grasp type and size for a limited set of 13 objects (93%, grasp only). Acquiring such level of accuracy, each trial took on average ∼4 s, on a dual-core 2 GHz PC, since classification of 10 consecutive snapshots was required for each decision. Marković et al [31] demonstrated a semi-autonomous control mechanism in which stereo-vision provided depth information. In addition, their solution offered artificial proprioceptive feedback, via visual feedback to the user, about the grip aperture size by using augmented reality (AR). They incorporated sophisticated algorithms for image segmentation, 3-dimensional point cloud generation and geometrical model fitting. These algorithms however used a similar rule-based model that was proposed earlier by Došen et al [29,30]. With such improvements, the process of identifying the object size and the appropriate grasp became significantly faster, about 1 s, on an Intel i5 core (2.73 GHz) laptop with 8 GB of RAM. They achieved an overall accuracy of 81% for the successful accomplishment of the task (∼94% in grasp identification). However, without the AR feedback, this accuracy dropped to 73%. In [29][30][31], authors included four grasp types, namely, palmar, lateral, tri-digit (here: tripod) and bi-digit (here: pinch).
Marković et al [33] further exploited a data fusion technique to control a prosthetic hand. A plethora of modalities, namely, myoelectric recording, computer vision, inertial measurements and embedded prosthesis sensors (position and force) were utilised to provide realtime simultaneous, proportional and semi-autonomous control. The shape, the size and the orientation of objects were estimated with RGB-D imaging and integrated with prosthesis orientation and user behaviour via inertial sensing. Such a sophisticated architecture led to less than 1% cumulative trial failure rate. This setting was integrated into a prosthetic wrist, but only palmar and lateral grasps were considered.
Computer vision has been widely used in robotic grasp and object manipulation [35][36][37][38]. Saxena et al [35] pioneered the field by providing the capability of grasping novel (unseen) objects for robotic hands by utilising a stereo camera. Without building a 3-dimensional model, they estimated the 3-dimensional location of the best grasp by triangulation. The grasp location estimator algorithm was trained on synthetic images in a supervised learning regime. Kootstra et al [36] developed an early cognitive vision architecture for grasping unknown objects. Without any segmentation or preprocessing, they were able to generate two-and three-finger grasps based on contours and surface structure provided by stereo cameras. With the advancement of the deep learning structures [39], robotic grasp research has been radically upgraded. For instance, Lenz et al [37] introduced RGB-D images to a two-step cascade deep learning system. Given the image of an object to grasp, firstly a small deep network determined the suitable grasping points for the object; based on its position, size and orientation. Then, a second network was trained to pick the best candidate among the grasping spots that were identified by the first network. Group regularisation was utilised to balance learning with respect to information extracted from different modalities, such as the colour of the object, depth and surface normals. Similarly, Kopicki et al [38] provided a one-shot learning mechanism for recognising the most appropriate grasp for novel objects. They generated thousands of grasp candidates for images taken by a depth camera and optimised the combination of two learned model types: a contact model and a handconfiguration model. Table 1 shows a summary of structures that utilised vision in prosthetic and robotic applications.
We set out to translate the advances in deep learning in the robotics and computer vision research for control of hand prostheses. Benefiting from the flexibility that a deep learning structure offers, we developed an inexpensive vision-based system suitable for use in artificial hands. This solution can identify the appropriate grasp type for objects according to a learned abstract representation of the object rather than the explicitly-measured object dimensions. This key concept is illustrated in figure 1. In this way, objects are not classified based on the object category or identity, but based on the suitable grasp pattern. A key question would therefore be whether this deep learning-based approach generalises to unseen objects. We predict that a deep network trained for grasp recognition can extract high-level and grasp-related features from objects and discard other unnecessary details. These features could include object size and orientation. This approach is therefore conceptually different from object recognition in which object details matter.
To learn this abstract representation, we use a convolutional neural network (CNN) architecture [39]. There is mounting evidence that CNN-based structures can learn and classify visual patterns efficiently if provided with a large amount of training (labelled) samples [40][41][42][43][44][45]. The components of the CNN structure, namely, local connectivity, parameter sharing and pooling, make it reasonably invariant against object shift, scale and distortion. These features make the CNN structure a suitable candidate for upper-limb prosthetics applications. We therefore trained a CNN structure to identify the appropriate grasp for a database of household objects. The CNN structure, or in fact any other supervised learning architecture in which there exists a set of predefined labels, lack the ability of generalisation to novel objects that do not belong to any defined output object categories. Therefore during testing, unseen objects will be misclassified to one of the existing classes. However, identification of novel objects is crucial in prosthetic applications; since in everyday life people effortlessly pick up a variety of objects that they have never seen before. Moreover, the number of the categories of household objects can be excessively large, making object identification for grasp selection impractical.

Methods
In this section, we give a detailed description of the equipment and methods that we used offline and in the realtime experiments, both in computer-based tests and when amputee users controlled the prosthesis. To enhance clarity and in the interest of brevity, we merge the description of the methods that were common in all experiments.

Image databases
To train the CNN structure, we used the Amsterdam library of object images (ALOI) [46]. The ALOI database offers a rich set of the images of household objects. To enable realtime testing, we augmented the ALOI dataset by our dataset which we call Newcastle Grasp Library (made freely available online, see Acknowledgements). In the following, we describe both image libraries.

Amsterdam library of object images (ALOI).
The ALOI database [46] includes the images of 1000 common objects. Within this library, 250 objects have been photographed at a second zoom rate. We discarded these 250 objects. For each of the remaining 750 objects, the database includes 72 pictures, taken at 5 intervals against a black background. The camera was at 124.5 cm distance and 30 cm altitude from the objects. The camera resolution was × 768 576 pixels. We subjectively selected 473 of the objects in four different classes of pinch, tripod, palmar wrist neutral and palmar wrist pronated. Other objects were either not graspable or could be picked with more than one grasp type. All images were first converted to grey-scale. They were then downsampled to a resolution of × 48 36 pixels; using the imresize function in MATLAB®.

Newcastle grasp library.
Access to the same objects that were used to create the ALOI database was not possible. Therefore, to enable realtime analysis, 71 objects in four grasp classes were selected for photography. We synchronised a Crayfish 55 turntable (Seabass, UK) with a Canon Kiss X4 DSLR camera (resolution 18 Megapixel, × 5184 3456 pixels) to take 72 pictures from each object (at 5 intervals) against a black background. Table 2 indicates the number of objects in each grasp group that we used for further analysis.
To ensure object size is taken into account we positioned the camera at a fixed distance from objects when collecting the images. The distance between the camera and the object was 60 cm and the webcam was 15 cm higher than object. With this setting we could achieve images of objects that were comparable in size with those available in the ALOI database. All images were converted to grey-scale and downsampled to a resolution of × 48 36 pixels to train the CNN setting. Figure 2(A) represents some of the objects we selected from the ALOI database. Figure 2(B) shows all the additional objects included in the Newcastle Grasp Library. A list of all the objects that we chose and the corresponding grip In the above equations, I n,m denotes the intensity of pixel (n, m). For classification of images into grasp groups, we examined two CNN architectures: a one-layer and a two-layer, and explored the trade-off between accuracy, generalisability and computational complexity. We first explain briefly the setting of the developed CNN structure. In the following, all equations are presented in the vectorised format.  Assume that there are m l input maps of size × R U l l in each of the CNN layers l, with m 0 denoting the number of images in the 0-th layer. k l features are extracted at each layer by convolution with × C D l l ( < < C R D U , l l l l ) kernels according to ) and the j-th kernel in the l-th layer (W j l ) and adding the bias b j l . The output of layer l is then calculated by elementwise application of the activation function ⋅ a( ). In the above equation, the asterisk sign * refers to a valid convolution, that is, a convolution performed inside the image borders. Finally, We tested a range of activation functions, namely, the logistic, hyperbolic tangent and rectified linear unit (ReLU) functions. We empirically found that the ReLU function results in the highest performance and hence we used it in this study. The ReLU activation function ⋅ a( ) can be written as where z r,u denotes an element of Z [47]. Our one-layer CNN comprised one convolution (C 1 ) and one sub-sampling (S 1 ) sub-layers. In the two-layer CNN architecture, however, we had two convolution C 1 and C 2 and one sub-sampling S 2 stages, of which the latter two were in the second layer.
In both CNN settings, we used five kernels (W j 1 , = j 1, , 5) of size × 5 5 and the resultant feature maps were sub-sampled by max-pooling [48] by a factor of two. We applied the maxpooling operation to ensure salient elements in each feature map are retained. With the max-pooling operation, each sub-region is replaced with the maximum value of that sub-region. Figure 3 illustrates the two-layer CNN setting with all details in terms of kernels and dimensions that we used in this study. This setting was adopted after a large number of empirical testing with different number of layers and filters, filter and pooling sizes and activation functions. Between all, we selected the setting in figure 3 that maximised the overall classification performance, specially in identifying the appropriate grasp for novel objects.
Having m examples x i ( ) and their corresponding class labels y (i) in a training set as The matrix X has sample x i ( ) in its i-th column. The matrix of model parameters θ can be estimated by optimising the following cost function where ⋅ 1{ } is the 'indicator function', that is, { } = 1 a true statement 1 and { } = 1 a true statement 0 [51].
where ⋅ T ( ) denotes the vector transpose operation. Training was carried out through back propagation using the mini-batch momentum gradient descent algorithm [52] for optimising the learned filters within each iteration. We avoided over-fitting by using Tikhonov regularisation in the final cost function during training the CNN structure where the matrix W j l in the last layer is optimised.

Cross-validation
To verify the generalisability and robustness of grasp classification, we examined two forms of cross-validation: withinand between-object cross-validations. In the following we introduce and provide the rationale for using them. Both of the CNN settings (one-or two-layers) were tested in both of the below cross-validations schemes. All images were converted to grey-scale and downsampled before further analysis.

Within-object cross-validation (WOC).
Firstly, we evaluated the ability of the proposed structure in classifying previously seen objects. The training set included 90% (65 of 72) of the views for each object in each grasp class. The remaining 10% of the views for each object were allocated to the testing set. We randomly selected 10 different training and testing sets to quantify the sensitivity of the classifier to the choice of views. Figure 4 illustrates an example of splitting images of one object into the training and testing sets.

Between-object cross-validation (BOC).
To be able to identify the appropriate grasps for unseen objects, we carried out the BOC test. In the BOC scheme, an object and its views were either wholly seen or unseen.
For the ALOI database, the training set included ∼90% of all the object categories in all grasp groups with all of their different poses; for instance all 124 objects of the 'palmar wrist pronated' class with all their 72 poses were selected for training. The remaining ∼10% of the object categories were allocated to the testing set, that is, 13 objects in this class.
An example for random selection of 4 objects from the Newcastle Grasp Library in the 'palmar wrist pronated' class for the test set is illustrated in figure 5. The above procedure was repeated 10 times independently. Table 3 reports the exact number of objects selected for training and testing from each database in the BOC test.

Statistical analysis
A two-way repeated-measures ANOVA was conducted that examined the main effects of cross-validation type (BOC versus WOC) and number of layers (1 versus 2) in the CNN structure on the offline classification results. In this analysis, each of the 10 folds of cross-validation was treated as an independent sample. In the realtime experiments with amputee subjects, we compared the average block accomplishment times in blocks 1 and 6 with a paired t-test, for each participant independently. All tests were performed in SPSS® 22.

Computer-based realtime performance analysis
We implemented the introduced deep-learning based system in realtime. We carried out the realtime experiments with the learned CNN parameters of the BOC setting. This was because in the real-life cases, it is likely that novel objects are encountered.
We deliberately included this stage before real-time experiment with amputee subjects to marginalise the effect of the users' behaviour on the image acquisition step. One potential influence is the distance between the camera and the object that can be changed by the user during the realtime  To perform this test, we used an inexpensive web camera (Logitech Quickcam® Chat), instead of the high-resolution DSLR Canon camera that we used previously to make Newcastle dataset. The webcam was attached to a photography tripod stand. The distance between the webcam and the object was fixed at 60 cm and the webcam was 15 cm higher than the target object such that we could take pictures in the same way as we took in the Newcastle grasp library. The camera was connected to the recording laptop through a USB link. The imaging resolution was set to × 640 480 pixels. With clicking on a command button on a MATLAB®based graphical user interface (GUI), an image was acquired and a series of image processing operations were executed to detect the object in the scene and remove the background. Figure 6 illustrates all of the preprocessing steps. The output of final step, that is G, was resized to × 48 36 pixels and then normalised according to section 2.2; before feature extraction and classification. Preprocessing was required to remove the background.
We used a two-layer CNN trained for the realtime tests. The test process was repeated for 7 different random views of 24 objects (6 in each grasp group). In this analysis, 16 out of the 24 (66%) objects were not seen by the trained CNN and hence were novel.
All offline and computer-based realtime tests were implemented in MATLAB® in a personal computer with an Intel Core i5-47670 CPU (3.4 GHz), running a 64-bit Windows 7 operating system, with 32 GB RAM.    The study was approved by the Newcastle University ethics committee and carried out at the School of Electrical and Electronic Engineering. Participants signed the experiment consent form.
As in the computer-based realtime experiments, participants sat such that the prosthesis webcam was roughly 60 cm away from and 15 cm higher than the object. Before the start of the experiment, we confirmed that at this distance, they could maintain a comfortable posture to take a picture with the camera and reach the objects readily. In each trial subjects reached one object. As such, in none of the trials the object of interest was occluded by any other objects.

2.7.2.
Overall control structure and system components. A general flow digram for our realtime experiment is shown in figure 7(A). Figure 7(B) illustrates the implemented programme. In the following, we describe the main components of the programme flow. As we will fully describe in section 2.8, the realtime experiment consisted of 6 blocks. The main difference between blocks 1-5 and block 6 was that in the last block an error correction routine was enabled. This additional feature was achieved with the linkages and operations within the grey box in figure 7(B). These connections were inactive in blocks 1-5. Otherwise, all blocks used the same programme for controlling the prosthesis.
A short (300 ms) flexion of wrist muscles was required to trigger the webcam to take a snapshot. After a grasp is identified, the prosthesis was controlled proportionally according to the input EMG signals recorded from the wrist flexor and extensor muscle groups. Long (3 s) extensions reset the grasp and opened the prosthesis.  In block 6, if the user did not approve the classifier output, they could re-aim the prosthesis at the object and issue a long (2 s) flexion of the wrist muscle to re-open prosthesis, reset the grasp and take a new snapshot. From that point onwards, the control mechanism was exactly as it was in blocks 1-5. The user could repeat this error correction approach until an appropriate grasp is identified.

Myoelectric control.
The EMG signals were recorded with two Delsys® Trigno TM lab wireless EMG electrodes. The electrodes were placed on the wrist flexor and extensor muscle groups on the forearm after skin preparation. Surface EMG signals were band-pass filtered between 20 Hz and 450 Hz before sampling at 2 kHz via a Trigno Digital SDK, executed under MATLAB®.
The EMG signals were then transformed into analogue control signals such that 0 and 1 represented the EMG at rest and at a comfortable level of contraction (typically 10-15% of the maximum voluntary contraction) respectively. Generating muscle activity at this low amplitude may be sensitive. However, as we have demonstrated earlier [10,24,[53][54][55], with practice participants can learn to contract their muscles reliably at this low level of the MVC to perform a computer task or to control a prosthesis. One reason may be that the magnitude of the signal-dependent motor noise at such low percentages of the MVC is very small [56].
For each EMG channel, a control signal c was computed every 100 ms by smoothing (with a rectangular window) the preceding 500 ms of rectified EMG after correction for offset according to where |EMG k (t)| denotes the rectified activity of muscle k at time t. The coefficient α k normalises the control signal by muscle activity at the comfortable contraction level. During a short (15 min) 'familiarisation and calibration' block, subjects were provided with visual feedback of the raw EMG data in two channels and asked to imagine flexion and extension of the wrist alternatively. We ensured that both participants were able to contract the two muscles groups independently before further calibration. To that end, we empirically determined a separate threshold activity for the two control signals. With provision of realtime feedback on a computer screen, we asked the participants to activate one muscle group and cross the corresponding control signal above the threshold whilst keeping the control signal of the other muscle group below its threshold. More details with regards to the calibration can be found in our earlier work [53,54]. For subject D, the control signal was recalibrated due to a posture change half-way in the experiment.

Experimental protocol
The realtime experiment comprised 6 blocks of a pick and place task. In each block, subjects grasped, moved and placed 24 objects. The order of objects in blocks was pseudorandomised. This order however remained unchanged between blocks and subjects. In each trial the experimenter placed the object at the standard distance on the table in front of the participant. In blocks 1 and 2, subjects had realtime visual feedback of the measured raw EMGs as well as the calculated control signals on a computer screen. In addition, they could see the webcam video stream, the snapshot that they took and the classification outcome. In blocks 3 and 4, only the raw EMG signals and the control signals were presented as feedback. In block 5, subjects had no computer-based visual feedback at all. Finally, in block 6, similar to block 5, subjects had no visual feedback. They however could reject the grasp identified by the classifier by re-aiming the webcam at the object to take a new picture. This allowed the CNN structure to classify the new image and identify the correct grasp. Due to technical reasons, subject D could not use the error correction function.
With this arrangement of blocks, we combined the familiarisation and testing steps such that the experiment was as short as possible. We therefore analysed the data from familiarisation blocks 1 to 4 as well as data in blocks 5 and 6.
For the experiment with subject M, we allocated a fixed 3 s interval in the beginning of each trial to provide enough time for the participant to settle into the trial before activating the muscles. After the first few trials, we realized that this indeed was a suboptimal approach because the subject enthusiastically flexed the wrist flexor muscles very early to take a picture, before the end of this period. In the experiment with subject D, the protocol was changed slightly such that an audio beep cued the start of the trial, instructing the subject to flex the wrist flexor muscles to activate the webcam. In addition, we made the prosthesis preshaping period shorter to improve responsiveness. As it will be seen in the result, the choice of the trial start protocol and preshaping time affected the total trial duration. However they did not influence the answer to the main question of this work, that is, whether the deep learning structures can be used to offer grasp classification without explicitly measuring object dimensions. The realtime test was implemented in MATLAB® on a Lenovo laptop with an Intel Core i7-4559U CPU (2.10 GHz), running a 64-bit Windows 7 operating system, with 8GB RAM. Table 5 summarises the datasets used in different experimental conditions for training and testing the CNN structure.

Results
In this section, three categories of results are presented. The first set of results are offline grasp classification scores. For this analysis we used the images of the ALOI database together with the high-resolution images collected for the Newcastle Grasp Library. The aim of this analysis was to test the idea of grasp identification with CNN, fine-tune the CNN structure and identify the most effective classification architecture for the realtime experiments. The second set includes the classification results of the computerbased realtime experiments in which all images were taken with the webcam. The third set of results reports the performance achieved by the amputees in using the proposed deep learningbased vision system for prosthetic grasp in the realtime scenario. Figure 8 shows the results of the WOC and BOC crossvalidation schemes. Both were performed on the combined ALOI and Newcastle libraries. We compared the results of the one-layer and two-layer CNN structures.

Offline grasp classification
A repeated measure two-way ANOVA test revealed no statistical difference between the classification scores for the results achieved by using a one-(80.0%) or a two-layer (79.9%) CNN feature extraction structures (n = 10, F 1,9 = 0.001, p = 0.98). Figure 8(C) however shows that the difference between the average classification scores for the main effect of the cross-validation type (WOC: 85.29% versus BOC: 74.74%) was statistically significant ( = = < − n F p 10, 32.08, 10 1,9 3 ). This was predictable since generalisation across views of an object would be less challenging than generalisation to novel objects in the BOC case.
Specifically, in the BOC setting, the two-layer CNN structure led to 0.7% (1-layer: 74.38%, 2-layer: 75.10%) higher classification score when compared to the one-layer CNN setting. This difference was not statistically significant (post-hoc analysis with a paired t-test, t 9 = 0.28, p = 0.78). For the following realtime experiments, we chose to proceed with the two-layer CNN setting, due to better average performance in three of four grasp classes, figure 8(B). Figure 9 demonstrates the classification performance achieved in a realtime, but computer-based, setting. For this realtime experiments, one of the ten aforementioned trained CNN structures, that presented a reasonable grasp classification of novel objects during offline BOC tests, was selected. As such, we adopted the CNN parameters that resulted in an average performance of ∼70%; from within a range of settings that gave performances between 64% and 75%. Having six distinct objects in each grasp group and examining seven random views of each enabled us to simulate a real scenario closely before bringing the variability caused by the user into account. In figure 9, the proposed grasp for each object and view is shown. In an ideal case, that is 100% correct grasp classification, each bar would be in a single colour. Emergence of different colours indicates incorrect classification. With this computer-based test, we quantified the time taken for the system to identify a grasp (correct or incorrect) from a low-resolution input image. The time periods needed for feature extraction with the CNN structure and classification were ± 78 6 ms and ± 3 0.03 ms, respectively.

Realtime test platform with an amputee user in the loop
We tested the whole system with two trans-radial amputee volunteers. Figure 10 shows few representative trials including the recorded myoelectric signals, the acquired images and classification results. This data is from the experiment with subject M. Figure 10(A) illustrates a trial in which the participant oriented the hand such that a reasonable image of the object was acquired; image preprocessing and grasp classification were performed accurately and the correct grasp was identified. In this trial, the subject exhibited an average performance in the pick and place operation (∼7 s). Figure 10(B) shows a trial in which an incorrect classification took place, that is a palmar wrist pronated instead of a tripod. The participant however accepted the incorrect grasp and accomplished the trial. Figure 10(C) shows an example of the trial that the classification was incorrect initially, because the hand was not oriented in a way that the object was fully in the scene. Repeated efforts by the participants were unsuccessful until the fourth time the participant took a snapshot. Once the correct grasp, that is palmar wrist neutral, was selected, the participant completed the trial.
In the realtime experiments with amputee subjects, we included 8 seen, but randomly-rotated, objects as well as 16 novel objects. With this setting, we tested in realtime both within-and between-object generalisation. Figure 11 illustrates a summary of all results in the realtime experiment for the two volunteers M (left column) and D (right column). Figure 11(A) shows the classification accuracy achieved in each block with respect to individual classes. Importantly, in blocks 1 to 5, we report the percentages of correct classification, that is we only consider trials for which the identified grasps matched exactly with the labels that we assigned to that particular object. For block 6, the same terms apply except that the classification results are reported for the final attempt that the user made in each trial; as error correction was enabled. Figure 11(B) shows the overall accuracy in blocks 1 to 6 and in block 6 only, across all grasps. In addition, we have reported the percentage of trials in which the classification was incorrect, however, the subjects accepted the offered grasp and finished the trial successfully or did not accept the offered grasp. In the latter case, if the subject could not complete the trial, the experimenter stopped the trial. Participants were on average more successful in block 6 when compared to the average performance in all blocks: 79% versus 73% for subject M and 86% versus 73% for subject D. When acceptable errors (error subtype 1; as explained in figure 11(B)) included, subject M and D could accomplish 88% and 87% of all trials over the 6 blocks. Figure 11(C) shows the average trial accomplishment time for each block for participants M and D. For both subjects, block 1 was the longest trial. For subject M, the reduction in the accomplishment time (across the 24 trials) in block 6 versus block 1 was only marginally significant (block 1: ± 21.4 8.1 s, block 6: ± 16.7 9.3 s, paired t-test, n = 24, t 23 = 1.81, p = 0.08). This reduction for subject D, however, was statistically significant (block 1: ± 30.7 17.2 s, block 6: ± 19.3 25.7 s, paired t-test, n = 24, t 23 = 2.26, p = 0.03). This reduction in the accomplishment time was despite the increasing difficulty of the task. As mentioned before, in blocks 3 and 4, the webcam output was not shown on the screen and in the blocks 5 and 6, visual feedback on the screen was withheld totally.
We quantified the time taken for the system to identify a grasp (correct or incorrect) from a low-resolution input image in realtime within our graphical user interface. With the laptop that was used in the realtime experiments the average time needed for pre-processing and classification were 110ms and 40ms, respectively. As mentioned in the Methods section, to take a picture with the camera, subjects had to make a short flexion above the felxion threshold for 300ms, whilst the activity of the extensor muscle group remained below its threshold. As such a correct classification could be achieved within 450ms. All time stamps are shown in figure 11(D).
Finally, we assessed the ability of the proposed structure in generalising to novel objects during the realtime experiments. To that end, for each volunteer we split the results of the realtime experiment for seen and unseen objects in table 6. Out of the 24 objects in each block, eight were seen and 16 were unseen by the trained two-layer CNN. Results showed that it All images were converted to grey-scale and downsampled before further analysis. Objects shown with dashed black box around them were novel to the classifier. All other objects were seen by the classifier however they were rotated randomly for this test. In the case of 100% correct classification, each bar would be in a single colour.
was not possible to predict whether classification would be more successful for seen or unseen objects.

Concluding remarks
We augmented a commercial prosthetic hand with a webcam and a deep learning-based structure to improve the grasp ability of the amputees. This setting was examined with two trans-radial amputee participants after a comprehensive series of offline and realtime, but computer-based, experiments. We showed that after about an hour of practice, the participant could accomplish 88% of trials successfully.
In current commercial prosthetic hands to switch between the grasp types, the user has to either learn various cocontractions, move the prosthetic hand in certain trajectories or have objects in their environment labelled with RFID tags. These workaround techniques have emerged mainly because the promised EMG pattern recognition-based methods have not proved robust, or even feasible, for grasp classification clinically. The non-intuitiveness and shortcomings of the aforementioned approaches have encouraged the emergence of techniques that advocate utilisation of sensing modalities other than the conventional EMG signals, such as accelerometry or in general inertial measurements [14,25,26,57], RFID tags [28], artificial vision including standard cameras as well as Kinect [29][30][31][32][33][34]. In almost all multi-modal approaches to control limb prosthesis, it is argued that the incorporation of two or more sources of information can reduce the users' cognitive burden and enhance functionality in terms of accuracy.
In this work, the user could effectively pick objects with 4 different grasp types by capturing a single picture of the object of interest. We adapted and trained a standard CNN architecture to extract abstract grasp-related features of a single lowresolution input object image in realtime.

Database
In order to train the CNN structure for grasp recognition, a database including a large number of object images was required. Identifying a database with an ample number of household objects can be considered a challenge since most of the accessible databases, e.g. Imagenet [40], include a large variety of objects of which many are not graspable.
In our previous pilot experiment [34], we used the COIL100 database [58] which includes 100 categories of graspable objects. The overall classification performance in the WOC test was 97%. For the BOC test the classification accuracy was 55% with the lowest results in the 'palmar wrist pronated' group. We attributed this poor BOC performance to lacking sufficient number of training objects since other grasp classes that had enough number of training objects gained significantly higher accuracies. Therefore, we used the ALOI database [46] instead, which provided us with more training data in the range of 1000 objects of which we selected 500 objects for analysis. Due to the variety in the number of objects in each grasp group and of course to enable realtime testing where original objects are not available, a database of 71 objects was collected at Newcastle University. These 71 objects were distributed between grasp groups such that they are each provided with sufficient samples for training. This augmented image database was used for CNN training in this work.

Object classification versus grasp identification
We adapted the CNN architecture for grasp recognition rather than object identification. Supervised learning systems (e.g. [59,60]) including the CNN setting (e.g. [61,62]), lack the capability of generalisation to novel objects, which is a crucial requirement in prosthetic hand applications. To address this issue, we either need a very large amount of training data or we can capitalise on the flexibility of deep learning system to generalise based on learning abstract representation of different classes of training data. Forming large image libraries can be challenging since it requires advanced hardware for photography and computing facilities for data handling and storage. Instead, we approached the problem by noting that rather than having a large number of classes of objects, we can group the objects according to their most appropriate grasp type. In this way, the output space includes only a small number of grasps. Consequently, the detection task can be generalised to unknown objects and any type of objects can be detected and classified correctly.

The CNN design considerations
The difference between the two CNN structures, that is 1and 2-layer, was very small. Despite no statistical difference in offline analysis between the two cases we chose to use a 2-layer structure as it showed a slightly better performance in three of four grasp classes ( figure 8(B)). Realtime implementation on a laptop was similar with differences in the nanoseconds range. In principle, with using a smaller network one could avoid over-fitting. We avoided over-fitting in the 2-layer network by using Tikhonov regularisation in training the CNN structure where the matrix W j l in the last layer is optimised. Tuning and training of a CNN structure may be very time-consuming. However once trained, it offers a very fast response time. Typical times for training the proposed 2-layer CNN structure were about 2 hours without a GPU. In fact, the slowest component of the proposed approach is the image pre-processing block that takes ∼110ms to carry out all steps that were introduced in figure 6. For the realtime experiments, we used standard MATLAB instructions without additional GPU hardware. With the advent of fast GPU chips, that to the mobile phones industry, we believe that realtime implementation of our standard image preprocessing tasks will be much faster.
Although not included in the results section, we tested the hypothesis that using a pre-trained CNN, for example with all images in the ImageNet database [40] could enhance the classification accuracy. We therefore re-tuned a ResNet18 [61], an 18-layer network pre-trained with ImageNet, with the combined ALOI and Newcastle images and then repeated the BOC test. We observed a large reduction in the classification scores to ∼50%. Such a poor performance may be because many of the objects in the ImageNet database are not graspable, e.g. an airplane or a tree. In addition, most images include significant clutter and have various backgrounds, e.g. an ambulance on a street. Whilst these results are not in favour of using a pretrained network, we do not rule out the possibility that pretrained architectures can be used to enhance the generalisation performance. Perhaps the use of pre-trained networks, that are trained with a large number of graspable objects can lead to higher performance.
We sought to understand whether object specific patterns were extracted by the CNN structure for grasp classification or the size and orientation of the objects enable the CNN setting to generalise. Figure 12 illustrates two examples per grasp class. The 25 maps per object are the outcome of the second convolution layer of the CNN architecture after the ReLU stage, as introduced in figure 3. This preliminary visualisation suggests that the determining factors for classification and generalisation are the size and the orientation of the object. Further work may need be to verify the consistency of this finding in a larger number of objects and object views. The output of the Softmax classifier provides the probability for each class. Therefore, when the most probable grasp is not suitable, other grasp types of decreasing probability may be provided. This can feature as an automatic error correction mechanism. However, there were two main reasons behind our decision for not using this approach though we find it very interesting and feasible from an engineering point of view. These reasons include: • The addition of an artificial vision system for artificial hands makes the prosthesis more autonomous [31] and less under the control of the subject. Our initial and unbiased briefing of the subjects with regards to the experiment suggested that they both would like to have a degree of control over the function of the prosthesis. As such, we decided to test the manual error-correction method only. In this approach, in block 6, subjects could restart the process by resetting the prosthesis and taking a new snapshot. • Our volunteers were both naíve to the concept of the experiment and neither used myoelectric prosthesis in daily life. In addition, our experiment was already rather long (+2 hours). Therefore, the addition of another condition to the experiment, in which errors are automatically corrected, would tire the participants.

Feasibility of more grasp types
We limited the number of grasp types to four. The number of grasps however can be increased readily upon availability of training data. For instance, lateral grasp is not included cur rently within the grasp types. Objects that are grasped with a lateral grip, e.g. a credit card, present typically a distinctive flat shape which is different from our training images. Therefore, with augmenting the existing database with images of objects requiring a lateral grasp, we can include the lateral grasp as an additional grasp class. Whether prosthesis users would use more than four or five grasps will remain to be investigated.

Performance in the presence of clutter
Identification and segmentation of an object in a cluttered scene or when the object lies on an arbitrary background can be an extremely challenging computer vision task. In this proof-of-principle work, we tested the use of deep learning algorithm in a clutter-free environment. Previous work such as in [30,31] incorporated 3D point clouds to segment the scene (in addition to colours and edges) to facilitate segmentation. One interesting study would be to combine the two approaches and use the 3D features as inputs into the CNN system.

Realtime performance: computer-based versus human experiments
With the computer-based realtime experiments, we simulated a grasp classification scenario without having the user in the loop. We believe that it was an appropriate practice since it gave us an indication of realtime performance without biases induced by the user, e.g. camera view and distance to the object. The computer-based results with the average performance of 84% were higher than the accuracy achieved in the realtime experiments with amputee subjects in the loop specially in early blocks. With training, both subjects improved performance yet they fall short of the score that was achieved in the computer-based experiment. We believe that the higher performance in the computer-based experiment was because the camera view and distance to the objects were fixed during testing. Other intrinsic factors, such as physical and mental fatigue, can deteriorate realtime performance. Further invest igations are needed to identify underlying sources of error and inaccuracies during realtime experiments; be it in the laboratory, clinical or real-life settings.
In the realtime experiments with amputee subjects, out of the 24 objects in each block, eight were seen and 16 were unseen by the trained two-layer CNN. Results did not show that generalisation to unseen objects was necessarily less successful than classification of seen objects. This is in contrast to what we observed in the offline experiments in which the BOC performance was lower than that of the WOC performance. However, this finding corroborates earlier work in [29][30][31] supporting the hypothesis that users' behaviour could play an important role in the accuracy of vision-based prostheses control architectures.

User training with full or partial visual feedback
The webcam was mounted on the dorsum of the i-limb hand. This was in line with earlier work on vision-based prosthesis control [29]. However, in more recent work, Marković et al [31] placed the sensors on the user to facilitate targeting the object. We showed that with training in a step-by-step approach (blocks 1-6) subjects can learn to target the object accurately such that all of the object boundaries are in the scene. This is particularly important for tall objects in the palmar wrist neutral group (figure 10(C)). Following the familiarisation block, and the first two measurement blocks, the visual feedback from the webcam output was withheld, however the performance did not drop. The available proprioceptive feedback from the arm and the truck muscles may have facilitated accurate targeting.

User feedback
Both subjects provided positive feedback on the use of the proposed vision-enabled prosthetic control approach. For instance, subject D said: 'Just getting the routine was difficult at the beginning but once this was established it became much easier. If it would be further refined (in terms of positioning of camera) I would certainly use this and always give feedback'. Subject M tested the proposed approach and a novel pattern recognition system on the same day. When asked which of the two approaches he would prefer, he replied: 'I'd like the pattern recognition better, when it works perfectly! For the time being, the vision-based system seems to be a good solution. I liked its responsiveness very much'.

Directions for further development
In the proposed setting, misclassification could stem from inaccurate object detection or from insufficient feature extraction by the CNN structure. Advanced image processing techniques may be used to address the former. The latter problem may be dealt with fine-tuning the CNN parameters according to an objective criterion. Beyond these challenges, one critical issue that can affect the performance of any vision-based prosthetic control system is the distance between the object of interest and the camera. Previous work incorporated additional sensors, e.g. sonar [29] or stereovision [31], to alleviate the uncertainty with regards to the true object sizes. Our current work includes using movement inertial measurements during reach to approximate the distance from the target object and rescale the images before giving them to the CNN structure.