PVGAN: a generative adversarial network for object simplification in prosthetic vision

Reham H Elnabawy; Slim Abdennadher; Olaf Hellwich; Seif Eldawlatly

doi:10.1088/1741-2552/ac8acf

1. Introduction

Vision is considered the most crucial sense that human-beings rely on to be able to function independently. Loss of vision affects the confidence and independence level of the impaired individuals when performing daily activities and during social interactions. Blindness could occur for many reasons including diseases such as retinitis pigmentosa (RP) or aged-macular degeneration (AMD) [1]. RP causes progressive loss of the peripheral vision, whereas AMD affects the central vision [2, 3]. These diseases mainly affect the photoreceptors that are responsible for transducing light into electrical energy, leaving most of the other parts of the retina and the visual system intact. In addition, injuries in other areas along the visual pathway could also lead to blindness.

To compensate for the loss of the vision, visual prostheses, aka bionic eyes, have been introduced to artificially induce visual percepts via electrical stimulation of the visual pathway [4]. The most successful example of visual prostheses is of the retinal implant type; namely the Argus II device [5]. A typical visual prosthesis device stimulates the functional parts (i.e. the surviving neurons such as retinal ganglion cells and bipolar cells) of the retina beyond the damaged part (i.e. the photoreceptors) [6]. In such a system, a tiny camera is mounted on the middle of eyeglasses to capture images and subsequently send them to a video processing unit (VPU) through a wired connection. This VPU converts the images into electrical signals, sending them back wirelessly to electrodes implanted either epiretinal (such as Argus II and Intelligent Retinal Implant System), subretinal (such as Alpha-IMS and Photovoltaic Retinal Implant bionic vision system) or suprachoroidal (such as Bionic Vision Australia) [7–9]. Other types of artificial vision devices have been proposed to compensate for vision loss, such as the cortical and the thalamic types [10, 11].

Prosthetic vision is composed of spots of light called phosphenes which represent the pixels of the visual field [12]. Any visual prosthetic device is limited to a certain number of implanted electrodes, which limits the resolution of the perceived image through the visual percepts [13]. The perceived vision by using a prosthetic vision device is known to be of extremely low spatial and radiometric resolutions [14]. This means that the implanted patient will not be able to perceive the full details of an image, unlike a normally-sighted person. In order to get used to image interpretation when perceived in the form of phosphenes, implanted patients take long time to be able to cope with the new way of vision that is totally different from the normal vision they had before their loss of vision [15]. This results in a number of challenges associated with the use of visual prostheses, such as psychological challenges, that arise due to over-expectations of full restoration of normal vision [6]. In addition, implanted patients typically have very limited ability to recognize objects in a scene in real life which hinders their independence and confidence level [16]. This limited ability deteriorates further in many cases as a result of electrodes dropout leading to missing information in the field [17]. These challenges might drive implanted patients to abandon using the device after implantation.

To tackle the aforementioned challenges, we introduce a deep learning approach that aims at enhancing the ability of visual prostheses implanted patients to recognize the perceived objects. Given the low resolution of the perceived prosthetic image, we propose a novel deep learning approach for object simplification to enhance the ability of visual prostheses' users in object recognition. Deep learning is a promising tool for learning in neural networks which not only enables better image illustration, but also provides better object recognition [18]. Recently, generative adversarial networks (GANs) have been introduced to generate data that resembles a target input [19]. A GAN is composed of two models: a generator and a discriminator. The generator produces candidates, whereas the discriminator assesses them. The training of both models aims at making the generator more able of creating superior tests (i.e. generate images that are similar to that of the actual input image) and making the discriminator more skilled at deciding fake tests [20].

In this article, we introduce the use of GANs to generate a simplified image from a high-resolution one. A popular GAN model is the conditional GAN, where both the generator and the discriminator are conditioned on certain information such as class labels. This can be composed of the conditional generation of pictures by a generator model [21]. Picture prediction from a typical outline [22], item photo generation [23] and subsequent frame prediction [24] have been successfully demonstrated in image-conditional models. Moreover, exceptional performance on future state forecast [25], in painting [26], picture control guided by client limitations [27], super-resolution [28] and fashion exchange [29] were achieved using conditional GANs. One popular model that exploits conditional GAN is Pix2Pix, which performs image-to-image translation for matched pictures mapping an input picture from a certain category to a target picture from another category [30]. In this article, we propose prosthetic vision GAN (PVGAN); a model trained using images of objects belonging to general classes that visual prostheses users could encounter in their daily life. The model performs novel adjustments to Pix2Pix input data to generate clip art images from a real high-resolution image. The generated clip art image is then used as input to the visual prosthesis system to enhance object recognition for implanted patients. The proposed approach is evaluated using simulated prosthetic vision experiments that involve normally-sighted participants, demonstrating the efficacy of the proposed approach.

2. Methods

2.1. PVGAN model overview

The proposed framework is illustrated in figure 1. In the training phase, an input picture that comprises a clip art picture concatenated with the corresponding high-resolution picture is provided as input to the generator. At that point, the generator learns the generation of clip art images by trying to generate a clip art that is similar to that of the input picture to be able to trick the discriminator. On the other hand, the discriminator takes as input the input clip art from the input picture in addition to the generated clip art to decide if the generated clip art is real (i.e. a generated image that is indistinguishable from the input clip art) or fake (i.e. a generated image that is not similar to that of the input clip art). The generator keeps enhancing its ability of producing clip arts until the discriminator cannot differentiate between the initial and the generated clip arts. Once the generator is trained, it can be used to generate clip art for input images that are either new images from the same classes used in the training phase or, more generally, images from new classes other than the training classes. Finally, the generated clip art image is used as an input to a phosphene simulation procedure to produce an image that contains phosphenes matching the resolution of visual prostheses to simulate the image perceived by visual prostheses users.

2.2. Pix2Pix GAN

2.2.1. General GAN architecture

Figure 2 shows the GAN architecture where the discriminator D distinguishes between real and fake data, where the fake data is the data generated by the generator. The generator G takes an n-dimensional noise input z, and outputs G(z), which is subsequently fed as an input to the discriminator [31]. For the real data, the output of the discriminator is D(x), while for the fake data, the output is D(G(z)). These predictions are represented in the form of probabilities P. If P is close to zero, then, this means that the input was fake, however, if it is closer to one, then, the input was real. The discriminator aims to make the probability of x as large as possible and the probability of G(z) as small as possible. However, the generator aims to make the probability of the discriminator for G(z) as large as possible to fool the discriminator, so that the discriminator thinks of fake data as real. This sets up an adversarial network.

To rescale the output, log is taken for both D(x) and D(G(z)), then, the mathematical expectation for both terms is also computed to get the loss function [31]. On the other hand, this loss function is complicated to solve since the discriminator has to maximize the whole term (i.e. ${E_{p - p\left( x \right)}}\log \left( {D\left( x \right)} \right) + \,{E_{p - p\left( z \right)}}\log \left( {D\left( {G\left( z \right)} \right)} \right)$ ), whereas the generator tries to maximize the second term in the equation (i.e. ${E_{p - p\left( z \right)}}\log \left( {D\left( {G\left( z \right)} \right)} \right)$ ). So, this complexity can be reduced by subtracting D(G(z)) from 1 giving rise to the following loss function for GANs [31]

$\begin{align}{\text{min}_G}{\text{max}_D}V\left( {G,D} \right) & = \,{E_{p - p\left( x \right)}}\log \left( {D\left( x \right)} \right) \nonumber\\ &\quad + \,{E_{p - p\left( z \right)}}\log \left( {1 - D\left( {G\left( z \right)} \right)} \right).\end{align} \tag{ 1 }$

2.2.2. Pix2Pix model overview

The Pix2Pix GAN uses conditional GAN for image-to-image translation. The difference between conditional GANs and GANs is that in GANs, there is no control over the modes of the generated data, whereas the conditional GAN generates images conditioned on a class label [32]. The generator of Pix2Pix is based on U-Net, which is an encoder-decoder model with skip connections between reflected layers within the encoder and the decoder stacks [33]. This permits low-level data to easy route over the network by skipping some of the layers in the neural network, feeding the output of one layer as an input to the successive layer [34]. On the other hand, a PatchGAN classifier is utilized within the discriminator, where rather than demonstrating that a generated picture is real or fake, the discriminator decides whether an N × N patch of the generated picture is real or fake [35]. An advantage of PatchGAN is that a fixed-size patch discriminator can be applied to huge pictures. The inner structures of the generator and the discriminator are as follows: The generator comprises eight convolutional layers for the down-sampling part of the U-Net with number of filters equal to 64, 128, 256, 512, 512, 512, 512, and 512. On the other hand, the up-sampling part is composed of seven convolutional layers with number of filters equal to 512, 512, 512, 512, 256, 128, and 64. In each layer, filters of size 4 × 4 and stride of 2 are used with batch normalization and Leaky rectified linear unit (ReLU) activation [36]. Moreover, skip connection is performed by concatenating layers in down-sampling with up-sampling part. In addition, the binary cross entropy loss function is utilized for the generator and discriminator. The discriminator comprises three convolutional layers with number of filters equal to 64, 128, and 256, each followed by batch normalization and Leaky ReLU. At that point, zero-padding is used to maintain the required shape of the output. Another convolutional layer with 512 filters is used and Leaky ReLU layer are included in conjunction with a zero-padding layer. Similar to the U-Net architecture, all filters are of size 4 × 4 with a stride of 2.

Finally, Adam optimizer, which is an extension to stochastic gradient descent, is utilized for the generator and discriminator. Figure 3 illustrates the network design for Pix2Pix GAN, where a generator G takes as input a combined picture that comprises two pictures side-by-side from two distinctive spaces [37]. Then, G produces a picture that the discriminator D will take as input alongside the first picture to decide whether the created picture is real or fake compared to that of the real picture. Next, the loss function is calculated between the produced picture and the initial picture until there is no distinction between both pictures, where, at that stage, the generator succeeds in tricking the discriminator with produced pictures that are indistinguishable from the initial pictures. In addition, updating the weights utilizing Gradient Descent Optimizer is accomplished through back propagation [38].

2.2.3. Adversarial loss

The adversarial loss, known as ${L_{{\text{cGAN}}}}\left( {G,D} \right)$ , is expressed as

$\begin{align}{L_{{\text{cGAN}}}}\left( {G,D} \right) & = { }{E_{x,y}}\left[\log D\left( {x,y} \right)\right] \nonumber\\ &\quad + {E_{x,z}}\left[\log (1 - D\left( {x,G\left( {x,z} \right)} \right)\right]\end{align} \tag{ 2 }$

where ${L_{{\text{cGAN}}}}\left( {G,D} \right)$ is the adversarial loss between the generator and the discriminator and ${E_{x,y}}$ and ${E_{x,z}}$ are the mathematical expectations of $\log D\left( {x,y} \right)$ and $\log (1 - D\left( {x,G\left( {x,z} \right)} \right)$ , respectively. The generator G tries to reduce the adversarial loss, while the discriminator D tries to increase it (i.e. ${G^*}\, = \,{\text{arg}}\,\,{{\text{min}_G}}{\text{max}_D}{L_{\text{cGAN}}}\left( {G,D} \right)\,)$ . The x and y within the equation are the input and created pictures, respectively [39]. This equation effectively tricks the discriminator, whereas the core objective currently is to generate a picture that is near to the ground truth output. Thus, we include regularization to penalize the network each time it produces an undesired output.

To assess the necessity of conditioning the discriminator, an unconditional variation in which the discriminator does not observe x was introduced [40]

$\begin{align}{L_{GAN}}\left( {G,D} \right) & = \,{E_y}\left[\log D\left( y \right)\right] \nonumber\\ &\quad + {E_{x,z}}\left[\log (1 - D\left( {G\left( {x,z} \right)} \right)\right].\end{align} \tag{ 3 }$

Since L1 results in preserving the details in the output images and helps the generator to, not only trick the discriminator, but also to be near to the ground truth yield. It was used as

$\begin{align}{L_{L1}}\left( G \right) = \,{E_{x,y,\,z}}\left[ {{{\left\| {y - G\left( {x,z} \right)} \right\|}_1}} \right].\end{align} \tag{ 4 }$

So, the final objective is:

$\begin{align}{G^*}\, = \,{\text{arg}}\,\text{min}{_G}{\text{max}_D}{L_{\text{cGAN}}}\left( {G,D} \right) + \,\lambda {L_{L1}}\left( G \right)\end{align} \tag{ 5 }$

where λ regulates the relative significance of the two goals. The generator learned to generate outputs that are indistinguishable from the real input based on [22].

2.3. Dataset preparation

The PVGAN model proposed in this article utilizes the same design as that of the Pix2Pix GAN. However, the training set has been developed to assist in ideal clip art production for improving object recognition when showing an image of relatively low resolution via visual prostheses.

2.3.1. Dataset collection

The dataset developed to train the model was obtained from different studies [41–47] in addition to the Kaggle website. The dataset is composed of images belonging to 17 classes representing food (apple, banana, lemon and pizza), vehicles (bus and car), animals (bird, dog and zebra), furniture (chair, bed, door and table), plants (flower and tree) and personal belongings (bag and laptop). These classes were selected to ensure diversity across the objects used to train the model. The dataset is composed of only real high-resolution pictures. To collect their corresponding clip art images, an algorithm is proposed to gather those clip art pictures based on color and orientation adjustment. To match the input required by the Pix2Pix model, the input pictures are entered as pair of pictures concatenated together where the left-located picture is the clip art form of the right-located real high-resolution picture. All the real high-resolution pictures contain only one object of interest to train the model more precisely.

2.3.2. Training input data pre-processing

Before training the model, pre-processing for the input dataset is performed. Normalization of the pictures was performed by rescaling the pixel value calculated as

$\begin{align}R = \,\frac{R}{{127.5}} - 1\end{align} \tag{ 6 }$

where R is the original pixel intensity value. Then, all the pictures are resized to the same size (i.e. 256 × 256). Next, we performed arbitrary flipping to extend the generalization ability of the model where horizontal flipping is utilized in this case. The total number of images in the dataset is thus 5100 images after adding the flipped versions of the images.

2.3.2.1. Photos collection

To prepare the photos (i.e. high-resolution images) to be used for training the model, the following steps were performed: First, we downloaded photos related to various categories such as dogs, cars, etc., from different resources. Second, we identified photos that include exactly one object of interest centered in the middle of the image to be easily processed in our proposed algorithm. Third, we renamed each of the identified photo files according to the identity of the object present such as 'dog1.jpg', 'car5.jpg', etc. The total number of images in the dataset collected and refined is 2550 images, with 150 images per each training class. Fourth, we iterate on each of the photos and parse the filename until reaching a digit. We then take that filename and search automatically using Google Images to retrieve the first ten clip art images shown in the search result. We download these ten clip art images in a new folder to be ready for future use. This continues until the first ten clip art images corresponding to each photo are downloaded. Fifth, all the clip art images are resized to a unified size (i.e. 256 × 256). Sixth, by means of histogram of oriented gradients (HOG), each photo is compared with its corresponding ten clip art images and the best match clip art image is then saved in a new folder [48]. In our analysis, we used a HOG block size of 2 × 2 to help suppress the changes of HOG features. Moreover, we used a HOG cell size of 8 × 8 to preserve tiny details with nine orientation bins to encode finer orientation details. The best clip art image is determined as a result of getting the minimum Euclidean distance between the feature vector of the query image (i.e. the photo) and the feature vector of the clip art image. By this, the clip art images that best identify the shape of the photos are saved.

2.3.2.2. Clip art adjustment

To generate a clip art of an object that resembles the properties of the object in the real image, we modify the clip art image to match the color and orientation of the real object. While color would not affect the identity of an object when simulated using phosphenes since the number of gray levels available in visual prosthesis is limited, color adjustment gives an indication of the brightness level that a certain object might have [49]. In this approach, the average color for the object of interest within the real high-resolution picture is first calculated. Second, the average color of the object of interest within the clip art picture is calculated. The difference between both averages is then obtained. The color of the clip art object is then changed by adding that difference to the pixel intensities. Finally, values below 0 or above 255 are mapped to 0 or 255, respectively.

To demonstrate how the pose of a certain clip art was adjusted to match that of the real object, we first apply a median filter of size 5 × 5 to eliminate any noise from the real picture [50]. We then extract the object of interest within a bounding box and place it on another picture with only a black background and no objects. Next, we binarize the new high-resolution picture (i.e. the dark picture with the object of interest extracted on it) by means of Otsu thresholding [51], to be prepared for applying the morphological operator, skeletonization. Skeletonization is performed by means of hit-or-miss transform to preserve the shape topology [52]. Skeletonization is repeated until the image no longer changes. After skeletonization, we calculate the gradient orientations of the new high-resolution picture utilizing Prewitt operator to be used in determining the rotation angle [53]. At that point, we get the average of all of the gradient orientations of the skeleton. We then apply the same method to the clip art picture to get its corresponding average of the gradient orientations of its skeleton. Next, we subtract the average of the gradient orientations of the clip art from that of the real high-resolution picture. We then rotate the clip art object of interest with that difference to have the same orientation as that of the real high-resolution picture. This will help visual prostheses users to know the actual orientation of a certain object. Finally, we merge the best identified clip art image with its corresponding photo (as previously shown in figure 1(a)) to be ready for training the model.

2.4. Phosphene simulation

The generated clip art image is further processed to be expressed in terms of phosphenes to simulate the prosthetic vision. The grid used in the simulation is a squared grid [54, 55]. The limited number of electrodes in visual prostheses limits the number of pixels perceived by visual prostheses users. To accommodate for the number of electrodes available in the implants based on the developments of the number of electrodes used in future versions of the Argus device, Alpha IMS and Polyretina, a square lattice of size 32 × 32 pixels was used [56, 57]. The brightness of the phosphenes changes as a function of stimulation intensity across electrodes, thus, the brightness of phosphenes elicited by an individual electrode should scale appropriately with luminance [58]. The number of gray levels that was reported in multiple studies by visual protheses implanted patients ranges from 4 to 12 levels [49, 55, 59, 60]. In addition, other studies reported that reaching 8–16 gray levels could be achieved [51]. Therefore, we used eight gray levels in our simulations to match the mid-range of the values reported in the literature [61]. The axon map phosphene simulation model was used to assess the effect of using the generated clip art from PVGAN [62]. Recent studies have reported that phosphene shapes, especially in epiretinal implants, have some spatial and temporal properties resulting in distortions and temporal fading. This distorted shape could follow the shape of the axons sent by the retinal ganglion cells; hence, the name axon map. Therefore, we examined this model to assess the efficacy of the proposed PVGAN when applied to this type of phosphene simulation. In this model, we used the pulse2percept Python library using eight gray levels [63, 64].

2.5. Experiments

All the experimental procedures were approved by the Faculty of Media Engineering and Technology, German University in Cairo from the ethical standpoint and is in accordance with the ethical standards of the Declaration of Helsinki. All participants involved in the experiments signed an informed consent form. Experiments were conducted on corrected vision/normally-sighted subjects seated on a chair facing a 15 inch computer screen at 1 m distance. This results in a 20° simulated field of view, which is the limit for legal blindness [56], as shown in figure 4. In these experiments, different metrics were measured to evaluate the performance of the subjects when presented with phosphenes simulations of the generated clip art images from the PVGAN model. A set of 24 generated clip art images displayed after phosphene simulation was used to simulate the perception of images perceived by any visual prostheses implanted patients. The set was divided evenly into two equal groups (i.e. each group is of 12 test images), where one group has new images from the training classes that the PVGAN was trained on, and the other group has test images from new classes that the PVGAN model did not see before. All subjects had no prior experience with simulated prosthetic vision. The subjects did not receive any payment for participating in the study.

**Figure 4.** Experimental setup.
Download figure:
Standard image High-resolution image

The axon map phosphene simulation model was used to take into account the spatial and temporal fading and distortions that happen to the phosphenes. A set of 24 test images was used in this experiment among the same subjects. In this experiment, a total of ten subjects were involved, five males and five females aging from 19 to 57 years old (34.1 ± 13.17 years old). This experiment comprised the same set of 10 subjects who were presented with 24 images that belong to 4 types, 6 images per type. These types were 'Real Image', 'Real before Clip Art', 'Clip Art no Dropout', and 'Clip Art with Dropout'. For the 'Real Image' type, phosphene simulation of each image from the six images was displayed for 10 s. The second type, named as 'Real before Clip Art', phosphene simulation of the real image (i.e. photo) was displayed for 5 s followed by phosphene simulation of the corresponding GAN-generated clip art image displayed for another 5 s. So, a total duration of 10 s per each real image with its corresponding GAN-generated clip art image was used during this type to match the time given to other types. The third type, named as 'Clip Art no Dropout', only the GAN-generated clip art images were displayed using phosphene simulation, each for 10 s to match the duration taken per one version of an image in both the 'Real Image' and the 'Real before Clip Art' types. Finally, the fourth group, named as 'Clip Art with Dropout', the GAN-generated clip art images were displayed with dropout added to the phosphene simulation to test whether the subject will be still able to recognize the object after the clip art generation or the dropout will hinder the recognition. Each of the images was displayed for 10 s to match the duration used in all the other three types per one image. The order of images displayed throughout the experiment was randomly shuffled across the subjects. To mimic the electrode dropout that might occur over time in an implant, we added manually dropout in the axon map phosphene simulation experiment after image generation as the pulse2percept library used, for this axon map simulation, does not implement the dropout [64].

2.6. Evaluation metrics

The subjects in the conducted experiment were asked to try to recognize the objects displayed in front of them. Responses of the subjects were logged by the experimenter during the course of the experiment. Subjects were informed in advance that there will be 24 images in the experiment and they were informed about the available themes of the objects. Three evaluation metrics were used to evaluate the performance of the subjects in each experiment. The first metric is the time taken by the subject to recognize the object measured in seconds relative to the time at which the image is displayed. The second metric is the confidence level of the subject in the recognition on a scale from 1 to 5, where 1 indicates that the subject is extremely unsure about the answer, whereas 5 indicate the contrary. Finally, the third measure is the recognition accuracy, measured using the sensitivity measure d' which takes into account the hit and false alarm rates for each class (i.e. an object in this case) as opposed to other measures of accuracy which focus on the hit rate only [65]. For a given object obj, the hit rate $HitRat{e_{\text{obj}}}\,$ is calculated as

$\begin{align}{\text{HitRat}}{{\text{e}}_{{\text{obj}}}}=\frac{{{\text{Hit}}{{\text{s}}_{{\text{obj}}}}}}{S}\end{align} \tag{ 7 }$

where ${\text{Hit}}{{\text{s}}_{{\text{obj}}}}$ is the number participants who were able to correctly identify obj and $S$ is the total number of subjects, which is equivalent to the number of times a certain object was displayed in the whole experiment (i.e. ten subjects). While a hit should be either 1 for correct identification or 0 for an incorrect one, a value of 0.5 was used for recognizing the general theme of the object but incorrectly recognizing the object itself. For example, a value of 0.5 was assigned if the subject recognizes a couch as a chair given that both have the same general theme.

In addition to the hit rate, we computed the false alarm rate ${\text{FARate}}{_{\text{obj}}}\,$ for an object obj calculated as

$\begin{align}{\text{FARate}}_{\text{obj}} = \frac{\mathop \sum \nolimits_{r = 1,\,r \ne {\text{obj}}}^O {\text{FA}}_{\text{obj}}^{\left( r \right)}}{{S\, \times \,\left( {O - 1} \right)}}\end{align} \tag{ 8 }$

where ${\text{ FA}}_{{\text{obj}}}^{\left( r \right)}$ is the summation of the number of times obj was falsely identified as a displayed object r, $S$ is the total number of subjects in the experiment and $O$ is the total number of objects. Then, for a given object, $d{^{^{\prime}}_{{\text{obj}}}}$ is calculated as

$\begin{align}d{^{^{\prime}}_{{\text{obj}}}} = z\left( {{\text{HitRat}}{{\text{e}}_{{\text{obj}}}}} \right) - z\left( {{\text{FARat}}{{\text{e}}_{{\text{obj}}}}} \right)\end{align} \tag{ 9 }$

where $z\left( {{\text{HitRat}}{{\text{e}}_{{\text{obj}}}}} \right)$ and $z\left( {{\text{FARat}}{{\text{e}}_{{\text{obj}}}}} \right)$ are the z-scores of ${\text{HitRat}}{{\text{e}}_{{\text{obj}}}}$ and ${\text{FARat}}{{\text{e}}_{{\text{obj}}}}$ , respectively.

3. Results

3.1. Image pre-processing and data preparation

We first demonstrate the outcome of the image pre-processing stages before training the PVGAN model. Figure 5 shows an input photo (a lemon) where a median filter is applied first to remove any possible noise. A bounding box that surrounds the lemon is shown, which is followed by binarization using Otsu thresholding before applying the skeletonization stage. This initial pre-processing is applied to both real high-resolution images and clip art images. Following the initial pre-processing, a pose adjustment procedure is applied to clip art images to adjust the orientation of the corresponding object to match that of the object in the real high-resolution images. Figure 6 demonstrates the outcome of the pose adjustment. This was done by first computing the average of the gradients orientation of the skeletonized images. Next, the difference between the average of gradients orientation of the real-high resolution image and that of the clip art image is computed. Finally, the object in the clip art image is rotated based on the computed difference. This is expected to achieve better training of the model in addition to providing better object localization for visual prostheses users.

**Figure 6.** Clip art pose adjustment based on skeletonization.
Download figure:
Standard image High-resolution image

We next illustrate the HOG process for best clip art selection. The criterion for the best match selection is based on both the most similar shape and orientation of the object in the clip art images as compared to that of the real high-resolution image. Figure 7 shows a sample photo of a lemon that we need to get its corresponding clip art representation, based on HOG, from a sample of three clip art images out of the ten possible clip art images after applying orientation adjustment.

**Figure 7.** Clip art selection based on HOG.
Download figure:
Standard image High-resolution image

Finally, we demonstrate the color adjustment procedure to give the clip art object similar color to that of the actual object in the real high-resolution image. This was done by subtracting the average of the colors in the clip art images from that of the real high-resolution image, taking the difference to be added to each of the pixels' colors in the clip art image. Accordingly, the generated clip art images from PVGAN are automatically color adjusted to match the input high-resolution image. Figure 8 shows the output of PVGAN for a sofa that is taking its shape from the fauteuil (chair) that the PVGAN was trained on and taking its color from the object in the input high-resolution image.

**Figure 8.** Sample for test image from new class.
Download figure:
Standard image High-resolution image

3.2. Phosphenes simulation outcome

We first demonstrate the outcome of PVGAN when applied to sample images. Table 1 shows the outcome of applying PVGAN to a sample of new test images that either belong to the training classes or belong to new classes. It also shows the axon map phosphene simulation of the real high-resolution image in addition to that of the corresponding clip art version generated from the generator of the PVGAN model. Two sample images are shown that belong to the training classes representing new images other than the images that the PVGAN model was trained on (an apple and a car). The last two columns of table 1 demonstrate that the axon map phosphene simulation of clip art images is more illustrative and easier in recognition compared to that of the real high-resolution images due to the simple representation of the object (i.e. abstract representation) without showing unnecessary details. This is clear in the case of the apple and the car. More importantly, objects in the test images that do not belong to the training classes were also successfully converted to their corresponding clip art representation, which can be easily recognized when displayed in phosphene simulation as in the case of the horse and the sofa. In this case, the two sample images took their shape from the most relevant objects PVGAN was trained on, where the most similar objects from the training set are the zebra and the chair, respectively. Table 1 also demonstrates the impact of the adjustments applied to the clip art images. For the car, the clip art generated by PVGAN has the same orientation (pose) and color as that of the real high-resolution image. The last column of table 1 shows dropout added manually, which reflect the black spots at certain locations in the visual field that could happen due to the malfunction of some of the implanted electrodes. However, despite the electrode dropout, the objects are still recognizable. None of the subjects was presented with all phosphene simulations appearing on the same row in the table.

Table 1. Axon map phosphene simulation.

	Test image	Real image	Generated clip art	Clip art no dropout	Clip art with dropout
New images from the training classes
New images from the training classes
Images from new classes not belonging to training classes
Images from new classes not belonging to training classes

3.3. Experimental results

To quantify the performance of the proposed approach, we conducted an experiment involving ten subjects, each subject was presented with 24 images comprising the four types of images previously shown in table 1. The four types comprise displaying axon map phosphene simulation of the real images (as a control group), displaying axon map phosphene simulation of the real images followed by axon map phosphene simulation of the clip art images, displaying axon map phosphene simulation of clip art images without dropout and displaying axon map phosphene simulation of clip art images with dropout. First, we examined the performance of the subjects when presented with images that belong to the same classes used in training PVGAN. Figure 9(a) illustrates the recognition time for each type. The figure demonstrates that the least recognition time is achieved when using clip art representation in comparison to using the real image counterpart (Real Image: 9.63 ± 0.64 s, Real before clip art: 7.47 ± 0.45 s, Clip Art no Dropout: 4.37 ± 1.07 s, Clip Art with Dropout: 4.1 ± 1.05 s, P < 1 × 10⁻⁰⁷, n = 12, two-sample t-test). The 'Real before Clip Art' approach resulted in an intermediate recognition time since the real image is first displayed, which spans 5 s, before displaying the corresponding clip art image. In addition, no significant difference can be observed between presenting the subjects with clip art without dropout versus clip art with dropout. This indicates that the occurrence of electrode dropout might not hinder the recognition ability of the subjects if clip art representation generated by PVGAN is used.

**Figure 9.** Results for test images from training classes. (a) Recognition time. (b) Recognition accuracy measured using d'. (c) Confidence level. All values reported as ${\text{mean}}{}_ - ^ + {\text{std}}$ . *P < 0.05, **P < 1 × 10⁻⁰⁴, ***P < 1 × 10⁻⁰⁷, two-sample t-test.
Download figure:
Standard image High-resolution image

**Figure 9.** Results for test images from training classes. (a) Recognition time. (b) Recognition accuracy measured using d'. (c) Confidence level. All values reported as ${\text{mean}}{}_ - ^ + {\text{std}}$ . *P < 0.05, **P < 1 × 10⁻⁰⁴, ***P < 1 × 10⁻⁰⁷, two-sample t-test.
Download figure:
Standard image High-resolution image

**Figure 10.** Results for test images from new classes. (a) Recognition time. (b) Recognition accuracy measured using d'. (c) Confidence level. All values reported as ${\text{mean}}{}_ - ^ + {\text{std}}$ . *P < 0.05, **P < 1 × 10⁻⁰⁴, ***P < 1 × 10⁻⁰⁷, two-sample t-test.
Download figure:
Standard image High-resolution image

We also evaluated the ability of the subjects to correctly recognize the presented objects using the sensitivity measure d'. Figure 9(b) shows the recognition accuracy of each of the four aforementioned types of presentation. The figure demonstrates that the images in which phosphene simulation of clip art images were presented resulted in similar accuracies, and outperformed that of the phosphene simulation of the 'Real Image' group (Real Image: 0.65 ± 0.93, Real before Clip Art: 0.93 ± 0.15, Clip Art no Dropout: 0.97 ± 0.08, Clip Art with Dropout: 0.98 ± 0.05, P < 1 × 10⁻⁰⁴, n = 12, two-sample t-test). This is consistent with the reduced recognition time observed in figure 9(a).

We finally quantified the confidence of the subjects in their recognition. Figure 9(c) shows that the confidence level for the subjects when presented with clip art was significantly higher than that of the 'Real Image' approach (Real Image: 1.37 ± 0.06, Real before Clip Art: 4.4 ± 0.35, Clip Art no Dropout: 4.73 ± 0.25, Clip Art with Dropout: 4.53 ± 0.06, P < 1 × 10⁻⁰⁷, n = 12, two-sample t-test.).

We, then, analyzed test images from new classes that were not encountered during PVGAN training to measure the ability of the model to generalize beyond the training classes. Figure 10 shows the results for the four types using test images from new classes according to the three metrics. The performance of the subjects across the four types remained consistent with the performance reported in figure 9. The figure shows that the axon map phosphene simulation of a real-image, as shown in 'Real Image' and 'Real before Clip Art' can be barely identified due to the variety of details in the real-image, whereas much better performance is achieved when clip art is used. This is consistent across the recognition time (Real Image: 9.8 ± 0.35 s, Real before Clip Art: 7.17 ± 0.06 s, Clip Art no Dropout: 3.87 ± 0.15 s, Clip Art with Dropout: 3.7 ± 0.46 s, P < 1 × 10⁻⁰⁷, n = 12, two-sample t-test), the recognition accuracy measured using d' (Real Image: 0.1 ± 0.2, Real before Clip Art: 0.75 ± 0.35, Clip Art no Dropout: 0.8 ± 0.14, Clip Art with Dropout: 0.95 ± 0.07, P < 1 × 10⁻⁰⁷, n = 12, two-sample t-test), and the confidence level (Real Image: 2.1 ± 0.98, Real before Clip Art: 4.67 ± 0.29, Clip Art no Dropout: 4.27 ± 0.49, Clip Art with Dropout: 4.43 ± 0.21, P < 1 × 10⁻⁰⁷, n = 12, two-sample t-test).

4. Discussion

We proposed PVGAN as a novel deep learning approach for object simplification that could be used for easier object recognition for visual prostheses users. The model was specifically trained using images of objects that are relevant to visual prostheses users. To prepare the data to train PVGAN, pose and color adjustments were used to match the clip art to the real object. This resulted in generating clip art images for newly seen objects that PVGAN was not trained on, that are correctly oriented based on the orientation of the high-resolution object. This is critical to help visual prostheses users to figure out the direction of moving objects such as, for instance, vehicles in the streets.

Introducing the use of GANs for object simplification represents a novel direction compared to previous efforts that attempted to enhance object representation for visual prostheses users. For instance, background subtraction was proposed to enhance the quality of the perceived image. However, the details of the objects in the image are not feasible to be shown due to the limited number of electrodes that hinders the details from being recognized [66]. Moreover, in [67], segmentation was used to separate foreground objects from the background to allow an enhanced perception of images. This enables objects to be segmented by means of a certain property such as region-based segmentation or color-based segmentation. However, those objects are still represented with the full details which will be difficult to determine in the perceived low-resolution images. Another approach proposed by Sanchez-Garcia et al used grayscale histogram equalization to improve the contrast of the input images [68]. Edge enhancement was also introduced by Dowling et al to enhance the recognition of objects when displayed in the low-resolution environment which reduces the amount of needed information for recognition [69]. While these approaches represent different enhancement directions, they all rely on representing the full details of the object. On the other hand, in PVGAN, we aim to simplify the object representation for better visualization to match the low-resolution representation perceived by visual prostheses users.

PVGAN utilized Pix2Pix, a well-known GAN architecture for image-to-image translation. While other models exist in the literature that could be used in the same task, Pix2Pix seemed the most appropriate given that it is relatively simple and capable of generating large high-quality images across a variety of image translation tasks. For instance, a model named CycleGAN utilizes the concept of image-to-image interpretation but from another viewpoint. In CycleGAN, unpaired pictures are utilized to map an image from a certain category to another picture in another category where this is performed in a cycle approach. However, this does not guarantee the correct mapping of images between the two domains but instead, it allows for two-way image translation [70]. Another model named StyleGAN maps points from latent space to an intermediate latent space to take control of the style of each point in the generator model and adds noise as a variation at every point in the generator model [71]. However, StyleGAN cannot be used for any object other than human faces, that is why it was not used in our proposed work since it will not accomplish the generation of clip art images from real images.

To analyze the phosphene simulation of the different types of images (i.e. real high-resolution images and clip art images), axon map phosphene simulation model was used. This model represents the type of phosphenes that have been reported by implanted patients. The phosphene simulation of the real high-resolution images resulted in ambiguous and undefined objects. This represents the current perception available in visual prosthetic devices, representing a limitation of current visual prostheses. This could be attributed to the limited number of electrodes used to represent the visual field. The performance of the subjects confirmed our hypothesis that using clip art representation is better in representing the objects. Subjects were able to recognize the objects in a fast manner and their confidence was extremely high in identifying the identity of an object. This is due to the fact that it is better and easier to see an image that has only a simple representation of the object of interest as opposed to seeing a real image that is full of irrelevant details. Moreover, even in the case of having electrode dropout, understanding and recognizing a clip art image was not significantly affected by the presence of dropouts, indicative of the efficacy of the proposed approach. This could enable visual prostheses users to better interpret the viewed scenes, giving them more independence and confidence.

While the results achieved here using simulated prosthetic vision indicate the significance of using clip art representation as input to visual prostheses, it remains to be tested on real patients. Feedback from patients would enable better tuning of the model. For instance, despite the positive responses from the participants in the experiments when representing the clip arts using the axon map phosphene simulation, phosphenes have been reported by patients to have variable sizes due to spatial and temporal distortions [62]. In addition, using PVGAN in real-time could be examined to measure the amount of delay caused by PVGAN in generating clip arts. Using GANs and deep learning in general is known to induce some delays [72]. It remains to be tested if such delay would have an impact on visual prostheses users' experience. Our proposed system can be used in future visual prosthetic systems as it is expected to provide high recognition ability to implanted patients as a result of using the clip art simple representation of objects. Moreover, for future visual prosthetic systems, the number of electrodes is predicted to be larger than the current number of electrodes which could reach 32 × 32 enabling better perception of the images. One way that our system can be used in real visual prosthesis system is by displaying the real scene followed by the object of interest lightened up and all other objects darkened to enable object localization. This could be followed by the display of the clip art representation of that object. All the objects of interests in the scene will be treated in the same manner until all the clip art representations for all the objects are displayed in order.

5. Conclusion

Visual prostheses have demonstrated significant success in recent years in compensating for the loss of vision in blind patients. However, multiple challenges have been observed that need to be addressed for better adoption of this technology. While some of these challenges could be addressed through the development of the implant hardware, image processing could still be utilized to provide better perception using the current technology. In this article, we proposed a GAN-based model to generate clip art images for a certain real high-resolution image to improve object recognition for visual prostheses users. An experiment was conducted utilizing the axon map phosphene simulation model to evaluate the proposed approach. The experiment was conducted with performance measured through three evaluation metrics (time to decision, recognition accuracy and confidence level). The introduction of clip art in the phosphene simulation, showed easier, faster and more accurate recognition compared to that of real high-resolution images. The results obtained showed that the generated clip art images used in the phosphenes simulation gave outstanding results in objects' recognition both with and without dropout added to the phosphene simulation. This work could be extended by examining other GAN architectures in addition to testing it on implanted patients.

Data availability statement

The data that support the findings of this study are available upon reasonable request from the authors.

Conflict of interest

The authors report no competing interests.

PVGAN: a generative adversarial network for object simplification in prosthetic vision

Article metrics

Submit

Permissions

Author e-mails

Author affiliations

Author notes

ORCID iDs

Dates

Peer review information

Abstract

1. Introduction