Capsule Endoscopy Image Deblurring Method Based on Multi-scale Recurrent Network

During the process of acquiring capsule endoscope images, image motion blur may result from errors made by the operating physician. A multi-scale recurrent attention network is proposed to address the issue of motion blur in collected images. Considering the issue of over-exposure during image acquisition in capsule endoscopy, which can adversely affect network generation results, an image preprocessing module has been incorporated into the network to eliminate highlights caused by over-exposure in input images. To enhance network sampling efficiency, the encoder is equipped with a Convolutional Block Attention Module (CBAM) to extract fuzzy features more effectively. Additionally, depthwise separable convolution is employed to reduce parameter count. Since the structure of images obtained through capsule endoscopy is relatively simple and stable, we have opted for the mean square error as our loss function due to its heightened sensitivity to data errors. The experimental results demonstrate that the proposed network achieves a Peak Signal-to-Noise Ratio (PSNR) of 31.096 and a Structural Similarity Index Measure (SSIM) of 0.925, indicating a significant improvement over DeblurGAN-v2 and Pix2pix.


Introduction
With the advent of intelligent medical treatment, an increasing number of intelligent medical equipment products have emerged in the market.These products seamlessly integrate computer-aided technology with medical devices, significantly enhancing people's healthcare experience.As an intelligent medical device, capsule endoscopy offers the advantages of convenience, painlessness, and flexibility in use, making it a preferred option among the majority of patients.However, motion blur often occurs during the process of capturing images with capsule endoscopy.For the mitigation of motion blur, there exist primarily two approaches: mathematical optimization and deep learning.The former can effectively restore image details even when they are unclear, exhibiting strong versatility.For example, Lee et al. [1] proposed a novel deblurring algorithm, which first mathematically models the light field blur, and then models the motion blur as the integral of the potential sub-aperture images during the shutter opening process and jointly estimates the potential sharp image according to these mathematical models.According to the principles of variational Bayesian inference, Yang and Hui [2] proposed a variational expectation maximization algorithm that utilizes adaptive edge selection to restore images without prior knowledge of their potential blur kernels.However, the mathematical optimization-based method lacks pertinence and often falls short of the deep learning-based method in certain image processing scenarios.
To enhance the visual experience, Sim and Kim [3] proposed a motion deblurring kernel learning network, which performed pixel-by-pixel adaptive convolution on the learned blur kernel and introduced the residual module to generate the residual image, which was added to the adaptive convolution to reconstruct the clear image collaboratively.Zheng et al. [4] proposed an edge heuristic multi-scale generative adversarial network based on Adversarial Generative Network (GAN) to recover images in an end-to-end manner.
The existing research scheme has greatly inspired the research idea of this paper.Considering the characteristics of random motion blur in capsule endoscopy images, it is observed that Tao et al.'s Scale-Recurrent Network DeblurNet (SRN-DeblurNet) [5] exhibits excellent performance across multiple datasets.Based on SRN-DeblurNet, this paper proposes a multi-scale recurrent attention network for capsule endoscopy image deblurring.To address the issue of highlights often present in images captured by capsule endoscopy, an image preprocessing module is incorporated at the beginning of the network to remove these highlights before inputting them into the recurrent network.At the same time, an attention mechanism has been incorporated into the recurrent network encoder to complement the ResBlock [6] module in extracting fuzzy image details more accurately.To enhance network prediction efficiency, depthwise separable convolutions have replaced convolutions in both the encoder and decoder.Finally, considering the characteristics of capsule endoscopy network imaging, which is relatively singular and structurally stable, we adopt the mean square error loss function as the network's loss function due to its high sensitivity to data errors.

Input preprocessing module
Considering the common issue of highlights in capsule endoscope images, this paper proposes an input preprocessing module for enhancing image quality.The structure of the processing module is illustrated in Figure 1.

Figure 1. Preprocessing module
The module first resizes the input image to a fixed size of equal length, which is user-defined and does not exceed the original input image size.Subsequently, it identifies the highlight point in the image.Based on Meslouhi et al.'s research [7], Y, the brightness of the highlighted area always surpasses y, indicating color brightness.The RGB image is converted into CIE-XYZ space to obtain Y, and then y is derived using Equation ( 1).The area Y > y is considered the highlighted area.
After acquiring the highlighted area, the module inputs the positioning map of the said region into an enhanced Criminisi algorithm [8], which effectively fills and restores it.This algorithm boasts a superior reconstruction ability for large areas, prioritizes filling order, and gives precedence to larger highlight points.The improved Criminisi algorithm is detailed as follows: Algorithm: Improved Criminisi Step 1: According to the boundary of each highlighted area, the patch priority is calculated and the path with the highest priority is selected.
Step 2: The best filling patch is determined.
Step 3: The confidence term is updated.
Step 4: Steps 1 to 3 are repeated before the patch is complete in all areas.

Encoder-decoder structure
In this study, the encoder is constructed using one convolutional layer, three ResBlocks [6], and one CBAM [9].The convolutional kernel size is 5×5.The use of ResBlock can mitigate gradient disappearance while enhancing image feature propagation.Additionally, incorporating CBAM into the encoder enables effective focus on fuzzy image areas, reducing resource consumption in other regions and improving the sampling effect of ResBlock.The encoder and decoder possess a corresponding architecture comprising of three ResBlocks and a deconvolution layer.Figure 2 illustrates the structure of both the encoder and decoder.To enhance network running speed, deep separable convolution is utilized in both encoder and decoder instead of conventional convolution.This method performs separate convolutions for each channel followed by point-by-point convolutions to mix the output, effectively reducing parameter count and preventing significant feature loss.

Network architecture
The proposed network architecture in this paper comprises an input preprocessing module, an encoder, a decoder, Long Short-Term Memory Networks (LSTM) [10], and other key components, as illustrated in Figure 3.The input fuzzy image undergoes preprocessing to adjust its size and remove highlight points before being fed into the encoder.The encoder then samples and encodes the image at scales of 32×32, 64×64, and 128×128.Attention features are generated through CBAM after sampling is completed in the upper scale network, which are combined with sampling codes for use in the lower scale network.Finally, the encoded sample is passed to the decoder.The decoder output is fed back into the encoder for multiple iterations until the loop terminates.Equation (2) illustrates the deblurring process at each scale.

‫ܫ‬
, ݄ = ‫ܫ(ܩ‬ where ݅ is the scale number, and the scale value is inversely proportional to the scale size; ‫ܫ‬ and ݄ are the output of the generated image and cyclic network layer at the i scale, ‫ܫ‬ is the fuzzy image at the ݅ scale, and ‫(ܩ‬ή) is the function expression of the single scale network.In the network architecture, the encoder and decoder are connected via an LSTM module, allowing for selective integration of feature information from both large-and small-scale networks to enhance memory capacity.

Loss function
Mean-Square Error (MSE) is selected as the loss function in this paper.The smaller the MSE is, the closer the predicted value is to the true value of the model.Therefore, MSE can effectively reflect the relationship between predicted and true values.However, it should be noted that MSE is more sensitive to outliers.If the sample exhibits a large dispersion, it may distort the loss function to some extent and lead to a decline in model accuracy.The image processed in this paper was obtained from capsule endoscopy, featuring a relatively simple and stable structure with minimal sample dispersion.Therefore, the MSE loss function can be utilized to more accurately evaluate the error of predicted image values and better assess prediction models.The expression for the mean square error loss function is shown in Equation ( 3): where ‫ݕ‬ is the true value for sample ݅ and ‫ݕ‬ ො is the predicted value for sample ݅.

Dataset creation
In this study, the gastroenteroscopy dataset in [11] was utilized to construct both a training and testing set.The dataset was meticulously compiled and annotated by clinical experts in the field of gastroenterology, encompassing various categories of pathological manifestations, gastrointestinal anatomical markers, as well as endoscopic surgical procedures.Pathological manifestations included polyps, esophagitis, and ulcerative colitis, while anatomical markers comprised z line, pylorus, cecum, etc.The resolution of the dataset ranges from 720×576 pixels to 1920×1072 pixels.For this study, a random selection of 500 images was made by mixing different categories of datasets.Within the dataset, 300 sharp images were selected for the training set, and an additional set of 300 blurred images were generated by randomly adding motion blur between angles of 10-40° to each corresponding sharp image.As a test set, the remaining 200 images were randomly subjected to motion blur ranging from 10-40°.

Experimental environment and parameter setting
In this paper, the model runs on Ubuntu 20.4,Python 3.8, Cuda 11.3, and Pytorch 1.11.0.The hardware uses an RTX A5000(24 GB) graphics card, an AMD EPYC 7543 CPU, a 100 GB hard disk, and 30 GB of memory.The initial learning rate is set to 10 ିସ , batch size to 16, and epoch to 10, 000.All input images were randomly cropped to 128×128 sizes, and the total convergence time of the model was about 40 h.

Experimental analysis and results
In order to verify the effectiveness and robustness of the proposed network, this paper utilizes the aforementioned dataset and evaluates it using SSIM and PSNR as indicators.SSIM is based on the brightness, contrast, and structure of two images to evaluate the image.Its value ranges between 0 and 1, with a higher value indicating better image quality.PSNR is used to measure the level of image distortion.The greater the degree of image distortion is, the smaller the PSNR value is.In the same system environment, the proposed network is quantitatively analyzed with two deblurring networks: DeblurGAN-v2 [12] and Pix2pix [13].The experimental results for images generated by these three networks are presented in Table 1.From the data presented in Table 1, it is evident that the proposed network outperforms the other two networks in terms of SSIM and PSNR metrics.Furthermore, our proposed network exhibits a significant improvement in SSIM compared to its counterparts.Notably, both Pix2pix and our proposed network demonstrate a substantial enhancement in PSNR when compared to DeblurGAN-v2.After processing the dataset in this paper, DeblurGAN-v2 exhibited some image distortion.However, it outperformed the other two networks in terms of network running speed.Since the objective of this study is to restore images acquired by capsule endoscope with high fidelity, we have improved our proposed network by utilizing depth-separated convolution to reduce network parameters.The network's operational time falls within the acceptable range, thus indicating its efficiency.Figure 4 showcases examples of images generated by all three networks.However, Pix2pix exhibits relatively smooth processing of image boundaries with less obvious boundary distinction.The visual effect of the proposed network-generated images is similar to that of Pix2pix but with clearer boundary processing and more prominent texture details.In conclusion, the proposed network demonstrates a significant enhancement in image quality for capsule endoscopy.

Conclusion
In order to address the issue of motion blur that arises during capsule endoscopy image acquisition, this paper proposes a multi-scale recurrent attention network as a solution.In the proposed network, an image preprocessing module is incorporated to eliminate highlights from collected images.The encoderdecoder structure serves as the primary component of the network, with LSTM connecting the encoder and decoder to enhance information exchange across multiple scales and imbue memory capacity into the network.The utilization of depthwise separable convolution within the encoder-decoder architecture reduces parameter count and enhances network efficiency.Additionally, CBAM is incorporated into the encoder to further optimize feature extraction for indistinct images.Meanwhile, CBAM is incorporated into the encoder to enhance the network's efficiency in extracting indistinct image features.Experimental results demonstrate that the proposed network achieves PSNR and SSIM values of 31.096 and 0.929, respectively, surpassing those of DeblurGAN-v2 and Pix2pix, indicating a more targeted approach.Therefore, the proposed approach holds significant reference and practical value in addressing image deblurring challenges encountered during capsule endoscopy.

Figure 4 .
Images generated by different networks The original motion blur image is shown in Figure (4)(a), while the generation results of DeblurGAN-v2, Pix2pix, and the proposed network are displayed in Figures (4)(b)-(d), respectively.It can be observed that although DeblurGAN-v2 has achieved certain improvements compared to the original image, it still suffers from overall blurriness and color distortion.The effect of Pix2pix and the proposed network on image generation is significantly improved compared to the original image.