Neural image enhancement and restoration for time-lapse SPM images

This paper presents methods for enhancing and restoring scanning probe microscopy (SPM) images. We focus on image super-resolution as enhancement and image denoising and deblurring as restoration. Assume that almost same time-lapse images are captured in the same area of each specimen. In contrast to a single image, our proposed methods using a recurrent neural network improve the enhancement and restoration of SPM images by merging the time-lapse images in order to acquire a single enhanced/restored image. However, subtle deformations between the time-lapse SPM images and degraded pixels such as noisy and blurred pixels in the SPM image disturb the network to successfully merge the images. For the successful merge, our methods spatially align the time-lapse images and detect degraded pixels based on the characteristic property of SPM images. Experimental results demonstrate that our methods can reconstruct sharp, super-resolved images, and clean noiseless images.


Introduction
Scanning probe microscopy (SPM) allows us to inspect the surface of a specimen in nano scale. 1,2) However, SPM has several problems for visual inspection. For example, the limited resolution of SPM makes it difficult to precisely inspect a specimen, while new scanning devices are developed for improving the resolution. 3,4) In addition, SPM images are degraded by various reasons, such as noise and blur 5) due to probe degradation. In order to resolve these problems, we enhance and restore the SPM images by machine learning such as deep neural networks. Specifically speaking, this paper proposes methods for image super-resolution (SR) as enhancement and image denoising and deblurring as restoration.
Image SR upscales an input low-resolution (LR) image to its high-resolution (HR) image. With upscaled SR images, a variety of applications can be realized. For example, distantobject detection, [6][7][8] tiny-defect inspection, 9) remote sensing, 10) wide-angle image analysis, 11) and cell image analysis. 12) So with these examples, the upscaled SR image of a specimen allows us to visually inspect its detail and also develop automatic inspection systems. Similarly, in many image enhancement technologies, [13][14][15][16][17] SR has been significantly improved by convolutional neural networks (CNNs). [18][19][20][21] In general SR methods using CNNs, 22) a SR network is trained so that a difference between a reconstructed SR image and its HR (ground-truth) image is minimized in the training process. A huge number of the training images are given so that the SR network is applicable to any new images whose image domain is similar to that of the training images.
However, such SR using a single input image is difficult because SR is a typically ill-posed problem where the number of pixels in the input LR image is less than that in the output SR image. In order to relieve this difficulty, multiple LR images are used as a set of input images in video SR [23][24][25] and burst SR. 26) Our proposed method also improves the quality of SR by employing time-lapse SPM images that capture the same location of a specimen.
Image denoising and deblurring reduce noisy and blurred pixels, respectively. Many of such degradations are caused by the degraded tip-apex of a cantilever. The cantilever of SPM is shown in Figs. 1(a) and 1(b). It is expected that its tip-apex is as sharp as possible as shown in Fig. 1(a). However, the tip-apex is worn out after repeated scanning, and materials are adhered to the tip-apex sometimes, as shown in Fig. 1(b). This produces pixel degradations in a SPM image, as shown in Figs. 1(c) and 1(d). Such noise and blur are specific to the SPM image. For restoring the image degraded by these SPM-specific noise and blur, it is better to design a restoration method optimized for these degradations rather than generic image restoration methods [e.g. denoising [27][28][29] and deblur [30][31][32] ]. To cope with this SPMspecific degradation problem, the 3D shape of the degraded tip-apex is estimated in Ref. 33. With this estimated shape of the tip-apex, we can restore the captured surface of a specimen. However, naive blind shape estimation using only the captured surface is not so reliable. In order to improve the reliability, machine learning is utilized with a number of extra training data for classifying each captured surface into a clean surface or a degraded surface in Ref. 34. However, the captured surfaces are just classified with no restoration in Ref. 34.
The goal of our work is to restore images with machine learning such as CNNs. The basic strategy of denoising 35) and deblurring 36) using CNNs is same with that of SR. That is, a restoration network is trained so that error between an output restored image and its ground-truth image is minimized. In previous methods of microscopy image restoration, [37][38][39][40] a single image is fed into a network in order to restore this image. On the other hand, in our proposed method, as with SR mentioned above, time-lapse SPM images are employed for improving the quality of the output image.
Different from SR, however, denoising and deblurring have their own technical issue. For any supervised machine learning, a set of pairs of input and its ground-truth data is required. For SR, a HR image as the ground-truth and its LR image as the input are required as the paired data. For denoising/deblurring, a clean image as the ground-truth and its noisy/blurred image are required as the pair. Any HR SPM image can be easily and almost-uniquely downscaled to its LR image in order to make the pair for SR. On the other hand, it is difficult to assume that a clean image as the ground-truth is always available in each location. In the denoising method for STEM images, 40) temporal images are captured and averaged to approximately produce a clean image in each location. However, it is known that such an averaged image is blurred. In addition, noise and blur produced by the degradation of a cantilever tend to be continuously observed in the same location in such temporal images. These noise and blur cannot be reduced by the averaging process.
Furthermore, noise and blur that are specific to SPM images are unknown and difficult to be simulated. Therefore, even if the clean image is available, it is difficult to synthesize realistic noisy and blurred images from this clean image. That is, it is difficult to have a pair of a clean image and its SPM-specific degraded image for training a restoration model. In previous methods 37,38) for denoising microscopy images, empirically-produced synthetic degradations are given in order to artificially have the paired images. However, if these artificially-reproduced degradations are different from those observed in real SPM images, this difference makes it difficult to train the restoration model that is applicable to the real SPM images. Our proposed method avoids the unavailability of clean images in the training process by (i) selecting the cleanest image in each set of noisy time-lapse images as the ground-truth, if this cleanest image is clean enough, and (ii) using other images with real degradations as input images.
In this paper, time-lapse SPM images of carbon nanotube (CNT) are used for validating the effectiveness of our proposed methods for the aforementioned SR, denoising, and deblurring of time-lapse SPM images. In what follows, the details of "how to use time-lapse SPM images for SR" and "how to select the cleanest and other input images for denoising and deblurring" are described in Sects. 2.1 and 2.2, respectively.

Time-lapse image SR
As mentioned in Sect. 1, given a pair of a LR image and its HR image as training images, the LR image is fed into this network. Then its output SR image is compared with the HR image in order to train the SR network. The SR network is trained so that the difference between the SR and HR images (which are denoted by S and H, respectively) is minimized. In our method, this difference is represented by the following mean square error (MSE) loss function: where x ∈ X and y ∈ Y denote x and y coordinates in an image, respectively. H(x, y) and S(x. y) are a pixel value at (x, y) in HR and SR images, respectively. By minimizing this MSE for a number of training images, we can train the SR network properly.
In the aforementioned basic strategy for training the SR network, each HR image is given and downscaled to its LR image by a standard image scaling algorithm such as bicubic interpolation. While image upscaling such as SR is difficult because it is an ill-posed problem, it is easy to downscale any image with less error. It can be therefore assumed that we are able to easily collect each pair of LR and HR images for training, if any HR image is given.
For further improving the quality of SR, our proposed method employs a set of time-lapse images instead of a single input image. Our assumption is as follows. The time-lapse SPM images (e.g. 10 images) in each set are captured in the same area of each specimen. Under this capturing condition, all of these images should be almost same with each other. Among all the images, therefore, only one image is required to be super-resolved. The quality of this SR image is improved by merging all other images in the set.
Our proposed network for SR using time-lapse SPM images is a recurrent neural network shown in Fig. 2. The basic structure of this network is provided by our video SR network. 41) Image features extracted in convolutional layers (indicated by "Conv" in the figure) are upscaled in the . In addition to the aforementioned processes, we have to resolve the following problem for employing time-lapse SPM images for SR. SPM is often employed for observing a variety of deformable specimens such as CNT and biological samples. The time-lapse images of such deformable objects are changed. It is more difficult to merge the deformed images than similar images. To make it easier to merge the deformed images, we add the following processes: • The deformations between images [i.e. F i where i ∈ {2, ⋯ , n} in Fig. 2] are estimated and fed into the network, as shown in Fig. 2. F i represents the spatial pixel displacements from I 1 to I i , which is called an optical flow image in general. We expect that F i fills a gap between I 1 to I i . In our method, F i is estimated by a simple and fast optical flow estimation algorithm. 42) While previous methods 43,44) also cope with deformations in SPM images, the interpolation scheme 43) produces a blurry image, and highly non-convex minimization 44) is slow and unstable. • Even with the aforementioned optical flow, it is not easy to fill a large gap between I 1 to I i . If the deformation is too large, it makes it too difficult to merge the images. To avoid this problem, such over-deformed images are removed from each time-lapse set. This image removal is achieved so that (i) the mean over all time-lapse images in each set is computed and (ii) we remove images that are significantly different from the mean image. This difference is expressed as the MSE between each image and the mean image. If the MSE is larger than a pre-defined threshold (denoted by T r ), this image is removed.

2.2.
Time-lapse image denoising and deblurring 2.2.1. Basic strategy for time-lapse image restoration. As with for SR, our proposed method for image restoration employs time-lapse SPM images. The basic structure of our neural network for this restoration is equal to Fig. 2. The training of the image restoration network is also same with the one of the SR network so that the MSE loss, Eq. (1), is minimized. On the other hand, different from SR where an input image is upscaled in the projection modules, image restoration does not change the image size in its output. In order to hold the image size, the projection module is designed to hold the size of its output feature map.
Another important difference is that image denoising and deblurring should explicitly cope with degraded pixels. This is because image denoising and deblurring have to reduce  these degraded pixels, while SR processes degraded pixels as they are. Although many time-lapse SPM images are degraded, several pixels are degraded in some images, and those pixels can be captured with less noise in other images. In sample time-lapse images shown in Fig. 3, red and green circles indicate clean and degraded pixels, respectively. Our goal is to merge all clean pixels in order to restore one of the input time-lapse images.
The basic training strategy mentioned in Sect. 2.1 is not only for SR but also for most of any other image enhancement and restoration methods. That is, given a pair of input and its ground-truth images as training data, the network is trained so that its output gets close to the ground-truth image. To train a neural network for denoising and deblurring, degraded and clean images are input with its ground-truth images, respectively.

2.2.2.
Image selection depending on the degradation score. While our SR and restoration problems are similar to each other as mentioned in Sect. 2.2.1, there are several differences between our SR and image restoration tasks. Due to these differences, we have the following three problems: 1. Training sets: In our SR task, all training sets can be used in each set. In our image restoration task, however, a training set cannot be used if all images in this set have a number of degraded pixels. This is because at least one clean image is required as the ground-truth for the training process. 2. Ground-truth image for training: In each training set, both input and its ground-truth images are required. In our SR task, any image can be the ground-truth in each set. In our image restoration task, however, the groundtruth image is required to be clean. 3. Input images: Input images should be also as clear as possible for better restoration. In our image restoration task, however, all images are degraded in some time-lapse sets. In what follows, our solution for each problem is described.
1. Training sets: In order to successfully train the network, we propose to split all training sets to clean sets and others. In each of the clean sets, at least one clean image is included. This clean image can be regarded as the ground-truth image. If no clean image is captured in a set, this set is not used for training the network. 2. Ground-truth image for training: In each training set, among all clean images, the cleanest image is selected and regarded as the ground-truth image used for training the network. 3. Input images: Among all input images, the cleanest image is selected as I 1 shown in Fig. 2. This is because a primary input in our network is I 1 so that I 1 is improved by merging clean pixels in other input images, I 2 , ⋯ , I N . In all the three solutions mentioned above, we assume that the cleanness level of each image can be measured. For example, the cleanest image is selected in each training set. In our proposed method, the noise level, which is the opposite measure of the cleanness level, is defined by using image derivatives. The discrete image derivatives are approximately computed by subtraction between neighboring pixel values as follows:   I x y  y  I x y  I x y  ,  ,  1  , ,  3 where I(x, y) denotes a pixel value at (x, y) in image I. In the image derivatives, boundary lines are activated as shown in Fig. 4. Figures 4(b) and 4(c) are the horizontal and vertical image derivatives of a SPM image shown in Fig. 4(a), respectively. We assume that boundary lines in a microscopy image are uniformly random. If such an image with random boundary lines has no noise, "the sum of vertical derivatives, Eq. (3), in all pixels (denoted by D v )" minus "the sum of horizontal derivatives, Eq. (2), in all pixels (denoted by D h )" is expected to be zero. In SPM images, however, stripe noise is observed horizontally along the scanning direction of a microscope, shown in examples in Figs. 4(d), 4(e), and 4(f). Due to this property, the sum of vertical derivatives is larger than the sum of horizontal derivatives. This subtraction, D v − D h , is considered to be the noise score, S n , in our method as follows: 3. Image division for data augmentation. As shown in Fig. 3, many images might be degraded by noise and blur. If all images in a set of time-lapse images are degraded significantly, this set cannot be used in the training process of our method. This is critical because machine learning cannot work well if the amount of training data is insufficient. However, these noise and blur are observed partially in general, as indicated by red circles in Fig. 5. With this property, we propose to divide each SPM image to subimages and to feed the sub-image to our restoration network. In an example shown in Fig. 5, each SPM image is divided to four sub-images, each of which is enclosed by colored rectangles. Green and blue rectangles contain no and several degraded pixels, respectively. While both the SPM images 1 and 2 in Fig. 5 cannot be used as the ground-truth image due to noise pixels, the sub-images with no noise pixels can be employed as the clean ground-truth data.
Since the aforementioned method feeds a sub-image to the network, its output is also a sub-image. If one requires the output restored image to be its original size, all sub-images generated from each SPM image are tiled to be one image.

Dataset
We captured the following specimen for our experiments: • Specimen: CNT.
• UV ozonation: 1 h. This specimen was captured in the following preparation and measurement conditions: Training Sets: For verifying the noise robustness of the proposed method, these 143 training sets are divided into 67 and 76 sets so that, in each of the latter 76 sets, the cleanest time-lapse image is less noisy. That is, with the latter set, our network can be trained properly for reducing the noise more. We call the latter 76 training set the Dataset-NL. All training sets are used in all experiments except for the one for Table I(f) in which only the Dataset-NL is trained. If the amount of training data is not enough, the network cannot perform well. In reality, 143 training sets in our experiments are much smaller than general datasets for image processing and recognition. It is difficult to train a huge nonlinear model such as a neural network with only such a small amount of data. To relieve such difficulty in training with small amount of data, fine-tuning from a pre-trained model is employed in general. For fine-tuning, the pre-trained model is trained with a huge amount of data, whose domain is different from the target domain (i.e. SPM images in our case) in advance. By using this pre-trained model as an initial model, the network is trained with a small amount of data of the target domain. In all experiments shown in this paper, the network is trained with this fine-tuning strategy. In our experiments, the network is pre-trained with general-purpose video datasets [i.e. Vimeo90K, 45) SPMCS, 46) and Vid4 47) ] in which general landscape photos are included.
Test Sets: For evaluating the performance of noise removal with a test set, the noiseless image of the images in this test set is required. It is, however, difficult to capture such a perfectly-noiseless SPM image. For quantitative evaluation in this paper, in each test set, the most noiseless image is selected and regarded as the ground-truth. For this selection, the amount of noise is defined by Eq. (4).

Implementation details
The implementation details of our networks are as follows. The sub-image size is 64 × 64 pixels. The number of input frames is seven.
In addition to the proposed image division process for data augmentation, as general data augmentation, the training time-lapse images are also augmented by (i) random brightness adjustment using the Gamma correction with 0.8 ⩽ γ ⩽ 1.1 and (ii) random horizontal and vertical image flipping. Note that all time-lapse images in each set are augmented in the same manner so that these images are temporally consistent.
In the training process, Adam (adaptive moment estimation) 48) is used as an optimizer. Hyper parameters used in our neural network training are as follows. The learning rate is 10 −4 , and the mini-batch size is 12.

Time-lapse image SR
In the literature (e.g. Refs. 18,49), the performance of SR is evaluated by peak signal-to-noise ratio (PSNR) and structural similarity index measure (SSIM). 50) PSNR compares HR and SR images (denoted by H and S) pixelwise as follows: where (μ x , μ y ), ( ) s s , x y 2 2 , and σ xy denote the averages of x and y, the variances of x and y, and the covariance of x and y, respectively. c 1 = (0.01 × 255) 2 and c 2 = (0.03 × 255) 2 denote regularization constants. SSIM has the maximum value (i.e. 1), if x = y.  Experimental results are shown in Fig. 6. We conducted experiments with several thresholds, T r , for removing timelapse images from each training and test set. In both results [i.e. (a) PSNR and (b) SSIM], the following observations can be seen: • In terms of the threshold for the test images, a lower threshold achieves better results in all cases. That is, among all bars in the same cluster (i.e. six different color bars that clump together along the horizontal axis), a lower threshold achieves better performance than a higher threshold. • Against our expectations, on the other hand, a lower threshold for the training images cannot improve the SR quality. That is, among all bars with the same color, a lower threshold along the horizontal axis achieves better performance than a higher threshold. In addition, the performance is also quantified with the difference between the noise-removed image and its groundtruth clean image. The image difference is, in general, evaluated by PSNR and SSIM as mentioned in Sect. 3.3. Since a pixelwise metric such as PSNR is inappropriate for evaluating the SPM image of a deformable specimen because the different images of the same specimen are deformed, only SSIM is used for evaluating noise removal.
b. Deblur. The result of deblur is considered to be better if the result image is sharper. The image sharpness is evaluated by the following Laplacian filter: where I 3×3 (x, y) denotes the output value of the Laplacian filter with 3 × 3 pixels at (x, y) in an image. The Laplacian filter is a standard image processing filter used for edge detection. Since the Laplacian filter outputs are larger and smaller values in sharper edge pixels with less blur and blurry pixels, respectively (as shown in Fig. 7), this filter can be used for evaluating how much each image is deblurred. The sharpness score is expressed with Eq. (7) as follows:  Table I. The results of our method are shown in Table I as (b). For comparison, the scores of the ground-truth cleanest image are shown in Table I as (a). The visual results of our method are shown in Fig. 8. We can see that (i) stripe noise observed in the input images are relieved in the output image and (ii) the output image gets close to the cleanest image in Fig. 8(a). In Fig. 8(b), on the other hand, our method fails to relieve noise in the input images. This is because there are only noise in all input images. In such a case, it is impossible to select clean images from the input images.
In Table I, (c), (d), and (e), gives the effects of the number of input images in our method which are verified. While our network whose results are shown in (b) accepts seven input images in experiments shown above, nine, four, and one images are fed into our network for comparison. In Table I, (f) shows the results of our network trained by the Dataset-NL. While the number of training sets in the Dataset-NL is less than all the training sets (i.e. 76 versus 162), all training images in the Dataset-ML are cleaner. In Table I, (g) and (h), shows the effects of the size of sub-images which are verified. While the default size is 64 × 64 pixels, smaller and larger sizes (i.e. 32 × 32 pixels and 96 × 96 pixels) are used for evaluation. We interpret these results in Sect. 4.

Time-lapse image SR
As shown in Fig. 6, lower thresholds T r in test and training image sets can and cannot improve the SR quality, respectively. It is a natural consequence because the test images selected by the lower threshold are more similar to each other than all the test images in each set. The similar images are easier to be merged by our recurrent network for improving the output quality. We expect that the lower threshold also in the training images can intrinsically improve the performance. However, the SR quality is not improved in our experiments probably because we used only a small number of training images. For verifying this assumption, experiments with more training images are required.

4.2.
Time-lapse image denoising and deblurring a. Effect of our method. While the noise score S n of (b) is better than (a) as we aim at noise reduction, the sharpness score S s of (b) is worse than (a). In terms of S s , not only (b) but also all of other results (c)-(h) are inferior to (a). This might be because the MSE loss, Eq. (1), tends to produce blurry results, as validated in the literature. 18) In Table I(b), SSIM = 0.474 is not high enough as a score of image similarity. This might be because of large deformation between time-lapse images. It is well-known that a sharper image can be reconstructed by additional constraints (e.g. minimization between not only images but also deep features 51) ) on training a neural network. However, these constraints tend to reconstruct fake images. For looking at images as art appreciation, images should be sharp (i.e. should be visually better), while realistic fakes are not problematic. On the other hand, such fakes are not acceptable for verifying experimental results on material science. Due to this difference between general image reconstruction tasks for art appreciation and verification of the experimental results, in future work, we aim at a trade-off between "noise removal and image sharpening" and "fake removal." b. Effect of the number of input images. While the noise score S n is best in (c) (i.e. with the maximum number of input images), the sharpness score S s is best in (e) (i.e. with the minimum number of input images). This might be because (i) noise can be better reduced by more input images and (ii) more input images degrade the aforementioned blurry effect. 18) c. Effect of noiseless training images. No significant difference is observed in the noise and sharpness scores between (b) and (f), while SSIM is relatively improved in (f). That is, SSIM (i.e. similarity between input images and their ground-truth image) is improved even with a small number of training images. This might be because of overfitting, which is a well-known effect that improves the performance of a machine learning model only for test data that are similar to training data, if this model is trained with a limited number of training data. For clearly verifying the effect of noiseless training images, experiments with more noiseless training images are required.
d. Effect of sub-image size. The noise score is better in smaller sub-images. This might be because it is more difficult to remove stripe noise (in other words, to produce a realistic noiseless image) in a larger sub-image. On the other hand, the sharpness score gets worse in smaller and larger sub-images. Compared with the score difference in the noise score, the sharpness score largely differs depending on the sub-image size. Therefore, the default sub-image size in our method is determined to be 64 × 64 pixels in our experiments so that the sharpness score is emphasized.

Conclusions
This paper proposed image SR, noise removal, and deblurring methods using time-lapse SPM images. These methods are designed with deep neural networks. Different from similar methods, our methods employ multiple input images (i.e. time-lapse images) for one output image.
Future work includes more realistic image reconstruction with less fakes and experiments with more training images. Since it is not easy to capture a huge number of time-lapse SPM images due to mechanical reasons of SPM, efficient time-lapse image capturing is also an important issue.

Acknowledgments
This work was partly supported by JSPS KAKENHI (Grant Nos. 19K12129 and 22H03618).