High-resolution imaging in acoustic microscopy using deep learning

Acoustic microscopy is a cutting-edge label-free imaging technology that allows us to see the surface and interior structure of industrial and biological materials. The acoustic image is created by focusing high-frequency acoustic waves on the object and then detecting reflected signals. On the other hand, the quality of the acoustic image’s resolution is influenced by the signal-to-noise ratio, the scanning step size, and the frequency of the transducer. Deep learning-based high-resolution imaging in acoustic microscopy is proposed in this paper. To illustrate four times resolution improvement in acoustic images, five distinct models are used: SRGAN, ESRGAN, IMDN, DBPN-RES-MR64-3, and SwinIR. The trained model’s performance is assessed by calculating the PSNR (Peak Signal to Noise Ratio) and SSIM (Structural Similarity Index) between the network-predicted and ground truth images. To avoid the model from over-fitting, transfer learning was incorporated during the procedure. SwinIR had average SSIM and PSNR values of 0.95 and 35, respectively. The model was also evaluated using a biological sample from Reindeer Antler, yielding an SSIM score of 0.88 and a PSNR score of 32.93. Our framework is relevant to a wide range of industrial applications, including electronic production, material micro-structure analysis, and other biological applications in general.


Introduction
High-resolution (HR) acoustic images can be used to facilitate biomedical or materials research, to investigate, measure, or determine the mechanical or bio-mechanical properties of the samples.Scanning acoustic microscope (SAM) also provides abundant and quantitative information about the objects under inspection.The capabilities of SAM include noninvasive micro-structural characterization of materials, characterization of surface and subsurface mechanical properties of piezoelectric materials, structural health monitoring of composite structures, surface defects on polymer circuits, and studies of anisotropic phonon propagations [1][2][3][4][5][6][7].In HR SAM, 200 nm has been achieved by using a 4.4 GHz acoustic transducer [8].The application of Acoustic Microscopy prevails in many areas, such as medical imaging (inspection of bones or internal structures like articular cartilage) and inspection of manufactured products like electronic chips and circuits [4,9,10].Mainly, SAM utilizes a wide range of frequencies (10 MHz to 1.2 GHz) to produce visible images of the surface or sub-surfaces of an object without damaging the samples.Furthermore, by thoroughly studying the internal layers and the structures of these objects, we can analyze and detect defects efficiently.Microelectronics and semiconductor industries are a demanding and highly competitive market.SAM plays a vital role in the development of the improved molded for the flip chip packages.It is also capable of dealing with the complexity of miniaturized assemblies such as chip-scale packages and 3D IC stacks.
However, because acoustic microscopes scan things point-by-point (pixel-by-pixel), it takes a long time to conduct a thorough scan of even a tiny sample.A HR image needs a high-frequency transducer and a smaller step size during data acquisition.Because low-resolution (LR) ultrasound data needs scanning fewer spots on the object, the time required to execute such a scan will be dramatically reduced.Thus, we can obtain high-resolution (HR) ultrasound images with minimal effort by employing the super-resolution model.The patterns in naturally occurring images are quite perceptible to deep learning-based algorithms.Hence, using deep neural networks in computer vision is common and is also found to produce better results than shallow networks.However, when it comes to ultrasound imaging, there are not quite obvious patterns in the data.Hence, we believe very deep neural networks would not help us in our purpose.
With the rise of new technologies and the ability to generate high-quality images, old methods of generating images are becoming obsolete.Hence a large amount of research is being carried out on HR imaging in multiple domains [11][12][13][14][15]. In each of these works, authors tried to apply super-resolution techniques in solving real-world problems, specially in the bio-medical field.Another such example is brain magnetic resonance images (MRI) [16].Because of restrictions such as patient comfort and extensive sample period, a typical MRI picture lacks appropriate resolution.As a result, a few researchers suggested super-resolution of brain magnetic resonance imaging using autoencoders, an unsupervised neural network that involves scanning fewer points on the object, substantially reducing the time required to execute such a scan [17,18].Another research in MRI proposed edge-enhanced super-resolution generative adversarial networks (EE-SRGAN) for MRI super-resolution in slice-select direction [19] because a LR MRI in slice direction leads to information loss and hence improper diagnosis.
This work aims to demonstrate HR imaging from LR acoustic images using deep learning.There has been a previous work on high resolution in SAM imaging using deep learning [20].Its authors used a four-layered U-NET-inspired architecture.The authors were able to achieve a peak signal to noise ratio (PSNR) score of 28.4 and an NRMSE score of 0.05.On the other hand, our work achieves better results in the PSNR score, and we did not use NRMSE since it does not incorporate structural fidelity which structural similarity index (SSIM) incorporates.Also, instead of using smooth L 1 loss, a combination of pixel loss, GAN loss, and perceptual loss was used to improve the quality of the output images.Also, in the work [21], the authors used the SR-net architecture to achieve a two-fold image super-resolution.Their architecture, on the other hand, is primarily made up of CNN layers.Some of the models we investigated used CNNs, but it is clear that transformer-based approaches have recently outperformed CNNs in many cases.Our article explores the use of the recently popular transformer [22] networks that have the ability to learn to focus on important image regions by exploring the global interactions between different regions.Due to their better performance [23,24], it can also be used for image restoration.Swin transformers [24], have the advantage of CNN in processing images of large size, because of the local attention mechanism.It also has the benefit of the transformer model's long-term dependency on the shifted window scheme.Figure 1 demonstrates the overall strategy applied in this paper.
The remainder of the article is organized as follows: the deep learning approach has been mentioned in section 2. Section 3 contains a brief description of the material and methods used in the experiments.The results of the approaches can be found in section 4. Section 5 concludes the article and section 6 provides a brief future direction of the presented work.

Original data-set and curation for supervised learning
The ultrasound images of 17 different coins were included in the data sets.Because a small data set with many similarities is unsuitable for training the DNN, coins from different countries were used to introduce variation into the data set, as coins from the same country have similar patterns.The dimension of each image is 400 × 400 pixel and the pixel size (50 µm) is kept constant for both transducers (20 and 50 MHz).All coins have a circular shape.Each image is cropped into multiple images to create diversity and hence more robust training of the network.Cropping is done by starting from the top-left corner of the image and striding by 48 pixels in the x-direction and y-direction.The resulting cropped image is of the dimension 120 × 120.After cropping, the final coin data set is made of 850 images, where 800 images are used for the training of the network and 50 are reserved for validation.

Architecture and training parameters
SRGAN [25], ESRGAN [26], IMDN [27,28], DBPN-RES-MR64-3 [29] and SwinIR [30] models are used for both training and testing.The following section contains information on the training hyperparameters.SwinIR outperforms the others when it comes to generating HR images and comparing them to the ground

Shallow feature extraction
For a low quality input I LQ , a 3 × 3 convolution layer H SF (.) is used for shallow feature extraction, F 0 .This produces stable optimization and maps the input image space to a higher dimensional feature space [30].

Deep feature extraction
Followed by the shallow feature extraction, we have the deep feature extraction, F DF , using K residual Swin transformer blocks (RSTB) and 3 × 3 convolutional layer.Here, intermediate features F 1 , F 2 , . . ., F k and output deep feature F DK are extracted block by block, given by; where the ith RSTB and last convolutional layer are denoted by H RSTB i (.) and H CONV () respectively.Each RSTB involves dividing images into patches, applying self-attention and feedforward layers, and adhering to core deep learning principles.
Each layer combines self-attention and feedforward neural networks.It starts with patch processing, resulting in X i , representing patch embeddings at each layer.Multi-head self-attention (MSA), followed by residual connections and layer normalization, produces Attention i (X i ).Additionally, feedforward neural networks further process the output to capture intricate image features, yielding . Thus, we have, By stacking multiple Swin Transformer blocks, deeper and more complex features are extracted from images.The convolutional layer at the end of feature extraction brings the inductive bias of the convolution operation into the network.

Image reconstruction
The high-quality image I RHQ is reconstructed by aggregating shallow and deep features as, where H REC (.) is the function of the high-quality image reconstruction module.While shallow features mainly focus on low frequencies, deep features recover lost high-frequencies.Due to the long skip connection, the model can transmit low-frequency information directly to the high-quality image reconstruction module.To implement the reconstruction module, the sub-pixel convolutional layer is used to upsample the feature.Also, residual learning is used to reconstruct the residual between the low-quality and high-quality image instead of the high-quality image, given by; where H SwinIR (.) is the SwinIR function.The parameters of SwinIR are optimized by minimizing the L 1 pixel loss, where I RHQ is obtained by taking I LQ as an input of SwinIR, and I HQ is the corresponding ground truth high quality image.The RSTB is a residual block with Swin transformer layers (STL) and convolutional layers.For input feature F i,0 of the ith RSTB, the intermediate features F i,1 , F i,2 , . . ., F i,L are extracted as, where H STL i,j (.) is the jth STL in ith RSTB.STL is based on the standard MSA of the original transformer layer.Local attention and shifted window mechanism are the main differences.For a local window feature X, the query, key, and value matrices Q, K, and V are computed as, where P Q , P K , and P V are projection matrices which are shared across windows.Using the self-attention mechanism, the attention matrix is thus computed as, where B is the learnable relative positional encoding.The attention function is performed h times in parallel and the results are concatenated for MSA.Also, a multi-layer perception (MLP) having two fully connected layers with GELU non-linearity between them is used for further feature transformations.Before MSA and MLP, the LayerNorm (LN) layer is added and the residual connection is employed for both modules.Hence, There is no connection across local windows when different layers have fixed partitions.Regular and shifted window partitioning are used alternatively for enabling cross-window connections.Here, shifted window partitioning refers to shifting the features by (⌊M/2⌋, ⌊M/2⌋) pixels before partitioning.
The model architecture is given in figure 2.
Finally, the SwinIR model displayed the best results regarding both PSNR and SSIM scores.The G_loss of the SwinIR training has been displayed in figure 4.

Experimental setup
There are two operational modes in SAM namely; reflection and transmission mode.A detailed description of the working principles of both modes can be found elsewhere [32].In this manuscript we consider reflection mode to scan the samples.To focus acoustic energy via a coupling medium (in this case, water), a concave spherical sapphire lens rod is generally utilized.Later on, ultrasound signals are generated from the signal generator and transmitted toward the sample and the reflected waves were recorded.The digitized  signal from the samples is called an A-scan or amplitude scan.In order to acquire a C-scan of the sample, repeat this procedure at various points in the XY plane.In another way, a C-scan can be referred to as the summation of A-scans in two dimensions.
For this experiment the data acquisitions were performed on a custom-built SAM (figure 5), integrated with a Standa (8MTF-200-Motorized XY Microscope Stage) high-precision scanning stage controlled by LabVIEW [33] program.A similar experimental setup was employed earlier by our group to determine and correct the inclined sample [34,35].The acoustic microscopic features were implemented employing National Instruments' PXIe FPGA modules and FlexRIO hardware.It was enclosed in a PXIe chassis (PXIe-1082) which consists of an arbitrary waveform generator (AT-1212).For acoustic imaging, the transducer was excited with signals (Mexican hat) and delivered into an RF amplifier (AMP018032-T) for further amplification of the ultrasonic signals [36].The acoustical reflections caused by the transmitted signal on the surface of the sample were picked up and fed into a custom-designed amplifier.The role of such an amplifier is to amplify the currents into an output potential.These signals were then amplified with a custom-designed pre-amplifier and digitized with a 12-bit high-speed (1.6 GS s −1 ) digitizer (NI-5772).
For ground truth, an Olympus 50 MHz focused transducer having an aperture of 6.35 mm and a focal length of 12 mm was used to scan all the coins and also the biological specimen.By focusing the acoustic energy on the coin's top surface, the sample was scanned in the x and y directions with 50 µm steps.On the other-hand low resolution images were acquired with a 20 MHz transducer (focal length 50 mm).All experiments were performed in distilled water and the room temperature was kept constant at around 22 • C during the experiments.For testing the models a biological sample was employed in SAM for imaging.The sample used for this experiment was a discarded reindeer antler that was collected from the jungle of Tromsø (Norway).The moss on the antler was first removed by cleaning it with lukewarm water and 96% ethanol.After cleaning, the sample was diced and boiled in distilled water at 100 • C in order to remove any unwanted biological substance from the antler.The sample was thereafter put on the sample holder and allowed to dry before being scanned.
UiT The Arctic University of Norway has a focus on research on topics and concerns related to Arctic life such as reindeer.The local Arctic tribes consider that Reindeer's antlers indicate the health and well-being of reindeer.Therefore, investigating them is a long-term interest for us.Now, to study such biological samples, above mentioned preparation step is generally required which includes slicing the antler to view it under the microscope.Potentially, in the long term, our method can provide a more scientifically rooted study of this conjecture.

Results and discussion
The models correctly identified numerals, alphabets, and patterns in the coins.The SwinIR model produced the best results when compared to SRGAN, ESRGAN, DBPN-RES-MR64-3, and IMDN.The SwinIR model identified alphabets, digits, and patterns more effectively than the other models.The SwinIR model also has the data set's highest PSNR and SSIM scores.The SwinIR model has an average SSIM of 0.92 and a PSNR of 35.13. Figure 6 displays the outputs of the various models and the corresponding ground truth images.
The PSNR and SSIM scores of the pictures across multiple models are presented in the tables 1 and 2 for the 10 sample inputs displayed in figure 6.The SwinIR model has the highest PSNR and SSIM scores (in bold letters) for all ten inputs.Although the DBPN-RES-MR64-3 model comes in second place, the SwinIR model's SSIM and PSNR scores are superior in all circumstances.We can observe from figure 6 that the SwinIR model output is the most accurate and closest to the ground truth when compared to the other models.

Effect of transfer learning strategy
Initially, 17 coins from different countries and with different dimensions were scanned for 20 MHz (low resolution) and 50 MHz (ground truth).Later on, they were cropped and several images were created, and eventually, a training data set of 800 images were generated, the models were prone to overfitting and low PSNR and SSIM scores (7.92 and 0.064 respectively).Hence, transfer learning was used in training the models.This significantly increased the PSNR and SSIM scores and the overall results of the models.Hence, pre-trained models trained using data sets like celebA data-set [37] and DIV2K data-set [38][39][40][41] were used and they were fine-tuned using the coin data set, which improved the model performance to a large extend.
Figure 7 shows some more examples of outputs we got across models without using the transfer learning approach.

Comparison with conventional digital resolution enhancement techniques
Popular digital resolution enhancement techniques include the nearest neighbor interpolation algorithm, bi-linear interpolation, and cubic convolution interpolation.However, when applied to images, all of these algorithms have some drawbacks.Errors occur when the picture is overly expanded.The nearest neighbor interpolation algorithm results in significant image quality loss, as well as visible mosaic and jagged phenomena.Because of the poor design of the interpolation function, the output image of the bi-linear interpolation algorithm suffers from quality damage and low calculation accuracy.The cubic convolution interpolation algorithm requires a significant amount of calculation and is also complicated and time-consuming.Interpolation-based algorithms also have issues with computational complexity, noise amplification, and blurry images.Deep learning methods and techniques, on the other hand, have advanced in recent years, and thus deep learning-based SR models are used.These methods frequently achieve cutting-edge performance on various resolution enhancement benchmarks.

Testing on the unknown biological sample (Reindeer Antler)
A biological sample (Reindeer Antler) of unknown size, shape, and surface morphology was employed to validate the reproducibility of the suggested model.This experiment was conducted with a discarded reindeer antler gathered from the local jungle.To remove the moss, the antler was cleaned with lukewarm water and ethanol.Following cleaning, the antler was diced and boiled at 100 • C in distilled water for 30 minutes to eliminate any undesired biological substances.The sample was then dried and mounted on the sample holder in preparation for scanning.The figure 8 shows the result obtained after testing the unknown biological sample (Reindeer Antler).The original image was taken to be the ground truth and a LR image was generated from the original image (figure 8).The low-resolution image was given as input to the final model to get the generated output image using the SwinIR model.The result showed a PSNR value of 31.88 and an SSIM value of 0.8406.Although the PSNR and SSIM values are comparatively less than the average SSIM and PSNR scores obtained in the test data set, it is because a data set consisting of only coin images was used to train the model.Also, it is to be noted that the image is evaluated only quantitatively.

Conclusion
In this paper, we developed an acoustic microscopy system that uses deep learning to improve the image resolution of industrial and biological samples by four times.Acoustic image acquisition was carried out on a custom-developed SAM, equipped with a high-precision scanning stage.

Future directions
This work has established that the resolution enhancement can be used to improve the digital resolution of scanning acoustic microscopy, more work is needed in the future to further mature our technique.According to the usual signal processing theory, Nyquist sampling is mandatory in digitization, and on the other hand, the acoustic resolution sets the bandwidth of the measurement.While digital oversampling is routine for high-quality and high-definition imaging, it is reasonable to expect that performing barely Nyquist sampling and applying our technique should be able to support high-quality high-definition imaging according to the signal processing theory.However, the improvement in perceptive quality must be saturated, and performing digital resolution enhancement using Nyquist sampled images may present an optimum between the learnability of our approach and the extent of resolution enhancement sought.It is also likely that beyond a certain point, the learning becomes imprecise and the model introduces artifacts in the HR images.So, this aspect needs to be studied in an extensive manner and benchmarked.It is further interesting to consider that deep learning can learn features from the large data priors during training, which may not all be available in the testing when we actually present the data for resolution enhancement.This provides an opportunity to consider if we can perform sampling at a rate poorer than the Nyquist criterion and use the pre-learned data priors in the deep-learned model to compensate for the deficiency in measurement.This possibility however requires careful assessment and benchmarking and constitutes a future direction.We would also like to consider the effect of the sample material and inhomogeneity on the achievable resolution enhancement and the value of transfer learning our models on different types of samples or different scanning acoustic microscopy instruments.
It is interesting that the proposed method does not ask for any modification in the instrument, sample, or measurement protocol.Therefore, on the one hand, our technique can be directly applied to existing systems, on the other hand, it implies that the cost of a new instrument remains unchanged.Therefore, it is of interest to evaluate the value of using our technique in practical terms.Our technique can be of prime importance where time is critical and any advantage in scanning time translates to monetary or non-monetary value.Examples of such situations include material integrity or failure analysis during the process of implanting or checking medical implants or at a site of infrastructural failure for tactical decision-making.Moreover, our technique improves the throughput of existing scanning acoustical microscopes by orders of magnitude, which in commercial settings translates to more volume or lower operating costs of commerce.

Figure 1 .
Figure 1.This figure depicts the overall strategy used in this paper.First, acoustic data is acquired for both high (50 MHz) and low resolution (20 MHz).This is followed by deep learning-based high-resolution imaging.High-resolution ground truth and corresponding low-resolution input images are fed into the network used to train.The network output is further used as error backpropagation to train the network.

Figure 2 .
Figure 2. Diagram depicting the network architecture.This entire architecture represents the 'Network Under Training' in figure 1.The network consists of shallow feature extraction, deep feature extraction, and high-quality (HQ) image reconstruction modules.While the shallow feature extraction produces stable optimization and maps the input image space to a higher dimensional feature space, the model also has the deep feature extraction, consisting of K residual Swin transformer blocks (RSTB) and a 3 × 3 convolutional layer.

Figure 3 .
Figure3.The effect of PSNR and SSIM in increasing the number of epochs can be seen through these figures for the SwinIR model.Both the SSIM and PSNR are seen to attain maxima at 420 epochs, after which they start to decrease.The SSIM is seen to decline more rapidly than the PSRN.

Figure 4 .
Figure 4. Loss vs epoch for SwinIR model is depicted here.As the training progresses with more and more epochs, the loss gradually decreases.

Figure 5 .
Figure 5. Figure depicts a labeled image of SAM used for image acquisition as discussed in this paper.The experimental setup demonstrates all the fundamental components that constitute a SAM.

Figure 6 .
Figure 6.The figure depicts the results obtained after training different models for ten example input images.Visually, SwinIR (marked with a blue box) seems to provide the best results, they also have the best SSIM and PSNR scores.The input images have been chosen such that they contain text, digits, or patterns.All the images are 6.4 mm × 6.4 mm.

Figure 7 .
Figure 7.The figure depicts the results obtained after training SRGAN model without using transfer learning.The approach showed extremely poor results.The images had PSNR scores in the range of 7-8 and SSIM scores in the range of 0.03-0.07.All the individual images are 6.4 mm × 6.4 mm.

Figure 8 .
Figure 8.The figure depicts the results obtained after testing the unknown biological sample on SwinIR model.A PSNR score of 31.88 and SSIM score of 0.8406 was obtained on this sample.The comparatively low scores are due to the fact that the data set used consisted of only coin images which are different than this input image.All three images have the dimension of 12 mm × 12 mm.
Deep learning was used to improve the lateral resolution of the SAM images.SRGAN, ESRGAN, IMDN, SwinIR, and DBPN-RES-MR64-3 were the models compared in this study.All five models were trained and tested on 17 different coin images, and the results were reported in terms of PSNR and SSIM scores.The SwinIR model is made up of modules for shallow feature extraction, deep feature extraction, and high-quality image reconstruction.The model's long skip connections allow it to send low-frequency data directly to the high-quality image reconstruction module.The process took into account transfer learning.Because only 800 images were used for training, this was done to prevent the model from overfitting.Methods that did not use transfer learning were also implemented, but the results were demonstrated poor in terms of PSNR and SSIM.SwinIR model presented an average SSIM of 0.92 and a PSNR of 35.13.The SwinIR model was also used to test an unknown biological sample.The SSIM score was 0.8406 and the PSNR score was 31.88.Deep learning methods for resolution enhancement have many advantages over traditional digital resolution enhancement techniques, and the SwinIR model performed the best of the five deep learning techniques investigated in this paper.Deep learning-based models, specifically SwinIR, are found to closely approximate the ground truth image even with extremely limited training data.

Table 1 .
The table contain the PSNR scores of the images across various models for the ten example inputs shown in figure6.The SwinIR model has the best PSNR scores (written in bold letters) for all 10 inputs.

Table 2 .
The table represents the SSIM scores of the images across different models for the ten examples shown in figure6.The SwinIR model has demonstrated the best SSIM scores (written in bold letters) for all ten inputs.