Enhanced Global Convolution Networks for Defect Detection in Full Wavefield Imaging

Non-Destructive Evaluation (NDE) techniques based on ultrasonic guided waves can provide relevant information about the health status of an inspected structure. The analysis of the wavefields recorded through Scanning Laser Doppler Vibrometers (SLDVs) has been used successfully in NDE systems to detect damages in planar isotropic materials, such as interlaminar fractures in composites. However, the high spatial point resolution required to ensure an accurate localization and quantification of defects typically implies a time-consuming acquisition process, which limits the applicability of such approaches. In this work, we exploit the potential of convolutional neural networks (CNNs) to perform segmentation on full wavefield images, obtained by reconstruction from a low resolution grid having a spatial sampling rate below the Nyquist frequency, for the purpose of detecting and localising delaminations in carbon fibre reinforced plastic (CFRP) plates. In particular, we trained an improved version of Global Convolutional Networks (GCN) using a public dataset containing 475 simulated cases of full wavefield Lamb waves propagation in a CFRP plate, generated by an actuator with a carrier frequency of 50 kHz. We adopted channel and spatial attention mechanisms to improve the accuracy of the networks and applied our method on (i) the resulting image given by the root mean square value (RMS) in time for each spatial position and on (ii) the 3-dimensional animation representing the full wavefield propagation. Our networks show the ability to accurately locate target damages with a spatial resolution 8 times higher than the dimension of the adopted sampling grid, achieving an Intersection over Union score equal to 0.78 with a number of scanning point more than 60 times lower than the number of pixels in the output segmentation mask.


Introduction
Ultrasonic Guided wave (GW) inspection methods based on active sensing have been widely adopted in the field of Non-Destructive Evaluation (NDE).Because of their high sensitivity to damages and long propagation distances, GW can indeed provide relevant information about the presence of defects in plate-like structures such as laminated composite materials [1; 2].
The latter are prone to damage in the form of delamination due to weak transverse tensile and interlaminar shear forces, which can alter the compression strength of plates [3].For such reason, the detection of delaminations in their early stages has the potential to avoid structural collapses or other critical consequences.In particular, by analyzing how the waves generated by various kinds of transducers propagate over the plate, it is possible to spot the presence of potential defects [4].In this work, we focus on a particular type of GW, the so-called Lamb waves, travelling in structures in which the thickness is comparable with the wavelength; hence, their are particularly suited for the inspection of thin-walled components.
The term guided ultrasonic wavefield imaging is generally referred to the acquisition of series of images representing the time evolution of guided ultrasonic waves propagating across an inspected portion of a certain structure, and, eventually, their interaction with defects.[5].The measurement of the full wavefield can be performed relying on non-contact point-by-point scanning technologies such as scanning laser Doppler vibrometers (SLDVs), which operate by measuring the out-of-plane velocity of a point addressed by a focused laser beam, using the Doppler shift between the incident light and the scattered one returning to the instrument [6].
Since the quality of the imaging greatly affects the subsequent processing, high spatial point resolution is required in order to ensure an accurate localization and characterization of defects.However, such procedures typically imply a time-consuming acquisition process, further increased by the adoption of extensive waveform averaging required to improve the signal-to-noise ratio (SNR) of the gathered images, a constraint which limits the applicability of the above mentioned approaches [4; 5; 7].
To account for these drawbacks, several methods have been proposed in literature to obtain an informative SLDV-based imaging while minimizing the number of scanning points.One of the most promising solution is built upon the Compressive Sensing (CS) theory [8], which can be exploited to reconstruct full-resolution scan grid from a random subsampling scheme.CS relies on the assumption that a signal s ∈ R n has a K-sparse representation α produced by a specific basis Ψ, where s = Ψα, i.e., s has K or less nonzero elements in such domain of representation.If certain conditions are satisfied, s can be recovered from an acquired signal y ∈ R m with m << n, obtained by applying to s a random subsampling operator represented by a compression matrix Ω, namely y = Ωs [4; 8].Unfortunately, these methods require computationally onerous iterative processes to solve ill-posed problems, hence they are often unable to accurately recover image patches where delaminations are located, i.e., where abrupt changes in the wave propagation characteristics occur [9].
In the last few years, Deep Learning (DL) models -in particular, Convolutional Neural Networks (CNNs) or, more properly, Fully Convolutional Networks (FCNs) -have been widely adopted to address image-to-image problems such has super-resolution [10] and semantic segmentation [11] due to their ability to capture relevant information and hidden patterns across signals.Accordingly, FCNs can be exploited to detect the presence of flaws with high precision simply relying on low-resolution wavefield images obtained by scanning the inspected structure in a small number of points, resulting in a much faster acquisition procedures.If compared with conventional CS-based approaches, performing super-resolution adopting this kind of strategy implies no assumption on the existence of a sparse representation for the acquired signals, while the reconstruction criteria is learned directly from data in a totally data-driven paradigm.
The aim of this work is to exploit the potential of FCNs to perform automatic delaminations detection in isotropic plates.Such defects are localized achieving a far higher resolution with respect to the number of sampling points in which the the wavefield is measured.same test; • Our methods can provide an Intersection over Union score equal to 0.78 and a Dice coefficients of 0.87 with an amount of scanning point more than 60 times lower than the number of pixels in the output segmentation mask.
In Section 2, data preprocessing and model architectures will be described extensively.The training dataset and the validation setup are introduced in Section 2, while Section 4 exposes the achieved results.Conclusions and future outlooks end the paper.

Related works
Several works in the literature have addressed similar problems adopting different strategies and tools.Kudela et al. [13] analyzed Lamb waves interaction with impact-induced damage at various impact energies, studying the limitation of the wavenumber adaptive image filtering method [14; 15].In [16], authors developed an imaging method obtaining precise localization and sizing of damages in metallic plates and composite laminates.
The speed up of SLDV-based inspection by means of image reconstruction techniques was previously discussed by Di Ianni et al. [4], who applied CS theory to the super-resolution of full wavefield images acquired by means of a SLDV, comparing the results obtained with the adoption of several projection basis.Moreover, by subtracting wavefield reconstructions produced relying on different representation domains, authors in [9] were capable to infer damage locations across inspected plates.The effectiveness of deep learning techniques in such application has been investigated by Esfandabadi et al. in [5], where a CS algorithm was combined with a Super-Resolution CNN (SRCNN) and a Very Deep Super Resolution (VDSR) network [10] in order to further improve performances.However, in these works, the achieved downsampling which permits an accurate wavefield reconstruction is relatively modest (not less than 10% of the scan points is preserved).
In [3], [17] and [18] authors addressed the problem of automatic delamination detection in CFRP plates by means of FCNs architectures, but the effect of spatial downsampling was not examined in these works.Instead, super-resolution of full wavefield images merely relying on deep learning strategies was performed in [7].

General workflow
The general processing flow adopted in this work is illustrated in Figure 1.First of all, high resolution frames representing the time evolution of the guided wavefield are undersampled on a squared grid at lower resolution with respect to the original one.Then, a non-trainable reconstruction logic is applied in order to match the spatial resolution desired in the output segmentation mask, i.e., the one associated with the original images.Bicubic interpolation has been chosen as super-resolution method as it combines simplicity and effectiveness.It should be noted that the undersampling operator is applied only along the space dimension, since obtaining high time resolutions is not a critical task in non-destructive testing with SLDV.In the following discussion, the ratio between the number of pixels per image at high resolution and the one at low resolution will be referred to as Compression Rate (CR).
Two different ways of combining multiple frames related to the same test are compared, both of them previously adopted in the literature [3; 17; 18].In particular: (i) The FCN is applied to the Root Mean Square (RMS) value over time per spatial location related to the overall wavefield animation, expressed as: Figure 1: Proposed workflow: after the subsampled grid is acquired, dimensions are increased in order to match the desired spatial resolution before the segmentation model is applied.The reconstruction is performed with bicubic interpolation.
where N is the total number of frames, Wi (x, y) is the reconstruction by bicubic interpolation of the original i − th frame W i (x, y), while x and y are the spatial coordinates.Moreover, the top and the bottom 1% of pixel intensities in the resulting images has been saturated in order to enhance the contrast, simplifying their visual interpretation.An example of RMS image obtained from full resolution frames is reported in Figure 2a, while the corresponding ground truth is shown in Figure 2b.In the next sections, we will refer to this model as RMS model.(ii) By means of a Time Distributed Layer, the network is applied to each frame of the full animation separately, processing each time sequence of output maps composed by 8 frames spaced by 3 timestamps of stride (window length of 24 frames) with a convolutional Long short-term memory (LSTM) [19], which is a well-known recurrent NN suitable for processing sequences of images, as illustrated in Figure 3. Finally, the segmentation masks computed by the 8-timestamps LSTM related to each j − th window, which are denoted with h 8 in Equation 2 and Figure 3, are combined using the RMS operator: where M = N − 23 and each j − th window of reconstructed frames Wi (x, y) is designated with the notation { Wi (x, y)} i=j:3:j+23 .In the following, this model will be referred to as Full Animation model.

Network architecture
The adopted model consists in a Global Convolutional Network (GCN), proposed for the first time by Peng et al. in [12].Other research have previously demonstrated the superiority of such architectures with respect to other FCNs in segmenting delaminations in CFRP plates using guided wavefield imaging [17].The key aspect of such kind of NNs is the utilization of large kernels, which make the network aware of the full context by introducing dense connections to the feature maps, providing the same benefits of cone-shaped classification models while performing segmentation tasks [12].The overall architecture is illustrated in Figure 4.As the most popular deep learning models aimed at this kind of application, it consists in a U-shaped encoder-decoder structure where the first acts as a feature extractor, generating progressively higher level of representation for input data by increasing exponentially the number of channels and decreasing spatial resolution, while the latter combines global information associated with high level features with local ones   contained in low level feature maps which preserve precision in localization.As proposed in [17], the encoder stack is composed by residual convolutional blocks (ResBlock in Figure 4) with batch normalization, interleaved with MaxPooling layers, while the upsampling is performed in the decoder by 2D deconvolutional operators with a fixed number of channels equal to 64.
The two main building blocks introduced with this kind of network are reported in Figure 5.They consists in:   • Global convolutional network block (GCN ): it performs convolution using large K × K kernels, where K is fixed at 15, a value near to the spatial resolution of the feature maps at the bottleneck, i.e., the higher level of representation.As in the original proposal [12], such operations is decomposed into a sequence of K × 1 and 1 × K convolutions: adopting this method, authors found that, with respect to the trivial solution, it is possible to obtain better results relying on a reduced number of trainable parameters.The GCN block structure is illustrated in Figure 5a: the number of output feature maps C is fixed at 64 for every instance; • Boundary Refinement block (BR): it relies on small kernels, with size N = 3 in order to improve the segmentation performance in boundary regions, as experimentally demonstrated in [12].The BR block architecture is shown in Figure 5b.

Attention modules
In addition to the standard implementation, the GCN architecture has been enhanced in this work by means of spatial and channel attention modules.The spatial attention module (SA) consists in an Attention Gate (AG) similar to the one proposed in [20].The details about the AG are shown in Figure 6a.As it is possible to observe, it performs a calibration of low level feature maps extracted by the encoder stack (X l ) via the global information contained in high level features produced by the decoder (X h ): this is done by multiplying point-by-point the former with a score map (attention map) which is generated summing input data after applying 1 × 1 convolutional operators, with a number of output channels C equal to the one of tensor X l ; finally, output scores are computed by means of K × K 1-channel convolution.Differently from [20], K is not unitary and it is imposed to 5, since this has empirically led to better results, probably because in this way noisiness of attention maps can be reduced.Hence, the role of the AG modules is to strengthen the contribution of feature values in the most important spatial regions for each single input image before it is processed by the GCN block.
After concatenation, the same channel attention module (CA) adopted in [21] is inserted to weigh the importance of different feature channels in the following layers, suppressing the redundant or noisy ones according to input data.The CA module architecture is illustrated in Figure 6b.First, two values are attributed to each feature map by means of global MaxPooling and AveragePooling operators, obtaining two 1 × 1 × C vectors, where C is the number of total input channels.Then, these vectors are processed by the same Multilayer Perceptron (MLP) [22] with a number of hidden features equal to C/2.Finally, the corresponding outputs are summed together before applying sigmoid activation.A residual connections has also been added, to improve convergence during training.

Dataset and model deployment
Model training and validation has been performed relying on a synthetic dataset [23] publicly available on the platform Zenodo (https://zenodo.org/record/5414555#.Y kF03bMK3C), composed by 475 simulated cases of full wavefield propagation in cross-ply CFRP laminates of dimensions 500×500 mm.Each one of these simulations represents a delamination with different location, shape, and size.The original resolution was set to 500 × 500 pixels, but for the sake of simplicity each images has been reshaped using bicubic interpolation to the nearest power of 2 before any kind of processing has been applied, obtaining a set of 512 × 512 images.
The excitation consists in a toneburst sine signal modulated by the Hann window, with a carrier frequency of 50 kHz and a modulation frequency of 10 kHz.The actuation has been performed simulating a piezoelectric transducer placed at the center of the plate.The total duration of the experiment was set to 0.75 ms, so that the guided wave can propagate to the plate edges and back to the actuator twice.In each delamination case, 512 frames were generated to visualise the propagation of Lamb waves.However, only the first 256 were used in this work, comprising a single back and forth propagation to the plate edges, since this subset contains all the useful information required in order to correctly detect the presence of flaws.
The shortest wavelength associated to the A0 Lamb wave mode is λ min = 21.2 mm.According to the Nyquist's theorem, the maximum permissible distance between two sampling point in ideal condition is given by: Since in a 2D uniform squared grid the longest distance between points is on the diagonal, the number of Nyquist sampling points along each edge of the plate is approximately equal to: where L is length of the edge.The performances of the algorithms were evaluated at different CR by means of two popular metrics commonly used in segmentation problems, respectively the Intersection over Union (IoU): where Y stands for the output of the model, while Ŷ for the correct ground truth mask, and the Dice coefficient: As usual, an additive smoothing equal to 1 has been inserted at numerator and denominator in the computation of both the two metrics, preventing from zero denominator.By doing so, Equation 5 and 6 are now expressed as and Training has been performed differently for the two models according to their respective requirements.In case of the RMS model, since the whole dataset is composed by only 475 elements, data augmentation techniques based on horizontal, vertical and diagonal random flip have been applied in order to increase the variability of training images.On the other hand, AIVELA-2023 Journal of Physics: Conference Series 2698 (2024) 012006 IOP Publishing doi:10.1088/1742-6596/2698/1/0120069 the full animation model has been trained using random time windows starting from the first interaction of the guided wave with the delamination, as previous frames does not provide any useful features, while during validation tests the overall output mask is calculated according to Equation 2 relying on the full stack of images.It is rational that the first part of the animation, which least from the beginning of the actuation to the moment in which the wave reach the defect, is responsible for an overall decreasing of scores in the output mask after the application of the averaging operator, since no delaminations are detectable despite they are present, resulting in unequal comparison between the two models if Equations 5 and 6 are used.For this reason, during tests, a threshold of 0.5 has been adopted after the sigmoid activation, obtaining a binary mask.
Moreover, since the adoption of the full animation model is highly memory consuming, in this second case the central symmetry of the testing setup is exploited by cropping target images to one of the four 256 × 256 quarters, depending on which one of these contains the delamination.Then, images are rotated in order to place the actuation point in the botton-right corner.Accordingly, the kernel size of the GCN block has been reduced to the value of 7 since the resolution at the bottleneck of the network is now equal to 8. It is important to specify that, in the following of the manuscript, all reported resolutions are referred to the full image and not to the cropped one.
The Dice loss, expressed by the following equation, is used as objective function for training processes: The two networks have been implemented using Keras API (https://keras.io/).Both of them have been trained until convergence adopting Adam optimizer [24] by means of data generators, using batch of 8 samples.The training processes have been repeated for each examined compression rate and the number of required training batches varies according to it.The learning rate was set to 1e −5 for the RMS model and to 1e −6 for the full animation one.
The validation has been performed relying on a 5-fold cross validation 80-20% by partitioning the 475 different delaminations.

Results
IoU coefficients and Dice scores obtained with the RMS model at 4 different CRs are reported in Table 1.As can be observed from results, metrics start to decrease consistently as the spatial sampling resolution falls below the number of Nyquist sampling points obtained with Equation 4(IoU -13.1 %).For this reason, in the following, only CRs equal to 64:1 and 256:1 will be considered.Figure 7 shows an example of output mask obtained at full resolution (Figure 7b) and at two different compression rates (64:1 in Figure 7c and 256:1 in Figure 7d), with each relative input image and the corresponding ground truth (Figure 7a).Even when the resolution is equal to 32×32 (CR equal to 256:1), the correct position of the delamination is clearly detectable by observing the spatial distribution of the average power of propagating wavefield.In fact, a blacked-out region which produces a pronounced asymmetry in the overall image appears in the top where the flaw is interposed at the actuator location.
Table 2 shows the results related to the full animation model.In this second case it is possible to observe a significant decrease for a CR of 256:1 (-60.9 % in IoU and -53.8 % in Dice score).By looking at Figure 8, and in particular at Figure 8d, it is clear that, in these circumstances, the network is no longer able to perform an accurate recovery of the delamination shape and dimension, while the information related to the position seems to be preserved.
Finally, in Table 3, a comparison between the two adopted strategies is reported.Moreover, the same RMS model implemented without any of the attention modules (RMS without spatial and channel attention or RMS w/o S&CA in Table 3) is evaluated in order to asses their     effectiveness.As already observed, while the recurrent network processing the full animation stack obtains quite better results using higher resolutions, the RMS model strongly outperforms the other one on a 32×32 sampling grid.Furthermore, the adoption of spatial and channel attention shows the ability to improve performances in each case, at the price of a negligible increase in the total number of parameters.Last but not least, it is worth to mention the substantial differences in memory footprint imposed by the training process of the two different architectures.With a fixed batch size, which is equal to 8 in both cases, the adoption of recursion involves a consistent increase in the RAM consumption of the GPU executing the backpropagation algorithm, which is more than doubled.On the other hand, the total number of parameters is reduced because of the smaller dimension of convolutional kernels in GCN blocks, as explained in Section 3.

Conclusions and future works
In this work, deep learning models suited for segmentation problems have been adopted for the automatic detection of delamination in CFRP isotropic plates by elaborating full wavefield images scanned by a SLDV.Two different kinds of networks have been considered: a feedforward architecture, taking as input the RMS value over time for each spatial location, and (a) An example of input image for the RMS model.(b) Corresponding ground truth mask.

Figure 2 :
Figure 2: RMS image related to a single delamination in a squared CFRP plates with the actuator placed in its center (a) and the corresponding ground truth mask (b).Delamination position is highlighted in red.

Figure 3 :
Figure 3: Processing flow related to the full animation model.

Figure 4 :
Figure 4: Full model architecture.Spatial resolution of feature maps and corresponding number of channels are reported in the encoder stack.
(a) GCN block architecture with a number of channels equal to C and K × K kernels.(b) BR block architecture with and N × N kernels.

Figure 5 :
Figure 5: GCN block (a) and BR block (b) architectures.C in denotes the number of input channels.
(a) Spatial attention module.(b) Channel attention module.

Figure 7 :
Figure 7: An example of result obtained with the RMS model at different resolutions (b, c and d) with the corresponding ground truth mask (a).

Figure 8 :
Figure 8: An example of result obtained with the full animation model at different resolutions (b, c and d) with the corresponding ground truth mask (a).

Table 1 :
Results related to the RMS model at different compression rates.Metrics drops with respect to the full resolution is reported between paretheses.

Table 2 :
Results related to the full animation model at different compression rates.