Quantifying predictive uncertainty in damage classification for nondestructive evaluation using Bayesian approximation and deep learning

Magnetic flux leakage (MFL), a widely used nondestructive evaluation (NDE) method, for inspecting pipelines to prevent potential long-term failures. However, during field testing, uncertainties can affect the accuracy of the inspection and the decision-making process regarding damage conditions. Therefore, it is essential to identify and quantify these uncertainties to ensure the reliability of the inspection. This study focuses on the uncertainties that arise during the inverse NDE process due to the dynamic magnetization process, which is affected by the relative motion of the MFL sensor and the material being tested. Specifically, the study investigates the uncertainties caused by sensing liftoff, which can affect the output signal of the sensing system. Due to the complexity of describing the forward uncertainty propagation process, this study compared two typical machine learning (ML)-based approximate Bayesian inference methods, convolutional neural network and deep ensemble, to address the input uncertainty from the MFL response data. Besides, an autoencoder method is applied to tackle the lack of experimental data for the training model by augmenting the dataset, which is constructed with the pre-trained model based on transfer learning. Prior knowledge learned from large simulated MFL signals can fine-tune the autoencoder model which enhances the subsequent learning process on experimental MFL data with faster generalization. The augmented data from the fine-tuned autoencoder is further applied for ML-based defect size classification. This study conducted prediction accuracy and uncertainty analysis with calibration, which can evaluate the prediction performance and reveal the relation between the liftoff uncertainty and prediction accuracy. Further, to strengthen the trustworthiness of the prediction results, the decision-making process guided by uncertainty is applied to provide valuable insights into the reliability of the final prediction results. Overall, the proposed framework for uncertainty quantification offers valuable insights into the assessment of reliability in MFL-based decision-making and inverse problems.

Original Content from this work may be used under the terms of the Creative Commons Attribution 4.0 licence.Any further distribution of this work must maintain attribution to the author(s) and the title of the work, journal citation and DOI.

Introduction
Pipeline infrastructure plays a vital role in our society, where more than 2.6 million miles of natural gas, petroleum, and hazardous liquid pipelines are operated across all states according to the report from the Pipeline and Hazardous Materials Safety Administration (PHMSA) of the United States Department of Transportation (DOT) [1].Pipelines are vulnerable to damage from various factors such as internal and external corrosion, cracks, and manufacturing defects.These issues can lead to severe leakages, raising significant safety concerns.Pipelines can be prone to damage from various sources such as internal and external corrosion, cracks, manufacturing defects, and other factors.As a result of the potential for significant leakage and safety concerns arising from pipeline damage, it is imperative to have a dependable assessment scheme in place for monitoring and maintaining this critical infrastructure.Cracks, which are caused by different mechanisms, can appear in the pipeline at any stage during manufacturing, installation, or throughout its service life.The shape of the crack is highly variable, especially in the pipe body and seam welds.Initially, cracks are small, but because of inadequate inspection, cracks may further develop into larger-sized cracks to cause ultimate structural failure [2].
In the field of nondestructive evaluation (NDE), the inverse process refers to obtaining quantitative estimations and detailed insights into the specific nature and properties of defects [3].The accurate assessment of cracks contributes to monitoring crack growth and thus ensuring the reliability, safety, and efficiency of various applications.Therefore, various NDE methods have been applied in detecting and characterizing crack size, shape, and orientation [4].Ultrasonic testing methods are considered suitable for structural integrity damage monitoring systems and characterizing surface cracks by applying Rayleigh wave and acoustic emission [5][6][7].In addition, a variety of electromagnetic (EM) techniques such as eddy current testing, microwave, and magnetic flux leakage (MFL), are highly developed for identifying and detecting surface and subsurface cracks in metals and polymers, based on EM principles [7][8][9][10][11].In MFL-based pipe inspection, lots of effort has been put into characterizing the metal loss defect inversion problem, which is associated with length, width, and wall loss (% WL), obtained from the measured three-axis MFL signals in terms of magnetic flux density [12][13][14].For example, a set of multiple defects from MFL signal is recovered with a Gaussian process model [15].Li et al applied a highly sensitive magnetic field sensor with extensive finite element analysis to improve the defect characterization capabilities of existing MFL systems in characterizing defects of irregular geometries [16].In [17], a novel fast defect reconstruction framework from MFL signals is proposed by combining key physics-based parameters and a least-square support vector machine.Typically, metal cracking is affected by applied stress, material defects, and operating conditions, while corrosion is one of the biggest problems contributing to leaks and ruptures of pipelines.Corrosion allowance is an important inspection criterion for pipelines, which represents the available corrosion/crack depth amount of material in a pipe or vessel without affecting the pressure containing integrity [18][19][20].The crack depth is an important index to evaluate structural integrity and verify the structure's durability performance.In ISO 21 457, 1 mm-1.5 mm internal corrosion allowance is regulated for non-corrosive service, while 3 mm is for mild service and 6 mm is for severe service [21].
However, for the widely used MFL inspection method, the collected signal quality is greatly affected by various uncertainty factors, such as material property, inspection variations (sensor liftoff, scanning speed, etc), shape irregularity, noises, etc which further affect prediction and integrity assessment capabilities.Therefore, the uncertainty analysis is critical to assess the reliability of the results in NDE and materials state awareness (MSA).Those measurementassociated uncertainties could be systematic and random, and they reflect the influence of these factors on the confidence level of the NDE and MSA results [22].Therefore, a quantified uncertainty estimation in NDE is indispensable for ensuring inspection quality.Deep learning is an ideal technique to scale massive NDE data and remain tractable without prior physics knowledge [23][24][25].However, standard DL methods are not able to measure the reliability of the prediction, while Bayesian-based methods are able to estimate uncertainty by introducing the Bayesian posterior inference over the neural network (NN) parameters [26].Alternatively, despite relying on Bayes' theorem, Lakshminarayanan et al came up with the idea that deep ensembles (DEs) can provide a simple and scalable predictive uncertainty estimation [27].The obtained probability output from ensembles is proven efficient in estimating the uncertainty of classification and regression predictive modeling problems.Both methods have shown great advances in uncertainty estimations.
In this NDE damage characterization problem, the severity of crack depth is crucial for ensuring the structural integrity of the material.In our previous work, the efficiency of machine learning (ML) is validated by quantifying the uncertainty in numerical simulations based MFL problem [28].This study extends the previous work by investigating the impact of the liftoff distance on the classification of crack depth and validates using extensive experimental studies.Due to limitations in the size of the experimental dataset, an autoencoder network was created to enlarge the experimental dataset.In order to reduce the computational costs of data augmentation, we employ transfer learning.Initially, the original NN model is fine-tuned using simulated MFL data, and the pre-trained encoder model is applied further with the decoder to generate a new experimental dataset.Next, to address the associated uncertainties in defect classification, approximate Bayesian inference-based NN and DE are used to compare the results and analyze the uncertainty associated with each method.The classification performance is proven to be related to uncertainties bear in the MFL data.Also, uncertainties from each method are well defined and compared which further benefits in assessing the assessment quality of MFL inspection.The remaining sections of the paper are organized as follows: section 2 provides background on MFL inspection and the sources of uncertainty.Section 3 explains the Bayesian approximation theory, which is used to support the use of convolutional NNs (CNNs) and DE for defect classification and uncertainty estimation.The autoencoder network is proposed in section 4 to address the problem of insufficient experimental data using transfer learning concepts.The augmentation and classification with uncertainty estimation performance are evaluated in section 5, and the process of using the proposed uncertainty feature index to guide and evaluate the final classification decision has been explained.The overview of the paper structure is presented in figure 1.

MFL principle and simulation
MFL is a magnetic method of NDE, that utilizes a powerful magnet to magnetize a conductive material under test.When there are defects, such as corrosion or material loss, the magnetic field 'leaks' from ferromagnetic materials, such as steel and cast iron.MFL probes contain a magnetic detector positioned between the poles of the magnet, which can detect the leakage field.During the inspection, a magnetic circuit is formed between the probe and the materials under test (MUT), here the piping materials.The magnetic field induced in the MUT saturates it until it can no longer hold any more flux.The flux overflows and leaks out of the pipe wall, and sensors placed strategically can accurately measure the three-dimensional (3D) vector of the leakage field [29].Therefore, MFL signals have been widely applied to offer useful insights into detecting, identifying, and understanding the nature of metal loss defects [30][31][32].
For understanding MFL physics, Maxwell's equations are used to simulate and analyze both the electric and magnetic fields in MFL systems.In the simulation study, the magnetic field is generated using permanent magnets.To simulate this, simplified Maxwell's equations governing magnetostatic phenomena are utilized, as demonstrated below.
where µ, A, J, B represent magnetic permeability constant, magnetic vector potential, the equivalent current density of permanent magnet, and magnetic flux density vector, respectively.The field equations are supplemented by a constitutive relation that describes the behavior of EM materials.In the permanent magnet region, this relation takes into account the magnetic properties of the material and how they contribute to the overall magnetic field.
where M 0 denotes permanent intrinsic magnetization vector.The other region is governed by When it comes to practical calculations, obtaining a direct solution for the above EM model can be quite challenging.As a result, a numerical technique known as the finite element model (FEM) is usually employed to calculate the magnetic flux density distribution for the system A typical FEM model, as illustrated in [31], is built with COMSOL software to generate simulation data.The parameters for the applied simulation model are described in table 1.Specifically, the magnetic circuit in this model consists of several components: a yoke, magnets, brushes, and a specimen with a rectangular defect positioned at its center.The two permanent magnets are constructed from NdFeB material, serving as the source of magnetic flux induction.Both the yoke and brushes are made from mild steel, which has a relative permeability of 186 000.The specimen itself is composed of stainless steel 416.In solving the FEM model, magnetization clearance (clearance between brush and specimen) is equal to sensor liftoff.The most fundamental element of 3D is a tetrahedron.To ensure precise results, we have refined the elements close to the flaw.To enrich the variability of simulated MFL data, the influence of eddy current under different moving speeds of 3, 5, and 7 m s −1 are considered.Besides, under each velocity scenario, we have considered the impact of uncertainty brought by different liftoff distances ranging from 0.2 mm to 1 mm.As a result.Under those variations, it is feasible to conduct simulated magnetic field measurements for defect depth detection.

Uncertainty sources of MFL
The process of uncertainty quantification (UQ) entails recognizing and classifying different types of uncertainty that have the potential to impact prognostics and sensitivity analysis.This is a critical stage in ensuring that models and simulations accurately account for these uncertainties.Broadly, uncertainty are typically categorized into two main types: aleatory (statistical) uncertainty and epistemic (systematic) uncertainty [33][34][35].Aleatory uncertainty pertains to the variability and inherent uncertainty that arises from the natural variability of the physical system.On the other hand, epistemic uncertainty is induced by a lack of knowledge or understanding of the system or the process involved.It is considered to be a systematic or modeling uncertainty that can be reduced by acquiring more information.
To develop a reliable NDE-based inspection system, it is essential to have a comprehensive knowledge of the system and a precise understanding of influential parameters and their impact.The NDE system is characterized using the forward and inverse modeling process, as explained in [36].During the forward stage of this process, variations from the geometric parameter (e.g.defect size and shape) and material property (e.g.hardness and strength) are considered as the aleatoric uncertainty in applications related to material characterization and damage detection [37][38][39][40].On the other hand, the processing parameters related to simulation (e.g.mesh parameters, boundary conditions) and experimental testing (e.g.setup process, experimental noise) are considered as epistemic uncertainty [28,41,42].The variability mentioned above is considered the input uncertainty, which will be integrated into the subsequent inversion stage.In this stage, modeling and analysis are used to derive the predicted parameters that describe the system based on observed measurements or simulated output from the forward procedure [43].During the inversion process, epistemic uncertainty is introduced, which is related to the learning model parameters and the model itself.
During MFL in-line inspections, surface irregularities such as changes in coating thickness, welds, or hardness deposits can cause fluctuations in the liftoff distance, adding complexity to the inspection process.These fluctuations affect the amplitude of MFL signals, which can alter the detection sensitivity [44].Thus, investigating the liftoff distance is a crucial uncertainty factor for MFL inspection.Also, there are other uncertainty factors are considered, such as sensor velocity [45], defect size and shape [31], microstructural changes and mechanical properties [46], etc.Additionally, during the inversion process, the results of NDE field inspections are often sensitive to environmental conditions and signal processing methods [47].In this study, uncertainties from the liftoff are investigated through the inverse process for defect classification, during which the uncertainty from MFL data and ML model are quantified with the approximate Bayesian inference modeling process.

Learning-based uncertainty estimation in MFL
ML techniques have made significant progress in aiding NDE decisions by utilizing comprehensive NDE data since they did not rely on the development of models based on complicated physics knowledge.However, this NDE data can be complicated, massive, inconsistent, and noisy, it is crucial to develop UQ techniques together with appropriate ML models to efficiently handle existing uncertainty in the system to improve the safety of the inspection system.Hence, a good predictive uncertainty score can measure the reliability of the model's prediction, which serves as a sound basis for assessing the model's performance.The predictive probability obtained from deep learning assists in interpreting the probability interpretation and quantifying the prediction's uncertainties to accomplish statistical inference [48].

Bayesian inference for uncertainty estimation
The Bayesian theory is considered as the primary approach to address uncertainties through the 'Learning' model, which aims to comprehend and describe uncertainty in the inverse solution based on observations data, and other sources of information (e.g.prior distributions).As a result, probabilistic predictions can be made under the addressed uncertainties to assist in optimizing experimental design.
Bayes theorem is applied to the inference of a parameter given observed training input, which is generated from a probability distribution depending on an unknown parameter ω.In this application, the obtained probability distribution P(ω | X mfl ) is used to describe the relationship between the input MFL image data X mfl and their associated defect classes D, D ∈ 1, 2, 3.The uncertainty in this process can be obtained from the variance of the predictive posterior probability distribution P(ω | X mfl ), which can be expressed as: in which p(D | X mfl , ω) is the likelihood of the model denoting the probability distribution of observed data X mfl given the parameter ω; p(ω) serves as the prior information of ω to describe the learning model, which is independent of any observation.During the modeling process of p(ω | X mfl , D).Both prior and likelihood are considered known parts of the assumed model, while the probability distribution of the predictive distribution of the output for a new MFL input image x * and its output class d * , p(D | X mfl ) can be expanded as: As the learning process of this posterior distribution is intractable in higher dimensions, some approximation techniques are used to fit the true posterior distribution to find an analytical way to evaluate the process with tractable approximating variational distribution q θ (ω) parametrized by variational parameters θ.This process is usually described as the Bayesian inference [49].Kullback-Leibler (KL) divergence is a universal index to measure the closeness between the q θ (ω) and the true posterior distribution, in which the optimal θ is chosen to minimize the KL divergence, which can be described as follows: Moreover, KL divergence minimization can be further represented as the maximization of evidence lower bound [50]: This process provides a good basis for developing the approximate Bayesian inference techniques, which is known as variational inference (VI).Practically, several Bayesian approximation techniques have been applied to constitute a possible solution to overcome computational difficulties related to Bayesian inference [49].Monte Carlo Markov chain (MCMC) has been used to approximate inference by drawing random samples from a probability distribution without exacted assumed models.By making repeated stochastic transitions, corresponding outputs will converge in the distribution to the true posterior.The efficiency of MCMC-based uncertainty propagation methods has been validated in various applications, such as NDE-based structural health monitoring [51], damage detection [52], flaw characterization [53], and image reconstruction [54].Another alternative approach is to utilize VI techniques, which aim to identify the best approximation of a complex target probability distribution from a given family model.VI involves an optimization problem that seeks to identify the optimal approximation of a distribution within a parametrized family.This process involves achieving a balance between the family function and the approximation output, ensuring that the approximation is both efficient and of high quality.Because of this optimization process, it is suitable for handling large-scale problems with high efficiency for applications related to Bayesian NN and its variants [55][56][57].Besides, Gal and Ghahramani have proved that the Dropout function applied before the weighted layers are able to approximate VI [58].The regularization term with weight decay in the dropout process is conceived as the optimization process.Several studies have quantified uncertainty with dropout and its variants in image segmentations [59], defect detection [60], remaining life prediction [61] , etc.Moreover, in [27], ensemble learning is demonstrated as an alternative technique to dropout where the optimal final predictions are determined by considering the outputs of multiple NNs, also known as ensembles.Ensemble learning techniques are able to provide a more simple and scalable predictive uncertainty estimation in various applications, such as defect estimation and localization [62], pattern recognition [63], etc.Other than the above-mentioned methods, ML techniques such as Bayesian active learning [64], Laplacian approximations [65], Bayes by Backprop [66], are able to estimate uncertainty for various applications.
As we discussed before, the sensor liftoff is the main aleatoric uncertainty source in this experimental MFL inspection, in order to investigate how liftoff affect the inspection performance, two typical deep learning methods: CNN with dropout and DE are proposed and compared to estimate predictive uncertainty with the realization of approximate Bayesian inference.

CNN with dropout
In the realm of ML-based networks, the dropout technique, which randomly discards some of the model units during training, is not only effective in avoiding overfitting but can also serve as an approximation of the Bayesian process [58].Specifically, with a given L layers in the network, the L 2 regularization term is used to present the optimization process of dropout with weight decay λ, which can be expressed as: where K is the number of input MFL samples and L is the number of network layers.di and d i are the predictive probability and true classes respectively of input x i , (i = 1, . .., K).E(.) is the applied loss function with weighted matrix W m and bias vectors b m for different layer m, (m = 1, . .., L).There exist several different dropout rules could be applied for different networks [67], which may result in different unit drop probabilities for each layer, such as Bernoulli dropout, Gaussian dropout, Bernoulli DropConnec, etc. Specifically, for the CNN model, the efficiency of Bernoulli and Gaussian dropout has been approved in [67], therefore, in this CNN-based application, the conventional Bernoulli dropout is applied for sampling each unit output with a certainty probability.Besides, considering this is a classification problem, the predictive probability P(d | x, ω) is a categorical distribution that corresponds to the softmax likelihood [68], which can be written as: where f ω (x) represents the NN.As proven in [31,69] that the sampling process from q θ (ω) is the same as the dropout operation, the Bayesian inference process L as addressed equation ( 8) can be approximated as L dropout .Therefore, a CNN model with dropout is applied as the realization of the Bayesian approximate learning process for addressing the uncertainty in this defect classification application.The detailed learning process of the proposed CNN model is shown in figure 2, where the convolutional layers with maxpooling are used to extract features from the MFL input image.The following fully connected layers with the dropout layer are employed to combine extracted high-level features for classification purposes.The outputs from these layers are then passed through the softmax activation function, which assigns probabilities to each class label.For obtaining the posterior probability distribution for further UQ, T times repetitive predictions are made for each MFL sample data.During the model learning process, all parameters are simultaneously optimized by minimizing the misclassification error and thus providing reliable output probability distribution for the following uncertainty estimation.Therefore, for classification-based NDE inverse problems, the approximation process is of great potential in serving as a universal and effective approach to address associated uncertainties within ML frameworks, leading to improved decision-making and risk mitigation in various NDE applications.For obtaining the aleatoric and epistemic uncertainty of this work, as presented in [31,69] the uncertainty is equivalent to the variance of the prediction probability of the network.Decomposing the prediction variance leads to a meaningful interpretation of the uncertainty, with the aleatoric uncertainty representing the randomness of the prediction defect class D and the epistemic uncertainty representing the variability coming from the proposed CNN model.Given the new MFL input image x * , T times of prediction will be made for generating the corresponding new predictive probability d * t , (t = 1, . ..T).Equation ( 11) introduces the correlation between the variance of the prediction probability and the uncertainty, which represents the total prediction uncertainty, comprising aleatoric and epistemic.the uncertainty can be expressed as: where diag(.)denotes the diagonal matrix.Since the softmax output is one-hot coded, in the variance, E p [d * ] square can be simplified as diag(E p [d * ]).As addressed in [69], the first term is the expectation is over q θ , which captures the inherent randomness of output defect classes, while the second term is only related to the network weight parameter ω, therefore, the equation ( 11) can be split as aleatoric A i and epistemic E i of d * i : where With increasing prediction repetition T, the summation of A t and E t converges in probability to system variance.Specifically, the aleatoric uncertainty is considered from the liftoff variance, while the epistemic is related to the model parameters of the proposed model.Therefore, for evaluating the aleatoric and epistemic uncertainty in this application, the dropout layer is adopted with softmax activation function to generate the prediction.During the prediction stage, each testing MFL image is predicted T = 10 times to derive the variability distribution of the output.As a result, for each testing MFL sample, there are ten aleatoric uncertainty results and ten epistemic uncertainty results, which can provide a manageable distribution to describe both types of uncertainties.

DE
Different from conventional deep learning models which require a significant amount of computation and data, DE is able to effectively integrate the complementary information of multiple algorithms into a cohesive framework to enhance model performance [70].By training multiple models and combining the predictions from these models, ensemble learning can not only reduce the variance of predictions but also result in predictions that are better than any single model.DE is considered a more straightforward alternative to traditional NNs, as they are easier to implement with simple hyperparameter tuning procedures [27].Furthermore, DE offers the advantage of interpretable uncertainty estimates, which are essential for gaining insights into model behavior.By ensembling multiple models with diverse parameter settings, we can encompass a wide range of potential predictions and their corresponding uncertainties.This approach proves especially valuable in scenarios where the model demonstrates uncertainty in its predictions.For example, in tasks like image classification or object detection, where a single input may have multiple plausible labels, an ensemble can capture this diversity, leading to a more comprehensive understanding of the data [71][72][73].
Boosting and bagging are two popular ensemble learning techniques used for UQ [70].Boosting involves a sequential combination of multiple weak learners, where it tends to focus more on reducing bias and improving prediction accuracy rather than explicitly quantifying uncertainty directly [74].The focus in boosting is on the instances in the training data that are misclassified by the previous models.These instances are given more weight in the training of the next model.Boosting aims to reduce bias and improve model accuracy.It can lead to more complex and expressive models as the sequence progresses [75].An example of boosting in DEs is AdaBoost, which adapts the weight of each instance to emphasize the samples that are difficult to classify.In contrast, bagging forms an ensemble of models by independently training them on different bootstrap samples from the original dataset.In the context of DEs, bagging involves training multiple deep NNs (DNNs) independently on different random subsets of the training data, which provides a natural way to capture and estimate the variability in the final predictions [76].Each DNN is trained on a bootstrap sample, which is a random subset of the training dataset with replacement.This means that some data points may be duplicated in the subset, while others may be omitted.Bagging aims to reduce the variance of the model by generating diverse models and averaging their predictions.It helps in making the model more robust and less prone to overfitting [77].An example of bagging in DEs is random forest, which is an ensemble of decision trees.
For the classification problem in DEs [27], the basic procedure is (1) NN parametrize a predictive distribution using softmax; (2) cross entropy is applied as the scoring rule; (3) augment the training data for adversarial training to enhance the network robustness (optional); (4) train an ensemble of M networks with random initialization; (5) combine predictions at during the test.In our work, we followed a similar idea that a random forest algorithm is built with a randomization procedure (typically, resampling methods) for estimating the predictive uncertainty in the system.Random forest guarantees that the behavior of each individual tree in the model is not strongly correlated with the behavior of any other tree.This helps improve the diversity and robustness of the ensemble.Since the aleatoric uncertainty brought by the MFL signal is the focus of this work, the model uncertainty that comes from the hyperparameter is reduced by fixing the number of subtrees (M) to ten.Firstly, to achieve the first step of random initialization, we use k-fold cross-validation with a specific number of repetitions (r).This technique divides the training and testing data into k-folds using a uniform probability distribution and randomized subsamples as the new training data, ensuring unbiased performance estimation.Unlike random train-test splits where a given example may be used to evaluate a model many times, this method is less biased because each example in the dataset is used only once in the test dataset to estimate model performance.Besides, addressing the limitation of k-fold cross-validation where the models tend to be highly similar in subsequent ensemble learning, random forest employs a bootstrapping technique to create different sub-datasets for each tree, which involves selecting examples randomly with replacement.Replacement refers to the practice of metaphorically returning the same example to the pool of candidate rows.This means that a specific example can be selected again, possibly multiple times, within a single sample from the training dataset.Specifically, a set of decision trees is trained from a randomly selected subset of the new training data, which helps to reduce the correlation among the prediction results of the subtrees.It then grows a decision tree that is allowed to use only a random subset of features at each split.This diversity enhances the model's performance.Finally, the random forest averages the output of each decision tree to determine the final results.Specifically, k = 10 in the DE model, while the first nine folds are used to train a model, the left holdout fold is used as the test set and each of the folds is given an opportunity to be used as the holdout test set.Totally ten models are fit and evaluated with three repetitions (r = 3), and the final performance of the model is calculated as the mean of these runs.The training process of random forest-based DE is illustrated in the algorithm 1. 3. Obtain and combine the calibrated predictive distribution P ckrm (y|x) of the test data and evaluate with the scoring rule.
end for end for Combine and average the calibrated prediction probability to estimate the final uncertainty variance of each class N of the model across all repeats and folds: P ckrn (y|x) = M −1 ∑ M=10 m=1 P ckrm (y|x) .end for return P ckrn ▷ the goal

Predictive uncertainty estimation scoring index
Scoring rules are essential to evaluate the quality of predictive uncertainty, which is realized through a proper loss function in ML models.The training criterion of both CNN and DE is to minimize the cross-entropy loss for optimizing the predictive model and find the optimal model parameter ω, which shares the same formula as log loss, which can be presented as: where d and d are the predictive probability and true distribution respectively of input x.Each predicted probability is compared to the actual class output value (0 or 1) and a score is calculated that penalizes the probability.Besides, Brier score is a popular scoring rule in calculating the mean square error between the predictive probability and true classes, which can be expressed as follows: Both the log loss and Brier score act as the evaluation of the predictive uncertainty, specifically, the higher the log loss and Brier score, the higher the uncertainty is buried in the system.For ML modeling, the probability scores are overconfident or under-confident in some cases, which will bring bias to predictions that should be near zero or one and further affect the subsequent averaged prediction result.Therefore, calibration of predictions is an essential step to improve the reliability of probability predictions to make accurate probability estimates.Calibrated outputs the predictive probability of each ensemble, which is set as a uniformly weighted mixed model.It is a scaling operation to adjust the obtained probability distribution to match the expected distribution observed in the data [78].Especially in the random forest method, because of the feature subset, the basic level trees are trained with a relatively high variance, which will bring errors to predictions that should be near zero or one and further affect the subsequent averaged prediction result.Therefore, the calibration of the log loss and brier score is indispensable.
Platt scaling (Platt calibration) is a typical calibration method, which transforms predictions to posterior probabilities by passing them through sigmoid.Each calibrated predictive distribution can be presented as: where ŷi is the uncalibrated predictive output of true label of sample i.A and B are real numbers to be determined when fitting the regressor via maximum likelihood.The calibrated prediction is further applied to obtaining the calibrated scoring index for estimating the total predictive uncertainty in terms of prediction accuracy, log loss score, and Brier score for evaluating and comparing the performance of the proposed CNN and DE model.

Autoencoder with transfer learning for data augmentation
The use of ML algorithms for NDE experiment data analysis is often hampered by the demanding and expensive nature of data collection procedures.To address this problem, one potential solution is to apply data augmentation methods that can expand the existing dataset by generating more diverse and comprehensive training data.Autoencoder, a typical unsupervised multilayer NN that is employed to compress and decompress input data for realizing the data augmentation.The concept of autoencoders is not to recreate the input data perfectly, where some degree of errors are expected.In fact, introducing a controlled amount of error or noise can be a beneficial aspect of the following training process of defect characterization to capture more generalized features and patterns in the data and thus prevent overfitting.
It is similar to traditional data or image augmentation techniques such as introducing noise or applying blurring [79].Autoencoder NNs' capability has been demonstrated in areas such as image reconstruction [80], feature extraction [81], augmenting data for anomaly detection [82], noise reduction in medical images [83].It consists of a pair of an encoder and a decoder, where the encoder is able to generate the compact representation for the whole dataset, which is then passed to the decoder to reconstruct the original data from this simplified representation with high fidelity [84].The goal of the autoencoder is to train the network to minimize the discrepancies between the input data and the reconstructed data with proper loss function while retaining some certain similarity between the original input and the recreated output for enriching the original dataset.
To enhance the learning efficiency of the autoencoder on MFL experimental data, the network is firstly pre-trained on simulation data, using transfer learning to gain experience from similar MFL classification tasks.Transfer learning is the ML-based optimization process where knowledge gained from training on one task is applied to a different but related task [85,86].This approach involves using a pre-trained model as a starting point and finetuning it for a new task rather than training a model from scratch.It is a useful technique when there is limited data for the new task, as it leverages the knowledge learned from the original task to improve performance on the new task.It has been widely applied in ML studies, such as translation and image recognition [87], image classification [88], etc.It is worth highlighting that the similarities in format and sensing methods between the experimental and simulated MFL data, coupled with the larger size of the simulated data, add weight to the effectiveness of transfer learning in this study.Through transfer learning, the performance of pre-trained autoencoder models on our experimental dataset could be enhanced, leading to a reduced need for a high number of experiments.
The applied autoencoder model architecture and transfer learning process is illustrated in figure 3. Specifically, two pairs of the convolutional layer with max pooling operations are employed in the encoder stage, which is activated by the non-linear activation function rectified linear unit (ReLU) for capturing more useful representations.Further, the dropout layer is applied as regularization to reduce overfitting.The number of kernels is based on the principle that ensures the overall number of activations does not decrease from one layer to the following one.These two connected convolutions with a pooling layer act as feature extractors from the input for obtaining the compressed feature representation space Z.These parameters of the encoder are first initialized by pretraining on the large simulation MFL dataset, allowing the encoder network to extract the general features of MFL signals.Owing to the intrinsic connection between simulation and experimental data, the pre-trained learning process provides a good basis for further facilitating the network's learning of specific features present in the experimental data.When processing MFL experimental data, the pre-trained layers' weights are used as initial values and kept frozen in subsequent training to prevent any loss of the valuable information they carry as the training progresses.The subsequent decoding process is added as trainable layers, which mirrors the encoder process that will learn to reconstruct the original images.Specifically, in the decoder process, each convolutional layer is followed by upsampling layers to map Z and restore it to the same size as the original image.After updating the decoder layers, the whole autoencoder layers are further fine-tuned with the experimental data again, for constructing the fine-tuned model.Therefore, the new autoencoder model can turn the old MFL signal features into prediction training with the experiment dataset.The transfer learning process allows the fine-tuned model to adapt to the unique characteristics of the experimental data, resulting in better performance compared to training a model from scratch.Further, once the optimized autoencoder model has been fine-tuned, it can augment the experimental MFL data.This is accomplished by inputting experimental MFL images into the autoencoder for reconstruction.Subsequently, the resulting reconstructed images are appended to the original experimental dataset.Specifically, the initial experimental dataset is referred to as 'OR' while the combined new experimental MFL dataset is denoted as 'GE'.The objective of this process is to expand the training and testing dataset for the previously mentioned learning-based networks, thereby enriching the available data for improved classification and prediction performance.The performance of the applied transfer learning and data augmentation will be discussed in the following section.

Performance evaluation and discussion
MFL experiment was conducted on a stainless steel sample containing three kinds of conical defects, whose sizes which are presented in table 2. The horizontal opening is described by 'Diameter', and its vertical direction is described by 'Depth'.Different sensor liftoff during the data collection ranges from 1 mm, 2 mm, and 3 mm respectively for each kind of defect.Specifically, each defect is subjected to 60 times testing under each liftoff scenario.This results in the collection of 180 MFL images for each type of defect with each image having dimensions of 217 × 217 pixels in RGB format.Consequently, 'OR' consists of 540 experimental MFL data for subsequent analysis.

Performance evaluation for autoencoder-based transfer learning
To initiate the process of fine-tuning the network weights through pre-training, 1500 MFL simulation images from the simulation model are employed for depth classification.Typically,  three defect depths are considered, which are equally divided from varying depths from 2 mm to 10 mm.In the training of the pre-trained model, 70% of the total simulation MFL data is used for training, while the remaining 30% is set aside for validation.After updating the network layers and obtaining the optimally compressed representations, the autoencoder model is further fine-tuned with the experimental data in which 66.6% are allocated for training purposes.Before evaluating the performance of the applied autoencoder in data augmentation, the effectiveness of the proposed autoencoder-based transfer learning approach in this application is addressed.The mean squared error (MSE) loss is a commonly used measure when training and testing NNs.It quantifies the average of the squared differences between the predicted values and the actual target values, making it a fundamental measure for assessing the model's predictive accuracy.In this section, we compute the MSE loss of our proposed autoencoder network applied to MFL data and compare it in two cases.In the first case, the autoencoder is directly trained and tested on the experimental dataset; while the second scenario involves pretraining the Autoencoder model on a larger simulated dataset and then fine-tuned with the experimental data, allowing it to adapt to the specifics of the real-world data.The results are illustrated in figure 4. Both models exhibit a similar progression over time, starting with relatively high loss values and gradually improving, with substantial reductions, particularly after 25 epochs.Remarkably, during the validation stage, the observation of a lower MSE in the testing data for both cases is indicative of the model's ability to generalize effectively to new, hitherto unseen data.This finding highlights the autoencoder model's capacity to make accurate predictions beyond the scope of the training dataset.When compared to the direct training of the autoencoder, the use of transfer learning leads to faster convergence to an acceptable loss level within a relatively small number of training epochs.The visual representations provide a clear basis for comparing the models' performance, emphasizing the autoencoder's proficiency in learning valuable features from MFL data.Furthermore, the adoption of transfer learning enhances performance, highlighting the advantages of leveraging existing knowledge to expedite the learning process and enhance the model's ability to generalize.With the optimal pre-trained autoencoder network finetuned from the training on experimental data (OR), the reconstructed images from those 33.3%testing 'OR' experimental MFL data are used as the augmented dataset, which is further combined with the original experimental dataset (OR) as the newly generated dataset (GE).Therefore, an additional 180 newly generated data are added with 'OR', noted as 'GE'.In order to compare the effect brought by the data augmentation, the relationship between the 'GE' and 'OR' datasets can be evaluated by analyzing the direction and strength of the results obtained from the proposed CNN and DE models, respectively.
The directional relation can be realized through the covariance matrix, which is expressed as: where i denotes the number of liftoff variance.S(.) is the averaged scoring index.Moreover, the correlation indicator is applied to determine how strongly two variables are related, which can be written as follows: where σ S(.) denotes the standard deviation of the scoring index.
Based on the 'OR' and 'GE' MFL data, the corresponding relation is evaluated in terms of classification accuracy and uncertainty scoring index: calibrated log loss and Brier score, which are presented in table 3. The results from both CNN and DE models show that all the scoring indices' covariance indicators are positive, and their correlation indicators are close to 1.This confirms that there is a strong positive correlation between the 'OR' and 'GE' datasets, which further supports the feasibility and efficiency of using a pre-trained autoencoder network to address data deficiency in MFL experimental scenarios.Therefore, the final augmented MFL dataset 'GE' is used for further learning-based defect detection and uncertainty estimation.

Comparison performance of CNN and DE with uncertainty estimation
As discussed before, network calibration is beneficial for improving the prediction reliability of modeling.For multiclass scenarios, static calibration error (SCE) is usually applied to evaluate calibration performance by measuring the difference between the confidence and accuracy of a model [89].Specifically, the model predictions are divided into N equally spaced bins separately for each class j and compute the calibration error within each bin.The final result is obtained by averaging the calibration error across all the bins.In the case of each bin B ij , the accuracy acc(B ij ) represents the fraction of correct predictions, while the confidence conf(B ij ) corresponds to the mean of the maximum probability for each data point.The SCE can be described as follows: where N and M denote the number of bins and the total number of classes, respectively.K is the total number of data points.The corresponding calibration comparison results on MFL experimental classification problem for CNN and DE are illustrated in table 4.
It can be seen that Platt scaling leads to a noticeable decrease in prediction errors for both models.This reduction in prediction errors can help decrease uncertainty within the system and ultimately enhance the model's reliability, which provides a good basis for investigating the uncertainties brought by the liftoff variance.
To compare the CNN and DE in terms of uncertainty estimation, results from the proposed CNN and DE on augmented MFL experiment data (GE) are evaluated through the calibrated scoring index S, with respect to the increasing liftoff variance, which has been presented in figure 5. Specifically, 66% of the GE data were randomly assigned to the training set, while the remaining 34% were allocated for the test set within each iteration.This approach aids in mitigating biases during model evaluation.In figure 5(a), the monotonic decrease in prediction accuracy with increasing liftoff variation indicates that both models are sensitive to liftoff changes in the input data.However, the CNN model appears to be more robust than the DE model, as its classification accuracy remains consistently higher than the DE model's accuracy, regardless of the level of liftoff variation.When considering the total predictive uncertainty from figures 5(b) and (c), an increase in liftoff variation is observed to increase the total predictive uncertainty in both the CNN and DE methods, indicating their ability to  evaluate uncertainties in this application.Additionally, the results plots reveal that the disparity in accuracy and predictive uncertainty between liftoff 2 and liftoff 3 is significantly larger than that between liftoff 2 and liftoff 1, indicating that the classification capability deteriorates exponentially with respect to liftoff changes.Though the Brier score suggests less uncertainty in CNN compared to DE, the overall results demonstrate that CNN achieves higher prediction accuracy and lower log loss compared to DE.This suggests that CNN is better at capturing the unique characteristics of the experimental MFL dataset and resisting the impact of uncertainty.
Furthermore, as discussed in the previous section the CNN model with dropout can evaluate uncertainty components arising from the learning model and the data by identifying the aleatoric and epistemic uncertainty during the prediction stage.The results are shown in figure 6.The results demonstrate that the aleatoric uncertainty exhibits a significant increase from 0.04 to 0.08 under the liftoff variation.Although there are some fluctuations in the epistemic uncertainty, the aleatoric uncertainty remains around four times larger than the epistemic uncertainty.Therefore, the uncertainty attributed to the model is negligible compared to the data uncertainty, and the aleatoric uncertainty is mainly influenced by the variation in the data.

Uncertainty guided defect classification evaluation
As the efficiency of the proposed CNN model has been addressed in the previous sections with an efficient uncertainty estimation scoring index, we will now explore their ability to provide guidance in decision-making for new unseen MFL data.Because of the time-consuming process involved in collecting MFL data, four groups of new MFL data were acquired, which consisted of 36, 72, 108, and 144 MFL images, respectively.The liftoff variance was averaged across all the samples in each group.The corresponding four sample groups are fed into the proposed CNN model to evaluate its classification performance on the defect size.Figure 7(a) shows the percentage of wrongly classified probability among each sample with respect to different liftoff variances.It is noticeable that larger liftoff values result in higher prediction bias.On average, the percentage of incorrect classifications under different levels of uncertainty from liftoff 1 to 3 is 25%, 30%, and 42%, respectively.It should be noted that only four sample sets were used in this analysis and a clearer and more stable trend can be expected with a larger number of samples.However, a relatively consistent relation between the liftoff variance and the wrong classification percentage can be observed, therefore results in sample 4 can be viewed as a reliable representation to describe the true underlying trends in this application.Further, the area under the receiver operating characteristic curve (AUC ROC) curves under each liftoff uncertainty are depicted in figure 7(b) for assessing the overall discriminatory power of the proposed learning model.Considering it is a multi-class problem, the macro-averaging method is applied, which involves summing up predictions for each class and then calculating an AUC ROC curve for the aggregated forecasts.AUC serves as a measure of the model's ability to differentiate between defect classes amid uncertainties.It can be observed that with increased liftoff variance, the model's capability in distinguishing defect size is decreasing from 0.96 to 0.86.
Further, to have better insights into the predictions under different liftoffs of sample 4, the confusion matrix is used to provide a detailed breakdown of the model's predictions.The results are presented in figure 8, which shows how many instances were correctly or incorrectly classified for each class in predicting new unseen MFL data.With the confusion matrix, we can compute the F1 score, a metric that considers both false positives and false negatives and is the harmonic mean of precision and recall.Among these three results, LO1 demonstrates the highest F1 score at 74.72%, indicating that increased uncertainty can reduce the model's capacity to accurately identify positive instances and most of the actual positive instances.Specifically, with increasing liftoff uncertainty, class 1 defects maintain a high classification rate of 87.5%, whereas the other two defects, particularly class 3, are more prone to be misclassified.Larger defects may overlap with background noise and the existence of uncertainty can distort the edges of these defects, making them more challenging to accurately identify and classify.As a result, the boundaries between the defect and its surroundings become less distinct, leading to a higher likelihood of misclassification or erroneous characterization.Overall, both the ROC curve and confusion matrix support the previous observation in section 5.2.2 that higher liftoff changes introduce more ambiguity in the measurements or features used by the model, and therefore deteriorate the classification capability.
Further in order to propose a quantitative way to evaluate and determine the reliability of the classification to new input, two feature indexes are considered in this case: • Confidence index (CI): indicate the degree of confidence in the classification performance, with higher CI values indicating lower uncertainty and vice versa [39].The formula can be expressed as: As each predictive probability is generated from the softmax function, where a certain probability will be assigned to each class for one prediction.L1 is the negative log-likelihood of probability for the correctly classified class; while L2 is the negative log-likelihood of the maximum probability, calculated among the other wrong classes.• Weighted predictive uncertainty U: the log loss, Brier, and aleatoric uncertainty are all capable of revealing the uncertainty in classification to varying degrees.To combine these uncertainty indexes, the minimum redundancy maximum relevance method is used to rank them by finding the optimal feature set that can effectively represent the response variable while minimizing redundancy between features [90].The resultant ranking determines the importance weights for each uncertainty index and thus generates the weighted total predictive uncertainty, which can be expressed as: To evaluate the performance of defect classification on new MFL data, the corresponding feature indexes are extracted from sample 4 and presented in figure 9.The correctly classified data points are marked as red dots, while the incorrectly classified ones are marked as green dots.Two boundaries can be established to improve the process of decision-making guided by uncertainty, which are based on the CI and weighted total uncertainty respectively.CI threshold, marked with brown line, serves as the first boundary in uncertainty-guided decisionmaking, which is determined based on the highest wrong prediction sample's CI.Any samples with a CI higher than this threshold are considered to have a trustworthy classification, regardless of their uncertainty index; otherwise, samples with a CI lower than this threshold will be evaluated with the uncertainty decision boundary, drawn with pink curve.This boundary is generated with quadratic discriminant analysis, which is a statistical algorithm used to classify data into groups by modeling the distributions of the independent variables (predictors) for each group using a quadratic function [91].Therefore, when a new classification is made, the following evaluation steps should go through to determine the reliability of the classification, which are summarized in algorithm 2. Based on the given steps, example new MFL images of correct prediction and wrong predictions are presented in figures 10 and 11.The corresponding feature index CI and U are listed while the true class and prediction class are listed as well.

Conclusion
This paper applied a Bayesian approximation-based learning model as a comprehensive and practical solution for uncertainty estimation in experimental MFL defect classification.The main contribution of this work can be summarized as follows: (1) The proposed framework addresses the issue of insufficient data quantity by utilizing an efficient autoencoder network with transfer learning.The use of transfer learning illustrates the advantages of using existing knowledge from simulated MFL signals in speeding up learning and improving the model's generalization ability.The pre-train machine learning model is further used to fine-tune the autoencoder model with experimental MFL data.The effectiveness of the proposed autoencoder in learning important features from experimental MFL data has been proved and be further used for enriching the original data size.(2) Moreover, two Bayesian approximation-based ML networks: CNN with dropout and DE are applied for defect classification with uncertainty estimation.Comparison results have proved the monotonic relationship between the liftoff variance and prediction accuracy when total uncertainty estimation is applied.Comparatively, the proposed CNN outperforms the DE, achieving higher prediction accuracy and lower overall prediction uncertainty.
(3) Further, we proposed guidance to determine the reliability of the classification on new unseen MFL data with two feature indexed CI and weighted total uncertainty.The incorporation of proposed uncertainty-guided decision-making offers valuable insights into the prediction results, thereby enhancing the reliability of the classification outcomes.
The existing criteria for uncertainty guidance rely on a limited amount of experimental data, suggesting possibilities for future research.It would be beneficial to incorporate larger datasets with greater variations to assess the performance of the proposed uncertainty analysis.Another interesting research could be exploring a more advanced learning-based network that can significantly enhance its resilience to uncertainty.This advanced network should be designed with a focus on improving UQ, thereby further improving the model's performance and reliability in the face of uncertainty.
Overall, this work has introduced a valuable research framework aimed at classifying and enhancing prediction reliability by incorporating transfer learning-assisted autoencoder-based data augmentation, learning-based defect classification, and uncertainty analysis.Those key elements are collectively contributing to the robustness and effectiveness of this work.This approach demonstrates significant potential for expansion and application to other engineering challenges within NDE.

Data availability statement
The data cannot be made publicly available upon publication due to legal restrictions preventing unrestricted public distribution.

Figure 1 .
Figure 1.Flow chart of the structure of the paper.

Figure 2 .
Figure 2. Schematic representation of the CNN process.

Algorithm 1 .
Pseudo-code of random forest based DE training process in MFL. for n = 1:3: do ▷ defect classes for r = 1 to 3: do ▷ repeated k-fold cross-validation for k = 1:10 do 1.Random model parameter (θ rkm , m = 1 : M) initialization of M sub tree with bootstrap-based sub-dataset.▷ bootstrap 2. Training each ensemble of M networks in random forest classifier with calibrated classifier to minimize cross-entropy loss.

Figure 3 .
Figure 3. Schematic representation of the autoencoder architecture and transfer learning process.

Figure 4 .
Figure 4. Comparative loss analysis with and without transfer learning: (a) train loss; (b) validation loss.

Figure 5 .
Figure 5.Comparison performance in terms of mean (asterisk) and variance (shadowed bounds) for CNN and DE with respect to different uncertainty: (a) prediction accuracy; (b) log loss; (c) Brier.

Figure 7 .
Figure 7. Prediction performance for new MFL data: (a) wrong prediction percentage for different new sample; (b) ROC curve for sample 4.

Figure 9 .
Figure 9. New MFL sample distribution based on the confidence index and weighted total uncertainty with noted CI threshold (brown) and uncertainty decision boundary (pink).

Algorithm 2 .
Uncertainty guided classification reliability evaluation.Given: CI i , U i ▷ input: x if CI i > CI threshold then Classification is correct; else if U i < uncertainty decision boundary then Uncertainty is low, classification is correct.else Uncertainty is high, classification might be wrong.end if end ifBased on the aforementioned steps, figures 10 and 11 show examples of new MFL images with correct and wrong predictions, respectively.The corresponding CI and U, as well as the true class and predicted class, are listed.These examples emphasize the significance of considering both factors in the prediction process, demonstrating the effectiveness and practicality of the proposed uncertainty guidance process.

Figure 10 .
Figure 10.Examples of correctly predicted MFL images: first row images are of high CI, while the second row is of low uncertainty, even with low CI.

Figure 11 .
Figure 11.Example of wrongly predicted MFL images, which are of low CI and high uncertainty.

Table 3 .
Performance evaluation on augmented MFL experimental data.

Table 4 .
The effects of calibration on the mean of SCE.