A limited labeled data bearing fault diagnosis method based on self-supervised learning

The key focus of this study revolves around several crucial issues concerning bearing fault diagnosis, and presents an innovative solution. Conventional labeled approaches for bearing fault diagnosis often necessitate labeled data sets, which can be time-consuming or infeasible to obtain. To address this problem, an increasing amount of research has started to explore fault diagnosis methods that utilize limited labeled data. In our study, we introduce a framework for bearing fault diagnosis that incorporates wavelet transform and self-supervised learning techniques. The framework leverages vibration signals and transforms them into time-frequency spectrograms as inputs. To extract features, we employ the Swin Transformer as an encoder. Furthermore, we present a self-supervised learning approach named MoBy to address the challenge of limited labeled samples. Encouragingly, our approach achieves a diagnostic accuracy of 96.4% by utilizing only 1% labeled samples, through a well-trained encoder and a simple linear classification layer. This demonstrates outstanding performance in utilizing limited labeled data. To validate the superiority of our proposed approach, we conducted experiments on two rolling bearing fault datasets and achieved significant results.


Introduction
Bearings and gears are critical components of rotating machinery.Due to prolonged heavy loads and high-speed operation in modern industries, they are susceptible to failures [1].Hence, there is a strong demand for accurate and timely bearing fault diagnosis to enhance machinery reliability [2].These diagnosis methods rely on physical models that employ signal processing techniques to analyze specific fault components.However, these models are not suitable for increasingly complex real-world conditions and require manual identification of fault features, which exposes the limitations of traditional methods.
In recent years, due to advancements in information gathering capabilities and computational power, machine learning-based intelligent data-driven fault diagnosis methods have been increasingly recognized.Machine learning (ML) in particular offers high accuracy while requiring minimal prior knowledge [3].ML models such as Support Vector Machines, Self-Organizing Maps, etc. have been widely utilized.However, these methods still rely on manually designed feature extraction algorithms, which limit their adaptability to different fault patterns and operating conditions.Therefore, there is a need for an adaptive feature extraction method to overcome these limitations in complex bearing fault diagnosis.
With its ability for deep representation learning, deep learning (DL) has overcome the limitations of traditional machine learning that relied on manually designed features.Various DL architectures and enhanced techniques, such as Convolutional Neural Networks, Recurrent Neural Networks, etc. have shown promising results in fault diagnosis.Notably, Ding et al. [4] introduced a transformer-based TFT model for rolling bearing fault diagnosis, demonstrating its effectiveness experimentally.Deep learningbased fault diagnosis methods have made significant advancements in recent years [5].
However, most of the aforementioned DL approaches rely on sufficient labeled data, which necessitates time-consuming and costly annotations.In real-world engineering scenarios, obtaining and annotating all the data is challenging.Moreover, supervised learning methods may not perform well in complex systems and nonlinear problems due to reliance on predefined models and features.To address these challenges, researchers have focused on utilizing unlabeled data for fault diagnosis methods, such as self-supervised learning.
We propose a self-supervised learning-based approach for bearing fault diagnosis using limited labeled data, addressing the issue of insufficient labeled data in real-world monitoring datasets.In this framework, the original signals are first transformed into time-frequency spectrograms using wavelet transform.The Swin Transformer [10] is employed as the feature extractor.The proposed method is experimentally validated on two fault datasets, where only a small fraction (1%) of the data is labeled.The experiments demonstrate a test accuracy exceeding 96%.

Bearing fault diagnosis model construction
Our method primarily consists of three main stages: (1) preprocessing of the original vibration signals; (2) self-supervised learning methods based on swin transformers; (3) achieving few-shot classification.We define the train dataset (T), which includes unlabeled data (U), limited labeled data (L), and test dataset (Y).The overall flowchart is depicted in Figure 1.In this section, we will provide a detailed description of the method we propose.

Data preprocessing
In practical operating conditions, the acquisition of time-domain signals may be affected by sensor and environmental noise.Additionally, due to speed fluctuations and faults, the obtained vibration signals are often non-stationary.In such cases, it becomes essential to examine the frequency and temporal characteristics of the signals and convert the original vibration signals into a representation that captures both time and frequency information.To overcome this limitation, we employed the wavelet analysis technique to examine the frequency and temporal characteristics of the signals [13].The transformation process is shown in Figure 2. The formula for the wavelet transform is as follows: In the formula, represents scale parameter, and τ represents time translation parameter., ( ) = √ is wavelet basis function of the WT.

Self-supervised learning methods based on swin transformers
The input image is augmented to produce two distinct perspectives, which are then processed by the up and down branching network (online and target networks) to generate Queries and Keys correspondingly.These Queries and Keys are continuously appended to a fixed-length queue, and subsequently, the Contrastive Loss is computed for the Queries and the Queue.

Swin transformers. Swin
Transformer is an advanced and versatile backbone in computer vision, known for its excellent performance.It is essentially a hierarchical Transformer that computes its representations using a shifted window scheme.Based on Vision Transformer, downsampling forms shift windows and restricts self-attentive computation to non-overlapping localised windows, while allowing computation in different windows, thus reducing computational cost and increasing efficiency.[12].Noteworthy, we utilized the tiny version as the feature extractor.

MoBy.
MoBy [11] is a combination of two effective self-supervised learning methods: MoCo v2 [6], which constructs a dynamic dictionary and compares an image with other images, and BYOL [7], which consists of an online network and a target network.As shown in Figure 3 (a), the workflow framework consists of the following steps: Input images (unlabeled data U) undergo image augmentation to generate two different views.These two views are fed into the online and target branch networks, respectively, to generate Queries and Keys.Keys are continuously added to a fixed-length Queue, and finally, the Contrastive Loss is computed based on the Queries and Queue.Specifically, U are first input, and two types of data augmentation are performed to obtain views V and V'.Data augmentation includes random cropping and resizing, horizontal flipping, applying color jitter, converting the image to grayscale with a given probability (0.2), and applying Gaussian blur.Additionally, the V' view undergoes solarization (inverting pixel values) with a given probability (0.2).
In this approach, the data is subject to data augmentation to generate two distinct views.These views are then processed by the online encoder and the target encoder, respectively (as shown in Figure 3(b)).The target encoder consists of a Swin Transformer and a projection head (a 2-layer MLP).On the other hand, we added an additional prediction layer (2-layer MLP) on top of the target encoder to serve as our online encoder.The structure of the prediction module is similar to the projection module and consists of multiple linear layers, batch normalization layers, and ReLU activation functions.The online encoder generates queries through gradient updates, while the target encoder generates keys by updating the online encoder's moving average using momentum at each training iteration.The keys and queries originate from separate perspectives of the identical image, forming pairs that are considered positive instances.The queue contains a large number of other images that form negative pairs with the query image, but they do not come from the same image.The contrastive loss function used in this study is defined as follows: where q refers to the online view; represents the target features from an alternative view of the identical image; signifies the target features stored in the key queue; τdenotes the temperature parameter; K denotes the capacity of the key queue (set to a default value of 4096).During training, similar to most Transformer-based methods, we also utilize the AdamW optimizer.

Achieving few-shot classification
The process of Moby is illustrated in Figure 3.As the training progresses, it undergoes iterative updates.When the training is complete, the trained Swin Transformer is extracted as the backbone, and its training parameters are frozen.A linear layer is then added, which maps the dimensions of the input features to the number of classes.This layer applies a weight matrix and bias, using a linear transformation to convert the input features into a probability distribution for the output classes, achieving the effect of a classifier.Subsequently, the limited labeled data L is used for fine-tuning, followed by testing on the test dataset Y.The workflow diagram is presented in Figure 4.

Summary
In this section, we introduce a self-supervised learning-based approach for bearing fault diagnosis using limited labeled data.The workflow framework is illustrated in Figure 1, and the key steps are as follows: (1) Data Preprocessing: Raw data from the rolling bearing sensors is collected and preprocessed.The preprocessing includes performing a wavelet transform and normalization on theraw vibration signals to convert them into wavelet transformed images.
(2) The pipeline Moby: Unlabeled data is utilized for self-supervised learning using the Swin-T model and the Moby algorithm.
(3) Classification Workflow: The parametersof the model network are frozen and saved.Combining the limited labeled data, the classifier is fine-tuned.The fine-tuned model is then used to classify and output the fault diagnosis results.

Experiment
In the experiments, we utilized the CWRU dataset from Case Western Reserve University [12] and the DIRG dataset from Politecnico di Torino [8].Both of these datasets are publicly available and have been widely used by many researchers.To validate the feasibility of the fault diagnosis model, the two datasets and two algorithms were compared and validated.The hardware environment used for the experiments consisted of an Intel(R) Xeon(R) Gold 5218R CPU @ 2.10GHz, Two NVIDIA GeForce RTX 3090 graphics cards, and we used Python 3.8 along with PyTorch 1.10 as the framework for building the deep learning models.The corresponding CUDA version used was 11.3 for GPU acceleration.

Datasets description
The first dataset is sourced from the Case Western Reserve University, is depicted in Figure 5(a) [12].It includes three types of bearing fault data from the drive end: normal state data, data with zero load and a motor speed of 1797r/min, and data with a sampling frequency of 12K.They are IR007_0, B007_0, OR007 (failure diameter 0.1778mm for faults in the inner ring, rolling elements, and outer ring), IR014_0, B014_0, OR014 (failure diameter 0.3556mm for faults in the inner ring, rolling elements, and outer ring), IR021_0, B021_0, OR021 (fault diameter 0.5334mm faults in the inner ring, rolling elements, and outer ring), where only 1% of the data in our experiment was labeled.The second data set we used was the fault data from Politecnico di Torino with the lab, and Figure 5(b) presents the experimental setup [8].We used seven types of data: normal state data at a nominal speed of 100 Hz, damaged inner ring indentation diameter 450μm, damaged inner ring indentation diameter 250μm, damaged inner ring indentation diameter 150μm, damaged roll indentation diameter 450μm, damaged roll indentation diameter 250μm, damaged indentation diameter 150μm , which also contains only 1% labeled data.

Evaluation of experimental results
Our experiment consists of two stages.In the first stage, the model network is trained with unlabeled data T on the CWRU and DIRG datasets based on the Moby algorithm.Swin-T is used for training for 300 epochs, including a warm-up stage of 5 epochs.In the second stage, the backbone weights are frozen, and a linear classifier is added to the experiment.Limited labeled data H is used to train the classifier.The top-1 accuracy using central cropping is reported on the validation set.During training, we follow MoCo [9] and employ random crop and horizontal flip as data augmentations, with the crop size randomly adjusted from [0.08, 1].Training is conducted for 10 epochs within a short period, including a linear warm-up stage of 5 epochs.The learning rate is set to the optimal value from the grid search {0.5, 0.75, 1.0, 1.25} for each pretrained model, and the weight decay is set to 0.1.After training, the highest top-1 accuracies of 96.4% and 100% are achieved on the CWRU and DIRG datasets, respectively, using the test dataset Y.
To provide a detailed analysis of the diagnostic results for each fault class, we conducted experimental evaluations on the CWRU and DIRG datasets based on the confusion matrix.The results are shown in Figure 6, where the columns represent the true fault types, and the rows represent the predicted fault types.Specifically, on the CWRU dataset, the proposed method achieves good fault diagnosis accuracy for most fault modes, with the majority of misclassifications concentrated in the OB014 category.This result can be explained by the generally weaker fault conditions of outer ring fault conditions.In addition to the confusion matrix, the evaluation of the data feature extraction using dimensionality reduction is also crucial.Therefore, in this case, we utilize the t-distributed stochastic neighbor embedding (t-SNE) algorithm as the technique for reducing the dimensionality of the data.Based on the experimental evaluations, which include the analysis of confusion matrices and visualization of feature representations in various datasets, this section validates the effectiveness of the proposed method.The utilization of the model network and the Moby algorithm allows for the generation of feature vectors with strong generalization capabilities, and achieves high diagnostic accuracy with only 1% labeled data.

Comparison algorithm
To further validate the superiority of the proposed method, we conducted a comparative analysis against other fault diagnosis methods.Firstly, we establish a convolutional neural network-based autoencoder.The autoencoder is an unsupervised learning model developed to acquire a compressed representation of the data, which is subsequently used to reconstruct the input data.It consists of two parts: an encoder and a decoder.The encoder part utilizes two convolutional layers and pooling layers to gradually reduce the spatial dimensions of the images, and introduces non-linearity through the ReLU activation function.The decoder part employs a deconvolutional layer to progressively increase the spatial dimensions of the data and decode the low-dimensional representation from the encoder part into the original data.In the forward propagation approach, the input data is first encoded by the encoder and then decoded and reconstructed by the decoder.The final output consists of the encoded representation and the reconstructed data.By training the autoencoder, it acquires a compact encoding of the input data and can reconstruct the data based on these encoded representations.Then, the decoder is extracted as a feature extractor, followed by the addition of a classification layer, which is fine-tuned while keeping the other layers frozen.The second comparative algorithm is a Vision Transformer-based autoencoder model.This autoencoder model utilizes the structure of Vision Transformer to handle image data.
We conducted experiments using these two algorithms on two datasets.Multiple experiments were performed under consistent conditions to achieve the highest accuracy.The results showed that the top-1 accuracy on the CWRU dataset was 71.1% and 86.9%, while on the DIRG dataset, the accuracy was 67.1% and 70.9%.In addition, to test the robustness of the MoBy framework, we conducted a total of 10 fault diagnosis experiments using a limited number of samples.Each time, the training and testing sets were randomly redivided while maintaining the same proportion between them to retrain the feature extractor.As shown in Figure 8, our method achieved significantly higher accuracy compared to other methods.

Conclusion
To overcome the challenge posed by the heavy reliance of traditional supervised fault diagnosis methods on extensive labeled data, we propose a finite tagged data bearing fault diagnosis method based on wavelet transform and self-supervised learning.In the absence of labeled samples in the CWRU and DIRG datasets, by utilizing a pre-trained feature encoder to extract features and fine-tuning with only 1% of labeled samples, the achieved testing accuracies are 96.4% and 100% respectively.

Figure 1 .
Figure 1.The overall framework of the proposed method.

Figure 2 .
Figure 2. Preprocessing of the original vibration signals.

Figure 3 .
Figure 3. (a) The pipeline of MoBy.(b) Structure of the online and the target.
Figure 7 displays the two-dimensional TSNE visualization results of the feature extraction on the CWRU and DIRG datasets.

Figure 8 .
Figure 8.Comparison of the results of the three methods.