Unsupervised Deep Transfer Learning Method for Rolling Bearing Fault Diagnosis Based on Improved Convolutional Neural Network

As a critical component widely used in mechanical equipment, the implementation of intelligent fault diagnosis for rolling bearings can improve the reliability of equipment. In this paper, a method named JMMD-CKDSCNet is proposed to address the task of fault diagnosis under unsupervised domain discrepancy scenarios. First, the convolutional kernel dropout (CKD) mechanism is introduced in the convolutional layer, and partial convolutional kernel weights are set to be inactive during the training process using the random mask. Second, skip connection (SC) fuses the features of multiple shallow layers to preserve and transfer the original features. Finally, domain alignment is achieved using joint maximum mean discrepancy (JMMD), which measures the joint distribution between different domains with feature discrepancies under the condition that the target domain lacks labeled data. The experimental results demonstrate that CKDSCNet exhibits superior generalization performance and outperforms other models in terms of diagnostic accuracy and model performance. Compared with other domain adaptation methods, JMMD has significant superiority, proving the application value of JMMD-CKDSCNet.


Introduction
Rolling bearings serve as the fundamental components in machine tool spindles, motor rotors, and various other mechanical equipment.Bearing faults can lead to downtime and damage to the entire equipment.Intelligent fault diagnosis (IFD) can be used to monitor bearing anomalies, predict potential faults, and take preventive measures, which are important for reducing maintenance costs and improving equipment reliability [1].
The development of big data and artificial intelligence technologies has allowed deep learning to be used for fault diagnosis in industry [2].The convolutional neural network (CNN) has gained immense popularity in the field of IFD because of its versatility in handling diverse input types and dimensions, along with its efficient capability to extract multiscale features through the convolutional layer [3][4][5][6].However, CNN with a multi-layer complex structure has enough parameters to adapt to the features in the training data, which can cause the model to have weak generalization performance due to overfitting.In addition, CNN mainly extracts local features and cannot capture global features effectively.Meanwhile, a common problem between CNN and other deep neural networks is that backpropagation may cause the gradient to vanish, which leads to the model failing to converge.
To perform fault diagnosis using deep neural networks, a substantial amount of labeled data is essential for training models.Therefore, current research for IFD mainly uses supervised learning [7].It is typically assumed in most supervised learning methods that the training samples and test samples share the same distribution.However, machines often operate with varying working conditions or environments, resulting in data discrepancies in the distribution and features.Many scholars have employed various transfer learning methods to address the aforementioned issue [8,9].Their approaches rely on fine-tuning the model with a small amount of labeled data in the target domain.However, unknown device failures or the high cost of labeled data make it difficult to obtain labeled data in real engineering scenarios, which leads to the unsupervised domain discrepancy problem.
In this paper, JMMD-CKDSCNet is used to solve the above problems.The convolutional kernel dropout mechanism and skip connection structure are introduced to improve the CNN, while unsupervised domain adaptation is achieved using JMMD.The convolutional kernel dropout mechanism employs the random mask to deactivate partial convolutional kernel weights, which reduces the number of convolutional kernel parameters involved in the convolutional operation and improves the model's capacity for generalization.Skip connection enables the model to access features at different levels from various network layers, facilitating the model's capture of global features in the input data and helping to address the gradient vanishing issue.JMMD is an improved domain adaptation method based on maximum mean discrepancy (MMD), which tackles the limitation that transfer learning needs to use a few labeled data in the target domain to fine-tune the pre-trained model.JMMD measures the joint distribution of different domain data in the feature space and realizes feature alignment without labeled data in the target domain.

Convolutional Neural Network
1D-CNN is well-suited for processing sequential data, making it more widely applicable in fields that collect one-dimensional time-domain data, such as fault diagnosis and quality prediction.In contrast, the 2D-CNN is better suited for processing image data and excels in tasks like image classification.In this paper, a fault diagnosis model is constructed based on 1D-CNN.
Convolutional and pooling layers are the unique structures of CNN.Convolutional layers operate by convolving one or multiple convolutional kernels over the input data to extract local features.Each convolutional kernel can capture different feature information, and the backpropagation algorithm is designed to adjust the parameters of the convolutional kernels to reduce the value of the loss function.The input features are operated by the l-th convolutional layer, and then the output features can be expressed as: ( ) where l K and l b represent the weights and biases of the convolutional kernels in layer l , 1 l− x is the output feature from layer 1 l − , and ( )   means the activation function.The pooling layer is used to down-sample the input features to reduce the feature dimensions, but it still retains the main feature information, which helps to represent the input features as higher-level features and also reduces the computational load in the next layer.In this study, we employ the max pooling layer to construct the diagnostic model.Max pooling retains the maximum value in each pooled region, which allows the model to stay focused on the most salient features of the data.The input features are operated by the l-th pooling layer, and then the output features can be expressed as: ( )

Convolutional Kernel Dropout
Deep CNN can capture subtle features of the data, but the network may overfit when it is very deep with many parameters.The CKD mechanism is designed to prevent the network from overlearning the anomalous features within the training data.The dropout layer is originally designed to be used in fully connected layers, but with the CKD convolution mechanism, the dropout layer is used to set partial weights in the convolutional kernel to zero via a random mask, which means that deactivated weights will not be involved in the convolution operation.The uniform distribution of the dropout rate for convolutional kernels during the training phase is as follows: ( ) The CKD mechanism allows the model to select different convolutional kernel weights for dropping based on the dropout rate.Then the model uses different subsets of weights to perform each convolution operation, reducing the model's dependence on specific weights and enhancing its capacity for feature extraction by accommodating varied weights.Additionally, the CKD mechanism effectively reduces the number of model parameters, thereby mitigating the potential for overfitting.

Skip Connection
The convolutional operation in CNN is based on local receptive fields, meaning that each convolutional kernel can only capture a small portion of local features from the input data.Furthermore, the architectural design of CNNs is characterized by a hierarchical structure, implying that they may not fully grasp the global structure and contextual information of the entire input data.In addition, the gradient gradually decreases during the backpropagation, making the weights at the bottom layers of the network almost not updated, which makes it difficult for the model to converge.
In this paper, the structure of skip connection is introduced to the CNN.Specifically, the skip connection fuses the output features of the first and second layers in the channel dimension and uses the fused output features as input to the third convolutional layer.In addition, the merged features of the first and second layers will be fused with the output features of the third layer, and then the fused features will be used as inputs for the convolution of the fourth layer.
The integration mechanism of skip connection allows CNN to fuse multiple feature information from shallower layers on deeper layers, which helps alleviate the gradient vanishing.In addition, skip connection allows the model to fuse local and global features at different levels, and then the network can learn information at different scales.

Structure of CKDSCNet
By incorporating the convolutional kernel dropout mechanism and integrating the skip connection architecture, we have refined the structure of the CNN, resulting in the CKDSCNet model depicted in figure 1.The architectural composition of this model primarily comprises four convolutional layers, two pooling layers, and three fully connected layers.
The output features are normalized after each convolution operation using a batch normalization layer (BN), which is applied to expedite the convergence of the model during training.Given the computational complexity associated with most activation functions, such as sigmoid and tanh, due to their exponential computations, the computationally efficient ReLU activation function is selected to model the non-linear relationships within the data.The Adaptive Max Pooling layer provides greater flexibility to the model by adapting to different sizes of input data and allowing the model to perform adaptive pooling operations to extract critical features among different sizes of inputs, which enhances the generalization of the network.Two fully connected layers with the same output channels are employed to compose the CKDSCNet model.The first fully connected layer extracts and combines the features from the previous layers at a higher level.The second fully connected layer further maps the output features from the first fully connected layer.[10], and correlation alignment (CORAL) [11], which reduce the discrepancy between different domains and improve the performance of pretrained models in the target domain.
Maximum mean discrepancy (MMD) [12]  MMD is added to the loss function to realize feature alignment, and the loss can be written as [13]: where k is hilbert space using the kernel k, ( )   represents the mapping function, denotes the mathematical expectation.

Joint Maximum Mean Discrepancy
The limitation of MMD is that it emphasizes the marginal distribution of the dataset, and MMD is specifically designed to tackle problem ( ) ( ) . Joint maximum mean discrepancy (JMMD) is designed to measure the joint distribution among different datasets, allowing JMMD to accurately measure the differences between different domains in a transfer learning task.
It is assumed that the source dataset with s n labeled data is defined as , and the target dataset with t n unlabeled data is defined as ( ) JMMD is added to the loss function to realize feature alignment, and the loss can be written as [13]: where ( ) ( ) ( ) is the feature mapping, L is the set of higher network layers, L is the number of layers, s l and t l denote the activation of the l-th layer generated by the source domain and target domain.

Fault Diagnosis Method based on JMMD-CKDSCNet
In this paper, JMMD is applied to the CKDSCNet diagnostic model, and the structure of JMMD-CKDSCNet is shown in figure 2, where c is the cross-entropy loss.

Dataset Description and Model Training Environment
The dataset is constructed using partial drive-end data sampled at a frequency of 12 kHz from the Case Western Reserve University (CWRU) dataset.As shown in table 2, four fault types of data (inner (IF), ball (BF), outer (OF), and normal state (NA)) are divided into 10 types of labeled data according to different fault diameters.As shown in table 3, in order to set the transfer learning task, four working conditions with discrepancies (0HP-3HP) are used as four different transfer tasks (four different domains).
The initial learning rate is set to 0.001, and the batch size is set to 64.And 80% of the data in the source dataset is classified as a training set, and 20% of the data is classified as a validation set.The test dataset is divided into a domain adaptation set and a test set in the same proportion.The training is set for 300 epochs.All experiments are conducted on a system running Windows 11 with PyTorch 1.7.The hardware configuration includes an Intel Core i5-10210U and a GeForce MX250.In the experiment of task 0-3, the input layer feature visualization results and the output layer feature visualization results for JMMD-CKDSCNet in the third experiment are depicted in figure 5 and figure 6, respectively.The features in the input layer are chaotic and disordered, while the boundaries of the 10 features in the output layer are well-defined, which indicates that the model has great adaptive feature extraction and classification capabilities.As shown in figure 7, the confusion matrix of the third experiment in task 0-3 demonstrates the performance of JMMD-CKDSCNet in each category, further confirming that JMMD-CKDSCNet has good performance in the cross-domain diagnosis task.

Conclusion
In real engineering scenarios, when significant differences exist between the source and target domains due to working conditions and there is an absence of labeled data in the target domain, JMMD-CKDSCNet can be employed for fault diagnosis tasks.Moreover, it has demonstrated the capability to achieve elevated diagnostic accuracy under such circumstances.The performance of JMMD-CKDSCNet is verified by several experiments, and the following conclusions are drawn: • The convolutional kernel dropout mechanism can enhance the model's ability to extract local features.Additionally, dynamically reducing the parameters of the convolutional kernels can improve the model's generalization performance.This is particularly beneficial when the model is applied to fault diagnosis tasks in varying working conditions.• The skip connection structure is used to fuse multiple shallow features into the deeper layers of the network to capture global feature information.The structural improvement for CNN can rectify the quandary of gradient vanishing during the training process and elevate the model's proficiency in extracting global features.• The JMMD is used to realize feature alignment by measuring the joint distribution differences among different domains when the target dataset lacks labeled data.The application of JMMD allows the trained model to be directly transferred to other complex diagnostic tasks.
, respectively.The discrepancies of samples and labels will still remain in the activations of multiple higher network layers.So the joint distributions

Figure 5 .
Figure 5. Visualization of input features.Figure 6. Visualization of output features.

Figure 6 .
Figure 5. Visualization of input features.Figure 6. Visualization of output features.

Figure 7 .
Figure 7. Confusion matrix of classification results.In addition, to further verify that the JMMD method can effectively realize feature alignment, JMMD is compared with AdaBN, CORAL, and MK-MMD in comparison experiments.As shown in figure8and figure9, regardless of whether the model is directly transferred for testing on task 0 or task 3, the average diagnostic accuracy of JMMD-CKDSCNet exceeded 99%.This indicates that

Table 2 .
Class labels of CWRU.

Table 3 .
Four working conditions in the CWRU.To verify that the CKD mechanism can improve the feature extraction ability and generalization of the model, four experiments are executed on SCNet and CKDSCNet, respectively.The CKD mechanism is not incorporated into the structure of the SCNet model.The diagnostic accuracy of SCNet and CKDSCNet is shown in table4.Training the model on the same source domain and validating its performance on different target domains, CKDSCNet consistently outperforms SCNet in accuracy.This indicates that CKDSCNet possesses better feature extraction capabilities.Under varying operating conditions, the diagnostic accuracy of CKDSCNet exhibits smaller fluctuations compared to SCNet, indicating that CKDSCNet demonstrates a more stable performance across different target domain datasets.Consequently, CKDSCNet possesses superior generalization capabilities.

Table 4 .
Accuracy of SCNet and CKDSCNet.Comparative ExperimentsTo further verify that the structural improvement of CKDSCNet is effective, the model is evaluated in comparison with other fault diagnosis models in the experiments.The model is pre-trained on transfer task 0 and task 3, and then directly transferred for testing on task 3 and task 0, respectively.The fault diagnosis accuracy of DCNN, ResNet, and CKDSCNet under different transfer tasks is shown in figure3and figure4, respectively.DCNN is a convolutional neural network consisting of four convolutional layers without the CKD mechanism and SC architecture.ResNet is a residual neural network with four convolutional blocks.Based on experimental results, regardless of whether JMMD is applied to the model, CKDSCNet consistently exhibits higher diagnostic accuracy than DCNN and ResNet.These results indicate that the convolutional kernel dropout mechanism and skip connection structure contribute to enhancing the model's feature extraction and generalization capabilities.