A novel bearing fault diagnosis method under small samples using time-frequency multi-scale convolution layer and hybrid attention mechanism module

Deep neural networks for bearing fault diagnosis have become the focus of research in recent years with its excellent feature extraction capability. However, the problem of diagnosis under small samples still needs to be solved in industrial applications, because bearings rarely work in the fault state in practice, resulting in the scarcity of fault data. To solve this problem, this paper proposes a new diagnosis model, a time-frequency multi-scale attention network, which structure allows the original signal and its transformed spectrum to be used as the input in parallel. A multi-scale convolutional layer is also designed to extract information from the signal at different scales to enhance the feature extraction capability of the network. In addition, a hybrid attention mechanism is added to integrate the redundant features and realize the complementarity between features. The experimental results of seven bearing diagnosis cases from two bearings show that the proposed method can achieve high diagnostic accuracy under small samples, which proves the superiority of the proposed method. The time domain signal and frequency domain signal were respectively used as input to train the model. By comparing the accuracy with the time-frequency combined signal as input, the superiority of the time-frequency domain signal as input is proved.


Introduction
Bearings are one of the important components in modern industrial equipment, widely used in various rotating machinery [1]. However, due to the long-term work of high speed, heavy load, and strong impact, bearings will cause wear, spalling, and other failures. If these failures are not handled well in time to restore the bearings to a healthy state, it will lead to the decline of the performance of mechanical equipment, and even lead to safety accidents, causing huge losses [2][3][4]. Based on such industrial needs, the field of mechanical fault diagnosis technology has achieved significant results in the past few decades. Fourier transform (FT), empirical mode decomposition, wavelet transform, variational mode decomposition, and other signal processing methods have made a series of positive academic achievements and industrial applications when facing the problem of strong noise interference [5][6][7][8]. In recent years, with the development of artificial intelligence technologies such as machine learning (ML) and deep learning (DL), the intelligent fault diagnostic (IFD) methods of bearing failure have made great progress, which has become a popular means to solve the identification of bearing fault in the field of fault diagnosis [9][10][11]. However, in order to achieve the high ability of fault diagnosis, these traditional IFD methods often require a large number of labeled data that allows the models to be adequately trained.
Data is the basis for the implementation of intelligent fault diagnosis methods. Broadly speaking, there are three kinds of data that can be collected: simulation data, laboratory data, and engineering monitoring data. Although simulation data and laboratory data provide us with sufficient fault data, it is difficult to directly reflect the complex characteristics of actual machines. Meanwhile, among the actual engineering applications, the data with valid labels are very difficult to obtain and few in number [12]. Therefore, this paper will specifically investigate the problem of few samples from the perspective of engineering applications.
Difficulties in obtaining the data lead to the fact that when facing the actual problems of industry, there is not enough labeled data to satisfy the data requirements of the model during training. Under the condition that only a few samples can be trained, the deep network cannot learn the most effective fault features, and it is easy to appear the phenomenon of over-fitting. The generation of over-fitting will reduce the model's generalization performance, resulting in a decrease in the accuracy of fault pattern recognition, and bringing great challenges for IFD methods.
The problem of fault diagnosis under small samples has attracted the attention of many researchers in recent years, and some methods have been proposed. These methods can be divided into two main categories according to different optimization objects: data-based methods (DBMs) and model-based methods (MBMs).
DBMs focus on reducing the scarcity of information under small samples to improve the model's generalization performance during the learning process, such as data augmentation (DA) and transfer learning (TL). Hu et al proposed a DA algorithm utilizing a resampling technique to simulate data under different rotating speeds and working loads, which can be regarded as a solution for both few-shot learnings as well as enhancing models' generalization ability [13]. Pei et al proposed an enhanced few-shot Wasserstein auto-encoder (fs-WAE) motivated by optimal transport cost, promoting the diversity and authenticity of the generated samples [14]. Zhang et al proposed a DL-based synthetic over-sampling method, in which generative adversarial networks (GANs) was used to generate additional realistic fake samples and expand the available dataset afterward [15]. Zhou et al proposed another GAN-based method to synthesize fault instances, and an auxiliary loss of triplet form was introduced into the original loss function to enhance the quality of generated samples [16].
TL methods can generate additional beneficial knowledge by learning a source task similar to the target task and improving the model performance of the target task under a few samples. Chen et al proposed a hierarchy-guided TL framework for fault recognition with few-shot samples, which extracted and transferred fault knowledge between similar tasks via TL techniques [17]. Zhang et al pretrained the model by source domain samples, obtained a good feature encoder and fixed them, then fine-tuned the classifier module with a small amount of target domain data, which was a typical TL method [18]. Wu et al constructed a few-shot TL method utilizing meta-learning for few-shot samples diagnosis in variable conditions, which transferred the knowledge form artificial fault bearings to natural fault bearings [19].
The focus of this article is MBMs, the purpose is to optimize the network structure to improve the feature extraction ability of the model and improve the results of the fault diagnosis. Ren et al proposed a capsule auto-encoder model, which extracted multiple meaningful feature capsules and fusion them by the dynamic routing algorithm, and reduced the dependence on the number of samples [20]. Zhang et al developed a Siamese neural network model based on deep convolutional neural networks with wide first-layer kernels (WDCNN), which can acquire better feature representation [21]. Ye et al proposed a novel U-Net with CapsNet (UN-CN) to, which reduced the loss of features in the pooling process and ensured the integrity of the features to realize better results of fault diagnosis [22]. An et al proposed a few-shot fault diagnosis method for rolling bearing using local descriptors, which made full use of the lowly discriminative descriptors to improve the distinguishing ability [23]. In order to extract more effective and discriminative features, Lv et al introduced squeeze-and-excitation networks as an attention module which can enhance effective features and weaken invalid features [24]. Wang et al proposed a one-dimensional convolution neural network (CNN) with an attention mechanism (AM), which made CNN pay more attention to the interesting part of the fault signals to extract discriminative features [25]. Chen et al proposed a transformer-based network with shifted windows, which used self-attention calculation in each non-overlapping window to improve the recognition accuracy of the model [26].
Although the above methods in the field of IFD under small samples have made a series of achievements, there are still some problems and challenges. On the one hand, the characteristics of the bearing failure have different scales. Some global features need to be detected from a larger scale perspective and some local changes require a small-scale perspective to find it in time. However, the deep neural network (DNN) represented by CNN always uses the same size convolution kernels for operation in each layer, which is inappropriate for the multi-scale features contained in the signal. The use of singlesize kernels in each layer cannot extract the comprehensive fault features and affects the diagnostic performance of IFD models. On the other hand, the time domain is one of the angles of observing signals. The frequency domain can sufficiently express the periodic characteristics of rotating components such as bearings. Extracting appropriate features from a single signal domain is more difficult than the multi-signal domain, which has a greater challenge to the learning ability of the model.
To solve the above-mentioned problem, this study proposed a new time-frequency multi-scale attention network (TFMSAN) for bearing fault diagnosis under small samples. TFMSAN utilizes the time domain representation and frequency representation of the signal as the input, increasing the comprehensiveness of the signal. This kind of input makes the feature extraction ability of the model improve because the model can easier to learn effective features when the input is more comprehensive. In order to extract the multiscale features from the input, a multi-scale parallel architecture is designed in the TFMSAN, which can perform the convolutional operation with different sizes of kernels at the same time. In addition, a hybrid attention framework (hybrid AM) has been constructed to integrate the multi-scale redundant features of the time and frequency domain in this study. A hybrid AM includes both intra-domain and inter-domain AMs. The AM of features intra-domains and features between domains are added to the TFMSAN simultaneously, realizing the effective complementarity between different domains and ensuring the generalization ability of the TFMSAN. In general, the contributions of this paper are as follows: (1) An TFMSAN is proposed for handling the IFD problem under small samples. The background and the principle of the proposed method are introduced in detail in section 2, and then the structural framework of the proposed method is elaborated in section 3. The experimental layout and the analysis of the results are described in section 4, while section 5 summarizes the results and the outlook for the future.

Convolution layer
The CNN was first proposed by LeCun et al in 1989 [27]. CNN is widely used in the fields of computer vision and natural language processing (NLP), because of its three characteristics of sparse interactions, parameter sharing, and equivariant representations, which greatly improve the network's ability to extract deep features [28]. Because the structure of CNN can automatically mine the deep abstract features of input data, it is also used in IFD recently. The convolution layer is the core part of the CNN, and gradual convolution can be performed between the input and the kernel. Assuming a 2D input I and a 2D kernel K, the process of convolution can be expressed as: (1) Figure 1 shows the differences in convolution operations for different sizes of kernels. It can be seen that large convolution kernels have a larger sensory range and can capture features at lager scales, while small size kernels can find subtle features.

AM
AM has been formally proposed since 2014 [29], it has made great progress in the field of artificial intelligence, especially the field of NLP. The AM allows neural networks to pay more attention to the relevant information in the input and reduce the attention to unrelated information. Because of this advantage, AM also attracted the attention of many scholars in the field of fault diagnosis [30]. As shown in figure 2 (b), there are three core concepts in AM: query, key, and value. An attention function can be described as mapping a query and a set of key-value pairs to an output, where the query, keys, values, and output are all vectors. The output is computed as a weighted sum of the values, where the weight assigned to each value is computed by a compatibility function of the query with the corresponding key [31]. In actual use, researchers usually do not calculate the compatibility of each query and key, but consolidate multiple queries in a matrix for calculation. AM can be expressed as formula (2), where Q is a matrix composed of some query, K means keys matrix, V means values matrix, and √ d k is a scale factor. In this formula, the compatibility between the query and the key is to be calculated in the form of dot product, Unlike the lack of representation of information by a single AM, the multi-headed AM which is shown in figure 2 (c) introduces multiple attention functions, enabling the model to discover interested information from multiple perspectives and obtain an extensive representation of information. As shown in  figure 2 (a), Self-attention [32] is the variant of the AM, which autonomously generates the query, key, and value without relying on external information, allowing the model to notice correlations between different parts of the whole input. Selfattention is widely used in this study, to mine efficient fault features.

Time-frequency multi-scale framework
The time domain signal provides an intuitive representation of the measurement results of the physical quantity, which accurately reflects the change in the physical quantity over time.
The frequency domain provides an additional perspective to observe the signal and describes the frequency structure of the signal in detail. The frequency domain representation of a signal can be obtained from the time domain representation by FT, as shown in equation (2), It is necessary to link characteristics of the time domain and frequency domain and give a trade-off when analyzing the signal. Compared to using the time domain or frequency domain alone, using them together as an analysis object provides a more comprehensive and fuller understanding of the signal. This study constructed a time-frequency parallel architecture that uses the original time domain signal and the spectrum obtained by Fast FT as the network's input. The architecture can provide comprehensive time-frequency information to the network, making the network extract the sensitive fault features easier during the training process, although the features may be redundant.
Signals collected from mechanical equipment are usually complex and varied, consisting of a large number of different components and noise. The fault information of equipment is included in the multi-scale components, which makes it difficult to mine the appropriate fault feature with a single scale of convolution kernels. In this paper, a multi-scale kernel convolution network was established to obtain feature extraction results at multiple scales, by using simultaneous convolution operation between kernels of different sizes and the input signals. The multi-scale convolution process can be expressed as: where x is the input of the multi-scale convolution (MSConv) layer, ω m is the m-th kernel, and M is the number of types of kernels with different sizes. In the MSConv layer, the input x convolutes with M kernels at first, and the results of convolution are concatenated into a whole tensor as the output of the MSConv layer. In general, this paper proposed a time-frequency multiscale framework, namely TFMSF, which included timefrequency parallel architecture and MSConv layer, as shown in figure 3. First, the original signal is transformed by FFT to obtain its spectrum, and then the time domain signal and spectrum are used as the input of the MSConv layer. There are three modules in MSConv: convolutional operations of different scales, batch normalization (BN) layer, and maximum pooling layer. Features from different domains and different scales, extracted by the MSConv layer, are combined into a feature vector finally in TFMSF. This framework can be used to extract the multi-scale information of the time domain and frequency domain from the original fault signal to ensure the completeness of the information and enhance the network's ability to extract the fault information under small samples.

TFMSAN
Although the network under the guidance of the TFMSF can mine multi-scale features of faults more comprehensively, the extracted features are often redundant because the information contained in the time and frequency domains is duplicated. In addition, when the mechanical equipment failure, some generated characteristics are discontinuous and periodic, not always present in signals. For example, when there is a single point defect in the outer ring of the bearing, the ball will pass the defect and produce impact vibration per turn. The traditional CNN pays the same attention to the data at different moments and lacks the ability to capture fault information segments, leading to the extraction of some features that are not related to the fault. The proposed TFMSAN introduced a hybrid AM, containing multiple  The original signal and its spectrum are first fed into the MSConv layer to obtain multi-scale features, after which intra-AM is used to mine the fault-sensitive features. After that, the obtained time-domain features and frequency-domain features are jointly used as the input of inter-AM to achieve fusion and enhancement in order to improve the generalization ability of the network. Finally, a classifier module made of two linear full connection layers identifies the fault classes based on the extracted multi-scale features. The classification error is calculated using the common cross-entropy loss function as shown in equation (5), and the network is trained by updating the parameters with back-propagation techniques,

Experiment
In this section, seven bearing diagnostic experimental scenarios from two bearings are used to demonstrate the performance of the proposed TFMSAN. In TFMSAN, the convolutional layers are divided into three types according to the size of kernels, 5 × 1, 9 × 1, and 16 × 1, respectively, and the number of convolution kernels is set to 32 × 1. To ensure that the same size output is obtained at different convolution scales, the step size in the convolution process is set to 1 and the padding is set to 2, 4, and 8, respectively. The number of head, embedded dimension of intra-AM is set to 2, 32, while the corresponding terms of the inter-AM is set to 4, 32.
The classifier consists of two linearly connected layers with input size and output size of (1920, 300), (300, 4). The leaky rectified linear unit (LeakyReLU) activation was adopted for the whole network, and BN was used for normalization. The full structural parameters of the network are shown in table 1.
During the training process, 3 (or 5) samples of each fault type are randomly selected as the training set, and the remaining samples are used as the test set to satisfy the hypothesis of the IFD problem under small samples. Training is performed using a stochastic gradient descent optimizer with 1000 epochs per experiment. The initial learning rate (LR) is set to 0.001, and to ensure the convergence of the model, the LR decreases exponentially as the training progresses, as shown in the equation (6): where initial_lr is the initial LR, the epoch is the completed training epochs. β and γ are two parameters that control the rate of LR's change and are empirically set to 0.01 and −0.75.
To validate the superiority of the proposed TFMSAN, the k-nearest neighbors (KNN), support vector machine (SVM), random forest (RF), CNN, and wide kernel CNN (WKCNN) were selected for comparison. Among them, KNN, SVM, and RF use the mature versions from the scikit-learn module.
The CNN uses a convolutional structure similar to that of the TFMSAN, containing four convolution layers, but using only the original signal as the input. WKCNN utilizes a wider convolutional kernel to extract one-dimensional signal features more efficiently [33]. All these comparison methods were fine-tuned to achieve the best experimental results on the data set used in this paper.

Dataset description.
The data sets used in this study were collected from two different bearing failure simulation experimental benches to validate the proposed method. The data set used in this paper includes four data types, namely, bearing data under the condition of health and bearing data under the condition of inner ring failure, outer ring failure, and ball failure. The first data set is from the Case Western Reserve University Bearing Data Center. As shown in figure 5(a), the bearing test bench includes a two-horsepower motor, a torque sensor, a power meter, and an electronic control system. The bearing type under test is 6205-2RS JEM SKF. Faults of bearing include inner ring failures, outer ring failures, and ball failures, which are single point artificial failures machined at the corresponding locations respectively. The motor load includes 0-3 hp and the speed distribution is between 1720 and 1797 rpm. The vibration signal used in this paper is collected from the drive-side bearing measurement point with a sampling frequency of 12 kHz, for more detailed information see [34].
The second data set is from the spectra quest (SQ) test bench, as shown in figure 5(b). It consists of a single-phase asynchronous motor as the power output, a heavy load of 5 kg, and the test bearing on the right side of the rig. The type of bearing under test is ER16K. Inner ring failures, outer ring failures, and rolling element failures were manufactured in the test bearing manually, the same as the first data set. The speed is divided into three types: 300, 600, and 900 rpm and the sampling frequency is 51.2 kHz.
In order to verify the effectiveness of the proposed method under small samples, data from two bearing experimental benches were divided into seven cases depending on the speed and load. Each situation includes normal data of bearings, inner ring fault data, outer ring fault data, and rolling element fault data. All of the data were divided by time windows of length 2560 and steps 400 in this study. In each case, 3 (or 5) samples of each fault type were randomly selected as the training set, and the remaining samples are used as the test set to simulate small sample scenarios. The details of the experimental data are shown in table 2.

Performance of models under three training samples.
During the experiment, we tested five ML and DL algorithms such as RF, and CNN as controls, and finally obtained the experimental results of TFMSAN, and the comparison methods under seven cases are shown in figure 6. To avoid inaccurate comparison results due to the specificity of the selected training samples, the final results show the mean and variance of five experiments. As can be seen from the figure, the proposed method achieves the best diagnostic results for these seven different bearing failure cases, which proves the effectiveness of the proposed method. In case 2 and case 3, there is no significant difference between SVM and the proposed method because they are both close to 100%, while in other cases the proposed method has a significantly better diagnostic effect than SVM, especially for SQ bearings.  Result Output a Conv1D (in channels, out channels, kernel size) means a 1D convolution layer with the number of channels in the input equals in channels, the number of channels produced by the convolution equals out channels, and the size of the convolving kernel equals kernel size.
RF is the second-best performing method in cases 4, 5, 6, and 7 after the proposed method, presumably because the integration property in RF makes the model better resistant to overfitting, which can get better results under significant noise. The best diagnostic results were obtained in these cases, which also demonstrated the excellent interference resistance and generalization performance of the proposed method. CNN and WKCNN perform similarly in these cases, significantly lower than the proposed methods and traditional ML methods such as RF and SVM. We speculate that the reason for these phenomena is that CNN and WKCNN, based on DNNs and using raw signals as input to extract features, which lead  to insufficient generalization of the learned features when it is difficult to obtain sufficient diagnostic knowledge in scenarios with small samples.

Performance of models under five training samples.
The diagnostic results of methods on seven different cases under the condition that training with five samples are shown in figure 7. From the figure, we can find that in case 2 and case 3, KNN, RF, and SVM achieve similar results with the proposed method, which are all very close to 100%. In other cases, the proposed method achieves the best diagnostic results. Especially, in cases 4, 5, 6 and 7, the performance of the comparison methods declined more obviously, while the proposed method still achieved good diagnostic accuracy, which exceeded 90% in all cases, reflecting the excellent feature extraction ability of the proposed method in different conditions. Through the experimental results of the previous two different training samples, we can prove the superiority of TFMSAN. For this phenomenon, we believe it is because the input signals in the time domain and frequency domain are respectively input into the MSConv layer to obtain multiscale features. Then, effective fault features can be mined through the intra-domain AM, and the time domain and frequency domain features are combined and the features  are fused and enhanced through the inter-domain AM so that the classification can be made according to the multiscale features. Thus, the proposed method has the highest accuracy.

Performance of using time and frequency domain
signals as input signals respectively. Table 3 and figure 8 show the experimental results of using the time domain signal, the frequency domain signal, and the combined time and frequency domain signal as the input signal respectively under three training samples. From the results, it can be seen that using time-frequency combined signals as input has an average improvement of 5.0826% in accuracy compared to using only time-domain signals as input, and at the same time, it has an average improvement of 2.9313% in accuracy compared to using only frequency-domain signals as input. This can prove that using time-frequency combined signals as input can provide multi-dimensional features for fault diagnosis, thereby improving the model's feature extraction ability under small sample conditions. According to this, we can get that timefrequency signals have good local properties and adaptability to different scales, and can simultaneously characterize the time-domain characteristics and frequency-domain characteristics of signals. 4.2.4. Performance of the model with and without hybrid AM. Table 4 and figure 9 show the experimental results with and without the addition of hybrid AM under the conditions of     seen from the two bar charts, no matter the time domain signal or frequency domain signal as input, the diagnostic accuracy of most models without mixed AM is lower than that of models with mixed AM. These results show that the hybrid AM can effectively improve the feature extraction performance for time domain, frequency domain, and time-frequency domain signals under appropriate conditions, so as to improve the fault diagnosis accuracy of the model under the condition of small samples.

Conclusion
In this paper, a new IFD model, TFMSAN, is proposed to solve the problem of poor generalization ability under small samples in real industrial environments. A neural network framework is constructed, namely TFMSF, to realize the parallel extraction of time-and frequency-domain multi-scale features to enhance the information extraction ability of the model. The framework uses multiple kernels of different sizes to build the MSConv layer to achieve different scales of convolution. In addition, two structurally identical branches in the framework are used to extract features in the time and frequency domains respectively. Meanwhile, a hybrid AM is introduced into the model to mine for more effective and focused features. Intra-AM and inter-AM are used for feature fusion of one domain and different domains, to integrate the redundant features and realize the complementarity between features. The experimental results prove the superiority of the proposed method, and the following three conclusions can be drawn.
(1) The comparison results with the five other methods prove the superiority of TFMSAN for extracting efficient fault features, and show the effectiveness of the proposed method for fault diagnosis with small samples.
(2) Experimental results in seven cases of bearing diagnosis experiments from two bearings demonstrate the reliability and generalization of the proposed method, as the proposed TFMSAN achieves the best or near best fault diagnosis accuracy in all these cases which is generally better than comparison methods.
(3)By comparing the diagnostic accuracy of time-domain signal and frequency-domain signal as input with the diagnostic results of time-frequency signal as input, it can be proved that the model can obtain more comprehensive and effective fault features when time-frequency signal is used as input.
(4) The comparison results of the model with AM and the model without AM show the role of the hybrid AM, which can significantly improve the efficient representation and generalization performance of features.
In this study, it is assumed that all failure types have the same number of samples, but in the actual industry, the probability of occurrence of various failure types is not the same, resulting in an imbalance between the various types of samples. What's more, the proposed method still requires complete categories of data, and it is still impossible to effectively diagnose data sets that lack a certain fault category which is common in industry applications. How to make full use of the unbalanced small sample data to ensure the validity of the model is the focus of the next study.

Data availability statement
All data that support the findings of this study are included within the article (and any supplementary files).