An End-to-end Domain Generalization Fault Diagnosis Method Based on Adaptive Weighted Domain Adversarial Learning

Domain adaptation (DA) based fault diagnosis methods have attracted great attention for a long time, which leverages domain-invariant knowledge from a source domain (SD) to a special target domain (TD). However, in practise, there not usually exists only single SD and a specific and available TD. Mostly, data is generated from multiple domain distributions due to varied working conditions, and the TD is not always available during model training. Hence, how to use these multiple domain distribution data to build a model to generalize domain knowledge to an “unseen” TD, is an emerging and challengeable issue. In this paper, a novel domain generalization (DG) based fault diagnosis methods is proposed. Multiple domains are considered jointly to extract generalized domain-invariant features. A multiple residual blocks structure is adopted as the main feature extractor, an additional extractor is added to extract prior knowledge from statistic features. Besides, a k-means based adaptive weighted domain adversarial learning is designed to realize multiple domain confusion in a latent space while the wasserstein distance is calculated to assist in narrowing discrepancy between different domain distributions. Experiments of multiple domain distributions tasks from two datasets are conducted to evaluate the proposed method, and the results have verified the effectiveness of proposed method.


Introduction
Deep transfer learning (DTL) is a popular approach for fault diagnosis, which aims to solve diagnosis issue that when domain bias occurs due to changing working conditions [1].Further, domain adaption (DA) is an extensively researched DTL algorithms [2].DA assumes that there exist a source domain (SD) and a different but related target domain (TD), which leverages domain knowledge from SD to TD. DA mainly consists of model-based method, feature-based method and sample-based method, and feature based method is the most widely used among them.For fault diagnosis, deep DA is applied for solving different working conditions issues including motor speed, working load and etc.Some metrics such as maximum mean discrepancy (MMD), correlation alignment (CORAL) and Kullback-Leibler (KL) divergency is inserted into a neural network to minimize the distribution shift between SD and TD [3][4][5].Different from above methods, domain adversarial learning narrows the distribution between SD and TD in a latent space by an adversarial learning manner rather than using explicit metrics [6].However, monitoring data may be generated from multiple distributions due to varied working conditions, which is different from a single SD of DA.Besides, it is difficult to guarantee there exists a specific and available TD all the time in practice industry scenario.Mostly, the diagnosis model is requested to be robust to "unseen" TD because TD is not available during model training.Hence, domain generalization (DG) is proposed to solve such diagnosis issues [7].In this paper, an end-to-end adaptive weighted domain adversarial DG fault diagnosis method is proposed, the main contributions and innovations are as follows: (1) A k-means based weighted adversarial learning is designed to control the degree of domain shift of different domain pairs during training.The wasserstein distance is adopted to assist in building the diagnosis model at the same time.
(2) A residual feature extractor is designed to realize end-to-end feature extraction.Moreover, five statistical features are calculated and an additional priori knowledge extractor is added to provide model with comprehensive information.
(3) Two different datasets including the Paderborn bearing dataset and our own bearing dataset are adopted, twelve DG tasks are conducted to verify the performance of proposed method.Several methods are adopted to make a comparison at the same time.

Domain Generalization
Let denote there are a SD

Wasserstein Distance
Wasserstein distance is usually used to measure the discrepancy between two different distributions P and Q.In this paper, the 1-D wasserstein is adopted, which defined as equation ( 1). will be large as well due to the same reason.In this case, ij  can be used as a weight to enhance the effect of ( , ) ij Dis S S , and vice versa.Hence, the equation ( 2) can be updated by equation (3) and total loss function of proposed method can be defined as equation ( 4)

Results and Analysis
The experimental results are shown in figure 3. It can be found that the baseline deep learning method TICNN reached worse performance when deal with "unseen" distribution data.Further, when domain shift is taken into consideration, such as TICNN-MMD, TICNN-CORAL and TICNN-Adver, which maybe acquire better performance than baseline method.Besides, it is worth noting that the TICNN-Adver method reached worse performance than TICNN on some tasks such as A, B, D→C and E, G, H→F.It is indicated that the possible reason is the degree of domain adversarial learning on different domain pair is quite different, and they are influenced by each other during training.Hence, it is necessary to control the degree of domain adversarial learning on different domain pair.For our proposed method, the proposed-naive method reached good performance even better performance than TICNN-MMD, TICNN-CORAL or TICNN-Adver in some tasks, such as F, G, H→E and I, J, K→L.This indicates that the superiority of our designed model architecture, especially for the residual feature extractor and prior feature extractor.Finally, the proposed method reached the best performance on almost all tasks, especially for task B,C,D→A, nearly 15% accuracy improvement was reached compared with baseline.To illustrate the effectiveness of adaptive weighting multiple domain adversarial learning and wasserstein, the comparison of testing accuracies between proposed method and proposed-naive method on two tasks are shown in figure 4. It can be found in figure 4 (a), a rising trend of accuracy was reached by the proposed method after 15 epochs.However, the accuracy trend of proposed-naive method is nearly constant even a little decreasing after 15 epochs.Moreover, for figure 4 (b), it can be found that although both two methods performance reached nearly saturation, the proposed method still had higher limits and faster convergence speed than proposed-naive method.The above proves that the special design of proposed method can assist in enhancing the robustness of model when dealing with "unseen" distribution data.
To explain the superiority of the proposed method more visually, the feature distribution is visualized by t-Distributed Stochastic Neighbour Embedding (t-SNE), e.g., task B, C, D→A is token as an example shown in figure 5 [10].From figure 5 (a), it can be seen that original three data distributions with different labels are overlapping, which almost cannot be separated with each other.As for figure 5 (b), data belongs to different distribution but same labels can be clustered into a same region.For instance, the OF data (with square marker in figure 5 Further, when the diagnosis model trained by distribution B, C and D deals with "unseen" distribution A, we can find that the similar effect is obtained from figure 5 (c).The data belongs to A with different labels is separated into different regions, just like figure 5 (b).From the three enlarged graphs in the right of figure 5 (c), we can find that data of A can be categorized correctly as well, which illustrates that the proposed diagnosis model can extract domain-invariant features from multiple domain distributions and generalize to "unseen" domain distribution.

Conclusion
In this paper, an end-to-end adaptive weighting domain generalization fault diagnosis method is proposed, which aims to build a strongly generalized build to deal with "unseen" distribution data by training with multiple distribution data.The main conclusions are as follows.
(1) An end-to-end domain generalization fault diagnosis method is proposed, in which a k-means based adaptive domain adversarial learning is designed to weight different domain pairs during training.The wasserstein distance is also adopted to narrow distribution discrepancy as well.Besides, a multiple  (2) Two different datasets including twelve tasks were conducted.The results showed that the proposed method achieved performance improvement compared with other methods.Moreover, the feature distribution was visualized to illustrate the performance of proposed method when deal with "unseen" distribution data.Finally, the weights of different domain pairs were plotted to verify the effectiveness.

1 {→
, which can generalize domain-invariant knowledge from multi S to "unseen" TD.

Figure 1 .
Figure 1.Structure of proposed end-to-end weighted domain generalization model.
DescriptionTwo datasets including the Paderborn bearing dataset and our local dataset are adopted verify the performance of proposed method[8], whose test rigs are shown in figure2.For Paderborn bearing dataset, there exist two type data including real damage (RD) bearings data and artificial damage (AD) bearings data.Twelve designed DG tasks and experimental data details are listed in table 3. Further, the N09_M07_F10 means motor speed 1500 rpm, load torque 0.7 Nm and radial force 1000 N of Paderborn bearing dataset, and N05 means motor speed 500 rpm of our dataset.The task A ,B, C→D means the datasets A, B and C are utilized for training while dataset D is not available during training and just for testing.Moreover, the length of each sample is 2048, the zero mean and one standard deviation normalization is utilized as the pre-processing method.

Figure 2 .
Figure 2. (a) test rig of Paderborn dataset; (b) test rig of our dataset.

Figure 3 .
Figure 3. Accuracy of different methods on twelve DG tasks.

Figure 4 .
Figure 4. Comparison results between proposed method and proposed-naive method on different tasks.(a) B,C,D→A; (b) A,C,D→B.

Figure 5 .
Figure 5. Visualization of data distribution.(a) Original data distribution of B, C and D; (b) Feature distribution of B, C and D extracted by proposed method; (C) Feature distribution of B, C, D and "unseen" A extracted by proposed method.Furthermore, the change of weight  in task E, F, G→H is token as examples to explain the effectiveness of k-means based weighting, which is shown in figure 6.It can be found that  is quite different due to different domain pair before training, e.g., which is 201 of domain pair (E, F), 162 of (E, G) and 152 of (F, G), respectively.This indicates that the degree of domain shift of different domain pair is different.Besides, at the beginning of training,  is large, which imposes a stronger weighting to domain adversarial learning.When training begins,  of different domain pairs decrease gradually after a few epochs, which indicates that the degree of domain shift becomes smaller through iteration.Finally, all degree of domain shift of different domain pairs reached small values at 10 when training ends.

Figure 6 .
Figure 6.The change of weight  in task E, F, G→H during training.(a) Original graph; (b) Enlarged graph with epoch 20 to 100.

Table 1 .
Five statistical features definition.
ij Cls S S denotes the regular classification cost with cross-entropy loss of { , } ij SS .( , ) ij Dis S S denotes the domain cost with binary cross-entropy loss between i S and j S realized by gradient reverse (GR), which reverses the direction of optimization during training to reach domain confusion.Since the domain cost of different domain pair is not equal, which has different contribution to the parameters update during back propagation.To address this issue, the k-means algorithm is adopted to weight for ( , ) ij Dis S S .Let denote the cluster centers of domain pair ( , ) ij SS is ( , ) CC , which is represented with ij  (easily obtained ij ji  = ).For instance, it is natural to illustrate that when the performance of domain confusion is poor, the ( , ) ij Dis S S is large.At the same time, ij ) 1

Table 2 .
Parameters of proposed model.

Table 3 .
[9]aset description.Several methods are adopted to compare with proposed method.The TICNN model proposed in[9]is adopted as a deep learning (DL) baseline method.Details of all comparison methods are listed in table4.The epoch is 100, batch size is 128, learning rate is 1e-5, optimizer is Adam for all methods.
residual feature extractor is proposed to deal with original data while a priori feature extractor is designed to deal with five statistical values to consider the prior knowledge. 8