Bearing fault diagnosis based on optimized variational mode decomposition and 1D convolutional neural networks

Due to the fact that measured vibration signals from a bearing are complex and non-stationary in nature, and that impulse characteristics are always immersed in stochastic noise, it is usually difficult to diagnose fault symptoms manually. A novel hybrid fault diagnosis approach is developed for denoising signals and fault classification in this work, which combines successfully variational mode decomposition (VMD) and a one-dimensional convolutional neural network (1D CNN). VMD is utilized to remove stochastic noise in the raw signal and to enhance the corresponding characteristics. Since the modal number and penalty parameter are very important in VMD, a particle swarm mutation optimization as a novel optimization method and the weighted signal difference average as a new fitness function are proposed to optimize the parameters of VMD. The reconstructed signals of mode components decomposed by optimized VMD are used as the input of the 1D CNN to obtain fault diagnosis models. The performance of the proposed hybrid approach has been evaluated using sets of experimental data on rolling bearings. The experimental results demonstrate that the VMD can eliminate signal noise and strengthen status characteristics, and the proposed hybrid approach has a superior capability for fault diagnosis from vibration signals of bearings.


Introduction
Rolling bearings are key components of rotating machines. Once there is a serious failure, it may lead to unexpected downtime and thereby result in huge financial losses or safety issues. According to [1], bearing-related failures account for over 30% of all rotary machines. Therefore, the goal of accurate and effective fault diagnosis of bearings has attracted much attention to ensure high production efficiency and to improve the safety and reliability of rotary machines in industrial processes. Vibration signal analysis is the most commonly used approach due to its easy measurement and high correlation with structural dynamics [2][3][4][5][6][7]. Ahmed and Nandi have explored the combined compressive sampling method of signals, and achieved good results in bearing fault diagnosis [8,9]. Also, they have explored the compressive sensing method to produce highly compressed measurements of bearing vibration signals [10,11]. But the measured vibration signals are complex and non-stationary in nature, and meanwhile impulse characteristics of rolling bearings are usually immersed in stochastic noise. So, there are two challenging issues in bearing fault diagnosis from vibration signals. The questions are (a) how one should effectively eliminate noise and (b) how one should extract valid fault features.
Some signal decomposition techniques, as powerful nonlinear and non-stationary signal processing tools, have been developed and immediately attracted much attention in the field of fault diagnosis, such as empirical mode decomposition (EMD), ensemble EMD (EEMD), local mean decomposition (LMD), singular value decomposition (SVD), empirical wavelets transform (EWT), etc. Xu proposed a new bearing fault diagnosis method combining SVD and the squared envelope spectrum [12]. Longqing et al put forward a fault diagnosis method of bearing clearance of a reciprocating compressor based on LMD sample entropy and SVM [13]. However, these methods still have their drawbacks and limited application. EMD has the drawbacks of mode mixing and end effects, although it has good intrinsic locally adaptive properties. Rilling et al [14] first completely probed these issues of EMD. EEMD, an extended version of EMD, is proposed to address the mode mixing of EMD [15] but creates a large computational burden [16]. EWT can alleviate the sensitivity of EMD to noise and sampling, but it cannot avoid invalid decomposition caused by the concentration of boundaries. LMD can make up for the shortcomings of EMD, but both are types of recursive mode decomposition, which is affected by mode aliasing, end effects, and sampling frequency [17].
Variational mode decomposition (VMD) as a non-recursive signal processing method has been recently proposed by Dragomiretskiy and Zossso [18]. Xin et al [19] proposed a rolling bearing fault diagnosis method based on VMD and SVM. The energy features are extracted from the intrinsic modal components decomposed by VMD and used as the input of a support vector machine (SVM) to judge the working state and fault type of the bearing. Gu et al [20] proposed a new fault diagnosis method based on statistical characteristics such as variational pattern decomposition (VMD), SVM and variance contribution rate, energy entropy (EE), and permutation entropy (PE). Liu et al [21] proposed a feature extraction method based on parameter optimization of VMD and sample entropy, and further used to support SVM for fault diagnosis. The performance of VMD is more powerful than EMD in tone-separation [22] and EWT in feature extraction [23], because the VMD can avoid the cumulative error and end effects. When VMD is used to decompose signals, the number of decomposition components is always preset by experience, but it directly affects the decomposition effectivity, and so does the penalty parameter. In order to solve the problems of choosing the values of VMD parameters, a 1.5-dimensional diagnostic method based on the optimized VMD with a genetic algorithm (GA) is proposed for fault diagnosis [24]. However, a single GA cannot handle the complexity in an efficient way. Zhang et al [25] use the improved particle swarm optimization (PSO) algorithm to optimize the parameters K and α of VMD by using the maximum value of weighted autocorrelation function as the optimization objective function. But PSO can easily get trapped in a local optimum when solving a complex multimodal problem [26], and the choice of fitness function also affects the optimization [27]. In short, VMD can be used to eliminate the noise of the non-stationary vibration signal of bearings by decomposing signals, but the problem of how to obtain the optimal parameters of VMD efficiently still needs research.
The second challenging issue is the question of how one should extract feature values. It is well known that the feature extraction is a bottleneck problem and deep learning methods can avoid this problem because they integratefeature extraction and classification operations into a single machine learning body to optimize jointly the classification performances [28,29]. Convolutional neural networks (CNNs) are widely used tools for deep learning which are different from the traditional feed-forward ANN because of the three architectural properties of the visual cortex cell: local receptive regions, shared weights, and subsampling. CNNs are most frequently used with two-dimensional (2D) data, such as images. Dong et al [30] proposed a new bearing fault diagnosis method based on multiple noise reduction of SVD and EMD and an improved CNN for bearing weak fault identification. Wang et al [31] uses a CNN and AE-DNN to detect and classify faults of an MMC-HVDC system. Li et al [32] uses a CNN to classify the time-frequency samples obtained by a shorttime Fourier transform for bearing vibration signals. However, recently, as an alternative, a modified version of 2D CNNs, called 1D CNNs (1D-CNNs), has been developed [33][34][35]. In many applications, 1D-CNNs are preferable to their 2D counterparts in dealing with 1D signals due to the reduced computational complexity and the same performance with shallower architecture. The main difference between 2D and 1D CNNs is the usage of 1D arrays instead of 2D matrices for both kernels and feature maps [36]. A classifier based on a 1D CNN has the ability to process the raw signal directly and to extract the representative features more precisely, without explicit feature extraction or manual selection [37]. Therefore, in order to solve these two challenging issues in bearing fault diagnosis, a novel hybrid fault diagnosis approach is developed for the denoising of signals and fault classification in this work. This combines successfully with the VMD and 1D CNN. To obtain the appropriate decomposition mode, it is necessary to find a method to optimize the parameters of VMD and design an excellent fitness function.
The main contributions of this paper are summarized as follows.
(a) A novel hybrid fault diagnosis approach is proposed to solve the two challenging issues of non-stationary vibration signal from bearings. (b) A novel PSMO method is proposed. The key parameters of VMD are optimized by using the PSMO method which has the advantages of both PSO and GA. Meanwhile, a weighted signal difference average (WSDA) is proposed as the fitness function. (c) The denoising signals are reconstructed by the decomposition modes and the reconstructed signals are estimated by the index of PE. (d) The effectiveness of the proposed method is verified using real bearing data. The diagnosis results are analyzed from the diagnosis accuracy, confusion matrix, and data distribution, as well as comparison with other methods.
The rest of this paper is arranged as follows. In section 2 we propose VMD and a novel PSMO method. The 1D CNN is introduced in section 3. A novel hybrid fault diagnosis approach is proposed in section 4. The feasibility and performance of the proposed approach are discussed in section 5. Conclusions are drawn in section 6.

VMD and a novel PSMO method
In 2014, Dragomiretskiy and Zosso [18] introduced a new adaptive signal processing method called VMD, which can effectively decompose non-stationary nonlinear signals. In contrast to EMD and LMD, VMD can effectively solve the mode aliasing problem and has significant superior anti-noise performance and higher computational efficiency. But, values of its decomposition parameters need to be artificially set, which is prone to over-decomposition or under-decomposition phenomena. In order to overcome the problem of VMD, PSO and GA have been combined in a novel way to optimize the VMD parameters.

Brief introduction to VMD
VMD can non-recursively decompose a multi-component input signal into a discrete set of quasi-orthogonal bandlimited intrinsic mode functions (IMFs). Each IMF component u k has a center frequency and a finite bandwidth, and the corresponding constrained variation model is described as follows [18]: where u k = (u 1 , u 2 , · · · , u k ) is a set of modal component functions, their sum is the original function f, ∂ t is the partial derivative of time t, ω k = (ω 1 , ω 2 , · · · , ω k ) is the center frequency set of the modal component, δ (t) is the unit pulse function, j is the imaginary unit, and * represents the convolution operation.
In order to solve equation (1), the quadratic penalty factor α and Lagrangian multiplier λ (t) are introduced to transform the constrained variational problem into the following unconstrained variational problem: In detail, the implementation process of the VMD is described as follows.
Step 3. For all ω ⩾ 0, update u k , ω k and λ k u n+1 Step 4. Repeat steps (2) to (3), until the iteration stop condition is satisfied Step 5. Stop the iterations and obtain the IMF components. In order to estimate the effectiveness of VMD, PE is used as it is a measure of the complexity of a time series by capturing the order relations and extracting a probability distribution of the ordinal patterns. PE was presented by Bandit and Pompe [38] for the complexity analysis of time domain data by using the comparison of neighboring values. Yan et al [39] used PE as one nonlinear statistical measure for status characterization of rotating machines. Their study has shown promising results that PE could effectively detect and amplify the dynamic change of vibration signals, and characterize the bearing working status under different operating conditions.
The PE method also has the advantages of simplicity, extremely fast calculation, robustness, and invariance with respect to nonlinear monotonic transformations. PE is defined as the Shannon entropy associated with the probability distribution where D is the embedding dimension which controls the row numbers of the new matrix. The new matrix is formed by partitioning the one-dimensional time series which need to set the embedding time delay τ and the embedding dimension D. π i is the frequencies associated with the i possible permutation patterns, i = 1, · · · , D!.

A novel PSMO method
In order to decompose accurately the vibration signal of bearings by using the VMD, to prevent the phenomenon of overdecomposition or under-decomposition, and to obtain accurate bandwidth and central frequency for each component, it is necessary to optimize two parameters of the VMD, which are the modal number k and the penalty factor α.
Evolutionary algorithms, as an effective method for solving difficult optimization problems, have gained much attention. PSO and GA are quite similar, which means that PSO and GA change from one set of points to another set of points within an iteration with visible improvement from the previous values using some probabilistic and deterministic rules. The PSO has the advantages of strong local search ability and fast convergence speed but easily falls into a local optimum and the search path is more complicated. GA has the advantages of strong global search ability but cannot handle complexity in an efficient way. Therefore, in this paper we introduce the mutation idea of the GA algorithm into PSO and propose a new hybrid algorithm of PSMO by adding mutation during the update iteration of the PSO algorithm.
The definition is as follows: suppose that in a D-dimensional search space there is a population of M particles X = (X 1 , X 2 , X 3 , · · · , X M ). Each of these particles has a corresponding position and velocity; for example, the position of the ith particle is The local extremum of the particle is P i = (p i1 , p i2 , p i3 , · · · , p iD ), and the global optimal value of the corresponding population is G = (g i1 , g i2 , g i3 , · · · , g iD ). On this basis, the idea of GA mutation is added, and the mutation probability is q, and the random number r d ∈ [0, 1]. If r d > q, the particles do not mutate, and each particle is updated according to formula (8), and the position and speed of the next generation are updated by individual local extremum and global extremum. If r d ⩽ q, the position after the mutation is M i = rand (lbd, ubd). The specific update formula is shown in (9).
where ω is the inertia weight, r d is a random number between 0 and 1, and c 1 and c 2 are the learning factors, which represent the learning ability of the local extremum and global extremum respectively. A fitness function is needed to optimize the parameters of VMD using PSMO. In this paper, the WSDA is proposed based on the signal difference average (SDA) where K is the modal number and β is the penalty parameter. The SDA method calculates the difference between the signals for each data point within the signals itself. A small value of SDA indicates that the signals have high similarities whereas a large value of SDA indicates that there is a lot of information loss from the signal. Equation (11) describes the SDA method, where y IMFS is a sum of VMFs (equation (12)), y s is an input signal, and N is the data point in the signals: The flowchart of the proposed PSMO algorithm is shown in figure 1.

1D CNNs
More and more pattern recognition of 1D signals uses a 1D CNN, instead of a 2D CNN, because of its reduced computing burden and often better performance. A 1D CNN uses weight sharing, which requires fewer parameters to converge than traditional neural network models. This guarantees the convergence of the 1D CNN earlier and faster.
The configuration of a 1D CNN is composed of the input layer, convolutional layer, pooling layer, full connect layer, and output layer, which are shown in figure 2. Among these layers, there are two basic layers in CNN, which are the convolutional layer and the pooling layer. The convolution operation implements the first two properties, that are local receptive regions and shared weights. The pooling operation implements the subsampling property [40]. The input layer is a passive layer that receives the raw 1D signals and the output layer is  an MLP layer with the number of neurons equal to the number of classes.
A convolutional layer consists of neurons that connect to small regions of the input and operate the convolution computation. Every kernel detects specific characteristics in any location on the input feature map. The output feature map of the convolutional layer can be written as where K is the ith convolution kernel of the lth layer. X l(r j ) is the local area of the lth layer input, and * represents the convolution operator.
Pooling layers perform down-sampling operations. Pooling functions usually include max-pooling and average-pooling, In this paper, the max-pooling function is applied and it outputs the maximal values of rectangular regions of its input. In a fully connected layer, neurons between two adjacent layers are fully pairwise connected but neurons within the same layer share no connections. Then the Softmax function is commonly adopted for classification tasks in the output layer. It is calculated as: The loss function can use the mean squared error function and the cross-entropy function. In this paper, we used the cross-entropy function, which is given by where t ij is the indicator that the ith example belongs to the jth class, and y ij is the output for example i, which is the value from the Softmax function.

A novel fault diagnosis method
When the rolling bearings are in operation, bearings with different types and different degrees of failure will generate vibrations which are typical non-stationary and nonlinear multiple-component vibration signals. The vibration signal collected by the acceleration sensor maybe the superposition of multiple vibration signals which include the interference signals. In the process of signal analysis and fault diagnosis, the interference signal makes it more difficult to extract valuable information. The VMD is an adaptive and non-recursive signal decomposition method that can reveal the weak transient impulse from complex vibration signals which has the advantages of effectively reducing pseudo-components and modal aliasing. The 1D CNN is a typical deep learning method that provides a structure in which both feature extraction and prediction are performed together in a single block, like genetic programming but unlike other traditional methods. A 1D CNN is superior to a 2D CNN in terms of compu- tation burden and similar performance with shallower architectures. This paper combines the above methods for fault diagnosis. The flowchart of the fault diagnosis method is shown in figure 3. The specific steps of the proposed method are described as follows.
Step 1. The vibration signal data set is divided into training examples and testing examples.
Step 2. The modal number and the penalty factor of VMD are optimized by PSMO.
Step 3. Training examples and testing examples are decomposed by optimized VMD, and some sets of IMF components are obtained.
Step 4. Signals are reconstructed by the VMF components obtained in step 3.
Step 5. The 1D CNN is trained using the reconstructed training examples.
Step 6. The effectiveness of the proposed fault diagnosis model is evaluated by experimental data of rolling bearings.

Verification and analysis
To verify the effectiveness of the proposed method in bearing fault diagnosis, two open-source datasets are used for research in this paper, including Case Western Reserve University (CWRU) Bearing Data Center dataset [41] and the Society for Mechanical Failure Prevention Technology (MFPT) dataset [42]. Rolling bearing test rig [41]. Reproduced from [41]. CC BY 4.0. The CWRU data is the experimental data of rolling bearings from the electrical engineering laboratory of CWRU in this experiment [41]. The bearing model is a drive-end bearing (6205-2RSJEM SKF, deep groove ball bearing). As shown in figure 4, the test bed consists of a 2 hp motor (left), a torque transducer and encoder (center), a dynamometer (right), and control electronics (not shown). The dynamometer is controlled so that desired torque load levels can be achieved. The test bearings support the motor shaft. Single point faults were introduced to the test bearings using electricaldischarge machining with fault diameters of 1.778 × 10 -4 m, 3.556 × 10 -4 m, and 5.334 × 10 -4 m, and the fault depth is 2.794 × 10 -4 m.
Vibration signals were collected using a 16-channel DAT recorder with a sample rate of 12 kHz. Each data set is made up of 1.2 × 10 5 points. The experimental rotating frequency is about 30 Hz. The experimental data (table 1) were collected vibration signals for normal and three different fault types (ball fault, inner race fault, and outer race fault) with three different degrees (the different fault diameters correspond to the different fault degrees). Each of the ten conditions in the data sets is composed of 600 samples and the sample length of each signal is 4096. In order to eliminate the noise of vibration signals of bearings, the VMD method is used to decompose the signals. To obtain appropriate mode components, the parameters such as the modal number and the penalty parameter of VMD need to be optimized. We propose a novel PSMO method which has the advantages of the global optimum of GA and the convergence speed of PSO. The parameters of PSMO are set as shown in table 2. G max is the maximum evolution algebra, M is the population size, c 1 and c 2 are learning factors, ω represents inertia weight, and q is the mutation probability. The WSDA is taken as the fitness function. The smaller the WSDA value is, the better is the decomposition result. There are two termination conditions which are the number of iterations t and the WSDA value ε. As long as either ε ⩽ 1 × 10 −7 or t = 10, the optimization result is output. In this paper, we select the data with 0.1778 inner race damage, which has 4096 sample length. In the fourth iteration, the minimum average difference is 0.0000425, and the corresponding optimal parameter combination is K = 8 and α = 6300.

Signal decomposition.
The vibration signals of ten states are decomposed via the VMD method with optimized parameters obtained from 5.1.2.1. Because of limited space, we introduce only the signal decomposition and reconstruction process of th inner ring signal with a damage diameter of 0.14 feet. Figure 5 shows the time domain and frequency domain diagram of the original signal of the inner ring fault. Figure 6 is the decomposition of the inner race fault signal.

Signal reconstruction.
Signals are reconstructed according to the maximum kurtosis criterion, and the IMF component whose kurtosis is greater than the average kurtosis is selected for signal reconstruction. Figure 7 shows the time domain and frequency domain diagram of the reconstructed fault signal of the inner ring with a damage diameter of 0.14 feet. Compared with figure 4, it can be seen that the reconstructed signal removes the disorder components in the original signal and removes some noise signals. For other samples, there are similar differences between the other original signals and the reconstructed signals.
To illustrate the effectiveness of the VMD, we adopt the index of PE to compare the original signal with the reconstructed signal. The parameters of PE need to be set; the embedding time delay is set as 5 and the embedding dimension is set as 4.
The smaller the value of PE is, the more regular and more deterministic the time series is. Conversely, the greater the value of PE is, the more noisy and random the time series is. From figure 8, it can be seen that the PE value of the reconstructed signal is smaller than that of the original signal. This indicates that the reconstructed signal has more information of working states and the original signal has more noise and chaos.

Parameter setting of 1D CNN.
Reconstructed signals with 4096 data points are used as the input of the 1D CNN. Therefore, the size of the input bearing signals is (4096 × 1). The 1D CNN model has a total of seven layers, including five convolutional and pooling layers, a fully connected layer and a Softmax layer. The five layers of convolutional and pooling layers and their parameters are set as shown in figure 9. The Relu function [43] is used as the activation functionin all layers except the Softmax layer. In table 3, 'same' means that the output feature map has the same spatial dimensions as the input feature map. Zero padding is introduced to make the shapes match as needed, equally on every side of the input map. Here, 'same' tries to pad evenly left and right, but if the number of columns to be added is odd it will add the extra column to the right. Here, 'valid' means no padding and only ever drops the right-most columns (or bottom-most rows).
As the Adam optimization algorithm (Adam) may reduce the oscillations along the path of the steepest descent towards the optimum that is sometimes caused by the stochastic gradient descent algorithm [44], we use the Adam algorithm to  (20) where m t and v t are the first-order moment estimation and second-order moment estimation of the gradient, respectively, and m t and v t are the corrections of m t and v t , which can be approximated to the unbiased estimation of expectation. The parameter t represents the number of times; β 1 are β 2 are constants, controlling exponential attenuation. Also, α is called the learning rate or step size factor, which controls the update ratio of weights, and ϵ is a very small number in order to prevent division by zero in the implementation. Here, we set the momentum β 1 at 0.9 and β 2 at 0.99, the learning rate α at 0.001, ϵ at 10 × 10 −8 and the maximum number of epochs to use for training at 10.

Diagnosis results and analysis.
The 5000 samples are taken as the training set and 1000 samples are taken as the testing set in the experiment. The original signal and reconstructed signal are used respectively as the input layer of the 1D CNN to verify the diagnosis accuracy of the proposed method. The average diagnosis accuracy for running 10 times was 99.6% for reconstruction signals and 98.7% for original signals. Table 4 records the diagnosis accuracy for original signals and for reconstructed signals at each running time. From the diagnosis accuracy, we can see that at every running time the proposed method can obtain a higher diagnosis accuracy, which indicates that the VMD method is very important in the process of fault diagnosis to obtain higher diagnosis accuracy.
Because of limited space, we only provide a confusion matrix of the classification results for each condition with testing data at the fourth run, which is shown in figure 9. The reason for selecting the results of the fourth run is that the diagnosis accuracy of the fourth run is near to the average accuracy. In figure 9, the 1D CNN misclassified 1% of the testing examples of the B1 condition as the N condition, and misclassified 10% In order to describe the overall perspective of the confusion matrix under ten runs in limited space, the numbers of misclassifications in the confusion matrix under ten runs are counted in tables 5 and 6 respectively for original signals and for reconstructed signals.
It can be seen from table 5 that out of the ten runs, the condition of B1 is misclassified as B0 once; the condition of B2 is misclassified as B0 seven times, as B1 once, and as O1 once; the condition of I1 is misclassified as B0 twice, as B2 once, as I0 once and as O1 three times. By contrast, the content of table 6 is simpler than that of table 5. Only the conditions of B2  I0 I1 I2 O0 O1 O2 N   True B0  B1  B2  6  3  I0  3  I1  I2  O0  O1  O2  N and I0 incurred misclassification. The condition of B2 is misclassified as B0 six times and B1 three times. The condition of I0 is misclassified as I1 three times.
It is known that the classification accuracy is influenced by the distribution of samples. To observe more clearly whether there is a difference of distribution between the original signals and the constructed signals, we use the t-SNE [45] dimension reduction algorithm to visualize the distribution of highdimensional datasets at the full connection layer.
It can be seen from figure 10 that, for the original signals, there are some overlaps among three different types of dataset, which are the conditions of B0, B1, and B2. These three different fault degrees of rolling element fault type are more easily confused in the process of diagnosis. It is consistent with figure 9 that the conditions of B0, B1 and B2 are easily misclassified. it is obvious that for the reconstructed signals there are clear boundaries between different types of data set. This partly explains why the diagnosis accuracy of the reconstructed signal is higher than that of the original signal.

Comparison.
To verify further the effectiveness of the proposed method, we use long short-term memory (LSTM), random forest (RF), and SVM to classify the faults of the original data and the faults of the reconstructed data. The parameters are set as follows: for LSTM, the number of layers is set to two, and the neuron numbers of the two layers are set to 16 and 32 respectively; for RF, the number of trees in the forest is set to 100. Information entropy is used to measure the performance of splitting quality, the maximum depth of the random tree is set to none mode, and the meaning nodes are expanded until all leaves are pure. For SVM, the radial basis function is introduced as the kernel function of SVM. The above three methods have been used for ten runs each to classify the original data and classify the reconstructed data. Table 7 contains the results of the three methods. Compared with table 4, we can see that the proposed method is better than other methods, on both the original data and the reconstructed data. We also conclude that the diagnosis accuracy using the reconstructed data is better than the same with the original data, no matter whether 1D CNN, LSTM, RF or SVM is used. This demonstrates that the optimized VMD is essential to improve the diagnosis accuracy.   consisting of 4096 sampling data points, and 500 samples are included in each condition of bearing data.

Vibration signal denoising.
To eliminate the noise of vibration signals of bearings, first we optimize the modal number and the penalty parameter of VMD by PSMO, then decompose signals by VMD, and finally reconstruct signals of bearing data.

Parameter optimization of VMD based on PSMO.
To obtain appropriate mode components, the parameters such as the modal number and the penalty parameter of VMD need to be optimized by PSMO. The parameters of PSMO are set as shown in table 9. G max is the maximum evolution algebra, M is the population size, c 1 and c 2 are learning factors, ω represents inertia weight, and q is the mutation probability. The WSDA is taken as the fitness function. The smaller the WSDA value is, the better is the decomposition result. There are two termination conditions, which are the number of iterations t and the WSDA value ε. As long as either ε ⩽ 1 × 10 −7 or t = 10, the optimization result is output. In this paper, we select the data with 0 lb inner race damage, which has 4096 sample length. In the fourth iteration, the minimum average difference is 0.0000994, and the corresponding optimal parameter combination is K = 6 and α = 3623.

Signal decomposition.
The vibration signals of five states are decomposed via the VMD method with K = 6 and α = 3623. Because of limited space, we introduce only the signal decomposition and reconstruction process of the inner ring signal with load of 0 lb. Figure 11 shows the time domain and frequency domain diagram of the original signal of the inner ring fault. Figure 12 is the decomposition of the inner race fault signal.

Signal reconstruction.
Signals are reconstructed according to the maximum kurtosis criterion, and the IMF component whose kurtosis is greater than the average kurtosis is selected for signal reconstruction. Figure 13 shows the time domain and frequency domain diagram of the reconstructed fault signal of the inner ring with a damage 0 lb load.
Compared with figure 11, it can be seen that the reconstructed signal removes the disorder components in the original signal and removes some noise signals. For other samples, there  are similar differences between the other original signals and the reconstructed signals.
To illustrate the effectiveness of the VMD, we adopt the PE index to compare the original signal with the reconstructed signal. The embedding time delay and the embedding dimension of PE are set to 5 and 4 respectively.
From figure 14, it can be seen that the PE value of the reconstructed signal is smaller than that of the original signal. This indicates that the reconstructed signal has more information of working states compared with the original signal.  Reconstructed signals with 4096 data points are used as the input of the 1D CNN. Therefore, the size of the input bearing signals is (4096 × 1). The 1D CNN model has a total of seven layers, including five convolutional and pooling layers, a fully connected layer and a Softmax layer. The five convolutional and pooling layers and their parameters are set as shown in table 10. The Relu function [42] is used as the activation function in all layers except the Softmax layer, and the Adam algorithm [43] is used to update the parameters of the deep NN.

Diagnosis results and analysis.
The 2000 samples are taken as the training set and 500 samples are taken as the testing set in the experiment. The original signal and reconstructed signal are used respectively as the input layer of the 1D CNN to verify the diagnosis accuracy of the proposed method. The average diagnosis accuracy for running 10 times was 96.1% for reconstruction signals and 91.2% for original signals. Table 11 records the diagnosis accuracy for original signals and for reconstructed signals at each running time. From the diagnosis accuracy, we can see that at every running time the proposed method can obtain a higher diagnosis accuracy, which indicates that the VMD method is very important in the process of fault diagnosis to obtain higher diagnosis accuracy. Figure 15 provides a confusion matrix of the classification results for each condition with testing data at the sixth run. The reason for selecting the results of the sixth run is that the diagnosis accuracy of the fouth run is near to the average accuracy.
In figure 15, for the original data, 1D CNN misclassified 13% of the testing examples of the I1 condition as the I2 condition, and misclassified 18% of the testing examples of the O1 condition as the O2 condition for the original signals; For the reconstructed data, 1D CNN misclassified 5% of the testing examples of the I2 condition as the N condition, and misclassified 10% of the testing examples of the O1 condition as the O2 condition.
It is well known that the classification performance is influenced by the distribution of samples. In order to observe more clearly whether there is a difference of distribution between the original signals and the constructed signals, we use the t-SNE [44] dimension reduction algorithm to visualize the data distribution of high-dimensional datasets at the full connection layer.
It can be seen from figure 16 that, for the original signals, there is some overlap between I1 and I2 and some overlap between O1 and O2, which is consistent with figure 15. For the reconstructed signals, there are nearly clear boundaries between different types of data set. To some extent, this explains why the diagnosis accuracy of the reconstructed signal is higher than that of the original signal.

Comparison.
To verify further the effectiveness of the proposed method, we use LSTM, RF, and SVM to classify the faults of the original data and the faults of the reconstructed data, then we compare them with the results of our proposed method.
The parameters are set as follows: for LSTM, the number of layers is set to two, and the neuron numbers of the two layers are set to 16 and 32 respectively; for RF, the number of trees in the forest is set to 100. Information entropy is used to measure the performance of splitting quality, the maximum depth of the random tree is set to none mode, and the meaning nodes are expanded until all leaves are pure. For SVM, the radial basis function is introduced as the kernel function. The above three methods have been used for ten runs on the original data and on the reconstructed data. The results are shown in table 12.
Compared with table 11, we can see that the proposed method is better than other methods, both on the original data and the reconstructed data. We also conclude that the diagnosis accuracy using the reconstructed data is better than the same with the original data, no matter whether 1D CNN, LSTM, RF or SVM is used. This demonstrates that the optimized VMD is essential to improve the diagnosis accuracy.

Conclusions
In this paper, to address two challenging issues of how to effectively eliminate noise and extract valid fault features from the vibration signals for bearing fault diagnosis, a novel hybrid fault diagnosis method, the optimized VMD by PSMO and 1D CNN, has been proposed to realize fault diagnosis of rolling bearings. PSMO, which has the advantages of strong global search ability and fast convergence speed by combining PSO and GA, and WSDA as the fitness function, are proposed to realize the optimization of K and α in VMD. From the index of PE, the PE value of the reconstructed signal is smaller than that of the original signal, which indicates that the reconstructed signal has more information of working states than the original signal.
1D CNN, which fuses feature extraction and classification operations into a single machine learning body to optimize jointly the classification performances, has been adopted to realize fault diagnosis of the reconstructed vibration signals. For the CWRU data and MFPT data, our method has better classification accuracy, 99.6% for CWRU data and 96.1% for MFPT data, compared with LSTM, RF, and SVM. Meanwhile, the optimized VMD denoising signals by PSMO is very important in the process of fault diagnosis, because the reconstructed data have been better adapted for this task than the original data.

Data availability statement
All data that support the findings of this study are included within the article (and any supplementary files).