Remaining useful life prediction for bearing based on automatic feature combination extraction and residual multi-Head attention GRU network

Jiawen He; Xu Zhang; Xuechang Zhang; Jie Shen

doi:10.1088/1361-6501/ad1652

1. Introduction

The Industrial Internet of Things and modern technology have made fault prediction and the health management of systems (PHM) common in industrial system health management [1]. As an important mechanical component, rolling bearings find wide applications in rotating machinery across various fields such as electrical industry, machine tools, and automobiles. The long-term use of rolling bearings may cause bearing performance degradation and noise increase due to various factors. If not detected and dealt with in time, the bearings may suddenly fail during use, causing equipment damage or even endangering personnel safety [2]. Rolling bearing remaining useful life (RUL) represents the remaining time or service life of a bearing that can work safely until failure under the current working condition. In order to diagnose possible faults and damages in advance, carry out maintenance and replacement work, and avoid accidents or production losses caused by equipment downtime during operation, rolling bearing RUL prediction comes into being [3, 4]. Therefore, RUL prediction of rolling bearings is a significant technology in modern intelligent manufacturing.

Regarding the RUL prediction of rolling bearings, the current research methods mainly include model-based methods, data-driven methods, and hybrid methods [5]. Model-based methods establish mathematical models of bearing operation to analyze the influencing factors of bearing damage and failure, facilitating accurate RUL prediction [6]. For example, the finite element method and system identification technology are utilized to establish bearing operation models, which are combined with experimental data for calibration and verification [7, 8]. In the early stages of industrial development, model-based methods demonstrated better results. However, as industrial systems become increasingly complex, this method is difficult to meet the needs of industrial applications. Data-driven methods use machine learning algorithms to comprehend the inherent relationship between bearing operational status and lifespan from massive bearing data, and use the learned model for RUL prediction [9]. For instance, support vector machines [10], ANN [11], and hidden Markov models [12] are commonly employed for feature extraction and modeling. Recent breakthroughs in communication technology and networks have significantly lowered the threshold for data-driven methods, and the required professional knowledge and expert experience are relatively less, thus making this method gradually receive attention in the research field. Hybrid methods combine the characteristics of these two methods, but its application on industrial systems still faces challenges [13, 14]. This paper chooses data-driven methods that are more suitable for the current era as the research method.

Currently, data-driven RUL prediction methods can be mainly divided into two-stage and end-to-end approaches. The two-stage rolling bearing RUL prediction method decomposes the RUL prediction task into two stages for processing [15, 16]. Firstly, the rolling bearing degradation state is modeled to obtain its corresponding feature representation; then, the RUL prediction model takes the feature representation as input and produces the RUL prediction output of the rolling bearing. End-to-end RUL prediction methods for rolling bearings directly use raw data as input to the prediction model, simultaneously achieving degradation state representation and RUL prediction tasks [17, 18]. End-to-end models are often intricate, requiring an extensive amount of time and data for training. Conversely, two-stage methods break down the task into two parts for processing, enabling the use of fewer data and computing resources at each stage to improve the model robustness and generalizability. Moreover, the flexibility of the two-stage methods lies in their ability to adopt different feature extraction and preprocessing methods to suit various data types and distributions. Therefore, a data-driven two-stage framework is adopted in this paper to research the RUL prediction of rolling bearings.

As one of the most classical data-driven methods, remarkable performance has been demonstrated through deep learning in numerous prediction tasks [19, 20]. The common deep neural networks for RUL prediction are mainly recurrent neural networks (RNNs) and convolutional neural networks (CNNs). Research on CNNs is extensive, including traditional convolutional networks and time sequence convolutional network [21]. Cao et al [22] proposed a time convolutional network with residual self-attention mechanism and verified it on multiple datasets. However, only marginal spectrum is extracted in this study, which contains relatively limited degradation information. Similarly, Song et al [23] proposed a time convolutional network based on distributed attention for RUL prediction of aircraft engines, which uses a single attention mechanism and has doubts about whether the information contained in multi-modal data can be fully exploited. On the other hand, Jiang et al [24] proposed a CNN for RUL prediction of rolling bearings based on dual attention mechanism. The feature combination used in this study includes both time domain and frequency domain features, but no screening process was applied to the feature combination. Yang et al [25] proposed an intelligent RUL prediction method based on dual CNN model architecture, which does not need to extract features in advance, but directly inputs the original vibration signal into the network, and uses the powerful feature extraction ability of CNN to learn degradation features. However, large-scale raw vibration signal data may lead to a heavy computational burden for network training. Another approach involves the use of RNN, which is a type of neural network with the capability to model and learn from time series data. As rolling bearing RUL prediction has strong temporal characteristics, the utilization of RNN in RUL prediction is widely prevalent [26, 27]. Guo et al [28] proposed a predictive model based on multi-scale gated CNN and Transformer to address the rolling bearing RUL prediction problem. This method employs CNN and gated recurrent unit (GRU) for feature extraction, followed by prediction using a Transformer model. However, it exhibits relatively slower training speeds. Another study, introduced by Shen et al [29], utilizes a combination of multi-head attention mechanisms (MHA) and Bi-long-short term memory (Bi-LSTM) for bearing RUL research. In terms of feature selection, degenerate feature combinations are selected by considering monotonicity and trend. Furthermore, Liu et al [30] introduced a multi-head neural network model with asymmetric constraints. This research combines Bi-GRU with self-attention mechanisms and employs an ensemble of multiple subnetworks for learning. While this method demonstrates significant effectiveness in focusing on the degradation features with diverse transformations, it has relatively long training times. Finally, Qin et al [31] proposed a GRU neural network with dual attention gates for rolling bearing RUL prediction. However, this study only considered the root mean square of the signals as the network's input, which may not be sufficient for extracting degradation features comprehensively. According to the above literature, the existing bearing RUL prediction methods still have some limitations.

(1)
Most of the current deep learning-based methods for predicting bearing RUL mainly concentrate on feature extraction and modeling, while ignoring the importance of feature combination. Feature combination can effectively integrate multiple types of features and improve the representation ability of features, thereby enhancing the accuracy of RUL prediction.
(2)
Some methods rely on one type of manually crafted features, which may not effectively capture the complex patterns in vibration signals. Other methods only consider single-type attention mechanisms, failing to fully utilize the information contained in multimodal data. These methods have limitations in capturing the complex and dynamic characteristics of rolling bearings.

A RUL prediction method is presented in this paper, which integrates an automatic feature combination extraction mechanism and a GRU network with residual MHA to mitigate the aforementioned issues. The main contributions of this paper are as follows:

(1)
A method for automatically extracting feature combinations is proposed in this paper. The method calculates a series of candidate features and then automatically selects a combination of features that better expresses bearing degradation according to a random forest regression-based method.
(2)
A GRU network model with residual MHA is proposed. The residual MHA weights the GRU network's output, emphasizing the contribution of important hidden features to RUL prediction. By employing a residual connection, the GRU network's deficiency in extracting long-term sequence features is addressed.
(3)
By integrating the above two approaches, a new method for RUL prediction in rolling bearings is proposed. To validate the proposed method's effectiveness and accuracy, experimental predictions are conducted on a dataset of rolling bearing degradation.

The rest of the paper is arranged as follows: section 2 introduces the theoretical basis required for the study, section 3 details the proposed methodology, section 4 validate the proposed method through a contrast experiment. Finally, a conclusion is drawn in section 5.

2. Theoretical basis

2.1. Hilbert–Huang transformation (HHT)

HHT has been extensively applied in bearing PHM research owing to its excellent analysis and processing capabilities for nonlinear and non-stationary signals [32]. By decomposing the original signal into various Intrinsic Mode Functions (IMFs), which indicate different vibration modes, this method accomplishes signal decomposition and analysis. Then the instantaneous frequency and amplitude of each IMF can be obtained by calculating the Hilbert transform. This allows for the energy distribution of the signal at different frequencies to be obtained. The specific calculation steps are as follows:

The first step involves employing empirical mode decomposition on the original signal a(t) to obtain several IMFs, it can be defined as:

$\begin{eqnarray} &&a\left(t\right) = \sum\limits_{i = 1}^n {{I_i}\left(t\right) + Re\left(t\right)} \end{eqnarray} \tag{ 1 }$

Where, ${{I_i}(t)}$ is the ith IMF and Re(t) is the residual term.

Subsequently, the Hilbert transform of each ${{I_i}(t)}$ is calculated, which is defined as:

$\begin{eqnarray} &&{H_i}\left( {\,f} \right) = \frac{1}{\pi }\mathop \int \nolimits_{- \infty }^{+ \infty } \frac{{{I_i}\left(\tau\right)}}{{t - \tau}}\textrm{d}\tau. \end{eqnarray} \tag{ 2 }$

Thus, the analytic signal ${{s_i}(t)}$ of each ${{I_i}(t)}$ is defined as:

$\begin{eqnarray} &&{s_i}\left(t\right) = {I_i}\left(t\right) + j{H_i}\left({\,f}\right) = {B_i}\left(t\right){\textrm{e}^{\,j{\phi _i}\left(t\right)}} \end{eqnarray} \tag{ 3 }$

where, ${B_i}(t)$ is the instantaneous amplitude and ${\phi _i}(t)$ is the instantaneous phase, which are defined as:

$\begin{eqnarray} &&{B_i}\left(t\right) = \sqrt {I_i^2\left(t\right) + H_i^2\left({f\,}\right)} \end{eqnarray} \tag{ 4 }$

$\begin{eqnarray} &&{\phi _i}\left(t\right) = \arctan \frac{{{H_i}\left({\,f}\right)}}{{{I_i}\left(t\right)}}. \end{eqnarray} \tag{ 5 }$

The instantaneous frequency of each analytic signal may be derived by differentiating the instantaneous phase with respect to time, which can be calculated by:

$\begin{eqnarray} &&{\omega _i}\left(t\right) = \frac{{\textrm{d}{\phi _i}\left(t\right)}}{{\textrm{d}t}}. \end{eqnarray} \tag{ 6 }$

According to equations (1), (3) and (6), the original signal a(t) can be rephrased as:

$\begin{eqnarray} &&a\left(t\right) = \sum\limits_{i = 1}^n {{B_i}\left(t\right){\textrm{e}^{\,j\int {{w_i}\left(t\right)} \textrm{d}t}} + Re\left(t\right)}. \end{eqnarray} \tag{ 7 }$

The marginal spectrum is an application of HHT, which is employed to analyze the instantaneous frequency and amplitude of time–frequency signals [33]. It can be calculated by:

$\begin{eqnarray} &&M\left({\,f}\right) = \int_0^T {H\left(f,t\right)\textrm{d}t} \end{eqnarray} \tag{ 8 }$

where, $H(f,t)$ denotes the HHT transform of the signal at frequency f and time t, T is the upper bound of integration and represents the time horizon of integration.

2.2. Gate recurrent unit network

In the field of time series modeling and prediction, GRU has become a popular choice due to its effectiveness. It is a type of recurrent neural network that is derived from LSTM [34]. Compared with LSTM, GRU uses only an update gate and a reset gate to control the selection and retention of memory, thus significantly reducing the number of parameters and computational complexity. The structure of a single GRU cell is shown in figure 1.

**Figure 1.** GRU cell structure.
Download figure:
Standard image High-resolution image

GRU network can predict future states by learning the historical information of sequential data. The process of computing a GRU neuron can be described as: at time step t, given an input sequence x_t , the GRU cell computes the reset gate r_t and the update gate z_t . The reset gate controls the forgetting of historical information, while the update gate controls the fusion of new information. Next, the candidate memory cell ${\tilde h_t}$ is calculated and interpolated with the previous memory cell ${h_{t - 1}}$ to obtain the current memory cell ${h_t}$ . Finally, the predicted value or hidden state h_t at time step t is outputted and used for prediction and computation at the next time step. The specific calculation formulas are as follows:

$\begin{align} {h_t} &= {{{\mathcal H}}^G}\left( {{h_{t - 1}},{x_t}} \right)\nonumber \\ &= \left\{\begin{array}{l} {z_t} = \sigma \left( {{V_z} \cdot \left[ {{h_{t - 1}},{x_t}} \right] + {p_z}} \right)\\ {r_t} = \sigma \left( {{V_r} \cdot \left[ {{h_{t - 1}},{x_t}} \right] + {p_r}} \right)\\ {{\tilde h}_t} = \tanh \left( {{V_h} \cdot \left[ {{r_t} \odot {h_{t - 1}},{x_t}} \right] + {p_h}} \right)\\ {h_t} = \left( {1 - {z_t}} \right) \odot {h_{t - 1}} + {z_t} \odot {{\tilde h}_t} \end{array} \right. \end{align} \tag{ 9 }$

where, ${{\mathcal H}^G}$ is the non-linear transformation function of the GRU neuron. The weights V_z , V_r and V_h are the weights of the current input x_t and the recurrent input $h_{t-1}$ . The biases p_z , p_r , and p_h are the bias weights of the corresponding nodes. σ is the sigmoid function, $\textrm{tanh}$ is the hyperbolic tangent function, and $\odot$ is element-wise multiplication. z_t and r_t are the outputs of the update gate and reset gate of the GRU cell unit. ${\tilde h_t}$ is the candidate hidden state at the current time step, h_t is the hidden state at the current time step, and ${h_{t - 1}}$ is the hidden state at the previous time step.

2.3. MHA

The attention mechanism is a structure specifically designed to handle the interdependence between input and output data, which can automatically learn and calculate the weight of each input data's contribution to the output data [35]. In the field of deep learning, attention mechanisms have become an important technology widely used in various tasks, especially in natural language processing. The MHA is an algorithm derived from the attention mechanism, which was first applied to the transformer model and proposed by Vaswani et al [36]. It captures key information in different subspaces through parallel computation and combines them into the final output. The state of rolling bearings is usually affected by multiple factors, such as vibration, temperature, pressure, etc. Compared with the common attention mechanism, the use of MHA mechanism allows the model to focus on multiple different key information at the same time, and thus improve the accuracy and reliability of predictions [37]. The MHA can be calculated as follows:

$\begin{eqnarray} &&\textrm{MHA}\left(Q,K,V\right) = \textrm{Concact}\left( \alpha_1,\alpha_2, \ldots ,\alpha_n \right){W} \end{eqnarray} \tag{ 10 }$

$\begin{eqnarray} &&\alpha_i = \textrm{softmax}\left( {\frac{{Q_i{K^\intercal_i}}}{{\sqrt {{\textrm{d}_k}} }}} \right)V_i \end{eqnarray} \tag{ 11 }$

$\begin{eqnarray} &&{Q_i} = QW_i^{\,Q}, {K_i} = KW_i^K, {V_i} = VW_i^V,i = 1, \ldots ,n. \end{eqnarray} \tag{ 12 }$

Where, Q, K, and V denote the query, key, and value, respectively. n is the number of heads of MHA. W is the learnable parameter. α_i is the score of the ith attention head. Q_i , K_i , V_i are the query subspace, key subspace, and value subspace respectively, which can be obtained by multiplying Q, K, V with the learnable weight parameters $W_i^{\,Q}$ , $W_i^K$ , $W_i^V$ respectively. $\sqrt {{\textrm{d}_k}}$ is the scaling factor.

2.4. Residual connection

Deep neural networks have powerful representation capabilities, but they often encounter the issue of gradient vanishing during training. To mitigate this problem, the residual connection can be used, which introduces direct connections across layers in the network to ensure that the output contains information from the input, thus allowing gradients to propagate to deeper layers [38]. The residual connection is defined as follows:

$\begin{eqnarray} &&C\left(a,\theta_l\right) = f\left(a,\theta_l\right)+a \end{eqnarray} \tag{ 13 }$

where a is the input vector, θ_l is the parameter of the lth layer, $f(a,\theta_l)$ is the transformation of the input a.

Residual connections can be used to build deep neural networks by stacking multiple residual blocks. During training, residual connections allow gradients to flow without losing information. Compared to traditional network structures, residual connections can better mitigate the issue of vanishing gradients and expedite model convergence.

3. Methodology

This section introduces the proposed framework for estimating the RUL of rolling bearings, depicted in its entirety in figure 2. The approach comprises two primary modules: automatic feature combination extraction and residual multi-Head attention gated recurrent unit (RMAGRU). Automatic feature combination extraction (AFCE) includes two parts: feature extraction and automatic feature combination selection. The feature extraction module uses feature parameters to define the computation of features. Next, in the extracted features, the random forest algorithm was applied and enhanced to allow for the automatic selection of feature combinations from the extracted features. These selected combinations meet a specific threshold in terms of their scores and represent features with more pronounced degradation. Finally, we have developed RMAGRU and applied it to predict the RUL of rolling bearings. The following sections will provide a detailed description of each module.

3.1. Automatic feature combination extraction

3.1.1. Feature extraction.

Although deep learning methods such as GRU possess formidable nonlinear modeling capabilities and the capacity for autonomous feature acquisition, the high sampling frequency of sensors results in an excessive volume of raw bearing data. If one were to directly employ a GRU network for feature extraction from the raw data and subsequent RUL prediction, it would engender an inflation in network parameters, thereby significantly augmenting the computational burden during network training [39]. In the field of rolling bearing prognostics, engineers typically harbor a profound understanding of bearing operational principles and features. Consequently, they are adept at the curation and delineation of feature parameters germane to the problem at hand. Through the judicious selection of appropriate feature parameters, it is possible to ameliorate the computational onus on the network whilst simultaneously diminishing the model's susceptibility to noise and extraneous information, rendering it more adept at accommodating diverse datasets.

Based on the above considerations, we calculated and extracted 23 types of rolling bearing features except f₂₄ according to [40], because the calculation of f₂₄ may appear complex numbers, which is inconvenient to calculate on the computer. These features can represent various aspects of the signal in both the frequency and time domains and are of paramount importance for predicting RUL. In addition, the signal envelope R feature and the sum of edge spectra have been proved to effectively characterize signal performance [41]. Consequently, we computed the signal envelope R feature and the sum of edge spectra, designating them as f₂₄ and f₂₅. The specific calculation process is as follows:

Initially, the Hilbert transform is computed for the raw signal ${{I_i}(t)}$ by equation (2).

Next, the envelope of the Hilbert transform is extracted by equation (4).

Then, the envelope feature R of the signal is calculated as f₂₄ by:

$\begin{eqnarray} &&{f_{24}} = R = \left| {\frac{{E\left[{B^4}\left(t\right)\right] - {E^{\,2}}\left[{B^2}\left(t\right)\right]}}{{{E^{\,2}}\left[{B^2}\left(t\right)\right]}}} \right| \end{eqnarray} \tag{ 14 }$

where, $E[{B^4}(t)]$ is the fourth moment, and ${E^{\,2}}[{B^2}(t)]$ is the square of the second moment.

According to equation (8), the sum of marginal spectra of the HHT transform is calculated as f₂₅ by:

$\begin{eqnarray} &&{f_{25}} = \sum_{k = 1}^K{M\left(f_{r_k}\right)} \end{eqnarray} \tag{ 15 }$

Where, K is the number of spectral lines, $f_{r_k}$ is the frequency value of the kth spectral line.

The final 25 features are shown in table 1, where $f_1-f_{11}$ are time domain features, $f_{12}-f_{23}$ are frequency domain features, and $f_{24}-f_{25}$ are time–frequency domain features.

Table 1. The feature parameters.

Feature	equation	Feature	equation	Feature	equation
f₁	$\frac{\sum_{i = 1}^Na_i}N$	f₁₀	$\frac{f_4}{\frac1N\sum_{i = 1}^N\left\|a_i\right\|}$	f₁₉	$\sqrt{\frac{\sum_{k = 1}^Kf_{r_k}^{\,4}s_k}{\sum_{k = 1}^Kf_{r_k}^{\,2}s_k}}$
f₂	$\sqrt{\frac{\sum_{i = 1}^{N}(a_i-f_1)^2}{N-1}}$	f₁₁	$\frac{f_5}{\frac1N\sum_{i = 1}^N\left\|a_i\right\|}$	f₂₀	$\frac{\sum_{k = 1}^{K}f_{r_k}^{\,2}s_k}{\sqrt{\sum_{k = 1}^{K}s_k\sum_{k = 1}^{K}f_{r_k}^{\,4}s_k}}$
f₃	$\left(\frac{\sum_{i = 1}^N\sqrt{\left\|a_i\right\|}}{N}\right)^2$	f₁₂	$\frac{\sum_{k = 1}^{K}s_k}{K}$	f₂₁	$\frac{f_{17}}{f_{16}}$
f₄	$\sqrt{\frac{\sum_{i = 1}^N(a_i)^2}N}$	f₁₃	$\frac{\sum_{k = 1}^K(s_k-f_{12})^2}{K-1}$	f₂₂	$\frac{\sum_{k = 1}^K(f_{r_k}-f_{16})^3s_k}{Kf_{17}^{\,3}}$
f₅	$\max\|a_i\|$	f₁₄	$\frac{\sum_{k = 1}^K(s_k-f_{12})^3}{K(\sqrt{f_{13}})^3}$	f₂₃	$\frac{\sum_{k = 1}^K(f_{r_k}-f_{16})^4s_k}{Kf_{17}^{\,4}}$
f₆	$\frac{\sum_{i = 1}^N(a_i-f_1)^3}{(N-1)f_2^3}$	f₁₅	$\frac{\sum_{k = 1}^{K}(s_k-f_{12})^{4}}{Kf_{13}^{\,2}}$	f₂₄	$\left\| {\frac{{E[{B^4}(t)] - {E^{\,2}}[{B^2}(t)]}}{{{E^{\,2}}[{B^2}(t)]}}} \right\|$
f₇	$\frac{\sum_{i = 1}^{N}(a_i-f_{1})^{4}}{(N-1)f_{2}^{4}}$	f₁₆	$\frac{\sum_{k = 1}^Kf_{r_k}s_k}{\sum_{k = 1}^Ks_k}$	f₂₅	$\sum_{k = 1}^K {M(f_{r_k})}$
f₈	$\frac{f_5}{f_4}$	f₁₇	$\sqrt{\frac{\sum_{k = 1}^{K}(f_{r_k}-P_{16})^{2}s_k}{K}}$
f₉	$\frac{f_5}{f_3}$	f₁₈	$\sqrt{\frac{\sum_{k = 1}^{K}f_{r_k}^{\,2}s_k}{\sum_{k = 1}^{K}s_k}}$

Where a_i is a sequence of signals for $i = 1,2,\ldots ,N$ . N is the number of data points. s_k is a spectrum for $k = 1,2,\ldots ,K$ and K is the number of spectral lines. $F_{r_k}$ is the frequency value of the kth spectral line.

3.1.2. Feature combination automatic selection.

The raw vibration signal monitored by the sensor includes horizontal and vertical vibration signals. These distinct vibration signals convey differential information pertaining to the degradation of the bearing, respectively. Extracting features from the horizontal vibration signal yields a total of 25 distinctive features denoted as $F_{h_1}$ to $F_{h_{25}}$ . Similarly, the extraction of features from the vertical vibration signal results in another 25 unique features designated as $F_{v_1}$ to $F_{v_{25}}$ . Consequently, a grand total of 50 distinct features are obtained. However, some of these features have similar properties, while some degenerate properties may not be significant enough. If all these features are fed into the neural network for training, it will increase the computational burden and may even lead to information loss of degenerate features, thus reducing the prediction accuracy of RUL. Moreover, selecting an optimal combination of features is a challenging task, and random or subjective selection may not necessarily achieve the best results. To address this issue, this paper proposes a feature combination algorithm based on random forest automatic selection. This algorithm evaluates the importance of each feature combination and selects the best feature combination with a score that meets the preset threshold.

Random forest obtains the final classification or regression result through the voting of multiple decision trees. In terms of feature selection, random forest has been shown to have good performance [42]. Random forest algorithms typically use classification and regression trees (CART) as decision trees. The traditional CART tree uses the gini impurity function to measure the criterion of splitting nodes, but it may not be sensitive enough to the degradation characteristics of bearings. In order to better adapt to the bearing degradation characteristics, we propose a weighted gini impurity function. This modified gini impurity function utilizes the bearing degradation correlation to assist the node splitting optimization of the CART tree. The calculation formula is given in equation (16),

$\begin{align} I_\textrm{G}\left(x,v\right) = \frac{1}{N_\textrm{s}}\left(\sum_{y_i\in X_{\textrm{left}}}w_i\left(y_i-\bar{y}_{\textrm{left}}\right)^2+\sum_{y_j\in X_{\textrm{right}}}w_j\left(y_j-\bar{y}_{\textrm{right}}\right)^2\right) \end{align} \tag{ 16 }$

Where x is a splitting variable, v is a splitting value of the splitting variable, $N_\textrm{s}$ is the number of all training samples in the current node. X_left and X_right are the training sample sets of the left and right child nodes, respectively. $\bar{y}_{\textrm{left}}$ and $\bar{y}_{\textrm{right}}$ are the average values of the target variables of the left and right node samples, respectively. y_i , y_j are the current sample target variables of the left and right nodes, respectively. w_i , w_j denote the Spearman's rank correlation coefficient between y_i , y_j and the current RUL true value, respectively.

The improved random forest algorithm can evaluate the importance of each feature by calculating its importance in multiple CART trees and further select feature combinations. The specific process of the automatic feature combination selection model studied in this paper is shown in figure 3.

**Figure 3.** Automatic feature combination extraction model.
Download figure:
Standard image High-resolution image

The automatic feature combination algorithm studied in this paper follows the process outlined below:

(1)
For an original dataset X with dimensions of (m, n), where m is the number of samples and n is the number of features, random sampling with replacement is performed to construct the sub-datasets $x_1,x_2,\ldots ,x_s$ , where the dimension of x_i is (m, k), $i = 1,2,\ldots ,s$ and k < n.
(2)
CART models are trained using the sub-training set. The optimal split variable and split point are found by optimizing to minimize the gini impurity function $I_\textrm{G}$ , as represented in equation (17). By repeating this process, multiple CART models can be generated.
$\begin{eqnarray} &&\left(x^*,v^*\right) = \operatorname{argmin}_{x,v}I_\textrm{G}\left(x_i,v_{ij}\right) \end{eqnarray} \tag{ 17 }$
Where $x^*$ is the solution of the optimization problem $\operatorname{argmin}_{x,v}I_\textrm{G}$ , $v^*$ is the solution of the optimization problem $\operatorname{argmin}_{x,v}I_\textrm{G}$ associated with $x^*$ , x_i is the candidate bearing degradation eigenvalue, and v_ij is the specific partition condition associated with x_i .
(3)
All the trained CART models are used to predict each feature.
(4)
To evaluate the importance of features, the mean of all CART models predicted values is used as the score of each feature.
(5)
Based on the scores calculated in step 4, all features are sorted in descending order.
(6)
By establishing a feature selection threshold, feature combinations meeting the threshold criteria are automatically chosen in sequential order.

3.2. RMAGRU

In order to address the issue of overfitting and the limited modeling of long-term dependencies by GRU when dealing with high-dimensional data, this paper proposes a new derived model of GRU called RMAGRU, which incorporates both MHA and residual connections to enhance the model's focus on input information and its ability to model long-term dependencies. The network architecture of RMAGRU is illustrated in figure 4. RMAGRU can adaptively fuse input features and aggregate relevant information, better describing the relationships between data. Essentially, the input of each layer's hidden state undergoes a residual multi-head attention calculation, capturing important degradation characteristics from the previous layer's hidden state.

**Figure 4.** RMAGRU structure.
Download figure:
Standard image High-resolution image

The internal structure of the residual multi-head attention mechanism (RMHA) is demonstrated in figure 5. In the operation of RMHA, the MHA is used to calculate the attention weighted representation of different parts of the hidden state input. These attention-weighted representations are added to the initial hidden state input, thus establishing a residual connection to preserve the original input information. The output of the residual connection goes through the BN layer for batch normalization and is used as the output of RMHA for the subsequent calculation.

The calculation of the residual MHA, based on equations (10)–(13), can be defined as follows:

$\begin{eqnarray} &&\textrm{RMHA} = \textrm{BN}\left( {{h_t} + \textrm{MHA}\left(Q,K,V\right)} \right) \end{eqnarray} \tag{ 18 }$

Where, RMHA is the score calculated by the residual MHA, BN is batch normalization.

Incorporating equations (9)–(13), (18), the calculation process of an RMAGRU neuron at time t can be described as follows:

$\begin{align} {h_t} &= {\mathcal H^R}\left( {{h_{t - 1}},{x_t}} \right)\nonumber\\ & = \left\{\begin{array}{l} {Q_i}_{t-1} = {h_{t - 1}}W_i^{\,Q},{K_i}_{t-1} = {h_{t - 1}}W_i^K,{V_i}_{t-1} = {h_{t - 1}}W_i^V,\\i = 1, \ldots ,n\\ {\alpha_i}_{t-1} = \textrm{softmax}\left( {\frac{{{Q_i}_{t-1}{K_i}_{t-1}^\intercal}}{{\sqrt {{\textrm{d}_k}} }}} \right){V_i}_{t-1},i = 1, \ldots ,n\\ \textrm{MHA}_{t-1} = \textrm{Concact}\left( {\alpha_1}_{t-1},{\alpha_2}_{t-1}, \ldots ,{\alpha_n}_{t-1} \right){W}\\ h_{t - 1}^* = \textrm{BN}\left({h_{t - 1}} + \textrm{MHA}_{t-1}\right)\\ {z_t} = \sigma \left( {{V_z} \cdot \left[ {h_{t - 1}^*,{x_t}} \right] + {p_z}} \right)\\ {r_t} = \sigma \left( {{V_r} \cdot \left[ {h_{t - 1}^*,{x_t}} \right] + {p_r}} \right)\\ {{\tilde h}_t} = \tanh \left( {{V_h} \cdot \left[ {{r_t} \odot h_{t - 1}^*,{x_t}} \right] + {p_h}} \right)\\ {h_t} = \left( {1 - {z_t}} \right) \odot h_{t - 1}^* + {z_t} \odot {{\tilde h}_t} \end{array} \right. \end{align} \tag{ 19 }$

Where, $\mathcal H^R$ is the non-linear transformation function of the GRU neuron. ${Q_i}_{t-1}$ , ${K_i}_{t-1}$ , and ${V_i}_{t-1}$ are the query subspace, key subspace, and value subspace of the ith attention head the previous time step, respectively. ${\alpha_i}_{t-1}$ is the score of the ith attention head at the previous time step. $h_{t - 1}^*$ is the output of the hidden state from the previous time step after passing the RMHA in the current layer.

As equation (19) indicates, the operation of RMAGRU can be divided into two parts. Firstly, the RMAGRU model is able to effectively guide the hidden state information from the previous time step and enhance the network's memory capacity by utilizing the residual MHA. Secondly, similar to GRU, RMAGRU is capable of effectively extracting degradation information from the hidden state and adjusting it based on update criteria. Compared to traditional RNN models, RMAGRU's hidden state contains more degradation information and the problem of gradient vanishing is alleviated through residual connections.

The input of feature data goes through multiple layers of RMAGRU to learn and extract relevant information. The output of the previous RMAGRU at each time step is used as the input to the current time step, which is divided into two branch outputs via RMHA. One branch continues the forward propagation and performs the RMAGRU internal computation, while the other branch multiplies the output of RMHA with the weight coefficient β_t for the current time step to obtain the attention weight γ_t for the current time step. The value of β_t is determined by equation (20), which makes $\beta_t\gt 1$ , in line with the property that the importance of the degraded features increases during the degradation process.

$\begin{eqnarray} \beta_{t+1} = \begin{cases}\beta_t\times D\left(g_t,g_{t-1}\right),&if \quad g_t\unicode{x2A7E} g_{t-1}\\\beta_t,&else\end{cases} \end{eqnarray} \tag{ 20 }$

Where,

$\begin{eqnarray} &&D\left(g_t,g_{t-1}\right) = 1+\textrm{sigmoid}\left(\sqrt{g_t-g_{t-1}}\right) \end{eqnarray} \tag{ 21 }$

$\begin{eqnarray} &&g_t = \left({o_t} - {{h}_t}\right)^2 \end{eqnarray} \tag{ 22 }$

Where, o_t is the RUL true value at time t.

All RMAGRU units are used to compute the attention weights γ for all time steps and then concatenate them together and normalized. Finally, the concatenated attention weights are multiplied with the output h_out of the last hidden layer and subjected to batch normalization. This process can be defined as equations (23)–(25).

$\begin{eqnarray} &&\gamma_{t} = \beta_{t}\cdot \textrm{RMHA}_{t} \end{eqnarray} \tag{ 23 }$

$\begin{eqnarray} &&\gamma = \textrm{Concat}\left[\frac{\gamma_1}{\sum_{{i = 1}}^t\gamma_i},\frac{\gamma_2}{\sum_{{i} = 1}^t\gamma_i},\ldots ,\frac{\gamma_t}{\sum_{{i} = 1}^t\gamma_i}\right] \end{eqnarray} \tag{ 24 }$

$\begin{eqnarray} &&h_{bn} = \textrm{BN}\left(\gamma\cdot h_{\textrm{out}}\right) \end{eqnarray} \tag{ 25 }$

The final process of predicting RUL from the fully connected layer output can be defined as equation (26).

$\begin{eqnarray} &&\hat o = \varphi \left( {h_{bn}} \right) \end{eqnarray} \tag{ 26 }$

Where, $\hat o$ is the predicted value, ϕ is the activation function and LeakyReLU is chosen as the activation function in this paper.

RMAGRU network uses mean squared error (MSE) as the loss function, defined as:

$\begin{eqnarray} &&\textrm{Loss}\left(o,\hat o\right) = \frac{1}{N}\sum\limits_{i = 1}^N {{{\left({o_i} - {{\hat o}_i}\right)}^2}} \end{eqnarray} \tag{ 27 }$

where, N is the length of the output vector, and o is the true RUL label.

The RUL label in this study is defined as the running time of the bearing from the current time until it fails, which can be calculated from equation (28),

$\begin{eqnarray} &&o = \frac{{T - t_{\textrm{cr}}}}{T} \times 100\% \end{eqnarray} \tag{ 28 }$

where, T is the total running time of the bearing from integrity to failure, and t_cr is the current running time of the bearing.

3.3. General steps of the proposed method

The proposed RUL prediction method combines automatic feature combination extraction and RMAGRU network. The detailed steps are as follows:

(1)
Collect vibration data of the monitored bearing using sensors. The sampling interval is t_s and the total sampling time is t. Assume there are n samples in total.
(2)
Extract features from the vibration data of all datasets to obtain a set of features $F = \left[ {{F_1},{F_2},\ldots ,{F_{50}}} \right]$ that reflects the degradation information of the bearing, where ${F_i} = {[{f_{i1}},{f_{i2}},\ldots ,{f_{in}}]^\intercal},i = 1,2,\ldots ,50$ .
(3)
Use the feature combination automatic selection method to select the feature combinations $F_{\textrm{final}} = [{F_1},{F_2}, \cdots ,{F_m}]$ with the strongest importance from the feature set F, where m is the number of finally selected features.
(4)
Normalize the F_final selected in the previous step.
$\begin{eqnarray} &&F_{norm} = \frac{F_{\textrm{final}}-\textrm{min}\left(F_{\textrm{final}}\right)}{\textrm{max}\left(F_{\textrm{final}}\right)-\textrm{min}\left(F_{\textrm{final}}\right)} \end{eqnarray} \tag{ 29 }$
(5)
Define the input matrix Z of the RMAGRU network and the corresponding RUL label $Y = {[\begin{array}{*{20}{c}} {{y_1}}&{{y_2}}& \cdots &{{y_m}} \end{array}]^\intercal}$ .
$\begin{eqnarray} Z = F_{norm} = \left[ {\begin{array}{*{20}{c}} {{F_{11}}}&{{F_{21}}}& \cdots &{{F_{m1}}}\\ {{F_{12}}}&{{F_{22}}}& \cdots &{{F_{m2}}}\\ \vdots & \vdots & \ddots & \vdots \\ {{F_{1n}}}&{{F_{2n}}}& \cdots &{{F_{mn}}} \end{array}} \right] = \left[ \begin{array}{l} {z_1}\\ {z_2}\\ \vdots \\ {z_n} \end{array} \right] \end{eqnarray} \tag{ 30 }$
Where, ${z_i} = \left[ {\begin{array}{*{20}{c}} {{F_{1i}}}&{{F_{2i}}}& \cdots &{{F_{mi}}} \end{array}} \right]$ .
(6)
Build the RMAGRU network to learn and predict the RUL of the bearing.
(7)
Set the input to be the first k vectors of matrix Z with the output label $y_{k+1}$ , set the sliding step length to 1, use MSE as the loss function during training, use backpropagation to train the model and optimize its parameters, and calculate the predicted RUL for the validation dataset.
(8)
Repeat Step 7 in a continuous iterative optimization process until the desired performance is achieved to obtain the best training model.
(9)
The test set feature combination F_final is fed into the trained RMAGRU model, which predicts the test set RUL and compares the predicted RUL with the actual RUL.

4. Experiment

4.1. Description of experimental dataset

The dataset used in the experiments is from the PRONOSTIA bearing dataset provided by the FEMTO-ST Research Institute, which includes several accelerated bearing degradation experiments performed on the PRONOSTIA platform [43]. Detailed experimental apparatus for the bearing test stand can be found in figure 6.

**Figure 6.** PRONOSTIA platform.
Download figure:
Standard image High-resolution image

Three different operating conditions are set in the PRONOSTIA rolling bearing accelerated degradation experiment. The operating conditions are defined based on the torque applied on the bearing, the speed of rotation of the shaft, and the instantaneous measurements of the radial force applied on the bearing. The characterization of bearing degradation is reflected through data collected by a vibration sensor. The vibration sensor consists of two micro-accelerometers that are perpendicular to each other, with one aligned along the vertical axis and the other along the horizontal axis. Both accelerometers are radially located on the bearing's outer ring. The vibration signal is sampled at a frequency of 25.6 kHz, with a duration of 0.1 s for each sample and a sampling interval of 10 s. The experiment is terminated once the amplitude of the vibration data surpasses 20 g. This dataset provides six datasets that ran until failure and eleven datasets that did not fully degrade to failure for testing the model. Additionally, the actual RUL of the test signal is provided. Table 2 presents detailed description of the vibration data collected from 17 rolling bearings.

Table 2. Datasets of IEEE 2012 PHM prognostic challenge.

	Conditions 1	Conditions 2	Conditions 3
Load (N)	4000	4200	5000
Speed (RPM)	1800	1650	1500
Training set	Bearing1_1	Bearing2_1	Bearing3_1
	Bearing1_2	Bearing2_2	Bearing3_2
Test set	Bearing1_3	Bearing2_3	Bearing3_3
	Bearing1_4	Bearing2_4
	Bearing1_5	Bearing2_5
	Bearing1_6	Bearing2_6
	Bearing1_7	Bearing2_7

The raw vibration signal of bearing 1_1 is shown in figure 7. It is evident that the raw data volume is exceptionally large, and directly inputting it into a neural network for training would inevitably affect training speed and might even make training infeasible. Therefore, feature extraction in advance is necessary.

Figure 8 shows 25 kinds of features extracted from the raw horizontal vibration signal of bearing 1_1 according to table 1. It is noticeable that in the extracted time-domain and frequency-domain features, some features have similar degradation characteristics, while others may have less pronounced degradation traits or even potentially interfere with the learning of degradation features. Hence, feature selection is imperative. In this article, bearings 1_1, 1_2, 2_1, 2_2, 3_1, 3_2 are used to train the model, and the remaining 11 datasets are used as the test set.

4.2. Evaluation index

In this paper, the proposed model is evaluated using three metrics: root mean squared error (RMSE), mean absolute error (MAE) and a scoring function named SCORE. MAE and RMSE both indicate the accuracy of the predicted value compared to the actual value, where a lower error signifies a more precise prediction. SCORE is a comprehensive metric, and higher values indicate greater accuracy of the prediction. They are defined as follows:

$\begin{eqnarray} &&\textrm{MAE} = \frac{1}{n}\sum\limits_{i = 1}^n {\left| {{\textrm{TBR}_i} - {{\textrm{PBR}}_i}} \right|} \end{eqnarray} \tag{ 31 }$

$\begin{eqnarray} &&\textrm{RMSE} = \sqrt {\frac{1}{n}\sum\limits_{i = 1}^n {{{\left({\textrm{TBR}_i} - {{\textrm{PBR}}_i}\right)}^2}} } \end{eqnarray} \tag{ 32 }$

$\begin{eqnarray} &&\textrm{Score} = \frac{1}{{n - 1}}\sum\limits_{i = 1}^{n - 1} {{A_i}} \end{eqnarray} \tag{ 33 }$

$\begin{eqnarray} &&{A_i} = \left\{\begin{array}{l} {e^{- \ln \left(0.5\right) \times \left(E{r_i}/5\right)}},\quad \textrm{if} \quad E{r_i} \unicode{x2A7D} 0\\ {e^{+ \ln \left(0.5\right) \times \left(E{r_i}/20\right)}},\quad \textrm{if} \quad E{r_i} > 0 \end{array} \right. \end{eqnarray} \tag{ 34 }$

$\begin{eqnarray} &&E{r_i} = 100 \times \frac{{{\textrm{TBR}_i} - {{\textrm{PBR}}_i}}}{{{\textrm{TBR}_i}}} \end{eqnarray} \tag{ 35 }$

Where, $\textrm{PBR}_i$ is the predicted value of the model, ${\textrm{TBR}_i}$ is the actual value, n is the number of samples in the testing dataset.

4.3. Comparing framework settings

The experimental study is conducted on a computer with Intel(R) Core(TM) i5-13600K CPU @ 3.5GHz, 32GB RAM, and NVIDIA GeForce RTX 3070 Ti Graphics Card. The operating system is Windows 10, and the RUL prediction is implemented under the framework of torch1.7 based on Python3.9. To prove the effectiveness of our method, several recent two-stage RUL prediction frameworks are employed for comparison. In the first stage, the proposed AFCE method is compared with five other feature extraction methods, which are time domain feature extraction (TDF), frequency domain feature extraction (FDF), joint time–frequency domain feature extraction (TFDF), hybrid feature extraction (MF), and Convolutional Autoencoder feature extraction (CAE). In the second stage, these frameworks can be grouped based on the differences in neural network structures: LSTM-based framework, GRU-based framework, TCN-based framework and RMAGRU-based framework. In this comparative study, experiments are divided into two groups.

For the first group of experiments, in terms of TDF, since the sampling frequency of the bearing vibration signal is 25.6 kHz, the duration of each sampling is 0.1 s, and the sampling interval is 10 s, the 2560 data sampled each time reflect the degradation characteristics of the bearing at this time. Therefore, we calculate and normalize the time-domain characteristics of this section of signal data according to $f_1-f_{11}$ in table 1. Finally, 11 kinds of characteristic values can be obtained after calculation for each section of vibration signal with the size of $2560\times1$ , and they are merged into the eigenvector of $1\times11$ . Similarly, in terms of frequency domain feature extraction, we extract 12 normalized frequency domain features to represent the degradation features of bearings according to the features of each $2560\times1$ vibration signal from $f_{12}-f_{23}$ in table 1, and combine these features to obtain a feature vector of size $1\times12$ . Time–frequency joint domain feature extraction involves f₂₄ and f₂₅ of table 1 to extract two types of joint domain features, resulting in a feature vector of size $1\times2$ for each $2560\times1$ vibration signal. Hybrid feature extraction involves extracting 25 features such as time domain, frequency domain, and joint time–frequency domain, and generating a feature vector of size $1\times25$ for each $2560\times1$ vibration signal according to $f_1-f_{25}$ of table 1. The CAE is to extract features from the collected data by using a one-dimensional convolutional network with convolution kernel of 16 and step size of 1, and compress the $2560\times1$ vibration signal into a feature vector of size $1\times1$ . The automatic feature combination extraction method refers to what is presented in section 3.

For the second group of experiments, we evaluate the performance of our RMAGRU model in comparison to three other commonly used deep learning methods: LSTM, GRU, and TCN. To ensure fairness, we ensure that all hyperparameters consistent across all networks during training. During model training, a feature selection threshold of 36 is used. Backpropagation is employed for learning, and the optimizer used is Adam with a learning rate of $1\times 10^{-5}$ . The batch size is fixed to 32, and we ran the training for a total number of 300 epochs. Details of all the hyperparameters used are presented in table 3.

Table 3. The hyperparameters of model training.

Hyperparameter	Value
Learning rate	$1\times 10^{-5}$
Batch Size	32
Training epochs	300
Sequence length	64
Early stop	30
Number of layers	4
Number of hidden units	28
Optimizer	Adam

4.4. Comparison of methods

In this section, we demonstrate the figures and experimental outcomes of the aforementioned frameworks, and analyze the impact of various key hyperparameters on RUL prediction performance.

In the first group of experiments, six different feature extraction methods are used to extract bearing degradation features and input them into three prediction models, respectively. Considering Bearing1_3 and Bearing1_4 as examples, figure 9 exhibits the prediction outcomes of the five feature extraction methods combined with the GRU model and our proposed feature extraction method combined with the GRU model. Additionally, table 5 presents the prediction results of the GRU model on the testing set for all 11 bearings, and it can be seen that for some test bearings, some feature extraction methods can obtain the best results. However, from the overall results, the AFCE method achieves good results, with the overall average values of RMSE and MAE as low as 17.21% and 14.89%, respectively, while the RMSE of other feature extraction methods is higher than 21% and MAE is higher than 18%. In order to better prove the effectiveness of the AFCE method, the five feature extraction methods are input into the LSTM model and the TCN model for testing. Table 4 reports the prediction results obtained using the LSTM model. It can be seen that out of the 11 test bearings, the AFCE method achieved the best results in 8 bearings, with the overall average RMSE of 18.19% and the overall average MAE of 15.22%. The average RMSE and average MAE of other methods for bearing prediction are between 21% and 23% and 18.25% and 19.55%, respectively. In addition, the proposed AFCE method is compared with different feature extraction methods, and the prediction results under the TCN model are shown in table 6, in most test bearings, the error of this method is the smallest, and the average error of all test bearings is lower than other feature extraction methods.

**Figure 9.** Prediction results of different feature extraction methods in GRU model.
Download figure:
Standard image High-resolution image

Table 4. The results of different feature extraction methods in LSTM model.

Model Feature	Long-short term memory(LSTM)
Model Feature	TDF		FDF		TFDF		MF		CAE		AFCE (Proposed)
Metric	RMSE	MAE	RMSE	MAE	RMSE	MAE	RMSE	MAE	RMSE	MAE	RMSE	MAE
Bearing1_3	23.87	19.99	21.39	18.99	22.47	19.02	19.94	17.15	23.25	19.59	19.36	16.99
Bearing1_4	26.40	22.83	17.96	14.10	26.50	22.97	15.97	13.12	26.72	23.13	15.04	12.50
Bearing1_5	26.36	22.80	34.44	26.18	27.40	23.66	27.55	23.35	26.21	22.69	26.09	21.75
Bearing1_6	26.44	22.88	21.61	15.59	26.50	22.93	28.23	23.85	26.39	22.86	12.18	9.46
Bearing1_7	24.21	19.82	17.64	11.99	23.62	18.71	15.52	12.11	23.13	18.99	12.53	9.09
Bearing2_3	24.65	20.16	9.31	7.55	23.42	19.01	21.61	17.38	23.55	19.21	12.11	9.99
Bearing2_4	21.72	18.61	20.23	17.49	21.80	18.80	23.23	20.05	21.36	18.40	10.80	9.57
Bearing2_5	24.90	21.38	35.32	28.04	24.99	21.47	31.19	25.75	24.54	21.15	28.31	22.94
Bearing2_6	21.51	18.44	18.86	15.84	22.49	19.21	21.80	18.41	21.05	18.12	16.28	12.54
Bearing2_7	13.46	11.62	25.78	24.72	13.53	11.59	19.20	15.72	14.02	12.01	30.80	30.06
Bearing3_3	19.16	16.50	24.18	21.21	19.10	16.81	15.07	13.81	19.20	16.63	16.59	12.57
Average	22.97	19.55	22.43	18.34	22.89	19.47	21.76	18.25	22.67	19.34	18.19	15.22

Table 5. The results of different feature extraction methods in GRU model.

Model Feature	Gate recurrent unit(GRU)
Model Feature	TDF		FDF		TFDF		MF		CAE		AFCE (Proposed)
Metric	RMSE	MAE	RMSE	MAE	RMSE	MAE	RMSE	MAE	RMSE	MAE	RMSE	MAE
Bearing1_3	22.02	18.78	14.87	11.19	22.61	19.22	21.48	17.05	25.91	21.43	10.22	8.05
Bearing1_4	26.36	22.85	20.49	17.90	25.34	22.06	27.46	23.43	33.13	27.35	12.62	10.80
Bearing1_5	26.73	23.07	18.88	15.99	22.11	19.17	30.16	25.15	22.38	19.05	20.15	17.43
Bearing1_6	26.58	22.99	20.35	15.25	24.46	21.25	28.87	24.23	27.20	23.43	16.87	14.41
Bearing1_7	23.31	18.58	19.25	15.06	20.90	18.14	13.56	10.72	16.31	12.92	24.52	19.10
Bearing2_3	24.31	19.74	27.19	25.61	23.25	19.53	23.27	17.94	21.20	17.21	7.44	6.39
Bearing2_4	21.40	18.48	28.20	26.94	19.37	16.36	24.14	20.04	23.28	19.17	23.65	21.49
Bearing2_5	25.23	21.53	15.27	13.14	23.94	20.76	35.55	28.87	23.49	19.93	20.59	16.55
Bearing2_6	21.79	18.59	31.31	26.86	16.98	14.50	24.31	20.23	27.22	22.47	11.86	9.14
Bearing2_7	13.36	11.42	37.11	36.87	13.31	11.20	21.92	18.15	17.59	14.48	27.03	26.84
Bearing3_3	18.92	16.49	12.43	7.16	22.35	17.27	20.09	17.17	27.34	22.53	14.34	13.59
Average	22.73	19.32	22.30	19.27	21.33	18.13	24.62	20.27	24.10	20.00	17.21	14.89

Table 6. The results of different feature extraction methods in TCN model.

Model Feature	Temporal convolutional network(TCN)
Model Feature	TDF		FDF		TFDF		MF		CAE		AFCE (Proposed)
Metric	RMSE	MAE	RMSE	MAE	RMSE	MAE	RMSE	MAE	RMSE	MAE	RMSE	MAE
Bearing1_3	20.82	17.63	7.79	5.84	23.28	19.48	73.49	43.33	22.87	19.40	23.17	19.53
Bearing1_4	26.27	22.71	17.65	15.19	26.62	22.99	23.34	15.62	27.25	23.41	17.53	11.75
Bearing1_5	27.98	23.86	28.11	22.61	17.35	13.48	15.13	11.26	26.12	22.49	26.91	23.31
Bearing1_6	29.66	24.91	23.49	18.76	26.85	23.40	28.15	20.74	26.49	22.91	17.48	15.26
Bearing1_7	30.59	25.86	20.30	15.88	24.58	21.95	13.03	11.86	21.72	18.29	8.35	6.87
Bearing2_3	38.18	34.59	21.90	19.23	22.89	18.45	15.01	11.73	22.20	18.17	26.59	25.63
Bearing2_4	21.15	18.64	11.54	9.87	19.14	16.65	16.72	14.47	20.95	18.27	8.72	7.85
Bearing2_5	27.97	23.36	34.70	30.16	26.48	23.35	23.21	19.92	24.07	20.82	21.90	18.86
Bearing2_6	31.30	26.49	27.72	25.42	11.96	9.56	13.68	12.14	20.78	17.88	20.39	18.56
Bearing2_7	15.02	12.43	30.56	30.07	17.87	16.35	54.73	53.34	15.12	12.53	13.65	11.70
Bearing3_3	27.74	22.82	14.93	13.34	20.09	16.65	26.14	24.81	19.77	16.91	13.42	11.68
Average	26.97	23.03	21.70	18.76	21.56	18.39	27.51	21.75	22.49	19.19	18.01	15.55

Based on the findings presented in tables 4–6, our proposed AFCE method outperformed other feature extraction methods in terms of RUL prediction results among different deep network structures. This result reflects that AFCE can retain features that contain more detailed degradation information, enabling deep network models to make more use of features for autonomous learning and achieving more accurate RUL prediction. It should be noted that the performance of our automatic feature combination extraction method in the GRU model is slightly inferior. This may be because the GRU model has poor ability to handle long sequences compared to other deep network models, and there is also a problem of gradient disappearing. Overall, our AFCE approach can substantially enhance the performance of the entire predictive model. This result confirms the effectiveness of our automatic feature combination extraction method.

In the second group of experiments, we use the AFCE method to extract feature combinations and input them into four different prediction models. To provide a comprehensive assessment of model performance, we add a new evaluation metric named SCORE. Taking Bearing2_6 and Bearing3_3 as examples, figure 10 and table 7 show the prediction results of the deep models based on AFCE.

**Figure 10.** Prediction results of different network models method in AFCE.
Download figure:
Standard image High-resolution image

Table 7. Quantitative results of various models under AFCE.

Model	LSTM			GRU			TCN			RMAGRU (Proposed)
Metric	RMSE	MAE	SCORE	RMSE	MAE	SCORE	RMSE	MAE	SCORE	RMSE	MAE	SCORE
Bearing1_3	8.57	6.35	59.85	28.87	24.15	30.54	21.67	19.68	34.59	16.09	14.31	42.05
Bearing1_4	15.21	12.81	40.00	18.68	16.47	31.86	12.21	8.16	49.40	11.92	7.83	51.20
Bearing1_5	32.59	27.40	20.34	30.46	25.75	21.85	16.81	14.16	38.36	19.34	17.02	31.89
Bearing1_6	12.89	10.69	37.59	22.78	19.11	29.52	17.69	13.87	28.00	12.59	9.22	44.02
Bearing1_7	18.27	15.33	42.20	19.54	18.71	37.67	14.73	11.12	44.32	13.65	10.69	48.15
Bearing2_3	24.40	20.21	41.70	16.03	13.30	48.01	20.39	19.33	38.24	9.86	7.76	50.37
Bearing2_4	10.33	8.49	47.28	13.52	11.26	45.58	9.76	7.14	48.26	16.26	15.20	40.41
Bearing2_5	34.69	29.05	21.19	26.65	22.98	25.33	23.03	20.42	27.97	20.99	17.63	33.30
Bearing2_6	24.82	21.37	33.82	16.18	14.08	41.24	11.05	10.10	41.45	6.63	5.61	51.27
Bearing2_7	26.47	24.15	4.56	19.14	16.94	10.44	34.40	34.30	0.05	18.35	17.63	3.42
Bearing3_3	21.06	15.69	49.34	15.34	13.81	34.36	10.18	8.43	45.45	8.76	7.66	50.38
Average	20.85	17.41	36.17	20.65	17.87	32.40	17.45	15.16	36.01	14.04	11.87	40.59

Figure 11 illustrates the predicted absolute error of figure 10. Several trends can be observed from the figure. Firstly, the prediction error performance of GRU is not consistently stable, showing obvious fluctuations, sometimes the error is small, sometimes the error is large, displaying a volatile pattern. Secondly, LSTM exhibits relatively large absolute errors in the early prediction stages, which can go as high as 0.4. However, in the later prediction stages, the error gradually decreases, demonstrating a more stable performance. Nevertheless, it has a relatively poorer performance in the early prediction stages. In contrast, both TCN and RMAGRU methods demonstrate relatively stable prediction absolute errors throughout the entire degradation phase of the bearing. It's worth noting that the average prediction absolute error for RMAGRU is lower than that for TCN, indicating that RMAGRU outperforms TCN in the RUL prediction task. As indicated in table 7, the AFCE-RMAGRU model outperformed other frameworks in terms of predictive accuracy, which further demonstrates the excellent performance of this framework in RUL prediction for rolling bearings.

After comparing the prediction results between the GRU and RMAGRU models, it is discovered that in the majority of cases, the RMAGRU model is able to predict more accurately than the GRU model. This suggests that the RMAGRU model has a stronger capacity in handling lengthy sequences and mitigating gradient vanishing, which highlights its exceptional performance advantages.

In our experiments, four key parameters are optimized: the threshold of the number of features, the number of attention heads, the sequence length, and the learning rate. We obtain the average RUL prediction results on the test set by adjusting different parameter values and display them in figure 12. The horizontal axis in the figure represents different parameter values, the left vertical axis represents the predicted RMSE, while the right vertical axis represents the predicted SCORE. A smaller RMSE and a higher score indicate a higher prediction accuracy, from which the optimal parameters can be inferred.

According to figure 12(a), when the feature threshold value is small, the prediction error is large, and as the feature threshold increases, the prediction error decreases. When the feature threshold value is 36, the RUL prediction effect reaches the best. Figures 12(b) and (c) suggest that appropriate sequence length and attention head number can enhance the model's predictive capabilities, but the relationship between these two hyperparameters and performance is not monotonically increasing. Excessive sequence length and attention head number can also lead to a decrease in computational efficiency and overfitting risk. After comprehensive consideration, the optimal attention head number is 8, and the optimal sequence length is 64. At the same time, according to figure 12(d), within the parameters of this experiment, the influence of varying learning rates on RUL prediction is small. Therefore, while ensuring accuracy, we chose the optimal learning rate of $1\times 10^{-5}$ to minimize training time costs.

4.5. Results and discussion

In this section, the proposed RMAGRU network-based RUL prediction framework is compared with other cutting-edge approaches. In the comparison, we use the RNN-based gated dual attention unit method (GDAU) [31], the LSTM method based on macro-micro attention (MMALSTM) [27], and the time-domain convolutional network based on soft-thresholding and attention (TCN-SA) [21] as controls. We employ MAE and RMSE as evaluation metrics and tested them on the same dataset. As shown in table 8, based on the experimental results, it can be concluded that the proposed framework surpasses other methods in both prediction accuracy and robustness. The comparison results further indicate that the MRAGRU network effectively leverages the information of input data and hidden layer states, and has a broad range of potential applications in RUL prediction.

Table 8. Performance comparison of different prediction networks.

Metric	GDAU	MMALSTM	TCN-SA	RMAGRU
MAE	14.8	19.6	15.7	11.87
RMSE	18.7	23.7	19.7	14.04

According to the experimental results in the previous section, the feature combination module and the residual MHA are very important for improving prediction accuracy and robustness. The feature combination module can extract different features from the input data and automatically select better degraded feature combinations, which helps the GRU network to better capture the complex and dynamic features of rolling bearings. By utilizing residual multi-head attention, the RMAGRU network can enhance its ability to model long-term dependencies in time series data and ultimately bolster the model's robustness.

5. Conclusion

In this paper, a rolling bearing RUL prediction framework based on residual multi-head attention GRU network with automatic feature combination extraction is proposed. The framework extracts a series of bearing degradation features, designs a new feature combination extractor to obtain the optimal bearing degradation feature combination, and develops an RMAGRU network that combines residual MHA with GRU network to better capture the complex and dynamic features of bearings. The experimental outcomes demonstrate that the suggested framework surpasses other contemporary techniques in terms of prediction accuracy and robustness. Furthermore, it can achieve high-precision predictions under various operating conditions and fault degrees. Therefore, the framework has important industrial application value, which can help enterprises to better maintain and manage equipment, and improve equipment reliability and safety.

Future work will focus on optimizing the network structure to reduce model training time and exploring the bearing life prediction problem under different working condition data distributions. To improve the practical applicability of the model when faced with changing working conditions, we plan to combine transfer learning with RMAGRU to better adapt to different data features under variable operating conditions. The implementation of these works will further improve the precision and resilience, making the proposed framework more practical and reliable.

Acknowledgments

This work is supported by National Natural Science Foundation of China (Grant No. 62101314), and the Ningbo Major Science and Technology Project (20212ZDYF020021).

Data availability statement

The data supporting this study is publicly available in IEEE phm 2012 data challenge at http://www.femto-st.fr/en/Research-departments/AS2M/Research-groups/PHM/IEEE-PHM-2012-Data-challenge.php.

Ethics approval

Since the data used for this research does not involve human or animal participants, it does not violate any moral ethics.

Conflicts of interest

No potential conflict of interest was reported by the authors.

Remaining useful life prediction for bearing based on automatic feature combination extraction and residual multi-Head attention GRU network

Article metrics

Submit

Permissions

Author e-mails

Author affiliations

Author notes

ORCID iDs

Dates

Peer review information

Abstract

1. Introduction

2. Theoretical basis

2.1. Hilbert–Huang transformation (HHT)

2.2. Gate recurrent unit network

2.3. MHA

2.4. Residual connection

3. Methodology

3.1. Automatic feature combination extraction

3.1.1. Feature extraction.

3.1.2. Feature combination automatic selection.

3.2. RMAGRU

3.3. General steps of the proposed method

4. Experiment

4.1. Description of experimental dataset

4.2. Evaluation index

4.3. Comparing framework settings

4.4. Comparison of methods

4.5. Results and discussion

5. Conclusion

Acknowledgments

Data availability statement

Ethics approval

Conflicts of interest

Remaining useful life prediction for bearing based on automatic feature combination extraction and residual multi-Head attention GRU network

Article metrics

Submit

Permissions

Share this article

Author e-mails

Author affiliations

Author notes

ORCID iDs

Dates

Peer review information

Abstract

1. Introduction

2. Theoretical basis

2.1. Hilbert–Huang transformation (HHT)

2.2. Gate recurrent unit network

2.3. MHA

2.4. Residual connection

3. Methodology

3.1. Automatic feature combination extraction

3.1.1. Feature extraction.

3.1.2. Feature combination automatic selection.

3.2. RMAGRU

3.3. General steps of the proposed method

4. Experiment

4.1. Description of experimental dataset

4.2. Evaluation index

4.3. Comparing framework settings

4.4. Comparison of methods

4.5. Results and discussion

5. Conclusion

Acknowledgments

Data availability statement

Ethics approval

Conflicts of interest