Brought to you by:
Paper

A novel ensemble convex hull-based classification model for bevel gearbox fault diagnosis

, , , , and

Published 16 December 2022 © 2022 IOP Publishing Ltd
, , Citation Xin Kang et al 2023 Meas. Sci. Technol. 34 035017 DOI 10.1088/1361-6501/aca8c1

0957-0233/34/3/035017

Abstract

The kernel-based geometric learning model has been successfully applied in bevel gearbox fault diagnosis. However, due to its shallow architecture and problems with its sensitivity to noise and outliers, its generalization ability and robustness need to be further improved. Ensemble learning can improve the classification accuracy of sub-classifiers, but it is effective only when the sub-classifiers meet the requirements of difference and accuracy at the same time. However, as strong classifiers, geometric learning models are difficult to produce sub-classifiers with differences. To solve these problems, this study proposes a novel ensemble model, the ensemble convex hull (CH)-based (EnCH) classification model. CH has the advantages of clear geometric meaning and is easy to deform. This paper considers the clustering characteristics of the sample points in the feature space, or both distance and density, and performs differential shrinkage deformation on the original CH. For one thing, this can produce differential CHs to build differential sub-classifiers for the ensemble. Also, it can suppress the interference of noise and outliers to improve robustness. The results of our experiments on the fault dataset of a bevel gear box indicate that the EnCH classification model can improve the generalization of the geometric learning model and has excellent tolerance to noise and outliers.

Export citation and abstract BibTeX RIS

1. Introduction

Bevel gear boxes are essential components of modern industrial products and corresponding production tools. Their normal and stable operation can bring great economic benefits and productivity gains [1, 2]. Therefore, it is important to study how to automatically and accurately identify the pattern and extent of faults before the faults occur [3, 4, 5].

Thanks to the development of hardware, data and algorithms, fault diagnosis methods based on machine learning have been developed and successfully applied in the field of intelligent fault diagnosis for machinery [6]. Although deep learning architectures have been successfully applied in many fields, from classification/detection to natural language processing, there are still many problems in the application of fault diagnosis [7]. To achieve an autonomous feature extraction and end-to-end diagnosis mode usually needs huge amounts of data to prevent overfitting. However, in practical engineering, it is difficult or even impossible to obtain many and even a few samples for reasons such as cost and safety [810]. And mechanical equipment often operates under variable conditions of load and rotational speed. To achieve fault diagnosis using the same model (backbone and model parameters) under variable conditions, some sophisticated and special transfer learning skills are needed [11, 12], which needs a lot of source signals and time. Besides that, issues such as non-stationary vibration signals, noise and outlier interference, etc, have limited the performance of deep learning models [13, 14]. Therefore, in the field of bevel gearbox fault diagnosis, it is still necessary and meaningful to study shallow learning models with their advantages of great generalization ability on small samples and stability across working conditions.

Based on the principle of structural risk minimization, kernel-based geometric learning models are typical representatives of shallow learning models that have good generalization on small samples and moderate robustness. They use geometric models to describe the sample distribution boundary, and then divide the maximum margin hyperplane according to the support vectors of different categories as the decision function. Combined with the use of kernel tricks, they have a powerful nonlinear classification capability and achieve great success in the field of fault diagnosis. Ma et al proposed a bevel gearbox fault diagnosis method that combines multivariate multiscale fuzzy distribution entropy and SVM [15], and achieved good results. In fact, SVM transforms the problem of finding the maximum margin between two sample sets into the problem of solving the nearest point pair of convex hulls (CH) of two sample sets [16]. Except for using the CH model to estimate the sample distribution, Cheema et al and Cevikalp et al independently proposed the maximum margin classification based on an affine hull (AH) [17, 18]. Compared with the loose AH model, Turkoz et al used the minimum volume hypersphere to simulate the boundary of target data [19], while Cevikalp et al investigated a tighter hyperdisk (HD) model and obtained better experimental results [20]. To further improve the generalization ability and robustness of models, ensemble learning is introduced [21, 22]. Wang applied ensemble learning with differentiated probabilistic neural networks to the fault diagnosis of rotary machinery, and achieved excellent results [23]. Te Han and Li achieved the detection of out-of-distribution samples by ensemble multiple deep neural networks [13]. Kim et al used the bootstrap method for sampling and integrated the obtained SVM sub-classifiers [24]. The experimental results show that the ensemble can improve the performance of SVM.

The above research has shown the successful application of ensemble learning in various fields. However, unlike the deep learning models based on the principle of empirical risk minimization, the geometric learning models based on structural risk minimization make it hard to generate different sub-classifiers by down sampling the original dataset. Therefore, in order to combine the geometric learning models, which have the advantages of great small sample generalization and fast computing speed, with ensemble learning, which can improve performance of single classifier [25, 26], the following problems need to be solved: (a) the conventional bagging strategy cannot make the kernel-based geometric learning model produce different sub-classifiers, so it contradicts the requirement that ensemble learning needs different sub-classifiers; and (b) There is no universal criterion to determine the selection of sub-classifiers in ensemble learning. With the increase of the number of sub-classifiers, the model will be expensive. At the same time, redundant sub-classifiers will degenerate the performance of the learning system [27, 28]. (c) The geometric learning model is inherently sensitive to noise and outliers, which makes it less robust in actual fault diagnosis.

The CH uses CHs to estimate the sample distributions in feature space, then determines the optimal hyperplane, which minimizes the structural risk by finding the nearest point pair between two CHs. The principle of classification is similar to SVM. But, compared with SVM, CH has clearer geometric meaning, which means it is convenient to deform the CH as we want. In this paper, unlike the conventional bagging strategy, we can use a balanced-distance-density constant to easily obtain different deformed CHs (sub-classifiers). This advantage makes CH suitable for ensemble learning. First, using the deformed CHs can replace the process of random downsampling to construct the subspaces and corresponding sub-classifiers. On the other hand, the CHs are deformed according to the state of the sample points in the feature space (distance and density), so this can suppress the outliers and noise to improve the performance of generalization.

In this paper, a new strong classifier ensemble model (EnCH) is proposed to solve the above problems and is applied to gearbox fault diagnosis. The experimental results verified that the proposed model can significantly improve the generalization ability of sub-classifiers, and has strong resistance to noise and outliers. The main innovations of this paper are as follows: (a) unlike the conventional bagging strategy of ensemble learning, this paper uses a balanced-distance-density constant to obtain different deformed CHs as sub-classifiers; (b) according to the selective ensemble theory [28], using special balanced-distance-density constants guides the generation of the deformed CHs to avoid the random selection of sub-classifiers; (c) considering the clustering characteristics of the sample points in the feature space, the original CH model is deformed according to the distance and the local density of the samples to suppress the influence of noise and outliers.

2. Related work

2.1. Basis of CH classification model

Geometric learning models originated from vector space geometry. They all aim to describe a set with the linear combination of specific vectors. By imposing restrictions on the weights of the combinations, different geometric models such as CH, AH, and HD are formed. The CH model requires that the sum of the combined weights is 1, and each weight is non negative. When applied to pattern recognition, the CH model can be understood as the smallest representative element of all samples in the feature space.

Suppose the sample set of one class is $S = \{ {s_i}\} ,i = 1, \ldots ,n$, where ${s_i} \in {R^n}$, the CH of $S$ is:

Equation (1)

where $n$ is the number of samples and ${c_i}$ denotes the combination coefficient of the $i{\text{th}}$ sample.

From the definition of CH, we can find that the shape of the CH model depends on the combination coefficient and the distribution of samples in the feature space. That means we can control the shape of the CH by adjusting the combination coefficients. This is very convenient for converting the CH boundary with prior knowledge to obtain different decision functions. This is why CH has the advantage of clear geometric meaning.

Take the binary classification as an example, the essential idea of the CH classification model is to find the maximal margin hyperplane between two CHs, $f = w \bullet s + b$, where $w$ and $b$ denote the normal vector and the bias respectively. Therefore, the problem is transformed into solving the nearest point pairs of two CHs. Assume that the positive sample set is ${S_ + } = \{ {s_{i + }}\} ,i = 1, \ldots ,{n_ + }$, the negative sample set is ${S_ - } = \{ {s_{i - }}\} ,i = 1, \cdots {n_ - }$, then the two CH models can be expressed as:

Equation (2)

Equation (3)

The objective function is:

Equation (4)

It can be transformed into a quadratic programming (QP) problem, the optimal normal vector ${w^*}$ and bias constant ${b^*}$ can be expressed by the solution of QP problem $(c_{1 + }^*, \ldots ,c_{n + }^*, \ldots ,c_{1 - }^*, \ldots ,c_{n - }^*)$:

Equation (5)

where $\left\langle {{s_i},{s_i}} \right\rangle $ is the inner product of two eigenvectors, for a given test sample $s$, the prediction function of the CH model can be defined as:

Equation (6)

2.2. Theoretical basis of ensemble learning

Combining the predictions from several models has proved to be an effective approach for increasing the performance of a single model, which is known as ensemble learning. Ensemble learning completes the classification task by building multiple sub-classifiers with differences and certain accuracy, and uses some strategies to integrate the sub-classifiers. Hansen and Salamon have proved that if the accuracy of each classifier is above 0.5 and the classifiers are independent, the accuracies by average voting will approach 1 [29]. The reasons for the success of ensemble learning can be concluded as statistical, computational and representation learning, bias-variance decomposition and strength-correlation [3032].

Ensemble learning can be proved by the Hoeffding inequality. Take the binary classification as an example; $y \in \left\{ { - 1, + 1} \right\}$ denotes the label. $f$ stands for the actual function and ${h_i}$ means the sub-classifier. The error probability of the sub-classifier is $\varepsilon $, $P({h_i}(x) \ne f(x)) = \varepsilon $, and there are T sub-classifiers. Use a simple voting method to integrate sub-classifiers. The calculation formula is given by equation (7). Suppose that if more than half of the sub-classifiers classify correctly, the result of ensemble is correct.

Equation (7)

Assuming that each sub-classifier is independent, the ensemble error probability can be given as follows:

Equation (8)

According to equation (8), the ensemble error probability will decrease exponentially with the increase of the number of sub-classifiers.

3. EnCH classification model and its application

Ensemble learning can improve the generalization performance of sub classifiers, if two conditions are met: (a) each sub-classifier has certain accuracy; and (b) the sub-classifiers are different. However, in practical application, because each sub-classifier is trained for the same task, it is hard for sub-classifiers to be independent from each other, the accuracy and diversity of sub-classifiers are contradictory. Therefore, for the EnCH classification model, asking how to build CH sub-classifiers that meet the requirements of both accuracy and diversity is the key point for effectiveness.

3.1. Construction of EnCH sub-classifiers

Since the CH model has the advantages of clear geometric meaning and is easy to deform, a method is proposed to generate differential sub-classifiers, which is by deforming the original CH. Considering the requirements of differential deformation and the clustering characteristics of samples in the feature space, the distance-based contraction factor and k-nearest neighbor density-based contraction factor are introduced to deform the original CH model. Figure 1 depicts the process of deforming the original CH model, where $\lambda $ is the distance-density-balance constant. The detailed process is depicted as follows.

Figure 1.

Figure 1. The deformation process of the original CH.

Standard image High-resolution image

Step 1: Shrink the sample point ${{\text{s}}_i}$ towards the barycenter of the CH ${s_{{\text{cent}}}}$. ${\hat s_i}$ denotes the sample after shrinking, which is defined by equation (9):

Equation (9)

where ${s_{{\text{cent}}}} = \frac{1}{n}\mathop \sum \limits_{i = 1}^n {s_i}$, and ${\mu _i}$ indicate the degree of shrinkage. It is determined by its distance from the barycenter and its local density. That is, the farther the distance is, and the lower the local density is, the greater the degree of shrinkage. The ${\mu _i}$ is defined by equation (10):

Equation (10)

where $\lambda $ represents the distance-density-balance constant. The constant gives the tradeoff between distance and density. The distance-based contraction factor ${\mu _{i\_{\text{dis}}}}$ and k-nearest neighbor density-based contraction factor ${\mu _{i\_{\text{den}}}}$ are defined as follows:

Equation (11)

The function $\varphi ({s_i},{s_{i\_k}})$ represents the average distance from the sample point ${s_i}$ to the $k$ nearest neighbor points of ${s_i}$, and the function $\max (\varphi (s,{s_k}))$ represents the maximum of the average distances.

So far, a series of different deformed CHs can be produced according to different $\lambda $, resulting in a series of sub-classifiers. In fact, we adopt a different way to build the subspaces of the original dataset from the traditional down-sampling method. We leverage the balanced-distance-density constant $\lambda $ to obtain different subspaces (deformed CHs). Considering the sparsity of samples in the feature space, the subspaces will change a lot with the change of $\lambda $. We can ensure that we will obtain differential sub-classifiers by setting different $\lambda $ especially in case of outliers and noise, so we do not need too many sub-classifiers. In this paper, considering the trade-off between cost and performance, we set $\lambda = \{ 0,0.5,1\} $, $\lambda = 0$ means the shrinkage only according to the density, $\lambda = 0.5$ means the shrinkage comprehensively considering the distance and density, and $\lambda = 1$ means the shrinkage only according to the distance. In this way, it can ensure that there are great differences among the deformed CHs. Besides, it can suppress the interference of noise and outliers to obtain a more reliable data boundary description. In summary, this strategy can guarantee that the sub-classifiers are different and with strong robustness. Therefore, there are finally four sub-classifiers to ensemble plus the original CH.

Step 2: Swell the deformed CHs to eliminate the possible influence of unbalanced samples. Expansion factor ${\phi _s}$ is given by equation (12):

Equation (12)

Step 3: Construct the sub-classifiers. Taking the binary classification as an example, the deformed CH models can be expressed by equations (13) and (14):

Equation (13)

Equation (14)

The objective function of the corresponding sub-classifier is given in equation (15):

Equation (15)

Formula (15) can be expanded into formula (16), and the problem can be converted into a QP problem:

Equation (16)

To solve equation (16), we use the Lagrange multiplier approach to solve the standard convex QP (quadratic programming) problem. Equation (16) can be transformed to a standard QP problem as follows:

Equation (17)

where $x = {[{\hat c_{1 + }},{\hat c_{2 + }}\ldots {\hat c_{n + }}\ldots {\hat c_{1 - }},{\hat c_{2 - }}\ldots {\hat c_{n - }}]^T}$, $H = \left[ {\begin{array}{*{20}{l}} {{K_{11}}}\;\;\;{ - {K_{12}}} \\ { - {K_{12}}}\;\;\;{{K_{22}}} \end{array}} \right]$, which represents the hessian matrix, ${K_{11}} = {\left[ {{{\hat s}_{1 + }},{{\hat s}_{2 + }}\ldots {{\hat s}_{n + }}} \right]^T} * \left[ {{{\hat s}_{1 + }},{{\hat s}_{2 + }}\ldots {{\hat s}_{n + }}} \right]{\text{, }}{K_{12}} = {\left[ {{{\hat s}_{1 + }},{{\hat s}_{2 + }}\ldots {{\hat s}_{n + }}} \right]^T}$ $* \left[ {{{\hat s}_{1 - }},{{\hat s}_{2 - }}\ldots {{\hat s}_{n - }}} \right]{\text{, }}{K_{22}} = {\left[ {{{\hat s}_{1 - }},{{\hat s}_{2 - }}\ldots {{\hat s}_{n - }}} \right]^T} * \left[ {{{\hat s}_{1 - }},{{\hat s}_{2 - }}\ldots {{\hat s}_{n - }}} \right]$ $Aeq = {\left[ {\begin{array}{*{20}{l}} {1,1\ldots 1}\;\;\;{0,0\ldots 0} \\ {0,0\ldots 0}\;\;\;{1,1\ldots 1} \end{array}} \right]_{2 \times ((n + ) + (n - ))}}$, $Beq = {\left[ {1,1} \right]^T}$. The hessian matrix $H$ is a symmetric matrix, which means that $H$ is a positive semidefinite matrix and the QP problem has the global optimal solution. So we can leverage the Karush–Kuhn–Tucker conditions of the Lagrange multiplier approach to obtain the solution. Then, we can obtain the optimal solution $(\hat c_{1 + }^*, \ldots ,\hat c_{n + }^*, \ldots ,\hat c_{1 - }^*, \ldots ,\hat c_{n - }^*)$ by solving formula (17), and the optimal hyperplane of the sub-classifier is expressed as $f(s) = {\text{sign}}\left\{ {\langle {{\hat w}^*},s\rangle + {{\hat b}^*}} \right\}$. The normal vector ${\hat w^*}$ and bias constant ${\hat b^*}$ are given by equation (18):

Equation (18)

So far, the work of constructing the sub-classifier is completed. Because equation (18) only involves the inner product operation between the feature vectors, the original feature space can be mapped to the high-dimensional Hilbert space by introducing a kernel trick. The maximum margin hyperplane can be divided in the high-dimensional space to solve the linearly inseparable problem. In this paper, the Gaussian kernel function is used:

Equation (19)

where $\sigma > 0$, represents the bandwidth of the Gaussian kernel. For multi-classification problems, the 'one-against-one' strategy is adopted to extend to a multi-classifier.

3.2. Construction of the EnCH classification model

We have obtained the set of sub-classifiers with differences, $\left\{ {{f_m}} \right\},m = 1, \ldots ,M$. The output of EnCH will be obtained by utilizing a bagging algorithm, which integrates the sub-classifiers by weight. The final decision function of the EnCH is defined by equation (20):

Equation (20)

where ${\varepsilon _m}$ is the weight of the $m{\text{th}}$ sub-classifier, and ${f_m}(s)$ represents the prediction results of the $m{\text{th}}$ sub-classifier. The basic structure of EnCH is presented in figure 2. The complete algorithm of EnCH is presented in algorithm 1.

Figure 2.

Figure 2. The basic structure of the EnCH classification model.

Standard image High-resolution image
Algorithm 1. EnCH classification model.
Input: Training set $S:\{ {s_i},{y_i}\} _{i = 1}^n,{y_i} \in \{ - 1, + 1\} $, validation set $V:\{ {v_i},{y_i}\} _{i = 1}^n,{y_i} \in \{ - 1, + 1\} $ and parameter set $\left\{ {k = 3,\;\lambda = \left[ {0,0.5,1} \right]} \right\}$.
Output: Decision functions of the EnCH classification model.
Procedure:
1. Normalize the data set;
2. Get M groups of new samples with different degrees of shrinkage, by equations (8), (9);
3. Get M groups of deformed CHs, by equations (7), (10) – (12);
4. Construct the classification hyperplane between the deformed CHs of different classes in each group, and get M sub-classifiers ${f_m},m = 1,2, \ldots ,M$, according to equations (13) – (15);
5. Get the weight ${\varepsilon _m}$ on the validation set, integrate all sub-classifiers, and the final decision function is $F(s) = {\text{sign}}\left(\mathop \sum \nolimits_{m = 1}^M {\varepsilon _m} * {f_m}(s)\right)$.

3.3. The procedure of bevel gearbox fault diagnosis using the EnCH classification model

When the fault modes of a bevel gear box are different, the time-domain waveform and frequency-domain waveform of the vibration signal will change, and the corresponding time-domain statistical parameters and frequency-domain statistical parameters will also change. Therefore, taking time-domain statistical parameters and frequency-domain statistical parameters as the feature vectors has been widely used in the field of fault diagnosis. In this paper, 12 time-domain statistical parameters (mean, root mean square, square root amplitude, mean amplitude, maximum peak, standard deviation, skewness, kurtosis, crest factor, clearance factor, shape factor, and impulse factor) and eight frequency-domain statistical parameters (spectral amplitude mean, spectral amplitude standard deviation, spectral standard deviation frequency, spectral amplitude skewness, spectral frequency skewness, spectral amplitude kurtosis, spectral frequency kurtosis, and spectral gravity frequency) are selected to form the feature vectors of samples. Specific calculation formulas of these parameters refer to literature [16].

The reason to select these features is because they can reflect the characteristics of time domain waveforms and frequency domain waveforms more comprehensively. Failure modes of the machines are closely related to these features. For instance, skewness and kurtosis can reflect the degree of signal deviation from the normal distribution, which are sensitive to cracks and spalling failures. The crest factor, clearance factor, shape factor and impulse factor are non-dimensional parameters that keep stable under different working conditions. The case in the frequency domain is similar to the case of the time domain.

The fault diagnosis procedure based on EnCH is graphically presented in figure 3, and the main steps are as follows:

  • (a)  
    Signal acquisition and processing: use the sensor to obtain vibration signals of the bevel gear box under different fault modes.
  • (b)  
    Feature extraction in the time domain and frequency domain: analyze the vibration signal in the time domain and frequency-domain to extract the corresponding time-domain statistical parameters and the corresponding frequency-domain statistical parameters to form the feature vectors.
  • (c)  
    Division of the dataset: divide the data set into a training set and a test set with a certain proportion, and form the original CH according to the training set.
  • (d)  
    Training of the EnCH: according to the original CH, the distance-based contraction factor and k-nearest neighbor density-based contraction factor of each sample are calculated to form a series of deformed CHs. The deformed CHs are then trained to obtain a series of sub-classifiers. Finally, the sub-classifiers are integrated.
  • (e)  
    Validation of the effect: input the test samples into the trained EnCH model to validate the effect.

Figure 3.

Figure 3. The framework of fault diagnosis.

Standard image High-resolution image

4. Experiment verification and discussion

4.1. Description of experimental platform and dataset

The vibration signal dataset used in this paper is collected from a drivetrain dynamic simulator test bench in our laboratory. The test bench is shown in figure 4. The bevel gear pair uses a gear with 18 teeth as the input end and a gear with 36 teeth as the output end. Before the experiment, the electric discharge wire-cutting (EDM) technology was used to cut the driving bevel gear at different depths to simulate five different gear crack degrees. During the experiment, the input-axis speed is set to 1500 r min−1 and the brake load is set to 4 N m−1 under each fault condition. The sampling frequency is set to 4096 Hz.

Figure 4.

Figure 4. Driven dynamic simulator test bench.

Standard image High-resolution image

The vibration signal dataset obtained contains six different modes, including normal mode and five fault modes. Each mode contains 60 samples, and each sample contains 4096 sampling points. We then extract the features of each sample from the time domain and frequency domain respectively, to constitute the feature vector containing 20 dimensions, and get the dataset for experiment. Specific information about the experimental dataset is summarized in table 1. Figure 5 displays the time-domain waveforms of the vibration signals under six different fault modes.

Figure 5.

Figure 5. The original vibration signals of six fault modes.

Standard image High-resolution image

Table 1. Description of dataset.

Number of samplesFault conditionClass labelDimensions of sample
60Healthy120
60Crack of drive gear (0.6 mm)220
60Crack of drive gear (0.8 mm)320
60Crack of drive gear (1.0 mm)420
60Crack of drive gear (1.2 mm)520
60Crack of drive gear (1.4 mm)620

4.2. Experiment 1

Experiment 1 is used to verify the following three points: (a) the proposed model can improve the generalization of the sub-CH classification model; (b) the proposed model is superior to other geometric learning models; and (c) the deep learning methods will suffer performance degradation when they are facing small sample tasks.

Four typical geometric learning models, SVM, CH, AH, and HD, are selected for comparative experiments. In the experiment, 18 samples are randomly selected from the 60 samples contained in each fault mode as the training dataset, and the rest as the test dataset. All of the models used for comparison adopt five-fold cross-validation to select their optimal parameters, and the candidate range of the gaussian kernel width $\sigma $ and the penalty parameters $\xi $ of SVM are set as $\sigma \in \left\{ {{2^{ - 3}},{2^{ - 2.5}},{2^{ - 2}}, \ldots ,{2^{10}}} \right\}$and $\xi \in \left\{ {0,5,10, \ldots ,50} \right\}$ respectively. The optimal parameters of all models are displayed in table 2, and all models involved in the experiment are run ten times using random selected training data with the optimal parameters to avoid contingency in the experiment.

Table 2. The optimal hyperparameters of all models.

 HDAHSVMCHEnCH
Gaussian kernel width ($\sigma $)21.5 25 22.5 2 ${\sigma _1}$ = 24, ${\sigma _2}$ = 24.5, ${\sigma _3}$ = 24, ${\sigma _4}$ = 2
Penalty parameters ($\xi $)20

First, according to the selection strategy for the distance-density-balance constant $\lambda $ in section 2.1, we get the EnCH classification model with four sub-classifiers. Table 3 shows the average classification accuracy for the ten experiments for each sub-classifier, as well as the average classification accuracy and ensemble classification accuracy of four sub-classifiers under the optimal parameters. We find that the ensemble result is better than the accuracy of each sub-classifier and their average accuracy, indicating that the proposed model has the ability to improve the generalization of the original CH model. To further illustrate the difference among the sub-classifiers, taking the first trial as an example, the prediction labels of the test dataset under each sub-classifier are shown in figure 6, where the accuracies of sub-classifier 1, sub-classifier 2, sub-classifier 3 and sub-classifier 4 are 98.01%, 98.41%, 97.62% and 97.22% respectively, while the final ensemble accuracy is 98.81%. It is obvious that the accuracy of the sub-classifier 3 is close to the ensemble classification accuracy. But it is worth nothing that the sub-classifier 3 does not correspond to the original CH classifier (from table 4, we can see the accuracy of the original CH is 96.15%), it corresponds to the deformed CH classifier. So it still reflects that the proposed method can improve the performance of the base model. That is to say, although the accuracies of sub-classifiers may be close to the ensemble accuracy sometimes, it is unpredictable which sub-classifier will obtain excellent results, so the use of ensemble learning can improve the stability of the model and always obtains a better result.

Figure 6.

Figure 6. The prediction labels of sub-classifiers.

Standard image High-resolution image

Table 3. The accuracy of different sub-classifiers and ensemble.

 Sub-classifier1Sub-classifier 2Sub-classifier 3Sub-classifier 4AverageEnsemble
Accuracy97.34 ± 0.9897.82 ± 0.8897.89 ± 0.9596.15 ± 0.9597.3 ± 0.94 98.01 ± 0.87

Table 4. The average accuracy of ten trials.

 HDAHSVMCHEnCH
Accuracy96.59 ± 1.0196.26 ± 1.5696.55 ± 1.4196.15 ± 0.95 98.01 ± 0.87

However, we can also see that when the dataset is pure, because each sub-classifier has achieved high accuracy, the difference among sub-classifiers is not obvious. Subsequent experiments (sections 4.3 and 4.4) find that the worse the dataset conditions are, the better the performance of the proposed model will be.

The classification accuracies of all learning models in every trial are shown in figure 7, and table 4 shows the average accuracy of the experiments repeated ten times for each model. The experimental results prove that each geometric learning model achieves relatively high classification accuracy under the condition of few samples, but the performance of the proposed classification model is more stable. And the proposed model obtains the highest classification accuracy except for the fourth trial, which can verify the superiority of the proposed model.

Figure 7.

Figure 7. Comparison of different models.

Standard image High-resolution image

In addition, in order to test the stability of the performance of the proposed model, different proportions (0.1, 0.2, 0.3, 0.4, 0.5) of training samples are set for training. The average accuracies of the ten repeated experiments with randomly selected training data in every trial are shown in figure 8. The results show that the classification accuracy of different models is improved with the increase of the number of training samples. The performance of SVM with few samples is the worst, but it is improved significantly with the increase of the proportion of training samples. The stability of the proposed model is the best, and the highest accuracies are obtained under different training sample proportions. It can be predicted that the deformation operation to the original CH makes the boundary estimation of the dataset in the feature space more reliable, so the performance is more stable.

Figure 8.

Figure 8. The classification accuracy with different proportions.

Standard image High-resolution image

To further prove the necessity and superiority of the proposed method compared with deep learning methods under few samples conditions, we use the convolutional neural network (CNN) for comparison experiments. The structure and parameters of the CNN are presented in table 5. Note that the Batch Normalization is selected for the necessary modules to avoid the overfitting. We choose 16 as the training batch size. The SGD optimizer and CrossEntropyLoss are selected. The learning rate is set as 0.001.

Table 5. Structure and parameters of the CNN.

LayersModuleHyperparameters
Layer 1Conv1dKernel_size = 15, num_filters = 16
Batchnorm1d
ReLu
Layer 2Conv1dKernel_size = 3, num_filters = 32
Batchnorm1d
ReLu
MaxPool1dKernel_size = 2, stride = 2
Layer 3Conv1dKernel_size = 3, num_filters = 64
Batchnorm1d
ReLu
Layer 4Conv1dKernel_size = 3, num_filters = 128
Batchnorm1d
ReLu
AdaptiveMaxPool1dKernel_size = 4
Layer 5Linear128*4, 256
ReLu
Linear256, 64
ReLu
Linear64, 6

First, we set the proportion of the training set as 0.3. The change of error and accuracy of the training and test during the process of training are shown in figure 9. We can see that the phenomenon of over fitting is very serious. When the training loss is close to 0 and the training accuracy is close to 1, the test loss and accuracy have poor performance. The best diagnostic accuracy is only 76.85%, which is far lower than the proposed method and it takes longer to train. Therefore, this can illustrate that deep learning models may create a performance degradation problem when facing few samples tasks.

Figure 9.

Figure 9. Loss and accuracy during training process with training data proportions of 0.3.

Standard image High-resolution image

To observe the change of performance with the change of proportion of the training set, we improve the training set proportion to 0.5. Figure 10 shows the training result. We can see the performance of the CNN model has not improved significantly. Overfitting still exists, and the best accuracy is only improved to 81.11%. That is because the 0.5 proportion (30 training samples) is still not enough for a deep learning model.

Figure 10.

Figure 10. Loss and accuracy during the training process with training data proportions of 0.5.

Standard image High-resolution image

4.3. Experiment 2

Because it is inevitable that the vibration signal is mixed with noise during the process of collection, which will deform the boundary of the dataset in the feature space, the decision of the maximal margin hyperplane of the geometric learning model will be interfered with, and the diagnosis accuracy will decrease. In the proposed EnCH classification model, each sub-classifier benefits from the shrinking to the original CH model based on two aspects of distance and density. Theoretically, the deformed sub-CH model can better represent the essential distribution of the dataset, so it will have a good suppression effect on the interference of noise. Therefore, experiment 2 is carried out to prove the following conclusions: (a) the EnCH classification model is robust to noise and has better performance than other geometric learning models; and (b) the difference among sub-classifiers is more obvious and the corresponding ensemble performance is better than experiment 1 under the condition with noise.

In the experiment, Gaussian white noise with different signal-to-noise ratio (SNR) is used to simulate the noise mixed in an actual collected signal. First, we add white noise with SNR = −6 ∼ 2 dB to the original vibration signal, and then calculate the features in the time domain and frequency domain. Finally, we get the dataset with different degrees of noise. The following experiments are carried out with different degrees of noisy dataset, and the proportion of the training dataset and the selection of parameters are the same as those in experiment 1. Figure 11 shows the time-domain waveform of the vibration signal (1.0 mm crack in the drive gear) when the noise intensity is −6 dB, −4 dB, −2 dB, 0 dB and 2 dB respectively and the original signal.

Figure 11.

Figure 11. The time-domain waveform of the vibration signal with different SNRs.

Standard image High-resolution image

The test results are presented in figure 12. We can see the performances of SVM, CH, AH and HD models are poor under strong noise interference, illustrating that, for the geometric classification model, noise does affect the decision of the hyperplane. Due to AH's loose estimation of the sample distribution boundary, the influence of noise on AH decreases gradually with the reduction of noise level, and its classification accuracy improves rapidly. Although the Gaussian white noise produces a great change on the CH's boundary, the shape of the CH, which is the key in the decision of the hyperplane, changes little. Therefore, the CH model is more resistant to noise interference than other geometric learning models, and its performance is also better. As an improvement of the original CH model, the EnCH classification model not only retains the advantages of the original CH model, but also significantly improves the generalization ability, at the same time, it has stronger noise-robustness and higher classification accuracy compared with other geometric learning models.

Figure 12.

Figure 12. The classification accuracy with noise of different SNR.

Standard image High-resolution image

When the noise intensity is −6 dB, taking the first trial as an example, the confusion matrices of the classification results of each sub-classifier and the confusion matrix of the EnCH classifier are shown in figure 13. The classification accuracies of sub-classifier 1, sub classifier 2, sub classifier 3, sub classifier 4 and EnCH classifier are 88.49%, 87.69%, 87.69%, 88.09% and 91.27% respectively. We can see that the difference among sub-classifiers is more obvious and the generalization ability has more significant improvement than experiment 1.

Figure 13.

Figure 13. The confusion matrices of first trial when SNR = −6 dB.

Standard image High-resolution image

To further demonstrate that the proposed method has the ability to suppress the influence of noise, we give the distribution of the samples in the feature space before and after the deformation of CH. The noise of SNR = −6 db is added to the original data. The result is given in figure 14. As the result shows, the deformed CHs obtain different distributions of the samples, and CH1 and CH2 are obviously has better clustering result compared with original CH.

Figure 14.

Figure 14. Visual feature distribution before and after the deformation of CH using t-SNE.

Standard image High-resolution image

4.4. Experiment 3

The kernel-based geometric learning model relies on support vectors to determine the decision function. However, outliers are very likely to be boundary points (support vectors) of the geometric model, which will greatly interfere with the decision of the maximum margin hyperplane, and the corresponding classification accuracy will be greatly reduced. The EnCH classification model considers the clustering characteristics of sample points in the feature space, and introduces the local contraction factor for each sample point. Theoretically, because the outliers are far away from the barycenter of CH and the local density at outliers is low, the degree of shrinkage is greater. This will restrain the influence of outliers on the boundary of CH, and the obtained boundary will be more fitted with the actual distribution boundary, which will improve the classification accuracy. So, experiment 3 is used to prove these hypotheses: (a) the EnCH classification model can automatically identify outliers and give them a large degree of shrinkage; and (b) when the dataset contains outliers, the classification accuracy of the EnCH classification model is also significantly better than other geometric classification models.

For a specific fault mode, the samples in other fault modes are taken as the outliers of this fault mode, and these outliers are used to replace the normal samples to ensure that the number of samples in the training dataset and test dataset does not change. The experiment tests the classification accuracy of all models with the number of outliers being $l(l = 1,2,3,4,5)$. The training dataset proportion of each fault mode is 0.3, that is, 18 training samples, and the rest are set as test samples. The optimal parameters are the same as those in experiment 1. The experiment was repeated 10 times with a randomly selected training dataset.

In order to observe the contraction degree of each sample before training, we recorded the order number of the outliers in the training dataset in table 6 when five outliers are inserted to each fault mode. The contraction degree of all training samples in the first kind of deformed CH is shown in figure 15. It can be found that, except for few exceptions, the outliers all get the relative maximum degree of shrinkage in the 18 training samples. It proves that the proposed model has the ability to identify outliers and give them a large degree of shrinkage. It also verifies that the proposed model can suppress the influence of outliers.

Figure 15.

Figure 15. The contraction degree of the training sample.

Standard image High-resolution image

Table 6. The order number of outliers in different fault modes.

Fault mode (number of samples)The order number of the outliers
1(1–18)59131416
2(19–36)2429313435
3(37–54)4243465253
4(55–72)5565686070
5(73–90)7579818385
6(91–108)94102103104107

Table 7 presents the classification accuracy of each model with a different number of outliers. The results show first that the classification accuracy of all classification models in the experiment decreases significantly with the increase of the number of outliers, illustrating that the geometric learning models are sensitive to outliers. Second, that although SVM greatly suppresses the interference of outliers through the penalty parameters, and achieves similar classification accuracy as the proposed model, considering the complex parameter-optimized process, its time cost and calculation cost are huge. At the same time, it is impossible to predict whether the dataset obtained in the actual fault diagnosis process is polluted by outliers, so it is hard to guide the selection of the SVM parameters. The proposed model can adapt to different datasets by using the same parameters. Therefore, the performance of the EnCH classification model is better in general. Third, compared with the original CH model, the generalization performance of the proposed model has been significantly improved. At the same time, compared with other geometric learning models, the proposed model has also achieved higher classification accuracy.

Table 7. The classification accuracy of each model with different number of outliers.

ModelThe number of outliers (the proportion)
1 (5.6%)2 (11.1%)3 (16.7%)4 (22.2%)5 (27.8%)
HD95.12 ± 0.4394.05 ± 1.1092.26 ± 2.2291.75 ± 3.0091.58 ± 4.82
AH94.52 ± 5.8893.41 ± 4.5592.62 ± 5.4290.67 ± 6.1388.49 ± 5.90
SVM95.19 ± 0.7794.44 ± 1.1493.68 ± 1.8592.33 ± 3.2691.19 ± 4.53
CH92.74 ± 1.2590.32 ± 1.9387.74 ± 2.9686.47 ± 4.7782.26 ± 5.24
EnCH96.75 ± 0.6395.63 ± 0.7594.76 ± 0.9193.73 ± 0.6392.66 ± 1.29

To further demonstrate the effect of the shrinking operation on outliers, we give the distribution of the samples in the feature space before and after the deformation of CH. Before shrinking, 18 outliers are added to the original CH. The result is shown in figure 16, which shows that CH1 and CH3 can obviously confine the outliers to obtain a more realistic distribution.

Figure 16.

Figure 16. Visual feature distribution before and after the deformation of CH using t-SNE.

Standard image High-resolution image

4.5. Experiment 4

The three experiments above prove that the proposed method has better generalization ability and robustness compared with shallow models. However, the necessity and advantages of the method proposed in this paper, compared with other ensemble learning methods, still need to be verified. So, experiment 4 is used to prove these hypotheses: (a) the proposed method can use fewer sub-classifiers to achieve higher diagnostic accuracy compared with the ensemble methods based on the bagging strategy; (b) as strong classifiers, geometric learning models make it hard to obtain differential sub-classifiers, and ensemble learning does not work on geometric learning models if there is no special processing.

The typical ensemble learning algorithm random forest is chosen for comparison to see the relationship between the number of sub classifiers and classification accuracy. In this experiment, the proportion of the training dataset and the hyperparameters of EnCH are the same as those in experiment 1. For the random forest, the number of sub-classifiers (decision trees) is set equal to EnCH (four sub-classifiers) for the sake of fairness. The down-sampling rate is set to 0.8 which has been proved to be optimal by experiments. Different proportions (0.1, 0.2, 0.3, 0.4, 0.5) of training samples are set for training. The average accuracies of the ten repeated experiments are shown in figure 17. The result shows that EnCH obtains better results on different training sample proportions.

Figure 17.

Figure 17. The classification accuracy with different proportions.

Standard image High-resolution image

In addition, we illuminate the relationship between the number of sub classifiers and classification accuracy in the general ensemble learning method based on the bagging strategy. Figure 18 shows the variation of classification accuracy of the random forest with increasing number of sub-classifiers in the case where the proportion of training samples is 0.3. We can see that the classification accuracy increases slightly with the increase of the number of sub-classifiers, but the best accuracy is still lower than EnCH.

Figure 18.

Figure 18. The accuracy of random forest with different number of sub-classifiers.

Standard image High-resolution image

To verify the second hypothesis, SVM is selected as the base model of ensemble learning to see the result of ensemble learning based on strong classifiers. In this experiment, the hyperparameters of EnCH and the ensemble SVM are set to the same as in experiment 1. The down-sampling rate is set to 0.8 and the number of sub-classifiers (sub-SVM) is also set equal to EnCH for the sake of fairness. From the result in figure 19, we can see that the classification accuracy is even worse than the single SVM in experiment 1, so there is reason to believe that ensemble learning does not work on geometric learning models if there is no special processing.

Figure 19.

Figure 19. The classification accuracy with different proportions.

Standard image High-resolution image

5. Conclusions

In this paper, a novel ensemble model of strong classifiers based on deformed CHs (the EnCH classification model) is proposed and applied to the fault diagnosis of bevel gears. The focus of this study is to improve the generalization ability and robustness to noise and outliers of the geometric classification model by using the ensemble learning. The main conclusions are as follows:

  • (a)  
    According to the clustering characteristics of the sample points in the feature space, the balanced-distance-density constant introduced can produce different deformed CHs. The EnCH classification model can be obtained by integrating the CH sub-classifiers, and it has better generalization ability compared with other geometric learning models.
  • (b)  
    When the dataset contains noise and outliers, the EnCH classification model obtains a more reliable boundary of data description by introducing the shrinkage factor, which greatly suppresses the misleading effects of noise and outliers for determining the maximum margin hyperplane, and improves the robustness. Experiments verify the effectiveness of the EnCH classification model.

Acknowledgments

This research is supported by National Natural Science Foundation of China (51875183 and 51975193).

Data availability statement

The data generated and/or analysed during the current study are not publicly available for legal/ethical reasons but are available from the corresponding author on reasonable request.

Please wait… references are loading.
10.1088/1361-6501/aca8c1