Fault Diagnosis of an Industrial Chemical Process using Machine Learning Algorithms: Principal Component Analysis (PCA) and Kernel Principal Component Analysis (KPCA)

Fault diagnosis plays a very important role in today’s complex industrial chemical processes. Intelligent fault diagnosis (IFD) is the term for the application of machine learning ideas to the diagnosis of process faults. These past two or three decades have seen a lot of interest in this promising method for releasing the contribution from human work and automatically recognizing the health statuses of any processes. Detecting the fault and the associated variable for the cause of the fault has high significance as it reduces the waste of resources and ensures production safety. The goal of this research was fault diagnosis of the Tennessee Eastman Process (TEP) using two different machine learning algorithms Principal Component Analysis (PCA) and Kernel Principal Component Analysis (KPCA). PCA and KPCA have been applied with the integration of the Support Vector Machine (SVM) to the data collected to produce a classifier for the different faults in the chemical process. Afterward, the classification results of the two methods have been compared.


Introduction
Over the past decades, technological advancements in the process industries have led to an increase in the complexity of processes, systems, and products.As a result, contemporary studies take into account the difficulties in managing and designing them for successful functioning [1].In the sophisticated industrial chemical processes of today, fault diagnosis is a critical component.The term "fault diagnosis" refers to the process of locating unusual system conditions.Finding the fault and the related variable for the issue's cause is crucial since it prevents resource waste and guarantees the safety of the production process.To find any potential faults, monitoring techniques are crucial [2][3].There has been a lot of research done on the analysis of chemical data to detect faults in chemical processes.A fault is defined as an unpermitted departure of at least one property or variable of the system from acceptable, common, and standard behaviour.Numerous multivariate statistical methods have been created and are currently in use for process analysis and defect detection.Since producing 1305 (2024) 012037 IOP Publishing doi:10.1088/1757-899X/1305/1/012037 2 high-quality products and ensuring operational safety are some of the key goals in industry applications, these strategies are helpful [4] [5].Data visualization tools have been used to manually spot errors, but they take too long to use for realtime error detection in streaming data.Many automated statistics and machine learning techniques have recently been proposed by researchers, including nearest neighbour, clustering, minimum volume ellipsoid, convex pealing, neural network classifier, decision tree, and support vector machine classifier.Although one of the most widely used statistical techniques for modeling and fault detection issues, principal component analysis also gives linear combinations of variables that highlight significant trends in data collection [6].Principal component analysis is a popular multivariate statistical technique for process monitoring in business.PCA method is a linear transformation that projects high dimensional, noisy, and correlated data into a lower dimensional subspace [7] [8].Most industrial processes exhibit nonlinear properties, and PCA may overlook crucial data in nonlinear system behaviour.Numerous nonlinear PCA techniques have been developed to address this issue.The simplicity and elegance of Kernel PCA (KPCA) make it stand out among various techniques [9].This paper aims to diagnose faults of the Tennessee Eastman Process (TEP) using Principal Component Analysis (PCA) and Kernel Principal Component Analysis (KPCA) with the integration of Support Vector Machine (SVM) and Recursive Feature Elimination (RFE).This technique is applied to simulated data collected from the Tennessee Eastman chemical plant simulator which was designed to simulate a wide variety of faults occurring in a chemical plant based on a facility at Eastman Chemical.The performance of both models was compared in terms of their accuracy and the model relies heavily upon the number of chosen principal components.

Literature Review
Recent progress in improving process control and integrating and managing industry data makes it possible to make good decisions based on data.A lot of writers have shown how combining industrial data can be used to regulate and maintain a close watch on industrial processes.For example, using semantic technologies to combine knowledge in chemical process engineering lets one find interesting opportunities in the flow of chemical processes.During the past decade, data-driven methods have given us a lot of useful information by creating custom markers to track behaviour that isn't normal [4] [5].In this research [4], the authors examined various recurrent and convolutional architectures using the publicly accessible simulated Tennessee Eastman Process extended TEP dataset to detect faults in chemical processes.We have chosen the most optimal architecture for the problem and introduced a new temporal CNN1D2D architecture that outperforms all other methods mentioned in terms of overall performance on the dataset.The authors [7], introduced a new technique for adaptive neighbourhood-preserving embedding, along with an online fault-detection approach that is based on adaptive neighbourhood-preserving embedding.This method integrates the constraint of approximate linear dependency with neighbourhood-preserving embedding.Based on the recently described update method, the algorithm can produce an adaptive update model that enables online fault detection of processes.In this article [3], this work presents the shortcomings of the standard fault detection technique and suggests a model named Enhance Multi-Scale Principal Component Analysis (EMSPCA) fault detection algorithm that incorporates a novel wavelet thresholding criterion.Consequently, it enhances the identification of defects in the remaining space and the determination of the threshold for the fault detection statistic.When subjected to analysis using a simulated model, EMSPCA showed a notable 30% enhancement in the rate of detection, while maintaining equivalent levels of false alarms.The above literature review shows that despite a gradually growing body of knowledge on fault detection of industrial processes, till now no research has used the integration of PCA, and KPCA together that focuses on fault detection of TEP.To bridge this gap, we adopted the application of PCA and KPCA supervised machine learning approach to detect and diagnose the TEP faults.

Tennessee Eastman Process (TEP)
The Eastman Chemical Company developed the TE process to meet the demand for a realistic industrial process that could be used in the evaluation of various process control and monitoring strategies.The TE plant is made up of five units and eight components [10].A, C, D, and E are all gaseous materials, but B is inert.F is a liquid byproduct of the process.G and H are the liquid byproducts of the process.The reaction of H has less energy than the reaction of G, and G is more temperature-sensitive.The American Institute of Chemical Engineers (AICHE) was the forum in which Downs and Vogel made the initial suggestion for it.It has seen extensive use in a variety of control and optimization applications, including process control, fault diagnosis, statistical process monitoring, data-driven, and a great many others.The dataset has evolved into a standard data source for these different lines of research directions [11].

Data Collection
The collected data for this research was from the Tennessee Eastman Process.The TE process has 21 pre-programmed faults.Each fault consists of 41 measurements and 12 manipulated variables.The training dataset for normal operating conditions has 500 samples and the testing dataset for normal operating conditions has 960 samples.All the samples in both training and testing data are obtained in normal operating conditions.The training datasets for each fault consist of 480 samples and the testing dataset for each fault consists of 960 samples.In the training datasets for each fault, all the samples are in faulty condition, and the testing datasets for each fault first 160 samples are in normal operating condition and the rest are in faulty condition.For classification, the samples which are in normal operating conditions are labelled as 1 and the samples which are in faulty conditions are labelled as 0. The details of TE data of each fault are shown in Table 1.The value for stream 4 was fixed at the steady-state position 4

Feature Scaling
Feature scaling using standardization is a very important step in machine learning.Standardization is associated with rescaling the features such that they have standard normal distribution properties with a mean of zero and a standard deviation of one [12].Standardization is even more important in terms of dimensionality reduction techniques such as PCA and KPCA as in PCA we are interested in the components that maximize the variance.In large multivariate data one, component is most likely to vary for another component.If the features are not scaled, PCA might determine that the direction of maximal variance is more closely corresponding to the feature that has more variance.For this reason, feature scaling is an important step before performing PCA or KPCA.

Principal Component Analysis (PCA)
The principal component analysis (PCA) is one of the machine learning algorithms that is utilized the most for data exploration and data analysis across all scientific disciplines.PCA is a member of the family of approaches known as dimension reduction and is particularly helpful in situations in which the available data are extensive, substantial, and highly linked.When dealing with such highdimensional data, the objective is to find a smaller set of features that accurately reflect the primary data in a lower-dimensional subspace, while sacrificing as little information as possible in the process [13][14].After feature scaling is done, we can apply PCA to our training dataset.PCA reduces the dimension of given data sets by projecting the data into a lower dimensional space.This helps to accurately monitor processes that have data of large dimensions.It produces a lower dimensional space in a way such that it preserves the correlation structure between the process variables by capturing the variability in the data.The first principal component represents the highest variability in the data as possible and each of the succeeding principal components represents as much of the remaining variability as possible [15].The number of principal components to be retained can be obtained from different methods.One method of obtaining the number of principal components to retain is the Percent Variance Test.The percent variance method determines the number of components by selecting the smallest number of loading vectors to explain a minimum percentage of the total variance.This minimum percentage of variation is chosen arbitrarily.[4] For our assignment, we set the minimum percentage of explained variance to be 85%.Figure 1 the cumulative explained variance in the case of Fault 5. So, for Fault 1, we chose the number of principal components to be retained in the model to be 24, which explained approximately 95% of the variance, and for Fault 5, we chose the number of principal components to be retained in the model to be 35, which explained approximately 95% of the total variance.So, it can be seen that by applying PCA we reduced the dimension of our data significantly.
The number of principal components chosen for each fault and the amount of variance explained is documented in Table 2.

Kernel Principal Component Analysis (KPCA)
As a monitoring method for nonlinear processes, kernel principal component analysis (KPCA), which is essentially a nonlinear extension of principal component analysis (PCA), has garnered a substantial amount of interest recently [16].KPCA, on the other hand, is a more effective technique for mapping a non-linear process in the data set.This is in contrast to the fact that normal PCA makes it possible to reduce the linear dimensions, as was discussed earlier.KPCA is helpful for the data which have complicated structures and cannot be separated.The KPCA transforms the linearly inseparable data into non-linearly separable by projecting into high dimensional space.KPCA uses kernel trick function and support vector machine for projecting non-linearly separable data in higher space.The capacity of the kernel algorithm to function without the need for any non-linear optimization, in contrast to the operation of other non-linear methods, is the primary reason for its significance.The implementation of this strategy results in the input variables being modified and then utilized as independent principal component variables.When conducting any kind of factor analysis, the Kaiser-Meyer-Olkin test is one of the statistics that is utilized most frequently to evaluate the suitability of the data [16] [17].

Applying Principal Component Analysis (PCA)
After choosing the correct number of principal components to be retained PCA projects a new model of the data with the new number of principal components.A classification algorithm is implemented in this new projected model for classifying and detecting faults.For this research, we used a Support Vector Machine as our classification tool.Support Vector Machine (SVM) is a classification tool developed in the 1990s.SVM projects the low-dimensional samples into a higher feature space.In this way, the observations that cannot be separated in a lower dimensional space will become linearly

Applying Kernel Principal Component Analysis (KPCA)
After the detection of faults by the PCA classifier, we move on with Kernel Principal Component Analysis.After applying KPCA and projecting the training data into a linearly separable higher dimensional space the classification model with SVM is applied to the training set and test results are predicted.The performance of the KPCA classifier is obtained from the confusion matrix.Figure 10 shows the confusion matrix with the KPCA classifier for Faults 2 and 6.

Fault Diagnosis
After detecting the faults for each fault case with PCA and KPCA classifier respectively, the task of diagnosing the possible cause of the faults is analyzed with the help of recursive feature elimination.
Recursive Feature Elimination (RFE) repeatedly constructs a model and chooses either the best or the worst performing features.In the next step, it sets the best or worst performing feature aside and repeats the process with the rest of the features.This process is continued until all the features of the model are exhausted.RFE then ranks these features according to the time they are eliminated.[6] So, by RFE algorithm the feature ranking list of each fault is obtained.These rankings depict the most relevant variable for each fault and from this, the reason for each fault can be analyzed and proper diagnostic measures can be applied.From the above feature ranking list, we found that variables 1, 2, 7, 15, 17, 42, 44, 49, 51 and 52 are ranked as 1.So, if we compare these variables in faulty conditions and normal operating conditions, we will be able to determine the variable that is most likely the cause of the fault.From the feature ranking list, it is observed that variable 51 has a step change which is shown in Figure 4.It can be seen from Figure 4 that, the variable 51 in testing data has a step change.However, the variable 51 in normal condition has no step change.So, the variable 51 is the most likely variable for the cause of the fault.So, we can say that, since the value of variable 51 experiences a step change, it is the most relevant variable for the cause of Fault 4 in the chemical process and proper diagnostic measures should be taken.The most relevant variable for each fault can be obtained from RFE.The results for the most relevant variable for each fault are listed below in Table 3. From Table 3 we can see that, the most relevant variable for Fault 1 is 19.From Figure 5, we can see that variable 19 has a step change in testing data.The classification model can learn this step change from the training data and accurately predict the fault scenarios.So, variable 19 is the most relevant variable for Fault 1.    From Table 3 we can see that the most relevant variable for fault 10 is variable 50.The fault type of Fault 10 is random variation.It can be seen from the plot of variable 50 for Fault 10 that random variation occurs for the observations in variable 50, which causes a fault in C feed temperature.Figure 8 shows the plot of variable 50 for Fault 10.The fault type for Fault 18 is unknown.From Figure 9 it is visible that, variable 4 in Fault 18 has a change in testing data compared to the normal operating conditions.The classifier can learn this change from the training data and predict the fault condition accordingly.So, the variable 4 is the most relevant variable for Fault 18.Similar results can be shown for all the faults in the TE process.The most relevant variables for each fault condition are already mentioned in Table 3.It is seen that; the most relevant variables are the most likely reasons for each faulty situation.The variables most likely responsible for the faults show changes in the testing set.The classifier learns these changes from the training set and predicts the fault scenarios in the test set.The other process variables except the most relevant variables did not show any considerable change when compared to the data from normal operating conditions.After finding the variables that cause the faults to occur, proper diagnostic measures can be taken to diagnose the cause of the faults.It can be seen from Table 4 that, the performance of the classification model using PCA and KPCA is satisfactory.In most of the cases, the classification performance for KPCA is almost the same compared to PCA.In some of the cases, the performance of KPCA is more.It is also seen that, for some of the faults, the classification model performs poorly with very low accuracy rates.

Conclusion
PCA and KPCA classifiers with the help of SVM were implemented for the collected data of the TE process in this research.The performance of both models was effective in most of the fault cases.In some cases, the classification performance was not satisfactory.The performance of both models was compared in terms of their accuracy.The performance of the models relies heavily upon the number of chosen principal components.If the number of principal components is not chosen effectively, then there might be a loss of important information.After classification, the most relevant variables for each fault case were determined with the help of Recursive Feature Elimination (RFE).By knowing the variables that are most likely responsible for causing the faults, the faults can be effectively diagnosed.

Figure 1 .
Figure 1.Cumulative Explained Variance for Fault 1 creates a maximum separating hyperplane in the higher dimensional space and since the hyperplane is constructed by using support vectors, SVM is a very good solution with data containing high dimensions.A confusion matrix summarizes the performance of a classification algorithm.It is a summary of prediction results in a classification problem.The diagonal elements of the confusion matrix are called true positives and true negatives.True positives show the correct number of positive predictions, in our case it shows the correct number of faults predicted by the PCA classifier.True negatives show the correct number of negative predictions, in our case the correct number of normal operating samples predicted by the PCA classifier.False positives and false negatives show the incorrect number of predictions.From this matrix, the accuracy of the model can be obtained, and we can evaluate our model's performance.The confusion matrix for Faults 2 and 6 is shown in Figure2.

Figure 2 .
Figure 2. Confusion Matrix for Faults 2 and 6 respectively for PCA Classifier

Figure 3 .
Figure 3. Confusion Matrix for Faults 2 and 6 respectively for KPCA Classifier

Figure 6 .From Figure 7 ,
Figure 6.Plot of Variable 30 in Fault 2 From Figure 7, it is seen that variable 1 in Fault 5 causes the fault in the condenser cooling water temperature.The classifier can learn the change from the training set and predict the faulty condition.

Figure 7 .
Figure 7.The plot of Variable 1 in Fault 5

Table 1 .
Faults of Tennessee Eastman Process

Table 2 .
Number of Principal Components for Each Fault

Table 3 .
The most relevant variable for each fault