Research on transformer fault intelligent diagnosis technology based on improved random forest algorithm

Transformer oil can dissolve a certain amount of gas, which can provide an important basis for transformer fault diagnosis. The relationship between the characteristic amount of dissolved gas in transformers and the type of transformer fault was analyzed, and the main influencing factors were extracted. We studied a transformer fault diagnosis method based on the improved random forest algorithm and combined it with the identification requirements of abnormal data in multiple parts of transformers. We improved the random forest from four aspects: dataset construction, bootstrap sampling, decision tree generation, and multi-node voting. We established an improved random forest diagnosis model with oil chromatogram feature gas ratio as the characteristic parameter and analyzed different input parameter combinations. The effectiveness of improving the random forest diagnosis model under different training times was compared with various machine learning methods. The experimental results show that the diagnostic accuracy of the improved random forest diagnostic model is significantly improved compared to other machine learning methods.


Introduction
The reliable operation of transformers is crucial for the overall reliability of the power system.As a key equipment, the quality of transformers directly affects the power supply reliability and stability of the power system.There are dissolved gases in transformer oil, which are mainly caused by the decomposition of oil, aging of insulation materials, and mechanical failures caused by thermal and electrical stress during transformer operation [1].The presence of these gases has a significant impact on the safety and performance of transformers, so monitoring and analyzing dissolved gases in transformer oil is one of the key activities for maintaining transformers.The following are some common dissolved gases in transformer oil and their effects: (1) Hydrogen is the most common gas in transformer oil, produced by the decomposition of hydrogen carbide compounds in the oil.The presence of hydrogen may indicate potential hazards in transformer oil, such as partial discharge, overheating, or other faults.The accumulation of a large amount of hydrogen gas may lead to an explosion hazard, so it is necessary to detect and handle it promptly; (2) Methane and ethane are typically associated with the decomposition of hydrocarbons in oil.Their presence may indicate aging of the insulation material or the presence of partial discharge.These gases may also lead to the formation of oil foam, reducing the insulation performance; (3) Acetylene is another common gas, usually produced by the decomposition of insulating materials.The presence of acetylene may indicate potential insulation issues, as it is usually associated with arc discharge and may cause fires; (4) Carbon monoxide and carbon dioxide are typically associated with the oxidation of carbides in oil.
Their presence is usually not too dangerous and can be used to analyze the degree of oxidation of insulation oil.Monitoring and analyzing dissolved gases in transformer oil is a routine maintenance practice [2].By regularly sampling and analyzing oil samples, potential problems can be detected and appropriate repair and maintenance measures can be taken to ensure the reliability and safety of transformers.
Dissolved Gas Analysis (DGA) is a widely used technology for transformer monitoring and fault diagnosis.It detects dissolved gases in transformer insulation oil to obtain important information about transformer operation status, potential faults, and maintenance needs [3].DGA uses various techniques to analyze dissolved gases in transformer oil.Common techniques include gas chromatography (GC), spectroscopy (Spectroscopy), and electrical induction analysis (EIA).These technologies can quantitatively measure various gases, such as hydrogen, methane, ethane, and acetylene, and infer the operating condition of transformers based on the proportion and variation mode of different gases.The main goal of DGA is to detect and identify faults in transformers.By analyzing the types, concentrations, and trends of dissolved gases, different types of faults can be identified, such as partial discharge, overheating, insulation aging, and mechanical faults.The gas patterns generated by each type of fault have specific characteristics.By studying and identifying these patterns, the type of transformer fault can be determined.With the advancement of sensor technology and data analysis capabilities, DGA is moving towards intelligence and automation.Modern DGA systems can not only monitor dissolved gases in real time but also associate data with other monitoring parameters (such as temperature and vibration), using machine learning and artificial intelligence algorithms for fault prediction and diagnosis [4].This provides a more efficient and accurate means for transformer operation and fault management.Overall, DGA plays an important role in transformer monitoring and has become a major technology for transformer fault diagnosis and maintenance.With the continuous maturity of machine learning algorithms, intelligent diagnosis of transformer faults is also being deeply studied for the application of various algorithms [5].At the same time, there are different algorithms in transformer fault diagnosis.
Neural networks perform well in nonlinear relationship modeling, which is capable of handling complex data patterns and large-scale datasets.Neural networks typically require a large amount of labeled data for training, which may not be easily obtained in the field of transformers [6].Support vector machines perform well in high-dimensional datasets and are suitable for binary and multiclassification problems.For large-scale datasets, training time and computational costs are high.Bayesian networks can represent dependencies between variables and handle uncertainty.Building accurate Bayesian networks requires a large amount of domain knowledge and data, and complex inference algorithms may be required for complex problems.Methods such as decision trees and random forests are easy to understand and interpret and can handle mixed data types [7].A single decision tree is prone to overfitting and needs to be improved through integration methods.For highdimensional datasets, a large number of trees may be required to achieve good performance.Principal component analysis can reduce data dimensions and extract important features.Overall, power transformer fault diagnosis is a complex field, and the application of machine learning methods and intelligent algorithms provides powerful tools for improving diagnostic accuracy.However, the successful application of these methods requires in-depth domain knowledge, extensive data support, and appropriate algorithm selection and parameter adjustment [8].
Random forest (RF), as an advanced ensemble learning algorithm [9], is also widely used in the field of anomaly data detection.RF is an integrated algorithm based on a decision tree (DT) [10].It not only retains the characteristics of simple implementation, high detection accuracy, and good scalability but also overcomes overfitting problems and improves generalization ability and highdimensional data processing ability by integrating decision trees and random sampling.In [11], a random forest ensemble algorithm is used to detect and diagnose photovoltaic data anomalies and model parameters are optimized through a grid search method, which can achieve high overall detection and diagnostic performance.
Taking into account the boundary features and derived features of transformer status monitoring data, a fast intelligent identification method for transformer abnormal data is proposed based on the improved random forest algorithm.Abnormal data identification requires improving random forests from multiple aspects such as dataset construction, bootstrap sampling, and decision tree generation.Multiple node sample moments are constructed based on sample data and derived data, and the information gain rate is introduced to validate the optimal features and optimize the decision tree.A subset of the decision tree and corresponding decision matrix are formed, and the detection results are determined by row or column voting.Secondly, parallel analysis is conducted on the algorithm operation process, and algorithm parallel strategies are designed based on data parallelism and task parallelism to achieve rapid identification.Finally, we optimize algorithm parameters such as decision tree and number of feature selections through simulation experiments.The experiment shows that the method proposed in this paper can achieve fault diagnosis based on transformer abnormal data, improving accuracy and efficiency.

Improved random forest algorithm
Random forest is a supervised machine learning algorithm that uses decision trees as the basis classifier and can solve classification and regression problems.As an ensemble learning algorithm, random forest overcomes the problems of overfitting and low generalization performance in decision trees and has high-dimensional data processing capabilities and excellent scalability.It has been applied to solve multiple fields such as image processing, fault detection, and biomedicine.The construction process of a random forest includes bootstrap sampling, generation of decision trees, and voting of classification results.The specific steps are as follows: (1) Bootstrap sampling is used to generate a subset of data for training decision trees, which essentially involves random sampling of sample data that has been returned.We extract 0 sets of data from the sample set and obtain N training sets containing 0 sets of data after subsampling.
(2) We use CART and other decision tree algorithms to construct N decision trees: a) Feature selection: We select 1 subset of generated features through random sampling; b) Node splitting: The essence of decision tree node splitting is to complete the splitting of the training dataset based on the optimal splitting feature values.The purity or certainty of the training subset after splitting is higher than that of the pre-splitting dataset, thereby achieving sample classification during the continuous splitting process.The basis for node splitting depends on the decision tree algorithm chosen; c) Node splitting end condition: When the node splitting end condition is met, node splitting is stopped.Common conditions include decision tree depth, number of leaf nodes containing samples, Gini coefficient reading, etc.
(3) The classification results of the random forest algorithm are obtained after counting all decision tree classification results and are counted based on the majority voting principle.The classification results of decision tree i for test sample A can be expressed as: The output of the random forest classification model is: where R represents the decision tree-based classifier; Lab represents the classification result of Sample 4 for the decision; lab=1 indicates that the identification result is normal; lab=2 indicates that the identification result is abnormal; RRF is the classification of random forests: N represents the number of decision trees in the random forest.

Decision tree based on information gain
A decision tree based on information gain is a type of decision tree that selects attributes based on information gain.The decision tree generation algorithm based on information gain generally selects attributes with larger information gain values as the splitting attributes of nodes and divides the training dataset based on the splitting values of the attributes until the samples in the training dataset are completely classified.The most typical decision tree based on information gain is the ID3 decision tree.It is assumed that the dataset is D and the segmentation attribute is A, and the information gain represents the change in information entropy before and after segmenting the dataset D with attribute A. The calculation is shown in Equation ( 3).Decision trees based on information gain are generally applied in situations where the values of attributes in a dataset are relatively small.When there are many values of attributes, the information entropy of the data after segmentation is relatively small, increasing information gain.At this time, when information gain is used as the evaluation indicator for attribute selection, the probability of attributes with more values used as segmentation attributes increases.Decision trees based on information gain ratio are generally used in situations where there are many attribute values in the dataset.When the value of the attribute is small, the value of the penalty parameter is large, increasing the information gain ratio.At this time, when the information gain ratio is used as the evaluation indicator for attribute selection, the probability of using the attribute with fewer values as the segmentation attribute increases.

Optimization of dataset construction
To address the shortcomings of traditional random forest algorithms in processing scenarios with constantly increasing data, this paper proposes adding incremental learning strategies to traditional random forest algorithms.Incremental learning refers to the ability of machine learning models to utilize existing knowledge to quickly learn the newly added parts when processing new samples and gradually improve learning accuracy as the sample set accumulates, thus forming a continuous learning process.
Random forests are composed of multiple decision trees, so the incremental learning strategy of random forests mainly improves the decision trees in random forests to have the ability of incremental learning.In incremental decision tree generation, the decision tree is first established through historical data, which is the same as the traditional decision tree construction algorithm.When a new sample instance arrives, the leaf node where the sample instance is located is updated according to the model update algorithm.The general steps for constructing and updating are shown in Figure 1.The improved random forest increment algorithm first establishes an initial random forest fault prediction model based on transformer historical fault data.On this basis, using continuously updated data, the initial random forest fault prediction model is used to obtain the prediction state.Data with a predicted state that does not match the actual state is selected and stored in the corresponding decision tree node.When the new data in the node changes the category of the node, we segment data in nodes and update node information in the decision tree.

Screening of key influencing factors
By utilizing the characteristics of transformers with different faults under different gas contents, this paper proposes to analyze the relationship between the content of characteristic gases and their as well as the different types of faults present in transformers.Using the Spearman correlation analysis method, such as Equation ( 1), the size of the impact factor is calculated, mainly including the gas ratios of nine different combinations, to identify the quantities related to the fault properties.The ratio diagnosis method for transformer faults is formed by determining the type of fault by using gas ratios related to the nature of the fault.The nine types of ratios are mainly as follows: CH4/H2, C2H4/C2H2, C2H4/C2H6, C2H2/(C1+C2), H2/(H2+C1+C2), C2H4/(C1+C2), CH4/(C1+C2), C2H6/(C1+C2), and (CH4+C2H4)/(C1+C2).Among them, C1 represents the first-order hydrocarbon represented by CH4, and C2 represents the second-order hydrocarbon represented by C2H6, C2H4, and C2H2.Using the above ratios as the characteristic parameters of the diagnostic model, different combinations can fully explore the correlation between data and better mine key influencing factors.
where R (x) and R (y) are the positions of x and y, respectively, ( ) x R and ( ) y R represent the average positions, and n represents the total number of observed samples.

Fault type classification
The types of transformer faults are mainly divided into 6 types: low-temperature overheating, medium-temperature overheating, high-temperature overheating, partial discharge, low energy discharge, and high energy discharge.Long-term discharge faults cause an increase in the insulation oil temperature of transformers, leading to equipment overheating faults.This results in both discharge and overheating fault characteristics in the oil chromatogram data.If such data is not distinguished, it will inevitably have a certain impact on the diagnosis of fault types.By using one hot encoding rule to encode fault types separately, as transformer fault prediction belongs to nonlinear multi-classification problems, a SoftMax classifier is used to output the diagnostic results.

Experimental environment
The relevant configuration of the experimental environment in this article is as follows: the processor is Intel Core i7-13650HX @ 4.90 GHZ twelve core; The graphics card is Nvidia GeForce GTX 950 M (8 GB/Lenovo); The memory is 8 GB (Samsung DDR4 2, 400 MHz).

Data preparation
The improved forest algorithm proposed in this article consists of two parts: the collected fault cases during transformer operation and the monitoring data of dissolved gas content in oil in the real-time monitoring system of transformers.The transformer equipment fault cases are composed of experimental data from transformer manufacturers and operation records during transformer operation.The experimental data of transformer faults is collected by various research institutions, including the types and values of dissolved characteristic gases in transformer oil and the status types of transformer equipment, which are used as a dataset for transformer fault cases.The data structure is shown in Table 1.
Table 1.Dataset of transformer equipment fault cases.transformer was that it was manufactured in 2009 and tested multiple times in 2011, but no abnormalities were found.We evaluate the performance of the proposed detection algorithm using accuracy (Acc), precision (Prec), recall rate (Rec), and false positive rate (FPR) to evaluate its effectiveness, stability, sensitivity, and specificity.The calculation method of Acc is shown in Equation (7), which represents the correct classification ratio.When Acc is 1, both abnormal and normal samples are correctly classified.We calculate Prec through Equation ( 8), which represents the correct classification ratio among the samples classified as abnormal.When Prec is 1, there is no misjudgment in the normal sample.Equation (9) shows the calculation process of Rec, which represents the probability of abnormal samples correctly classified.When Rec is 1, there is no misjudgment for abnormal samples.The FPR in Equation (10) represents the probability of normal data identified as incorrect data.
Multiple models were used for comparative testing, and the test results are shown in Table 2.We compare and analyze the proposed method with a random forest, support vector machine, and multi-layer neural network to verify the effectiveness of the improved random forest method.In the experiment, the unmodified random forest decision tree selected the optimal features based on the principle of minimizing Gini coefficients and used a neural network consisting of one input layer, two hidden layers, and one output layer, as well as a support vector machine algorithm with Gaussian radial basis function as the kernel.The radial basis function is a nonlinear kernel used for transformer fault diagnosis.From Table 2, it can be seen that the accuracy, precision, recall rate, and false positive rate of the improved random forest method proposed in this article have reached 89.25%, 89.34%, 88.93%, and 89.27%, respectively.All four indicators are superior to other machine learning methods, verifying the accuracy and effectiveness of this method in transformer fault diagnosis.

Conclusion
The random forest algorithm proposed in this article has been improved from multiple aspects, and the improved algorithm has excellent recognition performance in transformer fault diagnosis, making it very suitable for transformer fault diagnosis in operation.The experimental results demonstrate that the accuracy, precision, recall rate, and false positive rate of the improved random forest method proposed in this paper have reached a high level, verifying the accuracy and effectiveness of this where H (D) represents the information entropy of Dataset D, as shown in Equation (4), and H (D|A) represents the information entropy of Dataset D after Attribute A segmentation, as shown in Equation (5).
the number of data in Dataset D, k represents the total number of categories in Dataset D, |Ci| represents the number of i-th category Ci in Dataset D, n represents the number of data subsets after data D is divided by Attribute A, |Dj| represents the number of data in jth data subset Dj, and |Cji| represents the number of data with Category Ci in Dataset Dj.

Table 2 .
Formatting of sections and subsections.