Feeder Fault Warning of Distribution Network Based on XGBoost

In this paper, the historical observation data of the power grid are used to build a predictive model of power outages for distribution network by using machine learning methods. By judging whether the distribution transformer network is about to fail, the maintenance and troubleshooting of the distribution network can be achieved in advance, thereby fundamentally reducing the occurrence of power fault of the distribution transformer. The data covers several dimensions such as distribution network loads, equipment ledgers, historical faults, weather and so on. The experiments show that the proposed method based on XGBoost is valid and efficiency for feeder fault early warning.


1.
Introduction More than 80% of feeder faults occur on the distribution side in the power system. For a long time, sorting out dangerous factors of distribution network blackouts and identifying hazard sources are mainly based on the experience of professional technical managers and operators, which are not comprehensive and objective, and lack effective methods and scientific tools for identifying the subtle signs of hidden blackout risks. It is difficult to identify risks and control them in time before the occurrence of a power outage. Part of the knowledge of hidden power hazard characteristics is not mastered in advance, and it cannot be eradicated after a power outage, resulting in frequent power outages.
In order to effectively solve the problem of repeated power outages, improve the reliability of power supply in distribution networks, and reduce customer complaints, it is urgent to dig out hidden factors that cause frequent power outages and study their mechanism of power outages. The active early warning application function provides a scientific basis for active early warning prevention and control of frequent power outages.
In recent years, many methods such as rough set [1,2], Bayesian network [3] , have been proposed to solve the problem of power distribution network fault diagnosis, which are mainly grounded on the actions of protection devices or a particular electrical feature. With the development of massive power grid information systems, big data technics are available to mine and forecast such as support vector machine (SVM) [4] and neural networks [5,6]. However, it is hard for the methods mentioned above to work out satisfactorily in dealing with power system faults. Another reason is that the decision rules obtained by these black boxes models cannot be understood from the perspective of human.
XGBoost [7] is one of scalable variety of the Gradient Boosting Machine (GBM), which has been proved an excellent tool among artificial intelligence methods as its easy parallelism and high prediction accuracy. In [8], a XGBoost-Based Algorithm is used to post-fault transient stability status prediction of power system, which indicates deep learning methods can be introduced to combine with power system to locate the fault of power system.

2.
Model for feeder fault warning

Features of frequent power outages
The key to the technology is to build a set features that can comprehensively and accurately represents the status of the distribution network. The power outage of the actual distribution network is not only affected by the power supply facilities such as distribution equipment, overhead lines, poles and towers, but also multiple factors such as external force destruction and improper installation and maintenance, which affect the stability of the entire distribution network power supply. Therefore, the lack of any link in the process of early warning of frequent power outages in the distribution network will lead to deviations in the results. Before the model is built, the technical reasons for historical faults, the causes of responsibility, the text description of historical faults in the distribution network, and the existing system data, the following major features were extracted: feeder characteristics, equipment account characteristics, equipment failure information, historical failure information, grid load data, and weather data. While the comprehensive features are considered, the GBDT (Gradient Boosting Decision Tree) method is used to reselect the features based on the characteristics of the features, which aims to extract the relevant features and the features with low correlation and negative impact on the accuracy of the model.

Construction of feeder fault prediction model
Based on a priori business understanding and data exploration, the factors that may be related to power outages are extracted to form the characteristics of the machine learning model. This model takes the feeder faults prediction as the research goal. By analyzing the causes of the feeder failure outage from January to June 2017 in the distribution network, the characteristics of weather, feeder operation period, feeder load, and historical number of feeder failures are extracted. Among them, the weather is the main feature that affects power outages, and this feature will be taken into account when data compression is performed. Internal data is extracted from various business systems of the power grid, and external weather data is obtained from the Meteorological Bureau.
Data processing is required, including outlier and missing value processing, normalization, and over/under sampling, then the model can be tuned. The algorithm adopted in this project is XGboost, and then perform a brute force search on the space of multiple hyperparameter combinations. Each set of hyperparameter combinations is substituted into the learning function as a new model and evaluated by the K-fold cross-validation method. The commonly used evaluation criteria are is the AUC (Area Under Curve) in ROC (Receiver Operating Characteristic) curve. The vertical axis of the ROC curve is the TPR(True Positive Rate), which represents the proportion of the minority sample that is correctly predicted to all the minority samples, and the horizontal axis is the FPR (False Positive Rate), which represents the proportion of the majority of samples that are mispredicted to all the majority of samples. In the ROC curve, when TPR is higher than FPR, the area of the ROC curve will also be higher, indicating that the model works well. The whole flow for the proposed module is shown as Fig.1

XGBoost
GBDT is a kind of boosting algorithm in integrated learning, which working mechanism is to train several classification decisions trees, each classification tree is a weak classifier, and the linear weighted combination of these weak classifiers constitutes a strong classifier.
Considering choosing an algorithm with regular terms, representative logistic regression was selected from the single classifier. While in XGBoost, regularization term is added based on the GBDT(Gradient Boosting Decision Tree).
The principle of the XGBoost algorithm is to split the original data set into multiple sub-data sets, randomly assign each sub-data set to the base classifier for prediction, and then sum the results of the weak classification according to a certain weight to predict the final result. By adding a regular term, XGBoost can effectively prevent overfitting, so it is suitable for this study. The target loss function of GBDT's algorithm is In the formula, represents the true value of the sample, i is the sample number, is the total number of samples, represents the current classifier, β is the weight, and T( ; ) represents the current iteration In the trained decision tree, represents its parameters, and represents samples. On this basis, XGBoost defines the structural complexity function Ω( ) by introducing the number of leaf nodes T and the leaf score function ω of the decision tree, so that the target loss function is transformed into: In the formula, represents the sample prediction value, is the penalty term of L1 regularization, and λ is the penalty term of L2 regularization.

Oversampling
An unbalanced dataset is one in which the target variable has more observations in one specific class than the others. The difference in the number of samples leads to the asymmetry of the information provided by the training algorithms of the different types of samples, resulting in the performance degradation of the machine learning model. The sample number of the minority data is relatively lacking, resulting in the influence of the minority data on the classifier is too low. In this project, the feeder that has a power fault during a specific day is a positive sample, and the rest are negative samples. The sample time window used is 16-17 years. As of 17 years, there are more than 16,000 feeders of Fujian

Correlation analysis of consecutive fault-free days
The feature importance score is the highest in consecutive days without failure. Discretize it based on the fastest information gain method and analyze the difference in the failure rate between different values. It can be seen from the table that the longer the continuous trouble-free days of the feeder, the better the overall stability of the feeder and the lower the probability of power outage. There are about 16,000 feeders in Fujian province, with an average of about 78 faults per day (based on data from year 2016 to 2017). The average daily fault rate is 0.5%. For feeders that have not been powered off for more than 600 days, the fault rate is about 0.12%, which is lower than the average failure rate.

Analysis of The Influence of Feeder Commissioning Period and Duration
Several features in the system are related to the commissioning period of the equipment, namely: transformer average commissioning period, cable average commissioning period, lightning arrester average commissioning period, pole tower average commissioning period and conductor average operation period. Taking feeder average commissioning years as an example, study the relationship between feeder commissioning years and power outages, and also discretize the value of transformer average commissioning years based on the maximum information gain. It can be seen from the table that at the beginning, as the equipment age increases, the failure rate rises. After reaching a certain age, the failure rate begins to decline as the age increases. The reason may be that the quality of the equipment will gradually expose the problem when it is first put into use, and the equipment may be replaced accordingly. The longer the service life, the better the quality and the lower probability of failure.

Receiver Operating Characteristic (ROC)
The samples are sorted according to the prediction results of the learner, and the samples are predicted as positive examples one by one in this order, and each time the TPR and the FPR are calculated.
The area under the ROC curve becomes AUC. Generally, when AUC is greater than 0.7, it means that the model is valid. The value of AUC varies between 0 and 1, the larger the better. The AUC of the module we built is 0.8899, indicating that the model is valid. The relationship between features and whether there is a power outage is explored is revealed to a certain extent. dimensions such as distribution network loads, equipment ledgers, historical faults, weather and so on. Correlation of consecutive fault-free days and feeder commissioning period rate are analyzed. The results of AUC which is 0.8899, indicates that the model is valid and efficient for feeder fault early warning.