Business Anomaly Detection Method of Power Dispatching Automation System Based on Clustering Under-Sampling in the Boundary Region

Timely detecting business anomaly in the power dispatching automation system is significant for the steady operation of the power grid. Though the imbalanced binary classification method in machine learning is an effective way to achieve the business anomaly detection of the system, the overlap of boundary samples is an urgent issue affecting the classification effect. An under-sampling method by removing the clustering noises of the majority samples in the boundary region is proposed. Firstly, KNN is used to search adjacent points of the majority class, and the boundary region and the safety region are divided according to the proportion of the majority samples in adjacent points. Secondly, DBSCAN is used to cluster the majority samples in the boundary region, and noise points are removed. Finally, it’s combined with the method based on model dynamic selection driven by data partition hybrid sampling (DPHS-MDS). The purpose of reducing the overlap degree of boundary samples, balancing the dataset and improving the classification effect is achieved. Experimental results show that the proposed method is superior to the relevant mainstream methods under F-measure and G-mean.


Introduction
The power dispatching automation system is of great significance for the steady operation of the power grid, and it is very important to detect the system anomaly in time. Due to the stable operation of the system, the number of normal samples and abnormal samples varies greatly, and the imbalanced binary classification method in machine learning is one of the effective means to solve this problem [1]. In recent years, one of the research hotspots is the combination of data preprocessing and algorithm improvement. The relevant methods include RUSboost [2], SMOTEboost [3], EasyEnsemble [4], BalancedBagging [5], BRAF [6], DTE-SBD [7]. The ensemble imbalanced classification method based on model dynamic selection driven by data partition hybrid sampling (DPHS-MDS) proposed by Xin Gao is typical [8]. However, only minority classes in the boundary region are sampled by BMW-smote, which does not reduce the overlap of samples in the boundary region. Therefore, this paper proposes an under-sampling method by removing the majority clustering noise in the boundary region. Combined with DPHS-MDS [8], the overlap degree of samples in the boundary region is reduced and the dataset is balanced, so as to improve the classification effect.

Analysis on the Improvement of the Method
In 2020, Xin Gao proposed an ensemble imbalanced classification method (DPHS-MDS) [8]. This method only conducts BMW-smote sampling for the minority samples in the boundary region, and does not reduce the overlap degree of samples in the boundary region [8].
In order to solve the above problem, this paper proposes an under-sampling method by removing the majority clustering noise in the boundary region. On the basis of BMW-smote sampling for the minority class in the boundary region, we hope to reduce the sample overlap by under-sampling the majority class [8]. But random under-sampling will lose important samples. Therefore, density based clustering algorithm (DBSCAN) is used for clustering [9]. The samples that cannot be clustered are regarded as clustering noise points. We argue that in the neighborhood of these points, the existence of the majority sample is not of universal significance, and the degree of relative importance is low, so we remove such points and effectively retain the relatively important majority class points, at the same time, the dataset is balanced. The method is mainly divided into two steps: KNN algorithm is used to divide majority class samples in the boundary region [10]. Then, clustering noise points are removed after clustering majority class points in the boundary region by DBSCAN algorithm [9].

Dividing Majority Class Samples into the Boundary Region Based on KNN
Suppose that the distribution of some samples in the two-dimensional space of the dataset is shown in figure 1a, taking k = 5 as an example, find k nearest points of majority class points and count the number of majority classes of them. If adjacent points are all of the majority classes, the sample point is divided into the safety region, otherwise the sample point is divided into the boundary region.
The specific algorithm steps are as follows: according to the imbalanced dataset D and the label of majority class Lmaj, the majority class dataset Dmaj is divided. By traversing the majority class dataset Dmaj, KNN algorithm is used to search k adjacent points of each majority class point di, and the number of majority samples Ni-in the adjacent points is counted [10]. By judging Ni-= k, the majority sample point di is divided into the majority class safety region Dsafe-, otherwise, the point is divided into the majority class boundary region Dborder-.

Under-Sampling of the Majority Clustering Noise in The Boundary Region Based On DBSCAN
After obtaining majority class samples in the boundary region, DBSCAN algorithm is used for clustering [9]. The basic principle of clustering is as follows: search whether there are other sample points in the neighborhood of each sample point, if not, mark the sample point as the clustering noise point. As shown in figure 1b, if two points exist in each other's neighborhood, then they are defined to be connected by density. A series of points connected by density are clustered into one class, and clustering noise points that cannot be clustered are removed to achieve the effect of under-sampling.
The specific algorithm steps are as follows: first, according to the dataset D and sample labels Lmaj and Lmin, the number of majority class samples N+ and the number of minority class samples N-are calculated, and the imbalance rate r of the dataset is calculated. The appropriate neighborhood threshold eps is calculated according to equation (1): Using the above parameters, DBSCAN is used to assign clustering labels c_label to majority class samples in the boundary region Dborder-, and delete the sample points whose clustering labels are noise Lnoise. Finally, the processed dataset of majority class in the boundary region Dborder_us-is obtained.

DPHS-cnus-MDS Overall Framework
The overall framework of the proposed method (DPHS-cnus-MDS) is as follows: as shown in figure  1c, we remove samples in the minority class noise region from the original dataset and form the filtered dataset. After under-sampling the clustering noise of majority classes in the boundary region, over-sample the minority classes in the boundary region by BMW-SMOTE, and randomly under-sample the majority classes in the safety region after clustering them. The minority class samples in the safety region are retained, and the balanced dataset is formed by merging the processed samples from the above regions. The original random forest model and the biased ones are generated by the filtered dataset and the balanced dataset respectively. The base classifiers (decision trees) of two models are integrated to obtain the hybrid random forest model. Finally, the type of test points is judged according to the proportion of majority class training samples in the neighborhood of test points, and the suitable model is dynamically selected for different types of test points to classify [8].

Experiment
The experimental data are from UCI, KEEL and power dispatching business. In order to fully compare the proposed method with relevant methods, Friedman test was carried out [11]. This paper uses F-measure and G-mean as evaluation indexes to evaluate the classification effect of the method.

Experiments on Public Datasets
In order to verify that the proposed method (DPHS-cnus-MDS) can effectively improve the classification effect of imbalanced data, seven relevant mainstream methods are compared, including RUSboost [2], SMOTEboost [3], EasyEnsemble [4], BalancedBagging [5], BRAF [6], DTE-SBD [7], DPHS-MDS [8]. Tables 1 and 2 show the F-measure and G-mean of DPHS-cnus-MDS and above relevant methods on the selected 8 public datasets. According to the analysis in tables 1 and 2, the proposed method is superior to other relevant methods under F-measure and G-mean, and achieves 7 best results on 8 datasets. The average rank of 1.13 and 1.25 rank first in Friedman test. This shows that the proposed method not only ensures the accuracy of the minority class classification, but also reduces the misjudgment of the majority class.

Business Anomaly Detection in the Power Dispatching Automation System
The effectiveness of the proposed method is verified by combining three kinds of business abnormal data of the power dispatching, which include application disconnection (AD), data jump (DJ) and telemetry table not refreshing (TTNR) [12]. The experimental results are shown in table 3. The proposed method achieves the best results in F-measure and G-mean, which shows that the proposed method has advantages in business anomaly detection of the power dispatching automation system.

Conclusion
In order to solve the problem of the overlap of samples in the boundary region, an under-sampling method by removing majority clustering noises in the boundary region is proposed. Combined with DPHS-MDS, the proposed mothed is compared with relevant methods. The experimental results on 8 imbalanced datasets show that the proposed method performs better on F-measure and G-mean, which are improved by 5.4% and 9.1% respectively compared with DPHS-MDS. Friedman rank achieved 1.13 and 1.25 on F-measure and G-mean respectively, ranking first. Finally, the proposed method is carried out in the business anomaly datasets of the power dispatching automation system, and the better detection effect is achieved. The F-measure and G-mean are all above 0.99. To sum up, the proposed method provides a new idea of balancing dataset, which can be applied to more hybrid sampling methods, and combined with different classifiers to verify its robustness, which will be further studied in the future.