Research on feature extraction method of power grid online data based on big data

The development of power simulation artificial intelligence technology needs massive open sample data. It is a trend to use the characteristics of online data to construct sample data. In order to solve the problem that the online data of power grid is rich in information, but the utilization rate of features is not high, aiming at the information features of generators, the LTTB dimension reduction and DBSCAN + L2 clustering methods are proposed, which reduce the complexity of feature extraction of time series data. The method is verified by the actual power grid data, and has achieved certain results.


Introduction
With the wide application of online security analysis and early warning system, online calculation data and stable results form a set of cross-section data every 5 minutes, which is about 10M, and the data of a year is about 1T. Massive online data is rich in information, but the utilization rate of data features is not high. In this paper,in order to solve the problem of rich online data and low utilization rate of features in power grid, different dimension reduction and clustering methods are proposed for feature extraction of generator and load in online data. Because online data of power grid is time series data, online data feature extraction is feature extraction of time series data. Firstly, the common technical methods of data dimension reduction are introduced, and the advantages and disadvantages of different dimension reduction methods are analyzed; Secondly, this paper introduces the common technical methods of time series data clustering, analyzes the advantages and disadvantages of different clustering methods, and finally it achieves preliminary results by actual power grid data [1].

Dimension reduction of time series data
Time series data is the distributed data in time domain. The commonly used time series data dimension reduction technologies include PAA(Piecewise aggregate approximation) , LTTB (Largest Triangle Three Buckets)etc [2].

PAA
The sliding window is used to reduce the dimension of time series. The mean value method is used to aggregate the time series in each window. The calculated mean value is used to represent a series of values in a window.

LTTB
LTTB is based on the maximum effective area of time series data dimension reduction and fitting. The three segment maximum triangle algorithm divides the original time series data into equal segments (buckets). For each segment, the algorithm selects the most important point (denoted as f point) to represent all the points in the current segment, so as to achieve the effect of dimension reduction. PAA uses the mean method to select f points, and lttb algorithm selects f points based on the maximum effective area (MEA). EA (effect area) of a point is defined as the area of the triangle formed by two adjacent points. Just as figure2.

Clustering method for time series data
Clustering pattern is an abstract algorithm framework. Clustering pattern algorithm obtains the similarity between samples. Through comparison, the sample groups with higher similarity are divided into one category and iterated repeatedly until all the categories are divided, so as to complete the clustering task. Therefore, the clustering process can be divided into the combination of similarity calculation and clustering mode operation.

K-means clustering
In the data set, k points are selected as the initial center of each cluster according to a certain strategy, and then the remaining data is observed, and the data is divided into the clusters nearest to the k points. That is to say, the data is divided into K clusters to complete a partition, but the new cluster formed is not necessarily the best partition. Therefore, in the new cluster generated, the center point of each cluster is recalculated, Then the partition is repeated until the result of each partition remains unchanged [3].

DBSCAN (Density-Based Spatial Clustering of Applications with Noise)clustering algorithm
DBSCAN is a typical density based clustering algorithm with noise. Compared with K-means. DBSCAN can be applied to both convex and non convex samples,it receives two parameters: neighborhood threshold and minpts, which are the minimum number of points needed to form core points (high-density regions / clustering centers). It is advisable to set up a set of unaccepted points } ,..., { 1 n p p P  , The algorithm performs alternately according to the following two steps,find and Broadcast.

Distance calculation
Similarity reflects the degree of similarity between samples. In the field of clustering, the concept of distance is usually used to represent the dissimilarity between samples. The distance of sample points in space can be calculated in different ways. Distance is inversely proportional to similarity, and zero distance means the same samplem,the distance calculation are L1-Norm( Manhattan distance,(2))and L2-Norm(Euclidean distance,(3)).

The Clustering methods verification of real power grid data
Based on the QS data of 40000 current nodes of the state grid dispatching, the data of DC voltage (VDC) and AC active power (PAC) in North China are extracted for dimension reduction of PAA and LTTB respectively. The dimension reduction scale will be nearly 100 times reduced dimension and nearly 200 times dimension reduction dimension, and the comparison and analysis will be made. In the experiment, the annual data of DC voltage and AC active power are nearly 80000-dimensional, and the lttb algorithm and PAA algorithm are used to reduce to 365 and 1095 dimensions. The VDC results are as follows [4]:  Table 1 is obtained (the smaller the average loss degree, the better) [5,6].Practice has proved that LTTB algorithm can capture the overall and detailed characteristics of DC line voltage and power.

The Dimension reduction verification of real power grid data
Based on more than 10000 load data of 40 000 nodes in North China, after dimension reduction of time series data and before clustering, we should analyze the selection strategy of super parameters. In the process of clustering, different clustering patterns are used to combine different similarity measurement methods for clustering. Finally, the clustering results are compared to test the clustering tasks of different combination methods, 500 random sampling time series curves are used for elbow rule measurement, and the best K value of each combination is selected for the experiment, and the contour coefficient corresponding to the best cohesion is recorded, The contour coefficient table (generally more than 0.5 means good clustering) is shown in the following table2.
It can be seen from the table that the effect of DBSCAN + L2 is better than that of other combinations [7,8]. Taking 10495 load curves as an example, time series curves with different change patterns (original load data, black) are divided, and characteristic curves (cluster center, red) are extracted from these similar time series curves to represent the overall change. The effect of DBSCAN + L2 mode is shown in the figure 5 .

Conclusion
In recent years, with the gradual development of artificial intelligence technology in the field of power grid simulation analysis, the demand for sample data is more and more urgent. How to use online data to make sample data. Aiming at the problem of feature extraction of massive online data, a feature extraction method of online data based on Dimension Reduction Clustering of time series data is proposed. Different dimension reduction and clustering methods are used to verify the grid data, and the preliminary results are obtained. In the future, we can use the feature extraction method in this paper to construct a virtual grid, and use the structure of the virtual grid and online data features to generate massive simulation samples, which provides data support for the application of artificial intelligence technology in power system. However, there are still some shortcomings. Before dimension reduction and clustering, the processing of missing data needs to be explored. In the next step, more clustering and dimension reduction methods can be used to deal with different characteristics of online data.