A Method for Improving the Data Availability of Distributed Photovoltaic Power Plants

The growth of distributed photovoltaics is very fast, and there have been situations such as short-term low-voltage distributed photovoltaics sending power to high voltage levels, a voltage exceeding limits, and inadequate operation and maintenance. If data analysis is not done well, it will affect the operation of the power grid and the income of the power station. This article analyses the reasons for the limited data quality of existing distributed photovoltaic monitoring platforms. It proposes a method that can improve the data availability of distributed photovoltaic power plants. By reviewing and repairing abnormal data, efficient application of data can be achieved. It can identify 6 types of abnormal data from 25, and 535 and complete the correction of 6 types of abnormal data. After correction, by comparing the data from nearby power stations, it can be seen that the root mean square deviation is 8.25%.


Introduction
At the end of 2022, China has added 87.4 GW of new photovoltaic power stations to the grid, including 51.11 GW of distributed photovoltaic power stations and 36.29 GW of centralized photovoltaic power stations.The newly added capacity of distributed photovoltaic power stations has exceeded that of centralized photovoltaic power stations.In the "dual carbon" background and the construction of new power systems, distributed photovoltaics still exhibit a trend of large-scale development.However, due to the small scale and dispersed geographical location of distributed photovoltaic power stations, most of them have a low level of information technology and low quality of data collection, which hinders the information mining and deepening application of these data.It cannot meet the data application needs of distributed photovoltaic information monitoring, output prediction, operation management, and operation and maintenance management.
There is not much research on this topic both domestically and internationally.Existing research mostly proposes corresponding data identification methods based on specific needs such as power prediction.In [1], it is proposed that improving the accuracy of power prediction requires strengthening data quality control, deepening data pre-processing research, and utilizing photovoltaic data characteristics such as spatial correlation to eliminate bad data.In [2], it is proposed that data anomaly testing includes data rationality testing, constant value testing, missing data testing, and state testing.It achieves anomaly data identification from the aspects of exceeding limits, dead numbers, missing numbers, and cross-state checking.There is also limited research on distributed photovoltaic data correction both domestically and internationally.In [3], a weather state pattern recognition model was established based on SVM to identify historical data with missing weather type information, but there is less evidence of feature space used for classification.In [4][5][6], the spatial region similarity correlation theory is adopted, using data from reference photovoltaic power plants with correlation characteristics to achieve correction and repair of missing data for the target photovoltaic power plant.

Data identification and correction system
We establish a distributed photovoltaic data identification and correction method, which forms a standardized process for dispersed core master data and realizes data identification and correction.It enhances the application value of mining distributed photovoltaic data.

Data problem analysis
In [7][8][9], the main static data problems include geographic coordinate deviation, installed capacity error reporting, or data missing issues, which are mainly achieved through manual search or semi-automatic methods.We will not elaborate on them here.This article focuses on the identification and correction of dynamic data anomalies, and the reasons for their occurrence include: (1) Collection device issues In [10], due to incorrect parameter size and category number settings, equipment failures, electromagnetic noise generated by inverters, and inadequate operation and maintenance, abnormal data such as missing numbers, dead numbers, sudden changes, and exceeding limits often occur.
(2) Communication network issues Distributed photovoltaics are widely distributed and numerous in quantity, with most wireless and carrier communication used, and with few fiber optics used.If the collection frequency is high, such as minute-level collection, there will be a large amount of parallel data flowing to the collection platform at the same time.However, due to issues such as limited network bandwidth, abnormal communication equipment, and external interference, the stability of wireless transmission is poor.Data interruption and interference problems often occur, resulting in abnormal data such as missing numbers, dead numbers, and mutations. (

3) Platform issues
There are many manufacturers of information collection equipment for distributed photovoltaic power plants, with varying information collection capabilities.Many protocols need to be adapted for access.During the process of data access, parsing, storage, and processing, the platform may have issues such as the data not being accessed, parsing and storage of errors, coding of logic design issues, platform instability, insufficient queue resources, and abnormal data such as missing data, exceeding limits, and abnormal logic.

Basic framework
The basic framework for data identification and correction is shown in the following Figure 1.

Abnormal identification (1) Integrity identification
Integrity identification is a check of data integrity, including the structure of the original data and the length of the data record, to judge whether the number of data in the original data is equal to the expected number and whether the data record meets the expected start and end times.We identify the collected data as empty, "NULL", "NONE", etc., and label the data accordingly.
(2) Extreme value identification Extreme value identification is to judge whether the electrical data of distributed photovoltaic power generation conforms to the laws of the range of electrical operating parameters and whether the meteorological resource data conform to the reasonable range of climatology standard parameters.For example, the voltage, current, and output data exceed limits, and meteorological data exceed reasonable ranges.
(3) Mutation identification The mutation identification is to judge whether the electrical data of photovoltaic power generation conforms to the electrical operation law and whether the meteorological resource data conforms to the statistical law of climatology.For example, short-term changes in output data exceed the limit and shortterm changes in meteorological resource data exceed the limit.
(4) Logical anomaly identification Logic anomaly identification is the mutual verification identification between data with correlated characteristics.For example, the trend of correlation between output and irradiance is opposite, and the trend of output data of nearby locations under the same meteorological zone is opposite.

Data correction (1) Accuracy calculation
We calculate the accuracy of various types of data based on the anomaly identification method in 3.2.1.
(2) Self-correction For cases where abnormal data is less than 30% of the sample size during the statistical period, the linear interpolation correction method is used.With x t ⋯x t+i ⋯x t+n consecutive time series (i=1, 2 * n-1), where is an unknown value, x t and x t+n is a known value, and we use and x t+n to perform interpolation supplementation.
x t+i = x t+n -x t n+1 ×i+x t (3) Mutual correction For cases where abnormal data exceeds 30% of the sample size during the statistical period, the reference data with the highest accuracy ranking is used to replace the abnormal data.We use the time series difference correction method for correction.
x is the reference data, with N time series data; y is the data to be corrected, with n time series data; n<N; n is included in the N time series; the data to be corrected can be corrected to N time series.y N =y n +r e x e y (x N -x n ) where y N represents the data to be corrected; x N represents the N time series element values of the reference data; x n and y n represent the average values of the N time series elements of the reference data and the data to be corrected, respectively; e x and e y represent the standard deviation of the element values in n time series between the reference data and the data to be corrected; r represents the correlation coefficients of n time series between the reference data and the data to be corrected.The correlation coefficient is calculated as follows: where r is the correlation coefficient; a i and b i are the i-th value of sequence a and b respectively; a and b are the arithmetic mean values of sequence a and b respectively; a and b can be the reference data and data to be corrected.
The reference data should be selected according to the priority order and the following principles: located in nearby areas, it should not exceed 9 km; the correlation should pass the significance level test of 0.05.

Data label
Data anomalies and data correction require identification and storage of the original and corrected data.The specific design method is shown in Table 1

Example calculation
We take a distributed photovoltaic monitoring platform as an example, analyze the output data of 25, 535 distributed photovoltaic power stations on a certain day, identify various abnormal situations within the range of sunrise and sunset, and handle them accordingly.
Station A shows the output data under normal conditions, with good lighting on that day and normal power generation output data.The missing rate of station B data is as high as 57%, and the data repair is completed by using the mutual correction method.The missing rate of station C data is 14%, and a self-correction method is used to complete data repair.The abnormal rate of station D and station G data is 7%, which is caused by data mutation and extremely exceeding values, respectively.A self-correction method is used to complete data repair.Station E and Station F have a high data anomaly rate, which is a logical anomaly problem.Station H and station I with correlation coefficients of 95.2% and 96.1% were used for mutual correction to complete data repair.The root mean square deviation before and after correction is 8.25%.The situation before and after correction is detailed in Table 2 and Table 3

Conclusions
Through the research in this article, a digital platform construction engineering foundation is provided for the high-proportion grid connection of future distributed photovoltaic power plants.Not only are methods for identifying and correcting data anomalies proposed but also methods for identifying data anomalies are proposed.It provides a basis for further analysis of the causes of data anomalies and establishes an anomaly knowledge base in engineering.The anomaly identification and correction methods presented in this article are simple and easy to use.In the future, more artificial intelligence methods can be integrated to improve the accuracy of identification and correction.

Figure 1 .
Figure 1.Data identification and correction basic framework

Table 1 .
. Data identification design

Table 2 .
. Data before correction

Table 3 .
Data after correction