Distributed Energy Grid-Connected Dense Data Forecasting Technology Based on Federated Learning

Photovoltaic power generation system is one of the main clean energy power generation systems at present, which plays an important role in daily production and life. However, the photovoltaic power generation system is easily affected by various factors, and the output power will be unstable in the practical application process, which will affect the power generation efficiency. In this paper, a prediction method of distributed energy grid-connected dense data based on federated learning is constructed. This method can not only realize the short-term prediction of distributed photovoltaic power generation data, but also ensure that the data can be encrypted and modeled, thus solving the “digital island” problem. The model evaluation shows that the method in this paper performs well in short-term photovoltaic power generation prediction, and it can predict the short-term power generation of different photovoltaic power stations with high prediction accuracy. This method is of great significance to improve the management and scheduling ability and energy utilization rate of distributed photovoltaic power generation systems.

the management and scheduling ability of distributed photovoltaic power generation system while solving the "digital island" problem.
However, unlike traditional power generation technology, photovoltaic power generation is unstable and its output power is not controllable.Changes in the surrounding environment, such as light, temperature, sunshine duration, etc., will have a severe impact on the photovoltaic power generation system.So at present, one of the main factors hindering the large-scale use of photovoltaic power generation is how to integrate it into the power grid with stable power.
In recent years, distributed photovoltaic power generation has become an important development trend.There are also some problems that are gradually highlighted, such as an unreasonable voltage of distributed photovoltaic power generation connected to a distribution network, potential problems of distributed photovoltaic island protection, and contradiction between the interests of distributed photovoltaic owners on the user side and the operation of the power grid.
Xiang [3] put forward the introduction of voltage regulation technology to control the electric power by a voltage regulator, so as to reduce the voltage fluctuation range.Based on the running state of the power generation system, a voltage regulator was added in the energy storage stage, and the voltage fluctuation was controlled by the voltage regulator.At the same time, in order to prevent the short-term loss of connection between the two devices during the grid connection, which leads to the loss of the ability to independent operation of the photovoltaic power grid, the related technologies of islanding protection are studied.If there is an unplanned islanding operation, the influence of the power grid on electrical equipment and staff should be analyzed, so as to formulate a feasible precontrol plan.
Various factors, such as the economic development, the prosperity of industry, tourism and climate in a region, will have an impact on electric power generation.Through the analysis of power generation data in the region, it can be found that there is a nonlinear relationship between load changes and influencing factors in the region, and the power generation data itself also has a time series relationship.Therefore, the basic idea of power generation forecasting is to mine the nonlinear relationship between power generation data and environmental data and the time series relationship of power generation data itself.
Since the 1970s, the research interest in power generation forecasting at home and abroad has been increasing.By the 1980s, China's demand for electric power was increasing, but the energy supply was becoming tenser, and the shortage of electric power supply occurred from time to time.Therefore, power generation forecasting has gradually become a necessary daily work of power companies.In the 1990s, with the gradual deepening of power marketization and the rapid development of power-related technologies, various new forecasting methods emerged, which provided strong support for the research of power generation forecasting in power systems.In recent years, with the development of various mathematical models and artificial intelligence, there are many novel forecasting methods, which can be roughly divided into two categories: one is the classical forecasting methods of mathematical statistics, such as regression analysis, trend extrapolation, time series method, etc.The other is a new prediction method of artificial intelligence, such as the expert system method popular in the late 1980s and the artificial neural network method developed in the late 1990s [4].There is a nonlinear relationship between influencing factors and power generation, while artificial neural networks, support vector machines and other intelligent algorithms have good nonlinear fitting ability, so in power generation forecasting, intelligent algorithms occupy a large proportion of domestic and foreign research.
At present, many scholars at home and abroad use machine learning and deep learning technology to carry out photovoltaic power generation prediction research.The current research can be divided into short-term forecasts, medium-term forecasts and long-term forecasts according to the time scale [5].Among them, short-term forecasting models need high precision, which can cooperate with conventional generating units to optimize power system scheduling and management.Long-term forecasting models generally need a large amount of data, and the time scale is generally one year or longer.Long-term forecasting is mainly used for future economic benefit evaluation of photovoltaic power generation systems and site selection evaluation decisions of photovoltaic power stations [6].According to the forecasting method of photovoltaic power generation, the existing research can be divided into single model research and mixed model research.Dou et al. [7] organically combined principal component analysis (PCA) with Elman neural network and applied it to photovoltaic power generation prediction.Li et al. [8] put forward a hybrid deep learning method based on the convolutional neural network (CNN) and long-term and short-term memory circulation neural network (LSTM).Compared with the back-propagation neural network (BPNN) and radial basis function neural network (RBFNN), the mean absolute deviation (MAE) of this method is reduced by 0.852 and 0.613 respectively.Han et al. [9] considered the influence of seasons on power generation, and constructed a photovoltaic power generation prediction model based on the extreme learning machine (ELM).Zhou et al. [10] also realized the short-term prediction correction of photovoltaic power generation with the help of extreme learning machine (ELM) technology, which effectively reduced the error of short-term prediction of photovoltaic power generation.
To sum up, the research status of distributed photovoltaic power generation and power generation forecasting at home and abroad is analyzed.For the problems of grid connection, island protection and power generation forecasting faced by distributed photovoltaic power generation, this paper constructs a distributed energy grid-connected dense data forecasting method based on federal learning, so as to explore the effective solutions and approaches to the above problems.

The algorithm of this paper
Federated Learning is a distributed machine learning technology.Its core idea is to build a global model based on virtual fused data by training distributed models among multiple data sources with local data, without exchanging local individual or sample data, so as to achieve the balance between data sharing computing and data privacy protection, namely, a new application paradigm of "data available and invisible" and "data fixed model moving" [11].
In this paper, with the help of the XGBoost algorithm and trusted XGBoost algorithm based on homomorphic encryption, multiple weak learners are integrated into one strong learner, and multiple trees are used to make decisions together, so as to improve the effect of the whole model.The joint modeling is completed without revealing the data information of both parties, so as to consider both computing performance and model performance.

XGBoost algorithm principle
XGBoost algorithm is an efficient and parallel machine learning algorithm based on ensemble learning proposed by Chen and Guestrin [12] in 2016.Its basic idea is to make the second-order Taylor expansion of the objective function, use the second derivative information of the function to train the tree model, and add the complexity of the tree model as a regular term to the optimization objective, so that the learned model has higher generalization ability.XGBoost is based on a residual training model to fit the real data scene, and performs an efficient calculation based on gradient histogram, thus realizing super-large scale parallel computing Boosting Tree, which is the fastest and best open source Boosting Tree framework at present.
XGBoost is a tree lifting model that builds multiple weak learners to integrate into one strong learner.A tree represents a function, and the addition of trees is equivalent to learning a new function without changing the original model, using the new function to fit the error between the predicted value and the real value of the previous tree, and continuously iterating to reduce the error.After the training, the number of trees is t.When a sample score is predicted, the corresponding leaf nodes on each tree are found according to the characteristics of the sample, and the predicted value of the sample can be obtained by adding the scores corresponding to these leaf nodes.The specific principle is as follows: The ith data output in the initial treeless model is as follows: Every time a tree is added, a new function can be learned without changing the original model.Its output is as follows: , which defines the complexity of the model.The smaller the regularization function is, the lower the complexity of the model will be.The stronger the generalization ability is, the less likely it is to fall into over-fitting.The two functions are as follows: where T, γ, λ, and ω j respectively represent the number of trees in the model, the penalty coefficient of T, the penalty coefficient of L2 regularization, and the weight of the corresponding leaf node number J. The objective function is: where C 0 is a constant term.In this paper, the XGBoost algorithm is used for practical training.When the t-th tree is built, XGBoost uses the greedy method to split the tree nodes: when splitting a node, there will be many candidate segmentation points.The general steps to find the best segmentation point are as follows: (1) First, all possible values of each feature of each node are traversed; (2) The eigenvalues of each feature are ranked according to the size of the eigenvalues to be candidates; (3) Linear scanning is used again to find out the best split eigenvalue of each feature as a candidate; (4) Finally, among the best splitting points of all features, the best splitting point can be found, that is, the feature with the largest gain after splitting and its value.In this method, all candidate splitting points should be traversed for global scanning.However, when the amount of data is too large and the distribution of data is very scattered, the efficiency of the greedy algorithm will become extremely low and basically unusable.
Based on this, XGBoost proposed a series of schemes to speed up the search for the best split point, based on the Weighted Quantile Sketch algorithm of weighted quantile sketch: (1) Pre-sorting of features +Cache: Before training, XGBoost will sort each feature according to its feature value in advance, and then save it as the memory structure of quantile sketch, which will be used repeatedly later to reduce redundant operations.
(2) Quantile method (histogram): After each feature is sorted according to its characteristic value, only a limited number of features are selected as the representative splitting points of the feature by the quantile method (histogram), and the segmentation attempt is optimal.
(3) Parallel search: It is necessary to pay attention to whether trees are connected in series and knowledge features are parallel.Based on the fact that each feature has been stored in block structure in advance, XGBoost uses the thread pool mode of multiple threads to calculate the best split point of each feature in parallel, which not only greatly improves the splitting speed of nodes, but also greatly facilitates the adaptive expansion of large-scale training sets, and better supports the distributed architecture.
The use of the above methods in this paper can overcome the problem of low efficiency in the case of a large amount of data and scattered data distribution, and greatly improve the efficiency of finding the best split point.

Trusted XGBoost algorithm based on homomorphic encryption
The essence of learning is a distributed machine learning technology or machine learning framework based on data privacy protection.Its goal is to realize common modeling on the premise of the lossless model on the basis of ensuring data privacy security and legal compliance, improve the effect of the AI model, and enable business.As for modeling, it can be roughly divided into GBDT and neural networks, which are well-known in the industry in recent years.However, due to the characteristics of federated learning, it is necessary to protect the privacy of users' features and labels, so it is necessary to adopt homomorphic encryption, secret key sharing, differential privacy and other privacy computing methods to ensure security.However, based on this, it brings a big challenge.The complex operation of neural networks, such as exponent and logarithm, will pose a very big problem for modeling.It is still very difficult with the current hardware and software encryption technology, but SecureBoost only needs a simple homomorphic operation to solve it and achieve the same modeling effect as XGboost.The SecureBoost algorithm used in this paper, based on XGB, uses the technology of privacy computing to perform the secret privacy operation of data, so as to achieve the joint modeling on the basis of privacy protection and achieve a win-win situation between computing performance and model performance.
In the research scenario of vertical federated learning in this paper, parallel training of the distributed model according to XGB mode may also face the risk of feature and Label leakage.SecureBoost is committed to protecting the features of both data sides from being leaked, protecting the Label of the party holding label from being leaked, completing joint modeling without revealing the data information of both sides, and boosting business.There are many privacy calculation methods available, including homomorphic encryption, secret key sharing, confusing circuits and inadvertent transmission, etc. Different encryption has different adaptation scenarios.

Experimental dataset
In this experiment, fine-grained short-term power generation forecasting is conducted for Shandong power generation data.The data set is a total of 590 pieces of real-time power generation data collected from power supply terminals by power supply companies in Jinan City, Shandong Province.The time interval is from May 1, 2021, to May 31, 2021, and the time granularity is one hour.The data includes all historical data and power generation status data about photovoltaic power generation.The data fields are shown in Table 1.In order to better evaluate the experimental results, the data set is divided into the training set and the testing set according to the ratio of 8: 2. The training set is used to model the federated model, and the testing set is used to test the training results of the model.The experimental environment is a CentOS7.6 operating system, 2-core CPU and 8G storage space.All codes and models are trained in a Linux environment.

Experimental process
Because the acquired data itself is coarse-grained, including missing values and abnormal values, and it is necessary to model and predict the future power generation data with the help of historical power generation data, a series of data analysis and data preprocessing operations are required before the model training.The technical roadmap of this experiment is shown in Figure 1.
In this study, the power generation data is analyzed by federal forecasting.Power generation data is a typical time series data, and the adjacent data are closely related according to the time sequence, and the data as a whole is characterized by periodic changes.Therefore, the federated learning model can be used to forecast and analyze the power generation data based on three-phase voltage, three-phase current, weather data and time series power generation data with autoregressive characteristics.The single-step forecasting method and multi-step forecasting method can be used to forecast the power generation data according to the actual business needs, and different time sliding windows can be used to adapt to different business scenarios.This experiment adopts the SecureBoost algorithm in the FATE framework to forecast the power generation data in the future.In the data preprocessing stage, Python is used to process the missing value and abnormal value of power generation data, and the missing power generation data is supplemented.At the same time, as the power generation data is only one-dimensional, for the time series forecasting model, it is necessary to convert one-dimensional power generation data into multi-dimensional power generation data to realize multi-step forecasting of power generation data.At the same time, as this experiment adopts the vertical federal learning framework, it is necessary to split the data set into two data sets for simulation according to features, and at the same time, it is necessary to add a feature field ID for each data set as the basis for encryption and alignment.Because SecureBoost is a tree model, and the data are of jumping order, there is no need to normalize the data.The preprocessed power generation data is input into the SecureBoost model for training, and the model is optimized from the learning rate of the model itself, the number of leaf nodes and the number of trees.Finally, the visualization results of the model training are obtained.

Feature importance
After the SecureBoost modeling is completed, the model will calculate the contribution degree of each feature, that is, the feature importance degree, through the tree model according to the marginal effect of features, as shown in Figure 3.As can be seen from the figure, the state characteristics of power generation equipment are very important and play a key role in power generation prediction, especially the three-phase voltage of power generation has the greatest influence on power generation prediction, among which the characteristic importance of phase C voltage is 165, and that of phase B voltage is 142.The influence of three-phase current data and active power data on the power generation forecast is not significant.In addition, for the time series relationship of power generation data, the maximum characteristic importance degree of the data with the closest time series distance to the forecast data on the power generation forecast results is 480.This is because for each time series, the adjacent power generation data reflects the overall development trend of the time series data.Therefore, the experiment proves that for each time series data, the power generation situation of adjacent data often greatly affects the overall forecast results of subsequent power generation data.At the same time, in order to explore the influence of weather features on photovoltaic power generation data, this paper also uses a comparative data set, adds weather data to the data in the same time dimension, and trains the importance of the features through the model, as shown in Figure 4. Through the comparative experiment, it can be seen that when the weather data is added to the data, the ranking of the overall feature importance of the model has changed.The newly added weather data, namely, surface temperature, surface temperature and relative humidity, rank at the forefront in the importance of features, with the importance of 128, 120 and 114 respectively.However, the importance of power generation state features is relatively lower than that of weather features, which proves that weather features have a positive effect on power generation data prediction, and their importance is higher than that of power generation state features.

Model optimization
The trained prediction model is evaluated by the evaluation component, and the test set is used to output the evaluation results of the model.The common indexes MAE, MSE and Root_MSE of the regression prediction model are selected in the evaluation.MAE refers to the average absolute error, which means the average of absolute errors between the predicted value and the real value.The smaller the value, the better.The formula is as follows.
MSE represents the mean square error, the average of absolute square errors between the predicted value and the real value, and the formula is as follows: Root_MSE represents the root mean square error, which is the square root of MSE.The smaller the value, the better.By means of the ablation experiment, based on the above three indexes, this experiment compared the results of two contrast data sets, and the data sets evaluated the test set.The evaluation results are shown in Table 2.In the above table, Eval indicates the data without weather features, and Eval_wt indicates the data with weather features.By comparing the evaluation results with and without weather features, it can be clearly found that when the weather features are added to the data set, the index of the regression model is obviously better, and the error results of the model evaluation are reduced.Therefore, in the follow-up experiment, we continue to use the data of weather characteristics to optimize the model, and verify the influence of the learning rate and the number of decision trees on the training results of the model.
Based on the above three indicators, this paper verifies the influence of different learning rates on the training of time series models through experiments.The experimental data set containing weather was trained, and the learning rates were 0.2, 0.3 and 0.5.The test set was evaluated, and the evaluation results are shown in Table 3. 4351 Through the evaluation results, it can be seen that with the increase in the learning rate, the evaluation results of the three models are constantly improved, and finally the best results are achieved when the learning rate is 0.5.At the same time, this paper also verifies the influence of different tree numbers on time series model training through experiments.The experimental data set containing weather was trained, and the number of trees was 3, 5 and 10, and the test set was evaluated.The evaluation results are shown in Table 4. 4351 Through the evaluation results, it can be seen that with the increase in the number of trees, the evaluation results of the three models are constantly improved, and finally the best results are achieved when the number of trees is 10.

Summary and discussion
Aiming at the "digital island" problem in the actual research of different power supply companies during the modeling process of power generation forecasting model, this paper realizes the short-term accurate forecasting of distributed photovoltaic power generation data based on the vertical federated learning model, and at the same time realizes the encrypted exchange of power generation data.Based on this research and analysis, the following relevant conclusions are drawn: 1) Combined with the characteristics of short-term photovoltaic power generation and the analysis of the characteristic importance of the model, it is found that the three-phase voltage is very important for the prediction of power generation, while the three-phase current and active power data have a weak influence on the prediction of power generation.Therefore, the prediction based on these parameters can reduce the number of parameters to be considered and improve the prediction efficiency.At the same time, for the power generation time series data itself, the data at both ends of the time series are also relatively important to the prediction of power generation, so the data at both ends of the time series determine the development trend of the power generation time series data itself to a great extent, which will directly affect the prediction of power generation.
2) Establishing a multi-step prediction model of photovoltaic power generation based on SecureBoost can reduce the data error of the subsequent prediction model and improve the accuracy of the overall prediction.The photovoltaic forecasting model constructed in this paper realizes the transformation of data structure and improves the flexibility of the model by converting the original power generation data into multi-step data that can be directly regressed.The model has a simple and intuitive structure, high efficiency of data input and output, no need to normalize the data, and strong tolerance for complex data with large cardinality.
3) The verification results show that the prediction method in this paper has a high accuracy, which is consistent with the actual situation and data trend changes, and its overall algorithm performance is excellent and practical.

Figure 2 .
Figure 2. Flow chart of secure boost model training.

Figure 3 .
Figure 3. Importance of power generation state characteristics.

Figure 4 .
Figure 4. Importance of weather features.
( ) is the weight of leaf nodes; () represents the output leaf node number.

Table 1 .
Description of Power Generation Data Set Fields.

Table 2 .
Compare data set evaluation results.

Table 3 .
Evaluation results of different learning rates.

Table 4 .
Evaluation results of different tree numbers