Research on distributed photovoltaic power prediction method based on reinforcement learning

The distributed photovoltaic power prediction is of great significance for reasonably adjusting and optimizing the power generation plan in the dispatching system. When using machine learning algorithms to construct photovoltaic power prediction models, there are usually problems such as poor prediction accuracy under default hyperparameters and high cost of human experience parameter adjustment experiments. Therefore, a reinforcement learning algorithm based on the long short-term memory network model (SARSA-LSTM) is proposed. Firstly, the reinforcement learning SARSA algorithm is used to tune the hyperparameters of LSTM automatically. Then, the optimal hyperparameter combination is used for regression prediction of photovoltaic power. The experiment compared the model results under default hyperparameters, Bayesian optimization, and grid search optimization hyperparameters. The results showed that the SARSA-LSTM proposed in this paper has better training efficiency and prediction performance compared to other models, which can meet the needs of practical prediction applications.


Introduction
The intermittency, randomness, and volatility of photovoltaic power generation have an important impact on the safe operation of the grid after it is connected to the grid.Photovoltaic power prediction can effectively help the power grid dispatching department reasonably adjust and optimize the power generation plan to ensure the stable operation of the power system [1].Studying high-performance distributed photovoltaic power prediction methods is of great significance.
At present, many scholars have applied artificial intelligence algorithms to conduct relevant research on the prediction of distributed photovoltaic power, for example, they use genetic algorithms [2], BP neural networks [3], Long Short-Term Memory (LSTM) [4][5], and support vector machines [6].The above literature assumes that the hyperparameters of the prediction model are set in advance, and the values of the hyperparameters directly affect the prediction performance of the model.
Therefore, finding a hyperparameter automatic tuning method has become significant [7].In existing methods, manual parameter adjustment is highly subjective and difficult to find the global optimal hyperparameter combination.The grid search method requires enumerating all possible values of hyperparameters and requires a large amount of computation.In [8], the Bayesian optimization algorithm is used to optimize hyperparameters of the ultra-short-term wind power prediction model.This method can achieve better optimization results, but the optimization process takes longer as the number of iterations increases.In [9], a particle swarm optimization algorithm is proposed to automatically optimize the hyperparameters of LSTM, which has good robustness and adaptability, but 2 the encoding is complex and requires a large number of computational resources.In response to the current application status of the above methods, this paper proposes a distributed photovoltaic power prediction model (SARSA-LSTM) based on reinforcement learning of the SARSA algorithm.The hyperparameter combination optimization problem is abstracted as a sequential decision problem that only optimizes a single parameter in a single decision, to efficiently achieve the automatic optimization process of the hyperparameters of the photovoltaic power prediction model.This method fully considers the importance and complexity of the model hyperparameter tuning, providing a solution for improving model performance.Moreover, experimental results show that the accuracy of the model using this method has significantly improved compared to traditional LSTM models.The distributed photovoltaic power generation data used in this article is time series data, and the LSTM model has outstanding advantages in time series analysis [10].Therefore, this article selects LSTM and proposes an LSTM model optimized by using the SARSA (State-Action-Reward-State-Action) algorithm to predict photovoltaic power data.

Overall structure of the model
The optimization process of the SARSA algorithm uses the four key hyperparameters of LSTM (epochs, batch size, learning rate, and L1 regularization coefficient) as optimization variables, to maximize the accuracy of the model on the test set.Through the intelligent agent, the hyperparameter variables are gradually updated until the model converges, and the optimal hyperparameter combination is obtained.
The basic implementation framework of the SARSA-LSTM prediction model is shown in Figure 1.

LSTM algorithm and its hyperparameters
Long Short-Term Memory Network (LSTM) is a special type of recurrent neural network that primarily runs through the entire time series through channels of "cell state".3 much of the previous cell state needs to be preserved until the current time, the input gate determines how much of the input data of the network needs to be preserved until the current time, and the output gate controls how much of the current cell state needs to be output to the current output value.Overall, LSTM controls information transmission through a gating state, forgetting unimportant information and retaining information that requires long-term memory.Its basic structure is shown in Figure 2.
3) We calculate the information to be retained at the current time through the input gate  , .
4) We update the cell state   and calculate the output ℎ  at the current time through the output gate.
5) We repeat the above calculation steps until the training is completed.where  and  represent the weight matrix and bias vector of the input of the control gate,  and ℎ represent the sigmoid and tanh functions respectively,  represents input, and ℎ represents output.In addition to its network structure parameters, the LSTM model also needs to set a large number of hyperparameters before model training.In this paper, four key hyperparameters such as training iteration number (epochs), training batch size, learning rate, and regularization coefficient (L1) were selected for optimization from the aspects of training strategy, optimizer strategy, and generalization performance, to improve the performance of the photovoltaic power prediction model.Among them, too large or too small epochs may lead to overfitting or underfitting of the model.The batch size will affect the training speed and memory consumption of the model.Too large batch may cause the gradient to disappear or gradient explosion, and too small batch may cause the training speed to be slow; Too large learning rate may cause parameters to oscillate back and forth on both sides of the optimal solution; Too small learning rate will greatly reduce the convergence speed and increase the training time.A reasonable setting of L1 can effectively adjust the degree of overfitting and underfitting of the model.It can be seen that whether the hyperparameter setting is reasonable directly affects the prediction performance of the LSTM model, so this paper adopts the SARSA algorithm to optimize the hyperparameters.

Optimization of hyperparameters using the SARSA algorithm
The SARSA algorithm is a method for agents to learn and make decisions based on a pattern of stateaction-reward-next state-next action.Its essence is to establish a Q table that reflects the relationship between state and action and to learn the optimal decision through continuous trial and error, correction, and reward and punishment mechanisms, update the Q table, and find the optimal result.
The optimization process of the SARSA algorithm applied to LSTM hyperparameter optimization scenarios is as follows: 1) We establish the state action matrix SA and Q table, and initialize them, where S represents the hyperparameter list, with the initialization value set as the default hyperparameter value of the model, and A represents the amplitude of hyperparameter adjustment action, including two types of actions, positive adjustment and negative adjustment.The Q table is a matrix with the form SA but the content is Q values, and its initial value is a 0 value matrix.The specific expression is as follows: We establish a reward and punishment mechanism R.This article takes the change in the prediction error of the test set before and after hyperparameter adjustment as the basic setting value for the reward value.The more the prediction error decreases after adjustment, the greater the reward value is.
where   is the prediction error of the test set after adjusting the hyperparameter state value for the i-th execution of the action.
3) The intelligent agent randomly selects a hyperparameter state  and selects an adjustment action based on the strategy of -.
where  , represents the random selection of an adjustment action under the hyperparameter state ,   , represents the adjustment action corresponding to the maximum Q value under the hyperparameter state  in the table Q,  is the random number from [0, 1] , and ε is the threshold, which is set as 0.9 in this paper.4) We execute the action , and update the hyperparameter value and reward value R. 5) We select the next hyperparameter state  ′ based on the reward value and select the next action  ′ based on the strategy of -.
6) We update the Q table based on the above results.

𝑄(𝑠, 𝑎) = 𝑄(𝑠, 𝑎) + 𝛼 × [𝑅 + 𝛾𝑄(𝑠
where  is the learning rate, set as 0.01 in this paper;  is the discount coefficient, set as 0.9 in this paper.7)  updates to  ′ , and  updates to  ′ .8) We iteratively execute Steps 4 to 7 until the model prediction error converges.At this time, the list of hyperparameter state values obtained is the optimal hyperparameter combination value.

Preparation of experimental data
Dataset: This article conducts model experiments by using power generation data and weather data (wind speed, direction, temperature, pressure, humidity, and irradiance) collected from a photovoltaic power station in Tianjin from January 1, 2021 to December 31, 2022.The sampling interval of the dataset is 5 minutes, and it is divided into a training set and a test set in a 7:3 ratio.The training set is used for LSTM model construction and training in the experiment, and the test set is used for validation of model prediction performance after SARSA tuning hyperparameters.
Hyperparameter tuning space: The experiment mainly optimizes the four hyperparameters of LSTM and the tuning space is shown in Table 1.

Analysis of experimental results
This article selects Mean Absolute Percentage Error (MAPE) as a measure of the model's predictive performance.The smaller the MAPE is, the better the model's predictive performance is.The MAPE of the model after adjusting hyperparameters in the SARSA algorithm decreases more compared to before, indicating that the method proposed in this article is more effective.
where   represents the actual value of photovoltaic power generation,   represents the predicted value of photovoltaic power generation, and N indicates the number of predicted samples.

Validation of SARSA algorithm effectiveness.
To verify that the SARSA algorithm can effectively improve the prediction performance of the LSTM model after adjusting its hyperparameters, this experiment constructed a single-step prediction model and a multi-step prediction model for photovoltaic power (ahead by 2 steps, 3 steps, and 4 steps), and recorded the MAPE change trend of the LSTM model on the test set prediction during the iterative learning and hyperparameter adjustment process of SARSA.The specific results are shown in Figure 3. From the above figure, it can be seen that with the continuous adjustment of hyperparameters by the SARSA algorithm, the error rate of the photovoltaic power prediction model shows a decreasing trend and gradually stabilizes, verifying the effectiveness of this method for model tuning.

Verification of hyperparameter optimization effect.
To visually demonstrate the performance improvement effect of the proposed method on the photovoltaic power prediction model, the SARSA-LSTM model was compared with the default hyperparameter model (LSTM), the grid search optimization-based model (GridSearch LSTM), and the Bayesian optimization-based model (Bayesian LSTM) (Due to its exhaustive nature, the hyperparameter tuning space was reduced based on experience when using the grid search method).The changes in the number of iterations and hyperparameter combination values, as well as the average prediction error of the test set, are shown in Table 2.The order of hyperparameter combinations in the table is [epochs, batch_size, learning_rate, L1].The power prediction curve of its one-step ahead prediction model from 8:00 to 18:00 on December 31, 2022 is shown in Figure 4.  From the comparison between the data and prediction results in the table, it can be seen that whether it is single-step prediction or multi-step prediction, the three optimization methods, especially SARSA-LSTM, can obtain hyperparameter combinations with better prediction performance than the default parameter model.Specifically, in terms of the number of hyperparameter optimization iterations, SARSA-LSTM is much lower than GridSearch-LSTM, indicating that this method helps improve the training efficiency of the model.In terms of model prediction accuracy, SARSA-LSTM is greater than or close to other models, proving the feasibility of applying hyperparameter optimization methods based on reinforcement learning SARSA algorithm to improve the performance of distributed photovoltaic power prediction.

Conclusion
To improve the performance of distributed photovoltaic power prediction, this paper proposes a hyperparameter optimization method based on reinforcement learning from the perspective of model hyperparameter tuning.Firstly, a photovoltaic power prediction model with default parameters is constructed by using LSTM.Secondly, the agent is constructed based on the SARSA learning method, and the hyperparameters of the LSTM model are adjusted step by step to automatically select the optimal combination of hyperparameters.Finally, we output the photovoltaic power prediction results of the LSTM model under the optimal hyperparameter combination.
To verify the effectiveness of the proposed method, experiments were conducted to compare the results of LSTM models by using grid search and Bayesian optimization hyperparameter optimization methods.It was demonstrated that SARSA-LSTM has better training efficiency and prediction performance, which can solve the problem of manual parameter tuning difficulty and has certain practicality.

Figure 3 .
Figure 3. MAPE change trend chart of LSTM model test set.

Figure 4 .
Figure 4. Prediction curve of photovoltaic power generation The unit structure of LSTM consists of a forgetting gate, an input gate, an output gate, and a cell state.The forgetting gate determines how ]

Table 2 .
Comparison of photovoltaic power prediction errors among different models.