Energy System Optimization Research Based on PEC-DDPG Algorithm

Factors such as the stochastic nature of loads in energy systems make it difficult to optimize the operation of integrated energy systems. To address these problems, an energy system economy optimization scheme based on the PEC-DDPG is proposed. Firstly, exponential moving average (EMA) is introduced into deep deterministic policy gradient (DDPG) algorithm, and prioritized experience replay (PER) is added into the experience pool to prioritize the experience to improve the learning efficiency of algorithm, and the overestimation existing in a single Critic network is solved by using multi-Critic structure. Next, the energy system optimization model is constructed, and the appropriate observation states, decision actions and reward functions are selected. Finally, simulations using energy system data of a region show that the optimization of PEC-DDPG is better than the operational optimization of DDPG algorithm.


Introduction
Towards a two-carbon goal, it is the future development direction to build a new energy-based power systems [1].The uncertainty of energy optimization process has a great impact on the optimization of new power system operation.In the current context, the problems of energy system operation optimization research are becoming more and more prominent.
Model optimization and operation strategies are currently the main directions of research addressing the operation optimization of integrated energy systems [2].Regarding the model optimization research, Ref. [3] built the model considering compressors to adjust the pressure of natural gas pipelines, which can increased system flexibility.Ref. [4] considered the dynamic characteristics of pipeline network, while adding an integrated demand response to the day-ahead scheduling model to improve flexibility and economy of system operation.This type of method mainly models the energy system accurately, and approximates the nonlinear and nonconvex problems linearly with low accuracy.
Regarding the operation strategy study, Ref. [5] applied the binary particle swarm algorithm and particle swarm algorithm to the MO-PBCUCP problem with mixed integers.Ref. [6] used improved second-order oscillating particle swarm algorithm and adaptive ε-dominated multi-objective particle swarm algorithm to solve active distribution network two-layer optimization model, respectively, to reduce the voltage offset and operation cost.The above studies mainly use traditional algorithms to achieve operation optimization, which has the limitations of scheduling plan and the drawbacks of high accuracy requirement of new energy and other data prediction.
2 With the continuous advancement of artificial intelligence, reinforcement learning has been applied to the problem of optimizing the operation of integrated energy systems [7].Ref. [8] used Q-learning methods to achieve real-time optimal dispatching of distribution systems considering wind speed and other factors perturbations.Ref. [9] used reinforcement learning differential evolution method to optimize energy management for multi-energy system scheduling.The above studies mainly use reinforcement learning to deal with optimal energy scheduling, but when faced with a continuous task, the number of variable dimensions will grow exponentially, causing dimensional disasters [10].
To solve the dimensional catastrophe problem, deep reinforcement learning algorithms capable of handling continuous tasks were applied to the study of integrated energy system optimization [11].Ref. [12] proposed A micro-energy management model, and a dual deep expectation Q-network algorithm combining Bayesian neural network and deep reinforcement learning is used to find optimal policy, significantly improved system economy and algorithm convergence speed.Ref. [13] proposed the use of the approach based on a flexible actor-judge framework applied to an integrated electricity-gas energy system optimization problem.Ref. [14] proposed dynamic energy conversion and management strategies to reconcile economic costs objective and load shifting objective, used DDPG algorithm to obtain the running strategy, which improved system profit and wind power utilization.Ref. [15] used an improved algorithm and the operational optimization problem is compared in terms of two features, weekdays and bi-weekly, to improve the economics of optimal system scheduling.Nevertheless, the complexity of integrated energy system causes difficulties in the study of optimal economic operation strategies of energy systems, which is an important issue at present.
In the above context, an energy economy optimization scheme based on the PEC-DDPG algorithm is proposed.Firstly, an energy system optimization model is constructed and suitable observation states, decision actions and reward functions are selected; secondly, improvement of the solution process of the DDPG algorithm in order to enhance the effectiveness of energy system optimization.Finally, validity and economy of scheme is proven through using the simulation of energy system data in a region.

Fundamentals
The DDPG algorithm uses an intelligent body with the goal of maximizing long-term payoffs, interacts with the environmental space according to Markov decision principles, and immediate rewards are used as feedback to train the acquisition of optimal strategies.The Critic network function is the expected payoff after the decision action at under the observed state st, and is used to judge the merit of at under st in order to find the optimal strategy, and the expression is Where, γ is discount factor, rt is value of reward function at time t, E( ) is expectation.Relying on the judging role of the Critic network function, the optimal strategy π* is found which is the strategy when maximizing the Critic network function, and the expression is Actor network is a mapping of observed states to decision actions, with the expression is Where, θ π is Actor network parameter.About Critic network parameters, it is optimized by minimizing the loss function Where, θ Q' is target Critic network parameter, θ π' is target Actor network parameter.
is the gradient about the parameters, and the expression is The Critic network parameters are updated according to gradient rule with the expression is Where, σ Q is Critic network learning rate.About Actor network parameters, ▽aQ(si, ai | θ Q ) determines the optimization direction, ▽θ π J(θ π ) determines the parameter optimization value.the expression is Actor network updates the parameters according to the gradient rule with the expression is Where, σ π is Actor network learning rate.
Actor network and Critic network implement the update of target network parameters through a soft update technique with the expression is Where, τ is soft update coefficient, 0＜τ<<1.

Priority experience replay
Prioritized experience replay is to first prioritize the samples to be drawn and then draw higher priority samples more frequently.In the use of algorithms, TD error is usually used as an important indicator of sample prioritization, and the expression is The sample priority probability is set mainly based on the corresponding absolute TD error.To ensure that the sampling probability is still available when the sample TD error is low, the constant is added to ensure sampling diversity, and the expression is Where, pi is the probability of the i data set being sampled,  is constant, D is the tuning parameter.
According to the sample priority ranking, the frequent sampling of high priority samples during the training of the algorithm will cause the convergence uncertainty problem of the training process.Therefore, importance sampling weights are introduced, and the expression is Where, O is number of sample in experience pool, χ is correction parameters.Due to the uncertainty of the state and the volatility of the state quantity in the energy system, the deep deterministic policy gradient algorithm uses a uniform sampling strategy with sampling bias, which makes it difficult for the Actor network to learn the optimal policy during the training process.Therefore, the uniform sampling strategy is changed to preferential empirical playback in the sampling process of algorithm to upgrade the quality of empirical sampling, enhance accuracy of training parameters, and find the optimal scheduling strategy.

Exponential moving average
The exponential moving average is used in areas such as deep learning, finance and optimization algorithms, where it is able to react faster to changes in parameters.The expressions is Where,  is the smoothing exponent, Vt-1 is updated value for previous period, Wt is current actual value, Vt is the current updated value.
In the application of energy system optimization, since the calculation type of Vt is non-integer, there is a problem that it cannot be used directly in the updating process, so maps the exponential function with floor function and to an integer type in the range of 1 to 3, and the expression is

Multi-Critic structure
In the solution process of energy system operation optimization, the optimal policy learning of Actor network in the algorithm mainly relies on the guidance of the Critic network.The complexity of the energy system will affect the training accuracy of the Critic network and the problem of overestimation of a single Critic.Since the network update process is very sensitive to parameter changes, it affects Actor network parameters accuracy.Therefore, single Critic network is replaced with a multi-Critic structure, and the minimum value of the multi-Critic is subsequently used to guide Actor network learning and enhance network parameters accuracy.

Energy system optimization model
Under the premise of stable operation of energy system, an energy system optimization model with electricity-gas-heat is constructed.As shown in Figure 1, the energy purchase side of the model includes electricity grid (Gird) and natural gas (Gas), energy storage device includes electric storage (BES) and heat storage (HS), conversion device includes Power-to-Gas (P2G), electric boiler (EB), gas boiler (GB), and combined heat and power (CHP), the power generation device includes wind turbine (Wind) and photovoltaic (PV).

Objective function
Under the condition of ensuring the safety and reliability of integrated energy system, to achieve best economic optimization of system, the objective function F is expressed as Where, t is current scheduling period, T is total scheduling period, PGrid,t is Gird's interactive power, PCHP,t and PP2G,t are electrical power of CHP and P2G, PBES,t and PEB,t are charging (discharging) power of BES and electrical power of EB, HGB,t and HHS,t are thermal power of GB and charging (discharging) heat power of HS, cGrid and cGas are purchased (sold) electricity price and gas purchase price, cBES, cHS, cP2G, cEB, cCHP, cGB are respectively are the operating cost of BES, HS, P2G, EB, CHP, GB, ηCHP is CHP conversion efficiency, ηGB and ηP2G are respectively conversion efficiency of GB and P2G.

Binding conditions
Depending on energy system configuration, the capacity range of the energy device needs to be determined, which are BES, CHP, and so on.The expressions is in ax , Where, S is the type of energy device, B is the energy device power type, B Min S and B Max S are lower and upper limits of energy device power.
According to the principle of energy conservation, the electric and thermal power balance constraint expressions are Where, PLoad,t is load electric power, PPV,t and PWind,t are respectively PV output power and Wind power output power, HLoad,t and HEB,t are load thermal power and thermal power of EB.

Experiments and results
In order to verify the applicability of PEC-DDPG applied to energy system optimization, a regional energy system data arithmetic simulation is used.The experimental data set is 9 days of data from December 22, and divide the experimental data into training and testing sets to facilitate subsequent research, as shown in Figure 2.

Analysis of results
To verify the generalizability, the saved parameters are loaded and applied to the energy optimization.The electricity dispatch and dispatch are obtained, as shown in Figure 3 and Figure 4.As shown in Figure 3 and Figure 4, at low tariffs, electric load demand is mainly based on purchaser side, supplemented by wind power generation in a few hours, heat load demand is mainly satisfied by gas boilers and electric boilers, due to low electricity price, the electric storage is mainly in the charging state, and heat storage is mainly in the charging state.At peak tariff, electric load demand is mainly satisfied by photovoltaic and wind power generation, the electric storage is in a discharged state, purchaser side to maintain the load balance, and heat load demand is satisfied by the electric boiler in cooperation with the heat storage.At normal tariff, heat load demand is satisfied by the combined heat and power supply device, gas boilers and heat storage in cooperation, and electric load demand is mainly satisfied by purchaser side, and when photovoltaic power is generated, purchaser side is reduced.Thus, the PEC-DDPG algorithm can enable energy system uptime optimization under the condition of satisfying the power balance.

Comparison of results
To confirm the validity and economy of the algorithm.PEC-DDPG, CPLEX, DDPG, PEC-DDPG without PER, PEC-DDPG without EMA and PEC-DDPG without multi-Critic were respectively applied to the energy optimization model to compare and analyze the optimization results, as shown in Table 1.  1, the average dispatch cost of CPLEX solver is higher than cost of PEC-DDPG because CPLEX is prone to fall into local optima when handling complex tasks.The average dispatch cost of DDPG is 61.0 % higher than the cost of the proposed algorithm due to the presence of exploration bias, which affects operational optimization.The cost of the algorithm without multi-Critic is 15.6 % higher than that of PEC-DDPG due to the Critic network is not easy to learn the environment space adequately, which causes excessive error affecting the parameter accuracy.The cost of the algorithm without PER is 10.8 % higher than that of PEC-DDPG, because the uniform sampling strategy cannot guarantee the sampling quality and the high strategy learning efficiency.The algorithm cost without EMA is 6.8 % higher than that of PEC-DDPG, and the learning of Critic network is biased due to complexity of energy systems, and the fixed network update structure interferes with the Actor network learning.From the comparison, it can be concluded that the PEC-DDPG algorithm can further improve economy of energy optimization.

Conclusion
In this paper, an energy system economy optimization scheme based on PEC-DDPG algorithm is proposed.Firstly, the problem of single network update structure in the algorithm is transformed into dynamic adjustment by adding exponential moving average, the updating speed of the critical network is faster than that of the operator network.Secondly, in the complex energy system, PER is introduced to prioritize the algorithm experience pool to improve the sampling quality and facilitate optimization.Third, set up multi-Critic structure, reduce the overestimation error in the algorithm, better learning energy system environment space.Finally, the energy optimization model is constructed, and the simulation results are compared and analyzed.In conclusion, PEC-DDPG algorithm can effectively enhance optimization of system operation.

Figure 4 .
Figure 4. thermal dispatch.As shown in Figure3and Figure4, at low tariffs, electric load demand is mainly based on purchaser side, supplemented by wind power generation in a few hours, heat load demand is mainly satisfied by gas boilers and electric boilers, due to low electricity price, the electric storage is mainly in the charging state, and heat storage is mainly in the charging state.At peak tariff, electric load demand is mainly satisfied by photovoltaic and wind power generation, the electric storage is in a

Table 1 .
Comparison of the average dispatch costs of different algorithms.