Wind power anomaly data detection based on unsupervised methods

During the actual operation of the wind turbine, a large number of abnormal data will be generated due to environmental or human factors, which will have a great impact on the condition assessment and output prediction. In order to make wind energy a reliable source of energy, it is very important to establish an efficient and accurate wind power detection model. Therefore, it becomes essential to identify abnormal data for a more precise evaluation of wind turbine performance. Based on the data mining clustering technology K-means algorithm, this paper introduces an unsupervised abnormal wind power detection algorithm combining the variational autoencoders (VAE) model. The focus of this abnormal wind power detection method is primarily on assessing the reconstruction error, which, in turn, generates an abnormality score for wind power data. This score is then used to determine the presence of abnormal wind power data. Finally, the wind power data in 2021 is tested.


Introduction
Given the rapid progress in wind power technology, in order to ensure the normal operation of wind turbines, the daily state detection of wind turbines has become an important research direction.However, due to the abnormal anemometer, power generation equipment failure, wind power curtailment, human factors, and other reasons, during their practical operation, wind turbines frequently produce a substantial volume of abnormal wind power data.These abnormal conditions are often difficult to identify, which will not only interfere with the follow-up data mining work but also are not conducive to the development of intelligent operation, maintenance, and management.Therefore, it is necessary to propose a method to study the anomaly detection of wind power data.
Researchers at home and abroad have carried out extensive research on anomaly detection of output data of wind farms and wind turbines.Over the past few years, as data mining technology has experienced rapid advancements, anomaly detection algorithms have seen rapid progress and extensive adoption.Yan et al. [1] used the DBSCAN clustering algorithm to identify abnormal data.The research shows that the DBSCAN algorithm performs relatively poorly in identifying densely clustered abnormal data.Matsui et al.
[2] introduced a lightning-induced damage detection approach for wind turbine blades.This method relies on SCADA data and employs a Gaussian mixture model (GMM) anomaly detection model.For real-time abnormal data detection in the context of wind power data, Dong et al. [3] introduced a detection framework specifically tailored for this purpose.This framework integrates the semi-supervised learning mechanism into the robust random forest algorithm.Dao [4] used nonparametric statistical test methods to automatically detect the operating status and faults of wind turbines.The test results relied on statistical hypothesis tests and were verified on the wind turbine fault data set provided by SCADA.Wang et al. [5] developed an anomaly removal technique based on the distribution pattern of abnormal data.This method leverages the Bayesian change point-quartile combination algorithm to identify and handle anomalies.Zhou et al. [6] introduced an algorithm that addresses abnormal data in low wind speed and high wind speed categories by combining quartile analysis and clustering techniques for cleanup.HyunYong Lee et al. [7] used the correlation between sensors to construct an unsupervised anomaly detection model for wind turbine systems.
Combined with the research of Li et al. [8], in this paper, the clustering technique and variational autoencoder are combined to construct a wind power anomaly detection model.The model is used to detect abnormal wind power data in the test set.

Algorithmic structure
The structure of the algorithm is shown in Figure 1, and the algorithm consists of three parts.First, the unlabeled input data are clustered using the K-means method to form an initial training set.Second, feature extraction is performed on the initial training set using the variational autoencoders (VAE) to compute the reconstruction error.The bottleneck features are connected to the reconstruction error to generate new features.Third, the new features are re-clustered every other cycle to form the training set again until the two consecutive clustering results are consistent.

Data processing
In order to understand the energy distribution of wind power data in different frequency ranges, combined with the research of Zhan et al. [9], this paper uses the Short Time Fourier Transform (STFT) to preprocess the data and convert wind power data from time domain to frequency domain.For the STFT structure of each time window, the coefficients of the first 20 frequency components are extracted as features.Anomaly detection is performed on the features to identify potential abnormal frequency components.This is shown in Equations (1) and (2).

STFT (t, f) = ʃ [x(τ)w (t -τ)] e -j2πfτ dτ
(1) where STFT (t, f) represents the STFT value at time t and frequency f. x(τ) indicates the wind power signal.w (t -τ) stands for window function.e -j2πfτ is a complex exponential factor used to perform the transformation of a signal from the time domain to the frequency domain.
xk=|STFT ( (3) where C represents the selection process that relies on the clustering output, and the threshold used in this process is set to the (100-p) th percentile of the cluster variance.The cluster whose variance is less than this threshold is selected.p0 represents the percentage of anomalies, which controls which clusters are accepted into the "training set".

Variational autoencoders.
Variational autoencoder (VAE) is a kind of unsupervised generation model that can realize the approximate reconstruction of the encoded data.Variational autoencoder has two structures: encoding and decoding.The hidden variable generated after encoding is a normal distribution.VAE learns a function through an encoder that generates a distribution parameter that replicates the underlying vector through a decoder.VAE learns encoding function Fen through training Strain= {S1, S2,…,SM}, generating potential vector feature zc.This is shown in Equation ( 4).
zc=Fen (s; θen), s∈Strain (4) where θen represents the learning parameter of the encoder and zc represents the potential vector feature of dimension kbn.
For the decoding part, the reconstructed features x' are obtained by the decoder of VAE.This is shown in Equation ( 5).
x'=Fen (zc; θde) (5) where θde represents the learning parameter of the decoder and x' represents the reconstructed features.
The true distribution is specified as pθ(x|z) and the estimated network distribution as qφ(z|x).Therefore, the VAE model needs to narrow the gap between them.The loss function of the Variational autoencoder (VAE) comprises two components: the reconstruction loss and the Kullback-Leibler (KL) divergence.The reconstruction loss, denoted as Eq[logpθ(x|z)], measures the discrepancy between the generated data and the input data, quantifying how well the VAE reconstructs the original data.The Kullback-Leibler (KL) divergence is employed to quantify the dissimilarity between the model's distribution generated during training and the Gaussian distribution p(z), which is typically assumed for the latent variables in the data.It assesses how much the learned latent variable distribution diverges from the assumed Gaussian prior distribution.This is shown in Equation (6).
) 1/2 (7) The potential vector features Z= {z1, z2,…,zN} are linked with the reconstruction errors to generate a new set of features.This is shown in Equation (8).

Re-select normal candidate.
The training set X are re-labeled using Z as a proxy, and the members in Z that belong to the low-variance cluster are selected.This is shown in Equations ( 9) and (10).  +1 = (  , ) +1 = {  : ∀  ∈   +1 } (10) The training process concludes when the chosen set of normal samples remains consistent between two consecutive iterations.Following the final iteration, the VAE parameter {θen, θde} is obtained, and the scoring function is constructed with it.This is shown in Equation (11).h(x)=d (x, x') =d (x, Fde (Fen (x, θen); θde)) (11) where x' represents the outcome of the encoding and decoding process carried out by the trained Variational Autoencoder (VAE).

Dataset
The data employed in this study consists of wind turbine operational data collected from the Data Acquisition and Monitoring Control System (SCADA) installed within the wind farm.For the purpose of training the abnormal detection model, wind power data spanning from January 2019 to December 2020 is utilized as the training dataset.Subsequently, the model is tested on wind power data recorded from January to August 2021 to identify dates characterized by abnormal data patterns.

Algorithm parameters setting
The parameter settings of the algorithm are shown in Table 1.

Training result
In this paper, the training methodology involves employing loop nesting.It generates supervised signals by incorporating clustering techniques and follows an iterative process.This process alternates between assuming candidate subsets of normal data and conducting representation learning.After each cluster, the VAE is iteratively trained, and the membership of the input set is evaluated every r time until the results of two consecutive clusters are consistent, and the cycle is ended.
Figure 2 shows the change in the loss function during model training.As shown in Figure 2, after 600 iterations, the loss value dropped to the lowest level and gradually stabilized with the increase of iterations.After 800 iterations, the loss value stabilized within a certain range and always oscillated around the "five points".After 1050 iterations, two consecutive clustering results were consistent, and the training ended.

Testing result
The wind data from wind power measurements is fed into the trained model.The anomaly score is determined by assessing the disparity between the data generated by the model and the actual input wind power data.When the anomaly score exceeds the threshold, it can be determined that there is an anomaly in the input wind power data.This paper tests wind power data recorded from January to August 2023.It identifies dates containing abnormal wind power data by calculating an anomaly score for each day in this period.The test results are shown in Figure 3.As shown in Figure 3, in a total of 226 wind power data from January 1 to August 14, 2021, a of 12 abnormal scores exceeding the threshold were detected.By analyzing the index or scoring mechanism, it becomes evident that the dates with abnormal wind power data are as follows: January 26, February 8, February 9, February 14, February 15, February 16, March 31, May 9, July 10, July 23, July 24, and August 14.

Conclusions
The wind farm power data includes a significant volume of anomalous data points, making it challenging to accurately represent the actual wind energy conditions of the wind farm.It affects the normal development of wind farm condition monitoring and power prediction.For this problem, based on the STFT, VAE, and clustering methods, this paper constructs the wind power detection model.Firstly, STFT is carried out to transform the wind power data from the time domain to the frequency domain.Then, VAE is used as a generator, and K-means clustering is used to guide VAE to adjust training parameters.By conducting an analysis of the wind power data for the year 2021, it was determined that there was a total of 12 days with abnormal wind power data up to August 14, 2021.

Figure 2 .
Figure 2. Changes in loss value during training.
Ck represents the amplitude of the kth frequency component, and STFT (t, fk) represents the STFT value at the frequency fk at time t.X={xi}, i=1, 2,… N, x∈ℝ k represents a collection of input data points that contain a certain proportion of abnormal values.