Improving SOH estimation for lithium-ion batteries using TimeGAN

Recently, the xEV market has been expanding by strengthening regulations on fossil fuel vehicles. It is essential to ensure the safety and reliability of batteries, one of the core components of xEVs. Furthermore, estimating the battery’s state of health (SOH) is critical. There are model-based and data-based methods for SOH estimation. Model-based methods have limitations in linearly modeling the nonlinear internal state changes of batteries. In data-based methods, high-quality datasets containing large quantities of data are crucial. Since obtaining battery datasets through measurement is difficult, this paper supplements insufficient battery datasets using time-series generative adversarial network and compares the improvement rate in SOH estimation accuracy through long short-term memory and gated recurrent unit based on recurrent neural networks. According to the results, the average root mean square error of battery SOH estimation improved by approximately 25%, and the learning stability improved by approximately 40%.


Introduction
Recently, regulations on fossil fuel vehicles in the automotive industry have been strengthened owing to emerging environmental issues, and the xEV (electric vehicles, hydrogen fuel vehicles, hybrid vehicles, etc) market has been expanding.Batteries, one of the core components of xEVs, can affect the efficiency and safety of the system depending on their condition and management methods.As a result, the need to monitor the battery's condition through metrics such as the state of health (SOH) and the state of charge (SOC) to ensure its safety and reliability has arisen.
SOH is defined as the percentage of maximum capacity that a fully charged battery can discharge at a specific point in time compared to its capacity.As an important indicator representing the aging state of the battery, active research is being conducted on SOH estimation methods.The methods for SOH estimation are broadly divided into model-based estimation methods [1][2][3] and data-based estimation methods [4][5][6].
Model-based SOH estimation analyzes the physical/chemical principles of batteries and mathematically models them, applying estimation filters based on the model.The SOH of a battery has been estimated using a Kalman filter (KF) based on the Thevenin battery model [1], the unscented particle filter algorithm [2], and dual extended KF [3].However, model-based SOH estimation has limitations in linearizing and modeling complex nonlinear internal state changes.
Data-based SOH estimation studies aim to estimate SOH by reflecting the characteristics of the data without considering the complex state changes inside the battery [7][8][9].Cui and Joe used a dynamic spatiotemporal attention-based gated recurrent unit (GRU) model based on a GRU [4], Gu et al combined a convolutional neural network (CNN) and a transformer [5], and Li et al used active state tracking long short-term memory neural network (AST-LSTM NN) for estimating SOH and predicting remaining useful life [6].Shen et al used the extreme learning machine and voltage estimated through the whale optimization algorithm to assess SOH [8].Avkhimenia used reinforced training for battery action [10].Lee et al used five types of CNN models for SOH estimation [11], whereas Lin et al used semi-supervised learning for the same purpose [12].Various data-based SOH estimation algorithms are being researched to improve the accuracy of SOH estimation, and high-quality datasets containing large quantities of data are imperative to leverage the full potential of these methods.
Research is being conducted to improve the performance of estimation and classification algorithms by generating a large amount of high-quality data using TimeGAN [20].Increased prediction accuracy by generating hard-to-collect tire wear data, and [21] improved prediction accuracy by supplementing heating system data with different characteristics depending on the conditions.Additionally [22], resolved the imbalance of bearing failure data using data generation algorithms to improve failure classification accuracy.Therefore, research on improving estimation and classification accuracy by augmenting open-source datasets with limited quantities, such as batteries, is necessary.
This study aimed to improve battery SOH estimation accuracy by supplementing a dataset, a crucial factor affecting learning, using TimeGAN.Since it is difficult to collect battery data, we secure a high-quality dataset containing a large quantity of data through TimeGAN and compare the SOH estimation accuracy improvement rate through LSTM [23] and GRU [24] based on recurrent neural networks.Through this, we aimed to verify the usefulness of the SOH estimation accuracy improvement technique proposed in this paper by securing datasets.

Research outline
The research outline is presented in figure 1.First, synthetic battery data were generated using TimeGAN [19].Next, the input of the model was divided into two, and LSTM and GRU models were used to estimate the SOH.One of the inputs was the original dataset, and the other was the dataset with synthetic data added to the original dataset.The accuracy of SOH estimation was evaluated using the root mean square error (RMSE) metric for comparison.

Data preprocessing
The data used in this study were from the NASA Randomized Battery Dataset [25].We focused on the data of cells RW9, RW10, RW11, and RW12 belonging to Group 1. Since random charge-discharge cycles exhibit nonlinearity and transitional intervals, acquiring SOH data becomes challenging.Therefore, we utilized battery data from a reference charge-discharge cycle to address these limitations [26][27][28][29].
Three preprocessing steps were performed on the original data for training the time series data augmentation algorithm.
First, normalization was performed to achieve a fast optimization speed and improve learning accuracy.Normalization evenly reflects temperature, current, and voltage characteristics used as inputs for the applied algorithm without bias on the scale, and there is a limitation that the influence of large-scale parameters is substantial in cases without normalization.Therefore, this study used min-max normalization, as shown in equation ( 1).Here, x new is the normalized data, and x is the original data.x max and x min represent the maximum and minimum values of the original data, respectively, Second, through sampling, the data were converted into three dimensions to fit the input format of the TimeGAN model, as shown in figure 2. The 3D data reflect temporal and static features by including voltage, current, temperature, and SOH data within a single sequence.Therefore, this preprocessing method allows various characteristics of battery data to be reflected during TimeGAN learning.
Third, 60% of the battery data, along with the synthesized data generated by the TimeGAN, are used as the training dataset for the proposed SOH estimation algorithm.The remaining 40% of the data serves as the test dataset, enabling the validation of the algorithm's accuracy.

TimeGAN
GAN is a representative algorithm for data augmentation [13], based on which various algorithms have been derived in the image domain (DCGAN [14], WGAN [15], LSGAN [16], etc) and the time series domain   (C-RNN-GAN [17], RCGAN [18], TimeGAN [19], etc).The basic GAN model consists of a generator and a discriminator, as shown in figure 3. The generator uses random noise as input to create synthetic data, and the discriminator classifies original data and synthetic data.Equation ( 2) is based on the role of the generator and discriminator, and the GAN algorithm aims to generate synthetic data similar to the original data.Here, D and G represent the discriminator and generator, respectively, V represents the value function of the discriminator and generator, and E represents the expected value, When the discriminator D classifies the original data x as true and synthetic data G (z) as false, the value function V (D, G) has its maximum value.The GAN algorithm learns in the direction where the value function is maximized.Based on this, the TimeGAN model was proposed to generate time series data reflecting temporal characteristics.The TimeGAN structure, as shown in figure 4, adds an autoencoder architecture to the GAN algorithm, enabling the learning of temporal dynamics in smaller dimensions.In the autoencoder, data are reduced and restored through embedding and recovery functions, and in GAN, data are generated and distinguished in reduced dimensions to create data similar to the original data.

Embedding and recovery
The autoencoder in figure 4 maps the characteristics of the time series to a lower dimension through the embedding function and recovers the data generated in the lower dimension back to the original dimension through the recovery function.The embedding function in the autoencoder maps the temporal characteristics of the data to the latent space in the reduced dimension.In this case, s represents the static feature space of the data in the original dimension, and x represents the temporal feature space.The static and temporal feature spaces in the latent space are represented by h s and h s , respectively.The embedding function is shown in equations ( 3) and (4), The recovery function restores the reduced data to the original dimension for data learning.In detail, the static feature space (h s ) and the temporal feature space (h t ) expressed in the latent space are restored to the original dimension's static feature space (s) and the temporal feature space (x), respectively.The recovery function is shown in equations ( 5) and ( 6)

) 2.3.2. Generator and discriminator
As shown in figure 4, GAN generates and discriminates data in the reduced-dimension latent space through the generator and discriminator.The generator function takes random noise as input to create synthetic data.The static feature space in the latent space uses random noise as input.The temporal feature space learns the synthetic static feature space and the synthetic temporal feature space of the previous time point in the latent space, as well as the temporal feature space in the original dimension, as shown in equations ( 7) and (8), The discriminator function, which classifies synthetic and original data, is shown in equations ( 9) and (10),

SOH estimation
We generated a synthetic dataset using TimeGAN, used it to supplement the dataset, and then compared the improvement in SOH estimation accuracy.We utilized representative recurrent neural network models, LSTM and GRU, to estimate SOH.

LSTM
LSTM is an algorithm designed to solve the vanishing gradient problem of recurrent neural networks, with a structure shown in figure 5 [23].First, it uses the cell state to transmit information input before the current time point and assigns weights to the data through three gates (forget gate, input gate, and output gate).The forget gate determines the importance of past data and assigns weights.The input gate decides how much to reflect the current input and the previously hidden state data.The output gate determines the weight when transmitting to the hidden state.
As shown in equation (11), the forget gate assigns appropriate weights to the values of h t−1 and x t and delivers them to cell state C t−1 .Meaningful data receive weights close to 1, while unimportant data receive weights close to 0, In the input gate, the new cell state C t is created by updating the cell state through i t and Ct , as shown in equations ( 12)-( 14), In the output gate, the output of C t is determined, as shown in equations ( 15) and ( 16), The LSTM model configuration used for SOH estimation is shown in figure 6 and consists of an input layer, LSTM layer, dropout, and dense layer.

GRU
GRU is an algorithm proposed to simplify LSTM [24].It combines LSTM's cell state and hidden state and integrates the forget gate and input gate to create a simpler structure, as shown in figure 7. GRU consists of an update gate and a reset gate, resulting in a simpler structure.Its characteristic feature is the reduced computational load compared to LSTM.The reset gate determines the weight of the previous information, and the update gate decides how much information to reflect through the tanh function.
In the reset gate, the weights of the previous hidden state h t−1 and the current input x t are determined, as shown in equations ( 17) and ( 18), The update gate judges the reflection ratio of past and current information and assigns weights, as shown in equations ( 19) and (20), The GRU model configuration used for SOH estimation is shown in figure 8 and consists of an input layer, GRU layer, dropout, and dense layer.

Performance evaluation
The SOH estimation accuracy improvement method proposed in this paper was evaluated in two aspects.We assessed the similarity between the synthetic battery dataset generated through TimeGAN and the original battery dataset and compared the SOH estimation accuracy improvement rate using the synthetic battery dataset similar to the original one.
First, to quantitatively evaluate the similarity of the generated time series data features, the original and synthetic data were represented using t-distributed stochastic neighbor embedding (t-SNE).To compare the similarity of datasets in t-SNE, we applied the rate of change in the correlation coefficient of linear regression and the silhouette coefficient [30], which are quantitative indicators.
Next, to compare the SOH estimation accuracy improvement rate, we used RMSE.RMSE is a representative indicator for estimation accuracy evaluation and is shown in equation (21).Here, y i and ŷi represent the actual and predicted values, respectively, and n denotes the number of data points,

Results and analysis
We evaluated the proposed SOH estimation accuracy improvement method through TimeGAN in two aspects.First, we applied the rate of change in the correlation coefficient and the silhouette coefficient to evaluate the similarity between the synthetic dataset generated through TimeGAN and the original dataset.Then, we estimated SOH using LSTM and GRU algorithms with the synthetic battery dataset to evaluate the improvement rate of SOH estimation accuracy.

Analysis of synthetic data
Figure 9 compares the synthesized and original battery datasets.It can be observed that, when the SOH values are the same, the generated voltage, current, and temperature are also similar.
We assessed the similarity between the synthetic and original battery datasets using the rate of change in the correlation coefficient of linear regression and the silhouette coefficient.First, we reduced the dimensions using the t-SNE dimensionality reduction technique and applied quantitative evaluation indicators.
In t-SNE in figure 10, the black points represent the original battery dataset, while the red points represent the synthetic battery dataset.The solid black line is the linear regression equation of the original dataset, and the solid blue line is the linear regression equation of the dataset with synthetic data added to the original dataset.
First, we judged the similarity using the rate of change in the correlation coefficient of linear regression.Before using this indicator, we checked if the linear regression equation could sufficiently explain the data through the coefficient of determination.Since the battery dataset exhibits linearity in t-SNE, it is possible to evaluate the similarity of the data using the rate of change in the correlation coefficient of linear regression.Accordingly, we calculated the rate of change between the correlation coefficients of the original dataset and the dataset with added synthetic data.According to the results, the rate of change in the correlation coefficient was sufficiently low, confirming similarity with the original dataset.Next, we evaluated the similarity of the data using the silhouette coefficient.The closer the silhouette coefficient value is to 0, the greater the similarity to the quality of the original data, and the closer the value is to 1, the more distinct the characteristics between the data.The silhouette coefficient of the original battery dataset and synthetic battery dataset was close to 0, demonstrating that the latter reflects the characteristics of the original dataset.
We confirmed the similarity between the synthetic and original battery datasets through two evaluation indicators, as shown in table 1, and verified its applicability in battery SOH estimation algorithms.

Analysis of SOH estimation
We applied LSTM and GRU models to verify the impact of the synthetic battery dataset generated through TimeGAN on the improvement of SOH estimation accuracy.We conducted ten repeated tests for each cell to obtain reliable test results.Each plot represents the SOH estimation results for one cell, showing the SOH estimation results learned from the original dataset and the dataset with added synthetic data in terms of RMSE.We used boxplots to represent information about RMSE as mean, median, and inter quatile range (IQR).
In the boxplots, the green line represents the median, and the green triangle represents the mean.The blue box indicates the IQR of the data, representing the range between the 25% and 75% percentiles.The IQR has a similar meaning to variance but is characterized by being hardly influenced by outliers.
The results of battery SOH estimation accuracy and improvement rate using LSTM and GRU models are shown in figures 11, 12 and table 2.
First, the SOH estimation results using LSTM are shown in figure 11.The mean and median RMSE values decreased for all cells, indicating that the synthetic dataset generated through TimeGAN can improve SOH estimation accuracy.Based on the mean values, the highest accuracy improvement rate was 25.43% for RW9 data, while the lowest rate was 5.29% for RW11 data.Moreover, the improved IQR confirmed that the proposed method also affects the stability of SOH estimation results.
Next, the battery SOH estimation results using GRU are shown in figure 12.The mean and median RMSE values decreased for all cells, indicating that the synthetic dataset generated through TimeGAN can improve SOH estimation accuracy.In particular, based on the mean values, the highest accuracy improvement rate   was 18.11% for RW12 data, and the lowest rate was 5.22% for RW10 data.Additionally, the IQR either improved or remained similar, confirming its influence on the improvement of learning stability.
As shown in table 2, with the dataset secured through TimeGAN, the accuracy of learning improved overall for all cells and algorithms.The accuracy improvement rate varied with the characteristics of the original dataset, and since the accuracy can change with the estimation algorithm and learning environment, the cells with the highest or lowest accuracy varied depending on the estimation algorithm and learning environment; this confirms that while the most suitable estimation algorithm for learning may differ depending on the data characteristics, SOH estimation accuracy can be improved by securing datasets through data augmentation algorithms.In addition, the SOH estimation accuracy improved with a significant increase in synthetic data in all cells, and the stability of SOH estimation results was also affected.

Conclusion
This study investigated a method to improve SOH estimation accuracy by securing a high-quality dataset containing a large quantity of data through TimeGAN.To augment the battery dataset, we applied the TimeGAN model and then verified the similarity between the generated synthetic dataset and the original dataset to ensure high quality.When the confirmed high-quality synthetic battery dataset was added to the original dataset for SOH estimation, the battery SOH estimation accuracy improved for all cells.
(1) A synthetic battery dataset with similar characteristics to the original battery dataset can be generated using TimeGAN.(2) The addition of a synthetic battery dataset can improve SOH estimation accuracy.
It was confirmed that utilizing high-quality synthetic data in the SOH estimation algorithm learning improved the SOH estimation accuracy.(3) The dataset expanded through synthetic data can improve the stability of SOH estimation algorithm learning.
It was confirmed that utilizing a large amount of synthetic data for SOH estimation algorithm learning resolved the problems of overfitting or underfitting, resulting in stable learning outcomes.

Figure 9 .
Figure 9. Similarity between original and synthetic data.

Figure 10 .
Figure 10.Linear regression of four battery data in t-SNE.

Table 1 .
Evaluation of synthetic data.

Table 2 .
Evaluation of SOH estimation (Bold indicates best performance).