Statistical learning for accurate and interpretable battery lifetime prediction

Data-driven methods for battery lifetime prediction are attracting increasing attention for applications in which the degradation mechanisms are poorly understood and suitable training sets are available. However, while advanced machine learning and deep learning methods promise high performance with minimal data preprocessing, simpler linear models with engineered features often achieve comparable performance, especially for small training sets, while also providing physical and statistical interpretability. In this work, we use a previously published dataset to develop simple, accurate, and interpretable data-driven models for battery lifetime prediction. We first present the"capacity matrix"concept as a compact representation of battery electrochemical cycling data, along with a series of feature representations. We then create a number of univariate and multivariate models, many of which achieve comparable performance to the highest-performing models previously published for this dataset. These models also provide insights into the degradation of these cells. Our approaches can be used both to quickly train models for a new dataset and to benchmark the performance of more advanced machine learning methods.

models also provide insights into the degradation of these cells. Our approaches can be used both to quickly train models for a new dataset and to benchmark the performance of more advanced machine learning methods.

Graphical abstract
Battery lifetime prediction has many applications throughout the battery product cycle.
Some examples include screening of new electrodes, electrolytes, and cell designs during research and development; optimizing cell designs, cycling protocols for formation (the final step in cell manufacturing), and fast charging protocols; and predicting cycle life, remaining capacity, and the likelihood of a safety-threatening event in the context of use and reuse in the field. Critically, accurate battery lifetime estimation is needed to set warranty cost estimates for electric vehicle and grid storage applications, as reducing the uncertainty in the warranty cost will reduce the cost of battery deployments. Given the many applications of lifetime prediction within the battery product cycle, this research direction is of crucial importance to the battery community. While improving our first-principles understanding and modeling of battery degradation is an imperative research direction 1,2 , data-driven approaches to battery lifetime prediction 3-10 are increasingly exciting for applications in which the degradation mechanisms are challenging to model and a suitable training dataset is available.
Desirable attributes of data-driven models for battery lifetime prediction include high accuracy, low number of cycles required for prediction, small training set sizes, and high interpretability. The first three of these criteria are attributes related to model performance, and the final criterion engenders trust in the model. The use of sophisticated machine learning methods is undoubtedly an exciting research direction, as these methods can produce high performing models with minimal domain expertise (which is required to design predictive features). In particular, advanced machine learning and deep learning methodologies often excel at capturing information from high-dimensional datasets 11 , which in many ways makes them an ideal fit for high-dimensional battery cycling datasets, i.e., voltage vs. capacity as a function of cycle number. However, deep learning models are often brittle, i.e., small changes in input data can lead to dramatic differences in their predictions 12,13 , and generally require large training set sizes. 11 Furthermore, these methods often produce models that are not easily interpretable, as the relationship between the input and output data is convoluted. Interpretability helps confirm the model is behaving reasonably and, in some situations, can even elucidate the underlying physics of the task at hand. Thus, interpretability is a useful property for machine learning models applied to scientific domains, particularly for experimentally generated datasets (which unavoidably have "real-world" issues). While developing frameworks to explain advanced machine learning and deep learning methods is an active area of research 14,15 , explaining black box models has inherent limitations compared to using intrinsically interpretable approaches. 16 For instance, linear models using "engineered" (i.e., curated) input features that are downselected via regularization 17 are often interpretable while still achieving high predictive performance, especially with small-or medium-sized datasets. Throughout this work, we refer to classical methods that produce these types of models as "statistical learning" methods. 17 Models generated via statistical learning methods can be trained quickly and can be used to benchmark the performance of models generated via more complex methods. Additionally, the interpretability of these models can aid in generalizing across training sets and in developing physics-informed data-driven models 18,19 . While different data-driven approaches may be useful in different contexts, statistical learning for battery lifetime prediction remains underexplored in the literature.
Severson et al. 3 developed statistical learning models to predict the lifetime of lithium iron phosphate (LFP)/graphite cylindrical cells undergoing ~10-minute fast charging using a training dataset of 41 cells cycled to failure. Degradation during fast charging in commerciallyrelevant form factors is poorly understood 20,21 and thus challenging to model using first principles, making data-driven approaches a suitable alternative given an available training set.
These machine learning models, which were subsequently used for rapid evaluation of fastcharging protocols 22 , achieved ~9% test error using only the first 100 cycles for prediction (12% of the average cycle life). The features, i.e., transformations of the raw data, were sourced from measurements of voltage vs. capacity, capacity vs. cycle number, internal resistance, and can temperature. The most predictive features were derived from the voltage vs. discharge capacity curves, which is among the richest yet most complex data sources; specifically, these features summarized the difference between the voltage vs. discharge capacity curves of the 100 th and 10 th cycles, denoted ∆Q100−10(V). In fact, a simple linear model with a single feature derived from ∆Q100−10(V) performed nearly as well as the most complex model containing information from all data sources. This result highlighted the value of using features from the voltage vs. capacity data, as opposed to just the capacity vs. cycle number data. More importantly, the high performance of these simple models demonstrated the power of domain-inspired feature engineering coupled with statistical learning methods.
Since publication of the Severson et al. 3 dataset, others have applied advanced machine learning and deep learning methods, including relevance vector machines 5,8 , gradient boosted regression trees 23 , Gaussian process regression 8 , recurrent neural networks (including long shortterm memory networks) 8,24 , and convolutional neural networks 8,[25][26][27][28] . Many of these works have explored creative approaches, including data augmentation 25 and the use of differential capacity analysis 28 . However, few of these approaches emphasize interpretability. Our belief is that the development of interpretable statistical learning approaches for battery lifetime prediction should be pursued in tandem with state-of-the-art machine learning and deep learning methods that maximize performance.
In this work, we explore statistical learning approaches for developing accurate and interpretable battery lifetime prediction models. We use the same datasets and objectives in Severson et al. 3 for consistency and for comparison. We first present the "capacity matrix" concept for compactly representing the changes in cell capacity with respect to voltage and cycle number. We then present a number of dimensionality reduction and model building approaches and apply them to this dataset. While models as simple as a univariate linear model using a single element of ∆Q100−10(V) perform similarly to the best models from Severson et al. 3 , new approaches using the elements of ∆Q100−10(V) further reduce error and provide interesting insights into the relationship between elements in ∆Q100−10(V) and cycle life. Overall, the high accuracy of these simple, interpretable models highlights the effectiveness of statistical learning.
Additionally, features sourced from other transformations of the capacity matrix are generally outperformed by features from ∆Q100−10(V); throughout this work, we discuss both successful and unsuccessful approaches to reduce prediction error. The methods in this work can apply broadly to other battery datasets and can serve as a benchmarking suite for new datasets.

Summary of dataset and previous work
Here, we summarize the dataset and statistical learning approach in Severson et al. 3 , but we refer the reader to the original publication for more information. Table I also summarizes essential details of this study. The dataset consists of 124 LFP/graphite cells cycled in three "batches", i.e., a group of cells cycled simultaneously in the convection oven. These cells were split into a training set (41 cells), a primary test set (43 cells), and a secondary test set (40 cells; generated after model development). Cells were cycled with one of 72 different fast charging protocols (~9 -13 minutes from 0-80% state-of-charge), but all cells were identically discharged at 4C (here, C rate refers to the rate required to (dis)charge a cell in 1 hour). In addition to standard electrochemical data like voltage vs. capacity and capacity vs. cycle number, internal resistance was recorded every cycle number, and a thermocouple mounted to the cell can continuously recorded the can surface temperature of each cell.  Figure 1c; attributed to the diffusion of lithium from the edges to the center of the graphite electrode once cycling begins 29,30 5 . Note that linear interpolation is an alternative approach that also avoids the small additional error from the spline fit, but here we use the spline fits for consistency with Severson et al. 3 Lin et al. 35 suggest using a smoothing spline with a cross-validated smoothing parameter for sampling voltage-capacity curves, which is advantageous over manual tuning of filtering parameters. Alternatively, the data acquisition rate could be specified to match the desired basis vector.
Using this data processing approach, each discharge curve has 1000 capacity points at 1000 pre-defined voltage points. Thus, if we restrict the available training data to include only the first 100 cycles, each cell has 100,000 voltage-capacity data points. Of course, many of these data are highly correlated, both within a cycle and at the same voltage position across multiple cycles. As such, a primary goal of this work is to explore feature representations that reduce the dimensionality of these data without losing any information content.
We start by introducing a visualization of these 100,000 features, subsequently termed a "capacity matrix", that we found useful for model development. A similar approach has been developed by at least two previous works 36, 37 . Figure 1 presents graphical representations of a capacity matrix. Figure 1a presents voltage vs. discharge capacity for the first 100 cycles for an example cell in the Severson et al. 3 dataset. Given the high discharge rates (4C), the voltage response is smoothed due to heterogeneity 38 , and diagnostic differential capacity and/or differential voltage analysis would not reveal individual peaks that could be assigned to specific electrodes and failure modes. Overall, the curves shift only subtly as a function of cycle number.
In Figure 1b, we present a capacity matrix representation of the data presented in Figure 1a, i.e., discharge capacity at 4C as a function of voltage and cycle number. A key advantage of this matrix representation is that this high-dimensional feature set is stored in a compact and machine-learning-ready format. However, the trends across cycle number are still challenging to perceive. Figure 1c presents the "baseline-subtracted capacity matrix", denoted Qn − Q2, which is simply the matrix presented in Figure 1b subtracted by the second column, i.e., the cycle 2 discharge capacity. We use cycle 2 here as the data from cycle 1 was unavailable due to a data acquisition error. With this representation, the subtle differences between cycles are much clearer. In Figure 1d, we present Qn − Q2 at two selected voltages, i.e., two rows of the baselinesubtracted capacity matrix. Curiously, the trends in these curves are quite linear, with the exception of the first 10 cycles (which we attribute to the diffusion of lithium from the edges to the center of the graphite electrode, commonly observed after an extended rest 29,30 ) and cycles 50-60 (which we attribute to a temperature excursion in the environmental chamber during cycling of this batch). Other cycle numbers could also be selected for the baseline cycle; for instance, we use cycle 10 as a baseline cycle throughout this work as this cycle number avoids the initial rise in capacity seen in Figure 1d. Given the high discharge rates (4C), the voltage response is smoothed due to heterogeneity 38 , and diagnostic differential capacity and/or differential voltage analysis cannot be applied. We also explore the "baseline-divided capacity matrix", Qn/Q2, which is the matrix presented in Figure 1b divided by the discharge capacity of cycle 2. Again, the small differences between cycles is now clearer. The region of maximum contrast occurs at a voltage ~300 mV higher for Qn/Q2 than Qn − Q2. Figure 1f presents Qn/Q2 at two selected voltages as a function of cycle number. The trends are linear here as well, as they were in Figure 1d. However, the trends are generally noisier and lower contrast in Qn/Q2 than in Qn − Q2. Additionally, Qn − Q2 is more physically meaningful than Qn/Q2, as discussed in the following paragraph. For these reasons, we focus on feature extraction from the baseline-subtracted capacity matrix, instead of the baselinedivided capacity matrix, in the remainder of this work.
Capacity matrices are clean, compact, machine-learning-ready representations of capacity vs. voltage and cycle number. Furthermore, baseline subtraction or division helps magnify the changes in the electrochemical response with cycle number due to degradation. Note that the sum of all elements in a column of the baseline-subtracted capacity matrix is proportional to the change in discharge energy between the constant-current portions of the cycle of interest and the baseline cycle (i.e., the integral of capacity over voltage); in fact, a single element is proportional to the change in discharge energy between the cycle of interest and the baseline cycle at a given voltage. While not strictly a requirement, the capacity matrix concept applies most naturally if some part of the cycling protocol is consistent with cycle number; capacity matrices could be developed with the charge data in this work as well, but in this case the electrochemical response would be convoluted by the interaction between the intrinsic degradation and the charging protocol. Furthermore, while we focus on the constant-current portion of the cycling protocol in this work, a similar concept could be applied to voltage vs. time during a constant-voltage hold, an open-circuit rest, or any step that is repeated during cycling. This approach may be especially interesting for lithium plating detection during voltage relaxation. [39][40][41][42]

Feature generation and statistical learning approach
We now can represent all electrochemical discharge data for the first 100 cycles of each cell via a single capacity matrix with high dimensionality (1000 voltages × 100 cycles). In a sense, these matrices resemble single-channel (i.e., grayscale) images. One seemingly natural method to apply to this data is neural networks, which perform automatic feature learning and excel at high-dimensional data like images. 11 Indeed, these methods have been previously applied to this dataset. Overall, we maintain consistent modeling objectives as Severson et al. 3 Identically to Severson et al. 3 , our objective is to predict the log10-transformed cycle life. We evaluate model performance with root-mean-squared error (RMSE), one of two performance metrics used by Severson et al. 3 While reducing the number of cycles is of interest, we also maintain the 100cycle limit here to compare with previous results; Figure 5 of Severson et al. 3 demonstrates that similar predictive performance can be achieved with a similar modeling approach that uses only the first 60 cycles for prediction. Finally, we maintain the same training set and primary/secondary test sets as Severson et al. 3 ; note that we exclude the outlier battery from the primary test set throughout this work.
A key step in statistical learning is feature generation-that is, proposing meaningful representations of the input data that predict the objective. In this case, where the number of features (100,000) greatly exceeds the number of observations (41), an associated objective is to reduce the dimensionality of a dataset to its "intrinsic dimensionality". 17 Here, many of the elements of the input data are highly correlated, i.e., data at cycle 50 is correlated with data at cycle 51, and data at 3.00 V is correlated with data at 3.01 V. Dimensionality reduction can be performed manually, by discarding some elements of the input data, or via dimensionality reduction methods that find alternative representations of the input data. Throughout this work, we consider both approaches.
One dimension that is straightforward to reduce is the number of voltage points used per discharge curve, as the choice of 1000 points per discharge curve in Severson et al. 3 was arbitrary. Figure   (a) Voltage vs. ∆Q100−10(V) for an example cell as a function of the sampling frequency of the voltage points. The essential shape of the curve is maintained even with a sampling frequency of 80 mV, which is 50x less frequent than the 1.6 mV spacing used in Severson et al. 3 (b) Change in RMSE of the univariate log10(var(∆Q100−10(V))) model vs. the number of points used to evaluate the spline for the training, primary test, and secondary test sets. These RMSEs are expressed relative to the RMSE using default sampling frequency (1.6 mV, 1000 points/curve). Here, new models are trained on the training set, with ∆Q100−10(V) data sampled at each sampling frequency. The change in the absolute value of RMSE exceeds 1% only after the sampling frequency exceeds 40 mV (40 points) for the training and primary test sets and 160 mV (10 points) for the secondary test set. Note that this result is likely sensitive to the fact that the discharge curve is applied at 4C; constant-current data at lower rates may require a higher sampling frequency to capture the essential shape of the curve. Here, we focus on univariate linear models with the intention of finding the simplest yet highest-performing models. We consider fourteen summary statistic functions of ∆Q100−10(V) and four feature transformations, for a total of 56 function-transformation pairings. Most of these functions are summary statistics that are often applied to populations, e.g., mean, range, and variance; these functions map the ∆Q100−10(V) vectors to scalars. We also explore feature transformation, which is a common preprocessing step that can improve model performance by improving the linearity of the relationship between the feature(s) and the output. The transformations considered here include no transformation, square root, cube root, and log10.
However, one disadvantage of the square root and log10 transformations is that they do not accept negative values as input. Thus, we apply the absolute value function to summary statistics with both positively and negatively signed values for these two transformations. An obvious drawback of this approach is that values with equivalent magnitude but opposite signs are treated identically; we did not extensively explore alternative approaches such as power transforms 43 .
Additionally, we use elastic net regression instead of ordinary least squares regression, even for these univariate models, so that the model coefficients can be regularized (i.e., given lower magnitude) in the interest of building more robust and generalizable models. Finally, we limit our analysis to leveraging cycles 100 and 10, i.e., ∆Q100−10(V), and do not investigate other cycle number combinations.  The log10(IQR) model generally has the lowest error across all three data sets; its error is lower than that of the log10(var) model by 5, 14, and 6 cycles (4.8%, 10%, and 3.2%) for the training, primary test, and secondary test sets, respectively. Note that the test errors are consistently higher than the training error, which is consistent with Severson et al. 3 and possibly reflects differences between the datasets (e.g., the median cycle life of the secondary test set, 964.5 cycles, is 66% higher than that of the primary test set, 580 cycles). 3 All models requiring the use of the absolute value function perform poorly. Interestingly, simply evaluating ∆Q100−10(V) at V = 2.959 V performs similarly to the IQR and variance models (the errors are 8.7%, −13.8%, and 7.1% higher than the variance model for the training, primary test, and secondary test sets, respectively). Note that this particular voltage was selected because the minimum of the baseline-subtracted capacity matrix occurs at this voltage (Figure 1c).

Figure 3.
Root-mean-square error (RMSE) of univariate models derived from ∆Q100−10(V) as a function of summary statistic (rows) and transformation (columns) for the (a) training set, (b) primary test set, and (c) secondary test set in predicting the log10-transformed cycle life. IDR and IQR represent interdecile range and interquartile range, respectively. The log10(IQR) model generally has the lowest error across all three data sets, even exceeding the log10(var) model. The asterisk for some summary statistics denotes models where the absolute value was applied before the log/square root transformation to ensure positive values were used as input for these transformations. Models with anomalously high error are excluded from these plots and indicated using "0".   Figure 3). IDR and IQR represent interdecile range and interquartile range, respectively. The log10(IQR) model generally has the lowest error across all three data sets, even exceeding the log10(var) model. However, the errors of these models are uniformly higher than the errors of corresponding models predicting the log10-transformed cycle life. The asterisk for some summary statistics denotes models where the absolute value was applied before the log/square root transformation to ensure positive values were used as input for these transformations. Models with anomalously high error are excluded from these plots and indicated using "0". Second, Figure 5 presents local cycle averaging, denoted ∆Q98:100−9:11(V); here, the "98:100" nomenclature denotes inclusively averaging these cycles across voltages. The motivation for cycle averaging was to improve the signal-to-noise ratio inherent to using only one cycle. However, the performance of these models is generally comparable with or without cycle averaging, generally varying by a few cycles in either direction. Thus, the signal-to-noise ratio does not appear to improve with local cycle averaging. models that use ∆Q100−10(V) directly (i.e., without cycle averaging). The asterisk for some summary statistics denotes models where the absolute value was applied before the log/square root transformation to ensure positive values were used as input for these transformations. Models with anomalously high error are excluded from these plots and indicated using "0".
Univariate percentile models.-Given the success of the univariate IQR and IDR models, we explored models using other percentile ranges of ∆Q100−10(V). Figure 6 presents univariate models based on the log10 of different percentile ranges of ∆Q100−10(V). In Figures 6a-6c   For almost all cells in the training and primary test sets, both the upper and lower percentiles happen to be located on the sharp shoulder near 3.2 V, which corresponds to a change in the largest graphite plateau (xgraphite = ~0.5 to ~1.0) 45 . This observation helps rationalize the success of the dispersion-based univariate models of Figure 3: in some sense, this feature is a measure of the length of the shoulder in ∆Q100−10(V), which appears to be more predictive of cycle life than the mean or median of ∆Q100−10(V). Note that both of these percentile ranges of ∆Q100−10(V) generally do not capture the sharp shoulder at ~3.2 V for the secondary test set, which may explain why these models do not perform well on this dataset. Based on this result, an engineered feature that may perform well for all datasets is the magnitude of the shoulder at ~3.2 V, although we do not explore this type of curated feature engineering in this work.
Univariate single-element models.-Finally, we consider univariate models derived from single elements of ∆Q100−10(V), i.e., evaluating the difference between two discharge capacity curves at a single voltage. Inspired by the success of the model using only ∆Q100−10(V) at V = 2.959 V, these models use only one element in the entire 100,000-element capacity matrix. Figure 7 displays results from univariate models derived from single elements of ∆Q100−10(V), i.e.,    Figure   7f. Here, we find that the voltage that produces the univariate model with the smallest standarddeviation-normalized slope is 3.326 V, a voltage that corresponds to the beginning of the large shoulder in ∆Q100−10(V); at this voltage, the contrast between cells with low and high cycle life is high. This result is consistent with our results from Figure 6 that indicate the predictive nature of this shoulder. Note that the absolute value of this metric is maximized at voltages between ~3.5 V and ~3.6 V, where the standard deviation is very low, but these models have high RMSEs (Figure 7a-7c). Overall, the high performance of these simple univariate models highlights how easily interpretable models can still maintain high accuracy.

Multivariate models from ∆Q100−10(V)
Thus far, we have primarily discussed univariate linear models of the form y = mx + b.
While the results from these models are satisfactory for many applications (e.g., closed-loop optimization 22 ), model performance typically improves as additional features are added-as long as these additional features capture meaningful information about the underlying prediction objective. In this section, we explore using the elements of ∆Q100−10(V) directly as input features for model building. An important outcome of this approach is that the model coefficients can provide some clues as to which features (i.e., capacity differences at a given voltage) have the largest impact on cycle life (either positive or negative). However, a challenge with this approach is the high collinearity of the features, as ∆Q100−10(V) at one voltage will be closely related to the value at a neighboring voltage. Fortunately, this situation is common in other fields like chemometrics and bioinformatics, and many statistical learning methods have been developed for these types of applications.
We consider four statistical learning methods and two nonlinear methods for this task, many of which are recommended for applications with many highly correlated features. 17  Interpreting the MLP predictions is not necessarily straightforward; while various tools have been proposed to interpret otherwise black box models, we use SHapley Additive exPlanations (SHAP) 47 here.
Here, we used 100 features as input by downsampling the 1000-feature set (16 mV sampling frequency), as some methods (specifically, elastic net and random forest) did not converge during model fitting when using the full 1000-feature set as input. For all methods except MLP, the features were standardized during preprocessing, i.e., mean-subtracted and scaled by the standard deviation of the training set; for the MLP, the features were only scaled by the standard deviation of the training set. All models were trained via 5-fold cross validation.
Note that we use the untransformed features here, as the square root and log10 transformations do not accept negative values as input. We also considered the cube root transformation, which is not subject to this limitation, but we generally obtained higher errors with this transformation. Figure 8 presents the results of this approach using these methods. Figure 8a presents Figure 8b is identical to Figure 8a but uses ∆Q98:100−9:11(V) (i.e., cycle-averaged) instead of ∆Q100−10(V). The results in Figure 8a and 8b are comparable, suggesting cycle averaging does not help in this case. square regression (also known as projection to latent structures regression), RF represents random forest regression, and MLP represents multi-layer perceptron regression. (a) RMSE for the training set, primary test set, and secondary test set for each of the five methods using ∆Q100−10(V). PLSR generally has the lowest errors. (b) RMSE for the training set, primary test set, and secondary test set for each of the five methods using ∆Q98:100−9:11(V). PLSR generally has the lowest errors. The errors are comparable to the non-cycle averaged case. (c) Scaled coefficients of the linear models to predict log10 cycle life for the four methods that produce linear models. These coefficients are generally consistent across all four methods.       In summary, the use of the elements of the baseline-subtracted capacity matrix as input features for statistical learning methods is a promising approach for creating accurate and interpretable lifetime models. Physics-based degradation modeling could perhaps provide a deeper understanding of the trends present in these coefficient vectors. Additionally, a similar approach coupled with low-rate capacity-voltage curves may be more readily interpretable, as specific degradation modes can often be identified 32 . Furthermore, the use of differential capacity or differential voltage analysis on ∆Q(V) vectors may further aid interpretability, though perhaps at the cost of accuracy due to the noise introduced by taking the numerical derivative. Finally, this approach may aid in developing generalizable models, particularly when applied to recent synthetic and experimental datasets that span multiple chemistries and cycling conditions. 48,49 Multivariate models from capacity matrices Lastly, we consider four approaches that attempt to capture information from the entire capacity matrix, as opposed to just ∆Q100−10(V). However, all of these approaches produced models with test errors comparable or larger than the other models presented in this work. We briefly discuss these approaches and their results here for completeness.
First, we considered horizontal slices of the capacity matrices. While vertical slices of the capacity matrices (including metrics such as ∆Q100−10(V), which is the difference between two vertical slices) capture information over a range of voltages for fixed cycle number, horizontal slices of the capacity matrices (illustrated in Figure 1d) capture information over a range of cycle numbers for fixed voltage. This approach is motivated by the idea that these linear trends may extrapolate and correlate with cycle life. Here, we first identified the voltage that maximizes the absolute value of the slope (i.e., minimizes ∆Q100−2(V)), found to be 3.003 V. We then fit the horizontal slices to a line and use the slope and intercept as input features into the elastic net. We also explored (a) choosing other voltages for the horizontal slice, (b) "voltage averaging", i.e., local averaging across five neighboring voltages, (c) building models with polynomial features from the slope and intercept (e.g., using the square of the slope), and (d) removing outlier points.
However, all of these approaches produced models with higher error to other approaches explored in this work (e.g., representative training, primary test, and secondary test errors of 190, 233, and 269 cycles, respectively). We attribute the poor performance of this approach to the nonlinear degradation of these cells (see Figure 1a of Severson et al. 3 ), as well as our results throughout this work that demonstrate the value of capturing the dispersion in ∆Q(V), which is not captured with so few voltage points used.
Second, we considered models that use the entire capacity matrix for feature generation, using the multivariate models presented in Figure 8. The capacity matrix was first reshaped into a vector and then downsampled to avoid training on 100,000 features (we experimented with different downsampling frequencies). However, our attempts were unsuccessful; the models were highly overfit with near-zero training error and large test error. With such a large, collinear feature set, our statistical learning methods were unable to build meaningful models. In some ways, this result demonstrates the value of manual feature engineering for these highdimensional input datasets when coupled with statistical learning approaches. However, as we discuss soon, the use of the entire capacity matrix may find utility in training end-to-end machine learning and deep learning models for battery lifetime prediction.
Third, we built multivariate models using elastic net regression using the log10transformed ∆Q100−10(V) summary features from Figure 3 (excluding the features requiring the use of the absolute value function). Surprisingly, however, the performance of these models is comparable to the univariate models of Figure 3, 3 and from selected models in this work. Overall, the "discharge model" from Severson et al. 3 has the lowest test errors. The CNN results are averaged over ten runs. Note that the outlier cell from the primary test set is excluded from all rows; additionally, the training RMSE for the log10(var(∆Q100−10(V))) is reported as 103 cycles in Severson et al. 3 but 104 cycles in this work.
To provide additional insight into the residual distribution, we present the cumulative distribution of the absolute residuals for the training, primary test, and secondary test datasets for the six multivariate ∆Q100−10(V) models, the variance model from Severson et al. 3 , and the CNN model ( Figure 10). In these plots, high performing models are those with curves close to the upper left corner. The deep learning models (MLP and CNN) both have the best performance on the training set; this trend is expected as, generally, the training error decreases with model complexity. Most models perform comparably on the primary training set, although the CNN model has noticeably lower error. On the secondary test set, the CNN model has the best performance of all models for the majority of the cells. However, this model also has very poor performance for two cells (absolute residuals > 700 cycles), which greatly increases the RMSE.
The CNN model has the additional challenge of being the most difficult to interpret, making these errors even more challenging to understand. Overall, these results suggest that deep learning approaches may be suitable for applications in which accuracy is paramount and both training time and interpretability are not; in our view, however, simpler statistical learning approaches are generally preferred to deep learning methods, at least for objectives and datasets similar to ours. Figure 10. Cumulative distribution of the absolute residuals for the (a) training set, (b) primary test set, and (c) secondary test set for the six multivariate ∆Q100−10(V) models, the variance model from Severson et al. 3 , and the CNN model. In these plots, high performing models are those with curves close to the upper left corner. While the performance on the training set generally scales with model complexity, the performance on the test sets is model dependent. Notably, the CNN model outperforms other models on the primary test set, but its moderate performance on the secondary test set is largely caused by two cells with high residuals.
The six features selected in the "discharge" model are the minimum, variance, skewness, and kurtosis of ∆Q100−10(V), as well as the discharge capacity of cycle 2 and the difference between the maximum discharge capacity and that of cycle 2. One notable feature in this model that is absent from our work is the second-cycle discharge capacity (i.e., the initial capacity, used because the first-cycle capacity was unavailable for one batch). One disadvantage of the baseline-normalized (e.g., baseline-subtracted or baseline-divided) capacity matrix concept is that the original initial capacity values are lost; this simple feature may help capture cell-to-cell differences due to both manufacturing variation and calendar aging. Note that the constantcurrent capacity of a cycle can be calculated by taking the sum over a column in the unnormalized capacity matrix. Overall, however, the high performance of our simple and interpretable models illustrates the effectiveness of statistical learning approaches for battery lifetime prediction.

Conclusions
In this work, we designed a general framework for rapid development of data-driven battery lifetime prediction models, using the datasets of Severson et al. 3 and its models as a baseline. Using the voltage curves as our only source of features, we first present the capacity matrix concept for compactly representing and visualizing changes in voltage curves with cycle number. We then explore reducing the voltage sampling frequency, finding that the sampling frequency can be reduced by a factor of 50 without impacting the error. Next, we present a number of univariate models that use summary statistics applied to ∆Q100−10(V) as input. The univariate interquartile range (IQR) model outperformed the univariate variance model from Severson et al. 3 ; we also develop high-performing univariate models using both percentiles of ∆Q100−10(V) and single elements of the capacity matrix. Additionally, the log10 transformation was consistently found to be an effective transformation for both the features and the cycle life.
We then investigate multivariate models that use the elements of ∆Q100−10(V) as input, with the PLSR method in particular producing a model with low error and interesting insights into the behavior of ∆Q100−10(V). We also report some approaches that are not effective, including models using multi-cycle averaging and horizontal slices of the capacity matrices. We generally find that the performance of statistical learning methods is comparable more complex deep learning approaches, particularly for generalization, although tailored neural architectures may improve the performance of deep learning models. In summary, the approaches presented in this work produce simple, accurate, and interpretable models for battery lifetime prediction, highlighting the value of domain expertise in feature engineering.
Future work in this space is broad. We hope this work inspires creative feature extraction techniques from capacity matrices. Baseline-normalized capacity matrices can also be applied to other electrochemical data sources like rests and constant-voltage holds. Additionally, as the battery data community develops data storage and representation standards, we propose capacity matrices as one option for compact, machine-learning-ready battery cycling data storage 51 .
Statistical learning approaches can also be applied to other objective functions, such as energybased cycle life, knee point 5 , and multipoint prediction 51 . We also recommend applying a similar suite of statistical learning models to benchmark future work that uses new datasets and/or more advanced machine learning methods for battery lifetime prediction, as these approaches provide a reasonable starting point for building high-performing models with minimal human input.
Finally, we hope that this work inspires data-driven lifetime prediction models that can generalize to new chemistries and usage conditions; the use of synthetic and experimental datasets that span multiple chemistries and usage conditions 48,49 as training sets may be a step in this direction.