A multifidelity approach to continual learning for physical systems

We introduce a novel continual learning method based on multifidelity deep neural networks. This method learns the correlation between the output of previously trained models and the desired output of the model on the current training dataset, limiting catastrophic forgetting. On its own the multifidelity continual learning method shows robust results that limit forgetting across several datasets. Additionally, we show that the multifidelity method can be combined with existing continual learning methods, including replay and memory aware synapses, to further limit catastrophic forgetting. The proposed continual learning method is especially suited for physical problems where the data satisfy the same physical laws on each domain, or for physics-informed neural networks, because in these cases we expect there to be a strong correlation between the output of the previous model and the model on the current training domain.

In many real world applications of machine learning data is received sequentially or in discrete datasets.When used as training data new information received about the system requires completely retraining a given neural network.Much recent work has focused on how to instead incorporate the newly received training data into the machine learning model without requiring retraining with the full dataset and without forgetting the previously learned model.This process is referred to as continual learning [1].One key goal in continual learning is to limit catastrophic forgetting, or abruptly and completely forgetting the previously trained data.
Many methods have been proposed to limit forgetting in continual learning.In replay (rehearsal), a subset of the training set from previously trained regions are used in training subsequent models, so the method can limit forgetting by reevaluating on the previous regions [2].However, replay requires access to the previously used training data sets.This both requires large storage capabilities for large datasets, and also physical access to the previous dataset.However, data privacy can limit access to prior datasets, so replay may not be a feasible option.An alternative to replay are regularization methods, where a regularizer is used to assign weights to each parameter in the neural network, representing the parameter's importance.Then, a penalty is applied to prevent the parameters with the largest weights from changing.Multiple methods have been proposed for how to calculate the importance weights.Among the top choices are Synaptic Intelligence [3], elastic weight consolidation (EWC) [4], and memory aware synapses (MAS) [5].Subsequent work has shown that MAS performs among the best in multiple use cases, and is more robust to the choice of hyperparameters, so here we use MAS [6,7].Finally, a third category of continual learning methods includes those that employ task-specific modules [8], ensembles [9], adapters [10], reservoir computing based architectures [11], slow-fast weights [12,13] and more.
In recent years, a huge research focus has been on scientific machine learning methods for physical systems [14,15,16], for example fluid mechanics and rheology [17,18,19,20], metamaterial development [21,22,23], high speed flows [24], and power systems [25,26,27].In particular, physics-informed neural networks, or PINNs [28], allow for accurately representing differential operators through automatic differentiation, allowing for finding the solution to PDEs without explicit mesh generation.Work on continual learning for PINNs is limited.While as a first attempt PINNs can be trained on the entire domain because the issues of data acquisition and privacy do not apply, many systems have been identified for which it is not possible to train a PINN for the entire desired time domain.For example, even the simple examples used in this work, a pendulum and the Allen-Cahn equation, cannot be trained by a PINN for long times.Recent work has looked at improving the training of PINNs for such systems, including applications of the neural tangent kernel [29], but more work remains to be done.The closest work we are aware of for continual learning with PINNS is the backward-compatible PINNs in [30] and incremental PINNs (iPINNs) in [31].Backward-compatible PINNs train N PINNs on a sequence of N time domains, and in each new domain enforce that the output from the current PINN satisfies the PINN loss function in the current domain and the output of the previous model on all previous domains.We note that this work is distinct from the replay approach taken with PINNs in this work, both in the single fidelity and multifidelity cases, because we enforce that the N th neural network satisfies the residual in all prior domains, not the output from the previous model.In iPINNs, PINNs are trained to satisfy a series of different equations through a subnetwork for each equation, rather than the same equation over a long time.
We will introduce the multifidelity continual learning method in Sec. 2. We will then show the performance of the method on physics-informed problems in Sec. 3 and on data-informed problems in Sec. 4.

Multifidelity continual learning method
We assume that we have a domain Ω, which we divide into N subdomains Ω = ∪ N i=0 Ω i .We will learn sequential models on each subdomain Ω i , with the goal that the ith model can provide accurate predictions on the domain ∪ i j=0 Ω j .That is, the ith model does not forget the information learned on earlier domains used in training.We will focus on applications to physical systems, where we either have data available or knowledge of the physical laws the system obeys.We will begin this section with a brief overview of physics-informed neural networks (PINNs), then discuss the multifidelity continual learning method (MFCL), and conclude with a description of methods we use to limit catastrophic forgetting.

Physics-informed neural networks
In this section we give a brief introduction to single-fidelity and multifidelity physics-informed neural networks (PINNs), which were introduced in [28] and have been covered in depth for many relevant applications [32,14].PINNs are generally used, in these applications, for initial-boundary valued problems.
where Ω ∈ R N is an open, bounded domain with boundary ∂Ω, g and u are given functions, and x and t are the spatial and temporal coordinates, respectively.O x is a general differential operator with respect to x.We wish to find an approximation to s(x, t) by a (series) of deep neural networks with parameters γ, denoted by s γ (x, t).The neural network is trained by minimizing the loss function where the subscripts bc, ic, r, and data denote the terms corresponding to the boundary conditions, initial conditions, and residual, and any provided data, respectively.We take N bc , N ic , and N r to be the batch sizes of the boundary, initial, and residual data point, and denote the training data by (x i bc , t i bc ), g(x i bc , t i bc ) N bc i=0 , (x i ic ), u(x i bc ) Nic i=0 , and (x i r , t i r ) Nr i=0 .The boundary and initial collocation points are randomly sampled uniformly in their respective domains.The selection of the N r residual points will be discussed in Sec.??.If data representing the solution s is available, we can also consider an additional dataset (x i data , t i data ), s(x i data , t i data ) N data i=0 .This term is included to capture the data-based training we will cover in Sec. 4.
The individual loss terms are given by the mean square errors, (5) where The weighting parameters λ bc , λ ic , λ r , and λ data are chosen before training by the user.
Multifidelity PINNs, as used in this work, are inspired by [33].We assume we have a low fidelity model in the form of a deep neural network that approximates a given dataset or differential operator with low accuracy.We want to train two additional neural networks to learn the linear and nonlinear correlations between the low fidelity approximation and a high fidelity approximation or high fidelity data.We denote these neural networks as N N l for the linear correlation and N N nl for the nonlinear correlation.The output is then s γ (x, t) = N N nl (x, t; γ) + N N l (x, t; γ), where γ is all trainable parameters of the linear and nonlinear networks.The loss function includes an additional term, where {γ nl,ij } is the set of all weights and biases of the nonlinear network N N nl .No activation function is used in N N l to result in learning a linear correlation between the previous prediction and the high fidelity model.

Multifidelity continual learning
In the MFCL method, we exploit correlations between the previously trained models on prior domains and the expected model on the current domain.Explicitly, we use the prior model N N i−1 as a low fidelity model for domain Ω i .Then, we learn the correlation between N N i−1 on domain Ω i and the data or physics given on the domain.By learning a general combination of linear and nonlinear terms, we can capture complex correlations.Because the method learns only the correlation between the previous model and the new model, we can in general use smaller networks in each subdomain.The procedure requires two initial steps: 1. Train a (single-fidelity) DNN or PINN on Ω 1 , denoted by N N * (x, t; γ * ).This network will approximate the solution in a single domain.
2. Train a multifidelity DNN or PINN in Ω 1 , which takes as input the single fidelity model N N * (x, t; γ * ) as a low fidelity approximation.This initial multifidelity network is denoted by N N 1 (x, t; γ 1 ) Then, for each additional domain Ω i , we train a multifidelity DNN or PINN in Ω i , denoted by N N i (x, t; γ i ), which takes as input the previous multifidelity model N N i−1 (x, t; γ i−1 ) as a low fidelity approximation.The goal is for N N i (x, t; γ i ) to provide an accurate solution on ∪ i j=1 Ω i , even when data from Ω j , j < i, is not used in training the multifidelity network N N i .A diagram of the method is given in Fig. 2.
Figure 2: Diagram of the MF-CL method on domain Ω i .The output from the previously trained neural network, , is used as input to the linear and nonlinear subnets for a point (x, t) ∈ Ω i , x ∈ R N .The output neural network is the sum of the linear and nonlinear subnetworks.
As we will show, the MF-CL method provides more accurate results with less forgetting than single fidelity training on its own, however, the method can be improved by a few methods that have been previously developed both for reducing forgetting in continual learning and for selecting collocation points for training PINNs.These methods are discussed below.

Memory aware synapses
Memory aware synapses (MAS) is a continual learning method that attempts to limit forgetting in continual learning by assigning an importance weight to each neuron in the neural network.Then, a penalty term is added to the loss function to prevent large deviations in the values of important weights when the next networks are trained.The importance weights are found by measuring how sensitive the output of neural net N N n is to changes in the network parameters [5].For each weight and bias γ ij in the neural network we calculate the importance weight parameter where ℓ 2 2 denotes the squared ℓ 2 norm of the output of the neural network N N n applied at x k .The loss function in eq. 10 is then modified to read: When applying MAS to multifidelity neural networks, we calculate the MAS terms separately: (13) where nl denotes the nonlinear network and l denotes the linear network.In this way, roughly, the importance in the weights in calculating the linear and nonlinear terms is found separately, instead of determining the importance in the overall output of the sum of the networks.The parameter λ M AS is kept the same for the linear and nonlinear parts.

Replay
In replay, a selection of points in the previously trained domains, ∪ n−1 i=1 Ω i are selected at each iteration and the residual loss, L r (γ n ) is evaluated at the points.In this way, the multididelity training still satisfies the PDE across the earlier trained domains.For PINNs, the replay approach only requires knowledge of the geometry of ∪ n−1 i=1 Ω i , and not the value of the output of the model on this domain.

Transfer learning
In all cases in this work, the values of the trainable parameters in each subsequent network N N i , i ≥ 2, is initialized from the final values of the trainable parameters in the previous network, N N i−1 .In notation, γ 0 i = γ i−1 .This approach allows for faster training because the network is not initialized randomly.We note that some previous work has found less forgetting by initializing each subsequent network randomly [34], and leave the exploration of this option for future work.

Physics-informed training
In this section, we give examples of applying the multifidelity continual learning for physics-informed neural networks in cases where PINNs fail to train.We show that using continual learning in time can improve the accuracy of training a PINN for long-time integration problems, where a single PINN is not sufficient.All hyperparameters used in training are given in Appendix 8.

Pendulum dynamics
In this section, we consider the gravity pendulum with damping from [29].The system is governed by an ODE for t ∈ [0, T ] The initial conditions are give by s 1 (0) = s 2 (0) = 1.We take m = L = 1, b = 0.05, and g = 9.81, and we take T = 10.We first consider a single PINN trained in t ∈ [0, 10] in Fig. 3.The solution quickly goes to zero, showing that a single PINN cannot capture the longtime dynamics of even this simple system.Similar results were shown in [29].We will note that there are recent advances that have been developed for improving the training of PINNs for long-time integration problems [29,35,36].In this section, we will explore how continual learning can also allow for accurate solutions over long times by dividing the time domains into subdomains.
We divide the domain into five subdomains,  1: RMSE of the final output N N 5 on the full domain for the pendulum problem.For the MAS cases, the network is trained for six values of λ M AS , and the case with the lowest RMSE is shown in the table above.The replay results have N = 100 neurons in each hidden layer, see Table 2 for cases with varying neurons in each hidden layer.case, we calculate the root mean square error (RMSE) of the final output N N 5 on the full domain, Ω = [0, 10] by where s denotes the exact solution.If forgetting is limited, the final solution should have a small RMSE on the full domain.It is clear from Table 1 that replay performs the best in both cases, and significantly better than any other approach.It is no surprise that the SF applied alone case has a large RMSE, as it does not have any incorporation of techniques to limit forgetting.This case is shown in Fig. 4a.
Fig. 5 gives the best MAS results for each of the sets of hyperparameters considered, with λ M AS = 100 for single fidelity and λ M AS = 0.001 for multifidelity.As is unsurprising given the smaller RMSE, the multifidelity outperforms the single fidelity training with MAS.
As shown in Fig. 6, the SF-replay case does appear to outperform the MF-replay case.However, it is interesting to look at the RMSE as we change the network size in Table 2.While the MF-replay case is robust to changes in the network size, the single fidelity case only achieves a small RMSE with a very specific architecture.2: RMSE of the final output N N 5 on the full domain for the pendulum problem.The SF case has five hidden layers with N neurons each.In the MF case, each nonlinear network has five hidden layers with N neurons.The multifidelity linear network has one hidden layer with 20 neurons.

Allen-Cahn equation
The Allen-Cahn equation is given by We take c 2 1 = 0.0001.The Allen-Cahn equation is notoriously difficult for PINNs to solve by direct application [37,38], see Fig. 7. Modifications of PINNs have successfully been able to solve the Allen-Cahn equation, including by using a discrete Runge-Kutta neural network [28], adaptive sampling of the collocation points [37], and backward compatible PINNs [30].In this section we show that we can accurately learn the solution to the Allen-Cahn equation by applying the multifidelity continual learning framework.We divide the domain into four subdomains, Ω i = [2(i − 1), 2i], and report the relative RMSE of N N 4 on the full domain Ω.When the multifidelity and single fidelity methods are trained alone, in Fig. 8, they have approximately equal relative RMSEs.MAS and replay both improve the results, in Figs. 9 and 10, respectively.A summary of the results is given in Table 3.  3: Relative RMSE of the final output N N 4 on the full domain for the Allen-Cahn equation.For the MAS cases, the network is trained for seven values of λ M AS , and the case with the lowest RMSE is shown in the table above.

Data-informed training 4.1 Batteries
This is a case where if an additional dataset is added, it not is clear a priori which subdomain it lies in.Therefore, it is essential that the final model can predict the current accurately for the entire domain without forgetting.
For testing, a vanadium redox-flow battery (VRFB) system was selected to generate datasets.The left image in Fig. 11 shows a typical configuration of a VRFB, which consists of electrodes, current collectors and a membrane separator.The negative and positive side have a storage tank each to store the redox couple of V 2+ /V 3+ and V 4+ /V 5+ , respectively.We applied the MFCL method for the problem of identifying the applied charge current from a given charge voltage curve.To generate the VRFB charge curve dataset, a highly computationally efficient 2-D analytical model was utilized [39,40].This model fully resolves the coupled physics of active species transport, electrochemical reaction kinetics, and fluid dynamics within the battery cell, thereby providing a faithful representation of the VRFB system.Further details on the model and its parameters can be found in [39].Typical charge curves are visualized in the right plot of Fig. 11 for five selected current levels.For a given charge current, the battery voltage (E) is calculated at different state-of-charge (SOC) values to form the charge curve which is used as input data.The applied charge current I which gives rise to the charge curve is the output quantity we want to predict.i=1 Ω i .We test two network architectures, a wide network which has two hidden layers with 80 neurons each, and a deeper and narrower network which has three hidden layers with 40 neurons each.We first train with the single fidelity and multifdelity approaches alone, see Fig. 12.The multifidelity continual learning results show less forgetting than those from the single fidelity continual learning.
We then consider the impact of adding MAS.We consider the narrow and wide networks with and without MAS scaling, for a total of four cases.The multifidelity MAS results show significant improvement, see Fig. 13.In Fig. 14, we compare the performance across the value of the MAS hyperparameter λ M AS .We see that the single fidelity approach performance is robust, since it is insensitive to the value of λ M AS .However, it is not very accurate.On the other hand, the multifidelity approach can be substantially more accurate than the single fidelity approach for most values of λ M AS .Overall, the multifidelity results significantly outperform the single fidelity results.

Energy consumption
To provide a second example of data-informed continual learning, we consider the city-scale daily energy consumption dataset from [41].The dataset consists of daily energy usage for three metropolitan areas, New York, Sacramento, and Los Angeles, along with daily weather data.Three years of data are used as a test set, with an additional year as a test set.
Energy usage depends strongly on the weather, with air conditioner usage in the warmer months and heating in the winter months.Therefore, to provide different tasks to the continual learning training, we divide the three years of training data by quarter.Task 1 has training data from January to March, Task 2 has training data from April to June, Task 3 has training data from July to September, and Task 4 has training data from October to December.The test set for all tasks is to predict the energy usage from July 2018 to June 2019.An illustration of the testing and training data divided into tasks is given in Fig. 15.
We train both single fidelity and multifidelity networks with and without MAS.We consider a range of λ ∈ [0.001, 100].We also compare with training a network without continual learning.In this case, a single fidelity DNN receives all of the training data from all four tasks, to try and predict the energy usage from July 2018 to June 2019.This case serves  as a benchmark for the reasonable level of error we can expect from our model using continual learning.The results are shown in Table 4.We note that in all cases, the multifidelity continual learning approach outperforms the single fidelity continual learning.Including MAS does improve the results, as shown in Fig. 16.The continual learning methods do perform worse than the case with no continual learning, which is expected because they never have access to all the training data simultaneously.A comparison of the RMSE for each value of λ M AS tested is given in Fig. 17.We note

City
No  4: RMSE (GWh) of the final output N N 4 on the full test domain for the energy consumption case.For the MAS cases, the network is trained for seven values of λ M AS , and the case with the lowest RMSE is shown in the table above.that overall, the multifidelity approach with MAS is more robust than the single fidelity training with MAS, resulting in a smaller RMSE across a range of λ M AS . .

Discussion and future work
We have introduced a novel continual learning method based on multifidelity deep neural networks.The premise of the method is the existence of correlations between the output of previously trained models and the desired output of the model on the current training dataset.The discovery and use of these correlations can limit catastrophic forgetting.On its own, the multifidelity continual learning method has shown robustness and limited forgetting across several datasets  for physics-informed and data-driven training examples.Additionally, it can be combined with existing continual learning methods, including replay and memory aware synapses (MAS), to further limit catastrophic forgetting.
The proposed continual learning method is especially suited for physical problems where the data satisfy the same physical laws on each domain, or for a physics-informed neural network, because in these cases we expect there to be a strong correlation between the output of the previous model and the model on the current training domain.As a result of exploiting the correlation between data in the various domains instead of training from scratch for each domain, the method can afford to continue learning in new domains using smaller networks.Specifically, its training accuracy is more robust to the size of the network employed in the new domain.This can lead to computational savings during both training and inference.The approach is particularly suited for situations where privacy concerns can limit access to prior datasets.It can also offer new possibilities in the area of federated learning by allowing the design of new algorithms for processing sensor data in a distributed fashion.These topics are under investigation and results will be reported in a future publication.

Figure 3 :
Figure 3: Results from training a single PINN to satisfy Eqs. 14 and 15 (solid lines) compared with the exact solution (dotted line) for s 1 (left) and s 2 (right).The results decay to zero quickly and the learned solution does not agree well with the exact solution.

Figure 4 :
Figure 4: Results from training the single fidelity (a) and multifidelity (b) alone to satisfy Eqs. 14 and 15 compared with the exact solution (dash-dotted line) for s 1 (left) and s 2 (right).Of particular importance is the final network, N N 5 (blue solid line), which is trained on Ω 5 = [8, 10].While the multifidelity results in (b) have significant errors, the are substantially better than the single fidelity results in (a).In the single fidelity training, each network N N i is only accurate on the subdomain Ω i , and extrapolation outside Ω i presents significant difficulties.

Figure 5 :
Figure 5: Results from training the single fidelity (a) and multifidelity (b) with MAS to satisfy Eqs. 14 and 15 compared with the exact solution (dash-dotted line) for s 1 (left) and s 2 (right).Of particular importance is the final network, N N 5 (blue solid line), which is trained on Ω 5 = [8, 10].These simulations plotted here have the smallest RMSEs of N N 5 on Ω of any of the sets of hyperparameters tested.In the single fidelity case, MAS appears to cause restrictions in training that are too strict, and later networks N N i are no longer accurate on their respective domains Ω i .For the multifidelity training, the solutions are accurate across a wider portion of the full domain, and the RMSE is decreased compared with multifidelity training alone.

Figure 6 :
Figure 6: Results from training the single fidelity (a) and multifidelity (b) with Replay to satisfy Eqs. 14 and 15 compared with the exact solution (dash-dotted line) for s 1 (left) and s 2 (right).Both cases show very limited forgetting.

Figure 7 :
Figure 7: Results from training a single PINN training for the Allen-Cahn equation.The bottom figures are taken at t = 0.25 (left) and t = 0.75 (right).While the PINN trains well until about 0.3, the solution degrades with increasing t.

Figure 8 :
Figure 8: N N 4 results from training a single fidelity and multifidelity PINN training alone for the Allen-Cahn equation.The bottom figures are taken at t = 0.25 (left) and t = 0.75 (right).The multifidelity results have errors about half as large as those of the single fidelity results.

Figure 9 :
Figure 9: N N 4 results from training a single fidelity and multifidelity PINN training with MAS for the Allen-Cahn equation.The bottom figures are taken at t = 0.25 (left) and t = 0.75 (right).These results represent the best MAS results from all sets of hyperparameters considered.The multifidelity results have errors about a quarter as large as those of the single fidelity results.

Figure 10 :
Figure 10: N N 4 results from training a single fidelity and multifidelity PINN training with replay for the Allen-Cahn equation.The bottom figures are taken at t = 0.25 (left) and t = 0.75 (right).

Figure 11 :
Figure 11: The VRFB system used for battery data generation (left).Sample charge curve distribution at different charge current (right).

Figure 12 :
Figure 12: Results from the single fidelity (a) and multifidelity (b) training alone for the battery test case.The left column has the network outputs of each task on all the tasks, and the right column shows the RMSE of each task tested on each other task.The results in this figure use the narrow architecture, with three hidden layers and 40 neurons per layer.

Figure 13 :
Figure 13: Results from the single fidelity (a) and multifidelity (b) training with MAS for the battery test case.The single fidelity case struggles to train accurately, while multifidelity has very limited forgetting.The left column has the network outputs of each task on all the tasks, and the right column shows the RMSE of each task tested on each other task.The results shown represent the best output from the MAS hyperparameters tested.For the single fidelity case, the results are from the narrow network with λ M AS = 100, and for the multifidelity case, the results are from the wide network with λ M AS = 0.001.

Figure 14 :
Figure 14: Comparison of the RMSE with MAS for the redox flow battery test case.The RMSE is lower for almost all the multifidelity test cases in comparison with the single fidelity test cases.

Figure 15 :
Figure 15: Illustration of the datasets used in the energy consumption example.

Figure 16 :
Figure 16: Results of the energy consumption problem.(Left) results without MAS.(Right) results with MAS.

Figure 17 :
Figure 17: Comparison of the RMSEs generated by training with MAS for the energy consumption problem.