Inverse-Dirichlet Weighting Enables Reliable Training of Physics Informed Neural Networks

We characterize and remedy a failure mode that may arise from multi-scale dynamics with scale imbalances during training of deep neural networks, such as Physics Informed Neural Networks (PINNs). PINNs are popular machine-learning templates that allow for seamless integration of physical equation models with data. Their training amounts to solving an optimization problem over a weighted sum of data-fidelity and equation-fidelity objectives. Conflicts between objectives can arise from scale imbalances, heteroscedasticity in the data, stiffness of the physical equation, or from catastrophic interference during sequential training. We explain the training pathology arising from this and propose a simple yet effective inverse-Dirichlet weighting strategy to alleviate the issue. We compare with Sobolev training of neural networks, providing the baseline of analytically $\boldsymbol{\epsilon}$-optimal training. We demonstrate the effectiveness of inverse-Dirichlet weighting in various applications, including a multi-scale model of active turbulence, where we show orders of magnitude improvement in accuracy and convergence over conventional PINN training. For inverse modeling using sequential training, we find that inverse-Dirichlet weighting protects a PINN against catastrophic forgetting.

We characterize and remedy a failure mode that may arise from multi-scale dynamics with scale imbalances during training of deep neural networks, such as Physics Informed Neural Networks (PINNs). PINNs are popular machine-learning templates that allow for seamless integration of physical equation models with data. Their training amounts to solving an optimization problem over a weighted sum of data-fidelity and equation-fidelity objectives. Conflicts between objectives can arise from scale imbalances, heteroscedasticity in the data, stiffness of the physical equation, or from catastrophic interference during sequential training. We explain the training pathology arising from this and propose a simple yet effective inverse-Dirichlet weighting strategy to alleviate the issue. We compare with Sobolev training of neural networks, providing the baseline of analytically ǫ-optimal training. We demonstrate the effectiveness of inverse-Dirichlet weighting in various applications, including a multi-scale model of active turbulence, where we show orders of magnitude improvement in accuracy and convergence over conventional PINN training. For inverse modeling using sequential training, we find that inverse-Dirichlet weighting protects a PINN against catastrophic forgetting.

I. INTRODUCTION
Data-driven modeling has emerged as a powerful and complementary approach to first-principles modeling. It was made possible by advances in imaging and measurement technology, computing power, and innovations in machine learning, in particular neural networks. The success of neural networks in data-driven modeling can be attributed to their powerful function approximation properties for diverse functional forms, from image recognition [1] and natural language processing [2] to learning policies in complex environments [3].
The wide application scope of PINNs rests on their parameterization as deep neural networks, which can be tuned by minimizing a weighted sum of data-fitting and equation-fitting errors. However, finding network parameters that impartially optimize for several errors is challenging when errors impose conflicting requirements. For Multi-Task Learning (MTL) problems in machine learning, this issue has been addressed by various strategies for relative weighting of conflicting objectives, e.g., based on uncertainty [26], gradient normalization [27], or Pareto optimality [28]. The PINN community in parallel has developed heuristics for loss weighting with demonstrated gains in accuracy and convergence [29]. However, there is neither consensus on when to use such strategies in PINNs, nor are there optimal benchmark solutions for objectives based on differential equations.
Here, we characterize training pathologies of PINNs and provide criteria for their occurrence.
We explain these pathologies by showing a connection between PINNs and Sobolev training, and we propose a strategy for loss weighting in PINNs based on the Dirichlet energy of the task-specific gradients. We show that this strategy reduces optimization bias and protects against catastrophic forgetting. We evaluate the proposed inverse-Dirichlet weighting by comparing with Sobolev training in a case where provably optimal weights can be derived, and by empirical comparison with two conceptually different state-of-the-art PINN weighting approaches.

II. A PHYSICS INFORMED NEURAL NETWORK IS A MULTI-OBJECTIVE OPTIMIZATION PROBLEM
An optimization problem involving multiple tasks or objectives can be written as: minimize L k (θ sh , θ k ), k = 1, . . . , K.
The total number of tasks is given by K. θ sh are shared parameters between the tasks, and θ k are the taskspecific parameters. For conflicting objectives, there is no single optimal solution, and Pareto dominated solutions can be found using heuristics like evolutionary algorithms [30]. Alternatively, the multi-objective optimization problem can be converted to a single-objective optimization problem using scalarization, e.g., as a weighted sum with real-valued relative weights λ k associated with each loss objective L k : Often, the λ k are dynamically tuned along training epochs [28]. The weights λ k intuitively quantify the extent to which we aim to minimize the objective L k . Large λ k favor small L k , and vice-versa [31]. For K ≥ 2, manual tuning and grid search become prohibitively expensive for large parameterizations as common in deep neural networks. The loss function of a PINN is a scalarized multiobjective loss where each objective corresponds to either a data-fidelity term or a constraint derived from the physics of the underlying problem [4,5]. A typical PINN loss contains five objectives: We distinguish the neural network approximation u θ (t j , x Ω i ) from the training data u i = u(t j , x Ω i ) sampled at the N int space-time points (t j , x Ω i ) in the interior of the solution domain Ω, N B points (t j , x ∂Ω i ) on the boundary ∂Ω, and N I samples of the initial state at time t = 0. The λ 1 , λ 2 , and λ 3 weight the residuals of the physical equation ∂ t u = N (u, Σ) with right-hand side N and coefficients Σ, the boundary condition B, and the initial condition I, respectively. The objective with weight λ 4 is an additional constraint F on the state variable u, such as divergence-freeness (∇ · u = 0) of the velocity field of an incompressible flow. The weight λ 5 penalizes deviations of the neural network approximation from the training data. From now on, we simplify the notation by designating space-time points as x i := (t, x) i = (t j , x k ) in time-dependent problems.
PINNs can be used to solve both forward and inverse modeling problems of ordinary differential equations (ODEs) or partial differential equations (PDEs). In the forward problem, a numerical approximation to the solution of a ODE or PDE is to be learned. For this, θ sh are the networks parameters, and θ k is an empty set. In the inverse problem, the coefficients of an ODE or PDE are to be learned from data such that the equation describes the dynamics in the data. Then, the task-specific parameters θ k correspond to the coefficients (ξ ⊆ Σ) of the physical equation that shall be inferred. During training of a PINN, the parameters (θ sh , θ k ) are determined by minimizing the total loss L for given data u(x i ).

A. Sobolev training is a special case of PINN training
Like most deep neural networks, PINNs are usually trained by first-order parameter update based on (approximate) loss gradients. The update rule can be interpreted as a discretization of the gradient flow, i.e., The learning rate η(τ ) depends on the training epoch τ , reflecting the adaptive step size used in optimizers like Adam [32]. The gradients are batch gradients where B is a minibatch created by random splitting of the N training data points into N/|B| distinct partitions and ℓ k is one summand (for one data point x i ) of the loss L k corresponding to the k th objective. For batch size |B| = 1, the minibatch gradient descent reduces to stochastic gradient descent.
Adding derivatives D m x , m ≥ 1, of the state variable into the training loss has been shown to improve data efficiency and generalization power of neural networks [7]. Minimizing the resulting loss using the update rule in Eq. 4 can be interpreted as a finite approximation to the Sobolev gradient flow of the loss functional. The neural network is then trained with access to the (K +2)-tuples Such Sobolev training is an instance of scalarized multiobjective optimization with each objective based on the approximation of a specific derivative. The weights are usually chosen as λ 0 = . . . = λ K = 1 [7]. The additional information from the derivatives introduces an inductive bias, which has been found to improve data efficiency and network generalization [7]. This is akin to PINNs, which use derivatives from the PDE model to achieve inductive bias. Indeed, the Sobolev loss in Eq. 5 is a special case of the PINN loss from Eq. 3 with each objective based on a PDE with right-hand side N = D k x u, k = 1, . . . , K.
B. Vanishing task-specific gradients are a failure mode of PINNs The update rule in Eq. 4 is known to lead to stiff learning dynamics for loss functionals whose Hessian matrix has large positive eigenvalues [29]. This is the case for Sobolev training, even with K = 1, when the function u(x) is highly oscillatory, leading us to the following proposition, proven in the Supplementary Material: x u(x i )| 2 , the rate of change of the residual of the dominant Fourier mode r(k 0 ) = u θ (k 0 ) − u(k 0 ) using firstorder parameter update is: This proposition suggests that losses containing high derivatives m, dominant high frequencies k 0 , or a combination of both exhibit dominant gradient statistics. This can lead to biased optimization of the dominant objectives at the expense of the other objectives with relatively smaller gradients. This is akin to the well known vanishing gradients phenomenon across the layers of a neural network [33], albeit now across different objectives of a multi-objective optimization problem. We call this phenomenon vanishing task-specific gradients.
Proposition B.1 also admits a qualitative comparison to the phenomenon of numerical stiffness in PDEs of the form If the real parts of the eigenvalues of L, c = Re(λ(L)) ≫ 1, then the solution of the above PDE has fast modes u ∈ O(1) with d/dt ∈ O(c) and slow modes u ∈ O(1/c) with d/dt ∈ O(1). In words, high frequency modes evolve on shorter time scales than low frequency modes. The rate amplitude for spatial derivatives of order m is d/dt ∈ O(k −m ) for the wavenumber k [34]. Due to this analogy, we call the learning dynamics of a neural network with vanishing task-specific gradients stiff. Stiff learning dynamics leads to discrepancy in the convergence rates of different objectives in a loss function. Taken together, vanishing task-specific gradients constitute a likely failure mode of PINNs, since their training amounts to an extended version of Sobolev training.

III. STRATEGIES FOR BALANCED PINN TRAINING
Since PINN training minimizes a weighted sum of multiple objectives (Eq. 3), the result depends on appropriately chosen weights. The ratio between any two weights λ i /λ j , i = j, defines the relative importance of the two objectives during training. Suitable weighting is therefore quintessential to finding a Pareto-optimal solution that balances all objectives. While ad hoc manual tuning is still commonplace, it has recently been proposed to determine the weights as moving averages over the inverse gradient magnitude statistics of a given objective as [29]:λ where ∇ θ sh L 1 is the gradient of the residual, i.e. of the first objective in Eq. 3, and α is a user-defined learning rate (here always α = 0.5, see Supplementary Material). All gradients are taken with respect to the shared parameters across tasks and | · | is the component-wise absolute value. The overbar signifies the algebraic mean over the vector components. The maximum in the numerator is over all components of the gradient vector. While this weighting has been shown to improve PINN performance [29], it does not prevent vanishing task-specific gradients.

A. Inverse-Dirichlet weighting
Instead of determining the relative weights proportional to the inverse average gradient magnitude, we here propose to use weights based on the gradient variance. In particular, we propose to use weights for which the variances over the components of the back-propagated weighted gradients λ k ∇ θ L k become equal across all objectives, thus directly preventing vanishing task-specific gradients. This can be achieved in any (stochastic) gradient descent optimizer by using the update rule in Eq. 4 with weightŝ This guarantees that all weighted gradients have the same variance over gradient components, i.e., that Since the popular Adam optimizer is invariant to diagonal scaling of the gradients [32], the weights in Eq. 8 can in this case efficiently be computed as the inverse of averaged squared gradients of the taskspecific objectives, which in turn is proportional to the Dirichlet energy of the objective, i.e., We therefore refer to this weighting strategy as inverse-Dirichlet weighting.
The weighting strategies in Eqs. 7 and 8 are fundamentally different. Eq. 7 weights objectives in inverse proportion to certainty as quantified by the average gradient magnitude. Eq. 8 uses weights that are inversely proportional to uncertainty as quantified by the variance of the loss gradients. Equation 8 is also different from previous uncertainty-weighted approaches [26] because it measures the training uncertainty stemming from noise in the loss gradients rather than the uncertainty from the model's observational noise.

B. Gradient-based multi-objective optimization
We compare weighting-based scalarization approaches with gradient-based multi-objective approaches using Karush-Kuhn-Tucker (KKT) local optimality conditions to find a descent direction that decreases all objectives. Specifically, we adapt the multiple gradient descent algorithm (MGDA) [28], which leverages the KKT conditions (9) to solve an MTL problem. The MGDA is guaranteed to converge to a Pareto-stationary point [35]. Given the gradients ∇ θ sh L k of all objectives k = 1, . . . , K with respect to the shared parameters, we find weights λ k that satisfy the above KKT conditions. The resulting λ leads to a descent direction that improves all objectives [28], see also Suppl. Fig. A.1. We adapt the MGDA to PINNs as described in the Materials and Methods section. We refer to this adapted algorithm as pinn-MGDA.

IV. RESULTS
As we posit above, PINNs are expected to fail for problems with dominant high frequencies, i.e, with high k 0 or high derivatives m, as is typical for multi-scale PDE models. We therefore present results of numerical experiments for such cases and compare the approximation properties of PINNs with different weighting schemes. This showcases learning failure modes originating from vanishing task-specific gradients.
We first consider Sobolev training of neural networks, which is a special case of PINN training, as shown above.
Due to the linear nature of the Sobolev loss (Eq. 5), this test case allows us to compute ǫ-optimal analytical weights (see Materials and Methods), providing a baseline for our comparison. Second, we consider a real-world example for a nonlinear PDE, using PINNs to model active turbulence.
In the first example, we specifically consider the Sobolev training problem with target function derivatives up to fourth order, i.e. K = 4, and total loss (10) The neural network is trained on the data u(x i ) and their derivatives. To mimic inverse-modeling scenarios, in addition to producing an accurate estimate u θ (x i ) of the function, the neural network is also tasked to infer unknown scalar prefactorsξ = (ξ 1 ,ξ 2 ,ξ 3 ,ξ 4 ) with true values chosen as ξ k = 1. This mimics scenarios in which unknown coefficients of the PDE model also needed to be inferred from data. For this loss, ǫ-optimal weights can be analytically determined (see Materials and Methods) such that all objectives are minimized in an unbiased fashion [36]. This provides the baseline against which we benchmark the different weighting schemes.

A. Inverse-Dirichlet weighting avoids vanishing task-specific gradients
We characterize the phenomenon of vanishing taskspecific gradients, demonstrate its impact on the accuracy of a learned function approximation, and empirically validate Proposition B.1. For this, we consider a generic neural network with 5 layers and 64 neurons per layer, tasked with learning a 2D function that contains a wide spectrum of frequencies and unknown coefficientŝ ξ = (ξ 1 ,ξ 2 ,ξ 3 ,ξ 4 ) using Sobolev training with the loss from Eq. 10. The details of the test problem and the training setup are given in the Supplementary Material.
We first confirm that in this test case vanishing taskspecific gradients occur when using uniform weights, i.e., when setting λ k = 1 for all k = 0, . . . , 4. For this, we plot the gradient histogram of the five loss terms in Fig. 1A. The histograms clearly show that training is biased in this case to mainly optimize the objective containing the highest derivative k = 4. All other objectives are neglected, as predicted by Proposition B.1. This leads to a uniformly high approximation error on the test data (training vs. test data split 50:50, see Supplementary Material), as shown in Fig. 2B, as well as a uniformly high error in the estimated coefficients ξ (Fig. 2C). When determining the weights using the Pareto-seeking pinn-MGDA, the accuracy considerably improves (Fig. 2B,C). However, the phenomenon of vanishing task-specific gradients is still present, as seen in Fig. 1B, this time favoring the objective for k = 0 and neglecting the derivatives. Max/avg weighting as proposed in Ref. [29] and given in Eq. 7 improves the accuracy for higher derivatives (Fig. 2B), but still suffers from unbalanced gradient distributions (Fig. 1C), which leads to a uniformly high error in the estimated coefficients ξ (Fig. 2C). The inverse-Dirichlet weighting proposed here and detailed in Eq. 8 shows balanced gradient histograms (Fig. 1D) closest to those observed when using ǫ-optimal analytical weights (Fig. 1E). This leads to orders of magnitude better accuracy in both the function approximation and the coefficient estimates, as shown in Fig. 2B,C, reaching the performance achieved when using ǫ-optimal weights. This is confirmed when looking at the power spectrum of the function approximation u θ learned by the neural network in Fig. 2D. Inverse-Dirichlet weighting leads to results that are as good as those achieved by optimal weights in this test case, in particular to approximate high frequencies, as predicted by Proposition B.1. A closer look at the power spectra learned by the different weighting strategies also reveals that the max/avg strategy fails to capture high frequencies, whereas the pinn-MDGA performs surprisingly well (Fig. 2D) despite its unbalanced gradients (Fig. 1B). This may be problem-specific, as the weight trajectories taken by pinn-MDGA during training (Suppl. Fig. B.1) are fundamentally different from the behavior of the other methods. This could be because the pinn-MGDA uses a fun-damentally different strategy purely based on satisfying Pareto-stationarity conditions (Eq. 9). As shown in Suppl. Fig. B.2, this leads to a rapid attenuation of the Fourier spectrum of the residual |r(k)| over the first few learning epochs in this test case, allowing pinn-MGDA to escape the stiffness defined in Proposition B.1.
In summary, we observe stark differences between different weighting strategies. Of the strategies compared, only the inverse-Dirichlet weighting proposed here is able to avoid vanishing task-specific gradients, performing on par with the ǫ-optimal solution. The pinn-MGDA seems to be able to escape, rather than avoid, the stiffness caused by vanishing task-specific gradients. We also observe a clear correlation between having balanced effective gradient distributions and achieving good approximation and estimation accuracy.
B. Inverse-Dirichlet weighting enables multi-scale modeling using PINNs We investigate the use of PINNs in modeling multiscale dynamics in space and time where vanishing taskspecific gradients are a common problem. As a realworld test problem, we consider meso-scale active turbulence as it occurs in living fluids, such as bacterial suspensions [37,38] and multi-cellular tissues [39]. Unlike inertial turbulence, active turbulence has energy injected at small length scales comparable to the size of the actively moving entities, like the bacteria or individual motor proteins. The spatiotemporal dynamics of active turbulence can be modeled using the generalized incompressible Navier-Stokes equation [37,38]: where u(t, x) is the mean-field flow velocity and p(t, x) the pressure field. The coefficient ξ 0 scales the contribution from advection and ξ 1 the active pressure. The α and β terms correspond to a quartic Landau-type velocity potential. The higher-order terms with coefficients Γ 0 and Γ 2 model passive and active stresses arising from hydrodynamic and steric interactions, respectively [38]. Supplementary Figure C.1 illustrates the rich multi-scale dynamics observed when numerically solving this equation on a square for the parameter regime Γ 0 < 0 and Γ 2 > 0 [37] (see Supplementary Material for details on the numerical methods used). An important characteristic of Eq. 11 is the presence of disparate length and time scales in addition to fourth-order derivatives and nonlinearities in u, which jointly lead to considerable numerical stiffness [34]. Given such large dominant wavenumbers k 0 , any data-driven approach, forward or inverse, is expected to encounter significant learning pathologies as per Proposition B.1. We demonstrate these pathologies by applying PINNs to approximate the forward solution of Eq. 11 in 2D for the parameter regime explored in Ref. [37] in both square (Fig. C.1) and annular (Fig. 3E) domains. The details of simulation and training setup are given in the Supplementary Material. The loss function includes four terms as given in Eq. 3, corresponding to the PDE residual, initial and boundary conditions, and the divergence-freeness constraint. Figure 3A,B shows the average (over different initializations of the neural network weights) relative L2 errors of the vorticity field over training epochs when using different PINN weighting strategies. As predicted by Proposition B.1, the uniformly weighted PINN fails completely both in the square (Fig. 3A) and in the annular geometry (Fig. 3B). In this case, optimization is entirely biased towards minimizing the PDE residual, neglecting the initial, boundary, and incompressibility constraints. This imbalance is also clearly visible in the loss gradient distributions shown in Suppl. Fig. C.4A. Comparing the approximation accuracy after 7000 training epochs confirms the findings from the previous section with inverse-Dirichlet weighting outperforming the other methods, followed by pinn-MGDA and max/avg weighting. This is made possible by assigning adequate weights for the initial and boundary conditions in relative to the equation residual (See Suppl. Fig. C.2A,D).
In Fig. 3C we compare the convergence of the learned approximation u θ to the ground-truth numerical solution for different spatial resolutions h of the discretization grid. The errors scale as O(h r ) with convergence orders r as identified by the dashed lines. While the pinn-MGDA and max/avg weighting achieve linear convergence (r = 1), the inverse-Dirichlet weighted PINN converges with order r = 4. The uniformly weighted PINN does not converge at all and is therefore not shown in the plot.
Inverse-Dirichlet weighting also perfectly recovers the power spectrum E(k) of the active turbulence velocity field, as shown in Fig. 3D. This confirms that inverse-Dirichlet weighting enables PINNs to capture high frequencies in the PDE solution, corresponding to fine structures in the fields, by avoiding the stiffness problem from Proposition B.1. Indeed, the PINNs trained with the other weighting methods fail to predict small structures in the flow field, as shown by the spatial distributions of the prediction errors in Fig. 3F-I compared to the ground-truth numerical solution in Fig. 3E. The evolution of the Fourier spectra | r(k)| of the residuals of both velocity components (u, v) = u shown in Suppl. In summary, our results suggest that avoiding vanishing task-specific gradients can enable the use of PINNs in multi-scale problems previously not amenable to neuralnetwork modeling. Moreover, we observe that seemingly small changes in the weighting can lead to significant differences in the convergence order of the learned approximation.

C. Inverse-Dirichlet weighting protects against catastrophic forgetting
Artificial neural networks tend to "forget" their parameterization when training for different tasks sequentially [40], leading to a phenomenon referred to as catas-  trophic forgetting. Sequential training is often used, e.g., when learning to de-noise data and estimate parameters at the same time, like combined image segmentation and denoising [41], or when learning solutions of fluid mechanics models with additional constraints like divergence-freeness of the flow velocity field [16]. In PINNs, catastrophic forgetting can occur, e.g., in the frequently practiced approach of first training on measurement data alone and adding the PDE model only at later epochs [42]. Such sequential training can be motivated by the computational cost of evaluating the derivatives in the PDE using automatic differentiation, so it seems prudent to pre-condition the network for data fitting and introduce the residual later to regress for the missing network parameters.
Using the active turbulence test case described above, we investigate how the different weighting schemes compared here influence the phenomenon of catastrophic forgetting. For this, we consider the inverse problem of inferring unknown model parameters and latent fields like the effective pressure p from flow velocity data u. We therefore train a PINN in three sequential steps: (1) first fitting flow velocity data, (2) later including the additional objective for the incompressibility constraint ∇ · u = 0, (3) finally including also the objective for the PDE model residual. This creates a setup in which catastrophic forgetting occurs when using uniform PINN weights, as confirmed in Fig. 4A (arrows). As soon as additional objectives (divergence-freeness at epoch 1000 and equation residual at epoch 2000) are introduced, the PINN loses the parameterization learned on the previous objectives, leading to large and increasing test error for the field u.
Dynamic weighting strategies can protect PINNs against catastrophic forgetting, by adequately adjusting the weights whenever an additional objective is introduced. This is confirmed in Fig. 4A with inverse-Dirichlet and pinn-MGDA only minimally affected by catastrophic forgetting. The max/avg weighting strategy [29], however, does not protect the network quite as well. When looking at the test errors for the learned model coefficients (Fig. 4B) and the accuracy of estimating the latent pressure field p(x, t) (Fig. 4C), for which no training data is given, the inverse-Dirichlet weighting proposed here outperforms the other approaches.
In summary, inverse-Dirichlet weighting can protect a PINN against catastrophic forgetting, making sequential training a viable option. In our tests, it offered the best protection of all compared weighting schemes. lem, which is sensitive to the choice of regularization weights. Trivial uniform weights are known to lead to failure of PINNs, empirically found to exacerbate for physical models with disparate scales or high derivatives. As we have shown here, this failure can be explained by analogy to numerical stiffness of Partial Differential Equations (PDEs) when understanding Sobolev training of neural networks [7] as a special case of PINN training. This connection enabled us to prove a proposition stating that PINN learning pathologies are caused by dominant high-frequency components or high derivatives in the physics prior. The asymptotic arguments in our proposition inspired a novel dynamic adaptation strategy for the regularization weights of a PINN during training, which we termed inverse-Dirichlet weighting.
We empirically compared inverse-Dirichlet weighting with the recently proposed max/avg weighting for PINNs [29]. Moreover, we looked at PINN training from the perspective of multi-objective optimization, which avoids a-priori choices of regularization weights. We therefore adapted to PINNs the state-of-the-art multiple gradient descent algorithm (MGDA) for training multi-objective artificial neural networks, which has previously been proven to converge to a Pareto-stationary solution [28]. This provided another baseline to compare with. Finally, we compared against analytically derived ǫ-optimal static weights in a simple linear test problem where those can be derived, providing an absolute baseline. In all comparisons, and across a wide range of problems from 2D Poisson to multi-scale active turbulence and curvature-driven level-set flow, inverse-Dirichlet weighting empirically performed best.
The inverse-Dirichlet weighting proposed here can be interpreted as an uncertainty weighting with training uncertainty quantified by the batch variance of the loss gradients. This is different from previous approaches [26] that used weights quantifying the model uncertainty, rather than the training uncertainty. The proposition proven here provides an explanation for why the loss gradient variance may be a good choice, and our results con-firmed that weighting based on the Dirichlet energy of the loss objectives outperforms other weighting heuristics, as well as Pareto-front seeking algorithms like pinn-MGDA. The proposed inverse-Dirichlet weighting also performed best at protecting a PINN against catastrophic forgetting in sequential training settings where loss objectives are introduced one-by-one, making sequential training of PINNs a viable option in practice.
While we have focussed on PINNs here, our analysis equally applies to Sobolev training of other neural networks, and to other multi-objective losses, as long as training is done using an explicit first-order stochastic gradient descent (SGD) optimizer, of which Adam [32] is most commonly used. It would also be interesting to see how the weighting strategies presented in this study generalize to other first-order optimizers [43].
Taken together, we have exploited a connection between PINNs and Sobolev training to explain a likely failure mode of PINNs, and we have proposed a simple yet effective strategy to avoid training failure. The proposed inverse-Dirichlet weighting strategy is easily included into any SDG-type optimizer at almost no additional computational cost (see Suppl. Fig. C.8). It has the potential to improve the accuracy and convergence of PINNs by orders of magnitude, to enable sequential training with less risk of catastrophic forgetting, and to extend the application realm of PINNs to multi-scale problems that were previously not amenable to datadriven modeling.
For Sobolev loss functions based on elliptic PDEs, optimal weights λ that minimize all objectives in an unbiased way can be determined by ǫ-closeness analysis [36].
Definition 1 (ǫ-closeness) A candidate solution u is called ǫ-close to the true solutionû if it satisfies for any multi-index β ∈ N and x ∈ Ω ⊂ R d , thus For the loss function typically used in Sobolev training of neural networks, i.e., we can use ǫ-closeness analysis to show that Using this inequality, we can bound the total loss as This bound can be used to determine the weights {λ k } K k=1 that result in the smallest ǫ: The weights λ k , ∀k ∈ 1, ..., K that lead to the tightest upper bound on the total loss in Eq. A4 can be found by solving the maximization-minimization problem described in following corollary: Corollary 1.1 A maximization-minimization problem can be transformed into a piecewise linear convex optimization problem by introducing an auxiliary variable z, such that maximize z subject to 1 ⊤ λ * = 1; z ≤ λ k I k , k = 1, . . . , K.
The analytical solution to this problem exists and is given by λ * k = Π j =k Ij K k=1 Π j =k Ij , k = 1, . . . , K. For a Sobolev training problem with loss given by Eq. 5, weights that lead to ǫ-optimal solutions are computed using

PINN Multiple Gradient Descent Algorithm (pinn-MGDA)
We adapt the MGDA to PINNs by formulating the conditions for Pareto stationary as stated in Eq. 9 as a quadratic program: We solve the quadratic program in Eq. A5 using the Frank-Wolfe algorithm. Let us assume the Fourier spectrum of the signal is a delta function, i.e. u(k) = Γ 0 δ(k − k 0 ), with k 0 being the dominant wavenumber. The residual at training time (t) is given as r(·, t) = u θ − u, and its corresponding Fourier coefficients for the wavenumber k is given by r(k, t) = u θ (k, t) − u(k). The learning rate of the residual for the dominant frequency is given as, From Parseval's theorem [44], we can compute the magnitude of the gradients for both objectives L 1 and L 2 with respect to the network parameters θ.
The second step in the above calculation involves triangular inequality and the assumption that the Fourier spectrum of the signal is a delta function. In a similar fashion, we can repeat the same analysis for the Sobolev part of the objective function as follows, By substituting equations (B2) and (B3) into equation (B1), we get, We use the fact that the spectral rate of the network function residual is inversely proportional to the wavenumber, i.e. ∂ u θ (k)/∂θ = O k −1 [45].

Sobolev training set-up
We use a neural network f θ : (x, y) → (u) with 5 layers and 64 neurons per layer and sin(·) activation function. The target function is a sinusoidal forcing term of the form,  of Sobolev training of the neural networks for different methods A) pinn-MGDA B) ǫ-optimal C) inverse-Dirichlet D) max/avg. Different trajectories in the same plots correspond to the weights λ k , k ∈ 1, .., K, associated with different objectives. On closer inspective we can notice that the relative weights or relative importance measure (λ i /λ j ) : ∀i, j ∈ 1, .., K between objectives for inverse-Dirichlet and the ǫ-optimal strategy are approximately of the same order: λ 0 /λ 1 ≈ 10, λ 0 /λ 2 ≈ 10 2 − 10 3 , λ 0 /λ 3 ≈ 10 3 − 10 4 , λ 0 /λ 4 ≈ 10 4 − 10 5 (shown in red double-headed arrows). The ground truth solution of the 2D incompressible active fluid is computed in the vorticity formulation which is given as follows, The simulation is conducted with a pseudo-spectral solver, where the derivatives are computed in the Fourier space and the time integration is done using the Integration factor method [34,47].

Inverse problem set-up
For studying the inverse problem we resort to the primitive formulation (u, v, p * ) of the active turbulence model, with p * = −∇p + ξ 1 ∇|u| 2 as the effective pressure. We used a neural network f θ : (x, y, t) → (u, v, p * ) with 5 layers and 100 neurons per layer and sin-activation functions. Given the measurement data of flow velocities {u i , v i } N i=1 at discrete points, we infer the effective pressure p * along with the parameters Σ = {ξ 0 , α, β, Γ 0 , Γ 2 }. The total of 204800 points, sampled from the same domain as for the forward solution in the square domain, are utilized with a 70/30 train and test split, which corresponds to 143360 training points used. The network is trained for a total of 7000 epochs using the Adam optimizer with each epoch consisting of 35 mini-batch updates with a batch-size |B| = 4096. We optimize the model subject to the loss function L(u, θ) = λ 1 L fit + λ 2 L res + λ 3 L div , with λ 1 , λ 2 , λ 3 corresponding to the data-fidelity, equation residual and the objective that penalises non-zero divergence in velocities, respectively. The training is done in a three-step approach, with the first 1000 epochs used for fitting the measurement data (pre-training, λ 1 = 1, λ 2 = λ 3 = 0), the next 1000 epochs training with an additional constraint on the divergence (λ 1 = 0, λ 2 = 0, λ 3 = 0), and finally followed by the introduction of the equation residual (λ 1 = 0, λ 2 = 0, λ 3 = 0). At every checkpoint introducing a new task, we reset all λ k for the computation ofλ k (see Eq. 7 main text) of the max/avg variant to λ k = 1. For both, forward and inverse problems, we choose an initial learning rate η = 10 −3 and decrease it by a factor of 10 after 3000 and 6000 epochs (See Fig. C.2).
with the parameter ω controlling the frequency of the oscillatory solution u(x, y). The analytical solution u(x, y) = cos(ωx) sin(ωy) is computed on the domain [0, 1] × [0, 1] with a grid resolution of 100 × 100. The boundary condition is given as B(x, y) = cos (ωx) sin (ωy), x, y ∈ ∂Ω. We set up the PINN with two objective functions one for handling boundary conditions and the other to compute the residual with the loss L(u, θ) = λ 1 L res + λ 2 L boundary (D2) We set N B = 400 sampled uniformly on the four boundaries, and randomly sample N int = 2500 points from the interior of the domain. We choose a network f θ : (x, y) → (u) with 5 layers and 50 nodes per layer and the tanh activation function is used. The neural network is trained for 30000 epochs using the Adam optimizer. We choose the initial learning rate η = 10 −3 and decrease it by a factor of 10 after 10000, 20000 and 30000 epochs. In Fig. D.1 we show the relative L2 error for different weighting strategies across changing wavenumber ω. We notice a very similar trend as observed in the previous problems, with inverse Dirichlet scheme performing in par with the analytical ǫ optimal weighting strategy, followed by max/avg and pinn-MGDA. for the wave-number ω = 6. Comparison between relative L2 error for different weighing strategies for increasing wave-numbers (ω).