Physics Informed Token Transformer for Solving Partial Differential Equations

Solving Partial Differential Equations (PDEs) is the core of many fields of science and engineering. While classical approaches are often prohibitively slow, machine learning models often fail to incorporate complete system information. Over the past few years, transformers have had a significant impact on the field of Artificial Intelligence and have seen increased usage in PDE applications. However, despite their success, transformers currently lack integration with physics and reasoning. This study aims to address this issue by introducing PITT: Physics Informed Token Transformer. The purpose of PITT is to incorporate the knowledge of physics by embedding partial differential equations (PDEs) into the learning process. PITT uses an equation tokenization method to learn an analytically-driven numerical update operator. By tokenizing PDEs and embedding partial derivatives, the transformer models become aware of the underlying knowledge behind physical processes. To demonstrate this, PITT is tested on challenging 1D and 2D PDE neural operator prediction tasks. The results show that PITT outperforms popular neural operator models and has the ability to extract physically relevant information from governing equations.


Introduction
Partial Differential Equations (PDEs) are ubiquitous in science and engineering applications.
While much progress has been made in developing analytical and computational methods to solve the various equations, no complete analytical theory exists, and computational methods are often prohibitively expensive.Recent work has shown the ability to learn analytical solutions using bilinear residual networks, 1 and bilinear neural networks, [2][3][4] where an analytical solution is available.2][13][14] While mesh optimization generally allows for using traditional numerical solvers, current methods only improve speed or accuracy by a few percent, or require many simulations during training.Methods for super resolution improve speed, but often struggle with generalizing to data resolutions not seen in the training data, with more recent work improving generalization capabilities. 15Surrogate modeling, on the other hand, has shown a good balance between improved performance and generalization.
Neural operator learning architectures, specifically, have also shown promise in combining super resolution capability with surrogate modeling due to their inherent discretization invariance. 16Recently, the attention mechanism has become a popular choice for operator learning.
The attention mechanism first emerged as a promising model for natural language processing tasks, [17][18][19][20] especially the scaled dot-product attention. 18Its success has been extended to other areas, including computer vision tasks 21 and biology. 224][25][26][27][28][29][30] Kovachki et al. 31 proposes a kernel integral interpretation of attention.Cao 23 analyzes the theoretical properties of softmax-free dot product attention (also known as linear attention) and further proposes two interpretations of attention, such that it can be viewed as the numerical quadrature of a kernel integral operator or a Peterov-Galerkin projection.OFormer (Operator Transformer) 24 extends the kernel integral formulation of linear attention by adding relative positional encoding 32 and using cross attention to flexibly handle discretization, and further proposes a latent marching architecture for solving forward time-dependent problems.Guo et al. 29 introduces attention as an instance-based learnable kernel for direct sampling method and demonstrates superiority on boundary value inverse problems.LOCA (Learning Operators with Coupled Attention) 33 uses attention weights to learn correlations in the output domain and enables sample-efficient training of the model.GNOT (General Neural Operator Transformer for Operator Learning) 25 proposes a heterogeneous attention architecture that stacks multiple cross-attention layers and uses a geometric gating mechanism to adaptively aggregate features from query points.Additionally, encoding physics-informed inductive biases has also been of great interest because it allows incorporatation of additional system knowledge, making the learning task easier.One strategy to encode the parameters of different instances for parametric PDEs is by adding conditioning module to the model. 34,35other approach is to embed governing equations into the loss function, known as Physics-Informed Neural Networks (PINNs). 36Physics Informed Neural Networks (PINNs) have shown promise in physics-based tasks, but have some downsides.Namely, they show lack of generalization, and are difficult to train.Complex training strategies have been developed in order to account for these deficiencies. 37ile many existing works are successful in their own right, none so far have incorporated entire analytical governing equations.In this work we introduce an equation embedding strategy as well as an attention-based architecture, Physics Informed Token Transformer (PITT), to perform neural operator learning using equation information that utilizes physicsbased inductive bias directly from governing equations (The main architecture of PITT is shown in figure 1).More specifically, PITT fuses the equation knowledge into the neural operator learning by introducing a symbolic transformer on top of the neural operator.
We demonstrate through a series of challenging benchmarks that PITT outperforms the popular Fourier Neural Operator 13 (FNO), DeepONet, 14 and OFormer 24 and is able to learn physically relevant information from only the governing equations and system specifications.

Methods
In this work, we aim to learn the operator G θ : A → U, where A is our input function space, U is our solution function space, and θ are the learnable model parameters.We use a combination of novel equation tokenization and numerical method-like updates to learn model operators G θ .Our novel equation tokenization and embedding method is described first, followed by a detailed explanation of the numerical update scheme.

Equation Tokenization
In order to utilize the text view of our data, the equations must be tokenized as input to our transformer.Following Lampe et al., 38 each equation is parsed and split into its constituent symbols.The tokens are given in table 1.  as the continuity equation, are self-contained.All of the tokens are then compiled into a single list, where each token in the tokenized equation is the index at which it occurs in this list.For example, we have the following tokenization: = [Derivative, (, u, (, x, , , t, ), , , t, )] = [6, 0, 3, 15, 0, 16, 33, 14, 1, 33, 1] After each equation has been tokenized, the target time value is appended in tokenized form to the equation, and the total equation is padded with a placeholder token so that each text embedding is the same length.Sampled values are truncated at 15 digits of precision.
Data handling code is adapted from PDEBench. 39

Physics Informed Token Transformer
The Physics Informed Token Transformer (PITT) utilizes tokenized equation information to construct an update operator F P , similar to numerical integration techniques: x t+1 = x t + F P (x t ).We see in figure 1, PITT takes in the numerical values and grid spacing, similar to operator learning architectures such as FNO, as well as the tokenized equation and the explicit time differential between simulation steps.The tokenized equation is passed through a Multi Head Attention block seen in figure 1a.In our case we use Self Attention. 23e tokens are shifted and scaled to be between -1 and 1 upon input, which significantly boosts performance.This latent equation representation is then used to construct the keys and queries for a subsequent Multi Head Attention block that is used in conjunction with output from the underlying neural operator to construct the update values for the final input frame.The time difference between steps is encoded, allowing use of arbitrary timesteps.
Intuitively, we can view the model as using a neural operator to passthrough the previous state, as well as calculate the update, like in numerical methods.The tokenized information is then used to construct an analytically driven update operator that acts as a correction to the neural operator state update.This intuitive understanding of PITT is explored with our 1D benchmarks.
Two different embedding methods are used for the tokenized equations.In the first method, the token attention block first embeds the tokens, T , as key, query, and values with learnable weight matrices: We use a single layer of self-attention for the tokens.The update attention blocks seen in figure 1b then uses the token attention block output as queries and keys, the neural operator output as values, and embeds them using trainable The output is passed through a fully connected projection layer to match the target output dimension.This update scheme mimics numerical methods and is given in algorithm 1.
Algorithm 1 PITT numerical update scheme Require: V 0 , T h1 , T h2 , time t, L layers for l = 1, 2, . . ., L do A standard, fully connected multi-layer perceptron is used to calculate the update after concatenating the attention output with an embedding of the fractional timestep.This block uses softmax-free Linear Attention (LA), 23

Data Generation
In order to properly assess performance, multiple data sets that represent distinct challenges are used.In the 1D case, we have the Heat equation, which is a linear parabolic equation, the Where the forcing term is given by: δ (t, x) = J j=1 A j sin (ω j t + (2πl j x)/L + ϕ j ), and the initial condition is the forcing term at time t = 0: u (0, x) = δ (0, x).The parameters in the forcing term are sampled as follows: A j ∼ U(−0.5, 0.5), ω j ∼ U(−0.4,0.4), l j ∼ {1, 2, 3}, ϕ j ∼ U(0, 2π).The parameters, (α, β, γ) of equation 1 can be set to define different, famous equations.When γ = 0, β = 0 we have the Heat equation, when only γ = 0 we have Burgers' equation, and when β = 0 we have the KdV equation.Each equation has at least one parameter that we modify in order to generate large data sets.
For the Heat equation, we generated 10,000 simulations from each β value for 60,000 total samples.For Burgers' equation, we used advection values of α ∈ {0.01, 0.05, 0.1, 0.2, 0.5, 1}, and generated 2,500 simulations for each combination of values, for 90,000 total simulations.
For the KdV equation, we used an advection value of α = 0.01, with γ ∈ {2, 4, 6, 8, 10, 12}, and generated 2,500 simulations for each parameter combination, for 15,000 total simulations.The 1D equations text tokenization is padded to a length of 500.Tokenized equations are long here due to the many sampled values.

Navier-Stokes Equation
In 2D, we use the incompressible, viscous Navier-Stokes equations in vorticity form, given in equation 2. Data generation code was adapted from Li et al. 13 .
where u(x, t) is the velocity field, w(x, t) = ∇ × u(x, t) is the vorticity, w 0 (x) is the initial vorticity, f (x) is the forcing term, and ν is the viscosity parameter.We use viscosities

Steady-State Poisson Equation
The last benchmark we perform is on the steady-state Poisson equation given in equation 3.
where u(x, y) is the electric potential, −∇u(x, y) is the electric field, and g(x, y) contains boundary condition and charge information.The simulation cell is discretized with 100 points in the horizontal direction and 60 points in the vertical direction.Capacitor plates are added with various widths, x and y positions, and charges.An example of input and target electric field magnitude is given in figure 2.

Results
We now compare PITT with both embedding methods against FNO, DeepONet, and OFormer on our various data sets.† indicates our novel embedding method and * indicates standard embedding.All experiments were run with five random splits of the data.Reported results and shaded regions in plots are the mean and one standard deviation of each result, respectively.Experiments were run with a 60-20-20 train-validation-test split.Early stopping is also used, where the epoch with lowest validation loss is used for evaluation.Note: parameter count represents total number of parameters.In some cases PITT variants use a smaller underlying neural operator and have lower parameter count than the baseline model.
Hyperparameters for each experiment are given in the appendix.

1D Next-Step Prediction
Our 1D case is trained by using 10 frames of our simulation to predict the next frame.The data is generated for four seconds, with 100 timesteps, and 100 grid points between 0 and 16.
The final time is T = 4s.Specifically, the task is to learn the operator The effect of the neural operator and token transformer modules in PITT can be easily decomposed and analyzed by returning the passthrough and update separately, instead of their sum (Figure 1c).Using the pretrained PITT FNO from above, a sample is predicted for the 1D Heat equation.We see the decomposition in figure 3  where PITT shows both lower final error and improved total error accumulation.The novel embedding error accumulation plot is given in the appendix in figure 9.In these experiments, we used the models trained in the next-step fashion from section our 1D benchmarks.We start with the first 10 frames from each trajectory in the test set for the 1D data sets and the only initial condition for the 2D test data set and autoregressively predict the entire rollout.

Appendix Experimental Details
Training and model hyperparameters are given here.In all cases, 3 encoding and decoding convolutional layers were used for FNO.FNO

1D Next-Step Training Details
All models were trained for 200 epochs on all of the 1D data sets.

1D Rollout Comparison
In 1D rollout we see significantly PITT variants of FNO and DeepONet match the ground truth values much better for the Heat and Burgers simulations, and maintains its shape closer to ground truth for the KdV equation when compared to FNO.Darker lines correspond with londer times in rollout, up to a time of 4 seconds.

Figure 1 :
Figure1: The Physics Informed Token Transformer (PITT) uses standard multi head selfattention to learn a latent embedding of the governing equations.This latent embedding is then used to perform numerical updates using linear attention blocks.The equation embedding acts as an analytically-driven correction to an underlying data-driven neural operator.
10 −8 , 2 • 10 −8 , . . ., 10 −5 } and forcing term amplitudes A ∈ {0.001, 0.002, 0.003, . . ., 0.01}, for 370 total parameter combinations.120 frames are saved over 30 seconds of simulation time.The initial vorticity is sampled according to a gaussian random field.For each combination of ν and A, 1 random initialization was used for the next-step and rollout experiments and 5 random initializations were used for the fixed-future experiments.The tokenized equations are padded to a length of 100.Simulations are run on a 1x1 unit cell with periodic boundary conditions.The space is discretized with a 256x256 grid for numerical stability that is evenly downsampled to 64x64 during training and testing.

Figure 2 :
Figure 2: Example setup for the 2D Poisson equation.a) Input boundary conditions and geometry.b) Target electric field output.

Figure 5 :
Figure 5: Rollout results for 2D Navier Stokes using our novel embedding method.
Attention Weight Response to Parameter Change

Figure 7 :
Figure 7: PITT attention weight response to changing equation tokens.a) PITT attention weights change as we modify the input tokens.From our Navier-Stokes equation, we see the attention weights change differently when we modify the viscosity and forcing term amplitude.This demonstrates that PITT is able to learn equation parameters from the tokenized equations.b) PITT attention weights change as we modify the input token target time.From our Navier-Stokes equation, we see the attention weights change differently when we modify the target time.This demonstrates that PITT is able to learn time evolution from the tokenized equations.
Attention Weight Response to Parameter Change

Figure 8 :
Figure 8: PITT attention weight response to changing equation tokens.a) PITT attention weights change as we modify the input tokens.From our Navier-Stokes equation, we see the attention weights change differently when we modify the viscosity and forcing term amplitude.This demonstrates that PITT is able to learn equation parameters from the tokenized equations.b) PITT attention weights change as we modify the input token target time.From our Navier-Stokes equation, we see the attention weights change differently when we modify the target time.This demonstrates that PITT is able to learn time evolution from the tokenized equations.

Figure 9 :
Figure 9: Error accumulation for rollout experiments.PITT variants have less error accumulation at long rollout times for every benchmark when compared to the baseline models.

Figure 10 :
Figure 10: Comparison of FNO to ground truth data for autoregressive rollout on our 1D data sets.

Figure 11 :
Figure 11: Comparison of PITT FNO using our novel embedding to ground truth data for autoregressive rollout on our 1D data sets.

Figure 12 :
Figure 12: Comparison of PITT FNO using standard embedding to ground truth data for autoregressive rollout on our 1D data sets.

Figure 13 :
Figure 13: Comparison of DeepONet to ground truth data for autoregressive rollout on our 1D data sets.

Figure 14 :
Figure 14: Comparison of PITT DeepONet using our novel embedding to ground truth data for autoregressive rollout on our 1D data sets.

Figure 15 :
Figure 15: Comparison of PITT DeepONet using standard embedding to ground truth data for autoregressive rollout on our 1D data sets.

Figure 19 :
Figure 19: Comparison of Poisson predicition error between PITT variants and baseline models using standard embedding.

Table 1 :
Collection of all tokens used in tokenizing governing equations, sampled values, and system parameters.
, ∂, Σ, j, A j , l j , ω j , ϕ j , sin, t, u, x, y, +, −, * , / initial condition, sampled values, and output simulation time are all separated because each component controls distinct properties of the system.The 2D equations are tokenized so that the governing equations remain in tact because some of the governing equations, such

Table 2 :
j=n+1 where n ∈[10, 100].A total of 1,000 sampled equations were used in the training set, with 90 frames for each equation.Data was split such that samples from the Mean Absolute Error (MAE) ×10 −3 for 1D benchmarks.Bold indicates best performance.
same equation and forcing term did not appear in the training and test sets.We see PITT significantly outperforms all of the baseline models across all equations for both embedding methods.Although the lower error often resulted in unstable autoregressive rollout, PITT variants have also outperformed their baseline counterparts when simply trained to minimum error.Additionally, PITT is able to improve performance with fewer parameters than FNO, and a comparable number of parameters to both OFormer and DeepONet.Notably, PITT uses a single attention head and single multi-head attention block for the multi-head and linear attention blocks in this experiment.

table 3 .
and figure6in the appendix.The first 10 frames of each equation are used as input to predict the last frame of each simulation.In total, 5,000 samples from each equation were used for both single equation and multiple equation training.Models trained on the combined data sets are then tested on data from each equation individually.For PITT FNO and PITT OFormer, we see that training on the combined equations using our novel embedding method has best performance improve neural operator generalization across different systems.Interestingly, we see also improvement in FNO and OFormer when training using the combined data sets.
Figure 3: PITT FNO prediction decomposition for 1D Heat equation.Left: The FNO module of PITT predicts a large change to the final frame of input data.Middle The numerical update block corrects the FNO output.Right The combination of FNO and numerical update block output very accurately predicts the next step.Interestingly, the underlying FNO has learned to overestimate the passthrough of the data in both cases.The token attention and numerical update modules have learned a correction to the FNO output, as expected.1DFixed-FuturePredictionIn this 1D benchmark, each model is trained on all three equations simultaneously, and performance is compared against training on single equations.Results are shown in across all data sets.Additionally, for PITT FNO and PITT DeepONet, training using our standard embedding method acheivs best performance across all data sets.This shows PITT is able to

Table 3 :
Mean Absolute Error (MAE) ×10 −3 for 1D benchmarks.Bold indicates best performance.The final time is T = 30s.Similar to the 1D case, we are learning the operatorG θ : a(•, t i )| i=n → u(•, t j )| j=n+1 where n ∈ [0, 119].This benchmark is especially challenging for two reasons.First, there are viscosity and forcing term amplitude combinations in the test set that the model has not trained on.Second, rollout is done starting from only the initial condition, and models are trained to predict the next step using a single snapshot.This limits the time evolution information available to models during training.Although the baseline models perform comparably to PITT variants in terms of error, we note that PITT shows improved accuracy for all variants, and in many cases lower error led to unstable rollout, like in the 1D cases.Despite this, PITT has much better rollout error accumulation, seen in table6.Further analysis of PITT FNO attention maps from this experiment is given in the appendix in figures 7a, 7b, 8a, and 8b.The attention maps show PITT FNO is able to extract physically relevant information from the governing equations.For the steady-state Poisson equation, for a given set of boundary conditions we learn the operator, G θ : a → u, with Boundary conditions: u(x) = g(x), ∀x ∈ ∂Ω 0 and n∇u(x) = f (x), ∀x ∈ ∂Ω 1 .The primary challenge here is in learning the effect of boundary conditions.Dirichlet boundary conditions are constant, only requiring passing through initial values at the boundary for accurate prediction, but Neumann boundary conditions lead to boundary values that must be learned from the system.Standard neural operators do not offer a way to easily encode this information without modifying the initial conditions, while PITT uses a The 2D benchmarks provided here provide a wider array of settings and tests for each model.In the next-step training and rollout test experiment, we used 200 equations, a single random initialization for each equation, and the entire 121 step trajectory for the data set.text encoding of each boundary condition, as outlined in equation tokenization.PITT is able to learn boundary conditions through the text embedding, and performs approximately an order of magnitude better, with the standard embedding improving over our novel embedding by an average of over 50%.5,000 samples were used during training with random data splitting.All combinations of boundary conditions appear in both the train and test sets.Prediction error plots for our models on this data set are given in the appendix in figures 18 and 19.

Table 4 :
24an Absolute Error (MAE) ×10 −3 for 2D benchmarks.Bold indicates best performance.Although PITT variants have overlapping error bars with the base model in the Navier-Stokes benchmark, the PITT variant had lower error on all but one random split of the data for PITT FNO, and every random split for PITT DeepONet.Lastly, similar to experiments in both Li et al.13and Li et al.24, we can use our models to use the first 10 seconds of data to predict a fixed, future timestep.Including the initial condition, we use 41 frames to predict a single, future frame.In this case, we predict the system state at 20 and 30 seconds in two separate experiments.For this experiment, we are learning the operator G θ : u(•, t)| t∈[0,10] → u(•, t)| t=20,30 .We shuffle the data such that forcing term amplitude and viscosity combinations appear in both the training and test set, but initial conditions do not appear in both.Our setup is more difficult than in previous works because we are using multiple forcing term amplitudes and viscosities.The results are given in table5, where we see PITT variants outperform the baseline model for both embedding methods.Example predictions are given in the appendix in figures 16 and 17.
exception of PITT DeepONet using our novel embedding method when compared to the baseline model on KdV.Error accumulation is shown in figure4for standard embedding,

Table 6 :
Final Mean Absolute Error (MAE) for rollout experiments.Bold indicates best performance when comparing base models to their PITT version.OFormer is omitted due to instability during rollout.Standard embedding PITT DeepONet is bolded here because it outperforms DeepONet for every random split of the data.
does not match small-scale features well.Similarly, PITT DeepONet is able to approximately match large-scale features in magnitude (lighter color), whereas DeepONet noticably differs even at large scales.1D rollout comparison plots are given in the appendix.
and DeepONet used a step schedule for learning rate during training.PITT variants and OFormer used a One Cycle Learning rate during training.All models used the Adam optimizer and L1 loss function for all experiments.

Table 7 :
Training Hyperparameters for 1D Next-Step Experiments

Table 8 :
FNO Hyperparameters for 1D Next-Step Experiments

Table 9 :
OFormer Hyperparameters for 1D Next-Step Experiments Model Data Set Hidden Dim.Numerical Layers Heads Input Embedding Dim.Output Embedding Dim.Encoder Depth Decoder Depth Latent Channels Encoder Resolution Decoder Resolution Scale

Table 11 :
Training Hyperparameters for 1D Fixed-Future Experiments

Table 13 :
OFormer Hyperparameters for 1D Fixed-Future Experiments Model Data Set Hidden Dim.Numerical Layers Heads Input Embedding Dim.Output Embedding Dim.Encoder Depth Decoder Depth Latent Channels Encoder Resolution Decoder Resolution Scale

Table 15 :
Training Hyperparameters for the 2D Navier-Stokes Next-Step Experiment

Table 16 :
Model Hyperparameters for the 2D Navier-Stokes Next-Step Experiment

Table 17 :
Model Hyperparameters for the 2D Navier-Stokes Next-Step Experiment

Table 18 :
Model Hyperparameters for the 2D Navier-Stokes Next-Step Experiment

Table 19 :
Training Hyperparameters for the 2D Navier-Stokes Fixed-Future Experiment

Table 20 :
Model Hyperparameters for the 2D Navier-Stokes Fixed-Future Experiment

Table 21 :
Model Hyperparameters for the 2D Navier-Stokes Fixed-Future Experiment

Table 22 :
Model Hyperparameters for the 2D Navier-Stokes Fixed-Future Experiment All models were trained for 1000 epochs on the 2D Poisson steady-state data.

Table 23 :
Training Hyperparameters for the 2D Poisson Experiment

Table 24 :
Model Hyperparameters for the 2D Poisson Experiment

Table 25 :
Model Hyperparameters for the 2D Poisson Experiment

Table 26 :
Model Hyperparameters for the 2D Poisson Experiment PITT FNO prediction decomposition for 1D Heat equation.Left: The FNO module of PITT predicts a large change to the final frame of input data.Middle The numerical update block corrects the FNO output.Right The combination of FNO and numerical update block output very accurately predicts the next step.