Governing equation discovery based on causal graph for nonlinear dynamic systems

The governing equations of nonlinear dynamic systems is of great significance for understanding the internal physical characteristics. In order to learn the governing equations of nonlinear systems from noisy observed data, we propose a novel method named governing equation discovery based on causal graph that combines spatio-temporal graph convolution network with governing equation modeling. The essence of our method is to first devise the causal graph encoding based on transfer entropy to obtain the adjacency matrix with causal significance between variables. Then, the spatio-temporal graph convolutional network is used to obtain approximate solutions for the system variables. On this basis, automatic differentiation is applied to obtain basic derivatives and form a dictionary of candidate algebraic terms. Finally, sparse regression is used to obtain the coefficient matrix and determine the explicit formulation of the governing equations. We also design a novel cross-combinatorial optimization strategy to learn the heterogeneous parameters that include neural network parameters and control equation coefficients. We conduct extensive experiments on seven datasets from different physical fields. The experimental results demonstrate the proposed method can automatically discover the underlying governing equation of the systems, and has great robustness.


Introduction
It is a common problem in science and engineering to analyze the data of complex dynamic systems to understand its qualitative behavior and accurately predict its future state. In other words, studying the internal dynamic mechanism of complex systems is an important basis for understanding, predicting and controlling systems. Modern machine learning methods have made significant progress in the black box prediction performance of many tasks, but the simplified closed form of the internal governing equations of the systems is still unclear or partially unknown. Therefore, it is necessary to study how to explore the governing equations of systems, which contain the underlying governing equation, from the observed data [1][2][3].
In recent years, with the development of intelligent technology and the improvement of computing ability, the data-driven dynamic equations learning methods have rapidly developed. Bongard et al carried out a groundbreaking research, that is, using symbolic regression optimized by genetic algorithm to extract the governing equation of nonlinear systems from data from data [4,5]. However, this method is costly in calculation and may have over-fitting problem. Recently, Brunton et al studied an innovative framework named sparse identification of nonlinear dynamics (SINDy) [6,7]. This framework is based on the fact that many systems have relatively few terms in the control equations. Therefore, sparse regression can be used to select the main candidate functions from a complete function dictionary to reveal simple governing equations. In the past few years, SINDy has attracted extensive attention from the academic community and has been widely used to identify the physical equations of low dimensional systems [8,9]. The improved algorithms based on SINDy are used to identify various nonlinear dynamic systems, including biological and chemical systems [10][11][12], predator-prey systems [13] and structural systems [14,15]. Other studies have expanded the SINDy method based on the need to improve the robustness of high-dimensional dynamics sparse discovery, such as combining physical constraints [16], embedding random sampling [17], and discovering implicit dynamics [15,18]. However, there are still some issues with research based on the SINDy framework. One of the key bottlenecks is that the existing methods use ordinary numerical methods to calculate derivatives, such as finite difference, to construct governing equations. In the case of lack of data and high noise, the error caused by approximate derivatives will reduce the accuracy of SINDy based methods [19]. Another limitation is that a overcomplete candidate function library will lead to information redundancy and computational complexity [20]. Some studies have found that weak SINDy framework has better robustness to noise and can numerically approximate the solution of the original system [21][22][23]. But they did not consider the problem of large computational complexity caused by redundant candidate function libraries.
Aiming at the issue of algorithm errors caused by numerical differentiation, we have gained great inspiration from the physics-informed neural networks (PINNs) proposed by Raissi et al [24] and its successful application in solving forward and inverse partial differential equations [25][26][27]. It is evident that automatic differentiation within the PINN framework can effectively solve the error problem caused by numerical approximate derivatives in SINDy-based methods [28]. The idea of PINN is to use the deep neural network (DNN) as a universal function approximator to obtain the solution constrained by a prior partial differential equation with a small amount of data. Many studies have applied PINN to some scenes with prior knowledge and achieved remarkable results [29,30]. There is a noteworthy work that Raisi introduced an effective algorithm for learning infinite dimensional dynamic systems using DNNs combining automatic differentiation [31]. However, this method can only rely on the black-box solver and cannot reveal the closed form of the system governing equation. A noteworthy study proposes a deep learning based model discovery algorithm (DeepMoD) within the framework of PINN, which can discover the partial differential equations behind the data [32]. However, due to inaccurate coefficient optimization, this method may recognise errors. Thanasutives et al proposed the noise-aware physics-informed machine learning framework [33] to train PINN based on discrete Fourier transform in a multitasking learning paradigm [34] to reveal a set of optimal reduced partial differential equations. PDE-Read [35] introduces a variant of the two-network and a Recursive Feature Elimination algorithm to identify human readable PDEs from data. Chen et al proposed alternating direction optimization by combining PINN with STRidge to successfully identify differential equation coefficients [28]. These successful experiences have verified the feasibility of combining deep learning with SINDy to discover control equations.
In response to the computational complexity caused by the overcomplete candidate function library in SINDy-based method, we split the governing equation into self-dynamics and interaction dynamics, to promote sparse learning of the coefficients of interactive dynamics in differential equations by capturing the relationships between variables. The researches on graph neural network (GNN) have given us the idea of graph encoding of observed temporal data. Wu et al studied the prediction of multivariate time series based on GNN [36,37], and further introduced a graph learning layer to adaptively capture the relationship between variables from time series data, eliminating the restriction that that prior adjacency matrices are required for GNN [38]. However, the graph learning layer of this method cannot capture the inherent causal correlation mechanism between variables.
To this end, we take advantage of these advances to design a new CG-GED framework (governing equation discovery based on causal graph) to learn underlying governing equation from observed data of nonlinear dynamic systems. Firstly, a causal graph encoding method based on transfer entropy is devised to extract causal correlations between variables to learn a stable graph structure from observed data. Then, the spatio-temporal graph convolution network is used to capture the spatial and temporal dependencies at the same time to obtain the approximate solution of the system. On this basis, the derivative is obtained numerically by automatic differentiation to form a overcomplete candidate algebraic term dictionary. Finally, the sparse regression method is used to obtain the accurate expression of the governing equation. The proposed framework is composed of neural network and sparse regression, we use a novel cross-combination optimization strategy to learn the heterogeneous parameters of these two parts. The effectiveness and robustness of proposed method have been demonstrated on various nonlinear dynamic systems.
Our contributions can be summarized as the following three aspects, • A novel governing equation discovery framework has been constructed to obtain sparse representations of governing equations based on causal graph encoding.
• A new cross-combinatorial optimization strategy is implemented to learn heterogeneous parameters, which include spatio-temporal graph convolution network parameters and system control equation coefficients. • Through experiments on seven datasets from different fields, it has been proven that the proposed method can discover precise expressions of governing equations from observed data without prior information.
We note that the adjacency matrix representing the causal relationship between the observation variables obtained by the graph encoding method based on the transfer entropy promotes the sparse learning of the coefficients of interaction dynamics in governing equations. Meanwhile, it can serve as a model input for the spatio-temporal graph convolution. On this basis, spatiotemporal graph convolution can capture the complex local spatiotemporal correlations of variables and obtain accurate system solution approximations.
In the parameter optimization stage, the neural network provides accurate modeling of the approximate solution and derivative of the system for sparse regression, and in turn, the control equation obtained from sparse regression constrains and optimizes the network's parameter learning. At the same time, our proposed framework meets the interpretability requirements, and the resulting governing equation can naturally become an explanation for graph encoding.

Problem formulation
We consider that the governing equation of a nonlinear dynamic system can be expressed as follow: where u is the potential solution of the system observed in time t = [0, T] at space x ∈ Ω. The goal is to determine the governing equation defined by F from the observed data of the system without a detailed description of the dynamic system structure based on prior knowledge. Rewrite equation (1) to: where u i (t) = (u i,1 (t), u i,2 (t), . . . , u i,m (t)) T is an observed variable in m dimension, F(u i (t)) is a linear or nonlinear function used to describe the dynamics of all observed variables, G(u i (t), u j (t)) is designed to capture the dynamic mechanism of interaction between variables. A n×n in equation (2) is an adjacency matrix, where n indicates the number of variables in the system, A ij denotes the influence from variable u i (t) to u j (t). Equation (2) can be further deduced as: where Φ F is an extended library of many function terms describing self dynamics. Φ G is a library of candidate functions that describe interaction dynamics. Time-varying matrices Φ F (u i (t)) and Φ G (u i (t), u j (t)) are the mapping of variables in function space Φ F and Φ G , respectively. And the symbol ⊗ denotes the Kronecker product. ξ F and ξ G are sparse coefficients to be inferred. At the same time, it is also necessary to determine the adjacency matrix A that characterizes the causal association of variables. The research goal of this article is to determine the governing equation defined by equation (3) through system observation data without prior knowledge that describes the governing equation of a nonlinear dynamic system in detail.

The CG-GED model: foundation and design
In order to discover the underlying governing equation of the systems, that is, to learn the parameters ξ F , ξ G and A in equation (3) from the observed data to determine the governing equations. However, visible data u v often cannot cover the complete state space, which requires reconstructing reliable hidden statesû h . The complete reconstruction stateû = (u v ,û h ) allows us to calculate the time derivative and derive accurate governing equations. Based on this, we design a method that integrates deep learning and sparse regression. As shown in figure 1, the proposed framework consists of four parts: causal graph encoding, spatio-temporal graph convolutional network (STGCN), automatic differentiation, and governing equation discovery. The causal graph encoding is designed to learn the cause and effect correlation information of the observed variables, obtain the adjacency matrix A to as a prior input to the STGCN. The STGCN is used to reconstruct the hidden state of the system, and is connected to the governing equation discovery module through automatic differentiation. The governing equations of the nonlinear dynamic systems are obtained through sparse regression. The core components of our method are explained in detail below.

Causal graph encoding
In order to construct graphs, most of the existing studies use distance based metrics to describe the correlation between variables, such as dot products and Euclidean distances [39]. The adjacency matrix obtained in this way is symmetric and cannot describe uni-directional causal relationships between variables.
In real systems, changes in one state can cause changes in another state, we expect this uni-directional relationship can be expressed explicitly. Therefore, we propose a causal graph encoding module specifically designed to adaptively learn the adjacency matrix to capture the implicit causal relationship between observed variables. Proposed causal graph encoding module uses transfer entropy to define causal relationships between observed variables. Since the introduction by Schreiber [40], transfer entropy has been considered an important tool for analyzing causal relationships in nonlinear systems [41,42]. Transfer entropy is based on the basic framework of information theory, which characterizes the causal relationship between variables based on information transfer. Considering the observed variables u i (t) and u j (t), the transfer entropy based on conditional mutual information theory is defined as: where H(u i,t |u k i,t−1 ) and H(u i,t |u k i,t−1 , u l j,t−1 ) are conditional entropies. Further, the transfer entropy can be calculated as: 1) . Considering that calculating the causality between variables requires attention to the state of past moments, k and l are the length of the time delay.
i,t−1 ) are conditional probability density functions. Assuming that the dynamics are Markovian, theoretically, k = l = 1. The equation (6) can be simplified to: Then, the causal relationship between variables u i (t) and u j (t) can be expressed as follows: When Using the calculated transfer entropy as a priori information of the system, in other words, using the transfer entropy matrix as an adjacency matrix A to describe the causal interaction between observed variables. The element A ij represents the causal relationship between u i (t) and u j (t). when i = j, A ij = 1; When i ̸ = j, A ij is calculated as follows: The advantage of designed graph learning module is that we can learn the inherent causal relationship between variables from the observated data, which can be used as an explanation of the graph generative model. The adjacency matrix obtained as the input of subsequent spatio-temporal graph convolution network can promote the information aggregation of graph convolution. At the same time, the adjacency matrix can advance the sparse learning of the coefficients of the interaction dynamics terms.

Spatio-temporal graph convolutional network
To obtain the latent state of the systems, we adopt a STGCN inspired by [37], as shown in figure 1. The proposal of STGCN is to solve the problem of traffic flow prediction, which assumes that the dependency relationship between nodes is known, that is, the adjacency matrix exists as prior information. However, this work was conducted on data generated by real governing equations and did not observe adjacency information as a prior, so STGCN cannot be used alone. Therefore, this paper uses the adjacency matrix constructed based on transfer entropy and the spatio-temporal coordinates {u, t} as inputs to approximate the latent solution. The spatio-temporal graph convolution layer (GCN) contains a gated time convolution layer (Gated TCN) and a GCN.
It is used to approximate the latent solution using the spatio-temporal coordinates {u, t} as input. It consists of several spatio-temporal GCNs and output layers. The spatio-temporal GCN contains a Gated TCN and a GCN.
Gated TCN: We use a gated TCN containing two parallel dilated causal convolutions [43] to capture the dynamic trends of the observed variables. Compared to the method based on recurrent neural network (RNN), TCN can obtain a larger receptive field by stacking more dilated convolutional layers, and its reverse propagation path is different from the input time direction, which can avoid the common problem of gradient explosion/disappearance in RNN. TCN extracts temporal features of different frequencies by concatenating p convolutional filters W 1×k i (i = 1, 2, . . . , p) with convolutional kernel of different sizes 1 × k i (i = 1, 2, . . . , p). The process is represented as follows: where tanh(·) is a hyperbolic tangent activation function, σ(·) is the sigmoid activation function, * is convolution operation, ⊙ represents Hadamard product of elements, and concat(·) is concatenation operation. Parameter d is the dilation factor. GCN: After extracting the temporal characteristics of observation variables, combined with the adjacency matrix obtained by the graph learning module, the input observation variable matrix can be converted into a graph feature matrix. The nodes in the graph represent the hidden layer state of each observation variable output by the time convolution module, and the adjacency matrix represents the causal association between the nodes. We perform graph convolution operations to aggregate node information and neighborhood information, it takes the form: where A is the adjacency matrix obtained by equation (9), D is a degree matrix, which D ii = j A ij . H is the hidden states outputted by the gated TCA module, and W is the trainable model parameter matrix.

Framework of STGCN:
The input observed variables first through a linear layer, then through a gated TCN to capture complex temporal dependence. After that, the output hidden states are input into the GCN module with the adjacency matrix obtained from the graph learning module to extract causal association information between nodes. The output of the GCN is residual connected with the initial input to prevent information loss during training. By stacking k layers, STGCN is enable to extract temporal dependencies for different frequency through expansion factors of different sizes. And on this basis, causal dependencies at different temporal levels can be handled. The output part of the STGCN is to concatenate the results of k spatio-temporal layers, and then obtaines the output sequence through two linear layers.
STGCN can simultaneously capture complex spatial and temporal dependencies, outputting the latent states of the systems. Then, performing automatic differential calculation of essential derivatives can obtain accurate calculations of derivatives. Based on this, the self and interaction dynamics in the candidate function library can be calculated through STGCN, and an accurate function library can be established to transmit to the governing equation discovery module.

Governing equation discovery
The role of governing equation discovery is to discover the governing equations of the systems, which is to reconstruct system dynamics by selecting a few terms from the candidate function library through sparse optimization. Many studies have innovated methods for sparse coefficient recognition [18,44,45]. However, the framework designed in this article connects the neural network and governing equation discovery modules through automatic differentiation. In other words, the designed network architecture requires simultaneous optimization of variables including the parameters of STGCN and the coefficients of the governing equation. These heterogeneous variables cannot be directly obtained using existing regression methods.
For the CG-GED architecture proposed in this paper, the total loss function of training is defined as where W is the trainable parameter matrix of STGCN, u is the input visible variable set. α and β are the relative weight of the loss function, L STGC is the data reconstruction loss of the STGCN, and L GED is the physical loss of the GED module. L L0 is the ℓ 0 regularization term, which is used to promote the sparsity of the coefficient matrix ξ F and ξ G . These loss functions are specifically defined as: where N is the total number of input visible variables,û is the reconstructed status obtained by STGCN. ∥·∥ 0 and ∥·∥ F denote the L0 norm and the Frobenius norm respectively. It is worth noting that, L STGC ensures that the STGCN accurately approximates the latent solution of the system by fitting the original data, and connects to the GED module through automatic differentiation. L GED provides constraints to the STGCN by reconstructing the closed form of the system governing equations, and L L0 regularization can ensure the sparse expression of differential equations. It is very difficult to directly solve the optimization problem defined by equation (15), as the nonconvex optimization of the ℓ 0 is a NP hard problem. Most existing research has focused on transforming the ℓ 0 problem and solving it after relaxation. A typical relaxation method is to transform it into a ℓ 1 problem [28,46], but the equivalence of ℓ 0 and ℓ 1 requires satisfying conditions such as the null-space property and restricted isometry principle [47]. And ℓ 1 is more sensitive to noise, so the probability of solution failure is higher. To solve this problem, we use a cross-combination optimization strategy to split the optimization of the total loss into a set of easily solvable subproblems, and achieve the optimization of heterogeneous parameters through alternating iterations. That is, In each iteration, the parameters of the STGCN are trained using the Adam optimizer, with an initial learning rate of 0.001. Alternatively, we stack six spatio-temporal GCNs with dilation factors {1, 2, 1, 2, 1, 2} and 16 output channels. In each spatio-temporal GCN, we use four filter sizes, that is p = 4 in equations (11) and (12), k i = {2, 3, 6, 7}, to generate convolutional kernels with size 1 × 2, 1 × 3, 1 × 6, and 1 × 7 to cover different time periods. And the skip connection layer has 32 output channels. Based on the kth iteration optimization of STGCN parameters, update the sparse differential equation coefficients of the k+1st time through STRidge [7], and cycle until convergence. Note that the sparse coefficients of differential equations are initialized by uniformly sampling in [−1, 1]. The suboptimal solution of the alternating learning subproblem will eventually converge to the global optimal solution, which means obtaining the optimal network parameters and sparse coefficients of governing equations simultaneously.
It should be noted that selecting appropriate hyperparameters (α and β) is a key step in successfully discovering governing equations. In this study, hyperparameters α is used to balance the loss contribution of L STGC and L GED , and it is usually estimated based on the ratio between the visible variable u and its temporal derivative u t , where u t is obtained by the finite difference approximation. Another hyperparameter β is the coefficient of the regularization term used in STRidge, which helps balance physical loss and equation sparsity. We solve the optimization problem defined in equation (20) based on the results of the first iteration of STGCN, and use the grid search method to determine the optimal range of β.
In a word, the innovation of the GC-GED method we designed is that integrates the advantages of causal representation, deep learning, automatic differentiation and sparse regression. This method can gradually achieve (i) causal graph encoding, (ii) approximate solution of system variables, (iii) calculate essential derivatives, (iv) determine the structure of the governing equations and coefficients of key terms. Therefore, this method can discover the underlying governing equation of nonlinear dynamic systems from observed data, namely the governing equations.

Applications and discussion
To verify the effectiveness of the proposed method, we conduct experiments on several typical nonlinear physical systems that can be described by differential equations. We also add Gaussian white noise with fixed root-mean-square ratio to the original data to verify the robustness of the algorithm. We have constructed two candidate function libraries (Φ F and Φ G ) based on the mathematical theory of differential equations, which are respectively self and interaction dynamics, including functions such as derivatives of all orders, polynomials, trigonometric, exponential and other functions as listed in table 1. The simulation experiments in this paper are conducted on a Linux system workstation equipped with 80 Intel Xeon Gold 6242R CPUs and 2 Quadro RTX 5000 GPU cards.

Discovering the Kuramoto-Sivashinsky equation
Firstly, consider a complex dynamic in extended systems with intrinsic instabilities, controlled by the 1D Kuramoto-Sivashinsky (KS) equation [48]: (21) where the non-linear term uu x represents an energy transfer, u xx is the diffusion term, and u xxxx is dissipative term. The KS equation is widely used to describe unstable chaotic dynamics. We generated activity data by applying two different initial conditions (ICs) to the real equation, recording wave responses of 101 time steps for 1024 spatial grid nodes In this case, the equation does not include interactive dynamics, so only the coefficients of the self dynamics term need to be learned. It is worth noting that the high-order derivative term u xxxx in the KS equation poses a significant challenge for learning differential equations from poorly observed data. In this case, existing methods (such as the SINDy based methods [7]) can identify the terms included in differential equations, but the coefficients are substantial underestimated, indicating significant errors in numerical differentiation. However, proposed method uses automatic differentiation to obtain precise derivatives. Therefore, even if large-scale 10% noise is added to the original data, the proposed CG-GED method can still successfully identify the terms contained in the governing equation behind the data, and discover coefficients which are very close to the coefficients of the real KS equation: As shown in figure 2, under different initial conditions, despite the addition of high noise to the sparse observation data, the full field waves predicted by CG-GED can match the accurate solution of the system. The predicted responses under different spatial coordinates and temporal are also very consistent with the actual responses. This is attributed to the automatic differentiation in designed method, which reduces the error of equation coefficients through accurate derivative calculation.

Discovering the Fitzhugh-Nagumo equations
We then test our method using neuronal activity data generated by FitzHugh-Nagumo (FHN) dynamics [49]. The governing equations of the FHN neuronal dynamics are: where the coupling term Here u represents the membrane potential, v is a recovery variable. Set the diffusion coefficients to γ u = 1 and γ v = 100, ∆ is the Laplacian operator. ε = 1 is the coefficient of interaction dynamic. The directed adjacency matrix of the interaction dynamics in equation (24a) is generated according to the [20], and its average total-degree is 5.1. The FHN equations comprise two coupled nonlinear differential equations to describe the activation-inhibition systems, which are usually used to present activity of biological neurons excited by external stimuli.
Drawing on the data generation method proposed in [28], we generate three test datasets within a time period of [0, T] = [7.18, 36] with different random initial conditions, and project them into a 301 × 301 grid, with dx = dy = 0.5 and dt = 0.06. Meanwhile, to verify the robustness of the algorithm, we add 10% noise to these three datasets. The equations discover from the data with added noise using the proposed method are as follows: It can be seen that our method can not only accurately identify all the included terms in the equations, but also learn the coefficients corresponding to the coefficients that are very close to those in the real equation. The u and v response trajectories generated by the real FHN equations and the inferred equations are displayed in figure 3. Figure 4 shows the true and inferred response snapshots at different times (t = 16.12, 29.93) under three initial conditions, as well as the prediction error maps. It is obvious that the predicted full field waves are very close to the real system response, as the error distribution is within a small range. Figure 5 shows the loss-iteration curve, which further illustrates that the proposed method can converge quickly and obtain the optimal sparse coefficients of the governing equations.

Discovering the coupled Rössler equations
Another experiment is a coupled nonlinear oscillator system represented by Rössler [50]. We first generate chaotic activity data based on the true equations governing heterogeneous Rössler dynamics: where coupling strength ε = 0.15. The frequencies of oscillators ω follow a normal distribution with a mean value of 0 and a standard deviation of 0.1. Similar to the previous example in section 4.2, the adjacency matrix with average total-degree of 5.1 is generated. Rössler is a simplification of Lorenz system describing Our method accurately infers the terms included in the equations describing self and interaction dynamics, respectively. The coefficient 1.010 in equation (28a) is the average of inferred frequencyω. This example particularly demonstrates the accuracy and robustness of our method in discovering high-dimensional system governing equations from highly noisy data. More specifically, the proposed causal graph learning module can obtain an adjacency matrix consistent with the basic facts, as shown in figure 6. The system response generated by the real and inferred equations under the same initial conditions are shown in figure 7. We also conducted experiments on nonlinear dynamic systems governed by homogeneous Rössler equations: where coupling strength ε = 0.1. The equations discovered from the generated data with 10% noise by our method are: The results in figure 8 show that the discovered equations can indeed restore the dynamic response trajectory of the true system. It should be noted that due to the chaoticity of the original system, the distance between inferred and true trajectories gradually increase with response time.

Comparison with SINDy
In addition to the three experiments described in detail above, we also conduct experiments on four dynamic systems governed by typical nonlinear equations and compare the governing equation learning performance of proposed CG-GED method and the advanced PDE-FIND method [7] under different levels of noise. They are Burgers' , Schrödinger, Navier Stokes and λ − ω Reaction-Diffusion systems. Respectively, the solution of Burgers' equation is taken from the open data set, the solution of Schrödinger equation is simulated based on Gauss initial conditions, the solution of Navier Stokes vorticity equation is obtained by using the submerged boundary projection method, and the λ − ω Reaction-Diffusion system response is generated based on randomly selected spatial points.
The error in evaluating the accuracy of differential equations recognition is defined as the mean relative error (MRE) between the identified coefficients and the true coefficients, that is ∆ = (ξ −ξ )/ξ . NA indicates the inferred equations have incorrect terms. For the four systems in table 2, PDE-FIND exhibits great ability to discover governing equations on noiseless data, but when the data volume remains constant and the noise level increases, PDE-FIND will fail. This is because PDE-FIND uses numerical differentiation to calculate derivatives, thus relying on strict data quality. When there is noise in the observed data, the derivative error will cause the identified governing equations to be unable to match the real equations. By contrast, the proposed CG-GED method can correctly discover its governing equations under different levels of noise conditions, which demonstrates the excellent robustness of CG-GED. The reason for this is that the causal graph encoding in GC-GED learns accurate causal correlations between system variables, and STGCN also exhibits superior performance as a function approximator. Based on this, accurate derivatives are obtained using automatic differentiation. The cross-combinatorial optimization strategy we designed ultimately achieved accurate identification of sparse coefficients in the governing equations.

Summary and conclusions
This paper has proposed a new governing equation discovery method called CG-GED, which is used to discover potential governing equations of nonlinear dynamic systems from noisy multidimensional observed data. This method combines the causal graph encoding based on transfer entropy for mining causality, the STGCN for nonlinear function approximation, the automatic differentiation for obtaining accurate derivatives and the sparse regression for obtaining coefficient matrix. The adjacency matrix containing causal correlations of variables obtained by causal graph encoding based on transfer entropy can promote the sparse learning of interaction dynamics coefficients and accelerate convergence. The convolution of spatio-temporal graph obtains the latent state of the system and is connected to physical information sparse regression through automatic differentiation, serving as the basis for sparse regression. The governing equations obtained from sparse regression in turn constrain and promote the parameter optimization of spatio-temporal graph convolutional networks. Therefore, the global optimization adopts cross-combination optimization algorithm, which obtains the coefficient matrix of the governing equations that best matches the system while optimizing the neural network parameters. The proposed overall framework of CG-GED effectively integrates scientific methods driven by data and physical information. The explicit governing equations discovered from observated data reflect the inherent governing equation of dynamic systems, which also satisfy the interpretability requirements of scientific discoveries. We conduct experiments on some nonlinear dynamic systems and generated test data with added noise based on the true governing equations. The results all prove the effectiveness and robustness of our method.
Nonetheless, it cannot be ignored that optimizing STGC neural networks results in higher computational costly to CG-GED compared to PDE-PIND. Fortunately, with the improvement of computer performance, the proposed method can perform parallel computation on a powerful GPU platform, efficiently and accurately discover the underlying governing equation of dynamic systems.
In addition, it is undeniable that when applying the method proposed in this paper to some systems, there may be situations where the candidate libraries are incomplete. However, it should be noted that our proposed method can combine the terms present in the candidate library to generate the governing equations. These governing equations may not have the function of analyzing the system, but it can approach the solution of the original system numerically well. In the next step of research, we will focus on dimensional analysis theory and attempt to combine deep learning with dimensional invariance theory [53] to discover governing equations that can explain and analyze the system.

Data availability statement
The data cannot be made publicly available upon publication because no suitable repository exists for hosting data in this field of study. The data that support the findings of this study are available upon reasonable request from the authors.