Validating a data-driven framework for vehicular traffic modeling

This study presents a data-driven framework for modeling complex systems, with a specific emphasis on traffic modeling. Traditional methods in traffic modeling often rely on assumptions regarding vehicle interactions. Our approach comprises two steps: first, utilizing information- theoretic (IT) tools to identify interaction directions and candidate variables thus eliminating assumptions, and second, employing the sparse identification of nonlinear systems (SINDy) tool to establish functional relationships. We validate the framework’s efficacy using synthetic data from two distinct traffic models, while considering measurement noise. Results show that IT tools can reliably detect directions of interaction as well as instances of no interaction. SINDy proves instrumental in creating precise functional relationships and determining coefficients in tested models. The innovation of our framework lies in its ability to use data-driven approach to model traffic dynamics without relying on assumptions, thus offering applications in various complex systems beyond traffic.


Introduction
In complex systems, group-level behaviors such as self-organization and phase transitions emerge from interactions between units.Traffic systems are examples of complex systems, where 'interaction' refers to the dynamic relationships and influences between vehicles on the road [1].Understanding these interactions is vital for developing precise models, which have practical applications in improving traffic planning, reducing travel times, fuel consumption, pollution, and congestion [2].
Traffic systems can be modeled in a variety of ways.A popular approach is agent-based models, which create a road network, add agents to it, and define their behavior and rules of interaction [3].Traffic flow simulation software applications have been developed based on microscopic agent-based modeling, including MovSim [4,5], SUMO [6,7], MITSIM [8,9].Microscopic agent-based models consider the driver and vehicle as one entity and the movement of every driver-vehicle unit is simulated, considering car-following dynamics [9,10], lane-changing behavior [11,12], gap acceptance maneuvers [13], and movement at intersections [14].Numerous models have been developed (Gipps' model [15], intelligent driver model (IDM) [16], optimal velocity model (OVM) [17], each with its own set of rules.These models exhibit sparsity which means they consider a small number of relevant features or coefficients.In traffic modeling, this sparsity aids in understanding vehicle movement and capturing underlying physics.However, these models heavily rely on assumptions about driver interaction, such as each driver being influenced only by the vehicle immediately in front of it.
In an attempt to mitigate these assumptions, some researchers have turned to artificial intelligence (AI) models that incorporate real-world data [18][19][20][21].Nevertheless, there remains a common concern that these AI models act as black-boxes [10].This criticism implies that neural network models can be understood solely based on their inputs and outputs, without providing insight into their internal mechanisms.For traffic systems, sparse microscopic models can be more reliable than black-box models because sparse models are based on well-defined principles that mimic human behavior by taking into account of parameters like headway distance and speed, offering improved transparency and interpretability.Interpretability of sparse models facilitates a better understanding of traffic dynamics, whereas black-box models lack explicit rules for driving behavior.Moreover, the computational efficiency of these sparse models makes them effective in the design and control of traffic systems when implemented in real-time [22,23].
This paper presents a two-step approach to develop sparse traffic models from data, eliminating the need for assumptions.The first step eliminates assumptions by detecting true directional relationships between vehicles based on trajectory data without having any prior knowledge of vehicle interactions.Specifically, we investigate how many preceding and following vehicles influence a subject vehicle in single-lane traffic.To achieve this, we employ information theoretic (IT) tools, which help identify the relevant candidates to be included in the model.The final step uses sparse identification of nonlinear systems (SINDy) to identify functional relationships between the candidate variables, considering only vehicles that exert influence on the subject, thus completing the model identification process.
Information theory has emerged as a valuable tool for detecting directional relationships directly from data in complex system studies.Specifically, transfer entropy (TE) and conditional transfer entropy (CTE) can quantify coupling between time-series variables, and therefore identify candidate variables for a model.These metrics have found successful applications in the study of various complex systems including human brain activity [24][25][26], animal collective behavior [27][28][29][30][31], climate modeling [32], policy-making [33][34][35][36][37] and financial markets [38,39].However, its application within vehicular traffic systems has been relatively limited [40][41][42][43].While IT metrics can provide empirical evidence of candidate variables required for fully describing a systems dynamics, the functional relationship between the variables still needs to be discovered.Identifying the functional relationships between the candidate variables using a sparse modeling approach involves selecting a minimal combination of nonlinear functions which fully describes the system dynamics from a larger library of candidate variables, without making any prior assumptions.This relationship can be detected using the SINDy framework [44][45][46][47] and has found applications in various fields such as fluid mechanics [48], plasma dynamics [49,50], optics [51], and power grids [52].
The present study employs a data-driven framework to construct sparse traffic models, combining IT tools with SINDy.Specifically, we evaluate the effectiveness of TE and CTE measures in detecting true coupling using synthetic data from two distinct car-following models in stop-and-go (jammed-flow) and free-flow traffic scenarios.The functional relationships among candidate variables are then determined using SINDy across varying levels of noise.Toy models serve as ground truth to evaluate the performance of TE, CTE, and the accuracy of dynamical equations identified by SINDy.The innovative aspect of this work lies in its validation of the proposed data-driven modeling framework and establishing confidence in these tools before applying them to real-world scenarios.The findings from this study additionally provide insights by comparing the two IT metrics and enhancing understanding of how the results from these metrics can be interpreted.

Methods
In this section, we present two microscopic traffic models used to generate toy dataset: the IDM and the OVM, as well as the data-analytic tools used for analysis: TE, CTE, and SINDy.While the IDM and OVM are structurally different, both are car-following models which determine a subject vehicle's acceleration by considering only the relative speeds and/or positions between it and the vehicle immediately ahead of it.

IDM
In IDM, the input parameters are the vehicle's speed v, bumper-to-bumper distance to the leading vehicle (distance headway) s, and the relative speed (∆v).The model outputs the acceleration of a vehicle as: where a is the maximum vehicle acceleration, v 0 is the desired velocity, δ is an acceleration exponent, and s * is the desired minimum headway.The first part of the equation describes free flow, in which the acceleration a decreases to zero as the speed approaches v 0 .The second part corresponds to the interaction term (braking), where the current distance headway (s) and the desired headway (s * ) are compared and deceleration is increased as the current headway decreases.The desired minimum headway s * is given by: where s 0 is the minimum gap allowed, T is the time headway, and b is a positive coefficient defining the rate of deceleration [16].The ballistic method [53] is used to solve equation ( 1) and the speed and vehicle positions are determined as: If the front vehicle is at rest, there is a possibility of calculating both a negative acceleration and velocity in the next time step.This negative velocity is prevented by implementing the following conditional statement [53]: Each vehicle is simulated identically using typical model parameter values of T = 1.5, a = 0.3, b = 3, δ = 4, s 0 = 2 [16].The circumference of the circular track is set to L = 314 meters and v 0 = 30 km h −1 based on [54] and given an initial speed of 30 km h −1 .Trajectory data for each vehicle are recorded at a 0.1 second interval.

OVM
Different from IDM, vehicle acceleration in OVM is dependent on the difference between the vehicle's current speed and optimal speed V(s): The parameter a h accounts for heterogeneity among vehicle types and drivers [17], which in our simulation is assumed to be a constant for all vehicles for all time.The optimal velocity (OV) function is a hyperbolic tangent function of distance headway s and is given by: The minimum distance headway s 0 along with constants α, β, and v 0 scale the hyperbolic tangent function and determine the response of the OV given the value of distance headway s.Here, as s approaches s 0 , the OV reduces to zero to avoid collision.We choose the parameter values as a h = 1.8, α = 5.5, β = 0.37, s 0 = 9.1, and v 0 = 4.9 based on empirical evidence [55].To avoid bias in synthetic data generation, simulation conditions (such as track length, initial conditions, etc) are kept constant to those used with the IDM.

IT tools
Pairwise TE and CTE or causation entropy are extensions of the definition of entropy described by Shannon in 1948 [56].TE was formalized concurrently by Schreiber [57] and Palus et al [58] to assess the information exchange between two variables (X and Y) over time.It is a metric commonly used to detect the coupling strength and direction between time series variables.For example, to detect coupling from Y → X, it quantifies how much information Y can provide about the future state of X using the present states of both variables.With a first-order Markov process assumption, pairwise TE is defined as: where ⟨.⟩ is the average computed over all the samples, n is the time index, p(x n+1 ) denotes probability, and p(x n+1 |x n ) is the probability of x n+1 conditioned on the present state x n .If there is no influence from Y on X, then p (x n+1 | x n , y n ) = p (x n+1 | x n ), and T Y→X = 0.The unit for TE is determined by the base of the logarithm used, i.e. 'nats' for log e , and 'bits' for log 2 .
When three (or more) variables are involved (X, Y, and Z), pairwise TE may not distinguish the indirect couplings [59].In such scenarios, CTE can be applied.CTE evaluates the direct influence of Y on X accounting for any indirect influence from Z, and is defined as: where p(x n+1 | x n , z n , y n ) is the probability of x n+1 conditioned on x n , y n and z n .When Y does not influence X, the values of the numerator and denominator become equivalent, thus C Y→X|Z equals zero.
Both TE and CTE are asymmetric by construction (T X→Y ̸ = T Y→X , and C Y→X|Z ̸ = C X→Y|Z ); allowing for the dominant direction of information flow (coupling direction) to be identified [35].When examining TE and CTE coupling from finite empirical data, a statistical significance test can be conducted using surrogate data to determine if the resultant value is statistically different than zero [40].In the present work, the IT measurements and statistical significance tests are performed using the Java Information Dynamics Toolkit for Matlab [60].The Kraskov estimation method is used for bias correction [61] when estimating the probability density functions.

Sparse identification of nonlinear dynamics (SINDy)
Here we present an overview of how SINDy identifies governing dynamical systems models from data.SINDy considers a dynamical system of the form: where the vector x(t) = [x 1 (t); x 2 (t); . . ., x n (t)] ∈ R n represents the state of a system at time t and the function f(x) describes the temporal evolution of the system's state.SINDy identifies the fewest terms that approximate the unknown f(x) and establishes the model based on a library of candidate basis functions where the coefficients ξ jk are typically zero, and entries that are not zero indicate active terms in the dynamics.To find f, time-series measurements of x and their time derivatives ẋ (measured directly or approximated numerically) are sampled at several time steps t 1 , t 2 , . . ., t m and arranged into matrices such that where m is the sample size and n is the number of states.The library functions are next evaluated on the data by constructing Finally, SINDy uses the sparse regression technique to approximately solve: is a set of coefficients that determines the active terms in f.An extension of SINDy, referred to as SINDy-PI [46], has been developed for implicit differential equations of the form: and then the sparse model is detected as It is also possible to include control input data in the SINDy-PI algorithm.
The equations that describe the two traffic models we utilize for data generation are implicit.Creating a traffic model for a subject vehicle requires incorporating data from adjacent vehicles as control input.Given that SINDy-PI is capable of handling these requirements, we employ it in our study.

Results and discussion
In this section, we present the results using IT tools that can detect the true nature of interactions, i.e. whether vehicles react only to vehicles in front, or also to vehicles behind and further ahead.Next, it is necessary to identify the rules for these interactions, essentially establishing functional relationships among relevant variables.We further validate the effectiveness of the proposed framework using toy data with known ground truth.

Generating synthetic traffic data
We generate synthetic data from two fundamentally different car-following models, IDM and OVM, to validate the proposed data-analytic framework for identifying traffic models.The utilization of synthetic data is motivated by its known ground truth, providing a benchmark for evaluating the performance of the tools employed in this study.We generate data by simulating vehicles on a circular track with a single lane of traffic as illustrated in figure 1.This setup allows for large samples of data to be generated by tracking vehicles within a fixed arena.For IT measures, large datasets are necessary to estimate probability density functions accurately.We simulate vehicles to drive on a single lane which eliminates lateral interactions (influence from adjacent lanes).For each model, we simulate jammed-flow and free-flow traffic conditions by adjusting the vehicle density within a constant track length of 314 meters.For jammed-flow, the number of vehicles (N v ) is set at 30, while for free-flow, N v is set to 15.The circular track is simulated by imposing a periodic boundary condition.Periodic boundary conditions are applied to the simulation, treating the last vehicle (i = N v ) separately.The position of the vehicle in front of it is considered as the position of the first vehicle (i = 1) plus the track length, achieving the periodic boundary condition.For visualization, vehicle positions are wrapped within the track length.

Detect interaction using IT tools
Using synthetic data with known ground truth, we evaluate the ability of IT tools to identify coupling direction accurately.For IT analysis, we compute the observables as shown in figure 1.For ith vehicle at a given instant, we compute the distance headway between vehicle i and its immediate front vehicle (D i+1 ), the distance from its immediate rear vehicle (D i−1 ), the distance from i + 2th vehicle (D i+2 ), and the distance the vehicle i travels in the next ∆t time interval (D ∆t ).Subsequently, we employ the IT measures to quantify coupling between these time-series variables to detect the influence of the adjacent vehicles on a subject vehicle.For IT analysis, the simulated vehicle trajectory data was resampled to obtain data points at one second intervals to match human driver reaction time [40].This resulted in 29 000 samples per vehicle after excluding first 1000 samples to eliminate initial transient.The table 1 provides a summary of the variables and samples utilized in IT and SINDy-PI analysis, along with the interpretation of the corresponding analysis.
The results of IT analysis are presented in figure 2. Sub-figures in the left column correspond to jammed-flow (N v = 30) while those in the center correspond to free-flow (N v = 15).Results of IT analysis of OVM data is shown in the first three rows, and results of IT analysis of IDM data are shown in the last row.

IT analysis on OVM data
As a first step, we use pairwise TE to identify the influence from the vehicle directly in front to the ith vehicle (T i+1→i ), as well as the influence from the vehicle directly behind (T i−1→i ). Figure 2(a) shows that for all 30 vehicles, pairwise TE accurately identifies statistically significant coupling from the front vehicle.However, it detects false coupling from the rear vehicle since the OVM dynamics do not incorporate information from the rear vehicle.A common issue with pairwise TE measure is its challenge in discerning indirect coupling.This arises because when the front vehicle alters its position, it exerts an influence on the ith vehicle to also change its position.This, in turn, impacts the distance with its immediate rear vehicle-the variable used in TE to detect the influence of the rear.Therefore, in the presence of unidentified confounding variables, the pairwise TE measure is often used to identify the dominant coupling direction.For instance, in this scenario, TE accurately identifies that the dominant coupling direction is always from the front vehicle, as T i+1→i consistently greater than T i−1→i .These measures also exhibit consistent values across vehicles, as data is generated for all vehicles from the same exact model.
The homogeneity is lost in the pairwise TE analysis of free flow data (figure 2(b)), where, for some vehicles, TE detects statistically significant influence from the front but not from the rear.The cross symbols  5)) versus distance headway s from our simulation.The plot demonstrates that vehicles maintain their maximum speed if their distance headway exceeds a threshold.The OV function determines how a vehicle responds when headway drops below the threshold as it reduces its speed.Using the distance headway of all the vehicles measured from our simulation data, we compute the interaction regimes corresponding to free and jammed traffic.Vertical lines represent the average headway of all vehicles over the entire simulation time (dotted line for jammed-flow, dashed line for free-flow), and shaded regions indicate one standard deviation.Notably, we observe that in free-flow conditions, with distance headway exceeding the threshold, vehicles follow the maximum desired speed.As a result, they are not influenced by their immediate front cars, resulting in the absence of interactions.Next, we employ the CTE measure on OVM data to determine its ability to accurately discern that the coupling is solely present in jammed traffic and that the direction of influence is only from the front vehicle.The results of CTE analysis with OVM data and the corresponding schematic illustrations are presented in the figures 2(d)-(i).We conduct a thorough analysis by examining the influence of two preceding vehicles (i + 1 and i + 2) and the immediate rear vehicle (i − 1) on the ith vehicle.
Observing figure 2(d), CTE correctly detects the significant coupling is from the immediate front car for each ith vehicle in jammed traffic.However, CTE values from the rear vehicle are identified as statistically significant, albeit with magnitudes much smaller when compared to those from the front vehicle.This is evident from the averages computed over 30 vehicles, with C i+1→i |i−1 = 0.5120 nats and C i−1→i |i+1 = 0.0109 nats.This finding indicates it is necessary to account for both the significance of the test results and actual CTE values in order to infer coupling.The value of rear-to-target coupling being almost near zero indicates that, despite its statistical significance, such couplings are not present, and can therefore be ignored.Similarly, in figure 2(g), the results of the CTE analysis involving two preceding vehicles indicate a negligible influence from the i + 2th vehicle, thus can be disregarded.Therefore, combining knowledge from figures 2(d) and (g), the CTE results accurately indicate that the only influence in jammed traffic originates from the immediate front car (i + 1).
The figures 2(e) and (h) illustrate the results of the CTE analysis on free-flow data, correctly revealing non-significant coupling from either direction.The results demonstrate that CTE has the ability to accurately detect the lack of interaction.The discussion on the lack of interaction in free-flow OVM is already elaborated above using the OV function in figure 2(c).

IT analysis on IDM data
As for the IDM, we only use CTE analysis, which is in line with the conclusions of IT analysis on OVM data, which indicates that CTE is better than pairwise TE for our study because it takes indirect coupling into account.Figures 2(j) and (k) present the results of CTE analysis of IDM data.For both jammed and free-flow, CTE results accurately identify the only coupling is from the immediate front vehicle.Note that, In the first two columns, IT units on the vertical axis are in nats and the black symbol x denotes that the empirical measurements are not statistically different from zero.In the third column: sub-figure (c) shows the regions of interaction experienced during the jammed and free flow simulation of OVM, sub-figures (f) and (i) are schematics showing the conditions being tested for their respective rows, and (l) shows the regions of interaction experienced during the jammed and free flow simulation of IDM.
unlike in OVM, IDM has interaction occurring even in the free-flow.This is evident from the figure 2(l), which displays the interaction regions similarly computed as in figure 2(c) from the interaction term in equation (1).The interaction regions show that, for the simulation parameters used, there is strong coupling in jammed-flow and weak coupling in free-flow, which is also determined correctly by the CTE measure.Specifically, upon comparing C i+1→i |i−1 values between the figures 2(j) and (k), we find that the influence from the front vehicle is more pronounced in jammed flow compared to free-flow.The stronger influence in jammed traffic occurs because vehicles go through a series of stop-and-go events, requiring them to respond more frequently to the front vehicle compared to when traffic is flowing freely.
In summary, IT analysis of OVM and IDM data shows that CTE is a more effective approach than pairwise TE to distinguish indirect influences and infer whether coupling exists or not.To accurately infer the presence of coupling both statistical significance and actual values must be considered; and finally CTE measure is found to be sensitive to coupling strength (e.g. for IDM, the front vehicle exerts more influence in jammed flow than in free-flow.).In this study, the CTE analysis identifies coupling only from the immediate front vehicle, thus matching the ground truth of both the car-following models.This tool is thus validated by two distinct traffic models to accurately infer directional influence from adjacent vehicles, proving its usefulness for real-world applications.This validation is crucial as accurately identifying the range of vehicles, whose variables should be incorporated into the model, is essential for the development of sparse traffic models.

Model identification using SINDy-PI
The IT analysis of IDM and OVM data suggests that to create a traffic model for the vehicle i, only variables from the immediate front vehicle (i + 1) need to be included, as the other vehicles have no influence.With this knowledge, we proceed to employ SINDy-PI to investigate its capability for discerning functional relationships among variables, thereby facilitating the completion of model identification.We use data from jammed-flow scenarios where interaction is present in both OVM and IDM models.When applying SINDy-PI, we randomly select a vehicle from the set of 30 vehicles.Notably, we observe that this random selection does not impact the performance of SINDy.
The position and velocity of vehicle i are considered as primary states x = [x 1 , x 2 ], position and velocity from vehicle i + 1 are considered as a control input u = [u 1 , u 2 ], and velocity and acceleration are represented as the time derivatives of ẋ = [dx 1 , dx 2 ].For both training and testing, we utilize 400 and 200 seconds of data, respectively, with a time resolution of ∆t = 0.1.This accounts for a total of 4000 data points for training and 2000 data points for testing.The testing and training sets remain consistent throughout the evaluations of SINDy-PI.In SINDy-PI, the candidate basis functions included in the library Θ dictate the form of the final model.We assume the final model to be in the form of a polynomial which allows a direct comparison between each coefficient obtained through SINDy-PI and the model itself.Consequently, IDM in equation ( 1) and OVM in equation ( 4) are transformed into polynomials (details on these polynomial expansions are provided in appendix).The OVM equation is expanded using a Pade approximation of order [1,2], where this order denotes the power of the polynomials present in the numerator and denominator, respectively, of the series approximation [62,63].
To evaluate the performance of SINDy-PI, we construct the library of candidate basis functions Θ that include the compulsory terms (Θ C ) derived from the relevant models, as well as we add redundant terms (Θ R ) to create a complete polynomial of a given degree.The compulsory terms can be regarded as the initial hypothesis of the model.The performance of SINDy-PI will be assessed by its ability to accurately retain the compulsory terms, determine their correct coefficients, and reject the redundant terms.When constructing the library, we set the maximum exponent for x to 4 and for u to 2, resulting in terms with highest power of 6 (x 4  1 u 2 1 , x 4 1 u 1 u 2 , . ..) and yielding a total of 250 terms where Θ C which is the set of all compulsory terms required to compose either the IDM or OVM and are presented in table 2 and Θ R contains the redundant terms.

SINDy-PI analysis of IDM and OVM
We systematically evaluate the performance of SIDNy-PI starting with using only the compulsory terms in the library for each model, denoted as Θ 0 = Θ C .Next, we incrementally introduce redundant terms by selecting from Θ R up to a specified polynomial degree N: R , terms up to degree 2 . . .
R , terms up to degree 6 We evaluate SINDy-PI corresponding to each library Θ N and compare these results to the true model coefficients of the IDM and OVM.The performance of SINDy-PI is quantified in terms of sensitivity, specificity, and accuracy.Terms identified by SINDy-PI are labeled as true positive (TP) if they are present in the true model and false positive (FP) if they are not.As shown in table 2, the total number of positives (active terms) for the IDM and OVM are P = 25 and P = 18 respectively, and the number of negatives, N, depends on the library used for testing.Sensitivity, specificity, and accuracy are then measured as follows: Tables 3 and 4 present the results in percentage.Additionally, we measure error computed by comparing the coefficient of each term identified by SINDy-PI with the true coefficient of the respective model.We calculate the error for each FP term (FP e ) by computing the absolute difference between the estimated coefficient returned by SINDy-PI and the true value, which is zero since those terms are absent in the model.The error for each TP term (TP e ) is determined as the ratio of the absolute difference between the estimated coefficient and true coefficient, divided by the true coefficient.Tables 3 and 4 shows the maximum values for both types of errors across all six evaluated libraries.
From table 3, we observe that SINDy-PI accurately identified all TP terms for IDM data across all libraries, achieving a sensitivity of 100%.Notably, the library Θ 5 led to the highest number of FP, resulting in a specificity of 64.1% and an overall accuracy of 70.42%.No FP terms are identified when evaluated with Θ 3 , resulting in specificity and accuracy of 100%.The estimated coefficients for the TP terms fall within a range of 8% of the actual coefficients.The coefficients for the FP terms are on the order of 1 × 10 −10 or lower, Table 3. presents the sensitivity, specificity, accuracy, and maximum error for the coefficients identified by SINDy-PI for the IDM evaluated at each size of library ΘN.There are 25 terms required to represent the IDM that SINDy-PI identified in each evaluation for a sensitivity of 100%.Additional terms identified by SINDy-PI are considered false positives.markedly smaller than any coefficients found in the actual set.Consequently, when the model is not known in advance, these can be confidently disregarded.When applied to the OVM, SINDy-PI achieves a perfect accuracy of 100% across all evaluated libraries, as shown in table 4. Furthermore, the error in the TP coefficients is below 0.01%.These results indicate that SINDy-PI accurately identified the TP terms and their corresponding coefficients for the respective model.The reduced specificity observed in the IDM results is attributed to the ballistic method used in generating data.If the front vehicle is at rest, the ballistic update method to solve IDM (section 2.1) introduces discontinuities.SINDy-PI accommodates these discontinuities by incorporating additional terms.Moreover, we perform additional tests using varying sampling rates with CTE and SINDy-PI.The supplementary analysis produced comparable results, affirming the robustness of our findings.

SINDy-PI analysis of IDM and OVM with noise
Next, we evaluate the robustness of SINDy-PI's performance in the presence of measurement noise, which is common with real-world data, including traffic systems [64].We add Gaussian noise N (0, σ 2 i ) of ten increasing magnitudes where σ i ranges from 1 × 10 −10 to 1 × 10 −1 .Additionally, at each σ i and library size Θ N , we generate ten independent datasets for the evaluation of SINDY-PI to account for stochastic noise.A model returned by SINDy-PI is considered successful if it meets three criteria: sensitivity = 100%, TP e ⩽ 10%, and FP e ⩽ 1.The summary of these results from ten evaluations for a given σ i and library size Θ N are presented in figure 3.
Looking at the results with the IDM data in figure 3(a), we see that the accuracy of SINDy-PI falls sharply with added noise of σ ≳ 1 × 10 −4 for smaller libraries of Θ 0 to Θ 2 which increases to σ ≳ 1 × 10 −6 at the largest library of Θ 5 .A similar trend is observed for OVM data shown in figure 3(b), where SINDy-PI performance deteriorates with both increased noise levels and an increased library size.As the noise level increases, performance declines because the added noise magnitude is comparable to the smallest coefficient present in each model, as outlined in table 2. When the library size is large, this effect is more pronounced at smaller magnitudes of σ.These results are consistent with previous observations on SINDy's performance with noisy data [46,65].

Conclusions and future work
The significance of the current study lies in the introduction of a novel framework for identifying sparse models of complex systems from data, eliminating the need for assumptions.This retention of sparsity is crucial as it preserves the underlying physics, rendering the discovered equations interpretable.This stands in contrast to other existing data-driven approaches such as neural networks, which often lack sparse functional relationships between variables.We demonstrate the effectiveness of our proposed data-driven framework using traffic modeling as an application, where traditional traffic models rely on assumptions.The framework operates through a two-step process.In the first step, IT metrics are used to gain insights into the range of interactions extracted from data, accurately discerning the directionality and extent of vehicle interaction.This initial step eliminates the need for assumptions commonly employed in sparse traffic models.The results obtained from the IT metrics assist in isolating the variables, narrowing them down to interactions between two adjacent vehicles for both the IDM and OVM models used in this study.Subsequently, these identified variables are used to establish a functional relationship, completing the traffic modeling process.Using synthetic data with a known ground truth, this study validates the framework.The validation of this framework holds significance in gaining insights into the anticipated behavior of these data-analytic tools and instilling confidence in their real-world application.
Beyond traffic modeling, the framework holds broader implications for modeling complex systems, as it can be adapted for various domains where variables influencing dynamics are unknown.The framework's sparsity-promoting system identification technique ensures the retention of the physics of the dynamics, facilitating the accurate discovery of relationships among variables while preventing overfitting.The implicit form of the OVM equation is obtained as D OVM ẋ2 = N OVM , aligning with the coefficients presented in table 2. The coefficients provides the basis for comparing the results from SINDy-PI.

B.1. Comparison of the OVM using hyperbolic tangent and Pade approximations
The simulation data used to assess SINDy's performance is generated using the hyperbolic tangent representation, consistent with the true OV model as visualized in figure B1.We utilize the Pade approximation of the OVM model as a basis for comparison with the SINDy-PI results, which utilizes a polynomial basis library.

Figure 1 .
Figure 1.(a) Visualization of the simulation setup for jammed (Nv = 30) and free (Nv = 15) traffic, and the time-series variables used in the IT analysis to detect the influence of adjacent vehicles on a subject vehicle i.We use two distinct car-following models, IDM and OVM, to generate synthetic data.(b) Upon testing the influence of immediate front (using time series data of D i+1 ), next preceding (D i+2 ), and rear (D i−1 ) vehicles on the subject (D ∆t ), IT analysis accurately identifies that the sole existing influence is from (i + 1)th vehicle.(c) IT analysis thus reveal that the candidate variables for the model of the ith vehicle are solely associated with the (i + 1)th vehicle.Subsequently, SINDy-PI uncovers the functional relationships among these variables, utilizing both position and speed.We validate the proposed framework for model identification in both the presence and absence of noise.

Figure 2 .
Figure 2. Results and figures related to computing TE (a), (b) and CTE (d)-(l).Sub-figures (a), (d), (g), (j) in the left column correspond to jammed flow (Nv = 30), and sub-figures (b), (e), (h), (k) in the center column correspond to free flow (Nv = 15).In the first two columns, IT units on the vertical axis are in nats and the black symbol x denotes that the empirical measurements are not statistically different from zero.In the third column: sub-figure (c) shows the regions of interaction experienced during the jammed and free flow simulation of OVM, sub-figures (f) and (i) are schematics showing the conditions being tested for their respective rows, and (l) shows the regions of interaction experienced during the jammed and free flow simulation of IDM.

Figure 3 .
Figure 3.rate of SINDy-PI for identifying (a) IDM and (b) OVM models for different library ΘN and measurement noise.The color scale represents the number of times SINDy-PI met the three criteria from the ten trials.

Figure B1 .
Figure B1.Comparison of OVM data generated using the original hyperbolic tangent function and its corresponding Pade approximation for vehicles 1 and 15 from the simulation of jammed traffic.

Table 1 .
Summary of tools, variables, samples used along with their corresponding interpretation.
indicate statistically non-significant results, implying a lack of evidence for coupling.At first glance, it may seem that TE performed better for free flow data by accurately identifying coupling from the front and rejecting coupling from the rear.To investigate this further, we plot in figure2(c) the OV function (equation (

Table 2 .
We present the true model coefficients obtained from the polynomial expansion of IDM and OVM alongside sample SINDy results for comparison.The SINDy coefficients presented correspond to library of Θ3 for IDM and library of Θ6 for OVM.Rows displaying 'n/a' indicate the absence of the corresponding term in the library of the respective model.

Table 4 .
presents the sensitivity, specificity, accuracy, and maximum error for the coefficients identified by SINDy-PI for the OVM evaluated at each size of library ΘN.There are 18 terms required for the OVM that SINDy-PI identified in each evaluation for a sensitivity of 100%.Additional terms identified by SINDy-PI are considered false positives.