The High-Frequency and Rare Events Barriers to Neural Closures of Atmospheric Dynamics

Recent years have seen a surge in interest for leveraging neural networks to parameterize small-scale or fast processes in climate and turbulence models. In this short paper, we point out two fundamental issues in this endeavor. The first concerns the difficulties neural networks may experience in capturing rare events due to limitations in how data is sampled. The second arises from the inherent multiscale nature of these systems. They combine high-frequency components (like inertia-gravity waves) with slower, evolving processes (geostrophic motion). This multiscale nature creates a significant hurdle for neural network closures. To illustrate these challenges, we focus on the atmospheric 1980 Lorenz model, a simplified version of the Primitive Equations that drive climate models. This model serves as a compelling example because it captures the essence of these difficulties.


I. INTRODUCTION
Atmospheric and oceanic flows constrained by Earth's rotation satisfy an approximately geostrophic momentum balance on larger scales, associated with slow evolution on time scales of days, but they also exhibit fast inertia-gravity wave oscillations.The problems of identifying the slow component (e.g., for weather forecast initialization [1][2][3][4]) and of characterizing slow-fast interactions are central to geophysical fluid dynamics, and the former was first coined as a slow manifold problem by Leith [5].The L63 model [6] famous for its chaotic strange attractor is a paradigm for the geostrophic component, while the L80 model [7] is its paradigmatic successor both for the generalization of slow balance and for slow-fast coupling.
The explosion of machine learning (ML) methods provides an unprecedented opportunity to analyze data and accelerate scientific progress.A variety of ML methods have emerged for solving dynamical systems [8][9][10], predicting [11] or discovering [12] them from data.For larger scale problems, much effort has been devoted lately to the learning of neural subgrid-scale parameterizations in coarse-resolution climate models [13] but yet the lack of interpretability and reliability prevents a widespread adoption so far [14,15].
In parallel, the learning of stable neural parameterizations of small scales or neglected variables has progressed remarkably for the closure of fluid models in turbulent regimes such as the forced Navier-Stokes equations or quasi-geostrophic flow models on a β-plane; see [16][17][18][19][20][21][22].
While neural networks show promise for climate modeling, the full Primitive Equations (PE) remain a challenge.This * mchekroun@atmos.ucla.edustudy identifies potential hurdles in achieving efficient neural closures for PE.We leverage the L80 model, a simplified version of the PE, as a illustrative example to highlight these fundamental issues.
In that respect, the L80 model exhibits a fascinating dynamical transition.For small Rossby numbers, its solutions evolve slowly over time and remain entirely slow, dominated by large-scale Rossby waves [23].However, as the Rossby number increases, faster oscillations become superimposed on these slow background motions [24,25].This spontaneous emergence of high-frequency components, linked to inertiagravity waves (IGWs) riding on the slower geostrophic flow, significantly complicates the closure problem in atmospheric models [25,26].
Multiscale dynamics, characterized by the intricate interplay of slow and fast processes without clear separation, are not unique to the L80 model.Similar regimes have been observed in fully resolved Primitive equation (PE) models, where fronts and jets generate complex multiscale interactions [27,28] as well as in cloud-resolving models, where large-scale convectively coupled gravity waves emerge spontaneously [29].Tropical convection regions, where organized activity produces gravity waves with a broad spectrum, ranging from 10 km to over 1000 km wavelengths [30] provide another instance of such multiscale dynamics.Finally, inertia-gravity waves have also been observed in continental shallow convection, where they contribute to organized mesoscale patterns over vegetated areas [31].
Inertia-gravity waves can hold surprising amounts of energy even at large scales.For example, Rocha et al. [32] found that IGWs contribute nearly half of the near-surface kinetic energy in specific ocean regions at scales ranging from 10 to 40 km.This overlap between wave and turbulence scales in geophysical kinetic energy spectra creates a challenge: perturbation methods like Wentzel-Kramers-Brillouin (WKB) [33] become inapplicable across all scales [34].
Such regimes where slow and fast dynamics overlap were shown to constitute critical challenges for closure methods in the L80 model.Solutions in these regimes blend slow background motion with sudden bursts of IGWs carrying a significant portion of the total energy.These "high-low frequency (HLF)" solutions disrupt the expected slaving relationships satisfied at lower Rossby numbers, leading to a major breakdown in closure techniques relying on a separation between the slow and fast variables [25].
A recent study by [26] proposes a promising solution to closure problems in such HLF regimes without timescale separation and where slow Rossby variables are influenced by high-frequency waves.This approach hinges on the Balance Equation (BE) [23,35] as rooted in the works of Monin [36], Charney and Bolin [1,37], and Lorenz [38], which allows for a nonlinear separation of variables.As demonstrated in [26], the BE isolates, for large Rossby numbers, the fast, nongeostrophic component of the flow as residual dynamics off the BE manifold.Building on the BE separation, it was shown in [26] that this fast motion can be effectively parameterized using networks of nonlinear stochastic oscillators (NSOs).These NSOs are designed to match the characteristic patterns of variability observed in the fast motion, leveraging the concept of resonances discussed in [39][40][41].The resulting stochastic closure shows then high-accuracy skills in reproducing the multiscale dynamics.
This work emphasizes the limitations of (standard) neural networks (alone) for achieving such accurate closures for HLF regimes, highlighting their struggle to simultaneously capture the slow, balanced motion while restoring the high-frequency oscillations.Section II discusses the limitations of neural networks for parameterizing the L80 model's slow motion, emphasizing in particular their sensitivity to rare event statistics (Section III).Section IV highlights the fundamental challenges faced by neural networks in capturing both the slow and highfrequency content of the L80 solutions, ultimately hindering accurate closure.

II. LEARNING SLOW NEURAL CLOSURE: SENSITIVITY
The L80 model, obtained by Lorenz in [7] as a ninedimensional truncation of the PE onto three Fourier modes with low wavenumbers, can be written as: whose model parameters are described in [7,25].
In this model, the square root of the constant forcing F 1 can be interpreted as the Rossby number; see [23] and [25,Eq. (2.4)].Transitions to chaos occur as the Rossby number Ro is increased [23,25].As mentioned above, at small Rossby numbers, the solutions to the L80 model are dominated by Rossby waves and thus remain entirely slow for all time.As identified in [25], when the Rossby number is further increased beyond a critical Rossby number Ro * , fast IGW oscillations emerge spontaneously and are superimposed on the slow component of the solutions.For such regimes, the aforementioned BE manifold on which balanced solutions lie [23,25,35] is no longer able to parameterize fully the L80 dynamics since a substantial portion of it, associated with the IGWs, evolves transversally to the BE manifold [26,Fig. 3].These regimes with energetic bursts of IGWs lie beyond the parameter range explored by Lorenz in his original 1980 article [7] and beyond other regimes with exponential smallness of IGW amplitudes as studied in subsequent Lorenz 86 models [42][43][44][45] and the full primitive equations [46] at smaller Rossby numbers [47].
The HLF solutions considered in this study are obtained for such a critical parameter regime where Ro > Ro * .They correspond to those of [26,Fig. 7]; see Appendix A for details.We first analyze the ability of neural parameterizations to learn the slow motion of the L80 dynamics in the HLF regime.To do so, we preprocess the target variables x and z to be parameterized by applying a low-pass filter in order to extract the slow motion.In that respect, a simple moving average is adopted with a window size equal to T GW , the dominant period of the gravity waves.The results are shown in Figure 1A for the z 3 -variable for which we observe that the lowpass filtered solution almost coincides this way with the BE parameterization z BE (t) = G(y(t)) with y(t) denoting the y-component of the HLF solution to the L80 model.
The L80 model has an inherent structure that can be exploited for closure.Studies have shown that the BE manifold, constructed in two steps (parameterizing z as a function of y and then x as a function of y and the parameterized z), achieves excellent closure across various parameter regimes [23] (see Appendix B and [25] for details).To leverage this existing knowledge and facilitate comparison with the BE manifold, we design our neural network parameterizations with a similar structure.Specifically, we first learn a feedforward neural network (multilayer perceptron, MLP) denoted as Z θ , which takes the (unfiltered) variable y as input and predicts the filtered z-variable (Eq.( 2)).Then, we train a second MLP, X θ , that takes both y and the output of Z θ , (y, Z θ (y)), as input in order to predict the filtered x-variable.
The structure of our MLPs is standard.Each neural parameterization, e.g.z in terms of y, is sought by means of an MLP with L hidden layers of p neurons each.It boils down to find in which N in (resp.N out ) constitutes the input (resp.output) layer, while N k is a mapping from R p (the space of neurons) FIG. 1: Panel A: Illustration, for the z 3 -variable, of the BE manifold's ability in capturing the L80 model's slow motion.See [23] and Appendix B for a derivation.Panels (B) and (C): Neural parameterizations X 3 θ for the x 3 -variable, as learnt through random selection (NN 1 )/predefined selection (NN 2 ).Visualized here as mappings from (y 1 , y 2 ) onto the unit sphere in R 3 .Panels (D): Same visualization adopted for the BE manifold.Panels (E): High-frequency residual E N N 1 (t) for x 3 (black) given by ( 5) and its difference with where Ψ k is a p-dimensional elementwise function, i.e. a function that applies a (scalar) activation function to each of its inputs individually, and the W k and b k denote respectively the weight matrices and bias vectors to be learnt.In (2), the subscript θ denotes the collection of these parameters.In this work, the nonlinear activation function is a simple tanh function, and the input and output layers consist just of linear normalization and reversal operations.It turns out that NNs with one hidden layer and 5 neurons are sufficient to obtain loss functions with a small residual; see Table I.
Based on our approach paralleling the BE manifold construction, we learn our neural parameterizations for the L80 model, through the following consecutive minimizations.First, given a discrete set of time instants t j , one minimizes in which z is filtered (in time) while y is not, followed by the minimization of (4) with x filtered and where Z θ * 1 denotes the optimal parameterization obtained after minimization of (3).
We emphasize the importance of including the unfiltered y-component of the HLF solution in the training data, even though it contains rapid oscillations.This unfiltered data is indeed crucial for the network to learn a proper representation of the slow motion.If we replace the unfiltered y-component with a filtered version (like the blue curve for y 3 in Figure 2A), the resulting closure fails.It produces an unrealistic quasiperiodic behavior that does not resemble even the L80 model's quasi-periodic behaviors documented in [23] for nearby parameter settings (see red curves in Figure 2).6) is driven by Z θ and X θ that are trained using a low-pass filtered version of y(t) (blue curve in Panel (A)) unlike the closure defined in Eq. ( 6) where the slow neural closure is trained using the unfiltered y-variable.
To assess whether a neural parameterization is successful in capturing the slow motion, we evaluate also the following  4) for x, are minimized using two neural networks, NN 1 and NN 2 providing each a parameterization (Z θ , X θ ), differing only in the way the training, validation, and testing sets are selected.In each case, the aspect ratios between these sets are the same.Epochs 10 50 100 300 500 1000 NN1 loss for z (random) (×10 −3 ) 11.17 9. 26 in which the x j (t) and y(t) are both unfiltered.For an NN with small residual, E j N N (t) is typically void of slow oscillations (see Figure 1E) with mean ⟨E j N N ⟩ ≈ 0 for each 1 ≤ j ≤ 3. Figure 1 illustrates this feature with two neural networks, NN 1 and NN 2 , trained using different strategies for selecting training, validation, and testing data.Even though both networks achieve good parameterization results offline (similar to the BE manifold), their underlying structures differ visually from the BE manifold.
To explore these differences, we focus on specific components (X j θ * 2 for x j and Z j θ * 1 for z j ) of the neural parameterizations.We plot these components as level sets on a threedimensional sphere to reveal their geometric properties.This visualization is particularly useful since Z j ) on the three-dimensional sphere, y 2 1 + y 2 2 + y 2 3 = r 2 , can be visualized as a 2D surface that maps (y 1 , y 2 ) to z j (resp.x j ).Figures 1B, 1C, and 1D show these level sets for radius r = 1.
Interestingly, these visualizations reveal significant differences in the minimizers (and consequently, the parameterization formulas) of NN 1 and NN 2 , even though their loss function values differ only by 1% (Table I) and their high-frequency residuals are similar (red curve in Figure 1E).
These geometric offline differences hide more profound consequences when the neural parameterizations are used online, for closure.As explained below, the sensitivity of online pre-dictions that are tied to sampling issues is indeed observed.In that respect, recall that a common practice to train NNs is to divide the dataset into three subsets.The first subset is the training set, which is used for computing the loss function's gradient and updating the network weights and biases.
The second subset is the validation set.It corresponds to the second dataset over which the prediction skills of the fitted model are assessed.The error on the validation set is monitored during the training process to provide an unbiased evaluation while tuning the model's hyperparameters.When the network begins to overfit the data, the error on the validation set typically begins to rise after an initial decrease.The network parameters are saved at the minimum.It gives then the "final model" that is tested over the test set that is typically a holdout dataset not used as a validation nor a training set.
The parameterization NN 1 shown in Figure 1A is learnt through a random selection while NN 2 is learnt through a predefined selection.In each case, ratios for training, testing, and validation are 0.7, 0.15, and 0.15, respectively.The total length of the training is 700 days.Given the same input and target data, the minimal values of the loss functions (3)-(4) for NN 1 and NN 2 are reported in Table I, across epochs.Already after 500 epochs, one observes that the loss function evaluations differ only by 1% between the random or predefined selection protocol of the training, validation, and testing sets.
We now discuss the sensitivity issue of online predictions driven by such neural parameterizations that are close in terms of their loss function scoring.This point is illustrated in Figure 3. There, we show online prediction corresponding to a given slow NN-parameterization (X θ * 2 , Z θ * 1 ) learnt by minimization of the loss functions (Eqns.( 3) and ( 4)), namely the solution to the slow neural closure This closed equation in the y-variable is obtained by replacing the x ℓ -variables in the y-equation of the L80 model (Eq.( 1)) by their neural parameterizations, either NN 1 or NN 2 .
The attractor corresponding to the slow NN 1 -closure (with random selection) differs clearly from that of slow NN 2closure (with predefined selection) in spite of convergence and closeness of the loss functions at their respective minimal value; see Figure 3B.Both predict periodic orbits with different attributes, one self-intersecting in the (y 2 , y 3 )-plane (NN 1 ), the other without intersection point (NN 2 ).
A closer inspection at these topological differences reveals in the time domain that the slow NN 1 -closure is able to capture more accurately the low-frequency content of certain temporal patterns exhibited by the HLF solutions of the L80 model compared to the slow NN 2 -closure; blue vs red curves in Figure 3A.We argue below that such a sensitivity between online solutions takes its root in the rare events tied to the irregular transitions exhibited by the HLF solutions to the L80 model that spoils the offline learning.I), while the dynamical differences of the online predictions are substantial.
In contrast, at lower Rossby numbers, for regimes devoid of fast oscillations such as shown in Figure 4D below corresponding to F 1 = 6.97 × 10 −2 in the L80 model, neural closures of high-accuracy are easily accessible with skills comparable to those obtained with the BE manifold; see Figure 5.As explained below, the reasons for this success lie in the absence of high-frequencies in the solutions to parameterize and in the absence of rare events in the statistics of lobe transitions.

III. IRREGULAR TRANSITIONS, RARE EVENTS AND LEARNING CONSEQUENCES
The significant sensitivity observed in capturing the lowfrequency content with nearby neural parameterizations (as measured by their loss functions) requires further investigation.Since these variations in Figure 3 solely stem from how training, validation, and test sets are chosen, we conduct in this Section a statistical analysis of key features of the L80 dynamics in HLF regimes.Our focus is on the irregular lobe transitions exhibited by HLF solutions.For comparison, we also analyze lobe transitions in the slow chaotic regime of Figure 4D, where neural parameterizations perform well and learn the closure effectively.Notably, Figure 5 demonstrates that for the slow chaotic regime, high-accuracy neural closures are readily achievable, with skills comparable to those obtained using the BE manifold.
To gain a deeper understanding of lobe transition statistics in the slow chaotic and HLF regimes, we performed highresolution simulations of the L80 model for each regime.Each simulation spanned a 500-year period, integrating the L80 dynamics with a timestep of 0.75 minutes.This corresponds roughly to an interval of size 730, 000 × T GW , where T GW is the dominant period of the gravity waves in the model.
In each regime, the L80 attractor exhibits two lobes.This is shown in the (y 2 , y 3 )-projection for the HLF regime (Figure 4A) and in the (y 1 , y 3 )-projection for the slow chaos regime (Figure 4D).The latter evokes the Lorenz 63 "butterfly attractor" [6], consistent with the L80 dynamics devoid of fast motion for this Rossby number (geostrophic motion).The former attractor, more fuzzy, exemplifies the presence of fast dynamics riding the slow, geostrophic motion.
In each case, these lobes are essentially separated by the vertical line y 3 = 0. Numerical integration of the L80 model reveals that the visit of the right lobe comes with y 3 (t) getting greater than some threshold value y b , while the visit of the left lobe comes with y 3 (t) getting smaller than y a = −y b .A close inspection of the solution in the HLF case reveals that the choice of y b = 0.2 constitutes a good one to identify the sojourn of the dynamics within one lobe from the other.This choice leads furthermore to an interval (−y b , y b ) that provides a good bound of the bursts of fast oscillations crossing the vertical line y 3 = 0 in the (y 2 , y 3 )-plane ("gray" zone).
To count the transitions from one lobe to the other one thus proceeds as follows.Given our 500-yr long simulation of y 3 (t) we first find the local maxima and minima that are above y b and below y a , respectively.No transition occurs between consecutive such local maxima or minima.A transition occurs only when a local maximum above y b is immediately followed by a local minimum below y a or vice versa.If a local maximum is immediately followed by a local minimum, the intermediate time instant at which the trajectory goes below zero is identified as the transition instant, and the other way around if a local minimum is immediately followed by a local maximum.These transition times characterized this way allow us to count the sojourn times in a lobe and display the distribution of these sojourn times shown in Figures 4C and 4F.
These lobe sojourn time distributions reveal a striking difference between the HLF and slow chaotic regimes.In the HLF case, we observe indeed that the solution can stay in one lobe for a period of time that can be arbitrarily long (see solution's segment between t = 763 and t = 893 shown in blue in Figure 4B) albeit of probability of occurrence vanishing exponentially as shown in Figure 4C.As a comparison, the transitions between the attractor's lobes occur at a much more regular pace in the slow chaotic regime (see Figure 4E) in which the solutions to the L80 model are void of fast oscillations.In this case, the distribution of sojourn times drops quickly below a 60-day duration barrier (Figure 4F).These rare events, following an exponential distribution, pose a significant challenge for developing reliable slow neural closures.They introduce diversity in the temporal patterns of the time series, which contributes to the sensitivity issues observed in Figure 3.A random training set might be skewed towards one lobe duration more than a predefined set, leading to confusion in the learning process for the neural network.

IV. THE HIGH-FREQUENCY BARRIER TO NEURAL CLOSURE
Section III demonstrated that using neural networks to parameterize the slow dynamics of HLF solutions can lead to sensitivity issues in online prediction (Figure 3).This sensitivity arises from rare events associated with irregular lobe sojourn durations, as shown in Figures 4B and 4C.In this Section, we explore another challenge: the direct parameterization of the unfiltered x-components of HLF solutions.These components contain a complex mixture of both slow and fast motions, posing significant difficulties for closure with neural networks.
To illustrate this point, we learn an MLP for x(t), denoted by V θ , with (the unfiltered) y(t)-variable of the L80 model (Eq.( 1)), as input, and the unfiltered x-component, x(t), as output.Note that unlike the slow NN-parameterizations above, the parameterization V θ aims at parameterizing x(t) directly as a nonlinear mapping of y(t) without conditioning on z(t) nor filtering of any sort.The corresponding closure, called a vanilla NN-closure, consists then of Eq. ( 6) in which ) is replaced by V θ * (y), obtained after mini-mization of the following L 2 -loss function for which the target variable x(t) is unfiltered, i.e. containing a mixture of fast and slow oscillations.To address this more challenging problem we use MLPs with a larger capacity either with more neurons and/or layers.Interestingly, our experiments show that a neural network with just one hidden layer and 20 neurons achieves the best closure results.Figure 6 compares simulated time series from four different vanilla NN-closure settings.The setting with one hidden layer and 20 neurons partially captures the complexity of the HLF solution's temporal patterns (Figure 7A-B).However, it entirely misses the high-frequency content associated with IGWs, as evident from the power spectral density (PSD) comparison in Figure 8.
While increasing the complexity of a neural network (more hidden layers or neurons) can reduce the loss function during training, it does not guarantee better performance in the actual closure.For example, a vanilla neural network (V θ ) with 5 hidden layers and 20 neurons per layer predicts an unrealistic, small-amplitude periodic orbit when used online in the neural closure through time-stepping (Figure 9B).Additionally, it exaggerates high-frequency content in the solutions it generates offline (see Figure 9C and Table II).
Our results highlight the limitations of using a vanilla neural network closure to directly capture the fast dynamics of the L80 system using the "slow" variable y.This approach relies on potentially complex, non-linear functions encoded by MLPs,  7): one hidden layer with 20 neurons (thick solid line).Setting II: two hidden layers with 5 neurons in each layer (dashed line).Setting III: two hidden layers with 10 neurons in each layer (light solid line).Setting IV: two hidden layers with 20 neurons in each layer (dash-dotted line).The corresponding loss function values are given in Table II.but struggles to represent the system's multiscale dynamics accurately.This issue is similar to the spectral bias problem observed in standard neural networks for function fitting [48], where they prioritize capturing low-frequency features.However, the challenge here is more complex.The goal is to learn the neglected "fast" variables and their high-frequency content offline, so the online solution through the NN-closure can reproduce both the mixture of slow and fast motions of the original system.This includes capturing global geometric features like the attractor's shape and symmetry.As shown in Figure 7C, vanilla NN-closures often distort these features compared to the true L80 attractor.
However, as highlighted in [22], memory effects might not be crucial for achieving effcient closure of solutions in the HLF regime.Studies have shown that using the BE manifold for capturing the geostrophic motion and a network of stochastic oscillators for IGWs can achieve high accuracy without recurrent architectures like LSTMs [26].This, along with the challenges of rare events discussed earlier, raises questions about whether LSTMs or other recurrent networks are necessary to reproduce the intricate multiscale dynamics of y using a closed model (like in [26]) built with these components (LSTMs).

V. DISCUSSION
Our findings, particularly the interplay between rare events and the multiscale nature in HLF regimes, highlight the challenges that machine learning can face for accurate closure of geophysical flows in which geostrophic and ageostrophic motions interact strongly.As extreme weather events and non-Gaussian statistics become more prevalent with climate change [58][59][60][61][62], this study underscores that significant hurdles remain despite the recent advancements in neural parameterizations.Reliable parameterizations that robustly capture rare events are crucial.In this regard, incorporating rare event algorithms [63][64][65][66][67] could be beneficial.By simulating rare events offline, these algorithms could improve the sampling of distribution tails, leading to better trained neural networks.
This study contributes new insights into the challenges of closing the Lorenz 80 model using data-driven methods, particularly in high Rossby number regimes (Ro > Ro * ).Compared to other Lorenz models, like the less challenging Lorenz 96 model [68], the L80 system has received less attention for closure tasks.However, the recent stochastic closure approach by [26] for these demanding regimes provides a valuable benchmark for future research.We hope this work encourages further exploration of the L80 model as a meaningful testbed for developing and comparing closure ideas.B) for the L80 model (gray curve) and the best performing vanilla neural network closure (blue curve) from Setting I in Figure 6.While the vanilla closure captures the overall spectral background of the L80 solutions well, it misses the important peaks at frequencies f GW and f Ro (and their harmonics).These frequencies correspond to inertia-gravity waves and Rossby waves, respectively.

Appendix A: HLF solutions and the slow motion learning
The high-low frequency (HLF) solutions used in this article are those reported in [26,Fig. 7].These solutions are obtained from the parameters used in Lorenz's original paper [7] except F 1 chosen to be F 1 = 3.027 × 10 −1 as identified in [25]; see the Materials and Methods section in [26] for details.
As shown in Figure 10, for this parameter regime, the HLF solutions contain a mixture of slow and fast oscillations in each variable x, y, and z of the L80 model that causes serious difficulties for closure [26].The dominant frequency of the Rossby wave content in the HLF solutions is f Ro = 0.31 day −1 (T Ro = 3.2 days) and that of the inertia-gravity wave (IGW) content is f GW = 3.76 day −1 (T GW = 6.3 hours).
To learn a neural parameterization of the slow motion, the weights and biases of the NNs are updated according to a Levenberg-Marquardt (LM) optimization [69].The LM algorithm is known to be efficient for small or medium-scaled problems [70,Chap. 12], especially when the loss function is just a mean squared error, which is the case here.This algorithm is sufficient to obtain loss functions with small residuals; see Table I.

FIG. 2 :
FIG.2: False quasiperiodicity produced by a slow neural closure.Here, the slow neural closure Eq. (6) is driven by Z θ and X θ that are trained using a low-pass filtered version of y(t) (blue curve in Panel (A)) unlike the closure defined in Eq. (6) where the slow neural closure is trained using the unfiltered y-variable.

1 (
on three variables.For a given radius, the level sets of Z j θ * resp.X j θ * 1

FIG. 3 :
FIG. 3: Sensitivity of the slow neural closures.Here, NN 1 and NN 2 differ only in their training modalities.NN 1 is learnt from random selection of the training, validation, and testing sets, and NN 2 from a predefined selection with the same aspect ratios; see Text.The corresponding loss functions differ by 1% (see TableI), while the dynamical differences of the online predictions are substantial.

FIG. 4 :
FIG. 4: Panel A: Attractor in the HLF case.Panel B: The sojourn episodes within one particular lobe are marked by different colours.Here, the parameters are those used in Lorenz's original paper [7] except F 1 = 0.3027 in Eq. (1).Panel C: Lobe sojourn time distributions.The exponential fit is calculated over 500 yr-long simulation of Eq. (1) and is shown by the black curve f (t) = ae bt with a = 2292 and b = −6.05× 10 −2 with t in day.The inset in panel C shows a magnification of the distribution for the rare and large sojourn times.Panels E and F: Same as panels B and C except that F 1 = 6.97 × 10 −2 , corresponding to the slow chaotic regime shown in panel D in which the solutions are void of fast oscillations.In this regime, no rare event statistics emerge.

1 FIG. 6 :
FIG.6: Simulated time series from vanilla NN-closures in four different settings.Setting I (same as used for the results shown in Fig.7): one hidden layer with 20 neurons (thick solid line).Setting II: two hidden layers with 5 neurons in each layer (dashed line).Setting III: two hidden layers with 10 neurons in each layer (light solid line).Setting IV: two hidden layers with 20 neurons in each layer (dash-dotted line).The corresponding loss function values are given in TableII.

FIG. 7 : 6 .fFIG. 8 :
FIG.7: Vanilla NN-closure vs L80 dynamics.Failure to capture the high-frequency content and symmetry of the L80 attractor.Here, is used the best performing vanilla neural network (NN1) from Setting I in Figure6.

FIG. 9 :
FIG. 9: Panel A: This panel shows the neural parameterization (V θ ) with 5 layers and 20 neurons per layer (denoted as NN5) for variable x 1 .We use the same visualization style as Figures 1B-D.Notice the sharp gradients in the manifold, reflecting NN5's attempt to capture the high-frequency details of the HLF solutions.Panel B: This panel displays the corresponding solution (Y 2 , Y 3 ) obtained using the NN5 closure.Panel C: Compared to the best performing vanilla neural network (NN1) from Setting I in Figure 6, NN5 exaggerates the high-frequency content in the offline parameterization.