Detection of limit cycle signatures of El Niño in models and observations using reservoir computing

While the physics of the El Niño–Southern Oscillation (ENSO) phenomenon in the Tropical Pacific is quite well understood, there is still debate on several more fundamental aspects. The focus of this paper is on one of these issues that deals with whether ENSO variability, within the recharge-discharge oscillator theory arising from a stochastic Hopf bifurcation, is subcritical or supercritical. Using a Reservoir Computing method, we develop a criticality index as an indicator for the presence of a limit cycle in noisy time series. The utility of this index is shown in three members of a hierarchy of ENSO models: a conceptual box model, the classical Zebiak-Cane model and a state-of-the-art Global Climate Model. Finally, the criticality index is determined from observations, leading to the result that ENSO variability appears to be subcritical.


Introduction
The sea surface temperature of the equatorial Pacific undergoes relatively large changes on an interannual time scale, associated with the El Niño-Southern Oscillation (ENSO) phenomenon [30].Through its atmospheric teleconnections, ENSO variability causes weather anomalies far outside of the Pacific [2] and hence has been a topic of much active research in the climate science community over the last decades [37].Of particular relevance today is the (projections of) changes in ENSO variability due to global warming [2].
The extensive work done since the 1980s to understand the physics of ENSO has clearly been very beneficial for model development and the predictive capabilities of these models.ENSO is at the moment understood as a large-scale irregular internal mode of variability in the Tropical climate system [26].The spatial pattern of ENSO is mostly controlled by several ocean-atmosphere (Bjerknes) feedbacks and its time scale of variability mostly by ocean wave dynamics [3].This has resulted in the so-called recharge-discharge oscillator view of ENSO [16,39], which forms the basic interpretation framework for model results, including their prediction skill [19].
In one of the early deterministic models that was able to simulate ENSO in a reasonable way, the Zebiak-Cane (ZC) model [44], it has been clearly shown that ENSO variability results from a supercritical Hopf bifurcation.The bifurcation parameter in this model is the dimensionless coupling strength µ.This parameter measures the response of the atmospheric winds to sea surface temperature anomalies.If µ is supercritical, ENSO variability occurs as a sustained oscillation or limit cycle.In case the value of µ is subcritical (i.e.below the Hopf bifurcation), the ZC model indicates that the Pacific climate is stable and no ENSO variability occurs.However, in the subcritical case, ENSO variations can be easily generated in the ZC model when noise is added in the wind-stress field [9,32].From a dynamical systems point of view, these results are consistent with a stochastic Hopf bifurcation, which is also consistent with the recharge-discharge oscillator view of ENSO [16].
Apart from that the ZC model is a strong idealization of the equatorial Pacific climate system, it is difficult to extract the precise value of µ from observations or from state-of-the-art global climate models (GCMs).Hence, although accepted by large part of the ENSO research community [37], there is still debate on whether ENSO variability is subcritical or supercritical, i.e. whether a sustained limit cycle plays a role in the variability [8].In an alternative framework, mostly motivated by the success of linear inverse models (LIMs) to predict properties of ENSO variability [28], ENSO is a stochastically driven linear system [23,33].In this view, the smaller-scale processes (not resolved in the deterministic ZC model) collectively drive the large-scale, mainly linear processes (such as wave dynamics).The nonlinear Bjerknes feedbacks contribute to the variability as they add to the non-normality of the linear system such that non-normal growth of perturbations can occur [7].This limiting situation can be considered as a strongly subcritical case.
While resolving this issue may not be high on the agenda of the ENSO research community, it is important for ENSO prediction, in particular for assessing the predictability horizon which is currently estimated to be about 6 months [1].It is also crucial for the confidence in projections of ENSO changes under global warming from GCMs.One way to resolve this issue would be to demonstrate that criticality occurs in GCMs (e.g. by varying parameters that affect the strength of the Bjerknes feedbacks) but such simulations have not yet been done.However, even if that would be demonstrated in GCMs, it would still be difficult to connect that to the limited observational record and one could argue that the GCMs do not capture the relevant small-scale processes adequately.Another way is to generate measures of criticality from model data and observations which could distinguish sub-or supercritical behavior.In this paper, we follow the latter approach using modern machine learning techniques.
Feedforward neural networks (FNNs) have been used for quite some time already in ENSO prediction [35,36,42] with mixed success [1,4].It has been shown that convolutional neural networks trained on GCM model data and reanalysis data show a better forecasting skill than most dynamical models and, in particular, the forecast skill remains high up to lead times of about 17 months [12].Predictive uncertainty was addressed in [29] using Gaussian density neural networks and quantile regression neural networks, and a skilful long-lead time prediction could be obtained using a relatively small, but physically motivated, predictor set [29].Recently, also reservoir computing (RC) methods have been used on ENSO prediction [13] and skilful predictions could be obtained for lead times up to 21 months.Apparently, these machine learning techniques are able to circumvent the so-called spring predictability barrier' found in dynamical models, such as in the ZC model [5] and in observations [24].
In this paper, we build on the success of RC techniques in climate research [25] to develop measures of criticality of the Tropical Pacific climate system.In section 2, we present the observational data and the models used, i.e. an idealized conceptual model (the Jin-Timmermann model), the ZC model, and a state-of-the-art GCM.In addition, the RC techniques applied are briefly presented and the novel aspects are highlighted.Section 3 describes the results of the RC approach to the model data and the observational data attempting to assess if limit cycle signatures can be found to explain the ENSO variability.A summary and discussion of the results follows in the final section 4.

Models, observations and methods
The models and observations used will be shortly discussed in subsection 2.1 and the machine learning approach is described in more detail in subsection 2.2.

Models and Observations
The Jin-Timmermann (JT) model [38] was developed to study variations in the amplitude of ENSO variability.For convenience, the original version of the model is shown in appendix A. Here, we use the non-dimensional version described in [31] of which the governing equations are Here x is the dimensionless zonal temperature difference, y is the dimensionless temperature deviation between western Pacific and radiation equilibrium temperature and z is the dimensionless western thermocline depth anomaly.The physical meaning of the parameters ρ, k, c, a is explained in appendix A.
The parameter δ is a measure of the strength of the ocean-atmospheric coupled processes and is taken as the main parameter here.
To include noise in the model, the approach described in [21] has been followed, where the noise components D x and D y are defined as: and η t in ( 1) is a white-noise component with unit variance.For σ = 0, we obtain the deterministic version of the model.The equations ( 1) have been integrated using the Euler-Maruyama scheme, with a dimensional time step of one month (dimensionless time step is 0.29).
The ZC model is a so-called intermediate complexity ocean-atmosphere model and is one of the first models that was able to reasonably simulate ENSO [44].The original version of the model was deterministic and the seasonal cycle was prescribed from observations.It consists of a set of seven two-dimensional partial differential equations for the surface temperature T, the zonal and meridional velocities in the surface ocean (u, v), those of the lower atmosphere (U, V), the ocean thermocline depth h and the atmospheric geopotential ϕ.In a subsequent variant, the mean state was also modeled and noise was introduced [9]; this is the model we use here.For convenience, the equations of the model are shown in appendix B. The main dimensionless parameter is the coupling strength µ.To numerically solve the model's equations the technique described in [40] has been applied.
The community earth system model (CESM) version 1 is a coupled climate model consisting of different model components, including ocean, sea ice, land surface and land ice models [17].These different components exchange fluxes through a central coupler.The atmosphere is simulated by the community atmosphere model (CAM, [20]), while the oceans are simulated by the Parallel Ocean Program model (POP, [34]).We have used results from simulations with the CESM version 1.1.2that is composed of 30 vertical layers in the atmosphere with a 0.9 • horizontal resolution, and 60 vertical layers in the ocean with a horizontal resolution of 1 • .We use monthly mean data from a 100 years CESM simulation experiment under constant year 2000 forcing [41].
The observations analyzed are derived from the NOAA ERSST v5 [14], the PEODAS D20 datasets [43] and the ICOADS dataset [10].From the NOAA ERSST v5 dataset, the NINO3.4index has been extracted.From the PEODAS dataset the monthly mean thermocline anomalies have been extracted.Finally, from the ICOADS dataset, the monthly mean zonal wind-stress anomalies have been extracted.We focus on the period between January 1960 and September 2020.

Methods
The idea behind our approach is that the ML technique will be able to extract properties of the underlying dynamical system and hence can indicate whether a certain time series (characterized by parameters in a model, or by the present-day/historical climate in observations) contains limit-cycle characteristics.The RC framework is ideally suited for this, as it generates a high-dimensional dynamical system from input data, whose relevant patterns and dynamics are learned by the model during the training procedure.
Although the procedure to generate a RC has been well described elsewhere [22,27], we briefly summarize the approach here, and introduce our notation.Given an input signal u(n) ∈ R Nu , n = 1, .., N t , where N t is the total number of time steps, and a desired output signal y target (n) ∈ R Ny , the model learns how to estimate an output signal y(n) ∈ R Ny as close as possible to y target (n).In order to do that during the training procedure, an error measure E(y, y target ) is minimized.Different loss functions can be used and we have decided to apply the mean squared error (MSE) loss, i.e.

MSE
( which is a common choice for regression problems.Before the training procedure the input data u(n) are non-linearly expanded into a higher dimensional so-called reservoir space, generating a new signal x(n) ∈ R Nx .This new representation of the data also contains temporal information and is based on the following update equations: where [1; u(n)] is short for stacking (here adding a top element 1 to the vector u(n)), which is done to introduce a bias term, and α ∈ (0, 1] is the leaking rate.The two matrices W in ∈ R Nx×(Nu+1) and W ∈ R Nx×Nx are generated randomly according to chosen hyperparameters.The non-zero elements of W and W in are sampled from a symmetrical uniform distribution over the range [−a, a].W is usually a sparse matrix, while W in is dense, and the tuple (W in , W, α) defines the reservoir.The output layer is given by and, during the training procedure, only the weights of the output matrix W out ∈ R Ny×(Nu+Nx+1) are estimated, minimizing E(y, y target ) through a linear regression procedure.We use a ridge regression to avoid overfitting, leading to the loss function L: where ϵ is a regularization parameter.The hyperparameters are given by the dimension of the reservoir (N x ), the sparsity of W's connections (as measured by an average degree of the network), the input scaling a, the leaking rate α, the regularization parameter ϵ and the spectral radius ρ s of W. To determine the best set of hyperparameters in each problem, a Bayesian search has been applied based on criteria specified below.We will not list the values of the best hyperparameters in each of the results below, and the reader is referred to the codes and data which are publicly available (see Data Availability statement at the end of the paper).
After having determined optimal hyperparameters the RC can be transformed into an autonomous evolving system that can be used for prediction [18,27].Given an input sequence u(n), we train the RC to predict the target sequence y(n) = u(n + 1), i.e. perform single-step forecasts.After training, we can introduce feedback connections between the outputs at time step n and the inputs at the subsequent time step n + 1.This way, we generate a model that autonomously evolves in time without being driven by external inputs.This results in the following map: (7a) Here, x(n) and x(n + 1) are the reservoir states at time step n and n + 1, while y(n) is the output at time step n and u(n + 1) is the input at the subsequent time n + 1.

Results
The results are naturally described within five subsections.The first one introduces and illustrates our approach for a model of the stochastic Hopf bifurcation.Sections 3.2 and 3.3 deal with the analysis of the JT and ZC models, for which the location of the deterministic Hopf bifurcation is precisely known.The next section 3.4 deals with the results from the CESM simulations and the last section 3.5 presents the results obtained from analyzing the observations.

Illustration of the approach
Consider a simple system of two stochastic differential equations (SDEs): where µ and ω are parameters and η t is a white-noise component with unit variance.The quantity σ is the noise amplitude and for σ = 0, a supercritical Hopf bifurcation occurs at µ = 0.The values of µ we consider are a value far before the Hopf bifurcation (µ = −0.3)and one (µ = 0.3) far after.Using the Euler-Maruyama method, we perform simulations of 2000 time steps ∆t = 0.1 for σ = 0.08 and delete the first 1000 steps before training to guarantee that a statistical equilibrium has been reached.The best set of hyperparameters of the RC has been determined using a Bayesian search algorithm.For a particular set of hyperparameters, we evaluate the RC performances for both µ = −0.3 and µ = 0.3 as follows.For each µ, the RC is trained using a noisy solution of (8).Next, the RC performances for a lead time of six time steps are evaluated using a different solution and using the root MSE (RMSE) as metric.Finally, the mean of the results obtained for the two µ values is computed.This approach allows us to estimate a single best hyperparameter set for both the µ values, increasing the generalization capabilities of the model and reducing the risk of overfitting.Table 1 shows that the prediction skills of the RC for both µ = −0.3 and µ = 0.3 using the best set of hyperparameters, are very good.To account for the random nature of the RC, 100 runs have been performed (for each µ value separately), training and evaluating every time the RC with two different solutions of (8).
To better analyze the behavior of the self-evolving RC, we eliminate the bias term (see section 2.2), and define the output y(n) at time step n as y(n) = W out x(n).In this way, the self-evolving RC is given by: where W ∈ R Nx×Nx , while W in ∈ R Nx×Nu and W out ∈ R Ny×Nx (the inputs and the outputs of the self-evolving RC have the same dimension so N u = N y ).Using this formulation, we can compute the Jacobian matrix J of this discrete dynamical system (after the training procedure, W out will be fixed so that the self-evolving RC will be deterministic) and compute the corresponding eigenvalues at the equilibrium point (where x(n + 1) = x(n)) before and after the bifurcation.The number of eigenvalues depends on the reservoir dimension N x , here N x = 8.Figures 1(a) and (b) show the output of the self-evolving RC for µ = −0.3(before the bifurcation) and µ = 0.3 (after the bifurcation), respectively.Figure 1(c) and (d) also show the eigenvalues of the self-evolving RC's Jacobian matrix of the mapping (9) at the equilibrium point.Before the Hopf bifurcation (µ = −0.3),all the eigenvalues are inside the unit circle in the complex plane, and the self-evolving RC output is indeed a damped oscillation.After the bifurcation (µ = 0.3), the dominant eigenvalues have crossed the unit circle, and the self-evolving RC approaches a periodic solution.
In the spectra for δ = 0.194, a peak at a frequency around 0.16 (corresponding to about 1.8 years) appears.For δ = 0.214, new frequency components appear but the frequency components related to the corresponding deterministic periodic solution are still clearly visible.
We now use the noisy solutions of the JT model to train RCs.For the two different values of δ (0.194 and 0.214), an RC was trained using 100 years of training data and all the three features [x, y, z].Thereto we perform simulations of 300 years and delete the first 200 years before training to guarantee that a statistical equilibrium has been reached.The best set of hyperparameters for this scenario has been determined using a Bayesian search algorithm.For a particular set of hyperparameters, we evaluate the RCs performances for both δ = 0.194 and δ = 0.214 as follows.For each δ, a simulation is used to train the RC.Next, a new simulation (with a new noise realization) is used to evaluate the performance of the RC using a lead time of 6 months.Finally, the mean over the loss function for the two values of δ is computed.In this way, we determine a single best set of hyperparameters for both δ values.This was done again to avoid overfitting and increase the RC's generalization capabilities.The indicator we used to evaluate the RCs performances throughout the Bayesian search was the NRMSE (Normalized Root MSE), calculated by taking into account all features.The RMSE has been normalized separately for each variable, dividing it by the variance of the target values.Table 2 shows the prediction skills of the RC for both δ = 0.194 and δ = 0.214 (using the best set of hyperparameters).To account for the random nature of the RC, 100 runs have been performed (for each δ value separately), and the mean NRMSE values is shown (table 2) for each of the variables [x, y, z] separately.The RCs prediction skills increase after the bifurcation (δ = 0.214), and NRMSE values are the smallest for the variable z (i.e.representing the thermocline depth).3(d)), the RC has been able to identify the main frequency components of the corresponding deterministic JT solution (which is periodic for δ = 0.214).For δ = 0.194, the self-evolving RC most of the time converges to a constant value, but for a small number of runs, the self-evolving RC's spectrum is characterized by a small amplitude frequency component around 0.16 (figure 3(c)).
Using the JT model, we can efficiently investigate the effects of the number and type of features on the RCs behavior.For each case, the same procedures described above have been used to both optimize the hyperparameters and evaluate the prediction skills of the RC.Using all the three variables x, y and z as features, we first decreased the number of years used to train the model and found that the RC can distinguish subcritical and supercritical conditions down to a training data set of 50 years.When trained with 40 years of data instead of 100, the prediction skills of the RC decrease for both δ = 0.214 and δ = 0.194 (table 2).In this case, the self-evolving RC is not able anymore to reconstruct the underlying deterministic behavior of the JT model.For example, for δ = 0.214, the spectrum (not shown) displays variability on a wide range of frequencies.For δ = 0.194, the RC most of the time converges to a periodic solution, instead of a damped oscillation.
The prediction skills of the RCs when only the variable x is used to train it decrease for both δ = 0.214 and δ = 0.194 compared to those of the full features set (table 2).Also when we combine the variables [x, y] the performances for both δ values decrease (table 2); the same holds when the RC is trained using the variables [x, z] (table 2).When only an input feature x is used and 50 years of training data, the self-evolving RC produces very different results compared to using the full feature vector.The same result was found for a feature vector [x, y], However, for a feature vector [x, z], the self-evolving RC produces similar results as using the full feature vector.So both the zonal temperature difference (x) and the thermocline depth (z) are important for the RC to generate the correct behavior.This is in correspondence with the recharge-discharge oscillator view of ENSO, where both features are crucial [15,38].

ZC model
In the ZC model, we use the dimensionless coupling strength µ as the control parameter.For the standard values of the other parameters, a supercritical Hopf bifurcation occurs at µ = 3.0 in the deterministic model.Hence, for µ = 2.8, the Pacific climate state is stable (figure 4(a)), while for µ = 3.1, a periodic solution with a time scale of about 4 years appears (figure 4(b)).Under the wind-stress noise forcing with σ = 0.7 (see [9] for the dimensional values), the case µ = 2.8 shows substantial variability with a mean period of about 4 years (figure 4(c)).For µ = 3.1 the variability becomes more irregular due to the noise (figure 4(d)).These results are basically similar to those in [9], and are in correspondence with a stochastic Hopf bifurcation interpretation.
The ZC-model results for all the combinations of five µ (µ = [2.5, 2.6, 2.7, 2.8, 3.1]) and three σ (σ = [0.5, 0.7, 1.0]) values are used to train RCs, optimize the hyperparameters using a Bayesian search, assess the RCs performances and subsequently study the behavior of the self-evolving RC.This procedure is the same as used for the JT model described in the previous section.For the ZC model, we use a feature set consisting of the sea surface temperature in the western (T W ) and eastern (T E ) equatorial Pacific and the thermocline depth in the eastern (h E ) and western (h W ) equatorial Pacific, where the western and eastern regions are given by 120 We first consider the case where 60 years of training data are used.In this case, for each combination of µ and σ, data from 10 different stochastic ZC simulations of 100 years were generated.To avoid capturing the transient behavior, only the last 60 years of each simulation were considered.The best set of hyperparameters for this scenario has been determined again using a Bayesian search algorithm.In this case, we optimize the hyperparameters for each value of σ separately as follows.For a specific set of hyperparameters, the RC performances are evaluated for each σ, using all values of µ = [2.5, 2.6, 2.7, 2.8, 3.1].For each µ, the RC has been trained and evaluated using two different stochastic ZC simulations (sampled from the ten that were available).Finally, the mean of the different µ results was used for each σ to determine a best set of hyperparameters.This has been done again to avoid overfitting and generate a model able to generalize better.It was too difficult for the RC to generalize well also for all values of σ and for this reason different best hyperparameters sets have been determined.To evaluate the RC performances during the Bayesian search, a lead time of 6 months was considered, and the NRMSE was again used as the performance measure.Table 3 shows the performances of the RCs (using the best sets of hyperparameters) for a lead time of 6 months.To account for the random nature of the reservoir 100 runs have been performed (for each combination of σ and µ separately) and the mean of the results have been computed.Similarly to the JT model results, the RCs show better prediction skills for supercritical values of µ, where the behavior of the ZC model is more regular and more robust with respect to the noise.Moreover, for the same value of µ, the performances remain similar when increasing the noise amplitude σ.One of the reasons behind this apparent resilience to the noise is connected to the strategies applied by the RC to avoid overfitting the noisy patterns in the training data.During training, the connections W out , which are the only ones that can be influenced by the presence of the noise in the input signal (see section 2.2), are not estimated using a simple linear regression, but using a ridge regression, which limits the weights amplitude reducing the overfitting.Moreover, W out does not directly map the inputs to the desired outputs as the input signal is first mapped into a non-linear high-dimensional space where it is easier to separate the noise components and identify the most relevant underlying patterns.The weights W in and W (responsible for mapping the input data into the Reservoir space) are fixed and not influenced by the training data.This is another strategy that enhances the generalization capabilities of the RC, reducing the overfitting and the influence of noise.As in the JT model results, the self-evolving RC solution converges either to a fixed point or a periodic solution.To show the results in a concise way, we define the criticality index C as follows: where n T is the number of data points considered.Here, r(n) is the self-evolving RC solution for T E and z(n) the same variable for the random ZC model realization, that has been used to train the RC.Before computing C, both r(n) and z(n) are centered around zero by subtracting the mean.Under subcritical conditions, we expect the RC solution to converge to a fixed point, so the nominator in ( 10) is much smaller than the denominator and C ≈ 0. For supercritical conditions, the RC solution is expected to converge to a periodic solution so C ≈ 1.
To compute the index C, we used 100 years simulations of the self-evolving RC and made sure that the amount of self-evolving RC data considered was the same as the amount of data used to train the RC.For example, for the 60 years case, the first 40 years of both the self-evolving RC and the training ZC simulation have been eliminated.The criticality index C is plotted as a function of µ for different σ values in figure 5(a).To produce this boxplot, for each combination of µ and σ, 100 simulations were performed.The index C is near zero for values of µ = 2.5 and µ = 2.6, for all σ values.in agreement with the subcritical conditions (see figure 5).For µ = 3.1, the value of C is near unity, in agreement with supercritical conditions.For the subcritical values µ = 2.7 and µ = 2.8 the value of C shows a large variance.
The robustness of the results to the number of training years was investigated by using 40 and 20 years of training data, instead of 60 years.To estimate the best set of hyperparameters and evaluate the RC performances after the Bayesian search, the same methodology as the one applied for the 60 years case has been followed.The RCs, when trained with either 40 or 20 years of training data show prediction skills comparable to the ones obtained when it is trained using 60 years of data (not shown).The self-evolving reservoir results for 20 years of training data (as well as those for 40 years, not shown) are also fairly similar (figure 5(b)) to the ones obtained using 60 years of data, and hence the self-evolving RC produces adequate results also if we train it using a smaller amount of data.

CESM
For the deterministic ZC and JT models considered in the previous subsections, we precisely know the Hopf bifurcation and hence can use the criticality index C to determine whether the ENSO variability is sub-or supercritical.For the 100 years of de-seasonalized CESM data (model years 1200-1300) [41], we do not know whether a limit cycle is central for the ENSO variability.For the features, we use the NINO3.4index (figure 6(a)), the anomaly of the depth of the 20 • C isotherm over the area 120 6(b)) and the zonal wind-stress anomaly over the area 120 The depth of the 20 • C isotherm for each grid point was determined using linear interpolation from the temperature field provided by the CESM.The surface zonal wind stress and NINO3.4 are direct outputs from the CESM model.
Again, a Bayesian search has been applied to determine the best set of hyperparameters.For each set of hyperparameters, the RC performances for a lead time of 3, 6, and 9 months have been evaluated.To assess these RC performances, a cross-validation technique has been applied over the first 80 of the 100 years of the CESM data time series.The mean of the results obtained for each lead time has been taken as a measure of the RC performances using as metric the NRMSE computed in the same way described in section 3.2.After the validation phase, 5 different overlapping intervals of 40 years have been defined from the 80 years of validation data, using a sliding window of size 40 years and step 10 years.To evaluate the prediction skills of the RC, it has been trained using each of these periods to then assess its prediction performances on the last 20 years of the CESM time series, which have not been considered during the validation phase.This has been done to determine if the RC was able to perform well on completely unseen data.From these results, it was found that the RC has difficulties in predicting large amplitude ENSO events, especially for larger lead times (not shown).
For this reason, we decided to weight the contribution of different samples to the loss function.In particular, we define the NINO3.
where here n c = 4.Using this approach, the RC shows good prediction skills even for a lead time of 9 months, and is able to well capture the strong El Niño events, independently from the 40 years period used to train it (figure 7).To quantify the performance, we use the MAE (Mean Absolute Error) for the 9-month lead time predictions of the two largest El Niño (NINO3.4⩾ 2.5 • C) and the 3 largest La Niña (NINO3.4⩽ −2 • C) events of the evaluation period.This has been done for all the training periods, and table 4 summarizes the results obtained.Applying the weights method, the RC is clearly able to predict strong El Niño events better.The RC still has difficulties in correctly predicting strong La Niña events.This is because  our weighting scheme is based on the training data distribution, so the performance asymmetry reflects the asymmetry in the distribution of strong El Niño and La Niña events.The strongest Niño events have, in general, a larger amplitude than the strongest La Niña events (this is the case in CESM data as well as in real observations).In general, the number of strong amplitude El Niño events is larger compared to the number of strong amplitude La Niña events.The weighting method intensifies this asymmetry, leading to better performances in predicting positive events and to worse performances in predicting negative events.Table 4 also shows the RC prediction skills over the whole test interval (model years 1280-1300).Applying the weights method gives a small performance decrease.This is because, with the weights, we are giving more importance to strong events, so the predictions for the samples related to weaker temperatures will be less accurate.
For these reasons, the weighting method was also applied to generate the self-evolving RC solution.The criticality index C has been computed taking into consideration the five overlapping periods of 40 years used to evaluate the RC prediction performances.Figure 8 shows the distribution of the results we have obtained in terms of the criticality index C for the different training periods.To produce this boxplot, for each training period, 300 different simulations were performed.A self-evolving RC solution of 100 years has been generated every time to then eliminate the first 60 years to avoid capturing the transient behavior.The RC has been trained every time using the best set of hyperparameters.
For some of the intervals analyzed (figure 8(a)), the self-evolving RC produces a periodic solution, while for other intervals, its output converges most of the time to a constant value.More specific, for the intervals 1200-1240, 1210-1250 and 1230-1270, C is on average larger than 0.5.This means that the self-evolving RC converges most of the time to a periodic solution.The intervals 1220-1260 and 1240-1280 are characterized by an average value of C ∼ 0.1.
To characterize the behavior of the self-evolving RC in more detail, we consider the change in amplitudes of the peaks of the NINO3.4values to estimate the damping rate ζ of its solution r(n), with peak values p(n), as Here p(n + m) is the amplitude of the peak of r after m cycles of the dominant period, with n fixed.When the time series is a sustained periodic signal, δ = ζ = 0.Under strong damping of the oscillation, δ will become larger than zero and, consequently, also ζ.Of course, the larger the value of m, the better the estimation of the damping ratio.Despite this, for small values of m (m = 5 was used from the starting solution with n = 0) it is already possible to clearly see the difference between a sustained and a damped oscillation.
In comparison with the index C, the damping ratio is less dependent on the self-evolving RCs output length.This advantage is significant, especially when the RC needs a lot of steps to converge.For the intervals 1200-1240 and 1230-1270 characterized by a large C in figure 8(a), the values of ζ are close to zero (figure 8(b)).This confirms that the self-evolving RC's output for these periods, on average, converges to a periodic solution.For the intervals 1220-1260 and 1240-1280, the damping rate is larger in correspondence with the lower value of C (figure 8(b)).Values of ζ for the interval 1210-1250 are larger than those of the intervals 1200-1240 and 1230-1270, but smaller than those of 1220-1260 and 1240-1280.Without applying the weights method, the self-evolving RC produces damped oscillations for all the different intervals (not shown).This is due to the lower RC performances in predicting strong El Niño events for larger lead times (figure 7), so the RC seems less able to capture the relevant dynamics of the data properly.The logarithmic decrement technique to estimate the self-evolving solutions' damping ratio can be applied only to underdamped oscillations.When trained with the CESM data, the self-evolving RC always produces an underdamped oscillation or a sustained periodic solution (not shown), so the method can be applied.This is not the case when the RC is trained with the ZC data for small values of µ (e.g.µ = 2.5), where the self-evolving RC solutions are usually overdamped.For this reason, the behavior ζ for the self-evolving RC trained with the ZC data has not been analyzed in section 3.3.

Observations
The observational data set was linearly detrended, the seasonal cycle was removed, and then the same features as in the CESM were used, i.e.NINO3.4,the 20 • C thermocline depth anomalies over the area 120 • E-280 • E × 5 • S-5 • N and the zonal surface wind-stress anomalies over the area 120 • E-200 • E × 2.5 • S-2.5 • N. A 3-month moving average was also applied to the observational time series (figure 9).
The observations cover a period of approximately 60 years, from January 1960 to September 2020.The first 40 years have been used for validation and training, while the last 20 years have been used only for testing.The same hyperparameter search as described in the previous section has been applied.
In this case, we have determined a different set of best hyperparameters for each lead time by applying a separate Bayesian search.Figure 10 shows that the prediction skills of the RCs for a lead time of 3 and 6 months are relatively good, while they become low for a lead time of 9 months.However, due to the weighting method, the large amplitude El Niño events are better captured, especially for larger lead times.To quantify this performance, the MAE for the nine-month lead time RC's predictions of the three strongest El  5).With the weights method, the RC can better predict strong El Niño events for a larger lead time.The model still has difficulties capturing the behavior of strong La Niña events, which are also poorly captured without applying the weights method.Table 5 also shows the RC prediction skills over the whole test period for different lead times.Applying the weights method, the RC performances over the whole test period decrease.As in the CESM case, we are giving larger importance to the samples related to strong events, so the performances in capturing the behavior of weaker events decrease.
Because a separate validation procedure was performed for each lead time, also the criticality index C was computed for the best sets of hyperparameters identified for each lead time (figure 11).To estimate the index C and the damping ratio ζ, the same procedure described in the previous section has been followed, but only the period from 1960 to 2000 has been considered.Again to avoid capturing transient behavior, before computing the criticality index C, 100 years of self-evolving RC data have been generated, and only the last 40 years were retained.The self-evolving RC solutions for the real observations usually converge to a fixed point in less than 40 years (not shown).This is not a problem when we compute C, but for ζ this can be problematic.For this reason we have also considered the transient behavior to compute ζ, as otherwise it was not possible to adequately estimate the damping rate.The value of C (figure 11(a)) is close to zero for each set of hyperparameters, suggesting that the observed ENSO variability is subcritical.

Summary and discussion
Using a RC machine learning technique, we have developed a scalar index, called the criticality index C, to detect limit cycle behavior in noisy time series.
The intuitive notion behind this method is that if a noisy time series comes from a dissipative dynamical system with no sustained oscillations, there will be times when noise is small, and hence decay will occur.When there are sustained oscillations, this will not happen.The self-evolving RC is a deterministic system, and the time series (from the dynamical system and noise) will affect the output weights of the RC.As shown in section 3.1, the RC can pick up the temporary decay under subcritical conditions from the training data and adjust its weights accordingly, resulting in a fixed point for the self-evolving reservoir.No decay occurs in the supercritical case, and the RC weights lead to a periodic solution for the self-evolving RC.Moreover, the self-evolving RC is no longer influenced by external stochastic forcing.Therefore, when we generate a self-evolving trajectory, it will only reflect the essential underlying dynamics learned during the training phase.This makes the self-evolving RC not useful anymore for predictions, but perfect for our task to distinguish the underlying deterministic dynamics.
The utility of the methodology was first illustrated using the JT model, for which the bifurcation diagram is known [31], and we can also use the full state vector as feature vector.We learned from the results of the JT model that relatively short time series are sufficient to distinguish subcritical from supercritical behavior and that the thermocline depth and zonal temperature difference are important features.
With a limited set of features from simulations of the stochastic ZC model, where also the position of the deterministic Hopf bifurcation is known, we found that the RC approach can distinguish subcritical from supercritical behavior when the coupling parameter µ is relatively far from the deterministic Hopf bifurcation.A criticality index C, based on the ratio of the variability of the RC and that of the model time series, was defined to obtain a scalar quantity for distinguishing subcritical (C ∼ 0) and supercritical (C ∼ 1) behavior.For subcritical values of µ but close to the Hopf bifurcation, the range of C can vary substantially.
Applied to the 100 years simulation data from a state-of-the-art GCM (CESM), using again limited features, the index C indicates that a limit cycle is underlying the ENSO variability for most time intervals.For these data, a weighting technique was needed to obtain a good performance of the RC on large amplitude ENSO events.The values of C are quite large and this is in agreement with the result that ENSO in this version of CESM is much too regular compared to observations [41].The damping rate of the oscillatory behavior of the RC is consistent with the supercritical behavior of the CESM for most intervals.At the moment, nothing is known about the underlying bifurcation structure in the CESM, and it would actually be worthwhile to try to vary parameters in this model to demonstrate the appearance of the Hopf bifurcation.
For the case of the observations, we have quite limited data compared to the model cases considered earlier.The training and validation procedure of the RC for the observations is the same as that for the CESM, as only one 100-years CESM simulation was analyzed, but the prediction performances of the input-driven RCs are generally lower.Trained with observations, the behavior of the self-evolving RC indicates that the variability is subcritical, as both the values of C the damping rate ζ are small.This may be caused by the limited data and the limited features which decreases the performance of the RC and hence provide not an adequate representation of the dynamical system.On other hand, the result is consistent with those in [8] who conclude that ENSO is near critical based on the properties of the Pacific climate state.
To further support the accuracy of our RC reconstructions, primarily when the location of the bifurcation point is unknown (i.e. in CESM and observations), we have also always evaluated the prediction performances of the input-driven RC.Comparing this for model data and real observations, a decrease in performance is found.This is expected for many machine learning algorithm because far more complex dynamics characterizes real observations compared to the one in models.The input-driven RC performances can be taken as an indicator of the ability of the RC to capture the essential dynamics present in the training data, which its self-evolving version will then reflect.This means that the RC captures model data dynamics better than that in real observations.Although these results may not have direct practical importance, as models will continue to provide predictions of ENSO the way they did before, our results are important for ENSO predictability theory.According to our results, the observed ENSO variability is dominantly driven by stochastic forcing which reduces the predictability compared to the case when sustained variability would have been present.Our technique may also have more general application in systems where it is not known (because of limited data) whether limit cycle behavior is relevant, for example the Atlantic Multidecadal Variability [6].

Appendix A. JT model description
In the original version of the JT model [38], the variables of interest are: the temperature in the eastern and western equatorial pacific (T 2 , T 1 ) and the western tropical Pacific thermocline depth (h 1 ).The ordinary differential equations that describe the ENSO dynamics in the original implementation of the JT model are: where T r represents the climatological mean state and the eastern tropical Pacific sub-surface temperature T sub is expressed by the following formula: Finally, T r0 represents the mean eastern equatorial temperature at a depth of 75 m, H is an eastern thermocline reference depth and h * represents the sharpness of the thermocline.The last equation, that closes the system is where h 2 represents the eastern thermocline depth.
In [31] Using dimensionless variables: where the values of the constants S 0 , T 0 and h 0 are respectively 2.8182

Appendix B. ZC model description
The ocean domain of the ZC model is bounded by meridional walls at west (x = 0) and east (x = L) and consists of a well-mixed layer embedded in a shallow-water layer.The well-mixed layer is characterized by a mean depth H 1 , while the shallow-water is characterized by a mean depth H = H 1 + H 2 and a constant density ρ.The lower boundary of the shallow-water layer is called thermocline.Only long wave motions above the thermocline are considered and the deep-ocean with density ρ + ∆ρ is considered at rest.The flow in the shallow-water layer can be described by the following equations: The surface layer velocities u s = (u s , v s ) are defined by the following equations: If we now define the horizontal velocities components in the mixed layer as u 1 = u s + u and v 1 = v s + v, the vertical velocity component can be determined in the following way: The evolution of the sea surface temperature T is controlled by the equation: where H is a continuous approximation of the Heaviside function and T 0 is the temperature at radiative equilibrium.The subsurface temperature T s depends on thermocline variations and can be expressed as: where h 0 and h 1 control the offset and steepness of the T s profile.T S0 is a characteristic temperature.The wind response to the sea surface temperature anomalies is determined using the Gill model [11]: Here, (U, V) are the zonal and meridional atmospheric surface velocities, Θ is the geopotential height and a M is a damping coefficient.T r is a reference temperature and c a is the speed of the atmospheric equatorial Kelvin wave.Like in [40], the zonal wind stress consists of two different contributions: The coupled component τ x c is proportional to the zonal atmospheric surface velocity U and is defined by the following equation: where γ is a constant coefficient.τ x ext represents the easterly wind stress component due to the Hadley circulation.The control parameter we have used to manipulate the output of the Zebiak Cane model is the dimensionless parameter µ: where ∆T is a typical zonal temperature difference over the basin.The control parameter µ regulates the strength of the coupling between the ocean and the atmosphere.

Figure 2 .
Figure 2. Behavior of the stochastic JT model for (a) δ = 0.194 and (b) δ = 0.214 with noise amplitude σ = 0.0095.The values of the other control parameters are [a, ρ, c, k] = [8,0.3224,2.3952, 0.4032].(c) Mean, 5th and 95th percentiles of the normalized spectrum of the x−component of the JT model for δ = 0.194.(d) Same as (c) but for δ = 0.214.To obtain the spectra, 100 simulations with different noise realizations have been performed for each value of δ.

Figure 3 .
Figure 3.Time series of the self-evolving RC trained with 100 years of noisy JT model data, for (a) δ = 0.194 and (b) δ = 0.214.In both cases, [a, ρ, c, k, σ] = [8, 0.3224, 2.3952, 0.4032, 0.0095] and all variables [x, y, z] have been used to train the RC.The normalized spectra of the variable x are shown in (c) for δ = 0.194 and (d) for δ = 0.214 and are based on 150 simulations with the RC.

Figure 5 .
Figure 5. Values of the criticality index C obtained from a self-evolving reservoir, in which the RC is trained with ZC model data for (a) 60 years and (b) 20 years.The whiskers in the boxplot represent the maximum and minimum results that fall within 1.5 the inter quantile range distance from the 25th and the 75th percentiles.The black dots can be considered outliers.
4 classes I: [−ρσ w µ w , ρσ w µ w ], II: [ρσ w µ w , 2ρσ w µ w ] and [−2ρσ w µ w , −ρσ w µ w ], III: [2ρσ w µ w , 3ρσ w µ w ] and [−3ρσ w µ w , −2ρσ w µ w ] and IV: [3ρσ w µ w , ∞] and [−∞, −3ρσ w µ w ].Here, µ w and σ w are the mean and standard deviation of the distribution of the training data and ρ is an additional hyperparameter optimized during the Bayesian search.With the number of samples belonging to each class being s j , the weight w j of each sample belonging to class j is determined from

Figure 7 .
Figure 7. RCs 9 months lead time predictions on the test interval, obtained when training the RC with all the different 40 years periods (a): without the weights, (b): with the weights.Both figures show the mean of the results obtained over 300 simulations.

Figure 8 .
Figure 8.(a) Box plot showing the distribution of the results obtained in terms of the criticality index C, determined from the CESM data and based on NINO3.4.(b) Plot of the damping ratio for each of the periods; here the spread in the results is very small and cannot be seen.

Figure 10 .
Figure 10.Predictions of the RC when trained with observations for (a) 3 months lead time, (b) 6 months lead time, (c) 9 months lead time.All the figures show the mean of the results obtained from 300 simulations.

Figure 11 .
Figure 11.(a) Box plot showing the distribution of the results for the criticality index C based on the RC trained with observations, using the best sets of hyperparameters determined for different lead times.(b) The corresponding damping rates.Every box plot shows the distribution of the results obtained for 300 different simulations.The variable analyzed is NINO3.4.
a change of variables was introduced, i.e,S = T 2 − T 1 , T = T 1 − T r , h = h 1 + H − z 0 = h 1 + K. (16)With these new variables the JT model can be rewritten asdS dt = −αS − ϵµS 2 + ζµS [ S + T + C − Ctanh ( • C, 104.9819 days and 62 m, the final dimensionless version of the Jin Timmerman model are given in section 2.1.The parameters of the dimensionless version of the model are defined in the following way:δ = rbL ζh * β , ρ = ϵh * β rbL , a = αbL ϵβh * , c = T r − T r0 2S 0 k = H − z 0 h * , S 0 = T 0 = h * β bLµ , h 0 = h * , t 0 = bL βζh * ∂u ∂t + a m u − β 0 yu + g ′ ∂h ∂x = u = (u, v)is the mean horizontal velocity over the mixed and shallow layer.(τ x , τ y ) are the two components of the surface wind stress.h is the total upper layer thickness, a m is a linear damping coefficient and g ′ = g∆ρ/ρ is the reduced gravity.c 0 = √ g ′ H is the equatorial Kelvin wave speed.The two conditions at the boundaries are: ˆ+∞ −∞ u (0, y, t) dy = 0 , u (L, y, t) = 0. (

Table 1 .
Six time step lead time prediction performances of the RCs on data from the system (8), using 1000 training time steps, both the features x and y and the best set of hyperparameters.

Table 2 .
RC's six-month lead time prediction performances on stochastic JT model data, using a different number of years of training data, different feature sets, and in each case the best set of hyperparameters.

Table 3 .
RCs six-month lead time prediction performances of TE based on the stochastic ZC model data, using 60 years of training data and the best set of hyperparameters.All the results refer to the RC skill for the variable TE.

Table 4 .
MAE for the 9-month lead time predictions of the RC for the strongest El Niño and La Niña events of the last 20 years of the CESM time series.The bottom table shows the MAE for the nine-month lead time predictions of the RC and the CESM's NINO3.4 index over the whole test period (model years 1280-1300).

Table 5 .
MAE of the 9-month lead time predictions of the RC for the strongest El Niño and La Niña events of the 20 years test period.The bottom table shows the MAE using the RC predictions and the observed NINO3.4 index over the whole test period for 3 different lead times.