Reducing reservoir computer hyperparameter dependence by external timescale tailoring

Task specific hyperparameter tuning in reservoir computing is an open issue, and is of particular relevance for hardware implemented reservoirs. We investigate the influence of directly including externally controllable task specific timescales on the performance and hyperparameter sensitivity of reservoir computing approaches. We show that the need for hyperparameter optimisation can be reduced if timescales of the reservoir are tailored to the specific task. Our results are mainly relevant for temporal tasks requiring memory of past inputs, for example chaotic timeseries prediction. We consider various methods of including task specific timescales in the reservoir computing approach and demonstrate the universality of our message by looking at both time-multiplexed and spatially-multiplexed reservoir computing.


Introduction
Reservoir computing is a machine learning method that has received a lot of research attention recently.There are several reasons for this.Firstly, the learning method is very simple; only the output layer is trained via linear regression [1,2].This makes the implementation of the reservoir computing relatively straightforward and therefore accessible for researcher form a wide range of backgrounds.Secondly, reservoir computing has been shown to perform well on predicting complex dynamics [3,4,5,6].And, thirdly, due to the simplicity of the training method it is well suited for hardware implementation [7,8].In particular, time-multiplexed reservoir computing of delaybased reservoirs has been a popular method of hardware implementation.The idea to utilize a feedback delay loop instead of a network was first published in [9] and has since then been implemented and tested for opto-electronic [9,10,11,12], optical [13,14,15,16,17,18] and electromechanical [19] realizations.We also refer the reader to recent reviews on this topic [20,21,22,8,3,23].
An issue that remains is the lack of an efficient method for optimising the parameters (hyperparameters) of the reservoir in a task specific manner.This is particularly relevant for hardware implemented reservoir computing, where parameters can be difficult to tune.There have been many studies that have looked into various approaches including grid searches, Bayesian optimisation, genetic algorithms, or new network measures that correlate well with the reservoir performance [24,25,26,27,28].However, these approaches can greatly increase the computational cost of the training phase, or are not suitable for hardware implementation.In this paper we aim to contribute to solving the issue of hyperparameter optimisation, with a particular focus on hardware implementation.Specifically, we investigate the influence of augmenting the memory of the reservoir at either the input or the output level such that task relevant timescales are present in the system.
For the case of delay-based reservoirs it is known that the feedback delay-time can be an important tuning parameter, as it directly influences the memory of the reservoir [29,30].In the delay-based reservoir computing community, the delay-time is mostly chosen to be either resonant with the time-scale of the input clock-cycle or offset from it by one input mask interval.However, depending on the task, other values of the delay-time are beneficial [31,32].We therefore also address the influence of tuning the memory of the reservoir directly via the internal delay as one possible optimization step.
The fact that task relevant timescales play an important role in timeseries prediction tasks is well established and in the context of reservoir computing is also closely related to Takens embedding [33,34].In this paper we highlight how we can externally add task relevant timescales by adding additional delay lines at the input or output layer.These approaches are universal for all reservoir computing realizations, including electrical, optical and digital implementations, and should not significantly decrease the speed or the energy efficiency in hardware implementation.We show that by externally adding the required memory to the reservoir, the performance can be improved over the hardware relevant parameter spaces, thereby reducing the need for hyperparameter optimisation.
When optimising the timescales of a reservoir for a particular task, the discretisation of the input timeseries also plays an important role.The authors of [35] have performed a comprehensive investigation of closed-loop timeseries prediction using ESNs.They consider the influence of the discretisation of the input timeseries and look at the difference between providing full and partial information.While, in [36] the authors consider the relationship between discretisation and dynamic timescales.We also discuss the influence of the discretisation here, but with a focus on the resulting memory requirements on the reservoir.
There have been several other recent studies into various methods of augmenting the reservoir computing approach.In [37] and [38] the authors concatenate the state matrix with randomly selected delayed system states and the authors of [39] use different delayed output information for improving action recognition with a frequency multiplexed reservoir computing system.In [40] trainable state delays are investigated.In [41] and [42] concatenation of the state matrix with past states is studied in the context of reservoir size reduction.The authors of [43] highlight the possibility of forgoing the reservoir by directly filling the state matrix with nonlinear transforms of past inputs, in which case the memory can be freely chosen.This nonlinear vector regression approach, referred to as "next generation reservoir computing", can produce accurate results for standard benchmark tasks at a relatively low computational cost, however the hardware implementability is greatly reduced and the optimisation of hyperparameters is traded for the selection of inputs and nonlinear functions [44].In the context of hardware implemented delay-based reservoir computing, recent efforts toward performance enhancement include using ensembles of delay-based reservoirs [45], sequentially coupled delay-based reservoirs [46,47] or delay-coupled networks [31].Our contribution is to highlight the importance of task specific timescales, and to show how these can easily and tunably be incorporated in hardware implemented reservoir computing.
The paper is structured as follows.The concepts of reservoir computing and the external memory augmentation approaches are introduced in Sec. 2 and 3, respectively.In Sec. 4 we briefly discuss timeseries descretisation.The reservoir computing models and tasks used in this study are introduced in Sec. 5.The results showing the improved performance with augmented memory and the influence of the input timeseries discretisation are presented in Sec.6, before we conclude in Sec.7.

Reservoir Computing
Reservoir computing is a scheme wherein the input response of a dynamical system (the reservoir) is sampled a number of times and these sampled system responses are then linearly combined to construct an output.During the training phase, the weights of the linear combinations of the system responses (output weights) are trained via linear regression.This is done by feeding the reservoir a sequence of K T inputs and collecting S responses of the reservoir to each input into a state matrix S ∈ R K T ×R S+1 , adding a bias term of one to each row, and then finding the vector of weights w ∈ R S+1 that minimises the difference between the target output y ∈ R K T and the output ŷ = Sw ∈ R K T .The solution to this problem is given by using the Moore-Penrose pseudoinverse.Equation (1) includes Tikhonov regularisation, with the regularisation parameter λ and identity matrix I ∈ R S+1 × R S+1 .

Spatially-multiplexed Reservoir Computing
The responses of the reservoir can either be multiplexed in space or in time (or combinations whereof).Echo-state networks are an example of spatially multiplexed reservoirs [1].In this scenario, there are a number of spatially separated dynamical nodes which evolve discretely in time.These nodes are fed with a weighted input signal and the responses of some (or all) of the nodes are collected into the state matrix.The evolution of such a reservoir can be described by an equation of the form where x (k) ∈ R M is the vector describing the state of the M nodes at time k, W int ∈ R M × R M is the matrix of internal reservoir node couplings (adjacency matrix), I (k) is the input signal at time k, and w in ∈ R M is the vector of input weights.The function f (•) is a nonlinear function which is applied element wise and commonly referred to as the activation function.In Eq. ( 2) the input is one dimensional, i.e. there is a single temporal input signal.To generalise to p input signals, I (k) becomes a vector i in (k) ∈ R p and the vector of input weigths w in becomes a matrix The k th row of the state matrix is filled by sampling S of the reservoir states x (k), with S ≤ M .For S = M the elements of the state matrix are s k,j = x j (k) with x j (k) (j ∈ [0, ..S)) being the elements of the reservoir state vector x (k).
Spatially continuous reservoirs are also possible.In such a case the reservoir evolution would be described by a suitable system of partial differential equations, and the state matrix would be filled by sampling discretely in space (and in time).

Time-multiplexed Reservoir Computing
In the case of time-multiplexed reservoir computing, typically, a masked input signal is sequentially fed into the reservoir and the S reservoir responses needed to fill the state matrix are sampled sequentially in time.The input mask is analogous to the input weights in the spatially multiplexed case, but rather than simultaneously feeding the input signal into spatially separated nodes with varying weights, the input signal is fed into the same node multiple times with varying weights (mask values).If I (k) is the task-dependent input signal at time k, then the masked reservoir input is given by where i = Sk + j, k ∈ [0, ..K T ), j ∈ [0, ..., S) and m 1 (j) are the mask values.The state matrix entry corresponding to the input J (i) is given by s k,j = x (i) = x (k, j), where x (i) is the sampled reservoir response.If the reservoir is a time-continuous system, then the discrete input sequence can be transformed into a piecewise constant function J (t) by holding each input constant for a fixed time θ: In such a case the time for one input mask sequence, θS, is typically referred to as the clockcycle T [29].In this study we consider a time discrete reservoir, see Sec. 5.1.1, in which case the dimensionless clockcycle T is equal to the number of input mask values S.
Here we have taken the number of mask values to be equal to the number of sampled reservoir responses S.This is not a necessary condition, but is the typical choice.We adhere to this convention for simplicity.We also note that, in delay-based reservoir computing the number of sampled reservoir responses S is often referred to as the number of virtual nodes.

Reservoir Computing Memory Augmentation Methods
The memory requirements of timeseries prediction tasks are very task specific and can vary greatly, even for tasks using the same input signal but trained on different target signals, e.g. for different prediction horizons.To perform well, the reservoir must have memory on the relevant timescales.This means that there must be information of past values of the input signal in the k th row of the state matrix in order to accurately predict the k th target y (k).Note that here we are using the term "memory" to refer to any influences of past inputs on the state of the reservoir, i.e. also nonlinear transforms of past inputs.In the typical reservoir computing scheme, this memory comes from the dynamics of the reservoir.In this study we compare methods of augmenting the memory of a reservoir.There are many ways in which this can be done and the choice of method comes down to the objective one aims to achieve, which means there is always a tradeoff between computational cost and absolute performance.The motivation of this study is to reduce the need for hyperparameter optimisation at a low computational cost, as well as allowing for easy hardware implementation.To this end, we investigate two simple schemes of augmenting the memory externally at the levels of the input and the output layer of the reservoir.These two methods are described below.

Delayed-input
We consider the influence of one additional time-delayed input implemented as in [48,49].The input function Eq. ( 3) is modified to where n I is the input-delay and m 2 (j) is the mask applied to the delayed version of the input sequence, I (k − n I ).This delayed input scheme is illustrated in Fig. 1b.
In terms of a photonic setup, such a delayed input could be implemented by adding a delay-loop before the reservoir, as depicted in Fig. 1a.Since n I describes the input shift via the number of k-steps, the actual length of the delay line depends on the sampling interval.

Delayed State Matrix Concatenation (Output delay)
On the level of the output layer, a method of including a task-dependent delay, is to concatenate the current and delayed versions of the reservoir responses, i.e. to concatenate the state matrix with a delayed version of itself: where n S is the output delay that described how many steps (index k) we go back to collect the input for the state matrix, see Fig. 1b.This approach has recently been investigated in [42] and [41] in the context of reducing the size of a recurrent neural network reservoir.
In order to make fair comparisons with other methods it is important to consider whether one should compare results using the same reservoir sampling dimension S (number of virtual nodes), or whether this should be halved such that the final state matrix dimensions are the same.The latter case has the advantage that the dynamical system can be smaller, or that the input clockcycle can be reduced in the case of time multiplexed reservoir computing.In this study we choose to keep the reservoir sampling dimension S the same because in the context of hardware implemented reservoirs we believe comparing the same sized physical reservoirs is fairer.
In a photonically implemented reservoir, this approach could be incorporated by including an additional delay-loop after the reservoir, as sketched in Fig. 1a.Again, the physical length of the delay line is given by the product of n S and the sampling time.

Input Discretisation
Another approach to improve the reservoir performance by incorporating task specific memory requirements is to modify the task itself.When constructing a time series model from data, the discretisation ∆t (sampling step) of the input timeseries can have a significant influence on the model and on the resulting performance [50,36].In the context of reservoir computing, the input discretisation influences the memory required of the reservoir and is related to Takens (delay) embedding [33].Recent studies have shown that under certain conditions reservoirs perform a delay embedding of the input timeseries [34,51].For this to occur, and to yield accurate results, the reservoir must have memory on the appropriate time scales, which are related to the dynamics of the task [52].This means that if the discretisation of the input timeseries is finer than is necessary, the reservoir will need to be able to remember input steps which lie further in the past in order for it to perform the delay embedding.The impact of different ∆t is discussed in Sec.6.2.

Reservoir models
We investigate the influence of the memory augmentation methods focusing mainly on a time-multiplexed delay-based reservoir.The methods are, however, not specific to such reservoirs.We therefore also show an example of a spatially multiplexed reservoir (i.e. a random recurrent network reservoir).Both reservoirs are described below.

Time-delayed Reservoir Model
The time-delayed reservoir model we consider is the following iterative map: where G (•) is the nonlinear activation function, J (i) is the input sequence given by Eq. ( 3), K is the strength with which the current iteration is coupled with a past state of the system and N is the feedback delay (see Fig. 1a).We remind the reader that the index i runs over all mask steps and input points (i = Sk + j).
Motivated by photonic reservoir implementations, we use a nonlinear function which models a semiconductor optical amplifier (SOA) for the activation function.An SOA can be described by a dynamical differential equation model which describes the lightmatter interaction and the complex carrier dynamics [53,54].In our case, only the static characteristic is important, which is why we use the simple input-response of an SOA that is given by [55]: Using this activation function, Eq. ( 6) describes the input-response of an SOA with weak time-delayed feedback in the limit where the time between mask steps is much longer than the characteristic time-scale of the carrier dynamics in the SOA, i.e. the limit that the reservoir reaches a steady state before it is sampled at the end of each mask step.That means, we do not include any dynamic response of the SOA and all the memory of the setup comes from the delay-loop.A sketch of such a setup is depicted in Fig. 1a.
For the input masks in Eqs. ( 3) and (4) we use a set of S randomly drawn values from a uniform distribution between zero and one, U (0, 1), and multiply these by a scaling factor G α , i.e. m α (j) = G α U α j for α = 1, 2 and U α j sampled from U (0, 1), where G 1 is the input scaling of the current input and G 2 the scaling factor for the delayed input in Eq.( 4).
Unless stated otherwise, the parameters for the time-delayed reservoir are as given in Tab. 1.
where Θ (•) is the Heaviside step function.The internal node coupling weights are given by W int = K r W u int where the entries of W u int are chosen randomly from a uniform distribution between zero and one, and K r is a scaling factor that determines and coupling strength and the spectral radius ρ.Analogous to the time-delayed reservoir, the input weights are also sampled from U (0, 1) and scaled by a factor G 1 .
The motivation for our choice of random network is not to achieve the best performance possible, but rather to demonstrate the universality of our results on a reservoir with very different properties to the time-delayed reservoir described above.

Tasks
We have chosen three standard benchmark tasks to demonstrate the influence of the memory augmentation methods; multi-step-ahead prediction of the Mackey-Glass chaotic timeseries, one-step-ahead prediction of the X variable of the Lorenz 63 system [56] and cross prediction of the Z variable of the Lorenz 63 system.Another method of characterising the nonlinear transform and memory capabilities of reservoir computers is to calculate the information processing capacity [57].However, since it is difficult to relate the information processing capacities to task specific performance, particularly for tasks requiring temporally correlated inputs [58], we have chosen leave this approach for future studies.The Mackey-Glass and Lorenz tasks are described below.
In this study we focus mainly on ∆s = 12 step-ahead prediction of this system with an input discretisation of ∆t = 1 (open-loop).To create the input sequence I (k), we generate a timeseries by numerically integrating Eq.( 9) using a Runge-Kutta fourth order method with Hermitian interpolation for the mid-steps of the delayed terms and with a time step of h = 10 −2 .This timeseries is then sampled with a time step of ∆t = 1.The corresponding target sequence is given by I (k + ∆s).We have chosen ∆t = 1 as our default sampling, as this is a commonly used discretisation and it is small enough that finer features of the time series can be resolved, as can be seen by the green dots in Fig. 2a.It is however not the optimal discretisation for delay-embedding [50].
The chaotic Mackey-Glass timeseries is shown in Fig. 2a, along with its corresponding auto-correlation function in Fig. 2b.The dynamics are chaotic, but exhibit clear oscillations on a scale of approximately 50 time units.For the reservoir to reconstruct the dynamics of this timeseries, it would need to have memory on the order of 0.1-0.5 of these features [52].Furthermore, the autocorrelation shows that points separated by one time unit are highly correlated.This indicates that, with an input discretisation of ∆t = 1, consecutive inputs will contain redundant information and the reservoir will need to be able to retain information from about 10 steps into the past for a delay embedding to be performed.We therefore also consider other combinations of ∆t and ∆s to compare the influence of the discretisation of the input signal on the memory requirements of the reservoir and the performance.

Lorenz 63 Time-series Prediction
The Lorenz system [56] is given by With c 1 = 10, c 2 = 28 and c 3 = 8/3 this system exhibits chaotic dynamics as can be seen for the X and Z coordinate in Fig. 3a and b.We use the X variable, sampled with a step size of ∆t = 0.1 (Fig. 3 orange dots), as the input I (k) for two time series prediction tasks.The first task is one step-ahead (∆s = 1) prediction of the X variable.
The second task is cross prediction of the Z variable (∆s = 0).For the cross prediction task we also consider a discretisation of ∆t = 0.02 (Fig. 3 green dots).We generate the Lorenz timeseries using Runge-Kutta fourth order numerical integration with a time step of h = 10 −3 .When doing timeseries prediction using an ordinary differential equation system like the Lorenz 63 system, there is a qualitative difference between the tasks where all dynamical variables are provided as input and tasks where this is not the case.Using the Lorenz 63 system, a comparitive study of these two cases was made in [35].When all variables are provided, no memory of past steps is needed to determine the evolution of the dynamics.However, if only partial information is provided, then the reservoir must perform a delay embedding [34].For our study, only the latter case is relevant as we only provide one of the three dynamical variables for the Lorenz 63 task (and not the complete history for the Mackey-Glass task).The sampling step for an optimal embedding for the Lorenz 63 system is about 0.1 [50,36].Figure 3c shows the autocorrelation of the X variable (green line) and the cross-correlation between the X and Z variables (orange line) for the chaotic dynamics (Fig. 3a,b) with the parameters given above.We will see later on in Sec.6.3 that the differences in the correlation between the input and the target for the X (t) → X (t + ∆t) and X (t) → Z (t) tasks have important consequences for the optimal discretisation.

Error Measure
We quantify the reservoir computing performance using the normalised root mean squared error (NRMSE), defined as where y k are the target values, ŷk are the outputs produced by the reservoir computer, K T is the number of testing steps (i.e. the length of the vector y) and var (y) is the variance of the target sequence.

Simulation Parameters
For all tasks we rescale the input and target sequences to the range between zero and one.Before beginning the training and testing phases for these tasks, we initialise the reservoirs by feeding in 10000 inputs I (k).In the training phase we use 10000 inputs while in the testing phase 5000 inputs are used.All results presented in the next section show the testing error averaged over Z R realisations of the random input masks (or input weights).For the NRMSE results the associated error bars are the standard deviation of the Z R realisations.

Mackey-Glass task performance comparison
We are interested in the influence of including task specific timescales in the reservoir computing scheme.The delayed-input and delayed-state matrix methods described above can be applied to all reservoir computing schemes.However, when using a delaybased reservoir, we also have the option of directly tuning the memory properties by changing the feedback delay time.For the reservoir used here (Eq.( 6)), this corresponds to the parameter N .In most delay-based reservoir computing studies, the authors typically choose N = T (delay-time τ = T in the time-continuous case) or N = T + 1 (τ = T + θ) [9,11,13,14,62,45,63,64], where T is the input clockcycle.It has been shown that resonances between the feedback delay time and clock-cycle are generally detrimental to the reservoir computing performance [32,30,29], we therefore choose the desynchronised configuration N = T + 1 (N = S + 1) as our default and compare its impact with the memory augmentation methods.The other parameters of the timedelayed reservoir are the feedback strength K, the input scaling G 1 and the parameters of SOA the nonlinearity g 0 and a.The latter two are determined by properties of the SOA and are therefore not freely tunable.Since we are motivated by hardware implementability, in the following, we only consider the influence of K and G 1 , while keeping the material specific parameters g 0 and a fixed.Figure 4a shows the NRMSE for the Mackey-Glass 12-step-ahead prediction task as a function of the feedback strength K and the input scaling G 1 for a default reservoir consisting of S = 10 virtual nodes.The parameter ranges are chosen such that the characteristics of the nonlinearity span from linear to highly nonlinear and the feedback strength covers experimentally realisable values.Over the entire parameter plane, relatively poor performance is achieved (orange colors in Fig. 4a).Increasing the number of virtual nodes to S = 500 (Fig. 4e), the minimum NRMSE is decreased from 0.37±0.05 to 0.064 ± 0.002.However, this comes at a cost of a factor 50 longer clockcycle T = S and requires sensitive tuning of the reservoir parameters.
To look at the influence of a delayed-input, as described in Sec.3.1, we select the parameters corresponding to the lowest error in Fig. 4a (Fig. 4e) and perform a grid search in the input-delay n I and input scaling G 2 .Please see Fig. A1 in Appendix A for the corresponding scans.The n I and G 2 parameters for which the lowest error is achieved (n I = 8, G 2 = 0.6) are then used for the G 1 -K scan shown in Fig. 4b (Fig. 4f).The delayed-input parameters are therefore not optimised for each pair of values in the G 1 -K parameter plane, however, we have checked that the optimal n I does not vary significantly (see Fig. A2 in Appendix A).With the delayed input included, not only is the minimum error reduced, the variation of the error in dependence of G 1 and K is also greatly reduced (mostly green colors in Fig. 4f).For the delayed-state matrix concatenation method described in Sec.3.2, we use the same approach as above, i.e. we optimise the state matrix delay n S for the optimal G 1 ,K pair from Fig. 4a (Fig. 4e) and then use this value, n S = 8, to produce Fig. 4c (Fig. 4g).The results of optimising the internal delay of the reservoir, again using the same approach as above, are shown in Fig. 4d (Fig. 4h).The optimised delay values for the Mackey-Glass 12 step ahead prediction task (∆t = 1) are n I = 8, n S = 8 and an internal delay length of about 5 times the clock cycle N = 5S + 1.
The colour distributions in the three rightmost columns of Fig. 4, show that all three methods of augmenting or changing the memory of the reservoir to optimise the performance for this Mackey-Glass task lead to a large decrease in the error over most of the depicted G 1 -K parameter plane (the minimum error achieved is always depicted in the inset of each panel).To summarise the results of Fig. 4, and similar calculations for virtual node numbers between S = 10 and S = 500, we show the minimum error as a function of the number of virtual nodes S in Fig. 5a.To quantify the sensitivity of the absolute error to the reservoir parameters G 1 and K we introduce a quantity P 0.1 , which is the percentage of the depicted G 1 -K plane for which the NRMSE is under a threshold value of 0.1.Figure .5b shows P 0.1 as a function of the number of virtual nodes S (results for a NRMSE threshold of 0.05 (P 0.05 ) are given in Fig. B1 in Appendix B). Figure 5 shows that, regardless of the method, including a task relevant time scale vastly improves the performance over a wide range of reservoir parameters.The smallest error in Fig. 5 is achieved when the internal delay N of the reservoir is optimised (for S = 500).Whereas, the least parameter sensitivity is achieved using the delayed state matrix concatenation method (dark green points in Fig. 5b).Here we would like to note that there are several ways in which one could define a parameter sensitivity.We have chosen to quantify the sensitivity with respect to an absolute error, rather than, for example, with respect to the minimum error in each case.Our motivation for this is that for real world applications one would typically have an absolute threshold for what in dependence of the virtual node number S for the default delay based setup (black), with delayed input G 2 = 0.6 and n I = 8 (orange), with delayed output n S = 8 (green) and for optimal internal delay N = 5S + 1 (grey).Parameters: Z R = 50 for S < 100 and as in Table 1.
error is tolerable.We do, however, show an example of the sensitivity defined relative to the respective minimum error in Fig. B2 in Appendix B, and similar trends are found.
It should be noted that which of the memory augmentation methods produces the best results is strongly task dependent.This is because all methods lead to a different nonlinear mixing of the input signal.An example where the delayed-input method out performs the delayed state matrix concatenation method in terms of the sensitivity measure P 0.1 is shown in Fig. B4 in Appendix B where a Mackey Glass task one step prediction with ∆t = 12, ∆s = 1 was used.The most pronounced differences compared with Fig. 5 are seen for the default case with N = S + 1.This is due to the larger discretisation ∆t = 12 enabling a much better delay embedding of the Mackey Glass timeseries and therefore the non-optimized internal delay configuration already leading to good prediction results.The next section will focus on this issue.

Relationship between optimal delays, task and reservoir
The optimal augmentation delays (n I and n S ) depend on the requirements of the task and on the memory properties of the reservoir.We demonstrate this by altering the Mackey-Glass task.Firstly, by changing ∆s for a fixed discretisation ∆t (Fig. 6) and secondly, by altering ∆s and ∆t under the constraint that ∆s × ∆t is fixed, i.e. that the time span of the prediction step is constant (Fig. 7).We then consider three implementations of the reservoir which have very different recall abilities.Results for these cases are shown in Fig. 6 and Fig. 7.In Fig. 6a and Fig. 7a the feedback strength K is set to zero.In this limit the reservoir itself is a single layer feedforward network, i.e. an extreme learning machine [65,60], and has no memory on its own.Figure 6b and Fig. 7b show results for the default configuration, where the internal delay is one time step longer than the input clockcycle.In this configuration the system behaves similar to a simple ring [29]; from one clockcycle to the next neighbouring nodes couple and the rate of information decay is determined by K.In Fig. 6c and Fig. 7c the internal delay is about five times as long as the input clockcycle, N = 5S + 1.This means that the reservoir has zero ability to recall information between one and four steps into the past.In all three cases the output delay n S of the state matrix concatenation adds additional memory.
For K = 0, Fig. 6a clearly shows that as ∆s is changed, so does the optimal input delay, (n s corresponding to the darkest green).The sum between optimal output delay and predicted step ∆s, however stays constant until ∆s = 19 leading to a z-like behaviour of the optimal output delay.For the default internal delay configuration (Fig. 6b), the optimal state matrix delay is shifted slightly to lower values compared with the K = 0 case.This is because the state matrix delay compensates for the memory that is missing in the reservoir and for the default case there is memory due to the coupling to past reservoir states.When the internal delay is N = 5S + 1 (Fig. 6c) the reservoir is already well suited for delay embedding the Mackey-Glass input signal, therefore adding the state matrix concatenation has only a very small influence.Only slight improvements can be seen for n S values that are multiples of 5.
When the input discretisation ∆t is changed then the required memory can vary substantially, even if the prediction horizon ∆s × ∆t stays constant.This can be seen most clearly in Fig. 7a by the varying position of the minimum error along the n S axis.However, n S is given in units of ∆t and the optimal value of n S in units of the original Mackey-Glass time series stays constant at about n opt S • ∆t = 9 for the extreme learning limit K = 0.For the default delay configuration in Fig. 7b additional minima are seen.Inspecting the minima of the curves in Fig. 7b, it can also be seen that different input discretisations result in different lowest errors (∆t = 3 and ∆t = 4 are the best choices here), which was not the case in the extreme learning limit in Fig. 7a.Thus, the optimal discretisation depends on the memory of the reservoir.
In Fig. 7c, an internal delay configuration of N = 5S + 1 was chosen which was optimal for the discretisation of ∆t = 1 (cyan curve).If the discretisation is changed, N is no longer optimized, and adding an output delay can now provide the missing memory.In general, a clear predictive relationship between the optimal augmentation delays, the task requirements and the properties of the reservoir are non-trivial.We thus recommend treating the delays as tuning parameters, guided by Takens embedding.

Lorenz results
So far we have used various versions of the Mackey-Glass timeseries prediction task.In this section we present a brief summary of results obtained using the Lorenz 63 system.We do this to make clear that the influence of the memory augmentation methods is not specific to the Mackey-Glass system and to show that the benefits of such methods are related to the memory requirements of the task.In Table 2 we show the minimum NRMSE, P 0.1 and P 0.05 , which we have calculated using the same optimisation procedure as in Sec.6.1.For the X(t) → X (t + ∆t) task the benefits of the memory augmentation methods are marginal.The state matrix concatenation method shows the best performance, both in terms of the absolute error and the parameter sensitivity.However, we remind the reader that the effective output dimension is doubled when the state matrix concatenation is used (see Sec. 3.2).For the cross-prediction task, X (t) → Z (t), the memory augmentation methods greatly improve the results when ∆t = 0.02, but lead to only a slight improvement for ∆t = 0.1.However, better performance is achieved for the ∆t = 0.1 case.This is because ∆t = 0.1 is the optimal delay for delay embedding the Lorenz 63 system [35,50].This means that with ∆t = 0.02 the reservoir needs to retain information from further in the past and, hence, augmenting the memory has a greater impact.
Although not shown here, in contrast to the cross-prediction task, the performance for the X(t) → X (t + ∆t) task improves when ∆t is smaller than 0.1.This can be understood by looking at the autocorrelation plot in Fig. 3c.For small lags the autocorrelation of X (t) is high, whereas the cross-correlation with Z (t) is close to zero even for small lags.This means that the X(t) → X (t + ∆t) task becomes more linear and therefore easier as ∆t approaches zero, however, for the X (t) → Z (t) task this is not the case.

Random Network
Delay based reservoirs and the more commonly used ESNs, or random recurrent networks, have very different memory properties.In the case of the ESN the memory is determined indirectly by the random topology of the reservoir and the spectral radius, and will mostly be monotonically decreasing [66].However, for a delay-based reservoir, such as the one used here, the memory can be directly tuned by changing the ratio of the delay and the input clocktime, and will not decrease monotonically if the delay is sufficiently large compared with the clocktime and the memory of the solitary nonlinear node [31,30,29].These differences, however, have no bearing on the influence of the memory augmentation methods.We demonstrate this in Fig. 8 which shows one example of the performance improvement that can be achieved with the delayed state matrix concatenation method.Compared with the time-delayed reservoir (see Fig. 4), the overall performance is worse because our chosen nonlinearity produces a binary output (Sec.5.1.2).We also note, that for Fig. 8 we did not optimise the delay n S , we simply set it to n S = 8, it is therefore possible that the results could be improved further.However, despite this a stark improvement is obtained over the entire parameter range when the delayed state matrix concatenation is included.

Conclusion
We have shown that including task specific timescales can significantly improve the performance of a reservoir computer for timeseries prediction tasks.While the timescale itself matters, the method of including it does not play such an important role for the overall performance and should be chosen according to the requirements of the specific task.Our focus was on a simple means of improving the performance, which reduces the need for hyperparameter tuning in the context of hardware implementation.There are advantages and disadvantages to all methods.The state matrix concatenation method  has the advantage that it does not influence the reservoir as it comes after the reservoir sampling step.This is likely the reason why this method showed the best performance in terms of the hyperparameter sensitivity.It also has the advantage that it only introduces one tuning parameter, the delay.However, if this method is to be implemented in hardware, the previous reservoir states need to be stored and the state vector is twice as large.The delayed input method, on the other hand, has the comparative disadvantage that it introduces two tuning parameters, the delay and the input scaling.However, this method produces very similar results to the state matrix concatenation without an increased effective output dimension.In the case of a delay-based reservoir, tuning the internal delay is also an option, but can be difficult in practise, particularly in photonic systems which can be very feedback phase sensitivity [67,68,69].Lastly, one can optimise the discretisation of the input signal, but here one must also consider the desired output discretisation.
Extensions or variations of the methods discussed here are possible and some have already been study [62,42,41,70,37,38].Whether or not to apply such methods once again comes down to the objective and a trade-off between simplicity and absolute performance.

Appendix A. Delay optimisation for augmented memory
Figure A1 shows the grid search and parameters scans performed to find the optimal internal delay N , the optimal input delay n I and the best state matrix concatenation delay n S .For the internal delay scan N was increased in steps of S starting at N = 1.   1.
now shows much better performance.This is because less redundant information is being feed into the reservoir and the memory requirements are reduced.The performance is  1.
still improved when a delay-input or delayed state matrix concatenation is included, however the impact is reduced and the optimised delays are smaller (now n I = 4 and n S = 4, compared with n I = 8 and n S = 8 for the ∆t = 1 and ∆s = 12 case).We also note that, in terms of the sensitivity measure P 0.1 , the delayed input method out performs the delayed state matrix concatenation method for large node numbers, even though the effective state matrix dimension is only half as big.This also holds for P 0.05 , see Fig. B4.  1.

Figure 1 .
Figure 1.Sketch of (a) a possible implementation for a delay-based reservoir with additional delayed input and state matrix concatenation (delayed output), and (b) an example of the input/output data processed in these external delay schemes to predict a target (orange crosses and bended arrows indicate the situation for n S = 6, n I = 2, ∆s = 1).

Figure 4 .
Figure 4. Time-delayed reservoir -NRMSE for the Mackey-Glass 12-step-ahead prediction task (∆t = 1, ∆s = 12) color coded in dependence of the feedback strength K and the input scaling G 1 for two reservoir sizes, S = 10 (upper row (a-d) with Z R = 50) and S = 500 (lower row (e-h) with Z R = 10) and 4 different delay schemes (columns): (b,f) delayed input with G 2 = 0.6, n I = 8, (c,g) delayed output with n S = 8, and (d,h) optimal internal delay N = 5S + 1.The minimum error for each subplot is given in the insets and the corresponding G 1 -K values are indicated by the red crosses.Remaining parameters as in Tab. 1.

Figure 5 .
Figure 5. Impact of reservoir size -(a) Minimum NRMSE and (b) parameter sensitivity P 0.1 for the Mackey-Glass 12-step-ahead prediction task (∆t = 1, ∆s = 12) in dependence of the virtual node number S for the default delay based setup (black), with delayed input G 2 = 0.6 and n I = 8 (orange), with delayed output n S = 8 (green)and for optimal internal delay N = 5S + 1 (grey).Parameters: Z R = 50 for S < 100 and as in Table1.

Figure 6 .
Figure 6.Impact of prediction horizon on output delay -NRMSE for the Mackey-Glass prediction task (∆t = 1) color coded as a function of the prediction step ∆s and the externally added state matrix delay n S .(a) extreme learning machine (K = 0), (b) default internal delay setup, and (c) optimal internal delay setup.Parameters: S = 50, Z R = 50 and remaining parameters as in Tab. 1.

Figure 7 .
Figure 7. Optimal output delay -Impact of discretisation on the NRMSE for the Mackey-Glass prediction task for constant prediction horizon ∆s • ∆t as a function of the state matrix delay n S .(a) extreme learning machine (K = 0), (b) default internal delay of S = N + 1, and (c) internal delay of 5S + 1. Parameters: S = 50, Z R = 50 and remaining parameters as in Tab. 1.

Figure 8 .
Figure 8. Random network -NRMSE for the Mackey-Glass 12-step-ahead prediction task in dependence of the spectral radius ρ and the input scaling G 1 .Parameters: n S = 8, S = 500 and Z R = 10.

Figure A1 .
Figure A1.Optimizing delays -NRMSE for the Mackey-Glass 12-step-ahead prediction task (∆t = 1) in dependence of (a) state matrix concatenation delay n S , (b) internal delay N (normalized to the number of virtual nodes S and chosen off-resonant), and (c) input delay n I and input scaling G 2 .For parameters see Tab. 1.

Figure B3 .
Figure B3.Time-delayed reservoir -(a) Minimum NRMSE and (b) sensitivity measure P 0.1 for the Mackey-Glass ∆s = 1 step-ahead prediction task with ∆t = 12 in dependence of the number of virtual nodes S for the default delay based setup (black), with delayed input G 2 = 0.6 and n I = 8 (orange), and with delayed output n S = 8 (green).Parameters: Z R = 50 for S < 100 and as in Table1.

Figure B4 .
Figure B4.Impact of reservoir size -Sensitivity measure P 0.05 for the Mackey-Glass ∆s = 1 step-ahead prediction task with ∆t = 12 in dependence of the number of virtual nodes S for the default delay based setup (black), with delayed input G 2 = 0.6 and n I = 8 (orange), and with delayed output n S = 8 (green).Parameters: Z R = 50 for S < 100 and as in Table1.

Table 1 .
Dimensionless time-delay reservoir and input parameters.
R 105.1.2.Random Network Reservoir ModelWe use a random network as described by Eq. (2) with