Hierarchical Architectures in Reservoir Computing Systems

Reservoir computing (RC) offers efficient temporal data processing with a low training cost by separating recurrent neural networks into a fixed network with recurrent connections and a trainable linear network. The quality of the fixed network, called reservoir, is the most important factor that determines the performance of the RC system. In this paper, we investigate the influence of the hierarchical reservoir structure on the properties of the reservoir and the performance of the RC system. Analogous to deep neural networks, stacking sub-reservoirs in series is an efficient way to enhance the nonlinearity of data transformation to high-dimensional space and expand the diversity of temporal information captured by the reservoir. These deep reservoir systems offer better performance when compared to simply increasing the size of the reservoir or the number of sub-reservoirs. Low frequency components are mainly captured by the sub-reservoirs in later stage of the deep reservoir structure, similar to observations that more abstract information can be extracted by layers in the late stage of deep neural networks. When the total size of the reservoir is fixed, tradeoff between the number of sub-reservoirs and the size of each sub-reservoir needs to be carefully considered, due to the degraded ability of individual sub-reservoirs at small sizes. Improved performance of the deep reservoir structure alleviates the difficulty of implementing the RC system on hardware systems.


Introduction
Due to the dramatically increased computing power, advances in algorithms, and abundantly available data, machine learning, especially deep learning [1] has been successfully applied in a wide range of artificial intelligence domains recently, from computer vision to natural language processing. Depending on the tasks, different deep learning architectures have been optimized. For instance, convolutional neural networks (CNNs) [2][3][4] are mainly used for static data processing such as image recognition, as individual input images are independent so that there is no need to capture the correlation between the input images. On the other hand, temporal data processing such as time-series prediction and speech recognition is typically performed by recurrent neural networks (RNNs) [5][6][7] since the recurrent connections allow the network to capture temporal features in the sequential input data.
Although the cyclic connections in RNNs are useful to process temporal dynamics of input signals in RNNs, these connections make training RNNs very computationally expensive and difficult. In error backpropagation through time, which is a standard method to train the RNNs, calculating the gradients of the error with respect to weights involves a large amount of computation process as the current states depend on not only the current inputs but also the previous states and inputs. Furthermore, when applying the chain rule to compute the gradients, repeated multiplying the derivatives of the current states with respect to the previous states can cause vanishing gradient or exploding gradient problems [8] that make it difficult to find optimal weights. Long short-term memory (LSTM) [6] is developed to deal with the problems by enforcing constant error flow through internal states of LSTM units. As the weight matrices of the internal gates are trained by gradient-based learning algorithms, LSTM networks have been applied in a broad range of applications by fine-tuning the network for a given task, but nevertheless still requires significant computation costs to train the weight matrices.
Reservoir computing (RC) [7], [9] is proposed to reduce the training cost and mitigate the difficulty of the training process of RNNs. Since the expensive cost and difficulty of training RNNs come from the attempt to control the recurrent connections, RC systems avoid this problem by tuning only the linear output layer (often called the readout layer) instead of training every weight in the recurrent connections of the main RNN body. In RC, the recurrent networks called 'reservoirs' are randomly initialized and fixed during the training process. The reservoir needs to offer fading memory property and should nonlinearly map the temporal inputs into a high-dimensional feature space, represented by the states of the nodes forming the reservoir. Due to the recurrent connections in the reservoir, this nonlinear mapping also involves time. Afterwards, the initial complex inputs can become linearly separable in the new, high dimensional reservoir state space. Software-based RC systems have achieved state-of-the-art performance for tasks such as speech recognition [10] and showed superior ability to forecast large spatiotemporally chaotic systems [11]. Recently, hardware implementations of RC systems have paved the way for applying RC systems to real-time operating hardware systems in power-constrained environments [12][13].
With the successful implementation of RC algorithms and hardware, several methods [14][15][16][17][18] to design the reservoir have been proposed to improve the performance of RC systems. A conventional reservoir is initialized randomly with a single set of hyperparameters such as input scaling and spectral radius, such that the states of the reservoir nodes are coupled strongly and it is difficult to process data with different time scales. Decoupled echo state networks (ESNs) [15] can break the coupling between the nodes by dividing the reservoir into sub-reservoirs and by introducing lateral inhibition unit between the sub-reservoirs. Unlike the conventional random process to build the reservoir, dynamic methods can also be used to generate reservoirs such as the simple cycle reservoir concept [16]. The deterministically constructed reservoirs not only can achieve comparable performance to the conventional randomized reservoir but can also allow reservoir properties to be analyzed in a more comprehensive manner. Additionally, reservoirs using a single physical nonlinear node subjected to delayed feedback [17] have been proposed to allow hardware-friendly implementation of deterministic reservoirs, similar to the simple cycle reservoir concept used in power-efficient hardware systems.
Besides the efforts to improve the single reservoir structure, several studies have been conducted to apply hierarchical structures to RC systems. Early attempts [18][19] to introduce an architectural modification to the reservoir included adding feedforward or static layers in the reservoir to amplify the nonlinearity through additional nonlinear mapping of outputs from the recurrent part, and to overcome the tradeoff between short-term memory and non-linearity of a single reservoir structure. With great success of hierarchical architectures in convolutional neural networks [2][3][4], several works [20][21][22][23] have proposed deep reservoir architectures by stacking several sub-reservoirs to capture multiscale dynamics and to increase the richness of reservoir. MESM [20] and DeepESN [21], [22], which stacks the sub-reservoirs directly, and DeePr-ESN [23], which stacks the reservoirs and the encoders alternatively (requiring extra training process), can provide robust prediction performance and effectively capture rich multiscale dynamics. Although the general properties of deep reservoir architectures have been analyzed, emphasis was normally given on the overall performance instead the hardware cost. In reality, simply increasing the size of the reservoirs or the depth of the deep reservoir architecture to achieve better performance may not be optimal given constraints of the hardware system. For example routing the devices acting as reservoir nodes will become increasingly challenging, making it difficult to significantly expand the size of the reservoirs.
In this paper, we show that deep RC systems that stack subreservoirs vertically without increasing the overall reservoir size can improve the system performance for a broad range of tasks, by analyzing the distribution of node states in each subreservoir and the frequency spectrum captured by each subreservoir. To focus on the effects of hierarchical architectures, three different structures based on ESN are compared: Shallow ESN, which has a single reservoir, Wide ESN, which has independent sub-reservoirs in parallel, and Deep ESN, which stacks the sub-reservoirs in series. As the optimal hyperparameters for each structure may be different from each other, a genetic algorithm is used to obtain the best performance of each structure by individually finding the optimal hyperparameters. Moreover, we find that there is a tradeoff between the number of sub-reservoirs and the size of the sub-reservoirs, when the size of the readout network is fixed.

Architecture
A conventional ESN consists of three basic components: an input layer, a recurrent layer (i.e. the reservoir), and an output layer (called the readout layer). The reservoir state, which is the collection of the node states in the reservoir, depends on the input signal and the previous reservoir state. Without losing generality, in our study we choose the following equation to describe the reservoir state update: ( ) = (1 − ) × ( − 1) +α × tanh( ( ) + ( − 1)) (1) where ( ) ∈ ℝ , ( ) ∈ ℝ denote respectively the input and the reservoir state at time , ∈ ℝ × is the inputto-reservoir weight matrix, ∈ ℝ × is the recurrent reservoir weight matrix, tanh()is the element-wise applied hyperbolic tangent activation function, and ∈ (0,1] is a leaky rate. The weight values in is chosen from a uniform distribution over [-IS, IS], where IS is an input scaling factor.
is rescaled from the randomly generated , following: where SR is the spectral radius, is chosen from a uniform distribution over [-1,1], and ( ) is the largest eigenvalue of matrix . IS, SR, and are the important hyperparameters affecting the reservoir properties. Typically, IS determines how nonlinear the node states are: a small IS makes the reservoir nodes operate around the origin where the tanh()activation function is virtually linear, while a more nonlinear transformation can be achieved with a large IS that lets the reservoir nodes operate near 1 or -1. SR affects the stability of the node states and should be less than unity to ensure the echo state property, which is a central characteristic of the RC system that states the reservoir state is uniquely defined by the fading history of the input signal. is related to the speed of the reservoir update dynamics. A small means the reservoir nodes depend more on the previous reservoir state than on the current input, resulting in a slow update speed. In our approach, these three hyperparameters for each sub-reservoir will be optimized to minimize the error by using the genetic algorithm, as discussed later. The output of the ESN is a linear combination of the node states: ( ) = ( ) (3) where ( ) ∈ ℝ denotes the output at time , and ∈ ℝ × is the reservoir-to-output weight matrix. In contrast to the fixed weight matrices and in the reservoir, is trained by ridge regression, also known as regression with Tikhonov regularization. Figure 1 shows the three different ESN structures used to analyze the architectures, namely, Shallow ESN, Wide ESN, and Deep ESN. Shallow ESN is a conventional single reservoir structure, which is governed by (1)- (3). Wide ESN has independent sub-reservoirs in parallel, which receive the input signals independently and have no interaction between the sub-reservoirs. When the sub-reservoirs have different hyperparameters, each sub-reservoir will capture different temporal dynamics from the input signals, and thus Wide ESN may offer better performance than Shallow ESN. The equations of the Wide ESN are following: denotes the reservoir state of subreservoir at time , ( ) ∈ ℝ × is the input-to-reservoir weight matrix of sub-reservoir , ( ) ∈ ℝ × is the recurrent reservoir weight matrix of sub-reservoir , ( ) ∈ (0,1] is the leaky rate of sub-reservoir , ( ) ∈ ℝ × is the reservoir-to-output weight matrix, and [•;⋯;•] stands for vertical vector concatenation.
Deep ESN is the hierarchical ESN structure that stacks the sub-reservoirs in series. Only the first sub-reservoir can see the input signals, and the subsequent sub-reservoirs receive data from the output of the previous sub-reservoir, in the form of linear combination of the previous sub-reservoir's node states. If stacking the sub-reservoirs improves the system's capability in processing temporal information, Deep ESN will outperform both Shallow ESN and Wide ESN. The equations of Deep ESN are following: where ( ) ( ) ∈ ℝ denotes the reservoir state of subreservoir at time , (1) ∈ ℝ × is the input-to-reservoir weight matrix for the first sub-reservoir, ( ) ∈ ℝ × is the inter-reservoirs weight matrix from sub-reservoir − 1 to subreservoir , ( ) ∈ ℝ × is the intra-reservoir weight matrix of sub-reservoir , ( ) ∈ (0,1] is the leaky rate of subreservoir , ( ) ∈ ℝ × is the reservoir-to-output weight matrix, and [•;⋯;•] stands for vertical vector concatenation.
The main function of the reservoir in an RC system is to nonlinearly map the sequential input signals onto the reservoir state, which allows certain tasks such as classification and prediction to be processed by the readout network using a linear combination of the reservoir node states. Thus, the quality of the reservoir can be measured in two aspects: how well the reservoir can nonlinearly transform the input signal; and how diverse the temporal dynamics the reservoir can capture. As the performance of the RC system is dominantly affected by how the reservoirs are designed, the hyperparameters for building the reservoirs should be carefully selected to ensure the designed reservoirs produce the best result in given evaluation criteria. To avoid the node states from being easily saturated and ensure the echo state property, IS, SR, and are set to be less than 1 as shown in Table Ⅰ. Grid search is widely used to analyze the influence of hyperparameters on the RC performance, and is a simple way to design the optimal single reservoir system since the hyperparameter space of a single reservoir is not large. However, when the RC system has several sub-reservoirs like Wide ESN and Deep ESN, the grid search is not an efficient optimization method because the computing cost to test every point in the hyperparameter space will increase exponentially as a function of the dimension of the space. Thus, in this paper, a genetic algorithm [23][24], one of the widely used evolutionary optimization methods, is used to find the optimal set of hyperparameters for a given RC architecture.
Genetic algorithms can efficiently explore the large hyperparameter space through the metaheuristic optimization process inspired by biological evolutions. Specifically, a population of sets of hyperparameters for a given RC architecture is randomly generated, and will evolve toward better reservoir structures through an iterative process, where the population with each iteration is called a generation. In each generation, two sets among the population are randomly selected, and two RC systems based on the selected sets are evaluated for a given task by the normalized root mean squared error (NRMSE): where ( ) and ( ) denote the target signal and the output from the RC system at time , ̅̅̅̅̅̅̅̅̅ is the mean of the target signal, and is the number of time steps in the validation sequence. The loser set, which has higher error than the other, winner set, goes through two genetic operations: crossover, which replaces the part of the loser with the part of the winner based on a crossover rate; and mutation, which changes part of the loser to random values based on a mutation rate. The modified loser replaces the original loser in the population, and this process is repeated until the maximum number of generations is reached. The parameters for the genetic algorithm are summarized in Table Ⅰ.
We evaluate RC systems with different architectures on three different time series data: 10 th order nonlinear autoregressive moving average (NARMA10) [25], Santa Fe Laser time series [26], and Mackey-Glass time series [27]. The NARMA10 system is widely used to test ESN models since it needs both nonlinear transformation and memory, which are the main characteristics of RC systems [16][17][18], [23]. The output of the NARMA10 system is computed as following: where ( ) is a random input at time step , generated from a uniform distribution over [0,0.5], and ( ) is the output at time step . The readout network is trained to predict ( + 1) from the reservoir state and ( ). The lengths of the training, validation, and test sequences are 3000, 100, and 1000, respectively. The first 100 reservoir state values, which are called initial transient, are not used to train the readout network or to predict the next value in the test stage to avoid the influence of initial states. The Santa Fe Laser time series is a one-step ahead prediction on the data obtained by sampling the intensity of a far-infrared laser in a chaotic regime [16], [18]. The data are normalized in the interval [0,1], and the lengths of training, validation, test and initial transient sequences are 3000, 1000, 1000, and 100, respectively. The Mackey-Glass time series is a standard benchmark for chaotic time series prediction task [7], [12], [18], [20], [22]. The time series is defined by the following differential equation: where ( ) is the output at time step , and is the time delay. When > 16.8, the system has a chaotic attractor. In this study, we set = 17, and define the task as to perform 84step-ahead direct prediction, i.e., the readout network is trained to predict ( + 84) from the reservoir state when ( ) is given. The lengths of training, validation, test and initial transient sequences are 1000, 1000, 1000, and 100, respectively.
In the genetic algorithm, the NRMSE is used to determine which hyperparameter set is the winner, as shown in Figure  2(a). Additionally, the distribution and frequency spectrum of the node states are measured to study the properties of different reservoir structures and to explain why a specific reservoir structure performs better than others. To get better performance, the RC system should nonlinearly transform the input signals into a high-dimensional space and allow the readout network to linearly separate the corresponding reservoir states. Although it is easy to decide whether the transformation is linear or nonlinear, it is difficult to quantify the degree of nonlinearity in a single evaluation term. Here, the distribution of node states is analyzed since nodes whose states are far away from the origin indicate these nodes go through a more nonlinear transformation than the nodes whose states are close to the origin. After finding the optimal hyperparameter set for a given reservoir structure and task, the node states of the test sequence are recorded, and the mean and standard deviation of the node states are calculated as shown in Figure 2(a) To clearly visualize the difference of each sub-reservoir, the node states are grouped according to the sub-reservoir they belong to, and the means of node states in the same sub-reservoir are sorted in ascending order.
Another important property of the RC system is the ability to capture diverse temporal dynamics with different time scales. The diversity of temporal dynamics can be analyzed by examining the differences of the frequency components each sub-reservoir captures. The multiple superimposed oscillator (MSO) task [15], [28][29] is used to analyze the multiple timescales processing ability of the different reservoir structures. The MSO12 time series, given by a sum of sinusoidal functions, is used: (11) where determines the frequency of the i-th sinusoidal function. The coefficients are set as in [28], i.e. 1 = 0.2, 2 = 0.331, 3 = 0.42, 4 = 0.51, 5 = 0.63, 6 = 0.74, 7 = 0.85, 8 = 0.97, 9 = 1.08, 10 = 1.19, 11 = 1.27, 12 = 1.32. As shown in Figure 2(b), the frequency spectra of each node state after being excited by the MSO12 input are calculated by the Fast Fourier Transformation (FFT).
The FFT values are averaged on a sub-reservoir-by-subreservoir basis. To emphasize which frequency components are mainly captured by the specific sub-reservoir, the peak values of the 12 different frequency components are normalized by the minimum among them.

Results
In this section, we will compare the performance of the three different reservoir structures on time series prediction tasks, and discuss the effects of stacking sub-reservoirs on the properties of the RC system.
As the performance of an RC system is strongly affected by the size of the readout network, we first fixed the size of the readout network to 300, i.e. corresponding to a single reservoir with 300 nodes (Shallow ESN), three independent subreservoirs with 100 nodes each (Wide ESN), and three stacked sub-reservoirs with 100 nodes each (Deep ESN). Figure 3(a), (b), and (c) show the NRMSEs of the three different reservoir structures when performing NARMA10, Santa Fe Laser and Mackey-Glass time series tasks, respectively. The box charts show the results from 10 different randomly generated reservoir systems with the hyperparameter set optimized by the genetic algorithm. In all cases, Deep ESN shows the best performance among the different reservoir structures. In contrast, compared to Shallow ESN, Wide ESN shows slightly better performance for the first two tasks but worse performance for Mackey-Glass time series.

Figure 2
Schematic of the evaluation methods. (a) NRMSE and the distribution of node states. NRMSE is measured from the difference between the prediction of the test sequence and the ground truth. The temporal response (red line) of a node (e.g. represented by the red circle) is recorded for a test sequence. The mean and standard variation of the states from each node in a subreservoir are plotted and sorted in ascending order by the mean value. (b) The frequency spectrum of the node states. When the MSO12 time series is applied to the reservoir, the FFT of the states from each node is calculated (e.g. red, blue lines) and averaged (black line) on a sub-reservoir-by-sub-reservoir basis. The peak values of the 12 different frequency components are normalized by the minimum among them.
To find out why Deep ESN performs better than others, the distributions of node states are plotted in Figure 4. The nodes of Deep ESN in the third (i.e. last) sub-reservoir show a wider spectrum than not only the nodes of Deep ESN in the first two sub-reservoirs, but also the nodes in every sub-reservoir in other tested structures. This indicates that the nodes of the third sub-reservoir in the Deep ESN go through a higher degree of nonlinear transformation. Considering the nodes of Deep ESN in the deeper sub-reservoir have a wide spectrum than the nodes in the previous sub-reservoirs, the wide range of node states in Deep ESN can be attributed to the stacked sub-reservoir structure in the Deep ESN.
Besides the absolute magnitude of the node states range, the diversity of the ranges among the sub-reservoirs is also a good factor to examine the quality of the reservoir structure, because having different ranges between the sub-reservoirs means that the node states in each sub-reservoir have different degrees of nonlinear transformation so that the RC system can capture broader temporal dynamics. It can be seen from Figure 4 that Wide ESN has a slightly larger range and diversity of node states than Shallow ESN, but the enhancement is not as significant as Deep ESN. Thus, a more efficient way to achieve a high degree of diverse, nonlinear transformation is stacking the sub-reservoirs in series rather than building independent sub-reservoirs in parallel.
Another method to estimate the temporal dynamics captured by each sub-reservoir is the frequency spectrum of node states, which shows which frequency components are picked up by each sub-reservoir. In our study, due to the simple but distinguishable frequency spectrum of the MSO12 input signal, it makes it easier to recognize the distinct frequency spectra among the reservoir structures, as shown in Figure 5. While Wide ESN shows similar frequency spectra among the sub-reservoirs, the sub-reservoirs in Deep ESN show significantly different frequency spectra. Specifically, the sub-reservoirs in the late stages of Deep ESN tends to capture the low frequency component rather than the high frequency component. Physically, this can be attributed to the fading memory property of the (sub-)reservoir which essentially performs an temporal integration function of the input sequence. The resulting reservoir state, reflecting the integrated input, is then used as inputs to the next stage, allowing lower frequency components from the original input to be more efficiently captured in the later stages of Deep ESN. Analogous to the fact that the layers in the late stage of CNNs capture more abstract or global information, i.e. lower spatial frequency features of the input signal, the subreservoirs in the late stages of Deep ESN thus emphasizes on the low temporal frequency component or the global (long term) temporal information. We also note that, although both CNNs and Deep ESNs put similar roles to the layers in the late  stage, they utilize the early stage layers significantly differently. Typically a CNN only uses the outcome of the last layer since its goal is to derive an abstract representation of the input signal, such as identifying the object, and the outcome of the last layer is typically enough for the task. On the other hand, Deep ESN needs to utilize the reservoir states of every sub-reservoir, since a wide range of temporal information from low to high frequency components is usually required to perform the temporal data processing tasks such as predicting the next value of the time series data.
The simplest way to improve the performance of the RC system is increasing the total size of the reservoir, since larger reservoirs with more diverse internal connections should capture richer temporal dynamics. If stacking the sub-reservoirs in series helps the RC system to predict the time series data more accurately, Deep ESN should still outperform the others when the total size of the reservoir increases. To expand the size of the reservoir, extra sub-reservoirs with 100 nodes are serially added to the output of the last sub-reservoir in Deep ESN, or connected as parallel, independent subreservoirs in Wide ESN. Shallow ESN has still a single reservoir, but the size of the reservoir increases to match the total size of the expanded Deep ESN or Wide ESN. Figure  6(a), (b) and (c) show the NRMSEs of the three different reservoir structures on NARMA10, Santa Fe Laser and Mackey-Glass time series, respectively, as the total size of the reservoir increases from 200 to 500. Again Deep ESN shows the lowest NRMSE among the tests, and Wide ESN is better than Shallow ESN in most cases, which are consistent with the findings for the case with 300 total nodes. In general, as the total number of nodes in the reservoir increases, the performance of the RC system improves no matter what reservoir structure is used, but the improvement in prediction error tends to be saturated. Moreover, when considering both the performance measured in NRMSE and the computation cost measured in the size of weight matrices, the performance improvement by simply increasing the reservoir size may actually be lower than the increased expense needed to compute the reservoir states and train the readout network.
Instead of adding extra sub-reservoirs to Wide ESN or Deep ESN, we can instead fix the total size of the reservoir (thus keeping the total compute cost low) and vary the number and size of sub-reservoirs. While Shallow ESN always has a single reservoir with 300 nodes, we varied the number of subreservoirs in Wide ESN and Deep ESN from 2 to 5 while keeping the total number of nodes constant at 300. Due to the fixed total node size, the cost to train the readout network will be identical for the three different reservoir structures, but the expense to calculate the node states will be reduced as the weight matrix size in Wide ESN or Deep ESN is reduced  For the NARMA10 and Santa Fe Laser tasks, Wide ESN and Deep ESN show similar trend on the number of subreservoirs, i.e. the performance is improved when the reservoir consists of 2 or 3 sub-reservoirs, but becomes worse when the reservoir is divided into more than 2 or 3 sub-reservoirs. This can be explained by the tradeoff between the number of subreservoirs and the size of each sub-reservoir. Given the fixed total number of nodes, the size of sub-reservoirs is reduced as the number of sub-reservoirs increases. When the reservoir is split into several sub-reservoirs, the temporal dynamics captured by the whole system become more diverse since each sub-reservoir can have different hyperparameters leading to different temporal dynamics. Moreover, stacking the subreservoirs enhances the nonlinearity of transformation from the input signal and the diversity of the temporal dynamics captured by each sub-reservoir, as shown in Figure 4 and 5. However, at the same time, the feature space of each subreservoir will shrink due to the reduced size of the subreservoirs, i.e. the ability of the sub-reservoirs to extract the temporal information will be weakened.
The optimal number and size of the sub-reservoirs depend on both the reservoir structure and the task. For example, the result for Mackey-Glass time series shown in Figure 7(c) is different from the trends observed in Figure 7(a) and (b). As the number of sub-reservoirs increases, Deep ESN performance always improves, but Wide ESN performance becomes worse. This clear dependence on tasks originates from the intrinsic characteristics of each task. Figure 8 (9), the input of NARMA10 is randomly generated, not affected by the output. Thus, the frequency spectrum of the input signal of NARMA10 has no major frequency dependence. In contrast, Santa Fe Laser and Mackey-Glass time series have clear frequency peaks since in these tasks the input is the previous output, not an independent variable. Importantly, the major peaks of Mackey-Glass time series are located near the low frequency range rather than spreading out evenly, which is the case of Santa Fe Laser time series. As shown in Figure 8   Wide ESN performance for Mackey-Glass time series becomes worse when having more sub-reservoirs.
In the cases of NARMA10 and Santa Fe Laser time series, the improvement comes from having more diverse temporal dynamics rather than extracting more information from the low frequency component, as both tasks do not have major peaks in the low frequency range. Thus, although the Deep ESN still outperforms Shallow ESN and Wide ESN, the benefits in capturing more low-frequency signals are not fully utilized and the Deep ESN performance degrades when the sub-reservoir size becomes too small. Following similar reasoning, even small differences among the independent subreservoirs of Wide ESN contribute to the improvement of Wide ESN over Shallow ESN for these two tasks.
Finally, we compare our Deep ESN with other networks. For the task of predicting Mackey-Glass time series 84 step ahead, the Deep ESN (NRMSE = 0.034) outperforms reported performance from feed-forward neural networks (NRMSE = 0.038) [30] and LSTM (NRMSE = 0.470) [31] because of the Deep ESN's ability to capture diverse temporal dynamics. The small number of units and trainable parameters (4 units and 113 trainable parameters, respective) may contribute to the worse reported LSTM performance than the feed-forward neural networks. However, although the performance of LSTM network can be improved by increasing the number of units, the training cost will also increase dramatically because the weight matrix size of internal gates increases quadratically. Moreover, the gradient-based learning algorithm is computationally more expensive than ridgeregression, which is the learning rule commonly used for RC systems. Among reported methods to design RC systems, DeePr-ESN (NRMSE = 9.08E-04) [23] shows better performance than the Deep ESN analyzed here, but DeePr-ESN inserts additional encoder layers between sub-reservoirs, which requires extra training for auto-encoders based on extreme learning machines. The large total size of DeePr-ESN (8 sub-reservoirs with 300 nodes each vs. 5 sub-reservoirs with 60 nodes each for the Deep ESN) is another factor of the better DeePr-ESN performance.

Discussion
Hierarchical reservoir structures are investigated in this study to understand how the different reservoir architectures affect the performance of the RC system. While Shallow ESN can pick up only limited temporal dynamics from the timeseries input due to a single set of hyperparameters, Wide ESN and Deep ESN can capture more diverse temporal dynamics by designing each sub-reservoir with different sets of hyperparameters. The enhancement of Wide ESN is not as significant as Deep ESN since the genetic algorithm does not have any external momentum to assign different hyperparameters to the sub-reservoirs of Wide ESN. Other hyperparameter optimization methods that force each sub-reservoir to have distinct hyperparameters may be helpful to improve the Wide ESN performance. In addition to the enhanced diversity in temporal dynamics, Deep ESN provides higher degree of nonlinear transformation, as well as the ability to efficiently capture the low frequency components. The genetic algorithm, which already led to good optimization results in DeePr-ESN [23], allows the large hyperparameter space in the hierarchical structures to be explored with reasonable computational cost, so that optimized hyperparameter sets for Deep ESN can be obtained.
In addition to the nonlinearity and diversity of reservoir transformation, the memory property of a reservoir structure can be evaluated by the memory capacity (MC), which measures the ability of the reservoir to store and recall previous inputs by computing correlations between the delayed inputs and the reconstructed outputs [32][33][34][35]. MC can be affected by the reservoirs' hyperparameters such as input scaling and spectral radius. It was found that the maximum MC can be obtained with a linear reservoir in the absence of numerical errors [33]. Although an ideal reservoir should have a high degree of nonlinearity and a large MC, there is a tradeoff between the nonlinearity and MC [34][35] because reconstructing the delayed input is a kind of linear mapping from the original input to the delayed input, which conflicts with the nonlinearity property . Similarly, we observed the tradeoff between nonlinearity and MC in this study. For instance, Deep ESN shows a higher degree of nonlinearity compared to Wide ESN and Shallow ESN in the NARMA10 task, as shown in Figure 4, but produces a smaller MC (20.17±0.83) than Wide ESN (21.35±0.58) and Shallow ESN (23.64±0.47). However, the large hyperparameter space offered by the hierarchical structure means Deep ESNs can also provide large MC by optimizing the hyperparameters, where MC larger than Wide ESN and Shallow ESN has been reported in previous studies [21]. In other words, depending on which property (e.g. nonlinearity, MC) is more critical for a specific task, the desired property can be enhanced in Deep ESN with proper hyperparameter optimization.
While software-based reservoir computing systems have broad design options such as aggressively expanding the reservoir size and inserting encoder layers between subreservoirs, several constraints have to be considered when designing hardware-based reservoir computing systems. For instance, physically connecting the nodes is not trivial because the complexity of routing large number of devices grows exponentially when the size of reservoir increases. Moreover, if the device is passive, active components that control the signal flows from one device to others should be also carefully designed. Due to these physical constraints, hardware-based reservoir computing systems have not shown as fast improvement in performance as software-based reservoir computing systems, even though several studies have demonstrated promising features of hardware-based reservoir computing systems such as power-efficiency and computing speed [13]. The hierarchical reservoir structures studied here respect the hardware constraints and achieves better performance by capturing more diverse temporal dynamics and enhancing the degree of nonlinear transformation without increasing the reservoir size. In fact, the number of interconnections between devices is actually reduced due to the use of sub-reservoirs to alleviate the complexity of routing. Furthermore, when the total reservoir size is fixed, the analysis of the tradeoff between the number of sub-reservoirs and the size of each sub-reservoir allows one to design an optimal hardware architecture by considering the balance between the effects of increased hierarchical depth and the capacity subreservoirs.

Conclusion
In this paper, the effects of hierarchical reservoir structures on the performance of reservoir computing are investigated. Wide-ESNs that build independent sub-reservoirs in parallel, and Deep-ESNs that stack the sub-reservoirs in series, can capture more diverse temporal dynamics than simply increase the reservoir size. Furthermore, Deep-ESNs not only offers stronger nonlinear transformation of the input to the reservoir states, but also captures more diverse temporal dynamics than Wide-ESNs and Shallow-ESNs. When the total size of the reservoir is fixed, a trade-off may be observed between the number of sub-reservoirs and the size of each sub-reservoir. For tasks where the low frequency signals are important, such as long-term forecasting of time-series such as Mackey-Glass time series, the strength of Deep ESN becomes more evident as the enhanced ability of the sub-reservoirs in the late stage to capture low frequency signals overcomes the effect of reduced feature space of the sub-reservoirs.
The analysis of hierarchical reservoir structures may provide useful insight into designing practical reservoir systems. Especially, when it comes to hardware implementation of RC systems, reducing the number of connections between the nodes by stacking sub-reservoirs offers a promising approach to building large reservoirs in hardware. Enhancement of nonlinearity and diversity from the stacked sub-reservoir structure should be considered and utilized for solving complex tasks such as multivariate systems. Further studies on hyperparameter optimization algorithms can help fine tune the reservoir design for specific tasks, and provide a more comprehensive understanding of how the specific hyperparameter set affects the overall performance of the RC system.