A comparative study of different machine learning methods for dissipative quantum dynamics

It has been recently shown that supervised machine learning (ML) algorithms can accurately and efficiently predict the long-time populations dynamics of dissipative quantum systems given only short-time population dynamics. In the present article we benchmaked 22 ML models on their ability to predict long-time dynamics of a two-level quantum system linearly coupled to harmonic bath. The models include uni- and bidirectional recurrent, convolutional, and fully-connected feed-forward artificial neural networks (ANNs) and kernel ridge regression (KRR) with linear and most commonly used nonlinear kernels. Our results suggest that KRR with nonlinear kernels can serve as inexpensive yet accurate way to simulate long-time dynamics in cases where the constant length of input trajectories is appropriate. Convolutional Gated Recurrent Unit model is found to be the most efficient ANN model.


I. INTRODUCTION
Simulation of the dynamics of quantum dissipative systems 1-3 is one of the most challenging problems in physics and chemistry. Quantum dissipation arises from the coupling of a quantum system to a thermal bath which consists of an infinite number of degrees of freedom. This results in a time irreversible dynamics of the system.
Methods based on Nakajima-Zwanzig generalized quantum master equation (GQME) [20][21][22][23][24][25][26] including transfer tensor method (TTM), [27][28][29][30][31] allow to obtain long-time dynamics of the reduced density matrix of a quantum system at a significantly lower computational cost coma) Electronic mail: dral@xmu.edu.cn b) Electronic mail: akanane@udel.edu pared to numerically exact methods, provided the GQME memory kernels are available. However, in general, it is difficult to numerically calculate the exact memory kernel for GQME. TTM further reduces the computational cost compared to the direct solution of GQME but it still requires an input set of dynamical maps to be generated by a numerically accurate method rendering such approaches infeasible for systems with many DOFs. Machine learning (ML) offers an alternative route to accurate, yet greatly accelerated quantum dynamics calculations with minimum input information required. [32][33][34][35] In ML, predicting the future time-evolution of quantum mechanical observables based on past values can be formulated as time series forecasting problem. A time series is a set of data points recorded in consecutive time intervals. Forecasting is a challenging part of time series data analysis. Traditional approaches to forecasting of time series utilize moving averages. One such approach, Autoregressive Integrated Moving Average (ARIMA) is a linear regression-based method which, together with its variations (e.g., ARIMAX and SARIMA), 36 have become the standard tools for modeling various time series problems. 37,38 These approaches perform reasonably well for short-term forecasting, but the performance of these methods deteriorates severely for long-term predictions. Additionally, such methods require assumptions about the underlying data that have to be incorporated into the model.
Machine learning is artificial intelligence-based data analysis technique that is data-driven as opposed to traditional model-driven approaches. Perhaps, the most widely known ML tool is a feed-forward neural network (FFNN) which is an artificial neural network (ANN) wherein connections between the nodes do not form a arXiv:2207.02417v1 [quant-ph] 6 Jul 2022 cycle. 39,40 In this kind of ANN the information moves only in one direction from the input nodes, through the hidden nodes (if any) and to the output nodes without using any feedback from the past input data. As a result, the output of any layer does not affect the training process performed on the same layer. These types of ANNs are, in general, not able to handle sequential inputs and all their inputs (and outputs) must be independent of each others.
There are several variations of RNN-based models differing in their capabilities to remember input data. The vanilla RNN, 40,85 long short-term memory (LSTM), 40,42,54,86,87 and gated recurrent unit (GRU) 40,88 are the most commonly used types of RNNs. The vanilla RNN is the first model of recurrent ANNs that was introduced. 85 Its ability to learn long-term dependencies is limited due to vanishing and exploding gradient problems. LSTM-based models are extensions of RNNs with significantly improved performance on long-term predictions. LSTM uses a set of gates to learn what information worth to remember. GRU is another gated variant of RNN model developed by Chung et al. 89 who showed that GRU outperforms the LSTM in some tasks. Josefowicz et al. 90 also reported the superior performance of the GRU-based models, but noticed that the performance of LSTM can nearly match that of the GRU if proper initialization is performed.
Bidirectional RNNs (BRNN) 91 are another extension of the standard unidirectional RNNs in which two RNNs are applied to the input data. Firstly, an RNN is applied on the input sequence which is then followed by the application of RNN on the reversed input sequence. This typically improves the accuracy of the model 79,92 at a cost of slower training times. 45,79 Several variants of bidirectional RNNs exist differing in the type of the underlying RNN cell. Bidirectional LSTMs (BLSTM) 56 and bidirectional GRUs (BGRU) 67 are being explored in various tasks. 51,56,61,69,79,81 In some applications, 45,72 such as speech recognition, the better performance of BLSTM compared to the regular unidirectional LSTM has been reported 56 and is not surprising given the nature of the task (text parsing and prediction of next words in the input sentence). In general, however, it is not clear what problems benefit from the bi-directional training.
CNNs have also been combined with RNNs. Such hybrid models are called convolutional recurrent networks (CRNNs). A CRNN is a deep ANN that contains a CNN layer(s) followed by an RNN layer(s). Such architectures possess several advantages. Firstly, 1D CNN layers learn to sub-sample the data and reduce the input vector that is passed to an RNN layer(s). This is important because GRU and LSTM layers are computationally expensive and replacing some of them with convolutional layer(s) improves the computational scaling of the algorithm. Second, 1D CNN layers extract local information from neighboring time points and pass already detected temporal dependencies further down to RNN layers. CRNNs are being actively explored in various tasks. 103,[106][107][108][109] It has been reported that CRNNs can outperform CNNs in some tasks. 106 CNNs were combined with BLSTMs as well. 110 Note that these methods are different from convolutional LSTM models 111 in which input transformations and recurrent transformations are both convolutional.
Kernel methods represent another class of ML methods that are applied to time-series analysis. [112][113][114] Such methods employ a function (kernel) that maps input data into a high dimensional space and then perform a linear regression in that space. Kernel Ridge Regression (KRR) and Support Vector Regression (SVR) are examples of such algorithms. A crucial aspect of applying kernel methods to time series data is to find appropriate kernels to distinguish between time series. A simple way is to treat time series as static vectors, essentially as they are treated in a feed-forward ANN, ignoring the time dependence. In such cases standard kernels such as Matérn 115 or Gaussian radial basis function (RBF) 115 can be used. However, such methods are limited to input time-sequences of equal length, again similar to ANNs, yet many applications involve time sequences of varying length. To overcome this limitation, kernel functions such as autoregressive 116 kernel have been developed.
Recently, many ML models have been applied to simulate the dynamics of quantum systems. [32][33][34][35][117][118][119][120][121][122][123][124][125][126] We note that ML can also be applied to quantum dynamics in a different context-namely as surrogate models for quantum chemical properties such as potential energies and forces in different electronic states as well as cou-plings between states eliminating the need for expensive (excited-state) electronic structure calculations. [127][128][129] Here we apply ML to propagate a quantum system assuming potential energies are readily available. ML models are attractive because of their very low computational cost and favorable scaling with respect to the size of a quatum system and bath. RNNs were used to simulate dynamics of the spin-boson model Hamiltonian and Landau-Zener transitions, 119,120 and to learn the convolutionless master equation. 121 Rodriguez et al. 32 used CNNs to accurately model long-time dynamics of the spin-boson system based on short-time dynamical data. Later Ullah et al. 33 illustrated that KRR methods can also predict long-time dynamics of the spinboson model very accurately. Recently, CNNs is used to study the excitation energy transfer in Fenna-Matthews-Olson light-harvesting complex. 34,35 Wu et al. 122 used a hybrid CNN/LSTM network to predict long-time semiclassical and mixed quantum-classical dynamics of the spin-boson model. Lin et al. 123,130 trained a multi-layer LSTM model to simulate the long-time dynamics of spinboson model and used bootstrap method to estimate the confidence interval. We note that ML methods based on FFNNs have also been recently applied to model quantum dynamics. 117,118 The recent upsurge of applications of ML methods to dissipative quantum dynamics calls for a systematic benchmark of such methods.
In this article, we present a comprehensive comparison of 22 ML models for predicting the long-time dynamics of an open quantum system given the short-time evolution data. We consider all three most used types of unidirectional RNNs including the vanilla RNN, GRU, and LSTM, the corresponding bidirectional RNNs (BRNN, BGRU, BLSTM), 1D CNNs, CRNNs, as well as KRR with 8 different kernel functions. We test the ability of these ML methods to predict the donor-acceptor population difference of a spin-boson model in symmetric and asymmetric regimes over a broad range of temperatures, reorganization energies, and bath relaxation timescales.

A. Model system
We choose to test ML methods on the spin-boson model which has become a paradigmatic model system in the study of open quantum systems due to the richness of its physics. 3 The spin-boson model was exploited in a wide-range of applications in quantum computing, 131 quantum phase transitions, 132,133 electron transfer in biological systems, 134 and others. The spin-boson model describes a two-level quantum subsystem linearly coupled to a heat-bath environment. The bath is modeled as an ensemble of independent harmonic oscillators. The total Hamiltonian in the subsystem's basis {|+ , |− } is given by ( = 1) is the bosonic creation (annihilation) operator of the αth mode with frequency ω α ,σ z = |+ +| − |− −| andσ x = |+ −|+|− +| are the Pauli operators, is the energetic bias, ∆ is the tunneling matrix element, and g α are the coupling coefficients.
The impact of the bath is completely determined by the spectral density J(ω) = π α g 2 α δ(ω α − ω) which, in this work, is choosen to be of the Debye form (Ohmic spectral density with the Drude-Lorentz cut-off) 135 where λ is the bath reorganization energy which controls the strength of the coupling between system and the bath, and the cutoff frequency ω c which sets the primary timescale for the bath evolution τ c = (ω c ) −1 . We consider the time evolution of the expectation value ofσ z Pauli operator which is often referred to as the population difference In Eq. (3) the trace is taken over the system degrees of freedom as denoted by "s", andρ s is the system's reduced density operator whereρ(0) is the total system plus bath density operator and the trace is taken over the bath degrees of freedom. The initial state of the total system is assumed to be a product state of the following form where Z b = Tr b e −βĤ b is the bath partition function, β = (k B T ) −1 is the inverse temperature, and k B is the Boltzmann constant. The initial density operator of the system is chosen to beρ s (0) = |+ +|. These conditions correspond to situations where the initial preparation of the subsystem occurs quickly on the timescale of the bath relaxation.

B. Machine learning models
In this section we provide a detailed description of all ML models used in the present study. We specialize our discussion to modeling time-series data with the input sequence denoted as x = x (1) , . . . , x (T ) where T is the length of the time series. Each element x (t) of x can be a real-valued vector itself, x (t) ∈ R n . In this work, the dimension n of each element of an input sequence x (t) is 1. Consider a data set D = {(x i , y i )} N i=1 containing N time series x i and their associated labels y i . In time-series forecasting problems, labels can describe the future states of the input sequence x i as denoted by which corresponds to m next elements of the sequence x i . Time-series forecasting in a supervised learning framework amounts to (i ) training ML models on the subset of D called a training set and (ii ) using trained ML models to make a predictionŷ i for a given test time series x i . In this work we test the ability of ML models to predict a single real-valued scalar quantity, the population difference of the spin-boson model, σ z (t) for a single time step i.e., m = 1. Extension to multivariate time series data (n > 1) and multi-step outputs (m > 1) is possible. 35

Feedforward neural networks
Feedforward neural networks (FFNNs) or multiplayer perceptrons are the most used type of artificial neural networks. FFNN approximate a function g(x) by defining a mapping y = F(x; θ) and determining the value of parameters θ that best approximate g(x). These models are called feedforward because information flows through the function being evaluated from an input x through intermediate steps to the outputŷ. There are no feedback connections between the output and the input of the model. 40 A crucial element of the success of ANNs is the use of deep architectures. Deep ANNs are created by stacking multiple layers on top of each other, with the output of one layer forming the input sequence for the next layer. The first layer is called an input layer, the last layer is called the output layer; all layers in between are called hidden layers.
Deep FFNNs are compositions of several functions: y = F (L) (. . . F (2) (F (1) x; θ (1) ) ; θ (2) ); θ (L) ) where each function F (k) depends on its own set of parameters θ (k) . Function F (l) connects lth and (l + 1)th layers of the network. It takes an input x (l) ∈ R k l and generates the output according to where f (l) : R → R is the lth layer activation function which is applied elementwise. As seen from Eq. (6) each layer of the network is vector valued. Each vector element a (l) j is called a neuron. Its value is calculated from layer's input and parameters a (l) where w (l) ∈ R k l ×k l+1 and b (l) ∈ R k l are called the weights and the biases of the lth layer, respectively, that together constitute the set of trainable parameters of the lth layer θ (l) = {w (l) , b (l) }. The output x (l+1) forms an input into (l + 1)th layer. The total number of trainable parameters, biases and weights, of a single FFNN layer is k l (k l+1 + 1). It should be emphasized that FFNN models require input time sequences of a fixed length, T . This, in particular, requires an a priori knowledge of the memory effects in a quantum system under study which is nontrivial task. Artificial neural networks based on recurrent layers, discussed below, do not impose such a restriction. The strategy of deep learning is to find the set of model parameters {θ (1) , . . . , θ (L) } that best approximates the target function g(x). The model parameters are adjusted during the training process which is based on backpropagation algorithm. 40 Typically the activation function of the hidden layers is chosen to be nonlinear, e.g., logistic sigmoid function σ(z) = (1 + e −z ) −1 . According to the universal approximation theorem [136][137][138] an FFNN with a linear output activation function and at least one hidden layer with a nonlinear activation function can approximate any Borel measurable function (continuous on the closed and bonded subset of the real coordinate space) from one finite-dimensional space to another with any desired nonzero amount of error, provided the network contains enough neurons. The theorem gurantees that regardless of the target function, a single hidden layer FFNNs with many neurons will be able to represent this function with any degree of accuracy. However, in general, there is no guarantee that the training algorithm can do so. 40

1D convolutional neural networks
Convolutional neural networks (CNNs) is a type of ANNs that can be applied to time-series data. 40,93 CNNs are based on a mathematical operation called convolution. Let (g * w) be the result of a 1D discrete convolution and the ith element of the result is given by where w ∈ R k is referred to as kernel, or weight vector, and g is a time series. CNNs exploit two key ideas: sparse connectivity and parameter sharing. 40 The former is accomplished by making the kernel size smaller than the size of the input which allows to detect, for the time-series data, short-time correlations and reduces the number of parameters to be stored in memory compared to FFNNs. The latter amounts to using the same kernel parameters for all positions in the input which further reduces the storage requirements. In the case of convolution, the parameter sharing leads to equivariance-a property that causes the output to change in the same way the input changes. Specifically, in the case of time-series data, the convolution captures a timeline for different features to appear in the input. 40 In practical implementations of CNNs, the convolution operation given in Eq. (8) is often replaced by the socalled cross-correlation operation 40 which we denote by a " " It should be noted that Eq. (9) is given for the step size of the kernel, as it is applied across the sequence, equal to 1. This step size is called stride s(s ∈ Z + ). In general, for the stride s ≥ 1, Eq. (9) is modified to Let's consider a two-layer 1D CNN architecture. An input sequence x is processed through K 1 1D kernels of the first layer whose weights are denoted as w (1) ∈ R k1 . The result of this operation is a tensor Z (1) whose elements are given by where f (1) is the activation function, b (1) ∈ R K1 are the bias parameters and w (1) j , j = 1, . . . , K 1 is the jth kernel of the first layer. The output tensor Z (1) has dimensions of M 1 × K 1 with M 1 = (T − k 1 + 2p 1 )/s 1 + 1, where s 1 is the stride, p 1 is the amount of zero padding, and x = max{m ∈ Z|m ≤ x} is the floor function. Zero padding parameter p 1 denotes the number of zeros to be added to the start and the end of the input sequence. For example, for p = 0 the input sequence is not padded with zeros and, consequently, the kernel is allowed to visit only positions where it is contained entirely within the input sequence. This type of zero padding is called a valid padding.
Because convolution of input data with a single kernel can extract only one kind of features, albeit at many locations, in practice, many filters are used. Each element of a kernel is a trainable parameter and the same parameters of each kernel are shared across the entire sequence. Therefore, the total number of trainable parameters of the first 1D CNN layer is K 1 (k 1 + 1).
The output of the first layer Z (1) is then processed through K 1 × K 2 1D kernels whose weights are denoted as w (2) ∈ R k2 . The output of the second layer is tensor Z (2) whose elements are given by where f (2) is the activation function of the second layer and b (2) are the bias parameters of the second layer. In Eq. (12) w (2) uj is to be understood as the the uth kernel (weight vector) applied to the output of jth kernel (also called channel) of the first layer. The dimensions of Z (2) are M 2 × K 2 where M 2 = (M 1 − k 2 + 2p 2 )/s 2 + 1 with p 2 and s 2 being the zero-padding and the stride of the second 1D CNN layer, respectively. It is worth emphasizing that a separate set of kernels is applied to the output of each kernel of the preceding (in this case first) layer. Thus, the total number of trainable parameters of the second 1D CNN layer is K 2 (K 1 k 2 + 1).
A pooling layer is commonly found in CNN architectures. Its function is to subsample (shrink) the input. Pooling function replaces the output of a CNN layer at a certain location with a neighborhood-dependent information. For example, the maximum pooling (MaxPooling) operation 139 outputs the maximum value in a neighborhood of a specified size. This helps to make representation approximately invariant to small translations of the input and improves the computational efficiency in a deep CNN by reducing the number of trainable parameters of the next convolutional layer. A 1D MaxPooling layer with the kernel size k p , stride s p , operates independently on every depth slice of the input Z in and resizes it by taking only the maximum value is written, for simplicity, for the zero padding but other kinds of padding can be used as well. MaxPool layers do not employ any trainable parameters.

Recurrent neural networks
Recurrent neural networks (RNNs) are a family of neural networks specifically designed for processing sequential data. 140 Similarly to FFNNs, RNNs have the universal approximation ability. 136,141 Similarly to CNNs, RNNs are based on the idea of parameter sharing but, unlike CNNs, RNNs share parameters through recursion. Formally, given a sequence x at each time step, an RNN updates its hidden state H = h (1) , . . . , h (T ) recursively based on the current input x (j) and the previous hidden state h (j−1) as follows where R is a nonlinear function. Each element of the hidden state vector, h (j) is, in general, a vector itself. Recursive application of Eq. (14) results in the sharing of parameters across an ANN architecture. Training of such ANN amounts to selectively emphasizing some aspects of the past sequence inputs that deemed more important than others. All RNN models used in this work contain two stacked recurrent layers and are schematically illustrated in Fig. 1a. The number of RNN cells in the first layer equals to the length of an input vector such that each cell processes a single time step t = 1, . . . , T and updates the corresponding hidden state vector h (t) ∈ R k1 . The size of the hidden state vector k 1 is a hyperparameter that needs to be optimized. Note that various software packages use different terminology. In this work RNN models are built with the Keras 142 software whose argument units corresponds to the size of the hidden state vector. An RNN layer outputs a tensor, H ∈ R T ×k1 , containing all hidden states and it serves as an input to the following RNN layer.
Each RNN cell of a layer processes the corresponding slice of the preceding layer's output H Lj , where j ∈ [1, k 1 ] and L is the number of cells of the second layer, and transforms it into the state vector h Various RNN architectures differ by the function R in Eq. (14). A simple RNN cell, also known as vanilla RNN is illustrated in Fig. 1b with the hyperbolic tangent activation function. It receives an input x (t) corresponding to the time step t and uses the state vector from the chronologically previous RNN cell h (t−1) ∈ R k to generate the hidden state h (t) according to the following update equation where b h ∈ R k are the bias parameters, w h ∈ R k×k , w x ∈ R k×m are the coefficient matrices, and the tanh activation function is applied element-wise. The function R(h (j−1) , x (j) ) in case of the vanilla RNN amounts to iterative application of Eq. (15) to an input sequence x from t = 1 to t = T . The hidden state h (t) is then passed to the next RNN cell. Weights w h , w x and biases b in Eq. (15) are updated iteratively via the back-propagation algorithm. In a multilayer RNN architecture all hidden state vectors H are passed to the next layer. For RNNs whose output is used directly to predict the next value of a sequence the hidden state corresponding to t = T is the desired predicted value of the next time stepŷ = h (T ) .
The total number of trainable parameters of the vanilla RNN cell is k(k + n + 1), where k is the size of the hidden state vector, which is the same for all RNN cells within a layer, and n is the dimension of each element of an input sequence x (t) which, in this work is 1 for the first RNN layer and k 1 for the second RNN layer. Note that if RNN is not the first layer, as e.g., in convolutional recurrent neural networks discussed below, n would be different from 1. It should also be noted that the total number of trainable parameters of the whole RNN layer comprised of any number of the vanilla RNN cells is the same because b, w h , and w x are shared across all RNN cells within the given layer. The total number of trainable parameters in a two-layer vanilla RNN architecture is k 2 (k 1 + k 2 + 1) + k 1 (k 1 + 1 + 1). In spite of being simple and powerful ANN model, the vanilla RNN described above is plagued by the vanishing gradient and exploding gradient problems when trained via backpropagation. 40,[143][144][145] The exploding gradients problem refers to the large increase in the norm of the gradient during training. The vanishing gradients problem refers to the opposite behavior, when long term components decay exponentially fast to zero norm, limiting the model's ability to learn long-range dependencies. The exploding gradient problem can be easily addressed by gradient clipping. 146 The vanishing gradient problem is, however, much more difficult to address.

Long short-term memory networks
In general, learning long-term dependencies is one of the most important problems in deep learning. 40 The difficulty in dealing with long-memory sequences in RNNs arises from the exponentially smaller weights given to long-term correlations compared to short-term ones. 40,[143][144][145] This problem is particular to simple RNN cell described above. It was shown that very deep FFNNs can avoid the vanishing and exploding gradient problems 147 but, in order to store memories, RNNs must enter a region of parameter space where gradients vanish. 144,145 To overcome the vanishing gradient problem various gated RNN architectures such as long short-term memory (LSTM) were developed. Gated RNNs are based on the idea of creating paths through time such that the derivatives are neither vanish nor explode. The Long Short-Term Memory (LSTM) 42,54,86,87 model was developed by Hochreiter et al. 42 and is based on the idea of using selfloops to produce paths where the gradient can flow for long duration. This is achieved by adding the cell state denoted by C (t) ∈ R k and data-dependent gates, that control the flow of information, to the standard RNN architecture. 42,86 Both hidden state and cell state control the memory of the network. The cell state carries relevant information throughout the processing of the sequence such that even information from the earlier time steps can make its way to later time steps, reducing the effects of short-term memory. The information is added or removed to the cell state via gates. The gates are different simplest FFNNs that decide which information is allowed on the cell state. Thus, the gates can learn what information is relevant to keep or forget during training.
Although a number of variants of the LSTM cell have been produced, a large-scale analysis shows that none of them outperforms the standard LSTM architecture. 42,54,86 The block diagram of the LSTM cell is shown in Fig. 1c. It contains three gates: the forget gate, the input gate, and the output gate. All gates have a sigmoid [logistic σ(z) = (1 + e −z ) −1 ], nonlinear activation function (yellow circles). The tth element(s) of the input sequence x (t) enters the cell and is concatenated with the hidden state vector from the chronologically previous cell h (t−1) . The total vector is passed through the forget gate where b F ∈ R k are the biases and W F h ∈ R k×k , W F x ∈ R k×m , are the corresponding weights. The forget gate was the crucial addition to the original LSTM cell extending the length of sequences that can be processed. 86 The forget gate learns to reset memory blocks once their contents are out of date and hence useless. The output vector F (t) ∈ R k contains the values between 1 and 0 emphasizing or diminishing the importance of the elements of the hidden state vector, correspondingly. In two separate branches, the current input vector and the previous hidden state vector are processed through the input gates: the external input gate and the new candidate gate x ∈ R k×m , and W G x ∈ R k×m are the corresponding biases and weights of the input and new candidate gates. The external output gate uses the sigmoid activation function to obtain a gating value between 0 and 1 which is then element-wise multiplied by the corresponding element of the new candidate vector G (t) effectively deciding which elements of the input vector are the most important. Given F (t) , I (t) , and G (t) the cell state of the previous time step C (t−1) is updated as follows where denotes the element-wise vector (Hadamard) product. The first term in Eq. (19) removes parts of the previous cell state through the forget gate and the second term adds new information yielding the new cell state.
Finally, the hidden state h (t) is updated as follows where O (t) is known as the output gate which uses a sigmoid activation function for gating and the correspond- is then used to compute what to forget, input, and output by the cell in the next time step. Thus, while both hidden state and cell state control the memory of the network, the cell state carries information about the entire sequence and the hidden state encodes the information about the most recent time step. Similarly to a deep vanilla RNN architecture, in a deep LSTM NN the hidden state of each LSTM cell is passed to the next layer, as shown in Fig. 1a.
Since each of the four gates are based on a perceptron the total number of trainable parameters of the LSTM variant shown in Fig. 1c is 4k(k + m + 1) and, are shared across all LSTM cells, this is also the total number of trainable parameters of the first LSTM layer comprised of any number of LSTM cells. The total number of trainable parameters of a two-layer LSTM architecture is 4[k 2 (k 1 + k 2 + 1) + k 1 (k 1 + m + 1)]. Thus, such LSTM models contain exactly four times more trainable parameters than the vanilla RNN models with the same number of units (length of the hidden state vector).

Gated recurrent unit networks
LSTM has become a popular off-the-shelf architecture that effectively solves the vanishing gradient problem. However, LSTM is often criticized for its ad hoc nature. Furthermore, the purpose of its many components is not apparent and there is no proof that LSTM is even the optimal structure. For example, Ref. 90 shows that the forget gate is crucial element of the LSTM architecture, while the output gate is the least important.
Gated recurrent unit (GRU) [88][89][90]148 ANNs are another example of gated RNNs. GRUs were introduced as a simplified version of LSTM that uses one less gate and, thus, has fewer trainable parameters. The main difference between LSTM and GRU cells is that forget and input gates appear in the LSTM architecture are combined together in one gate in the GRU cell. Additionally, the hidden state and cell state are combined as well. The information between GRU cells is transferred via the hidden state vector. Due to the reduction in trainable parameters and complexity models based on GRU cells tend to converge faster.
There exist several variants of a GRU cell. Fig. 1d illustrates the GRU cell used in this work. The update equations are discussed below. An input vector x (t) is concatenated with the previous cell hidden state h (t−1) and passed through the update Z and reset R gates both using the sigmoid activation function ∈ R k×m are the weights of the update (reset) gate. The update gate controls the update of the memory state h (t) . The reset gate controls the influence of the hidden state h (t−1) on G (t) introducing additional nonlinear effect in the relationship between past and future states. The two gates can individually "ignore" parts of the hidden state vector. 40 Next, the input vector x (t) is concatenated with the element-wise product of the hidden state vector h (t−1) and the output of the reset gate which, after passing through hyperbolic tangent activation function, produces the so-called proposed new candidate state , and W G x ∈ R k×m are the corresponding bias vector and weights. The hidden state is updated by combining the previous cell hidden state h (t−1) with the output of the update gate as well as with the product of the output of the update gate and the proposed candidate gate The total number of trainable parameters of a single GRU cell described above, as implemented in Tensor-Flow software library, is 3k(k + m + 2). Note the use of the two separate sets of biases b W h and b Wx . The total number of trainable parameters is independent of the number of GRU cells due to parameter sharing. Twolayer GRU models contain 3[k 2 (k 2 +k 1 + 2)+ k 1 (k 1 +m+ 2)] trainable parameters. Thus, given the same k 1 , k 2 , and m, GRU models contain approximately three times more parameters than the corresponding vanilla RNN models and ca. 1/4 less parameters than LSTM models with the same number of layers.

Bidirectional RNNs
In all the RNNs, described above the state at time T captures information from the past x (t) , t = 1, . . . , T − 1 as well as the current state x (T ) to make a prediction. Such RNNs are said to have "causal" structure 40 and are called unidirectional RNNs. RNN architectures that combine an RNN that moves forward through time from t = 1 to t = T with another RNN that moves backward through time from t = T to t = 1 are called bidirectional RNNs. 40,56 This is realized by duplicating each recurrent layer in the network. The two resulting layers have separate forward − → h and backward ← − h hidden state vectors. Forward and backward layers are not connected to each other. An input time sequence is provided in the chronological order to the first RNN layer and is fed in the reversed chronological order to the second RNN layer. Applying RNNs twice increases the amount of input information available to the network and leads to better capturing long-term dependencies and, thus, improves the accuracy of the model. 92 Bidirectional recurrent NN can be built from the vanilla RNN, LSTM, and GRU cells resulting in BRNN, BLSTM, 91 ∈ R k are the bias parameters for the forward and backward hidden states, respectively; are the coefficient matrices. In deep BRNN architectures, forward and backward hidden states of a previous layer are combined and passed to the following layer. In Eq. (28) F is a function that combines the two hidden state vectors. It can be a concatenating function (this work), element-wise addition, multiplication, or averaging. 142,149 Thus every hidden RNN layer receives and input from both forward and backward layers at the preceding layer. The total number of trainable parameters of the first layer of BRNN is exactly twice the number of trainable parameters in the first layer of a unidirectional RNN with the same length of the hidden state vector and same type of RNN cell. In a deep bidirectional recurrent NN architecture an ith hidden layer with m RNN cells and the length of hidden vectors k i outputs to the next RNN layer a tensor Z ∈ R 2ki×m , assuming F is a concatenating function. The next (i + 1) layer uses 2k i+1 (2k i + k i+1 + 1) trainable parameters, where k i+1 is the length of the hidden state vector of (i + 1) layer. Therefore, deep bidirectional recurrent NNs, in general, use more than twice the number of trainable parameters than the corresponding unidirectional RNN with the same number of cells and length of the hidden state vector.

Convolutional RNNs
Convolutional Recurrent Neural networks are deep ANNs that combine convolutional layers with recurrent layers. In this work, we build two-layer convolutional recurrent NN models by adding one recurrent layer (vanilla RNN, GRU, LSTM) to 1D CNN layer resulting in three models denoted as CRNN, CGRU, and CLSTM. The update equations of each of these types of neural networks are exactly those of the corresponding individual layers described above. For example, in the CRNN architecture, the first layer (1D CNN) processes input sequence of length T through convolution, or rather, crosscorrelation, operation as given by Eqs. (10) and (11) and generates output tensor Z (1) formed by the output of K 1 kernels (filters) each of size M 1 = (T − k 1 + 2p 1 )/s 1 + 1, where s 1 is the stride and p 1 is the zero padding. The output tensor Z (1) is then passed to an RNN layer which is comprised of M 1 vanilla RNN cells with the hidden state vector of (user-specified) length k 2 . Each RNN cell receives an input x (t) ≡ Z (1) t,1:K1 , i ∈ 1, . . . , M 1 from the preceding 1D CNN layer and processes it according to the corresponding update equations.
In this work the recurrent layer of convolutional recurrrent NN models is configured to output the whole sequence of hidden states from all RNN cells, Z (2) ∈ R M1×k2 which is then passed to fully-connected layers. The total number of trainable parameters of the twolayer architecture described above is K 1 (k + 1) + N RNN , where k 1 is the size of the kernel of 1D CNN layer and N RNN the number of trainable parameters of an RNN layer.

Convolutional bidirectional RNNs
It is possible to fuse a 1D CNN layer and a bidirectional RNN layer. Resulting architectures are called Convolutional Bidirectional Recurrent Neural Networks. Such models are expected to combine the benefits of convolutional recurrent NNs with the improved description of long-term dependencies pertinent to bidirectional RNNs. In this work, three convolutional bidirectional NN models are build by combining one 1D CNN layer with each of the three recurrent NN architectures discussed above: the vanilla RNN, GRU, and LSTM. The resulting models are denoted as CBRNN, CBGRU, and CBLSTM. Similarly to convolutional recurrent NNs the update equations of convolutional bidirectional NNs can be deduced from the corresponding update equations of 1D CNN and bidirectional recurrent NNs given above. Specifically for CBRNN, an input sequence is first passed through an 1D CNN layer where it is transformed according to Eqs. (10) and (11) into an output tensor Z (1) which is formed by the output of K 1 kernels (filters) each of size M 1 = (T − k 1 + 2p 1 )/s 1 + 1, where s 1 is the stride and p 1 is the zero padding. Then, Z (1) is fed separately into forward and backward vanilla RNN layers where the corresponding forward and backward hidden states are updated as shown in Eqs. (26) and (27). In the two-layer architecture described above, the total number of trainable parameters is simply the sum of the total number of trainable parameters of the 1D CNN layer and the total number of trainable parameters of a bidirectional recurrent layer.

Kernel ridge regression
In kernel ridge regression (KRR) 150-152 the approximating function f (x) for a vector of input values x is defined as where N is the number of training points and α = {α i } is a vector of regression coefficients. The covariance function k (x, x i ), commonly referred to as a kernel function or simply kernel, can be understood as a similarity measure between two vectors x and x i from the input space. The kernel performs an implicit mapping to a higherdimensional feature space. Because input time-sequences employed in this work have the same length we focus on standard kernels. 115 One of the most common kernel functions is the Matérn kernel 115,153 , which in the MLatom software package [154][155][156] used in this work is defined as: 155 where σ is a positive hyperparameter which defines the characteristic length scale of the covariance function, n is a non-negative integer, and . . . 2 is the Euclidian distance which is taken to be the L 2 norm. For n = 0 the Matern covariance function reduces to exponential kernel function 115,153 k Another popular choice of a covariance function is the squared exponential (Gaussian) kernel function 153 Because quantum dynamics often resembles periodically-decaying time-series, we also test a decaying periodic kernel (our adaptation based on Refs. 153,157): where p is the period and σ p is a length scale for the periodic term (both are hyperparameters).
Given the kernel function, the regression coefficients α are found by minimizing a squared error loss function where y = {y i } is the target output vector, K ∈ R N ×N is the kernel matrix with elements K ij = k (x i , x j ) and λ denotes a non-negative regularization hyperparameter. In Eq. (34), the second term is usually added to prevent KRR model from assigning large weight to a single point. The optimization of parameters (regression coefficients α) amounts to solving a system of linear equations for which analytical solution is known. Here I is the identity matrix. The computational scaling of solving this system is O(N 3 ). 153,155 KRR is a kernelized version of ridge regression and when linear kernel function is used, KRR becomes equivalent to ridge regression, i.e., the approximating function f (x) is simply a multiple linear regression with regression coefficients β shrunk (see the second term in Eq. (34)) using regularization: As shown in above equation, regression coefficients β can be conveniently derived from α coefficients and training input data and, in fact, are printed out by MLatom.

A. Data sets for training, validation, and testing
The data set used in this work is the same as used in Ref. 33 158 In our calculations we set ∆ = 1.0. The total propagation time was t max ∆ = 20 and the HEOM integration time-step was set to t∆ = 0.05. In total, 1,000 HEOM calculations, 500 for symmetric ( /∆ = 0) and 500 for asymmetric ( /∆ = 1) spinboson Hamiltonian, were performed.
Time-evolved reduced density matrices (RDM) are saved every dt∆ = 0.1. Secondly, σ z (t) are calculated from RDMs and processed into shorter sequences of length T by window slicing. 32,33,123 Namely, for a time series x = x (1) , . . . , x (L) , where σ z (t) is denoted by x (t) for compactness, a slice is a subset of the original time series defined as s i:j = x (i) , . . . , x (j) , 1 ≤ i ≤ j ≤ P . For a given time series x of length L, and the length of the slice is P , a set of L − P + 1 sliced time series {s 1:P , s 2:P +1 , . . . , s L−P +1:L } is generated. Finally, the total data set D = {(x i , y i )} N i=1 containing time series x i and their corresponding labels y i is obtained by setting 1, . . . , T elements of each slice, with T = P − 1, to an input time-series x i and the last (P th) element of each slice to the associated label y i .
In general, the size of the window P − 1, or equivalently T , should be treated as a hyperparameter but, following previous work, 33 we set to T = 0.2L. The window slicing is applied to all 1,000 RDMs obtained in HEOM calculations with different system and system-bath parameters. For each set of the Hamiltonian parameters the initially calculated set of time-evolved σ z (t) with L = t max /dt = 200 generates 160 data points for the data set with T = 41 (including t∆ = 0 point).
From the raw HEOM data set of 1,000 trajectories, 100 randomly chosen trajectories are taken as the holdout test set, which is used for testing and generating the results presented in Sec. IV. The remaining set of 900 trajectories are transformed into 144,000 trajectories by window slicing described above. In total 72,000 shorttime σ z (t) trajectories for symmetric and 72,000 for asymmetric spin-boson models were generated. Each trajectory has a time-length of t∆ + dt∆ = 4.1. This data set of supervised trajectories is the training set which is randomly partitioned into two subsets: a sub-training set, which contains 80% of the data and a validation set with 20% of the data. ML models are (initially) trained on the sub-training set and the validation set is used for monitoring the performance of the models (mainly, to prevent overfitting and optimize hyperparameters). The final NN models tested on a hold-out test set are not trained on the entire training set and are only trained on the sub-training set to prevent overfitting. KRR models are, however, trained on the entire training set. The performance of KRR models trained only on the subtraining set is similar to the KRR models trained on the entire training set (Appendix A). These are typical ways of training NN and KRR models, respectively, which are different due to differences in the formalism and training procedures of these types of models. Following previous similar works 32,33 the input data is not normalized.
B. Artificial neural network models

Details of the models
In this work fourteen deep ANN models are built and tested. In general, each ANN model is comprised by the total of two convolutional and/or recurrent layers followed by one fully-connected layer, and one output layer. The exception is the FFNN model which comprised of two fully-connected layers followed by the output layer. The fully-connected layer in each model, except for FFNN, has 256 neurons and the rectified linear function defined as (ReLU) f (z) = max(0, z) is used as the activation function. Fixing the properties of fullyconnected layer allows to compare the performance of recurrent and convolutional layers. The output layer contains one neuron with the linear activation function f (z) = z. The details of all ANN models studied in the present Article is summarized below.
• 1D CNN model contains two 1D CNN layers followed by a MaxPooling, one fully-connected and output layers. ReLU is used as the activation func-tion in each 1D CNN and fully-connected layers. For the MaxPooling layer a pool size k p = 2 (see Eq. (13)) is used. The stride of s = 1 and zero padding are used in both 1D CNN and MaxPooling layers. The number of filters and the filter sizes of each layer are optimized using Particle Swarm Optimization algorithm as described in Sec. III B 2.
• FFNN model contains two hidden fully-connected layers followed by the output layer. The ReLU activation function is used.
• Three recurrent NN models comprised of the two recurrent layers of the same type: the vanilla RNN, LSTM, and GRU. The whole sequence of hidden state vectors from all cells is passed from the first to the second recurrent layer as well as from the second recurrent layer to the fully-connected layer.
• Three bidirectional recurrent NN models comprised of the two bidirectional recurrent layers of the same type: the vanilla BRNN (denoted simply as BRNN hereafter), BLSTM, and BGRU. The whole sequence of hidden state vectors from all cells is passed from the first bidirectional recurrent layer to the second bidirectional recurrent layer as well as from the second bidirectional recurrent layer to the fully-connected layer.

Hyperparameter optimization
The goal of this work is to compare the performance of ML models including ANN models with different architectures as described above. Depending on the overall objective, one can envision several approaches for comparing performances of different ANNs. For example, the hyperparameters of each of ANN model can be adjusted using grid search, random search, or other optimization techniques, including evolutionary algorithms, to achieve the best performance of each ANN model on the given data set. However, given the fundamentally different types of ANN models considered in the present Article, this approach may result in several models with approximately the same prediction accuracy but requiring drastically different resources such as CPU (or GPU) time and memory. In such case, it might be reasonable to use ANN models that are not the most accurate, but provide an acceptable accuracy with less training and prediction times. The notion of the optimal running time, however, strongly depends on the problem under study. One can also set the desired accuracy level and systematically adjust the hyperparameters of each model to achieve the desired accuracy and then analyze the resulting ANN models. This approach, however, requires specifying the fixed accuracy level which might not be achievable for some ML models and relies on a priori knowledge of the dynamics of the system of interest.
The following approach is adopted in the present study. We compare the performance of ANN models with approximately the same number of trainable parameters. Hyperparameters of each ANN model described above are scanned using a grid search approach and the models with the pre-determined number of trainable parameters are selected and their performance is reported in Sec. IV. The number of trainable parameters is chosen as follows. In Refs. 32 and 159 we illustrated that deep 1D CNN models can approximate long-time dynamics of a molecular dimer system (spin-boson-like model) and the Fenna-Matthews-Olson photosynthetic complex accurately. Therefore, the 1D CNN model from Ref. 32 is taken as a base model. The hyperparameters of this model, associated with convolutional layers, specifically, the number and the size of the kernels of each layer, are optimized using Particle Swarm Optimization algorithm (PSO). All other hyperparameters such as the number of neurons of a fully-connected layer and the activation functions of all neurons are fixed to 256 and ReLU correspondingly. PSO calculation is performed with three particles and for 50 steps. The data set described above is used for the hyperparameter optimization. At each step of the PSO calculation, three 1D CNN models, one for each particle, are trained and validated using the above mentioned data set. During training, the deviation between predictedŷ i and reference values y i is minimized. In this work, we use the mean squared error (MSE) as the loss function in the minimization Adaptive moment estimation (Adam) algorithm 160 is used with the initial learning rate set to 1.0·10 −4 . The initial values of the weights are randomly sampled using Xavier initialization. 161 The biases are initialized to zero. Keras 142 software package with the TensorFlow 149 backend was employed for calculation. The batch size is set to N b = 64 and the training for each model is performed for 30 epochs. As will be shown later this is sufficient for the MSE to drop below 10 −5 -10 −6 . For each trained 1D CNN model, the MSE for the validation set is calculated and used as the fitness value in the PSO. Then the hyperparameters of 1D CNN models are adjusted based on the algorithmic details of PSO which can be found in Appendix B. After 50 steps of the PSO, the 1D CNN models stop improving and the PSO calculation is terminated. The optimized number of kernels of the final 1D CNN model is 235 and 125 with the kernel sizes 16 and 7 of the first and second layer, respectively. The 1D CNN model described above contains a total of 530,258 trainable parameters. This number is taken as reference and the hyperaprameters of all other 13 ANN models are selected to generate models with approximately the same number of parameters. Note, however, that according to the formulas given in Sec. I, it is not possible to set hyperparameters in all models to achieve exactly the desired number of trainable parameters. Therefore, some variation in the number of trainable parameters across the reported ANN models is to be expected.
Even though the number of trainable parameters is straightforwardly connected to the hyperparameters of each ANN architecture, a grid search is performed over hyperparameters with the goal to understand the sensitivity of the model performance to the small variations in the hyperparameters. In all models containing (bidirectional) recurrent layers the number of units is scanned from 5 to 100 with the step of 5. Additionally, in all ANN models containing 1D CNN layers and (bidirectional) recurrent layers, the number of kernels and kernel sizes of the 1D CNN layer is taken to be same as in 1D CNN model while the number of units in the recurrent layer is scanned from 5 to 100 in steps of 5. All the models are trained, validated, and tested using the data set for the symmetric spin-boson model. keras software package and Adam algorithm is used for all ANN calculations. The learning rate is fixed to 1.0·10 −4 , the batch size is set to 128, and the training for each model is performed for 30 epochs.
Usually more than one set of hyperparameters generates ANN models with the number of trainable parameters close to the target number. In such cases, models with the number of trainable parameters within ±5% of the reference number are considered. Among selected models, the models that perform significantly different than the average model are discarded. The number of outliers in each case is found to be small. We attribute outliers to finite training time and nonuniform decay of the MSE during the training, which in case of RNNs is a manifestation of the vanishing gradient problem. We will return to this issue in Sec. IV. The model with the closest number of trainable parameters to the target number is saved and its performance is reported.

Training and validation curves
To illustrate the learning process of ANN models in Fig. 3 we plot the training and validation curves for several such models. Such curves are a widely used tool to examine the performance of supervised learning algorithms. In general the learning is found to be stable in each case. The apparent noisiness in the data should be attributed to the logarithmic scale on the vertical axis. Several conclusions can be drawn by examining the curves shown in Fig. 3. Firstly, none of the models overfits which is illustrated in lower and in, general, decaying validation MAE (Fig. 3 right panel) compared to the training MAEs (Fig. 3 left panel). Secondly, the learning process is fast even with the learning rate of 1·10 −4 . It should be noted that decreasing learning rate to 1·10 −5 does not result in noticeable improvement in accuracy for all the ANN models studied in this work but the increase of the learning rate to 1·10 −3 usually results in increasing MAE. Furthermore, FFNN model reaches the lowest MAE rapidly in less than 10 training epochs while the MAE of other models e.g., 1D CNN, B(LSTM,GRU) continues to decay over all shown 30 training epochs. The same behavior is manifested in the validation MAEs. This suggests that 30 training epochs used in this work is justified for 1D CNN and other convolutional and recurrent models but seems unnecessarily too many for the FFNN model. This observation further illustrates the efficiency of simple FFNN models. We note that in the case of the BRNN model the validation MAE exhibits largemagnitude changes over the course of training. This can be attributed to the vanishing gradient problem of the vanilla RNNs. It should be noted that unlike recurrent neural networks, KRR (and FFNN) models studied in this work require a fixed-size input. To enable the comparison between ANN and KRR models studied in this work, we focused on the input σ z (t) trajectories of the same length. We note that extending RNN models built in the present Article to variable size input is straightforward and will be discussed elsewhere.
In KRR models, optimization amounts to finding regression coefficients α as shown in Eq. (35). Therefore, one can interpret each regression coefficient α i as a trainable parameter. The total number of such coefficients is the same as the number of elements of the training set, which in this work is 72,000 in each case of symmetric and asymmetric spin-boson models. This number of trainable parameters of KRR model is far fewer than the target number of trainable parameters for ANN models. The discrepancy between the number of trainable parameters which, by construction, is set to the size of the training set, yet it is variable in ANN models. This further makes the faithful comparison of fundamentally different ML approaches studied in this work non-trivial. As will be shown in the following, the fewer number of trainable parameters of KRR does not adversely impact the performance of KRR models.
KRR approaches have analytical solution to optimal parameters (regression coefficients α), but they still require adjusting a (small) number of hyperparameters. All KRR models used here contain the regularization hyperparameter λ which is the only hyperparameter in KRR-L. Other KRR models have additional hyperparameters in their kernel functions, i.e., all nonlinear models have length-scale hyperparameter σ, KRR-Mn also have the integer hyperparameter n (not optimized here), KRR-DP has two more hyperparameters compared to KRR-G: p and σ p . Due to the small number of hyperparameters, they are optimized on a logarithmic grid as described elsewhere 154 for all models except for KRR-DP which has three hyperparameters. Hyperparameters in KRR-DP are optimized using the tree-structured Parzen estimator 162 via interface to the hyperopt package 163 . All hyperparameters of KRR models are optimized based on the data set for symmetric spin-boson model using gridsearch method for KRR.
The hyperparameters of each ML model are optimized only for the data set for symmetric spin-boson model as already indicated above. The optimized models are then re-trained on the asymmetric spin-boson data set keeping the hyperparameters unchanged. MLatom software package 154,155 is used for all KRR calculations.

A. Symmetric spin-boson model
The details and performance of all ANN and KRR models are summarized in Table I and Table II, respectively. As seen from Table I, the two best performing ANN models on the symmetric spin-boson data set, are the convolutional LSTM (CLSTM) and convolutional bidirectional LSTM (CBLSTM) while the two least accurate ANN models are unidirectional RNN and bidirectional vanilla RNN models. Worse performance of simple vanilla RNN layers compared to LSTM and GRU layers is the expected result. It has been observed in other ap- plications that in contrast to LSTM and GRU models, the performance of vanilla RNN models degrades with the increasing length of input sequence. 164,165 We note that the weak performance of RNN models is despite the increased number of units or length of hidden state vector compared to other recurrent NN models. A better performance of convolutional recurrent NN models is an illustration of the power of sub-sampling of the initial input data which is then more efficiently learned by the LSTM layer. Comparing the unidirectional or bidirectional recurrent NN models whether taken separately or as the second layer on top of 1D CNN layer confirms the trend. The best performing model is an (C,B,CB)LSTM model, followed by the (C,B,CB)GRU, and (C,B,CB)RNN. It is also worth noting that recurrent NN models based on bidirectional layers do not outperform their unidirectional counterparts. This is observed for both unidirectional recurrent models compared to bidirectional recurrent models as well as when convolutional unidirectional recurrent models are compared to the corresponding convolutional bidirectional recurrent models.
Focusing on KRR models, whose details and performance is illustrated in Table II, we note that these models generally outperform ANN models with the exception of a KRR model with linear kernel. Furthermore, the KRR-L is the worst performing of all ML models studied in the present work. This is, perhaps, not surprising because it is expected that quantum dynamics of such a complex system as the spin-boson model is highly non-trivial and cannot be captured by simple linear regression.
The differences in accuracy between ML models noted above is, however, not significant in the case of symmetric spin-boson model. This is illustrated in Fig. 4 where the exact population difference σ z (t) is compared to the ML-predicted σ z (t) . We stress that only short σ z (t) trajectory of t∆ = 4 is used as an input. The rest of the dynamics is predicted recursively as done in Refs. 32 and 33. Fig. 4 compares the performance of the 1D CNN model, whose number of trainable parameters is used as a reference for other ANN models, to the best and the worst performing ANN and KRR models. One notices that prediction accuracy of the KRR-L model degrades slowly over time but still remains acceptable. This can only be seen in Fig. 4a and b where the chosen system and bath parameters generate the oscillatory dynamics of RDM. In the cases of incoherent relaxation dynamics KRR-L results are indistinguishable from the exact HEOM dynamics.
We stress that even the worst performing ML model, KRR-L, as illustrated in Fig. 4 still provides the acceptable accuracy. Therefore, we conclude that all ML models benchmarked in the present article can be trained to provide a sufficient long-time prediction accuracy. The suspected drawback of the recursive propagation approach proposed in Refs. 32 and 33 and exploited in this article is that the prediction error would grow over many time steps possibly leading to significant accuracy loss in the long-time dynamics. As shown above such errors can be made small enough to make long-time predictions reliable all the way until the dynamics reaches equilibrium.

B. Asymmetric spin-boson model
The results reported so far are encouraging but not discriminative. To unravel the differences between the studied ML models, we devise a more stringent test. All the ANN and KRR models are re-trained on the asymmetric spin-boson model data set without any hyperparameter adjustment. The increased difficulty of this test stems from the richer dynamics of the asymmetric spinboson model which might require more training parameters or even different ANN architecture (more hidden layers, longer hidden state vectors, and memory T ).
The proposed test clearly indicates that some models predict the long-time dynamics much more accurately than others. The performance of each model on the holdout test set is reported in Tables I and II. There is a noticeable increase in the MAE of all studied ML models  5 shows the ML-predicted dynamics of σ z (t) of the most and least accurate ANN and KRR models for the asymmetric spin-boson model as well as of the "ref-erence" ANN model, 1D CNN. Analogously to the symmetric spin-boson model, BRNN model is found to be least accurate ANN model. The difference is, however, more dramatic. Clearly, BRNN model is overall no longer acceptable ANN model even though it provides an accurate prediction for some parameters, see Fig. 5d and e, but even in these cases one can easily distinguish an unphysical oscillatory behavoir building up beyond relatively short times of t∆ ≈ 10. All other ANN models provide more accurate and stable predictions without noise and unphysical artifacts. The CGRU model is the most accurate matching the HEOM dynamics nearly exactly. The worst KRR model is, once again, the one with the linear kernel. It is however, able to capture the dynamics qualitatively, and in some, cases fairly accurately. It seems to be less reliable in predicting the oscillatory dynamics typically observed in low temperature and small reorganization energy regimes. The most accurate KRR model is the one with the Gaussian kernel. This model predicts the reference HEOM dynamics very accurately. The promise of the KRR-G model has been already pointed out in Ref. 33.

C. Training and prediction times
The accuracy and robustness of ML models are clearly very important but our analysis would be incomplete without discussing the timings associated with all studied ML models. In large-scale applications, the choice of ML model is often a compromise between the accuracy and the complexity of the model. The latter directly affects the training as well as the prediction times which are important, especially for the approaches based on recursive application of ML models to generate the dynamics. Training times for all ML models are calculated using 2x20C Intel Xeon Gold 6230 2.1GHz processor with 192 Gb DDR4 memory, and reported in Tables I and II (only times for training are shown, in practice, the cost is higher if the hyperparameter optimization is performed). One should be mindful that different software packages are used for ANN and KRR models which makes the direct comparison somewhat dubious. Nonetheless both keras and MLatom are state-of-the-art software widely used in the community so we will proceed with the comparison.
According to Table I, the accuracy of FFNN and CB-GRU models for the asymmetric spin-boson model is the same but the training time of the FFNN model is nearly 4 times shorter. Furthermore, for the symmetric spinboson Hamiltonian, FFNN model outperforms all ANN models comprised of only recurrent layers but shows similar performance to convolutional recurrent neural networks and bidirectional recurrent neural networks and it does so faster: FFNN model is 3 times faster to train than the best performing CLSTM and CGRU models.
Continuing the comparison of the training times we note that, due to their simplicity, the vanilla (unidirectional) RNN models are also fast to train. Oppositely, bidirectional RNNs of all kinds are among the slowest models to train. Given that using bidirectional architecture does not reduce the error compared to their unidirectional counterparts and the increased training time of bidirectional layers makes such models disfavoured for, at least, the problem of predicting the long-time dynamics of the spin-boson problem. This result is interesting given that bidirectional recurrent neural networks have some success in other domains as described in Introduction.
Training times of the KRR models is relatively fast compared to most of the NN models except for FFNN and CRNN.
The average single-step prediction times for each ANN and KRR models are shown in the last column of Tables I and II. It is clear that KRR models take less time to make a single time-step prediction than ANN mod-els, although this can be attributed to the software nuances. Expectedly, the FFNN model is the fastest ANN model while BGRU is the slowest. Similar to the training time we observe that (C,B)GRU models are slower than (C,B)LSTM models which is counerintuitive given than GRU cell is less complex than the LSTM one. We attribute this inconsistency to the implementation details of both cells in keras. Overall, recurrent layers require more time to evaluate than other layers as should be anticipated. Bidirectional recurrent neural networks take even more time to make a prediction compared to their unidirectional counterparts. Convolutional recurrent neural network models take less time to predict compared to their two-layer recurrent counterparts as they replace one computationally more expensive recurrent layer with a chaper 1D CNN layer.

D. Model efficiency
To illustrate the tradeoff between accuracy and training/prediction time, in Fig. 6 the training and single-step prediction times are plotted against the MAE for each ML model built in the present work. For ease of illustration, the models of similar kind are highlighed with the same color, e.g., all KRR models are depicted by green circles, magenta color is used for all convolutional recurrent models, etc. Clearly, KRR models are the most efficient models showing high accuracy and requiring the shortest amount of time to train and make a prediction. Among the ANN models, convolutional recurrent models are the most efficient. According to Fig. 6 convolutional bidirectional NNs, specifically, CBLSTM constitute the second most efficient set of ANN models. In contrast, because of the longer training times and not much improved accuracy, bidirectional recurrent ANN models are the least efficient.

V. CONCLUSIONS
We performed a benchmark study of 22 supervised machine learning methods comparing their ability to accurately forecast long-time dynamics of a two-level quantum system linearly coupled to harmonic bath. We illustrate that many of the studied ML methods can achieve a good agreement with the exact population dynamics. Our study reveals that if the memory time of a quantum dynamical system under study is known, the models based on Kernel Ridge Regression are the most accurate and should be preferred. The commonly employed Gaussian kernel is confirmed to be the most efficient yielding the accuracy on a par with other nonlinear kernels while requiring somewhat less than training time.
Convolutional recurrent neural networks appear to be the most promising ANN models. One scenario where such models are suitable are the problems when the memory time of the problem is not available and flexible ML The mean absolute single-step prediction error of σz(t) of machine learning models studied in this work plotted against the corresponding training (a) and single-step prediction (b) times. Bidirectional recurrent neural network model is not shown because its error is much too big compared to other models (see Table I and Fig. 5). models allowing variable size input are needed. Such investigation is beyond the scope of this work and will be performed and reported in future studies. Based on the present study we conclude that particularly CGRU is the most promising ANN models for long-time quantum dynamics simulations.
All ANN models are however less accurate than KRR methods with nonlinear kernels and take more time to train and predict. Often quoted poor computational scaling of KRR with the increasing system size may pose some technical problems for large data sets which may be needed for large systems, longer and larger number of training trajectories. Nevertheless, many approaches have been suggested to mitigate this problem [166][167][168][169][170][171] and we are also working on implementation of alternative approaches for large data sets.
One can reasonably anticipate that our results might depend on the strategy used to compare ML models based on artificial neural networks. Here we chosen to build and compare ANN models with approximately the same number of trainable parameters. There is however no proven best way to compare the performance of ANNs with fundamentally different types of layers. We believe that the strategy chosen in this work should provide a faithful comparison of ANN models. Additionally, our conclusion advocating KRR methods holds irrespective of the strategy used to compare ANN models because KRR models studied in this work, are both faster and more accurate than ANN models. Addition of more layers to ANN models will likely make them more accurate (although overfitting should be carefully checked in this case) but it will necessarily make such models more computationally expensive.
FFNN may seem as a reasonable compromise between accuracy and computational cost. However, one needs to bear in mind that FFNN models require a fixed-size input. Of course an input to FFNN models can be padded with zeros, but it necessarily alters the representation of the underlying physics in the input data and, therefore, the performance of such models is not expected to be strong.
Many popular ML methods have been studied in the present article. However, the field of artificial intelligence and machine learning is growing rapidly making it nearly impossible to cover all recently developed algorithms. Our future work will focus on novel approaches to time-series modeling such as transformers. 172 Additionally, the CNN models employed in this work uses kernels of the same size. Even though the kernel size was optimized for the given task, the use of inception modules 173 that include multiple filters of varying size might improve the performance of CNNs and will also be tested.  III. Mean absolute prediction errors (MAEs), training, and average single step prediction times of all KRR models used in this work. KRR-L, KRR-G, KRR-DP, KRR-E, and KRR-M (n = 1, 2, 3, 4) denote kernel ridge regression models with linear kernel, Gaussian kernel, decaying-periodic kernel, exponential kernel, and Matern kernel with n = 1, 2, 3, 4, respectively. The MAEs are shown for both symmetric and asymmetric cases. KRR models trained on 100% training data and sub-training data (80% of training data).