Dimensionality Reduction of High Dimensional Time Series based on Artificial Neural Network

Molecular dynamics is a molecular simulation method which relies on Newtonian mechanics to simulate the motion of molecular system. In this method, some differential equations are integrated, and the results of integration are further processed to obtain the trajectory or momentum evolution process of some particles controlled by dynamic equations, and the technology of extracting the equilibrium state, motion process or related properties of classical particle system can be used. Through molecular dynamics simulation, we can obtain a series of properties of the system, which are widely used in experimental verification, theoretical derivation and other scenarios. Because it can obtain the dynamic state of macromolecules to make up for the limitations of these properties, it is widely used in the study of transmembrane proteins, polypeptide chains and other systems in life sciences. Through the kinetic path reduction of these systems, we can intuitively understand the characteristics of molecular folding, molecular motion and specific binding, which can play a very important role in the study of proteins and peptides. However, due to the characteristics of high-dimensional time series obtained by molecular dynamics simulation, it is difficult for us to pay attention to the collective state or characteristic process of the whole system in a non-equilibrium state or slow process. This is due to the difficulty in data processing and the difficulty in obtaining its characteristic function. This makes it very difficult to study the dynamic process of the whole system, especially the dynamic process at the intermediate non-equilibrium moment. It is difficult to solve this kind of problem by conventional methods, and only a few special simple systems can be solved by experience. Therefore, it is of great significance to find a method to obtain the characteristic function of the system through the trajectory obtained by molecular dynamics, and then reduce the molecular dynamics path. In order to solve this scientific problem, researchers focus on machine learning. In this study, machine learning method will be used to solve the overall non-equilibrium state of the system or the collective state of the slow process in molecular dynamics simulation. Firstly, we use this method to solve a simple one-dimensional four well model. By this method, we obtain a series of characteristic functions describing the motion process of the model. By sorting the eigenvalue contributions, we obtain some main characteristic functions describing the system. It includes the motion description of Markov smooth transition state and the motion description of four potential wells. At the same time, we use the traditional transition probability matrix to calculate. The difference between the characteristic function obtained by machine learning and the traditional method is very small, but the calculation method is simpler and more universal. After that, we apply the method to the actual scene. By solving the molecular dynamics simulation of alanine dipeptide structure in polymer protein molecule, the characteristic function of dihedral angle folding of alanine dipeptide structure was preliminarily calculated. The results were consistent with the traditional method.


Introduction
Dimension reduction refers to mapping the original high-dimensional data such as high-dimensional time series to low latitude data space through some method, usually a mapping method. It is difficult to avoid losing the information of original high-dimensional data space in the process of dimensionality reduction. Therefore, a successful data dimension reduction is to eliminate redundant data and noise in the original high-dimensional data space as much as possible. The principal component analysis method commonly used in statistical and instrumental analysis is the most commonly used dimension reduction method. It represents the data in high-dimensional data space in low-dimensional data space by some linear mapping projection. It is usually used to observe some information between high-dimensional data which is inconvenient to observe in high dimension, such as: it can be used to measure some cancerous biological tissue by mass spectrometry, and compare with the results of normal tissue. This is often difficult to observe in mass spectrometry data, but it can be seen directly whether the test method has the ability to distinguish samples significantly by PCA.
The result of molecular dynamics simulation is also such a high-dimensional time series. Because of the characteristics of its high-dimensional time series, it is unrealistic to analyze the whole slow process directly, so for such data processing, we need to use the means of data dimension reduction. By restoring the trajectory data of high latitude to the characteristic function describing the essence of its motion characteristics, we can intuitively analyze its overall motion.
At present, the most basic method for high-dimensional time series of molecular dynamics is the transfer operator theory. The theory of transition operator is an iterative method which depends on the properties of transition probability matrix and Markov chain, and obtains the eigenfunction and eigenvalue through the diagonalization of matrix. The numerical solutions of such complex differential equations play an important role in many fields, such as molecular dynamics, fluid dynamics, mechanical and electrical engineering and physics. These systems usually exhibit multi-scale behavior, which may be due to the coupling of subsystems with different time scales, such as fast electrical and slow mechanical components. Or because of the inherent characteristics of the system itself, such as rapid vibration and slow conformational changes. [1][2][3] However, from the perspective of computation, due to the 'curse' of high latitude data, it is usually not feasible or too difficult to use the method based on transfer operator to analyze such problems. This method can only work well in very simple models or special models. One way to avoid this is to project the dynamics of the high-dimensional system onto the low dimensional space, and then analyze the simplified system. [3] In order to solve this problem, researchers have proposed Time delay independent component analysis (TICA), Variational method of conformational dynamics (VAC) and dynamic mode decomposition (DMD). [4][5][6][7][8][9][10] In this study, the whole state of the system in the molecular dynamics simulation will be solved by artificial neural network. Firstly, we use this method to solve a simple one-dimensional four well model. By this method, we obtain a series of characteristic functions describing the motion process of the model. By sorting the eigenvalue contributions, we obtain some main characteristic functions describing the system, it includes the motion description of Markov smooth transition state and the motion description of four potential wells. At the same time, we use the transfer probability matrix to verify the calculation, which depends on the properties of Markov chain. Markov chain is a stochastic model describing a series of possible events, in which the probability of each event depends only on the state reached in the previous event. In a continuous time, this is called a Markov process. It was named after Russian  [10] the transfer probability of several steps can be calculated by the properties of Markov chain and one-step transfer matrix. The difference between the feature function obtained by machine learning and the traditional method is very small, but the calculation method is more simple and more universal.

Result and Discussion
Firstly, we choose the simple 1d-4 potential well model [9] as the research object. It is discretized into 100 points between -1 and 1, and a transformation matrix is defined so that there is a non-zero probability of moving left, moving right or staying at the current point at each point. The probability can be obtained by the following formula.
Where Vj and Vi represent the central potential energy of the discrete point, C is the normalization factor, and the synthesis of transition probability is 1. If two particles are not adjacent, the transition probability Pij = 0. In this way, we construct the transition probability matrix. In the treatment of transfer operator matrix method, we use the properties of Markov chain.
Markov chain is a stochastic model describing a series of possible events, in which the probability of each event only depends on the state reached in the previous event. In continuous time, this is called a Markov process. It is named after Russian mathematician Andrey Markov [10] through the properties of Markov chain and one-step transition matrix, the transition probability of several steps can be calculated. We use transfer operator matrix method and machine learning method to simulate trajectories with 100 time steps. The specific calculation code refers to the better encapsulated machine learning method---master library provided by reference [8].
We use the theoretical calculation of the transfer operator method and the artificial neural network with two hidden layers and 100 hidden neurons in each layer for machine learning. By comparison, we find that the state functions of the transition operator method and the machine learning method are almost the same, which shows that the characteristic functions of the machine learning method and the transfer operator method are in good agreement. This proves the accuracy of machine learning method to a certain extent, and it has the advantage that it can be applied to more systems. Compared with other machine learning methods given in the literature, machine learning method has the advantage of good spatial complexity.

Research Model
The model we used in this study is given by the following formula [9] Where V is the potential energy of the system and the domain of definition is -1 to 1. We divide the range of -1 to 1 into 100 parts and draw the following figure. The potential energy of each point represents the potential energy of the central point.

Treatment of 1d-4 potential well model by transfer operator matrix method
Firstly, we calculate the characteristic function of the model motion through the transfer operator theory. The complete calculation code is shown in Appendix 1. For the adjacent parts, the transition probability and potential energy have the following relation: The transition probability matrix P (1) can be constructed by the above formula. See Appendix 2 for details. The following one-step transition matrix can be obtained.  The first characteristic function corresponds to the smooth transition distribution and is equal to 1 for the transfer operator because of the normalized result in the calculation. The other characteristic functions represent the motion process between the potential wells, that is, when the system crosses a certain potential well, the motion state of the whole system is described. The barrier heights of these wells determine the related eigenvalues λ These four characteristic functions can better describe the overall motion characteristics of the original model. Of course, for the 50 * 50 transfer matrix, we can obtain at most 50 eigenvalues and their corresponding eigenfunctions, but the eigenfunctions ranked by eigenvalues have little contribution to the overall motion of the system, so only the first four state functions with the largest contribution are listed here.

Treatment of 1d-4 well model by machine learning method
After that, we use machine learning method to deal with the system. In the process of processing, we directly use the potential energy model and one-step transfer matrix provided in 2.1 and 2.2. The time lag parameters are identical to those in 2.2. According to the learning ideas provided by reference [8] and the artificial neural network construction method provided by keras [11], we use the artificial neural network with two hidden layers and 100 hidden neurons in each layer to learn machine learning method. The results show that we choose three characteristic functions corresponding to the unbalanced over balance in 2.2, as shown in the following figure. By comparison, we find that the state functions of the transition operator method and the machine learning method are almost the same, which shows that the characteristic functions of the machine learning method and the transfer operator method are in good agreement. This proves the accuracy of machine learning method to a certain extent, and it has the advantage that it can be applied to more systems.
After that, we compared the calculation results of the model with those of the TICA method reported in the literature [12], and the results are still highly consistent. It is worth mentioning that when I try the TICA code given in the literature on my home computer, there will be memory overflow and incalculability, while machine learning method will not. This is because the spatial complexity of TICA method is exponentially higher than that of machine learning method, which makes many models have to be further optimized and simplified to run successfully, and machine learning method has obvious advantages in this aspect.

Analysis of alanine dipeptide by machine learning method
After the machine learning method was tested on a simple test model, we selected the alanine dipeptide (N-acetyl-L-alanine-N '-methylamide) as the research object, in which alanine dipeptide was in water phase. Alanine dipeptide is a simple 22 atom peptide, and is often used in a new standard system for biomolecular simulation. Because the molecular structure has four main chain dihedral angles which determine the configuration state, the conformation of the molecule can be determined by calculating the dihedral angle of the main chain. We use the molecular simulation results given by OpenMM software [8,13]. OpenMM is a software toolkit for performing molecular simulations on a series of high performance computing architectures. It is based on a hierarchical architecture: the lower layer acts as a reusable library that can be called by any application, while the higher layer forms a complete environment for running molecular simulation.
The software is used to simulate the condition [13][14]. The alanine dipeptide in water was simulated for 200 ns under the condition of T = 300, and the force field of amber99sb ildn was used to model. In the simulation parameters, K and P = 1 bar, Lennard Jones interaction smoothly switches to zero at the 1.4 nm cutoff, and the particle mesh Ewald is used for electrostatic treatment. The actual space cutoff value is 1.4 nm, and the space grid spacing between them is 0.12 nm. Save the configuration every 2 PS to generate a track containing 1000000 calculations. The influence of solvent molecules on peptide configuration is implicitly treated. In this case, the theoretical feature function of the transfer operator is not available, so we will directly simulate the machine learning method.
The results of machine learning method are superimposed on the Ramachandran graph. The Laplace diagram shows the conformation of amino residues, and describes whether the dihedral angle of amino acid residues in protein structure is in a reasonable region, which can test the modeling quality after modeling [16] As can be seen from figure 2.3.1, the first slow mode captures the transition along ϕ torsional motion, the second slow mode captures the transition between α and (β, P / /), while the second slow mode captures the transition between αL and γ. The transition conformation between α and (β, P / /) is shown in Fig. 5 [15]. Figure 5. Conformation of alanine dipeptide α, β and P / /. Among them, yellow is the central carbon atom, gray is the carbon atom, blue is the nitrogen atom, and red is the oxygen atom.
The conformations of the molecule of αL and γ are shown in Figure 6. Through these structural diagrams, we can clearly understand the path of molecular motion among several metastable conformations. Figure 6. Conformation of alanine dipeptide α, β And P / /. Among them, yellow is the central carbon atom, gray is the carbon atom, blue is the nitrogen atom, and red is the oxygen atom.
Compared with the results of KTICA method reported in the literature, its basic characteristics are completely consistent, but its time scale is generally about 20% [17], which needs to be further explored.

Conclusion
In this study, machine learning method will be used to solve the overall non-equilibrium state of the system or the collective state of the slow process in molecular dynamics simulation. Firstly, we use this method to solve a simple one-dimensional four well model. By this method, we obtain a series of characteristic functions describing the motion process of the model. By sorting the eigenvalue contributions, we obtain some main characteristic functions describing the system, it includes the motion description of Markov smooth transition state and the motion description of four potential wells. At the same time, we use the traditional transition probability matrix to calculate. The difference between the characteristic function obtained by machine learning and the traditional method is very small, but the calculation method is simpler and more universal.
After that, we apply the method to the actual scene. By solving the molecular dynamics simulation of alanine dipeptide structure in polymer protein molecule, the characteristic function of dihedral angle folding of alanine dipeptide structure was preliminarily calculated. We use the method reported in the literature to carry out the machine learning method, and process the results of molecular simulation of alanine dipeptide. The results of machine learning method are superimposed on the Ramachandran diagram, the first slow mode captures the transition along ϕ torsional motion, the second slow mode captures the transition between α and (β, P / /), while the second slow mode captures the transition between αL and γ. Results compared with the results of KTICA method reported in the literature, its basic characteristics are completely consistent, but its time scale is generally about 20%. This needs to be further explored.