Probing reaction channels via reinforcement learning

We propose a reinforcement learning based method to identify important configurations that connect reactant and product states along chemical reaction paths. By shooting multiple trajectories from these configurations, we can generate an ensemble of configurations that concentrate on the transition path ensemble. This configuration ensemble can be effectively employed in a neural network-based partial differential equation solver to obtain an approximation solution of a restricted Backward Kolmogorov equation, even when the dimension of the problem is very high. The resulting solution, known as the committor function, encodes mechanistic information for the reaction and can in turn be used to evaluate reaction rates.


Introduction
The study of rare reactive events is a fundamental topic within the field of chemical physics. 1 These reactions are dynamical processes that can be characterized by the transition of a molecular system from one collection of meta-stable atomic configurations, i.e., a reactant, to another, i.e., a product.This metastability arises from the features of the potential energy surface, where metastable states are configurations near the local minimizer separated by high energy saddle points.When the saddle points lie along the typical transition paths between reactants and products, they are referred to as transition states.The direct numerical studies through molecular dynamics (MD) simulations are typically prohibitive computationally because these reactions are rare relative to the timescales of the typical thermal fluctuations of the system 2 .
A primary quantity of interest that has been employed to infer a mechanistic understanding of reactive events is the committor, a function that maps the phase space of the system to the probability of reacting. 3,4 6][7] The high-dimensionality of this problem poses difficulties in its inference, however a multitude of methods have been developed over the past decade that have made the computation of this function tractable.Some notable examples include the string method 5,6 , diffusion maps [8][9][10][11] and neural networks [12][13][14][15][16][17][18][19][20][21] .
This optimization of the committor can be framed as two separate but intertwined problems -(1) finding configurations with high reactive densities and (2) fitting a nonlinear ansatz to solve the optimization problem on those configurations to compute the committor.To solve the first problem, one needs a measure of the reactive density that depends on the committor itself, and to solve the second problem, one needs to access those configurations.One way to solve the first problem is to obtain an ensemble of reactive trajectories via Transition Path Sampling 3, 22-24 , a Monte-Carlo based method that enables generation of new reactive trajectories from old ones.While this method provides access to the ideal set of configurations to compute the committor on, it often suffers from low acceptance rates resulting in long decorrelation times between subsequent trajectories, and further refinements based on approximations of the committor are required to enable efficient sampling.Beyond Transition Path Sampling, a wide range of methods have been developed in the past two decades to access rare but important configurations that encode information of reactive events.Such examples include but are not limited to metadynamics 25,26 , weighted ensemble method 27,28 and variational enhanced sampling 29 .These methods rely on applying a carefully chosen bias potential to the original Hamiltonian along some low-rank ansatz of the reaction coordinate, often referred to as an order parameter to generate more atomic configurations near the transition state.While these methods can obtain accurate estimates of thermodynamic and kinetic observables, they strongly hinge on the overlap between the order parameter and the reaction coordinate.As such, the ensembles of configurations are often limited by the choice of the order parameter.
In this paper, we use a reinforcement learning (RL) based method to obtain an ensemble of configurations with high reactive densities.This is done through a two-step method in which we first find saddle points along the reactive probability density, which we refer to as connective configurations.These configurations can be thought of as the maxima of reactive probability along each reactive channel of the reactive process.This optimization procedure allows for the discovery of saddle points along each reactive channel when multiple reactive pathways exist.The optimization proceeds through an RL algorithm where an agent moves from one state to another by taking an action in order to achieve a certain goal.Each action is associated with a reward, and which action to take depends on a policy function designed to guide the agent toward the goal.In the context of searching for connective configurations, a state is an atomic configuration, an action simply moves the agent from one configuration to another according to a policy that is updated (or learned) over time.The optimal policy, which is obtained after performing several RL episodes with each episode consisting of a sequence of actions, would allow us (the agent) to move from any arbitrary configuration toward a connective configuration.In our RL algorithm, the reward function, which we will describe in detail in section 3, is chosen as a proxy to the true objective function to be maximized.While such RL based methods have been previously used to identify transition states 30 and perform transition path sampling [31][32][33] or cloning 34,35 , the proposed method has a different objective and yields different quantities of interest.Once we have obtained these configurations, we perform shooting operations from these points to obtain a set of configurations along a relatively small subspace in the configuration space, which we refer to as reaction channels.The notion of reaction channels is similar to the concept of transition tubes introduced in Ref. 7 .Both are expected to concentrate on the reactive path ensemble.
Finally, once we obtain these configurations within the reaction channels, we train a feed-forward neural network (NN) to solve the backward-Kolmogorov equation (BKE).While a range of methods have been proposed to approximate a committor using an NN, the method demonstrated in this work is unique in that it solves the exact BKE rather than the variational form [13][14][15][16] or the Feynman-Kac form 17,18 .This form of optimization is more accurate, however similar to other methods, it is strongly sensitive to the configurations that it is trained on.We find that training on samples obtained from RL based method proposed in this work outperforms optimization on samples generated from a grid-based or high-temperature based sampling method by at least one order of magnitude.Beyond the accuracy of the committor, the fidelity of this method is also demonstrated through accurate estimates of reactive rates, that are often non-trivial to converge.
The rest of the paper is organized as follows.In Section 2, we introduce terminologies and concepts that are relevant to the approach we take to study chemical reactions.We outline the basic scheme we use with some detail in Section 3. Section 3.1 is devoted to the details of the RL algorithm we use to identify connective configurations.In particular, we discuss how to define the reward function and design an effective policy.We discuss how to use NN to obtain the values of the committor function at selected configurations generated within reaction channels in Section 3.2.We demonstrate the effectiveness of our approach using a few examples in Section 4. We conclude the paper with some additional perspectives in Section 5. Some of the computational details and discussions are provided in the appendix.

Preliminaries
In this section, we introduce terminologies and concepts needed to present our algorithms in subsequent sections.Specifically, we define a transition path and reactive trajectory associated with an overdamped Langevin dynamics, the committor function associated with transition paths, and describe how the reaction rate constant can be calculated from the committor function.We then show how the committor function can be computed by using a feed-forward neural network.

Transition path and reactive trajectory
Let x(t) = (x 1 (t), x 2 (t), ..., x d (t)) ∈ R d be the coordinates of a system that satisfy an overdamped Langevin dynamics defined by where V : Ω → R is a potential energy function, γ i is the friction coefficient for x i , ξ i (t) is white noise with mean ⟨ξ i (t)⟩ and variance ⟨ξ i (t)ξ j (t ′ )⟩ = 2β −1 γ i δ(t − t ′ )δ ij and β is the inverse of the product of temperature and Boltzmann's constant.
Let A and B be two disjoint metastable regions of interest in the configuration space Ω ⊂ R d .They correspond to regions surrounding two distinct local minima of the potential energy V (x).We are interested in trajectories that start in A and terminate in B. Along such a trajectory, x(t) must escape out of one metastable region and cross over a transition region Ω \ (A ∪ B) before reaching another metastable region.Such a trajectory is often referred to as a transition path.For chemical systems, a transition region is associated with a chemical reaction.A transition path is also referred to as a reactive trajectory.
The probability density of x follows the Boltzmann-Gibbs distribution p(x) = exp(−βV (x))/Z, where Z = Ω exp(−βV (x))dx is a normalization constant or the partition function.The probability of observing x in a transition region relative to the probability of observing x in a metastable region is very low.As a result, a transition path or a reactive trajectory is a rare event and generally requires a very long period of simulation time to observe.
Committor function.Reactive trajectories are not unique.In fact, a chemical reaction is often characterized by an ensemble of reactive trajectories.Along each reactive trajectory, we are particularly interested in configurations that are equally likely to evolve (or commit) to either one of the metastable regions A and B. Such likelihood can be characterized by what is known as a committor function.To give a precise definition of a committor function q(x), let τ D (x) be the first hitting time of region D when the dynamics is initiated from x, i.e. τ D (x) represents the time it takes for the chemical system to first enter region D when the dynamics starts from x.A committor function, denoted by q(x), is defined as the probability that a trajectory x(t), starting from a point x(0) = x within a given set Ω, reaches B before it reaches A, namely, q(x) = Prob (τ B < τ A | x(0) = x).It is well known that q(x) satisfies the backward Kolmogorov equation (BKE) When dimension of Ω is low (2 or 3), this partial differential equation can be solved numerically by standard methods such as the finite difference or finite element method.However, when the dimension of Ω is high, it is not practical to use these methods to solve for q(x).We will discuss an alternative approach that uses a feed-forward neural network to solve for q(x) in Section 2.2.Probability of being reactive.We use ρ(x) to denote the probability of observing a reactive trajectory crossing x.It follows from the Markovian property of the underlying overdamped Langevin dynamics that ρ(x) can be written as the product of 1 − q(x), which describes the probability of the trajectory arriving at x from A without going through B (under the assumption of time reversibility at the statistical equilibrium 36 ), p(x), which represents the probability of being at x, and q(x), which describes the probability of reaching B first after leaving x, i.e., ρ(x) = (1 − q(x))q(x)p(x). ( A configuration x that has a high reactive density ρ(x) is of particular interest because it marks a transition region along a reactive trajectory associated with an overdamped Langevin dynamics.However, the definition of q(x) given in (3) requires the committor function q(x) to be known in advance.Because q(x) is generally difficult to calculate for an arbitrary x, it is not easy to calculate ρ(x) in practice.Reaction rate.The committor function is used in Ref. 7 to introduce the notion of the current or flux associated with a reactive trajectory.At a configuration x along a reactive trajectory, the flux J (x) across x is defined as where ) and Z is again the partition function.
If J (x) is known for all x along a dividing surface S that separates A and B, we can use it to evaluate the reaction rate constant κ via where n S (x) is normal vector of S.
Note that the main contribution to the integral (5) comes from configurations x along the dividing surface that has a relatively large magnitude of the flux J (x).Because these configurations typically occupy a small area S ′ on 3/23 S for rare events 7,36 , we can focus on this area and approximate (5) by The area S ′ can be defined by the intersection of S and the so-called reaction channel to be defined in Sec. 3. We should note that as long as S ′ can be easily identified and J (x) can be efficiently evaluated for all x ∈ S ′ , the formula (6) provides a more practical way to compute the rate constant compared to a brute force approach in which the rate constant is computed according to the alternative definition where N T is the number of trajectory segments that leave A and enter B within the time interval [0, T ] 7 .In the latter approach, κ can be approximated by a direct simulation, i.e., we can generate a sufficiently long Langevin trajectory from a random starting point and count N T .However, when the system contains a high barrier, it becomes extremely rare to observe a reactive trajectory even with a long time simulation.
If the dividing surface is not accessible, an alternative method is to calculate the integral of the flux on the transition region, which can be done as follows: for the systems considered here.

Solving committor function via deep learning
In recent year, deep learning has demonstrated remarkable progress in various domains of scientific exploration [37][38][39] , thanks to the exceptional approximation and generalization capabilities of neural networks (NNs) 40 .In particular, deep learning has emerged as a powerful tool for solving a wide range of partial differential equations (PDEs), including those formulated in high-dimensional spaces where conventional solvers like finite difference and finite element methods suffer from the curse of dimensionality 41 .Furthermore, NN-based solvers can be easily adapted to solve PDEs defined on an irregular domain.These two key attributes of an NN-based PDE solver make it highly advantageous for the specific problem we aim to address in this study.As we will show in Section 4, an NN-based PDE solver enables us to solve a 66-dimensional BKE associated with a 22-atom molecule within a reaction channel that consists of an ensemble of configurations not uniformly distributed or organized in a regular domain in the configuration space.The NN-based PDE solver overcomes these complexities and enables us to effectively handle this challenging scenario.
Typically in an NN-based PDE solver, the approximation to the solution of the PDE is represented as an NN q(x; θ) parameterized by a set of weights and biases denoted by a vector θ.The network takes x as the input and generates q(x; θ) as the output.The NN parameters are determined in an iterative training procedure that minimizes a loss function with respect to θ for different choices of the input x.To use deep learning to solve the BKE (2) on Ω, we define the loss function as where ℓ is a penalty coefficient used to impose the boundary constraints.In practice, the L 2 -norm in ( 9) is evaluated by summing the loss on the data points sampled randomly and uniformly in Ω and L is computed by auto-differentiation using advanced deep learning frameworks (e.g., Pytorch 42 ).The NN-based optimization (9) can be conducted by stochastic gradient descent (SGD), such as Adam 43 , a variant of SGD based on momentum.Similarly, when solving BKE (2) on K sub-domains Ω 1 , • • • , Ω K ⊂ Ω, the PDE becomes L(q) = 0 for x ∈ ∪ K s=1 Ω s along with the boundary conditions, which is referred to as restricted BKE.We can define the NN-optimization problem by

4/23
When we have data points

Methodology
As we indicated in Sec.2.1, configurations x with high reactive density ρ(x) are of interest because they mark a transition region in which reactive trajectories are more likely to be observed.Intuitively, if we shoot a trajectory from a configuration x with a high reactive density ρ(x) and initiate it with a random momentum, it is likely that the trajectory will stay within a region where reactive trajectories pass through.Even though such a trajectory may not be part of a reactive trajectory, the configurations along such a trajectory occupy a small subspace that is likely to contain several reactive trajectories.We will refer to the subspace formed by these configurations as a reactive channel.Because configurations within such a subspace are likely to have high reactive flux J (x), we will focus on these configurations, and solve a restricted BKE within a reactive channel using the neural network technique discussed in Sec.2.2 to obtain an approximate committor function and and its gradient at configurations within the channel.With these, one can approximately calculate the reaction rate constant by evaluating (6).We should note that the concept of reaction channel introduced here is similar in spirit to the notion of transition tube introduced in Ref. 7 .A transition tube is defined to be an ensemble of regions on non-intersection dividing surfaces between two metastable regions A and B that have localized flux.Because a transition tube is characterized by configurations with relatively high reactive flux which depends on the unknown committor function, it is not easy to identify directly.Although the central curve within the transition tube can be approximated by the minimum energy path which can be computed by the string method 5 , defining the region of the tube is still not trivial.
Because a reaction channel is generated by shooting trajectories from a single configuration, it is relatively easy to produce as long as we can select a proper configuration to shoot from.Ideally, that configuration should be the one that has a high reactive density ρ(x).However, because ρ(x) is defined in terms of the committor function, it is not easy to identify configurations with high ρ(x) directly because that would require solving the original BKE (2).In the following, we will present a reinforcement learning-based technique to identify configurations that are likely to have a high ρ(x), and we will refer to these configurations as connective configurations.
To create reaction channels, we start by performing a shooting procedure from connective configurations.Within each reaction channel, we can determine the committor function on each configuration by solving a restricted BKE using a neural network.Using the NN solution and its gradient, we can then calculate the reactive flux for every configuration within the reaction channels.As we will see in the next section, the reaction channel generated by shooting trajectories from connective configurations is likely to contain configurations with relatively high reactive flux.This is sufficient to provide a good estimation of statistics, such as rate, even if not all configurations within the channel have high reactive flux.

Seeking connective configurations via reinforcement learning
In this section, we show how to use a reinforcement learning (RL) method to identify connective configurations.Our basic strategy is to treat each configuration as a state x t and train an agent to take a sequence of actions {a 0 , a 1 until it ultimately reaches a desired state x n which corresponds to a connective configuration.The sequence of state-action pairs {(x t , a t )}, with t = 0, 1, 2, .., n − 1 is an instance of a policy π the agent follows, which is initially not optimal.However, over a multi-episode learning process, the policy is gradually improved based on the feedback the agent receives from the environment, which consists of configurations not being visited, through a policy gradient.An optimal policy allows the agent to move from an arbitrary state to the desired state efficiently.
In an RL algorithm, an action that an agent takes at a particular state x is associated with a reward r(x, a) that measures the effectiveness of that state action pair.The policy an agent follows at a particular state x is often designed to maximize not just the reward r(x, a), but the expectation of a sequence of discounted future rewards, i.e., where τ is an instance of a policy π that is specified by a sequence of state action pairs τ := {(x 0 , a 0 ), for some discount factor 0 < η ≤ 1.Such expectation of discounted future rewards is often referred to as a Q-value function or Q-function in short and denoted by Q π (x, a).We use A(x) to denote the action that maximizes Q π (x, a) for a given policy π, i.e., It is well known that the optimal Q-function Q * (x, a) := max π Q π (x, a) satisfies the Bellman equation 44 where the expectation is taken with respect to the conditional probability of the new state x ′ given x and a. Associated with this optimal Q-function is the optimal action Because the state space that contains x and the action space that contains a in our problem are not discrete and because the analytical form of the optimal Q-function is generally unknown, it is difficult to solve the Bellman equation ( 14) or the optimization problem (15) directly.Finding the optimal Q-function is often an intractable problem, and approximate solutions are commonly used instead.Several methods such as the deep deterministic policy gradient 45 (DDPG) method, Twin Delayed DDPG 46 (TD3) method have been developed to solve the problem approximately.In these methods, Q(x, a) and A(x) are represented by neural networks parameterized by Ψ and Φ respectively.These neural networks are trained by a set of data P that contains a collection of (x, a, r(x, a), x ′ ), where x ′ denotes a new state reached by the agent after taking the action a. Roughly speaking, the parameters Ψ and Φ are optimized in an alternate fashion by maximizing the objectives derived from (15) and the Bellman equation.We have included a diagram in Appendix A to provide a visual illustration of the Q-learning training process.To be specific, the training process is used to solve the following optimization problems alternately max and min Reward.Note that the solution of the second problem (17) depends on how the reward function r(x, a) is defined.Because our ultimate goal is to identify connective configurations that are expected to have a high reactive density ρ(x), ideally, we would like to use ρ(x ′ ), where x ′ = x + a, as the reward function.The difficulty is that, as we indicated earlier, ρ(x) is defined in terms of the committor function q(x) which is unknown in general.Therefore, setting r(x, a) to the exact ρ(x ′ ) is not practical.
However, as q(x) is defined as the probability of a trajectory starting from a point x(0) = x and reaching B before reaching A, we can estimate this probability numerically by shooting trajectories from x, and performing statistical analysis of these trajectories.To be specific, we propose to use a shooting procedure to estimate q(x) by counting the number of trajectories originating from x with a random momentum and terminating in one of the metastable regions A or B within a fixed number of time steps.
We shoot N trajectories from x with a random momentum or force.Let N A be the number of trajectories reaching A first rather than B within a fixed number of time steps T , and N B be the number of trajectories reaching B first rather than A. Typically, T is far smaller than the time scale required to observe a reactive trajectory.Clearly, N A + N B ≤ N since some trajectories may hover around the starting configuration for a long time and never reach either A or B within T time interval.We can view N A /(N A + N B ) as an approximation to q(x) and N B /(N A + N B ) as an approximation to 1 − q(x).Consequently, we can use 6/23 as a proxy for ρ(x).As a result, the reward function associated with the state action pair (x, a) can be defined as where x ′ = x + a is a new state.In practice, we can ignore the constant Z in (19).
Once the neural networks are properly trained, the optimal action to be taken at each x satisfies RL is a multi-episode iterative learning process.In each episode, a random configuration x 0 is chosen to start the learning process.A neural network that represents the action function A(x) takes the configuration as the input and generates an action a 0 as the output.Taking such an action yields a new configuration x 1 for which a reward r can be obtained by shooting several trajectories from x 1 and evaluating (19).This process can be repeated several times until we generate a sequence of state action pairs {(x t , a t )}. Figure 1 gives a schematic illustration of the process of generating a sequence of state action pairs within a single episode.⃝ Evolution: the parameterized policy yields the action a t after observing the state x t .After taking the action, the agent moves to a new configuration x t+1 =x t + a t ; 3 ⃝ Interaction: we shoot multiple trajectories from x t+1 , and evaluate its reward r(x t , a t ).The quadruples (x t , a t , r(x t , a t ), x t+1 ) is used in Q-learning to update the policy.
The generated sequence of state action pairs, together with the rewards evaluated for states form the training data set P that we use to optimize the neural network representions of the Q-function Q(x, a) and action A(x) by solving the minimization problems ( 16) and (17) in an alternate fashion.
We use the TD3 method to perform the optimization of A(x) and Q(x, a).This variant of the DDPG method uses a variety of techniques to improve the stability of the training process and mitigate the risk of potentially over-estimating the Q-value function.In particular, TD3 uses the moving average of parameters associated with multiple neural networks to solve (16) and (17).Furthermore, TD3 uses a "delayed" policy update, where the policy network (which is used to determine the optimal action) is only updated after a certain number of Q-network updates.This delayed updating scheme helps to stabilize the training process and reduces the likelihood of the policy network training being stuck in an undesirable local minimum.
The main steps of a multi-episode RL algorithm for seeking the optimal policy that allows us to quickly identify connection configurations from an arbitrary starting configuration is shown in Algorithm 1.

Algorithm 1 An RL algorithm for seeking connective configurations
Input: Random parameter initialization Ψ 1 and Ψ 2 of critic networks and Φ of actor network.Output: Action network A(•; Φ).
Replay buffer P ← ∅ 3: for episode from 1 to the maximal number of episodes do 4: for t from 0 to L − 1 do

6:
Obtain a new action a t = A(x t ; Φ) and a new state x t+1 = x t + a t ▷ New state update rule 7: if p(x t+1 ) is smaller than a threshold then Break Compute the reward r(x t , a t ) using ( 19) ▷ Compute the reward by shooting for tt from 0 to t do 13: Sample a mini-batch (x, a, r, x ′ ) of size B from P 14: Update Ψ i with the loss 1 if tt mod policy_delay = 0 then 17: Update Φ with the loss 1 B −Q(s, A(s; Φ); Ψ 1 )

18:
Update target networks: end for 21: end for

Generating reaction channels and computing reaction rate constant
After performing several episodes of RL using Algorithm 1, we obtain an optimal action function A(•; Φ * ) parameterized by Φ * .Such a function yields the policy we follow at each configuration x to quickly move towards a connective configuration.Figure 2(a) shows how such a policy (represented as the vector field) looks like for a simple potential energy surface.The value of A(x), which is a vector with two components, is plotted as an arrow for each x uniformly sampled in [−2.0, 2.0] × [−1.5, 2.0].We see the arrows point to two configurations marked by crosses.These correspond to two connective configurations.
Once we identify a connective configuration, we perform additional shooting operations from that configuration to generate multiple trajectories originating from the connective configuration.The configurations generated along these trajectories are considered as samples within a reactive channel between A and B. If multiple connective configurations are identified, each one of them can be used to generate a distinct reactive channel.
The configurations generated within reactive channels can be used to compute the committor function q(x) by solving a restricted BKE on these configurations using a deep neural network as we described in Section 2.2.The utilization of NN-based PDE solver is often associated with an implicit bias towards fitting smooth functions that exhibit fast decay in the frequency domain. 47Consequently, This bias can make it challenging for NN models to capture drastic changes in the committor function.However, by choosing an appropriate training dataset, one can mitigate this bias. 48In Appendix D.2, we demonstrate the advantage of using configurations within reaction channels to train the NN designed to solve the restricted BKE.
The neural network not only returns q(x) for each x within all reactive channels, but also its gradient ∇q(x).This will allow us to compute the reactive flux at each x within the reaction channel.As a result, by using (6), we can calculate the rate constant.

Numerical results
In this section, we present several numerical experiments that demonstrate the effectiveness of the RL method introduced in Sec. 3 for identifying connective configurations and using them to generate configurations that characterize reactive channels.Our experiments were performed on model potentials (triple-well and rugged Muller potentials) as well as the Alanine dipeptide molecule in vacuum.While we demonstrate the RL results with initializations uniformly sampled from Ω in this section, we have the flexibility to relax this constraint by considering initializations from metastable states (see Appendix D.1).

Potential with multiple reaction channels
We consider the triple-well potential defined by We focus on the domain Ω = [−2, 2] × [−1.2, 2]. Figure 2(a) shows this potential as a color-mapped image.The two meta-stable regions A and B are defined by We can see from Figure 2(a) that there are two transition paths from A to B. The top transition path goes through the third well in the top part of the image and two transition states between the third well and A (B).The bottom transition path goes directly from A to B and crosses the transition state.Because the potential function is symmetric with respect to x 1 = 0, all configurations along {x : x 1 = 0} have an equal probability of reaching either A or B. As a result, {x : x 1 = 0} represents the half-isocommittor surface {x : q(x) = 0.5}.We experimented with both a relatively low-temperature regime β = 6.67 and a relatively high-temperature regime β = 1.67.We discuss the results for β = 6.67 here, which is more challenging for studying rare events.We report a similar observation for the β = 1.67 case in Appendix B.1.By default, the friction coefficient is set as 1.
Identifying reaction channels.In our experiments, the initial configuration for each RL episode is randomly sampled from a uniform distribution of configurations in Ω.The reward r(x, a) (19) is obtained by shooting N = 50 trajectories from x.The maximal number of evolution steps is set to L = 15.The maximum number of episodes used in RL is set to 1000.While we initially set a large number of episodes for the RL training process, it is worth noting that the convergence of RL does not require an extensive number of episodes.Additional discussion can be found in Appendix B. 3.
Figure 2(a) shows the learned action A(x; Φ * ) as a vector field that represents an optimal policy.Two attractors can be seen from this policy field.They correspond to two connective configurations located in two different transition paths.
From each of the identified connective configuration, we shoot 50 trajectories by simulating the overdamped Langevin dynamics (1) using the Euler-Maruyama scheme.We choose a uniform time step size ∆t = 5 × 10 −3 , and propagate the solution from T = 0.0 to T = 2.0.The total number of configurations generated along these trajectories is 40,000.These configurations lie either in the metastable region A or B or two transition regions between A and B as shown in Fig. 2(b).These two transition regions correspond to the two reactive channels associated with this potential energy surface.
Solving BKE.We then solve the restricted BKE in the identified reaction channels by using the NN-based solver discussed in Sec.2.2.The loss function (11) is optimized by the Adam optimizer 43 .The hyperparameters used in the NN are listed in Appendix C. Figure 2(d) shows the contour plots of the NN solution (colored contour lines) and the potential (dotted contour lines).The 0.5-level set of NN solution marked in the figure almost coincides with the true half-isocommittor {x : x 1 = 0}.As a reference, we also used the finite difference method (FDM) to solve the BKE on the entire domain Ω.The absolute difference between the NN and FDM solutions on a 100 × 100 uniform mesh is shown in Figure 2(e).We observe small errors in the NN solution in the regions that contain a large number of sampled configurations (such as the top transition path) and relatively large errors in regions where configurations are sparsely sampled (such as the region near (0.0, 0.5)).
Rate estimation.The NN approximation to the committor function and its gradient on configurations with two reactive channels are used to calculate the reaction rate by integrating (5) over the dividing sub-surfaces identified with each reaction channel based on the data.We compare the computed rate with the one obtained from a direct dynamics simulation (see Sec. 2.1) and that computed from the committor function obtained from the finite difference solution of the KBE on the entire domain Ω in Table 1.We list the computed rates for both β = 1.67 and β = 6.67.In the direct dynamics simulation, we generate a long trajectory by time evolving the solution to the overdampled Langevin equation to T = 2 × 10 7 .In the FDM calculation, the rate is computed with numerical integration of (5) on the entire line segment {x : x 1 = 0, −1.2 ≤ x 2 ≤ 2}.From these numerical experiments, we find that the generated configurations mainly cross the line segment {x : x 1 = 0, −1.0 ≤ x 2 ≤ 2.2} for β = 1.67 and line segments {x : x 1 = 0, −0.8 ≤ x 2 ≤ 0.0, 0.8 ≤ x 2 ≤ 1.8} for β = 6.67.We then compute the rates using the NN solutions on these segments.As we can see, the NN solution on reaction channels gives comparable rates as the one obtained by the FDM and a direct simulation.Finally, we validate the rate calculation with our NN solution using the formula (8).
Here the parameters γ and k control the roughness of the landscape, which are set to 9 and 5, respectively.Other model parameters ) are exactly the same as the ones used in Refs. 14,15 By default, the friction coefficient is set as 1.
Identifying reaction channels.We discuss the numerical results in the low temperature regime β = 0.25 (Fig. 3) here and refer readers to Appendix B.2 for results obtained for β = 0.1.In this example, the reward in the RL algorithm is calculated by shooting N = 20 trajectories up to T = 0.25.The step size of used in each trajectory is set to 5 × 10 −5 .Each RL episode consists of L = 20 steps (actions).We ran the RL algorithm for 1000 episodes.Figure 3(a) shows that the learned policy points to a single connective configuration.From that configuration, we shoot 50 trajectories using the Euler-Maruyama scheme.These trajectories contain 100,000 configurations that lie in the metastable regions A and B as well as the transition region in between (Fig. 3(b)).The latter is viewed as the reactive channel for this particular potential energy surface.The hyperparameter setting for the NNs used in the RL algorithm is listed in Appendix C.
Solving BKE.We use the configurations contained in the reactive channel to solve the BKE via a NN.Figures 3(c) and 3(d) show the contour plots of the solution to the BKE obtained from both the FDM and the NN, respectively.The difference between the two is also shown in Figure 3(e).From these plots, we observe that the NN solution agrees well with the FDM solution.In particular, the NN solution captures drastical changes near the point (−0.8, 0.6).Rate estimation.The computed committor functions obtained by the FDM and the NN approach are used to compute the reaction rate under different β values.The computed rates are compared with those obtained from direct numerical simulations of the corresponding overdamped Langevin dynamics in Table 2.In direction numerical simulations, we ran a long trajectory until T = 1.5 × 10 4 with a time step size of 10 −5 .When β = 0.25, no reactive trajectory is observed.When the committor function is obtained from the FDM, the rate is computed from the numerical integration of (5) on the dividing surface {x 1 = 0.0, −0.5 ≤ x 2 ≤ 2.0}.When using an NN to compute the commitor function within the reactive channel, the reaction rate is computed from the numerical integration of (5) along the line segments {x 1 = −0.5, 0.0 ≤ x 2 ≤ 0.8} for β = 0.1.When β = 0.25, we calculate the rate by integrating (5) along the line segment {x 1 = −0.5, 0.3 ≤ x 2 ≤ 0.65}.We observe that the rates computed by all three methods are comparable when β = 0.1.When β = 0.25, the rates obtained from the NN and FDM approximation of the committor function have the same magnitude.

Alanine dipeptide in vacuum
In this example, we show how the RL algorithm introduced above can be used to identify a reaction channel of an Alanine dipeptide (ADP) molecule in vacuum that corresponds to its isomerization process.In our numerical experiment, we set the temperature to 300 K.The ADP molecule contains 22 atoms (see Fig. 4(e)).Therefore, the dimension of the configuration space is 66.The isomerization process is principally described by two the dihedral angles (ϕ, ψ) ∈ [−180 • , 180 • ] 2 of a subset of atoms (indexed by 4, 6, 8, 14 and 6, 8, 14, 16, respectively.)With a slight abuse of notation, we use ϕ(x) and ψ(x) to donate the mapping from a configuration x to the two specific torsion angles.Figure 4(a) shows the potential energy landscape in the (ϕ, ψ)-space.The plot is constructed as follows.
We first generate a long trajectory at a relatively high temperature (1200 K).We denote the set of configurations along this trajectory by S. The potential energy for each configuration in S is stored.Next, we discretize (ϕ, ψ) in (−180 • , 180 • ] × (−180 • , 180 • ] by generating a 100 × 100 uniform grid.For each (ϕ i , ψ j ) pair on the grid, we define a neighborhood  If C(ϕ i , ψ j ) ̸ = ∅, we set the potential energy associated with (ϕ i , ψ j ) to V min , where Otherwise, the potential energy of (ϕ i , ψ j ) is undefined and colored by white in Figure 4.The two metastable regions A (solid box in Fig. 4(a)) and B (dotted box) are defined by shows the snapshots of one trajectory initiated at a configuration near A of T = 1 × 10 8 fs when the temperature is 300 K and we can see that the no reactive trajectory is observed.Identifying reaction channels.Our RL algorithm aims to find a connective configuration in the (ϕ, ψ)-space.Subsequently, we use this configuration to generate additional configurations that bridge two metastable states.We should note that, for the ADP system, the actions taken in the RL algorithm are defined in a low-dimensional space specified by (ϕ, ψ) whereas the shooting procedure used to evaluate the reward takes place in the 66-dimensional configuration space.To address this disparity in dimensionality, it is necessary to establish a one-to-one mapping between each (ϕ, ψ) pair and a configuration in the phase space.To this end, we first construct a configuration set P by generating trajectories at high temperature 1200 K.We then map a given torsion coordinate in (ϕ, ψ) back to the configuration space by choosing a configuration from P with lowest potential energy.We simulate the Langevin dynamics with a step size of 2 fs and friction 10 ps −1 using the package Openmm Python API 49 .The reward for each action is computed by shooting 10 trajectories of T = 2 × 10 3 fs with kinetic initialization randomly sampled from the Boltzmann-Gibbs distribution.Each RL episode consists of L = 10 steps (actions).shows the learned action function reveals 2 different connective configurations, i.e., (−10 • , −62 • ) and (139 • , −120 • ).However, our primary interest is in (−10 • , −62 • ) configuration, as it has been extensively studied in the existing literature 14,20,50 .We shoot 100 trajectories of T = 2 × 10 3 fs from this connective configuration to generate the additional configurations that bridge two metastable states as shown in Fig. 4(d).We also visualize the change of molecular structures from the connective configuration to two metastable states in Fig. 4(e).
Solving BKE.Our next objective is to solve the BKE (2) in a 66-dimensional space.The obtained numerical solution is used to generate the plot of the approximate committor function in (ϕ, ψ).Such an approximation is then compared with the approximation obtained from the diffusion map (DM) 8,51 method.Note that it is not possible to use traditional PDE solvers, such as finite difference and finite element to solve (2) because they suffer from the curse of dimensionality, i.e., their computational cost increase exponentially with respect to dimension of the problem 41 .The DM method is another sample-based method that allows for the solution of the BKE to be approximated on an arbitrary set of configurations {x i }.However, it is important to note that DM may not always produce an accurate approximation of the derivatives at configurations near the boundary 52, 53 .This can result in less accurate DM solutions to BKE near the boundaries of a fixed domain.
We use the dataset {x i } identified by our RL method as shown in Fig. 4(d) and apply DM and NN method to get the solutions of the BKE.The solutions are presented in Fig. 5.We observe that the half-isocommittor region of the solution, i.e., the set of configurations on which the committor function value is close to 0.5, obtained from the DM method occupies a relatively large area in the (ϕ, ψ) plane, whereas the half-isocommittor region defined by the NN solution is confined to a small area defined by [−25 Our solution is found to be more consistent with the results presented in Fig. 1 Panel 2 of Ref. 50 .
Rate estimation.To estimate the reaction rate, we use formula (8) and approximate the evaluation of the integral using the generated configurations {x i } obtained through the shooting method.This approach yields the approximation formula: Finally, evaluating (22) using the approximated solution of KBE, we obtain a rate of 1.58 × 10 −5 ps −1 .This calculated estimate is comparable to the approximated value 4.54 × 10 −5 ps −1 (reported in Figure 5 of Ref. 20 ).

Conclusion
We presented a novel RL-based approach for identifying and characterizing an ensemble of configurations where reactive trajectories are likely to be found.The optimized action function returned from the RL algorithm reveals connective configurations that have high reactive probabilities.One of the key elements of the RL algorithm is the proper construction of a reward function that serves as a surrogate for measuring the reactive probability of a configuration, which is normally defined in terms of the value of it committor function.Because the exact committor function is generally unknown in advance, we employ a randomized shooting procedure to estimate its value at an arbitrary configuration.Our results demonstrate the effectiveness of this approach across multiple model problems of different sizes.Using the identified connective configurations, we generate trajectories directed towards metastable regions.The configurations along these trajectories are utilized to define reactive channels on which a restricted BKE is solved by a NN-based PDE solver.The solution yields an approximate committor function evaluated within these channels.This committor function can then be used to estimate reaction rates.

B Numerical results for high temperature regime B.1 Triple-well potential
This section presents the the results obtained from running the RL algorithm to identify connective configurations for a triple-well potential with an inverse temperature of β = 1.67.The reward function defined in Equation (19)  was evaluated by shooting 50 trajectories of that evolve up to time T = 0.75.The maximum number of time steps taken in each trajectory is set to L = 20.The RL procedure was run for a total of 1000 episodes.Figure 7(a) shows the action function A(x; Φ) produced at the end of the RL run.We mark two distinct connectivity configurations identified by crosses.They are nearly identical to the connective configurations we found when the trajectories were generated using β = 6.67.In Figure 7(b), we plot configurations generated by shooting trajectories from two connective configurations.We observe that the configurations generated from running trajectories using β = 1.67 cover a wider region of the configuration space that includes the local maximum near (0.0, 0.5), as well as some configurations outside the pre-defined domain Ω = [−2, 2] × [−1.2, 2].Figures 7(c-e) show that the neural network solution to BKE is comparable to solution obtained from the finite difference method (FDM).

B.2 Rugged Muller potential
Figure 8 shows the results obtained from applying the previously presented RL algorithm to the Rugged Muller potential at an inverse temperature of β = 0.1.The reward function defined in Equation (19) was computed based on shooting N = 10 trajectories that evolve up to time T = 0.05.The maximum number of time steps taken in each trajectory was set to L = 20.The RL algorithm was run for a total of 1000 episodes.We can observe that the configurations generated by shooting trajectories from the identified connective configurations with β = 0.1 (see Fig. 8(b)) cover a wider area compared to that obtained from performing RL and generating trajectories at β = 0.25.The committor function appears to be more stable near the region around (−0.8, 0.6).The error is relatively small in the region covered by the sampled configurations.

B.3 Monitoring RL progress
Figure 9 shows the reward as a function of the episode in the RL algorithm.A maximum of 1000 episodes are allowed.However, the largest rewards are achieved around episode 500.This suggests that the RL process is efficient and reaches a high level of performance before reaching the limit on the allowed number of episodes.

C Implementation details of NN solutions of the BKE
In this paper, we approximate the solution of the BKE by a fully connected neural network (FNN), which can be viewed as a composition of L simple nonlinear functions, i.e., ϕ(x; θ) The FNN is optimized by Adam 43 .In Adam, we use an initial learning rate of 0.001 for T iterations.The learning rate is then adjusted in each iteration by a factor of 0.5(cos( πt T ) + 1), where t is the current iteration number.We set the batch size to be the total number of training points.The hyperparameter setting for each numerical example is listed in Table 3.

Figure 1 .
Figure 1.A schematic description of the proposed RL action-reward feedback loop.In each episode, a sequence of states (configurations) {x 0 , • • • , x t , • • • } is generated iteratively.Each iteration consists of the following stages.1 ⃝ Initialization: a configuration x 0 is randomly drawn from the configuration set; 2⃝ Evolution: the parameterized policy yields the action a t after observing the state x t .After taking the action, the agent moves to a new configuration x t+1 =x t + a t ; 3⃝ Interaction: we shoot multiple trajectories from x t+1 , and evaluate its reward r(x t , a t ).The quadruples (x t , a t , r(x t , a t ), x t+1 ) is used in Q-learning to update the policy.
Learned action function (vector field) and the connective configurations it reveals (crosses).(b) Sampled configurations generated by shooting from two identified connective configurations.
The NN solution of (2).(e) The difference between (c) and (d).

Figure 2 .
Figure 2. The results obtained from running the RL algorithm for Triple-well potential at inverse temperature β = 6.67.(a) The action function (vector field) learned by our proposed RL.The action function reveals two connective configurations (crosses).(b) The configurations generated by shooting trajectories initiated at the identified connective configurations.(c-e) The FDM (c) and NN (d) solutions of the BKE (2) and their difference (e).

4. 2
Potential with rough landscape In the second example, we consider the rugged Muller potential on the domain Ω = [−1.5,1] × [−0.5, 2].The potential function is defined by

Figure 3 .
Figure 3.The results obtained from running the RL algorithm for the Rugged Muller potential at the inverse temperature β = 0.25.(a) The action function (vector field) learned by the RL algorithm.It reveals one connective configuration (cross).(b) The configurations generated by shooting trajectories from the identified connective configurations.(c-e) The FDM (c) and NN (d) solutions of the BKE (2) and their difference (e).

Figure 6
Figure 6 gives a schematic description of how the action function and the Q-function are optimized in the RL algorithm used to identify connective configurations.
Learned action function and the connective configurations it reveals.(b) Sampled configurations generated by shooting trajectories from the two identified connective configurations.
The NN solution of (2).(e) The difference between (c) and (d).

Figure 7 .
Figure 7.The results obtained from running the RL algorithm for Triple-well potential at inverse temperature β = 1.67.(a) The action function learned by our proposed RL.The action function reveals two connective configurations (crosses).(b) The configurations generated by shooting trajectories initiated at the identified connective configurations.(c-e) The FDM (c) and NN (d) solutions of the BKE (2) and their difference (e).

Table 1 .
Reaction rate of the triple-well potential under various temperatures."N/A" means that no reactive trajectory is observed in the long trajectory of time T = 2 × 10 7 .