Manifold learning in atomistic simulations: a conceptual review

Analyzing large volumes of high-dimensional data requires dimensionality reduction: finding meaningful low-dimensional structures hidden in their high-dimensional observations. Such practice is needed in atomistic simulations of complex systems where even thousands of degrees of freedom are sampled. An abundance of such data makes gaining insight into a specific physical problem strenuous. Our primary aim in this review is to focus on unsupervised machine learning methods that can be used on simulation data to find a low-dimensional manifold providing a collective and informative characterization of the studied process. Such manifolds can be used for sampling long-timescale processes and free-energy estimation. We describe methods that can work on datasets from standard and enhanced sampling atomistic simulations. Unlike recent reviews on manifold learning for atomistic simulations, we consider only methods that construct low-dimensional manifolds based on Markov transition probabilities between high-dimensional samples. We discuss these techniques from a conceptual point of view, including their underlying theoretical frameworks and possible limitations.


I. Introduction 4 I. INTRODUCTION
Atomistic simulations are extensively used to investigate complex systems in chemistry and biology [1,2]. These simulations provide detailed information about physical and chemical processes at the atomistic level of detail with spatiotemporal resolution inaccessible to experiments. However, such systems often involve hundreds of thousands of atoms, making it challenging to analyze their high-dimensional configuration space. It requires techniques for averaging over noisy variables that correspond to fast degrees of freedom while obtaining a low-dimensional description that retains the essential characteristics of the associated physical processes. A low-dimensional representation should be physically explainable and interpretable. Paraphrasing R. Coifman [3]: There is innate truth in the low-dimensional manifold of the data, and we would like to have a characterization of some latent variable that intrinsically describes the changes of states, we can intuitively understand the motivation to develop methods for finding low-dimensional representations in atomistic simulations.
Complex systems require a strict approach that ensures their informative physical characteristics are encoded in corresponding low-dimensional manifolds. In the context of atomistic simulations, encoding essential characteristics of the physical process while averaging over remaining degrees of freedom should be performed according to several requirements [7,[27][28][29]: 1. Distinguishing between relevant states of the system.
2. Including slowly varying degrees of freedom corresponding to system behavior on longer timescales.
Another difficulty in finding low-dimensional manifolds from atomistic simulations arises due to the sampling problem. Complex systems are often characterized by metastable states separated by energy barriers much higher than thermal energy k B T . This metastability leads to kinetic entrapment in a single state, making transitions between metastable states infrequent (i.e., rare). As a result, metastable systems are sampled only in a fraction of their configuration space due to the low probability of jumping across energy barriers, and the data obtained cannot represent the whole behavior of the system. To address this issue, enhanced sampling methods can be used to bias the equilibrium probability and improve the sampling of the configuration space [7,30,31]. However, this problem is often overlooked, and many methods remain unable to learn from data generated by biased sampling.
The main objective of this review is to establish a theoretical framework for manifold learning methods suitable for investigating systems through atomistic simulations, including enhanced sampling simulations. However, we deviate from a commonly taken route to reviewing unsupervised learning methods for finding low-dimensional representations of complex systems. We do not discuss standard techniques such as principal component analysis, multidimensional scaling, and their variants, as these have already been covered in many reviews. For an introduction to the methods omitted here, we refer to reviews focusing on learning from machine learning datasets [19-21, 23, 24, 32] or simulation data [10,27,[33][34][35][36][37][38][39][40].
Instead, we focus solely on a group of nonlinear techniques that construct Markov transition probabilities between high-dimensional samples. Recent development has shown that these techniques can be considered in one general framework suitable for complex systems sampled using atomistic simulations. For this reason, we consequently use the nomenclature employed in statistical physics and atomistic simulations. Apart from using such manifold learning techniques for unbiased simulations, we also introduce concepts that allow the construction of low-dimensional manifolds from enhanced sampling simulations, in which the crucial information about the manifold is sampled from biased probability distributions. Overall, we review manifold learning methods under a single unifying framework, covering methods required to understand the latest developments in the field. In general, each algorithm we review here involves the following steps: 1. Generation of high-dimensional samples from unbiased or biased atomistic simulations.
2. Construction of a Markov chain on the data with pairwise transition probabilities between samples.
3. Parametrization of a manifold using a mapping that embeds high-dimensional samples to a reduced space through eigendecomposition (i.e., spectral embeddings [41]) or divergence optimization.
This review begins with relatively standard material about atomistic simulations and a general introduction to enhanced sampling techniques (Sec. II). We cover only the concepts required to understand manifold learning in the context of standard atomistic and enhanced sampling simulations. This part, by no means exhaustive, can be supplemented by several comprehensive reviews on enhanced sampling [7,28,30,31,42,43]. Next, a general problem of finding low-dimensional manifolds for the description of complex systems is introduced (Sec. IV). Subsequently, we move to the central part of this review and focus on several manifold learning techniques that can be used for learning from atomistic simulations. Each of these frameworks is introduced from the conceptual perspective, followed by examples of applications and software implementations (Sec. V and VI). Finally, we summarize ongoing issues and provide our perspective on manifold learning in standard atomistic and enhanced sampling simulations (Sec. VII).

II. ATOMISTIC SIMULATIONS
Atomistic simulation techniques such as molecular dynamics or Monte Carlo have emerged as general methods at the intersection of theoretical and computational physics [44]. These techniques used to explore the dynamics of complex systems can be viewed as samplers for generating high-dimensional data from some underlying probability distributions.

A. Statistical Physics Representation
In statistical physics, we represent the dynamics of a complex system using its microscopic configurations. Such a representation generally involves a high number of degrees of freedom. Let us suppose that the system is represented by an n-dimensional vector of configuration variables: such as the microscopic coordinates, where n = 3N for an N -atom system. Generally, the configuration variables are functions of the microscopic coordinates under the assumption that the space spanned by these variables is high-dimensional. In machine learning, the configurational variables are referred to as features or descriptors. A dataset of K high-dimensional samples of the configuration variables: recorded at consecutive times during the dynamics can be expressed as a matrix of size n × K called a trajectory.
In the following, we limit our discussion to the canonical ensemble (N V T ), in which the configuration variables evolve according to a high-dimensional potential energy function U (x) at a temperature T .
When the system is represented by the microscopic coordinates, its equilibrium density is given by the stationary Boltzmann distribution [4]: where β = (k B T ) −1 is the inverse of the thermal energy k B T corresponding to the temperature T with the Boltzmann constant denoted by k B , and Z = dx e −βU (x) is the canonical partition function. Otherwise, the set of samples X (Eq. 2) is sampled from an unknown high-dimensional equilibrium density.

B. Collective Variables and Target Mapping
The primary assumption in statistical physics is that we can average over some properties of the highdimensional representation and obtain a macroscopic description of the system with fewer degrees 3. Include slowly varying degrees of freedom corresponding to long timescale processes.
Let us assume for now that CVs are correctly identified through some procedure. CVs are typically expressed as functions of the configuration variables (Eq. 1), meaning that finding CVs involves obtaining a set of functions that embed high-dimensional samples into a low-dimensional CV space. This set of functions is called the target mapping [58]: for d ≪ n. The target mapping ξ(x) can be linear, nonlinear, or even an identity function (i.e., this reduces the problem to selection). Eq. 4 is central to our review: each manifold learning method provides a unique functional form of the target mapping used to reduce the dimensionality of the system representation.
To define a probability density for CVs expressed by the target mapping (Eq. 4), we need to consider only a part of the configuration space. The equilibrium distribution of CVs is obtained by averaging over unused variables. This gives us a marginal density: (a) Simplified model of a system with two long-lived metastable states and its free-energy landscape F (z k , z l ), where z = (z k , z l ), with a free-energy barrier higher than the thermal energy k B T . Gray lines show selected variables for projections shown in (b-d) with remaining variables integrated out, e.g., for z = (z k , z l ), F (z k ) = − 1 β log dz l e −βF (z l ) . (b) Free energy along the z l variable indicates only one minimum, which means that z l is not an optimal CV. (c) Free energy along z k depicts two minima with a correct free-energy barrier, slightly higher than the thermal energy (see color bar). (d) Free energy along z k − z l shows two energy minima, but the free-energy barrier is lower than the correct value shown in (a).
where the multidimensional Dirac delta function is: and ⟨·⟩ denotes an unbiased ensemble average. The equilibrium distribution of CVs (Eq. 5) typically contains several disconnected states of high probability separated by regions of low probability leading to infrequent transitions between such states.

C. Free-Energy Landscape
Instead of the potential energy function U (x) characteristic for a high-dimensional representation (Eq. 3), the reduced dynamics of the system in the CV space follows the underlying free-energy landscape. We define it as the negative logarithm of the marginal distribution of CVs multiplied by the thermal energy: which is defined up to an immaterial constant. The equilibrium density of CVs can be equivalently written as ρ(z) = e −βF (z) /Z, where the partition function in the CV space is given as Z = dz e −βF (z) .
The free-energy landscape determines an effective energy landscape for CVs that consists of kinetic barriers between metastable states. The free-energy difference between states A and B can be calculated as [31]: which is defined using the ratio between the partition functions corresponding to the states Z A and Z B , respectively. Note that Eq. 8 is valid if (and only if) the CV set properly separates the two states A and B. To calculate the free-energy difference, one must integrate over the regions in CV space defining the states A and B, as simply taking the difference of F (z) at the minima of states A and B does not yield correct results [61].
As the free-energy landscape (Eq. 7) is not invariant with respect to CVs [61][62][63], the relation of the free-energy barrier to the kinetics of crossing between states [62] or activation free energies [63] is not apparent (Fig. 1).
Sampling free-energy landscapes exhaustively can be challenging, even for simple systems. On the timescales accessible for standard atomistic simulations (around milliseconds), crossings over high freeenergy barriers are rare events. As a result, the system remains kinetically trapped in a metastable state as its dynamics is restricted to sampling fast equilibration. This so-called sampling problem can be observed in many physical processes, for example, catalysis [64], ligand interactions with proteins [65][66][67][68][69] and DNA [70], glass transitions in amorphous materials [71], crystallization [46], and graphite etching [72].
As a representative example of enhanced sampling techniques, we consider methods that employ an external bias potential to enhance CV fluctuations. The first approach of this kind, called umbrella sampling, was introduced in 1977 by Torrie and Valleau [78]. The motivation for naming the method "umbrella sampling" was to highlight the versatility of the method to investigate a wide range of physical processes [44].
When the bias potential is introduced to the system, the distribution of CVs can deviate significantly from equilibrium (Eq. 5). This results in sampling according to a biased distribution: where Z V = dz e −β(F (z)+V (z)) is the biased partition function and ⟨·⟩ V denotes an ensemble average calculated under the biasing potential V (z) . By design, the biased distribution is easier to sample.
Enhanced sampling methods based on CVs vary in how they flatten or reduce free-energy barriers and how they construct the bias potential [76,78,80,81,85]. In umbrella sampling, it was proposed that the biased probability distribution of CVs should be "as wide and uniform as possible" [78], and thus sometimes called flat histogram. However, recent developments in enhanced sampling have shown that this approach may not be the most efficient for fast convergence [81,84,[86][87][88].
As an example of a non-uniform target distribution, let us consider the well-tempered distribution used in metadynamics [81]. Well-tempered metadynamics uses a history-dependent bias potential updated iteratively by depositing Gaussians centered at the current location in the CV space: where G σ (z, z k ) is a Gaussian kernel with a bandwidth set σ, z k is the center of k-th added Gaussian, and γ is a bias factor that determines how much we enhance CV fluctuations. Well-tempered metadynamics convergences to the well-tempered distribution: in which we sample an effective free-energy landscape F γ (z) = F (z)/γ with barriers reduced by a factor of γ [31,81].

E. Reweighting Biased Probability Distributions
When conducting biased simulations, the CVs sample a smoother biased CV distribution. When obtaining equilibrium properties, such as free-energy landscapes, each sample is given a statistical weight to account for the effect of the biasing. For methods employing a quasi-stationary bias potential V (z), the weight associated with a CV sample z is given as: In contrast, well-tempered metadynamics uses an adaptive bias potential (Eq. 10), and the weights are modified by adding a time-dependent constant to the bias potential [7,89]. In unbiased simulations, every sample is equally important as it is obtained from the equilibrium distribution (Fig. 2).
Standard reweighting involves employing the weights to find the stationary equilibrium distribution from the biased CV distribution, i.e., ρ(z) ∝ w(z)ρ V (z), which can be computed by histogramming or kernel density estimation, where each sample z is weighted by Eq. 12. Many simulation codes, such as PLUMED [90,91], routinely use this approach.

III. SIMULATION DATA
Here, we describe how to prepare a dataset from standard atomistic and enhanced sampling simulations for a manifold learning method. These data have unique characteristics that differ from ordinary datasets. Moreover, we demonstrate how to collect such datasets and reduce their size while preserving their density.

A. Data Representation
To prepare a simulation dataset, we collect high-dimensional samples represented by the configuration variables. This dataset can be a trajectory of the system, as shown in Eq. 2. Additionally, we can sample the configuration variables in multiple trajectories that originate from the same probability distribution and combine them. If the dataset contains high-dimensional samples from the equilibrium probability distribution, we do not require additional information about the system.
However, as explained in Sec. II E, if we sample simulation data from a biased probability density, each high-dimensional sample has a statistical weight that contains information about its importance. Therefore, we can express that dataset as: where the weights are given by (Eq. 12). Failure to consider these weights when creating a dataset from biased simulations data can affect its geometry, density, and importance [55,58,92].
A fundamental issue of simulating complex systems is the convergence and accuracy of the data. When the simulation fails to capture enough transitions between long-lived metastable states, the representation of these rare events is greatly affected. This problem is especially pronounced in systems exhibiting metastability and sampling problems.

B. Landmark Sampling
Vast amounts of data obtained from atomistic simulations often necessitate reducing the number of samples in the datasets. This can be achieved through landmark sampling, where a subset of the samples from the high-dimensional dataset is selected to create a dataset with preferably much fewer samples that retain information about the complete dataset. This approach is used both machine learning, where low-rank approximations such as the Nyström extensions and clustering are widely used [96][97][98][99][100][101][102], and atomistic simulations [49,55,[103][104][105][106][107]. Below, we focus on landmark sampling techniques that can be modified to include statistical weights from enhanced sampling simulations as a selection criterium.
When working with unbiased data, it is reasonable to choose landmarks randomly, as all samples are equally important (Eq. 13). This approach results in landmarks that are approximately distributed according to the Boltzmann equilibrium distribution. Farthest point sampling (FPS) [108] can also be used to sample landmarks from unbiased simulations. FPS relies on geometric criteria to ensure a uniform selection of landmarks, regardless of the probability distribution. Concretely, FPS selects a sample that is farthest from all previously selected landmarks. However, FPS may be computationally expensive for large datasets.
For biased trajectories, landmark samples can be chosen based on their weights w(x) (Eq. 13). In the simplest case, landmarks can be obtained by weighted random sampling, where each sample is selected with a probability proportional to its weight [109]. However, this method can result in an overpopulation of samples in the deepest free-energy basins while leaving other metastable states with a limited number of samples [55,104,110,111]. This overpopulation occurs because samples in the deepest metastable states have much higher weights than those in higher-lying states due to the exponential dependence of the statistical weights on the bias potential.
To exploit the information about the geometry and density of the configuration space, we can employ a well-tempered variant of FPS [104]. This selection process involves two stages. First, FPS selects √ KL samples, which are then divided into Voronoi polyhedra {v k }. For each polyhedron v k , we calculate a tempered weight: where l ∈ v k and α > 1 is a tempering parameter. In the second stage, weighted random sampling selects a polyhedron according to the tempered weight (Eq. 14). A landmark is then sampled from the selected Voronoi polyhedron using unmodified weights (without tempering). This process is repeated until the desired number of landmarks L is reached. In the limit of α → ∞, landmarks are uniformly distributed. For α → 1, the landmarks selection should resemble the equilibrium distribution. The procedure for well-tempered FPS is summarized in Algorithm 1.

Select
√ KL landmarks using FPS.
2. Perform the Voronoi tesselation {v k } based on landmarks selected using FPS.

Until the number of landmarks is L:
(a) Select a Voronoi polyhedra according to the accumulated weights ω k .
(b) Include a landmark based on weighted random sampling.
Recently, weight-tempered random sampling (WTRS) has been introduced as an alternative for welltempered FPS [55]. This technique involves scaling the weights of samples without using FPS. WTRS selects a landmark with a probability ∝ w 1/α , where, as in Eq. 14, α is a tempering parameter. Assuming landmarks are sampled from the well-tempered distribution (Eq. 11), the limit α → ∞ corresponds to sampling landmarks according to the biased distribution, while the limit α → 1 results in sampling landmarks from the equilibrium distribution [55].
3. Landmark Sampling with Weight-Tempered Random Sampling. Sampling landmarks according to probabilities ∝ w 1/α for different values of the tempering parameter α compared to the free-energy surface (FES) of alanine dipeptide in vacuum. Biased simulation data is generated from a well-tempered metadynamics simulation of alanine dipeptide in vacuum using a bias factor of γ = 5 and the Φ and Ψ dihedral angles as biased variables. By increasing α, landmarks gradually diverge from the unbiased distribution. Data taken from Ref. 55.

IV. MANIFOLD LEARNING
In this section, we discuss fundamental concepts of manifold learning techniques. We aim to explain how these methods can be employed to analyze simulation data and construct CVs. A non-exhaustive list of manifold learning methods that can be applied to atomistic systems includes locally linear embedding [26], Laplacian eigenmap [112][113][114], diffusion map [5,[115][116][117], spectral gap optimization of order parameters [118], sketch-map [104,119], and stochastic neighbor embedding and its variants [120][121][122].

A. Markov Transition Matrix
Manifold learning, also referred to as nonlinear dimensionality reduction, aims to simplify highdimensional data to low-dimensional manifolds. This class of techniques generalizes linear methods such as principal component analysis (PCA) or singular value decomposition (SVD). Manifold learning is commonly used to analyze machine-learning data and has recently gained attention in atomistic simulations for extracting physical properties from the dynamics of complex systems. In the following, we consider a manifold as a low-dimensional space spanned by a few CVs.
A primary assumption in manifold learning for atomistic systems is that the dynamics in a highdimensional space can be represented by a low-dimensional and smooth subspace called a manifold. This assumption is known as the manifold hypothesis. It states that the dynamics effectively evolves on the low-dimensional manifold embedded in the high-dimensional space [123,124]. The existence of such low-dimensional descriptions may be attributed to couplings between the degrees of freedom, resulting in a small number of slowly evolving variables that govern the dynamics and to which the remaining fast degrees of freedom (or their statistics) are slaved. Under this view, the fast degrees of freedom are controlled by the dynamics of the slow CVs due to fast equilibration within the metastable states, resulting in an adiabatic timescale separation. This description leads naturally to the modeling of complex systems as diffusion processes, in which a set of stochastic differential equations may be formulated in the slow variables, with the fast degrees of freedom represented as thermal noise. Thus, characterizing the system dynamics by the slow CVs leads to a negligible error.
The core of most manifold learning methods is having a notion of similarity between high-dimensional samples, usually through a distance metric [5,25,26,101,112,113,120,121,125]. The distances are then integrated into a global parameterization of the data through the construction of a discrete Markov chain, where the similarities depend on distances between the samples. For example, a common starting point is the construction of the Markov chain based on a Gaussian kernel: where ∥ · ∥ denotes Euclidean distances. Manifold learning techniques discussed in this review use a normalized Gaussian kernel with Euclidean distances to model the Markov chain. The distances can be computed between all samples in the dataset or only between nearest neighbors. The scale parameter ε > 0 depends on the dataset as it induces a length scale ∼ √ ε. It can be selected to match the distance between neighboring samples. We cover different algorithms for selecting ε when discussing each method (Sec. V and VI).
Depending on distance metrics, other Gaussian kernels can be considered [126], for instance, generalized Minkowski, Mahalanobis, or cosine distances. Moreover, Laplacian [112,113], heat kernel [127,128], weight matrix [26], graph diffusion kernel [129], or Fisher information kernel [130] can also be used in manifold learning. Any kernel G(x k , x l ) used in the construction of the Markov chain must satisfy the following properties: To convert the Gaussian kernel into a Markov transition matrix M (x k , x l ) of size K × K that describes transition probabilities p kl , we row-normalize Eq. 15 to obtain a right stochastic matrix: which describes the probability of transitioning from sample x k to sample x l in an auxiliary time step t: where, through the construction in Eq. 16, the transition probability depends only on the current sample, i.e., it is called the Markovian assumption, meaning that the system has no memory.
Target Mapping The Markov transition matrix can be used to propagate the corresponding Markov chain. For a timehomogeneous Markov chain, the k-step transition probability can be obtained as the k-th power of the transition matrix M k . A unique stationary distribution π(x) exists if the Markov chain is irreducible and aperiodic. In this case, M k converges to a rank-one matrix in which each row is the stationary distribution lim k→∞ M k = π(x). Alternatively, it can be as defined M π(x) = π(x) as the stationary distribution is unchanged by the Markov transition matrix. Consequently, the highest eigenvalue corresponding to the stationary distribution equals one.
The terminology used to describe M (x k , x l ) can vary depending on the field from which a method originates. In unsupervised learning, it is commonly referred to as the affinity, similarity, or proximity matrix [120,121,126,131]. These methods usually include an additional step when building a Markov chain, e.g., the diagonal entries of M are set to zero. This type of Markov chain is called non-lazy. In contrast, methods devised for atomistic systems do not use this assumption as the diagonal entries may contain important information about the system [116]. In such Markov chains, there is a probability that the transition does not occur.

B. Target Mapping
Here, we focus on a generalization of the target mapping from the high-dimensional configuration space (Eq. 1) to a low-dimensional manifold (or the CV space) (Eq. 4). As explained in Sec. II B, under our framework, the problem of finding CVs is equivalent to finding a parametrization of the target mapping. The target mapping performs dimensionality reduction such that the dimensionality of the manifold is much lower than that of the high-dimensional space, i.e., d ≪ n; see Fig. 4.
Manifold learning methods can be split into two categories depending on how the low-dimensional manifold is constructed by the target mapping: I. Eigendecomposition (i.e., spectral decomposition) of the Markov transition matrix: where {ψ k } and {λ k } are the corresponding eigenfunctions and eigenvalues, respectively. The solution of Eq. 18 determines the low-dimensional manifold [5]. For instance, the target mapping can be parametrized as follows: where the k-th manifold coordinate is λ k ψ k (x). The eigenvalues are sorted in non-increasing order and include only d dominant eigenvalues as each corresponds to the importance of respective coordinates spanned by eigenfunctions.
The eigenvalues decrease exponentially and can be related to the effective timescales of the studied physical process, as multiple timescales frequently characterize complex systems. As such, the dominant eigenvalues also correspond to the slowest processes. We analyze this in detail in Sec. V A. Manifold learning methods exploiting the eigendecomposition to find a low-dimensional manifold of CVs are described in Sec. V.
II. Divergence optimization where a divergence (i.e., a statistical distance between a pair of probability distributions) between the Markov transition matrix M built from high-dimensional samples and a Markov transition matrix Q(z k , z l ), constructed from low-dimensional samples, is minimized (Fig. 4). The target mapping expressed as a parametrizable embedding is: where θ θ θ = {θ k } are parameters that are varied such that the divergence between M and Q is minimized. In such methods, M is fixed while Q is estimated by the parametrized target mapping. Depending on the manifold learning method used, the minimization can be performed differently, i.e., gradient descent [121], or stochastic gradient descent if the target mapping is represented by a neural network [49,55,122]. We introduce such manifold learning methods in Sec. VI.

V. TARGET MAPPING (I): EIGENDECOMPOSITION
Here, we cover manifold learning methods that employ the eigendecomposition of the Markov transition matrix to find the target mapping ξ(x). The following manifold learning methods we discuss are diffusion map (DMAP) (Sec. V A), time-lagged independent component analysis (TICA) (Sec. V B), and spectral gap optimization (SGOOP) (Sec. V C).

A. Diffusion Map (DMAP)
The concept of DMAP was inspired mainly by Laplacian eigenmap. Laplacian eigenmap [112,113] originates from spectral graph theory [132], which is a mathematical field studying properties of the Laplacian matrix or adjacency matrix associated with a graph. As states of the system can be represented as a graph, Laplacian eigenmap is commonly used in manifold learning to reduce data dimensionality. Laplacian eigenmap has theoretical convergence guarantees as the discrete operator approaches the Laplacian on the underlying manifold assuming the data are uniformly sampled.
DMAP proposed by Coifman et al. [5] expands the concept of Laplacian eigenmap. This algorithm yields a family of embeddings and provides theoretical understanding for the resulting embedding even when the data are non-uniformly sampled. DMAP can construct an informative low-dimensional embedding of a complex system. Compared to other manifold learning methods, DMAP has a substantial theoretical background [5,115,117]. Namely, the CVs spanning the low-dimensional manifold constructed by DMAP correspond to the slowest relaxation processes given by a probability distribution evolving under a random walk over the data [116].

Anisotropic Diffusion Kernel
As a starting point for our framework, we consider the anisotropic DMAP [116]. DMAP first employs a Gaussian kernel function to estimate the similarity between high-dimensional feature samples x k and x l , G ε (x k , x l ), where ε is a scale parameter (Eq. 15). The scale parameter in DMAP ε can be chosen using several heuristics. For instance, see Ref. 137-139. Next, the Gaussian kernel is used to define a density-preserving kernel in a high-dimensional space that is called the anisotropic diffusion kernel: where ϱ(x k ) = n G ε (x k , x n ) is up to a normalization constant a pointwise kernel density estimate at x k and α ∈ [0, 1] is a normalization parameter, the anisotropic diffusion constant, based on which a family of different manifold parametrizations can be considered [5].
Then, Eq. 21 is row-normalized to represent Markov transition probabilities: so that l M (x k , x l ) = 1 for k-th row. As such, Eq. 21 is a Markov transition matrix containing information about the transition probability from x k to x l . Under this view, Eq. 22 denotes a Markov chain with the transition probability from x k to x l in an auxiliary time step (Eq. 17). It can be shown that there is a direct link between Laplacian eigenmap and DMAP [116].
The anisotropic diffusion constant (Eq. 21) is related to the importance of data density [115]. Specifically, when the microscopic coordinates are sampled according to the equilibrium density, in the limits ε → 0 and K → ∞, DMAP asymptotically converges to the stationary Boltzmann distribution of the modeled Markov chain. Based on these assumptions and depending on the normalization using the anisotropic diffusion constant, we can consider the following interesting constructions of the stationary density: 1. α = 0: we recover dynamics according to the potential 2U (x) and the density ∝ [ρ(x)] 2 for the classical normalized graph Laplacian [112,113,128].
2. α = 1: we get the graph Laplacian with data uniformly distributed on a manifold. This normalization accounts only for the data geometry, while density does not play a role.
3. α = 1 2 : we obtain dynamics according to the underlying potential U (x) and the density ∝ ρ(x) whose eigenfunctions capture the long-time asymptotics of data (i.e., correspond to slow variables).
For the anisotropic diffusion constant α = 1 2 , we asymptotically recover the long-time dynamics of the system whose microscopic coordinates are sampled from the Boltzmann distribution (Eq. 3). The related backward Fokker-Planck differential equation is given byy [115]: where µ is the friction coefficient, −∇U (x) denotes the force acting on atoms, and w is an n-dimensional Brownian motion. The infinitesimal generator of this diffusion process is: The eigenvalues and eigenvectors of L determine the kinetic information of the diffusion process and can be used to parametrize a low-dimensional manifold.
As we are interested in finding a low-dimensional representation of a system, i.e., estimating CVs, the case for α = 1 2 is crucial to model the slowest degrees of freedom, accounting for both the underlying geometry and density of the manifold.
Note that when considering configuration variables other than the microscopic coordinates, the underlying equilibrium density is not given by the Boltzmann distribution (Sec. II). However, it is reasonable to assume that there is a separation of timescales for variables other than the microscopic coordinates.

Diffusion Coordinates
The idea of using eigenfunctions of the Gaussian kernel as coordinates for Riemannian manifolds originates with Ref. 127 and in the context of data analysis with Ref. 115. The transition probability matrix M can be used to solve the eigenvalue problem: for k = 1, . . . , K. The spectrum then is synonymous with the eigenvalues {λ l }. The corresponding right eigenvectors {ψ l } can be used to embed the system in a low-dimensional representation (or CVs).
Based on this eigendecomposition, the target mapping ξ(x) (Eq. 4) can be defined as diffusion coordinates [5,115]: where the eigenvalues and eigenvectors are given by {λ l } and {ψ l }, respectively, and define reduced coordinates. In Eq. 26, each diffusion coordinate is defined as z k = λ k ψ k , where the spectrum is sorted by non-increasing value: where d is the index at which we truncate the diffusion coordinates in Eq. 26 and the dimensionality of the reduced representation. Thus, it is expected that the dominant timescales found in the dynamics of the high-dimensional system can be described only by several eigenvectors corresponding to the largest eigenvalues. As the dynamics in DMAP is represented by the transition probability matrix (Eq. 22), the eigenvalue λ 0 = 1 and the first diffusion coordinate λ 0 ψ 0 corresponds to the Boltzmann equilibrium distribution. Therefore, we exclude it from the target mapping (Eq. 26).
The truncation in Eq. 27 can be justified as follows. As atomistic systems are usually metastable (Sec. II D), it is sufficient to approximate the diffusion coordinates by a set of dominant eigenvalues (e.g., up to d in Eq. 26 and Eq. 27). This corresponds to a negligible error on the order of O(λ d /λ d−1 ). In other words, a sufficient condition for this is λ d−1 ≫ λ d as it relates to a large spectral gap. As such, the spectral gap separates slow degrees of freedom (for ≥ λ d ) and fast degrees of freedom (for < λ d ).

Diffusion and Commute Distances
In the reduced space of the diffusion coordinates, we can define the diffusion distance, a measure of proximity for the samples lying on a low-dimensional manifold. The diffusion distance is equivalent to Euclidean distance on the manifold [5]: where the definition of ξ(x) is given by Eq. 26. Alternatively, we can write that the diffusion distance between high-dimensional samples x k and x l is equivalent to Euclidean distance between CV samples z k = ξ(x k ) and z l = ξ(x l ).
The diffusion coordinates (Eq. 26) can also be defined using the relation of the effective timescales and the eigenvalues [116]. Employing the fact that the eigenvalues decay exponentially: where τ is a lag time, κ n denotes relaxation rates and κ −1 n = t n are effective timescales, we can rewrite Eq. 26 to obtain kinetic map: where τ should be selected so the samples are uncorrelated. Selecting the lag time is a known issue in modeling metastable dynamics. Namely, the target mapping in Eq. 30 strongly depends on the lag time τ . The lag time should be selected to separate fast and slow processes if there is an evident timescale separation, i.e., the lag time value is selected between fast and slow timescales. However, these timescales are often unknown before running simulations.
Relaxation timescales (Eq. 29) can be calculated only when a time-lagged construction of pairwise transition probabilities is used. For instance, this can be done using Mahalanobis distance as a distance metric which employs estimating a time covariance matrix [140], the Taken theorem (a delay embedding theorem) [141], or the von Neumann entropy [142].
A particularly interesting method that does not require selecting the lag time relies on integrating Eq. 28 over the lag time τ . Then, we use the relation between the eigenvalues and the effective timescales (Eq. 29) to arrive at commute distances [143]: Eq. 31 is approximately the average time the systems spends to commute between x k and x l , and the associated commute map is given by: where we can see the difference between the diffusion and commute distances are in the coefficients λ n and t n /2, respectively. Various methods exploit the relation of the effective timescale with eigenvalues [9,92,135,136,[143][144][145][146][147].

Diffusion Reweighting
The standard DMAP can construct CVs from unbiased atomistic simulations (Sec. IV). However, an approach incorporating statistical sample weights is necessary to learn from enhanced sampling simulations where sampling follows a biased probability distribution. Several unbiasing algorithms are 5. Diffusion Reweighting. The equilibrium diffusion coordinate λ0ψ0 calculated for an alanine dipeptide dataset at a temperature of 300 K in vacuum generated using well-tempered metadynamics with a bias factor of 5. The dataset consists of 5000 samples, with 45 features being all pairwise heavy-atom distances in the system. The standard DMAP captures a biased low-dimensional manifold that does not correspond to the equilibrium distribution, as seen by the lowered free-energy barriers and consequently boosted transitions between the metastable states. In contrast, the reweighted DMAP correctly reverts the effect of sampling from a biased probability distribution.
available for DMAP, such as target-measure DMAP [136,148] and its variant that uses Mahalanobis distances [147].
Here, we focus on recently proposed reweighted DMAP, which implements diffusion reweighting to unbias Markov transitions [58,92]. Reweighted DMAP employs a weighted Markov transition matrix: that uses a pairwise reweighting factor r(x k , x l ) to weight each Gaussian. It can be shown that the pairwise reweighting factor takes a simple form [58,92]: where ϱ(x k ) = l w(x l )G ε (x k , x l ) is up to a multiplicative constant an unbiased density estimator at sample x k . For a detailed derivation and variants of the reweighting factor used in other manifold learning techniques, we refer to Ref. 58.
A procedure for constructing reweighted DMAP can be implemented according to Algorithm 2. We show how this technique can be employed to obtain an equilibrium density from a biased simulation and a comparison to the standard DMAP in Fig. 5. DMAP and its variants have been used extensively to study the properties of complex systems. Noteworthy applications include constructing CVs and low-dimensional representations from unbiased simulations [105,116,117,124,133,134,137,149,150], estimating free-energy landscapes [151][152][153], modeling kinetics [135,143,144], constructing reweighted manifolds and CVs from biased simulations [49,58,123,136,148], and selecting high-dimensional representations [92]. DMAP is implemented in the pydiffmap package [154].

B. Time-Lagged Independent Component Analysis (TICA)
TICA was introduced by Molgedey and Schuster in 1994 to determine the mixing coefficients of linearly superimposed uncorrelated signals [155]. This method has inspired the development of novel techniques to analyze long simulation trajectories by considering them as propagating signals [17,[156][157][158][159][160]. In 2013 two research groups independently applied TICA in Markov state modeling to understand the kinetics of conformational changes [157,158]. Currently, TICA is widely used to project high-dimensional simulation data to a low-dimensional manifold, which helps to uncover slow molecular processes more effectively [17].

Markov Propagator of Stochastic Processes
As in the case of DMAP, we start from the notion that dynamics of the microscopic coordinates of the system can be explained by the stochastic differential equation given by Eq. 23. The corresponding transition probability p τ (x, y) dx = Prob x t+τ = y | x t = x describes the likelihood of the transition from x to y in a time step τ . We can describe the evolution of a time-dependent probability distribution as: which alternatively can be given by the Markov time-propagator M τ : that maps a probability distribution ρ t (x) to ρ t+τ (x). We want to emphasize that the time-propagator M τ is closely related to the semigroup K τ and infinitesimal generator L [17]. The semigroup is given as: where g is an auxiliary function. Then, the generator L is defined as Lg = lim t→0 + (K τ g − g)/t. The adjoint operator of L, denoted as L † , determines the time-evolution of ρ: which is the famous Fokker-Planck equation. From Eq. 38, we can see that the time-propagator has the form: To understand the kinetic characteristics of the system, it is necessary to analyze the eigenvalues and eigenfunctions of the generators L and L † , as well as the propagator M τ . The eigenvalues and eigenfunctions of L satisfy the equation: where the eigenvalues are sorted by increasing values 0 = λ 0 < λ 1 ≤ · · · . The eigenvalues of L † are the same as those of L, while the eigenfunctions of L † are expressed as ϕ k = πψ k , where π is the equilibrium Boltzmann distribution. We can notice that π is also an eigenfunction of L † with λ 0 = 0, which means that ψ 0 = 1. This is because λ 0 = 0 is non-degenerate and corresponds to a unique equilibrium distribution. With λ k and ϕ k , we can formally write the time-dependent probability distribution: where coefficients are determined by the initial conditions A k = dx(x)ψ k (x)π. Eq. 41 means that ρ t (x) converges to π when t → ∞. The eigenfunctions ϕ k can be viewed as different "modes" with an equilibration rate determined by λ k , where small λ k suggests a slow equilibration along ϕ k . As M τ shares the eigenfunctions with L † , while the eigenvalues of M are λ M k = e τ λ k , the equilibration timescale is t k = −τ / log λ M k .

Variational Optimization of Dominant Eigenvectors
As previously discussed, the eigenfunctions of M τ are useful in defining the target mapping for slow equilibration rates when λ M k are small. The first non-trivial eigenfunction ψ 1 is the most significant mode, which corresponds to the smallest non-zero eigenvalue. It can be estimated variationally by maximizing: with a constraint given as dx g(x)π = 0, where g is a function to be optimized. Eq. 42 can be generalized to obtain the eigenvalues and eigenfunctions of M τ [157,161]. As ψ 0 = 1, g(x) can be expressed as a linear combination: where c k are coefficients and η k (x) are basis functions. It can be shown that the maximization of Eq. 42 is equivalent to solving a generalized eigenproblem: where M kl = η i | M τ | η j , the overlap matrix S is defined as S kl = η k | η l , and χ 0 corresponds to the equilibrium distribution. Features with zero-mean are usually taken as basis functions [155,157,158,162]. However, other basis functions can also be employed. For instance, indicator functions of a partition of the configuration space [157,161] or Gaussian functions can be used, which requires fewer macrostates and shorter τ to converge [161]. Neural networks can also be used [56].
In practice, M kl depends on the lag time τ , which is often taken to be on a nanosecond timescale [163,164]. For unbiased simulation data, we can estimate M kl (τ ) as: where x k and x l are k−th and l-th features, respectively, and the time lag is given as τ = K τ ∆t, where ∆t is a time step. As x k and x l are usually not orthonormal, the overlap matrix can be estimated as: which reduces the generalized eigenproblem (Eq. 44) to M (τ )χ = λ M M (0)χ. We can see that the matrix M has a different interpretation compared to other methods reviewed here, e.g., rather than describing the probabilities between samples, it is given by autocorrelation functions.
A procedure for performing TICA is given in Algorithm 3.

TICA and Enhanced Sampling Simulations
Generating TICA directly from enhanced sampling simulation is another approach worth considering. However, this remains challenging as many enhanced sampling methods do not preserve unbiased  [157,165]. This is mainly because M (0) is often ill-conditioned thus M (0) −1 is numerically unstable.
(a) Perform principle component analysis to transformx to principle components y.
(b) Calculate normalizedỹ = Σ −1 y, where Σ is a diagonal matrix whose k−th diagonal element is the standard deviation of y k .
(d) DiaganoalzeM (τ ). Use the k-th eigenvector as coefficients to construct a linear combination ofỹ to calculate ψ k .
kinetics. CVs generated with TICA and a Markov state model can be used with metadynamics, which has improved the sampling efficiency of complex systems [164,166].
A successful example of this is constructing CVs with TICA and well-tempered metadynamics [56,167]. It has been shown that not only can configurations of the system be unbiased [168][169][170] but also an effective timescale dτ can be obtained by rescaling the simulation time [167]: where V (z) is a time-dependent bias potential acting in the CV space (Eq. 10) and c(t) is a timedependent bias drift [169]. Eq. 47 is asymptotically correct at the long-time limit, and generally, the rescaling does not correspond to the actual unbiased time.
As shown in the case of DMAP, the timescale separation between fast and slow dynamics can be used to parametrize manifolds for metastable systems. However, sometimes it is sufficient to know the difference between the slow and fast eigenvalues rather than the related eigenvectors based on which a low-dimensional manifold is constructed in DMAP. The spectral gap can be used to estimate the degree of the separation between the fast and slow processes. Consequently, the maximal spectrum gap corresponds to the most optimal construction of the slow variables. An approach for this is based on the maximum entropy framework and referred to as spectral gap optimization of order parameters (SGOOP) [118].

Maximizing the Spectral Gap
In SGOOP, we assume the dynamics of the configuration variables x = {x k } n k=1 (i.e., order parameters [118]) is Markovian. These configurational variables can be linear or nonlinear functions of the microscopic coordinates, and the space they span is high-dimensional. We then consider a target mapping ξ(x) that embeds the n-dimensional configuration space x into a d-dimensional CV space z, creating a trial CV. We consider a single CV where the target mapping is a linear combination of the configurational variables [118]: where c k are adjustable parameters. More generally, the trial CV can be multidimensional, and the target mapping given by a nonlinear mapping [118].
The CV space is then discretized in a grid. Next, the Markov transition matrix between grid bins z k and z l in the low-dimensional CV space is defined as: where K is the rate matrix, and k and l are bin indices. The transition probabilities p kl should not depend on the lag time τ when it is sufficiently small and the transition matrix is Markovian. As the Markov transition matrix is defined in the CV space, not the high-dimensional configuration space, SGOOP differs from the other methods considered in this review.
In the maximum caliber framework (a generalization of maximum entropy to dynamics), a path entropy can be defined based on the probability of micropaths for a Markovian process discrete in time and space [196,197], which for the CV space is defined as: where ρ(z k ) is the stationary probability of the grid bin z k . The path ensemble average of a timedependent quantity A kl is then calculated as: The path entropy given by Eq. 50, along with constraints based on our knowledge of certain static or dynamical quantities ⟨A (n) ⟩, and other conditions such as detailed balance, are together called caliber [118,196,197]. Then, by maximizing the caliber, we can determine that the Markov transition matrix is related to stationary probabilities: where the sum is over the known static or dynamical constraints A (n) kl for the grid bin z k and z l , weighted by Lagrange multipliers l n . It has been shown that maximizing the caliber is equivalent to being the least committal about missing information [118,196].
Refs. [118,198] consider a single dynamical constrain ⟨N ⟩ that is the average number of transitions to nearest neighbors observed in time τ (so that N kl = 1 if grid bins are nearest neighbors and zero otherwise [198]). In this case, the Markov transition matrix becomes: where the Lagrange multiplier l is calculated by inserting this equation in Eq. 51 [198]. Similar equations can be derived by considering other dynamical observables [198].
Eq. 53 is the main equation of SGOOP. Using this equation, we can calculate the Markov transition matrix M from the equilibrium CV distribution ρ(z) for the given trial CV (we can even ignore the Lagrange multiplier l as it only sets an overall timescale that is not important [118]). Furthermore, the equilibrium CV distribution ρ(z) is obtained through reweighting, so SGOOP is directly applicable to enhanced sampling simulations biasing any CVs (that can be different from the trial CV).
We can then obtain the eigenvalues {λ k } of the Markov transition matrix M for a given trial CV. Then, by varying the trial CV, for instance, by changing the parameters c k in Eq. 48, we can obtain the spectral gap λ s − λ s+1 between the slowest s processes in the system with the largest eigenvalues and other fast processes for the different trial CVs. The optimal CV is the one that maximizes the spectral gap. Note that we do not need to perform additional biased simulations to obtain the spectral gap for the trial CV, as we can employ reweighting to estimate the equilibrium CV distribution ρ(z).
The framework offered by maximum caliber and SGOOP is versatile and can be expanded in multiple ways. For instance, SGOOP can be extended to multidimensional CVs using conditional probability factorization, as shown in Ref. 199. Additionally, SGOOP can be combined with commute distances (Eq. 31) [145]. Lastly, SGOOP can be employed to assess CVs obtained through other machine-learning techniques [200].

VI. TARGET MAPPING (II): DIVERGENCE OPTIMIZATION
This section discusses manifold learning methods that use divergence optimization to find the parametric target mapping. In the following, we describe parametric variants of well-known methods such as stochastic neighbor embedding (SNE) (Sec. VI A) and uniform manifold approximation and projection (UMAP) (Sec. VI B), which enable learning CVs from unbiased atomistic simulations. Additionally, we cover multiscale reweighted stochastic embedding (MRSE) (Sec. VI C 2) and stochastic kinetic embedding (StKE) (Sec. VI C 1), which can be used to learn from enhanced sampling simulations.

A. Stochastic Neighbor Embedding (SNE)
Introduced by Hinton and Roweis [120], SNE is used in many fields where dimensionality reduction is required. Despite its widespread popularity, only a few investigations have explored the theoretical background of SNE, unlike DMAP or Laplacian eigenmap. However, a non-parametric SNE has been recently considered from a theoretical standpoint [207][208][209][210], revealing that SNE methods can effectively separate clusters of data in a low-dimensional embedding if there are well-separated clusters in a highdimensional space [207,209]. This result, however, has been suggested to be insufficient to show that SNE can accurately find a low-dimensional representation [208,210]. A recent study has also shown an interesting connection between SNE and spectral embedding techniques [209,211].

Parametric Target Mapping
We consider a more versatile approach called the parametric SNE, which employs a parametric mapping from a high-dimensional space to a low-dimensional manifold [122]. This variant uses the target mapping consisting of d functions given by (see also Eq. 20): where parameters θ θ θ = {θ k } are adjusted to correspond to a minimum of divergence. The parametric version offers more flexibility in constructing low-dimensional spaces than the standard SNE. The target mapping can be represented by a neural network, where each layer gradually decreases the dimensionality of the configuration space [216].

Markov Transitions in High-Dimensional Space
In SNE, the Markov transition matrix is constructed using a Gaussian kernel G ε (Eq. 15). Then, the corresponding transition probabilities p kl are defined by normalizing each row of the matrix created using the Gaussian kernel with an additional selection of scale parameters (bandwidths): where the diagonal elements are set to zero in order to define a non-lazy Markov chain. Each scale parameter {ε k > 0} K k=1 corresponds to a row of M . The scale parameters can be written as ε k ∝ σ 2 k , where σ k is the standard deviation of the Gaussian kernel. Finally, the Markov transition matrix is symmetrized [21].
Selecting {ε k } is the main difference compared to constructing the Markov transition matrix from the single-scale Gaussian kernel (Eq. 15) as in DMAP. To find {ε k }, the Shannon entropy H(x k ) of each row of the Markov transition matrix should be optimized to hold the following condition: where P is a measure of neighborhood at x k referred to as perplexity. Therefore, the optimization task is to adjust {ε k } so that H(x k ) is roughly log 2 P . After the optimization, the Markov transition matrix contains information on the geometric relationships between samples in the high-dimensional space. In most SNE algorithms, the optimization is performed using logarithmic search.
Perplexity is a concept that considers both local and global aspects of data. However, its value can have a complicated impact on the final result, causing confusion and potentially affecting how we interpret the manifold, particularly in complex physical systems with multiple metastable states of varying densities. Therefore, it is common to construct manifolds with different perplexity values and select a manifold that can be interpreted. For more information on choosing the appropriate perplexity and its impact on low-dimensional representations, we refer to Ref. 217.

Manifold Representation
Depending on the variant of SNE, the Markov transition matrix built from low-dimensional samples can be represented by different normalized kernels Q(z k , z l ). In SNE, the transition probabilities q kl

Random Variable
Density 6. Crowding Problem. The approach taken by t-distributed SNE to construct Markov transition probabilities from low-dimensional samples. The Markov transition matrix with probabilities p kl between high-dimensional samples is modeled using a Gaussian kernel (blue). In contrast, its equivalent for lowdimensional samples Q is given by a one-dimensional t-distribution (red between the low-dimensional samples are given by the Gaussian kernel: where the samples are determined by the parametric target mapping such that z = ξ θ θ θ (x). A single value of ε for every row of Q is chosen, which differs from estimating bandwidths to build the Markov transition matrix from high-dimensional samples.
In t-distributed SNE [121,122] (abbreviated as t-SNE), which is perhaps the most commonly used variant of SNE, Q(z k , z l ) is instead represented by a non-parametric t-distribution kernel with one degree of freedom (i.e., the Lorentz function) (Fig. 6): which does not require selecting additional parameters. Therefore, in contrast to spectral embedding methods, SNE does not employ eigendecomposition but builds the transition matrix from the lowdimensional samples estimated by the parametric target mapping [120][121][122].
Using the t-distribution kernel instead of the Gaussian kernel is motivated by the crowding problem [121,122] where clusters of samples (or metastable states) are not separated correctly. This is because the low-dimensional space available to accommodate moderately distant samples is not large enough compared to the space available to accommodate nearby samples. It is partly caused by the dimensionality curse [218,219]. As the t-distribution kernel is heavy-tailed (it is an infinite mixture of Gaussians with different variances), the separation of the clusters is more optimal (Fig. 6). The choice of a one-dimensional t-distribution is because (1 + ∥z k − z l ∥ 2 ) −1 approaches an inverse square law for large pairwise distances in the low-dimensional representation [121].

Learning Algorithm
To correctly parametrize the target mapping and CV samples, we aim to achieve a minimal difference between the high-dimensional and low-dimensional Markov transition matrices, M and Q, respectively. Many divergences allow computing statistical distances between pairwise probability distributions. SNE uses the Kullback-Leibler divergence [220] reformulated to compare Markov transition matrices row by row [221]: where the summation goes over every pair of k and l. The first term in Eq. 59 is constant and can be excluded from the minimization. The Kullback-Leibler divergence is greater than 0 and equal to 0 only when p kl = q kl for every pair (k, l). Other divergences can also be used to compare M and Q, e.g., the symmetrized Kullback-Leibler or the Jensen-Shannon divergences.
Algorithm 4 outlines the learning procedure employed in SNE. The neural network representing the parametric target mapping is trained iteratively, which can be performed using many stochastic gradient descent algorithms, e.g., Adam [222]. Then, backpropagation is used to estimate errors of the parametric target mapping ∇ θ ξ θ (x) which corrects the parameters. This results in the decrease of the  ii. Update optimization step with the loss given by the divergence for the updated parameters θ θ θ.
Kullback-Leibler divergence (or any other divergence) calculated from the Markov transition matrices M (x k , x l ) and Q(z k , z l ). Finally, the minimum value of the Kullback-Leibler divergence indicates that the geometric relations between samples in the configuration and CV space are preserved (Fig. 7).

Applications
For atomistic simulations, SNE and its extensions have been applied to constructing CVs and lowdimensional representations [55,66,223,224], clustering metastable states [215,225,226], and analyzing molecular structures [227]. t-SNE is implemented in the sklearn library [228].

B. Uniform Manifold Approximation and Projection (UMAP)
UMAP is a method for unsupervised manifold learning first introduced by McInnes et al. in 2018 [131]. Recently, a parametric variant of UMAP has been developed [229]. UMAP is now considered on par with t-SNE with applications in various disciplines. Similar to Laplacian eigenmap [112,113], UMAP builds on the assumption that a high-dimensional dataset is uniformly distributed on a low-dimensional manifold. However, this assumption has been challenged for many datasets, even those where UMAP performs well, making it difficult to understand the real reason for its success [230].
For an easy comparison to other parametric manifold learning techniques discussed in this review, we cover only the general aspects of UMAP, including the construction of transition matrices from highand low-dimensional samples and divergence minimization. For readers interested in the mathematical concepts involved in UMAP and a detailed comparison to t-SNE, we refer to Ref. 131,230,231.

Random Variable
Density q kl ∼ t-distribution (a, b) (1,1) (1, 4) (4,1) (4, 4) FIG. 8. Low-Dimensional Transition Probabilities q kl in UMAP. Procedure using a parametrizable t-distribution to represent data states in a low-dimensional manifold. This distribution with (a, b) = (1, 1) corresponds to the t-distribution used in t-SNE. UMAP finds a and b using optimization to match the distribution of pairwise distances calculated from the low-dimensional samples.

Unnormalized Transition Matrix
UMAP uses an unnormalized transition matrix to represent relations between samples in highdimensional space. The lack of normalization of the transition probabilities matrix in UMAP is motivated mainly by reducing the computational cost associated with normalization. The Markov transition probability matrix is defined in UMAP as: where the distance from each k-th sample to its first neighbor is denoted as d k , ensuring that every probability p kl ≤ 1 and divergence can be used as a loss function. Similar to SNE, the diagonal elements of M are zeroed. Although the Markov transition matrix used by UMAP is very similar to that of SNE (Eq. 16), UMAP uses a different symmetrization scheme [131] instead of symmetrizing the Markov transition matrix as in SNE.
UMAP does not employ perplexity to find the scale parameters (Eq. 60). Instead, based on the selected number nearest neighbors m, UMAP estimates the scale parameters {ε k } so that log 2 m = l M (x k , x l ). This can be employed, unlike in SNE techniques, as the Markov transition matrix in UMAP is not row-normalized and l M (x k , x l ) ̸ = 1. Consequently, this optimization procedure does not require the calculation of the Shannon entropy of each row of the Markov transition matrix (Sec. VI A 2) as is done in SNE, making UMAP less computationally demanding.

Transition Probabilities in Manifold
As in every manifold learning technique relying on divergence optimization, UMAP needs to construct the probability transition matrix from low-dimensional samples (Fig. 8). To provide more flexibility in separating states in the manifold, the probability transition matrix in UMAP is estimated by the following t-distribution kernel: where we omit the normalization factor for brevity, and a and b are parameters that can be determined through an additional optimization procedure. If both parameters are set to one, Eq. 61 simplifies to the kernel for the low-dimensional probabilities used in t-SNE (Eq. 58). In practice, UMAP uses nonlinear least-square optimization to find a and b by fitting Eq. 61 to a piecewise function: where d m is a constant setting the separation between close samples in a manifold. For instance, the fitting can be performed using the BFGS algorithm [232]. In practice, a and b are often constant or fixed after initial learning steps [131].

Divergence Minimization
Divergence optimization in UMAP can be performed using a general framework for learning parametric manifold learning techniques (Algorithm 4). To find a low-dimensional transition matrix and the corresponding target mapping into a manifold, UMAP minimizes a divergence that can be viewed as the sum of cross entropy losses for each transition probability [131]: consisting of two Kullback-Leibler divergences D KL (compare to Eq. 59) responsible for grouping samples with high transition probabilities in metastable states and separating samples with low transition probabilities into different metastable states, respectively. As in SNE, the loss function can be reduced by excluding constant terms that depend on p kl ∼ M (x k , x l ).

C. Reweighted Stochastic Embedding
Reweighted stochastic embedding is a recently developed framework designed to construct lowdimensional manifolds of CVs from standard and enhanced sampling atomistic simulations [58]. The framework expands on concepts from DMAP and parametric versions of SNE, with an additional reweighting procedure to account for constructing the Markov transition matrix based on samples yielded from biased probability distributions. To such methods, we can include stochastic kinetic embedding (StKE) [39,49] and multiscale reweighted stochastic embedding (MRSE) [55,58].

Stochastic Kinetic Embedding (StKE)
Stochastic kinetic embedding (StKE) aims to learn the target mapping by matching kinetics encoded in the high-dimensional and low-dimensional spaces [49]. The primary assumption in constructing the unbiased Markov transition matrix from high-dimensional samples is that these samples are uniformly distributed in the configuration space. To achieve this, a sparse set of high-dimensional samples is selected using landmark sampling (Sec. III B) by requiring that the distance between each pair of landmarks is larger than a predefined threshold value, ∥x k − x l ∥ > d 0 . Using the anisotropic diffusion kernel L (Eq. 21), the unbiased Markov transition matrix is: where the statistical weights can be approximated by the unbiased kernel density estimates w(x k ) ≈ ϱ(x k ) [49,58] as the distribution of landmarks virtually uniform. An interesting property that establishes a link between StKE and maximum entropy methods is that Eq. 64 can also be written as [58]: where G ε is the Gaussian kernel and we take the following approximation ϱ(x k ) ≈ n ϱ(x n )L(x k , x n ). For a detailed discussion on this approximation, see Ref. 58.
Apart from constructing the unbiased Markov transition matrix, the remaining elements in StKE are similar to other manifold learning methods that use parametric target mappings. Namely, the lowdimensional transition matrix Q(z k , z l ) is constructed using a Gaussian kernel such as in the standard version of SNE (Eq. 57). The parametric target mapping ξ θ θ θ (x) is represented as a neural network, and the algorithm learns through minimizing the Kullback-Leibler divergence (Eq. 59). StKE can be implemented according to Algorithm 5, where we additionally provide a step that divides the learning set into data batches, as is often done for training neural networks. Biased simulation data is generated using well-tempered metadynamics at a temperature of 300 K using Ψ as variables whose fluctuations are enhanced. The WTRS algorithm is used to sample landmarks. For more details, see Ref. 55. importance of high-dimensional samples. This this aim, Eq. 66 is redefined as: where the pairwise reweighting factor r(x k , x l ) needed to unbias the transition probabilities is [55]: which can be shown to be a special case of diffusion reweighting used in reweighted DMAP [58]. Using the reweighting factor as a geometric mean of two statistical weights can be justified due to the additive nature of the bias potential.
The remaining components of MRSE are similar to those of StKE. A non-parametric t-distribution represents the low-dimensional transition matrix, and the target mapping is expressed as a neural network that adjusts the parameters of the target mapping so that the Kullback-Leibler divergence is minimal.

Biasing Manifolds
A valuable advantage of the parametric target mapping is that high-dimensional samples outside the training set can be mapped to the CV samples. Furthermore, the gradient of the target mapping with respect to the microscopic coordinates x can be estimated through backpropagation. This enables using the parametric target mapping as CVs for biasing in enhanced sampling simulations. The additional biasing force acting on the microscopic coordinates at time t is: where the second partial derivative is calculated from the bias potential V acting in the CV space defined by the target mapping. The target mapping can also be recomputed during the simulation, allowing iterative improvement of the estimated CVs, which is particularly helpful in insufficiently sampled systems.

VII. CONCLUSIONS
The use of simulations is widespread for analyzing the dynamics of complex systems at the atomistic level of detail inaccessible for experiments. As a result, we often need to extract meaningful information from observations comprising thousands of configuration variables. Gaining insight into physical processes can be challenging. This problem, however, can be approached using unsupervised manifold learning techniques that can reduce the high-dimensional representations to simplify the data, uncover hidden structures, and describe the dynamics by a few CVs.
Manifold learning often requires extensive preprocessing of simulation data, including reducing the number of configuration variables to create a high-dimensional space for further dimensionality reduction and sampling landmarks to limit the number of samples in the dataset. These tasks are not trivial and should be performed so that the reduced simulation dataset accurately represents the properties of the system. While many landmark sampling algorithms can be used in standard atomistic and enhanced sampling simulations, selecting the initial high-dimensional representation to be further reduced quantitatively just recently started to gain attention [92,238].
On the other hand, it is also important that have access to well-converged simulation data, which, at this point, is a prerequisite for many manifold learning techniques [34,239]. This is critical if extensive sampling is required to reach experimental timescales where many rare processes occur. As such, developing manifold learning techniques often requires testing under idealistic sampling conditions. Several methods circumvent this problem by learning and biasing CVs iteratively at different stages of the progressing enhanced sampling simulation [49-51, 53, 56, 60, 153, 240]. This process can benefit from reweighting manifolds constructed from samples generated according to biased probability distributions [49,55,58,92].