Nonparametric Bayesian inference for meta-stable conformational dynamics

Analyses of structural dynamics of biomolecules hold great promise to deepen the understanding of and ability to construct complex molecular systems. To this end, both experimental and computational means are available, such as fluorescence quenching experiments or molecular dynamics simulations, respectively. We argue that while seemingly disparate, both fields of study have to deal with the same type of data about the same underlying phenomenon of conformational switching. Two central challenges typically arise in both contexts: (i) the amount of obtained data is large, and (ii) it is often unknown how many distinct molecular states underlie these data. In this study, we build on the established idea of Markov state modeling and propose a generative, Bayesian nonparametric hidden Markov state model that addresses these challenges. Utilizing hierarchical Dirichlet processes, we treat different meta-stable molecule conformations as distinct Markov states, the number of which we then do not have to set a priori. In contrast to existing approaches to both experimental as well as simulation data that are based on the same idea, we leverage a mean-field variational inference approach, enabling scalable inference on large amounts of data. Furthermore, we specify the model also for the important case of angular data, which however proves to be computationally intractable. Addressing this issue, we propose a computationally tractable approximation to the angular model. We demonstrate the method on synthetic ground truth data and apply it to known benchmark problems as well as electrophysiological experimental data from a conformation-switching ion channel to highlight its practical utility.


Introduction
Computational prediction of molecular structures is a long-standing challenge in structural biology. Recently, novel frameworks for protein [1] and ribonucleic acid (RNA) structure prediction [2] have been proposed that greatly enhance the prediction accuracy and the model applicability compared to existing approaches. The focus of structure prediction however lies on static structures. To obtain deeper insights into molecular processes, it is necessary to complement such approaches with frameworks that also take into account the dynamics of molecular structure.
Both experimental and computational tools exist to study such conformational dynamics: a prime example of experimental approaches to this chal-lenge are analyses of conformational switching of ion channels via widely adopted electrophysiological techniques, such as voltage clamping or lipid bilayer measurements [3][4][5]. On the other hand, ion channel switching can also be studied computationally via, e.g., molecular dynamics (MD) simulations [6]. The MD framework more generally can also be used to investigate protein and RNA folding [7,8]. Experimentally, this can be assessed, e.g., via fluorescence quenching [9] or fluorescence resonance energy transfer measurements [10]. Since in all of these settings, both experimental and computational protocols typically yield large amounts of data, their analysis is a challenge in itself.
To understand the switching dynamics of the system under study, one needs to obtain a coarse-grained description of the continuous system dynamics (e.g., voltages or atom coordinates) in terms of comparatively long-lived, meta-stable discrete states corresponding to distinct stable structural conformations [11]. These are defined by a separation of time scales between the intra-and inter-state dynamics. In the context of MD, particularly, the framework of Markov state models (MSMs) has received much attention in recent years; besides extensive methodological research [12][13][14][15], MSMs have also been successfully applied to a wide range of use cases [16][17][18][19][20][21][22][23]. The classical MSM approach approximates the continuous simulation dynamics (mathematically expressible in terms of the propagation operator) directly by projecting them to a discretized space on which then a discrete-time Markov chain (DTMC) is constructed. Importantly, the binning of continuous data into discrete states introduces correlations and makes the resulting discrete process generally non-Markovian; the raw MD data, in contrast, are inherently Markovian, as they originate from the integration of stochastic differential equations (SDE).
To render the resulting discrete dynamics amenable to MSM analysis, the correlations need to be reduced via temporal thinning by some lag time constant [13,24]. This results in two key parameters that need manual selection: the lag time and the state-space discretization. An explicit error bound for the reconstruction error of the MSM reconstruction and the true propagation operator can be derived in terms of these two parameters, showing that this error can be made arbitrarily small by either choosing a finer state-space discretization or decreasing the lag time [13]. This relationship between state-space discretization and lag time results in a trade-off problem: one has to balance between (i) sufficient sampling of the discrete state-space, viz, a coarse state-space discretization and (ii) a sufficiently long lag time to render the resulting process Markovian. To address this issue, tools such as the Chapman-Kolmogorov test have been introduced [13]. These tests, however, only consider the appropriateness of the lag time; the overall reconstruction error may still be off, resulting in an MSM not reproducing the long-time dynamics accurately, as detailed in [25]. Also, the lag time selection itself is acknowledged to be a major challenge in practice [26].
The identification of meta-stable conformational states of the system is then carried out after data pre-processing given some state-space discretization and lag time. Typically, this is done utilizing spectral methods such as Perron-cluster cluster analysis (PCCA) [27] or PCCA+ [28]; other approaches are however also possible, see e.g., [29,30]. In general, the result of this procedure will depend on the chosen lag time as well as the state-space discretization.
While the framework as such has seen several refinements [31,32] including recent deep learning extensions [33,34], they share the two conceptual drawbacks detailed above: (i) the data needs to be thinned with a manually specified lag time, which is merely a model artifact and deteriorates the overall temporal resolution, and (ii) the number of metastable states has to be identified manually.
Both problems can be addressed utilizing nonparametric hidden Markov model (HMM) frameworks developed in statistical machine learning [36,42,43]: on the one hand, the introduction of a latent process to an MSM abolishes the need to temporally de-correlate the discrete projection of the data via a lag time [31], which can thus be interpreted as a generalization of classical MSMs. On the other hand, nonparametric probabilistic models allow one to specify distributions on unbounded spaces, such as an infinite number of topics in topic modeling [35] or an infinite number of states in a Markov chain [36].
Nonparametric Bayesian HMMs have gained attention in recent years both in experimental settings such as analyses of ion channel switching [37,38] or single-particle tracking [39] as well as in MD studies [40,41]. Inference of the meta-stable trajectories and the system parameters was however carried out using sampling techniques such as Markov chain Monte Carlo (MCMC); while yielding accurate results, these approaches do not scale well [42,43]. Even for the relatively simple problem of one-dimensional ion channel voltage trajectories, they become computationally intractable for longer sequences or higherdimensional systems.
To address both the conceptual shortcomings of MSMs and the computational tractability issues of conventional sampling approaches, we provide in this paper a scalable nonparametric Bayesian MSM inference framework for the analysis of conformational molecule dynamics. We emphasize that this framework is very widely applicable, including data generated, e.g., by voltage clamp experiments on ion channels as well as MD simulations, and can hence help bridge the gap between theory and experiment. We note also that in terms of modeling, the transition between ion channel experiments and MD simulations is gradual, as the measured ion current in the former can be interpreted as a one-dimensional reaction coordinate in the latter.
Our method does neither require manual specification of a lag time to re-establish Markovianity, nor of the number of meta-stable conformations. We model the switching dynamics between distinct structural conformations via a nonparametric HMM by defining a latent Markov process on a countable set of states (meta-stable protein or channel conformations), of which one obtains only noisy, continuousvalued observations, such as currents or atom positions. We specify noise models that are appropriate for our use cases: in the experimental and computational settings described above, observations typically are real-valued vectors, x ∈ R n , or rotation angles, x ∈ [0, 2π) n . For the angular case, we furthermore propose a novel approximation enabling computational tractability.
To ensure scalability to large amounts of data, we resort to variational methods for inference: instead of drawing samples from the exact posterior distribution, we approximate the latter in a computationally efficient way by distributions of known type [44][45][46]. As we pursue a Bayesian approach, we treat the model parameters as random variables, specifying appropriate prior distributions and deriving their full posterior distributions conditioned on all observed data.
In the following, we will first introduce the general modeling framework and we will in particular address the issue of adequate prior distributions. Subsequently, we show how to perform scalable inference in this setting. Finally, we present our results on synthetic ground-truth data as well as real benchmark and experimental data.

Bayesian nonparametric Markov state models
We model the conformational molecule dynamics by utilizing the well-known HMM, consisting of two joint stochastic processes {Z t , X t : t = 1, . . . , T}, where t is the time index [47]. Note that we use roman upper case letters Z t , X t to refer to random variables and the corresponding lower case letters z t , x t to refer to particular realizations throughout the paper.
The distinct meta-stable conformational states are represented by the latent Markov states Z t ∈ Z ⊆ N; the observed data (e.g., experimentally obtained channel voltages or simulated atom positions) are described by X t ∈ R n . The time evolution of this joint process is given as a DTMC on the discrete state space Z governed by a transition probability function Π : Z × Z → [0, 1]. We represent this as a matrix Π ∈ [0, 1] n×n , whose kth row π k := Π k· specifies the probabilities for transitions to all possible states l ∈ Z from state k ∈ Z, The observation X t at time point t depends only on the state of the latent process at the same time. This dependency is given by the observation density where {θ i : i = 1, . . . , |Z|} represents a set of generic distribution parameters for each state i. In accordance with the MSM literature, we interpret the HMM as a generalization of MSMs [31], and thus refer to this construct synonymously as hidden MSM.
The key drawback of hidden MSMs regarding the analysis of conformational switching is that the number of meta-stable molecule conformations |Z| needs to be specified in advance. Typically, however, this number is unknown. Quite on the contrary, it is a key quantity of interest that is to be determined from the data. This shortcoming can be addressed by utilizing a nonparametric modeling approach, which allows for countably infinite state spaces. For any finite data set, |Z| will however be finite and can hence be learned from the data. In more concrete terms, we specify a model for potentially infinitely many distinct molecular conformations; in any given observed trajectory from simulations or experiments, only a finite number of these conformations will be visited, the number of which can then be identified.
To set up a nonparametric HMM, one needs to construct prior distributions for transition matrices on countably infinite state spaces and for countably infinite observation parameters. This can be achieved via the hierarchical Dirichlet process (HDP) [48], giving rise to the HDP-HMM [36,42,43,45]. In the following, we provide mainly the relevant definitions; we however provide an extended background section on Bayesian nonparametrics in the supporting information (https://stacks.iop.org/PB/19/056006/mmedia). An HDP-HMM is constructed hierarchically in a twostep fashion: First, specify a Dirichlet process (DP), which is a stochastic process taking values in the space of (discrete) probability measures: with concentration parameter α > 0 and base probability measure H 0 over some space Θ.
A realization of H 1 is obtained by drawing independent and identically distributed (i.i.d.) samples θ k ∈ Θ from the base measure H 0 , and assigning to each θ k a probability mass σ k via a stick-breaking process: where Beta(r, s) is the beta distribution with shape parameters r, s > 0 [47]. Equation (5) is compactly written as σ ∼ GEM(α), short for Griffiths-Engen-McCloskey process [48]. A sample H 1 ∼ DP(α, H 0 ) accordingly reads where δ θ k denotes the Dirac or point measure at θ k [49], δ θ k (θ) = 1 if θ = θ k and 0 otherwise. This procedure results in valid discrete probability measure, dH 1 = 1, determining a prior distribution over conformational states: each k represents one distinct conformation, with σ k its probability and θ k its associated parameterization.
In the second step, H 1 serves as base measure of another, subordinate DP: because H 1 is discrete, all samples drawn from this subordinate DP have shared support. We consider independent draws with the stickiness parameter ξ 0, on which we will elaborate in the next paragraph. Decomposing with the point measure at index k yields As the support of all π k are the shared atoms {θ 1 , θ 2 , . . .} drawn in equation (4), each π k can be understood as a realization from a probability distribution over a row of a 'countably infinite transition matrix' Π: Each element of the set θ k ∈ {θ 1 , θ 2 , . . .} corresponds to one latent state k, that is, one molecular conformation, and parameterizes the respective observation distribution, In other words, the two-step HDP-HMM construction (i) defines the molecular conformations via the measure H 1 , and (ii) determines their transition dynamics via all π k . The stickiness parameter introduces a self-transition bias, that is, it extends the sojourn times within each state. For ξ = 0, the classical HDP is recovered [48]. The sticky HDP-HMM has been shown to counter-balance the sensitivity of the classical HDP-HMM to within-state variability, which results in a tendency to introduce redundant states all pertaining to the same ground-truth state; see, e.g., [42]. This is exactly the setting we are interested in, as we are aiming specifically at the analysis of meta-stable states potentially exhibiting a high level of intra-state variability. Note that in comparison to classical MSMs, the stickiness parameter can be understood as a bias towards larger time scales that is to be set by the experimenter. Importantly, our approach does not discard any information, but retains all available data points; the stickiness represents only a bias, but no strict truncation of resolvable time scales. Hence, the minimum time scale this approach is able to resolve is the native time scale of the data points. As with all hyperparameters, we set ξ empirically; see the supporting information for details.
The above definition allows us to formulate the full model distribution. Denoting with the collection of all trajectories, we can write with some initial distributions p(z i 1 ), constituting a fully Bayesian nonparametric HMM.

Observation models and conjugate priors
To fully specify the HDP-HMM, it remains to set up the observation distributions p(x t |θ z t ) for the required spaces x ∈ R n and x ∈ [0, 2π) n as well as the corresponding prior distributions H 0 . For the purpose of inference, it is beneficial to choose priors that are conjugate to the respective likelihoods: a prior of a specific functional form f parameterized by γ, p(θ|γ) =: f (θ, γ), is said to be conjugate to a given conditional probability distribution p(x|θ) if the resulting Bayesian posterior distribution p(θ|x) = p(x|θ)p(θ)/p(x) is of the same functional form as the prior, p(θ|x) = f (θ, γ ), with updated parameters γ . This property simplifies inference, because the computation of the posterior distributions then reduces to computing the parameter updates γ → γ .
Real-valued data. A versatile model for coordinate data x ∈ R n as often obtained through MD (e.g., 3D atom positions) as well as electrophysiological experiments is the multivariate normal (MVN) distribution, with the mean vector μ ∈ R n and the covariance matrix Σ ∈ R n×n . Generally, we interpret the raw data as noisy observations of the discrete latent conformational states. The MVN is well suited for this purpose due to its unimodality as we aim to identify well-discernible, meta-stable states. For ion channel voltage data, typically x t ∈ R, which is covered by equation (11) as a special case n = 1. Note that normal observation models are frequently used for biophysical experiments [37,38,40]. We can hence cover both MD simulation data as well as experimental voltage trajectory data with the same observation model.
For the MVN, a conjugate prior exists termed the normal inverse-Wishart (NIW) distribution, which defines a joint distribution over means μ and covariances Σ: with the inverse Wishart (IW) distribution [47]. We combine the likelihood equation (11) with the conjugate prior equation (12) to specify the HDP-HMM for real-valued data. Angular data. Another natural and widely used way in MD of specifying the spatial arrangement of complex molecules are the dihedral angles between adjacent atom or molecule planes [50]. Hence, data are often angular and constrained to the unit circle, x t ∈ [0, 2π) n . Observation models particularly suited for these spaces are von Mises (vM) type distributions [51,52]. Since it is customary to characterize amino acid chains such as proteins by sets of pairs of torsion angles (φ i , ψ i ), we focus on the two-dimensional case here; the extension to multiple pairs is then straightforward. We utilize a well-known and in the context of protein modeling established parameterization of the bivariate vM distribution [53] where c(κ 1 , κ 2 , κ 3 ) = (2π) 2 I 0 (κ 1 )I 0 (κ 2 )I 0 (κ 3 ) and I i is the modified Bessel function of the first kind and order i. The location parameters ζ and ν control the position of the mode of the distribution, as can be seen from the trigonometric terms in equation (13). The parameters κ 1 , κ 2 , κ 3 specify the spatial correlations. Note that marginalizing over φ and setting κ 1 = κ 3 = 0 recovers the conventional one-dimensional vM distribution, While analytical expressions for a conjugate prior exist also for the bivariate vM distribution [51], the infinite sum of Bessel functions in equation (13) renders the normalizer c intractable in a Bayesian setting. Computing the exact posterior vM is hence not possible, as this requires computing integrals over all κ parameters in equation (14). Additionally, this distribution is not guaranteed to be unimodal; intricate conditions exist on the relation of the concentration parameters κ 1 , κ 2 , κ 3 to achieve unimodality [52]. For high concentration values in specific regimes, however, it is known that the bivariate vM distribution is well approximated by a bivariate normal distribution [51]. This is unsurprising, as in general, vM-type distributions and normal distributions are tightly linked: the former can be constructed from the latter [54]. To ensure tractability and interpretability, we utilize this circumstance and propose an approximation via The mode position ζ, ν roughly corresponds to the mean vector μ; the covariance depends on the κparameters. The precise analytical expressions for these dependencies are rather involved and not relevant to our approximation-we hence refer the interested reader to [51,52] for an in-depth analysis.
As we focus on systems exhibiting distinct, separable meta-stable states, we in fact expect peaked angular distributions, for which this approximation is valid.
To gain a better intuition, see also figure 1, where we compare a one-dimensional vM with the corresponding normal distribution; as is immediately clear, for high concentration values the approximation error becomes negligible. Additionally, equation (16) allows for straightforward debugging: as long as the probability assigned to the area outside the unit circle is small, the approximation can be assumed valid; vice versa, it deteriorates if this probability becomes nonnegligible. We consequentially accept this error in the observation model to arrive at a tractable expression.
Note that the gained tractability may greatly aid the practical utility of the framework, as it is otherwise also customary to resort to 3D coordinates to avoid mathematical complexity, disregarding crucial structural information about the biological problem [40].
In the following, we refer to our approximation as the approximate vM model.
This distribution cannot be evaluated analytically. In principle, one can employ standard sampling techniques such as MCMC and obtain the posterior empirically [36]. In our case, however, the typically large data sets from simulations or long-duration experiments (see, e.g., the discussion in [37]), render this computationally infeasible, because every draw from the posterior requires one full pass through the data.
This KL divergence is still computationally intractable because of the evidence p(x [1,N] ) in the denominator of equation (17). The evidence is in principle obtained by integrating over all unobserved model components, p(x) = z p(x, z, Π, θ, σ)dθ dΠ dσ, which requires a summation of |Z| T terms, each of which contains the full integrals over the parameters Π, θ and σ and hence is impossible to compute for realistic state space sizes and sequence lengths.
The optimization problem equation (18) can be however transformed into an equivalent, but tractable problem. To do so, one re-writes the KL divergence as KL q z [1,T] , Π, θ, σ p z [1,T] , Π, θ, σ|x [1,T] = −L + log(p(x [1,T] where and E q is the expectation operator, where the expectation is to be taken with respect to q. This yields a lower bound on the log evidence, L log p(x [1,T] ), since the KL divergence satisfies KL(q p) 0 for any two distributions q, p. Accordingly, the quantity L is termed the evidence lower-bound (ELBO). Since the log evidence is constant with respect to the model parameters, minimizing the KL divergence is equivalent to maximizing the ELBO. The intractable computation of the log evidence is hence not needed to evaluate L. Without further assumptions, maximization of equation (20) yields the exact posterior, q * = p(z [1,T] , Π, θ|x [1,T] ), but does not provide a practical way of actually performing the optimization. To enable a practical computational scheme, we employ a standard mean-field assumption on the variational distributions [44]: q z [1,T] , Π, θ, σ = q z [1,T] This enables a computationally tractable, iterative coordinate-wise ascent optimization procedure [47]: one variational factor of equation (21) is optimized at a time while keeping all others fixed, and one pass through all variational factors constitutes a VI iteration. The generic distribution update for any quantity α ∈ {z [1,T] , {θ k } k , {π k } k , σ} is obtained as q(α) ∝ exp E q \α ln p(x [1,T] , z [1,T] , Π, θ) , (22) where E q \α denotes the expectation with respect to all variational distributions except q(α). Note that while the ELBO is not convex with respect to all variational distributions jointly [44], it is convex with respect to any factor individually [55]. This coordinate-wise ascent algorithm hence converges to a local optimum which in general depends on the initialization of the variational factors. To alleviate this initialization-dependency, we additionally utilize a multi-start approach; we run several instances of the inference algorithm until convergence and then select the one with the maximal ELBO score as the overall optimum.
Since the HDP-HMM specifies distributions over countably infinite objects, VI in this case requires an additional variational parameter. To be able to instantiate the q-distributions, it is necessary to truncate the number of variational states to some maximum number K. This number could in principle be set to the number of data points; in practice, for computational reasons one chooses some number which is large compared to the expected number of HMM states [36,42]. Note that this only affects the variational distributions; the original model equation (10) remains unchanged [56]. We choose the 'direct assignment' truncation method, setting q(z t ) = 0 for any z t > K [36]. The resulting update equations follow in closed form, as will be shown in the following. The variational model then can, but not necessarily does utilize all clusters up to K [57]. This also allows for straightforward debugging, as it is directly apparent whether all K states are occupied. If q(z t ) > 0 for all states z t , one might incur a non-negligible truncation error, as intuitively speaking more states might be needed to explain the data, and a double-check with increased K is due. If however q(z t ) = 0 for some states, the variational approximation is expressive enough and will not result in a significant truncation error. Note that the direct assignment scheme can be utilized for automated search algorithms over the truncation depths [58].
We provide the detailed mathematical derivations of all updates as well as the used initializations in the supporting information and state here only the update equations.
Latent state sequence. The marginal probabilities of the sequence of meta-stable states, q(z t ), can be computed by a forward-backward message-passing algorithm [59]. The forward messages α t and the backward messages β t are computed as and yield the marginals via q(z t ) ∝ α t (z t )β t (z t ). The expectations occurring in these expressions can be evaluated in closed form because of conjugacy between the variational distributions over Π and {θ 1 , . . . , θ K } and the corresponding likelihoods. Note that the forward and backward messages are the only part of the framework that accepts the trajectory data x to be analyzed as input. Due to the mean-field assumption, the remaining variational distribution updates do not require x. Transition distributions. The DP can be shown to be conjugate to the (infinite) categorical distributions defined by a DP-draw, equation (9) [48]. Note that this is completely analogous to the finite case, where the Dirichlet distribution is a conjugate prior for the categorical distribution. As we constrain the variational posterior to a maximum of K states, this induces a partition on the base measure space Θ of equation (9). This allows one to write the prior as a finite Dirichlet distribution with K + 1 dimensions corresponding to K states and the 'rest' of the state space where no transitions are observed (see the supporting information for details). Because of conjugacy, the updated variational transition probabilities for the ith row of Π then read q π i |{η i,k } k = Dir (η i,1 , . . . , η i,K , η i,− ), (25) with the posterior concentration parameters where the Kronecker delta δ i, j = 1 if i = j and 0 otherwise. Observation distributions. Due to conjugacy between the MVN observation likelihoods and the corresponding variational NIW distributions, the variational posterior of the observation model parameters where (27) Recall that I is the number of trajectories and T is the number of time points; μ 0 , λ, Ψ, ν are the parameters of the prior distribution H 0 and (28) For the approximate vM model, we deal with the periodicity by projecting the data into an interval [−π, +π] around each mean: Note that this necessarily leads to an underestimation of the covariance, as we treat the data as if it were produced by a normal distribution, where in reality, it has been generated by a vM; data outside of [0, 2π) do not occur. This is tolerable for two reasons: first, as detailed above, we assume the data to be peaked for our approximation to hold; if this assumption is valid, the probability mass outside the interval [−π, π] is negligible, cf figure 1. Second, due to the well-known mode-seeking property of VI methods, we anyway incur an underestimation of uncertainty, to which equation (29) should add little [44].
Top-level stick-breaking measure. Setting up the transition distributions p(π k |σ) as above (cf 'transition distributions') as K + 1-Dirichlet distributions results in the relation between p(π k |σ) and the stick-breaking measure σ ∼ GEM(α) being non-conjugate [36]. Hence, a closed-form update for σ is not available. It is customary to instead utilize a point estimate q(σ) = δ σ * (σ), rendering the expectation in equation (22) tractable [36]. The optimum still has no closed-form solution, however; thus, we utilize a gradient optimization scheme and update σ * ← σ * + ω∇ σ * L. To set the step size ω, we utilize a back-tracking line search algorithm [64].

Results
We apply the laid-out framework to a range of different data sets. First, we demonstrate the framework on ground truth 2D HMM data to provide an intuition about its functionality. We then apply the model to synthetic continuous-valued SDE data generated from a standard three-well benchmark potential often utilized in the MSM literature [13,15,31] and demonstrate its ability to learn a readily interpretable discrete structure from continuous dynamics. Subsequently, we show that our vM approximation works well on synthetic 2D vM data and then employ this approximation on a standard MD benchmark dataset from the protein alanine dipeptide [15,33,60,61]. Lastly, we show the model's utility on a large dataset from voltage clamp experiments on the viral potassium channel Kcv PBCV−1 [62]. We provide the parameters used to generate all synthetic data in the supporting information.

Synthetic HMM data
To demonstrate the method, we set up a cyclic threestate HMM with transition probabilities  From this HMM, we generate 10 independent latent sequences z i [1,T] consisting of 1000 time points each. Each observation x i t ∈ R 2 is drawn from a normal distribution with anisotropic covariances, We provide the values of all μ z i t , Σ z i t in the supporting information. As discussed above, we utilize a multi-start scheme to alleviate the problem of local optima; we run 10 randomly initialized instances until convergence and pick the one with the highest ELBO value as the optimum. The method accurately recovers three latent states, as shown in the inset of figure 2(B). We show the inferred marginals q(z i t ) of one latent state sequence in figure 2, demonstrating also the accurate recovery of the ground truth sequence. In particular, the maximum a posteriori state assignment z t of each data point x t defined via precisely matches the corresponding ground truth. Accordingly, also the inferred posterior means μ 0,z and expected covariances (black crosses and circles in figure 2) with n = 2 the dimensionality of the system, faithfully resemble their true counterparts (diamonds and dashed ellipses in figure 2). Note that the labels of the inferred states of course do not need to correspond to the ground-truth labels; this is an interpretation to be done by the experimenter after convergence of the model. We hence show in figure 2 explicitly the trajectories of all K states, of which only 3 are significantly occupied. In all following figures, we will omit all unoccupied states.

Stochastic dynamics in a 2D potential
After validating the method, we apply it to a standard benchmark problem of Markov state modeling, which consists of stochastic particle dynamics in a 2D potential landscape with three distinct wells [13,31,32]. The dynamics are given by the Itô SDE [63]  with the potential function U : R n → R, some constant dispersion Q ∈ R n×n and the standard Brownian motion W. We provide the functional form of U in the supporting information. Using an Euler-Maruyama scheme to simulate these dynamics, we generate 10 trajectories of length T = 10 000 time points each. The potential landscape together with the inferred meta-stable state means μ 0,z and expected covariances E [Ψ z ] is shown in figure 3. The reconstruction captures the essential features: the locations of the potential wells are accurately identified, where the two deeper wells are fit with higher precision than the shallow minimum at the where the superscript q indicates the variational parameters and subscripts t, l, r denote the top (red), left (blue) and right (green) well in figure 3, respectively.
The total state sojourn times correspond to the well depths. The inferred sequence of meta-stable states accordingly yields highly plausible results, as can be checked by comparing the components of the true, continuous process to the inferred discrete switching process.

Synthetic HMM data with angular observations
To demonstrate the method on angular data, we generate data from the same HMM as before, but employ a vM observation model. We define two latent states to generate independent 1D vM observations along each dimension, cf equation (15); the third state includes angular correlations and is generated from the bivariate vM distribution equation (13) following the sampling scheme of [53]. As in the non-angular case, we create overlap between the individual distributions. Additionally, we include observations that wrap around the period boundary 2π → 0.
As shown in figure 4, our vM approximation recovers the ground truth means with high fidelity. Since the ground truth data were generated by true vM distributions, no ground truth covariance matrices exist and hence, the inferred covariances cannot be directly compared to them. We can however assess by comparison with the plotted data that the vM approximation produces accurate estimates. In particular, we note that our projection method equation (29) enables sensible and accurate periodic continuations across the period boundaries 2π ↔ 0. we added the orange arrow to clarify that this also is displayed in the graph. Colored diamonds represent the variational means of the meta-stable states. Ellipses indicate the expected variational covariances. Annotations refer to known α-helix and β-sheet conformations, cf [65,69]. Right: expected transition probabilities. Each row shows the transitions probabilities E [π] from one of the five found states to all others, including the 'rest' of the state space (cf equation (25)) indicated by '−'. Note that the transitions α → α R and α R → α are approximately symmetric.

MD simulation data: alanine dipeptide
We apply the approximate vM model to MD simulation data from alanine dipeptide provided with the pyemma package [24]. 3 Alanine dipeptide is a widely used model system in computational biology [65][66][67]. The data are taken from the cited, publicly available repository, and to the best of our knowledge, are simulated in explicit water using the TIP3P water model [60]. Notably, plain inspection of the raw data reveals that the simulated molecule exhibits meta-stable dynamics, underlining the relevance of our meta-stability assumption at the outset, see supporting information figure 2. This two-dimensional dataset describes the molecule dynamics in terms of the backbone torsion angles (φ, ψ) and consists of three independent simulation runs of length T = 250 000 ps with a time step of 1 ps each.
The energy landscape in terms of φ and ψ exhibits an intricate fine structure of several local minima. Due to its wide adoption in the field, several computational frameworks have been applied to this dataset, yielding partitionings between three and six different states [15,33,60,61,68]. As shown in the Ramachandran plot in figure 5, our framework identifies five different states that are in line with the aforementioned literature. By comparison to indepth MD studies of this particular molecule [65,69], we can match the found states to known α-helix and β-sheet conformations of alanine dipeptide. Note that the transition probabilities between the two states α and α R are found to be very similar. It would hence be a valid interpretation of our model results that these two states could also be lumped together, which would similarly be in accordance with the literature [65,69]. Notably, the runtime was only around 120 s for one complete optimization run until convergence on a 2.5 GHz Intel i7 processor.

Electrophysiological single-molecule ion channel data
Lastly, we use our method on time course data of single channel measurements of the viral potassium channel Kcv PBCV−1 . It is known that the wild-type channel switches between an 'open' and a 'closed' state; mutation of the last amino acid to histidine, however, leads to the appearance of sublevels between 'open' and 'closed' [70]. We utilize our method to quantify these sublevels. The data are obtained using the planar lipid bilayer technique as detailed in, e.g., [5]. The applied voltage is 160 mV at pH = 6 and data are sampled at 5 kHz over a time span of T = 60 s, half of which we discard due to apparent drift. The complete trajectory is shown in supporting information figure 3. Despite the high noise level in the measurements, the inferred latent sequence shows a highly plausible switching behavior, see figure 6: we find three different states: a 'closed' state and an 'open' state as well as one intermediate, subconductive state. The histidine mutation consequentially gives rise to one novel channel conformation that cannot be attained by the wild-type. Importantly, one full optimization run only took ∼25 s for a sequence of 1.5 × 10 5 , which is orders of magnitude faster than the sampling algorithm proposed in [37] for analysis of such trajectories. Also, conventional methods of trajectory segmentation [74] require both the prespecification of the number of conformational states as well as their conductivity values, which our method does not.

Discussion
The nonparametric Bayesian Markov state model framework presented in this work offers a generative modeling approach for inference of global, metastable states from MD and experimental data. In particular, this allows the user to leave the number of conformational states unspecified a priori and rather learn it from data. This is beneficial as the number of states in typical computational and experimental settings is not known in advance. In contrast to the MSM approach, we (i) neither need to pre-process the data via discretization and temporal thinning to re-establish Markovianity, (ii) nor manually select the number of meta-stable states. Our method importantly does not deteriorate the temporal resolution of the data.
As we have demonstrated, the model is able to reliably identify the relevant meta-stable states of the system: their number has been sensibly established in all experiments. The application to the triple-well potential highlights the utility of this model on purely continuous data as generated, e.g., by MD. It hence achieves the central goal of modeling the complex dynamics via a finite set of readily interpretable discrete conformational states; in other words, one obtains a spatio-temporal clustering of the data. Furthermore, we presented a computationally tractable approximation to the classical vM distribution that yields accurate results. Application of this approximation to the canonical alanine dipeptide benchmark yielded results consistent with the literature. This is of special relevance to MD due to the frequent use of dihedral angles as system coordinates. We stress that this benchmark problem, albeit consisting of relatively short trajectories compared to MD standards, requires the use of VI methods, since MCMCtype sampling schemes would be computationally infeasible for this task. We thereby also provide a scalable alternative for inference on experimental voltage clamp data, where existing methods all resort to sampling schemes and hence require runtimes on much longer time scales than our framework. This is a significant advance: while nonparametric methods have been around for quite some time (see, e.g., [37][38][39][40]), the combination with VI is not established in the field. Furthermore, the adaptation of HDP-HMM methods to typical problems is potentially challenging: utilizing a straightforward categorical observation model p(x t |θ z t ) on discretized data (see, e.g., [31,41,71]). We hence deem it the merit of our study to adapt the existing HDP-HMM framework to the settings commonly found in biophysical problems and to demonstrate its potential for biophysics. Note that from a technical perspective, the proposed vM approximation is novel to the best of our knowledge.
The framework lends itself to further extensions of practical relevance. One interesting direction is provided by the fact that in many MD analysis protocols, some dimensionality reduction is employed, potentially changing the geometry of the data used for analysis [72]. We believe that a natural extension of the presented model is to include observation likelihoods parameterized by neural networks. Akin to the classical variational auto-encoder this could achieve an efficient encoding to lower dimensions [73]. We note that in the field of Markov state modeling, first approaches to this challenge have been proposed recently [33,34]. None of these proposals however build on nonparametric formulations; the number of states hence remains to be set and tuned by the user. In addition, since not only the observation distributions, but also the transition distributions are parameterized via neural networks, also the necessity to specify an artificial lag or thinning time scale is retained. Another approach from machine learning combines classical probabilistic models with complex likelihood functions in a modular way, however compromising the convexity of the ELBO [75].
Another promising extension are semi-Markovian models, in which the transition between different states is still Markovian, but the sojourn times within each state may be non-exponentially distributed. Similar analyses have already been done for ion channel data and might hence help to get a detailed understanding of more complex switching dynamics [76]. The challenge here is to obtain computationally tractable inference schemes. Notice that similar ideas are also already exploited for lumping in conventional MSM settings [29] and alleviating the lag time issue of MSMs [26].
We believe that variational nonparametric models as the one presented in this paper are a natural match for the requirements of computational and experimental data analysis in the context of structural molecular biology and hence see an untapped potential for applications to biophysical problems.

Acknowledgments
This work has been funded by the European Research Council (ERC) within the CONSYN project, Grant Agreement Number 773196, as well as within the noMAGIC project, Grant Agreement Number 695078.

Data availability statement
The data that support the findings of this study are openly available at the following URL/DOI: https://git.rwth-aachen.de/bcs/projects/lk/bnp-conf -dyn.git