This site uses cookies. By continuing to use this site you agree to our use of cookies. To find out more, see our Privacy and Cookies policy.
Paper The following article is Open access

Folding kinetics of proteins with multiple domains: inference of transition rates from extinction times

and

Published 2 July 2018 © 2018 The Author(s). Published by IOP Publishing Ltd
, , Citation Yingxiang Zhou and Pak-Wing Fok 2018 J. Phys. Commun. 2 075002 DOI 10.1088/2399-6528/aacca5

2399-6528/2/7/075002

Abstract

In this paper we are concerned with the folding kinetics of large proteins that may have multiple binding/unbinding domains. The theory underlying such systems have received much less attention compared to small, 'two-state' (folded/unfolded) proteins. To analyze this problem, we model the reaction coordinate using a birth-death chain, a well-known stochastic process. Specifically, we consider a protein with N domains where every domain that unfolds (folds) increases (decreases) the reaction coordinate by an integer so that the state-space of the reaction coordinate is the set of integers {0, 1, ..., N}. As input data, our method uses (i) the extinction time of trajectories, starting from when the protein leaves the 0 state for the first time and finishing when the protein re-enters state 0 for the first time and (ii) the maximum number of unfolded domains in the said trajectory. Since the maximum value n reached by each trajectory is known, we use the proportion of trajectories that do not exceed n and corresponding mean extinction times (ET) to recover the birth-death rates sequentially from 1 to N. In general, the initial error will propagate with the site number exponentially. However, with sufficient amount of input data, we can recover the rates with relatively small error. For instance, given 50 million ETs of an 11-site BDP, we can recover the rates with a relative error of about 3%.

Export citation and abstract BibTeX RIS

Original content from this work may be used under the terms of the Creative Commons Attribution 3.0 licence. Any further distribution of this work must maintain attribution to the author(s) and the title of the work, journal citation and DOI.

1. Introduction

The prediction of the native conformation of a protein, given its amino acid sequence is one of the great open problems in biophysics [15]. Since the 1990s [6, 7], techniques such as Atomic Force Microscopy (AFM) [8] and Förster Resonance Energy Transfer (FRET) [9, 10] have allowed experimentalists to explore the relationship between macromolecular structure and folding/unfolding rates.

A computational approach to protein folding is to implement large-scale all-atom molecular dynamics (MD) simulations [11]. The folding of a macromolecule is represented by tracking the positions of every single atom in the molecule and the result of the simulation is a trajectory in a high-dimensional state-space. A more conceptual way to understand protein folding is to introduce a reaction coordinate [12] which effectively maps the high-dimensional space onto a single scalar that measures the progress of the folding. The progress along the reaction coordinate is modulated by a free energy 'landscape' [13] that may exhibit multiple minima, corresponding to multiple metastable configurations. Inferring the shape of these landscapes from quantities such as extinction times [14], rupture forces [15] or time-displacement trajectories [16] remains a challenging theoretical problem.

One of the most common ways of probing the energy landscape is through AFM [17]. In AFM experiments, one end of the molecule is tethered to a movable platform and the other is attached to a cantilever tip: see figure 1(a). Small deflections of the cantiliever are detected using a laser-photodiode setup. The AFM can operate in several ways. One protocol is 'force-ramp' mode where the platform lowers at a constant speed. As a protein domain is coercively unfolded, the cantilever deflection increases until a critical platform position is reached. Beyond this point, the cantilever quickly relaxes, corresponding to domain rupture. The resulting force-extension curve allows quantification of the 'entropic elasticity' of that particular domain. The procedure can also be performed sequentially if multiple domains are present in a large protein [18, 19]. The discrete event corresponding to rupture is interesting both physically and mathematically. Experiments show that rupture does not always occur at the same force. Furthermore, the rupture force distribution shifts towards higher values when the platform speed is larger. Both of these observations are in stark contrast to mechanical bonds that break at a single yield stress. They point towards a model of bond-breaking that is based on 'thermally activated escape,' i.e. a theory of random walks.

Figure 1.

Figure 1. (a) Experimental schematic for Atomic Force Microscopy of proteins. Deflection of a soft cantilever is detected using a laser-photodiode setup. (b) Possible time trace of reaction coordinate for a two-state protein. (c) Possible trace of reaction coordinate for a three-state protein. Extinction times τ1 and τ2 along with maximal sites 2 and 1 respectively are used in this paper for inference. a.u. = arbitrary units.

Standard image High-resolution image

Besides force-ramp mode, another protocol is to keep the AFM platform stationary and operate in 'force-clamp' mode. Under this mode of operation, one focuses on deflections of the cantilever which essentially provide the reaction coordinate as a function of time. Some possible time traces are shown in figures 1(b) and (c). The protein spends most of its time in metastable configurations (when the reaction coordinate is an integer) and very little time in-between these states. Figure 1(b) shows the trace for a simple protein with a single folding domain that can be in one of two states: folded or unfolded. The kinetics in this case are well-described by two exponential distributions [20]; one for the $1\to 0$ transition and the other for the $0\to 1$ transition. The half-lives or rate constants associated with each exponential distribution can easily be inferred from the time trace in figure 1(b). For proteins with multiple domains the traces can be more complex and could resemble figure 1(c). If one assumes exponential kinetics as before (i.e. the transition between states always follows an exponential distribution, but the parameters of the distribution could be state-dependent) the resulting stochastic process is a birth-death chain. These chains correspond to energy landscapes with multiple metastable states: see figure 2.

Figure 2.

Figure 2. (a) Two-state and (b) Four-state Markov models for protein folding dynamics. Small proteins are usually described by a two-state model. Larger proteins may have multiple metastable states giving rise to longer birth-death chains.

Standard image High-resolution image

An extinction time is the time taken for the reaction coordinate to start at 1 and reach 0 for the first time, corresponding to the state where all domains are folded. In figure 1(c), the measurement of τj starts when the reaction coordinate reaches 1 for the first time. For each excursion, one can also define the maximal site nj that is reached before extinction occurs. After suitable processing of the signal, a single time trace generates many pairs (τj, nj), j = 1, ..., M and M ≫ 1. In this paper, we show how to recover all the transition rates of the birth-death chain from {τj} and {nj}. How to detect jumps in the reaction coordinate, that define when transitions between states have occurred, is a separate issue and beyond the scope of this paper.

1.1. Energy landscape inference: existing methods

In this section, we briefly describe some of the existing methods used to interpret AFM data. One of the earliest methods is Bell's method [21] which assumes a force-dependent rate ansatz (of Arrhenius form) for the state of a domain or chemical bond. Providing that the force on the protein depends on the instantaneous deflection in the cantilever, Bell's method predicts rupture force distributions and survival probabilities for a chemical bond under the action of a force ramp. The method is used to solve the 'forward' problem in the sense that rupture distributions are predicted from a potential well whose shape is known. In contrast, Dudko and co-authors [22] and Freund [23] essentially solved the inverse problem by inferring rate constants, features of the potential well and other related parameters from rupture force distributions.

Chang and co-authors [16] utlilized a path integral method that takes trajectory data (time traces) as input rather than force distributions. Based on non-parametric Bayesian estimation, the method makes very few assumptions about the underlying energy landscape and is able to simultaneously infer the energy landscape and an effective spatially-dependent diffusivity for the reaction coordinate.

Most of the above methods are mainly concerned with the AFM operating in force-ramp mode. However, in works such as [14, 24], the authors develop methodology to extract energy landscapes from data generated by AFMs in force-clamp mode. They treat the reaction coordinate using a Smoluchowksi framework to infer features of the energy landscape; this type of analysis dates back to Kramers's classic transition state theory [25] for chemical reactions. Finally, it is worth mentioning that if the reaction coordinate is treated as a Brownian random walker on an energy landscape, estimating the parameters of the resulting stochastic process from sample paths is a classic problem in statistics and control theory [26, 27].

The method described in this paper is different because from the outset, we assume that the underlying stochastic process is a birth-death chain (and subsequently, transitions are always exponentially distributed). Estimation of parameters from sample paths of a birth-death chain is a classic problem [28, 29]. However, our goal is to estimate transition rates 'mainly' from extinction times. Unfortunately, as discussed below using only extinction times results in a severely ill-posed problem. Having access to maximal site data turns out to render the inference problem much better-posed.

The inclusion of maximal sites is important. The calculation of transition rates in a birth-death chain purely from extinction times essentially reduces to finding the best-fit coefficients and exponents to a given extinction time distribution. Such problems are highly ill-posed: a small amount of noise added to the curve can lead to a large change in the best-fit coefficients/exponents. Nevertheless, because fitting exponential modes to given data is one of the most commonly-arising inverse problems, it has a long history and has been investigated by many researchers: see for example [3033] and references within.

2. Governing equations for the birth death process

We now present a method to compute the transition rates of a birth death process from its extinction times. Our stochastic model takes the form of a random walker on a finite integer lattice: $\{0,1,\,\ldots ,\,N\}$. The birth and death rates at site n are λn and μn respectively. We assume the particle starts at site n = 1 and the process terminates (goes extinct) when it reaches site 0. For each trajectory, we record the extinction time and the maximal site reached (i.e. the largest site number it attains before reaching site 0). From this data, we wish to infer the $2N-1$ rates ${\lambda }_{1},\,\ldots ,\,{\lambda }_{N-1}$ and μ1, ..., μN.

Consider a birth-death process on a lattice with sites labeled $\{0,1,2,\,\ldots ,\,N\}$, where N is finite but unknown: see figure 3. A particle starts at site 1 and executes a random walk which we write as X(t): X(t) is a random walk on the non-negative integers. At site i, the rightward (leftward) hopping rate is λi (μi). When the particle reaches site 0, we record the time of exit. When this experiment is repeated many times, we may use the resulting data to compute the extinction time distribution W(t).

Figure 3.

Figure 3. Birth death chain with N + 1 sites, with ${\lambda }_{N}=0$. Our algorithm uses the extinction times of the process given that the positions of particles never exceed site n (always remain in the dashed box).

Standard image High-resolution image

The matrix that determines this process is defined as

Equation (2.1)

and λN = 0 . If ${P}_{k}(t)={\mathbb{P}}[X(t)=k]$, then the ${\bf{P}}={[{P}_{1},\ldots ,{P}_{N}]}^{T}$ satisfy the forward master equation

Equation (2.2)

and it is a simple matter to find W(t) from P1(t) (see section 2.1).

If N is known, finding all the transition rates ${\{{\lambda }_{i},{\mu }_{i}\}}_{1\leqslant i\leqslant N}$ from W(t) amounts to an exponential fitting problem, which is very ill-conditioned [34]. To overcome this difficulty, we assume that for every trajectory Xj(t) we also record the maximal site of the particle, nj. The maximal site for the random walker is simply the largest site number that it attains before exiting. By grouping trajectories according to their maximal site and computing the statistics of the extinction times of each group, we are able to accurately infer transition rates for birth death chains of length 8–11 from about 5 × 107 extinction times with relative error of a few percent.

To be more precise, if Xj(t) is the jth trajectory, then τj is the jth extinction time defined as $\inf \{\tau :{X}_{j}(\tau )=0\}$, ${n}_{j}={\max }_{0\leqslant t\leqslant {\tau }_{j}}{X}_{j}(t)$ is the maximal site of the jth trajectory and ${S}_{n}=\{{\tau }_{j}:{n}_{j}\leqslant n\}$ is the set of extinction times corresponding to trajectories whose maximal site does not exceed n. Then for a finite sample of trajectories, we have ${S}_{1}\subseteq {S}_{2}\subseteq \ldots \subseteq {S}_{N}$. The algorithm that we propose takes as input $| {S}_{n}| $ and ${\bar{S}}_{n}$ (cardinality and mean of Sn) to infer λn and μn for each n.

Random walkers that exit from the dashed box in figure 3 can exit at site 0 or n + 1. We informally call the times that correspond to exit at site 0 (n + 1) 'left' ('right') extinction times. Then, the distribution of extinction times for the birth-death process, conditioned on trajectories not exceeding site n is identical to the left extinction time distribution out of the sublattice {1, ..., n}. By finding finding analytical expressions for the moments of this distribution and matching them to the observed moments, we may infer the transition rates on the lattice. This forms the basis of our method.

2.1. Extinction times and probability fluxes

We assume that sites 0 and n + 1 are absorbing in the sense that if the particle reaches site 0 or n + 1 ('exits'), it stays at these sites for all time. We use the superscript n to distinguish the subproblem from the entire chain. Define the probability that the random walker is at site k at time t as

Equation (2.3)

If we take the n × n leading principal submatrix of ${A}^{(N)}$, then it follows that the conditional probabilities ${{\bf{P}}}^{(n)}={[{P}_{1}^{(n)},{P}_{2}^{(n)},\ldots ,{P}_{n}^{(n)}]}^{T}$ satisfy the forward master equations

Equation (2.4)

for 1 ≤ n ≤ N. Equation (2.4) is the starting point for the reconstruction process.

Now we introduce the two random variables

  • ${E}^{(n)}\in \{0,n+1\}$, a binary random variable which represents the exit site of the random walker:
    Equation (2.5)
    Equation (2.6)
    and ${{\rm{\Pi }}}^{(n)}+{{\rm{\Pi }}}_{* }^{(n)}=1$.
  • T(n), the extinction time of the random walker, defined to be the time at which the walker arrives either at site 0 or n + 1 for the first time. Conditioning on E(n), let the density of extinction times T(n) be ${w}_{L}^{(n)}(t)$ and ${w}_{R}^{(n)}(t)$:
    Equation (2.7)
    Equation (2.8)
    and we let ${W}_{L}^{(n)}(t)$ and ${W}_{R}^{(n)}(t)$ be the corresponding CDFs.

By equation (2.5), ${{\rm{\Pi }}}^{(n)}$ is strictly increasing with n. Now we show how extinction times are related to probability fluxes. The probability of the particle being at site 0 at time t + dt is given by

Equation (2.9)

Note that the probability of hopping left in time dt from site 1 is μ1dt and the probability of staying at site 0 in time dt is 1 since site 0 is absorbing. Equation (2.9) implies that

Equation (2.10)

Lemma 1. The flux out of site 1, ${\mu }_{1}{P}_{1}^{(n)}(t)$, and the left extinction time density ${w}_{L}^{(n)}(t)$ are related through

Equation (2.11)

where μ1 is the death rate from site 1 and Π(n) is defined in equation (2.5).

Proof. If the random walker is at site 0 at time t, it must have arrived there either at t or before. Then

Equation (2.12)

Equation (2.13)

Equation (2.14)

Equation (2.15)

Equation (2.16)

using equation (2.10). □

3. Algorithm for reconstructing transition rates

Our algorithm for transition rate reconstruction requires the following as input: for each n, the fraction of random walks that exit and whose maximal site does not exceed n; and the mean extinction time for these conditional random walks. For each n ≥ 1, Π(n) and ${\mathbb{E}}[{T}^{(n)}| {E}^{(n)}=0]\equiv {M}^{(n)}$ yield {λn, μn}: see figure 4.

Figure 4.

Figure 4. Flow chart of the algorithm presented in this paper. ${{\rm{\Pi }}}^{(n)}$ is the probability of left exit and ${M}^{(n)}$ is the mean of the extinction times, all conditioned on the particles remaining in the domain $\{1,\,\ldots ,\,n\}$ before exiting. At each site, a pair of birth and death rate at that site are recovered.

Standard image High-resolution image

3.1. Inference of μ1 and λ1

In the first step, we recover the birth and death rates at site 1. Note that ${{\bf{P}}}^{(1)}(t)$ only contains a single element, and the forward master equation can be written as

Equation (3.1)

with

Equation (3.2)

This simple ODE has solution

Equation (3.3)

Suppose we only consider left extinction times, with n = 1, generated by all trajectories that directly arrive at site 0 from site 1. These extinction times are exponentially distributed with parameter ${\lambda }_{1}+{\mu }_{1}$:

Equation (3.4)

It follows from the property of exponential distributions that

Equation (3.5)

In the next step, we use (2.11) to get

Equation (3.6)

Equation (3.7)

Equation (3.8)

Equation (3.9)

We now have obtained the approximations for μ1 and λ1.

3.2. Inference of μ2 and λ2

The forward master equations are for ${{\bf{P}}}^{(2)}(t)$ are

Equation (3.10)

where

Equation (3.11)

By (2.11), we have that

Equation (3.12)

Equation (3.13)

Equation (3.14)

Now introduce the Laplace transform ${ \mathcal L }\{P(t)\}=\tilde{P}(s)$. Then the Laplace transformed equation for ${P}_{1}^{(2)}(t)$ satisfies

Equation (3.15)

where ${\xi }_{2}^{(2)}={\lambda }_{1}+{\mu }_{1}+{\lambda }_{2}+{\mu }_{2}$, ${\xi }_{1}^{(2)}={\lambda }_{1}{\lambda }_{2}+{\mu }_{1}{\mu }_{2}+{\lambda }_{2}{\mu }_{1}$ and ${\eta }_{2}^{(2)}={\lambda }_{2}+{\mu }_{2}$. Taking derivatives with respect to s, we have

Equation (3.16)

and when s = 0,

Equation (3.17)

Equation (3.18)

Equations (3.17) and (3.18) can be rewritten as

Equation (3.19)

Equation (3.20)

a linear system for λ2, μ2 where

Equation (3.21)

Assuming λ1 and μ1 are known from the n = 1 case, solving equations (3.19) and (3.20) allows us to compute λ2 and μ2 from the conditional moments Π(2) and M(2).

3.3. Inference of μn and λn for n ≥ 3

Now we consider the n-th site after computing the birth and death rates for the first n − 1 sites. The Laplace transformed ODEs of ${\tilde{{\bf{P}}}}^{(n)}(s)={[{\tilde{P}}_{1}^{(n)}(s),\ldots ,{\tilde{P}}_{n}^{(n)}(s)]}^{T}$ can be represented in the following matrix form:

Equation (3.22)

where ${{\bf{e}}}_{1}^{(n)}={[1,0,\ldots ,0]}^{T}$ has n elements and In is the identity matrix of size n × n. Whenever s is not an eigenvalue of ${A}^{(n)}$, we have that ${\tilde{{\bf{P}}}}^{(n)}(s)={({{sI}}_{n}-{A}^{(n)})}^{-1}{{\bf{e}}}_{1}^{(n)}$. Let the characteristic polynomial of ${A}^{(n)}$ be

Equation (3.23)

where the $\{{\xi }_{i}^{(n)}\}$ are the coefficients of the characteristic polynomial of ${A}^{(n)}$, and they can be written as functions of birth and death rates:

Equation (3.24)

In addition, define

Equation (3.25)

Finally, we note that the coefficient of of sn in (3.23) is 1 and for notational convenience, define

Equation (3.26)

Equation (3.27)

Lemma 2. The Laplace-transformed probability ${\tilde{P}}_{1}^{(n)}(s)$, the first element in ${\tilde{{\bf{P}}}}^{(n)}(s)$ from equation (3.22), is a rational function in s satisfying

Equation (3.28)

for some constants $\{{\xi }_{i}^{(n)}\}{}_{i=1}^{n}$ and $\{{\eta }_{i}^{(n)}\}{}_{i=2}^{n}$.

Proof. Let ${\hat{A}}^{(k-1)}$ be the (k − 1) × (k − 1) submatrix of ${A}^{(k)}$ with the first row and first column removed, so that

Equation (3.29)

By Cramer's rule and the definition of the determinant in terms of its cofactor expansion, the first element of the solution vector ${\tilde{{\bf{P}}}}^{(n)}(s)$ can be calculated as

Equation (3.30)

Now introduce the polynomial

Equation (3.31)

for some coefficients ${c}_{n-1},{c}_{n-2},\,\ldots ,\,{c}_{1}$. Then it is clear that

Equation (3.32)

Equation (3.33)

from the definitions in equations (3.25) and the fact that the (1, 1) and (2, 1) entries of A(n) are zero when λ1 = μ1 = 0. Because $q(s=0;{\lambda }_{1}=0,{\mu }_{1}=0)=0$, it follows that ${\eta }_{1}^{(n)}=0$. Equations (3.32) and (3.33) imply that

Equation (3.34)

Comparing coefficients, we find that ${\eta }_{n}^{(n)}={c}_{n-1}$, ${\eta }_{n-1}^{(n)}={c}_{n-2},\,\ldots ,\,{\eta }_{2}^{(n)}={c}_{1}$, so that equation (3.31) becomes

Equation (3.35)

Equation (3.36)

Since the expression $\tfrac{{{\rm{\Pi }}}^{(n)}}{{\mu }_{1}}$ appears frequently in later analysis, we define the notation

Equation (3.37)

and use it in the analysis below.

Lemma 3. [Moment and Exit Probability Relations] Let ${M}^{(n)}\equiv {\mathbb{E}}[{T}^{(n)}| {E}^{(n)}=0]$, then

Equation (3.38)

Equation (3.39)

Proof. The Laplace transform of ${P}_{1}^{(n)}(t)$ is ${\tilde{P}}_{1}^{(n)}(s)={\int }_{0}^{\infty }{e}^{-{st}}{P}_{1}^{(n)}(t){dt}$. Integrating both sides of (2.11), we have that

Therefore, equation (3.38) holds:

If we differentiate the Laplace transform ${\tilde{P}}_{1}^{(n)}(s)$, then (3.39) is validated by

using equation (2.11). □

By setting s = 0 in (3.28) and its derivative equation, we find two constraints involving the exit probability and the first moment ${M}^{(n)}$ of the conditional extinction times using (3.38) (3.39):

Equation (3.40)

Equation (3.41)

Lemma 4. (Recurrence Relations.) The coefficients ${\xi }_{1}^{(n)}$, ${\xi }_{2}^{(n)}$ and ${\eta }_{2}^{(n)}$, ${\eta }_{3}^{(n)}$ are linear in ${\lambda }_{n}$ and ${\mu }_{n}$, satisfying

Equation (3.42)

Equation (3.43)

and

Equation (3.44)

Equation (3.45)

Proof. From the definition of A(n) and a cofactor expansion, we have

By equating coefficients at O(1) and O(s), we can establish a recurrence relation between the coefficients of the characteristic polynomial for the n-site subproblem and the n − 1 and n − 2 site subproblems:

Equation (3.46)

Equation (3.47)

It is clear from this recurrence that ${\xi }_{1}^{(n)}$ and ${\xi }_{2}^{(n)}$ are linear in the transition rates λn and μn since ${\xi }_{1}^{(n-1)}$ only depends on ${\lambda }_{1},\,\ldots ,\,{\lambda }_{n-1}$, ${\mu }_{1},\,\ldots ,\,{\mu }_{n-1}$ and ${\xi }_{1}^{(n-2)}$ only depends on ${\lambda }_{1},\,\ldots ,\,{\lambda }_{n-2}$, ${\mu }_{1},\,\ldots ,\,{\mu }_{n-2}$. A similar argument applied to $\det ({{sI}}_{n-1}-{\hat{A}}^{(n-1)})$ shows that $\{{\eta }_{2}^{(n)}\}$ and $\{{\eta }_{3}^{(n)}\}$ are linear in λn and μn also:

Equation (3.48)

Equation (3.49)

The way of computing (λn, μn) for the n-site subproblem is to rewrite equations (3.40), (3.41) and (3.42)–(3.45):

Equation (3.50)

where

Equation (3.51)

and σn = λn + μn. Eliminating the vector ${\left[\begin{array}{cccc}{\xi }_{1}^{(n)} & {\xi }_{2}^{(n)} & {\eta }_{2}^{(n)} & {\eta }_{3}^{(n)}\end{array}\right]}^{T}$, we find that

Equation (3.52)

If the matrix ${V}_{1}^{(n)}{V}_{2}^{(n)}$ is invertible, σn and μn are uniquely determined, and so λn and μn are uniquely determined.

Theorem 1. Given exact data $\{{{\rm{\Pi }}}^{(n)},{M}^{(n)}\},n=1,2,\,\ldots ,\,N$ generated by some underlying birth-death process, the rates $({\lambda }_{n},{\mu }_{n}),n=1,\,\ldots ,\,N$ are uniquely determined. In particular, the matrix

Equation (3.53)

from equations (3.19) and (3.20) is invertible and the 2 × 2 matrix ${V}_{1}^{(n)}{V}_{2}^{(n)}$ (where ${V}_{1}^{(n)}$ and ${V}_{2}^{(n)}$ are defined in equation (3.51)) is invertible for n ≥ 3.

Proof. Consider the case n = 2. Then, we see that

Equation (3.54)

Equation (3.55)

where we used the relations

Equation (3.56)

Equation (3.57)

Therefore (λ2, μ2) are uniquely determined. Now consider the case n ≥ 3. For reference, define

Equation (3.58)

To show that ${V}_{1}^{(n)}{V}_{2}^{(n)}$ is invertible, it is sufficient to show that $\det ({V}_{1}^{(n)}{\tilde{V}}_{2}^{(n)})\ne 0$. We split the proof into two parts. First, we find a simple expression for the determinant. Second, we show using induction that this expression is always non-zero.

Expression for Determinant. The determinant depends on r(n) and M(n), which are determined by the (perfect) data. Using the recurrence relations (3.42)–(3.45), we note that

Equation (3.59)

and

Therefore, after some algebra

Equation (3.60)

The denominator is always nonzero because ${\xi }_{1}^{(n)}={(-1)}^{n}\det ({A}^{(n)})\ne 0$. It is well-known that the eigenvalues of the infinitesimal generator matrix (and its submatrices) of a birth-death process are all negative [35], and the matrix ${A}^{(n)}$ is the transpose of a submatrix of such an infinitesimal generator. It remains to show that this expression is non-zero for $n\geqslant 3$.

Induction Proof. First we show that $\det ({V}_{1}^{(3)}{\tilde{V}}_{2}^{(3)})$ is non-zero. Because

Equation (3.61)

Equation (3.62)

Equation (3.63)

Equation (3.64)

Equation (3.65)

Now assume that $\det ({V}_{1}^{(n)}{\tilde{V}}_{2}^{(n)})$ is non-zero. Then

Equation (3.66)

It suffices to show that ${\eta }_{2}^{(n)}{\xi }_{1}^{(n-1)}-{\eta }_{2}^{(n-1)}{\xi }_{1}^{(n)}\ne 0$. Using (3.42) and (3.44),

Therefore $\det ({V}_{1}^{(n)}{\tilde{V}}_{2}^{(n)})$ is non-zero for n ≥ 3. Therefore ${V}_{1}^{(n)}{V}_{2}^{(n)}$ is invertible for n ≥ 3. From (3.52), σn and μn are uniquely determined and so (λn, μn) are also uniquely determined. □

In summary, at step n ≥ 3, we must solve the linear system

Equation (3.67)

which we may write as

Equation (3.68)

where

Equation (3.69)

The F(n) and G(n) depend on the previous transition rates ν1, ..., νn−1 and from theorem 1, F(n) is invertible. For reference, the entries of F(n) are:

Equation (3.70)

Equation (3.71)

Equation (3.72)

Equation (3.73)

Let's define the following notations

Equation (3.74)

Equation (3.75)

where the birth-death rates with asterisks stand for exact rates and ${{\rm{\Pi }}}^{(n)},{M}^{(n)}$ are exact data. In contrast, the elements in ${\boldsymbol{\delta }}{\bf{x}}$ are perturbations to the corresponding elements. Therefore the transition rates at site n depend on the rates at sites 1, 2, ..., n − 1:

Equation (3.76)

Equation (3.77)

where ${f}_{i}^{(n)}:\,{{\mathbb{R}}}^{4n-2}\to {\mathbb{R}}$ for $i=1,2$.

Theorem 2. (Error Propagation with site number.) Let ${\nu }_{n}={({\lambda }_{n},{\mu }_{n})}^{T},{D}_{n}={({{\rm{\Pi }}}^{(n)},{M}^{(n)})}^{T}$ and let ${\nu }_{n}^{* }$ be the exact transition rate at site n. Suppose that all first derivatives of ${f}_{1}^{(n)}$ and ${f}_{2}^{(n)}$ in equations (3.76), (3.77) are bounded in a small neighborhood B(x*,r) of x* in equation (3.74), i.e. there exists r, R > 0 such that for any x ∈ B(x*,r),

Equation (3.78)

for 1 ≤ n ≤ N and m = 1, 2. If a sufficiently small error δDk is introduced into the data ${\{{D}_{k}\}}_{k=1}^{n}$ at each site such that ${D}_{k}={D}_{k}^{* }+\delta {D}_{k}$ and $| | \delta {D}_{k}| {| }_{{\rm{\infty }}}\le r$ for $k=1,\,...,\,n$, then the error of birth-death rates at site n satisfies

Equation (3.79)

Proof. The analysis is fairly standard and makes use of Taylor series expansions. The exact rates satisfy

Equation (3.80)

By the mean value theorem for multivariate functions, the errors at each site satisfy

Equation (3.81)

Equation (3.82)

for some ${{\bf{z}}}_{1}^{(n)}={{\bf{x}}}^{* }+{c}_{1}^{(n)}{\boldsymbol{\delta }}{\bf{x}},{{\bf{z}}}_{2}^{(n)}={{\bf{x}}}^{* }+{c}_{2}^{(n)}{\boldsymbol{\delta }}{\bf{x}}$ with ${c}_{i}^{(n)}\in (0,1),i\,=\,1,2$, so that ${{\bf{z}}}_{1}^{(n)},{{\bf{z}}}_{2}^{(n)}\in B({{\bf{x}}}^{\ast },r)$. By equation (3.80), we have

Equation (3.83)

Equation (3.84)

Define the matrices

Equation (3.85)

and

Equation (3.86)

Therefore,

Equation (3.87)

In general, we can repeatedly substitute to get δνn in terms of ${\{\delta {D}_{k}\}}_{k=1}^{n}$ only:

Equation (3.88)

By equaton (3.78) we have $\parallel {R}_{k}^{(n)}{\parallel }_{\infty }\leqslant R$ and $\parallel {S}_{k}^{(n)}{\parallel }_{\infty }\leqslant R$ for all 1 ≤ i, j ≤ N. Then, a binomial expansion yields

Equation (3.89)

Hence the errors of data introduced at each site will propagate exponentially in its subsequent sites. In other words, at each site n, the error in the birth-death rates (λn, μn) is the result of accumulating the exponential growth of errors from sites 1 to n − 1.

Corollary 1. Define a constant D such that $D={{\max }}_{1\le j\le N}\parallel \delta {D}_{j}{\parallel }_{{\rm{\infty }}}$. Then equation (3.79) becomes

Equation (3.90)

In this special case where all errors in data are bounded by D, we also expect to observe the error in birth-death rates grow exponentially.

3.4. Algorithm details

The implementation of the algorithm starts with the inference at site 1 and site 2. In order to keep track of the recurrence relation in lemma 4, we only need to focus on the 'feature vector' defined as ${{\bf{u}}}_{n}={[{\xi }_{1}^{(n)},{\xi }_{2}^{(n)},{\eta }_{2}^{(n)},{\eta }_{3}^{(n)}]}^{T}$ at each site n. For site 1, we have

Equation (3.91)

For site 2, we have

Equation (3.92)

Then we can define the matrix

Equation (3.93)

where the j-th column is the feature vector at site j.

At each step j ≥ 3, we need to solve linear system defined by (3.68) and (3.69) to obtain [λj, μj], and update the j-th column of Z by lemma 4. Details are listed in algorithm 1.

Algorithm 1. Inference of birth and death rates up to site N in a birth death chain

Input: An array of extinction times $\{{\tau }_{j}\}$ along with maximal sites {nj} from repeated
        simulation of a birth death process (j = 1, ..., L).
Initialize: For $k=1\text{to}N$, find all τj such that nj = k. Compute number of extinctions
        as a fraction of L(k)) and the mean of the extinction times (M(k)). Initialize Z as a
        4 × N zero matrix.
  1: At site 1, $\left[\begin{array}{c}{\lambda }_{1}\\ {\mu }_{1}\end{array}\right]=\left[\begin{array}{c}(1-{{\rm{\Pi }}}^{(1)})/{M}^{(1)}\\ {{\rm{\Pi }}}^{(1)}/{M}^{(1)}\end{array}\right]$, as in (3.8), (3.9). Feature vector ${Z}_{:,1}={{\bf{u}}}_{1}$ is defined
        in (3.91).
  2: if N == 1 then return {μ1, λ1}
  3: end if
  4: At site 2, $\left[\begin{array}{c}{\lambda }_{2}\\ {\mu }_{2}\end{array}\right]={\left[\begin{array}{cc}{r}^{(2)}({\lambda }_{1}+{\mu }_{1})-1 & {{\rm{\Pi }}}^{(2)}-1\\ 1-{M}^{(2)}({\lambda }_{1}+{\mu }_{1}) & 1-{M}^{(2)}{\mu }_{1}\end{array}\right]}^{-1}\left[\begin{array}{c}0\\ 1/{r}^{(2)}-{\lambda }_{1}-{\mu }_{1}\end{array}\right]$, see
        (3.19) (3.20). Feature vector ${Z}_{:,2}={{\bf{u}}}_{2}$ is defined as (3.92).
  5: for j = 3 : N do
  6:     Compute ${V}_{1}^{(j)}$ and ${V}_{2}^{(j)}$ as in (3.51).
  7:     Form the matrices ${F}^{(j)}$ and ${G}^{(j)}$ as in (3.68) and (3.69).
  8:     Solve ${[{\lambda }_{j},{\mu }_{j}]}^{T}={({F}^{(j)})}^{-1}{G}^{(j)}$.
  9:     Update ${Z}_{:,j}={V}_{2}^{(j)}\left[\begin{array}{c}{\lambda }_{j}+{\mu }_{j}\\ {\mu }_{j}\end{array}\right]+\left[\begin{array}{c}0\\ {Z}_{1,j-1}\\ 0\\ {Z}_{3,j-1}\end{array}\right].$
 10:    if j == N then
 11:       ${\lambda }_{j}=0$
 12:    end if
 13: end for
Output: {μ1, ..., μN} and $\{{\lambda }_{1},\,\ldots ,\,{\lambda }_{N-1}\}$

4. Numerical results

Below we reconstruct the transition rates for proteins with N folding/unfolding domains. When all domains are folded, the reaction coordinate is 0 and if all domains are unfolded, the reaction coordinate is N. Every domain that unfolds (folds) increases (decreases) the reaction coordinate by an integer.

Example 1. Protein with N = 4 domains. First we present a simple result of a birth-death chain with only 5 states and some pre-determined rates. All the rates are about the same order of magnitude, see figure 5.

The extinction time data is generated by simulating the birth-death process with these given rates, and we evaluate the reconstruction in comparison to them. With about 5 × 106 extinction times, we can infer the rates to very good accuracy—about 0.11% and 0.37% relative errors in $\lambda $ and $\mu $, respectively. First, the sum of birth and death rates λ1 + μ1 follows from the mean extinction time conditioned on immediate exit through equation (3.5). The fraction of times corresponding to immediate exit then yields μ1 and λ1 separately through (3.8) and (3.9). Next, λ2 and μ2 are computed by (3.19) (3.20). Then for n = 3,4, we compute the n-th columns of matrix Z as in (3.93) and obtain {λn, μn} simultaneously. Notice that the last birth rate λN is always assumed to be zero and no inference is necessary at that site.

Figure 5.

Figure 5. Bar plots of the inference results in a 5-site birth death chain. The top subplot (a) contains rates for μk and bottom subplot (b) for λk. The bars in dark blue represent numerically approximated rates, and yellow bars stand for exact rates. On top of each bar is the value associated with it.

Standard image High-resolution image

Example 2. Protein with N = 10 domains. We now test our reconstruction algorithm on a longer, 11-site chain. In this result, 5 × 107 extinction times are simulated from an 11-site birth-death chain. Following the same steps, we have the inference results with a relative error in $\lambda $ and $\mu $ to be 3.29% and 3.71%, respectively: see figure 6.

Figure 6.

Figure 6. Bar plots of the inference results in a 11-site birth death chain. The top subplot (a) contains rates for μk and bottom subplot (b) for λk. The bars in dark blue represent numerically approximated rates, and yellow bars stand for exact rates.

Standard image High-resolution image

Example 3. Protein with N = 10 domains and a single high activation energy barrier. This birth-death chain has a 'bottleneck' between states 3 and 4 so that the transition rates between these two configurations are much smaller than the others: see figure 7. The accuracy of the transition rates of sites after the bottleneck are more affected than the accuracy before the bottleneck, due to the fact that all inferences of sites after the bottleneck depend on the extinction times generated by particles that have gone through the bottleneck. With about 5 × 107 extinction times, there are 1.17% and 1.24% relative errors in $\lambda $ and $\mu $, respectively. The effect of the bottleneck is evident from the data since we see an abrupt increase in ${M}^{(n)}$ as n goes from 3 to 4.

Figure 7.

Figure 7. Bar plots of the inference results in a 9-site birth death chain with a bottleneck at site number 4. The top subplot (a) contains rates for μk and bottom subplot (b) for λk. The bars in dark blue represent numerically approximated rates, and yellow bars stand for exact rates.

Standard image High-resolution image

Example 4. Protein with N = 8 domains and a single low activation energy barrier. This protein has a very low barrier between states 5 and 6; the transition rates between these sites are very large. From an AFM trace, it may be difficult to distinguish or separate these two configurations. However, given 5 × 107 simulated data, the birth and death rates can be inferred within a reasonable amount of accuracy, and the overall relative error is 2.21% in λk and 2.24% in μk. See figure 8.

Figure 8.

Figure 8. Bar plots of the inference results in a 9-site birth death chain with extreme rates. The top subplot (a) contains rates for μk and bottom subplot (b) for λk. The bars in dark blue represent numerically approximated rates, and yellow bars stand for exact rates.

Standard image High-resolution image

Example 5. 11-site birth death chain error propagation. In this example, we show how error propagates with site number. The birth and death rates are all equal to 1, and 1 × 108 extinction times are generated from the Monte Carlo simulation. Bootstrap is done by taking random samples of size 5 × 106 from these extinction times and errors at each site are computed. This resampling procedure is repeated 50 times, and figures 9 and 10 display the mean error and 95% confidence intervals from the above 50 samples, where it is clear that errors for both λk and μk increase exponentially with site number, as a result of theorem 2. Although this example uses noisy data, this exponential increase with site number is not surprising in light of theorem 2.

Figure 9.

Figure 9. Error plot for λk at each site, taken as the average of 50 random samples of 1 × 108 extinction times.

Standard image High-resolution image
Figure 10.

Figure 10. Error plot for μk at each site, taken as the average of 50 random samples of 1 × 108 extinction times.

Standard image High-resolution image

Example 6. Minimum number of extinction times required. Finally, we consider the minimum number of extinction times required for a birth death chain of length N + 1 such that relative error for both λk and μk is below 15%. For each N, 50 bootstrap samples of a fixed size are selected and the average relative errors are computed from these samples. This procedure is then repeated for different sample sizes, from which we can estimate the minimum number of extinction times required for the chain of length N + 1 to have error below 15%. Note that this estimation applies to the chain where all birth-death rates are about the same order of magnitude. Detail is in table 1, and figure 11. It is obvious from the plot that minimum number of extinction times required is increasing exponentially with the length of chain.

Figure 11.

Figure 11. Number of extinction times required for rates to have relative error below 15%. The best-fit exponential curve is shown in red.

Standard image High-resolution image

Table 1.  Number of extinction times required for rates to have relative error below 15%, on chains of different lengths. The birth-death rates all have the same order of magnitude.

N (Estimated) Minimum number of ETs
1 5.0 × 101
2 1.5 × 102
3 5.0 × 102
4 1.0 × 104
5 1.0 × 105
6 3.0 × 105
7 2.0 × 106
8 4.0 × 106
9 7.0 × 106

5. Conclusions

In summary, we have presented a method for extracting the kinetic rates of large proteins, with multiple folding domains, from extinction times (when all domains have folded) and 'maximal sites' (the maximum number of unfolded domains before extinction). Both of these quantities can, in principle, be computed from AFM time traces.

The inference relies on the recurrence relation specified in lemma 4 and starts with base cases when n = 1, 2, and inference of each subsequent site depends on its previous sites by solving a linear system. If the data ${{\rm{\Pi }}}^{(n)}$ and ${M}^{(n)}$ are exactly given, meaning that they correspond to the statistics of an underlying birth-death process, then the birth-death rates are uniquely determined, and we proved that a small perturbation in site 1 will propagate exponentially throughout the chain, given that the first derivatives of ${f}_{1}^{(n)}$ and ${f}_{2}^{(n)}$ in theorem 2 are bounded near the exact solution. With sufficient data (about 50 million for a chain of length 8–12), the method can compute these rates to very good accuracy and is capable of detecting bottlenecks or extreme values in the chain. In general, the number of extinction times one needs to obtain reasonable results grows exponentially with the total length of the chain.

There still remain many theoretical challenges to interpreting single-molecule AFM data. For example, inference from extinction times in the 'transmission' problem where absorption/extinction is at site N rather than site 0 is severely ill-posed. Can some aspect of time-trace data be used to better-condition this problem? Also, bifurcations or loops in the birth-death chain representing multiple pathways to a final unfolded state have not yet been explored. Finally, it remains to be seen if our method can be adapted to force-ramp data, or how to proceed if transition times between metastable configurations are not exponentially distributed.

Appendix

There is an alternative way to infer the birth death rates, instead of the recurrence relations. This method is simple, yet more computationally intensive if number of sites is large.

In light of lemma 4, we can assume the following forms:

Equation (A.1)

Equation (A.2)

where ${C}^{(n)}$ and ${\hat{C}}^{(n)}$ are coefficients that depend on ${\lambda }_{1},\,\ldots ,\,{\lambda }_{n-1}$, ${\mu }_{1},\,\ldots ,\,{\mu }_{n-1};$ we describe how to compute ${C}^{(n)}$ and ${\hat{C}}^{(n)}$ in section A.1. If we plug (A.1) and (A.2) into (3.40) and (3.41), and denote ${r}^{(n)}={{\rm{\Pi }}}^{(n)}/{\mu }_{1}$, then we get a linear system in two variables $({\lambda }_{n},{\mu }_{n})$:

Equation (A.3)

where the 2 × 2 matrix F(n) and 2-vector G(n) are defined as

Equation (A.4)

Equation (A.5)

This matrix is consistent with equations (3.70)–(3.73). The rates at current site are therefore given by ${[{\lambda }_{n},{\mu }_{n}]}^{T}={({F}^{(n)})}^{-1}{G}^{(n)}$.

A.1. Leverrier-faddeev algorithm

The method for computing the coefficient matrices ${C}^{(n)}$ and ${\hat{C}}^{(n)}$ is the Leverrier-Faddeev (L-F) algorithm [36]. The L-F algorithm computes all coefficients of the characteristic polynomial for a given n × n square matrix with time complexity ${ \mathcal O }({n}^{4})$. In this problem, it suffices to get only the last three coefficients ${\xi }_{3}^{(n)}$, ${\xi }_{2}^{(n)}$ and ${\xi }_{1}^{(n)}$. Given the expression in (A.1), we have

Equation (A.6)

for i = 1, 2. With (A.2), we also have

Equation (A.7)

for j = 2, 3. A concise and straightforward algorithm containing all steps is presented in algorithm 2.

Algorithm 2. Inference of birth and death rates up to site N with L-F algorithm

Input: An array of extinction times T along with maximal site of repeated simulation of a
        birth death process.
Initialize: Compute the conditional probabilities of left exit $\{{{\rm{\Pi }}}^{(1)},\,\ldots ,\,{{\rm{\Pi }}}^{(N)}\}$, and mean of
        conditional extinction times $\{{M}^{(1)},\,\ldots ,\,{M}^{(N)}\}$.
  1: At site 1, ${\mu }_{1}={{\rm{\Pi }}}^{(1)}/{M}^{(1)}$ and ${\lambda }_{1}=(1-{{\rm{\Pi }}}^{(1)})/{M}^{(1)}$.
  2: for j = 2 : N do
  3:     Compute the coefficients of the characteristic polynomial of ${A}^{(j)}$ to obtain C and $\hat{C}$,
     see section A.1;
  4:     Form the constraint matrices ${F}^{(j)}$ and ${G}^{(j)}$ as in (A.4) and (A.5);
  5:     Solve ${[{\lambda }_{j},{\mu }_{j}]}^{T}={({F}^{(j)})}^{-1}{G}^{(j)}$.
  6:     if j == N then
  7:        ${\lambda }_{j}=0$
  8:     end if
  9: end for
Output: {μ1, ..., μN} and {λ1, ..., λN−1}

Please wait… references are loading.
10.1088/2399-6528/aacca5